Weirdest network symptom for me in a long time... An ssh -v
to one of our remote hosts would hang every time with:
debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY
The only hint in the apparently-most-relevant online discussion
is that it might have something to do with some network
interface's MTU (Maximum Transmission Unit).
Lacking any better ideas, we changed the MTU on the SSH server
(with ifconfig eth0 mtu 512) -- note: a value wildly lower
than what should be required -- and it worked! SSH connections
could happen.
Except: only just barely. The first time the remote end tried to
send any amount of data -- e.g. a VNC screen refresh -- the
connection would wedge, and that would be that. (I cannot
explain this data point: I would expect the very low MTU setting
to ensure that no big packets ever got out, and so no wedging
could happen on that account.)
On the weak assumption that our problem was something to do with
MTUs, I proceeded to learn a fair bit more than I wanted to know.
Every network link has an MTU, and there are on-the-order-of a
dozen links between Here and There in any network interaction.
An MTU of 1500 is standard for Ethernet (modulo Jumbo Frames),
but you get MTUs in the high 1400s for various network
technologies (1500 minus "some bytes for our special stuff").
Needing an MTU under 1000 means "something is not right".
What happens if you send a packet that is too big to make the
next network hop? In theory, the network device is allowed to
"fragment" the packet, and send on the pieces. I believe this
sort of malarkey is frowned on, and many (most? (all?)) TCP
packets are sent with a DF (Do Not Fragment) flag.
So what happens with a DF-flagged too-big packet? Well, the
device at the edge of the too-small-MTU chasm is supposed to send
back an ICMP packet telling of the problem, so that the
originator can re-send in smaller bundles.
A whole bunch of things can go wrong. The device might simply
fail to send the required ICMP packet. Any of the network
devices on the way back to the originator might choose to drop
the ICMP packet ("Security, you know."). The host firewall -- or
the one on the last-hop consumer router -- stands a high chance
of Doing The Wrong Thing.
Net effect: Big packets go, but then just drop off the end of the
world. Re-sending doesn't help because, odds are, they will
follow the same route and drop no less ignominiously. Eventually
the connection will time out.
Your packets are said to have fallen into an MTU black hole.
Along the way, I learned that you can probe network MTUs by sending
various-sized ping packets. (As in most networking problems
ping and traceroute are your friends.) In Linux-speak, you
type...
ping -M do -c 2 -s <n> remote.example.com
...varying <n>. So, for instance, if
ping -M do -c 2 -s 1465 remote.example.com
balks with Frag needed and DF set, but
ping -M do -c 2 -s 1464 remote.example.com
does not, then 1464+28=1492 is the least MTU between here and there.
(ping has 28 bytes overhead which you need to add back in.)
Back to our own sad case. Besides trying our fabulously-low MTU
of 512, we did all the other typical things, namely power-cycling
all the networking doodahs around the office. In desperation, we
even did a Linux reboot (a mistake). Nothing helped.
But when we woke up next morning and tried our stuff, everything
was fine. Completely baffling.
Recent Comments