Weirdest network symptom for me in a long time... An ssh -v
to one of our remote hosts would hang every time with:
debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY
The only hint in the apparently-most-relevant online discussion is that it might have something to do with some network interface's MTU (Maximum Transmission Unit).
Lacking any better ideas, we changed the MTU on the SSH server
(with ifconfig eth0 mtu 512) -- note: a value wildly lower
than what should be required -- and it worked! SSH connections
could happen.
Except: only just barely. The first time the remote end tried to send any amount of data -- e.g. a VNC screen refresh -- the connection would wedge, and that would be that. (I cannot explain this data point: I would expect the very low MTU setting to ensure that no big packets ever got out, and so no wedging could happen on that account.)
On the weak assumption that our problem was something to do with MTUs, I proceeded to learn a fair bit more than I wanted to know.
Every network link has an MTU, and there are on-the-order-of a dozen links between Here and There in any network interaction. An MTU of 1500 is standard for Ethernet (modulo Jumbo Frames), but you get MTUs in the high 1400s for various network technologies (1500 minus "some bytes for our special stuff"). Needing an MTU under 1000 means "something is not right".
What happens if you send a packet that is too big to make the next network hop? In theory, the network device is allowed to "fragment" the packet, and send on the pieces. I believe this sort of malarkey is frowned on, and many (most? (all?)) TCP packets are sent with a DF (Do Not Fragment) flag.
So what happens with a DF-flagged too-big packet? Well, the device at the edge of the too-small-MTU chasm is supposed to send back an ICMP packet telling of the problem, so that the originator can re-send in smaller bundles.
A whole bunch of things can go wrong. The device might simply fail to send the required ICMP packet. Any of the network devices on the way back to the originator might choose to drop the ICMP packet ("Security, you know."). The host firewall -- or the one on the last-hop consumer router -- stands a high chance of Doing The Wrong Thing.
Net effect: Big packets go, but then just drop off the end of the world. Re-sending doesn't help because, odds are, they will follow the same route and drop no less ignominiously. Eventually the connection will time out.
Your packets are said to have fallen into an MTU black hole.
Along the way, I learned that you can probe network MTUs by sending
various-sized ping packets. (As in most networking problems
ping and traceroute are your friends.) In Linux-speak, you
type...
ping -M do -c 2 -s <n> remote.example.com
...varying <n>. So, for instance, if
ping -M do -c 2 -s 1465 remote.example.com
balks with Frag needed and DF set, but
ping -M do -c 2 -s 1464 remote.example.com
does not, then 1464+28=1492 is the least MTU between here and there.
(ping has 28 bytes overhead which you need to add back in.)
Back to our own sad case. Besides trying our fabulously-low MTU of 512, we did all the other typical things, namely power-cycling all the networking doodahs around the office. In desperation, we even did a Linux reboot (a mistake). Nothing helped.
But when we woke up next morning and tried our stuff, everything was fine. Completely baffling.
