Intermittent loss of network connectivity

I’ve been pulling my hair out trying to debug this for several days now, and I’m just stumped. I’m running arch linux and have a pretty vanilla installation of both the OS and Docker. When I first boot the machine and start docker, everything works, however after some time. I lose networking within my containers. Eventually it usually comes back; and then goes away; and then a few hours later comes back… and then, of course goes away.

I’ve gone through all the usual suspects. I’m not even worrying about DNS yet, as when the networking fails, I can’t even ping the bridge network interface on the host via it’s IP (172.17.0.1).

So to be clear:

  • When I boot my machine and start docker, then run a container, I can ping 172.17.0.1 from inside the container fine.

  • At some point the network fails and the same ping operation just hangs with no output (no failure; just hangs)

  • If I leave it running and come back to it later, I’ll see that at some point the ping operation started working again, and usually after another period of time, failed again.

Restarting the docker daemon does nothing, even if stop the daemon, remove the bridge interface (usually doesn’t go way on it’s own), and then restart the dameon which recreates the interface.

If the networking failed all the time, I’d at least have a place to start, but that fact that it’s working intermittently is what is throwing me for a loop. This tells me that the basics are in place, but that somehow they are being changed or corrupted (and eventually fixed for a time).

I’m not even sure where to go next. Any help would be massively appreciated. My job involves a ton of Docker work and right now I’m being completely blocked.

So, two days with no progress and an hour after posting this I finally have a clue. After restarting my computer completely, the networking in the container usually works so I tried that and on a whim viewed the routes:

$ ip route 
default via 192.168.0.1 dev enp5s0 proto static metric 100 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
192.168.0.0/24 dev enp5s0 proto kernel scope link src 192.168.0.15 metric 100

After a while when networking failed, here’s what the same command shows:

# ip route
default via 192.168.0.1 dev enp5s0 proto static metric 100 
192.168.0.0/24 dev enp5s0 proto kernel scope link src 192.168.0.15 metric 100

The docker0 route is missing. This makes sense given the symptoms I experience and I don’t know why it didn’t occur to me to check this before.

Unfortunately, this still doesn’t help me understand why this route is disappearing, but at least it’s something new I can search for. As before, any ideas would be appreciated.

Well alrighty then. Please ignore all of this. I finally figured it out. I work remotely from my company’s main office and only occasionally connect to the corporate VPN. Apparently they recently added a few new routes to the (huge) set of routes that are created when I establish the VPN connection. One of those was, you guessed it, 172.17.0.0/16. So yeah, that’s a problem.

There are a number of ways around this including just having docker create it’s local network on a different IP range.

Freaking VPN… This kind of stuff is WHY I only rarely make the connection.