DNS issue due to race condition between DHCP and Docker?

<Edited 5/26 11:31 with better steps to reproduce>
Description of issue
I’ve trying to fix an issue where in some devices in the field, DNS works on the host, but doesn’t work on the container. This only happens on a small percentage of the devices that all have local DNS servers on the LAN and 8.8.8.8/8.8.4.4 blocked. For example, I can curl something from the host successfully, but the container fails on the same curl command, due to a DNS lookup failure.

From what I’ve been able to piece together, resolv.conf on the host and container don’t match when this failure shows up (seen this in 100% of the failures). When I run into this problem any of the following: (reboot, docker service restart, docker container restart) will get the resolv.confs in sync and the DNS issue goes away. My pet theory is that DHCP is slow to set up a lease on these devices and Docker starts up before DHCP finishes getting DNS setup, though I have been unable to This would result in Docker using a default resolv.conf and then the host.

Any idea on how to make sure that the container tracks any DHCP changes in the OS?

Here is an example of a device in a problem state. Note that the first resolv.conf is the host and the Docker version that follows does not match.

[REMOVED]:~ $ cat /etc/resolv.conf
_# Generated by resolvconf
nameserver 192.168.101.9
[REMOVED]:~ $ docker exec -it [REMOVED] cat /etc/resolv.conf
_# Generated by resolvconf

nameserver 8.8.8.8
nameserver 8.8.4.4

Given that I can see the resolv.conf mismatch, I restart the container. After the restart, the resolv.conf files match.

[REMOVED]:~ $ docker restart [REMOVED]
[REMOVED]
[REMOVED]:~ $ cat /etc/resolv.conf
_# Generated by resolvconf
nameserver 192.168.101.9
[REMOVED]:~ $ docker exec -it [REMOVED] cat /etc/resolv.conf
_# Generated by resolvconf
nameserver 192.168.101.9

How do I ensure that Docker and the host keep in sync with regards to network setup?

OS Version/build

[REMOVED]:~ $ cat /etc/os-release
PRETTY_NAME=“Raspbian GNU/Linux 8 (jessie)”
NAME=“Raspbian GNU/Linux”
VERSION_ID=“8”
VERSION=“8 (jessie)”
ID=raspbian
ID_LIKE=debian
HOME_URL=“http://www.raspbian.org/
SUPPORT_URL=“RaspbianForums - Raspbian
BUG_REPORT_URL=“RaspbianBugs - Raspbian

App version

[REMOVED]:~ $ docker version
Client:
Version: 18.06.3-ce
API version: 1.38
Go version: go1.10.3
Git commit: d7080c1
Built: Wed Feb 20 02:34:35 2019
OS/Arch: linux/arm
Experimental: false

Server:
Engine:
Version: 18.06.3-ce
API version: 1.38 (minimum version 1.12)
Go version: go1.10.3
Git commit: d7080c1
Built: Wed Feb 20 02:30:23 2019
OS/Arch: linux/arm
Experimental: false

Steps to reproduce
I don’t have a clean reproduction of this. Like I said above, I think this is a race condition between between DHCP setting up DNS and Docker setting up the container version of resolv.conf.
Edit:
I’ve been able to reproduce something that looks extremely similar to what I see in the field.

  • -Baseline - start with a working system that has os and docker with resolv.conf in sync
    • Edit /etc/resolv.conf to use another dns (8.8.8.8 on my network, vs. my default 192.168.1.1)
    • Unplug rpi power
    • Unplug rpi Ethernet
    • Plug in power - note time
    • Wait a few minutes - This is to simulate a slow local DHCP server
    • Plug in Ethernet - note time. Ideally, resolv.conf will be updated on the OS and then Docker will have the container resolv.conf updated. Docker does not appear to update the container resolv.conf in this case
    • Check /etc/resolv.conf on docker and OS. OS should be more recent, as seen in ls -l
    • Check OS and Docker /etc/resolv.conf files. The Docker resolv.conf will have the “modified” DNS server of 8.8.8.8, but the OS will have its DNS updated to 192.168.1.1.

How I start the container
docker run --name [REMOVED] -v /home/pi/certs:/usr/src/app/certs/ -p [REMOVED]:[REMOVED] -dit --log-opt mode=non-blocking --log-opt max-size=10M --log-opt max-file=5 --restart unless-stopped [REMOVED]

I found this link that made me hopeful that the problem I am seeing got fixed in a newer version. I upgraded a device to the latest Docker version and retested. Same problem.

docker version
Client: Docker Engine - Community
Version: 19.03.15
API version: 1.40
Go version: go1.13.15
Git commit: 99e3ed8919
Built: Sat Jan 30 03:18:42 2021
OS/Arch: linux/arm
Experimental: false

Server: Docker Engine - Community
Engine:
Version: 19.03.15
API version: 1.40 (minimum version 1.12)
Go version: go1.13.15
Git commit: 99e3ed8919
Built: Sat Jan 30 03:16:47 2021
OS/Arch: linux/arm
Experimental: false
containerd:
Version: 1.2.13
GitCommit: 7ad184331fa3e55e52b890ea95e65ba581ae3429
runc:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683

The possible fix above didn’t help
I did figure out that I could keep the container network in sync by

  • -Adding a read only volume mount for /etc “-v /etc:/hostetc:ro”
  • -Make a cron job in the container “cp -u /hostetc/resolv.conf /etc/resolv.conf”

This will get the container in sync with the host every time that the cron job runs.

I found another solution that I like better than the cron job. I set up a docker network first

docker network create mynetworkname

Then I do a docker run using that network

docker run --net mynetworkname

This runs, but I ran into another problem. On the system I was working on, dnsmasq was being used, with the service being started and stopped at runtime. This would get the docker network “out of sync” and unable to do dns resolution. So, I worked with the team responsible for implmenting dnsmasq this way and found a way to not start and stop it. Bottom line, the docker network worked (aside from the bad interaction with dnsmasq), but I had discarded the idea, as it was not completely working with dnsmasq starting and stopping. I don’t expect anyone else runs into this, but thought it was worth saying that the docker network solved my problems here, and would likely solve others’ problem in a similar circumstance.