Intermittent DNS Resolving Issues

Expected behavior

DNS should resolve consistently within a container every time.

Actual behavior

Occasionally DNS will not resolve with no pattern, and causes failures when doing a docker build

Information

OS X: version 10.11.4 (build: 15E65)
Docker.app: version v1.11.0-beta7
Running diagnostic tests:
[OK]      docker-cli
[OK]      Moby booted
[OK]      driver.amd64-linux
[OK]      vmnetd
[OK]      osxfs
[OK]      db
[OK]      slirp
[OK]      menubar
[OK]      environment
[OK]      Docker
[OK]      VT-x
Docker logs are being collected into /tmp/20160419-171424.tar.gz
Most specific failure is: No error was detected
Your unique id is: B252C346-DC86-40B3-AFE8-FF88F8B6571C
Please quote this in all correspondence.

Iā€™ve created a repository with the Dockerfile and other files Iā€™ve used to reproduce this issue

Steps to reproduce the behavior

  1. git clone git@github.com:ecliptik/docker-beta-mac-ruby.git
  2. docker build --no-cache -t docker-beta-mac-ruby .
  3. Look for Gem::RemoteFetcher::UnknownHostError: no such name errors

More Information

This issue first appeared after turning off VPN Compatibility Mode from the Docker App settings because I wanted to use the docker.local DNS alias instead of finding out the IP of the docker VM to test the rails app on port 3000.

When turning this setting off all docker commands started not working (see my related comment in Docker becomes unresponsive - #12 by steel). In order to try and bring Docker back up properly Iā€™ve done the following:

  1. Rebooted the Mac
  2. Reset Docker to Factory Defaults
  3. Re-installed from scratch

Now Docker only works if the VPN Setting is checked.

I can also reproduce an intermittent DNS issue from a container shell with the same repository. The same dig command from OSX does not produce any timeouts.

  1. git clone git@github.com:ecliptik/docker-beta-mac-ruby.git
  2. docker build -f Dockerfile.shell --no-cache -t docker-beta-mac-ruby .
  3. docker run -it --rm docker-beta-mac-ruby
  4. while true; do dig rubygems.org +short; sleep 1; done

This will do a lookup on rubygems.org every second, and occasionally it will give a DNS issue. Here is some example output:

root@d3eede8e3d0a:/app# cat /etc/resolv.conf 
search local
nameserver 192.168.64.1

root@d3eede8e3d0a:/app# while true; do dig rubygems.org +short; sleep 1; done
54.186.104.15
54.186.104.15

; <<>> DiG 9.9.5-9+deb8u6-Debian <<>> rubygems.org +short
;; global options: +cmd
;; connection timed out; no servers could be reached
54.186.104.15
54.186.104.15
54.186.104.15
54.186.104.15
54.186.104.15
54.186.104.15
54.186.104.15
54.186.104.15

Getting a similar issue when building an Erlang image.

Running mix local.hex will give the error,

** (Mix) httpc request failed with: {:failed_connect, [{:to_address, {'s3.amazonaws.com', 80}}, {:inet, [:inet], :nxdomain}]}

Could not install Hex because Mix could not download metadata at http://s3.amazonaws.com/s3.hex.pm/installs/hex-1.x.csv.

However, when running a shell in the failed container image, and using wget to download the file form s3.amazonaws.com it works.

root@cec59804718d:/usr/local/maru# wget http://s3.amazonaws.com/s3.hex.pm/installs/hex-1.x.csv
--2016-04-21 18:16:31--  http://s3.amazonaws.com/s3.hex.pm/installs/hex-1.x.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.33.202
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.33.202|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 426 [text/csv]
Saving to: ā€˜hex-1.x.csvā€™

hex-1.x.csv                             100%[===============================================================================>]     426  --.-KB/s   in 0s     

2016-04-21 18:16:32 (57.0 MB/s) - ā€˜hex-1.x.csvā€™ saved [426/426]

To make this even more frustrating, running these commands back to back show wget succeeding and mix failing on the exact same DNS name. It appears that depending on the resolv syscall used, name resolution will or wonā€™t succeed.

Here are some example commands showing inconsistent behavor when run within the container

  • ping does not work
root@cec59804718d:/usr/local/maru# ping s3.amazonaws.com
ping: unknown host
  • getent does not work
root@cec59804718d:/usr/local/maru# getent hosts s3.amazonaws.com
root@cec59804718d:/usr/local/maru#
  • dig works
root@cec59804718d:/usr/local/maru# dig s3.amazonaws.com A +short
54.231.18.104
s3-1.amazonaws.com.
  • nslookup works
root@cec59804718d:/usr/local/maru# nslookup s3.amazonaws.com
Server:		192.168.65.1
Address:	192.168.65.1#53

Non-authoritative answer:
Name:	s3-1.amazonaws.com
Address: 54.231.112.155
s3.amazonaws.com	canonical name = s3-1.amazonaws.com.

I too am seeing intermittent DNS issues from inside a container shell, though Iā€™ve not seen it while building. When testing some updates to gems or node modules and running bundle install or npm install #### I will get errors about resolving the host. Adding the proper lines to /etc/hosts allows the commands to finish so itā€™s not networking in general, strictly DNS for some reason. I can resolve the names on my host with no issue.

Information

OS X: version 10.11.4 (build: 15E65)
Docker.app: version v1.11.0-beta8.2
Running diagnostic tests:
[OK]      docker-cli
[OK]      Moby booted
[OK]      driver.amd64-linux
[OK]      vmnetd
[OK]      osxfs
[OK]      db
[OK]      slirp
[OK]      menubar
[OK]      environment
[OK]      Docker
[OK]      VT-x
Docker logs are being collected into /tmp/20160428-111808.tar.gz
Most specific failure is: No error was detected
Would you like to upload log files? [Y/n]: Y

Your unique id is: 609D836B-2C26-4410-A777-8D58D6D5A557

#Dockerfile
FROM ruby:2.2

# Install apt based dependencies required to run Rails as
# well as RubyGems. As the Ruby image itself is based on a
# Debian image, we use apt-get to install those.
RUN apt-get update && apt-get install -y \
      build-essential \
      postgresql-client 

# Configure the main working directory. This is the base
# directory used in any further RUN, COPY, and ENTRYPOINT
# commands.
RUN mkdir -p /app
WORKDIR /app

# Copy the Gemfile as well as the Gemfile.lock and install
# the RubyGems. This is a separate step so the dependencies
# will be cached unless changes to one of those two files
# are made.
COPY Gemfile Gemfile.lock ./
RUN gem install bundler && bundle install --jobs 20 --retry 5

RUN wget http://nodejs.org/dist/v5.7.0/node-v5.7.0-linux-x64.tar.gz
RUN tar -C /usr/local --strip-components 1 -xzf node-v5.7.0-linux-x64.tar.gz

WORKDIR /tmp
COPY package.json ./
RUN npm install  && npm install -g gulp-cli

WORKDIR /app
# Copy the main application.
COPY . ./

# Expose port 3000 to the Docker host, so we can access it
# from the outside.
EXPOSE 3000

I am seeing something similar running something reading a DB and making many external HTTP requests.

It works for 15-20 minutes and then it canā€™t resolve hosts and it effects other containers thereafter.

Seems to happen both with & without vpn compatibility mode enabled.

Using 1.11.0-build8

Iā€™m also running into strange DNS issues:

$ docker run --rm -ti ubuntu /bin/bash root@1deb208b60c9:/# ping teste.database.windows.net ping: unknown host teste.database.windows.net

I can resolve this on the host:

$ ping teste.database.windows.net PING data.cq1-1.database.windows.net (23.97.112.44): 56 data bytes

Diagnostics messages:
OS X: version 10.11.4 (build: 15E65) Docker.app: version v1.11.0-beta8.2 Running diagnostic tests: [OK] docker-cli [OK] Moby booted [OK] driver.amd64-linux [OK] vmnetd [OK] osxfs [OK] db [OK] slirp [OK] menubar [OK] environment [OK] Docker [OK] VT-x Docker logs are being collected into /tmp/20160428-180228.tar.gz Most specific failure is: No error was detected Would you like to upload log files? [Y/n]: Y

Your unique id is: D449C40B-50D9-4341-9476-952831C23F72 Please quote this in all correspondence.

It seems it is something specific to resolving Canonical Names.

# ping andre.cabine.org ping: unknown host andre.cabine.org

Beta 9 seems to have solved this issue.

I just updated to beta9 as well and things are running much more smoothly.

DNS within containers works 100% of the time
Exposed ports are available on localhost

I switched to multiple VPN configurations (no VPN, OSX VPN, and Tunnelblick) and everything seems to still work.

Only downside is pulls seem much slower now, not sure if itā€™s network or storage IO related but I can now use Docker Beta and donā€™t have to fallback to docker-machine.

Version 1.11.1-beta10 (build: 6662)

Running multiple debian:wheezy Docker builds, noticed the wget in the Dockerfile started to hang. The embedded DNS server used is no longer responding. stopping / restarting docker works. Resolving against a public service (e.g. 8.8.4.4) works.

Workaround for an image might then be to point resolv.conf to something public if this were a prod build but its beta so lets keep letting it lock up :slight_smile:

Session:

% screen ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/tty

docker login: root
Welcome to the Moby alpha, based on Alpine Linux.
docker:~# host www.google.com
^Cdocker:~# host www.google.com 8.8.8.8
Using domain server:
Name: 8.8.8.8
Address: 8.8.8.8#53
Aliases:

www.google.com has address 216.177.189.177
www.google.com has address 216.177.189.151

Unfortunately, I still see the exact same issue as @ecliptik, but not on all networks, strangely.

At work, things seem to work as expected, at home, they do not. No other application has any networking issues at home, but something seems to be different.

Iā€™m having the issue with Rubygems as well, where itā€™s actually impossible to build an image, due to some gem download always erroring out during every single build I try.

Iā€™m also seeing the same issue. It is literally making it impossible for me to build an image that requires a bundle install in some networks. It always fails on trying to fetch one of the gems from rubygems.org. Has anyone found a way to fix this, without having to set a different DNS server on the container? Iā€™m using Version 1.12.0-rc3-beta18 (build: 9996)

This is still an issue with beta19
Diagnostic ID: D4F0D98E-8F73-4F24-B486-7A43E779721D

My specific case is running a python ā€˜pip installā€™ command in a container, installing a series of packages read from a file (pip install -r). The packages are installed from github.com. When running pip install multiple times, at random points during the pip install run, installation of a package will fail due to a failure to resolve github.com. Successive runs of pip install will fail on different packages. I.e., the failure is not related to the installation of a specific package. Occasionally, the pip install will successfully make it through installing all packages specified in the file, but this is a rarity.

Another data point: Using ā€˜screenā€™ to access the Mac VM directly (screen ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/tty) and then attempting to ping github.com will produce occasional failures resolving github.com (bad address ā€˜github.comā€™)

Dig also intermittently fails from directly within the VM:

moby:~# dig github.com

; <<>> DiG 9.10.4-P1 <<>> github.com
;; global options: +cmd
;; connection timed out; no servers could be reached
moby:~#

This is definitely still an issue, I hoped moving to Docker for Mac would resolve but alas itā€™s still there. This particular bug makes working on any sort of Ruby project particularly painful.

Is it normal behavior to try and resolve the dns for every request? Why isnā€™t a cache being usedā€¦ that seems to be likely the underlying issue here

3 Likes

Also an issue in Docker for Windows beta30.

Step 5 : RUN mix local.hex --force
 ---> Running in 2e197ac4d8de
** (Mix) httpc request failed with: {:failed_connect, [{:to_address, {'repo.hex.pm', 443}}, {:inet, [:inet], :nxdomain}]}

Could not install Hex because Mix could not download metadata at https://repo.hex.pm/installs/hex-1.x.csv.signed.

Iā€™m seeing this too but only after installing the latest beta (1.12.3-beta30).

+1 with @anthonysmith - Iā€™ve upgraded to beta30 and now builds that worked on beta28 are failing with name resolution failures

I was able to get around this by manually setting my containerā€™s DNS to Googleā€™s DNS servers

1 Like

I also tried this by adding Googleā€™s DNS to /etc/resolve.conf. This fixed DNS for external hosts, but then the DNS for hosts on my local network (needed for accessing source code control) stopped working. As I understand it resolve.conf only supports one name server (so long as that name server responds OK).

For now Iā€™ve reverted back to the stable channel of Docker for Mac and all my builds are working again.

My issue by the way was that pypi.python.org would not resolve, breaking all my Python installs.

I have had the same issue with rubygems where the failure was on api.rubygems.org. This only seemed to occur when using a ubuntu:12.04 base image. If I use ruby base image (based on debian iirc) it seems to work fine.

Iā€™m also seeing this issue since installing 1.12.3-beta30, although this was a full reinstall rather than an upgrade.

I was unable to build a Rails application image that was installing gems from our private source. I was able to build the image once I changed the url in the Gemfile to the ec2 server address.

However, when running the container Iā€™m getting errors when making some http calls. The errors appear as ā€œName or service not knownā€ so it seems like a similar DNS issue. The routes that I know are affected are services that are hosted on AWS and have DNS defined through route 53.