Docker Community Forums

Share and learn in the Docker community.

Docker DNS resolution slow for HTTP calls - ipv6 bridge network issue?

Ive narrowed down an issue where it appears the DNS resolution in my docker containers is causing a delay of around 5seconds sending out HTTP post calls to an API.

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.1 LTS"

Client:
 Version:           18.09.0
 API version:       1.39
 Go version:        go1.10.4
 Git commit:        4d60db4
 Built:             Wed Nov  7 00:48:57 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.0
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.4
  Git commit:       4d60db4
  Built:            Wed Nov  7 00:16:44 2018
  OS/Arch:          linux/amd64
  Experimental:     false

My services are lifted with docker-compose 3.7 and a user defined bridge network.

version: '3.7'

networks:
  web:
    driver: bridge
    internal: false
  app:
    driver: bridge
    internal: true
  db:
    driver: bridge
    internal: true

With the particular container using all 3 networks and it is running “FROM node:11.10.1-stretch” so a debian based image, whilst the hosts are ubuntu 18.04.

Through testing, I have the following outputs from a simple ping within a container. As you can see the ping time is fine but its the dns resolution which takes up the time.

time docker run busybox ping -c 1 api.telegram.org
WARNING: Error loading config file: /home/xxx/.docker/config.json: stat /home/xxx/.docker/config.json: permission denied
PING api.telegram.org (149.154.167.220): 56 data bytes
64 bytes from 149.154.167.220: seq=0 ttl=49 time=122.568 ms

--- api.telegram.org ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 122.568/122.568/122.568 ms

real	0m9.498s
user	0m0.124s
sys	0m0.340s

Outside of container with the host it’s near instant.

time ping -c 1 api.telegram.org
PING api.telegram.org (149.154.167.220) 56(84) bytes of data.
64 bytes from 149.154.167.220: icmp_seq=1 ttl=50 time=109 ms

--- api.telegram.org ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 109.608/109.608/109.608/0.000 ms

real	0m0.136s
user	0m0.000s
sys	0m0.008s

Whilst the above outputs are just a ping test, the same issues hit on nslookup:

Within containers:

root@a25854d3e60e:/usr/src/app# /usr/bin/dig api.telegram.org

; <<>> DiG 9.10.3-P4-Debian <<>> api.telegram.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 63318
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;api.telegram.org.              IN      A

;; ANSWER SECTION:
api.telegram.org.       82      IN      A       149.154.167.220

;; Query time: 4027 msec
;; SERVER: 127.0.0.11#53(127.0.0.11)
;; WHEN: Mon Mar 04 18:50:19 UTC 2019
;; MSG SIZE  rcvd: 61

Outside containers:

dig api.telegram.org

; <<>> DiG 9.11.3-1ubuntu1.5-Ubuntu <<>> api.telegram.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 9385
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;api.telegram.org.		IN	A

;; ANSWER SECTION:
api.telegram.org.	216	IN	A	149.154.167.220

;; Query time: 25 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Mon Mar 04 18:52:52 GMT 2019
;; MSG SIZE  rcvd: 61

I have previously been using node-alpine image but after some research I found that the muscl lib has a history of poor performance for DNS resolution. I’ve switched to node-slim which contains more of the jessie base image. This hasn’t provided the improvements I was hoping for.

Today I spent some time narrowing down my API calls in the code to get some timings.

I noticed the following behaviour:

  • On first call to the API I hit around 4 seconds delay (not good)
  • On immediate 2nd call I have around 60ms delay (perfect!)
  • If I leave the application alone for a few minutes and then hit the API again I find the same 4 second delay.
  • If I keep hitting the api, after that initial 4 second delay I consistently receive the 60ms speeds I would expect.

The initial delay will be the resolution, but once this has been done I believe the agent keep alive (which appears to be 10s by default in underlying libs) connection is ensuring the ongoing hits are responding in good time.

Metrics:

I’m thinking that theres a couple of things here:

  1. DNS resolution is just damn slow to start with. I think this might be due to using my own bridge network in docker and some form of ipv6 failings.
  2. Cache isn’t lasting long (10 seconds at an application level), but this is fine if the resolution is quicker.

Listening on port 53 is systemd-resolve still so I went the route of symlink fix as per comment #2187 (comment) but this did not change anything.

I’ve checked strace on port 53:

sudo tcpdump -vvv -s 0 -l -n port 53
tcpdump: listening on wlp2s0, link-type EN10MB (Ethernet), capture size 262144 bytes
05:35:12.442852 IP (tos 0x0, ttl 64, id 48895, offset 0, flags [DF], proto UDP (17), length 60)
    192.168.1.116.59581 > 192.168.1.141.53: [bad udp cksum 0x848b -> 0x24d4!] 57480+ A? localtunnel.me. (32)
05:35:12.442904 IP (tos 0x0, ttl 64, id 48896, offset 0, flags [DF], proto UDP (17), length 60)
    192.168.1.116.59581 > 192.168.1.141.53: [bad udp cksum 0x848b -> 0xa6a5!] 24220+ AAAA? localtunnel.me. (32)
05:35:12.455301 IP (tos 0x0, ttl 64, id 47819, offset 0, flags [DF], proto UDP (17), length 76)
    192.168.1.141.53 > 192.168.1.116.59581: [udp sum ok] 57480 q: A? localtunnel.me. 1/0/0 localtunnel.me. [13m50s] A 138.197.63.247 (48)
05:35:12.465921 IP (tos 0x0, ttl 64, id 47820, offset 0, flags [DF], proto UDP (17), length 60)
    192.168.1.141.53 > 192.168.1.116.59581: [udp sum ok] 24220 q: AAAA? localtunnel.me. 0/0/0 (32)
05:35:12.522916 IP (tos 0x0, ttl 63, id 62222, offset 0, flags [DF], proto UDP (17), length 62)
    192.168.1.116.39734 > 10.0.0.2.53: [bad udp cksum 0xcc59 -> 0x9375!] 1710+ A? api.telegram.org. (34)
05:35:16.523303 IP (tos 0x0, ttl 63, id 16594, offset 0, flags [DF], proto UDP (17), length 62)
    192.168.1.116.53416 > 8.8.8.8.53: [bad udp cksum 0xd267 -> 0x57f5!] 1710+ A? api.telegram.org. (34)
05:35:16.543163 IP (tos 0x0, ttl 119, id 49810, offset 0, flags [none], proto UDP (17), length 78)
    8.8.8.8.53 > 192.168.1.116.53416: [udp sum ok] 1710 q: A? api.telegram.org. 1/0/0 api.telegram.org. [2m29s] A 149.154.167.220 (50)

Here I can see the AAAA? v6 request and then the timeout for the v4’s

Anyone come across this before and any ideas on how to fix? Im pretty stuck at this point with lots of conflicting info when searching.

Updates:

So I messed around with this some more today. I edited the containers resolv.conf to:

nameserver 8.8.4.4
options timeout:1

This allowed me to ping google.com and looks to be pretty instant. However, because the nameserver is not pointing to the default docker dns for my bridge network I cant get any connections to my other containers.

I’ve then made a few changes whilst the container was running to have the following resolv.conf

root@8f65fb4f4210:/etc# cat resolv.conf
nameserver 127.0.0.11
options timeout:1
nameserver 8.8.4.4

This results in a faster response (1.1second) instead of the 5, but its clear that this is due to the timeout I’ve added and also forcing the google dns to be ipv4 only.