Ive narrowed down an issue where it appears the DNS resolution in my docker containers is causing a delay of around 5seconds sending out HTTP post calls to an API.
DISTRIB_ID=Ubuntu DISTRIB_RELEASE=18.04 DISTRIB_CODENAME=bionic DISTRIB_DESCRIPTION="Ubuntu 18.04.1 LTS" Client: Version: 18.09.0 API version: 1.39 Go version: go1.10.4 Git commit: 4d60db4 Built: Wed Nov 7 00:48:57 2018 OS/Arch: linux/amd64 Experimental: false Server: Docker Engine - Community Engine: Version: 18.09.0 API version: 1.39 (minimum version 1.12) Go version: go1.10.4 Git commit: 4d60db4 Built: Wed Nov 7 00:16:44 2018 OS/Arch: linux/amd64 Experimental: false
My services are lifted with docker-compose 3.7 and a user defined bridge network.
version: '3.7' networks: web: driver: bridge internal: false app: driver: bridge internal: true db: driver: bridge internal: true
With the particular container using all 3 networks and it is running “FROM node:11.10.1-stretch” so a debian based image, whilst the hosts are ubuntu 18.04.
Through testing, I have the following outputs from a simple ping within a container. As you can see the ping time is fine but its the dns resolution which takes up the time.
time docker run busybox ping -c 1 api.telegram.org WARNING: Error loading config file: /home/xxx/.docker/config.json: stat /home/xxx/.docker/config.json: permission denied PING api.telegram.org (126.96.36.199): 56 data bytes 64 bytes from 188.8.131.52: seq=0 ttl=49 time=122.568 ms --- api.telegram.org ping statistics --- 1 packets transmitted, 1 packets received, 0% packet loss round-trip min/avg/max = 122.568/122.568/122.568 ms real 0m9.498s user 0m0.124s sys 0m0.340s
Outside of container with the host it’s near instant.
time ping -c 1 api.telegram.org PING api.telegram.org (184.108.40.206) 56(84) bytes of data. 64 bytes from 220.127.116.11: icmp_seq=1 ttl=50 time=109 ms --- api.telegram.org ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 109.608/109.608/109.608/0.000 ms real 0m0.136s user 0m0.000s sys 0m0.008s
Whilst the above outputs are just a ping test, the same issues hit on nslookup:
root@a25854d3e60e:/usr/src/app# /usr/bin/dig api.telegram.org ; <<>> DiG 9.10.3-P4-Debian <<>> api.telegram.org ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 63318 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 512 ;; QUESTION SECTION: ;api.telegram.org. IN A ;; ANSWER SECTION: api.telegram.org. 82 IN A 18.104.22.168 ;; Query time: 4027 msec ;; SERVER: 127.0.0.11#53(127.0.0.11) ;; WHEN: Mon Mar 04 18:50:19 UTC 2019 ;; MSG SIZE rcvd: 61
dig api.telegram.org ; <<>> DiG 9.11.3-1ubuntu1.5-Ubuntu <<>> api.telegram.org ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 9385 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 65494 ;; QUESTION SECTION: ;api.telegram.org. IN A ;; ANSWER SECTION: api.telegram.org. 216 IN A 22.214.171.124 ;; Query time: 25 msec ;; SERVER: 127.0.0.53#53(127.0.0.53) ;; WHEN: Mon Mar 04 18:52:52 GMT 2019 ;; MSG SIZE rcvd: 61
I have previously been using node-alpine image but after some research I found that the muscl lib has a history of poor performance for DNS resolution. I’ve switched to node-slim which contains more of the jessie base image. This hasn’t provided the improvements I was hoping for.
Today I spent some time narrowing down my API calls in the code to get some timings.
I noticed the following behaviour:
- On first call to the API I hit around 4 seconds delay (not good)
- On immediate 2nd call I have around 60ms delay (perfect!)
- If I leave the application alone for a few minutes and then hit the API again I find the same 4 second delay.
- If I keep hitting the api, after that initial 4 second delay I consistently receive the 60ms speeds I would expect.
The initial delay will be the resolution, but once this has been done I believe the agent keep alive (which appears to be 10s by default in underlying libs) connection is ensuring the ongoing hits are responding in good time.
I’m thinking that theres a couple of things here:
- DNS resolution is just damn slow to start with. I think this might be due to using my own bridge network in docker and some form of ipv6 failings.
- Cache isn’t lasting long (10 seconds at an application level), but this is fine if the resolution is quicker.
Listening on port 53 is systemd-resolve still so I went the route of symlink fix as per comment #2187 (comment) but this did not change anything.
I’ve checked strace on port 53:
sudo tcpdump -vvv -s 0 -l -n port 53 tcpdump: listening on wlp2s0, link-type EN10MB (Ethernet), capture size 262144 bytes 05:35:12.442852 IP (tos 0x0, ttl 64, id 48895, offset 0, flags [DF], proto UDP (17), length 60) 192.168.1.116.59581 > 192.168.1.141.53: [bad udp cksum 0x848b -> 0x24d4!] 57480+ A? localtunnel.me. (32) 05:35:12.442904 IP (tos 0x0, ttl 64, id 48896, offset 0, flags [DF], proto UDP (17), length 60) 192.168.1.116.59581 > 192.168.1.141.53: [bad udp cksum 0x848b -> 0xa6a5!] 24220+ AAAA? localtunnel.me. (32) 05:35:12.455301 IP (tos 0x0, ttl 64, id 47819, offset 0, flags [DF], proto UDP (17), length 76) 192.168.1.141.53 > 192.168.1.116.59581: [udp sum ok] 57480 q: A? localtunnel.me. 1/0/0 localtunnel.me. [13m50s] A 126.96.36.199 (48) 05:35:12.465921 IP (tos 0x0, ttl 64, id 47820, offset 0, flags [DF], proto UDP (17), length 60) 192.168.1.141.53 > 192.168.1.116.59581: [udp sum ok] 24220 q: AAAA? localtunnel.me. 0/0/0 (32) 05:35:12.522916 IP (tos 0x0, ttl 63, id 62222, offset 0, flags [DF], proto UDP (17), length 62) 192.168.1.116.39734 > 10.0.0.2.53: [bad udp cksum 0xcc59 -> 0x9375!] 1710+ A? api.telegram.org. (34) 05:35:16.523303 IP (tos 0x0, ttl 63, id 16594, offset 0, flags [DF], proto UDP (17), length 62) 192.168.1.116.53416 > 188.8.131.52.53: [bad udp cksum 0xd267 -> 0x57f5!] 1710+ A? api.telegram.org. (34) 05:35:16.543163 IP (tos 0x0, ttl 119, id 49810, offset 0, flags [none], proto UDP (17), length 78) 184.108.40.206.53 > 192.168.1.116.53416: [udp sum ok] 1710 q: A? api.telegram.org. 1/0/0 api.telegram.org. [2m29s] A 220.127.116.11 (50)
Here I can see the AAAA? v6 request and then the timeout for the v4’s
Anyone come across this before and any ideas on how to fix? Im pretty stuck at this point with lots of conflicting info when searching.
So I messed around with this some more today. I edited the containers resolv.conf to:
nameserver 18.104.22.168 options timeout:1
This allowed me to ping google.com and looks to be pretty instant. However, because the nameserver is not pointing to the default docker dns for my bridge network I cant get any connections to my other containers.
I’ve then made a few changes whilst the container was running to have the following resolv.conf
root@8f65fb4f4210:/etc# cat resolv.conf nameserver 127.0.0.11 options timeout:1 nameserver 22.214.171.124
This results in a faster response (1.1second) instead of the 5, but its clear that this is due to the timeout I’ve added and also forcing the google dns to be ipv4 only.