Dockerd not fully operational, docker DNS service not able to resolve docker network's names and few docker commands hang

  • This is a bug report
  • This is a feature request
  • I searched existing issues before opening this one

Expected behavior

The containers should continue to work smoothly continuously. Docker daemon should be fully operational and docker DNS service should be able to resolve docker container names. Commands like, docker inspect, docker network ls, docker network inspect, docker pull, docker run should work on all containers/network.

Actual behavior

Containers showing as up and running, but unable to run docker exec on few of them. Docker daemon is not fully operational. Docker DNS service is not able to resolve docker network’s names, but IP is accessible. Commands like docker inspect, docker network ls, docker network inspect hang. docker info also hangs. Even docker pull, docker run commands are getting stuck.

Steps to reproduce the behavior

Not Known. A microservices architecture based product (uses docker containers) was deployed and was left unattended for 4 days with more than medium load. After 4th day, it was noticed that product is behaving unusually.

On further analysing, it was found that the docker daemon is not fully operational. Also, a custom network was being used for containers, and containers use private names to communicate with each other, but the name was not able to be resolved using the docker DNS service. Containers could be accessed using private IPs but not with names.

root@6f920e2b3a24:/# curl -k https://172.22.0.3:27017
curl: (52) Empty reply from server
root@6f920e2b3a24:/# curl -k https://<private_name>:27017
curl: (6) Could not resolve host: <private_name>
root@6f920e2b3a24:/# cat /etc/hosts
127.0.0.1    localhost
::1    localhost ip6-localhost ip6-loopback
fe00::0    ip6-localnet
ff00::0    ip6-mcastprefix
ff02::1    ip6-allnodes
ff02::2    ip6-allrouters
172.22.0.32    6f920e2b3a24
fd00::1:8:20    6f920e2b3a24
root@6f920e2b3a24:/# echo "172.22.0.3 <private_name>" >> /etc/hosts
root@6f920e2b3a24:/# cat /etc/hosts
127.0.0.1    localhost
::1    localhost ip6-localhost ip6-loopback
fe00::0    ip6-localnet
ff00::0    ip6-mcastprefix
ff02::1    ip6-allnodes
ff02::2    ip6-allrouters
172.22.0.32    6f920e2b3a24
fd00::1:8:20    6f920e2b3a24
172.22.0.3 <private_name>
root@6f920e2b3a24:/# curl -k https://<private_name>:27017
curl: (52) Empty reply from server

Some containers are showing as up and running since 5 days, but actually, the microservice running inside them is terminated. Unable to do docker exec on them.

These are some logs and outputs:

Container:

f2300b0687a7   <image>:<version>    "/usr/bin/..…"   5 days ago     Up 5 days         

Syslog:

root@cp1:/var/log# grep -nri "f2300b0687a7" ./

./syslog.1:58724:May 27 01:45:26 cp1 containerd[4099]: time="2021-05-27T01:45:26.451414726Z" level=info msg="shim disconnected" id=f2300b0687a7606be6ce79b503a9d8e1b4d61d61db689c40b2e3483347dc1edd

./syslog.1:58726:May 27 01:45:26 cp1 dockerd[4259]: time="2021-05-27T01:45:26.451699426Z" level=info msg="ignoring event" container=f2300b0687a7606be6ce79b503a9d8e1b4d61d61db689c40b2e3483347dc1edd module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"

Binary file ./journal/e66f1b9049144bc88b3757a48c472d6a/system@7afd5ca7bca54575ac5260c8aef722ad-0000000000096ada-0005c3410ad12141.journal matches

Product logs to confirm the service was actually terminated:

<service-name>: "2021-05-27 01:45:26,400: __main__ - INFO - Service terminated ... /opt/.../bin/<service>.py (0)"

docker inspect hangs…

# docker inspect f2300b0687a7
^C
#

docker network ls and docker network inspect hangs…

Output of 
strace -f docker network inspect my-network
 
[pid  6966] mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0 <unfinished ...>
[pid  6964] <... read resumed> 0xc0005a9000, 4096) = -1 EAGAIN (Resource temporarily unavailable)
[pid  6966] <... mmap resumed> )        = 0x7f9c9cde8000
[pid  6962] futex(0x55ca72a83468, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=100000} <unfinished ...>
[pid  6964] write(3, "GET /v1.41/networks/my-netwo"..., 106 <unfinished ...>
[pid  6966] mprotect(0x7f9c9cde9000, 8388608, PROT_READ|PROT_WRITE <unfinished ...>
[pid  6964] <... write resumed> )       = 106
[pid  6962] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  6970] <... epoll_pwait resumed> [{EPOLLOUT, {u32=2964458120, u64=140310956083848}}], 128, -1, NULL, 3) = 1

No crash dump:

root@cp1:/var/crash# ls -lart
total 8
drwxrwxrwt  2 root root 4096 May  8 15:50 .
drwxr-xr-x 13 root root 4096 May  8 15:51 ..
root@cp1:/var/crash#

Output of docker version:

# docker version
Client: Docker Engine - Community
 Version:           20.10.6
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        370c289
 Built:             Fri Apr  9 22:46:01 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.6
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       8728dd2
  Built:            Fri Apr  9 22:44:13 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.4
  GitCommit:        05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc:
  Version:          1.0.0-rc93
  GitCommit:        12644e614e25b05da6fd08a38ffa0cfe1903fdec
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Output of docker info:

# docker info
^C

Command hangs.

Additional environment details (AWS, VirtualBox, physical, etc.)
Azure Cloud.

# cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic