Die hard containers

Hi,

I have been using docker since two years ago. I have a small cluster for development purposes where I use docker swarm to coordinate and distribute my containers.
From time to time, I started to have problems killing containers.
They do not give any errors on docker stop or docker kill, but they are always shown as running. I cannot stat these containers, but I can inspect them. If I restart the computer, they continue to be listed as running containers. It seems that them do not consume any memory or cpu (at least no significant amound of it).

There is nothing special about the containers I create. I use a custom build python image in +100 containers. Most of the time I have no problems stopping and restarting them. When I cannot kill them, I log into each node with an undead container and execute the following procedure:

I get the die hard containers id with:

docker ps -f label=project -q --no-trunc
sudo -s
systemctl disable docker
reboot

When the machine restarts:

sudo -s
cd /var/lib/docker/containers
rm -rf \
065e505db408f835fef8f79e46078f1b357f573366335471f64f51cf5e29e64d
sudo systemctl enable docker

This simple procedure needs to be executed in every machine with undead containers :frowning:
It is timing consuming and it breaks my deployment scripts.

My current setup uses 7 machines running Linux:

There containers are started with the follwing template:

docker create --name X \
        --restart=unless-stopped -d -t -i --network=host \
        --add-host=postgres:${POSTGRES_IP} \
        --add-host=redis:${REDIS_IP} \
        --add-host=rabbitmq:${RABBITMQ_IP} \
        --add-host=cassandra:${CASSANDRA_IP} \
        -e PROJECT_ENVIRONMENT=${ENVIRONMENT} \
        -v /export/resources:/sf/resources \
        -w ${DJANGO_HOME} -u service \
        ${IMAGE}

uname -a

Linux m1 4.4.0-78-generic #99-Ubuntu SMP Thu Apr 27 15:29:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

docker version
Client:
 Version:      17.04.0-ce
 API version:  1.24 (downgraded from 1.28)
 Go version:   go1.7.5
 Git commit:   4845c56
 Built:        Mon Apr  3 18:07:42 2017
 OS/Arch:      linux/amd64

Server:
 Version:      swarm/1.2.6
 API version:  1.22 (minimum version )
 Go version:   go1.7.1
 Git commit:   `git rev-parse --short HEAD`
 Built:        `date -u`
 OS/Arch:      linux/amd64
 Experimental: false

docker info

Containers: 12
 Running: 10
 Paused: 0
 Stopped: 2
Images: 113
Server Version: swarm/1.2.6
Role: primary
Strategy: spread
Filters: health, port, containerslots, dependency, affinity, constraint, whitelist
Nodes: 7
 m2: 192.168.1.2:2375
  └ ID: 74MO:2I2U:IHWY:SOLA:6NXF:WE5W:TKZF:ZVNB:IVC4:DXFN:ACGC:FCY6
  └ Status: Healthy
  └ Containers: 1 (1 Running, 0 Paused, 0 Stopped)
  └ Reserved CPUs: 0 / 12
  └ Reserved Memory: 0 B / 65.99 GiB
  └ Labels: kernelversion=4.4.0-72-generic, operatingsystem=Ubuntu 16.04.2 LTS, storagedriver=aufs
  └ UpdatedAt: 2017-05-19T08:43:55Z
  └ ServerVersion: 17.04.0-ce
 m3: 192.168.1.3:2375
  └ ID: ZTCL:O3JN:FPVN:IXBH:CVIJ:H5PX:YC6I:QUT5:LYIU:CIAC:RFR2:JQWI
  └ Status: Healthy
  └ Containers: 1 (1 Running, 0 Paused, 0 Stopped)
  └ Reserved CPUs: 0 / 8
  └ Reserved Memory: 0 B / 33.02 GiB
  └ Labels: kernelversion=4.4.0-75-generic, operatingsystem=Ubuntu 16.04.2 LTS, storagedriver=aufs
  └ UpdatedAt: 2017-05-19T08:43:37Z
  └ ServerVersion: 17.04.0-ce
 m4: 192.168.1.4:2375
  └ ID: XYCB:KIE3:YGUA:T5EX:SLC4:KQ4Y:HC2X:DNLH:WUDM:JDCP:S2HJ:BCTP
  └ Status: Healthy
  └ Containers: 2 (2 Running, 0 Paused, 0 Stopped)
  └ Reserved CPUs: 0 / 16
  └ Reserved Memory: 0 B / 99.05 GiB
  └ Labels: kernelversion=4.4.0-75-generic, operatingsystem=Ubuntu 16.04.2 LTS, storagedriver=aufs
  └ UpdatedAt: 2017-05-19T08:44:20Z
  └ ServerVersion: 17.04.0-ce
 m1: 192.168.1.1:2375
  └ ID: HPV4:JZPD:E43J:IUQF:FHCA:FAXU:O3MM:IUBH:DK2X:2W7G:EQEC:3UXJ
  └ Status: Healthy
  └ Containers: 4 (4 Running, 0 Paused, 0 Stopped)
  └ Reserved CPUs: 0 / 16
  └ Reserved Memory: 0 B / 99.05 GiB
  └ Labels: kernelversion=4.4.0-78-generic, operatingsystem=Ubuntu 16.04.2 LTS, storagedriver=aufs
  └ UpdatedAt: 2017-05-19T08:44:24Z
  └ ServerVersion: 17.04.0-ce
 m5: 192.168.1.5:2375
  └ ID: QBEK:YKU5:ZTER:SPUD:CAO3:KFNJ:QW22:NQUH:3AY5:MCQI:IQKA:YRF6
  └ Status: Healthy
  └ Containers: 1 (1 Running, 0 Paused, 0 Stopped)
  └ Reserved CPUs: 0 / 16
  └ Reserved Memory: 0 B / 99.05 GiB
  └ Labels: kernelversion=4.4.0-72-generic, operatingsystem=Ubuntu 16.04.2 LTS, storagedriver=aufs
  └ UpdatedAt: 2017-05-19T08:44:31Z
  └ ServerVersion: 17.04.0-ce
 m6: 192.168.1.6:2375
  └ ID: GINY:O4OX:IXRE:W3YU:ISJG:C36X:NNHN:5D44:3ZRE:4WZS:MR5B:75EO
  └ Status: Healthy
  └ Containers: 1 (1 Running, 0 Paused, 0 Stopped)
  └ Reserved CPUs: 0 / 21
  └ Reserved Memory: 0 B / 99.05 GiB
  └ Labels: kernelversion=4.4.0-78-generic, operatingsystem=Ubuntu 16.04.2 LTS, storagedriver=aufs
  └ UpdatedAt: 2017-05-19T08:44:10Z
  └ ServerVersion: 17.04.0-ce
 m7: 192.168.1.7:2375
  └ ID: 2YE6:7XTJ:IYCU:XXWD:S6F7:TPZQ:KY2O:3T7P:TXDT:ILKL:DSUI:QASK
  └ Status: Healthy
  └ Containers: 2 (0 Running, 0 Paused, 2 Stopped)
  └ Reserved CPUs: 0 / 21
  └ Reserved Memory: 0 B / 99.05 GiB
  └ Labels: kernelversion=4.4.0-78-generic, operatingsystem=Ubuntu 16.04.2 LTS, storagedriver=aufs
  └ UpdatedAt: 2017-05-19T08:44:14Z
  └ ServerVersion: 17.04.0-ce
Plugins: 
 Volume: 
 Network: 
Swarm: 
 NodeID: 
 Is Manager: false
 Node Address: 
Kernel Version: 4.4.0-78-generic
Operating System: linux
Architecture: amd64
CPUs: 110
Total Memory: 594.3GiB
Name: 0ba32ddaf9ef
Docker Root Dir: 
Debug Mode (client): false
Debug Mode (server): false
Experimental: false
Live Restore Enabled: false

WARNING: No kernel memory limit support

Can you post the docker info output from the individual nodes themselves?

This is just the swarm classic docker info output, so it does not include things like the specific docker version you are running on the server-side, nor the graph driver that you might be using. I suspect that this might have something to do with that underlying graft/storage driver, so that is probably going to be vital information to proceed.