Docker container started with --init still leaves zombie processes

Problem
Docker containers started via Jenkins pipeline command

docker.image(imageToStart).inside('--init')

can not be stopped due to zombie processes left by container.

Questions

  • How is it possible to get zombie processes from a Docker container, when it was started with ‘–init’ option?
  • Has someone else encountered the same issue?

Used environment

  • Docker 18.03.1-ce
  • Jenkins 2.60.2
  • Docker Pipeline plugin 1.12

Details
When a container is started from Jenkins pipeline with a command like:

docker.image('alpine').inside('--init') {
  sh ('ps -efa -o pid,ppid,user,comm')
}

There are several processes in this container with parent PID 0:

[Pipeline] withDockerContainer
testhost does not seem to be running inside a container
$ docker run -t -d -u 1001:1002 \
  --init \
  -w /lhome/testadmin/jenkins/workspace/bli-groovy-test \
  -v /lhome/testadmin/jenkins/workspace/bli-groovy-test:/lhome/testadmin/jenkins/workspace/bli-groovy-test:rw,z \
  -v /lhome/ciadmin/jenkins/workspace/bli-groovy-test-tmp:/lhome/testadmin/jenkins/workspace/bli-groovy-test-tmp:rw,z \
  -e ******** \
  --entrypoint cat alpine
[Pipeline] {
[Pipeline] sh

[bli-groovy-test] Running shell script
+ ps -efa -o pid,ppid,user,comm
PID   PPID  USER     COMMAND
    1     0 1001     init
    7     1 1001     cat
    8     0 1001     sh
   14     8 1001     script.sh
   15    14 1001     ps
[Pipeline] }
  • PID 1 / PPID 0 is the ‘init’ command used to start the container
  • PID 8 / PPID 0 is the ‘sh’ command from the closure to execute ‘ps’ command

The ‘sh’ process does not reap its child processes. When the process itself exits its descendants are assigned to a PPID from outside the container and not to PPID 1 from the ‘init’ process of the container.
The new parent PID is the PID of the ‘docker-containerd-shim’ process of the container.

With the small example I could not reproduce the zombie processes, but here is the situation from a more complex Jenkins job:

Docker command from Jenkins job

$ docker run -t -d -u 1001:1002 \
  --init \
  -w /lhome/testadmin/jenkins-coreloops/workspace/test-job/database \
  -v /lhome/testadmin/jenkins-coreloops/workspace/test-job/database:/lhome/testadmin/jenkins-coreloops/workspace/test-job/database:rw,z \
  -v /lhome/testadmin/jenkins-coreloops/workspace/test-job/database-tmp:/lhome/testadmin/jenkins-coreloops/workspace/test-job/database-tmp:rw,z \
  -e ******** \
  --entrypoint cat ait/mpde
[Pipeline] {
[Pipeline] sh
10:03:09 [database] Running shell script
10:03:09 + ./db-upgrade.sh

the ait/mpde image is based on RedHat 7.2 and inside the container the following scripts are started:

  • shell scripts, that set environment and call perl scripts
  • perl scripts start SQL*Plus (1 instance per database login)
  • perl scripts read SQL scripts and send SQL commands to SQL*Plus instances via STDIN

When the closure ends and Jenkins tries to stop the container, the following processes are left:

[testadmin@testhost] ~  # ps -efa | grep -vw grep | grep -w 47077
root      1725 47077  0 10:03 ?        00:00:00 [ps] <defunct>
root      1732 47077  0 10:03 ?        00:00:00 [docker-runc] <defunct>
root      2887 47077  0 10:04 ?        00:00:00 [sqlplus] <defunct>
root      2915 47077  0 10:04 ?        00:00:00 [sqlplus] <defunct>
root     47077 17349  0 10:03 ?        00:00:00 docker-containerd-shim 
                                                -namespace moby 
                                                -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/1863503ca54f75168db8ce20c78b821c0e5280f07d59875e8f651db4f0b67d9f 
                                                -address /var/run/docker/containerd/docker-containerd.sock 
                                                -containerd-binary /usr/bin/docker-containerd 
                                                -runtime-root /var/run/docker/runtime-runc
root     47098 47077  0 10:03 pts/0    00:00:00 /dev/init -- cat
root     47506 47077  0 10:03 ?        00:00:00 [sh] <defunct>
[testadmin@testhost] ~  #

and command ‘docker stop’ is aborted by Jenkins after timeout of 180 seconds.

To cleanup the remaining processes of the container this docker-containerd-shim process has to be killed with SIGKILL.

Note
We observed this issue on our recently installed CentOS server:

  • CentOS release 7.5.1804

  • environment related parts from ‘docker info’:

    Server Version: 18.03.1-ce
    Storage Driver: overlay2
     Backing Filesystem: extfs
     Supports d_type: true
     Native Overlay Diff: true
    Logging Driver: json-file
    Cgroup Driver: cgroupfs
    Plugins:
     Volume: local
     Network: bridge host macvlan null overlay
     Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
    Swarm: inactive
    Runtimes: runc
    Default Runtime: runc
    Init Binary: docker-init
    containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
    runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
    init version: 949e6fa
    Security Options:
     seccomp
      Profile: default
    Kernel Version: 3.10.0-862.6.3.el7.x86_64
    Operating System: CentOS Linux 7 (Core)
    OSType: linux
    Architecture: x86_64
    CPUs: 48
    Total Memory: 377.6GiB
    Docker Root Dir: /var/lib/docker
    Debug Mode (client): false
    Debug Mode (server): false
    Registry: https://index.docker.io/v1/
    Experimental: false
    Insecure Registries:
     127.0.0.0/8
    Live Restore Enabled: false
    

The behavior on other hosts is similar with respect to the multiple processes with parent PID 0.
But on these hosts we didn’t observe that containers were hanging on shutdown or that there was a similar number of zombie processes.

For comparisation the similar ‘docker info’ extract from one of these other hosts:

  Server Version: 17.05.0-ce
  Storage Driver: devicemapper
   Pool Name: dock-thinpool
   Pool Blocksize: 524.3kB
   Base Device Size: 16.11GB
   Backing Filesystem: xfs
   Data file:
   Metadata file:
   Udev Sync Supported: true
   Deferred Removal Enabled: true
   Deferred Deletion Enabled: true
   Deferred Deleted Device Count: 0
   Library Version: 1.02.140-RHEL7 (2017-05-03)
  Logging Driver: json-file
  Cgroup Driver: cgroupfs
  Plugins:
   Volume: local
   Network: bridge host macvlan null overlay
  Swarm: inactive
  Runtimes: runc
  Default Runtime: runc
  Init Binary: docker-init
  containerd version: 9048e5e50717ea4497b757314bad98ea3763c145
  runc version: 9c2d8d184e5da67c95d601382adf14862e4f2228
  init version: 949e6fa
  Security Options:
   seccomp
    Profile: default
  Kernel Version: 3.10.0-693.11.6.el7.x86_64
  Operating System: CentOS Linux 7 (Core)
  OSType: linux
  Architecture: x86_64
  CPUs: 48
  Total Memory: 377.6GiB
  Docker Root Dir: /var/lib/docker
  Debug Mode (client): false
  Debug Mode (server): false
  Registry: https://index.docker.io/v1/
  Experimental: false
  Insecure Registries:
   127.0.0.0/8
  Live Restore Enabled: false

Any updates on this issue? I seem have run into an issue running docker which looks like this. Any workarounds? I am running docker version 18.06.1-ce.

johnl

I didn’t receive any confirmation or update on that topic.

As workaround I created images that automatically start an SSH daemon and modified the ‘docker exec’ to start commands via “ssh 127.0.0.1 ${COMMAND}”.
This way the started processes are childs of SSHD process and the container INIT process.