Containers of a docker stack not coming up after host reboot

We need to deploy a docker stack in a CentOS VM. We have a docker compose file to launch a stack with two services and one container in each of the services. One of these containers connects to two external networks.

The docker-compose.yml looks like this:

version: ‘3’
services:
GoOn_db:
image: postgres
GoOn_web:
image: sshweb_5:new
command: bash start.sh
volumes:
- .:/code
ports:
- “8000:8000”
- “8022:22”
networks:
- external_oam_network
- external_data_network
depends_on:
- GoOn_db
networks:
external_oam_network:
external:
name: goon__oam
external_data_network:
external:
name: goon__data

The external networks are swarm scoped macvlan networks created using below commands:

docker network create --config-only --subnet 172.28.128.0/24 --gateway 172.28.128.1 -o parent=eth1 --ip-range 172.28.128.32/27 __goon__data
docker network create -d macvlan --scope swarm --config-from __goon__data goon__data

The docker stack is created using below command:

docker stack deploy --compose-file docker-compose.yml app

The Issue:

With the above configuration, docker stack comes up perfectly the first time. But, if the hosting VM goes for a reboot [or crashes and comes up again], the container that is connected to the external networks (GoOn_web service), fails to come up. Following are the errors seen in journallogs.

Jun 13 15:00:14 localhost.localdomain dockerd[21817]: time=“2018-06-13T15:00:14.393543091+05:30” level=error msg=“fatal task error” error=“network dm-g3ovik5qx6br is already using parent interface goon__data” module=node/agent/taskmanager node.id=enqfccpf6sn28l01f6i6grq6h service.id=6w743aksizz5b6p7u3xgqpet8 task.id=wlxpngi6faw571nkgm4f9p0c9
Jun 13 15:00:14 localhost.localdomain dockerd[21817]: time=“2018-06-13T15:00:14.824590521+05:30” level=warning msg=“failed to deactivate service binding for container app_GoOn_web.1.y7l2c88wrhfq54f5af4d0qio7” error=“No such container: app_GoOn_web.1.y7l2c88wrhfq54f5af4d0qio7” module=node/agent node.id=enqfccpf6sn28l01f6i6grq6h
Jun 13 15:00:16 localhost.localdomain dockerd[21817]: time=“2018-06-13T15:00:16.827271962+05:30” level=error msg=“network goon__data remove failed: network goon__data not found” module=node/agent node.id=enqfccpf6sn28l01f6i6grq6h
Jun 13 15:00:16 localhost.localdomain dockerd[21817]: time=“2018-06-13T15:00:16.827406882+05:30” level=error msg=“remove task failed” error=“network goon__data not found” module=node/agent node.id=enqfccpf6sn28l01f6i6grq6h task.id=y7l2c88wrhfq54f5af4d0qio7

The other issue observed is that there is no way to clean up the network along with its config completely. The following commands were tried:

[localhost config_drive]# docker stack rm app
Removing service app_GoOn_db
Removing service app_GoOn_web
Removing network app_default
[localhost config_drive]# docker network rm goon__data
goon__data
[localhost config_drive]# docker network rm __goon__data
Error response from daemon: configuration network “__goon__data” is in use

It seems like the network cleanup has some issue as well.
Please let us know if there are any workarounds for this issue or if our configuration needs some tweaking.

Possibly related issues found in github:
https://github.com/docker/libnetwork/issues/1743
Cannot remove network due to active endpoint, but cannot stop/remove containers · Issue #23302 · moby/moby · GitHub

The following are the docker command outputs [after hosting VM reboot]:

[localhost config_drive]# docker version
Client:
 Version:       17.12.0-ce
 API version:   1.35
 Go version:    go1.9.2
 Git commit:    c97c6d6
 Built: Wed Dec 27 20:10:14 2017
 OS/Arch:       linux/amd64

Server:
 Engine:
  Version:      17.12.0-ce
  API version:  1.35 (minimum version 1.12)
  Go version:   go1.9.2
  Git commit:   c97c6d6
  Built:        Wed Dec 27 20:12:46 2017
  OS/Arch:      linux/amd64
  Experimental: false


[root@localhost config_drive]# docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
1456a3ec482a        __goon__data        null                local
c637316e8a95        __goon__oam         null                local
395a87391443        bridge              bridge              local
67f95713ee03        docker_gwbridge     bridge              local
ut4qii3qdzrs        goon__data          macvlan             swarm
88zfsq41n7xo        goon__oam           macvlan             swarm
803609448d35        host                host                local
wrypoj5x9fxx        ingress             overlay             swarm
095dc2ca9729        none                null                local


[localhost config_drive]# docker stack ls
NAME                SERVICES
app                 2

[localhost config_drive]# docker service ls
ID                  NAME                MODE                REPLICAS            IMAGE               PORTS
c4kpzmc26qgk        app_GoOn_db         replicated          1/1                 postgres:latest
6w743aksizz5        app_GoOn_web        replicated          0/1                 sshweb_5:new        *:8000->8000/tcp,*:8022->22/tcp

[localhost config_drive]# docker ps --all
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
ce820c3596f0        postgres:latest     "docker-entrypoint.s…"   9 minutes ago       Up 9 minutes        5432/tcp            app_GoOn_db.1.o000vpvfer0moz9b1xnh05wp5

Update:

We found a workaround. Deletion of “/var/lib/docker/network/files/local-kv.db” and restarting docker service seems to be working for us. This is mentioned in https://github.com/moby/moby/issues/17669 . We are in the process of testing this in CI. Please let us know if there are any side-effects of this work-around.

Update:

The above mentioned workaround of deleting “/var/lib/docker/network/files/local-kv.db” file and restarting docker has been tested for more than a fortnight now and with this, we are not facing the original issue anymore. So, anyone facing this issue can probably use this workaround. This probably is a genuine issue in docker and needs to be fixed.

just wanted to drop that your workaround helped me too @naanuswaroop

What a pain that this isn’t fixed yet. For me solved it by adding the deletion of the file and a docker restart to /etc/rc.local

Cheers,
Frank

1 Like