Cannot start containers, init-layer issue?

I have a very weird problem I can find nothing about… It’s hard to reproduce on other servers… 2 just work and 2 have the same problem. I am probably missing something so any pointers are appreciated…
When starting a container it fails with errors similar to

ERRO[2019-09-19T10:33:07.618605334+02:00] Handler for POST /v1.40/containers/create returned error: open /data/docker/overlay2/a96da70f876a18dc93ea677da85d7e02fa30cfcebde8b739cc3ae4e0803eddbb-init/merged/etc/hosts: invalid argument 
docker: Error response from daemon: open /data/docker/overlay2/a96da70f876a18dc93ea677da85d7e02fa30cfcebde8b739cc3ae4e0803eddbb-init/merged/etc/hosts: invalid argument.
See 'docker run --help'.

I know similar issues can happen if you use xfs without ftype support, but that isn’t it. Manual checks and docker info do report this is okay.
Also the OS/packages/kernel versions are all the same for these servers.

 Server Version: 19.03.1
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true

The error triggers when starting up the container. Docker tries to remove the /etc/hosts if present to make sure it isn’t… Which completes without an error… and after tries to create a new empty one but crashes with invalid argument.

This seems to happen because the /etc/hosts-file as present in the overlays is a character special device instead of a file. As far as I can see it’s created by Docker itself, but overlaying a character special device and recreating a regular file fails.

On another server where everything works I can see it being a regular file by default, but somehow this isn’t the case on another server.
How it looks on a server where it works:

find /data -wholename *-init/diff/etc/hosts -exec ls -l {} \;
-rwxr-xr-x 1 root root 0 Sep 17 14:25 /data/docker/overlay2/c178f63836c51d6fca868cc442f8ffeb0299826b665eeb9d3182e43d9435f489-init/diff/etc/hosts
-rwxr-xr-x 1 root root 0 Sep 17 14:58 /data/docker/overlay2/16353e66e3d96f40768d0bcb7c94351ca3b2661602e09110a10642eeca98ca03-init/diff/etc/hosts

How it looks on the server where it doesn’t:

 find /data -wholename *-init/diff/etc/hosts -exec ls -l {} \;
c--------- 1 root root 0, 0 Sep 19 10:19 /data/docker/overlay2/f593733309887fd011324e35bfa5e813b44b62b0a0937bd728a7d79ab0eb6e7c-init/diff/etc/hosts
c--------- 1 root root 0, 0 Sep 19 10:33 /data/docker/overlay2/a96da70f876a18dc93ea677da85d7e02fa30cfcebde8b739cc3ae4e0803eddbb-init/diff/etc/hosts

I looked through some of the source code and saw the /etc/hosts being bind mounted and some other magic. But I cannot explain why it becomes a character special device.

I gathered the information by using strace on dockerd and following the system calls it makes.

Manually doing the overlay and trying to remove /etc/hosts and recreating it triggers the same error. So I can reproduce that manually. When I remove the character special device and make it a regular file and overlay it again it works without issues.
This workaround doesn’t work for Docker as it will recreate the init layer when starting up a new
container every time.

Any ideas what I can check next? I am at a bit at a loss…

Command that I execute to test (with a clean /data/docker directory):
docker run --rm -it bash

Full docker info:

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 1
 Server Version: 19.03.1
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
 runc version: 425e105d5a03fabd737a126ad93d62a9eeede87f
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 3.10.0-957.21.3.el7.x86_64
 Operating System: OpenShift
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 31.41GiB
 Name: XXX
 ID: XXX
 Docker Root Dir: /data/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

/data/docker is a mounted xfs partition.

$ xfs_info /data/docker/
meta-data=/dev/mapper/vgdata-lvdocker isize=512    agcount=4, agsize=32767744 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=131070976, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=63999, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Hello!
Can you Try This
service docker status
After You give a error like this
docker daemon running
then check to dockerd
dockerd is running that’s why you face issues

The service is running without issues.

● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/docker.service.d
└─service-overrides.conf
Active: active (running) since wo 2019-09-25 08:27:05 CEST; 1 weeks 0 days ago
Docs: https://docs.docker.com
Main PID: 62244 (dockerd)
Tasks: 16
Memory: 47.6M
CGroup: /system.slice/docker.service
└─62244 /usr/bin/dockerd --data-root=/data/docker --storage-driver=overlay2 --group=docker

sep 25 08:27:05 hostname dockerd[62244]: time=“2019-09-25T08:27:05.411349721+02:00” level=info msg=“Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daem… IP address”
sep 25 08:27:05 hostname dockerd[62244]: time=“2019-09-25T08:27:05.453610215+02:00” level=info msg=“Loading containers: done.”
sep 25 08:27:05 hostname dockerd[62244]: time=“2019-09-25T08:27:05.483566528+02:00” level=info msg=“Docker daemon” commit=74b1e89 graphdriver(s)=overlay2 version=19.03.1
sep 25 08:27:05 hostname dockerd[62244]: time=“2019-09-25T08:27:05.483643247+02:00” level=info msg=“Daemon has completed initialization”
sep 25 08:27:05 hostname dockerd[62244]: time=“2019-09-25T08:27:05.497897015+02:00” level=info msg=“API listen on /var/run/docker.sock”
sep 25 08:27:05 hostname systemd[1]: Started Docker Application Container Engine.
sep 25 08:27:15 hostname dockerd[62244]: time=“2019-09-25T08:27:15.768286858+02:00” level=error msg=“Handler for POST /v1.40/containers/create returned error: open /data/docke…id argument”
sep 25 08:27:19 hostname dockerd[62244]: time=“2019-09-25T08:27:19.911985752+02:00” level=error msg=“Handler for POST /v1.40/containers/create returned error: open /data/docke…id argument”
sep 25 08:40:55 hostname systemd[1]: Current command vanished from the unit file, execution of the command list won’t be resumed.
sep 25 10:03:03 hostname dockerd[62244]: time=“2019-09-25T10:03:03.467430075+02:00” level=error msg=“Handler for POST /v1.40/containers/create returned error: open /data/docke…id argument”

this may be a little late… but did you find out what was the problem? i have the exact same problem.

thank you,
daniel.

Hi Daniel,

The problem still persists. You are trying to fix the issue for, well, me. :wink:
There is a GitHub issue (also made by me) which contains more info: https://github.com/moby/moby/issues/40072

Regards,
Serhat

serhat! hahaha!!

i fixed it. I removed the acl’s from /data/docker and now it works fine.

1 Like

That’s great… Weird, but great. Thanks!

yes this works for me. i was getting issue …/merged/etc/hostname: invalid argument and some time /merged/etc/resolv.conf: invalid argument.
setfacl -R --remove-all /data/docker
then restart docker service.

If you want to validate same without removing acl, just change docker data dir to some system default location
i.e file /etc/docker/daemon.json
{“graph”: “/data/docker”}