Could not pull image - caused by: failed to register layer: error creating overlay mount to var/lib/docker/overlay2 :too many levels of symbolic links

Hi,
We are facing one issue related to the IoT Edge Module pull in one of the hardware. The Gateway is not able to pull all the edge modules as per the deployment manifest file. While downloading some modules, the below error is observed in the IoT Edge runtime logs. It shows an error in docker overlay.

**Could not pull image XXXXXX/YYYY:6.0.0-amd64** <4>2022-03-28T08:56:44Z [WARN] - caused by: failed to register layer: error creating overlay mount to var/lib/docker/overlay2/24593016a6b6bf0eaf6543d5ec82d94244d5fcb6d25e3be62ed0da70761daacd/merged: too many levels of symbolic links

The IoT Edge runtime version is : 1.1.6

Docker version: 20.10.12,
OS: Yocto
Kernel version : 5.4.94

Could you please let us know the cause for this error and how to recover from this error when the devices are installed at sites during production?

We are not expecting any errors while IoT Edge modules are being pulled as per the deployment manifest file. All the modules should be pulled and running.

Thanks and Regards
Pavan

I don’t know Yocto. It is not officially supported so I will not find the proper way to install it in the docs.

  • Can you show the output of docker info?
  • and how the Docker data folder looks like? Is itself a symbolic link for example?
  • Have you ever edited/created/deleted files manually in that folder without docker commands?

Hi, here are the answers for the questions…
*Can you show the output of docker info ?

root@genericx86-64:~# docker info
Client:
Context:    default
Debug Mode: false

Server:
Containers: 15
 Running: 15
 Paused: 0
 Stopped: 0
Images: 15
Server Version: 20.10.12
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
 userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 7b11cfaabd73bb80907dd23182b9347b4245eb5d
runc version: v1.0.2-0-g52b36a2d
init version: de40ad0
Security Options:
 seccomp
  Profile: default
Kernel Version: 5.4.94-yocto-standard
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 3.675GiB
Name: genericx86-64
ID: A4B2:SQYC:AVIB:BL7B:P2XY:TGMO:TUPG:FGI4:UJHK:D5XV:G2XG:PBJG
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 localhost:5000
 127.0.0.0/8
Registry Mirrors:
 http://localhost:5000/
Live Restore Enabled: false
Product License: Community Engine

WARNING: No cpu cfs quota support
WARNING: No cpu cfs period support
WARNING: No blkio throttle.read_bps_device support
WARNING: No blkio throttle.write_bps_device support
WARNING: No blkio throttle.read_iops_device support
WARNING: No blkio throttle.write_iops_device support

  • and how the Docker data folder looks like? Is itself a symbolic link for example?
ls /var/lib/docker/
buildkit/   containers/ network/    plugins/    swarm/      trust/
containerd/ image/      overlay2/   runtimes/   tmp/        volumes/

ls /var/lib/docker/ -lah
total 68K
drwx--x---  14 root root 4.0K Apr  7 10:44 .
drwxr-xr-x   9 root root 4.0K Jan  6 12:35 ..
drwx--x--x   4 root root 4.0K Apr  7 10:42 buildkit
drwx--x--x   3 root root 4.0K Apr  7 10:42 containerd
drwx--x---  17 root root 4.0K Apr  8 16:33 containers
drwx------   3 root root 4.0K Apr  7 10:42 image
drwxr-x---   3 root root 4.0K Apr  7 10:42 network
drwx--x--- 114 root root  16K Apr  8 16:33 overlay2
drwx------   4 root root 4.0K Apr  7 10:42 plugins
drwx------   2 root root 4.0K Apr  7 10:44 runtimes
drwx------   2 root root 4.0K Apr  7 10:42 swarm
drwx------   2 root root 4.0K Apr  8 11:47 tmp
drwx------   2 root root 4.0K Apr  7 10:42 trust
drwx-----x   2 root root 4.0K Apr  7 10:44 volumes
  • Have you ever edited/created/deleted files manually in that folder without docker commands?
    I have not edited/created/deleted files manually.

More importantly this issue is observed when there is low speed network , is there something because of low network?

It can be a problem if /var/lib/docker is on a network filesystem, but I don’t think the error would be “too many symbolic links”

Some people on the net could solve similar issues by cleaning up the overlay filesystem. You can try docker system prune or docker image prune. If it is not a problem for you, you could try to restart the Docker daemon. I would not do it in production unless there is no other way.

The strange thing for me in this error message is that you get this message when you try to pull an image. I know there are some symbolic links in /var/lib/docker/overlay2/l but those links are next to each other. It should not be a problem. “Too many levels” could be when you have a symbolic link pointing to itself like and something tries to resolve it:

mkdir test
cd test
ln -s mylink mylink
find -L -xtype l

I didn’t know it, so I had to look for the message. Here is the source:

My example is a little shorter but the point is the same. I don’t know why it happens on pulling images, but my guess is that sometimes you tried to pull an image, stopped it but a symbolic link was created and now it tries to create the same symbolic link, but the filesystem in /var/lib/docker is broken. Docker also tries to clean itself on start, if I am not mistaken, so this is why I think that running docker system prune, docker image prune and restarting Docker could help.

Actually, if I think about it again, when the network is slow, it might cause Docker to fail and retry on a broken filesystem. Of course, I am just guessing like a gambler.

Okay firstly, /var/lib/docker is not on network file system.
Secondly, we used the same command docker system prune, then restarted docker so It recovered! but we have couple of worries now

  1. We have already deployed our gateways running dockerd in to field now the problem is how do we avoid this problem on these remote systems because it is very hard to connect to each gateways everytime and clear the issue?

  2. Is there any way to monitor this issue and clear periodically or any options to be used while starting dockerd service?

  3. Any fixes done in latest dockerd release? we are already using latest dockerd version Version: 20.10.12

I have never had this issue, I just read about it. I don’t know how frequently this can happen and I don’t exactly know why it can happen, because I think there should be a way to handle this issue. But sometimes reproducing an issue intentionally is very hard, so the developers need people who ran into this issue and can tell about their environment. So you can do that and hope it can be resolved soon. I know, finding the proper repository for the issue is not always obvious but here are some links:

Since not everything is open source, there are some repos just for issues:

Sometimes you now you have problem with a certain component:

If you have problem on a specific operating system but you don’t know which component is the problem, I would try one of the “Docker for” links. It will probably take time to search for existing issues, but I beleive, this is how we can end issues like this.

Of course, until it can be fixed, you want answers for your questions :slight_smile:

  1. Since we don’t know (at least I don’t) exactly why it happens, I don’t know how we could avoid it. If you can intentionally reproduce it, maybe you can avoid it or at least add it to your report. The other thing you can do is wait fot the next case and try to find the broken link the same way as my example worked. If you can find a broken symbolic link, you can share that on this forum if you are not sure about how you should handle it, or you could “unlink” that link and check if that fix the problem, but you have to be careful, because if that was not the cause of the issue, it can cause an other. If you run a cluster and not just an individual docker host, with enough redundancy, you can remove the broken cluster node and join an other. I am not sure if that is faster or easier than running docker system prune and restarting Docker. ONe thing that you should always do is make sure you have stable enough network and power supply.
  2. You can run that mentioned find -L -xtype l command to find the broken links periodically in /var/lib/docker, but that would work only if I was right and that what caused the issue. I don’t know any option for dockerd which can help you.
  3. Since I don’t have this issue, I don’t follow these fixes, but I found a similar issue in the moby project: Fail to pull an image after it's been interrupted by a power cut · Issue #42964 · moby/moby · GitHub
    Since this issue is still open, I don’t think it was fixed. And this is a relatively old issue from 2021, so I think you should your problem there as well. If enough people report it, it might get a higher priority.