Hi,
We are facing one issue related to the IoT Edge Module pull in one of the hardware. The Gateway is not able to pull all the edge modules as per the deployment manifest file. While downloading some modules, the below error is observed in the IoT Edge runtime logs. It shows an error in docker overlay.
**Could not pull image XXXXXX/YYYY:6.0.0-amd64** <4>2022-03-28T08:56:44Z [WARN] - caused by: failed to register layer: error creating overlay mount to var/lib/docker/overlay2/24593016a6b6bf0eaf6543d5ec82d94244d5fcb6d25e3be62ed0da70761daacd/merged: too many levels of symbolic links
The IoT Edge runtime version is : 1.1.6
Docker version: 20.10.12,
OS: Yocto
Kernel version : 5.4.94
Could you please let us know the cause for this error and how to recover from this error when the devices are installed at sites during production?
We are not expecting any errors while IoT Edge modules are being pulled as per the deployment manifest file. All the modules should be pulled and running.
It can be a problem if /var/lib/docker is on a network filesystem, but I donât think the error would be âtoo many symbolic linksâ
Some people on the net could solve similar issues by cleaning up the overlay filesystem. You can try docker system prune or docker image prune. If it is not a problem for you, you could try to restart the Docker daemon. I would not do it in production unless there is no other way.
The strange thing for me in this error message is that you get this message when you try to pull an image. I know there are some symbolic links in /var/lib/docker/overlay2/l but those links are next to each other. It should not be a problem. âToo many levelsâ could be when you have a symbolic link pointing to itself like and something tries to resolve it:
mkdir test
cd test
ln -s mylink mylink
find -L -xtype l
I didnât know it, so I had to look for the message. Here is the source:
My example is a little shorter but the point is the same. I donât know why it happens on pulling images, but my guess is that sometimes you tried to pull an image, stopped it but a symbolic link was created and now it tries to create the same symbolic link, but the filesystem in /var/lib/docker is broken. Docker also tries to clean itself on start, if I am not mistaken, so this is why I think that running docker system prune, docker image prune and restarting Docker could help.
Actually, if I think about it again, when the network is slow, it might cause Docker to fail and retry on a broken filesystem. Of course, I am just guessing like a gambler.
Okay firstly, /var/lib/docker is not on network file system.
Secondly, we used the same command docker system prune, then restarted docker so It recovered! but we have couple of worries now
We have already deployed our gateways running dockerd in to field now the problem is how do we avoid this problem on these remote systems because it is very hard to connect to each gateways everytime and clear the issue?
Is there any way to monitor this issue and clear periodically or any options to be used while starting dockerd service?
Any fixes done in latest dockerd release? we are already using latest dockerd version Version: 20.10.12
I have never had this issue, I just read about it. I donât know how frequently this can happen and I donât exactly know why it can happen, because I think there should be a way to handle this issue. But sometimes reproducing an issue intentionally is very hard, so the developers need people who ran into this issue and can tell about their environment. So you can do that and hope it can be resolved soon. I know, finding the proper repository for the issue is not always obvious but here are some links:
Since not everything is open source, there are some repos just for issues:
If you have problem on a specific operating system but you donât know which component is the problem, I would try one of the âDocker forâ links. It will probably take time to search for existing issues, but I beleive, this is how we can end issues like this.
Of course, until it can be fixed, you want answers for your questions
Since we donât know (at least I donât) exactly why it happens, I donât know how we could avoid it. If you can intentionally reproduce it, maybe you can avoid it or at least add it to your report. The other thing you can do is wait fot the next case and try to find the broken link the same way as my example worked. If you can find a broken symbolic link, you can share that on this forum if you are not sure about how you should handle it, or you could âunlinkâ that link and check if that fix the problem, but you have to be careful, because if that was not the cause of the issue, it can cause an other. If you run a cluster and not just an individual docker host, with enough redundancy, you can remove the broken cluster node and join an other. I am not sure if that is faster or easier than running docker system prune and restarting Docker. ONe thing that you should always do is make sure you have stable enough network and power supply.
You can run that mentioned find -L -xtype l command to find the broken links periodically in /var/lib/docker, but that would work only if I was right and that what caused the issue. I donât know any option for dockerd which can help you.
Since I donât have this issue, I donât follow these fixes, but I found a similar issue in the moby project: Fail to pull an image after it's been interrupted by a power cut · Issue #42964 · moby/moby · GitHub
Since this issue is still open, I donât think it was fixed. And this is a relatively old issue from 2021, so I think you should your problem there as well. If enough people report it, it might get a higher priority.