I was not sure where to report a bug in docker build. Feel free to point me to a more appropriate forum.
I have a need to run ssh-agent to execute a command during the build process and found that this breaks any subsequent command in the Dockerfile (which produce an “invalid argument” error).
This occurs whether I build the image with docker compose or docker buildx with or without --no-cache enabled.
It appears that the ssh-agent leaves an orphaned folder /tmp/ssh-HEXDEC in the /tmp folder that somehow breaks caching the intermediate image or its subsequent execution.
If I remove that temporary folder within the same RUN command, it fixes the problem and the docker build completes.
I have assembled a simple test case but cannot attach files as a new user.
The Dockerfile that works looks something like:
FROM ubuntu:focal-20221130
# Pre requisite tools
RUN apt-get update && \
apt-get install -y \
openssh-client
RUN ssh-agent sh -c 'uname -a' && \
uname -a && \
rm -rf /tmp/ssh-*
RUN uname -a
Remove the rm -rf /tmp/ssh-* part and the resulting Dockerfile build fails at the RUN uname -a command.
Based on your description I am not sure I fully understand the issue, but there were multiple build-related fixes in Docker 23.0.1. I recommend you to upgrade to the fixed version from 23.0.0 and check if you get the same result.
For me, the error is unexpected. I figured out a workaround, but it took me quite a while to determine that an extra folder in /tmp seemed to be causing the problem. The error would seem to suggest a problem creating the intermediate image or subsequently loading the intermediate image for the next step in the Dockerfile.
Here is the Dockerfile that does not work. I include the error message below:
# Start with osimis orthanc with plugins
FROM ubuntu:focal-20221130
# Pre requisite tools
RUN apt-get update && \
apt-get install -y \
openssh-client
RUN ssh-agent sh -c 'uname -a' && \
uname -a
RUN uname -a
Here is an example output:
docker buildx build --no-cache -f Dockerfile.bad -t sshagent:bad --progress=plain .
#1 [internal] load .dockerignore
#1 transferring context: 2B done
#1 DONE 0.0s
#2 [internal] load build definition from Dockerfile.bad
#2 transferring dockerfile: 373B done
#2 DONE 0.0s
#3 [internal] load metadata for docker.io/library/ubuntu:focal-20221130
#3 DONE 0.0s
#4 [1/4] FROM docker.io/library/ubuntu:focal-20221130
#4 CACHED
#5 [2/4] RUN apt-get update && apt-get install -y openssh-client
...
#5 DONE 9.1s
#6 [3/4] RUN ssh-agent sh -c 'uname -a' && uname -a
#6 0.546 Linux buildkitsandbox 3.10.0-1160.83.1.el7.x86_64 #1 SMP Wed Jan 25 16:41:43 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
#6 0.547 Linux buildkitsandbox 3.10.0-1160.83.1.el7.x86_64 #1 SMP Wed Jan 25 16:41:43 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
#6 DONE 0.9s
#7 [4/4] RUN uname -a
#7 ERROR: failed to prepare w03evjooaydfac3f503ffzpoo as 4bt19qedk554t7caq2u1zbgu7: invalid argument
------
> [4/4] RUN uname -a:
------
Dockerfile.bad:12
--------------------
10 | uname -a
11 |
12 | >>> RUN uname -a
13 |
14 |
--------------------
ERROR: failed to solve: failed to prepare w03evjooaydfac3f503ffzpoo as 4bt19qedk554t7caq2u1zbgu7: invalid argument
I do not know how to interpret the error message. The full message does not seem more informative than what I originally posted. Let me know if there are command line flags I could add to increase the debug info. If necessary, I can reset the docker daemon to produce more debug info.
Perhaps this is a version dependent problem. I have provided the Dockerfile above in case anyone wants to test this on their system.
Have you upgraded Docker to 23.0.1?
Each instruction starts a new container unless it has changed in the new version. That means Nothing could fail because of any file in a container started by previous instructions.
My apologies. I just upgraded this server to 23.0.0 last week. I wasn’t aware 23.0.1 came out so recently.
I just now upgraded to 23.0.1 and still encounter the same error.
As long as I am careful to remove that /tmp/ssh-SOMEHEXDEC folder after an ssh-agent call, the Docker build proceeds. If I do not, it fails.
I suspect that this file is somehow related to ssh maintaining state between boots of the machine running the ssh-agent (in this case, our container) and that this probably contradicts the assumptions about state in the docker build process.
I simply have not gotten into this level of detail thinking about the effects on state within intermediate images while building the final image.
For this ssh-agent problem for the moment I will simply delete the folder in /tmp, which solves this problem for me. Eventually, I will figure out how to run ssh-agent in a way that negates having to delete this temporary folder.
I tried your example Dockerfile now and I was clearly wrong about build containers not using the same /tmp folder which is interesting sicne there is no mounted tmp so I guess the content is copied from container to container, but it worked on macOS (Docker Desktop) and Linux. Both Docker versions were 23.0.1.
Even if the tmp folder is kept it should not break any instruction.
I agree with you that something in /tmp should not break the build process.
My best guess that ssh-agent is setting up a process that survives reboot and that this interferes with the build. Deleting the folder out of /tmp breaks whatever ssh-agent was trying to do and so works as a hack solution to this problem.
A better approach would be for me to figure out what ssh-agent is doing well enough to stop it from setting itself up to do something after a reboot. I only need it for a one-off execution during build.
I have used the same Dockerfile that you shared and it worked perfectly for me.
For me it seems like you had a non-printable character in the Dockerfile which caused problem only when it was the last before the next RUN instruction. Since the error message was “invalid argument” and it started with “failed to sovle” which is what Docker throws when it can’t find a file (sometimes the Dockerfile) it is not likely to be caused by a file inside a container.
Try to change this instruction
RUN ssh-agent sh -c 'uname -a' && \
uname -a && \
rm -rf /tmp/ssh-*
to this
RUN ssh-agent sh -c 'uname -a' && \
uname -a && \
whoami
Since the container is not “booting” and only one process starts in each build layer without the processess that ran in previous layers, ssh agent could hardly cause the issue unless there is something going on that I have never heard of, which wouldn’t be a miracle and I almost hope it is the case so I could learn something new. However, if ssh agent was the reason, I think I should have get the same error.
I tried building the image on a new docker installation running the default overlay2 storage driver.
The build succeeded.
When I switched the storage driver to the old overlay and restarted docker, the build failed.
I switched back to overlay2, restarted docker, and the build succeeded. I was careful to remove any residual containers and images before each attempt.
So it would seem to be some issue with the overlay driver. When I have a chance I plan on migrating the older system where I was working and first had problems to overlay2.