Ssh-agent use in Dockerfile breaks subsequent commands during build

I was not sure where to report a bug in docker build. Feel free to point me to a more appropriate forum.

I have a need to run ssh-agent to execute a command during the build process and found that this breaks any subsequent command in the Dockerfile (which produce an “invalid argument” error).

This occurs whether I build the image with docker compose or docker buildx with or without --no-cache enabled.

It appears that the ssh-agent leaves an orphaned folder /tmp/ssh-HEXDEC in the /tmp folder that somehow breaks caching the intermediate image or its subsequent execution.

If I remove that temporary folder within the same RUN command, it fixes the problem and the docker build completes.

I have assembled a simple test case but cannot attach files as a new user.

The Dockerfile that works looks something like:

FROM ubuntu:focal-20221130

# Pre requisite tools
RUN apt-get update && \
    apt-get install -y \
        openssh-client

RUN ssh-agent sh -c 'uname -a' && \
    uname -a && \
    rm -rf /tmp/ssh-*

RUN uname -a

Remove the rm -rf /tmp/ssh-* part and the resulting Dockerfile build fails at the RUN uname -a command.

Here is my docker info:

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.10.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.15.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose
  scan: Docker Scan (Docker Inc.)
    Version:  v0.23.0
    Path:     /usr/libexec/docker/cli-plugins/docker-scan

Server:
 Containers: x
  Running: x
  Paused: x
  Stopped: x
 Images: x
 Server Version: 23.0.0
 Storage Driver: overlay
  Backing Filesystem: extfs
  Supports d_type: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
 Kernel Version: 3.10.0-1160.83.1.el7.x86_64
 Operating System: CentOS Linux 7 (Core)
 OSType: linux
 Architecture: x86_64
 CPUs: 32
 Total Memory: 31.15GiB
 Name: xxxxxx.xxx.xxxx.xxx
 ID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 Docker Root Dir: /xxxxxx/xxxxx/xxxxxx
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

I realize I am using an ancient storage driver, overlay, on this machine. If I get a chance, I will test on a system running a newer storage driver.

Please, share the error message.

Based on your description I am not sure I fully understand the issue, but there were multiple build-related fixes in Docker 23.0.1. I recommend you to upgrade to the fixed version from 23.0.0 and check if you get the same result.

For me, the error is unexpected. I figured out a workaround, but it took me quite a while to determine that an extra folder in /tmp seemed to be causing the problem. The error would seem to suggest a problem creating the intermediate image or subsequently loading the intermediate image for the next step in the Dockerfile.

Here is the Dockerfile that does not work. I include the error message below:

# Start with osimis orthanc with plugins
FROM ubuntu:focal-20221130

# Pre requisite tools
RUN apt-get update && \
    apt-get install -y \
        openssh-client

RUN ssh-agent sh -c 'uname -a' && \
    uname -a

RUN uname -a

Here is an example output:

docker buildx build --no-cache -f Dockerfile.bad -t sshagent:bad --progress=plain .
#1 [internal] load .dockerignore
#1 transferring context: 2B done
#1 DONE 0.0s

#2 [internal] load build definition from Dockerfile.bad
#2 transferring dockerfile: 373B done
#2 DONE 0.0s

#3 [internal] load metadata for docker.io/library/ubuntu:focal-20221130
#3 DONE 0.0s

#4 [1/4] FROM docker.io/library/ubuntu:focal-20221130
#4 CACHED

#5 [2/4] RUN apt-get update &&     apt-get install -y         openssh-client
...
#5 DONE 9.1s

#6 [3/4] RUN ssh-agent sh -c 'uname -a' &&     uname -a
#6 0.546 Linux buildkitsandbox 3.10.0-1160.83.1.el7.x86_64 #1 SMP Wed Jan 25 16:41:43 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
#6 0.547 Linux buildkitsandbox 3.10.0-1160.83.1.el7.x86_64 #1 SMP Wed Jan 25 16:41:43 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
#6 DONE 0.9s

#7 [4/4] RUN uname -a
#7 ERROR: failed to prepare w03evjooaydfac3f503ffzpoo as 4bt19qedk554t7caq2u1zbgu7: invalid argument
------
 > [4/4] RUN uname -a:
------
Dockerfile.bad:12
--------------------
  10 |         uname -a
  11 |     
  12 | >>> RUN uname -a
  13 |     
  14 |     
--------------------
ERROR: failed to solve: failed to prepare w03evjooaydfac3f503ffzpoo as 4bt19qedk554t7caq2u1zbgu7: invalid argument

I do not know how to interpret the error message. The full message does not seem more informative than what I originally posted. Let me know if there are command line flags I could add to increase the debug info. If necessary, I can reset the docker daemon to produce more debug info.

Perhaps this is a version dependent problem. I have provided the Dockerfile above in case anyone wants to test this on their system.

Have you upgraded Docker to 23.0.1?
Each instruction starts a new container unless it has changed in the new version. That means Nothing could fail because of any file in a container started by previous instructions.

My apologies. I just upgraded this server to 23.0.0 last week. I wasn’t aware 23.0.1 came out so recently.

I just now upgraded to 23.0.1 and still encounter the same error.

As long as I am careful to remove that /tmp/ssh-SOMEHEXDEC folder after an ssh-agent call, the Docker build proceeds. If I do not, it fails.

I suspect that this file is somehow related to ssh maintaining state between boots of the machine running the ssh-agent (in this case, our container) and that this probably contradicts the assumptions about state in the docker build process.

I simply have not gotten into this level of detail thinking about the effects on state within intermediate images while building the final image.

For this ssh-agent problem for the moment I will simply delete the folder in /tmp, which solves this problem for me. Eventually, I will figure out how to run ssh-agent in a way that negates having to delete this temporary folder.

I tried your example Dockerfile now and I was clearly wrong about build containers not using the same /tmp folder which is interesting sicne there is no mounted tmp so I guess the content is copied from container to container, but it worked on macOS (Docker Desktop) and Linux. Both Docker versions were 23.0.1.

Even if the tmp folder is kept it should not break any instruction.

I agree with you that something in /tmp should not break the build process.

My best guess that ssh-agent is setting up a process that survives reboot and that this interferes with the build. Deleting the folder out of /tmp breaks whatever ssh-agent was trying to do and so works as a hack solution to this problem.

A better approach would be for me to figure out what ssh-agent is doing well enough to stop it from setting itself up to do something after a reboot. I only need it for a one-off execution during build.

Thanks for your feedback.

I have used the same Dockerfile that you shared and it worked perfectly for me.
For me it seems like you had a non-printable character in the Dockerfile which caused problem only when it was the last before the next RUN instruction. Since the error message was “invalid argument” and it started with “failed to sovle” which is what Docker throws when it can’t find a file (sometimes the Dockerfile) it is not likely to be caused by a file inside a container.

Try to change this instruction

RUN ssh-agent sh -c 'uname -a' && \
    uname -a && \
    rm -rf /tmp/ssh-*

to this

RUN ssh-agent sh -c 'uname -a' && \
    uname -a && \
    whoami

Since the container is not “booting” and only one process starts in each build layer without the processess that ran in previous layers, ssh agent could hardly cause the issue unless there is something going on that I have never heard of, which wouldn’t be a miracle and I almost hope it is the case so I could learn something new. However, if ssh agent was the reason, I think I should have get the same error.

I tried building the image on a new docker installation running the default overlay2 storage driver.

The build succeeded.

When I switched the storage driver to the old overlay and restarted docker, the build failed.

I switched back to overlay2, restarted docker, and the build succeeded. I was careful to remove any residual containers and images before each attempt.

So it would seem to be some issue with the overlay driver. When I have a chance I plan on migrating the older system where I was working and first had problems to overlay2.

1 Like

Still weird, but at least you figured out that it is related to the storage driver. Thanks for the update!