We are building a scalable self-hosted GitHub Actions runner architecture on AWS, and we’re encountering a persistent concurrency issue. Our goal is to have a shared Docker image cache to accelerate our CI/CD pipelines by avoiding repeated image downloads.
Our Architecture:
We use AWS ECS to run our self-hosted runner tasks.
The ECS tasks run on a fleet of EC2 instances.
To share the Docker image cache, we have mounted an Amazon EBS volume to each EC2 instance at /opt/runners.
The Problem:
The setup works perfectly when a single CI job is running. However, when multiple jobs are triggered concurrently, we faced to build failures.
This is happening even though the image exists locally on the EBS volume from a previous successful run.
Our Questions:
Is using a shared EBS volume for ‘/opt/runners’ across multiple EC2 instances a viable or supported architecture for achieving a shared Docker cache? If so, how can we mitigate these race conditions?
What are the recommended architecture for achieving a shared Docker layer cache for concurrent, ephemeral GitHub Actions runners on AWS?
You should really not share /var/lib/docker with multiple virtual machines. It is not just “image cache”, it is the entire Docker Data root and it could cuase issues. You would have all your metadata on all machines including existing containers and you would not have those containers running on your machne.
I don1t know if this is the cause of your issue, but each machine should have its own docker data root not used by anyone else.
You can run a locl docker registry somewhere in your environment if you want to speed up pulling images.
I need to correct my previous statement about our architecture. My apologies for any confusion. To be clear, we are not sharing the Docker daemon; each runner has its own isolated environment.
An EBS volume is a block storage that is bound to an AZ of a region. Block storage can not be attached to more than one compute node at a time. Not a feasible solution.
The answer remains the same: use the GitHub Actions caching. It should support s3 as backend. I haven`t configured it myself, but you should find pointers in the docs or in blog posts.
Update: I accidentally wrote “I have configured it myself”. I meant to write that I haven’t. Now its fixed.