I am hosting a private docker repo on AWS EC2 and want to push a 100GB layer to it. The upload itself seems to work well, meaning the upload progress bar on the client goes up to 100% but then there are multiple retries and eventually the client exits with received unexpected HTTP status: 500 Internal Server Error
.
There are other smaller layers within the same image that push without problems so this issue only applies to the single large layer. Thus, I suspect that it is not a configuration problem on the registry side.
Registry version: 2.6.2
docker client version: 17.12.1-ce
I am running the registry image as a swarm service with 1 replica. docker version
on the swarm node:
Client:
Version: 18.03.0-ce
API version: 1.37
Go version: go1.9.4
Git commit: 0520e24
Built: Wed Mar 21 23:10:01 2018
OS/Arch: linux/amd64
Experimental: false
Orchestrator: swarm
Server:
Engine:
Version: 18.03.0-ce
API version: 1.37 (minimum version 1.12)
Go version: go1.9.4
Git commit: 0520e24
Built: Wed Mar 21 23:08:31 2018
OS/Arch: linux/amd64
Experimental: false
The infrastructure between the client and the registry looks like this:
+--------+ +-------+ +-------+ +-------+ +----------+
| Client | --> | ELB 1 | --> | Nginx | --> | ELB 2 | --> | Registry |
+--------+ +-------+ +-------+ +-------+ +----------+
Inspired by some other problem reports around similar issues, the idle timeout for both ELBs is set to 3600 seconds.
The nginx configuration follows the official recipe.
I do not see any relevant error messages in the registry logs.
The nginx logs show 5 pairs of the following lines:
2018/06/12 18:02:29 [warn] 8#8: *108325 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000056, client: 185.242.113.117, server: registry.*, request: "PATCH /v2/nominatim/blobs/uploads/3efb293b-0f9e-4cf2-9124-6b6ea878559b?_state=LNzokuLcihlUHRyt3qILWr4RPhSZQEcigU-63oF7E2J7Ik5hbWUiOiJub21pbmF0aW0iLCJVVUlEIjoiM2VmYjI5M2ItMGY5ZS00Y2YyLTkxMjQtNmI2ZWE4Nzg1NTliIiwiT2Zmc2V0IjowLCJTdGFydGVkQXQiOiIyMDE4LTA2LTEyVDE4OjAyOjI5LjAxNTg3OTA2NFoifQ%3D%3D HTTP/1.1", host: "registry.*"
2018/06/12 19:09:46 [crit] 8#8: *108325 pwrite() "/var/cache/nginx/client_temp/0000000056" failed (28: No space left on device), client: 185.242.113.117, server: registry.*, request: "PATCH /v2/nominatim/blobs/uploads/3efb293b-0f9e-4cf2-9124-6b6ea878559b?_state=LNzokuLcihlUHRyt3qILWr4RPhSZQEcigU-63oF7E2J7Ik5hbWUiOiJub21pbmF0aW0iLCJVVUlEIjoiM2VmYjI5M2ItMGY5ZS00Y2YyLTkxMjQtNmI2ZWE4Nzg1NTliIiwiT2Zmc2V0IjowLCJTdGFydGVkQXQiOiIyMDE4LTA2LTEyVDE4OjAyOjI5LjAxNTg3OTA2NFoifQ%3D%3D HTTP/1.1", host: "registry.*"
I suspect that these pairs correspond to the client’s retry attempts. As you can see, there is approximately 1 hour between the warning and the error which corresponds to the idle timeout of 3600 seconds that is configured on my load balancers. But the error message indicates that it is not a timeout issue but a disk space issue. I checked the relevant nodes running nginx after the push failed - the free disk space was 17G.
So I am not getting smart out of this and would appreciate some help.