We run a private docker registry and use a common procedure to handle builds and deployments. Lately, we have a project that is failing to deploy to our production environment due to an issue where the contents of the prod manifest do not match the testing manifest, even after rebuilding and retagging the image. I have not been able to reproduce the issue with another repo (due largely to me not knowing how we got into this scenario in the first place), although I can relay the specifics of the problem and what we’ve tried so far.
First, let me describe our process. Developers write code and merge it through a Github-driven peer review process. When that merge happens, it kicks off an automated CI process which builds a docker image from the code repository. That image then gets retagged twice. One tag is an internal dated version number, for example, “20230930.161912z”. That gets pushed to our private registry so we always have a unique tag pointing to each build. But then we retag that with a simple docker tag
command so that there is also an “environmental” tag associated with it. Initially, that’s “testing” which gets deployed to the testing environment. Even in this odd case we’re experiencing now, we have no problems with deployments to the testing environment. Later, when we are prepared to deploy this image to our prod environment, we run another retag that looks something like this: docker tag image-name:20230930.161912z image-name:prod
and push the prod tag to the registry.
Immediately following the retag and prior to actual deployment, our automation tries to make sure that we are deploying the image we think we’re deploying. It does this by pulling the manifest history for the versioned tag and the “prod” tag and comparing the IDs in the history component. In the problematic case we’re having right now, the version history of the “prod” tag does not match that of the versioned tag (even though the “testing” tag history matches the versioned tag history), and there is an issue (I think) in the “fsLayers” component. These mismatches do not exist between the versioned tag and the “testing” tag, but do exist between the testing/versioned tags and the “prod” tag.
I’ll provide some details of the problem as it appears now. At this point, the versioned image has been retagged as both “testing” and “prod” and all images have been pushed to the registry.
First, the fsLayers do not match. If I make a call to our registry to show the manifest for this image (/v2/image-name/manifests/20231001.155532z
), we can see the fsLayers are as such (it’s a lot of data, so forgive me for summarizing the output a bit):
[
"sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4",
"sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1",
"sha256:298badf3c0dd9c275ce1e9ee06e379ceb89ecd13547f1c0b92e270819099e66b",
"sha256:42606668e4d6dfdfac5684c2bdec7e312610237e6fbc28a10de96090add7f3b5",
"sha256:acec1d429b012a2737fc48ab5f5cbca6b0a166192dcf99c41c6a5025fba154cf",
...
]
If I run the same command against the “testing” tag, we get 100% identical results:
[
"sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4",
"sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1",
"sha256:298badf3c0dd9c275ce1e9ee06e379ceb89ecd13547f1c0b92e270819099e66b",
"sha256:42606668e4d6dfdfac5684c2bdec7e312610237e6fbc28a10de96090add7f3b5",
"sha256:acec1d429b012a2737fc48ab5f5cbca6b0a166192dcf99c41c6a5025fba154cf",
...
]
If I run it against “prod” I get somewhat different results:
[
"sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4",
"sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4",
"sha256:298badf3c0dd9c275ce1e9ee06e379ceb89ecd13547f1c0b92e270819099e66b",
"sha256:42606668e4d6dfdfac5684c2bdec7e312610237e6fbc28a10de96090add7f3b5",
"sha256:acec1d429b012a2737fc48ab5f5cbca6b0a166192dcf99c41c6a5025fba154cf",
...
]
These are mostly the same, but the prod image is missing the “4f4fb…” layer and instead has a duplicate of the “a3ed9…” layer. Maybe there’s something I don’t understand here, but shouldn’t the prod and versioned image look identical since the testing and versioned images look identical and the procedure for creating the prod image is identical? This effect happens even if I do the retag to a brand new, never before used tag. It happens if I delete the prod image (have tried using a docker rmi
command but also by making DELETE calls to the manifest directly against the registry API) and rebuild/tag/push it.
The thing our automation is testing for is a piece of data in the manifest history. In the history
block of the manifest API response, each item contains a stringified JSON object which, when decoded, contains an id
field. When I assemble a list of these IDs, I find again that the testing tag and the versioned tag produce identical results. They look like this:
[
"85fe142aa189e237746021f1a01acff7df5fc1a7091338bb3a269cd1ec10afa1",
"c18f2966f0f88e8b050988a159d55f738c933752441ec17192383d1f8944ebd8",
"324d228d41eaa5d01c2d0e5c0ff1d4c591819e19f5a63db6410f46514506d2f3",
"4fe3f514f6f399f89a6dd5d39e899ac8c22803f43f4ae7adb513eada41ee79d7",
"749823c5a66363fbab330ec24b4041d2b69461ad6a26ce49367b8cdc7f776c7b",
...
]
The prod tag, however, produces slightly different results. It seems most of the history is identical, but the two most recent entries are different.
[
"2bef38df31851e23dd6869567ae5e10ac4649e6126a2437942e871c83433d647",
"a9a5cd6d35eda86324b37f5385216834e732c68563c00ae34f042d5f99e08d14",
"324d228d41eaa5d01c2d0e5c0ff1d4c591819e19f5a63db6410f46514506d2f3",
"4fe3f514f6f399f89a6dd5d39e899ac8c22803f43f4ae7adb513eada41ee79d7",
"749823c5a66363fbab330ec24b4041d2b69461ad6a26ce49367b8cdc7f776c7b",
...
]
I’m wondering a few things:
- What could have happened here to create this scenario, and how can I reproduce it?
- I have heard that this could be related to an image being built against a different machine architecture. I only see “amd64” in any of my manifests, but could it be that maybe a dev built an image on an ARM-based Mac and pushed it, creating the deviation?
- Why does rebuilding, retagging, and pushing the image not resolve this discrepancy?
- Why does tagging the image with a never-before-used tag have the same result?
Thanks in advance to anyone who can help enlighten me about whatever it is I’m not understanding here.