Image manifests and histories differ between tags that should be identical even after retagging and pushing images

We run a private docker registry and use a common procedure to handle builds and deployments. Lately, we have a project that is failing to deploy to our production environment due to an issue where the contents of the prod manifest do not match the testing manifest, even after rebuilding and retagging the image. I have not been able to reproduce the issue with another repo (due largely to me not knowing how we got into this scenario in the first place), although I can relay the specifics of the problem and what we’ve tried so far.

First, let me describe our process. Developers write code and merge it through a Github-driven peer review process. When that merge happens, it kicks off an automated CI process which builds a docker image from the code repository. That image then gets retagged twice. One tag is an internal dated version number, for example, “20230930.161912z”. That gets pushed to our private registry so we always have a unique tag pointing to each build. But then we retag that with a simple docker tag command so that there is also an “environmental” tag associated with it. Initially, that’s “testing” which gets deployed to the testing environment. Even in this odd case we’re experiencing now, we have no problems with deployments to the testing environment. Later, when we are prepared to deploy this image to our prod environment, we run another retag that looks something like this: docker tag image-name:20230930.161912z image-name:prod and push the prod tag to the registry.

Immediately following the retag and prior to actual deployment, our automation tries to make sure that we are deploying the image we think we’re deploying. It does this by pulling the manifest history for the versioned tag and the “prod” tag and comparing the IDs in the history component. In the problematic case we’re having right now, the version history of the “prod” tag does not match that of the versioned tag (even though the “testing” tag history matches the versioned tag history), and there is an issue (I think) in the “fsLayers” component. These mismatches do not exist between the versioned tag and the “testing” tag, but do exist between the testing/versioned tags and the “prod” tag.

I’ll provide some details of the problem as it appears now. At this point, the versioned image has been retagged as both “testing” and “prod” and all images have been pushed to the registry.

First, the fsLayers do not match. If I make a call to our registry to show the manifest for this image (/v2/image-name/manifests/20231001.155532z), we can see the fsLayers are as such (it’s a lot of data, so forgive me for summarizing the output a bit):

[
"sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4",
"sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1",
"sha256:298badf3c0dd9c275ce1e9ee06e379ceb89ecd13547f1c0b92e270819099e66b",
"sha256:42606668e4d6dfdfac5684c2bdec7e312610237e6fbc28a10de96090add7f3b5",
"sha256:acec1d429b012a2737fc48ab5f5cbca6b0a166192dcf99c41c6a5025fba154cf",
...
]

If I run the same command against the “testing” tag, we get 100% identical results:

[
"sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4",
"sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1",
"sha256:298badf3c0dd9c275ce1e9ee06e379ceb89ecd13547f1c0b92e270819099e66b",
"sha256:42606668e4d6dfdfac5684c2bdec7e312610237e6fbc28a10de96090add7f3b5",
"sha256:acec1d429b012a2737fc48ab5f5cbca6b0a166192dcf99c41c6a5025fba154cf",
...
]

If I run it against “prod” I get somewhat different results:

[
"sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4",
"sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4",
"sha256:298badf3c0dd9c275ce1e9ee06e379ceb89ecd13547f1c0b92e270819099e66b",
"sha256:42606668e4d6dfdfac5684c2bdec7e312610237e6fbc28a10de96090add7f3b5",
"sha256:acec1d429b012a2737fc48ab5f5cbca6b0a166192dcf99c41c6a5025fba154cf",
...
]

These are mostly the same, but the prod image is missing the “4f4fb…” layer and instead has a duplicate of the “a3ed9…” layer. Maybe there’s something I don’t understand here, but shouldn’t the prod and versioned image look identical since the testing and versioned images look identical and the procedure for creating the prod image is identical? This effect happens even if I do the retag to a brand new, never before used tag. It happens if I delete the prod image (have tried using a docker rmi command but also by making DELETE calls to the manifest directly against the registry API) and rebuild/tag/push it.

The thing our automation is testing for is a piece of data in the manifest history. In the history block of the manifest API response, each item contains a stringified JSON object which, when decoded, contains an id field. When I assemble a list of these IDs, I find again that the testing tag and the versioned tag produce identical results. They look like this:

[
"85fe142aa189e237746021f1a01acff7df5fc1a7091338bb3a269cd1ec10afa1",
"c18f2966f0f88e8b050988a159d55f738c933752441ec17192383d1f8944ebd8",
"324d228d41eaa5d01c2d0e5c0ff1d4c591819e19f5a63db6410f46514506d2f3",
"4fe3f514f6f399f89a6dd5d39e899ac8c22803f43f4ae7adb513eada41ee79d7",
"749823c5a66363fbab330ec24b4041d2b69461ad6a26ce49367b8cdc7f776c7b",
...
]

The prod tag, however, produces slightly different results. It seems most of the history is identical, but the two most recent entries are different.

[
"2bef38df31851e23dd6869567ae5e10ac4649e6126a2437942e871c83433d647",
"a9a5cd6d35eda86324b37f5385216834e732c68563c00ae34f042d5f99e08d14",
"324d228d41eaa5d01c2d0e5c0ff1d4c591819e19f5a63db6410f46514506d2f3",
"4fe3f514f6f399f89a6dd5d39e899ac8c22803f43f4ae7adb513eada41ee79d7",
"749823c5a66363fbab330ec24b4041d2b69461ad6a26ce49367b8cdc7f776c7b",
...
]

I’m wondering a few things:

  • What could have happened here to create this scenario, and how can I reproduce it?
  • I have heard that this could be related to an image being built against a different machine architecture. I only see “amd64” in any of my manifests, but could it be that maybe a dev built an image on an ARM-based Mac and pushed it, creating the deviation?
  • Why does rebuilding, retagging, and pushing the image not resolve this discrepancy?
  • Why does tagging the image with a never-before-used tag have the same result?

Thanks in advance to anyone who can help enlighten me about whatever it is I’m not understanding here.

Adding a new tag can’t change the filesystem layers. That is just a pointer to an image id. When you rebuild an image, even if you are using the build cache, that can change the filesystem layers if there is something that can’t be cached. You mentioned history, but have you checked the image history as well?

docker image history image-name:20230930.161912z

I can’t explain the duplicate layer though. That should be impossible and doesn’t make sense. Maybe that is just a bug in the API.

Thanks for the quick reply. I ran two image history commands against the testing and prod images and redirected the output to two files. I then just ran diff against those text files and it found no difference between their histories.

I am currrently using this container-diff tool to see if it can point me to any major differences, but I’ve never used it before. Not sure it’s going to reveal anything useful.

I’m glad you say the duplicate layer shouldn’t happen, at least so I don’t feel so crazy. If I have full access to the private registry and its storage, what are the odds I can do some kind of forbidden surgery to fix that?

I’m not sure I could do anything at all to fix something like this. Maybe I could understand it and report it if I found something.

Well, it could happen, but not because of retagging. It could happen when you have a Dockerfile like this:

FROM ubuntu:22.04

COPY . /app
COPY . /app

So you actually create the same layer twice. If it is not in a single Dockerfile, I can imagine that you create an image with one COPY instruction, tag it and use that tag as a base image to run the same COPY instruction resulting the same layer again. This is not likely to happen. I didn’t even think of it before my previous post only now. I tested and managed to make a duplicate layer this way.

docker image inspect localhost/test --format '{{ json .RootFS.Layers }}' | jq
[
  "sha256:ab318285541e81ecd4e8822d6a86cbbaf339f7113cf43763f9b54d0cbc9ebcf1",
  "sha256:1f89c258504eda6722752efe52d7d5f3636b1c68e5422e8deaed07e8b06d2b85",
  "sha256:1f89c258504eda6722752efe52d7d5f3636b1c68e5422e8deaed07e8b06d2b85"
]

That is super interesting. I’m going to try to reproduce that myself and then see if I can find any example of this happening somewhere in my automation.

I think I just figured out how this happened, thanks largely to your comment (so thank you very much). Another CI job related to building base images (upon which application images are built) was run to test a change and some images were pushed which should not have been pushed. I think this explains why the duplicate layer is all the way at the bottom of the fsLayers stack.

So now the question is how do I fix it? I’m going to do some playing around with the base image builds and see what happens.

No dice…

Since I last posted, I have tried a number of things. First and foremost, I have used GCP’s container-diff program to ensure that my images are the same. No matter what diff types I ask it to find, it never finds anything. The contents of these images are 100% completely identical, just as we would expect. If I do one build, then retag that build, nothing changes under the hood. We only create a reference to the same manifest. This checks out.

But if I just do a series of docker pull/push commands, it gives the same digests for the “testing” and versioned images, but a different digest for the “stable” image (or any other tag I assign to it).

To start fresh, I delete the manifest from the registry and then confirm it’s gone by pulling it and expecting a failure:

docker pull my-registry/my-image:my-test
Error response from daemon: manifest for my-registry/my-image:my-test not found: manifest unknown: manifest unknown

Then I delete the local copy of the image just in case.

docker rmi my-registry/my-image:my-test

Now I pull the versioned image:

docker pull my-registry/my-image:versioned

This works, and reports the digest to be sha256:b90582ba9a74cb7295fc743192746cfc8fc9c659a29d78b00f27f7c78f55fcd4

My automation has already tagged it as “testing”, so I pull that next:

docker pull my-registry/my-image:testing

This works, and reports the digest to be the same thing: sha256:b90582ba9a74cb7295fc743192746cfc8fc9c659a29d78b00f27f7c78f55fcd4

Now I (or my automation) tag it as stable using the same command:

docker tag my-registry/my-image:versioned my-registry/my-image:my-test

And push it:

docker push my-registry/my-image:my-test

It produces a different digest: sha256:9fcbe24ed8c6e3bc240225c4b4c71631842b574dab978a2df0267906fc25c503

Again, using container-diff produces “no diff” results. If I docker save these things to tarballs, extract the tarballs, and manually diff the files (or in the case of the embedded layer tarballs, I compare their checksums), and everything is 100% identical with the exception of the tag names themselves.

However, if I run a docker images I can see all three images right at the top of the list, and they all have an identical “IMAGE ID”. I can’t find reference to this image ID anywhere in the images, though, or in the registry. The Docker-Content-Digest header, when I ask the registry for it, still matches between versioned and testing but not the stable/etc image.

I found another tool called regctl (https://github.com/regclient/regclient/blob/main/docs/regctl.md) that talks to my registry. It does some nice formatting, but otherwise appears to just make the registry API calls for me.

regctl manifest get my-registry/my-image:testing > testing.out
regctl manifest get my-registry/my-image:stable > stable.out
diff testing.out stable.out

This tells me there’s a difference between the two images. Excluding the tag information itself, which we always expect to differ, I see this:

3c3
< Digest:      sha256:b90582ba9a74cb7295fc743192746cfc8fc9c659a29d78b00f27f7c78f55fcd4
---
> Digest:      sha256:9fcbe24ed8c6e3bc240225c4b4c71631842b574dab978a2df0267906fc25c503
113c113
<   Digest:    sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1
---
>   Digest:    sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4

I guess I’m going to see if I can figure out what these different shas are and maybe pull them down individually? Maybe I can correct this with surgery done on the registry files?

This ended up being a problem with buildkit, which produces images which are identical at the file system level but different at the manifest/metadata level. Our docker installation was upgraded with the OS, and that broke builds (or rather, produced different forms of otherwise working builds, which broke our automation).

I’m sure we should probably find a way to upgrade and move forward, but for now I have resolved this problem by setting DOCKER_BUILDKIT=0 in our tools, disabling the newer build process.