Replacing Large Files Without Increasing Image Size

I have a situation where I have a large file in my docker image that gets replaced with each new version of the image. So say there is an image_a which comes from a base image. Then image_b is from image_a. And image_c is from image_b. Various changes are made in each image, but of particular interest is one large file I have that gets replaced.

What ultimately happens is that when I replace the file, the new file size is added to the new image. This happens with each successive image I create. The simplest way I can show this is like this. Say you have three docker files:

#Dockerfile_a
FROM ubuntu:latest
WORKDIR /mydata
RUN dd if=/dev/urandom of=data.dat bs=1M count=100

#Dockerfile_b
FROM image_a:latest
WORKDIR /mydata
RUN dd if=/dev/urandom of=data.dat bs=1M count=100

#Dockerfile_c
FROM image_b:latest
WORKDIR /mydata
RUN dd if=/dev/urandom of=data.dat bs=1M count=100

As you can see. Each file builds on the previous. And each effectively replaces a large file. I don’t need to have the previous version of the file. If you then run the following:

docker build -t image_a -f Dockerfile_a .
docker build -t image_b -f Dockerfile_b .
docker build -t image_c -f Dockerfile_c .

Then do a “docker images”, you can see that each file is more than 100MB larger than the previous. Clearly the filesystem overlay is just adding the new file (and size) to each image. In many scenarios that is probably what you want, but in this case I don’t want that. I don’t want to have each version grow in size like this. I really want to replace the file so that images stay relatively the same size.

Is there a way to solve this problem? Each version of my images are growing much more than I want them to. And this isn’t going to be sustainable long-term. Maybe there is a better way to handle this. Any advice would be appreciated.

Hmm … can’t you just add a “RUN rm -f /mydata/image_a” inside your Dockerfile to get rid of the old stuff ?
Also overwriting should not lead to “piling up” … if the same name is used …

Any luck with this one? I have the same issue. A docker novice here.

I assume you are familar that images consist of a manifest and image layers. When you build a new image based on another base image, your image will re-use the exist layers from the base image. A file added in a layer, can not be physicaly modified or delted in a later layer, but it can be marked as such. Image layers make container images very efficient.

Having to replace a big file in different images layers (regardless wether they are in the same image or images build from a base image) indicate a bad image design.

The example from the OP would be such a terrible design. The same problem could have been shown with a single Dockerfile by reapeating the RUN command three times (which would be still terrible design).

The real use case was for a database. Obviously, databases store their data in a datafile. If you create an image of database with a datafile you set that layer in place. If at any point in the further layers you modify any data in that datafile then the entire datafile ends up being a new layer on top of the old one. Even though the datafile is a replacement for the first.

This problem would manifest for any datafile that changes in any layer of the image. That was what my example was intended to show.

This use case rather matches what images for vm solutions are used for, than what layered container images are ment to be used for.

Just out of curriosity why would it make sense to embedd the database’s datafile into an image? How would a volume be mounted to persist the datafile outside the container when run in production? If no volues are intended,the datafile will still exist with at least one additional copy in the copy-on-write layer of the container. However I look at it, it doesn’t feel like a good image design.

In the past we used official database images and used flyway in application images to apply DDL/DML operations for the schema they own before the main process was started. While we embedded flyway (the tool itself) in a base image, the DDL/DML statements have always been in the final layer with the application itself.

In our case it was for software development purposes. It wasn’t used for production. The intent was that a developer could pull down a particular version of code from source control. And then pull down a particular version of a database loaded with a pre-configured dataset that matched the codebase.

Creating the initial dataset was “hard”. Which is why we had the initial image pre-built. The versioning made changes to the dataset, but they were a tiny fraction of the overall size. But they became large changes to the docker layers.

I have seen that approach a lot, but never in a situation where someone tried to encapsulate it in docker images. If the migration script are not taking ages to apply, the flyway migration approach is way cleaner. I strongly believe that database binary files do not belong inside an image - but that’s just my opinion…

There might be a way to mitigate the growing image size situation:
You might want to check how the linuxserver docker-mods work. Basicly they implemented a way to a) create a single layer images based on scratch for a special purpose (in your
case this could be database file) and b) to pull the first layer and extract it into the containers filesystem.

This is pretty much the same like downloading a zip in the entrypoint script, with the difference that it still pulls the mod-image into the docker engines image cache (at least this is what I remember from analyzing the method a while ago), and extracts the layer from the image cache instead of downloading it each time.

Though, why not simply provide a zip/tar/tar.gz in an artefact repository. Create a Makefile or bash script that takes care of downloading the archive, extract it in a host folder that is mounted as a bind-mount volume into the container folder?

erata:

I was mistaken about the linuxserver docker-mods: they do not leverage the local docker cache. Their method pulls the first layer of a single layer image from a container registery with curl and extracts its content into the container’s filesystem.