Docker Community Forums

Share and learn in the Docker community.

Efficient Docker caching with python dependencies

Saw this very nice article from Tibor here - https://blog.docker.com/2019/07/intro-guide-to-dockerfile-best-practices/. The main concepts to efficiently use the Docker cache are stated very clearly.

I have a question related to one pattern I see that applies to python world - which is the use of requirements.txt to install all python dependencies via pip.

Let’s say I am building an intermediate image like below:

FROM python as intermediate
RUN pip install -r requirements.txt
ADD xxx /

v/s

FROM python as intermediate
RUN pip install aar
RUN pip install bar
..
RUN pip install zar
ADD xxx /

In my company I have been pitching for the latter style as being a more cache friendly option. Obviously I value developer productivity as well as project maintainability.

While both work, the problem with the first approach is that when the requirements.txt changes, the entire docker cache is invalidated, and you need to pay the price by building and installing all packages listed in requirements.txt. On the other hand, if we extract the contents of requirements.txt and put it inside the Dockerfile itself, it promotes more efficient use of the cache. Also note that the number of layers do not matter because, in my case, this is just an intermediate image within my “multi-stageDockerfile.

But the argument against my approach is that - requirements.txt is a small “self-contained” file which lists all the requirements. However this is not very true when considering all dependencies required to build the packages listed in requirements.txt - like installing all dev packages outside of pip. Another argument is that changes to requirements.txt are not very often so its fine. But I say, why pay the price when we can do better. Its not much of a leap or a learning curve to put all your requirements directly inside the Dockerfile.

I have already seen other articles that touch up on this subject - https://jpetazzo.github.io/2013/12/01/docker-python-pip-requirements/. But not enough articles talk about this.

What are the experiences and opinions of other folks in this forum?

First, we never leave to pip to identify dependencies and keeping quiet about it :slight_smile: We download all the packages we use to a local direcotry, and install from explicitly named and versioned package files on an otherwise empty Python installation, giving pip a dummy INDEX-URL. Wen pip barfs about a package not being found, we download it from pypi.org and add it to the requirements file, with an explicit version. Every dependency shall be explicit!

Furthermore, we practically never build Docker images from the OS layer. First the base ubuntu, and we add tools used by “everyone” as a layer on top (typically for build management), to serve as a os-with-tools base layer for gcc images, doc tools images and python images. But even these are base layers common to, say, all users of Python 3.7.3. A number of packages are used by “everybody”, so we throw that in as a layer on top Python.

If you build two images with lots of layers, the second image can make use of the layers made by the first as long as they are EXACTLY alike, perfectly identical by SHA. From the first tiny little bit difference, there is no more sharing up the layer stack, even though the RUN commands in the Dockerfile are identical. Every layer’s SHA depends the SHAs of all its lower layers, so after one discrepancy, the “sibling” layers in the two images will never match. With our discipline, and using “high level” images as base images as far as possible, we enforce that lots of layers are identical, not by good luck, but by strict management.

When your Python project needs another four packages, you start with a a base image htat has the common Python packages, on top of Python, on top of the common tools, on top of Ubuntu. Everything is there already, so your Dockerfile is very simple, and builds in a few seconds. Of course we gain in efficient cache use, but more important: We control the proliferation of tool version combination. If some project asks for, say, a minor upgrade to gcc, they must demonstrate a real need for it, and “we just want to use the latest and greatest” is not a real need - they will be told to stick to the version all the others are using, until they come up with a REAL need. Uniformity in tools is essential, with approx 150 developers.

At startup, every layer needs to be handled. Some of our projects exceed 200 python packages used. Getting 200 layers in place for execution will certainly cost you some CPU/IO resources and time. I don’t know the details of the file system, but I will assume that each layer takes an integral number of disk pages, On the average, half a page is wasted in internal fragmentation - that would, on the average, be 2 kiByte if the page size is 4 KiByte (e.g. on an NTFS system). Individual Python packages are small, and even more space is wasted. For a 200 layer image, the fragmentation loss could easily run up to half a megabyte.

Metadata to describe each layer probably also come in page chunks, one per layer, which adds to this.

All general machines (ignoring e.g. embedded code etc.) runs on machines with virtual memory. I would assume (without knowing the details for sure) that a layer is allocated RAM space in page chunks as well. RAM may use a different page size from the file system, but 4 KiByte is quite common (e.g. on x86/x64), so you waste half a megabyte of RAM at run time.

If a lagre layer contains sections of code that is not used in a specific run, it is simply not paged it; it remains untouched on disk. That is like another unused excutable laying on the disk: It costs you nothing but the disk space.

In our quite strictly managed base image hierarchy, all the lower layers are used by so many runs from all sorts of projects that one or another container is “always” using them. Since we bundle up “commonly used” utilities, or libraries, or python packages, one given element may for a given run be “superfluous”, but it will certainly be used by other runs. There is only one copy on disk or at runtime in RAM; it doesn’t gobble up resources - quite to the contrary! As you move from the root towards the leaves of our “base images tree”, there are fewer users of those top layers - but the layers are generally very thin, compared to Ubuntu, gcc and python, putting a small load on both the Docker engine image cache and on runtime RAM.

Images are updated, and when we need to update a lower image (say, to a newer Python version), higher layers must be rebuilt on the new Python layer. Then we look over the leaves from the old version: If three or four leaves have added the same package, we move that package down to the “commonly used packages layer”, so there is only a single copy of it (and the leaf layers become even thinner and simpler). Similar with e.g. gcc tools, doc tools or whatever: If a number of braches have added the same element, at a revision, they go down the layers - provided that they are sufficiently stable. (We do have projects that always require “the latest and greatest”, and insist on the using pip at run time to download from pypi.org, or run apt-get).

Bottom line: We have selected a quite different layering strategy, with fewer, thicker layers, and a management structure for our images that ensures a high degree of common use of images/layers.