First, we never leave to pip to identify dependencies and keeping quiet about it We download all the packages we use to a local direcotry, and install from explicitly named and versioned package files on an otherwise empty Python installation, giving pip a dummy INDEX-URL. Wen pip barfs about a package not being found, we download it from pypi.org and add it to the requirements file, with an explicit version. Every dependency shall be explicit!
Furthermore, we practically never build Docker images from the OS layer. First the base ubuntu, and we add tools used by “everyone” as a layer on top (typically for build management), to serve as a os-with-tools base layer for gcc images, doc tools images and python images. But even these are base layers common to, say, all users of Python 3.7.3. A number of packages are used by “everybody”, so we throw that in as a layer on top Python.
If you build two images with lots of layers, the second image can make use of the layers made by the first as long as they are EXACTLY alike, perfectly identical by SHA. From the first tiny little bit difference, there is no more sharing up the layer stack, even though the RUN commands in the Dockerfile are identical. Every layer’s SHA depends the SHAs of all its lower layers, so after one discrepancy, the “sibling” layers in the two images will never match. With our discipline, and using “high level” images as base images as far as possible, we enforce that lots of layers are identical, not by good luck, but by strict management.
When your Python project needs another four packages, you start with a a base image htat has the common Python packages, on top of Python, on top of the common tools, on top of Ubuntu. Everything is there already, so your Dockerfile is very simple, and builds in a few seconds. Of course we gain in efficient cache use, but more important: We control the proliferation of tool version combination. If some project asks for, say, a minor upgrade to gcc, they must demonstrate a real need for it, and “we just want to use the latest and greatest” is not a real need - they will be told to stick to the version all the others are using, until they come up with a REAL need. Uniformity in tools is essential, with approx 150 developers.
At startup, every layer needs to be handled. Some of our projects exceed 200 python packages used. Getting 200 layers in place for execution will certainly cost you some CPU/IO resources and time. I don’t know the details of the file system, but I will assume that each layer takes an integral number of disk pages, On the average, half a page is wasted in internal fragmentation - that would, on the average, be 2 kiByte if the page size is 4 KiByte (e.g. on an NTFS system). Individual Python packages are small, and even more space is wasted. For a 200 layer image, the fragmentation loss could easily run up to half a megabyte.
Metadata to describe each layer probably also come in page chunks, one per layer, which adds to this.
All general machines (ignoring e.g. embedded code etc.) runs on machines with virtual memory. I would assume (without knowing the details for sure) that a layer is allocated RAM space in page chunks as well. RAM may use a different page size from the file system, but 4 KiByte is quite common (e.g. on x86/x64), so you waste half a megabyte of RAM at run time.
If a lagre layer contains sections of code that is not used in a specific run, it is simply not paged it; it remains untouched on disk. That is like another unused excutable laying on the disk: It costs you nothing but the disk space.
In our quite strictly managed base image hierarchy, all the lower layers are used by so many runs from all sorts of projects that one or another container is “always” using them. Since we bundle up “commonly used” utilities, or libraries, or python packages, one given element may for a given run be “superfluous”, but it will certainly be used by other runs. There is only one copy on disk or at runtime in RAM; it doesn’t gobble up resources - quite to the contrary! As you move from the root towards the leaves of our “base images tree”, there are fewer users of those top layers - but the layers are generally very thin, compared to Ubuntu, gcc and python, putting a small load on both the Docker engine image cache and on runtime RAM.
Images are updated, and when we need to update a lower image (say, to a newer Python version), higher layers must be rebuilt on the new Python layer. Then we look over the leaves from the old version: If three or four leaves have added the same package, we move that package down to the “commonly used packages layer”, so there is only a single copy of it (and the leaf layers become even thinner and simpler). Similar with e.g. gcc tools, doc tools or whatever: If a number of braches have added the same element, at a revision, they go down the layers - provided that they are sufficiently stable. (We do have projects that always require “the latest and greatest”, and insist on the using pip at run time to download from pypi.org, or run apt-get).
Bottom line: We have selected a quite different layering strategy, with fewer, thicker layers, and a management structure for our images that ensures a high degree of common use of images/layers.