Why we want to use Docer?
The brief answer: For smooth switching between tool versions on buld systems that handle a great number of different jobs.
We have one huge Bamboo build system, and one not quite as huge (yet) Jenkins build system. Both are shared by a large number of projects, with very different requirements to tool versions and environment setups.
Until now, for Windows jobs we have used a wizard called at the start of each job to ensure that the proper tool versions are active. In most cases it can be done by moving symbolic links. In several (and an increasing number of) cases we must resort to uninstalling the unsuitable version and install the version wanted by the current job. In a few cases, it is not possible to automatically switch between versions, so we must install one version on one subset of nodes, the other version on another subset, and flag the nodes with the versions they have. (That sort of defeats the idea of having a large pool of shared build nodes, automatically adapting to needs.)
This wizard started as a simple and clean tool about seven years ago, but has grown into a messy, unstable monster as new tools have required yet another quirk to be handled. We have something like 20 different toolboxes active right now, and maintenance is a nightmare. The archive of retired toolboxes (that might have to be dug up for bugfixing in systems delivered to customers years ago) counts between 150 and 200: If debugging is required, we must rebuild a system exactly the way the customer’s system was build, to be sure it is bit by bit identical. So every single tool (or, at least those affecting generated code) must be of exactly the same version.
One big problem is that some jobs do their own installation from the network of uknkown tool versions. Some of these have “new and exciting” ways of uninstalling them. Or, the installer breaks the symbolic links so tney cannot be switched for the next job, because they are no longer links but plain files or directories. Another problem is that some of the tools cannot be installed/uninstalled in quiet mode but requires manual intervention, and they do not lend themselves to symbolic link switching. And finally: This wizard is a WIndows only system; it cannot be adapted to Linux, which gradually comes in for our new development projects.
If we replace those twenty toolboxes by twenty Docker images, jobs are not delayed by a lot of uninstalling and installing tool versions. Tools that cannot work with symbolic links (they exist!) will work fine. Any job ruining the execution environment (e.g. by installing uknown software from the network) will make problems for noone but itself. The images can (and will, in most instances) be available in a Linux environment, and the image itself will usually be Linux based.
One nice side effect: Our desktop PCs are still Windows based - but they can run Linux containers. So by pulling the tool image to the PC, and supplying the build script to it, developers can run the build locally, exactly the same way that the Jenkins job does it, before comitting code changes.
We have had problems with Docker as well: Our first images was preserved as Dockerfiles only. When we had to debug an old delivery and wanted to rebuild the image, some of the Linux tool versions were no longer available in the LInux apt software repository. Searching for something by the same name in other repos is not guaranteed to find one that is 100% identical with the one we had before.
So now we are preserving images in binary form. We are also in the process of building a local repository of those packages we use in our environment (like the repository we have for the wizard handling Windows software), so that we can rebuild an image if someone by mistake erases the binary version. (Yes, we do make backups of the registry, but getting back a mistakenly erased image from the backup is such an involve business that we try to avoid it.)
From the maintenance point of view, Docker looks quite promising. This attempt to split up jobs into separate tasks of different nature (compiling, log analysis, test monitoring, document generation, …) is one way to succeed with the management. If we must put all the tools into one huge image, and every time someone asks for one tilny little update, in any one of the dozens of tools in the complete set, we have to generate a whole new huge image. Then we will soon have not two hundre, but two thousand retired tool sets, and each of them will be large…
One of our main projects have a tool set filling 6.9 gigabytes, out of which 5 Gbytes are project specific tools. Below it is a nicely layered structure of one base layer on top of another base layer, all cleanly created in multistage builds so they contain as little build leftovers as possible. Then this project makes a request for one extra Python package. The Python layer is in the “general” layers. Then we have two alternatives: Add a layer on top of those 6.9 G, containing nothing but a “pip install”. If this happens all the time, in multiple projects, we will have a mess of Python configurations. It would be so much cleaner to make a new version of the Python image, where this package could be made available to all future images built on a Python base. If Python tasks were handled by an “external” image, using SSH as a relay to another container, no update to the giga-image would be required (but the task requesting Python would have to update the tag value to the new image.
The other alternative is to update the Pyton layer below those 5 “private” GB of tools, so that we have anoter 5 GB layer - nothing above that updated Python layer can be reused. That doesn’t make me very happy, either.
Worst of all: I am in an ongoing battle with one manager of that project: He wants a facility to rebuild his 6.9 GB image automatically: If a job specifies a new Python package version, he want that to be detected atuomatically and a new Python layer + 5 GB “private” layer to be generated on tne fly, He indicates that this might happen several times a day… We have no way of tracing which ones of thes auto-generated images will no longer be used; we must keep all of them indefinitely. They cannot ble built locally on one Jenkins node and pruned locally: When the build is re-run, it might be placed on a different node, which will pull the image from the registry… (Note: This manager is a highly qualified software guy, not a technically ignorant administrator guy. Some excellent software developers are nevertheless completely ignorant with respect to resource costs!)