I work at PythonAnywhere, a PaaS that lets people run Python code in the cloud. We’ve mostly rolled our own sandboxing solution, but we’ve been experimenting with docker for a few months. We’ve started running into some scaling issues which are starting to make us think it’s not going to work for us, but thought we’d reach out and ask for help first.
(here’s a cool thing you may not know: the interactive python consoles on python.org are supplied by us, and run in docker).
The two main problems we’re having are:
on a given server, the time taken to start a docker container gets longer and longer. A fresh server takes 4-5 seconds to start a process, but over a matter of days it gets slower and slower, up towards and past 30 seconds, at which point we generally reboot it
rebooting a server takes ages – upwards of 20 minutes. We assume this is something to do with docker or aufs cleanup.
rebooting the server does fix the slow-process-startup issue (but just restarting the docker daemon does not)
Some additional info:
we start around 2 or 3 thousand docker containers per day on any given server
we’re running mostly interactive processes (so we run docker -ti)
each container mounts about 10 directories/files
we use --rm=true to try and tidy up processes on exit
we are using --net=host, --log-opt=max-size=100k’, and the DOCKER_FIX env var
we also have a manual cleanup script that kills old containers (and that shuts down any that have run past the allowed max time)
That last cleanup script often throws up a bunch of errors of the type:
Driver aufs failed to remove root filesystem xxx rename /var/lib/docker/aufs/diff/xxx /var/lib/docker/aufs/diff/xxx-removing: device or resource busy
Is anyone else using docker to launch thousands of ephemeral containers per day? In other words, are these scalability problems unexpected, or is it just that we’re using it beyond the normal/tested boundaries?
Any tips on where to look for what’s causing the problem, or pinning down what’s being slow?
PPS. re: the very-long-reboots, it’s hard to be sure, but our current belief is that it’s happening after the reboot, ie it’s something in the process of starting up the server that takes ages, not shutting down… we “know” this from comparing the difference between reboot and echo b > /proc/sysrq-trigger and finding no significant difference in the slow reboot time (but small n here)
Cool. Thanks. Here is some of the info you requested. In collecting it I have discovered something that may be interesting: All of the reboots happened pretty quickly and the docker daemon was contactable fairly quickly, too. The machine that I was testing on has only been running since Friday, though. Whereas the machine that we had the trouble with had been running for weeks. It, therefore, appears to me as if this is an issue that gets worse slowly and may be a function of how many containers have ever run on the machine. We’ll continue to monitor the new machine to see whether it also degrades like the previous ones.
docker version
Client:
Version: 1.11.2
API version: 1.23
Go version: go1.5.4
Git commit: b9f10c9
Built: Wed Jun 1 21:47:50 2016
OS/Arch: linux/amd64
Server:
Version: 1.11.2
API version: 1.23
Go version: go1.5.4
Git commit: b9f10c9
Built: Wed Jun 1 21:47:50 2016
OS/Arch: linux/amd64
If you can run docker version and docker info on the affected machine, that could also be interesting. I’m wondering in particular about the number of containers and images lying around.
I also wonder what kind of filesystem is backing /var/lib/docker (I suppose the default ext4 but confirmation would be useful) and wondering if you have any kind of system metrics showing if there is an increase in memory use, CPU utilization, or disk I/O on the slower machine.
Unfortunately, the machine that was affected was an AWS instance that has now been consigned to the great bit bucket in the sky. We’ll leave the current machine until it’s exhibiting the behaviour and then get back to you with the information you requested. From previous experience, that takes about a week or so.
OK, the machine with the problem has been up for long enough to start showing the problem again. Sorry for the delay in the response! It takes a week or so for a new instance to start showing the problem.
Here’s the output of docker version:
# docker version
Client:
Version: 1.11.2
API version: 1.23
Go version: go1.5.4
Git commit: b9f10c9
Built: Wed Jun 1 21:47:50 2016
OS/Arch: linux/amd64
Server:
Version: 1.11.2
API version: 1.23
Go version: go1.5.4
Git commit: b9f10c9
Built: Wed Jun 1 21:47:50 2016
OS/Arch: linux/amd64
So not a lot of docker containers at all, but the docker daemon has been running for a long time (more than 2 weeks), which is when we see performance degradation for docker. The machine is a large 4 core machine with load avg <2x throughout.
running time docker run pythonanywhere/user_execution_environment echo hi takes 1m 30s.
Monitoring ps, we see that the first 20s or so auplink is spinning like crazy. root 27876 98.5 0.0 4332 464 ? R 11:02 0:05 \_ auplink /var/lib/docker/aufs/mnt/6799d62
Then there is ~30s of ~30-70% cpu usage that alternates between /usr/bin/docker-runc init and docker-runc --log /run/containerd/c9d82.
Finally, there is another 10s or so of auplink taking up 100%+ cpu.
I wonder if it is because we are using a large system image. The size of /var/lib/docker/aufs is 39G and the image we are using is 19G and is the only image in the repository other than ubuntu.