Docker scalability issues -- major problems when running hundreds/thousands of interactive sessions

pythonanywhere · June 20, 2016, 4:10pm

Hi everyone!

I work at PythonAnywhere, a PaaS that lets people run Python code in the cloud. We’ve mostly rolled our own sandboxing solution, but we’ve been experimenting with docker for a few months. We’ve started running into some scaling issues which are starting to make us think it’s not going to work for us, but thought we’d reach out and ask for help first.

(here’s a cool thing you may not know: the interactive python consoles on python.org are supplied by us, and run in docker).

The two main problems we’re having are:

on a given server, the time taken to start a docker container gets longer and longer. A fresh server takes 4-5 seconds to start a process, but over a matter of days it gets slower and slower, up towards and past 30 seconds, at which point we generally reboot it
rebooting a server takes ages – upwards of 20 minutes. We assume this is something to do with docker or aufs cleanup.
rebooting the server does fix the slow-process-startup issue (but just restarting the docker daemon does not)

Some additional info:

we start around 2 or 3 thousand docker containers per day on any given server
we’re running mostly interactive processes (so we run docker -ti)
each container mounts about 10 directories/files
we use --rm=true to try and tidy up processes on exit
we are using --net=host, --log-opt=max-size=100k’, and the DOCKER_FIX env var
we also have a manual cleanup script that kills old containers (and that shuts down any that have run past the allowed max time)

That last cleanup script often throws up a bunch of errors of the type:

Driver aufs failed to remove root filesystem xxx rename /var/lib/docker/aufs/diff/xxx /var/lib/docker/aufs/diff/xxx-removing: device or resource busy

Is anyone else using docker to launch thousands of ephemeral containers per day? In other words, are these scalability problems unexpected, or is it just that we’re using it beyond the normal/tested boundaries?

Any tips on where to look for what’s causing the problem, or pinning down what’s being slow?

pythonanywhere · June 20, 2016, 4:19pm

PS - just checked logs, is actually more like 6,000 containers/day.

pythonanywhere · June 20, 2016, 4:22pm

PPS. re: the very-long-reboots, it’s hard to be sure, but our current belief is that it’s happening after the reboot, ie it’s something in the process of starting up the server that takes ages, not shutting down… we “know” this from comparing the difference between reboot and echo b > /proc/sysrq-trigger and finding no significant difference in the slow reboot time (but small n here)

jpetazzo · June 27, 2016, 6:03pm

We’d need some more info to figure out what’s going on here.

Can you post the output of docker version and docker info?
What distro are you using?
How long does it take to restart Docker?
If you restart Docker twice, does it take as long the second time?
If it takes as long the second time, can you enable debug logs, restart it, and attach the logs?

Thank you!

pythonanywhere · June 28, 2016, 1:24pm

Cool. Thanks. Here is some of the info you requested. In collecting it I have discovered something that may be interesting: All of the reboots happened pretty quickly and the docker daemon was contactable fairly quickly, too. The machine that I was testing on has only been running since Friday, though. Whereas the machine that we had the trouble with had been running for weeks. It, therefore, appears to me as if this is an issue that gets worse slowly and may be a function of how many containers have ever run on the machine. We’ll continue to monitor the new machine to see whether it also degrades like the previous ones.

docker version

Client:
 Version:      1.11.2
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   b9f10c9
 Built:        Wed Jun  1 21:47:50 2016
 OS/Arch:      linux/amd64
Server:
 Version:      1.11.2
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   b9f10c9
 Built:        Wed Jun  1 21:47:50 2016
 OS/Arch:      linux/amd64

docker info

Containers: 178
 Running: 178
 Paused: 0
 Stopped: 0
Images: 2
Server Version: 1.11.2
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 412
 Dirperm1 Supported: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: host bridge null
Kernel Version: 3.13.0-88-generic
Operating System: Ubuntu 14.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 14.69 GiB
Name: harry-liveconsole5
ID: XMJF:CKTA:3JKW:ZI6P:KIT7:2L6I:EJQY:FEHE:UDUK:VTVZ:QB2C:JA7Y
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/

jpetazzo · June 28, 2016, 1:36pm

OK!

If you can run docker version and docker info on the affected machine, that could also be interesting. I’m wondering in particular about the number of containers and images lying around.

I also wonder what kind of filesystem is backing /var/lib/docker (I suppose the default ext4 but confirmation would be useful) and wondering if you have any kind of system metrics showing if there is an increase in memory use, CPU utilization, or disk I/O on the slower machine.

pythonanywhere · June 30, 2016, 11:03am

Unfortunately, the machine that was affected was an AWS instance that has now been consigned to the great bit bucket in the sky. We’ll leave the current machine until it’s exhibiting the behaviour and then get back to you with the information you requested. From previous experience, that takes about a week or so.

gpjt · July 27, 2016, 11:25am

OK, the machine with the problem has been up for long enough to start showing the problem again. Sorry for the delay in the response! It takes a week or so for a new instance to start showing the problem.

Here’s the output of docker version:

# docker version
Client:
 Version:      1.11.2
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   b9f10c9
 Built:        Wed Jun  1 21:47:50 2016
 OS/Arch:      linux/amd64
Server:
 Version:      1.11.2
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   b9f10c9
 Built:        Wed Jun  1 21:47:50 2016
 OS/Arch:      linux/amd64

…and here’s the results from docker info:

# docker info
Containers: 226
 Running: 197
 Paused: 0
 Stopped: 29
Images: 2
Server Version: 1.11.2
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 4813
 Dirperm1 Supported: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge null host
Kernel Version: 3.13.0-88-generic
Operating System: Ubuntu 14.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 14.69 GiB
Name: harry-liveconsole5
ID: XMJF:CKTA:3JKW:ZI6P:KIT7:2L6I:EJQY:FEHE:UDUK:VTVZ:QB2C:JA7Y
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): true
 File Descriptors: 1301
 Goroutines: 2909
 System Time: 2016-07-27T11:15:15.882539137Z
 EventsListeners: 0
Registry: https://index.docker.io/v1/

As you expected, /var/lib/docker is ext4

CPU load is high; the graph is noisy but shows a general upward trend over the last two weeks. Disk IO doesn’t look particularly high.

Right now, load average via uptime is pretty consistent at around 8. The processes that are currently using the bulk of CPU (via top) are:

/usr/bin/docker daemon -D --raw-logs 
docker-containerd -l /var/run/docker/libcontainerd/docker-containerd
docker-containerd-shim ....
auplink /var/lib/docker/aufs/mnt/.....

The first of those has the largest CPU time used over all time; the auplink thing disappeared while I was watching it.

Is there any more information we could usefully gather?

pythonanywhere · August 12, 2016, 11:24am

Hi there,

Just wanted to update with one more test we have done.

This is on a machine with just the following docker info:

Containers: 4
 Running: 1
 Paused: 0
 Stopped: 3
Images: 2
Server Version: 1.11.2
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 20
 Dirperm1 Supported: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: null host bridge
Kernel Version: 3.13.0-88-generic
Operating System: Ubuntu 14.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 14.69 GiB
Name: harry-liveconsole1
ID: XMJF:CKTA:3JKW:ZI6P:KIT7:2L6I:EJQY:FEHE:UDUK:VTVZ:QB2C:JA7Y
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/

So not a lot of docker containers at all, but the docker daemon has been running for a long time (more than 2 weeks), which is when we see performance degradation for docker. The machine is a large 4 core machine with load avg <2x throughout.

running time docker run pythonanywhere/user_execution_environment echo hi takes 1m 30s.

Monitoring ps, we see that the first 20s or so auplink is spinning like crazy.
root 27876 98.5 0.0 4332 464 ? R 11:02 0:05 \_ auplink /var/lib/docker/aufs/mnt/6799d62

Then there is ~30s of ~30-70% cpu usage that alternates between /usr/bin/docker-runc init and docker-runc --log /run/containerd/c9d82.

Finally, there is another 10s or so of auplink taking up 100%+ cpu.

I wonder if it is because we are using a large system image. The size of /var/lib/docker/aufs is 39G and the image we are using is 19G and is the only image in the repository other than ubuntu.

Comments/suggestions?

alvico · April 12, 2017, 10:33am

Have you found the reason for the service degradation, we are experiencing similar behavior in our system, but we are puzzle about what’s going on.

gpjt · April 12, 2017, 12:29pm

No, we never got to the bottom of it

Topic		Replies	Views
Lxc-docker-1.5.0 is very slow General	0	1878	March 19, 2015
Docker hang using interactive ubuntu, 300% CPU Docker Desktop	0	824	May 26, 2016
Docker process restarts causing all containers to terminate Docker Desktop docker , beta	0	1351	June 9, 2016
Docker Becomes Unresponsive (Requires Reinstall to Fix) General	0	1740	March 16, 2015
Extremely slow parallel docker builds General build	3	5252	September 27, 2017

Docker scalability issues -- major problems when running hundreds/thousands of interactive sessions

Related topics