Extreme slowness launching thousands of containers on AL2023

Hi all, I’m simulating very large fleets of low resource IoT devices using containers. I’m doing this on an AWS/EC2 r7i.16xlarge instance (64 vCPU/512GB memory) running AL2023 with docker client/server version 24.0.5 and systemd v252. I’m launching 4096 containers serially.

When I first start this process, each container takes about half a second to start. But as the number of containers increases, the time to start the next container increases exponentially. Once there are about 3,000 containers running, it takes about 20 seconds to start the next one. The same thing happens on the way down when removing the containers. The machine itself is relatively lightly loaded. Overall CPU utilization is under 10% during the slowness, and there is over 400GB free memory, no swapping and light disk IO. Once the containers are up and running, everything functions as expected and the containers themselves are fast and responsive.

Digging deeper, I noticed that systemd and systemd-journal are each consuming almost a full core during launch, while the various docker* processes are consuming minimal CPU and memory. Looking further into the systemd issues with strace and perf, it seems almost all the time is spent in mem_cgroup_wb_stats in the mem_cgroup_css_rstat_flush path. I presume this is systemd attempting to collect cgroup-related stats with each cgroup change. I’ve somewhat blindly tried to work around this by using native.cgroupdriver=cgroupfs, but this is ineffective; presumably systemd is getting cgroup update events independently of docker.

I’m wondering if anyone has suggestions for easy workarounds? I’ve calculated that this machine should be able to run about 20,000 containers, but the ability to launch them efficiently is the bottleneck. Workarounds I can imagine include running docker in multiple VMs, striping the containers across a number of smaller instances, modifying systemd, etc. But all of these involve their own complexities. So if possible I’d rather be able to make efficient use of a single larger instance.

Thank you for any assistance!

  • Eric

Maybe run netdata to see other stats of the server, it seems to have a simple graph for every metric available (link, run).

From what I remember the slowness was related to either the network or published ports, but I don’t recall exactly. Or was this just for Docker Swarm… I am not sure about that either.

Though, you could raise a support ticket @AWS: Since the docker engine you run on AL is maintained and supported by AWS, and AWS happens to have experience with high workloads, you could ask them for assistance. If you got an AWS TAM, talk to them about it.