By default, Docker starts containers with a restricted set of capabilities. What does that mean?
Capabilities turn the binary “root/non-root” dichotomy into a fine-grained access control system. Processes (like web servers) that just need to bind on a port below 1024 do not need to run as root: they can just be granted the net_bind_service capability instead. And there are many other capabilities, for almost all the specific areas where root privileges are usually needed.
This means a lot for container security; let’s see why!
Typical servers run several processes as root, including the SSH daemon, cron daemon, logging daemons, kernel modules, network configuration tools, and more. A container is different, because almost all of those tasks are handled by the infrastructure around the container:
SSH access are typically managed by a single server running on the Docker host;
cron, when necessary, should run as a user process, dedicated and tailored for the app that needs its scheduling service, rather than as a platform-wide facility;
log management is also typically handed to Docker, or to third-party services like Loggly or Splunk;
hardware management is irrelevant, meaning that you never need to run udevd or equivalent daemons within containers;
network management happens outside of the containers, enforcing separation of concerns as much as possible, meaning that a container should never need to perform ifconfig, route, or ip commands (except when a container is specifically engineered to behave like a router or firewall, of course).
This means that in most cases, containers do not need “real” root privileges at all. And therefore, containers can run with a reduced capability set; meaning that “root” within a container has much less privileges than the real “root”. For instance, it is possible to:
deny all “mount” operations;
deny access to raw sockets (to prevent packet spoofing);
deny access to some filesystem operations, like creating new device nodes, changing the owner of files, or altering attributes (including the immutable flag);
deny module loading;
and many others.
This means that even if an intruder manages to escalate to root within a container, it is much harder to do serious damage, or to escalate to the host.
This doesn’t affect regular web apps, but reduces the vectors of attack by malicious users considerably. By default Docker drops all capabilities except those needed, an allowlist instead of a denylist approach. You can see a full list of available capabilities in Linux manpages.
One primary risk with running Docker containers is that the default set of capabilities and mounts given to a container may provide incomplete isolation, either independently, or when used in combination with kernel vulnerabilities.
Docker supports the addition and removal of capabilities, allowing use of a non-default profile. This may make Docker more secure through capability removal, or less secure through the addition of capabilities. The best practice for users would be to remove all capabilities except those explicitly required for their processes.