We are running into an issue, which seems be a trivial one given the maturity of Docker framework. I could not find a reliable solution from digging around, hence this is my last hope.
We are running out production systems in DCOS, with Docker version 1.12(Commit id: d5236f0).
Goal:
We want to set the value for tcp_keepalive_time param in the container.
Approach 1:
Modified the docker-compose.yml.tmpl and set the value using sysctl.
I am doing this at the moment.
For the first approach, it it working, I checked in the container /proc/sys/net/ipv4/tpc_keepalive_intvl, etc … the value is good. In the docker-compose.yml file, you have to be carfefull not to add any spaces between the key and the value.
For your second approach, you must use the privileged flag in your docker-compose.yaml, as you are modifying the kernel settings.
I was trying to solve a similar problem and came across this page. It is missing some important pieces of information, and thus motivates my response.
Setting the tcp_keepalive parameters within a container requires a kernel level of 4.13 on the base host. If you try this on an earlier kernel level, like the 3.10 kernel of CentOS 7.x, then these parameters will be missing from /proc and the command will fail in either case. In our case, we were running an older kernel and the way to accomplish this is to set the parameter in the base host only. You can do this with sysctl -w command, but that only works until the next reboot. If you hook into /etc/sysctl.conf or /etc/sysctl.d/, then it can be set automatically when the system comes up.
Please note that you’ll need to restart your containers after making this change on the base host.
I haven’t yet tried Fedora Core 27 or Ubuntu 17.10, both of which have the required kernel needed for this feature, but I suspect from the previous response that you’ll be able to set this on a per container basis with that kernel version.
We came across a very similar issue where what we observed was a Spring Boot application, when running as a service was working fine. But when deployed the same as a docker container, became unresponsive after some time, especially when left idle, i.e not used or no API calls made for some time.
What solved it by tuning the TCP keepalive kernel settings on Linux. The internal swarm loadbalancer purges all idle connections after 900 seconds.
So if you set keep alive to something less than 900 seconds, the problem of unresponsiveness will be solved.
I am still having an issue where keep_alive is set to 600 on the host (it was 7200, but set via systctl.conf). But any running containers still get a value of 7200. This makes containers running in a swarm getting connection timeouts because of reasons previously explaind here.