Socket connections from one container to another no longer possible after a while (errno 99), but works fine before! How to fix?

Hello.

I have a simple docker-compose setup with two containers: The “client” container contains the application that needs to perform TPM operations, and the “server” container contains the TPM simulator s/w that provides (emulates) TPM functionality. Every time that the application performs a TPM operation, it will create a TCP socket connection to the TPM simulator running in the other container, send the command and receive the response. The connection is closed after the operation is completed. The application performs the TPM operations sequentially, so there’ll be at most one “open” connection at a time.

Now, this all works fine. But only for a limited amount of time! After about ~1500 TPM operations, the next operation will suddenly fail with the following error:

socket_connect() Failed to connect to host 10.0.0.20, port 2321:
errno 99: Cannot assign requested address

And once an operation has failed with this error, all subsequent operations fail too, with the same error!

When I shut down and restart the complete docker-compose setup, then it will work again. But, again, only for ~1500 TPM operations, before the errors are back :slightly_frowning_face:


The reason why I know that this must be some weird Docker issue (not a problem with the application) is because if I start a shell in the “client” container after the error has started, even a simple nc -zv 10.0.0.20 2321 will fail with “errno 99: Cannot assign requested address” too! If I run that very same command, also from within the “client” container, before the errors have started, all is good (nc returns “open”). So, obviously, new socket connections from the “client” container to the “server” container are no longer possible, after a certain time has elapsed, for some weird reason.

I also started a shell in the “server” container and verified that the TPM simulator software is still up and running, and is ready to accept incoming connections. So, not a problem with the simulator software!

Is there any way to fix this ?!?!

Than youk :sweat_smile:

I noticed, that when the “errno 99: Cannot assign requested address” errors start and then I wait for a few minutes (without restart of the containers!) new socket connects will again become possible!

I also found out, after a lot of testing, that, apparently, the problem can be worked around by adding an artificial delay after each connection (TPM) operation, e.g. 100 milliseconds. Of course, this will greatly slow things down, but I’m now at 100,000+ operations in a straight unbroken run, and still going…

So, this is strong evidence that when a large number of connections is opened/closed in short time, then Docker can’t keep up. Workaround of adding “artificial” delays isn’t good. Any better solution ???

The operating system takes some time to establish and tear down TCP connections. Docker is just a thin isolation layer on top.

The solution is to re-use the TCP connections, as done with http/2 and database connection pools.

Well, this may right, in general. But the application has to use the system-provided library (libtss2) to access the TPM. The address of the TPM simulator that is to be used is set up via an environment variable (TSS2_TCTI). So, the details of how the connections are managed are in the system-provided TPM library and therefore this can not be changed. Also, using one TCP connection per command probably is how the SWTPM network protocol works. I don’t think that is going to change anytime soon…

Furthermore, even if the OS takes a while to establish and tear down TCP connections, why should it, all of a sudden, refuse to create any new connections? And why with that weird “errno 99” error?

I can’t give you a solution, just an idea. You could search for kernel parameters setting connection related limits. If it happens only in containers, there must be something different inside the container than outside. Network and kernel parameters are not my strongest skills so I can’t tell you which parameter you should look for. It is also possible that it is caused by something else, but for now I don’t know what.

You probably searched for the error message and found discussions like this

So people are talking about similar issues, but I met this error message only when a process wanted to listen on a port, not when a client wanted to connect to service on a port.

Yes, I already search for this error. But when people discuss “errno 99”, then it is always about the general inability to establish socket connections. The problem that I am facing is different though: Connects do work just fine, for a while, but then, all of a sudden, no further connections are possible – with the exactly same connection parameters that still worked milliseconds ago! :face_with_raised_eyebrow:

Furthermore, the usual suggestion to fix “errno 99” problems is to connect directly to an IP address instead of using a hostname. I’m already connecting directly to the IP address of the container, as specified in my docker-compose.yaml, so the suggested fix does not apply here…

I tried with all the parameters that are mentioned here:
https://enterprise-support.nvidia.com/s/article/linux-sysctl-tuning

Did not make a difference, though :neutral_face:

Using libtss2 (Github)? Check with them.

Furthermore, are you sure you are closing all connections correctly? Did you update your library for potential bug fixes?

I’m using TPM library that is part of the operating system (libtss2), so I’m not dealing with the connections directly, I just set up the destination server (TPM simulator) with the environment variable TSS2_TCTI. Handling of the connections happens inside the library. But I’m pretty sure they’re closed correctly, because with the artificial delay added, I have now reached ~10,000,000 operations in a row! Also, I tried monitoring the connections with ss, but, most of the time, when I run ss there are no active connections at all. Very rarely, when I happen to run ss just at the right moment, I briefly see one “established” connection - which is gone as soon as I re-try. So these connections obviously are very short-lived.

I use Debian stable “Bookworm”. Also gave the latest Debian Testing a try, but makes no difference. At this point, it seems that I have to live with the artificial delay, until Docker gets improved… :face_with_raised_eyebrow: