We are running Docker and Kubernetes on EC2 through the EKS service, and this cluster is mainly used for CICD purposes solely.
Besides that, we are trying to setup TestContainers to run in our pipelines. The process works as follow, every time we start a new pipeline a new POD is mapped with the docker.sock so that TestContainers can start containers on docker to run the tests. The main drawback of this approach is that the containers started through TestContainers aren’t managed by Kubernetes and run completely detached.
Everything works fine just after we started the EC2, with fresh starts of Docker and Kubernetes services. But after sometime, for some unknown reason, the TestContainer running on the POD stops communicating with the containers started through TestContainer.
So far we have made some tests to understand the reasons for the communication stop. I’ll walk through two experiments to explain how things are going on.
Scenario 1: Docker to Docker communication.
First I start a NGINX server running:
> docker run --rm -it -p 8081:80 nginx
And then I search for the new virtual interface card created for the container in the EC2 machine:
> ifconfig
docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255
inet6 fe80::42:b9ff:feaa:cb3b prefixlen 64 scopeid 0x20<link>
ether 02:42:b9:aa:cb:3b txqueuelen 0 (Ethernet)
RX packets 179450 bytes 15754424 (15.0 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 331573 bytes 1308507471 (1.2 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001
inet 10.3.7.83 netmask 255.255.255.192 broadcast 10.3.7.127
inet6 fe80::491:5fff:feed:c4fc prefixlen 64 scopeid 0x20<link>
ether 06:91:5f:ed:c4:fc txqueuelen 1000 (Ethernet)
RX packets 125291544 bytes 127131602158 (118.4 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 88867600 bytes 22189402440 (20.6 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
veth940712e: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet6 fe80::e0c9:45ff:fe7e:3bbf prefixlen 64 scopeid 0x20<link>
ether e2:c9:45:7e:3b:bf txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 8 bytes 720 (720.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Secondly, I start an Ubuntu container in the same EC2 host so to curl the NGINX. But before the curl
I start to tcpdump the NGINX interface with sudo tcpdump -n -v -i veth940712e
.
> docker run --rm -it ubuntu sh
# apt-get update
# apt-get install -y curl
# curl http://172.17.0.1:8081
After the curl, the dump below contains the data collected for the calls made from Ubuntu. You can notice that the TCP Handshake happened successfully with the SYN from client (Flags [S]), SYN/ACK from server (Flags [S.]), and ACK from client (Flags [.]).
sudo tcpdump -n -v -i veth940712e
14:55:37.312810 IP (tos 0x0, ttl 255, id 21290, offset 0, flags [DF], proto TCP (6), length 60)
10.3.7.83.17503 > 172.17.0.2.http: Flags [S], cksum 0xbd97 (incorrect -> 0x0aa0), seq 4090060033, win 64240, options [mss 1460,sackOK,TS val 2467323003 ecr 0,nop,wscale 7], length 0
14:55:37.312829 IP (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto TCP (6), length 60)
172.17.0.2.http > 10.3.7.83.17503: Flags [S.], cksum 0xbd97 (incorrect -> 0x748a), seq 2527613674, ack 4090060034, win 65160, options [mss 1460,sackOK,TS val 2009280790 ecr 2467323003,nop,wscale 7], length 0
14:55:37.312843 IP (tos 0x0, ttl 255, id 21291, offset 0, flags [DF], proto TCP (6), length 52)
10.3.7.83.17503 > 172.17.0.2.http: Flags [.], cksum 0xbd8f (incorrect -> 0x9fe9), ack 1, win 502, options [nop,nop,TS val 2467323003 ecr 2009280790], length 0
Scenario 2: Kubernetes to Docker communication.
For the second scenario I repeat the same steps from before with the difference that instead of curling NGINX from Ubuntu, I run the curl from inside a POD container started through Kubernetes (also tcpdumping the NGINX interface network card: sudo tcpdump -n -v -i veth940712e
)
> kubectl exec -i -t -n cicd docker-maven-agent -- sh
# curl http://172.17.0.1:8081
This time from TCPDUMP I noticed the communication is not happening because the packets from NGINX are not reaching back to the POD. In the logs below is possible to see that the SYN/ACK confirmation (Flags [S.]) is tried to be sent to the POD twice but as no ACK is sent from the POD to the container the communication is not established, and the POD eventually tries again a new handshake.
14:54:07.401408 IP (tos 0x0, ttl 254, id 145, offset 0, flags [DF], proto TCP (6), length 60)
10.3.7.83.35639 > 172.17.0.2.http: Flags [S], cksum 0xbd97 (incorrect -> 0x2ba8), seq 3109992598, win 62727, options [mss 8961,sackOK,TS val 535965483 ecr 0,nop,wscale 7], length 0
14:54:07.401426 IP (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto TCP (6), length 60)
172.17.0.2.http > 10.3.7.83.35639: Flags [S.], cksum 0xbd97 (incorrect -> 0x1190), seq 1217174438, ack 3109992599, win 65160, options [mss 1460,sackOK,TS val 2009190878 ecr 535965483,nop,wscale 7], length 0
14:54:08.425371 IP (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto TCP (6), length 60)
172.17.0.2.http > 10.3.7.83.35639: Flags [S.], cksum 0xbd97 (incorrect -> 0x0d90), seq 1217174438, ack 3109992599, win 65160, options [mss 1460,sackOK,TS val 2009191902 ecr 535965483,nop,wscale 7], length 0
14:54:09.421394 IP (tos 0x0, ttl 254, id 146, offset 0, flags [DF], proto TCP (6), length 60)
10.3.7.83.35639 > 172.17.0.2.http: Flags [S], cksum 0xbd97 (incorrect -> 0x23c4), seq 3109992598, win 62727, options [mss 8961,sackOK,TS val 535967503 ecr 0,nop,wscale 7], length 0
14:54:09.421412 IP (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto TCP (6), length 60)
172.17.0.2.http > 10.3.7.83.35639: Flags [S.], cksum 0xbd97 (incorrect -> 0x09ac), seq 1217174438, ack 3109992599, win 65160, options [mss 1460,sackOK,TS val 2009192898 ecr 535965483,nop,wscale 7], length 0
14:54:11.433366 IP (tos 0x0, ttl 255, id 0, offset 0, flags [DF], proto TCP (6), length 60)
172.17.0.2.http > 10.3.7.83.35639: Flags [S.], cksum 0xbd97 (incorrect -> 0x01d0), seq 1217174438, ack 3109992599, win 65160, options [mss 1460,sackOK,TS val 2009194910 ecr 535965483,nop,wscale 7], length 0
Does anybody have any idea of why could be causing this? Or any hint on what other troubleshooting strategies I could use to gather more information about this problem?