We have a strange problem with a docker production setup. We have a rhel production server 7.5 maipo with various docker container amongst which nginx proxy, php/apache and mariadb. After seemingly random amount of time the web application becomes unreachable from remote servers but can be reached from the server itself. Eg a wget or curl call from within the host server gets the expected response but from another server it return a connection reset. The only way to solve the problem is to restart the mariadb container (and only this one). After that the web application is again accessible again from remote servers/network. It seems to have something to do with docker network becoming unavailable but we cannot seem to find out what the problem is exactly and why only restarting the db container solves the problem. The db container logs show no error whatsoever and as stated the entire application IS up and responding, only from the server itself. For example doing
wget urlname.com from within the server (which resolves to the public ip address btw) responds with http 200 and the html response is correct. However doing the same from another server gets a connection reset by peer error. In journal i did notice that after a restart of the container (using docker-compose stop and up) that there are some messages regarding the virtual ethernet devices of docker:
Dec 05 12:28:43 nmb_live_new dockerd[16890]: time=“2019-12-05T12:28:43+03:00” level=info msg=“shim reaped” id=9e65f712c2d53611fd1a2c38c586667c12828c92fa754189f5f53e50f72b90a5 module=“containerd/tasks”
Dec 05 12:28:43 nmb_live_new dockerd[16890]: time=“2019-12-05T12:28:43.069192811+03:00” level=info msg=“ignoring event” module=libcontainerd namespace=moby topic=/tasks/delete type=“*events.TaskDelete”
Dec 05 12:28:43 nmb_live_new kernel: br-7b06b33213df: port 1(vethc6c8034) entered disabled state
Dec 05 12:28:43 nmb_live_new NetworkManager[1065]: [1575538123.2357] manager: (vetha1a7e02): new Veth device (/org/freedesktop/NetworkManager/Devices/2443)
Dec 05 12:28:43 nmb_live_new kernel: br-7b06b33213df: port 1(vethc6c8034) entered disabled state
Dec 05 12:28:43 nmb_live_new kernel: device vethc6c8034 left promiscuous mode
Dec 05 12:28:43 nmb_live_new kernel: br-7b06b33213df: port 1(vethc6c8034) entered disabled state
Dec 05 12:28:43 nmb_live_new NetworkManager[1065]: [1575538123.2606] device (vethc6c8034): released from master device br-7b06b33213df
Dec 05 12:28:43 nmb_live_new libvirtd[1553]: 2019-12-05 09:28:43.274+0000: 1800: error : virFileReadAll:1420 : Failed to open file ‘/sys/class/net/vetha1a7e02/operstate’: No such file or directory
Dec 05 12:28:43 nmb_live_new libvirtd[1553]: 2019-12-05 09:28:43.275+0000: 1800: error : virNetDevGetLinkInfo:2509 : unable to read: /sys/class/net/vetha1a7e02/operstate: No such file or directory
Dec 05 12:28:45 nmb_live_new dockerd[16890]: time=“2019-12-05T12:28:45.302275980+03:00” level=warning msg=“IPv4 forwarding is disabled. Networking will not work”
Dec 05 12:28:45 nmb_live_new kernel: br-7b06b33213df: port 1(vethd406c36) entered blocking state
Dec 05 12:28:45 nmb_live_new kernel: br-7b06b33213df: port 1(vethd406c36) entered disabled state
Dec 05 12:28:45 nmb_live_new kernel: device vethd406c36 entered promiscuous mode
Dec 05 12:28:45 nmb_live_new kernel: IPv6: ADDRCONF(NETDEV_UP): vethd406c36: link is not ready
Dec 05 12:28:45 nmb_live_new NetworkManager[1065]: [1575538125.3712] manager: (veth8eb51a4): new Veth device (/org/freedesktop/NetworkManager/Devices/2444)
Dec 05 12:28:45 nmb_live_new NetworkManager[1065]: [1575538125.3744] manager: (vethd406c36): new Veth device (/org/freedesktop/NetworkManager/Devices/2445)
Dec 05 12:28:45 nmb_live_new firewalld[62837]: WARNING: COMMAND_FAILED: ‘/usr/sbin/iptables -w2 -t nat -C DOCKER -p tcp -d 0/0 --dport 3306 -j DNAT --to-destination 172.18.0.2:3306 ! -i br-7b06b33213df’ fail
ed: iptables: No chain/target/match by that name.
Dec 05 12:28:45 nmb_live_new firewalld[62837]: WARNING: COMMAND_FAILED: ‘/usr/sbin/iptables -w2 -t filter -C DOCKER ! -i br-7b06b33213df -o br-7b06b33213df -p tcp -d 172.18.0.2 --dport 3306 -j ACCEPT’ failed
: iptables: Bad rule (does a matching rule exist in that chain?).
Dec 05 12:28:45 nmb_live_new firewalld[62837]: WARNING: COMMAND_FAILED: ‘/usr/sbin/iptables -w2 -t nat -C POSTROUTING -p tcp -s 172.18.0.2 -d 172.18.0.2 --dport 3306 -j MASQUERADE’ failed: iptables: No chain
/target/match by that name.
Dec 05 12:28:45 nmb_live_new dockerd[16890]: time=“2019-12-05T12:28:45+03:00” level=info msg=“shim docker-containerd-shim started” address=“/containerd-shim/moby/9e65f712c2d53611fd1a2c38c586667c12828c92fa754
189f5f53e50f72b90a5/shim.sock” debug=false module=“containerd/tasks” pid=49100
Dec 05 12:28:46 nmb_live_new kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Dec 05 12:28:46 nmb_live_new kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethd406c36: link becomes ready
Dec 05 12:28:46 nmb_live_new kernel: br-7b06b33213df: port 1(vethd406c36) entered blocking state
Dec 05 12:28:46 nmb_live_new kernel: br-7b06b33213df: port 1(vethd406c36) entered forwarding state
Dec 05 12:28:46 nmb_live_new NetworkManager[1065]: [1575538126.2935] device (vethd406c36): carrier: link connected
Dec 05 12:28:50 nmb_live_new sshd[49440]: Did not receive identification string from 10.200.219.25 port 49807
Which seems to indicate docker resetting network devices and reconfiguring iptables rules.
firwalld is running and active. The host itself is running on a vsphere virtual environment (not managed by us). this is the networks section from docker inspect for the db container:
“Networks”: {
“webdev_default”: {
“IPAMConfig”: null,
“Links”: null,
“Aliases”: [
“9e65f712c2d5”,
“mysql”
],
“NetworkID”: “7b06b33213df6a8aab5852f2da0a365fc46b621651a1b12fcf07b4d8722474aa”,
“EndpointID”: “4a0b628d590daacb3a7a2d176c42de783c0c4a0d88c111bcb8d97091eb348710”,
“Gateway”: “172.18.0.1”,
“IPAddress”: “172.18.0.2”,
“IPPrefixLen”: 16,
“IPv6Gateway”: “”,
“GlobalIPv6Address”: “”,
“GlobalIPv6PrefixLen”: 0,
“MacAddress”: “02:42:ac:12:00:02”,
“DriverOpts”: null
}
}
the db container is part of the same docker network other containers are also using: “webdev_default”. Output of docker network ls:
4718a5ca6935 bridge bridge local
73dfc726bef5 host host local
579cc4c85276 none null local
7b06b33213df webdev_default bridge local
output of ip addr:
1792: veth94e2735@if1791: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 700 qdisc noqueue master br-7b06b33213df state UP group default
link/ether 22:b0:c7:3f:68:c2 brd ff:ff:ff:ff:ff:ff link-netnsid 0
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
1794: veth9cd5244@if1793: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 700 qdisc noqueue master br-7b06b33213df state UP group default
link/ether 4a:81:83:4b:7f:d6 brd ff:ff:ff:ff:ff:ff link-netnsid 3
2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 700 qdisc mq state UP group default qlen 1000
link/ether 00:50:56:96:26:be brd ff:ff:ff:ff:ff:ff
inet 10.200.221.26/24 brd 10.200.221.255 scope global noprefixroute ens192
valid_lft forever preferred_lft forever
3: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
link/ether 52:54:00:5e:02:40 brd ff:ff:ff:ff:ff:ff
inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
valid_lft forever preferred_lft forever
1796: vethd406c36@if1795: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 700 qdisc noqueue master br-7b06b33213df state UP group default
link/ether 26:7e:e7:7c:ea:b0 brd ff:ff:ff:ff:ff:ff link-netnsid 1
4: virbr0-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast master virbr0 state DOWN group default qlen 1000
link/ether 52:54:00:5e:02:40 brd ff:ff:ff:ff:ff:ff
6: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:79:c9:da:c5 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:79ff:fec9:dac5/64 scope link
valid_lft forever preferred_lft forever
1738: br-7b06b33213df: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 700 qdisc noqueue state UP group default
link/ether 02:42:ba:cc:39:4e brd ff:ff:ff:ff:ff:ff
inet 172.18.0.1/16 brd 172.18.255.255 scope global br-7b06b33213df
valid_lft forever preferred_lft forever
1776: veth29161ee@if1775: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 700 qdisc noqueue master br-7b06b33213df state UP group default
link/ether aa:97:92:7b:f9:f7 brd ff:ff:ff:ff:ff:ff link-netnsid 5
1778: veth5c89d1e@if1777: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 700 qdisc noqueue master br-7b06b33213df state UP group default
link/ether 56:08:e0:aa:fc:63 brd ff:ff:ff:ff:ff:ff link-netnsid 2
1782: vethc8d18bf@if1781: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 700 qdisc noqueue master br-7b06b33213df state UP group default
link/ether ee:af:09:10:8b:ea brd ff:ff:ff:ff:ff:ff link-netnsid 4
1784: veth8e08e7a@if1783: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 700 qdisc noqueue master br-7b06b33213df state UP group default
link/ether da:47:cc:fb:db:4a brd ff:ff:ff:ff:ff:ff link-netnsid 7
1786: veth70bb06c@if1785: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 700 qdisc noqueue master br-7b06b33213df state UP group default
link/ether 8e:00:80:5e:df:a5 brd ff:ff:ff:ff:ff:ff link-netnsid 8
1790: vethfd99569@if1789: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default
link/ether 3a:b7:87:98:15:46 brd ff:ff:ff:ff:ff:ff link-netnsid 6
inet6 fe80::38b7:87ff:fe98:1546/64 scope link
valid_lft forever preferred_lft forever
we had to change mtu values of the network to 700 because of network problems, but that is about it and did not cause problems in the past
output of firewall list all:
public (active)
target: default
icmp-block-inversion: no
interfaces: ens192
sources:
services: dhcpv6-client ssh mysql http https mysql-replication
ports: 8025/tcp 80/tcp 3306/tcp 4444/tcp 4567/tcp 4568/tcp 5601/tcp 22/tcp
protocols:
masquerade: no
forward-ports: port=8025:proto=tcp:toport=8025:toaddr=
port=80:proto=tcp:toport=80:toaddr=
port=3306:proto=tcp:toport=3306:toaddr=
port=4444:proto=tcp:toport=4444:toaddr=
port=4567:proto=tcp:toport=4567:toaddr=
port=22:proto=tcp:toport=22:toaddr=
port=4568:proto=udp:toport=4568:toaddr=
source-ports:
icmp-blocks:
rich rules:
Does anybody have any clue as why the networking seems to fail randomly after periods of time? Im happy to provide any kind of logging but my network knowledge is limited