Very large number of `veth` interfaces

markvr · April 3, 2017, 2:29pm

I think I’ve come across another major bug in the “docker-for-azure” project. This bug combined with the broken logging implementation means that personally I don’t think this project is tested enough to be out of beta yet. One day it will be really useful, but in it’s current form it’s turning out to be more hassle than just building a swarm manually.

Expected behavior

Docker to clear up interfaces when containers stop

Actual behavior

Interfaces are left behind.

swarm-manager000000:~$ ifconfig | grep veth | wc -l
572

Additional Information

Rebooting the host clears the interfaces, but that isn’t really a solution as it’s disruptive to other running containers.

Steps to reproduce the behavior

Start a service where the containers fail and get restarted by the swarm manager. In my case this was because of port bindings that use --publish mode=host
Watch the number of interfaces rise.

markvr · April 3, 2017, 2:41pm

Having said that, I am trying the “edge” channel because I wanted to test using the “Cloudstor” plugin. I don’t know if this bug is also on the “stable” channel?

ddebroy · April 3, 2017, 6:56pm

@markvr Thanks for the feedback so far. The veth leak is most likely a bug in the docker engine rather than something specific to Docker4Azure. I am still investigating it with other members of the team and will get back regarding findings.

The logger issue should be fixed in the recent patch release for the stable channel (17.03.1) (https://download.docker.com/azure/stable/17.03.1/Docker.tmpl). The docs are in the process of being updated to announce the release. In edge channel, the fix will be available as part of the April (17.04) release.

ddebroy · April 5, 2017, 10:12pm

@markvr Can you please indicate the exact command line you used to spin up the service where the tasks kept failing?

I tried something along the lines of the following pinning everything to the single/current node:

docker service create --replicas 20 --constraint 'node.id == 9n2vp2nwq4yho642w5eib42h6' --publish mode=host,target=80,published=8080,protocol=tcp --name ping2 ddebroy/print:ddr1

What I find is that while the number of veths do rise up (to say about 30), their numbers do stay at a certain level (between 20 and 30) and they also do get cleaned up (dropping to 20). So they do not keep rising monotonically which would be the symptom of a leak. This makes sense since docker service will keep trying to respawn the failed tasks (which you can see under docker service ps) since it isn’t aware the task specification is invalid. If you remove the service (docker service rm) all the veths are cleared and their numbers fall back to the number prior to the service being kicked off.

I think this behavior is by design and not a bug assuming what I tried below is similar to what you were trying. Please let us know what you think.

markvr · April 6, 2017, 7:39am

Hmmm it sounds like the interfaces were cleared up as expected in your test.

I’m afraid can’t remember the exact command, it might have happened over a period of time and I just happened to notice it. This has happened before and I raised it on the Github tracker (https://github.com/docker/for-azure/issues/11) with a diagnostic session which might help.

I’ve decided not to use “docker-for-azure” for our project. IMO there are still too many bugs & issues with it, and there has been no response to any bugs raised by me or others on the issue tracker. Combined with not being open source means I’ve spent significant time trying to reverse engineer what is going on, e.g. why bind-mounts don’t work, and the logging.

We are also trying to migrate an existing legacy system to it, and we have too many edge cases (e.g. existing network topologies) that we need to work with. I’ve have had to customise the Azure template in various ways to fit these requirements, and in the end decided it would easier to simply build a swarm from scratch then I know exactly how it is configured.