I have a 20-node swarm setup(3 managers, 17 workers) on 64-bit ubuntu 14.04, docker engine 17.04.0-ce on the nodes. I’m having connectivity issues between nodes with
--endpoint-mode vip. I see it on all the services I’m deploying. This is a new swarm that we’re starting QA testing in.
A few thoughts first:
- We use docker deploy with compose files, which is “experimental” so we have
--experimentalturned on in /etc/default/docker. I am about to test deploying a service with experimental features turned off on some nodes to see if the same behavior is exhibited. We use
docker deploy -c docker-compose.yaml dev --with-registry-authto deploy the individual services.
- I can’t find any way to debug the gossip protocol that the swarm nodes use to communicate with each other. It sure would be cool if I could see any errors the swarm networking is seeing.
- This behavior seems to only exist with using the default
--endpoint-mode vip, but seeing how this is REQUIRED for ingress routing, that presents a problem with our ingress into the swarm from haproxy.
Our DOCKER_OPTS from /etc/default/docker:
DOCKER_OPTS="\ -H unix:///var/run/docker.sock --ip-forward=true --iptables=true --ip-masq=true --bip=172.31.0.1/16 --experimental"
Elastic search across 3 worker nodes using constraints to ensure the service “sticks” to that node - our docker-compose.yaml (only showing the first elasticsearch service, there’s two more I omitted for brevity).
version: '3' services: elasticsearch1: image: mirror/elasticsearch:5.2.2 command: elasticsearch -Enetwork.host=0.0.0.0 -Ediscovery.zen.ping.unicast.hosts=dev_elasticsearch1,dev_elasticsearch2,dev_elasticsearch3 -Ecluster.name=dev-cluster -Expack.security.enabled=false -Ediscovery.zen.minimum_master_nodes=2 -Enode.master=true -Enode.name=dev_elasticsearch1 deploy: mode: replicated replicas: 1 labels: [APP=elasticsearch] restart_policy: condition: on-failure delay: 5s max_attempts: 3 window: 120s placement: constraints: [ 'engine.labels.stage==dev','engine.labels.node==01' ] resources: limits: memory: 1G environment: - "ES_JAVA_OPTS=-Xms512m -Xmx512m" - VIRTUAL_HOST=esproxy - VIRTUAL_PORT=9200 ulimits: memlock: -1 nofile: soft: 65536 hard: 65536 volumes: - esdata1:/usr/share/elasticsearch/data networks: - elasticsearch-backend logging: driver: "gelf" options: gelf-address: "udp://graylog:12201" tag: 'elasticsearch' esproxy: image: mirror/jwilder/nginx-proxy networks: - elasticsearch-backend - elasticsearch volumes: - /var/run/docker.sock:/tmp/docker.sock:ro deploy: mode: replicated replicas: 3 labels: [APP=esproxy] placement: constraints: [ 'engine.labels.stage==dev' ] logging: driver: "gelf" options: gelf-address: "udp://graylog:12201" tag: 'esproxy' volumes: esdata1: esdata2: esdata3: networks: elasticsearch: driver: overlay elasticsearch-backend: driver: overlay
What happens is that there’s the initial zen discovery but then at some point, they get timeouts, zen pings fail, and in general the containers lose connectivity to each other. I’ve gone into the running containers and done connectivity tests, which show intermittent connectivity problems.
The fix for elasticsearch was to update each elasticsearch service with --endpoint-mode dnsrr. Elasticsearch works fine after that- so we figured it was just an odd behavior.
Then when we started testing the microservices inside the swarm, we ran into nearly identical issues. When hitting any swarm service, we would see intermittent connectivity issues.
version: '3' services: web: logging: driver: gelf options: gelf-address: 'udp://graylog:12201' tag: btw-web image: mirror/web:develop ports: ['8080:80'] networks: frontend: deploy: mode: global restart_policy: condition: on-failure delay: 5s max_attempts: 3 window: 120s update_config: parallelism: 1 delay: 10s failure_action: continue monitor: 60s max_failure_ratio: 0.3 placement: constraints: - 'engine.labels.stage == dev' - 'engine.labels.role == web' api: logging: driver: gelf options: gelf-address: 'udp://graylog:12201' tag: btw-buyer-api image: 'mirror/api:develop' networks: frontend: elasticsearch: deploy: mode: replicated replicas: 2 restart_policy: condition: on-failure delay: 5s window: 30s update_config: parallelism: 1 delay: 10s failure_action: continue monitor: 60s max_failure_ratio: 0.3 placement: constraints: - 'engine.labels.stage == dev' - 'engine.labels.role == web' networks: frontend:% elasticsearch:%
The web container above is our internal router to multiple backend APIs (only one of which I show). There’s an haproxy in front of the web router that we use for ingress balancing across swarm nodes. The health check for the haproxy is to a static page on the webserver, and we see continuous random health test timeouts to the web router service - basically the same behavior as the elasticsearch cluster. Changing the web container to a static “answer with a 200” container did nothing.