Hi all.
I have a 20-node swarm setup(3 managers, 17 workers) on 64-bit ubuntu 14.04, docker engine 17.04.0-ce on the nodes. I’m having connectivity issues between nodes with --endpoint-mode vip
. I see it on all the services I’m deploying. This is a new swarm that we’re starting QA testing in.
A few thoughts first:
- We use docker deploy with compose files, which is “experimental” so we have
--experimental
turned on in /etc/default/docker. I am about to test deploying a service with experimental features turned off on some nodes to see if the same behavior is exhibited. We usedocker deploy -c docker-compose.yaml dev --with-registry-auth
to deploy the individual services. - I can’t find any way to debug the gossip protocol that the swarm nodes use to communicate with each other. It sure would be cool if I could see any errors the swarm networking is seeing.
- This behavior seems to only exist with using the default
--endpoint-mode vip
, but seeing how this is REQUIRED for ingress routing, that presents a problem with our ingress into the swarm from haproxy.
Our DOCKER_OPTS from /etc/default/docker:
DOCKER_OPTS="\
-H unix:///var/run/docker.sock --ip-forward=true --iptables=true --ip-masq=true --bip=172.31.0.1/16 --experimental"
Example 1:
Elastic search across 3 worker nodes using constraints to ensure the service “sticks” to that node - our docker-compose.yaml (only showing the first elasticsearch service, there’s two more I omitted for brevity).
version: '3'
services:
elasticsearch1:
image: mirror/elasticsearch:5.2.2
command: elasticsearch -Enetwork.host=0.0.0.0 -Ediscovery.zen.ping.unicast.hosts=dev_elasticsearch1,dev_elasticsearch2,dev_elasticsearch3 -Ecluster.name=dev-cluster -Expack.security.enabled=false -Ediscovery.zen.minimum_master_nodes=2 -Enode.master=true -Enode.name=dev_elasticsearch1
deploy:
mode: replicated
replicas: 1
labels: [APP=elasticsearch]
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
window: 120s
placement:
constraints: [ 'engine.labels.stage==dev','engine.labels.node==01' ]
resources:
limits:
memory: 1G
environment:
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
- VIRTUAL_HOST=esproxy
- VIRTUAL_PORT=9200
ulimits:
memlock: -1
nofile:
soft: 65536
hard: 65536
volumes:
- esdata1:/usr/share/elasticsearch/data
networks:
- elasticsearch-backend
logging:
driver: "gelf"
options:
gelf-address: "udp://graylog:12201"
tag: 'elasticsearch'
esproxy:
image: mirror/jwilder/nginx-proxy
networks:
- elasticsearch-backend
- elasticsearch
volumes:
- /var/run/docker.sock:/tmp/docker.sock:ro
deploy:
mode: replicated
replicas: 3
labels: [APP=esproxy]
placement:
constraints: [ 'engine.labels.stage==dev' ]
logging:
driver: "gelf"
options:
gelf-address: "udp://graylog:12201"
tag: 'esproxy'
volumes:
esdata1:
esdata2:
esdata3:
networks:
elasticsearch:
driver: overlay
elasticsearch-backend:
driver: overlay
What happens is that there’s the initial zen discovery but then at some point, they get timeouts, zen pings fail, and in general the containers lose connectivity to each other. I’ve gone into the running containers and done connectivity tests, which show intermittent connectivity problems.
The fix for elasticsearch was to update each elasticsearch service with --endpoint-mode dnsrr. Elasticsearch works fine after that- so we figured it was just an odd behavior.
Then when we started testing the microservices inside the swarm, we ran into nearly identical issues. When hitting any swarm service, we would see intermittent connectivity issues.
Example two:
version: '3'
services:
web:
logging:
driver: gelf
options:
gelf-address: 'udp://graylog:12201'
tag: btw-web
image: mirror/web:develop
ports: ['8080:80']
networks:
frontend:
deploy:
mode: global
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
window: 120s
update_config:
parallelism: 1
delay: 10s
failure_action: continue
monitor: 60s
max_failure_ratio: 0.3
placement:
constraints:
- 'engine.labels.stage == dev'
- 'engine.labels.role == web'
api:
logging:
driver: gelf
options:
gelf-address: 'udp://graylog:12201'
tag: btw-buyer-api
image: 'mirror/api:develop'
networks:
frontend:
elasticsearch:
deploy:
mode: replicated
replicas: 2
restart_policy:
condition: on-failure
delay: 5s
window: 30s
update_config:
parallelism: 1
delay: 10s
failure_action: continue
monitor: 60s
max_failure_ratio: 0.3
placement:
constraints:
- 'engine.labels.stage == dev'
- 'engine.labels.role == web'
networks:
frontend:%
elasticsearch:%
The web container above is our internal router to multiple backend APIs (only one of which I show). There’s an haproxy in front of the web router that we use for ingress balancing across swarm nodes. The health check for the haproxy is to a static page on the webserver, and we see continuous random health test timeouts to the web router service - basically the same behavior as the elasticsearch cluster. Changing the web container to a static “answer with a 200” container did nothing.
Thoughts?