Connectivity problems with swarm mode service endpoint-mode vip

Hi all.

I have a 20-node swarm setup(3 managers, 17 workers) on 64-bit ubuntu 14.04, docker engine 17.04.0-ce on the nodes. I’m having connectivity issues between nodes with --endpoint-mode vip. I see it on all the services I’m deploying. This is a new swarm that we’re starting QA testing in.

A few thoughts first:

  1. We use docker deploy with compose files, which is “experimental” so we have --experimental turned on in /etc/default/docker. I am about to test deploying a service with experimental features turned off on some nodes to see if the same behavior is exhibited. We use docker deploy -c docker-compose.yaml dev --with-registry-auth to deploy the individual services.
  2. I can’t find any way to debug the gossip protocol that the swarm nodes use to communicate with each other. It sure would be cool if I could see any errors the swarm networking is seeing.
  3. This behavior seems to only exist with using the default --endpoint-mode vip, but seeing how this is REQUIRED for ingress routing, that presents a problem with our ingress into the swarm from haproxy.

Our DOCKER_OPTS from /etc/default/docker:

DOCKER_OPTS="\
 -H unix:///var/run/docker.sock --ip-forward=true --iptables=true --ip-masq=true --bip=172.31.0.1/16 --experimental"

Example 1:

Elastic search across 3 worker nodes using constraints to ensure the service “sticks” to that node - our docker-compose.yaml (only showing the first elasticsearch service, there’s two more I omitted for brevity).

version: '3'
services:
  elasticsearch1:
    image: mirror/elasticsearch:5.2.2
    command: elasticsearch -Enetwork.host=0.0.0.0 -Ediscovery.zen.ping.unicast.hosts=dev_elasticsearch1,dev_elasticsearch2,dev_elasticsearch3 -Ecluster.name=dev-cluster -Expack.security.enabled=false -Ediscovery.zen.minimum_master_nodes=2 -Enode.master=true -Enode.name=dev_elasticsearch1
    deploy:
      mode: replicated
      replicas: 1
      labels: [APP=elasticsearch]
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
        window: 120s
      placement:
        constraints: [ 'engine.labels.stage==dev','engine.labels.node==01' ]
      resources:
        limits:
          memory: 1G
    environment:
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - VIRTUAL_HOST=esproxy
      - VIRTUAL_PORT=9200
    ulimits:
      memlock: -1
      nofile:
        soft: 65536
        hard: 65536
    volumes:
      - esdata1:/usr/share/elasticsearch/data
    networks:
      - elasticsearch-backend
    logging:
      driver: "gelf"
      options:
        gelf-address: "udp://graylog:12201"
        tag: 'elasticsearch'
esproxy:
    image: mirror/jwilder/nginx-proxy
    networks:
      - elasticsearch-backend
      - elasticsearch
    volumes:
      - /var/run/docker.sock:/tmp/docker.sock:ro
    deploy:
      mode: replicated
      replicas: 3
      labels: [APP=esproxy]
      placement:
        constraints: [ 'engine.labels.stage==dev' ]
    logging:
      driver: "gelf"
      options:
        gelf-address: "udp://graylog:12201"
    tag: 'esproxy'
volumes:
  esdata1:
  esdata2:
  esdata3:
networks:
  elasticsearch:
driver: overlay
  elasticsearch-backend:
driver: overlay

What happens is that there’s the initial zen discovery but then at some point, they get timeouts, zen pings fail, and in general the containers lose connectivity to each other. I’ve gone into the running containers and done connectivity tests, which show intermittent connectivity problems.

The fix for elasticsearch was to update each elasticsearch service with --endpoint-mode dnsrr. Elasticsearch works fine after that- so we figured it was just an odd behavior.

Then when we started testing the microservices inside the swarm, we ran into nearly identical issues. When hitting any swarm service, we would see intermittent connectivity issues.

Example two:

version: '3'
services:
  web:
    logging:
      driver: gelf
      options:
        gelf-address: 'udp://graylog:12201'
        tag: btw-web
    image: mirror/web:develop
    ports: ['8080:80']
    networks:
      frontend:
    deploy:
      mode: global
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
        window: 120s
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: continue
        monitor: 60s
        max_failure_ratio: 0.3
      placement:
        constraints:
          - 'engine.labels.stage == dev'
          - 'engine.labels.role == web'
api:
    logging:
      driver: gelf
      options:
        gelf-address: 'udp://graylog:12201'
        tag: btw-buyer-api
    image: 'mirror/api:develop'
    networks:
      frontend:
      elasticsearch:
    deploy:
      mode: replicated
      replicas: 2
      restart_policy:
        condition: on-failure
        delay: 5s
        window: 30s
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: continue
        monitor: 60s
        max_failure_ratio: 0.3
      placement:
        constraints:
          - 'engine.labels.stage == dev'
          - 'engine.labels.role == web'
  networks:
    frontend:%
    elasticsearch:%

The web container above is our internal router to multiple backend APIs (only one of which I show). There’s an haproxy in front of the web router that we use for ingress balancing across swarm nodes. The health check for the haproxy is to a static page on the webserver, and we see continuous random health test timeouts to the web router service - basically the same behavior as the elasticsearch cluster. Changing the web container to a static “answer with a 200” container did nothing.

Thoughts?

2 Likes

We have same issue. First we thought it elastic issue, so we upgrade from 2.4 to 5.4, but same thing