Docker swarm nodes unable to access network

willbehr · July 20, 2022, 3:17pm

I have a swarm with two nodes on it, I’m able deploy my stack to the swarm but I’m having trouble with one of my services that require network access.

The compose section for the service looks like this:

  # Icecream daemon 
  # Allows the host to be used as a build node for the scheduler
  icecc-daemon:
    image: git.example.com/devops/docker-services/icecc-daemon
    build: 
      context: ./
      dockerfile: services/icecc-daemon.dockerfile    
    restart: unless-stopped
    ports:
      - "8766:8766"
      - "10245:10245"
    depends_on:
      - "icecc-scheduler"
    deploy:
      mode: global

and it’s dockerfile looks like this

# Use dev container because it already has cross compilers
FROM ubuntu:focal

RUN apt-get update \
    && apt-get install -y \
        icecc \
        build-essential \
        libncurses-dev \
        libssl-dev \
        libelf-dev \
        libudev-dev \
        libpci-dev \
        libiberty-dev \
    && apt-get autoclean \
    && rm -rf \
        /var/lib/apt/lists/* \
        /var/tmp/* \
        /tmp/*

EXPOSE 10245
EXPOSE 8766

ENTRYPOINT [ "iceccd", "-vvv", "-n", "focal" ]

after deploying the stack 2 replicas of this services are created, as expected, yet only the service on the manager machine is able to connect to the scheduler(another service in the stack).

The service running on the worker node gives this error in it’s logs

build-farm_icecc-daemon.0.ow5u9mxeqqsr@st12873    | [1] 2022-07-20 15:06:24: open_send_broadcast sendto(Error: Operation not permitted)
build-farm_icecc-daemon.0.ow5u9mxeqqsr@st12873    | [1] 2022-07-20 15:06:24: broadcast eth2 172.31.255.255
build-farm_icecc-daemon.0.ow5u9mxeqqsr@st12873    | [1] 2022-07-20 15:06:24: broadcast eth0 10.0.4.255
build-farm_icecc-daemon.0.ow5u9mxeqqsr@st12873    | [1] 2022-07-20 15:06:24: scheduler not yet found/selected.

It’s acting like it can’t get access to the host machine’s network which doesn’t make sense? How can the one of the manager access the host but the one on the worker can’t?

meyay · July 21, 2022, 9:41am

Seems like a functionality of the application inside the container needs one or more capabilites that are not available for the container. Usualy they are found in the documentation of the application (or can be identified by tools) and need to be added to the compose file using “cap_add”.

Since docker-ce 20.10,0, capabilites are supported with swarm deployments:

Add capabilities support to stack/service commands docker/cli#2687 docker/cli#2709 moby/moby#39173 moby/moby#41249

Please ignore that the compose file version 3 reference claims cap_add is not available for swarm deployments, it is just an inconsistency where the documentation did not catch up with the implementation

meyay · July 21, 2022, 10:43am

I guess it mgiht work with this capabilites added:

  # Icecream daemon 
  # Allows the host to be used as a build node for the scheduler
  icecc-daemon:
    image: git.example.com/devops/docker-services/icecc-daemon
    build: 
      context: ./
      dockerfile: services/icecc-daemon.dockerfile    
    restart: unless-stopped
    ports:
      - "8766:8766"
      - "10245:10245"
    depends_on:
      - "icecc-scheduler"
    deploy:
      mode: global
    cap_add:
      - NET_ADMIN
      - NET_BROADCAST

If it works try to remove one of both capabilites and see if it still works with just one of them - or if it requires both. Though, it might as well require additional capabilites.

Make sure you use a version “3.8” or “3.9” for your compose file, as the docker compose version 3 reference does not indicate for which versions of the compose file schema this configuration element is valid for swarm stack deployments.

willbehr · July 21, 2022, 2:42pm

I applied those capabilities and now when I deploy it’ll only deploy the daemon as 1/1 rather then of 2 (still 2 nodes on the swarm) I have to make the worker node leave and rejoin for it to detect and deploy to it but after a few seconds it’ll go back to 1/1

meyay · July 21, 2022, 3:15pm

It would be weird if cap_add would influence the scheduler…

You might want to share your full compose file so that we can see if there is a configuration that might explain it.

willbehr · July 21, 2022, 3:19pm

My full compose file

version: "3.9" 

services:
  # Icecream build acceleration scheduler
  icecc-scheduler:
    image: git.example.com:8444/devops/docker-services/icecc-scheduler
    build: services/icecc-scheduler 
    ports:
      - "8765:8765"
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints: [node.role == manager]

  # Icecream daemon 
  # Allows the host to be used as a build node for the scheduler
  icecc-daemon:
    image: git.example.com:8444/devops/docker-services/icecc-daemon
    build: services/icecc-daemon    
    ports:
      - "8766:8766"
      - "10245:10245"
    depends_on:
      - "icecc-scheduler"
    deploy:
      mode: global
    cap_add:
      - NET_RAW
      - NET_ADMIN
      - NET_BROADCAST
      - NET_BIND_SERVICE

meyay · July 21, 2022, 3:27pm

Since you didn’t specify a network, swarm should create a swarm scoped default network (by default they are called {stack name}_default)

I have no idea about which capabiliites you use and how each one of these affects the host kernel, but you will want to apply least privilges here and only use the capabilities that are necessary.

I could undestand if you used the host network for the services, and the capabiliites would result in the host interface beeing modified from the container… but it doesn’t seem like this is the case.

Thus said: I have no idea why you experience what you experience, but it doesn’t make sense to me that you experience it at all.

willbehr · July 21, 2022, 4:41pm

The first time I used the --with-registry-auth flag to deploy, I forgot that. After adding that flag back I’m still getting the same error of it not being able to connect to the scheduler.

willbehr · July 21, 2022, 5:01pm

The recommended way to use the container is it enable host mode networking (network_mode: host) but as far as I can tell swarm doesn’t support that.

meyay · July 21, 2022, 5:12pm

Indeed, not like that. You need to declare a network like this and use it in your service:

...
services
  myservice:
    networks:
       hostnet: {}  
    ...
...
networks:
  hostnet:
    name: host
    external: true

Though, swarm containers do not support privilged mode and they never will even

...
    cap_add:
      - ALL
...

will not result in the same capabilites a priliged container has.

I can’t realy help you with your problem other than pointing you in the direction that your problem is caused by missing capabities and that you need to add them to the container.

willbehr · July 21, 2022, 5:39pm

I fixed the issue by using the long syntax for ports and adding the NET_BROADCAST and NET_ADMIN capabilities

- target: 10245
        published: 10245
        protocol: tcp
        mode: host

meyay · July 21, 2022, 5:46pm

Its kind of odd that publishing the port as host port helps with broadcast messages, but hey: whatever works

Thank you for reporting back for you finaly configured it.

Topic		Replies	Views
Docker swarm: second service connection refused Swarm swarm	6	964	September 15, 2023
Container on swarm not able to connect to the internet (when ports published) Swarm swarm	6	802	March 13, 2024
CURL request times out, but service can ping in between swarm services on different nodes Swarm swarm	3	453	September 17, 2024
Docker Swarm overlay network - unable to open webpage from different nodes General docker , swarm	5	5029	October 1, 2017
Why can't my container connect to a swarm service General	0	1511	October 20, 2016

Docker swarm nodes unable to access network

Related topics