Docker swarm nodes unable to access network

I have a swarm with two nodes on it, I’m able deploy my stack to the swarm but I’m having trouble with one of my services that require network access.

The compose section for the service looks like this:

  # Icecream daemon 
  # Allows the host to be used as a build node for the scheduler
  icecc-daemon:
    image: git.example.com/devops/docker-services/icecc-daemon
    build: 
      context: ./
      dockerfile: services/icecc-daemon.dockerfile    
    restart: unless-stopped
    ports:
      - "8766:8766"
      - "10245:10245"
    depends_on:
      - "icecc-scheduler"
    deploy:
      mode: global

and it’s dockerfile looks like this

# Use dev container because it already has cross compilers
FROM ubuntu:focal

RUN apt-get update \
    && apt-get install -y \
        icecc \
        build-essential \
        libncurses-dev \
        libssl-dev \
        libelf-dev \
        libudev-dev \
        libpci-dev \
        libiberty-dev \
    && apt-get autoclean \
    && rm -rf \
        /var/lib/apt/lists/* \
        /var/tmp/* \
        /tmp/*

EXPOSE 10245
EXPOSE 8766

ENTRYPOINT [ "iceccd", "-vvv", "-n", "focal" ] 

after deploying the stack 2 replicas of this services are created, as expected, yet only the service on the manager machine is able to connect to the scheduler(another service in the stack).

The service running on the worker node gives this error in it’s logs

build-farm_icecc-daemon.0.ow5u9mxeqqsr@st12873    | [1] 2022-07-20 15:06:24: open_send_broadcast sendto(Error: Operation not permitted)
build-farm_icecc-daemon.0.ow5u9mxeqqsr@st12873    | [1] 2022-07-20 15:06:24: broadcast eth2 172.31.255.255
build-farm_icecc-daemon.0.ow5u9mxeqqsr@st12873    | [1] 2022-07-20 15:06:24: broadcast eth0 10.0.4.255
build-farm_icecc-daemon.0.ow5u9mxeqqsr@st12873    | [1] 2022-07-20 15:06:24: scheduler not yet found/selected.

It’s acting like it can’t get access to the host machine’s network which doesn’t make sense? How can the one of the manager access the host but the one on the worker can’t?

Seems like a functionality of the application inside the container needs one or more capabilites that are not available for the container. Usualy they are found in the documentation of the application (or can be identified by tools) and need to be added to the compose file using “cap_add”.

Since docker-ce 20.10,0, capabilites are supported with swarm deployments:

Please ignore that the compose file version 3 reference claims cap_add is not available for swarm deployments, it is just an inconsistency where the documentation did not catch up with the implementation :slight_smile:

I guess it mgiht work with this capabilites added:

  # Icecream daemon 
  # Allows the host to be used as a build node for the scheduler
  icecc-daemon:
    image: git.example.com/devops/docker-services/icecc-daemon
    build: 
      context: ./
      dockerfile: services/icecc-daemon.dockerfile    
    restart: unless-stopped
    ports:
      - "8766:8766"
      - "10245:10245"
    depends_on:
      - "icecc-scheduler"
    deploy:
      mode: global
    cap_add:
      - NET_ADMIN
      - NET_BROADCAST

If it works try to remove one of both capabilites and see if it still works with just one of them - or if it requires both. Though, it might as well require additional capabilites.

Make sure you use a version “3.8” or “3.9” for your compose file, as the docker compose version 3 reference does not indicate for which versions of the compose file schema this configuration element is valid for swarm stack deployments.

I applied those capabilities and now when I deploy it’ll only deploy the daemon as 1/1 rather then of 2 (still 2 nodes on the swarm) I have to make the worker node leave and rejoin for it to detect and deploy to it but after a few seconds it’ll go back to 1/1

It would be weird if cap_add would influence the scheduler…

You might want to share your full compose file so that we can see if there is a configuration that might explain it.

My full compose file

version: "3.9" 

services:
  # Icecream build acceleration scheduler
  icecc-scheduler:
    image: git.example.com:8444/devops/docker-services/icecc-scheduler
    build: services/icecc-scheduler 
    ports:
      - "8765:8765"
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints: [node.role == manager]

  # Icecream daemon 
  # Allows the host to be used as a build node for the scheduler
  icecc-daemon:
    image: git.example.com:8444/devops/docker-services/icecc-daemon
    build: services/icecc-daemon    
    ports:
      - "8766:8766"
      - "10245:10245"
    depends_on:
      - "icecc-scheduler"
    deploy:
      mode: global
    cap_add:
      - NET_RAW
      - NET_ADMIN
      - NET_BROADCAST
      - NET_BIND_SERVICE

Since you didn’t specify a network, swarm should create a swarm scoped default network (by default they are called {stack name}_default)

I have no idea about which capabiliites you use and how each one of these affects the host kernel, but you will want to apply least privilges here and only use the capabilities that are necessary.

I could undestand if you used the host network for the services, and the capabiliites would result in the host interface beeing modified from the container… but it doesn’t seem like this is the case.

Thus said: I have no idea why you experience what you experience, but it doesn’t make sense to me that you experience it at all.

The first time I used the --with-registry-auth flag to deploy, I forgot that. After adding that flag back I’m still getting the same error of it not being able to connect to the scheduler.

The recommended way to use the container is it enable host mode networking (network_mode: host) but as far as I can tell swarm doesn’t support that.

Indeed, not like that. You need to declare a network like this and use it in your service:

...
services
  myservice:
    networks:
       hostnet: {}  
    ...
...
networks:
  hostnet:
    name: host
    external: true

Though, swarm containers do not support privilged mode and they never will even

...
    cap_add:
      - ALL
...

will not result in the same capabilites a priliged container has.

I can’t realy help you with your problem other than pointing you in the direction that your problem is caused by missing capabities and that you need to add them to the container.

1 Like

I fixed the issue by using the long syntax for ports and adding the NET_BROADCAST and NET_ADMIN capabilities

- target: 10245
        published: 10245
        protocol: tcp
        mode: host

Its kind of odd that publishing the port as host port helps with broadcast messages, but hey: whatever works :slight_smile:

Thank you for reporting back for you finaly configured it.