Docker Swarm issues on one node

I have couple issues using Docker swarm.
I am new on Docker world, and even more on Swarm, so issue may be just my insufficient knowledge.

I am using Portainer as front-end, i.e. all configs are done via Portainer and mostly on Stacks.

Setup is:

  1. Dell Precision 5820 server with Proxmox. Two Ubuntu Server (22.04) VMs, where one VM is Swarm Manager (and Portainer) and second Ubuntu Server as worker where is running multiple interconnected, i.e. cluster of Containers. This is called Worker1
  2. Dell Poweredge R720 server with Proxmox. It has also one Ubuntu Server running as worker (Worker2).

Initially I made standalone Docker setup and everything worked fine, so everything was on Precision Server
Most of containers are dependent of hw, i,e, using disks or such, but of one of container which does all heavy processing is not HW dependent (stateless?).

I though that Docker Swarm mode is good for load balancing and share load to second server (Dell Poweredge).
So I made Swarm and Overlay network on these two servers (VMs / Workers).
I got eventually setup to ‘work’, but I having couple issues that I have not been able solve.

Issue1: All instances of this stateless container goes only Worker1. If I try to force container to Worker2 (Dell Poweredge), it crashes with exist code 132. I set CPU to Host mode for both Proxmox / Ubuntu VMs. I tried also ‘x86-64-v2-AES’ CPU mode, but it was even worse (even Portainer Agent was crashing).
So now all instances of this stateless containers are running at Worker1 (Dell Precision).
I could not see anything wrong when running ‘docker node inspect Worker2’
Any ideas to get second Worker to work?

Issue2: This is more cosmetic issue. When I restart Workers (VMs) there comes a lot of failing Containers; either status Failed or Shutdown (when checking Services) on both Workers. If checking Containers directly there are statuses; Exited and Created. But eventually all Containers (and amount of them ) are started, and working fine (only on Worker1).
I have defined ‘Depend On’ on Stacks for Services/Containers. Anything else could be done to make cleaner startup?

So basically I am same state as on Standalone configuration, except multiple instances of one ‘processing’ Container, but those are still on same VM (so not really more processing capacity).

I don’t think that depends_on works for stacks. In a container and microservice world each container should be able to deal with failure, like re-connect to database.

Interesting that it comes up twice in a day. @bluepuma77 already answered it, and the other was about compose, but that doesn’t matter so I leave it here. So you have to implement the waiting and reconnecting or just let the container restart unti it an finally connect.

Regarding the first question

That is not really what stateless is.The hardware is not the state of the application, handled data is.
Everything is hardware dependent by the way. The question is just how much, but you will not run an amd64 application on an arm cpu without emulation.

I could return that exit code any time in a shell script, so that depends on what application you run, but it looks like it is usually a CPU compatibility issue which you already know as you tried different CPU settings for the Proxmox VMs. It is still something we solve without knowing the application, and even then there is no guarantee we will have an idea. Maybe you can contact the developers of the application or a community that supports it and ask what CPU requirements it has.

On the other hand, the fact that the container is not scheduled automatically on that node, could mean that even Docker can tell you that wouldn’t work. So I would check the difference between the architecture of the VMs and any other hardware difference.

This topic might also be related to the issue:

Thanks on the responses!
I am just end-user of these docker containers/images, so no control how it works.

This one service (containers) that I tried to share over two nodes, does not use any volumes.
Communication between containers happens mostly via NATS. I can see sometimes NATS related errors;
2024-06-05T21:10:13.955099Z INFO async_nats: event: disconnected
2024-06-05T21:10:13.967721Z INFO async_nats: event: client error: nats: IO error
2024-06-05T21:10:18.978260Z INFO async_nats: event: client error: nats: timed out
2024-06-05T21:10:18.980557Z INFO async_nats: event: client error: nats: timed out
Those did not happen on standalone deployment.

I’ll try to contact developers of this solution.

If I cannot get better this swarm deployment, I will return back to standalone setup.

You can always add your entrypoints by creating a new image based on the original image. Just a short Dockerfile which copies an entrypoint script and sets the ENTRYPOINT instruction.

I was able to solve issues. So rebuilding this crashing container/image solved the issue.
Also I found instructions how to make NATS working better on Swarm deployment.

Only set-back is my Portainer is a bit messed up. I made so many changes / reinstallations, so it won’t show information correctly. Even reinstallation of Portainer did not help.

Edit: I got also Portainer to work by cleaning everything possible then install everything from scratch.
Now during start-up there comes only few failures, so I won’t try to optimize start-up seqeunce of containers.

1 Like