Monitoring Docker service states via Prometheus

ajnelson938 · August 28, 2024, 3:07pm

We’ve had a situation arise occasionally where instances of a service are unable to start due to an “no suitable node…” error. In some cases we’re alerted to this via some service-specific metric, but I’m hoping there’s a general metric we could use to alert of any service instance that has a desired state of X and a current state of Y.
I suppose another way would be to have a metric for the number of running instances of a service in the swarm vs. the number of desired service instances.
I don’t see anything like this baked into cAdvisor (also it wouldn’t really make sense since this is a service which doesn’t have a container - this would be more of a swarm-based metric I guess) so I’m wondering if it exists elsewhere.

rimelek · August 28, 2024, 5:33pm

Note that I don’t use Swarm, but if that message appears as a log entry somewhere, you could create an alert based on that log entry

ajnelson938 · August 29, 2024, 1:05pm

Thanks but as I mentioned I’m looking for Prometheus metrics specifically. I’ve since found out thad Docker has its own metrics and I’ve managed to set it up per this Prometheus doc: Docker Swarm | Prometheus
However I’m still looking for the metrics that would be useful to determine a service’s current state vs. its desired state.

meyay · August 29, 2024, 6:22pm

According to the <dockerswarm_sd_config> documentation there is no meta label for the metrics you are looking for.

The answer of @rimelek might not be what you want, but I am afraid it is what you need.

ajnelson938 · September 6, 2024, 3:47pm

I’m looking at using the swarm_manager_services_total metric, which keeps count of services by state (running, stopped, etc…) over time. The “running” state seems key - if it drops there’s likely a problem. I’m still trying to figure how useful other states like “orphaned” and “rejected” are for indications of problems.

rimelek · September 6, 2024, 8:20pm

Isn’t swarm_manager_services_total for just the number of services and not the number of replicas of a service?

I recommend joining this conversation on GitHub

github.com/moby/moby

Feature: Expose more service/replica level metrics

opened 05:28PM - 11 Nov 21 UTC

heapdavid

kind/enhancement area/swarm

Hi, We're currently monitoring docker swarm with prometheus and we're looking… for a way to check for services not being fulfilled/replicas not running. We can see metrics such as `swarm_manager_services_total` and failed/rejected containers but would ideally love something like `swarm_manager_replicas_total` with e.g. `state=running` and `state=desired` labels for comparison/alerting. This could be a summation of the data presented under the REPLICAS column when one runs `docker service list` on the manager node.

It was not rejected but there is no activity there so maybe you could show you are interested in that too.

ajnelson938 · September 6, 2024, 9:26pm

From my count it’s including all service instances (including replicas). That FR on GH is from 2021. Perhaps they updated the metric since then?

rimelek · September 6, 2024, 9:37pm

The request is open. Not all requests are implemented. And if you want the best source to know if it is implemented, it is the feature request where a staff member already replied

Topic		Replies	Views
Prometheus or other node/swarm monitoring for Docker for AWS General aws	1	1501	August 24, 2018
How to make sure that a service is up and running General docker , swarm	3	1748	May 30, 2017
Swarm monitoring Swarm swarm	0	572	April 20, 2021
Swarm host availability Swarm	0	955	October 19, 2016
Autoscaling in docker swarm Docker Hub docker , swarm	15	56148	May 23, 2024

Monitoring Docker service states via Prometheus

Related topics