Monitoring Docker service states via Prometheus

We’ve had a situation arise occasionally where instances of a service are unable to start due to an “no suitable node…” error. In some cases we’re alerted to this via some service-specific metric, but I’m hoping there’s a general metric we could use to alert of any service instance that has a desired state of X and a current state of Y.
I suppose another way would be to have a metric for the number of running instances of a service in the swarm vs. the number of desired service instances.
I don’t see anything like this baked into cAdvisor (also it wouldn’t really make sense since this is a service which doesn’t have a container - this would be more of a swarm-based metric I guess) so I’m wondering if it exists elsewhere.

Note that I don’t use Swarm, but if that message appears as a log entry somewhere, you could create an alert based on that log entry

1 Like

Thanks but as I mentioned I’m looking for Prometheus metrics specifically. I’ve since found out thad Docker has its own metrics and I’ve managed to set it up per this Prometheus doc: Docker Swarm | Prometheus
However I’m still looking for the metrics that would be useful to determine a service’s current state vs. its desired state.

According to the <dockerswarm_sd_config> documentation there is no meta label for the metrics you are looking for.

The answer of @rimelek might not be what you want, but I am afraid it is what you need.

I’m looking at using the swarm_manager_services_total metric, which keeps count of services by state (running, stopped, etc…) over time. The “running” state seems key - if it drops there’s likely a problem. I’m still trying to figure how useful other states like “orphaned” and “rejected” are for indications of problems.

Isn’t swarm_manager_services_total for just the number of services and not the number of replicas of a service?

I recommend joining this conversation on GitHub

It was not rejected but there is no activity there so maybe you could show you are interested in that too.

From my count it’s including all service instances (including replicas). That FR on GH is from 2021. Perhaps they updated the metric since then?

The request is open. Not all requests are implemented. And if you want the best source to know if it is implemented, it is the feature request where a staff member already replied :slight_smile: