Why does Prometheus container exits after configure rule and restart

khopithan · August 5, 2021, 5:03am

Here is my logs for the exit container

level=info ts=2021-08-05T04:58:10.744Z caller=main.go:450 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2021-08-05T04:58:10.745Z caller=main.go:451 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2021-08-05T04:58:10.754Z caller=web.go:541 component=web msg=“Start listening for connections” address=0.0.0.0:9090
level=info ts=2021-08-05T04:58:10.757Z caller=main.go:824 msg=“Starting TSDB …”
level=info ts=2021-08-05T04:58:10.763Z caller=tls_config.go:191 component=web msg=“TLS is disabled.” http2=false
level=info ts=2021-08-05T04:58:10.779Z caller=head.go:780 component=tsdb msg=“Replaying on-disk memory mappable chunks if any”
level=info ts=2021-08-05T04:58:10.779Z caller=head.go:794 component=tsdb msg=“On-disk memory mappable chunks replay completed” duration=23.487µs
level=info ts=2021-08-05T04:58:10.779Z caller=head.go:800 component=tsdb msg=“Replaying WAL, this may take a while”
level=info ts=2021-08-05T04:58:10.789Z caller=head.go:854 component=tsdb msg=“WAL segment loaded” segment=0 maxSegment=1
level=info ts=2021-08-05T04:58:10.789Z caller=head.go:854 component=tsdb msg=“WAL segment loaded” segment=1 maxSegment=1
level=info ts=2021-08-05T04:58:10.789Z caller=head.go:860 component=tsdb msg=“WAL replay completed” checkpoint_replay_duration=179.416µs wal_replay_duration=10.469308ms total_replay_duration=10.738798ms
level=info ts=2021-08-05T04:58:10.794Z caller=main.go:851 fs_type=XFS_SUPER_MAGIC
level=info ts=2021-08-05T04:58:10.794Z caller=main.go:854 msg=“TSDB started”
level=info ts=2021-08-05T04:58:10.794Z caller=main.go:981 msg=“Loading configuration file” filename=/etc/prometheus/prometheus.yml
level=error ts=2021-08-05T04:58:10.799Z caller=manager.go:956 component=“rule manager” msg=“loading groups failed” err="/etc/prometheus/instance_up_rule.yml: group “instance_down_alert_rule”, rule 1, “InstanceDown”: invalid annotation name: d:wqescription"
level=error ts=2021-08-05T04:58:10.799Z caller=main.go:1001 msg=“Failed to apply configuration” err=“error loading rules, previous rule set restored”
level=info ts=2021-08-05T04:58:10.799Z caller=main.go:697 msg=“Stopping scrape discovery manager…”
level=info ts=2021-08-05T04:58:10.799Z caller=main.go:711 msg=“Stopping notify discovery manager…”
level=info ts=2021-08-05T04:58:10.799Z caller=main.go:733 msg=“Stopping scrape manager…”
level=info ts=2021-08-05T04:58:10.799Z caller=main.go:707 msg=“Notify discovery manager stopped”
level=info ts=2021-08-05T04:58:10.799Z caller=main.go:693 msg=“Scrape discovery manager stopped”
level=info ts=2021-08-05T04:58:10.799Z caller=manager.go:934 component=“rule manager” msg=“Stopping rule manager…”
level=info ts=2021-08-05T04:58:10.799Z caller=manager.go:944 component=“rule manager” msg=“Rule manager stopped”
level=info ts=2021-08-05T04:58:10.799Z caller=main.go:727 msg=“Scrape manager stopped”
level=info ts=2021-08-05T04:58:10.800Z caller=notifier.go:601 component=notifier msg=“Stopping notification manager…”
level=info ts=2021-08-05T04:58:10.800Z caller=main.go:908 msg=“Notifier manager stopped”
level=error ts=2021-08-05T04:58:10.800Z caller=main.go:917 err=“error loading config from “/etc/prometheus/prometheus.yml”: one or more errors occurred while applying the new configuration (–config.file=”/etc/prometheus/prometheus.yml")"

Regards,
Khopi

khopithan · August 5, 2021, 5:06am

This is my prometheus.yml file

my global config

global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

scrape_timeout is set to the global default (10s).

Alertmanager configuration

alerting:
alertmanagers:

static_configs:
- targets:
  - ‘192.168.43.20:9093’

Load rules once and periodically evaluate them according to the global ‘evaluation_interval’.

rule_files:

instance_up_rule.yml

ahokanson · August 5, 2021, 12:54pm

The problem is actually with the rules file as these two log lines suggest

level=error ts=2021-08-05T04:58:10.799Z caller=manager.go:956 component=“rule manager” msg=“loading groups failed” err="/etc/prometheus/instance_up_rule.yml: group “instance_down_alert_rule”, rule 1, “InstanceDown”: invalid annotation name: d:wqescription"
level=error ts=2021-08-05T04:58:10.799Z caller=main.go:1001 msg=“Failed to apply configuration” err=“error loading rules, previous rule set restored”

Please paste your instance_up_rule.yml file as well and please put it in a code block to make it more legible

khopithan · August 6, 2021, 4:15am

Ohh sry…

This is my instance_up_rule.yml file

groups:

name: instance_down_alert_rule
rules:
To give a warning when all Vms and application services are down for 1 min
- alert: InstanceDown
  expr: up == 0
  for: 1m
  labels:
  severity: critical
  annotations:
  summary: “Instance [{{$labels.instance}}] down”
  d:wqescription: “[{{$labels.instance}}] of job [{{$labels.job}}] has been down for more than 1 minute.”

ahokanson · December 2, 2021, 1:48pm

Ah, sorry for such a late reply. If you haven’t figured it out, I believe it has to do with the last line. It reads

d:wqescription: “[{{$labels.instance}}] of job [{{$labels.job}}] has been down for more than 1 minute.”

I think you want

description: “[{{$labels.instance}}] of job [{{$labels.job}}] has been down for more than 1 minute.”

Nitty Gritty

The problem lies with the extra :wq characters in there. Since YAML uses the colon : as a delimiter between keys and values, it actually saw a key called d which has an object as a value. That object has one key-value mapping of key wqescription and value “[{{$labels.instance}}] of job [{{$labels.job}}] has been down for more than 1 minute.”

If I could guess, you use vim/neovim and you tried to save and quit with :wq, but you were still in INSERT mode. Make sure you get into NORMAL mode before run ex commands (I’ve done this a lot myself )

khopithan · December 22, 2021, 11:21am

Thank you. I fixed the issue.

Topic		Replies	Views
Prometheus container immediately exits General	2	2223	April 23, 2019
Al poner cadvisor en prometheus.yml me pone los targets de node-exporter y cadvisor en down General docker	0	518	April 7, 2019
Prometheus isn’t scrapping cadvisor and node-exporter General dockerhub , docker	0	1811	June 16, 2021
Upgrade container alertmanager / Prometheus / Grafana General docker	0	710	April 16, 2021
VictoriaMetrics vmagent skipping dockerswarm_sd_config targets for job_name "dockerswarm" because of error General	1	886	April 27, 2021