Why does Prometheus container exits after configure rule and restart

Here is my logs for the exit container

level=info ts=2021-08-05T04:58:10.744Z caller=main.go:450 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2021-08-05T04:58:10.745Z caller=main.go:451 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2021-08-05T04:58:10.754Z caller=web.go:541 component=web msg=“Start listening for connections” address=0.0.0.0:9090
level=info ts=2021-08-05T04:58:10.757Z caller=main.go:824 msg=“Starting TSDB …”
level=info ts=2021-08-05T04:58:10.763Z caller=tls_config.go:191 component=web msg=“TLS is disabled.” http2=false
level=info ts=2021-08-05T04:58:10.779Z caller=head.go:780 component=tsdb msg=“Replaying on-disk memory mappable chunks if any”
level=info ts=2021-08-05T04:58:10.779Z caller=head.go:794 component=tsdb msg=“On-disk memory mappable chunks replay completed” duration=23.487µs
level=info ts=2021-08-05T04:58:10.779Z caller=head.go:800 component=tsdb msg=“Replaying WAL, this may take a while”
level=info ts=2021-08-05T04:58:10.789Z caller=head.go:854 component=tsdb msg=“WAL segment loaded” segment=0 maxSegment=1
level=info ts=2021-08-05T04:58:10.789Z caller=head.go:854 component=tsdb msg=“WAL segment loaded” segment=1 maxSegment=1
level=info ts=2021-08-05T04:58:10.789Z caller=head.go:860 component=tsdb msg=“WAL replay completed” checkpoint_replay_duration=179.416µs wal_replay_duration=10.469308ms total_replay_duration=10.738798ms
level=info ts=2021-08-05T04:58:10.794Z caller=main.go:851 fs_type=XFS_SUPER_MAGIC
level=info ts=2021-08-05T04:58:10.794Z caller=main.go:854 msg=“TSDB started”
level=info ts=2021-08-05T04:58:10.794Z caller=main.go:981 msg=“Loading configuration file” filename=/etc/prometheus/prometheus.yml
level=error ts=2021-08-05T04:58:10.799Z caller=manager.go:956 component=“rule manager” msg=“loading groups failed” err="/etc/prometheus/instance_up_rule.yml: group “instance_down_alert_rule”, rule 1, “InstanceDown”: invalid annotation name: d:wqescription"
level=error ts=2021-08-05T04:58:10.799Z caller=main.go:1001 msg=“Failed to apply configuration” err=“error loading rules, previous rule set restored”
level=info ts=2021-08-05T04:58:10.799Z caller=main.go:697 msg=“Stopping scrape discovery manager…”
level=info ts=2021-08-05T04:58:10.799Z caller=main.go:711 msg=“Stopping notify discovery manager…”
level=info ts=2021-08-05T04:58:10.799Z caller=main.go:733 msg=“Stopping scrape manager…”
level=info ts=2021-08-05T04:58:10.799Z caller=main.go:707 msg=“Notify discovery manager stopped”
level=info ts=2021-08-05T04:58:10.799Z caller=main.go:693 msg=“Scrape discovery manager stopped”
level=info ts=2021-08-05T04:58:10.799Z caller=manager.go:934 component=“rule manager” msg=“Stopping rule manager…”
level=info ts=2021-08-05T04:58:10.799Z caller=manager.go:944 component=“rule manager” msg=“Rule manager stopped”
level=info ts=2021-08-05T04:58:10.799Z caller=main.go:727 msg=“Scrape manager stopped”
level=info ts=2021-08-05T04:58:10.800Z caller=notifier.go:601 component=notifier msg=“Stopping notification manager…”
level=info ts=2021-08-05T04:58:10.800Z caller=main.go:908 msg=“Notifier manager stopped”
level=error ts=2021-08-05T04:58:10.800Z caller=main.go:917 err=“error loading config from “/etc/prometheus/prometheus.yml”: one or more errors occurred while applying the new configuration (–config.file=”/etc/prometheus/prometheus.yml")"

Regards,
Khopi

This is my prometheus.yml file

my global config

global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

scrape_timeout is set to the global default (10s).

Alertmanager configuration

alerting:
alertmanagers:

  • static_configs:
    • targets:
      • ‘192.168.43.20:9093’

Load rules once and periodically evaluate them according to the global ‘evaluation_interval’.

rule_files:

  • instance_up_rule.yml

The problem is actually with the rules file as these two log lines suggest

level=error ts=2021-08-05T04:58:10.799Z caller=manager.go:956 component=“rule manager” msg=“loading groups failed” err="/etc/prometheus/instance_up_rule.yml: group “instance_down_alert_rule”, rule 1, “InstanceDown”: invalid annotation name: d:wqescription"
level=error ts=2021-08-05T04:58:10.799Z caller=main.go:1001 msg=“Failed to apply configuration” err=“error loading rules, previous rule set restored”

Please paste your instance_up_rule.yml file as well and please put it in a code block to make it more legible :smiley:

Ohh sry…

This is my instance_up_rule.yml file

groups:

  • name: instance_down_alert_rule
    rules:

    To give a warning when all Vms and application services are down for 1 min

    • alert: InstanceDown
      expr: up == 0
      for: 1m
      labels:
      severity: critical
      annotations:
      summary: “Instance [{{$labels.instance}}] down”
      d:wqescription: “[{{$labels.instance}}] of job [{{$labels.job}}] has been down for more than 1 minute.”

Ah, sorry for such a late reply. If you haven’t figured it out, I believe it has to do with the last line. It reads

d:wqescription: “[{{$labels.instance}}] of job [{{$labels.job}}] has been down for more than 1 minute.”

I think you want

description: “[{{$labels.instance}}] of job [{{$labels.job}}] has been down for more than 1 minute.”

Nitty Gritty

The problem lies with the extra :wq characters in there. Since YAML uses the colon : as a delimiter between keys and values, it actually saw a key called d which has an object as a value. That object has one key-value mapping of key wqescription and value “[{{$labels.instance}}] of job [{{$labels.job}}] has been down for more than 1 minute.” :grinning_face_with_smiling_eyes:

If I could guess, you use vim/neovim and you tried to save and quit with :wq, but you were still in INSERT mode. Make sure you get into NORMAL mode before run ex commands :grinning_face_with_smiling_eyes: (I’ve done this a lot myself :man_facepalming:)

1 Like

Thank you. I fixed the issue.