Prometheus邮件报警配置-CFANZ编程社区

Prometheus本身不支持告警功能，主要通过插件alertmanage来实现告警。AlertManager用于接收Prometheus发送的告警并对于告警进行一系列的处理后发送给指定的用户。

Prometheus触发一条告警的过程：

prometheus--->触发阈值--->超出持续时间--->alertmanager--->分组|抑制|静默--->媒体类型--->邮件|钉钉|微信等。

1、下载Alertmanager

[root@localhost ~]# wget https://github.com/prometheus/alertmanager/releases/download/v0.20.0/alertmanager-0.20.0.linux-amd64.tar.gz

[root@localhost ~]# tar xf alertmanager-0.20.0.linux-amd64.tar.gz

[root@localhost ~]# mv alertmanager-0.20.0.linux-amd64 /usr/local/alertmanager

2、创建启动文件

[root@localhost ~]# vim /usr/lib/systemd/system/alertmanager.service

[Unit]
Description=alertmanager
Documentation=https://github.com/prometheus/alertmanager
After=network.target
[Service]
Type=simple
User=root
ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml
Restart=on-failure
[Install]
WantedBy=multi-user.target

3、配置alertmanager.yml文件

Alertmanager 安装目录下默认有 alertmanager.yml 配置文件，可以创建新的配置文件，在启动时指定即可。

[root@localhost ~]# cd /usr/local/alertmanager

[root@localhost alertmanager]# vim alertmanager.yml

4、配置告警规则

[root@localhost alertmanager]# mkdir -p /usr/local/prometheus/rules

[root@localhost alertmanager]# cd /usr/local/prometheus/rules

[root@localhost rules]# vim node.yml

groups:
- name: Node_Down
rules:
- alert: Node实例已宕机
expr: up == 0
for: 30s
labels:
user: root
severity: Warning
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

在Prometheus.yml 中指定 node.yml 的路径

[root@localhost rules]# vim /usr/local/prometheus/prometheus.yml

global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
# - localhost:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- 'rules/*.yml'
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9100']

5、重启 Prometheus 服务

[root@localhost rules]# systemctl restart prometheus

6、启动 Alertmanager

[root@localhost rules]# systemctl daemon-reload

[root@localhost rules]# systemctl start alertmanager

7、验证效果

此时到管理界面可以看到如下信息：

Prometheus邮件报警配置_vim_02

然后停止 node_exporter 服务，然后再看效果。

[root@localhost rules]# systemctl stop node_exporter

prometheus界面的alert可以看到告警状态。

绿色表示正常。
红色状态为PENDING表示alerts还没有发送至Alertmanager，因为rules里面配置了for: 30s。
30秒后状态由PENDING变为FIRING，此时Prometheus才将告警发给alertmanager，在Alertmanager中可以看到有一个alert。

Prometheus邮件报警配置_lua_03

接着邮箱应该会收到邮件：

Prometheus邮件报警配置_linux_04

附：

CPU使用率告警规则：

内存使用率告警规则：

groups:
- name: Memory
rules:
- alert: 内存使用率过高
expr: (node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 80
for: 1m #告警持续时间，超过这个时间才会发送给alertmanager
labels:
severity: Warning
annotations:
summary: "{{ $labels.instance }} 内存使用率过高"
description: "{{ $labels.instance }}：内存使用率超过80%，当前使用率({{ $value }})."