0
点赞
收藏
分享

微信扫一扫

Prometheus邮件报警配置

Prometheus本身不支持告警功能,主要通过插件alertmanage来实现告警。AlertManager用于接收Prometheus发送的告警并对于告警进行一系列的处理后发送给指定的用户。

Prometheus触发一条告警的过程:

prometheus--->触发阈值--->超出持续时间--->alertmanager--->分组|抑制|静默--->媒体类型--->邮件|钉钉|微信等。

Prometheus邮件报警配置_vim

安装Alertmanager

1、下载Alertmanager

[root@localhost ~]# wget https://github.com/prometheus/alertmanager/releases/download/v0.20.0/alertmanager-0.20.0.linux-amd64.tar.gz

[root@localhost ~]# tar xf alertmanager-0.20.0.linux-amd64.tar.gz

[root@localhost ~]# mv alertmanager-0.20.0.linux-amd64 /usr/local/alertmanager

2、创建启动文件

[root@localhost ~]# vim /usr/lib/systemd/system/alertmanager.service

  1. [Unit]
  2. Description=alertmanager
  3. Documentation=https://github.com/prometheus/alertmanager
  4. After=network.target
  5. [Service]
  6. Type=simple
  7. User=root
  8. ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml
  9. Restart=on-failure
  10. [Install]
  11. WantedBy=multi-user.target

3、配置alertmanager.yml文件

Alertmanager 安装目录下默认有 alertmanager.yml 配置文件,可以创建新的配置文件,在启动时指定即可。

[root@localhost ~]# cd /usr/local/alertmanager

[root@localhost alertmanager]# vim alertmanager.yml

  1. global:
  2. resolve_timeout: 5m
  3. # 邮件配置
  4. smtp_smarthost: 'smtp.exmail.qq.com:25'
  5. smtp_from: 'service@yangxingzhen.com'
  6. smtp_auth_username: 'service@yangxingzhen.com'
  7. smtp_auth_password: '123456'
  8. smtp_require_tls: false
  9. # route用来设置报警的分发策略
  10. route:
  11. # 采用哪个标签来作为分组依据
  12. group_by: ['alertname']
  13. # 组告警等待时间。也就是告警产生后等待10s,如果有同组告警一起发出
  14. group_wait: 10s
  15. # 两组告警的间隔时间
  16. group_interval: 10s
  17. # 重复告警的间隔时间,减少相同邮件的发送频率
  18. repeat_interval: 5m
  19. # 设置默认接收人
  20. receiver: 'default-receiver'
  21. routes: # 可以指定哪些组接手哪些消息
  22. - receiver: 'default-receiver'
  23. continue: true
  24. group_wait: 10s
  25. receivers:
  26. - name: 'default-receiver'
  27. email_configs:
  28. - to: 'xingzhen.yang@yangxingzhen.com'
  29. headers: { Subject: "[WARN] 报警邮件" }
  • smtp_smarthost:是用于发送邮件的邮箱的 SMTP 服务器地址+端口;
  • smtp_auth_password:是发送邮箱的授权码而不是登录密码;
  • smtp_require_tls:不设置的话默认为 true,当为 true 时会有 starttls 错误,为了简单这里设置为 false;
  • headers:为邮件标题;

4、配置告警规则

[root@localhost alertmanager]# mkdir -p /usr/local/prometheus/rules

[root@localhost alertmanager]# cd /usr/local/prometheus/rules

[root@localhost rules]# vim node.yml

  1. groups:
  2. - name: Node_Down
  3. rules:
  4. - alert: Node实例已宕机
  5. expr: up == 0
  6. for: 30s
  7. labels:
  8. user: root
  9. severity: Warning
  10. annotations:
  11. summary: "Instance {{ $labels.instance }} down"
  12. description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

在Prometheus.yml 中指定 node.yml 的路径

[root@localhost rules]# vim /usr/local/prometheus/prometheus.yml

  1. global:
  2. scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  3. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  4. # scrape_timeout is set to the global default (10s).

  5. # Alertmanager configuration
  6. alerting:
  7. alertmanagers:
  8. - static_configs:
  9. - targets: ['localhost:9093']
  10. # - localhost:9093

  11. # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
  12. rule_files:
  13. - 'rules/*.yml'
  14. # - "first_rules.yml"
  15. # - "second_rules.yml"

  16. # A scrape configuration containing exactly one endpoint to scrape:
  17. # Here it's Prometheus itself.
  18. scrape_configs:
  19. # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  20. - job_name: 'prometheus'
  21. # metrics_path defaults to '/metrics'
  22. # scheme defaults to 'http'.

  23. static_configs:
  24. - targets: ['localhost:9100']

5、重启 Prometheus 服务

[root@localhost rules]# systemctl restart prometheus

6、启动 Alertmanager

[root@localhost rules]# systemctl daemon-reload

[root@localhost rules]# systemctl start alertmanager

7、验证效果

此时到管理界面可以看到如下信息:

Prometheus邮件报警配置_vim_02

然后停止 node_exporter 服务,然后再看效果。

[root@localhost rules]# systemctl stop node_exporter

prometheus界面的alert可以看到告警状态。

  • 绿色表示正常。
  • 红色状态为PENDING表示alerts还没有发送至Alertmanager,因为rules里面配置了for: 30s。
  • 30秒后状态由PENDING变为FIRING,此时Prometheus才将告警发给alertmanager,在Alertmanager中可以看到有一个alert。 

Prometheus邮件报警配置_lua_03

接着邮箱应该会收到邮件:

Prometheus邮件报警配置_linux_04

附:

CPU使用率告警规则:

  1. groups:
  2. - name: CPU
  3. rules:
  4. - alert: CPU使用率过高
  5. expr: (100 - (avg by (instance) (irate(node_cpu{mode="idle"}[5m])) * 100)) > 80
  6. for: 1m
  7. labels:
  8. severity: Warning
  9. annotations:
  10. summary: "{{ $labels.instance }} CPU使用率过高"
  11. description: "{{ $labels.instance }}: CPU使用率超过80%,当前使用率({{ $value }})."

内存使用率告警规则:

  1. groups:
  2. - name: Memory
  3. rules:
  4. - alert: 内存使用率过高
  5. expr: (node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 80
  6. for: 1m #告警持续时间,超过这个时间才会发送给alertmanager
  7. labels:
  8. severity: Warning
  9. annotations:
  10. summary: "{{ $labels.instance }} 内存使用率过高"
  11. description: "{{ $labels.instance }}:内存使用率超过80%,当前使用率({{ $value }})."
举报

相关推荐

0 条评论