0
点赞
收藏
分享

微信扫一扫

prometheus通过企业微信机器人报警

一、prometheus告警逻辑

prometheus通过企业微信机器人报警_企业微信机器人

prometheus主服务通过警报规则(rules)去推送到alertmanager ,这些规则将使用我们收集的指标并在指定的阈值或标准上触发警报,收到警报后, Alertmanager 会处理警报并根据其标签进行路由。一旦路径确定,它们将由Alertmanager调用webhook发送企业微信群组的

二、prometheus的主服务配置

prometheus.yml

# my global config
global: #全局配置
scrape_interval: 5s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting: #配置报警的接口
alertmanagers:
- static_configs:
- targets: ['127.0.0.1:9093'] #9093是alertmanager的端口

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- /export/prometheus/rules/*.yml #加载报警规则
# - "first_rules.yml"
# - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
# - job_name: 'prometheus'

# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.

- job_name: 'qinghotel_report' #定义监控项名称
metrics_path: '/report/actuator/prometheus' #这里是服务接口,可以通过nacos上查看
file_sd_configs:
- files:
- /export/prometheus/conf/report.json #这里是加载json文件的方式匹配被监控主机
refresh_interval: 10s
- job_name: 'qinghotel-erp-server' #每一行的job_name要对齐,不然会报错
metrics_path: '/actuator/prometheus'
static_configs:
- targets: ['10.11.0.10:19008']
- job_name: 'push-metrics'
static_configs:
- targets: ['127.0.0.1:9099']
honor_labels: true
- job_name: ' qinghotel-hotel-member'
metrics_path: '/actuator/prometheus'
static_configs:
- targets: ['10.0.0.3:9099','10.11.0.36:9099']

这里贴一个json的配置,一般是被监控主机较多时,会单独写一个文件

/export/prometheus/conf/report.json

[
{
"targets": ["10.11.0.8:8900","10.11.0.29:8900"]
}
]


prometheus配置校验方法

执行promtool这个文件校验指定的配置

./promtool check config prometheus.yml

三、alertmanager的配置

global:
resolve_timeout: 5m

templates:
- '/export/alertmanager/template/*.tmpl'
#这里要加载template的文件


# 定义路由树信息
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 1m
repeat_interval: 30m
receiver: 'prometheus' #这里的名称要上下一致
routes:
- receiver: 'prometheus' #同上一致
group_wait: 60s
match:
level: '1'


receivers:
- name: 'prometheus' #同上一致
webhook_configs:
- url: 'http://10.11.0.16:8089/adapter/wx' #这里的配置是调用adapter服务的接口
# 匹配adapter的接口,匹配企业微信prometheus机器人
send_resolved: true


inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']

template配置文件

wechat.tmpl

{{ define "wechat.default.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
==========异常告警==========
告警类型: {{ $alert.Labels.alertname }}
告警级别: {{ $alert.Labels.severity }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};{{$alert.Annotations.summary}}
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息: {{ $alert.Labels.instance }}
{{- end }}
{{- if gt (len $alert.Labels.namespace) 0 }}
命名空间: {{ $alert.Labels.namespace }}
{{- end }}
{{- if gt (len $alert.Labels.node) 0 }}
节点信息: {{ $alert.Labels.node }}
{{- end }}
{{- if gt (len $alert.Labels.pod) 0 }}
实例名称: {{ $alert.Labels.pod }}
{{- end }}
============END============
{{- end }}
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
==========异常恢复==========
告警类型: {{ $alert.Labels.alertname }}
告警级别: {{ $alert.Labels.severity }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};{{$alert.Annotations.summary}}
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息: {{ $alert.Labels.instance }}
{{- end }}
{{- if gt (len $alert.Labels.namespace) 0 }}
命名空间: {{ $alert.Labels.namespace }}
{{- end }}
{{- if gt (len $alert.Labels.node) 0 }}
节点信息: {{ $alert.Labels.node }}
{{- end }}
{{- if gt (len $alert.Labels.pod) 0 }}
实例名称: {{ $alert.Labels.pod }}
{{- end }}
============END============
{{- end }}
{{- end }}
{{- end }}
{{- end }}

adapter服务和alertmanager服务是通过docker一起启动的

docker-compose.yml

version: '3'
services:
webhook-adapter:
image: guyongquan/webhook-adapter:latest
version: '3'
services:
webhook-adapter:
image: guyongquan/webhook-adapter:latest
container_name: webhook-adapter
hostname: webhook-adapter
ports:
- "8089:80"
restart: always
command:
- "--adapter=/app/prometheusalert/wx.js=/wx=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=ddcebdbc-*******"
#/wx=后面是匹配企业微信机器人的webhook地址
alertmanager:
image: prom/alertmanager
container_name: alertmanager
hostname: alertmanager
restart: always
volumes:
- /export/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
#这里是要挂载alertmanager.yml的绝对路径,这里要修改你的路径
- /etc/localtime:/etc/localtime:ro
ports:
- "9093:9093"

启动方法为:docker-compose up -d  #和docker-compose.yml在同级目录

企业微信机器人的地址复制下来贴到上面配置里即可

prometheus通过企业微信机器人报警_企业微信_02

查看webhook adapter是否启动成功,访问外网地址的8089端口,确认安全组放开。

prometheus通过企业微信机器人报警_prometheus_03

查看alertmanager服务是否正常

prometheus通过企业微信机器人报警_企业微信机器人_04

alertmanager.yml语法校验

./amtool check-config  /export/alertmanager/alertmanager.yml


提供几个报警规则rules的配置。这个在prometheus下层目录

1、hoststats-alert.yml

groups:
- name: hostStatsAlert
rules:
- alert: hostCpuUsageAlert
expr: sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance) > 0.85
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} CPU usgae high"
description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }})"

- alert: hostMemUsageAlert
expr: (node_memory_MemTotal - node_memory_MemAvailable)/node_memory_MemTotal > 0.85
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} MEM usgae high"
description: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }})"


2、jvm_alert.yml

groups:
- name: jvm-alerting
rules:

# 堆空间使用超过80%
- alert: heap-usage-too-much
for: 60m
labels:
level: 3 #告警级别,告警级别定义 0 信息,1 警告,2 一般严重,3 严重,4 灾难
name: prometheusalertcenter
annotations:
summary: "JVM Instance {{ $labels.instance }} memory usage > 80%"
runbook: "详情请参考:http://1.1.1.1:9093/#/alerts" #这里是alert的外网访问地址

# 在5分钟里,Old GC花费时间超过50%
for: 5m
labels:
level: 3 #告警级别,告警级别定义 0 信息,1 警告,2 一般严重,3 严重,4 灾难
name: prometheusalertcenter
annotations:
summary: "JVM Instance {{ $labels.instance }} Old GC time > 50% running time"
description: "{{ $labels.instance }} of application {{ $labels.application }} has been in status [Old GC time > 50% running time] for more than 5 minutes. current seconds ({{ $value }}%)"
runbook: "详情请参考:http://1.1.1.1:9093/#/alerts"

# 在5分钟里,Old GC花费时间超过80%
- alert: old-gc-time-too-much
expr: increase(jvm_gc_collection_seconds_sum{gc="PS MarkSweep"}[5m]) > 5 * 60 * 0.8
for: 5m
labels:
level: 3 #告警级别,告警级别定义 0 信息,1 警告,2 一般严重,3 严重,4 灾难
name: prometheusalertcenter
annotations:
summary: "JVM Instance {{ $labels.instance }} Old GC time > 80% running time"
description: "{{ $labels.instance }} of application {{ $labels.application }} has been in status [Old GC time > 80% running time] for more than 5 minutes. current seconds ({{ $value }}%)"
runbook: "详情请参考:http://1.1.1.1:9093/#/alerts"

3、service_status.yml

groups:
- name: 实例存活告警规则
rules:
- alert: 实例存活告警
expr: up == 0
for: 1m
labels:
user: prometheus
severity: warning
annotations:
summary: "主机宕机 !!!"
description: "该实例已经宕机超过一分钟了"
- name: 内存报警规则
rules:
for: 1m
labels:
severity: warning
annotations:
summary: "服务器可用内存不足"
description: "内存使用率已超过80%(当前值:{{ $value }}%)"
- name: CPU报警规则
rules:
- alert: CPU使用率告警
expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 80
for: 1m
labels:
severity: warning
annotations:
summary: "CPU使用率正在飙升。"
description: "CPU使用率超过80%(当前值:{{ $value }}%)"
- name: 磁盘使用率报警规则
rules:
- alert: 磁盘使用率告警
expr: 100 - node_filesystem_free_bytes{fstype=~"xfs|ext4"} / node_filesystem_size_bytes{fstype=~"xfs|ext4"} * 100 > 80
for: 80m
labels:
severity: warning
annotations:
summary: "硬盘分区使用率过高"
description: "分区使用大于80%(当前值:{{ $value }}%)"

三、启动服务:

prometheus主服务启动方法

./prometheus --config.file=prometheus.yml --web.enable-lifecycle 2> /dev/null &

prometheus热加载,变更prometheus配置时使用

curl -XPOST http://localhost:9090/-/reload

所有服务正常启动后是有8089,9090,9093这3个端口的

四、shell方法测试企业微信的推送

#!/usr/bin/env bash
alerts_message='[
{
"labels": {
"alertname": "磁盘已满",
"dev": "sda1",
"instance": "实例sda1",
"msgtype": "testing"
},
"annotations": {
"info": "程序员小王提示您:这是测试消息",
"summary": "testing"
}
}
]'

curl -XPOST -d"$alerts_message" http://127.0.0.1:9093/api/v1/alerts
#调用上面alerts_message的这个参数

告警通知显示如下

prometheus通过企业微信机器人报警_企业微信机器人_05


至此本文完。欢迎点赞、收藏、评论

举报

相关推荐

0 条评论