1.创建对象存储
- 创建对象存储
这里以 ucloud托管s3 环境为例,其他公有云和自建s3同理。
地域:与集群同一地域。
存储空间:xxxxx-prometheus-thanos-ucloud-huabei
格式:<公司名>-<服务名>-<集群名>
这里公司名和服务名是固定的,只需要更新集群名即可。
- 创建令牌
不同地区令牌名字可以重复 prometheus-thanos
。
创建令牌之后,获取
- 公钥:TOKEN_7baab610-b900-xxxx
- 私钥:5c100495-47f2-xxxx
- 保存s3存储信息
创建对象存储对应的文件,用于thanos存储
endpoint 使用Ucloud US3 AWS S3协议,根据接入域名填写。
us3内网是http协议,使用insecure: true
type: s3
config:
bucket: xxxxx-prometheus-thanos-ucloud-huabei
endpoint: s3.ap-east-1.amazonaws.com
access_key: TOKEN_7baab610-b900-xxxx
secret_key: 5c100495-47f2-xxxx
insecure: true
2.ArgoCD部署Thanos
创建服务前,先创建project
2.1 服务端配置
project: ucloud-public-monitoring
source:
repoURL: 'https://xxxxxx.com/chartrepo/public'
targetRevision: 9.0.8
helm:
valueFiles:
- values.yaml
parameters:
- name: bucketweb.enabled # 开启部署一些组件
value: 'true'
- name: compactor.enabled
value: 'true'
- name: compactor.persistence.storageClass
value: ssd-csi-udisk
- name: storegateway.enabled
value: 'true'
- name: storegateway.persistence.storageClass
value: ssd-csi-udisk
- name: objstoreConfig
value: |-
type: s3
config:
bucket: xxxxx-prometheus-thanos-ucloud-public
endpoint: internal.s3-cn-sh2.ufileos.com
access_key: TOKEN_931a52e6-xxxxx
secret_key: 4d8502f3-6115-xxxxx
insecure: true
values: |-
query:
stores:
- dnssrv+_grpc._tcp.prometheus-operated:10901
- xxx
- xxx # 追加集群
chart: thanos
destination:
server: 'https://xxxxxxx:6443'
namespace: monitoring
syncPolicy: {}
2.2 新增集群配置
project: ucloud-huabei-monitoring
source:
repoURL: 'https://xxxxxxxx.com/chartrepo/public'
targetRevision: 9.0.8
helm:
valueFiles:
- values.yaml
parameters:
- name: bucketweb.enabled
value: 'true'
- name: compactor.enabled
value: 'true'
- name: objstoreConfig
value: |-
type: s3
config:
bucket: xxxxxx-prometheus-thanos-ucloud-huabei
endpoint: internal.s3-cn-bj.ufileos.com
access_key: TOKEN_7baab610-xxxxx
secret_key: 5c100495-47f2-xxxxx
insecure: true
- name: query.service.type
value: NodePort
- name: queryFrontend.enabled
value: 'false'
- name: compactor.persistence.storageClass
value: ssd-csi-udisk
- name: storegateway.enabled
value: 'true'
- name: storegateway.persistence.storageClass
value: ssd-csi-udisk
values: |-
query:
stores:
- dnssrv+_grpc._tcp.prometheus-operated:10901
chart: thanos
destination:
server: 'https://xxxxxxx:6443'
namespace: monitoring
syncPolicy: {}
3.Prometheus Operator
- 根据当前k8s集群的版本选择 prometheus operator 的版本。
- 部署prometheus operator
kubectl create -f manifests/setup
kubectl create -f manifests/
- 修改prometheus配置
# prometheus-prometheus.yaml
spec:
thanos:
baseImage: quay.io/thanos/thanos
version: v0.8.1
objectStorageConfig:
key: objstore.yml
name: thanos-huabei-objstore-secret # 与argoCD名字一致
externalLabels:
alertmanager_url: http://xxxxxxx:32368 # 定义集群
origin_prometheus: ucloud-huabei
prometheus_url: http://xxxxxxx:30535
replicaExternalLabelName: "" # 删除 prometheus_replica 标签
4.告警
在 prometheusrule CRD 里面删除一些告警规则。
修改Alertmanager config
global:
resolve_timeout: 5m
http_config:
follow_redirects: true
smtp_hello: localhost
smtp_require_tls: true
pagerduty_url: https://events.pagerduty.com/v2/enqueue
opsgenie_api_url: https://api.opsgenie.com/
wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
route:
receiver: default
group_by:
- alertname
continue: false
routes:
- receiver: critical_alerts
match:
severity: critical
continue: false
group_wait: 1m
group_interval: 1m
repeat_interval: 5m
- receiver: warning_alerts
match:
severity: warning
continue: false
group_wait: 30m
group_interval: 30m
repeat_interval: 2h
- receiver: info_alerts
match:
severity: info
continue: false
group_wait: 3h
group_interval: 3h
repeat_interval: 1d
group_wait: 30s
group_interval: 30s
repeat_interval: 10m
inhibit_rules:
- source_match:
severity: critical
target_match_re:
severity: warning|info
equal:
- origin_prometheus
- namespace
- alertname
- source_match:
severity: warning
target_match_re:
severity: info
equal:
- origin_prometheus
- namespace
- alertname
receivers:
- name: default
webhook_configs:
- send_resolved: true
http_config:
follow_redirects: true
url: http://xxxxxxxx:32555/prometheusalert?type=fs&tpl=prometheus-fs-wraning&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/4197322c-3c93-4d8a-xxxxx
max_alerts: 0
- name: warning_alerts
webhook_configs:
- send_resolved: true
http_config:
follow_redirects: true
url: http://xxxxx:32555/prometheusalert?type=fs&tpl=prometheus-fs-wraning&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/4197322c-3c93-4d8a-xxxxx
max_alerts: 0
- name: critical_alerts
webhook_configs:
- send_resolved: true
http_config:
follow_redirects: true
url: http://xxxxxx:32555/prometheusalert?type=fs&tpl=prometheus-fs-critical&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/1d422c40-2b88-4ead-xxxxx
max_alerts: 0
- name: info_alerts
webhook_configs:
- send_resolved: true
http_config:
follow_redirects: true
url: http://xxxxxx:32555/prometheusalert?type=fs&tpl=prometheus-fs-info&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/a1c21ffc-6413-4bab-xxxx
max_alerts: 0
避免通知轰炸,解决prometheus中的告警之后,再接入飞书推送。
5.新增集群流程
- 创建集群的对象存储
存储空间:xxxxxx-prometheus-thanos-ucloud-huabei
格式:<公司名>-<服务名>-<集群名>
这里公司名和服务名是固定的,只需要更新集群名即可。
- 如果有令牌,修改令牌的权限,选择存储空间,新增新建对象存储。
- 使用argoCD部署Thanos,参考本篇 2.2
- 部署prometheus operator ,参考本篇 3
- 配置告警,参考本篇 4