0
点赞
收藏
分享

微信扫一扫

部署 Prometheus Operate、Thanos 过程


1.创建对象存储

  1. 创建对象存储

这里以 ucloud托管s3 环境为例,其他公有云和自建s3同理。


地域:与集群同一地域。

存储空间:xxxxx-prometheus-thanos-ucloud-huabei

格式:<公司名>-<服务名>-<集群名>

这里公司名和服务名是固定的,只需要更新集群名即可。

部署 Prometheus Operate、Thanos 过程_thanos

  1. 创建令牌

不同地区令牌名字可以重复 prometheus-thanos

创建令牌之后,获取

  • 公钥:TOKEN_7baab610-b900-xxxx
  • 私钥:5c100495-47f2-xxxx

部署 Prometheus Operate、Thanos 过程_thanos_02


  1. 保存s3存储信息

创建对象存储对应的文件,用于thanos存储

endpoint 使用Ucloud US3 AWS S3协议,根据接入域名填写。

us3内网是http协议,使用insecure: true

type: s3
config:
  bucket: xxxxx-prometheus-thanos-ucloud-huabei
  endpoint: s3.ap-east-1.amazonaws.com
  access_key: TOKEN_7baab610-b900-xxxx
  secret_key: 5c100495-47f2-xxxx
  insecure: true



2.ArgoCD部署Thanos

创建服务前,先创建project


2.1 服务端配置

project: ucloud-public-monitoring
source:
  repoURL: 'https://xxxxxx.com/chartrepo/public'
  targetRevision: 9.0.8
  helm:
    valueFiles:
      - values.yaml
    parameters:
      - name: bucketweb.enabled # 开启部署一些组件
        value: 'true'
      - name: compactor.enabled
        value: 'true'
      - name: compactor.persistence.storageClass
        value: ssd-csi-udisk
      - name: storegateway.enabled
        value: 'true'
      - name: storegateway.persistence.storageClass
        value: ssd-csi-udisk
      - name: objstoreConfig
        value: |-
          type: s3  
          config:
            bucket: xxxxx-prometheus-thanos-ucloud-public  
            endpoint: internal.s3-cn-sh2.ufileos.com 
            access_key: TOKEN_931a52e6-xxxxx
            secret_key: 4d8502f3-6115-xxxxx  
            insecure: true
    values: |-
      query:
        stores: 
        - dnssrv+_grpc._tcp.prometheus-operated:10901
        - xxx
        - xxx # 追加集群
  chart: thanos
destination:
  server: 'https://xxxxxxx:6443'
  namespace: monitoring
syncPolicy: {}


2.2 新增集群配置

project: ucloud-huabei-monitoring
source:
  repoURL: 'https://xxxxxxxx.com/chartrepo/public'
  targetRevision: 9.0.8
  helm:
    valueFiles:
      - values.yaml
    parameters:
      - name: bucketweb.enabled
        value: 'true'
      - name: compactor.enabled
        value: 'true'
      - name: objstoreConfig
        value: |-
          type: s3
          config:
            bucket: xxxxxx-prometheus-thanos-ucloud-huabei
            endpoint: internal.s3-cn-bj.ufileos.com
            access_key: TOKEN_7baab610-xxxxx
            secret_key: 5c100495-47f2-xxxxx
            insecure: true
      - name: query.service.type
        value: NodePort
      - name: queryFrontend.enabled
        value: 'false'
      - name: compactor.persistence.storageClass
        value: ssd-csi-udisk
      - name: storegateway.enabled
        value: 'true'
      - name: storegateway.persistence.storageClass
        value: ssd-csi-udisk
    values: |-
      query:
        stores: 
        - dnssrv+_grpc._tcp.prometheus-operated:10901
  chart: thanos
destination:
  server: 'https://xxxxxxx:6443'
  namespace: monitoring
syncPolicy: {}



3.Prometheus Operator

  1. 根据当前k8s集群的版本选择 prometheus operator 的版本。
  2. 部署prometheus operator

kubectl create -f manifests/setup
kubectl create -f manifests/

  1. 修改prometheus配置

# prometheus-prometheus.yaml 
spec:
  thanos:
    baseImage: quay.io/thanos/thanos
    version: v0.8.1
    objectStorageConfig:
      key: objstore.yml
      name: thanos-huabei-objstore-secret    # 与argoCD名字一致
  externalLabels:
    alertmanager_url: http://xxxxxxx:32368    # 定义集群
    origin_prometheus: ucloud-huabei
    prometheus_url: http://xxxxxxx:30535
  replicaExternalLabelName: ""  # 删除 prometheus_replica 标签



4.告警

在 prometheusrule CRD 里面删除一些告警规则。


修改Alertmanager config

global:
  resolve_timeout: 5m
  http_config:
    follow_redirects: true
  smtp_hello: localhost
  smtp_require_tls: true
  pagerduty_url: https://events.pagerduty.com/v2/enqueue
  opsgenie_api_url: https://api.opsgenie.com/
  wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
  victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
route:
  receiver: default
  group_by:
  - alertname
  continue: false
  routes:
  - receiver: critical_alerts
    match:
      severity: critical
    continue: false
    group_wait: 1m
    group_interval: 1m
    repeat_interval: 5m
  - receiver: warning_alerts
    match:
      severity: warning
    continue: false
    group_wait: 30m
    group_interval: 30m
    repeat_interval: 2h
  - receiver: info_alerts
    match:
      severity: info
    continue: false
    group_wait: 3h
    group_interval: 3h
    repeat_interval: 1d
  group_wait: 30s
  group_interval: 30s
  repeat_interval: 10m
inhibit_rules:
- source_match:
    severity: critical
  target_match_re:
    severity: warning|info
  equal:
  - origin_prometheus
  - namespace
  - alertname
- source_match:
    severity: warning
  target_match_re:
    severity: info
  equal:
  - origin_prometheus
  - namespace
  - alertname
receivers:
- name: default
  webhook_configs:
  - send_resolved: true
    http_config:
      follow_redirects: true
    url: http://xxxxxxxx:32555/prometheusalert?type=fs&tpl=prometheus-fs-wraning&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/4197322c-3c93-4d8a-xxxxx
    max_alerts: 0
- name: warning_alerts
  webhook_configs:
  - send_resolved: true
    http_config:
      follow_redirects: true
    url: http://xxxxx:32555/prometheusalert?type=fs&tpl=prometheus-fs-wraning&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/4197322c-3c93-4d8a-xxxxx
    max_alerts: 0
- name: critical_alerts
  webhook_configs:
  - send_resolved: true
    http_config:
      follow_redirects: true
    url: http://xxxxxx:32555/prometheusalert?type=fs&tpl=prometheus-fs-critical&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/1d422c40-2b88-4ead-xxxxx
    max_alerts: 0
- name: info_alerts
  webhook_configs:
  - send_resolved: true
    http_config:
      follow_redirects: true
    url: http://xxxxxx:32555/prometheusalert?type=fs&tpl=prometheus-fs-info&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/a1c21ffc-6413-4bab-xxxx
    max_alerts: 0

避免通知轰炸,解决prometheus中的告警之后,再接入飞书推送。


5.新增集群流程

  1. 创建集群的对象存储

存储空间:xxxxxx-prometheus-thanos-ucloud-huabei
格式:<公司名>-<服务名>-<集群名>  
这里公司名和服务名是固定的,只需要更新集群名即可。

  1. 如果有令牌,修改令牌的权限,选择存储空间,新增新建对象存储。
  2. 使用argoCD部署Thanos,参考本篇 2.2
  3. 部署prometheus operator ,参考本篇 3
  4. 配置告警,参考本篇 4



举报

相关推荐

0 条评论