一、前言
今天更新一期关于Kubertes Etcd集群的监控,这里我们依然基于Kube-Prometheus监控etcd,话不多说,开整!
二、Promehtues监控
创建Etcd Service
通常来说,我们Endpoints对象是一个动态的IP列表,随着绑定Pod标签的IP变化而变化,但在监控etcd的场景中,etcd不是作为Pod,而是组件,组件的IP都是固定的,因此在指定对应的Service同时,也需要编写Endpoints指定固定节点IP
apiVersion: v1
kind: Endpoints
metadata:
labels:
app: etcd-prom
name: etcd-prom
namespace: kube-system
subsets:
- addresses:
- ip: x.x.x.x #指定集群中ETCD节点IP
- ip: x.x.x.x #指定集群中ETCD节点IP
- ip: x.x.x.x #指定集群中ETCD节点IP
ports:
- name: https-metrics
port: 2379 # etcd 端口
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
labels:
app: etcd-prom
name: etcd-prom
namespace: kube-system
spec:
ports:
- name: https-metrics
port: 2379
protocol: TCP
targetPort: 2379
type: ClusterIP
查看etcd对应的Service是否正确创建
#kubectl get svc -n kube-system
通过ClusterIP访问Metrics接口测试是否有数据正常返回(Ps:因为etcd是https, 这里需要指定证书才能正常返回)
# curl -s --cert /etc/kubernetes/pki/etcd/etcd.pem --key /etc/kubernetes/pki/etcd/etcd-key.pem https://192.168.59.114:2379/metrics -k | tail
创建etcd secret
指定etcd相关证书,定义成Secret资源
#kubectl create secret generic etcd-ssl --from-file=/etc/kubernetes/pki/etcd/etcd-ca.pem --from-file=/etc/kubernetes/pki/etcd/etcd.pem --from-file=/etc/kubernetes/pki/etcd/etcd-key.pem -n monitoring
etcd 证书挂载
需要将上述创建的secret etcd证书挂载至Promehtues资源容器中。因Promehtues 是Operator部署的,因此有一个名为k8s的Promehtues存在
#kubectl get prometheus -n monitoring
#kubectl edit prometheus k8s -n monitoring
重启完成之后,我们查看Secrets对应的证书是否挂载(Ps:任意一个Promehtues Pod即可)
# kubectl -n monitoring exec prometheus-k8s-0 -c prometheus ls /etc/prometheus/secrets/etcd-ssl
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
etcd-ca.pem
etcd-key.pem
etcd.pem
Etcd ServiceMonitor创建
上述工作准备就绪之后,谁去发现并监控etcd呢,此时就该我们的ServiceMonitor出场了,它的意义就是去动态的发现并监控指定符合要求的Service,从而实现监控的目的
#vim kube-etcd-serviceMonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: etcd
namespace: monitoring
labels:
app: etcd
spec:
jobLabel: k8s-app
endpoints:
- interval: 30s
port: https-metrics #这个port对应Service.spec.ports.name
scheme: https
tlsConfig:
caFile: /etc/prometheus/secrets/etcd-ssl/etcd-ca.pem #证书路径
certFile: /etc/prometheus/secrets/etcd-ssl/etcd.pem
keyFile: /etc/prometheus/secrets/etcd-ssl/etcd-key.pem
insecureSkipVerify: true # 关闭证书校验
selector:
matchLabels:
app: etcd-prom # 跟 svc 的 lables 保持一
namespaceSelector:
matchNames:
- kube-system
# kubectl get servicemonitor -n monitoring -l app=etcd
NAME AGE
etcd 78s
此时我们可以发现Promehtues 已有了关于etcd的监控目标
创建Grafana 监控模版
这里Import 导入的是3070的etcd大盘(https://grafana.com/grafana/dashboards/3070 )大家如果喜欢其它的大盘可以自行到Grafana官网获取即可
三、编写PromehtuesRule告警规则
前面我们完成了针对etcd监控,但仅仅只有监控那还远远不够,我们还需要针对核心重点的监控指标进行告警通知
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: k8s
role: alert-rules
name: etcd-rules
namespace: monitoring
spec:
groups:
- name: etcd-exporter.rules
rules:
- alert: EtcdClusterUnavailable
annotations:
summary: etcd cluster small
description: If one more etcd peer goes down the cluster will be unavailable
expr: |
count(up{job="etcd"} == 0) > (count(up{job="etcd"}) / 2-1)
for: 1m
labels:
severity: critical
从下面我们就可以看到关于etcd的告警规则已经生效,同时也更新至Promehtues