0
点赞
收藏
分享

微信扫一扫

Prometheus GPU 监控

whiteMu 2023-09-13 阅读 68

Prometheus GPU 监控

以下是步骤

1,Prometheus GPU 监控

2,安装gpu-monitoring-tools

2.1,设置`dcgm-exporter`开机启动

3,Prometheus修改配置

4,grafana

5,使用监控面板`9957`可以切换节点

6,Grafana设置

7,使用`12027`

8,使用GPU-Nodes-Metrics-Nvidia 12639

1,Prometheus GPU 监控

安装DCGM

datacenter-gpu-manager_1.7.2_amd64.deb

# dcgmi --version

dcgmi  version: 1.7.2

2,安装gpu-monitoring-tools

# git clone https://github.com/NVIDIA/gpu-monitoring-tools.git
# cd gpu-monitoring-tools/
# make binary
go build -o dcgm-exporter github.com/NVIDIA/gpu-monitoring-tools/pkg
# make install
go build -o dcgm-exporter github.com/NVIDIA/gpu-monitoring-tools/pkg
install -m 557 dcgm-exporter /usr/bin/dcgm-exporter
install -m 557 -D ./etc/dcgm-exporter/default-counters.csv /etc/dcgm-exporter/default-counters.csv
install -m 557 -D ./etc/dcgm-exporter/dcp-metrics-included.csv /etc/dcgm-exporter/dcp-metrics-included.csv

  • 运行dcgm-exporter

# which dcgm-exporter
/usr/bin/dcgm-exporter
# dcgm-exporter
INFO[0000] Starting dcgm-exporter
INFO[0000] DCGM successfully initialized!
INFO[0000] Pipeline starting
INFO[0000] Starting webserver

  • 测试,可以看到监控数据

# curl 192.168.1.2:9400/metrics

2.1,设置dcgm-exporter开机启动

#新建服务
vim /lib/systemd/system/dcgm-exporter.service 

#如下

[Unit]
Description=dcgm-exporter service

[Service]
User=root
ExecStart=/usr/bin/dcgm-exporter

TimeoutStopSec=10
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

保存退出

加载、添加开机启动、开启、查看服务的一些命令

 1.加载
 systemctl daemon-reload
 2.添加开机启动
 systemctl enable dcgm-exporter.service
 3.开启
 systemctl start dcgm-exporter.service
 4.查看
 systemctl status dcgm-exporter.service

3,Prometheus修改配置

  • 添加dcgm-exporter(修改prometheus配置文件)

    # dcgm-exporter
  - job_name: 'gpu'
    static_configs:
    - targets: ['192.168.1.2:9400']

如下是我的配置文件 实例:# dcgm-exporter 以下是新添加的

# cat prometheus.yml
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']


    # node_exporter
  - job_name: 'node'
    static_configs:
    - targets: ['127.0.0.1:9100','192.168.1.2:9100']

    # dcgm-exporter
  - job_name: 'gpu'
    static_configs:
    - targets: ['192.168.1.2:9400']

  • 重启prometheus

systemctl restart  prometheus.service

浏览器访问你的prometheus

如下:http://10.10.201.86:9090/targets

可以看到新添加的 UP了

Prometheus GPU 监控_github

4,grafana

Prometheus GPU 监控_开机启动_02

5,使用监控面板9957可以切换节点

Prometheus GPU 监控_github_03

Prometheus GPU 监控_配置文件_04

6,Grafana设置

  • 监控功率,instance为ip地址

DCGM_FI_DEV_POWER_USAGE{instance="192.168.1.101:9400"}

  • 显卡使用率

DCGM_FI_DEV_GPU_UTIL{instance="192.168.1.101:9400"}

7,使用12027模板

Prometheus GPU 监控_github_05

   # dcgm-exporter
  - job_name: 'gpu-metrics'
    static_configs:
    - targets: ['127.0.0.1:9400','192.168.1.101:9400','192.168.1.102:9400']

Prometheus GPU 监控_开机启动_06


手动设置监控

Prometheus GPU 监控_开机启动_07

  • 查看显卡指标

curl http://127.0.0.1:9400/metrics

  • 使用功率

DCGM_FI_DEV_POWER_USAGE{instance="127.0.0.1:9400"}

  • 内存使用

DCGM_FI_DEV_FB_USED{instance="127.0.0.1:9400"}

  • 总内存

DCGM_FI_DEV_FB_USED{instance="127.0.0.1:9400"}+DCGM_FI_DEV_FB_FREE{instance="127.0.0.1:9400"}

  • GPU使用率

DCGM_FI_DEV_GPU_UTIL{instance="127.0.0.1:9400"}

8,使用GPU-Nodes-Metrics-Nvidia 12639




举报

相关推荐

0 条评论