當前位置：首頁 >

Gpu 监控

發布時間：2024/4/14 43 豆豆

生活随笔收集整理的這篇文章主要介紹了 Gpu 监控小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1、先決條件

NVIDIA 特斯拉驅動程序 = R384+ （從NVIDIA 驅動程序下載頁面下載)
nvidia - docker 版本> 2.0 （查看如何安裝，這是先決條件)
可選配置碼頭工人，將默認運行時間設置為 nvidia
庫伯涅茨的 NVIDIA 設備插件（查看如何安裝）)

2.創建PVC

apiVersion: v1 kind: PersistentVolumeClaim metadata:name: prometheus-gpu-pvcnamespace: kube-system spec:accessModes:- ReadWriteManyvolumeMode: Filesystemresources:requests:storage: 10Gi

3.運行戴門塞特，運行在GPU節點的運行吊艙

apiVersion: apps/v1 kind: DaemonSet metadata:name: prometheus-gpunamespace: kube-system spec:revisionHistoryLimit: 3selector:matchLabels:k8s-app: prometheus-gputemplate:metadata:labels:k8s-app: prometheus-gpuspec:nodeSelector:kubernetes.io/hostname: gpuvolumes:- name: prometheuspersistentVolumeClaim:claimName: prometheus-gpu-pvc- name: prochostPath:path: /proc- name: syshostPath:path: /sysserviceAccountName: admin-usercontainers:- name: dcgm-exporterimage: "nvidia/dcgm-exporter"volumeMounts:- name: prometheusmountPath: /run/prometheus/imagePullPolicy: AlwayssecurityContext:runAsNonRoot: falserunAsUser: 0env:- name: DEPLOY_TIMEvalue: {{ ansible_date_time.iso8601 }}- name: node-exporterimage: "quay.io/prometheus/node-exporter"args:- "--web.listen-address=0.0.0.0:9100"- "--path.procfs=/host/proc"- "--path.sysfs=/host/sys"- "--collector.textfile.directory=/run/prometheus"- "--no-collector.arp"- "--no-collector.bcache"- "--no-collector.bonding"- "--no-collector.conntrack"- "--no-collector.cpu"- "--no-collector.diskstats"- "--no-collector.edac"- "--no-collector.entropy"- "--no-collector.filefd"- "--no-collector.filesystem"- "--no-collector.hwmon"- "--no-collector.infiniband"- "--no-collector.ipvs"- "--no-collector.loadavg"- "--no-collector.mdadm"- "--no-collector.meminfo"- "--no-collector.netdev"- "--no-collector.netstat"- "--no-collector.nfs"- "--no-collector.nfsd"- "--no-collector.sockstat"- "--no-collector.stat"- "--no-collector.time"- "--no-collector.timex"- "--no-collector.uname"- "--no-collector.vmstat"- "--no-collector.wifi"- "--no-collector.xfs"- "--no-collector.zfs"volumeMounts:- name: prometheusmountPath: /run/prometheus/- name: procreadOnly: truemountPath: /host/proc- name: sysreadOnly: truemountPath: /host/sysimagePullPolicy: Alwaysenv:- name: DEPLOY_TIMEvalue: {{ ansible_date_time.iso8601 }}ports:- containerPort: 9100

更多信息，請參閱https://github.com/NVIDIA/gpu-monitoring-tools

4.創建服務

kind: Service apiVersion: v1 metadata:labels:k8s-app: prometheus-gpuname: prometheus-gpu-servicenamespace: kube-system spec:ports:- port: 9100targetPort: 9100selector:k8s-app: prometheus-gpu

5.測試指標

curl prometheus-gpu-service.kube-system:9100/metrics

然后，你會看到這樣的一些指標：

# HELP dcgm_board_limit_violation Throttling duration due to board limit constraints (in us). # TYPE dcgm_board_limit_violation counter dcgm_board_limit_violation{gpu="0",uuid="GPU-a47ee51a-000c-0a26-77cb-6153ec8687b7"} 0 dcgm_board_limit_violation{gpu="1",uuid="GPU-0edfde45-1181-dc4f-947c-eab7c58c10d2"} 0 dcgm_board_limit_violation{gpu="2",uuid="GPU-973ac166-2c6a-12e1-d14d-968237a88104"} 0 dcgm_board_limit_violation{gpu="3",uuid="GPU-1a55c23a-b7d0-e93f-fea6-39c586c9e47b"} 0 # HELP dcgm_dec_utilization Decoder utilization (in %). # TYPE dcgm_dec_utilization gauge dcgm_dec_utilization{gpu="0",uuid="GPU-a47ee51a-000c-0a26-77cb-6153ec8687b7"} 0 dcgm_dec_utilization{gpu="1",uuid="GPU-0edfde45-1181-dc4f-947c-eab7c58c10d2"} 0 dcgm_dec_utilization{gpu="2",uuid="GPU-973ac166-2c6a-12e1-d14d-968237a88104"} 0 dcgm_dec_utilization{gpu="3",uuid="GPU-1a55c23a-b7d0-e93f-fea6-39c586c9e47b"} 0 .....

使用普羅米修斯收集指標

1.創建配置圖

apiVersion: v1 kind: ConfigMap metadata:name: prometheus-confignamespace: kube-system data:prometheus.yml: |scrape_configs:- job_name: 'gpu'honor_labels: truestatic_configs:- targets: ['prometheus-gpu-service.kube-system:9100']

2.創建部署

apiVersion: apps/v1 kind: Deployment metadata:name: prometheusnamespace: kube-system spec:replicas: 1revisionHistoryLimit: 3selector:matchLabels:k8s-app: prometheustemplate:metadata:labels:k8s-app: prometheusspec:volumes:- name: prometheusconfigMap:name: prometheus-configserviceAccountName: admin-usercontainers:- name: prometheusimage: "prom/prometheus:latest"volumeMounts:- name: prometheusmountPath: /etc/prometheus/imagePullPolicy: Alwaysports:- containerPort: 9090protocol: TCP

3.創建服務

kind: Service apiVersion: v1 metadata:labels:k8s-app: prometheusname: prometheus-servicenamespace: kube-system spec:ports:- port: 9090targetPort: 9090selector:k8s-app: prometheus

格拉法納儀表板

1.在庫伯內茨集群中部署格拉法納

kind: Deployment apiVersion: apps/v1 metadata:name: grafananamespace: kube-system spec:replicas: 1selector:matchLabels:k8s-app: grafanatemplate:metadata:labels:k8s-app: grafanaspec:containers:- name: grafanaimage: grafana/grafana:6.2.5env:- name: GF_SECURITY_ADMIN_PASSWORDvalue: <your-password>- name: GF_SECURITY_ADMIN_USERvalue: <your-username>ports:- containerPort: 3000protocol: TCP

2.創建服務暴露您的格拉法納服務

kind: Service apiVersion: v1 metadata:labels:k8s-app: grafananame: grafana-servicenamespace: kube-system spec:ports:- port: 3000targetPort: 3000nodePort: 31111selector:k8s-app: grafanatype: NodePort

3.訪問格拉法納

格拉法納地址可能是http://:31111/，用戶名和密碼是你配置在第1步。

4.添加新數據源

點擊 -> ->->。配置示例：settingDateSourceAdd data sourcePrometheus

名字：Prometheus
違約：Yes
網址：http://prometheus-service:9090
訪問：Server
赫特普方法：Get

然后單擊。好的，你現在可以訪問普羅米修斯的數據。Save & Test

5.自定義 GPU 監控儀表板

例如，顯示 GPU 溫度：

# HELP dcgm_gpu_temp GPU temperature (in C). # TYPE dcgm_gpu_temp gauge dcgm_gpu_temp{gpu="0",uuid="GPU-a47ee51a-000c-0a26-77cb-6153ec8687b7"} 29 dcgm_gpu_temp{gpu="1",uuid="GPU-0edfde45-1181-dc4f-947c-eab7c58c10d2"} 27 dcgm_gpu_temp{gpu="2",uuid="GPU-973ac166-2c6a-12e1-d14d-968237a88104"} 27 dcgm_gpu_temp{gpu="3",uuid="GPU-1a55c23a-b7d0-e93f-fea6-39c586c9e47b"} 28

通過查詢獲取每個 gpu 溫度sum(dcgm_gpu_temp{gpu=~".*"}) by (gpu)

額外查詢：

gpu 編號：count(dcgm_board_limit_violation)
總內存使用率：sum(dcgm_fb_used) / sum(sum(dcgm_fb_free) + sum(dcgm_fb_used))
功率抽取：sum(dcgm_power_usage{gpu=~".*"}) by (gpu)
內存溫度：sum(dcgm_memory_temp{gpu=~".*"}) by (gpu)

作者：Anoyi
鏈接：https://www.imooc.com/article/294338
來源：慕課網

總結

以上是生活随笔為你收集整理的Gpu 监控的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Prometheus之kubernete
下一篇： kubernetes 容器内获取Pod信

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

Gpu 监控

使用普羅米修斯收集指標

格拉法納儀表板

總結