日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Gpu 监控

發布時間:2024/4/14 编程问答 41 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Gpu 监控 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1、先決條件

  • NVIDIA 特斯拉驅動程序 = R384+ (從NVIDIA 驅動程序下載頁面下載)
  • nvidia - docker 版本> 2.0 (查看如何安裝, 這是先決條件)
  • 可選配置碼頭工人,將默認運行時間設置為 nvidia
  • 庫伯涅茨的 NVIDIA 設備插件(查看如何安裝))

2.創建PVC

apiVersion: v1 kind: PersistentVolumeClaim metadata:name: prometheus-gpu-pvcnamespace: kube-system spec:accessModes:- ReadWriteManyvolumeMode: Filesystemresources:requests:storage: 10Gi

3.運行戴門塞特,運行在GPU節點的運行吊艙

apiVersion: apps/v1 kind: DaemonSet metadata:name: prometheus-gpunamespace: kube-system spec:revisionHistoryLimit: 3selector:matchLabels:k8s-app: prometheus-gputemplate:metadata:labels:k8s-app: prometheus-gpuspec:nodeSelector:kubernetes.io/hostname: gpuvolumes:- name: prometheuspersistentVolumeClaim:claimName: prometheus-gpu-pvc- name: prochostPath:path: /proc- name: syshostPath:path: /sysserviceAccountName: admin-usercontainers:- name: dcgm-exporterimage: "nvidia/dcgm-exporter"volumeMounts:- name: prometheusmountPath: /run/prometheus/imagePullPolicy: AlwayssecurityContext:runAsNonRoot: falserunAsUser: 0env:- name: DEPLOY_TIMEvalue: {{ ansible_date_time.iso8601 }}- name: node-exporterimage: "quay.io/prometheus/node-exporter"args:- "--web.listen-address=0.0.0.0:9100"- "--path.procfs=/host/proc"- "--path.sysfs=/host/sys"- "--collector.textfile.directory=/run/prometheus"- "--no-collector.arp"- "--no-collector.bcache"- "--no-collector.bonding"- "--no-collector.conntrack"- "--no-collector.cpu"- "--no-collector.diskstats"- "--no-collector.edac"- "--no-collector.entropy"- "--no-collector.filefd"- "--no-collector.filesystem"- "--no-collector.hwmon"- "--no-collector.infiniband"- "--no-collector.ipvs"- "--no-collector.loadavg"- "--no-collector.mdadm"- "--no-collector.meminfo"- "--no-collector.netdev"- "--no-collector.netstat"- "--no-collector.nfs"- "--no-collector.nfsd"- "--no-collector.sockstat"- "--no-collector.stat"- "--no-collector.time"- "--no-collector.timex"- "--no-collector.uname"- "--no-collector.vmstat"- "--no-collector.wifi"- "--no-collector.xfs"- "--no-collector.zfs"volumeMounts:- name: prometheusmountPath: /run/prometheus/- name: procreadOnly: truemountPath: /host/proc- name: sysreadOnly: truemountPath: /host/sysimagePullPolicy: Alwaysenv:- name: DEPLOY_TIMEvalue: {{ ansible_date_time.iso8601 }}ports:- containerPort: 9100

更多信息,請參閱https://github.com/NVIDIA/gpu-monitoring-tools

4.創建服務

kind: Service apiVersion: v1 metadata:labels:k8s-app: prometheus-gpuname: prometheus-gpu-servicenamespace: kube-system spec:ports:- port: 9100targetPort: 9100selector:k8s-app: prometheus-gpu

5.測試指標

curl prometheus-gpu-service.kube-system:9100/metrics

然后,你會看到這樣的一些指標:

# HELP dcgm_board_limit_violation Throttling duration due to board limit constraints (in us). # TYPE dcgm_board_limit_violation counter dcgm_board_limit_violation{gpu="0",uuid="GPU-a47ee51a-000c-0a26-77cb-6153ec8687b7"} 0 dcgm_board_limit_violation{gpu="1",uuid="GPU-0edfde45-1181-dc4f-947c-eab7c58c10d2"} 0 dcgm_board_limit_violation{gpu="2",uuid="GPU-973ac166-2c6a-12e1-d14d-968237a88104"} 0 dcgm_board_limit_violation{gpu="3",uuid="GPU-1a55c23a-b7d0-e93f-fea6-39c586c9e47b"} 0 # HELP dcgm_dec_utilization Decoder utilization (in %). # TYPE dcgm_dec_utilization gauge dcgm_dec_utilization{gpu="0",uuid="GPU-a47ee51a-000c-0a26-77cb-6153ec8687b7"} 0 dcgm_dec_utilization{gpu="1",uuid="GPU-0edfde45-1181-dc4f-947c-eab7c58c10d2"} 0 dcgm_dec_utilization{gpu="2",uuid="GPU-973ac166-2c6a-12e1-d14d-968237a88104"} 0 dcgm_dec_utilization{gpu="3",uuid="GPU-1a55c23a-b7d0-e93f-fea6-39c586c9e47b"} 0 .....

使用普羅米修斯收集指標

1.創建配置圖

apiVersion: v1 kind: ConfigMap metadata:name: prometheus-confignamespace: kube-system data:prometheus.yml: |scrape_configs:- job_name: 'gpu'honor_labels: truestatic_configs:- targets: ['prometheus-gpu-service.kube-system:9100']

2.創建部署

apiVersion: apps/v1 kind: Deployment metadata:name: prometheusnamespace: kube-system spec:replicas: 1revisionHistoryLimit: 3selector:matchLabels:k8s-app: prometheustemplate:metadata:labels:k8s-app: prometheusspec:volumes:- name: prometheusconfigMap:name: prometheus-configserviceAccountName: admin-usercontainers:- name: prometheusimage: "prom/prometheus:latest"volumeMounts:- name: prometheusmountPath: /etc/prometheus/imagePullPolicy: Alwaysports:- containerPort: 9090protocol: TCP

3.創建服務

kind: Service apiVersion: v1 metadata:labels:k8s-app: prometheusname: prometheus-servicenamespace: kube-system spec:ports:- port: 9090targetPort: 9090selector:k8s-app: prometheus

格拉法納儀表板

1.在庫伯內茨集群中部署格拉法納

kind: Deployment apiVersion: apps/v1 metadata:name: grafananamespace: kube-system spec:replicas: 1selector:matchLabels:k8s-app: grafanatemplate:metadata:labels:k8s-app: grafanaspec:containers:- name: grafanaimage: grafana/grafana:6.2.5env:- name: GF_SECURITY_ADMIN_PASSWORDvalue: <your-password>- name: GF_SECURITY_ADMIN_USERvalue: <your-username>ports:- containerPort: 3000protocol: TCP

2.創建服務暴露您的格拉法納服務

kind: Service apiVersion: v1 metadata:labels:k8s-app: grafananame: grafana-servicenamespace: kube-system spec:ports:- port: 3000targetPort: 3000nodePort: 31111selector:k8s-app: grafanatype: NodePort

3.訪問格拉法納

格拉法納地址可能是http://:31111/,用戶名和密碼是你配置在第1步。

4.添加新數據源

點擊 -> ->->。配置示例:settingDateSourceAdd data sourcePrometheus

  • 名字:Prometheus
  • 違約:Yes
  • 網址:http://prometheus-service:9090
  • 訪問:Server
  • 赫特普方法:Get

然后單擊 。好的,你現在可以訪問普羅米修斯的數據。Save & Test

5.自定義 GPU 監控儀表板

例如,顯示 GPU 溫度:

# HELP dcgm_gpu_temp GPU temperature (in C). # TYPE dcgm_gpu_temp gauge dcgm_gpu_temp{gpu="0",uuid="GPU-a47ee51a-000c-0a26-77cb-6153ec8687b7"} 29 dcgm_gpu_temp{gpu="1",uuid="GPU-0edfde45-1181-dc4f-947c-eab7c58c10d2"} 27 dcgm_gpu_temp{gpu="2",uuid="GPU-973ac166-2c6a-12e1-d14d-968237a88104"} 27 dcgm_gpu_temp{gpu="3",uuid="GPU-1a55c23a-b7d0-e93f-fea6-39c586c9e47b"} 28

通過查詢獲取每個 gpu 溫度sum(dcgm_gpu_temp{gpu=~".*"}) by (gpu)

額外查詢:

  • gpu 編號:count(dcgm_board_limit_violation)
  • 總內存使用率:sum(dcgm_fb_used) / sum(sum(dcgm_fb_free) + sum(dcgm_fb_used))
  • 功率抽取:sum(dcgm_power_usage{gpu=~".*"}) by (gpu)
  • 內存溫度:sum(dcgm_memory_temp{gpu=~".*"}) by (gpu)


作者:Anoyi
鏈接:https://www.imooc.com/article/294338
來源:慕課網

總結

以上是生活随笔為你收集整理的Gpu 监控的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。