Gpu 监控
1、先決條件
- NVIDIA 特斯拉驅動程序 = R384+ (從NVIDIA 驅動程序下載頁面下載)
- nvidia - docker 版本> 2.0 (查看如何安裝, 這是先決條件)
- 可選配置碼頭工人,將默認運行時間設置為 nvidia
- 庫伯涅茨的 NVIDIA 設備插件(查看如何安裝))
2.創建PVC
apiVersion: v1 kind: PersistentVolumeClaim metadata:name: prometheus-gpu-pvcnamespace: kube-system spec:accessModes:- ReadWriteManyvolumeMode: Filesystemresources:requests:storage: 10Gi3.運行戴門塞特,運行在GPU節點的運行吊艙
apiVersion: apps/v1 kind: DaemonSet metadata:name: prometheus-gpunamespace: kube-system spec:revisionHistoryLimit: 3selector:matchLabels:k8s-app: prometheus-gputemplate:metadata:labels:k8s-app: prometheus-gpuspec:nodeSelector:kubernetes.io/hostname: gpuvolumes:- name: prometheuspersistentVolumeClaim:claimName: prometheus-gpu-pvc- name: prochostPath:path: /proc- name: syshostPath:path: /sysserviceAccountName: admin-usercontainers:- name: dcgm-exporterimage: "nvidia/dcgm-exporter"volumeMounts:- name: prometheusmountPath: /run/prometheus/imagePullPolicy: AlwayssecurityContext:runAsNonRoot: falserunAsUser: 0env:- name: DEPLOY_TIMEvalue: {{ ansible_date_time.iso8601 }}- name: node-exporterimage: "quay.io/prometheus/node-exporter"args:- "--web.listen-address=0.0.0.0:9100"- "--path.procfs=/host/proc"- "--path.sysfs=/host/sys"- "--collector.textfile.directory=/run/prometheus"- "--no-collector.arp"- "--no-collector.bcache"- "--no-collector.bonding"- "--no-collector.conntrack"- "--no-collector.cpu"- "--no-collector.diskstats"- "--no-collector.edac"- "--no-collector.entropy"- "--no-collector.filefd"- "--no-collector.filesystem"- "--no-collector.hwmon"- "--no-collector.infiniband"- "--no-collector.ipvs"- "--no-collector.loadavg"- "--no-collector.mdadm"- "--no-collector.meminfo"- "--no-collector.netdev"- "--no-collector.netstat"- "--no-collector.nfs"- "--no-collector.nfsd"- "--no-collector.sockstat"- "--no-collector.stat"- "--no-collector.time"- "--no-collector.timex"- "--no-collector.uname"- "--no-collector.vmstat"- "--no-collector.wifi"- "--no-collector.xfs"- "--no-collector.zfs"volumeMounts:- name: prometheusmountPath: /run/prometheus/- name: procreadOnly: truemountPath: /host/proc- name: sysreadOnly: truemountPath: /host/sysimagePullPolicy: Alwaysenv:- name: DEPLOY_TIMEvalue: {{ ansible_date_time.iso8601 }}ports:- containerPort: 9100更多信息,請參閱https://github.com/NVIDIA/gpu-monitoring-tools
4.創建服務
kind: Service apiVersion: v1 metadata:labels:k8s-app: prometheus-gpuname: prometheus-gpu-servicenamespace: kube-system spec:ports:- port: 9100targetPort: 9100selector:k8s-app: prometheus-gpu5.測試指標
curl prometheus-gpu-service.kube-system:9100/metrics然后,你會看到這樣的一些指標:
# HELP dcgm_board_limit_violation Throttling duration due to board limit constraints (in us). # TYPE dcgm_board_limit_violation counter dcgm_board_limit_violation{gpu="0",uuid="GPU-a47ee51a-000c-0a26-77cb-6153ec8687b7"} 0 dcgm_board_limit_violation{gpu="1",uuid="GPU-0edfde45-1181-dc4f-947c-eab7c58c10d2"} 0 dcgm_board_limit_violation{gpu="2",uuid="GPU-973ac166-2c6a-12e1-d14d-968237a88104"} 0 dcgm_board_limit_violation{gpu="3",uuid="GPU-1a55c23a-b7d0-e93f-fea6-39c586c9e47b"} 0 # HELP dcgm_dec_utilization Decoder utilization (in %). # TYPE dcgm_dec_utilization gauge dcgm_dec_utilization{gpu="0",uuid="GPU-a47ee51a-000c-0a26-77cb-6153ec8687b7"} 0 dcgm_dec_utilization{gpu="1",uuid="GPU-0edfde45-1181-dc4f-947c-eab7c58c10d2"} 0 dcgm_dec_utilization{gpu="2",uuid="GPU-973ac166-2c6a-12e1-d14d-968237a88104"} 0 dcgm_dec_utilization{gpu="3",uuid="GPU-1a55c23a-b7d0-e93f-fea6-39c586c9e47b"} 0 .....使用普羅米修斯收集指標
1.創建配置圖
apiVersion: v1 kind: ConfigMap metadata:name: prometheus-confignamespace: kube-system data:prometheus.yml: |scrape_configs:- job_name: 'gpu'honor_labels: truestatic_configs:- targets: ['prometheus-gpu-service.kube-system:9100']2.創建部署
apiVersion: apps/v1 kind: Deployment metadata:name: prometheusnamespace: kube-system spec:replicas: 1revisionHistoryLimit: 3selector:matchLabels:k8s-app: prometheustemplate:metadata:labels:k8s-app: prometheusspec:volumes:- name: prometheusconfigMap:name: prometheus-configserviceAccountName: admin-usercontainers:- name: prometheusimage: "prom/prometheus:latest"volumeMounts:- name: prometheusmountPath: /etc/prometheus/imagePullPolicy: Alwaysports:- containerPort: 9090protocol: TCP3.創建服務
kind: Service apiVersion: v1 metadata:labels:k8s-app: prometheusname: prometheus-servicenamespace: kube-system spec:ports:- port: 9090targetPort: 9090selector:k8s-app: prometheus格拉法納儀表板
1.在庫伯內茨集群中部署格拉法納
kind: Deployment apiVersion: apps/v1 metadata:name: grafananamespace: kube-system spec:replicas: 1selector:matchLabels:k8s-app: grafanatemplate:metadata:labels:k8s-app: grafanaspec:containers:- name: grafanaimage: grafana/grafana:6.2.5env:- name: GF_SECURITY_ADMIN_PASSWORDvalue: <your-password>- name: GF_SECURITY_ADMIN_USERvalue: <your-username>ports:- containerPort: 3000protocol: TCP2.創建服務暴露您的格拉法納服務
kind: Service apiVersion: v1 metadata:labels:k8s-app: grafananame: grafana-servicenamespace: kube-system spec:ports:- port: 3000targetPort: 3000nodePort: 31111selector:k8s-app: grafanatype: NodePort3.訪問格拉法納
格拉法納地址可能是http://:31111/,用戶名和密碼是你配置在第1步。
4.添加新數據源
點擊 -> ->->。配置示例:settingDateSourceAdd data sourcePrometheus
- 名字:Prometheus
- 違約:Yes
- 網址:http://prometheus-service:9090
- 訪問:Server
- 赫特普方法:Get
然后單擊 。好的,你現在可以訪問普羅米修斯的數據。Save & Test
5.自定義 GPU 監控儀表板
例如,顯示 GPU 溫度:
# HELP dcgm_gpu_temp GPU temperature (in C). # TYPE dcgm_gpu_temp gauge dcgm_gpu_temp{gpu="0",uuid="GPU-a47ee51a-000c-0a26-77cb-6153ec8687b7"} 29 dcgm_gpu_temp{gpu="1",uuid="GPU-0edfde45-1181-dc4f-947c-eab7c58c10d2"} 27 dcgm_gpu_temp{gpu="2",uuid="GPU-973ac166-2c6a-12e1-d14d-968237a88104"} 27 dcgm_gpu_temp{gpu="3",uuid="GPU-1a55c23a-b7d0-e93f-fea6-39c586c9e47b"} 28通過查詢獲取每個 gpu 溫度sum(dcgm_gpu_temp{gpu=~".*"}) by (gpu)
額外查詢:
- gpu 編號:count(dcgm_board_limit_violation)
- 總內存使用率:sum(dcgm_fb_used) / sum(sum(dcgm_fb_free) + sum(dcgm_fb_used))
- 功率抽取:sum(dcgm_power_usage{gpu=~".*"}) by (gpu)
- 內存溫度:sum(dcgm_memory_temp{gpu=~".*"}) by (gpu)
作者:Anoyi
鏈接:https://www.imooc.com/article/294338
來源:慕課網
總結
- 上一篇: Prometheus之kubernete
- 下一篇: kubernetes 容器内获取Pod信