目录

k8s集群中部署dcgm-exporter收集GPU指标

k8s集群中部署dcgm-exporter收集GPU指标

总体步骤:

  1. 部署dcgm-exporter的DaemonSet和Service,确保Service有正确的标签和端口。
  2. 创建ServiceMonitor,选择dcgm-exporter的Service,并指定端口。
  3. 检查Prometheus的targets页面,确认dcgm-exporter是否被正确发现和抓取。
  4. 可能需要调整Prometheus的RBAC或网络策略,确保访问权限。

1,部署dcgm-exporter

创建dcgm-exporter.yaml文件,包含DaemonSet和Service:

# dcgm-exporter.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: monitoring  # 假设与kube-prometheus同命名空间
  labels:
    app: dcgm-exporter
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    metadata:
      labels:
        app: dcgm-exporter
    spec:
      nodeSelector:
        nvidia.com/gpu.present: "true"  # 仅在有GPU的节点运行
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: dcgm-exporter
        image: nvidia/dcgm-exporter:3.3.4-3.6.12-ubuntu22.04
        ports:
        - containerPort: 9400
        resources:
          limits:
            nvidia.com/gpu: 1  # 分配1个GPU
---
apiVersion: v1
kind: Service
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    app: dcgm-exporter  # ServiceMonitor通过此标签选择
spec:
  selector:
    app: dcgm-exporter
  ports:
  - name: metrics
    port: 9400
    targetPort: 9400

应用配置:

kubectl apply -f dcgm-exporter.yaml

2. 创建ServiceMonitor资源

创建dcgm-service-monitor.yaml文件,定义如何抓取指标:

# dcgm-service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: monitoring  # 必须与Prometheus Operator监控的命名空间匹配
  labels:
    release: kube-prometheus  # 根据实际kube-prometheus部署的标签调整
spec:
  jobLabel: dcgm-exporter
  endpoints:
  - port: metrics  # 对应Service的端口名称
    interval: 15s
    path: /metrics  # 指标路径
  selector:
    matchLabels:
      app: dcgm-exporter  # 匹配Service的标签
  namespaceSelector:
    matchNames:
    - monitoring  # 指定Service所在的命名空间

应用配置:

kubectl apply -f dcgm-service-monitor.yaml

3,验证配置

检查Pod状态

kubectl get pods -n monitoring -l app=dcgm-exporter

查看Service和Endpoints:

kubectl get svc,ep -n monitoring dcgm-exporter

访问Prometheus UI:

1,端口转发Prometheus服务:
kubectl --namespace monitoring port-forward svc/prometheus-operated 9090
2,打开浏览器访问 http://localhost:9090/targets ,确认dcgm-exporter目标状态为UP。

查询GPU指标:

在Prometheus中输入 DCGM_FI_DEV_GPU_UTIL 验证指标是否存在。