如何监控-Pod-的-CPU内存使用率,prometheusgrafana

2025-03-07 约 1036 字预计阅读 3 分钟

https://bing.ee123.net/img/rand?artid=146088233

如何监控 Pod 的 CPU/内存使用率，prometheus+grafana

一、监控 Pod 的 CPU/内存使用率的方法

1. 使用 `kubectl top` 命令（临时检查）

# 查看所有 Pod 的资源使用率（需安装 Metrics Server）
kubectl top pods --all-namespaces

# 查看指定命名空间的 Pod
kubectl top pods -n <namespace>

# 查看单个 Pod 的详细指标
kubectl top pod <pod-name> -n <namespace>

2. 通过 Metrics Server 获取数据

• 安装 Metrics Server （集群级监控核心组件）：

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

• 查询 Pod 资源使用率 ：

  # 查看 Pod 列表并按 CPU 排序
  kubectl get pods --sort-by=cpu

  # 获取指定 Pod 的详细资源使用率
  kubectl describe pod <pod-name> -n <namespace> | grep -E "^Resource|cpu|memory"

二、配置 Prometheus + Grafana 监控（长期可视化方案）

1. 部署 Prometheus（数据采集）

# 创建 Prometheus 配置文件 `prometheus.yaml`
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
  namespace: monitoring
spec:
  serviceAccountName: prometheus
  storage:
    configMap:
      name: prometheus-storage
  scrape_configs:
    - jobName: 'kubernetes-pods'
      kubernetes_sd_configs:
        - role: pod
      relabel_configs:
        - source_labels: [__meta_kubernetes_pod_label_app]
          action: keep
          regex: my-app.*

2. 部署 Grafana（可视化界面）

# 创建 Grafana 配置文件 `grafana.yaml`
apiVersion: 1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: monitoring
data:
  grafana.ini: |
    [datasources]
    [datasources.prometheus]
      name = Prometheus
      type = prometheus
      url = http://prometheus-server.monitoring.svc.cluster.local:9090

# 部署 Grafana
kubectl apply -f https://raw.githubusercontent.com/grafana/grafana/master/k8s/deployments.yaml

3. 访问 Grafana 并配置监控面板

获取 Grafana 服务地址：

kubectl get svc -n monitoring grafana --output=jsonpath='{.status.loadBalancer.ingress[0].hostname}'

创建 Pod 监控仪表盘 ： • 添加新面板，选择 Prometheus 作为数据源。 • 查询语句：

  # CPU 使用率（按 Pod 名称分组）
  sum by (pod_name) (container_cpu_usage_seconds_total{container="app"} / 10^9)

  # 内存使用率（按 Pod 名称分组）
  sum by (pod_name) (container_memory_usage_bytes_total{container="app"} / 1024^3)

三、关键配置与优化

1. Prometheus 抓取 Pod 指标

• 启用 Pod 级别监控 ：

# 在 Prometheus 配置中添加以下内容
scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod

• 通过标签过滤特定 Pod ：

# 监控名称包含 "my-app" 的 Pod
sum by (pod_name) (container_cpu_usage_seconds_total{container="app", pod_name=~"my-app.*"})

2. Grafana 仪表盘优化

• 自动刷新 ：设置面板刷新间隔为 10s 。

• 预警规则 ：

• CPU 高负载 （示例）： promql rate(container_cpu_usage_seconds_total{container="app"}[5m]) > 0.8

• 内存不足 （示例）： promql container_memory_usage_bytes_total{container="app"} > 1024*1024*512 # 512MB

3. 资源限制与成本控制

• 为 Prometheus 设置资源限制 ：

limits:
  cpu: '1'
  memory: '2Gi'

• 启用持久化存储 （根据需求选择）：

storage:
  persistentVolumeClaim:
    claimName: prometheus-pvc

四、验证监控效果

检查 Prometheus 数据 ：

curl http://prometheus-server.monitoring.svc.cluster.local:9090/api/v1/query?query=sum(container_cpu_usage_seconds_total%7Bcontainer%3D%22app%22%7D)

在 Grafana 中验证面板 ：

• 确保 Pod 的 CPU/内存曲线随负载变化实时更新。

• 测试预警规则是否触发。

五、常见问题排查

现象	解决方案
Prometheus 无数据	1. 检查 Metrics Server 是否正常运行 2. 确认 Prometheus 配置中的 `kubernetes_sd_configs` 正确指向 Pod
Grafana 无法连接 Prometheus	1. 检查防火墙规则 2. 确认 Prometheus 服务端口 `9090` 开放 3. 验证 RBAC 权限（Grafana 需要访问 Prometheus）
数据延迟	调整 Prometheus 抓取间隔（默认 `10s` ）或增加历史数据保留时间。

总结

通过 Prometheus + Grafana 可以实现：

• 实时监控 ：Pod 级 CPU/内存使用率可视化。

• 智能告警 ：基于阈值自动触发通知（集成 Alertmanager）。

• 历史分析 ：长期资源消耗趋势分析。

• 成本优化 ：根据监控数据调整 Pod 数量和资源配额。