目录

使用AOP-Prometheus-node-exporter-grafana-实现Java系统的接口监控

使用AOP + Prometheus + node-exporter + grafana 实现Java系统的接口监控

使用AOP + Prometheus + node-exporter + grafana 实现Java系统的接口监控

本笔记适用于想要搭建业务系统监控能力的工程狮。此笔记适用的架构为单机部署方式,更倾向于一个小demo,本笔记需要用到cocker。请提前准备!!

1.监控的目的以及监控哪些内容?

在平常的业务需求中,我们通常需要一下两个方面的监控来全方位了解我们的系统:

1.1 机器健康度监控

  • CPU使用率内存使用率磁盘IO网络流量
  • 系统负载 (Load Average)
  • 磁盘使用情况
  • 进程存活状态
  • Docker容器状态 (如果使用容器化)
  • 系统日志异常

1.2 业务监控

  • 请求QPS、TPS (吞吐量)
  • 请求响应时间(P99, P95, P50)
  • HTTP状态码统计
  • 数据库查询耗时
  • 缓存命中率
  • 异常日志(如超时、错误)

2. 我们要用到哪些软件来实现呢?

2.1 Prometheus 主要用来存储和查询监控数据。

2.2 node Exporter 这个主要是用来采集服务器的CPU、内存、磁盘、网络指标等。(值得注意的是他基本不占什么内存)

2.3 grafana 通过可视化的图标来展示服务的运行情况

2.4 Java spring boot + Micrometer 实现业务监控

3. 开始搭建Prometheus + Grafana

3.1 创建docker-compose.yml

在本地创建一个文件夹

mkdir monitoring-demo && cd monitoring-demo

3.2 创建docker-compose.yml

version: '3'
services:
  prometheus:
    image: prom/prometheus
    container_name: prometheus
    volumes:
    - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana
    container_name: grafana
    ports:
      - "3000:3000"

  node-exporter:
    image: prom/node-exporter
    container_name: node-exporter
    ports:
      - "9100:9100"

在上面的文件中共创建了三个容器,并为他们开放了不同的端口分别是 9090、3000以及9100 其中node-exporter用来采集docker容器的资源。

volumes:
    - ./prometheus.yml:/etc/prometheus/prometheus.yml

这个主要是用来配置普罗米修斯的采集规则,在docker-compose.yml的平级目录中添加一个新的yml用来进行数据采集。

global:
  scrape_interval: 15s  # 全局抓取间隔

scrape_configs:
  - job_name: 'prometheus' # 监控 Prometheus 自己
    static_configs:
      - targets: ['localhost:9090']
  - job_name: 'node-exporter' # 监控 Node Exporter
    static_configs:
      - targets: ['node-exporter:9100']
  - job_name: 'java-demo-app' # Java 业务系统
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['host.docker.internal:8080']

3.3 启动docker

完成上面的操作后,就可以通过下面的命令来启动容器了

docker-compose up -d # -d 不展示启动log,在后台启动

启动后:

Prometheus 访问地址:http://localhost:9090

Grafana 访问地址:http://localhost:3000(默认用户 admin/admin ) Ps🗡 第一次访问会让你改默认密码

一些知识点🛩 :

  1. 在普罗米修斯中可以通过简单的sql来查询已经上报的数据。可以通过 http://localhost:9090/query 这个路径来查询。如果你想要看到一些数据可以尝试查一下 node_cpu_seconds_total 这个信息,你大概可以看到一堆数据
  2. 如果你没看到任何信息或者你想要看普罗米修斯正在从收集哪些应用的信息,可以通过 http://localhost:9090/targets 来进行查看,这里可以看到哪些应用在线或者离线

4. 在Grafana中配置Prometheus数据源

访问Grafana http://localshot:3000

Grafana 左侧菜单 → 齿轮图标(Configuration) → Data Sources

点击 Add data source

选择 Prometheus

填写 URL :http://prometheus:9090 这里为什么写prometheus:9090呢,是以为Grafana 和 Prometheus 在同一个docker网络下,所以可以直呼其名

点击 Save & Test → 显示 Data source is working ,表示连接成功!

5. 导入监控面板(Dashboard)

导入现成的服务器监控 Dashboard(Node Exporter)
  1. 左侧 → Dashboard(四个方块图标) → Import
  2. 输入 Dashboard ID1860
  3. 点击 Load
  4. 选择 Prometheus 数据源
  5. 点击 Import

效果 :马上看到服务器 CPU、内存、磁盘、网络图表!

✅ 到目前为止你所完成事情 每个组件的职责和意义

组件作用你做了什么
Docker快速构建环境装了 Docker,部署了容器
Prometheus抓取指标,保存数据配了 prometheus.yml,启动服务
Node Exporter提供主机指标启动了 node-exporter,暴露 9100
Grafana数据展示和分析配了数据源、导入 Dashboard

6. 启动Java DEMO

  1. 创建一个maven项目,pom文件如下
<dependencies>
		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-web</artifactId>
		</dependency>
		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-actuator</artifactId>
		</dependency>
		<dependency>
			<groupId>io.pivotal.spring.cloud</groupId>
			<artifactId>spring-cloud-services-starter-service-registry</artifactId>
		</dependency>

		<dependency>
			<groupId>io.micrometer</groupId>
			<artifactId>micrometer-registry-prometheus</artifactId>
			<scope>runtime</scope>
		</dependency>
		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-test</artifactId>
			<scope>test</scope>
		</dependency>
	</dependencies>
  1. 配置application.yml
  spring:
    application:
      name: demo
  server:
    port: 8080
  # 以下内容主要是上报jvm的健康度
  management:
    endpoints:
      web:
        exposure:
          include: health,info,prometheus
    metrics:
      tags:
        application: demo-app
  1. 启动Javademo项目,这个Java项目的jvm信息将会被上报给普罗米修斯。
  2. 想要看看 jvm GC的情况?可以在仪表盘中导入这个json模板
     {
       "annotations": {
         "list": []
       },
       "panels": [
         {
           "type": "stat",
           "title": "GC 次数 (1m)",
           "datasource": "prometheus-1",
           "targets": [
             {
               "expr": "increase(jvm_gc_pause_seconds_count{application=\"demo-app\"}[1m])",
               "interval": "",
               "legendFormat": "GC 次数",
               "refId": "A"
             }
           ],
           "fieldConfig": {
             "defaults": {
               "unit": "none",
               "thresholds": {
                 "mode": "absolute",
                 "steps": [
                   {
                     "color": "green",
                     "value": null
                   },
                   {
                     "color": "yellow",
                     "value": 5
                   },
                   {
                     "color": "red",
                     "value": 10
                   }
                 ]
               }
             },
             "overrides": []
           },
           "options": {
             "reduceOptions": {
               "calcs": ["lastNotNull"],
               "fields": "",
               "values": false
             },
             "orientation": "horizontal",
             "colorMode": "value",
             "graphMode": "none",
             "justifyMode": "auto"
           },
           "gridPos": {
             "h": 6,
             "w": 6,
             "x": 0,
             "y": 0
           }
         },
         {
           "type": "stat",
           "title": "GC 耗时 (1m)",
           "datasource": "prometheus-1",
           "targets": [
             {
               "expr": "increase(jvm_gc_pause_seconds_sum{application=\"demo-app\"}[1m])",
               "interval": "",
               "legendFormat": "GC 耗时 (秒)",
               "refId": "B"
             }
           ],
           "fieldConfig": {
             "defaults": {
               "unit": "s",
               "thresholds": {
                 "mode": "absolute",
                 "steps": [
                   {
                     "color": "green",
                     "value": null
                   },
                   {
                     "color": "yellow",
                     "value": 0.5
                   },
                   {
                     "color": "red",
                     "value": 1
                   }
                 ]
               }
             },
             "overrides": []
           },
           "options": {
             "reduceOptions": {
               "calcs": ["lastNotNull"],
               "fields": "",
               "values": false
             },
             "orientation": "horizontal",
             "colorMode": "value",
             "graphMode": "none",
             "justifyMode": "auto"
           },
           "gridPos": {
             "h": 6,
             "w": 6,
             "x": 6,
             "y": 0
           }
         },
         {
           "type": "timeseries",
           "title": "GC 次数趋势",
           "datasource": "prometheus-1",
           "targets": [
             {
               "expr": "rate(jvm_gc_pause_seconds_count{application=\"demo-app\"}[1m])",
               "interval": "",
               "legendFormat": "{{gc}}",
               "refId": "C"
             }
           ],
           "fieldConfig": {
             "defaults": {
               "unit": "none"
             },
             "overrides": []
           },
           "options": {
             "legend": {
               "displayMode": "table",
               "placement": "bottom"
             },
             "tooltip": {
               "mode": "single"
             }
           },
           "gridPos": {
             "h": 8,
             "w": 12,
             "x": 0,
             "y": 6
           }
         },
         {
           "type": "timeseries",
           "title": "GC 耗时趋势",
           "datasource": "prometheus-1",
           "targets": [
             {
               "expr": "rate(jvm_gc_pause_seconds_sum{application=\"demo-app\"}[1m])",
               "interval": "",
               "legendFormat": "{{gc}}",
               "refId": "D"
             }
           ],
           "fieldConfig": {
             "defaults": {
               "unit": "s"
             },
             "overrides": []
           },
           "options": {
             "legend": {
               "displayMode": "table",
               "placement": "bottom"
             },
             "tooltip": {
               "mode": "single"
             }
           },
           "gridPos": {
             "h": 8,
             "w": 12,
             "x": 12,
             "y": 6
           }
         }
       ],
       "schemaVersion": 34,
       "title": "JVM GC 仪表盘",
       "timezone": "browser",
       "version": 1
     }
     
  1. 想要实现controller和service接口的执行情况? 通过aop来实现 这个aop可以上报所有接口 tp 999 tp 99 tp 50 的情况
     import io.micrometer.core.instrument.MeterRegistry;
     import io.micrometer.core.instrument.Timer;
     import org.aspectj.lang.ProceedingJoinPoint;
     import org.aspectj.lang.annotation.Around;
     import org.aspectj.lang.annotation.Aspect;
     import org.springframework.stereotype.Component;
     
     @Aspect
     @Component
     public class MethodTimerAspect {
     
         private final MeterRegistry meterRegistry;
     
         public MethodTimerAspect(MeterRegistry meterRegistry) {
             this.meterRegistry = meterRegistry;
         }
     
         @Around("execution(* com.example..controller..*(..)) || execution(* com.example..service..*(..))")
         public Object timeMethods(ProceedingJoinPoint pjp) throws Throwable {
             String className = pjp.getTarget().getClass().getSimpleName();
             String methodName = pjp.getSignature().getName();
             String timerName = "method_execution_timer";
     
             Timer.Sample sample = Timer.start(meterRegistry);
     
             try {
                 return pjp.proceed();
             } finally {
                 sample.stop(
                         Timer.builder(timerName)
                                 .description("方法执行耗时统计")
                                 .tags("class", className, "method", methodName)
                                 .publishPercentiles(0.5, 0.9, 0.99, 0.999)
                                 .publishPercentileHistogram()
                                 .register(meterRegistry)
                 );
             }
         }
     
     }
     ```

     6. 如果你想要看看各个接口的执行情况导入下面这个json模板可以看到

        ```json
        {
          "annotations": {
            "list": []
          },
          "templating": {
            "list": [
              {
                "name": "datasource",
                "type": "datasource",
                "query": "prometheus",
                "label": "Prometheus 数据源",
                "hide": 0
              }
            ]
          },
          "panels": [
            {
              "type": "stat",
              "title": "接口 TP999 延迟 (ms)",
              "datasource": "prometheus-1",
              "targets": [
                {
                  "expr": "histogram_quantile(0.999, sum(rate(method_execution_time_bucket[1m])) by (le, method))",
                  "legendFormat": "{{method}}",
                  "refId": "A"
                }
              ],
              "fieldConfig": {
                "defaults": {
                  "unit": "ms",
                  "thresholds": {
                    "mode": "absolute",
                    "steps": [
                      { "color": "green", "value": null },
                      { "color": "yellow", "value": 1000 },
                      { "color": "red", "value": 2000 }
                    ]
                  }
                },
                "overrides": []
              },
              "options": {
                "reduceOptions": {
                  "calcs": ["max"],
                  "fields": "",
                  "values": false
                },
                "orientation": "horizontal",
                "colorMode": "value",
                "graphMode": "none",
                "justifyMode": "auto"
              },
              "gridPos": { "h": 6, "w": 12, "x": 0, "y": 0 }
            },
            {
              "type": "timeseries",
              "title": "接口 TP999 延迟趋势",
              "datasource": "prometheus-1",
              "targets": [
                {
                  "expr": "histogram_quantile(0.999, sum(rate(method_execution_time_bucket[1m])) by (le, method))",
                  "legendFormat": "{{method}}",
                  "refId": "B"
                }
              ],
              "fieldConfig": {
                "defaults": {
                  "unit": "ms"
                },
                "overrides": []
              },
              "options": {
                "legend": {
                  "displayMode": "table",
                  "placement": "bottom"
                },
                "tooltip": {
                  "mode": "single"
                }
              },
              "gridPos": { "h": 8, "w": 24, "x": 0, "y": 6 }
            },
            {
              "type": "stat",
              "title": "接口 QPS",
              "datasource": "${datasource}",
              "targets": [
                {
                  "expr": "sum(rate(http_server_requests_seconds_count[1m])) by (uri, method)",
                  "legendFormat": "{{uri}} {{method}}",
                  "refId": "C"
                }
              ],
              "fieldConfig": {
                "defaults": {
                  "unit": "ops",
                  "thresholds": {
                    "mode": "absolute",
                    "steps": [
                      { "color": "red", "value": null },
                      { "color": "yellow", "value": 5 },
                      { "color": "green", "value": 20 }
                    ]
                  }
                },
                "overrides": []
              },
              "options": {
                "reduceOptions": {
                  "calcs": ["max"],
                  "fields": "",
                  "values": false
                },
                "orientation": "horizontal",
                "colorMode": "value",
                "graphMode": "none",
                "justifyMode": "auto"
              },
              "gridPos": { "h": 6, "w": 12, "x": 0, "y": 14 }
            },
            {
              "type": "timeseries",
              "title": "接口 QPS 趋势",
              "datasource": "prometheus-1",
              "targets": [
                {
                  "expr": "sum(rate(http_server_requests_seconds_count[1m])) by (uri, method)",
                  "legendFormat": "{{uri}} {{method}}",
                  "refId": "D"
                }
              ],
              "fieldConfig": {
                "defaults": {
                  "unit": "ops"
                },
                "overrides": []
              },
              "options": {
                "legend": {
                  "displayMode": "table",
                  "placement": "bottom"
                },
                "tooltip": {
                  "mode": "single"
                }
              },
              "gridPos": { "h": 8, "w": 24, "x": 0, "y": 20 }
            },
            {
              "type": "stat",
              "title": "接口 5xx 错误数",
              "datasource": "prometheus-1",
              "targets": [
                {
                  "expr": "sum(rate(http_server_requests_seconds_count{status=~\"5..\"}[1m])) by (uri, method)",
                  "legendFormat": "{{uri}} {{method}}",
                  "refId": "E"
                }
              ],
              "fieldConfig": {
                "defaults": {
                  "unit": "short",
                  "thresholds": {
                    "mode": "absolute",
                    "steps": [
                      { "color": "green", "value": null },
                      { "color": "yellow", "value": 1 },
                      { "color": "red", "value": 5 }
                    ]
                  }
                },
                "overrides": []
              },
              "options": {
                "reduceOptions": {
                  "calcs": ["max"],
                  "fields": "",
                  "values": false
                },
                "orientation": "horizontal",
                "colorMode": "value",
                "graphMode": "none",
                "justifyMode": "auto"
              },
              "gridPos": { "h": 6, "w": 12, "x": 0, "y": 28 }
            },
            {
              "type": "timeseries",
              "title": "接口 5xx 错误数趋势",
              "datasource": "prometheus-1",
              "targets": [
                {
                  "expr": "sum(rate(http_server_requests_seconds_count{status=~\"5..\"}[1m])) by (uri, method)",
                  "legendFormat": "{{uri}} {{method}}",
                  "refId": "F"
                }
              ],
              "fieldConfig": {
                "defaults": {
                  "unit": "short"
                },
                "overrides": []
              },
              "options": {
                "legend": {
                  "displayMode": "table",
                  "placement": "bottom"
                },
                "tooltip": {
                  "mode": "single"
                }
              },
              "gridPos": { "h": 8, "w": 24, "x": 0, "y": 34 }
            }
          ],
          "title": "接口 TP999 + QPS + 错误监控",
          "timezone": "browser",
          "schemaVersion": 34,
          "version": 1
        }
        

✅通过上面的步骤你应该就能实现业务接口的监控,后续微服务式监控可能还会更新 敬请期待…