Skip to content

监控方案

监控指标体系

容器监控需要关注以下维度:

维度指标
资源使用CPU、内存、磁盘 IO、网络 IO
容器状态运行/停止、重启次数、健康状态
应用指标请求量、响应时间、错误率
基础设施宿主机 CPU、内存、磁盘

cAdvisor + Prometheus + Grafana

这是最流行的容器监控技术栈。

架构

容器 → cAdvisor(采集容器指标)→ Prometheus(存储指标)→ Grafana(可视化)

快速部署

yaml
# docker-compose.monitoring.yml
services:
  # 容器指标采集
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    privileged: true

  # 指标存储
  prometheus:
    image: prom/prometheus:latest
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=15d'

  # 可视化
  grafana:
    image: grafana/grafana:latest
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin123
      GF_USERS_ALLOW_SIGN_UP: "false"
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards:ro
      - ./grafana/datasources:/etc/grafana/provisioning/datasources:ro

  # 宿主机指标采集
  node-exporter:
    image: prom/node-exporter:latest
    restart: unless-stopped
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'

volumes:
  prometheus-data:
  grafana-data:

prometheus.yml:

yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'docker-daemon'
    static_configs:
      - targets: ['host.docker.internal:9323']

Grafana 配置

bash
# 启动监控栈
docker compose -f docker-compose.monitoring.yml up -d

# 访问 Grafana
# http://localhost:3000(admin/admin123)

# 配置 Prometheus 数据源:
# Configuration → Data Sources → Add data source → Prometheus
# URL: http://prometheus:9090

# 导入 Docker 监控 Dashboard:
# Dashboards → Import → 输入 ID: 193(Docker and system monitoring)

使用 Docker 内置健康检查

dockerfile
# Dockerfile 中定义健康检查
HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \
  CMD curl -f http://localhost:3000/health || exit 1
bash
# 查看健康状态
docker ps  # STATUS 列显示 (healthy) 或 (unhealthy)

# 查看健康检查历史
docker inspect mycontainer --format \
  '{{range .State.Health.Log}}{{.Start}}: {{.ExitCode}} - {{.Output}}{{end}}'

告警配置(Alertmanager)

yaml
# alertmanager.yml
global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alert@example.com'

route:
  group_by: ['alertname']
  receiver: 'team-email'

receivers:
  - name: 'team-email'
    email_configs:
      - to: 'ops@example.com'
  
  # 钉钉通知
  - name: 'dingtalk'
    webhook_configs:
      - url: 'http://dingtalk-hook:8060/dingtalk/webhook'

Prometheus 告警规则:

yaml
# alert.rules.yml
groups:
  - name: container_alerts
    rules:
      - alert: ContainerDown
        expr: absent(container_last_seen{name=~".+"})
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.name }} is down"

      - alert: HighMemoryUsage
        expr: (container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} memory usage > 90%"

      - alert: HighCPUUsage
        expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} CPU usage > 80%"

简单监控方案(Watch 脚本)

对于小型项目,可以用简单脚本监控:

bash
#!/bin/bash
# monitor.sh

while true; do
  echo "=== $(date) ==="
  
  # 检查容器状态
  docker ps --format "table {{.Names}}\t{{.Status}}"
  
  # 检查资源使用
  docker stats --no-stream --format \
    "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}"
  
  # 检查不健康的容器
  UNHEALTHY=$(docker ps --filter health=unhealthy -q)
  if [ -n "$UNHEALTHY" ]; then
    echo "WARNING: Unhealthy containers found!"
    docker ps --filter health=unhealthy --format "{{.Names}}"
  fi
  
  sleep 60
done

总结

监控方案选型建议:

规模推荐方案
单机小项目Docker 健康检查 + 脚本告警
中小团队cAdvisor + Prometheus + Grafana
大型生产Prometheus Operator + Kubernetes
云环境云厂商托管监控(CloudWatch/阿里云监控)

核心三步:采集指标(cAdvisor)→ 存储分析(Prometheus)→ 可视化告警(Grafana + Alertmanager)。