Appearance
监控方案
监控指标体系
容器监控需要关注以下维度:
| 维度 | 指标 |
|---|---|
| 资源使用 | CPU、内存、磁盘 IO、网络 IO |
| 容器状态 | 运行/停止、重启次数、健康状态 |
| 应用指标 | 请求量、响应时间、错误率 |
| 基础设施 | 宿主机 CPU、内存、磁盘 |
cAdvisor + Prometheus + Grafana
这是最流行的容器监控技术栈。
架构
容器 → cAdvisor(采集容器指标)→ Prometheus(存储指标)→ Grafana(可视化)快速部署
yaml
# docker-compose.monitoring.yml
services:
# 容器指标采集
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
restart: unless-stopped
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
privileged: true
# 指标存储
prometheus:
image: prom/prometheus:latest
restart: unless-stopped
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=15d'
# 可视化
grafana:
image: grafana/grafana:latest
restart: unless-stopped
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: admin123
GF_USERS_ALLOW_SIGN_UP: "false"
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards:ro
- ./grafana/datasources:/etc/grafana/provisioning/datasources:ro
# 宿主机指标采集
node-exporter:
image: prom/node-exporter:latest
restart: unless-stopped
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
volumes:
prometheus-data:
grafana-data:prometheus.yml:
yaml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'docker-daemon'
static_configs:
- targets: ['host.docker.internal:9323']Grafana 配置
bash
# 启动监控栈
docker compose -f docker-compose.monitoring.yml up -d
# 访问 Grafana
# http://localhost:3000(admin/admin123)
# 配置 Prometheus 数据源:
# Configuration → Data Sources → Add data source → Prometheus
# URL: http://prometheus:9090
# 导入 Docker 监控 Dashboard:
# Dashboards → Import → 输入 ID: 193(Docker and system monitoring)使用 Docker 内置健康检查
dockerfile
# Dockerfile 中定义健康检查
HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \
CMD curl -f http://localhost:3000/health || exit 1bash
# 查看健康状态
docker ps # STATUS 列显示 (healthy) 或 (unhealthy)
# 查看健康检查历史
docker inspect mycontainer --format \
'{{range .State.Health.Log}}{{.Start}}: {{.ExitCode}} - {{.Output}}{{end}}'告警配置(Alertmanager)
yaml
# alertmanager.yml
global:
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alert@example.com'
route:
group_by: ['alertname']
receiver: 'team-email'
receivers:
- name: 'team-email'
email_configs:
- to: 'ops@example.com'
# 钉钉通知
- name: 'dingtalk'
webhook_configs:
- url: 'http://dingtalk-hook:8060/dingtalk/webhook'Prometheus 告警规则:
yaml
# alert.rules.yml
groups:
- name: container_alerts
rules:
- alert: ContainerDown
expr: absent(container_last_seen{name=~".+"})
for: 1m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} is down"
- alert: HighMemoryUsage
expr: (container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} memory usage > 90%"
- alert: HighCPUUsage
expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} CPU usage > 80%"简单监控方案(Watch 脚本)
对于小型项目,可以用简单脚本监控:
bash
#!/bin/bash
# monitor.sh
while true; do
echo "=== $(date) ==="
# 检查容器状态
docker ps --format "table {{.Names}}\t{{.Status}}"
# 检查资源使用
docker stats --no-stream --format \
"table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}"
# 检查不健康的容器
UNHEALTHY=$(docker ps --filter health=unhealthy -q)
if [ -n "$UNHEALTHY" ]; then
echo "WARNING: Unhealthy containers found!"
docker ps --filter health=unhealthy --format "{{.Names}}"
fi
sleep 60
done总结
监控方案选型建议:
| 规模 | 推荐方案 |
|---|---|
| 单机小项目 | Docker 健康检查 + 脚本告警 |
| 中小团队 | cAdvisor + Prometheus + Grafana |
| 大型生产 | Prometheus Operator + Kubernetes |
| 云环境 | 云厂商托管监控(CloudWatch/阿里云监控) |
核心三步:采集指标(cAdvisor)→ 存储分析(Prometheus)→ 可视化告警(Grafana + Alertmanager)。