Skip to content

健康检查

健康检查用于监控后端服务器的状态,自动剔除故障服务器,提高系统可用性。

被动健康检查

基本配置

nginx
upstream backend {
    server 192.168.1.10:8080 max_fails=3 fail_timeout=30s;
    server 192.168.1.11:8080 max_fails=3 fail_timeout=30s;
    server 192.168.1.12:8080 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    server_name lb.example.com;

    location / {
        proxy_pass http://backend;
    }
}

参数说明

max_fails

  • 最大失败次数
  • 默认值:1
  • 超过该值则标记为不可用

fail_timeout

  • 失败超时时间
  • 默认值:10s
  • 超时后重新尝试连接

工作原理

1. 请求发送到服务器A
2. 服务器A失败(max_fails次)
3. 标记服务器A为不可用
4. 请求发送到其他服务器
5. fail_timeout后重新尝试服务器A

主动健康检查

使用第三方模块

需要安装nginx_upstream_check_module模块。

nginx
upstream backend {
    server 192.168.1.10:8080;
    server 192.168.1.11:8080;
    server 192.168.1.12:8080;

    check interval=3000 rise=2 fall=3 timeout=1000 type=http;
    check_http_send "HEAD /health HTTP/1.0\r\n\r\n";
    check_http_expect_alive http_2xx http_3xx;
}

server {
    listen 80;
    server_name lb.example.com;

    location / {
        proxy_pass http://backend;
    }

    location /status {
        check_status;
        access_log off;
    }
}

参数说明

interval

  • 检查间隔
  • 单位:毫秒

rise

  • 成功次数
  • 连续成功该次数后标记为可用

fall

  • 失败次数
  • 连续失败该次数后标记为不可用

timeout

  • 超时时间
  • 单位:毫秒

type

  • 检查类型
  • http、tcp、ssl_hello、mysql、ajp

完整配置

被动健康检查

nginx
upstream backend {
    server 192.168.1.10:8080 max_fails=3 fail_timeout=30s;
    server 192.168.1.11:8080 max_fails=3 fail_timeout=30s;
    server 192.168.1.12:8080 max_fails=3 fail_timeout=30s;

    keepalive 32;
}

server {
    listen 80;
    server_name lb.example.com;

    access_log /var/log/nginx/lb.access.log;
    error_log /var/log/nginx/lb.error.log;

    location / {
        proxy_pass http://backend;

        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        proxy_connect_timeout 5s;
        proxy_send_timeout 5s;
        proxy_read_timeout 5s;

        proxy_next_upstream error timeout http_502 http_503 http_504;
        proxy_next_upstream_tries 2;
    }
}

主动健康检查

nginx
upstream backend {
    server 192.168.1.10:8080;
    server 192.168.1.11:8080;
    server 192.168.1.12:8080;

    check interval=5000 rise=2 fall=3 timeout=2000 type=http;
    check_http_send "GET /health HTTP/1.0\r\nHost: backend\r\n\r\n";
    check_http_expect_alive http_2xx http_3xx;

    keepalive 32;
}

server {
    listen 80;
    server_name lb.example.com;

    location / {
        proxy_pass http://backend;

        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

    location /status {
        check_status;
        access_log off;
        allow 127.0.0.1;
        deny all;
    }
}

健康检查端点

后端健康检查

nginx
server {
    listen 8080;
    server_name backend.example.com;

    location /health {
        access_log off;
        return 200 "OK\n";
        add_header Content-Type text/plain;
    }
}

数据库检查

nginx
upstream backend {
    server 192.168.1.10:8080 max_fails=3 fail_timeout=30s;
    server 192.168.1.11:8080 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    server_name lb.example.com;

    location /health {
        proxy_pass http://backend/health;
        proxy_next_upstream error timeout http_502 http_503 http_504;
    }
}

监控和日志

记录健康检查

nginx
log_format health '$remote_addr - $remote_user [$time_local] "$request" '
                  '$status $body_bytes_sent "$http_referer" '
                  '"$http_user_agent" "$upstream_addr" '
                  '"$upstream_status"';

access_log /var/log/nginx/health.log health;

状态监控

nginx
server {
    listen 80;
    server_name status.example.com;

    location /nginx_status {
        stub_status on;
        access_log off;
        allow 127.0.0.1;
        deny all;
    }
}

故障转移

自动故障转移

nginx
upstream backend {
    server 192.168.1.10:8080 max_fails=3 fail_timeout=30s;
    server 192.168.1.11:8080 max_fails=3 fail_timeout=30s;
    server 192.168.1.12:8080 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    server_name lb.example.com;

    location / {
        proxy_pass http://backend;

        proxy_next_upstream error timeout http_502 http_503 http_504;
        proxy_next_upstream_tries 2;
    }
}

备用服务器

nginx
upstream backend {
    server 192.168.1.10:8080;
    server 192.168.1.11:8080;
    server 192.168.1.12:8080 backup;
}

server {
    listen 80;
    server_name lb.example.com;

    location / {
        proxy_pass http://backend;
    }
}

常见问题

服务器频繁切换

原因: max_fails和fail_timeout设置不当

解决: 调整参数

nginx
upstream backend {
    server 192.168.1.10:8080 max_fails=5 fail_timeout=60s;
    server 192.168.1.11:8080 max_fails=5 fail_timeout=60s;
}

健康检查失败

原因: 健康检查端点配置错误

解决: 检查健康检查端点

nginx
server {
    listen 8080;

    location /health {
        access_log off;
        return 200 "OK\n";
    }
}

总结

健康检查的关键点:

  • 被动检查:max_fails和fail_timeout
  • 主动检查:第三方模块支持
  • 健康端点:提供/health接口
  • 故障转移:自动切换到可用服务器
  • 监控日志:记录健康检查状态

合理配置健康检查,提高系统可用性和稳定性。