Appearance
健康检查
健康检查用于监控后端服务器的状态,自动剔除故障服务器,提高系统可用性。
被动健康检查
基本配置
nginx
upstream backend {
server 192.168.1.10:8080 max_fails=3 fail_timeout=30s;
server 192.168.1.11:8080 max_fails=3 fail_timeout=30s;
server 192.168.1.12:8080 max_fails=3 fail_timeout=30s;
}
server {
listen 80;
server_name lb.example.com;
location / {
proxy_pass http://backend;
}
}参数说明
max_fails
- 最大失败次数
- 默认值:1
- 超过该值则标记为不可用
fail_timeout
- 失败超时时间
- 默认值:10s
- 超时后重新尝试连接
工作原理
1. 请求发送到服务器A
2. 服务器A失败(max_fails次)
3. 标记服务器A为不可用
4. 请求发送到其他服务器
5. fail_timeout后重新尝试服务器A主动健康检查
使用第三方模块
需要安装nginx_upstream_check_module模块。
nginx
upstream backend {
server 192.168.1.10:8080;
server 192.168.1.11:8080;
server 192.168.1.12:8080;
check interval=3000 rise=2 fall=3 timeout=1000 type=http;
check_http_send "HEAD /health HTTP/1.0\r\n\r\n";
check_http_expect_alive http_2xx http_3xx;
}
server {
listen 80;
server_name lb.example.com;
location / {
proxy_pass http://backend;
}
location /status {
check_status;
access_log off;
}
}参数说明
interval
- 检查间隔
- 单位:毫秒
rise
- 成功次数
- 连续成功该次数后标记为可用
fall
- 失败次数
- 连续失败该次数后标记为不可用
timeout
- 超时时间
- 单位:毫秒
type
- 检查类型
- http、tcp、ssl_hello、mysql、ajp
完整配置
被动健康检查
nginx
upstream backend {
server 192.168.1.10:8080 max_fails=3 fail_timeout=30s;
server 192.168.1.11:8080 max_fails=3 fail_timeout=30s;
server 192.168.1.12:8080 max_fails=3 fail_timeout=30s;
keepalive 32;
}
server {
listen 80;
server_name lb.example.com;
access_log /var/log/nginx/lb.access.log;
error_log /var/log/nginx/lb.error.log;
location / {
proxy_pass http://backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_connect_timeout 5s;
proxy_send_timeout 5s;
proxy_read_timeout 5s;
proxy_next_upstream error timeout http_502 http_503 http_504;
proxy_next_upstream_tries 2;
}
}主动健康检查
nginx
upstream backend {
server 192.168.1.10:8080;
server 192.168.1.11:8080;
server 192.168.1.12:8080;
check interval=5000 rise=2 fall=3 timeout=2000 type=http;
check_http_send "GET /health HTTP/1.0\r\nHost: backend\r\n\r\n";
check_http_expect_alive http_2xx http_3xx;
keepalive 32;
}
server {
listen 80;
server_name lb.example.com;
location / {
proxy_pass http://backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
location /status {
check_status;
access_log off;
allow 127.0.0.1;
deny all;
}
}健康检查端点
后端健康检查
nginx
server {
listen 8080;
server_name backend.example.com;
location /health {
access_log off;
return 200 "OK\n";
add_header Content-Type text/plain;
}
}数据库检查
nginx
upstream backend {
server 192.168.1.10:8080 max_fails=3 fail_timeout=30s;
server 192.168.1.11:8080 max_fails=3 fail_timeout=30s;
}
server {
listen 80;
server_name lb.example.com;
location /health {
proxy_pass http://backend/health;
proxy_next_upstream error timeout http_502 http_503 http_504;
}
}监控和日志
记录健康检查
nginx
log_format health '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$upstream_addr" '
'"$upstream_status"';
access_log /var/log/nginx/health.log health;状态监控
nginx
server {
listen 80;
server_name status.example.com;
location /nginx_status {
stub_status on;
access_log off;
allow 127.0.0.1;
deny all;
}
}故障转移
自动故障转移
nginx
upstream backend {
server 192.168.1.10:8080 max_fails=3 fail_timeout=30s;
server 192.168.1.11:8080 max_fails=3 fail_timeout=30s;
server 192.168.1.12:8080 max_fails=3 fail_timeout=30s;
}
server {
listen 80;
server_name lb.example.com;
location / {
proxy_pass http://backend;
proxy_next_upstream error timeout http_502 http_503 http_504;
proxy_next_upstream_tries 2;
}
}备用服务器
nginx
upstream backend {
server 192.168.1.10:8080;
server 192.168.1.11:8080;
server 192.168.1.12:8080 backup;
}
server {
listen 80;
server_name lb.example.com;
location / {
proxy_pass http://backend;
}
}常见问题
服务器频繁切换
原因: max_fails和fail_timeout设置不当
解决: 调整参数
nginx
upstream backend {
server 192.168.1.10:8080 max_fails=5 fail_timeout=60s;
server 192.168.1.11:8080 max_fails=5 fail_timeout=60s;
}健康检查失败
原因: 健康检查端点配置错误
解决: 检查健康检查端点
nginx
server {
listen 8080;
location /health {
access_log off;
return 200 "OK\n";
}
}总结
健康检查的关键点:
- 被动检查:max_fails和fail_timeout
- 主动检查:第三方模块支持
- 健康端点:提供/health接口
- 故障转移:自动切换到可用服务器
- 监控日志:记录健康检查状态
合理配置健康检查,提高系统可用性和稳定性。