API 网关的日志与监控功能:构建可观测的微服务系统
2025/8/30大约 8 分钟
在现代分布式系统中,可观测性是确保系统稳定性和可维护性的关键因素。API 网关作为系统的统一入口,承担着收集和分析系统运行数据的重要职责。通过完善的日志记录和监控机制,运维团队能够实时了解系统状态,快速定位和解决问题。本文将深入探讨 API 网关的日志与监控功能及其最佳实践。
日志记录机制
日志的重要性
日志是系统运行状态的重要反映,对于 API 网关而言,日志记录具有以下重要意义:
- 问题诊断:帮助快速定位和解决系统问题
- 安全审计:记录安全相关事件,支持安全审计
- 性能分析:分析系统性能瓶颈和优化点
- 业务分析:了解用户行为和业务趋势
- 合规要求:满足行业和法规的合规要求
日志分类
访问日志(Access Log)
访问日志记录所有通过 API 网关的请求信息:
// 示例:访问日志记录实现
@Component
public class AccessLogFilter implements GlobalFilter {
private static final Logger accessLogger = LoggerFactory.getLogger("access-log");
@Override
public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) {
long startTime = System.currentTimeMillis();
ServerHttpRequest request = exchange.getRequest();
ServerHttpResponse response = exchange.getResponse();
return chain.filter(exchange).then(
Mono.fromRunnable(() -> {
long duration = System.currentTimeMillis() - startTime;
String logMessage = String.format(
"%s %s %s %d %dms %s %s",
getClientIp(request),
request.getMethod(),
request.getPath(),
response.getStatusCode().value(),
duration,
request.getHeaders().getFirst("User-Agent"),
getUserId(request)
);
accessLogger.info(logMessage);
})
);
}
private String getClientIp(ServerHttpRequest request) {
String xForwardedFor = request.getHeaders().getFirst("X-Forwarded-For");
if (xForwardedFor != null && !xForwardedFor.isEmpty()) {
return xForwardedFor.split(",")[0].trim();
}
return request.getRemoteAddress().getAddress().getHostAddress();
}
private String getUserId(ServerHttpRequest request) {
// 从请求中提取用户ID
return request.getHeaders().getFirst("X-User-ID");
}
}
错误日志(Error Log)
错误日志记录系统运行过程中的异常和错误信息:
// 示例:错误日志记录实现
@Component
public class ErrorLogFilter implements GlobalFilter {
private static final Logger errorLogger = LoggerFactory.getLogger("error-log");
@Override
public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) {
return chain.filter(exchange).onErrorResume(throwable -> {
ServerHttpRequest request = exchange.getRequest();
errorLogger.error(
"Request processing failed: {} {} from {} - Error: {}",
request.getMethod(),
request.getPath(),
getClientIp(request),
throwable.getMessage(),
throwable
);
// 返回错误响应
ServerHttpResponse response = exchange.getResponse();
response.setStatusCode(HttpStatus.INTERNAL_SERVER_ERROR);
return response.writeWith(Mono.just(response.bufferFactory()
.wrap("Internal Server Error".getBytes())));
});
}
}
审计日志(Audit Log)
审计日志记录敏感操作和安全相关事件:
// 示例:审计日志记录实现
@Component
public class AuditLogFilter implements GlobalFilter {
private static final Logger auditLogger = LoggerFactory.getLogger("audit-log");
@Override
public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) {
ServerHttpRequest request = exchange.getRequest();
// 检查是否需要审计
if (requiresAudit(request)) {
String auditMessage = String.format(
"AUDIT: User %s performed %s on %s at %s",
getUserId(request),
request.getMethod(),
request.getPath(),
Instant.now()
);
auditLogger.info(auditMessage);
}
return chain.filter(exchange);
}
private boolean requiresAudit(ServerHttpRequest request) {
// 定义需要审计的操作
String path = request.getPath().value();
return path.startsWith("/api/admin/") ||
path.startsWith("/api/users/") ||
request.getMethod() == HttpMethod.POST ||
request.getMethod() == HttpMethod.PUT ||
request.getMethod() == HttpMethod.DELETE;
}
}
日志格式标准化
采用标准化的日志格式便于日志分析和处理:
// 示例:标准化日志格式
public class LogFormatter {
public static String formatAccessLog(ServerHttpRequest request,
ServerHttpResponse response,
long duration) {
JsonObject logObject = new JsonObject();
logObject.addProperty("timestamp", Instant.now().toString());
logObject.addProperty("client_ip", getClientIp(request));
logObject.addProperty("method", request.getMethod().name());
logObject.addProperty("path", request.getPath().value());
logObject.addProperty("status_code", response.getStatusCode().value());
logObject.addProperty("duration_ms", duration);
logObject.addProperty("user_agent", request.getHeaders().getFirst("User-Agent"));
logObject.addProperty("user_id", getUserId(request));
logObject.addProperty("request_id", getRequestID(request));
return logObject.toString();
}
}
日志配置管理
# 示例:日志配置
logging:
level:
root: INFO
access-log: INFO
error-log: ERROR
audit-log: INFO
pattern:
console: "%d{yyyy-MM-dd HH:mm:ss} [%thread] %-5level %logger{36} - %msg%n"
file:
name: logs/api-gateway.log
logback:
rollingpolicy:
max-file-size: 100MB
total-size-cap: 10GB
max-history: 30
监控指标收集
核心监控指标
性能指标
// 示例:性能指标收集
@Component
public class PerformanceMetrics {
private final MeterRegistry meterRegistry;
private final Timer requestTimer;
private final DistributionSummary responseSizeSummary;
public PerformanceMetrics(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
this.requestTimer = Timer.builder("api.gateway.requests")
.description("Request processing time")
.register(meterRegistry);
this.responseSizeSummary = DistributionSummary.builder("api.gateway.response.size")
.description("Response size distribution")
.register(meterRegistry);
}
public <T> T recordRequestTime(Supplier<T> operation) {
return requestTimer.record(operation);
}
public void recordResponseSize(int size) {
responseSizeSummary.record(size);
}
}
可用性指标
// 示例:可用性指标收集
@Component
public class AvailabilityMetrics {
private final MeterRegistry meterRegistry;
private final Counter successCounter;
private final Counter errorCounter;
private final Gauge uptimeGauge;
private final AtomicLong uptime = new AtomicLong(System.currentTimeMillis());
public AvailabilityMetrics(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
this.successCounter = Counter.builder("api.gateway.requests.success")
.description("Successful requests count")
.register(meterRegistry);
this.errorCounter = Counter.builder("api.gateway.requests.error")
.description("Failed requests count")
.register(meterRegistry);
this.uptimeGauge = Gauge.builder("api.gateway.uptime")
.description("Gateway uptime in milliseconds")
.register(meterRegistry, uptime, value ->
System.currentTimeMillis() - value.get());
}
public void recordSuccess() {
successCounter.increment();
}
public void recordError() {
errorCounter.increment();
}
}
流量指标
// 示例:流量指标收集
@Component
public class TrafficMetrics {
private final MeterRegistry meterRegistry;
private final Counter requestCounter;
private final Counter responseCounter;
private final FunctionCounter activeConnectionsCounter;
private final AtomicLong activeConnections = new AtomicLong(0);
public TrafficMetrics(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
this.requestCounter = Counter.builder("api.gateway.traffic.requests")
.description("Total requests count")
.register(meterRegistry);
this.responseCounter = Counter.builder("api.gateway.traffic.responses")
.description("Total responses count")
.register(meterRegistry);
this.activeConnectionsCounter = FunctionCounter.builder("api.gateway.connections.active",
this, t -> t.activeConnections.get())
.description("Active connections count")
.register(meterRegistry);
}
public void recordRequest() {
requestCounter.increment();
activeConnections.incrementAndGet();
}
public void recordResponse() {
responseCounter.increment();
activeConnections.decrementAndGet();
}
}
自定义监控指标
// 示例:自定义业务指标
@Component
public class BusinessMetrics {
private final MeterRegistry meterRegistry;
private final Counter userLoginCounter;
private final Counter apiCallCounter;
private final Timer paymentProcessingTimer;
public BusinessMetrics(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
this.userLoginCounter = Counter.builder("business.user.logins")
.description("User login count")
.register(meterRegistry);
this.apiCallCounter = Counter.builder("business.api.calls")
.description("API call count by service")
.tag("service", "unknown")
.register(meterRegistry);
this.paymentProcessingTimer = Timer.builder("business.payment.processing")
.description("Payment processing time")
.register(meterRegistry);
}
public void recordUserLogin() {
userLoginCounter.increment();
}
public void recordApiCall(String service) {
apiCallCounter.tag("service", service).increment();
}
public <T> T recordPaymentProcessing(Supplier<T> operation) {
return paymentProcessingTimer.record(operation);
}
}
分布式追踪
追踪机制实现
// 示例:分布式追踪实现
@Component
public class TracingFilter implements GlobalFilter {
private final Tracer tracer;
private final SpanReporter spanReporter;
@Override
public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) {
ServerHttpRequest request = exchange.getRequest();
// 创建根 Span
Span rootSpan = tracer.nextSpan().name("api-gateway-request")
.tag("http.method", request.getMethod().name())
.tag("http.path", request.getPath().value())
.tag("client.ip", getClientIp(request))
.start();
try (Tracer.SpanInScope ws = tracer.withSpanInScope(rootSpan)) {
// 将追踪信息传递给下游服务
ServerHttpRequest modifiedRequest = request.mutate()
.header("X-B3-TraceId", rootSpan.context().traceId())
.header("X-B3-SpanId", rootSpan.context().spanId())
.header("X-B3-Sampled", "1")
.build();
exchange.mutate().request(modifiedRequest).build();
return chain.filter(exchange).then(
Mono.fromRunnable(() -> {
// 记录响应信息
rootSpan.tag("http.status_code",
String.valueOf(exchange.getResponse().getStatusCode().value()));
rootSpan.finish();
})
).doOnError(throwable -> {
// 记录错误信息
rootSpan.tag("error", throwable.getMessage());
rootSpan.finish();
});
}
}
}
追踪数据收集
# 示例:追踪配置
spring:
sleuth:
enabled: true
sampler:
probability: 1.0
zipkin:
base-url: http://zipkin-server:9411
sender:
type: web
告警机制
告警规则配置
# 示例:告警规则配置
management:
endpoint:
metrics:
enabled: true
endpoints:
web:
exposure:
include: metrics,health
metrics:
export:
prometheus:
enabled: true
distribution:
percentiles-histogram:
api.gateway.requests: true
slo:
api.gateway.requests: 100ms, 500ms, 1000ms
# Prometheus 告警规则示例
groups:
- name: api-gateway-alerts
rules:
- alert: HighErrorRate
expr: rate(api_gateway_requests_error_total[5m]) / rate(api_gateway_requests_total[5m]) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "High error rate on API Gateway"
description: "Error rate is above 5% for more than 2 minutes"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(api_gateway_requests_duration_seconds_bucket[5m])) > 1
for: 2m
labels:
severity: warning
annotations:
summary: "High latency on API Gateway"
description: "95th percentile latency is above 1 second for more than 2 minutes"
告警通知
// 示例:告警通知实现
@Component
public class AlertNotifier {
private final WebClient webClient;
private final List<String> alertEndpoints;
public void sendAlert(String alertName, String message, String severity) {
AlertPayload payload = new AlertPayload(alertName, message, severity, Instant.now());
alertEndpoints.forEach(endpoint -> {
webClient.post()
.uri(endpoint)
.bodyValue(payload)
.retrieve()
.bodyToMono(String.class)
.subscribe(
response -> log.info("Alert sent successfully to {}", endpoint),
error -> log.error("Failed to send alert to {}: {}", endpoint, error.getMessage())
);
});
}
private static class AlertPayload {
private final String alertName;
private final String message;
private final String severity;
private final Instant timestamp;
// 构造函数和 getter 方法
}
}
日志与监控集成
与 ELK 栈集成
# 示例:Logback 配置与 Logstash 集成
<configuration>
<appender name="LOGSTASH" class="net.logstash.logback.appender.LogstashTcpSocketAppender">
<destination>logstash-server:5000</destination>
<encoder class="net.logstash.logback.encoder.LoggingEventCompositeJsonEncoder">
<providers>
<timestamp/>
<logLevel/>
<loggerName/>
<message/>
<mdc/>
<stackTrace/>
</providers>
</encoder>
</appender>
<root level="INFO">
<appender-ref ref="LOGSTASH"/>
</root>
</configuration>
与 Prometheus 集成
// 示例:自定义指标端点
@RestController
public class MetricsController {
private final MeterRegistry meterRegistry;
@GetMapping("/actuator/prometheus")
public String prometheusMetrics() {
return meterRegistry.scrape();
}
}
性能优化
异步日志处理
// 示例:异步日志处理
@Component
public class AsyncLogProcessor {
private final ExecutorService executorService = Executors.newFixedThreadPool(10);
private final BlockingQueue<LogEvent> logQueue = new LinkedBlockingQueue<>(10000);
@PostConstruct
public void startProcessing() {
for (int i = 0; i < 5; i++) {
executorService.submit(this::processLogs);
}
}
public void enqueueLog(LogEvent logEvent) {
try {
logQueue.offer(logEvent, 1, TimeUnit.SECONDS);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
private void processLogs() {
while (!Thread.currentThread().isInterrupted()) {
try {
LogEvent logEvent = logQueue.take();
// 异步处理日志事件
processLogEvent(logEvent);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
break;
}
}
}
}
批量指标上报
// 示例:批量指标上报
@Component
public class BatchMetricsReporter {
private final MeterRegistry meterRegistry;
private final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);
private final List<MetricEvent> metricBuffer = new CopyOnWriteArrayList<>();
private final int batchSize = 100;
@PostConstruct
public void startBatchReporting() {
scheduler.scheduleAtFixedRate(this::reportBatch, 0, 10, TimeUnit.SECONDS);
}
public void addMetric(MetricEvent metricEvent) {
metricBuffer.add(metricEvent);
if (metricBuffer.size() >= batchSize) {
reportBatch();
}
}
private void reportBatch() {
if (metricBuffer.isEmpty()) {
return;
}
List<MetricEvent> batch = new ArrayList<>(metricBuffer);
metricBuffer.clear();
// 批量上报指标
batch.forEach(this::reportMetric);
}
}
最佳实践
日志管理最佳实践
- 结构化日志:使用 JSON 格式记录结构化日志,便于分析和查询
- 日志级别控制:合理设置日志级别,避免生产环境产生过多日志
- 日志轮转:配置日志轮转策略,防止日志文件过大
- 敏感信息过滤:过滤日志中的敏感信息,如密码、密钥等
- 日志保留策略:制定合理的日志保留策略,平衡存储成本和审计需求
监控最佳实践
- 关键指标监控:重点关注响应时间、错误率、吞吐量等关键指标
- 多维度监控:从不同维度(服务、API、用户等)进行监控
- 实时告警:设置实时告警机制,及时发现和处理问题
- 历史数据分析:定期分析历史数据,发现潜在问题和优化点
- 可视化展示:使用仪表板展示监控数据,便于直观了解系统状态
故障排查最佳实践
- 追踪链路:使用分布式追踪定位问题根源
- 日志关联:通过请求ID关联不同服务的日志
- 指标分析:通过指标变化趋势分析问题原因
- 快速响应:建立快速响应机制,缩短故障处理时间
- 经验总结:总结故障处理经验,完善防护机制
总结
日志与监控是构建可观测微服务系统的重要组成部分。通过完善的日志记录、指标收集、分布式追踪和告警机制,API 网关能够为运维团队提供全面的系统运行信息,帮助快速发现和解决问题,确保系统的稳定性和可维护性。
在实际应用中,需要根据具体的业务需求和技术架构选择合适的日志和监控策略,并持续优化配置以达到最佳的可观测性效果。同时,建立完善的故障排查和应急响应机制也是确保系统高可用的重要保障。
在后续章节中,我们将继续探讨 API 网关的其他核心功能,帮助读者全面掌握这一关键技术组件。