3.3 KiB
3.3 KiB
告警鏈路 E2E 驗證 Runbook
Phase O-5 Wave 5.4 (2026-04-02 ogt) ADR-025 / ADR-035 / ADR-037
架構概覽
Prometheus → Alertmanager → AWOOOI API → Telegram
↓
SigNoz Trace
↓
Langfuse (AI 分析)
端點:
- Prometheus:
192.168.0.110:9090 - Alertmanager:
192.168.0.110:9093 - AWOOOI API:
https://awoooi.wooo.work/192.168.0.120:32334(K8s) - Webhook:
POST /api/v1/webhooks/alertmanager
快速煙霧測試
# 執行 Wave A 全量驗收 (8 項)
python3 scripts/alert_chain_smoke_test.py
# 監控覆蓋率驗證
python3 scripts/generate_monitoring.py
# JSON 輸出 (CI 用)
python3 scripts/generate_monitoring.py --json
python3 scripts/generate_monitoring.py --check # exit 1 if < 70%
手動 E2E 測試步驟
Step 1: 觸發測試告警
curl -X POST http://192.168.0.110:9093/api/v1/alerts \
-H "Content-Type: application/json" \
-d '[{
"labels": {
"alertname": "TestAlert",
"severity": "warning",
"service": "test"
},
"annotations": {
"summary": "E2E 測試告警",
"description": "手動觸發,驗證鏈路"
}
}]'
Step 2: 驗證 AWOOOI API 收到
# 查看最近 webhook 日誌
kubectl logs -n awoooi-prod deployment/awoooi-api --since=5m | grep -i "alertmanager\|webhook"
Step 3: 驗證 alert_chain 指標更新
# 查詢 Prometheus
curl -s "http://192.168.0.110:9090/api/v1/query?query=alert_chain_last_success_timestamp" \
| python3 -c "import json,sys,datetime; d=json.load(sys.stdin); ts=float(d['data']['result'][0]['value'][1]); print(datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S'))"
Step 4: 驗證 Telegram 收到通知
查看「AwoooI SRE戰情室」Telegram 群組,應收到格式化告警訊息。正式告警不再以 @tsenyangbot 個人 DM 作為收件通道。
Smoke Test 項目清單 (Wave A 8/8)
| # | 項目 | 指令 | 預期 |
|---|---|---|---|
| 1 | API Health (6 組件) | GET /health |
200 + all healthy |
| 2 | Alert Chain Metric | Prometheus query | timestamp ≤ 60 分鐘前 |
| 3 | Alertmanager Webhook | GET /api/v1/webhooks/health |
{"status":"ok"} |
| 4 | SigNoz Webhook | GET /api/v1/webhooks/signoz/health |
{"status":"ok"} |
| 5 | Sentry Webhook | GET /api/v1/webhooks/sentry/health |
{"status":"ok"} |
| 6 | SigNoz | GET http://192.168.0.188:3301 |
HTTP 200 |
| 7 | OTEL Collector | kubectl get pods -n otel |
2 Running |
| 8 | Event Exporter | kubectl get pods -n monitoring |
1 Running |
已知問題與豁免
| Target | 狀態 | 原因 |
|---|---|---|
| federation-k8s | down | SigNoz 內部 Prometheus,非外部暴露 |
| kube-state-metrics | down | 僅 OTEL Collector 內部存取 |
| node-exporter-120/121 | down | K8s 節點防火牆規則 |
回滾指令
# Phase 24 AI Router 回滾
kubectl set env deployment/awoooi-api -n awoooi-prod USE_AI_ROUTER=false
# 重啟 API pod
kubectl rollout restart deployment/awoooi-api -n awoooi-prod
# 驗證
kubectl rollout status deployment/awoooi-api -n awoooi-prod
歷史驗收記錄
| 日期 | 結果 | 備註 |
|---|---|---|
| 2026-04-02 21:22 | 8/8 ✅ | Wave A 首次全量通過 |