From 08f73dfce80f729888b1c5528f6c397ea5f7fae9 Mon Sep 17 00:00:00 2001 From: OG T Date: Thu, 2 Apr 2026 21:34:43 +0800 Subject: [PATCH] =?UTF-8?q?docs:=20Phase=20O-5=20Wave=205.4=20=E5=91=8A?= =?UTF-8?q?=E8=AD=A6=E9=8F=88=E8=B7=AF=20E2E=20=E9=A9=97=E8=AD=89=20Runboo?= =?UTF-8?q?k?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - 架構圖、手動測試步驟、smoke test 清單 - generate_monitoring.py 用法說明 - 已知問題豁免清單、回滾指令 - 首次驗收記錄 2026-04-02 8/8 Co-Authored-By: Claude Sonnet 4.6 --- docs/runbooks/ALERT-CHAIN-E2E-VALIDATION.md | 127 ++++++++++++++++++++ 1 file changed, 127 insertions(+) create mode 100644 docs/runbooks/ALERT-CHAIN-E2E-VALIDATION.md diff --git a/docs/runbooks/ALERT-CHAIN-E2E-VALIDATION.md b/docs/runbooks/ALERT-CHAIN-E2E-VALIDATION.md new file mode 100644 index 00000000..8d71f08a --- /dev/null +++ b/docs/runbooks/ALERT-CHAIN-E2E-VALIDATION.md @@ -0,0 +1,127 @@ +# 告警鏈路 E2E 驗證 Runbook + +> Phase O-5 Wave 5.4 (2026-04-02 ogt) +> ADR-025 / ADR-035 / ADR-037 + +--- + +## 架構概覽 + +``` +Prometheus → Alertmanager → AWOOOI API → Telegram + ↓ + SigNoz Trace + ↓ + Langfuse (AI 分析) +``` + +**端點**: +- Prometheus: `192.168.0.110:9090` +- Alertmanager: `192.168.0.110:9093` +- AWOOOI API: `https://awoooi.wooo.work` / `192.168.0.120:32334` (K8s) +- Webhook: `POST /api/v1/webhooks/alertmanager` + +--- + +## 快速煙霧測試 + +```bash +# 執行 Wave A 全量驗收 (8 項) +python3 scripts/alert_chain_smoke_test.py + +# 監控覆蓋率驗證 +python3 scripts/generate_monitoring.py + +# JSON 輸出 (CI 用) +python3 scripts/generate_monitoring.py --json +python3 scripts/generate_monitoring.py --check # exit 1 if < 70% +``` + +--- + +## 手動 E2E 測試步驟 + +### Step 1: 觸發測試告警 + +```bash +curl -X POST http://192.168.0.110:9093/api/v1/alerts \ + -H "Content-Type: application/json" \ + -d '[{ + "labels": { + "alertname": "TestAlert", + "severity": "warning", + "service": "test" + }, + "annotations": { + "summary": "E2E 測試告警", + "description": "手動觸發,驗證鏈路" + } + }]' +``` + +### Step 2: 驗證 AWOOOI API 收到 + +```bash +# 查看最近 webhook 日誌 +kubectl logs -n awoooi-prod deployment/awoooi-api --since=5m | grep -i "alertmanager\|webhook" +``` + +### Step 3: 驗證 alert_chain 指標更新 + +```bash +# 查詢 Prometheus +curl -s "http://192.168.0.110:9090/api/v1/query?query=alert_chain_last_success_timestamp" \ + | python3 -c "import json,sys,datetime; d=json.load(sys.stdin); ts=float(d['data']['result'][0]['value'][1]); print(datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S'))" +``` + +### Step 4: 驗證 Telegram 收到通知 + +查看 @tsenyangbot 對話,應收到格式化告警訊息。 + +--- + +## Smoke Test 項目清單 (Wave A 8/8) + +| # | 項目 | 指令 | 預期 | +|---|------|------|------| +| 1 | API Health (6 組件) | `GET /health` | 200 + all healthy | +| 2 | Alert Chain Metric | Prometheus query | timestamp ≤ 60 分鐘前 | +| 3 | Alertmanager Webhook | `GET /api/v1/webhooks/health` | `{"status":"ok"}` | +| 4 | SigNoz Webhook | `GET /api/v1/webhooks/signoz/health` | `{"status":"ok"}` | +| 5 | Sentry Webhook | `GET /api/v1/webhooks/sentry/health` | `{"status":"ok"}` | +| 6 | SigNoz | `GET http://192.168.0.188:3301` | HTTP 200 | +| 7 | OTEL Collector | `kubectl get pods -n otel` | 2 Running | +| 8 | Event Exporter | `kubectl get pods -n monitoring` | 1 Running | + +--- + +## 已知問題與豁免 + +| Target | 狀態 | 原因 | +|--------|------|------| +| federation-k8s | down | SigNoz 內部 Prometheus,非外部暴露 | +| kube-state-metrics | down | 僅 OTEL Collector 內部存取 | +| node-exporter-120/121 | down | K8s 節點防火牆規則 | + +--- + +## 回滾指令 + +```bash +# Phase 24 AI Router 回滾 +kubectl set env deployment/awoooi-api -n awoooi-prod USE_AI_ROUTER=false + +# 重啟 API pod +kubectl rollout restart deployment/awoooi-api -n awoooi-prod + +# 驗證 +kubectl rollout status deployment/awoooi-api -n awoooi-prod +``` + +--- + +## 歷史驗收記錄 + +| 日期 | 結果 | 備註 | +|------|------|------| +| 2026-04-02 21:22 | 8/8 ✅ | Wave A 首次全量通過 |