docs: Phase O-5 Wave 5.4 告警鏈路 E2E 驗證 Runbook
Some checks failed
E2E Health Check / e2e-health (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled

- 架構圖、手動測試步驟、smoke test 清單
- generate_monitoring.py 用法說明
- 已知問題豁免清單、回滾指令
- 首次驗收記錄 2026-04-02 8/8

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
OG T
2026-04-02 21:34:43 +08:00
parent 234f7febd0
commit 08f73dfce8

View File

@@ -0,0 +1,127 @@
# 告警鏈路 E2E 驗證 Runbook
> Phase O-5 Wave 5.4 (2026-04-02 ogt)
> ADR-025 / ADR-035 / ADR-037
---
## 架構概覽
```
Prometheus → Alertmanager → AWOOOI API → Telegram
SigNoz Trace
Langfuse (AI 分析)
```
**端點**:
- Prometheus: `192.168.0.110:9090`
- Alertmanager: `192.168.0.110:9093`
- AWOOOI API: `https://awoooi.wooo.work` / `192.168.0.120:32334` (K8s)
- Webhook: `POST /api/v1/webhooks/alertmanager`
---
## 快速煙霧測試
```bash
# 執行 Wave A 全量驗收 (8 項)
python3 scripts/alert_chain_smoke_test.py
# 監控覆蓋率驗證
python3 scripts/generate_monitoring.py
# JSON 輸出 (CI 用)
python3 scripts/generate_monitoring.py --json
python3 scripts/generate_monitoring.py --check # exit 1 if < 70%
```
---
## 手動 E2E 測試步驟
### Step 1: 觸發測試告警
```bash
curl -X POST http://192.168.0.110:9093/api/v1/alerts \
-H "Content-Type: application/json" \
-d '[{
"labels": {
"alertname": "TestAlert",
"severity": "warning",
"service": "test"
},
"annotations": {
"summary": "E2E 測試告警",
"description": "手動觸發,驗證鏈路"
}
}]'
```
### Step 2: 驗證 AWOOOI API 收到
```bash
# 查看最近 webhook 日誌
kubectl logs -n awoooi-prod deployment/awoooi-api --since=5m | grep -i "alertmanager\|webhook"
```
### Step 3: 驗證 alert_chain 指標更新
```bash
# 查詢 Prometheus
curl -s "http://192.168.0.110:9090/api/v1/query?query=alert_chain_last_success_timestamp" \
| python3 -c "import json,sys,datetime; d=json.load(sys.stdin); ts=float(d['data']['result'][0]['value'][1]); print(datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S'))"
```
### Step 4: 驗證 Telegram 收到通知
查看 @tsenyangbot 對話,應收到格式化告警訊息。
---
## Smoke Test 項目清單 (Wave A 8/8)
| # | 項目 | 指令 | 預期 |
|---|------|------|------|
| 1 | API Health (6 組件) | `GET /health` | 200 + all healthy |
| 2 | Alert Chain Metric | Prometheus query | timestamp ≤ 60 分鐘前 |
| 3 | Alertmanager Webhook | `GET /api/v1/webhooks/health` | `{"status":"ok"}` |
| 4 | SigNoz Webhook | `GET /api/v1/webhooks/signoz/health` | `{"status":"ok"}` |
| 5 | Sentry Webhook | `GET /api/v1/webhooks/sentry/health` | `{"status":"ok"}` |
| 6 | SigNoz | `GET http://192.168.0.188:3301` | HTTP 200 |
| 7 | OTEL Collector | `kubectl get pods -n otel` | 2 Running |
| 8 | Event Exporter | `kubectl get pods -n monitoring` | 1 Running |
---
## 已知問題與豁免
| Target | 狀態 | 原因 |
|--------|------|------|
| federation-k8s | down | SigNoz 內部 Prometheus非外部暴露 |
| kube-state-metrics | down | 僅 OTEL Collector 內部存取 |
| node-exporter-120/121 | down | K8s 節點防火牆規則 |
---
## 回滾指令
```bash
# Phase 24 AI Router 回滾
kubectl set env deployment/awoooi-api -n awoooi-prod USE_AI_ROUTER=false
# 重啟 API pod
kubectl rollout restart deployment/awoooi-api -n awoooi-prod
# 驗證
kubectl rollout status deployment/awoooi-api -n awoooi-prod
```
---
## 歷史驗收記錄
| 日期 | 結果 | 備註 |
|------|------|------|
| 2026-04-02 21:22 | 8/8 ✅ | Wave A 首次全量通過 |