docs: Phase O-5 Wave 5.4 告警鏈路 E2E 驗證 Runbook
- 架構圖、手動測試步驟、smoke test 清單 - generate_monitoring.py 用法說明 - 已知問題豁免清單、回滾指令 - 首次驗收記錄 2026-04-02 8/8 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
127
docs/runbooks/ALERT-CHAIN-E2E-VALIDATION.md
Normal file
127
docs/runbooks/ALERT-CHAIN-E2E-VALIDATION.md
Normal file
@@ -0,0 +1,127 @@
|
||||
# 告警鏈路 E2E 驗證 Runbook
|
||||
|
||||
> Phase O-5 Wave 5.4 (2026-04-02 ogt)
|
||||
> ADR-025 / ADR-035 / ADR-037
|
||||
|
||||
---
|
||||
|
||||
## 架構概覽
|
||||
|
||||
```
|
||||
Prometheus → Alertmanager → AWOOOI API → Telegram
|
||||
↓
|
||||
SigNoz Trace
|
||||
↓
|
||||
Langfuse (AI 分析)
|
||||
```
|
||||
|
||||
**端點**:
|
||||
- Prometheus: `192.168.0.110:9090`
|
||||
- Alertmanager: `192.168.0.110:9093`
|
||||
- AWOOOI API: `https://awoooi.wooo.work` / `192.168.0.120:32334` (K8s)
|
||||
- Webhook: `POST /api/v1/webhooks/alertmanager`
|
||||
|
||||
---
|
||||
|
||||
## 快速煙霧測試
|
||||
|
||||
```bash
|
||||
# 執行 Wave A 全量驗收 (8 項)
|
||||
python3 scripts/alert_chain_smoke_test.py
|
||||
|
||||
# 監控覆蓋率驗證
|
||||
python3 scripts/generate_monitoring.py
|
||||
|
||||
# JSON 輸出 (CI 用)
|
||||
python3 scripts/generate_monitoring.py --json
|
||||
python3 scripts/generate_monitoring.py --check # exit 1 if < 70%
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 手動 E2E 測試步驟
|
||||
|
||||
### Step 1: 觸發測試告警
|
||||
|
||||
```bash
|
||||
curl -X POST http://192.168.0.110:9093/api/v1/alerts \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '[{
|
||||
"labels": {
|
||||
"alertname": "TestAlert",
|
||||
"severity": "warning",
|
||||
"service": "test"
|
||||
},
|
||||
"annotations": {
|
||||
"summary": "E2E 測試告警",
|
||||
"description": "手動觸發,驗證鏈路"
|
||||
}
|
||||
}]'
|
||||
```
|
||||
|
||||
### Step 2: 驗證 AWOOOI API 收到
|
||||
|
||||
```bash
|
||||
# 查看最近 webhook 日誌
|
||||
kubectl logs -n awoooi-prod deployment/awoooi-api --since=5m | grep -i "alertmanager\|webhook"
|
||||
```
|
||||
|
||||
### Step 3: 驗證 alert_chain 指標更新
|
||||
|
||||
```bash
|
||||
# 查詢 Prometheus
|
||||
curl -s "http://192.168.0.110:9090/api/v1/query?query=alert_chain_last_success_timestamp" \
|
||||
| python3 -c "import json,sys,datetime; d=json.load(sys.stdin); ts=float(d['data']['result'][0]['value'][1]); print(datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S'))"
|
||||
```
|
||||
|
||||
### Step 4: 驗證 Telegram 收到通知
|
||||
|
||||
查看 @tsenyangbot 對話,應收到格式化告警訊息。
|
||||
|
||||
---
|
||||
|
||||
## Smoke Test 項目清單 (Wave A 8/8)
|
||||
|
||||
| # | 項目 | 指令 | 預期 |
|
||||
|---|------|------|------|
|
||||
| 1 | API Health (6 組件) | `GET /health` | 200 + all healthy |
|
||||
| 2 | Alert Chain Metric | Prometheus query | timestamp ≤ 60 分鐘前 |
|
||||
| 3 | Alertmanager Webhook | `GET /api/v1/webhooks/health` | `{"status":"ok"}` |
|
||||
| 4 | SigNoz Webhook | `GET /api/v1/webhooks/signoz/health` | `{"status":"ok"}` |
|
||||
| 5 | Sentry Webhook | `GET /api/v1/webhooks/sentry/health` | `{"status":"ok"}` |
|
||||
| 6 | SigNoz | `GET http://192.168.0.188:3301` | HTTP 200 |
|
||||
| 7 | OTEL Collector | `kubectl get pods -n otel` | 2 Running |
|
||||
| 8 | Event Exporter | `kubectl get pods -n monitoring` | 1 Running |
|
||||
|
||||
---
|
||||
|
||||
## 已知問題與豁免
|
||||
|
||||
| Target | 狀態 | 原因 |
|
||||
|--------|------|------|
|
||||
| federation-k8s | down | SigNoz 內部 Prometheus,非外部暴露 |
|
||||
| kube-state-metrics | down | 僅 OTEL Collector 內部存取 |
|
||||
| node-exporter-120/121 | down | K8s 節點防火牆規則 |
|
||||
|
||||
---
|
||||
|
||||
## 回滾指令
|
||||
|
||||
```bash
|
||||
# Phase 24 AI Router 回滾
|
||||
kubectl set env deployment/awoooi-api -n awoooi-prod USE_AI_ROUTER=false
|
||||
|
||||
# 重啟 API pod
|
||||
kubectl rollout restart deployment/awoooi-api -n awoooi-prod
|
||||
|
||||
# 驗證
|
||||
kubectl rollout status deployment/awoooi-api -n awoooi-prod
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 歷史驗收記錄
|
||||
|
||||
| 日期 | 結果 | 備註 |
|
||||
|------|------|------|
|
||||
| 2026-04-02 21:22 | 8/8 ✅ | Wave A 首次全量通過 |
|
||||
Reference in New Issue
Block a user