Files
awoooi/docs/runbooks/ALERT-CHAIN-E2E-VALIDATION.md
Your Name 61f5a6a419
Some checks failed
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
fix(telegram): route alerts to SRE war room
2026-04-30 15:01:23 +08:00

128 lines
3.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 告警鏈路 E2E 驗證 Runbook
> Phase O-5 Wave 5.4 (2026-04-02 ogt)
> ADR-025 / ADR-035 / ADR-037
---
## 架構概覽
```
Prometheus → Alertmanager → AWOOOI API → Telegram
SigNoz Trace
Langfuse (AI 分析)
```
**端點**:
- Prometheus: `192.168.0.110:9090`
- Alertmanager: `192.168.0.110:9093`
- AWOOOI API: `https://awoooi.wooo.work` / `192.168.0.120:32334` (K8s)
- Webhook: `POST /api/v1/webhooks/alertmanager`
---
## 快速煙霧測試
```bash
# 執行 Wave A 全量驗收 (8 項)
python3 scripts/alert_chain_smoke_test.py
# 監控覆蓋率驗證
python3 scripts/generate_monitoring.py
# JSON 輸出 (CI 用)
python3 scripts/generate_monitoring.py --json
python3 scripts/generate_monitoring.py --check # exit 1 if < 70%
```
---
## 手動 E2E 測試步驟
### Step 1: 觸發測試告警
```bash
curl -X POST http://192.168.0.110:9093/api/v1/alerts \
-H "Content-Type: application/json" \
-d '[{
"labels": {
"alertname": "TestAlert",
"severity": "warning",
"service": "test"
},
"annotations": {
"summary": "E2E 測試告警",
"description": "手動觸發,驗證鏈路"
}
}]'
```
### Step 2: 驗證 AWOOOI API 收到
```bash
# 查看最近 webhook 日誌
kubectl logs -n awoooi-prod deployment/awoooi-api --since=5m | grep -i "alertmanager\|webhook"
```
### Step 3: 驗證 alert_chain 指標更新
```bash
# 查詢 Prometheus
curl -s "http://192.168.0.110:9090/api/v1/query?query=alert_chain_last_success_timestamp" \
| python3 -c "import json,sys,datetime; d=json.load(sys.stdin); ts=float(d['data']['result'][0]['value'][1]); print(datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S'))"
```
### Step 4: 驗證 Telegram 收到通知
查看「AwoooI SRE戰情室」Telegram 群組,應收到格式化告警訊息。正式告警不再以 @tsenyangbot 個人 DM 作為收件通道。
---
## Smoke Test 項目清單 (Wave A 8/8)
| # | 項目 | 指令 | 預期 |
|---|------|------|------|
| 1 | API Health (6 組件) | `GET /health` | 200 + all healthy |
| 2 | Alert Chain Metric | Prometheus query | timestamp ≤ 60 分鐘前 |
| 3 | Alertmanager Webhook | `GET /api/v1/webhooks/health` | `{"status":"ok"}` |
| 4 | SigNoz Webhook | `GET /api/v1/webhooks/signoz/health` | `{"status":"ok"}` |
| 5 | Sentry Webhook | `GET /api/v1/webhooks/sentry/health` | `{"status":"ok"}` |
| 6 | SigNoz | `GET http://192.168.0.188:3301` | HTTP 200 |
| 7 | OTEL Collector | `kubectl get pods -n otel` | 2 Running |
| 8 | Event Exporter | `kubectl get pods -n monitoring` | 1 Running |
---
## 已知問題與豁免
| Target | 狀態 | 原因 |
|--------|------|------|
| federation-k8s | down | SigNoz 內部 Prometheus非外部暴露 |
| kube-state-metrics | down | 僅 OTEL Collector 內部存取 |
| node-exporter-120/121 | down | K8s 節點防火牆規則 |
---
## 回滾指令
```bash
# Phase 24 AI Router 回滾
kubectl set env deployment/awoooi-api -n awoooi-prod USE_AI_ROUTER=false
# 重啟 API pod
kubectl rollout restart deployment/awoooi-api -n awoooi-prod
# 驗證
kubectl rollout status deployment/awoooi-api -n awoooi-prod
```
---
## 歷史驗收記錄
| 日期 | 結果 | 備註 |
|------|------|------|
| 2026-04-02 21:22 | 8/8 ✅ | Wave A 首次全量通過 |