128 lines
3.3 KiB
Markdown
128 lines
3.3 KiB
Markdown
# 告警鏈路 E2E 驗證 Runbook
|
||
|
||
> Phase O-5 Wave 5.4 (2026-04-02 ogt)
|
||
> ADR-025 / ADR-035 / ADR-037
|
||
|
||
---
|
||
|
||
## 架構概覽
|
||
|
||
```
|
||
Prometheus → Alertmanager → AWOOOI API → Telegram
|
||
↓
|
||
SigNoz Trace
|
||
↓
|
||
Langfuse (AI 分析)
|
||
```
|
||
|
||
**端點**:
|
||
- Prometheus: `192.168.0.110:9090`
|
||
- Alertmanager: `192.168.0.110:9093`
|
||
- AWOOOI API: `https://awoooi.wooo.work` / `192.168.0.120:32334` (K8s)
|
||
- Webhook: `POST /api/v1/webhooks/alertmanager`
|
||
|
||
---
|
||
|
||
## 快速煙霧測試
|
||
|
||
```bash
|
||
# 執行 Wave A 全量驗收 (8 項)
|
||
python3 scripts/alert_chain_smoke_test.py
|
||
|
||
# 監控覆蓋率驗證
|
||
python3 scripts/generate_monitoring.py
|
||
|
||
# JSON 輸出 (CI 用)
|
||
python3 scripts/generate_monitoring.py --json
|
||
python3 scripts/generate_monitoring.py --check # exit 1 if < 70%
|
||
```
|
||
|
||
---
|
||
|
||
## 手動 E2E 測試步驟
|
||
|
||
### Step 1: 觸發測試告警
|
||
|
||
```bash
|
||
curl -X POST http://192.168.0.110:9093/api/v1/alerts \
|
||
-H "Content-Type: application/json" \
|
||
-d '[{
|
||
"labels": {
|
||
"alertname": "TestAlert",
|
||
"severity": "warning",
|
||
"service": "test"
|
||
},
|
||
"annotations": {
|
||
"summary": "E2E 測試告警",
|
||
"description": "手動觸發,驗證鏈路"
|
||
}
|
||
}]'
|
||
```
|
||
|
||
### Step 2: 驗證 AWOOOI API 收到
|
||
|
||
```bash
|
||
# 查看最近 webhook 日誌
|
||
kubectl logs -n awoooi-prod deployment/awoooi-api --since=5m | grep -i "alertmanager\|webhook"
|
||
```
|
||
|
||
### Step 3: 驗證 alert_chain 指標更新
|
||
|
||
```bash
|
||
# 查詢 Prometheus
|
||
curl -s "http://192.168.0.110:9090/api/v1/query?query=alert_chain_last_success_timestamp" \
|
||
| python3 -c "import json,sys,datetime; d=json.load(sys.stdin); ts=float(d['data']['result'][0]['value'][1]); print(datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S'))"
|
||
```
|
||
|
||
### Step 4: 驗證 Telegram 收到通知
|
||
|
||
查看「AwoooI SRE戰情室」Telegram 群組,應收到格式化告警訊息。正式告警不再以 @tsenyangbot 個人 DM 作為收件通道。
|
||
|
||
---
|
||
|
||
## Smoke Test 項目清單 (Wave A 8/8)
|
||
|
||
| # | 項目 | 指令 | 預期 |
|
||
|---|------|------|------|
|
||
| 1 | API Health (6 組件) | `GET /health` | 200 + all healthy |
|
||
| 2 | Alert Chain Metric | Prometheus query | timestamp ≤ 60 分鐘前 |
|
||
| 3 | Alertmanager Webhook | `GET /api/v1/webhooks/health` | `{"status":"ok"}` |
|
||
| 4 | SigNoz Webhook | `GET /api/v1/webhooks/signoz/health` | `{"status":"ok"}` |
|
||
| 5 | Sentry Webhook | `GET /api/v1/webhooks/sentry/health` | `{"status":"ok"}` |
|
||
| 6 | SigNoz | `GET http://192.168.0.188:3301` | HTTP 200 |
|
||
| 7 | OTEL Collector | `kubectl get pods -n otel` | 2 Running |
|
||
| 8 | Event Exporter | `kubectl get pods -n monitoring` | 1 Running |
|
||
|
||
---
|
||
|
||
## 已知問題與豁免
|
||
|
||
| Target | 狀態 | 原因 |
|
||
|--------|------|------|
|
||
| federation-k8s | down | SigNoz 內部 Prometheus,非外部暴露 |
|
||
| kube-state-metrics | down | 僅 OTEL Collector 內部存取 |
|
||
| node-exporter-120/121 | down | K8s 節點防火牆規則 |
|
||
|
||
---
|
||
|
||
## 回滾指令
|
||
|
||
```bash
|
||
# Phase 24 AI Router 回滾
|
||
kubectl set env deployment/awoooi-api -n awoooi-prod USE_AI_ROUTER=false
|
||
|
||
# 重啟 API pod
|
||
kubectl rollout restart deployment/awoooi-api -n awoooi-prod
|
||
|
||
# 驗證
|
||
kubectl rollout status deployment/awoooi-api -n awoooi-prod
|
||
```
|
||
|
||
---
|
||
|
||
## 歷史驗收記錄
|
||
|
||
| 日期 | 結果 | 備註 |
|
||
|------|------|------|
|
||
| 2026-04-02 21:22 | 8/8 ✅ | Wave A 首次全量通過 |
|