Files
awoooi/docs/runbooks/ALERT-CHAIN-E2E-VALIDATION.md
Your Name 61f5a6a419
Some checks failed
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
fix(telegram): route alerts to SRE war room
2026-04-30 15:01:23 +08:00

3.3 KiB
Raw Permalink Blame History

告警鏈路 E2E 驗證 Runbook

Phase O-5 Wave 5.4 (2026-04-02 ogt) ADR-025 / ADR-035 / ADR-037


架構概覽

Prometheus → Alertmanager → AWOOOI API → Telegram
                              ↓
                          SigNoz Trace
                              ↓
                          Langfuse (AI 分析)

端點:

  • Prometheus: 192.168.0.110:9090
  • Alertmanager: 192.168.0.110:9093
  • AWOOOI API: https://awoooi.wooo.work / 192.168.0.120:32334 (K8s)
  • Webhook: POST /api/v1/webhooks/alertmanager

快速煙霧測試

# 執行 Wave A 全量驗收 (8 項)
python3 scripts/alert_chain_smoke_test.py

# 監控覆蓋率驗證
python3 scripts/generate_monitoring.py

# JSON 輸出 (CI 用)
python3 scripts/generate_monitoring.py --json
python3 scripts/generate_monitoring.py --check  # exit 1 if < 70%

手動 E2E 測試步驟

Step 1: 觸發測試告警

curl -X POST http://192.168.0.110:9093/api/v1/alerts \
  -H "Content-Type: application/json" \
  -d '[{
    "labels": {
      "alertname": "TestAlert",
      "severity": "warning",
      "service": "test"
    },
    "annotations": {
      "summary": "E2E 測試告警",
      "description": "手動觸發,驗證鏈路"
    }
  }]'

Step 2: 驗證 AWOOOI API 收到

# 查看最近 webhook 日誌
kubectl logs -n awoooi-prod deployment/awoooi-api --since=5m | grep -i "alertmanager\|webhook"

Step 3: 驗證 alert_chain 指標更新

# 查詢 Prometheus
curl -s "http://192.168.0.110:9090/api/v1/query?query=alert_chain_last_success_timestamp" \
  | python3 -c "import json,sys,datetime; d=json.load(sys.stdin); ts=float(d['data']['result'][0]['value'][1]); print(datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S'))"

Step 4: 驗證 Telegram 收到通知

查看「AwoooI SRE戰情室」Telegram 群組,應收到格式化告警訊息。正式告警不再以 @tsenyangbot 個人 DM 作為收件通道。


Smoke Test 項目清單 (Wave A 8/8)

# 項目 指令 預期
1 API Health (6 組件) GET /health 200 + all healthy
2 Alert Chain Metric Prometheus query timestamp ≤ 60 分鐘前
3 Alertmanager Webhook GET /api/v1/webhooks/health {"status":"ok"}
4 SigNoz Webhook GET /api/v1/webhooks/signoz/health {"status":"ok"}
5 Sentry Webhook GET /api/v1/webhooks/sentry/health {"status":"ok"}
6 SigNoz GET http://192.168.0.188:3301 HTTP 200
7 OTEL Collector kubectl get pods -n otel 2 Running
8 Event Exporter kubectl get pods -n monitoring 1 Running

已知問題與豁免

Target 狀態 原因
federation-k8s down SigNoz 內部 Prometheus非外部暴露
kube-state-metrics down 僅 OTEL Collector 內部存取
node-exporter-120/121 down K8s 節點防火牆規則

回滾指令

# Phase 24 AI Router 回滾
kubectl set env deployment/awoooi-api -n awoooi-prod USE_AI_ROUTER=false

# 重啟 API pod
kubectl rollout restart deployment/awoooi-api -n awoooi-prod

# 驗證
kubectl rollout status deployment/awoooi-api -n awoooi-prod

歷史驗收記錄

日期 結果 備註
2026-04-02 21:22 8/8 Wave A 首次全量通過