docs(metrics): record alert chain durable evidence rollout [skip ci]
This commit is contained in:
@@ -1,3 +1,102 @@
|
||||
## 2026-05-19|T85 Alert Chain durable metric evidence
|
||||
|
||||
**背景**:
|
||||
|
||||
- T84 後留下的技術債:`awoooi_alert_chain_last_success_timestamp` 原本只靠 API process-local Prometheus gauge。
|
||||
- 當 API pod 重啟或部署切換時,Prometheus 可能短暫看不到「最後告警鏈成功時間」,post-deploy smoke 也可能在新 webhook 進來前誤判缺資料。
|
||||
- 這會讓 Operator Console / Telegram 對「告警鏈到底有沒有正常跑完」產生不必要的疑慮。
|
||||
|
||||
**完成變更**:
|
||||
|
||||
- 新增 `apps/api/src/services/alert_chain_metrics_service.py`。
|
||||
- API `/metrics` 在輸出前會從 durable DB evidence 補回 `awoooi_alert_chain_last_success_timestamp`:
|
||||
- `awooop_conversation_event`:`alertmanager / sentry / signoz` 的 internal timeline evidence。
|
||||
- `alert_operation_log`:legacy Alertmanager receive evidence fallback。
|
||||
- 只補 `last_success_timestamp`,不補 `awoooi_alert_chain_healthy`,避免舊成功證據蓋掉新的 runtime failure。
|
||||
- 15 秒 in-process cache,避免 `/metrics` scrape 每次都打 DB。
|
||||
|
||||
**驗證**:
|
||||
|
||||
```text
|
||||
python3 -m py_compile apps/api/src/services/alert_chain_metrics_service.py apps/api/src/main.py apps/api/tests/test_alert_chain_metrics_service.py
|
||||
-> OK
|
||||
|
||||
PYTHONPATH=apps/api DATABASE_URL='postgresql+asyncpg://test:test@localhost/test' apps/api/.venv/bin/python -m pytest apps/api/tests/test_alert_chain_metrics_service.py -q
|
||||
-> 2 passed
|
||||
|
||||
PYTHONPATH=apps/api DATABASE_URL='postgresql+asyncpg://test:test@localhost/test' apps/api/.venv/bin/python -m pytest apps/api/tests/test_alert_chain_metrics_service.py apps/api/tests/test_adr100_slo_metrics_service.py -q
|
||||
-> 5 passed
|
||||
|
||||
apps/api/.venv/bin/python -m ruff check --select F,E9 apps/api/src/services/alert_chain_metrics_service.py apps/api/src/main.py apps/api/tests/test_alert_chain_metrics_service.py
|
||||
-> All checks passed
|
||||
|
||||
git diff --check
|
||||
-> OK
|
||||
|
||||
Pre-deploy production DB evidence:
|
||||
-> alertmanager from awooop_conversation_event
|
||||
-> alertmanager from alert_operation_log
|
||||
-> sentry from awooop_conversation_event
|
||||
-> signoz from awooop_conversation_event
|
||||
|
||||
Commit:
|
||||
c516f9fc fix(metrics): refresh alert chain timestamp from durable evidence
|
||||
|
||||
Gitea Actions:
|
||||
2472 Code Review for c516f9fc -> success
|
||||
2471 CD for c516f9fc -> success
|
||||
tests -> success
|
||||
build-and-deploy -> success
|
||||
post-deploy-checks -> success
|
||||
|
||||
Deploy marker:
|
||||
6f6cf90 chore(cd): deploy c516f9f [skip ci]
|
||||
|
||||
Post-deploy evidence:
|
||||
-> Alert Chain Metric 最後告警成功: 1 分鐘前
|
||||
-> Alert Chain Smoke 8/8 checks passed
|
||||
-> OTEL Collector 2 Pod(s) Running
|
||||
-> Event Exporter 1 Pod(s) Running
|
||||
-> Playwright smoke 5 passed
|
||||
-> CI/CD success notification mirrored through AWOOI API
|
||||
```
|
||||
|
||||
**Production smoke**:
|
||||
|
||||
```text
|
||||
K8s image:
|
||||
awoooi-api 192.168.0.110:5000/awoooi/api:c516f9fc71f358de46d566625ef9c1eb164c102d
|
||||
awoooi-worker 192.168.0.110:5000/awoooi/api:c516f9fc71f358de46d566625ef9c1eb164c102d
|
||||
awoooi-web 192.168.0.110:5000/awoooi/web:c516f9fc71f358de46d566625ef9c1eb164c102d
|
||||
|
||||
GET https://awoooi.wooo.work/api/v1/health
|
||||
-> healthy, prod, mock_mode=false
|
||||
|
||||
API pod internal GET /metrics:
|
||||
-> awoooi_alert_chain_last_success_timestamp{source="alertmanager"} 1.779185257861278e+09
|
||||
-> awoooi_alert_chain_last_success_timestamp{source="sentry"} 1.778679960070588e+09
|
||||
-> awoooi_alert_chain_last_success_timestamp{source="signoz"} 1.778679960111112e+09
|
||||
```
|
||||
|
||||
**注意事項 / 技術債**:
|
||||
|
||||
- 公網 `https://awoooi.wooo.work/metrics` 目前會被 Web locale middleware 轉到 `/zh-TW/metrics`;真正 API metrics 應以 pod/service internal 或 Prometheus scrape 為準。
|
||||
- post-deploy log 仍會印出既有 `errSymlink cleanup noise` 註解,但 job 已 success,且沒有 T82 前的 runner panic。
|
||||
- 下一段可把 Alert Chain durable evidence 也映射到前端 AwoooP Runs / Monitoring 的「證據來源」欄位,讓 operator 看得到此指標是來自 DB evidence,不只是 Prometheus scrape。
|
||||
|
||||
**目前整體進度**:
|
||||
|
||||
- AwoooP 告警可觀測鏈:約 97.8%。
|
||||
- 低風險自動修復閉環:約 95%。
|
||||
- 前端 AI 自動化管理介面同步:約 92.5%。
|
||||
- CI/CD notification AwoooP 主路徑:約 99%。
|
||||
- CI/CD runner hygiene:約 96%。
|
||||
- 治理告警可讀性 / 可處置性:約 90%。
|
||||
- Alert Chain durable metric evidence:約 92%。
|
||||
- 完整 AI 自動化管理產品化:約 90.0%。
|
||||
|
||||
---
|
||||
|
||||
## 2026-05-19|T82 Gitea CD E2E smoke symlink cleanup hygiene
|
||||
|
||||
**背景**:
|
||||
|
||||
Reference in New Issue
Block a user