docs(metrics): record alert chain durable evidence rollout [skip ci]

This commit is contained in:
Your Name
2026-05-19 18:09:47 +08:00
parent 6f6cf90a17
commit 6ea041d463

View File

@@ -1,3 +1,102 @@
## 2026-05-19T85 Alert Chain durable metric evidence
**背景**
- T84 後留下的技術債:`awoooi_alert_chain_last_success_timestamp` 原本只靠 API process-local Prometheus gauge。
- 當 API pod 重啟或部署切換時Prometheus 可能短暫看不到「最後告警鏈成功時間」post-deploy smoke 也可能在新 webhook 進來前誤判缺資料。
- 這會讓 Operator Console / Telegram 對「告警鏈到底有沒有正常跑完」產生不必要的疑慮。
**完成變更**
- 新增 `apps/api/src/services/alert_chain_metrics_service.py`
- API `/metrics` 在輸出前會從 durable DB evidence 補回 `awoooi_alert_chain_last_success_timestamp`
- `awooop_conversation_event``alertmanager / sentry / signoz` 的 internal timeline evidence。
- `alert_operation_log`legacy Alertmanager receive evidence fallback。
- 只補 `last_success_timestamp`,不補 `awoooi_alert_chain_healthy`,避免舊成功證據蓋掉新的 runtime failure。
- 15 秒 in-process cache避免 `/metrics` scrape 每次都打 DB。
**驗證**
```text
python3 -m py_compile apps/api/src/services/alert_chain_metrics_service.py apps/api/src/main.py apps/api/tests/test_alert_chain_metrics_service.py
-> OK
PYTHONPATH=apps/api DATABASE_URL='postgresql+asyncpg://test:test@localhost/test' apps/api/.venv/bin/python -m pytest apps/api/tests/test_alert_chain_metrics_service.py -q
-> 2 passed
PYTHONPATH=apps/api DATABASE_URL='postgresql+asyncpg://test:test@localhost/test' apps/api/.venv/bin/python -m pytest apps/api/tests/test_alert_chain_metrics_service.py apps/api/tests/test_adr100_slo_metrics_service.py -q
-> 5 passed
apps/api/.venv/bin/python -m ruff check --select F,E9 apps/api/src/services/alert_chain_metrics_service.py apps/api/src/main.py apps/api/tests/test_alert_chain_metrics_service.py
-> All checks passed
git diff --check
-> OK
Pre-deploy production DB evidence:
-> alertmanager from awooop_conversation_event
-> alertmanager from alert_operation_log
-> sentry from awooop_conversation_event
-> signoz from awooop_conversation_event
Commit:
c516f9fc fix(metrics): refresh alert chain timestamp from durable evidence
Gitea Actions:
2472 Code Review for c516f9fc -> success
2471 CD for c516f9fc -> success
tests -> success
build-and-deploy -> success
post-deploy-checks -> success
Deploy marker:
6f6cf90 chore(cd): deploy c516f9f [skip ci]
Post-deploy evidence:
-> Alert Chain Metric 最後告警成功: 1 分鐘前
-> Alert Chain Smoke 8/8 checks passed
-> OTEL Collector 2 Pod(s) Running
-> Event Exporter 1 Pod(s) Running
-> Playwright smoke 5 passed
-> CI/CD success notification mirrored through AWOOI API
```
**Production smoke**
```text
K8s image:
awoooi-api 192.168.0.110:5000/awoooi/api:c516f9fc71f358de46d566625ef9c1eb164c102d
awoooi-worker 192.168.0.110:5000/awoooi/api:c516f9fc71f358de46d566625ef9c1eb164c102d
awoooi-web 192.168.0.110:5000/awoooi/web:c516f9fc71f358de46d566625ef9c1eb164c102d
GET https://awoooi.wooo.work/api/v1/health
-> healthy, prod, mock_mode=false
API pod internal GET /metrics:
-> awoooi_alert_chain_last_success_timestamp{source="alertmanager"} 1.779185257861278e+09
-> awoooi_alert_chain_last_success_timestamp{source="sentry"} 1.778679960070588e+09
-> awoooi_alert_chain_last_success_timestamp{source="signoz"} 1.778679960111112e+09
```
**注意事項 / 技術債**
- 公網 `https://awoooi.wooo.work/metrics` 目前會被 Web locale middleware 轉到 `/zh-TW/metrics`;真正 API metrics 應以 pod/service internal 或 Prometheus scrape 為準。
- post-deploy log 仍會印出既有 `errSymlink cleanup noise` 註解,但 job 已 success且沒有 T82 前的 runner panic。
- 下一段可把 Alert Chain durable evidence 也映射到前端 AwoooP Runs / Monitoring 的「證據來源」欄位,讓 operator 看得到此指標是來自 DB evidence不只是 Prometheus scrape。
**目前整體進度**
- AwoooP 告警可觀測鏈:約 97.8%。
- 低風險自動修復閉環:約 95%。
- 前端 AI 自動化管理介面同步:約 92.5%。
- CI/CD notification AwoooP 主路徑:約 99%。
- CI/CD runner hygiene約 96%。
- 治理告警可讀性 / 可處置性:約 90%。
- Alert Chain durable metric evidence約 92%。
- 完整 AI 自動化管理產品化:約 90.0%。
---
## 2026-05-19T82 Gitea CD E2E smoke symlink cleanup hygiene
**背景**