diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 871c96c3..0de584ba 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -1,3 +1,102 @@ +## 2026-05-19|T85 Alert Chain durable metric evidence + +**背景**: + +- T84 後留下的技術債:`awoooi_alert_chain_last_success_timestamp` 原本只靠 API process-local Prometheus gauge。 +- 當 API pod 重啟或部署切換時,Prometheus 可能短暫看不到「最後告警鏈成功時間」,post-deploy smoke 也可能在新 webhook 進來前誤判缺資料。 +- 這會讓 Operator Console / Telegram 對「告警鏈到底有沒有正常跑完」產生不必要的疑慮。 + +**完成變更**: + +- 新增 `apps/api/src/services/alert_chain_metrics_service.py`。 +- API `/metrics` 在輸出前會從 durable DB evidence 補回 `awoooi_alert_chain_last_success_timestamp`: + - `awooop_conversation_event`:`alertmanager / sentry / signoz` 的 internal timeline evidence。 + - `alert_operation_log`:legacy Alertmanager receive evidence fallback。 +- 只補 `last_success_timestamp`,不補 `awoooi_alert_chain_healthy`,避免舊成功證據蓋掉新的 runtime failure。 +- 15 秒 in-process cache,避免 `/metrics` scrape 每次都打 DB。 + +**驗證**: + +```text +python3 -m py_compile apps/api/src/services/alert_chain_metrics_service.py apps/api/src/main.py apps/api/tests/test_alert_chain_metrics_service.py + -> OK + +PYTHONPATH=apps/api DATABASE_URL='postgresql+asyncpg://test:test@localhost/test' apps/api/.venv/bin/python -m pytest apps/api/tests/test_alert_chain_metrics_service.py -q + -> 2 passed + +PYTHONPATH=apps/api DATABASE_URL='postgresql+asyncpg://test:test@localhost/test' apps/api/.venv/bin/python -m pytest apps/api/tests/test_alert_chain_metrics_service.py apps/api/tests/test_adr100_slo_metrics_service.py -q + -> 5 passed + +apps/api/.venv/bin/python -m ruff check --select F,E9 apps/api/src/services/alert_chain_metrics_service.py apps/api/src/main.py apps/api/tests/test_alert_chain_metrics_service.py + -> All checks passed + +git diff --check + -> OK + +Pre-deploy production DB evidence: + -> alertmanager from awooop_conversation_event + -> alertmanager from alert_operation_log + -> sentry from awooop_conversation_event + -> signoz from awooop_conversation_event + +Commit: +c516f9fc fix(metrics): refresh alert chain timestamp from durable evidence + +Gitea Actions: +2472 Code Review for c516f9fc -> success +2471 CD for c516f9fc -> success + tests -> success + build-and-deploy -> success + post-deploy-checks -> success + +Deploy marker: +6f6cf90 chore(cd): deploy c516f9f [skip ci] + +Post-deploy evidence: + -> Alert Chain Metric 最後告警成功: 1 分鐘前 + -> Alert Chain Smoke 8/8 checks passed + -> OTEL Collector 2 Pod(s) Running + -> Event Exporter 1 Pod(s) Running + -> Playwright smoke 5 passed + -> CI/CD success notification mirrored through AWOOI API +``` + +**Production smoke**: + +```text +K8s image: +awoooi-api 192.168.0.110:5000/awoooi/api:c516f9fc71f358de46d566625ef9c1eb164c102d +awoooi-worker 192.168.0.110:5000/awoooi/api:c516f9fc71f358de46d566625ef9c1eb164c102d +awoooi-web 192.168.0.110:5000/awoooi/web:c516f9fc71f358de46d566625ef9c1eb164c102d + +GET https://awoooi.wooo.work/api/v1/health + -> healthy, prod, mock_mode=false + +API pod internal GET /metrics: + -> awoooi_alert_chain_last_success_timestamp{source="alertmanager"} 1.779185257861278e+09 + -> awoooi_alert_chain_last_success_timestamp{source="sentry"} 1.778679960070588e+09 + -> awoooi_alert_chain_last_success_timestamp{source="signoz"} 1.778679960111112e+09 +``` + +**注意事項 / 技術債**: + +- 公網 `https://awoooi.wooo.work/metrics` 目前會被 Web locale middleware 轉到 `/zh-TW/metrics`;真正 API metrics 應以 pod/service internal 或 Prometheus scrape 為準。 +- post-deploy log 仍會印出既有 `errSymlink cleanup noise` 註解,但 job 已 success,且沒有 T82 前的 runner panic。 +- 下一段可把 Alert Chain durable evidence 也映射到前端 AwoooP Runs / Monitoring 的「證據來源」欄位,讓 operator 看得到此指標是來自 DB evidence,不只是 Prometheus scrape。 + +**目前整體進度**: + +- AwoooP 告警可觀測鏈:約 97.8%。 +- 低風險自動修復閉環:約 95%。 +- 前端 AI 自動化管理介面同步:約 92.5%。 +- CI/CD notification AwoooP 主路徑:約 99%。 +- CI/CD runner hygiene:約 96%。 +- 治理告警可讀性 / 可處置性:約 90%。 +- Alert Chain durable metric evidence:約 92%。 +- 完整 AI 自動化管理產品化:約 90.0%。 + +--- + ## 2026-05-19|T82 Gitea CD E2E smoke symlink cleanup hygiene **背景**: