diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 1a378af0..871c96c3 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -11736,3 +11736,112 @@ Traceback / ERROR / CRITICAL matches in recent API logs. - 低風險自動修復閉環:約 95%。 - 前端 AI 自動化管理介面同步:約 91%。 - 完整 AI 自動化管理產品化:約 86%。 + +### 2026-05-19 — T83/T84 治理告警可讀性與 Alert Chain smoke 收斂 + +**觸發**: + +- Telegram `knowledge_degradation` 告警只顯示 raw 欄位與抽象下一步,值班者難以判斷「這是服務事故、治理品質問題、AI 已做什麼、接下來誰要做什麼」。 +- post-deploy Alert Chain Smoke 曾因 runner 容器內無 `kubectl` 無法檢查 OTEL/Event Exporter;這會讓告警鏈路的部署證據不完整。 + +**修正**: + +- `scripts/alert_chain_smoke_test.py` 支援由 CI 預先帶入 OTEL Collector / Event Exporter Pod 狀態,保留本機 `kubectl` fallback。 +- `.gitea/workflows/cd.yaml` 在 host runner 端透過 SSH 到 K3s 節點查 observability Pod,將狀態注入 smoke 容器;post-deploy 可直接證明 OTEL/Event Exporter 正在跑。 +- `apps/api/src/services/failover_alerter.py` 將 `knowledge_degradation` 顯示名改為「KM 需要更新(影響 AI 判斷)」,並新增: + - `💬 白話說明` + - `🧩 AI 流程狀態` + - `✅ 現在要做` +- 同一類告警 payload 現在相容 `impact.total_count` 與舊式/外部式 `total`、`next_step`、`automatable_work`;避免 Telegram 顯示 `? / ?` 或把 raw 欄位倒到 `補充欄位`。 + +**local verification**: + +```text +python3 -m py_compile scripts/alert_chain_smoke_test.py OK +ruby -e 'require "yaml"; YAML.load_file(".gitea/workflows/cd.yaml"); puts "yaml ok"' -> yaml ok +python3 -m py_compile apps/api/src/services/failover_alerter.py apps/api/src/services/governance_agent.py apps/api/tests/test_failover_alerter.py OK +DATABASE_URL='postgresql+asyncpg://test:test@localhost/test' pytest apps/api/tests/test_failover_alerter.py apps/api/tests/test_governance_agent.py -q + -> 35 passed +git diff --check OK +``` + +**production deploy / smoke(完成)**: + +```text +T83 code commit: +d6c941ea fix(ci): feed observability pod status into alert smoke + +T83 deploy marker: +038f1a0d chore(cd): deploy d6c941e [skip ci] + +T83 Gitea Actions: +2461 Code Review -> success +2462 CD -> success + tests 3078 -> success + build-and-deploy 3079 -> success + post-deploy-checks 3080 -> success + +T83 post-deploy evidence: +OTEL Collector -> 2 Pod(s) Running +Event Exporter -> 1 Pod(s) Running +Alert Chain Smoke -> 8/8 checks passed + +T84 code commits: +795c9a4e fix(governance): clarify knowledge degradation alerts +bf8974be fix(governance): normalize knowledge degradation payloads + +T84 deploy markers: +81ac1f0f chore(cd): deploy 795c9a4 [skip ci] +477a7d46 chore(cd): deploy bf8974b [skip ci] + +T84 Gitea Actions: +2464 Code Review -> success +2463 CD -> success +2468 Code Review -> success +2467 CD -> success + tests 3087 -> success + build-and-deploy 3088 -> success + post-deploy-checks 3089 -> success + +K8s image: +awoooi-api 192.168.0.110:5000/awoooi/api:bf8974be0355cdfdcabcb127547c046353f8e34d +awoooi-web 192.168.0.110:5000/awoooi/web:bf8974be0355cdfdcabcb127547c046353f8e34d +awoooi-worker 192.168.0.110:5000/awoooi/api:bf8974be0355cdfdcabcb127547c046353f8e34d + +health: +GET https://awoooi.wooo.work/api/v1/health -> healthy, prod, mock_mode=false + +post-deploy evidence: +Alert Chain Metric -> 最後告警成功 20 分鐘前 +OTEL Collector -> 2 Pod(s) Running +Event Exporter -> 1 Pod(s) Running +Alert Chain Smoke -> 8/8 checks passed +E2E smoke -> 5 passed +CI/CD success notification -> mirrored through AWOOI API + +production card smoke: +format_governance_alert_card("knowledge_degradation", legacy payload) + -> contains "1425 / 1856 筆 KM" + -> contains "陳舊比例:76.8%" + -> contains "白話說明 / AI 流程狀態 / 現在要做" + -> no "補充欄位" + -> no "? / ?" +``` + +**判讀**: + +- `knowledge_degradation` 是 AI 自我治理警報,不是服務故障;不應重啟服務。 +- 正確處置是確認或觸發 `run_kb_growth_healthcheck`,讓 AI 反查 Incident / Sentry / SigNoz / PlayBook 產生 KM 更新草稿,並讓 owner 審核最近被告警引用的高影響 KM。 +- 告警關閉條件應以 `stale_ratio < 20%` 或對應治理門檻恢復為準;Telegram 只顯示「該做什麼」,完整欄位仍以 AwoooP 稽核資料為準。 +- T84 第一版 production smoke 抓到舊式 payload 會顯示 `? / ?`,已在 T84b 補上相容層並重新部署驗證。 +- 後續技術債:`awoooi_alert_chain_last_success_timestamp` 仍是 process-local Prometheus gauge,部署後若尚未有新 webhook 被 Prometheus scrape,可能短暫出現非 critical warning;長期應改成 DB-backed / timeline-backed evidence。 + +**目前整體進度**: + +- AwoooP 告警可觀測鏈:約 97%。 +- 低風險自動修復閉環:約 95%。 +- 前端 AI 自動化管理介面同步:約 92.5%。 +- CI/CD 通知與部署證據鏈:約 99%。 +- CI/CD runner hygiene:約 96%。 +- 治理告警可讀性 / 可處置性:約 90%。 +- 完整 AI 自動化管理產品化:約 89.6%。