docs(governance): record knowledge alert clarity rollout [skip ci]

This commit is contained in:
Your Name
2026-05-19 15:50:20 +08:00
parent 477a7d46a8
commit f0a9b1e00a

View File

@@ -11736,3 +11736,112 @@ Traceback / ERROR / CRITICAL matches in recent API logs.
- 低風險自動修復閉環:約 95%。
- 前端 AI 自動化管理介面同步:約 91%。
- 完整 AI 自動化管理產品化:約 86%。
### 2026-05-19 — T83/T84 治理告警可讀性與 Alert Chain smoke 收斂
**觸發**
- Telegram `knowledge_degradation` 告警只顯示 raw 欄位與抽象下一步值班者難以判斷「這是服務事故、治理品質問題、AI 已做什麼、接下來誰要做什麼」。
- post-deploy Alert Chain Smoke 曾因 runner 容器內無 `kubectl` 無法檢查 OTEL/Event Exporter這會讓告警鏈路的部署證據不完整。
**修正**
- `scripts/alert_chain_smoke_test.py` 支援由 CI 預先帶入 OTEL Collector / Event Exporter Pod 狀態,保留本機 `kubectl` fallback。
- `.gitea/workflows/cd.yaml` 在 host runner 端透過 SSH 到 K3s 節點查 observability Pod將狀態注入 smoke 容器post-deploy 可直接證明 OTEL/Event Exporter 正在跑。
- `apps/api/src/services/failover_alerter.py``knowledge_degradation` 顯示名改為「KM 需要更新(影響 AI 判斷)」,並新增:
- `💬 白話說明`
- `🧩 AI 流程狀態`
- `✅ 現在要做`
- 同一類告警 payload 現在相容 `impact.total_count` 與舊式/外部式 `total``next_step``automatable_work`;避免 Telegram 顯示 `? / ?` 或把 raw 欄位倒到 `補充欄位`
**local verification**
```text
python3 -m py_compile scripts/alert_chain_smoke_test.py OK
ruby -e 'require "yaml"; YAML.load_file(".gitea/workflows/cd.yaml"); puts "yaml ok"' -> yaml ok
python3 -m py_compile apps/api/src/services/failover_alerter.py apps/api/src/services/governance_agent.py apps/api/tests/test_failover_alerter.py OK
DATABASE_URL='postgresql+asyncpg://test:test@localhost/test' pytest apps/api/tests/test_failover_alerter.py apps/api/tests/test_governance_agent.py -q
-> 35 passed
git diff --check OK
```
**production deploy / smoke完成**
```text
T83 code commit:
d6c941ea fix(ci): feed observability pod status into alert smoke
T83 deploy marker:
038f1a0d chore(cd): deploy d6c941e [skip ci]
T83 Gitea Actions:
2461 Code Review -> success
2462 CD -> success
tests 3078 -> success
build-and-deploy 3079 -> success
post-deploy-checks 3080 -> success
T83 post-deploy evidence:
OTEL Collector -> 2 Pod(s) Running
Event Exporter -> 1 Pod(s) Running
Alert Chain Smoke -> 8/8 checks passed
T84 code commits:
795c9a4e fix(governance): clarify knowledge degradation alerts
bf8974be fix(governance): normalize knowledge degradation payloads
T84 deploy markers:
81ac1f0f chore(cd): deploy 795c9a4 [skip ci]
477a7d46 chore(cd): deploy bf8974b [skip ci]
T84 Gitea Actions:
2464 Code Review -> success
2463 CD -> success
2468 Code Review -> success
2467 CD -> success
tests 3087 -> success
build-and-deploy 3088 -> success
post-deploy-checks 3089 -> success
K8s image:
awoooi-api 192.168.0.110:5000/awoooi/api:bf8974be0355cdfdcabcb127547c046353f8e34d
awoooi-web 192.168.0.110:5000/awoooi/web:bf8974be0355cdfdcabcb127547c046353f8e34d
awoooi-worker 192.168.0.110:5000/awoooi/api:bf8974be0355cdfdcabcb127547c046353f8e34d
health:
GET https://awoooi.wooo.work/api/v1/health -> healthy, prod, mock_mode=false
post-deploy evidence:
Alert Chain Metric -> 最後告警成功 20 分鐘前
OTEL Collector -> 2 Pod(s) Running
Event Exporter -> 1 Pod(s) Running
Alert Chain Smoke -> 8/8 checks passed
E2E smoke -> 5 passed
CI/CD success notification -> mirrored through AWOOI API
production card smoke:
format_governance_alert_card("knowledge_degradation", legacy payload)
-> contains "1425 / 1856 筆 KM"
-> contains "陳舊比例76.8%"
-> contains "白話說明 / AI 流程狀態 / 現在要做"
-> no "補充欄位"
-> no "? / ?"
```
**判讀**
- `knowledge_degradation` 是 AI 自我治理警報,不是服務故障;不應重啟服務。
- 正確處置是確認或觸發 `run_kb_growth_healthcheck`,讓 AI 反查 Incident / Sentry / SigNoz / PlayBook 產生 KM 更新草稿,並讓 owner 審核最近被告警引用的高影響 KM。
- 告警關閉條件應以 `stale_ratio < 20%` 或對應治理門檻恢復為準Telegram 只顯示「該做什麼」,完整欄位仍以 AwoooP 稽核資料為準。
- T84 第一版 production smoke 抓到舊式 payload 會顯示 `? / ?`,已在 T84b 補上相容層並重新部署驗證。
- 後續技術債:`awoooi_alert_chain_last_success_timestamp` 仍是 process-local Prometheus gauge部署後若尚未有新 webhook 被 Prometheus scrape可能短暫出現非 critical warning長期應改成 DB-backed / timeline-backed evidence。
**目前整體進度**
- AwoooP 告警可觀測鏈:約 97%。
- 低風險自動修復閉環:約 95%。
- 前端 AI 自動化管理介面同步:約 92.5%。
- CI/CD 通知與部署證據鏈:約 99%。
- CI/CD runner hygiene約 96%。
- 治理告警可讀性 / 可處置性:約 90%。
- 完整 AI 自動化管理產品化:約 89.6%。