docs(governance): record t21 verifier coverage rollout
This commit is contained in:
@@ -1,3 +1,50 @@
|
||||
## 2026-05-14 | T21 Post-execution verifier coverage 接入前端,SLO 不再只看 Prometheus 分母
|
||||
|
||||
**背景**:T20 已讓 ADR-100 四項 SLO 在 `/governance` 呈現 `ok / skipped_low_volume / no_data`,但 Operator 仍無法直接判斷「近 24h 自動修復是否都有 verifier 寫回」、「是否有未驗證 backlog」、「verification 結果是成功還是 degraded/failed」。這會讓 Telegram / SLO 只告訴人有告警,卻無法說明 AI 自動化流程卡在哪個節點。
|
||||
|
||||
**修正**:
|
||||
- `/api/v1/ai/slo` 的 read-only `adr100` payload 新增 `verification_coverage`,從 PostgreSQL 查 `auto_repair_executions` 與最新 `incident_evidence.verification_result` 關聯。
|
||||
- Coverage payload 會回傳近 24h `total_auto / verified_auto / unverified_auto / verified_success / verified_non_success / coverage_rate / verification_success_rate / last_verified_auto_at / recent_unverified`。
|
||||
- Coverage 狀態語意:無自動修復樣本 → `skipped_low_volume`;有未驗證 backlog → `warning: verification_backlog_present`;所有自動修復都有 verifier、但含 degraded/failed/timeout → `warning: non_success_verification_present`;全部 verified success → `ok`。
|
||||
- `/governance` SLO tab 新增「驗證覆蓋率」面板,顯示近 24h 自動修復、已驗證、待驗證、覆蓋率、成功驗證率、最後已驗證執行與原因;文案已補 `zh-TW` / `en` i18n。
|
||||
- `evaluated_at` 改用台北時區工具,順手清理 touched service 裡原本的 UTC timestamp 技術債。
|
||||
|
||||
**本地驗證**:
|
||||
- `python3 -m py_compile apps/api/src/services/adr100_slo_status_service.py apps/api/src/api/v1/ai_slo.py apps/api/tests/test_adr100_slo_status_service.py`:pass。
|
||||
- `DATABASE_URL=postgresql+asyncpg://test:test@localhost:5432/test pytest tests/test_adr100_slo_status_service.py tests/test_adr100_slo_metrics_service.py tests/test_governance_agent.py tests/test_ai_governance_endpoints.py -q`:53 passed。
|
||||
- `ruff check --select F,E9 ...`:pass。
|
||||
- `pnpm --filter @awoooi/web typecheck`:pass。
|
||||
- `pnpm --dir apps/web exec next lint --file src/app/[locale]/governance/tabs/slo-tab.tsx`:pass。
|
||||
- `NEXT_PUBLIC_API_URL=https://awoooi.wooo.work pnpm --filter @awoooi/web build`:pass。
|
||||
- i18n JSON parse / `git diff --check`:pass。
|
||||
|
||||
**推版與 production 驗證**:
|
||||
- `485c58d0 feat(governance): surface verification coverage` 已推 Gitea main。
|
||||
- Gitea Code Review run `2169` success;CD run `2168` tests / build-and-deploy / post-deploy-checks 全 success。
|
||||
- 最新 deploy marker:`b1893395 chore(cd): deploy 485c58d [skip ci]`。
|
||||
- Production image:API / Worker `192.168.0.110:5000/awoooi/api:485c58d0852dd308f15da9259ae453d3dbf0b28e`,Web `192.168.0.110:5000/awoooi/web:485c58d0852dd308f15da9259ae453d3dbf0b28e`。
|
||||
- K8s rollout:`awoooi-api` / `awoooi-worker` / `awoooi-web` in `awoooi-prod` 均 successfully rolled out。
|
||||
- `https://awoooi.wooo.work/api/v1/health`:200 healthy。
|
||||
- `https://awoooi.wooo.work/api/v1/ai/slo` production payload:
|
||||
- `adr100.overall_status=warning`
|
||||
- `adr100.overall_compliance=1.0`
|
||||
- `verification_coverage.status=warning`
|
||||
- `reason=non_success_verification_present`
|
||||
- `total_auto=14`
|
||||
- `verified_auto=14`
|
||||
- `unverified_auto=0`
|
||||
- `verified_success=5`
|
||||
- `verified_non_success=9`
|
||||
- `coverage_rate=1.0`
|
||||
- `verification_success_rate=0.3571`
|
||||
- `https://awoooi.wooo.work/zh-TW/governance`:200。
|
||||
- Playwright production render check:`驗證覆蓋率`、`覆蓋率`、`需追蹤 / degraded` 原因可見,console error = 0。
|
||||
|
||||
**目前整體進度**:
|
||||
- Alertmanager 低風險自動修復主線:約 96%。
|
||||
- 完整 AI 自動化管理產品化:約 86%。
|
||||
- T21 已把 verifier coverage / freshness 從後端真相鏈推到前端;下一段建議 T22 拆解 9 筆 non-success verification 的原因,將 degraded/failed/timeout 分流到工作鏈路與 Ticket / PlayBook / KM 修復項。
|
||||
|
||||
## 2026-05-14 | T20 Governance SLO 前端狀態語意接入,低樣本不再偽裝紅燈
|
||||
|
||||
**背景**:T19 已修正 KM growth false-red,但 `/governance` 前端 SLO tab 仍吃舊 `/api/v1/ai/slo` 三指標形狀,無法呈現 ADR-100 的 `skipped / no_data / low volume` 語意。結果 Operator 看到 Telegram 或前端時,仍可能把「5m/1h 分母暫無有效事件」誤判為真正紅燈,或看不到 KM 已達標。
|
||||
|
||||
@@ -2151,6 +2151,15 @@ Phase 6 完成後
|
||||
- Production evidence:API / Worker / Web image 均為 `809bc9670b9fde034bc1fc0cd6bc5575c1bea8f0`;`/api/v1/ai/slo` 回 `overall_status=partial`、`overall_compliance=1.0`、三個 ratio SLO `skipped_low_volume`、`km_growth_rate=ok,value=24`;`/zh-TW/governance` 200。
|
||||
- 目前進度更新:Alertmanager 低風險自動修復主線約 96%;完整 AI 自動化管理產品化約 84%。下一段補 verifier coverage / post-execution verification freshness,讓 decision accuracy 有自動執行事件時能快速產生可評估樣本。
|
||||
|
||||
**T21 Post-execution verifier coverage 前端化(2026-05-14 台北)**:
|
||||
- 觸發:T20 已能呈現 Prometheus SLO 的低樣本語意,但 Operator 仍無法直接看出近 24h 自動修復是否都有 `incident_evidence.verification_result`、是否有未驗證 backlog、以及驗證結果是否多數為 degraded/failed/timeout。
|
||||
- 修正:`adr100_slo_status_service.py` 追加 read-only PostgreSQL coverage read model;`/api/v1/ai/slo` 的 `adr100.verification_coverage` 回傳 `total_auto / verified_auto / unverified_auto / verified_success / verified_non_success / coverage_rate / verification_success_rate / last_verified_auto_at / recent_unverified`。
|
||||
- UI:`/governance` SLO tab 新增「驗證覆蓋率」面板,顯示自動修復、已驗證、待驗證、覆蓋率、成功驗證率與原因;`zh-TW` / `en` i18n 已補齊。
|
||||
- 驗證:py_compile pass;`pytest tests/test_adr100_slo_status_service.py tests/test_adr100_slo_metrics_service.py tests/test_governance_agent.py tests/test_ai_governance_endpoints.py -q` 53 passed;ruff F/E9 pass;web typecheck / target lint / production build pass;i18n JSON parse / diff check pass;Playwright production render check console error = 0。
|
||||
- Production deploy:`485c58d0 feat(governance): surface verification coverage` 已推 Gitea main;Code Review run `2169` success;CD run `2168` tests / build-and-deploy / post-deploy-checks success;deploy marker `b1893395 chore(cd): deploy 485c58d [skip ci]`。
|
||||
- Production evidence:API / Worker / Web image 均為 `485c58d0852dd308f15da9259ae453d3dbf0b28e`;`/api/v1/ai/slo` 回 `adr100.overall_status=warning`、`overall_compliance=1.0`、`verification_coverage.status=warning`、`reason=non_success_verification_present`、`total_auto=14`、`verified_auto=14`、`unverified_auto=0`、`verified_success=5`、`verified_non_success=9`、`coverage_rate=1.0`;`/zh-TW/governance` 200。
|
||||
- 目前進度更新:Alertmanager 低風險自動修復主線約 96%;完整 AI 自動化管理產品化約 86%。下一段 T22 應拆解 9 筆 non-success verification 的根因,讓 degraded/failed/timeout 進入工作鏈路、Ticket、PlayBook/KM 修復項,而不是只停在 warning。
|
||||
|
||||
---
|
||||
|
||||
### 2026-04-20 晚 (台北) — C1-C4 全流程串接 — Playbook 鏈路保護(commit de2d34d)
|
||||
|
||||
Reference in New Issue
Block a user