docs(governance): record adr100 slo emitter rollout
This commit is contained in:
@@ -6,19 +6,35 @@
|
||||
- 新增 `adr100_slo_metrics_service.py`,從 PostgreSQL 事實來源產出 DB-derived Prometheus 指標:`automation_operation_log_total`、`post_execution_verification_total`、`knowledge_entries_total`、`approval_records_high_confidence_total`、`approval_records_high_confidence_success_total`。
|
||||
- `/metrics` 追加 ADR-100 SLO emitter,不新增 DB schema、不改 Prometheus scrape target,讓既有 `awoooi-api` scrape job 可直接取得底層 series。
|
||||
- `GovernanceAgent` 的 SLO no-data hint 改成 emitter / recording rule / multiprocess mount 三段式,不再把已驗證存在的 `PROMETHEUS_MULTIPROC_DIR` 當成單一原因。
|
||||
- `GovernanceAgent` 對 Prometheus `NaN` / `Inf` 改為 `skipped`,避免 confidence calibration 這類「分母暫無事件」被誤判成 ok。
|
||||
- `automation_operation_log_total` 收斂到真正 remediation / PlayBook / auto-repair 範圍,排除 asset scanner / rule updater 等背景治理工作,避免污染「AI 自動修復 SLO」分母。
|
||||
- 清理 `main.py` 兩個既有未使用 import(`aiops_flags`、`_dt`),避免本次觸碰檔案繼續帶 F401 技術債。
|
||||
|
||||
**本地驗證**:
|
||||
- `python3 -m py_compile apps/api/src/services/adr100_slo_metrics_service.py apps/api/src/services/governance_agent.py apps/api/src/main.py apps/api/tests/test_adr100_slo_metrics_service.py`:pass。
|
||||
- `pytest tests/test_adr100_slo_metrics_service.py tests/test_governance_agent.py tests/test_ai_governance_endpoints.py -q`:47 passed。
|
||||
- `pytest tests/test_adr100_slo_metrics_service.py tests/test_governance_agent.py tests/test_ai_governance_endpoints.py -q`:48 passed。
|
||||
- `ruff check --select F,E9 src/services/adr100_slo_metrics_service.py src/services/governance_agent.py src/main.py tests/test_adr100_slo_metrics_service.py`:pass。
|
||||
- `git diff --check`:pass。
|
||||
- Production SQL dry-run:automation / verification / knowledge / high-confidence approval 查詢均可在現有 schema 上執行。
|
||||
|
||||
**推版與 production 驗證**:
|
||||
- `13cf02b7 feat(governance): emit adr100 slo metrics`、`368386ab fix(governance): skip non-finite slo values`、`b92c9e28 fix(governance): scope adr100 automation metrics` 已推 Gitea main。
|
||||
- Gitea Code Review runs `2155` / `2157` / `2159` success;CD runs `2154` / `2156` / `2158` tests / build-and-deploy / post-deploy-checks 全 success。
|
||||
- 最新 deploy marker:`80a05653 chore(cd): deploy b92c9e2 [skip ci]`;Production image:API / Worker / Web 均為 `b92c9e285f880c50893adeac9f55ab7b5170e303`。
|
||||
- Health:`/api/v1/health` 200 healthy,PostgreSQL / Redis / Ollama / OpenClaw / SignOz up。
|
||||
- `/metrics` 已輸出 ADR-100 series,且 automation scope 不再包含背景治理工作:
|
||||
- `automation_operation_log_total{outcome="auto_executed",operation_type="auto_repair_executed"} 246`
|
||||
- `automation_operation_log_total{outcome="human_required",operation_type="playbook_executed"} 234`
|
||||
- `post_execution_verification_total{outcome="success"} 5`
|
||||
- `knowledge_entries_total 2161`
|
||||
- `approval_records_high_confidence_total 31`
|
||||
- Prometheus 查詢:底層 metrics 全部有資料;`sli:autonomy_rate:5m` / `sli:decision_accuracy:5m` / `sli:confidence_calibration:1h` / `sli:km_growth_rate:24h` 均不再是 empty result。
|
||||
- 目前真實 SLO 狀態:`decision_accuracy=0`、`km_growth_rate=0` 仍是待處理治理紅燈;`confidence_calibration=NaN` 已被 GovernanceAgent 標為 `skipped`,不再假綠。
|
||||
|
||||
**目前整體進度**:
|
||||
- Alertmanager 低風險自動修復主線:約 96%。
|
||||
- 完整 AI 自動化管理產品化:約 78%。
|
||||
- T18 正在推版;推版後需等 Prometheus scrape / recording rule evaluation,再確認 `sli:*` 不再全空,並觀察 `governance_slo_data_gap` 是否停止重複推播。
|
||||
- 完整 AI 自動化管理產品化:約 79%。
|
||||
- T18 已把「SLO 資料缺口」變成可查指標;下一段要處理實際 SLO 紅燈:post-execution verification 覆蓋率 / KM growth refresh / governance alert dedupe 與前端階段呈現。
|
||||
|
||||
## 2026-05-14 | T17b 治理事件 / dispatch API 查詢修復,解除前端工作鏈路紅燈
|
||||
|
||||
|
||||
@@ -158,7 +158,7 @@ increase(knowledge_entries_total[24h])
|
||||
| `ops/monitoring/tests/test_slo_rules.yaml` | promtool 單元測試 |
|
||||
| `ops/monitoring/grafana/dashboards/ai-slo-dashboard.json` | Grafana SLO Dashboard |
|
||||
| `apps/api/src/services/governance_agent.py` | `check_slo_compliance()` 整合 |
|
||||
| `apps/api/src/services/adr100_slo_metrics_service.py` | 2026-05-14 T18:從 PostgreSQL 事實來源輸出 ADR-100 底層 Prometheus series |
|
||||
| `apps/api/src/services/adr100_slo_metrics_service.py` | 2026-05-14 T18:從 PostgreSQL 事實來源輸出 ADR-100 底層 Prometheus series;`automation_operation_log_total` 僅納入 remediation / PlayBook / auto-repair 範圍,背景治理工作不進 AI 自動修復 SLO 分母 |
|
||||
| `apps/api/src/main.py` `/metrics` | 2026-05-14 T18:追加 DB-derived SLO emitter,讓既有 `awoooi-api` scrape job 取得底層 series |
|
||||
|
||||
## 決策理由
|
||||
|
||||
@@ -2127,10 +2127,11 @@ Phase 6 完成後
|
||||
|
||||
**T18 ADR-100 SLO emitter 接入(2026-05-14 台北)**:
|
||||
- 觸發:治理告警 `governance_slo_data_gap` 反覆推 Telegram,但 production 查核顯示 API Pod 已有 `PROMETHEUS_MULTIPROC_DIR` 與 `emptyDir`,真正缺口是 `/metrics` 未輸出 ADR-100 recording rules 所需底層 series,導致 `sli:*` 全部 empty result。
|
||||
- 修正:新增 DB-derived `/metrics` emitter,從 `automation_operation_log`、`incident_evidence`、`knowledge_entries`、`approval_records` 暴露 `automation_operation_log_total`、`post_execution_verification_total`、`knowledge_entries_total`、`approval_records_high_confidence_total`、`approval_records_high_confidence_success_total`;不新增 schema、不改 scrape target。
|
||||
- 訊息治理:`GovernanceAgent` no-data hint 改為 emitter / recording rule / multiprocess mount 三段式,避免 Operator 被誤導成只有 `PROMETHEUS_MULTIPROC_DIR` 未設。
|
||||
- 驗證:`py_compile` pass;`pytest tests/test_adr100_slo_metrics_service.py tests/test_governance_agent.py tests/test_ai_governance_endpoints.py -q` 47 passed;ruff F/E9 pass;diff check pass;production SQL dry-run 通過。
|
||||
- 目前進度更新:Alertmanager 低風險自動修復主線約 96%;完整 AI 自動化管理產品化約 78%。推版後需等 Prometheus scrape / recording rule evaluation,再確認 `sli:*` 不再全空。
|
||||
- 修正:新增 DB-derived `/metrics` emitter,從 `auto_repair_executions`、remediation/PlayBook 範圍的 `automation_operation_log`、`incident_evidence`、`knowledge_entries`、`approval_records` 暴露 `automation_operation_log_total`、`post_execution_verification_total`、`knowledge_entries_total`、`approval_records_high_confidence_total`、`approval_records_high_confidence_success_total`;不新增 schema、不改 scrape target。背景治理工作(asset scanner / rule updater 等)不進 AI 自動修復 SLO 分母。
|
||||
- 訊息治理:`GovernanceAgent` no-data hint 改為 emitter / recording rule / multiprocess mount 三段式;Prometheus `NaN` / `Inf` 改標 `skipped`,避免 Operator 被誤導成只有 `PROMETHEUS_MULTIPROC_DIR` 未設或把無分母資料假綠。
|
||||
- 驗證:`py_compile` pass;`pytest tests/test_adr100_slo_metrics_service.py tests/test_governance_agent.py tests/test_ai_governance_endpoints.py -q` 48 passed;ruff F/E9 pass;diff check pass;production SQL dry-run 通過。
|
||||
- Production deploy:`13cf02b7`、`368386ab`、`b92c9e28` 已推 Gitea main;Code Review runs `2155` / `2157` / `2159` success;CD runs `2154` / `2156` / `2158` success;最新 deploy marker `80a05653 chore(cd): deploy b92c9e2 [skip ci]`;API / Worker / Web image 均為 `b92c9e285f880c50893adeac9f55ab7b5170e303`;health healthy;`/metrics` 已輸出 scoped ADR-100 series,Prometheus 底層 metrics 與 `sli:*` 均不再 empty result。
|
||||
- 目前進度更新:Alertmanager 低風險自動修復主線約 96%;完整 AI 自動化管理產品化約 79%。T18 已解除 SLO data gap;下一段處理真實 SLO 紅燈(decision accuracy / KM growth)與前端治理階段呈現。
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user