docs(governance): record t23 auto repair gateway rollout
This commit is contained in:
@@ -2170,6 +2170,17 @@ Phase 6 完成後
|
||||
- Production evidence:API / Worker / Web image 均為 `bad48dee0424656e01e3ae232acba0423ae0c1e1`;`/api/v1/ai/slo` 回 `non_success_breakdown.by_verification_result=[degraded:8]`、`by_failure_class=[unsupported_action_scheme:7, verifier_missing_promql:1]`;`/zh-TW/governance` 200;Playwright render check console error = 0。
|
||||
- 目前進度更新:Alertmanager 低風險自動修復主線約 96%;完整 AI 自動化管理產品化約 87%。下一段 T23 應修 `PB-20260505-F4197B` 的 unsupported `ssh {host}` repair step,讓 Docker memory pressure 低風險告警走支援 executor / MCP envelope,並補 verifier PromQL template。
|
||||
|
||||
**T23 Auto-repair SSH diagnostic MCP Gateway 化(2026-05-14 台北)**:
|
||||
- 觸發:T22 最大 non-success 類別是 `PB-20260505-F4197B` 的 legacy `ssh {host} ...` repair step,AutoRepair executor 無法解析成支援 action envelope;另一類是 verifier metric tool 缺 PromQL query。
|
||||
- 修正:`AutoRepairService` 將 read-only legacy SSH diagnostic route 成 `auto_repair_executor -> AwoooP MCP Gateway -> ssh_diagnose`,保留 `required_scope=read`、`policy_enforced=true`、`flywheel_node=execute` audit;寫入/重啟/刪除類 SSH 命令仍 fail-closed,不偷渡成 read-only MCP。
|
||||
- MCP/契約:新增 `auto_repair_executor` active agent contract 與 read-only SSH grants;`SSHProvider.ssh_diagnose` 支援短 host mapping 與 `container_name` evidence。
|
||||
- Verifier:`PostExecutionVerifier` 對 Prometheus 類 metric tool 補 Docker memory/restart/cpu 與通用 K8s/host fallback PromQL template,避免 `verifier_missing_promql`。
|
||||
- 驗證:py_compile pass;`git diff --check` pass;migration static check pass;`pytest tests/test_auto_repair_service.py tests/test_ssh_provider_tools.py tests/test_post_execution_verifier.py -q` 67 passed;`pytest tests/test_mcp_gateway_audit.py tests/test_adr100_slo_status_service.py tests/test_adr100_slo_metrics_service.py tests/test_governance_agent.py tests/test_ai_governance_endpoints.py -q` 55 passed;ruff F/E9 pass。
|
||||
- Production deploy:`813d0883 feat(auto-repair): route ssh diagnostics through mcp gateway` 已推 Gitea main;Code Review run `2174` success;Run Migration run `2175` success;CD run `2173` success;deploy marker `33e4c923 chore(cd): deploy 813d088 [skip ci]`。
|
||||
- Production evidence:API / Worker / Web image 均為 `813d088339d05c1e902ffbe84ce07e1ce80343bb`;health 200;rollout success;production grant `auto_repair_executor -> ssh_diagnose` 為 `read` scope;read-only smoke 實際產生 `mcp:ssh_diagnose` SSHProvider evidence;MCP audit row 為 `result_status=success`、`required_scope=read`、`gateway_path=awooop_mcp_gateway`、`policy_enforced=true`;PromQL smoke 產生 Docker memory ratio query 與 canary fallback query。
|
||||
- 邊界:T23 修的是 executor / evidence collection 斷點,不代表所有 Docker memory pressure 告警都已完成「實際修復」。讀取診斷成功後仍要由後續 PlayBook/action semantics 判斷是否需要 restart、scale、rollback 或轉人工;`/api/v1/ai/slo` 24h window 仍會看到舊 historical degraded rows,需等新樣本覆蓋或窗口滑出。
|
||||
- 目前進度更新:Alertmanager 低風險自動修復主線約 97%;完整 AI 自動化管理產品化約 88%。下一段應把 historical degraded rows 轉成 replay / closure / ticket 工作鏈,並把 AwoooP Event Dossier 前端完整串 MCP Gateway、PlayBook、KM、Ansible/Sentry/SignOz evidence。
|
||||
|
||||
---
|
||||
|
||||
### 2026-04-20 晚 (台北) — C1-C4 全流程串接 — Playbook 鏈路保護(commit de2d34d)
|
||||
|
||||
Reference in New Issue
Block a user