docs(governance): record t23 auto repair gateway rollout

This commit is contained in:
Your Name
2026-05-14 21:24:55 +08:00
parent 33e4c9231e
commit f97127f704
2 changed files with 48 additions and 0 deletions

View File

@@ -2170,6 +2170,17 @@ Phase 6 完成後
- Production evidenceAPI / Worker / Web image 均為 `bad48dee0424656e01e3ae232acba0423ae0c1e1``/api/v1/ai/slo``non_success_breakdown.by_verification_result=[degraded:8]``by_failure_class=[unsupported_action_scheme:7, verifier_missing_promql:1]``/zh-TW/governance` 200Playwright render check console error = 0。
- 目前進度更新Alertmanager 低風險自動修復主線約 96%;完整 AI 自動化管理產品化約 87%。下一段 T23 應修 `PB-20260505-F4197B` 的 unsupported `ssh {host}` repair step讓 Docker memory pressure 低風險告警走支援 executor / MCP envelope並補 verifier PromQL template。
**T23 Auto-repair SSH diagnostic MCP Gateway 化2026-05-14 台北)**
- 觸發T22 最大 non-success 類別是 `PB-20260505-F4197B` 的 legacy `ssh {host} ...` repair stepAutoRepair executor 無法解析成支援 action envelope另一類是 verifier metric tool 缺 PromQL query。
- 修正:`AutoRepairService` 將 read-only legacy SSH diagnostic route 成 `auto_repair_executor -> AwoooP MCP Gateway -> ssh_diagnose`,保留 `required_scope=read``policy_enforced=true``flywheel_node=execute` audit寫入/重啟/刪除類 SSH 命令仍 fail-closed不偷渡成 read-only MCP。
- MCP/契約:新增 `auto_repair_executor` active agent contract 與 read-only SSH grants`SSHProvider.ssh_diagnose` 支援短 host mapping 與 `container_name` evidence。
- Verifier`PostExecutionVerifier` 對 Prometheus 類 metric tool 補 Docker memory/restart/cpu 與通用 K8s/host fallback PromQL template避免 `verifier_missing_promql`
- 驗證py_compile pass`git diff --check` passmigration static check pass`pytest tests/test_auto_repair_service.py tests/test_ssh_provider_tools.py tests/test_post_execution_verifier.py -q` 67 passed`pytest tests/test_mcp_gateway_audit.py tests/test_adr100_slo_status_service.py tests/test_adr100_slo_metrics_service.py tests/test_governance_agent.py tests/test_ai_governance_endpoints.py -q` 55 passedruff F/E9 pass。
- Production deploy`813d0883 feat(auto-repair): route ssh diagnostics through mcp gateway` 已推 Gitea mainCode Review run `2174` successRun Migration run `2175` successCD run `2173` successdeploy marker `33e4c923 chore(cd): deploy 813d088 [skip ci]`
- Production evidenceAPI / Worker / Web image 均為 `813d088339d05c1e902ffbe84ce07e1ce80343bb`health 200rollout successproduction grant `auto_repair_executor -> ssh_diagnose``read` scoperead-only smoke 實際產生 `mcp:ssh_diagnose` SSHProvider evidenceMCP audit row 為 `result_status=success``required_scope=read``gateway_path=awooop_mcp_gateway``policy_enforced=true`PromQL smoke 產生 Docker memory ratio query 與 canary fallback query。
- 邊界T23 修的是 executor / evidence collection 斷點,不代表所有 Docker memory pressure 告警都已完成「實際修復」。讀取診斷成功後仍要由後續 PlayBook/action semantics 判斷是否需要 restart、scale、rollback 或轉人工;`/api/v1/ai/slo` 24h window 仍會看到舊 historical degraded rows需等新樣本覆蓋或窗口滑出。
- 目前進度更新Alertmanager 低風險自動修復主線約 97%;完整 AI 自動化管理產品化約 88%。下一段應把 historical degraded rows 轉成 replay / closure / ticket 工作鏈,並把 AwoooP Event Dossier 前端完整串 MCP Gateway、PlayBook、KM、Ansible/Sentry/SignOz evidence。
---
### 2026-04-20 晚 (台北) — C1-C4 全流程串接 — Playbook 鏈路保護commit de2d34d