docs(governance): record t23 auto repair gateway rollout

This commit is contained in:
Your Name
2026-05-14 21:24:55 +08:00
parent 33e4c9231e
commit f97127f704
2 changed files with 48 additions and 0 deletions

View File

@@ -1,3 +1,40 @@
## 2026-05-14 | T23 Auto-repair SSH diagnostic 改走 AwoooP MCP Gateway補 verifier PromQL template
**背景**T22 已把近 24h non-success verifier 拆出根因,其中 `DockerContainerMemoryLimitPressure` 多數卡在 `PB-20260505-F4197B` 的 legacy `ssh {host} ...` 動作AutoRepair executor 只接受 `scheme://host/payload`,因此直接失敗成 `unsupported_action_scheme`。另有 canary / host 類 verifier 缺 PromQL query template導致 post-execution verification 只能 degraded。
**修正**
- `AutoRepairService` 將 read-only legacy `ssh {host} '...'` 診斷命令正規化為 `auto_repair_executor -> AwoooP MCP Gateway -> ssh_diagnose`,並保留 `required_scope=read``is_shadow=false``flywheel_node=execute` audit context。
- 寫入/重啟/刪除類 SSH 命令不會被自動轉成 read-only MCP例如 `docker restart``systemctl restart``prune``rm``bash` 仍維持 fail-closed / 需要明確 PlayBook executor。
- `SSHProvider.ssh_diagnose` 支援短 host mapping例如 `110` / `188`)與 `container_name`,可收集 host + container read-only evidence。
- 新增 migration 建立 `auto_repair_executor` active agent contract僅授權 read-only SSH tools尤其是 `ssh_diagnose`;未授權 write/admin MCP。
- `PostExecutionVerifier` 對 Prometheus 類 metric tool 補 `query`,包含 Docker memory / restart / CPU 與通用 K8s/host fallback避免 `verifier_missing_promql`
**本地驗證**
- `python3 -m py_compile apps/api/src/services/auto_repair_service.py apps/api/src/plugins/mcp/providers/ssh_provider.py apps/api/src/services/post_execution_verifier.py apps/api/tests/test_auto_repair_service.py apps/api/tests/test_ssh_provider_tools.py apps/api/tests/test_post_execution_verifier.py`pass。
- `git diff --check`pass。
- migration static check`auto_repair_executor` grant 存在,且未使用會造成 psql 失敗的 `:'var'` syntax。
- `DATABASE_URL=postgresql+asyncpg://test:test@localhost:5432/test pytest tests/test_auto_repair_service.py tests/test_ssh_provider_tools.py tests/test_post_execution_verifier.py -q`67 passed。
- `ruff check --select F,E9 ...`pass。
- `pytest tests/test_mcp_gateway_audit.py tests/test_adr100_slo_status_service.py tests/test_adr100_slo_metrics_service.py tests/test_governance_agent.py tests/test_ai_governance_endpoints.py -q`55 passed。
**推版與 production 驗證**
- `813d0883 feat(auto-repair): route ssh diagnostics through mcp gateway` 已推 Gitea main。
- Gitea Code Review run `2174` successRun Migration run `2175` successCD run `2173` tests / build-and-deploy / post-deploy-checks 全 success。
- 最新 deploy marker`33e4c923 chore(cd): deploy 813d088 [skip ci]`
- Production imageAPI / Worker `192.168.0.110:5000/awoooi/api:813d088339d05c1e902ffbe84ce07e1ce80343bb`Web `192.168.0.110:5000/awoooi/web:813d088339d05c1e902ffbe84ce07e1ce80343bb`
- K8s rollout`awoooi-api` / `awoooi-worker` / `awoooi-web` in `awoooi-prod` 均 successfully rolled out。
- `https://awoooi.wooo.work/api/v1/health`200 healthy。
- Production grant check`auto_repair_executor` 已有 `ssh_diagnose` grant`granted_scopes=["read"]``is_revoked=false`
- Production read-only smokelegacy `ssh {host} 'echo ... docker stats ...'` 透過 `auto_repair_executor -> AwoooP MCP Gateway -> ssh_diagnose` 成功執行SSHProvider 解析 `host=110``192.168.0.110`,並產生 host/container read-only evidence。
- MCP audit check`trace_id=codex-t23-auto-repair-executor-smoke-provider``agent_id=auto_repair_executor``tool_name=ssh_diagnose``result_status=success``required_scope=read``gateway_path=awooop_mcp_gateway``policy_enforced=true`
- PromQL smoke`DockerContainerMemoryLimitPressure` 產生 `docker_container_memory_usage_bytes{host="110",container_name="momo-scheduler"} / docker_container_memory_limit_bytes{host="110",container_name="momo-scheduler"}`canary fallback 產生 `up{namespace="awoooi-prod"}`
- `/api/v1/ai/slo` 仍會在 24h window 內顯示舊的 `unsupported_action_scheme` historical degraded rows這是舊執行證據不能用資料清洗假裝消失需等新告警樣本覆蓋或窗口滑出。
**目前整體進度**
- Alertmanager 低風險自動修復主線:約 97%。
- 完整 AI 自動化管理產品化:約 88%。
- T23 已修掉 T22 最大的 executor 格式斷點,並讓 verifier metric query 不再空缺;下一段應處理「已存在 historical degraded incident 如何轉成 replay / closure / ticket 工作鏈」,以及把前端 AwoooP Runs / Event Dossier 對 MCP Gateway、PlayBook、KM、Ansible evidence 做更完整的時間線呈現。
## 2026-05-14 | T22 Verifier non-success breakdown 前端化,從 warning 變成可行動根因
**背景**T21 已證實近 24h 自動修復都有 verifier evidence`coverage_rate=1.0`,但 `verified_non_success=9` 只呈現為單一 warning。Operator 仍不知道是 PlayBook 動作格式錯、verifier target 缺失、PromQL 模板缺失,還是真正修復失敗。

View File

@@ -2170,6 +2170,17 @@ Phase 6 完成後
- Production evidenceAPI / Worker / Web image 均為 `bad48dee0424656e01e3ae232acba0423ae0c1e1``/api/v1/ai/slo``non_success_breakdown.by_verification_result=[degraded:8]``by_failure_class=[unsupported_action_scheme:7, verifier_missing_promql:1]``/zh-TW/governance` 200Playwright render check console error = 0。
- 目前進度更新Alertmanager 低風險自動修復主線約 96%;完整 AI 自動化管理產品化約 87%。下一段 T23 應修 `PB-20260505-F4197B` 的 unsupported `ssh {host}` repair step讓 Docker memory pressure 低風險告警走支援 executor / MCP envelope並補 verifier PromQL template。
**T23 Auto-repair SSH diagnostic MCP Gateway 化2026-05-14 台北)**
- 觸發T22 最大 non-success 類別是 `PB-20260505-F4197B` 的 legacy `ssh {host} ...` repair stepAutoRepair executor 無法解析成支援 action envelope另一類是 verifier metric tool 缺 PromQL query。
- 修正:`AutoRepairService` 將 read-only legacy SSH diagnostic route 成 `auto_repair_executor -> AwoooP MCP Gateway -> ssh_diagnose`,保留 `required_scope=read``policy_enforced=true``flywheel_node=execute` audit寫入/重啟/刪除類 SSH 命令仍 fail-closed不偷渡成 read-only MCP。
- MCP/契約:新增 `auto_repair_executor` active agent contract 與 read-only SSH grants`SSHProvider.ssh_diagnose` 支援短 host mapping 與 `container_name` evidence。
- Verifier`PostExecutionVerifier` 對 Prometheus 類 metric tool 補 Docker memory/restart/cpu 與通用 K8s/host fallback PromQL template避免 `verifier_missing_promql`
- 驗證py_compile pass`git diff --check` passmigration static check pass`pytest tests/test_auto_repair_service.py tests/test_ssh_provider_tools.py tests/test_post_execution_verifier.py -q` 67 passed`pytest tests/test_mcp_gateway_audit.py tests/test_adr100_slo_status_service.py tests/test_adr100_slo_metrics_service.py tests/test_governance_agent.py tests/test_ai_governance_endpoints.py -q` 55 passedruff F/E9 pass。
- Production deploy`813d0883 feat(auto-repair): route ssh diagnostics through mcp gateway` 已推 Gitea mainCode Review run `2174` successRun Migration run `2175` successCD run `2173` successdeploy marker `33e4c923 chore(cd): deploy 813d088 [skip ci]`
- Production evidenceAPI / Worker / Web image 均為 `813d088339d05c1e902ffbe84ce07e1ce80343bb`health 200rollout successproduction grant `auto_repair_executor -> ssh_diagnose``read` scoperead-only smoke 實際產生 `mcp:ssh_diagnose` SSHProvider evidenceMCP audit row 為 `result_status=success``required_scope=read``gateway_path=awooop_mcp_gateway``policy_enforced=true`PromQL smoke 產生 Docker memory ratio query 與 canary fallback query。
- 邊界T23 修的是 executor / evidence collection 斷點,不代表所有 Docker memory pressure 告警都已完成「實際修復」。讀取診斷成功後仍要由後續 PlayBook/action semantics 判斷是否需要 restart、scale、rollback 或轉人工;`/api/v1/ai/slo` 24h window 仍會看到舊 historical degraded rows需等新樣本覆蓋或窗口滑出。
- 目前進度更新Alertmanager 低風險自動修復主線約 97%;完整 AI 自動化管理產品化約 88%。下一段應把 historical degraded rows 轉成 replay / closure / ticket 工作鏈,並把 AwoooP Event Dossier 前端完整串 MCP Gateway、PlayBook、KM、Ansible/Sentry/SignOz evidence。
---
### 2026-04-20 晚 (台北) — C1-C4 全流程串接 — Playbook 鏈路保護commit de2d34d