docs(logbook): record t16 auto repair live fire

This commit is contained in:
Your Name
2026-05-14 01:13:52 +08:00
parent a9b846c82a
commit b5288d4b7d

View File

@@ -8164,3 +8164,80 @@ outbound_messages_visible=2
- 仍不能宣稱完整 AI 自動修復閉環:`automation_quality` 仍會把 NO_ACTION / 未驗證執行標成 manual_required 或 unverified`verified_auto_repair_total` 仍需後續 live-fire 驗證。
- 仍待 T15b保存 redacted full inbound envelope目前 conversation_event 只保 hash / preview、Sentry / SignOz explicit source refs、Ansible executor/diff/apply/rollback decision audit、所有 write/admin MCP 進 Gateway。
- 目前整體進度更新:約 89%。
### 2026-05-14 — T16 Alertmanager 低風險自動修復閉環 production 驗收完成
**觸發**
- Telegram 告警卡雖能顯示 AI 建議,但無法證明「低風險告警是否真的經過 AI 自動判斷、自動修復、MCP 驗證、KM 回寫、Incident 關閉」。
- T15a 已完成 inbound truth-chain mirror本階段補上低風險 canary 的 production live-fire自動修復不可再只停留在 approval 卡或人工流程。
**修正**
- PlayBook 推薦保留 exact/Jaccard 候選,避免向量/RAG top-k 把 exact canary PlayBook 擠掉。
- Alertmanager 背景 AI 分析加 timeout fallbackOpenClaw/LLM 卡住時仍會以 PlayBook + labels + risk gate 推進自動修復。
- fallback 自動修復完成後同步 `approval_records.status = EXECUTION_SUCCESS`,避免 Incident 已關但前端仍像等待人工。
- MCP Gateway 補 `k8s_watch_rollout` read-only grant 給 `pre_decision_investigator` / `post_execution_verifier`,讓 verifier 能用 rollout 證據判定成功。
- PostExecutionVerifier 認得 `successfully rolled out` 與工具成功輸出,並把驗證結果寫入 `incident_evidence`
**local verification**
```text
ruff check --select F,E9 apps/api/src/api/v1/webhooks.py apps/api/src/services/playbook_service.py apps/api/src/services/playbook_match_resolver.py apps/api/src/services/mcp_tool_registry.py apps/api/src/services/post_execution_verifier.py
OK
python3 -m py_compile apps/api/src/api/v1/webhooks.py apps/api/src/services/playbook_service.py apps/api/src/services/playbook_match_resolver.py apps/api/src/services/mcp_tool_registry.py apps/api/src/services/post_execution_verifier.py
OK
pytest apps/api/tests/test_alertmanager_rule_bypass.py apps/api/tests/test_playbook_service.py apps/api/tests/test_mcp_tool_registry.py apps/api/tests/test_post_execution_verifier.py -q
90 passed
```
**production deploy / smoke完成**
```text
Commits:
a0a0731c fix(auto-repair): preserve exact playbook candidates
d835b666 fix(alertmanager): keep auto repair moving on ai fallback
b1ecb55b fix(verification): align playbook and mcp evidence for canary alerts
5fb73a56 fix(verifier): recognize rollout success evidence
6f6d032c fix(mcp): grant rollout verifier read tool
5604dd02 fix(auto-repair): mark approval execution status
Latest deployed image:
awoooi-api 192.168.0.110:5000/awoooi/api:5604dd02562368a5ad7c194c050c59a2e8fd2b96
awoooi-worker 192.168.0.110:5000/awoooi/api:5604dd02562368a5ad7c194c050c59a2e8fd2b96
health:
GET https://awoooi.wooo.work/api/v1/health -> healthy
Live-fire:
alertname=AwoooPT16J170843
alert_id=alert-20260514010908
fingerprint=9b5bab07e191b17228366b373e33a195
incident=INC-20260513-0B357C
approval=8b5392dc-d0b4-4990-be7e-b8f61fa3f776
playbook=PB-AWOOOP-CANARY-AWOOOPT16J17084
auto_repair_execution=8eddd1d2-8756-4755-8e0e-5d9c9955f958
DB result:
incidents.status=RESOLVED
approval_records.status=EXECUTION_SUCCESS
approval_records.matched_playbook_id=PB-AWOOOP-CANARY-AWOOOPT16J17084
auto_repair_executions.success=true
auto_repair_executions.execution_time_ms=265
incident_evidence.verification_result=success
knowledge_entries created=2
awooop_conversation_event stages=received,incident_linked
K8s verifier:
deployment/awoooi-auto-repair-canary successfully rolled out
generation=23 observed=23 ready=1/1 restartedAt=2026-05-13T17:10:43Z
```
**判讀**
- T16 已證明「低風險、PlayBook 可精準匹配、blast radius 受控」的 Alertmanager 告警可以從收到告警一路跑到自動修復、MCP/rollout 驗證、KM 建立、Incident 關閉。
- 這不是全面自動化完成:治理告警(例如 `knowledge_degradation` / `governance_slo_data_gap`)仍會重複 Telegram 推播,且目前沒有對應 `governance_remediation_dispatch` 階段可見性。
- 下一階段 T17治理告警 leader/dedupe、ADR-100 SLO emitter 修補、KM stale refresh 任務、治理 PlayBook seed、AwoooP 前端 Timeline 顯示每階段狀態與 MCP 使用證據。
- 目前整體進度更新:約 92%。