From b5288d4b7d9ddded497fbb26af1660ce0b58b915 Mon Sep 17 00:00:00 2001 From: Your Name Date: Thu, 14 May 2026 01:13:52 +0800 Subject: [PATCH] docs(logbook): record t16 auto repair live fire --- docs/LOGBOOK.md | 77 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 77 insertions(+) diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 5525ff2b..50439e86 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -8164,3 +8164,80 @@ outbound_messages_visible=2 - 仍不能宣稱完整 AI 自動修復閉環:`automation_quality` 仍會把 NO_ACTION / 未驗證執行標成 manual_required 或 unverified;`verified_auto_repair_total` 仍需後續 live-fire 驗證。 - 仍待 T15b:保存 redacted full inbound envelope(目前 conversation_event 只保 hash / preview)、Sentry / SignOz explicit source refs、Ansible executor/diff/apply/rollback decision audit、所有 write/admin MCP 進 Gateway。 - 目前整體進度更新:約 89%。 + +### 2026-05-14 — T16 Alertmanager 低風險自動修復閉環 production 驗收完成 + +**觸發**: + +- Telegram 告警卡雖能顯示 AI 建議,但無法證明「低風險告警是否真的經過 AI 自動判斷、自動修復、MCP 驗證、KM 回寫、Incident 關閉」。 +- T15a 已完成 inbound truth-chain mirror,本階段補上低風險 canary 的 production live-fire,自動修復不可再只停留在 approval 卡或人工流程。 + +**修正**: + +- PlayBook 推薦保留 exact/Jaccard 候選,避免向量/RAG top-k 把 exact canary PlayBook 擠掉。 +- Alertmanager 背景 AI 分析加 timeout fallback;OpenClaw/LLM 卡住時仍會以 PlayBook + labels + risk gate 推進自動修復。 +- fallback 自動修復完成後同步 `approval_records.status = EXECUTION_SUCCESS`,避免 Incident 已關但前端仍像等待人工。 +- MCP Gateway 補 `k8s_watch_rollout` read-only grant 給 `pre_decision_investigator` / `post_execution_verifier`,讓 verifier 能用 rollout 證據判定成功。 +- PostExecutionVerifier 認得 `successfully rolled out` 與工具成功輸出,並把驗證結果寫入 `incident_evidence`。 + +**local verification**: + +```text +ruff check --select F,E9 apps/api/src/api/v1/webhooks.py apps/api/src/services/playbook_service.py apps/api/src/services/playbook_match_resolver.py apps/api/src/services/mcp_tool_registry.py apps/api/src/services/post_execution_verifier.py +OK + +python3 -m py_compile apps/api/src/api/v1/webhooks.py apps/api/src/services/playbook_service.py apps/api/src/services/playbook_match_resolver.py apps/api/src/services/mcp_tool_registry.py apps/api/src/services/post_execution_verifier.py +OK + +pytest apps/api/tests/test_alertmanager_rule_bypass.py apps/api/tests/test_playbook_service.py apps/api/tests/test_mcp_tool_registry.py apps/api/tests/test_post_execution_verifier.py -q +90 passed +``` + +**production deploy / smoke(完成)**: + +```text +Commits: +a0a0731c fix(auto-repair): preserve exact playbook candidates +d835b666 fix(alertmanager): keep auto repair moving on ai fallback +b1ecb55b fix(verification): align playbook and mcp evidence for canary alerts +5fb73a56 fix(verifier): recognize rollout success evidence +6f6d032c fix(mcp): grant rollout verifier read tool +5604dd02 fix(auto-repair): mark approval execution status + +Latest deployed image: +awoooi-api 192.168.0.110:5000/awoooi/api:5604dd02562368a5ad7c194c050c59a2e8fd2b96 +awoooi-worker 192.168.0.110:5000/awoooi/api:5604dd02562368a5ad7c194c050c59a2e8fd2b96 + +health: +GET https://awoooi.wooo.work/api/v1/health -> healthy + +Live-fire: +alertname=AwoooPT16J170843 +alert_id=alert-20260514010908 +fingerprint=9b5bab07e191b17228366b373e33a195 +incident=INC-20260513-0B357C +approval=8b5392dc-d0b4-4990-be7e-b8f61fa3f776 +playbook=PB-AWOOOP-CANARY-AWOOOPT16J17084 +auto_repair_execution=8eddd1d2-8756-4755-8e0e-5d9c9955f958 + +DB result: +incidents.status=RESOLVED +approval_records.status=EXECUTION_SUCCESS +approval_records.matched_playbook_id=PB-AWOOOP-CANARY-AWOOOPT16J17084 +auto_repair_executions.success=true +auto_repair_executions.execution_time_ms=265 +incident_evidence.verification_result=success +knowledge_entries created=2 +awooop_conversation_event stages=received,incident_linked + +K8s verifier: +deployment/awoooi-auto-repair-canary successfully rolled out +generation=23 observed=23 ready=1/1 restartedAt=2026-05-13T17:10:43Z +``` + +**判讀**: + +- T16 已證明「低風險、PlayBook 可精準匹配、blast radius 受控」的 Alertmanager 告警,可以從收到告警一路跑到自動修復、MCP/rollout 驗證、KM 建立、Incident 關閉。 +- 這不是全面自動化完成:治理告警(例如 `knowledge_degradation` / `governance_slo_data_gap`)仍會重複 Telegram 推播,且目前沒有對應 `governance_remediation_dispatch` 階段可見性。 +- 下一階段 T17:治理告警 leader/dedupe、ADR-100 SLO emitter 修補、KM stale refresh 任務、治理 PlayBook seed、AwoooP 前端 Timeline 顯示每階段狀態與 MCP 使用證據。 +- 目前整體進度更新:約 92%。