docs(logbook): record t16 auto repair live fire
This commit is contained in:
@@ -8164,3 +8164,80 @@ outbound_messages_visible=2
|
||||
- 仍不能宣稱完整 AI 自動修復閉環:`automation_quality` 仍會把 NO_ACTION / 未驗證執行標成 manual_required 或 unverified;`verified_auto_repair_total` 仍需後續 live-fire 驗證。
|
||||
- 仍待 T15b:保存 redacted full inbound envelope(目前 conversation_event 只保 hash / preview)、Sentry / SignOz explicit source refs、Ansible executor/diff/apply/rollback decision audit、所有 write/admin MCP 進 Gateway。
|
||||
- 目前整體進度更新:約 89%。
|
||||
|
||||
### 2026-05-14 — T16 Alertmanager 低風險自動修復閉環 production 驗收完成
|
||||
|
||||
**觸發**:
|
||||
|
||||
- Telegram 告警卡雖能顯示 AI 建議,但無法證明「低風險告警是否真的經過 AI 自動判斷、自動修復、MCP 驗證、KM 回寫、Incident 關閉」。
|
||||
- T15a 已完成 inbound truth-chain mirror,本階段補上低風險 canary 的 production live-fire,自動修復不可再只停留在 approval 卡或人工流程。
|
||||
|
||||
**修正**:
|
||||
|
||||
- PlayBook 推薦保留 exact/Jaccard 候選,避免向量/RAG top-k 把 exact canary PlayBook 擠掉。
|
||||
- Alertmanager 背景 AI 分析加 timeout fallback;OpenClaw/LLM 卡住時仍會以 PlayBook + labels + risk gate 推進自動修復。
|
||||
- fallback 自動修復完成後同步 `approval_records.status = EXECUTION_SUCCESS`,避免 Incident 已關但前端仍像等待人工。
|
||||
- MCP Gateway 補 `k8s_watch_rollout` read-only grant 給 `pre_decision_investigator` / `post_execution_verifier`,讓 verifier 能用 rollout 證據判定成功。
|
||||
- PostExecutionVerifier 認得 `successfully rolled out` 與工具成功輸出,並把驗證結果寫入 `incident_evidence`。
|
||||
|
||||
**local verification**:
|
||||
|
||||
```text
|
||||
ruff check --select F,E9 apps/api/src/api/v1/webhooks.py apps/api/src/services/playbook_service.py apps/api/src/services/playbook_match_resolver.py apps/api/src/services/mcp_tool_registry.py apps/api/src/services/post_execution_verifier.py
|
||||
OK
|
||||
|
||||
python3 -m py_compile apps/api/src/api/v1/webhooks.py apps/api/src/services/playbook_service.py apps/api/src/services/playbook_match_resolver.py apps/api/src/services/mcp_tool_registry.py apps/api/src/services/post_execution_verifier.py
|
||||
OK
|
||||
|
||||
pytest apps/api/tests/test_alertmanager_rule_bypass.py apps/api/tests/test_playbook_service.py apps/api/tests/test_mcp_tool_registry.py apps/api/tests/test_post_execution_verifier.py -q
|
||||
90 passed
|
||||
```
|
||||
|
||||
**production deploy / smoke(完成)**:
|
||||
|
||||
```text
|
||||
Commits:
|
||||
a0a0731c fix(auto-repair): preserve exact playbook candidates
|
||||
d835b666 fix(alertmanager): keep auto repair moving on ai fallback
|
||||
b1ecb55b fix(verification): align playbook and mcp evidence for canary alerts
|
||||
5fb73a56 fix(verifier): recognize rollout success evidence
|
||||
6f6d032c fix(mcp): grant rollout verifier read tool
|
||||
5604dd02 fix(auto-repair): mark approval execution status
|
||||
|
||||
Latest deployed image:
|
||||
awoooi-api 192.168.0.110:5000/awoooi/api:5604dd02562368a5ad7c194c050c59a2e8fd2b96
|
||||
awoooi-worker 192.168.0.110:5000/awoooi/api:5604dd02562368a5ad7c194c050c59a2e8fd2b96
|
||||
|
||||
health:
|
||||
GET https://awoooi.wooo.work/api/v1/health -> healthy
|
||||
|
||||
Live-fire:
|
||||
alertname=AwoooPT16J170843
|
||||
alert_id=alert-20260514010908
|
||||
fingerprint=9b5bab07e191b17228366b373e33a195
|
||||
incident=INC-20260513-0B357C
|
||||
approval=8b5392dc-d0b4-4990-be7e-b8f61fa3f776
|
||||
playbook=PB-AWOOOP-CANARY-AWOOOPT16J17084
|
||||
auto_repair_execution=8eddd1d2-8756-4755-8e0e-5d9c9955f958
|
||||
|
||||
DB result:
|
||||
incidents.status=RESOLVED
|
||||
approval_records.status=EXECUTION_SUCCESS
|
||||
approval_records.matched_playbook_id=PB-AWOOOP-CANARY-AWOOOPT16J17084
|
||||
auto_repair_executions.success=true
|
||||
auto_repair_executions.execution_time_ms=265
|
||||
incident_evidence.verification_result=success
|
||||
knowledge_entries created=2
|
||||
awooop_conversation_event stages=received,incident_linked
|
||||
|
||||
K8s verifier:
|
||||
deployment/awoooi-auto-repair-canary successfully rolled out
|
||||
generation=23 observed=23 ready=1/1 restartedAt=2026-05-13T17:10:43Z
|
||||
```
|
||||
|
||||
**判讀**:
|
||||
|
||||
- T16 已證明「低風險、PlayBook 可精準匹配、blast radius 受控」的 Alertmanager 告警,可以從收到告警一路跑到自動修復、MCP/rollout 驗證、KM 建立、Incident 關閉。
|
||||
- 這不是全面自動化完成:治理告警(例如 `knowledge_degradation` / `governance_slo_data_gap`)仍會重複 Telegram 推播,且目前沒有對應 `governance_remediation_dispatch` 階段可見性。
|
||||
- 下一階段 T17:治理告警 leader/dedupe、ADR-100 SLO emitter 修補、KM stale refresh 任務、治理 PlayBook seed、AwoooP 前端 Timeline 顯示每階段狀態與 MCP 使用證據。
|
||||
- 目前整體進度更新:約 92%。
|
||||
|
||||
Reference in New Issue
Block a user