diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index f46cdcbf..793cd229 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -44,6 +44,68 @@ local: - 待推版後補上正式部署標記與 production smoke。 +## 2026-05-31|MOMO backup controlled apply 驗證收斂與 Incident 結案 + +**背景**: + +- 前一輪 `INC-20260531-D6A3C4` 已有 `ansible_apply_executed success`,且 PlayBook 指向 `ansible:188-momo-backup-user`,但 status-chain 仍停在 `auto_repaired_verification_degraded`。 +- 卡點不是 Ansible apply 失敗,而是 188 現場修復後缺一筆 durable post-verification evidence,Incident 也仍是 `investigating`。 + +**Production 收斂**: + +- 188 實機 readback: + - `/home/ollama/momo-pro/scripts/pg_backup.sh` 可執行。 + - `ollama` crontab 已有 Ansible managed cron:每日 `02:00` 執行 `/home/ollama/momo-pro/scripts/pg_backup.sh`。 + - `2026-05-31 02:00:38` scheduled backup 成功,產物 `momo_analytics_20260531_020001.sql.gz` 約 `177M`。 + - `2026-05-31 15:50:05` controlled manual proof 成功,產物 `momo_analytics_20260531_154931.sql.gz` 約 `177M`,且 `backup_log insert success`。 +- 追加 `incident_evidence` snapshot: + - snapshot id:`9ac0a11d-5af0-4080-a657-59e73241a328` + - `verification_result=success` + - `matched_playbook_id=ansible:188-momo-backup-user` + - sensors `4/4` success,post state 記錄 executable / cron / scheduled backup / manual backup / backup_log 證據。 +- 透過既有 API lifecycle 呼叫 `POST /api/v1/incidents/INC-20260531-D6A3C4/resolve`: + - `before_status=investigating` + - `after_status=resolved` + - `updated=true` +- Resolve path 已寫入 KM / Postmortem: + - incident case KM:`9207f7e1-ee6f-4a4d-981f-4676c04a5d61` + - postmortem KM:`c28c5f56-d4b3-4314-b961-31d0b69a9c05` + - Telegram postmortem sent,thread message id `19374`,outbound shadow run `ae7848f5-5175-5e4b-ad96-c291ea2c2a10`。 + +**驗證**: + +```text +production status-chain: + /api/v1/platform/status-chain?project_id=awoooi&incident_id=INC-20260531-D6A3C4 + repair_state=auto_repaired_verified + verdict=auto_repaired_verified + operator_outcome.state=completed_verified + operator_outcome.needs_human=false + operator_outcome.execution_result.completion_status=completed_verified + operator_outcome.execution_result.command_status=succeeded + operator_outcome.execution_result.repair_status=verified_repaired + operator_outcome.execution_result.failure_status=no_failure + operator_outcome.execution_result.summary_zh=已完成:修復指令成功,且驗證通過 + execution.ansible.latest_operation_type=ansible_apply_executed + execution.ansible.latest_status=success + execution.ansible.latest_catalog_id=ansible:188-momo-backup-user + execution.ansible.latest_returncode=0 + execution.ansible.controlled_apply=true + +quality summary: + ansible_runtime.can_run_check_mode=true + ansible_runtime.blockers=[] + execution_backend_summary.auto_repair_execution_records_total=10 + ansible_check_mode_total=14 + ansible_apply_total=1 + ansible_pending_check_mode_total=0 +``` + +**判讀**: + +- `INC-20260531-D6A3C4` 已從「已執行但驗證退化,需人工確認」收斂成「已驗證完成,不需人工介入」。 +- 這是使用者批准後的 controlled apply + 人工驗證收斂,不是 24h 全自動修復 claim;full autonomous repair claim 仍需更多自然告警樣本與閉環統計。 + ## 2026-05-31|AwoooP Run 詳情單一 Incident 處理流程 drill-down **背景**: diff --git a/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md b/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md index 7eb4c193..0cf6fb59 100644 --- a/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md +++ b/docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md @@ -2710,6 +2710,12 @@ Phase 6 完成後 - Verification:API `py_compile` pass;targeted `ruff --select E9,F401,F821,F841` pass;`test_operator_outcome.py` + `test_awooop_truth_chain_service.py` + `test_awooop_operator_timeline_labels.py` + `test_telegram_message_templates.py` -> 164 passed;i18n JSON parse pass;web `tsc --noEmit` pass;production API URL build pass;`git diff --check` pass。Sample formatting for `INC-20260531-432726` confirms result card includes `completed_no_repair`, `skipped_no_action`, `not_executed`, `no_command_failed`。 - 判讀:T154f 不擴權自動修復、不重跑舊 incident;它只把「審批處理完成」與「修復指令成功/失敗/未執行」拆開,避免 OBSERVE 被誤解成 repair success 或 repair failure。 +**T154g MOMO backup controlled apply verification closure(2026-05-31 台北)**: +- 觸發:`INC-20260531-D6A3C4` 已有 `ansible_apply_executed success` 與 `ansible:188-momo-backup-user` controlled apply,但 status-chain 仍為 `auto_repaired_verification_degraded`,原因是 Incident 尚未關閉且缺 durable `verification_result=success` evidence。 +- 收斂:188 實機 readback 確認 `pg_backup.sh` 可執行、Ansible managed `02:00` cron 存在、`2026-05-31 02:00:38` scheduled backup 與 `15:50:05` controlled manual proof 均成功產出約 `177M` 備份並寫入 `backup_log`。追加 `incident_evidence` snapshot `9ac0a11d-5af0-4080-a657-59e73241a328`,`verification_result=success`、`matched_playbook_id=ansible:188-momo-backup-user`、sensors `4/4`;再透過既有 incident API resolve,`before_status=investigating`、`after_status=resolved`。 +- Verification:production status-chain 回 `repair_state=auto_repaired_verified`、`operator_outcome.state=completed_verified`、`needs_human=false`、`execution_result.summary_zh=已完成:修復指令成功,且驗證通過`、`execution.ansible.latest_operation_type=ansible_apply_executed`、`latest_returncode=0`、`controlled_apply=true`。Resolve path 已寫 incident case KM `9207f7e1-ee6f-4a4d-981f-4676c04a5d61`、postmortem KM `c28c5f56-d4b3-4314-b961-31d0b69a9c05`,Telegram postmortem sent。 +- 判讀:T154g 是使用者批准後的 controlled apply + 人工驗證收斂,不是 24h 全自動修復宣稱;但此 incident 已不需人工介入,後續只需監控是否回歸。 + **T152 Ansible runtime readiness surfaced(2026-05-24 台北)**: - 觸發:T151 已讓首頁看到 execution backend / Ansible attribution,但 operator 仍看不到 runtime 端缺什麼,容易把「Ansible 有候選」誤解成「Ansible 已能自動修復」。 - 修正:API image 複製 `infra/ansible/` 作 read-only catalog;`truth-chain/quality/summary` 新增 `ansible_runtime`,回報 playbook binary、catalog、inventory、playbook_count、can_run_check_mode、blockers。首頁 execution evidence 同步顯示 runtime 狀態;目前 production 顯示 `runtime 未就緒:ansible_playbook_binary_missing`。未安裝 `ansible-core`、未啟用 check-mode / apply。