docs(logbook): record momo backup verification closure [skip ci]

This commit is contained in:
Your Name
2026-05-31 17:58:50 +08:00
parent bdcb059444
commit 165abaeae7
2 changed files with 68 additions and 0 deletions

View File

@@ -44,6 +44,68 @@ local:
- 待推版後補上正式部署標記與 production smoke。
## 2026-05-31MOMO backup controlled apply 驗證收斂與 Incident 結案
**背景**
- 前一輪 `INC-20260531-D6A3C4` 已有 `ansible_apply_executed success`,且 PlayBook 指向 `ansible:188-momo-backup-user`,但 status-chain 仍停在 `auto_repaired_verification_degraded`
- 卡點不是 Ansible apply 失敗,而是 188 現場修復後缺一筆 durable post-verification evidenceIncident 也仍是 `investigating`
**Production 收斂**
- 188 實機 readback
- `/home/ollama/momo-pro/scripts/pg_backup.sh` 可執行。
- `ollama` crontab 已有 Ansible managed cron每日 `02:00` 執行 `/home/ollama/momo-pro/scripts/pg_backup.sh`
- `2026-05-31 02:00:38` scheduled backup 成功,產物 `momo_analytics_20260531_020001.sql.gz``177M`
- `2026-05-31 15:50:05` controlled manual proof 成功,產物 `momo_analytics_20260531_154931.sql.gz``177M`,且 `backup_log insert success`
- 追加 `incident_evidence` snapshot
- snapshot id`9ac0a11d-5af0-4080-a657-59e73241a328`
- `verification_result=success`
- `matched_playbook_id=ansible:188-momo-backup-user`
- sensors `4/4` successpost state 記錄 executable / cron / scheduled backup / manual backup / backup_log 證據。
- 透過既有 API lifecycle 呼叫 `POST /api/v1/incidents/INC-20260531-D6A3C4/resolve`
- `before_status=investigating`
- `after_status=resolved`
- `updated=true`
- Resolve path 已寫入 KM / Postmortem
- incident case KM`9207f7e1-ee6f-4a4d-981f-4676c04a5d61`
- postmortem KM`c28c5f56-d4b3-4314-b961-31d0b69a9c05`
- Telegram postmortem sentthread message id `19374`outbound shadow run `ae7848f5-5175-5e4b-ad96-c291ea2c2a10`
**驗證**
```text
production status-chain:
/api/v1/platform/status-chain?project_id=awoooi&incident_id=INC-20260531-D6A3C4
repair_state=auto_repaired_verified
verdict=auto_repaired_verified
operator_outcome.state=completed_verified
operator_outcome.needs_human=false
operator_outcome.execution_result.completion_status=completed_verified
operator_outcome.execution_result.command_status=succeeded
operator_outcome.execution_result.repair_status=verified_repaired
operator_outcome.execution_result.failure_status=no_failure
operator_outcome.execution_result.summary_zh=已完成:修復指令成功,且驗證通過
execution.ansible.latest_operation_type=ansible_apply_executed
execution.ansible.latest_status=success
execution.ansible.latest_catalog_id=ansible:188-momo-backup-user
execution.ansible.latest_returncode=0
execution.ansible.controlled_apply=true
quality summary:
ansible_runtime.can_run_check_mode=true
ansible_runtime.blockers=[]
execution_backend_summary.auto_repair_execution_records_total=10
ansible_check_mode_total=14
ansible_apply_total=1
ansible_pending_check_mode_total=0
```
**判讀**
- `INC-20260531-D6A3C4` 已從「已執行但驗證退化,需人工確認」收斂成「已驗證完成,不需人工介入」。
- 這是使用者批准後的 controlled apply + 人工驗證收斂,不是 24h 全自動修復 claimfull autonomous repair claim 仍需更多自然告警樣本與閉環統計。
## 2026-05-31AwoooP Run 詳情單一 Incident 處理流程 drill-down
**背景**

View File

@@ -2710,6 +2710,12 @@ Phase 6 完成後
- VerificationAPI `py_compile` passtargeted `ruff --select E9,F401,F821,F841` pass`test_operator_outcome.py` + `test_awooop_truth_chain_service.py` + `test_awooop_operator_timeline_labels.py` + `test_telegram_message_templates.py` -> 164 passedi18n JSON parse passweb `tsc --noEmit` passproduction API URL build pass`git diff --check` pass。Sample formatting for `INC-20260531-432726` confirms result card includes `completed_no_repair`, `skipped_no_action`, `not_executed`, `no_command_failed`
- 判讀T154f 不擴權自動修復、不重跑舊 incident它只把「審批處理完成」與「修復指令成功/失敗/未執行」拆開,避免 OBSERVE 被誤解成 repair success 或 repair failure。
**T154g MOMO backup controlled apply verification closure2026-05-31 台北)**
- 觸發:`INC-20260531-D6A3C4` 已有 `ansible_apply_executed success``ansible:188-momo-backup-user` controlled apply但 status-chain 仍為 `auto_repaired_verification_degraded`,原因是 Incident 尚未關閉且缺 durable `verification_result=success` evidence。
- 收斂188 實機 readback 確認 `pg_backup.sh` 可執行、Ansible managed `02:00` cron 存在、`2026-05-31 02:00:38` scheduled backup 與 `15:50:05` controlled manual proof 均成功產出約 `177M` 備份並寫入 `backup_log`。追加 `incident_evidence` snapshot `9ac0a11d-5af0-4080-a657-59e73241a328``verification_result=success``matched_playbook_id=ansible:188-momo-backup-user`、sensors `4/4`;再透過既有 incident API resolve`before_status=investigating``after_status=resolved`
- Verificationproduction status-chain 回 `repair_state=auto_repaired_verified``operator_outcome.state=completed_verified``needs_human=false``execution_result.summary_zh=已完成:修復指令成功,且驗證通過``execution.ansible.latest_operation_type=ansible_apply_executed``latest_returncode=0``controlled_apply=true`。Resolve path 已寫 incident case KM `9207f7e1-ee6f-4a4d-981f-4676c04a5d61`、postmortem KM `c28c5f56-d4b3-4314-b961-31d0b69a9c05`Telegram postmortem sent。
- 判讀T154g 是使用者批准後的 controlled apply + 人工驗證收斂,不是 24h 全自動修復宣稱;但此 incident 已不需人工介入,後續只需監控是否回歸。
**T152 Ansible runtime readiness surfaced2026-05-24 台北)**
- 觸發T151 已讓首頁看到 execution backend / Ansible attribution但 operator 仍看不到 runtime 端缺什麼容易把「Ansible 有候選」誤解成「Ansible 已能自動修復」。
- 修正API image 複製 `infra/ansible/` 作 read-only catalog`truth-chain/quality/summary` 新增 `ansible_runtime`,回報 playbook binary、catalog、inventory、playbook_count、can_run_check_mode、blockers。首頁 execution evidence 同步顯示 runtime 狀態;目前 production 顯示 `runtime 未就緒ansible_playbook_binary_missing`。未安裝 `ansible-core`、未啟用 check-mode / apply。