docs(logbook): record momo backup verification closure [skip ci]
This commit is contained in:
@@ -44,6 +44,68 @@ local:
|
||||
|
||||
- 待推版後補上正式部署標記與 production smoke。
|
||||
|
||||
## 2026-05-31|MOMO backup controlled apply 驗證收斂與 Incident 結案
|
||||
|
||||
**背景**:
|
||||
|
||||
- 前一輪 `INC-20260531-D6A3C4` 已有 `ansible_apply_executed success`,且 PlayBook 指向 `ansible:188-momo-backup-user`,但 status-chain 仍停在 `auto_repaired_verification_degraded`。
|
||||
- 卡點不是 Ansible apply 失敗,而是 188 現場修復後缺一筆 durable post-verification evidence,Incident 也仍是 `investigating`。
|
||||
|
||||
**Production 收斂**:
|
||||
|
||||
- 188 實機 readback:
|
||||
- `/home/ollama/momo-pro/scripts/pg_backup.sh` 可執行。
|
||||
- `ollama` crontab 已有 Ansible managed cron:每日 `02:00` 執行 `/home/ollama/momo-pro/scripts/pg_backup.sh`。
|
||||
- `2026-05-31 02:00:38` scheduled backup 成功,產物 `momo_analytics_20260531_020001.sql.gz` 約 `177M`。
|
||||
- `2026-05-31 15:50:05` controlled manual proof 成功,產物 `momo_analytics_20260531_154931.sql.gz` 約 `177M`,且 `backup_log insert success`。
|
||||
- 追加 `incident_evidence` snapshot:
|
||||
- snapshot id:`9ac0a11d-5af0-4080-a657-59e73241a328`
|
||||
- `verification_result=success`
|
||||
- `matched_playbook_id=ansible:188-momo-backup-user`
|
||||
- sensors `4/4` success,post state 記錄 executable / cron / scheduled backup / manual backup / backup_log 證據。
|
||||
- 透過既有 API lifecycle 呼叫 `POST /api/v1/incidents/INC-20260531-D6A3C4/resolve`:
|
||||
- `before_status=investigating`
|
||||
- `after_status=resolved`
|
||||
- `updated=true`
|
||||
- Resolve path 已寫入 KM / Postmortem:
|
||||
- incident case KM:`9207f7e1-ee6f-4a4d-981f-4676c04a5d61`
|
||||
- postmortem KM:`c28c5f56-d4b3-4314-b961-31d0b69a9c05`
|
||||
- Telegram postmortem sent,thread message id `19374`,outbound shadow run `ae7848f5-5175-5e4b-ad96-c291ea2c2a10`。
|
||||
|
||||
**驗證**:
|
||||
|
||||
```text
|
||||
production status-chain:
|
||||
/api/v1/platform/status-chain?project_id=awoooi&incident_id=INC-20260531-D6A3C4
|
||||
repair_state=auto_repaired_verified
|
||||
verdict=auto_repaired_verified
|
||||
operator_outcome.state=completed_verified
|
||||
operator_outcome.needs_human=false
|
||||
operator_outcome.execution_result.completion_status=completed_verified
|
||||
operator_outcome.execution_result.command_status=succeeded
|
||||
operator_outcome.execution_result.repair_status=verified_repaired
|
||||
operator_outcome.execution_result.failure_status=no_failure
|
||||
operator_outcome.execution_result.summary_zh=已完成:修復指令成功,且驗證通過
|
||||
execution.ansible.latest_operation_type=ansible_apply_executed
|
||||
execution.ansible.latest_status=success
|
||||
execution.ansible.latest_catalog_id=ansible:188-momo-backup-user
|
||||
execution.ansible.latest_returncode=0
|
||||
execution.ansible.controlled_apply=true
|
||||
|
||||
quality summary:
|
||||
ansible_runtime.can_run_check_mode=true
|
||||
ansible_runtime.blockers=[]
|
||||
execution_backend_summary.auto_repair_execution_records_total=10
|
||||
ansible_check_mode_total=14
|
||||
ansible_apply_total=1
|
||||
ansible_pending_check_mode_total=0
|
||||
```
|
||||
|
||||
**判讀**:
|
||||
|
||||
- `INC-20260531-D6A3C4` 已從「已執行但驗證退化,需人工確認」收斂成「已驗證完成,不需人工介入」。
|
||||
- 這是使用者批准後的 controlled apply + 人工驗證收斂,不是 24h 全自動修復 claim;full autonomous repair claim 仍需更多自然告警樣本與閉環統計。
|
||||
|
||||
## 2026-05-31|AwoooP Run 詳情單一 Incident 處理流程 drill-down
|
||||
|
||||
**背景**:
|
||||
|
||||
@@ -2710,6 +2710,12 @@ Phase 6 完成後
|
||||
- Verification:API `py_compile` pass;targeted `ruff --select E9,F401,F821,F841` pass;`test_operator_outcome.py` + `test_awooop_truth_chain_service.py` + `test_awooop_operator_timeline_labels.py` + `test_telegram_message_templates.py` -> 164 passed;i18n JSON parse pass;web `tsc --noEmit` pass;production API URL build pass;`git diff --check` pass。Sample formatting for `INC-20260531-432726` confirms result card includes `completed_no_repair`, `skipped_no_action`, `not_executed`, `no_command_failed`。
|
||||
- 判讀:T154f 不擴權自動修復、不重跑舊 incident;它只把「審批處理完成」與「修復指令成功/失敗/未執行」拆開,避免 OBSERVE 被誤解成 repair success 或 repair failure。
|
||||
|
||||
**T154g MOMO backup controlled apply verification closure(2026-05-31 台北)**:
|
||||
- 觸發:`INC-20260531-D6A3C4` 已有 `ansible_apply_executed success` 與 `ansible:188-momo-backup-user` controlled apply,但 status-chain 仍為 `auto_repaired_verification_degraded`,原因是 Incident 尚未關閉且缺 durable `verification_result=success` evidence。
|
||||
- 收斂:188 實機 readback 確認 `pg_backup.sh` 可執行、Ansible managed `02:00` cron 存在、`2026-05-31 02:00:38` scheduled backup 與 `15:50:05` controlled manual proof 均成功產出約 `177M` 備份並寫入 `backup_log`。追加 `incident_evidence` snapshot `9ac0a11d-5af0-4080-a657-59e73241a328`,`verification_result=success`、`matched_playbook_id=ansible:188-momo-backup-user`、sensors `4/4`;再透過既有 incident API resolve,`before_status=investigating`、`after_status=resolved`。
|
||||
- Verification:production status-chain 回 `repair_state=auto_repaired_verified`、`operator_outcome.state=completed_verified`、`needs_human=false`、`execution_result.summary_zh=已完成:修復指令成功,且驗證通過`、`execution.ansible.latest_operation_type=ansible_apply_executed`、`latest_returncode=0`、`controlled_apply=true`。Resolve path 已寫 incident case KM `9207f7e1-ee6f-4a4d-981f-4676c04a5d61`、postmortem KM `c28c5f56-d4b3-4314-b961-31d0b69a9c05`,Telegram postmortem sent。
|
||||
- 判讀:T154g 是使用者批准後的 controlled apply + 人工驗證收斂,不是 24h 全自動修復宣稱;但此 incident 已不需人工介入,後續只需監控是否回歸。
|
||||
|
||||
**T152 Ansible runtime readiness surfaced(2026-05-24 台北)**:
|
||||
- 觸發:T151 已讓首頁看到 execution backend / Ansible attribution,但 operator 仍看不到 runtime 端缺什麼,容易把「Ansible 有候選」誤解成「Ansible 已能自動修復」。
|
||||
- 修正:API image 複製 `infra/ansible/` 作 read-only catalog;`truth-chain/quality/summary` 新增 `ansible_runtime`,回報 playbook binary、catalog、inventory、playbook_count、can_run_check_mode、blockers。首頁 execution evidence 同步顯示 runtime 狀態;目前 production 顯示 `runtime 未就緒:ansible_playbook_binary_missing`。未安裝 `ansible-core`、未啟用 check-mode / apply。
|
||||
|
||||
Reference in New Issue
Block a user