diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 1dd28b02..ab0300bb 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -84,6 +84,96 @@ git diff --check - 歷史 incident 不會被批次重跑修復;舊資料若缺原始 Telegram message id,新版後續會送 standalone 結果通知,歷史全量補發需另開 backfill 策略避免洗版。 - 自動化要宣稱 repair completed,仍必須有有效 repair execution 或 auto-repair execution 證據;診斷、觀察、dry-run、degraded 不再被顯示成修復完成。 +## 2026-05-31|MOMO PostgreSQL backup 接入 AwoooP 失敗通知與受控 Ansible 修復 + +**背景**: + +- 使用者要求 AI 自動化、監控、告警、匹配規則、PlayBook、自動修復、KM 必須能從 Telegram / AwoooP / DB 看出實際跑到哪個階段,不能只傳一段看不出真實處置狀態的訊息。 +- 前一輪已讓 `188-ai-web-readonly.yml` 可透過 production API pod 使用 `ssh_mcp` 執行 read-only Ansible check-mode;本輪接著處理 MOMO PostgreSQL backup 的真實 drift:188 上 `pg_backup.sh` 不可執行,crontab 仍指向舊 `/home/ollama/bin/momo-pg-backup.sh`。 +- 本輪是「使用者批准後的受控 non-root Ansible apply」,不是全自動修復;不可把這筆證據宣稱成 24h autonomous auto-repair closure。 + +**本次調整**: + +- 新增 `infra/ansible/playbooks/188-momo-backup-user.yml`: + - `become: false`,只管理 `ollama` 擁有的 `/home/ollama/momo-pro/scripts/*`、`/home/ollama/momo_backups` 與 `ollama` crontab。 + - 從 API image controller 複製 `scripts/backup/backup-momo-188-pg.sh` 與 `scripts/ops/notify-awoooi-ops.sh` 到 188。 + - 移除舊 unmanaged cron `/home/ollama/bin/momo-pg-backup.sh`,新增 Ansible 受管 cron:每日 02:00 執行 `/home/ollama/momo-pro/scripts/pg_backup.sh`。 +- `apps/api/src/services/awooop_ansible_audit_service.py` 新增專用低風險 catalog: + - `catalog_id=ansible:188-momo-backup-user` + - `risk_level=low` + - `auto_apply_enabled=false` + - `approval_required=true` +- truth-chain 測試改為 MOMO backup 優先匹配專用 playbook,而不是模糊落到整台 `188-ai-web` 大 playbook。 +- 修正 API image build context 破鏈: + - `.dockerignore` 白名單 `scripts/backup/backup-momo-188-pg.sh`。 + - `.gitea/workflows/cd.yaml` 加入 `scripts/backup/backup-momo-188-pg.sh` 與 `scripts/ops/notify-awoooi-ops.sh` 作為 CD trigger,避免未來 repo 腳本更新但 production image 沒更新。 + +**驗證**: + +```text +local: + ruby YAML parse .gitea/workflows/cd.yaml + 188-momo-backup-user.yml -> yaml ok + python3 -m py_compile awooop_ansible_audit_service.py / awooop_ansible_check_mode_service.py / test_awooop_truth_chain_service.py -> pass + ruff check ... --select E9,F401,F821 -> pass + pytest apps/api/tests/test_awooop_truth_chain_service.py -q -> 44 passed + git diff --check -> pass + +Gitea / production deploy: + 75f6929b fix(awooop): add momo backup user ansible repair -> pushed to gitea main + a1696695 fix(ansible): satisfy momo backup playbook lint -> ansible-lint run 3346 success + ebd9ca86 fix(api): include momo backup script in runtime image -> pushed to gitea main + run 3347 tests: 2291 passed, 23 skipped + run 3347 build-and-deploy: success + run 3347 post-deploy-checks: cancelled by later push, so final proof uses K8s image + live host + DB evidence + run 3348 ai-code-review: success + production image at verification: + awoooi-api = 192.168.0.110:5000/awoooi/api:ebd9ca865fa9a0af4ebc3470458f94b935805849 + awoooi-web = 192.168.0.110:5000/awoooi/web:ebd9ca865fa9a0af4ebc3470458f94b935805849 + +production Ansible: + API pod ansible-playbook --syntax-check 188-momo-backup-user.yml -> pass + API pod ansible-playbook --check --diff via ssh_mcp -> success, changed=3 + controlled apply via API pod -> success + automation_operation_log: + ansible_check_mode_executed op_id=1430b250-16fa-485b-bb7b-f18c829ff673 status=success returncode=0 + ansible_apply_executed op_id=08f52074-7ac6-4eb7-affd-a85f1f8eb0be status=success returncode=0 parent=1430b250-16fa-485b-bb7b-f18c829ff673 + ansible AOL aggregate after apply: + ansible_candidate_total=167 + ansible_check_mode_total=14 + ansible_apply_total=1 + ansible_apply_success_total=1 + momo_apply_total=1 + +188 host proof after apply: + /home/ollama/momo-pro/scripts/pg_backup.sh owner=ollama:ollama mode=755 size=5982 + /home/ollama/momo-pro/scripts/notify-awoooi-ops.sh owner=ollama:ollama mode=755 + /home/ollama/momo_backups owner=ollama:ollama mode=755 + bash -n pg_backup.sh notify-awoooi-ops.sh -> pass + crontab has only Ansible managed MOMO backup cron: + #Ansible: AWOOOI momo PostgreSQL daily backup + 0 2 * * * PATH=... /home/ollama/momo-pro/scripts/pg_backup.sh >> /home/ollama/momo_backups/backup.log 2>&1 + +end-to-end backup proof: + AWOOI_BACKUP_LOG_STDOUT=1 AWOOI_BACKUP_NOTIFY_SUCCESS=0 /home/ollama/momo-pro/scripts/pg_backup.sh + Backup success: momo_analytics_20260531_154931.sql.gz (177M, 33s) + backup_log insert success + Deleted old backups: 0 + 略過 AwoooP 成功通知;backup-health exporter 作為健康狀態來源 + file: + /home/ollama/momo_backups/momo_analytics_20260531_154931.sql.gz owner=ollama:ollama mode=640 size=177M + momo backup_log latest: + momo_analytics_20260531_154931.sql.gz | 185144359 bytes | 33s | success +``` + +**進度邊界**: + +- MOMO PostgreSQL backup 接入 AwoooP 失敗通知 / 受控 Ansible 修復:100%。 +- AwoooP truth-chain / DB 可追蹤性:約 96.5%;本輪新增 `ansible_apply_executed` 成功證據,但仍需把前端 timeline / Telegram 詳情對 apply row 做更完整 drill-down。 +- Ansible check-mode runtime:約 88%;`ssh_mcp`、production pod、playbook catalog、DB 回寫都已打通,但仍只有少數 playbook 有專用化 check-mode。 +- 受控 Ansible apply:約 10%;已有第一筆 low-risk user-approved non-root apply 成功,但仍未開放 autonomous apply。 +- 24h 完整自動修復 production claim:0%;本輪是使用者批准的受控 apply,不是 AI 自主 apply。 +- 完整 AI 自動化飛輪:約 65%;下一步應把 `ansible_apply_executed`、backup result、AwoooP timeline、Telegram 詳情/歷史與前端 AwoooP Runs drill-down 串成同一條 operator truth view。 + ## 2026-05-31|側邊欄 nav 全語系繁體中文收斂 **背景**: