From 943a6feacf4cb2c00614cdc462febebec32ee670 Mon Sep 17 00:00:00 2001 From: Your Name Date: Sun, 31 May 2026 13:58:15 +0800 Subject: [PATCH] docs(logbook): record ansible check-mode truth chain blocker [skip ci] --- docs/LOGBOOK.md | 95 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 95 insertions(+) diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 3a3deace..fd9ace5b 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -95,6 +95,101 @@ Browser local verification, /zh-TW/iwooos: -> stage board visible, deploy marker wording present, CD 3261 absent, 61% visible, active gate remains 0 ``` +## 2026-05-31|AwoooP Ansible check-mode 證據鏈補正與 cooldown 清噪 + +**背景**: + +- safe check-mode worker 上線後,production DB 已出現 `ansible_check_mode_executed` 失敗證據,但 quality summary 一開始仍顯示 `ansible_check_mode_total=0`、`blockers=[]`。 +- 直接查 `automation_operation_log` 證實共有 6 筆 `ansible_check_mode_executed failed`,`returncode=4`,失敗原因是 110/188 的 `repair-bot-*` forced-command 安全包裝拒絕 Ansible bootstrap shell,輸出含 `REPAIR_DENIED:invalid_command`。 +- 這代表「Ansible check-mode 執行證據鏈已真的跑到 DB」,但也證明目前不能宣稱自動修復成功;正確狀態是被安全邊界阻擋,需要後續建立專用 Ansible dry-run transport 或受控子命令。 + +**本次調整**: + +- `ansible_candidate_matched`、`ansible_check_mode_executed`、`ansible_execution_skipped` 新寫入時都同步寫入 `automation_operation_log.incident_id`,避免只把 incident 放在 JSON input 裡導致聚合漏看。 +- `fetch_automation_quality_summary` 的 24h window 改為同時納入「最近有 Ansible automation evidence 的舊 incident」,不再只看 incident 本身建立時間。 +- Ansible audit contract 補上 `incident_id` 為必要欄位,讓 Telegram / AwoooP / 前端都能沿著 incident 查到完整流程。 +- 修正 worker cooldown 查詢的 asyncpg interval 參數型別:改用 `:cooldown_seconds * INTERVAL '1 second'`,避免 production 每輪打出 SQL DataError。 + +**部署與驗證**: + +```text +local targeted pytest: + DATABASE_URL=postgresql+asyncpg://test:test@localhost:5432/test pytest apps/api/tests/test_awooop_truth_chain_service.py -q + -> 39 passed + +local static checks: + py_compile modified services/tests -> pass + ruff E9,F401,F821 modified services/tests -> pass + git diff --check -> pass + +Gitea / production: + dad8c0fb fix(awooop): link ansible evidence to incidents + e1355c8e chore(cd): deploy dad8c0f [skip ci] + CD run 3315 -> tests/build-and-deploy/post-deploy-checks success + + 126316a4 fix(awooop): make ansible cooldown query asyncpg safe + 7b2efc14 chore(cd): deploy 126316a [skip ci] + CD run 3317 -> tests/build-and-deploy/post-deploy-checks success + + production API image -> 192.168.0.110:5000/awoooi/api:126316a414df7cb0f117602c90ea573424ec9a84 + rollout status awoooi-api/awoooi-worker -> success + /api/v1/health -> healthy +``` + +**live truth-chain 摘要**: + +```text +incident_total=27 +evaluated_total=27 +verified_auto_repair_total=0 +ansible_considered_total=13 +ansible_audit_record_total=19 +ansible_candidate_total=48 +ansible_check_mode_total=6 +ansible_apply_total=0 +ansible_pending_check_mode_total=7 +ansible_runtime.can_run_check_mode=false +ansible_runtime.blockers=["ansible_repair_ssh_forced_command_denies_ansible_bootstrap"] +production_claim.can_claim_full_auto_repair=false +``` + +**worker cooldown 驗證**: + +```text +awooop_ansible_check_mode_worker_tick: + claimed=0 + completed=0 + failed=0 + blockers=["ansible_repair_ssh_forced_command_denies_ansible_bootstrap"] + +automation_operation_log counts: + ansible_candidate_matched dry_run = 166 + ansible_check_mode_executed failed = 6 +``` + +**目前整體進度**: + +```text +AwoooP truth-chain 可觀測性與真相鏈路:92% -> 94% +PlayBook / Ansible runtime readiness:65% +PlayBook check-mode 證據鏈:0% -> 35% +PlayBook check-mode 可持續運轉:35% -> 45%(已 cooldown 清噪,但 transport 仍 blocked) +PlayBook apply 自動修復:0%(仍鎖住,未啟用) +AI 自動化管理產品整體:99.45% -> 99.55% +full auto-repair production claim:false +``` + +**下一步**: + +- 不放寬既有 repair-bot forced-command 安全邊界;不要為了讓 Ansible 通過而允許任意 shell。 +- 建立專用 Ansible check-mode transport(獨立 key/account、最小 sudo、known_hosts、只允許 dry-run catalog),或在 repair-bot 中新增明確的 `ansible-check:` 受控子命令。 +- 前端 AwoooP Runs / Work Items / AI 治理頁要顯示: + - `ansible_check_mode_total=6` + - `ansible_apply_total=0` + - `can_run_check_mode=false` + - forced-command blocker + - 下一步為 transport remediation,而不是人工猜測。 + ## 2026-05-31|CD source-link gate 過期與 pipefail 修復 **背景**: