docs(logbook): record ansible check-mode truth chain blocker [skip ci]

This commit is contained in:
Your Name
2026-05-31 13:58:15 +08:00
parent 7b2efc14c4
commit 943a6feacf

View File

@@ -95,6 +95,101 @@ Browser local verification, /zh-TW/iwooos:
-> stage board visible, deploy marker wording present, CD 3261 absent, 61% visible, active gate remains 0
```
## 2026-05-31AwoooP Ansible check-mode 證據鏈補正與 cooldown 清噪
**背景**
- safe check-mode worker 上線後production DB 已出現 `ansible_check_mode_executed` 失敗證據,但 quality summary 一開始仍顯示 `ansible_check_mode_total=0``blockers=[]`
- 直接查 `automation_operation_log` 證實共有 6 筆 `ansible_check_mode_executed failed``returncode=4`,失敗原因是 110/188 的 `repair-bot-*` forced-command 安全包裝拒絕 Ansible bootstrap shell輸出含 `REPAIR_DENIED:invalid_command`
- 這代表「Ansible check-mode 執行證據鏈已真的跑到 DB」但也證明目前不能宣稱自動修復成功正確狀態是被安全邊界阻擋需要後續建立專用 Ansible dry-run transport 或受控子命令。
**本次調整**
- `ansible_candidate_matched``ansible_check_mode_executed``ansible_execution_skipped` 新寫入時都同步寫入 `automation_operation_log.incident_id`,避免只把 incident 放在 JSON input 裡導致聚合漏看。
- `fetch_automation_quality_summary` 的 24h window 改為同時納入「最近有 Ansible automation evidence 的舊 incident」不再只看 incident 本身建立時間。
- Ansible audit contract 補上 `incident_id` 為必要欄位,讓 Telegram / AwoooP / 前端都能沿著 incident 查到完整流程。
- 修正 worker cooldown 查詢的 asyncpg interval 參數型別:改用 `:cooldown_seconds * INTERVAL '1 second'`,避免 production 每輪打出 SQL DataError。
**部署與驗證**
```text
local targeted pytest:
DATABASE_URL=postgresql+asyncpg://test:test@localhost:5432/test pytest apps/api/tests/test_awooop_truth_chain_service.py -q
-> 39 passed
local static checks:
py_compile modified services/tests -> pass
ruff E9,F401,F821 modified services/tests -> pass
git diff --check -> pass
Gitea / production:
dad8c0fb fix(awooop): link ansible evidence to incidents
e1355c8e chore(cd): deploy dad8c0f [skip ci]
CD run 3315 -> tests/build-and-deploy/post-deploy-checks success
126316a4 fix(awooop): make ansible cooldown query asyncpg safe
7b2efc14 chore(cd): deploy 126316a [skip ci]
CD run 3317 -> tests/build-and-deploy/post-deploy-checks success
production API image -> 192.168.0.110:5000/awoooi/api:126316a414df7cb0f117602c90ea573424ec9a84
rollout status awoooi-api/awoooi-worker -> success
/api/v1/health -> healthy
```
**live truth-chain 摘要**
```text
incident_total=27
evaluated_total=27
verified_auto_repair_total=0
ansible_considered_total=13
ansible_audit_record_total=19
ansible_candidate_total=48
ansible_check_mode_total=6
ansible_apply_total=0
ansible_pending_check_mode_total=7
ansible_runtime.can_run_check_mode=false
ansible_runtime.blockers=["ansible_repair_ssh_forced_command_denies_ansible_bootstrap"]
production_claim.can_claim_full_auto_repair=false
```
**worker cooldown 驗證**
```text
awooop_ansible_check_mode_worker_tick:
claimed=0
completed=0
failed=0
blockers=["ansible_repair_ssh_forced_command_denies_ansible_bootstrap"]
automation_operation_log counts:
ansible_candidate_matched dry_run = 166
ansible_check_mode_executed failed = 6
```
**目前整體進度**
```text
AwoooP truth-chain 可觀測性與真相鏈路92% -> 94%
PlayBook / Ansible runtime readiness65%
PlayBook check-mode 證據鏈0% -> 35%
PlayBook check-mode 可持續運轉35% -> 45%(已 cooldown 清噪,但 transport 仍 blocked
PlayBook apply 自動修復0%(仍鎖住,未啟用)
AI 自動化管理產品整體99.45% -> 99.55%
full auto-repair production claimfalse
```
**下一步**
- 不放寬既有 repair-bot forced-command 安全邊界;不要為了讓 Ansible 通過而允許任意 shell。
- 建立專用 Ansible check-mode transport獨立 key/account、最小 sudo、known_hosts、只允許 dry-run catalog或在 repair-bot 中新增明確的 `ansible-check:<catalog_id>` 受控子命令。
- 前端 AwoooP Runs / Work Items / AI 治理頁要顯示:
- `ansible_check_mode_total=6`
- `ansible_apply_total=0`
- `can_run_check_mode=false`
- forced-command blocker
- 下一步為 transport remediation而不是人工猜測。
## 2026-05-31CD source-link gate 過期與 pipefail 修復
**背景**