docs(logbook): record ansible check-mode truth chain blocker [skip ci]
This commit is contained in:
@@ -95,6 +95,101 @@ Browser local verification, /zh-TW/iwooos:
|
||||
-> stage board visible, deploy marker wording present, CD 3261 absent, 61% visible, active gate remains 0
|
||||
```
|
||||
|
||||
## 2026-05-31|AwoooP Ansible check-mode 證據鏈補正與 cooldown 清噪
|
||||
|
||||
**背景**:
|
||||
|
||||
- safe check-mode worker 上線後,production DB 已出現 `ansible_check_mode_executed` 失敗證據,但 quality summary 一開始仍顯示 `ansible_check_mode_total=0`、`blockers=[]`。
|
||||
- 直接查 `automation_operation_log` 證實共有 6 筆 `ansible_check_mode_executed failed`,`returncode=4`,失敗原因是 110/188 的 `repair-bot-*` forced-command 安全包裝拒絕 Ansible bootstrap shell,輸出含 `REPAIR_DENIED:invalid_command`。
|
||||
- 這代表「Ansible check-mode 執行證據鏈已真的跑到 DB」,但也證明目前不能宣稱自動修復成功;正確狀態是被安全邊界阻擋,需要後續建立專用 Ansible dry-run transport 或受控子命令。
|
||||
|
||||
**本次調整**:
|
||||
|
||||
- `ansible_candidate_matched`、`ansible_check_mode_executed`、`ansible_execution_skipped` 新寫入時都同步寫入 `automation_operation_log.incident_id`,避免只把 incident 放在 JSON input 裡導致聚合漏看。
|
||||
- `fetch_automation_quality_summary` 的 24h window 改為同時納入「最近有 Ansible automation evidence 的舊 incident」,不再只看 incident 本身建立時間。
|
||||
- Ansible audit contract 補上 `incident_id` 為必要欄位,讓 Telegram / AwoooP / 前端都能沿著 incident 查到完整流程。
|
||||
- 修正 worker cooldown 查詢的 asyncpg interval 參數型別:改用 `:cooldown_seconds * INTERVAL '1 second'`,避免 production 每輪打出 SQL DataError。
|
||||
|
||||
**部署與驗證**:
|
||||
|
||||
```text
|
||||
local targeted pytest:
|
||||
DATABASE_URL=postgresql+asyncpg://test:test@localhost:5432/test pytest apps/api/tests/test_awooop_truth_chain_service.py -q
|
||||
-> 39 passed
|
||||
|
||||
local static checks:
|
||||
py_compile modified services/tests -> pass
|
||||
ruff E9,F401,F821 modified services/tests -> pass
|
||||
git diff --check -> pass
|
||||
|
||||
Gitea / production:
|
||||
dad8c0fb fix(awooop): link ansible evidence to incidents
|
||||
e1355c8e chore(cd): deploy dad8c0f [skip ci]
|
||||
CD run 3315 -> tests/build-and-deploy/post-deploy-checks success
|
||||
|
||||
126316a4 fix(awooop): make ansible cooldown query asyncpg safe
|
||||
7b2efc14 chore(cd): deploy 126316a [skip ci]
|
||||
CD run 3317 -> tests/build-and-deploy/post-deploy-checks success
|
||||
|
||||
production API image -> 192.168.0.110:5000/awoooi/api:126316a414df7cb0f117602c90ea573424ec9a84
|
||||
rollout status awoooi-api/awoooi-worker -> success
|
||||
/api/v1/health -> healthy
|
||||
```
|
||||
|
||||
**live truth-chain 摘要**:
|
||||
|
||||
```text
|
||||
incident_total=27
|
||||
evaluated_total=27
|
||||
verified_auto_repair_total=0
|
||||
ansible_considered_total=13
|
||||
ansible_audit_record_total=19
|
||||
ansible_candidate_total=48
|
||||
ansible_check_mode_total=6
|
||||
ansible_apply_total=0
|
||||
ansible_pending_check_mode_total=7
|
||||
ansible_runtime.can_run_check_mode=false
|
||||
ansible_runtime.blockers=["ansible_repair_ssh_forced_command_denies_ansible_bootstrap"]
|
||||
production_claim.can_claim_full_auto_repair=false
|
||||
```
|
||||
|
||||
**worker cooldown 驗證**:
|
||||
|
||||
```text
|
||||
awooop_ansible_check_mode_worker_tick:
|
||||
claimed=0
|
||||
completed=0
|
||||
failed=0
|
||||
blockers=["ansible_repair_ssh_forced_command_denies_ansible_bootstrap"]
|
||||
|
||||
automation_operation_log counts:
|
||||
ansible_candidate_matched dry_run = 166
|
||||
ansible_check_mode_executed failed = 6
|
||||
```
|
||||
|
||||
**目前整體進度**:
|
||||
|
||||
```text
|
||||
AwoooP truth-chain 可觀測性與真相鏈路:92% -> 94%
|
||||
PlayBook / Ansible runtime readiness:65%
|
||||
PlayBook check-mode 證據鏈:0% -> 35%
|
||||
PlayBook check-mode 可持續運轉:35% -> 45%(已 cooldown 清噪,但 transport 仍 blocked)
|
||||
PlayBook apply 自動修復:0%(仍鎖住,未啟用)
|
||||
AI 自動化管理產品整體:99.45% -> 99.55%
|
||||
full auto-repair production claim:false
|
||||
```
|
||||
|
||||
**下一步**:
|
||||
|
||||
- 不放寬既有 repair-bot forced-command 安全邊界;不要為了讓 Ansible 通過而允許任意 shell。
|
||||
- 建立專用 Ansible check-mode transport(獨立 key/account、最小 sudo、known_hosts、只允許 dry-run catalog),或在 repair-bot 中新增明確的 `ansible-check:<catalog_id>` 受控子命令。
|
||||
- 前端 AwoooP Runs / Work Items / AI 治理頁要顯示:
|
||||
- `ansible_check_mode_total=6`
|
||||
- `ansible_apply_total=0`
|
||||
- `can_run_check_mode=false`
|
||||
- forced-command blocker
|
||||
- 下一步為 transport remediation,而不是人工猜測。
|
||||
|
||||
## 2026-05-31|CD source-link gate 過期與 pipefail 修復
|
||||
|
||||
**背景**:
|
||||
|
||||
Reference in New Issue
Block a user