docs(logbook): record ansible check-mode ssh mcp proof [skip ci]
This commit is contained in:
@@ -1,3 +1,72 @@
|
||||
## 2026-05-31|AwoooP Ansible check-mode truth-chain 接通,188 sudo 邊界成為新紅燈
|
||||
|
||||
**背景**:
|
||||
|
||||
- 前一輪已證實 repair-bot forced-command 會拒絕 Ansible bootstrap shell,不能為了 check-mode 放寬 ADR-058 repair-bot 安全邊界。
|
||||
- Production API pod 內同時有 `/run/secrets/ssh_mcp_key` 與 `/etc/ssh-mcp/known_hosts`;`ssh_mcp@awoooi-api` key 可 SSH 到 110/188,但 188 `ollama` 帳號沒有 `sudo -n`。
|
||||
- `automation_operation_log.incident_id` 是 ADR-090 的 `BIGINT` 欄位,而 AWOOOI 外部 incident id 是 `INC-...` 字串;若直接寫欄位會造成 asyncpg `DatatypeMismatchError`,阻斷 check-mode claim。
|
||||
|
||||
**本次調整**:
|
||||
|
||||
- Ansible check-mode transport 改用 `ssh_mcp`:
|
||||
- `AWOOOP_ANSIBLE_CHECK_MODE_TRANSPORT_PROFILE=ssh_mcp`
|
||||
- `AWOOOP_ANSIBLE_CHECK_MODE_SSH_KEY_PATH=/run/secrets/ssh_mcp_key`
|
||||
- `AWOOOP_ANSIBLE_CHECK_MODE_KNOWN_HOSTS_PATH=/etc/ssh-mcp/known_hosts`
|
||||
- 保留 repair-bot forced-command 邊界;舊 `REPAIR_DENIED:invalid_command` 只列為 `historical_transport_blockers`,不再阻擋新的 ssh-mcp check-mode。
|
||||
- check-mode candidate 收斂到最近 24h,避免一次清空歷史 backlog。
|
||||
- `automation_operation_log` 寫入修正:`INC-...` 保留在 `input.incident_id`,只有純數字才寫入 `incident_id BIGINT` 欄位。
|
||||
- truth-chain runtime readiness 外露 check-mode key / known_hosts / transport profile,讓前台與 Telegram 可判斷「能不能跑 check-mode」而不是只看 repair-bot。
|
||||
|
||||
**Production 驗證**:
|
||||
|
||||
```text
|
||||
Gitea:
|
||||
3327 build-and-deploy job -> success
|
||||
3327 run final status -> cancelled (deploy marker push caused run/post-deploy status分離,需另列 CD 治理債)
|
||||
3328 code-review -> success
|
||||
deploy marker -> 4744670e chore(cd): deploy 8c40621 [skip ci]
|
||||
|
||||
K8s:
|
||||
awoooi-api -> 192.168.0.110:5000/awoooi/api:8c40621d...
|
||||
awoooi-worker -> 192.168.0.110:5000/awoooi/api:8c40621d...
|
||||
awoooi-web -> 192.168.0.110:5000/awoooi/web:8c40621d...
|
||||
rollout api/worker/web -> success
|
||||
/api/v1/health -> healthy, prod, mock_mode=false
|
||||
ollama_route_order -> GCP-A, GCP-B, local
|
||||
|
||||
truth-chain summary:
|
||||
ansible_runtime.check_mode_transport_profile=ssh_mcp
|
||||
check_mode_ssh_key_readable=true
|
||||
check_mode_known_hosts_readable=true
|
||||
can_run_check_mode=true
|
||||
blockers=[]
|
||||
historical_transport_blockers=[ansible_repair_ssh_forced_command_denies_ansible_bootstrap]
|
||||
production_claim.can_claim_full_auto_repair=false
|
||||
|
||||
DB / worker evidence:
|
||||
ansible_candidate_matched dry_run=166
|
||||
ansible_check_mode_executed failed=8
|
||||
latest 2 rows:
|
||||
INC-20260530-0E5C5C -> ssh_mcp, ansible:188-ai-web, check_mode_executed=true, apply_executed=false, rc=2
|
||||
INC-20260530-B37FB4 -> ssh_mcp, ansible:188-ai-web, check_mode_executed=true, apply_executed=false, rc=2
|
||||
failure reason:
|
||||
host_188 Gathering Facts -> Incorrect sudo password
|
||||
```
|
||||
|
||||
**判讀 / 下一步**:
|
||||
|
||||
- 這不是 auto-repair 完成;apply 仍鎖住,`ansible_apply_total=0`,production full auto-repair claim 仍為 false。
|
||||
- 已完成的是「AwoooP 能把 AI 候選修復接到 Ansible check-mode 並寫入 DB 證據」;下一個真 blocker 是 188 的受控 sudo / become 策略。
|
||||
- 不建議直接給 `ollama` 無限制 NOPASSWD;下一步應二選一:
|
||||
- 建立專用 Ansible check-mode 帳號與最小 sudoers,只允許 catalog 需要的 read/check 操作。
|
||||
- 或拆出 188 read-only check-mode playbook,無 sudo 先覆蓋 Docker / app 層觀測,root-owned drift 仍轉人工審批。
|
||||
- 進度:
|
||||
- AwoooP truth-chain 可見性:95%
|
||||
- Ansible check-mode 接線:70%(110 可跑;188 卡 sudo)
|
||||
- Telegram / 前台真相語意:90%
|
||||
- 自動 apply / 自動修復閉環:0%(刻意保持鎖住,尚未到安全放行門)
|
||||
- 整體 AI 自動化飛輪:60%(監控、分類、審批、證據鏈大幅改善;自動修復與 KM owner 審核仍未閉環)
|
||||
|
||||
## 2026-05-31|Telegram 告警前台真相顯示與舊資料補正
|
||||
|
||||
**背景**:
|
||||
|
||||
Reference in New Issue
Block a user