fix(autoheal): preflight + shell || true 結合性 — 解 24h 100% no-op
All checks were successful
CD Pipeline / deploy (push) Successful in 2m23s
All checks were successful
CD Pipeline / deploy (push) Successful in 2m23s
Debugger 五階段方法論 root cause(2026-05-03): ADR-020 政策層 100% 達成(reasoning 全 auto_fix=enabled),但 AiderHeal 執行層 24 小時 0 次 push 成功,全部 silent fail。 兩根因疊加: #1 (config) AIDER_REPO_PATH=/home/wooo/ewoooc 在 110 主機不存在 → 寫 SOP docs/runbooks/aider-heal-110-setup-sop.md 給統帥手動執行 #2 (code) setup_cmds 結尾 `git stash 2>&1 || true` 因 shell 結合性等同 `(A && B && C && D) || true`,cd 失敗整 chain rc=0 被吞, line 261 if rc != 0 永不觸發 → setup_failed 從未被 log #4 (code) 缺 preflight,環境壞掉時靜默走完整 pipeline 印 no_diff 本次程式碼修復: • execute_code_fix 開頭加 L0 preflight(test -d $REPO/.git) 失敗 fail-fast + Telegram 嚴重告警 + 指向 SOP • setup_cmds 改 `A && B && C && (D || true)` 用 subshell 限縮 || true • 全檔 5 處 `cd $REPO_PATH` 統一改 `cd shlex.quote(REPO_PATH)` 避免下次有人複製 cd chain 又踩同類 shell quoting bug SOP 同步處理 critic High-2 + Medium-6: • 步驟 2 改用 SSH clone(git clone gitea-autoheal:...) 避免 HTTP clone 在 private repo 卡帳密 + 跟步驟 1 部署的 key 不關聯 • 步驟 4b 修引號嵌套(heredoc + 單引號保護),原版永遠 false positive Critic 審過 Approve to commit;Medium-2/3/4(速率限制 / log 加 stderr / 新增 preflight unit test)排 follow-up,不阻擋本次。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
202
docs/runbooks/aider-heal-110-setup-sop.md
Normal file
202
docs/runbooks/aider-heal-110-setup-sop.md
Normal file
@@ -0,0 +1,202 @@
|
||||
# AiderHeal 110 主機部署 SOP(ADR-020)
|
||||
|
||||
> 解決 2026-05-03 發現的 AiderHeal 100% no-op 根因:110 主機上 `AIDER_REPO_PATH` (`/home/wooo/ewoooc`) **不存在**,所有 `cd` 立刻失敗,`|| true` 吞掉錯誤後整條 pipeline 走完卻 0 次 push。
|
||||
>
|
||||
> 本 SOP 設定一次後永久生效,需統帥手動執行(牽涉 SSH key 部署 + Gitea push 權限驗證)。
|
||||
|
||||
---
|
||||
|
||||
## 前置確認
|
||||
|
||||
| # | 檢查項 | 命令 |
|
||||
|---|--------|------|
|
||||
| 1 | 110 主機可達 | `ssh wooo@192.168.0.110 hostname` |
|
||||
| 2 | 110 上是否已有 `~/.ssh/autoheal_id_ed25519` | `ssh wooo@192.168.0.110 'ls -la ~/.ssh/autoheal*'` |
|
||||
| 3 | 188 容器內 `config/autoheal_id_ed25519` 是否存在 | `ssh ollama@192.168.0.188 'ls -la /home/ollama/momo-pro-system/config/autoheal*'` |
|
||||
| 4 | Gitea 上該 ssh key 是否已加為 deploy key(write 權限)| Gitea → wooo/ewoooc → Settings → Deploy Keys |
|
||||
|
||||
**若 #2 #3 都 OK 且 #4 已加** → 直接跳到「步驟 2 clone repo」
|
||||
**若 #2 缺 key** → 走「步驟 1 部署 SSH Key」
|
||||
**若 #4 沒加** → 走「步驟 3 加 Gitea Deploy Key」
|
||||
|
||||
---
|
||||
|
||||
## 步驟 1:部署 SSH Key 到 110 主機
|
||||
|
||||
容器內已有的 key 同步到 110,作為 110 push 回 Gitea 的身份。
|
||||
|
||||
```bash
|
||||
# 從 188 取出私鑰(container mount 點)
|
||||
ssh ollama@192.168.0.188 'cat /home/ollama/momo-pro-system/config/autoheal_id_ed25519' \
|
||||
| ssh wooo@192.168.0.110 'umask 077 && cat > ~/.ssh/autoheal_id_ed25519'
|
||||
|
||||
# 取公鑰
|
||||
ssh ollama@192.168.0.188 'cat /home/ollama/momo-pro-system/config/autoheal_id_ed25519.pub' \
|
||||
| ssh wooo@192.168.0.110 'cat > ~/.ssh/autoheal_id_ed25519.pub'
|
||||
|
||||
# 設權限
|
||||
ssh wooo@192.168.0.110 'chmod 600 ~/.ssh/autoheal_id_ed25519 && chmod 644 ~/.ssh/autoheal_id_ed25519.pub'
|
||||
|
||||
# 在 ~/.ssh/config 加 host alias 讓 git 自動用此 key
|
||||
ssh wooo@192.168.0.110 'cat >> ~/.ssh/config << "EOF"
|
||||
|
||||
Host gitea-autoheal
|
||||
HostName 192.168.0.110
|
||||
Port 3022
|
||||
User git
|
||||
IdentityFile ~/.ssh/autoheal_id_ed25519
|
||||
IdentitiesOnly yes
|
||||
EOF
|
||||
chmod 600 ~/.ssh/config'
|
||||
```
|
||||
|
||||
> **Port 3022 確認**:用 `ssh wooo@192.168.0.110 'docker ps | grep gitea'` 看 Gitea SSH port,預設 3022 但可能不同。
|
||||
|
||||
---
|
||||
|
||||
## 步驟 2:在 110 上 clone repo 到 `/home/wooo/ewoooc`(直接 SSH clone)
|
||||
|
||||
> 注意:**從一開始就用 SSH clone**,避免 HTTP clone 在 private repo 卡帳密 prompt + 跟步驟 1 部署的 key 不關聯。先確認 Gitea SSH port(預設 3022 但可能被改):
|
||||
>
|
||||
> ```bash
|
||||
> ssh wooo@192.168.0.110 'docker ps --format "{{.Ports}}" | grep gitea'
|
||||
> ```
|
||||
>
|
||||
> 從輸出找到 `0.0.0.0:NNN->22/tcp` 的 NNN 即為 Gitea SSH port。下方用 3022 為例,**請依實況替換**。
|
||||
|
||||
```bash
|
||||
ssh wooo@192.168.0.110 << 'EOF'
|
||||
set -e
|
||||
cd ~
|
||||
|
||||
# 防呆:如果 ewoooc 已存在但不是 git repo(可能舊垃圾),先備份
|
||||
if [ -d ewoooc ] && [ ! -d ewoooc/.git ]; then
|
||||
mv ewoooc ewoooc.bak.$(date +%s)
|
||||
fi
|
||||
|
||||
# 直接 SSH clone(複用步驟 1 部署的 key + ~/.ssh/config 的 gitea-autoheal alias)
|
||||
if [ ! -d ewoooc/.git ]; then
|
||||
git clone gitea-autoheal:wooo/ewoooc.git ewoooc
|
||||
fi
|
||||
|
||||
cd ewoooc
|
||||
|
||||
# 設 git identity 讓 AiderHeal commit 有可識別作者
|
||||
git config user.name "AiderHeal"
|
||||
git config user.email "autoheal@wooo.work"
|
||||
|
||||
# 確認 remote 走 SSH(gitea-autoheal alias 自帶正確 port + key)
|
||||
git remote -v
|
||||
git log --oneline -3
|
||||
EOF
|
||||
```
|
||||
|
||||
驗證:應印出 `origin gitea-autoheal:wooo/ewoooc.git`(fetch+push 兩行)和最近 3 個 commit。
|
||||
|
||||
> **若 clone 失敗報 `Permission denied (publickey)`**:步驟 3 的 Gitea Deploy Key 還沒加或沒勾 write access,先回去處理步驟 3。
|
||||
|
||||
---
|
||||
|
||||
## 步驟 3:在 Gitea 加 Deploy Key(若 #4 沒加)
|
||||
|
||||
1. 取公鑰:
|
||||
```bash
|
||||
ssh wooo@192.168.0.110 'cat ~/.ssh/autoheal_id_ed25519.pub'
|
||||
```
|
||||
2. Gitea Web UI:
|
||||
- 開 `http://192.168.0.110:3001/wooo/ewoooc/settings/keys`
|
||||
- Add Deploy Key
|
||||
- Title: `AiderHeal 110 host`
|
||||
- Key: 貼上 #1 的公鑰
|
||||
- **勾選 `Allow write access`**(必要!否則只能 fetch 不能 push)
|
||||
- Add Key
|
||||
|
||||
---
|
||||
|
||||
## 步驟 4:端到端驗證
|
||||
|
||||
### 4a. 110 上手動測試 push 鏈
|
||||
|
||||
```bash
|
||||
ssh wooo@192.168.0.110 << 'EOF'
|
||||
cd ~/ewoooc
|
||||
git fetch origin main
|
||||
git status
|
||||
EOF
|
||||
```
|
||||
|
||||
預期:`fetch` 不報錯,`status` 顯示 `Your branch is up to date`。
|
||||
|
||||
### 4b. 從 188 容器測試 SSH 鏈(模擬 AiderHeal preflight)
|
||||
|
||||
> 早期版本曾用 `docker exec ... bash -c "ssh ... \"...\""` 三層引號,內層雙引號會被中層吃掉,導致 `&& echo PREFLIGHT_OK` 變成本地 echo 而非 remote echo —— **永遠 false positive**。改用 heredoc + 單引號嵌套保護:
|
||||
|
||||
```bash
|
||||
ssh ollama@192.168.0.188 << 'OUTER'
|
||||
docker exec momo-pro-system bash -c '
|
||||
ssh -i /app/config/autoheal_id_ed25519 \
|
||||
-o StrictHostKeyChecking=no \
|
||||
wooo@192.168.0.110 "test -d /home/wooo/ewoooc/.git && echo PREFLIGHT_OK"
|
||||
'
|
||||
OUTER
|
||||
```
|
||||
|
||||
預期輸出:`PREFLIGHT_OK`(**從遠端 110 印出**,非本地)。
|
||||
|
||||
驗證真假:故意把 path 寫錯一個字母,應該 **0 輸出**(不該印 PREFLIGHT_OK)。
|
||||
|
||||
### 4c. 觸發 AiderHeal pipeline 觀察
|
||||
|
||||
任意推一個會被 Hermes 找到 finding 的 commit(或統帥 push 一個下次自然有 finding 的 commit),等 2 分鐘後查:
|
||||
|
||||
```bash
|
||||
# 看是否有 AiderHeal 簽名的新 commit
|
||||
git fetch origin main && git log --pretty='%h | %an | %s' origin/main -5
|
||||
```
|
||||
|
||||
預期:看到 author 是 `AiderHeal` 或 commit message 開頭 `fix(autoheal):` 的新 commit。
|
||||
|
||||
### 4d. 看容器 log
|
||||
|
||||
```bash
|
||||
ssh ollama@192.168.0.188 'docker logs momo-pro-system --since 10m 2>&1 | grep -E "event=(heal_start|aider_exec|push_ok|preflight_failed|setup_failed)"'
|
||||
```
|
||||
|
||||
預期:`event=heal_start` → `event=aider_exec`(停 10–60s)→ `event=push_ok` 連貫出現,**不應**看到 `event=preflight_failed`。
|
||||
|
||||
---
|
||||
|
||||
## 故障排除
|
||||
|
||||
| 症狀 | 可能原因 | 排查 |
|
||||
|------|---------|------|
|
||||
| `event=preflight_failed` | 110 上 `~/ewoooc` 不存在 / 不是 git repo | 重跑步驟 2 |
|
||||
| `event=setup_failed` 顯示 `Permission denied (publickey)` | Gitea deploy key 未加 / write 權限沒勾 | 檢查步驟 3 |
|
||||
| `event=push_failed` 顯示 `remote: hook declined` | Gitea 設 protected branch | 在 Gitea 把 main 從 protected 移除(或加 deploy key 為例外)|
|
||||
| `event=no_diff` 但 aider 確實看到問題 | aider 模型品質不佳(qwen2.5-coder:7b 太小)| 改 `AIDER_MODEL` env,例如 `ollama/deepseek-coder-v2:16b`;需 110 上有對應 model |
|
||||
| `event=diff_too_large` 連續發生 | finding 牽涉檔案 > 50 行修改 | 調 `AIDER_MAX_DIFF_LINES` env,但建議保留 50 作 ADR-020 安全網 |
|
||||
|
||||
---
|
||||
|
||||
## 安全護欄回顧(ADR-020)
|
||||
|
||||
| L | 機制 | 觸發點 |
|
||||
|---|------|-------|
|
||||
| L0 | preflight 路徑檢查 | `aider_heal_executor.py:execute_code_fix` 開頭 |
|
||||
| L1 | 檔案白名單 `^(services\|routes\|database)/[a-zA-Z0-9_]+\.py$` | `ALLOWED_FILE_PATTERN` |
|
||||
| L2 | diff > 50 行拒絕 push | `AIDER_MAX_DIFF_LINES` |
|
||||
| L3 | 每小時最多 5 次 CODE_FIX | `_enforce_rate_limit` |
|
||||
| L4 | health check 失敗自動 git revert | `_revert_last_commit` |
|
||||
| L5 | Telegram 通知(成功/失敗/回滾)| `_notify_telegram` → EventRouter |
|
||||
|
||||
主開關:`CODE_REVIEW_AUTO_FIX_ENABLED=false`(docker-compose env)即時切斷整條鏈。
|
||||
|
||||
---
|
||||
|
||||
## 完成後更新
|
||||
|
||||
- [ ] 110 上 `~/ewoooc` 存在且 `git remote -v` 顯示走 SSH push
|
||||
- [ ] Gitea deploy key 已加,write access 勾選
|
||||
- [ ] 步驟 4b 印出 `PREFLIGHT_OK`
|
||||
- [ ] 至少一次自然觸發 AiderHeal 後看到 `fix(autoheal):` commit
|
||||
- [ ] 通知 Claude 把 memory `feedback_code_review_autoheal.md` 的「待觀察」段刪掉,標記 AiderHeal 執行層也驗證完成
|
||||
@@ -218,6 +218,32 @@ def execute_code_fix(
|
||||
ctx: Dict[str, Any] = context or {}
|
||||
repo = Path(REPO_PATH_110).expanduser()
|
||||
|
||||
# L0:preflight — 確認 110 上的 repo 路徑真的存在且是 git repo
|
||||
# 沒有這個檢查時,後續 cd $REPO_PATH 失敗會被 shell `|| true` 吞掉,
|
||||
# 導致整條 pipeline 走完卻 0 次 push,靜默 100% no-op(2026-05-03 實測)
|
||||
rc_pre, _, _ = _ssh_exec(
|
||||
f"test -d {shlex.quote(REPO_PATH_110)}/.git", timeout=10
|
||||
)
|
||||
if rc_pre != 0:
|
||||
msg = (
|
||||
f"[AiderHeal] preflight 失敗:110 主機上 {REPO_PATH_110} 不存在或不是 git repo。"
|
||||
f"請檢查 AIDER_REPO_PATH env / 在 110 上 git clone repo(見 ADR-020 SOP)"
|
||||
)
|
||||
logger.error("event=preflight_failed path=%s", REPO_PATH_110)
|
||||
_notify_telegram(
|
||||
f"🚨 <b>AiderHeal preflight 失敗</b>\n"
|
||||
f"├ 路徑:<code>{REPO_PATH_110}</code>\n"
|
||||
f"├ 主機:<code>{HEAL_SSH_HOST}</code>\n"
|
||||
f"└ 動作:請依 ADR-020 SOP 在 110 上 clone repo 並設好 push 權限"
|
||||
)
|
||||
return {
|
||||
"success": False,
|
||||
"action": "CODE_FIX",
|
||||
"message": msg,
|
||||
"commit_sha": None,
|
||||
"reverted": False,
|
||||
}
|
||||
|
||||
# L1:檔案白名單
|
||||
if not ALLOWED_FILE_PATTERN.match(target_file):
|
||||
reason = f"[AiderHeal] 檔案不在白名單:{target_file}"
|
||||
@@ -251,11 +277,14 @@ def execute_code_fix(
|
||||
logger.info("event=heal_start error_type=%s file=%s", error_type, target_file)
|
||||
|
||||
# ── Step 1:準備 repo(在 110 上) ────────────────────────────────────────
|
||||
# 注意:`A && B && C && (D || true)` 才能讓 stash 失敗時被吞、其他步驟失敗時保留 rc。
|
||||
# 早期版本寫 `A && B && C && D || true`,shell 結合性等同
|
||||
# `(A && B && C && D) || true`,cd 失敗整條 chain 被吞 rc=0,line 261 永不觸發。
|
||||
setup_cmds = (
|
||||
f"cd {REPO_PATH_110} && "
|
||||
f"cd {shlex.quote(REPO_PATH_110)} && "
|
||||
f"git fetch {GITEA_REMOTE} main 2>&1 && "
|
||||
f"git reset --hard {GITEA_REMOTE}/main 2>&1 && "
|
||||
f"git stash 2>&1 || true"
|
||||
f"(git stash 2>&1 || true)"
|
||||
)
|
||||
rc, out, err = _ssh_exec(setup_cmds, timeout=30)
|
||||
if rc != 0:
|
||||
@@ -279,7 +308,7 @@ def execute_code_fix(
|
||||
)
|
||||
|
||||
aider_cmd = (
|
||||
f"cd {REPO_PATH_110} && "
|
||||
f"cd {shlex.quote(REPO_PATH_110)} && "
|
||||
f"PATH=/home/wooo/.local/bin:$PATH OLLAMA_API_BASE={OLLAMA_API_BASE} "
|
||||
f"aider --model {AIDER_MODEL} "
|
||||
f"--yes-always --no-git "
|
||||
@@ -293,7 +322,7 @@ def execute_code_fix(
|
||||
# ── Step 3:diff 評估(L2 護欄) ─────────────────────────────────────────
|
||||
# 使用 git diff --numstat 獲取有意義的變更行數(新增+刪除)
|
||||
numstat_cmd = (
|
||||
f"cd {REPO_PATH_110} && "
|
||||
f"cd {shlex.quote(REPO_PATH_110)} && "
|
||||
f"git diff --numstat HEAD 2>&1 | awk '{{added+=$1; deleted+=$2}} END{{print added+deleted}}'"
|
||||
)
|
||||
rc2, diff_lines_str, _ = _ssh_exec(numstat_cmd, timeout=10)
|
||||
@@ -314,7 +343,7 @@ def execute_code_fix(
|
||||
if diff_lines > MAX_DIFF_LINES:
|
||||
# 改動太大,丟棄並告警
|
||||
_, _, _ = _ssh_exec(
|
||||
f"cd {REPO_PATH_110} && git checkout -- . 2>&1", timeout=10
|
||||
f"cd {shlex.quote(REPO_PATH_110)} && git checkout -- . 2>&1", timeout=10
|
||||
)
|
||||
msg = (
|
||||
f"[AiderHeal] diff 超出限制 {diff_lines} > {MAX_DIFF_LINES} 行,"
|
||||
@@ -342,7 +371,7 @@ def execute_code_fix(
|
||||
f"Error: {safe_error[:200]}"
|
||||
)
|
||||
commit_cmd = (
|
||||
f"cd {REPO_PATH_110} && "
|
||||
f"cd {shlex.quote(REPO_PATH_110)} && "
|
||||
f'git add {shlex.quote(target_file)} && '
|
||||
f'git commit -m {shlex.quote(fix_msg)} 2>&1 && '
|
||||
f"git push {GITEA_REMOTE} main 2>&1"
|
||||
@@ -398,8 +427,8 @@ def execute_code_fix(
|
||||
# ── Step 6:健康檢查失敗 → 自動 revert(L4 護欄) ─────────────────────────
|
||||
logger.error("event=health_check_failed commit=%s", commit_sha)
|
||||
_, revert_out, revert_err = _ssh_exec(
|
||||
f"cd {REPO_PATH_110} && "
|
||||
f"git revert --no-edit {commit_sha} 2>&1 && "
|
||||
f"cd {shlex.quote(REPO_PATH_110)} && "
|
||||
f"git revert --no-edit {shlex.quote(commit_sha)} 2>&1 && "
|
||||
f"git push {GITEA_REMOTE} main 2>&1",
|
||||
timeout=30,
|
||||
)
|
||||
|
||||
Reference in New Issue
Block a user