From a49faf7baaa0d66ff976c8ee573b6d60d7cd810a Mon Sep 17 00:00:00 2001 From: OG T Date: Sun, 5 Apr 2026 13:09:58 +0800 Subject: [PATCH] =?UTF-8?q?docs:=20ADR-058=20Host=20Auto-Repair=20SSH=20?= =?UTF-8?q?=E7=99=BD=E5=90=8D=E5=96=AE=20+=20LOGBOOK=20=E6=9B=B4=E6=96=B0?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 首席架構師 Review 結果: 72→88/100 已修正: C1 C2 C3 M3 m1 m2 Co-Authored-By: Claude Sonnet 4.6 --- docs/LOGBOOK.md | 11 +- .../ADR-058-host-auto-repair-ssh-whitelist.md | 133 ++++++++++++++++++ 2 files changed, 140 insertions(+), 4 deletions(-) create mode 100644 docs/adr/ADR-058-host-auto-repair-ssh-whitelist.md diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 062f5ac5..e967ad83 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -5,7 +5,7 @@ --- -## 📍 當前狀態 (2026-04-05 Sprint 3 Host Auto-Repair 全閉環完成) +## 📍 當前狀態 (2026-04-05 Sprint 3 + 首席架構師 Review 全通過) | 項目 | 狀態 | Commit | |------|------|--------| @@ -13,11 +13,14 @@ | SSH keypair + authorized_keys + K8s Secret | ✅ | 892c5d5 | | HostRepairAgent (TDD 12/12 通過) | ✅ | e7d8da8 | | ActionType.SSH_COMMAND + auto_repair_service | ✅ | bf4f814 | -| K8s SSH key 掛載 (/etc/repair-ssh) | ✅ kubectl patch | | -| 5個 Host Repair Playbooks seed | ✅ | b688eee | +| K8s SSH key 掛載 (/etc/repair-ssh) + CD 持久化 | ✅ | 1cc8c27 | +| 5個 Host Repair Playbooks (全部 approved) | ✅ | b688eee | | Sprint 3 E2E 驗收 | ✅ | 全通過 | +| 首席架構師 Review C1/C2/C3/M3/m1 修正 | ✅ | 4b24ecd | +| ADR-058 Host Auto-Repair SSH 白名單 | ✅ | — | -**全系統自愈閉環**: Prometheus/SigNoz 告警 → AWOOOI API → SSH → repair-bot → docker compose up -d +**全系統自愈閉環**: Prometheus/SigNoz 告警 → AWOOOI API → SSH → repair-bot → docker compose up -d +**首席架構師評分**: 72/100 → 修正後 88/100 --- diff --git a/docs/adr/ADR-058-host-auto-repair-ssh-whitelist.md b/docs/adr/ADR-058-host-auto-repair-ssh-whitelist.md new file mode 100644 index 00000000..e566fcc0 --- /dev/null +++ b/docs/adr/ADR-058-host-auto-repair-ssh-whitelist.md @@ -0,0 +1,133 @@ +# ADR-058: Host Auto-Repair SSH 白名單架構 + +**狀態**: 已接受 +**日期**: 2026-04-05 (台北時區) +**起草**: Claude Code (Sprint 3 Host Auto-Repair) +**審查**: 首席架構師 (同日) + +--- + +## 背景 + +Phase O 可觀測性補完後,系統能偵測主機層服務異常(SentryDown、HarborDown、GiteaDown 等),但缺少從 K8s API Pod 向主機層發送修復命令的能力。需要一個安全、可審計的機制讓 AWOOOI API 透過 SSH 執行受限的修復操作。 + +--- + +## 決策 + +### D1 — 使用 SSH `command=` forced command 白名單 + +在目標主機的 `authorized_keys` 中使用 `command=` 選項,強制所有使用特定 key 的 SSH 連線只能執行指定的白名單腳本,無論呼叫方送出什麼命令。 + +``` +command="/home/wooo/bin/repair-bot-110.sh",no-pty,no-agent-forwarding ssh-ed25519 ... +``` + +**優點**: +- SSH key 洩漏也只能執行白名單操作(最小權限) +- 不需要在 API Pod 中安裝任何 agent +- 完整日誌記錄(repair-bot 寫 `~/.repair-bot.log`) + +### D2 — 命令格式:`repair:` + +白名單腳本接受 `repair:` 格式,component 對應預設的修復動作(docker compose up -d 或 systemctl restart)。 + +Regex 兩端統一:`^repair:([a-z0-9][a-z0-9-]{0,30})$` + +### D3 — SSH Key 儲存在 K8s Secret,掛載到 API Pod + +```yaml +volumes: + - name: repair-ssh-key + secret: + secretName: awoooi-repair-ssh-key + defaultMode: 0400 # 八進位 0400 = 十進位 256 = r-------- +``` + +掛載路徑:`/etc/repair-ssh/id_ed25519` + +### D4 — HostRepairAgent 透過 ActionType.SSH_COMMAND 整合到 Playbook 系統 + +``` +Prometheus/SigNoz 告警 → webhook → auto_repair_service + → 比對 Playbook (symptom_pattern) + → _execute_step(ActionType.SSH_COMMAND) + → HostRepairAgent.repair(layer, component) + → SSH forced command + → repair-bot.sh → docker compose up -d +``` + +### D5 — layer 路由隔離 + +| Layer | 主機 | 使用者 | 修復類型 | +|-------|------|--------|---------| +| `docker-110` | 192.168.0.110 | wooo | Docker Compose | +| `docker-188` | 192.168.0.188 | ollama | Docker Compose | +| `systemd-188` | 192.168.0.188 | ollama | systemd | +| `k8s` | — | — | kubectl(不走 SSH,拒絕) | + +--- + +## 已批准的修復白名單 + +### 110 主機 (repair-bot-110.sh) +| Component | 修復方式 | 目錄 | +|-----------|---------|------| +| sentry | docker compose up -d | /opt/sentry | +| harbor | docker compose up -d | /home/wooo/harbor/harbor | +| gitea | docker compose up -d | /home/wooo/gitea | +| gitea-runner | docker compose up -d | /home/wooo/act-runner | +| langfuse | docker compose up -d | /home/wooo/langfuse | +| alertmanager | docker compose up -d | /home/wooo/monitoring | +| signoz | docker compose up -d | /home/wooo/signoz/deploy/docker | + +### 188 主機 (repair-bot-188.sh) +| Component | 修復方式 | +|-----------|---------| +| openclaw | docker compose up -d | +| minio | docker compose up -d | +| signoz | docker compose up -d | +| redis | systemctl restart redis-server | +| nginx | systemctl restart nginx | +| ollama | systemctl restart ollama | + +--- + +## 首席架構師 Review 記錄 (2026-04-05) + +評分:**72/100 → 修正後 88/100** + +已修正問題: +- **C1**: `_ssh_execute` key_path 改為直接傳入,不反查 LAYER_SSH_CONFIG +- **C2**: `PlaybookService.create()` proxy 方法,Router 不直接呼叫 `_repository` +- **C3**: CD Step 1b `sed` 替換 IMAGE_TAG_PLACEHOLDER,消除中斷風險 +- **M3**: repair-bot regex 統一為 `[a-z0-9][a-z0-9-]{0,30}`,禁止底線 +- **m1**: defaultMode 加八進位說明注釋 +- **m2**: `_ssh_execute` 用 deadline 計算剩餘 timeout + +待觀察問題(已知、可接受): +- **M1**: 測試使用 AsyncMock(豁免原因:`_ssh_execute` 是真實 subprocess,無法在 CI 無 SSH 環境直接測試;已透過 E2E 驗收補充覆蓋) +- **M2**: `StrictHostKeyChecking=no`(內網環境可接受,未來加 known_hosts ConfigMap) + +--- + +## 安全注意事項 + +1. **SSH key 輪換**:若 `awoooi-repair-ssh-key` 洩漏,立即執行: + - 110: `sed -i '/repair-bot/d' ~/.ssh/authorized_keys` + - 188: `sed -i '/repair-bot/d' ~/.ssh/authorized_keys` + - 重建 keypair 並更新 K8s Secret + +2. **修復命令只能 `docker compose up -d` 或 `systemctl restart`**,無法執行任意命令 + +3. **日誌位置**:`~/.repair-bot.log`(各主機各自記錄) + +--- + +## 相關文件 + +- `scripts/repair-bot/repair-bot-110.sh` +- `scripts/repair-bot/repair-bot-188.sh` +- `apps/api/src/services/host_repair_agent.py` +- `k8s/awoooi-prod/04-repair-ssh-key-template.yaml` +- `ops/monitoring/alerts-unified.yml`