docs: ADR-058 Host Auto-Repair SSH 白名單 + LOGBOOK 更新
首席架構師 Review 結果: 72→88/100 已修正: C1 C2 C3 M3 m1 m2 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -5,7 +5,7 @@
|
||||
|
||||
---
|
||||
|
||||
## 📍 當前狀態 (2026-04-05 Sprint 3 Host Auto-Repair 全閉環完成)
|
||||
## 📍 當前狀態 (2026-04-05 Sprint 3 + 首席架構師 Review 全通過)
|
||||
|
||||
| 項目 | 狀態 | Commit |
|
||||
|------|------|--------|
|
||||
@@ -13,11 +13,14 @@
|
||||
| SSH keypair + authorized_keys + K8s Secret | ✅ | 892c5d5 |
|
||||
| HostRepairAgent (TDD 12/12 通過) | ✅ | e7d8da8 |
|
||||
| ActionType.SSH_COMMAND + auto_repair_service | ✅ | bf4f814 |
|
||||
| K8s SSH key 掛載 (/etc/repair-ssh) | ✅ kubectl patch | |
|
||||
| 5個 Host Repair Playbooks seed | ✅ | b688eee |
|
||||
| K8s SSH key 掛載 (/etc/repair-ssh) + CD 持久化 | ✅ | 1cc8c27 |
|
||||
| 5個 Host Repair Playbooks (全部 approved) | ✅ | b688eee |
|
||||
| Sprint 3 E2E 驗收 | ✅ | 全通過 |
|
||||
| 首席架構師 Review C1/C2/C3/M3/m1 修正 | ✅ | 4b24ecd |
|
||||
| ADR-058 Host Auto-Repair SSH 白名單 | ✅ | — |
|
||||
|
||||
**全系統自愈閉環**: Prometheus/SigNoz 告警 → AWOOOI API → SSH → repair-bot → docker compose up -d
|
||||
**全系統自愈閉環**: Prometheus/SigNoz 告警 → AWOOOI API → SSH → repair-bot → docker compose up -d
|
||||
**首席架構師評分**: 72/100 → 修正後 88/100
|
||||
|
||||
---
|
||||
|
||||
|
||||
133
docs/adr/ADR-058-host-auto-repair-ssh-whitelist.md
Normal file
133
docs/adr/ADR-058-host-auto-repair-ssh-whitelist.md
Normal file
@@ -0,0 +1,133 @@
|
||||
# ADR-058: Host Auto-Repair SSH 白名單架構
|
||||
|
||||
**狀態**: 已接受
|
||||
**日期**: 2026-04-05 (台北時區)
|
||||
**起草**: Claude Code (Sprint 3 Host Auto-Repair)
|
||||
**審查**: 首席架構師 (同日)
|
||||
|
||||
---
|
||||
|
||||
## 背景
|
||||
|
||||
Phase O 可觀測性補完後,系統能偵測主機層服務異常(SentryDown、HarborDown、GiteaDown 等),但缺少從 K8s API Pod 向主機層發送修復命令的能力。需要一個安全、可審計的機制讓 AWOOOI API 透過 SSH 執行受限的修復操作。
|
||||
|
||||
---
|
||||
|
||||
## 決策
|
||||
|
||||
### D1 — 使用 SSH `command=` forced command 白名單
|
||||
|
||||
在目標主機的 `authorized_keys` 中使用 `command=` 選項,強制所有使用特定 key 的 SSH 連線只能執行指定的白名單腳本,無論呼叫方送出什麼命令。
|
||||
|
||||
```
|
||||
command="/home/wooo/bin/repair-bot-110.sh",no-pty,no-agent-forwarding ssh-ed25519 ...
|
||||
```
|
||||
|
||||
**優點**:
|
||||
- SSH key 洩漏也只能執行白名單操作(最小權限)
|
||||
- 不需要在 API Pod 中安裝任何 agent
|
||||
- 完整日誌記錄(repair-bot 寫 `~/.repair-bot.log`)
|
||||
|
||||
### D2 — 命令格式:`repair:<component>`
|
||||
|
||||
白名單腳本接受 `repair:<component>` 格式,component 對應預設的修復動作(docker compose up -d 或 systemctl restart)。
|
||||
|
||||
Regex 兩端統一:`^repair:([a-z0-9][a-z0-9-]{0,30})$`
|
||||
|
||||
### D3 — SSH Key 儲存在 K8s Secret,掛載到 API Pod
|
||||
|
||||
```yaml
|
||||
volumes:
|
||||
- name: repair-ssh-key
|
||||
secret:
|
||||
secretName: awoooi-repair-ssh-key
|
||||
defaultMode: 0400 # 八進位 0400 = 十進位 256 = r--------
|
||||
```
|
||||
|
||||
掛載路徑:`/etc/repair-ssh/id_ed25519`
|
||||
|
||||
### D4 — HostRepairAgent 透過 ActionType.SSH_COMMAND 整合到 Playbook 系統
|
||||
|
||||
```
|
||||
Prometheus/SigNoz 告警 → webhook → auto_repair_service
|
||||
→ 比對 Playbook (symptom_pattern)
|
||||
→ _execute_step(ActionType.SSH_COMMAND)
|
||||
→ HostRepairAgent.repair(layer, component)
|
||||
→ SSH forced command
|
||||
→ repair-bot.sh → docker compose up -d
|
||||
```
|
||||
|
||||
### D5 — layer 路由隔離
|
||||
|
||||
| Layer | 主機 | 使用者 | 修復類型 |
|
||||
|-------|------|--------|---------|
|
||||
| `docker-110` | 192.168.0.110 | wooo | Docker Compose |
|
||||
| `docker-188` | 192.168.0.188 | ollama | Docker Compose |
|
||||
| `systemd-188` | 192.168.0.188 | ollama | systemd |
|
||||
| `k8s` | — | — | kubectl(不走 SSH,拒絕) |
|
||||
|
||||
---
|
||||
|
||||
## 已批准的修復白名單
|
||||
|
||||
### 110 主機 (repair-bot-110.sh)
|
||||
| Component | 修復方式 | 目錄 |
|
||||
|-----------|---------|------|
|
||||
| sentry | docker compose up -d | /opt/sentry |
|
||||
| harbor | docker compose up -d | /home/wooo/harbor/harbor |
|
||||
| gitea | docker compose up -d | /home/wooo/gitea |
|
||||
| gitea-runner | docker compose up -d | /home/wooo/act-runner |
|
||||
| langfuse | docker compose up -d | /home/wooo/langfuse |
|
||||
| alertmanager | docker compose up -d | /home/wooo/monitoring |
|
||||
| signoz | docker compose up -d | /home/wooo/signoz/deploy/docker |
|
||||
|
||||
### 188 主機 (repair-bot-188.sh)
|
||||
| Component | 修復方式 |
|
||||
|-----------|---------|
|
||||
| openclaw | docker compose up -d |
|
||||
| minio | docker compose up -d |
|
||||
| signoz | docker compose up -d |
|
||||
| redis | systemctl restart redis-server |
|
||||
| nginx | systemctl restart nginx |
|
||||
| ollama | systemctl restart ollama |
|
||||
|
||||
---
|
||||
|
||||
## 首席架構師 Review 記錄 (2026-04-05)
|
||||
|
||||
評分:**72/100 → 修正後 88/100**
|
||||
|
||||
已修正問題:
|
||||
- **C1**: `_ssh_execute` key_path 改為直接傳入,不反查 LAYER_SSH_CONFIG
|
||||
- **C2**: `PlaybookService.create()` proxy 方法,Router 不直接呼叫 `_repository`
|
||||
- **C3**: CD Step 1b `sed` 替換 IMAGE_TAG_PLACEHOLDER,消除中斷風險
|
||||
- **M3**: repair-bot regex 統一為 `[a-z0-9][a-z0-9-]{0,30}`,禁止底線
|
||||
- **m1**: defaultMode 加八進位說明注釋
|
||||
- **m2**: `_ssh_execute` 用 deadline 計算剩餘 timeout
|
||||
|
||||
待觀察問題(已知、可接受):
|
||||
- **M1**: 測試使用 AsyncMock(豁免原因:`_ssh_execute` 是真實 subprocess,無法在 CI 無 SSH 環境直接測試;已透過 E2E 驗收補充覆蓋)
|
||||
- **M2**: `StrictHostKeyChecking=no`(內網環境可接受,未來加 known_hosts ConfigMap)
|
||||
|
||||
---
|
||||
|
||||
## 安全注意事項
|
||||
|
||||
1. **SSH key 輪換**:若 `awoooi-repair-ssh-key` 洩漏,立即執行:
|
||||
- 110: `sed -i '/repair-bot/d' ~/.ssh/authorized_keys`
|
||||
- 188: `sed -i '/repair-bot/d' ~/.ssh/authorized_keys`
|
||||
- 重建 keypair 並更新 K8s Secret
|
||||
|
||||
2. **修復命令只能 `docker compose up -d` 或 `systemctl restart`**,無法執行任意命令
|
||||
|
||||
3. **日誌位置**:`~/.repair-bot.log`(各主機各自記錄)
|
||||
|
||||
---
|
||||
|
||||
## 相關文件
|
||||
|
||||
- `scripts/repair-bot/repair-bot-110.sh`
|
||||
- `scripts/repair-bot/repair-bot-188.sh`
|
||||
- `apps/api/src/services/host_repair_agent.py`
|
||||
- `k8s/awoooi-prod/04-repair-ssh-key-template.yaml`
|
||||
- `ops/monitoring/alerts-unified.yml`
|
||||
Reference in New Issue
Block a user