docs(skills+adr): 自動修復全鏈路知識更新 — ADR-058 Appendix A + Skills v2.5
ADR-058: 188白名單補完 + Appendix A (12 Bug修復記錄 + E2E驗證 + Playbook覆蓋矩陣) Skill-04 DevOps v2.5: SSH自動修復架構章節 (白名單/SOP/陷阱) Skill-05 SRE: 自動修復E2E驗收規範 + 診斷表 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -35,6 +35,7 @@
|
||||
| v2.2 | 2026-03-31 | Claude Code | **📊 K3s 優化成效數據 (告警-100%, Pod 重啟-100%, 48h+穩定)** |
|
||||
| v2.3 | 2026-03-31 | Claude Code | **📅 Phase 21 定期報告機制規劃 (Weekly/Daily E2E/K3s Report)** |
|
||||
| v2.4 | 2026-03-31 | Claude Code | **🔧 OTEL gRPC vs HTTP 端點區分 (K8s:24317, CI/CD:24318)** |
|
||||
| v2.5 | 2026-04-09 | Claude Sonnet 4.6 | **🔴 SSH 自動修復全鏈路 — 雙主機 E2E 閉環 + 12 Bug 修復** |
|
||||
|
||||
---
|
||||
|
||||
@@ -1197,3 +1198,81 @@ links = DeepLinking.get_all_links(
|
||||
- `memory/project_phase15_langfuse.md`: **📊 Phase 15 全部完成**
|
||||
- `memory/project_phase17_tech_debt.md`: **🔧 Phase 17 技術債**
|
||||
- `src/core/deep_linking.py`: **👁️ Deep Linking URL 生成器**
|
||||
- `docs/adr/ADR-058-host-auto-repair-ssh-whitelist.md`: **🔴 SSH 自動修復架構 + Bug 修復記錄**
|
||||
- `ops/config/service-registry.yaml`: **服務分級清單 (BLOCK/CRITICAL_HITL/STANDARD_HITL/AUTO)**
|
||||
|
||||
---
|
||||
|
||||
## 🔴 SSH 自動修復架構 (Sprint 3 + 2026-04-09 Bug 修復)
|
||||
|
||||
> **ADR**: ADR-058 (已批准,Appendix A 記錄 Bug 修復)
|
||||
> **狀態**: ✅ 雙主機 E2E 驗證通過
|
||||
|
||||
### 關鍵基礎設施要求
|
||||
|
||||
| 項目 | 設定值 | 說明 |
|
||||
|------|-------|------|
|
||||
| Dockerfile | `openssh-client` | 生產 stage 必須安裝,ssh binary 才存在 |
|
||||
| K8s Pod securityContext | `fsGroup: 1000` | 讓 appuser 有 group read on 0400 Secret |
|
||||
| NetworkPolicy egress | port 22 → 110 + 188 | 預設拒絕,必須明確開放 |
|
||||
| Secret defaultMode | `0400` (八進位) | SSH 要求 owner-only,group read 靠 fsGroup |
|
||||
| known_hosts Secret | `awoooi-repair-known-hosts` | optional: true,含 110+188 hashed 指紋 |
|
||||
|
||||
### repair-bot 白名單 (當前完整清單)
|
||||
|
||||
**110 主機 (wooo@192.168.0.110)**
|
||||
|
||||
| Component | 目錄 |
|
||||
|-----------|------|
|
||||
| sentry | /opt/sentry |
|
||||
| harbor | /home/wooo/harbor/harbor |
|
||||
| gitea | /home/wooo/gitea |
|
||||
| gitea-runner | /home/wooo/act-runner |
|
||||
| langfuse | /home/wooo/langfuse |
|
||||
| alertmanager | /home/wooo/monitoring |
|
||||
| signoz | /home/wooo/signoz/deploy/docker |
|
||||
| stock-platform | /home/wooo/stockPlatform |
|
||||
|
||||
**188 主機 (ollama@192.168.0.188)**
|
||||
|
||||
| Component | 目錄 |
|
||||
|-----------|------|
|
||||
| openclaw | /home/ollama/clawbot-v5 |
|
||||
| minio | /home/ollama/minio |
|
||||
| signoz | /home/ollama/signoz/deploy/docker |
|
||||
| momo-app | /home/ollama/momo-pro |
|
||||
| tsenyang-website | /home/ollama/services/tsenyang |
|
||||
| bitan-app | /home/ollama/services/bitan |
|
||||
|
||||
### 修改 repair-bot 白名單 SOP
|
||||
|
||||
1. 確認 compose dir 在目標主機存在
|
||||
2. SSH 到目標主機 `sed -i` 修改 `~/bin/repair-bot-{110|188}.sh`
|
||||
3. 用 `SSH_ORIGINAL_COMMAND=health ~/bin/repair-bot-xxx.sh` 驗證
|
||||
4. 同步更新 `ops/config/service-registry.yaml`
|
||||
5. commit + push gitea
|
||||
|
||||
### 新增修復主機 SOP
|
||||
|
||||
1. 在目標主機建立 `~/bin/repair-bot-{host}.sh`(複製模板)
|
||||
2. 將 `awoooi-repair-ssh-key.pub` 加入 `~/.ssh/authorized_keys`(加 `command=` 限制)
|
||||
3. `ssh-keyscan -H {host_ip}` → 更新 `awoooi-repair-known-hosts` Secret
|
||||
4. NetworkPolicy 新增 `{host_ip}:22` egress
|
||||
5. `LAYER_SSH_CONFIG` 新增 layer 設定(`host_repair_agent.py`)
|
||||
6. service-registry.yaml 新增服務分級
|
||||
|
||||
### 常見陷阱 (血的教訓)
|
||||
|
||||
```
|
||||
❌ target_resource 用 instance (IP:port) → Jaccard 服務比對為 0
|
||||
✅ 必須優先取 labels.component,再 fallback 到 pod、instance
|
||||
|
||||
❌ kubectl apply 06-deployment-api.yaml → IMAGE_TAG_PLACEHOLDER 覆蓋真實 SHA → ImagePullBackOff
|
||||
✅ 修改 K8s Deployment 配置用 kubectl patch,不用 kubectl apply
|
||||
|
||||
❌ known_hosts hashed 格式,grep IP 會得 0 → 以為沒寫進去
|
||||
✅ 用 wc -l 或 ssh 實測驗證,hashed 格式是正常的
|
||||
|
||||
❌ StrictHostKeyChecking=no(舊設定)
|
||||
✅ known_hosts Secret 已建立,改用 StrictHostKeyChecking=yes
|
||||
```
|
||||
|
||||
@@ -708,6 +708,87 @@ def validate_traditional_chinese(response: str) -> bool:
|
||||
|
||||
---
|
||||
|
||||
## 🔴 自動修復 E2E 驗收規範 (2026-04-09)
|
||||
|
||||
> **背景**: 系統曾有自動修復機制卻從未成功執行(success_count 全部為 0),完整審計後修復 12 個阻斷性 Bug
|
||||
> **教訓**: Playbook 匹配成功 ≠ SSH 執行成功,必須端到端驗收
|
||||
|
||||
### 自動修復完整鏈路
|
||||
|
||||
```
|
||||
Alertmanager → POST /api/v1/webhooks/alertmanager
|
||||
→ LLM 分析 (Nemotron) + _extract_symptoms()
|
||||
→ {alert_names, affected_services, keywords}
|
||||
⚠️ affected_services 必須取 labels.component,不能用 labels.instance (IP:port)
|
||||
→ playbook_service.get_recommendations() — Jaccard 相似度
|
||||
→ alert_exact_match bypass: alert_names 完全匹配時忽略 0.4 門檻
|
||||
→ evaluate_auto_repair() — 查 service-registry 分級
|
||||
→ BLOCK → 僅告警; AUTO → 直接執行
|
||||
→ HostRepairAgent.repair(layer, component)
|
||||
→ SSH: ssh -i /etc/repair-ssh/id_ed25519 wooo@192.168.0.110 repair:sentry
|
||||
→ repair-bot.sh → docker compose up -d → REPAIR_OK:sentry
|
||||
```
|
||||
|
||||
### E2E 驗收 Checklist
|
||||
|
||||
```bash
|
||||
# Step 1: 確認 SSH binary 存在
|
||||
POD=$(kubectl -n awoooi-prod get pod -l app=awoooi-api -o jsonpath='{.items[0].metadata.name}')
|
||||
kubectl -n awoooi-prod exec $POD -- which ssh # 必須有輸出
|
||||
|
||||
# Step 2: 確認 SSH key 可讀
|
||||
kubectl -n awoooi-prod exec $POD -- ls -la /etc/repair-ssh/id_ed25519
|
||||
# 預期: -r--r----- 1 root appuser ... (fsGroup=1000 生效)
|
||||
|
||||
# Step 3: 確認 known_hosts 有內容
|
||||
kubectl -n awoooi-prod exec $POD -- wc -l /etc/repair-known-hosts/known_hosts
|
||||
# 預期: 9 (hashed 格式,grep IP 會得 0 — 正常)
|
||||
|
||||
# Step 4: SSH 健康確認
|
||||
kubectl -n awoooi-prod exec $POD -- sh -c \
|
||||
"ssh -i /etc/repair-ssh/id_ed25519 \
|
||||
-o UserKnownHostsFile=/etc/repair-known-hosts/known_hosts \
|
||||
-o StrictHostKeyChecking=yes -o BatchMode=yes -o ConnectTimeout=10 \
|
||||
wooo@192.168.0.110 health"
|
||||
# 預期: REPAIR_BOT_HEALTHY:110
|
||||
|
||||
# Step 5: Webhook 觸發(新 fingerprint 避免去重)
|
||||
curl -X POST http://192.168.0.120:32334/api/v1/webhooks/alertmanager \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"alerts":[{"labels":{"alertname":"SentryDown","component":"sentry",
|
||||
"severity":"critical"},"fingerprint":"e2e-test-001","status":"firing",
|
||||
"startsAt":"2026-04-09T00:00:00Z","endsAt":"0001-01-01T00:00:00Z"}]}'
|
||||
|
||||
# Step 6: 確認 log
|
||||
kubectl -n awoooi-prod logs -l app=awoooi-api --tail=50 | \
|
||||
grep -E "REPAIR_OK|auto_repair_execute_success|auto_repair_approved"
|
||||
```
|
||||
|
||||
### Playbook symptom_pattern 要求
|
||||
|
||||
```json
|
||||
{
|
||||
"alert_names": ["SentryDown"], // ← alert_exact_match key,完全匹配才能 bypass
|
||||
"affected_services": ["sentry"], // ← 必須與 labels.component 一致,不是 instance
|
||||
"severity_range": ["P1", "P2"],
|
||||
"label_patterns": {"component": "sentry"},
|
||||
"keywords": ["sentry", "9000"]
|
||||
}
|
||||
```
|
||||
|
||||
### 自動修復被阻斷的診斷方法
|
||||
|
||||
| 症狀 | 可能原因 | 診斷指令 |
|
||||
|------|---------|---------|
|
||||
| `auto_repair_approved` 沒出現 | Jaccard 分數 < 0.4 | 查 log `similarity` 欄位 |
|
||||
| `can_auto_repair: false` | service-registry BLOCK/HITL | 查 `blocked_by` 欄位 |
|
||||
| `ssh: command not found` | Dockerfile 缺 openssh-client | Pod exec `which ssh` |
|
||||
| `Permission denied (publickey)` | known_hosts 缺少該主機 | Pod exec SSH 看錯誤訊息 |
|
||||
| `Load key ... Permission denied` | fsGroup 未設定 | Pod exec `ls -la /etc/repair-ssh/` |
|
||||
| `Connection refused/timeout` | NetworkPolicy 封鎖 22 | Pod exec `ssh -v` 看連線過程 |
|
||||
|
||||
---
|
||||
|
||||
## 參考文檔
|
||||
|
||||
- `apps/web/playwright.config.ts`: Playwright 設定
|
||||
@@ -720,5 +801,6 @@ def validate_traditional_chinese(response: str) -> bool:
|
||||
- `memory/feedback_runner_zombie_process.md`: **🚨 Runner 殭屍進程修復**
|
||||
- `docs/adr/ADR-018-llm-testing-strategy.md`: **🧠 LLM 測試策略 (Deferred)**
|
||||
- `docs/adr/ADR-019-system-prompt-management.md`: **📝 System Prompt 集中管理**
|
||||
- `docs/adr/ADR-058-host-auto-repair-ssh-whitelist.md`: **🔴 SSH 自動修復 + Bug 修復記錄**
|
||||
- `.github/workflows/nightly-llm.yaml`: **🌙 Nightly LLM 測試**
|
||||
- `.github/workflows/daily-e2e-health.yaml`: **🏥 Daily E2E 健康檢查**
|
||||
|
||||
@@ -80,16 +80,63 @@ Prometheus/SigNoz 告警 → webhook → auto_repair_service
|
||||
| langfuse | docker compose up -d | /home/wooo/langfuse |
|
||||
| alertmanager | docker compose up -d | /home/wooo/monitoring |
|
||||
| signoz | docker compose up -d | /home/wooo/signoz/deploy/docker |
|
||||
| stock-platform | docker compose up -d | /home/wooo/stockPlatform *(2026-04-09 新增)* |
|
||||
|
||||
### 188 主機 (repair-bot-188.sh)
|
||||
| Component | 修復方式 |
|
||||
|-----------|---------|
|
||||
| openclaw | docker compose up -d |
|
||||
| minio | docker compose up -d |
|
||||
| signoz | docker compose up -d |
|
||||
| redis | systemctl restart redis-server |
|
||||
| nginx | systemctl restart nginx |
|
||||
| ollama | systemctl restart ollama |
|
||||
| Component | 修復方式 | 目錄 |
|
||||
|-----------|---------|------|
|
||||
| openclaw | docker compose up -d | /home/ollama/clawbot-v5 |
|
||||
| minio | docker compose up -d | /home/ollama/minio |
|
||||
| signoz | docker compose up -d | /home/ollama/signoz/deploy/docker |
|
||||
| momo-app | docker compose up -d | /home/ollama/momo-pro *(2026-04-09 新增)* |
|
||||
| tsenyang-website | docker compose up -d | /home/ollama/services/tsenyang *(2026-04-09 新增)* |
|
||||
| bitan-app | docker compose up -d | /home/ollama/services/bitan *(2026-04-09 新增)* |
|
||||
|
||||
---
|
||||
|
||||
## Appendix A — Bug 修復記錄 (2026-04-09)
|
||||
|
||||
> **執行者**: Claude Sonnet 4.6 (Asia/Taipei)
|
||||
> **背景**: 自動修復機制從未成功執行(success_count 全部為 0),完整審計後發現下列阻斷性問題
|
||||
|
||||
| Bug | 問題描述 | 修復 | Commit |
|
||||
|-----|---------|------|--------|
|
||||
| #5 | `target_resource` 用 `instance` (IP:port) 代替 `component` label → Jaccard 服務相似度為 0 | `webhooks.py` 優先取 `component` label | 1fb0c0c |
|
||||
| #6 | `python:3.11-slim` 無 `openssh-client`,`ssh` binary 缺失 | Dockerfile 生產 stage 加 `openssh-client` | 1fb0c0c |
|
||||
| #11 | NetworkPolicy 未開放 22 (SSH),預設拒絕所有未列 port | `02-network-policy.yaml` 新增 110:22 + 188:22 egress | 07a097c |
|
||||
| #12 | Secret `defaultMode=0400` (root-only),appuser (UID 1000) 無法讀 SSH key | Pod `securityContext.fsGroup: 1000` | 77f2da9 |
|
||||
|
||||
### E2E 驗證結果
|
||||
|
||||
```
|
||||
# docker-110
|
||||
SentryDown → Jaccard 匹配 sentry-down-repair → SSH wooo@192.168.0.110 repair:sentry
|
||||
→ REPAIR_OK:sentry (6208ms) ✅
|
||||
|
||||
# docker-188
|
||||
MoWoooWorkDown → Jaccard 匹配 momo-app-down-repair → SSH ollama@192.168.0.188 repair:momo-app
|
||||
→ REPAIR_OK:momo-app (3791ms) ✅
|
||||
```
|
||||
|
||||
### Playbook 覆蓋矩陣 (2026-04-09,20 個 Playbook)
|
||||
|
||||
| 告警 | Playbook | Layer | success_count |
|
||||
|------|---------|-------|---------------|
|
||||
| SentryDown | sentry-down-repair | docker-110 | 1 ✅ |
|
||||
| HarborDown | harbor-down-repair | docker-110 | 0 |
|
||||
| GiteaDown | gitea-down-repair | docker-110 | 0 |
|
||||
| AlertmanagerDown | alertmanager-down-repair | docker-110 | 0 |
|
||||
| OpenClawDown | openclaw-down-repair | **docker-188** | 0 |
|
||||
| MoWoooWorkDown | momo-app-down-repair | docker-188 | 2 ✅ |
|
||||
| TsenyangWebsiteDown | tsenyang-website-down-repair | docker-188 | 0 |
|
||||
| BitanWoooWorkDown | bitan-app-down-repair | docker-188 | 0 |
|
||||
| StockWoooWorkDown | stock-platform-down-repair | docker-110 | 0 |
|
||||
| SignOzDown | signoz-down-repair | docker-188 | 0 |
|
||||
| DockerContainerExited | docker-container-exited-repair | 動態 | 0 |
|
||||
| DockerContainerUnhealthy | docker-container-unhealthy-repair | 動態 | 0 |
|
||||
| KubePodNotReady | k8s-pod-not-ready-restart | k8s | 0 |
|
||||
|
||||
> **注意**: `openclaw-down-repair` 原錯誤指向 `docker-110`,已於 2026-04-09 修正為 `docker-188`
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user