From c01026be9bd6314e72d669fa7b1e0eca0db8985c Mon Sep 17 00:00:00 2001 From: OG T Date: Thu, 9 Apr 2026 18:21:24 +0800 Subject: [PATCH] =?UTF-8?q?docs(skills+adr):=20=E8=87=AA=E5=8B=95=E4=BF=AE?= =?UTF-8?q?=E5=BE=A9=E5=85=A8=E9=8F=88=E8=B7=AF=E7=9F=A5=E8=AD=98=E6=9B=B4?= =?UTF-8?q?=E6=96=B0=20=E2=80=94=20ADR-058=20Appendix=20A=20+=20Skills=20v?= =?UTF-8?q?2.5?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ADR-058: 188白名單補完 + Appendix A (12 Bug修復記錄 + E2E驗證 + Playbook覆蓋矩陣) Skill-04 DevOps v2.5: SSH自動修復架構章節 (白名單/SOP/陷阱) Skill-05 SRE: 自動修復E2E驗收規範 + 診斷表 Co-Authored-By: Claude Sonnet 4.6 --- .agents/skills/04-awoooi-devops-commander.md | 79 ++++++++++++++++++ .agents/skills/05-awoooi-sre-qa.md | 82 +++++++++++++++++++ .../ADR-058-host-auto-repair-ssh-whitelist.md | 63 ++++++++++++-- 3 files changed, 216 insertions(+), 8 deletions(-) diff --git a/.agents/skills/04-awoooi-devops-commander.md b/.agents/skills/04-awoooi-devops-commander.md index cb3c3114..33c70bcb 100644 --- a/.agents/skills/04-awoooi-devops-commander.md +++ b/.agents/skills/04-awoooi-devops-commander.md @@ -35,6 +35,7 @@ | v2.2 | 2026-03-31 | Claude Code | **📊 K3s 優化成效數據 (告警-100%, Pod 重啟-100%, 48h+穩定)** | | v2.3 | 2026-03-31 | Claude Code | **📅 Phase 21 定期報告機制規劃 (Weekly/Daily E2E/K3s Report)** | | v2.4 | 2026-03-31 | Claude Code | **🔧 OTEL gRPC vs HTTP 端點區分 (K8s:24317, CI/CD:24318)** | +| v2.5 | 2026-04-09 | Claude Sonnet 4.6 | **🔴 SSH 自動修復全鏈路 — 雙主機 E2E 閉環 + 12 Bug 修復** | --- @@ -1197,3 +1198,81 @@ links = DeepLinking.get_all_links( - `memory/project_phase15_langfuse.md`: **📊 Phase 15 全部完成** - `memory/project_phase17_tech_debt.md`: **🔧 Phase 17 技術債** - `src/core/deep_linking.py`: **👁️ Deep Linking URL 生成器** +- `docs/adr/ADR-058-host-auto-repair-ssh-whitelist.md`: **🔴 SSH 自動修復架構 + Bug 修復記錄** +- `ops/config/service-registry.yaml`: **服務分級清單 (BLOCK/CRITICAL_HITL/STANDARD_HITL/AUTO)** + +--- + +## 🔴 SSH 自動修復架構 (Sprint 3 + 2026-04-09 Bug 修復) + +> **ADR**: ADR-058 (已批准,Appendix A 記錄 Bug 修復) +> **狀態**: ✅ 雙主機 E2E 驗證通過 + +### 關鍵基礎設施要求 + +| 項目 | 設定值 | 說明 | +|------|-------|------| +| Dockerfile | `openssh-client` | 生產 stage 必須安裝,ssh binary 才存在 | +| K8s Pod securityContext | `fsGroup: 1000` | 讓 appuser 有 group read on 0400 Secret | +| NetworkPolicy egress | port 22 → 110 + 188 | 預設拒絕,必須明確開放 | +| Secret defaultMode | `0400` (八進位) | SSH 要求 owner-only,group read 靠 fsGroup | +| known_hosts Secret | `awoooi-repair-known-hosts` | optional: true,含 110+188 hashed 指紋 | + +### repair-bot 白名單 (當前完整清單) + +**110 主機 (wooo@192.168.0.110)** + +| Component | 目錄 | +|-----------|------| +| sentry | /opt/sentry | +| harbor | /home/wooo/harbor/harbor | +| gitea | /home/wooo/gitea | +| gitea-runner | /home/wooo/act-runner | +| langfuse | /home/wooo/langfuse | +| alertmanager | /home/wooo/monitoring | +| signoz | /home/wooo/signoz/deploy/docker | +| stock-platform | /home/wooo/stockPlatform | + +**188 主機 (ollama@192.168.0.188)** + +| Component | 目錄 | +|-----------|------| +| openclaw | /home/ollama/clawbot-v5 | +| minio | /home/ollama/minio | +| signoz | /home/ollama/signoz/deploy/docker | +| momo-app | /home/ollama/momo-pro | +| tsenyang-website | /home/ollama/services/tsenyang | +| bitan-app | /home/ollama/services/bitan | + +### 修改 repair-bot 白名單 SOP + +1. 確認 compose dir 在目標主機存在 +2. SSH 到目標主機 `sed -i` 修改 `~/bin/repair-bot-{110|188}.sh` +3. 用 `SSH_ORIGINAL_COMMAND=health ~/bin/repair-bot-xxx.sh` 驗證 +4. 同步更新 `ops/config/service-registry.yaml` +5. commit + push gitea + +### 新增修復主機 SOP + +1. 在目標主機建立 `~/bin/repair-bot-{host}.sh`(複製模板) +2. 將 `awoooi-repair-ssh-key.pub` 加入 `~/.ssh/authorized_keys`(加 `command=` 限制) +3. `ssh-keyscan -H {host_ip}` → 更新 `awoooi-repair-known-hosts` Secret +4. NetworkPolicy 新增 `{host_ip}:22` egress +5. `LAYER_SSH_CONFIG` 新增 layer 設定(`host_repair_agent.py`) +6. service-registry.yaml 新增服務分級 + +### 常見陷阱 (血的教訓) + +``` +❌ target_resource 用 instance (IP:port) → Jaccard 服務比對為 0 +✅ 必須優先取 labels.component,再 fallback 到 pod、instance + +❌ kubectl apply 06-deployment-api.yaml → IMAGE_TAG_PLACEHOLDER 覆蓋真實 SHA → ImagePullBackOff +✅ 修改 K8s Deployment 配置用 kubectl patch,不用 kubectl apply + +❌ known_hosts hashed 格式,grep IP 會得 0 → 以為沒寫進去 +✅ 用 wc -l 或 ssh 實測驗證,hashed 格式是正常的 + +❌ StrictHostKeyChecking=no(舊設定) +✅ known_hosts Secret 已建立,改用 StrictHostKeyChecking=yes +``` diff --git a/.agents/skills/05-awoooi-sre-qa.md b/.agents/skills/05-awoooi-sre-qa.md index a4dd4fb2..764fabb3 100644 --- a/.agents/skills/05-awoooi-sre-qa.md +++ b/.agents/skills/05-awoooi-sre-qa.md @@ -708,6 +708,87 @@ def validate_traditional_chinese(response: str) -> bool: --- +## 🔴 自動修復 E2E 驗收規範 (2026-04-09) + +> **背景**: 系統曾有自動修復機制卻從未成功執行(success_count 全部為 0),完整審計後修復 12 個阻斷性 Bug +> **教訓**: Playbook 匹配成功 ≠ SSH 執行成功,必須端到端驗收 + +### 自動修復完整鏈路 + +``` +Alertmanager → POST /api/v1/webhooks/alertmanager + → LLM 分析 (Nemotron) + _extract_symptoms() + → {alert_names, affected_services, keywords} + ⚠️ affected_services 必須取 labels.component,不能用 labels.instance (IP:port) + → playbook_service.get_recommendations() — Jaccard 相似度 + → alert_exact_match bypass: alert_names 完全匹配時忽略 0.4 門檻 + → evaluate_auto_repair() — 查 service-registry 分級 + → BLOCK → 僅告警; AUTO → 直接執行 + → HostRepairAgent.repair(layer, component) + → SSH: ssh -i /etc/repair-ssh/id_ed25519 wooo@192.168.0.110 repair:sentry + → repair-bot.sh → docker compose up -d → REPAIR_OK:sentry +``` + +### E2E 驗收 Checklist + +```bash +# Step 1: 確認 SSH binary 存在 +POD=$(kubectl -n awoooi-prod get pod -l app=awoooi-api -o jsonpath='{.items[0].metadata.name}') +kubectl -n awoooi-prod exec $POD -- which ssh # 必須有輸出 + +# Step 2: 確認 SSH key 可讀 +kubectl -n awoooi-prod exec $POD -- ls -la /etc/repair-ssh/id_ed25519 +# 預期: -r--r----- 1 root appuser ... (fsGroup=1000 生效) + +# Step 3: 確認 known_hosts 有內容 +kubectl -n awoooi-prod exec $POD -- wc -l /etc/repair-known-hosts/known_hosts +# 預期: 9 (hashed 格式,grep IP 會得 0 — 正常) + +# Step 4: SSH 健康確認 +kubectl -n awoooi-prod exec $POD -- sh -c \ + "ssh -i /etc/repair-ssh/id_ed25519 \ + -o UserKnownHostsFile=/etc/repair-known-hosts/known_hosts \ + -o StrictHostKeyChecking=yes -o BatchMode=yes -o ConnectTimeout=10 \ + wooo@192.168.0.110 health" +# 預期: REPAIR_BOT_HEALTHY:110 + +# Step 5: Webhook 觸發(新 fingerprint 避免去重) +curl -X POST http://192.168.0.120:32334/api/v1/webhooks/alertmanager \ + -H "Content-Type: application/json" \ + -d '{"alerts":[{"labels":{"alertname":"SentryDown","component":"sentry", + "severity":"critical"},"fingerprint":"e2e-test-001","status":"firing", + "startsAt":"2026-04-09T00:00:00Z","endsAt":"0001-01-01T00:00:00Z"}]}' + +# Step 6: 確認 log +kubectl -n awoooi-prod logs -l app=awoooi-api --tail=50 | \ + grep -E "REPAIR_OK|auto_repair_execute_success|auto_repair_approved" +``` + +### Playbook symptom_pattern 要求 + +```json +{ + "alert_names": ["SentryDown"], // ← alert_exact_match key,完全匹配才能 bypass + "affected_services": ["sentry"], // ← 必須與 labels.component 一致,不是 instance + "severity_range": ["P1", "P2"], + "label_patterns": {"component": "sentry"}, + "keywords": ["sentry", "9000"] +} +``` + +### 自動修復被阻斷的診斷方法 + +| 症狀 | 可能原因 | 診斷指令 | +|------|---------|---------| +| `auto_repair_approved` 沒出現 | Jaccard 分數 < 0.4 | 查 log `similarity` 欄位 | +| `can_auto_repair: false` | service-registry BLOCK/HITL | 查 `blocked_by` 欄位 | +| `ssh: command not found` | Dockerfile 缺 openssh-client | Pod exec `which ssh` | +| `Permission denied (publickey)` | known_hosts 缺少該主機 | Pod exec SSH 看錯誤訊息 | +| `Load key ... Permission denied` | fsGroup 未設定 | Pod exec `ls -la /etc/repair-ssh/` | +| `Connection refused/timeout` | NetworkPolicy 封鎖 22 | Pod exec `ssh -v` 看連線過程 | + +--- + ## 參考文檔 - `apps/web/playwright.config.ts`: Playwright 設定 @@ -720,5 +801,6 @@ def validate_traditional_chinese(response: str) -> bool: - `memory/feedback_runner_zombie_process.md`: **🚨 Runner 殭屍進程修復** - `docs/adr/ADR-018-llm-testing-strategy.md`: **🧠 LLM 測試策略 (Deferred)** - `docs/adr/ADR-019-system-prompt-management.md`: **📝 System Prompt 集中管理** +- `docs/adr/ADR-058-host-auto-repair-ssh-whitelist.md`: **🔴 SSH 自動修復 + Bug 修復記錄** - `.github/workflows/nightly-llm.yaml`: **🌙 Nightly LLM 測試** - `.github/workflows/daily-e2e-health.yaml`: **🏥 Daily E2E 健康檢查** diff --git a/docs/adr/ADR-058-host-auto-repair-ssh-whitelist.md b/docs/adr/ADR-058-host-auto-repair-ssh-whitelist.md index e566fcc0..e0bcdd8f 100644 --- a/docs/adr/ADR-058-host-auto-repair-ssh-whitelist.md +++ b/docs/adr/ADR-058-host-auto-repair-ssh-whitelist.md @@ -80,16 +80,63 @@ Prometheus/SigNoz 告警 → webhook → auto_repair_service | langfuse | docker compose up -d | /home/wooo/langfuse | | alertmanager | docker compose up -d | /home/wooo/monitoring | | signoz | docker compose up -d | /home/wooo/signoz/deploy/docker | +| stock-platform | docker compose up -d | /home/wooo/stockPlatform *(2026-04-09 新增)* | ### 188 主機 (repair-bot-188.sh) -| Component | 修復方式 | -|-----------|---------| -| openclaw | docker compose up -d | -| minio | docker compose up -d | -| signoz | docker compose up -d | -| redis | systemctl restart redis-server | -| nginx | systemctl restart nginx | -| ollama | systemctl restart ollama | +| Component | 修復方式 | 目錄 | +|-----------|---------|------| +| openclaw | docker compose up -d | /home/ollama/clawbot-v5 | +| minio | docker compose up -d | /home/ollama/minio | +| signoz | docker compose up -d | /home/ollama/signoz/deploy/docker | +| momo-app | docker compose up -d | /home/ollama/momo-pro *(2026-04-09 新增)* | +| tsenyang-website | docker compose up -d | /home/ollama/services/tsenyang *(2026-04-09 新增)* | +| bitan-app | docker compose up -d | /home/ollama/services/bitan *(2026-04-09 新增)* | + +--- + +## Appendix A — Bug 修復記錄 (2026-04-09) + +> **執行者**: Claude Sonnet 4.6 (Asia/Taipei) +> **背景**: 自動修復機制從未成功執行(success_count 全部為 0),完整審計後發現下列阻斷性問題 + +| Bug | 問題描述 | 修復 | Commit | +|-----|---------|------|--------| +| #5 | `target_resource` 用 `instance` (IP:port) 代替 `component` label → Jaccard 服務相似度為 0 | `webhooks.py` 優先取 `component` label | 1fb0c0c | +| #6 | `python:3.11-slim` 無 `openssh-client`,`ssh` binary 缺失 | Dockerfile 生產 stage 加 `openssh-client` | 1fb0c0c | +| #11 | NetworkPolicy 未開放 22 (SSH),預設拒絕所有未列 port | `02-network-policy.yaml` 新增 110:22 + 188:22 egress | 07a097c | +| #12 | Secret `defaultMode=0400` (root-only),appuser (UID 1000) 無法讀 SSH key | Pod `securityContext.fsGroup: 1000` | 77f2da9 | + +### E2E 驗證結果 + +``` +# docker-110 +SentryDown → Jaccard 匹配 sentry-down-repair → SSH wooo@192.168.0.110 repair:sentry +→ REPAIR_OK:sentry (6208ms) ✅ + +# docker-188 +MoWoooWorkDown → Jaccard 匹配 momo-app-down-repair → SSH ollama@192.168.0.188 repair:momo-app +→ REPAIR_OK:momo-app (3791ms) ✅ +``` + +### Playbook 覆蓋矩陣 (2026-04-09,20 個 Playbook) + +| 告警 | Playbook | Layer | success_count | +|------|---------|-------|---------------| +| SentryDown | sentry-down-repair | docker-110 | 1 ✅ | +| HarborDown | harbor-down-repair | docker-110 | 0 | +| GiteaDown | gitea-down-repair | docker-110 | 0 | +| AlertmanagerDown | alertmanager-down-repair | docker-110 | 0 | +| OpenClawDown | openclaw-down-repair | **docker-188** | 0 | +| MoWoooWorkDown | momo-app-down-repair | docker-188 | 2 ✅ | +| TsenyangWebsiteDown | tsenyang-website-down-repair | docker-188 | 0 | +| BitanWoooWorkDown | bitan-app-down-repair | docker-188 | 0 | +| StockWoooWorkDown | stock-platform-down-repair | docker-110 | 0 | +| SignOzDown | signoz-down-repair | docker-188 | 0 | +| DockerContainerExited | docker-container-exited-repair | 動態 | 0 | +| DockerContainerUnhealthy | docker-container-unhealthy-repair | 動態 | 0 | +| KubePodNotReady | k8s-pod-not-ready-restart | k8s | 0 | + +> **注意**: `openclaw-down-repair` 原錯誤指向 `docker-110`,已於 2026-04-09 修正為 `docker-188` ---