diff --git a/docs/superpowers/specs/2026-04-05-observability-autohealing-design.md b/docs/superpowers/specs/2026-04-05-observability-autohealing-design.md new file mode 100644 index 00000000..da4cbe7d --- /dev/null +++ b/docs/superpowers/specs/2026-04-05-observability-autohealing-design.md @@ -0,0 +1,585 @@ +# AWOOOI 全系統自愈閉環設計規格 + +> **版本**: v1.0 +> **日期**: 2026-04-05 (台北時間) +> **起草者**: Claude Code (首席架構師) +> **觸發事件**: 重開機後服務異常靜默、監控告警失效、無自動修復能力 +> **關聯文件**: REBOOT-RECOVERY-SOP.md v3.0、ADR-030、Phase O + +--- + +## 目錄 + +1. [問題根因總結](#問題根因總結) +2. [整體架構目標](#整體架構目標) +3. [統一標籤規範 (S0)](#s0-統一標籤規範) +4. [Sprint 1:補 Prometheus 規則 + Sentry 安裝](#sprint-1) +5. [Sprint 2:Log-based Alerting](#sprint-2) +6. [Sprint 3:Host Auto-Repair Agent](#sprint-3) +7. [重開機 SOP 整合更新](#重開機-sop-整合更新) +8. [閉環驗證標準](#閉環驗證標準) +9. [實施時程](#實施時程) + +--- + +## 問題根因總結 + +### 三層系統性缺口 + +#### 缺口一:告警規則存在但從未部署 + +| 狀態 | 說明 | +|------|------| +| `k8s/monitoring/*.yaml` | ✅ 版控中存在 40+ 條規則 (含 SentryDown、AlertChain 等) | +| `110:/home/wooo/monitoring/alerts.yml` | ❌ 只有 13 條舊規則,從未同步 | +| 根本原因 | `k8s/monitoring/` 為 PrometheusRule CRD 格式 (Operator),110 是 Docker Compose Prometheus,兩套互不同步,且沒有 CD 自動部署機制 | +| 後果 | SentryDown、HarborDown、GiteaDown、AlertChainBroken 等規則從未生效;`ClawBotDown` 用了廢棄舊命名 | + +#### 缺口二:日誌收集但不告警 + +| 狀態 | 說明 | +|------|------| +| OTEL DaemonSet | ✅ Pod logs → SigNoz ClickHouse | +| SigNoz log alert rules | ❌ 完全沒有;ops/signoz/alerting/rules.yaml 只有 trace/metric 規則 | +| 根本原因 | Phase O 完成了 log 收集管線,但沒有建立 log-based alert rules | +| 後果 | API ERROR log、Pod crash log 全部靜默,不觸發任何告警 | + +#### 缺口三:自動修復只能處理 K8s 層 + +| 狀態 | 說明 | +|------|------| +| auto_repair_service.py | ✅ 可執行 kubectl (K8s 層) | +| ActionType | ❌ 只有 KUBECTL/SCRIPT/MANUAL,無 SSH_COMMAND | +| 主機層服務 | ❌ SentryDown/HarborDown → 無對應 Playbook,無修復能力 | +| Sentry | ✅ 已安裝於 `/opt/sentry/`,但重開機後未自動啟動 (未加入 startup-110.sh) | +| SSH key | ❌ K8s Secret 中沒有任何 SSH 金鑰 | +| 根本原因 | ADR-030 架構設計了 SSH 層,但從未實作;Sentry 已裝但啟動管理缺失 | + +#### 缺口四:無統一貼標規範 + +告警 labels 中沒有 `layer` 欄位,auto_repair_service 無法判斷修復路徑: + +``` +SentryDown 告警 labels: {severity: warning, team: ops, component: sentry} + ↑ + 沒有 layer: docker-110 + → 無法決定用 SSH 到 110 修復 + vs kubectl 修復 +``` + +--- + +## 整體架構目標 + +``` +【Layer 1: 收集】 + K8s Pod Logs → OTEL DaemonSet → SigNoz ClickHouse ✅ (已有) + K8s Metrics → kube-state/node → Prometheus (110) ✅ (已有) + App Exceptions → Sentry SDK → Sentry (110:9000) ❌ Sentry 未安裝 + App Traces → OTEL SDK → SigNoz ✅ (已有) + +【Layer 2: 偵測】 + Prometheus Rules (110) ← 40+ 條規則全部部署 Sprint 1 + SigNoz Log Alerts ← ERROR/crash log 觸發告警 Sprint 2 + Sentry Error Alerts ← Exception 閾值觸發 Sprint 2 + +【Layer 3: 告警路由】 + Alertmanager → AWOOOI API /webhooks/alertmanager → Telegram ✅ (已修復) + SigNoz → AWOOOI API /webhooks/signoz → Telegram ✅ (已有) + Sentry → AWOOOI API /webhooks/sentry → Telegram ✅ (handler 已有) + +【Layer 4: 自動修復】 + K8s 層告警 → auto_repair_service → kubectl ✅ (已有) + Docker 層告警 → HostRepairAgent → SSH → docker compose Sprint 3 + Systemd 層 → HostRepairAgent → SSH → systemctl Sprint 3 + +【Layer 5: 閉環驗證】 + 修復完成 → 服務健康確認 → 告警 resolved → Telegram 通知 Sprint 3 +``` + +--- + +## S0:統一標籤規範 + +**所有告警來源 (Prometheus/SigNoz/Sentry) 的標籤必須包含以下欄位:** + +```yaml +labels: + # 必填 + severity: critical | warning | info + layer: k8s | docker-110 | docker-188 | systemd-188 # ← 決定修復路徑 + component: sentry | harbor | gitea | openclaw | api | worker | web | postgres | redis | ollama | nginx | alertmanager | gitea-runner | minio | signoz | langfuse + team: ops | backend | ai | platform + + # 選填 (有助自動修復) + host: "110" | "188" | "120" | "121" # 受影響主機 + source: prometheus | signoz | sentry # 告警來源 + auto_repair: "true" | "false" # 是否允許自動修復 +``` + +**Layer 對應修復機制:** + +| Layer | 服務範例 | 修復機制 | +|-------|---------|---------| +| `k8s` | API/Web/Worker pods | `kubectl rollout restart` | +| `docker-110` | Sentry/Harbor/Gitea/Alertmanager/Langfuse/Gitea-Runner | SSH 110 → `docker compose up -d` | +| `docker-188` | OpenClaw/MinIO/SignOz | SSH 188 → `docker compose up -d` | +| `systemd-188` | PostgreSQL/Redis/Ollama/Nginx | SSH 188 → `systemctl restart` | + +**auto_repair_service 路由邏輯:** + +```python +if incident.labels.get("layer") == "k8s": + → use existing kubectl path +elif incident.labels.get("layer") in ("docker-110", "docker-188"): + → use new HostRepairAgent (SSH + docker compose) +elif incident.labels.get("layer") == "systemd-188": + → use new HostRepairAgent (SSH + systemctl) +``` + +--- + +## Sprint 1 + +**目標**: 所有服務宕機立即 Telegram 告警,不再靜默 +**工時**: 約 4h +**優先級**: P0 + +### S1-0:統一標籤補全 (所有現有規則) + +在新版 `alerts.yml` 中,所有規則補齊 `layer` / `host` / `auto_repair` 標籤。 + +### S1-1:新 alerts.yml 合併所有規則 + +將以下文件的規則**轉換為 Docker Prometheus 格式**並合併: + +| 來源文件 | 規則數 | 說明 | +|---------|-------|------| +| `k8s/monitoring/k3s-alerts.yaml` | 20+ | SentryDown、SignOzDown、HarborDown 等 | +| `k8s/monitoring/alert-chain-monitor.yaml` | 9 | AlertChainBroken、AutoRepair 等 | +| `k8s/monitoring/database-alerts.yaml` | 10+ | PG/Redis 詳細規則 | +| `k8s/monitoring/minio-kali-alerts.yaml` | 5 | MinIO/Kali | +| `k8s/monitoring/k3s-alerts-supplemental.yaml` | 10+ | 補充規則 | +| 現有 `alerts.yml` | 13 | 保留有效規則,刪除 ClawBotDown (舊命名) | + +**關鍵修正**: +- `ClawBotDown` → `OpenClawDown` +- 所有規則補齊 `layer` 標籤 +- Blackbox probe 規則確認 target IP 正確 + +**輸出**: `ops/monitoring/alerts-unified.yml` (版控) + +### S1-2:部署到 110 Prometheus + +```bash +# deploy-alerts.sh (新建) +scp ops/monitoring/alerts-unified.yml wooo@192.168.0.110:/home/wooo/monitoring/alerts.yml +ssh wooo@192.168.0.110 "curl -s -X POST http://localhost:9090/-/reload" +ssh wooo@192.168.0.110 "curl -s http://localhost:9090/api/v1/rules | python3 -c \"import sys,json; r=json.load(sys.stdin); print('Rules loaded:', sum(len(g['rules']) for g in r['data']['groups']))\"" +``` + +### S1-3:CD 自動同步 (不再人工部署) + +在 `.gitea/workflows/cd.yaml` 加入 `deploy-alerts` job: + +```yaml +deploy-alerts: + name: "Deploy Prometheus Alert Rules" + runs-on: [self-hosted] + needs: [] # 獨立 job,不依賴 API build + steps: + - name: Deploy alerts.yml to Prometheus + run: | + scp ops/monitoring/alerts-unified.yml wooo@192.168.0.110:/home/wooo/monitoring/alerts.yml + ssh wooo@192.168.0.110 "curl -X POST http://localhost:9090/-/reload" +``` + +**觸發條件**: `ops/monitoring/alerts-unified.yml` 文件有變更時 + +### S1-4:啟動 Sentry (110) + +**Sentry 已安裝於 `/opt/sentry/`** (2026-03-24 已安裝,Phase 1-5 已完成)。 +目前問題:重開機後容器全部停止,且未被加入 startup-110.sh。 + +```bash +# 確認並啟動 +ssh wooo@192.168.0.110 "cd /opt/sentry && docker compose up -d" +# 驗證 +curl -s http://192.168.0.110:9000/api/0/projects/ +``` + +**加入 startup-110.sh Step 7**: +```bash +# STEP 7: Sentry (Error Tracking) +# 2026-04-05 Claude Code: 加入 — 解決重開機後 Sentry 未自動啟動 +SENTRY_DIR="/opt/sentry" +if [ -d "$SENTRY_DIR" ]; then + cd "$SENTRY_DIR" + docker compose up -d 2>&1 | tail -5 + sleep 20 + if curl -sf --max-time 15 http://localhost:9000/api/0/projects/ >/dev/null 2>&1; then + log "✅ Sentry healthy" + else + log "⚠️ Sentry 尚未就緒,需稍等(Sentry 啟動約需 2-3 分鐘)" + fi +else + log "⚠️ 找不到 Sentry 目錄: $SENTRY_DIR" +fi +``` + +### S1-5:驗證 Sprint 1 + +```bash +# 驗證規則數量 +ssh wooo@192.168.0.110 "curl -s http://localhost:9090/api/v1/rules | python3 -c \"import sys,json; r=json.load(sys.stdin); print('Rules:', sum(len(g['rules']) for g in r['data']['groups']))\"" +# 期望: > 40 (舊版只有 13) + +# 驗證 SentryDown 規則存在 +ssh wooo@192.168.0.110 "curl -s http://localhost:9090/api/v1/rules | python3 -c \"import sys,json; r=json.load(sys.stdin); names=[rule['name'] for g in r['data']['groups'] for rule in g['rules']]; print('SentryDown' in names, 'AlertChainBroken_Alertmanager' in names)\"" +# 期望: True True + +# 驗證 Sentry 啟動 +curl -s --max-time 10 http://192.168.0.110:9000/api/0/projects/ | head -5 +``` + +--- + +## Sprint 2 + +**目標**: API 日誌中的 ERROR 不再靜默,觸發 Telegram 告警 +**工時**: 約 2h +**優先級**: P1 + +### S2-1:SigNoz Log Alert Rules + +在 SigNoz 建立以下 log-based alert rules,透過 `/webhooks/signoz` 鏈路通知: + +| Alert 名稱 | 觸發條件 | Severity | +|-----------|---------|---------| +| `APIHighErrorLogRate` | API logs: `level=ERROR` > 10/min,持續 5min | warning | +| `WorkerTaskFailed` | Worker logs: `"task_failed"\|"unhandled exception"` > 5/min | warning | +| `PodOOMKilled` | K8s logs: `"OOMKilled"\|"OutOfMemory"` | critical | +| `TelegramPollingFailed` | API logs: `"telegram_polling_error"` > 3/5min | critical | +| `NemotronAllTimeout` | API logs: `"nemotron_tool_call_timeout"` > 5/5min | warning | + +**實施**: +- SigNoz UI → Alerts → New Alert → Logs Based Alert +- 設定 webhook: `http://192.168.0.121:32334/api/v1/webhooks/signoz` +- 規則配置版控到 `ops/signoz/alerting/log-rules.md` (SigNoz 無 YAML 匯入,用文檔記錄) + +**統一標籤** (SigNoz alert 附加): +```json +{ + "layer": "k8s", + "source": "signoz", + "component": "api", + "team": "backend" +} +``` + +### S2-2:Sentry 整合 AWOOOI API + +Sentry 安裝完成後: + +1. 建立 AWOOOI 專案 (Python + Next.js) +2. 設定 Webhook Integration → `http://192.168.0.121:32334/api/v1/webhooks/sentry` +3. 驗證 `/webhooks/sentry` handler 正確接收並轉成 Incident + +**API 端 Sentry SDK 加入標籤**: +```python +# apps/api/src/main.py sentry init 加入 tags +sentry_sdk.set_tag("layer", "k8s") +sentry_sdk.set_tag("component", "api") +sentry_sdk.set_tag("host", "k8s-awoooi-prod") +``` + +### S2-3:驗證 Sprint 2 + +```bash +# 觸發測試 log alert (在 API pod 內產生 ERROR log) +kubectl exec -n awoooi-prod deploy/awoooi-api -- python3 -c " +import logging; logging.error('TEST_ERROR: Sprint2 validation') +" +# 期望: 5min 內 Telegram 收到 APIHighErrorLogRate 告警 + +# 驗證 Sentry webhook +curl -X POST http://192.168.0.121:32334/api/v1/webhooks/sentry \ + -H 'Content-Type: application/json' \ + -d '{"action":"triggered","data":{"issue":{"title":"Test Sentry Event","level":"error"}}}' +# 期望: {"success":true} +``` + +--- + +## Sprint 3 + +**目標**: SentryDown/HarborDown/GiteaDown 觸發後自動修復,無需人工 +**工時**: 約 5h +**優先級**: P2 + +### S3-1:SSH Key 基礎設施 + +**生成專用修復金鑰** (在 Mac 本機): +```bash +ssh-keygen -t ed25519 -C "awoooi-repair-bot-2026" -f ~/.ssh/awoooi_repair_bot -N "" +``` + +**在 110 設定受限 authorized_keys**: +```bash +# /home/wooo/.ssh/authorized_keys 加入 (限制只能執行修復腳本) +command="/home/wooo/bin/repair-bot-110.sh",no-port-forwarding,no-X11-forwarding,no-agent-forwarding ssh-ed25519 AAAA... awoooi-repair-bot-2026 +``` + +**在 188 (ollama user) 設定受限 authorized_keys**: +```bash +command="/home/ollama/bin/repair-bot-188.sh",no-port-forwarding,no-X11-forwarding,no-agent-forwarding ssh-ed25519 AAAA... awoooi-repair-bot-2026 +``` + +**存入 K8s Secret**: +```bash +kubectl create secret generic awoooi-repair-ssh-key \ + -n awoooi-prod \ + --from-file=id_ed25519=~/.ssh/awoooi_repair_bot \ + --from-file=id_ed25519.pub=~/.ssh/awoooi_repair_bot.pub +``` + +### S3-2:修復腳本白名單 + +**`/home/wooo/bin/repair-bot-110.sh`** (110 主機): +```bash +#!/bin/bash +# 修復機器人白名單腳本 — 只允許已知修復命令 +# 2026-04-05 Claude Code: Sprint 3 Host Auto-Repair + +ALLOWED_COMMANDS=( + "sentry:docker compose up -d:/home/wooo/sentry" + "harbor:docker compose up -d:/home/wooo/harbor/harbor" + "gitea:docker compose up -d:/home/wooo/gitea" + "gitea-runner:docker compose up -d:/home/wooo/act-runner" + "langfuse:docker compose up -d:/home/wooo/langfuse" + "alertmanager:docker compose up -d:/home/wooo/monitoring" +) + +CMD="${SSH_ORIGINAL_COMMAND:-}" +for entry in "${ALLOWED_COMMANDS[@]}"; do + name="${entry%%:*}"; rest="${entry#*:}"; command="${rest%%:*}"; dir="${rest#*:}" + if [ "$CMD" = "repair:$name" ]; then + cd "$dir" && $command 2>&1 | tail -5 + echo "REPAIR_OK:$name" + exit 0 + fi +done + +echo "REPAIR_DENIED:$CMD" +exit 1 +``` + +**`/home/ollama/bin/repair-bot-188.sh`** (188 主機): +```bash +#!/bin/bash +ALLOWED_COMMANDS=( + "openclaw:docker compose up -d:/home/ollama/clawbot-v5" + "minio:docker compose up -d:/home/ollama/minio" + "signoz:docker compose up -d:/home/ollama/signoz/deploy/docker" + "redis:systemctl restart redis-server" + "nginx:systemctl restart nginx" +) + +CMD="${SSH_ORIGINAL_COMMAND:-}" +for entry in "${ALLOWED_COMMANDS[@]}"; do + name="${entry%%:*}"; rest="${entry#*:}"; command="${rest%%:*}"; dir="${rest#*:}" + if [ "$CMD" = "repair:$name" ]; then + if [[ "$command" == systemctl* ]]; then + sudo $command && echo "REPAIR_OK:$name" || echo "REPAIR_FAIL:$name" + else + cd "$dir" && $command 2>&1 | tail -5 && echo "REPAIR_OK:$name" || echo "REPAIR_FAIL:$name" + fi + exit 0 + fi +done + +echo "REPAIR_DENIED:$CMD" +exit 1 +``` + +### S3-3:HostRepairAgent (新 Python 模組) + +**新建** `apps/api/src/services/host_repair_agent.py`: + +```python +""" +Host Repair Agent +================= +透過 SSH 執行主機層 (Docker/systemd) 修復動作 +安全設計: command= 限制,只允許白名單修復命令 + +Layer 對應: + docker-110 → SSH wooo@192.168.0.110 repair: + docker-188 → SSH ollama@192.168.0.188 repair: + systemd-188 → SSH ollama@192.168.0.188 repair: +""" + +LAYER_SSH_CONFIG = { + "docker-110": {"user": "wooo", "host": "192.168.0.110"}, + "docker-188": {"user": "ollama", "host": "192.168.0.188"}, + "systemd-188": {"user": "ollama", "host": "192.168.0.188"}, +} +``` + +**新增 ActionType** (`apps/api/src/models/playbook.py`): +```python +class ActionType(str, Enum): + KUBECTL = "kubectl" + SCRIPT = "script" + MANUAL = "manual" + SSH_COMMAND = "ssh_command" # ← 新增,主機層修復 +``` + +**auto_repair_service.py 路由邏輯**: +```python +layer = incident.labels.get("layer", "k8s") +if layer == "k8s": + # 現有 kubectl 路徑 + ... +elif layer in ("docker-110", "docker-188", "systemd-188"): + component = incident.labels.get("component") + result = await host_repair_agent.repair(layer=layer, component=component) +``` + +### S3-4:Playbook 建立 + +新建以下 Playbook (存入 PostgreSQL,透過 API 建立): + +| Playbook | 觸發告警 | Layer | 修復命令 | +|---------|---------|-------|---------| +| `sentry-down-repair` | SentryDown | docker-110 | `repair:sentry` | +| `harbor-down-repair` | HarborDown | docker-110 | `repair:harbor` | +| `gitea-down-repair` | GiteaDown | docker-110 | `repair:gitea` | +| `gitea-runner-repair` | GiteaRunnerOffline | docker-110 | `repair:gitea-runner` | +| `openclaw-down-repair` | OpenClawDown | docker-188 | `repair:openclaw` | +| `alertmanager-down-repair` | AlertmanagerDown | docker-110 | `repair:alertmanager` | + +### S3-5:E2E 閉環驗證 + +```bash +# 手動觸發 SentryDown 測試 +# 1. 停止 Sentry +ssh wooo@192.168.0.110 "cd /home/wooo/sentry && docker compose stop" + +# 2. 等待 Prometheus 偵測 (2min for: 2m) +sleep 130 + +# 3. 驗證 Telegram 收到告警 +# (人工確認) + +# 4. 驗證自動修復觸發 +# auto_repair_service 應在 3min 內執行 SSH repair:sentry + +# 5. 驗證 Sentry 恢復 +curl -s http://192.168.0.110:9000/api/0/projects/ +# 期望: Sentry 回應正常 + +# 6. 驗證告警 resolved +# Telegram 應收到 "Sentry 已自動修復" 通知 +``` + +--- + +## 重開機 SOP 整合更新 + +REBOOT-RECOVERY-SOP.md v4.0 需更新以下內容: + +### 更新 1:110 startup.sh 加入 Step 7 (Sentry) + +```bash +# STEP 7: Sentry (Error Tracking) +log "[7/7] 啟動 Sentry..." +SENTRY_DIR="/opt/sentry" +if [ -d "$SENTRY_DIR" ]; then + cd "$SENTRY_DIR" + docker compose up -d 2>&1 | tail -5 + sleep 15 + if curl -sf --max-time 10 http://localhost:9000/api/0/projects/ >/dev/null 2>&1; then + log "✅ Sentry healthy" + else + log "⚠️ Sentry 尚未就緒,等待 30 秒..." + sleep 30 + curl -sf --max-time 10 http://localhost:9000/api/0/projects/ >/dev/null 2>&1 && log "✅ Sentry 就緒" || log "❌ Sentry 未就緒,需手動檢查" + fi +else + log "⚠️ 找不到 Sentry 目錄: $SENTRY_DIR" +fi +``` + +### 更新 2:E2E 驗證腳本加入 Sentry + 規則數驗證 + +```bash +# 加入 E2E 驗證腳本 +check "110 Sentry" "curl -s -o /dev/null -w '%{http_code}' http://192.168.0.110:9000/api/0/projects/" "200" +check "Prometheus 規則數 >40" "ssh wooo@192.168.0.110 'curl -s http://localhost:9090/api/v1/rules' | python3 -c \"import sys,json; r=json.load(sys.stdin); n=sum(len(g['rules']) for g in r['data']['groups']); print(n if n > 40 else 'LOW:'+str(n))\"" "[^LOW]" +``` + +### 更新 3:告警鏈路故障排查加入「規則未部署」診斷 + +``` +告警沉默診斷樹 (補充): + Alertmanager healthy ✅ + webhook URL 正確 ✅ + 從 110 curl POST 成功 ✅ + 但某服務宕機無告警? + ↓ + Prometheus 規則是否包含該服務? + curl http://192.168.0.110:9090/api/v1/rules | grep -i "" + ├── 找不到 → 規則未部署 + │ cd /path/to/awoooi && bash scripts/deploy-alerts.sh + └── 找到但 inactive → 規則條件未觸發,檢查 expr +``` + +### 更新 4:主機依賴圖補充 Sentry + +``` +192.168.0.110 (DevOps 主機) +├── ... +└── Sentry :9000 ← Error Tracking (Sprint 1 新增) +``` + +--- + +## 閉環驗證標準 + +**定義「全系統自愈閉環完成」的驗收標準**: + +| 驗收項目 | 測試方法 | 期望結果 | +|---------|---------|---------| +| Prometheus 規則 > 40 條 | API 查詢 | ✅ | +| SentryDown 規則存在 | Prometheus API 查詢 | ✅ | +| Sentry 服務健康 | HTTP 200 | ✅ | +| CD 自動同步 alerts.yml | git push → 驗證 110 Prometheus 規則數更新 | ✅ | +| SigNoz log alert 觸發 | 注入 ERROR log → Telegram 告警 | ✅ | +| Sentry webhook 工作 | 發送測試 event → Telegram 告警 | ✅ | +| SSH 修復金鑰配置 | 從 K8s pod 執行 SSH repair:test | ✅ | +| SentryDown 自動修復 E2E | 停 Sentry → 等 5min → Sentry 自動恢復 | ✅ | +| HarborDown 自動修復 E2E | 停 Harbor → 等 5min → Harbor 自動恢復 | ✅ | + +--- + +## 實施時程 + +| Sprint | 內容 | 工時 | 優先級 | 依賴 | +|--------|------|------|--------|------| +| **S0** | 確立標籤規範 (本文件 S0 章節) | 0h (設計完成) | P0 | — | +| **S1** | Prometheus 規則部署 + Sentry 安裝 + CD 同步 | 4h | P0 | S0 | +| **S2** | SigNoz log alert + Sentry 整合 | 2h | P1 | S1 | +| **S3** | SSH 修復基礎設施 + HostRepairAgent + Playbooks | 5h | P2 | S1 | +| **SOI** | SOP v4.0 更新整合 (Sentry + 驗證腳本 + 診斷樹) | 1h | P1 | S1 | +| **驗收** | 全系統 E2E 閉環驗證 | 2h | — | S1+S2+S3 | + +**總工時估計**: ~14h + +--- + +## 版本歷史 + +| 版本 | 日期 | 說明 | +|------|------|------| +| v1.0 | 2026-04-05 | 初版:整合重開機恢復 + 監控告警 + 自動修復三大系統 |