docs(spec): 全系統自愈閉環設計規格 v1.0

整合三大問題的完整解決方案:
1. Prometheus 規則未部署 (13條→40+條,含SentryDown/AlertChain)
2. 日誌收集但無log-based alerting
3. 自動修復只限K8s層,無Host Docker/systemd修復能力

包含:
- 統一標籤規範 (layer/component/team/host)
- Sprint 1: 規則部署+Sentry啟動+CD同步
- Sprint 2: SigNoz log alert + Sentry整合
- Sprint 3: SSH HostRepairAgent + Playbooks
- SOP v4.0整合更新點

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
OG T
2026-04-05 02:14:01 +08:00
parent 8fdd159e6b
commit de33abe0e3

View File

@@ -0,0 +1,585 @@
# AWOOOI 全系統自愈閉環設計規格
> **版本**: v1.0
> **日期**: 2026-04-05 (台北時間)
> **起草者**: Claude Code (首席架構師)
> **觸發事件**: 重開機後服務異常靜默、監控告警失效、無自動修復能力
> **關聯文件**: REBOOT-RECOVERY-SOP.md v3.0、ADR-030、Phase O
---
## 目錄
1. [問題根因總結](#問題根因總結)
2. [整體架構目標](#整體架構目標)
3. [統一標籤規範 (S0)](#s0-統一標籤規範)
4. [Sprint 1補 Prometheus 規則 + Sentry 安裝](#sprint-1)
5. [Sprint 2Log-based Alerting](#sprint-2)
6. [Sprint 3Host Auto-Repair Agent](#sprint-3)
7. [重開機 SOP 整合更新](#重開機-sop-整合更新)
8. [閉環驗證標準](#閉環驗證標準)
9. [實施時程](#實施時程)
---
## 問題根因總結
### 三層系統性缺口
#### 缺口一:告警規則存在但從未部署
| 狀態 | 說明 |
|------|------|
| `k8s/monitoring/*.yaml` | ✅ 版控中存在 40+ 條規則 (含 SentryDown、AlertChain 等) |
| `110:/home/wooo/monitoring/alerts.yml` | ❌ 只有 13 條舊規則,從未同步 |
| 根本原因 | `k8s/monitoring/` 為 PrometheusRule CRD 格式 (Operator)110 是 Docker Compose Prometheus兩套互不同步且沒有 CD 自動部署機制 |
| 後果 | SentryDown、HarborDown、GiteaDown、AlertChainBroken 等規則從未生效;`ClawBotDown` 用了廢棄舊命名 |
#### 缺口二:日誌收集但不告警
| 狀態 | 說明 |
|------|------|
| OTEL DaemonSet | ✅ Pod logs → SigNoz ClickHouse |
| SigNoz log alert rules | ❌ 完全沒有ops/signoz/alerting/rules.yaml 只有 trace/metric 規則 |
| 根本原因 | Phase O 完成了 log 收集管線,但沒有建立 log-based alert rules |
| 後果 | API ERROR log、Pod crash log 全部靜默,不觸發任何告警 |
#### 缺口三:自動修復只能處理 K8s 層
| 狀態 | 說明 |
|------|------|
| auto_repair_service.py | ✅ 可執行 kubectl (K8s 層) |
| ActionType | ❌ 只有 KUBECTL/SCRIPT/MANUAL無 SSH_COMMAND |
| 主機層服務 | ❌ SentryDown/HarborDown → 無對應 Playbook無修復能力 |
| Sentry | ✅ 已安裝於 `/opt/sentry/`,但重開機後未自動啟動 (未加入 startup-110.sh) |
| SSH key | ❌ K8s Secret 中沒有任何 SSH 金鑰 |
| 根本原因 | ADR-030 架構設計了 SSH 層但從未實作Sentry 已裝但啟動管理缺失 |
#### 缺口四:無統一貼標規範
告警 labels 中沒有 `layer` 欄位auto_repair_service 無法判斷修復路徑:
```
SentryDown 告警 labels: {severity: warning, team: ops, component: sentry}
沒有 layer: docker-110
→ 無法決定用 SSH 到 110 修復
vs kubectl 修復
```
---
## 整體架構目標
```
【Layer 1: 收集】
K8s Pod Logs → OTEL DaemonSet → SigNoz ClickHouse ✅ (已有)
K8s Metrics → kube-state/node → Prometheus (110) ✅ (已有)
App Exceptions → Sentry SDK → Sentry (110:9000) ❌ Sentry 未安裝
App Traces → OTEL SDK → SigNoz ✅ (已有)
【Layer 2: 偵測】
Prometheus Rules (110) ← 40+ 條規則全部部署 Sprint 1
SigNoz Log Alerts ← ERROR/crash log 觸發告警 Sprint 2
Sentry Error Alerts ← Exception 閾值觸發 Sprint 2
【Layer 3: 告警路由】
Alertmanager → AWOOOI API /webhooks/alertmanager → Telegram ✅ (已修復)
SigNoz → AWOOOI API /webhooks/signoz → Telegram ✅ (已有)
Sentry → AWOOOI API /webhooks/sentry → Telegram ✅ (handler 已有)
【Layer 4: 自動修復】
K8s 層告警 → auto_repair_service → kubectl ✅ (已有)
Docker 層告警 → HostRepairAgent → SSH → docker compose Sprint 3
Systemd 層 → HostRepairAgent → SSH → systemctl Sprint 3
【Layer 5: 閉環驗證】
修復完成 → 服務健康確認 → 告警 resolved → Telegram 通知 Sprint 3
```
---
## S0統一標籤規範
**所有告警來源 (Prometheus/SigNoz/Sentry) 的標籤必須包含以下欄位:**
```yaml
labels:
# 必填
severity: critical | warning | info
layer: k8s | docker-110 | docker-188 | systemd-188 # ← 決定修復路徑
component: sentry | harbor | gitea | openclaw | api | worker | web | postgres | redis | ollama | nginx | alertmanager | gitea-runner | minio | signoz | langfuse
team: ops | backend | ai | platform
# 選填 (有助自動修復)
host: "110" | "188" | "120" | "121" # 受影響主機
source: prometheus | signoz | sentry # 告警來源
auto_repair: "true" | "false" # 是否允許自動修復
```
**Layer 對應修復機制:**
| Layer | 服務範例 | 修復機制 |
|-------|---------|---------|
| `k8s` | API/Web/Worker pods | `kubectl rollout restart` |
| `docker-110` | Sentry/Harbor/Gitea/Alertmanager/Langfuse/Gitea-Runner | SSH 110 → `docker compose up -d` |
| `docker-188` | OpenClaw/MinIO/SignOz | SSH 188 → `docker compose up -d` |
| `systemd-188` | PostgreSQL/Redis/Ollama/Nginx | SSH 188 → `systemctl restart` |
**auto_repair_service 路由邏輯:**
```python
if incident.labels.get("layer") == "k8s":
use existing kubectl path
elif incident.labels.get("layer") in ("docker-110", "docker-188"):
use new HostRepairAgent (SSH + docker compose)
elif incident.labels.get("layer") == "systemd-188":
use new HostRepairAgent (SSH + systemctl)
```
---
## Sprint 1
**目標**: 所有服務宕機立即 Telegram 告警,不再靜默
**工時**: 約 4h
**優先級**: P0
### S1-0統一標籤補全 (所有現有規則)
在新版 `alerts.yml` 中,所有規則補齊 `layer` / `host` / `auto_repair` 標籤。
### S1-1新 alerts.yml 合併所有規則
將以下文件的規則**轉換為 Docker Prometheus 格式**並合併:
| 來源文件 | 規則數 | 說明 |
|---------|-------|------|
| `k8s/monitoring/k3s-alerts.yaml` | 20+ | SentryDown、SignOzDown、HarborDown 等 |
| `k8s/monitoring/alert-chain-monitor.yaml` | 9 | AlertChainBroken、AutoRepair 等 |
| `k8s/monitoring/database-alerts.yaml` | 10+ | PG/Redis 詳細規則 |
| `k8s/monitoring/minio-kali-alerts.yaml` | 5 | MinIO/Kali |
| `k8s/monitoring/k3s-alerts-supplemental.yaml` | 10+ | 補充規則 |
| 現有 `alerts.yml` | 13 | 保留有效規則,刪除 ClawBotDown (舊命名) |
**關鍵修正**:
- `ClawBotDown``OpenClawDown`
- 所有規則補齊 `layer` 標籤
- Blackbox probe 規則確認 target IP 正確
**輸出**: `ops/monitoring/alerts-unified.yml` (版控)
### S1-2部署到 110 Prometheus
```bash
# deploy-alerts.sh (新建)
scp ops/monitoring/alerts-unified.yml wooo@192.168.0.110:/home/wooo/monitoring/alerts.yml
ssh wooo@192.168.0.110 "curl -s -X POST http://localhost:9090/-/reload"
ssh wooo@192.168.0.110 "curl -s http://localhost:9090/api/v1/rules | python3 -c \"import sys,json; r=json.load(sys.stdin); print('Rules loaded:', sum(len(g['rules']) for g in r['data']['groups']))\""
```
### S1-3CD 自動同步 (不再人工部署)
`.gitea/workflows/cd.yaml` 加入 `deploy-alerts` job
```yaml
deploy-alerts:
name: "Deploy Prometheus Alert Rules"
runs-on: [self-hosted]
needs: [] # 獨立 job不依賴 API build
steps:
- name: Deploy alerts.yml to Prometheus
run: |
scp ops/monitoring/alerts-unified.yml wooo@192.168.0.110:/home/wooo/monitoring/alerts.yml
ssh wooo@192.168.0.110 "curl -X POST http://localhost:9090/-/reload"
```
**觸發條件**: `ops/monitoring/alerts-unified.yml` 文件有變更時
### S1-4啟動 Sentry (110)
**Sentry 已安裝於 `/opt/sentry/`** (2026-03-24 已安裝Phase 1-5 已完成)。
目前問題:重開機後容器全部停止,且未被加入 startup-110.sh。
```bash
# 確認並啟動
ssh wooo@192.168.0.110 "cd /opt/sentry && docker compose up -d"
# 驗證
curl -s http://192.168.0.110:9000/api/0/projects/
```
**加入 startup-110.sh Step 7**:
```bash
# STEP 7: Sentry (Error Tracking)
# 2026-04-05 Claude Code: 加入 — 解決重開機後 Sentry 未自動啟動
SENTRY_DIR="/opt/sentry"
if [ -d "$SENTRY_DIR" ]; then
cd "$SENTRY_DIR"
docker compose up -d 2>&1 | tail -5
sleep 20
if curl -sf --max-time 15 http://localhost:9000/api/0/projects/ >/dev/null 2>&1; then
log "✅ Sentry healthy"
else
log "⚠️ Sentry 尚未就緒需稍等Sentry 啟動約需 2-3 分鐘)"
fi
else
log "⚠️ 找不到 Sentry 目錄: $SENTRY_DIR"
fi
```
### S1-5驗證 Sprint 1
```bash
# 驗證規則數量
ssh wooo@192.168.0.110 "curl -s http://localhost:9090/api/v1/rules | python3 -c \"import sys,json; r=json.load(sys.stdin); print('Rules:', sum(len(g['rules']) for g in r['data']['groups']))\""
# 期望: > 40 (舊版只有 13)
# 驗證 SentryDown 規則存在
ssh wooo@192.168.0.110 "curl -s http://localhost:9090/api/v1/rules | python3 -c \"import sys,json; r=json.load(sys.stdin); names=[rule['name'] for g in r['data']['groups'] for rule in g['rules']]; print('SentryDown' in names, 'AlertChainBroken_Alertmanager' in names)\""
# 期望: True True
# 驗證 Sentry 啟動
curl -s --max-time 10 http://192.168.0.110:9000/api/0/projects/ | head -5
```
---
## Sprint 2
**目標**: API 日誌中的 ERROR 不再靜默,觸發 Telegram 告警
**工時**: 約 2h
**優先級**: P1
### S2-1SigNoz Log Alert Rules
在 SigNoz 建立以下 log-based alert rules透過 `/webhooks/signoz` 鏈路通知:
| Alert 名稱 | 觸發條件 | Severity |
|-----------|---------|---------|
| `APIHighErrorLogRate` | API logs: `level=ERROR` > 10/min持續 5min | warning |
| `WorkerTaskFailed` | Worker logs: `"task_failed"\|"unhandled exception"` > 5/min | warning |
| `PodOOMKilled` | K8s logs: `"OOMKilled"\|"OutOfMemory"` | critical |
| `TelegramPollingFailed` | API logs: `"telegram_polling_error"` > 3/5min | critical |
| `NemotronAllTimeout` | API logs: `"nemotron_tool_call_timeout"` > 5/5min | warning |
**實施**:
- SigNoz UI → Alerts → New Alert → Logs Based Alert
- 設定 webhook: `http://192.168.0.121:32334/api/v1/webhooks/signoz`
- 規則配置版控到 `ops/signoz/alerting/log-rules.md` (SigNoz 無 YAML 匯入,用文檔記錄)
**統一標籤** (SigNoz alert 附加):
```json
{
"layer": "k8s",
"source": "signoz",
"component": "api",
"team": "backend"
}
```
### S2-2Sentry 整合 AWOOOI API
Sentry 安裝完成後:
1. 建立 AWOOOI 專案 (Python + Next.js)
2. 設定 Webhook Integration → `http://192.168.0.121:32334/api/v1/webhooks/sentry`
3. 驗證 `/webhooks/sentry` handler 正確接收並轉成 Incident
**API 端 Sentry SDK 加入標籤**:
```python
# apps/api/src/main.py sentry init 加入 tags
sentry_sdk.set_tag("layer", "k8s")
sentry_sdk.set_tag("component", "api")
sentry_sdk.set_tag("host", "k8s-awoooi-prod")
```
### S2-3驗證 Sprint 2
```bash
# 觸發測試 log alert (在 API pod 內產生 ERROR log)
kubectl exec -n awoooi-prod deploy/awoooi-api -- python3 -c "
import logging; logging.error('TEST_ERROR: Sprint2 validation')
"
# 期望: 5min 內 Telegram 收到 APIHighErrorLogRate 告警
# 驗證 Sentry webhook
curl -X POST http://192.168.0.121:32334/api/v1/webhooks/sentry \
-H 'Content-Type: application/json' \
-d '{"action":"triggered","data":{"issue":{"title":"Test Sentry Event","level":"error"}}}'
# 期望: {"success":true}
```
---
## Sprint 3
**目標**: SentryDown/HarborDown/GiteaDown 觸發後自動修復,無需人工
**工時**: 約 5h
**優先級**: P2
### S3-1SSH Key 基礎設施
**生成專用修復金鑰** (在 Mac 本機):
```bash
ssh-keygen -t ed25519 -C "awoooi-repair-bot-2026" -f ~/.ssh/awoooi_repair_bot -N ""
```
**在 110 設定受限 authorized_keys**:
```bash
# /home/wooo/.ssh/authorized_keys 加入 (限制只能執行修復腳本)
command="/home/wooo/bin/repair-bot-110.sh",no-port-forwarding,no-X11-forwarding,no-agent-forwarding ssh-ed25519 AAAA... awoooi-repair-bot-2026
```
**在 188 (ollama user) 設定受限 authorized_keys**:
```bash
command="/home/ollama/bin/repair-bot-188.sh",no-port-forwarding,no-X11-forwarding,no-agent-forwarding ssh-ed25519 AAAA... awoooi-repair-bot-2026
```
**存入 K8s Secret**:
```bash
kubectl create secret generic awoooi-repair-ssh-key \
-n awoooi-prod \
--from-file=id_ed25519=~/.ssh/awoooi_repair_bot \
--from-file=id_ed25519.pub=~/.ssh/awoooi_repair_bot.pub
```
### S3-2修復腳本白名單
**`/home/wooo/bin/repair-bot-110.sh`** (110 主機):
```bash
#!/bin/bash
# 修復機器人白名單腳本 — 只允許已知修復命令
# 2026-04-05 Claude Code: Sprint 3 Host Auto-Repair
ALLOWED_COMMANDS=(
"sentry:docker compose up -d:/home/wooo/sentry"
"harbor:docker compose up -d:/home/wooo/harbor/harbor"
"gitea:docker compose up -d:/home/wooo/gitea"
"gitea-runner:docker compose up -d:/home/wooo/act-runner"
"langfuse:docker compose up -d:/home/wooo/langfuse"
"alertmanager:docker compose up -d:/home/wooo/monitoring"
)
CMD="${SSH_ORIGINAL_COMMAND:-}"
for entry in "${ALLOWED_COMMANDS[@]}"; do
name="${entry%%:*}"; rest="${entry#*:}"; command="${rest%%:*}"; dir="${rest#*:}"
if [ "$CMD" = "repair:$name" ]; then
cd "$dir" && $command 2>&1 | tail -5
echo "REPAIR_OK:$name"
exit 0
fi
done
echo "REPAIR_DENIED:$CMD"
exit 1
```
**`/home/ollama/bin/repair-bot-188.sh`** (188 主機):
```bash
#!/bin/bash
ALLOWED_COMMANDS=(
"openclaw:docker compose up -d:/home/ollama/clawbot-v5"
"minio:docker compose up -d:/home/ollama/minio"
"signoz:docker compose up -d:/home/ollama/signoz/deploy/docker"
"redis:systemctl restart redis-server"
"nginx:systemctl restart nginx"
)
CMD="${SSH_ORIGINAL_COMMAND:-}"
for entry in "${ALLOWED_COMMANDS[@]}"; do
name="${entry%%:*}"; rest="${entry#*:}"; command="${rest%%:*}"; dir="${rest#*:}"
if [ "$CMD" = "repair:$name" ]; then
if [[ "$command" == systemctl* ]]; then
sudo $command && echo "REPAIR_OK:$name" || echo "REPAIR_FAIL:$name"
else
cd "$dir" && $command 2>&1 | tail -5 && echo "REPAIR_OK:$name" || echo "REPAIR_FAIL:$name"
fi
exit 0
fi
done
echo "REPAIR_DENIED:$CMD"
exit 1
```
### S3-3HostRepairAgent (新 Python 模組)
**新建** `apps/api/src/services/host_repair_agent.py`:
```python
"""
Host Repair Agent
=================
透過 SSH 執行主機層 (Docker/systemd) 修復動作
安全設計: command= 限制,只允許白名單修復命令
Layer 對應:
docker-110 → SSH wooo@192.168.0.110 repair:<component>
docker-188 → SSH ollama@192.168.0.188 repair:<component>
systemd-188 → SSH ollama@192.168.0.188 repair:<component>
"""
LAYER_SSH_CONFIG = {
"docker-110": {"user": "wooo", "host": "192.168.0.110"},
"docker-188": {"user": "ollama", "host": "192.168.0.188"},
"systemd-188": {"user": "ollama", "host": "192.168.0.188"},
}
```
**新增 ActionType** (`apps/api/src/models/playbook.py`):
```python
class ActionType(str, Enum):
KUBECTL = "kubectl"
SCRIPT = "script"
MANUAL = "manual"
SSH_COMMAND = "ssh_command" # ← 新增,主機層修復
```
**auto_repair_service.py 路由邏輯**:
```python
layer = incident.labels.get("layer", "k8s")
if layer == "k8s":
# 現有 kubectl 路徑
...
elif layer in ("docker-110", "docker-188", "systemd-188"):
component = incident.labels.get("component")
result = await host_repair_agent.repair(layer=layer, component=component)
```
### S3-4Playbook 建立
新建以下 Playbook (存入 PostgreSQL透過 API 建立)
| Playbook | 觸發告警 | Layer | 修復命令 |
|---------|---------|-------|---------|
| `sentry-down-repair` | SentryDown | docker-110 | `repair:sentry` |
| `harbor-down-repair` | HarborDown | docker-110 | `repair:harbor` |
| `gitea-down-repair` | GiteaDown | docker-110 | `repair:gitea` |
| `gitea-runner-repair` | GiteaRunnerOffline | docker-110 | `repair:gitea-runner` |
| `openclaw-down-repair` | OpenClawDown | docker-188 | `repair:openclaw` |
| `alertmanager-down-repair` | AlertmanagerDown | docker-110 | `repair:alertmanager` |
### S3-5E2E 閉環驗證
```bash
# 手動觸發 SentryDown 測試
# 1. 停止 Sentry
ssh wooo@192.168.0.110 "cd /home/wooo/sentry && docker compose stop"
# 2. 等待 Prometheus 偵測 (2min for: 2m)
sleep 130
# 3. 驗證 Telegram 收到告警
# (人工確認)
# 4. 驗證自動修復觸發
# auto_repair_service 應在 3min 內執行 SSH repair:sentry
# 5. 驗證 Sentry 恢復
curl -s http://192.168.0.110:9000/api/0/projects/
# 期望: Sentry 回應正常
# 6. 驗證告警 resolved
# Telegram 應收到 "Sentry 已自動修復" 通知
```
---
## 重開機 SOP 整合更新
REBOOT-RECOVERY-SOP.md v4.0 需更新以下內容:
### 更新 1110 startup.sh 加入 Step 7 (Sentry)
```bash
# STEP 7: Sentry (Error Tracking)
log "[7/7] 啟動 Sentry..."
SENTRY_DIR="/opt/sentry"
if [ -d "$SENTRY_DIR" ]; then
cd "$SENTRY_DIR"
docker compose up -d 2>&1 | tail -5
sleep 15
if curl -sf --max-time 10 http://localhost:9000/api/0/projects/ >/dev/null 2>&1; then
log "✅ Sentry healthy"
else
log "⚠️ Sentry 尚未就緒,等待 30 秒..."
sleep 30
curl -sf --max-time 10 http://localhost:9000/api/0/projects/ >/dev/null 2>&1 && log "✅ Sentry 就緒" || log "❌ Sentry 未就緒,需手動檢查"
fi
else
log "⚠️ 找不到 Sentry 目錄: $SENTRY_DIR"
fi
```
### 更新 2E2E 驗證腳本加入 Sentry + 規則數驗證
```bash
# 加入 E2E 驗證腳本
check "110 Sentry" "curl -s -o /dev/null -w '%{http_code}' http://192.168.0.110:9000/api/0/projects/" "200"
check "Prometheus 規則數 >40" "ssh wooo@192.168.0.110 'curl -s http://localhost:9090/api/v1/rules' | python3 -c \"import sys,json; r=json.load(sys.stdin); n=sum(len(g['rules']) for g in r['data']['groups']); print(n if n > 40 else 'LOW:'+str(n))\"" "[^LOW]"
```
### 更新 3告警鏈路故障排查加入「規則未部署」診斷
```
告警沉默診斷樹 (補充):
Alertmanager healthy ✅
webhook URL 正確 ✅
從 110 curl POST 成功 ✅
但某服務宕機無告警?
Prometheus 規則是否包含該服務?
curl http://192.168.0.110:9090/api/v1/rules | grep -i "<component>"
├── 找不到 → 規則未部署
│ cd /path/to/awoooi && bash scripts/deploy-alerts.sh
└── 找到但 inactive → 規則條件未觸發,檢查 expr
```
### 更新 4主機依賴圖補充 Sentry
```
192.168.0.110 (DevOps 主機)
├── ...
└── Sentry :9000 ← Error Tracking (Sprint 1 新增)
```
---
## 閉環驗證標準
**定義「全系統自愈閉環完成」的驗收標準**
| 驗收項目 | 測試方法 | 期望結果 |
|---------|---------|---------|
| Prometheus 規則 > 40 條 | API 查詢 | ✅ |
| SentryDown 規則存在 | Prometheus API 查詢 | ✅ |
| Sentry 服務健康 | HTTP 200 | ✅ |
| CD 自動同步 alerts.yml | git push → 驗證 110 Prometheus 規則數更新 | ✅ |
| SigNoz log alert 觸發 | 注入 ERROR log → Telegram 告警 | ✅ |
| Sentry webhook 工作 | 發送測試 event → Telegram 告警 | ✅ |
| SSH 修復金鑰配置 | 從 K8s pod 執行 SSH repair:test | ✅ |
| SentryDown 自動修復 E2E | 停 Sentry → 等 5min → Sentry 自動恢復 | ✅ |
| HarborDown 自動修復 E2E | 停 Harbor → 等 5min → Harbor 自動恢復 | ✅ |
---
## 實施時程
| Sprint | 內容 | 工時 | 優先級 | 依賴 |
|--------|------|------|--------|------|
| **S0** | 確立標籤規範 (本文件 S0 章節) | 0h (設計完成) | P0 | — |
| **S1** | Prometheus 規則部署 + Sentry 安裝 + CD 同步 | 4h | P0 | S0 |
| **S2** | SigNoz log alert + Sentry 整合 | 2h | P1 | S1 |
| **S3** | SSH 修復基礎設施 + HostRepairAgent + Playbooks | 5h | P2 | S1 |
| **SOI** | SOP v4.0 更新整合 (Sentry + 驗證腳本 + 診斷樹) | 1h | P1 | S1 |
| **驗收** | 全系統 E2E 閉環驗證 | 2h | — | S1+S2+S3 |
**總工時估計**: ~14h
---
## 版本歷史
| 版本 | 日期 | 說明 |
|------|------|------|
| v1.0 | 2026-04-05 | 初版:整合重開機恢復 + 監控告警 + 自動修復三大系統 |