docs(spec): 全系統自愈閉環設計規格 v1.0
整合三大問題的完整解決方案: 1. Prometheus 規則未部署 (13條→40+條,含SentryDown/AlertChain) 2. 日誌收集但無log-based alerting 3. 自動修復只限K8s層,無Host Docker/systemd修復能力 包含: - 統一標籤規範 (layer/component/team/host) - Sprint 1: 規則部署+Sentry啟動+CD同步 - Sprint 2: SigNoz log alert + Sentry整合 - Sprint 3: SSH HostRepairAgent + Playbooks - SOP v4.0整合更新點 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,585 @@
|
||||
# AWOOOI 全系統自愈閉環設計規格
|
||||
|
||||
> **版本**: v1.0
|
||||
> **日期**: 2026-04-05 (台北時間)
|
||||
> **起草者**: Claude Code (首席架構師)
|
||||
> **觸發事件**: 重開機後服務異常靜默、監控告警失效、無自動修復能力
|
||||
> **關聯文件**: REBOOT-RECOVERY-SOP.md v3.0、ADR-030、Phase O
|
||||
|
||||
---
|
||||
|
||||
## 目錄
|
||||
|
||||
1. [問題根因總結](#問題根因總結)
|
||||
2. [整體架構目標](#整體架構目標)
|
||||
3. [統一標籤規範 (S0)](#s0-統一標籤規範)
|
||||
4. [Sprint 1:補 Prometheus 規則 + Sentry 安裝](#sprint-1)
|
||||
5. [Sprint 2:Log-based Alerting](#sprint-2)
|
||||
6. [Sprint 3:Host Auto-Repair Agent](#sprint-3)
|
||||
7. [重開機 SOP 整合更新](#重開機-sop-整合更新)
|
||||
8. [閉環驗證標準](#閉環驗證標準)
|
||||
9. [實施時程](#實施時程)
|
||||
|
||||
---
|
||||
|
||||
## 問題根因總結
|
||||
|
||||
### 三層系統性缺口
|
||||
|
||||
#### 缺口一:告警規則存在但從未部署
|
||||
|
||||
| 狀態 | 說明 |
|
||||
|------|------|
|
||||
| `k8s/monitoring/*.yaml` | ✅ 版控中存在 40+ 條規則 (含 SentryDown、AlertChain 等) |
|
||||
| `110:/home/wooo/monitoring/alerts.yml` | ❌ 只有 13 條舊規則,從未同步 |
|
||||
| 根本原因 | `k8s/monitoring/` 為 PrometheusRule CRD 格式 (Operator),110 是 Docker Compose Prometheus,兩套互不同步,且沒有 CD 自動部署機制 |
|
||||
| 後果 | SentryDown、HarborDown、GiteaDown、AlertChainBroken 等規則從未生效;`ClawBotDown` 用了廢棄舊命名 |
|
||||
|
||||
#### 缺口二:日誌收集但不告警
|
||||
|
||||
| 狀態 | 說明 |
|
||||
|------|------|
|
||||
| OTEL DaemonSet | ✅ Pod logs → SigNoz ClickHouse |
|
||||
| SigNoz log alert rules | ❌ 完全沒有;ops/signoz/alerting/rules.yaml 只有 trace/metric 規則 |
|
||||
| 根本原因 | Phase O 完成了 log 收集管線,但沒有建立 log-based alert rules |
|
||||
| 後果 | API ERROR log、Pod crash log 全部靜默,不觸發任何告警 |
|
||||
|
||||
#### 缺口三:自動修復只能處理 K8s 層
|
||||
|
||||
| 狀態 | 說明 |
|
||||
|------|------|
|
||||
| auto_repair_service.py | ✅ 可執行 kubectl (K8s 層) |
|
||||
| ActionType | ❌ 只有 KUBECTL/SCRIPT/MANUAL,無 SSH_COMMAND |
|
||||
| 主機層服務 | ❌ SentryDown/HarborDown → 無對應 Playbook,無修復能力 |
|
||||
| Sentry | ✅ 已安裝於 `/opt/sentry/`,但重開機後未自動啟動 (未加入 startup-110.sh) |
|
||||
| SSH key | ❌ K8s Secret 中沒有任何 SSH 金鑰 |
|
||||
| 根本原因 | ADR-030 架構設計了 SSH 層,但從未實作;Sentry 已裝但啟動管理缺失 |
|
||||
|
||||
#### 缺口四:無統一貼標規範
|
||||
|
||||
告警 labels 中沒有 `layer` 欄位,auto_repair_service 無法判斷修復路徑:
|
||||
|
||||
```
|
||||
SentryDown 告警 labels: {severity: warning, team: ops, component: sentry}
|
||||
↑
|
||||
沒有 layer: docker-110
|
||||
→ 無法決定用 SSH 到 110 修復
|
||||
vs kubectl 修復
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 整體架構目標
|
||||
|
||||
```
|
||||
【Layer 1: 收集】
|
||||
K8s Pod Logs → OTEL DaemonSet → SigNoz ClickHouse ✅ (已有)
|
||||
K8s Metrics → kube-state/node → Prometheus (110) ✅ (已有)
|
||||
App Exceptions → Sentry SDK → Sentry (110:9000) ❌ Sentry 未安裝
|
||||
App Traces → OTEL SDK → SigNoz ✅ (已有)
|
||||
|
||||
【Layer 2: 偵測】
|
||||
Prometheus Rules (110) ← 40+ 條規則全部部署 Sprint 1
|
||||
SigNoz Log Alerts ← ERROR/crash log 觸發告警 Sprint 2
|
||||
Sentry Error Alerts ← Exception 閾值觸發 Sprint 2
|
||||
|
||||
【Layer 3: 告警路由】
|
||||
Alertmanager → AWOOOI API /webhooks/alertmanager → Telegram ✅ (已修復)
|
||||
SigNoz → AWOOOI API /webhooks/signoz → Telegram ✅ (已有)
|
||||
Sentry → AWOOOI API /webhooks/sentry → Telegram ✅ (handler 已有)
|
||||
|
||||
【Layer 4: 自動修復】
|
||||
K8s 層告警 → auto_repair_service → kubectl ✅ (已有)
|
||||
Docker 層告警 → HostRepairAgent → SSH → docker compose Sprint 3
|
||||
Systemd 層 → HostRepairAgent → SSH → systemctl Sprint 3
|
||||
|
||||
【Layer 5: 閉環驗證】
|
||||
修復完成 → 服務健康確認 → 告警 resolved → Telegram 通知 Sprint 3
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## S0:統一標籤規範
|
||||
|
||||
**所有告警來源 (Prometheus/SigNoz/Sentry) 的標籤必須包含以下欄位:**
|
||||
|
||||
```yaml
|
||||
labels:
|
||||
# 必填
|
||||
severity: critical | warning | info
|
||||
layer: k8s | docker-110 | docker-188 | systemd-188 # ← 決定修復路徑
|
||||
component: sentry | harbor | gitea | openclaw | api | worker | web | postgres | redis | ollama | nginx | alertmanager | gitea-runner | minio | signoz | langfuse
|
||||
team: ops | backend | ai | platform
|
||||
|
||||
# 選填 (有助自動修復)
|
||||
host: "110" | "188" | "120" | "121" # 受影響主機
|
||||
source: prometheus | signoz | sentry # 告警來源
|
||||
auto_repair: "true" | "false" # 是否允許自動修復
|
||||
```
|
||||
|
||||
**Layer 對應修復機制:**
|
||||
|
||||
| Layer | 服務範例 | 修復機制 |
|
||||
|-------|---------|---------|
|
||||
| `k8s` | API/Web/Worker pods | `kubectl rollout restart` |
|
||||
| `docker-110` | Sentry/Harbor/Gitea/Alertmanager/Langfuse/Gitea-Runner | SSH 110 → `docker compose up -d` |
|
||||
| `docker-188` | OpenClaw/MinIO/SignOz | SSH 188 → `docker compose up -d` |
|
||||
| `systemd-188` | PostgreSQL/Redis/Ollama/Nginx | SSH 188 → `systemctl restart` |
|
||||
|
||||
**auto_repair_service 路由邏輯:**
|
||||
|
||||
```python
|
||||
if incident.labels.get("layer") == "k8s":
|
||||
→ use existing kubectl path
|
||||
elif incident.labels.get("layer") in ("docker-110", "docker-188"):
|
||||
→ use new HostRepairAgent (SSH + docker compose)
|
||||
elif incident.labels.get("layer") == "systemd-188":
|
||||
→ use new HostRepairAgent (SSH + systemctl)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Sprint 1
|
||||
|
||||
**目標**: 所有服務宕機立即 Telegram 告警,不再靜默
|
||||
**工時**: 約 4h
|
||||
**優先級**: P0
|
||||
|
||||
### S1-0:統一標籤補全 (所有現有規則)
|
||||
|
||||
在新版 `alerts.yml` 中,所有規則補齊 `layer` / `host` / `auto_repair` 標籤。
|
||||
|
||||
### S1-1:新 alerts.yml 合併所有規則
|
||||
|
||||
將以下文件的規則**轉換為 Docker Prometheus 格式**並合併:
|
||||
|
||||
| 來源文件 | 規則數 | 說明 |
|
||||
|---------|-------|------|
|
||||
| `k8s/monitoring/k3s-alerts.yaml` | 20+ | SentryDown、SignOzDown、HarborDown 等 |
|
||||
| `k8s/monitoring/alert-chain-monitor.yaml` | 9 | AlertChainBroken、AutoRepair 等 |
|
||||
| `k8s/monitoring/database-alerts.yaml` | 10+ | PG/Redis 詳細規則 |
|
||||
| `k8s/monitoring/minio-kali-alerts.yaml` | 5 | MinIO/Kali |
|
||||
| `k8s/monitoring/k3s-alerts-supplemental.yaml` | 10+ | 補充規則 |
|
||||
| 現有 `alerts.yml` | 13 | 保留有效規則,刪除 ClawBotDown (舊命名) |
|
||||
|
||||
**關鍵修正**:
|
||||
- `ClawBotDown` → `OpenClawDown`
|
||||
- 所有規則補齊 `layer` 標籤
|
||||
- Blackbox probe 規則確認 target IP 正確
|
||||
|
||||
**輸出**: `ops/monitoring/alerts-unified.yml` (版控)
|
||||
|
||||
### S1-2:部署到 110 Prometheus
|
||||
|
||||
```bash
|
||||
# deploy-alerts.sh (新建)
|
||||
scp ops/monitoring/alerts-unified.yml wooo@192.168.0.110:/home/wooo/monitoring/alerts.yml
|
||||
ssh wooo@192.168.0.110 "curl -s -X POST http://localhost:9090/-/reload"
|
||||
ssh wooo@192.168.0.110 "curl -s http://localhost:9090/api/v1/rules | python3 -c \"import sys,json; r=json.load(sys.stdin); print('Rules loaded:', sum(len(g['rules']) for g in r['data']['groups']))\""
|
||||
```
|
||||
|
||||
### S1-3:CD 自動同步 (不再人工部署)
|
||||
|
||||
在 `.gitea/workflows/cd.yaml` 加入 `deploy-alerts` job:
|
||||
|
||||
```yaml
|
||||
deploy-alerts:
|
||||
name: "Deploy Prometheus Alert Rules"
|
||||
runs-on: [self-hosted]
|
||||
needs: [] # 獨立 job,不依賴 API build
|
||||
steps:
|
||||
- name: Deploy alerts.yml to Prometheus
|
||||
run: |
|
||||
scp ops/monitoring/alerts-unified.yml wooo@192.168.0.110:/home/wooo/monitoring/alerts.yml
|
||||
ssh wooo@192.168.0.110 "curl -X POST http://localhost:9090/-/reload"
|
||||
```
|
||||
|
||||
**觸發條件**: `ops/monitoring/alerts-unified.yml` 文件有變更時
|
||||
|
||||
### S1-4:啟動 Sentry (110)
|
||||
|
||||
**Sentry 已安裝於 `/opt/sentry/`** (2026-03-24 已安裝,Phase 1-5 已完成)。
|
||||
目前問題:重開機後容器全部停止,且未被加入 startup-110.sh。
|
||||
|
||||
```bash
|
||||
# 確認並啟動
|
||||
ssh wooo@192.168.0.110 "cd /opt/sentry && docker compose up -d"
|
||||
# 驗證
|
||||
curl -s http://192.168.0.110:9000/api/0/projects/
|
||||
```
|
||||
|
||||
**加入 startup-110.sh Step 7**:
|
||||
```bash
|
||||
# STEP 7: Sentry (Error Tracking)
|
||||
# 2026-04-05 Claude Code: 加入 — 解決重開機後 Sentry 未自動啟動
|
||||
SENTRY_DIR="/opt/sentry"
|
||||
if [ -d "$SENTRY_DIR" ]; then
|
||||
cd "$SENTRY_DIR"
|
||||
docker compose up -d 2>&1 | tail -5
|
||||
sleep 20
|
||||
if curl -sf --max-time 15 http://localhost:9000/api/0/projects/ >/dev/null 2>&1; then
|
||||
log "✅ Sentry healthy"
|
||||
else
|
||||
log "⚠️ Sentry 尚未就緒,需稍等(Sentry 啟動約需 2-3 分鐘)"
|
||||
fi
|
||||
else
|
||||
log "⚠️ 找不到 Sentry 目錄: $SENTRY_DIR"
|
||||
fi
|
||||
```
|
||||
|
||||
### S1-5:驗證 Sprint 1
|
||||
|
||||
```bash
|
||||
# 驗證規則數量
|
||||
ssh wooo@192.168.0.110 "curl -s http://localhost:9090/api/v1/rules | python3 -c \"import sys,json; r=json.load(sys.stdin); print('Rules:', sum(len(g['rules']) for g in r['data']['groups']))\""
|
||||
# 期望: > 40 (舊版只有 13)
|
||||
|
||||
# 驗證 SentryDown 規則存在
|
||||
ssh wooo@192.168.0.110 "curl -s http://localhost:9090/api/v1/rules | python3 -c \"import sys,json; r=json.load(sys.stdin); names=[rule['name'] for g in r['data']['groups'] for rule in g['rules']]; print('SentryDown' in names, 'AlertChainBroken_Alertmanager' in names)\""
|
||||
# 期望: True True
|
||||
|
||||
# 驗證 Sentry 啟動
|
||||
curl -s --max-time 10 http://192.168.0.110:9000/api/0/projects/ | head -5
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Sprint 2
|
||||
|
||||
**目標**: API 日誌中的 ERROR 不再靜默,觸發 Telegram 告警
|
||||
**工時**: 約 2h
|
||||
**優先級**: P1
|
||||
|
||||
### S2-1:SigNoz Log Alert Rules
|
||||
|
||||
在 SigNoz 建立以下 log-based alert rules,透過 `/webhooks/signoz` 鏈路通知:
|
||||
|
||||
| Alert 名稱 | 觸發條件 | Severity |
|
||||
|-----------|---------|---------|
|
||||
| `APIHighErrorLogRate` | API logs: `level=ERROR` > 10/min,持續 5min | warning |
|
||||
| `WorkerTaskFailed` | Worker logs: `"task_failed"\|"unhandled exception"` > 5/min | warning |
|
||||
| `PodOOMKilled` | K8s logs: `"OOMKilled"\|"OutOfMemory"` | critical |
|
||||
| `TelegramPollingFailed` | API logs: `"telegram_polling_error"` > 3/5min | critical |
|
||||
| `NemotronAllTimeout` | API logs: `"nemotron_tool_call_timeout"` > 5/5min | warning |
|
||||
|
||||
**實施**:
|
||||
- SigNoz UI → Alerts → New Alert → Logs Based Alert
|
||||
- 設定 webhook: `http://192.168.0.121:32334/api/v1/webhooks/signoz`
|
||||
- 規則配置版控到 `ops/signoz/alerting/log-rules.md` (SigNoz 無 YAML 匯入,用文檔記錄)
|
||||
|
||||
**統一標籤** (SigNoz alert 附加):
|
||||
```json
|
||||
{
|
||||
"layer": "k8s",
|
||||
"source": "signoz",
|
||||
"component": "api",
|
||||
"team": "backend"
|
||||
}
|
||||
```
|
||||
|
||||
### S2-2:Sentry 整合 AWOOOI API
|
||||
|
||||
Sentry 安裝完成後:
|
||||
|
||||
1. 建立 AWOOOI 專案 (Python + Next.js)
|
||||
2. 設定 Webhook Integration → `http://192.168.0.121:32334/api/v1/webhooks/sentry`
|
||||
3. 驗證 `/webhooks/sentry` handler 正確接收並轉成 Incident
|
||||
|
||||
**API 端 Sentry SDK 加入標籤**:
|
||||
```python
|
||||
# apps/api/src/main.py sentry init 加入 tags
|
||||
sentry_sdk.set_tag("layer", "k8s")
|
||||
sentry_sdk.set_tag("component", "api")
|
||||
sentry_sdk.set_tag("host", "k8s-awoooi-prod")
|
||||
```
|
||||
|
||||
### S2-3:驗證 Sprint 2
|
||||
|
||||
```bash
|
||||
# 觸發測試 log alert (在 API pod 內產生 ERROR log)
|
||||
kubectl exec -n awoooi-prod deploy/awoooi-api -- python3 -c "
|
||||
import logging; logging.error('TEST_ERROR: Sprint2 validation')
|
||||
"
|
||||
# 期望: 5min 內 Telegram 收到 APIHighErrorLogRate 告警
|
||||
|
||||
# 驗證 Sentry webhook
|
||||
curl -X POST http://192.168.0.121:32334/api/v1/webhooks/sentry \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"action":"triggered","data":{"issue":{"title":"Test Sentry Event","level":"error"}}}'
|
||||
# 期望: {"success":true}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Sprint 3
|
||||
|
||||
**目標**: SentryDown/HarborDown/GiteaDown 觸發後自動修復,無需人工
|
||||
**工時**: 約 5h
|
||||
**優先級**: P2
|
||||
|
||||
### S3-1:SSH Key 基礎設施
|
||||
|
||||
**生成專用修復金鑰** (在 Mac 本機):
|
||||
```bash
|
||||
ssh-keygen -t ed25519 -C "awoooi-repair-bot-2026" -f ~/.ssh/awoooi_repair_bot -N ""
|
||||
```
|
||||
|
||||
**在 110 設定受限 authorized_keys**:
|
||||
```bash
|
||||
# /home/wooo/.ssh/authorized_keys 加入 (限制只能執行修復腳本)
|
||||
command="/home/wooo/bin/repair-bot-110.sh",no-port-forwarding,no-X11-forwarding,no-agent-forwarding ssh-ed25519 AAAA... awoooi-repair-bot-2026
|
||||
```
|
||||
|
||||
**在 188 (ollama user) 設定受限 authorized_keys**:
|
||||
```bash
|
||||
command="/home/ollama/bin/repair-bot-188.sh",no-port-forwarding,no-X11-forwarding,no-agent-forwarding ssh-ed25519 AAAA... awoooi-repair-bot-2026
|
||||
```
|
||||
|
||||
**存入 K8s Secret**:
|
||||
```bash
|
||||
kubectl create secret generic awoooi-repair-ssh-key \
|
||||
-n awoooi-prod \
|
||||
--from-file=id_ed25519=~/.ssh/awoooi_repair_bot \
|
||||
--from-file=id_ed25519.pub=~/.ssh/awoooi_repair_bot.pub
|
||||
```
|
||||
|
||||
### S3-2:修復腳本白名單
|
||||
|
||||
**`/home/wooo/bin/repair-bot-110.sh`** (110 主機):
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# 修復機器人白名單腳本 — 只允許已知修復命令
|
||||
# 2026-04-05 Claude Code: Sprint 3 Host Auto-Repair
|
||||
|
||||
ALLOWED_COMMANDS=(
|
||||
"sentry:docker compose up -d:/home/wooo/sentry"
|
||||
"harbor:docker compose up -d:/home/wooo/harbor/harbor"
|
||||
"gitea:docker compose up -d:/home/wooo/gitea"
|
||||
"gitea-runner:docker compose up -d:/home/wooo/act-runner"
|
||||
"langfuse:docker compose up -d:/home/wooo/langfuse"
|
||||
"alertmanager:docker compose up -d:/home/wooo/monitoring"
|
||||
)
|
||||
|
||||
CMD="${SSH_ORIGINAL_COMMAND:-}"
|
||||
for entry in "${ALLOWED_COMMANDS[@]}"; do
|
||||
name="${entry%%:*}"; rest="${entry#*:}"; command="${rest%%:*}"; dir="${rest#*:}"
|
||||
if [ "$CMD" = "repair:$name" ]; then
|
||||
cd "$dir" && $command 2>&1 | tail -5
|
||||
echo "REPAIR_OK:$name"
|
||||
exit 0
|
||||
fi
|
||||
done
|
||||
|
||||
echo "REPAIR_DENIED:$CMD"
|
||||
exit 1
|
||||
```
|
||||
|
||||
**`/home/ollama/bin/repair-bot-188.sh`** (188 主機):
|
||||
```bash
|
||||
#!/bin/bash
|
||||
ALLOWED_COMMANDS=(
|
||||
"openclaw:docker compose up -d:/home/ollama/clawbot-v5"
|
||||
"minio:docker compose up -d:/home/ollama/minio"
|
||||
"signoz:docker compose up -d:/home/ollama/signoz/deploy/docker"
|
||||
"redis:systemctl restart redis-server"
|
||||
"nginx:systemctl restart nginx"
|
||||
)
|
||||
|
||||
CMD="${SSH_ORIGINAL_COMMAND:-}"
|
||||
for entry in "${ALLOWED_COMMANDS[@]}"; do
|
||||
name="${entry%%:*}"; rest="${entry#*:}"; command="${rest%%:*}"; dir="${rest#*:}"
|
||||
if [ "$CMD" = "repair:$name" ]; then
|
||||
if [[ "$command" == systemctl* ]]; then
|
||||
sudo $command && echo "REPAIR_OK:$name" || echo "REPAIR_FAIL:$name"
|
||||
else
|
||||
cd "$dir" && $command 2>&1 | tail -5 && echo "REPAIR_OK:$name" || echo "REPAIR_FAIL:$name"
|
||||
fi
|
||||
exit 0
|
||||
fi
|
||||
done
|
||||
|
||||
echo "REPAIR_DENIED:$CMD"
|
||||
exit 1
|
||||
```
|
||||
|
||||
### S3-3:HostRepairAgent (新 Python 模組)
|
||||
|
||||
**新建** `apps/api/src/services/host_repair_agent.py`:
|
||||
|
||||
```python
|
||||
"""
|
||||
Host Repair Agent
|
||||
=================
|
||||
透過 SSH 執行主機層 (Docker/systemd) 修復動作
|
||||
安全設計: command= 限制,只允許白名單修復命令
|
||||
|
||||
Layer 對應:
|
||||
docker-110 → SSH wooo@192.168.0.110 repair:<component>
|
||||
docker-188 → SSH ollama@192.168.0.188 repair:<component>
|
||||
systemd-188 → SSH ollama@192.168.0.188 repair:<component>
|
||||
"""
|
||||
|
||||
LAYER_SSH_CONFIG = {
|
||||
"docker-110": {"user": "wooo", "host": "192.168.0.110"},
|
||||
"docker-188": {"user": "ollama", "host": "192.168.0.188"},
|
||||
"systemd-188": {"user": "ollama", "host": "192.168.0.188"},
|
||||
}
|
||||
```
|
||||
|
||||
**新增 ActionType** (`apps/api/src/models/playbook.py`):
|
||||
```python
|
||||
class ActionType(str, Enum):
|
||||
KUBECTL = "kubectl"
|
||||
SCRIPT = "script"
|
||||
MANUAL = "manual"
|
||||
SSH_COMMAND = "ssh_command" # ← 新增,主機層修復
|
||||
```
|
||||
|
||||
**auto_repair_service.py 路由邏輯**:
|
||||
```python
|
||||
layer = incident.labels.get("layer", "k8s")
|
||||
if layer == "k8s":
|
||||
# 現有 kubectl 路徑
|
||||
...
|
||||
elif layer in ("docker-110", "docker-188", "systemd-188"):
|
||||
component = incident.labels.get("component")
|
||||
result = await host_repair_agent.repair(layer=layer, component=component)
|
||||
```
|
||||
|
||||
### S3-4:Playbook 建立
|
||||
|
||||
新建以下 Playbook (存入 PostgreSQL,透過 API 建立):
|
||||
|
||||
| Playbook | 觸發告警 | Layer | 修復命令 |
|
||||
|---------|---------|-------|---------|
|
||||
| `sentry-down-repair` | SentryDown | docker-110 | `repair:sentry` |
|
||||
| `harbor-down-repair` | HarborDown | docker-110 | `repair:harbor` |
|
||||
| `gitea-down-repair` | GiteaDown | docker-110 | `repair:gitea` |
|
||||
| `gitea-runner-repair` | GiteaRunnerOffline | docker-110 | `repair:gitea-runner` |
|
||||
| `openclaw-down-repair` | OpenClawDown | docker-188 | `repair:openclaw` |
|
||||
| `alertmanager-down-repair` | AlertmanagerDown | docker-110 | `repair:alertmanager` |
|
||||
|
||||
### S3-5:E2E 閉環驗證
|
||||
|
||||
```bash
|
||||
# 手動觸發 SentryDown 測試
|
||||
# 1. 停止 Sentry
|
||||
ssh wooo@192.168.0.110 "cd /home/wooo/sentry && docker compose stop"
|
||||
|
||||
# 2. 等待 Prometheus 偵測 (2min for: 2m)
|
||||
sleep 130
|
||||
|
||||
# 3. 驗證 Telegram 收到告警
|
||||
# (人工確認)
|
||||
|
||||
# 4. 驗證自動修復觸發
|
||||
# auto_repair_service 應在 3min 內執行 SSH repair:sentry
|
||||
|
||||
# 5. 驗證 Sentry 恢復
|
||||
curl -s http://192.168.0.110:9000/api/0/projects/
|
||||
# 期望: Sentry 回應正常
|
||||
|
||||
# 6. 驗證告警 resolved
|
||||
# Telegram 應收到 "Sentry 已自動修復" 通知
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 重開機 SOP 整合更新
|
||||
|
||||
REBOOT-RECOVERY-SOP.md v4.0 需更新以下內容:
|
||||
|
||||
### 更新 1:110 startup.sh 加入 Step 7 (Sentry)
|
||||
|
||||
```bash
|
||||
# STEP 7: Sentry (Error Tracking)
|
||||
log "[7/7] 啟動 Sentry..."
|
||||
SENTRY_DIR="/opt/sentry"
|
||||
if [ -d "$SENTRY_DIR" ]; then
|
||||
cd "$SENTRY_DIR"
|
||||
docker compose up -d 2>&1 | tail -5
|
||||
sleep 15
|
||||
if curl -sf --max-time 10 http://localhost:9000/api/0/projects/ >/dev/null 2>&1; then
|
||||
log "✅ Sentry healthy"
|
||||
else
|
||||
log "⚠️ Sentry 尚未就緒,等待 30 秒..."
|
||||
sleep 30
|
||||
curl -sf --max-time 10 http://localhost:9000/api/0/projects/ >/dev/null 2>&1 && log "✅ Sentry 就緒" || log "❌ Sentry 未就緒,需手動檢查"
|
||||
fi
|
||||
else
|
||||
log "⚠️ 找不到 Sentry 目錄: $SENTRY_DIR"
|
||||
fi
|
||||
```
|
||||
|
||||
### 更新 2:E2E 驗證腳本加入 Sentry + 規則數驗證
|
||||
|
||||
```bash
|
||||
# 加入 E2E 驗證腳本
|
||||
check "110 Sentry" "curl -s -o /dev/null -w '%{http_code}' http://192.168.0.110:9000/api/0/projects/" "200"
|
||||
check "Prometheus 規則數 >40" "ssh wooo@192.168.0.110 'curl -s http://localhost:9090/api/v1/rules' | python3 -c \"import sys,json; r=json.load(sys.stdin); n=sum(len(g['rules']) for g in r['data']['groups']); print(n if n > 40 else 'LOW:'+str(n))\"" "[^LOW]"
|
||||
```
|
||||
|
||||
### 更新 3:告警鏈路故障排查加入「規則未部署」診斷
|
||||
|
||||
```
|
||||
告警沉默診斷樹 (補充):
|
||||
Alertmanager healthy ✅
|
||||
webhook URL 正確 ✅
|
||||
從 110 curl POST 成功 ✅
|
||||
但某服務宕機無告警?
|
||||
↓
|
||||
Prometheus 規則是否包含該服務?
|
||||
curl http://192.168.0.110:9090/api/v1/rules | grep -i "<component>"
|
||||
├── 找不到 → 規則未部署
|
||||
│ cd /path/to/awoooi && bash scripts/deploy-alerts.sh
|
||||
└── 找到但 inactive → 規則條件未觸發,檢查 expr
|
||||
```
|
||||
|
||||
### 更新 4:主機依賴圖補充 Sentry
|
||||
|
||||
```
|
||||
192.168.0.110 (DevOps 主機)
|
||||
├── ...
|
||||
└── Sentry :9000 ← Error Tracking (Sprint 1 新增)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 閉環驗證標準
|
||||
|
||||
**定義「全系統自愈閉環完成」的驗收標準**:
|
||||
|
||||
| 驗收項目 | 測試方法 | 期望結果 |
|
||||
|---------|---------|---------|
|
||||
| Prometheus 規則 > 40 條 | API 查詢 | ✅ |
|
||||
| SentryDown 規則存在 | Prometheus API 查詢 | ✅ |
|
||||
| Sentry 服務健康 | HTTP 200 | ✅ |
|
||||
| CD 自動同步 alerts.yml | git push → 驗證 110 Prometheus 規則數更新 | ✅ |
|
||||
| SigNoz log alert 觸發 | 注入 ERROR log → Telegram 告警 | ✅ |
|
||||
| Sentry webhook 工作 | 發送測試 event → Telegram 告警 | ✅ |
|
||||
| SSH 修復金鑰配置 | 從 K8s pod 執行 SSH repair:test | ✅ |
|
||||
| SentryDown 自動修復 E2E | 停 Sentry → 等 5min → Sentry 自動恢復 | ✅ |
|
||||
| HarborDown 自動修復 E2E | 停 Harbor → 等 5min → Harbor 自動恢復 | ✅ |
|
||||
|
||||
---
|
||||
|
||||
## 實施時程
|
||||
|
||||
| Sprint | 內容 | 工時 | 優先級 | 依賴 |
|
||||
|--------|------|------|--------|------|
|
||||
| **S0** | 確立標籤規範 (本文件 S0 章節) | 0h (設計完成) | P0 | — |
|
||||
| **S1** | Prometheus 規則部署 + Sentry 安裝 + CD 同步 | 4h | P0 | S0 |
|
||||
| **S2** | SigNoz log alert + Sentry 整合 | 2h | P1 | S1 |
|
||||
| **S3** | SSH 修復基礎設施 + HostRepairAgent + Playbooks | 5h | P2 | S1 |
|
||||
| **SOI** | SOP v4.0 更新整合 (Sentry + 驗證腳本 + 診斷樹) | 1h | P1 | S1 |
|
||||
| **驗收** | 全系統 E2E 閉環驗證 | 2h | — | S1+S2+S3 |
|
||||
|
||||
**總工時估計**: ~14h
|
||||
|
||||
---
|
||||
|
||||
## 版本歷史
|
||||
|
||||
| 版本 | 日期 | 說明 |
|
||||
|------|------|------|
|
||||
| v1.0 | 2026-04-05 | 初版:整合重開機恢復 + 監控告警 + 自動修復三大系統 |
|
||||
Reference in New Issue
Block a user