docs(sop): REBOOT-RECOVERY-SOP.md v4.0

更新:
- 加入 Sentry /opt/sentry 啟動說明 (110 Step 7/9)
- 新增 Sentry 重開機損壞修復章節 (PostgreSQL WAL/Redis RDB/ClickHouse parts)
- 告警沉默診斷樹補充「規則未部署」診斷 + deploy-alerts.sh 修復指令
- E2E 驗證腳本加入 Sentry + Prometheus 規則數驗證 (≥25)
- 架構圖補充 Sentry :9000

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
OG T
2026-04-05 03:11:27 +08:00
parent 4ba62132e2
commit 91564c6ea3

View File

@@ -1,9 +1,9 @@
# AWOOOI 重開機恢復 SOP
> **版本**: v3.0
> **最後更新**: 2026-04-05 (台北時間)
> **版本**: v4.0
> **最後更新**: 2026-04-05 下午 (台北時間)
> **更新者**: Claude Code (首席架構師)
> **觸發事件**: 兩次重開機事故後完整盤點 + 告警鏈路根因修復 + Gitea Runner 自動化
> **觸發事件**: Prometheus 規則統一部署 + Sentry 啟動自動化 + Sentry 損壞修復 SOP + 全系統自愈閉環
---
@@ -44,6 +44,7 @@
├── Prometheus :9090 ← 指標收集
├── Alertmanager :9093 ← 告警路由
├── Grafana :3002 ← 監控儀表板
├── Sentry :9000 ← Error Tracking (2026-03-242026-04-05 加入 startup)
└── SignOz ← 可觀測性
192.168.0.120 (K3s Master - mon)
@@ -141,13 +142,14 @@ Git push → Gitea(110:3001)
- Nginx 啟動
- SignOz + MinIO + aiops-network + OpenClaw
**110 (6 步驟)**:
**110 (7 步驟)**:
- Docker BoltDB 損壞偵測與修復
- 孤兒容器清除 (Exited 128)
- Harbor (harbor-log healthy 後啟動全組件)
- Gitea + Langfuse + Monitoring (含 Alertmanager 健康驗證)
- SignOz
- **Gitea Act Runner** (自動清除過期 .runner 配置)
- **Sentry** (/opt/sentry含 PostgreSQL WAL + Redis RDB 損壞自動修復)
---
@@ -317,6 +319,7 @@ systemctl start postgresql@14-main
| 6a | 告警鏈路驗證 | 發送測試 webhook | webhook URL 錯誤 / NetworkPolicy |
| 7 | SignOz | `curl http://localhost:3301/-/healthy` | ClickHouse 起慢 |
| 8 | Gitea Runner | `docker logs gitea-runner \| grep SUCCESS` | .runner 配置過期 |
| 9 | Sentry | `curl -o /dev/null -w '%{http_code}' http://localhost:9000/` | PostgreSQL WAL/Redis RDB 損壞 (見下方) |
**Harbor Exited 128 修復**:
```bash
@@ -382,6 +385,28 @@ cd /home/wooo/act-runner && docker compose up -d
└── {"success":true} → 檢查 Telegram Bot Token
```
**補充診斷 — 特定服務異常但無告警Alertmanager 正常)**:
```bash
# 確認 Prometheus 規則數量和關鍵規則是否存在
ssh wooo@192.168.0.110 "curl -s http://localhost:9090/api/v1/rules | python3 -c \"
import sys,json
r=json.load(sys.stdin)
names=[x['name'] for g in r['data']['groups'] for x in g['rules']]
total=len(names)
key=['SentryDown','HarborDown','GiteaDown','OpenClawDown','AlertmanagerDown','AlertChainUnhealthy']
for k in key:
print(f'{'OK' if k in names else 'MISSING'}: {k}')
print(f'Total rules: {total}')
\""
# 若規則不存在或 total < 25 → 規則未部署
# 修復: 從專案根目錄執行
bash scripts/ops/deploy-alerts.sh
# 若規則存在但 inactive → 確認 blackbox probe target
curl http://192.168.0.110:9090/api/v1/query?query=probe_success
```
**根本教訓 (2026-04-05)**:
- 舊架構: Alertmanager → OpenClaw 8088 (中間層,重開機後 OpenClaw 不健康就沉默)
- 新架構: Alertmanager → AWOOOI API 直通 (更穩定,單一故障點少)
@@ -436,7 +461,53 @@ cd /home/wooo/act-runner && docker compose up -d
└── 有規則 → NetworkPolicy 或防火牆
```
### 5. Harbor ImagePullBackOff
### 5. Sentry 重開機後損壞修復 (2026-04-05 事故記錄)
**症狀**: sentry-self-hosted-postgres-1 / redis-1 / clickhouse-1 持續 Restarting
**原因**: 重開機時各 DB 未正常關閉,導致資料損壞:
- PostgreSQL: WAL 損壞 (`could not locate a valid checkpoint record`)
- Redis: dump.rdb 損壞 (`Wrong signature trying to load DB from file`)
- ClickHouse: system table store parts 損壞 (`broken and needs manual correction`)
**修復步驟**:
```bash
# Step 1: 修復 PostgreSQL WAL
docker run --rm -u 999 -v sentry-postgres:/var/lib/postgresql/data \
postgres:14 pg_resetwal -f /var/lib/postgresql/data
echo "PostgreSQL WAL reset 完成"
# Step 2: 修復 Redis (session cache可丟失)
docker run --rm --user root -v sentry-redis:/data alpine \
sh -c 'rm -f /data/dump.rdb && echo cleared'
echo "Redis dump.rdb 已清除"
# Step 3: 找出損壞的 ClickHouse store 目錄
docker logs sentry-self-hosted-clickhouse-1 2>&1 \
| grep -oE 'store/[a-f0-9]{3}/[a-f0-9-]+' \
| sort -u
# Step 4: 依據上方輸出刪除損壞目錄 (以 UUID 為準,以下為範例)
for uuid in "469/469957c9-..." "57d/57d1d567-..."; do
docker run --rm --user root -v sentry-clickhouse:/data alpine \
sh -c "rm -rf /data/store/$uuid && echo OK:$uuid"
done
# Step 5: 重啟 Sentry
cd /opt/sentry && docker compose up -d
# Step 6: 等待 2-3 分鐘後驗證
sleep 120
curl -o /dev/null -w "%{http_code}" http://192.168.0.110:9000/
# 預期: 200 / 302 / 400 (需登入)
```
**startup-110.sh 已自動化修復**: PostgreSQL WAL 和 Redis RDB 損壞會自動偵測並修復。ClickHouse parts 損壞需手動識別 UUID因 UUID 每次不同)。
---
### 6. Harbor ImagePullBackOff
```
診斷:
@@ -492,6 +563,8 @@ check "110 Harbor" "curl -s -o /dev/null -w '%{http_code}' http://192.168.0.110:
check "110 Gitea" "curl -s -o /dev/null -w '%{http_code}' http://192.168.0.110:3001" "200"
check "110 Alertmanager" "curl -s http://192.168.0.110:9093/-/healthy" "OK"
check "110 Gitea Runner" "curl -s -u 'wooo:TOKEN' 'http://192.168.0.110:3001/api/v1/repos/wooo/awoooi/actions/runners' | python3 -c 'import sys,json; d=json.load(sys.stdin); print(d.get(\"total_count\",0))'" "1"
check "110 Sentry" "curl -s -o /dev/null -w '%{http_code}' --max-time 10 http://192.168.0.110:9000/" "[23][0-9][0-9]\|400"
check "Prometheus rules ≥25" "ssh wooo@192.168.0.110 'curl -s http://localhost:9090/api/v1/rules' | python3 -c \"import sys,json; r=json.load(sys.stdin); print(sum(len(g['rules']) for g in r['data']['groups']))\"" "[2-9][5-9]\|[3-9][0-9]"
# K3s
check "K3s Nodes Ready" "ssh wooo@192.168.0.120 'kubectl get nodes --no-headers' | awk '{print \$2}'" "Ready"
@@ -519,3 +592,4 @@ echo "=== 結果: ${PASS} 通過, ${FAIL} 失敗 ==="
| v1.0 | 2026-04-04 | 初版containerd + Docker + PostgreSQL WAL 修復流程 |
| v2.0 | 2026-04-05 上午 | 188 啟動腳本 + 110 啟動腳本 + Harbor race condition 修復 |
| v3.0 | 2026-04-05 下午 | 完整架構重盤 + Gitea Runner 自動化 + 告警鏈路根因修復 + NetworkPolicy 修正 |
| v4.0 | 2026-04-05 下午 | Prometheus 規則統一部署 (28條) + Sentry startup (Step 7) + Sentry損壞修復SOP + 規則未部署診斷樹 + E2E腳本加入Sentry/Prometheus驗證 |