docs(sop): REBOOT-RECOVERY-SOP.md v4.0
更新: - 加入 Sentry /opt/sentry 啟動說明 (110 Step 7/9) - 新增 Sentry 重開機損壞修復章節 (PostgreSQL WAL/Redis RDB/ClickHouse parts) - 告警沉默診斷樹補充「規則未部署」診斷 + deploy-alerts.sh 修復指令 - E2E 驗證腳本加入 Sentry + Prometheus 規則數驗證 (≥25) - 架構圖補充 Sentry :9000 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1,9 +1,9 @@
|
||||
# AWOOOI 重開機恢復 SOP
|
||||
|
||||
> **版本**: v3.0
|
||||
> **最後更新**: 2026-04-05 (台北時間)
|
||||
> **版本**: v4.0
|
||||
> **最後更新**: 2026-04-05 下午 (台北時間)
|
||||
> **更新者**: Claude Code (首席架構師)
|
||||
> **觸發事件**: 兩次重開機事故後完整盤點 + 告警鏈路根因修復 + Gitea Runner 自動化
|
||||
> **觸發事件**: Prometheus 規則統一部署 + Sentry 啟動自動化 + Sentry 損壞修復 SOP + 全系統自愈閉環
|
||||
|
||||
---
|
||||
|
||||
@@ -44,6 +44,7 @@
|
||||
├── Prometheus :9090 ← 指標收集
|
||||
├── Alertmanager :9093 ← 告警路由
|
||||
├── Grafana :3002 ← 監控儀表板
|
||||
├── Sentry :9000 ← Error Tracking (2026-03-24,2026-04-05 加入 startup)
|
||||
└── SignOz ← 可觀測性
|
||||
|
||||
192.168.0.120 (K3s Master - mon)
|
||||
@@ -141,13 +142,14 @@ Git push → Gitea(110:3001)
|
||||
- Nginx 啟動
|
||||
- SignOz + MinIO + aiops-network + OpenClaw
|
||||
|
||||
**110 (6 步驟)**:
|
||||
**110 (7 步驟)**:
|
||||
- Docker BoltDB 損壞偵測與修復
|
||||
- 孤兒容器清除 (Exited 128)
|
||||
- Harbor (harbor-log healthy 後啟動全組件)
|
||||
- Gitea + Langfuse + Monitoring (含 Alertmanager 健康驗證)
|
||||
- SignOz
|
||||
- **Gitea Act Runner** (自動清除過期 .runner 配置)
|
||||
- **Sentry** (/opt/sentry,含 PostgreSQL WAL + Redis RDB 損壞自動修復)
|
||||
|
||||
---
|
||||
|
||||
@@ -317,6 +319,7 @@ systemctl start postgresql@14-main
|
||||
| 6a | 告警鏈路驗證 | 發送測試 webhook | webhook URL 錯誤 / NetworkPolicy |
|
||||
| 7 | SignOz | `curl http://localhost:3301/-/healthy` | ClickHouse 起慢 |
|
||||
| 8 | Gitea Runner | `docker logs gitea-runner \| grep SUCCESS` | .runner 配置過期 |
|
||||
| 9 | Sentry | `curl -o /dev/null -w '%{http_code}' http://localhost:9000/` | PostgreSQL WAL/Redis RDB 損壞 (見下方) |
|
||||
|
||||
**Harbor Exited 128 修復**:
|
||||
```bash
|
||||
@@ -382,6 +385,28 @@ cd /home/wooo/act-runner && docker compose up -d
|
||||
└── {"success":true} → 檢查 Telegram Bot Token
|
||||
```
|
||||
|
||||
**補充診斷 — 特定服務異常但無告警(Alertmanager 正常)**:
|
||||
```bash
|
||||
# 確認 Prometheus 規則數量和關鍵規則是否存在
|
||||
ssh wooo@192.168.0.110 "curl -s http://localhost:9090/api/v1/rules | python3 -c \"
|
||||
import sys,json
|
||||
r=json.load(sys.stdin)
|
||||
names=[x['name'] for g in r['data']['groups'] for x in g['rules']]
|
||||
total=len(names)
|
||||
key=['SentryDown','HarborDown','GiteaDown','OpenClawDown','AlertmanagerDown','AlertChainUnhealthy']
|
||||
for k in key:
|
||||
print(f'{'OK' if k in names else 'MISSING'}: {k}')
|
||||
print(f'Total rules: {total}')
|
||||
\""
|
||||
|
||||
# 若規則不存在或 total < 25 → 規則未部署
|
||||
# 修復: 從專案根目錄執行
|
||||
bash scripts/ops/deploy-alerts.sh
|
||||
|
||||
# 若規則存在但 inactive → 確認 blackbox probe target
|
||||
curl http://192.168.0.110:9090/api/v1/query?query=probe_success
|
||||
```
|
||||
|
||||
**根本教訓 (2026-04-05)**:
|
||||
- 舊架構: Alertmanager → OpenClaw 8088 (中間層,重開機後 OpenClaw 不健康就沉默)
|
||||
- 新架構: Alertmanager → AWOOOI API 直通 (更穩定,單一故障點少)
|
||||
@@ -436,7 +461,53 @@ cd /home/wooo/act-runner && docker compose up -d
|
||||
└── 有規則 → NetworkPolicy 或防火牆
|
||||
```
|
||||
|
||||
### 5. Harbor ImagePullBackOff
|
||||
### 5. Sentry 重開機後損壞修復 (2026-04-05 事故記錄)
|
||||
|
||||
**症狀**: sentry-self-hosted-postgres-1 / redis-1 / clickhouse-1 持續 Restarting
|
||||
|
||||
**原因**: 重開機時各 DB 未正常關閉,導致資料損壞:
|
||||
- PostgreSQL: WAL 損壞 (`could not locate a valid checkpoint record`)
|
||||
- Redis: dump.rdb 損壞 (`Wrong signature trying to load DB from file`)
|
||||
- ClickHouse: system table store parts 損壞 (`broken and needs manual correction`)
|
||||
|
||||
**修復步驟**:
|
||||
|
||||
```bash
|
||||
# Step 1: 修復 PostgreSQL WAL
|
||||
docker run --rm -u 999 -v sentry-postgres:/var/lib/postgresql/data \
|
||||
postgres:14 pg_resetwal -f /var/lib/postgresql/data
|
||||
echo "PostgreSQL WAL reset 完成"
|
||||
|
||||
# Step 2: 修復 Redis (session cache,可丟失)
|
||||
docker run --rm --user root -v sentry-redis:/data alpine \
|
||||
sh -c 'rm -f /data/dump.rdb && echo cleared'
|
||||
echo "Redis dump.rdb 已清除"
|
||||
|
||||
# Step 3: 找出損壞的 ClickHouse store 目錄
|
||||
docker logs sentry-self-hosted-clickhouse-1 2>&1 \
|
||||
| grep -oE 'store/[a-f0-9]{3}/[a-f0-9-]+' \
|
||||
| sort -u
|
||||
|
||||
# Step 4: 依據上方輸出刪除損壞目錄 (以 UUID 為準,以下為範例)
|
||||
for uuid in "469/469957c9-..." "57d/57d1d567-..."; do
|
||||
docker run --rm --user root -v sentry-clickhouse:/data alpine \
|
||||
sh -c "rm -rf /data/store/$uuid && echo OK:$uuid"
|
||||
done
|
||||
|
||||
# Step 5: 重啟 Sentry
|
||||
cd /opt/sentry && docker compose up -d
|
||||
|
||||
# Step 6: 等待 2-3 分鐘後驗證
|
||||
sleep 120
|
||||
curl -o /dev/null -w "%{http_code}" http://192.168.0.110:9000/
|
||||
# 預期: 200 / 302 / 400 (需登入)
|
||||
```
|
||||
|
||||
**startup-110.sh 已自動化修復**: PostgreSQL WAL 和 Redis RDB 損壞會自動偵測並修復。ClickHouse parts 損壞需手動識別 UUID(因 UUID 每次不同)。
|
||||
|
||||
---
|
||||
|
||||
### 6. Harbor ImagePullBackOff
|
||||
|
||||
```
|
||||
診斷:
|
||||
@@ -492,6 +563,8 @@ check "110 Harbor" "curl -s -o /dev/null -w '%{http_code}' http://192.168.0.110:
|
||||
check "110 Gitea" "curl -s -o /dev/null -w '%{http_code}' http://192.168.0.110:3001" "200"
|
||||
check "110 Alertmanager" "curl -s http://192.168.0.110:9093/-/healthy" "OK"
|
||||
check "110 Gitea Runner" "curl -s -u 'wooo:TOKEN' 'http://192.168.0.110:3001/api/v1/repos/wooo/awoooi/actions/runners' | python3 -c 'import sys,json; d=json.load(sys.stdin); print(d.get(\"total_count\",0))'" "1"
|
||||
check "110 Sentry" "curl -s -o /dev/null -w '%{http_code}' --max-time 10 http://192.168.0.110:9000/" "[23][0-9][0-9]\|400"
|
||||
check "Prometheus rules ≥25" "ssh wooo@192.168.0.110 'curl -s http://localhost:9090/api/v1/rules' | python3 -c \"import sys,json; r=json.load(sys.stdin); print(sum(len(g['rules']) for g in r['data']['groups']))\"" "[2-9][5-9]\|[3-9][0-9]"
|
||||
|
||||
# K3s
|
||||
check "K3s Nodes Ready" "ssh wooo@192.168.0.120 'kubectl get nodes --no-headers' | awk '{print \$2}'" "Ready"
|
||||
@@ -519,3 +592,4 @@ echo "=== 結果: ${PASS} 通過, ${FAIL} 失敗 ==="
|
||||
| v1.0 | 2026-04-04 | 初版:containerd + Docker + PostgreSQL WAL 修復流程 |
|
||||
| v2.0 | 2026-04-05 上午 | 188 啟動腳本 + 110 啟動腳本 + Harbor race condition 修復 |
|
||||
| v3.0 | 2026-04-05 下午 | 完整架構重盤 + Gitea Runner 自動化 + 告警鏈路根因修復 + NetworkPolicy 修正 |
|
||||
| v4.0 | 2026-04-05 下午 | Prometheus 規則統一部署 (28條) + Sentry startup (Step 7) + Sentry損壞修復SOP + 規則未部署診斷樹 + E2E腳本加入Sentry/Prometheus驗證 |
|
||||
|
||||
Reference in New Issue
Block a user