From 91564c6ea30da218990f2fea1b49d8abb6c51c0c Mon Sep 17 00:00:00 2001 From: OG T Date: Sun, 5 Apr 2026 03:11:27 +0800 Subject: [PATCH] docs(sop): REBOOT-RECOVERY-SOP.md v4.0 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 更新: - 加入 Sentry /opt/sentry 啟動說明 (110 Step 7/9) - 新增 Sentry 重開機損壞修復章節 (PostgreSQL WAL/Redis RDB/ClickHouse parts) - 告警沉默診斷樹補充「規則未部署」診斷 + deploy-alerts.sh 修復指令 - E2E 驗證腳本加入 Sentry + Prometheus 規則數驗證 (≥25) - 架構圖補充 Sentry :9000 Co-Authored-By: Claude Sonnet 4.6 --- docs/runbooks/REBOOT-RECOVERY-SOP.md | 84 ++++++++++++++++++++++++++-- 1 file changed, 79 insertions(+), 5 deletions(-) diff --git a/docs/runbooks/REBOOT-RECOVERY-SOP.md b/docs/runbooks/REBOOT-RECOVERY-SOP.md index f01e32ee..5f72cadf 100644 --- a/docs/runbooks/REBOOT-RECOVERY-SOP.md +++ b/docs/runbooks/REBOOT-RECOVERY-SOP.md @@ -1,9 +1,9 @@ # AWOOOI 重開機恢復 SOP -> **版本**: v3.0 -> **最後更新**: 2026-04-05 (台北時間) +> **版本**: v4.0 +> **最後更新**: 2026-04-05 下午 (台北時間) > **更新者**: Claude Code (首席架構師) -> **觸發事件**: 兩次重開機事故後完整盤點 + 告警鏈路根因修復 + Gitea Runner 自動化 +> **觸發事件**: Prometheus 規則統一部署 + Sentry 啟動自動化 + Sentry 損壞修復 SOP + 全系統自愈閉環 --- @@ -44,6 +44,7 @@ ├── Prometheus :9090 ← 指標收集 ├── Alertmanager :9093 ← 告警路由 ├── Grafana :3002 ← 監控儀表板 +├── Sentry :9000 ← Error Tracking (2026-03-24,2026-04-05 加入 startup) └── SignOz ← 可觀測性 192.168.0.120 (K3s Master - mon) @@ -141,13 +142,14 @@ Git push → Gitea(110:3001) - Nginx 啟動 - SignOz + MinIO + aiops-network + OpenClaw -**110 (6 步驟)**: +**110 (7 步驟)**: - Docker BoltDB 損壞偵測與修復 - 孤兒容器清除 (Exited 128) - Harbor (harbor-log healthy 後啟動全組件) - Gitea + Langfuse + Monitoring (含 Alertmanager 健康驗證) - SignOz - **Gitea Act Runner** (自動清除過期 .runner 配置) +- **Sentry** (/opt/sentry,含 PostgreSQL WAL + Redis RDB 損壞自動修復) --- @@ -317,6 +319,7 @@ systemctl start postgresql@14-main | 6a | 告警鏈路驗證 | 發送測試 webhook | webhook URL 錯誤 / NetworkPolicy | | 7 | SignOz | `curl http://localhost:3301/-/healthy` | ClickHouse 起慢 | | 8 | Gitea Runner | `docker logs gitea-runner \| grep SUCCESS` | .runner 配置過期 | +| 9 | Sentry | `curl -o /dev/null -w '%{http_code}' http://localhost:9000/` | PostgreSQL WAL/Redis RDB 損壞 (見下方) | **Harbor Exited 128 修復**: ```bash @@ -382,6 +385,28 @@ cd /home/wooo/act-runner && docker compose up -d └── {"success":true} → 檢查 Telegram Bot Token ``` +**補充診斷 — 特定服務異常但無告警(Alertmanager 正常)**: +```bash +# 確認 Prometheus 規則數量和關鍵規則是否存在 +ssh wooo@192.168.0.110 "curl -s http://localhost:9090/api/v1/rules | python3 -c \" +import sys,json +r=json.load(sys.stdin) +names=[x['name'] for g in r['data']['groups'] for x in g['rules']] +total=len(names) +key=['SentryDown','HarborDown','GiteaDown','OpenClawDown','AlertmanagerDown','AlertChainUnhealthy'] +for k in key: + print(f'{'OK' if k in names else 'MISSING'}: {k}') +print(f'Total rules: {total}') +\"" + +# 若規則不存在或 total < 25 → 規則未部署 +# 修復: 從專案根目錄執行 +bash scripts/ops/deploy-alerts.sh + +# 若規則存在但 inactive → 確認 blackbox probe target +curl http://192.168.0.110:9090/api/v1/query?query=probe_success +``` + **根本教訓 (2026-04-05)**: - 舊架構: Alertmanager → OpenClaw 8088 (中間層,重開機後 OpenClaw 不健康就沉默) - 新架構: Alertmanager → AWOOOI API 直通 (更穩定,單一故障點少) @@ -436,7 +461,53 @@ cd /home/wooo/act-runner && docker compose up -d └── 有規則 → NetworkPolicy 或防火牆 ``` -### 5. Harbor ImagePullBackOff +### 5. Sentry 重開機後損壞修復 (2026-04-05 事故記錄) + +**症狀**: sentry-self-hosted-postgres-1 / redis-1 / clickhouse-1 持續 Restarting + +**原因**: 重開機時各 DB 未正常關閉,導致資料損壞: +- PostgreSQL: WAL 損壞 (`could not locate a valid checkpoint record`) +- Redis: dump.rdb 損壞 (`Wrong signature trying to load DB from file`) +- ClickHouse: system table store parts 損壞 (`broken and needs manual correction`) + +**修復步驟**: + +```bash +# Step 1: 修復 PostgreSQL WAL +docker run --rm -u 999 -v sentry-postgres:/var/lib/postgresql/data \ + postgres:14 pg_resetwal -f /var/lib/postgresql/data +echo "PostgreSQL WAL reset 完成" + +# Step 2: 修復 Redis (session cache,可丟失) +docker run --rm --user root -v sentry-redis:/data alpine \ + sh -c 'rm -f /data/dump.rdb && echo cleared' +echo "Redis dump.rdb 已清除" + +# Step 3: 找出損壞的 ClickHouse store 目錄 +docker logs sentry-self-hosted-clickhouse-1 2>&1 \ + | grep -oE 'store/[a-f0-9]{3}/[a-f0-9-]+' \ + | sort -u + +# Step 4: 依據上方輸出刪除損壞目錄 (以 UUID 為準,以下為範例) +for uuid in "469/469957c9-..." "57d/57d1d567-..."; do + docker run --rm --user root -v sentry-clickhouse:/data alpine \ + sh -c "rm -rf /data/store/$uuid && echo OK:$uuid" +done + +# Step 5: 重啟 Sentry +cd /opt/sentry && docker compose up -d + +# Step 6: 等待 2-3 分鐘後驗證 +sleep 120 +curl -o /dev/null -w "%{http_code}" http://192.168.0.110:9000/ +# 預期: 200 / 302 / 400 (需登入) +``` + +**startup-110.sh 已自動化修復**: PostgreSQL WAL 和 Redis RDB 損壞會自動偵測並修復。ClickHouse parts 損壞需手動識別 UUID(因 UUID 每次不同)。 + +--- + +### 6. Harbor ImagePullBackOff ``` 診斷: @@ -492,6 +563,8 @@ check "110 Harbor" "curl -s -o /dev/null -w '%{http_code}' http://192.168.0.110: check "110 Gitea" "curl -s -o /dev/null -w '%{http_code}' http://192.168.0.110:3001" "200" check "110 Alertmanager" "curl -s http://192.168.0.110:9093/-/healthy" "OK" check "110 Gitea Runner" "curl -s -u 'wooo:TOKEN' 'http://192.168.0.110:3001/api/v1/repos/wooo/awoooi/actions/runners' | python3 -c 'import sys,json; d=json.load(sys.stdin); print(d.get(\"total_count\",0))'" "1" +check "110 Sentry" "curl -s -o /dev/null -w '%{http_code}' --max-time 10 http://192.168.0.110:9000/" "[23][0-9][0-9]\|400" +check "Prometheus rules ≥25" "ssh wooo@192.168.0.110 'curl -s http://localhost:9090/api/v1/rules' | python3 -c \"import sys,json; r=json.load(sys.stdin); print(sum(len(g['rules']) for g in r['data']['groups']))\"" "[2-9][5-9]\|[3-9][0-9]" # K3s check "K3s Nodes Ready" "ssh wooo@192.168.0.120 'kubectl get nodes --no-headers' | awk '{print \$2}'" "Ready" @@ -519,3 +592,4 @@ echo "=== 結果: ${PASS} 通過, ${FAIL} 失敗 ===" | v1.0 | 2026-04-04 | 初版:containerd + Docker + PostgreSQL WAL 修復流程 | | v2.0 | 2026-04-05 上午 | 188 啟動腳本 + 110 啟動腳本 + Harbor race condition 修復 | | v3.0 | 2026-04-05 下午 | 完整架構重盤 + Gitea Runner 自動化 + 告警鏈路根因修復 + NetworkPolicy 修正 | +| v4.0 | 2026-04-05 下午 | Prometheus 規則統一部署 (28條) + Sentry startup (Step 7) + Sentry損壞修復SOP + 規則未部署診斷樹 + E2E腳本加入Sentry/Prometheus驗證 |