diff --git a/docs/runbooks/REBOOT-RECOVERY-SOP.md b/docs/runbooks/REBOOT-RECOVERY-SOP.md index f7b7d2f9..f01e32ee 100644 --- a/docs/runbooks/REBOOT-RECOVERY-SOP.md +++ b/docs/runbooks/REBOOT-RECOVERY-SOP.md @@ -1,498 +1,521 @@ -# 重開機恢復 SOP +# AWOOOI 重開機恢復 SOP -> 最後更新:2026-04-05 ogt — 第二次重開機事故後完整修訂,加入自動化腳本 -> 適用環境:AWOOOI 五主機架構 +> **版本**: v3.0 +> **最後更新**: 2026-04-05 (台北時間) +> **更新者**: Claude Code (首席架構師) +> **觸發事件**: 兩次重開機事故後完整盤點 + 告警鏈路根因修復 + Gitea Runner 自動化 --- -## 🤖 自動化狀態(最優先確認) +## 目錄 -| 主機 | systemd service | 狀態 | 說明 | -|------|----------------|------|------| -| 192.168.0.188 | `awoooi-startup.service` | ✅ enabled | 自動修復 BoltDB + 啟動所有服務 | -| 192.168.0.110 | `awoooi-startup-110.service` | ✅ enabled | 自動修復 BoltDB + 啟動所有服務 | -| 192.168.0.120 | `k3s.service` | systemd 管理 | 依賴 PG 就緒,自動啟動 | -| 192.168.0.121 | `k3s.service` | systemd 管理 | 自動啟動 | - -**正常情況下,重開機後等待 3-5 分鐘,所有服務應自動恢復。** - -確認方式(重開機後執行): -```bash -# 確認自動化腳本執行結果 -ssh ollama@192.168.0.188 "sudo journalctl -u awoooi-startup.service -n 30 --no-pager" -ssh wooo@192.168.0.110 "echo '0936223270' | sudo -S journalctl -u awoooi-startup-110.service -n 30 --no-pager" -``` +1. [架構概覽與依賴圖](#架構概覽與依賴圖) +2. [自動化腳本狀態](#自動化腳本狀態) +3. [正常重啟流程 (計劃性維護)](#正常重啟流程) +4. [異常重啟流程 (緊急恢復)](#異常重啟流程) +5. [各主機詳細啟動序列](#各主機詳細啟動序列) +6. [常見故障排查手冊](#常見故障排查手冊) +7. [E2E 驗證腳本](#e2e-驗證腳本) --- -## ⚡ 全系統啟動順序(依賴關係) +## 架構概覽與依賴圖 + +### 五主機全貌 ``` -重開機後啟動順序(強制順序,不可逆轉): +192.168.0.188 (OLLAMA / 主服務主機) +├── containerd ← 基礎,必須最先啟動 +├── Docker ← 依賴 containerd +├── PostgreSQL :5432 ← K3s kine 後端,API DB +├── Redis :6380 ← Working Memory (重啟後清空,需 warm-up) +├── Ollama :11434 ← LLM 推論引擎 +├── Nginx :80/:443 ← 反向代理 +├── OpenClaw (clawbot) :8088 ← AI 核心 (Docker Compose) +├── MinIO ← Velero 備份存儲 (Docker Compose) +└── SignOz ← 可觀測性 (Docker Compose) -┌─────────────────────────────────────────┐ -│ Phase 1: 192.168.0.188 基礎設施 │ ← 最先 -│ ├─ containerd (BoltDB 修復) │ -│ ├─ Docker (BoltDB 修復) │ -│ ├─ PostgreSQL (WAL 修復 + kine VACUUM) │ ← K3s Kine Datastore -│ ├─ Redis (0.0.0.0:6380) │ ← API/Worker 依賴 -│ ├─ Ollama │ -│ ├─ Nginx │ -│ ├─ SignOz (docker compose) │ -│ └─ ClawBot (docker compose) │ -└─────────────────────────────────────────┘ - ↓ 必須先完成 -┌─────────────────────────────────────────┐ -│ Phase 2: 192.168.0.110 DevOps 金庫 │ ← 同步進行 or 稍後 -│ ├─ Docker (BoltDB 修復) │ -│ ├─ 清除孤兒容器 (network 不存在問題) │ -│ ├─ harbor-log (先 healthy!) │ ← 其他 Harbor 依賴它 -│ ├─ Harbor 其他元件 (nginx/core/db/...) │ ← K3s imagePull 依賴 -│ ├─ Gitea │ ← CI/CD 依賴 -│ ├─ Langfuse │ -│ ├─ Monitoring (Prometheus/Grafana/AM) │ -│ └─ SignOz │ -└─────────────────────────────────────────┘ - ↓ 必須先完成 (PostgreSQL + Harbor) -┌─────────────────────────────────────────┐ -│ Phase 3: K3s Control-Plane │ ← 最後 -│ ├─ 120 k3s.service → Ready │ -│ ├─ 121 k3s.service → Ready │ -│ └─ awoooi-prod Pods Running │ -└─────────────────────────────────────────┘ +192.168.0.110 (DevOps 主機) +├── Docker +├── Harbor :5000 ← Container Registry (K3s 拉映像用) +├── Gitea :3001 ← 主要 Git / CI 管理介面 +├── Gitea Act Runner ← CD pipeline 執行者 (docker: gitea-runner) +├── Langfuse :3100 ← LLMOps 追蹤 +├── Prometheus :9090 ← 指標收集 +├── Alertmanager :9093 ← 告警路由 +├── Grafana :3002 ← 監控儀表板 +└── SignOz ← 可觀測性 + +192.168.0.120 (K3s Master - mon) +├── K3s server (control plane) +├── kube-proxy ← NodePort 轉發 32334/32335 +└── keepalived ← VIP 192.168.0.125 (secondary) + +192.168.0.121 (K3s Worker - mon1) +├── K3s agent (worker) +└── kube-proxy + +192.168.0.125 (VIP, keepalived 管理) +├── → :32334 AWOOOI API +└── → :32335 AWOOOI Web ``` +### 服務依賴關係 + +``` +嚴格啟動順序: + +【188 層】 +containerd + └── Docker + ├── PostgreSQL (kine DB, API DB) + │ └── Redis (Working Memory) + │ └── Ollama (LLM) + │ └── Nginx + │ ├── SignOz + │ ├── MinIO + │ └── OpenClaw (依賴 aiops-network) + +【110 層】 +Docker + └── harbor-log (等 healthy) + └── Harbor 全組件 (:5000) + └── Gitea (:3001) + ├── Langfuse + ├── Monitoring (Prometheus + Alertmanager + Grafana) + │ [Alertmanager → AWOOOI API 告警鏈路] + ├── SignOz + └── Gitea Act Runner (等 Gitea 就緒後才啟動) + +【120/121 層】 +K3s (依賴 PostgreSQL@188) + └── kube-proxy (NodePort 規則) + └── Pods: API / Web / Worker + └── keepalived (VIP) + +【告警鏈路】 +Prometheus → Alertmanager(110) + → AWOOOI API(121:32334/api/v1/webhooks/alertmanager) ← 直接,不走 OpenClaw + → TelegramGateway → Telegram + +【CD 鏈路】 +Git push → Gitea(110:3001) + → Act Runner (gitea-runner container) + → Build Docker image → Harbor(:5000) → kubectl → K3s pods +``` + +### 關鍵依賴說明 + +| 服務 | 關鍵依賴 | 若依賴失敗 | +|------|---------|-----------| +| K3s | PostgreSQL@188 ready | K3s 無法啟動,所有 pod 無法調度 | +| OpenClaw | Docker + aiops-network | Telegram AI 對話失效 | +| Alertmanager → API | NetworkPolicy 允許 110/32 | 告警全面沉默 | +| CD pipeline | Gitea Act Runner online | 所有部署失效 | +| Harbor | harbor-log healthy | 所有其他 Harbor 容器 Exited 128 | +| K8s pods | Harbor (映像倉庫) | ImagePullBackOff | + --- -## 🤖 自動化腳本說明 +## 自動化腳本狀態 -### 腳本位置 +### 已部署 (2026-04-05) -``` -scripts/reboot-recovery/ -├── awoooi-startup.sh # 188 啟動腳本(部署到 /usr/local/bin/) -├── awoooi-startup.service # 188 systemd unit -├── awoooi-startup-110.sh # 110 啟動腳本(部署到 /usr/local/bin/) -├── awoooi-startup-110.service # 110 systemd unit -├── deploy-to-188.sh # 一鍵部署到 188 -└── deploy-to-110.sh # 一鍵部署到 110 -``` +| 主機 | 腳本 | 部署位置 | systemd service | 狀態 | +|------|------|---------|----------------|------| +| **188** | `awoooi-startup.sh` | `/usr/local/bin/` | `awoooi-startup.service` | ✅ enabled | +| **110** | `awoooi-startup-110.sh` | `/usr/local/bin/` | `awoooi-startup-110.service` | ✅ enabled | +| **120** | K3s 原生 | 系統內建 | `k3s.service` | ✅ enabled | +| **121** | K3s 原生 | 系統內建 | `k3s-agent.service` | ✅ enabled | -### 188 腳本(`awoooi-startup.sh`)步驟 +**本地原始碼**: `scripts/reboot-recovery/` -| 步驟 | 說明 | 故障處理 | -|------|------|---------| -| 1/7 | containerd 健康檢查 | BoltDB 損壞 → 自動刪除 `meta.db` | -| 2/7 | Docker 健康檢查 | BoltDB 損壞 → 自動刪除 `local-kv.db` | -| 3/7 | PostgreSQL 健康檢查 | WAL 損壞 → 自動執行 `pg_resetwal` + VACUUM ANALYZE kine | -| 4/7 | Redis 啟動 | — | -| 5/7 | Ollama 啟動 | — | -| 6/7 | Nginx 啟動 | — | -| 7/7 | SignOz + ClawBot compose up | ClawBot 失敗 → 嘗試 rebuild,失敗也繼續 | +### 各腳本覆蓋清單 -### 110 腳本(`awoooi-startup-110.sh`)步驟 +**188 (7 步驟)**: +- containerd BoltDB 損壞偵測與修復 +- Docker BoltDB 損壞偵測與修復 +- PostgreSQL WAL 損壞偵測 + pg_resetwal + kine VACUUM +- Redis 啟動 +- Ollama 啟動 +- Nginx 啟動 +- SignOz + MinIO + aiops-network + OpenClaw -| 步驟 | 說明 | 故障處理 | -|------|------|---------| -| 1/5 | Docker 健康檢查 | BoltDB 損壞 → 自動刪除所有損壞 `.db` | -| 2/5 | 清除孤兒容器 | `Exited (128)/(137)` → docker rm + network prune | -| 3/5 | Harbor 啟動 | 等 harbor-log healthy (max 60s) 才啟動其他元件 | -| 4/5 | Gitea/Langfuse/Monitoring compose up | — | -| 5/5 | SignOz compose up | — | - -### 重新部署腳本 - -```bash -# 從 Mac 執行(awoooi repo 目錄) -cd scripts/reboot-recovery - -bash deploy-to-188.sh # 更新 188 的腳本 -bash deploy-to-110.sh # 更新 110 的腳本 -``` +**110 (6 步驟)**: +- Docker BoltDB 損壞偵測與修復 +- 孤兒容器清除 (Exited 128) +- Harbor (harbor-log healthy 後啟動全組件) +- Gitea + Langfuse + Monitoring (含 Alertmanager 健康驗證) +- SignOz +- **Gitea Act Runner** (自動清除過期 .runner 配置) --- -## 手動恢復流程(自動化失敗時) +## 正常重啟流程 -### Phase 1:192.168.0.188 +> 適用於:計劃性維護、安全更新、OS 升級等**有預期**的重啟。 + +### 重啟前準備 (T-30 分鐘) ```bash -ssh ollama@192.168.0.188 +# 1. 確認沒有 INVESTIGATING/MITIGATING 的 incident +curl -s http://192.168.0.121:32334/api/v1/incidents | python3 -c 'import sys,json; d=json.load(sys.stdin); print("Active incidents:", d.get("count",0))' + +# 2. 確認 K3s 健康 +ssh wooo@192.168.0.120 "kubectl get nodes && kubectl get pods -n awoooi-prod" + +# 3. 備份確認 (MinIO Velero) +ssh wooo@192.168.0.120 "kubectl get backup -n velero 2>/dev/null | head -5" ``` -#### 1.1 containerd(若未起) +### 建議重啟順序 -```bash -sudo systemctl status containerd -# 若 BoltDB 損壞 (panic: freepages): -sudo systemctl stop containerd -sudo rm -f /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db -sudo systemctl start containerd -systemctl is-active containerd +``` +第一步: 188 + 110 同時重啟 + (兩者獨立,可並行) + +等待 5 分鐘確認基礎服務就緒: + - PostgreSQL accepting connections + - Harbor returning HTTP 200 + - Gitea accessible + +第二步: 120 + 121 同時重啟 + (K3s cluster) + +等待 10 分鐘: + - Nodes Ready + - Pods Running ``` -#### 1.2 Docker(若未起) +> **重要**: 120/121 必須在 188 PostgreSQL 完全就緒後才重啟, +> 否則 K3s kine 連不到 DB 無法啟動。 + +### 重啟後驗證 (T+15 分鐘) + +執行底部 E2E 驗證腳本。所有項目必須 ✅。 + +--- + +## 異常重啟流程 + +> 適用於:電源中斷、OOM、系統崩潰、強制重啟等**非預期**的重啟。 + +### 第一步:快速診斷 (T+2 分鐘) ```bash -sudo systemctl status docker -# 若 BoltDB 損壞 (panic: page already freed / invalid freelist page): -sudo systemctl stop docker -sudo rm -f /var/lib/docker/network/files/local-kv.db -sudo rm -f /var/lib/docker/volumes/metadata.db -find /var/lib/docker/buildkit -name "*.db" -delete 2>/dev/null || true -sudo systemctl start docker -systemctl is-active docker -``` - -#### 1.3 PostgreSQL(最關鍵) - -```bash -sudo systemctl start postgresql@14-main -sleep 8 -systemctl is-active postgresql@14-main # 若非 active → 見故障排除 A -pg_isready -h localhost -p 5432 # 應為 accepting connections - -# 清理 kine 孤立連線(WAL 重置後必做) -sudo -u postgres psql -d k3s_datastore -c " - SELECT pg_terminate_backend(pid) - FROM pg_stat_activity - WHERE datname='k3s_datastore' AND pid!=pg_backend_pid() - AND query_start < now() - interval '5 minutes'; +# 188 狀態 +ssh ollama@192.168.0.188 " +systemctl is-active containerd docker postgresql@14-main redis-server ollama nginx +docker ps --format '{{.Names}}: {{.Status}}' | grep -E 'clawbot|minio|signoz' " -sudo -u postgres psql -d k3s_datastore -c "VACUUM ANALYZE kine;" + +# 110 狀態 +ssh wooo@192.168.0.110 " +docker ps --format '{{.Names}}: {{.Status}}' | grep -E 'harbor-log|gitea$|alertmanager|gitea-runner' +" + +# K3s 狀態 +ssh wooo@192.168.0.120 " +kubectl get nodes 2>/dev/null || echo 'K3s not ready' +kubectl get pods -n awoooi-prod 2>/dev/null | grep -v Running | head -10 +" + +# API 健康 +curl -s --max-time 5 http://192.168.0.121:32334/api/v1/health 2>/dev/null | python3 -c 'import sys,json; d=json.load(sys.stdin); print("API:", d["status"])' 2>/dev/null || echo 'API unreachable' ``` -#### 1.4 Redis +### 第二步:按優先級恢復 +**P0 (T+0~5 分鐘): 基礎設施** ```bash -sudo systemctl start redis-server -redis-cli -p 6380 ping # 應回 PONG -# Redis 設定: 0.0.0.0:6380 (bind 0.0.0.0, port 6380) +# 自動化腳本通常已處理,但若沒有自動啟動: +ssh ollama@192.168.0.188 "sudo /usr/local/bin/awoooi-startup.sh" +ssh wooo@192.168.0.110 "echo '0936223270' | sudo -S /usr/local/bin/awoooi-startup-110.sh" ``` -#### 1.5 Ollama - +**P1 (T+5~15 分鐘): K3s 叢集** ```bash -sudo systemctl start ollama -sleep 5 -curl -sf http://localhost:11434/ | grep running +# 等 PostgreSQL@188 就緒後 +ssh ollama@192.168.0.188 "pg_isready -h localhost -p 5432" + +# K3s 若未自動恢復 +ssh wooo@192.168.0.120 "sudo systemctl restart k3s" +ssh wooo@192.168.0.121 "sudo systemctl restart k3s-agent" + +# 等待 Nodes Ready +ssh wooo@192.168.0.120 "kubectl wait --for=condition=Ready nodes --all --timeout=120s" ``` -#### 1.6 Nginx + SignOz + ClawBot - +**P2 (T+15~30 分鐘): 業務驗證** ```bash -sudo systemctl start nginx +# Pods 全部 Running? +ssh wooo@192.168.0.120 "kubectl get pods -n awoooi-prod" -cd /home/ollama/signoz/deploy/docker && docker compose up -d -cd /home/ollama/clawbot-v5 && docker compose up -d -``` +# API 健康 (含所有 components)? +curl -s http://192.168.0.121:32334/api/v1/health | python3 -m json.tool -#### 1.7 Phase 1 驗收 - -```bash -pg_isready -h localhost -p 5432 && echo "✅ PostgreSQL" -redis-cli -p 6380 ping | grep -q PONG && echo "✅ Redis :6380" -curl -sf http://localhost:11434/ | grep -q running && echo "✅ Ollama" -systemctl is-active nginx | grep -q active && echo "✅ Nginx" +# 告警鏈路 E2E 測試 +curl -X POST http://192.168.0.121:32334/api/v1/webhooks/alertmanager \ + -H 'Content-Type: application/json' \ + -d '{"receiver":"test","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"RebootRecoveryTest","severity":"info"},"annotations":{"summary":"重開機恢復測試,請忽略"},"startsAt":"2026-04-05T00:00:00Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":""}],"groupLabels":{},"commonLabels":{},"commonAnnotations":{},"externalURL":"","version":"4","groupKey":"test"}' +# 預期: {"success":true,...} 且 Telegram 收到測試告警 ``` --- -### Phase 2:192.168.0.110 +## 各主機詳細啟動序列 +### 192.168.0.188 — 主服務主機 + +| 步驟 | 服務 | 驗證方式 | 常見失敗原因 | +|------|------|---------|------------| +| 1 | containerd | `systemctl is-active containerd` | BoltDB meta.db 損壞 | +| 2 | Docker | `systemctl is-active docker` | BoltDB local-kv.db 損壞 | +| 3 | PostgreSQL | `pg_isready -h localhost -p 5432` | WAL checkpoint 損壞 | +| 3a | kine VACUUM | `psql -d k3s_datastore -c "VACUUM ANALYZE kine;"` | 慢查詢導致 K3s 超時 | +| 4 | Redis | `redis-cli -p 6380 ping` | bind 配置錯誤 (必須 0.0.0.0:6380) | +| 5 | Ollama | `systemctl is-active ollama` | GPU 記憶體不足 | +| 6 | Nginx | `systemctl is-active nginx` | port 衝突 | +| 7a | aiops-network | `docker network ls \| grep aiops-network` | 重啟後 external network 消失 | +| 7b | OpenClaw | `curl http://localhost:8088/health` | pip 依賴損壞需 rebuild | +| 7c | MinIO | `docker ps \| grep minio` | 無狀態服務,重啟即可 | +| 7d | SignOz | `curl http://localhost:3301/-/healthy` | ClickHouse 起慢 | + +**BoltDB 損壞修復**: ```bash -ssh wooo@192.168.0.110 +# containerd meta.db +BOLT=/var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db +cp $BOLT ${BOLT}.bak.$(date +%s) && rm $BOLT && systemctl restart containerd + +# Docker network BoltDB +BOLT=/var/lib/docker/network/files/local-kv.db +cp $BOLT ${BOLT}.bak.$(date +%s) && rm $BOLT && systemctl restart docker ``` -#### 2.1 Docker 修復 - +**PostgreSQL WAL 修復**: ```bash -systemctl is-active docker || { - echo "0936223270" | sudo -S bash -c " - rm -f /var/lib/docker/network/files/local-kv.db - rm -f /var/lib/docker/volumes/metadata.db - find /var/lib/docker/buildkit -name '*.db' -delete 2>/dev/null - systemctl start docker - " -} +# 確認是 WAL 損壞 +journalctl -u postgresql@14-main -n 30 | grep "could not locate a valid checkpoint" + +# 修復 +systemctl stop postgresql@14-main +/usr/lib/postgresql/14/bin/pg_resetwal -f /var/lib/postgresql/14/main +systemctl start postgresql@14-main ``` -#### 2.2 清除孤兒容器(關鍵!) +### 192.168.0.110 — DevOps 主機 +| 步驟 | 服務 | 驗證方式 | 常見失敗原因 | +|------|------|---------|------------| +| 1 | Docker | `systemctl is-active docker` | BoltDB 損壞 | +| 2 | 孤兒容器清除 | `docker ps -a \| grep "Exited (128)"` | 舊容器引用已不存在的 network | +| 3 | harbor-log | `docker inspect --format='{{.State.Health.Status}}' harbor-log` | syslog :1514 未就緒 | +| 3 | Harbor 全組件 | `curl http://localhost:5000/v2/` | harbor-log 還沒 healthy | +| 4 | Gitea | `curl http://localhost:3001` | 無特殊問題 | +| 5 | Langfuse | `curl http://localhost:3100` | DB 未就緒 | +| 6 | Alertmanager | `curl http://localhost:9093/-/healthy` | docker compose 未啟動 | +| 6a | 告警鏈路驗證 | 發送測試 webhook | webhook URL 錯誤 / NetworkPolicy | +| 7 | SignOz | `curl http://localhost:3301/-/healthy` | ClickHouse 起慢 | +| 8 | Gitea Runner | `docker logs gitea-runner \| grep SUCCESS` | .runner 配置過期 | + +**Harbor Exited 128 修復**: ```bash -# 重開機後舊容器使用的 Docker network 已不存在,必須清除 -docker rm -f $(docker ps -aq 2>/dev/null) 2>/dev/null || true -docker network prune -f 2>/dev/null || true -``` - -#### 2.3 Harbor(必須先等 harbor-log healthy) - -```bash -cd /home/wooo/harbor/harbor -docker compose up -d - -# 等 harbor-log healthy(最多 60 秒) -for i in $(seq 1 12); do - STATUS=$(docker inspect --format='{{.State.Health.Status}}' harbor-log 2>/dev/null) - echo "[$i] harbor-log: $STATUS" - [ "$STATUS" = "healthy" ] && break - sleep 5 +# 等 harbor-log healthy +until [ "$(docker inspect --format='{{.State.Health.Status}}' harbor-log)" = "healthy" ]; do + echo "waiting..."; sleep 5 done -# harbor-log healthy 後,重啟其他元件(它們依賴 :1514 syslog) -docker compose up -d +# 清除 Exited 128 容器 +docker rm -f $(docker ps -a --filter status=exited --format "{{.Names}}" | grep -v harbor-log) + +# 重新啟動 +cd /home/wooo/harbor/harbor && docker compose up -d ``` -#### 2.4 其他服務 - +**Gitea Runner 重新註冊**: ```bash -cd /home/wooo/gitea && docker compose up -d -cd /home/wooo/langfuse && docker compose up -d -cd /home/wooo/monitoring && docker compose up -d -cd /home/wooo/signoz/deploy/docker && docker compose up -d +# 清除過期配置 (指向錯誤 hostname 的) +sudo rm /home/wooo/act-runner/data/.runner + +# 獲取新的 registration token +curl -s -u 'wooo:TOKEN' 'http://192.168.0.110:3001/api/v1/repos/wooo/awoooi/actions/runners/registration-token' + +# 更新 docker-compose.yml 的 GITEA_RUNNER_REGISTRATION_TOKEN +# 然後啟動 +cd /home/wooo/act-runner && docker compose up -d ``` -#### 2.5 Phase 2 驗收 +### 192.168.0.120/121 — K3s 叢集 -```bash -curl -s -o /dev/null -w "%{http_code}" http://localhost:5000/v2/ | grep -q 401 && echo "✅ Harbor :5000" -curl -s -o /dev/null -w "%{http_code}" http://localhost:3001/ | grep -q 200 && echo "✅ Gitea :3001" -curl -s -o /dev/null -w "%{http_code}" http://localhost:3100/ | grep -q 200 && echo "✅ Langfuse :3100" -curl -s -o /dev/null -w "%{http_code}" http://localhost:9093/ | grep -q 200 && echo "✅ Alertmanager :9093" -curl -s -o /dev/null -w "%{http_code}" http://localhost:3002/ | grep -q 302 && echo "✅ Grafana :3002" +| 步驟 | 項目 | 驗證方式 | +|------|------|---------| +| 前提 | PostgreSQL@188 ready | `pg_isready -h 192.168.0.188 -p 5432` | +| 1 | K3s service | `systemctl is-active k3s` (120) | +| 2 | K3s agent | `systemctl is-active k3s-agent` (121) | +| 3 | Nodes Ready | `kubectl get nodes` | +| 4 | Pods Running | `kubectl get pods -n awoooi-prod` | +| 5 | keepalived VIP | `ip addr show \| grep 192.168.0.125` (120) | +| 6 | NodePort | `nc -zv 192.168.0.121 32334` | + +--- + +## 常見故障排查手冊 + +### 1. 告警沉默 (Telegram 沒收到告警) + +``` +診斷樹: + Alertmanager healthy? (http://192.168.0.110:9093/-/healthy) + ├── NO → docker compose up -d (在 110 /home/wooo/monitoring/) + └── YES + ↓ + Webhook URL 正確? + grep 'url:' /home/wooo/monitoring/alertmanager.yml + 必須是: http://192.168.0.121:32334/api/v1/webhooks/alertmanager + ├── NO → 修正 URL 並 curl http://localhost:9093/-/reload + └── YES + ↓ + 從 110 curl POST webhook 成功? + curl -X POST http://192.168.0.121:32334/api/v1/webhooks/alertmanager ... + ├── timeout → NetworkPolicy 未允許 110 + │ kubectl apply -f k8s/awoooi-prod/02-network-policy.yaml + └── {"success":true} → 檢查 Telegram Bot Token +``` + +**根本教訓 (2026-04-05)**: +- 舊架構: Alertmanager → OpenClaw 8088 (中間層,重開機後 OpenClaw 不健康就沉默) +- 新架構: Alertmanager → AWOOOI API 直通 (更穩定,單一故障點少) +- NetworkPolicy 必須允許 110/32 進入 pod + +### 2. Gitea CD 無法部署 + +``` +診斷樹: + Gitea runner 數量? + curl -u 'wooo:TOKEN' 'http://192.168.0.110:3001/api/v1/repos/wooo/awoooi/actions/runners' + ├── total_count: 0 → Runner 離線 + │ docker ps | grep gitea-runner + │ ├── 不存在 → cd /home/wooo/act-runner && docker compose up -d + │ └── 存在但 Restarting → docker logs gitea-runner + │ "lookup gitea" → 清除 data/.runner → docker compose up -d + └── total_count: 1 → Runner 在線但 CD 失敗 + 查看 Gitea Actions 日誌: http://192.168.0.110:3001/wooo/awoooi/actions +``` + +### 3. 網站數據顯示 0 / incidents 消失 + +``` +原因: Redis Working Memory 重啟後清空 + 新版 API (f4f454f+) 啟動時自動 warm-up + 舊版 API 沒有此邏輯 + +診斷: + curl http://192.168.0.121:32334/api/v1/incidents + → count: 0 但 Telegram 有歷史告警? + → 確認 API pod image 版本 + +修復 (若 image 版本正確): + kubectl rollout restart deployment/awoooi-api -n awoooi-prod + +修復 (若 image 版本過舊): + 等待 CD 完成部署最新版本 +``` + +### 4. NodePort 32334 從外部無法訪問 + +``` +診斷: + nc -zv 192.168.0.121 32334 ← 這個通嗎? + ├── 通 → 用 192.168.0.121:32334 直連 + └── 不通 + kubectl get pods -n awoooi-prod + ├── 沒有 Running pods → K3s pods 問題 + └── 有 Running pods + iptables -t nat -L KUBE-NODEPORTS -n | grep 32334 + ├── 沒有規則 → K3s kube-proxy 問題,重啟 k3s + └── 有規則 → NetworkPolicy 或防火牆 +``` + +### 5. Harbor ImagePullBackOff + +``` +診斷: + kubectl describe pod -n awoooi-prod | grep -A 5 Events + → "ImagePullBackOff" or "ErrImagePull" + +確認 Harbor 健康: + curl http://192.168.0.110:5000/v2/ → 應該回 {} + +若 Harbor 不健康: + - 等 harbor-log healthy 後重啟: 見 Harbor Exited 128 修復 + - 確認 Docker daemon 在 120/121 有設定 insecure registry: + cat /etc/docker/daemon.json | grep insecure ``` --- -### Phase 3:K3s Control-Plane +## E2E 驗證腳本 -> ⚠️ 必須確認 Phase 1 PostgreSQL + Phase 2 Harbor 完全就緒才執行! - -```bash -# 先確認前置條件 -ssh ollama@192.168.0.188 "pg_isready -h localhost -p 5432" -ssh wooo@192.168.0.110 "curl -s -o /dev/null -w '%{http_code}' http://localhost:5000/v2/" -# PostgreSQL: accepting connections -# Harbor: 401 (需認證,正常) -``` - -#### 3.1 K3s 節點(通常自動啟動) - -```bash -# 若 k3s 未啟動 -ssh wooo@192.168.0.120 "echo '0936223270' | sudo -S systemctl start k3s" -ssh wooo@192.168.0.121 "echo '0936223270' | sudo -S systemctl start k3s" - -# 確認節點狀態 -ssh wooo@192.168.0.120 "echo '0936223270' | sudo -S k3s kubectl get nodes" -``` - -#### 3.2 若 Pods ImagePullBackOff - -```bash -# Harbor 剛起來,強制 rollout -ssh wooo@192.168.0.120 "echo '0936223270' | sudo -S bash -c ' - k3s kubectl delete pod -l app=awoooi-api -n awoooi-prod - k3s kubectl delete pod -l app=awoooi-web -n awoooi-prod - k3s kubectl delete pod -l app=awoooi-worker -n awoooi-prod -'" -``` - -#### 3.3 Phase 3 驗收 - -```bash -ssh wooo@192.168.0.120 "echo '0936223270' | sudo -S k3s kubectl get pods -n awoooi-prod" -# 所有 Pod 應為 Running - -curl http://192.168.0.125:32334/api/v1/health -# 預期: status=healthy 或 degraded (openclaw down 可接受) -``` - ---- - -## 完整自動化驗收腳本 +執行此腳本確認重開機後全系統正常。 ```bash #!/bin/bash -# 執行位置: Mac (awoooi repo) -echo "=== AWOOOI 重開機完整驗收 $(date '+%Y-%m-%d %H:%M:%S') ===" +# 重開機後 E2E 驗證 +# 使用: bash docs/runbooks/scripts/e2e-verify.sh +# 執行位置: 任意可 SSH 的機器 -# 188 基礎服務 -echo "--- 192.168.0.188 ---" -ssh ollama@192.168.0.188 " -pg_isready -h localhost -p 5432 >/dev/null 2>&1 && echo '✅ PostgreSQL :5432' || echo '❌ PostgreSQL DOWN' -redis-cli -p 6380 ping 2>/dev/null | grep -q PONG && echo '✅ Redis :6380' || echo '❌ Redis DOWN' -curl -sf http://localhost:11434/ 2>/dev/null | grep -q running && echo '✅ Ollama :11434' || echo '❌ Ollama DOWN' -systemctl is-active nginx 2>/dev/null | grep -q active && echo '✅ Nginx' || echo '❌ Nginx DOWN' -docker ps --format '{{.Names}}\t{{.Status}}' 2>/dev/null | grep -E '^signoz|^clawbot' | head -5 -" +API="http://192.168.0.121:32334" +GREEN='\033[0;32m'; RED='\033[0;31m'; NC='\033[0m' +PASS=0; FAIL=0 -# 110 DevOps 金庫 -echo "--- 192.168.0.110 ---" -ssh wooo@192.168.0.110 " -curl -s -o /dev/null -w '%{http_code}' http://localhost:5000/v2/ 2>/dev/null | grep -q 401 && echo '✅ Harbor :5000' || echo '❌ Harbor DOWN' -curl -s -o /dev/null -w '%{http_code}' http://localhost:3001/ 2>/dev/null | grep -q 200 && echo '✅ Gitea :3001' || echo '❌ Gitea DOWN' -curl -s -o /dev/null -w '%{http_code}' http://localhost:3100/ 2>/dev/null | grep -q 200 && echo '✅ Langfuse :3100' || echo '❌ Langfuse DOWN' -curl -s -o /dev/null -w '%{http_code}' http://localhost:9093/ 2>/dev/null | grep -q 200 && echo '✅ Alertmanager :9093' || echo '❌ Alertmanager DOWN' -curl -s -o /dev/null -w '%{http_code}' http://localhost:3002/ 2>/dev/null | grep -qE '200|302' && echo '✅ Grafana :3002' || echo '❌ Grafana DOWN' -" +check() { + local name="$1"; local cmd="$2"; local expect="$3" + result=$(eval "$cmd" 2>/dev/null) + if echo "$result" | grep -q "$expect"; then + echo -e "${GREEN}✅ $name${NC}" + ((PASS++)) + else + echo -e "${RED}❌ $name${NC} (got: ${result:0:80})" + ((FAIL++)) + fi +} -# K3s 和 Pods -echo "--- K3s (via 120) ---" -ssh wooo@192.168.0.120 " -echo '0936223270' | sudo -S bash -c ' -k3s kubectl get nodes 2>/dev/null -k3s kubectl get pods -n awoooi-prod 2>/dev/null -' -" +echo "=== AWOOOI 重開機 E2E 驗證 $(date '+%Y-%m-%d %H:%M:%S') ===" -# API E2E -echo "--- API E2E ---" -curl -s http://192.168.0.125:32334/api/v1/health 2>/dev/null | \ - python3 -c "import sys,json; d=json.load(sys.stdin); [print('✅' if v['status']=='up' else '⚠️', k, v['status']) for k,v in d['components'].items()]" || \ - echo '❌ API E2E FAILED' +# 188 服務 +check "188 PostgreSQL" "ssh ollama@192.168.0.188 'pg_isready -h localhost -p 5432'" "accepting" +check "188 Redis" "ssh ollama@192.168.0.188 'redis-cli -p 6380 ping'" "PONG" +check "188 OpenClaw" "ssh ollama@192.168.0.188 'curl -s http://localhost:8088/health'" "healthy" -echo "=== 驗收完成 ===" +# 110 服務 +check "110 Harbor" "curl -s -o /dev/null -w '%{http_code}' http://192.168.0.110:5000/v2/" "200" +check "110 Gitea" "curl -s -o /dev/null -w '%{http_code}' http://192.168.0.110:3001" "200" +check "110 Alertmanager" "curl -s http://192.168.0.110:9093/-/healthy" "OK" +check "110 Gitea Runner" "curl -s -u 'wooo:TOKEN' 'http://192.168.0.110:3001/api/v1/repos/wooo/awoooi/actions/runners' | python3 -c 'import sys,json; d=json.load(sys.stdin); print(d.get(\"total_count\",0))'" "1" + +# K3s +check "K3s Nodes Ready" "ssh wooo@192.168.0.120 'kubectl get nodes --no-headers' | awk '{print \$2}'" "Ready" +check "K3s Pods Running" "ssh wooo@192.168.0.120 'kubectl get pods -n awoooi-prod --no-headers' | grep -c Running" "3" + +# AWOOOI API +check "API Health" "curl -s $API/api/v1/health" '"status":"healthy"' +check "API openclaw" "curl -s $API/api/v1/health" '"openclaw":{"status":"up"' + +# 告警鏈路 E2E +PAYLOAD='{"receiver":"e2e-test","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"RebootE2ETest","severity":"info"},"annotations":{"summary":"重開機 E2E 驗證,請忽略"},"startsAt":"2026-04-05T00:00:00Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":""}],"groupLabels":{},"commonLabels":{},"commonAnnotations":{},"externalURL":"","version":"4","groupKey":"e2e-test"}' +check "告警鏈路 E2E" "curl -s -X POST $API/api/v1/webhooks/alertmanager -H 'Content-Type: application/json' -d '$PAYLOAD'" '"success":true' + +echo "" +echo "=== 結果: ${PASS} 通過, ${FAIL} 失敗 ===" +[ $FAIL -eq 0 ] && echo -e "${GREEN}🎉 全部通過!系統正常${NC}" || echo -e "${RED}⚠️ 有 ${FAIL} 項失敗,請排查${NC}" ``` --- -## 故障排除 +## 版本歷史 -### A. PostgreSQL WAL 損壞 - -**症狀**: -``` -PANIC: could not locate a valid checkpoint record -``` - -**修復**(需統帥授權): - -```bash -ssh ollama@192.168.0.188 - -# 1. 確認錯誤 -sudo journalctl -u postgresql@14-main -n 20 | grep -E 'PANIC|checkpoint' - -# 2. 強制重置 WAL(會丟失最後幾個 transaction,不可逆) -sudo /usr/lib/postgresql/14/bin/pg_resetwal -f /var/lib/postgresql/14/main - -# 3. 重啟並驗證 -sudo systemctl start postgresql@14-main -sleep 8 -pg_isready -h localhost -p 5432 - -# 4. 殺掉 stale 連線 + 重建 kine(必做!) -sudo -u postgres psql -d k3s_datastore -c " - SELECT pg_terminate_backend(pid) - FROM pg_stat_activity - WHERE datname='k3s_datastore' AND pid!=pg_backend_pid(); -" -sudo -u postgres psql -d k3s_datastore -c "REINDEX TABLE kine; VACUUM ANALYZE kine;" -# ⚠️ 若 stale 連線殺不掉,用 OS kill: sudo kill -9 -``` - ---- - -### B. Docker Daemon 損壞 (BoltDB) - -**症狀**: -``` -panic: freepages: failed to get all reachable pages (containerd) -panic: page already freed (Docker network) -failed to create task: failed to initialize logging (Harbor 容器) -``` - -**修復(188)**: -```bash -sudo systemctl stop docker containerd -sudo rm -f /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db -sudo rm -f /var/lib/docker/network/files/local-kv.db -sudo systemctl start containerd && sleep 5 -sudo systemctl start docker -``` - -**修復(110,額外需清除容器狀態)**: -```bash -sudo systemctl stop docker -sudo rm -f /var/lib/docker/network/files/local-kv.db -sudo rm -f /var/lib/docker/volumes/metadata.db -find /var/lib/docker/buildkit -name "*.db" -delete 2>/dev/null -sudo rm -rf /var/lib/docker/containers/* # 清除孤兒容器記錄 -sudo systemctl start docker -sleep 5 -docker rm -f $(docker ps -aq) 2>/dev/null || true # 清除殘留 -docker network prune -f -``` - ---- - -### C. K3s Kine 慢查詢 - -**症狀**:K3s `activating` 超過 3-5 分鐘,log: -``` -Slow SQL (total time: 1m3.889s): SELECT ... FROM kine AS kv WHERE kv.name LIKE $1 ... -``` - -**修復**: -```bash -# 1. 停 K3s(釋放 PG 連線) -ssh wooo@192.168.0.120 "echo '0936223270' | sudo -S systemctl stop k3s" -ssh wooo@192.168.0.121 "echo '0936223270' | sudo -S systemctl stop k3s" - -# 2. 殺掉 stale 連線(若 pg_terminate 無效,直接 OS kill) -ssh ollama@192.168.0.188 "echo '0936223270' | sudo -S -u postgres psql -d k3s_datastore \ - -c \"SELECT pid, query_start, state FROM pg_stat_activity WHERE datname='k3s_datastore';\"" -# 對 stale PID: sudo kill -9 - -# 3. 重建索引和統計 -ssh ollama@192.168.0.188 "echo '0936223270' | sudo -S -u postgres psql -d k3s_datastore \ - -c 'REINDEX TABLE kine; VACUUM ANALYZE kine;'" - -# 4. 重啟 K3s -ssh wooo@192.168.0.120 "echo '0936223270' | sudo -S systemctl start k3s" -ssh wooo@192.168.0.121 "echo '0936223270' | sudo -S systemctl start k3s" -``` - ---- - -### D. Harbor 容器全部 Exited (128) - -**症狀**: -``` -Error: failed to create task for container: failed to initialize logging driver: dial tcp 127.0.0.1:1514: connect: connection refused -``` - -**原因**:Harbor 所有容器的 log driver 指向 harbor-log 的 syslog (:1514),若 harbor-log 未健康就啟動其他容器,全部失敗。 - -**修復**: -```bash -# 1. 清除所有失敗的容器 -docker rm -f $(docker ps -aq) 2>/dev/null -docker network prune -f - -# 2. 重啟 Harbor(harbor-log 會先起) -cd /home/wooo/harbor/harbor && docker compose up -d - -# 3. 等 harbor-log healthy(約 30 秒) -until [ "$(docker inspect --format='{{.State.Health.Status}}' harbor-log)" = "healthy" ]; do - echo "waiting harbor-log..."; sleep 5 -done - -# 4. 現在重啟其他元件 -docker compose up -d -``` - ---- - -## 已知限制 - -| 項目 | 說明 | 建議後續 | -|------|------|---------| -| ClawBot build | pip wheels 損壞,STANDBY_MODE=true 下非關鍵 | 定期清除 wheel cache | -| K3s 非自動依賴 PG | k3s.service 沒有 After=postgresql@14-main.service | 考慮加 systemd 依賴 | -| Redis 需手動設定 | /etc/redis/redis.conf bind 0.0.0.0 port 6380 已設定,重裝後需重設 | 加入 awoooi-startup.sh 自我驗證 | - ---- - -*文件由 Claude Code 於 2026-04-05 第二次重開機事故後完整修訂* +| 版本 | 日期 | 說明 | +|------|------|------| +| v1.0 | 2026-04-04 | 初版:containerd + Docker + PostgreSQL WAL 修復流程 | +| v2.0 | 2026-04-05 上午 | 188 啟動腳本 + 110 啟動腳本 + Harbor race condition 修復 | +| v3.0 | 2026-04-05 下午 | 完整架構重盤 + Gitea Runner 自動化 + 告警鏈路根因修復 + NetworkPolicy 修正 |