docs(runbooks): REBOOT-RECOVERY-SOP v3.0 完整重開機自動化方案

## 內容

完整盤點所有主機、服務、工具、監控的:
- 啟動順序與依賴關係圖
- 正常重啟 vs 異常重啟處理流程
- 各主機詳細啟動序列 (188/110/120/121)
- 常見故障排查手冊 (告警沉默/CD失效/數據消失/NodePort)
- E2E 驗證腳本

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
OG T
2026-04-05 01:48:29 +08:00
parent ad4abefcd9
commit 8f64affbdb

View File

@@ -1,498 +1,521 @@
# 重開機恢復 SOP
# AWOOOI 重開機恢復 SOP
> 最後更新2026-04-05 ogt — 第二次重開機事故後完整修訂,加入自動化腳本
> 適用環境AWOOOI 五主機架構
> **版本**: v3.0
> **最後更新**: 2026-04-05 (台北時間)
> **更新者**: Claude Code (首席架構師)
> **觸發事件**: 兩次重開機事故後完整盤點 + 告警鏈路根因修復 + Gitea Runner 自動化
---
## 🤖 自動化狀態(最優先確認)
## 目錄
| 主機 | systemd service | 狀態 | 說明 |
|------|----------------|------|------|
| 192.168.0.188 | `awoooi-startup.service` | ✅ enabled | 自動修復 BoltDB + 啟動所有服務 |
| 192.168.0.110 | `awoooi-startup-110.service` | ✅ enabled | 自動修復 BoltDB + 啟動所有服務 |
| 192.168.0.120 | `k3s.service` | systemd 管理 | 依賴 PG 就緒,自動啟動 |
| 192.168.0.121 | `k3s.service` | systemd 管理 | 自動啟動 |
**正常情況下,重開機後等待 3-5 分鐘,所有服務應自動恢復。**
確認方式(重開機後執行):
```bash
# 確認自動化腳本執行結果
ssh ollama@192.168.0.188 "sudo journalctl -u awoooi-startup.service -n 30 --no-pager"
ssh wooo@192.168.0.110 "echo '0936223270' | sudo -S journalctl -u awoooi-startup-110.service -n 30 --no-pager"
```
1. [架構概覽與依賴圖](#架構概覽與依賴圖)
2. [自動化腳本狀態](#自動化腳本狀態)
3. [正常重啟流程 (計劃性維護)](#正常重啟流程)
4. [異常重啟流程 (緊急恢復)](#異常重啟流程)
5. [各主機詳細啟動序列](#各主機詳細啟動序列)
6. [常見故障排查手冊](#常見故障排查手冊)
7. [E2E 驗證腳本](#e2e-驗證腳本)
---
## ⚡ 全系統啟動順序(依賴關係)
## 架構概覽與依賴圖
### 五主機全貌
```
重開機後啟動順序(強制順序,不可逆轉):
192.168.0.188 (OLLAMA / 主服務主機)
├── containerd ← 基礎,必須最先啟動
├── Docker ← 依賴 containerd
├── PostgreSQL :5432 ← K3s kine 後端API DB
├── Redis :6380 ← Working Memory (重啟後清空,需 warm-up)
├── Ollama :11434 ← LLM 推論引擎
├── Nginx :80/:443 ← 反向代理
├── OpenClaw (clawbot) :8088 ← AI 核心 (Docker Compose)
├── MinIO ← Velero 備份存儲 (Docker Compose)
└── SignOz ← 可觀測性 (Docker Compose)
┌─────────────────────────────────────────┐
│ Phase 1: 192.168.0.188 基礎設施 │ ← 最先
│ ├─ containerd (BoltDB 修復) │
│ ├─ Docker (BoltDB 修復) │
│ ├─ PostgreSQL (WAL 修復 + kine VACUUM) │ ← K3s Kine Datastore
│ ├─ Redis (0.0.0.0:6380) API/Worker 依賴
│ ├─ Ollama
│ ├─ Nginx │
│ ├─ SignOz (docker compose)
│ └─ ClawBot (docker compose) │
└─────────────────────────────────────────┘
↓ 必須先完成
─────────────────────────────────────────┐
│ Phase 2: 192.168.0.110 DevOps 金庫 ← 同步進行 or 稍後
│ ├─ Docker (BoltDB 修復) │
│ ├─ 清除孤兒容器 (network 不存在問題) │
│ ├─ harbor-log (先 healthy) │ ← 其他 Harbor 依賴它
│ ├─ Harbor 其他元件 (nginx/core/db/...) │ ← K3s imagePull 依賴
│ ├─ Gitea │ ← CI/CD 依賴
│ ├─ Langfuse │
│ ├─ Monitoring (Prometheus/Grafana/AM) │
│ └─ SignOz │
└─────────────────────────────────────────┘
↓ 必須先完成 (PostgreSQL + Harbor)
┌─────────────────────────────────────────┐
│ Phase 3: K3s Control-Plane │ ← 最後
│ ├─ 120 k3s.service → Ready │
│ ├─ 121 k3s.service → Ready │
│ └─ awoooi-prod Pods Running │
└─────────────────────────────────────────┘
192.168.0.110 (DevOps 主機)
├── Docker
├── Harbor :5000 ← Container Registry (K3s 拉映像用)
├── Gitea :3001 ← 主要 Git / CI 管理介面
├── Gitea Act Runner ← CD pipeline 執行者 (docker: gitea-runner)
├── Langfuse :3100LLMOps 追蹤
├── Prometheus :9090 ← 指標收集
├── Alertmanager :9093 ← 告警路由
├── Grafana :3002 ← 監控儀表板
└── SignOz ← 可觀測性
192.168.0.120 (K3s Master - mon)
── K3s server (control plane)
├── kube-proxy ← NodePort 轉發 32334/32335
└── keepalived ← VIP 192.168.0.125 (secondary)
192.168.0.121 (K3s Worker - mon1)
├── K3s agent (worker)
└── kube-proxy
192.168.0.125 (VIP, keepalived 管理)
├── → :32334 AWOOOI API
└── → :32335 AWOOOI Web
```
### 服務依賴關係
```
嚴格啟動順序:
【188 層】
containerd
└── Docker
├── PostgreSQL (kine DB, API DB)
│ └── Redis (Working Memory)
│ └── Ollama (LLM)
│ └── Nginx
│ ├── SignOz
│ ├── MinIO
│ └── OpenClaw (依賴 aiops-network)
【110 層】
Docker
└── harbor-log (等 healthy)
└── Harbor 全組件 (:5000)
└── Gitea (:3001)
├── Langfuse
├── Monitoring (Prometheus + Alertmanager + Grafana)
│ [Alertmanager → AWOOOI API 告警鏈路]
├── SignOz
└── Gitea Act Runner (等 Gitea 就緒後才啟動)
【120/121 層】
K3s (依賴 PostgreSQL@188)
└── kube-proxy (NodePort 規則)
└── Pods: API / Web / Worker
└── keepalived (VIP)
【告警鏈路】
Prometheus → Alertmanager(110)
→ AWOOOI API(121:32334/api/v1/webhooks/alertmanager) ← 直接,不走 OpenClaw
→ TelegramGateway → Telegram
【CD 鏈路】
Git push → Gitea(110:3001)
→ Act Runner (gitea-runner container)
→ Build Docker image → Harbor(:5000) → kubectl → K3s pods
```
### 關鍵依賴說明
| 服務 | 關鍵依賴 | 若依賴失敗 |
|------|---------|-----------|
| K3s | PostgreSQL@188 ready | K3s 無法啟動,所有 pod 無法調度 |
| OpenClaw | Docker + aiops-network | Telegram AI 對話失效 |
| Alertmanager → API | NetworkPolicy 允許 110/32 | 告警全面沉默 |
| CD pipeline | Gitea Act Runner online | 所有部署失效 |
| Harbor | harbor-log healthy | 所有其他 Harbor 容器 Exited 128 |
| K8s pods | Harbor (映像倉庫) | ImagePullBackOff |
---
## 🤖 自動化腳本說明
## 自動化腳本狀態
### 腳本位置
### 已部署 (2026-04-05)
```
scripts/reboot-recovery/
├── awoooi-startup.sh # 188 啟動腳本(部署到 /usr/local/bin/
├── awoooi-startup.service # 188 systemd unit
├── awoooi-startup-110.sh # 110 啟動腳本(部署到 /usr/local/bin/
├── awoooi-startup-110.service # 110 systemd unit
├── deploy-to-188.sh # 一鍵部署到 188
└── deploy-to-110.sh # 一鍵部署到 110
```
| 主機 | 腳本 | 部署位置 | systemd service | 狀態 |
|------|------|---------|----------------|------|
| **188** | `awoooi-startup.sh` | `/usr/local/bin/` | `awoooi-startup.service` | ✅ enabled |
| **110** | `awoooi-startup-110.sh` | `/usr/local/bin/` | `awoooi-startup-110.service` | ✅ enabled |
| **120** | K3s 原生 | 系統內建 | `k3s.service` | ✅ enabled |
| **121** | K3s 原生 | 系統內建 | `k3s-agent.service` | ✅ enabled |
### 188 腳本(`awoooi-startup.sh`)步驟
**本地原始碼**: `scripts/reboot-recovery/`
| 步驟 | 說明 | 故障處理 |
|------|------|---------|
| 1/7 | containerd 健康檢查 | BoltDB 損壞 → 自動刪除 `meta.db` |
| 2/7 | Docker 健康檢查 | BoltDB 損壞 → 自動刪除 `local-kv.db` |
| 3/7 | PostgreSQL 健康檢查 | WAL 損壞 → 自動執行 `pg_resetwal` + VACUUM ANALYZE kine |
| 4/7 | Redis 啟動 | — |
| 5/7 | Ollama 啟動 | — |
| 6/7 | Nginx 啟動 | — |
| 7/7 | SignOz + ClawBot compose up | ClawBot 失敗 → 嘗試 rebuild失敗也繼續 |
### 各腳本覆蓋清單
### 110 腳本(`awoooi-startup-110.sh`)步驟
**188 (7 步驟)**:
- containerd BoltDB 損壞偵測與修復
- Docker BoltDB 損壞偵測與修復
- PostgreSQL WAL 損壞偵測 + pg_resetwal + kine VACUUM
- Redis 啟動
- Ollama 啟動
- Nginx 啟動
- SignOz + MinIO + aiops-network + OpenClaw
| 步驟 | 說明 | 故障處理 |
|------|------|---------|
| 1/5 | Docker 健康檢查 | BoltDB 損壞 → 自動刪除所有損壞 `.db` |
| 2/5 | 清除孤兒容器 | `Exited (128)/(137)` → docker rm + network prune |
| 3/5 | Harbor 啟動 | 等 harbor-log healthy (max 60s) 才啟動其他元件 |
| 4/5 | Gitea/Langfuse/Monitoring compose up | — |
| 5/5 | SignOz compose up | — |
### 重新部署腳本
```bash
# 從 Mac 執行awoooi repo 目錄)
cd scripts/reboot-recovery
bash deploy-to-188.sh # 更新 188 的腳本
bash deploy-to-110.sh # 更新 110 的腳本
```
**110 (6 步驟)**:
- Docker BoltDB 損壞偵測與修復
- 孤兒容器清除 (Exited 128)
- Harbor (harbor-log healthy 後啟動全組件)
- Gitea + Langfuse + Monitoring (含 Alertmanager 健康驗證)
- SignOz
- **Gitea Act Runner** (自動清除過期 .runner 配置)
---
## 手動恢復流程(自動化失敗時)
## 正常重啟流程
### Phase 1192.168.0.188
> 適用於計劃性維護、安全更新、OS 升級等**有預期**的重啟。
### 重啟前準備 (T-30 分鐘)
```bash
ssh ollama@192.168.0.188
# 1. 確認沒有 INVESTIGATING/MITIGATING 的 incident
curl -s http://192.168.0.121:32334/api/v1/incidents | python3 -c 'import sys,json; d=json.load(sys.stdin); print("Active incidents:", d.get("count",0))'
# 2. 確認 K3s 健康
ssh wooo@192.168.0.120 "kubectl get nodes && kubectl get pods -n awoooi-prod"
# 3. 備份確認 (MinIO Velero)
ssh wooo@192.168.0.120 "kubectl get backup -n velero 2>/dev/null | head -5"
```
#### 1.1 containerd若未起
### 建議重啟順序
```bash
sudo systemctl status containerd
# 若 BoltDB 損壞 (panic: freepages):
sudo systemctl stop containerd
sudo rm -f /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db
sudo systemctl start containerd
systemctl is-active containerd
```
第一步: 188 + 110 同時重啟
(兩者獨立,可並行)
等待 5 分鐘確認基礎服務就緒:
- PostgreSQL accepting connections
- Harbor returning HTTP 200
- Gitea accessible
第二步: 120 + 121 同時重啟
(K3s cluster)
等待 10 分鐘:
- Nodes Ready
- Pods Running
```
#### 1.2 Docker若未起
> **重要**: 120/121 必須在 188 PostgreSQL 完全就緒後才重啟,
> 否則 K3s kine 連不到 DB 無法啟動。
### 重啟後驗證 (T+15 分鐘)
執行底部 E2E 驗證腳本。所有項目必須 ✅。
---
## 異常重啟流程
> 適用於電源中斷、OOM、系統崩潰、強制重啟等**非預期**的重啟。
### 第一步:快速診斷 (T+2 分鐘)
```bash
sudo systemctl status docker
# 若 BoltDB 損壞 (panic: page already freed / invalid freelist page):
sudo systemctl stop docker
sudo rm -f /var/lib/docker/network/files/local-kv.db
sudo rm -f /var/lib/docker/volumes/metadata.db
find /var/lib/docker/buildkit -name "*.db" -delete 2>/dev/null || true
sudo systemctl start docker
systemctl is-active docker
```
#### 1.3 PostgreSQL最關鍵
```bash
sudo systemctl start postgresql@14-main
sleep 8
systemctl is-active postgresql@14-main # 若非 active → 見故障排除 A
pg_isready -h localhost -p 5432 # 應為 accepting connections
# 清理 kine 孤立連線WAL 重置後必做)
sudo -u postgres psql -d k3s_datastore -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname='k3s_datastore' AND pid!=pg_backend_pid()
AND query_start < now() - interval '5 minutes';
# 188 狀態
ssh ollama@192.168.0.188 "
systemctl is-active containerd docker postgresql@14-main redis-server ollama nginx
docker ps --format '{{.Names}}: {{.Status}}' | grep -E 'clawbot|minio|signoz'
"
sudo -u postgres psql -d k3s_datastore -c "VACUUM ANALYZE kine;"
# 110 狀態
ssh wooo@192.168.0.110 "
docker ps --format '{{.Names}}: {{.Status}}' | grep -E 'harbor-log|gitea$|alertmanager|gitea-runner'
"
# K3s 狀態
ssh wooo@192.168.0.120 "
kubectl get nodes 2>/dev/null || echo 'K3s not ready'
kubectl get pods -n awoooi-prod 2>/dev/null | grep -v Running | head -10
"
# API 健康
curl -s --max-time 5 http://192.168.0.121:32334/api/v1/health 2>/dev/null | python3 -c 'import sys,json; d=json.load(sys.stdin); print("API:", d["status"])' 2>/dev/null || echo 'API unreachable'
```
#### 1.4 Redis
### 第二步:按優先級恢復
**P0 (T+0~5 分鐘): 基礎設施**
```bash
sudo systemctl start redis-server
redis-cli -p 6380 ping # 應回 PONG
# Redis 設定: 0.0.0.0:6380 (bind 0.0.0.0, port 6380)
# 自動化腳本通常已處理,但若沒有自動啟動:
ssh ollama@192.168.0.188 "sudo /usr/local/bin/awoooi-startup.sh"
ssh wooo@192.168.0.110 "echo '0936223270' | sudo -S /usr/local/bin/awoooi-startup-110.sh"
```
#### 1.5 Ollama
**P1 (T+5~15 分鐘): K3s 叢集**
```bash
sudo systemctl start ollama
sleep 5
curl -sf http://localhost:11434/ | grep running
# 等 PostgreSQL@188 就緒後
ssh ollama@192.168.0.188 "pg_isready -h localhost -p 5432"
# K3s 若未自動恢復
ssh wooo@192.168.0.120 "sudo systemctl restart k3s"
ssh wooo@192.168.0.121 "sudo systemctl restart k3s-agent"
# 等待 Nodes Ready
ssh wooo@192.168.0.120 "kubectl wait --for=condition=Ready nodes --all --timeout=120s"
```
#### 1.6 Nginx + SignOz + ClawBot
**P2 (T+15~30 分鐘): 業務驗證**
```bash
sudo systemctl start nginx
# Pods 全部 Running?
ssh wooo@192.168.0.120 "kubectl get pods -n awoooi-prod"
cd /home/ollama/signoz/deploy/docker && docker compose up -d
cd /home/ollama/clawbot-v5 && docker compose up -d
```
# API 健康 (含所有 components)?
curl -s http://192.168.0.121:32334/api/v1/health | python3 -m json.tool
#### 1.7 Phase 1 驗收
```bash
pg_isready -h localhost -p 5432 && echo "✅ PostgreSQL"
redis-cli -p 6380 ping | grep -q PONG && echo "✅ Redis :6380"
curl -sf http://localhost:11434/ | grep -q running && echo "✅ Ollama"
systemctl is-active nginx | grep -q active && echo "✅ Nginx"
# 告警鏈路 E2E 測試
curl -X POST http://192.168.0.121:32334/api/v1/webhooks/alertmanager \
-H 'Content-Type: application/json' \
-d '{"receiver":"test","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"RebootRecoveryTest","severity":"info"},"annotations":{"summary":"重開機恢復測試,請忽略"},"startsAt":"2026-04-05T00:00:00Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":""}],"groupLabels":{},"commonLabels":{},"commonAnnotations":{},"externalURL":"","version":"4","groupKey":"test"}'
# 預期: {"success":true,...} 且 Telegram 收到測試告警
```
---
### Phase 2192.168.0.110
## 各主機詳細啟動序列
### 192.168.0.188 — 主服務主機
| 步驟 | 服務 | 驗證方式 | 常見失敗原因 |
|------|------|---------|------------|
| 1 | containerd | `systemctl is-active containerd` | BoltDB meta.db 損壞 |
| 2 | Docker | `systemctl is-active docker` | BoltDB local-kv.db 損壞 |
| 3 | PostgreSQL | `pg_isready -h localhost -p 5432` | WAL checkpoint 損壞 |
| 3a | kine VACUUM | `psql -d k3s_datastore -c "VACUUM ANALYZE kine;"` | 慢查詢導致 K3s 超時 |
| 4 | Redis | `redis-cli -p 6380 ping` | bind 配置錯誤 (必須 0.0.0.0:6380) |
| 5 | Ollama | `systemctl is-active ollama` | GPU 記憶體不足 |
| 6 | Nginx | `systemctl is-active nginx` | port 衝突 |
| 7a | aiops-network | `docker network ls \| grep aiops-network` | 重啟後 external network 消失 |
| 7b | OpenClaw | `curl http://localhost:8088/health` | pip 依賴損壞需 rebuild |
| 7c | MinIO | `docker ps \| grep minio` | 無狀態服務,重啟即可 |
| 7d | SignOz | `curl http://localhost:3301/-/healthy` | ClickHouse 起慢 |
**BoltDB 損壞修復**:
```bash
ssh wooo@192.168.0.110
# containerd meta.db
BOLT=/var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db
cp $BOLT ${BOLT}.bak.$(date +%s) && rm $BOLT && systemctl restart containerd
# Docker network BoltDB
BOLT=/var/lib/docker/network/files/local-kv.db
cp $BOLT ${BOLT}.bak.$(date +%s) && rm $BOLT && systemctl restart docker
```
#### 2.1 Docker 修復
**PostgreSQL WAL 修復**:
```bash
systemctl is-active docker || {
echo "0936223270" | sudo -S bash -c "
rm -f /var/lib/docker/network/files/local-kv.db
rm -f /var/lib/docker/volumes/metadata.db
find /var/lib/docker/buildkit -name '*.db' -delete 2>/dev/null
systemctl start docker
"
}
# 確認是 WAL 損壞
journalctl -u postgresql@14-main -n 30 | grep "could not locate a valid checkpoint"
# 修復
systemctl stop postgresql@14-main
/usr/lib/postgresql/14/bin/pg_resetwal -f /var/lib/postgresql/14/main
systemctl start postgresql@14-main
```
#### 2.2 清除孤兒容器(關鍵!)
### 192.168.0.110 — DevOps 主機
| 步驟 | 服務 | 驗證方式 | 常見失敗原因 |
|------|------|---------|------------|
| 1 | Docker | `systemctl is-active docker` | BoltDB 損壞 |
| 2 | 孤兒容器清除 | `docker ps -a \| grep "Exited (128)"` | 舊容器引用已不存在的 network |
| 3 | harbor-log | `docker inspect --format='{{.State.Health.Status}}' harbor-log` | syslog :1514 未就緒 |
| 3 | Harbor 全組件 | `curl http://localhost:5000/v2/` | harbor-log 還沒 healthy |
| 4 | Gitea | `curl http://localhost:3001` | 無特殊問題 |
| 5 | Langfuse | `curl http://localhost:3100` | DB 未就緒 |
| 6 | Alertmanager | `curl http://localhost:9093/-/healthy` | docker compose 未啟動 |
| 6a | 告警鏈路驗證 | 發送測試 webhook | webhook URL 錯誤 / NetworkPolicy |
| 7 | SignOz | `curl http://localhost:3301/-/healthy` | ClickHouse 起慢 |
| 8 | Gitea Runner | `docker logs gitea-runner \| grep SUCCESS` | .runner 配置過期 |
**Harbor Exited 128 修復**:
```bash
# 重開機後舊容器使用的 Docker network 已不存在,必須清除
docker rm -f $(docker ps -aq 2>/dev/null) 2>/dev/null || true
docker network prune -f 2>/dev/null || true
```
#### 2.3 Harbor必須先等 harbor-log healthy
```bash
cd /home/wooo/harbor/harbor
docker compose up -d
# 等 harbor-log healthy最多 60 秒)
for i in $(seq 1 12); do
STATUS=$(docker inspect --format='{{.State.Health.Status}}' harbor-log 2>/dev/null)
echo "[$i] harbor-log: $STATUS"
[ "$STATUS" = "healthy" ] && break
sleep 5
# 等 harbor-log healthy
until [ "$(docker inspect --format='{{.State.Health.Status}}' harbor-log)" = "healthy" ]; do
echo "waiting..."; sleep 5
done
# harbor-log healthy 後,重啟其他元件(它們依賴 :1514 syslog
docker compose up -d
# 清除 Exited 128 容器
docker rm -f $(docker ps -a --filter status=exited --format "{{.Names}}" | grep -v harbor-log)
# 重新啟動
cd /home/wooo/harbor/harbor && docker compose up -d
```
#### 2.4 其他服務
**Gitea Runner 重新註冊**:
```bash
cd /home/wooo/gitea && docker compose up -d
cd /home/wooo/langfuse && docker compose up -d
cd /home/wooo/monitoring && docker compose up -d
cd /home/wooo/signoz/deploy/docker && docker compose up -d
# 清除過期配置 (指向錯誤 hostname 的)
sudo rm /home/wooo/act-runner/data/.runner
# 獲取新的 registration token
curl -s -u 'wooo:TOKEN' 'http://192.168.0.110:3001/api/v1/repos/wooo/awoooi/actions/runners/registration-token'
# 更新 docker-compose.yml 的 GITEA_RUNNER_REGISTRATION_TOKEN
# 然後啟動
cd /home/wooo/act-runner && docker compose up -d
```
#### 2.5 Phase 2 驗收
### 192.168.0.120/121 — K3s 叢集
```bash
curl -s -o /dev/null -w "%{http_code}" http://localhost:5000/v2/ | grep -q 401 && echo "✅ Harbor :5000"
curl -s -o /dev/null -w "%{http_code}" http://localhost:3001/ | grep -q 200 && echo "✅ Gitea :3001"
curl -s -o /dev/null -w "%{http_code}" http://localhost:3100/ | grep -q 200 && echo "✅ Langfuse :3100"
curl -s -o /dev/null -w "%{http_code}" http://localhost:9093/ | grep -q 200 && echo "✅ Alertmanager :9093"
curl -s -o /dev/null -w "%{http_code}" http://localhost:3002/ | grep -q 302 && echo "✅ Grafana :3002"
| 步驟 | 項目 | 驗證方式 |
|------|------|---------|
| 前提 | PostgreSQL@188 ready | `pg_isready -h 192.168.0.188 -p 5432` |
| 1 | K3s service | `systemctl is-active k3s` (120) |
| 2 | K3s agent | `systemctl is-active k3s-agent` (121) |
| 3 | Nodes Ready | `kubectl get nodes` |
| 4 | Pods Running | `kubectl get pods -n awoooi-prod` |
| 5 | keepalived VIP | `ip addr show \| grep 192.168.0.125` (120) |
| 6 | NodePort | `nc -zv 192.168.0.121 32334` |
---
## 常見故障排查手冊
### 1. 告警沉默 (Telegram 沒收到告警)
```
診斷樹:
Alertmanager healthy? (http://192.168.0.110:9093/-/healthy)
├── NO → docker compose up -d (在 110 /home/wooo/monitoring/)
└── YES
Webhook URL 正確?
grep 'url:' /home/wooo/monitoring/alertmanager.yml
必須是: http://192.168.0.121:32334/api/v1/webhooks/alertmanager
├── NO → 修正 URL 並 curl http://localhost:9093/-/reload
└── YES
從 110 curl POST webhook 成功?
curl -X POST http://192.168.0.121:32334/api/v1/webhooks/alertmanager ...
├── timeout → NetworkPolicy 未允許 110
│ kubectl apply -f k8s/awoooi-prod/02-network-policy.yaml
└── {"success":true} → 檢查 Telegram Bot Token
```
**根本教訓 (2026-04-05)**:
- 舊架構: Alertmanager → OpenClaw 8088 (中間層,重開機後 OpenClaw 不健康就沉默)
- 新架構: Alertmanager → AWOOOI API 直通 (更穩定,單一故障點少)
- NetworkPolicy 必須允許 110/32 進入 pod
### 2. Gitea CD 無法部署
```
診斷樹:
Gitea runner 數量?
curl -u 'wooo:TOKEN' 'http://192.168.0.110:3001/api/v1/repos/wooo/awoooi/actions/runners'
├── total_count: 0 → Runner 離線
│ docker ps | grep gitea-runner
│ ├── 不存在 → cd /home/wooo/act-runner && docker compose up -d
│ └── 存在但 Restarting → docker logs gitea-runner
│ "lookup gitea" → 清除 data/.runner → docker compose up -d
└── total_count: 1 → Runner 在線但 CD 失敗
查看 Gitea Actions 日誌: http://192.168.0.110:3001/wooo/awoooi/actions
```
### 3. 網站數據顯示 0 / incidents 消失
```
原因: Redis Working Memory 重啟後清空
新版 API (f4f454f+) 啟動時自動 warm-up
舊版 API 沒有此邏輯
診斷:
curl http://192.168.0.121:32334/api/v1/incidents
→ count: 0 但 Telegram 有歷史告警?
→ 確認 API pod image 版本
修復 (若 image 版本正確):
kubectl rollout restart deployment/awoooi-api -n awoooi-prod
修復 (若 image 版本過舊):
等待 CD 完成部署最新版本
```
### 4. NodePort 32334 從外部無法訪問
```
診斷:
nc -zv 192.168.0.121 32334 ← 這個通嗎?
├── 通 → 用 192.168.0.121:32334 直連
└── 不通
kubectl get pods -n awoooi-prod
├── 沒有 Running pods → K3s pods 問題
└── 有 Running pods
iptables -t nat -L KUBE-NODEPORTS -n | grep 32334
├── 沒有規則 → K3s kube-proxy 問題,重啟 k3s
└── 有規則 → NetworkPolicy 或防火牆
```
### 5. Harbor ImagePullBackOff
```
診斷:
kubectl describe pod <pod> -n awoooi-prod | grep -A 5 Events
→ "ImagePullBackOff" or "ErrImagePull"
確認 Harbor 健康:
curl http://192.168.0.110:5000/v2/ → 應該回 {}
若 Harbor 不健康:
- 等 harbor-log healthy 後重啟: 見 Harbor Exited 128 修復
- 確認 Docker daemon 在 120/121 有設定 insecure registry:
cat /etc/docker/daemon.json | grep insecure
```
---
### Phase 3K3s Control-Plane
## E2E 驗證腳本
> ⚠️ 必須確認 Phase 1 PostgreSQL + Phase 2 Harbor 完全就緒才執行!
```bash
# 先確認前置條件
ssh ollama@192.168.0.188 "pg_isready -h localhost -p 5432"
ssh wooo@192.168.0.110 "curl -s -o /dev/null -w '%{http_code}' http://localhost:5000/v2/"
# PostgreSQL: accepting connections
# Harbor: 401 (需認證,正常)
```
#### 3.1 K3s 節點(通常自動啟動)
```bash
# 若 k3s 未啟動
ssh wooo@192.168.0.120 "echo '0936223270' | sudo -S systemctl start k3s"
ssh wooo@192.168.0.121 "echo '0936223270' | sudo -S systemctl start k3s"
# 確認節點狀態
ssh wooo@192.168.0.120 "echo '0936223270' | sudo -S k3s kubectl get nodes"
```
#### 3.2 若 Pods ImagePullBackOff
```bash
# Harbor 剛起來,強制 rollout
ssh wooo@192.168.0.120 "echo '0936223270' | sudo -S bash -c '
k3s kubectl delete pod -l app=awoooi-api -n awoooi-prod
k3s kubectl delete pod -l app=awoooi-web -n awoooi-prod
k3s kubectl delete pod -l app=awoooi-worker -n awoooi-prod
'"
```
#### 3.3 Phase 3 驗收
```bash
ssh wooo@192.168.0.120 "echo '0936223270' | sudo -S k3s kubectl get pods -n awoooi-prod"
# 所有 Pod 應為 Running
curl http://192.168.0.125:32334/api/v1/health
# 預期: status=healthy 或 degraded (openclaw down 可接受)
```
---
## 完整自動化驗收腳本
執行此腳本確認重開機後全系統正常。
```bash
#!/bin/bash
# 執行位置: Mac (awoooi repo)
echo "=== AWOOOI 重開機完整驗收 $(date '+%Y-%m-%d %H:%M:%S') ==="
# 重開機後 E2E 驗證
# 使用: bash docs/runbooks/scripts/e2e-verify.sh
# 執行位置: 任意可 SSH 的機器
# 188 基礎服務
echo "--- 192.168.0.188 ---"
ssh ollama@192.168.0.188 "
pg_isready -h localhost -p 5432 >/dev/null 2>&1 && echo '✅ PostgreSQL :5432' || echo '❌ PostgreSQL DOWN'
redis-cli -p 6380 ping 2>/dev/null | grep -q PONG && echo '✅ Redis :6380' || echo '❌ Redis DOWN'
curl -sf http://localhost:11434/ 2>/dev/null | grep -q running && echo '✅ Ollama :11434' || echo '❌ Ollama DOWN'
systemctl is-active nginx 2>/dev/null | grep -q active && echo '✅ Nginx' || echo '❌ Nginx DOWN'
docker ps --format '{{.Names}}\t{{.Status}}' 2>/dev/null | grep -E '^signoz|^clawbot' | head -5
"
API="http://192.168.0.121:32334"
GREEN='\033[0;32m'; RED='\033[0;31m'; NC='\033[0m'
PASS=0; FAIL=0
# 110 DevOps 金庫
echo "--- 192.168.0.110 ---"
ssh wooo@192.168.0.110 "
curl -s -o /dev/null -w '%{http_code}' http://localhost:5000/v2/ 2>/dev/null | grep -q 401 && echo '✅ Harbor :5000' || echo '❌ Harbor DOWN'
curl -s -o /dev/null -w '%{http_code}' http://localhost:3001/ 2>/dev/null | grep -q 200 && echo '✅ Gitea :3001' || echo '❌ Gitea DOWN'
curl -s -o /dev/null -w '%{http_code}' http://localhost:3100/ 2>/dev/null | grep -q 200 && echo '✅ Langfuse :3100' || echo '❌ Langfuse DOWN'
curl -s -o /dev/null -w '%{http_code}' http://localhost:9093/ 2>/dev/null | grep -q 200 && echo '✅ Alertmanager :9093' || echo '❌ Alertmanager DOWN'
curl -s -o /dev/null -w '%{http_code}' http://localhost:3002/ 2>/dev/null | grep -qE '200|302' && echo '✅ Grafana :3002' || echo '❌ Grafana DOWN'
"
check() {
local name="$1"; local cmd="$2"; local expect="$3"
result=$(eval "$cmd" 2>/dev/null)
if echo "$result" | grep -q "$expect"; then
echo -e "${GREEN}$name${NC}"
((PASS++))
else
echo -e "${RED}$name${NC} (got: ${result:0:80})"
((FAIL++))
fi
}
# K3s 和 Pods
echo "--- K3s (via 120) ---"
ssh wooo@192.168.0.120 "
echo '0936223270' | sudo -S bash -c '
k3s kubectl get nodes 2>/dev/null
k3s kubectl get pods -n awoooi-prod 2>/dev/null
'
"
echo "=== AWOOOI 重開機 E2E 驗證 $(date '+%Y-%m-%d %H:%M:%S') ==="
# API E2E
echo "--- API E2E ---"
curl -s http://192.168.0.125:32334/api/v1/health 2>/dev/null | \
python3 -c "import sys,json; d=json.load(sys.stdin); [print('✅' if v['status']=='up' else '⚠️', k, v['status']) for k,v in d['components'].items()]" || \
echo '❌ API E2E FAILED'
# 188 服務
check "188 PostgreSQL" "ssh ollama@192.168.0.188 'pg_isready -h localhost -p 5432'" "accepting"
check "188 Redis" "ssh ollama@192.168.0.188 'redis-cli -p 6380 ping'" "PONG"
check "188 OpenClaw" "ssh ollama@192.168.0.188 'curl -s http://localhost:8088/health'" "healthy"
echo "=== 驗收完成 ==="
# 110 服務
check "110 Harbor" "curl -s -o /dev/null -w '%{http_code}' http://192.168.0.110:5000/v2/" "200"
check "110 Gitea" "curl -s -o /dev/null -w '%{http_code}' http://192.168.0.110:3001" "200"
check "110 Alertmanager" "curl -s http://192.168.0.110:9093/-/healthy" "OK"
check "110 Gitea Runner" "curl -s -u 'wooo:TOKEN' 'http://192.168.0.110:3001/api/v1/repos/wooo/awoooi/actions/runners' | python3 -c 'import sys,json; d=json.load(sys.stdin); print(d.get(\"total_count\",0))'" "1"
# K3s
check "K3s Nodes Ready" "ssh wooo@192.168.0.120 'kubectl get nodes --no-headers' | awk '{print \$2}'" "Ready"
check "K3s Pods Running" "ssh wooo@192.168.0.120 'kubectl get pods -n awoooi-prod --no-headers' | grep -c Running" "3"
# AWOOOI API
check "API Health" "curl -s $API/api/v1/health" '"status":"healthy"'
check "API openclaw" "curl -s $API/api/v1/health" '"openclaw":{"status":"up"'
# 告警鏈路 E2E
PAYLOAD='{"receiver":"e2e-test","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"RebootE2ETest","severity":"info"},"annotations":{"summary":"重開機 E2E 驗證,請忽略"},"startsAt":"2026-04-05T00:00:00Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":""}],"groupLabels":{},"commonLabels":{},"commonAnnotations":{},"externalURL":"","version":"4","groupKey":"e2e-test"}'
check "告警鏈路 E2E" "curl -s -X POST $API/api/v1/webhooks/alertmanager -H 'Content-Type: application/json' -d '$PAYLOAD'" '"success":true'
echo ""
echo "=== 結果: ${PASS} 通過, ${FAIL} 失敗 ==="
[ $FAIL -eq 0 ] && echo -e "${GREEN}🎉 全部通過!系統正常${NC}" || echo -e "${RED}⚠️ 有 ${FAIL} 項失敗,請排查${NC}"
```
---
## 故障排除
## 版本歷史
### A. PostgreSQL WAL 損壞
**症狀**
```
PANIC: could not locate a valid checkpoint record
```
**修復**(需統帥授權):
```bash
ssh ollama@192.168.0.188
# 1. 確認錯誤
sudo journalctl -u postgresql@14-main -n 20 | grep -E 'PANIC|checkpoint'
# 2. 強制重置 WAL會丟失最後幾個 transaction不可逆
sudo /usr/lib/postgresql/14/bin/pg_resetwal -f /var/lib/postgresql/14/main
# 3. 重啟並驗證
sudo systemctl start postgresql@14-main
sleep 8
pg_isready -h localhost -p 5432
# 4. 殺掉 stale 連線 + 重建 kine必做
sudo -u postgres psql -d k3s_datastore -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname='k3s_datastore' AND pid!=pg_backend_pid();
"
sudo -u postgres psql -d k3s_datastore -c "REINDEX TABLE kine; VACUUM ANALYZE kine;"
# ⚠️ 若 stale 連線殺不掉,用 OS kill: sudo kill -9 <pid>
```
---
### B. Docker Daemon 損壞 (BoltDB)
**症狀**
```
panic: freepages: failed to get all reachable pages (containerd)
panic: page already freed (Docker network)
failed to create task: failed to initialize logging (Harbor 容器)
```
**修復188**
```bash
sudo systemctl stop docker containerd
sudo rm -f /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db
sudo rm -f /var/lib/docker/network/files/local-kv.db
sudo systemctl start containerd && sleep 5
sudo systemctl start docker
```
**修復110額外需清除容器狀態**
```bash
sudo systemctl stop docker
sudo rm -f /var/lib/docker/network/files/local-kv.db
sudo rm -f /var/lib/docker/volumes/metadata.db
find /var/lib/docker/buildkit -name "*.db" -delete 2>/dev/null
sudo rm -rf /var/lib/docker/containers/* # 清除孤兒容器記錄
sudo systemctl start docker
sleep 5
docker rm -f $(docker ps -aq) 2>/dev/null || true # 清除殘留
docker network prune -f
```
---
### C. K3s Kine 慢查詢
**症狀**K3s `activating` 超過 3-5 分鐘log
```
Slow SQL (total time: 1m3.889s): SELECT ... FROM kine AS kv WHERE kv.name LIKE $1 ...
```
**修復**
```bash
# 1. 停 K3s釋放 PG 連線)
ssh wooo@192.168.0.120 "echo '0936223270' | sudo -S systemctl stop k3s"
ssh wooo@192.168.0.121 "echo '0936223270' | sudo -S systemctl stop k3s"
# 2. 殺掉 stale 連線(若 pg_terminate 無效,直接 OS kill
ssh ollama@192.168.0.188 "echo '0936223270' | sudo -S -u postgres psql -d k3s_datastore \
-c \"SELECT pid, query_start, state FROM pg_stat_activity WHERE datname='k3s_datastore';\""
# 對 stale PID: sudo kill -9 <pid>
# 3. 重建索引和統計
ssh ollama@192.168.0.188 "echo '0936223270' | sudo -S -u postgres psql -d k3s_datastore \
-c 'REINDEX TABLE kine; VACUUM ANALYZE kine;'"
# 4. 重啟 K3s
ssh wooo@192.168.0.120 "echo '0936223270' | sudo -S systemctl start k3s"
ssh wooo@192.168.0.121 "echo '0936223270' | sudo -S systemctl start k3s"
```
---
### D. Harbor 容器全部 Exited (128)
**症狀**
```
Error: failed to create task for container: failed to initialize logging driver: dial tcp 127.0.0.1:1514: connect: connection refused
```
**原因**Harbor 所有容器的 log driver 指向 harbor-log 的 syslog (:1514),若 harbor-log 未健康就啟動其他容器,全部失敗。
**修復**
```bash
# 1. 清除所有失敗的容器
docker rm -f $(docker ps -aq) 2>/dev/null
docker network prune -f
# 2. 重啟 Harborharbor-log 會先起)
cd /home/wooo/harbor/harbor && docker compose up -d
# 3. 等 harbor-log healthy約 30 秒)
until [ "$(docker inspect --format='{{.State.Health.Status}}' harbor-log)" = "healthy" ]; do
echo "waiting harbor-log..."; sleep 5
done
# 4. 現在重啟其他元件
docker compose up -d
```
---
## 已知限制
| 項目 | 說明 | 建議後續 |
|------|------|---------|
| ClawBot build | pip wheels 損壞STANDBY_MODE=true 下非關鍵 | 定期清除 wheel cache |
| K3s 非自動依賴 PG | k3s.service 沒有 After=postgresql@14-main.service | 考慮加 systemd 依賴 |
| Redis 需手動設定 | /etc/redis/redis.conf bind 0.0.0.0 port 6380 已設定,重裝後需重設 | 加入 awoooi-startup.sh 自我驗證 |
---
*文件由 Claude Code 於 2026-04-05 第二次重開機事故後完整修訂*
| 版本 | 日期 | 說明 |
|------|------|------|
| v1.0 | 2026-04-04 | 初版containerd + Docker + PostgreSQL WAL 修復流程 |
| v2.0 | 2026-04-05 上午 | 188 啟動腳本 + 110 啟動腳本 + Harbor race condition 修復 |
| v3.0 | 2026-04-05 下午 | 完整架構重盤 + Gitea Runner 自動化 + 告警鏈路根因修復 + NetworkPolicy 修正 |