docs(runbooks): REBOOT-RECOVERY-SOP v3.0 完整重開機自動化方案
## 內容 完整盤點所有主機、服務、工具、監控的: - 啟動順序與依賴關係圖 - 正常重啟 vs 異常重啟處理流程 - 各主機詳細啟動序列 (188/110/120/121) - 常見故障排查手冊 (告警沉默/CD失效/數據消失/NodePort) - E2E 驗證腳本 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1,498 +1,521 @@
|
||||
# 重開機恢復 SOP
|
||||
# AWOOOI 重開機恢復 SOP
|
||||
|
||||
> 最後更新:2026-04-05 ogt — 第二次重開機事故後完整修訂,加入自動化腳本
|
||||
> 適用環境:AWOOOI 五主機架構
|
||||
> **版本**: v3.0
|
||||
> **最後更新**: 2026-04-05 (台北時間)
|
||||
> **更新者**: Claude Code (首席架構師)
|
||||
> **觸發事件**: 兩次重開機事故後完整盤點 + 告警鏈路根因修復 + Gitea Runner 自動化
|
||||
|
||||
---
|
||||
|
||||
## 🤖 自動化狀態(最優先確認)
|
||||
## 目錄
|
||||
|
||||
| 主機 | systemd service | 狀態 | 說明 |
|
||||
|------|----------------|------|------|
|
||||
| 192.168.0.188 | `awoooi-startup.service` | ✅ enabled | 自動修復 BoltDB + 啟動所有服務 |
|
||||
| 192.168.0.110 | `awoooi-startup-110.service` | ✅ enabled | 自動修復 BoltDB + 啟動所有服務 |
|
||||
| 192.168.0.120 | `k3s.service` | systemd 管理 | 依賴 PG 就緒,自動啟動 |
|
||||
| 192.168.0.121 | `k3s.service` | systemd 管理 | 自動啟動 |
|
||||
|
||||
**正常情況下,重開機後等待 3-5 分鐘,所有服務應自動恢復。**
|
||||
|
||||
確認方式(重開機後執行):
|
||||
```bash
|
||||
# 確認自動化腳本執行結果
|
||||
ssh ollama@192.168.0.188 "sudo journalctl -u awoooi-startup.service -n 30 --no-pager"
|
||||
ssh wooo@192.168.0.110 "echo '0936223270' | sudo -S journalctl -u awoooi-startup-110.service -n 30 --no-pager"
|
||||
```
|
||||
1. [架構概覽與依賴圖](#架構概覽與依賴圖)
|
||||
2. [自動化腳本狀態](#自動化腳本狀態)
|
||||
3. [正常重啟流程 (計劃性維護)](#正常重啟流程)
|
||||
4. [異常重啟流程 (緊急恢復)](#異常重啟流程)
|
||||
5. [各主機詳細啟動序列](#各主機詳細啟動序列)
|
||||
6. [常見故障排查手冊](#常見故障排查手冊)
|
||||
7. [E2E 驗證腳本](#e2e-驗證腳本)
|
||||
|
||||
---
|
||||
|
||||
## ⚡ 全系統啟動順序(依賴關係)
|
||||
## 架構概覽與依賴圖
|
||||
|
||||
### 五主機全貌
|
||||
|
||||
```
|
||||
重開機後啟動順序(強制順序,不可逆轉):
|
||||
192.168.0.188 (OLLAMA / 主服務主機)
|
||||
├── containerd ← 基礎,必須最先啟動
|
||||
├── Docker ← 依賴 containerd
|
||||
├── PostgreSQL :5432 ← K3s kine 後端,API DB
|
||||
├── Redis :6380 ← Working Memory (重啟後清空,需 warm-up)
|
||||
├── Ollama :11434 ← LLM 推論引擎
|
||||
├── Nginx :80/:443 ← 反向代理
|
||||
├── OpenClaw (clawbot) :8088 ← AI 核心 (Docker Compose)
|
||||
├── MinIO ← Velero 備份存儲 (Docker Compose)
|
||||
└── SignOz ← 可觀測性 (Docker Compose)
|
||||
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Phase 1: 192.168.0.188 基礎設施 │ ← 最先
|
||||
│ ├─ containerd (BoltDB 修復) │
|
||||
│ ├─ Docker (BoltDB 修復) │
|
||||
│ ├─ PostgreSQL (WAL 修復 + kine VACUUM) │ ← K3s Kine Datastore
|
||||
│ ├─ Redis (0.0.0.0:6380) │ ← API/Worker 依賴
|
||||
│ ├─ Ollama │
|
||||
│ ├─ Nginx │
|
||||
│ ├─ SignOz (docker compose) │
|
||||
│ └─ ClawBot (docker compose) │
|
||||
└─────────────────────────────────────────┘
|
||||
↓ 必須先完成
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Phase 2: 192.168.0.110 DevOps 金庫 │ ← 同步進行 or 稍後
|
||||
│ ├─ Docker (BoltDB 修復) │
|
||||
│ ├─ 清除孤兒容器 (network 不存在問題) │
|
||||
│ ├─ harbor-log (先 healthy!) │ ← 其他 Harbor 依賴它
|
||||
│ ├─ Harbor 其他元件 (nginx/core/db/...) │ ← K3s imagePull 依賴
|
||||
│ ├─ Gitea │ ← CI/CD 依賴
|
||||
│ ├─ Langfuse │
|
||||
│ ├─ Monitoring (Prometheus/Grafana/AM) │
|
||||
│ └─ SignOz │
|
||||
└─────────────────────────────────────────┘
|
||||
↓ 必須先完成 (PostgreSQL + Harbor)
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Phase 3: K3s Control-Plane │ ← 最後
|
||||
│ ├─ 120 k3s.service → Ready │
|
||||
│ ├─ 121 k3s.service → Ready │
|
||||
│ └─ awoooi-prod Pods Running │
|
||||
└─────────────────────────────────────────┘
|
||||
192.168.0.110 (DevOps 主機)
|
||||
├── Docker
|
||||
├── Harbor :5000 ← Container Registry (K3s 拉映像用)
|
||||
├── Gitea :3001 ← 主要 Git / CI 管理介面
|
||||
├── Gitea Act Runner ← CD pipeline 執行者 (docker: gitea-runner)
|
||||
├── Langfuse :3100 ← LLMOps 追蹤
|
||||
├── Prometheus :9090 ← 指標收集
|
||||
├── Alertmanager :9093 ← 告警路由
|
||||
├── Grafana :3002 ← 監控儀表板
|
||||
└── SignOz ← 可觀測性
|
||||
|
||||
192.168.0.120 (K3s Master - mon)
|
||||
├── K3s server (control plane)
|
||||
├── kube-proxy ← NodePort 轉發 32334/32335
|
||||
└── keepalived ← VIP 192.168.0.125 (secondary)
|
||||
|
||||
192.168.0.121 (K3s Worker - mon1)
|
||||
├── K3s agent (worker)
|
||||
└── kube-proxy
|
||||
|
||||
192.168.0.125 (VIP, keepalived 管理)
|
||||
├── → :32334 AWOOOI API
|
||||
└── → :32335 AWOOOI Web
|
||||
```
|
||||
|
||||
### 服務依賴關係
|
||||
|
||||
```
|
||||
嚴格啟動順序:
|
||||
|
||||
【188 層】
|
||||
containerd
|
||||
└── Docker
|
||||
├── PostgreSQL (kine DB, API DB)
|
||||
│ └── Redis (Working Memory)
|
||||
│ └── Ollama (LLM)
|
||||
│ └── Nginx
|
||||
│ ├── SignOz
|
||||
│ ├── MinIO
|
||||
│ └── OpenClaw (依賴 aiops-network)
|
||||
|
||||
【110 層】
|
||||
Docker
|
||||
└── harbor-log (等 healthy)
|
||||
└── Harbor 全組件 (:5000)
|
||||
└── Gitea (:3001)
|
||||
├── Langfuse
|
||||
├── Monitoring (Prometheus + Alertmanager + Grafana)
|
||||
│ [Alertmanager → AWOOOI API 告警鏈路]
|
||||
├── SignOz
|
||||
└── Gitea Act Runner (等 Gitea 就緒後才啟動)
|
||||
|
||||
【120/121 層】
|
||||
K3s (依賴 PostgreSQL@188)
|
||||
└── kube-proxy (NodePort 規則)
|
||||
└── Pods: API / Web / Worker
|
||||
└── keepalived (VIP)
|
||||
|
||||
【告警鏈路】
|
||||
Prometheus → Alertmanager(110)
|
||||
→ AWOOOI API(121:32334/api/v1/webhooks/alertmanager) ← 直接,不走 OpenClaw
|
||||
→ TelegramGateway → Telegram
|
||||
|
||||
【CD 鏈路】
|
||||
Git push → Gitea(110:3001)
|
||||
→ Act Runner (gitea-runner container)
|
||||
→ Build Docker image → Harbor(:5000) → kubectl → K3s pods
|
||||
```
|
||||
|
||||
### 關鍵依賴說明
|
||||
|
||||
| 服務 | 關鍵依賴 | 若依賴失敗 |
|
||||
|------|---------|-----------|
|
||||
| K3s | PostgreSQL@188 ready | K3s 無法啟動,所有 pod 無法調度 |
|
||||
| OpenClaw | Docker + aiops-network | Telegram AI 對話失效 |
|
||||
| Alertmanager → API | NetworkPolicy 允許 110/32 | 告警全面沉默 |
|
||||
| CD pipeline | Gitea Act Runner online | 所有部署失效 |
|
||||
| Harbor | harbor-log healthy | 所有其他 Harbor 容器 Exited 128 |
|
||||
| K8s pods | Harbor (映像倉庫) | ImagePullBackOff |
|
||||
|
||||
---
|
||||
|
||||
## 🤖 自動化腳本說明
|
||||
## 自動化腳本狀態
|
||||
|
||||
### 腳本位置
|
||||
### 已部署 (2026-04-05)
|
||||
|
||||
```
|
||||
scripts/reboot-recovery/
|
||||
├── awoooi-startup.sh # 188 啟動腳本(部署到 /usr/local/bin/)
|
||||
├── awoooi-startup.service # 188 systemd unit
|
||||
├── awoooi-startup-110.sh # 110 啟動腳本(部署到 /usr/local/bin/)
|
||||
├── awoooi-startup-110.service # 110 systemd unit
|
||||
├── deploy-to-188.sh # 一鍵部署到 188
|
||||
└── deploy-to-110.sh # 一鍵部署到 110
|
||||
```
|
||||
| 主機 | 腳本 | 部署位置 | systemd service | 狀態 |
|
||||
|------|------|---------|----------------|------|
|
||||
| **188** | `awoooi-startup.sh` | `/usr/local/bin/` | `awoooi-startup.service` | ✅ enabled |
|
||||
| **110** | `awoooi-startup-110.sh` | `/usr/local/bin/` | `awoooi-startup-110.service` | ✅ enabled |
|
||||
| **120** | K3s 原生 | 系統內建 | `k3s.service` | ✅ enabled |
|
||||
| **121** | K3s 原生 | 系統內建 | `k3s-agent.service` | ✅ enabled |
|
||||
|
||||
### 188 腳本(`awoooi-startup.sh`)步驟
|
||||
**本地原始碼**: `scripts/reboot-recovery/`
|
||||
|
||||
| 步驟 | 說明 | 故障處理 |
|
||||
|------|------|---------|
|
||||
| 1/7 | containerd 健康檢查 | BoltDB 損壞 → 自動刪除 `meta.db` |
|
||||
| 2/7 | Docker 健康檢查 | BoltDB 損壞 → 自動刪除 `local-kv.db` |
|
||||
| 3/7 | PostgreSQL 健康檢查 | WAL 損壞 → 自動執行 `pg_resetwal` + VACUUM ANALYZE kine |
|
||||
| 4/7 | Redis 啟動 | — |
|
||||
| 5/7 | Ollama 啟動 | — |
|
||||
| 6/7 | Nginx 啟動 | — |
|
||||
| 7/7 | SignOz + ClawBot compose up | ClawBot 失敗 → 嘗試 rebuild,失敗也繼續 |
|
||||
### 各腳本覆蓋清單
|
||||
|
||||
### 110 腳本(`awoooi-startup-110.sh`)步驟
|
||||
**188 (7 步驟)**:
|
||||
- containerd BoltDB 損壞偵測與修復
|
||||
- Docker BoltDB 損壞偵測與修復
|
||||
- PostgreSQL WAL 損壞偵測 + pg_resetwal + kine VACUUM
|
||||
- Redis 啟動
|
||||
- Ollama 啟動
|
||||
- Nginx 啟動
|
||||
- SignOz + MinIO + aiops-network + OpenClaw
|
||||
|
||||
| 步驟 | 說明 | 故障處理 |
|
||||
|------|------|---------|
|
||||
| 1/5 | Docker 健康檢查 | BoltDB 損壞 → 自動刪除所有損壞 `.db` |
|
||||
| 2/5 | 清除孤兒容器 | `Exited (128)/(137)` → docker rm + network prune |
|
||||
| 3/5 | Harbor 啟動 | 等 harbor-log healthy (max 60s) 才啟動其他元件 |
|
||||
| 4/5 | Gitea/Langfuse/Monitoring compose up | — |
|
||||
| 5/5 | SignOz compose up | — |
|
||||
|
||||
### 重新部署腳本
|
||||
|
||||
```bash
|
||||
# 從 Mac 執行(awoooi repo 目錄)
|
||||
cd scripts/reboot-recovery
|
||||
|
||||
bash deploy-to-188.sh # 更新 188 的腳本
|
||||
bash deploy-to-110.sh # 更新 110 的腳本
|
||||
```
|
||||
**110 (6 步驟)**:
|
||||
- Docker BoltDB 損壞偵測與修復
|
||||
- 孤兒容器清除 (Exited 128)
|
||||
- Harbor (harbor-log healthy 後啟動全組件)
|
||||
- Gitea + Langfuse + Monitoring (含 Alertmanager 健康驗證)
|
||||
- SignOz
|
||||
- **Gitea Act Runner** (自動清除過期 .runner 配置)
|
||||
|
||||
---
|
||||
|
||||
## 手動恢復流程(自動化失敗時)
|
||||
## 正常重啟流程
|
||||
|
||||
### Phase 1:192.168.0.188
|
||||
> 適用於:計劃性維護、安全更新、OS 升級等**有預期**的重啟。
|
||||
|
||||
### 重啟前準備 (T-30 分鐘)
|
||||
|
||||
```bash
|
||||
ssh ollama@192.168.0.188
|
||||
# 1. 確認沒有 INVESTIGATING/MITIGATING 的 incident
|
||||
curl -s http://192.168.0.121:32334/api/v1/incidents | python3 -c 'import sys,json; d=json.load(sys.stdin); print("Active incidents:", d.get("count",0))'
|
||||
|
||||
# 2. 確認 K3s 健康
|
||||
ssh wooo@192.168.0.120 "kubectl get nodes && kubectl get pods -n awoooi-prod"
|
||||
|
||||
# 3. 備份確認 (MinIO Velero)
|
||||
ssh wooo@192.168.0.120 "kubectl get backup -n velero 2>/dev/null | head -5"
|
||||
```
|
||||
|
||||
#### 1.1 containerd(若未起)
|
||||
### 建議重啟順序
|
||||
|
||||
```bash
|
||||
sudo systemctl status containerd
|
||||
# 若 BoltDB 損壞 (panic: freepages):
|
||||
sudo systemctl stop containerd
|
||||
sudo rm -f /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db
|
||||
sudo systemctl start containerd
|
||||
systemctl is-active containerd
|
||||
```
|
||||
第一步: 188 + 110 同時重啟
|
||||
(兩者獨立,可並行)
|
||||
|
||||
等待 5 分鐘確認基礎服務就緒:
|
||||
- PostgreSQL accepting connections
|
||||
- Harbor returning HTTP 200
|
||||
- Gitea accessible
|
||||
|
||||
第二步: 120 + 121 同時重啟
|
||||
(K3s cluster)
|
||||
|
||||
等待 10 分鐘:
|
||||
- Nodes Ready
|
||||
- Pods Running
|
||||
```
|
||||
|
||||
#### 1.2 Docker(若未起)
|
||||
> **重要**: 120/121 必須在 188 PostgreSQL 完全就緒後才重啟,
|
||||
> 否則 K3s kine 連不到 DB 無法啟動。
|
||||
|
||||
### 重啟後驗證 (T+15 分鐘)
|
||||
|
||||
執行底部 E2E 驗證腳本。所有項目必須 ✅。
|
||||
|
||||
---
|
||||
|
||||
## 異常重啟流程
|
||||
|
||||
> 適用於:電源中斷、OOM、系統崩潰、強制重啟等**非預期**的重啟。
|
||||
|
||||
### 第一步:快速診斷 (T+2 分鐘)
|
||||
|
||||
```bash
|
||||
sudo systemctl status docker
|
||||
# 若 BoltDB 損壞 (panic: page already freed / invalid freelist page):
|
||||
sudo systemctl stop docker
|
||||
sudo rm -f /var/lib/docker/network/files/local-kv.db
|
||||
sudo rm -f /var/lib/docker/volumes/metadata.db
|
||||
find /var/lib/docker/buildkit -name "*.db" -delete 2>/dev/null || true
|
||||
sudo systemctl start docker
|
||||
systemctl is-active docker
|
||||
```
|
||||
|
||||
#### 1.3 PostgreSQL(最關鍵)
|
||||
|
||||
```bash
|
||||
sudo systemctl start postgresql@14-main
|
||||
sleep 8
|
||||
systemctl is-active postgresql@14-main # 若非 active → 見故障排除 A
|
||||
pg_isready -h localhost -p 5432 # 應為 accepting connections
|
||||
|
||||
# 清理 kine 孤立連線(WAL 重置後必做)
|
||||
sudo -u postgres psql -d k3s_datastore -c "
|
||||
SELECT pg_terminate_backend(pid)
|
||||
FROM pg_stat_activity
|
||||
WHERE datname='k3s_datastore' AND pid!=pg_backend_pid()
|
||||
AND query_start < now() - interval '5 minutes';
|
||||
# 188 狀態
|
||||
ssh ollama@192.168.0.188 "
|
||||
systemctl is-active containerd docker postgresql@14-main redis-server ollama nginx
|
||||
docker ps --format '{{.Names}}: {{.Status}}' | grep -E 'clawbot|minio|signoz'
|
||||
"
|
||||
sudo -u postgres psql -d k3s_datastore -c "VACUUM ANALYZE kine;"
|
||||
|
||||
# 110 狀態
|
||||
ssh wooo@192.168.0.110 "
|
||||
docker ps --format '{{.Names}}: {{.Status}}' | grep -E 'harbor-log|gitea$|alertmanager|gitea-runner'
|
||||
"
|
||||
|
||||
# K3s 狀態
|
||||
ssh wooo@192.168.0.120 "
|
||||
kubectl get nodes 2>/dev/null || echo 'K3s not ready'
|
||||
kubectl get pods -n awoooi-prod 2>/dev/null | grep -v Running | head -10
|
||||
"
|
||||
|
||||
# API 健康
|
||||
curl -s --max-time 5 http://192.168.0.121:32334/api/v1/health 2>/dev/null | python3 -c 'import sys,json; d=json.load(sys.stdin); print("API:", d["status"])' 2>/dev/null || echo 'API unreachable'
|
||||
```
|
||||
|
||||
#### 1.4 Redis
|
||||
### 第二步:按優先級恢復
|
||||
|
||||
**P0 (T+0~5 分鐘): 基礎設施**
|
||||
```bash
|
||||
sudo systemctl start redis-server
|
||||
redis-cli -p 6380 ping # 應回 PONG
|
||||
# Redis 設定: 0.0.0.0:6380 (bind 0.0.0.0, port 6380)
|
||||
# 自動化腳本通常已處理,但若沒有自動啟動:
|
||||
ssh ollama@192.168.0.188 "sudo /usr/local/bin/awoooi-startup.sh"
|
||||
ssh wooo@192.168.0.110 "echo '0936223270' | sudo -S /usr/local/bin/awoooi-startup-110.sh"
|
||||
```
|
||||
|
||||
#### 1.5 Ollama
|
||||
|
||||
**P1 (T+5~15 分鐘): K3s 叢集**
|
||||
```bash
|
||||
sudo systemctl start ollama
|
||||
sleep 5
|
||||
curl -sf http://localhost:11434/ | grep running
|
||||
# 等 PostgreSQL@188 就緒後
|
||||
ssh ollama@192.168.0.188 "pg_isready -h localhost -p 5432"
|
||||
|
||||
# K3s 若未自動恢復
|
||||
ssh wooo@192.168.0.120 "sudo systemctl restart k3s"
|
||||
ssh wooo@192.168.0.121 "sudo systemctl restart k3s-agent"
|
||||
|
||||
# 等待 Nodes Ready
|
||||
ssh wooo@192.168.0.120 "kubectl wait --for=condition=Ready nodes --all --timeout=120s"
|
||||
```
|
||||
|
||||
#### 1.6 Nginx + SignOz + ClawBot
|
||||
|
||||
**P2 (T+15~30 分鐘): 業務驗證**
|
||||
```bash
|
||||
sudo systemctl start nginx
|
||||
# Pods 全部 Running?
|
||||
ssh wooo@192.168.0.120 "kubectl get pods -n awoooi-prod"
|
||||
|
||||
cd /home/ollama/signoz/deploy/docker && docker compose up -d
|
||||
cd /home/ollama/clawbot-v5 && docker compose up -d
|
||||
```
|
||||
# API 健康 (含所有 components)?
|
||||
curl -s http://192.168.0.121:32334/api/v1/health | python3 -m json.tool
|
||||
|
||||
#### 1.7 Phase 1 驗收
|
||||
|
||||
```bash
|
||||
pg_isready -h localhost -p 5432 && echo "✅ PostgreSQL"
|
||||
redis-cli -p 6380 ping | grep -q PONG && echo "✅ Redis :6380"
|
||||
curl -sf http://localhost:11434/ | grep -q running && echo "✅ Ollama"
|
||||
systemctl is-active nginx | grep -q active && echo "✅ Nginx"
|
||||
# 告警鏈路 E2E 測試
|
||||
curl -X POST http://192.168.0.121:32334/api/v1/webhooks/alertmanager \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"receiver":"test","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"RebootRecoveryTest","severity":"info"},"annotations":{"summary":"重開機恢復測試,請忽略"},"startsAt":"2026-04-05T00:00:00Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":""}],"groupLabels":{},"commonLabels":{},"commonAnnotations":{},"externalURL":"","version":"4","groupKey":"test"}'
|
||||
# 預期: {"success":true,...} 且 Telegram 收到測試告警
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Phase 2:192.168.0.110
|
||||
## 各主機詳細啟動序列
|
||||
|
||||
### 192.168.0.188 — 主服務主機
|
||||
|
||||
| 步驟 | 服務 | 驗證方式 | 常見失敗原因 |
|
||||
|------|------|---------|------------|
|
||||
| 1 | containerd | `systemctl is-active containerd` | BoltDB meta.db 損壞 |
|
||||
| 2 | Docker | `systemctl is-active docker` | BoltDB local-kv.db 損壞 |
|
||||
| 3 | PostgreSQL | `pg_isready -h localhost -p 5432` | WAL checkpoint 損壞 |
|
||||
| 3a | kine VACUUM | `psql -d k3s_datastore -c "VACUUM ANALYZE kine;"` | 慢查詢導致 K3s 超時 |
|
||||
| 4 | Redis | `redis-cli -p 6380 ping` | bind 配置錯誤 (必須 0.0.0.0:6380) |
|
||||
| 5 | Ollama | `systemctl is-active ollama` | GPU 記憶體不足 |
|
||||
| 6 | Nginx | `systemctl is-active nginx` | port 衝突 |
|
||||
| 7a | aiops-network | `docker network ls \| grep aiops-network` | 重啟後 external network 消失 |
|
||||
| 7b | OpenClaw | `curl http://localhost:8088/health` | pip 依賴損壞需 rebuild |
|
||||
| 7c | MinIO | `docker ps \| grep minio` | 無狀態服務,重啟即可 |
|
||||
| 7d | SignOz | `curl http://localhost:3301/-/healthy` | ClickHouse 起慢 |
|
||||
|
||||
**BoltDB 損壞修復**:
|
||||
```bash
|
||||
ssh wooo@192.168.0.110
|
||||
# containerd meta.db
|
||||
BOLT=/var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db
|
||||
cp $BOLT ${BOLT}.bak.$(date +%s) && rm $BOLT && systemctl restart containerd
|
||||
|
||||
# Docker network BoltDB
|
||||
BOLT=/var/lib/docker/network/files/local-kv.db
|
||||
cp $BOLT ${BOLT}.bak.$(date +%s) && rm $BOLT && systemctl restart docker
|
||||
```
|
||||
|
||||
#### 2.1 Docker 修復
|
||||
|
||||
**PostgreSQL WAL 修復**:
|
||||
```bash
|
||||
systemctl is-active docker || {
|
||||
echo "0936223270" | sudo -S bash -c "
|
||||
rm -f /var/lib/docker/network/files/local-kv.db
|
||||
rm -f /var/lib/docker/volumes/metadata.db
|
||||
find /var/lib/docker/buildkit -name '*.db' -delete 2>/dev/null
|
||||
systemctl start docker
|
||||
"
|
||||
}
|
||||
# 確認是 WAL 損壞
|
||||
journalctl -u postgresql@14-main -n 30 | grep "could not locate a valid checkpoint"
|
||||
|
||||
# 修復
|
||||
systemctl stop postgresql@14-main
|
||||
/usr/lib/postgresql/14/bin/pg_resetwal -f /var/lib/postgresql/14/main
|
||||
systemctl start postgresql@14-main
|
||||
```
|
||||
|
||||
#### 2.2 清除孤兒容器(關鍵!)
|
||||
### 192.168.0.110 — DevOps 主機
|
||||
|
||||
| 步驟 | 服務 | 驗證方式 | 常見失敗原因 |
|
||||
|------|------|---------|------------|
|
||||
| 1 | Docker | `systemctl is-active docker` | BoltDB 損壞 |
|
||||
| 2 | 孤兒容器清除 | `docker ps -a \| grep "Exited (128)"` | 舊容器引用已不存在的 network |
|
||||
| 3 | harbor-log | `docker inspect --format='{{.State.Health.Status}}' harbor-log` | syslog :1514 未就緒 |
|
||||
| 3 | Harbor 全組件 | `curl http://localhost:5000/v2/` | harbor-log 還沒 healthy |
|
||||
| 4 | Gitea | `curl http://localhost:3001` | 無特殊問題 |
|
||||
| 5 | Langfuse | `curl http://localhost:3100` | DB 未就緒 |
|
||||
| 6 | Alertmanager | `curl http://localhost:9093/-/healthy` | docker compose 未啟動 |
|
||||
| 6a | 告警鏈路驗證 | 發送測試 webhook | webhook URL 錯誤 / NetworkPolicy |
|
||||
| 7 | SignOz | `curl http://localhost:3301/-/healthy` | ClickHouse 起慢 |
|
||||
| 8 | Gitea Runner | `docker logs gitea-runner \| grep SUCCESS` | .runner 配置過期 |
|
||||
|
||||
**Harbor Exited 128 修復**:
|
||||
```bash
|
||||
# 重開機後舊容器使用的 Docker network 已不存在,必須清除
|
||||
docker rm -f $(docker ps -aq 2>/dev/null) 2>/dev/null || true
|
||||
docker network prune -f 2>/dev/null || true
|
||||
```
|
||||
|
||||
#### 2.3 Harbor(必須先等 harbor-log healthy)
|
||||
|
||||
```bash
|
||||
cd /home/wooo/harbor/harbor
|
||||
docker compose up -d
|
||||
|
||||
# 等 harbor-log healthy(最多 60 秒)
|
||||
for i in $(seq 1 12); do
|
||||
STATUS=$(docker inspect --format='{{.State.Health.Status}}' harbor-log 2>/dev/null)
|
||||
echo "[$i] harbor-log: $STATUS"
|
||||
[ "$STATUS" = "healthy" ] && break
|
||||
sleep 5
|
||||
# 等 harbor-log healthy
|
||||
until [ "$(docker inspect --format='{{.State.Health.Status}}' harbor-log)" = "healthy" ]; do
|
||||
echo "waiting..."; sleep 5
|
||||
done
|
||||
|
||||
# harbor-log healthy 後,重啟其他元件(它們依賴 :1514 syslog)
|
||||
docker compose up -d
|
||||
# 清除 Exited 128 容器
|
||||
docker rm -f $(docker ps -a --filter status=exited --format "{{.Names}}" | grep -v harbor-log)
|
||||
|
||||
# 重新啟動
|
||||
cd /home/wooo/harbor/harbor && docker compose up -d
|
||||
```
|
||||
|
||||
#### 2.4 其他服務
|
||||
|
||||
**Gitea Runner 重新註冊**:
|
||||
```bash
|
||||
cd /home/wooo/gitea && docker compose up -d
|
||||
cd /home/wooo/langfuse && docker compose up -d
|
||||
cd /home/wooo/monitoring && docker compose up -d
|
||||
cd /home/wooo/signoz/deploy/docker && docker compose up -d
|
||||
# 清除過期配置 (指向錯誤 hostname 的)
|
||||
sudo rm /home/wooo/act-runner/data/.runner
|
||||
|
||||
# 獲取新的 registration token
|
||||
curl -s -u 'wooo:TOKEN' 'http://192.168.0.110:3001/api/v1/repos/wooo/awoooi/actions/runners/registration-token'
|
||||
|
||||
# 更新 docker-compose.yml 的 GITEA_RUNNER_REGISTRATION_TOKEN
|
||||
# 然後啟動
|
||||
cd /home/wooo/act-runner && docker compose up -d
|
||||
```
|
||||
|
||||
#### 2.5 Phase 2 驗收
|
||||
### 192.168.0.120/121 — K3s 叢集
|
||||
|
||||
```bash
|
||||
curl -s -o /dev/null -w "%{http_code}" http://localhost:5000/v2/ | grep -q 401 && echo "✅ Harbor :5000"
|
||||
curl -s -o /dev/null -w "%{http_code}" http://localhost:3001/ | grep -q 200 && echo "✅ Gitea :3001"
|
||||
curl -s -o /dev/null -w "%{http_code}" http://localhost:3100/ | grep -q 200 && echo "✅ Langfuse :3100"
|
||||
curl -s -o /dev/null -w "%{http_code}" http://localhost:9093/ | grep -q 200 && echo "✅ Alertmanager :9093"
|
||||
curl -s -o /dev/null -w "%{http_code}" http://localhost:3002/ | grep -q 302 && echo "✅ Grafana :3002"
|
||||
| 步驟 | 項目 | 驗證方式 |
|
||||
|------|------|---------|
|
||||
| 前提 | PostgreSQL@188 ready | `pg_isready -h 192.168.0.188 -p 5432` |
|
||||
| 1 | K3s service | `systemctl is-active k3s` (120) |
|
||||
| 2 | K3s agent | `systemctl is-active k3s-agent` (121) |
|
||||
| 3 | Nodes Ready | `kubectl get nodes` |
|
||||
| 4 | Pods Running | `kubectl get pods -n awoooi-prod` |
|
||||
| 5 | keepalived VIP | `ip addr show \| grep 192.168.0.125` (120) |
|
||||
| 6 | NodePort | `nc -zv 192.168.0.121 32334` |
|
||||
|
||||
---
|
||||
|
||||
## 常見故障排查手冊
|
||||
|
||||
### 1. 告警沉默 (Telegram 沒收到告警)
|
||||
|
||||
```
|
||||
診斷樹:
|
||||
Alertmanager healthy? (http://192.168.0.110:9093/-/healthy)
|
||||
├── NO → docker compose up -d (在 110 /home/wooo/monitoring/)
|
||||
└── YES
|
||||
↓
|
||||
Webhook URL 正確?
|
||||
grep 'url:' /home/wooo/monitoring/alertmanager.yml
|
||||
必須是: http://192.168.0.121:32334/api/v1/webhooks/alertmanager
|
||||
├── NO → 修正 URL 並 curl http://localhost:9093/-/reload
|
||||
└── YES
|
||||
↓
|
||||
從 110 curl POST webhook 成功?
|
||||
curl -X POST http://192.168.0.121:32334/api/v1/webhooks/alertmanager ...
|
||||
├── timeout → NetworkPolicy 未允許 110
|
||||
│ kubectl apply -f k8s/awoooi-prod/02-network-policy.yaml
|
||||
└── {"success":true} → 檢查 Telegram Bot Token
|
||||
```
|
||||
|
||||
**根本教訓 (2026-04-05)**:
|
||||
- 舊架構: Alertmanager → OpenClaw 8088 (中間層,重開機後 OpenClaw 不健康就沉默)
|
||||
- 新架構: Alertmanager → AWOOOI API 直通 (更穩定,單一故障點少)
|
||||
- NetworkPolicy 必須允許 110/32 進入 pod
|
||||
|
||||
### 2. Gitea CD 無法部署
|
||||
|
||||
```
|
||||
診斷樹:
|
||||
Gitea runner 數量?
|
||||
curl -u 'wooo:TOKEN' 'http://192.168.0.110:3001/api/v1/repos/wooo/awoooi/actions/runners'
|
||||
├── total_count: 0 → Runner 離線
|
||||
│ docker ps | grep gitea-runner
|
||||
│ ├── 不存在 → cd /home/wooo/act-runner && docker compose up -d
|
||||
│ └── 存在但 Restarting → docker logs gitea-runner
|
||||
│ "lookup gitea" → 清除 data/.runner → docker compose up -d
|
||||
└── total_count: 1 → Runner 在線但 CD 失敗
|
||||
查看 Gitea Actions 日誌: http://192.168.0.110:3001/wooo/awoooi/actions
|
||||
```
|
||||
|
||||
### 3. 網站數據顯示 0 / incidents 消失
|
||||
|
||||
```
|
||||
原因: Redis Working Memory 重啟後清空
|
||||
新版 API (f4f454f+) 啟動時自動 warm-up
|
||||
舊版 API 沒有此邏輯
|
||||
|
||||
診斷:
|
||||
curl http://192.168.0.121:32334/api/v1/incidents
|
||||
→ count: 0 但 Telegram 有歷史告警?
|
||||
→ 確認 API pod image 版本
|
||||
|
||||
修復 (若 image 版本正確):
|
||||
kubectl rollout restart deployment/awoooi-api -n awoooi-prod
|
||||
|
||||
修復 (若 image 版本過舊):
|
||||
等待 CD 完成部署最新版本
|
||||
```
|
||||
|
||||
### 4. NodePort 32334 從外部無法訪問
|
||||
|
||||
```
|
||||
診斷:
|
||||
nc -zv 192.168.0.121 32334 ← 這個通嗎?
|
||||
├── 通 → 用 192.168.0.121:32334 直連
|
||||
└── 不通
|
||||
kubectl get pods -n awoooi-prod
|
||||
├── 沒有 Running pods → K3s pods 問題
|
||||
└── 有 Running pods
|
||||
iptables -t nat -L KUBE-NODEPORTS -n | grep 32334
|
||||
├── 沒有規則 → K3s kube-proxy 問題,重啟 k3s
|
||||
└── 有規則 → NetworkPolicy 或防火牆
|
||||
```
|
||||
|
||||
### 5. Harbor ImagePullBackOff
|
||||
|
||||
```
|
||||
診斷:
|
||||
kubectl describe pod <pod> -n awoooi-prod | grep -A 5 Events
|
||||
→ "ImagePullBackOff" or "ErrImagePull"
|
||||
|
||||
確認 Harbor 健康:
|
||||
curl http://192.168.0.110:5000/v2/ → 應該回 {}
|
||||
|
||||
若 Harbor 不健康:
|
||||
- 等 harbor-log healthy 後重啟: 見 Harbor Exited 128 修復
|
||||
- 確認 Docker daemon 在 120/121 有設定 insecure registry:
|
||||
cat /etc/docker/daemon.json | grep insecure
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Phase 3:K3s Control-Plane
|
||||
## E2E 驗證腳本
|
||||
|
||||
> ⚠️ 必須確認 Phase 1 PostgreSQL + Phase 2 Harbor 完全就緒才執行!
|
||||
|
||||
```bash
|
||||
# 先確認前置條件
|
||||
ssh ollama@192.168.0.188 "pg_isready -h localhost -p 5432"
|
||||
ssh wooo@192.168.0.110 "curl -s -o /dev/null -w '%{http_code}' http://localhost:5000/v2/"
|
||||
# PostgreSQL: accepting connections
|
||||
# Harbor: 401 (需認證,正常)
|
||||
```
|
||||
|
||||
#### 3.1 K3s 節點(通常自動啟動)
|
||||
|
||||
```bash
|
||||
# 若 k3s 未啟動
|
||||
ssh wooo@192.168.0.120 "echo '0936223270' | sudo -S systemctl start k3s"
|
||||
ssh wooo@192.168.0.121 "echo '0936223270' | sudo -S systemctl start k3s"
|
||||
|
||||
# 確認節點狀態
|
||||
ssh wooo@192.168.0.120 "echo '0936223270' | sudo -S k3s kubectl get nodes"
|
||||
```
|
||||
|
||||
#### 3.2 若 Pods ImagePullBackOff
|
||||
|
||||
```bash
|
||||
# Harbor 剛起來,強制 rollout
|
||||
ssh wooo@192.168.0.120 "echo '0936223270' | sudo -S bash -c '
|
||||
k3s kubectl delete pod -l app=awoooi-api -n awoooi-prod
|
||||
k3s kubectl delete pod -l app=awoooi-web -n awoooi-prod
|
||||
k3s kubectl delete pod -l app=awoooi-worker -n awoooi-prod
|
||||
'"
|
||||
```
|
||||
|
||||
#### 3.3 Phase 3 驗收
|
||||
|
||||
```bash
|
||||
ssh wooo@192.168.0.120 "echo '0936223270' | sudo -S k3s kubectl get pods -n awoooi-prod"
|
||||
# 所有 Pod 應為 Running
|
||||
|
||||
curl http://192.168.0.125:32334/api/v1/health
|
||||
# 預期: status=healthy 或 degraded (openclaw down 可接受)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 完整自動化驗收腳本
|
||||
執行此腳本確認重開機後全系統正常。
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# 執行位置: Mac (awoooi repo)
|
||||
echo "=== AWOOOI 重開機完整驗收 $(date '+%Y-%m-%d %H:%M:%S') ==="
|
||||
# 重開機後 E2E 驗證
|
||||
# 使用: bash docs/runbooks/scripts/e2e-verify.sh
|
||||
# 執行位置: 任意可 SSH 的機器
|
||||
|
||||
# 188 基礎服務
|
||||
echo "--- 192.168.0.188 ---"
|
||||
ssh ollama@192.168.0.188 "
|
||||
pg_isready -h localhost -p 5432 >/dev/null 2>&1 && echo '✅ PostgreSQL :5432' || echo '❌ PostgreSQL DOWN'
|
||||
redis-cli -p 6380 ping 2>/dev/null | grep -q PONG && echo '✅ Redis :6380' || echo '❌ Redis DOWN'
|
||||
curl -sf http://localhost:11434/ 2>/dev/null | grep -q running && echo '✅ Ollama :11434' || echo '❌ Ollama DOWN'
|
||||
systemctl is-active nginx 2>/dev/null | grep -q active && echo '✅ Nginx' || echo '❌ Nginx DOWN'
|
||||
docker ps --format '{{.Names}}\t{{.Status}}' 2>/dev/null | grep -E '^signoz|^clawbot' | head -5
|
||||
"
|
||||
API="http://192.168.0.121:32334"
|
||||
GREEN='\033[0;32m'; RED='\033[0;31m'; NC='\033[0m'
|
||||
PASS=0; FAIL=0
|
||||
|
||||
# 110 DevOps 金庫
|
||||
echo "--- 192.168.0.110 ---"
|
||||
ssh wooo@192.168.0.110 "
|
||||
curl -s -o /dev/null -w '%{http_code}' http://localhost:5000/v2/ 2>/dev/null | grep -q 401 && echo '✅ Harbor :5000' || echo '❌ Harbor DOWN'
|
||||
curl -s -o /dev/null -w '%{http_code}' http://localhost:3001/ 2>/dev/null | grep -q 200 && echo '✅ Gitea :3001' || echo '❌ Gitea DOWN'
|
||||
curl -s -o /dev/null -w '%{http_code}' http://localhost:3100/ 2>/dev/null | grep -q 200 && echo '✅ Langfuse :3100' || echo '❌ Langfuse DOWN'
|
||||
curl -s -o /dev/null -w '%{http_code}' http://localhost:9093/ 2>/dev/null | grep -q 200 && echo '✅ Alertmanager :9093' || echo '❌ Alertmanager DOWN'
|
||||
curl -s -o /dev/null -w '%{http_code}' http://localhost:3002/ 2>/dev/null | grep -qE '200|302' && echo '✅ Grafana :3002' || echo '❌ Grafana DOWN'
|
||||
"
|
||||
check() {
|
||||
local name="$1"; local cmd="$2"; local expect="$3"
|
||||
result=$(eval "$cmd" 2>/dev/null)
|
||||
if echo "$result" | grep -q "$expect"; then
|
||||
echo -e "${GREEN}✅ $name${NC}"
|
||||
((PASS++))
|
||||
else
|
||||
echo -e "${RED}❌ $name${NC} (got: ${result:0:80})"
|
||||
((FAIL++))
|
||||
fi
|
||||
}
|
||||
|
||||
# K3s 和 Pods
|
||||
echo "--- K3s (via 120) ---"
|
||||
ssh wooo@192.168.0.120 "
|
||||
echo '0936223270' | sudo -S bash -c '
|
||||
k3s kubectl get nodes 2>/dev/null
|
||||
k3s kubectl get pods -n awoooi-prod 2>/dev/null
|
||||
'
|
||||
"
|
||||
echo "=== AWOOOI 重開機 E2E 驗證 $(date '+%Y-%m-%d %H:%M:%S') ==="
|
||||
|
||||
# API E2E
|
||||
echo "--- API E2E ---"
|
||||
curl -s http://192.168.0.125:32334/api/v1/health 2>/dev/null | \
|
||||
python3 -c "import sys,json; d=json.load(sys.stdin); [print('✅' if v['status']=='up' else '⚠️', k, v['status']) for k,v in d['components'].items()]" || \
|
||||
echo '❌ API E2E FAILED'
|
||||
# 188 服務
|
||||
check "188 PostgreSQL" "ssh ollama@192.168.0.188 'pg_isready -h localhost -p 5432'" "accepting"
|
||||
check "188 Redis" "ssh ollama@192.168.0.188 'redis-cli -p 6380 ping'" "PONG"
|
||||
check "188 OpenClaw" "ssh ollama@192.168.0.188 'curl -s http://localhost:8088/health'" "healthy"
|
||||
|
||||
echo "=== 驗收完成 ==="
|
||||
# 110 服務
|
||||
check "110 Harbor" "curl -s -o /dev/null -w '%{http_code}' http://192.168.0.110:5000/v2/" "200"
|
||||
check "110 Gitea" "curl -s -o /dev/null -w '%{http_code}' http://192.168.0.110:3001" "200"
|
||||
check "110 Alertmanager" "curl -s http://192.168.0.110:9093/-/healthy" "OK"
|
||||
check "110 Gitea Runner" "curl -s -u 'wooo:TOKEN' 'http://192.168.0.110:3001/api/v1/repos/wooo/awoooi/actions/runners' | python3 -c 'import sys,json; d=json.load(sys.stdin); print(d.get(\"total_count\",0))'" "1"
|
||||
|
||||
# K3s
|
||||
check "K3s Nodes Ready" "ssh wooo@192.168.0.120 'kubectl get nodes --no-headers' | awk '{print \$2}'" "Ready"
|
||||
check "K3s Pods Running" "ssh wooo@192.168.0.120 'kubectl get pods -n awoooi-prod --no-headers' | grep -c Running" "3"
|
||||
|
||||
# AWOOOI API
|
||||
check "API Health" "curl -s $API/api/v1/health" '"status":"healthy"'
|
||||
check "API openclaw" "curl -s $API/api/v1/health" '"openclaw":{"status":"up"'
|
||||
|
||||
# 告警鏈路 E2E
|
||||
PAYLOAD='{"receiver":"e2e-test","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"RebootE2ETest","severity":"info"},"annotations":{"summary":"重開機 E2E 驗證,請忽略"},"startsAt":"2026-04-05T00:00:00Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":""}],"groupLabels":{},"commonLabels":{},"commonAnnotations":{},"externalURL":"","version":"4","groupKey":"e2e-test"}'
|
||||
check "告警鏈路 E2E" "curl -s -X POST $API/api/v1/webhooks/alertmanager -H 'Content-Type: application/json' -d '$PAYLOAD'" '"success":true'
|
||||
|
||||
echo ""
|
||||
echo "=== 結果: ${PASS} 通過, ${FAIL} 失敗 ==="
|
||||
[ $FAIL -eq 0 ] && echo -e "${GREEN}🎉 全部通過!系統正常${NC}" || echo -e "${RED}⚠️ 有 ${FAIL} 項失敗,請排查${NC}"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 故障排除
|
||||
## 版本歷史
|
||||
|
||||
### A. PostgreSQL WAL 損壞
|
||||
|
||||
**症狀**:
|
||||
```
|
||||
PANIC: could not locate a valid checkpoint record
|
||||
```
|
||||
|
||||
**修復**(需統帥授權):
|
||||
|
||||
```bash
|
||||
ssh ollama@192.168.0.188
|
||||
|
||||
# 1. 確認錯誤
|
||||
sudo journalctl -u postgresql@14-main -n 20 | grep -E 'PANIC|checkpoint'
|
||||
|
||||
# 2. 強制重置 WAL(會丟失最後幾個 transaction,不可逆)
|
||||
sudo /usr/lib/postgresql/14/bin/pg_resetwal -f /var/lib/postgresql/14/main
|
||||
|
||||
# 3. 重啟並驗證
|
||||
sudo systemctl start postgresql@14-main
|
||||
sleep 8
|
||||
pg_isready -h localhost -p 5432
|
||||
|
||||
# 4. 殺掉 stale 連線 + 重建 kine(必做!)
|
||||
sudo -u postgres psql -d k3s_datastore -c "
|
||||
SELECT pg_terminate_backend(pid)
|
||||
FROM pg_stat_activity
|
||||
WHERE datname='k3s_datastore' AND pid!=pg_backend_pid();
|
||||
"
|
||||
sudo -u postgres psql -d k3s_datastore -c "REINDEX TABLE kine; VACUUM ANALYZE kine;"
|
||||
# ⚠️ 若 stale 連線殺不掉,用 OS kill: sudo kill -9 <pid>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### B. Docker Daemon 損壞 (BoltDB)
|
||||
|
||||
**症狀**:
|
||||
```
|
||||
panic: freepages: failed to get all reachable pages (containerd)
|
||||
panic: page already freed (Docker network)
|
||||
failed to create task: failed to initialize logging (Harbor 容器)
|
||||
```
|
||||
|
||||
**修復(188)**:
|
||||
```bash
|
||||
sudo systemctl stop docker containerd
|
||||
sudo rm -f /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db
|
||||
sudo rm -f /var/lib/docker/network/files/local-kv.db
|
||||
sudo systemctl start containerd && sleep 5
|
||||
sudo systemctl start docker
|
||||
```
|
||||
|
||||
**修復(110,額外需清除容器狀態)**:
|
||||
```bash
|
||||
sudo systemctl stop docker
|
||||
sudo rm -f /var/lib/docker/network/files/local-kv.db
|
||||
sudo rm -f /var/lib/docker/volumes/metadata.db
|
||||
find /var/lib/docker/buildkit -name "*.db" -delete 2>/dev/null
|
||||
sudo rm -rf /var/lib/docker/containers/* # 清除孤兒容器記錄
|
||||
sudo systemctl start docker
|
||||
sleep 5
|
||||
docker rm -f $(docker ps -aq) 2>/dev/null || true # 清除殘留
|
||||
docker network prune -f
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### C. K3s Kine 慢查詢
|
||||
|
||||
**症狀**:K3s `activating` 超過 3-5 分鐘,log:
|
||||
```
|
||||
Slow SQL (total time: 1m3.889s): SELECT ... FROM kine AS kv WHERE kv.name LIKE $1 ...
|
||||
```
|
||||
|
||||
**修復**:
|
||||
```bash
|
||||
# 1. 停 K3s(釋放 PG 連線)
|
||||
ssh wooo@192.168.0.120 "echo '0936223270' | sudo -S systemctl stop k3s"
|
||||
ssh wooo@192.168.0.121 "echo '0936223270' | sudo -S systemctl stop k3s"
|
||||
|
||||
# 2. 殺掉 stale 連線(若 pg_terminate 無效,直接 OS kill)
|
||||
ssh ollama@192.168.0.188 "echo '0936223270' | sudo -S -u postgres psql -d k3s_datastore \
|
||||
-c \"SELECT pid, query_start, state FROM pg_stat_activity WHERE datname='k3s_datastore';\""
|
||||
# 對 stale PID: sudo kill -9 <pid>
|
||||
|
||||
# 3. 重建索引和統計
|
||||
ssh ollama@192.168.0.188 "echo '0936223270' | sudo -S -u postgres psql -d k3s_datastore \
|
||||
-c 'REINDEX TABLE kine; VACUUM ANALYZE kine;'"
|
||||
|
||||
# 4. 重啟 K3s
|
||||
ssh wooo@192.168.0.120 "echo '0936223270' | sudo -S systemctl start k3s"
|
||||
ssh wooo@192.168.0.121 "echo '0936223270' | sudo -S systemctl start k3s"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### D. Harbor 容器全部 Exited (128)
|
||||
|
||||
**症狀**:
|
||||
```
|
||||
Error: failed to create task for container: failed to initialize logging driver: dial tcp 127.0.0.1:1514: connect: connection refused
|
||||
```
|
||||
|
||||
**原因**:Harbor 所有容器的 log driver 指向 harbor-log 的 syslog (:1514),若 harbor-log 未健康就啟動其他容器,全部失敗。
|
||||
|
||||
**修復**:
|
||||
```bash
|
||||
# 1. 清除所有失敗的容器
|
||||
docker rm -f $(docker ps -aq) 2>/dev/null
|
||||
docker network prune -f
|
||||
|
||||
# 2. 重啟 Harbor(harbor-log 會先起)
|
||||
cd /home/wooo/harbor/harbor && docker compose up -d
|
||||
|
||||
# 3. 等 harbor-log healthy(約 30 秒)
|
||||
until [ "$(docker inspect --format='{{.State.Health.Status}}' harbor-log)" = "healthy" ]; do
|
||||
echo "waiting harbor-log..."; sleep 5
|
||||
done
|
||||
|
||||
# 4. 現在重啟其他元件
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 已知限制
|
||||
|
||||
| 項目 | 說明 | 建議後續 |
|
||||
|------|------|---------|
|
||||
| ClawBot build | pip wheels 損壞,STANDBY_MODE=true 下非關鍵 | 定期清除 wheel cache |
|
||||
| K3s 非自動依賴 PG | k3s.service 沒有 After=postgresql@14-main.service | 考慮加 systemd 依賴 |
|
||||
| Redis 需手動設定 | /etc/redis/redis.conf bind 0.0.0.0 port 6380 已設定,重裝後需重設 | 加入 awoooi-startup.sh 自我驗證 |
|
||||
|
||||
---
|
||||
|
||||
*文件由 Claude Code 於 2026-04-05 第二次重開機事故後完整修訂*
|
||||
| 版本 | 日期 | 說明 |
|
||||
|------|------|------|
|
||||
| v1.0 | 2026-04-04 | 初版:containerd + Docker + PostgreSQL WAL 修復流程 |
|
||||
| v2.0 | 2026-04-05 上午 | 188 啟動腳本 + 110 啟動腳本 + Harbor race condition 修復 |
|
||||
| v3.0 | 2026-04-05 下午 | 完整架構重盤 + Gitea Runner 自動化 + 告警鏈路根因修復 + NetworkPolicy 修正 |
|
||||
|
||||
Reference in New Issue
Block a user