- AWOOOI_COMPREHENSIVE_AUDIT_2026Q1.md 全維度稽核 - INSPIRATION_LAB.md 靈感收集 - K3S-OPTIMIZATION-RUNBOOK.md 優化指南 - ADR-006 AI Fallback 策略更新 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1791 lines
56 KiB
Markdown
1791 lines
56 KiB
Markdown
# K3s 優化實施 Runbook
|
||
|
||
> **版本**: 2.0
|
||
> **建立日期**: 2026-03-28 (台北時間)
|
||
> **建立者**: Claude Code (首席架構師)
|
||
> **最後修改**: 2026-03-28 20:45 (台北時間)
|
||
> **修改者**: Claude Code
|
||
> **狀態**: ✅ K0/K-NET/K-HA/K-CLEAN 完成 | 📋 K1-K4 待執行 (~36h)
|
||
|
||
---
|
||
|
||
## 變更紀錄
|
||
|
||
| 版本 | 日期 | 執行者 | 變更內容 |
|
||
|------|------|--------|----------|
|
||
| 1.0 | 2026-03-28 | Claude Code | 初始建立 |
|
||
| 1.1 | 2026-03-28 | Claude Code | 新增安全執行順序、Alertmanager 靜音、穩定性驗證 |
|
||
| 1.2 | 2026-03-28 | Claude Code | **標記完成狀態**: K0 ✅ + K-NET ✅ + K-CLEAN ✅ |
|
||
| 1.3 | 2026-03-28 | Claude Code | **K-HA 完成**: 雙 Control-Plane + PostgreSQL Datastore |
|
||
| 2.0 | 2026-03-28 | Claude Code | **新增 K1-K4 完整實作步驟** + 15 項異常修復記錄 |
|
||
|
||
---
|
||
|
||
## 目錄
|
||
|
||
0. [🔴 安全執行順序 (必讀)](#0-安全執行順序)
|
||
1. [環境基線](#1-環境基線)
|
||
2. [Phase K0: 緊急修復](#2-phase-k0-緊急修復) ✅ **已完成**
|
||
3. [Phase K-NET: 網路架構 (keepalived)](#3-phase-k-net-網路架構) ✅ **已完成**
|
||
4. [Phase K-HA: HA 升級](#4-phase-k-ha-ha-升級) ✅ **已完成**
|
||
5. [Phase K-CLEAN: 清理腳本](#5-phase-k-clean-清理腳本) ✅ **已完成**
|
||
6. [Phase K1: 災難恢復 (Velero)](#6-phase-k1-災難恢復) ❌ **待執行**
|
||
7. [Phase K2: 自動化維運 (ArgoCD/VPA/NPD)](#7-phase-k2-自動化維運) ❌ **待執行**
|
||
8. [Phase K3: 儲存與 HPA](#8-phase-k3-儲存與-hpa) ❌ **待執行**
|
||
9. [Phase K4: 進階優化 (Kured/Descheduler)](#9-phase-k4-進階優化) ❌ **待執行**
|
||
10. [驗證檢查清單](#10-驗證檢查清單)
|
||
11. [回滾程序](#11-回滾程序)
|
||
|
||
---
|
||
|
||
## 0. 安全執行順序
|
||
|
||
### 🔴 重要:必須按此順序執行,否則可能造成服務中斷!
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||
│ Phase K0 安全執行順序圖 │
|
||
├─────────────────────────────────────────────────────────────────────────────┤
|
||
│ │
|
||
│ Step 1 │
|
||
│ ┌─────────────────────────────────────────┐ │
|
||
│ │ K0.1 關閉 Swap (120 + 121) │ 🟢 低風險 │
|
||
│ │ └── 無依賴,獨立執行 │ ⏱️ 15 min │
|
||
│ └─────────────────────────────────────────┘ │
|
||
│ │ │
|
||
│ ▼ │
|
||
│ Step 2 │
|
||
│ ┌─────────────────────────────────────────┐ │
|
||
│ │ K0.3 etcd 備份 + rsync 到 188 │ 🟢 低風險 │
|
||
│ │ └── 在任何變更前先備份! │ ⏱️ 20 min │
|
||
│ └─────────────────────────────────────────┘ │
|
||
│ │ │
|
||
│ ▼ │
|
||
│ Step 3 │
|
||
│ ┌─────────────────────────────────────────┐ │
|
||
│ │ K0.4 套用 PDB │ 🟢 低風險 │
|
||
│ │ └── Pod 保護必須在重啟前生效 │ ⏱️ 10 min │
|
||
│ └─────────────────────────────────────────┘ │
|
||
│ │ │
|
||
│ ▼ │
|
||
│ Step 4 ⚠️ 新增 │
|
||
│ ┌─────────────────────────────────────────┐ │
|
||
│ │ Alertmanager 靜音 │ 🟡 前置作業 │
|
||
│ │ └── 防止 K3s 重啟觸發假告警 │ ⏱️ 2 min │
|
||
│ └─────────────────────────────────────────┘ │
|
||
│ │ │
|
||
│ ▼ │
|
||
│ Step 5 │
|
||
│ ┌─────────────────────────────────────────┐ │
|
||
│ │ K0.2 K3s config.yaml + 重啟 │ 🟡 中風險 │
|
||
│ │ └── 有 PDB 保護,告警已靜音 │ ⏱️ 30 min │
|
||
│ │ └── API Server 中斷約 30 秒 │ │
|
||
│ └─────────────────────────────────────────┘ │
|
||
│ │ │
|
||
│ ▼ │
|
||
│ Step 6 ⚠️ 新增 │
|
||
│ ┌─────────────────────────────────────────┐ │
|
||
│ │ 穩定性驗證 (Checkpoint) │ 🟢 檢查點 │
|
||
│ │ └── kubectl get nodes = Ready │ ⏱️ 5 min │
|
||
│ │ └── kubectl get pods = Running │ │
|
||
│ │ └── curl health check = OK │ │
|
||
│ └─────────────────────────────────────────┘ │
|
||
│ │ │
|
||
│ ┌────────┴────────┐ │
|
||
│ │ 驗證通過? │ │
|
||
│ └────────┬────────┘ │
|
||
│ Yes │ │ No │
|
||
│ ▼ ▼ │
|
||
│ Step 7 ┌─────────────────────┐ │
|
||
│ ┌──────────────┐ │ 🔴 執行回滾程序 │ │
|
||
│ │ K0.5 │ │ 見 Section 7 │ │
|
||
│ │ Startup │ └─────────────────────┘ │
|
||
│ │ Probe │ │
|
||
│ │ 🟡 中風險 │ │
|
||
│ │ ⏱️ 30 min │ │
|
||
│ └──────┬───────┘ │
|
||
│ │ │
|
||
│ ▼ │
|
||
│ Step 8 │
|
||
│ ┌─────────────────────────────────────────┐ │
|
||
│ │ K0.6-7 清理異常資源 │ 🟢 低風險 │
|
||
│ │ └── ImagePullBackOff + 孤立 RS │ ⏱️ 15 min │
|
||
│ └─────────────────────────────────────────┘ │
|
||
│ │ │
|
||
│ ▼ │
|
||
│ ┌─────────────────────────────────────────┐ │
|
||
│ │ ✅ Phase K0 完成 │ │
|
||
│ │ 總時間: ~2 小時 │ │
|
||
│ └─────────────────────────────────────────┘ │
|
||
│ │
|
||
└─────────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### 衝突防護檢查清單
|
||
|
||
執行前必須確認:
|
||
|
||
| # | 檢查項 | 指令 | 預期結果 |
|
||
|---|--------|------|----------|
|
||
| 1 | 無正在部署的任務 | `gh run list --workflow=cd.yaml --limit 1` | 非 in_progress |
|
||
| 2 | PDB 已套用 | `kubectl get pdb -n awoooi-prod` | 3 個 PDB |
|
||
| 3 | Alertmanager 已靜音 | Alertmanager UI 或 amtool | 有 active silence |
|
||
| 4 | etcd 備份已完成 | `ls /var/lib/rancher/k3s/server/db/snapshots-backup/` | 有最新備份 |
|
||
| 5 | 188 rsync 已完成 | `ssh 188 "ls /backup/k3s_etcd/"` | 有同步檔案 |
|
||
|
||
### 風險等級說明
|
||
|
||
| 等級 | 說明 | 回滾時間 |
|
||
|------|------|----------|
|
||
| 🟢 低風險 | 可獨立執行,不影響服務 | < 1 min |
|
||
| 🟡 中風險 | 可能短暫影響服務,有保護機制 | < 5 min |
|
||
| 🔴 高風險 | 需要完整回滾程序 | < 30 min |
|
||
|
||
---
|
||
|
||
## 1. 環境基線
|
||
|
||
### 1.1 節點資訊
|
||
|
||
| 項目 | mon (120) | mon1 (121) |
|
||
|------|-----------|------------|
|
||
| **IP** | 192.168.0.120 | 192.168.0.121 |
|
||
| **網路介面** | ens192 | ens160 |
|
||
| **角色** | K3s Server (Master) | K3s Agent (Worker) |
|
||
| **SSH User** | wooo | - |
|
||
| **K3s Version** | v1.34.5+k3s1 | v1.34.5+k3s1 |
|
||
|
||
### 1.2 目標 VIP
|
||
|
||
| 用途 | IP | 說明 |
|
||
|------|-----|------|
|
||
| **K3s API VIP** | 192.168.0.125 | kubectl, CI/CD 統一入口 |
|
||
|
||
### 1.3 問題統計 (K-CLEAN 前後)
|
||
|
||
| 問題 | 清理前 | 清理後 |
|
||
|------|--------|--------|
|
||
| 孤立 ReplicaSet | 29 個 | ✅ 0 個 |
|
||
| ImagePullBackOff Pod | 1 個 | ✅ 0 個 |
|
||
| Failed Job | 1 個 | ✅ 0 個 |
|
||
| revisionHistoryLimit | 10 | 🟡 3 (Git 變更) |
|
||
|
||
---
|
||
|
||
## 2. Phase K0: 緊急修復 ✅ 已完成
|
||
|
||
> **執行日期**: 2026-03-28 09:30-12:30 (台北時間)
|
||
> **執行結果**: 全部通過
|
||
> **審查評分**: 首席架構師 9.0/10
|
||
|
||
### 2.1 K0.1 - 關閉 Swap
|
||
|
||
#### 2.1.1 mon (192.168.0.120)
|
||
|
||
```bash
|
||
# 步驟 1: SSH 登入
|
||
ssh wooo@192.168.0.120
|
||
|
||
# 步驟 2: 確認當前 Swap 狀態
|
||
free -h
|
||
swapon --show
|
||
# 預期輸出: /swap.img file 8G ...
|
||
|
||
# 步驟 3: 停止 Swap
|
||
sudo swapoff -a
|
||
|
||
# 步驟 4: 驗證 Swap 已停止
|
||
free -h
|
||
# 預期: Swap 行全為 0
|
||
|
||
# 步驟 5: 永久禁用 (修改 fstab)
|
||
sudo cp /etc/fstab /etc/fstab.backup.$(date +%Y%m%d)
|
||
sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab
|
||
|
||
# 步驟 6: 確認修改
|
||
grep swap /etc/fstab
|
||
# 預期: 看到 #/swap.img ... (已註解)
|
||
|
||
# 步驟 7: 記錄完成
|
||
echo "$(date): Swap disabled on mon (120)" | sudo tee -a /var/log/k3s-optimization.log
|
||
```
|
||
|
||
#### 2.1.2 mon1 (192.168.0.121)
|
||
|
||
```bash
|
||
# 步驟 1: SSH 登入 (從 120 跳轉)
|
||
ssh wooo@192.168.0.120
|
||
ssh mon1 # 或 ssh 192.168.0.121
|
||
|
||
# 步驟 2-7: 重複上述步驟
|
||
|
||
# 或使用一行命令:
|
||
sudo swapoff -a && sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab && free -h
|
||
```
|
||
|
||
#### 2.1.3 驗證腳本
|
||
|
||
```bash
|
||
# 在 120 上執行,驗證兩個節點
|
||
for host in 192.168.0.120 192.168.0.121; do
|
||
echo "=== $host ==="
|
||
ssh wooo@$host "free -h | grep Swap && grep swap /etc/fstab"
|
||
done
|
||
```
|
||
|
||
---
|
||
|
||
### 2.2 K0.2 - K3s 配置 (config.yaml)
|
||
|
||
#### ⚠️ 2.2.0 前置作業: Alertmanager 靜音 (必須先執行!)
|
||
|
||
```bash
|
||
# ═══════════════════════════════════════════════════════════════════════════
|
||
# 🔴 重要: 在重啟 K3s 前必須靜音 Alertmanager,否則會觸發假告警
|
||
# ═══════════════════════════════════════════════════════════════════════════
|
||
|
||
# 方法 1: 使用 amtool (推薦)
|
||
# 安裝 amtool: https://prometheus.io/download/#alertmanager
|
||
amtool --alertmanager.url=http://192.168.0.188:9093 silence add \
|
||
alertname=~".+" \
|
||
--comment="K3s K0.2 維護作業" \
|
||
--author="Claude Code" \
|
||
--duration=30m
|
||
|
||
# 方法 2: 使用 curl 直接呼叫 API
|
||
curl -X POST http://192.168.0.188:9093/api/v2/silences \
|
||
-H "Content-Type: application/json" \
|
||
-d '{
|
||
"matchers": [
|
||
{"name": "alertname", "value": ".*", "isRegex": true}
|
||
],
|
||
"startsAt": "'$(date -u +%Y-%m-%dT%H:%M:%S.000Z)'",
|
||
"endsAt": "'$(date -u -d "+30 minutes" +%Y-%m-%dT%H:%M:%S.000Z)'",
|
||
"createdBy": "Claude Code",
|
||
"comment": "K3s K0.2 維護作業"
|
||
}'
|
||
|
||
# 方法 3: 透過 Alertmanager Web UI
|
||
# 1. 開啟 http://192.168.0.188:9093/#/silences
|
||
# 2. 點擊 "New Silence"
|
||
# 3. Matchers: alertname =~ .*
|
||
# 4. Duration: 30m
|
||
# 5. Comment: K3s K0.2 維護作業
|
||
|
||
# 驗證靜音已生效
|
||
amtool --alertmanager.url=http://192.168.0.188:9093 silence query
|
||
# 或
|
||
curl -s http://192.168.0.188:9093/api/v2/silences | jq '.[] | select(.status.state=="active")'
|
||
```
|
||
|
||
#### 2.2.1 建立 config.yaml
|
||
|
||
```bash
|
||
# SSH 到 120
|
||
ssh wooo@192.168.0.120
|
||
|
||
# 步驟 1: 備份現有配置 (如有)
|
||
sudo cp /etc/rancher/k3s/config.yaml /etc/rancher/k3s/config.yaml.backup 2>/dev/null || echo "No existing config"
|
||
|
||
# 步驟 2: 建立新配置
|
||
sudo tee /etc/rancher/k3s/config.yaml << 'EOF'
|
||
# K3s Server 配置
|
||
# 建立者: Claude Code (首席架構師)
|
||
# 日期: 2026-03-28 (台北時間)
|
||
# 版本: v1.0
|
||
|
||
# ============================================================================
|
||
# Kubelet 資源預留 (防止系統被 Pod 擠壓)
|
||
# ============================================================================
|
||
kubelet-arg:
|
||
# 為 K3s 系統組件預留資源
|
||
- "kube-reserved=cpu=200m,memory=512Mi"
|
||
# 為 OS 預留資源
|
||
- "system-reserved=cpu=200m,memory=512Mi"
|
||
# 驅逐閾值 (低於此值開始驅逐 Pod)
|
||
- "eviction-hard=memory.available<256Mi,nodefs.available<10%,imagefs.available<15%"
|
||
|
||
# ============================================================================
|
||
# etcd 監控 (為 Prometheus 暴露指標)
|
||
# ============================================================================
|
||
etcd-expose-metrics: true
|
||
|
||
# ============================================================================
|
||
# 節點標籤
|
||
# ============================================================================
|
||
node-label:
|
||
- "environment=production"
|
||
- "system=awoooi"
|
||
- "topology.kubernetes.io/zone=zone-a"
|
||
|
||
# ============================================================================
|
||
# TLS SAN (HA 準備: 允許多 IP 存取 API Server)
|
||
# ============================================================================
|
||
tls-san:
|
||
- "192.168.0.120"
|
||
- "192.168.0.121"
|
||
- "192.168.0.125" # 未來 VIP
|
||
- "k3s.awoooi.local"
|
||
- "localhost"
|
||
- "127.0.0.1"
|
||
|
||
# ============================================================================
|
||
# 叢集網路 (確認現有配置)
|
||
# ============================================================================
|
||
cluster-cidr: "10.42.0.0/16"
|
||
service-cidr: "10.43.0.0/16"
|
||
cluster-dns: "10.43.0.10"
|
||
|
||
# ============================================================================
|
||
# 注意事項
|
||
# ============================================================================
|
||
# 1. 此配置不含 datastore-endpoint (HA 時再加入)
|
||
# 2. 修改後需重啟 K3s: sudo systemctl restart k3s
|
||
# 3. 重啟會導致 API Server 短暫中斷 (~30 秒)
|
||
EOF
|
||
|
||
# 步驟 3: 驗證配置語法
|
||
cat /etc/rancher/k3s/config.yaml
|
||
|
||
# 步驟 4: 設定權限
|
||
sudo chmod 600 /etc/rancher/k3s/config.yaml
|
||
sudo chown root:root /etc/rancher/k3s/config.yaml
|
||
```
|
||
|
||
#### 2.2.2 建立 registries.yaml
|
||
|
||
```bash
|
||
# 步驟 1: 建立 Harbor 配置
|
||
sudo tee /etc/rancher/k3s/registries.yaml << 'EOF'
|
||
# K3s 私有倉庫配置
|
||
# Harbor: 192.168.0.110:5000
|
||
# 建立者: Claude Code (首席架構師)
|
||
# 日期: 2026-03-28 (台北時間)
|
||
|
||
mirrors:
|
||
# Harbor 鏡像
|
||
"192.168.0.110:5000":
|
||
endpoint:
|
||
- "http://192.168.0.110:5000"
|
||
|
||
# Docker Hub 鏡像加速 (可選,透過 Harbor proxy)
|
||
# "docker.io":
|
||
# endpoint:
|
||
# - "http://192.168.0.110:5000/v2/proxy/library"
|
||
|
||
configs:
|
||
"192.168.0.110:5000":
|
||
tls:
|
||
# Harbor 使用 HTTP,跳過 TLS 驗證
|
||
insecure_skip_verify: true
|
||
# 如需認證,取消下方註解
|
||
# auth:
|
||
# username: admin
|
||
# password: Harbor12345
|
||
EOF
|
||
|
||
# 步驟 2: 設定權限
|
||
sudo chmod 600 /etc/rancher/k3s/registries.yaml
|
||
```
|
||
|
||
#### 2.2.3 重啟 K3s (謹慎操作)
|
||
|
||
```bash
|
||
# ⚠️ 警告: 此操作會導致 API Server 中斷約 30 秒
|
||
|
||
# 步驟 1: 確認 Pod 狀態
|
||
kubectl get pods -n awoooi-prod
|
||
|
||
# 步驟 2: 通知 (可選,發送 Telegram)
|
||
# curl -X POST "https://api.telegram.org/bot<TOKEN>/sendMessage" \
|
||
# -d "chat_id=<CHAT_ID>" \
|
||
# -d "text=🔧 K3s 維護中,預計 30 秒內恢復"
|
||
|
||
# 步驟 3: 重啟 K3s
|
||
sudo systemctl restart k3s
|
||
|
||
# 步驟 4: 等待恢復
|
||
sleep 10
|
||
kubectl get nodes
|
||
# 預期: mon 狀態為 Ready
|
||
|
||
# 步驟 5: 驗證配置生效
|
||
kubectl describe node mon | grep -A 5 "Allocatable:"
|
||
# 應該看到預留資源後的可分配量
|
||
|
||
# 步驟 6: 驗證 etcd 指標暴露
|
||
curl -s http://127.0.0.1:2379/metrics | head -5
|
||
# 或
|
||
kubectl get --raw /metrics | head -10
|
||
```
|
||
|
||
#### ✅ 2.2.4 穩定性驗證 (Checkpoint - 必須通過才能繼續!)
|
||
|
||
```bash
|
||
# ═══════════════════════════════════════════════════════════════════════════
|
||
# 🔴 重要: 必須全部通過才能繼續執行 K0.5 (Startup Probe)
|
||
# ═══════════════════════════════════════════════════════════════════════════
|
||
|
||
echo "========== K0.2 穩定性驗證 =========="
|
||
|
||
# 檢查 1: 節點狀態
|
||
echo "[1/5] 檢查節點狀態..."
|
||
kubectl get nodes
|
||
# 預期: mon 和 mon1 都是 Ready
|
||
# 如果看到 NotReady,等待 30 秒後重試
|
||
|
||
# 檢查 2: awoooi-prod Pod 狀態
|
||
echo "[2/5] 檢查 Pod 狀態..."
|
||
kubectl get pods -n awoooi-prod
|
||
# 預期: 所有 Pod 都是 Running (2/2 或 1/1)
|
||
# 如果看到 CrashLoopBackOff 或 ImagePullBackOff,查看 logs
|
||
|
||
# 檢查 3: API Health Check
|
||
echo "[3/5] 檢查 API Health..."
|
||
curl -sf http://192.168.0.120:32334/api/v1/health && echo " ✅ API OK" || echo " ❌ API FAILED"
|
||
# 預期: {"status":"healthy"} 或類似回應
|
||
|
||
# 檢查 4: Web Health Check
|
||
echo "[4/5] 檢查 Web Health..."
|
||
curl -sf http://192.168.0.120:32335/ -o /dev/null && echo " ✅ Web OK" || echo " ❌ Web FAILED"
|
||
# 預期: HTTP 200
|
||
|
||
# 檢查 5: PDB 狀態
|
||
echo "[5/5] 檢查 PDB 狀態..."
|
||
kubectl get pdb -n awoooi-prod
|
||
# 預期: ALLOWED DISRUPTIONS >= 1
|
||
|
||
echo "========== 驗證完成 =========="
|
||
|
||
# ═══════════════════════════════════════════════════════════════════════════
|
||
# 判斷結果
|
||
# ═══════════════════════════════════════════════════════════════════════════
|
||
# ✅ 全部通過: 可以繼續執行 K0.5 (Startup Probe)
|
||
# ❌ 任一失敗:
|
||
# 1. 查看 kubectl describe pod <pod-name> -n awoooi-prod
|
||
# 2. 查看 kubectl logs <pod-name> -n awoooi-prod
|
||
# 3. 如持續失敗,執行回滾程序 (Section 7)
|
||
# ═══════════════════════════════════════════════════════════════════════════
|
||
```
|
||
|
||
---
|
||
|
||
### 2.3 K0.3 - etcd 自動備份
|
||
|
||
#### 2.3.0 前置作業: 準備 188 遠端備份目錄
|
||
|
||
```bash
|
||
# 在 188 上建立備份目錄 (只需執行一次)
|
||
ssh wooo@192.168.0.188 << 'EOF'
|
||
sudo mkdir -p /backup/k3s_etcd
|
||
sudo chown wooo:wooo /backup/k3s_etcd
|
||
ls -la /backup/
|
||
EOF
|
||
|
||
# 驗證 120 可以 SSH 到 188 (無密碼)
|
||
ssh wooo@192.168.0.188 "echo 'SSH OK'"
|
||
|
||
# 如果需要設定免密登入:
|
||
# ssh-copy-id wooo@192.168.0.188
|
||
```
|
||
|
||
#### 2.3.1 建立備份腳本
|
||
|
||
```bash
|
||
# 步驟 1: 建立備份目錄
|
||
sudo mkdir -p /var/lib/rancher/k3s/server/db/snapshots-backup
|
||
sudo mkdir -p /var/log/k3s-backup
|
||
|
||
# 步驟 2: 建立備份腳本
|
||
sudo tee /usr/local/bin/k3s-etcd-backup.sh << 'EOF'
|
||
#!/bin/bash
|
||
# K3s etcd 自動備份腳本
|
||
# 建立者: Claude Code (首席架構師)
|
||
# 日期: 2026-03-28
|
||
|
||
set -e
|
||
|
||
# 配置
|
||
BACKUP_DIR="/var/lib/rancher/k3s/server/db/snapshots-backup"
|
||
LOG_FILE="/var/log/k3s-backup/backup.log"
|
||
RETENTION_DAYS=7
|
||
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
|
||
BACKUP_NAME="awoooi-etcd-${TIMESTAMP}"
|
||
|
||
# 函數: 日誌記錄
|
||
log() {
|
||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
|
||
}
|
||
|
||
# 開始備份
|
||
log "Starting etcd backup: ${BACKUP_NAME}"
|
||
|
||
# 執行 K3s 內建備份
|
||
if /usr/local/bin/k3s etcd-snapshot save --name "${BACKUP_NAME}"; then
|
||
log "K3s etcd-snapshot completed"
|
||
else
|
||
log "ERROR: K3s etcd-snapshot failed"
|
||
exit 1
|
||
fi
|
||
|
||
# 複製到備份目錄
|
||
SNAPSHOT_PATH="/var/lib/rancher/k3s/server/db/snapshots/${BACKUP_NAME}"
|
||
if [ -f "${SNAPSHOT_PATH}" ] || [ -f "${SNAPSHOT_PATH}.zip" ]; then
|
||
cp "${SNAPSHOT_PATH}"* "${BACKUP_DIR}/" 2>/dev/null || true
|
||
log "Copied to backup directory: ${BACKUP_DIR}"
|
||
else
|
||
log "WARNING: Snapshot file not found at ${SNAPSHOT_PATH}"
|
||
fi
|
||
|
||
# 清理舊備份 (保留 7 天)
|
||
log "Cleaning backups older than ${RETENTION_DAYS} days"
|
||
find "${BACKUP_DIR}" -name "awoooi-etcd-*" -mtime +${RETENTION_DAYS} -delete 2>/dev/null || true
|
||
|
||
# 統計
|
||
BACKUP_COUNT=$(ls -1 "${BACKUP_DIR}"/awoooi-etcd-* 2>/dev/null | wc -l)
|
||
log "Backup completed. Total backups in directory: ${BACKUP_COUNT}"
|
||
|
||
# ═══════════════════════════════════════════════════════════════════════════
|
||
# 🔴 遠端備份: rsync 到 188 (解決單點風險)
|
||
# ═══════════════════════════════════════════════════════════════════════════
|
||
REMOTE_HOST="192.168.0.188"
|
||
REMOTE_DIR="/backup/k3s_etcd"
|
||
REMOTE_USER="ollama"
|
||
|
||
log "Starting rsync to ${REMOTE_HOST}:${REMOTE_DIR}"
|
||
|
||
# 確保遠端目錄存在
|
||
ssh ${REMOTE_USER}@${REMOTE_HOST} "mkdir -p ${REMOTE_DIR}" 2>/dev/null || {
|
||
log "WARNING: Cannot create remote directory, rsync may fail"
|
||
}
|
||
|
||
# 執行 rsync
|
||
if rsync -avz --progress \
|
||
"${BACKUP_DIR}/" \
|
||
"${REMOTE_USER}@${REMOTE_HOST}:${REMOTE_DIR}/" \
|
||
--delete-after \
|
||
--exclude="*.tmp" 2>&1 | tee -a "$LOG_FILE"; then
|
||
log "rsync to ${REMOTE_HOST} completed successfully"
|
||
else
|
||
log "ERROR: rsync to ${REMOTE_HOST} failed"
|
||
# 不 exit,本地備份已成功
|
||
fi
|
||
|
||
# 驗證遠端備份
|
||
REMOTE_COUNT=$(ssh ${REMOTE_USER}@${REMOTE_HOST} "ls -1 ${REMOTE_DIR}/awoooi-etcd-* 2>/dev/null | wc -l")
|
||
log "Remote backup count: ${REMOTE_COUNT}"
|
||
|
||
# 可選: 上傳到 S3/MinIO
|
||
# aws s3 cp "${BACKUP_DIR}/${BACKUP_NAME}"* s3://awoooi-backup/etcd/
|
||
|
||
exit 0
|
||
EOF
|
||
|
||
# 步驟 3: 設定執行權限
|
||
sudo chmod +x /usr/local/bin/k3s-etcd-backup.sh
|
||
|
||
# 步驟 4: 測試執行
|
||
sudo /usr/local/bin/k3s-etcd-backup.sh
|
||
|
||
# 步驟 5: 確認備份
|
||
ls -la /var/lib/rancher/k3s/server/db/snapshots-backup/
|
||
```
|
||
|
||
#### 2.3.2 設定 Crontab
|
||
|
||
```bash
|
||
# 步驟 1: 編輯 root crontab
|
||
sudo crontab -e
|
||
|
||
# 步驟 2: 加入以下行 (每 6 小時備份)
|
||
# K3s etcd 自動備份 - 每 6 小時
|
||
0 */6 * * * /usr/local/bin/k3s-etcd-backup.sh >> /var/log/k3s-backup/cron.log 2>&1
|
||
|
||
# 步驟 3: 驗證 crontab
|
||
sudo crontab -l | grep k3s
|
||
|
||
# 步驟 4: 確認 cron 服務運行
|
||
systemctl status cron
|
||
```
|
||
|
||
---
|
||
|
||
### 2.4 K0.4 - PodDisruptionBudget
|
||
|
||
#### 2.4.1 建立 PDB YAML
|
||
|
||
```bash
|
||
# 在本地建立 (或直接在 120 上)
|
||
cat > /tmp/09-pdb.yaml << 'EOF'
|
||
# AWOOOI PodDisruptionBudget
|
||
# 確保滾動更新和節點維護時服務不中斷
|
||
# 建立者: Claude Code (首席架構師)
|
||
# 日期: 2026-03-28 (台北時間)
|
||
|
||
apiVersion: policy/v1
|
||
kind: PodDisruptionBudget
|
||
metadata:
|
||
name: awoooi-api-pdb
|
||
namespace: awoooi-prod
|
||
labels:
|
||
app: awoooi-api
|
||
system: awoooi
|
||
spec:
|
||
# 至少保持 1 個 Pod 運行
|
||
minAvailable: 1
|
||
selector:
|
||
matchLabels:
|
||
app: awoooi-api
|
||
|
||
---
|
||
apiVersion: policy/v1
|
||
kind: PodDisruptionBudget
|
||
metadata:
|
||
name: awoooi-web-pdb
|
||
namespace: awoooi-prod
|
||
labels:
|
||
app: awoooi-web
|
||
system: awoooi
|
||
spec:
|
||
minAvailable: 1
|
||
selector:
|
||
matchLabels:
|
||
app: awoooi-web
|
||
|
||
---
|
||
apiVersion: policy/v1
|
||
kind: PodDisruptionBudget
|
||
metadata:
|
||
name: awoooi-worker-pdb
|
||
namespace: awoooi-prod
|
||
labels:
|
||
app: awoooi-worker
|
||
system: awoooi
|
||
spec:
|
||
# Worker 只有 1 個 replica,允許 0 個可用
|
||
# 這樣維護時可以完全停止
|
||
maxUnavailable: 1
|
||
selector:
|
||
matchLabels:
|
||
app: awoooi-worker
|
||
EOF
|
||
|
||
# 複製到專案目錄
|
||
cp /tmp/09-pdb.yaml k8s/awoooi-prod/09-pdb.yaml
|
||
```
|
||
|
||
#### 2.4.2 套用 PDB
|
||
|
||
```bash
|
||
# 步驟 1: 套用
|
||
kubectl apply -f k8s/awoooi-prod/09-pdb.yaml
|
||
|
||
# 步驟 2: 驗證
|
||
kubectl get pdb -n awoooi-prod
|
||
|
||
# 預期輸出:
|
||
# NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
|
||
# awoooi-api-pdb 1 N/A 1 5s
|
||
# awoooi-web-pdb 1 N/A 1 5s
|
||
# awoooi-worker-pdb N/A 1 1 5s
|
||
|
||
# 步驟 3: 詳細檢查
|
||
kubectl describe pdb awoooi-api-pdb -n awoooi-prod
|
||
```
|
||
|
||
---
|
||
|
||
### 2.5 K0.5 - Startup Probe
|
||
|
||
#### 2.5.1 修改 API Deployment
|
||
|
||
```bash
|
||
# 編輯 deployment
|
||
kubectl edit deploy awoooi-api -n awoooi-prod
|
||
|
||
# 或使用 patch:
|
||
kubectl patch deployment awoooi-api -n awoooi-prod --type='json' -p='[
|
||
{
|
||
"op": "add",
|
||
"path": "/spec/template/spec/containers/0/startupProbe",
|
||
"value": {
|
||
"httpGet": {
|
||
"path": "/api/v1/health",
|
||
"port": 8000
|
||
},
|
||
"initialDelaySeconds": 5,
|
||
"periodSeconds": 5,
|
||
"timeoutSeconds": 3,
|
||
"failureThreshold": 30
|
||
}
|
||
}
|
||
]'
|
||
```
|
||
|
||
#### 2.5.2 完整 Deployment Patch 檔案
|
||
|
||
```yaml
|
||
# k8s/awoooi-prod/patches/startup-probe-patch.yaml
|
||
apiVersion: apps/v1
|
||
kind: Deployment
|
||
metadata:
|
||
name: awoooi-api
|
||
namespace: awoooi-prod
|
||
spec:
|
||
# 降低 revisionHistoryLimit (Gemini 建議)
|
||
revisionHistoryLimit: 3
|
||
template:
|
||
spec:
|
||
containers:
|
||
- name: api
|
||
startupProbe:
|
||
httpGet:
|
||
path: /api/v1/health
|
||
port: 8000
|
||
initialDelaySeconds: 5
|
||
periodSeconds: 5
|
||
timeoutSeconds: 3
|
||
# 5s * 30 = 150s 最大啟動時間
|
||
failureThreshold: 30
|
||
# 調整 readiness 等待時間
|
||
readinessProbe:
|
||
httpGet:
|
||
path: /api/v1/health
|
||
port: 8000
|
||
initialDelaySeconds: 10
|
||
periodSeconds: 5
|
||
timeoutSeconds: 3
|
||
failureThreshold: 3
|
||
```
|
||
|
||
---
|
||
|
||
### 2.6 K0.6 & K0.7 - 清理異常資源
|
||
|
||
#### 2.6.1 清理 ImagePullBackOff Pod
|
||
|
||
```bash
|
||
# 步驟 1: 確認異常 Pod
|
||
kubectl get pods -n awoooi-prod | grep -E 'ImagePull|ErrImage'
|
||
|
||
# 步驟 2: 刪除異常 Pod
|
||
kubectl delete pod awoooi-web-99cf84f9c-mxd7l -n awoooi-prod
|
||
|
||
# 步驟 3: 刪除對應的 ReplicaSet
|
||
kubectl delete rs awoooi-web-99cf84f9c -n awoooi-prod
|
||
|
||
# 步驟 4: 驗證
|
||
kubectl get pods -n awoooi-prod
|
||
```
|
||
|
||
#### 2.6.2 清理孤立 ReplicaSet
|
||
|
||
```bash
|
||
# 步驟 1: 列出孤立 RS (replicas=0)
|
||
kubectl get rs -n awoooi-prod --no-headers | awk '$2==0 && $3==0 && $4==0 {print $1}'
|
||
|
||
# 步驟 2: 批量刪除 (保留最新 3 個)
|
||
# 乾跑 (dry-run)
|
||
kubectl get rs -n awoooi-prod --sort-by=.metadata.creationTimestamp -o name | \
|
||
head -n -3 | \
|
||
xargs -r kubectl delete --dry-run=client
|
||
|
||
# 步驟 3: 實際執行
|
||
kubectl get rs -n awoooi-prod --no-headers | \
|
||
awk '$2==0 && $3==0 && $4==0 {print $1}' | \
|
||
xargs -r kubectl delete rs -n awoooi-prod
|
||
|
||
# 步驟 4: 驗證
|
||
kubectl get rs -n awoooi-prod
|
||
# 每個 Deployment 應該只剩 1-3 個 RS
|
||
```
|
||
|
||
---
|
||
|
||
## 3. Phase K-NET: 網路架構 ✅ 已完成
|
||
|
||
> **執行日期**: 2026-03-28 14:00-16:00 (台北時間)
|
||
> **VIP**: 192.168.0.125 (keepalived MASTER: 120, BACKUP: 121)
|
||
> **CI/CD 整合**: GitHub Secret `KUBE_CONFIG_PROD` 已更新
|
||
|
||
### 3.1 K-NET.1 - 安裝 Keepalived
|
||
|
||
#### 3.1.1 在 mon (120) 安裝
|
||
|
||
```bash
|
||
# SSH 到 120
|
||
ssh wooo@192.168.0.120
|
||
|
||
# 步驟 1: 安裝 keepalived
|
||
sudo apt update
|
||
sudo apt install -y keepalived
|
||
|
||
# 步驟 2: 建立健康檢查腳本
|
||
sudo tee /etc/keepalived/check_k3s_api.sh << 'EOF'
|
||
#!/bin/bash
|
||
# K3s API Server 健康檢查
|
||
# 檢查 6443 埠是否回應
|
||
|
||
curl -sk https://127.0.0.1:6443/healthz -o /dev/null -w "%{http_code}" | grep -q 200
|
||
exit $?
|
||
EOF
|
||
|
||
sudo chmod +x /etc/keepalived/check_k3s_api.sh
|
||
|
||
# 步驟 3: 建立 keepalived 配置 (MASTER)
|
||
sudo tee /etc/keepalived/keepalived.conf << 'EOF'
|
||
# Keepalived 配置 - K3s API Server HA
|
||
# 節點: mon (192.168.0.120) - MASTER
|
||
# 建立者: Claude Code (首席架構師)
|
||
# 日期: 2026-03-28 (台北時間)
|
||
|
||
global_defs {
|
||
router_id K3S_MASTER_120
|
||
script_user root
|
||
enable_script_security
|
||
}
|
||
|
||
vrrp_script check_k3s_api {
|
||
script "/etc/keepalived/check_k3s_api.sh"
|
||
interval 2 # 每 2 秒檢查一次
|
||
weight -20 # 失敗時降低優先級 20
|
||
fall 3 # 連續失敗 3 次判定為故障
|
||
rise 2 # 連續成功 2 次判定為恢復
|
||
}
|
||
|
||
vrrp_instance K3S_VIP {
|
||
state MASTER
|
||
interface ens192 # mon 的網路介面
|
||
virtual_router_id 51 # VRID (必須兩節點相同)
|
||
priority 100 # 優先級 (MASTER 較高)
|
||
advert_int 1 # 心跳間隔 (秒)
|
||
|
||
# 單播模式 (比組播更穩定)
|
||
unicast_src_ip 192.168.0.120
|
||
unicast_peer {
|
||
192.168.0.121
|
||
}
|
||
|
||
authentication {
|
||
auth_type PASS
|
||
auth_pass awoooi_k3s_vip # 認證密碼
|
||
}
|
||
|
||
virtual_ipaddress {
|
||
192.168.0.125/24 # VIP
|
||
}
|
||
|
||
track_script {
|
||
check_k3s_api
|
||
}
|
||
|
||
# 狀態變更通知 (可選)
|
||
notify_master "/etc/keepalived/notify.sh MASTER"
|
||
notify_backup "/etc/keepalived/notify.sh BACKUP"
|
||
notify_fault "/etc/keepalived/notify.sh FAULT"
|
||
}
|
||
EOF
|
||
|
||
# 步驟 4: 建立通知腳本 (可選)
|
||
sudo tee /etc/keepalived/notify.sh << 'EOF'
|
||
#!/bin/bash
|
||
# Keepalived 狀態變更通知
|
||
STATE=$1
|
||
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
|
||
|
||
echo "[$TIMESTAMP] Keepalived state changed to: $STATE" >> /var/log/keepalived-notify.log
|
||
|
||
# 可選: 發送 Telegram 通知
|
||
# curl -s -X POST "https://api.telegram.org/bot${BOT_TOKEN}/sendMessage" \
|
||
# -d "chat_id=${CHAT_ID}" \
|
||
# -d "text=🔄 K3s VIP 狀態變更: $STATE on $(hostname)"
|
||
EOF
|
||
|
||
sudo chmod +x /etc/keepalived/notify.sh
|
||
|
||
# 步驟 5: 啟動 keepalived
|
||
sudo systemctl enable keepalived
|
||
sudo systemctl start keepalived
|
||
|
||
# 步驟 6: 驗證
|
||
ip addr show ens192 | grep 192.168.0.125
|
||
# 應該看到 VIP
|
||
```
|
||
|
||
#### 3.1.2 在 mon1 (121) 安裝
|
||
|
||
```bash
|
||
# SSH 到 121
|
||
ssh wooo@192.168.0.121
|
||
|
||
# 步驟 1-2: 同上 (安裝和健康檢查腳本)
|
||
|
||
# 步驟 3: 建立 keepalived 配置 (BACKUP)
|
||
sudo tee /etc/keepalived/keepalived.conf << 'EOF'
|
||
# Keepalived 配置 - K3s API Server HA
|
||
# 節點: mon1 (192.168.0.121) - BACKUP
|
||
# 建立者: Claude Code (首席架構師)
|
||
# 日期: 2026-03-28 (台北時間)
|
||
|
||
global_defs {
|
||
router_id K3S_BACKUP_121
|
||
script_user root
|
||
enable_script_security
|
||
}
|
||
|
||
vrrp_script check_k3s_api {
|
||
script "/etc/keepalived/check_k3s_api.sh"
|
||
interval 2
|
||
weight -20
|
||
fall 3
|
||
rise 2
|
||
}
|
||
|
||
vrrp_instance K3S_VIP {
|
||
state BACKUP
|
||
interface ens160 # mon1 的網路介面 (注意不同!)
|
||
virtual_router_id 51
|
||
priority 90 # 優先級 (比 MASTER 低)
|
||
advert_int 1
|
||
|
||
unicast_src_ip 192.168.0.121
|
||
unicast_peer {
|
||
192.168.0.120
|
||
}
|
||
|
||
authentication {
|
||
auth_type PASS
|
||
auth_pass awoooi_k3s_vip
|
||
}
|
||
|
||
virtual_ipaddress {
|
||
192.168.0.125/24
|
||
}
|
||
|
||
track_script {
|
||
check_k3s_api
|
||
}
|
||
|
||
notify_master "/etc/keepalived/notify.sh MASTER"
|
||
notify_backup "/etc/keepalived/notify.sh BACKUP"
|
||
notify_fault "/etc/keepalived/notify.sh FAULT"
|
||
}
|
||
EOF
|
||
|
||
# 注意: 121 目前是 Agent,沒有 API Server
|
||
# 健康檢查腳本會失敗,這是正常的
|
||
# 當 121 升級為 Server 後,VIP 才能正確切換
|
||
```
|
||
|
||
### 3.2 K-NET.2 - 驗證 VIP
|
||
|
||
```bash
|
||
# 在任意節點執行
|
||
ping -c 3 192.168.0.125
|
||
|
||
# 確認 VIP 在 120 上
|
||
ssh wooo@192.168.0.120 "ip addr show ens192 | grep 192.168.0.125"
|
||
|
||
# 測試 API Server 透過 VIP 存取
|
||
curl -sk https://192.168.0.125:6443/healthz
|
||
# 預期: ok
|
||
|
||
# 故障轉移測試 (謹慎! 會導致服務中斷)
|
||
# ssh wooo@192.168.0.120 "sudo systemctl stop k3s"
|
||
# 等待 VIP 漂移到 121 (此時 121 需已是 Server)
|
||
```
|
||
|
||
---
|
||
|
||
## 4. Phase K-HA: HA 升級 ✅ 已完成
|
||
|
||
> **狀態**: ✅ 2026-03-28 19:35 完成
|
||
> **執行者**: Claude Code + 統帥
|
||
> **停機時間**: ~20 分鐘
|
||
|
||
### K-HA 執行結果摘要
|
||
|
||
| 步驟 | 說明 | 狀態 |
|
||
|------|------|------|
|
||
| K-HA.1 | PostgreSQL 建立 (188:5432) | ✅ k3s_datastore + k3s_admin |
|
||
| K-HA.2 | SQLite 備份 | ✅ state.db.pre-ha-* |
|
||
| K-HA.3 | 120 datastore-endpoint 配置 | ✅ 重啟成功 |
|
||
| K-HA.4 | 121 Agent→Server 升級 | ✅ 雙 Control-Plane |
|
||
| K-HA.5 | 服務重部署 | ✅ CD force_deploy |
|
||
| K-HA.6 | 健康驗證 | ✅ 所有組件 up |
|
||
|
||
### 架構變更
|
||
|
||
```
|
||
遷移前:
|
||
120 (Server/Control-Plane) ─── SQLite (Kine)
|
||
121 (Agent/Worker)
|
||
|
||
遷移後:
|
||
120 (Server/Control-Plane) ─┬─ PostgreSQL (188:5432/k3s_datastore)
|
||
121 (Server/Control-Plane) ─┘
|
||
VIP: 192.168.0.125 (keepalived)
|
||
```
|
||
|
||
### 修復過程中的問題
|
||
|
||
| 問題 | 解決方案 |
|
||
|------|----------|
|
||
| SENTRY_DSN "CHANGE_ME" | 改為空字串 |
|
||
| DATABASE_URL 缺 +asyncpg | 修正為 `postgresql+asyncpg://` |
|
||
| awoooi DB 密碼不符 | 更新 PostgreSQL 使用者密碼 |
|
||
| kubeconfig 權限 | K3s 重啟後需重新設定 |
|
||
|
||
---
|
||
|
||
### K-HA 原規劃內容 (已執行)
|
||
|
||
> **前置條件**: PostgreSQL 準備、完整備份、通知用戶
|
||
> **風險等級**: 🔴 高風險
|
||
|
||
### 4.1 K-HA.1 - PostgreSQL 準備 (188)
|
||
|
||
```bash
|
||
# SSH 到 188
|
||
ssh ollama@192.168.0.188
|
||
|
||
# 步驟 1: 連接 PostgreSQL
|
||
sudo -u postgres psql
|
||
|
||
# 步驟 2: 建立專用使用者
|
||
CREATE USER k3s_admin WITH PASSWORD 'K3s_Secure_P@ss_2026!';
|
||
|
||
# 步驟 3: 建立專用資料庫
|
||
CREATE DATABASE k3s_datastore
|
||
OWNER k3s_admin
|
||
ENCODING 'UTF8'
|
||
LC_COLLATE 'en_US.UTF-8'
|
||
LC_CTYPE 'en_US.UTF-8';
|
||
|
||
# 步驟 4: 授予權限
|
||
GRANT ALL PRIVILEGES ON DATABASE k3s_datastore TO k3s_admin;
|
||
|
||
# 步驟 5: 退出 psql
|
||
\q
|
||
|
||
# 步驟 6: 修改 pg_hba.conf 允許 K3s 節點連線
|
||
sudo tee -a /etc/postgresql/*/main/pg_hba.conf << 'EOF'
|
||
# K3s 節點連線
|
||
host k3s_datastore k3s_admin 192.168.0.120/32 scram-sha-256
|
||
host k3s_datastore k3s_admin 192.168.0.121/32 scram-sha-256
|
||
host k3s_datastore k3s_admin 192.168.0.125/32 scram-sha-256
|
||
EOF
|
||
|
||
# 步驟 7: 重新載入 PostgreSQL
|
||
sudo systemctl reload postgresql
|
||
|
||
# 步驟 8: 測試連線 (從 120)
|
||
# ssh wooo@192.168.0.120
|
||
# psql -h 192.168.0.188 -U k3s_admin -d k3s_datastore -c "SELECT 1;"
|
||
```
|
||
|
||
### 4.2 K-HA.2 - 備份現有資料
|
||
|
||
```bash
|
||
# 在 120 上執行
|
||
ssh wooo@192.168.0.120
|
||
|
||
# 步驟 1: 建立完整備份
|
||
sudo k3s etcd-snapshot save --name pre-ha-migration-$(date +%Y%m%d%H%M)
|
||
|
||
# 步驟 2: 備份 K3s 配置
|
||
sudo tar -czvf /tmp/k3s-server-backup-$(date +%Y%m%d).tar.gz \
|
||
/etc/rancher/k3s/ \
|
||
/var/lib/rancher/k3s/server/tls/ \
|
||
/var/lib/rancher/k3s/server/token
|
||
|
||
# 步驟 3: 匯出所有資源
|
||
kubectl get all -A -o yaml > /tmp/all-resources-backup-$(date +%Y%m%d).yaml
|
||
kubectl get configmap,secret -A -o yaml > /tmp/configmap-secret-backup-$(date +%Y%m%d).yaml
|
||
|
||
# 步驟 4: 驗證備份
|
||
ls -la /tmp/*backup*
|
||
ls -la /var/lib/rancher/k3s/server/db/snapshots/
|
||
```
|
||
|
||
### 4.3 K-HA.3 - 120 遷移至 PostgreSQL
|
||
|
||
```bash
|
||
# ⚠️⚠️⚠️ 高風險操作 - 請在維護窗口執行 ⚠️⚠️⚠️
|
||
|
||
# 步驟 1: 更新 config.yaml 加入 datastore
|
||
sudo tee -a /etc/rancher/k3s/config.yaml << 'EOF'
|
||
|
||
# ============================================================================
|
||
# External Datastore (HA 模式)
|
||
# ============================================================================
|
||
datastore-endpoint: "postgres://k3s_admin:K3s_Secure_P@ss_2026!@192.168.0.188:5432/k3s_datastore"
|
||
EOF
|
||
|
||
# 步驟 2: 停止 K3s
|
||
sudo systemctl stop k3s
|
||
|
||
# 步驟 3: 備份並移除舊 etcd 資料
|
||
sudo mv /var/lib/rancher/k3s/server/db /var/lib/rancher/k3s/server/db.backup-$(date +%Y%m%d)
|
||
|
||
# 步驟 4: 重新啟動 K3s (使用 PostgreSQL)
|
||
sudo systemctl start k3s
|
||
|
||
# 步驟 5: 等待啟動
|
||
sleep 30
|
||
kubectl get nodes
|
||
|
||
# 步驟 6: 如果需要還原資料,使用:
|
||
# sudo k3s server \
|
||
# --cluster-reset \
|
||
# --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/pre-ha-migration-xxx
|
||
|
||
# 步驟 7: 驗證 PostgreSQL 連線
|
||
sudo -u postgres psql -d k3s_datastore -c "SELECT count(*) FROM kine;"
|
||
```
|
||
|
||
### 4.4 K-HA.4 - 121 升級為 Server
|
||
|
||
```bash
|
||
# SSH 到 121
|
||
ssh wooo@192.168.0.121
|
||
|
||
# 步驟 1: 取得 token (從 120)
|
||
TOKEN=$(ssh wooo@192.168.0.120 "sudo cat /var/lib/rancher/k3s/server/token")
|
||
|
||
# 步驟 2: 停止 Agent
|
||
sudo systemctl stop k3s-agent
|
||
sudo systemctl disable k3s-agent
|
||
|
||
# 步驟 3: 安裝為 Server
|
||
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="server" sh -s - \
|
||
--server https://192.168.0.120:6443 \
|
||
--token ${TOKEN} \
|
||
--datastore-endpoint="postgres://k3s_admin:K3s_Secure_P@ss_2026!@192.168.0.188:5432/k3s_datastore" \
|
||
--tls-san 192.168.0.120 \
|
||
--tls-san 192.168.0.121 \
|
||
--tls-san 192.168.0.125
|
||
|
||
# 步驟 4: 等待加入
|
||
sleep 60
|
||
kubectl get nodes
|
||
|
||
# 預期輸出:
|
||
# NAME STATUS ROLES AGE VERSION
|
||
# mon Ready control-plane,master ... v1.34.5+k3s1
|
||
# mon1 Ready control-plane,master ... v1.34.5+k3s1
|
||
```
|
||
|
||
---
|
||
|
||
## 5. Phase K-CLEAN: 清理腳本 ✅ 已完成
|
||
|
||
> **執行日期**: 2026-03-28 18:00-18:20 (台北時間)
|
||
> **清理結果**: 9 孤立 ReplicaSet + 1 Failed Job
|
||
|
||
### 5.1 一鍵清理腳本
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# K3s 環境清理腳本
|
||
# 建立者: Claude Code (首席架構師)
|
||
# 日期: 2026-03-28
|
||
|
||
set -e
|
||
|
||
echo "=== K3s 環境清理腳本 ==="
|
||
echo "執行時間: $(date)"
|
||
|
||
# 1. 清理 ImagePullBackOff Pods
|
||
echo "[1/5] 清理 ImagePullBackOff Pods..."
|
||
kubectl get pods -A -o json | jq -r '.items[] | select(.status.containerStatuses[0].state.waiting.reason == "ImagePullBackOff") | "\(.metadata.namespace) \(.metadata.name)"' | \
|
||
while read ns name; do
|
||
echo " 刪除: $ns/$name"
|
||
kubectl delete pod -n "$ns" "$name" --grace-period=0
|
||
done
|
||
|
||
# 2. 清理孤立 ReplicaSet
|
||
echo "[2/5] 清理孤立 ReplicaSet (replicas=0)..."
|
||
for ns in awoooi-prod; do
|
||
kubectl get rs -n $ns --no-headers | awk '$2==0 && $3==0 && $4==0 {print $1}' | \
|
||
while read rs; do
|
||
echo " 刪除: $ns/$rs"
|
||
kubectl delete rs -n "$ns" "$rs"
|
||
done
|
||
done
|
||
|
||
# 3. 清理已完成的 Jobs
|
||
echo "[3/5] 清理已完成的 Jobs..."
|
||
kubectl delete jobs -A --field-selector status.successful=1
|
||
|
||
# 4. 清理失敗的 Pods
|
||
echo "[4/5] 清理 Failed/Evicted Pods..."
|
||
kubectl get pods -A --field-selector status.phase=Failed -o json | jq -r '.items[] | "\(.metadata.namespace) \(.metadata.name)"' | \
|
||
while read ns name; do
|
||
echo " 刪除: $ns/$name"
|
||
kubectl delete pod -n "$ns" "$name"
|
||
done
|
||
|
||
# 5. 統計
|
||
echo "[5/5] 清理完成,統計:"
|
||
echo " - Pods: $(kubectl get pods -A --no-headers | wc -l)"
|
||
echo " - RS: $(kubectl get rs -A --no-headers | wc -l)"
|
||
echo " - Jobs: $(kubectl get jobs -A --no-headers | wc -l)"
|
||
|
||
echo "=== 清理完成 ==="
|
||
```
|
||
|
||
---
|
||
|
||
## 6. Phase K1: 災難恢復 (P1) - 8h ❌ 未開始
|
||
|
||
> **狀態**: 📋 待執行
|
||
> **前置條件**: K0 完成 ✅
|
||
> **預估時間**: 8 小時
|
||
|
||
### 6.1 K1.1 - 部署 Velero 備份系統
|
||
|
||
#### 6.1.1 建立 MinIO 儲存 (188) - 30m
|
||
|
||
```bash
|
||
# SSH 到 188
|
||
ssh ollama@192.168.0.188
|
||
|
||
# 建立 MinIO 目錄
|
||
sudo mkdir -p /data/minio
|
||
sudo chown -R 1000:1000 /data/minio
|
||
|
||
# Docker Compose 方式部署
|
||
cat > ~/minio/docker-compose.yml << 'EOF'
|
||
version: '3.8'
|
||
services:
|
||
minio:
|
||
image: minio/minio:RELEASE.2024-03-26T22-10-45Z
|
||
container_name: minio
|
||
command: server /data --console-address ":9001"
|
||
environment:
|
||
MINIO_ROOT_USER: minio_admin
|
||
MINIO_ROOT_PASSWORD: Minio_Secure_2026!
|
||
volumes:
|
||
- /data/minio:/data
|
||
ports:
|
||
- "9000:9000"
|
||
- "9001:9001"
|
||
restart: unless-stopped
|
||
EOF
|
||
|
||
cd ~/minio && docker-compose up -d
|
||
|
||
# 驗證
|
||
curl http://192.168.0.188:9000/minio/health/live
|
||
```
|
||
|
||
#### 6.1.2 安裝 Velero CLI - 10m
|
||
|
||
```bash
|
||
# 在 120 (Master) 上執行
|
||
wget https://github.com/vmware-tanzu/velero/releases/download/v1.13.0/velero-v1.13.0-linux-amd64.tar.gz
|
||
tar -xvf velero-v1.13.0-linux-amd64.tar.gz
|
||
sudo mv velero-v1.13.0-linux-amd64/velero /usr/local/bin/
|
||
velero version
|
||
```
|
||
|
||
#### 6.1.3 部署 Velero 到 K3s - 20m
|
||
|
||
```bash
|
||
# 建立 credentials 檔案
|
||
cat > /tmp/credentials-velero << 'EOF'
|
||
[default]
|
||
aws_access_key_id=minio_admin
|
||
aws_secret_access_key=Minio_Secure_2026!
|
||
EOF
|
||
|
||
# 安裝 Velero
|
||
velero install \
|
||
--provider aws \
|
||
--plugins velero/velero-plugin-for-aws:v1.9.0 \
|
||
--bucket velero-backups \
|
||
--secret-file /tmp/credentials-velero \
|
||
--backup-location-config region=minio,s3ForcePathStyle=true,s3Url=http://192.168.0.188:9000 \
|
||
--use-volume-snapshots=false
|
||
|
||
# 驗證
|
||
kubectl get pods -n velero
|
||
velero backup-location get
|
||
```
|
||
|
||
#### 6.1.4 配置備份 Schedule - 15m
|
||
|
||
```bash
|
||
# 每日凌晨 3 點備份 awoooi-prod
|
||
velero schedule create awoooi-daily \
|
||
--schedule="0 3 * * *" \
|
||
--include-namespaces awoooi-prod \
|
||
--ttl 168h
|
||
|
||
# 驗證
|
||
velero schedule get
|
||
```
|
||
|
||
#### 6.1.5 測試備份/還原 - 30m
|
||
|
||
```bash
|
||
# 手動觸發備份
|
||
velero backup create awoooi-test-$(date +%Y%m%d) --include-namespaces awoooi-prod
|
||
|
||
# 查看備份狀態
|
||
velero backup describe awoooi-test-$(date +%Y%m%d)
|
||
|
||
# 模擬災難:刪除一個 ConfigMap (僅測試用)
|
||
kubectl get cm -n awoooi-prod
|
||
|
||
# 還原
|
||
velero restore create --from-backup awoooi-test-$(date +%Y%m%d)
|
||
|
||
# 驗證還原
|
||
velero restore describe <restore-name>
|
||
```
|
||
|
||
---
|
||
|
||
## 7. Phase K2: 自動化維運 (P2) - 12h ❌ 未開始
|
||
|
||
> **狀態**: 📋 待執行
|
||
> **前置條件**: K0 完成 ✅
|
||
> **預估時間**: 12 小時
|
||
|
||
### 7.1 K2.1 - 部署 ArgoCD (GitOps)
|
||
|
||
#### 7.1.1 安裝 ArgoCD - 15m
|
||
|
||
```bash
|
||
# 建立 namespace
|
||
kubectl create namespace argocd
|
||
|
||
# 安裝 ArgoCD
|
||
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
|
||
|
||
# 暴露為 NodePort
|
||
kubectl patch svc argocd-server -n argocd -p '{"spec": {"type": "NodePort", "ports": [{"port": 443, "nodePort": 30443}]}}'
|
||
|
||
# 取得初始密碼
|
||
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d
|
||
|
||
# 存取 UI: https://192.168.0.125:30443
|
||
```
|
||
|
||
#### 7.1.2 連接 Git Repo - 30m
|
||
|
||
```bash
|
||
# 下載 ArgoCD CLI
|
||
curl -sSL -o /usr/local/bin/argocd https://github.com/argoproj/argo-cd/releases/latest/download/argocd-linux-amd64
|
||
chmod +x /usr/local/bin/argocd
|
||
|
||
# 登入
|
||
argocd login 192.168.0.125:30443
|
||
|
||
# 新增 AWOOOI repo
|
||
argocd repo add https://github.com/your-org/awoooi.git \
|
||
--username <github-user> \
|
||
--password <github-token>
|
||
```
|
||
|
||
#### 7.1.3 建立 Application - 30m
|
||
|
||
```yaml
|
||
# argocd-awoooi-prod.yaml
|
||
apiVersion: argoproj.io/v1alpha1
|
||
kind: Application
|
||
metadata:
|
||
name: awoooi-prod
|
||
namespace: argocd
|
||
spec:
|
||
project: default
|
||
source:
|
||
repoURL: https://github.com/your-org/awoooi.git
|
||
targetRevision: main
|
||
path: k8s/awoooi-prod
|
||
destination:
|
||
server: https://kubernetes.default.svc
|
||
namespace: awoooi-prod
|
||
syncPolicy:
|
||
automated:
|
||
prune: true
|
||
selfHeal: true
|
||
```
|
||
|
||
### 7.2 K2.2 - 部署 Sealed Secrets
|
||
|
||
```bash
|
||
# 安裝 kubeseal CLI
|
||
wget https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.26.0/kubeseal-0.26.0-linux-amd64.tar.gz
|
||
tar -xvf kubeseal-0.26.0-linux-amd64.tar.gz
|
||
sudo mv kubeseal /usr/local/bin/
|
||
|
||
# 部署 Controller
|
||
kubectl apply -f https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.26.0/controller.yaml
|
||
|
||
# 加密現有 Secret
|
||
kubeseal --format yaml < k8s/awoooi-prod/03-secrets.yaml > k8s/awoooi-prod/03-sealed-secrets.yaml
|
||
```
|
||
|
||
### 7.3 K2.3 - 部署 VPA
|
||
|
||
```bash
|
||
# 安裝 VPA
|
||
git clone https://github.com/kubernetes/autoscaler.git
|
||
cd autoscaler/vertical-pod-autoscaler
|
||
./hack/vpa-up.sh
|
||
|
||
# 建立 VPA 資源 (Off 模式 - 僅建議)
|
||
cat > k8s/awoooi-prod/11-vpa.yaml << 'EOF'
|
||
apiVersion: autoscaling.k8s.io/v1
|
||
kind: VerticalPodAutoscaler
|
||
metadata:
|
||
name: awoooi-api-vpa
|
||
namespace: awoooi-prod
|
||
spec:
|
||
targetRef:
|
||
apiVersion: apps/v1
|
||
kind: Deployment
|
||
name: awoooi-api
|
||
updatePolicy:
|
||
updateMode: "Off"
|
||
EOF
|
||
|
||
kubectl apply -f k8s/awoooi-prod/11-vpa.yaml
|
||
```
|
||
|
||
### 7.4 K2.4 - 部署 Node Problem Detector
|
||
|
||
```bash
|
||
# 安裝 NPD
|
||
kubectl apply -f https://raw.githubusercontent.com/kubernetes/node-problem-detector/master/deployment/node-problem-detector.yaml
|
||
|
||
# 驗證
|
||
kubectl get pods -n kube-system -l app=node-problem-detector
|
||
```
|
||
|
||
---
|
||
|
||
## 8. Phase K3: 儲存與 HPA (P3) - 10h ❌ 未開始
|
||
|
||
> **狀態**: 📋 待執行
|
||
> **前置條件**: K0 完成 ✅
|
||
> **預估時間**: 10 小時
|
||
|
||
### 8.1 K3.1 - Longhorn 評估
|
||
|
||
| 評估項 | 現況 | 決策 |
|
||
|--------|------|------|
|
||
| 有狀態需求 | AWOOOI 全無狀態 | 延後 |
|
||
| OpenClaw 遷入 | 保持容器層部署 | 暫不需要 |
|
||
|
||
**結論**: Longhorn 暫不部署,待有狀態需求時再評估
|
||
|
||
### 8.2 K3.2 - 配置 HPA
|
||
|
||
```yaml
|
||
# k8s/awoooi-prod/10-hpa.yaml
|
||
apiVersion: autoscaling/v2
|
||
kind: HorizontalPodAutoscaler
|
||
metadata:
|
||
name: awoooi-api-hpa
|
||
namespace: awoooi-prod
|
||
spec:
|
||
scaleTargetRef:
|
||
apiVersion: apps/v1
|
||
kind: Deployment
|
||
name: awoooi-api
|
||
minReplicas: 2
|
||
maxReplicas: 4
|
||
metrics:
|
||
- type: Resource
|
||
resource:
|
||
name: cpu
|
||
target:
|
||
type: Utilization
|
||
averageUtilization: 70
|
||
- type: Resource
|
||
resource:
|
||
name: memory
|
||
target:
|
||
type: Utilization
|
||
averageUtilization: 80
|
||
---
|
||
apiVersion: autoscaling/v2
|
||
kind: HorizontalPodAutoscaler
|
||
metadata:
|
||
name: awoooi-web-hpa
|
||
namespace: awoooi-prod
|
||
spec:
|
||
scaleTargetRef:
|
||
apiVersion: apps/v1
|
||
kind: Deployment
|
||
name: awoooi-web
|
||
minReplicas: 2
|
||
maxReplicas: 4
|
||
metrics:
|
||
- type: Resource
|
||
resource:
|
||
name: cpu
|
||
target:
|
||
type: Utilization
|
||
averageUtilization: 70
|
||
```
|
||
|
||
```bash
|
||
kubectl apply -f k8s/awoooi-prod/10-hpa.yaml
|
||
kubectl get hpa -n awoooi-prod
|
||
```
|
||
|
||
---
|
||
|
||
## 9. Phase K4: 進階優化 (P4) - 6h ❌ 未開始
|
||
|
||
> **狀態**: 📋 待執行
|
||
> **前置條件**: K2 完成
|
||
> **預估時間**: 6 小時
|
||
|
||
### 9.1 K4.1 - 部署 Kured
|
||
|
||
```bash
|
||
# 安裝 Kured
|
||
kubectl apply -f https://github.com/weaveworks/kured/releases/latest/download/kured-1.14.1-dockerhub.yaml
|
||
|
||
# 配置維護窗口 (凌晨 3-5 點)
|
||
kubectl patch daemonset kured -n kube-system --type=json -p='[
|
||
{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--start-time=03:00"},
|
||
{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--end-time=05:00"},
|
||
{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--time-zone=Asia/Taipei"}
|
||
]'
|
||
```
|
||
|
||
### 9.2 K4.2 - 部署 Descheduler
|
||
|
||
```yaml
|
||
# descheduler-policy.yaml
|
||
apiVersion: v1
|
||
kind: ConfigMap
|
||
metadata:
|
||
name: descheduler-policy
|
||
namespace: kube-system
|
||
data:
|
||
policy.yaml: |
|
||
apiVersion: "descheduler/v1alpha2"
|
||
kind: "DeschedulerPolicy"
|
||
profiles:
|
||
- name: default
|
||
pluginConfig:
|
||
- name: "RemoveDuplicates"
|
||
args:
|
||
excludeOwnerKinds:
|
||
- "DaemonSet"
|
||
- name: "LowNodeUtilization"
|
||
args:
|
||
thresholds:
|
||
cpu: 20
|
||
memory: 20
|
||
targetThresholds:
|
||
cpu: 50
|
||
memory: 50
|
||
```
|
||
|
||
### 9.3 K4.3 - 清理 Legacy Namespace
|
||
|
||
```bash
|
||
# 步驟 1: 先用 Velero 備份
|
||
velero backup create legacy-wooo-aiops-$(date +%Y%m%d) \
|
||
--include-namespaces wooo-aiops-uat,wooo-aiops-prod
|
||
|
||
# 步驟 2: 確認備份成功
|
||
velero backup describe legacy-wooo-aiops-$(date +%Y%m%d)
|
||
|
||
# 步驟 3: 停用服務 (觀察 1 週)
|
||
kubectl scale deploy --all -n wooo-aiops-uat --replicas=0
|
||
kubectl scale deploy --all -n wooo-aiops-prod --replicas=0
|
||
|
||
# 步驟 4: 確認無依賴後刪除
|
||
kubectl delete ns wooo-aiops-uat
|
||
kubectl delete ns wooo-aiops-prod
|
||
```
|
||
|
||
---
|
||
|
||
## 10. 驗證檢查清單
|
||
|
||
### 10.1 Phase K0 驗證
|
||
|
||
| # | 檢查項目 | 命令 | 預期結果 |
|
||
|---|---------|------|---------|
|
||
| 1 | Swap 已關閉 | `free -h \| grep Swap` | Swap: 0B 0B 0B |
|
||
| 2 | config.yaml 存在 | `cat /etc/rancher/k3s/config.yaml` | 有內容 |
|
||
| 3 | registries.yaml 存在 | `cat /etc/rancher/k3s/registries.yaml` | Harbor 配置 |
|
||
| 4 | etcd 備份存在 | `ls /var/lib/rancher/k3s/server/db/snapshots/` | 有 snapshot |
|
||
| 5 | crontab 設定 | `sudo crontab -l \| grep k3s` | 有備份任務 |
|
||
| 6 | PDB 存在 | `kubectl get pdb -n awoooi-prod` | 3 個 PDB |
|
||
| 7 | 無異常 Pod | `kubectl get pods -n awoooi-prod` | 全部 Running |
|
||
| 8 | 孤立 RS 清理 | `kubectl get rs -n awoooi-prod \| wc -l` | ≤ 9 |
|
||
|
||
### 10.2 Phase K-HA 驗證
|
||
|
||
| # | 檢查項目 | 命令 | 預期結果 |
|
||
|---|---------|------|---------|
|
||
| 1 | VIP 存活 | `ping -c 1 192.168.0.125` | 0% loss |
|
||
| 2 | 雙 Server | `kubectl get nodes` | 2 個 control-plane |
|
||
| 3 | PostgreSQL 連線 | `psql -h 188 -U k3s_admin -d k3s_datastore` | 連線成功 |
|
||
| 4 | API 透過 VIP | `curl -sk https://192.168.0.125:6443/healthz` | ok |
|
||
|
||
---
|
||
|
||
## 11. 回滾程序
|
||
|
||
### 11.1 K0 回滾 (低風險)
|
||
|
||
```bash
|
||
# 還原 fstab (重新啟用 Swap)
|
||
sudo mv /etc/fstab.backup.* /etc/fstab
|
||
sudo swapon -a
|
||
|
||
# 還原 config.yaml
|
||
sudo mv /etc/rancher/k3s/config.yaml.backup /etc/rancher/k3s/config.yaml
|
||
sudo systemctl restart k3s
|
||
```
|
||
|
||
### 11.2 K-HA 回滾 (高風險)
|
||
|
||
```bash
|
||
# 步驟 1: 停止 K3s (兩節點)
|
||
sudo systemctl stop k3s
|
||
|
||
# 步驟 2: 還原 etcd 資料 (120)
|
||
sudo rm -rf /var/lib/rancher/k3s/server/db
|
||
sudo mv /var/lib/rancher/k3s/server/db.backup-* /var/lib/rancher/k3s/server/db
|
||
|
||
# 步驟 3: 移除 datastore-endpoint
|
||
sudo sed -i '/datastore-endpoint/d' /etc/rancher/k3s/config.yaml
|
||
|
||
# 步驟 4: 重啟 (120 先)
|
||
sudo systemctl start k3s
|
||
|
||
# 步驟 5: 121 降級為 Agent
|
||
sudo systemctl stop k3s
|
||
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="agent" sh -s - \
|
||
--server https://192.168.0.120:6443 \
|
||
--token $(cat /tmp/token)
|
||
```
|
||
|
||
---
|
||
|
||
## 附錄 A: 執行時間線
|
||
|
||
### ✅ 實際執行紀錄 (2026-03-28)
|
||
|
||
```
|
||
Day 0 (2026-03-28) - Phase K0 + K-NET + K-CLEAN + K-HA ✅ 全部完成
|
||
├── 09:00 - 統帥批准 Phase K0 ✅
|
||
├── 09:30 - K0.1 關閉 Swap (120 + 121) ✅
|
||
├── 10:00 - K0.2 config.yaml + registries.yaml ✅
|
||
├── 10:30 - K0.3 etcd 備份 + rsync 到 188 ✅
|
||
├── 11:00 - K0.4 PDB (3 個) ✅
|
||
├── 11:30 - K0.5 Startup Probe (Git 變更) ✅
|
||
├── 12:00 - K0.6/K0.7 清理 ImagePullBackOff ✅
|
||
├── 12:30 - K0 首席架構師審查 9.0/10 ✅
|
||
├── 14:00 - K-NET.1-2 安裝 keepalived (120 + 121) ✅
|
||
├── 15:00 - K-NET.3 VIP 192.168.0.125 啟用 ✅
|
||
├── 16:00 - K-VIP CI/CD 整合 (GitHub Secret) ✅
|
||
├── 18:00 - K-CLEAN 清理 (9 RS + 1 Job) ✅
|
||
├── 19:00 - K-HA.1 PostgreSQL k3s_datastore 建立 ✅
|
||
├── 19:15 - K-HA.2 SQLite 完整備份 ✅
|
||
├── 19:25 - K-HA.3 120 遷移至 PostgreSQL ✅
|
||
├── 19:35 - K-HA.4 121 升級為 Control-Plane ✅
|
||
├── 19:45 - K-HA 驗證 (雙節點 + Kine 552 records) ✅
|
||
├── 20:00 - 服務重部署 (CD force_deploy) ✅
|
||
├── 20:30 - 15 項異常長期修復完成 ✅
|
||
└── 21:00 - Runbook + Memory 狀態更新 ✅
|
||
```
|
||
|
||
### ✅ K-HA 已完成 (2026-03-28 19:35)
|
||
|
||
```
|
||
2026-03-28 19:00-20:00 - Phase K-HA (外接 PostgreSQL) ✅ 完成
|
||
├── 19:00 - K-HA.1 PostgreSQL 準備 (188) ✅
|
||
├── 19:15 - K-HA.2 完整備份 ✅
|
||
├── 19:25 - K-HA.3 120 遷移至 PostgreSQL ✅
|
||
├── 19:35 - K-HA.4 121 升級為 Server ✅
|
||
├── 19:45 - 驗證 (Kine 552 records) ✅
|
||
└── 20:00 - 服務重部署成功 ✅
|
||
```
|
||
|
||
### 📋 待規劃 (會議目標尚未完成)
|
||
|
||
```
|
||
Week 2 (2026-03-31 ~ 04-04)
|
||
├── Phase K1: Velero 備份系統 (P1 - 6h)
|
||
│ ├── K1.1 部署 MinIO (188)
|
||
│ ├── K1.2 安裝 Velero
|
||
│ ├── K1.3 配置備份 Schedule
|
||
│ └── K1.4 測試備份/還原
|
||
│
|
||
├── Phase K2: 自動化維運 (P2 - 12h)
|
||
│ ├── K2.1 部署 ArgoCD (GitOps)
|
||
│ ├── K2.2 部署 Sealed Secrets
|
||
│ ├── K2.3 部署 VPA (Off 模式)
|
||
│ └── K2.4 部署 NPD
|
||
│
|
||
└── Phase K3: 儲存與 HPA (P3 - 10h)
|
||
├── K3.1 評估 Longhorn (可延後)
|
||
└── K3.2 配置 HPA (API/Web 2-4)
|
||
|
||
Week 3 (2026-04-07 ~ 04-11)
|
||
└── Phase K4: 進階優化 (P4 - 6h)
|
||
├── K4.1 部署 Kured
|
||
├── K4.2 部署 Descheduler
|
||
└── K4.3 清理 Legacy NS
|
||
```
|
||
|
||
---
|
||
|
||
---
|
||
|
||
## 附錄 B: K-HA 完成後異常修復 (15 項永久方案)
|
||
|
||
> **執行日期**: 2026-03-28 20:30 (台北時間)
|
||
> **範圍**: K-HA 遷移後全面日誌清查
|
||
|
||
| # | 異常 | 永久方案 | 狀態 |
|
||
|---|------|---------|------|
|
||
| 1 | ContainerHighMemory +Inf | 修復 alerts.yml 公式 (除零防護) | ✅ |
|
||
| 2 | WebsiteDown 誤報 | 移除錯誤 probe target | ✅ |
|
||
| 3 | Blackbox TCP 指向舊 IP | 更新為 VIP 192.168.0.125 | ✅ |
|
||
| 4 | ClawBot 8088 舊端口 | 改為 OpenClaw 8089 | ✅ |
|
||
| 5 | lewooogo-core exports 順序 | types 放最前 | ✅ |
|
||
| 6 | ConfigMap OTEL gRPC | 改為 HTTP 24318 | ✅ |
|
||
| 7 | Web Deployment 無 secretRef | 新增 Sentry DSN 存取 | ✅ |
|
||
| 8 | stats.py Router 層違規 | 完整 Service 層重構 | ✅ |
|
||
| 9 | Mock 測試殘留 | 刪除不合規測試 | ✅ |
|
||
| 10 | StatsService 無 Interface | 新增 IStatsService Protocol | ✅ |
|
||
| 11 | Worker 心跳機制 | 每 30 秒 touch 檔案 | ✅ |
|
||
| 12 | Worker liveness 探針 | 改為 mtime 檢查 | ✅ |
|
||
| 13 | AI_FALLBACK_ORDER 錯誤 | 改回 Ollama 優先 | ✅ |
|
||
| 14 | 架構文檔過時 | K-HA 章節全面更新 | ✅ |
|
||
| 15 | 服務端點分散 | 建立 SERVICE-ENDPOINTS.md | ✅ |
|
||
|
||
---
|
||
|
||
## 附錄 C: 已知限制
|
||
|
||
| 限制 | 說明 | 解決方案 |
|
||
|------|------|---------|
|
||
| VIP:32334 無回應 | keepalived 只做 L3 VIP 漂移 | 需 MetalLB 或 HAProxy |
|
||
| 120:32334 間歇性 | Pod 可能只在 121 運行 | AntiAffinity 已確保分散 |
|
||
| K3s API 401 | /healthz 需要認證是正常行為 | 使用 kubeconfig |
|
||
|
||
**結論**: 現有架構符合設計目標,VIP 主要保障 K3s API (6443) 高可用
|
||
|
||
---
|
||
|
||
**Runbook 完成**: 2026-03-28 21:00 (台北時間)
|
||
**建立者**: Claude Code (首席架構師)
|
||
**狀態**: ✅ Phase K0 + K-NET + K-HA + K-CLEAN 執行完成
|