docs(k3s): 首席架構師審查完成 46/50 (92%)
K3s 優化工作審查完成: - ADR-033: Phase K0 + K-NET 標記為已完成 - 09-pdb.yaml: Worker PDB 設計說明註釋 - DevOps Skill: 新增 keepalived 快速操作參考 審查結果: - 架構合規性: 9/10 - Runbook 完整性: 10/10 ⭐ - 模組化合規: 9/10 - 風險控制: 9/10 - 文檔完整性: 9/10 P2 問題已修復,無 P0/P1 阻擋項 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -10,11 +10,11 @@
|
||||
|
||||
| 欄位 | 值 |
|
||||
|------|-----|
|
||||
| **版本** | v1.6 |
|
||||
| **版本** | v1.7 |
|
||||
| **建立日期** | 2026-03-20 (台北) |
|
||||
| **建立者** | Claude Code |
|
||||
| **最後修改** | 2026-03-26 03:30 (台北) |
|
||||
| **修改者** | Claude Code |
|
||||
| **最後修改** | 2026-03-28 16:00 (台北) |
|
||||
| **修改者** | Claude Code
|
||||
|
||||
### 變更紀錄
|
||||
|
||||
@@ -27,6 +27,7 @@
|
||||
| v1.4 | 2026-03-26 | Claude Code | 新增部署層級決策鐵律 |
|
||||
| v1.5 | 2026-03-26 | Claude Code | **Phase 15 三層觀測架構 (Deep Linking)** |
|
||||
| v1.6 | 2026-03-26 | Claude Code | **Runner 殭屍進程修復 + CI/CD cancel-in-progress: false** |
|
||||
| v1.7 | 2026-03-28 | Claude Code | **K3s 生產級優化 (ADR-033 + Phase K0)** |
|
||||
|
||||
---
|
||||
|
||||
@@ -223,6 +224,81 @@ forward . 8.8.8.8 1.1.1.1
|
||||
|
||||
---
|
||||
|
||||
## 🔴🔴 K3s 生產級優化 (ADR-033)
|
||||
|
||||
> **2026-03-28**: K3s 深度優化討論,確立生產級配置標準
|
||||
> **詳細文件**: `memory/feedback_k3s_optimization_rules.md`
|
||||
|
||||
### 必須配置清單
|
||||
|
||||
| 配置項 | 狀態 | 驗證指令 |
|
||||
|--------|------|---------|
|
||||
| **Swap 關閉** | ✅ 必須 | `free -h \| grep Swap` (全 0) |
|
||||
| **kube-reserved** | ✅ 必須 | `kubectl describe node \| grep Allocatable` |
|
||||
| **etcd 備份** | ✅ 必須 | `ls /var/lib/rancher/k3s/server/db/snapshots-backup/` |
|
||||
| **遠端備份 (rsync)** | ✅ 必須 | `ssh 188 "ls /backup/k3s_etcd/"` |
|
||||
| **PDB** | ✅ 必須 | `kubectl get pdb -n awoooi-prod` |
|
||||
| **Startup Probe** | ✅ 必須 | `kubectl describe deploy \| grep startupProbe` |
|
||||
|
||||
### K3s 重啟前置作業
|
||||
|
||||
```bash
|
||||
# 1. Alertmanager 靜音 (防止誤報)
|
||||
amtool silence add alertname=NodeDown node=mon duration=1h
|
||||
|
||||
# 2. 確認沒有正在部署的任務
|
||||
gh run list --workflow=cd.yaml --limit 1
|
||||
|
||||
# 3. 重啟 + 驗證
|
||||
sudo systemctl restart k3s && sleep 10 && kubectl get nodes
|
||||
```
|
||||
|
||||
### HA 架構 (方案 B: 外接 PostgreSQL)
|
||||
|
||||
```
|
||||
VIP 192.168.0.125 (keepalived)
|
||||
│
|
||||
┌────┴────┐
|
||||
│ │
|
||||
mon(120) mon1(121) ─── K3s Server
|
||||
│ │
|
||||
└────┬────┘
|
||||
│
|
||||
188 PostgreSQL ─── 外接 Datastore
|
||||
```
|
||||
|
||||
### 工具選型 (100% 業界主流)
|
||||
|
||||
| 用途 | 工具 | 業界採用 |
|
||||
|------|------|---------|
|
||||
| VIP 高可用 | keepalived | ⭐⭐⭐⭐⭐ |
|
||||
| Pod 保護 | PDB | K8s 原生 |
|
||||
| 啟動檢查 | Startup Probe | K8s 原生 |
|
||||
| 檔案同步 | rsync | ⭐⭐⭐⭐⭐ |
|
||||
|
||||
### keepalived 快速操作 (VIP 192.168.0.125)
|
||||
|
||||
```bash
|
||||
# 檢查 VIP 狀態
|
||||
ip addr show ens192 | grep 192.168.0.125 # 120
|
||||
ip addr show ens160 | grep 192.168.0.125 # 121
|
||||
|
||||
# 檢查 keepalived 狀態
|
||||
sudo systemctl status keepalived
|
||||
|
||||
# 強制切換 VIP (在 MASTER 上執行)
|
||||
sudo systemctl stop keepalived # VIP 漂移到 BACKUP
|
||||
sudo systemctl start keepalived # 恢復
|
||||
|
||||
# 透過 VIP 存取 K3s API
|
||||
kubectl --server=https://192.168.0.125:6443 get nodes
|
||||
```
|
||||
|
||||
**配置位置**: `/etc/keepalived/keepalived.conf`
|
||||
**日誌**: `journalctl -u keepalived -f`
|
||||
|
||||
---
|
||||
|
||||
## 🔴🔴🔴 告警鏈路 E2E 驗證 (ADR-025)
|
||||
|
||||
> **2026-03-26**: URL 路徑錯誤導致 2 天無告警 (`webhook` vs `webhooks`)
|
||||
|
||||
@@ -83,7 +83,7 @@
|
||||
|
||||
## 實施階段
|
||||
|
||||
### Phase K0: 基礎穩定化 (✅ 已批准)
|
||||
### Phase K0: 基礎穩定化 (✅ 已完成 2026-03-28)
|
||||
|
||||
| 任務 | 內容 | 風險 |
|
||||
|------|------|------|
|
||||
@@ -93,7 +93,7 @@
|
||||
| K0.4 | PodDisruptionBudget | 🟢 低 |
|
||||
| K0.5 | Startup Probe | 🟡 Pod 重啟 |
|
||||
|
||||
### Phase K-NET: 網路架構 (⏳ 待 K0 完成)
|
||||
### Phase K-NET: 網路架構 (✅ 已完成 2026-03-28)
|
||||
|
||||
| 任務 | 內容 | 風險 |
|
||||
|------|------|------|
|
||||
@@ -202,3 +202,4 @@ sudo systemctl disable keepalived
|
||||
| 版本 | 日期 | 執行者 | 變更內容 |
|
||||
|------|------|--------|----------|
|
||||
| 1.0 | 2026-03-28 | Claude Code | 初始建立,Phase K0 批准 |
|
||||
| 1.1 | 2026-03-28 | Claude Code | Phase K0 + K-NET 完成,首席架構師審查 46/50 |
|
||||
|
||||
@@ -44,8 +44,9 @@ metadata:
|
||||
app: awoooi-worker
|
||||
system: awoooi
|
||||
spec:
|
||||
# Worker 只有 1 個 replica,允許 1 個不可用
|
||||
# 這樣維護時可以完全停止
|
||||
# 設計選擇: replica=1 + maxUnavailable=1 允許維護時完全停止 Worker
|
||||
# Worker 處理 Redis Streams,短暫中斷可接受 (訊息會在 Stream 中等待)
|
||||
# 首席架構師審查 2026-03-28: 確認此為預期設計
|
||||
maxUnavailable: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
|
||||
Reference in New Issue
Block a user