docs(skills): add 2026-03-23 production incident learnings
New sections added: 01-frontend-aesthetics: - Polling + Operation Race Condition pattern 02-lewooogo-backend-core: - Worker Redis dedicated connection pool (socket_timeout=None) - SQLite prohibition decree - Function rename global search requirement 04-awoooi-devops-commander: - NetworkPolicy Pod Selector (system label) - Zombie consumer group cleanup - PostgreSQL initialization checklist 05-awoooi-sre-qa (updated earlier): - CrashLoopBackOff diagnosis - Telegram health check - Frontend race condition diagnosis Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -82,6 +82,60 @@ pnpm run dev & sleep 5 && curl -s http://localhost:3000 | grep -i "hydration"
|
||||
|
||||
---
|
||||
|
||||
## 🚨 Polling + 操作 Race Condition (2026-03-23 教訓)
|
||||
|
||||
> **事故**: 簽核卡片按下後消失,所有卡片消失,又全部出現。原因是 Zustand Polling 與 signApproval API 競爭
|
||||
|
||||
### 症狀
|
||||
|
||||
- UI 元素閃爍、消失又出現
|
||||
- 操作後狀態被覆蓋
|
||||
- Console 無錯誤但行為異常
|
||||
|
||||
### 根因
|
||||
|
||||
```typescript
|
||||
// ❌ 問題模式: Polling 與操作同時進行
|
||||
const signApproval = async () => {
|
||||
await fetch('/api/sign') // 操作 API
|
||||
// ⚠️ 此時 Polling (5秒一次) 可能已用舊數據覆蓋 state
|
||||
}
|
||||
```
|
||||
|
||||
### ✅ 正確模式: 操作期間暫停 Polling
|
||||
|
||||
```typescript
|
||||
// apps/web/src/stores/approval.store.ts
|
||||
signApproval: async (id, signerId, signerName, comment) => {
|
||||
// 🔧 暫停 Polling
|
||||
const wasPolling = pollingTimer !== null
|
||||
if (wasPolling) {
|
||||
clearInterval(pollingTimer!)
|
||||
pollingTimer = null
|
||||
}
|
||||
|
||||
try {
|
||||
const response = await fetch(`/api/v1/approvals/${id}/sign`, {...})
|
||||
// ... 處理回應 ...
|
||||
} finally {
|
||||
// 🔧 延遲 1 秒後恢復 Polling (讓後端有時間更新)
|
||||
if (wasPolling) {
|
||||
setTimeout(() => {
|
||||
get().startPolling(get().pollingInterval || 5000)
|
||||
}, 1000)
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 適用場景
|
||||
|
||||
- 簽核/拒絕 (Approval)
|
||||
- 任何會修改後端狀態的操作
|
||||
- 樂觀更新 (Optimistic Update) 模式
|
||||
|
||||
---
|
||||
|
||||
## 參考文檔
|
||||
|
||||
- ADR-002: Nothing.tech 設計系統
|
||||
|
||||
@@ -127,6 +127,90 @@ curl -sf http://localhost:8888/api/v1/health
|
||||
|
||||
---
|
||||
|
||||
## 🚨 Redis Worker 專屬連線 (2026-03-23 教訓)
|
||||
|
||||
> **事故**: Worker 使用 API 共用 Redis 連線池 (socket_timeout=5s),與 XREADGROUP BLOCK 5000ms 衝突,導致每 5 秒 TimeoutError
|
||||
|
||||
### 鐵律: Worker 必須獨立連線池
|
||||
|
||||
```python
|
||||
# ✅ 正確: Worker 專屬長連線
|
||||
async def init_worker_redis_pool() -> redis.Redis:
|
||||
return redis.from_url(
|
||||
settings.REDIS_URL,
|
||||
socket_timeout=None, # 無限等待 (XREADGROUP 阻塞)
|
||||
socket_connect_timeout=10.0,
|
||||
)
|
||||
|
||||
# ❌ 禁止: Worker 共用 API 短超時連線
|
||||
redis_client = get_redis() # socket_timeout=5s
|
||||
await redis_client.xreadgroup(block=5000) # 必定超時
|
||||
```
|
||||
|
||||
### 檢查清單
|
||||
|
||||
| 項目 | 正確值 |
|
||||
|------|--------|
|
||||
| API Redis socket_timeout | 5.0 秒 |
|
||||
| Worker Redis socket_timeout | None (無限) |
|
||||
| XREADGROUP block | 5000ms |
|
||||
|
||||
---
|
||||
|
||||
## 🚨 SQLite 禁止令 (2026-03-23 教訓)
|
||||
|
||||
> **事故**: Worker 與 API Pod 各自使用本地 SQLite,導致「腦分裂」與 `no such table: incidents` 錯誤
|
||||
|
||||
### 鐵律: 絕對禁止 SQLite,強制 PostgreSQL
|
||||
|
||||
```python
|
||||
# ✅ 正確: 使用 DATABASE_URL (PostgreSQL)
|
||||
def get_engine() -> AsyncEngine:
|
||||
database_url = settings.DATABASE_URL # postgresql+asyncpg://...
|
||||
|
||||
# 禁止 SQLite 守衛
|
||||
if "sqlite" in database_url.lower():
|
||||
logger.error("sqlite_forbidden")
|
||||
database_url = "postgresql+asyncpg://awoooi:xxx@192.168.0.188:5432/awoooi_prod"
|
||||
|
||||
return create_async_engine(database_url, ...)
|
||||
|
||||
# ❌ 禁止: 使用 SQLITE_DATABASE_URL
|
||||
settings.SQLITE_DATABASE_URL # 已廢棄,禁止使用
|
||||
```
|
||||
|
||||
### 生產環境連線
|
||||
|
||||
| 項目 | 值 |
|
||||
|------|-----|
|
||||
| 主機 | 192.168.0.188 |
|
||||
| Port | 5432 |
|
||||
| Database | awoooi_prod |
|
||||
| 連線字串前綴 | `postgresql+asyncpg://` |
|
||||
|
||||
---
|
||||
|
||||
## 🚨 函數重命名全域搜尋 (2026-03-23 教訓)
|
||||
|
||||
> **事故**: `init_redis()` 重命名為 `init_redis_pool()`,但 Worker standalone 模式遺漏更新,導致 ImportError
|
||||
|
||||
### 鐵律: 重命名前必須全域搜尋
|
||||
|
||||
```bash
|
||||
# 修改函數名稱前,必須執行:
|
||||
grep -rn "old_function_name" apps/api/src/
|
||||
|
||||
# 確認所有呼叫點都已更新後,才能提交
|
||||
```
|
||||
|
||||
### 高風險區域
|
||||
|
||||
- `workers/signal_worker.py` standalone `_main()` 函數
|
||||
- `main.py` lifespan 初始化
|
||||
- 測試檔案 `tests/`
|
||||
|
||||
---
|
||||
|
||||
## 參考文檔
|
||||
|
||||
- `apps/api/src/core/config.py`: 設定中心
|
||||
|
||||
@@ -135,6 +135,102 @@ kubectl rollout status deployment/awoooi-api -n awoooi-prod
|
||||
|
||||
---
|
||||
|
||||
## 🚨 NetworkPolicy Pod Selector (2026-03-23 教訓)
|
||||
|
||||
> **事故**: `allow-required-egress` 只匹配 `app=awoooi-api`,Worker 使用 `app=awoooi-worker` 被封鎖,無法連 Redis
|
||||
|
||||
### 鐵律: 使用共用 Label 而非個別 app
|
||||
|
||||
```yaml
|
||||
# ❌ 錯誤: 只匹配一種 Pod
|
||||
podSelector:
|
||||
matchLabels:
|
||||
app: awoooi-api # Worker 被排除!
|
||||
|
||||
# ✅ 正確: 匹配所有 AWOOOI Pods
|
||||
podSelector:
|
||||
matchLabels:
|
||||
system: awoooi # API + Worker + Web 都匹配
|
||||
```
|
||||
|
||||
### 驗證指令
|
||||
|
||||
```bash
|
||||
# 檢查 Pod labels
|
||||
kubectl get pods -n awoooi-prod --show-labels
|
||||
|
||||
# 檢查 NetworkPolicy selector
|
||||
kubectl get networkpolicy allow-required-egress -n awoooi-prod \
|
||||
-o jsonpath='{.spec.podSelector}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🚨 殭屍消費者群組清理 (2026-03-23 教訓)
|
||||
|
||||
> **事故**: Worker 92 次 CrashLoopBackOff 產生 50 個殭屍消費者
|
||||
|
||||
### 清理指令
|
||||
|
||||
```bash
|
||||
# 查看消費者群組狀態
|
||||
ssh ollama@192.168.0.188 \
|
||||
"docker exec clawbot-redis redis-cli -n 10 XINFO GROUPS stream:awoooi_signals"
|
||||
|
||||
# 警告訊號: consumers > 5 且有大量死掉的 Pod
|
||||
|
||||
# 銷毀並重建 (Tier 3 需統帥授權)
|
||||
ssh ollama@192.168.0.188 \
|
||||
"docker exec clawbot-redis redis-cli -n 10 XGROUP DESTROY stream:awoooi_signals awoooi_workers"
|
||||
```
|
||||
|
||||
### 重建後驗證
|
||||
|
||||
```bash
|
||||
# 消費者數量應該 = 實際 Worker Pod 數量
|
||||
kubectl get pods -n awoooi-prod | grep worker
|
||||
# consumers 應 = 上述數量
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🚨 PostgreSQL 初始化清單 (2026-03-23 教訓)
|
||||
|
||||
> **事故**: K8s Secret 有 `DATABASE_URL` 但 PostgreSQL 沒有對應用戶,導致 `InvalidPasswordError`
|
||||
|
||||
### 部署新環境前必須確認
|
||||
|
||||
| 項目 | 檢查指令 |
|
||||
|------|---------|
|
||||
| PostgreSQL 運行中 | `pg_isready -h 192.168.0.188 -p 5432` |
|
||||
| 用戶存在 | `\du awoooi` |
|
||||
| 資料庫存在 | `\l awoooi_prod` |
|
||||
| 網路連通 | 從 K3s Pod 測試連線 |
|
||||
|
||||
### 建立用戶/資料庫 (Tier 3)
|
||||
|
||||
```bash
|
||||
# 在 .188 以 postgres 身份執行
|
||||
sudo -u postgres psql -c "
|
||||
CREATE USER awoooi WITH PASSWORD 'xxx';
|
||||
CREATE DATABASE awoooi_prod OWNER awoooi;
|
||||
GRANT ALL PRIVILEGES ON DATABASE awoooi_prod TO awoooi;
|
||||
"
|
||||
```
|
||||
|
||||
### K8s Secret 格式
|
||||
|
||||
```bash
|
||||
# 編碼連線字串
|
||||
echo -n "postgresql+asyncpg://awoooi:密碼@192.168.0.188:5432/awoooi_prod" | base64 -w0
|
||||
|
||||
# 更新 Secret
|
||||
kubectl patch secret awoooi-secrets -n awoooi-prod --type='json' \
|
||||
-p='[{"op":"replace","path":"/data/DATABASE_URL","value":"<base64編碼>"}]'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 參考文檔
|
||||
|
||||
- `k8s/awoooi-prod/`: K8s Manifests
|
||||
|
||||
Reference in New Issue
Block a user