- ADR-037 監控增強架構 - MONITORING_MASTER_PLAN 主計畫 - MASTER_EXECUTION_SCHEDULE 執行排程 - Phase D/E/Worker HPA Runbooks Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
841 lines
29 KiB
Markdown
841 lines
29 KiB
Markdown
# AWOOOI 架構風險全維度沙盤推演
|
||
# Architectural Risk Full-Spectrum War Game
|
||
|
||
> **文件類型**: 架構決策基礎(必讀,優先於所有 RunBook)
|
||
> **建立**: 2026-03-29 13:37 (台北)
|
||
> **定位**: 本文件是所有執行計畫的上位文件,任何實施步驟必須以此為準
|
||
|
||
---
|
||
|
||
## 第零章:代碼確認的真實現況
|
||
|
||
在產出任何計畫前,必須先誠實面對代碼層面的真實狀況:
|
||
|
||
| 風險項目 | 代碼位置 | 真實現況 |
|
||
|---------|---------|---------|
|
||
| Worker SIGTERM | `signal_worker.py:450-455` | ✅ 已實作 SIGTERM 攔截 |
|
||
| Worker stop() timeout | `signal_worker.py:147` | ❌ **只有 5 秒**,AI 任務最長 60 秒 |
|
||
| XCLAIM(孤兒任務回收)| `signal_worker.py` 全文 | ❌ **完全缺失**,PEL 孤兒無人處理 |
|
||
| StatefulSet 硬阻斷 | `auto_repair_service.py` 全文 | ❌ **完全缺失**,只有 severity 和 risk 檢查 |
|
||
| TIER_ACTIONS Tier 1 | `auto_repair_service.py:411` | ❌ `restart_pod` 在 Tier 1,DB/Redis 無例外 |
|
||
| OpenClaw Circuit Breaker | `sentry_webhook.py:289` | ❌ **只有 60s httpx timeout,無斷路保護** |
|
||
| ESLint i18n plugin | `.eslintrc.js:20-22` | ❌ **只有 TODO 注解,未安裝** |
|
||
| Redis AOF 確認 | — | ⚠️ 未確認,需立即核查 |
|
||
| Visual Regression baseline | — | ❌ 未建立,Mac 環境 ≠ CI 環境 |
|
||
|
||
---
|
||
|
||
## 第一章:已識別的 6 大致命衝突(首席架構師版)
|
||
|
||
首席架構師已準確識別的 6 個衝突,代碼確認加深嚴重度:
|
||
|
||
### 衝突一:CI/CD 監控鐵幕導致部署癱瘓
|
||
|
||
**嚴重度**: 🔴 P0 | **類型**: 流程衝突
|
||
|
||
**代碼確認**:`service-registry.yaml` 已有 60+ 服務,但 `validate_coverage.py` 尚未整合至 `cd.yaml`。
|
||
|
||
**正確方案(Soft Launch 三階段)**:
|
||
|
||
```yaml
|
||
# .github/workflows/cd.yaml
|
||
# 階段一(即時):Warn-Only
|
||
- name: Service Registry Coverage Check
|
||
run: |
|
||
python ops/scripts/validate_coverage.py --warn-only
|
||
# exit 0 即使有缺失,只發 Telegram 警告
|
||
continue-on-error: true # ← 關鍵:不阻擋部署
|
||
|
||
# 階段二(待 Registry 完整後):正式 Block
|
||
# 將 continue-on-error 改為 false
|
||
```
|
||
|
||
---
|
||
|
||
### 衝突二:Worker HPA + Redis PEL 孤兒任務
|
||
|
||
**嚴重度**: 🔴 P0 | **類型**: 邏輯衝突
|
||
|
||
**代碼確認**:`signal_worker.py` 完全沒有 XCLAIM 邏輯。`start()` 方法只有 `_ensure_consumer_group()`,沒有 pending 任務回收。
|
||
|
||
**最小可行修復(加入 `start()` 方法)**:
|
||
|
||
```python
|
||
# signal_worker.py
|
||
async def start(self) -> None:
|
||
if self._running:
|
||
return
|
||
|
||
await self._ensure_consumer_group()
|
||
|
||
# 🆕 關鍵:啟動時先接管已死亡 Worker 的孤兒任務
|
||
await self._claim_orphaned_tasks()
|
||
|
||
self._running = True
|
||
self._task = asyncio.create_task(self._consume_loop())
|
||
|
||
async def _claim_orphaned_tasks(self, idle_ms: int = 60000) -> int:
|
||
"""
|
||
XCLAIM 機制:接管超過 idle_ms 未 ACK 的 Pending 任務
|
||
|
||
場景:前一個 Worker Pod 在處理任務途中被 K8s 砍掉,
|
||
此任務卡在 PEL 中,新 Worker 啟動時必須接管。
|
||
|
||
idle_ms: 任務閒置超過此毫秒數才接管(預設 60 秒)
|
||
"""
|
||
redis_client = get_redis()
|
||
claimed_count = 0
|
||
|
||
try:
|
||
# 查詢 PEL 中所有 Pending 任務
|
||
pending = await redis_client.xpending_range(
|
||
STREAM_KEY, CONSUMER_GROUP,
|
||
min='-', max='+', count=100
|
||
)
|
||
|
||
for entry in pending:
|
||
# 只接管超過 idle_ms 未被處理的任務
|
||
if entry['time_since_delivered'] > idle_ms:
|
||
claimed = await redis_client.xclaim(
|
||
STREAM_KEY, CONSUMER_GROUP, CONSUMER_NAME,
|
||
min_idle_time=idle_ms,
|
||
message_ids=[entry['message_id']]
|
||
)
|
||
if claimed:
|
||
claimed_count += len(claimed)
|
||
logger.info(
|
||
"orphaned_task_claimed",
|
||
message_id=entry['message_id'],
|
||
original_consumer=entry['consumer'],
|
||
idle_ms=entry['time_since_delivered'],
|
||
)
|
||
|
||
except Exception as e:
|
||
# XCLAIM 失敗不應阻擋 Worker 啟動
|
||
logger.warning("xclaim_failed", error=str(e))
|
||
|
||
if claimed_count > 0:
|
||
logger.info("orphaned_tasks_recovered", count=claimed_count)
|
||
|
||
return claimed_count
|
||
```
|
||
|
||
**與 HPA 的部署順序**:XCLAIM 必須先合併到 main,才能部署 Worker HPA。
|
||
|
||
---
|
||
|
||
### 衝突三:告警風暴重疊(跨源 Incident 爆炸)
|
||
|
||
**嚴重度**: 🔴 P0 | **類型**: 資料流衝突
|
||
|
||
**代碼確認**:`incident_service.py` 中的聚合邏輯是基於 `fingerprint` 字串匹配,Sentry 和 Alertmanager 的 fingerprint 格式不同,無法跨源聚合。
|
||
|
||
**根本原因場景**:
|
||
```
|
||
PostgreSQL 掛掉 →
|
||
Alertmanager: 1 個 PostgreSQLDown 告警
|
||
Sentry: 200 個 ConnectionRefused 告警(所有 API 請求)
|
||
→ 201 個獨立 Incident
|
||
→ 201 次 OpenClaw 分析(Token 爆炸)
|
||
```
|
||
|
||
**全域災難冷卻期實作**:
|
||
|
||
```python
|
||
# apps/api/src/services/incident_service.py 新增
|
||
|
||
GLOBAL_INCIDENT_DEBOUNCE_TTL = 300 # 5 分鐘全域冷卻期
|
||
P0_INFRASTRUCTURE_SERVICES = {
|
||
"postgres", "postgresql", "redis", "k8s-api", "etcd"
|
||
}
|
||
|
||
async def _check_global_incident_storm(self, signal_data: dict) -> str | None:
|
||
"""
|
||
檢查是否有活躍的 P0 基礎設施災難
|
||
|
||
若有 → 返回主 Incident ID(關聯事件)
|
||
若無 → 返回 None(正常建立新 Incident)
|
||
"""
|
||
redis = get_redis()
|
||
storm_key = "global:incident_storm:active"
|
||
|
||
# 判斷是否是 P0 基礎設施告警(優先處理,不關聯)
|
||
alert_name = signal_data.get("alert_name", "")
|
||
service = signal_data.get("target", "")
|
||
|
||
is_infra_p0 = any(svc in service.lower() for svc in P0_INFRASTRUCTURE_SERVICES)
|
||
|
||
if is_infra_p0 and signal_data.get("severity") == "critical":
|
||
# 設定全域風暴旗幟
|
||
main_incident_id = f"storm-{uuid.uuid4().hex[:8]}"
|
||
await redis.setex(storm_key, GLOBAL_INCIDENT_DEBOUNCE_TTL, main_incident_id)
|
||
logger.warning("global_incident_storm_detected", main_id=main_incident_id)
|
||
return None # P0 本身正常建立 Incident
|
||
|
||
# 非 P0 告警:檢查是否在風暴期間
|
||
active_storm_id = await redis.get(storm_key)
|
||
if active_storm_id:
|
||
logger.info(
|
||
"alert_correlated_to_storm",
|
||
storm_id=active_storm_id,
|
||
alert=alert_name,
|
||
)
|
||
return active_storm_id.decode() # 關聯到主 Incident,不單獨分析
|
||
|
||
return None
|
||
```
|
||
|
||
---
|
||
|
||
### 衝突四:Auto-Repair 誤殺有狀態服務
|
||
|
||
**嚴重度**: 🔴 P0 | **類型**: 架構衝突
|
||
|
||
**代碼確認**:`auto_repair_service.py:411` 中 `TIER_ACTIONS[1]` 包含 `restart_pod` 和 `restart_container`。評估邏輯只檢查 `severity <= P2` 和 `RiskLevel <= MEDIUM`,**完全沒有服務類型白名單**。
|
||
|
||
**最小可行修復(加入服務黑名單)**:
|
||
|
||
```python
|
||
# auto_repair_service.py 新增常數與防護
|
||
|
||
# 🚨 不可自動重啟的服務(有狀態服務)
|
||
STATEFUL_SERVICE_BLACKLIST = frozenset({
|
||
"postgres", "postgresql", "awoooi-postgres",
|
||
"redis", "awoooi-redis", "redis-stack",
|
||
"clickhouse", "signoz-clickhouse",
|
||
"elasticsearch", "etcd",
|
||
"minio", "awoooi-minio",
|
||
})
|
||
|
||
async def evaluate_auto_repair(self, incident: Incident) -> AutoRepairDecision:
|
||
# ... 現有檢查 ...
|
||
|
||
# 🆕 新增:有狀態服務硬阻擋(必須在 Playbook 匹配之前)
|
||
affected_services = incident.affected_services or []
|
||
for service in affected_services:
|
||
if any(bl in service.lower() for bl in STATEFUL_SERVICE_BLACKLIST):
|
||
logger.warning(
|
||
"auto_repair_blocked_stateful_service",
|
||
incident_id=incident.incident_id,
|
||
service=service,
|
||
)
|
||
return AutoRepairDecision(
|
||
can_auto_repair=False,
|
||
reason=f"服務 {service} 為有狀態服務,禁止自動重啟,請統帥手動介入",
|
||
blocked_by="STATEFUL_SERVICE_GUARDRAIL",
|
||
)
|
||
|
||
# ... 後續現有邏輯 ...
|
||
```
|
||
|
||
---
|
||
|
||
### 衝突五:前端「合併地獄」(時序衝突)
|
||
|
||
**嚴重度**: 🟠 P1 | **類型**: 時序衝突
|
||
|
||
**正確鎖定主幹策略**:
|
||
|
||
```
|
||
git flow 前端主權計畫:
|
||
|
||
Week 1: Feature Freeze
|
||
main 分支鎖定(禁止非 i18n 相關 PR 合併前端代碼)
|
||
|
||
Week 1: i18n 清零 PR(唯一允許的前端 PR)
|
||
branch: fix/i18n-zero-violation
|
||
→ 一次性修復所有 40+ 違規
|
||
→ 同步安裝 eslint-plugin-i18next(先 warn 模式)
|
||
→ Merge to main
|
||
|
||
Week 1 完成後: Feature Unfreeze
|
||
→ 開始 Storybook PR
|
||
→ 開始 Omni-Terminal SSE Event Sourcing PR
|
||
→ ESLint 切換為 error 模式
|
||
```
|
||
|
||
---
|
||
|
||
### 衝突六:Playwright HTTPS 憑證與網路盲區
|
||
|
||
**嚴重度**: 🟠 P1 | **類型**: 基礎設施衝突
|
||
|
||
**代碼確認**:`playwright.config.ts` 需要確認當前設定。
|
||
|
||
```typescript
|
||
// playwright.config.ts 必要修改
|
||
export default defineConfig({
|
||
use: {
|
||
baseURL: process.env.BASE_URL || 'http://192.168.0.120:32335',
|
||
ignoreHTTPSErrors: true, // 🆕 自簽憑證必須忽略
|
||
viewport: { width: 1280, height: 720 },
|
||
deviceScaleFactor: 1, // 防止 Retina 差異
|
||
},
|
||
expect: {
|
||
toHaveScreenshot: {
|
||
threshold: 0.05,
|
||
maxDiffPixelRatio: 0.05,
|
||
},
|
||
},
|
||
});
|
||
```
|
||
|
||
---
|
||
|
||
## 第二章:被遺漏的 6 個更深層致命風險
|
||
|
||
首席架構師的 6 個衝突是正確的,但以下 6 個更深層的風險同樣在系統中存在:
|
||
|
||
### 深層風險 A:.188 節點是單點故障(SPOF)——系統大腦失憶
|
||
|
||
**位置**: 192.168.0.188(AI+Web 中心)
|
||
**影響**: .188 掛掉 = Ollama + OpenClaw + Redis + PostgreSQL + SigNoz **全部同時失效**
|
||
|
||
這是整個系統最致命的單點:
|
||
|
||
```
|
||
.188 掛掉時的連鎖崩潰:
|
||
Redis 失效 → Signal Worker 無法消費 → 告警全部積壓
|
||
PostgreSQL 失效 → K3s 控制面失去 Datastore → K3s 可能崩潰
|
||
OpenClaw 失效 → 所有 AI 分析停止 → Sentry/Alertmanager Webhook 排隊
|
||
SigNoz 失效 → 可觀測性盲區
|
||
↓
|
||
K3s 崩潰 → AWOOOI API/Web Pod 全滅
|
||
↓
|
||
沒有 AWOOOI → 無法收到告警 → 統帥無法操作 → 完全失聯
|
||
```
|
||
|
||
**緩解策略(非根治)**:
|
||
```yaml
|
||
# OpenClaw 和 Worker 必須有 .188 失效時的降級模式
|
||
# 最低標準:Telegram Bot 直接發送「.188 疑似失效」告警
|
||
# (繞過 AWOOOI API,直接 curl Telegram API)
|
||
|
||
# k8s/monitoring/alert-rules.yaml 新增
|
||
- alert: AIWebCenterDown
|
||
expr: probe_success{job="blackbox", target="http://192.168.0.188:8089/health"} == 0
|
||
for: 2m
|
||
annotations:
|
||
summary: ".188 AI 中心失聯,系統進入降級模式"
|
||
runbook: "docs/runbooks/RUNBOOK-188-FAILOVER.md"
|
||
```
|
||
|
||
---
|
||
|
||
### 深層風險 B:可觀測性循環依賴(觀測者的盲點)
|
||
|
||
**這是架構上最諷刺的問題**:
|
||
|
||
```
|
||
當 AWOOOI API 本身崩潰:
|
||
Alertmanager 想發 webhook 給 AWOOOI → AWOOOI 掛了,webhook 失敗
|
||
Sentry 想發 webhook 給 AWOOOI → 同上
|
||
Telegram 通知要透過 AWOOOI → 同上
|
||
|
||
告警鏈路的最後一哩(Telegram 通知)依賴於被監控對象本身!
|
||
```
|
||
|
||
**現有防護**(ADR-035 已有!):
|
||
```yaml
|
||
# Alertmanager 有直接 Telegram 通知(繞過 AWOOOI)
|
||
# 但需要確認:alertmanager.yml 是否有 backup receiver
|
||
receivers:
|
||
- name: 'openclaw-api' # 主路徑(透過 AWOOOI)
|
||
...
|
||
- name: 'direct-telegram' # 備援路徑(直接打 Telegram)
|
||
webhook_configs:
|
||
- url: 'https://api.telegram.org/bot{TOKEN}/sendMessage'
|
||
```
|
||
|
||
**需要驗證**:ADR-035 的三層防護機制是否真的覆蓋了「AWOOOI API 本身掛掉」的場景。
|
||
|
||
---
|
||
|
||
### 深層風險 C:OpenClaw 呼叫無 Circuit Breaker(後端 AI 癱瘓傳播)
|
||
|
||
**代碼確認**:`sentry_webhook.py:289` 的 `call_openclaw_analyzer()` 只有 `httpx.AsyncClient(timeout=60.0)`,**沒有 Circuit Breaker**。
|
||
|
||
**場景**:OpenClaw 高負載(GPU 過熱、記憶體壓力),每個 Sentry/Alertmanager 呼叫都等待 60 秒才 timeout。大量 FastAPI 背景任務積壓,最終導致 API Pod 記憶體耗盡 OOM Kill。
|
||
|
||
**最小可行修復**:
|
||
|
||
```python
|
||
# apps/api/src/core/circuit_breaker.py(新建)
|
||
import asyncio
|
||
from enum import Enum
|
||
from collections import deque
|
||
|
||
class CircuitState(Enum):
|
||
CLOSED = "closed" # 正常
|
||
OPEN = "open" # 斷路(直接失敗)
|
||
HALF_OPEN = "half_open"# 試探性恢復
|
||
|
||
class SimpleCircuitBreaker:
|
||
"""
|
||
簡單 Circuit Breaker(不依賴 NVIDIA 的實作)
|
||
|
||
狀態機:
|
||
CLOSED → OPEN(連續 5 次失敗)
|
||
OPEN → HALF_OPEN(冷卻 60 秒後)
|
||
HALF_OPEN → CLOSED(1 次成功)
|
||
HALF_OPEN → OPEN(1 次失敗)
|
||
"""
|
||
def __init__(self, failure_threshold=5, timeout_s=60):
|
||
self.state = CircuitState.CLOSED
|
||
self.failure_count = 0
|
||
self.threshold = failure_threshold
|
||
self.timeout_s = timeout_s
|
||
self._opened_at: float | None = None
|
||
|
||
def is_open(self) -> bool:
|
||
if self.state == CircuitState.OPEN:
|
||
import time
|
||
if time.time() - self._opened_at > self.timeout_s:
|
||
self.state = CircuitState.HALF_OPEN
|
||
return False
|
||
return True
|
||
return False
|
||
|
||
def record_success(self):
|
||
self.failure_count = 0
|
||
self.state = CircuitState.CLOSED
|
||
|
||
def record_failure(self):
|
||
self.failure_count += 1
|
||
if self.failure_count >= self.threshold:
|
||
import time
|
||
self.state = CircuitState.OPEN
|
||
self._opened_at = time.time()
|
||
|
||
# 全域 OpenClaw Circuit Breaker
|
||
_openclaw_cb = SimpleCircuitBreaker(failure_threshold=5, timeout_s=60)
|
||
|
||
def get_openclaw_circuit_breaker() -> SimpleCircuitBreaker:
|
||
return _openclaw_cb
|
||
```
|
||
|
||
```python
|
||
# sentry_webhook.py 修改 call_openclaw_analyzer()
|
||
async def call_openclaw_analyzer(error_context: dict) -> ErrorAnalysisResult | None:
|
||
cb = get_openclaw_circuit_breaker()
|
||
|
||
# 斷路保護:直接失敗,不等待
|
||
if cb.is_open():
|
||
logger.warning("openclaw_circuit_open_skip_analysis")
|
||
return None
|
||
|
||
try:
|
||
async with httpx.AsyncClient(timeout=60.0) as client:
|
||
response = await client.post(...)
|
||
if response.status_code == 200:
|
||
cb.record_success()
|
||
return ErrorAnalysisResult(**response.json())
|
||
else:
|
||
cb.record_failure()
|
||
return None
|
||
except Exception as e:
|
||
cb.record_failure()
|
||
logger.exception("openclaw_call_failed", error=str(e))
|
||
return None
|
||
```
|
||
|
||
---
|
||
|
||
### 深層風險 D:K8s Rolling Update 資料庫遷移衝突
|
||
|
||
**場景**:CD 執行 rolling update,舊版和新版 API Pod 同時存在(Kubernetes rolling strategy)。若新版本有 `ALTER TABLE` 遷移,舊版 Pod 會因為欄位不存在而報錯;若先跑遷移,新舊結構衝突。
|
||
|
||
**現況確認需要**:確認 Alembic 遷移策略。
|
||
|
||
**防護機制(必須檢查現有 CD)**:
|
||
|
||
```yaml
|
||
# .github/workflows/cd.yaml 確認是否有 migration 步驟
|
||
# 若無,必須加入:
|
||
- name: Run DB Migration
|
||
run: |
|
||
kubectl exec -n awoooi-prod \
|
||
$(kubectl get pod -n awoooi-prod -l app=awoooi-api -o name | head -1) \
|
||
-- python -m alembic upgrade head
|
||
# 遷移必須是向後兼容的(不能刪除欄位,只能新增)
|
||
```
|
||
|
||
---
|
||
|
||
### 深層風險 E:Redis 記憶體壓力下的靜默資料丟失
|
||
|
||
AnomalyCounter 的滑動窗口使用 Redis Sorted Set / Counter,如果 Redis 記憶體緊張觸發 `maxmemory-policy`,這些計數器可能被靜默淘汰。
|
||
|
||
**需要立即確認**:
|
||
```bash
|
||
ssh root@192.168.0.188 'docker exec awoooi-redis redis-cli CONFIG GET maxmemory-policy'
|
||
# 如果是 allkeys-lru 或 allkeys-lfu,AnomalyCounter 的計數器會被淘汰!
|
||
# 正確設定應為:volatile-ttl 或 noeviction(搭配記憶體告警)
|
||
```
|
||
|
||
**修復**:AnomalyCounter 的 Redis key 必須用帶 TTL 的 key(已有),確保 eviction policy 不會誤殺有 TTL 的 key:
|
||
```bash
|
||
# 設定為 volatile-ttl(只淘汰有 TTL 的 key,從 TTL 最短的開始)
|
||
docker exec awoooi-redis redis-cli CONFIG SET maxmemory-policy volatile-ttl
|
||
# AnomalyCounter 計數器有 TTL → 可能被淘汰
|
||
# 解法:增加 Redis maxmemory 設定或改用 noeviction + 主動監控
|
||
```
|
||
|
||
---
|
||
|
||
### 深層風險 F:GitHub Actions Runner 安全隔離問題
|
||
|
||
`.110` 的 self-hosted runner 在每次 CD 時執行 `kubectl patch secret`,這代表:
|
||
- Runner 必須有 K8s 集群的 admin 權限
|
||
- 任何能合 PR 進 main 的人,都能觸發有 K8s admin 權限的 Job
|
||
|
||
**最小防護**:
|
||
|
||
```yaml
|
||
# .github/workflows/cd.yaml
|
||
# 所有 kubectl 操作的 Job 必須加上環境保護
|
||
jobs:
|
||
deploy:
|
||
environment: production # ← 必須設定 GitHub Environment 審核
|
||
```
|
||
|
||
---
|
||
|
||
## 第三章:終極安全執行定序(12 波)
|
||
|
||
整合所有 6+6 衝突分析後,以下是**唯一正確的執行順序**:
|
||
|
||
```
|
||
🛡️ Wave 0: 即時止血(當天,不需部署,純配置)
|
||
0.1 確認 Redis maxmemory-policy(5min)
|
||
0.2 確認 Redis appendonly(5min)
|
||
0.3 確認 Alertmanager 備援 Telegram 路徑(10min)
|
||
|
||
🔴 Wave 1: 底層安全網(Week 1,必須串行執行)
|
||
依序:
|
||
1.1 開發 XCLAIM 機制(2h)
|
||
1.2 開發 StatefulSet Guardrail(1h)
|
||
1.3 開發 OpenClaw Circuit Breaker(2h)
|
||
1.4 開發 Global Incident Debounce(2h)
|
||
1.5 以上四項合為單一 PR,測試後 Merge(1h)
|
||
|
||
→ 此 PR 絕對不能拆分!四個修復互相依賴。
|
||
|
||
🔴 Wave 2: Worker 升級(Wave 1 完成後)
|
||
2.1 Worker terminationGracePeriodSeconds 90s(30min)
|
||
2.2 Worker stop() timeout 75s(30min)
|
||
2.3 部署 Worker HPA(30min)
|
||
|
||
🟠 Wave 3: 前端主幹鎖定(與 Wave 1 同時啟動,但獨立分支)
|
||
3.1 宣佈 Frontend Feature Freeze
|
||
3.2 i18n 閃電清零(4h)
|
||
3.3 安裝 eslint-plugin-i18next(Warn 模式)(1h)
|
||
3.4 Merge i18n PR → 解除 Frontend Freeze
|
||
3.5 ESLint 切換 Error 模式
|
||
|
||
🟠 Wave 4: CI 基礎設施(Wave 3 完成後)
|
||
4.1 playwright.config.ts(ignoreHTTPSErrors + threshold)
|
||
4.2 Docker Visual Baseline 初始建立
|
||
4.3 E2E Weekly Schedule YAML(Warn-Only)
|
||
4.4 CD validate_coverage.py(Warn-Only)
|
||
|
||
🟠 Wave 5: 告警後端完整(Wave 1 完成後)
|
||
5.1 Sentry SENTRY_AUTH_TOKEN 配置(Phase D)
|
||
5.2 SignOz 告警規則部署到 .188(Phase E)
|
||
|
||
🟡 Wave 6: 可觀測性統合(Wave 5 完成後)
|
||
6.1 Prometheus Federation(.110 → .188)
|
||
6.2 AI Autonomy Index Metrics 建立
|
||
6.3 Redis AOF + Sentinel 評估與啟用
|
||
|
||
🟡 Wave 7: 前端能力擴充(Wave 4 完成後)
|
||
7.1 Storybook 10 核心組件
|
||
7.2 Omni-Terminal SSE Event Sourcing
|
||
7.3 監控 GenUI 卡片(7 張)
|
||
7.4 Nexus AI 自治率 UI
|
||
|
||
⚪ Wave 8: DB HA 根本解決
|
||
8.1 CloudNativePG 評估報告
|
||
8.2 決策後執行(Patroni / CloudNativePG / 維持現狀+備份)
|
||
|
||
⚪ Wave 9: 業務指標層
|
||
9.1 FinOps Dashboard API + UI
|
||
9.2 SLO / MTTR API 端點
|
||
|
||
⚪ Wave 10: 安全主權
|
||
10.1 Kali → MCP Tool → SecurityAgent
|
||
10.2 SBOM 生成整合
|
||
|
||
⚪ Wave 11: CI 硬阻擋切換
|
||
11.1 Visual Regression CI: 從 warn → block
|
||
11.2 Coverage validation: 從 warn → block
|
||
11.3 ESLint: 確認已為 error 模式
|
||
|
||
⚪ Wave 12: Phase 4 視覺靈魂注入
|
||
12.1 品牌 3D 資產 + Q 版 OpenClaw
|
||
12.2 全站微動畫升級
|
||
```
|
||
|
||
---
|
||
|
||
## 第四章:執行前的強制確認清單
|
||
|
||
在開始任何 Wave 1 工作之前,必須先完成以下確認:
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# ops/scripts/pre-execution-checklist.sh
|
||
|
||
echo "=== AWOOOI 執行前強制確認清單 ==="
|
||
|
||
# 1. Redis AOF 確認
|
||
APPENDONLY=$(docker exec awoooi-redis redis-cli CONFIG GET appendonly | tail -1)
|
||
echo "Redis AOF: $APPENDONLY" # 必須是 yes
|
||
|
||
# 2. Redis maxmemory-policy 確認
|
||
POLICY=$(docker exec awoooi-redis redis-cli CONFIG GET maxmemory-policy | tail -1)
|
||
echo "Redis eviction policy: $POLICY" # 不能是 allkeys-lru
|
||
|
||
# 3. K3s 叢集狀態確認
|
||
kubectl get nodes -n awoooi-prod
|
||
kubectl get pod -n awoooi-prod
|
||
|
||
# 4. Alertmanager 備援 Telegram 路徑確認
|
||
curl -s http://192.168.0.120:30093/api/v1/receivers | python3 -m json.tool | grep name
|
||
|
||
# 5. 確認 .110 → .120 網路路由(Playwright E2E 需要)
|
||
ping -c 3 192.168.0.120
|
||
|
||
echo "=== 確認完成,可以開始執行 ==="
|
||
```
|
||
|
||
---
|
||
|
||
## 附錄:尚未解決的開放問題(需要統帥決策)
|
||
|
||
| 問題 | 選項 A | 選項 B | 影響 |
|
||
|------|--------|--------|------|
|
||
| PostgreSQL HA | CloudNativePG(K8s 原生)| Patroni+keepalived(VM 層)| Q2 重大決策 |
|
||
| Redis HA 層級 | Sentinel(主動故障轉移)| AOF+手動恢復(保守)| 月度決策 |
|
||
| .188 備援節點 | 購置第二台 AI 主機 | Cloud GPU 熱備 | 季度預算 |
|
||
| GitHub Runner 安全隔離 | GitHub Environments 審核 | 拆分 CI(唯讀)和 CD(需要 K8s admin)| 安全策略 |
|
||
|
||
---
|
||
|
||
---
|
||
|
||
## 第五章:四個最終深水區(代碼確認級別)
|
||
|
||
### 5.1 Redis 崩潰 → AnomalyCounter 連鎖炸毀 → 告警永久丟失
|
||
|
||
**代碼確認**:`anomaly_counter.py:147`
|
||
|
||
```python
|
||
# 現況(無任何 try/except):
|
||
await self.redis.zadd(timeline_key, {str(timestamp): timestamp}) # ← Redis 掛了 = 直接 throw
|
||
await self.redis.zremrangebyscore(...)
|
||
await self.redis.zcount(...) # ← 全部爆炸
|
||
# → 呼叫端 sentry_webhook.py → 整個 background task 失敗 → 告警丟失!
|
||
```
|
||
|
||
**修復:Graceful Degradation 防禦性包裝**
|
||
|
||
```python
|
||
# anomaly_counter.py 修改 record_anomaly()
|
||
|
||
async def record_anomaly(self, anomaly_signature: dict) -> AnomalyFrequency:
|
||
"""記錄異常,Redis 失敗時優雅降級(不拋例外)"""
|
||
try:
|
||
return await self._record_anomaly_impl(anomaly_signature)
|
||
except Exception as e:
|
||
# Redis 連線失敗 → 降級:返回最小化頻率物件,讓主流程繼續執行
|
||
logger.warning(
|
||
"anomaly_counter_redis_degraded",
|
||
error=str(e),
|
||
reason="Returning default frequency to allow alert chain to continue"
|
||
)
|
||
# 不拋例外!告警鏈路必須繼續!
|
||
return AnomalyFrequency(
|
||
anomaly_key=self.hash_signature(anomaly_signature),
|
||
count_1h=1, count_24h=1, count_7d=1, count_30d=1,
|
||
first_seen=datetime.now(), last_seen=datetime.now(),
|
||
auto_repair_count=0, permanent_fix_applied=False,
|
||
escalation_level=None, # 無法升級判斷,保守處理
|
||
)
|
||
|
||
async def _record_anomaly_impl(self, anomaly_signature: dict) -> AnomalyFrequency:
|
||
"""原始實作邏輯(從 record_anomaly 提取)"""
|
||
# ... 原有的所有 Redis 操作 ...
|
||
```
|
||
|
||
**原則**:「記不住」不能導致「發不出」。Redis 是輔助系統,不是核心路徑。
|
||
|
||
---
|
||
|
||
### 5.2 Worker 非優雅崩潰 → PEL 孤兒任務永久卡死
|
||
|
||
**代碼確認**:`signal_worker.py` 全文無 `reclaim_loop` 或定期 XPENDING 掃描。
|
||
|
||
現有 `_claim_orphaned_tasks()` 只在 `start()` 時執行一次,解決不了**運行中 Pod 崩潰**的場景:
|
||
|
||
```
|
||
場景:2 個 Worker 穩定運行中
|
||
Worker A 處理任務途中 → Segfault / OOM Kill(非優雅關機)
|
||
Worker B 正在運行 → start() 不再觸發 → XCLAIM 永遠不執行
|
||
孤兒任務卡在 PEL → 直到下次 HPA 觸發新 Pod 才救回
|
||
可能等待 600+ 秒(HPA stabilizationWindowSeconds)
|
||
```
|
||
|
||
**修復:Active Sweeper Loop(與心跳循環並行)**
|
||
|
||
```python
|
||
# signal_worker.py 新增 _reclaim_loop()
|
||
|
||
async def start(self) -> None:
|
||
await self._ensure_consumer_group()
|
||
await self._claim_orphaned_tasks() # 啟動時一次
|
||
self._running = True
|
||
self._task = asyncio.create_task(self._consume_loop())
|
||
self._reclaim_task = asyncio.create_task(self._reclaim_loop()) # 🆕 持續掃描
|
||
|
||
async def _reclaim_loop(self, interval_s: int = 300) -> None:
|
||
"""
|
||
Active Sweeper:每 5 分鐘主動掃描 PEL,接管閒置超過 5 分鐘的孤兒任務
|
||
與 _consume_loop 並行執行,不阻擋正常消費
|
||
"""
|
||
while self._running:
|
||
try:
|
||
await asyncio.sleep(interval_s)
|
||
if not self._running:
|
||
break
|
||
claimed = await self._claim_orphaned_tasks(idle_ms=300_000) # 5 分鐘
|
||
if claimed > 0:
|
||
logger.info("active_sweeper_claimed", count=claimed)
|
||
except asyncio.CancelledError:
|
||
break
|
||
except Exception as e:
|
||
logger.warning("active_sweeper_error", error=str(e))
|
||
|
||
async def stop(self) -> None:
|
||
self._running = False
|
||
# 同時取消 reclaim_loop
|
||
if hasattr(self, '_reclaim_task') and self._reclaim_task:
|
||
self._reclaim_task.cancel()
|
||
if self._task:
|
||
try:
|
||
await asyncio.wait_for(self._task, timeout=75.0) # 已校正
|
||
except (TimeoutError, asyncio.CancelledError):
|
||
pass
|
||
```
|
||
|
||
---
|
||
|
||
### 5.3 SSE Event Store Redis 記憶體炸彈
|
||
|
||
**代碼確認**:`terminal.py:114` 使用 SSE Publisher/Subscribe 模式(`publisher.subscribe(topics)`),**不是 Redis List 模式**。
|
||
|
||
這是一個**重要的架構現況修正**:
|
||
|
||
- 現有 terminal.py 使用 `src.core.sse.SSEPublisher` 作為事件分發機制
|
||
- **Event Sourcing(Redis RPUSH)尚未實作**,這是未來要加的功能
|
||
- 因此 Redis 記憶體炸彈風險存在於**未來實作時**,需要在設計階段就預防
|
||
|
||
**未來實作 Event Sourcing 時的強制規格**:
|
||
|
||
```python
|
||
# terminal.py 未來的 stream_with_persistence()
|
||
|
||
MAX_PAYLOAD_BYTES = 50 * 1024 # 50KB 上限(tool_result 超出截斷)
|
||
MAX_EVENTS_PER_SESSION = 50 # 每個 session 最多 50 個事件(LTRIM)
|
||
SESSION_TTL_SECONDS = 3600 # 1 小時 TTL
|
||
|
||
async def stream_with_persistence(command_id: str, event_type: str, data: dict):
|
||
redis = get_redis()
|
||
key = f"terminal:events:{command_id}"
|
||
|
||
# 🚨 必要的 Payload 保護
|
||
payload_json = json.dumps(data)
|
||
if len(payload_json) > MAX_PAYLOAD_BYTES:
|
||
payload_json = json.dumps({
|
||
"truncated": True,
|
||
"original_size": len(payload_json),
|
||
"preview": payload_json[:1024],
|
||
"message": f"Payload {len(payload_json)//1024}KB 過大,已截斷"
|
||
})
|
||
|
||
event = {"type": event_type, "data": json.loads(payload_json)}
|
||
await redis.rpush(key, json.dumps(event))
|
||
await redis.ltrim(key, -MAX_EVENTS_PER_SESSION, -1) # 只保留最後 50 個
|
||
await redis.expire(key, SESSION_TTL_SECONDS)
|
||
```
|
||
|
||
---
|
||
|
||
### 5.4 Frontend Feature Freeze → Hotfix 死鎖
|
||
|
||
**代碼確認**:無 `release/` 分支策略,main 是唯一的長期分支。
|
||
|
||
**修復:Git Flow 三分支策略(啟動 Freeze 前必須建立)**
|
||
|
||
```bash
|
||
# Week 1 開始 i18n 清零前,執行:
|
||
|
||
# Step 1: 從當前 main 建立穩定的 release 基準
|
||
git checkout main
|
||
git pull origin main
|
||
git checkout -b release/v1.x
|
||
git push origin release/v1.x
|
||
|
||
# Step 2: 在 GitHub 設定 release/v1.x 為 Protected Branch
|
||
# → 只有 Hotfix PR 可以合併到此分支
|
||
|
||
# Step 3: 開始 i18n 清零(在 main/develop 進行)
|
||
git checkout main
|
||
git checkout -b fix/i18n-zero-violation
|
||
# ... 執行 i18n 清零 ...
|
||
git push origin fix/i18n-zero-violation
|
||
# → PR 合併到 main
|
||
```
|
||
|
||
**緊急 Hotfix 流程**(Freeze 期間生產爆炸時):
|
||
|
||
```bash
|
||
# 從 release 分支切 hotfix
|
||
git checkout release/v1.x
|
||
git checkout -b hotfix/critical-approval-button-fix
|
||
# ... 最小化修復 ...
|
||
git push origin hotfix/critical-approval-button-fix
|
||
|
||
# PR 合併到 release/v1.x → 立即部署
|
||
# 然後 cherry-pick 到 main(i18n 重構進行中的分支)
|
||
git checkout main
|
||
git cherry-pick <hotfix-commit-hash>
|
||
```
|
||
|
||
**Hotfix 觸發條件定義**(須寫入 HARD_RULES.md):
|
||
- 統帥無法使用核心功能(簽核按鈕失效、登入無法使用)
|
||
- P0 級 Sentry Error 每分鐘 > 10 次
|
||
- 服務 availability < 99%
|
||
|
||
---
|
||
|
||
## 第六章:Wave 1 最終實作清單(可立即授權執行)
|
||
|
||
經過全面代碼確認,Wave 1 的四個修復需要修改以下精確位置:
|
||
|
||
| 修復項目 | 修改檔案 | 位置 | 預估工時 |
|
||
|---------|---------|------|---------|
|
||
| XCLAIM + Active Sweeper | `signal_worker.py` | `start()`, `stop()`, 新增 `_reclaim_loop()`, `_claim_orphaned_tasks()` | 2h |
|
||
| StatefulSet Guardrail | `auto_repair_service.py` | `evaluate_auto_repair()` 開頭新增服務黑名單 | 1h |
|
||
| AnomalyCounter Redis 降級 | `anomaly_counter.py` | `record_anomaly()` 包裝 try/except + 降級回傳 | 1h |
|
||
| OpenClaw Circuit Breaker | `core/circuit_breaker.py`(新建)→ `sentry_webhook.py`, `signoz_webhook.py` | `call_openclaw_analyzer()` 包裝斷路保護 | 2h |
|
||
| Global Incident Debounce | `services/incident_service.py` | `process_signal()` 前加全域冷卻檢查 | 1.5h |
|
||
|
||
**Wave 1 執行條件**:
|
||
1. Wave 0.1-0.3 手動確認完成(Redis AOF/eviction、Alertmanager 備援)
|
||
2. Git Flow:建立 `release/v1.x` 穩定分支(防止 Freeze 期間 Hotfix 死鎖)
|
||
3. 所有修改捆綁為一個 PR(原子性部署,不可拆分)
|
||
|
||
*「真正的架構師不是設計完美的系統,而是設計在任何極端狀況下都能優雅降級的系統。」* 🦞
|
||
|