docs: ADR-037 + 監控架構提案 + Runbooks
- ADR-037 監控增強架構 - MONITORING_MASTER_PLAN 主計畫 - MASTER_EXECUTION_SCHEDULE 執行排程 - Phase D/E/Worker HPA Runbooks Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
199
docs/adr/ADR-037-monitoring-enhancement-architecture.md
Normal file
199
docs/adr/ADR-037-monitoring-enhancement-architecture.md
Normal file
@@ -0,0 +1,199 @@
|
||||
# ADR-037: 監控增強架構 - 異常頻率統計與根本修復
|
||||
|
||||
> **狀態**: ✅ Wave A+B 完成
|
||||
> **日期**: 2026-03-29
|
||||
> **決策者**: 統帥 (已批准)
|
||||
> **審查者**: 首席架構師 (Wave A: 91/100, Wave B: 92/100 OUTSTANDING)
|
||||
> **預估工時**: 17h (7 Phase)
|
||||
> **整合計畫**: [MONITORING_MASTER_PLAN.md](../proposals/MONITORING_MASTER_PLAN.md) (Wave A-D / 10.75h)
|
||||
|
||||
---
|
||||
|
||||
## 背景
|
||||
|
||||
2026-03-29 統帥明確指示:
|
||||
> "重啟只是治標,不是治本!太常發生的異常必須徹底解決"
|
||||
> "需要統計、計數!必須要讓使用者知道!!"
|
||||
> "太常發生的異常,必須要有徹底解決的自動修復機制!!!"
|
||||
|
||||
### 現狀問題
|
||||
|
||||
| 問題 | 影響 | 根因 |
|
||||
|------|------|------|
|
||||
| 重複異常無統計 | 同樣問題反覆告警 | 無頻率追蹤機制 |
|
||||
| 修復只靠重啟 | 治標不治本 | 無 Tier 分級修復策略 |
|
||||
| 資料庫無監控 | PostgreSQL/Redis 盲區 | 缺少 Exporter |
|
||||
| Sentry 無回寫 | 修復結果斷鏈 | 缺少 Comment API 整合 |
|
||||
| 無學習機制 | 無法自動改善 | 缺少成功率統計 |
|
||||
|
||||
---
|
||||
|
||||
## 決策
|
||||
|
||||
### 1. AnomalyCounter 服務 (Phase A)
|
||||
|
||||
**技術選型**: Redis Sorted Set + 滑動窗口
|
||||
|
||||
```python
|
||||
# 異常簽名 Hash 生成
|
||||
def generate_anomaly_key(signature: dict) -> str:
|
||||
"""生成異常簽名的唯一 key"""
|
||||
key_parts = [
|
||||
signature.get("alert_name", ""),
|
||||
signature.get("service", ""),
|
||||
signature.get("namespace", ""),
|
||||
signature.get("error_type", ""),
|
||||
]
|
||||
return hashlib.sha256("|".join(key_parts).encode()).hexdigest()[:16]
|
||||
```
|
||||
|
||||
**統計維度**:
|
||||
- `count_1h`: 1 小時內發生次數
|
||||
- `count_24h`: 24 小時內發生次數
|
||||
- `count_7d`: 7 天內發生次數
|
||||
- `count_30d`: 30 天內發生次數
|
||||
|
||||
**升級閾值**:
|
||||
| 等級 | 閾值 (24h) | 行動 |
|
||||
|------|-----------|------|
|
||||
| NORMAL | < 3 | 正常處理 |
|
||||
| REPEAT | ≥ 3 | 標記重複,通知用戶 |
|
||||
| ESCALATE | ≥ 5 | 升級 Tier,通知 Owner |
|
||||
| PERMANENT_FIX | ≥ 10 | 強制根因修復 |
|
||||
|
||||
### 2. Tier 分級修復策略 (Phase A)
|
||||
|
||||
| Tier | 名稱 | 修復動作 | 適用場景 |
|
||||
|------|------|---------|---------|
|
||||
| 1 | 重啟 | restart_pod, restart_container | 首次異常 |
|
||||
| 2 | 緩解 | scale_up, flush_cache, kill_queries | 重複 2 次 |
|
||||
| 3 | 根因 | apply_fix, update_config, run_migration | 24h ≥ 5 次 |
|
||||
| 4 | 架構 | failover, switch_backend, permanent_fix | 24h ≥ 10 次 |
|
||||
|
||||
**修復嘗試記錄**:
|
||||
```python
|
||||
# Redis Hash: anomaly:{key}:repair_stats
|
||||
{
|
||||
"restart_pod": {"total": 3, "success": 1, "rate": 0.33},
|
||||
"scale_up": {"total": 1, "success": 1, "rate": 1.0},
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Database Exporters (Phase B)
|
||||
|
||||
| 服務 | Exporter | Port | 關鍵指標 |
|
||||
|------|----------|------|---------|
|
||||
| PostgreSQL | postgres_exporter | 9187 | 連接池、慢查詢、鎖等待、表膨脹 |
|
||||
| Redis | redis_exporter | 9121 | 記憶體、命中率、驅逐率、延遲 |
|
||||
|
||||
**部署位置**: 192.168.0.188 (Docker Compose)
|
||||
|
||||
### 4. Incident 頻率欄位 (Phase C)
|
||||
|
||||
```python
|
||||
class IncidentFrequencyStats(BaseModel):
|
||||
anomaly_key: str
|
||||
count_1h: int = 0
|
||||
count_24h: int = 0
|
||||
count_7d: int = 0
|
||||
count_30d: int = 0
|
||||
escalation_level: str | None = None # REPEAT, ESCALATE, PERMANENT_FIX
|
||||
```
|
||||
|
||||
**聚合窗口**: 10 分鐘內同一問題不建新 Incident
|
||||
|
||||
### 5. Sentry Comment 回寫 (Phase D)
|
||||
|
||||
```
|
||||
POST /api/0/issues/{issue_id}/comments/
|
||||
Authorization: Bearer {SENTRY_API_TOKEN}
|
||||
{"text": "🤖 OpenClaw 自動修復: restart_pod 成功"}
|
||||
```
|
||||
|
||||
### 6. SignOz 告警規則 (Phase E)
|
||||
|
||||
| 指標 | 閾值 | 嚴重度 |
|
||||
|------|------|--------|
|
||||
| Error Rate | > 5% (5m) | warning |
|
||||
| P95 Latency | > 2s | warning |
|
||||
| Trace 異常 | duration > 10s | critical |
|
||||
|
||||
### 7. Learning Service (Phase G)
|
||||
|
||||
```python
|
||||
class LearningService:
|
||||
async def should_skip_action(self, anomaly_key: str, action: str) -> bool:
|
||||
"""成功率 < 20% 時跳過此動作"""
|
||||
stats = await self.counter.get_repair_stats(anomaly_key, action)
|
||||
return stats and stats["rate"] < 0.2
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 模組化合規
|
||||
|
||||
| 檢查項 | 狀態 | 說明 |
|
||||
|--------|------|------|
|
||||
| Interface 先定義 | ✅ | AnomalyCounter Protocol |
|
||||
| Router 禁止直接 DB | ✅ | 透過 Service 層 |
|
||||
| Repository 單一職責 | ✅ | AnomalyCounterRepository |
|
||||
| 無循環依賴 | ✅ | counter → repair → learning |
|
||||
| 可測試 (無 Mock) | ✅ | Redis Testcontainer |
|
||||
|
||||
---
|
||||
|
||||
## 實施計畫
|
||||
|
||||
| Phase | 任務 | 工時 | 依賴 | 狀態 |
|
||||
|-------|------|------|------|------|
|
||||
| A | AnomalyCounter + Tier 修復 | 4h | - | ✅ 已存在 |
|
||||
| B | Database Exporters | 3h | - | ⏳ 待執行 |
|
||||
| C | Incident 頻率欄位 | 2h | Phase A | ⏳ 待執行 |
|
||||
| D | Sentry Comment 回寫 | 1h | Phase A | ✅ **Wave A.4** |
|
||||
| E | SignOz 告警規則 | 2h | - | ✅ **Wave A.2+A.3** |
|
||||
| F | Alert Chain E2E 驗證 | 2h | Phase A-E | ✅ **Wave A.6+B.1+B.2** |
|
||||
| G | Learning Service | 3h | Phase A | ✅ 已存在 |
|
||||
|
||||
**總工時**: 17h (已完成約 10h)
|
||||
|
||||
### Wave A+B 完成明細 (2026-03-29)
|
||||
|
||||
| Wave | 任務 | 狀態 | 首席架構師審查 |
|
||||
|------|------|------|---------------|
|
||||
| A.1 | Sentry API Token | ✅ | 91/100 |
|
||||
| A.2 | SignOz 告警規則 | ✅ | OUTSTANDING |
|
||||
| A.3 | SignOz Webhook | ✅ | - |
|
||||
| A.4 | Sentry Comment | ✅ | - |
|
||||
| A.5 | Alert Chain Metrics | ✅ | - |
|
||||
| A.6 | Smoke Test | ✅ | - |
|
||||
| B.1 | PrometheusRule | ✅ | - |
|
||||
| B.2 | CD Pipeline | ✅ | - |
|
||||
|
||||
---
|
||||
|
||||
## 風險與緩解
|
||||
|
||||
| 風險 | 影響 | 緩解 |
|
||||
|------|------|------|
|
||||
| Redis 記憶體增長 | 統計資料膨脹 | 30 天 TTL + ZREMRANGEBYSCORE |
|
||||
| Tier 誤判 | 不當修復 | 人工審批 Tier 3/4 |
|
||||
| Sentry API 限流 | Comment 失敗 | Rate Limiter + 重試 |
|
||||
|
||||
---
|
||||
|
||||
## 參考
|
||||
|
||||
- [MONITORING_STRATEGIC_PLANNING.md](../proposals/MONITORING_STRATEGIC_PLANNING.md)
|
||||
- [IMPLEMENTATION_STEPS_ANOMALY_COUNTER.md](../proposals/IMPLEMENTATION_STEPS_ANOMALY_COUNTER.md)
|
||||
- [IMPLEMENTATION_STEPS_DATABASE_EXPORTERS.md](../proposals/IMPLEMENTATION_STEPS_DATABASE_EXPORTERS.md)
|
||||
- [IMPLEMENTATION_STEPS_INCIDENT_FREQUENCY.md](../proposals/IMPLEMENTATION_STEPS_INCIDENT_FREQUENCY.md)
|
||||
- [IMPLEMENTATION_STEPS_REMAINING_PHASES.md](../proposals/IMPLEMENTATION_STEPS_REMAINING_PHASES.md)
|
||||
|
||||
---
|
||||
|
||||
## 變更紀錄
|
||||
|
||||
| 日期 | 版本 | 變更 | 作者 |
|
||||
|------|------|------|------|
|
||||
| 2026-03-29 | 1.0 | 初版建立 | Claude (首席架構師) |
|
||||
| 2026-03-29 | 1.1 | Wave A+B 完成 (91/100 OUTSTANDING) | Claude Code |
|
||||
840
docs/proposals/ARCHITECTURAL_RISK_WAR_GAME.md
Normal file
840
docs/proposals/ARCHITECTURAL_RISK_WAR_GAME.md
Normal file
@@ -0,0 +1,840 @@
|
||||
# AWOOOI 架構風險全維度沙盤推演
|
||||
# Architectural Risk Full-Spectrum War Game
|
||||
|
||||
> **文件類型**: 架構決策基礎(必讀,優先於所有 RunBook)
|
||||
> **建立**: 2026-03-29 13:37 (台北)
|
||||
> **定位**: 本文件是所有執行計畫的上位文件,任何實施步驟必須以此為準
|
||||
|
||||
---
|
||||
|
||||
## 第零章:代碼確認的真實現況
|
||||
|
||||
在產出任何計畫前,必須先誠實面對代碼層面的真實狀況:
|
||||
|
||||
| 風險項目 | 代碼位置 | 真實現況 |
|
||||
|---------|---------|---------|
|
||||
| Worker SIGTERM | `signal_worker.py:450-455` | ✅ 已實作 SIGTERM 攔截 |
|
||||
| Worker stop() timeout | `signal_worker.py:147` | ❌ **只有 5 秒**,AI 任務最長 60 秒 |
|
||||
| XCLAIM(孤兒任務回收)| `signal_worker.py` 全文 | ❌ **完全缺失**,PEL 孤兒無人處理 |
|
||||
| StatefulSet 硬阻斷 | `auto_repair_service.py` 全文 | ❌ **完全缺失**,只有 severity 和 risk 檢查 |
|
||||
| TIER_ACTIONS Tier 1 | `auto_repair_service.py:411` | ❌ `restart_pod` 在 Tier 1,DB/Redis 無例外 |
|
||||
| OpenClaw Circuit Breaker | `sentry_webhook.py:289` | ❌ **只有 60s httpx timeout,無斷路保護** |
|
||||
| ESLint i18n plugin | `.eslintrc.js:20-22` | ❌ **只有 TODO 注解,未安裝** |
|
||||
| Redis AOF 確認 | — | ⚠️ 未確認,需立即核查 |
|
||||
| Visual Regression baseline | — | ❌ 未建立,Mac 環境 ≠ CI 環境 |
|
||||
|
||||
---
|
||||
|
||||
## 第一章:已識別的 6 大致命衝突(首席架構師版)
|
||||
|
||||
首席架構師已準確識別的 6 個衝突,代碼確認加深嚴重度:
|
||||
|
||||
### 衝突一:CI/CD 監控鐵幕導致部署癱瘓
|
||||
|
||||
**嚴重度**: 🔴 P0 | **類型**: 流程衝突
|
||||
|
||||
**代碼確認**:`service-registry.yaml` 已有 60+ 服務,但 `validate_coverage.py` 尚未整合至 `cd.yaml`。
|
||||
|
||||
**正確方案(Soft Launch 三階段)**:
|
||||
|
||||
```yaml
|
||||
# .github/workflows/cd.yaml
|
||||
# 階段一(即時):Warn-Only
|
||||
- name: Service Registry Coverage Check
|
||||
run: |
|
||||
python ops/scripts/validate_coverage.py --warn-only
|
||||
# exit 0 即使有缺失,只發 Telegram 警告
|
||||
continue-on-error: true # ← 關鍵:不阻擋部署
|
||||
|
||||
# 階段二(待 Registry 完整後):正式 Block
|
||||
# 將 continue-on-error 改為 false
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 衝突二:Worker HPA + Redis PEL 孤兒任務
|
||||
|
||||
**嚴重度**: 🔴 P0 | **類型**: 邏輯衝突
|
||||
|
||||
**代碼確認**:`signal_worker.py` 完全沒有 XCLAIM 邏輯。`start()` 方法只有 `_ensure_consumer_group()`,沒有 pending 任務回收。
|
||||
|
||||
**最小可行修復(加入 `start()` 方法)**:
|
||||
|
||||
```python
|
||||
# signal_worker.py
|
||||
async def start(self) -> None:
|
||||
if self._running:
|
||||
return
|
||||
|
||||
await self._ensure_consumer_group()
|
||||
|
||||
# 🆕 關鍵:啟動時先接管已死亡 Worker 的孤兒任務
|
||||
await self._claim_orphaned_tasks()
|
||||
|
||||
self._running = True
|
||||
self._task = asyncio.create_task(self._consume_loop())
|
||||
|
||||
async def _claim_orphaned_tasks(self, idle_ms: int = 60000) -> int:
|
||||
"""
|
||||
XCLAIM 機制:接管超過 idle_ms 未 ACK 的 Pending 任務
|
||||
|
||||
場景:前一個 Worker Pod 在處理任務途中被 K8s 砍掉,
|
||||
此任務卡在 PEL 中,新 Worker 啟動時必須接管。
|
||||
|
||||
idle_ms: 任務閒置超過此毫秒數才接管(預設 60 秒)
|
||||
"""
|
||||
redis_client = get_redis()
|
||||
claimed_count = 0
|
||||
|
||||
try:
|
||||
# 查詢 PEL 中所有 Pending 任務
|
||||
pending = await redis_client.xpending_range(
|
||||
STREAM_KEY, CONSUMER_GROUP,
|
||||
min='-', max='+', count=100
|
||||
)
|
||||
|
||||
for entry in pending:
|
||||
# 只接管超過 idle_ms 未被處理的任務
|
||||
if entry['time_since_delivered'] > idle_ms:
|
||||
claimed = await redis_client.xclaim(
|
||||
STREAM_KEY, CONSUMER_GROUP, CONSUMER_NAME,
|
||||
min_idle_time=idle_ms,
|
||||
message_ids=[entry['message_id']]
|
||||
)
|
||||
if claimed:
|
||||
claimed_count += len(claimed)
|
||||
logger.info(
|
||||
"orphaned_task_claimed",
|
||||
message_id=entry['message_id'],
|
||||
original_consumer=entry['consumer'],
|
||||
idle_ms=entry['time_since_delivered'],
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
# XCLAIM 失敗不應阻擋 Worker 啟動
|
||||
logger.warning("xclaim_failed", error=str(e))
|
||||
|
||||
if claimed_count > 0:
|
||||
logger.info("orphaned_tasks_recovered", count=claimed_count)
|
||||
|
||||
return claimed_count
|
||||
```
|
||||
|
||||
**與 HPA 的部署順序**:XCLAIM 必須先合併到 main,才能部署 Worker HPA。
|
||||
|
||||
---
|
||||
|
||||
### 衝突三:告警風暴重疊(跨源 Incident 爆炸)
|
||||
|
||||
**嚴重度**: 🔴 P0 | **類型**: 資料流衝突
|
||||
|
||||
**代碼確認**:`incident_service.py` 中的聚合邏輯是基於 `fingerprint` 字串匹配,Sentry 和 Alertmanager 的 fingerprint 格式不同,無法跨源聚合。
|
||||
|
||||
**根本原因場景**:
|
||||
```
|
||||
PostgreSQL 掛掉 →
|
||||
Alertmanager: 1 個 PostgreSQLDown 告警
|
||||
Sentry: 200 個 ConnectionRefused 告警(所有 API 請求)
|
||||
→ 201 個獨立 Incident
|
||||
→ 201 次 OpenClaw 分析(Token 爆炸)
|
||||
```
|
||||
|
||||
**全域災難冷卻期實作**:
|
||||
|
||||
```python
|
||||
# apps/api/src/services/incident_service.py 新增
|
||||
|
||||
GLOBAL_INCIDENT_DEBOUNCE_TTL = 300 # 5 分鐘全域冷卻期
|
||||
P0_INFRASTRUCTURE_SERVICES = {
|
||||
"postgres", "postgresql", "redis", "k8s-api", "etcd"
|
||||
}
|
||||
|
||||
async def _check_global_incident_storm(self, signal_data: dict) -> str | None:
|
||||
"""
|
||||
檢查是否有活躍的 P0 基礎設施災難
|
||||
|
||||
若有 → 返回主 Incident ID(關聯事件)
|
||||
若無 → 返回 None(正常建立新 Incident)
|
||||
"""
|
||||
redis = get_redis()
|
||||
storm_key = "global:incident_storm:active"
|
||||
|
||||
# 判斷是否是 P0 基礎設施告警(優先處理,不關聯)
|
||||
alert_name = signal_data.get("alert_name", "")
|
||||
service = signal_data.get("target", "")
|
||||
|
||||
is_infra_p0 = any(svc in service.lower() for svc in P0_INFRASTRUCTURE_SERVICES)
|
||||
|
||||
if is_infra_p0 and signal_data.get("severity") == "critical":
|
||||
# 設定全域風暴旗幟
|
||||
main_incident_id = f"storm-{uuid.uuid4().hex[:8]}"
|
||||
await redis.setex(storm_key, GLOBAL_INCIDENT_DEBOUNCE_TTL, main_incident_id)
|
||||
logger.warning("global_incident_storm_detected", main_id=main_incident_id)
|
||||
return None # P0 本身正常建立 Incident
|
||||
|
||||
# 非 P0 告警:檢查是否在風暴期間
|
||||
active_storm_id = await redis.get(storm_key)
|
||||
if active_storm_id:
|
||||
logger.info(
|
||||
"alert_correlated_to_storm",
|
||||
storm_id=active_storm_id,
|
||||
alert=alert_name,
|
||||
)
|
||||
return active_storm_id.decode() # 關聯到主 Incident,不單獨分析
|
||||
|
||||
return None
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 衝突四:Auto-Repair 誤殺有狀態服務
|
||||
|
||||
**嚴重度**: 🔴 P0 | **類型**: 架構衝突
|
||||
|
||||
**代碼確認**:`auto_repair_service.py:411` 中 `TIER_ACTIONS[1]` 包含 `restart_pod` 和 `restart_container`。評估邏輯只檢查 `severity <= P2` 和 `RiskLevel <= MEDIUM`,**完全沒有服務類型白名單**。
|
||||
|
||||
**最小可行修復(加入服務黑名單)**:
|
||||
|
||||
```python
|
||||
# auto_repair_service.py 新增常數與防護
|
||||
|
||||
# 🚨 不可自動重啟的服務(有狀態服務)
|
||||
STATEFUL_SERVICE_BLACKLIST = frozenset({
|
||||
"postgres", "postgresql", "awoooi-postgres",
|
||||
"redis", "awoooi-redis", "redis-stack",
|
||||
"clickhouse", "signoz-clickhouse",
|
||||
"elasticsearch", "etcd",
|
||||
"minio", "awoooi-minio",
|
||||
})
|
||||
|
||||
async def evaluate_auto_repair(self, incident: Incident) -> AutoRepairDecision:
|
||||
# ... 現有檢查 ...
|
||||
|
||||
# 🆕 新增:有狀態服務硬阻擋(必須在 Playbook 匹配之前)
|
||||
affected_services = incident.affected_services or []
|
||||
for service in affected_services:
|
||||
if any(bl in service.lower() for bl in STATEFUL_SERVICE_BLACKLIST):
|
||||
logger.warning(
|
||||
"auto_repair_blocked_stateful_service",
|
||||
incident_id=incident.incident_id,
|
||||
service=service,
|
||||
)
|
||||
return AutoRepairDecision(
|
||||
can_auto_repair=False,
|
||||
reason=f"服務 {service} 為有狀態服務,禁止自動重啟,請統帥手動介入",
|
||||
blocked_by="STATEFUL_SERVICE_GUARDRAIL",
|
||||
)
|
||||
|
||||
# ... 後續現有邏輯 ...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 衝突五:前端「合併地獄」(時序衝突)
|
||||
|
||||
**嚴重度**: 🟠 P1 | **類型**: 時序衝突
|
||||
|
||||
**正確鎖定主幹策略**:
|
||||
|
||||
```
|
||||
git flow 前端主權計畫:
|
||||
|
||||
Week 1: Feature Freeze
|
||||
main 分支鎖定(禁止非 i18n 相關 PR 合併前端代碼)
|
||||
|
||||
Week 1: i18n 清零 PR(唯一允許的前端 PR)
|
||||
branch: fix/i18n-zero-violation
|
||||
→ 一次性修復所有 40+ 違規
|
||||
→ 同步安裝 eslint-plugin-i18next(先 warn 模式)
|
||||
→ Merge to main
|
||||
|
||||
Week 1 完成後: Feature Unfreeze
|
||||
→ 開始 Storybook PR
|
||||
→ 開始 Omni-Terminal SSE Event Sourcing PR
|
||||
→ ESLint 切換為 error 模式
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 衝突六:Playwright HTTPS 憑證與網路盲區
|
||||
|
||||
**嚴重度**: 🟠 P1 | **類型**: 基礎設施衝突
|
||||
|
||||
**代碼確認**:`playwright.config.ts` 需要確認當前設定。
|
||||
|
||||
```typescript
|
||||
// playwright.config.ts 必要修改
|
||||
export default defineConfig({
|
||||
use: {
|
||||
baseURL: process.env.BASE_URL || 'http://192.168.0.120:32335',
|
||||
ignoreHTTPSErrors: true, // 🆕 自簽憑證必須忽略
|
||||
viewport: { width: 1280, height: 720 },
|
||||
deviceScaleFactor: 1, // 防止 Retina 差異
|
||||
},
|
||||
expect: {
|
||||
toHaveScreenshot: {
|
||||
threshold: 0.05,
|
||||
maxDiffPixelRatio: 0.05,
|
||||
},
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 第二章:被遺漏的 6 個更深層致命風險
|
||||
|
||||
首席架構師的 6 個衝突是正確的,但以下 6 個更深層的風險同樣在系統中存在:
|
||||
|
||||
### 深層風險 A:.188 節點是單點故障(SPOF)——系統大腦失憶
|
||||
|
||||
**位置**: 192.168.0.188(AI+Web 中心)
|
||||
**影響**: .188 掛掉 = Ollama + OpenClaw + Redis + PostgreSQL + SigNoz **全部同時失效**
|
||||
|
||||
這是整個系統最致命的單點:
|
||||
|
||||
```
|
||||
.188 掛掉時的連鎖崩潰:
|
||||
Redis 失效 → Signal Worker 無法消費 → 告警全部積壓
|
||||
PostgreSQL 失效 → K3s 控制面失去 Datastore → K3s 可能崩潰
|
||||
OpenClaw 失效 → 所有 AI 分析停止 → Sentry/Alertmanager Webhook 排隊
|
||||
SigNoz 失效 → 可觀測性盲區
|
||||
↓
|
||||
K3s 崩潰 → AWOOOI API/Web Pod 全滅
|
||||
↓
|
||||
沒有 AWOOOI → 無法收到告警 → 統帥無法操作 → 完全失聯
|
||||
```
|
||||
|
||||
**緩解策略(非根治)**:
|
||||
```yaml
|
||||
# OpenClaw 和 Worker 必須有 .188 失效時的降級模式
|
||||
# 最低標準:Telegram Bot 直接發送「.188 疑似失效」告警
|
||||
# (繞過 AWOOOI API,直接 curl Telegram API)
|
||||
|
||||
# k8s/monitoring/alert-rules.yaml 新增
|
||||
- alert: AIWebCenterDown
|
||||
expr: probe_success{job="blackbox", target="http://192.168.0.188:8089/health"} == 0
|
||||
for: 2m
|
||||
annotations:
|
||||
summary: ".188 AI 中心失聯,系統進入降級模式"
|
||||
runbook: "docs/runbooks/RUNBOOK-188-FAILOVER.md"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 深層風險 B:可觀測性循環依賴(觀測者的盲點)
|
||||
|
||||
**這是架構上最諷刺的問題**:
|
||||
|
||||
```
|
||||
當 AWOOOI API 本身崩潰:
|
||||
Alertmanager 想發 webhook 給 AWOOOI → AWOOOI 掛了,webhook 失敗
|
||||
Sentry 想發 webhook 給 AWOOOI → 同上
|
||||
Telegram 通知要透過 AWOOOI → 同上
|
||||
|
||||
告警鏈路的最後一哩(Telegram 通知)依賴於被監控對象本身!
|
||||
```
|
||||
|
||||
**現有防護**(ADR-035 已有!):
|
||||
```yaml
|
||||
# Alertmanager 有直接 Telegram 通知(繞過 AWOOOI)
|
||||
# 但需要確認:alertmanager.yml 是否有 backup receiver
|
||||
receivers:
|
||||
- name: 'openclaw-api' # 主路徑(透過 AWOOOI)
|
||||
...
|
||||
- name: 'direct-telegram' # 備援路徑(直接打 Telegram)
|
||||
webhook_configs:
|
||||
- url: 'https://api.telegram.org/bot{TOKEN}/sendMessage'
|
||||
```
|
||||
|
||||
**需要驗證**:ADR-035 的三層防護機制是否真的覆蓋了「AWOOOI API 本身掛掉」的場景。
|
||||
|
||||
---
|
||||
|
||||
### 深層風險 C:OpenClaw 呼叫無 Circuit Breaker(後端 AI 癱瘓傳播)
|
||||
|
||||
**代碼確認**:`sentry_webhook.py:289` 的 `call_openclaw_analyzer()` 只有 `httpx.AsyncClient(timeout=60.0)`,**沒有 Circuit Breaker**。
|
||||
|
||||
**場景**:OpenClaw 高負載(GPU 過熱、記憶體壓力),每個 Sentry/Alertmanager 呼叫都等待 60 秒才 timeout。大量 FastAPI 背景任務積壓,最終導致 API Pod 記憶體耗盡 OOM Kill。
|
||||
|
||||
**最小可行修復**:
|
||||
|
||||
```python
|
||||
# apps/api/src/core/circuit_breaker.py(新建)
|
||||
import asyncio
|
||||
from enum import Enum
|
||||
from collections import deque
|
||||
|
||||
class CircuitState(Enum):
|
||||
CLOSED = "closed" # 正常
|
||||
OPEN = "open" # 斷路(直接失敗)
|
||||
HALF_OPEN = "half_open"# 試探性恢復
|
||||
|
||||
class SimpleCircuitBreaker:
|
||||
"""
|
||||
簡單 Circuit Breaker(不依賴 NVIDIA 的實作)
|
||||
|
||||
狀態機:
|
||||
CLOSED → OPEN(連續 5 次失敗)
|
||||
OPEN → HALF_OPEN(冷卻 60 秒後)
|
||||
HALF_OPEN → CLOSED(1 次成功)
|
||||
HALF_OPEN → OPEN(1 次失敗)
|
||||
"""
|
||||
def __init__(self, failure_threshold=5, timeout_s=60):
|
||||
self.state = CircuitState.CLOSED
|
||||
self.failure_count = 0
|
||||
self.threshold = failure_threshold
|
||||
self.timeout_s = timeout_s
|
||||
self._opened_at: float | None = None
|
||||
|
||||
def is_open(self) -> bool:
|
||||
if self.state == CircuitState.OPEN:
|
||||
import time
|
||||
if time.time() - self._opened_at > self.timeout_s:
|
||||
self.state = CircuitState.HALF_OPEN
|
||||
return False
|
||||
return True
|
||||
return False
|
||||
|
||||
def record_success(self):
|
||||
self.failure_count = 0
|
||||
self.state = CircuitState.CLOSED
|
||||
|
||||
def record_failure(self):
|
||||
self.failure_count += 1
|
||||
if self.failure_count >= self.threshold:
|
||||
import time
|
||||
self.state = CircuitState.OPEN
|
||||
self._opened_at = time.time()
|
||||
|
||||
# 全域 OpenClaw Circuit Breaker
|
||||
_openclaw_cb = SimpleCircuitBreaker(failure_threshold=5, timeout_s=60)
|
||||
|
||||
def get_openclaw_circuit_breaker() -> SimpleCircuitBreaker:
|
||||
return _openclaw_cb
|
||||
```
|
||||
|
||||
```python
|
||||
# sentry_webhook.py 修改 call_openclaw_analyzer()
|
||||
async def call_openclaw_analyzer(error_context: dict) -> ErrorAnalysisResult | None:
|
||||
cb = get_openclaw_circuit_breaker()
|
||||
|
||||
# 斷路保護:直接失敗,不等待
|
||||
if cb.is_open():
|
||||
logger.warning("openclaw_circuit_open_skip_analysis")
|
||||
return None
|
||||
|
||||
try:
|
||||
async with httpx.AsyncClient(timeout=60.0) as client:
|
||||
response = await client.post(...)
|
||||
if response.status_code == 200:
|
||||
cb.record_success()
|
||||
return ErrorAnalysisResult(**response.json())
|
||||
else:
|
||||
cb.record_failure()
|
||||
return None
|
||||
except Exception as e:
|
||||
cb.record_failure()
|
||||
logger.exception("openclaw_call_failed", error=str(e))
|
||||
return None
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 深層風險 D:K8s Rolling Update 資料庫遷移衝突
|
||||
|
||||
**場景**:CD 執行 rolling update,舊版和新版 API Pod 同時存在(Kubernetes rolling strategy)。若新版本有 `ALTER TABLE` 遷移,舊版 Pod 會因為欄位不存在而報錯;若先跑遷移,新舊結構衝突。
|
||||
|
||||
**現況確認需要**:確認 Alembic 遷移策略。
|
||||
|
||||
**防護機制(必須檢查現有 CD)**:
|
||||
|
||||
```yaml
|
||||
# .github/workflows/cd.yaml 確認是否有 migration 步驟
|
||||
# 若無,必須加入:
|
||||
- name: Run DB Migration
|
||||
run: |
|
||||
kubectl exec -n awoooi-prod \
|
||||
$(kubectl get pod -n awoooi-prod -l app=awoooi-api -o name | head -1) \
|
||||
-- python -m alembic upgrade head
|
||||
# 遷移必須是向後兼容的(不能刪除欄位,只能新增)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 深層風險 E:Redis 記憶體壓力下的靜默資料丟失
|
||||
|
||||
AnomalyCounter 的滑動窗口使用 Redis Sorted Set / Counter,如果 Redis 記憶體緊張觸發 `maxmemory-policy`,這些計數器可能被靜默淘汰。
|
||||
|
||||
**需要立即確認**:
|
||||
```bash
|
||||
ssh root@192.168.0.188 'docker exec awoooi-redis redis-cli CONFIG GET maxmemory-policy'
|
||||
# 如果是 allkeys-lru 或 allkeys-lfu,AnomalyCounter 的計數器會被淘汰!
|
||||
# 正確設定應為:volatile-ttl 或 noeviction(搭配記憶體告警)
|
||||
```
|
||||
|
||||
**修復**:AnomalyCounter 的 Redis key 必須用帶 TTL 的 key(已有),確保 eviction policy 不會誤殺有 TTL 的 key:
|
||||
```bash
|
||||
# 設定為 volatile-ttl(只淘汰有 TTL 的 key,從 TTL 最短的開始)
|
||||
docker exec awoooi-redis redis-cli CONFIG SET maxmemory-policy volatile-ttl
|
||||
# AnomalyCounter 計數器有 TTL → 可能被淘汰
|
||||
# 解法:增加 Redis maxmemory 設定或改用 noeviction + 主動監控
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 深層風險 F:GitHub Actions Runner 安全隔離問題
|
||||
|
||||
`.110` 的 self-hosted runner 在每次 CD 時執行 `kubectl patch secret`,這代表:
|
||||
- Runner 必須有 K8s 集群的 admin 權限
|
||||
- 任何能合 PR 進 main 的人,都能觸發有 K8s admin 權限的 Job
|
||||
|
||||
**最小防護**:
|
||||
|
||||
```yaml
|
||||
# .github/workflows/cd.yaml
|
||||
# 所有 kubectl 操作的 Job 必須加上環境保護
|
||||
jobs:
|
||||
deploy:
|
||||
environment: production # ← 必須設定 GitHub Environment 審核
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 第三章:終極安全執行定序(12 波)
|
||||
|
||||
整合所有 6+6 衝突分析後,以下是**唯一正確的執行順序**:
|
||||
|
||||
```
|
||||
🛡️ Wave 0: 即時止血(當天,不需部署,純配置)
|
||||
0.1 確認 Redis maxmemory-policy(5min)
|
||||
0.2 確認 Redis appendonly(5min)
|
||||
0.3 確認 Alertmanager 備援 Telegram 路徑(10min)
|
||||
|
||||
🔴 Wave 1: 底層安全網(Week 1,必須串行執行)
|
||||
依序:
|
||||
1.1 開發 XCLAIM 機制(2h)
|
||||
1.2 開發 StatefulSet Guardrail(1h)
|
||||
1.3 開發 OpenClaw Circuit Breaker(2h)
|
||||
1.4 開發 Global Incident Debounce(2h)
|
||||
1.5 以上四項合為單一 PR,測試後 Merge(1h)
|
||||
|
||||
→ 此 PR 絕對不能拆分!四個修復互相依賴。
|
||||
|
||||
🔴 Wave 2: Worker 升級(Wave 1 完成後)
|
||||
2.1 Worker terminationGracePeriodSeconds 90s(30min)
|
||||
2.2 Worker stop() timeout 75s(30min)
|
||||
2.3 部署 Worker HPA(30min)
|
||||
|
||||
🟠 Wave 3: 前端主幹鎖定(與 Wave 1 同時啟動,但獨立分支)
|
||||
3.1 宣佈 Frontend Feature Freeze
|
||||
3.2 i18n 閃電清零(4h)
|
||||
3.3 安裝 eslint-plugin-i18next(Warn 模式)(1h)
|
||||
3.4 Merge i18n PR → 解除 Frontend Freeze
|
||||
3.5 ESLint 切換 Error 模式
|
||||
|
||||
🟠 Wave 4: CI 基礎設施(Wave 3 完成後)
|
||||
4.1 playwright.config.ts(ignoreHTTPSErrors + threshold)
|
||||
4.2 Docker Visual Baseline 初始建立
|
||||
4.3 E2E Weekly Schedule YAML(Warn-Only)
|
||||
4.4 CD validate_coverage.py(Warn-Only)
|
||||
|
||||
🟠 Wave 5: 告警後端完整(Wave 1 完成後)
|
||||
5.1 Sentry SENTRY_AUTH_TOKEN 配置(Phase D)
|
||||
5.2 SignOz 告警規則部署到 .188(Phase E)
|
||||
|
||||
🟡 Wave 6: 可觀測性統合(Wave 5 完成後)
|
||||
6.1 Prometheus Federation(.110 → .188)
|
||||
6.2 AI Autonomy Index Metrics 建立
|
||||
6.3 Redis AOF + Sentinel 評估與啟用
|
||||
|
||||
🟡 Wave 7: 前端能力擴充(Wave 4 完成後)
|
||||
7.1 Storybook 10 核心組件
|
||||
7.2 Omni-Terminal SSE Event Sourcing
|
||||
7.3 監控 GenUI 卡片(7 張)
|
||||
7.4 Nexus AI 自治率 UI
|
||||
|
||||
⚪ Wave 8: DB HA 根本解決
|
||||
8.1 CloudNativePG 評估報告
|
||||
8.2 決策後執行(Patroni / CloudNativePG / 維持現狀+備份)
|
||||
|
||||
⚪ Wave 9: 業務指標層
|
||||
9.1 FinOps Dashboard API + UI
|
||||
9.2 SLO / MTTR API 端點
|
||||
|
||||
⚪ Wave 10: 安全主權
|
||||
10.1 Kali → MCP Tool → SecurityAgent
|
||||
10.2 SBOM 生成整合
|
||||
|
||||
⚪ Wave 11: CI 硬阻擋切換
|
||||
11.1 Visual Regression CI: 從 warn → block
|
||||
11.2 Coverage validation: 從 warn → block
|
||||
11.3 ESLint: 確認已為 error 模式
|
||||
|
||||
⚪ Wave 12: Phase 4 視覺靈魂注入
|
||||
12.1 品牌 3D 資產 + Q 版 OpenClaw
|
||||
12.2 全站微動畫升級
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 第四章:執行前的強制確認清單
|
||||
|
||||
在開始任何 Wave 1 工作之前,必須先完成以下確認:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# ops/scripts/pre-execution-checklist.sh
|
||||
|
||||
echo "=== AWOOOI 執行前強制確認清單 ==="
|
||||
|
||||
# 1. Redis AOF 確認
|
||||
APPENDONLY=$(docker exec awoooi-redis redis-cli CONFIG GET appendonly | tail -1)
|
||||
echo "Redis AOF: $APPENDONLY" # 必須是 yes
|
||||
|
||||
# 2. Redis maxmemory-policy 確認
|
||||
POLICY=$(docker exec awoooi-redis redis-cli CONFIG GET maxmemory-policy | tail -1)
|
||||
echo "Redis eviction policy: $POLICY" # 不能是 allkeys-lru
|
||||
|
||||
# 3. K3s 叢集狀態確認
|
||||
kubectl get nodes -n awoooi-prod
|
||||
kubectl get pod -n awoooi-prod
|
||||
|
||||
# 4. Alertmanager 備援 Telegram 路徑確認
|
||||
curl -s http://192.168.0.120:30093/api/v1/receivers | python3 -m json.tool | grep name
|
||||
|
||||
# 5. 確認 .110 → .120 網路路由(Playwright E2E 需要)
|
||||
ping -c 3 192.168.0.120
|
||||
|
||||
echo "=== 確認完成,可以開始執行 ==="
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 附錄:尚未解決的開放問題(需要統帥決策)
|
||||
|
||||
| 問題 | 選項 A | 選項 B | 影響 |
|
||||
|------|--------|--------|------|
|
||||
| PostgreSQL HA | CloudNativePG(K8s 原生)| Patroni+keepalived(VM 層)| Q2 重大決策 |
|
||||
| Redis HA 層級 | Sentinel(主動故障轉移)| AOF+手動恢復(保守)| 月度決策 |
|
||||
| .188 備援節點 | 購置第二台 AI 主機 | Cloud GPU 熱備 | 季度預算 |
|
||||
| GitHub Runner 安全隔離 | GitHub Environments 審核 | 拆分 CI(唯讀)和 CD(需要 K8s admin)| 安全策略 |
|
||||
|
||||
---
|
||||
|
||||
---
|
||||
|
||||
## 第五章:四個最終深水區(代碼確認級別)
|
||||
|
||||
### 5.1 Redis 崩潰 → AnomalyCounter 連鎖炸毀 → 告警永久丟失
|
||||
|
||||
**代碼確認**:`anomaly_counter.py:147`
|
||||
|
||||
```python
|
||||
# 現況(無任何 try/except):
|
||||
await self.redis.zadd(timeline_key, {str(timestamp): timestamp}) # ← Redis 掛了 = 直接 throw
|
||||
await self.redis.zremrangebyscore(...)
|
||||
await self.redis.zcount(...) # ← 全部爆炸
|
||||
# → 呼叫端 sentry_webhook.py → 整個 background task 失敗 → 告警丟失!
|
||||
```
|
||||
|
||||
**修復:Graceful Degradation 防禦性包裝**
|
||||
|
||||
```python
|
||||
# anomaly_counter.py 修改 record_anomaly()
|
||||
|
||||
async def record_anomaly(self, anomaly_signature: dict) -> AnomalyFrequency:
|
||||
"""記錄異常,Redis 失敗時優雅降級(不拋例外)"""
|
||||
try:
|
||||
return await self._record_anomaly_impl(anomaly_signature)
|
||||
except Exception as e:
|
||||
# Redis 連線失敗 → 降級:返回最小化頻率物件,讓主流程繼續執行
|
||||
logger.warning(
|
||||
"anomaly_counter_redis_degraded",
|
||||
error=str(e),
|
||||
reason="Returning default frequency to allow alert chain to continue"
|
||||
)
|
||||
# 不拋例外!告警鏈路必須繼續!
|
||||
return AnomalyFrequency(
|
||||
anomaly_key=self.hash_signature(anomaly_signature),
|
||||
count_1h=1, count_24h=1, count_7d=1, count_30d=1,
|
||||
first_seen=datetime.now(), last_seen=datetime.now(),
|
||||
auto_repair_count=0, permanent_fix_applied=False,
|
||||
escalation_level=None, # 無法升級判斷,保守處理
|
||||
)
|
||||
|
||||
async def _record_anomaly_impl(self, anomaly_signature: dict) -> AnomalyFrequency:
|
||||
"""原始實作邏輯(從 record_anomaly 提取)"""
|
||||
# ... 原有的所有 Redis 操作 ...
|
||||
```
|
||||
|
||||
**原則**:「記不住」不能導致「發不出」。Redis 是輔助系統,不是核心路徑。
|
||||
|
||||
---
|
||||
|
||||
### 5.2 Worker 非優雅崩潰 → PEL 孤兒任務永久卡死
|
||||
|
||||
**代碼確認**:`signal_worker.py` 全文無 `reclaim_loop` 或定期 XPENDING 掃描。
|
||||
|
||||
現有 `_claim_orphaned_tasks()` 只在 `start()` 時執行一次,解決不了**運行中 Pod 崩潰**的場景:
|
||||
|
||||
```
|
||||
場景:2 個 Worker 穩定運行中
|
||||
Worker A 處理任務途中 → Segfault / OOM Kill(非優雅關機)
|
||||
Worker B 正在運行 → start() 不再觸發 → XCLAIM 永遠不執行
|
||||
孤兒任務卡在 PEL → 直到下次 HPA 觸發新 Pod 才救回
|
||||
可能等待 600+ 秒(HPA stabilizationWindowSeconds)
|
||||
```
|
||||
|
||||
**修復:Active Sweeper Loop(與心跳循環並行)**
|
||||
|
||||
```python
|
||||
# signal_worker.py 新增 _reclaim_loop()
|
||||
|
||||
async def start(self) -> None:
|
||||
await self._ensure_consumer_group()
|
||||
await self._claim_orphaned_tasks() # 啟動時一次
|
||||
self._running = True
|
||||
self._task = asyncio.create_task(self._consume_loop())
|
||||
self._reclaim_task = asyncio.create_task(self._reclaim_loop()) # 🆕 持續掃描
|
||||
|
||||
async def _reclaim_loop(self, interval_s: int = 300) -> None:
|
||||
"""
|
||||
Active Sweeper:每 5 分鐘主動掃描 PEL,接管閒置超過 5 分鐘的孤兒任務
|
||||
與 _consume_loop 並行執行,不阻擋正常消費
|
||||
"""
|
||||
while self._running:
|
||||
try:
|
||||
await asyncio.sleep(interval_s)
|
||||
if not self._running:
|
||||
break
|
||||
claimed = await self._claim_orphaned_tasks(idle_ms=300_000) # 5 分鐘
|
||||
if claimed > 0:
|
||||
logger.info("active_sweeper_claimed", count=claimed)
|
||||
except asyncio.CancelledError:
|
||||
break
|
||||
except Exception as e:
|
||||
logger.warning("active_sweeper_error", error=str(e))
|
||||
|
||||
async def stop(self) -> None:
|
||||
self._running = False
|
||||
# 同時取消 reclaim_loop
|
||||
if hasattr(self, '_reclaim_task') and self._reclaim_task:
|
||||
self._reclaim_task.cancel()
|
||||
if self._task:
|
||||
try:
|
||||
await asyncio.wait_for(self._task, timeout=75.0) # 已校正
|
||||
except (TimeoutError, asyncio.CancelledError):
|
||||
pass
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 5.3 SSE Event Store Redis 記憶體炸彈
|
||||
|
||||
**代碼確認**:`terminal.py:114` 使用 SSE Publisher/Subscribe 模式(`publisher.subscribe(topics)`),**不是 Redis List 模式**。
|
||||
|
||||
這是一個**重要的架構現況修正**:
|
||||
|
||||
- 現有 terminal.py 使用 `src.core.sse.SSEPublisher` 作為事件分發機制
|
||||
- **Event Sourcing(Redis RPUSH)尚未實作**,這是未來要加的功能
|
||||
- 因此 Redis 記憶體炸彈風險存在於**未來實作時**,需要在設計階段就預防
|
||||
|
||||
**未來實作 Event Sourcing 時的強制規格**:
|
||||
|
||||
```python
|
||||
# terminal.py 未來的 stream_with_persistence()
|
||||
|
||||
MAX_PAYLOAD_BYTES = 50 * 1024 # 50KB 上限(tool_result 超出截斷)
|
||||
MAX_EVENTS_PER_SESSION = 50 # 每個 session 最多 50 個事件(LTRIM)
|
||||
SESSION_TTL_SECONDS = 3600 # 1 小時 TTL
|
||||
|
||||
async def stream_with_persistence(command_id: str, event_type: str, data: dict):
|
||||
redis = get_redis()
|
||||
key = f"terminal:events:{command_id}"
|
||||
|
||||
# 🚨 必要的 Payload 保護
|
||||
payload_json = json.dumps(data)
|
||||
if len(payload_json) > MAX_PAYLOAD_BYTES:
|
||||
payload_json = json.dumps({
|
||||
"truncated": True,
|
||||
"original_size": len(payload_json),
|
||||
"preview": payload_json[:1024],
|
||||
"message": f"Payload {len(payload_json)//1024}KB 過大,已截斷"
|
||||
})
|
||||
|
||||
event = {"type": event_type, "data": json.loads(payload_json)}
|
||||
await redis.rpush(key, json.dumps(event))
|
||||
await redis.ltrim(key, -MAX_EVENTS_PER_SESSION, -1) # 只保留最後 50 個
|
||||
await redis.expire(key, SESSION_TTL_SECONDS)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 5.4 Frontend Feature Freeze → Hotfix 死鎖
|
||||
|
||||
**代碼確認**:無 `release/` 分支策略,main 是唯一的長期分支。
|
||||
|
||||
**修復:Git Flow 三分支策略(啟動 Freeze 前必須建立)**
|
||||
|
||||
```bash
|
||||
# Week 1 開始 i18n 清零前,執行:
|
||||
|
||||
# Step 1: 從當前 main 建立穩定的 release 基準
|
||||
git checkout main
|
||||
git pull origin main
|
||||
git checkout -b release/v1.x
|
||||
git push origin release/v1.x
|
||||
|
||||
# Step 2: 在 GitHub 設定 release/v1.x 為 Protected Branch
|
||||
# → 只有 Hotfix PR 可以合併到此分支
|
||||
|
||||
# Step 3: 開始 i18n 清零(在 main/develop 進行)
|
||||
git checkout main
|
||||
git checkout -b fix/i18n-zero-violation
|
||||
# ... 執行 i18n 清零 ...
|
||||
git push origin fix/i18n-zero-violation
|
||||
# → PR 合併到 main
|
||||
```
|
||||
|
||||
**緊急 Hotfix 流程**(Freeze 期間生產爆炸時):
|
||||
|
||||
```bash
|
||||
# 從 release 分支切 hotfix
|
||||
git checkout release/v1.x
|
||||
git checkout -b hotfix/critical-approval-button-fix
|
||||
# ... 最小化修復 ...
|
||||
git push origin hotfix/critical-approval-button-fix
|
||||
|
||||
# PR 合併到 release/v1.x → 立即部署
|
||||
# 然後 cherry-pick 到 main(i18n 重構進行中的分支)
|
||||
git checkout main
|
||||
git cherry-pick <hotfix-commit-hash>
|
||||
```
|
||||
|
||||
**Hotfix 觸發條件定義**(須寫入 HARD_RULES.md):
|
||||
- 統帥無法使用核心功能(簽核按鈕失效、登入無法使用)
|
||||
- P0 級 Sentry Error 每分鐘 > 10 次
|
||||
- 服務 availability < 99%
|
||||
|
||||
---
|
||||
|
||||
## 第六章:Wave 1 最終實作清單(可立即授權執行)
|
||||
|
||||
經過全面代碼確認,Wave 1 的四個修復需要修改以下精確位置:
|
||||
|
||||
| 修復項目 | 修改檔案 | 位置 | 預估工時 |
|
||||
|---------|---------|------|---------|
|
||||
| XCLAIM + Active Sweeper | `signal_worker.py` | `start()`, `stop()`, 新增 `_reclaim_loop()`, `_claim_orphaned_tasks()` | 2h |
|
||||
| StatefulSet Guardrail | `auto_repair_service.py` | `evaluate_auto_repair()` 開頭新增服務黑名單 | 1h |
|
||||
| AnomalyCounter Redis 降級 | `anomaly_counter.py` | `record_anomaly()` 包裝 try/except + 降級回傳 | 1h |
|
||||
| OpenClaw Circuit Breaker | `core/circuit_breaker.py`(新建)→ `sentry_webhook.py`, `signoz_webhook.py` | `call_openclaw_analyzer()` 包裝斷路保護 | 2h |
|
||||
| Global Incident Debounce | `services/incident_service.py` | `process_signal()` 前加全域冷卻檢查 | 1.5h |
|
||||
|
||||
**Wave 1 執行條件**:
|
||||
1. Wave 0.1-0.3 手動確認完成(Redis AOF/eviction、Alertmanager 備援)
|
||||
2. Git Flow:建立 `release/v1.x` 穩定分支(防止 Freeze 期間 Hotfix 死鎖)
|
||||
3. 所有修改捆綁為一個 PR(原子性部署,不可拆分)
|
||||
|
||||
*「真正的架構師不是設計完美的系統,而是設計在任何極端狀況下都能優雅降級的系統。」* 🦞
|
||||
|
||||
570
docs/proposals/INTEGRATION_ARCHITECTURE_MASTER.md
Normal file
570
docs/proposals/INTEGRATION_ARCHITECTURE_MASTER.md
Normal file
@@ -0,0 +1,570 @@
|
||||
# AWOOOI 整體整合架構統合設計
|
||||
|
||||
> **文件類型**: 統合架構設計(Single Source of Truth for Integration)
|
||||
> **優先級**: 🔴 統帥最高指令
|
||||
> **建立**: 2026-03-29 13:27 (台北)
|
||||
> **核心命題**: 所有節點必須在同一座大腦的神經網路中協同運作,不允許孤島。
|
||||
|
||||
---
|
||||
|
||||
## 第一部分:現況誠實盤點(精確)
|
||||
|
||||
### 已確認:比稽核報告更樂觀的部分
|
||||
|
||||
| 項目 | 稽核報告誤判 | 真實現況 |
|
||||
|------|------------|---------|
|
||||
| Worker SIGTERM 處理 | 報告說「缺失」 | ✅ **已實作**(`signal_worker.py:450-455`)— `signal.signal(SIGTERM)` + `shutdown_event` |
|
||||
| Worker 優雅關機流程 | 報告說「需要實作」| ✅ **已實作**(`stop()` 方法,有心跳機制) |
|
||||
| SignOz Webhook 路由 | 報告說「未部署」| ✅ **已路由**(`main.py:419`)|
|
||||
|
||||
### 已確認:比稽核報告更嚴峻的部分
|
||||
|
||||
| 項目 | 稽核報告版本 | 真實缺口 |
|
||||
|------|------------|---------|
|
||||
| Worker stop() timeout | 未提及 | ❌ **只有 5 秒**,AI 分析 30-60 秒會被強殺 |
|
||||
| K8s terminationGracePeriodSeconds | 未提及 | ❌ **未設定**,K8s 預設 30 秒不夠用 |
|
||||
| ESLint i18n 強制 | 說「CI 攔截」| ❌ **只有 TODO 注解**(`.eslintrc.js:20-22`),未實際安裝 plugin |
|
||||
| Visual Regression 跨平台 | 說「截圖比對」| ❌ **Mac 與 CI Linux 字體渲染不同**,baseline 不能在 Mac 產生 |
|
||||
| PostgreSQL HA | 說「Streaming Replication」| ❌ **無切換機制**,主庫掛了需要人工介入 |
|
||||
| Redis HA | 完全未提及 | ❌ **無 Sentinel**,Redis 單點故障 |
|
||||
| SSE Event Sourcing | 只設計了事件類型 | ❌ **F5 刷新後 GenUI 卡片全消失** |
|
||||
| Kali 整合 | Cronjob 被動 | 🟡 層次太低,應升格為 SecurityAgent |
|
||||
|
||||
---
|
||||
|
||||
## 第二部分:完整整合地圖
|
||||
|
||||
### 2.1 系統神經網路拓撲
|
||||
|
||||
```
|
||||
外部事件輸入
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Alertmanager :9093 Sentry :9000 SignOz :3301 │
|
||||
│ GitHub Actions Kali Scanner K8s Events │
|
||||
└──────────┬──────────────┬──────────────┬──────────────────────┘
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌────────────────────────────────────────────────────────────────┐
|
||||
│ AWOOOI API (K3s :32334) │
|
||||
│ ┌──────────┐ ┌──────────┐ ┌───────────┐ ┌─────────────┐ │
|
||||
│ │Alertmanager│ │Sentry │ │SignOz │ │Kali(未來) │ │
|
||||
│ │Webhook │ │Webhook │ │Webhook │ │Webhook │ │
|
||||
│ └────┬─────┘ └────┬─────┘ └─────┬─────┘ └──────┬──────┘ │
|
||||
│ │ │ │ │ │
|
||||
│ └──────────────┴──────────────┴────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌──────────────────────────────────────────────────────────┐ │
|
||||
│ │ Signal Worker (Redis XREADGROUP) │ │
|
||||
│ │ ← 消費 awoooi:signals stream │ │
|
||||
│ │ → IncidentEngine (聚合 / GraphRAG / 持久化) │ │
|
||||
│ └──────────────────────────┬───────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌──────────────────────────────────────────────────────────┐ │
|
||||
│ │ OpenClaw (192.168.0.188:8089) │ │
|
||||
│ │ 決策引擎: RCA → Blast Radius → Risk → Action │ │
|
||||
│ │ 工具: kubectl / SSH / Prometheus / SigNoz │ │
|
||||
│ └──────┬─────────────────────┬────────────────────────────┘ │
|
||||
│ │ │ │
|
||||
│ ▼ ▼ │
|
||||
│ Telegram Bot Approval DB (PostgreSQL) │
|
||||
│ 統帥通知 審核佇列 │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
統帥批准 / 拒絕
|
||||
│
|
||||
┌────────────▼────────────┐
|
||||
│ Auto-Repair Actions │
|
||||
│ restart/scale/rollback │
|
||||
└─────────────────────────┘
|
||||
```
|
||||
|
||||
### 2.2 前端整合地圖
|
||||
|
||||
```
|
||||
AWOOOI Web (K3s :32335 / Next.js)
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ │
|
||||
│ ┌── Dashboard (/) ─────────────────────────────────────────┐ │
|
||||
│ │ AutonomyIndexPanel ← GET /api/v1/stats/autonomy │ │
|
||||
│ │ SystemPulseRow ← GET /api/v1/stats/overview │ │
|
||||
│ │ DecisionZone ← SSE /api/v1/approvals/stream │ │
|
||||
│ └──────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌── Omni-Terminal ─────────────────────────────────────────┐ │
|
||||
│ │ Input Area → POST /api/v1/terminal/command │ │
|
||||
│ │ ThinkingStream ← SSE /api/v1/terminal/stream/{id} │ │
|
||||
│ │ GenUI Renderer ← event: render_ui │ │
|
||||
│ │ Event Replay ← Redis List (Last-Event-ID) │ ← ❌ 缺失
|
||||
│ └──────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌── Knowledge Base (/knowledge-base) ─────────────────────┐ │
|
||||
│ │ ❌ 空白頁面,缺後端 API │ │
|
||||
│ └──────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ [深度調查跳脫入口] │
|
||||
│ → SigNoz: http://192.168.0.188:3301 (新分頁) │
|
||||
│ → Grafana: http://192.168.0.188:3000 (新分頁) │
|
||||
│ → Sentry: http://192.168.0.110:9000 (新分頁) │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 第三部分:六大缺口的系統整合修復方案
|
||||
|
||||
### 缺口 1:Worker terminationGracePeriodSeconds 不足
|
||||
|
||||
**問題根源**:`signal_worker.py` 的 `stop()` 等 5 秒,但 AI 分析任務最長 60 秒。K8s 預設 `terminationGracePeriodSeconds: 30`。兩個值都不夠,且彼此沒有對齊。
|
||||
|
||||
**與整體整合的依賴**:
|
||||
- Worker 縮容影響:HPA 縮容時 → K8s 發 SIGTERM → `stop()` 被調用 → 5 秒後強殺
|
||||
- 上游依賴:Sentry 分析任務、Alertmanager 分析任務都在 Worker background task 中執行
|
||||
- 下游影響:PostgreSQL 寫入可能不完整(Incident 狀態 Dirty)
|
||||
|
||||
**修復:三層數值對齊**
|
||||
|
||||
```yaml
|
||||
# k8s/awoooi-prod/08-deployment-worker.yaml
|
||||
# 修改 terminationGracePeriodSeconds
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
terminationGracePeriodSeconds: 90 # 給 Worker 足夠時間完成當前任務
|
||||
containers:
|
||||
- name: awoooi-worker
|
||||
lifecycle:
|
||||
preStop:
|
||||
exec:
|
||||
command: ["/bin/sh", "-c", "sleep 5"] # 給 K8s 時間發 SIGTERM
|
||||
```
|
||||
|
||||
```python
|
||||
# apps/api/src/workers/signal_worker.py
|
||||
# 修改 stop() 的 timeout 與 K8s terminationGracePeriodSeconds 對齊
|
||||
|
||||
async def stop(self) -> None:
|
||||
if not self._running:
|
||||
return
|
||||
self._running = False
|
||||
if self._task:
|
||||
try:
|
||||
# 從 5 秒改為 75 秒(比 terminationGracePeriodSeconds=90 少 15 秒緩衝)
|
||||
await asyncio.wait_for(self._task, timeout=75.0)
|
||||
except TimeoutError:
|
||||
logger.warning("signal_worker_stop_timeout_forcekill")
|
||||
self._task.cancel()
|
||||
except asyncio.CancelledError:
|
||||
pass
|
||||
logger.info("signal_worker_stopped")
|
||||
```
|
||||
|
||||
**整合驗證指令**:
|
||||
```bash
|
||||
# 模擬 K8s 縮容,確認 Worker 優雅關機
|
||||
kubectl scale deployment awoooi-worker -n awoooi-prod --replicas=0
|
||||
kubectl logs -n awoooi-prod $(kubectl get pod -n awoooi-prod -l app=awoooi-worker -o name) --tail=20
|
||||
|
||||
# 預期看到:
|
||||
# shutdown_signal_received signal=15
|
||||
# signal_worker_shutting_down
|
||||
# signal_worker_shutdown_complete
|
||||
# (整個流程在 90 秒內完成)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 缺口 2:ESLint i18n 強制攔截(eslint-plugin-i18next)
|
||||
|
||||
**問題根源**:`.eslintrc.js:20-22` 只有 TODO 注解,未安裝 `eslint-plugin-i18next`。
|
||||
|
||||
**與整體整合的依賴**:
|
||||
- 這是 i18n 清零後的**防護層**:清零是清過去的債,ESLint 是防未來的債
|
||||
- 需在 `pnpm build` 和 `pnpm lint` CI 步驟中阻擋
|
||||
- 影響:所有前端開發流程(AI 生成代碼也必須通過)
|
||||
|
||||
**修復:安裝並啟用 Plugin**
|
||||
|
||||
```bash
|
||||
# Step 1: 安裝
|
||||
cd apps/web
|
||||
pnpm add -D eslint-plugin-i18next
|
||||
```
|
||||
|
||||
```javascript
|
||||
// apps/web/.eslintrc.js 修改後
|
||||
module.exports = {
|
||||
extends: [
|
||||
'@awoooi/eslint-config/react',
|
||||
'next/core-web-vitals',
|
||||
'plugin:i18next/recommended', // ← 新增
|
||||
],
|
||||
plugins: ['i18next'], // ← 新增
|
||||
parserOptions: {
|
||||
project: './tsconfig.json',
|
||||
tsconfigRootDir: __dirname,
|
||||
},
|
||||
rules: {
|
||||
'@next/next/no-html-link-for-pages': 'off',
|
||||
'no-console': 'off',
|
||||
|
||||
// 🚨 i18n 鐵律:所有 JSX 文字必須透過 t() 函式
|
||||
// 違反此規則 = PR 阻擋(error 級別)
|
||||
'i18next/no-literal-string': ['error', {
|
||||
markupOnly: true, // 只攔截 JSX 文字節點(非 JS 字串)
|
||||
ignoreAttribute: [ // 技術屬性不攔截
|
||||
'className', 'id', 'href', 'src', 'type', 'key',
|
||||
'data-testid', 'aria-label', 'placeholder'
|
||||
],
|
||||
}],
|
||||
|
||||
'@typescript-eslint/no-explicit-any': 'warn',
|
||||
'@typescript-eslint/no-unused-vars': ['warn', { argsIgnorePattern: '^_', varsIgnorePattern: '^_' }],
|
||||
'@typescript-eslint/consistent-type-imports': 'warn',
|
||||
'no-constant-condition': 'warn',
|
||||
},
|
||||
ignorePatterns: [
|
||||
'node_modules', '.next', 'out', 'dist', 'test-results',
|
||||
'*.config.js', '*.config.ts',
|
||||
],
|
||||
}
|
||||
```
|
||||
|
||||
**與 CI 整合(必須加入 cd.yaml)**:
|
||||
|
||||
```yaml
|
||||
# .github/workflows/cd.yaml 在 Build 之前加入 Lint 步驟
|
||||
- name: 🔍 ESLint i18n 強制檢查
|
||||
run: |
|
||||
cd apps/web
|
||||
pnpm lint
|
||||
# 失敗 = 有硬編碼字串 = 直接阻擋部署
|
||||
```
|
||||
|
||||
**⚠️ 重要**:第一次啟用 `eslint-plugin-i18next` 後,**現有的 40+ 違規會立刻全部報錯**。因此必須先完成 i18n 清零,再啟用此 Rule。**正確順序**:
|
||||
1. i18n 清零(一次性修復 40+ 違規)
|
||||
2. 安裝 eslint-plugin-i18next(啟用防護)
|
||||
3. 加入 CI Lint 步驟
|
||||
|
||||
---
|
||||
|
||||
### 缺口 3:Visual Regression 跨平台渲染問題
|
||||
|
||||
**問題根源**:Mac(M1/M2/M3)vs GitHub Actions(Linux Ubuntu)的字體渲染引擎不同(CoreText vs FreeType),導致截圖像素不吻合。
|
||||
|
||||
**與整體整合的依賴**:
|
||||
- Baseline 快照必須統一來源(CI Docker 環境)
|
||||
- 每次更新 Baseline 必須是可審計的(透過 PR),不能在本機靜默更新
|
||||
|
||||
**修復:Docker 強制基準線更新 + threshold 調整**
|
||||
|
||||
```json
|
||||
// apps/web/package.json 新增 scripts
|
||||
{
|
||||
"scripts": {
|
||||
"test:visual:update": "docker run --rm -v $(pwd):/work -w /work -p 3000:3000 mcr.microsoft.com/playwright:v1.44.0-jammy pnpm exec playwright test --update-snapshots --project=chromium --grep @visual",
|
||||
"test:visual": "pnpm exec playwright test --project=chromium --grep @visual"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```typescript
|
||||
// apps/web/playwright.config.ts 修改截圖比對設定
|
||||
export default defineConfig({
|
||||
expect: {
|
||||
toHaveScreenshot: {
|
||||
threshold: 0.05, // 允許 5% 差異(吸收跨平台微小差)
|
||||
maxDiffPixelRatio: 0.05,
|
||||
// 強制使用 CI 環境的字體設定
|
||||
},
|
||||
},
|
||||
use: {
|
||||
// 截圖時的視窗大小固定,避免不同螢幕 DPI 差異
|
||||
viewport: { width: 1280, height: 720 },
|
||||
deviceScaleFactor: 1, // 強制 1x,避免 Retina 差異
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
**強制規範(加入 .awoooi-agent-rules.md 條款)**:
|
||||
|
||||
```markdown
|
||||
## 條款 21:Visual Regression Baseline 更新規範
|
||||
|
||||
🚨 絕對禁止在本機 Mac 環境執行 `--update-snapshots`
|
||||
✅ 更新 Baseline 必須透過以下流程:
|
||||
|
||||
1. 在本機執行:`pnpm test:visual:update`(Docker 環境)
|
||||
2. Docker 生成的 .png 截圖自動存入 tests/e2e/__snapshots__/
|
||||
3. 提 PR,標注 📸 VISUAL_UPDATE
|
||||
4. 統帥視覺審核截圖後方可合併
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 缺口 4:PostgreSQL HA(Patroni / CloudNativePG)
|
||||
|
||||
**問題根源**:PostgreSQL 在 .188 上是單一 Docker 容器,K3s 的 Datastore 也依賴它(ADR-033)。資料庫掛掉 = K3s 控制面 + AWOOOI 資料同時失效。
|
||||
|
||||
**與整體整合的依賴**:
|
||||
- PostgreSQL 是 AWOOOI 的 Episodic Memory(Incidents、Approvals、Audit Logs)
|
||||
- PostgreSQL 也是 K3s 的 HA Datastore(120/121 節點的 K3s 元數據)
|
||||
- Auto-Repair 對 PostgreSQL 執行 `docker restart` 是**危險的**(可能 Dirty Page)
|
||||
|
||||
**修復路線圖(三階段)**:
|
||||
|
||||
```
|
||||
Phase DB-A(1週,低風險):
|
||||
監控補強
|
||||
├── 啟用 PG slow query log (log_min_duration_statement = 2000ms)
|
||||
├── 加入 pg_stat_statements extension 並接入 Prometheus
|
||||
└── 關閉 auto_repair.postgres.restart(防止 Dirty Page)
|
||||
|
||||
Phase DB-B(1月,中風險):
|
||||
備份策略
|
||||
├── Velero + PostgreSQL Volume Snapshot(已有 Velero,需加 Volume 備份)
|
||||
└── 確認 WAL archiving 到 MinIO(WAL-E/WAL-G)
|
||||
|
||||
Phase DB-C(Q2,需評估):
|
||||
HA 策略評估
|
||||
├── 方案 A:CloudNativePG(K8s 原生 PostgreSQL Operator)
|
||||
│ → 在 K3s 中部署 CloudNativePG,主從自動切換
|
||||
├── 方案 B:Patroni + keepalived(VM 層 HA)
|
||||
│ → 在 .188 和備用機上部署 Patroni
|
||||
└── 方案 C:Citus(分片,過於複雜,暫不考慮)
|
||||
|
||||
推薦:方案 A (CloudNativePG),與 K3s 最整合
|
||||
```
|
||||
|
||||
**立即可執行的防護措施**:
|
||||
|
||||
```yaml
|
||||
# k8s/awoooi-prod/manual-remediation/postgres-recovery.yaml
|
||||
# 建立 PostgreSQL 緊急修復 Playbook(人工操作)
|
||||
|
||||
# 事件:PostgreSQL 掛了
|
||||
# 動作:
|
||||
# 1. OpenClaw 發告警 + Telegram
|
||||
# 2. AlterManager 生成 CRITICAL Approval(不自動修復)
|
||||
# 3. 統帥核准後,執行以下指令:
|
||||
# ssh root@192.168.0.188 'docker restart awoooi-postgres'
|
||||
# kubectl rollout restart deployment/awoooi-api -n awoooi-prod
|
||||
# kubectl rollout restart deployment/awoooi-worker -n awoooi-prod
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 缺口 5:Redis HA(Sentinel 模式)
|
||||
|
||||
**問題根源**:Redis 在 .188 上是單一容器(port 6380),無備援。Redis 同時承載:
|
||||
- Working Memory(Incident 聚合狀態)
|
||||
- SSE Terminal Event Store(未來的 Event Source)
|
||||
- Sentry Dedup Cache(10分鐘去重 TTL)
|
||||
- Anomaly Counter(ADR-037 核心數據)
|
||||
|
||||
**與整體整合的依賴**:
|
||||
- Redis 掛掉 = Signal Worker 無法消費事件 = 整個告警鏈路中斷
|
||||
- AOF 啟用對性能有影響,需要評估
|
||||
|
||||
**修復路線圖**:
|
||||
|
||||
```
|
||||
Phase Redis-A(立即,0風險):
|
||||
確認 AOF 配置
|
||||
├── ssh root@192.168.0.188 'docker exec awoooi-redis redis-cli CONFIG GET appendonly'
|
||||
└── 確認 appendonly yes(否則重啟後 Working Memory 歸零)
|
||||
|
||||
Phase Redis-B(1月,中等工程量):
|
||||
Redis Sentinel 部署(在 .110 上部署 Sentinel + Replica)
|
||||
├── .188:Master(現有)
|
||||
├── .110:Replica + Sentinel
|
||||
└── OpenClaw 使用 redis-sentinel:// URI,自動發現 Master
|
||||
|
||||
配置變更:
|
||||
# AWOOOI API 連線改用 Sentinel
|
||||
REDIS_URL=redis-sentinel://sentinel1:26379/awoooi-master
|
||||
```
|
||||
|
||||
**立即可執行的防護措施**:
|
||||
|
||||
```bash
|
||||
# 確認 Redis AOF 狀態
|
||||
ssh root@192.168.0.188 'docker exec awoooi-redis redis-cli CONFIG GET appendonly; \
|
||||
docker exec awoooi-redis redis-cli CONFIG GET appendfsync; \
|
||||
docker exec awoooi-redis redis-cli INFO persistence | grep aof'
|
||||
|
||||
# 若 appendonly = no,立即啟用(需重啟 Redis)
|
||||
ssh root@192.168.0.188 'docker exec awoooi-redis redis-cli CONFIG SET appendonly yes'
|
||||
# 注意:CONFIG SET 是即時生效的,不需要重啟
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 缺口 6:SSE Event Sourcing(Terminal 狀態不丟失)
|
||||
|
||||
**問題根源**:Omni-Terminal 的 SSE 串流是無狀態的,F5 刷新後所有 `render_ui` GenUI 卡片消失。
|
||||
|
||||
**與整體整合的依賴**:
|
||||
- 這是 Agentic Workspace 用戶體驗的底層設施
|
||||
- 依賴 Redis List 作為 Event Store(如果 Redis 無 AOF,重啟後也丟)
|
||||
- 必須與 SSE 三種事件類型設計同步建立
|
||||
|
||||
**修復:三層機制**
|
||||
|
||||
```
|
||||
Layer 1: 後端 Event Store(Redis List)
|
||||
terminal.py → 每個 SSE 事件同步寫入 Redis List
|
||||
Key: terminal:events:{command_id}
|
||||
TTL: 3600 秒(1小時)
|
||||
|
||||
Layer 2: 前端 Reconnect(Last-Event-ID)
|
||||
useTerminalSSE → EventSource 自動帶 Last-Event-ID
|
||||
後端收到後:從 Redis 撈出錯過的事件 → Replay → 接上即時 Stream
|
||||
|
||||
Layer 3: 本地 Zustand 持久化
|
||||
useTerminalStore → 用 zustand/middleware/persist 持久化到 sessionStorage
|
||||
F5 刷新 → 從 sessionStorage 恢復 GenUI 卡片(UI 層快速恢復)
|
||||
同時 → SSE 重連補齊 Server 端新事件
|
||||
```
|
||||
|
||||
**後端實作關鍵代碼**:
|
||||
|
||||
```python
|
||||
# apps/api/src/api/v1/terminal.py 補充 Event Store 機制
|
||||
|
||||
import json
|
||||
from src.core.redis_client import get_redis
|
||||
|
||||
async def stream_with_persistence(command_id: str, event_type: str, data: dict):
|
||||
"""
|
||||
SSE 事件輸出 + 同步寫入 Redis Event Store
|
||||
確保 F5 刷新後可以 Replay
|
||||
"""
|
||||
redis = get_redis()
|
||||
|
||||
event_payload = {
|
||||
"type": event_type,
|
||||
"data": data,
|
||||
"timestamp": now_taipei_iso()
|
||||
}
|
||||
|
||||
# 寫入 Redis List(RPUSH append to right)
|
||||
key = f"terminal:events:{command_id}"
|
||||
await redis.rpush(key, json.dumps(event_payload))
|
||||
await redis.expire(key, 3600) # 1 小時後自動清理
|
||||
|
||||
# 返回 SSE 格式字串
|
||||
return f"id: {redis.llen(key)}\nevent: {event_type}\ndata: {json.dumps(data)}\n\n"
|
||||
|
||||
|
||||
@router.get("/stream/{command_id}/replay")
|
||||
async def replay_terminal_events(command_id: str, last_event_id: int = 0):
|
||||
"""
|
||||
從指定 ID 開始 Replay 錯過的事件(用於 F5 重連)
|
||||
"""
|
||||
redis = get_redis()
|
||||
key = f"terminal:events:{command_id}"
|
||||
|
||||
# 取出 last_event_id 之後的所有事件
|
||||
events = await redis.lrange(key, last_event_id, -1)
|
||||
|
||||
async def generate():
|
||||
for i, event_json in enumerate(events):
|
||||
event = json.loads(event_json)
|
||||
yield f"id: {last_event_id + i + 1}\nevent: {event['type']}\ndata: {json.dumps(event['data'])}\n\n"
|
||||
|
||||
return StreamingResponse(generate(), media_type="text/event-stream")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 第四部分:整合依賴關係圖(執行順序)
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ 整合執行優先序 │
|
||||
├──────────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ 🔴 P0 立即執行(本週,阻塞後續工作) │
|
||||
│ ┌─────────────────────────────────────────────────────────┐ │
|
||||
│ │ 1. 確認 Redis AOF 狀態(5min) │ │
|
||||
│ │ 2. Worker terminationGracePeriodSeconds 修正(1h) │ │
|
||||
│ │ 3. i18n 清零(4h)← 必須先於 ESLint Plugin 安裝 │ │
|
||||
│ │ 4. ESLint i18n Plugin 安裝並啟用(1h) │ │
|
||||
│ └─────────────────────────────────────────────────────────┘ │
|
||||
│ ↓(完成後解鎖) │
|
||||
│ │
|
||||
│ 🟠 P1 短期(2-3週) │
|
||||
│ ┌─────────────────────────────────────────────────────────┐ │
|
||||
│ │ 5. Sentry Comment Token 配置(2h) │ │
|
||||
│ │ 6. SignOz 告警規則部署到 .188(2h) │ │
|
||||
│ │ 7. Worker HPA YAML 部署(30min) │ │
|
||||
│ │ 8. E2E CI Weekly 排程(30min) │ │
|
||||
│ │ 9. Visual Regression Docker 基準線建立(2h) │ │
|
||||
│ └─────────────────────────────────────────────────────────┘ │
|
||||
│ ↓(完成後解鎖) │
|
||||
│ │
|
||||
│ 🟡 P2 中期(Month 2) │
|
||||
│ ┌─────────────────────────────────────────────────────────┐ │
|
||||
│ │ 10. Omni-Terminal SSE Event Sourcing(8h) │ │
|
||||
│ │ 11. Storybook 10 心組件(8h) │ │
|
||||
│ │ 12. Nexus AI 自治率 UI(8h) │ │
|
||||
│ │ 13. FinOps Dashboard UI(8h) │ │
|
||||
│ │ 14. Redis Sentinel 部署(1天) │ │
|
||||
│ └─────────────────────────────────────────────────────────┘ │
|
||||
│ ↓(完成後解鎖) │
|
||||
│ │
|
||||
│ ⚪ P3 長期(Q2-Q3) │
|
||||
│ ┌─────────────────────────────────────────────────────────┐ │
|
||||
│ │ 15. CloudNativePG 評估與導入(2天+) │ │
|
||||
│ │ 16. Kali SecurityAgent(MCP Tool 化) │ │
|
||||
│ │ 17. Knowledge Base 後端全建 │ │
|
||||
│ │ 18. Phase 4 視覺靈魂注入 │ │
|
||||
│ └─────────────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 第五部分:整合風險矩陣
|
||||
|
||||
| 風險 | 可能性 | 影響 | 緩解措施 |
|
||||
|------|--------|------|---------|
|
||||
| **ESLint 啟用後大量報錯** | 100%(有 40+ 違規) | CI 完全阻塞 | 先清零再啟用,正確順序執行 |
|
||||
| **Worker timeout 修改引發 Pod 啟動異常** | 低 | 服務中斷 | 先在 Dev namespace 測試 |
|
||||
| **Redis AOF 啟用影響性能** | 中 | 延遲微增 | 使用 `appendfsync everysec`(非 `always`)|
|
||||
| **Visual Regression Docker 第一次 Baseline** | — | 需要 1-2h 產生基準線 | 排在非尖峰時段執行 |
|
||||
| **PostgreSQL 無 HA 期間主庫故障** | 低 | 完全停機 | 備份策略(Velero)+ Playbook 就位 |
|
||||
| **SSE Event Sourcing Redis 依賴** | — | Redis 故障時 Event 丟失 | 先解決 Redis AOF,再實作 Event Sourcing |
|
||||
|
||||
---
|
||||
|
||||
## 第六部分:監控機制與前端的整合設計原則
|
||||
|
||||
(承接 `MONITORING_ARCHITECTURE_DEEP_DIVE.md` 的「三義分離原則」)
|
||||
|
||||
**整合設計的核心約束**:
|
||||
|
||||
```
|
||||
1. 監控數據 → 後台靜默消化,不直接呈現給統帥
|
||||
└─ 99%:Prometheus/SigNoz/Sentry 原始數據
|
||||
└─ 1%:AI 無法自動處理 → 浮現為 ApprovalCard
|
||||
|
||||
2. 前端不直接查詢 Prometheus/SigNoz
|
||||
└─ 所有監控數據透過 AWOOOI API 統一封裝
|
||||
└─ API 層:/api/v1/stats/overview, /api/v1/slo, /api/v1/finops
|
||||
|
||||
3. 深度調查只能透過「智能跳脫」(新分頁)
|
||||
└─ GenUI 卡片提供 ExternalLinks 按鈕
|
||||
└─ 絕對禁止 iframe 嵌入 Grafana/SigNoz
|
||||
|
||||
4. AI 自治率指數是前端唯一的「監控摘要」入口
|
||||
└─ Dashboard (/) 最頂部:AutonomyIndexPanel
|
||||
└─ 用一個數字替代所有圖表
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
*「整合不是把所有工具接在一起,而是讓所有工具服務同一個大腦。」* 🦞
|
||||
402
docs/proposals/MASTER_EXECUTION_SCHEDULE.md
Normal file
402
docs/proposals/MASTER_EXECUTION_SCHEDULE.md
Normal file
@@ -0,0 +1,402 @@
|
||||
# AWOOOI 最終執行主排程
|
||||
# Master Execution Schedule — 統帥審核版
|
||||
|
||||
> **文件類型**: 最終執行授權書
|
||||
> **建立**: 2026-03-29 14:05 (台北)
|
||||
> **狀態**: 🔴 待統帥審核與授權
|
||||
> **本文件為所有 RunBook 和 ADR 的執行定序總綱**
|
||||
|
||||
---
|
||||
|
||||
## 第一章:最終確認的缺口清單(代碼確認級別)
|
||||
|
||||
### 1.1 已確認真實存在的問題(共 16 項)
|
||||
|
||||
| # | 問題 | 代碼位置 | 嚴重度 | 波次 |
|
||||
|---|------|---------|--------|------|
|
||||
| 1 | Worker stop() timeout 只有 5 秒 | `signal_worker.py:147` | 🔴 P0 | Wave 1 |
|
||||
| 2 | XCLAIM / Active Sweeper 完全缺失 | `signal_worker.py` 全文 | 🔴 P0 | Wave 1 |
|
||||
| 3 | StatefulSet 自動修復無硬阻斷 | `auto_repair_service.py:159` | 🔴 P0 | Wave 1 |
|
||||
| 4 | AnomalyCounter Redis 無 try/except | `anomaly_counter.py:147` | 🔴 P0 | Wave 1 |
|
||||
| 5 | OpenClaw 無 Circuit Breaker | `sentry_webhook.py:289` | 🔴 P0 | Wave 1 |
|
||||
| 6 | OpenClaw 無 Concurrency Semaphore | `sentry_webhook.py` | 🔴 P0 | Wave 1 |
|
||||
| 7 | Global Incident Debounce 缺失 | `incident_service.py` | 🔴 P0 | Wave 1 |
|
||||
| 8 | Global Auto-Repair Cooldown 缺失 | `auto_repair_service.py` | 🔴 P0 | Wave 1 |
|
||||
| 9 | terminationGracePeriodSeconds 未設定 | `08-deployment-worker.yaml` | 🔴 P0 | Wave 2 |
|
||||
| 10 | ESLint i18n Plugin 只有 TODO | `.eslintrc.js:20-22` | 🟠 P1 | Wave 3 |
|
||||
| 11 | i18n 40+ 違規 | `TECHNICAL_DEBT_PHASE2.md` | 🟠 P1 | Wave 3 |
|
||||
| 12 | Playwright ignoreHTTPSErrors 未設定 | `playwright.config.ts` | 🟠 P1 | Wave 4 |
|
||||
| 13 | Visual Baseline 無 Docker 規範 | (設計空白) | 🟠 P1 | Wave 4 |
|
||||
| 14 | E2E 無 Auth Bypass(global.setup.ts 缺失)| `tests/e2e/` | 🟠 P1 | Wave 4 |
|
||||
| 15 | SSE Event Sourcing 尚未實作 | `terminal.py` | 🟡 P2 | Wave 7 |
|
||||
| 16 | Redis AOF/eviction 未確認 | `.188 主機` | 🔴 P0 | Wave 0 |
|
||||
|
||||
### 1.2 已確認不存在(稽核報告誤判)
|
||||
|
||||
| 項目 | 誤判內容 | 真實現況 |
|
||||
|------|---------|---------|
|
||||
| Worker SIGTERM 缺失 | 報告說「需要實作」| ✅ 已實作(line 450-455)|
|
||||
| ADR-035 備援路徑 | 說「可觀測循環依賴」| ✅ Layer 3 繞過 AWOOOI 直接打 Telegram |
|
||||
| Sentry Comment 缺失 | 說「尚未整合」| ✅ LOGBOOK 確認 Wave A.4 已整合 |
|
||||
| CD Secret 注入 | 說「缺失」| ✅ ADR-035 確認 CD 已有自動注入 |
|
||||
|
||||
---
|
||||
|
||||
## 第二章:ADR 評估
|
||||
|
||||
### 2.1 本次新建 ADR(已建立)
|
||||
|
||||
| ADR | 標題 | 說明 |
|
||||
|-----|------|------|
|
||||
| [ADR-038](file:///Users/ogt/awoooi/docs/adr/ADR-038-openclaw-concurrency-governance.md) | OpenClaw 推理引擎併發治理 | Semaphore + Circuit Breaker 雙層保護 |
|
||||
| [ADR-039](file:///Users/ogt/awoooi/docs/adr/ADR-039-global-autorepair-governance.md) | 全域自動修復熔斷機制 | Global Cooldown + StatefulSet 黑名單 |
|
||||
|
||||
### 2.2 現有 ADR 需更新(次要)
|
||||
|
||||
| ADR | 需追加內容 | 優先度 |
|
||||
|-----|---------|--------|
|
||||
| ADR-020 E2E Verification | 加入 global.setup.ts 和 Auth Bypass 規範 | 🟡 Wave 4 前 |
|
||||
| ADR-028 Failure Auto-Repair | 加入對 ADR-039 的引用說明 | 🟡 Wave 1 後 |
|
||||
|
||||
### 2.3 不需要新建 ADR 的項目
|
||||
|
||||
| 項目 | 理由 |
|
||||
|------|------|
|
||||
| XCLAIM / Active Sweeper | 屬於 ADR-037 Signal Worker 實作細節 |
|
||||
| terminationGracePeriodSeconds | 屬於 K8s 操作規範,不是架構決策 |
|
||||
| ESLint i18n | 屬於 ADR-002 設計系統的工具鏈細節 |
|
||||
|
||||
---
|
||||
|
||||
## 第三章:Skills 更新評估
|
||||
|
||||
### 3.1 必須更新的 Skills
|
||||
|
||||
#### Skill 02: leWOOOgo Backend Core(新增章節)
|
||||
|
||||
```markdown
|
||||
## 🛡️ OpenClaw 推理保護模式 (ADR-038, 2026-03-29)
|
||||
|
||||
### 鐵律:所有 OpenClaw 呼叫必須雙層保護
|
||||
|
||||
from src.core.circuit_breaker import get_openclaw_guard
|
||||
|
||||
async def call_openclaw_analyzer(...):
|
||||
guard = get_openclaw_guard()
|
||||
if guard.is_circuit_open(): # Layer 1: 斷路
|
||||
return None
|
||||
async with guard.semaphore: # Layer 2: 限流(最多 3 並發)
|
||||
# ... httpx 請求 ...
|
||||
|
||||
## 🔴 全域修復冷卻 (ADR-039, 2026-03-29)
|
||||
|
||||
### 鐵律:任何自動修復前必須呼叫 check_global_repair_cooldown()
|
||||
|
||||
可以修復的服務: 僅無狀態服務(awoooi-api, awoooi-web, awoooi-worker)
|
||||
絕對禁止修復: postgres, redis, clickhouse, minio, etcd
|
||||
```
|
||||
|
||||
#### Skill 05: AWOOOI SRE & QA(新增章節)
|
||||
|
||||
```markdown
|
||||
## 🎭 E2E Auth Bypass 鐵律 (2026-03-29)
|
||||
|
||||
### 必須在 tests/e2e/global.setup.ts 實作登入態
|
||||
|
||||
// global.setup.ts
|
||||
async function setup() {
|
||||
const { chromium } = require('@playwright/test');
|
||||
const browser = await chromium.launch();
|
||||
const page = await browser.newPage();
|
||||
await page.request.post('/api/v1/auth/login', { data: { username: 'demo', password: process.env.E2E_PASSWORD } });
|
||||
await page.context().storageState({ path: 'e2e-auth-state.json' });
|
||||
}
|
||||
|
||||
### 絕對禁止在 Mac 本機產生 Visual Baseline
|
||||
使用: pnpm test:visual:update(Docker 環境)
|
||||
禁止: pnpm exec playwright test --update-snapshots(本機)
|
||||
```
|
||||
|
||||
> **注意**:Skills 更新將在 Wave 1 代碼合併後,由後續 Session 執行。本次記錄評估結論即可。
|
||||
|
||||
---
|
||||
|
||||
## 第四章:模組化合規驗證
|
||||
|
||||
### 4.1 Wave 1 新代碼的合規性
|
||||
|
||||
| 新代碼 | 層次 | 依賴 | 介面 | 合規狀態 |
|
||||
|--------|------|------|------|---------|
|
||||
| `core/circuit_breaker.py` | core 基礎設施 | stdio + structlog | 直接類(無需 Protocol)| ✅ 合規 |
|
||||
| `services/global_repair_cooldown.py` | Service 層 | Redis(透過 get_redis())| 函數式 API | ✅ 合規 |
|
||||
| `signal_worker.py` XCLAIM 補充 | Worker(現有)| Redis Stream | 無新依賴 | ✅ 合規 |
|
||||
| `anomaly_counter.py` Degradation | Service 層 | 無新依賴 | 現有 Protocol | ✅ 合規 |
|
||||
|
||||
### 4.2 違規預防
|
||||
|
||||
| 規則 | 驗證方式 |
|
||||
|------|---------|
|
||||
| Router 不直接存取 Redis | Code Review:所有 Webhook router 只呼叫 Service |
|
||||
| Semaphore 在 core/ | `circuit_breaker.py` 放在 `src/core/`,非 `src/services/` |
|
||||
| Singleton 透過工廠函數 | `get_openclaw_guard()`、`get_global_repair_cooldown()` |
|
||||
|
||||
---
|
||||
|
||||
## 第五章:整合工作衝突分析
|
||||
|
||||
### 5.1 必須串行的依賴關係
|
||||
|
||||
```
|
||||
XCLAIM 代碼合併 → 才能部署 Worker HPA
|
||||
i18n 清零完成 → 才能啟用 ESLint i18n Plugin(error 模式)
|
||||
release/v1.x 建立 → 才能宣佈 Frontend Feature Freeze
|
||||
global.setup.ts 實作 → 才能有效執行 E2E Visual 測試
|
||||
Redis AOF 確認 → 才能實作 SSE Event Sourcing(依賴 Redis 持久化)
|
||||
```
|
||||
|
||||
### 5.2 可以並行的工作
|
||||
|
||||
```
|
||||
Wave 1 後端代碼修改(5 項)可以並行開發,捆綁一個 PR
|
||||
Sentry Token 配置 與 SignOz 告警規則部署 可以並行
|
||||
Storybook 建置 與 Omni-Terminal SSE Event Sourcing 可以並行(不同分支)
|
||||
```
|
||||
|
||||
### 5.3 確認無衝突的部分
|
||||
|
||||
| 項目 | 衝突評估 | 結論 |
|
||||
|------|---------|------|
|
||||
| terminationGracePeriodSeconds | 需與 XCLAIM 同一 PR(相互依賴)| ✅ Wave 1 捆綁 |
|
||||
| Global Cooldown + Global Debounce | 同為 Wave 1,無衝突 | ✅ 同一 PR |
|
||||
| ADR-038 + ADR-039 同時生效 | 需確保 auto_repair_service 引用兩者 | ✅ 已在 ADR 中指定 |
|
||||
|
||||
---
|
||||
|
||||
## 第六章:詳細實施步驟
|
||||
|
||||
### Wave 0: 即時止血(統帥手動確認,當天)
|
||||
|
||||
```bash
|
||||
# 0.1 確認 Redis AOF 狀態(5 分鐘)
|
||||
ssh root@192.168.0.188 \
|
||||
'docker exec awoooi-redis redis-cli CONFIG GET appendonly; \
|
||||
docker exec awoooi-redis redis-cli CONFIG GET maxmemory-policy'
|
||||
# 預期:appendonly=yes, policy=volatile-ttl 或 noeviction
|
||||
# 若非 yes 則立即啟用:docker exec awoooi-redis redis-cli CONFIG SET appendonly yes
|
||||
|
||||
# 0.2 確認 Alertmanager 備援 Telegram(10 分鐘)
|
||||
ssh root@192.168.0.188 'docker exec alertmanager cat /etc/alertmanager/alertmanager.yml' | grep -A 5 receiver
|
||||
|
||||
# 0.3 建立 release/v1.x 穩定分支(5 分鐘)
|
||||
git checkout main && git pull
|
||||
git checkout -b release/v1.x && git push origin release/v1.x
|
||||
# 在 GitHub 設定 Protected Branch
|
||||
|
||||
# 0.4 確認 SENTRY_AUTH_TOKEN(3 分鐘)
|
||||
kubectl get secret awoooi-secrets -n awoooi-prod \
|
||||
-o jsonpath='{.data.SENTRY_AUTH_TOKEN}' | base64 -d | wc -c
|
||||
# > 0 = Phase D 已完成;= 0 = 需執行 RunBook Phase D
|
||||
```
|
||||
|
||||
### Wave 1: 底層安全網(代碼 PR,7.5h,原子性)
|
||||
|
||||
> **所有以下修改必須在同一個 PR 中,不可拆分!**
|
||||
|
||||
```
|
||||
PR 標題: feat(safety): Wave 1 底層安全網(ADR-038 + ADR-039)
|
||||
|
||||
檔案清單(9 個):
|
||||
新建:
|
||||
apps/api/src/core/circuit_breaker.py (ADR-038)
|
||||
apps/api/src/services/global_repair_cooldown.py (ADR-039)
|
||||
|
||||
修改:
|
||||
apps/api/src/services/anomaly_counter.py
|
||||
→ record_anomaly() 加 try/except + graceful degrade
|
||||
→ 提取 _record_anomaly_impl()
|
||||
|
||||
apps/api/src/workers/signal_worker.py
|
||||
→ start() 加 _reclaim_task = asyncio.create_task(_reclaim_loop())
|
||||
→ 新增 _claim_orphaned_tasks()(XCLAIM 實作)
|
||||
→ 新增 _reclaim_loop()(Active Sweeper,每 5 分鐘)
|
||||
→ stop() timeout 從 5s 改為 75s,同步取消 _reclaim_task
|
||||
|
||||
apps/api/src/services/auto_repair_service.py
|
||||
→ evaluate_auto_repair() 開頭加 check_global_repair_cooldown()
|
||||
→ execute_auto_repair() 成功後加 record_global_repair_action()
|
||||
→ 常數區加 STATEFUL_SERVICE_BLACKLIST
|
||||
|
||||
apps/api/src/api/v1/sentry_webhook.py
|
||||
→ call_openclaw_analyzer() 加雙層保護(Circuit Breaker + Semaphore)
|
||||
|
||||
apps/api/src/api/v1/signoz_webhook.py
|
||||
→ 同上
|
||||
|
||||
apps/api/src/services/incident_service.py
|
||||
→ process_signal() 前加 _check_global_incident_storm()
|
||||
|
||||
k8s/awoooi-prod/08-deployment-worker.yaml
|
||||
→ 加入 terminationGracePeriodSeconds: 90
|
||||
→ 加入 preStop: sleep 5
|
||||
```
|
||||
|
||||
**Wave 1 PR 審核要點**:
|
||||
- `circuit_breaker.py` 在 `src/core/`(非 services)
|
||||
- `global_repair_cooldown.py` 在 `src/services/`(非 core)
|
||||
- 所有新代碼有 Google Style Docstring
|
||||
- 所有 Singleton 透過工廠函數暴露
|
||||
|
||||
**Wave 1 驗收指令**:
|
||||
```bash
|
||||
# 部署後驗證 terminationGracePeriodSeconds
|
||||
kubectl get deployment awoooi-worker -n awoooi-prod \
|
||||
-o jsonpath='{.spec.template.spec.terminationGracePeriodSeconds}'
|
||||
# 預期:90
|
||||
|
||||
# 驗證 Circuit Breaker + Semaphore(Alpha 測試)
|
||||
curl -X POST http://192.168.0.125:32334/api/v1/webhooks/alertmanager \
|
||||
-d '{"alerts": [{"labels": {"alertname": "TestAlert"}}]}'
|
||||
# 觀察 API log:openclaw_semaphore_acquired / openclaw_circuit_open_skip
|
||||
```
|
||||
|
||||
### Wave 2: Worker 擴縮容部署(30 分鐘,Wave 1 完成後)
|
||||
|
||||
```bash
|
||||
# 部署 Worker HPA
|
||||
kubectl apply -f k8s/awoooi-prod/12-hpa.yaml
|
||||
|
||||
# 驗證 HPA 建立
|
||||
kubectl get hpa awoooi-worker-hpa -n awoooi-prod
|
||||
# 預期:TARGETS 顯示實際 CPU%,MINPODS=1, MAXPODS=3
|
||||
```
|
||||
|
||||
### Wave 3: 前端主幹手術(4 天窗口,Week 1)
|
||||
|
||||
```bash
|
||||
# Day 1 準備:宣佈 Feature Freeze
|
||||
# Day 1-3:i18n 閃電清零(依 TECHNICAL_DEBT_PHASE2.md 清單)
|
||||
cd apps/web
|
||||
git checkout -b fix/i18n-zero-violation
|
||||
|
||||
# 執行清零(依序修復 9 個違規組)
|
||||
# 完成後驗證:
|
||||
pnpm exec tsc --noEmit
|
||||
|
||||
# 安裝 ESLint Plugin(先 Warn 模式)
|
||||
pnpm add -D eslint-plugin-i18next
|
||||
# 修改 .eslintrc.js(markupOnly: true,先 warn)
|
||||
|
||||
# Day 4:PR 合併 → 解除 Freeze
|
||||
git push origin fix/i18n-zero-violation
|
||||
# PR 審核後合併到 main
|
||||
# ESLint 切換為 error 模式
|
||||
```
|
||||
|
||||
### Wave 4: CI 基礎設施(1 天,Wave 3 後)
|
||||
|
||||
```bash
|
||||
# 修改 playwright.config.ts
|
||||
# 加入 ignoreHTTPSErrors: true + deviceScaleFactor: 1 + threshold: 0.05
|
||||
|
||||
# 建立 tests/e2e/global.setup.ts(E2E Auth Bypass)
|
||||
# 建立初始 Docker Visual Baseline
|
||||
cd apps/web
|
||||
pnpm test:visual:update # Docker 環境
|
||||
|
||||
# 部署 Weekly E2E Workflow
|
||||
git add .github/workflows/e2e-weekly.yaml
|
||||
git commit -m "feat(ci): weekly E2E + visual regression + Docker baseline"
|
||||
```
|
||||
|
||||
### Wave 5: 告警後端完整化(2 小時,Wave 1 後即可)
|
||||
|
||||
```bash
|
||||
# 部署 SENTRY_AUTH_TOKEN(若 Wave 0.4 確認缺失)
|
||||
kubectl patch secret awoooi-secrets -n awoooi-prod \
|
||||
--type=merge -p='{"stringData": {"SENTRY_AUTH_TOKEN": "YOUR_TOKEN"}}'
|
||||
|
||||
# 部署 SignOz 告警規則到 .188
|
||||
ssh root@192.168.0.188 'cat > /tmp/signoz-rules.yaml' < ops/signoz/alerting/rules.yaml
|
||||
# 透過 SignOz API 套用
|
||||
|
||||
# 驗證告警鏈路
|
||||
python ops/scripts/alert_chain_smoke_test.py
|
||||
```
|
||||
|
||||
> **Wave 6-12**: 中長期工作,依 INTEGRATION_ARCHITECTURE_MASTER.md 執行。
|
||||
|
||||
---
|
||||
|
||||
## 第七章:最終工作排程(供統帥審核)
|
||||
|
||||
```
|
||||
📅 2026-03-29(今天)- Wave 0
|
||||
□ 0.1 Redis AOF + eviction policy 確認(hand-on,5 min)
|
||||
□ 0.2 Alertmanager 備援路徑確認(hand-on,10 min)
|
||||
□ 0.3 release/v1.x 分支建立 + GitHub 設為 Protected(hand-on,5 min)
|
||||
□ 0.4 SENTRY_AUTH_TOKEN 存在確認(hand-on,3 min)
|
||||
|
||||
📅 2026-03-30~04-01(3 天)- Wave 1
|
||||
□ core/circuit_breaker.py 開發(2h)
|
||||
□ services/global_repair_cooldown.py 開發(1h)
|
||||
□ anomaly_counter.py Graceful Degrade(1h)
|
||||
□ signal_worker.py XCLAIM + Active Sweeper(2h)
|
||||
□ auto_repair_service.py Guardrail 整合(1h)
|
||||
□ sentry_webhook.py + signoz_webhook.py 雙層保護(0.5h)
|
||||
□ incident_service.py Global Debounce(1.5h)
|
||||
□ 08-deployment-worker.yaml terminationGracePeriodSeconds(0.5h)
|
||||
□ Wave 1 PR 提交 → 審核 → 合併 → CD 部署 → 驗收
|
||||
|
||||
📅 2026-04-01(Wave 1 完成後)- Wave 2
|
||||
□ Worker HPA YAML 部署(30 min)
|
||||
|
||||
📅 2026-04-02~04-05(4 天)- Wave 3(Feature Freeze)
|
||||
□ Frontend Feature Freeze 宣佈
|
||||
□ i18n 閃電清零(40+ 違規,4h 工時)
|
||||
□ ESLint i18n Plugin 安裝(Warn 模式)
|
||||
□ PR 合併 → Feature Unfreeze
|
||||
□ ESLint 切換為 Error 模式
|
||||
|
||||
📅 2026-04-06(1 天)- Wave 4 / Wave 5
|
||||
□ playwright.config.ts 修改(ignoreHTTPSErrors + threshold)
|
||||
□ global.setup.ts E2E Auth Bypass 建立
|
||||
□ Docker Visual Baseline 初始建立
|
||||
□ E2E Weekly Workflow 部署
|
||||
□ SignOz 告警規則部署到 .188
|
||||
|
||||
📅 2026-04 Week 3 - Wave 5 + 6(可視 Wave 1-4 完成情況調整)
|
||||
□ Prometheus Federation(.110 → .188)
|
||||
□ Redis AOF + Sentinel 評估
|
||||
□ AI Autonomy Index Metrics 建立
|
||||
|
||||
📅 2026-04 Week 4+ - Wave 7-9(Month 2)
|
||||
□ Storybook 10 核心組件
|
||||
□ Omni-Terminal SSE Event Sourcing
|
||||
□ 監控 GenUI 卡片(7 張)
|
||||
□ Nexus AI 自治率 UI
|
||||
|
||||
📅 Q2 - Wave 10-12(長期)
|
||||
□ CloudNativePG HA 評估
|
||||
□ Kali SecurityAgent(MCP Tool 化)
|
||||
□ Phase 4 視覺靈魂注入
|
||||
□ CI 硬阻擋正式啟用(Warn → Block)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 第八章:統帥審核要點
|
||||
|
||||
請特別審核以下決策點:
|
||||
|
||||
| # | 決策內容 | 選項 | 推薦 |
|
||||
|---|---------|------|------|
|
||||
| D1 | Wave 1 PR 是否原子性(9 個文件一次)| A. 原子 / B. 分批 | A(必須原子,互相依賴)|
|
||||
| D2 | Feature Freeze 時長 | A. 3天 / B. 5天 | B(40+ 違規需要謹慎)|
|
||||
| D3 | ESLint 啟用時機 | A. i18n清零後立即 / B. 一週緩衝期 | A(清零後立即,防新債)|
|
||||
| D4 | Semaphore max_concurrent | A. 2 / B. 3 / C. 5 | B(3 = 60% .188 CPU)|
|
||||
| D5 | Global Cooldown Threshold | A. 3次/15分鐘 / B. 5次/15分鐘 | B(3 次太嚴,5 次合理)|
|
||||
| D6 | Redis HA 策略 | A. Sentinel / B. AOF+手動 / C. 暫緩 | C(暫緩,Month 2 評估)|
|
||||
|
||||
---
|
||||
|
||||
*此排程基於 2026-03-29 的完整代碼審計與 16 輪沙盤推演,已閉合所有已知漏洞。* 🦞
|
||||
653
docs/proposals/MONITORING_ARCHITECTURE_DEEP_DIVE.md
Normal file
653
docs/proposals/MONITORING_ARCHITECTURE_DEEP_DIVE.md
Normal file
@@ -0,0 +1,653 @@
|
||||
# AWOOOI 監控機制完整規劃:讓監控成為 AI 智慧的感知神經,而非束縛
|
||||
|
||||
> **文件類型**: 架構設計 + 實施 RunBook
|
||||
> **優先級**: 🔴 重中之重
|
||||
> **建立**: 2026-03-29 12:38 (台北)
|
||||
> **建立者**: Antigravity
|
||||
> **核心命題**: 監控不是目的,而是 AI 決策的神經末梢。
|
||||
|
||||
---
|
||||
|
||||
## 一、核心哲學:監控與 AI 的關係定位
|
||||
|
||||
### 1.1 「不能淪為監控產品」— 這個恐懼從哪裡來?
|
||||
|
||||
傳統監控產品(Grafana / Prometheus / Datadog)的底層邏輯是:
|
||||
> 「系統把原始數據攤開,**人類**負責看懂並做決定。」
|
||||
|
||||
這讓使用者變成**數據的搬運工**,而非**決策者**。
|
||||
|
||||
AWOOOI 的定位必須是:
|
||||
> 「AI 消化所有數據,**主動帶著分析結論來問統帥**:『這裡我建議這樣做,您核准嗎?』」
|
||||
|
||||
---
|
||||
|
||||
### 1.2 黃金法則:哪些監控數據「應該消失在後台」,哪些「必須浮現到前台」
|
||||
|
||||
```
|
||||
┌────────────────────────────────────────────────────────────────────────┐
|
||||
│ 監控資訊的兩個命運 │
|
||||
├───────────────────────────┬────────────────────────────────────────────┤
|
||||
│ 🔒 靜默消化(後台進行) │ 🚨 主動浮現(推送給統帥) │
|
||||
├───────────────────────────┼────────────────────────────────────────────┤
|
||||
│ Prometheus Metrics 原始值 │ AI 判斷「這是異常」後產生的 Approval 卡片 │
|
||||
│ SigNoz Trace 詳情 │ Anomaly Counter 升級到 ESCALATE 時的警示 │
|
||||
│ Sentry Error Log 完整堆疊 │ Auto-Repair 執行後的結果摘要 │
|
||||
│ Grafana 儀表板圖表 │ P0 事件的緊急插隊(Priority Preemption) │
|
||||
│ Alertmanager 規則配置 │ 每日 AI 健康摘要(主動推送) │
|
||||
│ K8s Pod 狀態明細 │ FinOps 成本異常告警 │
|
||||
└───────────────────────────┴────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**核心結論**:監控數據 99% 應該「靜默消化」,只有 AI 無法自動處理的 1% 才浮現為「需要統帥決策的卡片」。
|
||||
|
||||
---
|
||||
|
||||
## 二、監控完整節點盤點(現況 vs 目標)
|
||||
|
||||
### 2.1 五層監控架構
|
||||
|
||||
```
|
||||
Layer 0: 物理感知層(主機/節點)
|
||||
↓
|
||||
Layer 1: 服務感知層(容器/Pod)
|
||||
↓
|
||||
Layer 2: 應用感知層(API/前端/Worker)
|
||||
↓
|
||||
Layer 3: AI 智慧層(LLM 推理/工具調用)
|
||||
↓
|
||||
Layer 4: 業務感知層(用戶行為/成本/SLO)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2.2 各層完整節點盤點
|
||||
|
||||
#### Layer 0: 物理感知層
|
||||
|
||||
| 節點 | 工具 | 數據 | 現況 | 缺口 |
|
||||
|------|------|------|------|------|
|
||||
| .110 CPU/Memory/Disk | Node Exporter | 系統指標 | ✅ Prometheus | — |
|
||||
| .112 CPU/Memory/Disk | Node Exporter | 系統指標 | 🟡 孤立,無 Webhook | 無告警整合 |
|
||||
| .188 CPU/Memory/Disk | Node Exporter | 系統指標 + GPU | ✅ Prometheus | — |
|
||||
| .120 K3s Master | Node Exporter + kube-state | K3s 節點指標 | ✅ | — |
|
||||
| .121 K3s Worker | Node Exporter + kube-state | K3s 節點指標 | ✅ | — |
|
||||
| VIP .125 | Blackbox Exporter | TCP 健康 | ✅ 已配置 | — |
|
||||
|
||||
#### Layer 1: 服務感知層(Docker/K8s)
|
||||
|
||||
| 服務 | Prometheus | Sentry | OTEL | 告警 | 自動修復 | 缺口 |
|
||||
|------|-----------|-------|------|------|---------|------|
|
||||
| awoooi-api | ✅ | ✅ | ✅ | ✅ 完整 | ✅ | — |
|
||||
| awoooi-web | ✅ | ✅ | ✅ | ✅ 完整 | ✅ | — |
|
||||
| awoooi-worker | ✅ | ✅ | ✅ | 🟡 | ✅ | HPA 缺失 |
|
||||
| Ollama | ✅ | — | — | ✅ | ✅ 重啟 | — |
|
||||
| OpenClaw | ✅ | ✅ | ✅ | ✅ | ✅ 重啟 | — |
|
||||
| Redis | ✅ | — | — | ✅ | ❌(謹慎) | 自動修復 too conservative |
|
||||
| PostgreSQL | ✅ | — | — | ✅ | ❌(謹慎) | 同上 |
|
||||
| Harbor | ✅ | — | — | ✅ | — | — |
|
||||
| Sentry | ✅ | — | — | ✅ | — | — |
|
||||
| Langfuse | ✅ | — | — | ✅ | — | — |
|
||||
| **MinIO** | ❌ | — | — | ❌ | ❌ | 完全未監控 |
|
||||
| **Kali Scanner** | ❌ | — | — | ❌ | ❌ | 孤立節點 |
|
||||
|
||||
#### Layer 2: 應用感知層
|
||||
|
||||
| 數據類型 | 工具 | 現況 | 缺口 |
|
||||
|---------|------|------|------|
|
||||
| API Error Rate | Prometheus + SigNoz | ✅ | — |
|
||||
| API Latency P50/P95/P99 | SigNoz OTEL | ✅ | — |
|
||||
| Distributed Traces | SigNoz | ✅ | — |
|
||||
| Frontend Web Vitals (LCP/FID/CLS) | Sentry | ✅ | — |
|
||||
| Frontend JS Errors | Sentry | ✅ | — |
|
||||
| Frontend Session Replay | Sentry | ✅ | — |
|
||||
| **Frontend Rage Click** | Sentry | ✅ | **未整合進 AI 分析** |
|
||||
| **API Slow Query** | Sentry + structlog | ✅ | **無 AI 自動優化建議** |
|
||||
| **K8s Resource Quota** | kube-state-metrics | ✅ | — |
|
||||
| Alert Chain E2E | Prometheus Counter | ✅ ADR-037 | — |
|
||||
|
||||
#### Layer 3: AI 智慧層
|
||||
|
||||
| 數據類型 | 工具 | 現況 | 缺口 |
|
||||
|---------|------|------|------|
|
||||
| LLM 請求/回應 Traces | Langfuse | ✅ | — |
|
||||
| LLM Token 用量/成本 | Langfuse | ✅ | **無 AWOOOI Dashboard** |
|
||||
| Ollama 推理延遲 | Prometheus | ✅ | — |
|
||||
| AI Fallback 觸發次數 | Prometheus | ✅(ADR-006)| — |
|
||||
| NVIDIA Circuit Breaker | Prometheus | ✅(ADR-036)| — |
|
||||
| **AI 自治率指數** | — | ❌ 完全缺失 | 核心指標未建立 |
|
||||
| **Anomaly Counter 統計** | Redis 計數器 | ✅ ADR-037 | **無前端展示** |
|
||||
| **Approval 決策分析** | PostgreSQL | ✅ | **只有原始 CRUD,無分析** |
|
||||
|
||||
#### Layer 4: 業務感知層
|
||||
|
||||
| 數據類型 | 工具 | 現況 | 缺口 |
|
||||
|---------|------|------|------|
|
||||
| SLO 達成率 | Prometheus + rules | ✅ 定義 | **無可視化** |
|
||||
| 事件 MTTR(平均修復時間)| PostgreSQL | ✅ 原始資料 | **無計算與展示** |
|
||||
| **FinOps 成本追蹤** | cost_analyzer.py | ✅ 邏輯 | **無 UI,完全閒置** |
|
||||
| **用戶操作審計** | audit_logs.py | ✅ | — |
|
||||
| **知識庫查詢統計** | — | ❌ | 無知識庫後端 |
|
||||
|
||||
---
|
||||
|
||||
## 三、整合缺口分析
|
||||
|
||||
### 3.1 「最後一哩路」缺口(已有工具,未整合)
|
||||
|
||||
| 缺口 | 工具已準備 | 缺什麼 | 工時 |
|
||||
|------|-----------|-------|------|
|
||||
| MinIO 監控 | Prometheus | MinIO Exporter 未部署 | 1h |
|
||||
| Kali 安全掃描 | Nmap/ZAP on .112 | 無 AWOOOI Webhook 整合 | 2h |
|
||||
| FinOps 前端 | cost_analyzer.py | 無 API 端點 + 無 UI | 8h |
|
||||
| AI 自治率指數 | Prometheus Counter 可建 | 指標定義 + Dashboard | 4h |
|
||||
| Rage Click → AI 分析 | Sentry `get_ux_audit_summary()` | 無觸發器,未週期調用 | 2h |
|
||||
| Anomaly Counter 前端展示 | Redis + anomaly_counter.py | 無 GenUI 卡片 | 4h |
|
||||
| SLO 可視化 | Prometheus rules 已定義 | 無 Grafana/前端展示 | 3h |
|
||||
| MTTR 計算 | PostgreSQL 有 incidents 資料 | 無計算 API 端點 | 2h |
|
||||
| 雙 Prometheus 聯邦 | 188/110 各一個 | 無 Federation 配置 | 2h |
|
||||
|
||||
**整合缺口總工時估算:~28 小時**
|
||||
|
||||
---
|
||||
|
||||
## 四、監控 UI 呈現戰略(避免淪為監控產品的核心設計)
|
||||
|
||||
### 4.1 三種監控 UI 反模式(絕對禁止)
|
||||
|
||||
```
|
||||
❌ 反模式 A:Grafana 嵌入 iframe
|
||||
→ 整個頁面都是 Grafana,用戶感覺在用 Grafana
|
||||
|
||||
❌ 反模式 B:「監控頁面」頂級選單項目
|
||||
→ 將 AWOOOI 降格為「有 AI 輔助的 Grafana」
|
||||
|
||||
❌ 反模式 C:Prometheus 原始指標直接展示
|
||||
→ 用戶看到 rate(http_requests_total[5m]) 這種語法,違反 AI 原生體驗
|
||||
```
|
||||
|
||||
### 4.2 正確的監控 UI 架構:「三義分離原則」
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ AWOOOI 監控 UI 三義分離 │
|
||||
├────────────────────┬──────────────────┬─────────────────────────────┤
|
||||
│ 義 1: AI 主動浮現 │ 義 2: 問即答 │ 義 3: 深度調查跳脫 │
|
||||
│ (AWOOOI 前端) │ (Omni-Terminal) │ (外部工具直連) │
|
||||
├────────────────────┼──────────────────┼─────────────────────────────┤
|
||||
│ Nexus 頁面 │ /status awoooi-api│ 🔗 Grafana (新分頁) │
|
||||
│ → AI 健康脈搏 │ → GenUI 健康卡 │ 🔗 SigNoz (新分頁) │
|
||||
│ → 自治率指數 │ /cost this-month │ 🔗 Sentry (新分頁) │
|
||||
│ → 異常趨勢圖 │ → FinOps 成本卡 │ │
|
||||
│ War Room 頁面 │ /trace xxx │ 不直接嵌入,保持 AWOOOI │
|
||||
│ → 待決策 Approval │ → Trace 彙整卡 │ 界面純淨性 │
|
||||
└────────────────────┴──────────────────┴─────────────────────────────┘
|
||||
```
|
||||
|
||||
### 4.3 Nexus 頁面的監控呈現規格
|
||||
|
||||
這是首頁(The Nexus / 全局心智樞紐),呈現監控摘要的唯一入口:
|
||||
|
||||
```tsx
|
||||
// Nexus 頁面結構(Nothing.tech 純白工業風)
|
||||
|
||||
// 區塊 A:AI 自治率指數(最大最重要)
|
||||
<AutonomyIndexPanel>
|
||||
今日 AI 成功攔截並自動修復:7/10 事件(70% 自治率)
|
||||
↗ 比昨日提升 12%
|
||||
</AutonomyIndexPanel>
|
||||
|
||||
// 區塊 B:系統脈搏(3 個數字,非圖表)
|
||||
<SystemPulseRow>
|
||||
<PulseMetric label="正常服務" value="24/25" status="healthy" />
|
||||
<PulseMetric label="活躍告警" value="0" status="healthy" />
|
||||
<PulseMetric label="待決策" value="2" status="warning" />
|
||||
</SystemPulseRow>
|
||||
|
||||
// 區塊 C:AI 思考流(背景動態,非重點)
|
||||
<ThinkingStream>
|
||||
[Investigator] Redis latency: 2ms ... OK
|
||||
[Investigator] API error rate: 0.1% ... OK
|
||||
[Investigator] cert://*.wooo.work: 42 days ... OK
|
||||
</ThinkingStream>
|
||||
|
||||
// 區塊 D:需要統帥決策的卡片(只有這個需要互動)
|
||||
// → 有待決策才出現,平時此區域「靜默」
|
||||
<DecisionZone>
|
||||
<ApprovalCard urgency="CRITICAL" ... />
|
||||
</DecisionZone>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 4.4 監控數據在 Omni-Terminal 的呈現(問即答模式)
|
||||
|
||||
Terminal 輸入 → AI 消化原始指標 → 回傳 GenUI 卡片(非原始數字)
|
||||
|
||||
| 使用者輸入 | AI 行為 | GenUI 卡片類型 |
|
||||
|---------|---------|--------------|
|
||||
| `/status all` | 查詢所有服務健康 | `SystemHealthCard` |
|
||||
| `/status awoooi-api` | 查 API P95 延遲 + 錯誤率 | `ServiceDetailCard` |
|
||||
| `/cost` | 呼叫 cost_analyzer.py | `FinOpsCard` |
|
||||
| `/trace 最近5分鐘` | 查詢 SigNoz slow traces | `TraceListCard` |
|
||||
| `/incident 今天` | 查詢今日事件 + AI 摘要 | `IncidentSummaryCard` |
|
||||
| `/alert 狀態` | 檢查告警鏈路 E2E | `AlertChainStatusCard` |
|
||||
| `/slo` | 計算 API/Web SLO 達成率 | `SLODashboardCard` |
|
||||
|
||||
**🎯 這才是 AI 原生體驗**:使用者永遠都在跟 AI 對話,而非直接操作圖表。
|
||||
|
||||
---
|
||||
|
||||
### 4.5 「深度調查」模式:智能跳脫
|
||||
|
||||
當用戶需要原始 Grafana / SigNoz 數據時,AWOOOI 提供**智能跳脫**,而非嵌入:
|
||||
|
||||
```tsx
|
||||
// 在 GenUI 卡片中,提供「深度調查」按鈕
|
||||
<ServiceDetailCard service="awoooi-api">
|
||||
<MetricRow label="P95 延遲" value="124ms" status="healthy" />
|
||||
<MetricRow label="錯誤率" value="0.1%" status="healthy" />
|
||||
|
||||
{/* 智能跳脫按鈕 */}
|
||||
<ExternalLinks>
|
||||
<SmartLink
|
||||
icon="📊"
|
||||
label="SigNoz 詳細追蹤"
|
||||
href="http://192.168.0.188:3301/traces?service=awoooi-api&from=now-1h"
|
||||
target="_blank" // ← 新分頁開啟,不汙染 AWOOOI 界面
|
||||
/>
|
||||
<SmartLink
|
||||
icon="📈"
|
||||
label="Grafana 即時圖表"
|
||||
href="http://192.168.0.188:3000/d/awoooi-api"
|
||||
target="_blank"
|
||||
/>
|
||||
<SmartLink
|
||||
icon="🐛"
|
||||
label="Sentry Issues"
|
||||
href="http://192.168.0.110:9000/organizations/sentry/issues/?project=awoooi-api"
|
||||
target="_blank"
|
||||
/>
|
||||
</ExternalLinks>
|
||||
</ServiceDetailCard>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 五、監控機制整合實施步驟
|
||||
|
||||
### Wave M-1: 立即啟動(1 週)
|
||||
|
||||
#### M-1.1 MinIO 監控整合(1h)
|
||||
|
||||
```bash
|
||||
# 在 192.168.0.188 部署 MinIO Exporter
|
||||
docker run -d \
|
||||
--name minio-exporter \
|
||||
--network momo-pro-network \
|
||||
-e MINIO_URL=http://minio:9000 \
|
||||
-e MINIO_ACCESS_KEY=minio_admin \
|
||||
-e MINIO_SECRET_KEY=Minio_Velero_2026! \
|
||||
-p 9290:9290 \
|
||||
bitnami/minio-exporter:latest
|
||||
|
||||
# 在 188:/momo-pro/monitoring/prometheus.yml 加入 scrape target:
|
||||
# - job_name: 'minio'
|
||||
# static_configs:
|
||||
# - targets: ['localhost:9290']
|
||||
```
|
||||
|
||||
#### M-1.2 Prometheus Federation 統一(2h)
|
||||
|
||||
```yaml
|
||||
# 在 188 的 Prometheus 加入聯邦查詢(抓取 .110 的 Prometheus 數據)
|
||||
# 188:/momo-pro/monitoring/prometheus.yml 追加:
|
||||
|
||||
- job_name: 'federate-110'
|
||||
scrape_interval: 30s
|
||||
honor_labels: true
|
||||
metrics_path: '/federate'
|
||||
params:
|
||||
'match[]':
|
||||
- '{job=~".+"}' # 抓取所有 .110 的 job
|
||||
static_configs:
|
||||
- targets: ['192.168.0.110:9090']
|
||||
```
|
||||
|
||||
#### M-1.3 建立 AI 自治率指數 Prometheus 指標(2h)
|
||||
|
||||
```python
|
||||
# apps/api/src/core/metrics.py 新增:
|
||||
|
||||
# === AI 自治率追蹤 (The Autonomy Index) ===
|
||||
AUTONOMY_INCIDENTS_TOTAL = Counter(
|
||||
'awoooi_incidents_total',
|
||||
'Total number of incidents received',
|
||||
['source', 'severity']
|
||||
)
|
||||
|
||||
AUTONOMY_AUTO_RESOLVED = Counter(
|
||||
'awoooi_incidents_auto_resolved_total',
|
||||
'Incidents resolved automatically by AI without human intervention',
|
||||
['source', 'action_type']
|
||||
)
|
||||
|
||||
AUTONOMY_HUMAN_RESOLVED = Counter(
|
||||
'awoooi_incidents_human_resolved_total',
|
||||
'Incidents requiring human approval',
|
||||
['source', 'risk_level']
|
||||
)
|
||||
|
||||
def record_incident_created(source: str, severity: str):
|
||||
AUTONOMY_INCIDENTS_TOTAL.labels(source=source, severity=severity).inc()
|
||||
|
||||
def record_auto_resolution(source: str, action_type: str):
|
||||
AUTONOMY_AUTO_RESOLVED.labels(source=source, action_type=action_type).inc()
|
||||
|
||||
def record_human_decision(source: str, risk_level: str):
|
||||
AUTONOMY_HUMAN_RESOLVED.labels(source=source, risk_level=risk_level).inc()
|
||||
|
||||
# 自治率計算公式:
|
||||
# autonomy_rate = auto_resolved / (auto_resolved + human_decisions) * 100
|
||||
```
|
||||
|
||||
Grafana Dashboard 公式:
|
||||
```
|
||||
# AI 自治率(24h)
|
||||
sum(increase(awoooi_incidents_auto_resolved_total[24h]))
|
||||
/
|
||||
(
|
||||
sum(increase(awoooi_incidents_auto_resolved_total[24h])) +
|
||||
sum(increase(awoooi_incidents_human_resolved_total[24h]))
|
||||
) * 100
|
||||
```
|
||||
|
||||
### Wave M-2: 短期啟動(2 週)
|
||||
|
||||
#### M-2.1 FinOps API 端點建立(4h)
|
||||
|
||||
```python
|
||||
# apps/api/src/api/v1/finops.py(新建)
|
||||
# 暴露 cost_analyzer.py 的計算結果
|
||||
|
||||
@router.get("/finops/summary")
|
||||
async def get_finops_summary():
|
||||
"""
|
||||
FinOps 成本摘要
|
||||
|
||||
Returns:
|
||||
{
|
||||
"period": "2026-03",
|
||||
"total_cost_usd": 12.50,
|
||||
"ollama_cost": 0.0, # 本地,零成本
|
||||
"gemini_cost": 1.20,
|
||||
"claude_cost": 11.30,
|
||||
"realizable_savings": 3.50, # 真實可省
|
||||
"freed_capacity": 8.00, # 釋放容量(非真實省錢)
|
||||
"top_cost_drivers": [...],
|
||||
"recommendations": [...]
|
||||
}
|
||||
"""
|
||||
cost_analyzer = get_cost_analyzer()
|
||||
return await cost_analyzer.monthly_summary()
|
||||
```
|
||||
|
||||
#### M-2.2 SLO API 端點建立(2h)
|
||||
|
||||
```python
|
||||
# apps/api/src/api/v1/slo.py(新建)
|
||||
|
||||
@router.get("/slo/status")
|
||||
async def get_slo_status():
|
||||
"""
|
||||
SLO 達成狀況
|
||||
|
||||
Returns:
|
||||
{
|
||||
"api": {
|
||||
"availability_7d": 99.97, # %
|
||||
"latency_p95_7d": 124, # ms
|
||||
"target_availability": 99.9,
|
||||
"target_latency_p95": 500,
|
||||
"status": "healthy" # healthy/at_risk/breached
|
||||
},
|
||||
"web": {...},
|
||||
"overall": "healthy"
|
||||
}
|
||||
"""
|
||||
```
|
||||
|
||||
#### M-2.3 MTTR API 端點建立(2h)
|
||||
|
||||
```python
|
||||
# apps/api/src/api/v1/stats.py 新增端點
|
||||
|
||||
@router.get("/stats/mttr")
|
||||
async def get_mttr_stats():
|
||||
"""
|
||||
平均修復時間 (Mean Time To Resolution)
|
||||
|
||||
計算邏輯:
|
||||
- MTTR = avg(resolved_at - created_at) for resolved incidents
|
||||
- 分 AI 自動修復 vs 人工審核分別計算
|
||||
"""
|
||||
```
|
||||
|
||||
#### M-2.4 Kali Scanner 整合(2h)
|
||||
|
||||
```python
|
||||
# apps/api/src/api/v1/webhooks.py 新增 Kali Scanner Webhook
|
||||
|
||||
@router.post("/webhooks/kali/scan-result")
|
||||
async def handle_kali_scan_result(request: Request):
|
||||
"""
|
||||
接收 .112 Kali 安全掃描結果
|
||||
|
||||
Kali 掃描腳本每週執行一次,結果發送至此 Webhook
|
||||
高危漏洞 → 自動建立 CRITICAL Approval
|
||||
"""
|
||||
```
|
||||
|
||||
Kali 端配置 (`192.168.0.112`):
|
||||
```bash
|
||||
# 在 112 建立每週掃描腳本
|
||||
cat > /opt/awoooi-scanner/weekly-scan.sh << 'EOF'
|
||||
#!/bin/bash
|
||||
TARGET="192.168.0.120:32334" # AWOOOI API
|
||||
RESULT=$(nmap -sV --script vuln $TARGET -oJ -)
|
||||
|
||||
curl -X POST http://192.168.0.120:32334/api/v1/webhooks/kali/scan-result \
|
||||
-H "Content-Type: application/json" \
|
||||
-d "{\"scan_result\": $RESULT, \"target\": \"$TARGET\"}"
|
||||
EOF
|
||||
|
||||
# 加入 crontab
|
||||
echo "0 2 * * 1 /opt/awoooi-scanner/weekly-scan.sh" | crontab -
|
||||
```
|
||||
|
||||
### Wave M-3: 中期啟動(3–4 週)
|
||||
|
||||
#### M-3.1 Nexus 頁面 AI 自治率指數 UI(8h)
|
||||
|
||||
```tsx
|
||||
// apps/web/src/app/[locale]/(dashboard)/page.tsx
|
||||
// 新增 AutonomyIndex 組件
|
||||
|
||||
interface AutonomyData {
|
||||
rate: number; // 70.0
|
||||
daily_trend: number; // +12.0 vs yesterday
|
||||
auto_resolved_24h: number;
|
||||
human_resolved_24h: number;
|
||||
}
|
||||
|
||||
const AutonomyIndexPanel = ({ data }: { data: AutonomyData }) => (
|
||||
<div className="bg-white/70 backdrop-blur-[20px] border border-black/[0.06] rounded-xl p-6">
|
||||
{/* 大數字:自治率 */}
|
||||
<div className="flex items-end gap-3">
|
||||
<span className="font-mono text-6xl font-bold text-nothing-ink">
|
||||
{data.rate.toFixed(0)}
|
||||
<span className="text-2xl text-nothing-gray">%</span>
|
||||
</span>
|
||||
<div className="mb-2">
|
||||
<span className="text-sm text-status-success">
|
||||
↗ +{data.daily_trend.toFixed(0)}% {t('nexus.vs_yesterday')}
|
||||
</span>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
{/* AI 自治率說明 */}
|
||||
<p className="font-mono text-xs tracking-widest text-nothing-gray-600 mt-2">
|
||||
[AI_AUTONOMY_INDEX] {t('nexus.autonomy_description')}
|
||||
</p>
|
||||
|
||||
{/* 細分:今日自動 vs 需要人工 */}
|
||||
<div className="flex gap-6 mt-4 border-t border-black/[0.04] pt-4">
|
||||
<div>
|
||||
<p className="text-2xl font-bold text-status-success">{data.auto_resolved_24h}</p>
|
||||
<p className="text-xs text-nothing-gray">{t('nexus.ai_auto_resolved')}</p>
|
||||
</div>
|
||||
<div>
|
||||
<p className="text-2xl font-bold text-status-warning">{data.human_resolved_24h}</p>
|
||||
<p className="text-xs text-nothing-gray">{t('nexus.required_approval')}</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
);
|
||||
```
|
||||
|
||||
#### M-3.2 Omni-Terminal 監控指令整合(8h)
|
||||
|
||||
後端需要新增以下 Terminal 指令的處理器:
|
||||
|
||||
```python
|
||||
# apps/api/src/services/terminal_service.py 擴充
|
||||
|
||||
class TerminalCommandRouter:
|
||||
|
||||
async def route(self, intent: str, context: dict) -> TerminalResponse:
|
||||
"""
|
||||
監控相關指令路由
|
||||
"""
|
||||
if intent == "/status":
|
||||
return await self._handle_status(context) # 服務健康狀態
|
||||
elif intent == "/cost":
|
||||
return await self._handle_cost(context) # FinOps 成本
|
||||
elif intent == "/slo":
|
||||
return await self._handle_slo(context) # SLO 達成率
|
||||
elif intent == "/trace":
|
||||
return await self._handle_trace(context) # SigNoz Traces
|
||||
elif intent == "/alert":
|
||||
return await self._handle_alert_chain(context) # 告警鏈路狀態
|
||||
elif intent == "/incident":
|
||||
return await self._handle_incident(context) # 事件查詢
|
||||
elif intent == "/mttr":
|
||||
return await self._handle_mttr(context) # 平均修復時間
|
||||
```
|
||||
|
||||
#### M-3.3 監控相關 GenUI 卡片擴充(8h)
|
||||
|
||||
```typescript
|
||||
// apps/web/src/components/genui/registry.ts 新增:
|
||||
|
||||
export const GENUI_COMPONENTS = {
|
||||
// 現有組件...
|
||||
|
||||
// 新增監控類組件:
|
||||
'SystemHealthCard': () => import('./monitoring/SystemHealthCard'),
|
||||
'ServiceDetailCard': () => import('./monitoring/ServiceDetailCard'),
|
||||
'FinOpsCard': () => import('./monitoring/FinOpsCard'),
|
||||
'SLODashboardCard': () => import('./monitoring/SLODashboardCard'),
|
||||
'AlertChainStatusCard': () => import('./monitoring/AlertChainStatusCard'),
|
||||
'AnomalyFrequencyCard': () => import('./monitoring/AnomalyFrequencyCard'),
|
||||
'MTTRCard': () => import('./monitoring/MTTRCard'),
|
||||
'KaliScanResultCard': () => import('./monitoring/KaliScanResultCard'),
|
||||
}
|
||||
```
|
||||
|
||||
**`SystemHealthCard` 規格**(最核心的監控 GenUI 卡片):
|
||||
|
||||
```tsx
|
||||
// SystemHealthCard 呈現邏輯:
|
||||
// - 25 個服務用「燈號矩陣」呈現,非圖表
|
||||
// - 每個燈號 hover 顯示服務名稱
|
||||
// - 有異常的燈號閃爍(animate-ping)
|
||||
// - 右下角「深度調查」按鈕連至 Grafana/SigNoz 新分頁
|
||||
|
||||
const SystemHealthCard = () => (
|
||||
<GenUICard title="系統健康矩陣">
|
||||
<div className="grid grid-cols-5 gap-2">
|
||||
{services.map(svc => (
|
||||
<ServiceOrb
|
||||
key={svc.name}
|
||||
name={svc.name}
|
||||
status={svc.status}
|
||||
// healthy: 靜態綠燈
|
||||
// warning: 黃燈慢速閃爍
|
||||
// critical: 紅燈 animate-ping
|
||||
externalLink={svc.grafana_url}
|
||||
/>
|
||||
))}
|
||||
</div>
|
||||
|
||||
{/* 摘要行 */}
|
||||
<p className="font-mono text-xs mt-3">
|
||||
25 SERVICES | {healthy_count} HEALTHY | {warning_count} WARNING | {critical_count} CRITICAL
|
||||
</p>
|
||||
|
||||
{/* 智能跳脫 */}
|
||||
<ExternalLinks grafana sentry signoz />
|
||||
</GenUICard>
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 六、監控整合路線圖與優先級
|
||||
|
||||
```
|
||||
📅 Week 1 (立即,~7h):
|
||||
├── M-1.1 MinIO Exporter 部署 (1h)
|
||||
├── M-1.2 Prometheus Federation (2h)
|
||||
└── M-1.3 AI 自治率指數 Metrics 建立 (2h + 2h Config)
|
||||
|
||||
📅 Week 2-3 (短期,~10h):
|
||||
├── M-2.1 FinOps API 端點 (4h)
|
||||
├── M-2.2 SLO API 端點 (2h)
|
||||
├── M-2.3 MTTR API 端點 (2h)
|
||||
└── M-2.4 Kali Scanner Webhook 整合 (2h)
|
||||
|
||||
📅 Month 2 (中期,~24h):
|
||||
├── M-3.1 Nexus 頁面 AI 自治率 UI (8h)
|
||||
├── M-3.2 Omni-Terminal 監控指令 (8h)
|
||||
└── M-3.3 監控 GenUI 卡片擴充 (8h)
|
||||
```
|
||||
|
||||
**監控整合完成後的最终效果**:
|
||||
|
||||
```
|
||||
統帥打開 AWOOOI,看到:
|
||||
✦ AI 自治率:今日 72%(↗ 比昨日高 8%)
|
||||
✦ 系統健康:25/25 服務正常
|
||||
✦ 待決策:0(系統無需要人工干預的事件)
|
||||
✦ AI 思考流在背後靜默巡邏...
|
||||
|
||||
這才是 AI 原生平台,不是監控工具。
|
||||
SRE 只在「AI 搞不定的時候」被喚醒,其餘時間人類可以去做更有價值的事。
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 七、ADR 建議
|
||||
|
||||
本規劃建議新增以下 ADR:
|
||||
|
||||
| ADR | 主題 | 核心決策 |
|
||||
|-----|------|---------|
|
||||
| ADR-038 | 監控 UI 三義分離原則 | 靜默消化 vs 主動浮現 vs 外部跳脫 |
|
||||
| ADR-039 | AI 自治率指數 (Autonomy Index) | 指標定義與計算公式 |
|
||||
| ADR-040 | Kali 安全掃描整合架構 | .112 → Webhook → AI 分析 |
|
||||
| ADR-041 | SLO 與 MTTR 業務指標架構 | 計算方法與展示標準 |
|
||||
|
||||
---
|
||||
|
||||
*「監控是神經末梢,AI 是大腦。神經不思考,大腦不直接感知。這就是 AWOOOI 的監控哲學。」* 🦞
|
||||
219
docs/proposals/MONITORING_MASTER_PLAN.md
Normal file
219
docs/proposals/MONITORING_MASTER_PLAN.md
Normal file
@@ -0,0 +1,219 @@
|
||||
# AWOOOI 監控整合主計畫
|
||||
|
||||
> **版本**: v1.0
|
||||
> **建立日期**: 2026-03-29
|
||||
> **狀態**: ✅ 統帥批准
|
||||
> **總工時**: 10.75h
|
||||
> **整合自**: `MONITORING_INTEGRATION_ARCHITECTURE.md` + `IMPLEMENTATION_STEPS_REMAINING_PHASES.md`
|
||||
|
||||
---
|
||||
|
||||
## 一、執行摘要
|
||||
|
||||
本計畫整合以下兩份文件:
|
||||
1. **MONITORING_INTEGRATION_ARCHITECTURE.md** - 監控即代碼架構
|
||||
2. **IMPLEMENTATION_STEPS_REMAINING_PHASES.md** - Phase D-G 實施步驟
|
||||
|
||||
### 核心發現
|
||||
|
||||
| 類別 | 已完成 | 待完成 |
|
||||
|------|--------|--------|
|
||||
| **Service Registry** | ✅ 含 NVIDIA | - |
|
||||
| **覆蓋率驗證** | ✅ validate_coverage.py | generate_monitoring.py |
|
||||
| **NVIDIA 告警** | ✅ 5 條規則 | Grafana Dashboard |
|
||||
| **Sentry 整合** | ✅ Webhook Handler | Comment 回寫 (TODO) |
|
||||
| **SignOz 整合** | ❌ 無 Webhook | Handler + Rules |
|
||||
| **告警鏈路驗證** | ❌ 無 | Smoke Test + CD 整合 |
|
||||
|
||||
---
|
||||
|
||||
## 二、工作依賴關係
|
||||
|
||||
```
|
||||
Layer 0: 基礎設施 (無依賴)
|
||||
├── L0.1 Sentry API Token ─────────┐
|
||||
└── L0.2 SignOz 告警規則 ──────────┼──▶ Layer 1
|
||||
│
|
||||
Layer 1: Webhook 鏈路 │
|
||||
├── L1.1 SignOz Webhook (←L0.2) ───┤
|
||||
├── L1.2 Sentry Comment (←L0.1) ───┼──▶ Layer 2
|
||||
└── L1.3 Alert Chain Metrics ──────┘
|
||||
│
|
||||
Layer 2: 告警鏈路驗證 │
|
||||
├── L2.1 Smoke Test (←L1.1,L1.2) ──┤
|
||||
├── L2.2 Alert Chain Rules (←L1.3) ┼──▶ Layer 3
|
||||
└── L2.3 CD Pipeline (←L2.1) ──────┘
|
||||
│
|
||||
Layer 3: 監控自動化 (獨立) │
|
||||
├── L3.1 generate_monitoring.py ───┤
|
||||
├── L3.2 CI 覆蓋率檢查 (←L3.1) ────┼──▶ Layer 4
|
||||
└── L3.3 Docker 自動發現 ──────────┘
|
||||
│
|
||||
Layer 4: 可視化 │
|
||||
├── L4.1 NVIDIA Grafana Dashboard ─┤
|
||||
└── L4.2 監控覆蓋率報告 (←L3.1) ───┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 三、Wave 執行計畫
|
||||
|
||||
### Wave A: 告警鏈路完善 (P0 - 3.5h)
|
||||
|
||||
| # | 任務 | 工時 | 依賴 | 可並行 | 檔案 |
|
||||
|---|------|------|------|--------|------|
|
||||
| A.1 | Sentry API Token 設定 | 15min | - | ✅ | GitHub Secret + K8s |
|
||||
| A.2 | SignOz 告警規則部署 | 30min | - | ✅ | `signoz/alerting/rules.yaml` |
|
||||
| A.3 | SignOz Webhook Handler | 45min | A.2 | - | `signoz_webhook.py` |
|
||||
| A.4 | Sentry Comment 回寫 | 30min | A.1 | - | `sentry_webhook.py` |
|
||||
| A.5 | Alert Chain Metrics | 30min | - | ✅ | `core/metrics.py` |
|
||||
| A.6 | Smoke Test 腳本 | 45min | A.3,A.4 | - | `alert_chain_smoke_test.py` |
|
||||
|
||||
### Wave B: 鏈路防護 (P1 - 1.5h)
|
||||
|
||||
| # | 任務 | 工時 | 依賴 | 檔案 |
|
||||
|---|------|------|------|------|
|
||||
| B.1 | Alert Chain PrometheusRule | 30min | A.5 | `k8s/monitoring/alert-chain-monitor.yaml` |
|
||||
| B.2 | CD Pipeline 整合 | 30min | A.6 | `.github/workflows/cd.yaml` |
|
||||
| B.3 | 部署驗證 + 文檔更新 | 30min | B.1,B.2 | ADR + Memory |
|
||||
|
||||
### Wave C: 監控自動化 (P2 - 2.75h)
|
||||
|
||||
| # | 任務 | 工時 | 依賴 | 檔案 |
|
||||
|---|------|------|------|------|
|
||||
| C.1 | generate_monitoring.py | 1.5h | - | `ops/monitoring/generate_monitoring.py` |
|
||||
| C.2 | CI 監控覆蓋率檢查 | 30min | C.1 | `.github/workflows/cd.yaml` |
|
||||
| C.3 | Docker 容器自動發現 | 45min | - | `ops/monitoring/discover_docker.py` |
|
||||
|
||||
### Wave D: 可視化 (P3 - 3h)
|
||||
|
||||
| # | 任務 | 工時 | 依賴 | 檔案 |
|
||||
|---|------|------|------|------|
|
||||
| D.1 | NVIDIA Grafana Dashboard | 2h | - | `ops/grafana/nvidia-dashboard.json` |
|
||||
| D.2 | 監控覆蓋率報告 | 1h | C.1 | `ops/monitoring/coverage_report.py` |
|
||||
|
||||
---
|
||||
|
||||
## 四、現有整合點
|
||||
|
||||
### Phase 20 (NVIDIA Nemotron) ✅
|
||||
|
||||
```yaml
|
||||
已完成:
|
||||
- nvidia_provider.py P3 Prometheus Metrics
|
||||
- k8s/monitoring/nvidia-alerts.yaml (5 規則)
|
||||
- ops/monitoring/service-registry.yaml (NVIDIA 條目)
|
||||
|
||||
整合:
|
||||
- Wave A.5 Alert Chain Metrics 納入 NVIDIA 告警監控
|
||||
```
|
||||
|
||||
### K-MON (K3s 監控) ✅
|
||||
|
||||
```yaml
|
||||
已完成:
|
||||
- k8s/monitoring/k3s-alerts.yaml (20+ 規則)
|
||||
- Blackbox Exporter
|
||||
- kube-state-metrics
|
||||
|
||||
整合:
|
||||
- Wave B.1 擴充現有 PrometheusRule
|
||||
```
|
||||
|
||||
### ADR-034 (Telegram Secrets 注入) ✅
|
||||
|
||||
```yaml
|
||||
已完成:
|
||||
- Pre-flight 檢查
|
||||
- CD 自動注入
|
||||
|
||||
整合:
|
||||
- Wave A.1 使用同樣模式注入 SENTRY_API_TOKEN
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 五、Phase D-G 任務對應
|
||||
|
||||
| 原 Phase | 任務 | 對應 Wave | 狀態 |
|
||||
|----------|------|-----------|------|
|
||||
| **Phase D** | Sentry Comment 回寫 | Wave A.4 | 待執行 |
|
||||
| **Phase E** | SignOz 告警規則 | Wave A.2 + A.3 | 待執行 |
|
||||
| **Phase F** | 告警鏈路 E2E 驗證 | Wave A.6 + B.1 + B.2 | 待執行 |
|
||||
| **Phase G** | Learning Service | ✅ 已存在 | 僅需整合 |
|
||||
|
||||
---
|
||||
|
||||
## 六、執行時程
|
||||
|
||||
```
|
||||
Day 1 (4h):
|
||||
├── [並行] A.1 Sentry Token (15min)
|
||||
├── [並行] A.2 SignOz Rules (30min)
|
||||
├── [並行] A.5 Alert Chain Metrics (30min)
|
||||
├── A.3 SignOz Webhook (45min)
|
||||
├── A.4 Sentry Comment (30min)
|
||||
└── A.6 Smoke Test (45min)
|
||||
|
||||
Day 2 (2h):
|
||||
├── B.1 Alert Chain Rules (30min)
|
||||
├── B.2 CD Pipeline (30min)
|
||||
└── B.3 驗證 + 文檔 (30min)
|
||||
|
||||
Day 3+ (可延後):
|
||||
├── Wave C: 監控自動化 (2.75h)
|
||||
└── Wave D: 可視化 (3h)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 七、驗收標準
|
||||
|
||||
### Wave A 完成條件
|
||||
|
||||
- [ ] Sentry Issue 自動收到 AI 分析 Comment
|
||||
- [ ] SignOz 告警可觸發 Telegram 通知
|
||||
- [ ] `alert_chain_smoke_test.py` 全部通過
|
||||
|
||||
### Wave B 完成條件
|
||||
|
||||
- [ ] CD 部署後自動執行 Smoke Test
|
||||
- [ ] 告警鏈路斷裂 2 小時內觸發告警
|
||||
- [ ] ADR-037 建立並通過審查
|
||||
|
||||
### Wave C 完成條件
|
||||
|
||||
- [ ] 新服務未註冊時 CI 失敗
|
||||
- [ ] 每小時自動掃描新 Docker 容器
|
||||
- [ ] 生成的配置與現有配置一致
|
||||
|
||||
### Wave D 完成條件
|
||||
|
||||
- [ ] NVIDIA Grafana Dashboard 可訪問
|
||||
- [ ] 每日覆蓋率報告自動發送
|
||||
|
||||
---
|
||||
|
||||
## 八、風險評估
|
||||
|
||||
| 風險 | 機率 | 影響 | 緩解措施 |
|
||||
|------|------|------|----------|
|
||||
| Sentry API Token 權限不足 | 低 | 中 | 測試 API 呼叫後再部署 |
|
||||
| SignOz 版本不支援告警 | 低 | 高 | 確認 SignOz 版本 |
|
||||
| Smoke Test 誤報 | 中 | 低 | 設定合理超時 + 重試 |
|
||||
| CD Pipeline 變慢 | 中 | 低 | Smoke Test 並行執行 |
|
||||
|
||||
---
|
||||
|
||||
## 九、相關文件
|
||||
|
||||
- [MONITORING_INTEGRATION_ARCHITECTURE.md](MONITORING_INTEGRATION_ARCHITECTURE.md)
|
||||
- [IMPLEMENTATION_STEPS_REMAINING_PHASES.md](IMPLEMENTATION_STEPS_REMAINING_PHASES.md)
|
||||
- [ADR-034 Telegram Secrets 注入](../adr/ADR-034-telegram-secrets-injection.md)
|
||||
- [ADR-036 Nemotron Tool Calling](../adr/ADR-036-nemotron-tool-calling-integration.md)
|
||||
|
||||
---
|
||||
|
||||
**文件結束**
|
||||
|
||||
*2026-03-29 統帥批准*
|
||||
873
docs/proposals/NEMOTRON-INTEGRATION-PROPOSAL.md
Normal file
873
docs/proposals/NEMOTRON-INTEGRATION-PROPOSAL.md
Normal file
@@ -0,0 +1,873 @@
|
||||
# Nemotron 整合提案
|
||||
|
||||
> **版本**: 1.1
|
||||
> **建立日期**: 2026-03-28 (台北時間)
|
||||
> **建立者**: Claude Code
|
||||
> **狀態**: ✅ **實測完成,待統帥批准**
|
||||
|
||||
---
|
||||
|
||||
## 🔥 實測結果摘要 (2026-03-28)
|
||||
|
||||
| 指標 | Nemotron (NIM) | Ollama (CPU) | 結論 |
|
||||
|------|----------------|--------------|------|
|
||||
| **Tool Calling 精準度** | 83.3% (5/6) | ~50% | **Nemotron 勝** |
|
||||
| **平均延遲** | 11-23 秒 | 100+ 秒 | **Nemotron 快 5-10x** |
|
||||
| **繁中支援** | ✅ 良好 | ✅ 良好 | 平手 |
|
||||
| **成本** | 免費 tier | 免費 | 平手 |
|
||||
|
||||
**建議**: 將 Nemotron 加入 Tool Calling 任務的首選路由
|
||||
|
||||
---
|
||||
|
||||
## 目錄
|
||||
|
||||
1. [NIM API 整合規格](#1-nim-api-整合規格)
|
||||
2. [架構設計](#2-架構設計)
|
||||
3. [測試腳本](#3-測試腳本)
|
||||
4. [實作計畫](#4-實作計畫)
|
||||
|
||||
---
|
||||
|
||||
## 1. NIM API 整合規格
|
||||
|
||||
### 1.1 Endpoint 資訊
|
||||
|
||||
| 項目 | 值 |
|
||||
|------|-----|
|
||||
| **Base URL** | `https://integrate.api.nvidia.com/v1` |
|
||||
| **Chat Completions** | `/chat/completions` |
|
||||
| **相容性** | ✅ OpenAI API 格式完全相容 |
|
||||
|
||||
### 1.2 認證方式
|
||||
|
||||
```bash
|
||||
# 環境變數
|
||||
export NVIDIA_API_KEY="nvapi-xxxx"
|
||||
|
||||
# HTTP Header
|
||||
Authorization: Bearer $NVIDIA_API_KEY
|
||||
```
|
||||
|
||||
### 1.3 可用模型
|
||||
|
||||
| 模型 ID | 大小 | 特色 | 建議用途 |
|
||||
|---------|------|------|----------|
|
||||
| `nvidia/nemotron-mini-4b-instruct` | 4B | 輕量、Tool Calling | 快速分類、簡單決策 |
|
||||
| `nvidia/llama-3.1-nemotron-70b-instruct` | 70B | 強推理 | 複雜 Incident 分析 |
|
||||
| `nvidia/nemotron-3-super` | 120B (MoE) | 最強、100萬 Token | 多代理協作 |
|
||||
|
||||
### 1.4 請求格式 (OpenAI 相容)
|
||||
|
||||
```python
|
||||
import httpx
|
||||
|
||||
response = httpx.post(
|
||||
"https://integrate.api.nvidia.com/v1/chat/completions",
|
||||
headers={
|
||||
"Content-Type": "application/json",
|
||||
"Authorization": f"Bearer {NVIDIA_API_KEY}"
|
||||
},
|
||||
json={
|
||||
"model": "nvidia/nemotron-mini-4b-instruct",
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are an SRE assistant."},
|
||||
{"role": "user", "content": "Analyze this K8s error..."}
|
||||
],
|
||||
"temperature": 0.2,
|
||||
"max_tokens": 1024,
|
||||
"tools": [...] # Tool Calling 定義
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
### 1.5 Tool Calling 格式
|
||||
|
||||
```python
|
||||
tools = [
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "kubectl_execute",
|
||||
"description": "Execute kubectl command on K8s cluster",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"command": {
|
||||
"type": "string",
|
||||
"description": "kubectl command (e.g., 'get pods -n awoooi-prod')"
|
||||
},
|
||||
"namespace": {
|
||||
"type": "string",
|
||||
"description": "Target namespace"
|
||||
}
|
||||
},
|
||||
"required": ["command"]
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "restart_deployment",
|
||||
"description": "Restart a Kubernetes deployment",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"deployment": {"type": "string"},
|
||||
"namespace": {"type": "string"}
|
||||
},
|
||||
"required": ["deployment", "namespace"]
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### 1.6 回應格式 (Tool Call)
|
||||
|
||||
```json
|
||||
{
|
||||
"choices": [{
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": null,
|
||||
"tool_calls": [{
|
||||
"id": "call_abc123",
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "restart_deployment",
|
||||
"arguments": "{\"deployment\": \"awoooi-api\", \"namespace\": \"awoooi-prod\"}"
|
||||
}
|
||||
}]
|
||||
},
|
||||
"finish_reason": "tool_calls"
|
||||
}]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. 架構設計
|
||||
|
||||
### 2.1 Fallback 層級調整
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ 現有架構 │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ Tier 1 Tier 2 Tier 3 │
|
||||
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
|
||||
│ │ Ollama │ ──▶ │ Gemini │ ──▶ │ Claude │ │
|
||||
│ │ (188) │ │ (API) │ │ (API) │ │
|
||||
│ │ 本地 │ │ 免費額度 │ │ 付費 │ │
|
||||
│ └─────────┘ └─────────┘ └─────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ 新架構 (加入 Nemotron) │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌──────────────────────────────────┐ │
|
||||
│ │ Smart Model Router │ │
|
||||
│ │ (任務類型路由) │ │
|
||||
│ └──────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌─────────────────┼─────────────────┐ │
|
||||
│ │ │ │ │
|
||||
│ ▼ ▼ ▼ │
|
||||
│ ┌─────────────────┐ ┌───────────┐ ┌─────────────────┐ │
|
||||
│ │ Tool Calling │ │ 一般對話 │ │ 複雜推理 │ │
|
||||
│ │ 路徑 │ │ 路徑 │ │ 路徑 │ │
|
||||
│ └────────┬────────┘ └─────┬─────┘ └────────┬────────┘ │
|
||||
│ │ │ │ │
|
||||
│ ▼ ▼ ▼ │
|
||||
│ ┌─────────────────┐ ┌───────────┐ ┌─────────────────┐ │
|
||||
│ │ Nemotron (NIM) │ │ Ollama │ │ Nemotron-70B │ │
|
||||
│ │ nemotron-mini │ │ qwen2.5 │ │ 或 Claude │ │
|
||||
│ │ 4B, Tool專用 │ │ 本地 │ │ 高品質 │ │
|
||||
│ └────────┬────────┘ └─────┬─────┘ └────────┬────────┘ │
|
||||
│ │ │ │ │
|
||||
│ └────────────────┼────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌─────────────────┐ │
|
||||
│ │ Fallback Chain │ │
|
||||
│ │ Gemini → Claude │ │
|
||||
│ └─────────────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 2.2 任務路由規則
|
||||
|
||||
```python
|
||||
# apps/api/src/services/ai/model_router.py
|
||||
|
||||
ROUTING_RULES = {
|
||||
# Tool Calling 任務 → Nemotron 優先
|
||||
"tool_calling": {
|
||||
"primary": "nvidia/nemotron-mini-4b-instruct",
|
||||
"fallback": ["gemini-1.5-flash", "claude-3-haiku"]
|
||||
},
|
||||
|
||||
# K8s 操作決策 → Nemotron 優先
|
||||
"k8s_operation": {
|
||||
"primary": "nvidia/nemotron-mini-4b-instruct",
|
||||
"fallback": ["ollama/qwen2.5:7b", "gemini-1.5-flash"]
|
||||
},
|
||||
|
||||
# Incident 分析 (複雜推理) → Nemotron-70B 或 Claude
|
||||
"incident_analysis": {
|
||||
"primary": "nvidia/llama-3.1-nemotron-70b-instruct",
|
||||
"fallback": ["claude-3-sonnet", "gemini-1.5-pro"]
|
||||
},
|
||||
|
||||
# 一般對話 → 本地 Ollama 優先
|
||||
"general_chat": {
|
||||
"primary": "ollama/qwen2.5:7b",
|
||||
"fallback": ["gemini-1.5-flash", "claude-3-haiku"]
|
||||
},
|
||||
|
||||
# Playbook 生成 → Nemotron (程式碼能力強)
|
||||
"code_generation": {
|
||||
"primary": "nvidia/nemotron-mini-4b-instruct",
|
||||
"fallback": ["ollama/qwen2.5-coder:7b", "claude-3-sonnet"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 2.3 OpenClaw 整合位置
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ OpenClaw Decision Flow │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ 1. Incident 進入 │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ 2. Intent Classifier (意圖分類) │
|
||||
│ │ └── Ollama qwen2.5 (本地、快速) │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ 3. Complexity Analyzer (複雜度評估) │
|
||||
│ │ └── Ollama qwen2.5 (本地、快速) │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ 4. Decision Manager (決策生成) ← 🔴 Nemotron 在這裡! │
|
||||
│ │ ├── Tool Calling 決策 → Nemotron-mini (NIM) │
|
||||
│ │ ├── 複雜推理 → Nemotron-70B (NIM) │
|
||||
│ │ └── 一般回覆 → Ollama/Gemini │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ 5. Trust Engine (信任驗證) │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ 6. Multi-Sig (需要時) │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ 7. K8s Executor (執行) │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 2.4 環境變數配置
|
||||
|
||||
```bash
|
||||
# .env.production 新增
|
||||
|
||||
# NVIDIA NIM API
|
||||
NVIDIA_API_KEY=nvapi-xxxx
|
||||
NVIDIA_API_BASE_URL=https://integrate.api.nvidia.com/v1
|
||||
|
||||
# Model 選擇
|
||||
NEMOTRON_TOOL_MODEL=nvidia/nemotron-mini-4b-instruct
|
||||
NEMOTRON_REASONING_MODEL=nvidia/llama-3.1-nemotron-70b-instruct
|
||||
|
||||
# Rate Limiting (免費額度保護)
|
||||
NEMOTRON_RPM_LIMIT=60
|
||||
NEMOTRON_TPM_LIMIT=100000
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. 測試腳本
|
||||
|
||||
### 3.1 Tool Calling 精準度測試
|
||||
|
||||
```python
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Nemotron Tool Calling 精準度測試
|
||||
比較 Nemotron vs Gemini vs Qwen 的 Tool Calling 能力
|
||||
|
||||
使用方式:
|
||||
export NVIDIA_API_KEY=nvapi-xxxx
|
||||
export GEMINI_API_KEY=xxxx
|
||||
python test_nemotron_tool_calling.py
|
||||
"""
|
||||
|
||||
import os
|
||||
import json
|
||||
import httpx
|
||||
import asyncio
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
import time
|
||||
|
||||
# ============================================================================
|
||||
# 配置
|
||||
# ============================================================================
|
||||
|
||||
NVIDIA_API_KEY = os.getenv("NVIDIA_API_KEY")
|
||||
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
|
||||
OLLAMA_BASE_URL = "http://192.168.0.188:11434"
|
||||
|
||||
# ============================================================================
|
||||
# Tool 定義 (K8s SRE 場景)
|
||||
# ============================================================================
|
||||
|
||||
TOOLS = [
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "kubectl_get",
|
||||
"description": "Get Kubernetes resources (pods, deployments, services, etc.)",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"resource": {
|
||||
"type": "string",
|
||||
"enum": ["pods", "deployments", "services", "nodes", "events"],
|
||||
"description": "Resource type to query"
|
||||
},
|
||||
"namespace": {
|
||||
"type": "string",
|
||||
"description": "Kubernetes namespace (default: awoooi-prod)"
|
||||
},
|
||||
"name": {
|
||||
"type": "string",
|
||||
"description": "Specific resource name (optional)"
|
||||
}
|
||||
},
|
||||
"required": ["resource"]
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "restart_deployment",
|
||||
"description": "Restart a Kubernetes deployment by rolling restart",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"deployment": {
|
||||
"type": "string",
|
||||
"description": "Deployment name"
|
||||
},
|
||||
"namespace": {
|
||||
"type": "string",
|
||||
"description": "Kubernetes namespace"
|
||||
}
|
||||
},
|
||||
"required": ["deployment", "namespace"]
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "scale_deployment",
|
||||
"description": "Scale a Kubernetes deployment to specified replicas",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"deployment": {"type": "string"},
|
||||
"namespace": {"type": "string"},
|
||||
"replicas": {"type": "integer", "minimum": 0, "maximum": 10}
|
||||
},
|
||||
"required": ["deployment", "namespace", "replicas"]
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_logs",
|
||||
"description": "Get logs from a Kubernetes pod",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"pod": {"type": "string"},
|
||||
"namespace": {"type": "string"},
|
||||
"tail": {"type": "integer", "description": "Number of lines (default: 100)"},
|
||||
"container": {"type": "string", "description": "Container name (optional)"}
|
||||
},
|
||||
"required": ["pod", "namespace"]
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "send_alert",
|
||||
"description": "Send alert notification via Telegram",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"severity": {"type": "string", "enum": ["info", "warning", "critical"]},
|
||||
"message": {"type": "string"},
|
||||
"incident_id": {"type": "string"}
|
||||
},
|
||||
"required": ["severity", "message"]
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
|
||||
# ============================================================================
|
||||
# 測試案例
|
||||
# ============================================================================
|
||||
|
||||
TEST_CASES = [
|
||||
{
|
||||
"id": "TC001",
|
||||
"description": "簡單查詢 - 列出所有 pods",
|
||||
"prompt": "Show me all pods in awoooi-prod namespace",
|
||||
"expected_tool": "kubectl_get",
|
||||
"expected_params": {"resource": "pods", "namespace": "awoooi-prod"}
|
||||
},
|
||||
{
|
||||
"id": "TC002",
|
||||
"description": "重啟服務",
|
||||
"prompt": "The API is not responding, please restart the awoooi-api deployment",
|
||||
"expected_tool": "restart_deployment",
|
||||
"expected_params": {"deployment": "awoooi-api", "namespace": "awoooi-prod"}
|
||||
},
|
||||
{
|
||||
"id": "TC003",
|
||||
"description": "擴展副本",
|
||||
"prompt": "We're getting high traffic, scale awoooi-web to 3 replicas",
|
||||
"expected_tool": "scale_deployment",
|
||||
"expected_params": {"deployment": "awoooi-web", "replicas": 3}
|
||||
},
|
||||
{
|
||||
"id": "TC004",
|
||||
"description": "查看日誌",
|
||||
"prompt": "Get the last 50 lines of logs from awoooi-api-xxx pod",
|
||||
"expected_tool": "get_logs",
|
||||
"expected_params": {"tail": 50}
|
||||
},
|
||||
{
|
||||
"id": "TC005",
|
||||
"description": "發送告警",
|
||||
"prompt": "Send a critical alert: Database connection failed for incident INC-2026-001",
|
||||
"expected_tool": "send_alert",
|
||||
"expected_params": {"severity": "critical"}
|
||||
},
|
||||
{
|
||||
"id": "TC006",
|
||||
"description": "複合理解 - 需要推理",
|
||||
"prompt": "The web frontend is showing 502 errors. Check if the API pods are running.",
|
||||
"expected_tool": "kubectl_get",
|
||||
"expected_params": {"resource": "pods"}
|
||||
},
|
||||
{
|
||||
"id": "TC007",
|
||||
"description": "繁體中文指令",
|
||||
"prompt": "請重啟 awoooi-worker 這個 deployment",
|
||||
"expected_tool": "restart_deployment",
|
||||
"expected_params": {"deployment": "awoooi-worker"}
|
||||
},
|
||||
{
|
||||
"id": "TC008",
|
||||
"description": "模糊指令 - 需要推理",
|
||||
"prompt": "Something is wrong with the worker, it keeps crashing. Fix it.",
|
||||
"expected_tool": "restart_deployment", # 或 get_logs
|
||||
"expected_params": {} # 接受多種合理回應
|
||||
}
|
||||
]
|
||||
|
||||
# ============================================================================
|
||||
# API 客戶端
|
||||
# ============================================================================
|
||||
|
||||
@dataclass
|
||||
class ToolCallResult:
|
||||
model: str
|
||||
test_id: str
|
||||
success: bool
|
||||
tool_called: Optional[str]
|
||||
params: Optional[dict]
|
||||
latency_ms: float
|
||||
error: Optional[str] = None
|
||||
|
||||
async def call_nemotron(prompt: str, model: str = "nvidia/nemotron-mini-4b-instruct") -> dict:
|
||||
"""呼叫 NVIDIA NIM API"""
|
||||
async with httpx.AsyncClient(timeout=30) as client:
|
||||
start = time.time()
|
||||
response = await client.post(
|
||||
"https://integrate.api.nvidia.com/v1/chat/completions",
|
||||
headers={
|
||||
"Content-Type": "application/json",
|
||||
"Authorization": f"Bearer {NVIDIA_API_KEY}"
|
||||
},
|
||||
json={
|
||||
"model": model,
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are an SRE assistant for AWOOOI AIOps platform. Use the provided tools to help with Kubernetes operations."},
|
||||
{"role": "user", "content": prompt}
|
||||
],
|
||||
"tools": TOOLS,
|
||||
"tool_choice": "auto",
|
||||
"temperature": 0.1,
|
||||
"max_tokens": 512
|
||||
}
|
||||
)
|
||||
latency = (time.time() - start) * 1000
|
||||
return {"data": response.json(), "latency_ms": latency}
|
||||
|
||||
async def call_ollama(prompt: str, model: str = "qwen2.5:7b") -> dict:
|
||||
"""呼叫本地 Ollama"""
|
||||
async with httpx.AsyncClient(timeout=60) as client:
|
||||
start = time.time()
|
||||
response = await client.post(
|
||||
f"{OLLAMA_BASE_URL}/api/chat",
|
||||
json={
|
||||
"model": model,
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are an SRE assistant. Respond with JSON indicating which tool to call and parameters."},
|
||||
{"role": "user", "content": f"Based on this request, which tool should be called and with what parameters? Request: {prompt}\n\nAvailable tools: kubectl_get, restart_deployment, scale_deployment, get_logs, send_alert\n\nRespond in JSON format: {{\"tool\": \"tool_name\", \"params\": {{...}}}}"}
|
||||
],
|
||||
"stream": False,
|
||||
"format": "json"
|
||||
}
|
||||
)
|
||||
latency = (time.time() - start) * 1000
|
||||
return {"data": response.json(), "latency_ms": latency}
|
||||
|
||||
# ============================================================================
|
||||
# 測試執行
|
||||
# ============================================================================
|
||||
|
||||
def parse_tool_call(response: dict, model_type: str) -> tuple:
|
||||
"""解析不同模型的 Tool Call 回應"""
|
||||
try:
|
||||
if model_type == "nemotron":
|
||||
choices = response.get("choices", [])
|
||||
if choices and choices[0].get("message", {}).get("tool_calls"):
|
||||
tool_call = choices[0]["message"]["tool_calls"][0]
|
||||
return (
|
||||
tool_call["function"]["name"],
|
||||
json.loads(tool_call["function"]["arguments"])
|
||||
)
|
||||
# 如果沒有 tool_calls,檢查 content
|
||||
content = choices[0].get("message", {}).get("content", "")
|
||||
return (None, {"content": content})
|
||||
|
||||
elif model_type == "ollama":
|
||||
content = response.get("message", {}).get("content", "{}")
|
||||
parsed = json.loads(content)
|
||||
return (parsed.get("tool"), parsed.get("params", {}))
|
||||
|
||||
except Exception as e:
|
||||
return (None, {"error": str(e)})
|
||||
|
||||
return (None, {})
|
||||
|
||||
async def run_test(test_case: dict) -> list:
|
||||
"""執行單一測試案例"""
|
||||
results = []
|
||||
prompt = test_case["prompt"]
|
||||
|
||||
# 測試 Nemotron
|
||||
if NVIDIA_API_KEY:
|
||||
try:
|
||||
resp = await call_nemotron(prompt)
|
||||
tool, params = parse_tool_call(resp["data"], "nemotron")
|
||||
success = tool == test_case["expected_tool"]
|
||||
results.append(ToolCallResult(
|
||||
model="Nemotron-mini-4B",
|
||||
test_id=test_case["id"],
|
||||
success=success,
|
||||
tool_called=tool,
|
||||
params=params,
|
||||
latency_ms=resp["latency_ms"]
|
||||
))
|
||||
except Exception as e:
|
||||
results.append(ToolCallResult(
|
||||
model="Nemotron-mini-4B",
|
||||
test_id=test_case["id"],
|
||||
success=False,
|
||||
tool_called=None,
|
||||
params=None,
|
||||
latency_ms=0,
|
||||
error=str(e)
|
||||
))
|
||||
|
||||
# 測試 Ollama
|
||||
try:
|
||||
resp = await call_ollama(prompt)
|
||||
tool, params = parse_tool_call(resp["data"], "ollama")
|
||||
success = tool == test_case["expected_tool"]
|
||||
results.append(ToolCallResult(
|
||||
model="Ollama-Qwen2.5-7B",
|
||||
test_id=test_case["id"],
|
||||
success=success,
|
||||
tool_called=tool,
|
||||
params=params,
|
||||
latency_ms=resp["latency_ms"]
|
||||
))
|
||||
except Exception as e:
|
||||
results.append(ToolCallResult(
|
||||
model="Ollama-Qwen2.5-7B",
|
||||
test_id=test_case["id"],
|
||||
success=False,
|
||||
tool_called=None,
|
||||
params=None,
|
||||
latency_ms=0,
|
||||
error=str(e)
|
||||
))
|
||||
|
||||
return results
|
||||
|
||||
async def main():
|
||||
"""主測試流程"""
|
||||
print("=" * 70)
|
||||
print("Nemotron vs Ollama Tool Calling 精準度測試")
|
||||
print("=" * 70)
|
||||
print()
|
||||
|
||||
all_results = []
|
||||
|
||||
for tc in TEST_CASES:
|
||||
print(f"[{tc['id']}] {tc['description']}")
|
||||
print(f" Prompt: {tc['prompt'][:50]}...")
|
||||
print(f" Expected: {tc['expected_tool']}")
|
||||
|
||||
results = await run_test(tc)
|
||||
all_results.extend(results)
|
||||
|
||||
for r in results:
|
||||
status = "✅" if r.success else "❌"
|
||||
print(f" {r.model}: {status} → {r.tool_called} ({r.latency_ms:.0f}ms)")
|
||||
if r.error:
|
||||
print(f" Error: {r.error}")
|
||||
print()
|
||||
|
||||
# 統計結果
|
||||
print("=" * 70)
|
||||
print("統計結果")
|
||||
print("=" * 70)
|
||||
|
||||
models = {}
|
||||
for r in all_results:
|
||||
if r.model not in models:
|
||||
models[r.model] = {"success": 0, "total": 0, "latency": []}
|
||||
models[r.model]["total"] += 1
|
||||
if r.success:
|
||||
models[r.model]["success"] += 1
|
||||
if r.latency_ms > 0:
|
||||
models[r.model]["latency"].append(r.latency_ms)
|
||||
|
||||
print(f"{'Model':<25} {'Accuracy':<15} {'Avg Latency':<15}")
|
||||
print("-" * 55)
|
||||
for model, stats in models.items():
|
||||
acc = stats["success"] / stats["total"] * 100 if stats["total"] > 0 else 0
|
||||
avg_lat = sum(stats["latency"]) / len(stats["latency"]) if stats["latency"] else 0
|
||||
print(f"{model:<25} {acc:>6.1f}% {avg_lat:>8.0f}ms")
|
||||
|
||||
print()
|
||||
print("測試完成!")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### 3.2 快速驗證腳本 (curl)
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# quick_test_nemotron.sh
|
||||
# 快速驗證 Nemotron API 連線
|
||||
|
||||
set -e
|
||||
|
||||
echo "=== Nemotron API 快速測試 ==="
|
||||
echo ""
|
||||
|
||||
# 檢查 API Key
|
||||
if [ -z "$NVIDIA_API_KEY" ]; then
|
||||
echo "❌ 請設定 NVIDIA_API_KEY"
|
||||
echo " export NVIDIA_API_KEY=nvapi-xxxx"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "✅ API Key 已設定"
|
||||
echo ""
|
||||
|
||||
# 測試簡單請求
|
||||
echo "測試 1: 簡單對話..."
|
||||
curl -s -X POST "https://integrate.api.nvidia.com/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "Authorization: Bearer $NVIDIA_API_KEY" \
|
||||
-d '{
|
||||
"model": "nvidia/nemotron-mini-4b-instruct",
|
||||
"messages": [{"role": "user", "content": "Say hello in JSON format"}],
|
||||
"max_tokens": 50
|
||||
}' | jq '.choices[0].message.content'
|
||||
|
||||
echo ""
|
||||
echo "測試 2: Tool Calling..."
|
||||
curl -s -X POST "https://integrate.api.nvidia.com/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "Authorization: Bearer $NVIDIA_API_KEY" \
|
||||
-d '{
|
||||
"model": "nvidia/nemotron-mini-4b-instruct",
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a K8s assistant."},
|
||||
{"role": "user", "content": "Restart the nginx deployment in production namespace"}
|
||||
],
|
||||
"tools": [{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "restart_deployment",
|
||||
"description": "Restart a K8s deployment",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"deployment": {"type": "string"},
|
||||
"namespace": {"type": "string"}
|
||||
},
|
||||
"required": ["deployment", "namespace"]
|
||||
}
|
||||
}
|
||||
}],
|
||||
"tool_choice": "auto",
|
||||
"max_tokens": 200
|
||||
}' | jq '.choices[0].message'
|
||||
|
||||
echo ""
|
||||
echo "=== 測試完成 ==="
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. 實作計畫
|
||||
|
||||
### 4.1 階段規劃
|
||||
|
||||
```
|
||||
Phase N.1: 驗證 (1-2 天)
|
||||
──────────────────────────
|
||||
├── 註冊 build.nvidia.com
|
||||
├── 取得 NVIDIA_API_KEY
|
||||
├── 執行 quick_test_nemotron.sh
|
||||
├── 執行完整 Tool Calling 測試
|
||||
└── 分析結果,決定是否繼續
|
||||
|
||||
Phase N.2: 整合 (2-3 天)
|
||||
──────────────────────────
|
||||
├── 建立 NvidiaAIProvider (參考現有 GeminiProvider)
|
||||
├── 加入 Model Router 路由規則
|
||||
├── 配置環境變數 + K8s Secrets
|
||||
├── Langfuse Tracing 整合
|
||||
└── 單元測試
|
||||
|
||||
Phase N.3: 驗收 (1 天)
|
||||
──────────────────────────
|
||||
├── E2E 測試 (真實 Incident 場景)
|
||||
├── 延遲 + 成本分析
|
||||
├── 首席架構師審查
|
||||
└── 統帥批准上線
|
||||
```
|
||||
|
||||
### 4.2 檔案結構
|
||||
|
||||
```
|
||||
apps/api/src/
|
||||
├── services/
|
||||
│ └── ai/
|
||||
│ ├── providers/
|
||||
│ │ ├── ollama_provider.py # 現有
|
||||
│ │ ├── gemini_provider.py # 現有
|
||||
│ │ ├── claude_provider.py # 現有
|
||||
│ │ └── nvidia_provider.py # 🆕 新增
|
||||
│ │
|
||||
│ ├── model_router.py # 修改: 加入 Nemotron 路由
|
||||
│ └── rate_limiter.py # 修改: 加入 Nemotron 限流
|
||||
```
|
||||
|
||||
### 4.3 GitHub Secrets 新增
|
||||
|
||||
```yaml
|
||||
# 需要新增到 GitHub Secrets
|
||||
NVIDIA_API_KEY: nvapi-xxxx
|
||||
|
||||
# 需要新增到 K8s Secrets
|
||||
kubectl create secret generic nvidia-api \
|
||||
--from-literal=NVIDIA_API_KEY=nvapi-xxxx \
|
||||
-n awoooi-prod
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. 成本估算
|
||||
|
||||
### 5.1 免費額度
|
||||
|
||||
| 項目 | 預估 |
|
||||
|------|------|
|
||||
| **開發測試** | 免費 (build.nvidia.com) |
|
||||
| **Rate Limit** | 待確認 (可能 60 RPM) |
|
||||
|
||||
### 5.2 生產環境 (如需付費)
|
||||
|
||||
| 模型 | 定價 (預估) | 月用量 | 月成本 |
|
||||
|------|-------------|--------|--------|
|
||||
| nemotron-mini-4b | ~$0.1/1M tokens | ~5M | ~$0.5 |
|
||||
| nemotron-70b | ~$1.0/1M tokens | ~1M | ~$1.0 |
|
||||
|
||||
**結論**: 成本極低,比 Claude API 便宜很多。
|
||||
|
||||
---
|
||||
|
||||
## 6. 風險評估
|
||||
|
||||
| 風險 | 機率 | 影響 | 緩解措施 |
|
||||
|------|------|------|----------|
|
||||
| 免費額度不足 | 中 | 低 | Fallback 到 Gemini |
|
||||
| API 延遲高 | 低 | 中 | 本地快取 + Timeout |
|
||||
| Tool Calling 精準度差 | 低 | 高 | 測試階段驗證 |
|
||||
| 服務不穩定 | 低 | 中 | 多層 Fallback |
|
||||
|
||||
---
|
||||
|
||||
## 附錄: 下一步行動
|
||||
|
||||
統帥批准後,立即執行:
|
||||
|
||||
```bash
|
||||
# Step 1: 取得 API Key
|
||||
# 前往 https://build.nvidia.com 註冊並取得 Key
|
||||
|
||||
# Step 2: 設定環境變數
|
||||
export NVIDIA_API_KEY=nvapi-xxxx
|
||||
|
||||
# Step 3: 快速驗證
|
||||
cd apps/api
|
||||
./scripts/quick_test_nemotron.sh
|
||||
|
||||
# Step 4: 完整測試
|
||||
python scripts/test_nemotron_tool_calling.py
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**建立者**: Claude Code
|
||||
**日期**: 2026-03-28 (台北時間)
|
||||
**狀態**: 待審核
|
||||
1061
docs/proposals/NEMOTRON-INTEGRATION-SOLUTION.md
Normal file
1061
docs/proposals/NEMOTRON-INTEGRATION-SOLUTION.md
Normal file
File diff suppressed because it is too large
Load Diff
336
docs/runbooks/RUNBOOK-E2E-CI-SCHEDULE.md
Normal file
336
docs/runbooks/RUNBOOK-E2E-CI-SCHEDULE.md
Normal file
@@ -0,0 +1,336 @@
|
||||
# RunBook: E2E Playwright CI 定期排程設定
|
||||
|
||||
> **類型**: 操作型 RunBook
|
||||
> **優先級**: 🔴 P0
|
||||
> **建立**: 2026-03-29 12:38 (台北)
|
||||
> **建立者**: Antigravity
|
||||
> **工時預估**: 30 分鐘
|
||||
> **前置條件**: Playwright 測試可在本機 `pnpm test:e2e` 成功執行
|
||||
|
||||
---
|
||||
|
||||
## 背景與現況
|
||||
|
||||
### 🔍 精確現況診斷
|
||||
|
||||
**已有的 12 個 E2E 測試檔案**(`apps/web/tests/e2e/`):
|
||||
|
||||
| 測試檔案 | 測試範圍 |
|
||||
|---------|---------|
|
||||
| `dashboard-acceptance.spec.ts` | 首頁 Dashboard 驗收 |
|
||||
| `multisig-security.spec.ts` | 多重簽核安全性 |
|
||||
| `approval-card-verify.spec.ts` | 簽核卡片驗證 |
|
||||
| `phase11-conversational.spec.ts` | 對話式 Phase 11 功能 |
|
||||
| `phase19-production-verification.spec.ts` | Phase 19 生產驗證 |
|
||||
| `action-log.spec.ts` | 行動日誌 |
|
||||
| `cpo102-visual.spec.ts` | 視覺截圖測試 |
|
||||
| `visual-armor-upgrade.spec.ts` | 視覺升級驗證 |
|
||||
| `debug-error.spec.ts` | 錯誤頁面 |
|
||||
| `rbac-screenshot.spec.ts` | RBAC 截圖驗證 |
|
||||
| `phase4-final-demo.spec.ts` | Phase 4 Demo |
|
||||
| `phase4-timeline.spec.ts` | Phase 4 時間軸 |
|
||||
|
||||
**缺口**:`playwright.config.ts` 已有配置,但 `.github/workflows/` 中**無定期執行排程**。
|
||||
|
||||
---
|
||||
|
||||
## Step 1: 建立 E2E 定期排程 Workflow
|
||||
|
||||
建立 `.github/workflows/e2e-weekly.yaml`:
|
||||
|
||||
```yaml
|
||||
name: 🎭 E2E Playwright 週期驗收
|
||||
|
||||
on:
|
||||
# 每週一凌晨 02:30 執行(台北時間,即 UTC 18:30 週日)
|
||||
schedule:
|
||||
- cron: '30 18 * * 0'
|
||||
|
||||
# 允許手動觸發
|
||||
workflow_dispatch:
|
||||
inputs:
|
||||
environment:
|
||||
description: '測試環境 URL'
|
||||
required: false
|
||||
default: 'https://192.168.0.120:32335'
|
||||
test_suite:
|
||||
description: '測試套件 (all / smoke / visual)'
|
||||
required: false
|
||||
default: 'all'
|
||||
|
||||
jobs:
|
||||
e2e-test:
|
||||
name: 🎭 E2E Playwright (${{ matrix.browser }})
|
||||
runs-on: [self-hosted, harbor] # 使用 .110 的 GitHub Runner
|
||||
|
||||
strategy:
|
||||
fail-fast: false # 一個瀏覽器失敗不影響其他
|
||||
matrix:
|
||||
browser: [chromium, firefox]
|
||||
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Install pnpm
|
||||
uses: pnpm/action-setup@v4
|
||||
with:
|
||||
version: 9
|
||||
|
||||
- name: Setup Node.js
|
||||
uses: actions/setup-node@v4
|
||||
with:
|
||||
node-version: '20'
|
||||
cache: 'pnpm'
|
||||
|
||||
- name: Install dependencies
|
||||
run: pnpm install --frozen-lockfile
|
||||
|
||||
- name: Install Playwright Browsers
|
||||
run: pnpm --filter web exec playwright install --with-deps ${{ matrix.browser }}
|
||||
|
||||
- name: 🎭 Run E2E Tests
|
||||
run: |
|
||||
cd apps/web
|
||||
pnpm exec playwright test \
|
||||
--project=${{ matrix.browser }} \
|
||||
--reporter=html \
|
||||
${SUITE_FILTER}
|
||||
env:
|
||||
SUITE_FILTER: ${{ github.event.inputs.test_suite == 'smoke' && '--grep @smoke' || '' }}
|
||||
BASE_URL: ${{ github.event.inputs.environment || 'http://192.168.0.120:32335' }}
|
||||
CI: true
|
||||
|
||||
- name: 📁 Upload Test Report
|
||||
uses: actions/upload-artifact@v4
|
||||
if: always()
|
||||
with:
|
||||
name: playwright-report-${{ matrix.browser }}-${{ github.run_id }}
|
||||
path: apps/web/playwright-report/
|
||||
retention-days: 14
|
||||
|
||||
- name: 📸 Upload Screenshots on Failure
|
||||
uses: actions/upload-artifact@v4
|
||||
if: failure()
|
||||
with:
|
||||
name: e2e-screenshots-${{ matrix.browser }}-${{ github.run_id }}
|
||||
path: apps/web/test-results/
|
||||
retention-days: 7
|
||||
|
||||
- name: 🚨 Notify Telegram on Failure
|
||||
if: failure()
|
||||
run: |
|
||||
curl -s -X POST "https://api.telegram.org/bot${TG_BOT_TOKEN}/sendMessage" \
|
||||
-d chat_id="${TG_CHAT_ID}" \
|
||||
-d parse_mode="HTML" \
|
||||
-d text="🎭 <b>E2E 週期測試失敗</b>
|
||||
|
||||
瀏覽器:${{ matrix.browser }}
|
||||
觸發:${{ github.event_name }}
|
||||
時間:$(TZ='Asia/Taipei' date '+%Y-%m-%d %H:%M:%S')
|
||||
|
||||
報告:${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
|
||||
|
||||
🔍 請立即調查 UI 回歸問題"
|
||||
env:
|
||||
TG_BOT_TOKEN: ${{ secrets.OPENCLAW_TG_BOT_TOKEN }}
|
||||
TG_CHAT_ID: ${{ secrets.OPENCLAW_TG_CHAT_ID }}
|
||||
|
||||
# 視覺截圖比對工作(獨立 job,只在排程時執行)
|
||||
visual-regression:
|
||||
name: 📸 視覺回歸比對
|
||||
runs-on: [self-hosted, harbor]
|
||||
if: github.event_name == 'schedule' # 只在定期排程時執行
|
||||
needs: e2e-test
|
||||
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Install pnpm & deps
|
||||
run: |
|
||||
pnpm install --frozen-lockfile
|
||||
pnpm --filter web exec playwright install --with-deps chromium
|
||||
|
||||
- name: 📸 Generate Visual Snapshots
|
||||
run: |
|
||||
cd apps/web
|
||||
pnpm exec playwright test \
|
||||
--project=chromium \
|
||||
--grep @visual \
|
||||
--reporter=html \
|
||||
--update-snapshots=missing # 新截圖自動建立 baseline
|
||||
env:
|
||||
BASE_URL: 'http://192.168.0.120:32335'
|
||||
CI: true
|
||||
|
||||
- name: 📁 Upload Visual Snapshots
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: visual-snapshots-${{ github.run_id }}
|
||||
path: apps/web/tests/e2e/__snapshots__/
|
||||
retention-days: 30
|
||||
|
||||
- name: 🚨 Notify Telegram on Visual Regression
|
||||
if: failure()
|
||||
run: |
|
||||
curl -s -X POST "https://api.telegram.org/bot${TG_BOT_TOKEN}/sendMessage" \
|
||||
-d chat_id="${TG_CHAT_ID}" \
|
||||
-d parse_mode="HTML" \
|
||||
-d text="📸 <b>視覺回歸測試失敗</b>
|
||||
|
||||
AWOOOI 前端出現視覺變化!
|
||||
時間:$(TZ='Asia/Taipei' date '+%Y-%m-%d %H:%M:%S')
|
||||
|
||||
請統帥審查截圖比對報告:
|
||||
${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
|
||||
env:
|
||||
TG_BOT_TOKEN: ${{ secrets.OPENCLAW_TG_BOT_TOKEN }}
|
||||
TG_CHAT_ID: ${{ secrets.OPENCLAW_TG_CHAT_ID }}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 2: 標記現有測試為 Smoke / Visual 類別
|
||||
|
||||
```typescript
|
||||
// 在各 spec 檔案第一行測試加上標籤:
|
||||
|
||||
// dashboard-acceptance.spec.ts(核心功能,標記為 @smoke)
|
||||
test.describe('Dashboard Acceptance @smoke', () => { ... });
|
||||
|
||||
// cpo102-visual.spec.ts(視覺截圖,標記為 @visual)
|
||||
test.describe('Visual Regression @visual', () => {
|
||||
test('CPO-102 首頁視覺', async ({ page }) => {
|
||||
await page.goto('/');
|
||||
await expect(page).toHaveScreenshot('dashboard-baseline.png', {
|
||||
fullPage: true,
|
||||
threshold: 0.02 // 允許 2% 像素差異
|
||||
});
|
||||
});
|
||||
});
|
||||
```
|
||||
|
||||
**建議標記方式**:
|
||||
|
||||
| 標籤 | 包含測試 | 執行頻率 |
|
||||
|------|---------|---------|
|
||||
| `@smoke` | dashboard, approval, action-log | 每次 CD 後 + 每週 |
|
||||
| `@visual` | cpo102-visual, visual-armor | 只在每週排程 |
|
||||
| (無標籤) | 所有其他測試 | 只在每週排程 |
|
||||
|
||||
---
|
||||
|
||||
## Step 3: 部署並驗證
|
||||
|
||||
```bash
|
||||
# 1. 提交 Workflow 檔案
|
||||
git add .github/workflows/e2e-weekly.yaml
|
||||
git commit -m "feat(ci): add weekly E2E Playwright schedule with Telegram failure notification"
|
||||
git push origin main
|
||||
|
||||
# 2. 手動觸發測試(確認 Workflow 運作正常)
|
||||
gh workflow run e2e-weekly.yaml \
|
||||
-f environment="http://192.168.0.120:32335" \
|
||||
-f test_suite="smoke"
|
||||
|
||||
# 3. 監控 Workflow 執行
|
||||
gh run watch
|
||||
|
||||
# 4. 確認 Telegram Bot 收到失敗通知(刻意讓一個測試失敗)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 驗收標準
|
||||
|
||||
| 項目 | 通過條件 |
|
||||
|------|---------|
|
||||
| Workflow 存在 | `.github/workflows/e2e-weekly.yaml` 成功 push |
|
||||
| 手動觸發正常 | `gh workflow run` 可執行且完成 |
|
||||
| Smoke 測試通過 | `@smoke` 標籤測試全部 PASS |
|
||||
| 失敗通知正常 | Telegram Bot 收到失敗訊息 |
|
||||
| 報告上傳 | GitHub Actions Artifacts 中有 `playwright-report-*` |
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ 架構安全補丁(2026-03-29 更新,部署前必讀)
|
||||
|
||||
> 來源:`ARCHITECTURAL_RISK_WAR_GAME.md` 深度沙盤推演,代碼確認級別
|
||||
|
||||
### 補丁 1:playwright.config.ts 必須加入 ignoreHTTPSErrors
|
||||
|
||||
**問題**:內網 K3s 使用自簽憑證(Self-signed cert),Playwright 連接 `https://192.168.0.120:32335` 時會遭遇 `NET::ERR_CERT_AUTHORITY_INVALID`,導致**所有測試在第一步就失敗**。
|
||||
|
||||
**修復**(`apps/web/playwright.config.ts`):
|
||||
|
||||
```typescript
|
||||
export default defineConfig({
|
||||
use: {
|
||||
baseURL: process.env.BASE_URL || 'http://192.168.0.120:32335',
|
||||
ignoreHTTPSErrors: true, // 🆕 必須加入,否則自簽憑證全面阻擋
|
||||
viewport: { width: 1280, height: 720 },
|
||||
deviceScaleFactor: 1, // 🆕 防止 Retina 螢幕 DPI 差異影響截圖比對
|
||||
},
|
||||
expect: {
|
||||
toHaveScreenshot: {
|
||||
threshold: 0.05, // 🆕 允許 5% 差異(吸收跨平台字體渲染微差)
|
||||
maxDiffPixelRatio: 0.05,
|
||||
},
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 補丁 2:Visual Baseline 必須在 Docker(Linux 環境)中產生
|
||||
|
||||
**問題**:Mac(CoreText 渲染)與 GitHub Actions CI(Linux FreeType 渲染)的字體像素不同。若在 Mac 本機產生 baseline,CI 比對時**100% 誤報失敗**。
|
||||
|
||||
> 🔴 **絕對禁止在本機 Mac 環境執行 `--update-snapshots`!**
|
||||
|
||||
**正確的 Baseline 更新流程**:
|
||||
|
||||
```bash
|
||||
# 在 Mac 本機執行,但透過 Docker(Linux 環境)產生截圖
|
||||
cd apps/web
|
||||
docker run --rm \
|
||||
-v $(pwd):/work -w /work \
|
||||
-p 3000:3000 \
|
||||
mcr.microsoft.com/playwright:v1.44.0-jammy \
|
||||
pnpm exec playwright test \
|
||||
--update-snapshots \
|
||||
--project=chromium \
|
||||
--grep @visual
|
||||
|
||||
# Docker 產生的 .png 自動存入 tests/e2e/__snapshots__/
|
||||
# 提 PR,標注 📸 VISUAL_UPDATE
|
||||
# 統帥視覺審核截圖後方可合併
|
||||
```
|
||||
|
||||
**加入 package.json scripts**:
|
||||
|
||||
```json
|
||||
{
|
||||
"scripts": {
|
||||
"test:visual": "playwright test --project=chromium --grep @visual",
|
||||
"test:visual:update": "docker run --rm -v $(pwd):/work -w /work mcr.microsoft.com/playwright:v1.44.0-jammy pnpm exec playwright test --update-snapshots --project=chromium --grep @visual"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 補丁 3:CI Threshold 需與本 RunBook Step 2 標籤保持一致
|
||||
|
||||
Visual 測試中 `threshold: 0.02`(Step 2 示例代碼)與 `playwright.config.ts` 全局設定 `0.05` 會以**個別設定優先**。建議統一為:
|
||||
|
||||
```typescript
|
||||
// 全局(playwright.config.ts)
|
||||
threshold: 0.05 // 寬鬆(跨平台環境差異)
|
||||
|
||||
// 個別敏感組件(*.spec.ts)
|
||||
await expect(page).toHaveScreenshot('component.png', {
|
||||
threshold: 0.02 // 嚴格(關鍵組件精確比對)
|
||||
});
|
||||
```
|
||||
499
docs/runbooks/RUNBOOK-FRONTEND-UIUX-SOVEREIGNTY.md
Normal file
499
docs/runbooks/RUNBOOK-FRONTEND-UIUX-SOVEREIGNTY.md
Normal file
@@ -0,0 +1,499 @@
|
||||
# RunBook: 前端 UI/UX 核心痛點徹底解決方案
|
||||
|
||||
> **類型**: 架構設計 + 實施 RunBook
|
||||
> **優先級**: 🔴 P0 (核心競爭力)
|
||||
> **建立**: 2026-03-29 12:38 (台北)
|
||||
> **建立者**: Antigravity
|
||||
> **核心命題**: 讓 AWOOOI 的前端「讓人看一眼就驚艷,用一次就上癮」
|
||||
|
||||
---
|
||||
|
||||
## 一、Claude Code 前端弱點診斷
|
||||
|
||||
### 1.1 五大結構性弱點
|
||||
|
||||
| 弱點 | 後果 | 解法 |
|
||||
|------|------|------|
|
||||
| **視覺感知盲區** | 代碼語法正確,但視覺效果差 | Playwright 截圖 + 統帥視覺主控 |
|
||||
| **CSS 語境失憶** | 改 A 壞 B,不知道全局 CSS 影響 | Storybook 隔離組件 + TypeCheck |
|
||||
| **動畫設計無感** | 150ms 快閃 vs 300ms 遲滯無法感知 | 在 Storybook Story 中定義動畫標準 |
|
||||
| **設計意圖推斷不足** | Nothing.tech 審美需要視覺參照 | 每個組件附帶規格截圖 |
|
||||
| **Safari 兼容性盲點** | `backdrop-blur` 在 Safari 有 bug | Playwright multi-browser E2E |
|
||||
|
||||
### 1.2 技術債現況(截至 2026-03-29)
|
||||
|
||||
```
|
||||
i18n 違規:40+ 處(TECHNICAL_DEBT_PHASE2.md)
|
||||
shadcn/ui 殘留:已部分廢除,但需全面確認
|
||||
GenUI Registry:只有基礎 5 張卡片,缺少監控類
|
||||
Knowledge Base:頁面空白,無後端
|
||||
Omni-Terminal:外殼完整,SSE 事件類型未完全對接
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 二、九大行動詳細實施計畫
|
||||
|
||||
### 🔴 行動 1: i18n 閃電清零(工時:4h)
|
||||
|
||||
**策略:一次性掃描→批量修復,非逐一處理**
|
||||
|
||||
```bash
|
||||
# Step 1.1: 自動掃描所有硬編碼字串
|
||||
cd /Users/ogt/awoooi/apps/web
|
||||
|
||||
# 掃描 TSX 中的英文硬編碼(排除技術識別符)
|
||||
grep -rn '"[A-Z][A-Z]' src/ --include="*.tsx" --include="*.ts" | \
|
||||
grep -v "//.*\"" | \
|
||||
grep -v "className=" | \
|
||||
grep -v "href=" | \
|
||||
grep -v "id=" | \
|
||||
grep -v "import " > /tmp/en_violations.txt
|
||||
|
||||
# 掃描中文硬編碼
|
||||
grep -rn "\"[^\x00-\x7F]" src/ --include="*.tsx" | \
|
||||
grep -v "//.*\"" > /tmp/zh_violations.txt
|
||||
|
||||
echo "英文違規:$(wc -l < /tmp/en_violations.txt) 處"
|
||||
echo "中文違規:$(wc -l < /tmp/zh_violations.txt) 處"
|
||||
```
|
||||
|
||||
**已知 P0 違規快速修復清單**(根據 TECHNICAL_DEBT_PHASE2.md):
|
||||
|
||||
```typescript
|
||||
// ❌ agent/data-pincer.tsx:50-78(需修復)
|
||||
// 現在:
|
||||
const statuses = {
|
||||
standby: 'STANDBY',
|
||||
analyzing: 'ANALYZING',
|
||||
executing: 'EXECUTING',
|
||||
awaiting: 'AWAITING APPROVAL',
|
||||
error: 'ERROR'
|
||||
}
|
||||
|
||||
// ✅ 修復後:
|
||||
const statuses = {
|
||||
standby: t('status.standby'),
|
||||
analyzing: t('status.analyzing'),
|
||||
executing: t('status.executing'),
|
||||
awaiting: t('status.awaitingApproval'),
|
||||
error: t('status.error')
|
||||
}
|
||||
```
|
||||
|
||||
```typescript
|
||||
// ❌ status-orb.tsx:16-31
|
||||
// 現在:
|
||||
const STATUS_TEXT = {
|
||||
idle: 'Idle',
|
||||
thinking: 'Thinking',
|
||||
executing: 'Executing',
|
||||
awaiting: 'Awaiting Approval'
|
||||
}
|
||||
|
||||
// ✅ 修復後:
|
||||
const STATUS_TEXT = {
|
||||
idle: t('status.idle'),
|
||||
thinking: t('status.thinking'),
|
||||
executing: t('status.executing'),
|
||||
awaiting: t('status.awaitingApproval')
|
||||
}
|
||||
```
|
||||
|
||||
**需要同步更新的字典檔**:
|
||||
|
||||
```json
|
||||
// apps/web/messages/zh-TW.json(追加)
|
||||
{
|
||||
"status": {
|
||||
"idle": "待機",
|
||||
"thinking": "分析中",
|
||||
"executing": "執行中",
|
||||
"awaitingApproval": "等待核准",
|
||||
"error": "錯誤",
|
||||
"standby": "待機",
|
||||
"analyzing": "分析中"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```json
|
||||
// apps/web/messages/en.json(追加)
|
||||
{
|
||||
"status": {
|
||||
"idle": "Idle",
|
||||
"thinking": "Thinking",
|
||||
"executing": "Executing",
|
||||
"awaitingApproval": "Awaiting Approval",
|
||||
"error": "Error",
|
||||
"standby": "Standby",
|
||||
"analyzing": "Analyzing"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**驗收指令(必須通過)**:
|
||||
|
||||
```bash
|
||||
cd apps/web
|
||||
|
||||
# 確認無中文硬編碼
|
||||
if grep -rn '"[^\x00-\x7F]' src/ --include="*.tsx" | grep -v "//"; then
|
||||
echo "❌ 仍有中文硬編碼!"
|
||||
exit 1
|
||||
else
|
||||
echo "✅ 中文硬編碼清零"
|
||||
fi
|
||||
|
||||
# TypeScript 編譯驗證(workflow 硬規則)
|
||||
pnpm exec tsc --noEmit
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 🔴 行動 2: Storybook 組件庫建立(工時:8h)
|
||||
|
||||
**這是解決 AI 視覺感知盲區的根本方案**。
|
||||
|
||||
```bash
|
||||
# Step 2.1: 安裝 Storybook
|
||||
cd apps/web
|
||||
pnpm add -D storybook@latest @storybook/nextjs @storybook/addon-essentials \
|
||||
@storybook/addon-interactions @storybook/test
|
||||
npx storybook@latest init --builder webpack5
|
||||
|
||||
# Step 2.2: 配置 Nothing.tech 主題
|
||||
```
|
||||
|
||||
**必須上架的 10 個核心組件 Story**:
|
||||
|
||||
| 組件 | Nothing.tech 規格 | Story 狀態 |
|
||||
|------|-----------------|-----------|
|
||||
| `GlassCard` | `bg-white/70 backdrop-blur-[20px] border border-black/[0.06]` | Loading / Content / Error |
|
||||
| `StatusOrb` | 燈號 + `animate-ping`(critical 時)| idle / thinking / executing / critical |
|
||||
| `ApprovalCard` | 1.0 ConversationalView 風格 | LOW / MEDIUM / HIGH / CRITICAL |
|
||||
| `OmniTerminal` | VT323 字體 + 綠色游標閃爍 | empty / thinking / streaming / error |
|
||||
| `HostCard` | CPU/Memory 橫條 + 脈搏點 | healthy / warning / critical |
|
||||
| `MetricsCard` | 數字大字 + 趨勢箭頭 | up / down / stable |
|
||||
| `SystemHealthCard` | 燈號矩陣 5x5 | all-healthy / some-warning / critical |
|
||||
| `FinOpsCard` | 成本分解 + 可省金額 | monthly / quarterly |
|
||||
| `SLOCard` | 達成率 + 趨勢 | healthy / at-risk / breached |
|
||||
| `AnomalyFrequencyCard` | 頻率統計 + 升級建議 | normal / repeat / escalate |
|
||||
|
||||
**Storybook Story 範例**(GlassCard):
|
||||
|
||||
```typescript
|
||||
// apps/web/src/components/ui/glass-card.stories.ts
|
||||
import type { Meta, StoryObj } from '@storybook/react';
|
||||
import { GlassCard } from './glass-card';
|
||||
|
||||
const meta: Meta<typeof GlassCard> = {
|
||||
title: 'AWOOOI/UI/GlassCard',
|
||||
component: GlassCard,
|
||||
parameters: {
|
||||
// Nothing.tech 白底背景
|
||||
backgrounds: {
|
||||
default: 'nothing-white',
|
||||
values: [{ name: 'nothing-white', value: '#F5F5F0' }],
|
||||
},
|
||||
// 規格文件截圖
|
||||
docs: {
|
||||
description: {
|
||||
component: `
|
||||
Nothing.tech 白玻璃卡片。固定規格:
|
||||
- bg: bg-white/70
|
||||
- blur: backdrop-blur-[20px]
|
||||
- border: border border-black/[0.06]
|
||||
- radius: rounded-xl
|
||||
`,
|
||||
},
|
||||
},
|
||||
},
|
||||
tags: ['autodocs'],
|
||||
};
|
||||
export default meta;
|
||||
|
||||
type Story = StoryObj<typeof GlassCard>;
|
||||
|
||||
export const Default: Story = {
|
||||
args: { children: '玻璃卡片內容' }
|
||||
};
|
||||
|
||||
export const WithCriticalBorder: Story = {
|
||||
args: {
|
||||
children: '緊急狀態',
|
||||
className: 'border-status-critical border-2'
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 🔴 行動 3: AI 視覺審查 SOP(工時:2h 建立,後續 0 工時)
|
||||
|
||||
**建立標準作業程序,讓 AI 自主截圖並等待統帥審核:**
|
||||
|
||||
```markdown
|
||||
## AI 前端修改 SOP(強制執行)
|
||||
|
||||
### 修改前
|
||||
1. 查閱 Storybook 對應組件的規格 Story
|
||||
2. 確認 Nothing.tech 視覺 Token(詳見 tailwind.config.ts)
|
||||
|
||||
### 修改中
|
||||
3. 修改代碼
|
||||
4. 執行 `pnpm exec tsc --noEmit`(語法驗證)
|
||||
|
||||
### 修改後(🆕 新增強制步驟)
|
||||
5. 啟動 Dev Server:`pnpm dev`
|
||||
6. 執行截圖腳本:
|
||||
```bash
|
||||
cd apps/web
|
||||
pnpm exec playwright screenshot \
|
||||
--browser chromium \
|
||||
http://localhost:3000 \
|
||||
docs/screenshots/$(date +%Y%m%d-%H%M)/homepage.png
|
||||
```
|
||||
7. 截圖存至 `docs/screenshots/{date}/{component}.png`
|
||||
8. 在 LOGBOOK 記錄:「已截圖,視覺存檔 docs/screenshots/xxx/yyy.png」
|
||||
9. 等待統帥視覺審批後方可 commit
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 🟠 行動 4: Omni-Terminal 後端全接通(工時:8h)
|
||||
|
||||
根據 `ADR-031` 和 `AWOOOI_AGENTIC_WORKSPACE_ROADMAP.md` 的神經連接藍圖:
|
||||
|
||||
#### 4.1 三大 SSE 事件類型定義
|
||||
|
||||
```python
|
||||
# apps/api/src/api/v1/terminal.py
|
||||
# 擴充現有 SSE 端點
|
||||
|
||||
class SSEEventType(str, Enum):
|
||||
THOUGHT = "thought" # Agent 思考流
|
||||
TOOL_CALL = "tool_call" # 工具調用(含微動畫觸發信號)
|
||||
TOOL_RESULT = "tool_result" # 工具結果
|
||||
RENDER_UI = "render_ui" # 動態渲染 GenUI 組件
|
||||
STREAM_END = "stream_end" # 思考流結束
|
||||
|
||||
|
||||
# 範例事件:
|
||||
async def stream_terminal_response(command: str):
|
||||
# 1. 思考流
|
||||
yield f"event: thought\ndata: {json.dumps({'text': '[Investigator] 分析指令...'})}\n\n"
|
||||
|
||||
# 2. 工具調用(觸發前端 CSS 動畫)
|
||||
yield f"event: tool_call\ndata: {json.dumps({'tool': 'kubectl_get', 'args': {'resource': 'pod', 'namespace': 'awoooi-prod'}})}\n\n"
|
||||
|
||||
# 3. 工具結果
|
||||
yield f"event: tool_result\ndata: {json.dumps({'pods': [...]})}\n\n"
|
||||
|
||||
# 4. 渲染 GenUI 卡片(前端動態載入組件)
|
||||
yield f"event: render_ui\ndata: {json.dumps({'component': 'SystemHealthCard', 'props': {...}})}\n\n"
|
||||
|
||||
# 5. 結束
|
||||
yield f"event: stream_end\ndata: {json.dumps({'success': True})}\n\n"
|
||||
```
|
||||
|
||||
#### 4.2 前端 SSE 事件處理器(關鍵)
|
||||
|
||||
```typescript
|
||||
// apps/web/src/hooks/useTerminalSSE.ts
|
||||
// 修改現有 SSE hook,增加 tool_call 動畫觸發
|
||||
|
||||
const useTerminalSSE = (commandId: string) => {
|
||||
const [state, setState] = useTerminalStore();
|
||||
|
||||
useEffect(() => {
|
||||
const es = new EventSource(`/api/v1/terminal/stream/${commandId}`);
|
||||
|
||||
// 思考流
|
||||
es.addEventListener('thought', (e) => {
|
||||
const data = JSON.parse(e.data);
|
||||
setState(s => ({ ...s, thoughts: [...s.thoughts, data.text] }));
|
||||
});
|
||||
|
||||
// 工具調用 → 觸發微動畫
|
||||
es.addEventListener('tool_call', (e) => {
|
||||
const data = JSON.parse(e.data);
|
||||
setState(s => ({
|
||||
...s,
|
||||
activeToolCall: data.tool, // UI 顯示「正在執行 kubectl_get...」
|
||||
isAnimating: true
|
||||
}));
|
||||
});
|
||||
|
||||
// GenUI 動態渲染(核心功能!)
|
||||
es.addEventListener('render_ui', (e) => {
|
||||
const { component, props } = JSON.parse(e.data);
|
||||
setState(s => ({
|
||||
...s,
|
||||
renderedCards: [...s.renderedCards, { component, props }]
|
||||
}));
|
||||
});
|
||||
|
||||
es.addEventListener('stream_end', () => {
|
||||
setState(s => ({ ...s, isStreaming: false, isAnimating: false }));
|
||||
es.close();
|
||||
});
|
||||
|
||||
return () => es.close();
|
||||
}, [commandId]);
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 🟠 行動 5: GenUI Registry 擴充(工時:8h)
|
||||
|
||||
新增 5 張監控類 GenUI 卡片(詳細規格見 `MONITORING_ARCHITECTURE_DEEP_DIVE.md`):
|
||||
|
||||
```typescript
|
||||
// apps/web/src/components/genui/registry.ts
|
||||
export const GENUI_COMPONENTS = {
|
||||
// 現有:
|
||||
'MetricsCard': () => import('./cards/MetricsCard'),
|
||||
'K8sPodCard': () => import('./cards/K8sPodCard'),
|
||||
|
||||
// 🆕 監控類(Wave M-3):
|
||||
'SystemHealthCard': () => import('./monitoring/SystemHealthCard'),
|
||||
'ServiceDetailCard': () => import('./monitoring/ServiceDetailCard'),
|
||||
'FinOpsCard': () => import('./monitoring/FinOpsCard'),
|
||||
'SLODashboardCard': () => import('./monitoring/SLODashboardCard'),
|
||||
'AlertChainStatusCard': () => import('./monitoring/AlertChainStatusCard'),
|
||||
'AnomalyFrequencyCard': () => import('./monitoring/AnomalyFrequencyCard'),
|
||||
'MTTRCard': () => import('./monitoring/MTTRCard'),
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 三、前端改善路線圖時間表
|
||||
|
||||
```
|
||||
📅 本週(立即):
|
||||
[4h] i18n 閃電清零(一次性全修)
|
||||
[2h] AI 視覺審查 SOP 建立(.awoooi-agent-rules.md 追加)
|
||||
|
||||
📅 Week 2-3:
|
||||
[8h] Storybook 10 個核心組件 Story
|
||||
[8h] Omni-Terminal 後端全接通(三種 SSE 事件)
|
||||
|
||||
📅 Week 4-5:
|
||||
[8h] 監控 GenUI 卡片擴充(7 張新卡片)
|
||||
[8h] Nexus 頁面 AI 自治率 UI 組件
|
||||
|
||||
📅 Month 2:
|
||||
[16h] Knowledge Base 後端 + 前端完整建設
|
||||
[8h] Visual Regression Testing CI 整合
|
||||
|
||||
📅 Month 3(Phase 4 視覺靈魂注入):
|
||||
[?h] 品牌 3D 資產 + Q 版 OpenClaw
|
||||
[?h] 全站微動畫升級(150ms 快閃標準)
|
||||
[?h] Nothing.tech 認證級別的設計審計
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 四、強制驗收標準
|
||||
|
||||
每次前端 PR 合併前,必須通過以下全部驗收:
|
||||
|
||||
```bash
|
||||
# 1. TypeScript 無錯誤(前端美學 Workflow 硬規則)
|
||||
cd apps/web && pnpm exec tsc --noEmit
|
||||
|
||||
# 2. i18n 無硬編碼(CI 攔截)
|
||||
[ -z "$(grep -rn '"[^\x00-\x7F]' src/ --include='*.tsx')" ] || exit 1
|
||||
|
||||
# 3. Storybook 可正常 build(確保組件獨立可用)
|
||||
pnpm storybook build
|
||||
|
||||
# 4. E2E Smoke 測試通過
|
||||
pnpm exec playwright test --grep @smoke
|
||||
|
||||
# 5. 截圖存檔(AI 執行,統帥視覺審批)
|
||||
pnpm exec playwright screenshot http://localhost:3000 docs/screenshots/pr-{NUMBER}/homepage.png
|
||||
|
||||
# 6. Build 成功(生產環境兼容)
|
||||
pnpm run build
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ 架構安全補丁(2026-03-29 更新,行動 1 開始前必讀)
|
||||
|
||||
> 來源:`ARCHITECTURAL_RISK_WAR_GAME.md` 深度沙盤推演,代碼確認級別
|
||||
|
||||
### 補丁:Feature Freeze 前必須建立 release/v1.x 穩定分支
|
||||
|
||||
**問題**:行動 1(i18n 閃電清零)需要全域替換 `src/` 底下數千行代碼,`main` 分支會進入持續 3-5 天的「混沌狀態」。若此期間生產爆發 P0 Bug(如簽核按鈕失效),**無法切出乾淨的 Hotfix 分支**,導致修復與重構互相衝突。
|
||||
|
||||
> 🔴 **行動 1 開始前,必須先建立 `release/v1.x` 穩定分支!**
|
||||
|
||||
#### 必須在行動 1 之前執行
|
||||
|
||||
```bash
|
||||
# Step 0(行動 1 前置):建立穩定基準分支
|
||||
git checkout main
|
||||
git pull origin main
|
||||
git checkout -b release/v1.x
|
||||
git push origin release/v1.x
|
||||
|
||||
# 在 GitHub 設定 release/v1.x 為 Protected Branch
|
||||
# Settings → Branches → Add branch protection rule → release/v1.x
|
||||
# ✅ Require pull request reviews (1 approver)
|
||||
# ✅ Do not allow bypassing the above settings
|
||||
```
|
||||
|
||||
#### 正確的行動順序
|
||||
|
||||
```
|
||||
Step 0:建立 release/v1.x(🆕 前置步驟,必須先做)
|
||||
|
||||
Step 1:宣佈 Frontend Feature Freeze(禁止非 i18n PR 合併前端代碼)
|
||||
|
||||
Step 2:i18n 閃電清零(在 fix/i18n-zero-violation 分支進行)
|
||||
|
||||
Step 3:PR 合併到 main → 解除 Frontend Freeze
|
||||
|
||||
Step 4:ESLint i18n Plugin 切換為 error 模式
|
||||
|
||||
Step 5:繼續行動 2-9(Storybook、Terminal 等)
|
||||
```
|
||||
|
||||
#### Freeze 期間 P0 Hotfix 緊急流程
|
||||
|
||||
```bash
|
||||
# 情境:簽核按鈕在生產失效,i18n 清零正在進行中
|
||||
|
||||
# 1. 從穩定基準切出 hotfix(不影響 i18n 重構)
|
||||
git checkout release/v1.x
|
||||
git checkout -b hotfix/fix-approval-button
|
||||
|
||||
# 2. 最小化修復(只改 Bug,不動其他代碼)
|
||||
# ... 修復代碼 ...
|
||||
git add . && git commit -m "fix: approval button not responding on mobile"
|
||||
|
||||
# 3. PR 到 release/v1.x → CD 直接部署 release 分支
|
||||
git push origin hotfix/fix-approval-button
|
||||
# → PR 合併到 release/v1.x
|
||||
# → 觸發 CD 部署(CD 配置需支援 release/* 分支)
|
||||
|
||||
# 4. Cherry-pick 到 main(不中斷 i18n 重構)
|
||||
git checkout main
|
||||
git cherry-pick <hotfix-commit-hash>
|
||||
# 若有衝突:手動解決後繼續
|
||||
```
|
||||
|
||||
#### Hotfix 觸發條件(須加入 HARD_RULES.md)
|
||||
|
||||
```
|
||||
P0 Hotfix 判定標準(任一條件符合即觸發):
|
||||
□ 統帥無法使用核心功能(簽核按鈕、登入、Telegram 通知)
|
||||
□ Sentry P0 Error 每分鐘 > 10 次
|
||||
□ 服務 availability < 99%(監控頁面顯示)
|
||||
□ OpenClaw 決策鏈完全中斷超過 5 分鐘
|
||||
```
|
||||
264
docs/runbooks/RUNBOOK-PHASE-D-SENTRY-COMMENT.md
Normal file
264
docs/runbooks/RUNBOOK-PHASE-D-SENTRY-COMMENT.md
Normal file
@@ -0,0 +1,264 @@
|
||||
# RunBook: Phase D — Sentry Comment 回寫啟動指南
|
||||
|
||||
> **類型**: 操作型 RunBook
|
||||
> **優先級**: 🔴 P0(功能框架已建,只缺 Token 配置)
|
||||
> **建立**: 2026-03-29 12:35 (台北)
|
||||
> **建立者**: Antigravity
|
||||
> **工時預估**: 1.5–2 小時
|
||||
> **前置條件**: AWOOOI API 正常運行 (`/api/v1/health` 返回 200)
|
||||
|
||||
---
|
||||
|
||||
## 背景與現況
|
||||
|
||||
### 🔍 精確現況診斷
|
||||
|
||||
`sentry_webhook.py` 的 `post_sentry_comment()` 函式已實作完整邏輯:
|
||||
|
||||
```
|
||||
sentry_webhook.py:251 → 呼叫 post_sentry_comment()
|
||||
sentry_service.py:206 → post_issue_comment() 已實作 POST /api/0/issues/{id}/comments/
|
||||
sentry_service.py:223 → 若 SENTRY_AUTH_TOKEN 為空,直接 return None 並 warning
|
||||
```
|
||||
|
||||
**唯一阻塞點**:`settings.SENTRY_AUTH_TOKEN` 環境變數未設定,導致 comment 靜默跳過。
|
||||
|
||||
### 資料流確認
|
||||
|
||||
```
|
||||
Sentry Issue 觸發
|
||||
↓
|
||||
/api/v1/webhooks/sentry/error (sentry_webhook.py)
|
||||
↓
|
||||
analyze_and_comment() [Background Task]
|
||||
↓
|
||||
call_openclaw_analyzer() → OpenClaw AI 分析
|
||||
↓
|
||||
create_sentry_approval() → 建立 Approval ✅ 已運作
|
||||
↓
|
||||
send_sentry_telegram_alert()→ Telegram 通知 ✅ 已運作
|
||||
↓
|
||||
post_sentry_comment() → ❌ SENTRY_AUTH_TOKEN 缺失,靜默跳過
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 1: 取得 Sentry API Token
|
||||
|
||||
### 1.1 登入 Sentry 後台
|
||||
|
||||
```
|
||||
瀏覽器開啟:http://192.168.0.110:9000
|
||||
帳號:參見 docs/security/SECRETS_REFERENCE.md
|
||||
```
|
||||
|
||||
### 1.2 建立 API Token
|
||||
|
||||
```
|
||||
路徑:設定 → API → Auth Tokens → Create New Token
|
||||
|
||||
權限設定:
|
||||
☑ project:read
|
||||
☑ project:write
|
||||
☑ issues:write ← 必須勾選,否則無法回寫 comment
|
||||
☑ event:read
|
||||
|
||||
Token 名稱建議:awoooi-openclaw-comment-writer
|
||||
```
|
||||
|
||||
### 1.3 記錄 Token(請勿存入代碼庫!)
|
||||
|
||||
```bash
|
||||
# 暫存到環境變數(本機測試用)
|
||||
export SENTRY_AUTH_TOKEN="sentry_xxx..."
|
||||
|
||||
# 驗證 Token 有效性
|
||||
curl -s http://192.168.0.110:9000/api/0/organizations/ \
|
||||
-H "Authorization: Bearer $SENTRY_AUTH_TOKEN" | python3 -m json.tool | head -20
|
||||
# 預期看到 organization 列表,無 401
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 2: 注入 GitHub Secrets(CD 自動化)
|
||||
|
||||
### 2.1 加入 GitHub Repository Secrets
|
||||
|
||||
```
|
||||
路徑:GitHub → owenhytsai/awoooi → Settings → Secrets → Actions
|
||||
|
||||
新增以下 Secrets:
|
||||
名稱:SENTRY_AUTH_TOKEN
|
||||
值:步驟 1.2 取得的 Token
|
||||
```
|
||||
|
||||
### 2.2 更新 K8s Secret(手動注入生產環境)
|
||||
|
||||
```bash
|
||||
# 在 192.168.0.120(K3s Master)執行
|
||||
kubectl patch secret awoooi-secrets -n awoooi-prod \
|
||||
--patch="{\"data\":{\"SENTRY_AUTH_TOKEN\":\"$(echo -n 'YOUR_TOKEN' | base64)\"}}"
|
||||
|
||||
# 驗證
|
||||
kubectl get secret awoooi-secrets -n awoooi-prod -o jsonpath='{.data.SENTRY_AUTH_TOKEN}' | base64 -d
|
||||
```
|
||||
|
||||
### 2.3 更新 k8s/awoooi-prod/03-secrets.yaml(模板)
|
||||
|
||||
```yaml
|
||||
# k8s/awoooi-prod/03-secrets.yaml
|
||||
# 新增以下欄位(使用 CD 自動注入,非硬編碼)
|
||||
stringData:
|
||||
# ... 現有欄位 ...
|
||||
SENTRY_AUTH_TOKEN: "${SENTRY_AUTH_TOKEN}" # CD 自動注入
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 3: 更新 CD Workflow 自動注入
|
||||
|
||||
```yaml
|
||||
# .github/workflows/cd.yaml
|
||||
# 在 "Inject K8s Secrets" 步驟中新增 SENTRY_AUTH_TOKEN
|
||||
|
||||
- name: Inject K8s Secrets
|
||||
run: |
|
||||
kubectl patch secret awoooi-secrets -n awoooi-prod \
|
||||
--patch="{\"data\":{
|
||||
\"OPENCLAW_TG_BOT_TOKEN\":\"$(echo -n '${{ secrets.OPENCLAW_TG_BOT_TOKEN }}' | base64)\",
|
||||
\"OPENCLAW_TG_CHAT_ID\":\"$(echo -n '${{ secrets.OPENCLAW_TG_CHAT_ID }}' | base64)\",
|
||||
\"SENTRY_AUTH_TOKEN\":\"$(echo -n '${{ secrets.SENTRY_AUTH_TOKEN }}' | base64)\"
|
||||
}}"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 4: 驗證 Sentry Comment 功能
|
||||
|
||||
### 4.1 本地單元測試(快速驗證)
|
||||
|
||||
```bash
|
||||
cd /Users/ogt/awoooi
|
||||
source .env
|
||||
|
||||
# 設定測試 Token
|
||||
export SENTRY_AUTH_TOKEN="你的真實 Token"
|
||||
export SENTRY_SELF_HOSTED_URL="http://192.168.0.110:9000"
|
||||
|
||||
# 用 Python 直接測試 SentryService
|
||||
python3 -c "
|
||||
import asyncio
|
||||
import sys
|
||||
sys.path.insert(0, 'apps/api/src')
|
||||
from services.sentry_service import SentryService
|
||||
|
||||
async def test():
|
||||
svc = SentryService(
|
||||
base_url='http://192.168.0.110:9000',
|
||||
auth_token='$SENTRY_AUTH_TOKEN'
|
||||
)
|
||||
# 先列出 Issues 找一個真實 ID
|
||||
issues = await svc.list_issues(project='awoooi-api', limit=3)
|
||||
if issues:
|
||||
issue_id = issues[0]['id']
|
||||
print(f'找到 Issue: {issue_id}')
|
||||
result = await svc.post_issue_comment(
|
||||
issue_id=issue_id,
|
||||
text='🤖 **AWOOOI 測試** - Sentry Comment 回寫功能正常運作。'
|
||||
)
|
||||
print(f'Comment 結果: {result}')
|
||||
else:
|
||||
print('無 Issue 可測試')
|
||||
|
||||
asyncio.run(test())
|
||||
"
|
||||
```
|
||||
|
||||
### 4.2 E2E 端對端驗證(生產環境)
|
||||
|
||||
```bash
|
||||
# 1. 在 Sentry 手動觸發一個測試 Issue
|
||||
curl -X POST http://localhost:8000/api/v1/webhooks/sentry/error \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"action": "triggered",
|
||||
"data": {
|
||||
"issue": {
|
||||
"id": "TEST-001",
|
||||
"title": "AWOOOI Comment 功能測試",
|
||||
"level": "error",
|
||||
"culprit": "test.py:1",
|
||||
"firstSeen": "2026-03-29T12:00:00Z",
|
||||
"count": 1,
|
||||
"project": {"slug": "awoooi-api"}
|
||||
},
|
||||
"event": {
|
||||
"message": "這是一個測試錯誤",
|
||||
"platform": "python"
|
||||
}
|
||||
}
|
||||
}'
|
||||
|
||||
# 預期回應
|
||||
# {"status": "accepted", "issue_id": "TEST-001", "message": "Analysis scheduled"}
|
||||
|
||||
# 2. 等待 60 秒後(OpenClaw 分析需時)
|
||||
sleep 60
|
||||
|
||||
# 3. 到 Sentry UI 的 Issue 中確認是否有 AI 分析 Comment
|
||||
# http://192.168.0.110:9000/organizations/sentry/issues/TEST-001/
|
||||
```
|
||||
|
||||
### 4.3 確認日誌
|
||||
|
||||
```bash
|
||||
# 查看 API Pod 日誌
|
||||
kubectl logs -n awoooi-prod \
|
||||
$(kubectl get pod -n awoooi-prod -l app=awoooi-api -o name | head -1) \
|
||||
--tail=50 | grep -i sentry_comment
|
||||
|
||||
# 預期看到
|
||||
# sentry_comment_posted issue_id=xxx comment_id=12345
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 5: 部署與驗收
|
||||
|
||||
```bash
|
||||
# 觸發 CD
|
||||
git add k8s/awoooi-prod/03-secrets.yaml .github/workflows/cd.yaml
|
||||
git commit -m "feat(sentry): enable comment write-back via SENTRY_AUTH_TOKEN injection"
|
||||
git push origin main
|
||||
|
||||
# 確認 CD 成功
|
||||
gh run list --workflow=cd.yaml --limit 1
|
||||
|
||||
# 確認 Pod 有新 Token
|
||||
kubectl exec -n awoooi-prod \
|
||||
$(kubectl get pod -n awoooi-prod -l app=awoooi-api -o name | head -1) \
|
||||
-- env | grep SENTRY_AUTH_TOKEN
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 驗收標準
|
||||
|
||||
| 項目 | 通過條件 |
|
||||
|------|---------|
|
||||
| K8s Secret 已注入 | `kubectl get secret` 確認 `SENTRY_AUTH_TOKEN` 不為空 |
|
||||
| Token 有效 | Sentry API `/api/0/organizations/` 返回 200 |
|
||||
| Comment 回寫 | Sentry Issue 中有「AI 錯誤分析」Comment |
|
||||
| 日誌正常 | `sentry_comment_posted` 日誌出現,無 `sentry_comment_failed` |
|
||||
| 頻率統計 | Comment 含「頻率統計」表格(24h 次數 > 1 時顯示)|
|
||||
|
||||
---
|
||||
|
||||
## 常見問題排除
|
||||
|
||||
| 症狀 | 診斷指令 | 解法 |
|
||||
|------|---------|------|
|
||||
| `sentry_comment_skipped` 日誌 | `env \| grep SENTRY_AUTH_TOKEN` | Secret 未注入,重跑 Step 3 |
|
||||
| `sentry_api_unauthorized` | 手動 curl Sentry API | Token 權限不足,重新建立 |
|
||||
| `sentry_api_timeout` | `curl -v http://192.168.0.110:9000/` | Sentry 服務本身異常 |
|
||||
| OpenClaw 分析失敗 | `curl http://192.168.0.188:8089/health` | OpenClaw 服務需重啟 |
|
||||
233
docs/runbooks/RUNBOOK-PHASE-E-SIGNOZ-WEBHOOK.md
Normal file
233
docs/runbooks/RUNBOOK-PHASE-E-SIGNOZ-WEBHOOK.md
Normal file
@@ -0,0 +1,233 @@
|
||||
# RunBook: Phase E — SignOz Webhook Handler 生產部署
|
||||
|
||||
> **類型**: 操作型 RunBook
|
||||
> **優先級**: 🔴 P0
|
||||
> **建立**: 2026-03-29 12:35 (台北)
|
||||
> **建立者**: Antigravity
|
||||
> **工時預估**: 1.5–2 小時
|
||||
> **前置條件**: SignOz UI 可在 http://192.168.0.188:3301 正常訪問
|
||||
|
||||
---
|
||||
|
||||
## 背景與現況
|
||||
|
||||
### 🔍 精確現況診斷
|
||||
|
||||
```
|
||||
signoz_webhook.py → 完整實作 (363 行,含 4 步驟完整流程)
|
||||
main.py:419 → 已正確路由 include_router(signoz_webhook_v1.router)
|
||||
端點:POST /api/v1/webhooks/signoz/alert ✅ 已可接收
|
||||
問題:SignOz 告警規則未指向此 Webhook
|
||||
```
|
||||
|
||||
**唯一阻塞點**:SignOz 告警規則 (`ops/signoz/alerting/rules.yaml`) 的 `webhook` 欄位尚未設定或未部署到 SignOz 主機。
|
||||
|
||||
### 完整資料流
|
||||
|
||||
```
|
||||
SignOz 偵測到異常 (Error Rate / Latency / No Traces)
|
||||
↓
|
||||
SignOz Alert Manager 觸發告警
|
||||
↓
|
||||
POST http://192.168.0.120:32334/api/v1/webhooks/signoz/alert ← 需要配置
|
||||
↓
|
||||
process_signoz_alert() [Background Task]
|
||||
↓
|
||||
├── AnomalyCounter 記錄頻率 (ADR-037) ✅
|
||||
├── IncidentService 建立事件 ✅
|
||||
├── ApprovalService 建立簽核 ✅
|
||||
└── TelegramGateway 發送通知 ✅
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 1: 確認 API 端點可達
|
||||
|
||||
```bash
|
||||
# 從 188 主機測試 SignOz Webhook 端點
|
||||
curl -s http://192.168.0.120:32334/api/v1/webhooks/signoz/health
|
||||
# 預期:{"status": "ok", "service": "signoz-webhook", "timestamp": "..."}
|
||||
|
||||
# 如端點不通,確認 Pod 狀態
|
||||
kubectl get pod -n awoooi-prod -l app=awoooi-api
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 2: 設定 SignOz 告警規則
|
||||
|
||||
### 2.1 確認 ops/signoz/alerting/rules.yaml 已建立
|
||||
|
||||
```bash
|
||||
# 確認檔案存在
|
||||
ls /Users/ogt/awoooi/ops/signoz/alerting/
|
||||
# 如不存在,從 IMPLEMENTATION_STEPS_REMAINING_PHASES.md 的 Phase E 代碼複製
|
||||
```
|
||||
|
||||
### 2.2 部署告警規則到 SignOz 主機
|
||||
|
||||
```bash
|
||||
# 登入 SignOz 主機
|
||||
ssh root@192.168.0.188
|
||||
|
||||
# 確認 SignOz 告警配置目錄
|
||||
docker inspect signoz-query-service | grep -A5 "Mounts"
|
||||
# 常見路徑:/opt/signoz/config/ 或 /data/signoz/
|
||||
|
||||
# 複製告警規則(從本機)
|
||||
# 先在本機執行:
|
||||
scp /Users/ogt/awoooi/ops/signoz/alerting/rules.yaml \
|
||||
root@192.168.0.188:/opt/signoz/config/alerting/
|
||||
|
||||
# 在 188 主機重啟 SignOz Alert Manager(不重啟整個 SignOz)
|
||||
docker restart signoz-alert-manager 2>/dev/null || \
|
||||
docker restart signoz # 若是單容器部署
|
||||
```
|
||||
|
||||
### 2.3 透過 SignOz API 驗證規則載入
|
||||
|
||||
```bash
|
||||
# 在 188 主機執行
|
||||
curl -s http://localhost:3301/api/v3/alerts/rules | python3 -m json.tool | head -40
|
||||
# 預期看到 APIHighErrorRate, APIHighLatencyP99 等規則名稱
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 3: 設定 SignOz Webhook Channel
|
||||
|
||||
SignOz 告警通知支援 Webhook Channel,需要透過 SignOz Web UI 或 API 設定。
|
||||
|
||||
### 3.1 透過 SignOz UI 設定(推薦)
|
||||
|
||||
```
|
||||
瀏覽器開啟:http://192.168.0.188:3301
|
||||
路徑:Settings → Alert Channels → New Channel
|
||||
|
||||
類型:Webhook
|
||||
名稱:AWOOOI-API
|
||||
URL:http://192.168.0.120:32334/api/v1/webhooks/signoz/alert
|
||||
Send resolved notifications:☑ (可選)
|
||||
```
|
||||
|
||||
### 3.2 透過 API 設定(腳本化)
|
||||
|
||||
```bash
|
||||
# 在 188 主機執行
|
||||
curl -s -X POST http://localhost:3301/api/v1/channels \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"name": "AWOOOI-API",
|
||||
"type": "webhook",
|
||||
"data": {
|
||||
"webhook_url": "http://192.168.0.120:32334/api/v1/webhooks/signoz/alert"
|
||||
}
|
||||
}'
|
||||
# 預期返回 channel ID
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 4: 建立測試告警規則
|
||||
|
||||
為了驗證整個鏈路,建立一個低閾值測試規則:
|
||||
|
||||
```bash
|
||||
# 在 188 主機的 SignOz 目錄建立測試規則
|
||||
cat > /tmp/test-alert.yaml << 'EOF'
|
||||
groups:
|
||||
- name: e2e_test
|
||||
rules:
|
||||
- alert: AWOOOI_E2E_SMOKE_TEST
|
||||
expr: up{job="awoooi-api"} == 1 # 永遠觸發(API 存活時)
|
||||
for: 1m
|
||||
labels:
|
||||
severity: info
|
||||
source: signoz
|
||||
test: "true"
|
||||
annotations:
|
||||
summary: "E2E Smoke Test - 請忽略"
|
||||
description: "這是 AWOOOI 告警鏈路的自動測試"
|
||||
webhook: "http://192.168.0.120:32334/api/v1/webhooks/signoz/alert"
|
||||
EOF
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 5: 端到端驗證
|
||||
|
||||
### 5.1 手動觸發測試
|
||||
|
||||
```bash
|
||||
# 直接向 AWOOOI API 發送模擬 SignOz 告警
|
||||
curl -s -X POST http://192.168.0.120:32334/api/v1/webhooks/signoz/alert \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"alertname": "APIHighErrorRate",
|
||||
"status": "firing",
|
||||
"labels": {
|
||||
"alertname": "APIHighErrorRate",
|
||||
"severity": "critical",
|
||||
"service_name": "awoooi-api",
|
||||
"source": "signoz"
|
||||
},
|
||||
"annotations": {
|
||||
"summary": "API 錯誤率 > 5%",
|
||||
"description": "服務 awoooi-api 錯誤率超標,這是一個測試告警"
|
||||
},
|
||||
"startsAt": "2026-03-29T12:00:00Z"
|
||||
}'
|
||||
|
||||
# 預期回應
|
||||
# {"status": "ok", "processed": 1, "results": [{"status": "accepted", "alert_name": "APIHighErrorRate"}]}
|
||||
```
|
||||
|
||||
### 5.2 確認 Telegram 收到告警
|
||||
|
||||
```
|
||||
預期在 Telegram Bot 中收到:
|
||||
═══════════════════════════
|
||||
📊 SignOz: APIHighErrorRate
|
||||
═══════════════════════════
|
||||
服務:awoooi-api
|
||||
摘要:API 錯誤率 > 5%
|
||||
[ Y 確認 ] [ N 忽略 ]
|
||||
```
|
||||
|
||||
### 5.3 確認 API 日誌
|
||||
|
||||
```bash
|
||||
kubectl logs -n awoooi-prod \
|
||||
$(kubectl get pod -n awoooi-prod -l app=awoooi-api -o name | head -1) \
|
||||
--tail=30 | grep -i signoz
|
||||
|
||||
# 預期看到:
|
||||
# signoz_alert_received payload=...
|
||||
# signoz_anomaly_recorded alert_name=APIHighErrorRate
|
||||
# signoz_alert_processed alert_name=APIHighErrorRate incident_id=xxx
|
||||
# signoz_telegram_sent approval_id=xxx
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 驗收標準
|
||||
|
||||
| 項目 | 通過條件 |
|
||||
|------|---------|
|
||||
| Webhook 端點可達 | `curl .../signoz/health` 返回 200 |
|
||||
| SignOz 規則載入 | `/api/v3/alerts/rules` 包含 `APIHighErrorRate` |
|
||||
| 手動測試正常 | 回應 `{"status": "ok"}` |
|
||||
| Telegram 通知 | 成功收到告警卡片 |
|
||||
| Incident 建立 | DB 中可查到 `source=signoz` 的 Incident |
|
||||
| Approval 建立 | `GET /api/v1/approvals` 顯示新 Approval |
|
||||
|
||||
---
|
||||
|
||||
## 常見問題排除
|
||||
|
||||
| 症狀 | 診斷 | 解法 |
|
||||
|------|------|------|
|
||||
| Webhook 404 | `curl .../signoz/health` | 確認主機是 32334 而非 8089 |
|
||||
| SignOz 規則不觸發 | SignOz UI → Alerts 頁 | 確認 Prometheus 端點可抓到 awoooi-api metrics |
|
||||
| Telegram 未收到 | 查 API 日誌 | 確認 `OPENCLAW_TG_BOT_TOKEN` Secret 已注入 |
|
||||
| Incident 建立失敗 | 查 API 日誌 `incident_creation_failed` | 確認 PostgreSQL 連線正常 |
|
||||
313
docs/runbooks/RUNBOOK-WORKER-HPA.md
Normal file
313
docs/runbooks/RUNBOOK-WORKER-HPA.md
Normal file
@@ -0,0 +1,313 @@
|
||||
# RunBook: Worker HPA — 水平自動擴展設定
|
||||
|
||||
> **類型**: 操作型 RunBook
|
||||
> **優先級**: 🔴 P0(Worker 目前單點故障風險)
|
||||
> **建立**: 2026-03-29 12:35 (台北)
|
||||
> **建立者**: Antigravity
|
||||
> **工時預估**: 30–60 分鐘
|
||||
> **前置條件**: K3s 叢集健康(120/121 皆 Ready)
|
||||
|
||||
---
|
||||
|
||||
## 背景與現況
|
||||
|
||||
### 🔍 精確現況診斷
|
||||
|
||||
**現有 HPA 配置 (`12-hpa.yaml`)**:
|
||||
|
||||
| Deployment | Min | Max | CPU 閾值 | Memory 閾值 |
|
||||
|-----------|-----|-----|---------|------------|
|
||||
| awoooi-api | 2 | 6 | 70% | 80% |
|
||||
| awoooi-web | 2 | 6 | 70% | 80% |
|
||||
| **awoooi-worker** | ❌ 無 | ❌ 無 | — | — |
|
||||
|
||||
**Worker 的特殊性**:
|
||||
- Worker 消費 Redis Streams (Event Bus)
|
||||
- 不像 API/Web 依賴 CPU/Memory 觸發,應依賴 **Queue 長度觸發**
|
||||
- 但 K3s 預設沒有安裝 KEDA(Kubernetes Event-driven Autoscaling)
|
||||
- **最保守方案**:設定 min:1 max:3,以 CPU 為指標
|
||||
|
||||
---
|
||||
|
||||
## 方案比較
|
||||
|
||||
| 方案 | 優點 | 缺點 | 適合性 |
|
||||
|------|------|------|-------|
|
||||
| **A: CPU HPA(立即可行)** | 零依賴,立即部署 | 不直接反應 Queue 長度 | ✅ 推薦(短期) |
|
||||
| B: KEDA Redis Stream HPA | 最精確,按 Queue 長度擴縮 | 需安裝 KEDA operator | 🟡 中期規劃 |
|
||||
| C: 固定 2 副本(無 HPA) | 簡單穩定 | 浪費資源 | ❌ 不推薦 |
|
||||
|
||||
**決策**:採用方案 A(CPU HPA),並記錄方案 B 的未來路徑。
|
||||
|
||||
---
|
||||
|
||||
## Step 1: 確認 Worker 資源設定
|
||||
|
||||
```bash
|
||||
# 查看現有 Worker Deployment 資源限制
|
||||
kubectl get deployment awoooi-worker -n awoooi-prod -o yaml | grep -A 20 resources
|
||||
|
||||
# 預期看到:
|
||||
# resources:
|
||||
# requests:
|
||||
# cpu: "100m"
|
||||
# memory: "256Mi"
|
||||
# limits:
|
||||
# cpu: "500m"
|
||||
# memory: "512Mi"
|
||||
```
|
||||
|
||||
**如果沒有設定 resources,HPA 無法正常運作!** 必須先在 `08-deployment-worker.yaml` 加入資源限制。
|
||||
|
||||
---
|
||||
|
||||
## Step 2: 更新 k8s/awoooi-prod/12-hpa.yaml
|
||||
|
||||
在現有檔案末尾追加 Worker HPA:
|
||||
|
||||
```yaml
|
||||
# =============================================================================
|
||||
# Worker HPA(追加到 12-hpa.yaml 末尾)
|
||||
# =============================================================================
|
||||
# K-Worker 2026-03-29: Worker HPA(CPU 指標,min:1 max:3)
|
||||
# 注意:Worker 消費 Redis Streams,未來可升級為 KEDA Redis Stream 指標
|
||||
# 建立者:Antigravity
|
||||
# =============================================================================
|
||||
---
|
||||
apiVersion: autoscaling/v2
|
||||
kind: HorizontalPodAutoscaler
|
||||
metadata:
|
||||
name: awoooi-worker-hpa
|
||||
namespace: awoooi-prod
|
||||
labels:
|
||||
app.kubernetes.io/name: awoooi
|
||||
app.kubernetes.io/component: worker
|
||||
annotations:
|
||||
description: "Worker 水平自動擴展 (1-3 replicas, 70% CPU)"
|
||||
note: "未來可升級為 KEDA Redis Stream 指標,按 Queue 長度動態擴縮"
|
||||
spec:
|
||||
scaleTargetRef:
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
name: awoooi-worker
|
||||
minReplicas: 1 # 保持最少 1 個處理事件
|
||||
maxReplicas: 3 # 2 節點叢集的合理上限
|
||||
metrics:
|
||||
- type: Resource
|
||||
resource:
|
||||
name: cpu
|
||||
target:
|
||||
type: Utilization
|
||||
averageUtilization: 70
|
||||
- type: Resource
|
||||
resource:
|
||||
name: memory
|
||||
target:
|
||||
type: Utilization
|
||||
averageUtilization: 80
|
||||
behavior:
|
||||
scaleUp:
|
||||
stabilizationWindowSeconds: 120 # Worker 擴展比 API 保守(120s vs 60s)
|
||||
policies:
|
||||
- type: Pods
|
||||
value: 1
|
||||
periodSeconds: 120
|
||||
scaleDown:
|
||||
stabilizationWindowSeconds: 600 # Worker 縮容非常保守,避免事件處理中斷
|
||||
policies:
|
||||
- type: Pods
|
||||
value: 1
|
||||
periodSeconds: 300
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 3: 確認 Worker Deployment 有資源設定
|
||||
|
||||
```bash
|
||||
# 查看現有設定
|
||||
kubectl get deployment awoooi-worker -n awoooi-prod -o jsonpath='{.spec.template.spec.containers[0].resources}'
|
||||
```
|
||||
|
||||
若無資源設定,在 `08-deployment-worker.yaml` 加入:
|
||||
|
||||
```yaml
|
||||
# apps/api/src/workers 對應的 K8s Deployment
|
||||
# 在 container spec 加入:
|
||||
resources:
|
||||
requests:
|
||||
cpu: "100m" # Worker 正常負載估算
|
||||
memory: "256Mi"
|
||||
limits:
|
||||
cpu: "500m" # 防止單 Worker 吃掉所有 CPU
|
||||
memory: "512Mi"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 4: 部署
|
||||
|
||||
```bash
|
||||
# 方法 A:直接 apply(推薦,只更新 HPA)
|
||||
kubectl apply -f k8s/awoooi-prod/12-hpa.yaml
|
||||
|
||||
# 確認 HPA 建立成功
|
||||
kubectl get hpa -n awoooi-prod
|
||||
|
||||
# 預期輸出:
|
||||
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
|
||||
# awoooi-api-hpa Deployment/api 5%/70% 2 6 2
|
||||
# awoooi-web-hpa Deployment/web 3%/70% 2 6 2
|
||||
# awoooi-worker-hpa Deployment/worker 8%/70% 1 3 1 ← 新增
|
||||
|
||||
# 方法 B:透過 CD 觸發(標準流程)
|
||||
git add k8s/awoooi-prod/12-hpa.yaml
|
||||
git commit -m "feat(k8s): add Worker HPA (min:1 max:3 CPU 70%)"
|
||||
git push origin main
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 5: 壓力測試驗證 HPA 觸發
|
||||
|
||||
```bash
|
||||
# 模擬大量事件涌入(謹慎,在非尖峰時段執行)
|
||||
for i in {1..100}; do
|
||||
curl -s -X POST http://192.168.0.120:32334/api/v1/webhooks/alertmanager \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"version": "4",
|
||||
"status": "firing",
|
||||
"alerts": [{"status": "firing", "labels": {"alertname": "LoadTest", "severity": "info"}, "annotations": {}}]
|
||||
}' &
|
||||
done
|
||||
|
||||
# 觀察 HPA 反應(每 15 秒看一次)
|
||||
watch -n 15 'kubectl get hpa awoooi-worker-hpa -n awoooi-prod'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 中期路線圖:升級 KEDA Redis Stream HPA
|
||||
|
||||
```yaml
|
||||
# 未來安裝 KEDA 後,可替換為更精確的 HPA:
|
||||
apiVersion: keda.sh/v1alpha1
|
||||
kind: ScaledObject
|
||||
metadata:
|
||||
name: awoooi-worker-scaledobject
|
||||
namespace: awoooi-prod
|
||||
spec:
|
||||
scaleTargetRef:
|
||||
name: awoooi-worker
|
||||
minReplicaCount: 1
|
||||
maxReplicaCount: 5
|
||||
triggers:
|
||||
- type: redis
|
||||
metadata:
|
||||
address: "192.168.0.188:6380"
|
||||
listName: "awoooi:events" # Redis Stream Key
|
||||
listLength: "20" # 每個 Pod 處理最多 20 個待處理事件
|
||||
```
|
||||
|
||||
KEDA 安裝指令(未來執行):
|
||||
```bash
|
||||
kubectl apply -f https://github.com/kedacore/keda/releases/download/v2.13.1/keda-2.13.1.yaml
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 驗收標準
|
||||
|
||||
| 項目 | 通過條件 |
|
||||
|------|---------|
|
||||
| HPA 建立 | `kubectl get hpa -n awoooi-prod` 顯示 `awoooi-worker-hpa` |
|
||||
| 指標正常 | TARGETS 顯示實際 CPU%,非 `<unknown>` |
|
||||
| Worker 正常運行 | `kubectl get pod -n awoooi-prod -l app=awoooi-worker` 顯示 Running |
|
||||
| 最小副本 | Worker 期望副本數 = 1 |
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ 架構安全補丁(2026-03-29 更新,部署前必讀)
|
||||
|
||||
> 來源:`ARCHITECTURAL_RISK_WAR_GAME.md` 深度沙盤推演,代碼確認級別
|
||||
|
||||
### 補丁 1:XCLAIM + Active Sweeper(部署 HPA 的前置條件)
|
||||
|
||||
**❌ 現況**:`signal_worker.py` 完全沒有 Redis PEL 孤兒任務回收機制。
|
||||
|
||||
**影響**:Worker Pod 被 HPA 縮容(或非優雅崩潰)時,正在處理的任務卡在 Redis PEL(Pending Entries List)中永久無人處理。
|
||||
|
||||
> 🔴 **HPA 必須在 XCLAIM 機制合併 main 之後才能部署!**
|
||||
|
||||
需要在 `signal_worker.py` 加入的兩個機制:
|
||||
|
||||
```python
|
||||
# 1. 啟動時接管孤兒(_claim_orphaned_tasks,在 start() 中調用)
|
||||
# 2. 運行中持續掃描(_reclaim_loop,與 _consume_loop 並行)
|
||||
async def _reclaim_loop(self, interval_s: int = 300) -> None:
|
||||
"""每 5 分鐘主動掃描 PEL,接管閒置超過 5 分鐘的孤兒任務"""
|
||||
while self._running:
|
||||
await asyncio.sleep(interval_s)
|
||||
claimed = await self._claim_orphaned_tasks(idle_ms=300_000)
|
||||
if claimed > 0:
|
||||
logger.info("active_sweeper_claimed", count=claimed)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 補丁 2:terminationGracePeriodSeconds 三層對齊
|
||||
|
||||
**❌ 現況**:`signal_worker.py` 的 `stop()` timeout = **5 秒**,AI 分析任務最長 60 秒。K8s 的 `terminationGracePeriodSeconds` 未設定(預設 30 秒)。兩個值都不夠,且彼此不對齊。
|
||||
|
||||
**需要同時修改兩個地方**:
|
||||
|
||||
```yaml
|
||||
# k8s/awoooi-prod/08-deployment-worker.yaml
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
terminationGracePeriodSeconds: 90 # 🆕 必須設定(比 Python timeout 多 15 秒緩衝)
|
||||
containers:
|
||||
- name: awoooi-worker
|
||||
lifecycle:
|
||||
preStop:
|
||||
exec:
|
||||
command: ["/bin/sh", "-c", "sleep 5"] # 讓 K8s 先更新 Endpoint 再發 SIGTERM
|
||||
```
|
||||
|
||||
```python
|
||||
# apps/api/src/workers/signal_worker.py
|
||||
async def stop(self) -> None:
|
||||
self._running = False
|
||||
if self._task:
|
||||
try:
|
||||
await asyncio.wait_for(self._task, timeout=75.0) # 🆕 從 5 秒改為 75 秒
|
||||
except (TimeoutError, asyncio.CancelledError):
|
||||
self._task.cancel()
|
||||
logger.info("signal_worker_stopped")
|
||||
```
|
||||
|
||||
**三層數值關係**:
|
||||
```
|
||||
preStop sleep: 5s
|
||||
Python timeout: 75s ← 比 K8s grace period 少 15s 緩衝
|
||||
K8s grace period: 90s ← terminationGracePeriodSeconds
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 合規確認指令(部署後必須執行)
|
||||
|
||||
```bash
|
||||
# 確認 terminationGracePeriodSeconds 已生效
|
||||
kubectl get deployment awoooi-worker -n awoooi-prod \
|
||||
-o jsonpath='{.spec.template.spec.terminationGracePeriodSeconds}'
|
||||
# 預期:90
|
||||
|
||||
# 模擬縮容,確認優雅關機
|
||||
kubectl scale deployment awoooi-worker -n awoooi-prod --replicas=0
|
||||
kubectl logs -n awoooi-prod -l app=awoooi-worker --tail=20
|
||||
# 預期看到:shutdown_signal_received → signal_worker_shutting_down → signal_worker_stopped
|
||||
# 整個流程在 90 秒內完成
|
||||
```
|
||||
Reference in New Issue
Block a user