docs: ADR-037 + 監控架構提案 + Runbooks

- ADR-037 監控增強架構 - MONITORING_MASTER_PLAN 主計畫 - MASTER_EXECUTION_SCHEDULE 執行排程 - Phase D/E/Worker HPA Runbooks Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 16:04:08 +08:00
parent 237fb64a81
commit 89e05e6ea2
13 changed files with 6462 additions and 0 deletions
--- a/docs/adr/ADR-037-monitoring-enhancement-architecture.md
+++ b/docs/adr/ADR-037-monitoring-enhancement-architecture.md
@@ -0,0 +1,199 @@
+# ADR-037: 監控增強架構 - 異常頻率統計與根本修復
+
+> **狀態**: ✅ Wave A+B 完成
+> **日期**: 2026-03-29
+> **決策者**: 統帥 (已批准)
+> **審查者**: 首席架構師 (Wave A: 91/100, Wave B: 92/100 OUTSTANDING)
+> **預估工時**: 17h (7 Phase)
+> **整合計畫**: [MONITORING_MASTER_PLAN.md](../proposals/MONITORING_MASTER_PLAN.md) (Wave A-D / 10.75h)
+
+---
+
+## 背景
+
+2026-03-29 統帥明確指示：
+> "重啟只是治標，不是治本！太常發生的異常必須徹底解決"
+> "需要統計、計數！必須要讓使用者知道！！"
+> "太常發生的異常，必須要有徹底解決的自動修復機制！！！"
+
+### 現狀問題
+
+| 問題 | 影響 | 根因 |
+|------|------|------|
+| 重複異常無統計 | 同樣問題反覆告警 | 無頻率追蹤機制 |
+| 修復只靠重啟 | 治標不治本 | 無 Tier 分級修復策略 |
+| 資料庫無監控 | PostgreSQL/Redis 盲區 | 缺少 Exporter |
+| Sentry 無回寫 | 修復結果斷鏈 | 缺少 Comment API 整合 |
+| 無學習機制 | 無法自動改善 | 缺少成功率統計 |
+
+---
+
+## 決策
+
+### 1. AnomalyCounter 服務 (Phase A)
+
+**技術選型**: Redis Sorted Set + 滑動窗口
+
+```python
+# 異常簽名 Hash 生成
+def generate_anomaly_key(signature: dict) -> str:
+    """生成異常簽名的唯一 key"""
+    key_parts = [
+        signature.get("alert_name", ""),
+        signature.get("service", ""),
+        signature.get("namespace", ""),
+        signature.get("error_type", ""),
+    ]
+    return hashlib.sha256("|".join(key_parts).encode()).hexdigest()[:16]
+```
+
+**統計維度**:
+- `count_1h`: 1 小時內發生次數
+- `count_24h`: 24 小時內發生次數
+- `count_7d`: 7 天內發生次數
+- `count_30d`: 30 天內發生次數
+
+**升級閾值**:
+| 等級 | 閾值 (24h) | 行動 |
+|------|-----------|------|
+| NORMAL | < 3 | 正常處理 |
+| REPEAT | ≥ 3 | 標記重複，通知用戶 |
+| ESCALATE | ≥ 5 | 升級 Tier，通知 Owner |
+| PERMANENT_FIX | ≥ 10 | 強制根因修復 |
+
+### 2. Tier 分級修復策略 (Phase A)
+
+| Tier | 名稱 | 修復動作 | 適用場景 |
+|------|------|---------|---------|
+| 1 | 重啟 | restart_pod, restart_container | 首次異常 |
+| 2 | 緩解 | scale_up, flush_cache, kill_queries | 重複 2 次 |
+| 3 | 根因 | apply_fix, update_config, run_migration | 24h ≥ 5 次 |
+| 4 | 架構 | failover, switch_backend, permanent_fix | 24h ≥ 10 次 |
+
+**修復嘗試記錄**:
+```python
+# Redis Hash: anomaly:{key}:repair_stats
+{
+    "restart_pod": {"total": 3, "success": 1, "rate": 0.33},
+    "scale_up": {"total": 1, "success": 1, "rate": 1.0},
+}
+```
+
+### 3. Database Exporters (Phase B)
+
+| 服務 | Exporter | Port | 關鍵指標 |
+|------|----------|------|---------|
+| PostgreSQL | postgres_exporter | 9187 | 連接池、慢查詢、鎖等待、表膨脹 |
+| Redis | redis_exporter | 9121 | 記憶體、命中率、驅逐率、延遲 |
+
+**部署位置**: 192.168.0.188 (Docker Compose)
+
+### 4. Incident 頻率欄位 (Phase C)
+
+```python
+class IncidentFrequencyStats(BaseModel):
+    anomaly_key: str
+    count_1h: int = 0
+    count_24h: int = 0
+    count_7d: int = 0
+    count_30d: int = 0
+    escalation_level: str | None = None  # REPEAT, ESCALATE, PERMANENT_FIX
+```
+
+**聚合窗口**: 10 分鐘內同一問題不建新 Incident
+
+### 5. Sentry Comment 回寫 (Phase D)
+
+```
+POST /api/0/issues/{issue_id}/comments/
+Authorization: Bearer {SENTRY_API_TOKEN}
+{"text": "🤖 OpenClaw 自動修復: restart_pod 成功"}
+```
+
+### 6. SignOz 告警規則 (Phase E)
+
+| 指標 | 閾值 | 嚴重度 |
+|------|------|--------|
+| Error Rate | > 5% (5m) | warning |
+| P95 Latency | > 2s | warning |
+| Trace 異常 | duration > 10s | critical |
+
+### 7. Learning Service (Phase G)
+
+```python
+class LearningService:
+    async def should_skip_action(self, anomaly_key: str, action: str) -> bool:
+        """成功率 < 20% 時跳過此動作"""
+        stats = await self.counter.get_repair_stats(anomaly_key, action)
+        return stats and stats["rate"] < 0.2
+```
+
+---
+
+## 模組化合規
+
+| 檢查項 | 狀態 | 說明 |
+|--------|------|------|
+| Interface 先定義 | ✅ | AnomalyCounter Protocol |
+| Router 禁止直接 DB | ✅ | 透過 Service 層 |
+| Repository 單一職責 | ✅ | AnomalyCounterRepository |
+| 無循環依賴 | ✅ | counter → repair → learning |
+| 可測試 (無 Mock) | ✅ | Redis Testcontainer |
+
+---
+
+## 實施計畫
+
+| Phase | 任務 | 工時 | 依賴 | 狀態 |
+|-------|------|------|------|------|
+| A | AnomalyCounter + Tier 修復 | 4h | - | ✅ 已存在 |
+| B | Database Exporters | 3h | - | ⏳ 待執行 |
+| C | Incident 頻率欄位 | 2h | Phase A | ⏳ 待執行 |
+| D | Sentry Comment 回寫 | 1h | Phase A | ✅ **Wave A.4** |
+| E | SignOz 告警規則 | 2h | - | ✅ **Wave A.2+A.3** |
+| F | Alert Chain E2E 驗證 | 2h | Phase A-E | ✅ **Wave A.6+B.1+B.2** |
+| G | Learning Service | 3h | Phase A | ✅ 已存在 |
+
+**總工時**: 17h (已完成約 10h)
+
+### Wave A+B 完成明細 (2026-03-29)
+
+| Wave | 任務 | 狀態 | 首席架構師審查 |
+|------|------|------|---------------|
+| A.1 | Sentry API Token | ✅ | 91/100 |
+| A.2 | SignOz 告警規則 | ✅ | OUTSTANDING |
+| A.3 | SignOz Webhook | ✅ | - |
+| A.4 | Sentry Comment | ✅ | - |
+| A.5 | Alert Chain Metrics | ✅ | - |
+| A.6 | Smoke Test | ✅ | - |
+| B.1 | PrometheusRule | ✅ | - |
+| B.2 | CD Pipeline | ✅ | - |
+
+---
+
+## 風險與緩解
+
+| 風險 | 影響 | 緩解 |
+|------|------|------|
+| Redis 記憶體增長 | 統計資料膨脹 | 30 天 TTL + ZREMRANGEBYSCORE |
+| Tier 誤判 | 不當修復 | 人工審批 Tier 3/4 |
+| Sentry API 限流 | Comment 失敗 | Rate Limiter + 重試 |
+
+---
+
+## 參考
+
+- [MONITORING_STRATEGIC_PLANNING.md](../proposals/MONITORING_STRATEGIC_PLANNING.md)
+- [IMPLEMENTATION_STEPS_ANOMALY_COUNTER.md](../proposals/IMPLEMENTATION_STEPS_ANOMALY_COUNTER.md)
+- [IMPLEMENTATION_STEPS_DATABASE_EXPORTERS.md](../proposals/IMPLEMENTATION_STEPS_DATABASE_EXPORTERS.md)
+- [IMPLEMENTATION_STEPS_INCIDENT_FREQUENCY.md](../proposals/IMPLEMENTATION_STEPS_INCIDENT_FREQUENCY.md)
+- [IMPLEMENTATION_STEPS_REMAINING_PHASES.md](../proposals/IMPLEMENTATION_STEPS_REMAINING_PHASES.md)
+
+---
+
+## 變更紀錄
+
+| 日期 | 版本 | 變更 | 作者 |
+|------|------|------|------|
+| 2026-03-29 | 1.0 | 初版建立 | Claude (首席架構師) |
+| 2026-03-29 | 1.1 | Wave A+B 完成 (91/100 OUTSTANDING) | Claude Code |
--- a/docs/proposals/ARCHITECTURAL_RISK_WAR_GAME.md
+++ b/docs/proposals/ARCHITECTURAL_RISK_WAR_GAME.md
@@ -0,0 +1,840 @@
+# AWOOOI 架構風險全維度沙盤推演
+# Architectural Risk Full-Spectrum War Game
+
+> **文件類型**: 架構決策基礎（必讀，優先於所有 RunBook）
+> **建立**: 2026-03-29 13:37 (台北)
+> **定位**: 本文件是所有執行計畫的上位文件，任何實施步驟必須以此為準
+
+---
+
+## 第零章：代碼確認的真實現況
+
+在產出任何計畫前，必須先誠實面對代碼層面的真實狀況：
+
+| 風險項目 | 代碼位置 | 真實現況 |
+|---------|---------|---------|
+| Worker SIGTERM | `signal_worker.py:450-455` | ✅ 已實作 SIGTERM 攔截 |
+| Worker stop() timeout | `signal_worker.py:147` | ❌ **只有 5 秒**，AI 任務最長 60 秒 |
+| XCLAIM（孤兒任務回收）| `signal_worker.py` 全文 | ❌ **完全缺失**，PEL 孤兒無人處理 |
+| StatefulSet 硬阻斷 | `auto_repair_service.py` 全文 | ❌ **完全缺失**，只有 severity 和 risk 檢查 |
+| TIER_ACTIONS Tier 1 | `auto_repair_service.py:411` | ❌ `restart_pod` 在 Tier 1，DB/Redis 無例外 |
+| OpenClaw Circuit Breaker | `sentry_webhook.py:289` | ❌ **只有 60s httpx timeout，無斷路保護** |
+| ESLint i18n plugin | `.eslintrc.js:20-22` | ❌ **只有 TODO 注解，未安裝** |
+| Redis AOF 確認 | — | ⚠️ 未確認，需立即核查 |
+| Visual Regression baseline | — | ❌ 未建立，Mac 環境 ≠ CI 環境 |
+
+---
+
+## 第一章：已識別的 6 大致命衝突（首席架構師版）
+
+首席架構師已準確識別的 6 個衝突，代碼確認加深嚴重度：
+
+### 衝突一：CI/CD 監控鐵幕導致部署癱瘓
+
+**嚴重度**: 🔴 P0 | **類型**: 流程衝突
+
+**代碼確認**：`service-registry.yaml` 已有 60+ 服務，但 `validate_coverage.py` 尚未整合至 `cd.yaml`。
+
+**正確方案（Soft Launch 三階段）**：
+
+```yaml
+# .github/workflows/cd.yaml
+# 階段一（即時）：Warn-Only
+- name: Service Registry Coverage Check
+  run: |
+    python ops/scripts/validate_coverage.py --warn-only
+    # exit 0 即使有缺失，只發 Telegram 警告
+  continue-on-error: true  # ← 關鍵：不阻擋部署
+
+# 階段二（待 Registry 完整後）：正式 Block
+# 將 continue-on-error 改為 false
+```
+
+---
+
+### 衝突二：Worker HPA + Redis PEL 孤兒任務
+
+**嚴重度**: 🔴 P0 | **類型**: 邏輯衝突
+
+**代碼確認**：`signal_worker.py` 完全沒有 XCLAIM 邏輯。`start()` 方法只有 `_ensure_consumer_group()`，沒有 pending 任務回收。
+
+**最小可行修復（加入 `start()` 方法）**：
+
+```python
+# signal_worker.py
+async def start(self) -> None:
+    if self._running:
+        return
+
+    await self._ensure_consumer_group()
+    
+    # 🆕 關鍵：啟動時先接管已死亡 Worker 的孤兒任務
+    await self._claim_orphaned_tasks()
+    
+    self._running = True
+    self._task = asyncio.create_task(self._consume_loop())
+
+async def _claim_orphaned_tasks(self, idle_ms: int = 60000) -> int:
+    """
+    XCLAIM 機制：接管超過 idle_ms 未 ACK 的 Pending 任務
+    
+    場景：前一個 Worker Pod 在處理任務途中被 K8s 砍掉，
+    此任務卡在 PEL 中，新 Worker 啟動時必須接管。
+    
+    idle_ms: 任務閒置超過此毫秒數才接管（預設 60 秒）
+    """
+    redis_client = get_redis()
+    claimed_count = 0
+    
+    try:
+        # 查詢 PEL 中所有 Pending 任務
+        pending = await redis_client.xpending_range(
+            STREAM_KEY, CONSUMER_GROUP,
+            min='-', max='+', count=100
+        )
+        
+        for entry in pending:
+            # 只接管超過 idle_ms 未被處理的任務
+            if entry['time_since_delivered'] > idle_ms:
+                claimed = await redis_client.xclaim(
+                    STREAM_KEY, CONSUMER_GROUP, CONSUMER_NAME,
+                    min_idle_time=idle_ms,
+                    message_ids=[entry['message_id']]
+                )
+                if claimed:
+                    claimed_count += len(claimed)
+                    logger.info(
+                        "orphaned_task_claimed",
+                        message_id=entry['message_id'],
+                        original_consumer=entry['consumer'],
+                        idle_ms=entry['time_since_delivered'],
+                    )
+    
+    except Exception as e:
+        # XCLAIM 失敗不應阻擋 Worker 啟動
+        logger.warning("xclaim_failed", error=str(e))
+    
+    if claimed_count > 0:
+        logger.info("orphaned_tasks_recovered", count=claimed_count)
+    
+    return claimed_count
+```
+
+**與 HPA 的部署順序**：XCLAIM 必須先合併到 main，才能部署 Worker HPA。
+
+---
+
+### 衝突三：告警風暴重疊（跨源 Incident 爆炸）
+
+**嚴重度**: 🔴 P0 | **類型**: 資料流衝突
+
+**代碼確認**：`incident_service.py` 中的聚合邏輯是基於 `fingerprint` 字串匹配，Sentry 和 Alertmanager 的 fingerprint 格式不同，無法跨源聚合。
+
+**根本原因場景**：
+```
+PostgreSQL 掛掉 → 
+  Alertmanager: 1 個 PostgreSQLDown 告警
+  Sentry: 200 個 ConnectionRefused 告警（所有 API 請求）
+  → 201 個獨立 Incident
+  → 201 次 OpenClaw 分析（Token 爆炸）
+```
+
+**全域災難冷卻期實作**：
+
+```python
+# apps/api/src/services/incident_service.py 新增
+
+GLOBAL_INCIDENT_DEBOUNCE_TTL = 300  # 5 分鐘全域冷卻期
+P0_INFRASTRUCTURE_SERVICES = {
+    "postgres", "postgresql", "redis", "k8s-api", "etcd"
+}
+
+async def _check_global_incident_storm(self, signal_data: dict) -> str | None:
+    """
+    檢查是否有活躍的 P0 基礎設施災難
+    
+    若有 → 返回主 Incident ID（關聯事件）
+    若無 → 返回 None（正常建立新 Incident）
+    """
+    redis = get_redis()
+    storm_key = "global:incident_storm:active"
+    
+    # 判斷是否是 P0 基礎設施告警（優先處理，不關聯）
+    alert_name = signal_data.get("alert_name", "")
+    service = signal_data.get("target", "")
+    
+    is_infra_p0 = any(svc in service.lower() for svc in P0_INFRASTRUCTURE_SERVICES)
+    
+    if is_infra_p0 and signal_data.get("severity") == "critical":
+        # 設定全域風暴旗幟
+        main_incident_id = f"storm-{uuid.uuid4().hex[:8]}"
+        await redis.setex(storm_key, GLOBAL_INCIDENT_DEBOUNCE_TTL, main_incident_id)
+        logger.warning("global_incident_storm_detected", main_id=main_incident_id)
+        return None  # P0 本身正常建立 Incident
+    
+    # 非 P0 告警：檢查是否在風暴期間
+    active_storm_id = await redis.get(storm_key)
+    if active_storm_id:
+        logger.info(
+            "alert_correlated_to_storm",
+            storm_id=active_storm_id,
+            alert=alert_name,
+        )
+        return active_storm_id.decode()  # 關聯到主 Incident，不單獨分析
+    
+    return None
+```
+
+---
+
+### 衝突四：Auto-Repair 誤殺有狀態服務
+
+**嚴重度**: 🔴 P0 | **類型**: 架構衝突
+
+**代碼確認**：`auto_repair_service.py:411` 中 `TIER_ACTIONS[1]` 包含 `restart_pod` 和 `restart_container`。評估邏輯只檢查 `severity <= P2` 和 `RiskLevel <= MEDIUM`，**完全沒有服務類型白名單**。
+
+**最小可行修復（加入服務黑名單）**：
+
+```python
+# auto_repair_service.py 新增常數與防護
+
+# 🚨 不可自動重啟的服務（有狀態服務）
+STATEFUL_SERVICE_BLACKLIST = frozenset({
+    "postgres", "postgresql", "awoooi-postgres",
+    "redis", "awoooi-redis", "redis-stack",
+    "clickhouse", "signoz-clickhouse",
+    "elasticsearch", "etcd",
+    "minio", "awoooi-minio",
+})
+
+async def evaluate_auto_repair(self, incident: Incident) -> AutoRepairDecision:
+    # ... 現有檢查 ...
+    
+    # 🆕 新增：有狀態服務硬阻擋（必須在 Playbook 匹配之前）
+    affected_services = incident.affected_services or []
+    for service in affected_services:
+        if any(bl in service.lower() for bl in STATEFUL_SERVICE_BLACKLIST):
+            logger.warning(
+                "auto_repair_blocked_stateful_service",
+                incident_id=incident.incident_id,
+                service=service,
+            )
+            return AutoRepairDecision(
+                can_auto_repair=False,
+                reason=f"服務 {service} 為有狀態服務，禁止自動重啟，請統帥手動介入",
+                blocked_by="STATEFUL_SERVICE_GUARDRAIL",
+            )
+    
+    # ... 後續現有邏輯 ...
+```
+
+---
+
+### 衝突五：前端「合併地獄」（時序衝突）
+
+**嚴重度**: 🟠 P1 | **類型**: 時序衝突
+
+**正確鎖定主幹策略**：
+
+```
+git flow 前端主權計畫：
+
+Week 1: Feature Freeze
+  main 分支鎖定（禁止非 i18n 相關 PR 合併前端代碼）
+
+Week 1: i18n 清零 PR（唯一允許的前端 PR）
+  branch: fix/i18n-zero-violation
+  → 一次性修復所有 40+ 違規
+  → 同步安裝 eslint-plugin-i18next（先 warn 模式）
+  → Merge to main
+
+Week 1 完成後: Feature Unfreeze
+  → 開始 Storybook PR
+  → 開始 Omni-Terminal SSE Event Sourcing PR
+  → ESLint 切換為 error 模式
+```
+
+---
+
+### 衝突六：Playwright HTTPS 憑證與網路盲區
+
+**嚴重度**: 🟠 P1 | **類型**: 基礎設施衝突
+
+**代碼確認**：`playwright.config.ts` 需要確認當前設定。
+
+```typescript
+// playwright.config.ts 必要修改
+export default defineConfig({
+  use: {
+    baseURL: process.env.BASE_URL || 'http://192.168.0.120:32335',
+    ignoreHTTPSErrors: true,  // 🆕 自簽憑證必須忽略
+    viewport: { width: 1280, height: 720 },
+    deviceScaleFactor: 1,    // 防止 Retina 差異
+  },
+  expect: {
+    toHaveScreenshot: {
+      threshold: 0.05,
+      maxDiffPixelRatio: 0.05,
+    },
+  },
+});
+```
+
+---
+
+## 第二章：被遺漏的 6 個更深層致命風險
+
+首席架構師的 6 個衝突是正確的，但以下 6 個更深層的風險同樣在系統中存在：
+
+### 深層風險 A：.188 節點是單點故障（SPOF）——系統大腦失憶
+
+**位置**: 192.168.0.188（AI+Web 中心）
+**影響**: .188 掛掉 = Ollama + OpenClaw + Redis + PostgreSQL + SigNoz **全部同時失效**
+
+這是整個系統最致命的單點：
+
+```
+.188 掛掉時的連鎖崩潰：
+  Redis 失效 → Signal Worker 無法消費 → 告警全部積壓
+  PostgreSQL 失效 → K3s 控制面失去 Datastore → K3s 可能崩潰
+  OpenClaw 失效 → 所有 AI 分析停止 → Sentry/Alertmanager Webhook 排隊
+  SigNoz 失效 → 可觀測性盲區
+  ↓
+  K3s 崩潰 → AWOOOI API/Web Pod 全滅
+  ↓
+  沒有 AWOOOI → 無法收到告警 → 統帥無法操作 → 完全失聯
+```
+
+**緩解策略（非根治）**：
+```yaml
+# OpenClaw 和 Worker 必須有 .188 失效時的降級模式
+# 最低標準：Telegram Bot 直接發送「.188 疑似失效」告警
+# （繞過 AWOOOI API，直接 curl Telegram API）
+
+# k8s/monitoring/alert-rules.yaml 新增
+- alert: AIWebCenterDown
+  expr: probe_success{job="blackbox", target="http://192.168.0.188:8089/health"} == 0
+  for: 2m
+  annotations:
+    summary: ".188 AI 中心失聯，系統進入降級模式"
+    runbook: "docs/runbooks/RUNBOOK-188-FAILOVER.md"
+```
+
+---
+
+### 深層風險 B：可觀測性循環依賴（觀測者的盲點）
+
+**這是架構上最諷刺的問題**：
+
+```
+當 AWOOOI API 本身崩潰：
+  Alertmanager 想發 webhook 給 AWOOOI → AWOOOI 掛了，webhook 失敗
+  Sentry 想發 webhook 給 AWOOOI → 同上
+  Telegram 通知要透過 AWOOOI → 同上
+
+  告警鏈路的最後一哩（Telegram 通知）依賴於被監控對象本身！
+```
+
+**現有防護**（ADR-035 已有！）：
+```yaml
+# Alertmanager 有直接 Telegram 通知（繞過 AWOOOI）
+# 但需要確認：alertmanager.yml 是否有 backup receiver
+receivers:
+  - name: 'openclaw-api'           # 主路徑（透過 AWOOOI）
+    ...
+  - name: 'direct-telegram'        # 備援路徑（直接打 Telegram）
+    webhook_configs:
+      - url: 'https://api.telegram.org/bot{TOKEN}/sendMessage'
+```
+
+**需要驗證**：ADR-035 的三層防護機制是否真的覆蓋了「AWOOOI API 本身掛掉」的場景。
+
+---
+
+### 深層風險 C：OpenClaw 呼叫無 Circuit Breaker（後端 AI 癱瘓傳播）
+
+**代碼確認**：`sentry_webhook.py:289` 的 `call_openclaw_analyzer()` 只有 `httpx.AsyncClient(timeout=60.0)`，**沒有 Circuit Breaker**。
+
+**場景**：OpenClaw 高負載（GPU 過熱、記憶體壓力），每個 Sentry/Alertmanager 呼叫都等待 60 秒才 timeout。大量 FastAPI 背景任務積壓，最終導致 API Pod 記憶體耗盡 OOM Kill。
+
+**最小可行修復**：
+
+```python
+# apps/api/src/core/circuit_breaker.py（新建）
+import asyncio
+from enum import Enum
+from collections import deque
+
+class CircuitState(Enum):
+    CLOSED = "closed"      # 正常
+    OPEN = "open"          # 斷路（直接失敗）
+    HALF_OPEN = "half_open"# 試探性恢復
+
+class SimpleCircuitBreaker:
+    """
+    簡單 Circuit Breaker（不依賴 NVIDIA 的實作）
+    
+    狀態機：
+    CLOSED → OPEN（連續 5 次失敗）
+    OPEN → HALF_OPEN（冷卻 60 秒後）
+    HALF_OPEN → CLOSED（1 次成功）
+    HALF_OPEN → OPEN（1 次失敗）
+    """
+    def __init__(self, failure_threshold=5, timeout_s=60):
+        self.state = CircuitState.CLOSED
+        self.failure_count = 0
+        self.threshold = failure_threshold
+        self.timeout_s = timeout_s
+        self._opened_at: float | None = None
+    
+    def is_open(self) -> bool:
+        if self.state == CircuitState.OPEN:
+            import time
+            if time.time() - self._opened_at > self.timeout_s:
+                self.state = CircuitState.HALF_OPEN
+                return False
+            return True
+        return False
+    
+    def record_success(self):
+        self.failure_count = 0
+        self.state = CircuitState.CLOSED
+    
+    def record_failure(self):
+        self.failure_count += 1
+        if self.failure_count >= self.threshold:
+            import time
+            self.state = CircuitState.OPEN
+            self._opened_at = time.time()
+
+# 全域 OpenClaw Circuit Breaker
+_openclaw_cb = SimpleCircuitBreaker(failure_threshold=5, timeout_s=60)
+
+def get_openclaw_circuit_breaker() -> SimpleCircuitBreaker:
+    return _openclaw_cb
+```
+
+```python
+# sentry_webhook.py 修改 call_openclaw_analyzer()
+async def call_openclaw_analyzer(error_context: dict) -> ErrorAnalysisResult | None:
+    cb = get_openclaw_circuit_breaker()
+    
+    # 斷路保護：直接失敗，不等待
+    if cb.is_open():
+        logger.warning("openclaw_circuit_open_skip_analysis")
+        return None
+    
+    try:
+        async with httpx.AsyncClient(timeout=60.0) as client:
+            response = await client.post(...)
+            if response.status_code == 200:
+                cb.record_success()
+                return ErrorAnalysisResult(**response.json())
+            else:
+                cb.record_failure()
+                return None
+    except Exception as e:
+        cb.record_failure()
+        logger.exception("openclaw_call_failed", error=str(e))
+        return None
+```
+
+---
+
+### 深層風險 D：K8s Rolling Update 資料庫遷移衝突
+
+**場景**：CD 執行 rolling update，舊版和新版 API Pod 同時存在（Kubernetes rolling strategy）。若新版本有 `ALTER TABLE` 遷移，舊版 Pod 會因為欄位不存在而報錯；若先跑遷移，新舊結構衝突。
+
+**現況確認需要**：確認 Alembic 遷移策略。
+
+**防護機制（必須檢查現有 CD）**：
+
+```yaml
+# .github/workflows/cd.yaml 確認是否有 migration 步驟
+# 若無，必須加入：
+- name: Run DB Migration
+  run: |
+    kubectl exec -n awoooi-prod \
+      $(kubectl get pod -n awoooi-prod -l app=awoooi-api -o name | head -1) \
+      -- python -m alembic upgrade head
+  # 遷移必須是向後兼容的（不能刪除欄位，只能新增）
+```
+
+---
+
+### 深層風險 E：Redis 記憶體壓力下的靜默資料丟失
+
+AnomalyCounter 的滑動窗口使用 Redis Sorted Set / Counter，如果 Redis 記憶體緊張觸發 `maxmemory-policy`，這些計數器可能被靜默淘汰。
+
+**需要立即確認**：
+```bash
+ssh root@192.168.0.188 'docker exec awoooi-redis redis-cli CONFIG GET maxmemory-policy'
+# 如果是 allkeys-lru 或 allkeys-lfu，AnomalyCounter 的計數器會被淘汰！
+# 正確設定應為：volatile-ttl 或 noeviction（搭配記憶體告警）
+```
+
+**修復**：AnomalyCounter 的 Redis key 必須用帶 TTL 的 key（已有），確保 eviction policy 不會誤殺有 TTL 的 key：
+```bash
+# 設定為 volatile-ttl（只淘汰有 TTL 的 key，從 TTL 最短的開始）
+docker exec awoooi-redis redis-cli CONFIG SET maxmemory-policy volatile-ttl
+# AnomalyCounter 計數器有 TTL → 可能被淘汰
+# 解法：增加 Redis maxmemory 設定或改用 noeviction + 主動監控
+```
+
+---
+
+### 深層風險 F：GitHub Actions Runner 安全隔離問題
+
+`.110` 的 self-hosted runner 在每次 CD 時執行 `kubectl patch secret`，這代表：
+- Runner 必須有 K8s 集群的 admin 權限
+- 任何能合 PR 進 main 的人，都能觸發有 K8s admin 權限的 Job
+
+**最小防護**：
+
+```yaml
+# .github/workflows/cd.yaml
+# 所有 kubectl 操作的 Job 必須加上環境保護
+jobs:
+  deploy:
+    environment: production  # ← 必須設定 GitHub Environment 審核
+```
+
+---
+
+## 第三章：終極安全執行定序（12 波）
+
+整合所有 6+6 衝突分析後，以下是**唯一正確的執行順序**：
+
+```
+🛡️ Wave 0: 即時止血（當天，不需部署，純配置)
+  0.1  確認 Redis maxmemory-policy（5min）
+  0.2  確認 Redis appendonly（5min）
+  0.3  確認 Alertmanager 備援 Telegram 路徑（10min）
+
+🔴 Wave 1: 底層安全網（Week 1，必須串行執行）
+  依序：
+  1.1  開發 XCLAIM 機制（2h）
+  1.2  開發 StatefulSet Guardrail（1h）
+  1.3  開發 OpenClaw Circuit Breaker（2h）
+  1.4  開發 Global Incident Debounce（2h）
+  1.5  以上四項合為單一 PR，測試後 Merge（1h）
+  
+  → 此 PR 絕對不能拆分！四個修復互相依賴。
+
+🔴 Wave 2: Worker 升級（Wave 1 完成後）
+  2.1  Worker terminationGracePeriodSeconds 90s（30min）
+  2.2  Worker stop() timeout 75s（30min）
+  2.3  部署 Worker HPA（30min）
+
+🟠 Wave 3: 前端主幹鎖定（與 Wave 1 同時啟動，但獨立分支）
+  3.1  宣佈 Frontend Feature Freeze
+  3.2  i18n 閃電清零（4h）
+  3.3  安裝 eslint-plugin-i18next（Warn 模式）（1h）
+  3.4  Merge i18n PR → 解除 Frontend Freeze
+  3.5  ESLint 切換 Error 模式
+
+🟠 Wave 4: CI 基礎設施（Wave 3 完成後）
+  4.1  playwright.config.ts（ignoreHTTPSErrors + threshold）
+  4.2  Docker Visual Baseline 初始建立
+  4.3  E2E Weekly Schedule YAML（Warn-Only）
+  4.4  CD validate_coverage.py（Warn-Only）
+
+🟠 Wave 5: 告警後端完整（Wave 1 完成後）
+  5.1  Sentry SENTRY_AUTH_TOKEN 配置（Phase D）
+  5.2  SignOz 告警規則部署到 .188（Phase E）
+
+🟡 Wave 6: 可觀測性統合（Wave 5 完成後）
+  6.1  Prometheus Federation（.110 → .188）
+  6.2  AI Autonomy Index Metrics 建立
+  6.3  Redis AOF + Sentinel 評估與啟用
+
+🟡 Wave 7: 前端能力擴充（Wave 4 完成後）
+  7.1  Storybook 10 核心組件
+  7.2  Omni-Terminal SSE Event Sourcing
+  7.3  監控 GenUI 卡片（7 張）
+  7.4  Nexus AI 自治率 UI
+
+⚪ Wave 8: DB HA 根本解決
+  8.1  CloudNativePG 評估報告
+  8.2  決策後執行（Patroni / CloudNativePG / 維持現狀+備份）
+
+⚪ Wave 9: 業務指標層
+  9.1  FinOps Dashboard API + UI
+  9.2  SLO / MTTR API 端點
+
+⚪ Wave 10: 安全主權
+  10.1  Kali → MCP Tool → SecurityAgent
+  10.2  SBOM 生成整合
+
+⚪ Wave 11: CI 硬阻擋切換
+  11.1  Visual Regression CI: 從 warn → block
+  11.2  Coverage validation: 從 warn → block
+  11.3  ESLint: 確認已為 error 模式
+
+⚪ Wave 12: Phase 4 視覺靈魂注入
+  12.1  品牌 3D 資產 + Q 版 OpenClaw
+  12.2  全站微動畫升級
+```
+
+---
+
+## 第四章：執行前的強制確認清單
+
+在開始任何 Wave 1 工作之前，必須先完成以下確認：
+
+```bash
+#!/bin/bash
+# ops/scripts/pre-execution-checklist.sh
+
+echo "=== AWOOOI 執行前強制確認清單 ==="
+
+# 1. Redis AOF 確認
+APPENDONLY=$(docker exec awoooi-redis redis-cli CONFIG GET appendonly | tail -1)
+echo "Redis AOF: $APPENDONLY"  # 必須是 yes
+
+# 2. Redis maxmemory-policy 確認
+POLICY=$(docker exec awoooi-redis redis-cli CONFIG GET maxmemory-policy | tail -1)
+echo "Redis eviction policy: $POLICY"  # 不能是 allkeys-lru
+
+# 3. K3s 叢集狀態確認
+kubectl get nodes -n awoooi-prod
+kubectl get pod -n awoooi-prod
+
+# 4. Alertmanager 備援 Telegram 路徑確認
+curl -s http://192.168.0.120:30093/api/v1/receivers | python3 -m json.tool | grep name
+
+# 5. 確認 .110 → .120 網路路由（Playwright E2E 需要）
+ping -c 3 192.168.0.120
+
+echo "=== 確認完成，可以開始執行 ==="
+```
+
+---
+
+## 附錄：尚未解決的開放問題（需要統帥決策）
+
+| 問題 | 選項 A | 選項 B | 影響 |
+|------|--------|--------|------|
+| PostgreSQL HA | CloudNativePG（K8s 原生）| Patroni+keepalived（VM 層）| Q2 重大決策 |
+| Redis HA 層級 | Sentinel（主動故障轉移）| AOF+手動恢復（保守）| 月度決策 |
+| .188 備援節點 | 購置第二台 AI 主機 | Cloud GPU 熱備 | 季度預算 |
+| GitHub Runner 安全隔離 | GitHub Environments 審核 | 拆分 CI（唯讀）和 CD（需要 K8s admin）| 安全策略 |
+
+---
+
+---
+
+## 第五章：四個最終深水區（代碼確認級別）
+
+### 5.1 Redis 崩潰 → AnomalyCounter 連鎖炸毀 → 告警永久丟失
+
+**代碼確認**：`anomaly_counter.py:147`
+
+```python
+# 現況（無任何 try/except）：
+await self.redis.zadd(timeline_key, {str(timestamp): timestamp})  # ← Redis 掛了 = 直接 throw
+await self.redis.zremrangebyscore(...)
+await self.redis.zcount(...)   # ← 全部爆炸
+# → 呼叫端 sentry_webhook.py → 整個 background task 失敗 → 告警丟失！
+```
+
+**修復：Graceful Degradation 防禦性包裝**
+
+```python
+# anomaly_counter.py 修改 record_anomaly()
+
+async def record_anomaly(self, anomaly_signature: dict) -> AnomalyFrequency:
+    """記錄異常，Redis 失敗時優雅降級（不拋例外）"""
+    try:
+        return await self._record_anomaly_impl(anomaly_signature)
+    except Exception as e:
+        # Redis 連線失敗 → 降級：返回最小化頻率物件，讓主流程繼續執行
+        logger.warning(
+            "anomaly_counter_redis_degraded",
+            error=str(e),
+            reason="Returning default frequency to allow alert chain to continue"
+        )
+        # 不拋例外！告警鏈路必須繼續！
+        return AnomalyFrequency(
+            anomaly_key=self.hash_signature(anomaly_signature),
+            count_1h=1, count_24h=1, count_7d=1, count_30d=1,
+            first_seen=datetime.now(), last_seen=datetime.now(),
+            auto_repair_count=0, permanent_fix_applied=False,
+            escalation_level=None,  # 無法升級判斷，保守處理
+        )
+
+async def _record_anomaly_impl(self, anomaly_signature: dict) -> AnomalyFrequency:
+    """原始實作邏輯（從 record_anomaly 提取）"""
+    # ... 原有的所有 Redis 操作 ...
+```
+
+**原則**：「記不住」不能導致「發不出」。Redis 是輔助系統，不是核心路徑。
+
+---
+
+### 5.2 Worker 非優雅崩潰 → PEL 孤兒任務永久卡死
+
+**代碼確認**：`signal_worker.py` 全文無 `reclaim_loop` 或定期 XPENDING 掃描。
+
+現有 `_claim_orphaned_tasks()` 只在 `start()` 時執行一次，解決不了**運行中 Pod 崩潰**的場景：
+
+```
+場景：2 個 Worker 穩定運行中
+  Worker A 處理任務途中 → Segfault / OOM Kill（非優雅關機）
+  Worker B 正在運行 → start() 不再觸發 → XCLAIM 永遠不執行
+  孤兒任務卡在 PEL → 直到下次 HPA 觸發新 Pod 才救回
+  可能等待 600+ 秒（HPA stabilizationWindowSeconds）
+```
+
+**修復：Active Sweeper Loop（與心跳循環並行）**
+
+```python
+# signal_worker.py 新增 _reclaim_loop()
+
+async def start(self) -> None:
+    await self._ensure_consumer_group()
+    await self._claim_orphaned_tasks()   # 啟動時一次
+    self._running = True
+    self._task = asyncio.create_task(self._consume_loop())
+    self._reclaim_task = asyncio.create_task(self._reclaim_loop())  # 🆕 持續掃描
+
+async def _reclaim_loop(self, interval_s: int = 300) -> None:
+    """
+    Active Sweeper：每 5 分鐘主動掃描 PEL，接管閒置超過 5 分鐘的孤兒任務
+    與 _consume_loop 並行執行，不阻擋正常消費
+    """
+    while self._running:
+        try:
+            await asyncio.sleep(interval_s)
+            if not self._running:
+                break
+            claimed = await self._claim_orphaned_tasks(idle_ms=300_000)  # 5 分鐘
+            if claimed > 0:
+                logger.info("active_sweeper_claimed", count=claimed)
+        except asyncio.CancelledError:
+            break
+        except Exception as e:
+            logger.warning("active_sweeper_error", error=str(e))
+
+async def stop(self) -> None:
+    self._running = False
+    # 同時取消 reclaim_loop
+    if hasattr(self, '_reclaim_task') and self._reclaim_task:
+        self._reclaim_task.cancel()
+    if self._task:
+        try:
+            await asyncio.wait_for(self._task, timeout=75.0)  # 已校正
+        except (TimeoutError, asyncio.CancelledError):
+            pass
+```
+
+---
+
+### 5.3 SSE Event Store Redis 記憶體炸彈
+
+**代碼確認**：`terminal.py:114` 使用 SSE Publisher/Subscribe 模式（`publisher.subscribe(topics)`），**不是 Redis List 模式**。
+
+這是一個**重要的架構現況修正**：
+
+- 現有 terminal.py 使用 `src.core.sse.SSEPublisher` 作為事件分發機制
+- **Event Sourcing（Redis RPUSH）尚未實作**，這是未來要加的功能
+- 因此 Redis 記憶體炸彈風險存在於**未來實作時**，需要在設計階段就預防
+
+**未來實作 Event Sourcing 時的強制規格**：
+
+```python
+# terminal.py 未來的 stream_with_persistence()
+
+MAX_PAYLOAD_BYTES = 50 * 1024          # 50KB 上限（tool_result 超出截斷）
+MAX_EVENTS_PER_SESSION = 50            # 每個 session 最多 50 個事件（LTRIM）
+SESSION_TTL_SECONDS = 3600             # 1 小時 TTL
+
+async def stream_with_persistence(command_id: str, event_type: str, data: dict):
+    redis = get_redis()
+    key = f"terminal:events:{command_id}"
+
+    # 🚨 必要的 Payload 保護
+    payload_json = json.dumps(data)
+    if len(payload_json) > MAX_PAYLOAD_BYTES:
+        payload_json = json.dumps({
+            "truncated": True,
+            "original_size": len(payload_json),
+            "preview": payload_json[:1024],
+            "message": f"Payload {len(payload_json)//1024}KB 過大，已截斷"
+        })
+
+    event = {"type": event_type, "data": json.loads(payload_json)}
+    await redis.rpush(key, json.dumps(event))
+    await redis.ltrim(key, -MAX_EVENTS_PER_SESSION, -1)  # 只保留最後 50 個
+    await redis.expire(key, SESSION_TTL_SECONDS)
+```
+
+---
+
+### 5.4 Frontend Feature Freeze → Hotfix 死鎖
+
+**代碼確認**：無 `release/` 分支策略，main 是唯一的長期分支。
+
+**修復：Git Flow 三分支策略（啟動 Freeze 前必須建立）**
+
+```bash
+# Week 1 開始 i18n 清零前，執行：
+
+# Step 1: 從當前 main 建立穩定的 release 基準
+git checkout main
+git pull origin main
+git checkout -b release/v1.x
+git push origin release/v1.x
+
+# Step 2: 在 GitHub 設定 release/v1.x 為 Protected Branch
+# → 只有 Hotfix PR 可以合併到此分支
+
+# Step 3: 開始 i18n 清零（在 main/develop 進行）
+git checkout main
+git checkout -b fix/i18n-zero-violation
+# ... 執行 i18n 清零 ...
+git push origin fix/i18n-zero-violation
+# → PR 合併到 main
+```
+
+**緊急 Hotfix 流程**（Freeze 期間生產爆炸時）：
+
+```bash
+# 從 release 分支切 hotfix
+git checkout release/v1.x
+git checkout -b hotfix/critical-approval-button-fix
+# ... 最小化修復 ...
+git push origin hotfix/critical-approval-button-fix
+
+# PR 合併到 release/v1.x → 立即部署
+# 然後 cherry-pick 到 main（i18n 重構進行中的分支）
+git checkout main
+git cherry-pick <hotfix-commit-hash>
+```
+
+**Hotfix 觸發條件定義**（須寫入 HARD_RULES.md）：
+- 統帥無法使用核心功能（簽核按鈕失效、登入無法使用）
+- P0 級 Sentry Error 每分鐘 > 10 次
+- 服務 availability < 99%
+
+---
+
+## 第六章：Wave 1 最終實作清單（可立即授權執行）
+
+經過全面代碼確認，Wave 1 的四個修復需要修改以下精確位置：
+
+| 修復項目 | 修改檔案 | 位置 | 預估工時 |
+|---------|---------|------|---------|
+| XCLAIM + Active Sweeper | `signal_worker.py` | `start()`, `stop()`, 新增 `_reclaim_loop()`, `_claim_orphaned_tasks()` | 2h |
+| StatefulSet Guardrail | `auto_repair_service.py` | `evaluate_auto_repair()` 開頭新增服務黑名單 | 1h |
+| AnomalyCounter Redis 降級 | `anomaly_counter.py` | `record_anomaly()` 包裝 try/except + 降級回傳 | 1h |
+| OpenClaw Circuit Breaker | `core/circuit_breaker.py`（新建）→ `sentry_webhook.py`, `signoz_webhook.py` | `call_openclaw_analyzer()` 包裝斷路保護 | 2h |
+| Global Incident Debounce | `services/incident_service.py` | `process_signal()` 前加全域冷卻檢查 | 1.5h |
+
+**Wave 1 執行條件**：
+1. Wave 0.1-0.3 手動確認完成（Redis AOF/eviction、Alertmanager 備援）
+2. Git Flow：建立 `release/v1.x` 穩定分支（防止 Freeze 期間 Hotfix 死鎖）
+3. 所有修改捆綁為一個 PR（原子性部署，不可拆分）
+
+*「真正的架構師不是設計完美的系統，而是設計在任何極端狀況下都能優雅降級的系統。」* 🦞
+
--- a/docs/proposals/INTEGRATION_ARCHITECTURE_MASTER.md
+++ b/docs/proposals/INTEGRATION_ARCHITECTURE_MASTER.md
@@ -0,0 +1,570 @@
+# AWOOOI 整體整合架構統合設計
+
+> **文件類型**: 統合架構設計（Single Source of Truth for Integration）
+> **優先級**: 🔴 統帥最高指令
+> **建立**: 2026-03-29 13:27 (台北)
+> **核心命題**: 所有節點必須在同一座大腦的神經網路中協同運作，不允許孤島。
+
+---
+
+## 第一部分：現況誠實盤點（精確）
+
+### 已確認：比稽核報告更樂觀的部分
+
+| 項目 | 稽核報告誤判 | 真實現況 |
+|------|------------|---------|
+| Worker SIGTERM 處理 | 報告說「缺失」 | ✅ **已實作**（`signal_worker.py:450-455`）— `signal.signal(SIGTERM)` + `shutdown_event` |
+| Worker 優雅關機流程 | 報告說「需要實作」| ✅ **已實作**（`stop()` 方法，有心跳機制） |
+| SignOz Webhook 路由 | 報告說「未部署」| ✅ **已路由**（`main.py:419`）|
+
+### 已確認：比稽核報告更嚴峻的部分
+
+| 項目 | 稽核報告版本 | 真實缺口 |
+|------|------------|---------|
+| Worker stop() timeout | 未提及 | ❌ **只有 5 秒**，AI 分析 30-60 秒會被強殺 |
+| K8s terminationGracePeriodSeconds | 未提及 | ❌ **未設定**，K8s 預設 30 秒不夠用 |
+| ESLint i18n 強制 | 說「CI 攔截」| ❌ **只有 TODO 注解**（`.eslintrc.js:20-22`），未實際安裝 plugin |
+| Visual Regression 跨平台 | 說「截圖比對」| ❌ **Mac 與 CI Linux 字體渲染不同**，baseline 不能在 Mac 產生 |
+| PostgreSQL HA | 說「Streaming Replication」| ❌ **無切換機制**，主庫掛了需要人工介入 |
+| Redis HA | 完全未提及 | ❌ **無 Sentinel**，Redis 單點故障 |
+| SSE Event Sourcing | 只設計了事件類型 | ❌ **F5 刷新後 GenUI 卡片全消失** |
+| Kali 整合 | Cronjob 被動 | 🟡 層次太低，應升格為 SecurityAgent |
+
+---
+
+## 第二部分：完整整合地圖
+
+### 2.1 系統神經網路拓撲
+
+```
+外部事件輸入
+┌─────────────────────────────────────────────────────────────────┐
+│  Alertmanager :9093  Sentry :9000  SignOz :3301               │
+│  GitHub Actions       Kali Scanner              K8s Events     │
+└──────────┬──────────────┬──────────────┬──────────────────────┘
+           │              │              │
+           ▼              ▼              ▼
+┌────────────────────────────────────────────────────────────────┐
+│               AWOOOI API (K3s :32334)                          │
+│  ┌──────────┐  ┌──────────┐  ┌───────────┐  ┌─────────────┐  │
+│  │Alertmanager│  │Sentry   │  │SignOz     │  │Kali(未來)   │  │
+│  │Webhook  │  │Webhook  │  │Webhook    │  │Webhook      │  │
+│  └────┬─────┘  └────┬─────┘  └─────┬─────┘  └──────┬──────┘  │
+│       │              │              │                │          │
+│       └──────────────┴──────────────┴────────────────┘          │
+│                              │                                    │
+│                              ▼                                    │
+│  ┌──────────────────────────────────────────────────────────┐   │
+│  │                Signal Worker (Redis XREADGROUP)           │   │
+│  │  ← 消費 awoooi:signals stream                            │   │
+│  │  → IncidentEngine (聚合 / GraphRAG / 持久化)              │   │
+│  └──────────────────────────┬───────────────────────────────┘   │
+│                              │                                    │
+│                              ▼                                    │
+│  ┌──────────────────────────────────────────────────────────┐   │
+│  │              OpenClaw (192.168.0.188:8089)                 │   │
+│  │  決策引擎: RCA → Blast Radius → Risk → Action             │   │
+│  │  工具: kubectl / SSH / Prometheus / SigNoz                │   │
+│  └──────┬─────────────────────┬────────────────────────────┘   │
+│         │                     │                                   │
+│         ▼                     ▼                                   │
+│  Telegram Bot          Approval DB (PostgreSQL)                   │
+│  統帥通知              審核佇列                                     │
+└──────────────────────────────────────────────────────────────────┘
+                              │
+                    統帥批准 / 拒絕
+                              │
+                 ┌────────────▼────────────┐
+                 │   Auto-Repair Actions   │
+                 │ restart/scale/rollback  │
+                 └─────────────────────────┘
+```
+
+### 2.2 前端整合地圖
+
+```
+AWOOOI Web (K3s :32335 / Next.js)
+┌─────────────────────────────────────────────────────────────────┐
+│                                                                   │
+│  ┌── Dashboard (/) ─────────────────────────────────────────┐   │
+│  │  AutonomyIndexPanel ← GET /api/v1/stats/autonomy         │   │
+│  │  SystemPulseRow     ← GET /api/v1/stats/overview          │   │
+│  │  DecisionZone       ← SSE /api/v1/approvals/stream        │   │
+│  └──────────────────────────────────────────────────────────┘   │
+│                                                                   │
+│  ┌── Omni-Terminal ─────────────────────────────────────────┐   │
+│  │  Input Area         → POST /api/v1/terminal/command       │   │
+│  │  ThinkingStream     ← SSE /api/v1/terminal/stream/{id}    │   │
+│  │  GenUI Renderer     ← event: render_ui                    │   │
+│  │  Event Replay       ← Redis List (Last-Event-ID)          │   ← ❌ 缺失
+│  └──────────────────────────────────────────────────────────┘   │
+│                                                                   │
+│  ┌── Knowledge Base (/knowledge-base) ─────────────────────┐   │
+│  │  ❌ 空白頁面，缺後端 API                                  │   │
+│  └──────────────────────────────────────────────────────────┘   │
+│                                                                   │
+│  [深度調查跳脫入口]                                               │
+│  → SigNoz: http://192.168.0.188:3301 (新分頁)                   │
+│  → Grafana: http://192.168.0.188:3000 (新分頁)                  │
+│  → Sentry:  http://192.168.0.110:9000 (新分頁)                  │
+│                                                                   │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## 第三部分：六大缺口的系統整合修復方案
+
+### 缺口 1：Worker terminationGracePeriodSeconds 不足
+
+**問題根源**：`signal_worker.py` 的 `stop()` 等 5 秒，但 AI 分析任務最長 60 秒。K8s 預設 `terminationGracePeriodSeconds: 30`。兩個值都不夠，且彼此沒有對齊。
+
+**與整體整合的依賴**：
+- Worker 縮容影響：HPA 縮容時 → K8s 發 SIGTERM → `stop()` 被調用 → 5 秒後強殺
+- 上游依賴：Sentry 分析任務、Alertmanager 分析任務都在 Worker background task 中執行
+- 下游影響：PostgreSQL 寫入可能不完整（Incident 狀態 Dirty）
+
+**修復：三層數值對齊**
+
+```yaml
+# k8s/awoooi-prod/08-deployment-worker.yaml
+# 修改 terminationGracePeriodSeconds
+spec:
+  template:
+    spec:
+      terminationGracePeriodSeconds: 90  # 給 Worker 足夠時間完成當前任務
+      containers:
+        - name: awoooi-worker
+          lifecycle:
+            preStop:
+              exec:
+                command: ["/bin/sh", "-c", "sleep 5"]  # 給 K8s 時間發 SIGTERM
+```
+
+```python
+# apps/api/src/workers/signal_worker.py
+# 修改 stop() 的 timeout 與 K8s terminationGracePeriodSeconds 對齊
+
+async def stop(self) -> None:
+    if not self._running:
+        return
+    self._running = False
+    if self._task:
+        try:
+            # 從 5 秒改為 75 秒（比 terminationGracePeriodSeconds=90 少 15 秒緩衝）
+            await asyncio.wait_for(self._task, timeout=75.0)
+        except TimeoutError:
+            logger.warning("signal_worker_stop_timeout_forcekill")
+            self._task.cancel()
+        except asyncio.CancelledError:
+            pass
+    logger.info("signal_worker_stopped")
+```
+
+**整合驗證指令**：
+```bash
+# 模擬 K8s 縮容，確認 Worker 優雅關機
+kubectl scale deployment awoooi-worker -n awoooi-prod --replicas=0
+kubectl logs -n awoooi-prod $(kubectl get pod -n awoooi-prod -l app=awoooi-worker -o name) --tail=20
+
+# 預期看到：
+# shutdown_signal_received signal=15
+# signal_worker_shutting_down
+# signal_worker_shutdown_complete
+# （整個流程在 90 秒內完成）
+```
+
+---
+
+### 缺口 2：ESLint i18n 強制攔截（eslint-plugin-i18next）
+
+**問題根源**：`.eslintrc.js:20-22` 只有 TODO 注解，未安裝 `eslint-plugin-i18next`。
+
+**與整體整合的依賴**：
+- 這是 i18n 清零後的**防護層**：清零是清過去的債，ESLint 是防未來的債
+- 需在 `pnpm build` 和 `pnpm lint` CI 步驟中阻擋
+- 影響：所有前端開發流程（AI 生成代碼也必須通過）
+
+**修復：安裝並啟用 Plugin**
+
+```bash
+# Step 1: 安裝
+cd apps/web
+pnpm add -D eslint-plugin-i18next
+```
+
+```javascript
+// apps/web/.eslintrc.js 修改後
+module.exports = {
+  extends: [
+    '@awoooi/eslint-config/react',
+    'next/core-web-vitals',
+    'plugin:i18next/recommended',   // ← 新增
+  ],
+  plugins: ['i18next'],              // ← 新增
+  parserOptions: {
+    project: './tsconfig.json',
+    tsconfigRootDir: __dirname,
+  },
+  rules: {
+    '@next/next/no-html-link-for-pages': 'off',
+    'no-console': 'off',
+
+    // 🚨 i18n 鐵律：所有 JSX 文字必須透過 t() 函式
+    // 違反此規則 = PR 阻擋（error 級別）
+    'i18next/no-literal-string': ['error', {
+      markupOnly: true,     // 只攔截 JSX 文字節點（非 JS 字串）
+      ignoreAttribute: [    // 技術屬性不攔截
+        'className', 'id', 'href', 'src', 'type', 'key',
+        'data-testid', 'aria-label', 'placeholder'
+      ],
+    }],
+
+    '@typescript-eslint/no-explicit-any': 'warn',
+    '@typescript-eslint/no-unused-vars': ['warn', { argsIgnorePattern: '^_', varsIgnorePattern: '^_' }],
+    '@typescript-eslint/consistent-type-imports': 'warn',
+    'no-constant-condition': 'warn',
+  },
+  ignorePatterns: [
+    'node_modules', '.next', 'out', 'dist', 'test-results',
+    '*.config.js', '*.config.ts',
+  ],
+}
+```
+
+**與 CI 整合（必須加入 cd.yaml）**：
+
+```yaml
+# .github/workflows/cd.yaml 在 Build 之前加入 Lint 步驟
+- name: 🔍 ESLint i18n 強制檢查
+  run: |
+    cd apps/web
+    pnpm lint
+  # 失敗 = 有硬編碼字串 = 直接阻擋部署
+```
+
+**⚠️ 重要**：第一次啟用 `eslint-plugin-i18next` 後，**現有的 40+ 違規會立刻全部報錯**。因此必須先完成 i18n 清零，再啟用此 Rule。**正確順序**：
+1. i18n 清零（一次性修復 40+ 違規）
+2. 安裝 eslint-plugin-i18next（啟用防護）
+3. 加入 CI Lint 步驟
+
+---
+
+### 缺口 3：Visual Regression 跨平台渲染問題
+
+**問題根源**：Mac（M1/M2/M3）vs GitHub Actions（Linux Ubuntu）的字體渲染引擎不同（CoreText vs FreeType），導致截圖像素不吻合。
+
+**與整體整合的依賴**：
+- Baseline 快照必須統一來源（CI Docker 環境）
+- 每次更新 Baseline 必須是可審計的（透過 PR），不能在本機靜默更新
+
+**修復：Docker 強制基準線更新 + threshold 調整**
+
+```json
+// apps/web/package.json 新增 scripts
+{
+  "scripts": {
+    "test:visual:update": "docker run --rm -v $(pwd):/work -w /work -p 3000:3000 mcr.microsoft.com/playwright:v1.44.0-jammy pnpm exec playwright test --update-snapshots --project=chromium --grep @visual",
+    "test:visual": "pnpm exec playwright test --project=chromium --grep @visual"
+  }
+}
+```
+
+```typescript
+// apps/web/playwright.config.ts 修改截圖比對設定
+export default defineConfig({
+  expect: {
+    toHaveScreenshot: {
+      threshold: 0.05,        // 允許 5% 差異（吸收跨平台微小差）
+      maxDiffPixelRatio: 0.05,
+      // 強制使用 CI 環境的字體設定
+    },
+  },
+  use: {
+    // 截圖時的視窗大小固定，避免不同螢幕 DPI 差異
+    viewport: { width: 1280, height: 720 },
+    deviceScaleFactor: 1,    // 強制 1x，避免 Retina 差異
+  },
+});
+```
+
+**強制規範（加入 .awoooi-agent-rules.md 條款）**：
+
+```markdown
+## 條款 21：Visual Regression Baseline 更新規範
+
+🚨 絕對禁止在本機 Mac 環境執行 `--update-snapshots`
+✅ 更新 Baseline 必須透過以下流程：
+
+1. 在本機執行：`pnpm test:visual:update`（Docker 環境）
+2. Docker 生成的 .png 截圖自動存入 tests/e2e/__snapshots__/
+3. 提 PR，標注 📸 VISUAL_UPDATE
+4. 統帥視覺審核截圖後方可合併
+```
+
+---
+
+### 缺口 4：PostgreSQL HA（Patroni / CloudNativePG）
+
+**問題根源**：PostgreSQL 在 .188 上是單一 Docker 容器，K3s 的 Datastore 也依賴它（ADR-033）。資料庫掛掉 = K3s 控制面 + AWOOOI 資料同時失效。
+
+**與整體整合的依賴**：
+- PostgreSQL 是 AWOOOI 的 Episodic Memory（Incidents、Approvals、Audit Logs）
+- PostgreSQL 也是 K3s 的 HA Datastore（120/121 節點的 K3s 元數據）
+- Auto-Repair 對 PostgreSQL 執行 `docker restart` 是**危險的**（可能 Dirty Page）
+
+**修復路線圖（三階段）**：
+
+```
+Phase DB-A（1週，低風險）：
+  監控補強
+  ├── 啟用 PG slow query log (log_min_duration_statement = 2000ms)
+  ├── 加入 pg_stat_statements extension 並接入 Prometheus
+  └── 關閉 auto_repair.postgres.restart（防止 Dirty Page）
+
+Phase DB-B（1月，中風險）：
+  備份策略
+  ├── Velero + PostgreSQL Volume Snapshot（已有 Velero，需加 Volume 備份）
+  └── 確認 WAL archiving 到 MinIO（WAL-E/WAL-G）
+
+Phase DB-C（Q2，需評估）：
+  HA 策略評估
+  ├── 方案 A：CloudNativePG（K8s 原生 PostgreSQL Operator）
+  │     → 在 K3s 中部署 CloudNativePG，主從自動切換
+  ├── 方案 B：Patroni + keepalived（VM 層 HA）
+  │     → 在 .188 和備用機上部署 Patroni
+  └── 方案 C：Citus（分片，過於複雜，暫不考慮）
+
+  推薦：方案 A (CloudNativePG)，與 K3s 最整合
+```
+
+**立即可執行的防護措施**：
+
+```yaml
+# k8s/awoooi-prod/manual-remediation/postgres-recovery.yaml
+# 建立 PostgreSQL 緊急修復 Playbook（人工操作）
+
+# 事件：PostgreSQL 掛了
+# 動作：
+# 1. OpenClaw 發告警 + Telegram
+# 2. AlterManager 生成 CRITICAL Approval（不自動修復）
+# 3. 統帥核准後，執行以下指令：
+#    ssh root@192.168.0.188 'docker restart awoooi-postgres'
+#    kubectl rollout restart deployment/awoooi-api -n awoooi-prod
+#    kubectl rollout restart deployment/awoooi-worker -n awoooi-prod
+```
+
+---
+
+### 缺口 5：Redis HA（Sentinel 模式）
+
+**問題根源**：Redis 在 .188 上是單一容器（port 6380），無備援。Redis 同時承載：
+- Working Memory（Incident 聚合狀態）
+- SSE Terminal Event Store（未來的 Event Source）
+- Sentry Dedup Cache（10分鐘去重 TTL）
+- Anomaly Counter（ADR-037 核心數據）
+
+**與整體整合的依賴**：
+- Redis 掛掉 = Signal Worker 無法消費事件 = 整個告警鏈路中斷
+- AOF 啟用對性能有影響，需要評估
+
+**修復路線圖**：
+
+```
+Phase Redis-A（立即，0風險）：
+  確認 AOF 配置
+  ├── ssh root@192.168.0.188 'docker exec awoooi-redis redis-cli CONFIG GET appendonly'
+  └── 確認 appendonly yes（否則重啟後 Working Memory 歸零）
+
+Phase Redis-B（1月，中等工程量）：
+  Redis Sentinel 部署（在 .110 上部署 Sentinel + Replica）
+  ├── .188：Master（現有）
+  ├── .110：Replica + Sentinel
+  └── OpenClaw 使用 redis-sentinel:// URI，自動發現 Master
+
+  配置變更：
+  # AWOOOI API 連線改用 Sentinel
+  REDIS_URL=redis-sentinel://sentinel1:26379/awoooi-master
+```
+
+**立即可執行的防護措施**：
+
+```bash
+# 確認 Redis AOF 狀態
+ssh root@192.168.0.188 'docker exec awoooi-redis redis-cli CONFIG GET appendonly; \
+  docker exec awoooi-redis redis-cli CONFIG GET appendfsync; \
+  docker exec awoooi-redis redis-cli INFO persistence | grep aof'
+
+# 若 appendonly = no，立即啟用（需重啟 Redis）
+ssh root@192.168.0.188 'docker exec awoooi-redis redis-cli CONFIG SET appendonly yes'
+# 注意：CONFIG SET 是即時生效的，不需要重啟
+```
+
+---
+
+### 缺口 6：SSE Event Sourcing（Terminal 狀態不丟失）
+
+**問題根源**：Omni-Terminal 的 SSE 串流是無狀態的，F5 刷新後所有 `render_ui` GenUI 卡片消失。
+
+**與整體整合的依賴**：
+- 這是 Agentic Workspace 用戶體驗的底層設施
+- 依賴 Redis List 作為 Event Store（如果 Redis 無 AOF，重啟後也丟）
+- 必須與 SSE 三種事件類型設計同步建立
+
+**修復：三層機制**
+
+```
+Layer 1: 後端 Event Store（Redis List）
+  terminal.py → 每個 SSE 事件同步寫入 Redis List
+  Key: terminal:events:{command_id}
+  TTL: 3600 秒（1小時）
+
+Layer 2: 前端 Reconnect（Last-Event-ID）
+  useTerminalSSE → EventSource 自動帶 Last-Event-ID
+  後端收到後：從 Redis 撈出錯過的事件 → Replay → 接上即時 Stream
+
+Layer 3: 本地 Zustand 持久化
+  useTerminalStore → 用 zustand/middleware/persist 持久化到 sessionStorage
+  F5 刷新 → 從 sessionStorage 恢復 GenUI 卡片（UI 層快速恢復）
+  同時 → SSE 重連補齊 Server 端新事件
+```
+
+**後端實作關鍵代碼**：
+
+```python
+# apps/api/src/api/v1/terminal.py 補充 Event Store 機制
+
+import json
+from src.core.redis_client import get_redis
+
+async def stream_with_persistence(command_id: str, event_type: str, data: dict):
+    """
+    SSE 事件輸出 + 同步寫入 Redis Event Store
+    確保 F5 刷新後可以 Replay
+    """
+    redis = get_redis()
+    
+    event_payload = {
+        "type": event_type,
+        "data": data,
+        "timestamp": now_taipei_iso()
+    }
+    
+    # 寫入 Redis List（RPUSH append to right）
+    key = f"terminal:events:{command_id}"
+    await redis.rpush(key, json.dumps(event_payload))
+    await redis.expire(key, 3600)  # 1 小時後自動清理
+    
+    # 返回 SSE 格式字串
+    return f"id: {redis.llen(key)}\nevent: {event_type}\ndata: {json.dumps(data)}\n\n"
+
+
+@router.get("/stream/{command_id}/replay")
+async def replay_terminal_events(command_id: str, last_event_id: int = 0):
+    """
+    從指定 ID 開始 Replay 錯過的事件（用於 F5 重連）
+    """
+    redis = get_redis()
+    key = f"terminal:events:{command_id}"
+    
+    # 取出 last_event_id 之後的所有事件
+    events = await redis.lrange(key, last_event_id, -1)
+    
+    async def generate():
+        for i, event_json in enumerate(events):
+            event = json.loads(event_json)
+            yield f"id: {last_event_id + i + 1}\nevent: {event['type']}\ndata: {json.dumps(event['data'])}\n\n"
+    
+    return StreamingResponse(generate(), media_type="text/event-stream")
+```
+
+---
+
+## 第四部分：整合依賴關係圖（執行順序）
+
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│                    整合執行優先序                                     │
+├──────────────────────────────────────────────────────────────────────┤
+│                                                                       │
+│  🔴 P0 立即執行（本週，阻塞後續工作）                                  │
+│  ┌─────────────────────────────────────────────────────────┐         │
+│  │  1. 確認 Redis AOF 狀態（5min）                          │         │
+│  │  2. Worker terminationGracePeriodSeconds 修正（1h）      │         │
+│  │  3. i18n 清零（4h）← 必須先於 ESLint Plugin 安裝        │         │
+│  │  4. ESLint i18n Plugin 安裝並啟用（1h）                  │         │
+│  └─────────────────────────────────────────────────────────┘         │
+│                              ↓（完成後解鎖）                           │
+│                                                                       │
+│  🟠 P1 短期（2-3週）                                                  │
+│  ┌─────────────────────────────────────────────────────────┐         │
+│  │  5. Sentry Comment Token 配置（2h）                      │         │
+│  │  6. SignOz 告警規則部署到 .188（2h）                      │         │
+│  │  7. Worker HPA YAML 部署（30min）                        │         │
+│  │  8. E2E CI Weekly 排程（30min）                          │         │
+│  │  9. Visual Regression Docker 基準線建立（2h）            │         │
+│  └─────────────────────────────────────────────────────────┘         │
+│                              ↓（完成後解鎖）                           │
+│                                                                       │
+│  🟡 P2 中期（Month 2）                                                │
+│  ┌─────────────────────────────────────────────────────────┐         │
+│  │  10. Omni-Terminal SSE Event Sourcing（8h）              │         │
+│  │  11. Storybook 10 心組件（8h）                           │         │
+│  │  12. Nexus AI 自治率 UI（8h）                            │         │
+│  │  13. FinOps Dashboard UI（8h）                          │         │
+│  │  14. Redis Sentinel 部署（1天）                          │         │
+│  └─────────────────────────────────────────────────────────┘         │
+│                              ↓（完成後解鎖）                           │
+│                                                                       │
+│  ⚪ P3 長期（Q2-Q3）                                                   │
+│  ┌─────────────────────────────────────────────────────────┐         │
+│  │  15. CloudNativePG 評估與導入（2天+）                    │         │
+│  │  16. Kali SecurityAgent（MCP Tool 化）                   │         │
+│  │  17. Knowledge Base 後端全建                             │         │
+│  │  18. Phase 4 視覺靈魂注入                               │         │
+│  └─────────────────────────────────────────────────────────┘         │
+└─────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## 第五部分：整合風險矩陣
+
+| 風險 | 可能性 | 影響 | 緩解措施 |
+|------|--------|------|---------|
+| **ESLint 啟用後大量報錯** | 100%（有 40+ 違規） | CI 完全阻塞 | 先清零再啟用，正確順序執行 |
+| **Worker timeout 修改引發 Pod 啟動異常** | 低 | 服務中斷 | 先在 Dev namespace 測試 |
+| **Redis AOF 啟用影響性能** | 中 | 延遲微增 | 使用 `appendfsync everysec`（非 `always`）|
+| **Visual Regression Docker 第一次 Baseline** | — | 需要 1-2h 產生基準線 | 排在非尖峰時段執行 |
+| **PostgreSQL 無 HA 期間主庫故障** | 低 | 完全停機 | 備份策略（Velero）+ Playbook 就位 |
+| **SSE Event Sourcing Redis 依賴** | — | Redis 故障時 Event 丟失 | 先解決 Redis AOF，再實作 Event Sourcing |
+
+---
+
+## 第六部分：監控機制與前端的整合設計原則
+
+（承接 `MONITORING_ARCHITECTURE_DEEP_DIVE.md` 的「三義分離原則」）
+
+**整合設計的核心約束**：
+
+```
+1. 監控數據 → 後台靜默消化，不直接呈現給統帥
+   └─ 99%：Prometheus/SigNoz/Sentry 原始數據
+   └─ 1%：AI 無法自動處理 → 浮現為 ApprovalCard
+
+2. 前端不直接查詢 Prometheus/SigNoz
+   └─ 所有監控數據透過 AWOOOI API 統一封裝
+   └─ API 層：/api/v1/stats/overview, /api/v1/slo, /api/v1/finops
+
+3. 深度調查只能透過「智能跳脫」（新分頁）
+   └─ GenUI 卡片提供 ExternalLinks 按鈕
+   └─ 絕對禁止 iframe 嵌入 Grafana/SigNoz
+
+4. AI 自治率指數是前端唯一的「監控摘要」入口
+   └─ Dashboard (/) 最頂部：AutonomyIndexPanel
+   └─ 用一個數字替代所有圖表
+```
+
+---
+
+*「整合不是把所有工具接在一起，而是讓所有工具服務同一個大腦。」* 🦞
--- a/docs/proposals/MASTER_EXECUTION_SCHEDULE.md
+++ b/docs/proposals/MASTER_EXECUTION_SCHEDULE.md
@@ -0,0 +1,402 @@
+# AWOOOI 最終執行主排程
+# Master Execution Schedule — 統帥審核版
+
+> **文件類型**: 最終執行授權書
+> **建立**: 2026-03-29 14:05 (台北)
+> **狀態**: 🔴 待統帥審核與授權
+> **本文件為所有 RunBook 和 ADR 的執行定序總綱**
+
+---
+
+## 第一章：最終確認的缺口清單（代碼確認級別）
+
+### 1.1 已確認真實存在的問題（共 16 項）
+
+| # | 問題 | 代碼位置 | 嚴重度 | 波次 |
+|---|------|---------|--------|------|
+| 1 | Worker stop() timeout 只有 5 秒 | `signal_worker.py:147` | 🔴 P0 | Wave 1 |
+| 2 | XCLAIM / Active Sweeper 完全缺失 | `signal_worker.py` 全文 | 🔴 P0 | Wave 1 |
+| 3 | StatefulSet 自動修復無硬阻斷 | `auto_repair_service.py:159` | 🔴 P0 | Wave 1 |
+| 4 | AnomalyCounter Redis 無 try/except | `anomaly_counter.py:147` | 🔴 P0 | Wave 1 |
+| 5 | OpenClaw 無 Circuit Breaker | `sentry_webhook.py:289` | 🔴 P0 | Wave 1 |
+| 6 | OpenClaw 無 Concurrency Semaphore | `sentry_webhook.py` | 🔴 P0 | Wave 1 |
+| 7 | Global Incident Debounce 缺失 | `incident_service.py` | 🔴 P0 | Wave 1 |
+| 8 | Global Auto-Repair Cooldown 缺失 | `auto_repair_service.py` | 🔴 P0 | Wave 1 |
+| 9 | terminationGracePeriodSeconds 未設定 | `08-deployment-worker.yaml` | 🔴 P0 | Wave 2 |
+| 10 | ESLint i18n Plugin 只有 TODO | `.eslintrc.js:20-22` | 🟠 P1 | Wave 3 |
+| 11 | i18n 40+ 違規 | `TECHNICAL_DEBT_PHASE2.md` | 🟠 P1 | Wave 3 |
+| 12 | Playwright ignoreHTTPSErrors 未設定 | `playwright.config.ts` | 🟠 P1 | Wave 4 |
+| 13 | Visual Baseline 無 Docker 規範 | (設計空白) | 🟠 P1 | Wave 4 |
+| 14 | E2E 無 Auth Bypass（global.setup.ts 缺失）| `tests/e2e/` | 🟠 P1 | Wave 4 |
+| 15 | SSE Event Sourcing 尚未實作 | `terminal.py` | 🟡 P2 | Wave 7 |
+| 16 | Redis AOF/eviction 未確認 | `.188 主機` | 🔴 P0 | Wave 0 |
+
+### 1.2 已確認不存在（稽核報告誤判）
+
+| 項目 | 誤判內容 | 真實現況 |
+|------|---------|---------|
+| Worker SIGTERM 缺失 | 報告說「需要實作」| ✅ 已實作（line 450-455）|
+| ADR-035 備援路徑 | 說「可觀測循環依賴」| ✅ Layer 3 繞過 AWOOOI 直接打 Telegram |
+| Sentry Comment 缺失 | 說「尚未整合」| ✅ LOGBOOK 確認 Wave A.4 已整合 |
+| CD Secret 注入 | 說「缺失」| ✅ ADR-035 確認 CD 已有自動注入 |
+
+---
+
+## 第二章：ADR 評估
+
+### 2.1 本次新建 ADR（已建立）
+
+| ADR | 標題 | 說明 |
+|-----|------|------|
+| [ADR-038](file:///Users/ogt/awoooi/docs/adr/ADR-038-openclaw-concurrency-governance.md) | OpenClaw 推理引擎併發治理 | Semaphore + Circuit Breaker 雙層保護 |
+| [ADR-039](file:///Users/ogt/awoooi/docs/adr/ADR-039-global-autorepair-governance.md) | 全域自動修復熔斷機制 | Global Cooldown + StatefulSet 黑名單 |
+
+### 2.2 現有 ADR 需更新（次要）
+
+| ADR | 需追加內容 | 優先度 |
+|-----|---------|--------|
+| ADR-020 E2E Verification | 加入 global.setup.ts 和 Auth Bypass 規範 | 🟡 Wave 4 前 |
+| ADR-028 Failure Auto-Repair | 加入對 ADR-039 的引用說明 | 🟡 Wave 1 後 |
+
+### 2.3 不需要新建 ADR 的項目
+
+| 項目 | 理由 |
+|------|------|
+| XCLAIM / Active Sweeper | 屬於 ADR-037 Signal Worker 實作細節 |
+| terminationGracePeriodSeconds | 屬於 K8s 操作規範，不是架構決策 |
+| ESLint i18n | 屬於 ADR-002 設計系統的工具鏈細節 |
+
+---
+
+## 第三章：Skills 更新評估
+
+### 3.1 必須更新的 Skills
+
+#### Skill 02: leWOOOgo Backend Core（新增章節）
+
+```markdown
+## 🛡️ OpenClaw 推理保護模式 (ADR-038, 2026-03-29)
+
+### 鐵律：所有 OpenClaw 呼叫必須雙層保護
+
+from src.core.circuit_breaker import get_openclaw_guard
+
+async def call_openclaw_analyzer(...):
+    guard = get_openclaw_guard()
+    if guard.is_circuit_open():   # Layer 1: 斷路
+        return None
+    async with guard.semaphore:   # Layer 2: 限流（最多 3 並發）
+        # ... httpx 請求 ...
+
+## 🔴 全域修復冷卻 (ADR-039, 2026-03-29)
+
+### 鐵律：任何自動修復前必須呼叫 check_global_repair_cooldown()
+
+可以修復的服務: 僅無狀態服務（awoooi-api, awoooi-web, awoooi-worker）
+絕對禁止修復: postgres, redis, clickhouse, minio, etcd
+```
+
+#### Skill 05: AWOOOI SRE & QA（新增章節）
+
+```markdown
+## 🎭 E2E Auth Bypass 鐵律 (2026-03-29)
+
+### 必須在 tests/e2e/global.setup.ts 實作登入態
+
+// global.setup.ts
+async function setup() {
+  const { chromium } = require('@playwright/test');
+  const browser = await chromium.launch();
+  const page = await browser.newPage();
+  await page.request.post('/api/v1/auth/login', { data: { username: 'demo', password: process.env.E2E_PASSWORD } });
+  await page.context().storageState({ path: 'e2e-auth-state.json' });
+}
+
+### 絕對禁止在 Mac 本機產生 Visual Baseline
+使用: pnpm test:visual:update（Docker 環境）
+禁止: pnpm exec playwright test --update-snapshots（本機）
+```
+
+> **注意**：Skills 更新將在 Wave 1 代碼合併後，由後續 Session 執行。本次記錄評估結論即可。
+
+---
+
+## 第四章：模組化合規驗證
+
+### 4.1 Wave 1 新代碼的合規性
+
+| 新代碼 | 層次 | 依賴 | 介面 | 合規狀態 |
+|--------|------|------|------|---------|
+| `core/circuit_breaker.py` | core 基礎設施 | stdio + structlog | 直接類（無需 Protocol）| ✅ 合規 |
+| `services/global_repair_cooldown.py` | Service 層 | Redis（透過 get_redis()）| 函數式 API | ✅ 合規 |
+| `signal_worker.py` XCLAIM 補充 | Worker（現有）| Redis Stream | 無新依賴 | ✅ 合規 |
+| `anomaly_counter.py` Degradation | Service 層 | 無新依賴 | 現有 Protocol | ✅ 合規 |
+
+### 4.2 違規預防
+
+| 規則 | 驗證方式 |
+|------|---------|
+| Router 不直接存取 Redis | Code Review：所有 Webhook router 只呼叫 Service |
+| Semaphore 在 core/ | `circuit_breaker.py` 放在 `src/core/`，非 `src/services/` |
+| Singleton 透過工廠函數 | `get_openclaw_guard()`、`get_global_repair_cooldown()` |
+
+---
+
+## 第五章：整合工作衝突分析
+
+### 5.1 必須串行的依賴關係
+
+```
+XCLAIM 代碼合併 → 才能部署 Worker HPA
+i18n 清零完成 → 才能啟用 ESLint i18n Plugin（error 模式）
+release/v1.x 建立 → 才能宣佈 Frontend Feature Freeze
+global.setup.ts 實作 → 才能有效執行 E2E Visual 測試
+Redis AOF 確認 → 才能實作 SSE Event Sourcing（依賴 Redis 持久化）
+```
+
+### 5.2 可以並行的工作
+
+```
+Wave 1 後端代碼修改（5 項）可以並行開發，捆綁一個 PR
+Sentry Token 配置 與 SignOz 告警規則部署 可以並行
+Storybook 建置 與 Omni-Terminal SSE Event Sourcing 可以並行（不同分支）
+```
+
+### 5.3 確認無衝突的部分
+
+| 項目 | 衝突評估 | 結論 |
+|------|---------|------|
+| terminationGracePeriodSeconds | 需與 XCLAIM 同一 PR（相互依賴）| ✅ Wave 1 捆綁 |
+| Global Cooldown + Global Debounce | 同為 Wave 1，無衝突 | ✅ 同一 PR |
+| ADR-038 + ADR-039 同時生效 | 需確保 auto_repair_service 引用兩者 | ✅ 已在 ADR 中指定 |
+
+---
+
+## 第六章：詳細實施步驟
+
+### Wave 0: 即時止血（統帥手動確認，當天）
+
+```bash
+# 0.1 確認 Redis AOF 狀態（5 分鐘）
+ssh root@192.168.0.188 \
+  'docker exec awoooi-redis redis-cli CONFIG GET appendonly; \
+   docker exec awoooi-redis redis-cli CONFIG GET maxmemory-policy'
+# 預期：appendonly=yes, policy=volatile-ttl 或 noeviction
+# 若非 yes 則立即啟用：docker exec awoooi-redis redis-cli CONFIG SET appendonly yes
+
+# 0.2 確認 Alertmanager 備援 Telegram（10 分鐘）
+ssh root@192.168.0.188 'docker exec alertmanager cat /etc/alertmanager/alertmanager.yml' | grep -A 5 receiver
+
+# 0.3 建立 release/v1.x 穩定分支（5 分鐘）
+git checkout main && git pull
+git checkout -b release/v1.x && git push origin release/v1.x
+# 在 GitHub 設定 Protected Branch
+
+# 0.4 確認 SENTRY_AUTH_TOKEN（3 分鐘）
+kubectl get secret awoooi-secrets -n awoooi-prod \
+  -o jsonpath='{.data.SENTRY_AUTH_TOKEN}' | base64 -d | wc -c
+# > 0 = Phase D 已完成；= 0 = 需執行 RunBook Phase D
+```
+
+### Wave 1: 底層安全網（代碼 PR，7.5h，原子性）
+
+> **所有以下修改必須在同一個 PR 中，不可拆分！**
+
+```
+PR 標題: feat(safety): Wave 1 底層安全網（ADR-038 + ADR-039）
+
+檔案清單（9 個）：
+  新建：
+    apps/api/src/core/circuit_breaker.py         (ADR-038)
+    apps/api/src/services/global_repair_cooldown.py (ADR-039)
+  
+  修改：
+    apps/api/src/services/anomaly_counter.py
+      → record_anomaly() 加 try/except + graceful degrade
+      → 提取 _record_anomaly_impl()
+    
+    apps/api/src/workers/signal_worker.py
+      → start() 加 _reclaim_task = asyncio.create_task(_reclaim_loop())
+      → 新增 _claim_orphaned_tasks()（XCLAIM 實作）
+      → 新增 _reclaim_loop()（Active Sweeper，每 5 分鐘）
+      → stop() timeout 從 5s 改為 75s，同步取消 _reclaim_task
+    
+    apps/api/src/services/auto_repair_service.py
+      → evaluate_auto_repair() 開頭加 check_global_repair_cooldown()
+      → execute_auto_repair() 成功後加 record_global_repair_action()
+      → 常數區加 STATEFUL_SERVICE_BLACKLIST
+    
+    apps/api/src/api/v1/sentry_webhook.py
+      → call_openclaw_analyzer() 加雙層保護（Circuit Breaker + Semaphore）
+    
+    apps/api/src/api/v1/signoz_webhook.py
+      → 同上
+    
+    apps/api/src/services/incident_service.py
+      → process_signal() 前加 _check_global_incident_storm()
+    
+    k8s/awoooi-prod/08-deployment-worker.yaml
+      → 加入 terminationGracePeriodSeconds: 90
+      → 加入 preStop: sleep 5
+```
+
+**Wave 1 PR 審核要點**：
+- `circuit_breaker.py` 在 `src/core/`（非 services）
+- `global_repair_cooldown.py` 在 `src/services/`（非 core）
+- 所有新代碼有 Google Style Docstring
+- 所有 Singleton 透過工廠函數暴露
+
+**Wave 1 驗收指令**：
+```bash
+# 部署後驗證 terminationGracePeriodSeconds
+kubectl get deployment awoooi-worker -n awoooi-prod \
+  -o jsonpath='{.spec.template.spec.terminationGracePeriodSeconds}'
+# 預期：90
+
+# 驗證 Circuit Breaker + Semaphore（Alpha 測試）
+curl -X POST http://192.168.0.125:32334/api/v1/webhooks/alertmanager \
+  -d '{"alerts": [{"labels": {"alertname": "TestAlert"}}]}'
+# 觀察 API log：openclaw_semaphore_acquired / openclaw_circuit_open_skip
+```
+
+### Wave 2: Worker 擴縮容部署（30 分鐘，Wave 1 完成後）
+
+```bash
+# 部署 Worker HPA
+kubectl apply -f k8s/awoooi-prod/12-hpa.yaml
+
+# 驗證 HPA 建立
+kubectl get hpa awoooi-worker-hpa -n awoooi-prod
+# 預期：TARGETS 顯示實際 CPU%，MINPODS=1, MAXPODS=3
+```
+
+### Wave 3: 前端主幹手術（4 天窗口，Week 1）
+
+```bash
+# Day 1 準備：宣佈 Feature Freeze
+# Day 1-3：i18n 閃電清零（依 TECHNICAL_DEBT_PHASE2.md 清單）
+cd apps/web
+git checkout -b fix/i18n-zero-violation
+
+# 執行清零（依序修復 9 個違規組）
+# 完成後驗證：
+pnpm exec tsc --noEmit
+
+# 安裝 ESLint Plugin（先 Warn 模式）
+pnpm add -D eslint-plugin-i18next
+# 修改 .eslintrc.js（markupOnly: true，先 warn）
+
+# Day 4：PR 合併 → 解除 Freeze
+git push origin fix/i18n-zero-violation
+# PR 審核後合併到 main
+# ESLint 切換為 error 模式
+```
+
+### Wave 4: CI 基礎設施（1 天，Wave 3 後）
+
+```bash
+# 修改 playwright.config.ts
+# 加入 ignoreHTTPSErrors: true + deviceScaleFactor: 1 + threshold: 0.05
+
+# 建立 tests/e2e/global.setup.ts（E2E Auth Bypass）
+# 建立初始 Docker Visual Baseline
+cd apps/web
+pnpm test:visual:update  # Docker 環境
+
+# 部署 Weekly E2E Workflow
+git add .github/workflows/e2e-weekly.yaml
+git commit -m "feat(ci): weekly E2E + visual regression + Docker baseline"
+```
+
+### Wave 5: 告警後端完整化（2 小時，Wave 1 後即可）
+
+```bash
+# 部署 SENTRY_AUTH_TOKEN（若 Wave 0.4 確認缺失）
+kubectl patch secret awoooi-secrets -n awoooi-prod \
+  --type=merge -p='{"stringData": {"SENTRY_AUTH_TOKEN": "YOUR_TOKEN"}}'
+
+# 部署 SignOz 告警規則到 .188
+ssh root@192.168.0.188 'cat > /tmp/signoz-rules.yaml' < ops/signoz/alerting/rules.yaml
+# 透過 SignOz API 套用
+
+# 驗證告警鏈路
+python ops/scripts/alert_chain_smoke_test.py
+```
+
+> **Wave 6-12**: 中長期工作，依 INTEGRATION_ARCHITECTURE_MASTER.md 執行。
+
+---
+
+## 第七章：最終工作排程（供統帥審核）
+
+```
+📅 2026-03-29（今天）- Wave 0
+  □ 0.1 Redis AOF + eviction policy 確認（hand-on，5 min）
+  □ 0.2 Alertmanager 備援路徑確認（hand-on，10 min）
+  □ 0.3 release/v1.x 分支建立 + GitHub 設為 Protected（hand-on，5 min）
+  □ 0.4 SENTRY_AUTH_TOKEN 存在確認（hand-on，3 min）
+
+📅 2026-03-30~04-01（3 天）- Wave 1
+  □ core/circuit_breaker.py 開發（2h）
+  □ services/global_repair_cooldown.py 開發（1h）
+  □ anomaly_counter.py Graceful Degrade（1h）
+  □ signal_worker.py XCLAIM + Active Sweeper（2h）
+  □ auto_repair_service.py Guardrail 整合（1h）
+  □ sentry_webhook.py + signoz_webhook.py 雙層保護（0.5h）
+  □ incident_service.py Global Debounce（1.5h）
+  □ 08-deployment-worker.yaml terminationGracePeriodSeconds（0.5h）
+  □ Wave 1 PR 提交 → 審核 → 合併 → CD 部署 → 驗收
+
+📅 2026-04-01（Wave 1 完成後）- Wave 2
+  □ Worker HPA YAML 部署（30 min）
+
+📅 2026-04-02~04-05（4 天）- Wave 3（Feature Freeze）
+  □ Frontend Feature Freeze 宣佈
+  □ i18n 閃電清零（40+ 違規，4h 工時）
+  □ ESLint i18n Plugin 安裝（Warn 模式）
+  □ PR 合併 → Feature Unfreeze
+  □ ESLint 切換為 Error 模式
+
+📅 2026-04-06（1 天）- Wave 4 / Wave 5
+  □ playwright.config.ts 修改（ignoreHTTPSErrors + threshold）
+  □ global.setup.ts E2E Auth Bypass 建立
+  □ Docker Visual Baseline 初始建立
+  □ E2E Weekly Workflow 部署
+  □ SignOz 告警規則部署到 .188
+
+📅 2026-04 Week 3 - Wave 5 + 6（可視 Wave 1-4 完成情況調整）
+  □ Prometheus Federation（.110 → .188）
+  □ Redis AOF + Sentinel 評估
+  □ AI Autonomy Index Metrics 建立
+
+📅 2026-04 Week 4+ - Wave 7-9（Month 2）
+  □ Storybook 10 核心組件
+  □ Omni-Terminal SSE Event Sourcing
+  □ 監控 GenUI 卡片（7 張）
+  □ Nexus AI 自治率 UI
+
+📅 Q2 - Wave 10-12（長期）
+  □ CloudNativePG HA 評估
+  □ Kali SecurityAgent（MCP Tool 化）
+  □ Phase 4 視覺靈魂注入
+  □ CI 硬阻擋正式啟用（Warn → Block）
+```
+
+---
+
+## 第八章：統帥審核要點
+
+請特別審核以下決策點：
+
+| # | 決策內容 | 選項 | 推薦 |
+|---|---------|------|------|
+| D1 | Wave 1 PR 是否原子性（9 個文件一次）| A. 原子 / B. 分批 | A（必須原子，互相依賴）|
+| D2 | Feature Freeze 時長 | A. 3天 / B. 5天 | B（40+ 違規需要謹慎）|
+| D3 | ESLint 啟用時機 | A. i18n清零後立即 / B. 一週緩衝期 | A（清零後立即，防新債）|
+| D4 | Semaphore max_concurrent | A. 2 / B. 3 / C. 5 | B（3 = 60% .188 CPU）|
+| D5 | Global Cooldown Threshold | A. 3次/15分鐘 / B. 5次/15分鐘 | B（3 次太嚴，5 次合理）|
+| D6 | Redis HA 策略 | A. Sentinel / B. AOF+手動 / C. 暫緩 | C（暫緩，Month 2 評估）|
+
+---
+
+*此排程基於 2026-03-29 的完整代碼審計與 16 輪沙盤推演，已閉合所有已知漏洞。* 🦞
--- a/docs/proposals/MONITORING_ARCHITECTURE_DEEP_DIVE.md
+++ b/docs/proposals/MONITORING_ARCHITECTURE_DEEP_DIVE.md
@@ -0,0 +1,653 @@
+# AWOOOI 監控機制完整規劃：讓監控成為 AI 智慧的感知神經，而非束縛
+
+> **文件類型**: 架構設計 + 實施 RunBook  
+> **優先級**: 🔴 重中之重  
+> **建立**: 2026-03-29 12:38 (台北)  
+> **建立者**: Antigravity  
+> **核心命題**: 監控不是目的，而是 AI 決策的神經末梢。
+
+---
+
+## 一、核心哲學：監控與 AI 的關係定位
+
+### 1.1 「不能淪為監控產品」— 這個恐懼從哪裡來？
+
+傳統監控產品（Grafana / Prometheus / Datadog）的底層邏輯是：
+> 「系統把原始數據攤開，**人類**負責看懂並做決定。」
+
+這讓使用者變成**數據的搬運工**，而非**決策者**。
+
+AWOOOI 的定位必須是：
+> 「AI 消化所有數據，**主動帶著分析結論來問統帥**：『這裡我建議這樣做，您核准嗎？』」
+
+---
+
+### 1.2 黃金法則：哪些監控數據「應該消失在後台」，哪些「必須浮現到前台」
+
+```
+┌────────────────────────────────────────────────────────────────────────┐
+│  監控資訊的兩個命運                                                      │
+├───────────────────────────┬────────────────────────────────────────────┤
+│  🔒 靜默消化（後台進行）     │  🚨 主動浮現（推送給統帥）                  │
+├───────────────────────────┼────────────────────────────────────────────┤
+│  Prometheus Metrics 原始值 │  AI 判斷「這是異常」後產生的 Approval 卡片   │
+│  SigNoz Trace 詳情         │  Anomaly Counter 升級到 ESCALATE 時的警示   │
+│  Sentry Error Log 完整堆疊 │  Auto-Repair 執行後的結果摘要               │
+│  Grafana 儀表板圖表         │  P0 事件的緊急插隊（Priority Preemption）   │
+│  Alertmanager 規則配置     │  每日 AI 健康摘要（主動推送）                │
+│  K8s Pod 狀態明細          │  FinOps 成本異常告警                        │
+└───────────────────────────┴────────────────────────────────────────────┘
+```
+
+**核心結論**：監控數據 99% 應該「靜默消化」，只有 AI 無法自動處理的 1% 才浮現為「需要統帥決策的卡片」。
+
+---
+
+## 二、監控完整節點盤點（現況 vs 目標）
+
+### 2.1 五層監控架構
+
+```
+Layer 0: 物理感知層（主機/節點）
+    ↓
+Layer 1: 服務感知層（容器/Pod）
+    ↓
+Layer 2: 應用感知層（API/前端/Worker）
+    ↓
+Layer 3: AI 智慧層（LLM 推理/工具調用）
+    ↓
+Layer 4: 業務感知層（用戶行為/成本/SLO）
+```
+
+---
+
+### 2.2 各層完整節點盤點
+
+#### Layer 0: 物理感知層
+
+| 節點 | 工具 | 數據 | 現況 | 缺口 |
+|------|------|------|------|------|
+| .110 CPU/Memory/Disk | Node Exporter | 系統指標 | ✅ Prometheus | — |
+| .112 CPU/Memory/Disk | Node Exporter | 系統指標 | 🟡 孤立，無 Webhook | 無告警整合 |
+| .188 CPU/Memory/Disk | Node Exporter | 系統指標 + GPU | ✅ Prometheus | — |
+| .120 K3s Master | Node Exporter + kube-state | K3s 節點指標 | ✅ | — |
+| .121 K3s Worker | Node Exporter + kube-state | K3s 節點指標 | ✅ | — |
+| VIP .125 | Blackbox Exporter | TCP 健康 | ✅ 已配置 | — |
+
+#### Layer 1: 服務感知層（Docker/K8s）
+
+| 服務 | Prometheus | Sentry | OTEL | 告警 | 自動修復 | 缺口 |
+|------|-----------|-------|------|------|---------|------|
+| awoooi-api | ✅ | ✅ | ✅ | ✅ 完整 | ✅ | — |
+| awoooi-web | ✅ | ✅ | ✅ | ✅ 完整 | ✅ | — |
+| awoooi-worker | ✅ | ✅ | ✅ | 🟡 | ✅ | HPA 缺失 |
+| Ollama | ✅ | — | — | ✅ | ✅ 重啟 | — |
+| OpenClaw | ✅ | ✅ | ✅ | ✅ | ✅ 重啟 | — |
+| Redis | ✅ | — | — | ✅ | ❌（謹慎） | 自動修復 too conservative |
+| PostgreSQL | ✅ | — | — | ✅ | ❌（謹慎） | 同上 |
+| Harbor | ✅ | — | — | ✅ | — | — |
+| Sentry | ✅ | — | — | ✅ | — | — |
+| Langfuse | ✅ | — | — | ✅ | — | — |
+| **MinIO** | ❌ | — | — | ❌ | ❌ | 完全未監控 |
+| **Kali Scanner** | ❌ | — | — | ❌ | ❌ | 孤立節點 |
+
+#### Layer 2: 應用感知層
+
+| 數據類型 | 工具 | 現況 | 缺口 |
+|---------|------|------|------|
+| API Error Rate | Prometheus + SigNoz | ✅ | — |
+| API Latency P50/P95/P99 | SigNoz OTEL | ✅ | — |
+| Distributed Traces | SigNoz | ✅ | — |
+| Frontend Web Vitals (LCP/FID/CLS) | Sentry | ✅ | — |
+| Frontend JS Errors | Sentry | ✅ | — |
+| Frontend Session Replay | Sentry | ✅ | — |
+| **Frontend Rage Click** | Sentry | ✅ | **未整合進 AI 分析** |
+| **API Slow Query** | Sentry + structlog | ✅ | **無 AI 自動優化建議** |
+| **K8s Resource Quota** | kube-state-metrics | ✅ | — |
+| Alert Chain E2E | Prometheus Counter | ✅ ADR-037 | — |
+
+#### Layer 3: AI 智慧層
+
+| 數據類型 | 工具 | 現況 | 缺口 |
+|---------|------|------|------|
+| LLM 請求/回應 Traces | Langfuse | ✅ | — |
+| LLM Token 用量/成本 | Langfuse | ✅ | **無 AWOOOI Dashboard** |
+| Ollama 推理延遲 | Prometheus | ✅ | — |
+| AI Fallback 觸發次數 | Prometheus | ✅（ADR-006）| — |
+| NVIDIA Circuit Breaker | Prometheus | ✅（ADR-036）| — |
+| **AI 自治率指數** | — | ❌ 完全缺失 | 核心指標未建立 |
+| **Anomaly Counter 統計** | Redis 計數器 | ✅ ADR-037 | **無前端展示** |
+| **Approval 決策分析** | PostgreSQL | ✅ | **只有原始 CRUD，無分析** |
+
+#### Layer 4: 業務感知層
+
+| 數據類型 | 工具 | 現況 | 缺口 |
+|---------|------|------|------|
+| SLO 達成率 | Prometheus + rules | ✅ 定義 | **無可視化** |
+| 事件 MTTR（平均修復時間）| PostgreSQL | ✅ 原始資料 | **無計算與展示** |
+| **FinOps 成本追蹤** | cost_analyzer.py | ✅ 邏輯 | **無 UI，完全閒置** |
+| **用戶操作審計** | audit_logs.py | ✅ | — |
+| **知識庫查詢統計** | — | ❌ | 無知識庫後端 |
+
+---
+
+## 三、整合缺口分析
+
+### 3.1 「最後一哩路」缺口（已有工具，未整合）
+
+| 缺口 | 工具已準備 | 缺什麼 | 工時 |
+|------|-----------|-------|------|
+| MinIO 監控 | Prometheus | MinIO Exporter 未部署 | 1h |
+| Kali 安全掃描 | Nmap/ZAP on .112 | 無 AWOOOI Webhook 整合 | 2h |
+| FinOps 前端 | cost_analyzer.py | 無 API 端點 + 無 UI | 8h |
+| AI 自治率指數 | Prometheus Counter 可建 | 指標定義 + Dashboard | 4h |
+| Rage Click → AI 分析 | Sentry `get_ux_audit_summary()` | 無觸發器，未週期調用 | 2h |
+| Anomaly Counter 前端展示 | Redis + anomaly_counter.py | 無 GenUI 卡片 | 4h |
+| SLO 可視化 | Prometheus rules 已定義 | 無 Grafana/前端展示 | 3h |
+| MTTR 計算 | PostgreSQL 有 incidents 資料 | 無計算 API 端點 | 2h |
+| 雙 Prometheus 聯邦 | 188/110 各一個 | 無 Federation 配置 | 2h |
+
+**整合缺口總工時估算：~28 小時**
+
+---
+
+## 四、監控 UI 呈現戰略（避免淪為監控產品的核心設計）
+
+### 4.1 三種監控 UI 反模式（絕對禁止）
+
+```
+❌ 反模式 A：Grafana 嵌入 iframe
+   → 整個頁面都是 Grafana，用戶感覺在用 Grafana
+   
+❌ 反模式 B：「監控頁面」頂級選單項目
+   → 將 AWOOOI 降格為「有 AI 輔助的 Grafana」
+   
+❌ 反模式 C：Prometheus 原始指標直接展示
+   → 用戶看到 rate(http_requests_total[5m]) 這種語法，違反 AI 原生體驗
+```
+
+### 4.2 正確的監控 UI 架構：「三義分離原則」
+
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│  AWOOOI 監控 UI 三義分離                                              │
+├────────────────────┬──────────────────┬─────────────────────────────┤
+│  義 1: AI 主動浮現  │  義 2: 問即答      │  義 3: 深度調查跳脫           │
+│  （AWOOOI 前端）    │  （Omni-Terminal） │  （外部工具直連）             │
+├────────────────────┼──────────────────┼─────────────────────────────┤
+│ Nexus 頁面         │ /status awoooi-api│ 🔗 Grafana (新分頁)           │
+│ → AI 健康脈搏       │ → GenUI 健康卡     │ 🔗 SigNoz (新分頁)           │
+│ → 自治率指數        │ /cost this-month  │ 🔗 Sentry (新分頁)           │
+│ → 異常趨勢圖        │ → FinOps 成本卡    │                             │
+│ War Room 頁面      │ /trace xxx        │ 不直接嵌入，保持 AWOOOI       │
+│ → 待決策 Approval  │ → Trace 彙整卡     │ 界面純淨性                   │
+└────────────────────┴──────────────────┴─────────────────────────────┘
+```
+
+### 4.3 Nexus 頁面的監控呈現規格
+
+這是首頁（The Nexus / 全局心智樞紐），呈現監控摘要的唯一入口：
+
+```tsx
+// Nexus 頁面結構（Nothing.tech 純白工業風）
+
+// 區塊 A：AI 自治率指數（最大最重要）
+<AutonomyIndexPanel>
+  今日 AI 成功攔截並自動修復：7/10 事件（70% 自治率）
+  ↗ 比昨日提升 12%
+</AutonomyIndexPanel>
+
+// 區塊 B：系統脈搏（3 個數字，非圖表）
+<SystemPulseRow>
+  <PulseMetric label="正常服務" value="24/25" status="healthy" />
+  <PulseMetric label="活躍告警" value="0" status="healthy" />
+  <PulseMetric label="待決策" value="2" status="warning" />
+</SystemPulseRow>
+
+// 區塊 C：AI 思考流（背景動態，非重點）
+<ThinkingStream>
+  [Investigator] Redis latency: 2ms ... OK
+  [Investigator] API error rate: 0.1% ... OK
+  [Investigator] cert://*.wooo.work: 42 days ... OK
+</ThinkingStream>
+
+// 區塊 D：需要統帥決策的卡片（只有這個需要互動）
+// → 有待決策才出現，平時此區域「靜默」
+<DecisionZone>
+  <ApprovalCard urgency="CRITICAL" ... />
+</DecisionZone>
+```
+
+---
+
+### 4.4 監控數據在 Omni-Terminal 的呈現（問即答模式）
+
+Terminal 輸入 → AI 消化原始指標 → 回傳 GenUI 卡片（非原始數字）
+
+| 使用者輸入 | AI 行為 | GenUI 卡片類型 |
+|---------|---------|--------------|
+| `/status all` | 查詢所有服務健康 | `SystemHealthCard` |
+| `/status awoooi-api` | 查 API P95 延遲 + 錯誤率 | `ServiceDetailCard` |
+| `/cost` | 呼叫 cost_analyzer.py | `FinOpsCard` |
+| `/trace 最近5分鐘` | 查詢 SigNoz slow traces | `TraceListCard` |
+| `/incident 今天` | 查詢今日事件 + AI 摘要 | `IncidentSummaryCard` |
+| `/alert 狀態` | 檢查告警鏈路 E2E | `AlertChainStatusCard` |
+| `/slo` | 計算 API/Web SLO 達成率 | `SLODashboardCard` |
+
+**🎯 這才是 AI 原生體驗**：使用者永遠都在跟 AI 對話，而非直接操作圖表。
+
+---
+
+### 4.5 「深度調查」模式：智能跳脫
+
+當用戶需要原始 Grafana / SigNoz 數據時，AWOOOI 提供**智能跳脫**，而非嵌入：
+
+```tsx
+// 在 GenUI 卡片中，提供「深度調查」按鈕
+<ServiceDetailCard service="awoooi-api">
+  <MetricRow label="P95 延遲" value="124ms" status="healthy" />
+  <MetricRow label="錯誤率" value="0.1%" status="healthy" />
+  
+  {/* 智能跳脫按鈕 */}
+  <ExternalLinks>
+    <SmartLink 
+      icon="📊" 
+      label="SigNoz 詳細追蹤"
+      href="http://192.168.0.188:3301/traces?service=awoooi-api&from=now-1h"
+      target="_blank"   // ← 新分頁開啟，不汙染 AWOOOI 界面
+    />
+    <SmartLink
+      icon="📈"
+      label="Grafana 即時圖表"
+      href="http://192.168.0.188:3000/d/awoooi-api"
+      target="_blank"
+    />
+    <SmartLink
+      icon="🐛"
+      label="Sentry Issues"
+      href="http://192.168.0.110:9000/organizations/sentry/issues/?project=awoooi-api"
+      target="_blank"
+    />
+  </ExternalLinks>
+</ServiceDetailCard>
+```
+
+---
+
+## 五、監控機制整合實施步驟
+
+### Wave M-1: 立即啟動（1 週）
+
+#### M-1.1 MinIO 監控整合（1h）
+
+```bash
+# 在 192.168.0.188 部署 MinIO Exporter
+docker run -d \
+  --name minio-exporter \
+  --network momo-pro-network \
+  -e MINIO_URL=http://minio:9000 \
+  -e MINIO_ACCESS_KEY=minio_admin \
+  -e MINIO_SECRET_KEY=Minio_Velero_2026! \
+  -p 9290:9290 \
+  bitnami/minio-exporter:latest
+
+# 在 188:/momo-pro/monitoring/prometheus.yml 加入 scrape target：
+# - job_name: 'minio'
+#   static_configs:
+#     - targets: ['localhost:9290']
+```
+
+#### M-1.2 Prometheus Federation 統一（2h）
+
+```yaml
+# 在 188 的 Prometheus 加入聯邦查詢（抓取 .110 的 Prometheus 數據）
+# 188:/momo-pro/monitoring/prometheus.yml 追加：
+
+- job_name: 'federate-110'
+  scrape_interval: 30s
+  honor_labels: true
+  metrics_path: '/federate'
+  params:
+    'match[]':
+      - '{job=~".+"}'         # 抓取所有 .110 的 job
+  static_configs:
+    - targets: ['192.168.0.110:9090']
+```
+
+#### M-1.3 建立 AI 自治率指數 Prometheus 指標（2h）
+
+```python
+# apps/api/src/core/metrics.py 新增：
+
+# === AI 自治率追蹤 (The Autonomy Index) ===
+AUTONOMY_INCIDENTS_TOTAL = Counter(
+    'awoooi_incidents_total',
+    'Total number of incidents received',
+    ['source', 'severity']
+)
+
+AUTONOMY_AUTO_RESOLVED = Counter(
+    'awoooi_incidents_auto_resolved_total',
+    'Incidents resolved automatically by AI without human intervention',
+    ['source', 'action_type']
+)
+
+AUTONOMY_HUMAN_RESOLVED = Counter(
+    'awoooi_incidents_human_resolved_total',
+    'Incidents requiring human approval',
+    ['source', 'risk_level']
+)
+
+def record_incident_created(source: str, severity: str):
+    AUTONOMY_INCIDENTS_TOTAL.labels(source=source, severity=severity).inc()
+
+def record_auto_resolution(source: str, action_type: str):
+    AUTONOMY_AUTO_RESOLVED.labels(source=source, action_type=action_type).inc()
+
+def record_human_decision(source: str, risk_level: str):
+    AUTONOMY_HUMAN_RESOLVED.labels(source=source, risk_level=risk_level).inc()
+
+# 自治率計算公式：
+# autonomy_rate = auto_resolved / (auto_resolved + human_decisions) * 100
+```
+
+Grafana Dashboard 公式：
+```
+# AI 自治率（24h）
+sum(increase(awoooi_incidents_auto_resolved_total[24h]))
+/
+(
+  sum(increase(awoooi_incidents_auto_resolved_total[24h])) +
+  sum(increase(awoooi_incidents_human_resolved_total[24h]))
+) * 100
+```
+
+### Wave M-2: 短期啟動（2 週）
+
+#### M-2.1 FinOps API 端點建立（4h）
+
+```python
+# apps/api/src/api/v1/finops.py（新建）
+# 暴露 cost_analyzer.py 的計算結果
+
+@router.get("/finops/summary")
+async def get_finops_summary():
+    """
+    FinOps 成本摘要
+    
+    Returns:
+        {
+            "period": "2026-03",
+            "total_cost_usd": 12.50,
+            "ollama_cost": 0.0,         # 本地，零成本
+            "gemini_cost": 1.20,
+            "claude_cost": 11.30,
+            "realizable_savings": 3.50, # 真實可省
+            "freed_capacity": 8.00,     # 釋放容量（非真實省錢）
+            "top_cost_drivers": [...],
+            "recommendations": [...]
+        }
+    """
+    cost_analyzer = get_cost_analyzer()
+    return await cost_analyzer.monthly_summary()
+```
+
+#### M-2.2 SLO API 端點建立（2h）
+
+```python
+# apps/api/src/api/v1/slo.py（新建）
+
+@router.get("/slo/status")
+async def get_slo_status():
+    """
+    SLO 達成狀況
+    
+    Returns:
+        {
+            "api": {
+                "availability_7d": 99.97,         # %
+                "latency_p95_7d": 124,            # ms
+                "target_availability": 99.9,
+                "target_latency_p95": 500,
+                "status": "healthy"               # healthy/at_risk/breached
+            },
+            "web": {...},
+            "overall": "healthy"
+        }
+    """
+```
+
+#### M-2.3 MTTR API 端點建立（2h）
+
+```python
+# apps/api/src/api/v1/stats.py 新增端點
+
+@router.get("/stats/mttr")
+async def get_mttr_stats():
+    """
+    平均修復時間 (Mean Time To Resolution)
+    
+    計算邏輯：
+    - MTTR = avg(resolved_at - created_at) for resolved incidents
+    - 分 AI 自動修復 vs 人工審核分別計算
+    """
+```
+
+#### M-2.4 Kali Scanner 整合（2h）
+
+```python
+# apps/api/src/api/v1/webhooks.py 新增 Kali Scanner Webhook
+
+@router.post("/webhooks/kali/scan-result")
+async def handle_kali_scan_result(request: Request):
+    """
+    接收 .112 Kali 安全掃描結果
+    
+    Kali 掃描腳本每週執行一次，結果發送至此 Webhook
+    高危漏洞 → 自動建立 CRITICAL Approval
+    """
+```
+
+Kali 端配置 (`192.168.0.112`)：
+```bash
+# 在 112 建立每週掃描腳本
+cat > /opt/awoooi-scanner/weekly-scan.sh << 'EOF'
+#!/bin/bash
+TARGET="192.168.0.120:32334"  # AWOOOI API
+RESULT=$(nmap -sV --script vuln $TARGET -oJ -)
+
+curl -X POST http://192.168.0.120:32334/api/v1/webhooks/kali/scan-result \
+  -H "Content-Type: application/json" \
+  -d "{\"scan_result\": $RESULT, \"target\": \"$TARGET\"}"
+EOF
+
+# 加入 crontab
+echo "0 2 * * 1 /opt/awoooi-scanner/weekly-scan.sh" | crontab -
+```
+
+### Wave M-3: 中期啟動（3–4 週）
+
+#### M-3.1 Nexus 頁面 AI 自治率指數 UI（8h）
+
+```tsx
+// apps/web/src/app/[locale]/(dashboard)/page.tsx
+// 新增 AutonomyIndex 組件
+
+interface AutonomyData {
+  rate: number;           // 70.0
+  daily_trend: number;   // +12.0 vs yesterday
+  auto_resolved_24h: number;
+  human_resolved_24h: number;
+}
+
+const AutonomyIndexPanel = ({ data }: { data: AutonomyData }) => (
+  <div className="bg-white/70 backdrop-blur-[20px] border border-black/[0.06] rounded-xl p-6">
+    {/* 大數字：自治率 */}
+    <div className="flex items-end gap-3">
+      <span className="font-mono text-6xl font-bold text-nothing-ink">
+        {data.rate.toFixed(0)}
+        <span className="text-2xl text-nothing-gray">%</span>
+      </span>
+      <div className="mb-2">
+        <span className="text-sm text-status-success">
+          ↗ +{data.daily_trend.toFixed(0)}% {t('nexus.vs_yesterday')}
+        </span>
+      </div>
+    </div>
+    
+    {/* AI 自治率說明 */}
+    <p className="font-mono text-xs tracking-widest text-nothing-gray-600 mt-2">
+      [AI_AUTONOMY_INDEX] {t('nexus.autonomy_description')}
+    </p>
+    
+    {/* 細分：今日自動 vs 需要人工 */}
+    <div className="flex gap-6 mt-4 border-t border-black/[0.04] pt-4">
+      <div>
+        <p className="text-2xl font-bold text-status-success">{data.auto_resolved_24h}</p>
+        <p className="text-xs text-nothing-gray">{t('nexus.ai_auto_resolved')}</p>
+      </div>
+      <div>
+        <p className="text-2xl font-bold text-status-warning">{data.human_resolved_24h}</p>
+        <p className="text-xs text-nothing-gray">{t('nexus.required_approval')}</p>
+      </div>
+    </div>
+  </div>
+);
+```
+
+#### M-3.2 Omni-Terminal 監控指令整合（8h）
+
+後端需要新增以下 Terminal 指令的處理器：
+
+```python
+# apps/api/src/services/terminal_service.py 擴充
+
+class TerminalCommandRouter:
+    
+    async def route(self, intent: str, context: dict) -> TerminalResponse:
+        """
+        監控相關指令路由
+        """
+        if intent == "/status":
+            return await self._handle_status(context)      # 服務健康狀態
+        elif intent == "/cost":
+            return await self._handle_cost(context)         # FinOps 成本
+        elif intent == "/slo":
+            return await self._handle_slo(context)          # SLO 達成率
+        elif intent == "/trace":
+            return await self._handle_trace(context)        # SigNoz Traces
+        elif intent == "/alert":
+            return await self._handle_alert_chain(context)  # 告警鏈路狀態
+        elif intent == "/incident":
+            return await self._handle_incident(context)     # 事件查詢
+        elif intent == "/mttr":
+            return await self._handle_mttr(context)         # 平均修復時間
+```
+
+#### M-3.3 監控相關 GenUI 卡片擴充（8h）
+
+```typescript
+// apps/web/src/components/genui/registry.ts 新增：
+
+export const GENUI_COMPONENTS = {
+  // 現有組件...
+  
+  // 新增監控類組件：
+  'SystemHealthCard': () => import('./monitoring/SystemHealthCard'),
+  'ServiceDetailCard': () => import('./monitoring/ServiceDetailCard'),
+  'FinOpsCard': () => import('./monitoring/FinOpsCard'),
+  'SLODashboardCard': () => import('./monitoring/SLODashboardCard'),
+  'AlertChainStatusCard': () => import('./monitoring/AlertChainStatusCard'),
+  'AnomalyFrequencyCard': () => import('./monitoring/AnomalyFrequencyCard'),
+  'MTTRCard': () => import('./monitoring/MTTRCard'),
+  'KaliScanResultCard': () => import('./monitoring/KaliScanResultCard'),
+}
+```
+
+**`SystemHealthCard` 規格**（最核心的監控 GenUI 卡片）：
+
+```tsx
+// SystemHealthCard 呈現邏輯：
+// - 25 個服務用「燈號矩陣」呈現，非圖表
+// - 每個燈號 hover 顯示服務名稱
+// - 有異常的燈號閃爍（animate-ping）
+// - 右下角「深度調查」按鈕連至 Grafana/SigNoz 新分頁
+
+const SystemHealthCard = () => (
+  <GenUICard title="系統健康矩陣">
+    <div className="grid grid-cols-5 gap-2">
+      {services.map(svc => (
+        <ServiceOrb 
+          key={svc.name}
+          name={svc.name}
+          status={svc.status}
+          // healthy: 靜態綠燈
+          // warning: 黃燈慢速閃爍
+          // critical: 紅燈 animate-ping
+          externalLink={svc.grafana_url}
+        />
+      ))}
+    </div>
+    
+    {/* 摘要行 */}
+    <p className="font-mono text-xs mt-3">
+      25 SERVICES | {healthy_count} HEALTHY | {warning_count} WARNING | {critical_count} CRITICAL
+    </p>
+    
+    {/* 智能跳脫 */}
+    <ExternalLinks grafana sentry signoz />
+  </GenUICard>
+);
+```
+
+---
+
+## 六、監控整合路線圖與優先級
+
+```
+📅 Week 1 (立即，~7h):
+  ├── M-1.1 MinIO Exporter 部署 (1h)
+  ├── M-1.2 Prometheus Federation (2h)
+  └── M-1.3 AI 自治率指數 Metrics 建立 (2h + 2h Config)
+
+📅 Week 2-3 (短期，~10h):
+  ├── M-2.1 FinOps API 端點 (4h)
+  ├── M-2.2 SLO API 端點 (2h)
+  ├── M-2.3 MTTR API 端點 (2h)
+  └── M-2.4 Kali Scanner Webhook 整合 (2h)
+
+📅 Month 2 (中期，~24h):
+  ├── M-3.1 Nexus 頁面 AI 自治率 UI (8h)
+  ├── M-3.2 Omni-Terminal 監控指令 (8h)
+  └── M-3.3 監控 GenUI 卡片擴充 (8h)
+```
+
+**監控整合完成後的最终效果**：
+
+```
+統帥打開 AWOOOI，看到：
+  ✦ AI 自治率：今日 72%（↗ 比昨日高 8%）
+  ✦ 系統健康：25/25 服務正常
+  ✦ 待決策：0（系統無需要人工干預的事件）
+  ✦ AI 思考流在背後靜默巡邏...
+
+這才是 AI 原生平台，不是監控工具。
+SRE 只在「AI 搞不定的時候」被喚醒，其餘時間人類可以去做更有價值的事。
+```
+
+---
+
+## 七、ADR 建議
+
+本規劃建議新增以下 ADR：
+
+| ADR | 主題 | 核心決策 |
+|-----|------|---------|
+| ADR-038 | 監控 UI 三義分離原則 | 靜默消化 vs 主動浮現 vs 外部跳脫 |
+| ADR-039 | AI 自治率指數 (Autonomy Index) | 指標定義與計算公式 |
+| ADR-040 | Kali 安全掃描整合架構 | .112 → Webhook → AI 分析 |
+| ADR-041 | SLO 與 MTTR 業務指標架構 | 計算方法與展示標準 |
+
+---
+
+*「監控是神經末梢，AI 是大腦。神經不思考，大腦不直接感知。這就是 AWOOOI 的監控哲學。」* 🦞
--- a/docs/proposals/MONITORING_MASTER_PLAN.md
+++ b/docs/proposals/MONITORING_MASTER_PLAN.md
@@ -0,0 +1,219 @@
+# AWOOOI 監控整合主計畫
+
+> **版本**: v1.0
+> **建立日期**: 2026-03-29
+> **狀態**: ✅ 統帥批准
+> **總工時**: 10.75h
+> **整合自**: `MONITORING_INTEGRATION_ARCHITECTURE.md` + `IMPLEMENTATION_STEPS_REMAINING_PHASES.md`
+
+---
+
+## 一、執行摘要
+
+本計畫整合以下兩份文件：
+1. **MONITORING_INTEGRATION_ARCHITECTURE.md** - 監控即代碼架構
+2. **IMPLEMENTATION_STEPS_REMAINING_PHASES.md** - Phase D-G 實施步驟
+
+### 核心發現
+
+| 類別 | 已完成 | 待完成 |
+|------|--------|--------|
+| **Service Registry** | ✅ 含 NVIDIA | - |
+| **覆蓋率驗證** | ✅ validate_coverage.py | generate_monitoring.py |
+| **NVIDIA 告警** | ✅ 5 條規則 | Grafana Dashboard |
+| **Sentry 整合** | ✅ Webhook Handler | Comment 回寫 (TODO) |
+| **SignOz 整合** | ❌ 無 Webhook | Handler + Rules |
+| **告警鏈路驗證** | ❌ 無 | Smoke Test + CD 整合 |
+
+---
+
+## 二、工作依賴關係
+
+```
+Layer 0: 基礎設施 (無依賴)
+├── L0.1 Sentry API Token ─────────┐
+└── L0.2 SignOz 告警規則 ──────────┼──▶ Layer 1
+                                   │
+Layer 1: Webhook 鏈路              │
+├── L1.1 SignOz Webhook (←L0.2) ───┤
+├── L1.2 Sentry Comment (←L0.1) ───┼──▶ Layer 2
+└── L1.3 Alert Chain Metrics ──────┘
+                                   │
+Layer 2: 告警鏈路驗證              │
+├── L2.1 Smoke Test (←L1.1,L1.2) ──┤
+├── L2.2 Alert Chain Rules (←L1.3) ┼──▶ Layer 3
+└── L2.3 CD Pipeline (←L2.1) ──────┘
+                                   │
+Layer 3: 監控自動化 (獨立)         │
+├── L3.1 generate_monitoring.py ───┤
+├── L3.2 CI 覆蓋率檢查 (←L3.1) ────┼──▶ Layer 4
+└── L3.3 Docker 自動發現 ──────────┘
+                                   │
+Layer 4: 可視化                    │
+├── L4.1 NVIDIA Grafana Dashboard ─┤
+└── L4.2 監控覆蓋率報告 (←L3.1) ───┘
+```
+
+---
+
+## 三、Wave 執行計畫
+
+### Wave A: 告警鏈路完善 (P0 - 3.5h)
+
+| # | 任務 | 工時 | 依賴 | 可並行 | 檔案 |
+|---|------|------|------|--------|------|
+| A.1 | Sentry API Token 設定 | 15min | - | ✅ | GitHub Secret + K8s |
+| A.2 | SignOz 告警規則部署 | 30min | - | ✅ | `signoz/alerting/rules.yaml` |
+| A.3 | SignOz Webhook Handler | 45min | A.2 | - | `signoz_webhook.py` |
+| A.4 | Sentry Comment 回寫 | 30min | A.1 | - | `sentry_webhook.py` |
+| A.5 | Alert Chain Metrics | 30min | - | ✅ | `core/metrics.py` |
+| A.6 | Smoke Test 腳本 | 45min | A.3,A.4 | - | `alert_chain_smoke_test.py` |
+
+### Wave B: 鏈路防護 (P1 - 1.5h)
+
+| # | 任務 | 工時 | 依賴 | 檔案 |
+|---|------|------|------|------|
+| B.1 | Alert Chain PrometheusRule | 30min | A.5 | `k8s/monitoring/alert-chain-monitor.yaml` |
+| B.2 | CD Pipeline 整合 | 30min | A.6 | `.github/workflows/cd.yaml` |
+| B.3 | 部署驗證 + 文檔更新 | 30min | B.1,B.2 | ADR + Memory |
+
+### Wave C: 監控自動化 (P2 - 2.75h)
+
+| # | 任務 | 工時 | 依賴 | 檔案 |
+|---|------|------|------|------|
+| C.1 | generate_monitoring.py | 1.5h | - | `ops/monitoring/generate_monitoring.py` |
+| C.2 | CI 監控覆蓋率檢查 | 30min | C.1 | `.github/workflows/cd.yaml` |
+| C.3 | Docker 容器自動發現 | 45min | - | `ops/monitoring/discover_docker.py` |
+
+### Wave D: 可視化 (P3 - 3h)
+
+| # | 任務 | 工時 | 依賴 | 檔案 |
+|---|------|------|------|------|
+| D.1 | NVIDIA Grafana Dashboard | 2h | - | `ops/grafana/nvidia-dashboard.json` |
+| D.2 | 監控覆蓋率報告 | 1h | C.1 | `ops/monitoring/coverage_report.py` |
+
+---
+
+## 四、現有整合點
+
+### Phase 20 (NVIDIA Nemotron) ✅
+
+```yaml
+已完成:
+  - nvidia_provider.py P3 Prometheus Metrics
+  - k8s/monitoring/nvidia-alerts.yaml (5 規則)
+  - ops/monitoring/service-registry.yaml (NVIDIA 條目)
+
+整合:
+  - Wave A.5 Alert Chain Metrics 納入 NVIDIA 告警監控
+```
+
+### K-MON (K3s 監控) ✅
+
+```yaml
+已完成:
+  - k8s/monitoring/k3s-alerts.yaml (20+ 規則)
+  - Blackbox Exporter
+  - kube-state-metrics
+
+整合:
+  - Wave B.1 擴充現有 PrometheusRule
+```
+
+### ADR-034 (Telegram Secrets 注入) ✅
+
+```yaml
+已完成:
+  - Pre-flight 檢查
+  - CD 自動注入
+
+整合:
+  - Wave A.1 使用同樣模式注入 SENTRY_API_TOKEN
+```
+
+---
+
+## 五、Phase D-G 任務對應
+
+| 原 Phase | 任務 | 對應 Wave | 狀態 |
+|----------|------|-----------|------|
+| **Phase D** | Sentry Comment 回寫 | Wave A.4 | 待執行 |
+| **Phase E** | SignOz 告警規則 | Wave A.2 + A.3 | 待執行 |
+| **Phase F** | 告警鏈路 E2E 驗證 | Wave A.6 + B.1 + B.2 | 待執行 |
+| **Phase G** | Learning Service | ✅ 已存在 | 僅需整合 |
+
+---
+
+## 六、執行時程
+
+```
+Day 1 (4h):
+├── [並行] A.1 Sentry Token (15min)
+├── [並行] A.2 SignOz Rules (30min)
+├── [並行] A.5 Alert Chain Metrics (30min)
+├── A.3 SignOz Webhook (45min)
+├── A.4 Sentry Comment (30min)
+└── A.6 Smoke Test (45min)
+
+Day 2 (2h):
+├── B.1 Alert Chain Rules (30min)
+├── B.2 CD Pipeline (30min)
+└── B.3 驗證 + 文檔 (30min)
+
+Day 3+ (可延後):
+├── Wave C: 監控自動化 (2.75h)
+└── Wave D: 可視化 (3h)
+```
+
+---
+
+## 七、驗收標準
+
+### Wave A 完成條件
+
+- [ ] Sentry Issue 自動收到 AI 分析 Comment
+- [ ] SignOz 告警可觸發 Telegram 通知
+- [ ] `alert_chain_smoke_test.py` 全部通過
+
+### Wave B 完成條件
+
+- [ ] CD 部署後自動執行 Smoke Test
+- [ ] 告警鏈路斷裂 2 小時內觸發告警
+- [ ] ADR-037 建立並通過審查
+
+### Wave C 完成條件
+
+- [ ] 新服務未註冊時 CI 失敗
+- [ ] 每小時自動掃描新 Docker 容器
+- [ ] 生成的配置與現有配置一致
+
+### Wave D 完成條件
+
+- [ ] NVIDIA Grafana Dashboard 可訪問
+- [ ] 每日覆蓋率報告自動發送
+
+---
+
+## 八、風險評估
+
+| 風險 | 機率 | 影響 | 緩解措施 |
+|------|------|------|----------|
+| Sentry API Token 權限不足 | 低 | 中 | 測試 API 呼叫後再部署 |
+| SignOz 版本不支援告警 | 低 | 高 | 確認 SignOz 版本 |
+| Smoke Test 誤報 | 中 | 低 | 設定合理超時 + 重試 |
+| CD Pipeline 變慢 | 中 | 低 | Smoke Test 並行執行 |
+
+---
+
+## 九、相關文件
+
+- [MONITORING_INTEGRATION_ARCHITECTURE.md](MONITORING_INTEGRATION_ARCHITECTURE.md)
+- [IMPLEMENTATION_STEPS_REMAINING_PHASES.md](IMPLEMENTATION_STEPS_REMAINING_PHASES.md)
+- [ADR-034 Telegram Secrets 注入](../adr/ADR-034-telegram-secrets-injection.md)
+- [ADR-036 Nemotron Tool Calling](../adr/ADR-036-nemotron-tool-calling-integration.md)
+
+---
+
+**文件結束**
+
+*2026-03-29 統帥批准*
--- a/docs/proposals/NEMOTRON-INTEGRATION-PROPOSAL.md
+++ b/docs/proposals/NEMOTRON-INTEGRATION-PROPOSAL.md
@@ -0,0 +1,873 @@
+# Nemotron 整合提案
+
+> **版本**: 1.1
+> **建立日期**: 2026-03-28 (台北時間)
+> **建立者**: Claude Code
+> **狀態**: ✅ **實測完成，待統帥批准**
+
+---
+
+## 🔥 實測結果摘要 (2026-03-28)
+
+| 指標 | Nemotron (NIM) | Ollama (CPU) | 結論 |
+|------|----------------|--------------|------|
+| **Tool Calling 精準度** | 83.3% (5/6) | ~50% | **Nemotron 勝** |
+| **平均延遲** | 11-23 秒 | 100+ 秒 | **Nemotron 快 5-10x** |
+| **繁中支援** | ✅ 良好 | ✅ 良好 | 平手 |
+| **成本** | 免費 tier | 免費 | 平手 |
+
+**建議**: 將 Nemotron 加入 Tool Calling 任務的首選路由
+
+---
+
+## 目錄
+
+1. [NIM API 整合規格](#1-nim-api-整合規格)
+2. [架構設計](#2-架構設計)
+3. [測試腳本](#3-測試腳本)
+4. [實作計畫](#4-實作計畫)
+
+---
+
+## 1. NIM API 整合規格
+
+### 1.1 Endpoint 資訊
+
+| 項目 | 值 |
+|------|-----|
+| **Base URL** | `https://integrate.api.nvidia.com/v1` |
+| **Chat Completions** | `/chat/completions` |
+| **相容性** | ✅ OpenAI API 格式完全相容 |
+
+### 1.2 認證方式
+
+```bash
+# 環境變數
+export NVIDIA_API_KEY="nvapi-xxxx"
+
+# HTTP Header
+Authorization: Bearer $NVIDIA_API_KEY
+```
+
+### 1.3 可用模型
+
+| 模型 ID | 大小 | 特色 | 建議用途 |
+|---------|------|------|----------|
+| `nvidia/nemotron-mini-4b-instruct` | 4B | 輕量、Tool Calling | 快速分類、簡單決策 |
+| `nvidia/llama-3.1-nemotron-70b-instruct` | 70B | 強推理 | 複雜 Incident 分析 |
+| `nvidia/nemotron-3-super` | 120B (MoE) | 最強、100萬 Token | 多代理協作 |
+
+### 1.4 請求格式 (OpenAI 相容)
+
+```python
+import httpx
+
+response = httpx.post(
+    "https://integrate.api.nvidia.com/v1/chat/completions",
+    headers={
+        "Content-Type": "application/json",
+        "Authorization": f"Bearer {NVIDIA_API_KEY}"
+    },
+    json={
+        "model": "nvidia/nemotron-mini-4b-instruct",
+        "messages": [
+            {"role": "system", "content": "You are an SRE assistant."},
+            {"role": "user", "content": "Analyze this K8s error..."}
+        ],
+        "temperature": 0.2,
+        "max_tokens": 1024,
+        "tools": [...]  # Tool Calling 定義
+    }
+)
+```
+
+### 1.5 Tool Calling 格式
+
+```python
+tools = [
+    {
+        "type": "function",
+        "function": {
+            "name": "kubectl_execute",
+            "description": "Execute kubectl command on K8s cluster",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "command": {
+                        "type": "string",
+                        "description": "kubectl command (e.g., 'get pods -n awoooi-prod')"
+                    },
+                    "namespace": {
+                        "type": "string",
+                        "description": "Target namespace"
+                    }
+                },
+                "required": ["command"]
+            }
+        }
+    },
+    {
+        "type": "function",
+        "function": {
+            "name": "restart_deployment",
+            "description": "Restart a Kubernetes deployment",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "deployment": {"type": "string"},
+                    "namespace": {"type": "string"}
+                },
+                "required": ["deployment", "namespace"]
+            }
+        }
+    }
+]
+```
+
+### 1.6 回應格式 (Tool Call)
+
+```json
+{
+  "choices": [{
+    "message": {
+      "role": "assistant",
+      "content": null,
+      "tool_calls": [{
+        "id": "call_abc123",
+        "type": "function",
+        "function": {
+          "name": "restart_deployment",
+          "arguments": "{\"deployment\": \"awoooi-api\", \"namespace\": \"awoooi-prod\"}"
+        }
+      }]
+    },
+    "finish_reason": "tool_calls"
+  }]
+}
+```
+
+---
+
+## 2. 架構設計
+
+### 2.1 Fallback 層級調整
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│  現有架構                                                        │
+├─────────────────────────────────────────────────────────────────┤
+│                                                                  │
+│   Tier 1          Tier 2          Tier 3                        │
+│   ┌─────────┐     ┌─────────┐     ┌─────────┐                   │
+│   │ Ollama  │ ──▶ │ Gemini  │ ──▶ │ Claude  │                   │
+│   │ (188)   │     │ (API)   │     │ (API)   │                   │
+│   │ 本地    │     │ 免費額度 │     │ 付費    │                   │
+│   └─────────┘     └─────────┘     └─────────┘                   │
+│                                                                  │
+└─────────────────────────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────────────────────────┐
+│  新架構 (加入 Nemotron)                                          │
+├─────────────────────────────────────────────────────────────────┤
+│                                                                  │
+│                    ┌──────────────────────────────────┐         │
+│                    │      Smart Model Router          │         │
+│                    │      (任務類型路由)               │         │
+│                    └──────────────────────────────────┘         │
+│                              │                                   │
+│            ┌─────────────────┼─────────────────┐                │
+│            │                 │                 │                │
+│            ▼                 ▼                 ▼                │
+│   ┌─────────────────┐ ┌───────────┐ ┌─────────────────┐        │
+│   │ Tool Calling    │ │ 一般對話  │ │ 複雜推理        │        │
+│   │ 路徑            │ │ 路徑      │ │ 路徑            │        │
+│   └────────┬────────┘ └─────┬─────┘ └────────┬────────┘        │
+│            │                │                │                  │
+│            ▼                ▼                ▼                  │
+│   ┌─────────────────┐ ┌───────────┐ ┌─────────────────┐        │
+│   │ Nemotron (NIM)  │ │ Ollama    │ │ Nemotron-70B    │        │
+│   │ nemotron-mini   │ │ qwen2.5   │ │ 或 Claude       │        │
+│   │ 4B, Tool專用    │ │ 本地      │ │ 高品質          │        │
+│   └────────┬────────┘ └─────┬─────┘ └────────┬────────┘        │
+│            │                │                │                  │
+│            └────────────────┼────────────────┘                  │
+│                             │                                   │
+│                             ▼                                   │
+│                    ┌─────────────────┐                          │
+│                    │ Fallback Chain  │                          │
+│                    │ Gemini → Claude │                          │
+│                    └─────────────────┘                          │
+│                                                                  │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+### 2.2 任務路由規則
+
+```python
+# apps/api/src/services/ai/model_router.py
+
+ROUTING_RULES = {
+    # Tool Calling 任務 → Nemotron 優先
+    "tool_calling": {
+        "primary": "nvidia/nemotron-mini-4b-instruct",
+        "fallback": ["gemini-1.5-flash", "claude-3-haiku"]
+    },
+
+    # K8s 操作決策 → Nemotron 優先
+    "k8s_operation": {
+        "primary": "nvidia/nemotron-mini-4b-instruct",
+        "fallback": ["ollama/qwen2.5:7b", "gemini-1.5-flash"]
+    },
+
+    # Incident 分析 (複雜推理) → Nemotron-70B 或 Claude
+    "incident_analysis": {
+        "primary": "nvidia/llama-3.1-nemotron-70b-instruct",
+        "fallback": ["claude-3-sonnet", "gemini-1.5-pro"]
+    },
+
+    # 一般對話 → 本地 Ollama 優先
+    "general_chat": {
+        "primary": "ollama/qwen2.5:7b",
+        "fallback": ["gemini-1.5-flash", "claude-3-haiku"]
+    },
+
+    # Playbook 生成 → Nemotron (程式碼能力強)
+    "code_generation": {
+        "primary": "nvidia/nemotron-mini-4b-instruct",
+        "fallback": ["ollama/qwen2.5-coder:7b", "claude-3-sonnet"]
+    }
+}
+```
+
+### 2.3 OpenClaw 整合位置
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│  OpenClaw Decision Flow                                         │
+├─────────────────────────────────────────────────────────────────┤
+│                                                                  │
+│  1. Incident 進入                                                │
+│     │                                                            │
+│     ▼                                                            │
+│  2. Intent Classifier (意圖分類)                                 │
+│     │  └── Ollama qwen2.5 (本地、快速)                           │
+│     │                                                            │
+│     ▼                                                            │
+│  3. Complexity Analyzer (複雜度評估)                             │
+│     │  └── Ollama qwen2.5 (本地、快速)                           │
+│     │                                                            │
+│     ▼                                                            │
+│  4. Decision Manager (決策生成) ← 🔴 Nemotron 在這裡！           │
+│     │  ├── Tool Calling 決策 → Nemotron-mini (NIM)               │
+│     │  ├── 複雜推理 → Nemotron-70B (NIM)                         │
+│     │  └── 一般回覆 → Ollama/Gemini                              │
+│     │                                                            │
+│     ▼                                                            │
+│  5. Trust Engine (信任驗證)                                      │
+│     │                                                            │
+│     ▼                                                            │
+│  6. Multi-Sig (需要時)                                           │
+│     │                                                            │
+│     ▼                                                            │
+│  7. K8s Executor (執行)                                          │
+│                                                                  │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+### 2.4 環境變數配置
+
+```bash
+# .env.production 新增
+
+# NVIDIA NIM API
+NVIDIA_API_KEY=nvapi-xxxx
+NVIDIA_API_BASE_URL=https://integrate.api.nvidia.com/v1
+
+# Model 選擇
+NEMOTRON_TOOL_MODEL=nvidia/nemotron-mini-4b-instruct
+NEMOTRON_REASONING_MODEL=nvidia/llama-3.1-nemotron-70b-instruct
+
+# Rate Limiting (免費額度保護)
+NEMOTRON_RPM_LIMIT=60
+NEMOTRON_TPM_LIMIT=100000
+```
+
+---
+
+## 3. 測試腳本
+
+### 3.1 Tool Calling 精準度測試
+
+```python
+#!/usr/bin/env python3
+"""
+Nemotron Tool Calling 精準度測試
+比較 Nemotron vs Gemini vs Qwen 的 Tool Calling 能力
+
+使用方式:
+    export NVIDIA_API_KEY=nvapi-xxxx
+    export GEMINI_API_KEY=xxxx
+    python test_nemotron_tool_calling.py
+"""
+
+import os
+import json
+import httpx
+import asyncio
+from dataclasses import dataclass
+from typing import Optional
+import time
+
+# ============================================================================
+# 配置
+# ============================================================================
+
+NVIDIA_API_KEY = os.getenv("NVIDIA_API_KEY")
+GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
+OLLAMA_BASE_URL = "http://192.168.0.188:11434"
+
+# ============================================================================
+# Tool 定義 (K8s SRE 場景)
+# ============================================================================
+
+TOOLS = [
+    {
+        "type": "function",
+        "function": {
+            "name": "kubectl_get",
+            "description": "Get Kubernetes resources (pods, deployments, services, etc.)",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "resource": {
+                        "type": "string",
+                        "enum": ["pods", "deployments", "services", "nodes", "events"],
+                        "description": "Resource type to query"
+                    },
+                    "namespace": {
+                        "type": "string",
+                        "description": "Kubernetes namespace (default: awoooi-prod)"
+                    },
+                    "name": {
+                        "type": "string",
+                        "description": "Specific resource name (optional)"
+                    }
+                },
+                "required": ["resource"]
+            }
+        }
+    },
+    {
+        "type": "function",
+        "function": {
+            "name": "restart_deployment",
+            "description": "Restart a Kubernetes deployment by rolling restart",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "deployment": {
+                        "type": "string",
+                        "description": "Deployment name"
+                    },
+                    "namespace": {
+                        "type": "string",
+                        "description": "Kubernetes namespace"
+                    }
+                },
+                "required": ["deployment", "namespace"]
+            }
+        }
+    },
+    {
+        "type": "function",
+        "function": {
+            "name": "scale_deployment",
+            "description": "Scale a Kubernetes deployment to specified replicas",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "deployment": {"type": "string"},
+                    "namespace": {"type": "string"},
+                    "replicas": {"type": "integer", "minimum": 0, "maximum": 10}
+                },
+                "required": ["deployment", "namespace", "replicas"]
+            }
+        }
+    },
+    {
+        "type": "function",
+        "function": {
+            "name": "get_logs",
+            "description": "Get logs from a Kubernetes pod",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "pod": {"type": "string"},
+                    "namespace": {"type": "string"},
+                    "tail": {"type": "integer", "description": "Number of lines (default: 100)"},
+                    "container": {"type": "string", "description": "Container name (optional)"}
+                },
+                "required": ["pod", "namespace"]
+            }
+        }
+    },
+    {
+        "type": "function",
+        "function": {
+            "name": "send_alert",
+            "description": "Send alert notification via Telegram",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "severity": {"type": "string", "enum": ["info", "warning", "critical"]},
+                    "message": {"type": "string"},
+                    "incident_id": {"type": "string"}
+                },
+                "required": ["severity", "message"]
+            }
+        }
+    }
+]
+
+# ============================================================================
+# 測試案例
+# ============================================================================
+
+TEST_CASES = [
+    {
+        "id": "TC001",
+        "description": "簡單查詢 - 列出所有 pods",
+        "prompt": "Show me all pods in awoooi-prod namespace",
+        "expected_tool": "kubectl_get",
+        "expected_params": {"resource": "pods", "namespace": "awoooi-prod"}
+    },
+    {
+        "id": "TC002",
+        "description": "重啟服務",
+        "prompt": "The API is not responding, please restart the awoooi-api deployment",
+        "expected_tool": "restart_deployment",
+        "expected_params": {"deployment": "awoooi-api", "namespace": "awoooi-prod"}
+    },
+    {
+        "id": "TC003",
+        "description": "擴展副本",
+        "prompt": "We're getting high traffic, scale awoooi-web to 3 replicas",
+        "expected_tool": "scale_deployment",
+        "expected_params": {"deployment": "awoooi-web", "replicas": 3}
+    },
+    {
+        "id": "TC004",
+        "description": "查看日誌",
+        "prompt": "Get the last 50 lines of logs from awoooi-api-xxx pod",
+        "expected_tool": "get_logs",
+        "expected_params": {"tail": 50}
+    },
+    {
+        "id": "TC005",
+        "description": "發送告警",
+        "prompt": "Send a critical alert: Database connection failed for incident INC-2026-001",
+        "expected_tool": "send_alert",
+        "expected_params": {"severity": "critical"}
+    },
+    {
+        "id": "TC006",
+        "description": "複合理解 - 需要推理",
+        "prompt": "The web frontend is showing 502 errors. Check if the API pods are running.",
+        "expected_tool": "kubectl_get",
+        "expected_params": {"resource": "pods"}
+    },
+    {
+        "id": "TC007",
+        "description": "繁體中文指令",
+        "prompt": "請重啟 awoooi-worker 這個 deployment",
+        "expected_tool": "restart_deployment",
+        "expected_params": {"deployment": "awoooi-worker"}
+    },
+    {
+        "id": "TC008",
+        "description": "模糊指令 - 需要推理",
+        "prompt": "Something is wrong with the worker, it keeps crashing. Fix it.",
+        "expected_tool": "restart_deployment",  # 或 get_logs
+        "expected_params": {}  # 接受多種合理回應
+    }
+]
+
+# ============================================================================
+# API 客戶端
+# ============================================================================
+
+@dataclass
+class ToolCallResult:
+    model: str
+    test_id: str
+    success: bool
+    tool_called: Optional[str]
+    params: Optional[dict]
+    latency_ms: float
+    error: Optional[str] = None
+
+async def call_nemotron(prompt: str, model: str = "nvidia/nemotron-mini-4b-instruct") -> dict:
+    """呼叫 NVIDIA NIM API"""
+    async with httpx.AsyncClient(timeout=30) as client:
+        start = time.time()
+        response = await client.post(
+            "https://integrate.api.nvidia.com/v1/chat/completions",
+            headers={
+                "Content-Type": "application/json",
+                "Authorization": f"Bearer {NVIDIA_API_KEY}"
+            },
+            json={
+                "model": model,
+                "messages": [
+                    {"role": "system", "content": "You are an SRE assistant for AWOOOI AIOps platform. Use the provided tools to help with Kubernetes operations."},
+                    {"role": "user", "content": prompt}
+                ],
+                "tools": TOOLS,
+                "tool_choice": "auto",
+                "temperature": 0.1,
+                "max_tokens": 512
+            }
+        )
+        latency = (time.time() - start) * 1000
+        return {"data": response.json(), "latency_ms": latency}
+
+async def call_ollama(prompt: str, model: str = "qwen2.5:7b") -> dict:
+    """呼叫本地 Ollama"""
+    async with httpx.AsyncClient(timeout=60) as client:
+        start = time.time()
+        response = await client.post(
+            f"{OLLAMA_BASE_URL}/api/chat",
+            json={
+                "model": model,
+                "messages": [
+                    {"role": "system", "content": "You are an SRE assistant. Respond with JSON indicating which tool to call and parameters."},
+                    {"role": "user", "content": f"Based on this request, which tool should be called and with what parameters? Request: {prompt}\n\nAvailable tools: kubectl_get, restart_deployment, scale_deployment, get_logs, send_alert\n\nRespond in JSON format: {{\"tool\": \"tool_name\", \"params\": {{...}}}}"}
+                ],
+                "stream": False,
+                "format": "json"
+            }
+        )
+        latency = (time.time() - start) * 1000
+        return {"data": response.json(), "latency_ms": latency}
+
+# ============================================================================
+# 測試執行
+# ============================================================================
+
+def parse_tool_call(response: dict, model_type: str) -> tuple:
+    """解析不同模型的 Tool Call 回應"""
+    try:
+        if model_type == "nemotron":
+            choices = response.get("choices", [])
+            if choices and choices[0].get("message", {}).get("tool_calls"):
+                tool_call = choices[0]["message"]["tool_calls"][0]
+                return (
+                    tool_call["function"]["name"],
+                    json.loads(tool_call["function"]["arguments"])
+                )
+            # 如果沒有 tool_calls，檢查 content
+            content = choices[0].get("message", {}).get("content", "")
+            return (None, {"content": content})
+
+        elif model_type == "ollama":
+            content = response.get("message", {}).get("content", "{}")
+            parsed = json.loads(content)
+            return (parsed.get("tool"), parsed.get("params", {}))
+
+    except Exception as e:
+        return (None, {"error": str(e)})
+
+    return (None, {})
+
+async def run_test(test_case: dict) -> list:
+    """執行單一測試案例"""
+    results = []
+    prompt = test_case["prompt"]
+
+    # 測試 Nemotron
+    if NVIDIA_API_KEY:
+        try:
+            resp = await call_nemotron(prompt)
+            tool, params = parse_tool_call(resp["data"], "nemotron")
+            success = tool == test_case["expected_tool"]
+            results.append(ToolCallResult(
+                model="Nemotron-mini-4B",
+                test_id=test_case["id"],
+                success=success,
+                tool_called=tool,
+                params=params,
+                latency_ms=resp["latency_ms"]
+            ))
+        except Exception as e:
+            results.append(ToolCallResult(
+                model="Nemotron-mini-4B",
+                test_id=test_case["id"],
+                success=False,
+                tool_called=None,
+                params=None,
+                latency_ms=0,
+                error=str(e)
+            ))
+
+    # 測試 Ollama
+    try:
+        resp = await call_ollama(prompt)
+        tool, params = parse_tool_call(resp["data"], "ollama")
+        success = tool == test_case["expected_tool"]
+        results.append(ToolCallResult(
+            model="Ollama-Qwen2.5-7B",
+            test_id=test_case["id"],
+            success=success,
+            tool_called=tool,
+            params=params,
+            latency_ms=resp["latency_ms"]
+        ))
+    except Exception as e:
+        results.append(ToolCallResult(
+            model="Ollama-Qwen2.5-7B",
+            test_id=test_case["id"],
+            success=False,
+            tool_called=None,
+            params=None,
+            latency_ms=0,
+            error=str(e)
+        ))
+
+    return results
+
+async def main():
+    """主測試流程"""
+    print("=" * 70)
+    print("Nemotron vs Ollama Tool Calling 精準度測試")
+    print("=" * 70)
+    print()
+
+    all_results = []
+
+    for tc in TEST_CASES:
+        print(f"[{tc['id']}] {tc['description']}")
+        print(f"    Prompt: {tc['prompt'][:50]}...")
+        print(f"    Expected: {tc['expected_tool']}")
+
+        results = await run_test(tc)
+        all_results.extend(results)
+
+        for r in results:
+            status = "✅" if r.success else "❌"
+            print(f"    {r.model}: {status} → {r.tool_called} ({r.latency_ms:.0f}ms)")
+            if r.error:
+                print(f"        Error: {r.error}")
+        print()
+
+    # 統計結果
+    print("=" * 70)
+    print("統計結果")
+    print("=" * 70)
+
+    models = {}
+    for r in all_results:
+        if r.model not in models:
+            models[r.model] = {"success": 0, "total": 0, "latency": []}
+        models[r.model]["total"] += 1
+        if r.success:
+            models[r.model]["success"] += 1
+        if r.latency_ms > 0:
+            models[r.model]["latency"].append(r.latency_ms)
+
+    print(f"{'Model':<25} {'Accuracy':<15} {'Avg Latency':<15}")
+    print("-" * 55)
+    for model, stats in models.items():
+        acc = stats["success"] / stats["total"] * 100 if stats["total"] > 0 else 0
+        avg_lat = sum(stats["latency"]) / len(stats["latency"]) if stats["latency"] else 0
+        print(f"{model:<25} {acc:>6.1f}%        {avg_lat:>8.0f}ms")
+
+    print()
+    print("測試完成！")
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+### 3.2 快速驗證腳本 (curl)
+
+```bash
+#!/bin/bash
+# quick_test_nemotron.sh
+# 快速驗證 Nemotron API 連線
+
+set -e
+
+echo "=== Nemotron API 快速測試 ==="
+echo ""
+
+# 檢查 API Key
+if [ -z "$NVIDIA_API_KEY" ]; then
+    echo "❌ 請設定 NVIDIA_API_KEY"
+    echo "   export NVIDIA_API_KEY=nvapi-xxxx"
+    exit 1
+fi
+
+echo "✅ API Key 已設定"
+echo ""
+
+# 測試簡單請求
+echo "測試 1: 簡單對話..."
+curl -s -X POST "https://integrate.api.nvidia.com/v1/chat/completions" \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer $NVIDIA_API_KEY" \
+  -d '{
+    "model": "nvidia/nemotron-mini-4b-instruct",
+    "messages": [{"role": "user", "content": "Say hello in JSON format"}],
+    "max_tokens": 50
+  }' | jq '.choices[0].message.content'
+
+echo ""
+echo "測試 2: Tool Calling..."
+curl -s -X POST "https://integrate.api.nvidia.com/v1/chat/completions" \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer $NVIDIA_API_KEY" \
+  -d '{
+    "model": "nvidia/nemotron-mini-4b-instruct",
+    "messages": [
+      {"role": "system", "content": "You are a K8s assistant."},
+      {"role": "user", "content": "Restart the nginx deployment in production namespace"}
+    ],
+    "tools": [{
+      "type": "function",
+      "function": {
+        "name": "restart_deployment",
+        "description": "Restart a K8s deployment",
+        "parameters": {
+          "type": "object",
+          "properties": {
+            "deployment": {"type": "string"},
+            "namespace": {"type": "string"}
+          },
+          "required": ["deployment", "namespace"]
+        }
+      }
+    }],
+    "tool_choice": "auto",
+    "max_tokens": 200
+  }' | jq '.choices[0].message'
+
+echo ""
+echo "=== 測試完成 ==="
+```
+
+---
+
+## 4. 實作計畫
+
+### 4.1 階段規劃
+
+```
+Phase N.1: 驗證 (1-2 天)
+──────────────────────────
+├── 註冊 build.nvidia.com
+├── 取得 NVIDIA_API_KEY
+├── 執行 quick_test_nemotron.sh
+├── 執行完整 Tool Calling 測試
+└── 分析結果，決定是否繼續
+
+Phase N.2: 整合 (2-3 天)
+──────────────────────────
+├── 建立 NvidiaAIProvider (參考現有 GeminiProvider)
+├── 加入 Model Router 路由規則
+├── 配置環境變數 + K8s Secrets
+├── Langfuse Tracing 整合
+└── 單元測試
+
+Phase N.3: 驗收 (1 天)
+──────────────────────────
+├── E2E 測試 (真實 Incident 場景)
+├── 延遲 + 成本分析
+├── 首席架構師審查
+└── 統帥批准上線
+```
+
+### 4.2 檔案結構
+
+```
+apps/api/src/
+├── services/
+│   └── ai/
+│       ├── providers/
+│       │   ├── ollama_provider.py    # 現有
+│       │   ├── gemini_provider.py    # 現有
+│       │   ├── claude_provider.py    # 現有
+│       │   └── nvidia_provider.py    # 🆕 新增
+│       │
+│       ├── model_router.py           # 修改: 加入 Nemotron 路由
+│       └── rate_limiter.py           # 修改: 加入 Nemotron 限流
+```
+
+### 4.3 GitHub Secrets 新增
+
+```yaml
+# 需要新增到 GitHub Secrets
+NVIDIA_API_KEY: nvapi-xxxx
+
+# 需要新增到 K8s Secrets
+kubectl create secret generic nvidia-api \
+  --from-literal=NVIDIA_API_KEY=nvapi-xxxx \
+  -n awoooi-prod
+```
+
+---
+
+## 5. 成本估算
+
+### 5.1 免費額度
+
+| 項目 | 預估 |
+|------|------|
+| **開發測試** | 免費 (build.nvidia.com) |
+| **Rate Limit** | 待確認 (可能 60 RPM) |
+
+### 5.2 生產環境 (如需付費)
+
+| 模型 | 定價 (預估) | 月用量 | 月成本 |
+|------|-------------|--------|--------|
+| nemotron-mini-4b | ~$0.1/1M tokens | ~5M | ~$0.5 |
+| nemotron-70b | ~$1.0/1M tokens | ~1M | ~$1.0 |
+
+**結論**: 成本極低，比 Claude API 便宜很多。
+
+---
+
+## 6. 風險評估
+
+| 風險 | 機率 | 影響 | 緩解措施 |
+|------|------|------|----------|
+| 免費額度不足 | 中 | 低 | Fallback 到 Gemini |
+| API 延遲高 | 低 | 中 | 本地快取 + Timeout |
+| Tool Calling 精準度差 | 低 | 高 | 測試階段驗證 |
+| 服務不穩定 | 低 | 中 | 多層 Fallback |
+
+---
+
+## 附錄: 下一步行動
+
+統帥批准後，立即執行：
+
+```bash
+# Step 1: 取得 API Key
+# 前往 https://build.nvidia.com 註冊並取得 Key
+
+# Step 2: 設定環境變數
+export NVIDIA_API_KEY=nvapi-xxxx
+
+# Step 3: 快速驗證
+cd apps/api
+./scripts/quick_test_nemotron.sh
+
+# Step 4: 完整測試
+python scripts/test_nemotron_tool_calling.py
+```
+
+---
+
+**建立者**: Claude Code
+**日期**: 2026-03-28 (台北時間)
+**狀態**: 待審核
--- a/docs/proposals/NEMOTRON-INTEGRATION-SOLUTION.md
+++ b/docs/proposals/NEMOTRON-INTEGRATION-SOLUTION.md
--- a/docs/runbooks/RUNBOOK-E2E-CI-SCHEDULE.md
+++ b/docs/runbooks/RUNBOOK-E2E-CI-SCHEDULE.md
@@ -0,0 +1,336 @@
+# RunBook: E2E Playwright CI 定期排程設定
+
+> **類型**: 操作型 RunBook  
+> **優先級**: 🔴 P0  
+> **建立**: 2026-03-29 12:38 (台北)  
+> **建立者**: Antigravity  
+> **工時預估**: 30 分鐘  
+> **前置條件**: Playwright 測試可在本機 `pnpm test:e2e` 成功執行
+
+---
+
+## 背景與現況
+
+### 🔍 精確現況診斷
+
+**已有的 12 個 E2E 測試檔案**（`apps/web/tests/e2e/`）：
+
+| 測試檔案 | 測試範圍 |
+|---------|---------|
+| `dashboard-acceptance.spec.ts` | 首頁 Dashboard 驗收 |
+| `multisig-security.spec.ts` | 多重簽核安全性 |
+| `approval-card-verify.spec.ts` | 簽核卡片驗證 |
+| `phase11-conversational.spec.ts` | 對話式 Phase 11 功能 |
+| `phase19-production-verification.spec.ts` | Phase 19 生產驗證 |
+| `action-log.spec.ts` | 行動日誌 |
+| `cpo102-visual.spec.ts` | 視覺截圖測試 |
+| `visual-armor-upgrade.spec.ts` | 視覺升級驗證 |
+| `debug-error.spec.ts` | 錯誤頁面 |
+| `rbac-screenshot.spec.ts` | RBAC 截圖驗證 |
+| `phase4-final-demo.spec.ts` | Phase 4 Demo |
+| `phase4-timeline.spec.ts` | Phase 4 時間軸 |
+
+**缺口**：`playwright.config.ts` 已有配置，但 `.github/workflows/` 中**無定期執行排程**。
+
+---
+
+## Step 1: 建立 E2E 定期排程 Workflow
+
+建立 `.github/workflows/e2e-weekly.yaml`：
+
+```yaml
+name: 🎭 E2E Playwright 週期驗收
+
+on:
+  # 每週一凌晨 02:30 執行（台北時間，即 UTC 18:30 週日）
+  schedule:
+    - cron: '30 18 * * 0'
+  
+  # 允許手動觸發
+  workflow_dispatch:
+    inputs:
+      environment:
+        description: '測試環境 URL'
+        required: false
+        default: 'https://192.168.0.120:32335'
+      test_suite:
+        description: '測試套件 (all / smoke / visual)'
+        required: false
+        default: 'all'
+
+jobs:
+  e2e-test:
+    name: 🎭 E2E Playwright (${{ matrix.browser }})
+    runs-on: [self-hosted, harbor]  # 使用 .110 的 GitHub Runner
+    
+    strategy:
+      fail-fast: false              # 一個瀏覽器失敗不影響其他
+      matrix:
+        browser: [chromium, firefox]
+    
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+      
+      - name: Install pnpm
+        uses: pnpm/action-setup@v4
+        with:
+          version: 9
+
+      - name: Setup Node.js
+        uses: actions/setup-node@v4
+        with:
+          node-version: '20'
+          cache: 'pnpm'
+
+      - name: Install dependencies
+        run: pnpm install --frozen-lockfile
+
+      - name: Install Playwright Browsers
+        run: pnpm --filter web exec playwright install --with-deps ${{ matrix.browser }}
+
+      - name: 🎭 Run E2E Tests
+        run: |
+          cd apps/web
+          pnpm exec playwright test \
+            --project=${{ matrix.browser }} \
+            --reporter=html \
+            ${SUITE_FILTER}
+        env:
+          SUITE_FILTER: ${{ github.event.inputs.test_suite == 'smoke' && '--grep @smoke' || '' }}
+          BASE_URL: ${{ github.event.inputs.environment || 'http://192.168.0.120:32335' }}
+          CI: true
+
+      - name: 📁 Upload Test Report
+        uses: actions/upload-artifact@v4
+        if: always()
+        with:
+          name: playwright-report-${{ matrix.browser }}-${{ github.run_id }}
+          path: apps/web/playwright-report/
+          retention-days: 14
+
+      - name: 📸 Upload Screenshots on Failure
+        uses: actions/upload-artifact@v4
+        if: failure()
+        with:
+          name: e2e-screenshots-${{ matrix.browser }}-${{ github.run_id }}
+          path: apps/web/test-results/
+          retention-days: 7
+
+      - name: 🚨 Notify Telegram on Failure
+        if: failure()
+        run: |
+          curl -s -X POST "https://api.telegram.org/bot${TG_BOT_TOKEN}/sendMessage" \
+            -d chat_id="${TG_CHAT_ID}" \
+            -d parse_mode="HTML" \
+            -d text="🎭 <b>E2E 週期測試失敗</b>
+
+  瀏覽器：${{ matrix.browser }}
+  觸發：${{ github.event_name }}
+  時間：$(TZ='Asia/Taipei' date '+%Y-%m-%d %H:%M:%S')
+  
+  報告：${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
+  
+  🔍 請立即調查 UI 回歸問題"
+        env:
+          TG_BOT_TOKEN: ${{ secrets.OPENCLAW_TG_BOT_TOKEN }}
+          TG_CHAT_ID: ${{ secrets.OPENCLAW_TG_CHAT_ID }}
+
+  # 視覺截圖比對工作（獨立 job，只在排程時執行）
+  visual-regression:
+    name: 📸 視覺回歸比對
+    runs-on: [self-hosted, harbor]
+    if: github.event_name == 'schedule'  # 只在定期排程時執行
+    needs: e2e-test
+    
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+      
+      - name: Install pnpm & deps
+        run: |
+          pnpm install --frozen-lockfile
+          pnpm --filter web exec playwright install --with-deps chromium
+
+      - name: 📸 Generate Visual Snapshots
+        run: |
+          cd apps/web
+          pnpm exec playwright test \
+            --project=chromium \
+            --grep @visual \
+            --reporter=html \
+            --update-snapshots=missing   # 新截圖自動建立 baseline
+        env:
+          BASE_URL: 'http://192.168.0.120:32335'
+          CI: true
+
+      - name: 📁 Upload Visual Snapshots
+        uses: actions/upload-artifact@v4
+        with:
+          name: visual-snapshots-${{ github.run_id }}
+          path: apps/web/tests/e2e/__snapshots__/
+          retention-days: 30
+
+      - name: 🚨 Notify Telegram on Visual Regression
+        if: failure()
+        run: |
+          curl -s -X POST "https://api.telegram.org/bot${TG_BOT_TOKEN}/sendMessage" \
+            -d chat_id="${TG_CHAT_ID}" \
+            -d parse_mode="HTML" \
+            -d text="📸 <b>視覺回歸測試失敗</b>
+
+  AWOOOI 前端出現視覺變化！
+  時間：$(TZ='Asia/Taipei' date '+%Y-%m-%d %H:%M:%S')
+  
+  請統帥審查截圖比對報告：
+  ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
+        env:
+          TG_BOT_TOKEN: ${{ secrets.OPENCLAW_TG_BOT_TOKEN }}
+          TG_CHAT_ID: ${{ secrets.OPENCLAW_TG_CHAT_ID }}
+```
+
+---
+
+## Step 2: 標記現有測試為 Smoke / Visual 類別
+
+```typescript
+// 在各 spec 檔案第一行測試加上標籤：
+
+// dashboard-acceptance.spec.ts（核心功能，標記為 @smoke）
+test.describe('Dashboard Acceptance @smoke', () => { ... });
+
+// cpo102-visual.spec.ts（視覺截圖，標記為 @visual）
+test.describe('Visual Regression @visual', () => {
+  test('CPO-102 首頁視覺', async ({ page }) => {
+    await page.goto('/');
+    await expect(page).toHaveScreenshot('dashboard-baseline.png', {
+      fullPage: true,
+      threshold: 0.02  // 允許 2% 像素差異
+    });
+  });
+});
+```
+
+**建議標記方式**：
+
+| 標籤 | 包含測試 | 執行頻率 |
+|------|---------|---------|
+| `@smoke` | dashboard, approval, action-log | 每次 CD 後 + 每週 |
+| `@visual` | cpo102-visual, visual-armor | 只在每週排程 |
+| （無標籤） | 所有其他測試 | 只在每週排程 |
+
+---
+
+## Step 3: 部署並驗證
+
+```bash
+# 1. 提交 Workflow 檔案
+git add .github/workflows/e2e-weekly.yaml
+git commit -m "feat(ci): add weekly E2E Playwright schedule with Telegram failure notification"
+git push origin main
+
+# 2. 手動觸發測試（確認 Workflow 運作正常）
+gh workflow run e2e-weekly.yaml \
+  -f environment="http://192.168.0.120:32335" \
+  -f test_suite="smoke"
+
+# 3. 監控 Workflow 執行
+gh run watch
+
+# 4. 確認 Telegram Bot 收到失敗通知（刻意讓一個測試失敗）
+```
+
+---
+
+## 驗收標準
+
+| 項目 | 通過條件 |
+|------|---------|
+| Workflow 存在 | `.github/workflows/e2e-weekly.yaml` 成功 push |
+| 手動觸發正常 | `gh workflow run` 可執行且完成 |
+| Smoke 測試通過 | `@smoke` 標籤測試全部 PASS |
+| 失敗通知正常 | Telegram Bot 收到失敗訊息 |
+| 報告上傳 | GitHub Actions Artifacts 中有 `playwright-report-*` |
+
+---
+
+## ⚠️ 架構安全補丁（2026-03-29 更新，部署前必讀）
+
+> 來源：`ARCHITECTURAL_RISK_WAR_GAME.md` 深度沙盤推演，代碼確認級別
+
+### 補丁 1：playwright.config.ts 必須加入 ignoreHTTPSErrors
+
+**問題**：內網 K3s 使用自簽憑證（Self-signed cert），Playwright 連接 `https://192.168.0.120:32335` 時會遭遇 `NET::ERR_CERT_AUTHORITY_INVALID`，導致**所有測試在第一步就失敗**。
+
+**修復**（`apps/web/playwright.config.ts`）：
+
+```typescript
+export default defineConfig({
+  use: {
+    baseURL: process.env.BASE_URL || 'http://192.168.0.120:32335',
+    ignoreHTTPSErrors: true,   // 🆕 必須加入，否則自簽憑證全面阻擋
+    viewport: { width: 1280, height: 720 },
+    deviceScaleFactor: 1,      // 🆕 防止 Retina 螢幕 DPI 差異影響截圖比對
+  },
+  expect: {
+    toHaveScreenshot: {
+      threshold: 0.05,          // 🆕 允許 5% 差異（吸收跨平台字體渲染微差）
+      maxDiffPixelRatio: 0.05,
+    },
+  },
+});
+```
+
+---
+
+### 補丁 2：Visual Baseline 必須在 Docker（Linux 環境）中產生
+
+**問題**：Mac（CoreText 渲染）與 GitHub Actions CI（Linux FreeType 渲染）的字體像素不同。若在 Mac 本機產生 baseline，CI 比對時**100% 誤報失敗**。
+
+> 🔴 **絕對禁止在本機 Mac 環境執行 `--update-snapshots`！**
+
+**正確的 Baseline 更新流程**：
+
+```bash
+# 在 Mac 本機執行，但透過 Docker（Linux 環境）產生截圖
+cd apps/web
+docker run --rm \
+  -v $(pwd):/work -w /work \
+  -p 3000:3000 \
+  mcr.microsoft.com/playwright:v1.44.0-jammy \
+  pnpm exec playwright test \
+    --update-snapshots \
+    --project=chromium \
+    --grep @visual
+
+# Docker 產生的 .png 自動存入 tests/e2e/__snapshots__/
+# 提 PR，標注 📸 VISUAL_UPDATE
+# 統帥視覺審核截圖後方可合併
+```
+
+**加入 package.json scripts**：
+
+```json
+{
+  "scripts": {
+    "test:visual": "playwright test --project=chromium --grep @visual",
+    "test:visual:update": "docker run --rm -v $(pwd):/work -w /work mcr.microsoft.com/playwright:v1.44.0-jammy pnpm exec playwright test --update-snapshots --project=chromium --grep @visual"
+  }
+}
+```
+
+---
+
+### 補丁 3：CI Threshold 需與本 RunBook Step 2 標籤保持一致
+
+Visual 測試中 `threshold: 0.02`（Step 2 示例代碼）與 `playwright.config.ts` 全局設定 `0.05` 會以**個別設定優先**。建議統一為：
+
+```typescript
+// 全局（playwright.config.ts）
+threshold: 0.05  // 寬鬆（跨平台環境差異）
+
+// 個別敏感組件（*.spec.ts）
+await expect(page).toHaveScreenshot('component.png', {
+  threshold: 0.02  // 嚴格（關鍵組件精確比對）
+});
+```
--- a/docs/runbooks/RUNBOOK-FRONTEND-UIUX-SOVEREIGNTY.md
+++ b/docs/runbooks/RUNBOOK-FRONTEND-UIUX-SOVEREIGNTY.md
@@ -0,0 +1,499 @@
+# RunBook: 前端 UI/UX 核心痛點徹底解決方案
+
+> **類型**: 架構設計 + 實施 RunBook  
+> **優先級**: 🔴 P0 (核心競爭力)  
+> **建立**: 2026-03-29 12:38 (台北)  
+> **建立者**: Antigravity  
+> **核心命題**: 讓 AWOOOI 的前端「讓人看一眼就驚艷，用一次就上癮」
+
+---
+
+## 一、Claude Code 前端弱點診斷
+
+### 1.1 五大結構性弱點
+
+| 弱點 | 後果 | 解法 |
+|------|------|------|
+| **視覺感知盲區** | 代碼語法正確，但視覺效果差 | Playwright 截圖 + 統帥視覺主控 |
+| **CSS 語境失憶** | 改 A 壞 B，不知道全局 CSS 影響 | Storybook 隔離組件 + TypeCheck |
+| **動畫設計無感** | 150ms 快閃 vs 300ms 遲滯無法感知 | 在 Storybook Story 中定義動畫標準 |
+| **設計意圖推斷不足** | Nothing.tech 審美需要視覺參照 | 每個組件附帶規格截圖 |
+| **Safari 兼容性盲點** | `backdrop-blur` 在 Safari 有 bug | Playwright multi-browser E2E |
+
+### 1.2 技術債現況（截至 2026-03-29）
+
+```
+i18n 違規：40+ 處（TECHNICAL_DEBT_PHASE2.md）
+shadcn/ui 殘留：已部分廢除，但需全面確認
+GenUI Registry：只有基礎 5 張卡片，缺少監控類
+Knowledge Base：頁面空白，無後端
+Omni-Terminal：外殼完整，SSE 事件類型未完全對接
+```
+
+---
+
+## 二、九大行動詳細實施計畫
+
+### 🔴 行動 1: i18n 閃電清零（工時：4h）
+
+**策略：一次性掃描→批量修復，非逐一處理**
+
+```bash
+# Step 1.1: 自動掃描所有硬編碼字串
+cd /Users/ogt/awoooi/apps/web
+
+# 掃描 TSX 中的英文硬編碼（排除技術識別符）
+grep -rn '"[A-Z][A-Z]' src/ --include="*.tsx" --include="*.ts" | \
+  grep -v "//.*\"" | \
+  grep -v "className=" | \
+  grep -v "href=" | \
+  grep -v "id=" | \
+  grep -v "import " > /tmp/en_violations.txt
+
+# 掃描中文硬編碼
+grep -rn "\"[^\x00-\x7F]" src/ --include="*.tsx" | \
+  grep -v "//.*\"" > /tmp/zh_violations.txt
+
+echo "英文違規：$(wc -l < /tmp/en_violations.txt) 處"
+echo "中文違規：$(wc -l < /tmp/zh_violations.txt) 處"
+```
+
+**已知 P0 違規快速修復清單**（根據 TECHNICAL_DEBT_PHASE2.md）：
+
+```typescript
+// ❌ agent/data-pincer.tsx:50-78（需修復）
+// 現在：
+const statuses = {
+  standby: 'STANDBY',
+  analyzing: 'ANALYZING',
+  executing: 'EXECUTING',
+  awaiting: 'AWAITING APPROVAL',
+  error: 'ERROR'
+}
+
+// ✅ 修復後：
+const statuses = {
+  standby: t('status.standby'),
+  analyzing: t('status.analyzing'),
+  executing: t('status.executing'),
+  awaiting: t('status.awaitingApproval'),
+  error: t('status.error')
+}
+```
+
+```typescript
+// ❌ status-orb.tsx:16-31
+// 現在：
+const STATUS_TEXT = {
+  idle: 'Idle',
+  thinking: 'Thinking',
+  executing: 'Executing',
+  awaiting: 'Awaiting Approval'
+}
+
+// ✅ 修復後：
+const STATUS_TEXT = {
+  idle: t('status.idle'),
+  thinking: t('status.thinking'),
+  executing: t('status.executing'),
+  awaiting: t('status.awaitingApproval')
+}
+```
+
+**需要同步更新的字典檔**：
+
+```json
+// apps/web/messages/zh-TW.json（追加）
+{
+  "status": {
+    "idle": "待機",
+    "thinking": "分析中",
+    "executing": "執行中",
+    "awaitingApproval": "等待核准",
+    "error": "錯誤",
+    "standby": "待機",
+    "analyzing": "分析中"
+  }
+}
+```
+
+```json
+// apps/web/messages/en.json（追加）
+{
+  "status": {
+    "idle": "Idle",
+    "thinking": "Thinking",
+    "executing": "Executing",
+    "awaitingApproval": "Awaiting Approval",
+    "error": "Error",
+    "standby": "Standby",
+    "analyzing": "Analyzing"
+  }
+}
+```
+
+**驗收指令（必須通過）**：
+
+```bash
+cd apps/web
+
+# 確認無中文硬編碼
+if grep -rn '"[^\x00-\x7F]' src/ --include="*.tsx" | grep -v "//"; then
+  echo "❌ 仍有中文硬編碼！"
+  exit 1
+else
+  echo "✅ 中文硬編碼清零"
+fi
+
+# TypeScript 編譯驗證（workflow 硬規則）
+pnpm exec tsc --noEmit
+```
+
+---
+
+### 🔴 行動 2: Storybook 組件庫建立（工時：8h）
+
+**這是解決 AI 視覺感知盲區的根本方案**。
+
+```bash
+# Step 2.1: 安裝 Storybook
+cd apps/web
+pnpm add -D storybook@latest @storybook/nextjs @storybook/addon-essentials \
+           @storybook/addon-interactions @storybook/test
+npx storybook@latest init --builder webpack5
+
+# Step 2.2: 配置 Nothing.tech 主題
+```
+
+**必須上架的 10 個核心組件 Story**：
+
+| 組件 | Nothing.tech 規格 | Story 狀態 |
+|------|-----------------|-----------|
+| `GlassCard` | `bg-white/70 backdrop-blur-[20px] border border-black/[0.06]` | Loading / Content / Error |
+| `StatusOrb` | 燈號 + `animate-ping`（critical 時）| idle / thinking / executing / critical |
+| `ApprovalCard` | 1.0 ConversationalView 風格 | LOW / MEDIUM / HIGH / CRITICAL |
+| `OmniTerminal` | VT323 字體 + 綠色游標閃爍 | empty / thinking / streaming / error |
+| `HostCard` | CPU/Memory 橫條 + 脈搏點 | healthy / warning / critical |
+| `MetricsCard` | 數字大字 + 趨勢箭頭 | up / down / stable |
+| `SystemHealthCard` | 燈號矩陣 5x5 | all-healthy / some-warning / critical |
+| `FinOpsCard` | 成本分解 + 可省金額 | monthly / quarterly |
+| `SLOCard` | 達成率 + 趨勢 | healthy / at-risk / breached |
+| `AnomalyFrequencyCard` | 頻率統計 + 升級建議 | normal / repeat / escalate |
+
+**Storybook Story 範例**（GlassCard）：
+
+```typescript
+// apps/web/src/components/ui/glass-card.stories.ts
+import type { Meta, StoryObj } from '@storybook/react';
+import { GlassCard } from './glass-card';
+
+const meta: Meta<typeof GlassCard> = {
+  title: 'AWOOOI/UI/GlassCard',
+  component: GlassCard,
+  parameters: {
+    // Nothing.tech 白底背景
+    backgrounds: {
+      default: 'nothing-white',
+      values: [{ name: 'nothing-white', value: '#F5F5F0' }],
+    },
+    // 規格文件截圖
+    docs: {
+      description: {
+        component: `
+          Nothing.tech 白玻璃卡片。固定規格：
+          - bg: bg-white/70
+          - blur: backdrop-blur-[20px]
+          - border: border border-black/[0.06]
+          - radius: rounded-xl
+        `,
+      },
+    },
+  },
+  tags: ['autodocs'],
+};
+export default meta;
+
+type Story = StoryObj<typeof GlassCard>;
+
+export const Default: Story = {
+  args: { children: '玻璃卡片內容' }
+};
+
+export const WithCriticalBorder: Story = {
+  args: {
+    children: '緊急狀態',
+    className: 'border-status-critical border-2'
+  }
+};
+```
+
+---
+
+### 🔴 行動 3: AI 視覺審查 SOP（工時：2h 建立，後續 0 工時）
+
+**建立標準作業程序，讓 AI 自主截圖並等待統帥審核：**
+
+```markdown
+## AI 前端修改 SOP（強制執行）
+
+### 修改前
+1. 查閱 Storybook 對應組件的規格 Story
+2. 確認 Nothing.tech 視覺 Token（詳見 tailwind.config.ts）
+
+### 修改中
+3. 修改代碼
+4. 執行 `pnpm exec tsc --noEmit`（語法驗證）
+
+### 修改後（🆕 新增強制步驟）
+5. 啟動 Dev Server：`pnpm dev`
+6. 執行截圖腳本：
+   ```bash
+   cd apps/web
+   pnpm exec playwright screenshot \
+     --browser chromium \
+     http://localhost:3000 \
+     docs/screenshots/$(date +%Y%m%d-%H%M)/homepage.png
+   ```
+7. 截圖存至 `docs/screenshots/{date}/{component}.png`
+8. 在 LOGBOOK 記錄：「已截圖，視覺存檔 docs/screenshots/xxx/yyy.png」
+9. 等待統帥視覺審批後方可 commit
+```
+
+---
+
+### 🟠 行動 4: Omni-Terminal 後端全接通（工時：8h）
+
+根據 `ADR-031` 和 `AWOOOI_AGENTIC_WORKSPACE_ROADMAP.md` 的神經連接藍圖：
+
+#### 4.1 三大 SSE 事件類型定義
+
+```python
+# apps/api/src/api/v1/terminal.py
+# 擴充現有 SSE 端點
+
+class SSEEventType(str, Enum):
+    THOUGHT = "thought"          # Agent 思考流
+    TOOL_CALL = "tool_call"      # 工具調用（含微動畫觸發信號）
+    TOOL_RESULT = "tool_result"  # 工具結果
+    RENDER_UI = "render_ui"      # 動態渲染 GenUI 組件
+    STREAM_END = "stream_end"    # 思考流結束
+
+
+# 範例事件：
+async def stream_terminal_response(command: str):
+    # 1. 思考流
+    yield f"event: thought\ndata: {json.dumps({'text': '[Investigator] 分析指令...'})}\n\n"
+    
+    # 2. 工具調用（觸發前端 CSS 動畫）
+    yield f"event: tool_call\ndata: {json.dumps({'tool': 'kubectl_get', 'args': {'resource': 'pod', 'namespace': 'awoooi-prod'}})}\n\n"
+    
+    # 3. 工具結果
+    yield f"event: tool_result\ndata: {json.dumps({'pods': [...]})}\n\n"
+    
+    # 4. 渲染 GenUI 卡片（前端動態載入組件）
+    yield f"event: render_ui\ndata: {json.dumps({'component': 'SystemHealthCard', 'props': {...}})}\n\n"
+    
+    # 5. 結束
+    yield f"event: stream_end\ndata: {json.dumps({'success': True})}\n\n"
+```
+
+#### 4.2 前端 SSE 事件處理器（關鍵）
+
+```typescript
+// apps/web/src/hooks/useTerminalSSE.ts
+// 修改現有 SSE hook，增加 tool_call 動畫觸發
+
+const useTerminalSSE = (commandId: string) => {
+  const [state, setState] = useTerminalStore();
+  
+  useEffect(() => {
+    const es = new EventSource(`/api/v1/terminal/stream/${commandId}`);
+    
+    // 思考流
+    es.addEventListener('thought', (e) => {
+      const data = JSON.parse(e.data);
+      setState(s => ({ ...s, thoughts: [...s.thoughts, data.text] }));
+    });
+    
+    // 工具調用 → 觸發微動畫
+    es.addEventListener('tool_call', (e) => {
+      const data = JSON.parse(e.data);
+      setState(s => ({
+        ...s,
+        activeToolCall: data.tool,  // UI 顯示「正在執行 kubectl_get...」
+        isAnimating: true
+      }));
+    });
+    
+    // GenUI 動態渲染（核心功能！）
+    es.addEventListener('render_ui', (e) => {
+      const { component, props } = JSON.parse(e.data);
+      setState(s => ({
+        ...s,
+        renderedCards: [...s.renderedCards, { component, props }]
+      }));
+    });
+    
+    es.addEventListener('stream_end', () => {
+      setState(s => ({ ...s, isStreaming: false, isAnimating: false }));
+      es.close();
+    });
+    
+    return () => es.close();
+  }, [commandId]);
+};
+```
+
+---
+
+### 🟠 行動 5: GenUI Registry 擴充（工時：8h）
+
+新增 5 張監控類 GenUI 卡片（詳細規格見 `MONITORING_ARCHITECTURE_DEEP_DIVE.md`）：
+
+```typescript
+// apps/web/src/components/genui/registry.ts
+export const GENUI_COMPONENTS = {
+  // 現有：
+  'MetricsCard': () => import('./cards/MetricsCard'),
+  'K8sPodCard': () => import('./cards/K8sPodCard'),
+  
+  // 🆕 監控類（Wave M-3）：
+  'SystemHealthCard': () => import('./monitoring/SystemHealthCard'),
+  'ServiceDetailCard': () => import('./monitoring/ServiceDetailCard'),
+  'FinOpsCard': () => import('./monitoring/FinOpsCard'),
+  'SLODashboardCard': () => import('./monitoring/SLODashboardCard'),
+  'AlertChainStatusCard': () => import('./monitoring/AlertChainStatusCard'),
+  'AnomalyFrequencyCard': () => import('./monitoring/AnomalyFrequencyCard'),
+  'MTTRCard': () => import('./monitoring/MTTRCard'),
+}
+```
+
+---
+
+## 三、前端改善路線圖時間表
+
+```
+📅 本週（立即）：
+  [4h] i18n 閃電清零（一次性全修）
+  [2h] AI 視覺審查 SOP 建立（.awoooi-agent-rules.md 追加）
+
+📅 Week 2-3：
+  [8h] Storybook 10 個核心組件 Story
+  [8h] Omni-Terminal 後端全接通（三種 SSE 事件）
+
+📅 Week 4-5：
+  [8h] 監控 GenUI 卡片擴充（7 張新卡片）
+  [8h] Nexus 頁面 AI 自治率 UI 組件
+
+📅 Month 2：
+  [16h] Knowledge Base 後端 + 前端完整建設
+  [8h]  Visual Regression Testing CI 整合
+
+📅 Month 3（Phase 4 視覺靈魂注入）：
+  [?h]  品牌 3D 資產 + Q 版 OpenClaw
+  [?h]  全站微動畫升級（150ms 快閃標準）
+  [?h]  Nothing.tech 認證級別的設計審計
+```
+
+---
+
+## 四、強制驗收標準
+
+每次前端 PR 合併前，必須通過以下全部驗收：
+
+```bash
+# 1. TypeScript 無錯誤（前端美學 Workflow 硬規則）
+cd apps/web && pnpm exec tsc --noEmit
+
+# 2. i18n 無硬編碼（CI 攔截）
+[ -z "$(grep -rn '"[^\x00-\x7F]' src/ --include='*.tsx')" ] || exit 1
+
+# 3. Storybook 可正常 build（確保組件獨立可用）
+pnpm storybook build
+
+# 4. E2E Smoke 測試通過
+pnpm exec playwright test --grep @smoke
+
+# 5. 截圖存檔（AI 執行，統帥視覺審批）
+pnpm exec playwright screenshot http://localhost:3000 docs/screenshots/pr-{NUMBER}/homepage.png
+
+# 6. Build 成功（生產環境兼容）
+pnpm run build
+```
+
+---
+
+## ⚠️ 架構安全補丁（2026-03-29 更新，行動 1 開始前必讀）
+
+> 來源：`ARCHITECTURAL_RISK_WAR_GAME.md` 深度沙盤推演，代碼確認級別
+
+### 補丁：Feature Freeze 前必須建立 release/v1.x 穩定分支
+
+**問題**：行動 1（i18n 閃電清零）需要全域替換 `src/` 底下數千行代碼，`main` 分支會進入持續 3-5 天的「混沌狀態」。若此期間生產爆發 P0 Bug（如簽核按鈕失效），**無法切出乾淨的 Hotfix 分支**，導致修復與重構互相衝突。
+
+> 🔴 **行動 1 開始前，必須先建立 `release/v1.x` 穩定分支！**
+
+#### 必須在行動 1 之前執行
+
+```bash
+# Step 0（行動 1 前置）：建立穩定基準分支
+git checkout main
+git pull origin main
+git checkout -b release/v1.x
+git push origin release/v1.x
+
+# 在 GitHub 設定 release/v1.x 為 Protected Branch
+# Settings → Branches → Add branch protection rule → release/v1.x
+# ✅ Require pull request reviews (1 approver)
+# ✅ Do not allow bypassing the above settings
+```
+
+#### 正確的行動順序
+
+```
+Step 0：建立 release/v1.x（🆕 前置步驟，必須先做）
+
+Step 1：宣佈 Frontend Feature Freeze（禁止非 i18n PR 合併前端代碼）
+
+Step 2：i18n 閃電清零（在 fix/i18n-zero-violation 分支進行）
+
+Step 3：PR 合併到 main → 解除 Frontend Freeze
+
+Step 4：ESLint i18n Plugin 切換為 error 模式
+
+Step 5：繼續行動 2-9（Storybook、Terminal 等）
+```
+
+#### Freeze 期間 P0 Hotfix 緊急流程
+
+```bash
+# 情境：簽核按鈕在生產失效，i18n 清零正在進行中
+
+# 1. 從穩定基準切出 hotfix（不影響 i18n 重構）
+git checkout release/v1.x
+git checkout -b hotfix/fix-approval-button
+
+# 2. 最小化修復（只改 Bug，不動其他代碼）
+# ... 修復代碼 ...
+git add . && git commit -m "fix: approval button not responding on mobile"
+
+# 3. PR 到 release/v1.x → CD 直接部署 release 分支
+git push origin hotfix/fix-approval-button
+# → PR 合併到 release/v1.x
+# → 觸發 CD 部署（CD 配置需支援 release/* 分支）
+
+# 4. Cherry-pick 到 main（不中斷 i18n 重構）
+git checkout main
+git cherry-pick <hotfix-commit-hash>
+# 若有衝突：手動解決後繼續
+```
+
+#### Hotfix 觸發條件（須加入 HARD_RULES.md）
+
+```
+P0 Hotfix 判定標準（任一條件符合即觸發）：
+□ 統帥無法使用核心功能（簽核按鈕、登入、Telegram 通知）
+□ Sentry P0 Error 每分鐘 > 10 次
+□ 服務 availability < 99%（監控頁面顯示）
+□ OpenClaw 決策鏈完全中斷超過 5 分鐘
+```
--- a/docs/runbooks/RUNBOOK-PHASE-D-SENTRY-COMMENT.md
+++ b/docs/runbooks/RUNBOOK-PHASE-D-SENTRY-COMMENT.md
@@ -0,0 +1,264 @@
+# RunBook: Phase D — Sentry Comment 回寫啟動指南
+
+> **類型**: 操作型 RunBook  
+> **優先級**: 🔴 P0（功能框架已建，只缺 Token 配置）  
+> **建立**: 2026-03-29 12:35 (台北)  
+> **建立者**: Antigravity  
+> **工時預估**: 1.5–2 小時  
+> **前置條件**: AWOOOI API 正常運行 (`/api/v1/health` 返回 200)
+
+---
+
+## 背景與現況
+
+### 🔍 精確現況診斷
+
+`sentry_webhook.py` 的 `post_sentry_comment()` 函式已實作完整邏輯：
+
+```
+sentry_webhook.py:251  → 呼叫 post_sentry_comment()
+sentry_service.py:206  → post_issue_comment() 已實作 POST /api/0/issues/{id}/comments/
+sentry_service.py:223  → 若 SENTRY_AUTH_TOKEN 為空，直接 return None 並 warning
+```
+
+**唯一阻塞點**：`settings.SENTRY_AUTH_TOKEN` 環境變數未設定，導致 comment 靜默跳過。
+
+### 資料流確認
+
+```
+Sentry Issue 觸發
+    ↓
+/api/v1/webhooks/sentry/error  (sentry_webhook.py)
+    ↓
+analyze_and_comment() [Background Task]
+    ↓
+call_openclaw_analyzer()    → OpenClaw AI 分析
+    ↓
+create_sentry_approval()    → 建立 Approval ✅ 已運作
+    ↓
+send_sentry_telegram_alert()→ Telegram 通知 ✅ 已運作
+    ↓
+post_sentry_comment()       → ❌ SENTRY_AUTH_TOKEN 缺失，靜默跳過
+```
+
+---
+
+## Step 1: 取得 Sentry API Token
+
+### 1.1 登入 Sentry 後台
+
+```
+瀏覽器開啟：http://192.168.0.110:9000
+帳號：參見 docs/security/SECRETS_REFERENCE.md
+```
+
+### 1.2 建立 API Token
+
+```
+路徑：設定 → API → Auth Tokens → Create New Token
+
+權限設定：
+☑ project:read
+☑ project:write
+☑ issues:write    ← 必須勾選，否則無法回寫 comment
+☑ event:read
+
+Token 名稱建議：awoooi-openclaw-comment-writer
+```
+
+### 1.3 記錄 Token（請勿存入代碼庫！）
+
+```bash
+# 暫存到環境變數（本機測試用）
+export SENTRY_AUTH_TOKEN="sentry_xxx..."
+
+# 驗證 Token 有效性
+curl -s http://192.168.0.110:9000/api/0/organizations/ \
+  -H "Authorization: Bearer $SENTRY_AUTH_TOKEN" | python3 -m json.tool | head -20
+# 預期看到 organization 列表，無 401
+```
+
+---
+
+## Step 2: 注入 GitHub Secrets（CD 自動化）
+
+### 2.1 加入 GitHub Repository Secrets
+
+```
+路徑：GitHub → owenhytsai/awoooi → Settings → Secrets → Actions
+
+新增以下 Secrets：
+名稱：SENTRY_AUTH_TOKEN
+值：步驟 1.2 取得的 Token
+```
+
+### 2.2 更新 K8s Secret（手動注入生產環境）
+
+```bash
+# 在 192.168.0.120（K3s Master）執行
+kubectl patch secret awoooi-secrets -n awoooi-prod \
+  --patch="{\"data\":{\"SENTRY_AUTH_TOKEN\":\"$(echo -n 'YOUR_TOKEN' | base64)\"}}"
+
+# 驗證
+kubectl get secret awoooi-secrets -n awoooi-prod -o jsonpath='{.data.SENTRY_AUTH_TOKEN}' | base64 -d
+```
+
+### 2.3 更新 k8s/awoooi-prod/03-secrets.yaml（模板）
+
+```yaml
+# k8s/awoooi-prod/03-secrets.yaml
+# 新增以下欄位（使用 CD 自動注入，非硬編碼）
+stringData:
+  # ... 現有欄位 ...
+  SENTRY_AUTH_TOKEN: "${SENTRY_AUTH_TOKEN}"  # CD 自動注入
+```
+
+---
+
+## Step 3: 更新 CD Workflow 自動注入
+
+```yaml
+# .github/workflows/cd.yaml
+# 在 "Inject K8s Secrets" 步驟中新增 SENTRY_AUTH_TOKEN
+
+- name: Inject K8s Secrets
+  run: |
+    kubectl patch secret awoooi-secrets -n awoooi-prod \
+      --patch="{\"data\":{
+        \"OPENCLAW_TG_BOT_TOKEN\":\"$(echo -n '${{ secrets.OPENCLAW_TG_BOT_TOKEN }}' | base64)\",
+        \"OPENCLAW_TG_CHAT_ID\":\"$(echo -n '${{ secrets.OPENCLAW_TG_CHAT_ID }}' | base64)\",
+        \"SENTRY_AUTH_TOKEN\":\"$(echo -n '${{ secrets.SENTRY_AUTH_TOKEN }}' | base64)\"
+      }}"
+```
+
+---
+
+## Step 4: 驗證 Sentry Comment 功能
+
+### 4.1 本地單元測試（快速驗證）
+
+```bash
+cd /Users/ogt/awoooi
+source .env
+
+# 設定測試 Token
+export SENTRY_AUTH_TOKEN="你的真實 Token"
+export SENTRY_SELF_HOSTED_URL="http://192.168.0.110:9000"
+
+# 用 Python 直接測試 SentryService
+python3 -c "
+import asyncio
+import sys
+sys.path.insert(0, 'apps/api/src')
+from services.sentry_service import SentryService
+
+async def test():
+    svc = SentryService(
+        base_url='http://192.168.0.110:9000',
+        auth_token='$SENTRY_AUTH_TOKEN'
+    )
+    # 先列出 Issues 找一個真實 ID
+    issues = await svc.list_issues(project='awoooi-api', limit=3)
+    if issues:
+        issue_id = issues[0]['id']
+        print(f'找到 Issue: {issue_id}')
+        result = await svc.post_issue_comment(
+            issue_id=issue_id,
+            text='🤖 **AWOOOI 測試** - Sentry Comment 回寫功能正常運作。'
+        )
+        print(f'Comment 結果: {result}')
+    else:
+        print('無 Issue 可測試')
+
+asyncio.run(test())
+"
+```
+
+### 4.2 E2E 端對端驗證（生產環境）
+
+```bash
+# 1. 在 Sentry 手動觸發一個測試 Issue
+curl -X POST http://localhost:8000/api/v1/webhooks/sentry/error \
+  -H "Content-Type: application/json" \
+  -d '{
+    "action": "triggered",
+    "data": {
+      "issue": {
+        "id": "TEST-001",
+        "title": "AWOOOI Comment 功能測試",
+        "level": "error",
+        "culprit": "test.py:1",
+        "firstSeen": "2026-03-29T12:00:00Z",
+        "count": 1,
+        "project": {"slug": "awoooi-api"}
+      },
+      "event": {
+        "message": "這是一個測試錯誤",
+        "platform": "python"
+      }
+    }
+  }'
+
+# 預期回應
+# {"status": "accepted", "issue_id": "TEST-001", "message": "Analysis scheduled"}
+
+# 2. 等待 60 秒後（OpenClaw 分析需時）
+sleep 60
+
+# 3. 到 Sentry UI 的 Issue 中確認是否有 AI 分析 Comment
+# http://192.168.0.110:9000/organizations/sentry/issues/TEST-001/
+```
+
+### 4.3 確認日誌
+
+```bash
+# 查看 API Pod 日誌
+kubectl logs -n awoooi-prod \
+  $(kubectl get pod -n awoooi-prod -l app=awoooi-api -o name | head -1) \
+  --tail=50 | grep -i sentry_comment
+
+# 預期看到
+# sentry_comment_posted  issue_id=xxx  comment_id=12345
+```
+
+---
+
+## Step 5: 部署與驗收
+
+```bash
+# 觸發 CD
+git add k8s/awoooi-prod/03-secrets.yaml .github/workflows/cd.yaml
+git commit -m "feat(sentry): enable comment write-back via SENTRY_AUTH_TOKEN injection"
+git push origin main
+
+# 確認 CD 成功
+gh run list --workflow=cd.yaml --limit 1
+
+# 確認 Pod 有新 Token
+kubectl exec -n awoooi-prod \
+  $(kubectl get pod -n awoooi-prod -l app=awoooi-api -o name | head -1) \
+  -- env | grep SENTRY_AUTH_TOKEN
+```
+
+---
+
+## 驗收標準
+
+| 項目 | 通過條件 |
+|------|---------|
+| K8s Secret 已注入 | `kubectl get secret` 確認 `SENTRY_AUTH_TOKEN` 不為空 |
+| Token 有效 | Sentry API `/api/0/organizations/` 返回 200 |
+| Comment 回寫 | Sentry Issue 中有「AI 錯誤分析」Comment |
+| 日誌正常 | `sentry_comment_posted` 日誌出現，無 `sentry_comment_failed` |
+| 頻率統計 | Comment 含「頻率統計」表格（24h 次數 > 1 時顯示）|
+
+---
+
+## 常見問題排除
+
+| 症狀 | 診斷指令 | 解法 |
+|------|---------|------|
+| `sentry_comment_skipped` 日誌 | `env \| grep SENTRY_AUTH_TOKEN` | Secret 未注入，重跑 Step 3 |
+| `sentry_api_unauthorized` | 手動 curl Sentry API | Token 權限不足，重新建立 |
+| `sentry_api_timeout` | `curl -v http://192.168.0.110:9000/` | Sentry 服務本身異常 |
+| OpenClaw 分析失敗 | `curl http://192.168.0.188:8089/health` | OpenClaw 服務需重啟 |
--- a/docs/runbooks/RUNBOOK-PHASE-E-SIGNOZ-WEBHOOK.md
+++ b/docs/runbooks/RUNBOOK-PHASE-E-SIGNOZ-WEBHOOK.md
@@ -0,0 +1,233 @@
+# RunBook: Phase E — SignOz Webhook Handler 生產部署
+
+> **類型**: 操作型 RunBook  
+> **優先級**: 🔴 P0  
+> **建立**: 2026-03-29 12:35 (台北)  
+> **建立者**: Antigravity  
+> **工時預估**: 1.5–2 小時  
+> **前置條件**: SignOz UI 可在 http://192.168.0.188:3301 正常訪問
+
+---
+
+## 背景與現況
+
+### 🔍 精確現況診斷
+
+```
+signoz_webhook.py    → 完整實作 (363 行，含 4 步驟完整流程)
+main.py:419          → 已正確路由 include_router(signoz_webhook_v1.router)
+端點：POST /api/v1/webhooks/signoz/alert   ✅ 已可接收
+問題：SignOz 告警規則未指向此 Webhook
+```
+
+**唯一阻塞點**：SignOz 告警規則 (`ops/signoz/alerting/rules.yaml`) 的 `webhook` 欄位尚未設定或未部署到 SignOz 主機。
+
+### 完整資料流
+
+```
+SignOz 偵測到異常 (Error Rate / Latency / No Traces)
+    ↓
+SignOz Alert Manager 觸發告警
+    ↓
+POST http://192.168.0.120:32334/api/v1/webhooks/signoz/alert   ← 需要配置
+    ↓
+process_signoz_alert() [Background Task]
+    ↓
+├── AnomalyCounter 記錄頻率 (ADR-037) ✅
+├── IncidentService 建立事件            ✅
+├── ApprovalService 建立簽核            ✅
+└── TelegramGateway 發送通知            ✅
+```
+
+---
+
+## Step 1: 確認 API 端點可達
+
+```bash
+# 從 188 主機測試 SignOz Webhook 端點
+curl -s http://192.168.0.120:32334/api/v1/webhooks/signoz/health
+# 預期：{"status": "ok", "service": "signoz-webhook", "timestamp": "..."}
+
+# 如端點不通，確認 Pod 狀態
+kubectl get pod -n awoooi-prod -l app=awoooi-api
+```
+
+---
+
+## Step 2: 設定 SignOz 告警規則
+
+### 2.1 確認 ops/signoz/alerting/rules.yaml 已建立
+
+```bash
+# 確認檔案存在
+ls /Users/ogt/awoooi/ops/signoz/alerting/
+# 如不存在，從 IMPLEMENTATION_STEPS_REMAINING_PHASES.md 的 Phase E 代碼複製
+```
+
+### 2.2 部署告警規則到 SignOz 主機
+
+```bash
+# 登入 SignOz 主機
+ssh root@192.168.0.188
+
+# 確認 SignOz 告警配置目錄
+docker inspect signoz-query-service | grep -A5 "Mounts"
+# 常見路徑：/opt/signoz/config/ 或 /data/signoz/
+
+# 複製告警規則（從本機）
+# 先在本機執行：
+scp /Users/ogt/awoooi/ops/signoz/alerting/rules.yaml \
+    root@192.168.0.188:/opt/signoz/config/alerting/
+
+# 在 188 主機重啟 SignOz Alert Manager（不重啟整個 SignOz）
+docker restart signoz-alert-manager 2>/dev/null || \
+  docker restart signoz   # 若是單容器部署
+```
+
+### 2.3 透過 SignOz API 驗證規則載入
+
+```bash
+# 在 188 主機執行
+curl -s http://localhost:3301/api/v3/alerts/rules | python3 -m json.tool | head -40
+# 預期看到 APIHighErrorRate, APIHighLatencyP99 等規則名稱
+```
+
+---
+
+## Step 3: 設定 SignOz Webhook Channel
+
+SignOz 告警通知支援 Webhook Channel，需要透過 SignOz Web UI 或 API 設定。
+
+### 3.1 透過 SignOz UI 設定（推薦）
+
+```
+瀏覽器開啟：http://192.168.0.188:3301
+路徑：Settings → Alert Channels → New Channel
+
+類型：Webhook
+名稱：AWOOOI-API
+URL：http://192.168.0.120:32334/api/v1/webhooks/signoz/alert
+Send resolved notifications：☑ (可選)
+```
+
+### 3.2 透過 API 設定（腳本化）
+
+```bash
+# 在 188 主機執行
+curl -s -X POST http://localhost:3301/api/v1/channels \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "AWOOOI-API",
+    "type": "webhook",
+    "data": {
+      "webhook_url": "http://192.168.0.120:32334/api/v1/webhooks/signoz/alert"
+    }
+  }'
+# 預期返回 channel ID
+```
+
+---
+
+## Step 4: 建立測試告警規則
+
+為了驗證整個鏈路，建立一個低閾值測試規則：
+
+```bash
+# 在 188 主機的 SignOz 目錄建立測試規則
+cat > /tmp/test-alert.yaml << 'EOF'
+groups:
+  - name: e2e_test
+    rules:
+      - alert: AWOOOI_E2E_SMOKE_TEST
+        expr: up{job="awoooi-api"} == 1  # 永遠觸發（API 存活時）
+        for: 1m
+        labels:
+          severity: info
+          source: signoz
+          test: "true"
+        annotations:
+          summary: "E2E Smoke Test - 請忽略"
+          description: "這是 AWOOOI 告警鏈路的自動測試"
+          webhook: "http://192.168.0.120:32334/api/v1/webhooks/signoz/alert"
+EOF
+```
+
+---
+
+## Step 5: 端到端驗證
+
+### 5.1 手動觸發測試
+
+```bash
+# 直接向 AWOOOI API 發送模擬 SignOz 告警
+curl -s -X POST http://192.168.0.120:32334/api/v1/webhooks/signoz/alert \
+  -H "Content-Type: application/json" \
+  -d '{
+    "alertname": "APIHighErrorRate",
+    "status": "firing",
+    "labels": {
+      "alertname": "APIHighErrorRate",
+      "severity": "critical",
+      "service_name": "awoooi-api",
+      "source": "signoz"
+    },
+    "annotations": {
+      "summary": "API 錯誤率 > 5%",
+      "description": "服務 awoooi-api 錯誤率超標，這是一個測試告警"
+    },
+    "startsAt": "2026-03-29T12:00:00Z"
+  }'
+
+# 預期回應
+# {"status": "ok", "processed": 1, "results": [{"status": "accepted", "alert_name": "APIHighErrorRate"}]}
+```
+
+### 5.2 確認 Telegram 收到告警
+
+```
+預期在 Telegram Bot 中收到：
+═══════════════════════════
+📊 SignOz: APIHighErrorRate
+═══════════════════════════
+服務：awoooi-api
+摘要：API 錯誤率 > 5%
+[ Y 確認 ] [ N 忽略 ]
+```
+
+### 5.3 確認 API 日誌
+
+```bash
+kubectl logs -n awoooi-prod \
+  $(kubectl get pod -n awoooi-prod -l app=awoooi-api -o name | head -1) \
+  --tail=30 | grep -i signoz
+
+# 預期看到：
+# signoz_alert_received   payload=...
+# signoz_anomaly_recorded alert_name=APIHighErrorRate
+# signoz_alert_processed  alert_name=APIHighErrorRate  incident_id=xxx
+# signoz_telegram_sent    approval_id=xxx
+```
+
+---
+
+## 驗收標準
+
+| 項目 | 通過條件 |
+|------|---------|
+| Webhook 端點可達 | `curl .../signoz/health` 返回 200 |
+| SignOz 規則載入 | `/api/v3/alerts/rules` 包含 `APIHighErrorRate` |
+| 手動測試正常 | 回應 `{"status": "ok"}` |
+| Telegram 通知 | 成功收到告警卡片 |
+| Incident 建立 | DB 中可查到 `source=signoz` 的 Incident |
+| Approval 建立 | `GET /api/v1/approvals` 顯示新 Approval |
+
+---
+
+## 常見問題排除
+
+| 症狀 | 診斷 | 解法 |
+|------|------|------|
+| Webhook 404 | `curl .../signoz/health` | 確認主機是 32334 而非 8089 |
+| SignOz 規則不觸發 | SignOz UI → Alerts 頁 | 確認 Prometheus 端點可抓到 awoooi-api metrics |
+| Telegram 未收到 | 查 API 日誌 | 確認 `OPENCLAW_TG_BOT_TOKEN` Secret 已注入 |
+| Incident 建立失敗 | 查 API 日誌 `incident_creation_failed` | 確認 PostgreSQL 連線正常 |
--- a/docs/runbooks/RUNBOOK-WORKER-HPA.md
+++ b/docs/runbooks/RUNBOOK-WORKER-HPA.md
@@ -0,0 +1,313 @@
+# RunBook: Worker HPA — 水平自動擴展設定
+
+> **類型**: 操作型 RunBook  
+> **優先級**: 🔴 P0（Worker 目前單點故障風險）  
+> **建立**: 2026-03-29 12:35 (台北)  
+> **建立者**: Antigravity  
+> **工時預估**: 30–60 分鐘  
+> **前置條件**: K3s 叢集健康（120/121 皆 Ready）
+
+---
+
+## 背景與現況
+
+### 🔍 精確現況診斷
+
+**現有 HPA 配置 (`12-hpa.yaml`)**：
+
+| Deployment | Min | Max | CPU 閾值 | Memory 閾值 |
+|-----------|-----|-----|---------|------------|
+| awoooi-api | 2 | 6 | 70% | 80% |
+| awoooi-web | 2 | 6 | 70% | 80% |
+| **awoooi-worker** | ❌ 無 | ❌ 無 | — | — |
+
+**Worker 的特殊性**：
+- Worker 消費 Redis Streams (Event Bus)
+- 不像 API/Web 依賴 CPU/Memory 觸發，應依賴 **Queue 長度觸發**
+- 但 K3s 預設沒有安裝 KEDA（Kubernetes Event-driven Autoscaling）
+- **最保守方案**：設定 min:1 max:3，以 CPU 為指標
+
+---
+
+## 方案比較
+
+| 方案 | 優點 | 缺點 | 適合性 |
+|------|------|------|-------|
+| **A: CPU HPA（立即可行）** | 零依賴，立即部署 | 不直接反應 Queue 長度 | ✅ 推薦（短期） |
+| B: KEDA Redis Stream HPA | 最精確，按 Queue 長度擴縮 | 需安裝 KEDA operator | 🟡 中期規劃 |
+| C: 固定 2 副本（無 HPA） | 簡單穩定 | 浪費資源 | ❌ 不推薦 |
+
+**決策**：採用方案 A（CPU HPA），並記錄方案 B 的未來路徑。
+
+---
+
+## Step 1: 確認 Worker 資源設定
+
+```bash
+# 查看現有 Worker Deployment 資源限制
+kubectl get deployment awoooi-worker -n awoooi-prod -o yaml | grep -A 20 resources
+
+# 預期看到：
+# resources:
+#   requests:
+#     cpu: "100m"
+#     memory: "256Mi"
+#   limits:
+#     cpu: "500m"
+#     memory: "512Mi"
+```
+
+**如果沒有設定 resources，HPA 無法正常運作！** 必須先在 `08-deployment-worker.yaml` 加入資源限制。
+
+---
+
+## Step 2: 更新 k8s/awoooi-prod/12-hpa.yaml
+
+在現有檔案末尾追加 Worker HPA：
+
+```yaml
+# =============================================================================
+# Worker HPA（追加到 12-hpa.yaml 末尾）
+# =============================================================================
+# K-Worker 2026-03-29: Worker HPA（CPU 指標，min:1 max:3）
+# 注意：Worker 消費 Redis Streams，未來可升級為 KEDA Redis Stream 指標
+# 建立者：Antigravity
+# =============================================================================
+---
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: awoooi-worker-hpa
+  namespace: awoooi-prod
+  labels:
+    app.kubernetes.io/name: awoooi
+    app.kubernetes.io/component: worker
+  annotations:
+    description: "Worker 水平自動擴展 (1-3 replicas, 70% CPU)"
+    note: "未來可升級為 KEDA Redis Stream 指標，按 Queue 長度動態擴縮"
+spec:
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: awoooi-worker
+  minReplicas: 1   # 保持最少 1 個處理事件
+  maxReplicas: 3   # 2 節點叢集的合理上限
+  metrics:
+    - type: Resource
+      resource:
+        name: cpu
+        target:
+          type: Utilization
+          averageUtilization: 70
+    - type: Resource
+      resource:
+        name: memory
+        target:
+          type: Utilization
+          averageUtilization: 80
+  behavior:
+    scaleUp:
+      stabilizationWindowSeconds: 120  # Worker 擴展比 API 保守（120s vs 60s）
+      policies:
+        - type: Pods
+          value: 1
+          periodSeconds: 120
+    scaleDown:
+      stabilizationWindowSeconds: 600  # Worker 縮容非常保守，避免事件處理中斷
+      policies:
+        - type: Pods
+          value: 1
+          periodSeconds: 300
+```
+
+---
+
+## Step 3: 確認 Worker Deployment 有資源設定
+
+```bash
+# 查看現有設定
+kubectl get deployment awoooi-worker -n awoooi-prod -o jsonpath='{.spec.template.spec.containers[0].resources}'
+```
+
+若無資源設定，在 `08-deployment-worker.yaml` 加入：
+
+```yaml
+# apps/api/src/workers 對應的 K8s Deployment
+# 在 container spec 加入：
+resources:
+  requests:
+    cpu: "100m"     # Worker 正常負載估算
+    memory: "256Mi"
+  limits:
+    cpu: "500m"     # 防止單 Worker 吃掉所有 CPU
+    memory: "512Mi"
+```
+
+---
+
+## Step 4: 部署
+
+```bash
+# 方法 A：直接 apply（推薦，只更新 HPA）
+kubectl apply -f k8s/awoooi-prod/12-hpa.yaml
+
+# 確認 HPA 建立成功
+kubectl get hpa -n awoooi-prod
+
+# 預期輸出：
+# NAME               REFERENCE             TARGETS   MINPODS   MAXPODS   REPLICAS
+# awoooi-api-hpa     Deployment/api        5%/70%    2         6         2
+# awoooi-web-hpa     Deployment/web        3%/70%    2         6         2
+# awoooi-worker-hpa  Deployment/worker     8%/70%    1         3         1   ← 新增
+
+# 方法 B：透過 CD 觸發（標準流程）
+git add k8s/awoooi-prod/12-hpa.yaml
+git commit -m "feat(k8s): add Worker HPA (min:1 max:3 CPU 70%)"
+git push origin main
+```
+
+---
+
+## Step 5: 壓力測試驗證 HPA 觸發
+
+```bash
+# 模擬大量事件涌入（謹慎，在非尖峰時段執行）
+for i in {1..100}; do
+  curl -s -X POST http://192.168.0.120:32334/api/v1/webhooks/alertmanager \
+    -H "Content-Type: application/json" \
+    -d '{
+      "version": "4",
+      "status": "firing",
+      "alerts": [{"status": "firing", "labels": {"alertname": "LoadTest", "severity": "info"}, "annotations": {}}]
+    }' &
+done
+
+# 觀察 HPA 反應（每 15 秒看一次）
+watch -n 15 'kubectl get hpa awoooi-worker-hpa -n awoooi-prod'
+```
+
+---
+
+## 中期路線圖：升級 KEDA Redis Stream HPA
+
+```yaml
+# 未來安裝 KEDA 後，可替換為更精確的 HPA：
+apiVersion: keda.sh/v1alpha1
+kind: ScaledObject
+metadata:
+  name: awoooi-worker-scaledobject
+  namespace: awoooi-prod
+spec:
+  scaleTargetRef:
+    name: awoooi-worker
+  minReplicaCount: 1
+  maxReplicaCount: 5
+  triggers:
+    - type: redis
+      metadata:
+        address: "192.168.0.188:6380"
+        listName: "awoooi:events"  # Redis Stream Key
+        listLength: "20"           # 每個 Pod 處理最多 20 個待處理事件
+```
+
+KEDA 安裝指令（未來執行）：
+```bash
+kubectl apply -f https://github.com/kedacore/keda/releases/download/v2.13.1/keda-2.13.1.yaml
+```
+
+---
+
+## 驗收標準
+
+| 項目 | 通過條件 |
+|------|---------|
+| HPA 建立 | `kubectl get hpa -n awoooi-prod` 顯示 `awoooi-worker-hpa` |
+| 指標正常 | TARGETS 顯示實際 CPU%，非 `<unknown>` |
+| Worker 正常運行 | `kubectl get pod -n awoooi-prod -l app=awoooi-worker` 顯示 Running |
+| 最小副本 | Worker 期望副本數 = 1 |
+
+---
+
+## ⚠️ 架構安全補丁（2026-03-29 更新，部署前必讀）
+
+> 來源：`ARCHITECTURAL_RISK_WAR_GAME.md` 深度沙盤推演，代碼確認級別
+
+### 補丁 1：XCLAIM + Active Sweeper（部署 HPA 的前置條件）
+
+**❌ 現況**：`signal_worker.py` 完全沒有 Redis PEL 孤兒任務回收機制。
+
+**影響**：Worker Pod 被 HPA 縮容（或非優雅崩潰）時，正在處理的任務卡在 Redis PEL（Pending Entries List）中永久無人處理。
+
+> 🔴 **HPA 必須在 XCLAIM 機制合併 main 之後才能部署！**
+
+需要在 `signal_worker.py` 加入的兩個機制：
+
+```python
+# 1. 啟動時接管孤兒（_claim_orphaned_tasks，在 start() 中調用）
+# 2. 運行中持續掃描（_reclaim_loop，與 _consume_loop 並行）
+async def _reclaim_loop(self, interval_s: int = 300) -> None:
+    """每 5 分鐘主動掃描 PEL，接管閒置超過 5 分鐘的孤兒任務"""
+    while self._running:
+        await asyncio.sleep(interval_s)
+        claimed = await self._claim_orphaned_tasks(idle_ms=300_000)
+        if claimed > 0:
+            logger.info("active_sweeper_claimed", count=claimed)
+```
+
+---
+
+### 補丁 2：terminationGracePeriodSeconds 三層對齊
+
+**❌ 現況**：`signal_worker.py` 的 `stop()` timeout = **5 秒**，AI 分析任務最長 60 秒。K8s 的 `terminationGracePeriodSeconds` 未設定（預設 30 秒）。兩個值都不夠，且彼此不對齊。
+
+**需要同時修改兩個地方**：
+
+```yaml
+# k8s/awoooi-prod/08-deployment-worker.yaml
+spec:
+  template:
+    spec:
+      terminationGracePeriodSeconds: 90  # 🆕 必須設定（比 Python timeout 多 15 秒緩衝）
+      containers:
+        - name: awoooi-worker
+          lifecycle:
+            preStop:
+              exec:
+                command: ["/bin/sh", "-c", "sleep 5"]  # 讓 K8s 先更新 Endpoint 再發 SIGTERM
+```
+
+```python
+# apps/api/src/workers/signal_worker.py
+async def stop(self) -> None:
+    self._running = False
+    if self._task:
+        try:
+            await asyncio.wait_for(self._task, timeout=75.0)  # 🆕 從 5 秒改為 75 秒
+        except (TimeoutError, asyncio.CancelledError):
+            self._task.cancel()
+    logger.info("signal_worker_stopped")
+```
+
+**三層數值關係**：
+```
+preStop sleep:     5s
+Python timeout:   75s  ← 比 K8s grace period 少 15s 緩衝
+K8s grace period: 90s  ← terminationGracePeriodSeconds
+```
+
+---
+
+### 合規確認指令（部署後必須執行）
+
+```bash
+# 確認 terminationGracePeriodSeconds 已生效
+kubectl get deployment awoooi-worker -n awoooi-prod \
+  -o jsonpath='{.spec.template.spec.terminationGracePeriodSeconds}'
+# 預期：90
+
+# 模擬縮容，確認優雅關機
+kubectl scale deployment awoooi-worker -n awoooi-prod --replicas=0
+kubectl logs -n awoooi-prod -l app=awoooi-worker --tail=20
+# 預期看到：shutdown_signal_received → signal_worker_shutting_down → signal_worker_stopped
+# 整個流程在 90 秒內完成
+```