# AWOOOI 整體整合架構統合設計

> **文件類型**: 統合架構設計（Single Source of Truth for Integration）
> **優先級**: 🔴 統帥最高指令
> **建立**: 2026-03-29 13:27 (台北)
> **核心命題**: 所有節點必須在同一座大腦的神經網路中協同運作，不允許孤島。

---

## 第一部分：現況誠實盤點（精確）

### 已確認：比稽核報告更樂觀的部分

| 項目 | 稽核報告誤判 | 真實現況 |
|------|------------|---------|
| Worker SIGTERM 處理 | 報告說「缺失」 | ✅ **已實作**（`signal_worker.py:450-455`）— `signal.signal(SIGTERM)` + `shutdown_event` |
| Worker 優雅關機流程 | 報告說「需要實作」| ✅ **已實作**（`stop()` 方法，有心跳機制） |
| SignOz Webhook 路由 | 報告說「未部署」| ✅ **已路由**（`main.py:419`）|

### 已確認：比稽核報告更嚴峻的部分

| 項目 | 稽核報告版本 | 真實缺口 |
|------|------------|---------|
| Worker stop() timeout | 未提及 | ❌ **只有 5 秒**，AI 分析 30-60 秒會被強殺 |
| K8s terminationGracePeriodSeconds | 未提及 | ❌ **未設定**，K8s 預設 30 秒不夠用 |
| ESLint i18n 強制 | 說「CI 攔截」| ❌ **只有 TODO 注解**（`.eslintrc.js:20-22`），未實際安裝 plugin |
| Visual Regression 跨平台 | 說「截圖比對」| ❌ **Mac 與 CI Linux 字體渲染不同**，baseline 不能在 Mac 產生 |
| PostgreSQL HA | 說「Streaming Replication」| ❌ **無切換機制**，主庫掛了需要人工介入 |
| Redis HA | 完全未提及 | ❌ **無 Sentinel**，Redis 單點故障 |
| SSE Event Sourcing | 只設計了事件類型 | ❌ **F5 刷新後 GenUI 卡片全消失** |
| Kali 整合 | Cronjob 被動 | 🟡 層次太低，應升格為 SecurityAgent |

---

## 第二部分：完整整合地圖

### 2.1 系統神經網路拓撲

```
外部事件輸入
┌─────────────────────────────────────────────────────────────────┐
│  Alertmanager :9093  Sentry :9000  SignOz :3301               │
│  GitHub Actions       Kali Scanner              K8s Events     │
└──────────┬──────────────┬──────────────┬──────────────────────┘
           │              │              │
           ▼              ▼              ▼
┌────────────────────────────────────────────────────────────────┐
│               AWOOOI API (K3s :32334)                          │
│  ┌──────────┐  ┌──────────┐  ┌───────────┐  ┌─────────────┐  │
│  │Alertmanager│  │Sentry   │  │SignOz     │  │Kali(未來)   │  │
│  │Webhook  │  │Webhook  │  │Webhook    │  │Webhook      │  │
│  └────┬─────┘  └────┬─────┘  └─────┬─────┘  └──────┬──────┘  │
│       │              │              │                │          │
│       └──────────────┴──────────────┴────────────────┘          │
│                              │                                    │
│                              ▼                                    │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                Signal Worker (Redis XREADGROUP)           │   │
│  │  ← 消費 awoooi:signals stream                            │   │
│  │  → IncidentEngine (聚合 / GraphRAG / 持久化)              │   │
│  └──────────────────────────┬───────────────────────────────┘   │
│                              │                                    │
│                              ▼                                    │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │              OpenClaw (192.168.0.188:8089)                 │   │
│  │  決策引擎: RCA → Blast Radius → Risk → Action             │   │
│  │  工具: kubectl / SSH / Prometheus / SigNoz                │   │
│  └──────┬─────────────────────┬────────────────────────────┘   │
│         │                     │                                   │
│         ▼                     ▼                                   │
│  Telegram Bot          Approval DB (PostgreSQL)                   │
│  統帥通知              審核佇列                                     │
└──────────────────────────────────────────────────────────────────┘
                              │
                    統帥批准 / 拒絕
                              │
                 ┌────────────▼────────────┐
                 │   Auto-Repair Actions   │
                 │ restart/scale/rollback  │
                 └─────────────────────────┘
```

### 2.2 前端整合地圖

```
AWOOOI Web (K3s :32335 / Next.js)
┌─────────────────────────────────────────────────────────────────┐
│                                                                   │
│  ┌── Dashboard (/) ─────────────────────────────────────────┐   │
│  │  AutonomyIndexPanel ← GET /api/v1/stats/autonomy         │   │
│  │  SystemPulseRow     ← GET /api/v1/stats/overview          │   │
│  │  DecisionZone       ← SSE /api/v1/approvals/stream        │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                   │
│  ┌── Omni-Terminal ─────────────────────────────────────────┐   │
│  │  Input Area         → POST /api/v1/terminal/command       │   │
│  │  ThinkingStream     ← SSE /api/v1/terminal/stream/{id}    │   │
│  │  GenUI Renderer     ← event: render_ui                    │   │
│  │  Event Replay       ← Redis List (Last-Event-ID)          │   ← ❌ 缺失
│  └──────────────────────────────────────────────────────────┘   │
│                                                                   │
│  ┌── Knowledge Base (/knowledge-base) ─────────────────────┐   │
│  │  ❌ 空白頁面，缺後端 API                                  │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                   │
│  [深度調查跳脫入口]                                               │
│  → SigNoz: http://192.168.0.188:3301 (新分頁)                   │
│  → Grafana: http://192.168.0.188:3000 (新分頁)                  │
│  → Sentry:  http://192.168.0.110:9000 (新分頁)                  │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘
```

---

## 第三部分：六大缺口的系統整合修復方案

### 缺口 1：Worker terminationGracePeriodSeconds 不足

**問題根源**：`signal_worker.py` 的 `stop()` 等 5 秒，但 AI 分析任務最長 60 秒。K8s 預設 `terminationGracePeriodSeconds: 30`。兩個值都不夠，且彼此沒有對齊。

**與整體整合的依賴**：
- Worker 縮容影響：HPA 縮容時 → K8s 發 SIGTERM → `stop()` 被調用 → 5 秒後強殺
- 上游依賴：Sentry 分析任務、Alertmanager 分析任務都在 Worker background task 中執行
- 下游影響：PostgreSQL 寫入可能不完整（Incident 狀態 Dirty）

**修復：三層數值對齊**

```yaml
# k8s/awoooi-prod/08-deployment-worker.yaml
# 修改 terminationGracePeriodSeconds
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 90  # 給 Worker 足夠時間完成當前任務
      containers:
        - name: awoooi-worker
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 5"]  # 給 K8s 時間發 SIGTERM
```

```python
# apps/api/src/workers/signal_worker.py
# 修改 stop() 的 timeout 與 K8s terminationGracePeriodSeconds 對齊

async def stop(self) -> None:
    if not self._running:
        return
    self._running = False
    if self._task:
        try:
            # 從 5 秒改為 75 秒（比 terminationGracePeriodSeconds=90 少 15 秒緩衝）
            await asyncio.wait_for(self._task, timeout=75.0)
        except TimeoutError:
            logger.warning("signal_worker_stop_timeout_forcekill")
            self._task.cancel()
        except asyncio.CancelledError:
            pass
    logger.info("signal_worker_stopped")
```

**整合驗證指令**：
```bash
# 模擬 K8s 縮容，確認 Worker 優雅關機
kubectl scale deployment awoooi-worker -n awoooi-prod --replicas=0
kubectl logs -n awoooi-prod $(kubectl get pod -n awoooi-prod -l app=awoooi-worker -o name) --tail=20

# 預期看到：
# shutdown_signal_received signal=15
# signal_worker_shutting_down
# signal_worker_shutdown_complete
# （整個流程在 90 秒內完成）
```

---

### 缺口 2：ESLint i18n 強制攔截（eslint-plugin-i18next）

**問題根源**：`.eslintrc.js:20-22` 只有 TODO 注解，未安裝 `eslint-plugin-i18next`。

**與整體整合的依賴**：
- 這是 i18n 清零後的**防護層**：清零是清過去的債，ESLint 是防未來的債
- 需在 `pnpm build` 和 `pnpm lint` CI 步驟中阻擋
- 影響：所有前端開發流程（AI 生成代碼也必須通過）

**修復：安裝並啟用 Plugin**

```bash
# Step 1: 安裝
cd apps/web
pnpm add -D eslint-plugin-i18next
```

```javascript
// apps/web/.eslintrc.js 修改後
module.exports = {
  extends: [
    '@awoooi/eslint-config/react',
    'next/core-web-vitals',
    'plugin:i18next/recommended',   // ← 新增
  ],
  plugins: ['i18next'],              // ← 新增
  parserOptions: {
    project: './tsconfig.json',
    tsconfigRootDir: __dirname,
  },
  rules: {
    '@next/next/no-html-link-for-pages': 'off',
    'no-console': 'off',

    // 🚨 i18n 鐵律：所有 JSX 文字必須透過 t() 函式
    // 違反此規則 = PR 阻擋（error 級別）
    'i18next/no-literal-string': ['error', {
      markupOnly: true,     // 只攔截 JSX 文字節點（非 JS 字串）
      ignoreAttribute: [    // 技術屬性不攔截
        'className', 'id', 'href', 'src', 'type', 'key',
        'data-testid', 'aria-label', 'placeholder'
      ],
    }],

    '@typescript-eslint/no-explicit-any': 'warn',
    '@typescript-eslint/no-unused-vars': ['warn', { argsIgnorePattern: '^_', varsIgnorePattern: '^_' }],
    '@typescript-eslint/consistent-type-imports': 'warn',
    'no-constant-condition': 'warn',
  },
  ignorePatterns: [
    'node_modules', '.next', 'out', 'dist', 'test-results',
    '*.config.js', '*.config.ts',
  ],
}
```

**與 CI 整合（必須加入 cd.yaml）**：

```yaml
# .github/workflows/cd.yaml 在 Build 之前加入 Lint 步驟
- name: 🔍 ESLint i18n 強制檢查
  run: |
    cd apps/web
    pnpm lint
  # 失敗 = 有硬編碼字串 = 直接阻擋部署
```

**⚠️ 重要**：第一次啟用 `eslint-plugin-i18next` 後，**現有的 40+ 違規會立刻全部報錯**。因此必須先完成 i18n 清零，再啟用此 Rule。**正確順序**：
1. i18n 清零（一次性修復 40+ 違規）
2. 安裝 eslint-plugin-i18next（啟用防護）
3. 加入 CI Lint 步驟

---

### 缺口 3：Visual Regression 跨平台渲染問題

**問題根源**：Mac（M1/M2/M3）vs GitHub Actions（Linux Ubuntu）的字體渲染引擎不同（CoreText vs FreeType），導致截圖像素不吻合。

**與整體整合的依賴**：
- Baseline 快照必須統一來源（CI Docker 環境）
- 每次更新 Baseline 必須是可審計的（透過 PR），不能在本機靜默更新

**修復：Docker 強制基準線更新 + threshold 調整**

```json
// apps/web/package.json 新增 scripts
{
  "scripts": {
    "test:visual:update": "docker run --rm -v $(pwd):/work -w /work -p 3000:3000 mcr.microsoft.com/playwright:v1.44.0-jammy pnpm exec playwright test --update-snapshots --project=chromium --grep @visual",
    "test:visual": "pnpm exec playwright test --project=chromium --grep @visual"
  }
}
```

```typescript
// apps/web/playwright.config.ts 修改截圖比對設定
export default defineConfig({
  expect: {
    toHaveScreenshot: {
      threshold: 0.05,        // 允許 5% 差異（吸收跨平台微小差）
      maxDiffPixelRatio: 0.05,
      // 強制使用 CI 環境的字體設定
    },
  },
  use: {
    // 截圖時的視窗大小固定，避免不同螢幕 DPI 差異
    viewport: { width: 1280, height: 720 },
    deviceScaleFactor: 1,    // 強制 1x，避免 Retina 差異
  },
});
```

**強制規範（加入 .awoooi-agent-rules.md 條款）**：

```markdown
## 條款 21：Visual Regression Baseline 更新規範

🚨 絕對禁止在本機 Mac 環境執行 `--update-snapshots`
✅ 更新 Baseline 必須透過以下流程：

1. 在本機執行：`pnpm test:visual:update`（Docker 環境）
2. Docker 生成的 .png 截圖自動存入 tests/e2e/__snapshots__/
3. 提 PR，標注 📸 VISUAL_UPDATE
4. 統帥視覺審核截圖後方可合併
```

---

### 缺口 4：PostgreSQL HA（Patroni / CloudNativePG）

**問題根源**：PostgreSQL 在 .188 上是單一 Docker 容器，K3s 的 Datastore 也依賴它（ADR-033）。資料庫掛掉 = K3s 控制面 + AWOOOI 資料同時失效。

**與整體整合的依賴**：
- PostgreSQL 是 AWOOOI 的 Episodic Memory（Incidents、Approvals、Audit Logs）
- PostgreSQL 也是 K3s 的 HA Datastore（120/121 節點的 K3s 元數據）
- Auto-Repair 對 PostgreSQL 執行 `docker restart` 是**危險的**（可能 Dirty Page）

**修復路線圖（三階段）**：

```
Phase DB-A（1週，低風險）：
  監控補強
  ├── 啟用 PG slow query log (log_min_duration_statement = 2000ms)
  ├── 加入 pg_stat_statements extension 並接入 Prometheus
  └── 關閉 auto_repair.postgres.restart（防止 Dirty Page）

Phase DB-B（1月，中風險）：
  備份策略
  ├── Velero + PostgreSQL Volume Snapshot（已有 Velero，需加 Volume 備份）
  └── 確認 WAL archiving 到 MinIO（WAL-E/WAL-G）

Phase DB-C（Q2，需評估）：
  HA 策略評估
  ├── 方案 A：CloudNativePG（K8s 原生 PostgreSQL Operator）
  │     → 在 K3s 中部署 CloudNativePG，主從自動切換
  ├── 方案 B：Patroni + keepalived（VM 層 HA）
  │     → 在 .188 和備用機上部署 Patroni
  └── 方案 C：Citus（分片，過於複雜，暫不考慮）

  推薦：方案 A (CloudNativePG)，與 K3s 最整合
```

**立即可執行的防護措施**：

```yaml
# k8s/awoooi-prod/manual-remediation/postgres-recovery.yaml
# 建立 PostgreSQL 緊急修復 Playbook（人工操作）

# 事件：PostgreSQL 掛了
# 動作：
# 1. OpenClaw 發告警 + Telegram
# 2. AlterManager 生成 CRITICAL Approval（不自動修復）
# 3. 統帥核准後，執行以下指令：
#    ssh root@192.168.0.188 'docker restart awoooi-postgres'
#    kubectl rollout restart deployment/awoooi-api -n awoooi-prod
#    kubectl rollout restart deployment/awoooi-worker -n awoooi-prod
```

---

### 缺口 5：Redis HA（Sentinel 模式）

**問題根源**：Redis 在 .188 上是單一容器（port 6380），無備援。Redis 同時承載：
- Working Memory（Incident 聚合狀態）
- SSE Terminal Event Store（未來的 Event Source）
- Sentry Dedup Cache（10分鐘去重 TTL）
- Anomaly Counter（ADR-037 核心數據）

**與整體整合的依賴**：
- Redis 掛掉 = Signal Worker 無法消費事件 = 整個告警鏈路中斷
- AOF 啟用對性能有影響，需要評估

**修復路線圖**：

```
Phase Redis-A（立即，0風險）：
  確認 AOF 配置
  ├── ssh root@192.168.0.188 'docker exec awoooi-redis redis-cli CONFIG GET appendonly'
  └── 確認 appendonly yes（否則重啟後 Working Memory 歸零）

Phase Redis-B（1月，中等工程量）：
  Redis Sentinel 部署（在 .110 上部署 Sentinel + Replica）
  ├── .188：Master（現有）
  ├── .110：Replica + Sentinel
  └── OpenClaw 使用 redis-sentinel:// URI，自動發現 Master

  配置變更：
  # AWOOOI API 連線改用 Sentinel
  REDIS_URL=redis-sentinel://sentinel1:26379/awoooi-master
```

**立即可執行的防護措施**：

```bash
# 確認 Redis AOF 狀態
ssh root@192.168.0.188 'docker exec awoooi-redis redis-cli CONFIG GET appendonly; \
  docker exec awoooi-redis redis-cli CONFIG GET appendfsync; \
  docker exec awoooi-redis redis-cli INFO persistence | grep aof'

# 若 appendonly = no，立即啟用（需重啟 Redis）
ssh root@192.168.0.188 'docker exec awoooi-redis redis-cli CONFIG SET appendonly yes'
# 注意：CONFIG SET 是即時生效的，不需要重啟
```

---

### 缺口 6：SSE Event Sourcing（Terminal 狀態不丟失）

**問題根源**：Omni-Terminal 的 SSE 串流是無狀態的，F5 刷新後所有 `render_ui` GenUI 卡片消失。

**與整體整合的依賴**：
- 這是 Agentic Workspace 用戶體驗的底層設施
- 依賴 Redis List 作為 Event Store（如果 Redis 無 AOF，重啟後也丟）
- 必須與 SSE 三種事件類型設計同步建立

**修復：三層機制**

```
Layer 1: 後端 Event Store（Redis List）
  terminal.py → 每個 SSE 事件同步寫入 Redis List
  Key: terminal:events:{command_id}
  TTL: 3600 秒（1小時）

Layer 2: 前端 Reconnect（Last-Event-ID）
  useTerminalSSE → EventSource 自動帶 Last-Event-ID
  後端收到後：從 Redis 撈出錯過的事件 → Replay → 接上即時 Stream

Layer 3: 本地 Zustand 持久化
  useTerminalStore → 用 zustand/middleware/persist 持久化到 sessionStorage
  F5 刷新 → 從 sessionStorage 恢復 GenUI 卡片（UI 層快速恢復）
  同時 → SSE 重連補齊 Server 端新事件
```

**後端實作關鍵代碼**：

```python
# apps/api/src/api/v1/terminal.py 補充 Event Store 機制

import json
from src.core.redis_client import get_redis

async def stream_with_persistence(command_id: str, event_type: str, data: dict):
    """
    SSE 事件輸出 + 同步寫入 Redis Event Store
    確保 F5 刷新後可以 Replay
    """
    redis = get_redis()
    
    event_payload = {
        "type": event_type,
        "data": data,
        "timestamp": now_taipei_iso()
    }
    
    # 寫入 Redis List（RPUSH append to right）
    key = f"terminal:events:{command_id}"
    await redis.rpush(key, json.dumps(event_payload))
    await redis.expire(key, 3600)  # 1 小時後自動清理
    
    # 返回 SSE 格式字串
    return f"id: {redis.llen(key)}\nevent: {event_type}\ndata: {json.dumps(data)}\n\n"


@router.get("/stream/{command_id}/replay")
async def replay_terminal_events(command_id: str, last_event_id: int = 0):
    """
    從指定 ID 開始 Replay 錯過的事件（用於 F5 重連）
    """
    redis = get_redis()
    key = f"terminal:events:{command_id}"
    
    # 取出 last_event_id 之後的所有事件
    events = await redis.lrange(key, last_event_id, -1)
    
    async def generate():
        for i, event_json in enumerate(events):
            event = json.loads(event_json)
            yield f"id: {last_event_id + i + 1}\nevent: {event['type']}\ndata: {json.dumps(event['data'])}\n\n"
    
    return StreamingResponse(generate(), media_type="text/event-stream")
```

---

## 第四部分：整合依賴關係圖（執行順序）

```
┌─────────────────────────────────────────────────────────────────────┐
│                    整合執行優先序                                     │
├──────────────────────────────────────────────────────────────────────┤
│                                                                       │
│  🔴 P0 立即執行（本週，阻塞後續工作）                                  │
│  ┌─────────────────────────────────────────────────────────┐         │
│  │  1. 確認 Redis AOF 狀態（5min）                          │         │
│  │  2. Worker terminationGracePeriodSeconds 修正（1h）      │         │
│  │  3. i18n 清零（4h）← 必須先於 ESLint Plugin 安裝        │         │
│  │  4. ESLint i18n Plugin 安裝並啟用（1h）                  │         │
│  └─────────────────────────────────────────────────────────┘         │
│                              ↓（完成後解鎖）                           │
│                                                                       │
│  🟠 P1 短期（2-3週）                                                  │
│  ┌─────────────────────────────────────────────────────────┐         │
│  │  5. Sentry Comment Token 配置（2h）                      │         │
│  │  6. SignOz 告警規則部署到 .188（2h）                      │         │
│  │  7. Worker HPA YAML 部署（30min）                        │         │
│  │  8. E2E CI Weekly 排程（30min）                          │         │
│  │  9. Visual Regression Docker 基準線建立（2h）            │         │
│  └─────────────────────────────────────────────────────────┘         │
│                              ↓（完成後解鎖）                           │
│                                                                       │
│  🟡 P2 中期（Month 2）                                                │
│  ┌─────────────────────────────────────────────────────────┐         │
│  │  10. Omni-Terminal SSE Event Sourcing（8h）              │         │
│  │  11. Storybook 10 心組件（8h）                           │         │
│  │  12. Nexus AI 自治率 UI（8h）                            │         │
│  │  13. FinOps Dashboard UI（8h）                          │         │
│  │  14. Redis Sentinel 部署（1天）                          │         │
│  └─────────────────────────────────────────────────────────┘         │
│                              ↓（完成後解鎖）                           │
│                                                                       │
│  ⚪ P3 長期（Q2-Q3）                                                   │
│  ┌─────────────────────────────────────────────────────────┐         │
│  │  15. CloudNativePG 評估與導入（2天+）                    │         │
│  │  16. Kali SecurityAgent（MCP Tool 化）                   │         │
│  │  17. Knowledge Base 後端全建                             │         │
│  │  18. Phase 4 視覺靈魂注入                               │         │
│  └─────────────────────────────────────────────────────────┘         │
└─────────────────────────────────────────────────────────────────────┘
```

---

## 第五部分：整合風險矩陣

| 風險 | 可能性 | 影響 | 緩解措施 |
|------|--------|------|---------|
| **ESLint 啟用後大量報錯** | 100%（有 40+ 違規） | CI 完全阻塞 | 先清零再啟用，正確順序執行 |
| **Worker timeout 修改引發 Pod 啟動異常** | 低 | 服務中斷 | 先在 Dev namespace 測試 |
| **Redis AOF 啟用影響性能** | 中 | 延遲微增 | 使用 `appendfsync everysec`（非 `always`）|
| **Visual Regression Docker 第一次 Baseline** | — | 需要 1-2h 產生基準線 | 排在非尖峰時段執行 |
| **PostgreSQL 無 HA 期間主庫故障** | 低 | 完全停機 | 備份策略（Velero）+ Playbook 就位 |
| **SSE Event Sourcing Redis 依賴** | — | Redis 故障時 Event 丟失 | 先解決 Redis AOF，再實作 Event Sourcing |

---

## 第六部分：監控機制與前端的整合設計原則

（承接 `MONITORING_ARCHITECTURE_DEEP_DIVE.md` 的「三義分離原則」）

**整合設計的核心約束**：

```
1. 監控數據 → 後台靜默消化，不直接呈現給統帥
   └─ 99%：Prometheus/SigNoz/Sentry 原始數據
   └─ 1%：AI 無法自動處理 → 浮現為 ApprovalCard

2. 前端不直接查詢 Prometheus/SigNoz
   └─ 所有監控數據透過 AWOOOI API 統一封裝
   └─ API 層：/api/v1/stats/overview, /api/v1/slo, /api/v1/finops

3. 深度調查只能透過「智能跳脫」（新分頁）
   └─ GenUI 卡片提供 ExternalLinks 按鈕
   └─ 絕對禁止 iframe 嵌入 Grafana/SigNoz

4. AI 自治率指數是前端唯一的「監控摘要」入口
   └─ Dashboard (/) 最頂部：AutonomyIndexPanel
   └─ 用一個數字替代所有圖表
```

---

*「整合不是把所有工具接在一起，而是讓所有工具服務同一個大腦。」* 🦞