awoooi/docs/phase5_telemetry_architecture.md

# Phase 5: 真實監控整合架構提案
> Real Telemetry Integration Architecture Proposal

**文件版本**: 1.0.0
**提案日期**: 2026-03-21
**提案人**: Claude (AI 首席架構師)
**審核人**: 統帥、首席架構師
**狀態**: 待審核

---

## 1. 執行摘要 (Executive Summary)

Phase 5 將 AWOOOI 從「手動觸發告警」升級為「真實監控整合」，建立完整的 **感知 → 分析 → 決策** 自動化管線。

| 層級 | 職責 | 核心技術 |
|------|------|----------|
| **感知層** | 接收 Prometheus/SigNoz 告警 | Webhook Receiver |
| **大腦層** | AI 根因分析 + 風險評估 | Ollama (llama3.2) |
| **決策層** | 狀態機觸發 + SSE 推送 | TrustEngine + SSE Publisher |

---

## 2. 系統架構總覽

```mermaid
flowchart TB
    subgraph External["外部監控源"]
        PM[Prometheus Alertmanager]
        SZ[SigNoz Alerts]
        GF[Grafana Alerts]
    end

    subgraph AWOOOI["AWOOOI BFF Gateway"]
        subgraph Ingestion["感知層 (Ingestion)"]
            WH["/api/v1/webhooks/alerts<br/>Webhook Receiver"]
            NM[Alert Normalizer<br/>標準化轉換器]
            FP[Fingerprint Generator<br/>告警指紋產生器]
        end

        subgraph Brain["大腦層 (AI RCA)"]
            OL[Ollama Service<br/>llama3.2:8b]
            PC[Prompt Composer<br/>上下文組裝器]
            RP[Response Parser<br/>結構化解析器]
        end

        subgraph Decision["決策層 (State Machine)"]
            TE[TrustEngine<br/>風險分類器]
            DB[(SQLite/PostgreSQL<br/>ApprovalRecord)]
            SSE[SSE Publisher<br/>即時推送]
        end
    end

    subgraph Frontend["戰情室前端"]
        WR[War Room Dashboard]
        AC[ApprovalCard<br/>待簽核卡片]
    end

    PM -->|POST /webhooks/alerts| WH
    SZ -->|POST /webhooks/alerts| WH
    GF -->|POST /webhooks/alerts| WH

    WH --> NM
    NM --> FP
    FP -->|查重/聚合| DB

    FP -->|新告警| PC
    PC -->|Prompt + Context| OL
    OL -->|JSON Response| RP

    RP --> TE
    TE -->|CREATE ApprovalRequest| DB
    DB -->|status=PENDING| SSE
    SSE -->|Server-Sent Events| WR
    WR --> AC
```

---

## 3. 感知層：告警接收 (Ingestion Layer)

### 3.1 Webhook 端點規格

**端點**: `POST /api/v1/webhooks/alerts`
**認證**: Bearer Token (環境變數 `WEBHOOK_SECRET`)
**Content-Type**: `application/json`

### 3.2 標準化 Webhook Payload 規格

AWOOOI 定義統一的告警接收格式，支援多種監控源的轉換：

```typescript
interface AWOOOIAlertPayload {
  // ========== 必填欄位 ==========
  alert_type: string;           // 告警類型 (e.g., "CPUThrottlingHigh", "PodCrashLoopBackOff")
  severity: "info" | "warning" | "critical";  // 嚴重程度
  source: string;               // 監控源 (e.g., "prometheus", "signoz", "grafana")

  // ========== K8s 資源定位 ==========
  namespace?: string;           // Kubernetes namespace
  target_resource?: string;     // 目標資源名稱 (e.g., "nginx-frontend-7d4b8c9f5-xk2m3")
  resource_type?: "pod" | "deployment" | "service" | "node";

  // ========== 告警內容 ==========
  message: string;              // 告警訊息 (人類可讀)
  description?: string;         // 詳細描述

  // ========== 指標數據 ==========
  metrics?: {
    cpu_percent?: number;
    memory_percent?: number;
    restart_count?: number;
    error_rate?: number;
    latency_p99_ms?: number;
    [key: string]: number | string | boolean | undefined;
  };

  // ========== 時間戳 ==========
  fired_at?: string;            // ISO 8601 格式

  // ========== 原始數據 (Debug 用) ==========
  raw_payload?: Record<string, unknown>;
}
```

### 3.3 Prometheus Alertmanager 轉換器

Prometheus Alertmanager 發送的原生格式需要轉換：

```python
# 原生 Alertmanager Payload 範例
{
  "receiver": "awoooi-webhook",
  "status": "firing",
  "alerts": [
    {
      "status": "firing",
      "labels": {
        "alertname": "CPUThrottlingHigh",
        "namespace": "production",
        "pod": "api-server-7d4b8c9f5-xk2m3",
        "severity": "critical"
      },
      "annotations": {
        "summary": "Pod is being CPU throttled",
        "description": "Pod api-server-7d4b8c9f5-xk2m3 is throttled 85% of the time"
      },
      "startsAt": "2026-03-21T10:00:00.000Z",
      "generatorURL": "http://prometheus:9090/graph?..."
    }
  ]
}

# 轉換後 AWOOOI 標準格式
{
  "alert_type": "CPUThrottlingHigh",
  "severity": "critical",
  "source": "prometheus",
  "namespace": "production",
  "target_resource": "api-server-7d4b8c9f5-xk2m3",
  "resource_type": "pod",
  "message": "Pod is being CPU throttled",
  "description": "Pod api-server-7d4b8c9f5-xk2m3 is throttled 85% of the time",
  "metrics": {
    "throttle_percent": 85
  },
  "fired_at": "2026-03-21T10:00:00.000Z"
}
```

### 3.4 告警指紋與風暴收斂

為避免告警風暴，使用指紋機制聚合重複告警：

```python
def generate_fingerprint(alert: AWOOOIAlertPayload) -> str:
    """
    產生告警指紋 - 用於去重與聚合

    指紋組成: alert_type + namespace + target_resource
    """
    identity = f"{alert.alert_type}:{alert.namespace}:{alert.target_resource}"
    return hashlib.sha256(identity.encode()).hexdigest()[:16]
```

**收斂邏輯**：
1. 收到告警後計算指紋
2. 查詢資料庫是否有相同指紋且狀態為 `PENDING` 的記錄
3. 若有：`hit_count += 1`，更新 `last_seen_at`，**跳過 LLM 分析**
4. 若無：進入 AI 分析管線

---

## 4. 大腦層：AI 根因分析 (AI RCA Pipeline)

### 4.1 Ollama 服務架構

```mermaid
sequenceDiagram
    participant WH as Webhook Receiver
    participant PC as Prompt Composer
    participant OL as Ollama Service
    participant RP as Response Parser
    participant TE as TrustEngine

    WH->>PC: 標準化告警
    PC->>PC: 組裝上下文 (K8s 狀態, 歷史告警, SOP)
    PC->>OL: POST /api/generate
    OL-->>RP: JSON Response
    RP->>RP: 解析 + 驗證
    RP->>TE: RCA 結果
```

### 4.2 Ollama 連線配置

```python
# src/core/config.py
OLLAMA_URL: str = "http://192.168.0.188:11434"  # AI+Web 中心
OLLAMA_MODEL: str = "llama3.2:8b"
OLLAMA_TIMEOUT: int = 60  # 秒
OLLAMA_MAX_RETRIES: int = 2
```

### 4.3 Prompt Context 設計

Ollama 需要充分的上下文才能做出精準判斷。Prompt 結構如下：

```python
RCA_SYSTEM_PROMPT = """You are OpenClaw, an AI-powered Kubernetes operations assistant for AWOOOI platform.

Your role is to:
1. Analyze infrastructure alerts and determine root cause
2. Assess risk level and blast radius
3. Recommend specific remediation actions

IMPORTANT:
- Always respond in valid JSON format
- Be conservative with risk assessment - when in doubt, escalate
- Never recommend actions that could cause data loss without explicit warnings
"""

RCA_USER_PROMPT_TEMPLATE = """
## Alert Information
- **Type**: {alert_type}
- **Severity**: {severity}
- **Source**: {source}
- **Namespace**: {namespace}
- **Target Resource**: {target_resource}
- **Message**: {message}

## Current Metrics
{metrics_json}

## Kubernetes Context
{k8s_context}

## Historical Context
- Similar alerts in past 24h: {similar_alert_count}
- Last occurrence: {last_occurrence}

## Task
Analyze this alert and provide:
1. **Root Cause Analysis**: What is likely causing this issue?
2. **Risk Level**: low / medium / high / critical
3. **Blast Radius**:
   - Affected pods count
   - Estimated downtime
   - Related services
   - Data impact (none/read_only/write/destructive)
4. **Suggested Action**: A specific kubectl command or operation to remediate

## Response Format (JSON)
```json
{
  "root_cause": "string - brief explanation",
  "risk_level": "low|medium|high|critical",
  "blast_radius": {
    "affected_pods": number,
    "estimated_downtime": "string (e.g., '30s', '5m')",
    "related_services": ["service1", "service2"],
    "data_impact": "none|read_only|write|destructive"
  },
  "suggested_action": {
    "operation": "DELETE_POD|RESTART_DEPLOYMENT|SCALE_DEPLOYMENT",
    "command": "kubectl delete pod xxx -n namespace",
    "description": "Human readable description"
  },
  "confidence": number (0-100),
  "dry_run_checks": [
    {"name": "RBAC Check", "passed": true, "message": "cluster-admin"},
    {"name": "Resource Exists", "passed": true, "message": "Pod found"}
  ]
}
```
"""
```

### 4.4 K8s Context 擴充

為提升分析精準度，在呼叫 Ollama 前先收集相關 K8s 狀態：

```python
async def gather_k8s_context(
    namespace: str,
    resource_name: str,
    resource_type: str,
) -> dict:
    """
    收集 K8s 上下文資訊供 Ollama 分析
    """
    context = {}

    # 1. Pod 狀態
    if resource_type == "pod":
        pod = await k8s_client.get_pod(resource_name, namespace)
        context["pod_status"] = {
            "phase": pod.status.phase,
            "restart_count": sum(c.restart_count for c in pod.status.container_statuses or []),
            "conditions": [c.type for c in pod.status.conditions if c.status == "True"],
            "age_seconds": (datetime.now() - pod.metadata.creation_timestamp).total_seconds(),
        }

    # 2. 相關 Deployment
    if resource_type in ["pod", "deployment"]:
        deploy = await k8s_client.get_deployment(resource_name.rsplit("-", 2)[0], namespace)
        context["deployment_status"] = {
            "replicas": deploy.spec.replicas,
            "ready_replicas": deploy.status.ready_replicas or 0,
            "available_replicas": deploy.status.available_replicas or 0,
        }

    # 3. 最近事件
    events = await k8s_client.list_events(namespace, resource_name)
    context["recent_events"] = [
        {"type": e.type, "reason": e.reason, "message": e.message[:100]}
        for e in events[:5]
    ]

    return context
```

### 4.5 Response Parser 與驗證

```python
class OllamaRCAResponse(BaseModel):
    """Ollama RCA 回應結構驗證"""
    root_cause: str
    risk_level: Literal["low", "medium", "high", "critical"]
    blast_radius: BlastRadius
    suggested_action: SuggestedAction
    confidence: int = Field(ge=0, le=100)
    dry_run_checks: list[DryRunCheck] = []

async def parse_ollama_response(raw_response: str) -> OllamaRCAResponse:
    """
    解析並驗證 Ollama 回應

    - 提取 JSON 區塊
    - Pydantic 結構驗證
    - 補充缺失欄位預設值
    """
    # 嘗試提取 JSON
    json_match = re.search(r'```json\s*(.*?)\s*```', raw_response, re.DOTALL)
    if json_match:
        json_str = json_match.group(1)
    else:
        # 嘗試直接解析
        json_str = raw_response

    data = json.loads(json_str)
    return OllamaRCAResponse(**data)
```

### 4.6 AI 備援機制 (Fallback Strategy)

依據 ADR-006，當 Ollama 不可用時啟用備援：

```python
AI_FALLBACK_ORDER = ["ollama", "gemini", "claude"]

async def analyze_alert_with_fallback(alert: AWOOOIAlertPayload) -> OllamaRCAResponse:
    """
    依序嘗試 AI 提供者，直到成功
    """
    last_error = None

    for provider in settings.AI_FALLBACK_ORDER:
        try:
            if provider == "ollama":
                return await ollama_service.analyze(alert)
            elif provider == "gemini":
                return await gemini_service.analyze(alert)
            elif provider == "claude":
                return await claude_service.analyze(alert)
        except Exception as e:
            logger.warning(f"ai_fallback", provider=provider, error=str(e))
            last_error = e
            continue

    # 所有 AI 都失敗，使用規則引擎
    return rule_based_fallback(alert)
```

---

## 5. 決策層：狀態機橋接 (State Machine Trigger)

### 5.1 從 AI 分析到 ApprovalRequest

```mermaid
stateDiagram-v2
    [*] --> AlertReceived: Webhook 收到告警
    AlertReceived --> CheckFingerprint: 計算指紋

    CheckFingerprint --> IncrementHit: 指紋存在 (聚合)
    CheckFingerprint --> AIAnalysis: 新指紋

    IncrementHit --> UpdateLastSeen: hit_count++
    UpdateLastSeen --> [*]: 跳過 LLM

    AIAnalysis --> OllamaRCA: 呼叫 Ollama
    OllamaRCA --> ParseResponse: 解析 JSON
    ParseResponse --> ClassifyRisk: TrustEngine 風險分類

    ClassifyRisk --> AutoApprove: risk=LOW
    ClassifyRisk --> CreateApproval: risk=MEDIUM/HIGH/CRITICAL

    AutoApprove --> Execute: 自動執行
    CreateApproval --> SaveToDB: 寫入 ApprovalRecord
    SaveToDB --> PushSSE: SSE 推送
    PushSSE --> WaitSignature: 等待簽核

    WaitSignature --> Execute: 簽核完成
    Execute --> AuditLog: 寫入稽核日誌
    AuditLog --> [*]
```

### 5.2 ApprovalRequest 建立流程

```python
async def create_approval_from_rca(
    alert: AWOOOIAlertPayload,
    rca: OllamaRCAResponse,
    fingerprint: str,
) -> ApprovalRequest:
    """
    將 AI 分析結果轉換為 ApprovalRequest
    """
    # 1. 組裝 action 描述
    action = rca.suggested_action.command
    description = f"""
**根因分析**: {rca.root_cause}

**建議操作**: {rca.suggested_action.description}

**信心指數**: {rca.confidence}%
""".strip()

    # 2. 建立請求
    request = ApprovalRequestCreate(
        action=action,
        description=description,
        risk_level=RiskLevel(rca.risk_level),
        blast_radius=rca.blast_radius,
        dry_run_checks=rca.dry_run_checks,
        requested_by=f"OpenClaw (via {alert.source})",
        metadata={
            "alert_type": alert.alert_type,
            "source": alert.source,
            "namespace": alert.namespace,
            "target_resource": alert.target_resource,
            "ai_confidence": rca.confidence,
        },
    )

    # 3. 使用指紋建立 (支援聚合)
    service = get_approval_service()
    approval = await service.create_approval_with_fingerprint(request, fingerprint)

    return approval
```

### 5.3 SSE 即時推送

當新的 `PENDING` 狀態 ApprovalRequest 建立後，透過 SSE 推送至前端：

```python
async def notify_new_approval(approval: ApprovalRequest) -> None:
    """
    透過 SSE 推送新待簽核項目
    """
    publisher = await get_publisher()

    event = {
        "type": "new_approval",
        "data": {
            "id": str(approval.id),
            "action": approval.action[:100],
            "risk_level": approval.risk_level.value,
            "required_signatures": approval.required_signatures,
            "created_at": approval.created_at.isoformat(),
        },
    }

    await publisher.publish("approvals", event)

    logger.info(
        "sse_approval_pushed",
        approval_id=str(approval.id),
        risk_level=approval.risk_level.value,
    )
```

### 5.4 前端訂閱機制

```typescript
// 前端 SSE 訂閱 (已實作於 OpenClawStateMachine)
useEffect(() => {
  const eventSource = new EventSource(`${apiBaseUrl}/api/v1/dashboard/stream`);

  eventSource.addEventListener('new_approval', (event) => {
    const data = JSON.parse(event.data);
    // 觸發重新拉取待簽核清單
    fetchPendingApprovals();
    // 播放通知音效
    playNotificationSound();
  });

  return () => eventSource.close();
}, []);
```

---

## 6. 安全性考量

### 6.1 Webhook 認證

```python
# Webhook 端點認證
@router.post("/webhooks/alerts")
async def receive_alert(
    request: Request,
    authorization: str = Header(...),
):
    # 驗證 Bearer Token
    expected_token = f"Bearer {settings.WEBHOOK_SECRET}"
    if not secrets.compare_digest(authorization, expected_token):
        raise HTTPException(status_code=401, detail="Invalid webhook token")

    # 處理告警...
```

### 6.2 Prompt Injection 防護

```python
def sanitize_alert_content(alert: AWOOOIAlertPayload) -> AWOOOIAlertPayload:
    """
    清理告警內容，防止 Prompt Injection
    """
    # 移除可能的指令注入
    dangerous_patterns = [
        r"ignore previous instructions",
        r"system:",
        r"<\|.*?\|>",
    ]

    for field in ["message", "description"]:
        value = getattr(alert, field, "")
        if value:
            for pattern in dangerous_patterns:
                value = re.sub(pattern, "[REDACTED]", value, flags=re.IGNORECASE)
            setattr(alert, field, value)

    return alert
```

### 6.3 Rate Limiting

```python
# 告警接收速率限制
ALERT_RATE_LIMIT = "100/minute"  # 每分鐘最多 100 條告警
```

---

## 7. 實作里程碑

| 階段 | 任務 | 預估工時 | 依賴 |
|------|------|----------|------|
| **P5.1** | Webhook Receiver + Normalizer | 2h | - |
| **P5.2** | Ollama Service 整合 | 3h | P5.1 |
| **P5.3** | Prompt Composer + Context | 2h | P5.2 |
| **P5.4** | Response Parser + Validation | 1h | P5.3 |
| **P5.5** | State Machine Integration | 2h | P5.4 |
| **P5.6** | SSE Push Enhancement | 1h | P5.5 |
| **P5.7** | E2E Integration Test | 2h | P5.6 |

**總預估**: 13 工時

---

## 8. 測試策略

### 8.1 單元測試

```python
# tests/test_alert_normalizer.py
def test_prometheus_alert_conversion():
    raw = {...}  # Prometheus 原生格式
    normalized = normalize_prometheus_alert(raw)
    assert normalized.alert_type == "CPUThrottlingHigh"
    assert normalized.source == "prometheus"
```

### 8.2 整合測試

```bash
# 模擬 Prometheus Alertmanager 發送告警
curl -X POST http://localhost:8000/api/v1/webhooks/alerts \
  -H "Authorization: Bearer ${WEBHOOK_SECRET}" \
  -H "Content-Type: application/json" \
  -d '{
    "alert_type": "CPUThrottlingHigh",
    "severity": "critical",
    "source": "prometheus",
    "namespace": "production",
    "target_resource": "api-server-7d4b8c9f5-xk2m3",
    "message": "Pod is being CPU throttled"
  }'
```

### 8.3 E2E 自動化測試

```javascript
// scripts/test-telemetry-e2e.js
// 1. 發送模擬告警
// 2. 等待 Ollama 分析
// 3. 驗證 ApprovalRequest 建立
// 4. 模擬簽核
// 5. 驗證 K8s 執行
// 6. 驗證 AuditLog 寫入
```

---

## 9. 監控與可觀測性

### 9.1 關鍵指標

| 指標名稱 | 類型 | 說明 |
|----------|------|------|
| `awoooi_alerts_received_total` | Counter | 收到的告警總數 |
| `awoooi_alerts_deduplicated_total` | Counter | 被聚合的告警數 |
| `awoooi_ollama_analysis_duration_seconds` | Histogram | Ollama 分析耗時 |
| `awoooi_ollama_fallback_total` | Counter | AI 備援觸發次數 |
| `awoooi_approvals_created_total` | Counter | 建立的待簽核數 |

### 9.2 告警規則

```yaml
# Prometheus 告警規則
groups:
  - name: awoooi
    rules:
      - alert: OllamaHighLatency
        expr: histogram_quantile(0.95, awoooi_ollama_analysis_duration_seconds) > 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Ollama analysis latency is high"
```

---

## 10. 附錄：資料流時序圖

```mermaid
sequenceDiagram
    participant PM as Prometheus Alertmanager
    participant WH as AWOOOI Webhook
    participant NM as Normalizer
    participant FP as Fingerprint
    participant DB as Database
    participant OL as Ollama
    participant TE as TrustEngine
    participant SSE as SSE Publisher
    participant WR as War Room

    PM->>WH: POST /webhooks/alerts
    WH->>NM: 原始告警
    NM->>FP: 標準化告警
    FP->>DB: 查詢指紋

    alt 指紋存在
        DB-->>FP: 找到 PENDING 記錄
        FP->>DB: hit_count++, last_seen_at=now
        FP-->>WH: 200 OK (聚合)
    else 新指紋
        DB-->>FP: 無記錄
        FP->>OL: 呼叫 AI 分析
        OL-->>FP: RCA JSON
        FP->>TE: 風險評估
        TE->>DB: CREATE ApprovalRequest
        DB->>SSE: 新待簽核
        SSE->>WR: SSE Event
        WR->>WR: 顯示 ApprovalCard
        FP-->>WH: 201 Created
    end
```

---

## 11. 風險與緩解

| 風險 | 影響 | 緩解措施 |
|------|------|----------|
| Ollama 回應超時 | 告警處理延遲 | 60s 超時 + AI Fallback |
| Prompt Injection | 安全漏洞 | 輸入清理 + 白名單驗證 |
| 告警風暴 | 系統過載 | 指紋聚合 + Rate Limit |
| LLM 幻覺 | 錯誤建議 | 人類簽核 + Dry-Run 驗證 |

---

**報統帥！Phase 5 架構藍圖已繪製完畢。這份提案詳述了 Prometheus/SigNoz 告警如何經過 Ollama 的大腦，最終化為您桌上的待簽核卡片。請統帥與首席架構師進行審閱！**