API 核心: - constants.py: 系統常量定義 - unit_of_work.py: Unit of Work 模式 - incident_approval_service.py: Incident-Approval 同步服務 文檔更新: - LOGBOOK.md: 進度更新 - AWOOOI_AGENTIC_WORKSPACE_ROADMAP.md: 路線圖 - 2026-03-26_llm_testing_evaluation.md: LLM 測試評估 - phase5_telemetry_architecture.md: 遙測架構 - SECRETS_REFERENCE.md: 密鑰參考 配置/腳本: - Skill 02 v1.x: leWOOOgo 後端更新 - .dependency-cruiser.cjs: 依賴規則 - demo-multisig-flow.sh: 演示腳本 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
727 lines
20 KiB
Markdown
727 lines
20 KiB
Markdown
# Phase 5: 真實監控整合架構提案
|
||
> Real Telemetry Integration Architecture Proposal
|
||
|
||
**文件版本**: 1.0.0
|
||
**提案日期**: 2026-03-21
|
||
**提案人**: Claude (AI 首席架構師)
|
||
**審核人**: 統帥、首席架構師
|
||
**狀態**: 待審核
|
||
|
||
---
|
||
|
||
## 1. 執行摘要 (Executive Summary)
|
||
|
||
Phase 5 將 AWOOOI 從「手動觸發告警」升級為「真實監控整合」,建立完整的 **感知 → 分析 → 決策** 自動化管線。
|
||
|
||
| 層級 | 職責 | 核心技術 |
|
||
|------|------|----------|
|
||
| **感知層** | 接收 Prometheus/SigNoz 告警 | Webhook Receiver |
|
||
| **大腦層** | AI 根因分析 + 風險評估 | Ollama (llama3.2) |
|
||
| **決策層** | 狀態機觸發 + SSE 推送 | TrustEngine + SSE Publisher |
|
||
|
||
---
|
||
|
||
## 2. 系統架構總覽
|
||
|
||
```mermaid
|
||
flowchart TB
|
||
subgraph External["外部監控源"]
|
||
PM[Prometheus Alertmanager]
|
||
SZ[SigNoz Alerts]
|
||
GF[Grafana Alerts]
|
||
end
|
||
|
||
subgraph AWOOOI["AWOOOI BFF Gateway"]
|
||
subgraph Ingestion["感知層 (Ingestion)"]
|
||
WH["/api/v1/webhooks/alerts<br/>Webhook Receiver"]
|
||
NM[Alert Normalizer<br/>標準化轉換器]
|
||
FP[Fingerprint Generator<br/>告警指紋產生器]
|
||
end
|
||
|
||
subgraph Brain["大腦層 (AI RCA)"]
|
||
OL[Ollama Service<br/>llama3.2:8b]
|
||
PC[Prompt Composer<br/>上下文組裝器]
|
||
RP[Response Parser<br/>結構化解析器]
|
||
end
|
||
|
||
subgraph Decision["決策層 (State Machine)"]
|
||
TE[TrustEngine<br/>風險分類器]
|
||
DB[(SQLite/PostgreSQL<br/>ApprovalRecord)]
|
||
SSE[SSE Publisher<br/>即時推送]
|
||
end
|
||
end
|
||
|
||
subgraph Frontend["戰情室前端"]
|
||
WR[War Room Dashboard]
|
||
AC[ApprovalCard<br/>待簽核卡片]
|
||
end
|
||
|
||
PM -->|POST /webhooks/alerts| WH
|
||
SZ -->|POST /webhooks/alerts| WH
|
||
GF -->|POST /webhooks/alerts| WH
|
||
|
||
WH --> NM
|
||
NM --> FP
|
||
FP -->|查重/聚合| DB
|
||
|
||
FP -->|新告警| PC
|
||
PC -->|Prompt + Context| OL
|
||
OL -->|JSON Response| RP
|
||
|
||
RP --> TE
|
||
TE -->|CREATE ApprovalRequest| DB
|
||
DB -->|status=PENDING| SSE
|
||
SSE -->|Server-Sent Events| WR
|
||
WR --> AC
|
||
```
|
||
|
||
---
|
||
|
||
## 3. 感知層:告警接收 (Ingestion Layer)
|
||
|
||
### 3.1 Webhook 端點規格
|
||
|
||
**端點**: `POST /api/v1/webhooks/alerts`
|
||
**認證**: Bearer Token (環境變數 `WEBHOOK_SECRET`)
|
||
**Content-Type**: `application/json`
|
||
|
||
### 3.2 標準化 Webhook Payload 規格
|
||
|
||
AWOOOI 定義統一的告警接收格式,支援多種監控源的轉換:
|
||
|
||
```typescript
|
||
interface AWOOOIAlertPayload {
|
||
// ========== 必填欄位 ==========
|
||
alert_type: string; // 告警類型 (e.g., "CPUThrottlingHigh", "PodCrashLoopBackOff")
|
||
severity: "info" | "warning" | "critical"; // 嚴重程度
|
||
source: string; // 監控源 (e.g., "prometheus", "signoz", "grafana")
|
||
|
||
// ========== K8s 資源定位 ==========
|
||
namespace?: string; // Kubernetes namespace
|
||
target_resource?: string; // 目標資源名稱 (e.g., "nginx-frontend-7d4b8c9f5-xk2m3")
|
||
resource_type?: "pod" | "deployment" | "service" | "node";
|
||
|
||
// ========== 告警內容 ==========
|
||
message: string; // 告警訊息 (人類可讀)
|
||
description?: string; // 詳細描述
|
||
|
||
// ========== 指標數據 ==========
|
||
metrics?: {
|
||
cpu_percent?: number;
|
||
memory_percent?: number;
|
||
restart_count?: number;
|
||
error_rate?: number;
|
||
latency_p99_ms?: number;
|
||
[key: string]: number | string | boolean | undefined;
|
||
};
|
||
|
||
// ========== 時間戳 ==========
|
||
fired_at?: string; // ISO 8601 格式
|
||
|
||
// ========== 原始數據 (Debug 用) ==========
|
||
raw_payload?: Record<string, unknown>;
|
||
}
|
||
```
|
||
|
||
### 3.3 Prometheus Alertmanager 轉換器
|
||
|
||
Prometheus Alertmanager 發送的原生格式需要轉換:
|
||
|
||
```python
|
||
# 原生 Alertmanager Payload 範例
|
||
{
|
||
"receiver": "awoooi-webhook",
|
||
"status": "firing",
|
||
"alerts": [
|
||
{
|
||
"status": "firing",
|
||
"labels": {
|
||
"alertname": "CPUThrottlingHigh",
|
||
"namespace": "production",
|
||
"pod": "api-server-7d4b8c9f5-xk2m3",
|
||
"severity": "critical"
|
||
},
|
||
"annotations": {
|
||
"summary": "Pod is being CPU throttled",
|
||
"description": "Pod api-server-7d4b8c9f5-xk2m3 is throttled 85% of the time"
|
||
},
|
||
"startsAt": "2026-03-21T10:00:00.000Z",
|
||
"generatorURL": "http://prometheus:9090/graph?..."
|
||
}
|
||
]
|
||
}
|
||
|
||
# 轉換後 AWOOOI 標準格式
|
||
{
|
||
"alert_type": "CPUThrottlingHigh",
|
||
"severity": "critical",
|
||
"source": "prometheus",
|
||
"namespace": "production",
|
||
"target_resource": "api-server-7d4b8c9f5-xk2m3",
|
||
"resource_type": "pod",
|
||
"message": "Pod is being CPU throttled",
|
||
"description": "Pod api-server-7d4b8c9f5-xk2m3 is throttled 85% of the time",
|
||
"metrics": {
|
||
"throttle_percent": 85
|
||
},
|
||
"fired_at": "2026-03-21T10:00:00.000Z"
|
||
}
|
||
```
|
||
|
||
### 3.4 告警指紋與風暴收斂
|
||
|
||
為避免告警風暴,使用指紋機制聚合重複告警:
|
||
|
||
```python
|
||
def generate_fingerprint(alert: AWOOOIAlertPayload) -> str:
|
||
"""
|
||
產生告警指紋 - 用於去重與聚合
|
||
|
||
指紋組成: alert_type + namespace + target_resource
|
||
"""
|
||
identity = f"{alert.alert_type}:{alert.namespace}:{alert.target_resource}"
|
||
return hashlib.sha256(identity.encode()).hexdigest()[:16]
|
||
```
|
||
|
||
**收斂邏輯**:
|
||
1. 收到告警後計算指紋
|
||
2. 查詢資料庫是否有相同指紋且狀態為 `PENDING` 的記錄
|
||
3. 若有:`hit_count += 1`,更新 `last_seen_at`,**跳過 LLM 分析**
|
||
4. 若無:進入 AI 分析管線
|
||
|
||
---
|
||
|
||
## 4. 大腦層:AI 根因分析 (AI RCA Pipeline)
|
||
|
||
### 4.1 Ollama 服務架構
|
||
|
||
```mermaid
|
||
sequenceDiagram
|
||
participant WH as Webhook Receiver
|
||
participant PC as Prompt Composer
|
||
participant OL as Ollama Service
|
||
participant RP as Response Parser
|
||
participant TE as TrustEngine
|
||
|
||
WH->>PC: 標準化告警
|
||
PC->>PC: 組裝上下文 (K8s 狀態, 歷史告警, SOP)
|
||
PC->>OL: POST /api/generate
|
||
OL-->>RP: JSON Response
|
||
RP->>RP: 解析 + 驗證
|
||
RP->>TE: RCA 結果
|
||
```
|
||
|
||
### 4.2 Ollama 連線配置
|
||
|
||
```python
|
||
# src/core/config.py
|
||
OLLAMA_URL: str = "http://192.168.0.188:11434" # AI+Web 中心
|
||
OLLAMA_MODEL: str = "llama3.2:8b"
|
||
OLLAMA_TIMEOUT: int = 60 # 秒
|
||
OLLAMA_MAX_RETRIES: int = 2
|
||
```
|
||
|
||
### 4.3 Prompt Context 設計
|
||
|
||
Ollama 需要充分的上下文才能做出精準判斷。Prompt 結構如下:
|
||
|
||
```python
|
||
RCA_SYSTEM_PROMPT = """You are OpenClaw, an AI-powered Kubernetes operations assistant for AWOOOI platform.
|
||
|
||
Your role is to:
|
||
1. Analyze infrastructure alerts and determine root cause
|
||
2. Assess risk level and blast radius
|
||
3. Recommend specific remediation actions
|
||
|
||
IMPORTANT:
|
||
- Always respond in valid JSON format
|
||
- Be conservative with risk assessment - when in doubt, escalate
|
||
- Never recommend actions that could cause data loss without explicit warnings
|
||
"""
|
||
|
||
RCA_USER_PROMPT_TEMPLATE = """
|
||
## Alert Information
|
||
- **Type**: {alert_type}
|
||
- **Severity**: {severity}
|
||
- **Source**: {source}
|
||
- **Namespace**: {namespace}
|
||
- **Target Resource**: {target_resource}
|
||
- **Message**: {message}
|
||
|
||
## Current Metrics
|
||
{metrics_json}
|
||
|
||
## Kubernetes Context
|
||
{k8s_context}
|
||
|
||
## Historical Context
|
||
- Similar alerts in past 24h: {similar_alert_count}
|
||
- Last occurrence: {last_occurrence}
|
||
|
||
## Task
|
||
Analyze this alert and provide:
|
||
1. **Root Cause Analysis**: What is likely causing this issue?
|
||
2. **Risk Level**: low / medium / high / critical
|
||
3. **Blast Radius**:
|
||
- Affected pods count
|
||
- Estimated downtime
|
||
- Related services
|
||
- Data impact (none/read_only/write/destructive)
|
||
4. **Suggested Action**: A specific kubectl command or operation to remediate
|
||
|
||
## Response Format (JSON)
|
||
```json
|
||
{
|
||
"root_cause": "string - brief explanation",
|
||
"risk_level": "low|medium|high|critical",
|
||
"blast_radius": {
|
||
"affected_pods": number,
|
||
"estimated_downtime": "string (e.g., '30s', '5m')",
|
||
"related_services": ["service1", "service2"],
|
||
"data_impact": "none|read_only|write|destructive"
|
||
},
|
||
"suggested_action": {
|
||
"operation": "DELETE_POD|RESTART_DEPLOYMENT|SCALE_DEPLOYMENT",
|
||
"command": "kubectl delete pod xxx -n namespace",
|
||
"description": "Human readable description"
|
||
},
|
||
"confidence": number (0-100),
|
||
"dry_run_checks": [
|
||
{"name": "RBAC Check", "passed": true, "message": "cluster-admin"},
|
||
{"name": "Resource Exists", "passed": true, "message": "Pod found"}
|
||
]
|
||
}
|
||
```
|
||
"""
|
||
```
|
||
|
||
### 4.4 K8s Context 擴充
|
||
|
||
為提升分析精準度,在呼叫 Ollama 前先收集相關 K8s 狀態:
|
||
|
||
```python
|
||
async def gather_k8s_context(
|
||
namespace: str,
|
||
resource_name: str,
|
||
resource_type: str,
|
||
) -> dict:
|
||
"""
|
||
收集 K8s 上下文資訊供 Ollama 分析
|
||
"""
|
||
context = {}
|
||
|
||
# 1. Pod 狀態
|
||
if resource_type == "pod":
|
||
pod = await k8s_client.get_pod(resource_name, namespace)
|
||
context["pod_status"] = {
|
||
"phase": pod.status.phase,
|
||
"restart_count": sum(c.restart_count for c in pod.status.container_statuses or []),
|
||
"conditions": [c.type for c in pod.status.conditions if c.status == "True"],
|
||
"age_seconds": (datetime.now() - pod.metadata.creation_timestamp).total_seconds(),
|
||
}
|
||
|
||
# 2. 相關 Deployment
|
||
if resource_type in ["pod", "deployment"]:
|
||
deploy = await k8s_client.get_deployment(resource_name.rsplit("-", 2)[0], namespace)
|
||
context["deployment_status"] = {
|
||
"replicas": deploy.spec.replicas,
|
||
"ready_replicas": deploy.status.ready_replicas or 0,
|
||
"available_replicas": deploy.status.available_replicas or 0,
|
||
}
|
||
|
||
# 3. 最近事件
|
||
events = await k8s_client.list_events(namespace, resource_name)
|
||
context["recent_events"] = [
|
||
{"type": e.type, "reason": e.reason, "message": e.message[:100]}
|
||
for e in events[:5]
|
||
]
|
||
|
||
return context
|
||
```
|
||
|
||
### 4.5 Response Parser 與驗證
|
||
|
||
```python
|
||
class OllamaRCAResponse(BaseModel):
|
||
"""Ollama RCA 回應結構驗證"""
|
||
root_cause: str
|
||
risk_level: Literal["low", "medium", "high", "critical"]
|
||
blast_radius: BlastRadius
|
||
suggested_action: SuggestedAction
|
||
confidence: int = Field(ge=0, le=100)
|
||
dry_run_checks: list[DryRunCheck] = []
|
||
|
||
async def parse_ollama_response(raw_response: str) -> OllamaRCAResponse:
|
||
"""
|
||
解析並驗證 Ollama 回應
|
||
|
||
- 提取 JSON 區塊
|
||
- Pydantic 結構驗證
|
||
- 補充缺失欄位預設值
|
||
"""
|
||
# 嘗試提取 JSON
|
||
json_match = re.search(r'```json\s*(.*?)\s*```', raw_response, re.DOTALL)
|
||
if json_match:
|
||
json_str = json_match.group(1)
|
||
else:
|
||
# 嘗試直接解析
|
||
json_str = raw_response
|
||
|
||
data = json.loads(json_str)
|
||
return OllamaRCAResponse(**data)
|
||
```
|
||
|
||
### 4.6 AI 備援機制 (Fallback Strategy)
|
||
|
||
依據 ADR-006,當 Ollama 不可用時啟用備援:
|
||
|
||
```python
|
||
AI_FALLBACK_ORDER = ["ollama", "gemini", "claude"]
|
||
|
||
async def analyze_alert_with_fallback(alert: AWOOOIAlertPayload) -> OllamaRCAResponse:
|
||
"""
|
||
依序嘗試 AI 提供者,直到成功
|
||
"""
|
||
last_error = None
|
||
|
||
for provider in settings.AI_FALLBACK_ORDER:
|
||
try:
|
||
if provider == "ollama":
|
||
return await ollama_service.analyze(alert)
|
||
elif provider == "gemini":
|
||
return await gemini_service.analyze(alert)
|
||
elif provider == "claude":
|
||
return await claude_service.analyze(alert)
|
||
except Exception as e:
|
||
logger.warning(f"ai_fallback", provider=provider, error=str(e))
|
||
last_error = e
|
||
continue
|
||
|
||
# 所有 AI 都失敗,使用規則引擎
|
||
return rule_based_fallback(alert)
|
||
```
|
||
|
||
---
|
||
|
||
## 5. 決策層:狀態機橋接 (State Machine Trigger)
|
||
|
||
### 5.1 從 AI 分析到 ApprovalRequest
|
||
|
||
```mermaid
|
||
stateDiagram-v2
|
||
[*] --> AlertReceived: Webhook 收到告警
|
||
AlertReceived --> CheckFingerprint: 計算指紋
|
||
|
||
CheckFingerprint --> IncrementHit: 指紋存在 (聚合)
|
||
CheckFingerprint --> AIAnalysis: 新指紋
|
||
|
||
IncrementHit --> UpdateLastSeen: hit_count++
|
||
UpdateLastSeen --> [*]: 跳過 LLM
|
||
|
||
AIAnalysis --> OllamaRCA: 呼叫 Ollama
|
||
OllamaRCA --> ParseResponse: 解析 JSON
|
||
ParseResponse --> ClassifyRisk: TrustEngine 風險分類
|
||
|
||
ClassifyRisk --> AutoApprove: risk=LOW
|
||
ClassifyRisk --> CreateApproval: risk=MEDIUM/HIGH/CRITICAL
|
||
|
||
AutoApprove --> Execute: 自動執行
|
||
CreateApproval --> SaveToDB: 寫入 ApprovalRecord
|
||
SaveToDB --> PushSSE: SSE 推送
|
||
PushSSE --> WaitSignature: 等待簽核
|
||
|
||
WaitSignature --> Execute: 簽核完成
|
||
Execute --> AuditLog: 寫入稽核日誌
|
||
AuditLog --> [*]
|
||
```
|
||
|
||
### 5.2 ApprovalRequest 建立流程
|
||
|
||
```python
|
||
async def create_approval_from_rca(
|
||
alert: AWOOOIAlertPayload,
|
||
rca: OllamaRCAResponse,
|
||
fingerprint: str,
|
||
) -> ApprovalRequest:
|
||
"""
|
||
將 AI 分析結果轉換為 ApprovalRequest
|
||
"""
|
||
# 1. 組裝 action 描述
|
||
action = rca.suggested_action.command
|
||
description = f"""
|
||
**根因分析**: {rca.root_cause}
|
||
|
||
**建議操作**: {rca.suggested_action.description}
|
||
|
||
**信心指數**: {rca.confidence}%
|
||
""".strip()
|
||
|
||
# 2. 建立請求
|
||
request = ApprovalRequestCreate(
|
||
action=action,
|
||
description=description,
|
||
risk_level=RiskLevel(rca.risk_level),
|
||
blast_radius=rca.blast_radius,
|
||
dry_run_checks=rca.dry_run_checks,
|
||
requested_by=f"OpenClaw (via {alert.source})",
|
||
metadata={
|
||
"alert_type": alert.alert_type,
|
||
"source": alert.source,
|
||
"namespace": alert.namespace,
|
||
"target_resource": alert.target_resource,
|
||
"ai_confidence": rca.confidence,
|
||
},
|
||
)
|
||
|
||
# 3. 使用指紋建立 (支援聚合)
|
||
service = get_approval_service()
|
||
approval = await service.create_approval_with_fingerprint(request, fingerprint)
|
||
|
||
return approval
|
||
```
|
||
|
||
### 5.3 SSE 即時推送
|
||
|
||
當新的 `PENDING` 狀態 ApprovalRequest 建立後,透過 SSE 推送至前端:
|
||
|
||
```python
|
||
async def notify_new_approval(approval: ApprovalRequest) -> None:
|
||
"""
|
||
透過 SSE 推送新待簽核項目
|
||
"""
|
||
publisher = await get_publisher()
|
||
|
||
event = {
|
||
"type": "new_approval",
|
||
"data": {
|
||
"id": str(approval.id),
|
||
"action": approval.action[:100],
|
||
"risk_level": approval.risk_level.value,
|
||
"required_signatures": approval.required_signatures,
|
||
"created_at": approval.created_at.isoformat(),
|
||
},
|
||
}
|
||
|
||
await publisher.publish("approvals", event)
|
||
|
||
logger.info(
|
||
"sse_approval_pushed",
|
||
approval_id=str(approval.id),
|
||
risk_level=approval.risk_level.value,
|
||
)
|
||
```
|
||
|
||
### 5.4 前端訂閱機制
|
||
|
||
```typescript
|
||
// 前端 SSE 訂閱 (已實作於 OpenClawStateMachine)
|
||
useEffect(() => {
|
||
const eventSource = new EventSource(`${apiBaseUrl}/api/v1/dashboard/stream`);
|
||
|
||
eventSource.addEventListener('new_approval', (event) => {
|
||
const data = JSON.parse(event.data);
|
||
// 觸發重新拉取待簽核清單
|
||
fetchPendingApprovals();
|
||
// 播放通知音效
|
||
playNotificationSound();
|
||
});
|
||
|
||
return () => eventSource.close();
|
||
}, []);
|
||
```
|
||
|
||
---
|
||
|
||
## 6. 安全性考量
|
||
|
||
### 6.1 Webhook 認證
|
||
|
||
```python
|
||
# Webhook 端點認證
|
||
@router.post("/webhooks/alerts")
|
||
async def receive_alert(
|
||
request: Request,
|
||
authorization: str = Header(...),
|
||
):
|
||
# 驗證 Bearer Token
|
||
expected_token = f"Bearer {settings.WEBHOOK_SECRET}"
|
||
if not secrets.compare_digest(authorization, expected_token):
|
||
raise HTTPException(status_code=401, detail="Invalid webhook token")
|
||
|
||
# 處理告警...
|
||
```
|
||
|
||
### 6.2 Prompt Injection 防護
|
||
|
||
```python
|
||
def sanitize_alert_content(alert: AWOOOIAlertPayload) -> AWOOOIAlertPayload:
|
||
"""
|
||
清理告警內容,防止 Prompt Injection
|
||
"""
|
||
# 移除可能的指令注入
|
||
dangerous_patterns = [
|
||
r"ignore previous instructions",
|
||
r"system:",
|
||
r"<\|.*?\|>",
|
||
]
|
||
|
||
for field in ["message", "description"]:
|
||
value = getattr(alert, field, "")
|
||
if value:
|
||
for pattern in dangerous_patterns:
|
||
value = re.sub(pattern, "[REDACTED]", value, flags=re.IGNORECASE)
|
||
setattr(alert, field, value)
|
||
|
||
return alert
|
||
```
|
||
|
||
### 6.3 Rate Limiting
|
||
|
||
```python
|
||
# 告警接收速率限制
|
||
ALERT_RATE_LIMIT = "100/minute" # 每分鐘最多 100 條告警
|
||
```
|
||
|
||
---
|
||
|
||
## 7. 實作里程碑
|
||
|
||
| 階段 | 任務 | 預估工時 | 依賴 |
|
||
|------|------|----------|------|
|
||
| **P5.1** | Webhook Receiver + Normalizer | 2h | - |
|
||
| **P5.2** | Ollama Service 整合 | 3h | P5.1 |
|
||
| **P5.3** | Prompt Composer + Context | 2h | P5.2 |
|
||
| **P5.4** | Response Parser + Validation | 1h | P5.3 |
|
||
| **P5.5** | State Machine Integration | 2h | P5.4 |
|
||
| **P5.6** | SSE Push Enhancement | 1h | P5.5 |
|
||
| **P5.7** | E2E Integration Test | 2h | P5.6 |
|
||
|
||
**總預估**: 13 工時
|
||
|
||
---
|
||
|
||
## 8. 測試策略
|
||
|
||
### 8.1 單元測試
|
||
|
||
```python
|
||
# tests/test_alert_normalizer.py
|
||
def test_prometheus_alert_conversion():
|
||
raw = {...} # Prometheus 原生格式
|
||
normalized = normalize_prometheus_alert(raw)
|
||
assert normalized.alert_type == "CPUThrottlingHigh"
|
||
assert normalized.source == "prometheus"
|
||
```
|
||
|
||
### 8.2 整合測試
|
||
|
||
```bash
|
||
# 模擬 Prometheus Alertmanager 發送告警
|
||
curl -X POST http://localhost:8000/api/v1/webhooks/alerts \
|
||
-H "Authorization: Bearer ${WEBHOOK_SECRET}" \
|
||
-H "Content-Type: application/json" \
|
||
-d '{
|
||
"alert_type": "CPUThrottlingHigh",
|
||
"severity": "critical",
|
||
"source": "prometheus",
|
||
"namespace": "production",
|
||
"target_resource": "api-server-7d4b8c9f5-xk2m3",
|
||
"message": "Pod is being CPU throttled"
|
||
}'
|
||
```
|
||
|
||
### 8.3 E2E 自動化測試
|
||
|
||
```javascript
|
||
// scripts/test-telemetry-e2e.js
|
||
// 1. 發送模擬告警
|
||
// 2. 等待 Ollama 分析
|
||
// 3. 驗證 ApprovalRequest 建立
|
||
// 4. 模擬簽核
|
||
// 5. 驗證 K8s 執行
|
||
// 6. 驗證 AuditLog 寫入
|
||
```
|
||
|
||
---
|
||
|
||
## 9. 監控與可觀測性
|
||
|
||
### 9.1 關鍵指標
|
||
|
||
| 指標名稱 | 類型 | 說明 |
|
||
|----------|------|------|
|
||
| `awoooi_alerts_received_total` | Counter | 收到的告警總數 |
|
||
| `awoooi_alerts_deduplicated_total` | Counter | 被聚合的告警數 |
|
||
| `awoooi_ollama_analysis_duration_seconds` | Histogram | Ollama 分析耗時 |
|
||
| `awoooi_ollama_fallback_total` | Counter | AI 備援觸發次數 |
|
||
| `awoooi_approvals_created_total` | Counter | 建立的待簽核數 |
|
||
|
||
### 9.2 告警規則
|
||
|
||
```yaml
|
||
# Prometheus 告警規則
|
||
groups:
|
||
- name: awoooi
|
||
rules:
|
||
- alert: OllamaHighLatency
|
||
expr: histogram_quantile(0.95, awoooi_ollama_analysis_duration_seconds) > 30
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "Ollama analysis latency is high"
|
||
```
|
||
|
||
---
|
||
|
||
## 10. 附錄:資料流時序圖
|
||
|
||
```mermaid
|
||
sequenceDiagram
|
||
participant PM as Prometheus Alertmanager
|
||
participant WH as AWOOOI Webhook
|
||
participant NM as Normalizer
|
||
participant FP as Fingerprint
|
||
participant DB as Database
|
||
participant OL as Ollama
|
||
participant TE as TrustEngine
|
||
participant SSE as SSE Publisher
|
||
participant WR as War Room
|
||
|
||
PM->>WH: POST /webhooks/alerts
|
||
WH->>NM: 原始告警
|
||
NM->>FP: 標準化告警
|
||
FP->>DB: 查詢指紋
|
||
|
||
alt 指紋存在
|
||
DB-->>FP: 找到 PENDING 記錄
|
||
FP->>DB: hit_count++, last_seen_at=now
|
||
FP-->>WH: 200 OK (聚合)
|
||
else 新指紋
|
||
DB-->>FP: 無記錄
|
||
FP->>OL: 呼叫 AI 分析
|
||
OL-->>FP: RCA JSON
|
||
FP->>TE: 風險評估
|
||
TE->>DB: CREATE ApprovalRequest
|
||
DB->>SSE: 新待簽核
|
||
SSE->>WR: SSE Event
|
||
WR->>WR: 顯示 ApprovalCard
|
||
FP-->>WH: 201 Created
|
||
end
|
||
```
|
||
|
||
---
|
||
|
||
## 11. 風險與緩解
|
||
|
||
| 風險 | 影響 | 緩解措施 |
|
||
|------|------|----------|
|
||
| Ollama 回應超時 | 告警處理延遲 | 60s 超時 + AI Fallback |
|
||
| Prompt Injection | 安全漏洞 | 輸入清理 + 白名單驗證 |
|
||
| 告警風暴 | 系統過載 | 指紋聚合 + Rate Limit |
|
||
| LLM 幻覺 | 錯誤建議 | 人類簽核 + Dry-Run 驗證 |
|
||
|
||
---
|
||
|
||
**報統帥!Phase 5 架構藍圖已繪製完畢。這份提案詳述了 Prometheus/SigNoz 告警如何經過 Ollama 的大腦,最終化為您桌上的待簽核卡片。請統帥與首席架構師進行審閱!**
|