Files

OG T 89e05e6ea2 docs: ADR-037 + 監控架構提案 + Runbooks

- ADR-037 監控增強架構
- MONITORING_MASTER_PLAN 主計畫
- MASTER_EXECUTION_SCHEDULE 執行排程
- Phase D/E/Worker HPA Runbooks

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-03-29 16:04:08 +08:00

24 KiB

Raw Permalink Blame History

AWOOOI 監控機制完整規劃：讓監控成為 AI 智慧的感知神經，而非束縛

文件類型: 架構設計 + 實施 RunBook
優先級: 🔴 重中之重
建立: 2026-03-29 12:38 (台北)
建立者: Antigravity
核心命題: 監控不是目的，而是 AI 決策的神經末梢。

一、核心哲學：監控與 AI 的關係定位

1.1 「不能淪為監控產品」— 這個恐懼從哪裡來？

傳統監控產品（Grafana / Prometheus / Datadog）的底層邏輯是：

「系統把原始數據攤開，人類負責看懂並做決定。」

這讓使用者變成數據的搬運工，而非決策者。

AWOOOI 的定位必須是：

「AI 消化所有數據，主動帶著分析結論來問統帥：『這裡我建議這樣做，您核准嗎？』」

1.2 黃金法則：哪些監控數據「應該消失在後台」，哪些「必須浮現到前台」

┌────────────────────────────────────────────────────────────────────────┐
│  監控資訊的兩個命運                                                      │
├───────────────────────────┬────────────────────────────────────────────┤
│  🔒 靜默消化（後台進行）     │  🚨 主動浮現（推送給統帥）                  │
├───────────────────────────┼────────────────────────────────────────────┤
│  Prometheus Metrics 原始值 │  AI 判斷「這是異常」後產生的 Approval 卡片   │
│  SigNoz Trace 詳情         │  Anomaly Counter 升級到 ESCALATE 時的警示   │
│  Sentry Error Log 完整堆疊 │  Auto-Repair 執行後的結果摘要               │
│  Grafana 儀表板圖表         │  P0 事件的緊急插隊（Priority Preemption）   │
│  Alertmanager 規則配置     │  每日 AI 健康摘要（主動推送）                │
│  K8s Pod 狀態明細          │  FinOps 成本異常告警                        │
└───────────────────────────┴────────────────────────────────────────────┘

核心結論：監控數據 99% 應該「靜默消化」，只有 AI 無法自動處理的 1% 才浮現為「需要統帥決策的卡片」。

二、監控完整節點盤點（現況 vs 目標）

2.1 五層監控架構

Layer 0: 物理感知層（主機/節點）
    ↓
Layer 1: 服務感知層（容器/Pod）
    ↓
Layer 2: 應用感知層（API/前端/Worker）
    ↓
Layer 3: AI 智慧層（LLM 推理/工具調用）
    ↓
Layer 4: 業務感知層（用戶行為/成本/SLO）

2.2 各層完整節點盤點

Layer 0: 物理感知層

節點	工具	數據	現況	缺口
.110 CPU/Memory/Disk	Node Exporter	系統指標	✅ Prometheus	—
.112 CPU/Memory/Disk	Node Exporter	系統指標	🟡 孤立，無 Webhook	無告警整合
.188 CPU/Memory/Disk	Node Exporter	系統指標 + GPU	✅ Prometheus	—
.120 K3s Master	Node Exporter + kube-state	K3s 節點指標	✅	—
.121 K3s Worker	Node Exporter + kube-state	K3s 節點指標	✅	—
VIP .125	Blackbox Exporter	TCP 健康	✅ 已配置	—

Layer 1: 服務感知層（Docker/K8s）

服務	Prometheus	Sentry	OTEL	告警	自動修復	缺口
awoooi-api	✅	✅	✅	✅ 完整	✅	—
awoooi-web	✅	✅	✅	✅ 完整	✅	—
awoooi-worker	✅	✅	✅	🟡	✅	HPA 缺失
Ollama	✅	—	—	✅	✅ 重啟	—
OpenClaw	✅	✅	✅	✅	✅ 重啟	—
Redis	✅	—	—	✅	❌（謹慎）	自動修復 too conservative
PostgreSQL	✅	—	—	✅	❌（謹慎）	同上
Harbor	✅	—	—	✅	—	—
Sentry	✅	—	—	✅	—	—
Langfuse	✅	—	—	✅	—	—
MinIO	❌	—	—	❌	❌	完全未監控
Kali Scanner	❌	—	—	❌	❌	孤立節點

Layer 2: 應用感知層

數據類型	工具	現況	缺口
API Error Rate	Prometheus + SigNoz	✅	—
API Latency P50/P95/P99	SigNoz OTEL	✅	—
Distributed Traces	SigNoz	✅	—
Frontend Web Vitals (LCP/FID/CLS)	Sentry	✅	—
Frontend JS Errors	Sentry	✅	—
Frontend Session Replay	Sentry	✅	—
Frontend Rage Click	Sentry	✅	未整合進 AI 分析
API Slow Query	Sentry + structlog	✅	無 AI 自動優化建議
K8s Resource Quota	kube-state-metrics	✅	—
Alert Chain E2E	Prometheus Counter	✅ ADR-037	—

Layer 3: AI 智慧層

數據類型	工具	現況	缺口
LLM 請求/回應 Traces	Langfuse	✅	—
LLM Token 用量/成本	Langfuse	✅	無 AWOOOI Dashboard
Ollama 推理延遲	Prometheus	✅	—
AI Fallback 觸發次數	Prometheus	✅（ADR-006）	—
NVIDIA Circuit Breaker	Prometheus	✅（ADR-036）	—
AI 自治率指數	—	❌ 完全缺失	核心指標未建立
Anomaly Counter 統計	Redis 計數器	✅ ADR-037	無前端展示
Approval 決策分析	PostgreSQL	✅	只有原始 CRUD，無分析

Layer 4: 業務感知層

數據類型	工具	現況	缺口
SLO 達成率	Prometheus + rules	✅ 定義	無可視化
事件 MTTR（平均修復時間）	PostgreSQL	✅ 原始資料	無計算與展示
FinOps 成本追蹤	cost_analyzer.py	✅ 邏輯	無 UI，完全閒置
用戶操作審計	audit_logs.py	✅	—
知識庫查詢統計	—	❌	無知識庫後端

三、整合缺口分析

3.1 「最後一哩路」缺口（已有工具，未整合）

缺口	工具已準備	缺什麼	工時
MinIO 監控	Prometheus	MinIO Exporter 未部署	1h
Kali 安全掃描	Nmap/ZAP on .112	無 AWOOOI Webhook 整合	2h
FinOps 前端	cost_analyzer.py	無 API 端點 + 無 UI	8h
AI 自治率指數	Prometheus Counter 可建	指標定義 + Dashboard	4h
Rage Click → AI 分析	Sentry `get_ux_audit_summary()`	無觸發器，未週期調用	2h
Anomaly Counter 前端展示	Redis + anomaly_counter.py	無 GenUI 卡片	4h
SLO 可視化	Prometheus rules 已定義	無 Grafana/前端展示	3h
MTTR 計算	PostgreSQL 有 incidents 資料	無計算 API 端點	2h
雙 Prometheus 聯邦	188/110 各一個	無 Federation 配置	2h

整合缺口總工時估算：~28 小時

四、監控 UI 呈現戰略（避免淪為監控產品的核心設計）

4.1 三種監控 UI 反模式（絕對禁止）

❌ 反模式 A：Grafana 嵌入 iframe
   → 整個頁面都是 Grafana，用戶感覺在用 Grafana
   
❌ 反模式 B：「監控頁面」頂級選單項目
   → 將 AWOOOI 降格為「有 AI 輔助的 Grafana」
   
❌ 反模式 C：Prometheus 原始指標直接展示
   → 用戶看到 rate(http_requests_total[5m]) 這種語法，違反 AI 原生體驗

4.2 正確的監控 UI 架構：「三義分離原則」

┌─────────────────────────────────────────────────────────────────────┐
│  AWOOOI 監控 UI 三義分離                                              │
├────────────────────┬──────────────────┬─────────────────────────────┤
│  義 1: AI 主動浮現  │  義 2: 問即答      │  義 3: 深度調查跳脫           │
│  （AWOOOI 前端）    │  （Omni-Terminal） │  （外部工具直連）             │
├────────────────────┼──────────────────┼─────────────────────────────┤
│ Nexus 頁面         │ /status awoooi-api│ 🔗 Grafana (新分頁)           │
│ → AI 健康脈搏       │ → GenUI 健康卡     │ 🔗 SigNoz (新分頁)           │
│ → 自治率指數        │ /cost this-month  │ 🔗 Sentry (新分頁)           │
│ → 異常趨勢圖        │ → FinOps 成本卡    │                             │
│ War Room 頁面      │ /trace xxx        │ 不直接嵌入，保持 AWOOOI       │
│ → 待決策 Approval  │ → Trace 彙整卡     │ 界面純淨性                   │
└────────────────────┴──────────────────┴─────────────────────────────┘

4.3 Nexus 頁面的監控呈現規格

這是首頁（The Nexus / 全局心智樞紐），呈現監控摘要的唯一入口：

// Nexus 頁面結構（Nothing.tech 純白工業風）

// 區塊 A：AI 自治率指數（最大最重要）
<AutonomyIndexPanel>
  今日 AI 成功攔截並自動修復：7/10 事件（70% 自治率）
  ↗ 比昨日提升 12%
</AutonomyIndexPanel>

// 區塊 B：系統脈搏（3 個數字，非圖表）
<SystemPulseRow>
  <PulseMetric label="正常服務" value="24/25" status="healthy" />
  <PulseMetric label="活躍告警" value="0" status="healthy" />
  <PulseMetric label="待決策" value="2" status="warning" />
</SystemPulseRow>

// 區塊 C：AI 思考流（背景動態，非重點）
<ThinkingStream>
  [Investigator] Redis latency: 2ms ... OK
  [Investigator] API error rate: 0.1% ... OK
  [Investigator] cert://*.wooo.work: 42 days ... OK
</ThinkingStream>

// 區塊 D：需要統帥決策的卡片（只有這個需要互動）
// → 有待決策才出現，平時此區域「靜默」
<DecisionZone>
  <ApprovalCard urgency="CRITICAL" ... />
</DecisionZone>

4.4 監控數據在 Omni-Terminal 的呈現（問即答模式）

Terminal 輸入 → AI 消化原始指標 → 回傳 GenUI 卡片（非原始數字）

使用者輸入	AI 行為	GenUI 卡片類型
`/status all`	查詢所有服務健康	`SystemHealthCard`
`/status awoooi-api`	查 API P95 延遲 + 錯誤率	`ServiceDetailCard`
`/cost`	呼叫 cost_analyzer.py	`FinOpsCard`
`/trace 最近5分鐘`	查詢 SigNoz slow traces	`TraceListCard`
`/incident 今天`	查詢今日事件 + AI 摘要	`IncidentSummaryCard`
`/alert 狀態`	檢查告警鏈路 E2E	`AlertChainStatusCard`
`/slo`	計算 API/Web SLO 達成率	`SLODashboardCard`

🎯 這才是 AI 原生體驗：使用者永遠都在跟 AI 對話，而非直接操作圖表。

4.5 「深度調查」模式：智能跳脫

當用戶需要原始 Grafana / SigNoz 數據時，AWOOOI 提供智能跳脫，而非嵌入：

// 在 GenUI 卡片中，提供「深度調查」按鈕
<ServiceDetailCard service="awoooi-api">
  <MetricRow label="P95 延遲" value="124ms" status="healthy" />
  <MetricRow label="錯誤率" value="0.1%" status="healthy" />
  
  {/* 智能跳脫按鈕 */}
  <ExternalLinks>
    <SmartLink 
      icon="📊" 
      label="SigNoz 詳細追蹤"
      href="http://192.168.0.188:3301/traces?service=awoooi-api&from=now-1h"
      target="_blank"   // ← 新分頁開啟，不汙染 AWOOOI 界面
    />
    <SmartLink
      icon="📈"
      label="Grafana 即時圖表"
      href="http://192.168.0.188:3000/d/awoooi-api"
      target="_blank"
    />
    <SmartLink
      icon="🐛"
      label="Sentry Issues"
      href="http://192.168.0.110:9000/organizations/sentry/issues/?project=awoooi-api"
      target="_blank"
    />
  </ExternalLinks>
</ServiceDetailCard>

五、監控機制整合實施步驟

Wave M-1: 立即啟動（1 週）

M-1.1 MinIO 監控整合（1h）

# 在 192.168.0.188 部署 MinIO Exporter
docker run -d \
  --name minio-exporter \
  --network momo-pro-network \
  -e MINIO_URL=http://minio:9000 \
  -e MINIO_ACCESS_KEY=minio_admin \
  -e MINIO_SECRET_KEY=Minio_Velero_2026! \
  -p 9290:9290 \
  bitnami/minio-exporter:latest

# 在 188:/momo-pro/monitoring/prometheus.yml 加入 scrape target：
# - job_name: 'minio'
#   static_configs:
#     - targets: ['localhost:9290']

M-1.2 Prometheus Federation 統一（2h）

# 在 188 的 Prometheus 加入聯邦查詢（抓取 .110 的 Prometheus 數據）
# 188:/momo-pro/monitoring/prometheus.yml 追加：

- job_name: 'federate-110'
  scrape_interval: 30s
  honor_labels: true
  metrics_path: '/federate'
  params:
    'match[]':
      - '{job=~".+"}'         # 抓取所有 .110 的 job
  static_configs:
    - targets: ['192.168.0.110:9090']

M-1.3 建立 AI 自治率指數 Prometheus 指標（2h）

# apps/api/src/core/metrics.py 新增：

# === AI 自治率追蹤 (The Autonomy Index) ===
AUTONOMY_INCIDENTS_TOTAL = Counter(
    'awoooi_incidents_total',
    'Total number of incidents received',
    ['source', 'severity']
)

AUTONOMY_AUTO_RESOLVED = Counter(
    'awoooi_incidents_auto_resolved_total',
    'Incidents resolved automatically by AI without human intervention',
    ['source', 'action_type']
)

AUTONOMY_HUMAN_RESOLVED = Counter(
    'awoooi_incidents_human_resolved_total',
    'Incidents requiring human approval',
    ['source', 'risk_level']
)

def record_incident_created(source: str, severity: str):
    AUTONOMY_INCIDENTS_TOTAL.labels(source=source, severity=severity).inc()

def record_auto_resolution(source: str, action_type: str):
    AUTONOMY_AUTO_RESOLVED.labels(source=source, action_type=action_type).inc()

def record_human_decision(source: str, risk_level: str):
    AUTONOMY_HUMAN_RESOLVED.labels(source=source, risk_level=risk_level).inc()

# 自治率計算公式：
# autonomy_rate = auto_resolved / (auto_resolved + human_decisions) * 100

Grafana Dashboard 公式：

# AI 自治率（24h）
sum(increase(awoooi_incidents_auto_resolved_total[24h]))
/
(
  sum(increase(awoooi_incidents_auto_resolved_total[24h])) +
  sum(increase(awoooi_incidents_human_resolved_total[24h]))
) * 100

Wave M-2: 短期啟動（2 週）

M-2.1 FinOps API 端點建立（4h）

# apps/api/src/api/v1/finops.py（新建）
# 暴露 cost_analyzer.py 的計算結果

@router.get("/finops/summary")
async def get_finops_summary():
    """
    FinOps 成本摘要
    
    Returns:
        {
            "period": "2026-03",
            "total_cost_usd": 12.50,
            "ollama_cost": 0.0,         # 本地，零成本
            "gemini_cost": 1.20,
            "claude_cost": 11.30,
            "realizable_savings": 3.50, # 真實可省
            "freed_capacity": 8.00,     # 釋放容量（非真實省錢）
            "top_cost_drivers": [...],
            "recommendations": [...]
        }
    """
    cost_analyzer = get_cost_analyzer()
    return await cost_analyzer.monthly_summary()

M-2.2 SLO API 端點建立（2h）

# apps/api/src/api/v1/slo.py（新建）

@router.get("/slo/status")
async def get_slo_status():
    """
    SLO 達成狀況
    
    Returns:
        {
            "api": {
                "availability_7d": 99.97,         # %
                "latency_p95_7d": 124,            # ms
                "target_availability": 99.9,
                "target_latency_p95": 500,
                "status": "healthy"               # healthy/at_risk/breached
            },
            "web": {...},
            "overall": "healthy"
        }
    """

M-2.3 MTTR API 端點建立（2h）

# apps/api/src/api/v1/stats.py 新增端點

@router.get("/stats/mttr")
async def get_mttr_stats():
    """
    平均修復時間 (Mean Time To Resolution)
    
    計算邏輯：
    - MTTR = avg(resolved_at - created_at) for resolved incidents
    - 分 AI 自動修復 vs 人工審核分別計算
    """

M-2.4 Kali Scanner 整合（2h）

# apps/api/src/api/v1/webhooks.py 新增 Kali Scanner Webhook

@router.post("/webhooks/kali/scan-result")
async def handle_kali_scan_result(request: Request):
    """
    接收 .112 Kali 安全掃描結果
    
    Kali 掃描腳本每週執行一次，結果發送至此 Webhook
    高危漏洞 → 自動建立 CRITICAL Approval
    """

Kali 端配置 (192.168.0.112)：

# 在 112 建立每週掃描腳本
cat > /opt/awoooi-scanner/weekly-scan.sh << 'EOF'
#!/bin/bash
TARGET="192.168.0.120:32334"  # AWOOOI API
RESULT=$(nmap -sV --script vuln $TARGET -oJ -)

curl -X POST http://192.168.0.120:32334/api/v1/webhooks/kali/scan-result \
  -H "Content-Type: application/json" \
  -d "{\"scan_result\": $RESULT, \"target\": \"$TARGET\"}"
EOF

# 加入 crontab
echo "0 2 * * 1 /opt/awoooi-scanner/weekly-scan.sh" | crontab -

Wave M-3: 中期啟動（3–4 週）

M-3.1 Nexus 頁面 AI 自治率指數 UI（8h）

// apps/web/src/app/[locale]/(dashboard)/page.tsx
// 新增 AutonomyIndex 組件

interface AutonomyData {
  rate: number;           // 70.0
  daily_trend: number;   // +12.0 vs yesterday
  auto_resolved_24h: number;
  human_resolved_24h: number;
}

const AutonomyIndexPanel = ({ data }: { data: AutonomyData }) => (
  <div className="bg-white/70 backdrop-blur-[20px] border border-black/[0.06] rounded-xl p-6">
    {/* 大數字：自治率 */}
    <div className="flex items-end gap-3">
      <span className="font-mono text-6xl font-bold text-nothing-ink">
        {data.rate.toFixed(0)}
        <span className="text-2xl text-nothing-gray">%</span>
      </span>
      <div className="mb-2">
        <span className="text-sm text-status-success">
          ↗ +{data.daily_trend.toFixed(0)}% {t('nexus.vs_yesterday')}
        </span>
      </div>
    </div>
    
    {/* AI 自治率說明 */}
    <p className="font-mono text-xs tracking-widest text-nothing-gray-600 mt-2">
      [AI_AUTONOMY_INDEX] {t('nexus.autonomy_description')}
    </p>
    
    {/* 細分：今日自動 vs 需要人工 */}
    <div className="flex gap-6 mt-4 border-t border-black/[0.04] pt-4">
      <div>
        <p className="text-2xl font-bold text-status-success">{data.auto_resolved_24h}</p>
        <p className="text-xs text-nothing-gray">{t('nexus.ai_auto_resolved')}</p>
      </div>
      <div>
        <p className="text-2xl font-bold text-status-warning">{data.human_resolved_24h}</p>
        <p className="text-xs text-nothing-gray">{t('nexus.required_approval')}</p>
      </div>
    </div>
  </div>
);

M-3.2 Omni-Terminal 監控指令整合（8h）

後端需要新增以下 Terminal 指令的處理器：

# apps/api/src/services/terminal_service.py 擴充

class TerminalCommandRouter:
    
    async def route(self, intent: str, context: dict) -> TerminalResponse:
        """
        監控相關指令路由
        """
        if intent == "/status":
            return await self._handle_status(context)      # 服務健康狀態
        elif intent == "/cost":
            return await self._handle_cost(context)         # FinOps 成本
        elif intent == "/slo":
            return await self._handle_slo(context)          # SLO 達成率
        elif intent == "/trace":
            return await self._handle_trace(context)        # SigNoz Traces
        elif intent == "/alert":
            return await self._handle_alert_chain(context)  # 告警鏈路狀態
        elif intent == "/incident":
            return await self._handle_incident(context)     # 事件查詢
        elif intent == "/mttr":
            return await self._handle_mttr(context)         # 平均修復時間

M-3.3 監控相關 GenUI 卡片擴充（8h）

// apps/web/src/components/genui/registry.ts 新增：

export const GENUI_COMPONENTS = {
  // 現有組件...
  
  // 新增監控類組件：
  'SystemHealthCard': () => import('./monitoring/SystemHealthCard'),
  'ServiceDetailCard': () => import('./monitoring/ServiceDetailCard'),
  'FinOpsCard': () => import('./monitoring/FinOpsCard'),
  'SLODashboardCard': () => import('./monitoring/SLODashboardCard'),
  'AlertChainStatusCard': () => import('./monitoring/AlertChainStatusCard'),
  'AnomalyFrequencyCard': () => import('./monitoring/AnomalyFrequencyCard'),
  'MTTRCard': () => import('./monitoring/MTTRCard'),
  'KaliScanResultCard': () => import('./monitoring/KaliScanResultCard'),
}

SystemHealthCard 規格（最核心的監控 GenUI 卡片）：

// SystemHealthCard 呈現邏輯：
// - 25 個服務用「燈號矩陣」呈現，非圖表
// - 每個燈號 hover 顯示服務名稱
// - 有異常的燈號閃爍（animate-ping）
// - 右下角「深度調查」按鈕連至 Grafana/SigNoz 新分頁

const SystemHealthCard = () => (
  <GenUICard title="系統健康矩陣">
    <div className="grid grid-cols-5 gap-2">
      {services.map(svc => (
        <ServiceOrb 
          key={svc.name}
          name={svc.name}
          status={svc.status}
          // healthy: 靜態綠燈
          // warning: 黃燈慢速閃爍
          // critical: 紅燈 animate-ping
          externalLink={svc.grafana_url}
        />
      ))}
    </div>
    
    {/* 摘要行 */}
    <p className="font-mono text-xs mt-3">
      25 SERVICES | {healthy_count} HEALTHY | {warning_count} WARNING | {critical_count} CRITICAL
    </p>
    
    {/* 智能跳脫 */}
    <ExternalLinks grafana sentry signoz />
  </GenUICard>
);

六、監控整合路線圖與優先級

📅 Week 1 (立即，~7h):
  ├── M-1.1 MinIO Exporter 部署 (1h)
  ├── M-1.2 Prometheus Federation (2h)
  └── M-1.3 AI 自治率指數 Metrics 建立 (2h + 2h Config)

📅 Week 2-3 (短期，~10h):
  ├── M-2.1 FinOps API 端點 (4h)
  ├── M-2.2 SLO API 端點 (2h)
  ├── M-2.3 MTTR API 端點 (2h)
  └── M-2.4 Kali Scanner Webhook 整合 (2h)

📅 Month 2 (中期，~24h):
  ├── M-3.1 Nexus 頁面 AI 自治率 UI (8h)
  ├── M-3.2 Omni-Terminal 監控指令 (8h)
  └── M-3.3 監控 GenUI 卡片擴充 (8h)

監控整合完成後的最终效果：

統帥打開 AWOOOI，看到：
  ✦ AI 自治率：今日 72%（↗ 比昨日高 8%）
  ✦ 系統健康：25/25 服務正常
  ✦ 待決策：0（系統無需要人工干預的事件）
  ✦ AI 思考流在背後靜默巡邏...

這才是 AI 原生平台，不是監控工具。
SRE 只在「AI 搞不定的時候」被喚醒，其餘時間人類可以去做更有價值的事。

七、ADR 建議

本規劃建議新增以下 ADR：

ADR	主題	核心決策
ADR-038	監控 UI 三義分離原則	靜默消化 vs 主動浮現 vs 外部跳脫
ADR-039	AI 自治率指數 (Autonomy Index)	指標定義與計算公式
ADR-040	Kali 安全掃描整合架構	.112 → Webhook → AI 分析
ADR-041	SLO 與 MTTR 業務指標架構	計算方法與展示標準

「監控是神經末梢，AI 是大腦。神經不思考，大腦不直接感知。這就是 AWOOOI 的監控哲學。」 🦞

24 KiB Raw Permalink Blame History Unescape Escape