Files
awoooi/docs/MONITORING_COMPLETE_STRATEGY.md
OG T 40163a51b5 feat(monitoring): 完整監控策略與自動整合架構
新增:
1. MONITORING_COMPLETE_STRATEGY.md - 完整監控策略
   - 5 主機 × 60+ 服務監控矩陣
   - P0/P1/P2 告警規則清單
   - AI 自動修復閉環流程
   - 安全護欄配置

2. MONITORING_INTEGRATION_ARCHITECTURE.md - 自動整合架構
   - 服務註冊表 (Single Source of Truth)
   - CI/CD 自動驗證監控覆蓋率
   - 新服務自動獲得監控

3. ops/monitoring/service-registry.yaml - 服務清單
   - K8s 工作負載 (API/Web/Worker/ArgoCD)
   - Docker 容器 (Ollama/OpenClaw/Redis/Postgres)
   - 前端頁面 SLO
   - API 端點 SLO
   - 告警模板與自動修復動作

4. ops/monitoring/validate_coverage.py - 覆蓋率驗證
   - CI 階段執行
   - 檢測未監控服務
   - 生成覆蓋率報告

設計原則:
- 監控即代碼 (Monitoring as Code)
- 新服務必須在 registry 註冊才能部署
- 自動發現機制防止遺漏

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 01:52:08 +08:00

24 KiB
Raw Permalink Blame History

AWOOOI 完整監控與 AI 自動修復策略

版本: v1.0 建立日期: 2026-03-29 負責人: 首席架構師 (Claude Code) 目標: 100% 覆蓋率監控 + AI 驅動自動修復


執行摘要

本文件定義 AWOOOI 全棧監控策略,涵蓋:

  • 5 大主機 × 60+ 服務 × 4 層監控
  • 三層可觀測性: Sentry (錯誤) + SignOz (追蹤) + Prometheus (指標)
  • AI 自動修復閉環: 異常 → OpenClaw 分析 → 自動執行/人工審核

一、監控覆蓋矩陣

1.1 主機層 (Infrastructure)

主機 IP 角色 監控項目 告警規則
mon (K3s Master) 192.168.0.120 K3s Server + keepalived CPU/MEM/Disk/etcd NodeDown, etcdHighLatency
mon1 (K3s Worker) 192.168.0.121 K3s Worker + keepalived CPU/MEM/Disk/kubelet NodeNotReady, DiskPressure
harbor (DevOps) 192.168.0.110 Harbor/Sentry/Langfuse/Runner CPU/MEM/Disk/Docker HarborDown, RunnerOffline
pg (AI/Web) 192.168.0.188 PostgreSQL/Redis/Ollama/SignOz CPU/MEM/Disk/GPU DBConnectionFailed, OllamaTimeout
kali (Security) 192.168.0.112 Kali Scanner CPU/MEM/Disk ScannerOffline

1.2 服務層 (Services)

A. Kubernetes 工作負載 (K3s)

命名空間 Deployment/StatefulSet 副本數 健康檢查 告警條件
awoooi-prod awoooi-api 2 HTTP /api/v1/health PodCrashLoopBackOff, ReplicasUnavailable
awoooi-prod awoooi-web 2 HTTP / HighErrorRate, SlowResponse
awoooi-prod awoooi-worker 1 Exec mtime WorkerStuck, QueueBacklog
argocd argocd-server 1 HTTP /healthz ArgoCDDown
monitoring prometheus-server 1 HTTP /-/ready PrometheusDown
monitoring alertmanager 1 HTTP /-/ready AlertmanagerDown
velero velero 1 - BackupFailed

B. 容器服務 (Docker on 188/110)

主機 容器 端口 健康檢查 告警條件
188 ollama 11434 GET /api/tags OllamaUnresponsive, ModelLoadFailed
188 openclaw 8089 GET /health OpenClawDown, AnalysisTimeout
188 signoz-collector 24317/24318 gRPC health TraceDropped
188 signoz-ui 3301 HTTP / SignOzUIDown
188 redis-stack 6380 redis-cli ping RedisDown, MemoryExhausted
188 postgres 5432 pg_isready PostgresDown, ConnectionPoolExhausted
110 harbor-core 5000 GET /api/v2.0/health HarborDown
110 sentry-web 9000 GET /_health/ SentryDown
110 langfuse 3100 GET /api/public/health LangfuseDown
110 actions-runner - systemctl status RunnerOffline

1.3 應用層 (Application)

A. API 端點監控

端點 方法 預期回應 SLO 告警
/api/v1/health GET 200 99.9% APIHealthCheckFailed
/api/v1/approvals/pending GET 200 99% ApprovalsAPIError
/api/v1/incidents GET 200 99% IncidentsAPIError
/api/v1/analyze POST 200/202 95% AnalysisTimeout (>30s)
/api/v1/execute POST 200 99% ExecutionFailed

B. 錯誤率監控 (Sentry)

類型 閾值 告警 自動修復
Unhandled Exception >0 in 5min SentryNewError AI 分析 + Playbook 匹配
HTTP 5xx >1% HighErrorRate Pod 重啟
HTTP 4xx >10% ClientErrorSpike 告警 + 日誌分析
Slow Transaction P95 >2s SlowTransaction 資源擴展建議

C. 前端監控

指標 來源 閾值 告警
Page Load Time Sentry Performance >3s SlowPageLoad
JS Error Rate Sentry Issues >0.1% FrontendError
API Call Failures Sentry Breadcrumbs >1% APICallFailed
Web Vitals (LCP/FID/CLS) Sentry Google 標準 PoorWebVitals

1.4 資料層 (Data)

資料庫 監控項目 告警條件
PostgreSQL 連線數、QPS、慢查詢、WAL 延遲、Disk I/O ConnectionPoolExhausted (>90%), SlowQuery (>5s), ReplicationLag (>30s)
Redis 記憶體使用、命中率、延遲、Key 數量 MemoryHigh (>80%), HitRatelow (<90%), SlowCommands
ClickHouse 磁碟使用、查詢延遲、插入速率 DiskFull (>85%), QueryTimeout

1.5 AI/LLM 層

服務 監控項目 告警條件 自動修復
Ollama 推理延遲、模型載入狀態、GPU 使用 InferenceTimeout (>60s), ModelLoadFailed 容器重啟
OpenClaw 分析成功率、回應時間、Token 使用 AnalysisFailed (>10%), HighTokenCost Fallback to Gemini
Gemini API Rate Limit、錯誤率、成本 RateLimitHit, BudgetExceeded 降級到 Ollama
Claude API Rate Limit、錯誤率、成本 RateLimitHit, BudgetExceeded 降級到 Gemini
Langfuse Trace 記錄成功率 TraceLost (>1%) Reconnect

1.6 CI/CD 層

元件 監控項目 告警條件
GitHub Actions Workflow 狀態、Runner 健康、Job 延遲 WorkflowFailed, RunnerOffline, JobStuck (>30min)
Harbor 映像推送/拉取成功率、儲存空間 PushFailed, PullFailed, StorageFull
ArgoCD Sync 狀態、Application 健康 SyncFailed, AppDegraded

二、告警規則完整清單

2.1 P0 - Critical (5 分鐘回應)

# === 基礎設施層 ===
- alert: NodeDown
  expr: up{job="node-exporter"} == 0
  for: 1m
  severity: critical
  auto_repair: false  # 需人工介入

- alert: K3sAPIServerDown
  expr: up{job="kubernetes-apiservers"} == 0
  for: 1m
  severity: critical
  auto_repair: false

- alert: PostgreSQLDown
  expr: pg_up == 0
  for: 30s
  severity: critical
  auto_repair: restart_container

- alert: RedisDown
  expr: redis_up == 0
  for: 30s
  severity: critical
  auto_repair: restart_container

# === 應用層 ===
- alert: AWOOOIAPIDown
  expr: probe_success{job="awoooi-api"} == 0
  for: 1m
  severity: critical
  auto_repair: restart_pod

- alert: OpenClawDown
  expr: probe_success{job="openclaw"} == 0
  for: 2m
  severity: critical
  auto_repair: restart_container

- alert: PodCrashLoopBackOff
  expr: kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} > 0
  for: 2m
  severity: critical
  auto_repair: collect_logs_and_rollback

# === CI/CD 層 ===
- alert: GitHubRunnerOffline
  expr: github_runner_status == 0
  for: 5m
  severity: critical
  auto_repair: restart_runner_service

2.2 P1 - High (15 分鐘回應)

# === 效能告警 ===
- alert: HighCPUUsage
  expr: node_cpu_usage_percent > 90
  for: 5m
  severity: high
  auto_repair: scale_up_if_possible

- alert: HighMemoryUsage
  expr: node_memory_usage_percent > 90
  for: 5m
  severity: high
  auto_repair: investigate_memory_leak

- alert: APIHighLatency
  expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 2
  for: 5m
  severity: high
  auto_repair: analyze_slow_endpoints

- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
  for: 5m
  severity: high
  auto_repair: restart_pod

- alert: OllamaSlowInference
  expr: ollama_inference_duration_seconds > 60
  for: 3m
  severity: high
  auto_repair: switch_to_smaller_model

# === 資源告警 ===
- alert: DiskSpaceLow
  expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.15
  for: 10m
  severity: high
  auto_repair: cleanup_old_logs

- alert: PostgreSQLConnectionPoolHigh
  expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
  for: 5m
  severity: high
  auto_repair: analyze_connection_leaks

2.3 P2 - Medium (1 小時回應)

- alert: CertificateExpiringSoon
  expr: ssl_cert_not_after - time() < 14 * 24 * 3600
  severity: medium
  auto_repair: renew_certificate

- alert: BackupNotSuccessful
  expr: velero_backup_success_total < 1 in 24h
  severity: medium
  auto_repair: trigger_backup

- alert: LangfuseTraceLoss
  expr: langfuse_trace_drop_rate > 0.01
  severity: medium
  auto_repair: reconnect_langfuse

三、AI 自動修復閉環

3.1 修復流程圖

┌─────────────────────────────────────────────────────────────────────┐
│                        異常發生                                      │
│  (Prometheus Alert / Sentry Issue / SignOz Anomaly)                  │
└────────────────────────────┬────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     Alertmanager 路由                                │
│  ┌───────────────┬───────────────┬───────────────┐                  │
│  │ route: awoooi │ route: infra  │ route: aiops  │                  │
│  └───────┬───────┴───────┬───────┴───────┬───────┘                  │
└──────────┼───────────────┼───────────────┼──────────────────────────┘
           │               │               │
           └───────────────┼───────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────────┐
│              AWOOOI API: /api/v1/webhooks/alertmanager               │
│  1. 接收告警 → 2. 去重 (10min fingerprint) → 3. 建立 Incident        │
└────────────────────────────┬────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     OpenClaw AI 分析引擎                             │
│  ┌───────────────────────────────────────────────────────────────┐  │
│  │ 輸入:                                                          │  │
│  │ - Alert 內容 (labels, annotations)                            │  │
│  │ - K8s 上下文 (Pod logs, events, metrics)                      │  │
│  │ - 歷史 Playbook (相似案例)                                     │  │
│  │ - SignOz Traces (相關 Span)                                   │  │
│  │ - Sentry Issues (相關錯誤)                                     │  │
│  ├───────────────────────────────────────────────────────────────┤  │
│  │ 輸出:                                                          │  │
│  │ - suggested_action: RESTART_POD | DELETE_POD | SCALE_UP | ... │  │
│  │ - confidence: 0.0-1.0                                         │  │
│  │ - risk_level: LOW | MEDIUM | CRITICAL                         │  │
│  │ - blast_radius: {affected_pods, estimated_downtime}           │  │
│  │ - kubectl_command: 具體指令                                    │  │
│  │ - reasoning: 決策理由 (繁體中文)                               │  │
│  └───────────────────────────────────────────────────────────────┘  │
└────────────────────────────┬────────────────────────────────────────┘
                             │
              ┌──────────────┴──────────────┐
              │                             │
              ▼                             ▼
┌─────────────────────────┐   ┌─────────────────────────┐
│   confidence >= 0.85    │   │   confidence < 0.85     │
│   risk_level = LOW      │   │   OR risk = CRITICAL    │
│   ↓                     │   │   ↓                     │
│   自動執行              │   │   人工審核              │
└───────────┬─────────────┘   └───────────┬─────────────┘
            │                             │
            │                             ▼
            │               ┌─────────────────────────┐
            │               │   Telegram 推送審核卡片  │
            │               │   [✅ 簽核] [❌ 拒絕]   │
            │               │   [⏰ 稍後] [🔕 靜默]   │
            │               └───────────┬─────────────┘
            │                           │
            │               ┌───────────┴───────────┐
            │               │                       │
            │               ▼                       ▼
            │        ┌────────────┐          ┌────────────┐
            │        │  人工批准  │          │  人工拒絕  │
            │        └─────┬──────┘          └─────┬──────┘
            │              │                       │
            └──────────────┼───────────────────────┤
                           │                       │
                           ▼                       ▼
              ┌─────────────────────────┐  ┌────────────────┐
              │   K8s Executor 執行      │  │  記錄拒絕原因  │
              │   kubectl $command       │  │  更新 Playbook │
              └───────────┬─────────────┘  └────────────────┘
                          │
                          ▼
              ┌─────────────────────────┐
              │   執行結果驗證           │
              │   - 健康檢查通過?        │
              │   - 錯誤率下降?          │
              │   - 延遲恢復正常?        │
              └───────────┬─────────────┘
                          │
              ┌───────────┴───────────┐
              │                       │
              ▼                       ▼
    ┌─────────────────┐     ┌─────────────────┐
    │   修復成功       │     │   修復失敗       │
    │   - 關閉 Incident│     │   - 升級告警     │
    │   - 更新 Playbook│     │   - 記錄失敗     │
    │   - Telegram 通知│     │   - 人工介入     │
    └─────────────────┘     └─────────────────┘

3.2 自動修復動作清單

動作 觸發條件 執行指令 風險等級 自動執行?
RESTART_POD PodCrashLoop, HighErrorRate kubectl rollout restart deployment/{name} LOW 可自動
DELETE_POD PodStuck, OOMKilled kubectl delete pod {name} --grace-period=30 LOW 可自動
SCALE_UP HighCPU, HighMemory, SlowResponse kubectl scale deployment/{name} --replicas=+1 LOW 可自動
SCALE_DOWN ResourceWaste kubectl scale deployment/{name} --replicas=-1 MEDIUM 需審核
ROLLBACK DeploymentFailed, VersionDrift kubectl rollout undo deployment/{name} MEDIUM 需審核
RESTART_CONTAINER ContainerUnhealthy docker restart {container} LOW 可自動
CLEAR_CACHE RedisMemoryHigh, StaleCache redis-cli FLUSHDB MEDIUM 需審核
VACUUM_DB TableBloat, SlowQuery VACUUM ANALYZE {table} MEDIUM 需審核
RENEW_CERT CertExpiring certbot renew LOW 可自動
CLEANUP_LOGS DiskSpaceLow find /var/log -mtime +7 -delete LOW 可自動
SWITCH_MODEL OllamaTimeout 切換到更小模型 LOW 可自動
FALLBACK_AI GeminiRateLimit Gemini → Ollama LOW 可自動

3.3 安全護欄

# === 自動修復安全限制 ===
SAFETY_GUARDRAILS = {
    # 頻率限制
    "max_repairs_per_hour": 5,          # 每小時最多 5 次自動修復
    "max_repairs_per_resource": 3,      # 同一資源每小時最多 3 次
    "cooldown_after_failure": 600,      # 失敗後冷卻 10 分鐘

    # 風險限制
    "auto_approve_max_risk": "LOW",     # 自動批准僅限 LOW 風險
    "auto_approve_min_confidence": 0.85, # 最低信心度 85%

    # 影響範圍限制
    "max_affected_pods": 3,             # 最多影響 3 個 Pod
    "min_healthy_replicas": 1,          # 至少保留 1 個健康副本

    # 禁止自動執行
    "blacklist_actions": [
        "DROP_DATABASE",
        "DELETE_NAMESPACE",
        "FORCE_DELETE_PVC",
        "DELETE_SECRET",
    ],

    # 白名單命名空間
    "allowed_namespaces": [
        "awoooi-prod",
        "monitoring",
    ],
}

四、監控資料流整合

4.1 Sentry → OpenClaw

# /api/v1/webhooks/sentry - Sentry Issue Alert Webhook
async def handle_sentry_webhook(payload: dict):
    """
    1. 解析 Sentry Issue
    2. 去重檢查 (10 分鐘 TTL)
    3. 建立 Incident
    4. 觸發 OpenClaw 分析
    5. 推送 Telegram
    """
    issue_id = payload["data"]["issue"]["id"]

    # 去重
    if await redis.get(f"sentry_dedup:{issue_id}"):
        return {"status": "deduplicated"}
    await redis.setex(f"sentry_dedup:{issue_id}", 600, "1")

    # 建立 Incident
    incident = await incident_service.create_from_sentry(payload)

    # AI 分析
    analysis = await openclaw.analyze_error(
        error_title=payload["data"]["issue"]["title"],
        stack_trace=payload["data"]["issue"]["culprit"],
        sentry_url=payload["data"]["issue"]["web_url"],
        trace_id=extract_trace_id(payload),
    )

    # Telegram 通知
    await telegram.send_error_alert(
        incident_id=incident.id,
        analysis=analysis,
        sentry_url=payload["data"]["issue"]["web_url"],
    )

4.2 Alertmanager → OpenClaw

# alertmanager.yml
route:
  receiver: awoooi-api
  routes:
    - match:
        namespace: awoooi-prod
      receiver: awoooi-api
    - match:
        severity: critical
      receiver: awoooi-api

receivers:
  - name: awoooi-api
    webhook_configs:
      - url: http://192.168.0.125:32334/api/v1/webhooks/alertmanager
        send_resolved: true
        http_config:
          basic_auth:
            username: alertmanager
            password_file: /etc/alertmanager/secrets/webhook-password

4.3 SignOz → OpenClaw

# 透過 ClickHouse 查詢異常 Span
async def detect_signoz_anomalies():
    """
    定期查詢 SignOz ClickHouse 偵測:
    - Error Rate 異常上升
    - Latency P99 異常
    - Trace 數量驟降 (服務可能掛了)
    """
    anomalies = await clickhouse.query("""
        SELECT
            serviceName,
            count(*) as error_count,
            avg(durationNano) / 1e6 as avg_latency_ms
        FROM signoz_traces.signoz_index_v2
        WHERE timestamp > now() - INTERVAL 5 MINUTE
          AND statusCode = 'STATUS_CODE_ERROR'
        GROUP BY serviceName
        HAVING error_count > 10
    """)

    for anomaly in anomalies:
        await openclaw.analyze_trace_anomaly(
            service=anomaly["serviceName"],
            error_count=anomaly["error_count"],
            avg_latency=anomaly["avg_latency_ms"],
        )

五、實作優先級

Phase 1 (本週 - P0)

項目 狀態 負責 說明
Alertmanager → AWOOOI Webhook TODO Claude Code 配置 webhook + 測試告警
Sentry Webhook → Telegram TODO Claude Code 錯誤直接推送 + AI 分析
Secrets 自動注入 (CD) TODO Claude Code kubectl patch secret
告警去重驗證 TODO Claude Code 10min fingerprint 測試

Phase 2 (下週 - P1)

項目 狀態 負責 說明
SignOz 告警規則 TODO Claude Code Error Rate, Latency P99
自動修復動作擴展 TODO Claude Code SCALE_UP, ROLLBACK
Playbook 自動萃取 TODO Claude Code 成功修復 → Playbook
告警升級機制 TODO Claude Code SLA Engine

Phase 3 (兩週後 - P2)

項目 狀態 負責 說明
Grafana 儀表板 TODO Claude Code 監控總覽
SLO/SLI 定義 TODO Claude Code 99.9% 可用性目標
告警噪音抑制 TODO Claude Code ML 異常偵測
容量預測 TODO Claude Code 資源趨勢預測

六、附錄

A. 環境變數清單

# === Alertmanager ===
ALERTMANAGER_WEBHOOK_URL=http://192.168.0.125:32334/api/v1/webhooks/alertmanager
ALERTMANAGER_WEBHOOK_SECRET=<secret>

# === Sentry ===
SENTRY_DSN=http://<key>@192.168.0.110:9000/<project>
SENTRY_WEBHOOK_SECRET=<secret>
SENTRY_DEDUP_TTL=600

# === SignOz ===
SIGNOZ_CLICKHOUSE_URL=http://192.168.0.188:8123
SIGNOZ_ANOMALY_THRESHOLD_ERROR_COUNT=10

# === 自動修復 ===
AUTO_REPAIR_ENABLED=true
AUTO_REPAIR_MAX_PER_HOUR=5
AUTO_REPAIR_MIN_CONFIDENCE=0.85
AUTO_REPAIR_DRY_RUN=false

B. 告警模板

🚨 **CRITICAL | awoooi-api**
━━━━━━━━━━━━━━━━━━━
📋 INC-20260329-0001
🎯 Pod: awoooi-api-7d4b8c9f5-abc12
━━━━━━━━━━━━━━━━━━━
🤖 **AI 分析**
👥 責任: BE (後端)
📊 信心: 🟢 92%
💡 原因: OOM Killed - Memory limit exceeded
━━━━━━━━━━━━━━━━━━━
🔧 建議: DELETE_POD + SCALE_UP
⏱️ 停機: ~30s
💰 Tokens: 1,234 / $0.0012
━━━━━━━━━━━━━━━━━━━
🔗 [SignOz Trace](http://192.168.0.188:3301/trace/abc123)
🔗 [Sentry Issue](http://192.168.0.110:9000/issues/456)

[✅ 簽核] [❌ 拒絕] [⏰ 稍後] [🔕 靜默]

文件結束 下一步: 執行 Phase 1 任務