Files

OG T 40163a51b5 feat(monitoring): 完整監控策略與自動整合架構

新增:
1. MONITORING_COMPLETE_STRATEGY.md - 完整監控策略
   - 5 主機 × 60+ 服務監控矩陣
   - P0/P1/P2 告警規則清單
   - AI 自動修復閉環流程
   - 安全護欄配置

2. MONITORING_INTEGRATION_ARCHITECTURE.md - 自動整合架構
   - 服務註冊表 (Single Source of Truth)
   - CI/CD 自動驗證監控覆蓋率
   - 新服務自動獲得監控

3. ops/monitoring/service-registry.yaml - 服務清單
   - K8s 工作負載 (API/Web/Worker/ArgoCD)
   - Docker 容器 (Ollama/OpenClaw/Redis/Postgres)
   - 前端頁面 SLO
   - API 端點 SLO
   - 告警模板與自動修復動作

4. ops/monitoring/validate_coverage.py - 覆蓋率驗證
   - CI 階段執行
   - 檢測未監控服務
   - 生成覆蓋率報告

設計原則:
- 監控即代碼 (Monitoring as Code)
- 新服務必須在 registry 註冊才能部署
- 自動發現機制防止遺漏

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-03-29 01:52:08 +08:00

24 KiB

Raw Blame History

AWOOOI 完整監控與 AI 自動修復策略

版本: v1.0 建立日期: 2026-03-29 負責人: 首席架構師 (Claude Code) 目標: 100% 覆蓋率監控 + AI 驅動自動修復

執行摘要

本文件定義 AWOOOI 全棧監控策略，涵蓋：

5 大主機 × 60+ 服務 × 4 層監控
三層可觀測性: Sentry (錯誤) + SignOz (追蹤) + Prometheus (指標)
AI 自動修復閉環: 異常 → OpenClaw 分析 → 自動執行/人工審核

一、監控覆蓋矩陣

1.1 主機層 (Infrastructure)

主機	IP	角色	監控項目	告警規則
mon (K3s Master)	192.168.0.120	K3s Server + keepalived	CPU/MEM/Disk/etcd	NodeDown, etcdHighLatency
mon1 (K3s Worker)	192.168.0.121	K3s Worker + keepalived	CPU/MEM/Disk/kubelet	NodeNotReady, DiskPressure
harbor (DevOps)	192.168.0.110	Harbor/Sentry/Langfuse/Runner	CPU/MEM/Disk/Docker	HarborDown, RunnerOffline
pg (AI/Web)	192.168.0.188	PostgreSQL/Redis/Ollama/SignOz	CPU/MEM/Disk/GPU	DBConnectionFailed, OllamaTimeout
kali (Security)	192.168.0.112	Kali Scanner	CPU/MEM/Disk	ScannerOffline

1.2 服務層 (Services)

A. Kubernetes 工作負載 (K3s)

命名空間	Deployment/StatefulSet	副本數	健康檢查	告警條件
awoooi-prod	awoooi-api	2	HTTP /api/v1/health	PodCrashLoopBackOff, ReplicasUnavailable
awoooi-prod	awoooi-web	2	HTTP /	HighErrorRate, SlowResponse
awoooi-prod	awoooi-worker	1	Exec mtime	WorkerStuck, QueueBacklog
argocd	argocd-server	1	HTTP /healthz	ArgoCDDown
monitoring	prometheus-server	1	HTTP /-/ready	PrometheusDown
monitoring	alertmanager	1	HTTP /-/ready	AlertmanagerDown
velero	velero	1	-	BackupFailed

B. 容器服務 (Docker on 188/110)

主機	容器	端口	健康檢查	告警條件
188	ollama	11434	GET /api/tags	OllamaUnresponsive, ModelLoadFailed
188	openclaw	8089	GET /health	OpenClawDown, AnalysisTimeout
188	signoz-collector	24317/24318	gRPC health	TraceDropped
188	signoz-ui	3301	HTTP /	SignOzUIDown
188	redis-stack	6380	redis-cli ping	RedisDown, MemoryExhausted
188	postgres	5432	pg_isready	PostgresDown, ConnectionPoolExhausted
110	harbor-core	5000	GET /api/v2.0/health	HarborDown
110	sentry-web	9000	GET /_health/	SentryDown
110	langfuse	3100	GET /api/public/health	LangfuseDown
110	actions-runner	-	systemctl status	RunnerOffline

1.3 應用層 (Application)

A. API 端點監控

端點	方法	預期回應	SLO	告警
/api/v1/health	GET	200	99.9%	APIHealthCheckFailed
/api/v1/approvals/pending	GET	200	99%	ApprovalsAPIError
/api/v1/incidents	GET	200	99%	IncidentsAPIError
/api/v1/analyze	POST	200/202	95%	AnalysisTimeout (>30s)
/api/v1/execute	POST	200	99%	ExecutionFailed

B. 錯誤率監控 (Sentry)

類型	閾值	告警	自動修復
Unhandled Exception	>0 in 5min	SentryNewError	AI 分析 + Playbook 匹配
HTTP 5xx	>1%	HighErrorRate	Pod 重啟
HTTP 4xx	>10%	ClientErrorSpike	告警 + 日誌分析
Slow Transaction	P95 >2s	SlowTransaction	資源擴展建議

C. 前端監控

指標	來源	閾值	告警
Page Load Time	Sentry Performance	>3s	SlowPageLoad
JS Error Rate	Sentry Issues	>0.1%	FrontendError
API Call Failures	Sentry Breadcrumbs	>1%	APICallFailed
Web Vitals (LCP/FID/CLS)	Sentry	Google 標準	PoorWebVitals

1.4 資料層 (Data)

資料庫	監控項目	告警條件
PostgreSQL	連線數、QPS、慢查詢、WAL 延遲、Disk I/O	ConnectionPoolExhausted (>90%), SlowQuery (>5s), ReplicationLag (>30s)
Redis	記憶體使用、命中率、延遲、Key 數量	MemoryHigh (>80%), HitRatelow (<90%), SlowCommands
ClickHouse	磁碟使用、查詢延遲、插入速率	DiskFull (>85%), QueryTimeout

1.5 AI/LLM 層

服務	監控項目	告警條件	自動修復
Ollama	推理延遲、模型載入狀態、GPU 使用	InferenceTimeout (>60s), ModelLoadFailed	容器重啟
OpenClaw	分析成功率、回應時間、Token 使用	AnalysisFailed (>10%), HighTokenCost	Fallback to Gemini
Gemini API	Rate Limit、錯誤率、成本	RateLimitHit, BudgetExceeded	降級到 Ollama
Claude API	Rate Limit、錯誤率、成本	RateLimitHit, BudgetExceeded	降級到 Gemini
Langfuse	Trace 記錄成功率	TraceLost (>1%)	Reconnect

1.6 CI/CD 層

元件	監控項目	告警條件
GitHub Actions	Workflow 狀態、Runner 健康、Job 延遲	WorkflowFailed, RunnerOffline, JobStuck (>30min)
Harbor	映像推送/拉取成功率、儲存空間	PushFailed, PullFailed, StorageFull
ArgoCD	Sync 狀態、Application 健康	SyncFailed, AppDegraded

二、告警規則完整清單

2.1 P0 - Critical (5 分鐘回應)

# === 基礎設施層 ===
- alert: NodeDown
  expr: up{job="node-exporter"} == 0
  for: 1m
  severity: critical
  auto_repair: false  # 需人工介入

- alert: K3sAPIServerDown
  expr: up{job="kubernetes-apiservers"} == 0
  for: 1m
  severity: critical
  auto_repair: false

- alert: PostgreSQLDown
  expr: pg_up == 0
  for: 30s
  severity: critical
  auto_repair: restart_container

- alert: RedisDown
  expr: redis_up == 0
  for: 30s
  severity: critical
  auto_repair: restart_container

# === 應用層 ===
- alert: AWOOOIAPIDown
  expr: probe_success{job="awoooi-api"} == 0
  for: 1m
  severity: critical
  auto_repair: restart_pod

- alert: OpenClawDown
  expr: probe_success{job="openclaw"} == 0
  for: 2m
  severity: critical
  auto_repair: restart_container

- alert: PodCrashLoopBackOff
  expr: kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} > 0
  for: 2m
  severity: critical
  auto_repair: collect_logs_and_rollback

# === CI/CD 層 ===
- alert: GitHubRunnerOffline
  expr: github_runner_status == 0
  for: 5m
  severity: critical
  auto_repair: restart_runner_service

2.2 P1 - High (15 分鐘回應)

# === 效能告警 ===
- alert: HighCPUUsage
  expr: node_cpu_usage_percent > 90
  for: 5m
  severity: high
  auto_repair: scale_up_if_possible

- alert: HighMemoryUsage
  expr: node_memory_usage_percent > 90
  for: 5m
  severity: high
  auto_repair: investigate_memory_leak

- alert: APIHighLatency
  expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 2
  for: 5m
  severity: high
  auto_repair: analyze_slow_endpoints

- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
  for: 5m
  severity: high
  auto_repair: restart_pod

- alert: OllamaSlowInference
  expr: ollama_inference_duration_seconds > 60
  for: 3m
  severity: high
  auto_repair: switch_to_smaller_model

# === 資源告警 ===
- alert: DiskSpaceLow
  expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.15
  for: 10m
  severity: high
  auto_repair: cleanup_old_logs

- alert: PostgreSQLConnectionPoolHigh
  expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
  for: 5m
  severity: high
  auto_repair: analyze_connection_leaks

2.3 P2 - Medium (1 小時回應)

- alert: CertificateExpiringSoon
  expr: ssl_cert_not_after - time() < 14 * 24 * 3600
  severity: medium
  auto_repair: renew_certificate

- alert: BackupNotSuccessful
  expr: velero_backup_success_total < 1 in 24h
  severity: medium
  auto_repair: trigger_backup

- alert: LangfuseTraceLoss
  expr: langfuse_trace_drop_rate > 0.01
  severity: medium
  auto_repair: reconnect_langfuse

三、AI 自動修復閉環

3.1 修復流程圖

┌─────────────────────────────────────────────────────────────────────┐
│                        異常發生                                      │
│  (Prometheus Alert / Sentry Issue / SignOz Anomaly)                  │
└────────────────────────────┬────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     Alertmanager 路由                                │
│  ┌───────────────┬───────────────┬───────────────┐                  │
│  │ route: awoooi │ route: infra  │ route: aiops  │                  │
│  └───────┬───────┴───────┬───────┴───────┬───────┘                  │
└──────────┼───────────────┼───────────────┼──────────────────────────┘
           │               │               │
           └───────────────┼───────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────────┐
│              AWOOOI API: /api/v1/webhooks/alertmanager               │
│  1. 接收告警 → 2. 去重 (10min fingerprint) → 3. 建立 Incident        │
└────────────────────────────┬────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     OpenClaw AI 分析引擎                             │
│  ┌───────────────────────────────────────────────────────────────┐  │
│  │ 輸入:                                                          │  │
│  │ - Alert 內容 (labels, annotations)                            │  │
│  │ - K8s 上下文 (Pod logs, events, metrics)                      │  │
│  │ - 歷史 Playbook (相似案例)                                     │  │
│  │ - SignOz Traces (相關 Span)                                   │  │
│  │ - Sentry Issues (相關錯誤)                                     │  │
│  ├───────────────────────────────────────────────────────────────┤  │
│  │ 輸出:                                                          │  │
│  │ - suggested_action: RESTART_POD | DELETE_POD | SCALE_UP | ... │  │
│  │ - confidence: 0.0-1.0                                         │  │
│  │ - risk_level: LOW | MEDIUM | CRITICAL                         │  │
│  │ - blast_radius: {affected_pods, estimated_downtime}           │  │
│  │ - kubectl_command: 具體指令                                    │  │
│  │ - reasoning: 決策理由 (繁體中文)                               │  │
│  └───────────────────────────────────────────────────────────────┘  │
└────────────────────────────┬────────────────────────────────────────┘
                             │
              ┌──────────────┴──────────────┐
              │                             │
              ▼                             ▼
┌─────────────────────────┐   ┌─────────────────────────┐
│   confidence >= 0.85    │   │   confidence < 0.85     │
│   risk_level = LOW      │   │   OR risk = CRITICAL    │
│   ↓                     │   │   ↓                     │
│   自動執行              │   │   人工審核              │
└───────────┬─────────────┘   └───────────┬─────────────┘
            │                             │
            │                             ▼
            │               ┌─────────────────────────┐
            │               │   Telegram 推送審核卡片  │
            │               │   [✅ 簽核] [❌ 拒絕]   │
            │               │   [⏰ 稍後] [🔕 靜默]   │
            │               └───────────┬─────────────┘
            │                           │
            │               ┌───────────┴───────────┐
            │               │                       │
            │               ▼                       ▼
            │        ┌────────────┐          ┌────────────┐
            │        │  人工批准  │          │  人工拒絕  │
            │        └─────┬──────┘          └─────┬──────┘
            │              │                       │
            └──────────────┼───────────────────────┤
                           │                       │
                           ▼                       ▼
              ┌─────────────────────────┐  ┌────────────────┐
              │   K8s Executor 執行      │  │  記錄拒絕原因  │
              │   kubectl $command       │  │  更新 Playbook │
              └───────────┬─────────────┘  └────────────────┘
                          │
                          ▼
              ┌─────────────────────────┐
              │   執行結果驗證           │
              │   - 健康檢查通過?        │
              │   - 錯誤率下降?          │
              │   - 延遲恢復正常?        │
              └───────────┬─────────────┘
                          │
              ┌───────────┴───────────┐
              │                       │
              ▼                       ▼
    ┌─────────────────┐     ┌─────────────────┐
    │   修復成功       │     │   修復失敗       │
    │   - 關閉 Incident│     │   - 升級告警     │
    │   - 更新 Playbook│     │   - 記錄失敗     │
    │   - Telegram 通知│     │   - 人工介入     │
    └─────────────────┘     └─────────────────┘

3.2 自動修復動作清單

動作	觸發條件	執行指令	風險等級	自動執行?
`RESTART_POD`	PodCrashLoop, HighErrorRate	`kubectl rollout restart deployment/{name}`	LOW	✅ 可自動
`DELETE_POD`	PodStuck, OOMKilled	`kubectl delete pod {name} --grace-period=30`	LOW	✅ 可自動
`SCALE_UP`	HighCPU, HighMemory, SlowResponse	`kubectl scale deployment/{name} --replicas=+1`	LOW	✅ 可自動
`SCALE_DOWN`	ResourceWaste	`kubectl scale deployment/{name} --replicas=-1`	MEDIUM	❌ 需審核
`ROLLBACK`	DeploymentFailed, VersionDrift	`kubectl rollout undo deployment/{name}`	MEDIUM	❌ 需審核
`RESTART_CONTAINER`	ContainerUnhealthy	`docker restart {container}`	LOW	✅ 可自動
`CLEAR_CACHE`	RedisMemoryHigh, StaleCache	`redis-cli FLUSHDB`	MEDIUM	❌ 需審核
`VACUUM_DB`	TableBloat, SlowQuery	`VACUUM ANALYZE {table}`	MEDIUM	❌ 需審核
`RENEW_CERT`	CertExpiring	`certbot renew`	LOW	✅ 可自動
`CLEANUP_LOGS`	DiskSpaceLow	`find /var/log -mtime +7 -delete`	LOW	✅ 可自動
`SWITCH_MODEL`	OllamaTimeout	切換到更小模型	LOW	✅ 可自動
`FALLBACK_AI`	GeminiRateLimit	Gemini → Ollama	LOW	✅ 可自動

3.3 安全護欄

# === 自動修復安全限制 ===
SAFETY_GUARDRAILS = {
    # 頻率限制
    "max_repairs_per_hour": 5,          # 每小時最多 5 次自動修復
    "max_repairs_per_resource": 3,      # 同一資源每小時最多 3 次
    "cooldown_after_failure": 600,      # 失敗後冷卻 10 分鐘

    # 風險限制
    "auto_approve_max_risk": "LOW",     # 自動批准僅限 LOW 風險
    "auto_approve_min_confidence": 0.85, # 最低信心度 85%

    # 影響範圍限制
    "max_affected_pods": 3,             # 最多影響 3 個 Pod
    "min_healthy_replicas": 1,          # 至少保留 1 個健康副本

    # 禁止自動執行
    "blacklist_actions": [
        "DROP_DATABASE",
        "DELETE_NAMESPACE",
        "FORCE_DELETE_PVC",
        "DELETE_SECRET",
    ],

    # 白名單命名空間
    "allowed_namespaces": [
        "awoooi-prod",
        "monitoring",
    ],
}

四、監控資料流整合

4.1 Sentry → OpenClaw

# /api/v1/webhooks/sentry - Sentry Issue Alert Webhook
async def handle_sentry_webhook(payload: dict):
    """
    1. 解析 Sentry Issue
    2. 去重檢查 (10 分鐘 TTL)
    3. 建立 Incident
    4. 觸發 OpenClaw 分析
    5. 推送 Telegram
    """
    issue_id = payload["data"]["issue"]["id"]

    # 去重
    if await redis.get(f"sentry_dedup:{issue_id}"):
        return {"status": "deduplicated"}
    await redis.setex(f"sentry_dedup:{issue_id}", 600, "1")

    # 建立 Incident
    incident = await incident_service.create_from_sentry(payload)

    # AI 分析
    analysis = await openclaw.analyze_error(
        error_title=payload["data"]["issue"]["title"],
        stack_trace=payload["data"]["issue"]["culprit"],
        sentry_url=payload["data"]["issue"]["web_url"],
        trace_id=extract_trace_id(payload),
    )

    # Telegram 通知
    await telegram.send_error_alert(
        incident_id=incident.id,
        analysis=analysis,
        sentry_url=payload["data"]["issue"]["web_url"],
    )

4.2 Alertmanager → OpenClaw

# alertmanager.yml
route:
  receiver: awoooi-api
  routes:
    - match:
        namespace: awoooi-prod
      receiver: awoooi-api
    - match:
        severity: critical
      receiver: awoooi-api

receivers:
  - name: awoooi-api
    webhook_configs:
      - url: http://192.168.0.125:32334/api/v1/webhooks/alertmanager
        send_resolved: true
        http_config:
          basic_auth:
            username: alertmanager
            password_file: /etc/alertmanager/secrets/webhook-password

4.3 SignOz → OpenClaw

# 透過 ClickHouse 查詢異常 Span
async def detect_signoz_anomalies():
    """
    定期查詢 SignOz ClickHouse 偵測:
    - Error Rate 異常上升
    - Latency P99 異常
    - Trace 數量驟降 (服務可能掛了)
    """
    anomalies = await clickhouse.query("""
        SELECT
            serviceName,
            count(*) as error_count,
            avg(durationNano) / 1e6 as avg_latency_ms
        FROM signoz_traces.signoz_index_v2
        WHERE timestamp > now() - INTERVAL 5 MINUTE
          AND statusCode = 'STATUS_CODE_ERROR'
        GROUP BY serviceName
        HAVING error_count > 10
    """)

    for anomaly in anomalies:
        await openclaw.analyze_trace_anomaly(
            service=anomaly["serviceName"],
            error_count=anomaly["error_count"],
            avg_latency=anomaly["avg_latency_ms"],
        )

五、實作優先級

Phase 1 (本週 - P0)

項目	狀態	負責	說明
Alertmanager → AWOOOI Webhook	⬜ TODO	Claude Code	配置 webhook + 測試告警
Sentry Webhook → Telegram	⬜ TODO	Claude Code	錯誤直接推送 + AI 分析
Secrets 自動注入 (CD)	⬜ TODO	Claude Code	kubectl patch secret
告警去重驗證	⬜ TODO	Claude Code	10min fingerprint 測試

Phase 2 (下週 - P1)

項目	狀態	負責	說明
SignOz 告警規則	⬜ TODO	Claude Code	Error Rate, Latency P99
自動修復動作擴展	⬜ TODO	Claude Code	SCALE_UP, ROLLBACK
Playbook 自動萃取	⬜ TODO	Claude Code	成功修復 → Playbook
告警升級機制	⬜ TODO	Claude Code	SLA Engine

Phase 3 (兩週後 - P2)

項目	狀態	負責	說明
Grafana 儀表板	⬜ TODO	Claude Code	監控總覽
SLO/SLI 定義	⬜ TODO	Claude Code	99.9% 可用性目標
告警噪音抑制	⬜ TODO	Claude Code	ML 異常偵測
容量預測	⬜ TODO	Claude Code	資源趨勢預測

六、附錄

A. 環境變數清單

# === Alertmanager ===
ALERTMANAGER_WEBHOOK_URL=http://192.168.0.125:32334/api/v1/webhooks/alertmanager
ALERTMANAGER_WEBHOOK_SECRET=<secret>

# === Sentry ===
SENTRY_DSN=http://<key>@192.168.0.110:9000/<project>
SENTRY_WEBHOOK_SECRET=<secret>
SENTRY_DEDUP_TTL=600

# === SignOz ===
SIGNOZ_CLICKHOUSE_URL=http://192.168.0.188:8123
SIGNOZ_ANOMALY_THRESHOLD_ERROR_COUNT=10

# === 自動修復 ===
AUTO_REPAIR_ENABLED=true
AUTO_REPAIR_MAX_PER_HOUR=5
AUTO_REPAIR_MIN_CONFIDENCE=0.85
AUTO_REPAIR_DRY_RUN=false

B. 告警模板

🚨 **CRITICAL | awoooi-api**
━━━━━━━━━━━━━━━━━━━
📋 INC-20260329-0001
🎯 Pod: awoooi-api-7d4b8c9f5-abc12
━━━━━━━━━━━━━━━━━━━
🤖 **AI 分析**
👥 責任: BE (後端)
📊 信心: 🟢 92%
💡 原因: OOM Killed - Memory limit exceeded
━━━━━━━━━━━━━━━━━━━
🔧 建議: DELETE_POD + SCALE_UP
⏱️ 停機: ~30s
💰 Tokens: 1,234 / $0.0012
━━━━━━━━━━━━━━━━━━━
🔗 [SignOz Trace](http://192.168.0.188:3301/trace/abc123)
🔗 [Sentry Issue](http://192.168.0.110:9000/issues/456)

[✅ 簽核] [❌ 拒絕] [⏰ 稍後] [🔕 靜默]

文件結束 下一步: 執行 Phase 1 任務

24 KiB Raw Blame History Unescape Escape

AWOOOI 完整監控與 AI 自動修復策略

執行摘要

一、監控覆蓋矩陣

1.1 主機層 (Infrastructure)

1.2 服務層 (Services)

A. Kubernetes 工作負載 (K3s)

B. 容器服務 (Docker on 188/110)

1.3 應用層 (Application)

A. API 端點監控

B. 錯誤率監控 (Sentry)

C. 前端監控

1.4 資料層 (Data)

1.5 AI/LLM 層

1.6 CI/CD 層

二、告警規則完整清單

2.1 P0 - Critical (5 分鐘回應)

2.2 P1 - High (15 分鐘回應)

2.3 P2 - Medium (1 小時回應)

三、AI 自動修復閉環

3.1 修復流程圖

3.2 自動修復動作清單

3.3 安全護欄

四、監控資料流整合

4.1 Sentry → OpenClaw

4.2 Alertmanager → OpenClaw

4.3 SignOz → OpenClaw

五、實作優先級

Phase 1 (本週 - P0)

Phase 2 (下週 - P1)

Phase 3 (兩週後 - P2)

六、附錄

A. 環境變數清單

B. 告警模板

24 KiB

Raw Blame History