# AWOOOI 完整監控與 AI 自動修復策略 > **版本**: v1.0 > **建立日期**: 2026-03-29 > **負責人**: 首席架構師 (Claude Code) > **目標**: 100% 覆蓋率監控 + AI 驅動自動修復 --- ## 執行摘要 本文件定義 AWOOOI 全棧監控策略,涵蓋: - **5 大主機** × **60+ 服務** × **4 層監控** - **三層可觀測性**: Sentry (錯誤) + SignOz (追蹤) + Prometheus (指標) - **AI 自動修復閉環**: 異常 → OpenClaw 分析 → 自動執行/人工審核 --- ## 一、監控覆蓋矩陣 ### 1.1 主機層 (Infrastructure) | 主機 | IP | 角色 | 監控項目 | 告警規則 | |------|----|----|----------|----------| | **mon (K3s Master)** | 192.168.0.120 | K3s Server + keepalived | CPU/MEM/Disk/etcd | NodeDown, etcdHighLatency | | **mon1 (K3s Worker)** | 192.168.0.121 | K3s Worker + keepalived | CPU/MEM/Disk/kubelet | NodeNotReady, DiskPressure | | **harbor (DevOps)** | 192.168.0.110 | Harbor/Sentry/Langfuse/Runner | CPU/MEM/Disk/Docker | HarborDown, RunnerOffline | | **pg (AI/Web)** | 192.168.0.188 | PostgreSQL/Redis/Ollama/SignOz | CPU/MEM/Disk/GPU | DBConnectionFailed, OllamaTimeout | | **kali (Security)** | 192.168.0.112 | Kali Scanner | CPU/MEM/Disk | ScannerOffline | ### 1.2 服務層 (Services) #### A. Kubernetes 工作負載 (K3s) | 命名空間 | Deployment/StatefulSet | 副本數 | 健康檢查 | 告警條件 | |----------|------------------------|--------|----------|----------| | **awoooi-prod** | awoooi-api | 2 | HTTP /api/v1/health | PodCrashLoopBackOff, ReplicasUnavailable | | **awoooi-prod** | awoooi-web | 2 | HTTP / | HighErrorRate, SlowResponse | | **awoooi-prod** | awoooi-worker | 1 | Exec mtime | WorkerStuck, QueueBacklog | | **argocd** | argocd-server | 1 | HTTP /healthz | ArgoCDDown | | **monitoring** | prometheus-server | 1 | HTTP /-/ready | PrometheusDown | | **monitoring** | alertmanager | 1 | HTTP /-/ready | AlertmanagerDown | | **velero** | velero | 1 | - | BackupFailed | #### B. 容器服務 (Docker on 188/110) | 主機 | 容器 | 端口 | 健康檢查 | 告警條件 | |------|------|------|----------|----------| | 188 | ollama | 11434 | GET /api/tags | OllamaUnresponsive, ModelLoadFailed | | 188 | openclaw | 8089 | GET /health | OpenClawDown, AnalysisTimeout | | 188 | signoz-collector | 24317/24318 | gRPC health | TraceDropped | | 188 | signoz-ui | 3301 | HTTP / | SignOzUIDown | | 188 | redis-stack | 6380 | redis-cli ping | RedisDown, MemoryExhausted | | 188 | postgres | 5432 | pg_isready | PostgresDown, ConnectionPoolExhausted | | 110 | harbor-core | 5000 | GET /api/v2.0/health | HarborDown | | 110 | sentry-web | 9000 | GET /_health/ | SentryDown | | 110 | langfuse | 3100 | GET /api/public/health | LangfuseDown | | 110 | actions-runner | - | systemctl status | RunnerOffline | ### 1.3 應用層 (Application) #### A. API 端點監控 | 端點 | 方法 | 預期回應 | SLO | 告警 | |------|------|----------|-----|------| | /api/v1/health | GET | 200 | 99.9% | APIHealthCheckFailed | | /api/v1/approvals/pending | GET | 200 | 99% | ApprovalsAPIError | | /api/v1/incidents | GET | 200 | 99% | IncidentsAPIError | | /api/v1/analyze | POST | 200/202 | 95% | AnalysisTimeout (>30s) | | /api/v1/execute | POST | 200 | 99% | ExecutionFailed | #### B. 錯誤率監控 (Sentry) | 類型 | 閾值 | 告警 | 自動修復 | |------|------|------|----------| | Unhandled Exception | >0 in 5min | SentryNewError | AI 分析 + Playbook 匹配 | | HTTP 5xx | >1% | HighErrorRate | Pod 重啟 | | HTTP 4xx | >10% | ClientErrorSpike | 告警 + 日誌分析 | | Slow Transaction | P95 >2s | SlowTransaction | 資源擴展建議 | #### C. 前端監控 | 指標 | 來源 | 閾值 | 告警 | |------|------|------|------| | Page Load Time | Sentry Performance | >3s | SlowPageLoad | | JS Error Rate | Sentry Issues | >0.1% | FrontendError | | API Call Failures | Sentry Breadcrumbs | >1% | APICallFailed | | Web Vitals (LCP/FID/CLS) | Sentry | Google 標準 | PoorWebVitals | ### 1.4 資料層 (Data) | 資料庫 | 監控項目 | 告警條件 | |--------|----------|----------| | **PostgreSQL** | 連線數、QPS、慢查詢、WAL 延遲、Disk I/O | ConnectionPoolExhausted (>90%), SlowQuery (>5s), ReplicationLag (>30s) | | **Redis** | 記憶體使用、命中率、延遲、Key 數量 | MemoryHigh (>80%), HitRatelow (<90%), SlowCommands | | **ClickHouse** | 磁碟使用、查詢延遲、插入速率 | DiskFull (>85%), QueryTimeout | ### 1.5 AI/LLM 層 | 服務 | 監控項目 | 告警條件 | 自動修復 | |------|----------|----------|----------| | **Ollama** | 推理延遲、模型載入狀態、GPU 使用 | InferenceTimeout (>60s), ModelLoadFailed | 容器重啟 | | **OpenClaw** | 分析成功率、回應時間、Token 使用 | AnalysisFailed (>10%), HighTokenCost | Fallback to Gemini | | **Gemini API** | Rate Limit、錯誤率、成本 | RateLimitHit, BudgetExceeded | 降級到 Ollama | | **Claude API** | Rate Limit、錯誤率、成本 | RateLimitHit, BudgetExceeded | 降級到 Gemini | | **Langfuse** | Trace 記錄成功率 | TraceLost (>1%) | Reconnect | ### 1.6 CI/CD 層 | 元件 | 監控項目 | 告警條件 | |------|----------|----------| | **GitHub Actions** | Workflow 狀態、Runner 健康、Job 延遲 | WorkflowFailed, RunnerOffline, JobStuck (>30min) | | **Harbor** | 映像推送/拉取成功率、儲存空間 | PushFailed, PullFailed, StorageFull | | **ArgoCD** | Sync 狀態、Application 健康 | SyncFailed, AppDegraded | --- ## 二、告警規則完整清單 ### 2.1 P0 - Critical (5 分鐘回應) ```yaml # === 基礎設施層 === - alert: NodeDown expr: up{job="node-exporter"} == 0 for: 1m severity: critical auto_repair: false # 需人工介入 - alert: K3sAPIServerDown expr: up{job="kubernetes-apiservers"} == 0 for: 1m severity: critical auto_repair: false - alert: PostgreSQLDown expr: pg_up == 0 for: 30s severity: critical auto_repair: restart_container - alert: RedisDown expr: redis_up == 0 for: 30s severity: critical auto_repair: restart_container # === 應用層 === - alert: AWOOOIAPIDown expr: probe_success{job="awoooi-api"} == 0 for: 1m severity: critical auto_repair: restart_pod - alert: OpenClawDown expr: probe_success{job="openclaw"} == 0 for: 2m severity: critical auto_repair: restart_container - alert: PodCrashLoopBackOff expr: kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} > 0 for: 2m severity: critical auto_repair: collect_logs_and_rollback # === CI/CD 層 === - alert: GitHubRunnerOffline expr: github_runner_status == 0 for: 5m severity: critical auto_repair: restart_runner_service ``` ### 2.2 P1 - High (15 分鐘回應) ```yaml # === 效能告警 === - alert: HighCPUUsage expr: node_cpu_usage_percent > 90 for: 5m severity: high auto_repair: scale_up_if_possible - alert: HighMemoryUsage expr: node_memory_usage_percent > 90 for: 5m severity: high auto_repair: investigate_memory_leak - alert: APIHighLatency expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 2 for: 5m severity: high auto_repair: analyze_slow_endpoints - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01 for: 5m severity: high auto_repair: restart_pod - alert: OllamaSlowInference expr: ollama_inference_duration_seconds > 60 for: 3m severity: high auto_repair: switch_to_smaller_model # === 資源告警 === - alert: DiskSpaceLow expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.15 for: 10m severity: high auto_repair: cleanup_old_logs - alert: PostgreSQLConnectionPoolHigh expr: pg_stat_activity_count / pg_settings_max_connections > 0.8 for: 5m severity: high auto_repair: analyze_connection_leaks ``` ### 2.3 P2 - Medium (1 小時回應) ```yaml - alert: CertificateExpiringSoon expr: ssl_cert_not_after - time() < 14 * 24 * 3600 severity: medium auto_repair: renew_certificate - alert: BackupNotSuccessful expr: velero_backup_success_total < 1 in 24h severity: medium auto_repair: trigger_backup - alert: LangfuseTraceLoss expr: langfuse_trace_drop_rate > 0.01 severity: medium auto_repair: reconnect_langfuse ``` --- ## 三、AI 自動修復閉環 ### 3.1 修復流程圖 ``` ┌─────────────────────────────────────────────────────────────────────┐ │ 異常發生 │ │ (Prometheus Alert / Sentry Issue / SignOz Anomaly) │ └────────────────────────────┬────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ Alertmanager 路由 │ │ ┌───────────────┬───────────────┬───────────────┐ │ │ │ route: awoooi │ route: infra │ route: aiops │ │ │ └───────┬───────┴───────┬───────┴───────┬───────┘ │ └──────────┼───────────────┼───────────────┼──────────────────────────┘ │ │ │ └───────────────┼───────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ AWOOOI API: /api/v1/webhooks/alertmanager │ │ 1. 接收告警 → 2. 去重 (10min fingerprint) → 3. 建立 Incident │ └────────────────────────────┬────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ OpenClaw AI 分析引擎 │ │ ┌───────────────────────────────────────────────────────────────┐ │ │ │ 輸入: │ │ │ │ - Alert 內容 (labels, annotations) │ │ │ │ - K8s 上下文 (Pod logs, events, metrics) │ │ │ │ - 歷史 Playbook (相似案例) │ │ │ │ - SignOz Traces (相關 Span) │ │ │ │ - Sentry Issues (相關錯誤) │ │ │ ├───────────────────────────────────────────────────────────────┤ │ │ │ 輸出: │ │ │ │ - suggested_action: RESTART_POD | DELETE_POD | SCALE_UP | ... │ │ │ │ - confidence: 0.0-1.0 │ │ │ │ - risk_level: LOW | MEDIUM | CRITICAL │ │ │ │ - blast_radius: {affected_pods, estimated_downtime} │ │ │ │ - kubectl_command: 具體指令 │ │ │ │ - reasoning: 決策理由 (繁體中文) │ │ │ └───────────────────────────────────────────────────────────────┘ │ └────────────────────────────┬────────────────────────────────────────┘ │ ┌──────────────┴──────────────┐ │ │ ▼ ▼ ┌─────────────────────────┐ ┌─────────────────────────┐ │ confidence >= 0.85 │ │ confidence < 0.85 │ │ risk_level = LOW │ │ OR risk = CRITICAL │ │ ↓ │ │ ↓ │ │ 自動執行 │ │ 人工審核 │ └───────────┬─────────────┘ └───────────┬─────────────┘ │ │ │ ▼ │ ┌─────────────────────────┐ │ │ Telegram 推送審核卡片 │ │ │ [✅ 簽核] [❌ 拒絕] │ │ │ [⏰ 稍後] [🔕 靜默] │ │ └───────────┬─────────────┘ │ │ │ ┌───────────┴───────────┐ │ │ │ │ ▼ ▼ │ ┌────────────┐ ┌────────────┐ │ │ 人工批准 │ │ 人工拒絕 │ │ └─────┬──────┘ └─────┬──────┘ │ │ │ └──────────────┼───────────────────────┤ │ │ ▼ ▼ ┌─────────────────────────┐ ┌────────────────┐ │ K8s Executor 執行 │ │ 記錄拒絕原因 │ │ kubectl $command │ │ 更新 Playbook │ └───────────┬─────────────┘ └────────────────┘ │ ▼ ┌─────────────────────────┐ │ 執行結果驗證 │ │ - 健康檢查通過? │ │ - 錯誤率下降? │ │ - 延遲恢復正常? │ └───────────┬─────────────┘ │ ┌───────────┴───────────┐ │ │ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ │ 修復成功 │ │ 修復失敗 │ │ - 關閉 Incident│ │ - 升級告警 │ │ - 更新 Playbook│ │ - 記錄失敗 │ │ - Telegram 通知│ │ - 人工介入 │ └─────────────────┘ └─────────────────┘ ``` ### 3.2 自動修復動作清單 | 動作 | 觸發條件 | 執行指令 | 風險等級 | 自動執行? | |------|----------|----------|----------|-----------| | `RESTART_POD` | PodCrashLoop, HighErrorRate | `kubectl rollout restart deployment/{name}` | LOW | ✅ 可自動 | | `DELETE_POD` | PodStuck, OOMKilled | `kubectl delete pod {name} --grace-period=30` | LOW | ✅ 可自動 | | `SCALE_UP` | HighCPU, HighMemory, SlowResponse | `kubectl scale deployment/{name} --replicas=+1` | LOW | ✅ 可自動 | | `SCALE_DOWN` | ResourceWaste | `kubectl scale deployment/{name} --replicas=-1` | MEDIUM | ❌ 需審核 | | `ROLLBACK` | DeploymentFailed, VersionDrift | `kubectl rollout undo deployment/{name}` | MEDIUM | ❌ 需審核 | | `RESTART_CONTAINER` | ContainerUnhealthy | `docker restart {container}` | LOW | ✅ 可自動 | | `CLEAR_CACHE` | RedisMemoryHigh, StaleCache | `redis-cli FLUSHDB` | MEDIUM | ❌ 需審核 | | `VACUUM_DB` | TableBloat, SlowQuery | `VACUUM ANALYZE {table}` | MEDIUM | ❌ 需審核 | | `RENEW_CERT` | CertExpiring | `certbot renew` | LOW | ✅ 可自動 | | `CLEANUP_LOGS` | DiskSpaceLow | `find /var/log -mtime +7 -delete` | LOW | ✅ 可自動 | | `SWITCH_MODEL` | OllamaTimeout | 切換到更小模型 | LOW | ✅ 可自動 | | `FALLBACK_AI` | GeminiRateLimit | Gemini → Ollama | LOW | ✅ 可自動 | ### 3.3 安全護欄 ```python # === 自動修復安全限制 === SAFETY_GUARDRAILS = { # 頻率限制 "max_repairs_per_hour": 5, # 每小時最多 5 次自動修復 "max_repairs_per_resource": 3, # 同一資源每小時最多 3 次 "cooldown_after_failure": 600, # 失敗後冷卻 10 分鐘 # 風險限制 "auto_approve_max_risk": "LOW", # 自動批准僅限 LOW 風險 "auto_approve_min_confidence": 0.85, # 最低信心度 85% # 影響範圍限制 "max_affected_pods": 3, # 最多影響 3 個 Pod "min_healthy_replicas": 1, # 至少保留 1 個健康副本 # 禁止自動執行 "blacklist_actions": [ "DROP_DATABASE", "DELETE_NAMESPACE", "FORCE_DELETE_PVC", "DELETE_SECRET", ], # 白名單命名空間 "allowed_namespaces": [ "awoooi-prod", "monitoring", ], } ``` --- ## 四、監控資料流整合 ### 4.1 Sentry → OpenClaw ```python # /api/v1/webhooks/sentry - Sentry Issue Alert Webhook async def handle_sentry_webhook(payload: dict): """ 1. 解析 Sentry Issue 2. 去重檢查 (10 分鐘 TTL) 3. 建立 Incident 4. 觸發 OpenClaw 分析 5. 推送 Telegram """ issue_id = payload["data"]["issue"]["id"] # 去重 if await redis.get(f"sentry_dedup:{issue_id}"): return {"status": "deduplicated"} await redis.setex(f"sentry_dedup:{issue_id}", 600, "1") # 建立 Incident incident = await incident_service.create_from_sentry(payload) # AI 分析 analysis = await openclaw.analyze_error( error_title=payload["data"]["issue"]["title"], stack_trace=payload["data"]["issue"]["culprit"], sentry_url=payload["data"]["issue"]["web_url"], trace_id=extract_trace_id(payload), ) # Telegram 通知 await telegram.send_error_alert( incident_id=incident.id, analysis=analysis, sentry_url=payload["data"]["issue"]["web_url"], ) ``` ### 4.2 Alertmanager → OpenClaw ```yaml # alertmanager.yml route: receiver: awoooi-api routes: - match: namespace: awoooi-prod receiver: awoooi-api - match: severity: critical receiver: awoooi-api receivers: - name: awoooi-api webhook_configs: - url: http://192.168.0.125:32334/api/v1/webhooks/alertmanager send_resolved: true http_config: basic_auth: username: alertmanager password_file: /etc/alertmanager/secrets/webhook-password ``` ### 4.3 SignOz → OpenClaw ```python # 透過 ClickHouse 查詢異常 Span async def detect_signoz_anomalies(): """ 定期查詢 SignOz ClickHouse 偵測: - Error Rate 異常上升 - Latency P99 異常 - Trace 數量驟降 (服務可能掛了) """ anomalies = await clickhouse.query(""" SELECT serviceName, count(*) as error_count, avg(durationNano) / 1e6 as avg_latency_ms FROM signoz_traces.signoz_index_v2 WHERE timestamp > now() - INTERVAL 5 MINUTE AND statusCode = 'STATUS_CODE_ERROR' GROUP BY serviceName HAVING error_count > 10 """) for anomaly in anomalies: await openclaw.analyze_trace_anomaly( service=anomaly["serviceName"], error_count=anomaly["error_count"], avg_latency=anomaly["avg_latency_ms"], ) ``` --- ## 五、實作優先級 ### Phase 1 (本週 - P0) | 項目 | 狀態 | 負責 | 說明 | |------|------|------|------| | Alertmanager → AWOOOI Webhook | ⬜ TODO | Claude Code | 配置 webhook + 測試告警 | | Sentry Webhook → Telegram | ⬜ TODO | Claude Code | 錯誤直接推送 + AI 分析 | | Secrets 自動注入 (CD) | ⬜ TODO | Claude Code | kubectl patch secret | | 告警去重驗證 | ⬜ TODO | Claude Code | 10min fingerprint 測試 | ### Phase 2 (下週 - P1) | 項目 | 狀態 | 負責 | 說明 | |------|------|------|------| | SignOz 告警規則 | ⬜ TODO | Claude Code | Error Rate, Latency P99 | | 自動修復動作擴展 | ⬜ TODO | Claude Code | SCALE_UP, ROLLBACK | | Playbook 自動萃取 | ⬜ TODO | Claude Code | 成功修復 → Playbook | | 告警升級機制 | ⬜ TODO | Claude Code | SLA Engine | ### Phase 3 (兩週後 - P2) | 項目 | 狀態 | 負責 | 說明 | |------|------|------|------| | Grafana 儀表板 | ⬜ TODO | Claude Code | 監控總覽 | | SLO/SLI 定義 | ⬜ TODO | Claude Code | 99.9% 可用性目標 | | 告警噪音抑制 | ⬜ TODO | Claude Code | ML 異常偵測 | | 容量預測 | ⬜ TODO | Claude Code | 資源趨勢預測 | --- ## 六、附錄 ### A. 環境變數清單 ```bash # === Alertmanager === ALERTMANAGER_WEBHOOK_URL=http://192.168.0.125:32334/api/v1/webhooks/alertmanager ALERTMANAGER_WEBHOOK_SECRET= # === Sentry === SENTRY_DSN=http://@192.168.0.110:9000/ SENTRY_WEBHOOK_SECRET= SENTRY_DEDUP_TTL=600 # === SignOz === SIGNOZ_CLICKHOUSE_URL=http://192.168.0.188:8123 SIGNOZ_ANOMALY_THRESHOLD_ERROR_COUNT=10 # === 自動修復 === AUTO_REPAIR_ENABLED=true AUTO_REPAIR_MAX_PER_HOUR=5 AUTO_REPAIR_MIN_CONFIDENCE=0.85 AUTO_REPAIR_DRY_RUN=false ``` ### B. 告警模板 ```markdown 🚨 **CRITICAL | awoooi-api** ━━━━━━━━━━━━━━━━━━━ 📋 INC-20260329-0001 🎯 Pod: awoooi-api-7d4b8c9f5-abc12 ━━━━━━━━━━━━━━━━━━━ 🤖 **AI 分析** 👥 責任: BE (後端) 📊 信心: 🟢 92% 💡 原因: OOM Killed - Memory limit exceeded ━━━━━━━━━━━━━━━━━━━ 🔧 建議: DELETE_POD + SCALE_UP ⏱️ 停機: ~30s 💰 Tokens: 1,234 / $0.0012 ━━━━━━━━━━━━━━━━━━━ 🔗 [SignOz Trace](http://192.168.0.188:3301/trace/abc123) 🔗 [Sentry Issue](http://192.168.0.110:9000/issues/456) [✅ 簽核] [❌ 拒絕] [⏰ 稍後] [🔕 靜默] ``` --- **文件結束** **下一步**: 執行 Phase 1 任務