Files
awoooi/docs/MONITORING_COMPLETE_STRATEGY.md
OG T 40163a51b5 feat(monitoring): 完整監控策略與自動整合架構
新增:
1. MONITORING_COMPLETE_STRATEGY.md - 完整監控策略
   - 5 主機 × 60+ 服務監控矩陣
   - P0/P1/P2 告警規則清單
   - AI 自動修復閉環流程
   - 安全護欄配置

2. MONITORING_INTEGRATION_ARCHITECTURE.md - 自動整合架構
   - 服務註冊表 (Single Source of Truth)
   - CI/CD 自動驗證監控覆蓋率
   - 新服務自動獲得監控

3. ops/monitoring/service-registry.yaml - 服務清單
   - K8s 工作負載 (API/Web/Worker/ArgoCD)
   - Docker 容器 (Ollama/OpenClaw/Redis/Postgres)
   - 前端頁面 SLO
   - API 端點 SLO
   - 告警模板與自動修復動作

4. ops/monitoring/validate_coverage.py - 覆蓋率驗證
   - CI 階段執行
   - 檢測未監控服務
   - 生成覆蓋率報告

設計原則:
- 監控即代碼 (Monitoring as Code)
- 新服務必須在 registry 註冊才能部署
- 自動發現機制防止遺漏

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 01:52:08 +08:00

576 lines
24 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# AWOOOI 完整監控與 AI 自動修復策略
> **版本**: v1.0
> **建立日期**: 2026-03-29
> **負責人**: 首席架構師 (Claude Code)
> **目標**: 100% 覆蓋率監控 + AI 驅動自動修復
---
## 執行摘要
本文件定義 AWOOOI 全棧監控策略,涵蓋:
- **5 大主機** × **60+ 服務** × **4 層監控**
- **三層可觀測性**: Sentry (錯誤) + SignOz (追蹤) + Prometheus (指標)
- **AI 自動修復閉環**: 異常 → OpenClaw 分析 → 自動執行/人工審核
---
## 一、監控覆蓋矩陣
### 1.1 主機層 (Infrastructure)
| 主機 | IP | 角色 | 監控項目 | 告警規則 |
|------|----|----|----------|----------|
| **mon (K3s Master)** | 192.168.0.120 | K3s Server + keepalived | CPU/MEM/Disk/etcd | NodeDown, etcdHighLatency |
| **mon1 (K3s Worker)** | 192.168.0.121 | K3s Worker + keepalived | CPU/MEM/Disk/kubelet | NodeNotReady, DiskPressure |
| **harbor (DevOps)** | 192.168.0.110 | Harbor/Sentry/Langfuse/Runner | CPU/MEM/Disk/Docker | HarborDown, RunnerOffline |
| **pg (AI/Web)** | 192.168.0.188 | PostgreSQL/Redis/Ollama/SignOz | CPU/MEM/Disk/GPU | DBConnectionFailed, OllamaTimeout |
| **kali (Security)** | 192.168.0.112 | Kali Scanner | CPU/MEM/Disk | ScannerOffline |
### 1.2 服務層 (Services)
#### A. Kubernetes 工作負載 (K3s)
| 命名空間 | Deployment/StatefulSet | 副本數 | 健康檢查 | 告警條件 |
|----------|------------------------|--------|----------|----------|
| **awoooi-prod** | awoooi-api | 2 | HTTP /api/v1/health | PodCrashLoopBackOff, ReplicasUnavailable |
| **awoooi-prod** | awoooi-web | 2 | HTTP / | HighErrorRate, SlowResponse |
| **awoooi-prod** | awoooi-worker | 1 | Exec mtime | WorkerStuck, QueueBacklog |
| **argocd** | argocd-server | 1 | HTTP /healthz | ArgoCDDown |
| **monitoring** | prometheus-server | 1 | HTTP /-/ready | PrometheusDown |
| **monitoring** | alertmanager | 1 | HTTP /-/ready | AlertmanagerDown |
| **velero** | velero | 1 | - | BackupFailed |
#### B. 容器服務 (Docker on 188/110)
| 主機 | 容器 | 端口 | 健康檢查 | 告警條件 |
|------|------|------|----------|----------|
| 188 | ollama | 11434 | GET /api/tags | OllamaUnresponsive, ModelLoadFailed |
| 188 | openclaw | 8089 | GET /health | OpenClawDown, AnalysisTimeout |
| 188 | signoz-collector | 24317/24318 | gRPC health | TraceDropped |
| 188 | signoz-ui | 3301 | HTTP / | SignOzUIDown |
| 188 | redis-stack | 6380 | redis-cli ping | RedisDown, MemoryExhausted |
| 188 | postgres | 5432 | pg_isready | PostgresDown, ConnectionPoolExhausted |
| 110 | harbor-core | 5000 | GET /api/v2.0/health | HarborDown |
| 110 | sentry-web | 9000 | GET /_health/ | SentryDown |
| 110 | langfuse | 3100 | GET /api/public/health | LangfuseDown |
| 110 | actions-runner | - | systemctl status | RunnerOffline |
### 1.3 應用層 (Application)
#### A. API 端點監控
| 端點 | 方法 | 預期回應 | SLO | 告警 |
|------|------|----------|-----|------|
| /api/v1/health | GET | 200 | 99.9% | APIHealthCheckFailed |
| /api/v1/approvals/pending | GET | 200 | 99% | ApprovalsAPIError |
| /api/v1/incidents | GET | 200 | 99% | IncidentsAPIError |
| /api/v1/analyze | POST | 200/202 | 95% | AnalysisTimeout (>30s) |
| /api/v1/execute | POST | 200 | 99% | ExecutionFailed |
#### B. 錯誤率監控 (Sentry)
| 類型 | 閾值 | 告警 | 自動修復 |
|------|------|------|----------|
| Unhandled Exception | >0 in 5min | SentryNewError | AI 分析 + Playbook 匹配 |
| HTTP 5xx | >1% | HighErrorRate | Pod 重啟 |
| HTTP 4xx | >10% | ClientErrorSpike | 告警 + 日誌分析 |
| Slow Transaction | P95 >2s | SlowTransaction | 資源擴展建議 |
#### C. 前端監控
| 指標 | 來源 | 閾值 | 告警 |
|------|------|------|------|
| Page Load Time | Sentry Performance | >3s | SlowPageLoad |
| JS Error Rate | Sentry Issues | >0.1% | FrontendError |
| API Call Failures | Sentry Breadcrumbs | >1% | APICallFailed |
| Web Vitals (LCP/FID/CLS) | Sentry | Google 標準 | PoorWebVitals |
### 1.4 資料層 (Data)
| 資料庫 | 監控項目 | 告警條件 |
|--------|----------|----------|
| **PostgreSQL** | 連線數、QPS、慢查詢、WAL 延遲、Disk I/O | ConnectionPoolExhausted (>90%), SlowQuery (>5s), ReplicationLag (>30s) |
| **Redis** | 記憶體使用、命中率、延遲、Key 數量 | MemoryHigh (>80%), HitRatelow (<90%), SlowCommands |
| **ClickHouse** | 磁碟使用、查詢延遲、插入速率 | DiskFull (>85%), QueryTimeout |
### 1.5 AI/LLM 層
| 服務 | 監控項目 | 告警條件 | 自動修復 |
|------|----------|----------|----------|
| **Ollama** | 推理延遲、模型載入狀態、GPU 使用 | InferenceTimeout (>60s), ModelLoadFailed | 容器重啟 |
| **OpenClaw** | 分析成功率、回應時間、Token 使用 | AnalysisFailed (>10%), HighTokenCost | Fallback to Gemini |
| **Gemini API** | Rate Limit、錯誤率、成本 | RateLimitHit, BudgetExceeded | 降級到 Ollama |
| **Claude API** | Rate Limit、錯誤率、成本 | RateLimitHit, BudgetExceeded | 降級到 Gemini |
| **Langfuse** | Trace 記錄成功率 | TraceLost (>1%) | Reconnect |
### 1.6 CI/CD 層
| 元件 | 監控項目 | 告警條件 |
|------|----------|----------|
| **GitHub Actions** | Workflow 狀態、Runner 健康、Job 延遲 | WorkflowFailed, RunnerOffline, JobStuck (>30min) |
| **Harbor** | 映像推送/拉取成功率、儲存空間 | PushFailed, PullFailed, StorageFull |
| **ArgoCD** | Sync 狀態、Application 健康 | SyncFailed, AppDegraded |
---
## 二、告警規則完整清單
### 2.1 P0 - Critical (5 分鐘回應)
```yaml
# === 基礎設施層 ===
- alert: NodeDown
expr: up{job="node-exporter"} == 0
for: 1m
severity: critical
auto_repair: false # 需人工介入
- alert: K3sAPIServerDown
expr: up{job="kubernetes-apiservers"} == 0
for: 1m
severity: critical
auto_repair: false
- alert: PostgreSQLDown
expr: pg_up == 0
for: 30s
severity: critical
auto_repair: restart_container
- alert: RedisDown
expr: redis_up == 0
for: 30s
severity: critical
auto_repair: restart_container
# === 應用層 ===
- alert: AWOOOIAPIDown
expr: probe_success{job="awoooi-api"} == 0
for: 1m
severity: critical
auto_repair: restart_pod
- alert: OpenClawDown
expr: probe_success{job="openclaw"} == 0
for: 2m
severity: critical
auto_repair: restart_container
- alert: PodCrashLoopBackOff
expr: kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} > 0
for: 2m
severity: critical
auto_repair: collect_logs_and_rollback
# === CI/CD 層 ===
- alert: GitHubRunnerOffline
expr: github_runner_status == 0
for: 5m
severity: critical
auto_repair: restart_runner_service
```
### 2.2 P1 - High (15 分鐘回應)
```yaml
# === 效能告警 ===
- alert: HighCPUUsage
expr: node_cpu_usage_percent > 90
for: 5m
severity: high
auto_repair: scale_up_if_possible
- alert: HighMemoryUsage
expr: node_memory_usage_percent > 90
for: 5m
severity: high
auto_repair: investigate_memory_leak
- alert: APIHighLatency
expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 2
for: 5m
severity: high
auto_repair: analyze_slow_endpoints
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
for: 5m
severity: high
auto_repair: restart_pod
- alert: OllamaSlowInference
expr: ollama_inference_duration_seconds > 60
for: 3m
severity: high
auto_repair: switch_to_smaller_model
# === 資源告警 ===
- alert: DiskSpaceLow
expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.15
for: 10m
severity: high
auto_repair: cleanup_old_logs
- alert: PostgreSQLConnectionPoolHigh
expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
for: 5m
severity: high
auto_repair: analyze_connection_leaks
```
### 2.3 P2 - Medium (1 小時回應)
```yaml
- alert: CertificateExpiringSoon
expr: ssl_cert_not_after - time() < 14 * 24 * 3600
severity: medium
auto_repair: renew_certificate
- alert: BackupNotSuccessful
expr: velero_backup_success_total < 1 in 24h
severity: medium
auto_repair: trigger_backup
- alert: LangfuseTraceLoss
expr: langfuse_trace_drop_rate > 0.01
severity: medium
auto_repair: reconnect_langfuse
```
---
## 三、AI 自動修復閉環
### 3.1 修復流程圖
```
┌─────────────────────────────────────────────────────────────────────┐
│ 異常發生 │
│ (Prometheus Alert / Sentry Issue / SignOz Anomaly) │
└────────────────────────────┬────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Alertmanager 路由 │
│ ┌───────────────┬───────────────┬───────────────┐ │
│ │ route: awoooi │ route: infra │ route: aiops │ │
│ └───────┬───────┴───────┬───────┴───────┬───────┘ │
└──────────┼───────────────┼───────────────┼──────────────────────────┘
│ │ │
└───────────────┼───────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ AWOOOI API: /api/v1/webhooks/alertmanager │
│ 1. 接收告警 → 2. 去重 (10min fingerprint) → 3. 建立 Incident │
└────────────────────────────┬────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ OpenClaw AI 分析引擎 │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ 輸入: │ │
│ │ - Alert 內容 (labels, annotations) │ │
│ │ - K8s 上下文 (Pod logs, events, metrics) │ │
│ │ - 歷史 Playbook (相似案例) │ │
│ │ - SignOz Traces (相關 Span) │ │
│ │ - Sentry Issues (相關錯誤) │ │
│ ├───────────────────────────────────────────────────────────────┤ │
│ │ 輸出: │ │
│ │ - suggested_action: RESTART_POD | DELETE_POD | SCALE_UP | ... │ │
│ │ - confidence: 0.0-1.0 │ │
│ │ - risk_level: LOW | MEDIUM | CRITICAL │ │
│ │ - blast_radius: {affected_pods, estimated_downtime} │ │
│ │ - kubectl_command: 具體指令 │ │
│ │ - reasoning: 決策理由 (繁體中文) │ │
│ └───────────────────────────────────────────────────────────────┘ │
└────────────────────────────┬────────────────────────────────────────┘
┌──────────────┴──────────────┐
│ │
▼ ▼
┌─────────────────────────┐ ┌─────────────────────────┐
│ confidence >= 0.85 │ │ confidence < 0.85 │
│ risk_level = LOW │ │ OR risk = CRITICAL │
│ ↓ │ │ ↓ │
│ 自動執行 │ │ 人工審核 │
└───────────┬─────────────┘ └───────────┬─────────────┘
│ │
│ ▼
│ ┌─────────────────────────┐
│ │ Telegram 推送審核卡片 │
│ │ [✅ 簽核] [❌ 拒絕] │
│ │ [⏰ 稍後] [🔕 靜默] │
│ └───────────┬─────────────┘
│ │
│ ┌───────────┴───────────┐
│ │ │
│ ▼ ▼
│ ┌────────────┐ ┌────────────┐
│ │ 人工批准 │ │ 人工拒絕 │
│ └─────┬──────┘ └─────┬──────┘
│ │ │
└──────────────┼───────────────────────┤
│ │
▼ ▼
┌─────────────────────────┐ ┌────────────────┐
│ K8s Executor 執行 │ │ 記錄拒絕原因 │
│ kubectl $command │ │ 更新 Playbook │
└───────────┬─────────────┘ └────────────────┘
┌─────────────────────────┐
│ 執行結果驗證 │
│ - 健康檢查通過? │
│ - 錯誤率下降? │
│ - 延遲恢復正常? │
└───────────┬─────────────┘
┌───────────┴───────────┐
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ 修復成功 │ │ 修復失敗 │
│ - 關閉 Incident│ │ - 升級告警 │
│ - 更新 Playbook│ │ - 記錄失敗 │
│ - Telegram 通知│ │ - 人工介入 │
└─────────────────┘ └─────────────────┘
```
### 3.2 自動修復動作清單
| 動作 | 觸發條件 | 執行指令 | 風險等級 | 自動執行? |
|------|----------|----------|----------|-----------|
| `RESTART_POD` | PodCrashLoop, HighErrorRate | `kubectl rollout restart deployment/{name}` | LOW | ✅ 可自動 |
| `DELETE_POD` | PodStuck, OOMKilled | `kubectl delete pod {name} --grace-period=30` | LOW | ✅ 可自動 |
| `SCALE_UP` | HighCPU, HighMemory, SlowResponse | `kubectl scale deployment/{name} --replicas=+1` | LOW | ✅ 可自動 |
| `SCALE_DOWN` | ResourceWaste | `kubectl scale deployment/{name} --replicas=-1` | MEDIUM | ❌ 需審核 |
| `ROLLBACK` | DeploymentFailed, VersionDrift | `kubectl rollout undo deployment/{name}` | MEDIUM | ❌ 需審核 |
| `RESTART_CONTAINER` | ContainerUnhealthy | `docker restart {container}` | LOW | ✅ 可自動 |
| `CLEAR_CACHE` | RedisMemoryHigh, StaleCache | `redis-cli FLUSHDB` | MEDIUM | ❌ 需審核 |
| `VACUUM_DB` | TableBloat, SlowQuery | `VACUUM ANALYZE {table}` | MEDIUM | ❌ 需審核 |
| `RENEW_CERT` | CertExpiring | `certbot renew` | LOW | ✅ 可自動 |
| `CLEANUP_LOGS` | DiskSpaceLow | `find /var/log -mtime +7 -delete` | LOW | ✅ 可自動 |
| `SWITCH_MODEL` | OllamaTimeout | 切換到更小模型 | LOW | ✅ 可自動 |
| `FALLBACK_AI` | GeminiRateLimit | Gemini → Ollama | LOW | ✅ 可自動 |
### 3.3 安全護欄
```python
# === 自動修復安全限制 ===
SAFETY_GUARDRAILS = {
# 頻率限制
"max_repairs_per_hour": 5, # 每小時最多 5 次自動修復
"max_repairs_per_resource": 3, # 同一資源每小時最多 3 次
"cooldown_after_failure": 600, # 失敗後冷卻 10 分鐘
# 風險限制
"auto_approve_max_risk": "LOW", # 自動批准僅限 LOW 風險
"auto_approve_min_confidence": 0.85, # 最低信心度 85%
# 影響範圍限制
"max_affected_pods": 3, # 最多影響 3 個 Pod
"min_healthy_replicas": 1, # 至少保留 1 個健康副本
# 禁止自動執行
"blacklist_actions": [
"DROP_DATABASE",
"DELETE_NAMESPACE",
"FORCE_DELETE_PVC",
"DELETE_SECRET",
],
# 白名單命名空間
"allowed_namespaces": [
"awoooi-prod",
"monitoring",
],
}
```
---
## 四、監控資料流整合
### 4.1 Sentry → OpenClaw
```python
# /api/v1/webhooks/sentry - Sentry Issue Alert Webhook
async def handle_sentry_webhook(payload: dict):
"""
1. 解析 Sentry Issue
2. 去重檢查 (10 分鐘 TTL)
3. 建立 Incident
4. 觸發 OpenClaw 分析
5. 推送 Telegram
"""
issue_id = payload["data"]["issue"]["id"]
# 去重
if await redis.get(f"sentry_dedup:{issue_id}"):
return {"status": "deduplicated"}
await redis.setex(f"sentry_dedup:{issue_id}", 600, "1")
# 建立 Incident
incident = await incident_service.create_from_sentry(payload)
# AI 分析
analysis = await openclaw.analyze_error(
error_title=payload["data"]["issue"]["title"],
stack_trace=payload["data"]["issue"]["culprit"],
sentry_url=payload["data"]["issue"]["web_url"],
trace_id=extract_trace_id(payload),
)
# Telegram 通知
await telegram.send_error_alert(
incident_id=incident.id,
analysis=analysis,
sentry_url=payload["data"]["issue"]["web_url"],
)
```
### 4.2 Alertmanager → OpenClaw
```yaml
# alertmanager.yml
route:
receiver: awoooi-api
routes:
- match:
namespace: awoooi-prod
receiver: awoooi-api
- match:
severity: critical
receiver: awoooi-api
receivers:
- name: awoooi-api
webhook_configs:
- url: http://192.168.0.125:32334/api/v1/webhooks/alertmanager
send_resolved: true
http_config:
basic_auth:
username: alertmanager
password_file: /etc/alertmanager/secrets/webhook-password
```
### 4.3 SignOz → OpenClaw
```python
# 透過 ClickHouse 查詢異常 Span
async def detect_signoz_anomalies():
"""
定期查詢 SignOz ClickHouse 偵測:
- Error Rate 異常上升
- Latency P99 異常
- Trace 數量驟降 (服務可能掛了)
"""
anomalies = await clickhouse.query("""
SELECT
serviceName,
count(*) as error_count,
avg(durationNano) / 1e6 as avg_latency_ms
FROM signoz_traces.signoz_index_v2
WHERE timestamp > now() - INTERVAL 5 MINUTE
AND statusCode = 'STATUS_CODE_ERROR'
GROUP BY serviceName
HAVING error_count > 10
""")
for anomaly in anomalies:
await openclaw.analyze_trace_anomaly(
service=anomaly["serviceName"],
error_count=anomaly["error_count"],
avg_latency=anomaly["avg_latency_ms"],
)
```
---
## 五、實作優先級
### Phase 1 (本週 - P0)
| 項目 | 狀態 | 負責 | 說明 |
|------|------|------|------|
| Alertmanager → AWOOOI Webhook | ⬜ TODO | Claude Code | 配置 webhook + 測試告警 |
| Sentry Webhook → Telegram | ⬜ TODO | Claude Code | 錯誤直接推送 + AI 分析 |
| Secrets 自動注入 (CD) | ⬜ TODO | Claude Code | kubectl patch secret |
| 告警去重驗證 | ⬜ TODO | Claude Code | 10min fingerprint 測試 |
### Phase 2 (下週 - P1)
| 項目 | 狀態 | 負責 | 說明 |
|------|------|------|------|
| SignOz 告警規則 | ⬜ TODO | Claude Code | Error Rate, Latency P99 |
| 自動修復動作擴展 | ⬜ TODO | Claude Code | SCALE_UP, ROLLBACK |
| Playbook 自動萃取 | ⬜ TODO | Claude Code | 成功修復 → Playbook |
| 告警升級機制 | ⬜ TODO | Claude Code | SLA Engine |
### Phase 3 (兩週後 - P2)
| 項目 | 狀態 | 負責 | 說明 |
|------|------|------|------|
| Grafana 儀表板 | ⬜ TODO | Claude Code | 監控總覽 |
| SLO/SLI 定義 | ⬜ TODO | Claude Code | 99.9% 可用性目標 |
| 告警噪音抑制 | ⬜ TODO | Claude Code | ML 異常偵測 |
| 容量預測 | ⬜ TODO | Claude Code | 資源趨勢預測 |
---
## 六、附錄
### A. 環境變數清單
```bash
# === Alertmanager ===
ALERTMANAGER_WEBHOOK_URL=http://192.168.0.125:32334/api/v1/webhooks/alertmanager
ALERTMANAGER_WEBHOOK_SECRET=<secret>
# === Sentry ===
SENTRY_DSN=http://<key>@192.168.0.110:9000/<project>
SENTRY_WEBHOOK_SECRET=<secret>
SENTRY_DEDUP_TTL=600
# === SignOz ===
SIGNOZ_CLICKHOUSE_URL=http://192.168.0.188:8123
SIGNOZ_ANOMALY_THRESHOLD_ERROR_COUNT=10
# === 自動修復 ===
AUTO_REPAIR_ENABLED=true
AUTO_REPAIR_MAX_PER_HOUR=5
AUTO_REPAIR_MIN_CONFIDENCE=0.85
AUTO_REPAIR_DRY_RUN=false
```
### B. 告警模板
```markdown
🚨 **CRITICAL | awoooi-api**
━━━━━━━━━━━━━━━━━━━
📋 INC-20260329-0001
🎯 Pod: awoooi-api-7d4b8c9f5-abc12
━━━━━━━━━━━━━━━━━━━
🤖 **AI 分析**
👥 責任: BE (後端)
📊 信心: 🟢 92%
💡 原因: OOM Killed - Memory limit exceeded
━━━━━━━━━━━━━━━━━━━
🔧 建議: DELETE_POD + SCALE_UP
⏱️ 停機: ~30s
💰 Tokens: 1,234 / $0.0012
━━━━━━━━━━━━━━━━━━━
🔗 [SignOz Trace](http://192.168.0.188:3301/trace/abc123)
🔗 [Sentry Issue](http://192.168.0.110:9000/issues/456)
[✅ 簽核] [❌ 拒絕] [⏰ 稍後] [🔕 靜默]
```
---
**文件結束**
**下一步**: 執行 Phase 1 任務