feat(monitoring): 完整監控策略與自動整合架構
新增: 1. MONITORING_COMPLETE_STRATEGY.md - 完整監控策略 - 5 主機 × 60+ 服務監控矩陣 - P0/P1/P2 告警規則清單 - AI 自動修復閉環流程 - 安全護欄配置 2. MONITORING_INTEGRATION_ARCHITECTURE.md - 自動整合架構 - 服務註冊表 (Single Source of Truth) - CI/CD 自動驗證監控覆蓋率 - 新服務自動獲得監控 3. ops/monitoring/service-registry.yaml - 服務清單 - K8s 工作負載 (API/Web/Worker/ArgoCD) - Docker 容器 (Ollama/OpenClaw/Redis/Postgres) - 前端頁面 SLO - API 端點 SLO - 告警模板與自動修復動作 4. ops/monitoring/validate_coverage.py - 覆蓋率驗證 - CI 階段執行 - 檢測未監控服務 - 生成覆蓋率報告 設計原則: - 監控即代碼 (Monitoring as Code) - 新服務必須在 registry 註冊才能部署 - 自動發現機制防止遺漏 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
575
docs/MONITORING_COMPLETE_STRATEGY.md
Normal file
575
docs/MONITORING_COMPLETE_STRATEGY.md
Normal file
@@ -0,0 +1,575 @@
|
||||
# AWOOOI 完整監控與 AI 自動修復策略
|
||||
|
||||
> **版本**: v1.0
|
||||
> **建立日期**: 2026-03-29
|
||||
> **負責人**: 首席架構師 (Claude Code)
|
||||
> **目標**: 100% 覆蓋率監控 + AI 驅動自動修復
|
||||
|
||||
---
|
||||
|
||||
## 執行摘要
|
||||
|
||||
本文件定義 AWOOOI 全棧監控策略,涵蓋:
|
||||
- **5 大主機** × **60+ 服務** × **4 層監控**
|
||||
- **三層可觀測性**: Sentry (錯誤) + SignOz (追蹤) + Prometheus (指標)
|
||||
- **AI 自動修復閉環**: 異常 → OpenClaw 分析 → 自動執行/人工審核
|
||||
|
||||
---
|
||||
|
||||
## 一、監控覆蓋矩陣
|
||||
|
||||
### 1.1 主機層 (Infrastructure)
|
||||
|
||||
| 主機 | IP | 角色 | 監控項目 | 告警規則 |
|
||||
|------|----|----|----------|----------|
|
||||
| **mon (K3s Master)** | 192.168.0.120 | K3s Server + keepalived | CPU/MEM/Disk/etcd | NodeDown, etcdHighLatency |
|
||||
| **mon1 (K3s Worker)** | 192.168.0.121 | K3s Worker + keepalived | CPU/MEM/Disk/kubelet | NodeNotReady, DiskPressure |
|
||||
| **harbor (DevOps)** | 192.168.0.110 | Harbor/Sentry/Langfuse/Runner | CPU/MEM/Disk/Docker | HarborDown, RunnerOffline |
|
||||
| **pg (AI/Web)** | 192.168.0.188 | PostgreSQL/Redis/Ollama/SignOz | CPU/MEM/Disk/GPU | DBConnectionFailed, OllamaTimeout |
|
||||
| **kali (Security)** | 192.168.0.112 | Kali Scanner | CPU/MEM/Disk | ScannerOffline |
|
||||
|
||||
### 1.2 服務層 (Services)
|
||||
|
||||
#### A. Kubernetes 工作負載 (K3s)
|
||||
|
||||
| 命名空間 | Deployment/StatefulSet | 副本數 | 健康檢查 | 告警條件 |
|
||||
|----------|------------------------|--------|----------|----------|
|
||||
| **awoooi-prod** | awoooi-api | 2 | HTTP /api/v1/health | PodCrashLoopBackOff, ReplicasUnavailable |
|
||||
| **awoooi-prod** | awoooi-web | 2 | HTTP / | HighErrorRate, SlowResponse |
|
||||
| **awoooi-prod** | awoooi-worker | 1 | Exec mtime | WorkerStuck, QueueBacklog |
|
||||
| **argocd** | argocd-server | 1 | HTTP /healthz | ArgoCDDown |
|
||||
| **monitoring** | prometheus-server | 1 | HTTP /-/ready | PrometheusDown |
|
||||
| **monitoring** | alertmanager | 1 | HTTP /-/ready | AlertmanagerDown |
|
||||
| **velero** | velero | 1 | - | BackupFailed |
|
||||
|
||||
#### B. 容器服務 (Docker on 188/110)
|
||||
|
||||
| 主機 | 容器 | 端口 | 健康檢查 | 告警條件 |
|
||||
|------|------|------|----------|----------|
|
||||
| 188 | ollama | 11434 | GET /api/tags | OllamaUnresponsive, ModelLoadFailed |
|
||||
| 188 | openclaw | 8089 | GET /health | OpenClawDown, AnalysisTimeout |
|
||||
| 188 | signoz-collector | 24317/24318 | gRPC health | TraceDropped |
|
||||
| 188 | signoz-ui | 3301 | HTTP / | SignOzUIDown |
|
||||
| 188 | redis-stack | 6380 | redis-cli ping | RedisDown, MemoryExhausted |
|
||||
| 188 | postgres | 5432 | pg_isready | PostgresDown, ConnectionPoolExhausted |
|
||||
| 110 | harbor-core | 5000 | GET /api/v2.0/health | HarborDown |
|
||||
| 110 | sentry-web | 9000 | GET /_health/ | SentryDown |
|
||||
| 110 | langfuse | 3100 | GET /api/public/health | LangfuseDown |
|
||||
| 110 | actions-runner | - | systemctl status | RunnerOffline |
|
||||
|
||||
### 1.3 應用層 (Application)
|
||||
|
||||
#### A. API 端點監控
|
||||
|
||||
| 端點 | 方法 | 預期回應 | SLO | 告警 |
|
||||
|------|------|----------|-----|------|
|
||||
| /api/v1/health | GET | 200 | 99.9% | APIHealthCheckFailed |
|
||||
| /api/v1/approvals/pending | GET | 200 | 99% | ApprovalsAPIError |
|
||||
| /api/v1/incidents | GET | 200 | 99% | IncidentsAPIError |
|
||||
| /api/v1/analyze | POST | 200/202 | 95% | AnalysisTimeout (>30s) |
|
||||
| /api/v1/execute | POST | 200 | 99% | ExecutionFailed |
|
||||
|
||||
#### B. 錯誤率監控 (Sentry)
|
||||
|
||||
| 類型 | 閾值 | 告警 | 自動修復 |
|
||||
|------|------|------|----------|
|
||||
| Unhandled Exception | >0 in 5min | SentryNewError | AI 分析 + Playbook 匹配 |
|
||||
| HTTP 5xx | >1% | HighErrorRate | Pod 重啟 |
|
||||
| HTTP 4xx | >10% | ClientErrorSpike | 告警 + 日誌分析 |
|
||||
| Slow Transaction | P95 >2s | SlowTransaction | 資源擴展建議 |
|
||||
|
||||
#### C. 前端監控
|
||||
|
||||
| 指標 | 來源 | 閾值 | 告警 |
|
||||
|------|------|------|------|
|
||||
| Page Load Time | Sentry Performance | >3s | SlowPageLoad |
|
||||
| JS Error Rate | Sentry Issues | >0.1% | FrontendError |
|
||||
| API Call Failures | Sentry Breadcrumbs | >1% | APICallFailed |
|
||||
| Web Vitals (LCP/FID/CLS) | Sentry | Google 標準 | PoorWebVitals |
|
||||
|
||||
### 1.4 資料層 (Data)
|
||||
|
||||
| 資料庫 | 監控項目 | 告警條件 |
|
||||
|--------|----------|----------|
|
||||
| **PostgreSQL** | 連線數、QPS、慢查詢、WAL 延遲、Disk I/O | ConnectionPoolExhausted (>90%), SlowQuery (>5s), ReplicationLag (>30s) |
|
||||
| **Redis** | 記憶體使用、命中率、延遲、Key 數量 | MemoryHigh (>80%), HitRatelow (<90%), SlowCommands |
|
||||
| **ClickHouse** | 磁碟使用、查詢延遲、插入速率 | DiskFull (>85%), QueryTimeout |
|
||||
|
||||
### 1.5 AI/LLM 層
|
||||
|
||||
| 服務 | 監控項目 | 告警條件 | 自動修復 |
|
||||
|------|----------|----------|----------|
|
||||
| **Ollama** | 推理延遲、模型載入狀態、GPU 使用 | InferenceTimeout (>60s), ModelLoadFailed | 容器重啟 |
|
||||
| **OpenClaw** | 分析成功率、回應時間、Token 使用 | AnalysisFailed (>10%), HighTokenCost | Fallback to Gemini |
|
||||
| **Gemini API** | Rate Limit、錯誤率、成本 | RateLimitHit, BudgetExceeded | 降級到 Ollama |
|
||||
| **Claude API** | Rate Limit、錯誤率、成本 | RateLimitHit, BudgetExceeded | 降級到 Gemini |
|
||||
| **Langfuse** | Trace 記錄成功率 | TraceLost (>1%) | Reconnect |
|
||||
|
||||
### 1.6 CI/CD 層
|
||||
|
||||
| 元件 | 監控項目 | 告警條件 |
|
||||
|------|----------|----------|
|
||||
| **GitHub Actions** | Workflow 狀態、Runner 健康、Job 延遲 | WorkflowFailed, RunnerOffline, JobStuck (>30min) |
|
||||
| **Harbor** | 映像推送/拉取成功率、儲存空間 | PushFailed, PullFailed, StorageFull |
|
||||
| **ArgoCD** | Sync 狀態、Application 健康 | SyncFailed, AppDegraded |
|
||||
|
||||
---
|
||||
|
||||
## 二、告警規則完整清單
|
||||
|
||||
### 2.1 P0 - Critical (5 分鐘回應)
|
||||
|
||||
```yaml
|
||||
# === 基礎設施層 ===
|
||||
- alert: NodeDown
|
||||
expr: up{job="node-exporter"} == 0
|
||||
for: 1m
|
||||
severity: critical
|
||||
auto_repair: false # 需人工介入
|
||||
|
||||
- alert: K3sAPIServerDown
|
||||
expr: up{job="kubernetes-apiservers"} == 0
|
||||
for: 1m
|
||||
severity: critical
|
||||
auto_repair: false
|
||||
|
||||
- alert: PostgreSQLDown
|
||||
expr: pg_up == 0
|
||||
for: 30s
|
||||
severity: critical
|
||||
auto_repair: restart_container
|
||||
|
||||
- alert: RedisDown
|
||||
expr: redis_up == 0
|
||||
for: 30s
|
||||
severity: critical
|
||||
auto_repair: restart_container
|
||||
|
||||
# === 應用層 ===
|
||||
- alert: AWOOOIAPIDown
|
||||
expr: probe_success{job="awoooi-api"} == 0
|
||||
for: 1m
|
||||
severity: critical
|
||||
auto_repair: restart_pod
|
||||
|
||||
- alert: OpenClawDown
|
||||
expr: probe_success{job="openclaw"} == 0
|
||||
for: 2m
|
||||
severity: critical
|
||||
auto_repair: restart_container
|
||||
|
||||
- alert: PodCrashLoopBackOff
|
||||
expr: kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} > 0
|
||||
for: 2m
|
||||
severity: critical
|
||||
auto_repair: collect_logs_and_rollback
|
||||
|
||||
# === CI/CD 層 ===
|
||||
- alert: GitHubRunnerOffline
|
||||
expr: github_runner_status == 0
|
||||
for: 5m
|
||||
severity: critical
|
||||
auto_repair: restart_runner_service
|
||||
```
|
||||
|
||||
### 2.2 P1 - High (15 分鐘回應)
|
||||
|
||||
```yaml
|
||||
# === 效能告警 ===
|
||||
- alert: HighCPUUsage
|
||||
expr: node_cpu_usage_percent > 90
|
||||
for: 5m
|
||||
severity: high
|
||||
auto_repair: scale_up_if_possible
|
||||
|
||||
- alert: HighMemoryUsage
|
||||
expr: node_memory_usage_percent > 90
|
||||
for: 5m
|
||||
severity: high
|
||||
auto_repair: investigate_memory_leak
|
||||
|
||||
- alert: APIHighLatency
|
||||
expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 2
|
||||
for: 5m
|
||||
severity: high
|
||||
auto_repair: analyze_slow_endpoints
|
||||
|
||||
- alert: HighErrorRate
|
||||
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
|
||||
for: 5m
|
||||
severity: high
|
||||
auto_repair: restart_pod
|
||||
|
||||
- alert: OllamaSlowInference
|
||||
expr: ollama_inference_duration_seconds > 60
|
||||
for: 3m
|
||||
severity: high
|
||||
auto_repair: switch_to_smaller_model
|
||||
|
||||
# === 資源告警 ===
|
||||
- alert: DiskSpaceLow
|
||||
expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.15
|
||||
for: 10m
|
||||
severity: high
|
||||
auto_repair: cleanup_old_logs
|
||||
|
||||
- alert: PostgreSQLConnectionPoolHigh
|
||||
expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
|
||||
for: 5m
|
||||
severity: high
|
||||
auto_repair: analyze_connection_leaks
|
||||
```
|
||||
|
||||
### 2.3 P2 - Medium (1 小時回應)
|
||||
|
||||
```yaml
|
||||
- alert: CertificateExpiringSoon
|
||||
expr: ssl_cert_not_after - time() < 14 * 24 * 3600
|
||||
severity: medium
|
||||
auto_repair: renew_certificate
|
||||
|
||||
- alert: BackupNotSuccessful
|
||||
expr: velero_backup_success_total < 1 in 24h
|
||||
severity: medium
|
||||
auto_repair: trigger_backup
|
||||
|
||||
- alert: LangfuseTraceLoss
|
||||
expr: langfuse_trace_drop_rate > 0.01
|
||||
severity: medium
|
||||
auto_repair: reconnect_langfuse
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 三、AI 自動修復閉環
|
||||
|
||||
### 3.1 修復流程圖
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ 異常發生 │
|
||||
│ (Prometheus Alert / Sentry Issue / SignOz Anomaly) │
|
||||
└────────────────────────────┬────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ Alertmanager 路由 │
|
||||
│ ┌───────────────┬───────────────┬───────────────┐ │
|
||||
│ │ route: awoooi │ route: infra │ route: aiops │ │
|
||||
│ └───────┬───────┴───────┬───────┴───────┬───────┘ │
|
||||
└──────────┼───────────────┼───────────────┼──────────────────────────┘
|
||||
│ │ │
|
||||
└───────────────┼───────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ AWOOOI API: /api/v1/webhooks/alertmanager │
|
||||
│ 1. 接收告警 → 2. 去重 (10min fingerprint) → 3. 建立 Incident │
|
||||
└────────────────────────────┬────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ OpenClaw AI 分析引擎 │
|
||||
│ ┌───────────────────────────────────────────────────────────────┐ │
|
||||
│ │ 輸入: │ │
|
||||
│ │ - Alert 內容 (labels, annotations) │ │
|
||||
│ │ - K8s 上下文 (Pod logs, events, metrics) │ │
|
||||
│ │ - 歷史 Playbook (相似案例) │ │
|
||||
│ │ - SignOz Traces (相關 Span) │ │
|
||||
│ │ - Sentry Issues (相關錯誤) │ │
|
||||
│ ├───────────────────────────────────────────────────────────────┤ │
|
||||
│ │ 輸出: │ │
|
||||
│ │ - suggested_action: RESTART_POD | DELETE_POD | SCALE_UP | ... │ │
|
||||
│ │ - confidence: 0.0-1.0 │ │
|
||||
│ │ - risk_level: LOW | MEDIUM | CRITICAL │ │
|
||||
│ │ - blast_radius: {affected_pods, estimated_downtime} │ │
|
||||
│ │ - kubectl_command: 具體指令 │ │
|
||||
│ │ - reasoning: 決策理由 (繁體中文) │ │
|
||||
│ └───────────────────────────────────────────────────────────────┘ │
|
||||
└────────────────────────────┬────────────────────────────────────────┘
|
||||
│
|
||||
┌──────────────┴──────────────┐
|
||||
│ │
|
||||
▼ ▼
|
||||
┌─────────────────────────┐ ┌─────────────────────────┐
|
||||
│ confidence >= 0.85 │ │ confidence < 0.85 │
|
||||
│ risk_level = LOW │ │ OR risk = CRITICAL │
|
||||
│ ↓ │ │ ↓ │
|
||||
│ 自動執行 │ │ 人工審核 │
|
||||
└───────────┬─────────────┘ └───────────┬─────────────┘
|
||||
│ │
|
||||
│ ▼
|
||||
│ ┌─────────────────────────┐
|
||||
│ │ Telegram 推送審核卡片 │
|
||||
│ │ [✅ 簽核] [❌ 拒絕] │
|
||||
│ │ [⏰ 稍後] [🔕 靜默] │
|
||||
│ └───────────┬─────────────┘
|
||||
│ │
|
||||
│ ┌───────────┴───────────┐
|
||||
│ │ │
|
||||
│ ▼ ▼
|
||||
│ ┌────────────┐ ┌────────────┐
|
||||
│ │ 人工批准 │ │ 人工拒絕 │
|
||||
│ └─────┬──────┘ └─────┬──────┘
|
||||
│ │ │
|
||||
└──────────────┼───────────────────────┤
|
||||
│ │
|
||||
▼ ▼
|
||||
┌─────────────────────────┐ ┌────────────────┐
|
||||
│ K8s Executor 執行 │ │ 記錄拒絕原因 │
|
||||
│ kubectl $command │ │ 更新 Playbook │
|
||||
└───────────┬─────────────┘ └────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────┐
|
||||
│ 執行結果驗證 │
|
||||
│ - 健康檢查通過? │
|
||||
│ - 錯誤率下降? │
|
||||
│ - 延遲恢復正常? │
|
||||
└───────────┬─────────────┘
|
||||
│
|
||||
┌───────────┴───────────┐
|
||||
│ │
|
||||
▼ ▼
|
||||
┌─────────────────┐ ┌─────────────────┐
|
||||
│ 修復成功 │ │ 修復失敗 │
|
||||
│ - 關閉 Incident│ │ - 升級告警 │
|
||||
│ - 更新 Playbook│ │ - 記錄失敗 │
|
||||
│ - Telegram 通知│ │ - 人工介入 │
|
||||
└─────────────────┘ └─────────────────┘
|
||||
```
|
||||
|
||||
### 3.2 自動修復動作清單
|
||||
|
||||
| 動作 | 觸發條件 | 執行指令 | 風險等級 | 自動執行? |
|
||||
|------|----------|----------|----------|-----------|
|
||||
| `RESTART_POD` | PodCrashLoop, HighErrorRate | `kubectl rollout restart deployment/{name}` | LOW | ✅ 可自動 |
|
||||
| `DELETE_POD` | PodStuck, OOMKilled | `kubectl delete pod {name} --grace-period=30` | LOW | ✅ 可自動 |
|
||||
| `SCALE_UP` | HighCPU, HighMemory, SlowResponse | `kubectl scale deployment/{name} --replicas=+1` | LOW | ✅ 可自動 |
|
||||
| `SCALE_DOWN` | ResourceWaste | `kubectl scale deployment/{name} --replicas=-1` | MEDIUM | ❌ 需審核 |
|
||||
| `ROLLBACK` | DeploymentFailed, VersionDrift | `kubectl rollout undo deployment/{name}` | MEDIUM | ❌ 需審核 |
|
||||
| `RESTART_CONTAINER` | ContainerUnhealthy | `docker restart {container}` | LOW | ✅ 可自動 |
|
||||
| `CLEAR_CACHE` | RedisMemoryHigh, StaleCache | `redis-cli FLUSHDB` | MEDIUM | ❌ 需審核 |
|
||||
| `VACUUM_DB` | TableBloat, SlowQuery | `VACUUM ANALYZE {table}` | MEDIUM | ❌ 需審核 |
|
||||
| `RENEW_CERT` | CertExpiring | `certbot renew` | LOW | ✅ 可自動 |
|
||||
| `CLEANUP_LOGS` | DiskSpaceLow | `find /var/log -mtime +7 -delete` | LOW | ✅ 可自動 |
|
||||
| `SWITCH_MODEL` | OllamaTimeout | 切換到更小模型 | LOW | ✅ 可自動 |
|
||||
| `FALLBACK_AI` | GeminiRateLimit | Gemini → Ollama | LOW | ✅ 可自動 |
|
||||
|
||||
### 3.3 安全護欄
|
||||
|
||||
```python
|
||||
# === 自動修復安全限制 ===
|
||||
SAFETY_GUARDRAILS = {
|
||||
# 頻率限制
|
||||
"max_repairs_per_hour": 5, # 每小時最多 5 次自動修復
|
||||
"max_repairs_per_resource": 3, # 同一資源每小時最多 3 次
|
||||
"cooldown_after_failure": 600, # 失敗後冷卻 10 分鐘
|
||||
|
||||
# 風險限制
|
||||
"auto_approve_max_risk": "LOW", # 自動批准僅限 LOW 風險
|
||||
"auto_approve_min_confidence": 0.85, # 最低信心度 85%
|
||||
|
||||
# 影響範圍限制
|
||||
"max_affected_pods": 3, # 最多影響 3 個 Pod
|
||||
"min_healthy_replicas": 1, # 至少保留 1 個健康副本
|
||||
|
||||
# 禁止自動執行
|
||||
"blacklist_actions": [
|
||||
"DROP_DATABASE",
|
||||
"DELETE_NAMESPACE",
|
||||
"FORCE_DELETE_PVC",
|
||||
"DELETE_SECRET",
|
||||
],
|
||||
|
||||
# 白名單命名空間
|
||||
"allowed_namespaces": [
|
||||
"awoooi-prod",
|
||||
"monitoring",
|
||||
],
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 四、監控資料流整合
|
||||
|
||||
### 4.1 Sentry → OpenClaw
|
||||
|
||||
```python
|
||||
# /api/v1/webhooks/sentry - Sentry Issue Alert Webhook
|
||||
async def handle_sentry_webhook(payload: dict):
|
||||
"""
|
||||
1. 解析 Sentry Issue
|
||||
2. 去重檢查 (10 分鐘 TTL)
|
||||
3. 建立 Incident
|
||||
4. 觸發 OpenClaw 分析
|
||||
5. 推送 Telegram
|
||||
"""
|
||||
issue_id = payload["data"]["issue"]["id"]
|
||||
|
||||
# 去重
|
||||
if await redis.get(f"sentry_dedup:{issue_id}"):
|
||||
return {"status": "deduplicated"}
|
||||
await redis.setex(f"sentry_dedup:{issue_id}", 600, "1")
|
||||
|
||||
# 建立 Incident
|
||||
incident = await incident_service.create_from_sentry(payload)
|
||||
|
||||
# AI 分析
|
||||
analysis = await openclaw.analyze_error(
|
||||
error_title=payload["data"]["issue"]["title"],
|
||||
stack_trace=payload["data"]["issue"]["culprit"],
|
||||
sentry_url=payload["data"]["issue"]["web_url"],
|
||||
trace_id=extract_trace_id(payload),
|
||||
)
|
||||
|
||||
# Telegram 通知
|
||||
await telegram.send_error_alert(
|
||||
incident_id=incident.id,
|
||||
analysis=analysis,
|
||||
sentry_url=payload["data"]["issue"]["web_url"],
|
||||
)
|
||||
```
|
||||
|
||||
### 4.2 Alertmanager → OpenClaw
|
||||
|
||||
```yaml
|
||||
# alertmanager.yml
|
||||
route:
|
||||
receiver: awoooi-api
|
||||
routes:
|
||||
- match:
|
||||
namespace: awoooi-prod
|
||||
receiver: awoooi-api
|
||||
- match:
|
||||
severity: critical
|
||||
receiver: awoooi-api
|
||||
|
||||
receivers:
|
||||
- name: awoooi-api
|
||||
webhook_configs:
|
||||
- url: http://192.168.0.125:32334/api/v1/webhooks/alertmanager
|
||||
send_resolved: true
|
||||
http_config:
|
||||
basic_auth:
|
||||
username: alertmanager
|
||||
password_file: /etc/alertmanager/secrets/webhook-password
|
||||
```
|
||||
|
||||
### 4.3 SignOz → OpenClaw
|
||||
|
||||
```python
|
||||
# 透過 ClickHouse 查詢異常 Span
|
||||
async def detect_signoz_anomalies():
|
||||
"""
|
||||
定期查詢 SignOz ClickHouse 偵測:
|
||||
- Error Rate 異常上升
|
||||
- Latency P99 異常
|
||||
- Trace 數量驟降 (服務可能掛了)
|
||||
"""
|
||||
anomalies = await clickhouse.query("""
|
||||
SELECT
|
||||
serviceName,
|
||||
count(*) as error_count,
|
||||
avg(durationNano) / 1e6 as avg_latency_ms
|
||||
FROM signoz_traces.signoz_index_v2
|
||||
WHERE timestamp > now() - INTERVAL 5 MINUTE
|
||||
AND statusCode = 'STATUS_CODE_ERROR'
|
||||
GROUP BY serviceName
|
||||
HAVING error_count > 10
|
||||
""")
|
||||
|
||||
for anomaly in anomalies:
|
||||
await openclaw.analyze_trace_anomaly(
|
||||
service=anomaly["serviceName"],
|
||||
error_count=anomaly["error_count"],
|
||||
avg_latency=anomaly["avg_latency_ms"],
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 五、實作優先級
|
||||
|
||||
### Phase 1 (本週 - P0)
|
||||
|
||||
| 項目 | 狀態 | 負責 | 說明 |
|
||||
|------|------|------|------|
|
||||
| Alertmanager → AWOOOI Webhook | ⬜ TODO | Claude Code | 配置 webhook + 測試告警 |
|
||||
| Sentry Webhook → Telegram | ⬜ TODO | Claude Code | 錯誤直接推送 + AI 分析 |
|
||||
| Secrets 自動注入 (CD) | ⬜ TODO | Claude Code | kubectl patch secret |
|
||||
| 告警去重驗證 | ⬜ TODO | Claude Code | 10min fingerprint 測試 |
|
||||
|
||||
### Phase 2 (下週 - P1)
|
||||
|
||||
| 項目 | 狀態 | 負責 | 說明 |
|
||||
|------|------|------|------|
|
||||
| SignOz 告警規則 | ⬜ TODO | Claude Code | Error Rate, Latency P99 |
|
||||
| 自動修復動作擴展 | ⬜ TODO | Claude Code | SCALE_UP, ROLLBACK |
|
||||
| Playbook 自動萃取 | ⬜ TODO | Claude Code | 成功修復 → Playbook |
|
||||
| 告警升級機制 | ⬜ TODO | Claude Code | SLA Engine |
|
||||
|
||||
### Phase 3 (兩週後 - P2)
|
||||
|
||||
| 項目 | 狀態 | 負責 | 說明 |
|
||||
|------|------|------|------|
|
||||
| Grafana 儀表板 | ⬜ TODO | Claude Code | 監控總覽 |
|
||||
| SLO/SLI 定義 | ⬜ TODO | Claude Code | 99.9% 可用性目標 |
|
||||
| 告警噪音抑制 | ⬜ TODO | Claude Code | ML 異常偵測 |
|
||||
| 容量預測 | ⬜ TODO | Claude Code | 資源趨勢預測 |
|
||||
|
||||
---
|
||||
|
||||
## 六、附錄
|
||||
|
||||
### A. 環境變數清單
|
||||
|
||||
```bash
|
||||
# === Alertmanager ===
|
||||
ALERTMANAGER_WEBHOOK_URL=http://192.168.0.125:32334/api/v1/webhooks/alertmanager
|
||||
ALERTMANAGER_WEBHOOK_SECRET=<secret>
|
||||
|
||||
# === Sentry ===
|
||||
SENTRY_DSN=http://<key>@192.168.0.110:9000/<project>
|
||||
SENTRY_WEBHOOK_SECRET=<secret>
|
||||
SENTRY_DEDUP_TTL=600
|
||||
|
||||
# === SignOz ===
|
||||
SIGNOZ_CLICKHOUSE_URL=http://192.168.0.188:8123
|
||||
SIGNOZ_ANOMALY_THRESHOLD_ERROR_COUNT=10
|
||||
|
||||
# === 自動修復 ===
|
||||
AUTO_REPAIR_ENABLED=true
|
||||
AUTO_REPAIR_MAX_PER_HOUR=5
|
||||
AUTO_REPAIR_MIN_CONFIDENCE=0.85
|
||||
AUTO_REPAIR_DRY_RUN=false
|
||||
```
|
||||
|
||||
### B. 告警模板
|
||||
|
||||
```markdown
|
||||
🚨 **CRITICAL | awoooi-api**
|
||||
━━━━━━━━━━━━━━━━━━━
|
||||
📋 INC-20260329-0001
|
||||
🎯 Pod: awoooi-api-7d4b8c9f5-abc12
|
||||
━━━━━━━━━━━━━━━━━━━
|
||||
🤖 **AI 分析**
|
||||
👥 責任: BE (後端)
|
||||
📊 信心: 🟢 92%
|
||||
💡 原因: OOM Killed - Memory limit exceeded
|
||||
━━━━━━━━━━━━━━━━━━━
|
||||
🔧 建議: DELETE_POD + SCALE_UP
|
||||
⏱️ 停機: ~30s
|
||||
💰 Tokens: 1,234 / $0.0012
|
||||
━━━━━━━━━━━━━━━━━━━
|
||||
🔗 [SignOz Trace](http://192.168.0.188:3301/trace/abc123)
|
||||
🔗 [Sentry Issue](http://192.168.0.110:9000/issues/456)
|
||||
|
||||
[✅ 簽核] [❌ 拒絕] [⏰ 稍後] [🔕 靜默]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**文件結束**
|
||||
**下一步**: 執行 Phase 1 任務
|
||||
977
docs/MONITORING_INTEGRATION_ARCHITECTURE.md
Normal file
977
docs/MONITORING_INTEGRATION_ARCHITECTURE.md
Normal file
@@ -0,0 +1,977 @@
|
||||
# AWOOOI 監控自動整合架構
|
||||
|
||||
> **版本**: v1.0
|
||||
> **建立日期**: 2026-03-29
|
||||
> **目標**: 新服務/功能自動獲得監控,零遺漏
|
||||
|
||||
---
|
||||
|
||||
## 核心原則
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ 🎯 監控即代碼 (Monitoring as Code) │
|
||||
│ │
|
||||
│ • 所有監控配置存放於 Git │
|
||||
│ • CI/CD 自動驗證監控覆蓋率 │
|
||||
│ • 新服務必須通過監控檢查才能部署 │
|
||||
│ • 服務註冊表自動同步監控規則 │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 一、服務註冊表 (Service Registry)
|
||||
|
||||
### 1.1 註冊表結構
|
||||
|
||||
```yaml
|
||||
# /ops/monitoring/service-registry.yaml
|
||||
# 所有受監控服務的單一事實來源 (Single Source of Truth)
|
||||
|
||||
services:
|
||||
# === K8s 工作負載 ===
|
||||
- name: awoooi-api
|
||||
type: k8s-deployment
|
||||
namespace: awoooi-prod
|
||||
port: 8000
|
||||
health_endpoint: /api/v1/health
|
||||
monitoring:
|
||||
prometheus: true
|
||||
sentry: true
|
||||
otel: true
|
||||
langfuse: true # 有 LLM 呼叫
|
||||
alerts:
|
||||
- pod_crash
|
||||
- high_error_rate
|
||||
- slow_response
|
||||
owner: backend-team
|
||||
|
||||
- name: awoooi-web
|
||||
type: k8s-deployment
|
||||
namespace: awoooi-prod
|
||||
port: 3000
|
||||
health_endpoint: /
|
||||
monitoring:
|
||||
prometheus: true
|
||||
sentry: true
|
||||
otel: true
|
||||
langfuse: false
|
||||
alerts:
|
||||
- pod_crash
|
||||
- slow_page_load
|
||||
owner: frontend-team
|
||||
|
||||
- name: awoooi-worker
|
||||
type: k8s-deployment
|
||||
namespace: awoooi-prod
|
||||
health_endpoint: /tmp/worker-healthy # exec probe
|
||||
monitoring:
|
||||
prometheus: true
|
||||
sentry: true
|
||||
otel: true
|
||||
langfuse: true
|
||||
alerts:
|
||||
- worker_stuck
|
||||
- queue_backlog
|
||||
owner: backend-team
|
||||
|
||||
# === Docker 容器 (188) ===
|
||||
- name: ollama
|
||||
type: docker
|
||||
host: 192.168.0.188
|
||||
port: 11434
|
||||
health_endpoint: /api/tags
|
||||
monitoring:
|
||||
prometheus: true
|
||||
sentry: false # 外部服務
|
||||
otel: false
|
||||
alerts:
|
||||
- service_down
|
||||
- inference_timeout
|
||||
owner: ai-team
|
||||
|
||||
- name: openclaw
|
||||
type: docker
|
||||
host: 192.168.0.188
|
||||
port: 8089
|
||||
health_endpoint: /health
|
||||
monitoring:
|
||||
prometheus: true
|
||||
sentry: true
|
||||
otel: true
|
||||
langfuse: true
|
||||
alerts:
|
||||
- service_down
|
||||
- analysis_timeout
|
||||
owner: ai-team
|
||||
|
||||
- name: redis
|
||||
type: docker
|
||||
host: 192.168.0.188
|
||||
port: 6380
|
||||
health_endpoint: redis-cli ping
|
||||
monitoring:
|
||||
prometheus: true
|
||||
alerts:
|
||||
- service_down
|
||||
- memory_high
|
||||
owner: infra-team
|
||||
|
||||
- name: postgres
|
||||
type: docker
|
||||
host: 192.168.0.188
|
||||
port: 5432
|
||||
health_endpoint: pg_isready
|
||||
monitoring:
|
||||
prometheus: true
|
||||
alerts:
|
||||
- service_down
|
||||
- connection_pool_exhausted
|
||||
- slow_query
|
||||
owner: infra-team
|
||||
|
||||
# === Docker 容器 (110) ===
|
||||
- name: harbor
|
||||
type: docker
|
||||
host: 192.168.0.110
|
||||
port: 5000
|
||||
health_endpoint: /api/v2.0/health
|
||||
monitoring:
|
||||
prometheus: true
|
||||
alerts:
|
||||
- service_down
|
||||
- storage_full
|
||||
owner: devops-team
|
||||
|
||||
- name: sentry
|
||||
type: docker
|
||||
host: 192.168.0.110
|
||||
port: 9000
|
||||
health_endpoint: /_health/
|
||||
monitoring:
|
||||
prometheus: true
|
||||
alerts:
|
||||
- service_down
|
||||
owner: devops-team
|
||||
|
||||
- name: langfuse
|
||||
type: docker
|
||||
host: 192.168.0.110
|
||||
port: 3100
|
||||
health_endpoint: /api/public/health
|
||||
monitoring:
|
||||
prometheus: true
|
||||
alerts:
|
||||
- service_down
|
||||
owner: ai-team
|
||||
|
||||
- name: github-runner
|
||||
type: systemd
|
||||
host: 192.168.0.110
|
||||
service_name: actions.runner.owenhytsai-awoooi.awoooi-110.service
|
||||
monitoring:
|
||||
prometheus: true
|
||||
alerts:
|
||||
- runner_offline
|
||||
owner: devops-team
|
||||
|
||||
# === 前端頁面 ===
|
||||
pages:
|
||||
- path: /
|
||||
name: Dashboard
|
||||
monitoring:
|
||||
sentry_session: true
|
||||
web_vitals: true
|
||||
alerts:
|
||||
- slow_page_load
|
||||
- js_error
|
||||
|
||||
- path: /authorizations
|
||||
name: 授權管理
|
||||
monitoring:
|
||||
sentry_session: true
|
||||
web_vitals: true
|
||||
alerts:
|
||||
- slow_page_load
|
||||
- api_error
|
||||
|
||||
- path: /action-logs
|
||||
name: 行動日誌
|
||||
monitoring:
|
||||
sentry_session: true
|
||||
alerts:
|
||||
- slow_page_load
|
||||
|
||||
- path: /errors
|
||||
name: 錯誤追蹤
|
||||
monitoring:
|
||||
sentry_session: true
|
||||
alerts:
|
||||
- slow_page_load
|
||||
|
||||
# === API 端點 ===
|
||||
api_endpoints:
|
||||
- path: /api/v1/health
|
||||
method: GET
|
||||
critical: true
|
||||
slo_latency_ms: 100
|
||||
slo_availability: 99.99
|
||||
|
||||
- path: /api/v1/approvals
|
||||
method: GET
|
||||
critical: true
|
||||
slo_latency_ms: 500
|
||||
slo_availability: 99.9
|
||||
|
||||
- path: /api/v1/analyze
|
||||
method: POST
|
||||
critical: true
|
||||
slo_latency_ms: 30000 # 30s (LLM)
|
||||
slo_availability: 95
|
||||
|
||||
- path: /api/v1/webhooks/alertmanager
|
||||
method: POST
|
||||
critical: true
|
||||
slo_latency_ms: 5000
|
||||
slo_availability: 99.9
|
||||
```
|
||||
|
||||
### 1.2 自動生成工具
|
||||
|
||||
```python
|
||||
# /ops/monitoring/generate_monitoring.py
|
||||
"""
|
||||
從 service-registry.yaml 自動生成:
|
||||
1. Prometheus scrape configs
|
||||
2. Alertmanager alert rules
|
||||
3. Grafana dashboards
|
||||
4. Blackbox exporter targets
|
||||
"""
|
||||
|
||||
import yaml
|
||||
from pathlib import Path
|
||||
|
||||
def generate_prometheus_config(registry: dict) -> str:
|
||||
"""生成 Prometheus scrape_configs"""
|
||||
scrape_configs = []
|
||||
|
||||
for service in registry['services']:
|
||||
if service['monitoring'].get('prometheus'):
|
||||
config = {
|
||||
'job_name': service['name'],
|
||||
'static_configs': [{
|
||||
'targets': [f"{service['host']}:{service['port']}"]
|
||||
}],
|
||||
'metrics_path': '/metrics',
|
||||
'scrape_interval': '15s',
|
||||
}
|
||||
|
||||
# 根據類型調整
|
||||
if service['type'] == 'k8s-deployment':
|
||||
config['kubernetes_sd_configs'] = [{
|
||||
'role': 'pod',
|
||||
'namespaces': {'names': [service['namespace']]}
|
||||
}]
|
||||
del config['static_configs']
|
||||
|
||||
scrape_configs.append(config)
|
||||
|
||||
return yaml.dump({'scrape_configs': scrape_configs})
|
||||
|
||||
|
||||
def generate_alert_rules(registry: dict) -> str:
|
||||
"""生成 Prometheus alert rules"""
|
||||
groups = []
|
||||
|
||||
for service in registry['services']:
|
||||
rules = []
|
||||
|
||||
for alert in service.get('alerts', []):
|
||||
rule = ALERT_TEMPLATES.get(alert, {}).copy()
|
||||
rule['labels'] = {
|
||||
'service': service['name'],
|
||||
'owner': service['owner'],
|
||||
'severity': 'critical' if alert in CRITICAL_ALERTS else 'warning',
|
||||
}
|
||||
rules.append(rule)
|
||||
|
||||
if rules:
|
||||
groups.append({
|
||||
'name': f"{service['name']}_alerts",
|
||||
'rules': rules,
|
||||
})
|
||||
|
||||
return yaml.dump({'groups': groups})
|
||||
|
||||
|
||||
def generate_blackbox_targets(registry: dict) -> list:
|
||||
"""生成 Blackbox Exporter 健康檢查目標"""
|
||||
targets = []
|
||||
|
||||
for service in registry['services']:
|
||||
if service.get('health_endpoint'):
|
||||
if service['type'] in ['docker', 'k8s-deployment']:
|
||||
url = f"http://{service['host']}:{service['port']}{service['health_endpoint']}"
|
||||
targets.append({
|
||||
'targets': [url],
|
||||
'labels': {
|
||||
'service': service['name'],
|
||||
'type': service['type'],
|
||||
}
|
||||
})
|
||||
|
||||
return targets
|
||||
|
||||
|
||||
# 告警模板
|
||||
ALERT_TEMPLATES = {
|
||||
'pod_crash': {
|
||||
'alert': 'PodCrashLoopBackOff',
|
||||
'expr': 'kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} > 0',
|
||||
'for': '2m',
|
||||
'annotations': {
|
||||
'summary': 'Pod {{ $labels.pod }} is crash looping',
|
||||
'auto_repair': 'restart_pod',
|
||||
}
|
||||
},
|
||||
'high_error_rate': {
|
||||
'alert': 'HighErrorRate',
|
||||
'expr': 'rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01',
|
||||
'for': '5m',
|
||||
'annotations': {
|
||||
'summary': 'High error rate on {{ $labels.service }}',
|
||||
'auto_repair': 'restart_pod',
|
||||
}
|
||||
},
|
||||
'service_down': {
|
||||
'alert': 'ServiceDown',
|
||||
'expr': 'probe_success == 0',
|
||||
'for': '1m',
|
||||
'annotations': {
|
||||
'summary': '{{ $labels.service }} is down',
|
||||
'auto_repair': 'restart_container',
|
||||
}
|
||||
},
|
||||
'slow_response': {
|
||||
'alert': 'SlowResponse',
|
||||
'expr': 'histogram_quantile(0.95, http_request_duration_seconds_bucket) > 2',
|
||||
'for': '5m',
|
||||
'annotations': {
|
||||
'summary': 'Slow response on {{ $labels.service }}',
|
||||
'auto_repair': 'scale_up',
|
||||
}
|
||||
},
|
||||
'memory_high': {
|
||||
'alert': 'MemoryHigh',
|
||||
'expr': 'container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9',
|
||||
'for': '5m',
|
||||
'annotations': {
|
||||
'summary': 'High memory usage on {{ $labels.service }}',
|
||||
'auto_repair': 'analyze_memory_leak',
|
||||
}
|
||||
},
|
||||
'runner_offline': {
|
||||
'alert': 'GitHubRunnerOffline',
|
||||
'expr': 'github_runner_status == 0',
|
||||
'for': '5m',
|
||||
'annotations': {
|
||||
'summary': 'GitHub Runner is offline',
|
||||
'auto_repair': 'restart_runner_service',
|
||||
}
|
||||
},
|
||||
}
|
||||
|
||||
CRITICAL_ALERTS = {'pod_crash', 'service_down', 'runner_offline'}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 二、CI/CD 整合 (自動監控)
|
||||
|
||||
### 2.1 新服務自動監控流程
|
||||
|
||||
```yaml
|
||||
# .github/workflows/cd.yaml 新增步驟
|
||||
|
||||
jobs:
|
||||
monitoring-validation:
|
||||
name: "🔍 Monitoring Coverage Check"
|
||||
runs-on: self-hosted
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Validate Service Registry
|
||||
run: |
|
||||
# 檢查所有 K8s Deployment 是否都在 registry 中
|
||||
python ops/monitoring/validate_coverage.py
|
||||
|
||||
- name: Generate Monitoring Configs
|
||||
run: |
|
||||
# 從 registry 生成最新監控配置
|
||||
python ops/monitoring/generate_monitoring.py
|
||||
|
||||
- name: Check for Drift
|
||||
run: |
|
||||
# 檢查生成的配置與現有配置是否一致
|
||||
diff -r ops/monitoring/generated/ ops/monitoring/active/
|
||||
|
||||
- name: Apply Monitoring Configs
|
||||
if: github.ref == 'refs/heads/main'
|
||||
run: |
|
||||
# 部署監控配置
|
||||
kubectl apply -f ops/monitoring/generated/prometheus-rules.yaml
|
||||
kubectl apply -f ops/monitoring/generated/alertmanager-config.yaml
|
||||
```
|
||||
|
||||
### 2.2 新服務檢測腳本
|
||||
|
||||
```python
|
||||
# /ops/monitoring/validate_coverage.py
|
||||
"""
|
||||
CI 檢查: 確保所有服務都有監控配置
|
||||
"""
|
||||
|
||||
import yaml
|
||||
import subprocess
|
||||
import sys
|
||||
|
||||
def get_k8s_deployments() -> list[str]:
|
||||
"""取得所有 K8s Deployments"""
|
||||
result = subprocess.run(
|
||||
['kubectl', 'get', 'deployments', '-A', '-o', 'jsonpath={.items[*].metadata.name}'],
|
||||
capture_output=True, text=True
|
||||
)
|
||||
return result.stdout.split()
|
||||
|
||||
def get_docker_containers(host: str) -> list[str]:
|
||||
"""取得主機上的 Docker 容器"""
|
||||
result = subprocess.run(
|
||||
['ssh', host, 'docker', 'ps', '--format', '{{.Names}}'],
|
||||
capture_output=True, text=True
|
||||
)
|
||||
return result.stdout.strip().split('\n')
|
||||
|
||||
def load_registry() -> dict:
|
||||
"""載入服務註冊表"""
|
||||
with open('ops/monitoring/service-registry.yaml') as f:
|
||||
return yaml.safe_load(f)
|
||||
|
||||
def main():
|
||||
registry = load_registry()
|
||||
registered_services = {s['name'] for s in registry['services']}
|
||||
|
||||
errors = []
|
||||
|
||||
# 檢查 K8s Deployments
|
||||
k8s_deployments = get_k8s_deployments()
|
||||
for deploy in k8s_deployments:
|
||||
if deploy not in registered_services and not deploy.startswith('kube-'):
|
||||
errors.append(f"❌ K8s Deployment '{deploy}' 未在 service-registry.yaml 中註冊")
|
||||
|
||||
# 檢查 Docker 容器 (188, 110)
|
||||
for host in ['192.168.0.188', '192.168.0.110']:
|
||||
try:
|
||||
containers = get_docker_containers(host)
|
||||
for container in containers:
|
||||
if container and container not in registered_services:
|
||||
# 忽略系統容器
|
||||
if not any(x in container for x in ['k3s', 'pause', 'coredns']):
|
||||
errors.append(f"⚠️ Docker 容器 '{container}' on {host} 未在 registry 中")
|
||||
except Exception as e:
|
||||
print(f"Warning: Cannot check {host}: {e}")
|
||||
|
||||
if errors:
|
||||
print("\n".join(errors))
|
||||
print(f"\n❌ 發現 {len(errors)} 個未監控的服務!")
|
||||
print("請更新 ops/monitoring/service-registry.yaml")
|
||||
sys.exit(1)
|
||||
|
||||
print("✅ 所有服務都已註冊監控")
|
||||
sys.exit(0)
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
```
|
||||
|
||||
### 2.3 新 API 端點自動監控
|
||||
|
||||
```python
|
||||
# /apps/api/src/core/auto_monitoring.py
|
||||
"""
|
||||
FastAPI 路由自動監控
|
||||
- 自動註冊所有端點到 Prometheus
|
||||
- 自動設置 Sentry 追蹤
|
||||
- 自動建立健康檢查
|
||||
"""
|
||||
|
||||
from functools import wraps
|
||||
from fastapi import APIRouter, Request
|
||||
from prometheus_client import Counter, Histogram
|
||||
import time
|
||||
|
||||
# Prometheus Metrics (自動建立)
|
||||
REQUEST_COUNT = Counter(
|
||||
'http_requests_total',
|
||||
'Total HTTP requests',
|
||||
['method', 'path', 'status']
|
||||
)
|
||||
|
||||
REQUEST_LATENCY = Histogram(
|
||||
'http_request_duration_seconds',
|
||||
'HTTP request latency',
|
||||
['method', 'path'],
|
||||
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
|
||||
)
|
||||
|
||||
def auto_monitor(router: APIRouter):
|
||||
"""
|
||||
裝飾器: 為 Router 的所有端點添加監控
|
||||
|
||||
使用方式:
|
||||
```python
|
||||
router = APIRouter(prefix="/api/v1/new-feature")
|
||||
auto_monitor(router) # 自動添加監控
|
||||
```
|
||||
"""
|
||||
original_add_api_route = router.add_api_route
|
||||
|
||||
def monitored_add_api_route(path, endpoint, **kwargs):
|
||||
@wraps(endpoint)
|
||||
async def monitored_endpoint(request: Request, *args, **inner_kwargs):
|
||||
start_time = time.time()
|
||||
|
||||
try:
|
||||
response = await endpoint(request, *args, **inner_kwargs)
|
||||
status = getattr(response, 'status_code', 200)
|
||||
except Exception as e:
|
||||
status = 500
|
||||
raise
|
||||
finally:
|
||||
# 記錄指標
|
||||
REQUEST_COUNT.labels(
|
||||
method=request.method,
|
||||
path=path,
|
||||
status=status
|
||||
).inc()
|
||||
|
||||
REQUEST_LATENCY.labels(
|
||||
method=request.method,
|
||||
path=path
|
||||
).observe(time.time() - start_time)
|
||||
|
||||
return response
|
||||
|
||||
return original_add_api_route(path, monitored_endpoint, **kwargs)
|
||||
|
||||
router.add_api_route = monitored_add_api_route
|
||||
return router
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 三、前端自動監控
|
||||
|
||||
### 3.1 頁面自動埋點
|
||||
|
||||
```typescript
|
||||
// /apps/web/src/lib/auto-monitoring.ts
|
||||
/**
|
||||
* 前端頁面自動監控
|
||||
* - Web Vitals 自動收集
|
||||
* - 頁面錯誤自動上報
|
||||
* - API 呼叫自動追蹤
|
||||
*/
|
||||
|
||||
import * as Sentry from '@sentry/nextjs';
|
||||
|
||||
// 自動初始化 (在 _app.tsx 中調用)
|
||||
export function initAutoMonitoring() {
|
||||
// 1. Web Vitals
|
||||
if (typeof window !== 'undefined') {
|
||||
import('web-vitals').then(({ onCLS, onFID, onLCP, onTTFB, onINP }) => {
|
||||
onCLS(sendToAnalytics);
|
||||
onFID(sendToAnalytics);
|
||||
onLCP(sendToAnalytics);
|
||||
onTTFB(sendToAnalytics);
|
||||
onINP(sendToAnalytics);
|
||||
});
|
||||
}
|
||||
|
||||
// 2. 全局錯誤處理
|
||||
if (typeof window !== 'undefined') {
|
||||
window.addEventListener('error', (event) => {
|
||||
Sentry.captureException(event.error);
|
||||
});
|
||||
|
||||
window.addEventListener('unhandledrejection', (event) => {
|
||||
Sentry.captureException(event.reason);
|
||||
});
|
||||
}
|
||||
|
||||
// 3. API 呼叫自動追蹤
|
||||
patchFetch();
|
||||
}
|
||||
|
||||
function sendToAnalytics(metric: any) {
|
||||
// 發送到 Sentry Performance
|
||||
Sentry.metrics.distribution(
|
||||
`web_vitals.${metric.name}`,
|
||||
metric.value,
|
||||
{
|
||||
tags: {
|
||||
page: window.location.pathname,
|
||||
},
|
||||
}
|
||||
);
|
||||
}
|
||||
|
||||
function patchFetch() {
|
||||
const originalFetch = window.fetch;
|
||||
|
||||
window.fetch = async function(input, init) {
|
||||
const url = typeof input === 'string' ? input : input.url;
|
||||
const method = init?.method || 'GET';
|
||||
|
||||
const span = Sentry.startSpan({
|
||||
name: `${method} ${url}`,
|
||||
op: 'http.client',
|
||||
});
|
||||
|
||||
try {
|
||||
const response = await originalFetch(input, init);
|
||||
|
||||
// 記錄 API 錯誤
|
||||
if (!response.ok) {
|
||||
Sentry.captureMessage(`API Error: ${method} ${url} - ${response.status}`, {
|
||||
level: response.status >= 500 ? 'error' : 'warning',
|
||||
extra: {
|
||||
status: response.status,
|
||||
statusText: response.statusText,
|
||||
},
|
||||
});
|
||||
}
|
||||
|
||||
return response;
|
||||
} catch (error) {
|
||||
Sentry.captureException(error);
|
||||
throw error;
|
||||
} finally {
|
||||
span?.end();
|
||||
}
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
### 3.2 新頁面自動檢測
|
||||
|
||||
```typescript
|
||||
// /apps/web/src/middleware.ts
|
||||
import { NextResponse } from 'next/server';
|
||||
import type { NextRequest } from 'next/server';
|
||||
|
||||
// 已知頁面清單 (從 service-registry 同步)
|
||||
const KNOWN_PAGES = new Set([
|
||||
'/',
|
||||
'/authorizations',
|
||||
'/action-logs',
|
||||
'/errors',
|
||||
'/settings',
|
||||
'/knowledge-base',
|
||||
]);
|
||||
|
||||
export function middleware(request: NextRequest) {
|
||||
const path = request.nextUrl.pathname;
|
||||
|
||||
// 檢測新頁面
|
||||
if (!KNOWN_PAGES.has(path) && !path.startsWith('/api') && !path.startsWith('/_next')) {
|
||||
// 發送到監控系統
|
||||
console.warn(`[MONITORING] 新頁面被訪問但未註冊: ${path}`);
|
||||
|
||||
// TODO: 發送到 Sentry 或後端 API
|
||||
}
|
||||
|
||||
return NextResponse.next();
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 四、自動發現機制
|
||||
|
||||
### 4.1 K8s 服務自動發現
|
||||
|
||||
```yaml
|
||||
# /ops/monitoring/prometheus/kubernetes-sd.yaml
|
||||
# Prometheus 自動發現 K8s 服務
|
||||
|
||||
scrape_configs:
|
||||
# 自動發現所有 Pod
|
||||
- job_name: 'kubernetes-pods'
|
||||
kubernetes_sd_configs:
|
||||
- role: pod
|
||||
relabel_configs:
|
||||
# 只抓有 prometheus.io/scrape: "true" 標籤的 Pod
|
||||
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
|
||||
action: keep
|
||||
regex: true
|
||||
# 使用 Pod 標籤作為 job name
|
||||
- source_labels: [__meta_kubernetes_pod_label_app]
|
||||
target_label: job
|
||||
# 使用 namespace 標籤
|
||||
- source_labels: [__meta_kubernetes_namespace]
|
||||
target_label: namespace
|
||||
```
|
||||
|
||||
### 4.2 Docker 容器自動發現
|
||||
|
||||
```python
|
||||
# /ops/monitoring/discover_docker.py
|
||||
"""
|
||||
定期掃描 Docker 容器,發現未監控的新服務
|
||||
"""
|
||||
|
||||
import subprocess
|
||||
import json
|
||||
from datetime import datetime
|
||||
|
||||
HOSTS = ['192.168.0.188', '192.168.0.110']
|
||||
|
||||
def discover_containers():
|
||||
"""發現所有 Docker 容器"""
|
||||
all_containers = []
|
||||
|
||||
for host in HOSTS:
|
||||
try:
|
||||
result = subprocess.run(
|
||||
['ssh', host, 'docker', 'ps', '--format', '{{json .}}'],
|
||||
capture_output=True, text=True, timeout=10
|
||||
)
|
||||
|
||||
for line in result.stdout.strip().split('\n'):
|
||||
if line:
|
||||
container = json.loads(line)
|
||||
container['host'] = host
|
||||
all_containers.append(container)
|
||||
except Exception as e:
|
||||
print(f"Error scanning {host}: {e}")
|
||||
|
||||
return all_containers
|
||||
|
||||
def check_new_containers(containers: list, registry: dict):
|
||||
"""檢查是否有新容器未在 registry 中"""
|
||||
registered = {s['name'] for s in registry['services']}
|
||||
|
||||
new_containers = []
|
||||
for c in containers:
|
||||
name = c['Names']
|
||||
if name not in registered:
|
||||
new_containers.append({
|
||||
'name': name,
|
||||
'host': c['host'],
|
||||
'image': c['Image'],
|
||||
'created': c['CreatedAt'],
|
||||
})
|
||||
|
||||
return new_containers
|
||||
|
||||
def alert_new_containers(new_containers: list):
|
||||
"""發送新容器告警"""
|
||||
if new_containers:
|
||||
message = f"""🆕 發現 {len(new_containers)} 個未監控的容器:
|
||||
|
||||
"""
|
||||
for c in new_containers:
|
||||
message += f"• {c['name']} on {c['host']} ({c['image']})\n"
|
||||
|
||||
message += "\n請更新 service-registry.yaml"
|
||||
|
||||
# TODO: 發送 Telegram 告警
|
||||
print(message)
|
||||
|
||||
if __name__ == '__main__':
|
||||
# 作為 cron job 每小時執行
|
||||
import yaml
|
||||
|
||||
with open('ops/monitoring/service-registry.yaml') as f:
|
||||
registry = yaml.safe_load(f)
|
||||
|
||||
containers = discover_containers()
|
||||
new_containers = check_new_containers(containers, registry)
|
||||
|
||||
if new_containers:
|
||||
alert_new_containers(new_containers)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 五、監控覆蓋率儀表板
|
||||
|
||||
### 5.1 覆蓋率計算
|
||||
|
||||
```python
|
||||
# /ops/monitoring/coverage_report.py
|
||||
"""
|
||||
計算監控覆蓋率並生成報告
|
||||
"""
|
||||
|
||||
def calculate_coverage(registry: dict) -> dict:
|
||||
"""計算各維度的監控覆蓋率"""
|
||||
services = registry['services']
|
||||
|
||||
total = len(services)
|
||||
|
||||
coverage = {
|
||||
'prometheus': sum(1 for s in services if s['monitoring'].get('prometheus')) / total,
|
||||
'sentry': sum(1 for s in services if s['monitoring'].get('sentry')) / total,
|
||||
'otel': sum(1 for s in services if s['monitoring'].get('otel')) / total,
|
||||
'langfuse': sum(1 for s in services if s['monitoring'].get('langfuse')) / total,
|
||||
'alerts': sum(1 for s in services if s.get('alerts')) / total,
|
||||
}
|
||||
|
||||
# 頁面覆蓋率
|
||||
pages = registry.get('pages', [])
|
||||
if pages:
|
||||
coverage['page_sentry'] = sum(1 for p in pages if p['monitoring'].get('sentry_session')) / len(pages)
|
||||
coverage['page_vitals'] = sum(1 for p in pages if p['monitoring'].get('web_vitals')) / len(pages)
|
||||
|
||||
# API SLO 覆蓋率
|
||||
endpoints = registry.get('api_endpoints', [])
|
||||
if endpoints:
|
||||
coverage['api_slo'] = sum(1 for e in endpoints if e.get('slo_latency_ms')) / len(endpoints)
|
||||
|
||||
return coverage
|
||||
|
||||
def generate_report(coverage: dict) -> str:
|
||||
"""生成覆蓋率報告"""
|
||||
report = """
|
||||
# AWOOOI 監控覆蓋率報告
|
||||
生成時間: {timestamp}
|
||||
|
||||
## 服務監控覆蓋率
|
||||
|
||||
| 監控類型 | 覆蓋率 | 狀態 |
|
||||
|----------|--------|------|
|
||||
| Prometheus Metrics | {prometheus:.0%} | {prometheus_status} |
|
||||
| Sentry 錯誤追蹤 | {sentry:.0%} | {sentry_status} |
|
||||
| OTEL Traces | {otel:.0%} | {otel_status} |
|
||||
| Langfuse LLM | {langfuse:.0%} | {langfuse_status} |
|
||||
| Alert Rules | {alerts:.0%} | {alerts_status} |
|
||||
|
||||
## 前端監控覆蓋率
|
||||
|
||||
| 監控類型 | 覆蓋率 | 狀態 |
|
||||
|----------|--------|------|
|
||||
| Sentry Session | {page_sentry:.0%} | {page_sentry_status} |
|
||||
| Web Vitals | {page_vitals:.0%} | {page_vitals_status} |
|
||||
|
||||
## API SLO 覆蓋率
|
||||
|
||||
| 類型 | 覆蓋率 | 狀態 |
|
||||
|------|--------|------|
|
||||
| SLO 定義 | {api_slo:.0%} | {api_slo_status} |
|
||||
|
||||
---
|
||||
總體健康度: **{overall:.0%}**
|
||||
""".format(
|
||||
timestamp=datetime.now().isoformat(),
|
||||
**coverage,
|
||||
**{f"{k}_status": "✅" if v >= 0.9 else "⚠️" if v >= 0.7 else "❌" for k, v in coverage.items()},
|
||||
overall=sum(coverage.values()) / len(coverage),
|
||||
)
|
||||
|
||||
return report
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 六、整合流程圖
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ 開發者新增服務 │
|
||||
│ (新 K8s Deployment / Docker 容器 / API 端點 / 前端頁面) │
|
||||
└────────────────────────────┬────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ Step 1: 更新 service-registry.yaml │
|
||||
│ ┌───────────────────────────────────────────────────────────────┐ │
|
||||
│ │ services: │ │
|
||||
│ │ - name: new-service │ │
|
||||
│ │ type: k8s-deployment │ │
|
||||
│ │ monitoring: │ │
|
||||
│ │ prometheus: true │ │
|
||||
│ │ sentry: true │ │
|
||||
│ │ alerts: │ │
|
||||
│ │ - pod_crash │ │
|
||||
│ │ - high_error_rate │ │
|
||||
│ └───────────────────────────────────────────────────────────────┘ │
|
||||
└────────────────────────────┬────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ Step 2: git push → CI/CD Pipeline │
|
||||
│ ┌───────────────────────────────────────────────────────────────┐ │
|
||||
│ │ 1. validate_coverage.py → 檢查所有服務都在 registry │ │
|
||||
│ │ 2. generate_monitoring.py → 生成 Prometheus/Alertmanager 配置 │ │
|
||||
│ │ 3. kubectl apply → 部署監控配置 │ │
|
||||
│ │ 4. 部署新服務 │ │
|
||||
│ └───────────────────────────────────────────────────────────────┘ │
|
||||
└────────────────────────────┬────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ Step 3: 監控自動生效 │
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌────────────┐ │
|
||||
│ │ Prometheus │ │ Alertmanager│ │ Sentry │ │ SignOz │ │
|
||||
│ │ 開始抓 Metrics│ │ 開始監控告警 │ │ 開始追蹤錯誤│ │ 開始收 Traces│ │
|
||||
│ └─────────────┘ └─────────────┘ └─────────────┘ └────────────┘ │
|
||||
└────────────────────────────┬────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ Step 4: 異常發生 → AI 自動修復 │
|
||||
│ ┌───────────────────────────────────────────────────────────────┐ │
|
||||
│ │ 1. Prometheus 觸發告警 │ │
|
||||
│ │ 2. Alertmanager → AWOOOI Webhook │ │
|
||||
│ │ 3. OpenClaw AI 分析 │ │
|
||||
│ │ 4. 自動/人工修復 │ │
|
||||
│ │ 5. 結果回饋 → Playbook 更新 │ │
|
||||
│ └───────────────────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 七、實作清單
|
||||
|
||||
### 7.1 需要建立的檔案
|
||||
|
||||
| 檔案 | 用途 | 優先級 |
|
||||
|------|------|--------|
|
||||
| `ops/monitoring/service-registry.yaml` | 服務註冊表 | P0 |
|
||||
| `ops/monitoring/generate_monitoring.py` | 配置生成器 | P0 |
|
||||
| `ops/monitoring/validate_coverage.py` | 覆蓋率檢查 | P0 |
|
||||
| `ops/monitoring/discover_docker.py` | 容器發現 | P1 |
|
||||
| `apps/api/src/core/auto_monitoring.py` | API 自動監控 | P1 |
|
||||
| `apps/web/src/lib/auto-monitoring.ts` | 前端自動監控 | P1 |
|
||||
|
||||
### 7.2 CI/CD 修改
|
||||
|
||||
| 修改 | 用途 | 優先級 |
|
||||
|------|------|--------|
|
||||
| 新增 `monitoring-validation` job | 檢查覆蓋率 | P0 |
|
||||
| 新增 `monitoring-deploy` job | 部署配置 | P0 |
|
||||
|
||||
### 7.3 Cron Jobs
|
||||
|
||||
| 任務 | 頻率 | 用途 |
|
||||
|------|------|------|
|
||||
| `discover_docker.py` | 每小時 | 發現新容器 |
|
||||
| `coverage_report.py` | 每日 | 生成報告 |
|
||||
|
||||
---
|
||||
|
||||
**文件結束**
|
||||
626
ops/monitoring/service-registry.yaml
Normal file
626
ops/monitoring/service-registry.yaml
Normal file
@@ -0,0 +1,626 @@
|
||||
# AWOOOI 服務註冊表 (Single Source of Truth)
|
||||
# ===========================================
|
||||
# 版本: v1.0
|
||||
# 建立日期: 2026-03-29
|
||||
# 用途: 所有受監控服務的統一清單
|
||||
#
|
||||
# 新增服務時:
|
||||
# 1. 在此檔案新增 entry
|
||||
# 2. CI/CD 會自動生成對應的監控配置
|
||||
# 3. 部署後監控自動生效
|
||||
|
||||
# =============================================================================
|
||||
# K8s 工作負載 (awoooi-prod namespace)
|
||||
# =============================================================================
|
||||
services:
|
||||
# --- API 後端 ---
|
||||
- name: awoooi-api
|
||||
type: k8s-deployment
|
||||
namespace: awoooi-prod
|
||||
replicas: 2
|
||||
port: 8000
|
||||
health_endpoint: /api/v1/health
|
||||
monitoring:
|
||||
prometheus: true
|
||||
sentry: true
|
||||
otel: true
|
||||
langfuse: true
|
||||
alerts:
|
||||
- pod_crash
|
||||
- high_error_rate
|
||||
- slow_response
|
||||
- memory_high
|
||||
auto_repair:
|
||||
enabled: true
|
||||
actions:
|
||||
- restart_pod
|
||||
- scale_up
|
||||
owner: backend-team
|
||||
criticality: P0
|
||||
|
||||
# --- Web 前端 ---
|
||||
- name: awoooi-web
|
||||
type: k8s-deployment
|
||||
namespace: awoooi-prod
|
||||
replicas: 2
|
||||
port: 3000
|
||||
health_endpoint: /
|
||||
monitoring:
|
||||
prometheus: true
|
||||
sentry: true
|
||||
otel: true
|
||||
langfuse: false
|
||||
alerts:
|
||||
- pod_crash
|
||||
- slow_page_load
|
||||
auto_repair:
|
||||
enabled: true
|
||||
actions:
|
||||
- restart_pod
|
||||
owner: frontend-team
|
||||
criticality: P0
|
||||
|
||||
# --- Signal Worker ---
|
||||
- name: awoooi-worker
|
||||
type: k8s-deployment
|
||||
namespace: awoooi-prod
|
||||
replicas: 1
|
||||
health_endpoint: /tmp/worker-healthy
|
||||
health_type: exec_mtime
|
||||
monitoring:
|
||||
prometheus: true
|
||||
sentry: true
|
||||
otel: true
|
||||
langfuse: true
|
||||
alerts:
|
||||
- worker_stuck
|
||||
- queue_backlog
|
||||
auto_repair:
|
||||
enabled: true
|
||||
actions:
|
||||
- restart_pod
|
||||
owner: backend-team
|
||||
criticality: P1
|
||||
|
||||
# --- ArgoCD ---
|
||||
- name: argocd-server
|
||||
type: k8s-deployment
|
||||
namespace: argocd
|
||||
port: 443
|
||||
health_endpoint: /healthz
|
||||
monitoring:
|
||||
prometheus: true
|
||||
sentry: false
|
||||
otel: false
|
||||
alerts:
|
||||
- service_down
|
||||
- sync_failed
|
||||
owner: devops-team
|
||||
criticality: P1
|
||||
|
||||
# --- Prometheus ---
|
||||
- name: prometheus
|
||||
type: k8s-deployment
|
||||
namespace: monitoring
|
||||
port: 9090
|
||||
health_endpoint: /-/ready
|
||||
monitoring:
|
||||
prometheus: false # 自己監控自己會循環
|
||||
sentry: false
|
||||
alerts:
|
||||
- service_down
|
||||
owner: devops-team
|
||||
criticality: P0
|
||||
|
||||
# --- Alertmanager ---
|
||||
- name: alertmanager
|
||||
type: k8s-deployment
|
||||
namespace: monitoring
|
||||
port: 9093
|
||||
health_endpoint: /-/ready
|
||||
monitoring:
|
||||
prometheus: true
|
||||
sentry: false
|
||||
alerts:
|
||||
- service_down
|
||||
owner: devops-team
|
||||
criticality: P0
|
||||
|
||||
# =============================================================================
|
||||
# Docker 容器 (192.168.0.188 - AI/Web 中心)
|
||||
# =============================================================================
|
||||
|
||||
# --- Ollama LLM ---
|
||||
- name: ollama
|
||||
type: docker
|
||||
host: 192.168.0.188
|
||||
port: 11434
|
||||
health_endpoint: /api/tags
|
||||
monitoring:
|
||||
prometheus: true
|
||||
sentry: false
|
||||
otel: false
|
||||
alerts:
|
||||
- service_down
|
||||
- inference_timeout
|
||||
- model_load_failed
|
||||
auto_repair:
|
||||
enabled: true
|
||||
actions:
|
||||
- restart_container
|
||||
owner: ai-team
|
||||
criticality: P0
|
||||
|
||||
# --- OpenClaw AI 決策中心 ---
|
||||
- name: openclaw
|
||||
type: docker
|
||||
host: 192.168.0.188
|
||||
port: 8089
|
||||
health_endpoint: /health
|
||||
monitoring:
|
||||
prometheus: true
|
||||
sentry: true
|
||||
otel: true
|
||||
langfuse: true
|
||||
alerts:
|
||||
- service_down
|
||||
- analysis_timeout
|
||||
- high_error_rate
|
||||
auto_repair:
|
||||
enabled: true
|
||||
actions:
|
||||
- restart_container
|
||||
owner: ai-team
|
||||
criticality: P0
|
||||
|
||||
# --- Redis Stack ---
|
||||
- name: redis
|
||||
type: docker
|
||||
host: 192.168.0.188
|
||||
port: 6380
|
||||
health_endpoint: redis-cli ping
|
||||
health_type: exec
|
||||
monitoring:
|
||||
prometheus: true
|
||||
sentry: false
|
||||
alerts:
|
||||
- service_down
|
||||
- memory_high
|
||||
- connection_rejected
|
||||
auto_repair:
|
||||
enabled: false # 資料庫謹慎處理
|
||||
owner: infra-team
|
||||
criticality: P0
|
||||
|
||||
# --- PostgreSQL ---
|
||||
- name: postgres
|
||||
type: docker
|
||||
host: 192.168.0.188
|
||||
port: 5432
|
||||
health_endpoint: pg_isready
|
||||
health_type: exec
|
||||
monitoring:
|
||||
prometheus: true
|
||||
sentry: false
|
||||
alerts:
|
||||
- service_down
|
||||
- connection_pool_exhausted
|
||||
- slow_query
|
||||
- replication_lag
|
||||
auto_repair:
|
||||
enabled: false # 資料庫謹慎處理
|
||||
owner: infra-team
|
||||
criticality: P0
|
||||
|
||||
# --- SignOz OTEL Collector ---
|
||||
- name: signoz-collector
|
||||
type: docker
|
||||
host: 192.168.0.188
|
||||
port: 24317
|
||||
health_endpoint: grpc_health
|
||||
health_type: grpc
|
||||
monitoring:
|
||||
prometheus: true
|
||||
sentry: false
|
||||
alerts:
|
||||
- service_down
|
||||
- trace_dropped
|
||||
owner: devops-team
|
||||
criticality: P1
|
||||
|
||||
# --- SignOz UI ---
|
||||
- name: signoz-ui
|
||||
type: docker
|
||||
host: 192.168.0.188
|
||||
port: 3301
|
||||
health_endpoint: /
|
||||
monitoring:
|
||||
prometheus: true
|
||||
sentry: false
|
||||
alerts:
|
||||
- service_down
|
||||
owner: devops-team
|
||||
criticality: P2
|
||||
|
||||
# --- ClickHouse (SignOz 後端) ---
|
||||
- name: clickhouse
|
||||
type: docker
|
||||
host: 192.168.0.188
|
||||
port: 8123
|
||||
health_endpoint: /ping
|
||||
monitoring:
|
||||
prometheus: true
|
||||
sentry: false
|
||||
alerts:
|
||||
- service_down
|
||||
- disk_space_low
|
||||
- query_timeout
|
||||
owner: devops-team
|
||||
criticality: P1
|
||||
|
||||
# =============================================================================
|
||||
# Docker 容器 (192.168.0.110 - DevOps 中心)
|
||||
# =============================================================================
|
||||
|
||||
# --- Harbor Registry ---
|
||||
- name: harbor
|
||||
type: docker
|
||||
host: 192.168.0.110
|
||||
port: 5000
|
||||
health_endpoint: /api/v2.0/health
|
||||
monitoring:
|
||||
prometheus: true
|
||||
sentry: false
|
||||
alerts:
|
||||
- service_down
|
||||
- storage_full
|
||||
- push_failed
|
||||
owner: devops-team
|
||||
criticality: P0
|
||||
|
||||
# --- Sentry ---
|
||||
- name: sentry
|
||||
type: docker
|
||||
host: 192.168.0.110
|
||||
port: 9000
|
||||
health_endpoint: /_health/
|
||||
monitoring:
|
||||
prometheus: true
|
||||
sentry: false # 自己監控自己會循環
|
||||
alerts:
|
||||
- service_down
|
||||
owner: devops-team
|
||||
criticality: P1
|
||||
|
||||
# --- Langfuse LLMOps ---
|
||||
- name: langfuse
|
||||
type: docker
|
||||
host: 192.168.0.110
|
||||
port: 3100
|
||||
health_endpoint: /api/public/health
|
||||
monitoring:
|
||||
prometheus: true
|
||||
sentry: false
|
||||
alerts:
|
||||
- service_down
|
||||
- trace_lost
|
||||
owner: ai-team
|
||||
criticality: P2
|
||||
|
||||
# --- GitHub Actions Runner ---
|
||||
- name: github-runner
|
||||
type: systemd
|
||||
host: 192.168.0.110
|
||||
service_name: actions.runner.owenhytsai-awoooi.awoooi-110.service
|
||||
monitoring:
|
||||
prometheus: true
|
||||
sentry: false
|
||||
alerts:
|
||||
- runner_offline
|
||||
- job_stuck
|
||||
auto_repair:
|
||||
enabled: true
|
||||
actions:
|
||||
- restart_service
|
||||
owner: devops-team
|
||||
criticality: P0
|
||||
|
||||
# =============================================================================
|
||||
# 主機節點
|
||||
# =============================================================================
|
||||
nodes:
|
||||
- name: mon
|
||||
ip: 192.168.0.120
|
||||
role: k3s-master
|
||||
alerts:
|
||||
- node_down
|
||||
- cpu_high
|
||||
- memory_high
|
||||
- disk_space_low
|
||||
- etcd_latency_high
|
||||
owner: infra-team
|
||||
|
||||
- name: mon1
|
||||
ip: 192.168.0.121
|
||||
role: k3s-worker
|
||||
alerts:
|
||||
- node_down
|
||||
- node_not_ready
|
||||
- cpu_high
|
||||
- memory_high
|
||||
- disk_space_low
|
||||
owner: infra-team
|
||||
|
||||
- name: harbor
|
||||
ip: 192.168.0.110
|
||||
role: devops
|
||||
alerts:
|
||||
- node_down
|
||||
- cpu_high
|
||||
- memory_high
|
||||
- disk_space_low
|
||||
owner: devops-team
|
||||
|
||||
- name: pg
|
||||
ip: 192.168.0.188
|
||||
role: ai-web
|
||||
alerts:
|
||||
- node_down
|
||||
- cpu_high
|
||||
- memory_high
|
||||
- disk_space_low
|
||||
- gpu_utilization_high
|
||||
owner: ai-team
|
||||
|
||||
- name: kali
|
||||
ip: 192.168.0.112
|
||||
role: security
|
||||
alerts:
|
||||
- node_down
|
||||
owner: security-team
|
||||
|
||||
# =============================================================================
|
||||
# 前端頁面
|
||||
# =============================================================================
|
||||
pages:
|
||||
- path: /
|
||||
name: Dashboard
|
||||
monitoring:
|
||||
sentry_session: true
|
||||
web_vitals: true
|
||||
alerts:
|
||||
- slow_page_load
|
||||
- js_error
|
||||
slo:
|
||||
lcp_ms: 2500
|
||||
fid_ms: 100
|
||||
cls: 0.1
|
||||
|
||||
- path: /authorizations
|
||||
name: 授權管理
|
||||
monitoring:
|
||||
sentry_session: true
|
||||
web_vitals: true
|
||||
alerts:
|
||||
- slow_page_load
|
||||
- api_error
|
||||
slo:
|
||||
lcp_ms: 2000
|
||||
|
||||
- path: /action-logs
|
||||
name: 行動日誌
|
||||
monitoring:
|
||||
sentry_session: true
|
||||
web_vitals: true
|
||||
alerts:
|
||||
- slow_page_load
|
||||
|
||||
- path: /errors
|
||||
name: 錯誤追蹤
|
||||
monitoring:
|
||||
sentry_session: true
|
||||
web_vitals: true
|
||||
alerts:
|
||||
- slow_page_load
|
||||
|
||||
- path: /settings
|
||||
name: 設定
|
||||
monitoring:
|
||||
sentry_session: true
|
||||
alerts:
|
||||
- slow_page_load
|
||||
|
||||
- path: /knowledge-base
|
||||
name: 知識庫
|
||||
monitoring:
|
||||
sentry_session: true
|
||||
alerts:
|
||||
- slow_page_load
|
||||
|
||||
# =============================================================================
|
||||
# API 端點 (關鍵)
|
||||
# =============================================================================
|
||||
api_endpoints:
|
||||
- path: /api/v1/health
|
||||
method: GET
|
||||
critical: true
|
||||
slo:
|
||||
latency_p95_ms: 100
|
||||
availability: 99.99
|
||||
|
||||
- path: /api/v1/approvals
|
||||
method: GET
|
||||
critical: true
|
||||
slo:
|
||||
latency_p95_ms: 500
|
||||
availability: 99.9
|
||||
|
||||
- path: /api/v1/approvals/{id}/sign
|
||||
method: POST
|
||||
critical: true
|
||||
slo:
|
||||
latency_p95_ms: 1000
|
||||
availability: 99.9
|
||||
|
||||
- path: /api/v1/incidents
|
||||
method: GET
|
||||
critical: true
|
||||
slo:
|
||||
latency_p95_ms: 500
|
||||
availability: 99.9
|
||||
|
||||
- path: /api/v1/analyze
|
||||
method: POST
|
||||
critical: true
|
||||
slo:
|
||||
latency_p95_ms: 30000 # 30s (LLM 分析)
|
||||
availability: 95
|
||||
|
||||
- path: /api/v1/webhooks/alertmanager
|
||||
method: POST
|
||||
critical: true
|
||||
slo:
|
||||
latency_p95_ms: 5000
|
||||
availability: 99.9
|
||||
|
||||
- path: /api/v1/webhooks/sentry/error
|
||||
method: POST
|
||||
critical: true
|
||||
slo:
|
||||
latency_p95_ms: 5000
|
||||
availability: 99.9
|
||||
|
||||
- path: /api/v1/execute
|
||||
method: POST
|
||||
critical: true
|
||||
slo:
|
||||
latency_p95_ms: 10000
|
||||
availability: 99
|
||||
|
||||
# =============================================================================
|
||||
# AI 服務 (特殊監控)
|
||||
# =============================================================================
|
||||
ai_services:
|
||||
- name: gemini-api
|
||||
type: external
|
||||
rate_limit:
|
||||
requests_per_minute: 60
|
||||
tokens_per_minute: 100000
|
||||
alerts:
|
||||
- rate_limit_hit
|
||||
- budget_exceeded
|
||||
fallback: ollama
|
||||
cost_tracking: true
|
||||
|
||||
- name: claude-api
|
||||
type: external
|
||||
rate_limit:
|
||||
requests_per_minute: 50
|
||||
tokens_per_minute: 100000
|
||||
alerts:
|
||||
- rate_limit_hit
|
||||
- budget_exceeded
|
||||
fallback: gemini
|
||||
cost_tracking: true
|
||||
|
||||
- name: ollama-local
|
||||
type: local
|
||||
models:
|
||||
- qwen2.5:7b
|
||||
- llama3.2:3b
|
||||
alerts:
|
||||
- model_load_failed
|
||||
- inference_timeout
|
||||
cost_tracking: false
|
||||
|
||||
# =============================================================================
|
||||
# 告警模板 (Alert Templates)
|
||||
# =============================================================================
|
||||
alert_templates:
|
||||
pod_crash:
|
||||
expr: 'kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} > 0'
|
||||
for: 2m
|
||||
severity: critical
|
||||
auto_repair: restart_pod
|
||||
|
||||
high_error_rate:
|
||||
expr: 'rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01'
|
||||
for: 5m
|
||||
severity: critical
|
||||
auto_repair: restart_pod
|
||||
|
||||
slow_response:
|
||||
expr: 'histogram_quantile(0.95, http_request_duration_seconds_bucket) > 2'
|
||||
for: 5m
|
||||
severity: warning
|
||||
auto_repair: scale_up
|
||||
|
||||
service_down:
|
||||
expr: 'probe_success == 0'
|
||||
for: 1m
|
||||
severity: critical
|
||||
auto_repair: restart_container
|
||||
|
||||
memory_high:
|
||||
expr: 'container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9'
|
||||
for: 5m
|
||||
severity: warning
|
||||
auto_repair: analyze_memory_leak
|
||||
|
||||
disk_space_low:
|
||||
expr: 'node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.15'
|
||||
for: 10m
|
||||
severity: warning
|
||||
auto_repair: cleanup_logs
|
||||
|
||||
inference_timeout:
|
||||
expr: 'ollama_inference_duration_seconds > 60'
|
||||
for: 3m
|
||||
severity: warning
|
||||
auto_repair: switch_model
|
||||
|
||||
runner_offline:
|
||||
expr: 'github_runner_status == 0'
|
||||
for: 5m
|
||||
severity: critical
|
||||
auto_repair: restart_service
|
||||
|
||||
# =============================================================================
|
||||
# 自動修復動作 (Auto-Repair Actions)
|
||||
# =============================================================================
|
||||
auto_repair_actions:
|
||||
restart_pod:
|
||||
command: 'kubectl rollout restart deployment/{name} -n {namespace}'
|
||||
risk: low
|
||||
cooldown_minutes: 10
|
||||
|
||||
scale_up:
|
||||
command: 'kubectl scale deployment/{name} --replicas=+1 -n {namespace}'
|
||||
risk: low
|
||||
max_replicas: 5
|
||||
cooldown_minutes: 15
|
||||
|
||||
restart_container:
|
||||
command: 'ssh {host} docker restart {container}'
|
||||
risk: low
|
||||
cooldown_minutes: 10
|
||||
|
||||
restart_service:
|
||||
command: 'ssh {host} sudo systemctl restart {service_name}'
|
||||
risk: low
|
||||
cooldown_minutes: 10
|
||||
|
||||
switch_model:
|
||||
command: 'internal:switch_to_smaller_model'
|
||||
risk: low
|
||||
cooldown_minutes: 5
|
||||
|
||||
cleanup_logs:
|
||||
command: 'ssh {host} find /var/log -name "*.log" -mtime +7 -delete'
|
||||
risk: low
|
||||
cooldown_minutes: 60
|
||||
|
||||
analyze_memory_leak:
|
||||
command: 'internal:trigger_memory_analysis'
|
||||
risk: low
|
||||
cooldown_minutes: 30
|
||||
247
ops/monitoring/validate_coverage.py
Executable file
247
ops/monitoring/validate_coverage.py
Executable file
@@ -0,0 +1,247 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
AWOOOI 監控覆蓋率驗證
|
||||
====================
|
||||
CI/CD 階段執行,確保所有服務都有對應的監控配置
|
||||
|
||||
用法:
|
||||
python ops/monitoring/validate_coverage.py
|
||||
|
||||
退出碼:
|
||||
0 - 所有服務都已註冊
|
||||
1 - 發現未監控的服務
|
||||
"""
|
||||
|
||||
import subprocess
|
||||
import sys
|
||||
import yaml
|
||||
from pathlib import Path
|
||||
from typing import NamedTuple
|
||||
|
||||
|
||||
class ValidationResult(NamedTuple):
|
||||
"""驗證結果"""
|
||||
passed: bool
|
||||
errors: list[str]
|
||||
warnings: list[str]
|
||||
coverage: dict[str, float]
|
||||
|
||||
|
||||
def load_registry() -> dict:
|
||||
"""載入服務註冊表"""
|
||||
registry_path = Path(__file__).parent / 'service-registry.yaml'
|
||||
with open(registry_path) as f:
|
||||
return yaml.safe_load(f)
|
||||
|
||||
|
||||
def get_k8s_deployments() -> list[dict]:
|
||||
"""取得所有 K8s Deployments"""
|
||||
try:
|
||||
result = subprocess.run(
|
||||
[
|
||||
'kubectl', 'get', 'deployments', '-A',
|
||||
'-o', 'jsonpath={range .items[*]}{.metadata.namespace}/{.metadata.name}{\"\\n\"}{end}'
|
||||
],
|
||||
capture_output=True, text=True, timeout=30
|
||||
)
|
||||
deployments = []
|
||||
for line in result.stdout.strip().split('\n'):
|
||||
if line and '/' in line:
|
||||
ns, name = line.split('/', 1)
|
||||
deployments.append({'namespace': ns, 'name': name})
|
||||
return deployments
|
||||
except Exception as e:
|
||||
print(f"Warning: Cannot get K8s deployments: {e}")
|
||||
return []
|
||||
|
||||
|
||||
def get_k8s_services() -> list[dict]:
|
||||
"""取得所有 K8s Services"""
|
||||
try:
|
||||
result = subprocess.run(
|
||||
[
|
||||
'kubectl', 'get', 'services', '-A',
|
||||
'-o', 'jsonpath={range .items[*]}{.metadata.namespace}/{.metadata.name}{\"\\n\"}{end}'
|
||||
],
|
||||
capture_output=True, text=True, timeout=30
|
||||
)
|
||||
services = []
|
||||
for line in result.stdout.strip().split('\n'):
|
||||
if line and '/' in line:
|
||||
ns, name = line.split('/', 1)
|
||||
services.append({'namespace': ns, 'name': name})
|
||||
return services
|
||||
except Exception as e:
|
||||
print(f"Warning: Cannot get K8s services: {e}")
|
||||
return []
|
||||
|
||||
|
||||
def check_docker_containers(host: str) -> list[str]:
|
||||
"""檢查主機上的 Docker 容器"""
|
||||
try:
|
||||
result = subprocess.run(
|
||||
['ssh', '-o', 'ConnectTimeout=5', host, 'docker', 'ps', '--format', '{{.Names}}'],
|
||||
capture_output=True, text=True, timeout=10
|
||||
)
|
||||
return [c for c in result.stdout.strip().split('\n') if c]
|
||||
except Exception as e:
|
||||
print(f"Warning: Cannot check Docker on {host}: {e}")
|
||||
return []
|
||||
|
||||
|
||||
def validate_registry(registry: dict) -> ValidationResult:
|
||||
"""驗證服務註冊表完整性"""
|
||||
errors = []
|
||||
warnings = []
|
||||
|
||||
registered_services = {s['name'] for s in registry.get('services', [])}
|
||||
registered_k8s = {
|
||||
(s['namespace'], s['name'])
|
||||
for s in registry.get('services', [])
|
||||
if s.get('type') == 'k8s-deployment'
|
||||
}
|
||||
|
||||
# ==========================================================================
|
||||
# 1. 檢查 K8s Deployments
|
||||
# ==========================================================================
|
||||
k8s_deployments = get_k8s_deployments()
|
||||
ignored_namespaces = {'kube-system', 'kube-public', 'kube-node-lease', 'local-path-storage'}
|
||||
ignored_prefixes = {'coredns', 'metrics-server', 'local-path-provisioner'}
|
||||
|
||||
for deploy in k8s_deployments:
|
||||
ns, name = deploy['namespace'], deploy['name']
|
||||
|
||||
# 跳過系統命名空間
|
||||
if ns in ignored_namespaces:
|
||||
continue
|
||||
|
||||
# 跳過系統元件
|
||||
if any(name.startswith(p) for p in ignored_prefixes):
|
||||
continue
|
||||
|
||||
if (ns, name) not in registered_k8s:
|
||||
errors.append(f"K8s Deployment '{ns}/{name}' 未在 service-registry.yaml 中註冊")
|
||||
|
||||
# ==========================================================================
|
||||
# 2. 檢查 Docker 容器
|
||||
# ==========================================================================
|
||||
docker_hosts = ['192.168.0.188', '192.168.0.110']
|
||||
docker_services = {
|
||||
s['name']
|
||||
for s in registry.get('services', [])
|
||||
if s.get('type') == 'docker'
|
||||
}
|
||||
|
||||
ignored_containers = {
|
||||
'k3s', 'pause', 'registry', 'nginx', 'traefik',
|
||||
# SignOz 相關容器群組
|
||||
'signoz-alertmanager', 'signoz-query-service', 'signoz-otel-collector-metrics',
|
||||
'zookeeper', 'clickhouse',
|
||||
# Sentry 相關容器群組
|
||||
'sentry-web', 'sentry-worker', 'sentry-cron', 'sentry-kafka', 'sentry-redis',
|
||||
'sentry-postgres', 'sentry-zookeeper', 'sentry-snuba',
|
||||
}
|
||||
|
||||
for host in docker_hosts:
|
||||
containers = check_docker_containers(host)
|
||||
for container in containers:
|
||||
if not container:
|
||||
continue
|
||||
|
||||
# 跳過已知系統容器
|
||||
if any(ignored in container for ignored in ignored_containers):
|
||||
continue
|
||||
|
||||
# 提取主要名稱 (去除 _1, -1 等後綴)
|
||||
base_name = container.split('_')[0].split('-')[0]
|
||||
|
||||
if container not in docker_services and base_name not in docker_services:
|
||||
warnings.append(f"Docker 容器 '{container}' on {host} 未在 registry 中 (可能需要加入)")
|
||||
|
||||
# ==========================================================================
|
||||
# 3. 檢查 API 端點覆蓋
|
||||
# ==========================================================================
|
||||
api_endpoints = registry.get('api_endpoints', [])
|
||||
critical_endpoints = [e for e in api_endpoints if e.get('critical')]
|
||||
|
||||
if len(critical_endpoints) < 5:
|
||||
warnings.append(f"僅定義了 {len(critical_endpoints)} 個關鍵 API 端點,建議至少 5 個")
|
||||
|
||||
# ==========================================================================
|
||||
# 4. 檢查前端頁面覆蓋
|
||||
# ==========================================================================
|
||||
pages = registry.get('pages', [])
|
||||
if len(pages) < 3:
|
||||
warnings.append(f"僅定義了 {len(pages)} 個前端頁面監控,建議至少 3 個")
|
||||
|
||||
# ==========================================================================
|
||||
# 5. 計算覆蓋率
|
||||
# ==========================================================================
|
||||
services = registry.get('services', [])
|
||||
total = len(services) if services else 1
|
||||
|
||||
coverage = {
|
||||
'prometheus': sum(1 for s in services if s.get('monitoring', {}).get('prometheus')) / total,
|
||||
'sentry': sum(1 for s in services if s.get('monitoring', {}).get('sentry')) / total,
|
||||
'otel': sum(1 for s in services if s.get('monitoring', {}).get('otel')) / total,
|
||||
'alerts': sum(1 for s in services if s.get('alerts')) / total,
|
||||
'auto_repair': sum(1 for s in services if s.get('auto_repair', {}).get('enabled')) / total,
|
||||
}
|
||||
|
||||
# 覆蓋率低於 80% 產生警告
|
||||
for metric, rate in coverage.items():
|
||||
if rate < 0.8:
|
||||
warnings.append(f"{metric} 覆蓋率僅 {rate:.0%},建議提升至 80% 以上")
|
||||
|
||||
passed = len(errors) == 0
|
||||
return ValidationResult(passed=passed, errors=errors, warnings=warnings, coverage=coverage)
|
||||
|
||||
|
||||
def print_report(result: ValidationResult):
|
||||
"""輸出驗證報告"""
|
||||
print("\n" + "=" * 60)
|
||||
print("AWOOOI 監控覆蓋率驗證報告")
|
||||
print("=" * 60)
|
||||
|
||||
# 覆蓋率
|
||||
print("\n📊 覆蓋率:")
|
||||
for metric, rate in result.coverage.items():
|
||||
status = "✅" if rate >= 0.8 else "⚠️" if rate >= 0.5 else "❌"
|
||||
print(f" {status} {metric}: {rate:.0%}")
|
||||
|
||||
# 錯誤
|
||||
if result.errors:
|
||||
print(f"\n❌ 錯誤 ({len(result.errors)}):")
|
||||
for err in result.errors:
|
||||
print(f" • {err}")
|
||||
|
||||
# 警告
|
||||
if result.warnings:
|
||||
print(f"\n⚠️ 警告 ({len(result.warnings)}):")
|
||||
for warn in result.warnings:
|
||||
print(f" • {warn}")
|
||||
|
||||
# 結論
|
||||
print("\n" + "-" * 60)
|
||||
if result.passed:
|
||||
print("✅ 驗證通過 - 所有關鍵服務都已註冊監控")
|
||||
else:
|
||||
print("❌ 驗證失敗 - 請更新 ops/monitoring/service-registry.yaml")
|
||||
print("=" * 60 + "\n")
|
||||
|
||||
|
||||
def main():
|
||||
"""主函數"""
|
||||
registry = load_registry()
|
||||
result = validate_registry(registry)
|
||||
print_report(result)
|
||||
|
||||
# 錯誤時退出碼 1
|
||||
if not result.passed:
|
||||
sys.exit(1)
|
||||
|
||||
sys.exit(0)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
Reference in New Issue
Block a user