新增: 1. MONITORING_COMPLETE_STRATEGY.md - 完整監控策略 - 5 主機 × 60+ 服務監控矩陣 - P0/P1/P2 告警規則清單 - AI 自動修復閉環流程 - 安全護欄配置 2. MONITORING_INTEGRATION_ARCHITECTURE.md - 自動整合架構 - 服務註冊表 (Single Source of Truth) - CI/CD 自動驗證監控覆蓋率 - 新服務自動獲得監控 3. ops/monitoring/service-registry.yaml - 服務清單 - K8s 工作負載 (API/Web/Worker/ArgoCD) - Docker 容器 (Ollama/OpenClaw/Redis/Postgres) - 前端頁面 SLO - API 端點 SLO - 告警模板與自動修復動作 4. ops/monitoring/validate_coverage.py - 覆蓋率驗證 - CI 階段執行 - 檢測未監控服務 - 生成覆蓋率報告 設計原則: - 監控即代碼 (Monitoring as Code) - 新服務必須在 registry 註冊才能部署 - 自動發現機制防止遺漏 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
576 lines
24 KiB
Markdown
576 lines
24 KiB
Markdown
# AWOOOI 完整監控與 AI 自動修復策略
|
||
|
||
> **版本**: v1.0
|
||
> **建立日期**: 2026-03-29
|
||
> **負責人**: 首席架構師 (Claude Code)
|
||
> **目標**: 100% 覆蓋率監控 + AI 驅動自動修復
|
||
|
||
---
|
||
|
||
## 執行摘要
|
||
|
||
本文件定義 AWOOOI 全棧監控策略,涵蓋:
|
||
- **5 大主機** × **60+ 服務** × **4 層監控**
|
||
- **三層可觀測性**: Sentry (錯誤) + SignOz (追蹤) + Prometheus (指標)
|
||
- **AI 自動修復閉環**: 異常 → OpenClaw 分析 → 自動執行/人工審核
|
||
|
||
---
|
||
|
||
## 一、監控覆蓋矩陣
|
||
|
||
### 1.1 主機層 (Infrastructure)
|
||
|
||
| 主機 | IP | 角色 | 監控項目 | 告警規則 |
|
||
|------|----|----|----------|----------|
|
||
| **mon (K3s Master)** | 192.168.0.120 | K3s Server + keepalived | CPU/MEM/Disk/etcd | NodeDown, etcdHighLatency |
|
||
| **mon1 (K3s Worker)** | 192.168.0.121 | K3s Worker + keepalived | CPU/MEM/Disk/kubelet | NodeNotReady, DiskPressure |
|
||
| **harbor (DevOps)** | 192.168.0.110 | Harbor/Sentry/Langfuse/Runner | CPU/MEM/Disk/Docker | HarborDown, RunnerOffline |
|
||
| **pg (AI/Web)** | 192.168.0.188 | PostgreSQL/Redis/Ollama/SignOz | CPU/MEM/Disk/GPU | DBConnectionFailed, OllamaTimeout |
|
||
| **kali (Security)** | 192.168.0.112 | Kali Scanner | CPU/MEM/Disk | ScannerOffline |
|
||
|
||
### 1.2 服務層 (Services)
|
||
|
||
#### A. Kubernetes 工作負載 (K3s)
|
||
|
||
| 命名空間 | Deployment/StatefulSet | 副本數 | 健康檢查 | 告警條件 |
|
||
|----------|------------------------|--------|----------|----------|
|
||
| **awoooi-prod** | awoooi-api | 2 | HTTP /api/v1/health | PodCrashLoopBackOff, ReplicasUnavailable |
|
||
| **awoooi-prod** | awoooi-web | 2 | HTTP / | HighErrorRate, SlowResponse |
|
||
| **awoooi-prod** | awoooi-worker | 1 | Exec mtime | WorkerStuck, QueueBacklog |
|
||
| **argocd** | argocd-server | 1 | HTTP /healthz | ArgoCDDown |
|
||
| **monitoring** | prometheus-server | 1 | HTTP /-/ready | PrometheusDown |
|
||
| **monitoring** | alertmanager | 1 | HTTP /-/ready | AlertmanagerDown |
|
||
| **velero** | velero | 1 | - | BackupFailed |
|
||
|
||
#### B. 容器服務 (Docker on 188/110)
|
||
|
||
| 主機 | 容器 | 端口 | 健康檢查 | 告警條件 |
|
||
|------|------|------|----------|----------|
|
||
| 188 | ollama | 11434 | GET /api/tags | OllamaUnresponsive, ModelLoadFailed |
|
||
| 188 | openclaw | 8089 | GET /health | OpenClawDown, AnalysisTimeout |
|
||
| 188 | signoz-collector | 24317/24318 | gRPC health | TraceDropped |
|
||
| 188 | signoz-ui | 3301 | HTTP / | SignOzUIDown |
|
||
| 188 | redis-stack | 6380 | redis-cli ping | RedisDown, MemoryExhausted |
|
||
| 188 | postgres | 5432 | pg_isready | PostgresDown, ConnectionPoolExhausted |
|
||
| 110 | harbor-core | 5000 | GET /api/v2.0/health | HarborDown |
|
||
| 110 | sentry-web | 9000 | GET /_health/ | SentryDown |
|
||
| 110 | langfuse | 3100 | GET /api/public/health | LangfuseDown |
|
||
| 110 | actions-runner | - | systemctl status | RunnerOffline |
|
||
|
||
### 1.3 應用層 (Application)
|
||
|
||
#### A. API 端點監控
|
||
|
||
| 端點 | 方法 | 預期回應 | SLO | 告警 |
|
||
|------|------|----------|-----|------|
|
||
| /api/v1/health | GET | 200 | 99.9% | APIHealthCheckFailed |
|
||
| /api/v1/approvals/pending | GET | 200 | 99% | ApprovalsAPIError |
|
||
| /api/v1/incidents | GET | 200 | 99% | IncidentsAPIError |
|
||
| /api/v1/analyze | POST | 200/202 | 95% | AnalysisTimeout (>30s) |
|
||
| /api/v1/execute | POST | 200 | 99% | ExecutionFailed |
|
||
|
||
#### B. 錯誤率監控 (Sentry)
|
||
|
||
| 類型 | 閾值 | 告警 | 自動修復 |
|
||
|------|------|------|----------|
|
||
| Unhandled Exception | >0 in 5min | SentryNewError | AI 分析 + Playbook 匹配 |
|
||
| HTTP 5xx | >1% | HighErrorRate | Pod 重啟 |
|
||
| HTTP 4xx | >10% | ClientErrorSpike | 告警 + 日誌分析 |
|
||
| Slow Transaction | P95 >2s | SlowTransaction | 資源擴展建議 |
|
||
|
||
#### C. 前端監控
|
||
|
||
| 指標 | 來源 | 閾值 | 告警 |
|
||
|------|------|------|------|
|
||
| Page Load Time | Sentry Performance | >3s | SlowPageLoad |
|
||
| JS Error Rate | Sentry Issues | >0.1% | FrontendError |
|
||
| API Call Failures | Sentry Breadcrumbs | >1% | APICallFailed |
|
||
| Web Vitals (LCP/FID/CLS) | Sentry | Google 標準 | PoorWebVitals |
|
||
|
||
### 1.4 資料層 (Data)
|
||
|
||
| 資料庫 | 監控項目 | 告警條件 |
|
||
|--------|----------|----------|
|
||
| **PostgreSQL** | 連線數、QPS、慢查詢、WAL 延遲、Disk I/O | ConnectionPoolExhausted (>90%), SlowQuery (>5s), ReplicationLag (>30s) |
|
||
| **Redis** | 記憶體使用、命中率、延遲、Key 數量 | MemoryHigh (>80%), HitRatelow (<90%), SlowCommands |
|
||
| **ClickHouse** | 磁碟使用、查詢延遲、插入速率 | DiskFull (>85%), QueryTimeout |
|
||
|
||
### 1.5 AI/LLM 層
|
||
|
||
| 服務 | 監控項目 | 告警條件 | 自動修復 |
|
||
|------|----------|----------|----------|
|
||
| **Ollama** | 推理延遲、模型載入狀態、GPU 使用 | InferenceTimeout (>60s), ModelLoadFailed | 容器重啟 |
|
||
| **OpenClaw** | 分析成功率、回應時間、Token 使用 | AnalysisFailed (>10%), HighTokenCost | Fallback to Gemini |
|
||
| **Gemini API** | Rate Limit、錯誤率、成本 | RateLimitHit, BudgetExceeded | 降級到 Ollama |
|
||
| **Claude API** | Rate Limit、錯誤率、成本 | RateLimitHit, BudgetExceeded | 降級到 Gemini |
|
||
| **Langfuse** | Trace 記錄成功率 | TraceLost (>1%) | Reconnect |
|
||
|
||
### 1.6 CI/CD 層
|
||
|
||
| 元件 | 監控項目 | 告警條件 |
|
||
|------|----------|----------|
|
||
| **GitHub Actions** | Workflow 狀態、Runner 健康、Job 延遲 | WorkflowFailed, RunnerOffline, JobStuck (>30min) |
|
||
| **Harbor** | 映像推送/拉取成功率、儲存空間 | PushFailed, PullFailed, StorageFull |
|
||
| **ArgoCD** | Sync 狀態、Application 健康 | SyncFailed, AppDegraded |
|
||
|
||
---
|
||
|
||
## 二、告警規則完整清單
|
||
|
||
### 2.1 P0 - Critical (5 分鐘回應)
|
||
|
||
```yaml
|
||
# === 基礎設施層 ===
|
||
- alert: NodeDown
|
||
expr: up{job="node-exporter"} == 0
|
||
for: 1m
|
||
severity: critical
|
||
auto_repair: false # 需人工介入
|
||
|
||
- alert: K3sAPIServerDown
|
||
expr: up{job="kubernetes-apiservers"} == 0
|
||
for: 1m
|
||
severity: critical
|
||
auto_repair: false
|
||
|
||
- alert: PostgreSQLDown
|
||
expr: pg_up == 0
|
||
for: 30s
|
||
severity: critical
|
||
auto_repair: restart_container
|
||
|
||
- alert: RedisDown
|
||
expr: redis_up == 0
|
||
for: 30s
|
||
severity: critical
|
||
auto_repair: restart_container
|
||
|
||
# === 應用層 ===
|
||
- alert: AWOOOIAPIDown
|
||
expr: probe_success{job="awoooi-api"} == 0
|
||
for: 1m
|
||
severity: critical
|
||
auto_repair: restart_pod
|
||
|
||
- alert: OpenClawDown
|
||
expr: probe_success{job="openclaw"} == 0
|
||
for: 2m
|
||
severity: critical
|
||
auto_repair: restart_container
|
||
|
||
- alert: PodCrashLoopBackOff
|
||
expr: kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} > 0
|
||
for: 2m
|
||
severity: critical
|
||
auto_repair: collect_logs_and_rollback
|
||
|
||
# === CI/CD 層 ===
|
||
- alert: GitHubRunnerOffline
|
||
expr: github_runner_status == 0
|
||
for: 5m
|
||
severity: critical
|
||
auto_repair: restart_runner_service
|
||
```
|
||
|
||
### 2.2 P1 - High (15 分鐘回應)
|
||
|
||
```yaml
|
||
# === 效能告警 ===
|
||
- alert: HighCPUUsage
|
||
expr: node_cpu_usage_percent > 90
|
||
for: 5m
|
||
severity: high
|
||
auto_repair: scale_up_if_possible
|
||
|
||
- alert: HighMemoryUsage
|
||
expr: node_memory_usage_percent > 90
|
||
for: 5m
|
||
severity: high
|
||
auto_repair: investigate_memory_leak
|
||
|
||
- alert: APIHighLatency
|
||
expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 2
|
||
for: 5m
|
||
severity: high
|
||
auto_repair: analyze_slow_endpoints
|
||
|
||
- alert: HighErrorRate
|
||
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
|
||
for: 5m
|
||
severity: high
|
||
auto_repair: restart_pod
|
||
|
||
- alert: OllamaSlowInference
|
||
expr: ollama_inference_duration_seconds > 60
|
||
for: 3m
|
||
severity: high
|
||
auto_repair: switch_to_smaller_model
|
||
|
||
# === 資源告警 ===
|
||
- alert: DiskSpaceLow
|
||
expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.15
|
||
for: 10m
|
||
severity: high
|
||
auto_repair: cleanup_old_logs
|
||
|
||
- alert: PostgreSQLConnectionPoolHigh
|
||
expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
|
||
for: 5m
|
||
severity: high
|
||
auto_repair: analyze_connection_leaks
|
||
```
|
||
|
||
### 2.3 P2 - Medium (1 小時回應)
|
||
|
||
```yaml
|
||
- alert: CertificateExpiringSoon
|
||
expr: ssl_cert_not_after - time() < 14 * 24 * 3600
|
||
severity: medium
|
||
auto_repair: renew_certificate
|
||
|
||
- alert: BackupNotSuccessful
|
||
expr: velero_backup_success_total < 1 in 24h
|
||
severity: medium
|
||
auto_repair: trigger_backup
|
||
|
||
- alert: LangfuseTraceLoss
|
||
expr: langfuse_trace_drop_rate > 0.01
|
||
severity: medium
|
||
auto_repair: reconnect_langfuse
|
||
```
|
||
|
||
---
|
||
|
||
## 三、AI 自動修復閉環
|
||
|
||
### 3.1 修復流程圖
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ 異常發生 │
|
||
│ (Prometheus Alert / Sentry Issue / SignOz Anomaly) │
|
||
└────────────────────────────┬────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ Alertmanager 路由 │
|
||
│ ┌───────────────┬───────────────┬───────────────┐ │
|
||
│ │ route: awoooi │ route: infra │ route: aiops │ │
|
||
│ └───────┬───────┴───────┬───────┴───────┬───────┘ │
|
||
└──────────┼───────────────┼───────────────┼──────────────────────────┘
|
||
│ │ │
|
||
└───────────────┼───────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ AWOOOI API: /api/v1/webhooks/alertmanager │
|
||
│ 1. 接收告警 → 2. 去重 (10min fingerprint) → 3. 建立 Incident │
|
||
└────────────────────────────┬────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ OpenClaw AI 分析引擎 │
|
||
│ ┌───────────────────────────────────────────────────────────────┐ │
|
||
│ │ 輸入: │ │
|
||
│ │ - Alert 內容 (labels, annotations) │ │
|
||
│ │ - K8s 上下文 (Pod logs, events, metrics) │ │
|
||
│ │ - 歷史 Playbook (相似案例) │ │
|
||
│ │ - SignOz Traces (相關 Span) │ │
|
||
│ │ - Sentry Issues (相關錯誤) │ │
|
||
│ ├───────────────────────────────────────────────────────────────┤ │
|
||
│ │ 輸出: │ │
|
||
│ │ - suggested_action: RESTART_POD | DELETE_POD | SCALE_UP | ... │ │
|
||
│ │ - confidence: 0.0-1.0 │ │
|
||
│ │ - risk_level: LOW | MEDIUM | CRITICAL │ │
|
||
│ │ - blast_radius: {affected_pods, estimated_downtime} │ │
|
||
│ │ - kubectl_command: 具體指令 │ │
|
||
│ │ - reasoning: 決策理由 (繁體中文) │ │
|
||
│ └───────────────────────────────────────────────────────────────┘ │
|
||
└────────────────────────────┬────────────────────────────────────────┘
|
||
│
|
||
┌──────────────┴──────────────┐
|
||
│ │
|
||
▼ ▼
|
||
┌─────────────────────────┐ ┌─────────────────────────┐
|
||
│ confidence >= 0.85 │ │ confidence < 0.85 │
|
||
│ risk_level = LOW │ │ OR risk = CRITICAL │
|
||
│ ↓ │ │ ↓ │
|
||
│ 自動執行 │ │ 人工審核 │
|
||
└───────────┬─────────────┘ └───────────┬─────────────┘
|
||
│ │
|
||
│ ▼
|
||
│ ┌─────────────────────────┐
|
||
│ │ Telegram 推送審核卡片 │
|
||
│ │ [✅ 簽核] [❌ 拒絕] │
|
||
│ │ [⏰ 稍後] [🔕 靜默] │
|
||
│ └───────────┬─────────────┘
|
||
│ │
|
||
│ ┌───────────┴───────────┐
|
||
│ │ │
|
||
│ ▼ ▼
|
||
│ ┌────────────┐ ┌────────────┐
|
||
│ │ 人工批准 │ │ 人工拒絕 │
|
||
│ └─────┬──────┘ └─────┬──────┘
|
||
│ │ │
|
||
└──────────────┼───────────────────────┤
|
||
│ │
|
||
▼ ▼
|
||
┌─────────────────────────┐ ┌────────────────┐
|
||
│ K8s Executor 執行 │ │ 記錄拒絕原因 │
|
||
│ kubectl $command │ │ 更新 Playbook │
|
||
└───────────┬─────────────┘ └────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────┐
|
||
│ 執行結果驗證 │
|
||
│ - 健康檢查通過? │
|
||
│ - 錯誤率下降? │
|
||
│ - 延遲恢復正常? │
|
||
└───────────┬─────────────┘
|
||
│
|
||
┌───────────┴───────────┐
|
||
│ │
|
||
▼ ▼
|
||
┌─────────────────┐ ┌─────────────────┐
|
||
│ 修復成功 │ │ 修復失敗 │
|
||
│ - 關閉 Incident│ │ - 升級告警 │
|
||
│ - 更新 Playbook│ │ - 記錄失敗 │
|
||
│ - Telegram 通知│ │ - 人工介入 │
|
||
└─────────────────┘ └─────────────────┘
|
||
```
|
||
|
||
### 3.2 自動修復動作清單
|
||
|
||
| 動作 | 觸發條件 | 執行指令 | 風險等級 | 自動執行? |
|
||
|------|----------|----------|----------|-----------|
|
||
| `RESTART_POD` | PodCrashLoop, HighErrorRate | `kubectl rollout restart deployment/{name}` | LOW | ✅ 可自動 |
|
||
| `DELETE_POD` | PodStuck, OOMKilled | `kubectl delete pod {name} --grace-period=30` | LOW | ✅ 可自動 |
|
||
| `SCALE_UP` | HighCPU, HighMemory, SlowResponse | `kubectl scale deployment/{name} --replicas=+1` | LOW | ✅ 可自動 |
|
||
| `SCALE_DOWN` | ResourceWaste | `kubectl scale deployment/{name} --replicas=-1` | MEDIUM | ❌ 需審核 |
|
||
| `ROLLBACK` | DeploymentFailed, VersionDrift | `kubectl rollout undo deployment/{name}` | MEDIUM | ❌ 需審核 |
|
||
| `RESTART_CONTAINER` | ContainerUnhealthy | `docker restart {container}` | LOW | ✅ 可自動 |
|
||
| `CLEAR_CACHE` | RedisMemoryHigh, StaleCache | `redis-cli FLUSHDB` | MEDIUM | ❌ 需審核 |
|
||
| `VACUUM_DB` | TableBloat, SlowQuery | `VACUUM ANALYZE {table}` | MEDIUM | ❌ 需審核 |
|
||
| `RENEW_CERT` | CertExpiring | `certbot renew` | LOW | ✅ 可自動 |
|
||
| `CLEANUP_LOGS` | DiskSpaceLow | `find /var/log -mtime +7 -delete` | LOW | ✅ 可自動 |
|
||
| `SWITCH_MODEL` | OllamaTimeout | 切換到更小模型 | LOW | ✅ 可自動 |
|
||
| `FALLBACK_AI` | GeminiRateLimit | Gemini → Ollama | LOW | ✅ 可自動 |
|
||
|
||
### 3.3 安全護欄
|
||
|
||
```python
|
||
# === 自動修復安全限制 ===
|
||
SAFETY_GUARDRAILS = {
|
||
# 頻率限制
|
||
"max_repairs_per_hour": 5, # 每小時最多 5 次自動修復
|
||
"max_repairs_per_resource": 3, # 同一資源每小時最多 3 次
|
||
"cooldown_after_failure": 600, # 失敗後冷卻 10 分鐘
|
||
|
||
# 風險限制
|
||
"auto_approve_max_risk": "LOW", # 自動批准僅限 LOW 風險
|
||
"auto_approve_min_confidence": 0.85, # 最低信心度 85%
|
||
|
||
# 影響範圍限制
|
||
"max_affected_pods": 3, # 最多影響 3 個 Pod
|
||
"min_healthy_replicas": 1, # 至少保留 1 個健康副本
|
||
|
||
# 禁止自動執行
|
||
"blacklist_actions": [
|
||
"DROP_DATABASE",
|
||
"DELETE_NAMESPACE",
|
||
"FORCE_DELETE_PVC",
|
||
"DELETE_SECRET",
|
||
],
|
||
|
||
# 白名單命名空間
|
||
"allowed_namespaces": [
|
||
"awoooi-prod",
|
||
"monitoring",
|
||
],
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 四、監控資料流整合
|
||
|
||
### 4.1 Sentry → OpenClaw
|
||
|
||
```python
|
||
# /api/v1/webhooks/sentry - Sentry Issue Alert Webhook
|
||
async def handle_sentry_webhook(payload: dict):
|
||
"""
|
||
1. 解析 Sentry Issue
|
||
2. 去重檢查 (10 分鐘 TTL)
|
||
3. 建立 Incident
|
||
4. 觸發 OpenClaw 分析
|
||
5. 推送 Telegram
|
||
"""
|
||
issue_id = payload["data"]["issue"]["id"]
|
||
|
||
# 去重
|
||
if await redis.get(f"sentry_dedup:{issue_id}"):
|
||
return {"status": "deduplicated"}
|
||
await redis.setex(f"sentry_dedup:{issue_id}", 600, "1")
|
||
|
||
# 建立 Incident
|
||
incident = await incident_service.create_from_sentry(payload)
|
||
|
||
# AI 分析
|
||
analysis = await openclaw.analyze_error(
|
||
error_title=payload["data"]["issue"]["title"],
|
||
stack_trace=payload["data"]["issue"]["culprit"],
|
||
sentry_url=payload["data"]["issue"]["web_url"],
|
||
trace_id=extract_trace_id(payload),
|
||
)
|
||
|
||
# Telegram 通知
|
||
await telegram.send_error_alert(
|
||
incident_id=incident.id,
|
||
analysis=analysis,
|
||
sentry_url=payload["data"]["issue"]["web_url"],
|
||
)
|
||
```
|
||
|
||
### 4.2 Alertmanager → OpenClaw
|
||
|
||
```yaml
|
||
# alertmanager.yml
|
||
route:
|
||
receiver: awoooi-api
|
||
routes:
|
||
- match:
|
||
namespace: awoooi-prod
|
||
receiver: awoooi-api
|
||
- match:
|
||
severity: critical
|
||
receiver: awoooi-api
|
||
|
||
receivers:
|
||
- name: awoooi-api
|
||
webhook_configs:
|
||
- url: http://192.168.0.125:32334/api/v1/webhooks/alertmanager
|
||
send_resolved: true
|
||
http_config:
|
||
basic_auth:
|
||
username: alertmanager
|
||
password_file: /etc/alertmanager/secrets/webhook-password
|
||
```
|
||
|
||
### 4.3 SignOz → OpenClaw
|
||
|
||
```python
|
||
# 透過 ClickHouse 查詢異常 Span
|
||
async def detect_signoz_anomalies():
|
||
"""
|
||
定期查詢 SignOz ClickHouse 偵測:
|
||
- Error Rate 異常上升
|
||
- Latency P99 異常
|
||
- Trace 數量驟降 (服務可能掛了)
|
||
"""
|
||
anomalies = await clickhouse.query("""
|
||
SELECT
|
||
serviceName,
|
||
count(*) as error_count,
|
||
avg(durationNano) / 1e6 as avg_latency_ms
|
||
FROM signoz_traces.signoz_index_v2
|
||
WHERE timestamp > now() - INTERVAL 5 MINUTE
|
||
AND statusCode = 'STATUS_CODE_ERROR'
|
||
GROUP BY serviceName
|
||
HAVING error_count > 10
|
||
""")
|
||
|
||
for anomaly in anomalies:
|
||
await openclaw.analyze_trace_anomaly(
|
||
service=anomaly["serviceName"],
|
||
error_count=anomaly["error_count"],
|
||
avg_latency=anomaly["avg_latency_ms"],
|
||
)
|
||
```
|
||
|
||
---
|
||
|
||
## 五、實作優先級
|
||
|
||
### Phase 1 (本週 - P0)
|
||
|
||
| 項目 | 狀態 | 負責 | 說明 |
|
||
|------|------|------|------|
|
||
| Alertmanager → AWOOOI Webhook | ⬜ TODO | Claude Code | 配置 webhook + 測試告警 |
|
||
| Sentry Webhook → Telegram | ⬜ TODO | Claude Code | 錯誤直接推送 + AI 分析 |
|
||
| Secrets 自動注入 (CD) | ⬜ TODO | Claude Code | kubectl patch secret |
|
||
| 告警去重驗證 | ⬜ TODO | Claude Code | 10min fingerprint 測試 |
|
||
|
||
### Phase 2 (下週 - P1)
|
||
|
||
| 項目 | 狀態 | 負責 | 說明 |
|
||
|------|------|------|------|
|
||
| SignOz 告警規則 | ⬜ TODO | Claude Code | Error Rate, Latency P99 |
|
||
| 自動修復動作擴展 | ⬜ TODO | Claude Code | SCALE_UP, ROLLBACK |
|
||
| Playbook 自動萃取 | ⬜ TODO | Claude Code | 成功修復 → Playbook |
|
||
| 告警升級機制 | ⬜ TODO | Claude Code | SLA Engine |
|
||
|
||
### Phase 3 (兩週後 - P2)
|
||
|
||
| 項目 | 狀態 | 負責 | 說明 |
|
||
|------|------|------|------|
|
||
| Grafana 儀表板 | ⬜ TODO | Claude Code | 監控總覽 |
|
||
| SLO/SLI 定義 | ⬜ TODO | Claude Code | 99.9% 可用性目標 |
|
||
| 告警噪音抑制 | ⬜ TODO | Claude Code | ML 異常偵測 |
|
||
| 容量預測 | ⬜ TODO | Claude Code | 資源趨勢預測 |
|
||
|
||
---
|
||
|
||
## 六、附錄
|
||
|
||
### A. 環境變數清單
|
||
|
||
```bash
|
||
# === Alertmanager ===
|
||
ALERTMANAGER_WEBHOOK_URL=http://192.168.0.125:32334/api/v1/webhooks/alertmanager
|
||
ALERTMANAGER_WEBHOOK_SECRET=<secret>
|
||
|
||
# === Sentry ===
|
||
SENTRY_DSN=http://<key>@192.168.0.110:9000/<project>
|
||
SENTRY_WEBHOOK_SECRET=<secret>
|
||
SENTRY_DEDUP_TTL=600
|
||
|
||
# === SignOz ===
|
||
SIGNOZ_CLICKHOUSE_URL=http://192.168.0.188:8123
|
||
SIGNOZ_ANOMALY_THRESHOLD_ERROR_COUNT=10
|
||
|
||
# === 自動修復 ===
|
||
AUTO_REPAIR_ENABLED=true
|
||
AUTO_REPAIR_MAX_PER_HOUR=5
|
||
AUTO_REPAIR_MIN_CONFIDENCE=0.85
|
||
AUTO_REPAIR_DRY_RUN=false
|
||
```
|
||
|
||
### B. 告警模板
|
||
|
||
```markdown
|
||
🚨 **CRITICAL | awoooi-api**
|
||
━━━━━━━━━━━━━━━━━━━
|
||
📋 INC-20260329-0001
|
||
🎯 Pod: awoooi-api-7d4b8c9f5-abc12
|
||
━━━━━━━━━━━━━━━━━━━
|
||
🤖 **AI 分析**
|
||
👥 責任: BE (後端)
|
||
📊 信心: 🟢 92%
|
||
💡 原因: OOM Killed - Memory limit exceeded
|
||
━━━━━━━━━━━━━━━━━━━
|
||
🔧 建議: DELETE_POD + SCALE_UP
|
||
⏱️ 停機: ~30s
|
||
💰 Tokens: 1,234 / $0.0012
|
||
━━━━━━━━━━━━━━━━━━━
|
||
🔗 [SignOz Trace](http://192.168.0.188:3301/trace/abc123)
|
||
🔗 [Sentry Issue](http://192.168.0.110:9000/issues/456)
|
||
|
||
[✅ 簽核] [❌ 拒絕] [⏰ 稍後] [🔕 靜默]
|
||
```
|
||
|
||
---
|
||
|
||
**文件結束**
|
||
**下一步**: 執行 Phase 1 任務
|