- CLAUDE.md: 紅區治理章節 - Skills 01/03: 版本更新 - ADR/Architecture: 標準化 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
448 lines
14 KiB
Markdown
448 lines
14 KiB
Markdown
# WOOO-AIOPS 監控機制盤點報告
|
|
|
|
> **遷移至 AWOOOI 的監控資產清單**
|
|
>
|
|
> 盤點日期: 2026-03-22
|
|
> 來源專案: `/Users/ogt/wooo-aiops`
|
|
|
|
---
|
|
|
|
## 1. 監控系統總覽 (Monitoring Stack Overview)
|
|
|
|
| 元件 | 用途 | 來源路徑 | 遷移優先級 |
|
|
|------|------|----------|------------|
|
|
| **OpenTelemetry** | Distributed Tracing | `clawbot/app/core/telemetry.py` | 🔴 P0 |
|
|
| **Prometheus** | Metrics 採集 | `docker/prometheus/prometheus.yml` | 🔴 P0 |
|
|
| **Alertmanager** | 告警路由與通知 | `docker/alertmanager/alertmanager.yml` | 🔴 P0 |
|
|
| **SignOz** | APM + Traces + Logs | `infrastructure/signoz/alert-rules.yaml` | 🟡 P1 |
|
|
| **Grafana** | 儀表板視覺化 | `docker/grafana/dashboards/*.json` | 🟡 P1 |
|
|
| **Loki + Promtail** | Log Aggregation | `docker/loki/loki-config.yml` | 🟡 P1 |
|
|
|
|
---
|
|
|
|
## 2. 健康檢查機制 (Health Checks)
|
|
|
|
### 2.1 API 健康端點
|
|
|
|
| 端點 | 用途 | 檢查項目 |
|
|
|------|------|----------|
|
|
| `/health` | Liveness Probe | git_sha, build_time, version |
|
|
| `/ready` | Readiness Probe | DB 連線, Redis 連線 |
|
|
| `/api/v1/health` | Gateway Health | API 閘道狀態 |
|
|
|
|
**來源檔案**: `src/api/routes/health.py`
|
|
|
|
### 2.2 K8s Probes 配置
|
|
|
|
```yaml
|
|
# 來源: infrastructure/kubernetes/base/api-deployment.yaml
|
|
livenessProbe:
|
|
httpGet:
|
|
path: /health
|
|
port: 8000
|
|
initialDelaySeconds: 10
|
|
periodSeconds: 30
|
|
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /ready
|
|
port: 8000
|
|
initialDelaySeconds: 5
|
|
periodSeconds: 10
|
|
```
|
|
|
|
---
|
|
|
|
## 3. 告警規則盤點 (Alert Rules Inventory)
|
|
|
|
### 3.1 Prometheus Alert Rules (20+ 條)
|
|
|
|
**來源檔案**:
|
|
- `docker/prometheus/rules/alerts.yml`
|
|
- `docker/prometheus/rules/service-health-rules.yml`
|
|
|
|
#### 3.1.1 系統層級告警 (P0 Critical)
|
|
|
|
| 告警名稱 | 觸發條件 | 嚴重度 |
|
|
|----------|----------|--------|
|
|
| `InstanceDown` | 實例離線 > 1m | 🔴 P0 |
|
|
| `VersionDriftDetected` | 部署版本與預期不符 | 🔴 P0 |
|
|
| `UnexpectedPodRestart` | Pod 非預期重啟 | 🔴 P0 |
|
|
| `ImagePullBackOff` | 映像拉取失敗 | 🔴 P0 |
|
|
| `PodCrashLoopBackOff` | Pod 持續崩潰 | 🔴 P0 |
|
|
|
|
#### 3.1.2 CI/CD Pipeline 告警
|
|
|
|
| 告警名稱 | 觸發條件 | 嚴重度 |
|
|
|----------|----------|--------|
|
|
| `PipelineFailed` | Pipeline 執行失敗 | 🔴 P0 |
|
|
| `PipelineTooSlow` | Pipeline > 30m | 🟡 P1 |
|
|
| `JobStuckPending` | Job 排隊 > 5m | 🟡 P1 |
|
|
| `JobRunningTooLong` | Job 執行 > 30m | 🟡 P1 |
|
|
| `GitLabRunnerOffline` | Runner 離線 | 🔴 P0 |
|
|
|
|
#### 3.1.3 基礎設施告警
|
|
|
|
| 告警名稱 | 觸發條件 | 嚴重度 |
|
|
|----------|----------|--------|
|
|
| `HighCpuLoad` | CPU > 90% for 5m | 🔴 P0 |
|
|
| `HighMemoryUsage` | Memory > 90% for 5m | 🔴 P0 |
|
|
| `DiskSpaceLow` | Disk > 85% | 🟡 P1 |
|
|
| `HTTP502Spike` | 502 錯誤激增 | 🔴 P0 |
|
|
| `HTTP500Spike` | 500 錯誤激增 | 🔴 P0 |
|
|
|
|
#### 3.1.4 資料庫/快取告警
|
|
|
|
| 告警名稱 | 觸發條件 | 嚴重度 |
|
|
|----------|----------|--------|
|
|
| `PostgreSQLConnectionFailed` | DB 連線失敗 | 🔴 P0 |
|
|
| `RedisConnectionFailed` | Redis 連線失敗 | 🔴 P0 |
|
|
| `PostgreSQLConnectionPoolExhausted` | 連線池 > 90% | 🟡 P1 |
|
|
|
|
#### 3.1.5 SSL 憑證告警
|
|
|
|
| 告警名稱 | 觸發條件 | 嚴重度 |
|
|
|----------|----------|--------|
|
|
| `SSLCertExpiringSoon` | 憑證 14 天內到期 | 🟡 P1 |
|
|
| `SSLCertExpired` | 憑證已過期 | 🔴 P0 |
|
|
|
|
#### 3.1.6 效能告警
|
|
|
|
| 告警名稱 | 觸發條件 | 嚴重度 |
|
|
|----------|----------|--------|
|
|
| `APIResponseTimeSlow` | P95 延遲 > 2s | 🟡 P1 |
|
|
| `HighErrorRate` | 錯誤率 > 1% | 🟡 P1 |
|
|
| `WebSocketConnectionFailed` | WebSocket 失敗 | 🟢 P2 |
|
|
|
|
### 3.2 SignOz Alert Rules (30+ 條)
|
|
|
|
**來源檔案**: `infrastructure/signoz/alert-rules.yaml`
|
|
|
|
| 類別 | 告警數量 | 嚴重度分佈 |
|
|
|------|----------|------------|
|
|
| 資料庫 | 5 | P0: 2, P1: 3 |
|
|
| 快取 | 3 | P0: 1, P1: 2 |
|
|
| HTTP 錯誤 | 6 | P0: 2, P1: 2, P2: 2 |
|
|
| 容器 | 4 | P0: 2, P1: 2 |
|
|
| 服務專屬 | 12 | 依服務而定 |
|
|
|
|
**服務專屬告警涵蓋**:
|
|
- Gitea, Harbor, OpenClaw, Ollama, SignOz, n8n
|
|
|
|
---
|
|
|
|
## 4. 通知管道盤點 (Notification Channels)
|
|
|
|
### 4.1 Telegram 整合
|
|
|
|
**來源檔案**:
|
|
- `src/api/routes/telegram_alerts.py`
|
|
- `clawbot/app/bot/telegram.py`
|
|
|
|
| 頻道 | 環境變數 | 用途 |
|
|
|------|----------|------|
|
|
| 一般告警 | `TELEGRAM_CHAT_ID` | 全部告警 |
|
|
| P0 緊急 | `TELEGRAM_P0_CHAT_ID` | Critical 專用 |
|
|
| 資安告警 | `TELEGRAM_SECURITY_CHAT_ID` | Security 專用 |
|
|
|
|
**功能**:
|
|
- HTML 格式化 + Emoji 嚴重度標示
|
|
- 背景任務發送 (non-blocking)
|
|
- 雙向互動: `/ask` 指令觸發 AI 診斷
|
|
|
|
### 4.2 Slack 整合
|
|
|
|
**來源檔案**: `docker/alertmanager/alertmanager.yml`
|
|
|
|
| 頻道 | 用途 |
|
|
|------|------|
|
|
| `#alerts` | 預設告警 |
|
|
| `#alerts-security` | 資安告警 |
|
|
| `#alerts-security-critical` | 資安緊急 |
|
|
| `#alerts-infra` | 基礎設施 |
|
|
| `#p0-war-room` | P0 作戰室 |
|
|
|
|
### 4.3 PagerDuty On-Call
|
|
|
|
| 服務 | SLA | 用途 |
|
|
|------|-----|------|
|
|
| P0 Service Key | 5 分鐘回應 | Critical |
|
|
| P1 Service Key | 15 分鐘回應 | High |
|
|
|
|
**自動升級至 C-Level**
|
|
|
|
### 4.4 Email 通知
|
|
|
|
**來源檔案**: `src/services/notification.py`
|
|
|
|
- SMTP (TLS/STARTTLS)
|
|
- aiosmtplib 非同步
|
|
- HTML + Plain-text
|
|
|
|
---
|
|
|
|
## 5. 自動修復機制 (Auto-Remediation)
|
|
|
|
### 5.1 修復引擎 v1 (Remediation Engine)
|
|
|
|
**來源檔案**: `src/automation/remediation_engine.py`
|
|
|
|
#### 修復動作對照表
|
|
|
|
| 動作 | 說明 | 觸發告警 |
|
|
|------|------|----------|
|
|
| `restart_pod` | Pod 重啟 | HighErrorRate, PodCrashLooping, SlowResponse, ServiceDown |
|
|
| `scale_up` | 水平擴展 | HighCPU, HighMemory |
|
|
| `scale_down` | 縮減副本 | 手動觸發 |
|
|
| `rollback_deployment` | 版本回滾 | 手動觸發 |
|
|
| `clear_cache` | 清除 Redis | 手動觸發 |
|
|
|
|
#### 安全護欄
|
|
|
|
```python
|
|
# 白名單 Namespace
|
|
ALLOWED_NAMESPACES = ["wooo-aiops-uat", "wooo-aiops-prod"]
|
|
|
|
# Dry-Run 模式
|
|
AUTOMATION_DRY_RUN = True/False
|
|
|
|
# 最大副本數限制
|
|
MAX_REPLICAS = 10
|
|
```
|
|
|
|
### 5.2 修復引擎 v2 (Repair Engine)
|
|
|
|
**來源檔案**: `src/engines/repair_engine.py`
|
|
|
|
#### 8 種修復策略
|
|
|
|
| 策略 | 說明 |
|
|
|------|------|
|
|
| `RESTART` | Pod 重啟 |
|
|
| `SCALE_UP` | 水平擴展 |
|
|
| `SCALE_DOWN` | 縮減副本 |
|
|
| `ROLLBACK` | 版本回滾 |
|
|
| `INCREASE_MEMORY` | 調整記憶體 (+50% max) |
|
|
| `INCREASE_CPU` | 調整 CPU (+50% max) |
|
|
| `VACUUM_DB` | 資料庫維護 |
|
|
| `CLEAR_CACHE` | 清除快取 |
|
|
|
|
#### 安全限制
|
|
|
|
| 參數 | 值 | 說明 |
|
|
|------|-----|------|
|
|
| Max repairs/hour | 5 | 每小時最多修復次數 |
|
|
| Max consecutive failures | 3 | 連續失敗後停止 |
|
|
| Min healthy replicas | 1 | 最少健康副本 |
|
|
| Rollback window | 24h | 回滾時間窗口 |
|
|
| Memory increase limit | 50% | 記憶體增幅上限 |
|
|
| CPU increase limit | 50% | CPU 增幅上限 |
|
|
|
|
### 5.3 自動恢復腳本
|
|
|
|
**來源檔案**: `scripts/auto-recovery.sh`
|
|
|
|
**Cron 排程**: 每 10 分鐘執行
|
|
|
|
```bash
|
|
# 檢查項目
|
|
1. API Health Check (HTTP 200)
|
|
2. Frontend Health Check (HTTP 200/302)
|
|
3. Disk Space (>90% 觸發清理)
|
|
4. GitHub Actions Runner 狀態
|
|
5. 服務重啟恢復
|
|
```
|
|
|
|
**日誌位置**: `/var/log/wooo/auto-recovery.log`
|
|
|
|
---
|
|
|
|
## 6. SLA 引擎與升級機制 (SLA Engine)
|
|
|
|
**來源檔案**: `src/engines/sla_engine.py`
|
|
|
|
### 6.1 SLA 門檻
|
|
|
|
| 優先級 | 回應時間 | 解決時間 |
|
|
|--------|----------|----------|
|
|
| P0 | 5 分鐘 | 30 分鐘 |
|
|
| P1 | 15 分鐘 | 2 小時 |
|
|
| P2 | 1 小時 | 8 小時 |
|
|
| P3 | 4 小時 | 24 小時 |
|
|
|
|
### 6.2 升級層級
|
|
|
|
| Level | 角色 |
|
|
|-------|------|
|
|
| L0 | L1 Support (一線支援) |
|
|
| L1 | L2 Expert (專家支援) |
|
|
| L2 | Team Lead (部門主管) |
|
|
| L3 | Director (總監) |
|
|
| L4 | C-Level (CEO, CTO, CIO, CISO, CPO) |
|
|
|
|
### 6.3 升級矩陣
|
|
|
|
| 優先級 | 升級路徑 |
|
|
|--------|----------|
|
|
| P0 | On-Call → Team Lead → CIO → CISO → CEO |
|
|
| P1 | On-Call → Team Lead → CIO |
|
|
| P2 | On-Call → Team Lead |
|
|
| P3 | On-Call 僅 |
|
|
|
|
---
|
|
|
|
## 7. 告警聚合與去重 (Alert Aggregation)
|
|
|
|
**來源檔案**: `src/services/alert_aggregator.py`
|
|
|
|
### 功能
|
|
|
|
| 功能 | 說明 |
|
|
|------|------|
|
|
| 指紋去重 | 相同告警精確比對 |
|
|
| 時間窗口去重 | 5 分鐘內相同告警 |
|
|
| 告警風暴偵測 | > 10 告警/分鐘 |
|
|
| 標籤分組 | 相似標籤聚合 |
|
|
|
|
### Prometheus Metrics
|
|
|
|
```promql
|
|
wooo_alerts_received_total{severity, source}
|
|
wooo_alerts_deduplicated_total{reason}
|
|
wooo_alerts_aggregated_total{group_key}
|
|
wooo_alert_groups_active
|
|
wooo_alert_storm_detected_total
|
|
```
|
|
|
|
---
|
|
|
|
## 8. Grafana 儀表板盤點 (Dashboards)
|
|
|
|
| 儀表板 | 路徑 | 用途 |
|
|
|--------|------|------|
|
|
| AIOPS Brain | `infrastructure/grafana/dashboards/aiops-brain.json` | AI 大腦狀態 |
|
|
| API Performance | `docker/grafana/dashboards/api-performance.json` | API 效能 |
|
|
| Container Health | `docker/grafana/dashboards/container-health.json` | 容器健康 |
|
|
| System Overview | `docker/grafana/dashboards/system-overview.json` | 系統總覽 |
|
|
| DevOps KPIs | `infrastructure/grafana/dashboards/devops-kpis.json` | DevOps 指標 |
|
|
| Pipeline Health | `infrastructure/grafana/dashboards/pipeline-health.json` | Pipeline 健康 |
|
|
|
|
---
|
|
|
|
## 9. 告警工單整合 (Alert-to-Ticket)
|
|
|
|
**來源檔案**: `src/services/alert_ticket_service.py`
|
|
|
|
| 功能 | 說明 |
|
|
|------|------|
|
|
| 自動建票 | 所有告警自動建立工單 |
|
|
| 去重機制 | 防止相同告警重複建票 |
|
|
| 嚴重度對映 | P0/P1/P2 → 工單優先級 |
|
|
| 自動關閉 | 告警解除時自動關閉工單 |
|
|
|
|
---
|
|
|
|
## 10. 自訂 Metrics 匯出 (Custom Metrics)
|
|
|
|
### 10.1 部署追蹤
|
|
|
|
```promql
|
|
wooo_deployment_version_drift # 1 = 版本漂移
|
|
wooo_pipeline_status{status} # failed = 1
|
|
wooo_pipeline_duration_seconds
|
|
wooo_job_queued_duration_seconds
|
|
wooo_job_duration_seconds
|
|
wooo_gitlab_runner_status # 0 = offline
|
|
```
|
|
|
|
### 10.2 自動修復
|
|
|
|
```promql
|
|
wooo_repair_total{app_id, action, status}
|
|
wooo_repair_duration_seconds{app_id, action}
|
|
wooo_repair_in_progress{app_id}
|
|
```
|
|
|
|
---
|
|
|
|
## 11. 告警流程圖 (Notification Flow)
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Alert Triggered │
|
|
│ (Prometheus / SignOz) │
|
|
└──────────────────────┬──────────────────────────────────────┘
|
|
↓
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Alertmanager Webhook │
|
|
└──────────────────────┬──────────────────────────────────────┘
|
|
↓
|
|
┌──────────────┼──────────────┬──────────────┐
|
|
↓ ↓ ↓ ↓
|
|
┌─────────┐ ┌──────────┐ ┌──────────┐ ┌───────────┐
|
|
│ Ticket │ │ Telegram │ │ PagerDuty│ │ Slack │
|
|
│ System │ │ Bot │ │ (On-Call)│ │ Channels │
|
|
└─────────┘ └──────────┘ └──────────┘ └───────────┘
|
|
│
|
|
↓
|
|
┌──────────────────────────────────────┐
|
|
│ Auto-Remediation Engine │
|
|
├──────────────────────────────────────┤
|
|
│ 1. Validate target (whitelist) │
|
|
│ 2. Execute repair action │
|
|
│ 3. Record result in DB │
|
|
│ 4. Notify outcome (Telegram/NATS) │
|
|
└──────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## 12. 遷移至 AWOOOI 建議
|
|
|
|
### 12.1 必須遷移 (P0)
|
|
|
|
| 元件 | 原路徑 | 建議新路徑 |
|
|
|------|--------|------------|
|
|
| OpenTelemetry 初始化 | `clawbot/app/core/telemetry.py` | `apps/api/src/core/telemetry.py` |
|
|
| Prometheus Client | `src/services/prometheus_client.py` | `apps/api/src/services/` |
|
|
| Health Routes | `src/api/routes/health.py` | `apps/api/src/routes/health.py` |
|
|
| Alert Rules | `docker/prometheus/rules/*.yml` | `ops/prometheus/rules/` |
|
|
| Alertmanager Config | `docker/alertmanager/*.yml` | `ops/alertmanager/` |
|
|
|
|
### 12.2 可選遷移 (P1)
|
|
|
|
| 元件 | 說明 |
|
|
|------|------|
|
|
| Grafana Dashboards | 6 個儀表板 JSON |
|
|
| Loki + Promtail | Log 聚合 |
|
|
| SLA Engine | 升級機制 |
|
|
| Alert Aggregator | 告警去重 |
|
|
|
|
### 12.3 需重構 (P2)
|
|
|
|
| 元件 | 原因 |
|
|
|------|------|
|
|
| Remediation Engine | 需適配新的 Multi-Sig 審批流程 |
|
|
| On-Call Service | 需整合新的 OpenClaw 通知 |
|
|
|
|
---
|
|
|
|
## 附錄: 關鍵設定檔清單
|
|
|
|
| 設定檔 | 路徑 |
|
|
|--------|------|
|
|
| Alertmanager 主設定 | `docker/alertmanager/alertmanager.yml` |
|
|
| Alertmanager 生產設定 | `infrastructure/alertmanager/alertmanager.yml` |
|
|
| Prometheus Alert Rules | `docker/prometheus/rules/alerts.yml` |
|
|
| Service Health Rules | `docker/prometheus/rules/service-health-rules.yml` |
|
|
| SignOz Alert Rules | `infrastructure/signoz/alert-rules.yaml` |
|
|
| Prometheus Scrape Config | `docker/prometheus/prometheus.yml` |
|
|
| K8s API Deployment | `infrastructure/kubernetes/base/api-deployment.yaml` |
|
|
| Monitoring Cron Jobs | `infrastructure/cron/monitoring-jobs.cron` |
|
|
| Auto-Recovery Script | `scripts/auto-recovery.sh` |
|
|
|
|
---
|
|
|
|
**盤點完成**: 2026-03-22
|
|
**盤點人員**: Claude Code (Monitoring Inventory Agent)
|