awoooi/docs/MONITORING_INVENTORY_FROM_AIOPS.md

# WOOO-AIOPS 監控機制盤點報告

> **遷移至 AWOOOI 的監控資產清單**
>
> 盤點日期: 2026-03-22
> 來源專案: `/Users/ogt/wooo-aiops`

---

## 1. 監控系統總覽 (Monitoring Stack Overview)

| 元件 | 用途 | 來源路徑 | 遷移優先級 |
|------|------|----------|------------|
| **OpenTelemetry** | Distributed Tracing | `clawbot/app/core/telemetry.py` | 🔴 P0 |
| **Prometheus** | Metrics 採集 | `docker/prometheus/prometheus.yml` | 🔴 P0 |
| **Alertmanager** | 告警路由與通知 | `docker/alertmanager/alertmanager.yml` | 🔴 P0 |
| **SignOz** | APM + Traces + Logs | `infrastructure/signoz/alert-rules.yaml` | 🟡 P1 |
| **Grafana** | 儀表板視覺化 | `docker/grafana/dashboards/*.json` | 🟡 P1 |
| **Loki + Promtail** | Log Aggregation | `docker/loki/loki-config.yml` | 🟡 P1 |

---

## 2. 健康檢查機制 (Health Checks)

### 2.1 API 健康端點

| 端點 | 用途 | 檢查項目 |
|------|------|----------|
| `/health` | Liveness Probe | git_sha, build_time, version |
| `/ready` | Readiness Probe | DB 連線, Redis 連線 |
| `/api/v1/health` | Gateway Health | API 閘道狀態 |

**來源檔案**: `src/api/routes/health.py`

### 2.2 K8s Probes 配置

```yaml
# 來源: infrastructure/kubernetes/base/api-deployment.yaml
livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 10
  periodSeconds: 30

readinessProbe:
  httpGet:
    path: /ready
    port: 8000
  initialDelaySeconds: 5
  periodSeconds: 10
```

---

## 3. 告警規則盤點 (Alert Rules Inventory)

### 3.1 Prometheus Alert Rules (20+ 條)

**來源檔案**:
- `docker/prometheus/rules/alerts.yml`
- `docker/prometheus/rules/service-health-rules.yml`

#### 3.1.1 系統層級告警 (P0 Critical)

| 告警名稱 | 觸發條件 | 嚴重度 |
|----------|----------|--------|
| `InstanceDown` | 實例離線 > 1m | 🔴 P0 |
| `VersionDriftDetected` | 部署版本與預期不符 | 🔴 P0 |
| `UnexpectedPodRestart` | Pod 非預期重啟 | 🔴 P0 |
| `ImagePullBackOff` | 映像拉取失敗 | 🔴 P0 |
| `PodCrashLoopBackOff` | Pod 持續崩潰 | 🔴 P0 |

#### 3.1.2 CI/CD Pipeline 告警

| 告警名稱 | 觸發條件 | 嚴重度 |
|----------|----------|--------|
| `PipelineFailed` | Pipeline 執行失敗 | 🔴 P0 |
| `PipelineTooSlow` | Pipeline > 30m | 🟡 P1 |
| `JobStuckPending` | Job 排隊 > 5m | 🟡 P1 |
| `JobRunningTooLong` | Job 執行 > 30m | 🟡 P1 |
| `GitLabRunnerOffline` | Runner 離線 | 🔴 P0 |

#### 3.1.3 基礎設施告警

| 告警名稱 | 觸發條件 | 嚴重度 |
|----------|----------|--------|
| `HighCpuLoad` | CPU > 90% for 5m | 🔴 P0 |
| `HighMemoryUsage` | Memory > 90% for 5m | 🔴 P0 |
| `DiskSpaceLow` | Disk > 85% | 🟡 P1 |
| `HTTP502Spike` | 502 錯誤激增 | 🔴 P0 |
| `HTTP500Spike` | 500 錯誤激增 | 🔴 P0 |

#### 3.1.4 資料庫/快取告警

| 告警名稱 | 觸發條件 | 嚴重度 |
|----------|----------|--------|
| `PostgreSQLConnectionFailed` | DB 連線失敗 | 🔴 P0 |
| `RedisConnectionFailed` | Redis 連線失敗 | 🔴 P0 |
| `PostgreSQLConnectionPoolExhausted` | 連線池 > 90% | 🟡 P1 |

#### 3.1.5 SSL 憑證告警

| 告警名稱 | 觸發條件 | 嚴重度 |
|----------|----------|--------|
| `SSLCertExpiringSoon` | 憑證 14 天內到期 | 🟡 P1 |
| `SSLCertExpired` | 憑證已過期 | 🔴 P0 |

#### 3.1.6 效能告警

| 告警名稱 | 觸發條件 | 嚴重度 |
|----------|----------|--------|
| `APIResponseTimeSlow` | P95 延遲 > 2s | 🟡 P1 |
| `HighErrorRate` | 錯誤率 > 1% | 🟡 P1 |
| `WebSocketConnectionFailed` | WebSocket 失敗 | 🟢 P2 |

### 3.2 SignOz Alert Rules (30+ 條)

**來源檔案**: `infrastructure/signoz/alert-rules.yaml`

| 類別 | 告警數量 | 嚴重度分佈 |
|------|----------|------------|
| 資料庫 | 5 | P0: 2, P1: 3 |
| 快取 | 3 | P0: 1, P1: 2 |
| HTTP 錯誤 | 6 | P0: 2, P1: 2, P2: 2 |
| 容器 | 4 | P0: 2, P1: 2 |
| 服務專屬 | 12 | 依服務而定 |

**服務專屬告警涵蓋**:
- Gitea, Harbor, OpenClaw, Ollama, SignOz, n8n

---

## 4. 通知管道盤點 (Notification Channels)

### 4.1 Telegram 整合

**來源檔案**:
- `src/api/routes/telegram_alerts.py`
- `clawbot/app/bot/telegram.py`

| 頻道 | 環境變數 | 用途 |
|------|----------|------|
| 一般告警 | `TELEGRAM_CHAT_ID` | 全部告警 |
| P0 緊急 | `TELEGRAM_P0_CHAT_ID` | Critical 專用 |
| 資安告警 | `TELEGRAM_SECURITY_CHAT_ID` | Security 專用 |

**功能**:
- HTML 格式化 + Emoji 嚴重度標示
- 背景任務發送 (non-blocking)
- 雙向互動: `/ask` 指令觸發 AI 診斷

### 4.2 Slack 整合

**來源檔案**: `docker/alertmanager/alertmanager.yml`

| 頻道 | 用途 |
|------|------|
| `#alerts` | 預設告警 |
| `#alerts-security` | 資安告警 |
| `#alerts-security-critical` | 資安緊急 |
| `#alerts-infra` | 基礎設施 |
| `#p0-war-room` | P0 作戰室 |

### 4.3 PagerDuty On-Call

| 服務 | SLA | 用途 |
|------|-----|------|
| P0 Service Key | 5 分鐘回應 | Critical |
| P1 Service Key | 15 分鐘回應 | High |

**自動升級至 C-Level**

### 4.4 Email 通知

**來源檔案**: `src/services/notification.py`

- SMTP (TLS/STARTTLS)
- aiosmtplib 非同步
- HTML + Plain-text

---

## 5. 自動修復機制 (Auto-Remediation)

### 5.1 修復引擎 v1 (Remediation Engine)

**來源檔案**: `src/automation/remediation_engine.py`

#### 修復動作對照表

| 動作 | 說明 | 觸發告警 |
|------|------|----------|
| `restart_pod` | Pod 重啟 | HighErrorRate, PodCrashLooping, SlowResponse, ServiceDown |
| `scale_up` | 水平擴展 | HighCPU, HighMemory |
| `scale_down` | 縮減副本 | 手動觸發 |
| `rollback_deployment` | 版本回滾 | 手動觸發 |
| `clear_cache` | 清除 Redis | 手動觸發 |

#### 安全護欄

```python
# 白名單 Namespace
ALLOWED_NAMESPACES = ["wooo-aiops-uat", "wooo-aiops-prod"]

# Dry-Run 模式
AUTOMATION_DRY_RUN = True/False

# 最大副本數限制
MAX_REPLICAS = 10
```

### 5.2 修復引擎 v2 (Repair Engine)

**來源檔案**: `src/engines/repair_engine.py`

#### 8 種修復策略

| 策略 | 說明 |
|------|------|
| `RESTART` | Pod 重啟 |
| `SCALE_UP` | 水平擴展 |
| `SCALE_DOWN` | 縮減副本 |
| `ROLLBACK` | 版本回滾 |
| `INCREASE_MEMORY` | 調整記憶體 (+50% max) |
| `INCREASE_CPU` | 調整 CPU (+50% max) |
| `VACUUM_DB` | 資料庫維護 |
| `CLEAR_CACHE` | 清除快取 |

#### 安全限制

| 參數 | 值 | 說明 |
|------|-----|------|
| Max repairs/hour | 5 | 每小時最多修復次數 |
| Max consecutive failures | 3 | 連續失敗後停止 |
| Min healthy replicas | 1 | 最少健康副本 |
| Rollback window | 24h | 回滾時間窗口 |
| Memory increase limit | 50% | 記憶體增幅上限 |
| CPU increase limit | 50% | CPU 增幅上限 |

### 5.3 自動恢復腳本

**來源檔案**: `scripts/auto-recovery.sh`

**Cron 排程**: 每 10 分鐘執行

```bash
# 檢查項目
1. API Health Check (HTTP 200)
2. Frontend Health Check (HTTP 200/302)
3. Disk Space (>90% 觸發清理)
4. GitHub Actions Runner 狀態
5. 服務重啟恢復
```

**日誌位置**: `/var/log/wooo/auto-recovery.log`

---

## 6. SLA 引擎與升級機制 (SLA Engine)

**來源檔案**: `src/engines/sla_engine.py`

### 6.1 SLA 門檻

| 優先級 | 回應時間 | 解決時間 |
|--------|----------|----------|
| P0 | 5 分鐘 | 30 分鐘 |
| P1 | 15 分鐘 | 2 小時 |
| P2 | 1 小時 | 8 小時 |
| P3 | 4 小時 | 24 小時 |

### 6.2 升級層級

| Level | 角色 |
|-------|------|
| L0 | L1 Support (一線支援) |
| L1 | L2 Expert (專家支援) |
| L2 | Team Lead (部門主管) |
| L3 | Director (總監) |
| L4 | C-Level (CEO, CTO, CIO, CISO, CPO) |

### 6.3 升級矩陣

| 優先級 | 升級路徑 |
|--------|----------|
| P0 | On-Call → Team Lead → CIO → CISO → CEO |
| P1 | On-Call → Team Lead → CIO |
| P2 | On-Call → Team Lead |
| P3 | On-Call 僅 |

---

## 7. 告警聚合與去重 (Alert Aggregation)

**來源檔案**: `src/services/alert_aggregator.py`

### 功能

| 功能 | 說明 |
|------|------|
| 指紋去重 | 相同告警精確比對 |
| 時間窗口去重 | 5 分鐘內相同告警 |
| 告警風暴偵測 | > 10 告警/分鐘 |
| 標籤分組 | 相似標籤聚合 |

### Prometheus Metrics

```promql
wooo_alerts_received_total{severity, source}
wooo_alerts_deduplicated_total{reason}
wooo_alerts_aggregated_total{group_key}
wooo_alert_groups_active
wooo_alert_storm_detected_total
```

---

## 8. Grafana 儀表板盤點 (Dashboards)

| 儀表板 | 路徑 | 用途 |
|--------|------|------|
| AIOPS Brain | `infrastructure/grafana/dashboards/aiops-brain.json` | AI 大腦狀態 |
| API Performance | `docker/grafana/dashboards/api-performance.json` | API 效能 |
| Container Health | `docker/grafana/dashboards/container-health.json` | 容器健康 |
| System Overview | `docker/grafana/dashboards/system-overview.json` | 系統總覽 |
| DevOps KPIs | `infrastructure/grafana/dashboards/devops-kpis.json` | DevOps 指標 |
| Pipeline Health | `infrastructure/grafana/dashboards/pipeline-health.json` | Pipeline 健康 |

---

## 9. 告警工單整合 (Alert-to-Ticket)

**來源檔案**: `src/services/alert_ticket_service.py`

| 功能 | 說明 |
|------|------|
| 自動建票 | 所有告警自動建立工單 |
| 去重機制 | 防止相同告警重複建票 |
| 嚴重度對映 | P0/P1/P2 → 工單優先級 |
| 自動關閉 | 告警解除時自動關閉工單 |

---

## 10. 自訂 Metrics 匯出 (Custom Metrics)

### 10.1 部署追蹤

```promql
wooo_deployment_version_drift      # 1 = 版本漂移
wooo_pipeline_status{status}       # failed = 1
wooo_pipeline_duration_seconds
wooo_job_queued_duration_seconds
wooo_job_duration_seconds
wooo_gitlab_runner_status          # 0 = offline
```

### 10.2 自動修復

```promql
wooo_repair_total{app_id, action, status}
wooo_repair_duration_seconds{app_id, action}
wooo_repair_in_progress{app_id}
```

---

## 11. 告警流程圖 (Notification Flow)

```
┌─────────────────────────────────────────────────────────────┐
│                    Alert Triggered                          │
│               (Prometheus / SignOz)                         │
└──────────────────────┬──────────────────────────────────────┘
                       ↓
┌─────────────────────────────────────────────────────────────┐
│                  Alertmanager Webhook                       │
└──────────────────────┬──────────────────────────────────────┘
                       ↓
        ┌──────────────┼──────────────┬──────────────┐
        ↓              ↓              ↓              ↓
   ┌─────────┐   ┌──────────┐   ┌──────────┐   ┌───────────┐
   │ Ticket  │   │ Telegram │   │ PagerDuty│   │   Slack   │
   │ System  │   │   Bot    │   │ (On-Call)│   │ Channels  │
   └─────────┘   └──────────┘   └──────────┘   └───────────┘
                       │
                       ↓
        ┌──────────────────────────────────────┐
        │       Auto-Remediation Engine        │
        ├──────────────────────────────────────┤
        │ 1. Validate target (whitelist)       │
        │ 2. Execute repair action             │
        │ 3. Record result in DB               │
        │ 4. Notify outcome (Telegram/NATS)    │
        └──────────────────────────────────────┘
```

---

## 12. 遷移至 AWOOOI 建議

### 12.1 必須遷移 (P0)

| 元件 | 原路徑 | 建議新路徑 |
|------|--------|------------|
| OpenTelemetry 初始化 | `clawbot/app/core/telemetry.py` | `apps/api/src/core/telemetry.py` |
| Prometheus Client | `src/services/prometheus_client.py` | `apps/api/src/services/` |
| Health Routes | `src/api/routes/health.py` | `apps/api/src/routes/health.py` |
| Alert Rules | `docker/prometheus/rules/*.yml` | `ops/prometheus/rules/` |
| Alertmanager Config | `docker/alertmanager/*.yml` | `ops/alertmanager/` |

### 12.2 可選遷移 (P1)

| 元件 | 說明 |
|------|------|
| Grafana Dashboards | 6 個儀表板 JSON |
| Loki + Promtail | Log 聚合 |
| SLA Engine | 升級機制 |
| Alert Aggregator | 告警去重 |

### 12.3 需重構 (P2)

| 元件 | 原因 |
|------|------|
| Remediation Engine | 需適配新的 Multi-Sig 審批流程 |
| On-Call Service | 需整合新的 OpenClaw 通知 |

---

## 附錄: 關鍵設定檔清單

| 設定檔 | 路徑 |
|--------|------|
| Alertmanager 主設定 | `docker/alertmanager/alertmanager.yml` |
| Alertmanager 生產設定 | `infrastructure/alertmanager/alertmanager.yml` |
| Prometheus Alert Rules | `docker/prometheus/rules/alerts.yml` |
| Service Health Rules | `docker/prometheus/rules/service-health-rules.yml` |
| SignOz Alert Rules | `infrastructure/signoz/alert-rules.yaml` |
| Prometheus Scrape Config | `docker/prometheus/prometheus.yml` |
| K8s API Deployment | `infrastructure/kubernetes/base/api-deployment.yaml` |
| Monitoring Cron Jobs | `infrastructure/cron/monitoring-jobs.cron` |
| Auto-Recovery Script | `scripts/auto-recovery.sh` |

---

**盤點完成**: 2026-03-22
**盤點人員**: Claude Code (Monitoring Inventory Agent)