Files

OG T 7478dc0254 feat(phase6-9): Complete modular architecture and Agent Teams

Phase 6.4 - Modular Architecture:
- Add lewooogo-brain adapters for LLM providers
- Add lewooogo-data dual memory (Redis + PostgreSQL)
- Implement consensus engine for multi-agent decisions
- Add incident memory service for historical context

Phase 9 - Agent Teams (Claude Agent SDK):
- Add base agent class with Claude Sonnet 4 integration
- Implement action planner, blast radius, and security agents
- Add agent API endpoints and proposal workflow
- Integrate ADR-009 OpenClaw Agent Teams architecture

DevOps & CI/CD:
- Add GitHub Actions CI/CD workflows (ci.yaml, cd.yaml)
- Add pre-commit hooks and secrets baseline
- Add docker-compose for local development
- Update Kubernetes network policies

Frontend Improvements:
- Add auto-healing error boundary component
- Update i18n messages for agent features
- Enhance dual-state incident card with execution feedback

Documentation:
- Add 7 ADRs covering MCP, design system, architecture decisions
- Update ARCHITECTURE_MEMORY.md with modular design
- Add GLOBAL_RULES.md and SOUL.md for project identity

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-03-23 18:40:36 +08:00

14 KiB

Raw Blame History

WOOO-AIOPS 監控機制盤點報告

遷移至 AWOOOI 的監控資產清單

盤點日期: 2026-03-22 來源專案: /Users/ogt/wooo-aiops

1. 監控系統總覽 (Monitoring Stack Overview)

元件	用途	來源路徑	遷移優先級
OpenTelemetry	Distributed Tracing	`clawbot/app/core/telemetry.py`	🔴 P0
Prometheus	Metrics 採集	`docker/prometheus/prometheus.yml`	🔴 P0
Alertmanager	告警路由與通知	`docker/alertmanager/alertmanager.yml`	🔴 P0
SignOz	APM + Traces + Logs	`infrastructure/signoz/alert-rules.yaml`	🟡 P1
Grafana	儀表板視覺化	`docker/grafana/dashboards/*.json`	🟡 P1
Loki + Promtail	Log Aggregation	`docker/loki/loki-config.yml`	🟡 P1

2. 健康檢查機制 (Health Checks)

2.1 API 健康端點

端點	用途	檢查項目
`/health`	Liveness Probe	git_sha, build_time, version
`/ready`	Readiness Probe	DB 連線, Redis 連線
`/api/v1/health`	Gateway Health	API 閘道狀態

來源檔案: src/api/routes/health.py

2.2 K8s Probes 配置

# 來源: infrastructure/kubernetes/base/api-deployment.yaml
livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 10
  periodSeconds: 30

readinessProbe:
  httpGet:
    path: /ready
    port: 8000
  initialDelaySeconds: 5
  periodSeconds: 10

3. 告警規則盤點 (Alert Rules Inventory)

3.1 Prometheus Alert Rules (20+ 條)

來源檔案:

docker/prometheus/rules/alerts.yml
docker/prometheus/rules/service-health-rules.yml

3.1.1 系統層級告警 (P0 Critical)

告警名稱	觸發條件	嚴重度
`InstanceDown`	實例離線 > 1m	🔴 P0
`VersionDriftDetected`	部署版本與預期不符	🔴 P0
`UnexpectedPodRestart`	Pod 非預期重啟	🔴 P0
`ImagePullBackOff`	映像拉取失敗	🔴 P0
`PodCrashLoopBackOff`	Pod 持續崩潰	🔴 P0

3.1.2 CI/CD Pipeline 告警

告警名稱	觸發條件	嚴重度
`PipelineFailed`	Pipeline 執行失敗	🔴 P0
`PipelineTooSlow`	Pipeline > 30m	🟡 P1
`JobStuckPending`	Job 排隊 > 5m	🟡 P1
`JobRunningTooLong`	Job 執行 > 30m	🟡 P1
`GitLabRunnerOffline`	Runner 離線	🔴 P0

3.1.3 基礎設施告警

告警名稱	觸發條件	嚴重度
`HighCpuLoad`	CPU > 90% for 5m	🔴 P0
`HighMemoryUsage`	Memory > 90% for 5m	🔴 P0
`DiskSpaceLow`	Disk > 85%	🟡 P1
`HTTP502Spike`	502 錯誤激增	🔴 P0
`HTTP500Spike`	500 錯誤激增	🔴 P0

3.1.4 資料庫/快取告警

告警名稱	觸發條件	嚴重度
`PostgreSQLConnectionFailed`	DB 連線失敗	🔴 P0
`RedisConnectionFailed`	Redis 連線失敗	🔴 P0
`PostgreSQLConnectionPoolExhausted`	連線池 > 90%	🟡 P1

3.1.5 SSL 憑證告警

告警名稱	觸發條件	嚴重度
`SSLCertExpiringSoon`	憑證 14 天內到期	🟡 P1
`SSLCertExpired`	憑證已過期	🔴 P0

3.1.6 效能告警

告警名稱	觸發條件	嚴重度
`APIResponseTimeSlow`	P95 延遲 > 2s	🟡 P1
`HighErrorRate`	錯誤率 > 1%	🟡 P1
`WebSocketConnectionFailed`	WebSocket 失敗	🟢 P2

3.2 SignOz Alert Rules (30+ 條)

來源檔案: infrastructure/signoz/alert-rules.yaml

類別	告警數量	嚴重度分佈
資料庫	5	P0: 2, P1: 3
快取	3	P0: 1, P1: 2
HTTP 錯誤	6	P0: 2, P1: 2, P2: 2
容器	4	P0: 2, P1: 2
服務專屬	12	依服務而定

服務專屬告警涵蓋:

Gitea, Harbor, ClawBot, Ollama, SignOz, n8n

4. 通知管道盤點 (Notification Channels)

4.1 Telegram 整合

來源檔案:

src/api/routes/telegram_alerts.py
clawbot/app/bot/telegram.py

頻道	環境變數	用途
一般告警	`TELEGRAM_CHAT_ID`	全部告警
P0 緊急	`TELEGRAM_P0_CHAT_ID`	Critical 專用
資安告警	`TELEGRAM_SECURITY_CHAT_ID`	Security 專用

功能:

HTML 格式化 + Emoji 嚴重度標示
背景任務發送 (non-blocking)
雙向互動: /ask 指令觸發 AI 診斷

4.2 Slack 整合

來源檔案: docker/alertmanager/alertmanager.yml

頻道	用途
`#alerts`	預設告警
`#alerts-security`	資安告警
`#alerts-security-critical`	資安緊急
`#alerts-infra`	基礎設施
`#p0-war-room`	P0 作戰室

4.3 PagerDuty On-Call

服務	SLA	用途
P0 Service Key	5 分鐘回應	Critical
P1 Service Key	15 分鐘回應	High

自動升級至 C-Level

4.4 Email 通知

來源檔案: src/services/notification.py

SMTP (TLS/STARTTLS)
aiosmtplib 非同步
HTML + Plain-text

5. 自動修復機制 (Auto-Remediation)

5.1 修復引擎 v1 (Remediation Engine)

來源檔案: src/automation/remediation_engine.py

修復動作對照表

動作	說明	觸發告警
`restart_pod`	Pod 重啟	HighErrorRate, PodCrashLooping, SlowResponse, ServiceDown
`scale_up`	水平擴展	HighCPU, HighMemory
`scale_down`	縮減副本	手動觸發
`rollback_deployment`	版本回滾	手動觸發
`clear_cache`	清除 Redis	手動觸發

安全護欄

# 白名單 Namespace
ALLOWED_NAMESPACES = ["wooo-aiops-uat", "wooo-aiops-prod"]

# Dry-Run 模式
AUTOMATION_DRY_RUN = True/False

# 最大副本數限制
MAX_REPLICAS = 10

5.2 修復引擎 v2 (Repair Engine)

來源檔案: src/engines/repair_engine.py

8 種修復策略

策略	說明
`RESTART`	Pod 重啟
`SCALE_UP`	水平擴展
`SCALE_DOWN`	縮減副本
`ROLLBACK`	版本回滾
`INCREASE_MEMORY`	調整記憶體 (+50% max)
`INCREASE_CPU`	調整 CPU (+50% max)
`VACUUM_DB`	資料庫維護
`CLEAR_CACHE`	清除快取

安全限制

參數	值	說明
Max repairs/hour	5	每小時最多修復次數
Max consecutive failures	3	連續失敗後停止
Min healthy replicas	1	最少健康副本
Rollback window	24h	回滾時間窗口
Memory increase limit	50%	記憶體增幅上限
CPU increase limit	50%	CPU 增幅上限

5.3 自動恢復腳本

來源檔案: scripts/auto-recovery.sh

Cron 排程: 每 10 分鐘執行

# 檢查項目
1. API Health Check (HTTP 200)
2. Frontend Health Check (HTTP 200/302)
3. Disk Space (>90% 觸發清理)
4. GitHub Actions Runner 狀態
5. 服務重啟恢復

日誌位置: /var/log/wooo/auto-recovery.log

6. SLA 引擎與升級機制 (SLA Engine)

來源檔案: src/engines/sla_engine.py

6.1 SLA 門檻

優先級	回應時間	解決時間
P0	5 分鐘	30 分鐘
P1	15 分鐘	2 小時
P2	1 小時	8 小時
P3	4 小時	24 小時

6.2 升級層級

Level	角色
L0	L1 Support (一線支援)
L1	L2 Expert (專家支援)
L2	Team Lead (部門主管)
L3	Director (總監)
L4	C-Level (CEO, CTO, CIO, CISO, CPO)

6.3 升級矩陣

優先級	升級路徑
P0	On-Call → Team Lead → CIO → CISO → CEO
P1	On-Call → Team Lead → CIO
P2	On-Call → Team Lead
P3	On-Call 僅

7. 告警聚合與去重 (Alert Aggregation)

來源檔案: src/services/alert_aggregator.py

功能

功能	說明
指紋去重	相同告警精確比對
時間窗口去重	5 分鐘內相同告警
告警風暴偵測	> 10 告警/分鐘
標籤分組	相似標籤聚合

Prometheus Metrics

wooo_alerts_received_total{severity, source}
wooo_alerts_deduplicated_total{reason}
wooo_alerts_aggregated_total{group_key}
wooo_alert_groups_active
wooo_alert_storm_detected_total

8. Grafana 儀表板盤點 (Dashboards)

儀表板	路徑	用途
AIOPS Brain	`infrastructure/grafana/dashboards/aiops-brain.json`	AI 大腦狀態
API Performance	`docker/grafana/dashboards/api-performance.json`	API 效能
Container Health	`docker/grafana/dashboards/container-health.json`	容器健康
System Overview	`docker/grafana/dashboards/system-overview.json`	系統總覽
DevOps KPIs	`infrastructure/grafana/dashboards/devops-kpis.json`	DevOps 指標
Pipeline Health	`infrastructure/grafana/dashboards/pipeline-health.json`	Pipeline 健康

9. 告警工單整合 (Alert-to-Ticket)

來源檔案: src/services/alert_ticket_service.py

功能	說明
自動建票	所有告警自動建立工單
去重機制	防止相同告警重複建票
嚴重度對映	P0/P1/P2 → 工單優先級
自動關閉	告警解除時自動關閉工單

10. 自訂 Metrics 匯出 (Custom Metrics)

10.1 部署追蹤

wooo_deployment_version_drift      # 1 = 版本漂移
wooo_pipeline_status{status}       # failed = 1
wooo_pipeline_duration_seconds
wooo_job_queued_duration_seconds
wooo_job_duration_seconds
wooo_gitlab_runner_status          # 0 = offline

10.2 自動修復

wooo_repair_total{app_id, action, status}
wooo_repair_duration_seconds{app_id, action}
wooo_repair_in_progress{app_id}

11. 告警流程圖 (Notification Flow)

┌─────────────────────────────────────────────────────────────┐
│                    Alert Triggered                          │
│               (Prometheus / SignOz)                         │
└──────────────────────┬──────────────────────────────────────┘
                       ↓
┌─────────────────────────────────────────────────────────────┐
│                  Alertmanager Webhook                       │
└──────────────────────┬──────────────────────────────────────┘
                       ↓
        ┌──────────────┼──────────────┬──────────────┐
        ↓              ↓              ↓              ↓
   ┌─────────┐   ┌──────────┐   ┌──────────┐   ┌───────────┐
   │ Ticket  │   │ Telegram │   │ PagerDuty│   │   Slack   │
   │ System  │   │   Bot    │   │ (On-Call)│   │ Channels  │
   └─────────┘   └──────────┘   └──────────┘   └───────────┘
                       │
                       ↓
        ┌──────────────────────────────────────┐
        │       Auto-Remediation Engine        │
        ├──────────────────────────────────────┤
        │ 1. Validate target (whitelist)       │
        │ 2. Execute repair action             │
        │ 3. Record result in DB               │
        │ 4. Notify outcome (Telegram/NATS)    │
        └──────────────────────────────────────┘

12. 遷移至 AWOOOI 建議

12.1 必須遷移 (P0)

元件	原路徑	建議新路徑
OpenTelemetry 初始化	`clawbot/app/core/telemetry.py`	`apps/api/src/core/telemetry.py`
Prometheus Client	`src/services/prometheus_client.py`	`apps/api/src/services/`
Health Routes	`src/api/routes/health.py`	`apps/api/src/routes/health.py`
Alert Rules	`docker/prometheus/rules/*.yml`	`ops/prometheus/rules/`
Alertmanager Config	`docker/alertmanager/*.yml`	`ops/alertmanager/`

12.2 可選遷移 (P1)

元件	說明
Grafana Dashboards	6 個儀表板 JSON
Loki + Promtail	Log 聚合
SLA Engine	升級機制
Alert Aggregator	告警去重

12.3 需重構 (P2)

元件	原因
Remediation Engine	需適配新的 Multi-Sig 審批流程
On-Call Service	需整合新的 OpenClaw 通知

附錄: 關鍵設定檔清單

設定檔	路徑
Alertmanager 主設定	`docker/alertmanager/alertmanager.yml`
Alertmanager 生產設定	`infrastructure/alertmanager/alertmanager.yml`
Prometheus Alert Rules	`docker/prometheus/rules/alerts.yml`
Service Health Rules	`docker/prometheus/rules/service-health-rules.yml`
SignOz Alert Rules	`infrastructure/signoz/alert-rules.yaml`
Prometheus Scrape Config	`docker/prometheus/prometheus.yml`
K8s API Deployment	`infrastructure/kubernetes/base/api-deployment.yaml`
Monitoring Cron Jobs	`infrastructure/cron/monitoring-jobs.cron`
Auto-Recovery Script	`scripts/auto-recovery.sh`

盤點完成: 2026-03-22 盤點人員: Claude Code (Monitoring Inventory Agent)

14 KiB Raw Blame History