Files
ewoooc/docs/V10.2-AIOPS-TECHNICAL-ARCHITECTURE.md
ogt 1b4f3a7bbe
Some checks failed
CD Pipeline / deploy (push) Failing after 59s
feat: EwoooC 初始化 — 完整專案推版至 Gitea
- 建立 Gitea Actions CD pipeline (.gitea/workflows/cd.yaml)
- 部署模式: rsync Python 檔案至 188 → docker restart (volume mount)
- Dockerfile/requirements 變動時自動重建 Docker image
- 部署通知: Telegram (開始/成功/失敗)
- 健康檢查: https://mo.wooo.work/health (最多 5 次重試)
- 同步最新 CLAUDE.md / ADR-008 / memory (2026-04-19)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 01:21:13 +08:00

1054 lines
35 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# WOOO AIOps Platform - 技術架構設計 V10.2
> **讓新應用 5 分鐘內無痛接入的技術實現**
**建立日期**: 2026-02-14
**版本**: V10.2 (技術架構)
---
## 🏗️ 整體技術架構
```
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ WOOO AIOps Platform - 技術架構 │
├─────────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ╔═════════════════════════════════════════════════════════════════════════════╗ │
│ ║ 🌐 Web Portal (React/Vue) ║ │
│ ╠═════════════════════════════════════════════════════════════════════════════╣ │
│ ║ 用戶登入 │ 應用管理 │ 監控儀表板 │ 告警中心 │ 設定管理 ║ │
│ ╚═════════════════════════════════════════════════════════════════════════════╝ │
│ │ │
│ ▼ │
│ ╔═════════════════════════════════════════════════════════════════════════════╗ │
│ ║ 🔌 AIOps API Gateway (Flask) ║ │
│ ╠═════════════════════════════════════════════════════════════════════════════╣ │
│ ║ /api/apps - 應用 CRUD ║ │
│ ║ /api/deploy - 部署管理 ║ │
│ ║ /api/monitor - 監控數據 ║ │
│ ║ /api/alerts - 告警管理 ║ │
│ ║ /api/repair - 自動修復 ║ │
│ ║ /api/templates - 應用模板 ║ │
│ ╚═════════════════════════════════════════════════════════════════════════════╝ │
│ │ │
│ ┌──────────────────────────────┼──────────────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ╔═══════════════╗ ╔═══════════════╗ ╔═══════════════╗ │
│ ║ Deploy ║ ║ Monitor ║ ║ Repair ║ │
│ ║ Engine ║ ║ Engine ║ ║ Engine ║ │
│ ╠═══════════════╣ ╠═══════════════╣ ╠═══════════════╣ │
│ ║ • K8s 操作 ║ ║ • Prometheus ║ ║ • 故障診斷 ║ │
│ ║ • 模板渲染 ║ ║ • Grafana ║ ║ • 自動修復 ║ │
│ ║ • SSL 配置 ║ ║ • 告警路由 ║ ║ • 回滾機制 ║ │
│ ║ • DNS 管理 ║ ║ • 數據聚合 ║ ║ • 知識庫 ║ │
│ ╚═══════════════╝ ╚═══════════════╝ ╚═══════════════╝ │
│ │ │ │ │
│ └──────────────────────────────┼──────────────────────────────┘ │
│ ▼ │
│ ╔═════════════════════════════════════════════════════════════════════════════╗ │
│ ║ 🗄️ PostgreSQL (元資料庫) ║ │
│ ╠═════════════════════════════════════════════════════════════════════════════╣ │
│ ║ apps │ app_configs │ alerts │ repair_logs ║ │
│ ║ users │ templates │ metrics_cache │ audit_logs ║ │
│ ╚═════════════════════════════════════════════════════════════════════════════╝ │
│ │ │
│ ╔═════════════════════════════════════════════════════════════════════════════╗ │
│ ║ ☸️ Kubernetes Cluster (K3s) ║ │
│ ╠═════════════════════════════════════════════════════════════════════════════╣ │
│ ║ ║ │
│ ║ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ║ │
│ ║ │ Namespace: │ │ Namespace: │ │ Namespace: │ │ Namespace: │ ║ │
│ ║ │ aiops-core │ │ app-momo │ │ app-crm │ │ app-event │ ║ │
│ ║ │ (平台核心) │ │ (客戶應用1) │ │ (客戶應用2) │ │ (客戶應用3) │ ║ │
│ ║ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ ║ │
│ ║ ║ │
│ ║ ┌─────────────────────────────────────────────────────────────────────┐ ║ │
│ ║ │ monitoring Namespace │ ║ │
│ ║ │ Prometheus │ Grafana │ Alertmanager │ Loki │ n8n │ ║ │
│ ║ └─────────────────────────────────────────────────────────────────────┘ ║ │
│ ║ ║ │
│ ╚═════════════════════════════════════════════════════════════════════════════╝ │
│ │
└─────────────────────────────────────────────────────────────────────────────────────┘
```
---
## 📦 新應用接入流程 - 技術細節
### Step 1: 應用註冊
```python
# aiops_api/routes/apps.py
@app.route('/api/apps', methods=['POST'])
def create_app():
"""
創建新應用
Request:
{
"name": "event-system",
"display_name": "活動報名網站",
"domain": "event.wooo.work",
"source_type": "git",
"source_url": "https://gitlab.com/wooo/event-system.git",
"framework": "flask", # 可選,自動偵測
"plan": "standard"
}
Response:
{
"app_id": "app-12345",
"status": "initializing",
"namespace": "app-event-system",
"steps": [
{"step": "create_namespace", "status": "pending"},
{"step": "deploy_app", "status": "pending"},
{"step": "configure_ingress", "status": "pending"},
{"step": "setup_ssl", "status": "pending"},
{"step": "configure_monitoring", "status": "pending"},
{"step": "setup_auto_repair", "status": "pending"}
]
}
"""
# 1. 驗證域名 DNS
if not verify_dns(data['domain']):
return {"error": "DNS 未指向正確 IP"}, 400
# 2. 偵測技術框架(如果未指定)
if not data.get('framework'):
data['framework'] = detect_framework(data['source_url'])
# 3. 創建應用記錄
app_record = App.create(data)
# 4. 啟動異步部署任務
deploy_task.delay(app_record.id)
return app_record.to_dict()
```
### Step 2: 自動部署
```python
# aiops_api/engines/deploy_engine.py
class DeployEngine:
"""應用部署引擎"""
def deploy(self, app: App):
"""
完整部署流程
"""
# 1. 創建 Namespace
self.create_namespace(app)
# 2. 根據框架選擇模板
template = self.get_template(app.framework)
# 3. 渲染 K8s YAML
k8s_manifests = self.render_manifests(template, app)
# 4. 部署到 K8s
self.apply_manifests(k8s_manifests)
# 5. 配置 Ingress + SSL
self.setup_ingress(app)
self.setup_ssl(app)
# 6. 等待 Pod Ready
self.wait_for_ready(app)
# 7. 配置監控
self.setup_monitoring(app)
# 8. 配置自動修復
self.setup_auto_repair(app)
# 9. 發送通知
self.notify_success(app)
def render_manifests(self, template: str, app: App) -> dict:
"""
使用 Jinja2 渲染 K8s 配置
"""
from jinja2 import Template
# 資源配置對照表
PLAN_RESOURCES = {
'basic': {'memory_request': '256Mi', 'cpu_request': '250m'},
'standard': {'memory_request': '512Mi', 'cpu_request': '500m'},
'advanced': {'memory_request': '1Gi', 'cpu_request': '1000m'},
'professional': {'memory_request': '2Gi', 'cpu_request': '2000m'},
}
context = {
'app_name': app.name,
'namespace': app.namespace,
'image': app.image,
'domain': app.domain,
**PLAN_RESOURCES[app.plan]
}
return Template(template).render(**context)
```
### Step 3: 自動監控配置
```python
# aiops_api/engines/monitor_engine.py
class MonitorEngine:
"""監控配置引擎"""
def setup_monitoring(self, app: App):
"""
自動配置監控
"""
# 1. 創建 ServiceMonitor (Prometheus 自動抓取)
self.create_service_monitor(app)
# 2. 創建告警規則 (PrometheusRule)
self.create_alert_rules(app)
# 3. 配置 Alertmanager 路由
self.configure_alertmanager(app)
# 4. 創建 Grafana Dashboard
self.create_grafana_dashboard(app)
def create_service_monitor(self, app: App):
"""
創建 Prometheus ServiceMonitor
"""
manifest = f"""
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: {app.name}
namespace: {app.namespace}
labels:
app: {app.name}
managed-by: wooo-aiops
spec:
selector:
matchLabels:
app: {app.name}
endpoints:
- port: http
interval: 15s
path: /metrics
namespaceSelector:
matchNames:
- {app.namespace}
"""
self.k8s_apply(manifest)
def create_alert_rules(self, app: App):
"""
創建標準告警規則
"""
manifest = f"""
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: {app.name}-alerts
namespace: monitoring
labels:
managed-by: wooo-aiops
app: {app.name}
spec:
groups:
- name: {app.name}.rules
rules:
# 服務健康檢查
- alert: {app.name.title()}Down
expr: up{{namespace="{app.namespace}", job="{app.name}"}} == 0
for: 1m
labels:
severity: critical
app: {app.name}
auto_repair: restart
annotations:
summary: "{app.display_name} 服務停止"
description: "服務已停止運行超過 1 分鐘"
# 高記憶體使用
- alert: {app.name.title()}HighMemory
expr: |
(container_memory_usage_bytes{{namespace="{app.namespace}"}}
/ container_spec_memory_limit_bytes{{namespace="{app.namespace}"}}) > 0.9
for: 5m
labels:
severity: warning
app: {app.name}
auto_repair: scale_memory
annotations:
summary: "{app.display_name} 記憶體使用過高"
description: "記憶體使用率超過 90%"
# OOM Kill
- alert: {app.name.title()}OOMKilled
expr: |
increase(kube_pod_container_status_restarts_total{{
namespace="{app.namespace}",
reason="OOMKilled"
}}[5m]) > 0
labels:
severity: critical
app: {app.name}
auto_repair: oom_handler
annotations:
summary: "{app.display_name} OOM 被殺"
description: "容器因記憶體不足被終止"
# 5xx 錯誤率
- alert: {app.name.title()}High5xxRate
expr: |
(sum(rate(http_requests_total{{
namespace="{app.namespace}",
status=~"5.."
}}[5m]))
/ sum(rate(http_requests_total{{
namespace="{app.namespace}"
}}[5m]))) > 0.05
for: 5m
labels:
severity: warning
app: {app.name}
auto_repair: rollback
annotations:
summary: "{app.display_name} 5xx 錯誤率過高"
description: "5xx 錯誤率超過 5%"
"""
self.k8s_apply(manifest)
```
### Step 4: 自動修復引擎
```python
# aiops_api/engines/repair_engine.py
class RepairEngine:
"""自動修復引擎"""
# 修復策略對照表
REPAIR_STRATEGIES = {
'restart': 'restart_deployment',
'scale_memory': 'scale_memory_limit',
'oom_handler': 'handle_oom',
'rollback': 'rollback_deployment',
'cleanup': 'cleanup_resources',
}
def handle_alert(self, alert: dict):
"""
處理告警並執行自動修復
"""
app_name = alert['labels']['app']
repair_type = alert['labels'].get('auto_repair')
if not repair_type:
self.notify_manual_required(alert)
return
# 執行對應的修復策略
strategy = self.REPAIR_STRATEGIES.get(repair_type)
if strategy:
method = getattr(self, strategy)
result = method(app_name, alert)
self.log_repair(app_name, repair_type, result)
self.notify_repair_result(app_name, repair_type, result)
def restart_deployment(self, app_name: str, alert: dict) -> dict:
"""
重啟 Deployment
"""
namespace = self.get_app_namespace(app_name)
# 執行重啟
subprocess.run([
'kubectl', 'rollout', 'restart',
f'deployment/{app_name}',
'-n', namespace
])
# 等待就緒
time.sleep(30)
# 驗證修復
health = self.check_health(app_name)
return {
'action': 'restart',
'success': health['status'] == 'healthy',
'details': health
}
def handle_oom(self, app_name: str, alert: dict) -> dict:
"""
處理 OOM增加記憶體限制 50%
"""
namespace = self.get_app_namespace(app_name)
# 獲取當前資源限制
current = self.get_resource_limits(app_name, namespace)
# 計算新的限制 (+50%)
new_memory = self.increase_memory(current['memory'], 1.5)
# 更新 Deployment
patch = {
'spec': {
'template': {
'spec': {
'containers': [{
'name': app_name,
'resources': {
'limits': {'memory': new_memory}
}
}]
}
}
}
}
subprocess.run([
'kubectl', 'patch', 'deployment', app_name,
'-n', namespace,
'--patch', json.dumps(patch)
])
return {
'action': 'scale_memory',
'old_memory': current['memory'],
'new_memory': new_memory,
'success': True
}
def rollback_deployment(self, app_name: str, alert: dict) -> dict:
"""
回滾到上一個版本
"""
namespace = self.get_app_namespace(app_name)
# 獲取上一個版本
result = subprocess.run([
'kubectl', 'rollout', 'undo',
f'deployment/{app_name}',
'-n', namespace
], capture_output=True, text=True)
# 等待就緒
time.sleep(30)
# 驗證修復
health = self.check_health(app_name)
return {
'action': 'rollback',
'success': health['status'] == 'healthy',
'output': result.stdout
}
```
---
## 📋 應用模板系統
### 支援的框架模板
```
templates/
├── base/
│ ├── namespace.yaml.j2
│ ├── service.yaml.j2
│ ├── ingress.yaml.j2
│ └── ssl-certificate.yaml.j2
├── frameworks/
│ ├── flask/
│ │ ├── deployment.yaml.j2
│ │ ├── configmap.yaml.j2
│ │ └── healthcheck.yaml.j2
│ ├── django/
│ │ ├── deployment.yaml.j2
│ │ ├── configmap.yaml.j2
│ │ └── celery-worker.yaml.j2
│ ├── nodejs/
│ │ ├── deployment.yaml.j2
│ │ └── configmap.yaml.j2
│ ├── nextjs/
│ │ ├── deployment.yaml.j2
│ │ └── configmap.yaml.j2
│ └── static/
│ └── deployment.yaml.j2
└── addons/
├── postgresql.yaml.j2
├── redis.yaml.j2
├── mongodb.yaml.j2
└── elasticsearch.yaml.j2
```
### Flask 應用模板範例
```yaml
# templates/frameworks/flask/deployment.yaml.j2
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ app_name }}
namespace: {{ namespace }}
labels:
app: {{ app_name }}
framework: flask
managed-by: wooo-aiops
annotations:
aiops.wooo.work/auto-repair: "enabled"
aiops.wooo.work/plan: "{{ plan }}"
spec:
replicas: {{ replicas | default(1) }}
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: {{ app_name }}
template:
metadata:
labels:
app: {{ app_name }}
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "80"
prometheus.io/path: "/metrics"
spec:
containers:
- name: {{ app_name }}
image: {{ image }}
imagePullPolicy: Always
ports:
- name: http
containerPort: 80
# 資源限制
resources:
requests:
memory: "{{ memory_request }}"
cpu: "{{ cpu_request }}"
limits:
memory: "{{ memory_limit }}"
cpu: "{{ cpu_limit }}"
# 健康檢查
livenessProbe:
httpGet:
path: /health
port: 80
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 80
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
# 環境變數
env:
- name: FLASK_ENV
value: "production"
- name: GUNICORN_WORKERS
value: "{{ gunicorn_workers | default(4) }}"
# 從 ConfigMap 載入
envFrom:
- configMapRef:
name: {{ app_name }}-config
# 從 Secret 載入敏感資訊
- secretRef:
name: {{ app_name }}-secrets
# 服務帳號
serviceAccountName: {{ app_name }}
# 反親和性 - 分散到不同節點
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: {{ app_name }}
topologyKey: kubernetes.io/hostname
```
---
## 🔔 告警與通知系統
### 多渠道通知架構
```python
# aiops_api/services/notification_service.py
class NotificationService:
"""多渠道通知服務"""
def __init__(self):
self.channels = {
'telegram': TelegramChannel(),
'email': EmailChannel(),
'line': LineChannel(),
'slack': SlackChannel(),
}
def notify(self, app: App, event: dict, channels: list = None):
"""
發送通知到指定渠道
"""
if channels is None:
channels = app.notification_channels
# 渲染通知內容
message = self.render_message(app, event)
# 發送到各渠道
results = {}
for channel_name in channels:
channel = self.channels.get(channel_name)
if channel:
results[channel_name] = channel.send(app, message)
return results
def render_message(self, app: App, event: dict) -> dict:
"""
渲染通知訊息
"""
templates = {
'deploy_success': """
🚀 *部署成功*
應用: {app_name}
域名: {domain}
時間: {timestamp}
您的應用已成功部署!
訪問: https://{domain}
""",
'alert_fired': """
🔴 *告警觸發*
應用: {app_name}
告警: {alert_name}
嚴重度: {severity}
時間: {timestamp}
{description}
""",
'auto_repair_success': """
🔧 *自動修復成功*
應用: {app_name}
問題: {problem}
修復動作: {action}
耗時: {duration}
服務已自動恢復正常。
""",
'auto_repair_failed': """
❌ *自動修復失敗*
應用: {app_name}
問題: {problem}
嘗試: {attempts}
狀態: 需要人工介入
請盡快檢查系統。
""",
}
template = templates.get(event['type'])
return {
'text': template.format(**event),
'html': self.to_html(template.format(**event))
}
class TelegramChannel:
"""Telegram 通知渠道"""
def send(self, app: App, message: dict):
chat_id = app.telegram_chat_id
bot_token = os.getenv('TELEGRAM_BOT_TOKEN')
requests.post(
f"https://api.telegram.org/bot{bot_token}/sendMessage",
json={
'chat_id': chat_id,
'text': message['text'],
'parse_mode': 'Markdown'
}
)
```
---
## 💾 資料庫設計
### 核心表結構
```sql
-- 應用表
CREATE TABLE apps (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name VARCHAR(100) NOT NULL UNIQUE,
display_name VARCHAR(200),
domain VARCHAR(200) NOT NULL,
namespace VARCHAR(100),
framework VARCHAR(50),
plan VARCHAR(20) DEFAULT 'standard',
source_type VARCHAR(20), -- git, docker, upload
source_url TEXT,
status VARCHAR(20) DEFAULT 'initializing',
owner_id UUID REFERENCES users(id),
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
-- 應用配置表
CREATE TABLE app_configs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
app_id UUID REFERENCES apps(id) ON DELETE CASCADE,
config_key VARCHAR(100),
config_value TEXT,
is_secret BOOLEAN DEFAULT FALSE,
created_at TIMESTAMP DEFAULT NOW()
);
-- 告警歷史表
CREATE TABLE alerts (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
app_id UUID REFERENCES apps(id),
alert_name VARCHAR(200),
severity VARCHAR(20),
status VARCHAR(20), -- firing, resolved
description TEXT,
labels JSONB,
annotations JSONB,
fired_at TIMESTAMP,
resolved_at TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW()
);
-- 修復歷史表
CREATE TABLE repair_logs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
app_id UUID REFERENCES apps(id),
alert_id UUID REFERENCES alerts(id),
repair_type VARCHAR(50),
action_taken TEXT,
result JSONB,
success BOOLEAN,
duration_seconds INTEGER,
created_at TIMESTAMP DEFAULT NOW()
);
-- 部署歷史表
CREATE TABLE deployments (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
app_id UUID REFERENCES apps(id),
version VARCHAR(100),
image VARCHAR(500),
status VARCHAR(20), -- pending, deploying, success, failed, rolled_back
rollback_version VARCHAR(100),
deployed_by UUID REFERENCES users(id),
deployed_at TIMESTAMP DEFAULT NOW(),
completed_at TIMESTAMP
);
-- 用戶表
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email VARCHAR(200) NOT NULL UNIQUE,
password_hash VARCHAR(200),
name VARCHAR(100),
company VARCHAR(200),
plan VARCHAR(20) DEFAULT 'free',
telegram_chat_id VARCHAR(50),
created_at TIMESTAMP DEFAULT NOW()
);
```
---
## 🔐 安全架構
### 多租戶隔離
```yaml
# 每個應用獨立的 Namespace
# 使用 NetworkPolicy 隔離
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: {{ app_name }}-isolation
namespace: {{ namespace }}
spec:
podSelector:
matchLabels:
app: {{ app_name }}
policyTypes:
- Ingress
- Egress
ingress:
# 只允許來自 Ingress Controller
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
# 允許來自 Prometheus 監控
- from:
- namespaceSelector:
matchLabels:
name: monitoring
egress:
# 允許訪問 DNS
- to:
- namespaceSelector: {}
ports:
- protocol: UDP
port: 53
# 允許訪問資料庫(同 namespace
- to:
- podSelector:
matchLabels:
app: {{ app_name }}-postgresql
```
### API 認證
```python
# aiops_api/middleware/auth.py
from functools import wraps
from flask import request, jsonify
import jwt
def require_auth(f):
@wraps(f)
def decorated(*args, **kwargs):
token = request.headers.get('Authorization', '').replace('Bearer ', '')
if not token:
return jsonify({'error': 'Missing token'}), 401
try:
payload = jwt.decode(token, SECRET_KEY, algorithms=['HS256'])
request.user = User.query.get(payload['user_id'])
except jwt.ExpiredSignatureError:
return jsonify({'error': 'Token expired'}), 401
except jwt.InvalidTokenError:
return jsonify({'error': 'Invalid token'}), 401
return f(*args, **kwargs)
return decorated
def require_app_access(f):
"""確保用戶有權限訪問指定應用"""
@wraps(f)
def decorated(app_id, *args, **kwargs):
app = App.query.get(app_id)
if not app:
return jsonify({'error': 'App not found'}), 404
if app.owner_id != request.user.id:
return jsonify({'error': 'Access denied'}), 403
return f(app_id, *args, **kwargs)
return decorated
```
---
## 📈 監控儀表板
### Grafana Dashboard 自動生成
```python
# aiops_api/services/grafana_service.py
class GrafanaService:
"""Grafana Dashboard 管理服務"""
def create_app_dashboard(self, app: App):
"""
為應用自動創建 Grafana Dashboard
"""
dashboard = {
"dashboard": {
"title": f"{app.display_name} - 監控儀表板",
"tags": ["aiops", app.name],
"timezone": "Asia/Taipei",
"panels": [
# 服務可用性
{
"type": "stat",
"title": "可用性",
"gridPos": {"x": 0, "y": 0, "w": 6, "h": 4},
"targets": [{
"expr": f'avg(up{{namespace="{app.namespace}"}}) * 100'
}],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"color": "red", "value": 0},
{"color": "yellow", "value": 95},
{"color": "green", "value": 99}
]
}
}
}
},
# 請求數
{
"type": "timeseries",
"title": "請求量 (QPS)",
"gridPos": {"x": 6, "y": 0, "w": 12, "h": 8},
"targets": [{
"expr": f'sum(rate(http_requests_total{{namespace="{app.namespace}"}}[1m]))'
}]
},
# 回應時間
{
"type": "timeseries",
"title": "回應時間 (ms)",
"gridPos": {"x": 0, "y": 8, "w": 12, "h": 8},
"targets": [{
"expr": f'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{{namespace="{app.namespace}"}}[5m])) * 1000',
"legendFormat": "p95"
}]
},
# 記憶體使用
{
"type": "gauge",
"title": "記憶體使用",
"gridPos": {"x": 12, "y": 8, "w": 6, "h": 8},
"targets": [{
"expr": f'sum(container_memory_usage_bytes{{namespace="{app.namespace}"}}) / sum(container_spec_memory_limit_bytes{{namespace="{app.namespace}"}}) * 100'
}],
"fieldConfig": {
"defaults": {
"unit": "percent",
"max": 100,
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 70},
{"color": "red", "value": 90}
]
}
}
}
},
# 錯誤率
{
"type": "stat",
"title": "5xx 錯誤率",
"gridPos": {"x": 18, "y": 8, "w": 6, "h": 8},
"targets": [{
"expr": f'sum(rate(http_requests_total{{namespace="{app.namespace}", status=~"5.."}}[5m])) / sum(rate(http_requests_total{{namespace="{app.namespace}"}}[5m])) * 100'
}],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 1},
{"color": "red", "value": 5}
]
}
}
}
}
]
}
}
# 創建 Dashboard
response = requests.post(
f"{GRAFANA_URL}/api/dashboards/db",
headers={"Authorization": f"Bearer {GRAFANA_API_KEY}"},
json=dashboard
)
return response.json()
```
---
## 🚀 下一步行動
### 立即可做
1. **提取現有代碼為模塊**
- 將 MOMO Pro 的監控、自動修復代碼抽取為獨立服務
- 創建應用模板系統
2. **開發 Web Portal**
- 用戶註冊/登入
- 應用管理介面
- 新增應用精靈
3. **完善 API**
- RESTful API 設計
- API 文檔 (Swagger)
- SDK (Python, Node.js)
### 長期規劃
1. **擴展框架支援** (Q2 2026)
- Django, Node.js, Next.js, Go
2. **多雲支援** (Q3 2026)
- AWS EKS, GCP GKE, Azure AKS
3. **AI 增強** (Q4 2026)
- 異常預測
- 智慧容量規劃
- 自動調優
---
> **「讓每一行代碼都有專業 DevOps 的守護」**
>
> — WOOO AIOps Platform