- AWOOOI_COMPREHENSIVE_AUDIT_2026Q1.md 全維度稽核 - INSPIRATION_LAB.md 靈感收集 - K3S-OPTIMIZATION-RUNBOOK.md 優化指南 - ADR-006 AI Fallback 策略更新 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
462 lines
14 KiB
Markdown
462 lines
14 KiB
Markdown
# ADR-006: AI 降級備援策略
|
||
|
||
> **狀態**: 已接受
|
||
> **日期**: 2026-03-20
|
||
> **決策者**: CTO, CEO
|
||
|
||
---
|
||
|
||
## 背景
|
||
|
||
AWOOOI 系統高度依賴 AI 功能,包括 AI Copilot、異常偵測、智能摘要等。
|
||
當本地 Ollama 服務不可用時,需要有完善的降級備援機制,同時嚴格控制雲端 API 成本。
|
||
|
||
### CEO 指示 #2
|
||
|
||
> 雲端備援的順序採 **Gemini API 然後才是 Claude API**,並且要有效控管、監控,
|
||
> API Token 使用的數量,要搭配告警機制,避免費用暴增!
|
||
|
||
---
|
||
|
||
## 決策
|
||
|
||
### 1. AI 服務優先順序
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────┐
|
||
│ 優先級 1: Ollama (本地) │
|
||
│ 192.168.0.188:11434 │
|
||
│ 成本: $0 / 延遲: ~200ms │
|
||
└─────────────────────────────────────────────────────┘
|
||
│ 失敗
|
||
▼
|
||
┌─────────────────────────────────────────────────────┐
|
||
│ 優先級 2: Gemini API (雲端備援 - 優先) │
|
||
│ 成本: ~$0.001/1K tokens │
|
||
└─────────────────────────────────────────────────────┘
|
||
│ 失敗
|
||
▼
|
||
┌─────────────────────────────────────────────────────┐
|
||
│ 優先級 3: Claude API (雲端備援 - 次選) │
|
||
│ 成本: ~$0.008/1K tokens │
|
||
└─────────────────────────────────────────────────────┘
|
||
│ 失敗
|
||
▼
|
||
┌─────────────────────────────────────────────────────┐
|
||
│ 優先級 4: 靜態回應 (完全降級) │
|
||
│ 返回預設訊息,不調用任何 AI │
|
||
└─────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### 2. Circuit Breaker 機制
|
||
|
||
```python
|
||
# apps/api/app/services/ai/circuit_breaker.py
|
||
|
||
from enum import Enum
|
||
from datetime import datetime, timedelta
|
||
import asyncio
|
||
|
||
class CircuitState(Enum):
|
||
CLOSED = "closed" # 正常
|
||
OPEN = "open" # 熔斷
|
||
HALF_OPEN = "half_open" # 試探
|
||
|
||
class CircuitBreaker:
|
||
def __init__(
|
||
self,
|
||
failure_threshold: int = 5, # 連續失敗 5 次觸發熔斷
|
||
recovery_timeout: int = 60, # 熔斷後 60 秒嘗試恢復
|
||
half_open_max_calls: int = 3 # 半開狀態最多 3 次試探
|
||
):
|
||
self.state = CircuitState.CLOSED
|
||
self.failure_count = 0
|
||
self.last_failure_time = None
|
||
self.half_open_calls = 0
|
||
# ...
|
||
|
||
async def call(self, func, *args, **kwargs):
|
||
if self.state == CircuitState.OPEN:
|
||
if self._should_try_recovery():
|
||
self.state = CircuitState.HALF_OPEN
|
||
self.half_open_calls = 0
|
||
else:
|
||
raise CircuitOpenError("Circuit is open")
|
||
|
||
try:
|
||
result = await func(*args, **kwargs)
|
||
self._on_success()
|
||
return result
|
||
except Exception as e:
|
||
self._on_failure()
|
||
raise
|
||
```
|
||
|
||
### 3. Token 使用量監控與告警
|
||
|
||
#### 每日/每月配額
|
||
|
||
| API | 每日上限 | 每月上限 | 告警閾值 |
|
||
|-----|---------|---------|---------|
|
||
| Gemini | 100K tokens | 2M tokens | 70% |
|
||
| Claude | 50K tokens | 500K tokens | 70% |
|
||
|
||
#### 監控 Schema
|
||
|
||
```python
|
||
# apps/api/app/models/ai_usage.py
|
||
|
||
class AIUsageLog(Base):
|
||
__tablename__ = "ai_usage_logs"
|
||
|
||
id = Column(UUID, primary_key=True)
|
||
provider = Column(String) # ollama, gemini, claude
|
||
model = Column(String)
|
||
input_tokens = Column(Integer)
|
||
output_tokens = Column(Integer)
|
||
latency_ms = Column(Integer)
|
||
success = Column(Boolean)
|
||
error_message = Column(String, nullable=True)
|
||
user_id = Column(UUID, ForeignKey("users.id"))
|
||
created_at = Column(DateTime, default=func.now())
|
||
```
|
||
|
||
#### 告警規則
|
||
|
||
```yaml
|
||
# k8s/monitoring/prometheus/ai-usage-alerts.yaml
|
||
groups:
|
||
- name: ai-usage-alerts
|
||
rules:
|
||
# Gemini 每日用量 70% 告警
|
||
- alert: GeminiDailyUsageWarning
|
||
expr: |
|
||
sum(increase(ai_tokens_total{provider="gemini"}[24h])) > 70000
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "Gemini API 每日用量已達 70%"
|
||
description: "今日 Gemini 已使用 {{ $value | humanize }} tokens"
|
||
|
||
# Gemini 每日用量 90% 嚴重告警
|
||
- alert: GeminiDailyUsageCritical
|
||
expr: |
|
||
sum(increase(ai_tokens_total{provider="gemini"}[24h])) > 90000
|
||
labels:
|
||
severity: critical
|
||
annotations:
|
||
summary: "Gemini API 每日用量已達 90%,即將觸發限流"
|
||
|
||
# Claude 每日用量 70% 告警
|
||
- alert: ClaudeDailyUsageWarning
|
||
expr: |
|
||
sum(increase(ai_tokens_total{provider="claude"}[24h])) > 35000
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "Claude API 每日用量已達 70%"
|
||
|
||
# Ollama 連續失敗告警
|
||
- alert: OllamaConsecutiveFailures
|
||
expr: |
|
||
increase(ai_requests_failed_total{provider="ollama"}[5m]) > 5
|
||
labels:
|
||
severity: critical
|
||
annotations:
|
||
summary: "Ollama 服務可能已離線"
|
||
description: "過去 5 分鐘 Ollama 請求失敗超過 5 次,已啟動雲端備援"
|
||
|
||
# 月度預算 50% 提醒
|
||
- alert: MonthlyAIBudgetWarning
|
||
expr: |
|
||
(
|
||
sum(increase(ai_tokens_total{provider="gemini"}[30d])) * 0.000001 +
|
||
sum(increase(ai_tokens_total{provider="claude"}[30d])) * 0.000008
|
||
) > 5
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "AI 月度成本已達 $5 (預算 50%)"
|
||
```
|
||
|
||
### 4. 成本預估
|
||
|
||
| 場景 | Gemini | Claude | 月成本 |
|
||
|------|--------|--------|--------|
|
||
| **正常** (Ollama 100%) | 0 | 0 | $0 |
|
||
| **輕度降級** (Ollama 90%, Gemini 10%) | ~200K | 0 | ~$0.20 |
|
||
| **中度降級** (Gemini 80%, Claude 20%) | ~1.6M | ~400K | ~$5 |
|
||
| **完全降級** (雲端 100%) | ~2M | ~500K | ~$10 |
|
||
|
||
### 5. 實作範例
|
||
|
||
```python
|
||
# apps/api/app/services/ai/router.py
|
||
|
||
from app.services.ai.providers import OllamaProvider, GeminiProvider, ClaudeProvider
|
||
from app.services.ai.circuit_breaker import CircuitBreaker
|
||
from app.services.ai.usage_tracker import UsageTracker
|
||
|
||
class AIRouter:
|
||
def __init__(self):
|
||
self.ollama = OllamaProvider()
|
||
self.gemini = GeminiProvider()
|
||
self.claude = ClaudeProvider()
|
||
|
||
self.ollama_circuit = CircuitBreaker(failure_threshold=3, recovery_timeout=30)
|
||
self.gemini_circuit = CircuitBreaker(failure_threshold=5, recovery_timeout=60)
|
||
self.claude_circuit = CircuitBreaker(failure_threshold=5, recovery_timeout=60)
|
||
|
||
self.usage_tracker = UsageTracker()
|
||
|
||
async def generate(self, prompt: str, user_id: str) -> AIResponse:
|
||
providers = [
|
||
("ollama", self.ollama, self.ollama_circuit),
|
||
("gemini", self.gemini, self.gemini_circuit),
|
||
("claude", self.claude, self.claude_circuit),
|
||
]
|
||
|
||
for name, provider, circuit in providers:
|
||
# 檢查配額
|
||
if name in ["gemini", "claude"]:
|
||
if await self.usage_tracker.is_quota_exceeded(name):
|
||
logger.warning(f"{name} daily quota exceeded, skipping")
|
||
continue
|
||
|
||
try:
|
||
result = await circuit.call(provider.generate, prompt)
|
||
|
||
# 記錄使用量
|
||
await self.usage_tracker.log(
|
||
provider=name,
|
||
input_tokens=result.input_tokens,
|
||
output_tokens=result.output_tokens,
|
||
user_id=user_id,
|
||
success=True
|
||
)
|
||
|
||
return result
|
||
|
||
except CircuitOpenError:
|
||
logger.info(f"{name} circuit is open, trying next provider")
|
||
continue
|
||
except Exception as e:
|
||
logger.error(f"{name} failed: {e}, trying next provider")
|
||
await self.usage_tracker.log(
|
||
provider=name,
|
||
error_message=str(e),
|
||
user_id=user_id,
|
||
success=False
|
||
)
|
||
continue
|
||
|
||
# 所有 AI 都失敗,返回靜態回應
|
||
return AIResponse(
|
||
content="抱歉,AI 服務暫時不可用。請稍後再試,或聯繫管理員。",
|
||
provider="fallback",
|
||
tokens=0
|
||
)
|
||
```
|
||
|
||
### 6. Dashboard 展示
|
||
|
||
AI 用量監控面板應顯示:
|
||
|
||
- 今日各 Provider 使用量 (tokens)
|
||
- 本月累計成本 (USD)
|
||
- 各 Provider 健康狀態 (綠/黃/紅)
|
||
- 平均延遲 (ms)
|
||
- 成功率 (%)
|
||
|
||
---
|
||
|
||
## 影響
|
||
|
||
### 正面
|
||
|
||
- 確保 AI 功能高可用性
|
||
- 成本可控、可預測
|
||
- 即時告警避免帳單爆炸
|
||
|
||
### 需要注意
|
||
|
||
- 需維護多個 API Key
|
||
- 不同 Provider 回應品質可能有差異
|
||
- 需要處理 API 格式轉換
|
||
|
||
---
|
||
|
||
---
|
||
|
||
## 智能路由整合 (Phase 13.3 - 2026-03-26)
|
||
|
||
> 參考: [ADR-016 智能路由](ADR-016-smart-routing.md)
|
||
|
||
Phase 13.3 引入**智能路由**機制,動態選擇最適合的模型:
|
||
|
||
### 路由決策流程
|
||
|
||
```
|
||
User Request
|
||
│
|
||
▼
|
||
┌─────────────┐
|
||
│ Intent │ 分類意圖 (ALERT/QUERY/CODE_REVIEW/...)
|
||
│ Classifier │
|
||
└─────────────┘
|
||
│
|
||
▼
|
||
┌─────────────┐
|
||
│ Complexity │ 評估複雜度 (1-5)
|
||
│ Scorer │
|
||
└─────────────┘
|
||
│
|
||
▼
|
||
┌─────────────┐
|
||
│ AIRouter │ Intent + Complexity → Model Selection
|
||
└─────────────┘
|
||
│
|
||
├── Low (1-2) → Ollama llama3.2:3b
|
||
├── Medium (3) → Ollama qwen2.5:7b-instruct
|
||
└── High (4-5) → Gemini / Claude
|
||
```
|
||
|
||
### 意圖覆寫規則
|
||
|
||
| 意圖 | 強制模型 | 原因 |
|
||
|------|---------|------|
|
||
| CODE_REVIEW | qwen2.5:7b | 需要程式碼理解能力 |
|
||
| ALERT_TRIAGE (複雜) | Cloud | 多服務關聯分析 |
|
||
| QUERY | llama3.2:3b | 簡單查詢不需大模型 |
|
||
|
||
### 與原有 Fallback 整合
|
||
|
||
```
|
||
AIRouter 選擇模型
|
||
│
|
||
▼
|
||
┌─────────────┐
|
||
│ Primary │ (依路由決策選擇)
|
||
│ Model │
|
||
└─────────────┘
|
||
│ 失敗
|
||
▼
|
||
┌─────────────┐
|
||
│ Fallback │ ADR-006 原有順序
|
||
│ Chain │ Ollama → Gemini → Claude → Static
|
||
└─────────────┘
|
||
```
|
||
|
||
### 實作位置
|
||
|
||
```
|
||
apps/api/src/services/
|
||
├── intent_classifier.py # IntentClassifier
|
||
├── complexity_scorer.py # ComplexityScorer
|
||
└── ai_router.py # AIRouter
|
||
```
|
||
|
||
---
|
||
|
||
## 7. Rate Limiter 實作 (v1.2)
|
||
|
||
### 背景
|
||
|
||
2026-03-26 統帥決策:Ollama CPU 推論問題導致 mock_fallback,臨時切換 Gemini 需成本保護。
|
||
|
||
### 實作
|
||
|
||
```python
|
||
# apps/api/src/services/ai_rate_limiter.py
|
||
|
||
RATE_LIMITS = {
|
||
"gemini": {
|
||
"rpm": 10, # 每分鐘
|
||
"daily_requests": 500, # 每日
|
||
"daily_tokens": 100_000,
|
||
},
|
||
"claude": {
|
||
"rpm": 5,
|
||
"daily_requests": 200,
|
||
"daily_tokens": 50_000,
|
||
},
|
||
}
|
||
```
|
||
|
||
### 整合位置
|
||
|
||
```python
|
||
# apps/api/src/services/openclaw.py
|
||
|
||
for provider in settings.AI_FALLBACK_ORDER:
|
||
if provider in ("gemini", "claude"):
|
||
allowed, reason = await rate_limiter.check_and_increment(provider)
|
||
if not allowed:
|
||
continue # 自動降級到下一個 provider
|
||
```
|
||
|
||
### 監控端點
|
||
|
||
```
|
||
GET /api/v1/health/ai-usage
|
||
```
|
||
|
||
返回各 provider 的 RPM/Daily/Token 用量統計。
|
||
|
||
---
|
||
|
||
---
|
||
|
||
## 8. NVIDIA Nemotron 整合 (v1.3)
|
||
|
||
> 參考: [ADR-036 Nemotron Tool Calling 整合](ADR-036-nemotron-tool-calling-integration.md)
|
||
|
||
### 背景
|
||
|
||
2026-03-28 統帥指示評估 NVIDIA Nemotron 模型。實測結果 Tool Calling 精準度 83.3%,顯著優於 CPU Ollama (~50%)。
|
||
|
||
### Provider 更新
|
||
|
||
| Provider | 用途 | 延遲 | 精準度 | 成本 |
|
||
|----------|------|------|--------|------|
|
||
| **Ollama** | 即時對話、簡單查詢 | < 5s | 中 | $0 |
|
||
| **Nemotron** | Tool Calling、K8s 操作 | 11-45s | 高 (83%) | 免費 tier |
|
||
| **Gemini** | 通用備援 | 2-5s | 中高 | 低 |
|
||
| **Claude** | 複雜推理、CRITICAL | 2-5s | 最高 | 高 |
|
||
|
||
### 路由規則更新
|
||
|
||
```
|
||
任務類型 → 路由目標
|
||
────────────────────────────────
|
||
Tool Calling → Nemotron (精準度高)
|
||
即時對話 → Ollama (低延遲)
|
||
複雜推理 → Claude (最強)
|
||
通用備援 → Gemini (平衡)
|
||
```
|
||
|
||
### Fallback 鏈
|
||
|
||
```
|
||
Tool Calling 任務:
|
||
Nemotron → Gemini → Claude → 拒絕執行
|
||
|
||
一般對話任務 (不變):
|
||
Ollama → Gemini → Claude → Static Response
|
||
```
|
||
|
||
---
|
||
|
||
## 變更記錄
|
||
|
||
| 日期 | 版本 | 變更 | 作者 |
|
||
|------|------|------|------|
|
||
| 2026-03-20 | v1.0 | 初版建立 | CTO |
|
||
| 2026-03-26 | v1.1 | 新增智能路由整合章節 (Phase 13.3) | 首席架構師 |
|
||
| 2026-03-26 | v1.2 | 新增 Rate Limiter 實作章節 | 首席架構師 |
|
||
| 2026-03-29 | v1.3 | 新增 NVIDIA Nemotron 整合章節 (待統帥批准) | 首席架構師 |
|
||
|
||
---
|
||
|
||
*此 ADR 記錄 AI 降級備援策略的決策過程與實作規範。*
|