fix(ai-router): v4.3 NIM 保護 — timeout 不計 CB 失敗,每次先跑 NIM 才切 Gemini
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 20s
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 20s
需求: NIM 必須等到有回應才切換,不能因為慢就被 CB 封鎖走 Gemini 變更: - Timeout exception 不累積 CB failure(只有真實連線錯誤才計) - NIM CB: failure_threshold=10, recovery_timeout=30s(比預設寬鬆) - 設計文件 v4.3: 更新方向二,移除錯誤假設 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -787,7 +787,16 @@ class AIRouterExecutor:
|
||||
def _get_circuit_breaker(self, name: str) -> "_SimpleCircuitBreaker":
|
||||
"""取得 Provider 的 Circuit Breaker (per-provider, lazy init)"""
|
||||
if name not in self._circuit_breakers:
|
||||
self._circuit_breakers[name] = _SimpleCircuitBreaker(name)
|
||||
# 2026-04-05 Claude Code: v4.3 — NIM 使用更寬鬆的 CB 參數
|
||||
# 每次都先跑 NIM,只有真正連線錯誤(非 timeout)才累積失敗
|
||||
# failure_threshold=10: 需要 10 次真實錯誤才 OPEN(timeout 不計)
|
||||
# recovery_timeout=30: 30s 後進入 half-open,立即重試 NIM
|
||||
if name == "nemotron":
|
||||
self._circuit_breakers[name] = _SimpleCircuitBreaker(
|
||||
name, failure_threshold=10, recovery_timeout=30.0
|
||||
)
|
||||
else:
|
||||
self._circuit_breakers[name] = _SimpleCircuitBreaker(name)
|
||||
return self._circuit_breakers[name]
|
||||
|
||||
@staticmethod
|
||||
@@ -961,7 +970,12 @@ class AIRouterExecutor:
|
||||
except Exception as e:
|
||||
errors.append(f"{provider_name}: {e}")
|
||||
logger.warning("ai_router_provider_exception", provider=provider_name, error=str(e))
|
||||
cb.record_failure()
|
||||
# 2026-04-05 Claude Code: v4.3 — Timeout 不計 CB 失敗
|
||||
# NIM 偶爾 GPU 忙碌導致 27s,timeout 不代表 NIM 故障
|
||||
# 只有明確連線錯誤(非 timeout)才累積 CB 失敗次數
|
||||
import httpx as _httpx
|
||||
if not isinstance(e, _httpx.TimeoutException):
|
||||
cb.record_failure()
|
||||
|
||||
# 全部失敗
|
||||
logger.error("ai_router_all_providers_failed", tried=provider_order, errors=errors)
|
||||
|
||||
@@ -137,121 +137,64 @@ AutoRepairService.execute() 完成
|
||||
|
||||
## 方向二:DIAGNOSE Privacy-First Routing
|
||||
|
||||
### ⚠️ 實作修正記錄(2026-04-04)
|
||||
### ⚠️ 實作修正記錄(2026-04-05 v4.3 最終版)
|
||||
|
||||
設計討論時假設 Nemotron 為 local provider,但首席架構師 Q2 已裁定 NIM = 雲端 GPU,
|
||||
`NemotronProvider.privacy_level = "cloud"`。
|
||||
**設計討論時的兩個錯誤假設已全部驗證並修正:**
|
||||
|
||||
實際實作調整為:
|
||||
- FORCE_LOCAL 情境:`_local_fallback_chain = [OLLAMA]`(Nemotron 被 privacy 過濾正確排除)
|
||||
- 非 FORCE_LOCAL 情境:DIAGNOSE override 改為 NEMOTRON(雲端高能力診斷)
|
||||
- 兩種情境的隱私邊界均正確,設計意圖不變
|
||||
| 假設 | 實測結果 | 修正 |
|
||||
|------|---------|------|
|
||||
| Nemotron NIM 在 188 本地 | `integrate.api.nvidia.com` 是雲端 GPU | NIM 一直是主力,Phase 22 起就接收 Incident 資料,無隱私問題 |
|
||||
| Ollama 可作備援(~173s) | 實測 238s 僅回 `{"ok":true}`(CPU-only) | Ollama 生產不可用,廢棄 `_local_fallback_chain` |
|
||||
|
||||
---
|
||||
|
||||
### 設計目標
|
||||
### 設計目標(修正後)
|
||||
|
||||
- DIAGNOSE 任務優先路由本地 Nemotron(NIM 188)
|
||||
- `FORCE_LOCAL=true` 時,fallback chain **只走本地**,絕不觸碰雲端
|
||||
- Timeout 可配置(`NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS`),實測後調整
|
||||
- 本地全部失敗 → 靜默 REJECT + Telegram 通知,絕不降級雲端
|
||||
- DIAGNOSE 任務路由 Nemotron NIM(主力,實測 avg 10.6s)
|
||||
- NIM 失敗 → `_full_fallback_chain` 正常降級
|
||||
- Timeout 依實測:NIM 60s(實測 2.2~27.3s,avg 10.6s)
|
||||
|
||||
### 現有程式碼現況
|
||||
### 實作(v4.3)
|
||||
|
||||
```python
|
||||
# ai_router.py:255 — 現行 DIAGNOSE 路由到 Ollama(非 Nemotron)
|
||||
IntentType.DIAGNOSE: AIProviderEnum.OLLAMA
|
||||
# ai_router.py — DIAGNOSE override(不變)
|
||||
IntentType.DIAGNOSE: AIProviderEnum.NEMOTRON,
|
||||
|
||||
# ai_router.py:863 — require_local 隱私過濾已存在
|
||||
if require_local and provider.privacy_level != "local":
|
||||
continue
|
||||
# ai_router.py — _local_fallback_chain 廢棄(空 list)
|
||||
self._local_fallback_chain = [] # Ollama CPU-only 238s,不可用
|
||||
|
||||
# 但現行 fallback chain 是全局的(危險!)
|
||||
self._fallback_chain = [OLLAMA, GEMINI, CLAUDE]
|
||||
# FORCE_LOCAL 超時後仍可能走到 GEMINI
|
||||
# ai_router.py — DIAGNOSE 走 _full_fallback_chain(NIM 主力)
|
||||
fallback_chain = self._build_fallback_chain(provider)
|
||||
```
|
||||
|
||||
### 核心修正:獨立 Local-Only Fallback Chain
|
||||
|
||||
```python
|
||||
# 新增(ai_router.py)
|
||||
self._local_fallback_chain: list[AIProviderEnum] = [
|
||||
AIProviderEnum.NEMOTRON, # NIM 188,主力(零費用,高能力)
|
||||
AIProviderEnum.OLLAMA, # Ollama 188,備援(慢但可靠)
|
||||
# ← 到此為止,絕不繼續
|
||||
]
|
||||
|
||||
# route() 邏輯修正
|
||||
chain = (
|
||||
self._local_fallback_chain
|
||||
if require_local
|
||||
else self._fallback_chain
|
||||
)
|
||||
|
||||
# 全部失敗時的處理
|
||||
if require_local and not result:
|
||||
await telegram_gateway.push_alert(
|
||||
"⚠️ DIAGNOSE 本地 Provider 不可用,需人工介入"
|
||||
)
|
||||
return AIResult(success=False, error="local_providers_unavailable")
|
||||
```
|
||||
|
||||
### DIAGNOSE 意圖 Override 修正
|
||||
|
||||
```python
|
||||
# 現行(改為 NEMOTRON)
|
||||
IntentType.DIAGNOSE: AIProviderEnum.NEMOTRON, # NIM 188(隱私 + 效能)
|
||||
|
||||
# 同步更新 context
|
||||
context = {
|
||||
"alert_type": "...",
|
||||
"task_source": "auto_repair" | "user_chat" | "drift_detection",
|
||||
"privacy": "strict"
|
||||
}
|
||||
```
|
||||
|
||||
### Timeout 配置
|
||||
### Timeout 配置(依實測)
|
||||
|
||||
```bash
|
||||
# core/config.py 新增
|
||||
NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS=30 # DIAGNOSE 專用,實測後調整
|
||||
NEMOTRON_DEFAULT_TIMEOUT_SECONDS=45 # Tool Calling 等其他用途
|
||||
OLLAMA_DIAGNOSE_TIMEOUT_SECONDS=60 # Ollama 較慢,給更長時間
|
||||
NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS=60 # 實測 2.2-27.3s avg 10.6s,60s 含 buffer
|
||||
NEMOTRON_TIMEOUT_SECONDS=45 # Tool Calling 等其他用途
|
||||
OLLAMA_DIAGNOSE_TIMEOUT_SECONDS=200 # 保留欄位,實際 DIAGNOSE 不走 Ollama
|
||||
```
|
||||
|
||||
NemotronProvider.analyze() 根據 context["task_type"] 選擇對應 timeout,避免 DIAGNOSE 被 Tool Calling 的長 timeout 拖累。
|
||||
|
||||
### 完整路由流程
|
||||
### 完整路由流程(v4.3)
|
||||
|
||||
```
|
||||
DIAGNOSE 請求
|
||||
│
|
||||
AIRouter.route(intent=DIAGNOSE, require_local=True)
|
||||
AIRouter.route(intent=DIAGNOSE)
|
||||
│
|
||||
使用 _local_fallback_chain = [NEMOTRON → OLLAMA → REJECT]
|
||||
│
|
||||
├── NemotronProvider(NIM 188)
|
||||
│ timeout = NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS(30s)
|
||||
│ ├── 成功 → 回傳,費用=0,記錄 Langfuse trace
|
||||
│ └── 超時/失敗 → CB 記錄,下一個
|
||||
│
|
||||
├── OllamaProvider(188 本地備援)
|
||||
│ timeout = OLLAMA_DIAGNOSE_TIMEOUT_SECONDS(60s)
|
||||
│ ├── 成功 → 回傳,費用=0
|
||||
│ └── 超時/失敗 → 下一個
|
||||
│
|
||||
└── REJECT(隱私邊界,絕不跨越)
|
||||
→ AIResult(success=False, error="local_providers_unavailable")
|
||||
→ Telegram: "⚠️ DIAGNOSE 本地 Provider 不可用,需人工介入"
|
||||
NemotronProvider(integrate.api.nvidia.com,免費 GPU)
|
||||
timeout = 60s(實測 avg 10.6s)
|
||||
├── 成功 → 回傳,記錄 Langfuse trace
|
||||
└── 失敗 → _full_fallback_chain 正常降級(Gemini → Claude)
|
||||
```
|
||||
|
||||
### 修改清單
|
||||
|
||||
| 動作 | 檔案 | 變更 |
|
||||
|------|------|------|
|
||||
| 修改 | `services/ai_router.py` | 新增 `_local_fallback_chain`;`route()` 根據 `require_local` 選擇 chain;DIAGNOSE override 改 NEMOTRON;全部失敗時 Telegram 通知 |
|
||||
| 修改 | `services/ai_providers/nemotron.py` | `analyze()` 根據 `context["task_type"]` 讀取對應 timeout |
|
||||
| 修改 | `core/config.py` | 新增三個 timeout 環境變數 |
|
||||
| 修改 | `services/ai_router.py` | v4.3: `_local_fallback_chain` 廢棄;DIAGNOSE 改回 `_full_fallback_chain` |
|
||||
| 修改 | `services/ai_providers/nemotron.py` | `analyze()` 依 `context["task_type"]` 讀取 timeout |
|
||||
| 修改 | `core/config.py` | timeout 說明更新反映實測結果 |
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user