Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 2m26s
統帥鐵律 2026-04-29:「主要優先用 111 主機的 Ollama」
+ feedback_ai_autonomous_direction.md:以本地免費 LLM 為主
+ feedback_ollama_111_only.md:Ollama 唯一主機 = 111
## 推翻 A2 (2026-04-27 INC-20260425) 的事實基礎
**舊事實**:Ollama = CPU-only deepseek-r1:14b @ 238s(不可用)
**新事實**:prod Ollama 111 = M1 Pro Apple Silicon GPU + qwen2.5:7b-instruct
VRAM 8.2GB 全載入,ctx 32k,實測 hi prompt 0.54s
**雲端全死**(2026-04-29 prod log 證據):
- OpenClaw 188:8088 → 500 Internal Server Error
- Gemini → 429 Too Many Requests(配額爆)
- Claude → 404 Not Found(model claude-3-haiku-20240307 過期)
**不推翻 A2 → 100% incident llm_failed → AI 自動修復永遠不啟動**
## 修改範圍(最小、安全、可驗證)
### ai_router.py
- `_diagnose_fallback_chain`: OLLAMA 第一順位(取代「永久排除」舊註解)
順序:[OLLAMA, OPENCLAW_NEMO, GEMINI, CLAUDE]
- `_intent_provider_overrides[DIAGNOSE]`: OPENCLAW_NEMO → OLLAMA
- 不動 _full_fallback_chain(避免影響 RESTART/SCALE/CONFIG/DELETE)
- 不動 _tool_calling_fallback_chain
- 不動 complexity_map(critic M2 留待後續)
### openclaw.py
- 注入 task_type="diagnose" 到 alert_context(critic C2 真根因)
- 修復 ai_providers/ollama.py:77 timeout 對齊問題:
- 有 task_type → OLLAMA_DIAGNOSE_TIMEOUT_SECONDS=200s
- 沒有 → OPENCLAW_TIMEOUT=30s(不夠 qwen2.5:7b 推理)
- prod log 看到 latency_ms=120014 的根因
- 用 dict(alert_context) 複製,不污染原 context
## Regression Test 同步更新(5 個)
A2 鐵律守門 test 全部反映新鐵律:
- test_p0_diagnose_routing.py::test_diagnose_override_is_ollama
(原 test_diagnose_override_is_openclaw_nemo)
- test_ai_router_diagnose_fallback.py::test_diagnose_fallback_chain_ollama_primary
(原 test_diagnose_fallback_chain_no_ollama)
- test_ai_router_diagnose_fallback.py::test_diagnose_route_primary_is_ollama
(原 test_diagnose_route_fallback_chain_excludes_ollama)
- test_ai_router_diagnose_fallback.py::test_diagnose_route_sync_primary_is_ollama
(原 test_diagnose_route_sync_fallback_chain_excludes_ollama)
- test_ai_router_diagnose_fallback.py::test_build_fallback_chain_for_intent_diagnose_with_ollama_primary
(原 test_build_fallback_chain_for_intent_diagnose_no_ollama)
- test_ai_router_failover_integration.py::test_router_uses_failover_for_diagnose_ollama_primary
(原 test_router_does_not_use_failover_for_openclaw_nemo)
每個 test docstring 都記載歷史脈絡 + 推翻原因。
## 驗證
- 1608 unit tests 全綠
- LLM 路徑 16 個 test 全綠(含 6 個 A2 守門 test 更新版)
- complexity_scorer / failover_manager / intent_classifier 不受影響
## 期望 prod 行為(部署後驗證)
incident 進入 → DIAGNOSE intent → primary OLLAMA (qwen2.5:7b on M1 Pro GPU)
失敗才 fallback → OpenClaw 188 → Gemini → Claude
Ollama 用 200s timeout(之前 30s 不夠)
→ AI 自動修復終於可以啟動,不再 100% llm_failed
## 已知債(後續處理)
- models.json:21 ollama.default 仍是 deepseek-r1:14b(critic C1,但 prod 已自動 route 到實載 model)
- complexity 4/5 仍寫死 gemini/claude(critic M2)
- Gemini API key 在 prod log 明文(需輪換 + sanitize)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
407 lines
14 KiB
Python
407 lines
14 KiB
Python
# apps/api/tests/test_ai_router_diagnose_fallback.py
|
||
# 2026-04-27 Claude Sonnet 4.6: A2 INC-20260425 — DIAGNOSE fallback chain 移除 Ollama
|
||
"""
|
||
DIAGNOSE Fallback Chain 測試 (A2 INC-20260425)
|
||
===============================================
|
||
驗收標準:
|
||
1. DIAGNOSE intent,NEMO 失敗 → 跳 Gemini(不跳 Ollama)
|
||
2. Gemini 失敗 → 跳 Claude
|
||
3. 全失敗 → graceful 降級(不再去 Ollama)
|
||
4. 其他 intent(如 RESTART)的 fallback 行為不變(Ollama 仍在鏈中)
|
||
5. aiops_diagnose_fallback_total metric 可正常累計
|
||
|
||
測試分類:unit(mock provider / registry,無 Redis / DB / K8s 依賴)
|
||
"""
|
||
|
||
from __future__ import annotations
|
||
|
||
import os
|
||
|
||
os.environ.setdefault("MOCK_MODE", "true")
|
||
|
||
from unittest.mock import AsyncMock, MagicMock, patch
|
||
|
||
import pytest
|
||
|
||
from src.services.ai_router import (
|
||
AIProviderEnum,
|
||
AIRouter,
|
||
AIRouterExecutor,
|
||
AIProviderRegistry,
|
||
reset_ai_router,
|
||
)
|
||
from src.services.intent_classifier import IntentType
|
||
|
||
|
||
# =============================================================================
|
||
# Fixtures
|
||
# =============================================================================
|
||
|
||
|
||
@pytest.fixture(autouse=True)
|
||
def reset_router():
|
||
"""每個測試前後重置 singleton,避免 mock 殘留"""
|
||
yield
|
||
reset_ai_router()
|
||
|
||
|
||
def _make_router() -> AIRouter:
|
||
"""建立 AIRouter(mock failover_manager 避免 Redis 依賴)"""
|
||
router = AIRouter()
|
||
mock_fm = MagicMock()
|
||
mock_fm.select_provider = AsyncMock(side_effect=RuntimeError("not needed"))
|
||
router._failover_manager = mock_fm
|
||
return router
|
||
|
||
|
||
def _make_registry_with_providers(
|
||
*,
|
||
nemo_success: bool = True,
|
||
gemini_success: bool = True,
|
||
claude_success: bool = True,
|
||
) -> AIProviderRegistry:
|
||
"""建立只含 openclaw_nemo / gemini / claude 三個 provider 的 registry(無 Ollama)"""
|
||
from src.services.ai_providers.interfaces import AIResult
|
||
|
||
registry = AIProviderRegistry()
|
||
|
||
def _make_provider(name: str, privacy: str, success: bool, response: str = "") -> MagicMock:
|
||
p = MagicMock()
|
||
p.name = name
|
||
p.privacy_level = privacy
|
||
p.is_enabled = True
|
||
p.capabilities = {"rca", "chat"}
|
||
p.analyze = AsyncMock(
|
||
return_value=AIResult(
|
||
raw_response=response or f"{name}_response",
|
||
success=success,
|
||
provider=name,
|
||
error="" if success else f"{name}_timeout",
|
||
)
|
||
)
|
||
p.health_check = AsyncMock(return_value=success)
|
||
return p
|
||
|
||
registry._providers = {
|
||
"openclaw_nemo": _make_provider("openclaw_nemo", "cloud", nemo_success),
|
||
"gemini": _make_provider("gemini", "cloud", gemini_success),
|
||
"claude": _make_provider("claude", "cloud", claude_success, "claude_diagnosis_result"),
|
||
}
|
||
return registry
|
||
|
||
|
||
# =============================================================================
|
||
# Test 1: _diagnose_fallback_chain 屬性存在且不含 Ollama
|
||
# =============================================================================
|
||
|
||
|
||
def test_diagnose_fallback_chain_ollama_primary():
|
||
"""2026-04-29 ogt + Claude Code: 推翻 A2,OLLAMA 為 DIAGNOSE primary
|
||
|
||
統帥鐵律 (2026-04-29): 主要優先用 111 主機的 Ollama
|
||
+ feedback_ai_autonomous_direction.md: 以本地免費 LLM 為主
|
||
+ feedback_ollama_111_only.md: Ollama 唯一主機 = 111
|
||
|
||
推翻 A2 (2026-04-27 INC-20260425) 原因:
|
||
舊事實: Ollama = CPU-only deepseek-r1:14b @ 238s(不可用)
|
||
新事實: prod Ollama 111 = M1 Pro GPU + qwen2.5:7b 已實載(VRAM 8.2GB)
|
||
實測 hi 0.54s
|
||
雲端全死: OpenClaw 500 / Gemini 429 / Claude 404
|
||
不推翻 → 100% llm_failed
|
||
"""
|
||
router = _make_router()
|
||
|
||
assert hasattr(router, "_diagnose_fallback_chain"), (
|
||
"_diagnose_fallback_chain 屬性不存在"
|
||
)
|
||
|
||
providers_in_chain = [p for p, _ in router._diagnose_fallback_chain]
|
||
# 新鐵律:OLLAMA 必須在 chain 第一位
|
||
assert providers_in_chain[0] == AIProviderEnum.OLLAMA, (
|
||
f"統帥鐵律: chain 第一位應為 OLLAMA,實際: {providers_in_chain}"
|
||
)
|
||
# 雲端 fallback 仍在(救命備援)
|
||
assert AIProviderEnum.OPENCLAW_NEMO in providers_in_chain
|
||
assert AIProviderEnum.GEMINI in providers_in_chain
|
||
assert AIProviderEnum.CLAUDE in providers_in_chain
|
||
# OLLAMA_188 (CPU-only 備援) 仍排除(M1 Pro 111 才是 GPU 主推理)
|
||
assert AIProviderEnum.OLLAMA_188 not in providers_in_chain
|
||
|
||
|
||
def test_diagnose_fallback_chain_contains_cloud_providers():
|
||
"""_diagnose_fallback_chain 應含 OPENCLAW_NEMO, GEMINI, CLAUDE"""
|
||
router = _make_router()
|
||
|
||
providers_in_chain = [p for p, _ in router._diagnose_fallback_chain]
|
||
assert AIProviderEnum.OPENCLAW_NEMO in providers_in_chain
|
||
assert AIProviderEnum.GEMINI in providers_in_chain
|
||
assert AIProviderEnum.CLAUDE in providers_in_chain
|
||
|
||
|
||
# =============================================================================
|
||
# Test 2: DIAGNOSE route() 的 fallback_chain 不含 Ollama
|
||
# =============================================================================
|
||
|
||
|
||
@pytest.mark.asyncio
|
||
async def test_diagnose_route_primary_is_ollama():
|
||
"""2026-04-29: DIAGNOSE intent route() primary 必須是 OLLAMA(推翻 A2)"""
|
||
router = _make_router()
|
||
|
||
decision = await router.route(
|
||
"pod crash loop detected",
|
||
context={"intent_hint": "diagnose"},
|
||
)
|
||
|
||
assert decision.selected_provider == AIProviderEnum.OLLAMA, (
|
||
f"統帥鐵律: DIAGNOSE primary 應為 OLLAMA,實際: {decision.selected_provider}"
|
||
)
|
||
|
||
# 雲端 fallback 仍在(OpenClaw / Gemini / Claude 救命備援)
|
||
fb_providers = [p for p, _ in decision.fallback_chain]
|
||
# ollama_failover_manager 可能轉到 ollama_188,但 ollama variant 必須有
|
||
has_cloud_fallback = (
|
||
AIProviderEnum.GEMINI in fb_providers or AIProviderEnum.CLAUDE in fb_providers
|
||
)
|
||
assert has_cloud_fallback, (
|
||
f"雲端 fallback 應存在當救命備援: {fb_providers}"
|
||
)
|
||
|
||
|
||
@pytest.mark.asyncio
|
||
async def test_diagnose_route_sync_primary_is_ollama():
|
||
"""2026-04-29: DIAGNOSE route_sync() primary 同樣是 OLLAMA"""
|
||
router = _make_router()
|
||
|
||
decision = router.route_sync(
|
||
"pod crash loop detected",
|
||
context={"intent_hint": "diagnose"},
|
||
)
|
||
|
||
assert decision.selected_provider == AIProviderEnum.OLLAMA, (
|
||
f"統帥鐵律: DIAGNOSE route_sync primary 應為 OLLAMA,實際: {decision.selected_provider}"
|
||
)
|
||
|
||
|
||
# =============================================================================
|
||
# Test 3: DIAGNOSE NEMO 失敗 → fallback 到 Gemini(不是 Ollama)
|
||
# =============================================================================
|
||
|
||
|
||
@pytest.mark.asyncio
|
||
async def test_diagnose_nemo_fail_fallback_to_gemini_not_ollama():
|
||
"""DIAGNOSE: NEMO 失敗 → executor 嘗試 Gemini,不嘗試 Ollama"""
|
||
registry = _make_registry_with_providers(
|
||
nemo_success=False,
|
||
gemini_success=True,
|
||
)
|
||
executor = AIRouterExecutor(registry)
|
||
|
||
with patch("src.services.ai_router._settings") as mock_settings:
|
||
mock_settings.MOCK_MODE = False
|
||
result = await executor.execute(
|
||
prompt="RCA: pod OOMKilled",
|
||
provider_order=["openclaw_nemo", "gemini", "claude"],
|
||
context={"intent_hint": "diagnose"},
|
||
)
|
||
|
||
assert result.success is True
|
||
assert result.provider == "gemini", (
|
||
f"應 fallback 到 gemini,實際: {result.provider}"
|
||
)
|
||
# 驗證 Ollama 根本不在 provider_order(確保沒被加進去)
|
||
ollama_provider = registry._providers.get("ollama")
|
||
assert ollama_provider is None, "registry 不應含 ollama provider(DIAGNOSE 路徑)"
|
||
|
||
|
||
# =============================================================================
|
||
# Test 4: DIAGNOSE Gemini 失敗 → fallback 到 Claude
|
||
# =============================================================================
|
||
|
||
|
||
@pytest.mark.asyncio
|
||
async def test_diagnose_gemini_fail_fallback_to_claude():
|
||
"""DIAGNOSE: NEMO 失敗 + Gemini 失敗 → executor 嘗試 Claude"""
|
||
registry = _make_registry_with_providers(
|
||
nemo_success=False,
|
||
gemini_success=False,
|
||
claude_success=True,
|
||
)
|
||
executor = AIRouterExecutor(registry)
|
||
|
||
with patch("src.services.ai_router._settings") as mock_settings:
|
||
mock_settings.MOCK_MODE = False
|
||
result = await executor.execute(
|
||
prompt="RCA: pod crash",
|
||
provider_order=["openclaw_nemo", "gemini", "claude"],
|
||
context={"intent_hint": "diagnose"},
|
||
)
|
||
|
||
assert result.success is True
|
||
assert result.provider == "claude", (
|
||
f"應 fallback 到 claude,實際: {result.provider}"
|
||
)
|
||
|
||
|
||
# =============================================================================
|
||
# Test 5: DIAGNOSE 全失敗 → graceful 降級(不去 Ollama)
|
||
# =============================================================================
|
||
|
||
|
||
@pytest.mark.asyncio
|
||
async def test_diagnose_all_fail_graceful_no_ollama():
|
||
"""DIAGNOSE: NEMO + Gemini + Claude 全失敗 → graceful error,不嘗試 Ollama"""
|
||
registry = _make_registry_with_providers(
|
||
nemo_success=False,
|
||
gemini_success=False,
|
||
claude_success=False,
|
||
)
|
||
executor = AIRouterExecutor(registry)
|
||
|
||
with patch("src.services.ai_router._settings") as mock_settings:
|
||
mock_settings.MOCK_MODE = False
|
||
result = await executor.execute(
|
||
prompt="RCA: cascading failure",
|
||
provider_order=["openclaw_nemo", "gemini", "claude"],
|
||
context={"intent_hint": "diagnose"},
|
||
)
|
||
|
||
# 全失敗應回傳 success=False(graceful 降級,不 raise)
|
||
assert result.success is False
|
||
assert result.provider == "none"
|
||
# 確認沒有嘗試 Ollama(registry 裡根本沒有 ollama)
|
||
assert "ollama" not in registry._providers
|
||
|
||
|
||
# =============================================================================
|
||
# Test 6: 其他 intent(RESTART)的 fallback 行為不變(Ollama 仍在鏈中)
|
||
# =============================================================================
|
||
|
||
|
||
@pytest.mark.asyncio
|
||
async def test_restart_intent_still_has_ollama_in_fallback():
|
||
"""RESTART intent 的 fallback_chain 應仍包含 OLLAMA(行為不變)"""
|
||
router = _make_router()
|
||
|
||
# RESTART → None(複雜度路由),低複雜度 → OLLAMA primary
|
||
# 使用 context_hint 直接指定,避免 LLM 分類
|
||
decision = await router.route(
|
||
"restart the api service",
|
||
context={"intent_hint": "restart"},
|
||
)
|
||
|
||
# RESTART intent 不受 A2 影響,_full_fallback_chain 仍含 OLLAMA
|
||
all_providers_in_decision = [decision.selected_provider] + [
|
||
p for p, _ in decision.fallback_chain
|
||
]
|
||
assert AIProviderEnum.OLLAMA in all_providers_in_decision, (
|
||
f"RESTART 路徑應仍含 OLLAMA(行為不變),實際: {all_providers_in_decision}"
|
||
)
|
||
|
||
|
||
def test_build_fallback_chain_for_intent_diagnose_with_ollama_primary():
|
||
"""2026-04-29: _build_fallback_chain_for_intent(DIAGNOSE, primary=OLLAMA)
|
||
回傳結果應排除 primary OLLAMA,但保留雲端 fallback。"""
|
||
router = _make_router()
|
||
|
||
# primary 是 OLLAMA(推翻 A2 後)
|
||
chain = router._build_fallback_chain_for_intent(
|
||
AIProviderEnum.OLLAMA,
|
||
IntentType.DIAGNOSE,
|
||
)
|
||
providers = [p for p, _ in chain]
|
||
|
||
# primary 已排除
|
||
assert AIProviderEnum.OLLAMA not in providers
|
||
# fallback 雲端救命備援必須存在
|
||
assert AIProviderEnum.OPENCLAW_NEMO in providers
|
||
assert AIProviderEnum.GEMINI in providers
|
||
assert AIProviderEnum.CLAUDE in providers
|
||
|
||
|
||
def test_build_fallback_chain_for_intent_restart_has_ollama():
|
||
"""_build_fallback_chain_for_intent(RESTART) 回傳結果仍含 OLLAMA"""
|
||
router = _make_router()
|
||
|
||
chain = router._build_fallback_chain_for_intent(
|
||
AIProviderEnum.OPENCLAW_NEMO,
|
||
IntentType.RESTART,
|
||
)
|
||
providers = [p for p, _ in chain]
|
||
|
||
assert AIProviderEnum.OLLAMA in providers, (
|
||
f"RESTART fallback 應含 OLLAMA,實際: {providers}"
|
||
)
|
||
|
||
|
||
# =============================================================================
|
||
# Test 7: aiops_diagnose_fallback_total metric 正常累計
|
||
# =============================================================================
|
||
|
||
|
||
@pytest.mark.asyncio
|
||
async def test_diagnose_fallback_metric_incremented():
|
||
"""DIAGNOSE NEMO 失敗 → fallback Gemini 時,aiops_diagnose_fallback_total metric 被記錄"""
|
||
registry = _make_registry_with_providers(
|
||
nemo_success=False,
|
||
gemini_success=True,
|
||
)
|
||
executor = AIRouterExecutor(registry)
|
||
|
||
with patch("src.services.ai_router._settings") as mock_settings:
|
||
mock_settings.MOCK_MODE = False
|
||
with patch("src.core.metrics.record_diagnose_fallback") as mock_metric:
|
||
await executor.execute(
|
||
prompt="RCA: high error rate",
|
||
provider_order=["openclaw_nemo", "gemini", "claude"],
|
||
context={"intent_hint": "diagnose"},
|
||
)
|
||
|
||
# fallback from openclaw_nemo → gemini 應被記錄一次
|
||
mock_metric.assert_called_once_with(
|
||
from_provider="openclaw_nemo",
|
||
to_provider="gemini",
|
||
)
|
||
|
||
|
||
@pytest.mark.asyncio
|
||
async def test_non_diagnose_intent_no_fallback_metric():
|
||
"""非 DIAGNOSE intent 的 fallback 不應觸發 aiops_diagnose_fallback_total"""
|
||
from src.services.ai_providers.interfaces import AIResult
|
||
|
||
registry = AIProviderRegistry()
|
||
|
||
# ollama 失敗
|
||
mock_ollama = MagicMock()
|
||
mock_ollama.name = "ollama"
|
||
mock_ollama.privacy_level = "local"
|
||
mock_ollama.is_enabled = True
|
||
mock_ollama.capabilities = {"chat"}
|
||
mock_ollama.analyze = AsyncMock(
|
||
return_value=AIResult(raw_response="", success=False, provider="ollama", error="timeout")
|
||
)
|
||
|
||
# gemini 成功
|
||
mock_gemini = MagicMock()
|
||
mock_gemini.name = "gemini"
|
||
mock_gemini.privacy_level = "cloud"
|
||
mock_gemini.is_enabled = True
|
||
mock_gemini.analyze = AsyncMock(
|
||
return_value=AIResult(raw_response="ok", success=True, provider="gemini")
|
||
)
|
||
|
||
registry._providers = {"ollama": mock_ollama, "gemini": mock_gemini}
|
||
executor = AIRouterExecutor(registry)
|
||
|
||
with patch("src.services.ai_router._settings") as mock_settings:
|
||
mock_settings.MOCK_MODE = False
|
||
with patch("src.core.metrics.record_diagnose_fallback") as mock_metric:
|
||
await executor.execute(
|
||
prompt="restart service",
|
||
provider_order=["ollama", "gemini"],
|
||
context={"intent_hint": "restart"}, # 非 DIAGNOSE
|
||
)
|
||
|
||
# 非 DIAGNOSE intent → metric 不應被呼叫
|
||
mock_metric.assert_not_called()
|