feat(p1): Ollama 多層容災系統 — P1.1 健康檢測 + P1.2 ai_router 整合 + P1.5 容災告警
ADR-092 P1 飛輪閉環的 Ollama 失敗轉移子系統,全部 Engineer-A2/C/C2 補上。 新服務 (1581 行): - ollama_health_monitor.py (356):3 層健康檢測(TCP/HTTP/推理) - ollama_failover_manager.py (571):111→188 自動切換 + Redis 持久化 + recovery callback - ollama_auto_recovery.py (436):30s 背景監控 + 連續 3 次 HEALTHY → 切回 + clear_cache - failover_alerter.py (218):P1.5 Telegram 容災告警 服務整合: - ai_router.py: AIProviderEnum.OLLAMA_188 + 120s budget + failover fallback chain - main.py lifespan: 啟動時 wire callback + start recovery,關閉時優雅 stop - config.py: OLLAMA_FALLBACK_URL / OLLAMA_HEALTH_CHECK_MODEL / GEMINI_DAILY_QUOTA(帳單熔斷) K8s 配置: - 04-configmap.yaml.patch-188-fallback:注入 OLLAMA_FALLBACK_URL=http://192.168.0.188:11434 測試 (2082 行): - test_ollama_health_monitor.py (402) - test_ollama_failover_manager.py (707) - test_ollama_auto_recovery.py (580) - test_ai_router_failover_integration.py (257) - test_lifespan_failover_wiring.py (136) 依賴鏈:service 三件套 + ai_router + main.py 一起 commit,缺一就 ImportError。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -546,9 +546,39 @@ async def lifespan(_app: FastAPI) -> AsyncGenerator[None, None]:
|
||||
except Exception as e:
|
||||
logger.warning("ai_slo_watchdog_schedule_failed", error=str(e))
|
||||
|
||||
# 2026-04-25 P1.2 by Claude Engineer-A2 — failover 整合到 ai_router + lifespan
|
||||
# OllamaFailoverManager + OllamaAutoRecoveryService 飛輪接線:
|
||||
# failover 切換時 → recovery_callback → set_current_primary → Redis 持久化
|
||||
# recovery service 每 30s 檢查 → 111 連續 3 次 HEALTHY → 自動切回 → clear_cache
|
||||
# 順序:先取 singleton → wire callback → 啟動 recovery service(才能接收 callback)
|
||||
try:
|
||||
from src.services.ollama_failover_manager import get_ollama_failover_manager
|
||||
from src.services.ollama_auto_recovery import get_ollama_auto_recovery_service
|
||||
|
||||
_failover_mgr = get_ollama_failover_manager()
|
||||
_recovery_svc = get_ollama_auto_recovery_service()
|
||||
|
||||
# wire callback:failover 切換時通知 recovery service 更新 current_primary
|
||||
_failover_mgr.set_recovery_callback(_recovery_svc.set_current_primary)
|
||||
|
||||
# 啟動 recovery service(從 Redis bootstrap current_primary,並啟動背景監控)
|
||||
await _recovery_svc.start()
|
||||
logger.info("ollama_failover_system_started")
|
||||
except Exception as e:
|
||||
logger.warning("ollama_failover_system_start_failed", error=str(e))
|
||||
|
||||
yield
|
||||
|
||||
# Shutdown
|
||||
# 2026-04-25 P1.2 by Claude Engineer-A2 — 優雅關閉 Ollama failover 背景監控
|
||||
# 必須在 Redis pool 關閉之前停止(recovery service 可能仍在寫 Redis)
|
||||
try:
|
||||
from src.services.ollama_auto_recovery import get_ollama_auto_recovery_service
|
||||
await get_ollama_auto_recovery_service().stop()
|
||||
logger.info("ollama_failover_system_stopped")
|
||||
except Exception as e:
|
||||
logger.warning("ollama_failover_system_stop_failed", error=str(e))
|
||||
|
||||
# Phase 6.1: 關閉 Signal Worker (先關閉 Consumer)
|
||||
await close_signal_worker()
|
||||
await publisher.stop()
|
||||
|
||||
Reference in New Issue
Block a user