fix(ollama): ADR-110 修正 — 111 升 primary,failover log 改用動態 URL 標識
All checks were successful
Code Review / ai-code-review (push) Successful in 56s

根因:K8s pods → GCP-A/B:11434 = connection refused(外網路由不通),
但 ConfigMap 把 GCP-A 設為 OLLAMA_URL(primary),導致容災鏈最終才輪到 111。

ConfigMap (04-configmap.yaml):
- OLLAMA_URL: GCP-A → 192.168.0.111(K8s 內網可達的 primary)
- OLLAMA_SECONDARY_URL: GCP-B → 34.143.170.20(GCP-A,保留待 nginx proxy 後恢復)
- OLLAMA_FALLBACK_URL: 111 → 34.21.145.224(GCP-B,保留待 nginx proxy 後恢復)
- 長期目標:110 架設 nginx proxy 轉發 GCP,ConfigMap 改指向 110:11435/11436

health.py (check_ollama):
- 改為三層輪查(primary → secondary → tertiary)
- primary up → "up";fallback up → "degraded";全掛 → "down"
- 不再只看 OLLAMA_URL 一台,反映實際路由可用狀態

ollama_failover_manager.py (_decide_route / select_provider):
- 變數名改為 url_primary/secondary/tertiary(原 gcp_a/gcp_b/local 與實際 URL 脫鉤)
- routing_reason 改用動態 IP label,不再硬編碼 "GCP-A"/"GCP-B"/"Local"
- _write_failover_audit failed_host 同步改用實際 URL

2026-05-04 ogt

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Your Name
2026-05-04 19:17:07 +08:00
parent 855819652e
commit 0a90dab1e9
3 changed files with 98 additions and 59 deletions

View File

@@ -130,30 +130,49 @@ async def check_redis() -> Literal["up", "down"]:
return "down"
async def check_ollama() -> Literal["up", "down"]:
async def check_ollama() -> Literal["up", "down", "degraded"]:
"""
Check Ollama service via /api/tags endpoint
Check Ollama 三層容災狀態primary → secondary → tertiary
統帥鐵律: 真實 HTTP 請求,禁止假數據
2026-05-04 ogt: 改為檢查三台OLLAMA_URL / SECONDARY / FALLBACK
只要有任一台 up → "up"primary down 但 fallback up → "degraded"
全部 down → "down"。反映 K8s 實際可用的 Ollama 路由狀態。
"""
try:
async with httpx.AsyncClient(timeout=HEALTH_CHECK_TIMEOUT) as client:
response = await client.get(f"{settings.OLLAMA_URL}/api/tags")
if response.status_code == 200:
logger.debug("health_check_ollama", status="up")
return "up"
else:
logger.warning(
"health_check_ollama",
status="down",
status_code=response.status_code,
)
return "down"
except httpx.TimeoutException:
logger.warning("health_check_ollama", status="down", reason="timeout")
return "down"
except Exception as e:
logger.warning("health_check_ollama", status="down", error=str(e))
urls = [
(settings.OLLAMA_URL, "primary"),
(getattr(settings, "OLLAMA_SECONDARY_URL", ""), "secondary"),
(getattr(settings, "OLLAMA_FALLBACK_URL", ""), "tertiary"),
]
any_up = False
primary_up = False
async with httpx.AsyncClient(timeout=HEALTH_CHECK_TIMEOUT) as client:
for i, (url, label) in enumerate(urls):
if not url:
continue
try:
response = await client.get(f"{url}/api/tags")
if response.status_code == 200:
any_up = True
if i == 0:
primary_up = True
logger.debug("health_check_ollama", status="up", tier=label, url=url)
break # 找到第一台可用就停
else:
logger.debug("health_check_ollama_tier", tier=label, status_code=response.status_code)
except (httpx.TimeoutException, httpx.ConnectError, httpx.NetworkError):
logger.debug("health_check_ollama_tier", tier=label, status="unreachable")
except Exception as e:
logger.warning("health_check_ollama_tier", tier=label, error=str(e))
if primary_up:
return "up"
elif any_up:
logger.warning("health_check_ollama", status="degraded", reason="primary down, fallback active")
return "degraded"
else:
logger.warning("health_check_ollama", status="down", reason="all tiers unreachable")
return "down"

View File

@@ -183,8 +183,12 @@ class OllamaFailoverManager:
context: dict | None = None,
) -> OllamaRoutingResult:
"""
三層 Ollama 容災路由(2026-05-03 統帥新令ADR-110
GCP-A → GCP-B → Local(111) → Gemini → Nemotron → Claude
三層 Ollama 容災路由(ADR-110 修正版 2026-05-04
Primary(OLLAMA_URL) → Secondary(OLLAMA_SECONDARY_URL) → Tertiary(OLLAMA_FALLBACK_URL)
→ Gemini → Nemotron → Claude
2026-05-04 ogt: URL 優先序已更新ConfigMapprimary = 111K8s 內網可達)。
GCP-A/B 為 secondary/tertiary待 nginx proxy 架設後再升回 primary。
Args:
task_type: 任務類型(預留,目前未影響路由邏輯)
@@ -193,16 +197,17 @@ class OllamaFailoverManager:
Returns:
OllamaRoutingResult
"""
# 2026-05-03 ogt: GCP 三層容災ADR-110GCP-A → GCP-B → Local → Gemini
url_gcp_a = self._settings.OLLAMA_URL # 34.143.170.20
url_gcp_b = self._settings.OLLAMA_SECONDARY_URL # 34.21.145.224
url_local = self._settings.OLLAMA_FALLBACK_URL # 192.168.0.111
# 2026-05-04 ogt: 改用語意中性名稱 primary/secondary/tertiary
# 避免 gcp_a/gcp_b/local 與實際 URL 脫鉤造成 log 誤導
url_primary = self._settings.OLLAMA_URL # 當前: 192.168.0.111
url_secondary = self._settings.OLLAMA_SECONDARY_URL # 當前: 34.143.170.20 (GCP-A)
url_tertiary = self._settings.OLLAMA_FALLBACK_URL # 當前: 34.21.145.224 (GCP-B)
# 並行檢查三台 Ollama 主機asyncio.gather 提升效率)
results_raw = await asyncio.gather(
self._monitor.check(url_gcp_a),
self._monitor.check(url_gcp_b),
self._monitor.check(url_local),
self._monitor.check(url_primary),
self._monitor.check(url_secondary),
self._monitor.check(url_tertiary),
return_exceptions=True,
)
@@ -211,17 +216,17 @@ class OllamaFailoverManager:
return HealthReport(status=HealthStatus.OFFLINE, reason=f"{label} check error: {r}")
return r
health_gcp_a = _to_health(results_raw[0], "GCP-A")
health_gcp_b = _to_health(results_raw[1], "GCP-B")
health_local = _to_health(results_raw[2], "Local")
health_gcp_a = _to_health(results_raw[0], f"primary({url_primary})")
health_gcp_b = _to_health(results_raw[1], f"secondary({url_secondary})")
health_local = _to_health(results_raw[2], f"tertiary({url_tertiary})")
result = self._decide_route(
health_gcp_a=health_gcp_a,
health_gcp_b=health_gcp_b,
health_local=health_local,
url_gcp_a=url_gcp_a,
url_gcp_b=url_gcp_b,
url_local=url_local,
url_gcp_a=url_primary,
url_gcp_b=url_secondary,
url_local=url_tertiary,
)
# Gemini 帳單熔斷quota gate
@@ -316,36 +321,46 @@ class OllamaFailoverManager:
now_ts = datetime.datetime.now(TAIPEI_TZ).isoformat()
# GCP-A 健康 → 主 GCP-AGemini 永遠在 Ollama 鏈最後(與舊 111 行為一致
# 用實際 URL 取最後一段作為 log 標識IP 或 hostname
def _short(url: str) -> str:
from urllib.parse import urlparse
return urlparse(url).hostname or url
lbl_p = _short(url_gcp_a) # primary label
lbl_s = _short(url_gcp_b) # secondary label
lbl_t = _short(url_local) # tertiary label
# Primary HEALTHY → 使用 primary
if health_gcp_a.status == HealthStatus.HEALTHY:
return OllamaRoutingResult(
primary=ep_gcp_a,
fallback_chain=[ep_gcp_b, ep_local, _GEMINI_ENDPOINT],
routing_reason="GCP-A HEALTHY → primary GCP-A",
routing_reason=f"primary({lbl_p}) HEALTHY",
health_gcp_a=health_gcp_a,
health_gcp_b=health_gcp_b,
health_local=health_local,
)
# GCP-A 不健康GCP-B 健康 → 切 GCP-BGemini 在鏈尾
# Primary 不健康Secondary HEALTHY → 切 secondary
if health_gcp_b.status == HealthStatus.HEALTHY:
return OllamaRoutingResult(
primary=ep_gcp_b,
fallback_chain=[ep_local, _GEMINI_ENDPOINT],
routing_reason=f"GCP-A {health_gcp_a.status.value}切 GCP-B at {now_ts}",
routing_reason=f"primary({lbl_p}) {health_gcp_a.status.value}secondary({lbl_s}) at {now_ts}",
health_gcp_a=health_gcp_a,
health_gcp_b=health_gcp_b,
health_local=health_local,
)
# GCP-A + GCP-B 都不健康Local 健康 → 切 Local(111)
# Primary + Secondary 不健康Tertiary HEALTHY → 切 tertiary
if health_local.status == HealthStatus.HEALTHY:
return OllamaRoutingResult(
primary=ep_local,
fallback_chain=[_GEMINI_ENDPOINT],
routing_reason=(
f"GCP-A {health_gcp_a.status.value} + GCP-B {health_gcp_b.status.value}"
f" → 切 Local(111) at {now_ts}"
f"primary({lbl_p}) {health_gcp_a.status.value}"
f" + secondary({lbl_s}) {health_gcp_b.status.value}"
f" → tertiary({lbl_t}) at {now_ts}"
),
health_gcp_a=health_gcp_a,
health_gcp_b=health_gcp_b,
@@ -353,14 +368,11 @@ class OllamaFailoverManager:
)
# 2026-05-04 ogt: SLOW 容災備援外網同時抖動時SLOW Ollama 仍優於 Gemini quota 耗盡)
# 原設計:三層全部非 HEALTHY 直接切 Gemini
# 問題111 關機 + GCP 雙外網抖動 → 三節點同時 SLOW → 誤飛 Gemini → 燒 quota
# 修法SLOW 節點視為可用,按優先序選最佳 SLOW 節點
if health_gcp_a.status == HealthStatus.SLOW:
return OllamaRoutingResult(
primary=ep_gcp_a,
fallback_chain=[ep_gcp_b, ep_local, _GEMINI_ENDPOINT],
routing_reason=f"GCP-A SLOW降級可用→ primary GCP-A at {now_ts}",
routing_reason=f"primary({lbl_p}) SLOW降級可用at {now_ts}",
health_gcp_a=health_gcp_a,
health_gcp_b=health_gcp_b,
health_local=health_local,
@@ -369,7 +381,10 @@ class OllamaFailoverManager:
return OllamaRoutingResult(
primary=ep_gcp_b,
fallback_chain=[ep_local, _GEMINI_ENDPOINT],
routing_reason=f"GCP-A {health_gcp_a.status.value} + GCP-B SLOW降級可用→ 切 GCP-B at {now_ts}",
routing_reason=(
f"primary({lbl_p}) {health_gcp_a.status.value}"
f" + secondary({lbl_s}) SLOW降級可用at {now_ts}"
),
health_gcp_a=health_gcp_a,
health_gcp_b=health_gcp_b,
health_local=health_local,
@@ -379,8 +394,9 @@ class OllamaFailoverManager:
primary=ep_local,
fallback_chain=[_GEMINI_ENDPOINT],
routing_reason=(
f"GCP-A {health_gcp_a.status.value} + GCP-B {health_gcp_b.status.value}"
f" + Local SLOW降級可用→ 切 Local(111) at {now_ts}"
f"primary({lbl_p}) {health_gcp_a.status.value}"
f" + secondary({lbl_s}) {health_gcp_b.status.value}"
f" + tertiary({lbl_t}) SLOW降級可用at {now_ts}"
),
health_gcp_a=health_gcp_a,
health_gcp_b=health_gcp_b,
@@ -392,9 +408,9 @@ class OllamaFailoverManager:
primary=_GEMINI_ENDPOINT,
fallback_chain=[_NEMOTRON_ENDPOINT, _CLAUDE_ENDPOINT],
routing_reason=(
f"所有 Ollama 不健康(GCP-A {health_gcp_a.status.value}"
f"GCP-B {health_gcp_b.status.value}"
f"Local {health_local.status.value})→ 切 Gemini at {now_ts}"
f"所有 Ollama 不健康(primary({lbl_p}) {health_gcp_a.status.value}"
f"secondary({lbl_s}) {health_gcp_b.status.value}"
f"tertiary({lbl_t}) {health_local.status.value})→ 切 Gemini at {now_ts}"
),
health_gcp_a=health_gcp_a,
health_gcp_b=health_gcp_b,
@@ -606,14 +622,14 @@ class OllamaFailoverManager:
fallback_chain_str = "".join(
p.provider_name for p in result.fallback_chain
)
# 計算故障主機描述(哪層 Ollama 不健康)
# 計算故障主機描述(哪層 Ollama 不健康,用實際 URL 不用硬編碼標籤
_failed = []
if result.health_gcp_a.status != HealthStatus.HEALTHY:
_failed.append(f"GCP-A {self._settings.OLLAMA_URL}")
_failed.append(self._settings.OLLAMA_URL)
if result.health_gcp_b and result.health_gcp_b.status != HealthStatus.HEALTHY:
_failed.append(f"GCP-B {self._settings.OLLAMA_SECONDARY_URL}")
_failed.append(self._settings.OLLAMA_SECONDARY_URL or "secondary")
if result.health_local and result.health_local.status != HealthStatus.HEALTHY:
_failed.append(f"Local {self._settings.OLLAMA_FALLBACK_URL}")
_failed.append(self._settings.OLLAMA_FALLBACK_URL or "tertiary")
failed_host = " + ".join(_failed) if _failed else "Ollama"
alerter = get_failover_alerter()
await alerter.alert_failover({

View File

@@ -18,9 +18,13 @@ data:
# 2026-04-16 ogt + Claude Sonnet 4.6: 改指向 111GPU 機RTX
# 188 = CPU-only Ollama推理極慢>60s111 有 GPUavg 10s
# 2026-05-03 ogt: ADR-110 Ollama GCP 三層容災GCP-A → GCP-B → Local HDD
OLLAMA_URL: "http://34.143.170.20:11434"
OLLAMA_SECONDARY_URL: "http://34.21.145.224:11434"
OLLAMA_FALLBACK_URL: "http://192.168.0.111:11434"
# 2026-05-04 ogt: ADR-110 修正 — K8s pods → GCP-A/B:11434 = connection refused外網路由不通
# K8s 可達111內網不可達GCP-A/B外網 port 11434 被擋)
# 修法111 升為 primaryGCP-A/B 保留為 secondary/tertiary待 nginx proxy 架設後恢復可用
# 長期目標:在 110 架設 nginx proxy 轉發 GCP-A/BConfigMap 改指向 110:11435 / 110:11436
OLLAMA_URL: "http://192.168.0.111:11434"
OLLAMA_SECONDARY_URL: "http://34.143.170.20:11434"
OLLAMA_FALLBACK_URL: "http://34.21.145.224:11434"
OPENCLAW_URL: "http://192.168.0.188:8088"
KALI_SCANNER_URL: "http://192.168.0.112:8080"
SIGNOZ_URL: "http://192.168.0.188:3301"