fix(watchdog): ADR-092 B4 — 三層修復消除 META SYSTEM 重複告警 + Ollama 路由強化
All checks were successful
Code Review / ai-code-review (push) Successful in 7m16s

問題根因(debugger 全景徹查):
1. Prod 仍跑舊版代碼(ec013f66 後的修法未部署 → 告警字串仍含舊格式)
2. replicas=2 時 Pod 間 grace period 不共享 → violation_codes 分歧 → 不同 SHA256 → dedup 失效
3. 新 Pod 啟動立即執行 _check_once() → rollout 時多發一波
4. W6 violation_codes 含動態 low_count → count 微變繞過 dedup

修復(A2/A3/W6/C1/C2):
- A2:run_ai_slo_watchdog_loop 加 90s leading sleep,避免 rollout 立即觸發
- A3:_grace_active() 改為 Redis cluster-shared(watchdog:cluster_grace, ex=1800s, nx=True)
     消除 Pod 間 grace period 不一致;Redis 故障時 fallback 為 process-local monotonic
- W6:violation_codes 移除動態 low_count,改為穩定 "W6:trust_drift"
- C1:ollama_auto_recovery.py recovered_host 改動態 label(依 URL port 判斷 GCP-A/B/Local)
- C2:ConfigMap OLLAMA_FALLBACK_URL 改走 110:11437 nginx proxy,三層容災統一架構

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Your Name
2026-05-05 10:31:53 +08:00
parent 3f853accf2
commit aa4ccec429
3 changed files with 57 additions and 14 deletions

View File

@@ -423,12 +423,21 @@ class OllamaAutoRecoveryService:
if alerter is None:
from src.services.failover_alerter import get_failover_alerter
alerter = get_failover_alerter()
# 2026-05-03 ogt: ADR-110 — recovered_host 動態,顯示恢復的實際主機 URL
# 2026-05-05 ogt C1 修復:動態判斷 provider label避免 hardcode "GCP-A"
# 依 OLLAMA_URL port 判斷11435=GCP-A、11436=GCP-B、11437/111=Local
_recovered_url = getattr(self._settings, "OLLAMA_URL", "Ollama Primary")
if "11435" in _recovered_url:
_provider_label = "GCP-A"
elif "11436" in _recovered_url:
_provider_label = "GCP-B"
elif "11437" in _recovered_url or "192.168.0.111" in _recovered_url:
_provider_label = "Local"
else:
_provider_label = "Ollama"
await alerter.alert_recovery({
"from": from_provider,
"to": "ollama",
"recovered_host": f"GCP-A {_recovered_url}",
"recovered_host": f"{_provider_label} {_recovered_url}",
"stable_count": stable_count,
"recovery_time": recovery_time.isoformat(),
})