fix(watchdog): ADR-092 B4 — 三層修復消除 META SYSTEM 重複告警 + Ollama 路由強化
All checks were successful
Code Review / ai-code-review (push) Successful in 7m16s
All checks were successful
Code Review / ai-code-review (push) Successful in 7m16s
問題根因(debugger 全景徹查):
1. Prod 仍跑舊版代碼(ec013f66 後的修法未部署 → 告警字串仍含舊格式)
2. replicas=2 時 Pod 間 grace period 不共享 → violation_codes 分歧 → 不同 SHA256 → dedup 失效
3. 新 Pod 啟動立即執行 _check_once() → rollout 時多發一波
4. W6 violation_codes 含動態 low_count → count 微變繞過 dedup
修復(A2/A3/W6/C1/C2):
- A2:run_ai_slo_watchdog_loop 加 90s leading sleep,避免 rollout 立即觸發
- A3:_grace_active() 改為 Redis cluster-shared(watchdog:cluster_grace, ex=1800s, nx=True)
消除 Pod 間 grace period 不一致;Redis 故障時 fallback 為 process-local monotonic
- W6:violation_codes 移除動態 low_count,改為穩定 "W6:trust_drift"
- C1:ollama_auto_recovery.py recovered_host 改動態 label(依 URL port 判斷 GCP-A/B/Local)
- C2:ConfigMap OLLAMA_FALLBACK_URL 改走 110:11437 nginx proxy,三層容災統一架構
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -423,12 +423,21 @@ class OllamaAutoRecoveryService:
|
||||
if alerter is None:
|
||||
from src.services.failover_alerter import get_failover_alerter
|
||||
alerter = get_failover_alerter()
|
||||
# 2026-05-03 ogt: ADR-110 — recovered_host 動態,顯示恢復的實際主機 URL
|
||||
# 2026-05-05 ogt C1 修復:動態判斷 provider label,避免 hardcode "GCP-A"
|
||||
# 依 OLLAMA_URL port 判斷:11435=GCP-A、11436=GCP-B、11437/111=Local
|
||||
_recovered_url = getattr(self._settings, "OLLAMA_URL", "Ollama Primary")
|
||||
if "11435" in _recovered_url:
|
||||
_provider_label = "GCP-A"
|
||||
elif "11436" in _recovered_url:
|
||||
_provider_label = "GCP-B"
|
||||
elif "11437" in _recovered_url or "192.168.0.111" in _recovered_url:
|
||||
_provider_label = "Local"
|
||||
else:
|
||||
_provider_label = "Ollama"
|
||||
await alerter.alert_recovery({
|
||||
"from": from_provider,
|
||||
"to": "ollama",
|
||||
"recovered_host": f"GCP-A {_recovered_url}",
|
||||
"recovered_host": f"{_provider_label} {_recovered_url}",
|
||||
"stable_count": stable_count,
|
||||
"recovery_time": recovery_time.isoformat(),
|
||||
})
|
||||
|
||||
Reference in New Issue
Block a user