Files
awoooi/apps
Your Name aa4ccec429
All checks were successful
Code Review / ai-code-review (push) Successful in 7m16s
fix(watchdog): ADR-092 B4 — 三層修復消除 META SYSTEM 重複告警 + Ollama 路由強化
問題根因(debugger 全景徹查):
1. Prod 仍跑舊版代碼(ec013f66 後的修法未部署 → 告警字串仍含舊格式)
2. replicas=2 時 Pod 間 grace period 不共享 → violation_codes 分歧 → 不同 SHA256 → dedup 失效
3. 新 Pod 啟動立即執行 _check_once() → rollout 時多發一波
4. W6 violation_codes 含動態 low_count → count 微變繞過 dedup

修復(A2/A3/W6/C1/C2):
- A2:run_ai_slo_watchdog_loop 加 90s leading sleep,避免 rollout 立即觸發
- A3:_grace_active() 改為 Redis cluster-shared(watchdog:cluster_grace, ex=1800s, nx=True)
     消除 Pod 間 grace period 不一致;Redis 故障時 fallback 為 process-local monotonic
- W6:violation_codes 移除動態 low_count,改為穩定 "W6:trust_drift"
- C1:ollama_auto_recovery.py recovered_host 改動態 label(依 URL port 判斷 GCP-A/B/Local)
- C2:ConfigMap OLLAMA_FALLBACK_URL 改走 110:11437 nginx proxy,三層容災統一架構

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 10:31:53 +08:00
..