Your Name
|
36aeea80a3
|
fix(api): avoid local ollama health blocking gcp route
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / tests (push) Successful in 1m27s
CD Pipeline / build-and-deploy (push) Successful in 4m22s
CD Pipeline / post-deploy-checks (push) Successful in 2m0s
|
2026-05-19 12:22:46 +08:00 |
|
Your Name
|
4111ea4f9f
|
fix(ai): remove 188 ollama provider
Code Review / ai-code-review (push) Successful in 12s
CD Pipeline / tests (push) Successful in 1m13s
CD Pipeline / build-and-deploy (push) Successful in 3m36s
CD Pipeline / post-deploy-checks (push) Successful in 1m20s
|
2026-05-06 14:34:48 +08:00 |
|
Your Name
|
506744ba3a
|
test(ollama): keep slow gcp primary on ollama
CD Pipeline / tests (push) Failing after 2m21s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 26s
|
2026-05-05 13:29:27 +08:00 |
|
Your Name
|
b1ef05fa8c
|
feat(ollama): ADR-110 GCP 三層容災架構(GCP-A → GCP-B → Local → Gemini)
Code Review / ai-code-review (push) Successful in 50s
CD Pipeline / tests (push) Failing after 1m14s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
## 變更摘要
- Primary: http://34.143.170.20:11434 (GCP-A SSD, 9x 載速 + 2x 推理)
- Secondary: http://34.21.145.224:11434 (GCP-B SSD)
- Fallback: http://192.168.0.111:11434 (M1 Pro Local HDD,最後防線)
- 廢止 ADR-105「111 唯一鐵律」,新建 ADR-110
## 核心改動
- config.py: 新增 OLLAMA_SECONDARY_URL;validator 加 GCP IP 白名單(34.143.170.20, 34.21.145.224)
- ollama_failover_manager.py: 三層 Ollama 決策矩陣;並行健康檢查三台;health_111 → health_gcp_a
- ollama_health_monitor.py: host label 萃取改為通用版(支援 GCP 公網 IP)
- failover_alerter.py: 故障/恢復主機動態顯示,不再硬編碼「Ollama 111 (GPU)」
- ollama_auto_recovery.py: notify_recovery 改為 ollama_gcp_a;recovered_host 動態
- k8s/awoooi-prod: configmap + deployment + network-policy 同步更新(egress 加 GCP /32)
- 服務層: 10 個服務檔案硬編碼 192.168.0.111 改為讀 settings.OLLAMA_URL
- 測試: URL 常數更新,新增三層容災場景,GCP IP 白名單驗證測試
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-05-03 22:49:23 +08:00 |
|
Your Name
|
b432becd4e
|
fix(failover): 188 完全移出 routing chain,備援只用 Gemini
CD Pipeline / build-and-deploy (push) Has been cancelled
統帥鐵律 2026-04-26:
- 唯一 Ollama = 111(M1 Pro Metal 加速)
- 188 CPU-only (0.45 tok/s) 禁止即時回應,移出所有 fallback chain
- 111 HEALTHY → fallback=[Gemini]
- 111 非HEALTHY → primary=Gemini, fallback=[Nemotron, Claude]
- Gemini quota exceeded → Nemotron → Claude(不落 188)
- OllamaRoutingResult 移除 health_188 欄位
- select_provider 只 check 111(不再 asyncio.gather 兩節點)
- 測試全部對齊新規則(1451 passed)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-27 15:47:41 +08:00 |
|
Your Name
|
595629c013
|
fix(inc-20260425): A1 三段 Agent timeout 拆分 + A2 DIAGNOSE 移除 Ollama
CD Pipeline / build-and-deploy (push) Has been cancelled
INC-20260425-8D17BB / 3B6C39 兩則告警 AI 信心降到 20% 根因雙修(統帥批准 A+B):
A1 — 三段 Agent step timeout 拆分(北極星 §1.2 Observable by Default):
- diagnostician_agent.py: PHASE2_STEP_TIMEOUT_SEC=20.0 共用值 → 拆三段
· AGENT_DIAGNOSTICIAN_TIMEOUT_SEC=30.0(NIM 主吃口,最大 prompt + 多假設)
· AGENT_SOLVER_TIMEOUT_SEC=20.0(後續 commit 接線)
· AGENT_CRITIC_TIMEOUT_SEC=15.0(後續 commit 接線)
· env override 支援,K8s ConfigMap 動態調整不需 rebuild
· 保留 PHASE2_STEP_TIMEOUT_SEC alias(DEPRECATED,下 sprint 移除)
- observability/agent_step_metrics.py (58 行) — 新模組:
· aiops_agent_step_duration_seconds Histogram
· observe_agent_step() helper 統一三 Agent 呼叫點
· outcome label ∈ {success, timeout, error}
· agent label ∈ {diagnostician, solver, critic}
A2 — ai_router DIAGNOSE chain 移除 Ollama:
- ai_router.py v4.4 by Claude Sonnet 4.6
· 新增 _diagnose_fallback_chain: NEMO → GEMINI → CLAUDE
· Ollama 永久排除於此 chain(CPU-only 實測 238s,二次 timeout 必爆)
· 新增 aiops_diagnose_fallback_total Prometheus metric
- 根因: NIM timeout 後 fallback 到 Ollama deepseek-r1:14b CPU 238s
→ 二次 timeout → degraded confidence=0.2
Wave8-X2 整合測試補正:
- test_ollama_failover_manager.py: TestSelectProvider 補 mock _check_gemini_quota
原 test 期望 OFFLINE→Gemini,但 quota fail-closed 後沒 mock 會被切到 188
繞過 quota check 後驗純路由邏輯 → 37/37 PASS
Tests: 37 passed (test_ollama_failover_manager 全部)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (Wave 8 INC-20260425) <noreply@anthropic.com>
|
2026-04-27 08:15:10 +08:00 |
|
Your Name
|
bddf99a002
|
fix(test): test_ollama_failover_manager pipeline mock 對齊 atomic 修復
CD Pipeline / build-and-deploy (push) Has been cancelled
Wave5 B3-fix(commit 02362edd)改 _check_gemini_quota 用 redis.pipeline()
原測試 mock redis.incr.assert_awaited_once 失敗,因 incr 改在 pipeline 內。
修法(Engineer-A4 已同步寫好):
- mock_pipe.set / incr 返回 mock_pipe(chain)
- mock_pipe.execute 返回 [True, count] list
- assertion 改 mock_pipe.execute.assert_awaited_once
Tests: 37/37 PASSED
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Engineer-A4 <noreply@anthropic.com>
|
2026-04-26 20:52:11 +08:00 |
|
Your Name
|
55c6b4e2d9
|
feat(p1): Ollama 多層容災系統 — P1.1 健康檢測 + P1.2 ai_router 整合 + P1.5 容災告警
ADR-092 P1 飛輪閉環的 Ollama 失敗轉移子系統,全部 Engineer-A2/C/C2 補上。
新服務 (1581 行):
- ollama_health_monitor.py (356):3 層健康檢測(TCP/HTTP/推理)
- ollama_failover_manager.py (571):111→188 自動切換 + Redis 持久化 + recovery callback
- ollama_auto_recovery.py (436):30s 背景監控 + 連續 3 次 HEALTHY → 切回 + clear_cache
- failover_alerter.py (218):P1.5 Telegram 容災告警
服務整合:
- ai_router.py: AIProviderEnum.OLLAMA_188 + 120s budget + failover fallback chain
- main.py lifespan: 啟動時 wire callback + start recovery,關閉時優雅 stop
- config.py: OLLAMA_FALLBACK_URL / OLLAMA_HEALTH_CHECK_MODEL / GEMINI_DAILY_QUOTA(帳單熔斷)
K8s 配置:
- 04-configmap.yaml.patch-188-fallback:注入 OLLAMA_FALLBACK_URL=http://192.168.0.188:11434
測試 (2082 行):
- test_ollama_health_monitor.py (402)
- test_ollama_failover_manager.py (707)
- test_ollama_auto_recovery.py (580)
- test_ai_router_failover_integration.py (257)
- test_lifespan_failover_wiring.py (136)
依賴鏈:service 三件套 + ai_router + main.py 一起 commit,缺一就 ImportError。
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-26 20:18:33 +08:00 |
|