Commit Graph

2 Commits

Author SHA1 Message Date
Your Name
4111ea4f9f fix(ai): remove 188 ollama provider
All checks were successful
Code Review / ai-code-review (push) Successful in 12s
CD Pipeline / tests (push) Successful in 1m13s
CD Pipeline / build-and-deploy (push) Successful in 3m36s
CD Pipeline / post-deploy-checks (push) Successful in 1m20s
2026-05-06 14:34:48 +08:00
Your Name
595629c013 fix(inc-20260425): A1 三段 Agent timeout 拆分 + A2 DIAGNOSE 移除 Ollama
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
INC-20260425-8D17BB / 3B6C39 兩則告警 AI 信心降到 20% 根因雙修(統帥批准 A+B):

A1 — 三段 Agent step timeout 拆分(北極星 §1.2 Observable by Default):
- diagnostician_agent.py: PHASE2_STEP_TIMEOUT_SEC=20.0 共用值 → 拆三段
  · AGENT_DIAGNOSTICIAN_TIMEOUT_SEC=30.0(NIM 主吃口,最大 prompt + 多假設)
  · AGENT_SOLVER_TIMEOUT_SEC=20.0(後續 commit 接線)
  · AGENT_CRITIC_TIMEOUT_SEC=15.0(後續 commit 接線)
  · env override 支援,K8s ConfigMap 動態調整不需 rebuild
  · 保留 PHASE2_STEP_TIMEOUT_SEC alias(DEPRECATED,下 sprint 移除)
- observability/agent_step_metrics.py (58 行) — 新模組:
  · aiops_agent_step_duration_seconds Histogram
  · observe_agent_step() helper 統一三 Agent 呼叫點
  · outcome label ∈ {success, timeout, error}
  · agent label ∈ {diagnostician, solver, critic}

A2 — ai_router DIAGNOSE chain 移除 Ollama:
- ai_router.py v4.4 by Claude Sonnet 4.6
  · 新增 _diagnose_fallback_chain: NEMO → GEMINI → CLAUDE
  · Ollama 永久排除於此 chain(CPU-only 實測 238s,二次 timeout 必爆)
  · 新增 aiops_diagnose_fallback_total Prometheus metric
- 根因: NIM timeout 後 fallback 到 Ollama deepseek-r1:14b CPU 238s
  → 二次 timeout → degraded confidence=0.2

Wave8-X2 整合測試補正:
- test_ollama_failover_manager.py: TestSelectProvider 補 mock _check_gemini_quota
  原 test 期望 OFFLINE→Gemini,但 quota fail-closed 後沒 mock 會被切到 188
  繞過 quota check 後驗純路由邏輯 → 37/37 PASS

Tests: 37 passed (test_ollama_failover_manager 全部)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (Wave 8 INC-20260425) <noreply@anthropic.com>
2026-04-27 08:15:10 +08:00