docs(api): record ollama route order rollout [skip ci]

This commit is contained in:
Your Name
2026-05-19 12:44:02 +08:00
parent 1b09a64e01
commit 8a0a3f89aa

View File

@@ -1962,6 +1962,118 @@
**目前整體進度**
- Alertmanager 低風險自動修復主線:約 96%。
- 完整 AI 自動化管理產品化:約 86%。
---
## 2026-05-19T75/T76 Ollama 全域路由順序校正與 direct caller 收斂
**背景**
- Telegram 告警看起來像是跑到 111 Ollama 處理,統帥校正所有 Ollama 類路徑必須固定為 `GCP-A → GCP-B → 111 local → Gemini`
- Live 檢查發現 production env URL 順序本身正確,但 failover manager 會在 GCP-A healthy 時仍等待 111 health check造成 90s webhook timeout 風險。
- 另外部分 direct caller 只呼叫單一 `resolve_ollama_endpoint()`,沒有完整嘗試 GCP-A/GCP-B/111。
**完成變更**
1. `ollama_failover_manager.select_provider()` 改為 GCP-A healthy 時直接回傳 primaryfallback chain 保留 `GCP-B → 111 → Gemini`,不再等待 111 health check。
2. `ollama_endpoint_resolver` 新增 `resolve_ollama_order()`,所有 workload`local_required` / `privacy_sensitive` / `dr`)統一回傳 `GCP-A → GCP-B → 111`
3. 高風險 direct caller 先接上 ordered fallback
- `EmbeddingService`
- `KnowledgeExtractorService`
- `KnowledgeRAGService`
- `LocalCodeReviewService`
4. Code review 的 Gemini fallback 維持既有 `LOCAL_CODE_REVIEW_ALLOW_GEMINI_FALLBACK` 控管;未新增無條件 Gemini 直呼叫,避免繞過費用治理。
**Commit / Deploy**
```text
36aeea80 fix(api): avoid local ollama health blocking gcp route
5fa0e145 chore(cd): deploy 36aeea8 [skip ci]
45cd55b2 fix(api): enforce global ollama endpoint order
1b09a64e chore(cd): deploy 45cd55b [skip ci]
```
**本地驗證**
```text
python -m py_compile
apps/api/src/services/ollama_endpoint_resolver.py
apps/api/src/services/knowledge_extractor_service.py
apps/api/src/services/knowledge_rag_service.py
apps/api/src/services/local_code_review_service.py
apps/api/src/services/embedding_service.py
-> OK
ruff check
apps/api/src/services/ollama_endpoint_resolver.py
apps/api/src/services/knowledge_extractor_service.py
apps/api/src/services/knowledge_rag_service.py
apps/api/src/services/local_code_review_service.py
apps/api/src/services/embedding_service.py
apps/api/tests/test_ollama_endpoint_resolver.py
apps/api/tests/test_local_code_review_cloud_fallback.py
-> OK
DATABASE_URL=postgresql+asyncpg://test:test@localhost/test pytest
apps/api/tests/test_ollama_endpoint_resolver.py
apps/api/tests/test_local_code_review_cloud_fallback.py
-> 6 passed
DATABASE_URL=postgresql+asyncpg://test:test@localhost/test pytest
apps/api/tests/test_ollama_failover_manager.py
apps/api/tests/test_ai_router_failover_integration.py
-> 43 passed
```
**Gitea Actions**
```text
2423 Code Review for 36aeea80 -> success
2422 CD for 36aeea80 -> success
2426 Code Review for 45cd55b2 -> success
2425 CD for 45cd55b2 -> success
tests -> success
build-and-deploy -> success
post-deploy-checks -> success
```
**Production 驗證**
```text
K8s image:
awoooi-web 192.168.0.110:5000/awoooi/web:45cd55b2dad45d7c60a247bfa58db4c412fab752
awoooi-api 192.168.0.110:5000/awoooi/api:45cd55b2dad45d7c60a247bfa58db4c412fab752
awoooi-worker 192.168.0.110:5000/awoooi/api:45cd55b2dad45d7c60a247bfa58db4c412fab752
GET https://awoooi.wooo.work/api/v1/health
-> healthy, prod, mock_mode=false
Pod 內 resolver smoke:
interactive / deep_rca / embedding / rag / code_review / local_required / privacy_sensitive / dr
-> ollama_gcp_a:http://34.143.170.20:11434
-> ollama_gcp_b:http://34.21.145.224:11434
-> ollama_local:http://192.168.0.111:11434
Pod 內 failover manager smoke:
primary=ollama_gcp_a
fallback_chain=ollama_gcp_b -> ollama_local -> gemini
latency_ms=1.5
health_gcp_b=null / health_local=nullGCP-A healthy 時不阻塞檢查)
```
**判讀**
- 此次修正確認「所有經 resolver 的 Ollama workload」不會再先走 111。
- 告警主路由已恢復 `GCP-A → GCP-B → 111 → Gemini`,且 GCP-A healthy 時不再被 111 慢速 health check 拖爆 webhook timeout。
- 尚未宣稱所有歷史 direct HTTP caller 已 100% 收斂;下一階段要繼續掃 `resolve_ollama_endpoint` / `settings.OLLAMA_URL` / `/api/generate`,把 ChatManager、log summary、intent classifier、RAG debug 等剩餘 caller 逐步改成 ordered fallback 或 AI Router choke point。
**目前整體進度**
- 本輪 WIPT73-T76 告警閉環與 Ollama 路由修正):約 99.8%。
- AwoooP 告警可觀測鏈:約 95%。
- 低風險自動修復閉環:約 95%。
- 前端 AI 自動化管理介面同步:約 91%。
- 完整 AI 自動化管理產品化:約 87%。
- T21 已把 verifier coverage / freshness 從後端真相鏈推到前端;下一段建議 T22 拆解 9 筆 non-success verification 的原因,將 degraded/failed/timeout 分流到工作鏈路與 Ticket / PlayBook / KM 修復項。
## 2026-05-14 | T20 Governance SLO 前端狀態語意接入,低樣本不再偽裝紅燈