diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 39f28913..9617971c 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -1962,6 +1962,118 @@ **目前整體進度**: - Alertmanager 低風險自動修復主線:約 96%。 - 完整 AI 自動化管理產品化:約 86%。 + +--- + +## 2026-05-19|T75/T76 Ollama 全域路由順序校正與 direct caller 收斂 + +**背景**: + +- Telegram 告警看起來像是跑到 111 Ollama 處理,統帥校正所有 Ollama 類路徑必須固定為 `GCP-A → GCP-B → 111 local → Gemini`。 +- Live 檢查發現 production env URL 順序本身正確,但 failover manager 會在 GCP-A healthy 時仍等待 111 health check,造成 90s webhook timeout 風險。 +- 另外部分 direct caller 只呼叫單一 `resolve_ollama_endpoint()`,沒有完整嘗試 GCP-A/GCP-B/111。 + +**完成變更**: + +1. `ollama_failover_manager.select_provider()` 改為 GCP-A healthy 時直接回傳 primary,fallback chain 保留 `GCP-B → 111 → Gemini`,不再等待 111 health check。 +2. `ollama_endpoint_resolver` 新增 `resolve_ollama_order()`,所有 workload(含 `local_required` / `privacy_sensitive` / `dr`)統一回傳 `GCP-A → GCP-B → 111`。 +3. 高風險 direct caller 先接上 ordered fallback: + - `EmbeddingService` + - `KnowledgeExtractorService` + - `KnowledgeRAGService` + - `LocalCodeReviewService` +4. Code review 的 Gemini fallback 維持既有 `LOCAL_CODE_REVIEW_ALLOW_GEMINI_FALLBACK` 控管;未新增無條件 Gemini 直呼叫,避免繞過費用治理。 + +**Commit / Deploy**: + +```text +36aeea80 fix(api): avoid local ollama health blocking gcp route +5fa0e145 chore(cd): deploy 36aeea8 [skip ci] +45cd55b2 fix(api): enforce global ollama endpoint order +1b09a64e chore(cd): deploy 45cd55b [skip ci] +``` + +**本地驗證**: + +```text +python -m py_compile + apps/api/src/services/ollama_endpoint_resolver.py + apps/api/src/services/knowledge_extractor_service.py + apps/api/src/services/knowledge_rag_service.py + apps/api/src/services/local_code_review_service.py + apps/api/src/services/embedding_service.py + -> OK + +ruff check + apps/api/src/services/ollama_endpoint_resolver.py + apps/api/src/services/knowledge_extractor_service.py + apps/api/src/services/knowledge_rag_service.py + apps/api/src/services/local_code_review_service.py + apps/api/src/services/embedding_service.py + apps/api/tests/test_ollama_endpoint_resolver.py + apps/api/tests/test_local_code_review_cloud_fallback.py + -> OK + +DATABASE_URL=postgresql+asyncpg://test:test@localhost/test pytest + apps/api/tests/test_ollama_endpoint_resolver.py + apps/api/tests/test_local_code_review_cloud_fallback.py + -> 6 passed + +DATABASE_URL=postgresql+asyncpg://test:test@localhost/test pytest + apps/api/tests/test_ollama_failover_manager.py + apps/api/tests/test_ai_router_failover_integration.py + -> 43 passed +``` + +**Gitea Actions**: + +```text +2423 Code Review for 36aeea80 -> success +2422 CD for 36aeea80 -> success +2426 Code Review for 45cd55b2 -> success +2425 CD for 45cd55b2 -> success + tests -> success + build-and-deploy -> success + post-deploy-checks -> success +``` + +**Production 驗證**: + +```text +K8s image: +awoooi-web 192.168.0.110:5000/awoooi/web:45cd55b2dad45d7c60a247bfa58db4c412fab752 +awoooi-api 192.168.0.110:5000/awoooi/api:45cd55b2dad45d7c60a247bfa58db4c412fab752 +awoooi-worker 192.168.0.110:5000/awoooi/api:45cd55b2dad45d7c60a247bfa58db4c412fab752 + +GET https://awoooi.wooo.work/api/v1/health + -> healthy, prod, mock_mode=false + +Pod 內 resolver smoke: +interactive / deep_rca / embedding / rag / code_review / local_required / privacy_sensitive / dr + -> ollama_gcp_a:http://34.143.170.20:11434 + -> ollama_gcp_b:http://34.21.145.224:11434 + -> ollama_local:http://192.168.0.111:11434 + +Pod 內 failover manager smoke: +primary=ollama_gcp_a +fallback_chain=ollama_gcp_b -> ollama_local -> gemini +latency_ms=1.5 +health_gcp_b=null / health_local=null(GCP-A healthy 時不阻塞檢查) +``` + +**判讀**: + +- 此次修正確認「所有經 resolver 的 Ollama workload」不會再先走 111。 +- 告警主路由已恢復 `GCP-A → GCP-B → 111 → Gemini`,且 GCP-A healthy 時不再被 111 慢速 health check 拖爆 webhook timeout。 +- 尚未宣稱所有歷史 direct HTTP caller 已 100% 收斂;下一階段要繼續掃 `resolve_ollama_endpoint` / `settings.OLLAMA_URL` / `/api/generate`,把 ChatManager、log summary、intent classifier、RAG debug 等剩餘 caller 逐步改成 ordered fallback 或 AI Router choke point。 + +**目前整體進度**: + +- 本輪 WIP(T73-T76 告警閉環與 Ollama 路由修正):約 99.8%。 +- AwoooP 告警可觀測鏈:約 95%。 +- 低風險自動修復閉環:約 95%。 +- 前端 AI 自動化管理介面同步:約 91%。 +- 完整 AI 自動化管理產品化:約 87%。 - T21 已把 verifier coverage / freshness 從後端真相鏈推到前端;下一段建議 T22 拆解 9 筆 non-success verification 的原因,將 degraded/failed/timeout 分流到工作鏈路與 Ticket / PlayBook / KM 修復項。 ## 2026-05-14 | T20 Governance SLO 前端狀態語意接入,低樣本不再偽裝紅燈