Your Name
|
4111ea4f9f
|
fix(ai): remove 188 ollama provider
Code Review / ai-code-review (push) Successful in 12s
CD Pipeline / tests (push) Successful in 1m13s
CD Pipeline / build-and-deploy (push) Successful in 3m36s
CD Pipeline / post-deploy-checks (push) Successful in 1m20s
|
2026-05-06 14:34:48 +08:00 |
|
Your Name
|
b1ef05fa8c
|
feat(ollama): ADR-110 GCP 三層容災架構(GCP-A → GCP-B → Local → Gemini)
Code Review / ai-code-review (push) Successful in 50s
CD Pipeline / tests (push) Failing after 1m14s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
## 變更摘要
- Primary: http://34.143.170.20:11434 (GCP-A SSD, 9x 載速 + 2x 推理)
- Secondary: http://34.21.145.224:11434 (GCP-B SSD)
- Fallback: http://192.168.0.111:11434 (M1 Pro Local HDD,最後防線)
- 廢止 ADR-105「111 唯一鐵律」,新建 ADR-110
## 核心改動
- config.py: 新增 OLLAMA_SECONDARY_URL;validator 加 GCP IP 白名單(34.143.170.20, 34.21.145.224)
- ollama_failover_manager.py: 三層 Ollama 決策矩陣;並行健康檢查三台;health_111 → health_gcp_a
- ollama_health_monitor.py: host label 萃取改為通用版(支援 GCP 公網 IP)
- failover_alerter.py: 故障/恢復主機動態顯示,不再硬編碼「Ollama 111 (GPU)」
- ollama_auto_recovery.py: notify_recovery 改為 ollama_gcp_a;recovered_host 動態
- k8s/awoooi-prod: configmap + deployment + network-policy 同步更新(egress 加 GCP /32)
- 服務層: 10 個服務檔案硬編碼 192.168.0.111 改為讀 settings.OLLAMA_URL
- 測試: URL 常數更新,新增三層容災場景,GCP IP 白名單驗證測試
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-05-03 22:49:23 +08:00 |
|
Your Name
|
dccdcdbaf5
|
fix(flywheel): unblock action safety and Claude fallback
CD Pipeline / build-and-deploy (push) Successful in 9m45s
|
2026-04-29 21:51:18 +08:00 |
|
Your Name
|
fb0c72db42
|
feat(ai-router): 推翻 A2 鐵律 — DIAGNOSE primary 改 Ollama 本地優先
CD Pipeline / build-and-deploy (push) Failing after 2m26s
統帥鐵律 2026-04-29:「主要優先用 111 主機的 Ollama」
+ feedback_ai_autonomous_direction.md:以本地免費 LLM 為主
+ feedback_ollama_111_only.md:Ollama 唯一主機 = 111
## 推翻 A2 (2026-04-27 INC-20260425) 的事實基礎
**舊事實**:Ollama = CPU-only deepseek-r1:14b @ 238s(不可用)
**新事實**:prod Ollama 111 = M1 Pro Apple Silicon GPU + qwen2.5:7b-instruct
VRAM 8.2GB 全載入,ctx 32k,實測 hi prompt 0.54s
**雲端全死**(2026-04-29 prod log 證據):
- OpenClaw 188:8088 → 500 Internal Server Error
- Gemini → 429 Too Many Requests(配額爆)
- Claude → 404 Not Found(model claude-3-haiku-20240307 過期)
**不推翻 A2 → 100% incident llm_failed → AI 自動修復永遠不啟動**
## 修改範圍(最小、安全、可驗證)
### ai_router.py
- `_diagnose_fallback_chain`: OLLAMA 第一順位(取代「永久排除」舊註解)
順序:[OLLAMA, OPENCLAW_NEMO, GEMINI, CLAUDE]
- `_intent_provider_overrides[DIAGNOSE]`: OPENCLAW_NEMO → OLLAMA
- 不動 _full_fallback_chain(避免影響 RESTART/SCALE/CONFIG/DELETE)
- 不動 _tool_calling_fallback_chain
- 不動 complexity_map(critic M2 留待後續)
### openclaw.py
- 注入 task_type="diagnose" 到 alert_context(critic C2 真根因)
- 修復 ai_providers/ollama.py:77 timeout 對齊問題:
- 有 task_type → OLLAMA_DIAGNOSE_TIMEOUT_SECONDS=200s
- 沒有 → OPENCLAW_TIMEOUT=30s(不夠 qwen2.5:7b 推理)
- prod log 看到 latency_ms=120014 的根因
- 用 dict(alert_context) 複製,不污染原 context
## Regression Test 同步更新(5 個)
A2 鐵律守門 test 全部反映新鐵律:
- test_p0_diagnose_routing.py::test_diagnose_override_is_ollama
(原 test_diagnose_override_is_openclaw_nemo)
- test_ai_router_diagnose_fallback.py::test_diagnose_fallback_chain_ollama_primary
(原 test_diagnose_fallback_chain_no_ollama)
- test_ai_router_diagnose_fallback.py::test_diagnose_route_primary_is_ollama
(原 test_diagnose_route_fallback_chain_excludes_ollama)
- test_ai_router_diagnose_fallback.py::test_diagnose_route_sync_primary_is_ollama
(原 test_diagnose_route_sync_fallback_chain_excludes_ollama)
- test_ai_router_diagnose_fallback.py::test_build_fallback_chain_for_intent_diagnose_with_ollama_primary
(原 test_build_fallback_chain_for_intent_diagnose_no_ollama)
- test_ai_router_failover_integration.py::test_router_uses_failover_for_diagnose_ollama_primary
(原 test_router_does_not_use_failover_for_openclaw_nemo)
每個 test docstring 都記載歷史脈絡 + 推翻原因。
## 驗證
- 1608 unit tests 全綠
- LLM 路徑 16 個 test 全綠(含 6 個 A2 守門 test 更新版)
- complexity_scorer / failover_manager / intent_classifier 不受影響
## 期望 prod 行為(部署後驗證)
incident 進入 → DIAGNOSE intent → primary OLLAMA (qwen2.5:7b on M1 Pro GPU)
失敗才 fallback → OpenClaw 188 → Gemini → Claude
Ollama 用 200s timeout(之前 30s 不夠)
→ AI 自動修復終於可以啟動,不再 100% llm_failed
## 已知債(後續處理)
- models.json:21 ollama.default 仍是 deepseek-r1:14b(critic C1,但 prod 已自動 route 到實載 model)
- complexity 4/5 仍寫死 gemini/claude(critic M2)
- Gemini API key 在 prod log 明文(需輪換 + sanitize)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-29 11:39:36 +08:00 |
|
Your Name
|
277808758d
|
fix(failover): 補 OllamaRoutingResult.health_188 optional 欄位(merge conflict 遺漏)
CD Pipeline / build-and-deploy (push) Has been cancelled
stash pop 時 --theirs 覆蓋掉了 health_188 dataclass 欄位定義,
導致 to_dict() 拋出 AttributeError(health_188 只在方法內引用)。
補上 health_188: HealthReport | None = None,37 failover tests ✅
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-27 20:04:49 +08:00 |
|
Your Name
|
32affaffeb
|
fix(critic-hotfix): 4 修補 critic BLOCKER + HIGH(CD 阻塞 + 飛輪空轉)
CD Pipeline / build-and-deploy (push) Has started running
Critic 全面審查 6 個 commit 後抓出:
CD 阻塞修復:
- test_ai_router_failover_integration.py: 3 個 test 改用 patch.object 直接
mock _select_provider_and_model 強制初始 OLLAMA。原 IntentType.UNKNOWN mock
在 router 內仍被 reclassify 成 DIAGNOSE → openclaw_nemo,failover 不觸發。
→ 5/5 PASSED
BLOCKER B1 — Gitea Telegram 通知永遠發不出去:
- apps/api/src/api/v1/gitea_webhook.py:399
redis = await get_redis() → redis = get_redis()
原 await 會 raise TypeError 被外層 except 吞 → Task C PR merged + workflow_run
failure 通知全部失效(CI 綠燈是假象,test 只驗 HTTP 202 不驗實際送達)
BLOCKER B2 — P1.3+P1.4 學習鏈閉環空轉(兩處同 bug):
- apps/api/src/api/v1/webhooks.py:261
- apps/api/src/services/approval_execution.py:771(pre-existing)
EvidenceSnapshot.get_latest_snapshot(...) 是 module-level async function
不是 classmethod → AttributeError 被 except 吞成 warning
→ 飛輪閉環假性接通實際空跑(feature flag default off 暫時免爆)
HIGH H3 — main.py lifespan 順序競爭:
- apps/api/src/main.py: configure_alerter() 移到 _recovery_svc.start() 之前
原順序:start() 觸發 immediate-check → 可能呼叫 alert_recovery,但 alerter
尚未注入 Redis → dedup fail-open,重複告警風險。
HIGH H1 — Gemini quota dedup 跨日吞告警:
- apps/api/src/services/failover_alerter.py:89
dedup key 加 :{YYYY-MM-DD} 後綴,每日獨立 dedup window
原昨 22:00 觸發,今 21:30 再觸發時 dedup 還沒過期會被吞掉
Tests: 14 passed (failover_alerter + ai_router_failover_integration + lifespan_wiring)
延後 follow-up:
- H2: proactive_inspector memory metric 改名 + baseline 清理
- H4: probe_success NaN fallback
- M1-M4 / S1-S2: 見 critic 報告
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-26 20:39:53 +08:00 |
|
Your Name
|
55c6b4e2d9
|
feat(p1): Ollama 多層容災系統 — P1.1 健康檢測 + P1.2 ai_router 整合 + P1.5 容災告警
ADR-092 P1 飛輪閉環的 Ollama 失敗轉移子系統,全部 Engineer-A2/C/C2 補上。
新服務 (1581 行):
- ollama_health_monitor.py (356):3 層健康檢測(TCP/HTTP/推理)
- ollama_failover_manager.py (571):111→188 自動切換 + Redis 持久化 + recovery callback
- ollama_auto_recovery.py (436):30s 背景監控 + 連續 3 次 HEALTHY → 切回 + clear_cache
- failover_alerter.py (218):P1.5 Telegram 容災告警
服務整合:
- ai_router.py: AIProviderEnum.OLLAMA_188 + 120s budget + failover fallback chain
- main.py lifespan: 啟動時 wire callback + start recovery,關閉時優雅 stop
- config.py: OLLAMA_FALLBACK_URL / OLLAMA_HEALTH_CHECK_MODEL / GEMINI_DAILY_QUOTA(帳單熔斷)
K8s 配置:
- 04-configmap.yaml.patch-188-fallback:注入 OLLAMA_FALLBACK_URL=http://192.168.0.188:11434
測試 (2082 行):
- test_ollama_health_monitor.py (402)
- test_ollama_failover_manager.py (707)
- test_ollama_auto_recovery.py (580)
- test_ai_router_failover_integration.py (257)
- test_lifespan_failover_wiring.py (136)
依賴鏈:service 三件套 + ai_router + main.py 一起 commit,缺一就 ImportError。
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-26 20:18:33 +08:00 |
|