awoooi

wooo/awoooi

Fork 0

Commit Graph

Author SHA1 Message Date

Author	SHA1	Message	Date
Your Name	4111ea4f9f	fix(ai): remove 188 ollama provider All checks were successful Code Review / ai-code-review (push) Successful in 12s Details CD Pipeline / tests (push) Successful in 1m13s Details CD Pipeline / build-and-deploy (push) Successful in 3m36s Details CD Pipeline / post-deploy-checks (push) Successful in 1m20s Details	2026-05-06 14:34:48 +08:00
Your Name	fb0c72db42	feat(ai-router): 推翻 A2 鐵律 — DIAGNOSE primary 改 Ollama 本地優先 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 2m26s Details 統帥鐵律 2026-04-29：「主要優先用 111 主機的 Ollama」 + feedback_ai_autonomous_direction.md：以本地免費 LLM 為主 + feedback_ollama_111_only.md：Ollama 唯一主機 = 111 ## 推翻 A2 (2026-04-27 INC-20260425) 的事實基礎舊事實：Ollama = CPU-only deepseek-r1:14b @ 238s（不可用）新事實：prod Ollama 111 = M1 Pro Apple Silicon GPU + qwen2.5:7b-instruct VRAM 8.2GB 全載入，ctx 32k，實測 hi prompt 0.54s 雲端全死（2026-04-29 prod log 證據）： - OpenClaw 188:8088 → 500 Internal Server Error - Gemini → 429 Too Many Requests（配額爆） - Claude → 404 Not Found（model claude-3-haiku-20240307 過期）不推翻 A2 → 100% incident llm_failed → AI 自動修復永遠不啟動 ## 修改範圍（最小、安全、可驗證） ### ai_router.py - `_diagnose_fallback_chain`: OLLAMA 第一順位（取代「永久排除」舊註解）順序：[OLLAMA, OPENCLAW_NEMO, GEMINI, CLAUDE] - `_intent_provider_overrides[DIAGNOSE]`: OPENCLAW_NEMO → OLLAMA - 不動 _full_fallback_chain（避免影響 RESTART/SCALE/CONFIG/DELETE） - 不動 _tool_calling_fallback_chain - 不動 complexity_map（critic M2 留待後續） ### openclaw.py - 注入 task_type="diagnose" 到 alert_context（critic C2 真根因） - 修復 ai_providers/ollama.py:77 timeout 對齊問題： - 有 task_type → OLLAMA_DIAGNOSE_TIMEOUT_SECONDS=200s - 沒有 → OPENCLAW_TIMEOUT=30s（不夠 qwen2.5:7b 推理） - prod log 看到 latency_ms=120014 的根因 - 用 dict(alert_context) 複製，不污染原 context ## Regression Test 同步更新（5 個） A2 鐵律守門 test 全部反映新鐵律： - test_p0_diagnose_routing.py::test_diagnose_override_is_ollama （原 test_diagnose_override_is_openclaw_nemo） - test_ai_router_diagnose_fallback.py::test_diagnose_fallback_chain_ollama_primary （原 test_diagnose_fallback_chain_no_ollama） - test_ai_router_diagnose_fallback.py::test_diagnose_route_primary_is_ollama （原 test_diagnose_route_fallback_chain_excludes_ollama） - test_ai_router_diagnose_fallback.py::test_diagnose_route_sync_primary_is_ollama （原 test_diagnose_route_sync_fallback_chain_excludes_ollama） - test_ai_router_diagnose_fallback.py::test_build_fallback_chain_for_intent_diagnose_with_ollama_primary （原 test_build_fallback_chain_for_intent_diagnose_no_ollama） - test_ai_router_failover_integration.py::test_router_uses_failover_for_diagnose_ollama_primary （原 test_router_does_not_use_failover_for_openclaw_nemo）每個 test docstring 都記載歷史脈絡 + 推翻原因。 ## 驗證 - 1608 unit tests 全綠 - LLM 路徑 16 個 test 全綠（含 6 個 A2 守門 test 更新版） - complexity_scorer / failover_manager / intent_classifier 不受影響 ## 期望 prod 行為（部署後驗證） incident 進入 → DIAGNOSE intent → primary OLLAMA (qwen2.5:7b on M1 Pro GPU) 失敗才 fallback → OpenClaw 188 → Gemini → Claude Ollama 用 200s timeout（之前 30s 不夠） → AI 自動修復終於可以啟動，不再 100% llm_failed ## 已知債（後續處理） - models.json:21 ollama.default 仍是 deepseek-r1:14b（critic C1，但 prod 已自動 route 到實載 model） - complexity 4/5 仍寫死 gemini/claude（critic M2） - Gemini API key 在 prod log 明文（需輪換 + sanitize） Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 11:39:36 +08:00
Your Name	fefe4c21cd	fix(inc-20260425): A1+A2 後續 — Solver/Critic timeout + auto_repair 接線 + Runbook + Grafana Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 延續 `595629c0` INC-20260425 修復，補三段 Agent + 全鏈路觀測： A1 後續 — Solver/Critic 三段 timeout 接線: - solver_agent.py: AGENT_SOLVER_TIMEOUT_SEC=20.0（env override） - critic_agent.py: AGENT_CRITIC_TIMEOUT_SEC=15.0（env override） - protocol.py: 三 Agent 共用 observe_agent_step() 包裹呼叫 · success/timeout/error outcome label · histogram 寫入 aiops_agent_step_duration_seconds A2 後續 — auto_repair_service 改用 _diagnose_fallback_chain: - auto_repair_service.py +46 行 — 切換 DIAGNOSE 路由到新 chain（NEMO→GEMINI→CLAUDE） - 完全避開 Ollama CPU 238s 二次 timeout 新增 metrics: - core/metrics.py +59 行 — 配合 observe_agent_step 的 histogram bucket + label cardinality 新增測試 (862 行): - test_agent_step_timeouts.py (475) — 三 Agent 各 timeout 邊界 + outcome label - test_ai_router_diagnose_fallback.py (387) — _diagnose_fallback_chain 正確序新增配套: - docs/runbooks/RUNBOOK-AGENT-STEP-LATENCY.md (350) — INC 故障排查 + 觀測指引 - ops/monitoring/grafana/agent_step_latency_rules.yaml (160) · 三 Agent histogram alert rules（p99 > timeout 80% → warning）驗收: 33 tests pass (test_agent_step_timeouts 22 + test_ai_router_diagnose_fallback 11) INC-20260425 雙修總工作量（595629c0 + 此 commit）: · 5 個 service/agent 檔修改 · 1 個新 observability 模組 · 4 個新測試/配套檔 · 1372+187 = 1559 行新增 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Claude Sonnet 4.6 (INC-20260425 後續) <noreply@anthropic.com>	2026-04-27 08:15:53 +08:00

Your Name

4111ea4f9f

fix(ai): remove 188 ollama provider

Code Review / ai-code-review (push) Successful in 12s

Details

CD Pipeline / tests (push) Successful in 1m13s

Details

CD Pipeline / build-and-deploy (push) Successful in 3m36s

Details

CD Pipeline / post-deploy-checks (push) Successful in 1m20s

Details

2026-05-06 14:34:48 +08:00

Your Name

fb0c72db42

feat(ai-router): 推翻 A2 鐵律 — DIAGNOSE primary 改 Ollama 本地優先

CD Pipeline / build-and-deploy (push) Failing after 2m26s

Details

統帥鐵律 2026-04-29：「主要優先用 111 主機的 Ollama」
+ feedback_ai_autonomous_direction.md：以本地免費 LLM 為主
+ feedback_ollama_111_only.md：Ollama 唯一主機 = 111

## 推翻 A2 (2026-04-27 INC-20260425) 的事實基礎

**舊事實**：Ollama = CPU-only deepseek-r1:14b @ 238s（不可用）
**新事實**：prod Ollama 111 = M1 Pro Apple Silicon GPU + qwen2.5:7b-instruct
           VRAM 8.2GB 全載入，ctx 32k，實測 hi prompt 0.54s

**雲端全死**（2026-04-29 prod log 證據）：
- OpenClaw 188:8088 → 500 Internal Server Error
- Gemini → 429 Too Many Requests（配額爆）
- Claude → 404 Not Found（model claude-3-haiku-20240307 過期）

**不推翻 A2 → 100% incident llm_failed → AI 自動修復永遠不啟動**

## 修改範圍（最小、安全、可驗證）

### ai_router.py
- `_diagnose_fallback_chain`: OLLAMA 第一順位（取代「永久排除」舊註解）
  順序：[OLLAMA, OPENCLAW_NEMO, GEMINI, CLAUDE]
- `_intent_provider_overrides[DIAGNOSE]`: OPENCLAW_NEMO → OLLAMA
- 不動 _full_fallback_chain（避免影響 RESTART/SCALE/CONFIG/DELETE）
- 不動 _tool_calling_fallback_chain
- 不動 complexity_map（critic M2 留待後續）

### openclaw.py
- 注入 task_type="diagnose" 到 alert_context（critic C2 真根因）
- 修復 ai_providers/ollama.py:77 timeout 對齊問題：
  - 有 task_type → OLLAMA_DIAGNOSE_TIMEOUT_SECONDS=200s
  - 沒有 → OPENCLAW_TIMEOUT=30s（不夠 qwen2.5:7b 推理）
- prod log 看到 latency_ms=120014 的根因
- 用 dict(alert_context) 複製，不污染原 context

## Regression Test 同步更新（5 個）

A2 鐵律守門 test 全部反映新鐵律：

- test_p0_diagnose_routing.py::test_diagnose_override_is_ollama
  （原 test_diagnose_override_is_openclaw_nemo）
- test_ai_router_diagnose_fallback.py::test_diagnose_fallback_chain_ollama_primary
  （原 test_diagnose_fallback_chain_no_ollama）
- test_ai_router_diagnose_fallback.py::test_diagnose_route_primary_is_ollama
  （原 test_diagnose_route_fallback_chain_excludes_ollama）
- test_ai_router_diagnose_fallback.py::test_diagnose_route_sync_primary_is_ollama
  （原 test_diagnose_route_sync_fallback_chain_excludes_ollama）
- test_ai_router_diagnose_fallback.py::test_build_fallback_chain_for_intent_diagnose_with_ollama_primary
  （原 test_build_fallback_chain_for_intent_diagnose_no_ollama）
- test_ai_router_failover_integration.py::test_router_uses_failover_for_diagnose_ollama_primary
  （原 test_router_does_not_use_failover_for_openclaw_nemo）

每個 test docstring 都記載歷史脈絡 + 推翻原因。

## 驗證

- 1608 unit tests 全綠
- LLM 路徑 16 個 test 全綠（含 6 個 A2 守門 test 更新版）
- complexity_scorer / failover_manager / intent_classifier 不受影響

## 期望 prod 行為（部署後驗證）

incident 進入 → DIAGNOSE intent → primary OLLAMA (qwen2.5:7b on M1 Pro GPU)
  失敗才 fallback → OpenClaw 188 → Gemini → Claude
  Ollama 用 200s timeout（之前 30s 不夠）
  → AI 自動修復終於可以啟動，不再 100% llm_failed

## 已知債（後續處理）

- models.json:21 ollama.default 仍是 deepseek-r1:14b（critic C1，但 prod 已自動 route 到實載 model）
- complexity 4/5 仍寫死 gemini/claude（critic M2）
- Gemini API key 在 prod log 明文（需輪換 + sanitize）

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 11:39:36 +08:00

Your Name

fefe4c21cd

fix(inc-20260425): A1+A2 後續 — Solver/Critic timeout + auto_repair 接線 + Runbook + Grafana

CD Pipeline / build-and-deploy (push) Has been cancelled

Details

延續 595629c0 INC-20260425 修復，補三段 Agent + 全鏈路觀測：

A1 後續 — Solver/Critic 三段 timeout 接線:
- solver_agent.py: AGENT_SOLVER_TIMEOUT_SEC=20.0（env override）
- critic_agent.py: AGENT_CRITIC_TIMEOUT_SEC=15.0（env override）
- protocol.py: 三 Agent 共用 observe_agent_step() 包裹呼叫
  · success/timeout/error outcome label
  · histogram 寫入 aiops_agent_step_duration_seconds

A2 後續 — auto_repair_service 改用 _diagnose_fallback_chain:
- auto_repair_service.py +46 行 — 切換 DIAGNOSE 路由到新 chain（NEMO→GEMINI→CLAUDE）
- 完全避開 Ollama CPU 238s 二次 timeout

新增 metrics:
- core/metrics.py +59 行 — 配合 observe_agent_step 的 histogram bucket + label cardinality

新增測試 (862 行):
- test_agent_step_timeouts.py (475) — 三 Agent 各 timeout 邊界 + outcome label
- test_ai_router_diagnose_fallback.py (387) — _diagnose_fallback_chain 正確序

新增配套:
- docs/runbooks/RUNBOOK-AGENT-STEP-LATENCY.md (350) — INC 故障排查 + 觀測指引
- ops/monitoring/grafana/agent_step_latency_rules.yaml (160)
  · 三 Agent histogram alert rules（p99 > timeout 80% → warning）

驗收: 33 tests pass (test_agent_step_timeouts 22 + test_ai_router_diagnose_fallback 11)

INC-20260425 雙修總工作量（595629c0 + 此 commit）:
  · 5 個 service/agent 檔修改
  · 1 個新 observability 模組
  · 4 個新測試/配套檔
  · 1372+187 = 1559 行新增

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (INC-20260425 後續) <noreply@anthropic.com>

2026-04-27 08:15:53 +08:00

3 Commits