fix(BLOCKER): LLM 連續失敗 — 4 個違反設計處全部修復
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 14m21s
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 14m21s
統帥盤點發現飛輪沉默真因:4 個違反既定架構設計的 bug 同時撞車。 P0a — Ollama timeout 違反 GAP-B4 設計 config.py:OPENCLAW_TIMEOUT 從 120s 改 30s 原 120s 違反 ADR-052 GAP-B4 (LLM 25s hard timeout) 設計 致 Ollama 過載時 thread 飢餓 120s 才降級 P0b — AI Router silent skip 觀測性修復 ai_router.py: not_registered/circuit_open/rate_limit/privacy_skip 全部累積到 errors 陣列,log all_providers_failed 時可知為何 skip 原本 errors=["ollama: Timeout"] 但 tried=4 個,無法診斷 P1a — send_text 方法不存在 bug ai_router.py:1005 tg.send_text() → tg.send_notification(parse_mode=HTML) TelegramGateway 只有 send_notification 沒 send_text 致 fallback 失敗通知本身失敗(雙重靜默) P1b — resend_stale_ready_tokens 並發爆炸 decision_manager.py: 加 asyncio.Semaphore(5) + 200ms throttle 原本 fire_and_forget N 個 task 同時跑,N=108 時 Ollama embedding 全部 timeout,包括我打的 live-fire 也被擠爆 改:max 5 並發 + 每完成喘 200ms CD 流程審查 (Blocker 1): 完全符合 ADR-039 設計,10-15 min 是預期 不需修,是設計就需要這時間。 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -358,8 +358,10 @@ class Settings(BaseSettings):
|
||||
description="Default Ollama model for RCA analysis",
|
||||
)
|
||||
OPENCLAW_TIMEOUT: int = Field(
|
||||
default=120, # 2026-04-08 ogt: deepseek-r1:14b 實測最慢 54s,120s 含 buffer
|
||||
description="Timeout for OpenClaw AI calls (seconds)",
|
||||
default=30, # 2026-04-14 Claude Sonnet 4.6: 從 120s 改 30s,配合 ADR-052 GAP-B4
|
||||
# 25s LLM hard timeout + 5s buffer。原 120s 違反 defense-in-depth 設計,
|
||||
# 導致 Ollama 過載時 thread 飢餓 120s 才降級 fallback。
|
||||
description="Timeout for OpenClaw AI calls (seconds, aligned with GAP-B4 25s)",
|
||||
)
|
||||
|
||||
# ==========================================================================
|
||||
|
||||
Reference in New Issue
Block a user