fix(chat+nim): 修復首席架構師 Review I1-I4 + S3 四項重要問題
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m9s

I1: chat_manager._call_openclaw timeout=30.0 → 讀 settings.OPENCLAW_TIMEOUT
I2: nvidia_provider.py stale comment "45" → "55" 對齊 ConfigMap
I3: asyncio.shield 移除 — shield 超時後 task 繼續跑但無人等待 (silent leak)
I4: ChatManager.__init__ 移除 repo 實例 (leWOOOgo 禁 Service 持有 repository)
S3: _check_nemotron_health probe 10s → 25s + /v1/models 輕量端點

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
OG T
2026-04-03 16:36:16 +08:00
parent be247d6c5c
commit 15aabd6ac5
4 changed files with 51 additions and 16 deletions

View File

@@ -41,15 +41,16 @@ class ChatManager:
"""AWOOOI 雙 AI 對話管理器"""
def __init__(self):
self.k8s = get_k8s_repository()
self.incidents = get_incident_repository()
pass # 2026-04-03 ogt: 移除 repo 實例化leWOOOgo 規範禁止 Service 持有 repository
async def get_system_context(self) -> str:
"""收集系統即時上下文"""
now = now_taipei()
k8s = get_k8s_repository()
incidents = get_incident_repository()
try:
k8s_status = await self.k8s.get_pod_status_summary(namespace="awoooi-prod")
k8s_status = await k8s.get_pod_status_summary(namespace="awoooi-prod")
cluster_info = f"Cluster: {k8s_status['running']}/{k8s_status['total']} Pods Running"
if k8s_status.get('problem_pods'):
cluster_info += f", {len(k8s_status['problem_pods'])} 異常"
@@ -57,7 +58,7 @@ class ChatManager:
cluster_info = "Cluster: 無法取得狀態"
try:
active_incidents = await self.incidents.get_active()
active_incidents = await incidents.get_active()
if active_incidents:
lines = [f"- {inc.incident_id}: {inc.status.value} (SEV {inc.severity.value})"
for inc in active_incidents[:3]]
@@ -84,9 +85,10 @@ class ChatManager:
settings = get_settings()
openclaw_url = getattr(settings, 'OPENCLAW_URL', 'http://192.168.0.188:8088')
openclaw_timeout = float(getattr(settings, 'OPENCLAW_TIMEOUT', 30.0))
try:
# OpenClaw 沒有通用 chat endpoint用 analyze/incident 傳入對話內容
async with httpx.AsyncClient(timeout=30.0) as client:
async with httpx.AsyncClient(timeout=openclaw_timeout) as client:
resp = await client.post(
f"{openclaw_url}/api/v1/analyze/incident",
json={
@@ -167,8 +169,9 @@ class ChatManager:
)
# OpenClaw 最多等 40s含 context 取得時間NemoClaw 最多等 60s
# 2026-04-03 ogt: 移除 asyncio.shield — shield 會在超時後讓 task 繼續跑但無人等待,造成 silent leak
try:
openclaw_raw = await asyncio.wait_for(asyncio.shield(openclaw_task), timeout=40.0)
openclaw_raw = await asyncio.wait_for(openclaw_task, timeout=40.0)
except asyncio.TimeoutError:
openclaw_raw = None

View File

@@ -119,7 +119,7 @@ NVIDIA_DEFAULT_MODEL = "nvidia/nemotron-mini-4b-instruct"
# 請求超時 (秒)
# 2026-04-01 ogt: 設為 30s (平衡點)
# 2026-04-03 ogt: 改從 config 讀取,與 NEMOTRON_TIMEOUT_SECONDS=45 對齊
# 2026-04-03 ogt: 改從 config 讀取,與 NEMOTRON_TIMEOUT_SECONDS=55 對齊
# Memory 記載 NIM 免費 tier 延遲 11-45s30s 硬編碼導致慢請求全超時
def _get_nvidia_timeout() -> float:
try:

View File

@@ -2980,22 +2980,19 @@ class TelegramGateway:
if not api_key:
return False, "❌ NVIDIA_API_KEY 未設定"
# 2026-04-03 ogt: 用 /v1/models 輕量端點探測,避免觸發推理計費
# timeout 改為 25s — NIM 免費 tier 冷啟動可能需要 15-20s
try:
async with httpx.AsyncClient(timeout=10.0) as client:
resp = await client.post(
"https://integrate.api.nvidia.com/v1/chat/completions",
async with httpx.AsyncClient(timeout=25.0) as client:
resp = await client.get(
"https://integrate.api.nvidia.com/v1/models",
headers={"Authorization": f"Bearer {api_key}"},
json={
"model": "nvidia/nemotron-mini-4b-instruct",
"messages": [{"role": "user", "content": "ping"}],
"max_tokens": 1,
},
)
if resp.status_code == 200:
return True, "✅ 正常"
return False, f"❌ HTTP {resp.status_code}"
except httpx.TimeoutException:
return False, "⚠️ 超時 (>10s)"
return False, "⚠️ 超時 (>25s)"
except Exception as e:
return False, f"{str(e)[:40]}"

View File

@@ -5,6 +5,41 @@
---
## 📍 當前狀態 (2026-04-03 Phase 22.6 雙 AI 對話 + 首席架構師 Code Review)
| 項目 | 狀態 | Commit/備註 |
|------|------|-------------|
| **Phase 22.6 chat_manager 重寫** | ✅ 雙 AI (@openclaw/@nemo/混合模式) | be247d6 |
| **NEMOTRON_TIMEOUT 30→55s** | ✅ ConfigMap + kubectl set env | k8s configmap |
| **nvidia_provider.py 讀 config** | ✅ 不再硬編碼 30s | — |
| **費用變更審批憲法第五章** | ✅ HARD_RULES + Memory + CLAUDE.md | — |
| **I1: openclaw timeout 硬編碼** | ✅ 改讀 OPENCLAW_TIMEOUT config | — |
| **I2: stale 註解 45→55** | ✅ nvidia_provider.py comment 修正 | — |
| **I3: asyncio.shield task leak** | ✅ 移除 shield改直接 wait_for | — |
| **I4: ChatManager 持有 repo** | ✅ 移至 get_system_context() 本地變數 | — |
| **S3: NIM 探測 10s timeout** | ✅ 改 25s + 用 /v1/models 輕量端點 | — |
| **首席架構師 Review 評分** | 85/100 — 4 Important 已全修 | — |
**下一步**: 等待 CI 部署驗證Ollama on 188 仍需手動重啟
---
## 📍 當前狀態 (2026-04-03 首席架構師 Code Review — Layout 對齊 + Phase 24 命名收尾)
| 項目 | 狀態 | 備註 |
|------|------|------|
| **sidebar top 修正** | ✅ top:0→top:68pxsidebar 不再蓋住 header | |
| **app-layout 對齊** | ✅ pt-[68px] + ml-[224px],消除 32px 水平空隙 | |
| **page.tsx calc** | ✅ calc(100vh-64px)→calc(100vh-68px) | |
| **Metrics Strip 7指標** | ✅ 完整對齊 figma-v2 設計 | |
| **test_nvidia_provider.py** | ✅ "nvidia" key → "openclaw_nemo" 對齊 Phase 24 | |
| **ai_rate_limiter.py** | ✅ RATE_LIMITS/COST_LIMITS "nvidia"→"openclaw_nemo" | |
| **Review 評分** | 88/100 — 通過3項警告0項違規 | |
**下一步**: 無緊急待做
---
## 📍 當前狀態 (2026-04-03 Phase 24 收尾 + KB + Monitoring 修復)
| 項目 | 狀態 | Commit |