fix(chat+nim): 修復首席架構師 Review I1-I4 + S3 四項重要問題
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m9s
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m9s
I1: chat_manager._call_openclaw timeout=30.0 → 讀 settings.OPENCLAW_TIMEOUT I2: nvidia_provider.py stale comment "45" → "55" 對齊 ConfigMap I3: asyncio.shield 移除 — shield 超時後 task 繼續跑但無人等待 (silent leak) I4: ChatManager.__init__ 移除 repo 實例 (leWOOOgo 禁 Service 持有 repository) S3: _check_nemotron_health probe 10s → 25s + /v1/models 輕量端點 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -41,15 +41,16 @@ class ChatManager:
|
||||
"""AWOOOI 雙 AI 對話管理器"""
|
||||
|
||||
def __init__(self):
|
||||
self.k8s = get_k8s_repository()
|
||||
self.incidents = get_incident_repository()
|
||||
pass # 2026-04-03 ogt: 移除 repo 實例化,leWOOOgo 規範禁止 Service 持有 repository
|
||||
|
||||
async def get_system_context(self) -> str:
|
||||
"""收集系統即時上下文"""
|
||||
now = now_taipei()
|
||||
k8s = get_k8s_repository()
|
||||
incidents = get_incident_repository()
|
||||
|
||||
try:
|
||||
k8s_status = await self.k8s.get_pod_status_summary(namespace="awoooi-prod")
|
||||
k8s_status = await k8s.get_pod_status_summary(namespace="awoooi-prod")
|
||||
cluster_info = f"Cluster: {k8s_status['running']}/{k8s_status['total']} Pods Running"
|
||||
if k8s_status.get('problem_pods'):
|
||||
cluster_info += f", {len(k8s_status['problem_pods'])} 異常"
|
||||
@@ -57,7 +58,7 @@ class ChatManager:
|
||||
cluster_info = "Cluster: 無法取得狀態"
|
||||
|
||||
try:
|
||||
active_incidents = await self.incidents.get_active()
|
||||
active_incidents = await incidents.get_active()
|
||||
if active_incidents:
|
||||
lines = [f"- {inc.incident_id}: {inc.status.value} (SEV {inc.severity.value})"
|
||||
for inc in active_incidents[:3]]
|
||||
@@ -84,9 +85,10 @@ class ChatManager:
|
||||
settings = get_settings()
|
||||
|
||||
openclaw_url = getattr(settings, 'OPENCLAW_URL', 'http://192.168.0.188:8088')
|
||||
openclaw_timeout = float(getattr(settings, 'OPENCLAW_TIMEOUT', 30.0))
|
||||
try:
|
||||
# OpenClaw 沒有通用 chat endpoint,用 analyze/incident 傳入對話內容
|
||||
async with httpx.AsyncClient(timeout=30.0) as client:
|
||||
async with httpx.AsyncClient(timeout=openclaw_timeout) as client:
|
||||
resp = await client.post(
|
||||
f"{openclaw_url}/api/v1/analyze/incident",
|
||||
json={
|
||||
@@ -167,8 +169,9 @@ class ChatManager:
|
||||
)
|
||||
|
||||
# OpenClaw 最多等 40s(含 context 取得時間),NemoClaw 最多等 60s
|
||||
# 2026-04-03 ogt: 移除 asyncio.shield — shield 會在超時後讓 task 繼續跑但無人等待,造成 silent leak
|
||||
try:
|
||||
openclaw_raw = await asyncio.wait_for(asyncio.shield(openclaw_task), timeout=40.0)
|
||||
openclaw_raw = await asyncio.wait_for(openclaw_task, timeout=40.0)
|
||||
except asyncio.TimeoutError:
|
||||
openclaw_raw = None
|
||||
|
||||
|
||||
@@ -119,7 +119,7 @@ NVIDIA_DEFAULT_MODEL = "nvidia/nemotron-mini-4b-instruct"
|
||||
|
||||
# 請求超時 (秒)
|
||||
# 2026-04-01 ogt: 設為 30s (平衡點)
|
||||
# 2026-04-03 ogt: 改從 config 讀取,與 NEMOTRON_TIMEOUT_SECONDS=45 對齊
|
||||
# 2026-04-03 ogt: 改從 config 讀取,與 NEMOTRON_TIMEOUT_SECONDS=55 對齊
|
||||
# Memory 記載 NIM 免費 tier 延遲 11-45s,30s 硬編碼導致慢請求全超時
|
||||
def _get_nvidia_timeout() -> float:
|
||||
try:
|
||||
|
||||
@@ -2980,22 +2980,19 @@ class TelegramGateway:
|
||||
if not api_key:
|
||||
return False, "❌ NVIDIA_API_KEY 未設定"
|
||||
|
||||
# 2026-04-03 ogt: 用 /v1/models 輕量端點探測,避免觸發推理計費
|
||||
# timeout 改為 25s — NIM 免費 tier 冷啟動可能需要 15-20s
|
||||
try:
|
||||
async with httpx.AsyncClient(timeout=10.0) as client:
|
||||
resp = await client.post(
|
||||
"https://integrate.api.nvidia.com/v1/chat/completions",
|
||||
async with httpx.AsyncClient(timeout=25.0) as client:
|
||||
resp = await client.get(
|
||||
"https://integrate.api.nvidia.com/v1/models",
|
||||
headers={"Authorization": f"Bearer {api_key}"},
|
||||
json={
|
||||
"model": "nvidia/nemotron-mini-4b-instruct",
|
||||
"messages": [{"role": "user", "content": "ping"}],
|
||||
"max_tokens": 1,
|
||||
},
|
||||
)
|
||||
if resp.status_code == 200:
|
||||
return True, "✅ 正常"
|
||||
return False, f"❌ HTTP {resp.status_code}"
|
||||
except httpx.TimeoutException:
|
||||
return False, "⚠️ 超時 (>10s)"
|
||||
return False, "⚠️ 超時 (>25s)"
|
||||
except Exception as e:
|
||||
return False, f"❌ {str(e)[:40]}"
|
||||
|
||||
|
||||
@@ -5,6 +5,41 @@
|
||||
|
||||
---
|
||||
|
||||
## 📍 當前狀態 (2026-04-03 Phase 22.6 雙 AI 對話 + 首席架構師 Code Review)
|
||||
|
||||
| 項目 | 狀態 | Commit/備註 |
|
||||
|------|------|-------------|
|
||||
| **Phase 22.6 chat_manager 重寫** | ✅ 雙 AI (@openclaw/@nemo/混合模式) | be247d6 |
|
||||
| **NEMOTRON_TIMEOUT 30→55s** | ✅ ConfigMap + kubectl set env | k8s configmap |
|
||||
| **nvidia_provider.py 讀 config** | ✅ 不再硬編碼 30s | — |
|
||||
| **費用變更審批憲法第五章** | ✅ HARD_RULES + Memory + CLAUDE.md | — |
|
||||
| **I1: openclaw timeout 硬編碼** | ✅ 改讀 OPENCLAW_TIMEOUT config | — |
|
||||
| **I2: stale 註解 45→55** | ✅ nvidia_provider.py comment 修正 | — |
|
||||
| **I3: asyncio.shield task leak** | ✅ 移除 shield,改直接 wait_for | — |
|
||||
| **I4: ChatManager 持有 repo** | ✅ 移至 get_system_context() 本地變數 | — |
|
||||
| **S3: NIM 探測 10s timeout** | ✅ 改 25s + 用 /v1/models 輕量端點 | — |
|
||||
| **首席架構師 Review 評分** | 85/100 — 4 Important 已全修 | — |
|
||||
|
||||
**下一步**: 等待 CI 部署驗證,Ollama on 188 仍需手動重啟
|
||||
|
||||
---
|
||||
|
||||
## 📍 當前狀態 (2026-04-03 首席架構師 Code Review — Layout 對齊 + Phase 24 命名收尾)
|
||||
|
||||
| 項目 | 狀態 | 備註 |
|
||||
|------|------|------|
|
||||
| **sidebar top 修正** | ✅ top:0→top:68px,sidebar 不再蓋住 header | |
|
||||
| **app-layout 對齊** | ✅ pt-[68px] + ml-[224px],消除 32px 水平空隙 | |
|
||||
| **page.tsx calc** | ✅ calc(100vh-64px)→calc(100vh-68px) | |
|
||||
| **Metrics Strip 7指標** | ✅ 完整對齊 figma-v2 設計 | |
|
||||
| **test_nvidia_provider.py** | ✅ "nvidia" key → "openclaw_nemo" 對齊 Phase 24 | |
|
||||
| **ai_rate_limiter.py** | ✅ RATE_LIMITS/COST_LIMITS "nvidia"→"openclaw_nemo" | |
|
||||
| **Review 評分** | 88/100 — 通過,3項警告,0項違規 | |
|
||||
|
||||
**下一步**: 無緊急待做
|
||||
|
||||
---
|
||||
|
||||
## 📍 當前狀態 (2026-04-03 Phase 24 收尾 + KB + Monitoring 修復)
|
||||
|
||||
| 項目 | 狀態 | Commit |
|
||||
|
||||
Reference in New Issue
Block a user