- ADR-037 監控增強架構 - MONITORING_MASTER_PLAN 主計畫 - MASTER_EXECUTION_SCHEDULE 執行排程 - Phase D/E/Worker HPA Runbooks Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
31 KiB
31 KiB
Nemotron 整合提案
版本: 1.1 建立日期: 2026-03-28 (台北時間) 建立者: Claude Code 狀態: ✅ 實測完成,待統帥批准
🔥 實測結果摘要 (2026-03-28)
| 指標 | Nemotron (NIM) | Ollama (CPU) | 結論 |
|---|---|---|---|
| Tool Calling 精準度 | 83.3% (5/6) | ~50% | Nemotron 勝 |
| 平均延遲 | 11-23 秒 | 100+ 秒 | Nemotron 快 5-10x |
| 繁中支援 | ✅ 良好 | ✅ 良好 | 平手 |
| 成本 | 免費 tier | 免費 | 平手 |
建議: 將 Nemotron 加入 Tool Calling 任務的首選路由
目錄
1. NIM API 整合規格
1.1 Endpoint 資訊
| 項目 | 值 |
|---|---|
| Base URL | https://integrate.api.nvidia.com/v1 |
| Chat Completions | /chat/completions |
| 相容性 | ✅ OpenAI API 格式完全相容 |
1.2 認證方式
# 環境變數
export NVIDIA_API_KEY="nvapi-xxxx"
# HTTP Header
Authorization: Bearer $NVIDIA_API_KEY
1.3 可用模型
| 模型 ID | 大小 | 特色 | 建議用途 |
|---|---|---|---|
nvidia/nemotron-mini-4b-instruct |
4B | 輕量、Tool Calling | 快速分類、簡單決策 |
nvidia/llama-3.1-nemotron-70b-instruct |
70B | 強推理 | 複雜 Incident 分析 |
nvidia/nemotron-3-super |
120B (MoE) | 最強、100萬 Token | 多代理協作 |
1.4 請求格式 (OpenAI 相容)
import httpx
response = httpx.post(
"https://integrate.api.nvidia.com/v1/chat/completions",
headers={
"Content-Type": "application/json",
"Authorization": f"Bearer {NVIDIA_API_KEY}"
},
json={
"model": "nvidia/nemotron-mini-4b-instruct",
"messages": [
{"role": "system", "content": "You are an SRE assistant."},
{"role": "user", "content": "Analyze this K8s error..."}
],
"temperature": 0.2,
"max_tokens": 1024,
"tools": [...] # Tool Calling 定義
}
)
1.5 Tool Calling 格式
tools = [
{
"type": "function",
"function": {
"name": "kubectl_execute",
"description": "Execute kubectl command on K8s cluster",
"parameters": {
"type": "object",
"properties": {
"command": {
"type": "string",
"description": "kubectl command (e.g., 'get pods -n awoooi-prod')"
},
"namespace": {
"type": "string",
"description": "Target namespace"
}
},
"required": ["command"]
}
}
},
{
"type": "function",
"function": {
"name": "restart_deployment",
"description": "Restart a Kubernetes deployment",
"parameters": {
"type": "object",
"properties": {
"deployment": {"type": "string"},
"namespace": {"type": "string"}
},
"required": ["deployment", "namespace"]
}
}
}
]
1.6 回應格式 (Tool Call)
{
"choices": [{
"message": {
"role": "assistant",
"content": null,
"tool_calls": [{
"id": "call_abc123",
"type": "function",
"function": {
"name": "restart_deployment",
"arguments": "{\"deployment\": \"awoooi-api\", \"namespace\": \"awoooi-prod\"}"
}
}]
},
"finish_reason": "tool_calls"
}]
}
2. 架構設計
2.1 Fallback 層級調整
┌─────────────────────────────────────────────────────────────────┐
│ 現有架構 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Tier 1 Tier 2 Tier 3 │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Ollama │ ──▶ │ Gemini │ ──▶ │ Claude │ │
│ │ (188) │ │ (API) │ │ (API) │ │
│ │ 本地 │ │ 免費額度 │ │ 付費 │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ 新架構 (加入 Nemotron) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────┐ │
│ │ Smart Model Router │ │
│ │ (任務類型路由) │ │
│ └──────────────────────────────────┘ │
│ │ │
│ ┌─────────────────┼─────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────┐ ┌───────────┐ ┌─────────────────┐ │
│ │ Tool Calling │ │ 一般對話 │ │ 複雜推理 │ │
│ │ 路徑 │ │ 路徑 │ │ 路徑 │ │
│ └────────┬────────┘ └─────┬─────┘ └────────┬────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────┐ ┌───────────┐ ┌─────────────────┐ │
│ │ Nemotron (NIM) │ │ Ollama │ │ Nemotron-70B │ │
│ │ nemotron-mini │ │ qwen2.5 │ │ 或 Claude │ │
│ │ 4B, Tool專用 │ │ 本地 │ │ 高品質 │ │
│ └────────┬────────┘ └─────┬─────┘ └────────┬────────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Fallback Chain │ │
│ │ Gemini → Claude │ │
│ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
2.2 任務路由規則
# apps/api/src/services/ai/model_router.py
ROUTING_RULES = {
# Tool Calling 任務 → Nemotron 優先
"tool_calling": {
"primary": "nvidia/nemotron-mini-4b-instruct",
"fallback": ["gemini-1.5-flash", "claude-3-haiku"]
},
# K8s 操作決策 → Nemotron 優先
"k8s_operation": {
"primary": "nvidia/nemotron-mini-4b-instruct",
"fallback": ["ollama/qwen2.5:7b", "gemini-1.5-flash"]
},
# Incident 分析 (複雜推理) → Nemotron-70B 或 Claude
"incident_analysis": {
"primary": "nvidia/llama-3.1-nemotron-70b-instruct",
"fallback": ["claude-3-sonnet", "gemini-1.5-pro"]
},
# 一般對話 → 本地 Ollama 優先
"general_chat": {
"primary": "ollama/qwen2.5:7b",
"fallback": ["gemini-1.5-flash", "claude-3-haiku"]
},
# Playbook 生成 → Nemotron (程式碼能力強)
"code_generation": {
"primary": "nvidia/nemotron-mini-4b-instruct",
"fallback": ["ollama/qwen2.5-coder:7b", "claude-3-sonnet"]
}
}
2.3 OpenClaw 整合位置
┌─────────────────────────────────────────────────────────────────┐
│ OpenClaw Decision Flow │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. Incident 進入 │
│ │ │
│ ▼ │
│ 2. Intent Classifier (意圖分類) │
│ │ └── Ollama qwen2.5 (本地、快速) │
│ │ │
│ ▼ │
│ 3. Complexity Analyzer (複雜度評估) │
│ │ └── Ollama qwen2.5 (本地、快速) │
│ │ │
│ ▼ │
│ 4. Decision Manager (決策生成) ← 🔴 Nemotron 在這裡! │
│ │ ├── Tool Calling 決策 → Nemotron-mini (NIM) │
│ │ ├── 複雜推理 → Nemotron-70B (NIM) │
│ │ └── 一般回覆 → Ollama/Gemini │
│ │ │
│ ▼ │
│ 5. Trust Engine (信任驗證) │
│ │ │
│ ▼ │
│ 6. Multi-Sig (需要時) │
│ │ │
│ ▼ │
│ 7. K8s Executor (執行) │
│ │
└─────────────────────────────────────────────────────────────────┘
2.4 環境變數配置
# .env.production 新增
# NVIDIA NIM API
NVIDIA_API_KEY=nvapi-xxxx
NVIDIA_API_BASE_URL=https://integrate.api.nvidia.com/v1
# Model 選擇
NEMOTRON_TOOL_MODEL=nvidia/nemotron-mini-4b-instruct
NEMOTRON_REASONING_MODEL=nvidia/llama-3.1-nemotron-70b-instruct
# Rate Limiting (免費額度保護)
NEMOTRON_RPM_LIMIT=60
NEMOTRON_TPM_LIMIT=100000
3. 測試腳本
3.1 Tool Calling 精準度測試
#!/usr/bin/env python3
"""
Nemotron Tool Calling 精準度測試
比較 Nemotron vs Gemini vs Qwen 的 Tool Calling 能力
使用方式:
export NVIDIA_API_KEY=nvapi-xxxx
export GEMINI_API_KEY=xxxx
python test_nemotron_tool_calling.py
"""
import os
import json
import httpx
import asyncio
from dataclasses import dataclass
from typing import Optional
import time
# ============================================================================
# 配置
# ============================================================================
NVIDIA_API_KEY = os.getenv("NVIDIA_API_KEY")
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
OLLAMA_BASE_URL = "http://192.168.0.188:11434"
# ============================================================================
# Tool 定義 (K8s SRE 場景)
# ============================================================================
TOOLS = [
{
"type": "function",
"function": {
"name": "kubectl_get",
"description": "Get Kubernetes resources (pods, deployments, services, etc.)",
"parameters": {
"type": "object",
"properties": {
"resource": {
"type": "string",
"enum": ["pods", "deployments", "services", "nodes", "events"],
"description": "Resource type to query"
},
"namespace": {
"type": "string",
"description": "Kubernetes namespace (default: awoooi-prod)"
},
"name": {
"type": "string",
"description": "Specific resource name (optional)"
}
},
"required": ["resource"]
}
}
},
{
"type": "function",
"function": {
"name": "restart_deployment",
"description": "Restart a Kubernetes deployment by rolling restart",
"parameters": {
"type": "object",
"properties": {
"deployment": {
"type": "string",
"description": "Deployment name"
},
"namespace": {
"type": "string",
"description": "Kubernetes namespace"
}
},
"required": ["deployment", "namespace"]
}
}
},
{
"type": "function",
"function": {
"name": "scale_deployment",
"description": "Scale a Kubernetes deployment to specified replicas",
"parameters": {
"type": "object",
"properties": {
"deployment": {"type": "string"},
"namespace": {"type": "string"},
"replicas": {"type": "integer", "minimum": 0, "maximum": 10}
},
"required": ["deployment", "namespace", "replicas"]
}
}
},
{
"type": "function",
"function": {
"name": "get_logs",
"description": "Get logs from a Kubernetes pod",
"parameters": {
"type": "object",
"properties": {
"pod": {"type": "string"},
"namespace": {"type": "string"},
"tail": {"type": "integer", "description": "Number of lines (default: 100)"},
"container": {"type": "string", "description": "Container name (optional)"}
},
"required": ["pod", "namespace"]
}
}
},
{
"type": "function",
"function": {
"name": "send_alert",
"description": "Send alert notification via Telegram",
"parameters": {
"type": "object",
"properties": {
"severity": {"type": "string", "enum": ["info", "warning", "critical"]},
"message": {"type": "string"},
"incident_id": {"type": "string"}
},
"required": ["severity", "message"]
}
}
}
]
# ============================================================================
# 測試案例
# ============================================================================
TEST_CASES = [
{
"id": "TC001",
"description": "簡單查詢 - 列出所有 pods",
"prompt": "Show me all pods in awoooi-prod namespace",
"expected_tool": "kubectl_get",
"expected_params": {"resource": "pods", "namespace": "awoooi-prod"}
},
{
"id": "TC002",
"description": "重啟服務",
"prompt": "The API is not responding, please restart the awoooi-api deployment",
"expected_tool": "restart_deployment",
"expected_params": {"deployment": "awoooi-api", "namespace": "awoooi-prod"}
},
{
"id": "TC003",
"description": "擴展副本",
"prompt": "We're getting high traffic, scale awoooi-web to 3 replicas",
"expected_tool": "scale_deployment",
"expected_params": {"deployment": "awoooi-web", "replicas": 3}
},
{
"id": "TC004",
"description": "查看日誌",
"prompt": "Get the last 50 lines of logs from awoooi-api-xxx pod",
"expected_tool": "get_logs",
"expected_params": {"tail": 50}
},
{
"id": "TC005",
"description": "發送告警",
"prompt": "Send a critical alert: Database connection failed for incident INC-2026-001",
"expected_tool": "send_alert",
"expected_params": {"severity": "critical"}
},
{
"id": "TC006",
"description": "複合理解 - 需要推理",
"prompt": "The web frontend is showing 502 errors. Check if the API pods are running.",
"expected_tool": "kubectl_get",
"expected_params": {"resource": "pods"}
},
{
"id": "TC007",
"description": "繁體中文指令",
"prompt": "請重啟 awoooi-worker 這個 deployment",
"expected_tool": "restart_deployment",
"expected_params": {"deployment": "awoooi-worker"}
},
{
"id": "TC008",
"description": "模糊指令 - 需要推理",
"prompt": "Something is wrong with the worker, it keeps crashing. Fix it.",
"expected_tool": "restart_deployment", # 或 get_logs
"expected_params": {} # 接受多種合理回應
}
]
# ============================================================================
# API 客戶端
# ============================================================================
@dataclass
class ToolCallResult:
model: str
test_id: str
success: bool
tool_called: Optional[str]
params: Optional[dict]
latency_ms: float
error: Optional[str] = None
async def call_nemotron(prompt: str, model: str = "nvidia/nemotron-mini-4b-instruct") -> dict:
"""呼叫 NVIDIA NIM API"""
async with httpx.AsyncClient(timeout=30) as client:
start = time.time()
response = await client.post(
"https://integrate.api.nvidia.com/v1/chat/completions",
headers={
"Content-Type": "application/json",
"Authorization": f"Bearer {NVIDIA_API_KEY}"
},
json={
"model": model,
"messages": [
{"role": "system", "content": "You are an SRE assistant for AWOOOI AIOps platform. Use the provided tools to help with Kubernetes operations."},
{"role": "user", "content": prompt}
],
"tools": TOOLS,
"tool_choice": "auto",
"temperature": 0.1,
"max_tokens": 512
}
)
latency = (time.time() - start) * 1000
return {"data": response.json(), "latency_ms": latency}
async def call_ollama(prompt: str, model: str = "qwen2.5:7b") -> dict:
"""呼叫本地 Ollama"""
async with httpx.AsyncClient(timeout=60) as client:
start = time.time()
response = await client.post(
f"{OLLAMA_BASE_URL}/api/chat",
json={
"model": model,
"messages": [
{"role": "system", "content": "You are an SRE assistant. Respond with JSON indicating which tool to call and parameters."},
{"role": "user", "content": f"Based on this request, which tool should be called and with what parameters? Request: {prompt}\n\nAvailable tools: kubectl_get, restart_deployment, scale_deployment, get_logs, send_alert\n\nRespond in JSON format: {{\"tool\": \"tool_name\", \"params\": {{...}}}}"}
],
"stream": False,
"format": "json"
}
)
latency = (time.time() - start) * 1000
return {"data": response.json(), "latency_ms": latency}
# ============================================================================
# 測試執行
# ============================================================================
def parse_tool_call(response: dict, model_type: str) -> tuple:
"""解析不同模型的 Tool Call 回應"""
try:
if model_type == "nemotron":
choices = response.get("choices", [])
if choices and choices[0].get("message", {}).get("tool_calls"):
tool_call = choices[0]["message"]["tool_calls"][0]
return (
tool_call["function"]["name"],
json.loads(tool_call["function"]["arguments"])
)
# 如果沒有 tool_calls,檢查 content
content = choices[0].get("message", {}).get("content", "")
return (None, {"content": content})
elif model_type == "ollama":
content = response.get("message", {}).get("content", "{}")
parsed = json.loads(content)
return (parsed.get("tool"), parsed.get("params", {}))
except Exception as e:
return (None, {"error": str(e)})
return (None, {})
async def run_test(test_case: dict) -> list:
"""執行單一測試案例"""
results = []
prompt = test_case["prompt"]
# 測試 Nemotron
if NVIDIA_API_KEY:
try:
resp = await call_nemotron(prompt)
tool, params = parse_tool_call(resp["data"], "nemotron")
success = tool == test_case["expected_tool"]
results.append(ToolCallResult(
model="Nemotron-mini-4B",
test_id=test_case["id"],
success=success,
tool_called=tool,
params=params,
latency_ms=resp["latency_ms"]
))
except Exception as e:
results.append(ToolCallResult(
model="Nemotron-mini-4B",
test_id=test_case["id"],
success=False,
tool_called=None,
params=None,
latency_ms=0,
error=str(e)
))
# 測試 Ollama
try:
resp = await call_ollama(prompt)
tool, params = parse_tool_call(resp["data"], "ollama")
success = tool == test_case["expected_tool"]
results.append(ToolCallResult(
model="Ollama-Qwen2.5-7B",
test_id=test_case["id"],
success=success,
tool_called=tool,
params=params,
latency_ms=resp["latency_ms"]
))
except Exception as e:
results.append(ToolCallResult(
model="Ollama-Qwen2.5-7B",
test_id=test_case["id"],
success=False,
tool_called=None,
params=None,
latency_ms=0,
error=str(e)
))
return results
async def main():
"""主測試流程"""
print("=" * 70)
print("Nemotron vs Ollama Tool Calling 精準度測試")
print("=" * 70)
print()
all_results = []
for tc in TEST_CASES:
print(f"[{tc['id']}] {tc['description']}")
print(f" Prompt: {tc['prompt'][:50]}...")
print(f" Expected: {tc['expected_tool']}")
results = await run_test(tc)
all_results.extend(results)
for r in results:
status = "✅" if r.success else "❌"
print(f" {r.model}: {status} → {r.tool_called} ({r.latency_ms:.0f}ms)")
if r.error:
print(f" Error: {r.error}")
print()
# 統計結果
print("=" * 70)
print("統計結果")
print("=" * 70)
models = {}
for r in all_results:
if r.model not in models:
models[r.model] = {"success": 0, "total": 0, "latency": []}
models[r.model]["total"] += 1
if r.success:
models[r.model]["success"] += 1
if r.latency_ms > 0:
models[r.model]["latency"].append(r.latency_ms)
print(f"{'Model':<25} {'Accuracy':<15} {'Avg Latency':<15}")
print("-" * 55)
for model, stats in models.items():
acc = stats["success"] / stats["total"] * 100 if stats["total"] > 0 else 0
avg_lat = sum(stats["latency"]) / len(stats["latency"]) if stats["latency"] else 0
print(f"{model:<25} {acc:>6.1f}% {avg_lat:>8.0f}ms")
print()
print("測試完成!")
if __name__ == "__main__":
asyncio.run(main())
3.2 快速驗證腳本 (curl)
#!/bin/bash
# quick_test_nemotron.sh
# 快速驗證 Nemotron API 連線
set -e
echo "=== Nemotron API 快速測試 ==="
echo ""
# 檢查 API Key
if [ -z "$NVIDIA_API_KEY" ]; then
echo "❌ 請設定 NVIDIA_API_KEY"
echo " export NVIDIA_API_KEY=nvapi-xxxx"
exit 1
fi
echo "✅ API Key 已設定"
echo ""
# 測試簡單請求
echo "測試 1: 簡單對話..."
curl -s -X POST "https://integrate.api.nvidia.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $NVIDIA_API_KEY" \
-d '{
"model": "nvidia/nemotron-mini-4b-instruct",
"messages": [{"role": "user", "content": "Say hello in JSON format"}],
"max_tokens": 50
}' | jq '.choices[0].message.content'
echo ""
echo "測試 2: Tool Calling..."
curl -s -X POST "https://integrate.api.nvidia.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $NVIDIA_API_KEY" \
-d '{
"model": "nvidia/nemotron-mini-4b-instruct",
"messages": [
{"role": "system", "content": "You are a K8s assistant."},
{"role": "user", "content": "Restart the nginx deployment in production namespace"}
],
"tools": [{
"type": "function",
"function": {
"name": "restart_deployment",
"description": "Restart a K8s deployment",
"parameters": {
"type": "object",
"properties": {
"deployment": {"type": "string"},
"namespace": {"type": "string"}
},
"required": ["deployment", "namespace"]
}
}
}],
"tool_choice": "auto",
"max_tokens": 200
}' | jq '.choices[0].message'
echo ""
echo "=== 測試完成 ==="
4. 實作計畫
4.1 階段規劃
Phase N.1: 驗證 (1-2 天)
──────────────────────────
├── 註冊 build.nvidia.com
├── 取得 NVIDIA_API_KEY
├── 執行 quick_test_nemotron.sh
├── 執行完整 Tool Calling 測試
└── 分析結果,決定是否繼續
Phase N.2: 整合 (2-3 天)
──────────────────────────
├── 建立 NvidiaAIProvider (參考現有 GeminiProvider)
├── 加入 Model Router 路由規則
├── 配置環境變數 + K8s Secrets
├── Langfuse Tracing 整合
└── 單元測試
Phase N.3: 驗收 (1 天)
──────────────────────────
├── E2E 測試 (真實 Incident 場景)
├── 延遲 + 成本分析
├── 首席架構師審查
└── 統帥批准上線
4.2 檔案結構
apps/api/src/
├── services/
│ └── ai/
│ ├── providers/
│ │ ├── ollama_provider.py # 現有
│ │ ├── gemini_provider.py # 現有
│ │ ├── claude_provider.py # 現有
│ │ └── nvidia_provider.py # 🆕 新增
│ │
│ ├── model_router.py # 修改: 加入 Nemotron 路由
│ └── rate_limiter.py # 修改: 加入 Nemotron 限流
4.3 GitHub Secrets 新增
# 需要新增到 GitHub Secrets
NVIDIA_API_KEY: nvapi-xxxx
# 需要新增到 K8s Secrets
kubectl create secret generic nvidia-api \
--from-literal=NVIDIA_API_KEY=nvapi-xxxx \
-n awoooi-prod
5. 成本估算
5.1 免費額度
| 項目 | 預估 |
|---|---|
| 開發測試 | 免費 (build.nvidia.com) |
| Rate Limit | 待確認 (可能 60 RPM) |
5.2 生產環境 (如需付費)
| 模型 | 定價 (預估) | 月用量 | 月成本 |
|---|---|---|---|
| nemotron-mini-4b | ~$0.1/1M tokens | ~5M | ~$0.5 |
| nemotron-70b | ~$1.0/1M tokens | ~1M | ~$1.0 |
結論: 成本極低,比 Claude API 便宜很多。
6. 風險評估
| 風險 | 機率 | 影響 | 緩解措施 |
|---|---|---|---|
| 免費額度不足 | 中 | 低 | Fallback 到 Gemini |
| API 延遲高 | 低 | 中 | 本地快取 + Timeout |
| Tool Calling 精準度差 | 低 | 高 | 測試階段驗證 |
| 服務不穩定 | 低 | 中 | 多層 Fallback |
附錄: 下一步行動
統帥批准後,立即執行:
# Step 1: 取得 API Key
# 前往 https://build.nvidia.com 註冊並取得 Key
# Step 2: 設定環境變數
export NVIDIA_API_KEY=nvapi-xxxx
# Step 3: 快速驗證
cd apps/api
./scripts/quick_test_nemotron.sh
# Step 4: 完整測試
python scripts/test_nemotron_tool_calling.py
建立者: Claude Code 日期: 2026-03-28 (台北時間) 狀態: 待審核