fix(openclaw): Checkpoint-2 webhook path K8s inventory injection — 防止 NemoTron 幻覺 awoooi-service
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 11m39s

根因:NemoTron 在 webhook path(analyze_alert)無叢集上下文
→ 盲猜 deployment/awoooi-service → kubectl not found → EXECUTION_FAILED → trust score 0 永遠

修復:
- analyze_alert() Step 0.5: 呼叫 _fetch_k8s_inventory_for_openclaw() 拉取真實 Deployment 清單
- 注入「🔒 叢集實際資源清單」section 到 full_prompt,強制 LLM 從清單選擇資源名
- 失敗/超時 → 返回空字串 → 注入警示提示,主流程不中斷
- available_len 計算納入 k8s_section 長度防止 4K 截斷

影響:
- Solver Agent path (solver_agent.py) 已在 cf50a5c 修復
- 本 commit 修復 Alertmanager webhook path(analyze_alert → NemoTron)
- 兩條路徑均有 K8s 環境感知,LLM 不再幻覺資源名

ADR-082: Phase 2 多 Agent 協作
2026-04-17 ogt + Claude Sonnet 4.6(Checkpoint-2 webhook path completion)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
OG T
2026-04-18 00:53:27 +08:00
parent 0bde389323
commit 4f2e122fd2

View File

@@ -1262,19 +1262,33 @@ class OpenClawService:
Trace URL: {signoz_trace_url}
"""
# Step 0.5: 擷取 K8s 叢集真實資源清單Checkpoint-2 webhook path
# 2026-04-17 ogt + Claude Sonnet 4.6: 防止 NemoTron 幻覺 deployment/awoooi-service
# 根因webhook path 沒有叢集上下文 → LLM 盲猜資源名稱 → kubectl not found → trust 0 永遠
# 修復:每次分析前先拉真實 Deployment 清單,注入 prompt 強制 LLM 對齊
_k8s_ns = alert_context.get("namespace", "awoooi-prod")
_k8s_inventory = await _fetch_k8s_inventory_for_openclaw(namespace=_k8s_ns)
k8s_section = (
f"\n\n## 🔒 叢集實際資源清單({_k8s_ns}\n"
f"kubectl_command 與 target_resource **必須**從以下名稱選擇,不可自行編造:\n"
f"{_k8s_inventory}\n"
if _k8s_inventory
else "\n\n## ⚠️ 無法取得叢集清單target_resource 請依 alertname 推斷,勿編造。\n"
)
# 格式化告警為 Prompt (2026-03-31 ogt: 強力截斷以符合 NVIDIA 4K 限制)
# 優先保留 System Prompt截斷 Alert Data
available_len = 3500 - len(OPENCLAW_SYSTEM_PROMPT) - len(signoz_context)
available_len = 3500 - len(OPENCLAW_SYSTEM_PROMPT) - len(signoz_context) - len(k8s_section)
if available_len < 500:
# 如果 SignOz 太長,也截斷它
signoz_context = signoz_context[:500] + "... (truncated)"
available_len = 3500 - len(OPENCLAW_SYSTEM_PROMPT) - len(signoz_context)
available_len = 3500 - len(OPENCLAW_SYSTEM_PROMPT) - len(signoz_context) - len(k8s_section)
alert_json = json.dumps(alert_context, ensure_ascii=False, indent=2)
if len(alert_json) > available_len:
alert_json = alert_json[:available_len] + "... (truncated)"
full_prompt = OPENCLAW_SYSTEM_PROMPT + signoz_context + "\n\n## Alert Data:\n" + alert_json
full_prompt = OPENCLAW_SYSTEM_PROMPT + signoz_context + k8s_section + "\n\n## Alert Data:\n" + alert_json
logger.info(
"openclaw_alert_analysis_start",
@@ -2033,6 +2047,51 @@ async def close_openclaw() -> None:
_openclaw = None
async def _fetch_k8s_inventory_for_openclaw(
namespace: str = "awoooi-prod",
timeout_sec: float = 3.0,
) -> str:
"""
取得 K8s 叢集實際 Deployment/StatefulSet 清單,注入 analyze_alert prompt。
2026-04-17 ogt + Claude Sonnet 4.6 (Checkpoint-2 webhook path):
- 根因NemoTron 在 webhook path 收不到叢集清單 → 幻覺 deployment/awoooi-service
- 修復analyze_alert 前拉取真實資源名,注入 prompt強制 LLM 從清單選擇
- 超時/失敗 → 返回 ""prompt 仍正常但無鎖定效果,不中斷主流程)
- 只執行唯讀 get 指令,不修改叢集
Returns:
"awoooi-api, awoooi-web, ..." 格式字串,失敗時返回 ""
"""
import asyncio as _asyncio
import structlog as _structlog
_logger = _structlog.get_logger(__name__)
try:
cmd = (
f"kubectl get deployments,statefulsets -n {namespace} "
"-o jsonpath='{.items[*].metadata.name}' 2>/dev/null"
)
proc = await _asyncio.create_subprocess_shell(
cmd,
stdout=_asyncio.subprocess.PIPE,
stderr=_asyncio.subprocess.PIPE,
)
try:
stdout, _ = await _asyncio.wait_for(proc.communicate(), timeout=timeout_sec)
except _asyncio.TimeoutError:
proc.kill()
_logger.warning("k8s_inventory_timeout_openclaw", namespace=namespace)
return ""
raw = (stdout or b"").decode("utf-8", errors="replace").strip()
if not raw:
return ""
names = [n.strip() for n in raw.split() if n.strip()]
return ", ".join(names)
except Exception as _e:
_logger.warning("k8s_inventory_failed_openclaw", namespace=namespace, error=str(_e))
return ""
# =============================================================================
# Phase 5 + SignOz Integration Complete
# =============================================================================