fix(decision): block kubectl actions on bare_metal host alerts
When HostHighCpuLoad / HostOutOfMemory fire on a bare-metal host (192.168.0.110 et al, where Sentry / ClickHouse / Snuba are eating CPU), the LLM kept proposing "kubectl rollout restart awoooi-api", which is a wrong-domain action — restarting awoooi cannot fix a third-party process's CPU usage on the host. Auto-execute would then either run the no-op kubectl restart (wasted) or escalate after ssh_diagnose because no safe action was found, producing the "AI 自動修復失敗" Telegram noise the user just complained about. Adds a guard at the top of DecisionManager._auto_execute: if the incident's primary signal carries host_type=bare_metal AND the proposed action starts with "kubectl", refuse to execute. The incident is marked READY with a clear blocked_reason so human operators see why automation declined, and emergency_escalation records the event in AOL for audit. Also patches /home/wooo/monitoring/alerts.yml on 110 (and the new ops/monitoring/alerts.yml in repo) to add an explicit auto_repair_action annotation on HostHighCpuLoad / HostOutOfMemory that hints LLM toward `ssh ... ps aux` rather than kubectl restart. Prometheus reload returned 200. Tests: tests/test_decision_manager_bare_metal_kubectl_guard.py covers (1) bare_metal+kubectl blocked, (2) kubectl get also blocked, (3) bare_metal+ssh NOT blocked, (4) k8s host_type+kubectl NOT blocked, (5) missing host_type label NOT blocked. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1753,6 +1753,42 @@ class DecisionManager:
|
||||
"""
|
||||
action = token.proposal_data.get("kubectl_command", "")
|
||||
|
||||
# 2026-05-02 ogt + Claude Sonnet 4.6: bare_metal × kubectl 拒絕守衛
|
||||
# 根因:HostHighCpuLoad / HostOutOfMemory 等主機層告警 fire 在實體機(如 110,
|
||||
# 上面跑 Sentry/ClickHouse/Snuba 等第三方),但 LLM 看到 instance 後容易
|
||||
# 亂提「kubectl rollout restart awoooi-api」這種對症錯誤的 K8s action,
|
||||
# 重啟 awoooi 服務根本解不了第三方 CPU 燒爆,只是拖累自己。
|
||||
# 修法:偵測到 alert host_type=bare_metal 且 action 是 kubectl 類,立即降級人工,
|
||||
# Telegram 明示「跨 domain 動作被攔下」。auto_repair 走 SSH 診斷或人工。
|
||||
_alert_labels = incident.signals[0].labels if incident.signals else {}
|
||||
_host_type = (_alert_labels.get("host_type") or "").lower()
|
||||
_action_stripped = action.lstrip().lower()
|
||||
if _host_type == "bare_metal" and _action_stripped.startswith("kubectl"):
|
||||
logger.warning(
|
||||
"auto_execute_blocked_bare_metal_kubectl",
|
||||
incident_id=incident.incident_id,
|
||||
alertname=_alert_labels.get("alertname", ""),
|
||||
instance=_alert_labels.get("instance", ""),
|
||||
proposed_action=action[:120],
|
||||
reason="bare_metal host alert + kubectl action = wrong domain",
|
||||
)
|
||||
token.state = DecisionState.READY
|
||||
token.proposal_data["auto_executed"] = False
|
||||
token.proposal_data["blocked_reason"] = (
|
||||
f"host_type=bare_metal 但 LLM 提案 kubectl 動作 ({action[:60]})。"
|
||||
" 主機層告警的根因常在第三方服務(如 Sentry / ClickHouse),"
|
||||
" 重啟 K8s deployment 解不了,已降級人工。"
|
||||
)
|
||||
await self._save_token(token)
|
||||
_fire_and_forget(_escalate_decision_auto_repair_unavailable(
|
||||
incident=incident,
|
||||
token=token,
|
||||
failure_reason="bare_metal alert routed to kubectl action (wrong domain)",
|
||||
attempted_actions=f"action={action[:120]}",
|
||||
))
|
||||
_fire_and_forget(_push_decision_to_telegram(incident, token.proposal_data))
|
||||
return
|
||||
|
||||
# 2026-04-15 ogt: YAML 規則引擎優先 — 架構斷點修復
|
||||
# 根因:LLM 生成的 kubectl_command 與 YAML 規則引擎的 NO_ACTION / SSH 指令完全脫節
|
||||
# YAML 規則是人工審閱的權威來源,LLM 只是輔助
|
||||
|
||||
Reference in New Issue
Block a user