fix(prod): 接通 YAML 規則引擎與自動執行路徑 — 架構核心斷點
Some checks are pending
CD Pipeline / build-and-deploy (push) Has started running

架構斷點根因:
  YAML 規則引擎(alert_rules.yaml)是人工審閱的權威動作來源,
  但自動執行路徑只讀 proposal_data["kubectl_command"](LLM 生成),
  兩者完全脫節 → HostHighCpuLoad 得到 kubectl restart,DockerContainerUnhealthy
  的 SSH 指令被 LLM 的 kubectl 覆蓋。

修復策略:
  在 auto_execute 入口,先查 YAML match_rule:
  1. YAML → NO_ACTION(如 HostHighCpuLoad)→ 立即返回,不執行任何操作
  2. YAML → 非 kubectl 指令(如 ssh docker restart)→ 覆蓋 LLM action,
     後續 infrastructure SSH 路由才能生效

影響:
  - HostHighCpuLoad / NodeCPUUsageHigh → 停止自動執行,降級人工審核
  - DockerContainerUnhealthy → SSH docker restart(若 labels 有 host/container)

2026-04-15 ogt + Claude Sonnet 4.6(亞太): 生產緊急修復第三批

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
OG T
2026-04-15 21:50:25 +08:00
parent 3696fb5938
commit ecfb7148bf

View File

@@ -1318,6 +1318,50 @@ class DecisionManager:
"""
action = token.proposal_data.get("kubectl_command", "")
# 2026-04-15 ogt: YAML 規則引擎優先 — 架構斷點修復
# 根因LLM 生成的 kubectl_command 與 YAML 規則引擎的 NO_ACTION / SSH 指令完全脫節
# YAML 規則是人工審閱的權威來源LLM 只是輔助
# 修復策略:
# 1. YAML → NO_ACTION → 立即返回,不執行任何操作
# 2. YAML → SSH 指令(非 kubectl→ 覆蓋 LLM 生成的 action讓 SSH 路由生效
_alertname_for_yaml = incident.signals[0].labels.get("alertname", "") if incident.signals else ""
if _alertname_for_yaml:
try:
from src.services.alert_rule_engine import match_rule as _yaml_match
_yaml_r = _yaml_match({
"labels": incident.signals[0].labels if incident.signals else {},
"alert_type": _alertname_for_yaml,
"message": "",
"target_resource": incident.affected_services[0] if incident.affected_services else "unknown",
"namespace": "awoooi-prod",
})
if _yaml_r:
if _yaml_r.get("suggested_action") == "NO_ACTION":
logger.info(
"auto_execute_yaml_no_action",
incident_id=incident.incident_id,
alertname=_alertname_for_yaml,
reason="YAML 規則明確標記 NO_ACTION不執行自動修復",
)
token.state = DecisionState.READY
token.proposal_data["auto_executed"] = False
token.proposal_data["blocked_reason"] = f"YAML: NO_ACTION for {_alertname_for_yaml}"
await self._save_token(token)
_fire_and_forget(_push_decision_to_telegram(incident, token.proposal_data))
return
_yaml_cmd = (_yaml_r.get("kubectl_command") or "").strip()
if _yaml_cmd and not _yaml_cmd.startswith("kubectl"):
# YAML 給出 SSH / docker 指令 → 覆蓋 LLM 生成的 kubectl action
action = _yaml_cmd
logger.info(
"auto_execute_yaml_cmd_override",
incident_id=incident.incident_id,
alertname=_alertname_for_yaml,
yaml_cmd=_yaml_cmd[:80],
)
except Exception as _yaml_err:
logger.debug("auto_execute_yaml_check_error", error=str(_yaml_err))
# Phase 6 ADR-087: 自我降級守衛AIOPS_P6_SELF_DEMOTION 控制)
# SLO 違反 → 全域信心閾值調高;連續違反 → 保守模式,所有自動執行降為人工
# 2026-04-15 ogt + Claude Sonnet 4.6(亞太): Phase 6 初始建立