fix(GAP-A4 Phase 2): LLM 路徑 target 救援 — 解開 12 次飛輪攔截
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
統帥全景報告診斷(2026-04-14 20:00): 2h 內 12 次 auto_execute_blocked_unresolved_placeholder 全是 LLM 直接產出 `kubectl ... deployment HostHighCpuLoad` GAP-A4 Phase 1 只修了 alert_rule_engine._extract_vars 但 LLM 在 decision_manager 路徑沒做同樣檢查 → 12 次擋下 → 0 KM 0 飛輪 修復 (decision_manager._auto_execute placeholder 替換後): 1. 從 action regex 提取 deployment 名(kubectl ... deployment XXX) 2. 套用 alert_rule_engine._is_bad_target() 驗證 3. 若是垃圾(==alertname/unknown/IP)→ 從 incident.signals[0].labels 重推 (用 _extract_vars 同一套 multi-layer 邏輯) 4. 若有合法 target → action.replace(llm_target, good_target) 5. 若 labels 也救不了 → log target_rescue_failed → safety guard 處理 效果: - KubePodCrashLooping (有 deployment label) → LLM 即使填錯也救回 - HostHighCpuLoad (純主機,無 K8s label) → 仍進 safety guard, 但 log 變 target_rescue_failed 而非 unresolved_placeholder - 12 次飛輪攔截可望大幅減少 回歸:66/66 (GAP-A4 + kubectl validation) 全過 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -1312,6 +1312,56 @@ class DecisionManager:
|
||||
action = _re.sub(r"<deployment_name>", _target, action)
|
||||
action = _re.sub(r"<[^>]+>", _target, action)
|
||||
|
||||
# GAP-A4 Phase 2 (2026-04-14 Claude Sonnet 4.6): LLM 路徑 target 救援
|
||||
# 真兇:LLM 直接產出 `kubectl scale deployment HostHighCpuLoad` (target=alertname)
|
||||
# GAP-A4 Phase 1 只修了 rule_engine._extract_vars,LLM 路徑沒檢查
|
||||
# 結果:12 次 auto_execute_blocked_unresolved_placeholder 直接攔下 → 飛輪 0
|
||||
try:
|
||||
from src.services.alert_rule_engine import (
|
||||
_extract_vars as _rule_extract_vars,
|
||||
_is_bad_target as _rule_is_bad_target,
|
||||
)
|
||||
_alertname_for_rescue = (
|
||||
incident.signals[0].labels.get("alertname", "") if incident.signals else ""
|
||||
)
|
||||
# 從 action 提取 deployment 名稱(kubectl scale/restart deployment XXX 或 deployment/XXX)
|
||||
_kubectl_target_match = _re.search(
|
||||
r"deployment[/\s]+([\w.\-]+)", action
|
||||
)
|
||||
if _kubectl_target_match:
|
||||
_llm_target = _kubectl_target_match.group(1)
|
||||
if _rule_is_bad_target(_llm_target, _alertname_for_rescue):
|
||||
# LLM 把垃圾當 deployment 名(alertname/unknown/IP)→ 重推
|
||||
_alert_ctx = {
|
||||
"labels": incident.signals[0].labels if incident.signals else {},
|
||||
"target_resource": _target,
|
||||
"namespace": _ns,
|
||||
"alert_type": _alertname_for_rescue,
|
||||
}
|
||||
_good_target = _rule_extract_vars(_alert_ctx).get("target", "")
|
||||
if _good_target and _good_target != "unknown" and not _rule_is_bad_target(_good_target, _alertname_for_rescue):
|
||||
_old_action = action
|
||||
action = action.replace(_llm_target, _good_target)
|
||||
logger.info(
|
||||
"auto_execute_target_rescued",
|
||||
incident_id=incident.incident_id,
|
||||
llm_target=_llm_target,
|
||||
rescued_target=_good_target,
|
||||
old_action=_old_action[:120],
|
||||
new_action=action[:120],
|
||||
reason="LLM 產出垃圾 target,從 labels 重推 deployment 名",
|
||||
)
|
||||
else:
|
||||
logger.warning(
|
||||
"auto_execute_target_rescue_failed",
|
||||
incident_id=incident.incident_id,
|
||||
llm_target=_llm_target,
|
||||
alertname=_alertname_for_rescue,
|
||||
reason="labels 也找不到合法 deployment,將進 safety guard 攔截 → 人工",
|
||||
)
|
||||
except Exception as _rescue_err:
|
||||
logger.debug("target_rescue_skipped", error=str(_rescue_err))
|
||||
|
||||
# ADR-073 Phase 3-2: infrastructure 告警 (Docker/Host) → SSH MCP routing (2026-04-12 ogt)
|
||||
# alert_category = "infrastructure" 表示 Docker/Host 告警,不走 K8s executor
|
||||
# action 格式應為 "docker restart <container>" 或 "systemctl restart <service>"
|
||||
|
||||
Reference in New Issue
Block a user