fix(decision_manager): Agent 分析結果覆寫 Webhook 垃圾 action
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 28m30s

根因 (INC-20260416-C365D0 事故完整根因分析):
- Webhook inline LLM 建立 ApprovalRecord.action = "kubectl rollout restart awoooi-prod"
- Agent 分析正確(postgres disk → NO_ACTION)但只發新 Telegram 卡,未覆寫 DB
- 用戶批准 Agent 卡 → 系統查 incident_id → 找到 Webhook 舊 ApprovalRecord
  → 執行垃圾 action(rollout restart 一個磁碟告警!)

修復:
- approval_db.py: 新增 update_action_by_incident_id()(按 incident_id 更新 PENDING 記錄)
- decision_manager.py: Agent 確認 action 後立即覆寫 ApprovalRecord
  若 action="" (NO_ACTION): 存 "NO_ACTION - {description}" 讓用戶知道 Agent 建議觀察
  用戶批准時執行的是 Agent 的正確建議,而非 Webhook 的通用 action

2026-04-16 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
OG T
2026-04-16 20:07:15 +08:00
parent a258d87767
commit 513232e90b
2 changed files with 62 additions and 0 deletions

View File

@@ -691,6 +691,45 @@ class ApprovalDBService:
params,
)
async def update_action_by_incident_id(self, incident_id: str, new_action: str) -> int:
"""
Agent Orchestrator 分析完成後覆寫 ApprovalRecord.action。
設計動機 (2026-04-16 ogt + Claude Sonnet 4.6):
- Webhook inline LLM 寫入垃圾 action如 kubectl rollout restart for postgres disk
- Agent 分析正確但只發新 Telegram 卡,未覆寫 ApprovalRecord
- 用戶批准 Agent 卡 → 系統查 incident_id → 執行舊 webhook 垃圾 action
- 修復Agent 完成後呼叫此方法,讓用戶批准時執行正確 action
Args:
incident_id: INC-xxx 格式 Incident ID
new_action: Agent 決定的 action空字串 → 不覆寫)
Returns:
int: rowcount0 表示找不到對應 PENDING approval
"""
if not new_action:
return 0
async with get_db_context() as db:
from sqlalchemy import text as _text
result = await db.execute(
_text("""
UPDATE approval_records
SET action = :new_action
WHERE incident_id = :incident_id
AND status = 'PENDING'
"""),
{"incident_id": incident_id, "new_action": new_action},
)
rowcount = result.rowcount if hasattr(result, "rowcount") else -1
logger.info(
"approval_action_updated_by_agent",
incident_id=incident_id,
new_action=new_action[:80],
rowcount=rowcount,
)
return rowcount
# =========================================================================
# Phase 6.4h: Proposals API 支援方法
# =========================================================================

View File

@@ -218,6 +218,29 @@ async def _push_decision_to_telegram(
# 2026-04-16 ogt + Claude Sonnet 4.6: Playbook 匹配名稱穿透到 TG 卡片 (ADR-076)
_playbook_name = proposal_data.get("playbook_name", "")
# 2026-04-16 ogt + Claude Sonnet 4.6: 覆寫 Webhook 的垃圾 action
# 問題Webhook inline LLM 建立 ApprovalRecord 時寫入通用 action如 kubectl rollout restart
# Agent 分析正確但只發新 Telegram 卡,未覆寫 ApprovalRecord.action
# 用戶批准 Agent 卡 → 系統查 incident_id → 執行舊 webhook 垃圾 action
# 修復Agent 確認 action 後立即覆寫 ApprovalRecord只要有值空字串不覆寫
_agent_action = action # action 已由 _package_to_proposal_data + rule engine 確定
if not _agent_action and description:
# NO_ACTION用描述摘要讓用戶知道 Agent 的建議是「觀察/調查」
_agent_action = f"NO_ACTION - {description[:120]}"
if _agent_action:
try:
from src.services.approval_db import get_approval_service as _get_approval_svc
await _get_approval_svc().update_action_by_incident_id(
incident_id=incident.incident_id,
new_action=_agent_action,
)
except Exception as _update_err:
logger.warning(
"approval_action_update_by_agent_failed",
incident_id=incident.incident_id,
error=str(_update_err),
)
# 建立 approval_id (使用 incident_id 作為追蹤)
# 2026-03-27 ogt: 修復 INC-INC-INC- 重複前綴 bug
approval_id = incident.incident_id # 已經是 INC-xxx 格式