fix(decision_manager): Agent 分析結果覆寫 Webhook 垃圾 action
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 28m30s
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 28m30s
根因 (INC-20260416-C365D0 事故完整根因分析):
- Webhook inline LLM 建立 ApprovalRecord.action = "kubectl rollout restart awoooi-prod"
- Agent 分析正確(postgres disk → NO_ACTION)但只發新 Telegram 卡,未覆寫 DB
- 用戶批准 Agent 卡 → 系統查 incident_id → 找到 Webhook 舊 ApprovalRecord
→ 執行垃圾 action(rollout restart 一個磁碟告警!)
修復:
- approval_db.py: 新增 update_action_by_incident_id()(按 incident_id 更新 PENDING 記錄)
- decision_manager.py: Agent 確認 action 後立即覆寫 ApprovalRecord
若 action="" (NO_ACTION): 存 "NO_ACTION - {description}" 讓用戶知道 Agent 建議觀察
用戶批准時執行的是 Agent 的正確建議,而非 Webhook 的通用 action
2026-04-16 ogt + Claude Sonnet 4.6(亞太)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -691,6 +691,45 @@ class ApprovalDBService:
|
||||
params,
|
||||
)
|
||||
|
||||
async def update_action_by_incident_id(self, incident_id: str, new_action: str) -> int:
|
||||
"""
|
||||
Agent Orchestrator 分析完成後覆寫 ApprovalRecord.action。
|
||||
|
||||
設計動機 (2026-04-16 ogt + Claude Sonnet 4.6):
|
||||
- Webhook inline LLM 寫入垃圾 action(如 kubectl rollout restart for postgres disk)
|
||||
- Agent 分析正確但只發新 Telegram 卡,未覆寫 ApprovalRecord
|
||||
- 用戶批准 Agent 卡 → 系統查 incident_id → 執行舊 webhook 垃圾 action
|
||||
- 修復:Agent 完成後呼叫此方法,讓用戶批准時執行正確 action
|
||||
|
||||
Args:
|
||||
incident_id: INC-xxx 格式 Incident ID
|
||||
new_action: Agent 決定的 action(空字串 → 不覆寫)
|
||||
|
||||
Returns:
|
||||
int: rowcount(0 表示找不到對應 PENDING approval)
|
||||
"""
|
||||
if not new_action:
|
||||
return 0
|
||||
async with get_db_context() as db:
|
||||
from sqlalchemy import text as _text
|
||||
result = await db.execute(
|
||||
_text("""
|
||||
UPDATE approval_records
|
||||
SET action = :new_action
|
||||
WHERE incident_id = :incident_id
|
||||
AND status = 'PENDING'
|
||||
"""),
|
||||
{"incident_id": incident_id, "new_action": new_action},
|
||||
)
|
||||
rowcount = result.rowcount if hasattr(result, "rowcount") else -1
|
||||
logger.info(
|
||||
"approval_action_updated_by_agent",
|
||||
incident_id=incident_id,
|
||||
new_action=new_action[:80],
|
||||
rowcount=rowcount,
|
||||
)
|
||||
return rowcount
|
||||
|
||||
# =========================================================================
|
||||
# Phase 6.4h: Proposals API 支援方法
|
||||
# =========================================================================
|
||||
|
||||
@@ -218,6 +218,29 @@ async def _push_decision_to_telegram(
|
||||
# 2026-04-16 ogt + Claude Sonnet 4.6: Playbook 匹配名稱穿透到 TG 卡片 (ADR-076)
|
||||
_playbook_name = proposal_data.get("playbook_name", "")
|
||||
|
||||
# 2026-04-16 ogt + Claude Sonnet 4.6: 覆寫 Webhook 的垃圾 action
|
||||
# 問題:Webhook inline LLM 建立 ApprovalRecord 時寫入通用 action(如 kubectl rollout restart)
|
||||
# Agent 分析正確但只發新 Telegram 卡,未覆寫 ApprovalRecord.action
|
||||
# 用戶批准 Agent 卡 → 系統查 incident_id → 執行舊 webhook 垃圾 action
|
||||
# 修復:Agent 確認 action 後立即覆寫 ApprovalRecord(只要有值,空字串不覆寫)
|
||||
_agent_action = action # action 已由 _package_to_proposal_data + rule engine 確定
|
||||
if not _agent_action and description:
|
||||
# NO_ACTION:用描述摘要讓用戶知道 Agent 的建議是「觀察/調查」
|
||||
_agent_action = f"NO_ACTION - {description[:120]}"
|
||||
if _agent_action:
|
||||
try:
|
||||
from src.services.approval_db import get_approval_service as _get_approval_svc
|
||||
await _get_approval_svc().update_action_by_incident_id(
|
||||
incident_id=incident.incident_id,
|
||||
new_action=_agent_action,
|
||||
)
|
||||
except Exception as _update_err:
|
||||
logger.warning(
|
||||
"approval_action_update_by_agent_failed",
|
||||
incident_id=incident.incident_id,
|
||||
error=str(_update_err),
|
||||
)
|
||||
|
||||
# 建立 approval_id (使用 incident_id 作為追蹤)
|
||||
# 2026-03-27 ogt: 修復 INC-INC-INC- 重複前綴 bug
|
||||
approval_id = incident.incident_id # 已經是 INC-xxx 格式
|
||||
|
||||
Reference in New Issue
Block a user