Files
awoooi/docs/adr/ADR-119-durable-execution-saga-compensation.md
Your Name 13e51802fe feat(awooop): Phase 0 全 ADR + Phase 1 control plane schema(含 critic 四項修正)
## Phase 0(文件層,全部 Accepted)
- ADR-106/107:AwoooP 平台架構 + 儲存策略
- ADR-111~118:Bootstrap → RLS 七項核心 ADR
- ADR-119~124:SAGA → Singleton Decomposition 六項 ADR
- ADR-UI-01~04:Operator Console 四個 UI ADR

## Phase 1(DB schema + migration)
- awooop_phase1_control_plane_2026-05-04.sql:7 張新表 + trigger + RLS
  - Step 1:三角色(platform_admin/migration BYPASSRLS,awooop_app 受 RLS)
  - Step 13:GRANT awooop_app 最小權限(7 條)
  - Step 14:RLS fail-closed,移除 __platform__ 後門
- awooop_phase1_batch1_rls_2026-05-04.sql:高流量四表三步式 ADD COLUMN
- awooop_phase1_batch1_backfill.py:SKIP LOCKED 分批回填腳本
- awooop_models.py:7 個 SQLAlchemy 2.x models

## Critic 修正(4 Critical + 3 Major)
- C-1:ADD CONSTRAINT IF NOT EXISTS → DO 塊 + pg_constraint 查詢
- C-2:__mapper_args__ 字串 list → primary_key=True on mapped_column
- C-3:__platform__ RLS 後門 → 全移除,改用 BYPASSRLS role
- C-4:awooop_app role 從未建立 → Step 1 + 7 條 GRANT
- M-1:active_pointer_guard SECURITY DEFINER(FORCE RLS 跨租戶保護)
- M-2:pg_partman create_parent 加冪等防護
- M-3:immutability trigger 新增身份欄位保護(project_id/family/contract_id)

## Task 1.2 修補
- agent_loader.py:硬編碼 Mac 路徑 → AGENTS_DIR 環境變數
- Dockerfile:補 COPY .claude/agents/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 13:37:11 +08:00

8.3 KiB
Raw Blame History

ADR-119: Durable Execution & SAGA Compensation

狀態Accepted 日期2026-05-03台北 決策者:統帥 範圍Agent run 的步驟級 journal、補償命令設計、SAGA 觸發條件 關聯ADR-114worker lease、ADR-106Run State contract


背景

ADR-114 解決了 worker crash 後的 run 回收問題stale reaper。但 stale reaper 只是把 run 回到 PENDING 狀態,讓 worker 重試。這帶來新問題:

問題 1無法知道 run 到哪一步才 crash

worker 重新取到 run 後,從頭執行 → 重複呼叫已完成的步驟(例如:已經寫了 K8s repair重試時再寫一次

問題 2部分執行的副作用不可逆

Agent 已執行了 k8s_exec_pod(重啟了某個 Pod但後續步驟失敗了。重試時再重啟一次造成服務閃斷兩次。

問題 3無法對 approval 決策做補償

WAITING_APPROVAL 的 run 被 reject 後Agent 已呼叫的 tool 副作用無法回滾。

設計目標

  • Step-level idempotency步驟可重試而不產生重複副作用
  • Compensation補償命令設計
  • 不依賴外部 durable execution 服務Temporal 等),使用 DB + Redis 實作

決策

D1 — SAGA 步驟 Journalsaga_steps JSONB

awooop_run_state 表新增 saga_steps JSONB 欄位:

ALTER TABLE awooop_run_state
ADD COLUMN saga_steps JSONB NOT NULL DEFAULT '[]';

-- saga_steps 結構JSONB array
-- [
--   {
--     "step_id": "uuid",
--     "step_type": "llm_call" | "tool_call" | "approval_request" | "callback",
--     "tool_id": "k8s_exec_pod",          -- tool_call 時填入
--     "tool_params": {...},               -- tool_call 時填入(供補償使用)
--     "status": "pending" | "completed" | "failed" | "compensated",
--     "result": {...},                    -- 步驟結果completed 時填入)
--     "compensation_cmd": {...},          -- 補償命令tool_call 才有)
--     "started_at": "ISO8601",
--     "completed_at": "ISO8601"
--   }
-- ]

D2 — 步驟冪等性Step Idempotency

Worker 重新取到 run 後,跳過已完成的步驟

async def resume_run_from_checkpoint(run: RunState) -> None:
    completed_steps = {
        step["step_id"]
        for step in run.saga_steps
        if step["status"] == "completed"
    }

    for step in plan_steps:
        if step.step_id in completed_steps:
            # 跳過已完成的步驟,使用已記錄的 result
            result = get_step_result(run.saga_steps, step.step_id)
        else:
            # 執行步驟,寫入 journal
            result = await execute_step(step)
            await append_saga_step(run.run_id, step, result)

每個步驟執行前 journal 寫入「pending」執行後更新為「completed」或「failed」

async def execute_step_with_journal(run_id: str, step: Step) -> StepResult:
    # 寫入 pending防止重複執行
    await upsert_saga_step(run_id, step.step_id, status="pending")

    try:
        result = await step.execute()
        await upsert_saga_step(run_id, step.step_id, status="completed", result=result)
        return result
    except Exception as e:
        await upsert_saga_step(run_id, step.step_id, status="failed", error=str(e))
        raise

D3 — Compensation補償命令設計

只有 tool_call 步驟有補償命令LLM call 不可逆):

Tool 補償命令 可補償條件
k8s_exec_podexec 類) N/A副作用不可逆 不補償
k8s_scale_deploymentscale 到 N scale 回原值 有原始 replica count
k8s_create_resource k8s_delete_resource 有 resource 名稱
pg_mutateINSERT pg_mutateDELETE 對應 ID 有 inserted_id
knowledge_write knowledge_delete 有 entry_id

補償命令格式(在 saga_steps 中):

{
  "step_id": "uuid",
  "tool_id": "k8s_scale_deployment",
  "tool_params": {"deployment": "api", "replicas": 3},
  "status": "completed",
  "compensation_cmd": {
    "tool_id": "k8s_scale_deployment",
    "params": {"deployment": "api", "replicas": 1},  // 原始值
    "reason": "SAGA compensation: scale deployment back"
  }
}

D4 — SAGA 觸發條件

自動觸發補償的情境:

SAGA_COMPENSATION_TRIGGERS = [
    # 1. approval 被 reject
    RunStatus.WAITING_APPROVAL  reject  trigger_compensation,

    # 2. approval_token 過期15min TTL
    approval_token_expired  trigger_compensation,

    # 3. 後續步驟失敗,已有 compensation_cmd 的步驟需要回滾
    step_failed AND has_compensatable_steps  trigger_compensation,

    # 4. 主動 CANCEL
    user_cancel  trigger_compensation if has_compensatable_steps,
]

補償執行順序:反向(最後執行的步驟先補償):

async def run_saga_compensation(run: RunState) -> None:
    completed_steps_with_compensation = [
        step for step in reversed(run.saga_steps)
        if step["status"] == "completed" and step.get("compensation_cmd")
    ]

    for step in completed_steps_with_compensation:
        cmd = step["compensation_cmd"]
        try:
            await execute_mcp_tool(
                run_id=run.run_id,
                tool_id=cmd["tool_id"],
                tool_params=cmd["params"],
                project_id=run.project_id
            )
            await update_saga_step(run.run_id, step["step_id"], status="compensated")
        except Exception as e:
            # 補償失敗不 raise繼續補償其他步驟
            await update_saga_step(run.run_id, step["step_id"], status="compensation_failed",
                                   error=str(e))
            await write_audit_log("SAGA_COMPENSATION_FAILED", run.run_id, step, str(e))

D5 — SAGA 狀態欄位

awooop_run_state 補充欄位:

ALTER TABLE awooop_run_state
ADD COLUMN saga_status VARCHAR(32) DEFAULT 'in_progress';
-- 'in_progress' | 'compensating' | 'compensated' | 'compensation_failed'

ALTER TABLE awooop_run_state
ADD COLUMN compensation_started_at TIMESTAMPTZ;

ALTER TABLE awooop_run_state
ADD COLUMN compensation_completed_at TIMESTAMPTZ;

D6 — 不補償的情境

以下情境不執行 SAGA 補償,只記錄 audit log

  • step_type = "llm_call"LLM 輸出不可逆,補償無意義)
  • step_type = "k8s_exec_pod"exec 副作用不可逆)
  • attempt_count >= max_attempts(已超過最大重試,直接 FAILED
  • status = "CANCELLED" 且無 compensatable steps直接結束

與 ADR-114 的關係

機制 ADR-114 ADR-119
Worker crash 後恢復 Stale reaper 回收 → PENDING ✓(由 ADR-114 觸發)
重試時跳過已完成步驟 saga_steps journal checkpoint
部分執行副作用回滾 compensation_cmd 反向執行
approval reject 處理 FAILED SAGA 補償 + FAILED

後果

Benefits

  • Worker crash 重試後不重複執行已完成步驟(冪等性)
  • Approval reject 後可選擇性回滾 K8s 操作k8s_scale, k8s_create
  • saga_steps journal 是完整的執行審計紀錄

Costs

  • 每個步驟需要額外 DB writejournal upsert
  • saga_steps JSONB 欄位隨執行步驟增長(長 run 可能超過 1MB
  • 緩解:步驟執行結果只存必要欄位,大型 result 存 S3/GCS 只存 reference

Risks

  • 補償執行失敗compensation_failed操作已部分執行系統進入不一致狀態
  • 緩解audit_log + Telegram 告警,需要人工介入
  • K8s exec 等不可逆操作無法補償:設計上明確標記,並在 HIGH risk tool 加 approval gate

驗收標準

  • awooop_run_state 新增 saga_steps JSONB 欄位Phase 1 migration
  • Worker crash 重試後跳過已完成步驟(模擬測試)
  • Approval reject → k8s_scale 補償命令執行(整合測試)
  • compensation_failed → audit_log 寫入 + Telegram 告警(整合測試)
  • saga_steps JSONB 格式驗證Phase 3 schema validation

關聯

  • ADR-114worker leasestale reaper 觸發入口)
  • ADR-106Run State contractrun 狀態機)
  • ADR-116approval_tokenreject 觸發 SAGA 補償)
  • ADR-117MCP tool 執行,補償命令也走 MCP Gateway