## Phase 0(文件層,全部 Accepted) - ADR-106/107:AwoooP 平台架構 + 儲存策略 - ADR-111~118:Bootstrap → RLS 七項核心 ADR - ADR-119~124:SAGA → Singleton Decomposition 六項 ADR - ADR-UI-01~04:Operator Console 四個 UI ADR ## Phase 1(DB schema + migration) - awooop_phase1_control_plane_2026-05-04.sql:7 張新表 + trigger + RLS - Step 1:三角色(platform_admin/migration BYPASSRLS,awooop_app 受 RLS) - Step 13:GRANT awooop_app 最小權限(7 條) - Step 14:RLS fail-closed,移除 __platform__ 後門 - awooop_phase1_batch1_rls_2026-05-04.sql:高流量四表三步式 ADD COLUMN - awooop_phase1_batch1_backfill.py:SKIP LOCKED 分批回填腳本 - awooop_models.py:7 個 SQLAlchemy 2.x models ## Critic 修正(4 Critical + 3 Major) - C-1:ADD CONSTRAINT IF NOT EXISTS → DO 塊 + pg_constraint 查詢 - C-2:__mapper_args__ 字串 list → primary_key=True on mapped_column - C-3:__platform__ RLS 後門 → 全移除,改用 BYPASSRLS role - C-4:awooop_app role 從未建立 → Step 1 + 7 條 GRANT - M-1:active_pointer_guard SECURITY DEFINER(FORCE RLS 跨租戶保護) - M-2:pg_partman create_parent 加冪等防護 - M-3:immutability trigger 新增身份欄位保護(project_id/family/contract_id) ## Task 1.2 修補 - agent_loader.py:硬編碼 Mac 路徑 → AGENTS_DIR 環境變數 - Dockerfile:補 COPY .claude/agents/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8.3 KiB
ADR-119: Durable Execution & SAGA Compensation
狀態:Accepted 日期:2026-05-03(台北) 決策者:統帥 範圍:Agent run 的步驟級 journal、補償命令設計、SAGA 觸發條件 關聯:ADR-114(worker lease)、ADR-106(Run State contract)
背景
ADR-114 解決了 worker crash 後的 run 回收問題(stale reaper)。但 stale reaper 只是把 run 回到 PENDING 狀態,讓 worker 重試。這帶來新問題:
問題 1:無法知道 run 到哪一步才 crash
worker 重新取到 run 後,從頭執行 → 重複呼叫已完成的步驟(例如:已經寫了 K8s repair,重試時再寫一次)。
問題 2:部分執行的副作用不可逆
Agent 已執行了 k8s_exec_pod(重啟了某個 Pod),但後續步驟失敗了。重試時再重啟一次,造成服務閃斷兩次。
問題 3:無法對 approval 決策做補償
WAITING_APPROVAL 的 run 被 reject 後,Agent 已呼叫的 tool 副作用無法回滾。
設計目標:
- Step-level idempotency(步驟可重試而不產生重複副作用)
- Compensation(補償命令)設計
- 不依賴外部 durable execution 服務(Temporal 等),使用 DB + Redis 實作
決策
D1 — SAGA 步驟 Journal(saga_steps JSONB)
awooop_run_state 表新增 saga_steps JSONB 欄位:
ALTER TABLE awooop_run_state
ADD COLUMN saga_steps JSONB NOT NULL DEFAULT '[]';
-- saga_steps 結構(JSONB array)
-- [
-- {
-- "step_id": "uuid",
-- "step_type": "llm_call" | "tool_call" | "approval_request" | "callback",
-- "tool_id": "k8s_exec_pod", -- tool_call 時填入
-- "tool_params": {...}, -- tool_call 時填入(供補償使用)
-- "status": "pending" | "completed" | "failed" | "compensated",
-- "result": {...}, -- 步驟結果(completed 時填入)
-- "compensation_cmd": {...}, -- 補償命令(tool_call 才有)
-- "started_at": "ISO8601",
-- "completed_at": "ISO8601"
-- }
-- ]
D2 — 步驟冪等性(Step Idempotency)
Worker 重新取到 run 後,跳過已完成的步驟:
async def resume_run_from_checkpoint(run: RunState) -> None:
completed_steps = {
step["step_id"]
for step in run.saga_steps
if step["status"] == "completed"
}
for step in plan_steps:
if step.step_id in completed_steps:
# 跳過已完成的步驟,使用已記錄的 result
result = get_step_result(run.saga_steps, step.step_id)
else:
# 執行步驟,寫入 journal
result = await execute_step(step)
await append_saga_step(run.run_id, step, result)
每個步驟執行前 journal 寫入「pending」,執行後更新為「completed」或「failed」:
async def execute_step_with_journal(run_id: str, step: Step) -> StepResult:
# 寫入 pending(防止重複執行)
await upsert_saga_step(run_id, step.step_id, status="pending")
try:
result = await step.execute()
await upsert_saga_step(run_id, step.step_id, status="completed", result=result)
return result
except Exception as e:
await upsert_saga_step(run_id, step.step_id, status="failed", error=str(e))
raise
D3 — Compensation(補償命令)設計
只有 tool_call 步驟有補償命令(LLM call 不可逆):
| Tool | 補償命令 | 可補償條件 |
|---|---|---|
k8s_exec_pod(exec 類) |
N/A(副作用不可逆) | 不補償 |
k8s_scale_deployment(scale 到 N) |
scale 回原值 | 有原始 replica count |
k8s_create_resource |
k8s_delete_resource |
有 resource 名稱 |
pg_mutate(INSERT) |
pg_mutate(DELETE 對應 ID) |
有 inserted_id |
knowledge_write |
knowledge_delete |
有 entry_id |
補償命令格式(在 saga_steps 中):
{
"step_id": "uuid",
"tool_id": "k8s_scale_deployment",
"tool_params": {"deployment": "api", "replicas": 3},
"status": "completed",
"compensation_cmd": {
"tool_id": "k8s_scale_deployment",
"params": {"deployment": "api", "replicas": 1}, // 原始值
"reason": "SAGA compensation: scale deployment back"
}
}
D4 — SAGA 觸發條件
自動觸發補償的情境:
SAGA_COMPENSATION_TRIGGERS = [
# 1. approval 被 reject
RunStatus.WAITING_APPROVAL → reject → trigger_compensation,
# 2. approval_token 過期(15min TTL)
approval_token_expired → trigger_compensation,
# 3. 後續步驟失敗,已有 compensation_cmd 的步驟需要回滾
step_failed AND has_compensatable_steps → trigger_compensation,
# 4. 主動 CANCEL
user_cancel → trigger_compensation if has_compensatable_steps,
]
補償執行順序:反向(最後執行的步驟先補償):
async def run_saga_compensation(run: RunState) -> None:
completed_steps_with_compensation = [
step for step in reversed(run.saga_steps)
if step["status"] == "completed" and step.get("compensation_cmd")
]
for step in completed_steps_with_compensation:
cmd = step["compensation_cmd"]
try:
await execute_mcp_tool(
run_id=run.run_id,
tool_id=cmd["tool_id"],
tool_params=cmd["params"],
project_id=run.project_id
)
await update_saga_step(run.run_id, step["step_id"], status="compensated")
except Exception as e:
# 補償失敗不 raise,繼續補償其他步驟
await update_saga_step(run.run_id, step["step_id"], status="compensation_failed",
error=str(e))
await write_audit_log("SAGA_COMPENSATION_FAILED", run.run_id, step, str(e))
D5 — SAGA 狀態欄位
awooop_run_state 補充欄位:
ALTER TABLE awooop_run_state
ADD COLUMN saga_status VARCHAR(32) DEFAULT 'in_progress';
-- 'in_progress' | 'compensating' | 'compensated' | 'compensation_failed'
ALTER TABLE awooop_run_state
ADD COLUMN compensation_started_at TIMESTAMPTZ;
ALTER TABLE awooop_run_state
ADD COLUMN compensation_completed_at TIMESTAMPTZ;
D6 — 不補償的情境
以下情境不執行 SAGA 補償,只記錄 audit log:
step_type = "llm_call"(LLM 輸出不可逆,補償無意義)step_type = "k8s_exec_pod"(exec 副作用不可逆)attempt_count >= max_attempts(已超過最大重試,直接 FAILED)status = "CANCELLED"且無 compensatable steps(直接結束)
與 ADR-114 的關係
| 機制 | ADR-114 | ADR-119 |
|---|---|---|
| Worker crash 後恢復 | Stale reaper 回收 → PENDING | ✓(由 ADR-114 觸發) |
| 重試時跳過已完成步驟 | ✗ | saga_steps journal checkpoint |
| 部分執行副作用回滾 | ✗ | compensation_cmd 反向執行 |
| approval reject 處理 | ✗(FAILED) | SAGA 補償 + FAILED |
後果
Benefits
- Worker crash 重試後不重複執行已完成步驟(冪等性)
- Approval reject 後可選擇性回滾 K8s 操作(k8s_scale, k8s_create)
- saga_steps journal 是完整的執行審計紀錄
Costs
- 每個步驟需要額外 DB write(journal upsert)
saga_stepsJSONB 欄位隨執行步驟增長(長 run 可能超過 1MB)- 緩解:步驟執行結果只存必要欄位,大型 result 存 S3/GCS 只存 reference
Risks
- 補償執行失敗(compensation_failed):操作已部分執行,系統進入不一致狀態
- 緩解:audit_log + Telegram 告警,需要人工介入
- K8s exec 等不可逆操作無法補償:設計上明確標記,並在 HIGH risk tool 加 approval gate
驗收標準
awooop_run_state新增saga_steps JSONB欄位(Phase 1 migration)- Worker crash 重試後跳過已完成步驟(模擬測試)
- Approval reject → k8s_scale 補償命令執行(整合測試)
compensation_failed→ audit_log 寫入 + Telegram 告警(整合測試)saga_stepsJSONB 格式驗證(Phase 3 schema validation)
關聯
- ADR-114(worker lease,stale reaper 觸發入口)
- ADR-106(Run State contract,run 狀態機)
- ADR-116(approval_token,reject 觸發 SAGA 補償)
- ADR-117(MCP tool 執行,補償命令也走 MCP Gateway)