Files
awoooi/docs/adr/ADR-119-durable-execution-saga-compensation.md
Your Name 13e51802fe feat(awooop): Phase 0 全 ADR + Phase 1 control plane schema(含 critic 四項修正)
## Phase 0(文件層,全部 Accepted)
- ADR-106/107:AwoooP 平台架構 + 儲存策略
- ADR-111~118:Bootstrap → RLS 七項核心 ADR
- ADR-119~124:SAGA → Singleton Decomposition 六項 ADR
- ADR-UI-01~04:Operator Console 四個 UI ADR

## Phase 1(DB schema + migration)
- awooop_phase1_control_plane_2026-05-04.sql:7 張新表 + trigger + RLS
  - Step 1:三角色(platform_admin/migration BYPASSRLS,awooop_app 受 RLS)
  - Step 13:GRANT awooop_app 最小權限(7 條)
  - Step 14:RLS fail-closed,移除 __platform__ 後門
- awooop_phase1_batch1_rls_2026-05-04.sql:高流量四表三步式 ADD COLUMN
- awooop_phase1_batch1_backfill.py:SKIP LOCKED 分批回填腳本
- awooop_models.py:7 個 SQLAlchemy 2.x models

## Critic 修正(4 Critical + 3 Major)
- C-1:ADD CONSTRAINT IF NOT EXISTS → DO 塊 + pg_constraint 查詢
- C-2:__mapper_args__ 字串 list → primary_key=True on mapped_column
- C-3:__platform__ RLS 後門 → 全移除,改用 BYPASSRLS role
- C-4:awooop_app role 從未建立 → Step 1 + 7 條 GRANT
- M-1:active_pointer_guard SECURITY DEFINER(FORCE RLS 跨租戶保護)
- M-2:pg_partman create_parent 加冪等防護
- M-3:immutability trigger 新增身份欄位保護(project_id/family/contract_id)

## Task 1.2 修補
- agent_loader.py:硬編碼 Mac 路徑 → AGENTS_DIR 環境變數
- Dockerfile:補 COPY .claude/agents/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 13:37:11 +08:00

243 lines
8.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-119: Durable Execution & SAGA Compensation
**狀態**Accepted
**日期**2026-05-03台北
**決策者**:統帥
**範圍**Agent run 的步驟級 journal、補償命令設計、SAGA 觸發條件
**關聯**ADR-114worker lease、ADR-106Run State contract
---
## 背景
ADR-114 解決了 worker crash 後的 run 回收問題stale reaper。但 stale reaper 只是把 run 回到 PENDING 狀態,讓 worker 重試。這帶來新問題:
**問題 1無法知道 run 到哪一步才 crash**
worker 重新取到 run 後,從頭執行 → 重複呼叫已完成的步驟(例如:已經寫了 K8s repair重試時再寫一次
**問題 2部分執行的副作用不可逆**
Agent 已執行了 `k8s_exec_pod`(重啟了某個 Pod但後續步驟失敗了。重試時再重啟一次造成服務閃斷兩次。
**問題 3無法對 approval 決策做補償**
`WAITING_APPROVAL` 的 run 被 reject 後Agent 已呼叫的 tool 副作用無法回滾。
**設計目標**
- Step-level idempotency步驟可重試而不產生重複副作用
- Compensation補償命令設計
- 不依賴外部 durable execution 服務Temporal 等),使用 DB + Redis 實作
---
## 決策
### D1 — SAGA 步驟 Journalsaga_steps JSONB
`awooop_run_state` 表新增 `saga_steps` JSONB 欄位:
```sql
ALTER TABLE awooop_run_state
ADD COLUMN saga_steps JSONB NOT NULL DEFAULT '[]';
-- saga_steps 結構JSONB array
-- [
-- {
-- "step_id": "uuid",
-- "step_type": "llm_call" | "tool_call" | "approval_request" | "callback",
-- "tool_id": "k8s_exec_pod", -- tool_call 時填入
-- "tool_params": {...}, -- tool_call 時填入(供補償使用)
-- "status": "pending" | "completed" | "failed" | "compensated",
-- "result": {...}, -- 步驟結果completed 時填入)
-- "compensation_cmd": {...}, -- 補償命令tool_call 才有)
-- "started_at": "ISO8601",
-- "completed_at": "ISO8601"
-- }
-- ]
```
### D2 — 步驟冪等性Step Idempotency
Worker 重新取到 run 後,**跳過已完成的步驟**
```python
async def resume_run_from_checkpoint(run: RunState) -> None:
completed_steps = {
step["step_id"]
for step in run.saga_steps
if step["status"] == "completed"
}
for step in plan_steps:
if step.step_id in completed_steps:
# 跳過已完成的步驟,使用已記錄的 result
result = get_step_result(run.saga_steps, step.step_id)
else:
# 執行步驟,寫入 journal
result = await execute_step(step)
await append_saga_step(run.run_id, step, result)
```
**每個步驟執行前 journal 寫入「pending」執行後更新為「completed」或「failed」**
```python
async def execute_step_with_journal(run_id: str, step: Step) -> StepResult:
# 寫入 pending防止重複執行
await upsert_saga_step(run_id, step.step_id, status="pending")
try:
result = await step.execute()
await upsert_saga_step(run_id, step.step_id, status="completed", result=result)
return result
except Exception as e:
await upsert_saga_step(run_id, step.step_id, status="failed", error=str(e))
raise
```
### D3 — Compensation補償命令設計
**只有 tool_call 步驟有補償命令**LLM call 不可逆):
| Tool | 補償命令 | 可補償條件 |
|------|---------|----------|
| `k8s_exec_pod`exec 類)| N/A副作用不可逆| 不補償 |
| `k8s_scale_deployment`scale 到 N| scale 回原值 | 有原始 replica count |
| `k8s_create_resource` | `k8s_delete_resource` | 有 resource 名稱 |
| `pg_mutate`INSERT| `pg_mutate`DELETE 對應 ID| 有 inserted_id |
| `knowledge_write` | `knowledge_delete` | 有 entry_id |
**補償命令格式**(在 saga_steps 中):
```json
{
"step_id": "uuid",
"tool_id": "k8s_scale_deployment",
"tool_params": {"deployment": "api", "replicas": 3},
"status": "completed",
"compensation_cmd": {
"tool_id": "k8s_scale_deployment",
"params": {"deployment": "api", "replicas": 1}, // 原始值
"reason": "SAGA compensation: scale deployment back"
}
}
```
### D4 — SAGA 觸發條件
**自動觸發補償**的情境:
```python
SAGA_COMPENSATION_TRIGGERS = [
# 1. approval 被 reject
RunStatus.WAITING_APPROVAL reject trigger_compensation,
# 2. approval_token 過期15min TTL
approval_token_expired trigger_compensation,
# 3. 後續步驟失敗,已有 compensation_cmd 的步驟需要回滾
step_failed AND has_compensatable_steps trigger_compensation,
# 4. 主動 CANCEL
user_cancel trigger_compensation if has_compensatable_steps,
]
```
**補償執行順序**:反向(最後執行的步驟先補償):
```python
async def run_saga_compensation(run: RunState) -> None:
completed_steps_with_compensation = [
step for step in reversed(run.saga_steps)
if step["status"] == "completed" and step.get("compensation_cmd")
]
for step in completed_steps_with_compensation:
cmd = step["compensation_cmd"]
try:
await execute_mcp_tool(
run_id=run.run_id,
tool_id=cmd["tool_id"],
tool_params=cmd["params"],
project_id=run.project_id
)
await update_saga_step(run.run_id, step["step_id"], status="compensated")
except Exception as e:
# 補償失敗不 raise繼續補償其他步驟
await update_saga_step(run.run_id, step["step_id"], status="compensation_failed",
error=str(e))
await write_audit_log("SAGA_COMPENSATION_FAILED", run.run_id, step, str(e))
```
### D5 — SAGA 狀態欄位
`awooop_run_state` 補充欄位:
```sql
ALTER TABLE awooop_run_state
ADD COLUMN saga_status VARCHAR(32) DEFAULT 'in_progress';
-- 'in_progress' | 'compensating' | 'compensated' | 'compensation_failed'
ALTER TABLE awooop_run_state
ADD COLUMN compensation_started_at TIMESTAMPTZ;
ALTER TABLE awooop_run_state
ADD COLUMN compensation_completed_at TIMESTAMPTZ;
```
### D6 — 不補償的情境
以下情境**不執行 SAGA 補償**,只記錄 audit log
- `step_type = "llm_call"`LLM 輸出不可逆,補償無意義)
- `step_type = "k8s_exec_pod"`exec 副作用不可逆)
- `attempt_count >= max_attempts`(已超過最大重試,直接 FAILED
- `status = "CANCELLED"` 且無 compensatable steps直接結束
---
## 與 ADR-114 的關係
| 機制 | ADR-114 | ADR-119 |
|------|---------|---------|
| Worker crash 後恢復 | Stale reaper 回收 → PENDING | ✓(由 ADR-114 觸發)|
| 重試時跳過已完成步驟 | ✗ | saga_steps journal checkpoint |
| 部分執行副作用回滾 | ✗ | compensation_cmd 反向執行 |
| approval reject 處理 | ✗FAILED| SAGA 補償 + FAILED |
---
## 後果
### Benefits
- Worker crash 重試後不重複執行已完成步驟(冪等性)
- Approval reject 後可選擇性回滾 K8s 操作k8s_scale, k8s_create
- saga_steps journal 是完整的執行審計紀錄
### Costs
- 每個步驟需要額外 DB writejournal upsert
- `saga_steps` JSONB 欄位隨執行步驟增長(長 run 可能超過 1MB
- 緩解:步驟執行結果只存必要欄位,大型 result 存 S3/GCS 只存 reference
### Risks
- 補償執行失敗compensation_failed操作已部分執行系統進入不一致狀態
- 緩解audit_log + Telegram 告警,需要人工介入
- K8s exec 等不可逆操作無法補償:設計上明確標記,並在 HIGH risk tool 加 approval gate
---
## 驗收標準
- [ ] `awooop_run_state` 新增 `saga_steps JSONB` 欄位Phase 1 migration
- [ ] Worker crash 重試後跳過已完成步驟(模擬測試)
- [ ] Approval reject → k8s_scale 補償命令執行(整合測試)
- [ ] `compensation_failed` → audit_log 寫入 + Telegram 告警(整合測試)
- [ ] `saga_steps` JSONB 格式驗證Phase 3 schema validation
## 關聯
- ADR-114worker leasestale reaper 觸發入口)
- ADR-106Run State contractrun 狀態機)
- ADR-116approval_tokenreject 觸發 SAGA 補償)
- ADR-117MCP tool 執行,補償命令也走 MCP Gateway