## Phase 0(文件層,全部 Accepted) - ADR-106/107:AwoooP 平台架構 + 儲存策略 - ADR-111~118:Bootstrap → RLS 七項核心 ADR - ADR-119~124:SAGA → Singleton Decomposition 六項 ADR - ADR-UI-01~04:Operator Console 四個 UI ADR ## Phase 1(DB schema + migration) - awooop_phase1_control_plane_2026-05-04.sql:7 張新表 + trigger + RLS - Step 1:三角色(platform_admin/migration BYPASSRLS,awooop_app 受 RLS) - Step 13:GRANT awooop_app 最小權限(7 條) - Step 14:RLS fail-closed,移除 __platform__ 後門 - awooop_phase1_batch1_rls_2026-05-04.sql:高流量四表三步式 ADD COLUMN - awooop_phase1_batch1_backfill.py:SKIP LOCKED 分批回填腳本 - awooop_models.py:7 個 SQLAlchemy 2.x models ## Critic 修正(4 Critical + 3 Major) - C-1:ADD CONSTRAINT IF NOT EXISTS → DO 塊 + pg_constraint 查詢 - C-2:__mapper_args__ 字串 list → primary_key=True on mapped_column - C-3:__platform__ RLS 後門 → 全移除,改用 BYPASSRLS role - C-4:awooop_app role 從未建立 → Step 1 + 7 條 GRANT - M-1:active_pointer_guard SECURITY DEFINER(FORCE RLS 跨租戶保護) - M-2:pg_partman create_parent 加冪等防護 - M-3:immutability trigger 新增身份欄位保護(project_id/family/contract_id) ## Task 1.2 修補 - agent_loader.py:硬編碼 Mac 路徑 → AGENTS_DIR 環境變數 - Dockerfile:補 COPY .claude/agents/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
243 lines
8.3 KiB
Markdown
243 lines
8.3 KiB
Markdown
# ADR-119: Durable Execution & SAGA Compensation
|
||
|
||
**狀態**:Accepted
|
||
**日期**:2026-05-03(台北)
|
||
**決策者**:統帥
|
||
**範圍**:Agent run 的步驟級 journal、補償命令設計、SAGA 觸發條件
|
||
**關聯**:ADR-114(worker lease)、ADR-106(Run State contract)
|
||
|
||
---
|
||
|
||
## 背景
|
||
|
||
ADR-114 解決了 worker crash 後的 run 回收問題(stale reaper)。但 stale reaper 只是把 run 回到 PENDING 狀態,讓 worker 重試。這帶來新問題:
|
||
|
||
**問題 1:無法知道 run 到哪一步才 crash**
|
||
|
||
worker 重新取到 run 後,從頭執行 → 重複呼叫已完成的步驟(例如:已經寫了 K8s repair,重試時再寫一次)。
|
||
|
||
**問題 2:部分執行的副作用不可逆**
|
||
|
||
Agent 已執行了 `k8s_exec_pod`(重啟了某個 Pod),但後續步驟失敗了。重試時再重啟一次,造成服務閃斷兩次。
|
||
|
||
**問題 3:無法對 approval 決策做補償**
|
||
|
||
`WAITING_APPROVAL` 的 run 被 reject 後,Agent 已呼叫的 tool 副作用無法回滾。
|
||
|
||
**設計目標**:
|
||
- Step-level idempotency(步驟可重試而不產生重複副作用)
|
||
- Compensation(補償命令)設計
|
||
- 不依賴外部 durable execution 服務(Temporal 等),使用 DB + Redis 實作
|
||
|
||
---
|
||
|
||
## 決策
|
||
|
||
### D1 — SAGA 步驟 Journal(saga_steps JSONB)
|
||
|
||
`awooop_run_state` 表新增 `saga_steps` JSONB 欄位:
|
||
|
||
```sql
|
||
ALTER TABLE awooop_run_state
|
||
ADD COLUMN saga_steps JSONB NOT NULL DEFAULT '[]';
|
||
|
||
-- saga_steps 結構(JSONB array)
|
||
-- [
|
||
-- {
|
||
-- "step_id": "uuid",
|
||
-- "step_type": "llm_call" | "tool_call" | "approval_request" | "callback",
|
||
-- "tool_id": "k8s_exec_pod", -- tool_call 時填入
|
||
-- "tool_params": {...}, -- tool_call 時填入(供補償使用)
|
||
-- "status": "pending" | "completed" | "failed" | "compensated",
|
||
-- "result": {...}, -- 步驟結果(completed 時填入)
|
||
-- "compensation_cmd": {...}, -- 補償命令(tool_call 才有)
|
||
-- "started_at": "ISO8601",
|
||
-- "completed_at": "ISO8601"
|
||
-- }
|
||
-- ]
|
||
```
|
||
|
||
### D2 — 步驟冪等性(Step Idempotency)
|
||
|
||
Worker 重新取到 run 後,**跳過已完成的步驟**:
|
||
|
||
```python
|
||
async def resume_run_from_checkpoint(run: RunState) -> None:
|
||
completed_steps = {
|
||
step["step_id"]
|
||
for step in run.saga_steps
|
||
if step["status"] == "completed"
|
||
}
|
||
|
||
for step in plan_steps:
|
||
if step.step_id in completed_steps:
|
||
# 跳過已完成的步驟,使用已記錄的 result
|
||
result = get_step_result(run.saga_steps, step.step_id)
|
||
else:
|
||
# 執行步驟,寫入 journal
|
||
result = await execute_step(step)
|
||
await append_saga_step(run.run_id, step, result)
|
||
```
|
||
|
||
**每個步驟執行前 journal 寫入「pending」,執行後更新為「completed」或「failed」**:
|
||
|
||
```python
|
||
async def execute_step_with_journal(run_id: str, step: Step) -> StepResult:
|
||
# 寫入 pending(防止重複執行)
|
||
await upsert_saga_step(run_id, step.step_id, status="pending")
|
||
|
||
try:
|
||
result = await step.execute()
|
||
await upsert_saga_step(run_id, step.step_id, status="completed", result=result)
|
||
return result
|
||
except Exception as e:
|
||
await upsert_saga_step(run_id, step.step_id, status="failed", error=str(e))
|
||
raise
|
||
```
|
||
|
||
### D3 — Compensation(補償命令)設計
|
||
|
||
**只有 tool_call 步驟有補償命令**(LLM call 不可逆):
|
||
|
||
| Tool | 補償命令 | 可補償條件 |
|
||
|------|---------|----------|
|
||
| `k8s_exec_pod`(exec 類)| N/A(副作用不可逆)| 不補償 |
|
||
| `k8s_scale_deployment`(scale 到 N)| scale 回原值 | 有原始 replica count |
|
||
| `k8s_create_resource` | `k8s_delete_resource` | 有 resource 名稱 |
|
||
| `pg_mutate`(INSERT)| `pg_mutate`(DELETE 對應 ID)| 有 inserted_id |
|
||
| `knowledge_write` | `knowledge_delete` | 有 entry_id |
|
||
|
||
**補償命令格式**(在 saga_steps 中):
|
||
|
||
```json
|
||
{
|
||
"step_id": "uuid",
|
||
"tool_id": "k8s_scale_deployment",
|
||
"tool_params": {"deployment": "api", "replicas": 3},
|
||
"status": "completed",
|
||
"compensation_cmd": {
|
||
"tool_id": "k8s_scale_deployment",
|
||
"params": {"deployment": "api", "replicas": 1}, // 原始值
|
||
"reason": "SAGA compensation: scale deployment back"
|
||
}
|
||
}
|
||
```
|
||
|
||
### D4 — SAGA 觸發條件
|
||
|
||
**自動觸發補償**的情境:
|
||
|
||
```python
|
||
SAGA_COMPENSATION_TRIGGERS = [
|
||
# 1. approval 被 reject
|
||
RunStatus.WAITING_APPROVAL → reject → trigger_compensation,
|
||
|
||
# 2. approval_token 過期(15min TTL)
|
||
approval_token_expired → trigger_compensation,
|
||
|
||
# 3. 後續步驟失敗,已有 compensation_cmd 的步驟需要回滾
|
||
step_failed AND has_compensatable_steps → trigger_compensation,
|
||
|
||
# 4. 主動 CANCEL
|
||
user_cancel → trigger_compensation if has_compensatable_steps,
|
||
]
|
||
```
|
||
|
||
**補償執行順序**:反向(最後執行的步驟先補償):
|
||
|
||
```python
|
||
async def run_saga_compensation(run: RunState) -> None:
|
||
completed_steps_with_compensation = [
|
||
step for step in reversed(run.saga_steps)
|
||
if step["status"] == "completed" and step.get("compensation_cmd")
|
||
]
|
||
|
||
for step in completed_steps_with_compensation:
|
||
cmd = step["compensation_cmd"]
|
||
try:
|
||
await execute_mcp_tool(
|
||
run_id=run.run_id,
|
||
tool_id=cmd["tool_id"],
|
||
tool_params=cmd["params"],
|
||
project_id=run.project_id
|
||
)
|
||
await update_saga_step(run.run_id, step["step_id"], status="compensated")
|
||
except Exception as e:
|
||
# 補償失敗不 raise,繼續補償其他步驟
|
||
await update_saga_step(run.run_id, step["step_id"], status="compensation_failed",
|
||
error=str(e))
|
||
await write_audit_log("SAGA_COMPENSATION_FAILED", run.run_id, step, str(e))
|
||
```
|
||
|
||
### D5 — SAGA 狀態欄位
|
||
|
||
`awooop_run_state` 補充欄位:
|
||
|
||
```sql
|
||
ALTER TABLE awooop_run_state
|
||
ADD COLUMN saga_status VARCHAR(32) DEFAULT 'in_progress';
|
||
-- 'in_progress' | 'compensating' | 'compensated' | 'compensation_failed'
|
||
|
||
ALTER TABLE awooop_run_state
|
||
ADD COLUMN compensation_started_at TIMESTAMPTZ;
|
||
|
||
ALTER TABLE awooop_run_state
|
||
ADD COLUMN compensation_completed_at TIMESTAMPTZ;
|
||
```
|
||
|
||
### D6 — 不補償的情境
|
||
|
||
以下情境**不執行 SAGA 補償**,只記錄 audit log:
|
||
|
||
- `step_type = "llm_call"`(LLM 輸出不可逆,補償無意義)
|
||
- `step_type = "k8s_exec_pod"`(exec 副作用不可逆)
|
||
- `attempt_count >= max_attempts`(已超過最大重試,直接 FAILED)
|
||
- `status = "CANCELLED"` 且無 compensatable steps(直接結束)
|
||
|
||
---
|
||
|
||
## 與 ADR-114 的關係
|
||
|
||
| 機制 | ADR-114 | ADR-119 |
|
||
|------|---------|---------|
|
||
| Worker crash 後恢復 | Stale reaper 回收 → PENDING | ✓(由 ADR-114 觸發)|
|
||
| 重試時跳過已完成步驟 | ✗ | saga_steps journal checkpoint |
|
||
| 部分執行副作用回滾 | ✗ | compensation_cmd 反向執行 |
|
||
| approval reject 處理 | ✗(FAILED)| SAGA 補償 + FAILED |
|
||
|
||
---
|
||
|
||
## 後果
|
||
|
||
### Benefits
|
||
- Worker crash 重試後不重複執行已完成步驟(冪等性)
|
||
- Approval reject 後可選擇性回滾 K8s 操作(k8s_scale, k8s_create)
|
||
- saga_steps journal 是完整的執行審計紀錄
|
||
|
||
### Costs
|
||
- 每個步驟需要額外 DB write(journal upsert)
|
||
- `saga_steps` JSONB 欄位隨執行步驟增長(長 run 可能超過 1MB)
|
||
- 緩解:步驟執行結果只存必要欄位,大型 result 存 S3/GCS 只存 reference
|
||
|
||
### Risks
|
||
- 補償執行失敗(compensation_failed):操作已部分執行,系統進入不一致狀態
|
||
- 緩解:audit_log + Telegram 告警,需要人工介入
|
||
- K8s exec 等不可逆操作無法補償:設計上明確標記,並在 HIGH risk tool 加 approval gate
|
||
|
||
---
|
||
|
||
## 驗收標準
|
||
|
||
- [ ] `awooop_run_state` 新增 `saga_steps JSONB` 欄位(Phase 1 migration)
|
||
- [ ] Worker crash 重試後跳過已完成步驟(模擬測試)
|
||
- [ ] Approval reject → k8s_scale 補償命令執行(整合測試)
|
||
- [ ] `compensation_failed` → audit_log 寫入 + Telegram 告警(整合測試)
|
||
- [ ] `saga_steps` JSONB 格式驗證(Phase 3 schema validation)
|
||
|
||
## 關聯
|
||
|
||
- ADR-114(worker lease,stale reaper 觸發入口)
|
||
- ADR-106(Run State contract,run 狀態機)
|
||
- ADR-116(approval_token,reject 觸發 SAGA 補償)
|
||
- ADR-117(MCP tool 執行,補償命令也走 MCP Gateway)
|