## Phase 0(文件層,全部 Accepted) - ADR-106/107:AwoooP 平台架構 + 儲存策略 - ADR-111~118:Bootstrap → RLS 七項核心 ADR - ADR-119~124:SAGA → Singleton Decomposition 六項 ADR - ADR-UI-01~04:Operator Console 四個 UI ADR ## Phase 1(DB schema + migration) - awooop_phase1_control_plane_2026-05-04.sql:7 張新表 + trigger + RLS - Step 1:三角色(platform_admin/migration BYPASSRLS,awooop_app 受 RLS) - Step 13:GRANT awooop_app 最小權限(7 條) - Step 14:RLS fail-closed,移除 __platform__ 後門 - awooop_phase1_batch1_rls_2026-05-04.sql:高流量四表三步式 ADD COLUMN - awooop_phase1_batch1_backfill.py:SKIP LOCKED 分批回填腳本 - awooop_models.py:7 個 SQLAlchemy 2.x models ## Critic 修正(4 Critical + 3 Major) - C-1:ADD CONSTRAINT IF NOT EXISTS → DO 塊 + pg_constraint 查詢 - C-2:__mapper_args__ 字串 list → primary_key=True on mapped_column - C-3:__platform__ RLS 後門 → 全移除,改用 BYPASSRLS role - C-4:awooop_app role 從未建立 → Step 1 + 7 條 GRANT - M-1:active_pointer_guard SECURITY DEFINER(FORCE RLS 跨租戶保護) - M-2:pg_partman create_parent 加冪等防護 - M-3:immutability trigger 新增身份欄位保護(project_id/family/contract_id) ## Task 1.2 修補 - agent_loader.py:硬編碼 Mac 路徑 → AGENTS_DIR 環境變數 - Dockerfile:補 COPY .claude/agents/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
251 lines
8.3 KiB
Markdown
251 lines
8.3 KiB
Markdown
# ADR-120: Token Budget Hard Kill
|
||
|
||
**狀態**:Accepted
|
||
**日期**:2026-05-03(台北)
|
||
**決策者**:統帥
|
||
**範圍**:三層 token budget 架構、hard kill 機制、$47k 教訓、告警閾值
|
||
**關聯**:ADR-106(Agent contract budget_limit)、ADR-112(contract governance)
|
||
|
||
---
|
||
|
||
## 背景
|
||
|
||
### $47k Agent Loop 事故教訓
|
||
|
||
某 agentic AI 平台的 Agent 進入無限 LLM 呼叫迴圈(prompt injection 導致),48 小時內耗費 $47,000 USD。
|
||
|
||
事故發生的原因:
|
||
- **只有告警,沒有 hard kill**:超出預算閾值時發告警,但 Agent 繼續執行
|
||
- **沒有 token 計數**:每次 LLM call 的 token 消耗沒有即時統計
|
||
- **沒有分層 budget**:整個 platform 共享一個 budget 上限,單一 Agent 可以消耗所有額度
|
||
|
||
### 現況
|
||
|
||
- `budget_ledger` 表存在,但沒有 hard kill 機制
|
||
- LLM call 前沒有 pre-check(只在事後統計)
|
||
- 沒有 per-run budget limit(只有 per-provider total budget)
|
||
|
||
---
|
||
|
||
## 決策
|
||
|
||
### D1 — 三層 Budget 架構
|
||
|
||
```
|
||
Platform Budget(全域上限)
|
||
└── Tenant Budget(per project_id)
|
||
└── Agent Budget(per agent_id)
|
||
└── Run Budget(per run_id,最細粒度)
|
||
```
|
||
|
||
**層次說明**:
|
||
|
||
| 層次 | 限制對象 | 存放位置 | 更新頻率 |
|
||
|------|---------|---------|---------|
|
||
| Platform | 全部 LLM 呼叫 | `config.py` + K8s Secret | 靜態 |
|
||
| Tenant | per project_id | `awooop_projects.budget_limit_usd` | 合約期間靜態 |
|
||
| Agent | per agent_id | Agent contract `budget_limit` | 版本更新 |
|
||
| Run | per run_id | `awooop_run_state.token_budget_remaining` | 每次 LLM call 更新 |
|
||
|
||
### D2 — Hard Kill 機制(三道防線)
|
||
|
||
**防線 1:Pre-call Budget Check(呼叫前)**
|
||
|
||
```python
|
||
async def check_budget_before_llm_call(
|
||
project_id: str,
|
||
agent_id: str,
|
||
run_id: str,
|
||
estimated_tokens: int
|
||
) -> None:
|
||
"""預估 token 消耗,超預算直接拒絕呼叫"""
|
||
|
||
# Layer 1: Run budget
|
||
run = await get_run(run_id)
|
||
estimated_cost_usd = estimate_cost(estimated_tokens, model=run.model)
|
||
if run.token_budget_remaining < estimated_cost_usd:
|
||
raise BudgetExhaustedError(
|
||
"E-BUDGET-001",
|
||
f"Run {run_id} budget exhausted: remaining ${run.token_budget_remaining:.4f}"
|
||
)
|
||
|
||
# Layer 2: Tenant budget(Redis 快取,每分鐘同步 DB)
|
||
tenant_budget = await get_tenant_budget_remaining(project_id)
|
||
if tenant_budget < estimated_cost_usd:
|
||
raise BudgetExhaustedError(
|
||
"E-BUDGET-002",
|
||
f"Tenant {project_id} budget exhausted"
|
||
)
|
||
|
||
# Layer 3: Platform budget(Redis 快取)
|
||
platform_budget = await get_platform_budget_remaining()
|
||
if platform_budget < estimated_cost_usd:
|
||
raise BudgetExhaustedError(
|
||
"E-BUDGET-003",
|
||
"Platform budget exhausted — all LLM calls suspended"
|
||
)
|
||
```
|
||
|
||
**防線 2:Post-call Accounting(呼叫後)**
|
||
|
||
```python
|
||
async def record_token_usage(
|
||
project_id: str, agent_id: str, run_id: str,
|
||
prompt_tokens: int, completion_tokens: int,
|
||
model: str, provider: str
|
||
) -> None:
|
||
actual_cost = calculate_cost(prompt_tokens, completion_tokens, model, provider)
|
||
|
||
async with db_transaction():
|
||
# 扣除 run budget
|
||
await db.execute(
|
||
text("""
|
||
UPDATE awooop_run_state
|
||
SET token_budget_remaining = token_budget_remaining - :cost,
|
||
total_tokens_used = total_tokens_used + :total_tokens
|
||
WHERE run_id = :run_id
|
||
"""),
|
||
{"cost": actual_cost, "total_tokens": prompt_tokens + completion_tokens,
|
||
"run_id": run_id}
|
||
)
|
||
|
||
# 寫入 budget_ledger
|
||
await db.execute(
|
||
text("""
|
||
INSERT INTO budget_ledger
|
||
(project_id, agent_id, run_id, model, provider,
|
||
prompt_tokens, completion_tokens, cost_usd, recorded_at)
|
||
VALUES (:project_id, :agent_id, :run_id, :model, :provider,
|
||
:prompt_tokens, :completion_tokens, :cost_usd, NOW())
|
||
"""),
|
||
{...}
|
||
)
|
||
|
||
# 更新 Redis 快取(async,不在事務中)
|
||
await update_budget_cache(project_id, actual_cost)
|
||
```
|
||
|
||
**防線 3:Emergency Platform Kill Switch(緊急停機)**
|
||
|
||
```python
|
||
# Redis key: platform:budget:emergency_kill
|
||
# SET platform:budget:emergency_kill "1" → 停止所有 LLM 呼叫
|
||
|
||
async def check_emergency_kill() -> None:
|
||
if await redis.exists("platform:budget:emergency_kill"):
|
||
raise BudgetExhaustedError(
|
||
"E-BUDGET-004",
|
||
"Emergency kill switch activated — contact platform admin"
|
||
)
|
||
```
|
||
|
||
### D3 — Budget 欄位(awooop_run_state)
|
||
|
||
```sql
|
||
ALTER TABLE awooop_run_state
|
||
ADD COLUMN token_budget_usd NUMERIC(10, 4) NOT NULL DEFAULT 1.00,
|
||
ADD COLUMN token_budget_remaining NUMERIC(10, 4) NOT NULL DEFAULT 1.00,
|
||
ADD COLUMN total_tokens_used INT NOT NULL DEFAULT 0,
|
||
ADD COLUMN total_cost_usd NUMERIC(10, 4) NOT NULL DEFAULT 0.0000;
|
||
```
|
||
|
||
**預設 Run Budget**:
|
||
- 從 Agent contract 的 `run_budget_usd` 讀取
|
||
- 若未設定:預設 $1.00
|
||
- 上限(Platform 強制):$10.00 per run
|
||
|
||
### D4 — 告警閾值(三段)
|
||
|
||
```python
|
||
BUDGET_ALERT_THRESHOLDS = {
|
||
"warn": 0.80, # 80% 用量 → Telegram 警告(不 kill)
|
||
"critical": 0.95, # 95% 用量 → Telegram 緊急告警 + 人工確認建議
|
||
"hard_kill": 1.00, # 100% → hard kill(E-BUDGET-001/002/003)
|
||
}
|
||
```
|
||
|
||
**告警通知(Telegram)**:
|
||
```
|
||
[BUDGET-WARN] Run {run_id} (project={project_id})
|
||
已使用 ${used:.3f} / ${limit:.3f} ({pct:.1f}%)
|
||
模型:{model},Provider:{provider}
|
||
```
|
||
|
||
### D5 — Budget Ledger 表強化
|
||
|
||
```sql
|
||
-- 現有 budget_ledger 補充欄位
|
||
ALTER TABLE budget_ledger
|
||
ADD COLUMN IF NOT EXISTS project_id VARCHAR(64),
|
||
ADD COLUMN IF NOT EXISTS agent_id VARCHAR(128),
|
||
ADD COLUMN IF NOT EXISTS run_id UUID,
|
||
ADD COLUMN IF NOT EXISTS model VARCHAR(64),
|
||
ADD COLUMN IF NOT EXISTS provider VARCHAR(32),
|
||
ADD COLUMN IF NOT EXISTS prompt_tokens INT,
|
||
ADD COLUMN IF NOT EXISTS completion_tokens INT,
|
||
ADD COLUMN IF NOT EXISTS cost_usd NUMERIC(10, 4);
|
||
|
||
-- 分析用 index
|
||
CREATE INDEX idx_budget_ledger_project_date
|
||
ON budget_ledger(project_id, recorded_at);
|
||
|
||
CREATE INDEX idx_budget_ledger_run
|
||
ON budget_ledger(run_id);
|
||
```
|
||
|
||
### D6 — Agent Contract Budget 欄位
|
||
|
||
Agent contract(六合約之一)的 budget 欄位:
|
||
|
||
```json
|
||
{
|
||
"agent_id": "openclaw-sre",
|
||
"run_budget_usd": 0.50, // per run 上限(預設 $0.50)
|
||
"daily_budget_usd": 10.00, // per day 上限
|
||
"max_tokens_per_run": 100000, // 輔助限制(token 數量)
|
||
"llm_call_limit_per_run": 20 // 防止無限迴圈的最大 LLM 呼叫次數
|
||
}
|
||
```
|
||
|
||
**`llm_call_limit_per_run`**(防無限迴圈):
|
||
- Agent 每次 LLM call 前,計算 `saga_steps` 中 `step_type=llm_call` 的數量
|
||
- 超過 `llm_call_limit_per_run` → E-BUDGET-005(LLM call limit exceeded)
|
||
|
||
---
|
||
|
||
## 後果
|
||
|
||
### Benefits
|
||
- 三層 budget 防護:Platform / Tenant / Run 層層限制
|
||
- Pre-call check:在發出 API request 前就攔截,不產生費用
|
||
- Emergency kill switch:緊急情況下一個 Redis key 停止所有 LLM 呼叫
|
||
- `llm_call_limit_per_run`:從根本防止 $47k 事故(無限迴圈)
|
||
|
||
### Costs
|
||
- 每次 LLM call 前增加 3 次 Redis get(約 1-2ms 總延遲)
|
||
- Post-call accounting 有 DB write(已有 budget_ledger,只是補欄位)
|
||
- Budget cache 與 DB 可能有最多 1 分鐘的不一致(可接受)
|
||
|
||
### Risks
|
||
- Token 預估(estimated_tokens)不準確 → Pre-call check 誤殺或漏殺
|
||
- 緩解:estimated_tokens 保守估計(prompt token count × 1.5),寧可誤殺不漏殺
|
||
- Budget cache Redis 失效 → 退回 DB 查詢(比較慢但不 bypass)
|
||
|
||
---
|
||
|
||
## 驗收標準
|
||
|
||
- [ ] `awooop_run_state` 新增 budget 相關欄位(Phase 1 migration)
|
||
- [ ] Pre-call budget check 攔截超預算的 LLM 呼叫(整合測試)
|
||
- [ ] `llm_call_limit_per_run` 超過 → E-BUDGET-005(整合測試)
|
||
- [ ] Emergency kill switch:Redis SET → 所有 LLM 呼叫停止(整合測試)
|
||
- [ ] 80% budget 告警送達 Telegram(整合測試)
|
||
- [ ] EwoooC budget 與 AWOOOI budget 獨立計算(ADR-115 D5 驗收)
|
||
|
||
## 關聯
|
||
|
||
- ADR-106(Agent contract budget_limit 欄位定義)
|
||
- ADR-112(contract governance,Agent contract publish/activate)
|
||
- ADR-119(SAGA,run 失敗後 budget 狀態保留)
|
||
- ADR-121(OTel,token 使用量 telemetry 上報)
|