Files
awoooi/docs/adr/ADR-120-token-budget-hard-kill.md
Your Name 13e51802fe feat(awooop): Phase 0 全 ADR + Phase 1 control plane schema(含 critic 四項修正)
## Phase 0(文件層,全部 Accepted)
- ADR-106/107:AwoooP 平台架構 + 儲存策略
- ADR-111~118:Bootstrap → RLS 七項核心 ADR
- ADR-119~124:SAGA → Singleton Decomposition 六項 ADR
- ADR-UI-01~04:Operator Console 四個 UI ADR

## Phase 1(DB schema + migration)
- awooop_phase1_control_plane_2026-05-04.sql:7 張新表 + trigger + RLS
  - Step 1:三角色(platform_admin/migration BYPASSRLS,awooop_app 受 RLS)
  - Step 13:GRANT awooop_app 最小權限(7 條)
  - Step 14:RLS fail-closed,移除 __platform__ 後門
- awooop_phase1_batch1_rls_2026-05-04.sql:高流量四表三步式 ADD COLUMN
- awooop_phase1_batch1_backfill.py:SKIP LOCKED 分批回填腳本
- awooop_models.py:7 個 SQLAlchemy 2.x models

## Critic 修正(4 Critical + 3 Major)
- C-1:ADD CONSTRAINT IF NOT EXISTS → DO 塊 + pg_constraint 查詢
- C-2:__mapper_args__ 字串 list → primary_key=True on mapped_column
- C-3:__platform__ RLS 後門 → 全移除,改用 BYPASSRLS role
- C-4:awooop_app role 從未建立 → Step 1 + 7 條 GRANT
- M-1:active_pointer_guard SECURITY DEFINER(FORCE RLS 跨租戶保護)
- M-2:pg_partman create_parent 加冪等防護
- M-3:immutability trigger 新增身份欄位保護(project_id/family/contract_id)

## Task 1.2 修補
- agent_loader.py:硬編碼 Mac 路徑 → AGENTS_DIR 環境變數
- Dockerfile:補 COPY .claude/agents/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 13:37:11 +08:00

251 lines
8.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-120: Token Budget Hard Kill
**狀態**Accepted
**日期**2026-05-03台北
**決策者**:統帥
**範圍**:三層 token budget 架構、hard kill 機制、$47k 教訓、告警閾值
**關聯**ADR-106Agent contract budget_limit、ADR-112contract governance
---
## 背景
### $47k Agent Loop 事故教訓
某 agentic AI 平台的 Agent 進入無限 LLM 呼叫迴圈prompt injection 導致48 小時內耗費 $47,000 USD。
事故發生的原因:
- **只有告警,沒有 hard kill**:超出預算閾值時發告警,但 Agent 繼續執行
- **沒有 token 計數**:每次 LLM call 的 token 消耗沒有即時統計
- **沒有分層 budget**:整個 platform 共享一個 budget 上限,單一 Agent 可以消耗所有額度
### 現況
- `budget_ledger` 表存在,但沒有 hard kill 機制
- LLM call 前沒有 pre-check只在事後統計
- 沒有 per-run budget limit只有 per-provider total budget
---
## 決策
### D1 — 三層 Budget 架構
```
Platform Budget全域上限
└── Tenant Budgetper project_id
└── Agent Budgetper agent_id
└── Run Budgetper run_id最細粒度
```
**層次說明**
| 層次 | 限制對象 | 存放位置 | 更新頻率 |
|------|---------|---------|---------|
| Platform | 全部 LLM 呼叫 | `config.py` + K8s Secret | 靜態 |
| Tenant | per project_id | `awooop_projects.budget_limit_usd` | 合約期間靜態 |
| Agent | per agent_id | Agent contract `budget_limit` | 版本更新 |
| Run | per run_id | `awooop_run_state.token_budget_remaining` | 每次 LLM call 更新 |
### D2 — Hard Kill 機制(三道防線)
**防線 1Pre-call Budget Check呼叫前**
```python
async def check_budget_before_llm_call(
project_id: str,
agent_id: str,
run_id: str,
estimated_tokens: int
) -> None:
"""預估 token 消耗,超預算直接拒絕呼叫"""
# Layer 1: Run budget
run = await get_run(run_id)
estimated_cost_usd = estimate_cost(estimated_tokens, model=run.model)
if run.token_budget_remaining < estimated_cost_usd:
raise BudgetExhaustedError(
"E-BUDGET-001",
f"Run {run_id} budget exhausted: remaining ${run.token_budget_remaining:.4f}"
)
# Layer 2: Tenant budgetRedis 快取,每分鐘同步 DB
tenant_budget = await get_tenant_budget_remaining(project_id)
if tenant_budget < estimated_cost_usd:
raise BudgetExhaustedError(
"E-BUDGET-002",
f"Tenant {project_id} budget exhausted"
)
# Layer 3: Platform budgetRedis 快取)
platform_budget = await get_platform_budget_remaining()
if platform_budget < estimated_cost_usd:
raise BudgetExhaustedError(
"E-BUDGET-003",
"Platform budget exhausted — all LLM calls suspended"
)
```
**防線 2Post-call Accounting呼叫後**
```python
async def record_token_usage(
project_id: str, agent_id: str, run_id: str,
prompt_tokens: int, completion_tokens: int,
model: str, provider: str
) -> None:
actual_cost = calculate_cost(prompt_tokens, completion_tokens, model, provider)
async with db_transaction():
# 扣除 run budget
await db.execute(
text("""
UPDATE awooop_run_state
SET token_budget_remaining = token_budget_remaining - :cost,
total_tokens_used = total_tokens_used + :total_tokens
WHERE run_id = :run_id
"""),
{"cost": actual_cost, "total_tokens": prompt_tokens + completion_tokens,
"run_id": run_id}
)
# 寫入 budget_ledger
await db.execute(
text("""
INSERT INTO budget_ledger
(project_id, agent_id, run_id, model, provider,
prompt_tokens, completion_tokens, cost_usd, recorded_at)
VALUES (:project_id, :agent_id, :run_id, :model, :provider,
:prompt_tokens, :completion_tokens, :cost_usd, NOW())
"""),
{...}
)
# 更新 Redis 快取async不在事務中
await update_budget_cache(project_id, actual_cost)
```
**防線 3Emergency Platform Kill Switch緊急停機**
```python
# Redis key: platform:budget:emergency_kill
# SET platform:budget:emergency_kill "1" → 停止所有 LLM 呼叫
async def check_emergency_kill() -> None:
if await redis.exists("platform:budget:emergency_kill"):
raise BudgetExhaustedError(
"E-BUDGET-004",
"Emergency kill switch activated — contact platform admin"
)
```
### D3 — Budget 欄位awooop_run_state
```sql
ALTER TABLE awooop_run_state
ADD COLUMN token_budget_usd NUMERIC(10, 4) NOT NULL DEFAULT 1.00,
ADD COLUMN token_budget_remaining NUMERIC(10, 4) NOT NULL DEFAULT 1.00,
ADD COLUMN total_tokens_used INT NOT NULL DEFAULT 0,
ADD COLUMN total_cost_usd NUMERIC(10, 4) NOT NULL DEFAULT 0.0000;
```
**預設 Run Budget**
- 從 Agent contract 的 `run_budget_usd` 讀取
- 若未設定:預設 $1.00
- 上限Platform 強制):$10.00 per run
### D4 — 告警閾值(三段)
```python
BUDGET_ALERT_THRESHOLDS = {
"warn": 0.80, # 80% 用量 → Telegram 警告(不 kill
"critical": 0.95, # 95% 用量 → Telegram 緊急告警 + 人工確認建議
"hard_kill": 1.00, # 100% → hard killE-BUDGET-001/002/003
}
```
**告警通知Telegram**
```
[BUDGET-WARN] Run {run_id} (project={project_id})
已使用 ${used:.3f} / ${limit:.3f} ({pct:.1f}%)
模型:{model}Provider{provider}
```
### D5 — Budget Ledger 表強化
```sql
-- 現有 budget_ledger 補充欄位
ALTER TABLE budget_ledger
ADD COLUMN IF NOT EXISTS project_id VARCHAR(64),
ADD COLUMN IF NOT EXISTS agent_id VARCHAR(128),
ADD COLUMN IF NOT EXISTS run_id UUID,
ADD COLUMN IF NOT EXISTS model VARCHAR(64),
ADD COLUMN IF NOT EXISTS provider VARCHAR(32),
ADD COLUMN IF NOT EXISTS prompt_tokens INT,
ADD COLUMN IF NOT EXISTS completion_tokens INT,
ADD COLUMN IF NOT EXISTS cost_usd NUMERIC(10, 4);
-- 分析用 index
CREATE INDEX idx_budget_ledger_project_date
ON budget_ledger(project_id, recorded_at);
CREATE INDEX idx_budget_ledger_run
ON budget_ledger(run_id);
```
### D6 — Agent Contract Budget 欄位
Agent contract六合約之一的 budget 欄位:
```json
{
"agent_id": "openclaw-sre",
"run_budget_usd": 0.50, // per run 上限(預設 $0.50
"daily_budget_usd": 10.00, // per day 上限
"max_tokens_per_run": 100000, // 輔助限制token 數量)
"llm_call_limit_per_run": 20 // 防止無限迴圈的最大 LLM 呼叫次數
}
```
**`llm_call_limit_per_run`**(防無限迴圈):
- Agent 每次 LLM call 前,計算 `saga_steps``step_type=llm_call` 的數量
- 超過 `llm_call_limit_per_run` → E-BUDGET-005LLM call limit exceeded
---
## 後果
### Benefits
- 三層 budget 防護Platform / Tenant / Run 層層限制
- Pre-call check在發出 API request 前就攔截,不產生費用
- Emergency kill switch緊急情況下一個 Redis key 停止所有 LLM 呼叫
- `llm_call_limit_per_run`:從根本防止 $47k 事故(無限迴圈)
### Costs
- 每次 LLM call 前增加 3 次 Redis get約 1-2ms 總延遲)
- Post-call accounting 有 DB write已有 budget_ledger只是補欄位
- Budget cache 與 DB 可能有最多 1 分鐘的不一致(可接受)
### Risks
- Token 預估estimated_tokens不準確 → Pre-call check 誤殺或漏殺
- 緩解estimated_tokens 保守估計prompt token count × 1.5),寧可誤殺不漏殺
- Budget cache Redis 失效 → 退回 DB 查詢(比較慢但不 bypass
---
## 驗收標準
- [ ] `awooop_run_state` 新增 budget 相關欄位Phase 1 migration
- [ ] Pre-call budget check 攔截超預算的 LLM 呼叫(整合測試)
- [ ] `llm_call_limit_per_run` 超過 → E-BUDGET-005整合測試
- [ ] Emergency kill switchRedis SET → 所有 LLM 呼叫停止(整合測試)
- [ ] 80% budget 告警送達 Telegram整合測試
- [ ] EwoooC budget 與 AWOOOI budget 獨立計算ADR-115 D5 驗收)
## 關聯
- ADR-106Agent contract budget_limit 欄位定義)
- ADR-112contract governanceAgent contract publish/activate
- ADR-119SAGArun 失敗後 budget 狀態保留)
- ADR-121OTeltoken 使用量 telemetry 上報)