1297 lines
68 KiB
Markdown
1297 lines
68 KiB
Markdown
# AwoooP 完整詳細實施計畫
|
||
|
||
**版本**:v1.0(12-Agent 全景審查後整合版)
|
||
**日期**:2026-05-03(台北時區)
|
||
**建立者**:12-Agent 聯合審查 × Codex 整合
|
||
**基礎文件**:MASTER-WORKPLAN.md、ADR-106、ADR-107
|
||
**⚠️ ADR 編號修正**:ADR-108/109/110 已被其他 ADR 占用 → AwoooP 專用 ADR 從 ADR-111 開始
|
||
|
||
> 本文件是 MASTER-WORKPLAN.md 的完整展開版。
|
||
> MASTER-WORKPLAN 是主索引,本文是執行細節。
|
||
> 任何矛盾以本文為準(本文更新日期更晚)。
|
||
|
||
---
|
||
|
||
## 0. 全景背景
|
||
|
||
### 0.1 基礎架構現況(截至 2026-05-03)
|
||
|
||
| 組件 | 現況 | 備註 |
|
||
|------|------|------|
|
||
| Ollama Primary | GCP-A `34.143.170.20:11434`(SSD)| ADR-110,取代 ADR-105 |
|
||
| Ollama Secondary | GCP-B `34.21.145.224:11434`(SSD)| 新增,2026-05-03 上線 |
|
||
| Ollama Fallback | Local `192.168.0.111:11434`(HDD)| 最後防線,非 Primary |
|
||
| PostgreSQL | `192.168.0.188`(私網)| AwoooP 控制面唯一 source of truth |
|
||
| Redis | `192.168.0.188`(私網)| cache/watch/counter only(ADR-107 D4)|
|
||
| K3s 叢集 | `awoooi-prod` namespace | AWOOOI first tenant |
|
||
| Gitea CI/CD | `192.168.0.110`(或 Gitea Cloud)| ADR-039,所有 build 從 Gitea |
|
||
|
||
### 0.2 12-Agent 審查發現彙整
|
||
|
||
原始 MASTER-WORKPLAN 有 24 項共識問題。12 位 Agent 並行深度審查後新增:
|
||
|
||
| Agent | 新增 P0/P1 問題數 | 新增 ADR 需求 | 新增 Inventory |
|
||
|-------|-----------------|--------------|----------------|
|
||
| critic | 10 | 1(ADR-116 Migration Discipline)| INV-5、INV-6、INV-7 |
|
||
| vuln-verifier | 8(含 PoC 確認 3 個)| 2(ADR-116/117 安全系列)| — |
|
||
| debugger | 12(故障情境)| — | 8 份 Runbook |
|
||
| db-expert | 8(表設計缺陷)+ RLS 完全空白 | 1(ADR-118 RLS 策略)| — |
|
||
| planner | 7 粒度過粗 + 10 acceptance 不閉環 | — | — |
|
||
| fullstack-engineer | 7 API endpoint 缺失 + 9 error code | — | — |
|
||
| frontend-designer | 8 UI 模組完全缺失 | ADR-UI-01~04 | — |
|
||
| refactor-specialist | 8 重構地雷 + 11 PR 方案 | — | — |
|
||
| migration-engineer | 7 相容性風險 | — | version matrix |
|
||
| onboarder | 31 background loop(vs 估計 ~10)+ 13 模組衝突 | — | INV-8 |
|
||
| tool-expert | 8 工具容量不足 + 8 工具缺失 | — | — |
|
||
| web-researcher | 業界 5 大對齊缺口(SAGA/Token Kill/MCP OAuth 2.1/OTel/OWASP)| 5(ADR-119~123)| — |
|
||
| **合計新增** | **~70 個問題** | **~12 份 ADR** | **~4 份 Inventory** |
|
||
|
||
**結論:不先補完 Pre-flight Audit,Phase 1 必爆。**
|
||
|
||
---
|
||
|
||
## 1. 完整問題清單(P0 優先順序)
|
||
|
||
### P0 — 直接爆炸(必須在 Phase 1 之前修補)
|
||
|
||
| # | 問題 | 來源 | 影響範圍 |
|
||
|---|------|------|---------|
|
||
| P0-01 | Redis key 直接改名無雙寫期(費用計數歸零、Telegram 409、silence 失效、Ollama failover 三層拓撲雙寫不到)| critic | 費用、告警、Ollama |
|
||
| P0-02 | Migration SQL 表名錯(`incident_records` / `mcp_audit_snapshots`)、無 rollback、ORM 1.x vs 2.x | critic | Phase 1 migration |
|
||
| P0-03 | `project_id` / `tenant_id` 在 codebase 0 命中,30+ 業務表無此欄 | onboarder | 全系統 |
|
||
| P0-04 | `requires_approval` 欄位由 LLM output 決定(security_interceptor.py:451-490)| vuln-verifier(PoC 確認)| approval 鏈 |
|
||
| P0-05 | callback nonce 偽造:server nonce 邏輯可不知 secret 構造通過驗證(security_interceptor.py:451-490)| vuln-verifier(PoC 確認)| Telegram approval |
|
||
| P0-06 | Webhook HMAC replay 無 timestamp/nonce(webhooks.py:679-728)| vuln-verifier(PoC 確認)| 所有 webhook |
|
||
| P0-07 | 31 個 background loop 全無 project_id(main.py)| onboarder(實測)| 多租戶全崩 |
|
||
| P0-08 | `telemetry.py:71` 硬碼 `if "192.168.0.188" not in endpoint: raise`,EwoooC 啟動必失敗 | onboarder | EwoooC Phase 6 |
|
||
| P0-09 | `project_migration_state` 表缺失,Strangler Fig 無資料載體 | db-expert | Phase 1 |
|
||
| P0-10 | Task 9 順序倒置(agent prompt 載入點在 ConfigMap 前)→ 全回 None | critic | Phase 1 任何 agent |
|
||
| P0-11 | `ollama:current_primary` 在 `ollama_auto_recovery.py:230` 有第二定義,三層拓撲遷移必裂 | onboarder | GCP Ollama 拓撲 |
|
||
| P0-12 | `consensus_engine.py` 中 `CONSENSUS_PREFIX="consensus:"` 無 project 前綴,multi-tenant 時跨 tenant 共用 | onboarder | 多租戶一致性 |
|
||
| P0-13 | `mcp_bridge.py:592-681` kubectl 呼叫硬碼 `namespace="awoooi-prod"` | onboarder | EwoooC K8s tool |
|
||
|
||
### P1 — 嚴重缺陷(Phase 2-4 之前必修)
|
||
|
||
| # | 問題 | 來源 | 影響範圍 |
|
||
|---|------|------|---------|
|
||
| P1-01 | AWOOOI Bootstrap Paradox:cron/job/healthcheck 全無 project_id | critic | 多租戶啟動 |
|
||
| P1-02 | EwoooC 接入零技術路徑(非只改 `OLLAMA_API_BASE`)| critic | Phase 6 |
|
||
| P1-03 | Strangler Fig shadow→canary→active 無量化 gate 條件 | planner | 切換決策 |
|
||
| P1-04 | Layer 3 redaction 零實作(helper 有但無 enforcement)| critic | 資訊安全 |
|
||
| P1-05 | `_provider` 屬性 public,可繞過 audit(mcp/registry.py:24-71)| critic | MCP 安全 |
|
||
| P1-06 | `WAITING_APPROVAL` resume 不驗 caller identity,無 approval_token 簽章 | critic | approval 安全 |
|
||
| P1-07 | Redis approval state 單點,無 PG sync | critic | approval 可靠性 |
|
||
| P1-08 | Audit log 本身會洩密(redaction 必須做在 audit sink 前)| critic | 資訊安全 |
|
||
| P1-09 | `sanitization_service.py` helper 無 enforcement point(MCP Gateway / AgentToolExecutor 都沒用)| critic | tool 安全 |
|
||
| P1-10 | Active revision 切換無 transactional outbox,worker 可能吃舊 policy | db-expert | policy 一致性 |
|
||
| P1-11 | Run/Channel idempotency 缺 key derivation 規則與 unique index | db-expert | 重複執行 |
|
||
| P1-12 | Async worker 缺 lease / heartbeat / stale reaper | db-expert | worker 可靠性 |
|
||
| P1-13 | 高流量表 partition + retention 需 Phase 1 就決定(不能後補)| db-expert | 長期可擴展 |
|
||
| P1-14 | Observability metrics label cardinality(run_id/trace_id/session_id 禁進 metrics)| fullstack | Prometheus |
|
||
| P1-15 | `multi_sig_redis.py:178-205` approval flow 零 trace_id | debugger | 故障排查 |
|
||
| P1-16 | `hermes/nl_gateway.py:7,146,163` Redis key 無 project 前綴 | onboarder | Hermes 隔離 |
|
||
| P1-17 | `anomaly_counter.py:790` AnomalyCounter 全域單例,6 個 prefix 無 tenant 隔離 | onboarder | 多租戶計數 |
|
||
| P1-18 | `incident_service.py:603-615` `SCAN incident:*` 無 project_id | onboarder | Redis 資料隔離 |
|
||
| P1-19 | Contract publish 權限與簽章未定義 | critic | contract 治理 |
|
||
| P1-20 | 13 個全域單例跨 tenant 共用(TrustEngine/ProviderRegistry/TelegramGateway/等)| onboarder | 多租戶隔離 |
|
||
| P1-21 | Token Budget 無 Hard Kill($47k agent loop 事故教訓)| web-researcher | 費用控管 |
|
||
| P1-22 | RLS(Row Level Security)完全空白 | db-expert | DB 多租戶 |
|
||
| P1-23 | GCP Ollama 三層拓撲 Redis key 雙寫遷移未規劃(`ollama:current_primary` 舊 key 只知道 1 個 host)| critic | Ollama failover |
|
||
| P1-24 | `decision_manager.py:240` 硬碼 `telegram_silence:{target}` 未 import gateway 常數(跨兩處定義)| debugger | silence 功能 |
|
||
|
||
### P2 — 設計缺口(Phase 5-8 之前必補)
|
||
|
||
| # | 問題 | 來源 | 影響範圍 |
|
||
|---|------|------|---------|
|
||
| P2-01 | Telegram/LINE/Slack/API/Internal 缺 canonical principal mapping | critic | 身份統一 |
|
||
| P2-02 | Run FSM 零實作(只有表設計,無狀態機程式碼)| fullstack | Phase 4 |
|
||
| P2-03 | EwoooC Provider Proxy 不能只改 URL,需要完整 envelope+audit 入口 | critic | Phase 6 |
|
||
| P2-04 | 業界 Durable Execution / SAGA 補償交易機制缺失 | web-researcher | 長時 agent tool chain |
|
||
| P2-05 | MCP OAuth 2.1(RFC 9728 + RFC 7591)Confused Deputy 無防護 | web-researcher | MCP 安全 |
|
||
| P2-06 | OTel GenAI Semantic Conventions(span 命名 / attribute 規範)未對齊 | web-researcher | 可觀測性 |
|
||
| P2-07 | OWASP Agentic AI Top 10 對齊缺口(prompt injection、tool misuse 等 7 項)| web-researcher | AI 安全 |
|
||
| P2-08 | ISO 42001 AI 管理體系對齊文件缺失 | web-researcher | 合規 |
|
||
| P2-09 | 7 個 API endpoint 缺失(見 §6 fullstack 清單)| fullstack | API 完整性 |
|
||
| P2-10 | 9 個 error code 缺失(見 §7 error code 字典)| fullstack | 客戶端解析 |
|
||
| P2-11 | Progressive feedback policy(async run 無進度通知 ≤30s)| fullstack | UX |
|
||
| P2-12 | 8 個 Operator Console UI 模組完全缺失(見 §8 frontend)| frontend-designer | 運營可見性 |
|
||
| P2-13 | `awooop-ctl` CLI 工具缺失(現有 kubectl + curl 手動操作)| tool-expert | 運維體驗 |
|
||
| P2-14 | OPA/Cedar policy engine 缺失(現在 contract 授權邏輯散落程式碼)| tool-expert | 授權集中化 |
|
||
| P2-15 | chaostoolkit / LitmusChaos 缺失(Strangler Fig 切換無混沌驗證)| tool-expert | 容災驗證 |
|
||
| P2-16 | PgBouncer 缺失(AwoooP 多 worker 下 PG connection pool 會爆)| tool-expert | DB 可擴展性 |
|
||
|
||
---
|
||
|
||
## 2. Pre-flight Audit — Phase 0 完整清單
|
||
|
||
> Phase 0 全部 docs-only。無任何 runtime code 變動。
|
||
> 完成後才開新 Codex 對話進 Phase 1 code。
|
||
|
||
### 2.1 AwoooP 核心 ADR(ADR-111~115)
|
||
|
||
**注意:ADR-108/109/110 已被 incident fingerprint / telegram dedup / GCP Ollama 拓撲占用,AwoooP 從 ADR-111 起。**
|
||
|
||
| ADR | 主題 | 解決問題 | 主要內容 |
|
||
|-----|------|---------|---------|
|
||
| **ADR-111** | AwoooP Bootstrap Order & Identity Paradox | P0-07、P0-01、P1-01 | `platform_internal` / `requires_project_id` / `legacy_awoooi_default` 三種標記;31 個 background loop 分類;AWOOOI cron/job 過渡豁免時程;Ollama GCP 三層 failover 的 platform_resource 聲明 |
|
||
| **ADR-112** | Contract Governance & Publishing Workflow | P1-19 | 誰可 publish / activate;CODEOWNERS;HMAC 簽章;approval workflow;activation audit;draft 與 published 隔離 |
|
||
| **ADR-113** | Active Revision Invalidation & Outbox | P1-10 | `awooop_contract_outbox` 表設計;Redis pub/sub 通知;worker revision-aware cache;split-brain 防禦;GCP Ollama 拓撲切換事件 |
|
||
| **ADR-114** | Idempotency, Worker Lease & Run Recovery | P1-11、P1-12 | channel event dedupe;`(project_id, channel_type, provider_event_id)` unique;worker `lease_until` / `heartbeat_at` / `attempt_count`;stale run reaper;SKIP LOCKED |
|
||
| **ADR-115** | Canonical Principal Mapping & Tenant Onboarding | P2-01、P0-08 | Telegram/LINE/Slack/API/Internal → `platform_subject` 統一映射;EwoooC Proxy Adapter;Tsenyang/Bitan 模板;`telemetry.py:71` IP assert 修正方案 |
|
||
|
||
### 2.2 安全強化 ADR
|
||
|
||
| ADR | 主題 | 解決問題 | 主要內容 |
|
||
|-----|------|---------|---------|
|
||
| **ADR-116** | AwoooP Security Hardening | P0-04、P0-05、P0-06 | callback nonce 重設計(server_secret 必參與 HMAC);webhook 加 timestamp/nonce 防 replay;`requires_approval` 改為 policy-derived(禁止 LLM 決定);approval_token signing 規格(HS256,15min TTL,`jti` 唯一性)|
|
||
| **ADR-117** | MCP OAuth 2.1 & Confused Deputy Prevention | P2-05 | RFC 9728 Resource Indicators;RFC 7591 Dynamic Client Registration;per-tenant token scope;Confused Deputy 防護設計;MCP Server binding PKCE flow |
|
||
|
||
### 2.3 資料庫強化 ADR
|
||
|
||
| ADR | 主題 | 解決問題 | 主要內容 |
|
||
|-----|------|---------|---------|
|
||
| **ADR-118** | Row-Level Security & Tenant DB Isolation | P1-22 | 所有 AwoooP 表啟用 RLS;`current_setting('app.project_id')` 注入;RLS bypass role 設計;migration 驗收標準 |
|
||
| **ADR-119** | Durable Execution & SAGA Compensation | P2-04 | multi-step agent tool chain 的 step-level journal;補償交易觸發條件;checkpoint/resume 設計;與 Phase 4 run state machine 整合 |
|
||
|
||
### 2.4 可觀測性 & AI 安全 ADR
|
||
|
||
| ADR | 主題 | 解決問題 | 主要內容 |
|
||
|-----|------|---------|---------|
|
||
| **ADR-120** | Token Budget Hard Kill | P1-21 | 每 run / 每 project / 每 tenant 三層 budget limit;hard kill(不只 alert);$47k agent loop 事故 RCA;`budget_ledger` 表設計;Redis hot counter + PG 事務 hard stop |
|
||
| **ADR-121** | OTel GenAI Semantic Conventions Alignment | P2-06 | span 命名規範(`gen_ai.request.*`);token 計數 attribute;LLM provider attribute;與現有 SignOz(188:24318)整合;metrics label cardinality 規則 |
|
||
| **ADR-122** | OWASP Agentic AI Top 10 & ISO 42001 Alignment | P2-07、P2-08 | Top 10 逐項對應到 AwoooP 控制面;ISO 42001 AI 管理體系必要文件清單;每 Phase 對齊驗收項 |
|
||
|
||
### 2.5 Migration Discipline ADR
|
||
|
||
| ADR | 主題 | 解決問題 | 主要內容 |
|
||
|-----|------|---------|---------|
|
||
| **ADR-123** | Background Loop project_id Migration Strategy | P0-07、P1-01 | 31 個 background loop 分三類(platform_internal / legacy_awoooi_default / requires_project_id);每類遷移策略;regression test 設計;完成標準(main.py 0 個無標記 loop)|
|
||
| **ADR-124** | Global Singleton Decomposition for Multi-tenancy | P1-20 | 13 個全域單例清單;分解策略(per-project registry / factory pattern);AWOOOI 1.0 → AwoooP 1.0 遷移路徑;不能同時拆的依賴序 |
|
||
|
||
### 2.6 前端 Operator Console ADR(新增)
|
||
|
||
| ADR | 主題 | 解決問題 | 主要內容 |
|
||
|-----|------|---------|---------|
|
||
| **ADR-UI-01** | AwoooP Operator Console 架構 | P2-12 | 8 個 UI 模組規格;與現有 `apps/web/` 整合方式;多租戶視角設計;i18n(next-intl)規範 |
|
||
| **ADR-UI-02** | Contract Lifecycle UI | P2-12 | draft → publish → activate 操作流程;revision diff 視覺化;contract family 篩選 |
|
||
| **ADR-UI-03** | Run State & Shadow Monitoring UI | P2-12 | shadow/canary/active 切換 dashboard;run FSM 視覺化;Strangler Fig gate 量化指標展示 |
|
||
| **ADR-UI-04** | Tenant Budget & Audit UI | P2-12 | per-project token budget;hard kill 觸發歷史;audit log 查詢(含 redaction 遮蔽)|
|
||
|
||
### 2.7 ADR-106 補充章節
|
||
|
||
ADR-106 需新增:
|
||
- **Strangler Fig Quantified Gates**(量化切換條件)
|
||
- **GCP Ollama 拓撲影響**(三層 failover 如何成為 `platform_resource`,不屬於任何 tenant)
|
||
- **Bootstrap Order** 參照 ADR-111
|
||
|
||
### 2.8 Inventory 清單(9 份)
|
||
|
||
| Inventory | 位置 | 範圍 | 解決問題 |
|
||
|-----------|------|------|---------|
|
||
| **INV-1** | `docs/awooop/inventory/INV-1-redis-keys.md` | 全 codebase grep `redis_client.*\(["']` 等,列出 43+ 個 key、命名空間、TTL、用途、寫入/讀取點、是否硬碼 | P0-01、P1-18 |
|
||
| **INV-2** | `docs/awooop/inventory/INV-2-repository-project-id-retrofit.md` | 30+ 業務表 × 目前有無 `project_id` × 所有 repository 方法 × 需加 filter 的查詢 × 需 backfill 的歷史資料 | P0-03 |
|
||
| **INV-3** | `docs/awooop/inventory/INV-3-entrypoints.md` | 所有 cron job / scheduler / webhook / CLI / healthcheck / internal service call,標記三種類型 | P0-07、P1-01 |
|
||
| **INV-4** | `docs/awooop/inventory/INV-4-hardcoded-namespace-ip.md` | 硬碼 K8s namespace(`awoooi-prod`)、SSH 主機 IP、白名單(**含新 GCP IP:34.143.170.20、34.21.145.224**)| P0-08、P0-13 |
|
||
| **INV-5** | `docs/awooop/inventory/INV-5-migration-compatibility-matrix.md` | 版本相容矩陣:SQLAlchemy 1.x→2.x / Alembic / Pydantic v1→v2 / FastAPI 0.x / Python 3.10→3.12;每個 breaking change + 影響範圍 | critic |
|
||
| **INV-6** | `docs/awooop/inventory/INV-6-rollback-playbook-register.md` | 6 個 rollback playbook:Phase 1 schema rollback、Phase 2 Redis key rollback、Phase 5 MCP Gateway rollback、Phase 6 EwoooC rollback、Ollama GCP→Local fallback rollback、approval flow rollback | migration |
|
||
| **INV-7** | `docs/awooop/inventory/INV-7-pr-cutting-plan.md` | 11 個 PR 切割方案(refactor-specialist 設計):每 PR 的範圍、前置依賴、review 者、合併順序 | refactor |
|
||
| **INV-8** | `docs/awooop/inventory/INV-8-background-loop-catalog.md` | 31 個 background loop 逐一列出:名稱、位置(main.py 行號)、類別標記、遷移策略、預計完成 Phase | onboarder |
|
||
| **INV-9** | `docs/awooop/inventory/INV-9-global-singleton-catalog.md` | 13 個全域單例逐一列出:名稱、位置、依賴方、分解策略、遷移風險 | onboarder |
|
||
|
||
### 2.9 Phase 0 驗收標準
|
||
|
||
- [ ] ADR-111~115(5 份 AwoooP 核心 ADR)全部 Accepted
|
||
- [ ] ADR-116~124(9 份強化 ADR)全部 Accepted
|
||
- [ ] ADR-UI-01~04(4 份 UI ADR)全部 Accepted(或 Proposed + 統帥批准開工)
|
||
- [ ] ADR-106 補入 Strangler Fig Quantified Gates + GCP Ollama 章節
|
||
- [ ] INV-1~INV-9(9 份 Inventory)完成初稿
|
||
- [ ] 無任何 runtime code 變動
|
||
- [ ] `git diff --check` 通過
|
||
|
||
---
|
||
|
||
## 3. 8-Phase 詳細工作項
|
||
|
||
> 每項含:目標、範圍(精確路徑)、輸入(前置依賴)、輸出(交付物)、驗收標準、邊界(禁止碰什麼)
|
||
|
||
### Phase 1 — Control Plane Schema Foundation
|
||
|
||
**目標**:建立 PostgreSQL contract control plane 最小可用骨架,修正舊 SQL migration 三大 blocker,決定高流量表 partition 策略。
|
||
|
||
**前置依賴**:Phase 0 全部完成(所有 ADR + Inventory)
|
||
|
||
**範圍(精確檔案)**:
|
||
- `apps/api/migrations/` — 新增 migration files
|
||
- `apps/api/src/models/` — 新增 AwoooP SQLAlchemy models
|
||
- `apps/api/src/repositories/` — 新增 AwoooP repositories
|
||
- `docs/runbooks/` — 新增 partition + retention runbook
|
||
|
||
**禁止碰**:
|
||
- 任何既有 repository 方法(留給 Phase 2)
|
||
- provider 行為(`ai_router.py` / `ollama_*.py`)
|
||
- Telegram/LINE webhook 路徑
|
||
- `apps/web/`
|
||
- 任何 K8s manifest
|
||
|
||
**工作項(順序執行)**:
|
||
|
||
```
|
||
1.1 表名核對
|
||
- grep 確認 `incidents`(非 incident_records)
|
||
- grep 確認 `mcp_audit_log`(非 mcp_audit_snapshots)
|
||
- 修正 ORM: SQLAlchemy 2.x mapped_column、補齊 Numeric/UniqueConstraint/func import
|
||
- 每個 migration 強制有 down migration(rollback SQL)
|
||
|
||
1.2 Task 9 順序修正(必須 Phase 1.1 之前完成)
|
||
- Dockerfile: agent_loader default path 指向 ConfigMap mount
|
||
- ConfigMap 預載: 確認 agent prompt 路徑在 ConfigMap 已存在
|
||
- 驗收:dry-run 一個 agent loader,輸出非 None
|
||
|
||
1.3 AwoooP 控制面表(新增 migration)
|
||
- awooop_projects(tenant 主表,project_id VARCHAR PK,budget,ACL)
|
||
- awooop_contract_revisions(六合約共用 revision 表,append-only,見 §4.1 完整欄位)
|
||
- awooop_active_revisions(active pointer,指向特定 revision_id)
|
||
- awooop_artifact_refs(prompt/schema/eval ref + sha256 + type)
|
||
- awooop_project_migration_state(Strangler Fig 階段追蹤,per project × per capability)
|
||
- awooop_contract_outbox(ADR-113,active revision 切換事件,for worker invalidation)
|
||
- awooop_channel_event_dedupe(ADR-114,idempotency,唯一鍵)
|
||
- awooop_platform_subjects(ADR-115,canonical principal mapping)
|
||
- awooop_budget_ledger(ADR-120,token budget,per project × per period)
|
||
|
||
1.4 高流量表(在 Phase 4/7 建立時已決定 partition,此時寫規則)
|
||
- 須在本 Phase migration 中加 partition template comment(不執行,留 Phase 4)
|
||
- awooop_run_state → range partition by created_at(月)
|
||
- awooop_channel_event → range partition by created_at(月)
|
||
- awooop_mcp_gateway_audit → range partition by created_at(月)
|
||
- awooop_agent_audit_log → range partition by created_at(月)
|
||
- retention: 90 天 hot + 1 年 warm(pg_partman / cron job)
|
||
- 寫進 docs/runbooks/awooop-partition-retention.md
|
||
|
||
1.5 AWOOOI Bootstrap(seed data)
|
||
- INSERT INTO awooop_projects(project_id='awoooi', display_name='AWOOOI', migration_mode='legacy_awoooi_default')
|
||
- 驗收:AWOOOI 0 行為改動
|
||
|
||
1.6 RLS 骨架(ADR-118)
|
||
- 所有 awooop_* 表啟用 RLS
|
||
- policy: USING (project_id = current_setting('app.project_id', TRUE))
|
||
- bypass role: awooop_platform(只給 platform worker 用)
|
||
- 注意:RLS 需要 migration + 測試,不只是 ALTER TABLE ENABLE ROW LEVEL SECURITY
|
||
|
||
1.7 Immutability 測試
|
||
- published contract revision 嘗試 UPDATE → 必失敗(trigger 或 check constraint)
|
||
- draft 與 active 隔離:runtime 讀取 view 不含 draft
|
||
- 自動化:pytest + db-expert review
|
||
```
|
||
|
||
**RACI**:
|
||
- R(執行):fullstack-engineer
|
||
- A(負責):db-expert review,統帥批准
|
||
- C(諮詢):refactor-specialist(migration PR 切割)、critic(最終 review)
|
||
- I(通知):migration-engineer(版本相容驗證)
|
||
|
||
**DoD**:
|
||
- 所有 migration up/down dry-run 通過
|
||
- AWOOOI 可表示為 `project_id=awoooi`,0 行為改動
|
||
- RLS 測試:cross-project SELECT 被拒絕
|
||
- partition runbook 已建立
|
||
|
||
---
|
||
|
||
### Phase 2 — Tenant Isolation & Namespace Hardening
|
||
|
||
**目標**:在開放任何下游 tenant 之前,把 AWOOOI 自己變成乾淨的 tenant。
|
||
**前置**:Phase 1 完成
|
||
|
||
**範圍**:
|
||
- `apps/api/src/services/` — Redis key 遷移(依 INV-1)
|
||
- `apps/api/src/repositories/` — 加 project_id filter(依 INV-2)
|
||
- `apps/api/src/services/security_interceptor.py` — nonce 修補(P0-05,ADR-116)
|
||
- `apps/api/src/api/v1/webhooks.py` — replay 防護(P0-06,ADR-116)
|
||
- `apps/api/src/core/telemetry.py:71` — 移除硬碼 IP assert(P0-08)
|
||
- `apps/api/src/services/decision_manager.py:240` — silence key 常數化(P1-24)
|
||
- `apps/api/src/services/ollama_auto_recovery.py:230` — 移除第二定義(P0-11)
|
||
- `apps/api/src/plugins/mcp/mcp_bridge.py:592-681` — namespace 動態化(P0-13)
|
||
- `apps/api/src/services/consensus_engine.py` — CONSENSUS_PREFIX 加 project 前綴(P0-12)
|
||
- `apps/api/src/hermes/nl_gateway.py` — Redis key 加 project 前綴(P1-16)
|
||
- `apps/api/src/services/anomaly_counter.py:790` — per-project 改造(P1-17)
|
||
- `apps/api/src/services/incident_service.py:603` — SCAN 加 prefix(P1-18)
|
||
|
||
**禁止碰**:
|
||
- `awooop_contract_revisions` 以外的 AwoooP Phase 1 新表結構
|
||
- EwoooC / Tsenyang 任何接入(留 Phase 6)
|
||
- 任何 provider routing 改動(Ollama GCP 拓撲已由 ADR-110 定案,不在此 Phase 改)
|
||
|
||
**工作項**:
|
||
|
||
```
|
||
2.1 Redis 三階段雙寫遷移計畫執行(依 INV-1,分三批)
|
||
批次 A(Critical Path,影響 Ollama GCP 拓撲):
|
||
- ollama:current_primary(舊)→ {project_id}:ollama:primary(新)
|
||
注意:要同時支援三層 GCP-A/GCP-B/Local,INV-1 需確認所有寫入點
|
||
- ollama_auto_recovery.py:230 第二定義刪除,統一常數
|
||
批次 B(費用 + 告警關鍵):
|
||
- ai_rate:total_cost:gemini → {project_id}:ai_rate:total_cost:gemini
|
||
- telegram:polling:leader → platform:telegram:polling:leader(platform_resource)
|
||
- telegram_silence:{target} → {project_id}:telegram_silence:{target}
|
||
同步更新 decision_manager.py:240 import gateway 常數
|
||
批次 C(working memory):
|
||
- consensus: → {project_id}:consensus:(consensus_engine.py)
|
||
- hermes Redis keys(nl_gateway.py)
|
||
- anomaly_counter 6 個 prefix
|
||
- incident:* SCAN(incident_service.py:603)
|
||
|
||
每批次:Phase A(雙寫 30 天)→ Phase B(雙讀 14 天)→ Phase C(移除舊 key)
|
||
|
||
2.2 Security hardening(ADR-116)
|
||
- telemetry.py:71:移除 "192.168.0.188" 硬碼 assert,改為 config-driven allowed endpoints
|
||
- security_interceptor.py:451-490:nonce 重設計,server_secret 必參與 HMAC
|
||
- webhooks.py:679-728:加 timestamp(±5min window)+ nonce(Redis dedup)
|
||
- requires_approval:改為從 policy contract 讀取,禁止 LLM output 決定
|
||
- approval_token:HS256,15min TTL,jti 唯一性(Redis NX)
|
||
|
||
2.3 Repository project_id 改造(依 INV-2)
|
||
- 所有 30+ repository 方法加 project_id filter
|
||
- K8s namespace 白名單 → tenant-aware(mcp_bridge.py:592-681 動態化)
|
||
- SSH 主機白名單 → tenant-aware(依 INV-4)
|
||
|
||
2.4 Background loop 標記(依 ADR-123,INV-3/INV-8)
|
||
- 31 個 loop 標記為 platform_internal / legacy_awoooi_default / requires_project_id
|
||
- platform_internal 帶 project_id=__platform__
|
||
- legacy_awoooi_default fallback 到 project_id=awoooi,寫退場時程
|
||
|
||
2.5 Global singleton 分解第一步(依 ADR-124,INV-9)
|
||
- 只做:AnomalyCounter(P1-17 已修)per-project 改造
|
||
- 其餘 13 個全域單例列出退場時程(不在此 Phase 全拆,防爆炸半徑)
|
||
|
||
2.6 Token Budget Hard Kill 基礎(ADR-120)
|
||
- budget_ledger 表 migration(Phase 1 已建,此 Phase 寫入邏輯)
|
||
- 每 LLM call 前:check budget → hard kill if exceeded(不只 log)
|
||
- Redis hot counter + PG 事務 hard stop
|
||
```
|
||
|
||
**RACI**:
|
||
- R:fullstack-engineer + refactor-specialist(大量 repository 改動)
|
||
- A:db-expert(repository 改動 review)、vuln-verifier(security hardening PoC 驗證)
|
||
- C:critic(整體 diff review)、migration-engineer(相容性確認)
|
||
- I:tool-expert(K8s namespace 改動相關)
|
||
|
||
**DoD**:
|
||
- INV-1 所有 P0 key 完成三階段遷移(Phase A 完成,Phase B/C 在觀察期)
|
||
- cross-project test 全紅(pytest 覆蓋)
|
||
- `grep -r "awoooi-prod" apps/api/src/` 結果為 0
|
||
- `grep -r "192.168.0.188" apps/api/src/` telemetry assert 消失
|
||
- vuln-verifier PoC 重跑:P0-05 nonce 偽造失敗、P0-06 webhook replay 失敗
|
||
- Budget hard kill 測試:超額後 LLM call 被拒絕
|
||
|
||
---
|
||
|
||
### Phase 3 — Contract Packages & Validators
|
||
|
||
**目標**:六合約從散文升級為可驗證程式。
|
||
**前置**:Phase 1 完成(contract_revisions 表存在)
|
||
|
||
**範圍**:
|
||
- `packages/awooop-contracts/`(此時才建立!)
|
||
- `apps/api/src/services/contract_service.py`(新建)
|
||
- `apps/api/src/repositories/contract_repository.py`(新建)
|
||
|
||
**禁止碰**:
|
||
- 任何既有 provider / router / telegram 路徑
|
||
- `apps/web/`(UI 留 Phase 8 之後)
|
||
|
||
**工作項**:
|
||
|
||
```
|
||
3.1 建立 packages/awooop-contracts/(此時才有真實內容)
|
||
- 六合約 JSON Schema(Project/Tenant、Agent、MCP Gateway、Policy/Routing、Run State、Channel Event)
|
||
- Pydantic v2 models 對應六合約
|
||
- envelope schema:platform invocation、MCP tool call、run state transition、channel event
|
||
- golden fixtures(valid × 6 + invalid × 6)
|
||
|
||
3.2 Contract lifecycle service
|
||
- draft():建立 draft revision,不可被 runtime 讀
|
||
- publish():產生 immutable published revision(body_hash = sha256(body_json))
|
||
- activate():更新 active pointer,寫入 contract_outbox(ADR-113)
|
||
- get_active():runtime 讀取路徑,只返回 published + active
|
||
- 全部操作記錄 audit log
|
||
|
||
3.3 Output schema validator middleware
|
||
- LLM 回傳 → 過 schema validator → 失敗 → retry(上限 3 次)→ 失敗 → error code(E-SCHEMA-001)
|
||
- 任何 schema 不符的 LLM 輸出無法到達 channel adapter
|
||
|
||
3.4 Contract governance(ADR-112)
|
||
- CODEOWNERS 指定 packages/awooop-contracts/
|
||
- publish API:HMAC 簽章驗證
|
||
- activate API:approval workflow(multi_sig_redis 路徑)
|
||
|
||
3.5 SHA-256 artifact 驗證
|
||
- 所有 artifact ref 含 sha256
|
||
- runtime 讀取時驗 hash(與 DB 記錄比對)
|
||
```
|
||
|
||
**DoD**:
|
||
- schema 不符的 LLM 輸出無法到達 channel adapter(整合測試)
|
||
- AWOOOI 第一份 Agent contract 可 publish + activate(E2E)
|
||
- prompt/schema ref 必含 sha256
|
||
|
||
---
|
||
|
||
### Phase 4 — Platform Shell in Shadow Mode
|
||
|
||
**目標**:建立第一個 runtime shell,只跑 shadow,不改 legacy 行為。
|
||
**前置**:Phase 3 完成
|
||
|
||
**範圍**:
|
||
- `apps/api/src/api/v1/platform/` — 新增 platform runs API
|
||
- `apps/api/src/services/platform_runtime.py` — 新建
|
||
- `apps/api/src/services/run_state_machine.py` — Run FSM 實作(P2-02)
|
||
- `apps/api/src/workers/platform_worker.py` — 新建
|
||
- `apps/api/src/services/audit_sink.py` — 加 redaction(P1-08)
|
||
|
||
**禁止碰**:
|
||
- 任何既有 `/v1/incidents/`、`/v1/webhooks/` 路徑
|
||
- Telegram bot handler(legacy 維持)
|
||
- EwoooC 接入(留 Phase 6)
|
||
|
||
**工作項**:
|
||
|
||
```
|
||
4.1 Run API shell(shadow only)
|
||
- POST /v1/platform/runs
|
||
- 生成 run_id(UUID v7)、trace_id(W3C traceparent compatible)
|
||
- 解析 project + agent contract active revision
|
||
- 解析 EffectivePolicy(6 層合併,不改 provider 行為)
|
||
|
||
4.2 Run State Machine(ADR-114 + ADR-119)
|
||
- States: PENDING → RUNNING → WAITING_TOOL → WAITING_APPROVAL → COMPLETED / FAILED / CANCELLED
|
||
- lease_until、heartbeat_at、attempt_count 欄位
|
||
- SKIP LOCKED 取單(防 double-pickup)
|
||
- stale run reaper(每分鐘掃 expired lease,回到 PENDING 或 FAILED)
|
||
- SAGA step journal(ADR-119):每個 tool call 寫入 step_id、補償指令
|
||
|
||
4.3 Idempotency(ADR-114)
|
||
- (project_id, channel_type, provider_event_id) 複合 unique
|
||
- 重複事件 return 既有 run_id(不產生新 run)
|
||
- Redis NX + PG constraint 雙層保護
|
||
|
||
4.4 Audit log redaction(ADR-116)
|
||
- audit_sink 寫入前過 sanitization_service pipeline
|
||
- PII / secret pattern 硬攔(含 GCP IP、PG password、Telegram token 等)
|
||
- audit log 不記錄 raw LLM input/output,只記 hash + schema validation result
|
||
|
||
4.5 Observability(ADR-121)
|
||
- OTel GenAI span 命名(gen_ai.request.*)
|
||
- token 計數 attribute(gen_ai.usage.prompt_tokens 等)
|
||
- metrics label:只 project_id / agent_id / status / provider(禁止 run_id/trace_id/session_id 進 metrics)
|
||
- run_id / trace_id 只進 logs/traces(不進 metrics)
|
||
|
||
4.6 Shadow mode wiring
|
||
- 選定 3 個 AWOOOI 事件 mirror 到 shadow(不發 user response)
|
||
- shadow run 0 destructive tool call(MCP write/execute 全 block)
|
||
|
||
4.7 Token Budget Hard Kill(ADR-120)
|
||
- per-run token budget(from EffectivePolicy)
|
||
- 超額 → hard kill → FAILED state → error code E-BUDGET-001
|
||
- 每 run 完成後寫入 budget_ledger(實際消耗)
|
||
```
|
||
|
||
**RACI**:
|
||
- R:fullstack-engineer(API + service)、db-expert(run state schema review)
|
||
- A:critic(shadow mode 設計 review)、vuln-verifier(redaction PoC)
|
||
- C:debugger(trace_id 貫穿設計)、tool-expert(OTel 整合)
|
||
- I:migration-engineer(worker lease 相容性)
|
||
|
||
**DoD**:
|
||
- shadow run 0 user-visible response、0 destructive tool call(vuln-verifier 驗證)
|
||
- legacy AWOOOI 行為 0 改變(回歸測試通過)
|
||
- worker crash 後 stale run 1 分鐘內被回收(自動化測試)
|
||
- duplicate event 不產生重複 run(idempotency 測試)
|
||
- audit log 0 secret 命中(vuln-verifier 抽樣 100 筆)
|
||
- token budget 超額觸發 hard kill(整合測試)
|
||
|
||
---
|
||
|
||
### Phase 5 — MCP Gateway First Slice
|
||
|
||
**目標**:tool 授權搬到 Gateway,read-only 工具先進,解決 sanitization enforcement。
|
||
**前置**:Phase 4 完成
|
||
|
||
**範圍**:
|
||
- `apps/api/src/plugins/mcp/gateway.py` — 新建 MCP Gateway
|
||
- `apps/api/src/plugins/mcp/registry.py:24-71` — `_provider` → `__provider`(P1-05)
|
||
- `apps/api/src/plugins/mcp/mcp_bridge.py` — 接入 Gateway
|
||
- `apps/api/src/services/sanitization_service.py` — enforcement point(P1-09)
|
||
|
||
**禁止碰**:
|
||
- MCP write/execute tools(寫/執行工具留 Phase 8)
|
||
- Telegram approval flow(改動在 Phase 8)
|
||
|
||
**工作項**:
|
||
|
||
```
|
||
5.1 MCP Gateway 表
|
||
- awooop_mcp_tool_registry(tool_id, project_id, agent_id, tool_type, allowed_scopes)
|
||
- awooop_mcp_grants(grant_id, project_id, agent_id, tool_id, granted_by, expires_at)
|
||
- awooop_mcp_credential_refs(ref_id, tool_id, k8s_secret_ref, sha256)
|
||
- awooop_mcp_gateway_audit(call_id, trace_id, run_id, tool_id, credential_ref, latency_ms, result_status)
|
||
|
||
5.2 Five-gate enforcement
|
||
- Check: Project AND Agent AND Tool AND Environment AND Approval
|
||
- 任一不符 → 拒絕 + 記錄 audit + error code E-MCP-GATE-XXX
|
||
|
||
5.3 Result sanitization enforcement(P1-04、P1-09)
|
||
- 所有 MCP tool result 必經 sanitization_service pipeline
|
||
- MCP Gateway 加 sanitization middleware(不允許 raw result 直接進 LLM context)
|
||
- 進 LLM 前一層 + 進 audit sink 一層(雙層 redaction)
|
||
- sast 掃描 agent 程式碼路徑:0 raw credential 接觸
|
||
|
||
5.4 _provider 修正(P1-05)
|
||
- registry.py: _provider → __provider(雙底線 Python name mangling)
|
||
- 加 unit test:外部 reflect 取用 → AttributeError
|
||
|
||
5.5 Credential isolation
|
||
- agent 程式碼不直接存取 K8s Secret
|
||
- Gateway 解析 credential_ref → 回傳 masked result(token 替換)
|
||
- 2026-04-18 secret leak 重演測試:kubectl describe 輸出不出現在 LLM context
|
||
|
||
5.6 MCP OAuth 2.1(ADR-117)
|
||
- 實作 per-tenant dynamic client registration(RFC 7591)
|
||
- Resource Indicators(RFC 9728)防 Confused Deputy
|
||
- PKCE flow for MCP Server binding
|
||
```
|
||
|
||
**RACI**:
|
||
- R:fullstack-engineer(Gateway service)
|
||
- A:vuln-verifier(credential isolation 驗證)、critic(架構 review)
|
||
- C:tool-expert(MCP spec 確認)、db-expert(Gateway 表設計 review)
|
||
- I:migration-engineer(MCP registry 相容性)
|
||
|
||
**DoD**:
|
||
- 2026-04-18 secret leak 重演測試通過(kubectl describe 輸出不出現在 LLM context 或 audit row)
|
||
- sast 掃描:agent 程式碼路徑 0 raw credential 接觸
|
||
- `__provider` 雙底線 unit test 通過
|
||
- Five-gate 全部 integration test 覆蓋
|
||
|
||
---
|
||
|
||
### Phase 6 — EwoooC Read-Only Tenant Onboarding
|
||
|
||
**目標**:以真實下游 tenant 驗證 AwoooP,全 read-only。
|
||
**前置**:Phase 5 完成、telemetry.py:71 hardcoded IP assert 已移除(Phase 2 完成)
|
||
|
||
**範圍**:
|
||
- `apps/api/src/` — EwoooC project provisioning
|
||
- `packages/awooop-contracts/` — EwoooC agent contract
|
||
- `apps/api/src/services/provider_proxy.py` — 新建 Provider Proxy Adapter(P1-02)
|
||
|
||
**禁止碰**:
|
||
- AWOOOI 任何既有業務邏輯
|
||
- MCP write/execute tools
|
||
|
||
**工作項**:
|
||
|
||
```
|
||
6.1 EwoooC project provisioning
|
||
- INSERT INTO awooop_projects(project_id='ewoooc', ...)
|
||
- 不可讀 AWOOOI data(RLS 驗證)
|
||
|
||
6.2 openclaw-biz agent contract
|
||
- 針對市場情報 domain 設計 I/O schema
|
||
- 安全 ceiling:read-only only,禁止 infra tool
|
||
|
||
6.3 Provider Proxy Adapter(P1-02,ADR-115)
|
||
- 不只是改 OLLAMA_API_BASE
|
||
- Proxy 入口強制注入 envelope:project_id / agent_id / trace_id / run_id
|
||
- 過 EffectivePolicy + budget guard + audit
|
||
- GCP Ollama 三層拓撲:EwoooC 走相同 primary/secondary/fallback 路由
|
||
- read-only / model-call 入口優先啟用
|
||
|
||
6.4 Market intelligence MCP tools 註冊
|
||
- 4 個 read-only tools:market_data_fetch、product_catalog_query、competitor_analysis、trend_report
|
||
- 全部在 MCP Gateway 五重 gate 管控
|
||
|
||
6.5 Shadow → Canary 升級計畫
|
||
- 先 14 天 shadow(Strangler Fig gate 量化)
|
||
- 符合條件後升 canary(selected responses)
|
||
- canary 通過再升 read_only
|
||
```
|
||
|
||
**RACI**:
|
||
- R:fullstack-engineer
|
||
- A:critic(EwoooC 資料隔離 review)、vuln-verifier(cross-tenant isolation PoC)
|
||
- C:db-expert(RLS 驗證)、migration-engineer(EwoooC rollback playbook,INV-6)
|
||
- I:tool-expert(GCP Ollama 拓撲 EwoooC 路由設定)
|
||
|
||
**DoD**:
|
||
- EwoooC SELECT 無法讀到 AWOOOI data(RLS + cross-tenant pytest)
|
||
- Provider Proxy Adapter E2E 測試:envelope 正確注入
|
||
- budget / audit 完全 project-scoped
|
||
- EwoooC 啟動時 telemetry.py 不再因 IP assert 失敗
|
||
|
||
---
|
||
|
||
### Phase 7 — Communication Hub Increment
|
||
|
||
**目標**:標準化 channel 但不切斷既有 bot。
|
||
**前置**:Phase 6 完成
|
||
|
||
**範圍**:
|
||
- `apps/api/src/services/channel_hub.py` — 新建
|
||
- `apps/api/src/services/telegram_gateway.py` — mirror inbound events
|
||
- `apps/api/src/api/v1/platform/channel.py` — 新建
|
||
|
||
**禁止碰**:
|
||
- 既有 telegram bot handler(維持 legacy 權威,直到 canary 量化 gate 通過)
|
||
- LINE / Slack 接入(留 v2)
|
||
|
||
**工作項**:
|
||
|
||
```
|
||
7.1 awooop_conversation_event + awooop_outbound_message 表
|
||
- partition by created_at(月,Phase 1 已定策略)
|
||
- retention policy 配置
|
||
|
||
7.2 Telegram inbound mirror
|
||
- 現有 telegram_gateway.py 事件複製到 awooop_conversation_event
|
||
- canonical principal mapping(ADR-115):所有 sender 寫入 awooop_platform_subjects
|
||
|
||
7.3 Progressive Feedback Policy(P2-11)
|
||
- WAITING_TOOL / RUNNING / WAITING_APPROVAL → 必發 Telegram 暫態訊息
|
||
- 用 edit_message 更新(非新訊息,不觸發通知)
|
||
- 首則進度訊息 ≤ 30s
|
||
|
||
7.4 Idempotency 驗證(已由 Phase 4 完成)
|
||
- duplicate channel retry 不產生 duplicate run(整合測試)
|
||
|
||
7.5 Adapter-level 安全
|
||
- 所有 channel adapter:escaping + redaction + idempotency + delivery audit
|
||
- channel adapter 0 LLM 呼叫、0 MCP 呼叫(pytest 覆蓋)
|
||
|
||
7.6 量化 gate 監控儀表板(配合 ADR-UI-03)
|
||
- Strangler Fig gate 指標:decision divergence / p95 latency / error rate
|
||
- 供 Phase 8 升級決策用
|
||
```
|
||
|
||
**RACI**:
|
||
- R:fullstack-engineer(API + channel hub)
|
||
- A:critic(channel 設計 review)、debugger(trace_id 貫穿驗證)
|
||
- C:frontend-designer(進度訊息 UX)、tool-expert(Telegram API 規格確認)
|
||
- I:migration-engineer(channel 相容性)
|
||
|
||
**DoD**:
|
||
- channel adapter 0 LLM 呼叫、0 MCP 呼叫
|
||
- async run 首則進度訊息 ≤ 30s
|
||
- duplicate retry 不產生 duplicate run
|
||
|
||
---
|
||
|
||
### Phase 8 — Suggest & Controlled Write Paths
|
||
|
||
**目標**:從 read-only 升級到 propose,再到 controlled execute。
|
||
**前置**:Phase 7 完成 + Strangler Fig shadow→canary gate 全通過
|
||
|
||
**範圍**:
|
||
- `apps/api/src/services/multi_sig_redis.py` — approval token 簽章(P1-06)
|
||
- `apps/api/src/services/approval_timeout_resolver.py` — 加 trace_id(P1-15)
|
||
- `apps/api/src/api/v1/platform/suggest.py` — suggest mode endpoint
|
||
- Feature flags for write/execute paths
|
||
|
||
**禁止碰**:
|
||
- 任何 write/execute tool 的預設啟用
|
||
- Strangler Fig 量化 gate 通過前不做 auto_remediate
|
||
|
||
**工作項**:
|
||
|
||
```
|
||
8.1 Approval Token 安全強化(P1-06,ADR-116)
|
||
- WAITING_APPROVAL resume API:強制驗 approval_token(HS256,15min TTL,jti Redis NX)
|
||
- approval state:PG 為 source of truth,Redis 為 cache
|
||
- 過期 / 已決 / 重放 → 全部拒絕 + error code E-APPROVAL-XXX
|
||
|
||
8.2 multi_sig_redis.py + approval_timeout_resolver.py trace_id 補入
|
||
- 所有 approval 操作加 trace_id(P1-15)
|
||
- 完整鏈路可追蹤(debugger 驗證)
|
||
|
||
8.3 Suggest mode for AWOOOI SRE flows
|
||
- 選定低風險 3 個 SRE flow(e.g., 告警靜音建議、playbook 推薦)
|
||
- suggest 模式:AI 輸出建議,人工決定執行
|
||
- 量化 gate(ADR-106 補章):
|
||
* shadow → canary:≥14 天 + divergence <5% + p95 <10% + 0 P1 incident
|
||
* canary → read_only:≥7 天 + error rate <0.5% + cost diff <50%
|
||
* read_only → suggest:≥14 天 + accept rate ≥50% + 0 hallucination escalation
|
||
* suggest → auto_remediate:≥30 天 + rollback evidence ≥3 次 + approval token live + dry-run ≥99%
|
||
|
||
8.4 Dry-run 與 rollback evidence gate
|
||
- 每個 write/execute tool 必須有 dry-run mode
|
||
- rollback playbook 寫入 INV-6(Phase 0 已完成,此時執行驗證)
|
||
- 記錄每次 rollback 結果作為 Phase 8 gate evidence
|
||
|
||
8.5 Feature Flag Registry(見 §10)
|
||
- suggest mode:feature flag AWOOOP_SUGGEST_MODE(default OFF)
|
||
- controlled write:feature flag AWOOOP_WRITE_MODE(default OFF)
|
||
- 需顯式 flip 才啟用,不能環境變數意外帶入
|
||
|
||
8.6 vuln-verifier PoC 驗收
|
||
- WAITING_APPROVAL 無 token resume 必失敗
|
||
- Redis 宕機時 approval 仍可從 PG 恢復
|
||
```
|
||
|
||
**RACI**:
|
||
- R:fullstack-engineer
|
||
- A:vuln-verifier(approval security PoC)、critic(write path review)
|
||
- C:debugger(trace_id 驗證)、db-expert(approval state PG review)
|
||
- I:migration-engineer(feature flag rollback)
|
||
|
||
**DoD**:
|
||
- WAITING_APPROVAL 無 token resume 被拒絕(vuln-verifier PoC 通過)
|
||
- Redis 宕機後 approval 從 PG 恢復(整合測試)
|
||
- write/execute 預設 OFF,feature flag 手動 flip 才啟用
|
||
- 所有 Strangler Fig gate 量化驗收通過(critic + db-expert + vuln-verifier 三方簽核)
|
||
|
||
---
|
||
|
||
## 4. 資料庫詳細 Schema
|
||
|
||
### 4.1 awooop_contract_revisions(六合約共用 revision 表)
|
||
|
||
```sql
|
||
CREATE TABLE awooop_contract_revisions (
|
||
revision_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
project_id VARCHAR(64) NOT NULL REFERENCES awooop_projects(project_id),
|
||
contract_family VARCHAR(32) NOT NULL -- project_tenant/agent/mcp_gateway/policy_routing/run_state/channel_event
|
||
contract_id VARCHAR(128) NOT NULL,
|
||
version VARCHAR(32) NOT NULL,
|
||
lifecycle_status VARCHAR(16) NOT NULL DEFAULT 'draft', -- draft/published/superseded/revoked
|
||
body_json JSONB NOT NULL,
|
||
body_schema_version VARCHAR(32) NOT NULL,
|
||
body_hash CHAR(64) NOT NULL, -- SHA-256 hex
|
||
created_by VARCHAR(128) NOT NULL,
|
||
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||
published_at TIMESTAMPTZ,
|
||
supersedes_revision_id UUID REFERENCES awooop_contract_revisions(revision_id),
|
||
-- Immutability constraint
|
||
CONSTRAINT published_body_immutable CHECK (
|
||
lifecycle_status = 'draft' OR body_json IS NOT NULL
|
||
)
|
||
);
|
||
|
||
-- Runtime reads view(只看 published/active,不看 draft)
|
||
CREATE VIEW awooop_published_revisions AS
|
||
SELECT * FROM awooop_contract_revisions
|
||
WHERE lifecycle_status IN ('published', 'superseded');
|
||
|
||
-- Append-only trigger
|
||
CREATE OR REPLACE FUNCTION prevent_revision_update()
|
||
RETURNS TRIGGER AS $$
|
||
BEGIN
|
||
IF OLD.lifecycle_status != 'draft' THEN
|
||
RAISE EXCEPTION 'Published contract revision is immutable';
|
||
END IF;
|
||
RETURN NEW;
|
||
END;
|
||
$$ LANGUAGE plpgsql;
|
||
|
||
CREATE TRIGGER enforce_revision_immutability
|
||
BEFORE UPDATE ON awooop_contract_revisions
|
||
FOR EACH ROW EXECUTE FUNCTION prevent_revision_update();
|
||
|
||
-- RLS
|
||
ALTER TABLE awooop_contract_revisions ENABLE ROW LEVEL SECURITY;
|
||
CREATE POLICY tenant_isolation ON awooop_contract_revisions
|
||
USING (project_id = current_setting('app.project_id', TRUE)
|
||
OR current_user = 'awooop_platform');
|
||
```
|
||
|
||
### 4.2 awooop_run_state(含 lease + SAGA journal)
|
||
|
||
```sql
|
||
CREATE TABLE awooop_run_state (
|
||
run_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
project_id VARCHAR(64) NOT NULL,
|
||
agent_id VARCHAR(128) NOT NULL,
|
||
trace_id CHAR(32), -- W3C trace_id hex
|
||
parent_run_id UUID,
|
||
status VARCHAR(32) NOT NULL DEFAULT 'PENDING',
|
||
migration_mode VARCHAR(32) NOT NULL DEFAULT 'shadow', -- shadow/canary/read_only/suggest/auto_remediate
|
||
-- Worker lease
|
||
lease_until TIMESTAMPTZ,
|
||
heartbeat_at TIMESTAMPTZ,
|
||
attempt_count INT NOT NULL DEFAULT 0,
|
||
worker_id VARCHAR(128),
|
||
-- Token budget
|
||
budget_limit_tokens BIGINT,
|
||
tokens_used BIGINT NOT NULL DEFAULT 0,
|
||
-- Timestamps
|
||
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||
completed_at TIMESTAMPTZ,
|
||
-- SAGA journal(step-level)
|
||
saga_steps JSONB DEFAULT '[]', -- [{step_id, tool, status, compensation_cmd, completed_at}]
|
||
-- Metadata
|
||
input_hash CHAR(64), -- SHA-256 of input envelope(for audit)
|
||
effective_policy_revision_id UUID
|
||
) PARTITION BY RANGE (created_at);
|
||
|
||
-- Per-project RLS
|
||
ALTER TABLE awooop_run_state ENABLE ROW LEVEL SECURITY;
|
||
CREATE POLICY tenant_isolation ON awooop_run_state
|
||
USING (project_id = current_setting('app.project_id', TRUE)
|
||
OR current_user = 'awooop_platform');
|
||
```
|
||
|
||
### 4.3 awooop_budget_ledger(Token Budget Hard Kill)
|
||
|
||
```sql
|
||
CREATE TABLE awooop_budget_ledger (
|
||
ledger_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
project_id VARCHAR(64) NOT NULL,
|
||
period DATE NOT NULL, -- YYYY-MM-DD(月份第一天)
|
||
provider VARCHAR(32) NOT NULL,
|
||
tokens_input BIGINT NOT NULL DEFAULT 0,
|
||
tokens_output BIGINT NOT NULL DEFAULT 0,
|
||
cost_usd NUMERIC(12, 6) NOT NULL DEFAULT 0,
|
||
hard_kill_at NUMERIC(12, 6), -- NULL = no limit
|
||
hard_killed BOOLEAN NOT NULL DEFAULT FALSE,
|
||
last_run_id UUID,
|
||
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||
UNIQUE(project_id, period, provider)
|
||
);
|
||
```
|
||
|
||
### 4.4 8 群新增/擴充表清單(db-expert 發現)
|
||
|
||
| 表名 | 缺失欄位 / 缺失 Index | Phase |
|
||
|------|----------------------|-------|
|
||
| `incidents` | 加 `project_id`、`trace_id`、`awooop_run_id` | Phase 2 |
|
||
| `playbooks` | 加 `project_id`、`agent_id` | Phase 2 |
|
||
| `km_entries` | 加 `project_id`、`namespace` | Phase 2 |
|
||
| `mcp_audit_log` | 加 `trace_id`、`run_id`、`project_id`;加 index on (run_id) | Phase 2 |
|
||
| `ai_decisions` | 加 `project_id`、`run_id`、加 index on (run_id) | Phase 2 |
|
||
| `approval_records` | 加 `trace_id`、`approval_token_jti`、加 index on (jti) | Phase 2/8 |
|
||
| `telegram_events` | 加 `project_id`、`platform_subject_id` | Phase 7 |
|
||
| `ollama_health_checks` | 加 `host_tier`(gcp_a/gcp_b/local)、`project_id=__platform__` | Phase 2 |
|
||
|
||
---
|
||
|
||
## 5. 安全修補計畫(vuln-verifier 驗收)
|
||
|
||
### 5.1 PoC 確認的三個漏洞
|
||
|
||
| 漏洞 | 位置 | PoC 狀態 | 修補方案 | Phase |
|
||
|------|------|---------|---------|-------|
|
||
| Nonce 偽造(server nonce 不依賴 server_secret)| security_interceptor.py:451-490 | **PoC 確認可通過驗證** | HMAC(server_secret + nonce),server_secret 從 K8s Secret 注入 | Phase 2 |
|
||
| Webhook replay(無 timestamp/nonce)| webhooks.py:679-728 | **PoC 確認可 replay** | 加 timestamp(±5min)+ nonce Redis NX | Phase 2 |
|
||
| requires_approval 由 LLM output 決定 | decision_manager.py(approval 鏈)| **PoC 確認可繞過** | policy contract 決定,禁止 LLM output 影響 | Phase 2 |
|
||
|
||
### 5.2 approval_token 規格
|
||
|
||
```
|
||
簽章算法:HS256
|
||
Payload:
|
||
- jti: UUID(唯一性,Redis NX 15min TTL)
|
||
- iss: "awooop-platform"
|
||
- sub: "{project_id}:{run_id}"
|
||
- aud: "awooop-approval"
|
||
- exp: now + 15min
|
||
- approval_type: "human" | "system"
|
||
- decision_scope: [tool_id, ...]
|
||
|
||
驗證:
|
||
1. 簽章驗證
|
||
2. exp 未過期
|
||
3. Redis NX 確認 jti 未使用(防 replay)
|
||
4. sub 與 resume 的 run_id 吻合
|
||
5. decision_scope 與 run 的 tool 吻合
|
||
```
|
||
|
||
### 5.3 vuln-verifier 每 Phase 驗收清單
|
||
|
||
- Phase 2:nonce 偽造失敗、webhook replay 失敗、requires_approval 無法由 LLM 決定
|
||
- Phase 4:audit log 0 secret 命中(抽樣 100 筆)
|
||
- Phase 5:agent 程式碼路徑 0 raw credential(sast)
|
||
- Phase 6:cross-tenant isolation PoC(EwoooC 無法讀 AWOOOI)
|
||
- Phase 8:approval token 無 token resume 被拒、Redis 宕機後從 PG 恢復
|
||
|
||
---
|
||
|
||
## 6. API Endpoint 完整清單(fullstack 補充)
|
||
|
||
### 6.1 現有(不動)
|
||
- `POST /v1/webhooks/telegram`
|
||
- `POST /v1/webhooks/alertmanager`
|
||
- `GET /v1/incidents/`
|
||
- `POST /v1/decisions/`
|
||
|
||
### 6.2 Phase 4 新增(Platform Shell)
|
||
- `POST /v1/platform/runs` — 建立 run(async)
|
||
- `GET /v1/platform/runs/{run_id}` — 查詢 run state
|
||
- `GET /v1/platform/runs/{run_id}/steps` — 查詢 SAGA steps
|
||
- `POST /v1/platform/runs/{run_id}/cancel` — 取消 run
|
||
|
||
### 6.3 Phase 4-5 新增(Approval)
|
||
- `POST /v1/platform/runs/{run_id}/approve` — 帶 approval_token 的 resume
|
||
- `POST /v1/platform/runs/{run_id}/reject` — 拒絕(帶理由)
|
||
|
||
### 6.4 Phase 6 新增(Tenant)
|
||
- `POST /v1/platform/projects` — 建立 project(admin only)
|
||
- `GET /v1/platform/projects/{project_id}/migration_state` — 查詢 Strangler Fig 狀態
|
||
- `POST /v1/platform/projects/{project_id}/contracts` — 建立 contract draft
|
||
- `POST /v1/platform/projects/{project_id}/contracts/{contract_id}/publish` — publish
|
||
- `POST /v1/platform/projects/{project_id}/contracts/{contract_id}/activate` — activate
|
||
|
||
### 6.5 Phase 7 新增(Channel Hub)
|
||
- `GET /v1/platform/channel_events` — 查詢 conversation events(with pagination)
|
||
- `POST /v1/platform/outbound` — 發送 outbound message(admin/test)
|
||
|
||
---
|
||
|
||
## 7. 錯誤碼字典(必補 9 個)
|
||
|
||
| Error Code | HTTP Status | 描述 | 場景 |
|
||
|------------|-------------|------|------|
|
||
| `E-SCHEMA-001` | 422 | LLM output schema validation failed | Phase 3 contract validator |
|
||
| `E-BUDGET-001` | 429 | Token budget hard kill triggered | Phase 4 budget guard |
|
||
| `E-APPROVAL-001` | 401 | approval_token missing or invalid | Phase 8 approval resume |
|
||
| `E-APPROVAL-002` | 401 | approval_token expired | Phase 8 |
|
||
| `E-APPROVAL-003` | 409 | approval_token already used (replay) | Phase 8 |
|
||
| `E-MCP-GATE-001` | 403 | MCP tool not authorized for this project | Phase 5 |
|
||
| `E-MCP-GATE-002` | 403 | MCP tool not authorized for this agent | Phase 5 |
|
||
| `E-MCP-GATE-003` | 403 | MCP write/execute tool blocked (not in auto_remediate mode) | Phase 5/8 |
|
||
| `E-TENANT-001` | 403 | Cross-tenant data access blocked | Phase 2+ |
|
||
| `E-IDEMPOTENT-001` | 200 | Duplicate event, returning existing run_id | Phase 4 |
|
||
| `E-RATE-001` | 429 | Project rate limit exceeded | Phase 2+ |
|
||
| `E-SAGA-001` | 500 | SAGA compensation failed, manual intervention required | Phase 4/ADR-119 |
|
||
|
||
---
|
||
|
||
## 8. 前端 Operator Console(frontend-designer,8 個模組)
|
||
|
||
> 實作在 Phase 8 之後(或 Phase 6 可 prototype Operator Console)
|
||
> ADR-UI-01~04 定架構,此處為工作項清單
|
||
|
||
| 模組 | 描述 | 優先順序 |
|
||
|------|------|---------|
|
||
| **Tenant Management** | project 列表、建立、migration_state 視覺化、budget 設定 | P1(Phase 6 prototype)|
|
||
| **Contract Lifecycle** | draft/publish/activate 操作、revision diff、六合約 family 篩選 | P1(Phase 6 prototype)|
|
||
| **Run Monitor** | run FSM 視覺化、shadow/canary/active 標記、trace_id drill-down | P1(Phase 4 後)|
|
||
| **Strangler Fig Dashboard** | shadow→canary gate 量化指標(divergence / latency / error rate)即時儀表板 | P1(Phase 7 後)|
|
||
| **Budget & Cost** | per-project token budget、hard kill 觸發歷史、成本趨勢(GCP Ollama vs paid provider)| P2 |
|
||
| **Audit Log Viewer** | audit log 查詢(redaction 後)、secret 命中警告、trace_id 關聯 | P2 |
|
||
| **MCP Gateway Admin** | tool registry、grants 管理、credential refs(masked)、audit | P2 |
|
||
| **Principal Directory** | platform_subject 查詢、Telegram/LINE/API user mapping | P3 |
|
||
|
||
**與現有設計系統整合**:
|
||
- 必須使用 next-intl(禁止 hardcode 中文/英文)
|
||
- 禁止 emoji,使用 Lucide/SVG icon
|
||
- 遵循 `feedback_design_system_consistency.md` 全站設計規範
|
||
- 禁止直接存取內網 IP(`feedback_frontend_internal_ip_ban.md`)
|
||
|
||
---
|
||
|
||
## 9. 重構切割計畫(11 PR,refactor-specialist)
|
||
|
||
> 每 PR 必須獨立可合併、有 rollback 能力、不依賴後 PR
|
||
|
||
| PR# | 標題 | 前置 PR | 影響範圍 | 風險 |
|
||
|-----|------|---------|---------|------|
|
||
| PR-01 | `telemetry.py:71` 硬碼 IP assert 移除 | 無 | 1 行 | 低 |
|
||
| PR-02 | `decision_manager.py:240` silence key 常數化 | 無 | 2 行 | 低 |
|
||
| PR-03 | `ollama_auto_recovery.py:230` 第二定義移除 | 無 | ~5 行 | 低 |
|
||
| PR-04 | `_provider` → `__provider`(registry.py)| 無 | ~20 行 | 低 |
|
||
| PR-05 | `mcp_bridge.py` namespace 動態化 | 無 | ~30 行 | 中 |
|
||
| PR-06 | `consensus_engine.py` CONSENSUS_PREFIX 加 project 前綴 | Phase 2 Redis 雙寫 Phase A | ~15 行 | 中 |
|
||
| PR-07 | nonce 重設計 + webhook timestamp/nonce(ADR-116)| 無 | ~100 行 | 高(安全修補)|
|
||
| PR-08 | Repository project_id filter 批次 1(incidents/playbooks/km)| Phase 1 schema | ~200 行 | 中 |
|
||
| PR-09 | Repository project_id filter 批次 2(mcp/ai_decisions/approval)| PR-08 | ~200 行 | 中 |
|
||
| PR-10 | Background loop 標記(31 個 loop,main.py)| ADR-123 | ~150 行 | 中 |
|
||
| PR-11 | AnomalyCounter per-project 改造 | PR-10 | ~80 行 | 中 |
|
||
|
||
> PR-01~05 可並行(無依賴),先做先進。
|
||
> PR-06~07 需要 Redis 雙寫 Phase A 先完成。
|
||
> PR-08~09 需要 Phase 1 schema 先完成。
|
||
|
||
---
|
||
|
||
## 10. Feature Flag / Kill-Switch Registry
|
||
|
||
| Flag 名稱 | 預設值 | 說明 | 開啟條件 |
|
||
|-----------|--------|------|---------|
|
||
| `AWOOOP_SHADOW_MODE` | OFF | 啟用 shadow run(鏡像但不回應)| Phase 4 完成後手動 flip |
|
||
| `AWOOOP_CANARY_MODE` | OFF | 啟用 canary(部分 user-visible 回應)| shadow gate 14天量化通過 |
|
||
| `AWOOOP_READ_ONLY_MODE` | OFF | read-only 查詢搬到 AwoooP | canary gate 7天量化通過 |
|
||
| `AWOOOP_SUGGEST_MODE` | OFF | AI 建議但人工決定 | read_only gate 14天通過 |
|
||
| `AWOOOP_WRITE_MODE` | OFF | 受控 write/execute tool 啟用 | suggest gate 30天通過 + rollback evidence ≥3 |
|
||
| `AWOOOP_BUDGET_HARD_KILL` | ON | token budget 超額直接終止(非只告警)| **預設 ON**(ADR-120)|
|
||
| `AWOOOP_MCP_OAUTH21` | OFF | MCP OAuth 2.1 flow(ADR-117)| Phase 5 完成後 |
|
||
| `AWOOOP_RLS_STRICT` | OFF | 嚴格 RLS 模式(禁止 awooop_platform bypass)| Phase 2 完成 + 30天 soak |
|
||
| `AWOOOP_EWOOOC_LIVE` | OFF | EwoooC tenant 切為 live(非 shadow)| Phase 6 canary 7天通過 |
|
||
|
||
---
|
||
|
||
## 11. Runbook 清單(8 份,debugger 需求)
|
||
|
||
| Runbook | 位置 | 觸發情境 | 主要步驟 |
|
||
|---------|------|---------|---------|
|
||
| **RB-01**: AwoooP Contract Publish Failure | `docs/runbooks/awooop-contract-publish-failure.md` | schema 驗證失敗、CODEOWNERS reject | 1.查 body_hash 2.查 draft 狀態 3.rollback to previous active |
|
||
| **RB-02**: Run State Stuck / Stale Lease | `docs/runbooks/awooop-run-stuck.md` | run 停在 RUNNING > 10min | 1.查 lease_until 2.手動 reaper 3.查 saga_steps 決定補償或放棄 |
|
||
| **RB-03**: Budget Hard Kill Triggered | `docs/runbooks/awooop-budget-hard-kill.md` | E-BUDGET-001 大量出現 | 1.查 budget_ledger 2.確認 hard_kill_at 閾值 3.是否 incident 爆發 4.臨時上調 or 等下月 reset |
|
||
| **RB-04**: Phase Rollback(Strangler Fig)| `docs/runbooks/awooop-phase-rollback.md` | canary 錯誤率 > threshold | 1.切回 project_migration_state 到上一個 mode 2.清 Redis canary cache 3.通知 EwoooC(如果影響到)|
|
||
| **RB-05**: Approval Token Replay 告警 | `docs/runbooks/awooop-approval-replay.md` | E-APPROVAL-003 出現 | 1.查 jti Redis key 2.確認 IP / user 3.吊銷 token 4.通知安全 |
|
||
| **RB-06**: Cross-Tenant Data Leak 告警 | `docs/runbooks/awooop-cross-tenant-leak.md` | E-TENANT-001 大量出現 | 1.立即停 canary/active mode 2.查 audit log 3.RLS 設定確認 4.PITR restore 評估 |
|
||
| **RB-07**: GCP Ollama Failover 異常 | `docs/runbooks/awooop-gcp-ollama-failover.md` | GCP-A/B 同時掛、Local fallback 也掛 | 1.確認 `platform:ollama:primary` Redis key 2.手動設定 fallback 3.確認 paid provider 緊急路由 |
|
||
| **RB-08**: SAGA Compensation 失敗 | `docs/runbooks/awooop-saga-compensation-fail.md` | E-SAGA-001 出現 | 1.查 saga_steps JSON 2.找失敗 step 3.手動執行補償指令 4.更新 run 狀態 |
|
||
|
||
---
|
||
|
||
## 12. 工具補強計畫(tool-expert)
|
||
|
||
| 工具 | 用途 | 安裝位置 | Phase |
|
||
|------|------|---------|-------|
|
||
| **PgBouncer** | AwoooP 多 worker 下 PG connection pool 防爆 | K8s sidecar 或獨立 Pod | Phase 4 之前 |
|
||
| **Sealed Secrets** | 替代 K8s Secret 明文,CI/CD 安全 | K3s cluster | Phase 2(security hardening 時)|
|
||
| **OPA / Cedar** | policy engine,授權邏輯集中化(取代散落程式碼)| 作為 sidecar 或 admission webhook | Phase 5 之前 |
|
||
| **chaostoolkit / LitmusChaos** | Strangler Fig 切換的混沌驗證(worker 崩潰、Redis 宕機、PG timeout)| CI pipeline | Phase 4 完成後 |
|
||
| **awooop-ctl** | AwoooP CLI(contract CRUD / run 查詢 / migration state 管理)| 本地 CLI + CI | Phase 6 之前 |
|
||
| **pg_partman** | PostgreSQL partition 自動管理 | K8s Pod / cron | Phase 4(run_state 上線前)|
|
||
| **pgvector(已有)** | KM 向量搜索 | 已存在,需 per-project namespace | Phase 2 |
|
||
| **OpenTelemetry Collector** | OTel pipeline(ADR-121),現在直送 SignOz 188:24318,未來需 sampling | K8s DaemonSet | Phase 4 之前 |
|
||
|
||
---
|
||
|
||
## 13. 業界對齊(web-researcher 發現)
|
||
|
||
### 13.1 $47k Agent Loop 事故教訓(Token Budget Hard Kill)
|
||
|
||
問題:alert ≠ enforcement。僅發 Prometheus alert 但 agent 仍繼續執行,一個 loop 燒了 $47k。
|
||
|
||
AwoooP 解法(ADR-120):
|
||
- 三層 budget limit:per-run / per-project / per-tenant
|
||
- **Hard Kill**:超額 → 直接終止 run(not just log/alert)
|
||
- Redis hot counter(每次 call 減少)+ PG budget_ledger 事務(final decision)
|
||
- `AWOOOP_BUDGET_HARD_KILL` feature flag 預設 ON(唯一預設開啟的 flag)
|
||
|
||
### 13.2 Durable Execution / SAGA 補償交易(ADR-119)
|
||
|
||
業界標準(Temporal / Conductor / Azure Durable Functions):multi-step tool chain 必須有 step-level journal + 補償機制。
|
||
|
||
AwoooP 解法:
|
||
- `saga_steps` JSONB 欄位在 `awooop_run_state`
|
||
- 每個 tool call 記錄:step_id / tool / status / compensation_cmd / completed_at
|
||
- 失敗時執行補償指令(反向操作)
|
||
- 補償失敗 → E-SAGA-001 + Runbook RB-08
|
||
|
||
### 13.3 MCP OAuth 2.1 Confused Deputy(ADR-117)
|
||
|
||
MCP spec 2025-06-18 要求:
|
||
- per-tenant dynamic client registration(RFC 7591)
|
||
- Resource Indicators(RFC 9728):防止 token 被跨 resource server 使用
|
||
- PKCE(RFC 7636):防止 authorization code interception
|
||
|
||
AwoooP 解法(ADR-117):
|
||
- 每個 tenant 動態 client registration,不共用 client_id
|
||
- Resource Indicator 必須匹配 tool registry 的 target URI
|
||
- `E-MCP-GATE-001/002/003` error codes 覆蓋 Confused Deputy 情境
|
||
|
||
### 13.4 OTel GenAI Semantic Conventions(ADR-121)
|
||
|
||
官方規範(opentelemetry-specification/semantic_conventions/gen-ai):
|
||
- span 命名:`gen_ai.{system}.{operation}`(e.g., `gen_ai.anthropic.chat`)
|
||
- token attribute:`gen_ai.usage.input_tokens` / `gen_ai.usage.output_tokens`
|
||
- model attribute:`gen_ai.request.model` / `gen_ai.response.model`
|
||
|
||
AwoooP 解法:全部 LLM call 必須 emit 以上 attribute,進 SignOz(188:24318)。
|
||
|
||
### 13.5 OWASP Agentic AI Top 10 對齊(ADR-122)
|
||
|
||
| OWASP 項目 | AwoooP 對應控制 |
|
||
|-----------|---------------|
|
||
| OAI-01 Prompt Injection | MCP Gateway result sanitization + schema validator |
|
||
| OAI-02 Insecure Tool Use | Five-gate MCP enforcement + audit |
|
||
| OAI-03 Excessive Agency | requires_approval from policy(禁 LLM 決定)+ write/execute feature flag |
|
||
| OAI-04 Supply Chain | contract publish HMAC + artifact SHA-256 |
|
||
| OAI-05 Data Leakage | audit log redaction + credential isolation |
|
||
| OAI-06 Insufficient Observability | OTel GenAI + audit sink + run trace_id |
|
||
| OAI-07 Unsafe Orchestration | SAGA journal + compensation + hard kill |
|
||
| OAI-08 Memory Vulnerabilities | contract revision immutability + RLS |
|
||
| OAI-09 Access Control Bypass | approval_token HS256 + jti replay prevention |
|
||
| OAI-10 Resource Exhaustion | Token Budget Hard Kill(ADR-120)|
|
||
|
||
---
|
||
|
||
## 14. GCP Ollama 拓撲對 AwoooP 的影響(ADR-110 整合)
|
||
|
||
### 14.1 新拓撲(ADR-110 + ADR-125,2026-05-05 修正)
|
||
|
||
```
|
||
Phase 0 bridge:
|
||
Primary : GCP-A http://192.168.0.110:11435 (110 nginx → GCP public IP)
|
||
Secondary: GCP-B http://192.168.0.110:11436
|
||
Fallback : Local http://192.168.0.111:11434
|
||
Emergency: Gemini → Nemotron → Claude (全 Ollama 掛時,budget gated)
|
||
|
||
Target private mesh:
|
||
Primary : GCP-A http://10.77.114.21:11434
|
||
Secondary: GCP-B http://10.77.114.22:11434
|
||
Fallback : Local http://10.77.114.111:11434
|
||
```
|
||
|
||
ADR-125 修正 ADR-110 的傳輸層:公網 GCP IP / 110 nginx proxy 僅保留為
|
||
過渡與 rollback bridge。正式路徑是 WireGuard private mesh;runtime 路由由
|
||
AwoooP Inference Gateway 管理。
|
||
|
||
### 14.2 AwoooP 必須處理的影響項目
|
||
|
||
| 影響項 | 位置 | 處理方式 | Phase |
|
||
|--------|------|---------|-------|
|
||
| `ollama:current_primary` Redis key 雙寫(只支援 1 個 URL,新需要 3 層)| INV-1 | 改為 `platform:ollama:topology`(JSON:primary/secondary/fallback)| Phase 2 |
|
||
| `ollama_auto_recovery.py:230` 第二定義(P0-11)| ollama_auto_recovery.py | 移除,統一從 config 讀 | Phase 2 PR-03 |
|
||
| GCP public IP 進 INV-4(34.143.170.20, 34.21.145.224)| INV-4 | 標為 transitional only;正式改用 `10.77.114.21/22` mesh IP | Phase 0 INV-4 |
|
||
| WireGuard mesh | ADR-125 / runbook | 建立 `10.77.114.0/24` private transport;關閉 public 11434 | Phase 2 前置 |
|
||
| AwoooP Inference Gateway | ADR-125 / runbook | alert-fast / code-review / embedding / deep-rca lane 隔離,避免重模型搶告警 lane | Phase 4 |
|
||
| EwoooC Provider Proxy 走 GCP Ollama 路由 | Phase 6 | EwoooC 共用 platform Ollama topology(platform_resource)| Phase 6 |
|
||
| `telemetry.py:71` IP assert(P0-08)| telemetry.py:71 | 移除後,GCP IP 不再觸發 assert;改為 config-driven | Phase 2 PR-01 |
|
||
| budget_ledger 記錄 Ollama usage(免費 GCP 仍需 token 計數)| Phase 4 | Ollama call 也必須記錄 token 消耗(budget_ledger)| Phase 4 |
|
||
| Runbook RB-07(GCP Ollama failover 異常)| docs/runbooks/ | Phase 0 寫 Runbook,Phase 4 後實際 E2E 測試 | Phase 0 |
|
||
|
||
### 14.3 Ollama GCP 為 platform_resource(ADR-111)
|
||
|
||
GCP Ollama(bridge: 34.143.170.20 / 34.21.145.224;target mesh:
|
||
10.77.114.21 / 10.77.114.22)與 Local Ollama(192.168.0.111 / target
|
||
10.77.114.111)一律聲明為 `platform_resource`:
|
||
- 不屬於任何 tenant
|
||
- 所有 tenant(AWOOOI / EwoooC / Tsenyang / Bitan)共用,但 audit 記錄各自 project_id
|
||
- `platform:ollama:topology` Redis key 前綴為 `platform:`(非 `{project_id}:`)
|
||
|
||
### 14.4 實測限制(2026-05-05)
|
||
|
||
`scripts/ops/ollama-topology-check.sh` 實測:
|
||
|
||
- GCP-A `gemma3:4b` 約 2s,但 `size_vram=0`
|
||
- GCP-B `gemma3:4b` 約 8.5s,但 `size_vram=0`
|
||
- 111 fallback `gemma3:4b` 約 4.9s,`size_vram=8210446336`
|
||
|
||
結論:GCP-A/B 可以作為同步 `alert-fast` lane,但不可承擔 14B/32B 同步告警診斷。
|
||
重模型需由 Inference Gateway 分流到 async / 111 / GPU 節點。
|
||
|
||
---
|
||
|
||
## 15. 工作排序總表(含並行群組 + Critical Path)
|
||
|
||
### Critical Path(序列執行,不可跳)
|
||
|
||
```
|
||
Phase 0 全部 ADR/INV
|
||
→ Phase 1 Schema(PR-01/02/03/04/05 可並行先做)
|
||
→ Phase 2 Security Hardening + Redis 遷移(PR-06~11)
|
||
→ Phase 3 Contract Packages
|
||
→ Phase 4 Platform Shell(PgBouncer + OPA/pg_partman 同步準備)
|
||
→ Phase 5 MCP Gateway
|
||
→ Phase 6 EwoooC(14天 shadow gate)
|
||
→ Phase 7 Channel Hub(7天 canary gate)
|
||
→ Phase 8 Suggest + Write(30天 suggest gate)
|
||
```
|
||
|
||
### 可並行工作群組
|
||
|
||
| 群組 | 工作 | 可與哪個並行 |
|
||
|------|------|-----------|
|
||
| G-A(Phase 0 並行)| ADR-111~115 各自獨立 | 全部並行(5 份 ADR 各分配一位)|
|
||
| G-B(Phase 0 並行)| ADR-116~124 | 與 G-A 並行 |
|
||
| G-C(Phase 0 並行)| INV-1~INV-9(部分依賴 codebase 探索)| 與 G-A/G-B 並行 |
|
||
| G-D(Phase 2 並行)| PR-01/02/03/04/05(獨立小修補)| 全部並行 |
|
||
| G-E(Phase 2 並行)| Redis 雙寫 + repository 改造 + security hardening | 各自獨立,但 security hardening 優先 |
|
||
| G-F(Phase 4 並行)| PgBouncer 安裝 + pg_partman 安裝 + OPA 安裝 | 與 Phase 3 Contract Packages 並行 |
|
||
| G-G(Phase 5-6 並行)| Operator Console prototype(ADR-UI-01~04)| 與 Phase 6 EwoooC shadow 並行 |
|
||
|
||
### 完整排序表
|
||
|
||
| 順序 | 工作 | docs-only | 並行群組 | 阻擋誰 |
|
||
|------|------|-----------|---------|-------|
|
||
| 1-A | ADR-111 Bootstrap Order | ✅ | G-A | Phase 2 |
|
||
| 1-B | ADR-112 Contract Governance | ✅ | G-A | Phase 3 |
|
||
| 1-C | ADR-113 Active Revision Outbox | ✅ | G-A | Phase 1 |
|
||
| 1-D | ADR-114 Idempotency & Worker Lease | ✅ | G-A | Phase 4 |
|
||
| 1-E | ADR-115 Principal Mapping | ✅ | G-A | Phase 6、7 |
|
||
| 2-A | ADR-116 Security Hardening | ✅ | G-B | Phase 2 |
|
||
| 2-B | ADR-117 MCP OAuth 2.1 | ✅ | G-B | Phase 5 |
|
||
| 2-C | ADR-118 RLS Strategy | ✅ | G-B | Phase 1 |
|
||
| 2-D | ADR-119 Durable Execution SAGA | ✅ | G-B | Phase 4 |
|
||
| 2-E | ADR-120 Token Budget Hard Kill | ✅ | G-B | Phase 4 |
|
||
| 2-F | ADR-121 OTel GenAI | ✅ | G-B | Phase 4 |
|
||
| 2-G | ADR-122 OWASP Agentic AI | ✅ | G-B | 全 Phase |
|
||
| 2-H | ADR-123 Background Loop Migration | ✅ | G-B | Phase 2 |
|
||
| 2-I | ADR-124 Global Singleton Decomposition | ✅ | G-B | Phase 2 |
|
||
| 2-J | ADR-UI-01~04 Operator Console ADR | ✅ | G-B | Phase 6+ |
|
||
| 2-K | ADR-106 補 Quantified Gates | ✅ | G-B | Phase 8 |
|
||
| 3-A | INV-1 Redis Keys | ✅ | G-C | Phase 2 |
|
||
| 3-B | INV-2 Repository Retrofit Map | ✅ | G-C | Phase 2 |
|
||
| 3-C | INV-3 Entrypoints | ✅ | G-C | Phase 2 |
|
||
| 3-D | INV-4 Hardcoded Namespace/IP(含 GCP IP)| ✅ | G-C | Phase 2 |
|
||
| 3-E | INV-5 Migration Compatibility Matrix | ✅ | G-C | Phase 1 |
|
||
| 3-F | INV-6 Rollback Playbook Register | ✅ | G-C | Phase 4 |
|
||
| 3-G | INV-7 PR Cutting Plan | ✅ | G-C | Phase 2 |
|
||
| 3-H | INV-8 Background Loop Catalog(31 個)| ✅ | G-C | Phase 2 |
|
||
| 3-I | INV-9 Global Singleton Catalog(13 個)| ✅ | G-C | Phase 2 |
|
||
| 4 | Task 9 順序修正(Dockerfile/ConfigMap)| ❌ | — | Phase 1 |
|
||
| 5 | **Phase 1 Schema Migration**(重寫版)| ❌ | — | Phase 2~8 |
|
||
| 6-A | PR-01/02/03/04/05(並行小修補)| ❌ | G-D | Phase 2 |
|
||
| 6-B | **Phase 2 Security Hardening**(PR-07 優先)| ❌ | G-E | Phase 4 |
|
||
| 6-C | Phase 2 Redis 雙寫 + Repository(PR-06/08/09/10/11)| ❌ | G-E | Phase 4 |
|
||
| 7 | **Phase 3 Contract Packages**(packages/awooop-contracts/)| ❌ | — | Phase 4 |
|
||
| 8-A | PgBouncer + pg_partman + OPA 安裝 | ❌ | G-F | Phase 4 |
|
||
| 8-B | **Phase 4 Platform Shell + Shadow**(含 SAGA + Budget Kill)| ❌ | — | Phase 5 |
|
||
| 9 | **Phase 5 MCP Gateway**(含 OAuth 2.1)| ❌ | — | Phase 6 |
|
||
| 10-A | **Phase 6 EwoooC Shadow Onboarding**(14 天 gate)| ❌ | G-G | Phase 7 |
|
||
| 10-B | Operator Console prototype(G-G)| ❌ | G-G | Phase 7+ |
|
||
| 11 | **Phase 7 Channel Hub**(7 天 canary gate)| ❌ | — | Phase 8 |
|
||
| 12 | **Phase 8 Suggest + Controlled Write**(30 天 gate)| ❌ | — | AwoooP v1 GA |
|
||
|
||
**1-A 到 3-I 全部 docs-only,可在當前對話視窗連續完成,完成後才開新 Codex 對話進 Phase 1 code。**
|
||
|
||
---
|
||
|
||
## 16. 量化驗收門檻(完整版)
|
||
|
||
### Strangler Fig Gates
|
||
|
||
| 切換 | 量化條件 | 簽核 |
|
||
|------|---------|------|
|
||
| pre → shadow | tenant 已建 + agent contract published + audit/trace 寫入正常 | critic 確認 |
|
||
| shadow → canary | ≥14 天 + decision divergence < 5% + p95 退化 < 10% + 0 P0/P1 incident + audit 0 secret | critic + db-expert + vuln-verifier |
|
||
| canary → read_only | ≥7 天 + user-visible error rate < 0.5% + cost diff < 50% 預算 | critic + vuln-verifier |
|
||
| read_only → suggest | ≥14 天 + suggest accept rate ≥ 50% + 0 hallucination escalation | critic |
|
||
| suggest → auto_remediate | ≥30 天 + rollback evidence ≥ 3 成功 + approval token live + dry-run pass ≥ 99% | critic + db-expert + vuln-verifier |
|
||
|
||
### Phase 驗收門檻(量化補強)
|
||
|
||
| Phase | 必要量化指標 |
|
||
|-------|-----------|
|
||
| Phase 1 | migration up/down dry-run 通過;RLS cross-project 拒絕率 100%;AWOOOI 0 行為改動(regression pass rate 100%)|
|
||
| Phase 2 | INV-1 P0 key 遷移完成率 100%;vuln-verifier PoC 通過率 3/3;hardcode grep 結果 0 |
|
||
| Phase 3 | contract schema 覆蓋率 100%(6 個 family);invalid fixture 拒絕率 100% |
|
||
| Phase 4 | shadow run 0 user-visible response;duplicate event 唯一 run rate 100%;stale reaper 1min 內回收率 100% |
|
||
| Phase 5 | credential leak test 通過率 100%;Five-gate integration test 覆蓋率 100% |
|
||
| Phase 6 | cross-tenant data access 拒絕率 100%;EwoooC shadow 14天 gate 通過 |
|
||
| Phase 7 | 首則進度訊息 ≤ 30s 達成率 ≥ 99%;duplicate retry 0 重複 run |
|
||
| Phase 8 | approval replay 拒絕率 100%;write/execute 預設 OFF 驗證通過 |
|
||
|
||
---
|
||
|
||
## 17. 關聯文件索引
|
||
|
||
- [ADR-106: AwoooP 架構](../adr/ADR-106-agent-platform-architecture.md)
|
||
- [ADR-107: 控制面儲存策略](../adr/ADR-107-awooop-control-plane-storage.md)
|
||
- [ADR-110: GCP Ollama 三層容災拓撲](../adr/ADR-110-gcp-ollama-topology.md)
|
||
- [MASTER-WORKPLAN.md](MASTER-WORKPLAN.md)(本文展開的主索引)
|
||
- [IMPLEMENTATION-ROADMAP.md](IMPLEMENTATION-ROADMAP.md)(歷史文件,舊版草稿)
|
||
- 待建:`docs/awooop/inventory/` INV-1~INV-9
|
||
- 待建:ADR-111~ADR-124(AwoooP 專用 ADR 系列)
|
||
- 待建:ADR-UI-01~ADR-UI-04(Operator Console ADR)
|
||
- 待建:`docs/runbooks/` RB-01~RB-08
|
||
|
||
---
|
||
|
||
*最後更新:2026-05-03(台北時區)*
|
||
*建立:12-Agent 聯合審查 × Codex 整合*
|
||
*下一步:Phase 0 docs-only 工作(ADR-111 起),完成後開新 Codex 對話進 Phase 1 code*
|