## Phase 0(文件層,全部 Accepted) - ADR-106/107:AwoooP 平台架構 + 儲存策略 - ADR-111~118:Bootstrap → RLS 七項核心 ADR - ADR-119~124:SAGA → Singleton Decomposition 六項 ADR - ADR-UI-01~04:Operator Console 四個 UI ADR ## Phase 1(DB schema + migration) - awooop_phase1_control_plane_2026-05-04.sql:7 張新表 + trigger + RLS - Step 1:三角色(platform_admin/migration BYPASSRLS,awooop_app 受 RLS) - Step 13:GRANT awooop_app 最小權限(7 條) - Step 14:RLS fail-closed,移除 __platform__ 後門 - awooop_phase1_batch1_rls_2026-05-04.sql:高流量四表三步式 ADD COLUMN - awooop_phase1_batch1_backfill.py:SKIP LOCKED 分批回填腳本 - awooop_models.py:7 個 SQLAlchemy 2.x models ## Critic 修正(4 Critical + 3 Major) - C-1:ADD CONSTRAINT IF NOT EXISTS → DO 塊 + pg_constraint 查詢 - C-2:__mapper_args__ 字串 list → primary_key=True on mapped_column - C-3:__platform__ RLS 後門 → 全移除,改用 BYPASSRLS role - C-4:awooop_app role 從未建立 → Step 1 + 7 條 GRANT - M-1:active_pointer_guard SECURITY DEFINER(FORCE RLS 跨租戶保護) - M-2:pg_partman create_parent 加冪等防護 - M-3:immutability trigger 新增身份欄位保護(project_id/family/contract_id) ## Task 1.2 修補 - agent_loader.py:硬編碼 Mac 路徑 → AGENTS_DIR 環境變數 - Dockerfile:補 COPY .claude/agents/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
572 lines
17 KiB
Markdown
572 lines
17 KiB
Markdown
# ADR-106: AwoooP Agent Platform Architecture and Migration Strategy
|
|
|
|
**Status**: Accepted
|
|
**Date**: 2026-05-01
|
|
**Scope**: Multi-tenant Agent Platform, Agent contracts, MCP Gateway, runtime state, channel events, migration strategy
|
|
|
|
## Context
|
|
|
|
AWOOOI currently contains the strongest AI automation implementation in the
|
|
ecosystem: OpenClaw/NemoTron/Hermes/ElephantAlpha roles, Agent Loop
|
|
foundations, MCP providers, Telegram workflows, model routing, cost guards,
|
|
Playbook learning, and operational audit trails.
|
|
|
|
Other products already have adjacent AI or messaging surfaces:
|
|
|
|
- EwoooC / MOMO PRO has business analysis bots, ElephantAlpha orchestration,
|
|
market-intelligence tools, LINE/Telegram/Email notification paths, and local
|
|
AI provider selection.
|
|
- Tsenyang already has Telegram webhook capability.
|
|
- Bitan and future products need repeatable AI onboarding without copying
|
|
AWOOOI internals.
|
|
|
|
The old choice set was too narrow:
|
|
|
|
- A pure centralized HTTP hub solves governance but creates a single point of
|
|
failure and makes AWOOOI look like every product's private brain.
|
|
- A shared SDK reduces duplicated code but cannot solve tenant isolation,
|
|
identity, budget, channel, MCP credential, or cross-project audit problems.
|
|
- A light configuration plane helps routing drift but still leaves tool use,
|
|
session state, and channel handling scattered across projects.
|
|
|
|
The approved direction is a fourth path:
|
|
|
|
> Build **AwoooP**, a multi-tenant Agent Platform. AWOOOI is the first and
|
|
> largest tenant and first runtime host, not the platform boundary itself.
|
|
|
|
This ADR records the architecture and migration strategy only. It does not
|
|
authorize runtime code changes, provider cost changes, destructive operations,
|
|
or channel cutovers.
|
|
|
|
## Numbering Note
|
|
|
|
`ADR-105-revert-a2-ollama-primary.md` previously reserved `ADR-106` as a
|
|
placeholder for model-catalog cleanup. No checked-in `ADR-106*.md` existed.
|
|
This ADR consumes the `ADR-106` number for the broader Agent Platform decision.
|
|
The model-catalog and dynamic-routing debt is folded into the Policy / Routing
|
|
contract below and should be implemented under this platform roadmap or a later
|
|
non-conflicting ADR number.
|
|
|
|
## Decision
|
|
|
|
### D0 - Name the Platform AwoooP
|
|
|
|
The platform product name is **AwoooP**.
|
|
|
|
Naming rules:
|
|
|
|
- Product name: `AwoooP`
|
|
- Repository/package slug: `awooop`
|
|
- Project/tenant id for the existing AWOOOI product remains `awoooi`
|
|
- `AwoooP` is the platform boundary; `AWOOOI` is a tenant and initial runtime
|
|
host
|
|
|
|
Do not create empty project directories just for the name. Create `awooop`
|
|
runtime/package directories only when a concrete implementation phase owns
|
|
code, schemas, clients, or workers.
|
|
|
|
Recommended future layout when implementation begins:
|
|
|
|
| Purpose | Path |
|
|
|---|---|
|
|
| Shared contract schemas | `packages/awooop-contracts/` |
|
|
| Python/TS client SDK | `packages/awooop-client/` |
|
|
| Platform API/runtime shell | `apps/awooop-runtime/` |
|
|
| Async run workers | `apps/awooop-worker/` |
|
|
| Detailed schema docs | `docs/awooop/` |
|
|
|
|
### D1 - Adopt AwoooP as a Six-Plane Platform
|
|
|
|
AwoooP is defined as six cooperating planes:
|
|
|
|
| Plane | Responsibility | Must not do |
|
|
|---|---|---|
|
|
| Project / Tenant | Tenant identity, isolation, budget, channel allowance, ACL, migration mode | Store agent prompt details |
|
|
| Agent | Agent identity, version, role, I/O contract, safety ceiling, context domain | Decide concrete model/provider credentials |
|
|
| MCP Gateway | Tool authorization, credential resolution, rate limits, approval, result sanitization, audit | Expose raw credentials to agents |
|
|
| Policy / Routing | Effective model/provider route, fallback, privacy ceiling, budget gate, generation defaults | Bypass tenant hard stops |
|
|
| Runtime / Run State | Run lifecycle, async state machine, shadow/canary/active mode, checkpoint/resume | Treat long tasks as simple HTTP request-response |
|
|
| Communication / Channel Event | Telegram/LINE/Slack/Email/API receive, verify, normalize, send | Run LLM inference or call MCP directly |
|
|
|
|
### D2 - Require Platform Envelope Fields
|
|
|
|
Every platform invocation must carry the following envelope fields:
|
|
|
|
- `project_id`
|
|
- `environment_id` when environment-specific resources are involved
|
|
- `agent_id`
|
|
- `agent_version` after resolution
|
|
- `session_id`
|
|
- `trace_id`
|
|
- `run_id` for runtime-managed work
|
|
- `policy_version` after effective policy resolution
|
|
|
|
Missing `project_id`, `trace_id`, or `run_id` in runtime or MCP paths is a hard
|
|
reject, not a warning.
|
|
|
|
### D3 - Treat Project as the Smallest Isolation Unit
|
|
|
|
`project_id` is the platform tenant boundary. All Redis keys, session stores,
|
|
RAG namespaces, KM queries, MCP tool scopes, budget ledgers, ACL checks, and
|
|
channel routing must be project-scoped unless a resource is explicitly declared
|
|
as `platform_resource`.
|
|
|
|
`global:*` resources are platform resources, not AWOOOI resources.
|
|
|
|
### D4 - Define Agents as Platform Capability Modules
|
|
|
|
Agents are not product-local prompt strings. Agents are versioned platform
|
|
capabilities with:
|
|
|
|
- `agent_id`
|
|
- `base_agent_ref`
|
|
- immutable published version
|
|
- role and capability tags
|
|
- privacy ceiling
|
|
- payload input schema
|
|
- LLM output schema
|
|
- prompt artifact reference and hash
|
|
- context domain constraints
|
|
- MCP requirement declarations
|
|
- execution profile
|
|
- eval and lifecycle governance
|
|
|
|
Specialized agents must inherit from base agents without loosening safety:
|
|
|
|
- `openclaw-core`
|
|
- `openclaw-sre`
|
|
- `openclaw-biz`
|
|
- future `openclaw-pharmacy`
|
|
- future `openclaw-marketing`
|
|
|
|
Published agent artifacts are immutable. Prompt, schema, or contract changes
|
|
must publish a new version.
|
|
|
|
### D5 - Use MCP Gateway Instead of Direct Tool Calls
|
|
|
|
Agents never call MCP servers directly and never see raw credentials. Every
|
|
tool call goes through MCP Gateway.
|
|
|
|
Gateway authorization is the intersection of:
|
|
|
|
`Project grant AND Agent requirement AND Tool contract AND Environment boundary AND Approval state`
|
|
|
|
Tool results entering model context must be sanitized first. Tool calls must be
|
|
traceable through `trace_id` and `run_id`.
|
|
|
|
### D6 - Resolve EffectivePolicy Before Every Model Call
|
|
|
|
Provider and model selection is not embedded inside Agent Contract. Runtime must
|
|
derive an `EffectivePolicy` before each model call from:
|
|
|
|
- platform policy
|
|
- agent safety ceiling
|
|
- project policy and budget
|
|
- environment constraints
|
|
- channel constraints
|
|
- intent and complexity
|
|
- current provider health
|
|
|
|
Budget hard stop overrides all fallback behavior. Cloud fallback must include a
|
|
reason and be written to trace/audit. Any policy change that increases cost,
|
|
enables a new paid provider, or raises token ceilings still requires explicit
|
|
human approval under `docs/HARD_RULES.md`.
|
|
|
|
### D7 - Model Runtime as Durable Runs
|
|
|
|
An agent invocation is a `Run`, not just an HTTP request.
|
|
|
|
Required runtime states:
|
|
|
|
```text
|
|
CREATED -> POLICY_RESOLVED -> QUEUED -> RUNNING
|
|
RUNNING -> WAITING_TOOL -> RESUMED -> RUNNING
|
|
RUNNING -> WAITING_APPROVAL -> RESUMED -> RUNNING
|
|
RUNNING -> COMPLETED
|
|
any non-terminal -> FAILED
|
|
any non-terminal -> CANCELLED
|
|
```
|
|
|
|
Terminal states:
|
|
|
|
- `COMPLETED`
|
|
- `FAILED`
|
|
- `CANCELLED`
|
|
|
|
`WAITING_APPROVAL` must be resumable without replaying the whole task. `FAILED`
|
|
must include a structured `failure_code`.
|
|
|
|
### D8 - Keep Channel Adapters Thin
|
|
|
|
Telegram, LINE, Slack, Email, and API adapters only perform:
|
|
|
|
1. receive
|
|
2. verify
|
|
3. normalize to `ConversationEvent`
|
|
4. send `OutboundMessage`
|
|
|
|
Channel adapters must not call LLMs, call MCP tools, decide approval, or embed
|
|
business logic. Channel escaping, token resolution, delivery retry, and
|
|
provider message IDs remain adapter responsibilities.
|
|
|
|
### D9 - Migrate by Strangler Fig
|
|
|
|
No production path should be replaced in one cutover. Migration must progress by
|
|
tenant and capability:
|
|
|
|
1. `shadow`: platform receives mirrored events, writes audit/trace only, no user
|
|
response and no external side effects.
|
|
2. `canary`: platform can respond to selected low-risk traffic, side effects
|
|
disabled by default.
|
|
3. `read_only`: read-only queries and business chat move first.
|
|
4. `suggest`: analysis and recommendations move next, approval still external.
|
|
5. `auto_remediate`: write/execute tools move only after Gateway, approval,
|
|
replay, and audit evidence are green.
|
|
|
|
AWOOOI must also become a tenant (`project_id=awoooi`) instead of keeping a
|
|
privileged private path forever.
|
|
|
|
## Contract Baselines
|
|
|
|
### C1 - Project / Tenant Contract
|
|
|
|
Project is the smallest isolation unit.
|
|
|
|
Minimum contract fields:
|
|
|
|
- `project_id`
|
|
- `display_name`
|
|
- `status`
|
|
- `environments`
|
|
- `data_boundary`
|
|
- `platform_resources`
|
|
- `budget`
|
|
- `rate_limits`
|
|
- `allowed_agents`
|
|
- `allowed_channels`
|
|
- `approval_gates`
|
|
- `status_policy`
|
|
|
|
Invariants:
|
|
|
|
- `project_id` is immutable after creation.
|
|
- Data boundary enforcement happens at data/Gateway APIs, not in prompt text.
|
|
- Budget hard stop rejects cloud model calls with `BUDGET_EXCEEDED`.
|
|
- Agent calls outside `allowed_agents` return `AGENT_NOT_PERMITTED`.
|
|
- Project overrides may tighten permissions, never loosen them.
|
|
- Migration state lives in `project_migration_state`, not the stable project
|
|
contract record.
|
|
- `suspended` behavior must explicitly define channel, model, and MCP access.
|
|
|
|
### C2 - Agent Contract
|
|
|
|
Agent Contract answers who the agent is, what it can do, what its interfaces are,
|
|
and its highest safety boundary.
|
|
|
|
Minimum contract fields:
|
|
|
|
- `agent_id`
|
|
- `base_agent_ref`
|
|
- `version`
|
|
- `lifecycle_status`
|
|
- `role`
|
|
- `capability_tags`
|
|
- `risk_class`
|
|
- `privacy_ceiling`
|
|
- `artifact_refs`
|
|
- `interface`
|
|
- `context_policy`
|
|
- `mcp_requirements`
|
|
- `execution_profile`
|
|
- `governance`
|
|
|
|
Invariants:
|
|
|
|
- Published versions are immutable.
|
|
- `base_agent_ref` must include a version range and resolve to an audited exact
|
|
version at runtime.
|
|
- Prompt and schema artifacts require hashes.
|
|
- Agent requirements are not permissions; Gateway decides actual tool access.
|
|
- Agent output is LLM payload only. Runtime attaches trace, cost, policy, and
|
|
validation metadata.
|
|
- `requires_approval` is calculated by runtime from project, agent, policy, and
|
|
Gateway rules, not trusted from LLM output alone.
|
|
|
|
### C3 - MCP Gateway Contract
|
|
|
|
MCP Gateway is the security boundary between agents and external systems.
|
|
|
|
Minimum contract fields:
|
|
|
|
- `tool_id`
|
|
- `domain`
|
|
- `resource`
|
|
- `verbs`
|
|
- `owner_project_id`
|
|
- `tenancy_scope`
|
|
- `side_effect_level`
|
|
- `data_sensitivity`
|
|
- `server`
|
|
- `schemas`
|
|
- `credential_policy`
|
|
- `authorization_policy`
|
|
- `approval_policy`
|
|
- `rate_limits`
|
|
- `result_policy`
|
|
- `audit_policy`
|
|
|
|
Tool call envelope requires:
|
|
|
|
- `trace_id`
|
|
- `run_id`
|
|
- `project_id`
|
|
- `environment_id`
|
|
- `agent_id`
|
|
- `agent_version`
|
|
- `session_id`
|
|
- `tool_id`
|
|
- `verb`
|
|
- `idempotency_key`
|
|
- `payload`
|
|
|
|
Invariants:
|
|
|
|
- Agents never see raw credentials.
|
|
- Missing `project_id`, `trace_id`, or `run_id` is rejected.
|
|
- Project grant, agent declaration, and context boundary must all pass.
|
|
- Results entering context are sanitized.
|
|
- Write, execute, and destructive operations require approval unless explicitly
|
|
allowed by all relevant contracts.
|
|
|
|
### C4 - Policy / Routing Contract
|
|
|
|
Policy / Routing resolves the model and execution policy for a single call.
|
|
|
|
Minimum contract fields:
|
|
|
|
- `policy_id`
|
|
- `version`
|
|
- `lifecycle_status`
|
|
- `scope`
|
|
- `match`
|
|
- `constraints`
|
|
- `route_plan`
|
|
- `generation_defaults`
|
|
- `budget_guard`
|
|
- `approval_guard`
|
|
- `observability`
|
|
|
|
`EffectivePolicy` must include:
|
|
|
|
- `trace_id`
|
|
- `project_id`
|
|
- `agent_id`
|
|
- `agent_version`
|
|
- `policy_version`
|
|
- allow/deny decision
|
|
- provider/model refs
|
|
- fallback chain
|
|
- max tokens
|
|
- privacy mode
|
|
- budget state
|
|
- approval requirement
|
|
- matched policy layers
|
|
- decision notes
|
|
|
|
Invariants:
|
|
|
|
- Agent privacy ceiling beats project override.
|
|
- Project/global budget hard stop beats fallback.
|
|
- Providers/models are referenced through catalog refs, not hardcoded into
|
|
agent contracts.
|
|
- Policy merge uses strictest-wins semantics.
|
|
- Effective policy must be replayable from versioned inputs.
|
|
- Schema retry has a hard cap.
|
|
|
|
### C5 - Runtime / Run State Contract
|
|
|
|
Runtime owns durable execution state.
|
|
|
|
Minimum run fields:
|
|
|
|
- `run_id`
|
|
- `trace_id`
|
|
- `project_id`
|
|
- `environment_id`
|
|
- `agent_id`
|
|
- `agent_version`
|
|
- `session_id`
|
|
- `channel_type`
|
|
- `mode`
|
|
- `execution_type`
|
|
- `state`
|
|
- input refs
|
|
- effective policy ref
|
|
- checkpoint
|
|
- approval
|
|
- result
|
|
- audit timestamps and failure details
|
|
|
|
Invariants:
|
|
|
|
- Shadow mode has no external side effects and no user-visible response.
|
|
- Canary mode has no side effects by default.
|
|
- Every state transition writes audit.
|
|
- `WAITING_APPROVAL` is resumable.
|
|
- `CANCELLED` cannot resume.
|
|
- Tool loop iterations have a hard cap.
|
|
- Client response requires schema validation.
|
|
|
|
### C6 - Communication / Channel Event Contract
|
|
|
|
Communication Hub normalizes channels without embedding AI logic.
|
|
|
|
Inbound `ConversationEvent` minimum fields:
|
|
|
|
- `event_id`
|
|
- `trace_id`
|
|
- `project_id`
|
|
- `environment_id`
|
|
- channel metadata
|
|
- sender metadata
|
|
- routing metadata
|
|
- payload
|
|
- security metadata
|
|
- delivery metadata
|
|
|
|
Outbound message minimum fields:
|
|
|
|
- `outbound_id`
|
|
- `trace_id`
|
|
- `run_id`
|
|
- `project_id`
|
|
- channel target
|
|
- message payload
|
|
- policy
|
|
- delivery state
|
|
|
|
Invariants:
|
|
|
|
- Adapter does not call LLM or MCP.
|
|
- Raw payload is saved for short audit retention.
|
|
- Outbound messages pass ACL, redaction, and rate limit.
|
|
- Bot token resolution stays inside adapter runtime.
|
|
- Retry is idempotent.
|
|
- Escaping/formatting is adapter responsibility.
|
|
|
|
## Implementation Order
|
|
|
|
### Phase 0 - Documentation and Contract Freeze
|
|
|
|
- This ADR.
|
|
- Follow-up schema documents or ADRs only when implementation needs field-level
|
|
detail.
|
|
- No runtime behavior change.
|
|
|
|
### Phase 1 - Isolation Foundation
|
|
|
|
- Add `project_id` to Redis keys, sessions, budget ledgers, dispatch logs, MCP
|
|
audit snapshots, and approval records where applicable.
|
|
- Define `platform_resource` exceptions separately from tenant resources.
|
|
- Preserve existing AWOOOI behavior under `project_id=awoooi`.
|
|
- Follow ADR-107: PostgreSQL is the source of truth for AwoooP control-plane
|
|
contracts; Redis is cache/watch only; CRDs are future runtime projection.
|
|
|
|
### Phase 2 - Platform Shell
|
|
|
|
- Add platform APIs and data models around existing logic.
|
|
- Implement envelope validation, trace propagation, audit, and effective policy
|
|
calculation.
|
|
- Keep existing AWOOOI and EwoooC implementations behind adapters.
|
|
|
|
### Phase 3 - Shadow Mode
|
|
|
|
- Mirror selected events into AwoooP.
|
|
- Compare platform decisions with legacy decisions.
|
|
- Do not send user-visible responses or execute side effects.
|
|
|
|
### Phase 4 - Read-Only and Suggest Cutover
|
|
|
|
- Move low-risk chat and read-only analysis first.
|
|
- Add EwoooC business-agent traffic as the first downstream tenant validation.
|
|
- Keep remediation and write tools in legacy path until Gateway evidence is
|
|
sufficient.
|
|
|
|
### Phase 5 - Controlled Active Runtime
|
|
|
|
- Move write/execute operations only after:
|
|
- Gateway authorization is enforced.
|
|
- Approval resume is proven.
|
|
- Trace and audit can replay a complete run.
|
|
- Budget hard stop has live evidence.
|
|
- Production rollback path is documented.
|
|
|
|
## Consequences
|
|
|
|
### Benefits
|
|
|
|
- A single agent improvement can apply to all projects or to one specialization.
|
|
- Product-specific customization happens through policy, permissions, and
|
|
context scope rather than copy-pasted prompts.
|
|
- Tool credentials remain outside agent context.
|
|
- Cross-project data leakage is blocked by data/Gateway filters.
|
|
- Telegram, LINE, Slack, Email, and API can share the same agent runtime.
|
|
- Long-running agent work becomes observable, resumable, and replayable.
|
|
|
|
### Costs
|
|
|
|
- More platform tables and contracts must exist before large runtime changes.
|
|
- Early delivery focuses on audit and shadow evidence rather than visible
|
|
feature changes.
|
|
- Each downstream project needs explicit tenant onboarding.
|
|
- Provider/model policy changes become governed artifacts, not quick env edits.
|
|
|
|
### Risks
|
|
|
|
- Over-building before live traffic proves value.
|
|
- Contract sprawl if every detail becomes a new ADR.
|
|
- Legacy paths remaining forever if strangler milestones lack deadlines.
|
|
- Confusing `OpenClaw` brand identity if `openclaw-core`, `openclaw-sre`, and
|
|
`openclaw-biz` are not documented clearly.
|
|
|
|
### Mitigations
|
|
|
|
- Use this ADR as the index and create detailed schema docs only as needed.
|
|
- Use ADR-107 for control-plane storage decisions before creating migrations or
|
|
CRDs.
|
|
- Require `project_migration_state` milestones per tenant.
|
|
- Keep AWOOOI on the same tenant path as all other products.
|
|
- Treat shadow/canary evidence as the gate for every cutover.
|
|
- Keep cost-changing provider behavior behind explicit approval.
|
|
|
|
## Non-Goals
|
|
|
|
- This ADR does not deploy new providers or increase paid model usage.
|
|
- This ADR does not move Telegram/LINE/Slack webhooks.
|
|
- This ADR does not authorize destructive MCP tools.
|
|
- This ADR does not replace ADR-105 Agent Loop governance; it generalizes the
|
|
platform boundary above it.
|
|
- This ADR does not require Temporal or a specific workflow engine in v1.
|
|
|
|
## Acceptance Criteria
|
|
|
|
- Six contract baselines are captured in this ADR.
|
|
- AwoooP is named as the platform product and AWOOOI is explicitly defined as
|
|
first tenant, not the whole platform.
|
|
- MCP is Gateway-governed, not direct agent access.
|
|
- Runtime migration is shadow/canary/active, not big-bang.
|
|
- Cost and privacy hard stops are preserved.
|
|
- LOGBOOK records this architecture decision.
|
|
|
|
## References
|
|
|
|
- `docs/12-agent-game-rules.md`
|
|
- `docs/HARD_RULES.md`
|
|
- `docs/LOGBOOK.md`
|
|
- `docs/adr/ADR-080-ai-autonomy-flywheel-overview.md`
|
|
- `docs/adr/ADR-095-12agent-sdk-integration.md`
|
|
- `docs/adr/ADR-100-ai-autonomous-slo.md`
|
|
- `docs/adr/ADR-105-mcp-agent-loop-governance.md`
|
|
- `docs/adr/ADR-107-awooop-control-plane-storage.md`
|
|
- `docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md`
|