Files
awoooi/docs/adr/ADR-106-agent-platform-architecture.md
Your Name 13e51802fe feat(awooop): Phase 0 全 ADR + Phase 1 control plane schema(含 critic 四項修正)
## Phase 0(文件層,全部 Accepted)
- ADR-106/107:AwoooP 平台架構 + 儲存策略
- ADR-111~118:Bootstrap → RLS 七項核心 ADR
- ADR-119~124:SAGA → Singleton Decomposition 六項 ADR
- ADR-UI-01~04:Operator Console 四個 UI ADR

## Phase 1(DB schema + migration)
- awooop_phase1_control_plane_2026-05-04.sql:7 張新表 + trigger + RLS
  - Step 1:三角色(platform_admin/migration BYPASSRLS,awooop_app 受 RLS)
  - Step 13:GRANT awooop_app 最小權限(7 條)
  - Step 14:RLS fail-closed,移除 __platform__ 後門
- awooop_phase1_batch1_rls_2026-05-04.sql:高流量四表三步式 ADD COLUMN
- awooop_phase1_batch1_backfill.py:SKIP LOCKED 分批回填腳本
- awooop_models.py:7 個 SQLAlchemy 2.x models

## Critic 修正(4 Critical + 3 Major)
- C-1:ADD CONSTRAINT IF NOT EXISTS → DO 塊 + pg_constraint 查詢
- C-2:__mapper_args__ 字串 list → primary_key=True on mapped_column
- C-3:__platform__ RLS 後門 → 全移除,改用 BYPASSRLS role
- C-4:awooop_app role 從未建立 → Step 1 + 7 條 GRANT
- M-1:active_pointer_guard SECURITY DEFINER(FORCE RLS 跨租戶保護)
- M-2:pg_partman create_parent 加冪等防護
- M-3:immutability trigger 新增身份欄位保護(project_id/family/contract_id)

## Task 1.2 修補
- agent_loader.py:硬編碼 Mac 路徑 → AGENTS_DIR 環境變數
- Dockerfile:補 COPY .claude/agents/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 13:37:11 +08:00

17 KiB

ADR-106: AwoooP Agent Platform Architecture and Migration Strategy

Status: Accepted Date: 2026-05-01 Scope: Multi-tenant Agent Platform, Agent contracts, MCP Gateway, runtime state, channel events, migration strategy

Context

AWOOOI currently contains the strongest AI automation implementation in the ecosystem: OpenClaw/NemoTron/Hermes/ElephantAlpha roles, Agent Loop foundations, MCP providers, Telegram workflows, model routing, cost guards, Playbook learning, and operational audit trails.

Other products already have adjacent AI or messaging surfaces:

  • EwoooC / MOMO PRO has business analysis bots, ElephantAlpha orchestration, market-intelligence tools, LINE/Telegram/Email notification paths, and local AI provider selection.
  • Tsenyang already has Telegram webhook capability.
  • Bitan and future products need repeatable AI onboarding without copying AWOOOI internals.

The old choice set was too narrow:

  • A pure centralized HTTP hub solves governance but creates a single point of failure and makes AWOOOI look like every product's private brain.
  • A shared SDK reduces duplicated code but cannot solve tenant isolation, identity, budget, channel, MCP credential, or cross-project audit problems.
  • A light configuration plane helps routing drift but still leaves tool use, session state, and channel handling scattered across projects.

The approved direction is a fourth path:

Build AwoooP, a multi-tenant Agent Platform. AWOOOI is the first and largest tenant and first runtime host, not the platform boundary itself.

This ADR records the architecture and migration strategy only. It does not authorize runtime code changes, provider cost changes, destructive operations, or channel cutovers.

Numbering Note

ADR-105-revert-a2-ollama-primary.md previously reserved ADR-106 as a placeholder for model-catalog cleanup. No checked-in ADR-106*.md existed. This ADR consumes the ADR-106 number for the broader Agent Platform decision. The model-catalog and dynamic-routing debt is folded into the Policy / Routing contract below and should be implemented under this platform roadmap or a later non-conflicting ADR number.

Decision

D0 - Name the Platform AwoooP

The platform product name is AwoooP.

Naming rules:

  • Product name: AwoooP
  • Repository/package slug: awooop
  • Project/tenant id for the existing AWOOOI product remains awoooi
  • AwoooP is the platform boundary; AWOOOI is a tenant and initial runtime host

Do not create empty project directories just for the name. Create awooop runtime/package directories only when a concrete implementation phase owns code, schemas, clients, or workers.

Recommended future layout when implementation begins:

Purpose Path
Shared contract schemas packages/awooop-contracts/
Python/TS client SDK packages/awooop-client/
Platform API/runtime shell apps/awooop-runtime/
Async run workers apps/awooop-worker/
Detailed schema docs docs/awooop/

D1 - Adopt AwoooP as a Six-Plane Platform

AwoooP is defined as six cooperating planes:

Plane Responsibility Must not do
Project / Tenant Tenant identity, isolation, budget, channel allowance, ACL, migration mode Store agent prompt details
Agent Agent identity, version, role, I/O contract, safety ceiling, context domain Decide concrete model/provider credentials
MCP Gateway Tool authorization, credential resolution, rate limits, approval, result sanitization, audit Expose raw credentials to agents
Policy / Routing Effective model/provider route, fallback, privacy ceiling, budget gate, generation defaults Bypass tenant hard stops
Runtime / Run State Run lifecycle, async state machine, shadow/canary/active mode, checkpoint/resume Treat long tasks as simple HTTP request-response
Communication / Channel Event Telegram/LINE/Slack/Email/API receive, verify, normalize, send Run LLM inference or call MCP directly

D2 - Require Platform Envelope Fields

Every platform invocation must carry the following envelope fields:

  • project_id
  • environment_id when environment-specific resources are involved
  • agent_id
  • agent_version after resolution
  • session_id
  • trace_id
  • run_id for runtime-managed work
  • policy_version after effective policy resolution

Missing project_id, trace_id, or run_id in runtime or MCP paths is a hard reject, not a warning.

D3 - Treat Project as the Smallest Isolation Unit

project_id is the platform tenant boundary. All Redis keys, session stores, RAG namespaces, KM queries, MCP tool scopes, budget ledgers, ACL checks, and channel routing must be project-scoped unless a resource is explicitly declared as platform_resource.

global:* resources are platform resources, not AWOOOI resources.

D4 - Define Agents as Platform Capability Modules

Agents are not product-local prompt strings. Agents are versioned platform capabilities with:

  • agent_id
  • base_agent_ref
  • immutable published version
  • role and capability tags
  • privacy ceiling
  • payload input schema
  • LLM output schema
  • prompt artifact reference and hash
  • context domain constraints
  • MCP requirement declarations
  • execution profile
  • eval and lifecycle governance

Specialized agents must inherit from base agents without loosening safety:

  • openclaw-core
  • openclaw-sre
  • openclaw-biz
  • future openclaw-pharmacy
  • future openclaw-marketing

Published agent artifacts are immutable. Prompt, schema, or contract changes must publish a new version.

D5 - Use MCP Gateway Instead of Direct Tool Calls

Agents never call MCP servers directly and never see raw credentials. Every tool call goes through MCP Gateway.

Gateway authorization is the intersection of:

Project grant AND Agent requirement AND Tool contract AND Environment boundary AND Approval state

Tool results entering model context must be sanitized first. Tool calls must be traceable through trace_id and run_id.

D6 - Resolve EffectivePolicy Before Every Model Call

Provider and model selection is not embedded inside Agent Contract. Runtime must derive an EffectivePolicy before each model call from:

  • platform policy
  • agent safety ceiling
  • project policy and budget
  • environment constraints
  • channel constraints
  • intent and complexity
  • current provider health

Budget hard stop overrides all fallback behavior. Cloud fallback must include a reason and be written to trace/audit. Any policy change that increases cost, enables a new paid provider, or raises token ceilings still requires explicit human approval under docs/HARD_RULES.md.

D7 - Model Runtime as Durable Runs

An agent invocation is a Run, not just an HTTP request.

Required runtime states:

CREATED -> POLICY_RESOLVED -> QUEUED -> RUNNING
RUNNING -> WAITING_TOOL -> RESUMED -> RUNNING
RUNNING -> WAITING_APPROVAL -> RESUMED -> RUNNING
RUNNING -> COMPLETED
any non-terminal -> FAILED
any non-terminal -> CANCELLED

Terminal states:

  • COMPLETED
  • FAILED
  • CANCELLED

WAITING_APPROVAL must be resumable without replaying the whole task. FAILED must include a structured failure_code.

D8 - Keep Channel Adapters Thin

Telegram, LINE, Slack, Email, and API adapters only perform:

  1. receive
  2. verify
  3. normalize to ConversationEvent
  4. send OutboundMessage

Channel adapters must not call LLMs, call MCP tools, decide approval, or embed business logic. Channel escaping, token resolution, delivery retry, and provider message IDs remain adapter responsibilities.

D9 - Migrate by Strangler Fig

No production path should be replaced in one cutover. Migration must progress by tenant and capability:

  1. shadow: platform receives mirrored events, writes audit/trace only, no user response and no external side effects.
  2. canary: platform can respond to selected low-risk traffic, side effects disabled by default.
  3. read_only: read-only queries and business chat move first.
  4. suggest: analysis and recommendations move next, approval still external.
  5. auto_remediate: write/execute tools move only after Gateway, approval, replay, and audit evidence are green.

AWOOOI must also become a tenant (project_id=awoooi) instead of keeping a privileged private path forever.

Contract Baselines

C1 - Project / Tenant Contract

Project is the smallest isolation unit.

Minimum contract fields:

  • project_id
  • display_name
  • status
  • environments
  • data_boundary
  • platform_resources
  • budget
  • rate_limits
  • allowed_agents
  • allowed_channels
  • approval_gates
  • status_policy

Invariants:

  • project_id is immutable after creation.
  • Data boundary enforcement happens at data/Gateway APIs, not in prompt text.
  • Budget hard stop rejects cloud model calls with BUDGET_EXCEEDED.
  • Agent calls outside allowed_agents return AGENT_NOT_PERMITTED.
  • Project overrides may tighten permissions, never loosen them.
  • Migration state lives in project_migration_state, not the stable project contract record.
  • suspended behavior must explicitly define channel, model, and MCP access.

C2 - Agent Contract

Agent Contract answers who the agent is, what it can do, what its interfaces are, and its highest safety boundary.

Minimum contract fields:

  • agent_id
  • base_agent_ref
  • version
  • lifecycle_status
  • role
  • capability_tags
  • risk_class
  • privacy_ceiling
  • artifact_refs
  • interface
  • context_policy
  • mcp_requirements
  • execution_profile
  • governance

Invariants:

  • Published versions are immutable.
  • base_agent_ref must include a version range and resolve to an audited exact version at runtime.
  • Prompt and schema artifacts require hashes.
  • Agent requirements are not permissions; Gateway decides actual tool access.
  • Agent output is LLM payload only. Runtime attaches trace, cost, policy, and validation metadata.
  • requires_approval is calculated by runtime from project, agent, policy, and Gateway rules, not trusted from LLM output alone.

C3 - MCP Gateway Contract

MCP Gateway is the security boundary between agents and external systems.

Minimum contract fields:

  • tool_id
  • domain
  • resource
  • verbs
  • owner_project_id
  • tenancy_scope
  • side_effect_level
  • data_sensitivity
  • server
  • schemas
  • credential_policy
  • authorization_policy
  • approval_policy
  • rate_limits
  • result_policy
  • audit_policy

Tool call envelope requires:

  • trace_id
  • run_id
  • project_id
  • environment_id
  • agent_id
  • agent_version
  • session_id
  • tool_id
  • verb
  • idempotency_key
  • payload

Invariants:

  • Agents never see raw credentials.
  • Missing project_id, trace_id, or run_id is rejected.
  • Project grant, agent declaration, and context boundary must all pass.
  • Results entering context are sanitized.
  • Write, execute, and destructive operations require approval unless explicitly allowed by all relevant contracts.

C4 - Policy / Routing Contract

Policy / Routing resolves the model and execution policy for a single call.

Minimum contract fields:

  • policy_id
  • version
  • lifecycle_status
  • scope
  • match
  • constraints
  • route_plan
  • generation_defaults
  • budget_guard
  • approval_guard
  • observability

EffectivePolicy must include:

  • trace_id
  • project_id
  • agent_id
  • agent_version
  • policy_version
  • allow/deny decision
  • provider/model refs
  • fallback chain
  • max tokens
  • privacy mode
  • budget state
  • approval requirement
  • matched policy layers
  • decision notes

Invariants:

  • Agent privacy ceiling beats project override.
  • Project/global budget hard stop beats fallback.
  • Providers/models are referenced through catalog refs, not hardcoded into agent contracts.
  • Policy merge uses strictest-wins semantics.
  • Effective policy must be replayable from versioned inputs.
  • Schema retry has a hard cap.

C5 - Runtime / Run State Contract

Runtime owns durable execution state.

Minimum run fields:

  • run_id
  • trace_id
  • project_id
  • environment_id
  • agent_id
  • agent_version
  • session_id
  • channel_type
  • mode
  • execution_type
  • state
  • input refs
  • effective policy ref
  • checkpoint
  • approval
  • result
  • audit timestamps and failure details

Invariants:

  • Shadow mode has no external side effects and no user-visible response.
  • Canary mode has no side effects by default.
  • Every state transition writes audit.
  • WAITING_APPROVAL is resumable.
  • CANCELLED cannot resume.
  • Tool loop iterations have a hard cap.
  • Client response requires schema validation.

C6 - Communication / Channel Event Contract

Communication Hub normalizes channels without embedding AI logic.

Inbound ConversationEvent minimum fields:

  • event_id
  • trace_id
  • project_id
  • environment_id
  • channel metadata
  • sender metadata
  • routing metadata
  • payload
  • security metadata
  • delivery metadata

Outbound message minimum fields:

  • outbound_id
  • trace_id
  • run_id
  • project_id
  • channel target
  • message payload
  • policy
  • delivery state

Invariants:

  • Adapter does not call LLM or MCP.
  • Raw payload is saved for short audit retention.
  • Outbound messages pass ACL, redaction, and rate limit.
  • Bot token resolution stays inside adapter runtime.
  • Retry is idempotent.
  • Escaping/formatting is adapter responsibility.

Implementation Order

Phase 0 - Documentation and Contract Freeze

  • This ADR.
  • Follow-up schema documents or ADRs only when implementation needs field-level detail.
  • No runtime behavior change.

Phase 1 - Isolation Foundation

  • Add project_id to Redis keys, sessions, budget ledgers, dispatch logs, MCP audit snapshots, and approval records where applicable.
  • Define platform_resource exceptions separately from tenant resources.
  • Preserve existing AWOOOI behavior under project_id=awoooi.
  • Follow ADR-107: PostgreSQL is the source of truth for AwoooP control-plane contracts; Redis is cache/watch only; CRDs are future runtime projection.

Phase 2 - Platform Shell

  • Add platform APIs and data models around existing logic.
  • Implement envelope validation, trace propagation, audit, and effective policy calculation.
  • Keep existing AWOOOI and EwoooC implementations behind adapters.

Phase 3 - Shadow Mode

  • Mirror selected events into AwoooP.
  • Compare platform decisions with legacy decisions.
  • Do not send user-visible responses or execute side effects.

Phase 4 - Read-Only and Suggest Cutover

  • Move low-risk chat and read-only analysis first.
  • Add EwoooC business-agent traffic as the first downstream tenant validation.
  • Keep remediation and write tools in legacy path until Gateway evidence is sufficient.

Phase 5 - Controlled Active Runtime

  • Move write/execute operations only after:
    • Gateway authorization is enforced.
    • Approval resume is proven.
    • Trace and audit can replay a complete run.
    • Budget hard stop has live evidence.
    • Production rollback path is documented.

Consequences

Benefits

  • A single agent improvement can apply to all projects or to one specialization.
  • Product-specific customization happens through policy, permissions, and context scope rather than copy-pasted prompts.
  • Tool credentials remain outside agent context.
  • Cross-project data leakage is blocked by data/Gateway filters.
  • Telegram, LINE, Slack, Email, and API can share the same agent runtime.
  • Long-running agent work becomes observable, resumable, and replayable.

Costs

  • More platform tables and contracts must exist before large runtime changes.
  • Early delivery focuses on audit and shadow evidence rather than visible feature changes.
  • Each downstream project needs explicit tenant onboarding.
  • Provider/model policy changes become governed artifacts, not quick env edits.

Risks

  • Over-building before live traffic proves value.
  • Contract sprawl if every detail becomes a new ADR.
  • Legacy paths remaining forever if strangler milestones lack deadlines.
  • Confusing OpenClaw brand identity if openclaw-core, openclaw-sre, and openclaw-biz are not documented clearly.

Mitigations

  • Use this ADR as the index and create detailed schema docs only as needed.
  • Use ADR-107 for control-plane storage decisions before creating migrations or CRDs.
  • Require project_migration_state milestones per tenant.
  • Keep AWOOOI on the same tenant path as all other products.
  • Treat shadow/canary evidence as the gate for every cutover.
  • Keep cost-changing provider behavior behind explicit approval.

Non-Goals

  • This ADR does not deploy new providers or increase paid model usage.
  • This ADR does not move Telegram/LINE/Slack webhooks.
  • This ADR does not authorize destructive MCP tools.
  • This ADR does not replace ADR-105 Agent Loop governance; it generalizes the platform boundary above it.
  • This ADR does not require Temporal or a specific workflow engine in v1.

Acceptance Criteria

  • Six contract baselines are captured in this ADR.
  • AwoooP is named as the platform product and AWOOOI is explicitly defined as first tenant, not the whole platform.
  • MCP is Gateway-governed, not direct agent access.
  • Runtime migration is shadow/canary/active, not big-bang.
  • Cost and privacy hard stops are preserved.
  • LOGBOOK records this architecture decision.

References

  • docs/12-agent-game-rules.md
  • docs/HARD_RULES.md
  • docs/LOGBOOK.md
  • docs/adr/ADR-080-ai-autonomy-flywheel-overview.md
  • docs/adr/ADR-095-12agent-sdk-integration.md
  • docs/adr/ADR-100-ai-autonomous-slo.md
  • docs/adr/ADR-105-mcp-agent-loop-governance.md
  • docs/adr/ADR-107-awooop-control-plane-storage.md
  • docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md