awoooi/docs/adr/ADR-106-agent-platform-architecture.md

# ADR-106: AwoooP Agent Platform Architecture and Migration Strategy

**Status**: Accepted
**Date**: 2026-05-01
**Scope**: Multi-tenant Agent Platform, Agent contracts, MCP Gateway, runtime state, channel events, migration strategy

## Context

AWOOOI currently contains the strongest AI automation implementation in the
ecosystem: OpenClaw/NemoTron/Hermes/ElephantAlpha roles, Agent Loop
foundations, MCP providers, Telegram workflows, model routing, cost guards,
Playbook learning, and operational audit trails.

Other products already have adjacent AI or messaging surfaces:

- EwoooC / MOMO PRO has business analysis bots, ElephantAlpha orchestration,
  market-intelligence tools, LINE/Telegram/Email notification paths, and local
  AI provider selection.
- Tsenyang already has Telegram webhook capability.
- Bitan and future products need repeatable AI onboarding without copying
  AWOOOI internals.

The old choice set was too narrow:

- A pure centralized HTTP hub solves governance but creates a single point of
  failure and makes AWOOOI look like every product's private brain.
- A shared SDK reduces duplicated code but cannot solve tenant isolation,
  identity, budget, channel, MCP credential, or cross-project audit problems.
- A light configuration plane helps routing drift but still leaves tool use,
  session state, and channel handling scattered across projects.

The approved direction is a fourth path:

> Build **AwoooP**, a multi-tenant Agent Platform. AWOOOI is the first and
> largest tenant and first runtime host, not the platform boundary itself.

This ADR records the architecture and migration strategy only. It does not
authorize runtime code changes, provider cost changes, destructive operations,
or channel cutovers.

## Numbering Note

`ADR-105-revert-a2-ollama-primary.md` previously reserved `ADR-106` as a
placeholder for model-catalog cleanup. No checked-in `ADR-106*.md` existed.
This ADR consumes the `ADR-106` number for the broader Agent Platform decision.
The model-catalog and dynamic-routing debt is folded into the Policy / Routing
contract below and should be implemented under this platform roadmap or a later
non-conflicting ADR number.

## Decision

### D0 - Name the Platform AwoooP

The platform product name is **AwoooP**.

Naming rules:

- Product name: `AwoooP`
- Repository/package slug: `awooop`
- Project/tenant id for the existing AWOOOI product remains `awoooi`
- `AwoooP` is the platform boundary; `AWOOOI` is a tenant and initial runtime
  host

Do not create empty project directories just for the name. Create `awooop`
runtime/package directories only when a concrete implementation phase owns
code, schemas, clients, or workers.

Recommended future layout when implementation begins:

| Purpose | Path |
|---|---|
| Shared contract schemas | `packages/awooop-contracts/` |
| Python/TS client SDK | `packages/awooop-client/` |
| Platform API/runtime shell | `apps/awooop-runtime/` |
| Async run workers | `apps/awooop-worker/` |
| Detailed schema docs | `docs/awooop/` |

### D1 - Adopt AwoooP as a Six-Plane Platform

AwoooP is defined as six cooperating planes:

| Plane | Responsibility | Must not do |
|---|---|---|
| Project / Tenant | Tenant identity, isolation, budget, channel allowance, ACL, migration mode | Store agent prompt details |
| Agent | Agent identity, version, role, I/O contract, safety ceiling, context domain | Decide concrete model/provider credentials |
| MCP Gateway | Tool authorization, credential resolution, rate limits, approval, result sanitization, audit | Expose raw credentials to agents |
| Policy / Routing | Effective model/provider route, fallback, privacy ceiling, budget gate, generation defaults | Bypass tenant hard stops |
| Runtime / Run State | Run lifecycle, async state machine, shadow/canary/active mode, checkpoint/resume | Treat long tasks as simple HTTP request-response |
| Communication / Channel Event | Telegram/LINE/Slack/Email/API receive, verify, normalize, send | Run LLM inference or call MCP directly |

### D2 - Require Platform Envelope Fields

Every platform invocation must carry the following envelope fields:

- `project_id`
- `environment_id` when environment-specific resources are involved
- `agent_id`
- `agent_version` after resolution
- `session_id`
- `trace_id`
- `run_id` for runtime-managed work
- `policy_version` after effective policy resolution

Missing `project_id`, `trace_id`, or `run_id` in runtime or MCP paths is a hard
reject, not a warning.

### D3 - Treat Project as the Smallest Isolation Unit

`project_id` is the platform tenant boundary. All Redis keys, session stores,
RAG namespaces, KM queries, MCP tool scopes, budget ledgers, ACL checks, and
channel routing must be project-scoped unless a resource is explicitly declared
as `platform_resource`.

`global:*` resources are platform resources, not AWOOOI resources.

### D4 - Define Agents as Platform Capability Modules

Agents are not product-local prompt strings. Agents are versioned platform
capabilities with:

- `agent_id`
- `base_agent_ref`
- immutable published version
- role and capability tags
- privacy ceiling
- payload input schema
- LLM output schema
- prompt artifact reference and hash
- context domain constraints
- MCP requirement declarations
- execution profile
- eval and lifecycle governance

Specialized agents must inherit from base agents without loosening safety:

- `openclaw-core`
- `openclaw-sre`
- `openclaw-biz`
- future `openclaw-pharmacy`
- future `openclaw-marketing`

Published agent artifacts are immutable. Prompt, schema, or contract changes
must publish a new version.

### D5 - Use MCP Gateway Instead of Direct Tool Calls

Agents never call MCP servers directly and never see raw credentials. Every
tool call goes through MCP Gateway.

Gateway authorization is the intersection of:

`Project grant AND Agent requirement AND Tool contract AND Environment boundary AND Approval state`

Tool results entering model context must be sanitized first. Tool calls must be
traceable through `trace_id` and `run_id`.

### D6 - Resolve EffectivePolicy Before Every Model Call

Provider and model selection is not embedded inside Agent Contract. Runtime must
derive an `EffectivePolicy` before each model call from:

- platform policy
- agent safety ceiling
- project policy and budget
- environment constraints
- channel constraints
- intent and complexity
- current provider health

Budget hard stop overrides all fallback behavior. Cloud fallback must include a
reason and be written to trace/audit. Any policy change that increases cost,
enables a new paid provider, or raises token ceilings still requires explicit
human approval under `docs/HARD_RULES.md`.

### D7 - Model Runtime as Durable Runs

An agent invocation is a `Run`, not just an HTTP request.

Required runtime states:

```text
CREATED -> POLICY_RESOLVED -> QUEUED -> RUNNING
RUNNING -> WAITING_TOOL -> RESUMED -> RUNNING
RUNNING -> WAITING_APPROVAL -> RESUMED -> RUNNING
RUNNING -> COMPLETED
any non-terminal -> FAILED
any non-terminal -> CANCELLED
```

Terminal states:

- `COMPLETED`
- `FAILED`
- `CANCELLED`

`WAITING_APPROVAL` must be resumable without replaying the whole task. `FAILED`
must include a structured `failure_code`.

### D8 - Keep Channel Adapters Thin

Telegram, LINE, Slack, Email, and API adapters only perform:

1. receive
2. verify
3. normalize to `ConversationEvent`
4. send `OutboundMessage`

Channel adapters must not call LLMs, call MCP tools, decide approval, or embed
business logic. Channel escaping, token resolution, delivery retry, and
provider message IDs remain adapter responsibilities.

### D9 - Migrate by Strangler Fig

No production path should be replaced in one cutover. Migration must progress by
tenant and capability:

1. `shadow`: platform receives mirrored events, writes audit/trace only, no user
   response and no external side effects.
2. `canary`: platform can respond to selected low-risk traffic, side effects
   disabled by default.
3. `read_only`: read-only queries and business chat move first.
4. `suggest`: analysis and recommendations move next, approval still external.
5. `auto_remediate`: write/execute tools move only after Gateway, approval,
   replay, and audit evidence are green.

AWOOOI must also become a tenant (`project_id=awoooi`) instead of keeping a
privileged private path forever.

## Contract Baselines

### C1 - Project / Tenant Contract

Project is the smallest isolation unit.

Minimum contract fields:

- `project_id`
- `display_name`
- `status`
- `environments`
- `data_boundary`
- `platform_resources`
- `budget`
- `rate_limits`
- `allowed_agents`
- `allowed_channels`
- `approval_gates`
- `status_policy`

Invariants:

- `project_id` is immutable after creation.
- Data boundary enforcement happens at data/Gateway APIs, not in prompt text.
- Budget hard stop rejects cloud model calls with `BUDGET_EXCEEDED`.
- Agent calls outside `allowed_agents` return `AGENT_NOT_PERMITTED`.
- Project overrides may tighten permissions, never loosen them.
- Migration state lives in `project_migration_state`, not the stable project
  contract record.
- `suspended` behavior must explicitly define channel, model, and MCP access.

### C2 - Agent Contract

Agent Contract answers who the agent is, what it can do, what its interfaces are,
and its highest safety boundary.

Minimum contract fields:

- `agent_id`
- `base_agent_ref`
- `version`
- `lifecycle_status`
- `role`
- `capability_tags`
- `risk_class`
- `privacy_ceiling`
- `artifact_refs`
- `interface`
- `context_policy`
- `mcp_requirements`
- `execution_profile`
- `governance`

Invariants:

- Published versions are immutable.
- `base_agent_ref` must include a version range and resolve to an audited exact
  version at runtime.
- Prompt and schema artifacts require hashes.
- Agent requirements are not permissions; Gateway decides actual tool access.
- Agent output is LLM payload only. Runtime attaches trace, cost, policy, and
  validation metadata.
- `requires_approval` is calculated by runtime from project, agent, policy, and
  Gateway rules, not trusted from LLM output alone.

### C3 - MCP Gateway Contract

MCP Gateway is the security boundary between agents and external systems.

Minimum contract fields:

- `tool_id`
- `domain`
- `resource`
- `verbs`
- `owner_project_id`
- `tenancy_scope`
- `side_effect_level`
- `data_sensitivity`
- `server`
- `schemas`
- `credential_policy`
- `authorization_policy`
- `approval_policy`
- `rate_limits`
- `result_policy`
- `audit_policy`

Tool call envelope requires:

- `trace_id`
- `run_id`
- `project_id`
- `environment_id`
- `agent_id`
- `agent_version`
- `session_id`
- `tool_id`
- `verb`
- `idempotency_key`
- `payload`

Invariants:

- Agents never see raw credentials.
- Missing `project_id`, `trace_id`, or `run_id` is rejected.
- Project grant, agent declaration, and context boundary must all pass.
- Results entering context are sanitized.
- Write, execute, and destructive operations require approval unless explicitly
  allowed by all relevant contracts.

### C4 - Policy / Routing Contract

Policy / Routing resolves the model and execution policy for a single call.

Minimum contract fields:

- `policy_id`
- `version`
- `lifecycle_status`
- `scope`
- `match`
- `constraints`
- `route_plan`
- `generation_defaults`
- `budget_guard`
- `approval_guard`
- `observability`

`EffectivePolicy` must include:

- `trace_id`
- `project_id`
- `agent_id`
- `agent_version`
- `policy_version`
- allow/deny decision
- provider/model refs
- fallback chain
- max tokens
- privacy mode
- budget state
- approval requirement
- matched policy layers
- decision notes

Invariants:

- Agent privacy ceiling beats project override.
- Project/global budget hard stop beats fallback.
- Providers/models are referenced through catalog refs, not hardcoded into
  agent contracts.
- Policy merge uses strictest-wins semantics.
- Effective policy must be replayable from versioned inputs.
- Schema retry has a hard cap.

### C5 - Runtime / Run State Contract

Runtime owns durable execution state.

Minimum run fields:

- `run_id`
- `trace_id`
- `project_id`
- `environment_id`
- `agent_id`
- `agent_version`
- `session_id`
- `channel_type`
- `mode`
- `execution_type`
- `state`
- input refs
- effective policy ref
- checkpoint
- approval
- result
- audit timestamps and failure details

Invariants:

- Shadow mode has no external side effects and no user-visible response.
- Canary mode has no side effects by default.
- Every state transition writes audit.
- `WAITING_APPROVAL` is resumable.
- `CANCELLED` cannot resume.
- Tool loop iterations have a hard cap.
- Client response requires schema validation.

### C6 - Communication / Channel Event Contract

Communication Hub normalizes channels without embedding AI logic.

Inbound `ConversationEvent` minimum fields:

- `event_id`
- `trace_id`
- `project_id`
- `environment_id`
- channel metadata
- sender metadata
- routing metadata
- payload
- security metadata
- delivery metadata

Outbound message minimum fields:

- `outbound_id`
- `trace_id`
- `run_id`
- `project_id`
- channel target
- message payload
- policy
- delivery state

Invariants:

- Adapter does not call LLM or MCP.
- Raw payload is saved for short audit retention.
- Outbound messages pass ACL, redaction, and rate limit.
- Bot token resolution stays inside adapter runtime.
- Retry is idempotent.
- Escaping/formatting is adapter responsibility.

## Implementation Order

### Phase 0 - Documentation and Contract Freeze

- This ADR.
- Follow-up schema documents or ADRs only when implementation needs field-level
  detail.
- No runtime behavior change.

### Phase 1 - Isolation Foundation

- Add `project_id` to Redis keys, sessions, budget ledgers, dispatch logs, MCP
  audit snapshots, and approval records where applicable.
- Define `platform_resource` exceptions separately from tenant resources.
- Preserve existing AWOOOI behavior under `project_id=awoooi`.
- Follow ADR-107: PostgreSQL is the source of truth for AwoooP control-plane
  contracts; Redis is cache/watch only; CRDs are future runtime projection.

### Phase 2 - Platform Shell

- Add platform APIs and data models around existing logic.
- Implement envelope validation, trace propagation, audit, and effective policy
  calculation.
- Keep existing AWOOOI and EwoooC implementations behind adapters.

### Phase 3 - Shadow Mode

- Mirror selected events into AwoooP.
- Compare platform decisions with legacy decisions.
- Do not send user-visible responses or execute side effects.

### Phase 4 - Read-Only and Suggest Cutover

- Move low-risk chat and read-only analysis first.
- Add EwoooC business-agent traffic as the first downstream tenant validation.
- Keep remediation and write tools in legacy path until Gateway evidence is
  sufficient.

### Phase 5 - Controlled Active Runtime

- Move write/execute operations only after:
  - Gateway authorization is enforced.
  - Approval resume is proven.
  - Trace and audit can replay a complete run.
  - Budget hard stop has live evidence.
  - Production rollback path is documented.

## Consequences

### Benefits

- A single agent improvement can apply to all projects or to one specialization.
- Product-specific customization happens through policy, permissions, and
  context scope rather than copy-pasted prompts.
- Tool credentials remain outside agent context.
- Cross-project data leakage is blocked by data/Gateway filters.
- Telegram, LINE, Slack, Email, and API can share the same agent runtime.
- Long-running agent work becomes observable, resumable, and replayable.

### Costs

- More platform tables and contracts must exist before large runtime changes.
- Early delivery focuses on audit and shadow evidence rather than visible
  feature changes.
- Each downstream project needs explicit tenant onboarding.
- Provider/model policy changes become governed artifacts, not quick env edits.

### Risks

- Over-building before live traffic proves value.
- Contract sprawl if every detail becomes a new ADR.
- Legacy paths remaining forever if strangler milestones lack deadlines.
- Confusing `OpenClaw` brand identity if `openclaw-core`, `openclaw-sre`, and
  `openclaw-biz` are not documented clearly.

### Mitigations

- Use this ADR as the index and create detailed schema docs only as needed.
- Use ADR-107 for control-plane storage decisions before creating migrations or
  CRDs.
- Require `project_migration_state` milestones per tenant.
- Keep AWOOOI on the same tenant path as all other products.
- Treat shadow/canary evidence as the gate for every cutover.
- Keep cost-changing provider behavior behind explicit approval.

## Non-Goals

- This ADR does not deploy new providers or increase paid model usage.
- This ADR does not move Telegram/LINE/Slack webhooks.
- This ADR does not authorize destructive MCP tools.
- This ADR does not replace ADR-105 Agent Loop governance; it generalizes the
  platform boundary above it.
- This ADR does not require Temporal or a specific workflow engine in v1.

## Acceptance Criteria

- Six contract baselines are captured in this ADR.
- AwoooP is named as the platform product and AWOOOI is explicitly defined as
  first tenant, not the whole platform.
- MCP is Gateway-governed, not direct agent access.
- Runtime migration is shadow/canary/active, not big-bang.
- Cost and privacy hard stops are preserved.
- LOGBOOK records this architecture decision.

## References

- `docs/12-agent-game-rules.md`
- `docs/HARD_RULES.md`
- `docs/LOGBOOK.md`
- `docs/adr/ADR-080-ai-autonomy-flywheel-overview.md`
- `docs/adr/ADR-095-12agent-sdk-integration.md`
- `docs/adr/ADR-100-ai-autonomous-slo.md`
- `docs/adr/ADR-105-mcp-agent-loop-governance.md`
- `docs/adr/ADR-107-awooop-control-plane-storage.md`
- `docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md`