Files
awoooi/docs/adr/ADR-106-agent-platform-architecture.md
Your Name 13e51802fe feat(awooop): Phase 0 全 ADR + Phase 1 control plane schema(含 critic 四項修正)
## Phase 0(文件層,全部 Accepted)
- ADR-106/107:AwoooP 平台架構 + 儲存策略
- ADR-111~118:Bootstrap → RLS 七項核心 ADR
- ADR-119~124:SAGA → Singleton Decomposition 六項 ADR
- ADR-UI-01~04:Operator Console 四個 UI ADR

## Phase 1(DB schema + migration)
- awooop_phase1_control_plane_2026-05-04.sql:7 張新表 + trigger + RLS
  - Step 1:三角色(platform_admin/migration BYPASSRLS,awooop_app 受 RLS)
  - Step 13:GRANT awooop_app 最小權限(7 條)
  - Step 14:RLS fail-closed,移除 __platform__ 後門
- awooop_phase1_batch1_rls_2026-05-04.sql:高流量四表三步式 ADD COLUMN
- awooop_phase1_batch1_backfill.py:SKIP LOCKED 分批回填腳本
- awooop_models.py:7 個 SQLAlchemy 2.x models

## Critic 修正(4 Critical + 3 Major)
- C-1:ADD CONSTRAINT IF NOT EXISTS → DO 塊 + pg_constraint 查詢
- C-2:__mapper_args__ 字串 list → primary_key=True on mapped_column
- C-3:__platform__ RLS 後門 → 全移除,改用 BYPASSRLS role
- C-4:awooop_app role 從未建立 → Step 1 + 7 條 GRANT
- M-1:active_pointer_guard SECURITY DEFINER(FORCE RLS 跨租戶保護)
- M-2:pg_partman create_parent 加冪等防護
- M-3:immutability trigger 新增身份欄位保護(project_id/family/contract_id)

## Task 1.2 修補
- agent_loader.py:硬編碼 Mac 路徑 → AGENTS_DIR 環境變數
- Dockerfile:補 COPY .claude/agents/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 13:37:11 +08:00

572 lines
17 KiB
Markdown

# ADR-106: AwoooP Agent Platform Architecture and Migration Strategy
**Status**: Accepted
**Date**: 2026-05-01
**Scope**: Multi-tenant Agent Platform, Agent contracts, MCP Gateway, runtime state, channel events, migration strategy
## Context
AWOOOI currently contains the strongest AI automation implementation in the
ecosystem: OpenClaw/NemoTron/Hermes/ElephantAlpha roles, Agent Loop
foundations, MCP providers, Telegram workflows, model routing, cost guards,
Playbook learning, and operational audit trails.
Other products already have adjacent AI or messaging surfaces:
- EwoooC / MOMO PRO has business analysis bots, ElephantAlpha orchestration,
market-intelligence tools, LINE/Telegram/Email notification paths, and local
AI provider selection.
- Tsenyang already has Telegram webhook capability.
- Bitan and future products need repeatable AI onboarding without copying
AWOOOI internals.
The old choice set was too narrow:
- A pure centralized HTTP hub solves governance but creates a single point of
failure and makes AWOOOI look like every product's private brain.
- A shared SDK reduces duplicated code but cannot solve tenant isolation,
identity, budget, channel, MCP credential, or cross-project audit problems.
- A light configuration plane helps routing drift but still leaves tool use,
session state, and channel handling scattered across projects.
The approved direction is a fourth path:
> Build **AwoooP**, a multi-tenant Agent Platform. AWOOOI is the first and
> largest tenant and first runtime host, not the platform boundary itself.
This ADR records the architecture and migration strategy only. It does not
authorize runtime code changes, provider cost changes, destructive operations,
or channel cutovers.
## Numbering Note
`ADR-105-revert-a2-ollama-primary.md` previously reserved `ADR-106` as a
placeholder for model-catalog cleanup. No checked-in `ADR-106*.md` existed.
This ADR consumes the `ADR-106` number for the broader Agent Platform decision.
The model-catalog and dynamic-routing debt is folded into the Policy / Routing
contract below and should be implemented under this platform roadmap or a later
non-conflicting ADR number.
## Decision
### D0 - Name the Platform AwoooP
The platform product name is **AwoooP**.
Naming rules:
- Product name: `AwoooP`
- Repository/package slug: `awooop`
- Project/tenant id for the existing AWOOOI product remains `awoooi`
- `AwoooP` is the platform boundary; `AWOOOI` is a tenant and initial runtime
host
Do not create empty project directories just for the name. Create `awooop`
runtime/package directories only when a concrete implementation phase owns
code, schemas, clients, or workers.
Recommended future layout when implementation begins:
| Purpose | Path |
|---|---|
| Shared contract schemas | `packages/awooop-contracts/` |
| Python/TS client SDK | `packages/awooop-client/` |
| Platform API/runtime shell | `apps/awooop-runtime/` |
| Async run workers | `apps/awooop-worker/` |
| Detailed schema docs | `docs/awooop/` |
### D1 - Adopt AwoooP as a Six-Plane Platform
AwoooP is defined as six cooperating planes:
| Plane | Responsibility | Must not do |
|---|---|---|
| Project / Tenant | Tenant identity, isolation, budget, channel allowance, ACL, migration mode | Store agent prompt details |
| Agent | Agent identity, version, role, I/O contract, safety ceiling, context domain | Decide concrete model/provider credentials |
| MCP Gateway | Tool authorization, credential resolution, rate limits, approval, result sanitization, audit | Expose raw credentials to agents |
| Policy / Routing | Effective model/provider route, fallback, privacy ceiling, budget gate, generation defaults | Bypass tenant hard stops |
| Runtime / Run State | Run lifecycle, async state machine, shadow/canary/active mode, checkpoint/resume | Treat long tasks as simple HTTP request-response |
| Communication / Channel Event | Telegram/LINE/Slack/Email/API receive, verify, normalize, send | Run LLM inference or call MCP directly |
### D2 - Require Platform Envelope Fields
Every platform invocation must carry the following envelope fields:
- `project_id`
- `environment_id` when environment-specific resources are involved
- `agent_id`
- `agent_version` after resolution
- `session_id`
- `trace_id`
- `run_id` for runtime-managed work
- `policy_version` after effective policy resolution
Missing `project_id`, `trace_id`, or `run_id` in runtime or MCP paths is a hard
reject, not a warning.
### D3 - Treat Project as the Smallest Isolation Unit
`project_id` is the platform tenant boundary. All Redis keys, session stores,
RAG namespaces, KM queries, MCP tool scopes, budget ledgers, ACL checks, and
channel routing must be project-scoped unless a resource is explicitly declared
as `platform_resource`.
`global:*` resources are platform resources, not AWOOOI resources.
### D4 - Define Agents as Platform Capability Modules
Agents are not product-local prompt strings. Agents are versioned platform
capabilities with:
- `agent_id`
- `base_agent_ref`
- immutable published version
- role and capability tags
- privacy ceiling
- payload input schema
- LLM output schema
- prompt artifact reference and hash
- context domain constraints
- MCP requirement declarations
- execution profile
- eval and lifecycle governance
Specialized agents must inherit from base agents without loosening safety:
- `openclaw-core`
- `openclaw-sre`
- `openclaw-biz`
- future `openclaw-pharmacy`
- future `openclaw-marketing`
Published agent artifacts are immutable. Prompt, schema, or contract changes
must publish a new version.
### D5 - Use MCP Gateway Instead of Direct Tool Calls
Agents never call MCP servers directly and never see raw credentials. Every
tool call goes through MCP Gateway.
Gateway authorization is the intersection of:
`Project grant AND Agent requirement AND Tool contract AND Environment boundary AND Approval state`
Tool results entering model context must be sanitized first. Tool calls must be
traceable through `trace_id` and `run_id`.
### D6 - Resolve EffectivePolicy Before Every Model Call
Provider and model selection is not embedded inside Agent Contract. Runtime must
derive an `EffectivePolicy` before each model call from:
- platform policy
- agent safety ceiling
- project policy and budget
- environment constraints
- channel constraints
- intent and complexity
- current provider health
Budget hard stop overrides all fallback behavior. Cloud fallback must include a
reason and be written to trace/audit. Any policy change that increases cost,
enables a new paid provider, or raises token ceilings still requires explicit
human approval under `docs/HARD_RULES.md`.
### D7 - Model Runtime as Durable Runs
An agent invocation is a `Run`, not just an HTTP request.
Required runtime states:
```text
CREATED -> POLICY_RESOLVED -> QUEUED -> RUNNING
RUNNING -> WAITING_TOOL -> RESUMED -> RUNNING
RUNNING -> WAITING_APPROVAL -> RESUMED -> RUNNING
RUNNING -> COMPLETED
any non-terminal -> FAILED
any non-terminal -> CANCELLED
```
Terminal states:
- `COMPLETED`
- `FAILED`
- `CANCELLED`
`WAITING_APPROVAL` must be resumable without replaying the whole task. `FAILED`
must include a structured `failure_code`.
### D8 - Keep Channel Adapters Thin
Telegram, LINE, Slack, Email, and API adapters only perform:
1. receive
2. verify
3. normalize to `ConversationEvent`
4. send `OutboundMessage`
Channel adapters must not call LLMs, call MCP tools, decide approval, or embed
business logic. Channel escaping, token resolution, delivery retry, and
provider message IDs remain adapter responsibilities.
### D9 - Migrate by Strangler Fig
No production path should be replaced in one cutover. Migration must progress by
tenant and capability:
1. `shadow`: platform receives mirrored events, writes audit/trace only, no user
response and no external side effects.
2. `canary`: platform can respond to selected low-risk traffic, side effects
disabled by default.
3. `read_only`: read-only queries and business chat move first.
4. `suggest`: analysis and recommendations move next, approval still external.
5. `auto_remediate`: write/execute tools move only after Gateway, approval,
replay, and audit evidence are green.
AWOOOI must also become a tenant (`project_id=awoooi`) instead of keeping a
privileged private path forever.
## Contract Baselines
### C1 - Project / Tenant Contract
Project is the smallest isolation unit.
Minimum contract fields:
- `project_id`
- `display_name`
- `status`
- `environments`
- `data_boundary`
- `platform_resources`
- `budget`
- `rate_limits`
- `allowed_agents`
- `allowed_channels`
- `approval_gates`
- `status_policy`
Invariants:
- `project_id` is immutable after creation.
- Data boundary enforcement happens at data/Gateway APIs, not in prompt text.
- Budget hard stop rejects cloud model calls with `BUDGET_EXCEEDED`.
- Agent calls outside `allowed_agents` return `AGENT_NOT_PERMITTED`.
- Project overrides may tighten permissions, never loosen them.
- Migration state lives in `project_migration_state`, not the stable project
contract record.
- `suspended` behavior must explicitly define channel, model, and MCP access.
### C2 - Agent Contract
Agent Contract answers who the agent is, what it can do, what its interfaces are,
and its highest safety boundary.
Minimum contract fields:
- `agent_id`
- `base_agent_ref`
- `version`
- `lifecycle_status`
- `role`
- `capability_tags`
- `risk_class`
- `privacy_ceiling`
- `artifact_refs`
- `interface`
- `context_policy`
- `mcp_requirements`
- `execution_profile`
- `governance`
Invariants:
- Published versions are immutable.
- `base_agent_ref` must include a version range and resolve to an audited exact
version at runtime.
- Prompt and schema artifacts require hashes.
- Agent requirements are not permissions; Gateway decides actual tool access.
- Agent output is LLM payload only. Runtime attaches trace, cost, policy, and
validation metadata.
- `requires_approval` is calculated by runtime from project, agent, policy, and
Gateway rules, not trusted from LLM output alone.
### C3 - MCP Gateway Contract
MCP Gateway is the security boundary between agents and external systems.
Minimum contract fields:
- `tool_id`
- `domain`
- `resource`
- `verbs`
- `owner_project_id`
- `tenancy_scope`
- `side_effect_level`
- `data_sensitivity`
- `server`
- `schemas`
- `credential_policy`
- `authorization_policy`
- `approval_policy`
- `rate_limits`
- `result_policy`
- `audit_policy`
Tool call envelope requires:
- `trace_id`
- `run_id`
- `project_id`
- `environment_id`
- `agent_id`
- `agent_version`
- `session_id`
- `tool_id`
- `verb`
- `idempotency_key`
- `payload`
Invariants:
- Agents never see raw credentials.
- Missing `project_id`, `trace_id`, or `run_id` is rejected.
- Project grant, agent declaration, and context boundary must all pass.
- Results entering context are sanitized.
- Write, execute, and destructive operations require approval unless explicitly
allowed by all relevant contracts.
### C4 - Policy / Routing Contract
Policy / Routing resolves the model and execution policy for a single call.
Minimum contract fields:
- `policy_id`
- `version`
- `lifecycle_status`
- `scope`
- `match`
- `constraints`
- `route_plan`
- `generation_defaults`
- `budget_guard`
- `approval_guard`
- `observability`
`EffectivePolicy` must include:
- `trace_id`
- `project_id`
- `agent_id`
- `agent_version`
- `policy_version`
- allow/deny decision
- provider/model refs
- fallback chain
- max tokens
- privacy mode
- budget state
- approval requirement
- matched policy layers
- decision notes
Invariants:
- Agent privacy ceiling beats project override.
- Project/global budget hard stop beats fallback.
- Providers/models are referenced through catalog refs, not hardcoded into
agent contracts.
- Policy merge uses strictest-wins semantics.
- Effective policy must be replayable from versioned inputs.
- Schema retry has a hard cap.
### C5 - Runtime / Run State Contract
Runtime owns durable execution state.
Minimum run fields:
- `run_id`
- `trace_id`
- `project_id`
- `environment_id`
- `agent_id`
- `agent_version`
- `session_id`
- `channel_type`
- `mode`
- `execution_type`
- `state`
- input refs
- effective policy ref
- checkpoint
- approval
- result
- audit timestamps and failure details
Invariants:
- Shadow mode has no external side effects and no user-visible response.
- Canary mode has no side effects by default.
- Every state transition writes audit.
- `WAITING_APPROVAL` is resumable.
- `CANCELLED` cannot resume.
- Tool loop iterations have a hard cap.
- Client response requires schema validation.
### C6 - Communication / Channel Event Contract
Communication Hub normalizes channels without embedding AI logic.
Inbound `ConversationEvent` minimum fields:
- `event_id`
- `trace_id`
- `project_id`
- `environment_id`
- channel metadata
- sender metadata
- routing metadata
- payload
- security metadata
- delivery metadata
Outbound message minimum fields:
- `outbound_id`
- `trace_id`
- `run_id`
- `project_id`
- channel target
- message payload
- policy
- delivery state
Invariants:
- Adapter does not call LLM or MCP.
- Raw payload is saved for short audit retention.
- Outbound messages pass ACL, redaction, and rate limit.
- Bot token resolution stays inside adapter runtime.
- Retry is idempotent.
- Escaping/formatting is adapter responsibility.
## Implementation Order
### Phase 0 - Documentation and Contract Freeze
- This ADR.
- Follow-up schema documents or ADRs only when implementation needs field-level
detail.
- No runtime behavior change.
### Phase 1 - Isolation Foundation
- Add `project_id` to Redis keys, sessions, budget ledgers, dispatch logs, MCP
audit snapshots, and approval records where applicable.
- Define `platform_resource` exceptions separately from tenant resources.
- Preserve existing AWOOOI behavior under `project_id=awoooi`.
- Follow ADR-107: PostgreSQL is the source of truth for AwoooP control-plane
contracts; Redis is cache/watch only; CRDs are future runtime projection.
### Phase 2 - Platform Shell
- Add platform APIs and data models around existing logic.
- Implement envelope validation, trace propagation, audit, and effective policy
calculation.
- Keep existing AWOOOI and EwoooC implementations behind adapters.
### Phase 3 - Shadow Mode
- Mirror selected events into AwoooP.
- Compare platform decisions with legacy decisions.
- Do not send user-visible responses or execute side effects.
### Phase 4 - Read-Only and Suggest Cutover
- Move low-risk chat and read-only analysis first.
- Add EwoooC business-agent traffic as the first downstream tenant validation.
- Keep remediation and write tools in legacy path until Gateway evidence is
sufficient.
### Phase 5 - Controlled Active Runtime
- Move write/execute operations only after:
- Gateway authorization is enforced.
- Approval resume is proven.
- Trace and audit can replay a complete run.
- Budget hard stop has live evidence.
- Production rollback path is documented.
## Consequences
### Benefits
- A single agent improvement can apply to all projects or to one specialization.
- Product-specific customization happens through policy, permissions, and
context scope rather than copy-pasted prompts.
- Tool credentials remain outside agent context.
- Cross-project data leakage is blocked by data/Gateway filters.
- Telegram, LINE, Slack, Email, and API can share the same agent runtime.
- Long-running agent work becomes observable, resumable, and replayable.
### Costs
- More platform tables and contracts must exist before large runtime changes.
- Early delivery focuses on audit and shadow evidence rather than visible
feature changes.
- Each downstream project needs explicit tenant onboarding.
- Provider/model policy changes become governed artifacts, not quick env edits.
### Risks
- Over-building before live traffic proves value.
- Contract sprawl if every detail becomes a new ADR.
- Legacy paths remaining forever if strangler milestones lack deadlines.
- Confusing `OpenClaw` brand identity if `openclaw-core`, `openclaw-sre`, and
`openclaw-biz` are not documented clearly.
### Mitigations
- Use this ADR as the index and create detailed schema docs only as needed.
- Use ADR-107 for control-plane storage decisions before creating migrations or
CRDs.
- Require `project_migration_state` milestones per tenant.
- Keep AWOOOI on the same tenant path as all other products.
- Treat shadow/canary evidence as the gate for every cutover.
- Keep cost-changing provider behavior behind explicit approval.
## Non-Goals
- This ADR does not deploy new providers or increase paid model usage.
- This ADR does not move Telegram/LINE/Slack webhooks.
- This ADR does not authorize destructive MCP tools.
- This ADR does not replace ADR-105 Agent Loop governance; it generalizes the
platform boundary above it.
- This ADR does not require Temporal or a specific workflow engine in v1.
## Acceptance Criteria
- Six contract baselines are captured in this ADR.
- AwoooP is named as the platform product and AWOOOI is explicitly defined as
first tenant, not the whole platform.
- MCP is Gateway-governed, not direct agent access.
- Runtime migration is shadow/canary/active, not big-bang.
- Cost and privacy hard stops are preserved.
- LOGBOOK records this architecture decision.
## References
- `docs/12-agent-game-rules.md`
- `docs/HARD_RULES.md`
- `docs/LOGBOOK.md`
- `docs/adr/ADR-080-ai-autonomy-flywheel-overview.md`
- `docs/adr/ADR-095-12agent-sdk-integration.md`
- `docs/adr/ADR-100-ai-autonomous-slo.md`
- `docs/adr/ADR-105-mcp-agent-loop-governance.md`
- `docs/adr/ADR-107-awooop-control-plane-storage.md`
- `docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md`