- **MCP Gateway**: the only execution choke point for tool calls.
- **Approval / Trust**: the only authorization state machine.
- **KM / RAG / PlayBook**: the only knowledge substrate.
- **Channel Event / Audit**: the only observability and trace stream.
- **AI Flywheel**:引擎、認知、執行與學習迴路。
- **AwoooP**:操作員控制台、治理平面、稽核平面與人工簽核平面。
- **MCP Gateway**:工具呼叫的唯一執行閘門。
- **Approval / Trust**:唯一授權狀態機。
- **KM / RAG / PlayBook**:唯一知識底座。
- **Channel Event / Audit**:唯一觀測、稽核與追蹤流。
If AI automation and AwoooP are implemented as separate tracks, the system will drift into duplicate approval state machines, duplicate audit flows, duplicate routing gates, duplicate MCP and Channel decisions, and frontend states that do not match the backend execution truth.
如果 AI 自動化與 AwoooP 分成兩條軌道推進,系統會漂移出兩套簽核狀態機、兩套稽核流、兩套路由閘門、兩套 MCP 與 Channel 決策,最後前端看到的狀態會與後端真實執行狀態不一致。
## 2. Shared Target Loop
## 2. 共同目標迴路
The target loop is:
目標閉環如下:
```text
monitoring / alerting
-> event classification
-> rule match or rule creation
-> PlayBook / KM / RAG retrieval
-> AI decision
-> Approval / Trust gate
-> MCP Gateway execution
-> audit and trace
-> execution verification
-> KM / rule / PlayBook feedback
監控 / 告警
-> 事件分類
-> 規則匹配或規則建立
-> PlayBook / KM / RAG 檢索
-> AI 決策
-> Approval / Trust Gate
-> MCP Gateway 執行
-> 稽核與追蹤
-> 執行驗證
-> KM / Rule / PlayBook 回饋
```
AwoooP surfaces and governs that loop. It does not create an independent execution lane.
AwoooP 負責呈現與治理這條迴路,不建立獨立的執行 lane。
## 3. Architectural Invariants
## 3. 架構不變式
1.**One execution gate**: production MCP calls must pass through MCP Gateway. Direct provider calls are compatibility debt and must become`forbid-new`.
2.**One approval state machine**: AwoooP approvals, AWOOOI approval records, TrustEngine, and Telegram approval buttons must converge to one signed, auditable flow.
3.**One audit stream**: Channel events, runtime state, MCP audit, model routing, and approval decisions must be joinable by`project_id`, `run_id`, and `trace_id`.
4.**One knowledge substrate**: KM, RAG, and PlayBook embeddings must use consistent dimensions and model naming. Test fixtures must match production schema.
5.**One routing control plane**: provider/model fallback must resolve through EffectivePolicy or a wrapped legacy equivalent during strangler migration.
6.**No fail-open tenant boundary**: missing `project_id` must never broaden data visibility. Bootstrap exceptions must be explicit and audited.
7.**No client-only identity**: any operator identity supplied by frontend body is display metadata only; authorization identity must come from the authenticated principal.
8.**No channel policy decisions**: Telegram, LINE, Slack, and email adapters deliver and track messages, but do not decide model, tool, approval, or incident policy.
### 5.1 P0: Production-broken or Security-critical
### 5.1 P0:production 會壞或安全關鍵
| ID | Risk | Impact | Required direction |
| ID | 風險 | 影響 | 必要方向 |
| --- | --- | --- | --- |
| P0-A | MCP Gateway is not yet a production choke point; callers can still reach`provider.execute()` directly. | Gateway gates, redaction, and audit can be bypassed. | Wrap all production MCP call sites; then mark direct provider access `forbid-new`. |
| P0-B | MCP blocked-call audit has gaps when`tool_row`is missing or Gate 1/2 rejects early. | Denied or suspicious calls can disappear from audit. | Audit attempt before and after gate evaluation with safe redaction. |
| P0-C | Legacy K8s tool execution still has command/shell injection risk. | Destructive command path can be polluted by LLM or user input. | Parse command structure, avoid shell, enforce operation schema and allowlists. |
| P0-D | `MCPToolResult(data=...)` mismatches dataclass fields in some success paths. | Sentry / ArgoCD or similar tools can crash on valid results. | Normalize result schema and regression-test all MCP providers. |
| P0-E | RAG/KM/PlayBook embedding dimensions remain split between 768 and 1024. | Search or backfill can silently fail or hide production drift. | Standardize on`bge-m3` 1024 dimensions and remove stale fixtures. |
| P0-F | KM backfill reconciler is missing required async imports or runtime dependencies. | Repair path that should recover embeddings may crash. | Compile and integration-test the reconciler path. |
| P0-G | Ollama routing still has direct `OLLAMA_URL` consumers. | GCP-A/GCP-B/111 ordering and Gemini fallback policy can be bypassed. | Inventory, wrap, and migrate call sites to resolver / EffectivePolicy. |
| P0-H | RLS is fail-open when`project_id`is null. | Cross-tenant data exposure risk. | Force fail-closed RLS and explicit platform bootstrap identities. |
| P0-I | Approval APIs trust frontend/body identity such as`approver_id: "operator"`. | Audit identity is not legally or operationally usable. | Decide endpoint must derive principal from auth/session/token, not body. |
| P0-J | API control plane lacks real authorization in some paths; CSRF is not authorization. | Operator APIs can be invoked without a real role boundary. | Add authenticated principal and role checks to AwoooP APIs. |
| P0-K | Alertmanager internal bypass plus`X-Forwarded-For`and broad trusted hosts may allow spoofed source identity. | Forged alert ingress can create incidents or approvals. | Require signed webhook or strict trusted proxy chain. |
| P0-L | AwoooP approval token service also depends on`aioredis`. | Python/runtime compatibility issue can break approvals. | Migrate Redis client path or wrap with a maintained async Redis interface. |
| P0-A | MCP Gateway 尚未成為 production choke point;caller 仍可直接碰`provider.execute()`。 | Gateway gate、redaction、audit 可能被繞過。 | 包裝所有 production MCP call sites,然後將 direct provider access 標記為`forbid-new`。 |
### 5.2 P1: Governance consistency and reliability
### 5.2 P1:治理一致性與可靠性
| ID | Risk | Impact | Required direction |
| ID | 風險 | 影響 | 必要方向 |
| --- | --- | --- | --- |
| P1-A | Approval repository / TrustEngine race conditions. | Double decision, stale decision, or missed approval. | Add row locks, idempotency keys, and single source of truth. |
| P1-B | TrustEngine uses process memory dict. | Pending approvals vanish or split across pods/restarts. | Move authoritative state to PostgreSQL; Redis only for cache/notification. |
| P1-C | NetworkPolicy contains rules whose names imply deny but semantics allow broad traffic. | False sense of isolation. | Reconcile names, policies, and live cluster behavior. |
| P1-D | Kustomize may omit service registry, HPA, VPA, backup cron, or related resources. | GitOps drift and partial deploys. | Inventory generated/live manifests and add missing resources. |
| P1-E | Hardcoded restart/SSH actions remain in alert rules. | Conflicts with AI automation direction and rule/PlayBook evidence loop. | Convert hardcoded actions into PlayBook proposals with trust gates. |
| P1-F | Startup schema mutation and scattered SQL migrations remain. | Runtime ALTER can diverge dev/prod schema. | Move to disciplined migration path and remove runtime mutation. |
| P1-G | Tests still use stale 768-dimension vectors. | Dev tests can pass while production RAG fails. | Test fixtures must mirror production vector dimensions. |
| P1-H | GCP-B can remain passive fallback only. | Active-active design is not realized. | Use GCP-B for batch, embedding, shadow, and canary lanes via policy. |
### 5.3 P2: Productization and long-term maintainability
### 5.3 P2:產品化與長期可維護性
| ID | Risk | Impact | Required direction |
| ID | 風險 | 影響 | 必要方向 |
| --- | --- | --- | --- |
| P2-A | AwoooP frontend has incomplete i18n and inconsistent IA. | Operator experience is unprofessional and brittle. | Align UI text, sidebar, routes, and workflow labels. |
| P2-B | Operator Console lacks deep trace/journal views. | Operators cannot explain why AI made a decision. | Add run step journal, trace spans, model/tool/cost panes. |
| P2-C | Error code taxonomy is incomplete. | Frontend and channels cannot render actionable states. | Define platform error codes for auth, schema, budget, MCP, routing, approval, and RAG. |
| P2-D | Docs and runtime can drift. | Future sessions repeat the same analysis. | Keep ADR, LOGBOOK, inventory, and release checklists linked to commits. |
The following corrections must be applied when consuming the earlier 12-agent inventories:
消化早期 12-agent inventories 時,必須套用以下修正:
1. C21/C22 CronJob service account and image issues appear to have been partially fixed in the worktree, but must be verified against live cluster state before closure.
2. C1 `aioredis`is not necessarily a top-level import in all affected paths; at least one Gate 5 runtime import still makes the risk valid.
3. C2 is more precise as: blocked calls can skip audit when `tool_row is None`, not only due to a database`NOT NULL` failure.
4. C12 is partially stale for the main embedding service, but`knowledge_rag_service.py`and`playbook_rag.py`still need verification for 1024-dimensional consistency.
1. C21/C22 CronJob service account 與 image 問題在 worktree 中看起來已部分修正,但結案前必須對 live cluster state 驗證。
4. C12 對主要 embedding service 的描述部分過期,但`knowledge_rag_service.py`與`playbook_rag.py`仍需要驗證 1024 維一致性。
## 7. Twelve-agent Ownership Matrix
## 7. 12-agent 權責矩陣
| Owner | Scope | First verification |
| Owner | 範圍 | 第一個驗證點 |
| --- | --- | --- |
| Chief Architect | Total blueprint, dependency order, red-zone governance | This integration plan is linked from AwoooP master workplan and LOGBOOK. |
| API Layer | Platform endpoints auth, CSRF, tenant boundaries | Platform APIs reject unauthenticated and cross-project requests. |
| Frontend / AwoooP | i18n, sidebar, approver identity, internal IP ban | `/zh-TW/awooop` renders without redirect error and no private `NEXT_PUBLIC_*`bundle leak. |
| QA / Test | Integration tests, no-mock scanning, regression matrix | High-risk paths have production-like integration tests, not only mocks. |
| Docs / Release | ADR, LOGBOOK, Memory handoff, runbooks, release checklist | Each wave has a release note, rollback path, and live verification evidence. |
1.Wire`scripts/ops/deploy-alertmanager-config.sh`into the reboot/release checklist, then consider whether CD should run it for`ops/alertmanager/**`changes.
2.Continue Wave 1 with MCP Gateway bypass and MCP audit completeness, because production callers can still route around the gateway.
3.Keep GCP-A/GCP-B/111 Ollama routing verification in every alert-path release until EffectivePolicy becomes authoritative.
4.Add a Sentry/Snuba post-reboot health gate: ClickHouse table existence, Snuba migration status, and Kafka consumer offsets must be part of cold-start validation.
5.Add a post-deploy Alertmanager live check for `amtool check-config`, container status, and config-file mode; direct Telegram must remain emergency-only and target the SRE group.
1.將`scripts/ops/deploy-alertmanager-config.sh`納入 reboot/release checklist,並評估`ops/alertmanager/**`變更是否應由 CD 自動執行。
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.