docs(awooop): unify flywheel integration plan
This commit is contained in:
@@ -1,3 +1,16 @@
|
||||
## 2026-05-06 | AwoooP merged into the AI autonomous flywheel execution plan
|
||||
|
||||
**背景**:另一個 session 完成 AWOOOI / AWOOOP / AI 自動化飛輪整合總結,指出 AwoooP 不能獨立成另一條產品線;它必須是 AI 飛輪的人機協作控制台、治理層、稽核層與操作層。
|
||||
|
||||
**本次整合**:
|
||||
- 新增 `docs/awooop/AWOOOI-AWOOOP-AI-AUTONOMOUS-FLYWHEEL-INTEGRATION-PLAN.md`,定義共同目標、架構不變式、P0/P1/P2 風險、12-agent owner、Wave 0-3 與驗收方式。
|
||||
- 將未入版控的 `AWOOOP-MONITORING-ALERTING-CONVERGENCE.md` 併入 AwoooP docs,作為監控/告警 handoff map。
|
||||
- `MASTER-WORKPLAN.md` 補充整合基準連結,避免 AwoooP 與 AI 自動化飛輪分叉。
|
||||
|
||||
**後續**:
|
||||
- Wave 1 優先從 MCP Gateway bypass、approval auth、RAG 1024 維一致性、Ollama direct call-site 清理開始。
|
||||
- 每個 wave 都必須回填 LOGBOOK、rollback flag、live verification。
|
||||
|
||||
## 2026-05-06 | AwoooP root route no longer returns Next redirect error shell
|
||||
|
||||
**背景**:`https://awoooi.wooo.work/zh-TW/awooop` 回傳 `307` 到 `/zh-TW/awooop/work-items`,但 response body 是 Next.js `__next_error__` + `NEXT_REDIRECT` shell;瀏覽器端可能顯示 `Application error: a client-side exception has occurred`。`/zh-TW/awooop/work-items` 本身正常 200,問題集中在 AwoooP root route redirect。
|
||||
|
||||
@@ -0,0 +1,290 @@
|
||||
# AWOOOI / AwoooP / AI Autonomous Flywheel Integration Plan
|
||||
|
||||
**Date**: 2026-05-06
|
||||
**Status**: Accepted integration baseline for AwoooP execution
|
||||
**Scope**: AWOOOI alerting, AI decisioning, approval, MCP execution, KM/RAG/PlayBook learning, AwoooP Operator Console, audit, governance, and platform reliability
|
||||
**Related**:
|
||||
- `docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md`
|
||||
- `docs/awooop/MASTER-WORKPLAN.md`
|
||||
- `docs/awooop/DETAILED-IMPLEMENTATION-PLAN.md`
|
||||
- `docs/awooop/AWOOOP-MONITORING-ALERTING-CONVERGENCE.md`
|
||||
- `docs/adr/ADR-106-agent-platform-architecture.md`
|
||||
|
||||
## 1. Executive Decision
|
||||
|
||||
The current findings are not isolated bugs. They are evidence that the AI autonomous flywheel is not yet closed end-to-end.
|
||||
|
||||
AwoooP must not evolve as a separate product line beside AWOOOI automation. It is the human-and-governance plane for the same flywheel:
|
||||
|
||||
- **AI Flywheel**: engine, cognition, execution, learning loop.
|
||||
- **AwoooP**: operator console, governance plane, audit plane, human approval plane.
|
||||
- **MCP Gateway**: the only execution choke point for tool calls.
|
||||
- **Approval / Trust**: the only authorization state machine.
|
||||
- **KM / RAG / PlayBook**: the only knowledge substrate.
|
||||
- **Channel Event / Audit**: the only observability and trace stream.
|
||||
|
||||
If AI automation and AwoooP are implemented as separate tracks, the system will drift into duplicate approval state machines, duplicate audit flows, duplicate routing gates, duplicate MCP and Channel decisions, and frontend states that do not match the backend execution truth.
|
||||
|
||||
## 2. Shared Target Loop
|
||||
|
||||
The target loop is:
|
||||
|
||||
```text
|
||||
monitoring / alerting
|
||||
-> event classification
|
||||
-> rule match or rule creation
|
||||
-> PlayBook / KM / RAG retrieval
|
||||
-> AI decision
|
||||
-> Approval / Trust gate
|
||||
-> MCP Gateway execution
|
||||
-> audit and trace
|
||||
-> execution verification
|
||||
-> KM / rule / PlayBook feedback
|
||||
```
|
||||
|
||||
AwoooP surfaces and governs that loop. It does not create an independent execution lane.
|
||||
|
||||
## 3. Architectural Invariants
|
||||
|
||||
1. **One execution gate**: production MCP calls must pass through MCP Gateway. Direct provider calls are compatibility debt and must become `forbid-new`.
|
||||
2. **One approval state machine**: AwoooP approvals, AWOOOI approval records, TrustEngine, and Telegram approval buttons must converge to one signed, auditable flow.
|
||||
3. **One audit stream**: Channel events, runtime state, MCP audit, model routing, and approval decisions must be joinable by `project_id`, `run_id`, and `trace_id`.
|
||||
4. **One knowledge substrate**: KM, RAG, and PlayBook embeddings must use consistent dimensions and model naming. Test fixtures must match production schema.
|
||||
5. **One routing control plane**: provider/model fallback must resolve through EffectivePolicy or a wrapped legacy equivalent during strangler migration.
|
||||
6. **No fail-open tenant boundary**: missing `project_id` must never broaden data visibility. Bootstrap exceptions must be explicit and audited.
|
||||
7. **No client-only identity**: any operator identity supplied by frontend body is display metadata only; authorization identity must come from the authenticated principal.
|
||||
8. **No channel policy decisions**: Telegram, LINE, Slack, and email adapters deliver and track messages, but do not decide model, tool, approval, or incident policy.
|
||||
|
||||
## 4. Integration Layers
|
||||
|
||||
### 4.1 AI Flywheel Core
|
||||
|
||||
Owns monitoring, alert ingestion, event classification, rule matching, rule generation, PlayBook, KM/RAG, AI decisions, safe repair, verification, and learning feedback.
|
||||
|
||||
Key existing surfaces:
|
||||
|
||||
- `apps/api/src/api/v1/webhooks.py`
|
||||
- `apps/api/src/services/decision_manager.py`
|
||||
- `apps/api/src/services/alert_rule_engine.py`
|
||||
- `apps/api/src/services/learning_service.py`
|
||||
- `apps/api/src/services/auto_repair_service.py`
|
||||
- `apps/api/src/services/knowledge_rag_service.py`
|
||||
- `apps/api/src/services/playbook_rag.py`
|
||||
|
||||
### 4.2 AwoooP Governance / Operator Layer
|
||||
|
||||
Owns operator runs, approvals, audit, channel events, MCP Gateway visibility, frontend control surface, i18n, and authenticated operator identity.
|
||||
|
||||
Key existing surfaces:
|
||||
|
||||
- `apps/api/src/api/v1/platform/*`
|
||||
- `apps/api/src/services/platform_*`
|
||||
- `apps/api/src/plugins/mcp/gateway.py`
|
||||
- `apps/api/src/plugins/mcp/registry.py`
|
||||
- `apps/web/src/app/[locale]/awooop/*`
|
||||
- `docs/awooop/*`
|
||||
|
||||
### 4.3 Infrastructure / Reliability Layer
|
||||
|
||||
Owns DB schema and RLS, Redis namespace migration, K8s manifests, NetworkPolicy, Ollama failover, Gitea CD, migration discipline, and release gates.
|
||||
|
||||
Key existing surfaces:
|
||||
|
||||
- `apps/api/migrations/*`
|
||||
- `apps/api/src/db/*`
|
||||
- `k8s/awoooi-prod/*`
|
||||
- `.gitea/workflows/*`
|
||||
- `docs/runbooks/*`
|
||||
- `docs/awooop/inventory/*`
|
||||
|
||||
## 5. Consolidated Risk Register
|
||||
|
||||
### 5.1 P0: Production-broken or Security-critical
|
||||
|
||||
| ID | Risk | Impact | Required direction |
|
||||
| --- | --- | --- | --- |
|
||||
| P0-A | MCP Gateway is not yet a production choke point; callers can still reach `provider.execute()` directly. | Gateway gates, redaction, and audit can be bypassed. | Wrap all production MCP call sites; then mark direct provider access `forbid-new`. |
|
||||
| P0-B | MCP blocked-call audit has gaps when `tool_row` is missing or Gate 1/2 rejects early. | Denied or suspicious calls can disappear from audit. | Audit attempt before and after gate evaluation with safe redaction. |
|
||||
| P0-C | Legacy K8s tool execution still has command/shell injection risk. | Destructive command path can be polluted by LLM or user input. | Parse command structure, avoid shell, enforce operation schema and allowlists. |
|
||||
| P0-D | `MCPToolResult(data=...)` mismatches dataclass fields in some success paths. | Sentry / ArgoCD or similar tools can crash on valid results. | Normalize result schema and regression-test all MCP providers. |
|
||||
| P0-E | RAG/KM/PlayBook embedding dimensions remain split between 768 and 1024. | Search or backfill can silently fail or hide production drift. | Standardize on `bge-m3` 1024 dimensions and remove stale fixtures. |
|
||||
| P0-F | KM backfill reconciler is missing required async imports or runtime dependencies. | Repair path that should recover embeddings may crash. | Compile and integration-test the reconciler path. |
|
||||
| P0-G | Ollama routing still has direct `OLLAMA_URL` consumers. | GCP-A/GCP-B/111 ordering and Gemini fallback policy can be bypassed. | Inventory, wrap, and migrate call sites to resolver / EffectivePolicy. |
|
||||
| P0-H | RLS is fail-open when `project_id` is null. | Cross-tenant data exposure risk. | Force fail-closed RLS and explicit platform bootstrap identities. |
|
||||
| P0-I | Approval APIs trust frontend/body identity such as `approver_id: "operator"`. | Audit identity is not legally or operationally usable. | Decide endpoint must derive principal from auth/session/token, not body. |
|
||||
| P0-J | API control plane lacks real authorization in some paths; CSRF is not authorization. | Operator APIs can be invoked without a real role boundary. | Add authenticated principal and role checks to AwoooP APIs. |
|
||||
| P0-K | Alertmanager internal bypass plus `X-Forwarded-For` and broad trusted hosts may allow spoofed source identity. | Forged alert ingress can create incidents or approvals. | Require signed webhook or strict trusted proxy chain. |
|
||||
| P0-L | AwoooP approval token service also depends on `aioredis`. | Python/runtime compatibility issue can break approvals. | Migrate Redis client path or wrap with a maintained async Redis interface. |
|
||||
|
||||
### 5.2 P1: Governance consistency and reliability
|
||||
|
||||
| ID | Risk | Impact | Required direction |
|
||||
| --- | --- | --- | --- |
|
||||
| P1-A | Approval repository / TrustEngine race conditions. | Double decision, stale decision, or missed approval. | Add row locks, idempotency keys, and single source of truth. |
|
||||
| P1-B | TrustEngine uses process memory dict. | Pending approvals vanish or split across pods/restarts. | Move authoritative state to PostgreSQL; Redis only for cache/notification. |
|
||||
| P1-C | NetworkPolicy contains rules whose names imply deny but semantics allow broad traffic. | False sense of isolation. | Reconcile names, policies, and live cluster behavior. |
|
||||
| P1-D | Kustomize may omit service registry, HPA, VPA, backup cron, or related resources. | GitOps drift and partial deploys. | Inventory generated/live manifests and add missing resources. |
|
||||
| P1-E | Hardcoded restart/SSH actions remain in alert rules. | Conflicts with AI automation direction and rule/PlayBook evidence loop. | Convert hardcoded actions into PlayBook proposals with trust gates. |
|
||||
| P1-F | Startup schema mutation and scattered SQL migrations remain. | Runtime ALTER can diverge dev/prod schema. | Move to disciplined migration path and remove runtime mutation. |
|
||||
| P1-G | Tests still use stale 768-dimension vectors. | Dev tests can pass while production RAG fails. | Test fixtures must mirror production vector dimensions. |
|
||||
| P1-H | GCP-B can remain passive fallback only. | Active-active design is not realized. | Use GCP-B for batch, embedding, shadow, and canary lanes via policy. |
|
||||
|
||||
### 5.3 P2: Productization and long-term maintainability
|
||||
|
||||
| ID | Risk | Impact | Required direction |
|
||||
| --- | --- | --- | --- |
|
||||
| P2-A | AwoooP frontend has incomplete i18n and inconsistent IA. | Operator experience is unprofessional and brittle. | Align UI text, sidebar, routes, and workflow labels. |
|
||||
| P2-B | Operator Console lacks deep trace/journal views. | Operators cannot explain why AI made a decision. | Add run step journal, trace spans, model/tool/cost panes. |
|
||||
| P2-C | Error code taxonomy is incomplete. | Frontend and channels cannot render actionable states. | Define platform error codes for auth, schema, budget, MCP, routing, approval, and RAG. |
|
||||
| P2-D | Docs and runtime can drift. | Future sessions repeat the same analysis. | Keep ADR, LOGBOOK, inventory, and release checklists linked to commits. |
|
||||
|
||||
## 6. Corrections to Prior Inventories
|
||||
|
||||
The following corrections must be applied when consuming the earlier 12-agent inventories:
|
||||
|
||||
1. C21/C22 CronJob service account and image issues appear to have been partially fixed in the worktree, but must be verified against live cluster state before closure.
|
||||
2. C1 `aioredis` is not necessarily a top-level import in all affected paths; at least one Gate 5 runtime import still makes the risk valid.
|
||||
3. C2 is more precise as: blocked calls can skip audit when `tool_row is None`, not only due to a database `NOT NULL` failure.
|
||||
4. C12 is partially stale for the main embedding service, but `knowledge_rag_service.py` and `playbook_rag.py` still need verification for 1024-dimensional consistency.
|
||||
|
||||
## 7. Twelve-agent Ownership Matrix
|
||||
|
||||
| Owner | Scope | First verification |
|
||||
| --- | --- | --- |
|
||||
| Chief Architect | Total blueprint, dependency order, red-zone governance | This integration plan is linked from AwoooP master workplan and LOGBOOK. |
|
||||
| Security | Auth, webhook spoofing, RLS, MCP security, token signing | Authenticated approval decision cannot be forged by body-only identity. |
|
||||
| MCP Gateway | Gateway chokepoint, audit, provider bypass, result schema | Direct provider call inventory reaches zero `forbid-new` exceptions. |
|
||||
| K8s / SRE | CronJob, Kustomize, NetworkPolicy, RBAC, deployment verification | `kubectl diff` and live resource inventory match expected GitOps set. |
|
||||
| DB / Migration | Alembic, RLS, JSONB, FK, approval race | RLS fail-closed tests and migration up/down pass. |
|
||||
| RAG / KM | `bge-m3` 1024 dimensions, backfill, PlayBook embedding | Production vector dimension and test fixtures match. |
|
||||
| AI Router / Ollama | Resolver, failover, direct URL cleanup, GCP-A/B/111 lanes | New alert route logs prove `GCP-A -> GCP-B -> 111 -> Gemini` ordering. |
|
||||
| Core / Trust | TrustEngine, RedisLock, UnitOfWork, feature flags | Pending approval survives pod restart and HPA multi-pod execution. |
|
||||
| API Layer | Platform endpoints auth, CSRF, tenant boundaries | Platform APIs reject unauthenticated and cross-project requests. |
|
||||
| Frontend / AwoooP | i18n, sidebar, approver identity, internal IP ban | `/zh-TW/awooop` renders without redirect error and no private `NEXT_PUBLIC_*` bundle leak. |
|
||||
| QA / Test | Integration tests, no-mock scanning, regression matrix | High-risk paths have production-like integration tests, not only mocks. |
|
||||
| Docs / Release | ADR, LOGBOOK, Memory handoff, runbooks, release checklist | Each wave has a release note, rollback path, and live verification evidence. |
|
||||
|
||||
## 8. Execution Waves
|
||||
|
||||
### Wave 0: Formal convergence baseline
|
||||
|
||||
Goal: convert the cross-session conclusion into a single source of operational truth.
|
||||
|
||||
Work:
|
||||
|
||||
- Add this integration plan.
|
||||
- Link it from `MASTER-WORKPLAN.md` and `AWOOOP-MONITORING-ALERTING-CONVERGENCE.md`.
|
||||
- Keep runtime untouched except already-approved urgent repairs.
|
||||
- Define owner and wave for each high-risk item.
|
||||
|
||||
Exit criteria:
|
||||
|
||||
- Plan is committed.
|
||||
- LOGBOOK records the integration baseline.
|
||||
- Next implementation session can start from this document without re-deriving the architecture.
|
||||
|
||||
### Wave 1: P0 production-broken / security-critical
|
||||
|
||||
Goal: prevent the automation flywheel from bypassing governance.
|
||||
|
||||
Work:
|
||||
|
||||
- MCP Gateway bypass and audit gaps.
|
||||
- `aioredis` runtime compatibility in approval/MCP paths.
|
||||
- `MCPToolResult` schema normalization.
|
||||
- K8s command injection hardening.
|
||||
- Webhook spoofing and platform API approval auth.
|
||||
- RAG/KM/PlayBook 1024-dimensional consistency.
|
||||
- Ollama direct `OLLAMA_URL` cleanup for production alert paths.
|
||||
|
||||
Exit criteria:
|
||||
|
||||
- No production MCP caller can bypass Gateway without a tracked `forbid-new` exception.
|
||||
- Approval decide endpoint uses authenticated principal.
|
||||
- Embedding dimensions are production-consistent.
|
||||
- Alert route logs prove Gemini is fallback only after GCP-A, GCP-B, and 111.
|
||||
|
||||
### Wave 2: P1 governance consistency
|
||||
|
||||
Goal: make platform state durable and tenant-safe.
|
||||
|
||||
Work:
|
||||
|
||||
- RLS fail-closed and project context injection.
|
||||
- Approval / Trust race fixes.
|
||||
- TrustEngine state moves from process memory to PostgreSQL.
|
||||
- NetworkPolicy semantic cleanup.
|
||||
- Kustomize resource coverage.
|
||||
- Ollama failover centralization and GCP-B active-active usage.
|
||||
|
||||
Exit criteria:
|
||||
|
||||
- Cross-project read tests fail closed.
|
||||
- Pending approval survives pod restart.
|
||||
- Live cluster resources match Git-managed resources.
|
||||
|
||||
### Wave 3: P2 productization
|
||||
|
||||
Goal: make AwoooP a usable professional operator console, not a side dashboard.
|
||||
|
||||
Work:
|
||||
|
||||
- AwoooP i18n and sidebar IA.
|
||||
- Run step journal and trace view.
|
||||
- Platform error code taxonomy.
|
||||
- Alembic and migration discipline.
|
||||
- Internal IP bundle checks.
|
||||
- Regression matrix and release checklists.
|
||||
|
||||
Exit criteria:
|
||||
|
||||
- Operator can explain a run from alert to MCP execution to KM feedback.
|
||||
- Frontend has no hardcoded user-facing strings outside i18n.
|
||||
- Release checklist ties commit, image, route health, and live URL verification.
|
||||
|
||||
## 9. Wave Dependencies
|
||||
|
||||
```text
|
||||
Wave 0
|
||||
-> Wave 1 MCP/Auth/RAG/Ollama safety
|
||||
-> Wave 2 RLS/Trust/K8s state convergence
|
||||
-> Wave 3 Operator Console productization
|
||||
```
|
||||
|
||||
Do not move a destructive or auto-remediation path to AwoooP enforcement until Wave 1 is green and Wave 2 tenant state is fail-closed.
|
||||
|
||||
## 10. Release and Verification Discipline
|
||||
|
||||
Each wave must publish:
|
||||
|
||||
- changed surface
|
||||
- runtime behavior changed: yes/no
|
||||
- contract family
|
||||
- migration posture: keep, mirror, wrap, replace, forbid-new
|
||||
- rollback or disable flag
|
||||
- affected event types
|
||||
- risk rating
|
||||
- live verification command or URL
|
||||
- LOGBOOK entry
|
||||
|
||||
For frontend releases, also verify:
|
||||
|
||||
- `pnpm --filter @awoooi/web typecheck`
|
||||
- `NEXT_PUBLIC_API_URL=https://awoooi.wooo.work pnpm --filter @awoooi/web build`
|
||||
- no private `NEXT_PUBLIC_*` internal IPs in generated browser bundle
|
||||
- live route status and content after CD rollout
|
||||
|
||||
For AI routing releases, also verify:
|
||||
|
||||
- active pod environment
|
||||
- provider route logs
|
||||
- cost-bearing provider fallback order
|
||||
- `Gemini` appears only after all configured Ollama lanes fail or policy explicitly permits cloud fallback
|
||||
|
||||
## 11. Immediate Next Items
|
||||
|
||||
1. Finish live deployment verification for `682c0b99` AwoooP route fix.
|
||||
2. Create Wave 1 implementation checklist from this document.
|
||||
3. Start with MCP Gateway bypass and platform approval authentication, because both directly affect safety and auditability.
|
||||
4. Keep GCP-A/GCP-B/111 Ollama routing verification in every alert-path release until EffectivePolicy becomes authoritative.
|
||||
|
||||
290
docs/awooop/AWOOOP-MONITORING-ALERTING-CONVERGENCE.md
Normal file
290
docs/awooop/AWOOOP-MONITORING-ALERTING-CONVERGENCE.md
Normal file
@@ -0,0 +1,290 @@
|
||||
# AwoooP x Monitoring / Alerting Convergence Map
|
||||
|
||||
**Date**: 2026-05-05
|
||||
**Status**: Coordination map, no runtime change
|
||||
**Scope**: AWOOOI monitoring, alerting, incident, AI provider routing, MCP, Telegram, governance loops, and AwoooP contract convergence
|
||||
**Superseded by integration baseline for execution waves**: `docs/awooop/AWOOOI-AWOOOP-AI-AUTONOMOUS-FLYWHEEL-INTEGRATION-PLAN.md`
|
||||
|
||||
## 0. Purpose
|
||||
|
||||
AwoooP is being implemented in another active session. This document exists to keep the sessions aligned without competing for the same runtime files.
|
||||
|
||||
This session owns the full-flow map of the existing AWOOOI monitoring and alerting flywheel. The AwoooP implementation session owns platform schema, contracts, migrations, and runtime/control-plane implementation.
|
||||
|
||||
2026-05-06 update: this file remains the monitoring/alerting convergence handoff. Execution planning, 12-agent ownership, P0/P1/P2 risk register, and wave sequencing now live in the integration baseline document linked above.
|
||||
|
||||
The integration rule is simple:
|
||||
|
||||
- AwoooP should not become a parallel platform beside the old alerting flywheel.
|
||||
- Existing alerting and repair paths should be progressively mirrored, wrapped, and then governed by AwoooP contracts.
|
||||
- Provider routing, MCP access, channel delivery, incident state, and governance feedback must converge through contracts instead of staying as scattered direct calls.
|
||||
|
||||
## 1. Session Boundary
|
||||
|
||||
### This session may do
|
||||
|
||||
- Map all monitoring, alerting, incident, governance, provider, MCP, and channel nodes.
|
||||
- Identify overlap with AwoooP contracts.
|
||||
- Mark migration posture for each node: `keep`, `mirror`, `wrap`, `replace`, or `forbid-new`.
|
||||
- Produce conflict warnings for the AwoooP implementation session.
|
||||
- Update docs and LOGBOOK with coordination state.
|
||||
|
||||
### This session must not do yet
|
||||
|
||||
- Change `decision_manager.py` runtime behavior.
|
||||
- Change `governance_dispatcher.py` runtime behavior.
|
||||
- Change `ai_router.py` / `ollama_failover_manager.py` routing behavior.
|
||||
- Change production K8s provider configuration.
|
||||
- Change Telegram delivery behavior.
|
||||
- Introduce paid provider behavior or raise provider budgets.
|
||||
|
||||
### AwoooP implementation session likely owns
|
||||
|
||||
- Contract schemas and validators.
|
||||
- AwoooP DB models and migrations.
|
||||
- Runtime/control-plane APIs.
|
||||
- EffectivePolicy resolution.
|
||||
- Contract revision source of truth.
|
||||
- MCP Gateway implementation slices.
|
||||
- Channel Hub / Channel Event runtime implementation.
|
||||
|
||||
## 2. Current Full-Flow Map
|
||||
|
||||
```text
|
||||
Monitoring / Event Sources
|
||||
Prometheus / Alertmanager / SignOz / Sentry / watchdog / governance loops / background jobs
|
||||
↓
|
||||
Ingress and Normalization
|
||||
webhooks.py / AlertAnalyzer / fingerprint / in-flight lock / classify_alert_early
|
||||
↓
|
||||
Incident and Audit State
|
||||
incidents / approval_records / alert_operation_log / timeline_events / governance_remediation_dispatch
|
||||
↓
|
||||
Evidence and Sensing
|
||||
pre_decision_investigator / MCP ProviderRegistry / k8s / prom / signoz / ssh / db / rag
|
||||
↓
|
||||
Rules and Cognition
|
||||
alert_rule_engine / OpenClaw / AIRouter / decision_fusion_adapter / agents
|
||||
↓
|
||||
Provider Routing
|
||||
OLLAMA_URL configmap / ollama_failover_manager / AIRouter / AIProviderRegistry / NVIDIA / Gemini / Claude
|
||||
↓
|
||||
Decision and Approval
|
||||
decision_manager / proposal service / approval service / governance_dispatcher
|
||||
↓
|
||||
Execution
|
||||
approval_execution / auto_repair_service / MCP gateway / SSH provider / K8s provider / emergency escalation
|
||||
↓
|
||||
Channel Delivery
|
||||
telegram_gateway / failover_alerter / workflow notifications / Channel Hub
|
||||
↓
|
||||
Verification and Learning
|
||||
post-execution verification / learning_service / playbook generation / ai_slo_watchdog / governance loop
|
||||
```
|
||||
|
||||
## 3. AwoooP Contract Overlay
|
||||
|
||||
| Existing layer | AwoooP contract | Convergence need |
|
||||
| --- | --- | --- |
|
||||
| Alertmanager, SignOz, watchdog, governance loops | Communication / Channel Event | Normalize all inbound events into one event envelope before business decisions. |
|
||||
| Incident, approval, AOL, timeline, governance dispatch | Runtime State | Add project/run/trace identity and make event state joinable. |
|
||||
| OpenClaw, AIRouter, ollama failover, cloud fallback | Policy / Routing | Resolve provider/model/fallback through EffectivePolicy, not scattered env/direct calls. |
|
||||
| Agents, prompts, schemas, role-specific tools | Agent Contract | Agents become versioned capability modules with explicit I/O and safety ceilings. |
|
||||
| K8s, SSH, SignOz, Prometheus, DB, RAG tools | MCP Gateway | Tool access must satisfy Project AND Agent AND Tool AND Environment AND Approval. |
|
||||
| Telegram, workflow notification, failover messages | Communication / Channel Event | Channel adapters stay thin and must not decide LLM, MCP, approval, or business policy. |
|
||||
| ai_slo_watchdog, governance loop, remediation dispatch | Governance / Policy | Meta-system events must be deduped, scoped, and prevented from self-amplifying. |
|
||||
|
||||
## 4. Known Overlap and Conflict Zones
|
||||
|
||||
### 4.1 Provider routing
|
||||
|
||||
Current routing is not yet one-control-plane:
|
||||
|
||||
- `OLLAMA_URL`, `OLLAMA_SECONDARY_URL`, and `OLLAMA_FALLBACK_URL` are centralized in K8s config for Ollama topology.
|
||||
- Production runtime routes Ollama through the 110 proxy pool: `110:11435` → GCP-A, `110:11436` → GCP-B, `110:11437` → Local `.111`.
|
||||
- `ollama_failover_manager.py` still owns important runtime failover logic.
|
||||
- `AIRouter` and `AIProviderRegistry` exist, but not every LLM use path is guaranteed to enter through them.
|
||||
- Some knowledge-side services may still call provider-specific clients directly.
|
||||
- `INV-10` now tracks direct `OLLAMA_URL` call sites so GCP-B can become an active batch/shadow lane instead of only a passive fallback.
|
||||
|
||||
AwoooP action:
|
||||
|
||||
- Treat provider catalog as platform-level resource.
|
||||
- Treat per-project route selection as EffectivePolicy.
|
||||
- Do not move provider behavior until the old call sites are mapped and wrapped.
|
||||
- Preserve GCP-A for interactive low-latency work, use GCP-B for batch/RAG/embedding/shadow/canary, and keep Local `.111` for `local_required` / DR.
|
||||
|
||||
Migration posture: `wrap` first, `replace` later.
|
||||
|
||||
### 4.2 Monitoring and alerting ingress
|
||||
|
||||
Current ingress is multi-path:
|
||||
|
||||
- Alertmanager webhook paths.
|
||||
- SignOz webhook path.
|
||||
- Sentry or app error paths.
|
||||
- AI SLO watchdog.
|
||||
- Governance events and remediation dispatch.
|
||||
- Background loops that synthesize incidents or meta-alerts.
|
||||
|
||||
AwoooP action:
|
||||
|
||||
- Define a Channel Event envelope that can represent all of these without losing raw source payload.
|
||||
- Mirror events first; do not force all production ingress through AwoooP until parity and dedupe are proven.
|
||||
|
||||
Migration posture: `mirror` first, `wrap` later.
|
||||
|
||||
### 4.3 Rule-first versus LLM-first
|
||||
|
||||
Historical failures show that host/resource alerts can be polluted by LLM suggestions if rule-first gates are bypassed.
|
||||
|
||||
AwoooP action:
|
||||
|
||||
- Preserve rule-first safety gates during migration.
|
||||
- EffectivePolicy should be able to declare `llm_allowed=false` or `rule_authoritative=true` for specific categories.
|
||||
- Channel Event and Policy/Routing contracts must not weaken existing host-vs-k8s domain guards.
|
||||
|
||||
Migration posture: `keep` safety gates, then `wrap` with policy metadata.
|
||||
|
||||
### 4.4 MCP and direct tools
|
||||
|
||||
Current MCP capability exists, but provider registries and tools are not yet entirely tenant/project scoped.
|
||||
|
||||
AwoooP action:
|
||||
|
||||
- Keep platform-level provider registry for tool implementations.
|
||||
- Move grants, credential refs, environment boundaries, and project permission decisions into MCP Gateway policy.
|
||||
- Ensure direct tool calls become `forbid-new` once gateway coverage exists.
|
||||
|
||||
Migration posture: `wrap` existing tools, then `forbid-new` direct access.
|
||||
|
||||
### 4.5 Telegram and Channel Hub
|
||||
|
||||
Telegram currently carries incident cards, meta alerts, failover alerts, approval messages, workflow notifications, and status updates.
|
||||
|
||||
AwoooP action:
|
||||
|
||||
- Channel adapters must remain delivery-only.
|
||||
- They may escape, retry, edit, and track provider message IDs.
|
||||
- They must not decide provider route, approval policy, incident category, MCP access, or business logic.
|
||||
|
||||
Migration posture: `mirror` outbound delivery state, then `wrap` through Channel Event.
|
||||
|
||||
### 4.6 Meta-system governance loops
|
||||
|
||||
Recent production verification showed the desired health signal: multiple stale-event cycles processed without `meta_alert` / `META_SYSTEM` amplification.
|
||||
|
||||
AwoooP action:
|
||||
|
||||
- Model governance events as platform-internal runtime events.
|
||||
- Require dedupe/cooldown/active-state checks to be part of the governance policy contract.
|
||||
- Keep `governance_dispatch_path_skip`-style safeguards visible in audit state.
|
||||
|
||||
Migration posture: `keep` current guard, `mirror` event state, then `wrap` into governance policy.
|
||||
|
||||
## 5. Node Inventory for Coordination
|
||||
|
||||
| Node | Current role | AwoooP target | Posture | Notes |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| `api/v1/webhooks.py` | Alertmanager/SignOz ingress, incident creation, LLM fallback paths | Channel Event ingress + Runtime State | `mirror` | Do not rewrite first; mirror selected events. |
|
||||
| `incident_service.py` | Incident lifecycle and early classification | Runtime State + L3 cognition input | `wrap` | Preserve existing classification behavior until EvidenceSnapshot parity. |
|
||||
| `alert_rule_engine.py` | YAML rules and safety gates | Policy/Routing safety constraints + Playbook policy | `keep` | Rule-first gates must survive migration. |
|
||||
| `pre_decision_investigator.py` | Evidence gathering through MCP tools | MCP Gateway + Runtime evidence snapshot | `wrap` | Good first bridge to gateway-controlled sensing. |
|
||||
| `plugins/mcp/registry.py` | MCP provider registry | Platform provider registry with per-project grants in DB | `wrap` | Registry can stay platform-level; grants move out. |
|
||||
| `plugins/mcp/gateway.py` | MCP gate implementation slice | AwoooP MCP Gateway | `replace` later | Candidate canonical gateway after contract alignment. |
|
||||
| `ai_router.py` | Provider selection and execution | Policy/Routing EffectivePolicy executor | `wrap` | Do not bypass; gradually make EffectivePolicy an input. |
|
||||
| `ollama_failover_manager.py` | Ollama topology/failover | Platform provider health + topology resource | `wrap` | Keep runtime behavior while policy contract learns from it; see `INV-10` for GCP-A/B active-active usage. |
|
||||
| `decision_fusion_adapter.py` | Decision fusion / routing bridge | Agent + Policy/Routing integration | `wrap` | High-risk bridge, no direct rewrite without plan. |
|
||||
| `decision_manager.py` | Core decision and execution orchestration | Agent Runtime + Runtime State | `keep` | Tier-sensitive; last-mile migration only. |
|
||||
| `approval_execution.py` | Approval action execution | Runtime execution + MCP Gateway | `wrap` | Preserve host-vs-k8s guards. |
|
||||
| `auto_repair_service.py` | Repair execution and verification | Runtime execution + verification contract | `wrap` | Must not auto-resolve without verified success. |
|
||||
| `telegram_gateway.py` | Telegram send/edit/delete and cards | Channel adapter | `wrap` | Delivery only, not policy. |
|
||||
| `failover_alerter.py` | Provider/governance alert rendering | Channel adapter + governance event view | `wrap` | Human-readable rendering remains useful. |
|
||||
| `ai_slo_watchdog_job.py` | Platform self-monitoring | Governance policy + platform internal events | `keep` | Current anti-amplification behavior is important. |
|
||||
| `governance_dispatcher.py` | Governance event dispatch | Governance runtime dispatch | `keep` | Very sensitive; do not alter while other session changes platform. |
|
||||
| `learning_service.py` | Learning, playbook generation side effects | Runtime learning + Agent trace | `wrap` | Attach evidence/run identity before provider changes. |
|
||||
| `runbook_generator.py` | Runbook/anti-pattern generation | Knowledge side effect under Policy/Routing | `wrap` | Direct provider usage should eventually route through EffectivePolicy. |
|
||||
|
||||
## 6. Recommended Shared Handoff Protocol
|
||||
|
||||
When either session changes a relevant area, update this document or LOGBOOK with:
|
||||
|
||||
- `changed_surface`: file/module/contract touched.
|
||||
- `runtime_behavior_changed`: yes/no.
|
||||
- `contract_family`: Project, Agent, MCP Gateway, Policy/Routing, Runtime State, Channel Event, Governance.
|
||||
- `migration_posture`: keep, mirror, wrap, replace, forbid-new.
|
||||
- `rollback_or_disable_flag`: exact flag if applicable.
|
||||
- `affected_event_types`: alertmanager, signoz, watchdog, governance, provider_failover, approval, execution, channel.
|
||||
- `risk`: low/medium/high.
|
||||
- `coordination_note`: what the other session must avoid or consume.
|
||||
|
||||
## 7. Safe Integration Order
|
||||
|
||||
### Phase A: Shared map and mirror-only state
|
||||
|
||||
- Keep old runtime behavior.
|
||||
- Mirror selected alert/governance/channel events into AwoooP runtime state.
|
||||
- Add no new provider behavior.
|
||||
- Add no new paid fallback.
|
||||
|
||||
### Phase B: Policy/Routing as read-only resolver
|
||||
|
||||
- Compute EffectivePolicy but do not enforce it on critical decisions yet.
|
||||
- Compare EffectivePolicy result against current AIRouter / failover result.
|
||||
- Store divergence for review.
|
||||
|
||||
### Phase C: MCP Gateway wrapper
|
||||
|
||||
- Wrap read-only tools first.
|
||||
- Enforce project/agent/tool/environment/approval only on low-risk read-only paths.
|
||||
- Keep direct tool access as compatibility fallback but mark `forbid-new`.
|
||||
|
||||
### Phase D: Channel Event wrapper
|
||||
|
||||
- Make Telegram and workflow sends emit Channel Event delivery state.
|
||||
- Keep actual Telegram adapter behavior stable.
|
||||
- Prevent channel layer from gaining policy decisions.
|
||||
|
||||
### Phase E: Critical path strangler
|
||||
|
||||
- Move one low-risk LLM call path to EffectivePolicy enforcement.
|
||||
- Avoid host-resource, auto-repair, and governance-dispatch first.
|
||||
- Expand only after no duplicate incidents, no meta-system amplification, and no provider-route divergence.
|
||||
|
||||
## 8. Immediate Recommendations for the AwoooP Session
|
||||
|
||||
- Do not start with provider switching as the first visible runtime change.
|
||||
- Start by making EffectivePolicy resolvable and auditable without enforcement.
|
||||
- Treat `ai_slo_watchdog_job.py` and `governance_dispatcher.py` as platform-internal event producers, not tenant incidents.
|
||||
- Keep existing `governance_dispatch_path_skip`-style anti-amplification safeguards outside any speculative refactor.
|
||||
- Do not weaken host/resource rule-first bypasses.
|
||||
- Keep provider catalog platform-scoped, but make route permission project-scoped.
|
||||
- Add `project_id`, `trace_id`, and `run_id` to mirrored events before moving execution paths.
|
||||
|
||||
## 9. Open Questions
|
||||
|
||||
- Which AwoooP session branch or commit currently owns contract schema changes?
|
||||
- Has EffectivePolicy already been materialized as code, or is it still only in ADR/docs?
|
||||
- Which event families are already mirrored into AwoooP tables?
|
||||
- Are Channel Hub delivery records authoritative yet, or still shadow/canary?
|
||||
- Which provider calls still bypass `AIRouterExecutor`?
|
||||
- Which MCP calls still bypass `plugins/mcp/gateway.py`?
|
||||
|
||||
## 10. Current Coordination Decision
|
||||
|
||||
This session will not modify runtime while AwoooP is being actively implemented elsewhere.
|
||||
|
||||
This document is the handoff artifact for aligning AWOOOI monitoring/alerting convergence with AwoooP platform work. Runtime changes should begin only after the implementation session confirms which contract layer is ready to receive mirrored events or policy-resolution inputs.
|
||||
|
||||
## 11. Implementation Session Consumption Log
|
||||
|
||||
### 2026-05-05 — AwoooP implementation session consumed this map
|
||||
|
||||
- `changed_surface`: `docs/awooop/MASTER-WORKPLAN.md`, `docs/adr/ADR-106-agent-platform-architecture.md`, MCP Gateway redaction path, Operator Console API response alignment.
|
||||
- `runtime_behavior_changed`: yes, but limited to MCP result redaction before Runtime/LLM context and legacy MCP audit string redaction.
|
||||
- `contract_family`: MCP Gateway, Runtime State, Channel Event, Governance.
|
||||
- `migration_posture`: `mirror` / `wrap`; no provider switching, no Telegram cutover, no governance runtime refactor.
|
||||
- `rollback_or_disable_flag`: revert MCP Gateway redaction patch if unexpected provider consumers require raw output; no production DB migration applied.
|
||||
- `affected_event_types`: MCP tool results, operator approvals/runs console views, future mirrored alert/governance/channel events.
|
||||
- `risk`: low to medium; security-positive redaction path, no paid-provider or destructive-tool enablement.
|
||||
- `coordination_note`: Future AwoooP work must use this map as the safety order: mirror first, read-only EffectivePolicy comparison second, read-only MCP Gateway wrapper third, Channel Event wrapper fourth, low-risk LLM path strangler last.
|
||||
@@ -4,6 +4,7 @@
|
||||
**日期**:2026-05-03
|
||||
**主要 ADR**:ADR-106(架構)、ADR-107(控制面儲存)
|
||||
**取代**:本檔取代 `IMPLEMENTATION-ROADMAP.md` 作為 AwoooP 主索引;舊 roadmap 仍保留為一階草稿,僅供歷史對照
|
||||
**整合基準**:`docs/awooop/AWOOOI-AWOOOP-AI-AUTONOMOUS-FLYWHEEL-INTEGRATION-PLAN.md`
|
||||
|
||||
---
|
||||
|
||||
@@ -11,6 +12,8 @@
|
||||
|
||||
12 位 Agent(critic / vuln-verifier / debugger / db-expert / planner / fullstack-engineer / refactor-specialist / migration-engineer / onboarder / tool-expert / web-researcher / frontend-designer)對舊版 Plan 1 與 ADR-106 做完獨立審查後,發現至少 12 個 P0 問題;後續再補了 12 個會在實作後咬人的設計缺口。
|
||||
|
||||
2026-05-06 補充:AwoooP 不再被視為獨立產品線,而是 AI 自動化飛輪的人機協作控制台、治理層、稽核層與操作層。完整 owner、wave、risk register 與驗收方式以整合基準文件為準。
|
||||
|
||||
結論:**直接進 Phase 1 SQL migration 會立刻爆。** 必須先補足 5 份 ADR、4 份 Inventory,把 Strangler Fig 的「資料載體、雙寫遷移、邊界硬攔截、可重放、可審計」全部寫死,再下 code。
|
||||
|
||||
---
|
||||
|
||||
Reference in New Issue
Block a user