# AwoooP x Monitoring / Alerting Convergence Map **Date**: 2026-05-05 **Status**: Coordination map, no runtime change **Scope**: AWOOOI monitoring, alerting, incident, AI provider routing, MCP, Telegram, governance loops, and AwoooP contract convergence **Superseded by integration baseline for execution waves**: `docs/awooop/AWOOOI-AWOOOP-AI-AUTONOMOUS-FLYWHEEL-INTEGRATION-PLAN.md` ## 0. Purpose AwoooP is being implemented in another active session. This document exists to keep the sessions aligned without competing for the same runtime files. This session owns the full-flow map of the existing AWOOOI monitoring and alerting flywheel. The AwoooP implementation session owns platform schema, contracts, migrations, and runtime/control-plane implementation. 2026-05-06 update: this file remains the monitoring/alerting convergence handoff. Execution planning, 12-agent ownership, P0/P1/P2 risk register, and wave sequencing now live in the integration baseline document linked above. The integration rule is simple: - AwoooP should not become a parallel platform beside the old alerting flywheel. - Existing alerting and repair paths should be progressively mirrored, wrapped, and then governed by AwoooP contracts. - Provider routing, MCP access, channel delivery, incident state, and governance feedback must converge through contracts instead of staying as scattered direct calls. ## 1. Session Boundary ### This session may do - Map all monitoring, alerting, incident, governance, provider, MCP, and channel nodes. - Identify overlap with AwoooP contracts. - Mark migration posture for each node: `keep`, `mirror`, `wrap`, `replace`, or `forbid-new`. - Produce conflict warnings for the AwoooP implementation session. - Update docs and LOGBOOK with coordination state. ### This session must not do yet - Change `decision_manager.py` runtime behavior. - Change `governance_dispatcher.py` runtime behavior. - Change `ai_router.py` / `ollama_failover_manager.py` routing behavior. - Change production K8s provider configuration. - Change Telegram delivery behavior. - Introduce paid provider behavior or raise provider budgets. ### AwoooP implementation session likely owns - Contract schemas and validators. - AwoooP DB models and migrations. - Runtime/control-plane APIs. - EffectivePolicy resolution. - Contract revision source of truth. - MCP Gateway implementation slices. - Channel Hub / Channel Event runtime implementation. ## 2. Current Full-Flow Map ```text Monitoring / Event Sources Prometheus / Alertmanager / SignOz / Sentry / watchdog / governance loops / background jobs ↓ Ingress and Normalization webhooks.py / AlertAnalyzer / fingerprint / in-flight lock / classify_alert_early ↓ Incident and Audit State incidents / approval_records / alert_operation_log / timeline_events / governance_remediation_dispatch ↓ Evidence and Sensing pre_decision_investigator / MCP ProviderRegistry / k8s / prom / signoz / ssh / db / rag ↓ Rules and Cognition alert_rule_engine / OpenClaw / AIRouter / decision_fusion_adapter / agents ↓ Provider Routing OLLAMA_URL configmap / ollama_failover_manager / AIRouter / AIProviderRegistry / NVIDIA / Gemini / Claude ↓ Decision and Approval decision_manager / proposal service / approval service / governance_dispatcher ↓ Execution approval_execution / auto_repair_service / MCP gateway / SSH provider / K8s provider / emergency escalation ↓ Channel Delivery telegram_gateway / failover_alerter / workflow notifications / Channel Hub ↓ Verification and Learning post-execution verification / learning_service / playbook generation / ai_slo_watchdog / governance loop ``` ## 3. AwoooP Contract Overlay | Existing layer | AwoooP contract | Convergence need | | --- | --- | --- | | Alertmanager, SignOz, watchdog, governance loops | Communication / Channel Event | Normalize all inbound events into one event envelope before business decisions. | | Incident, approval, AOL, timeline, governance dispatch | Runtime State | Add project/run/trace identity and make event state joinable. | | OpenClaw, AIRouter, ollama failover, cloud fallback | Policy / Routing | Resolve provider/model/fallback through EffectivePolicy, not scattered env/direct calls. | | Agents, prompts, schemas, role-specific tools | Agent Contract | Agents become versioned capability modules with explicit I/O and safety ceilings. | | K8s, SSH, SignOz, Prometheus, DB, RAG tools | MCP Gateway | Tool access must satisfy Project AND Agent AND Tool AND Environment AND Approval. | | Telegram, workflow notification, failover messages | Communication / Channel Event | Channel adapters stay thin and must not decide LLM, MCP, approval, or business policy. | | ai_slo_watchdog, governance loop, remediation dispatch | Governance / Policy | Meta-system events must be deduped, scoped, and prevented from self-amplifying. | ## 4. Known Overlap and Conflict Zones ### 4.1 Provider routing Current routing is not yet one-control-plane: - `OLLAMA_URL`, `OLLAMA_SECONDARY_URL`, and `OLLAMA_FALLBACK_URL` are centralized in K8s config for Ollama topology. - Production runtime routes Ollama through the 110 proxy pool: `110:11435` → GCP-A, `110:11436` → GCP-B, `110:11437` → Local `.111`. - `ollama_failover_manager.py` still owns important runtime failover logic. - `AIRouter` and `AIProviderRegistry` exist, but not every LLM use path is guaranteed to enter through them. - Some knowledge-side services may still call provider-specific clients directly. - `INV-10` now tracks direct `OLLAMA_URL` call sites so GCP-B can become an active batch/shadow lane instead of only a passive fallback. AwoooP action: - Treat provider catalog as platform-level resource. - Treat per-project route selection as EffectivePolicy. - Do not move provider behavior until the old call sites are mapped and wrapped. - Preserve GCP-A for interactive low-latency work, use GCP-B for batch/RAG/embedding/shadow/canary, and keep Local `.111` for `local_required` / DR. Migration posture: `wrap` first, `replace` later. ### 4.2 Monitoring and alerting ingress Current ingress is multi-path: - Alertmanager webhook paths. - SignOz webhook path. - Sentry or app error paths. - AI SLO watchdog. - Governance events and remediation dispatch. - Background loops that synthesize incidents or meta-alerts. AwoooP action: - Define a Channel Event envelope that can represent all of these without losing raw source payload. - Mirror events first; do not force all production ingress through AwoooP until parity and dedupe are proven. Migration posture: `mirror` first, `wrap` later. ### 4.3 Rule-first versus LLM-first Historical failures show that host/resource alerts can be polluted by LLM suggestions if rule-first gates are bypassed. AwoooP action: - Preserve rule-first safety gates during migration. - EffectivePolicy should be able to declare `llm_allowed=false` or `rule_authoritative=true` for specific categories. - Channel Event and Policy/Routing contracts must not weaken existing host-vs-k8s domain guards. Migration posture: `keep` safety gates, then `wrap` with policy metadata. ### 4.4 MCP and direct tools Current MCP capability exists, but provider registries and tools are not yet entirely tenant/project scoped. AwoooP action: - Keep platform-level provider registry for tool implementations. - Move grants, credential refs, environment boundaries, and project permission decisions into MCP Gateway policy. - Ensure direct tool calls become `forbid-new` once gateway coverage exists. Migration posture: `wrap` existing tools, then `forbid-new` direct access. ### 4.5 Telegram and Channel Hub Telegram currently carries incident cards, meta alerts, failover alerts, approval messages, workflow notifications, and status updates. AwoooP action: - Channel adapters must remain delivery-only. - They may escape, retry, edit, and track provider message IDs. - They must not decide provider route, approval policy, incident category, MCP access, or business logic. Migration posture: `mirror` outbound delivery state, then `wrap` through Channel Event. ### 4.6 Meta-system governance loops Recent production verification showed the desired health signal: multiple stale-event cycles processed without `meta_alert` / `META_SYSTEM` amplification. AwoooP action: - Model governance events as platform-internal runtime events. - Require dedupe/cooldown/active-state checks to be part of the governance policy contract. - Keep `governance_dispatch_path_skip`-style safeguards visible in audit state. Migration posture: `keep` current guard, `mirror` event state, then `wrap` into governance policy. ## 5. Node Inventory for Coordination | Node | Current role | AwoooP target | Posture | Notes | | --- | --- | --- | --- | --- | | `api/v1/webhooks.py` | Alertmanager/SignOz ingress, incident creation, LLM fallback paths | Channel Event ingress + Runtime State | `mirror` | Do not rewrite first; mirror selected events. | | `incident_service.py` | Incident lifecycle and early classification | Runtime State + L3 cognition input | `wrap` | Preserve existing classification behavior until EvidenceSnapshot parity. | | `alert_rule_engine.py` | YAML rules and safety gates | Policy/Routing safety constraints + Playbook policy | `keep` | Rule-first gates must survive migration. | | `pre_decision_investigator.py` | Evidence gathering through MCP tools | MCP Gateway + Runtime evidence snapshot | `wrap` | Good first bridge to gateway-controlled sensing. | | `plugins/mcp/registry.py` | MCP provider registry | Platform provider registry with per-project grants in DB | `wrap` | Registry can stay platform-level; grants move out. | | `plugins/mcp/gateway.py` | MCP gate implementation slice | AwoooP MCP Gateway | `replace` later | Candidate canonical gateway after contract alignment. | | `ai_router.py` | Provider selection and execution | Policy/Routing EffectivePolicy executor | `wrap` | Do not bypass; gradually make EffectivePolicy an input. | | `ollama_failover_manager.py` | Ollama topology/failover | Platform provider health + topology resource | `wrap` | Keep runtime behavior while policy contract learns from it; see `INV-10` for GCP-A/B active-active usage. | | `decision_fusion_adapter.py` | Decision fusion / routing bridge | Agent + Policy/Routing integration | `wrap` | High-risk bridge, no direct rewrite without plan. | | `decision_manager.py` | Core decision and execution orchestration | Agent Runtime + Runtime State | `keep` | Tier-sensitive; last-mile migration only. | | `approval_execution.py` | Approval action execution | Runtime execution + MCP Gateway | `wrap` | Preserve host-vs-k8s guards. | | `auto_repair_service.py` | Repair execution and verification | Runtime execution + verification contract | `wrap` | Must not auto-resolve without verified success. | | `telegram_gateway.py` | Telegram send/edit/delete and cards | Channel adapter | `wrap` | Delivery only, not policy. | | `failover_alerter.py` | Provider/governance alert rendering | Channel adapter + governance event view | `wrap` | Human-readable rendering remains useful. | | `ai_slo_watchdog_job.py` | Platform self-monitoring | Governance policy + platform internal events | `keep` | Current anti-amplification behavior is important. | | `governance_dispatcher.py` | Governance event dispatch | Governance runtime dispatch | `keep` | Very sensitive; do not alter while other session changes platform. | | `learning_service.py` | Learning, playbook generation side effects | Runtime learning + Agent trace | `wrap` | Attach evidence/run identity before provider changes. | | `runbook_generator.py` | Runbook/anti-pattern generation | Knowledge side effect under Policy/Routing | `wrap` | Direct provider usage should eventually route through EffectivePolicy. | ## 6. Recommended Shared Handoff Protocol When either session changes a relevant area, update this document or LOGBOOK with: - `changed_surface`: file/module/contract touched. - `runtime_behavior_changed`: yes/no. - `contract_family`: Project, Agent, MCP Gateway, Policy/Routing, Runtime State, Channel Event, Governance. - `migration_posture`: keep, mirror, wrap, replace, forbid-new. - `rollback_or_disable_flag`: exact flag if applicable. - `affected_event_types`: alertmanager, signoz, watchdog, governance, provider_failover, approval, execution, channel. - `risk`: low/medium/high. - `coordination_note`: what the other session must avoid or consume. ## 7. Safe Integration Order ### Phase A: Shared map and mirror-only state - Keep old runtime behavior. - Mirror selected alert/governance/channel events into AwoooP runtime state. - Add no new provider behavior. - Add no new paid fallback. ### Phase B: Policy/Routing as read-only resolver - Compute EffectivePolicy but do not enforce it on critical decisions yet. - Compare EffectivePolicy result against current AIRouter / failover result. - Store divergence for review. ### Phase C: MCP Gateway wrapper - Wrap read-only tools first. - Enforce project/agent/tool/environment/approval only on low-risk read-only paths. - Keep direct tool access as compatibility fallback but mark `forbid-new`. ### Phase D: Channel Event wrapper - Make Telegram and workflow sends emit Channel Event delivery state. - Keep actual Telegram adapter behavior stable. - Prevent channel layer from gaining policy decisions. ### Phase E: Critical path strangler - Move one low-risk LLM call path to EffectivePolicy enforcement. - Avoid host-resource, auto-repair, and governance-dispatch first. - Expand only after no duplicate incidents, no meta-system amplification, and no provider-route divergence. ## 8. Immediate Recommendations for the AwoooP Session - Do not start with provider switching as the first visible runtime change. - Start by making EffectivePolicy resolvable and auditable without enforcement. - Treat `ai_slo_watchdog_job.py` and `governance_dispatcher.py` as platform-internal event producers, not tenant incidents. - Keep existing `governance_dispatch_path_skip`-style anti-amplification safeguards outside any speculative refactor. - Do not weaken host/resource rule-first bypasses. - Keep provider catalog platform-scoped, but make route permission project-scoped. - Add `project_id`, `trace_id`, and `run_id` to mirrored events before moving execution paths. ## 9. Open Questions - Which AwoooP session branch or commit currently owns contract schema changes? - Has EffectivePolicy already been materialized as code, or is it still only in ADR/docs? - Which event families are already mirrored into AwoooP tables? - Are Channel Hub delivery records authoritative yet, or still shadow/canary? - Which provider calls still bypass `AIRouterExecutor`? - Which MCP calls still bypass `plugins/mcp/gateway.py`? ## 10. Current Coordination Decision This session will not modify runtime while AwoooP is being actively implemented elsewhere. This document is the handoff artifact for aligning AWOOOI monitoring/alerting convergence with AwoooP platform work. Runtime changes should begin only after the implementation session confirms which contract layer is ready to receive mirrored events or policy-resolution inputs. ## 11. Implementation Session Consumption Log ### 2026-05-05 — AwoooP implementation session consumed this map - `changed_surface`: `docs/awooop/MASTER-WORKPLAN.md`, `docs/adr/ADR-106-agent-platform-architecture.md`, MCP Gateway redaction path, Operator Console API response alignment. - `runtime_behavior_changed`: yes, but limited to MCP result redaction before Runtime/LLM context and legacy MCP audit string redaction. - `contract_family`: MCP Gateway, Runtime State, Channel Event, Governance. - `migration_posture`: `mirror` / `wrap`; no provider switching, no Telegram cutover, no governance runtime refactor. - `rollback_or_disable_flag`: revert MCP Gateway redaction patch if unexpected provider consumers require raw output; no production DB migration applied. - `affected_event_types`: MCP tool results, operator approvals/runs console views, future mirrored alert/governance/channel events. - `risk`: low to medium; security-positive redaction path, no paid-provider or destructive-tool enablement. - `coordination_note`: Future AwoooP work must use this map as the safety order: mirror first, read-only EffectivePolicy comparison second, read-only MCP Gateway wrapper third, Channel Event wrapper fourth, low-risk LLM path strangler last.