16 KiB
AwoooP x Monitoring / Alerting Convergence Map
Date: 2026-05-05
Status: Coordination map, no runtime change
Scope: AWOOOI monitoring, alerting, incident, AI provider routing, MCP, Telegram, governance loops, and AwoooP contract convergence
Superseded by integration baseline for execution waves: docs/awooop/AWOOOI-AWOOOP-AI-AUTONOMOUS-FLYWHEEL-INTEGRATION-PLAN.md
0. Purpose
AwoooP is being implemented in another active session. This document exists to keep the sessions aligned without competing for the same runtime files.
This session owns the full-flow map of the existing AWOOOI monitoring and alerting flywheel. The AwoooP implementation session owns platform schema, contracts, migrations, and runtime/control-plane implementation.
2026-05-06 update: this file remains the monitoring/alerting convergence handoff. Execution planning, 12-agent ownership, P0/P1/P2 risk register, and wave sequencing now live in the integration baseline document linked above.
The integration rule is simple:
- AwoooP should not become a parallel platform beside the old alerting flywheel.
- Existing alerting and repair paths should be progressively mirrored, wrapped, and then governed by AwoooP contracts.
- Provider routing, MCP access, channel delivery, incident state, and governance feedback must converge through contracts instead of staying as scattered direct calls.
1. Session Boundary
This session may do
- Map all monitoring, alerting, incident, governance, provider, MCP, and channel nodes.
- Identify overlap with AwoooP contracts.
- Mark migration posture for each node:
keep,mirror,wrap,replace, orforbid-new. - Produce conflict warnings for the AwoooP implementation session.
- Update docs and LOGBOOK with coordination state.
This session must not do yet
- Change
decision_manager.pyruntime behavior. - Change
governance_dispatcher.pyruntime behavior. - Change
ai_router.py/ollama_failover_manager.pyrouting behavior. - Change production K8s provider configuration.
- Change Telegram delivery behavior.
- Introduce paid provider behavior or raise provider budgets.
AwoooP implementation session likely owns
- Contract schemas and validators.
- AwoooP DB models and migrations.
- Runtime/control-plane APIs.
- EffectivePolicy resolution.
- Contract revision source of truth.
- MCP Gateway implementation slices.
- Channel Hub / Channel Event runtime implementation.
2. Current Full-Flow Map
Monitoring / Event Sources
Prometheus / Alertmanager / SignOz / Sentry / watchdog / governance loops / background jobs
↓
Ingress and Normalization
webhooks.py / AlertAnalyzer / fingerprint / in-flight lock / classify_alert_early
↓
Incident and Audit State
incidents / approval_records / alert_operation_log / timeline_events / governance_remediation_dispatch
↓
Evidence and Sensing
pre_decision_investigator / MCP ProviderRegistry / k8s / prom / signoz / ssh / db / rag
↓
Rules and Cognition
alert_rule_engine / OpenClaw / AIRouter / decision_fusion_adapter / agents
↓
Provider Routing
OLLAMA_URL configmap / ollama_failover_manager / AIRouter / AIProviderRegistry / NVIDIA / Gemini / Claude
↓
Decision and Approval
decision_manager / proposal service / approval service / governance_dispatcher
↓
Execution
approval_execution / auto_repair_service / MCP gateway / SSH provider / K8s provider / emergency escalation
↓
Channel Delivery
telegram_gateway / failover_alerter / workflow notifications / Channel Hub
↓
Verification and Learning
post-execution verification / learning_service / playbook generation / ai_slo_watchdog / governance loop
3. AwoooP Contract Overlay
| Existing layer | AwoooP contract | Convergence need |
|---|---|---|
| Alertmanager, SignOz, watchdog, governance loops | Communication / Channel Event | Normalize all inbound events into one event envelope before business decisions. |
| Incident, approval, AOL, timeline, governance dispatch | Runtime State | Add project/run/trace identity and make event state joinable. |
| OpenClaw, AIRouter, ollama failover, cloud fallback | Policy / Routing | Resolve provider/model/fallback through EffectivePolicy, not scattered env/direct calls. |
| Agents, prompts, schemas, role-specific tools | Agent Contract | Agents become versioned capability modules with explicit I/O and safety ceilings. |
| K8s, SSH, SignOz, Prometheus, DB, RAG tools | MCP Gateway | Tool access must satisfy Project AND Agent AND Tool AND Environment AND Approval. |
| Telegram, workflow notification, failover messages | Communication / Channel Event | Channel adapters stay thin and must not decide LLM, MCP, approval, or business policy. |
| ai_slo_watchdog, governance loop, remediation dispatch | Governance / Policy | Meta-system events must be deduped, scoped, and prevented from self-amplifying. |
4. Known Overlap and Conflict Zones
4.1 Provider routing
Current routing is not yet one-control-plane:
OLLAMA_URL,OLLAMA_SECONDARY_URL, andOLLAMA_FALLBACK_URLare centralized in K8s config for Ollama topology.- Production runtime routes Ollama through the 110 proxy pool:
110:11435→ GCP-A,110:11436→ GCP-B,110:11437→ Local.111. ollama_failover_manager.pystill owns important runtime failover logic.AIRouterandAIProviderRegistryexist, but not every LLM use path is guaranteed to enter through them.- Some knowledge-side services may still call provider-specific clients directly.
INV-10now tracks directOLLAMA_URLcall sites so GCP-B can become an active batch/shadow lane instead of only a passive fallback.
AwoooP action:
- Treat provider catalog as platform-level resource.
- Treat per-project route selection as EffectivePolicy.
- Do not move provider behavior until the old call sites are mapped and wrapped.
- Preserve GCP-A for interactive low-latency work, use GCP-B for batch/RAG/embedding/shadow/canary, and keep Local
.111forlocal_required/ DR.
Migration posture: wrap first, replace later.
4.2 Monitoring and alerting ingress
Current ingress is multi-path:
- Alertmanager webhook paths.
- SignOz webhook path.
- Sentry or app error paths.
- AI SLO watchdog.
- Governance events and remediation dispatch.
- Background loops that synthesize incidents or meta-alerts.
AwoooP action:
- Define a Channel Event envelope that can represent all of these without losing raw source payload.
- Mirror events first; do not force all production ingress through AwoooP until parity and dedupe are proven.
Migration posture: mirror first, wrap later.
4.3 Rule-first versus LLM-first
Historical failures show that host/resource alerts can be polluted by LLM suggestions if rule-first gates are bypassed.
AwoooP action:
- Preserve rule-first safety gates during migration.
- EffectivePolicy should be able to declare
llm_allowed=falseorrule_authoritative=truefor specific categories. - Channel Event and Policy/Routing contracts must not weaken existing host-vs-k8s domain guards.
Migration posture: keep safety gates, then wrap with policy metadata.
4.4 MCP and direct tools
Current MCP capability exists, but provider registries and tools are not yet entirely tenant/project scoped.
AwoooP action:
- Keep platform-level provider registry for tool implementations.
- Move grants, credential refs, environment boundaries, and project permission decisions into MCP Gateway policy.
- Ensure direct tool calls become
forbid-newonce gateway coverage exists.
Migration posture: wrap existing tools, then forbid-new direct access.
4.5 Telegram and Channel Hub
Telegram currently carries incident cards, meta alerts, failover alerts, approval messages, workflow notifications, and status updates.
AwoooP action:
- Channel adapters must remain delivery-only.
- They may escape, retry, edit, and track provider message IDs.
- They must not decide provider route, approval policy, incident category, MCP access, or business logic.
Migration posture: mirror outbound delivery state, then wrap through Channel Event.
4.6 Meta-system governance loops
Recent production verification showed the desired health signal: multiple stale-event cycles processed without meta_alert / META_SYSTEM amplification.
AwoooP action:
- Model governance events as platform-internal runtime events.
- Require dedupe/cooldown/active-state checks to be part of the governance policy contract.
- Keep
governance_dispatch_path_skip-style safeguards visible in audit state.
Migration posture: keep current guard, mirror event state, then wrap into governance policy.
5. Node Inventory for Coordination
| Node | Current role | AwoooP target | Posture | Notes |
|---|---|---|---|---|
api/v1/webhooks.py |
Alertmanager/SignOz ingress, incident creation, LLM fallback paths | Channel Event ingress + Runtime State | mirror |
Do not rewrite first; mirror selected events. |
incident_service.py |
Incident lifecycle and early classification | Runtime State + L3 cognition input | wrap |
Preserve existing classification behavior until EvidenceSnapshot parity. |
alert_rule_engine.py |
YAML rules and safety gates | Policy/Routing safety constraints + Playbook policy | keep |
Rule-first gates must survive migration. |
pre_decision_investigator.py |
Evidence gathering through MCP tools | MCP Gateway + Runtime evidence snapshot | wrap |
Good first bridge to gateway-controlled sensing. |
plugins/mcp/registry.py |
MCP provider registry | Platform provider registry with per-project grants in DB | wrap |
Registry can stay platform-level; grants move out. |
plugins/mcp/gateway.py |
MCP gate implementation slice | AwoooP MCP Gateway | replace later |
Candidate canonical gateway after contract alignment. |
ai_router.py |
Provider selection and execution | Policy/Routing EffectivePolicy executor | wrap |
Do not bypass; gradually make EffectivePolicy an input. |
ollama_failover_manager.py |
Ollama topology/failover | Platform provider health + topology resource | wrap |
Keep runtime behavior while policy contract learns from it; see INV-10 for GCP-A/B active-active usage. |
decision_fusion_adapter.py |
Decision fusion / routing bridge | Agent + Policy/Routing integration | wrap |
High-risk bridge, no direct rewrite without plan. |
decision_manager.py |
Core decision and execution orchestration | Agent Runtime + Runtime State | keep |
Tier-sensitive; last-mile migration only. |
approval_execution.py |
Approval action execution | Runtime execution + MCP Gateway | wrap |
Preserve host-vs-k8s guards. |
auto_repair_service.py |
Repair execution and verification | Runtime execution + verification contract | wrap |
Must not auto-resolve without verified success. |
telegram_gateway.py |
Telegram send/edit/delete and cards | Channel adapter | wrap |
Delivery only, not policy. |
failover_alerter.py |
Provider/governance alert rendering | Channel adapter + governance event view | wrap |
Human-readable rendering remains useful. |
ai_slo_watchdog_job.py |
Platform self-monitoring | Governance policy + platform internal events | keep |
Current anti-amplification behavior is important. |
governance_dispatcher.py |
Governance event dispatch | Governance runtime dispatch | keep |
Very sensitive; do not alter while other session changes platform. |
learning_service.py |
Learning, playbook generation side effects | Runtime learning + Agent trace | wrap |
Attach evidence/run identity before provider changes. |
runbook_generator.py |
Runbook/anti-pattern generation | Knowledge side effect under Policy/Routing | wrap |
Direct provider usage should eventually route through EffectivePolicy. |
6. Recommended Shared Handoff Protocol
When either session changes a relevant area, update this document or LOGBOOK with:
changed_surface: file/module/contract touched.runtime_behavior_changed: yes/no.contract_family: Project, Agent, MCP Gateway, Policy/Routing, Runtime State, Channel Event, Governance.migration_posture: keep, mirror, wrap, replace, forbid-new.rollback_or_disable_flag: exact flag if applicable.affected_event_types: alertmanager, signoz, watchdog, governance, provider_failover, approval, execution, channel.risk: low/medium/high.coordination_note: what the other session must avoid or consume.
7. Safe Integration Order
Phase A: Shared map and mirror-only state
- Keep old runtime behavior.
- Mirror selected alert/governance/channel events into AwoooP runtime state.
- Add no new provider behavior.
- Add no new paid fallback.
Phase B: Policy/Routing as read-only resolver
- Compute EffectivePolicy but do not enforce it on critical decisions yet.
- Compare EffectivePolicy result against current AIRouter / failover result.
- Store divergence for review.
Phase C: MCP Gateway wrapper
- Wrap read-only tools first.
- Enforce project/agent/tool/environment/approval only on low-risk read-only paths.
- Keep direct tool access as compatibility fallback but mark
forbid-new.
Phase D: Channel Event wrapper
- Make Telegram and workflow sends emit Channel Event delivery state.
- Keep actual Telegram adapter behavior stable.
- Prevent channel layer from gaining policy decisions.
Phase E: Critical path strangler
- Move one low-risk LLM call path to EffectivePolicy enforcement.
- Avoid host-resource, auto-repair, and governance-dispatch first.
- Expand only after no duplicate incidents, no meta-system amplification, and no provider-route divergence.
8. Immediate Recommendations for the AwoooP Session
- Do not start with provider switching as the first visible runtime change.
- Start by making EffectivePolicy resolvable and auditable without enforcement.
- Treat
ai_slo_watchdog_job.pyandgovernance_dispatcher.pyas platform-internal event producers, not tenant incidents. - Keep existing
governance_dispatch_path_skip-style anti-amplification safeguards outside any speculative refactor. - Do not weaken host/resource rule-first bypasses.
- Keep provider catalog platform-scoped, but make route permission project-scoped.
- Add
project_id,trace_id, andrun_idto mirrored events before moving execution paths.
9. Open Questions
- Which AwoooP session branch or commit currently owns contract schema changes?
- Has EffectivePolicy already been materialized as code, or is it still only in ADR/docs?
- Which event families are already mirrored into AwoooP tables?
- Are Channel Hub delivery records authoritative yet, or still shadow/canary?
- Which provider calls still bypass
AIRouterExecutor? - Which MCP calls still bypass
plugins/mcp/gateway.py?
10. Current Coordination Decision
This session will not modify runtime while AwoooP is being actively implemented elsewhere.
This document is the handoff artifact for aligning AWOOOI monitoring/alerting convergence with AwoooP platform work. Runtime changes should begin only after the implementation session confirms which contract layer is ready to receive mirrored events or policy-resolution inputs.
11. Implementation Session Consumption Log
2026-05-05 — AwoooP implementation session consumed this map
changed_surface:docs/awooop/MASTER-WORKPLAN.md,docs/adr/ADR-106-agent-platform-architecture.md, MCP Gateway redaction path, Operator Console API response alignment.runtime_behavior_changed: yes, but limited to MCP result redaction before Runtime/LLM context and legacy MCP audit string redaction.contract_family: MCP Gateway, Runtime State, Channel Event, Governance.migration_posture:mirror/wrap; no provider switching, no Telegram cutover, no governance runtime refactor.rollback_or_disable_flag: revert MCP Gateway redaction patch if unexpected provider consumers require raw output; no production DB migration applied.affected_event_types: MCP tool results, operator approvals/runs console views, future mirrored alert/governance/channel events.risk: low to medium; security-positive redaction path, no paid-provider or destructive-tool enablement.coordination_note: Future AwoooP work must use this map as the safety order: mirror first, read-only EffectivePolicy comparison second, read-only MCP Gateway wrapper third, Channel Event wrapper fourth, low-risk LLM path strangler last.