Files
awoooi/docs/awooop/AWOOOP-MONITORING-ALERTING-CONVERGENCE.md

16 KiB

AwoooP x Monitoring / Alerting Convergence Map

Date: 2026-05-05 Status: Coordination map, no runtime change Scope: AWOOOI monitoring, alerting, incident, AI provider routing, MCP, Telegram, governance loops, and AwoooP contract convergence Superseded by integration baseline for execution waves: docs/awooop/AWOOOI-AWOOOP-AI-AUTONOMOUS-FLYWHEEL-INTEGRATION-PLAN.md

0. Purpose

AwoooP is being implemented in another active session. This document exists to keep the sessions aligned without competing for the same runtime files.

This session owns the full-flow map of the existing AWOOOI monitoring and alerting flywheel. The AwoooP implementation session owns platform schema, contracts, migrations, and runtime/control-plane implementation.

2026-05-06 update: this file remains the monitoring/alerting convergence handoff. Execution planning, 12-agent ownership, P0/P1/P2 risk register, and wave sequencing now live in the integration baseline document linked above.

The integration rule is simple:

  • AwoooP should not become a parallel platform beside the old alerting flywheel.
  • Existing alerting and repair paths should be progressively mirrored, wrapped, and then governed by AwoooP contracts.
  • Provider routing, MCP access, channel delivery, incident state, and governance feedback must converge through contracts instead of staying as scattered direct calls.

1. Session Boundary

This session may do

  • Map all monitoring, alerting, incident, governance, provider, MCP, and channel nodes.
  • Identify overlap with AwoooP contracts.
  • Mark migration posture for each node: keep, mirror, wrap, replace, or forbid-new.
  • Produce conflict warnings for the AwoooP implementation session.
  • Update docs and LOGBOOK with coordination state.

This session must not do yet

  • Change decision_manager.py runtime behavior.
  • Change governance_dispatcher.py runtime behavior.
  • Change ai_router.py / ollama_failover_manager.py routing behavior.
  • Change production K8s provider configuration.
  • Change Telegram delivery behavior.
  • Introduce paid provider behavior or raise provider budgets.

AwoooP implementation session likely owns

  • Contract schemas and validators.
  • AwoooP DB models and migrations.
  • Runtime/control-plane APIs.
  • EffectivePolicy resolution.
  • Contract revision source of truth.
  • MCP Gateway implementation slices.
  • Channel Hub / Channel Event runtime implementation.

2. Current Full-Flow Map

Monitoring / Event Sources
  Prometheus / Alertmanager / SignOz / Sentry / watchdog / governance loops / background jobs
    ↓
Ingress and Normalization
  webhooks.py / AlertAnalyzer / fingerprint / in-flight lock / classify_alert_early
    ↓
Incident and Audit State
  incidents / approval_records / alert_operation_log / timeline_events / governance_remediation_dispatch
    ↓
Evidence and Sensing
  pre_decision_investigator / MCP ProviderRegistry / k8s / prom / signoz / ssh / db / rag
    ↓
Rules and Cognition
  alert_rule_engine / OpenClaw / AIRouter / decision_fusion_adapter / agents
    ↓
Provider Routing
  OLLAMA_URL configmap / ollama_failover_manager / AIRouter / AIProviderRegistry / NVIDIA / Gemini / Claude
    ↓
Decision and Approval
  decision_manager / proposal service / approval service / governance_dispatcher
    ↓
Execution
  approval_execution / auto_repair_service / MCP gateway / SSH provider / K8s provider / emergency escalation
    ↓
Channel Delivery
  telegram_gateway / failover_alerter / workflow notifications / Channel Hub
    ↓
Verification and Learning
  post-execution verification / learning_service / playbook generation / ai_slo_watchdog / governance loop

3. AwoooP Contract Overlay

Existing layer AwoooP contract Convergence need
Alertmanager, SignOz, watchdog, governance loops Communication / Channel Event Normalize all inbound events into one event envelope before business decisions.
Incident, approval, AOL, timeline, governance dispatch Runtime State Add project/run/trace identity and make event state joinable.
OpenClaw, AIRouter, ollama failover, cloud fallback Policy / Routing Resolve provider/model/fallback through EffectivePolicy, not scattered env/direct calls.
Agents, prompts, schemas, role-specific tools Agent Contract Agents become versioned capability modules with explicit I/O and safety ceilings.
K8s, SSH, SignOz, Prometheus, DB, RAG tools MCP Gateway Tool access must satisfy Project AND Agent AND Tool AND Environment AND Approval.
Telegram, workflow notification, failover messages Communication / Channel Event Channel adapters stay thin and must not decide LLM, MCP, approval, or business policy.
ai_slo_watchdog, governance loop, remediation dispatch Governance / Policy Meta-system events must be deduped, scoped, and prevented from self-amplifying.

4. Known Overlap and Conflict Zones

4.1 Provider routing

Current routing is not yet one-control-plane:

  • OLLAMA_URL, OLLAMA_SECONDARY_URL, and OLLAMA_FALLBACK_URL are centralized in K8s config for Ollama topology.
  • Production runtime routes Ollama through the 110 proxy pool: 110:11435 → GCP-A, 110:11436 → GCP-B, 110:11437 → Local .111.
  • ollama_failover_manager.py still owns important runtime failover logic.
  • AIRouter and AIProviderRegistry exist, but not every LLM use path is guaranteed to enter through them.
  • Some knowledge-side services may still call provider-specific clients directly.
  • INV-10 now tracks direct OLLAMA_URL call sites so GCP-B can become an active batch/shadow lane instead of only a passive fallback.

AwoooP action:

  • Treat provider catalog as platform-level resource.
  • Treat per-project route selection as EffectivePolicy.
  • Do not move provider behavior until the old call sites are mapped and wrapped.
  • Preserve GCP-A for interactive low-latency work, use GCP-B for batch/RAG/embedding/shadow/canary, and keep Local .111 for local_required / DR.

Migration posture: wrap first, replace later.

4.2 Monitoring and alerting ingress

Current ingress is multi-path:

  • Alertmanager webhook paths.
  • SignOz webhook path.
  • Sentry or app error paths.
  • AI SLO watchdog.
  • Governance events and remediation dispatch.
  • Background loops that synthesize incidents or meta-alerts.

AwoooP action:

  • Define a Channel Event envelope that can represent all of these without losing raw source payload.
  • Mirror events first; do not force all production ingress through AwoooP until parity and dedupe are proven.

Migration posture: mirror first, wrap later.

4.3 Rule-first versus LLM-first

Historical failures show that host/resource alerts can be polluted by LLM suggestions if rule-first gates are bypassed.

AwoooP action:

  • Preserve rule-first safety gates during migration.
  • EffectivePolicy should be able to declare llm_allowed=false or rule_authoritative=true for specific categories.
  • Channel Event and Policy/Routing contracts must not weaken existing host-vs-k8s domain guards.

Migration posture: keep safety gates, then wrap with policy metadata.

4.4 MCP and direct tools

Current MCP capability exists, but provider registries and tools are not yet entirely tenant/project scoped.

AwoooP action:

  • Keep platform-level provider registry for tool implementations.
  • Move grants, credential refs, environment boundaries, and project permission decisions into MCP Gateway policy.
  • Ensure direct tool calls become forbid-new once gateway coverage exists.

Migration posture: wrap existing tools, then forbid-new direct access.

4.5 Telegram and Channel Hub

Telegram currently carries incident cards, meta alerts, failover alerts, approval messages, workflow notifications, and status updates.

AwoooP action:

  • Channel adapters must remain delivery-only.
  • They may escape, retry, edit, and track provider message IDs.
  • They must not decide provider route, approval policy, incident category, MCP access, or business logic.

Migration posture: mirror outbound delivery state, then wrap through Channel Event.

4.6 Meta-system governance loops

Recent production verification showed the desired health signal: multiple stale-event cycles processed without meta_alert / META_SYSTEM amplification.

AwoooP action:

  • Model governance events as platform-internal runtime events.
  • Require dedupe/cooldown/active-state checks to be part of the governance policy contract.
  • Keep governance_dispatch_path_skip-style safeguards visible in audit state.

Migration posture: keep current guard, mirror event state, then wrap into governance policy.

5. Node Inventory for Coordination

Node Current role AwoooP target Posture Notes
api/v1/webhooks.py Alertmanager/SignOz ingress, incident creation, LLM fallback paths Channel Event ingress + Runtime State mirror Do not rewrite first; mirror selected events.
incident_service.py Incident lifecycle and early classification Runtime State + L3 cognition input wrap Preserve existing classification behavior until EvidenceSnapshot parity.
alert_rule_engine.py YAML rules and safety gates Policy/Routing safety constraints + Playbook policy keep Rule-first gates must survive migration.
pre_decision_investigator.py Evidence gathering through MCP tools MCP Gateway + Runtime evidence snapshot wrap Good first bridge to gateway-controlled sensing.
plugins/mcp/registry.py MCP provider registry Platform provider registry with per-project grants in DB wrap Registry can stay platform-level; grants move out.
plugins/mcp/gateway.py MCP gate implementation slice AwoooP MCP Gateway replace later Candidate canonical gateway after contract alignment.
ai_router.py Provider selection and execution Policy/Routing EffectivePolicy executor wrap Do not bypass; gradually make EffectivePolicy an input.
ollama_failover_manager.py Ollama topology/failover Platform provider health + topology resource wrap Keep runtime behavior while policy contract learns from it; see INV-10 for GCP-A/B active-active usage.
decision_fusion_adapter.py Decision fusion / routing bridge Agent + Policy/Routing integration wrap High-risk bridge, no direct rewrite without plan.
decision_manager.py Core decision and execution orchestration Agent Runtime + Runtime State keep Tier-sensitive; last-mile migration only.
approval_execution.py Approval action execution Runtime execution + MCP Gateway wrap Preserve host-vs-k8s guards.
auto_repair_service.py Repair execution and verification Runtime execution + verification contract wrap Must not auto-resolve without verified success.
telegram_gateway.py Telegram send/edit/delete and cards Channel adapter wrap Delivery only, not policy.
failover_alerter.py Provider/governance alert rendering Channel adapter + governance event view wrap Human-readable rendering remains useful.
ai_slo_watchdog_job.py Platform self-monitoring Governance policy + platform internal events keep Current anti-amplification behavior is important.
governance_dispatcher.py Governance event dispatch Governance runtime dispatch keep Very sensitive; do not alter while other session changes platform.
learning_service.py Learning, playbook generation side effects Runtime learning + Agent trace wrap Attach evidence/run identity before provider changes.
runbook_generator.py Runbook/anti-pattern generation Knowledge side effect under Policy/Routing wrap Direct provider usage should eventually route through EffectivePolicy.

When either session changes a relevant area, update this document or LOGBOOK with:

  • changed_surface: file/module/contract touched.
  • runtime_behavior_changed: yes/no.
  • contract_family: Project, Agent, MCP Gateway, Policy/Routing, Runtime State, Channel Event, Governance.
  • migration_posture: keep, mirror, wrap, replace, forbid-new.
  • rollback_or_disable_flag: exact flag if applicable.
  • affected_event_types: alertmanager, signoz, watchdog, governance, provider_failover, approval, execution, channel.
  • risk: low/medium/high.
  • coordination_note: what the other session must avoid or consume.

7. Safe Integration Order

Phase A: Shared map and mirror-only state

  • Keep old runtime behavior.
  • Mirror selected alert/governance/channel events into AwoooP runtime state.
  • Add no new provider behavior.
  • Add no new paid fallback.

Phase B: Policy/Routing as read-only resolver

  • Compute EffectivePolicy but do not enforce it on critical decisions yet.
  • Compare EffectivePolicy result against current AIRouter / failover result.
  • Store divergence for review.

Phase C: MCP Gateway wrapper

  • Wrap read-only tools first.
  • Enforce project/agent/tool/environment/approval only on low-risk read-only paths.
  • Keep direct tool access as compatibility fallback but mark forbid-new.

Phase D: Channel Event wrapper

  • Make Telegram and workflow sends emit Channel Event delivery state.
  • Keep actual Telegram adapter behavior stable.
  • Prevent channel layer from gaining policy decisions.

Phase E: Critical path strangler

  • Move one low-risk LLM call path to EffectivePolicy enforcement.
  • Avoid host-resource, auto-repair, and governance-dispatch first.
  • Expand only after no duplicate incidents, no meta-system amplification, and no provider-route divergence.

8. Immediate Recommendations for the AwoooP Session

  • Do not start with provider switching as the first visible runtime change.
  • Start by making EffectivePolicy resolvable and auditable without enforcement.
  • Treat ai_slo_watchdog_job.py and governance_dispatcher.py as platform-internal event producers, not tenant incidents.
  • Keep existing governance_dispatch_path_skip-style anti-amplification safeguards outside any speculative refactor.
  • Do not weaken host/resource rule-first bypasses.
  • Keep provider catalog platform-scoped, but make route permission project-scoped.
  • Add project_id, trace_id, and run_id to mirrored events before moving execution paths.

9. Open Questions

  • Which AwoooP session branch or commit currently owns contract schema changes?
  • Has EffectivePolicy already been materialized as code, or is it still only in ADR/docs?
  • Which event families are already mirrored into AwoooP tables?
  • Are Channel Hub delivery records authoritative yet, or still shadow/canary?
  • Which provider calls still bypass AIRouterExecutor?
  • Which MCP calls still bypass plugins/mcp/gateway.py?

10. Current Coordination Decision

This session will not modify runtime while AwoooP is being actively implemented elsewhere.

This document is the handoff artifact for aligning AWOOOI monitoring/alerting convergence with AwoooP platform work. Runtime changes should begin only after the implementation session confirms which contract layer is ready to receive mirrored events or policy-resolution inputs.

11. Implementation Session Consumption Log

2026-05-05 — AwoooP implementation session consumed this map

  • changed_surface: docs/awooop/MASTER-WORKPLAN.md, docs/adr/ADR-106-agent-platform-architecture.md, MCP Gateway redaction path, Operator Console API response alignment.
  • runtime_behavior_changed: yes, but limited to MCP result redaction before Runtime/LLM context and legacy MCP audit string redaction.
  • contract_family: MCP Gateway, Runtime State, Channel Event, Governance.
  • migration_posture: mirror / wrap; no provider switching, no Telegram cutover, no governance runtime refactor.
  • rollback_or_disable_flag: revert MCP Gateway redaction patch if unexpected provider consumers require raw output; no production DB migration applied.
  • affected_event_types: MCP tool results, operator approvals/runs console views, future mirrored alert/governance/channel events.
  • risk: low to medium; security-positive redaction path, no paid-provider or destructive-tool enablement.
  • coordination_note: Future AwoooP work must use this map as the safety order: mirror first, read-only EffectivePolicy comparison second, read-only MCP Gateway wrapper third, Channel Event wrapper fourth, low-risk LLM path strangler last.