Files
awoooi/docs/awooop/AWOOOP-MONITORING-ALERTING-CONVERGENCE.md

291 lines
16 KiB
Markdown

# AwoooP x Monitoring / Alerting Convergence Map
**Date**: 2026-05-05
**Status**: Coordination map, no runtime change
**Scope**: AWOOOI monitoring, alerting, incident, AI provider routing, MCP, Telegram, governance loops, and AwoooP contract convergence
**Superseded by integration baseline for execution waves**: `docs/awooop/AWOOOI-AWOOOP-AI-AUTONOMOUS-FLYWHEEL-INTEGRATION-PLAN.md`
## 0. Purpose
AwoooP is being implemented in another active session. This document exists to keep the sessions aligned without competing for the same runtime files.
This session owns the full-flow map of the existing AWOOOI monitoring and alerting flywheel. The AwoooP implementation session owns platform schema, contracts, migrations, and runtime/control-plane implementation.
2026-05-06 update: this file remains the monitoring/alerting convergence handoff. Execution planning, 12-agent ownership, P0/P1/P2 risk register, and wave sequencing now live in the integration baseline document linked above.
The integration rule is simple:
- AwoooP should not become a parallel platform beside the old alerting flywheel.
- Existing alerting and repair paths should be progressively mirrored, wrapped, and then governed by AwoooP contracts.
- Provider routing, MCP access, channel delivery, incident state, and governance feedback must converge through contracts instead of staying as scattered direct calls.
## 1. Session Boundary
### This session may do
- Map all monitoring, alerting, incident, governance, provider, MCP, and channel nodes.
- Identify overlap with AwoooP contracts.
- Mark migration posture for each node: `keep`, `mirror`, `wrap`, `replace`, or `forbid-new`.
- Produce conflict warnings for the AwoooP implementation session.
- Update docs and LOGBOOK with coordination state.
### This session must not do yet
- Change `decision_manager.py` runtime behavior.
- Change `governance_dispatcher.py` runtime behavior.
- Change `ai_router.py` / `ollama_failover_manager.py` routing behavior.
- Change production K8s provider configuration.
- Change Telegram delivery behavior.
- Introduce paid provider behavior or raise provider budgets.
### AwoooP implementation session likely owns
- Contract schemas and validators.
- AwoooP DB models and migrations.
- Runtime/control-plane APIs.
- EffectivePolicy resolution.
- Contract revision source of truth.
- MCP Gateway implementation slices.
- Channel Hub / Channel Event runtime implementation.
## 2. Current Full-Flow Map
```text
Monitoring / Event Sources
Prometheus / Alertmanager / SignOz / Sentry / watchdog / governance loops / background jobs
Ingress and Normalization
webhooks.py / AlertAnalyzer / fingerprint / in-flight lock / classify_alert_early
Incident and Audit State
incidents / approval_records / alert_operation_log / timeline_events / governance_remediation_dispatch
Evidence and Sensing
pre_decision_investigator / MCP ProviderRegistry / k8s / prom / signoz / ssh / db / rag
Rules and Cognition
alert_rule_engine / OpenClaw / AIRouter / decision_fusion_adapter / agents
Provider Routing
OLLAMA_URL configmap / ollama_failover_manager / AIRouter / AIProviderRegistry / NVIDIA / Gemini / Claude
Decision and Approval
decision_manager / proposal service / approval service / governance_dispatcher
Execution
approval_execution / auto_repair_service / MCP gateway / SSH provider / K8s provider / emergency escalation
Channel Delivery
telegram_gateway / failover_alerter / workflow notifications / Channel Hub
Verification and Learning
post-execution verification / learning_service / playbook generation / ai_slo_watchdog / governance loop
```
## 3. AwoooP Contract Overlay
| Existing layer | AwoooP contract | Convergence need |
| --- | --- | --- |
| Alertmanager, SignOz, watchdog, governance loops | Communication / Channel Event | Normalize all inbound events into one event envelope before business decisions. |
| Incident, approval, AOL, timeline, governance dispatch | Runtime State | Add project/run/trace identity and make event state joinable. |
| OpenClaw, AIRouter, ollama failover, cloud fallback | Policy / Routing | Resolve provider/model/fallback through EffectivePolicy, not scattered env/direct calls. |
| Agents, prompts, schemas, role-specific tools | Agent Contract | Agents become versioned capability modules with explicit I/O and safety ceilings. |
| K8s, SSH, SignOz, Prometheus, DB, RAG tools | MCP Gateway | Tool access must satisfy Project AND Agent AND Tool AND Environment AND Approval. |
| Telegram, workflow notification, failover messages | Communication / Channel Event | Channel adapters stay thin and must not decide LLM, MCP, approval, or business policy. |
| ai_slo_watchdog, governance loop, remediation dispatch | Governance / Policy | Meta-system events must be deduped, scoped, and prevented from self-amplifying. |
## 4. Known Overlap and Conflict Zones
### 4.1 Provider routing
Current routing is not yet one-control-plane:
- `OLLAMA_URL`, `OLLAMA_SECONDARY_URL`, and `OLLAMA_FALLBACK_URL` are centralized in K8s config for Ollama topology.
- Production runtime routes Ollama through the 110 proxy pool: `110:11435` → GCP-A, `110:11436` → GCP-B, `110:11437` → Local `.111`.
- `ollama_failover_manager.py` still owns important runtime failover logic.
- `AIRouter` and `AIProviderRegistry` exist, but not every LLM use path is guaranteed to enter through them.
- Some knowledge-side services may still call provider-specific clients directly.
- `INV-10` now tracks direct `OLLAMA_URL` call sites so GCP-B can become an active batch/shadow lane instead of only a passive fallback.
AwoooP action:
- Treat provider catalog as platform-level resource.
- Treat per-project route selection as EffectivePolicy.
- Do not move provider behavior until the old call sites are mapped and wrapped.
- Preserve GCP-A for interactive low-latency work, use GCP-B for batch/RAG/embedding/shadow/canary, and keep Local `.111` for `local_required` / DR.
Migration posture: `wrap` first, `replace` later.
### 4.2 Monitoring and alerting ingress
Current ingress is multi-path:
- Alertmanager webhook paths.
- SignOz webhook path.
- Sentry or app error paths.
- AI SLO watchdog.
- Governance events and remediation dispatch.
- Background loops that synthesize incidents or meta-alerts.
AwoooP action:
- Define a Channel Event envelope that can represent all of these without losing raw source payload.
- Mirror events first; do not force all production ingress through AwoooP until parity and dedupe are proven.
Migration posture: `mirror` first, `wrap` later.
### 4.3 Rule-first versus LLM-first
Historical failures show that host/resource alerts can be polluted by LLM suggestions if rule-first gates are bypassed.
AwoooP action:
- Preserve rule-first safety gates during migration.
- EffectivePolicy should be able to declare `llm_allowed=false` or `rule_authoritative=true` for specific categories.
- Channel Event and Policy/Routing contracts must not weaken existing host-vs-k8s domain guards.
Migration posture: `keep` safety gates, then `wrap` with policy metadata.
### 4.4 MCP and direct tools
Current MCP capability exists, but provider registries and tools are not yet entirely tenant/project scoped.
AwoooP action:
- Keep platform-level provider registry for tool implementations.
- Move grants, credential refs, environment boundaries, and project permission decisions into MCP Gateway policy.
- Ensure direct tool calls become `forbid-new` once gateway coverage exists.
Migration posture: `wrap` existing tools, then `forbid-new` direct access.
### 4.5 Telegram and Channel Hub
Telegram currently carries incident cards, meta alerts, failover alerts, approval messages, workflow notifications, and status updates.
AwoooP action:
- Channel adapters must remain delivery-only.
- They may escape, retry, edit, and track provider message IDs.
- They must not decide provider route, approval policy, incident category, MCP access, or business logic.
Migration posture: `mirror` outbound delivery state, then `wrap` through Channel Event.
### 4.6 Meta-system governance loops
Recent production verification showed the desired health signal: multiple stale-event cycles processed without `meta_alert` / `META_SYSTEM` amplification.
AwoooP action:
- Model governance events as platform-internal runtime events.
- Require dedupe/cooldown/active-state checks to be part of the governance policy contract.
- Keep `governance_dispatch_path_skip`-style safeguards visible in audit state.
Migration posture: `keep` current guard, `mirror` event state, then `wrap` into governance policy.
## 5. Node Inventory for Coordination
| Node | Current role | AwoooP target | Posture | Notes |
| --- | --- | --- | --- | --- |
| `api/v1/webhooks.py` | Alertmanager/SignOz ingress, incident creation, LLM fallback paths | Channel Event ingress + Runtime State | `mirror` | Do not rewrite first; mirror selected events. |
| `incident_service.py` | Incident lifecycle and early classification | Runtime State + L3 cognition input | `wrap` | Preserve existing classification behavior until EvidenceSnapshot parity. |
| `alert_rule_engine.py` | YAML rules and safety gates | Policy/Routing safety constraints + Playbook policy | `keep` | Rule-first gates must survive migration. |
| `pre_decision_investigator.py` | Evidence gathering through MCP tools | MCP Gateway + Runtime evidence snapshot | `wrap` | Good first bridge to gateway-controlled sensing. |
| `plugins/mcp/registry.py` | MCP provider registry | Platform provider registry with per-project grants in DB | `wrap` | Registry can stay platform-level; grants move out. |
| `plugins/mcp/gateway.py` | MCP gate implementation slice | AwoooP MCP Gateway | `replace` later | Candidate canonical gateway after contract alignment. |
| `ai_router.py` | Provider selection and execution | Policy/Routing EffectivePolicy executor | `wrap` | Do not bypass; gradually make EffectivePolicy an input. |
| `ollama_failover_manager.py` | Ollama topology/failover | Platform provider health + topology resource | `wrap` | Keep runtime behavior while policy contract learns from it; see `INV-10` for GCP-A/B active-active usage. |
| `decision_fusion_adapter.py` | Decision fusion / routing bridge | Agent + Policy/Routing integration | `wrap` | High-risk bridge, no direct rewrite without plan. |
| `decision_manager.py` | Core decision and execution orchestration | Agent Runtime + Runtime State | `keep` | Tier-sensitive; last-mile migration only. |
| `approval_execution.py` | Approval action execution | Runtime execution + MCP Gateway | `wrap` | Preserve host-vs-k8s guards. |
| `auto_repair_service.py` | Repair execution and verification | Runtime execution + verification contract | `wrap` | Must not auto-resolve without verified success. |
| `telegram_gateway.py` | Telegram send/edit/delete and cards | Channel adapter | `wrap` | Delivery only, not policy. |
| `failover_alerter.py` | Provider/governance alert rendering | Channel adapter + governance event view | `wrap` | Human-readable rendering remains useful. |
| `ai_slo_watchdog_job.py` | Platform self-monitoring | Governance policy + platform internal events | `keep` | Current anti-amplification behavior is important. |
| `governance_dispatcher.py` | Governance event dispatch | Governance runtime dispatch | `keep` | Very sensitive; do not alter while other session changes platform. |
| `learning_service.py` | Learning, playbook generation side effects | Runtime learning + Agent trace | `wrap` | Attach evidence/run identity before provider changes. |
| `runbook_generator.py` | Runbook/anti-pattern generation | Knowledge side effect under Policy/Routing | `wrap` | Direct provider usage should eventually route through EffectivePolicy. |
## 6. Recommended Shared Handoff Protocol
When either session changes a relevant area, update this document or LOGBOOK with:
- `changed_surface`: file/module/contract touched.
- `runtime_behavior_changed`: yes/no.
- `contract_family`: Project, Agent, MCP Gateway, Policy/Routing, Runtime State, Channel Event, Governance.
- `migration_posture`: keep, mirror, wrap, replace, forbid-new.
- `rollback_or_disable_flag`: exact flag if applicable.
- `affected_event_types`: alertmanager, signoz, watchdog, governance, provider_failover, approval, execution, channel.
- `risk`: low/medium/high.
- `coordination_note`: what the other session must avoid or consume.
## 7. Safe Integration Order
### Phase A: Shared map and mirror-only state
- Keep old runtime behavior.
- Mirror selected alert/governance/channel events into AwoooP runtime state.
- Add no new provider behavior.
- Add no new paid fallback.
### Phase B: Policy/Routing as read-only resolver
- Compute EffectivePolicy but do not enforce it on critical decisions yet.
- Compare EffectivePolicy result against current AIRouter / failover result.
- Store divergence for review.
### Phase C: MCP Gateway wrapper
- Wrap read-only tools first.
- Enforce project/agent/tool/environment/approval only on low-risk read-only paths.
- Keep direct tool access as compatibility fallback but mark `forbid-new`.
### Phase D: Channel Event wrapper
- Make Telegram and workflow sends emit Channel Event delivery state.
- Keep actual Telegram adapter behavior stable.
- Prevent channel layer from gaining policy decisions.
### Phase E: Critical path strangler
- Move one low-risk LLM call path to EffectivePolicy enforcement.
- Avoid host-resource, auto-repair, and governance-dispatch first.
- Expand only after no duplicate incidents, no meta-system amplification, and no provider-route divergence.
## 8. Immediate Recommendations for the AwoooP Session
- Do not start with provider switching as the first visible runtime change.
- Start by making EffectivePolicy resolvable and auditable without enforcement.
- Treat `ai_slo_watchdog_job.py` and `governance_dispatcher.py` as platform-internal event producers, not tenant incidents.
- Keep existing `governance_dispatch_path_skip`-style anti-amplification safeguards outside any speculative refactor.
- Do not weaken host/resource rule-first bypasses.
- Keep provider catalog platform-scoped, but make route permission project-scoped.
- Add `project_id`, `trace_id`, and `run_id` to mirrored events before moving execution paths.
## 9. Open Questions
- Which AwoooP session branch or commit currently owns contract schema changes?
- Has EffectivePolicy already been materialized as code, or is it still only in ADR/docs?
- Which event families are already mirrored into AwoooP tables?
- Are Channel Hub delivery records authoritative yet, or still shadow/canary?
- Which provider calls still bypass `AIRouterExecutor`?
- Which MCP calls still bypass `plugins/mcp/gateway.py`?
## 10. Current Coordination Decision
This session will not modify runtime while AwoooP is being actively implemented elsewhere.
This document is the handoff artifact for aligning AWOOOI monitoring/alerting convergence with AwoooP platform work. Runtime changes should begin only after the implementation session confirms which contract layer is ready to receive mirrored events or policy-resolution inputs.
## 11. Implementation Session Consumption Log
### 2026-05-05 — AwoooP implementation session consumed this map
- `changed_surface`: `docs/awooop/MASTER-WORKPLAN.md`, `docs/adr/ADR-106-agent-platform-architecture.md`, MCP Gateway redaction path, Operator Console API response alignment.
- `runtime_behavior_changed`: yes, but limited to MCP result redaction before Runtime/LLM context and legacy MCP audit string redaction.
- `contract_family`: MCP Gateway, Runtime State, Channel Event, Governance.
- `migration_posture`: `mirror` / `wrap`; no provider switching, no Telegram cutover, no governance runtime refactor.
- `rollback_or_disable_flag`: revert MCP Gateway redaction patch if unexpected provider consumers require raw output; no production DB migration applied.
- `affected_event_types`: MCP tool results, operator approvals/runs console views, future mirrored alert/governance/channel events.
- `risk`: low to medium; security-positive redaction path, no paid-provider or destructive-tool enablement.
- `coordination_note`: Future AwoooP work must use this map as the safety order: mirror first, read-only EffectivePolicy comparison second, read-only MCP Gateway wrapper third, Channel Event wrapper fourth, low-risk LLM path strangler last.