291 lines
16 KiB
Markdown
291 lines
16 KiB
Markdown
# AwoooP x Monitoring / Alerting Convergence Map
|
|
|
|
**Date**: 2026-05-05
|
|
**Status**: Coordination map, no runtime change
|
|
**Scope**: AWOOOI monitoring, alerting, incident, AI provider routing, MCP, Telegram, governance loops, and AwoooP contract convergence
|
|
**Superseded by integration baseline for execution waves**: `docs/awooop/AWOOOI-AWOOOP-AI-AUTONOMOUS-FLYWHEEL-INTEGRATION-PLAN.md`
|
|
|
|
## 0. Purpose
|
|
|
|
AwoooP is being implemented in another active session. This document exists to keep the sessions aligned without competing for the same runtime files.
|
|
|
|
This session owns the full-flow map of the existing AWOOOI monitoring and alerting flywheel. The AwoooP implementation session owns platform schema, contracts, migrations, and runtime/control-plane implementation.
|
|
|
|
2026-05-06 update: this file remains the monitoring/alerting convergence handoff. Execution planning, 12-agent ownership, P0/P1/P2 risk register, and wave sequencing now live in the integration baseline document linked above.
|
|
|
|
The integration rule is simple:
|
|
|
|
- AwoooP should not become a parallel platform beside the old alerting flywheel.
|
|
- Existing alerting and repair paths should be progressively mirrored, wrapped, and then governed by AwoooP contracts.
|
|
- Provider routing, MCP access, channel delivery, incident state, and governance feedback must converge through contracts instead of staying as scattered direct calls.
|
|
|
|
## 1. Session Boundary
|
|
|
|
### This session may do
|
|
|
|
- Map all monitoring, alerting, incident, governance, provider, MCP, and channel nodes.
|
|
- Identify overlap with AwoooP contracts.
|
|
- Mark migration posture for each node: `keep`, `mirror`, `wrap`, `replace`, or `forbid-new`.
|
|
- Produce conflict warnings for the AwoooP implementation session.
|
|
- Update docs and LOGBOOK with coordination state.
|
|
|
|
### This session must not do yet
|
|
|
|
- Change `decision_manager.py` runtime behavior.
|
|
- Change `governance_dispatcher.py` runtime behavior.
|
|
- Change `ai_router.py` / `ollama_failover_manager.py` routing behavior.
|
|
- Change production K8s provider configuration.
|
|
- Change Telegram delivery behavior.
|
|
- Introduce paid provider behavior or raise provider budgets.
|
|
|
|
### AwoooP implementation session likely owns
|
|
|
|
- Contract schemas and validators.
|
|
- AwoooP DB models and migrations.
|
|
- Runtime/control-plane APIs.
|
|
- EffectivePolicy resolution.
|
|
- Contract revision source of truth.
|
|
- MCP Gateway implementation slices.
|
|
- Channel Hub / Channel Event runtime implementation.
|
|
|
|
## 2. Current Full-Flow Map
|
|
|
|
```text
|
|
Monitoring / Event Sources
|
|
Prometheus / Alertmanager / SignOz / Sentry / watchdog / governance loops / background jobs
|
|
↓
|
|
Ingress and Normalization
|
|
webhooks.py / AlertAnalyzer / fingerprint / in-flight lock / classify_alert_early
|
|
↓
|
|
Incident and Audit State
|
|
incidents / approval_records / alert_operation_log / timeline_events / governance_remediation_dispatch
|
|
↓
|
|
Evidence and Sensing
|
|
pre_decision_investigator / MCP ProviderRegistry / k8s / prom / signoz / ssh / db / rag
|
|
↓
|
|
Rules and Cognition
|
|
alert_rule_engine / OpenClaw / AIRouter / decision_fusion_adapter / agents
|
|
↓
|
|
Provider Routing
|
|
OLLAMA_URL configmap / ollama_failover_manager / AIRouter / AIProviderRegistry / NVIDIA / Gemini / Claude
|
|
↓
|
|
Decision and Approval
|
|
decision_manager / proposal service / approval service / governance_dispatcher
|
|
↓
|
|
Execution
|
|
approval_execution / auto_repair_service / MCP gateway / SSH provider / K8s provider / emergency escalation
|
|
↓
|
|
Channel Delivery
|
|
telegram_gateway / failover_alerter / workflow notifications / Channel Hub
|
|
↓
|
|
Verification and Learning
|
|
post-execution verification / learning_service / playbook generation / ai_slo_watchdog / governance loop
|
|
```
|
|
|
|
## 3. AwoooP Contract Overlay
|
|
|
|
| Existing layer | AwoooP contract | Convergence need |
|
|
| --- | --- | --- |
|
|
| Alertmanager, SignOz, watchdog, governance loops | Communication / Channel Event | Normalize all inbound events into one event envelope before business decisions. |
|
|
| Incident, approval, AOL, timeline, governance dispatch | Runtime State | Add project/run/trace identity and make event state joinable. |
|
|
| OpenClaw, AIRouter, ollama failover, cloud fallback | Policy / Routing | Resolve provider/model/fallback through EffectivePolicy, not scattered env/direct calls. |
|
|
| Agents, prompts, schemas, role-specific tools | Agent Contract | Agents become versioned capability modules with explicit I/O and safety ceilings. |
|
|
| K8s, SSH, SignOz, Prometheus, DB, RAG tools | MCP Gateway | Tool access must satisfy Project AND Agent AND Tool AND Environment AND Approval. |
|
|
| Telegram, workflow notification, failover messages | Communication / Channel Event | Channel adapters stay thin and must not decide LLM, MCP, approval, or business policy. |
|
|
| ai_slo_watchdog, governance loop, remediation dispatch | Governance / Policy | Meta-system events must be deduped, scoped, and prevented from self-amplifying. |
|
|
|
|
## 4. Known Overlap and Conflict Zones
|
|
|
|
### 4.1 Provider routing
|
|
|
|
Current routing is not yet one-control-plane:
|
|
|
|
- `OLLAMA_URL`, `OLLAMA_SECONDARY_URL`, and `OLLAMA_FALLBACK_URL` are centralized in K8s config for Ollama topology.
|
|
- Production runtime routes Ollama through the 110 proxy pool: `110:11435` → GCP-A, `110:11436` → GCP-B, `110:11437` → Local `.111`.
|
|
- `ollama_failover_manager.py` still owns important runtime failover logic.
|
|
- `AIRouter` and `AIProviderRegistry` exist, but not every LLM use path is guaranteed to enter through them.
|
|
- Some knowledge-side services may still call provider-specific clients directly.
|
|
- `INV-10` now tracks direct `OLLAMA_URL` call sites so GCP-B can become an active batch/shadow lane instead of only a passive fallback.
|
|
|
|
AwoooP action:
|
|
|
|
- Treat provider catalog as platform-level resource.
|
|
- Treat per-project route selection as EffectivePolicy.
|
|
- Do not move provider behavior until the old call sites are mapped and wrapped.
|
|
- Preserve GCP-A for interactive low-latency work, use GCP-B for batch/RAG/embedding/shadow/canary, and keep Local `.111` for `local_required` / DR.
|
|
|
|
Migration posture: `wrap` first, `replace` later.
|
|
|
|
### 4.2 Monitoring and alerting ingress
|
|
|
|
Current ingress is multi-path:
|
|
|
|
- Alertmanager webhook paths.
|
|
- SignOz webhook path.
|
|
- Sentry or app error paths.
|
|
- AI SLO watchdog.
|
|
- Governance events and remediation dispatch.
|
|
- Background loops that synthesize incidents or meta-alerts.
|
|
|
|
AwoooP action:
|
|
|
|
- Define a Channel Event envelope that can represent all of these without losing raw source payload.
|
|
- Mirror events first; do not force all production ingress through AwoooP until parity and dedupe are proven.
|
|
|
|
Migration posture: `mirror` first, `wrap` later.
|
|
|
|
### 4.3 Rule-first versus LLM-first
|
|
|
|
Historical failures show that host/resource alerts can be polluted by LLM suggestions if rule-first gates are bypassed.
|
|
|
|
AwoooP action:
|
|
|
|
- Preserve rule-first safety gates during migration.
|
|
- EffectivePolicy should be able to declare `llm_allowed=false` or `rule_authoritative=true` for specific categories.
|
|
- Channel Event and Policy/Routing contracts must not weaken existing host-vs-k8s domain guards.
|
|
|
|
Migration posture: `keep` safety gates, then `wrap` with policy metadata.
|
|
|
|
### 4.4 MCP and direct tools
|
|
|
|
Current MCP capability exists, but provider registries and tools are not yet entirely tenant/project scoped.
|
|
|
|
AwoooP action:
|
|
|
|
- Keep platform-level provider registry for tool implementations.
|
|
- Move grants, credential refs, environment boundaries, and project permission decisions into MCP Gateway policy.
|
|
- Ensure direct tool calls become `forbid-new` once gateway coverage exists.
|
|
|
|
Migration posture: `wrap` existing tools, then `forbid-new` direct access.
|
|
|
|
### 4.5 Telegram and Channel Hub
|
|
|
|
Telegram currently carries incident cards, meta alerts, failover alerts, approval messages, workflow notifications, and status updates.
|
|
|
|
AwoooP action:
|
|
|
|
- Channel adapters must remain delivery-only.
|
|
- They may escape, retry, edit, and track provider message IDs.
|
|
- They must not decide provider route, approval policy, incident category, MCP access, or business logic.
|
|
|
|
Migration posture: `mirror` outbound delivery state, then `wrap` through Channel Event.
|
|
|
|
### 4.6 Meta-system governance loops
|
|
|
|
Recent production verification showed the desired health signal: multiple stale-event cycles processed without `meta_alert` / `META_SYSTEM` amplification.
|
|
|
|
AwoooP action:
|
|
|
|
- Model governance events as platform-internal runtime events.
|
|
- Require dedupe/cooldown/active-state checks to be part of the governance policy contract.
|
|
- Keep `governance_dispatch_path_skip`-style safeguards visible in audit state.
|
|
|
|
Migration posture: `keep` current guard, `mirror` event state, then `wrap` into governance policy.
|
|
|
|
## 5. Node Inventory for Coordination
|
|
|
|
| Node | Current role | AwoooP target | Posture | Notes |
|
|
| --- | --- | --- | --- | --- |
|
|
| `api/v1/webhooks.py` | Alertmanager/SignOz ingress, incident creation, LLM fallback paths | Channel Event ingress + Runtime State | `mirror` | Do not rewrite first; mirror selected events. |
|
|
| `incident_service.py` | Incident lifecycle and early classification | Runtime State + L3 cognition input | `wrap` | Preserve existing classification behavior until EvidenceSnapshot parity. |
|
|
| `alert_rule_engine.py` | YAML rules and safety gates | Policy/Routing safety constraints + Playbook policy | `keep` | Rule-first gates must survive migration. |
|
|
| `pre_decision_investigator.py` | Evidence gathering through MCP tools | MCP Gateway + Runtime evidence snapshot | `wrap` | Good first bridge to gateway-controlled sensing. |
|
|
| `plugins/mcp/registry.py` | MCP provider registry | Platform provider registry with per-project grants in DB | `wrap` | Registry can stay platform-level; grants move out. |
|
|
| `plugins/mcp/gateway.py` | MCP gate implementation slice | AwoooP MCP Gateway | `replace` later | Candidate canonical gateway after contract alignment. |
|
|
| `ai_router.py` | Provider selection and execution | Policy/Routing EffectivePolicy executor | `wrap` | Do not bypass; gradually make EffectivePolicy an input. |
|
|
| `ollama_failover_manager.py` | Ollama topology/failover | Platform provider health + topology resource | `wrap` | Keep runtime behavior while policy contract learns from it; see `INV-10` for GCP-A/B active-active usage. |
|
|
| `decision_fusion_adapter.py` | Decision fusion / routing bridge | Agent + Policy/Routing integration | `wrap` | High-risk bridge, no direct rewrite without plan. |
|
|
| `decision_manager.py` | Core decision and execution orchestration | Agent Runtime + Runtime State | `keep` | Tier-sensitive; last-mile migration only. |
|
|
| `approval_execution.py` | Approval action execution | Runtime execution + MCP Gateway | `wrap` | Preserve host-vs-k8s guards. |
|
|
| `auto_repair_service.py` | Repair execution and verification | Runtime execution + verification contract | `wrap` | Must not auto-resolve without verified success. |
|
|
| `telegram_gateway.py` | Telegram send/edit/delete and cards | Channel adapter | `wrap` | Delivery only, not policy. |
|
|
| `failover_alerter.py` | Provider/governance alert rendering | Channel adapter + governance event view | `wrap` | Human-readable rendering remains useful. |
|
|
| `ai_slo_watchdog_job.py` | Platform self-monitoring | Governance policy + platform internal events | `keep` | Current anti-amplification behavior is important. |
|
|
| `governance_dispatcher.py` | Governance event dispatch | Governance runtime dispatch | `keep` | Very sensitive; do not alter while other session changes platform. |
|
|
| `learning_service.py` | Learning, playbook generation side effects | Runtime learning + Agent trace | `wrap` | Attach evidence/run identity before provider changes. |
|
|
| `runbook_generator.py` | Runbook/anti-pattern generation | Knowledge side effect under Policy/Routing | `wrap` | Direct provider usage should eventually route through EffectivePolicy. |
|
|
|
|
## 6. Recommended Shared Handoff Protocol
|
|
|
|
When either session changes a relevant area, update this document or LOGBOOK with:
|
|
|
|
- `changed_surface`: file/module/contract touched.
|
|
- `runtime_behavior_changed`: yes/no.
|
|
- `contract_family`: Project, Agent, MCP Gateway, Policy/Routing, Runtime State, Channel Event, Governance.
|
|
- `migration_posture`: keep, mirror, wrap, replace, forbid-new.
|
|
- `rollback_or_disable_flag`: exact flag if applicable.
|
|
- `affected_event_types`: alertmanager, signoz, watchdog, governance, provider_failover, approval, execution, channel.
|
|
- `risk`: low/medium/high.
|
|
- `coordination_note`: what the other session must avoid or consume.
|
|
|
|
## 7. Safe Integration Order
|
|
|
|
### Phase A: Shared map and mirror-only state
|
|
|
|
- Keep old runtime behavior.
|
|
- Mirror selected alert/governance/channel events into AwoooP runtime state.
|
|
- Add no new provider behavior.
|
|
- Add no new paid fallback.
|
|
|
|
### Phase B: Policy/Routing as read-only resolver
|
|
|
|
- Compute EffectivePolicy but do not enforce it on critical decisions yet.
|
|
- Compare EffectivePolicy result against current AIRouter / failover result.
|
|
- Store divergence for review.
|
|
|
|
### Phase C: MCP Gateway wrapper
|
|
|
|
- Wrap read-only tools first.
|
|
- Enforce project/agent/tool/environment/approval only on low-risk read-only paths.
|
|
- Keep direct tool access as compatibility fallback but mark `forbid-new`.
|
|
|
|
### Phase D: Channel Event wrapper
|
|
|
|
- Make Telegram and workflow sends emit Channel Event delivery state.
|
|
- Keep actual Telegram adapter behavior stable.
|
|
- Prevent channel layer from gaining policy decisions.
|
|
|
|
### Phase E: Critical path strangler
|
|
|
|
- Move one low-risk LLM call path to EffectivePolicy enforcement.
|
|
- Avoid host-resource, auto-repair, and governance-dispatch first.
|
|
- Expand only after no duplicate incidents, no meta-system amplification, and no provider-route divergence.
|
|
|
|
## 8. Immediate Recommendations for the AwoooP Session
|
|
|
|
- Do not start with provider switching as the first visible runtime change.
|
|
- Start by making EffectivePolicy resolvable and auditable without enforcement.
|
|
- Treat `ai_slo_watchdog_job.py` and `governance_dispatcher.py` as platform-internal event producers, not tenant incidents.
|
|
- Keep existing `governance_dispatch_path_skip`-style anti-amplification safeguards outside any speculative refactor.
|
|
- Do not weaken host/resource rule-first bypasses.
|
|
- Keep provider catalog platform-scoped, but make route permission project-scoped.
|
|
- Add `project_id`, `trace_id`, and `run_id` to mirrored events before moving execution paths.
|
|
|
|
## 9. Open Questions
|
|
|
|
- Which AwoooP session branch or commit currently owns contract schema changes?
|
|
- Has EffectivePolicy already been materialized as code, or is it still only in ADR/docs?
|
|
- Which event families are already mirrored into AwoooP tables?
|
|
- Are Channel Hub delivery records authoritative yet, or still shadow/canary?
|
|
- Which provider calls still bypass `AIRouterExecutor`?
|
|
- Which MCP calls still bypass `plugins/mcp/gateway.py`?
|
|
|
|
## 10. Current Coordination Decision
|
|
|
|
This session will not modify runtime while AwoooP is being actively implemented elsewhere.
|
|
|
|
This document is the handoff artifact for aligning AWOOOI monitoring/alerting convergence with AwoooP platform work. Runtime changes should begin only after the implementation session confirms which contract layer is ready to receive mirrored events or policy-resolution inputs.
|
|
|
|
## 11. Implementation Session Consumption Log
|
|
|
|
### 2026-05-05 — AwoooP implementation session consumed this map
|
|
|
|
- `changed_surface`: `docs/awooop/MASTER-WORKPLAN.md`, `docs/adr/ADR-106-agent-platform-architecture.md`, MCP Gateway redaction path, Operator Console API response alignment.
|
|
- `runtime_behavior_changed`: yes, but limited to MCP result redaction before Runtime/LLM context and legacy MCP audit string redaction.
|
|
- `contract_family`: MCP Gateway, Runtime State, Channel Event, Governance.
|
|
- `migration_posture`: `mirror` / `wrap`; no provider switching, no Telegram cutover, no governance runtime refactor.
|
|
- `rollback_or_disable_flag`: revert MCP Gateway redaction patch if unexpected provider consumers require raw output; no production DB migration applied.
|
|
- `affected_event_types`: MCP tool results, operator approvals/runs console views, future mirrored alert/governance/channel events.
|
|
- `risk`: low to medium; security-positive redaction path, no paid-provider or destructive-tool enablement.
|
|
- `coordination_note`: Future AwoooP work must use this map as the safety order: mirror first, read-only EffectivePolicy comparison second, read-only MCP Gateway wrapper third, Channel Event wrapper fourth, low-risk LLM path strangler last.
|