3.9 KiB
3.9 KiB
AwoooP Inference Gateway Runbook
Runtime design for keeping GCP-A, GCP-B, 111, and paid providers under one controlled inference lane.
Goal
Stop individual services from calling raw model hosts independently.
The gateway becomes the single platform path for:
- endpoint selection
- model lane assignment
- queue and concurrency control
- fallback
- cost and token audit
- trace correlation
Why This Is Needed
Direct provider calls caused the 2026-05-05 alert issue:
- alert diagnosis wanted a fast response
- GCP-A/B were loaded with heavyweight models
- the request timed out through GCP-A and GCP-B
- Gemini fallback generated cost
Private networking alone cannot prevent model eviction or queue contention. The gateway must own runtime scheduling.
Required Lanes
| Lane | Model | Allowed hosts | Notes |
|---|---|---|---|
alert-fast |
gemma3:4b |
GCP-A, GCP-B, 111 | Synchronous, protected |
code-review |
qwen2.5-coder:7b |
111, then GCP-B | Transitional: keep GCP-B clean during alert canary |
embedding |
bge-m3 |
111, then GCP-B | Transitional: keep GCP-A/B clean during alert canary |
deep-rca |
14B-class model | 111 or GPU node | Async only |
paid-emergency |
Gemini / Claude | Cloud | Budget-gated emergency fallback |
v0 API
The gateway should initially provide an Ollama-compatible API to minimize caller changes:
POST /api/generate
GET /api/tags
GET /api/ps
Required headers for AwoooP-aware calls:
X-AwoooP-Project-ID: awoooi
X-AwoooP-Trace-ID: <w3c-trace-id>
X-AwoooP-Lane: alert-fast
X-AwoooP-Intent: DIAGNOSE
Legacy callers may be accepted in shadow mode, but must be assigned
project_id=awoooi by bootstrap rules from ADR-111.
Scheduling Rules
alert-fastconcurrency is reserved and cannot be borrowed by other lanes.alert-fastkeepsgemma3:4bwarm on both GCP-A and GCP-B.- 14B/32B models are denied on GCP-A/B unless an operator opens maintenance.
- Per-host circuit breaker opens after 2 consecutive timeout failures.
- Paid provider fallback requires:
- all Ollama endpoints failed or are circuit-open
- budget hard kill not triggered
- audit span records fallback reason
Minimal Routing Algorithm
input: lane, model, project_id, trace_id
if lane == alert-fast:
model = gemma3:4b
try GCP-A with 45s timeout
try GCP-B with 45s timeout
try 111 with 60s timeout
if allowed by budget: try paid emergency fallback
if lane == code-review:
model = qwen2.5-coder:7b
try 111 with 120s timeout
try GCP-B with 90s timeout only if 111 is unavailable
if lane == deep-rca:
reject synchronous request
create async run
Metrics and Logs
Every request must emit:
awooop.project_idawooop.laneawooop.provider_tierawooop.endpointgen_ai.request.modelgen_ai.usage.input_tokensgen_ai.usage.output_tokensawooop.fallback_reasonawooop.cost_usd
Implementation Stages
Stage 1 - Sidecar health view
- Keep existing providers.
- Add health and residency checks to identify which lane is safe.
- No traffic proxying yet.
Stage 2 - Gateway in shadow
- Mirror inference requests to the gateway.
- Gateway computes routing decision but does not execute.
- Compare selected endpoint/model against legacy path.
Stage 3 - Alert lane active
- Route only
alert-fastthrough the gateway. - Keep code review and deep RCA on legacy providers.
Stage 4 - All Ollama traffic active
- Move code review, embedding, and deep RCA to the gateway.
- Enforce lane-based deny rules.
Stage 5 - AwoooP runtime integration
- Convert gateway decisions into
run_stateandstep_journalentries. - Use AwoooP budget ledger as source of truth.
Rollback
Set provider env back to raw endpoints:
OLLAMA_URL: "http://192.168.0.110:11435"
OLLAMA_SECONDARY_URL: "http://192.168.0.110:11436"
OLLAMA_FALLBACK_URL: "http://192.168.0.111:11434"
Do not disable budget hard kill during rollback.