153 lines
3.9 KiB
Markdown
153 lines
3.9 KiB
Markdown
# AwoooP Inference Gateway Runbook
|
|
|
|
> Runtime design for keeping GCP-A, GCP-B, 111, and paid providers under one
|
|
> controlled inference lane.
|
|
|
|
---
|
|
|
|
## Goal
|
|
|
|
Stop individual services from calling raw model hosts independently.
|
|
|
|
The gateway becomes the single platform path for:
|
|
|
|
- endpoint selection
|
|
- model lane assignment
|
|
- queue and concurrency control
|
|
- fallback
|
|
- cost and token audit
|
|
- trace correlation
|
|
|
|
## Why This Is Needed
|
|
|
|
Direct provider calls caused the 2026-05-05 alert issue:
|
|
|
|
- alert diagnosis wanted a fast response
|
|
- GCP-A/B were loaded with heavyweight models
|
|
- the request timed out through GCP-A and GCP-B
|
|
- Gemini fallback generated cost
|
|
|
|
Private networking alone cannot prevent model eviction or queue contention. The
|
|
gateway must own runtime scheduling.
|
|
|
|
## Required Lanes
|
|
|
|
| Lane | Model | Allowed hosts | Notes |
|
|
|------|-------|---------------|-------|
|
|
| `alert-fast` | `gemma3:4b` | GCP-A, GCP-B, 111 | Synchronous, protected |
|
|
| `code-review` | `qwen2.5-coder:7b` | 111, then GCP-B | Transitional: keep GCP-B clean during alert canary |
|
|
| `embedding` | `bge-m3` | 111, then GCP-B | Transitional: keep GCP-A/B clean during alert canary |
|
|
| `deep-rca` | 14B-class model | 111 or GPU node | Async only |
|
|
| `paid-emergency` | Gemini / Claude | Cloud | Budget-gated emergency fallback |
|
|
|
|
## v0 API
|
|
|
|
The gateway should initially provide an Ollama-compatible API to minimize caller
|
|
changes:
|
|
|
|
```http
|
|
POST /api/generate
|
|
GET /api/tags
|
|
GET /api/ps
|
|
```
|
|
|
|
Required headers for AwoooP-aware calls:
|
|
|
|
```http
|
|
X-AwoooP-Project-ID: awoooi
|
|
X-AwoooP-Trace-ID: <w3c-trace-id>
|
|
X-AwoooP-Lane: alert-fast
|
|
X-AwoooP-Intent: DIAGNOSE
|
|
```
|
|
|
|
Legacy callers may be accepted in shadow mode, but must be assigned
|
|
`project_id=awoooi` by bootstrap rules from ADR-111.
|
|
|
|
## Scheduling Rules
|
|
|
|
- `alert-fast` concurrency is reserved and cannot be borrowed by other lanes.
|
|
- `alert-fast` keeps `gemma3:4b` warm on both GCP-A and GCP-B.
|
|
- 14B/32B models are denied on GCP-A/B unless an operator opens maintenance.
|
|
- Per-host circuit breaker opens after 2 consecutive timeout failures.
|
|
- Paid provider fallback requires:
|
|
- all Ollama endpoints failed or are circuit-open
|
|
- budget hard kill not triggered
|
|
- audit span records fallback reason
|
|
|
|
## Minimal Routing Algorithm
|
|
|
|
```text
|
|
input: lane, model, project_id, trace_id
|
|
|
|
if lane == alert-fast:
|
|
model = gemma3:4b
|
|
try GCP-A with 45s timeout
|
|
try GCP-B with 45s timeout
|
|
try 111 with 60s timeout
|
|
if allowed by budget: try paid emergency fallback
|
|
|
|
if lane == code-review:
|
|
model = qwen2.5-coder:7b
|
|
try 111 with 120s timeout
|
|
try GCP-B with 90s timeout only if 111 is unavailable
|
|
|
|
if lane == deep-rca:
|
|
reject synchronous request
|
|
create async run
|
|
```
|
|
|
|
## Metrics and Logs
|
|
|
|
Every request must emit:
|
|
|
|
- `awooop.project_id`
|
|
- `awooop.lane`
|
|
- `awooop.provider_tier`
|
|
- `awooop.endpoint`
|
|
- `gen_ai.request.model`
|
|
- `gen_ai.usage.input_tokens`
|
|
- `gen_ai.usage.output_tokens`
|
|
- `awooop.fallback_reason`
|
|
- `awooop.cost_usd`
|
|
|
|
## Implementation Stages
|
|
|
|
### Stage 1 - Sidecar health view
|
|
|
|
- Keep existing providers.
|
|
- Add health and residency checks to identify which lane is safe.
|
|
- No traffic proxying yet.
|
|
|
|
### Stage 2 - Gateway in shadow
|
|
|
|
- Mirror inference requests to the gateway.
|
|
- Gateway computes routing decision but does not execute.
|
|
- Compare selected endpoint/model against legacy path.
|
|
|
|
### Stage 3 - Alert lane active
|
|
|
|
- Route only `alert-fast` through the gateway.
|
|
- Keep code review and deep RCA on legacy providers.
|
|
|
|
### Stage 4 - All Ollama traffic active
|
|
|
|
- Move code review, embedding, and deep RCA to the gateway.
|
|
- Enforce lane-based deny rules.
|
|
|
|
### Stage 5 - AwoooP runtime integration
|
|
|
|
- Convert gateway decisions into `run_state` and `step_journal` entries.
|
|
- Use AwoooP budget ledger as source of truth.
|
|
|
|
## Rollback
|
|
|
|
Set provider env back to raw endpoints:
|
|
|
|
```yaml
|
|
OLLAMA_URL: "http://192.168.0.110:11435"
|
|
OLLAMA_SECONDARY_URL: "http://192.168.0.110:11436"
|
|
OLLAMA_FALLBACK_URL: "http://192.168.0.111:11434"
|
|
```
|
|
|
|
Do not disable budget hard kill during rollback.
|