Files
awoooi/docs/runbooks/AWOOOP-INFERENCE-GATEWAY.md
Your Name c4854bb355
All checks were successful
CD Pipeline / tests (push) Successful in 54s
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / build-and-deploy (push) Successful in 3m19s
CD Pipeline / post-deploy-checks (push) Successful in 3m12s
fix(ai): isolate heavy Ollama workloads from GCP alert lane
2026-05-05 23:06:07 +08:00

153 lines
3.9 KiB
Markdown

# AwoooP Inference Gateway Runbook
> Runtime design for keeping GCP-A, GCP-B, 111, and paid providers under one
> controlled inference lane.
---
## Goal
Stop individual services from calling raw model hosts independently.
The gateway becomes the single platform path for:
- endpoint selection
- model lane assignment
- queue and concurrency control
- fallback
- cost and token audit
- trace correlation
## Why This Is Needed
Direct provider calls caused the 2026-05-05 alert issue:
- alert diagnosis wanted a fast response
- GCP-A/B were loaded with heavyweight models
- the request timed out through GCP-A and GCP-B
- Gemini fallback generated cost
Private networking alone cannot prevent model eviction or queue contention. The
gateway must own runtime scheduling.
## Required Lanes
| Lane | Model | Allowed hosts | Notes |
|------|-------|---------------|-------|
| `alert-fast` | `gemma3:4b` | GCP-A, GCP-B, 111 | Synchronous, protected |
| `code-review` | `qwen2.5-coder:7b` | 111, then GCP-B | Transitional: keep GCP-B clean during alert canary |
| `embedding` | `bge-m3` | 111, then GCP-B | Transitional: keep GCP-A/B clean during alert canary |
| `deep-rca` | 14B-class model | 111 or GPU node | Async only |
| `paid-emergency` | Gemini / Claude | Cloud | Budget-gated emergency fallback |
## v0 API
The gateway should initially provide an Ollama-compatible API to minimize caller
changes:
```http
POST /api/generate
GET /api/tags
GET /api/ps
```
Required headers for AwoooP-aware calls:
```http
X-AwoooP-Project-ID: awoooi
X-AwoooP-Trace-ID: <w3c-trace-id>
X-AwoooP-Lane: alert-fast
X-AwoooP-Intent: DIAGNOSE
```
Legacy callers may be accepted in shadow mode, but must be assigned
`project_id=awoooi` by bootstrap rules from ADR-111.
## Scheduling Rules
- `alert-fast` concurrency is reserved and cannot be borrowed by other lanes.
- `alert-fast` keeps `gemma3:4b` warm on both GCP-A and GCP-B.
- 14B/32B models are denied on GCP-A/B unless an operator opens maintenance.
- Per-host circuit breaker opens after 2 consecutive timeout failures.
- Paid provider fallback requires:
- all Ollama endpoints failed or are circuit-open
- budget hard kill not triggered
- audit span records fallback reason
## Minimal Routing Algorithm
```text
input: lane, model, project_id, trace_id
if lane == alert-fast:
model = gemma3:4b
try GCP-A with 45s timeout
try GCP-B with 45s timeout
try 111 with 60s timeout
if allowed by budget: try paid emergency fallback
if lane == code-review:
model = qwen2.5-coder:7b
try 111 with 120s timeout
try GCP-B with 90s timeout only if 111 is unavailable
if lane == deep-rca:
reject synchronous request
create async run
```
## Metrics and Logs
Every request must emit:
- `awooop.project_id`
- `awooop.lane`
- `awooop.provider_tier`
- `awooop.endpoint`
- `gen_ai.request.model`
- `gen_ai.usage.input_tokens`
- `gen_ai.usage.output_tokens`
- `awooop.fallback_reason`
- `awooop.cost_usd`
## Implementation Stages
### Stage 1 - Sidecar health view
- Keep existing providers.
- Add health and residency checks to identify which lane is safe.
- No traffic proxying yet.
### Stage 2 - Gateway in shadow
- Mirror inference requests to the gateway.
- Gateway computes routing decision but does not execute.
- Compare selected endpoint/model against legacy path.
### Stage 3 - Alert lane active
- Route only `alert-fast` through the gateway.
- Keep code review and deep RCA on legacy providers.
### Stage 4 - All Ollama traffic active
- Move code review, embedding, and deep RCA to the gateway.
- Enforce lane-based deny rules.
### Stage 5 - AwoooP runtime integration
- Convert gateway decisions into `run_state` and `step_journal` entries.
- Use AwoooP budget ledger as source of truth.
## Rollback
Set provider env back to raw endpoints:
```yaml
OLLAMA_URL: "http://192.168.0.110:11435"
OLLAMA_SECONDARY_URL: "http://192.168.0.110:11436"
OLLAMA_FALLBACK_URL: "http://192.168.0.111:11434"
```
Do not disable budget hard kill during rollback.