awoooi/docs/runbooks/AWOOOP-INFERENCE-GATEWAY.md

# AwoooP Inference Gateway Runbook

> Runtime design for keeping GCP-A, GCP-B, 111, and paid providers under one
> controlled inference lane.

---

## Goal

Stop individual services from calling raw model hosts independently.

The gateway becomes the single platform path for:

- endpoint selection
- model lane assignment
- queue and concurrency control
- fallback
- cost and token audit
- trace correlation

## Why This Is Needed

Direct provider calls caused the 2026-05-05 alert issue:

- alert diagnosis wanted a fast response
- GCP-A/B were loaded with heavyweight models
- the request timed out through GCP-A and GCP-B
- Gemini fallback generated cost

Private networking alone cannot prevent model eviction or queue contention. The
gateway must own runtime scheduling.

## Required Lanes

| Lane | Model | Allowed hosts | Notes |
|------|-------|---------------|-------|
| `alert-fast` | `gemma3:4b` | GCP-A, GCP-B, 111 | Synchronous, protected |
| `code-review` | `qwen2.5-coder:7b` | 111, then GCP-B | Transitional: keep GCP-B clean during alert canary |
| `embedding` | `bge-m3` | 111, then GCP-B | Transitional: keep GCP-A/B clean during alert canary |
| `deep-rca` | 14B-class model | 111 or GPU node | Async only |
| `paid-emergency` | Gemini / Claude | Cloud | Budget-gated emergency fallback |

## v0 API

The gateway should initially provide an Ollama-compatible API to minimize caller
changes:

```http
POST /api/generate
GET  /api/tags
GET  /api/ps
```

Required headers for AwoooP-aware calls:

```http
X-AwoooP-Project-ID: awoooi
X-AwoooP-Trace-ID: <w3c-trace-id>
X-AwoooP-Lane: alert-fast
X-AwoooP-Intent: DIAGNOSE
```

Legacy callers may be accepted in shadow mode, but must be assigned
`project_id=awoooi` by bootstrap rules from ADR-111.

## Scheduling Rules

- `alert-fast` concurrency is reserved and cannot be borrowed by other lanes.
- `alert-fast` keeps `gemma3:4b` warm on both GCP-A and GCP-B.
- 14B/32B models are denied on GCP-A/B unless an operator opens maintenance.
- Per-host circuit breaker opens after 2 consecutive timeout failures.
- Paid provider fallback requires:
  - all Ollama endpoints failed or are circuit-open
  - budget hard kill not triggered
  - audit span records fallback reason

## Minimal Routing Algorithm

```text
input: lane, model, project_id, trace_id

if lane == alert-fast:
  model = gemma3:4b
  try GCP-A with 45s timeout
  try GCP-B with 45s timeout
  try 111 with 60s timeout
  if allowed by budget: try paid emergency fallback

if lane == code-review:
  model = qwen2.5-coder:7b
  try 111 with 120s timeout
  try GCP-B with 90s timeout only if 111 is unavailable

if lane == deep-rca:
  reject synchronous request
  create async run
```

## Metrics and Logs

Every request must emit:

- `awooop.project_id`
- `awooop.lane`
- `awooop.provider_tier`
- `awooop.endpoint`
- `gen_ai.request.model`
- `gen_ai.usage.input_tokens`
- `gen_ai.usage.output_tokens`
- `awooop.fallback_reason`
- `awooop.cost_usd`

## Implementation Stages

### Stage 1 - Sidecar health view

- Keep existing providers.
- Add health and residency checks to identify which lane is safe.
- No traffic proxying yet.

### Stage 2 - Gateway in shadow

- Mirror inference requests to the gateway.
- Gateway computes routing decision but does not execute.
- Compare selected endpoint/model against legacy path.

### Stage 3 - Alert lane active

- Route only `alert-fast` through the gateway.
- Keep code review and deep RCA on legacy providers.

### Stage 4 - All Ollama traffic active

- Move code review, embedding, and deep RCA to the gateway.
- Enforce lane-based deny rules.

### Stage 5 - AwoooP runtime integration

- Convert gateway decisions into `run_state` and `step_journal` entries.
- Use AwoooP budget ledger as source of truth.

## Rollback

Set provider env back to raw endpoints:

```yaml
OLLAMA_URL: "http://192.168.0.110:11435"
OLLAMA_SECONDARY_URL: "http://192.168.0.110:11436"
OLLAMA_FALLBACK_URL: "http://192.168.0.111:11434"
```

Do not disable budget hard kill during rollback.