Files
awoooi/docs/runbooks/AWOOOP-INFERENCE-GATEWAY.md
Your Name c4854bb355
All checks were successful
CD Pipeline / tests (push) Successful in 54s
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / build-and-deploy (push) Successful in 3m19s
CD Pipeline / post-deploy-checks (push) Successful in 3m12s
fix(ai): isolate heavy Ollama workloads from GCP alert lane
2026-05-05 23:06:07 +08:00

3.9 KiB

AwoooP Inference Gateway Runbook

Runtime design for keeping GCP-A, GCP-B, 111, and paid providers under one controlled inference lane.


Goal

Stop individual services from calling raw model hosts independently.

The gateway becomes the single platform path for:

  • endpoint selection
  • model lane assignment
  • queue and concurrency control
  • fallback
  • cost and token audit
  • trace correlation

Why This Is Needed

Direct provider calls caused the 2026-05-05 alert issue:

  • alert diagnosis wanted a fast response
  • GCP-A/B were loaded with heavyweight models
  • the request timed out through GCP-A and GCP-B
  • Gemini fallback generated cost

Private networking alone cannot prevent model eviction or queue contention. The gateway must own runtime scheduling.

Required Lanes

Lane Model Allowed hosts Notes
alert-fast gemma3:4b GCP-A, GCP-B, 111 Synchronous, protected
code-review qwen2.5-coder:7b 111, then GCP-B Transitional: keep GCP-B clean during alert canary
embedding bge-m3 111, then GCP-B Transitional: keep GCP-A/B clean during alert canary
deep-rca 14B-class model 111 or GPU node Async only
paid-emergency Gemini / Claude Cloud Budget-gated emergency fallback

v0 API

The gateway should initially provide an Ollama-compatible API to minimize caller changes:

POST /api/generate
GET  /api/tags
GET  /api/ps

Required headers for AwoooP-aware calls:

X-AwoooP-Project-ID: awoooi
X-AwoooP-Trace-ID: <w3c-trace-id>
X-AwoooP-Lane: alert-fast
X-AwoooP-Intent: DIAGNOSE

Legacy callers may be accepted in shadow mode, but must be assigned project_id=awoooi by bootstrap rules from ADR-111.

Scheduling Rules

  • alert-fast concurrency is reserved and cannot be borrowed by other lanes.
  • alert-fast keeps gemma3:4b warm on both GCP-A and GCP-B.
  • 14B/32B models are denied on GCP-A/B unless an operator opens maintenance.
  • Per-host circuit breaker opens after 2 consecutive timeout failures.
  • Paid provider fallback requires:
    • all Ollama endpoints failed or are circuit-open
    • budget hard kill not triggered
    • audit span records fallback reason

Minimal Routing Algorithm

input: lane, model, project_id, trace_id

if lane == alert-fast:
  model = gemma3:4b
  try GCP-A with 45s timeout
  try GCP-B with 45s timeout
  try 111 with 60s timeout
  if allowed by budget: try paid emergency fallback

if lane == code-review:
  model = qwen2.5-coder:7b
  try 111 with 120s timeout
  try GCP-B with 90s timeout only if 111 is unavailable

if lane == deep-rca:
  reject synchronous request
  create async run

Metrics and Logs

Every request must emit:

  • awooop.project_id
  • awooop.lane
  • awooop.provider_tier
  • awooop.endpoint
  • gen_ai.request.model
  • gen_ai.usage.input_tokens
  • gen_ai.usage.output_tokens
  • awooop.fallback_reason
  • awooop.cost_usd

Implementation Stages

Stage 1 - Sidecar health view

  • Keep existing providers.
  • Add health and residency checks to identify which lane is safe.
  • No traffic proxying yet.

Stage 2 - Gateway in shadow

  • Mirror inference requests to the gateway.
  • Gateway computes routing decision but does not execute.
  • Compare selected endpoint/model against legacy path.

Stage 3 - Alert lane active

  • Route only alert-fast through the gateway.
  • Keep code review and deep RCA on legacy providers.

Stage 4 - All Ollama traffic active

  • Move code review, embedding, and deep RCA to the gateway.
  • Enforce lane-based deny rules.

Stage 5 - AwoooP runtime integration

  • Convert gateway decisions into run_state and step_journal entries.
  • Use AwoooP budget ledger as source of truth.

Rollback

Set provider env back to raw endpoints:

OLLAMA_URL: "http://192.168.0.110:11435"
OLLAMA_SECONDARY_URL: "http://192.168.0.110:11436"
OLLAMA_FALLBACK_URL: "http://192.168.0.111:11434"

Do not disable budget hard kill during rollback.