wooo/awoooi

Fork 0

Files

Your Name c4854bb355

CD Pipeline / tests (push) Successful in 54s

Details

Code Review / ai-code-review (push) Successful in 10s

Details

CD Pipeline / build-and-deploy (push) Successful in 3m19s

Details

CD Pipeline / post-deploy-checks (push) Successful in 3m12s

Details

fix(ai): isolate heavy Ollama workloads from GCP alert lane

2026-05-05 23:06:07 +08:00

3.9 KiB

Raw Blame History

AwoooP Inference Gateway Runbook

Runtime design for keeping GCP-A, GCP-B, 111, and paid providers under one controlled inference lane.

Goal

Stop individual services from calling raw model hosts independently.

The gateway becomes the single platform path for:

endpoint selection
model lane assignment
queue and concurrency control
fallback
cost and token audit
trace correlation

Why This Is Needed

Direct provider calls caused the 2026-05-05 alert issue:

alert diagnosis wanted a fast response
GCP-A/B were loaded with heavyweight models
the request timed out through GCP-A and GCP-B
Gemini fallback generated cost

Private networking alone cannot prevent model eviction or queue contention. The gateway must own runtime scheduling.

Required Lanes

Lane	Model	Allowed hosts	Notes
`alert-fast`	`gemma3:4b`	GCP-A, GCP-B, 111	Synchronous, protected
`code-review`	`qwen2.5-coder:7b`	111, then GCP-B	Transitional: keep GCP-B clean during alert canary
`embedding`	`bge-m3`	111, then GCP-B	Transitional: keep GCP-A/B clean during alert canary
`deep-rca`	14B-class model	111 or GPU node	Async only
`paid-emergency`	Gemini / Claude	Cloud	Budget-gated emergency fallback

v0 API

The gateway should initially provide an Ollama-compatible API to minimize caller changes:

POST /api/generate
GET  /api/tags
GET  /api/ps

Required headers for AwoooP-aware calls:

X-AwoooP-Project-ID: awoooi
X-AwoooP-Trace-ID: <w3c-trace-id>
X-AwoooP-Lane: alert-fast
X-AwoooP-Intent: DIAGNOSE

Legacy callers may be accepted in shadow mode, but must be assigned project_id=awoooi by bootstrap rules from ADR-111.

Scheduling Rules

alert-fast concurrency is reserved and cannot be borrowed by other lanes.
alert-fast keeps gemma3:4b warm on both GCP-A and GCP-B.
14B/32B models are denied on GCP-A/B unless an operator opens maintenance.
Per-host circuit breaker opens after 2 consecutive timeout failures.
Paid provider fallback requires:
- all Ollama endpoints failed or are circuit-open
- budget hard kill not triggered
- audit span records fallback reason

Minimal Routing Algorithm

input: lane, model, project_id, trace_id

if lane == alert-fast:
  model = gemma3:4b
  try GCP-A with 45s timeout
  try GCP-B with 45s timeout
  try 111 with 60s timeout
  if allowed by budget: try paid emergency fallback

if lane == code-review:
  model = qwen2.5-coder:7b
  try 111 with 120s timeout
  try GCP-B with 90s timeout only if 111 is unavailable

if lane == deep-rca:
  reject synchronous request
  create async run

Metrics and Logs

Every request must emit:

awooop.project_id
awooop.lane
awooop.provider_tier
awooop.endpoint
gen_ai.request.model
gen_ai.usage.input_tokens
gen_ai.usage.output_tokens
awooop.fallback_reason
awooop.cost_usd

Implementation Stages

Stage 1 - Sidecar health view

Keep existing providers.
Add health and residency checks to identify which lane is safe.
No traffic proxying yet.

Stage 2 - Gateway in shadow

Mirror inference requests to the gateway.
Gateway computes routing decision but does not execute.
Compare selected endpoint/model against legacy path.

Stage 3 - Alert lane active

Route only alert-fast through the gateway.
Keep code review and deep RCA on legacy providers.

Stage 4 - All Ollama traffic active

Move code review, embedding, and deep RCA to the gateway.
Enforce lane-based deny rules.

Stage 5 - AwoooP runtime integration

Convert gateway decisions into run_state and step_journal entries.
Use AwoooP budget ledger as source of truth.

Rollback

Set provider env back to raw endpoints:

OLLAMA_URL: "http://192.168.0.110:11435"
OLLAMA_SECONDARY_URL: "http://192.168.0.110:11436"
OLLAMA_FALLBACK_URL: "http://192.168.0.111:11434"

Do not disable budget hard kill during rollback.

3.9 KiB Raw Blame History

AwoooP Inference Gateway Runbook

Goal

Why This Is Needed

Required Lanes

v0 API

Scheduling Rules

Minimal Routing Algorithm

Metrics and Logs

Implementation Stages

Stage 1 - Sidecar health view

Stage 2 - Gateway in shadow

Stage 3 - Alert lane active

Stage 4 - All Ollama traffic active

Stage 5 - AwoooP runtime integration

Rollback

3.9 KiB

Raw Blame History