docs(awooop): define private Ollama mesh gateway

2026-05-05 22:56:22 +08:00
parent 7baa316224
commit ed7c6946cb
9 changed files with 786 additions and 9 deletions
--- a/docs/LOGBOOK.md
+++ b/docs/LOGBOOK.md
@@ -3175,3 +3175,21 @@ kubectl -n awoooi-prod exec deploy/awoooi-api -- printenv | grep -E 'ALERT_OLLAM
 - 目前 `192.168.0.110:11435/11436` 是經由 110 nginx 轉發到 GCP 公網 IP，屬於過渡方案，不應作為長期 primary Ollama lane。
 - 建議建立 WireGuard site-to-site private mesh，讓 K3s / 110 / 111 / GCP-A / GCP-B 以私網 IP 互連，Ollama 僅綁定 mesh interface，並由 AwoooP Inference Gateway 統一路由、熔斷、佇列與模型保溫。
 - 注意：目前 GCP-A / GCP-B `/api/ps` 顯示 `size_vram: 0`，內網化可解決連線與安全問題，但無法讓 CPU-only GCP 等同 111 的 VRAM/GPU 效能；大模型應留在 111 或改用 GPU 型 GCP 節點。
+
+### 後續文件化
+
+- 新增 `docs/adr/ADR-125-gcp-ollama-private-mesh-inference-gateway.md`
+- 新增 `docs/runbooks/GCP-OLLAMA-WIREGUARD-MESH.md`
+- 新增 `docs/runbooks/AWOOOP-INFERENCE-GATEWAY.md`
+- 新增 `scripts/ops/ollama-topology-check.sh` 作為現場三層 Ollama 健康 / residency / latency 檢查工具
+
+### `ollama-topology-check` 實測
+
+```bash
+bash scripts/ops/ollama-topology-check.sh
+# primary   GCP-A via 110 proxy: gemma3:4b generate OK, ~2s, size_vram=0
+# secondary GCP-B via 110 proxy: gemma3:4b generate OK, ~8.5s, size_vram=0
+# fallback  111 direct:          gemma3:4b generate OK, ~4.9s, size_vram=8210446336
+```
+
+結論：GCP-A/B 可作 `alert-fast` lane，但目前不應承擔 14B/32B 同步告警推理；重模型必須由 AwoooP Inference Gateway 隔離到 async / 111 / GPU 節點。
--- a/docs/adr/ADR-110-gcp-ollama-topology.md
+++ b/docs/adr/ADR-110-gcp-ollama-topology.md
@@ -5,6 +5,10 @@
 **決策者**: 統帥
 **關聯**: 取代 ADR-105（Revert A2 Ollama Primary）

+> 2026-05-05 修正：本 ADR 的「GCP-A → GCP-B → 111 → paid provider」邏輯仍有效，
+> 但公網 GCP IP / 110 nginx proxy 僅為過渡傳輸。正式傳輸與 runtime
+> 管理由 ADR-125（GCP Ollama Private Mesh and AwoooP Inference Gateway）取代。
+
 ---

 ## 背景
@@ -62,3 +66,15 @@ K8s NetworkPolicy egress 已新增 GCP-A/GCP-B 的 /32 出口規則（port 11434
 - Ollama 主要流量走 GCP SSD，效能提升
 - Local 111 保留為最後防線，不棄用
 - Gemini/Nemotron/Claude fallback 鏈不變
+
+## 2026-05-05 現場校正
+
+冷啟動救援期間的實測顯示：
+
+- GCP-A / GCP-B 透過 110 nginx proxy 可連線，但長 prompt 曾出現 504。
+- `/api/ps` 顯示 GCP-A / GCP-B `size_vram: 0`，因此不可假設它們等同 111 GPU/VRAM 推理能力。
+- 告警同步路徑必須使用 `gemma3:4b` 這類 fast lane 模型；14B/32B 模型需移到 async 或 111/GPU 節點。
+- 公網 `34.143.170.20:11434` / `34.21.145.224:11434` 不再視為最終安全架構。
+
+後續以 ADR-125 為準：WireGuard private mesh 是正式網路層，AwoooP
+Inference Gateway 是正式 runtime 層。
--- a/docs/adr/ADR-125-gcp-ollama-private-mesh-inference-gateway.md
+++ b/docs/adr/ADR-125-gcp-ollama-private-mesh-inference-gateway.md
@@ -0,0 +1,187 @@
+# ADR-125: GCP Ollama Private Mesh and AwoooP Inference Gateway
+
+**Status**: Accepted  
+**Date**: 2026-05-05 (Asia/Taipei)  
+**Decision Maker**: ogt / Codex  
+**Related**: ADR-110, ADR-111, ADR-113, ADR-121, ADR-124
+
+---
+
+## Context
+
+ADR-110 moved Ollama priority from local-only 111 to a three-layer topology:
+
+1. GCP-A
+2. GCP-B
+3. Local 111
+4. Paid cloud fallback only after all Ollama lanes fail
+
+The 2026-05-05 dirty-reboot recovery and alert incident exposed two gaps in the
+initial ADR-110 implementation:
+
+- The live transport is `K8s Pod -> 192.168.0.110 nginx -> GCP public IP`, not a
+  true private network path.
+- GCP-A and GCP-B reported `size_vram: 0` in `/api/ps`, so they are CPU-only from
+  Ollama's perspective. Private networking improves reachability and security,
+  but does not make these nodes equivalent to local 111 GPU/VRAM behavior.
+
+The public nginx proxy is useful as a bootstrap bridge, but it must not become
+the long-term primary transport for platform inference.
+
+## Decision
+
+Adopt a two-layer target architecture:
+
+### D1 - WireGuard private mesh is the target transport
+
+AwoooP uses a WireGuard site-to-site mesh for GCP Ollama access.
+
+Planned mesh CIDR:
+
+| Host | Role | WireGuard IP |
+|------|------|--------------|
+| 110 | DevOps / transitional proxy / optional mesh router | `10.77.114.10` |
+| 120 | K3s control-plane node | `10.77.114.120` |
+| 121 | K3s control-plane node | `10.77.114.121` |
+| 111 | Local Ollama fallback | `10.77.114.111` |
+| GCP-A | Ollama primary | `10.77.114.21` |
+| GCP-B | Ollama secondary | `10.77.114.22` |
+
+Ollama endpoints after cutover:
+
+| Tier | Endpoint |
+|------|----------|
+| Primary | `http://10.77.114.21:11434` |
+| Secondary | `http://10.77.114.22:11434` |
+| Fallback | `http://10.77.114.111:11434` |
+
+The current `192.168.0.110:11435/11436` nginx proxy remains an emergency bridge
+only until the mesh cutover passes shadow and canary gates.
+
+### D2 - Public Ollama exposure is forbidden after cutover
+
+After mesh cutover:
+
+- GCP firewall must deny public `0.0.0.0/0 -> 11434`.
+- Ollama should bind to the mesh interface or host firewall should allow
+  `11434/tcp` only from `10.77.114.0/24`.
+- K8s NetworkPolicy should allow egress only to the mesh IPs for Ollama.
+
+### D3 - AwoooP Inference Gateway owns runtime routing
+
+Provider clients should stop selecting raw Ollama hosts directly. They should
+call an AwoooP Inference Gateway that owns:
+
+- endpoint health and circuit breakers
+- per-lane concurrency limits
+- model residency and keep-alive policy
+- request timeouts by intent
+- token/cost audit spans
+- fallback order: GCP-A -> GCP-B -> 111 -> paid provider
+
+The gateway may initially expose an Ollama-compatible surface:
+
+| Endpoint | Purpose |
+|----------|---------|
+| `/api/tags` | health/model inventory |
+| `/api/ps` | residency inventory |
+| `/api/generate` | Ollama-compatible generation |
+| `/v1/awooop/inference/runs` | future async AwoooP run API |
+
+Gateway requests must carry `project_id`, `trace_id`, and an intent/lane label
+when called from AwoooP-aware code.
+
+### D4 - Alert lane is protected
+
+Alert diagnosis must not share an unconstrained queue with heavy code-review or
+deep-RCA jobs.
+
+Initial lanes:
+
+| Lane | Model | Primary use | Default timeout |
+|------|-------|-------------|-----------------|
+| `alert-fast` | `gemma3:4b` | Telegram incident cards and low-risk RCA | 45s |
+| `code-review` | `qwen2.5-coder:7b` | Gitea review | 90s |
+| `embedding` | `bge-m3` | RAG embeddings | 30s |
+| `deep-rca` | 111-hosted 14B-class model | slow human-reviewed diagnosis | async only |
+
+No 14B/32B model may evict `alert-fast` residency on GCP-A/GCP-B unless the
+gateway explicitly opens a maintenance window.
+
+## Migration Plan
+
+### Phase 0 - Current bridge
+
+- Keep `192.168.0.110:11435` and `192.168.0.110:11436` active.
+- Alert path uses `ALERT_OLLAMA_MODEL=gemma3:4b`.
+- Gemini remains paid emergency fallback only.
+
+### Phase 1 - Mesh build in parallel
+
+- Install WireGuard on 110, 120, 121, 111, GCP-A, and GCP-B.
+- Assign mesh IPs from `10.77.114.0/24`.
+- Keep public proxy and old env values unchanged.
+- Verify `/api/tags`, `/api/ps`, and `gemma3:4b` generation over mesh.
+
+### Phase 2 - Shadow mesh
+
+- Add shadow health checks from the API pod to mesh endpoints.
+- Emit OTel spans with both `active_endpoint` and `shadow_endpoint`.
+- Do not send production inference traffic to mesh yet.
+
+Promotion gate:
+
+- 24h continuous mesh health
+- p95 `alert-fast` latency <= current proxy p95 + 10%
+- zero public-path-only success events
+
+### Phase 3 - Switch active endpoints
+
+Set production env:
+
+```yaml
+OLLAMA_URL: "http://10.77.114.21:11434"
+OLLAMA_SECONDARY_URL: "http://10.77.114.22:11434"
+OLLAMA_FALLBACK_URL: "http://10.77.114.111:11434"
+```
+
+Promotion gate:
+
+- 7 days canary
+- Gemini usage for alert lane is zero except documented all-Ollama outage
+- no alert-card timeout regression
+
+### Phase 4 - Close public exposure
+
+- Remove or firewall public GCP `11434/tcp`.
+- Keep nginx bridge config but disable listener or restrict to operator-only
+  rollback.
+
+## Rollback
+
+Rollback is env-only while the bridge remains available:
+
+```yaml
+OLLAMA_URL: "http://192.168.0.110:11435"
+OLLAMA_SECONDARY_URL: "http://192.168.0.110:11436"
+OLLAMA_FALLBACK_URL: "http://192.168.0.111:11434"
+```
+
+If GCP-A/B are unstable, force 111-first temporarily:
+
+```yaml
+OLLAMA_URL: "http://192.168.0.111:11434"
+OLLAMA_SECONDARY_URL: "http://192.168.0.110:11435"
+OLLAMA_FALLBACK_URL: "http://192.168.0.110:11436"
+```
+
+Paid provider fallback must remain budget-gated.
+
+## Consequences
+
+- GCP Ollama becomes private-by-default instead of public-IP dependent.
+- K8s NetworkPolicy can move from public `/32` rules to stable mesh `/32` rules.
+- AwoooP can manage Ollama as a platform resource shared by all tenants.
+- CPU-only GCP performance remains a capacity constraint; routing must keep
+  heavy jobs off the alert lane or use GPU-capable GCP nodes.
+
--- a/docs/awooop/DETAILED-IMPLEMENTATION-PLAN.md
+++ b/docs/awooop/DETAILED-IMPLEMENTATION-PLAN.md
@@ -1120,22 +1120,34 @@ AwoooP 解法：全部 LLM call 必須 emit 以上 attribute，進 SignOz（188:

 ## 14. GCP Ollama 拓撲對 AwoooP 的影響（ADR-110 整合）

-### 14.1 新拓撲（ADR-110，2026-05-03 生效）
+### 14.1 新拓撲（ADR-110 + ADR-125，2026-05-05 修正）

 ```
-Primary  : GCP-A  http://34.143.170.20:11434   （SSD，9x 載速）
-Secondary: GCP-B  http://34.21.145.224:11434    （SSD，備援）
-Fallback : Local  http://192.168.0.111:11434    （HDD，最後防線）
-Emergency: Gemini → Nemotron → Claude           （全 Ollama 掛時）
+Phase 0 bridge:
+Primary  : GCP-A  http://192.168.0.110:11435   （110 nginx → GCP public IP）
+Secondary: GCP-B  http://192.168.0.110:11436
+Fallback : Local  http://192.168.0.111:11434
+Emergency: Gemini → Nemotron → Claude           （全 Ollama 掛時，budget gated）
+
+Target private mesh:
+Primary  : GCP-A  http://10.77.114.21:11434
+Secondary: GCP-B  http://10.77.114.22:11434
+Fallback : Local  http://10.77.114.111:11434
 ```

+ADR-125 修正 ADR-110 的傳輸層：公網 GCP IP / 110 nginx proxy 僅保留為
+過渡與 rollback bridge。正式路徑是 WireGuard private mesh；runtime 路由由
+AwoooP Inference Gateway 管理。
+
 ### 14.2 AwoooP 必須處理的影響項目

 | 影響項 | 位置 | 處理方式 | Phase |
 |--------|------|---------|-------|
 | `ollama:current_primary` Redis key 雙寫（只支援 1 個 URL，新需要 3 層）| INV-1 | 改為 `platform:ollama:topology`（JSON：primary/secondary/fallback）| Phase 2 |
 | `ollama_auto_recovery.py:230` 第二定義（P0-11）| ollama_auto_recovery.py | 移除，統一從 config 讀 | Phase 2 PR-03 |
-| GCP IP 進 INV-4（34.143.170.20, 34.21.145.224）| INV-4 | 加入 allowed IP 清單，確認 K8s NetworkPolicy egress 已設定 | Phase 0 INV-4 |
+| GCP public IP 進 INV-4（34.143.170.20, 34.21.145.224）| INV-4 | 標為 transitional only；正式改用 `10.77.114.21/22` mesh IP | Phase 0 INV-4 |
+| WireGuard mesh | ADR-125 / runbook | 建立 `10.77.114.0/24` private transport；關閉 public 11434 | Phase 2 前置 |
+| AwoooP Inference Gateway | ADR-125 / runbook | alert-fast / code-review / embedding / deep-rca lane 隔離，避免重模型搶告警 lane | Phase 4 |
 | EwoooC Provider Proxy 走 GCP Ollama 路由 | Phase 6 | EwoooC 共用 platform Ollama topology（platform_resource）| Phase 6 |
 | `telemetry.py:71` IP assert（P0-08）| telemetry.py:71 | 移除後，GCP IP 不再觸發 assert；改為 config-driven | Phase 2 PR-01 |
 | budget_ledger 記錄 Ollama usage（免費 GCP 仍需 token 計數）| Phase 4 | Ollama call 也必須記錄 token 消耗（budget_ledger）| Phase 4 |
@@ -1143,11 +1155,24 @@ Emergency: Gemini → Nemotron → Claude           （全 Ollama 掛時）

 ### 14.3 Ollama GCP 為 platform_resource（ADR-111）

-GCP Ollama（34.143.170.20, 34.21.145.224）與 Local Ollama（192.168.0.111）一律聲明為 `platform_resource`：
+GCP Ollama（bridge: 34.143.170.20 / 34.21.145.224；target mesh:
+10.77.114.21 / 10.77.114.22）與 Local Ollama（192.168.0.111 / target
+10.77.114.111）一律聲明為 `platform_resource`：
 - 不屬於任何 tenant
 - 所有 tenant（AWOOOI / EwoooC / Tsenyang / Bitan）共用，但 audit 記錄各自 project_id
 - `platform:ollama:topology` Redis key 前綴為 `platform:`（非 `{project_id}:`）

+### 14.4 實測限制（2026-05-05）
+
+`scripts/ops/ollama-topology-check.sh` 實測：
+
+- GCP-A `gemma3:4b` 約 2s，但 `size_vram=0`
+- GCP-B `gemma3:4b` 約 8.5s，但 `size_vram=0`
+- 111 fallback `gemma3:4b` 約 4.9s，`size_vram=8210446336`
+
+結論：GCP-A/B 可以作為同步 `alert-fast` lane，但不可承擔 14B/32B 同步告警診斷。
+重模型需由 Inference Gateway 分流到 async / 111 / GPU 節點。
+
 ---

 ## 15. 工作排序總表（含並行群組 + Critical Path）
--- a/docs/awooop/MASTER-WORKPLAN.md
+++ b/docs/awooop/MASTER-WORKPLAN.md
@@ -135,7 +135,7 @@ ADR-106 也需要補一節：**Strangler Fig Quantified Gates**，把 shadow →
 3. **Redis working memory project 邊界**（#15）：
   - `incident_service.py:603` 的 `SCAN incident:*` → `SCAN {project_id}:incident:*`
   - 所有 `SCAN`/`KEYS` 必須帶 prefix
-4. **`platform_resource` 例外名單**：Ollama failover state、global rate limit、leader election lock 等明確標記
+4. **`platform_resource` 例外名單**：Ollama failover state、global rate limit、leader election lock 等明確標記；GCP Ollama 正式路徑依 ADR-125 改為 WireGuard mesh + AwoooP Inference Gateway，110 nginx proxy 僅保留為過渡 / rollback bridge
 5. **回歸測試**：cross-project read/write 必拒絕；platform_resource 必允許但寫 audit
 6. **AWOOOI Bootstrap Paradox 修補**（依 ADR-111、INV-3）：
   - 標記為 `platform_internal` 的 entrypoint 帶 `project_id=__platform__`，hard reject 例外但寫 audit
--- a/docs/runbooks/AWOOOP-INFERENCE-GATEWAY.md
+++ b/docs/runbooks/AWOOOP-INFERENCE-GATEWAY.md
@@ -0,0 +1,153 @@
+# AwoooP Inference Gateway Runbook
+
+> Runtime design for keeping GCP-A, GCP-B, 111, and paid providers under one
+> controlled inference lane.
+
+---
+
+## Goal
+
+Stop individual services from calling raw model hosts independently.
+
+The gateway becomes the single platform path for:
+
+- endpoint selection
+- model lane assignment
+- queue and concurrency control
+- fallback
+- cost and token audit
+- trace correlation
+
+## Why This Is Needed
+
+Direct provider calls caused the 2026-05-05 alert issue:
+
+- alert diagnosis wanted a fast response
+- GCP-A/B were loaded with heavyweight models
+- the request timed out through GCP-A and GCP-B
+- Gemini fallback generated cost
+
+Private networking alone cannot prevent model eviction or queue contention. The
+gateway must own runtime scheduling.
+
+## Required Lanes
+
+| Lane | Model | Allowed hosts | Notes |
+|------|-------|---------------|-------|
+| `alert-fast` | `gemma3:4b` | GCP-A, GCP-B, 111 | Synchronous, protected |
+| `code-review` | `qwen2.5-coder:7b` | GCP-B, 111 | Never 32B on GCP during alert canary |
+| `embedding` | `bge-m3` | GCP-A, GCP-B, 111 | Short timeout |
+| `deep-rca` | 14B-class model | 111 or GPU node | Async only |
+| `paid-emergency` | Gemini / Claude | Cloud | Budget-gated emergency fallback |
+
+## v0 API
+
+The gateway should initially provide an Ollama-compatible API to minimize caller
+changes:
+
+```http
+POST /api/generate
+GET  /api/tags
+GET  /api/ps
+```
+
+Required headers for AwoooP-aware calls:
+
+```http
+X-AwoooP-Project-ID: awoooi
+X-AwoooP-Trace-ID: <w3c-trace-id>
+X-AwoooP-Lane: alert-fast
+X-AwoooP-Intent: DIAGNOSE
+```
+
+Legacy callers may be accepted in shadow mode, but must be assigned
+`project_id=awoooi` by bootstrap rules from ADR-111.
+
+## Scheduling Rules
+
+- `alert-fast` concurrency is reserved and cannot be borrowed by other lanes.
+- `alert-fast` keeps `gemma3:4b` warm on both GCP-A and GCP-B.
+- 14B/32B models are denied on GCP-A/B unless an operator opens maintenance.
+- Per-host circuit breaker opens after 2 consecutive timeout failures.
+- Paid provider fallback requires:
+  - all Ollama endpoints failed or are circuit-open
+  - budget hard kill not triggered
+  - audit span records fallback reason
+
+## Minimal Routing Algorithm
+
+```text
+input: lane, model, project_id, trace_id
+
+if lane == alert-fast:
+  model = gemma3:4b
+  try GCP-A with 45s timeout
+  try GCP-B with 45s timeout
+  try 111 with 60s timeout
+  if allowed by budget: try paid emergency fallback
+
+if lane == code-review:
+  model = qwen2.5-coder:7b
+  try GCP-B with 90s timeout
+  try 111 with 120s timeout
+
+if lane == deep-rca:
+  reject synchronous request
+  create async run
+```
+
+## Metrics and Logs
+
+Every request must emit:
+
+- `awooop.project_id`
+- `awooop.lane`
+- `awooop.provider_tier`
+- `awooop.endpoint`
+- `gen_ai.request.model`
+- `gen_ai.usage.input_tokens`
+- `gen_ai.usage.output_tokens`
+- `awooop.fallback_reason`
+- `awooop.cost_usd`
+
+## Implementation Stages
+
+### Stage 1 - Sidecar health view
+
+- Keep existing providers.
+- Add health and residency checks to identify which lane is safe.
+- No traffic proxying yet.
+
+### Stage 2 - Gateway in shadow
+
+- Mirror inference requests to the gateway.
+- Gateway computes routing decision but does not execute.
+- Compare selected endpoint/model against legacy path.
+
+### Stage 3 - Alert lane active
+
+- Route only `alert-fast` through the gateway.
+- Keep code review and deep RCA on legacy providers.
+
+### Stage 4 - All Ollama traffic active
+
+- Move code review, embedding, and deep RCA to the gateway.
+- Enforce lane-based deny rules.
+
+### Stage 5 - AwoooP runtime integration
+
+- Convert gateway decisions into `run_state` and `step_journal` entries.
+- Use AwoooP budget ledger as source of truth.
+
+## Rollback
+
+Set provider env back to raw endpoints:
+
+```yaml
+OLLAMA_URL: "http://192.168.0.110:11435"
+OLLAMA_SECONDARY_URL: "http://192.168.0.110:11436"
+OLLAMA_FALLBACK_URL: "http://192.168.0.111:11434"
+```
+
+Do not disable budget hard kill during rollback.
+
--- a/docs/runbooks/DEPLOY-GCP-OLLAMA-PROXY.md
+++ b/docs/runbooks/DEPLOY-GCP-OLLAMA-PROXY.md
@@ -1,6 +1,10 @@
 # GCP Ollama Nginx Proxy 部署指南

 > ADR-110 三層容災 — 啟用 GCP Ollama 的關鍵步驟
+>
+> 2026-05-05 修正：此 runbook 只保留為過渡 / rollback bridge。正式方案改用
+> ADR-125 的 WireGuard private mesh 與 AwoooP Inference Gateway。新部署不得把
+> GCP `11434/tcp` 對 `0.0.0.0/0` 長期開放。

 ---

@@ -173,7 +177,10 @@ kubectl describe networkpolicy -n awoooi-prod allow-required-egress
 curl -v http://34.143.170.20:11434/api/tags
 ```

-若失敗，檢查 GCP 防火牆規則是否開放 0.0.0.0/0:11434。
+若失敗，只允許短時間確認 GCP 防火牆是否對 110 的固定出口 IP 開放
+`11434/tcp`。不得把 `0.0.0.0/0:11434` 當成正式設定。
+
+正式切換請改走 [GCP-OLLAMA-WIREGUARD-MESH.md](GCP-OLLAMA-WIREGUARD-MESH.md)。

 ### 3. 模型載入但推理失敗

@@ -189,9 +196,12 @@ curl -v http://34.143.170.20:11434/api/tags
 ## 相關文件

 - ADR-110: GCP 三層容災架構
+- ADR-125: GCP Ollama Private Mesh and AwoooP Inference Gateway
 - `k8s/awoooi-prod/04-configmap.yaml`
 - `k8s/awoooi-prod/02-network-policy.yaml`
 - `docs/runbooks/RUNBOOK-OLLAMA-FAILOVER.md`
+- `docs/runbooks/GCP-OLLAMA-WIREGUARD-MESH.md`
+- `docs/runbooks/AWOOOP-INFERENCE-GATEWAY.md`

 ---

--- a/docs/runbooks/GCP-OLLAMA-WIREGUARD-MESH.md
+++ b/docs/runbooks/GCP-OLLAMA-WIREGUARD-MESH.md
@@ -0,0 +1,280 @@
+# GCP Ollama WireGuard Mesh Runbook
+
+> Target state for ADR-125. This replaces the public GCP Ollama proxy as the
+> primary path after shadow and canary validation.
+
+---
+
+## Scope
+
+This runbook builds private Ollama connectivity between AWOOOI K3s and the GCP
+Ollama hosts.
+
+It does not replace AwoooP Inference Gateway work. The mesh solves transport and
+security. The gateway solves routing, queueing, model residency, and fallback.
+
+## Current State
+
+Current production endpoints:
+
+| Variable | Endpoint | Meaning |
+|----------|----------|---------|
+| `OLLAMA_URL` | `http://192.168.0.110:11435` | GCP-A through 110 nginx |
+| `OLLAMA_SECONDARY_URL` | `http://192.168.0.110:11436` | GCP-B through 110 nginx |
+| `OLLAMA_FALLBACK_URL` | `http://192.168.0.111:11434` | Local 111 |
+
+This is a bridge. Do not treat the public proxy as the final architecture.
+
+## Target State
+
+| Host | WireGuard IP | Notes |
+|------|--------------|-------|
+| 110 | `10.77.114.10` | DevOps host and rollback bridge |
+| 120 | `10.77.114.120` | K3s node |
+| 121 | `10.77.114.121` | K3s node |
+| 111 | `10.77.114.111` | Local Ollama fallback |
+| GCP-A | `10.77.114.21` | Primary Ollama |
+| GCP-B | `10.77.114.22` | Secondary Ollama |
+
+Production endpoints after cutover:
+
+```yaml
+OLLAMA_URL: "http://10.77.114.21:11434"
+OLLAMA_SECONDARY_URL: "http://10.77.114.22:11434"
+OLLAMA_FALLBACK_URL: "http://10.77.114.111:11434"
+```
+
+## Prerequisites
+
+- SSH access to GCP-A and GCP-B.
+- GCP IAM permissions for firewall rules if OS firewall alone is not enough.
+- SSH access to 110, 111, 120, and 121.
+- A secured place to store WireGuard private keys. Never commit private keys.
+- Confirm the GCP hosts have enough CPU/RAM for `gemma3:4b`.
+
+## Key Rules
+
+- Private keys are generated on each host and never copied into Git.
+- Public keys may be recorded in the operator handoff note.
+- Public GCP `11434/tcp` must be closed after cutover.
+- `alert-fast` uses `gemma3:4b`; 14B/32B models must not run on GCP-A/B during
+  alert-lane canary.
+
+## Install WireGuard
+
+Ubuntu/Debian:
+
+```bash
+sudo apt-get update
+sudo apt-get install -y wireguard
+```
+
+Alpine:
+
+```bash
+sudo apk add --no-cache wireguard-tools
+```
+
+Generate keys on every host:
+
+```bash
+umask 077
+wg genkey | sudo tee /etc/wireguard/awooop.key
+sudo cat /etc/wireguard/awooop.key | wg pubkey | sudo tee /etc/wireguard/awooop.pub
+```
+
+## Configure Peers
+
+Create `/etc/wireguard/wg-awooop.conf` on each host.
+
+Example for GCP-A:
+
+```ini
+[Interface]
+Address = 10.77.114.21/32
+ListenPort = 51820
+PrivateKey = <GCP_A_PRIVATE_KEY>
+
+[Peer]
+# 120 K3s node
+PublicKey = <K3S_120_PUBLIC_KEY>
+AllowedIPs = 10.77.114.120/32
+Endpoint = <120_REACHABLE_ENDPOINT>:51820
+PersistentKeepalive = 25
+
+[Peer]
+# 121 K3s node
+PublicKey = <K3S_121_PUBLIC_KEY>
+AllowedIPs = 10.77.114.121/32
+Endpoint = <121_REACHABLE_ENDPOINT>:51820
+PersistentKeepalive = 25
+
+[Peer]
+# 110 DevOps rollback bridge
+PublicKey = <HOST_110_PUBLIC_KEY>
+AllowedIPs = 10.77.114.10/32
+Endpoint = <110_REACHABLE_ENDPOINT>:51820
+PersistentKeepalive = 25
+```
+
+Example for a K3s node:
+
+```ini
+[Interface]
+Address = 10.77.114.120/32
+ListenPort = 51820
+PrivateKey = <K3S_120_PRIVATE_KEY>
+
+[Peer]
+# GCP-A
+PublicKey = <GCP_A_PUBLIC_KEY>
+AllowedIPs = 10.77.114.21/32
+Endpoint = 34.143.170.20:51820
+PersistentKeepalive = 25
+
+[Peer]
+# GCP-B
+PublicKey = <GCP_B_PUBLIC_KEY>
+AllowedIPs = 10.77.114.22/32
+Endpoint = 34.21.145.224:51820
+PersistentKeepalive = 25
+
+[Peer]
+# Local 111
+PublicKey = <HOST_111_PUBLIC_KEY>
+AllowedIPs = 10.77.114.111/32
+Endpoint = 192.168.0.111:51820
+PersistentKeepalive = 25
+```
+
+The exact peer list depends on reachable endpoints. If inbound access to 120/121
+is not available, use 110 as a temporary mesh relay, then replace it with direct
+K3s-to-GCP peers when routing is confirmed.
+
+## Start WireGuard
+
+```bash
+sudo systemctl enable --now wg-quick@wg-awooop
+sudo wg show wg-awooop
+```
+
+Verify connectivity:
+
+```bash
+ping -c 3 10.77.114.21
+ping -c 3 10.77.114.22
+curl -fsS http://10.77.114.21:11434/api/tags
+curl -fsS http://10.77.114.22:11434/api/tags
+```
+
+## Bind or Firewall Ollama
+
+Preferred: bind Ollama to the mesh interface.
+
+```bash
+sudo systemctl edit ollama
+```
+
+```ini
+[Service]
+Environment="OLLAMA_HOST=10.77.114.21:11434"
+```
+
+Use `10.77.114.22:11434` on GCP-B.
+
+If binding is not possible, firewall the host:
+
+```bash
+sudo ufw allow from 10.77.114.0/24 to any port 11434 proto tcp
+sudo ufw deny 11434/tcp
+```
+
+Then restart:
+
+```bash
+sudo systemctl daemon-reload
+sudo systemctl restart ollama
+```
+
+## K8s NetworkPolicy
+
+After mesh cutover, allow only mesh endpoints for Ollama:
+
+```yaml
+- to:
+    - ipBlock:
+        cidr: 10.77.114.21/32
+    - ipBlock:
+        cidr: 10.77.114.22/32
+    - ipBlock:
+        cidr: 10.77.114.111/32
+  ports:
+    - protocol: TCP
+      port: 11434
+```
+
+Do not remove the `192.168.0.110:11435/11436` rules until rollback is no longer
+needed.
+
+## Shadow Validation
+
+From the API pod:
+
+```bash
+bash scripts/ops/ollama-topology-check.sh
+```
+
+Expected:
+
+- GCP-A `/api/tags` returns 200.
+- GCP-B `/api/tags` returns 200.
+- `gemma3:4b` generation succeeds on both nodes.
+- `/api/ps` contains `gemma3:4b`.
+- If `size_vram=0`, keep GCP-A/B on `alert-fast` only and route heavy models to
+  111 or a GPU-capable node.
+
+## Cutover
+
+Patch deployment env after shadow passes:
+
+```bash
+kubectl -n awoooi-prod set env deploy/awoooi-api \
+  OLLAMA_URL=http://10.77.114.21:11434 \
+  OLLAMA_SECONDARY_URL=http://10.77.114.22:11434 \
+  OLLAMA_FALLBACK_URL=http://10.77.114.111:11434
+
+kubectl -n awoooi-prod set env deploy/awoooi-worker \
+  OLLAMA_URL=http://10.77.114.21:11434 \
+  OLLAMA_SECONDARY_URL=http://10.77.114.22:11434 \
+  OLLAMA_FALLBACK_URL=http://10.77.114.111:11434
+```
+
+Verify:
+
+```bash
+kubectl -n awoooi-prod rollout status deploy/awoooi-api --timeout=180s
+kubectl -n awoooi-prod rollout status deploy/awoooi-worker --timeout=180s
+bash scripts/ops/ollama-topology-check.sh
+```
+
+## Rollback
+
+```bash
+kubectl -n awoooi-prod set env deploy/awoooi-api \
+  OLLAMA_URL=http://192.168.0.110:11435 \
+  OLLAMA_SECONDARY_URL=http://192.168.0.110:11436 \
+  OLLAMA_FALLBACK_URL=http://192.168.0.111:11434
+
+kubectl -n awoooi-prod set env deploy/awoooi-worker \
+  OLLAMA_URL=http://192.168.0.110:11435 \
+  OLLAMA_SECONDARY_URL=http://192.168.0.110:11436 \
+  OLLAMA_FALLBACK_URL=http://192.168.0.111:11434
+```
+
+## Done Criteria
+
+- Mesh endpoints pass 7 days of canary.
+- Alert lane Gemini usage is zero except documented all-Ollama outages.
+- Public GCP `11434/tcp` is closed.
+- Operator runbook records peer public keys and rollback owner.
+
--- a/scripts/ops/ollama-topology-check.sh
+++ b/scripts/ops/ollama-topology-check.sh
@@ -0,0 +1,88 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+NAMESPACE="${NAMESPACE:-awoooi-prod}"
+DEPLOYMENT="${DEPLOYMENT:-awoooi-api}"
+MODEL="${MODEL:-gemma3:4b}"
+TIMEOUT_SECONDS="${TIMEOUT_SECONDS:-60}"
+
+kubectl -n "${NAMESPACE}" exec -i "deploy/${DEPLOYMENT}" -- \
+  env CHECK_MODEL="${MODEL}" CHECK_TIMEOUT_SECONDS="${TIMEOUT_SECONDS}" python - <<'PY'
+import json
+import os
+import time
+import urllib.error
+import urllib.request
+
+model = os.environ["CHECK_MODEL"]
+timeout = int(os.environ["CHECK_TIMEOUT_SECONDS"])
+
+endpoints = [
+    ("primary", os.environ.get("OLLAMA_URL", "")),
+    ("secondary", os.environ.get("OLLAMA_SECONDARY_URL", "")),
+    ("fallback", os.environ.get("OLLAMA_FALLBACK_URL", "")),
+]
+
+print(f"model={model} timeout={timeout}s")
+
+def request_json(url: str, path: str, payload=None, timeout_seconds=10):
+    data = None
+    headers = {}
+    if payload is not None:
+        data = json.dumps(payload).encode()
+        headers["Content-Type"] = "application/json"
+    req = urllib.request.Request(url.rstrip("/") + path, data=data, headers=headers)
+    with urllib.request.urlopen(req, timeout=timeout_seconds) as response:
+        return json.loads(response.read().decode())
+
+for label, url in endpoints:
+    print(f"\n== {label}: {url or '<missing>'} ==")
+    if not url:
+        print("status=missing")
+        continue
+
+    try:
+        tags = request_json(url, "/api/tags", timeout_seconds=10)
+        names = sorted(m.get("name", "") for m in tags.get("models", []))
+        print("tags=ok", ",".join(names[:12]))
+    except Exception as exc:
+        print("tags=fail", type(exc).__name__, str(exc)[:160])
+        continue
+
+    try:
+        ps = request_json(url, "/api/ps", timeout_seconds=10)
+        live = ps.get("models", [])
+        if not live:
+            print("ps=ok live_models=<none>")
+        for item in live:
+            print(
+                "ps=ok",
+                f"model={item.get('model')}",
+                f"expires={item.get('expires_at')}",
+                f"size_vram={item.get('size_vram')}",
+                f"context={item.get('context_length')}",
+            )
+            if item.get("size_vram") == 0:
+                print("warning=cpu_only_or_no_vram")
+    except Exception as exc:
+        print("ps=fail", type(exc).__name__, str(exc)[:160])
+
+    payload = {
+        "model": model,
+        "prompt": "用繁體中文用一行回答：Ollama health check",
+        "stream": False,
+        "keep_alive": "8h",
+        "options": {"num_predict": 32, "temperature": 0.1},
+    }
+    start = time.time()
+    try:
+        result = request_json(url, "/api/generate", payload, timeout_seconds=timeout)
+        latency_ms = round((time.time() - start) * 1000)
+        response = (result.get("response") or "").replace("\n", " ")[:120]
+        print(f"generate=ok latency_ms={latency_ms} response={response}")
+    except urllib.error.HTTPError as exc:
+        body = exc.read().decode(errors="replace")[:200]
+        print("generate=fail", "HTTPError", exc.code, body)
+    except Exception as exc:
+        print("generate=fail", type(exc).__name__, str(exc)[:200])
+PY