diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 0e4244e8..71c9c04b 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -3175,3 +3175,21 @@ kubectl -n awoooi-prod exec deploy/awoooi-api -- printenv | grep -E 'ALERT_OLLAM - 目前 `192.168.0.110:11435/11436` 是經由 110 nginx 轉發到 GCP 公網 IP,屬於過渡方案,不應作為長期 primary Ollama lane。 - 建議建立 WireGuard site-to-site private mesh,讓 K3s / 110 / 111 / GCP-A / GCP-B 以私網 IP 互連,Ollama 僅綁定 mesh interface,並由 AwoooP Inference Gateway 統一路由、熔斷、佇列與模型保溫。 - 注意:目前 GCP-A / GCP-B `/api/ps` 顯示 `size_vram: 0`,內網化可解決連線與安全問題,但無法讓 CPU-only GCP 等同 111 的 VRAM/GPU 效能;大模型應留在 111 或改用 GPU 型 GCP 節點。 + +### 後續文件化 + +- 新增 `docs/adr/ADR-125-gcp-ollama-private-mesh-inference-gateway.md` +- 新增 `docs/runbooks/GCP-OLLAMA-WIREGUARD-MESH.md` +- 新增 `docs/runbooks/AWOOOP-INFERENCE-GATEWAY.md` +- 新增 `scripts/ops/ollama-topology-check.sh` 作為現場三層 Ollama 健康 / residency / latency 檢查工具 + +### `ollama-topology-check` 實測 + +```bash +bash scripts/ops/ollama-topology-check.sh +# primary GCP-A via 110 proxy: gemma3:4b generate OK, ~2s, size_vram=0 +# secondary GCP-B via 110 proxy: gemma3:4b generate OK, ~8.5s, size_vram=0 +# fallback 111 direct: gemma3:4b generate OK, ~4.9s, size_vram=8210446336 +``` + +結論:GCP-A/B 可作 `alert-fast` lane,但目前不應承擔 14B/32B 同步告警推理;重模型必須由 AwoooP Inference Gateway 隔離到 async / 111 / GPU 節點。 diff --git a/docs/adr/ADR-110-gcp-ollama-topology.md b/docs/adr/ADR-110-gcp-ollama-topology.md index bb9a4b72..f347c85a 100644 --- a/docs/adr/ADR-110-gcp-ollama-topology.md +++ b/docs/adr/ADR-110-gcp-ollama-topology.md @@ -5,6 +5,10 @@ **決策者**: 統帥 **關聯**: 取代 ADR-105(Revert A2 Ollama Primary) +> 2026-05-05 修正:本 ADR 的「GCP-A → GCP-B → 111 → paid provider」邏輯仍有效, +> 但公網 GCP IP / 110 nginx proxy 僅為過渡傳輸。正式傳輸與 runtime +> 管理由 ADR-125(GCP Ollama Private Mesh and AwoooP Inference Gateway)取代。 + --- ## 背景 @@ -62,3 +66,15 @@ K8s NetworkPolicy egress 已新增 GCP-A/GCP-B 的 /32 出口規則(port 11434 - Ollama 主要流量走 GCP SSD,效能提升 - Local 111 保留為最後防線,不棄用 - Gemini/Nemotron/Claude fallback 鏈不變 + +## 2026-05-05 現場校正 + +冷啟動救援期間的實測顯示: + +- GCP-A / GCP-B 透過 110 nginx proxy 可連線,但長 prompt 曾出現 504。 +- `/api/ps` 顯示 GCP-A / GCP-B `size_vram: 0`,因此不可假設它們等同 111 GPU/VRAM 推理能力。 +- 告警同步路徑必須使用 `gemma3:4b` 這類 fast lane 模型;14B/32B 模型需移到 async 或 111/GPU 節點。 +- 公網 `34.143.170.20:11434` / `34.21.145.224:11434` 不再視為最終安全架構。 + +後續以 ADR-125 為準:WireGuard private mesh 是正式網路層,AwoooP +Inference Gateway 是正式 runtime 層。 diff --git a/docs/adr/ADR-125-gcp-ollama-private-mesh-inference-gateway.md b/docs/adr/ADR-125-gcp-ollama-private-mesh-inference-gateway.md new file mode 100644 index 00000000..99676bb9 --- /dev/null +++ b/docs/adr/ADR-125-gcp-ollama-private-mesh-inference-gateway.md @@ -0,0 +1,187 @@ +# ADR-125: GCP Ollama Private Mesh and AwoooP Inference Gateway + +**Status**: Accepted +**Date**: 2026-05-05 (Asia/Taipei) +**Decision Maker**: ogt / Codex +**Related**: ADR-110, ADR-111, ADR-113, ADR-121, ADR-124 + +--- + +## Context + +ADR-110 moved Ollama priority from local-only 111 to a three-layer topology: + +1. GCP-A +2. GCP-B +3. Local 111 +4. Paid cloud fallback only after all Ollama lanes fail + +The 2026-05-05 dirty-reboot recovery and alert incident exposed two gaps in the +initial ADR-110 implementation: + +- The live transport is `K8s Pod -> 192.168.0.110 nginx -> GCP public IP`, not a + true private network path. +- GCP-A and GCP-B reported `size_vram: 0` in `/api/ps`, so they are CPU-only from + Ollama's perspective. Private networking improves reachability and security, + but does not make these nodes equivalent to local 111 GPU/VRAM behavior. + +The public nginx proxy is useful as a bootstrap bridge, but it must not become +the long-term primary transport for platform inference. + +## Decision + +Adopt a two-layer target architecture: + +### D1 - WireGuard private mesh is the target transport + +AwoooP uses a WireGuard site-to-site mesh for GCP Ollama access. + +Planned mesh CIDR: + +| Host | Role | WireGuard IP | +|------|------|--------------| +| 110 | DevOps / transitional proxy / optional mesh router | `10.77.114.10` | +| 120 | K3s control-plane node | `10.77.114.120` | +| 121 | K3s control-plane node | `10.77.114.121` | +| 111 | Local Ollama fallback | `10.77.114.111` | +| GCP-A | Ollama primary | `10.77.114.21` | +| GCP-B | Ollama secondary | `10.77.114.22` | + +Ollama endpoints after cutover: + +| Tier | Endpoint | +|------|----------| +| Primary | `http://10.77.114.21:11434` | +| Secondary | `http://10.77.114.22:11434` | +| Fallback | `http://10.77.114.111:11434` | + +The current `192.168.0.110:11435/11436` nginx proxy remains an emergency bridge +only until the mesh cutover passes shadow and canary gates. + +### D2 - Public Ollama exposure is forbidden after cutover + +After mesh cutover: + +- GCP firewall must deny public `0.0.0.0/0 -> 11434`. +- Ollama should bind to the mesh interface or host firewall should allow + `11434/tcp` only from `10.77.114.0/24`. +- K8s NetworkPolicy should allow egress only to the mesh IPs for Ollama. + +### D3 - AwoooP Inference Gateway owns runtime routing + +Provider clients should stop selecting raw Ollama hosts directly. They should +call an AwoooP Inference Gateway that owns: + +- endpoint health and circuit breakers +- per-lane concurrency limits +- model residency and keep-alive policy +- request timeouts by intent +- token/cost audit spans +- fallback order: GCP-A -> GCP-B -> 111 -> paid provider + +The gateway may initially expose an Ollama-compatible surface: + +| Endpoint | Purpose | +|----------|---------| +| `/api/tags` | health/model inventory | +| `/api/ps` | residency inventory | +| `/api/generate` | Ollama-compatible generation | +| `/v1/awooop/inference/runs` | future async AwoooP run API | + +Gateway requests must carry `project_id`, `trace_id`, and an intent/lane label +when called from AwoooP-aware code. + +### D4 - Alert lane is protected + +Alert diagnosis must not share an unconstrained queue with heavy code-review or +deep-RCA jobs. + +Initial lanes: + +| Lane | Model | Primary use | Default timeout | +|------|-------|-------------|-----------------| +| `alert-fast` | `gemma3:4b` | Telegram incident cards and low-risk RCA | 45s | +| `code-review` | `qwen2.5-coder:7b` | Gitea review | 90s | +| `embedding` | `bge-m3` | RAG embeddings | 30s | +| `deep-rca` | 111-hosted 14B-class model | slow human-reviewed diagnosis | async only | + +No 14B/32B model may evict `alert-fast` residency on GCP-A/GCP-B unless the +gateway explicitly opens a maintenance window. + +## Migration Plan + +### Phase 0 - Current bridge + +- Keep `192.168.0.110:11435` and `192.168.0.110:11436` active. +- Alert path uses `ALERT_OLLAMA_MODEL=gemma3:4b`. +- Gemini remains paid emergency fallback only. + +### Phase 1 - Mesh build in parallel + +- Install WireGuard on 110, 120, 121, 111, GCP-A, and GCP-B. +- Assign mesh IPs from `10.77.114.0/24`. +- Keep public proxy and old env values unchanged. +- Verify `/api/tags`, `/api/ps`, and `gemma3:4b` generation over mesh. + +### Phase 2 - Shadow mesh + +- Add shadow health checks from the API pod to mesh endpoints. +- Emit OTel spans with both `active_endpoint` and `shadow_endpoint`. +- Do not send production inference traffic to mesh yet. + +Promotion gate: + +- 24h continuous mesh health +- p95 `alert-fast` latency <= current proxy p95 + 10% +- zero public-path-only success events + +### Phase 3 - Switch active endpoints + +Set production env: + +```yaml +OLLAMA_URL: "http://10.77.114.21:11434" +OLLAMA_SECONDARY_URL: "http://10.77.114.22:11434" +OLLAMA_FALLBACK_URL: "http://10.77.114.111:11434" +``` + +Promotion gate: + +- 7 days canary +- Gemini usage for alert lane is zero except documented all-Ollama outage +- no alert-card timeout regression + +### Phase 4 - Close public exposure + +- Remove or firewall public GCP `11434/tcp`. +- Keep nginx bridge config but disable listener or restrict to operator-only + rollback. + +## Rollback + +Rollback is env-only while the bridge remains available: + +```yaml +OLLAMA_URL: "http://192.168.0.110:11435" +OLLAMA_SECONDARY_URL: "http://192.168.0.110:11436" +OLLAMA_FALLBACK_URL: "http://192.168.0.111:11434" +``` + +If GCP-A/B are unstable, force 111-first temporarily: + +```yaml +OLLAMA_URL: "http://192.168.0.111:11434" +OLLAMA_SECONDARY_URL: "http://192.168.0.110:11435" +OLLAMA_FALLBACK_URL: "http://192.168.0.110:11436" +``` + +Paid provider fallback must remain budget-gated. + +## Consequences + +- GCP Ollama becomes private-by-default instead of public-IP dependent. +- K8s NetworkPolicy can move from public `/32` rules to stable mesh `/32` rules. +- AwoooP can manage Ollama as a platform resource shared by all tenants. +- CPU-only GCP performance remains a capacity constraint; routing must keep + heavy jobs off the alert lane or use GPU-capable GCP nodes. + diff --git a/docs/awooop/DETAILED-IMPLEMENTATION-PLAN.md b/docs/awooop/DETAILED-IMPLEMENTATION-PLAN.md index e7f2a855..97c90cab 100644 --- a/docs/awooop/DETAILED-IMPLEMENTATION-PLAN.md +++ b/docs/awooop/DETAILED-IMPLEMENTATION-PLAN.md @@ -1120,22 +1120,34 @@ AwoooP 解法:全部 LLM call 必須 emit 以上 attribute,進 SignOz(188: ## 14. GCP Ollama 拓撲對 AwoooP 的影響(ADR-110 整合) -### 14.1 新拓撲(ADR-110,2026-05-03 生效) +### 14.1 新拓撲(ADR-110 + ADR-125,2026-05-05 修正) ``` -Primary : GCP-A http://34.143.170.20:11434 (SSD,9x 載速) -Secondary: GCP-B http://34.21.145.224:11434 (SSD,備援) -Fallback : Local http://192.168.0.111:11434 (HDD,最後防線) -Emergency: Gemini → Nemotron → Claude (全 Ollama 掛時) +Phase 0 bridge: +Primary : GCP-A http://192.168.0.110:11435 (110 nginx → GCP public IP) +Secondary: GCP-B http://192.168.0.110:11436 +Fallback : Local http://192.168.0.111:11434 +Emergency: Gemini → Nemotron → Claude (全 Ollama 掛時,budget gated) + +Target private mesh: +Primary : GCP-A http://10.77.114.21:11434 +Secondary: GCP-B http://10.77.114.22:11434 +Fallback : Local http://10.77.114.111:11434 ``` +ADR-125 修正 ADR-110 的傳輸層:公網 GCP IP / 110 nginx proxy 僅保留為 +過渡與 rollback bridge。正式路徑是 WireGuard private mesh;runtime 路由由 +AwoooP Inference Gateway 管理。 + ### 14.2 AwoooP 必須處理的影響項目 | 影響項 | 位置 | 處理方式 | Phase | |--------|------|---------|-------| | `ollama:current_primary` Redis key 雙寫(只支援 1 個 URL,新需要 3 層)| INV-1 | 改為 `platform:ollama:topology`(JSON:primary/secondary/fallback)| Phase 2 | | `ollama_auto_recovery.py:230` 第二定義(P0-11)| ollama_auto_recovery.py | 移除,統一從 config 讀 | Phase 2 PR-03 | -| GCP IP 進 INV-4(34.143.170.20, 34.21.145.224)| INV-4 | 加入 allowed IP 清單,確認 K8s NetworkPolicy egress 已設定 | Phase 0 INV-4 | +| GCP public IP 進 INV-4(34.143.170.20, 34.21.145.224)| INV-4 | 標為 transitional only;正式改用 `10.77.114.21/22` mesh IP | Phase 0 INV-4 | +| WireGuard mesh | ADR-125 / runbook | 建立 `10.77.114.0/24` private transport;關閉 public 11434 | Phase 2 前置 | +| AwoooP Inference Gateway | ADR-125 / runbook | alert-fast / code-review / embedding / deep-rca lane 隔離,避免重模型搶告警 lane | Phase 4 | | EwoooC Provider Proxy 走 GCP Ollama 路由 | Phase 6 | EwoooC 共用 platform Ollama topology(platform_resource)| Phase 6 | | `telemetry.py:71` IP assert(P0-08)| telemetry.py:71 | 移除後,GCP IP 不再觸發 assert;改為 config-driven | Phase 2 PR-01 | | budget_ledger 記錄 Ollama usage(免費 GCP 仍需 token 計數)| Phase 4 | Ollama call 也必須記錄 token 消耗(budget_ledger)| Phase 4 | @@ -1143,11 +1155,24 @@ Emergency: Gemini → Nemotron → Claude (全 Ollama 掛時) ### 14.3 Ollama GCP 為 platform_resource(ADR-111) -GCP Ollama(34.143.170.20, 34.21.145.224)與 Local Ollama(192.168.0.111)一律聲明為 `platform_resource`: +GCP Ollama(bridge: 34.143.170.20 / 34.21.145.224;target mesh: +10.77.114.21 / 10.77.114.22)與 Local Ollama(192.168.0.111 / target +10.77.114.111)一律聲明為 `platform_resource`: - 不屬於任何 tenant - 所有 tenant(AWOOOI / EwoooC / Tsenyang / Bitan)共用,但 audit 記錄各自 project_id - `platform:ollama:topology` Redis key 前綴為 `platform:`(非 `{project_id}:`) +### 14.4 實測限制(2026-05-05) + +`scripts/ops/ollama-topology-check.sh` 實測: + +- GCP-A `gemma3:4b` 約 2s,但 `size_vram=0` +- GCP-B `gemma3:4b` 約 8.5s,但 `size_vram=0` +- 111 fallback `gemma3:4b` 約 4.9s,`size_vram=8210446336` + +結論:GCP-A/B 可以作為同步 `alert-fast` lane,但不可承擔 14B/32B 同步告警診斷。 +重模型需由 Inference Gateway 分流到 async / 111 / GPU 節點。 + --- ## 15. 工作排序總表(含並行群組 + Critical Path) diff --git a/docs/awooop/MASTER-WORKPLAN.md b/docs/awooop/MASTER-WORKPLAN.md index 1e3032cb..28ce5b5c 100644 --- a/docs/awooop/MASTER-WORKPLAN.md +++ b/docs/awooop/MASTER-WORKPLAN.md @@ -135,7 +135,7 @@ ADR-106 也需要補一節:**Strangler Fig Quantified Gates**,把 shadow → 3. **Redis working memory project 邊界**(#15): - `incident_service.py:603` 的 `SCAN incident:*` → `SCAN {project_id}:incident:*` - 所有 `SCAN`/`KEYS` 必須帶 prefix -4. **`platform_resource` 例外名單**:Ollama failover state、global rate limit、leader election lock 等明確標記 +4. **`platform_resource` 例外名單**:Ollama failover state、global rate limit、leader election lock 等明確標記;GCP Ollama 正式路徑依 ADR-125 改為 WireGuard mesh + AwoooP Inference Gateway,110 nginx proxy 僅保留為過渡 / rollback bridge 5. **回歸測試**:cross-project read/write 必拒絕;platform_resource 必允許但寫 audit 6. **AWOOOI Bootstrap Paradox 修補**(依 ADR-111、INV-3): - 標記為 `platform_internal` 的 entrypoint 帶 `project_id=__platform__`,hard reject 例外但寫 audit diff --git a/docs/runbooks/AWOOOP-INFERENCE-GATEWAY.md b/docs/runbooks/AWOOOP-INFERENCE-GATEWAY.md new file mode 100644 index 00000000..8de6d15b --- /dev/null +++ b/docs/runbooks/AWOOOP-INFERENCE-GATEWAY.md @@ -0,0 +1,153 @@ +# AwoooP Inference Gateway Runbook + +> Runtime design for keeping GCP-A, GCP-B, 111, and paid providers under one +> controlled inference lane. + +--- + +## Goal + +Stop individual services from calling raw model hosts independently. + +The gateway becomes the single platform path for: + +- endpoint selection +- model lane assignment +- queue and concurrency control +- fallback +- cost and token audit +- trace correlation + +## Why This Is Needed + +Direct provider calls caused the 2026-05-05 alert issue: + +- alert diagnosis wanted a fast response +- GCP-A/B were loaded with heavyweight models +- the request timed out through GCP-A and GCP-B +- Gemini fallback generated cost + +Private networking alone cannot prevent model eviction or queue contention. The +gateway must own runtime scheduling. + +## Required Lanes + +| Lane | Model | Allowed hosts | Notes | +|------|-------|---------------|-------| +| `alert-fast` | `gemma3:4b` | GCP-A, GCP-B, 111 | Synchronous, protected | +| `code-review` | `qwen2.5-coder:7b` | GCP-B, 111 | Never 32B on GCP during alert canary | +| `embedding` | `bge-m3` | GCP-A, GCP-B, 111 | Short timeout | +| `deep-rca` | 14B-class model | 111 or GPU node | Async only | +| `paid-emergency` | Gemini / Claude | Cloud | Budget-gated emergency fallback | + +## v0 API + +The gateway should initially provide an Ollama-compatible API to minimize caller +changes: + +```http +POST /api/generate +GET /api/tags +GET /api/ps +``` + +Required headers for AwoooP-aware calls: + +```http +X-AwoooP-Project-ID: awoooi +X-AwoooP-Trace-ID: +X-AwoooP-Lane: alert-fast +X-AwoooP-Intent: DIAGNOSE +``` + +Legacy callers may be accepted in shadow mode, but must be assigned +`project_id=awoooi` by bootstrap rules from ADR-111. + +## Scheduling Rules + +- `alert-fast` concurrency is reserved and cannot be borrowed by other lanes. +- `alert-fast` keeps `gemma3:4b` warm on both GCP-A and GCP-B. +- 14B/32B models are denied on GCP-A/B unless an operator opens maintenance. +- Per-host circuit breaker opens after 2 consecutive timeout failures. +- Paid provider fallback requires: + - all Ollama endpoints failed or are circuit-open + - budget hard kill not triggered + - audit span records fallback reason + +## Minimal Routing Algorithm + +```text +input: lane, model, project_id, trace_id + +if lane == alert-fast: + model = gemma3:4b + try GCP-A with 45s timeout + try GCP-B with 45s timeout + try 111 with 60s timeout + if allowed by budget: try paid emergency fallback + +if lane == code-review: + model = qwen2.5-coder:7b + try GCP-B with 90s timeout + try 111 with 120s timeout + +if lane == deep-rca: + reject synchronous request + create async run +``` + +## Metrics and Logs + +Every request must emit: + +- `awooop.project_id` +- `awooop.lane` +- `awooop.provider_tier` +- `awooop.endpoint` +- `gen_ai.request.model` +- `gen_ai.usage.input_tokens` +- `gen_ai.usage.output_tokens` +- `awooop.fallback_reason` +- `awooop.cost_usd` + +## Implementation Stages + +### Stage 1 - Sidecar health view + +- Keep existing providers. +- Add health and residency checks to identify which lane is safe. +- No traffic proxying yet. + +### Stage 2 - Gateway in shadow + +- Mirror inference requests to the gateway. +- Gateway computes routing decision but does not execute. +- Compare selected endpoint/model against legacy path. + +### Stage 3 - Alert lane active + +- Route only `alert-fast` through the gateway. +- Keep code review and deep RCA on legacy providers. + +### Stage 4 - All Ollama traffic active + +- Move code review, embedding, and deep RCA to the gateway. +- Enforce lane-based deny rules. + +### Stage 5 - AwoooP runtime integration + +- Convert gateway decisions into `run_state` and `step_journal` entries. +- Use AwoooP budget ledger as source of truth. + +## Rollback + +Set provider env back to raw endpoints: + +```yaml +OLLAMA_URL: "http://192.168.0.110:11435" +OLLAMA_SECONDARY_URL: "http://192.168.0.110:11436" +OLLAMA_FALLBACK_URL: "http://192.168.0.111:11434" +``` + +Do not disable budget hard kill during rollback. + diff --git a/docs/runbooks/DEPLOY-GCP-OLLAMA-PROXY.md b/docs/runbooks/DEPLOY-GCP-OLLAMA-PROXY.md index c07a6687..77a39ad5 100644 --- a/docs/runbooks/DEPLOY-GCP-OLLAMA-PROXY.md +++ b/docs/runbooks/DEPLOY-GCP-OLLAMA-PROXY.md @@ -1,6 +1,10 @@ # GCP Ollama Nginx Proxy 部署指南 > ADR-110 三層容災 — 啟用 GCP Ollama 的關鍵步驟 +> +> 2026-05-05 修正:此 runbook 只保留為過渡 / rollback bridge。正式方案改用 +> ADR-125 的 WireGuard private mesh 與 AwoooP Inference Gateway。新部署不得把 +> GCP `11434/tcp` 對 `0.0.0.0/0` 長期開放。 --- @@ -173,7 +177,10 @@ kubectl describe networkpolicy -n awoooi-prod allow-required-egress curl -v http://34.143.170.20:11434/api/tags ``` -若失敗,檢查 GCP 防火牆規則是否開放 0.0.0.0/0:11434。 +若失敗,只允許短時間確認 GCP 防火牆是否對 110 的固定出口 IP 開放 +`11434/tcp`。不得把 `0.0.0.0/0:11434` 當成正式設定。 + +正式切換請改走 [GCP-OLLAMA-WIREGUARD-MESH.md](GCP-OLLAMA-WIREGUARD-MESH.md)。 ### 3. 模型載入但推理失敗 @@ -189,9 +196,12 @@ curl -v http://34.143.170.20:11434/api/tags ## 相關文件 - ADR-110: GCP 三層容災架構 +- ADR-125: GCP Ollama Private Mesh and AwoooP Inference Gateway - `k8s/awoooi-prod/04-configmap.yaml` - `k8s/awoooi-prod/02-network-policy.yaml` - `docs/runbooks/RUNBOOK-OLLAMA-FAILOVER.md` +- `docs/runbooks/GCP-OLLAMA-WIREGUARD-MESH.md` +- `docs/runbooks/AWOOOP-INFERENCE-GATEWAY.md` --- diff --git a/docs/runbooks/GCP-OLLAMA-WIREGUARD-MESH.md b/docs/runbooks/GCP-OLLAMA-WIREGUARD-MESH.md new file mode 100644 index 00000000..a2bce332 --- /dev/null +++ b/docs/runbooks/GCP-OLLAMA-WIREGUARD-MESH.md @@ -0,0 +1,280 @@ +# GCP Ollama WireGuard Mesh Runbook + +> Target state for ADR-125. This replaces the public GCP Ollama proxy as the +> primary path after shadow and canary validation. + +--- + +## Scope + +This runbook builds private Ollama connectivity between AWOOOI K3s and the GCP +Ollama hosts. + +It does not replace AwoooP Inference Gateway work. The mesh solves transport and +security. The gateway solves routing, queueing, model residency, and fallback. + +## Current State + +Current production endpoints: + +| Variable | Endpoint | Meaning | +|----------|----------|---------| +| `OLLAMA_URL` | `http://192.168.0.110:11435` | GCP-A through 110 nginx | +| `OLLAMA_SECONDARY_URL` | `http://192.168.0.110:11436` | GCP-B through 110 nginx | +| `OLLAMA_FALLBACK_URL` | `http://192.168.0.111:11434` | Local 111 | + +This is a bridge. Do not treat the public proxy as the final architecture. + +## Target State + +| Host | WireGuard IP | Notes | +|------|--------------|-------| +| 110 | `10.77.114.10` | DevOps host and rollback bridge | +| 120 | `10.77.114.120` | K3s node | +| 121 | `10.77.114.121` | K3s node | +| 111 | `10.77.114.111` | Local Ollama fallback | +| GCP-A | `10.77.114.21` | Primary Ollama | +| GCP-B | `10.77.114.22` | Secondary Ollama | + +Production endpoints after cutover: + +```yaml +OLLAMA_URL: "http://10.77.114.21:11434" +OLLAMA_SECONDARY_URL: "http://10.77.114.22:11434" +OLLAMA_FALLBACK_URL: "http://10.77.114.111:11434" +``` + +## Prerequisites + +- SSH access to GCP-A and GCP-B. +- GCP IAM permissions for firewall rules if OS firewall alone is not enough. +- SSH access to 110, 111, 120, and 121. +- A secured place to store WireGuard private keys. Never commit private keys. +- Confirm the GCP hosts have enough CPU/RAM for `gemma3:4b`. + +## Key Rules + +- Private keys are generated on each host and never copied into Git. +- Public keys may be recorded in the operator handoff note. +- Public GCP `11434/tcp` must be closed after cutover. +- `alert-fast` uses `gemma3:4b`; 14B/32B models must not run on GCP-A/B during + alert-lane canary. + +## Install WireGuard + +Ubuntu/Debian: + +```bash +sudo apt-get update +sudo apt-get install -y wireguard +``` + +Alpine: + +```bash +sudo apk add --no-cache wireguard-tools +``` + +Generate keys on every host: + +```bash +umask 077 +wg genkey | sudo tee /etc/wireguard/awooop.key +sudo cat /etc/wireguard/awooop.key | wg pubkey | sudo tee /etc/wireguard/awooop.pub +``` + +## Configure Peers + +Create `/etc/wireguard/wg-awooop.conf` on each host. + +Example for GCP-A: + +```ini +[Interface] +Address = 10.77.114.21/32 +ListenPort = 51820 +PrivateKey = + +[Peer] +# 120 K3s node +PublicKey = +AllowedIPs = 10.77.114.120/32 +Endpoint = <120_REACHABLE_ENDPOINT>:51820 +PersistentKeepalive = 25 + +[Peer] +# 121 K3s node +PublicKey = +AllowedIPs = 10.77.114.121/32 +Endpoint = <121_REACHABLE_ENDPOINT>:51820 +PersistentKeepalive = 25 + +[Peer] +# 110 DevOps rollback bridge +PublicKey = +AllowedIPs = 10.77.114.10/32 +Endpoint = <110_REACHABLE_ENDPOINT>:51820 +PersistentKeepalive = 25 +``` + +Example for a K3s node: + +```ini +[Interface] +Address = 10.77.114.120/32 +ListenPort = 51820 +PrivateKey = + +[Peer] +# GCP-A +PublicKey = +AllowedIPs = 10.77.114.21/32 +Endpoint = 34.143.170.20:51820 +PersistentKeepalive = 25 + +[Peer] +# GCP-B +PublicKey = +AllowedIPs = 10.77.114.22/32 +Endpoint = 34.21.145.224:51820 +PersistentKeepalive = 25 + +[Peer] +# Local 111 +PublicKey = +AllowedIPs = 10.77.114.111/32 +Endpoint = 192.168.0.111:51820 +PersistentKeepalive = 25 +``` + +The exact peer list depends on reachable endpoints. If inbound access to 120/121 +is not available, use 110 as a temporary mesh relay, then replace it with direct +K3s-to-GCP peers when routing is confirmed. + +## Start WireGuard + +```bash +sudo systemctl enable --now wg-quick@wg-awooop +sudo wg show wg-awooop +``` + +Verify connectivity: + +```bash +ping -c 3 10.77.114.21 +ping -c 3 10.77.114.22 +curl -fsS http://10.77.114.21:11434/api/tags +curl -fsS http://10.77.114.22:11434/api/tags +``` + +## Bind or Firewall Ollama + +Preferred: bind Ollama to the mesh interface. + +```bash +sudo systemctl edit ollama +``` + +```ini +[Service] +Environment="OLLAMA_HOST=10.77.114.21:11434" +``` + +Use `10.77.114.22:11434` on GCP-B. + +If binding is not possible, firewall the host: + +```bash +sudo ufw allow from 10.77.114.0/24 to any port 11434 proto tcp +sudo ufw deny 11434/tcp +``` + +Then restart: + +```bash +sudo systemctl daemon-reload +sudo systemctl restart ollama +``` + +## K8s NetworkPolicy + +After mesh cutover, allow only mesh endpoints for Ollama: + +```yaml +- to: + - ipBlock: + cidr: 10.77.114.21/32 + - ipBlock: + cidr: 10.77.114.22/32 + - ipBlock: + cidr: 10.77.114.111/32 + ports: + - protocol: TCP + port: 11434 +``` + +Do not remove the `192.168.0.110:11435/11436` rules until rollback is no longer +needed. + +## Shadow Validation + +From the API pod: + +```bash +bash scripts/ops/ollama-topology-check.sh +``` + +Expected: + +- GCP-A `/api/tags` returns 200. +- GCP-B `/api/tags` returns 200. +- `gemma3:4b` generation succeeds on both nodes. +- `/api/ps` contains `gemma3:4b`. +- If `size_vram=0`, keep GCP-A/B on `alert-fast` only and route heavy models to + 111 or a GPU-capable node. + +## Cutover + +Patch deployment env after shadow passes: + +```bash +kubectl -n awoooi-prod set env deploy/awoooi-api \ + OLLAMA_URL=http://10.77.114.21:11434 \ + OLLAMA_SECONDARY_URL=http://10.77.114.22:11434 \ + OLLAMA_FALLBACK_URL=http://10.77.114.111:11434 + +kubectl -n awoooi-prod set env deploy/awoooi-worker \ + OLLAMA_URL=http://10.77.114.21:11434 \ + OLLAMA_SECONDARY_URL=http://10.77.114.22:11434 \ + OLLAMA_FALLBACK_URL=http://10.77.114.111:11434 +``` + +Verify: + +```bash +kubectl -n awoooi-prod rollout status deploy/awoooi-api --timeout=180s +kubectl -n awoooi-prod rollout status deploy/awoooi-worker --timeout=180s +bash scripts/ops/ollama-topology-check.sh +``` + +## Rollback + +```bash +kubectl -n awoooi-prod set env deploy/awoooi-api \ + OLLAMA_URL=http://192.168.0.110:11435 \ + OLLAMA_SECONDARY_URL=http://192.168.0.110:11436 \ + OLLAMA_FALLBACK_URL=http://192.168.0.111:11434 + +kubectl -n awoooi-prod set env deploy/awoooi-worker \ + OLLAMA_URL=http://192.168.0.110:11435 \ + OLLAMA_SECONDARY_URL=http://192.168.0.110:11436 \ + OLLAMA_FALLBACK_URL=http://192.168.0.111:11434 +``` + +## Done Criteria + +- Mesh endpoints pass 7 days of canary. +- Alert lane Gemini usage is zero except documented all-Ollama outages. +- Public GCP `11434/tcp` is closed. +- Operator runbook records peer public keys and rollback owner. + diff --git a/scripts/ops/ollama-topology-check.sh b/scripts/ops/ollama-topology-check.sh new file mode 100755 index 00000000..2a2a4534 --- /dev/null +++ b/scripts/ops/ollama-topology-check.sh @@ -0,0 +1,88 @@ +#!/usr/bin/env bash +set -euo pipefail + +NAMESPACE="${NAMESPACE:-awoooi-prod}" +DEPLOYMENT="${DEPLOYMENT:-awoooi-api}" +MODEL="${MODEL:-gemma3:4b}" +TIMEOUT_SECONDS="${TIMEOUT_SECONDS:-60}" + +kubectl -n "${NAMESPACE}" exec -i "deploy/${DEPLOYMENT}" -- \ + env CHECK_MODEL="${MODEL}" CHECK_TIMEOUT_SECONDS="${TIMEOUT_SECONDS}" python - <<'PY' +import json +import os +import time +import urllib.error +import urllib.request + +model = os.environ["CHECK_MODEL"] +timeout = int(os.environ["CHECK_TIMEOUT_SECONDS"]) + +endpoints = [ + ("primary", os.environ.get("OLLAMA_URL", "")), + ("secondary", os.environ.get("OLLAMA_SECONDARY_URL", "")), + ("fallback", os.environ.get("OLLAMA_FALLBACK_URL", "")), +] + +print(f"model={model} timeout={timeout}s") + +def request_json(url: str, path: str, payload=None, timeout_seconds=10): + data = None + headers = {} + if payload is not None: + data = json.dumps(payload).encode() + headers["Content-Type"] = "application/json" + req = urllib.request.Request(url.rstrip("/") + path, data=data, headers=headers) + with urllib.request.urlopen(req, timeout=timeout_seconds) as response: + return json.loads(response.read().decode()) + +for label, url in endpoints: + print(f"\n== {label}: {url or ''} ==") + if not url: + print("status=missing") + continue + + try: + tags = request_json(url, "/api/tags", timeout_seconds=10) + names = sorted(m.get("name", "") for m in tags.get("models", [])) + print("tags=ok", ",".join(names[:12])) + except Exception as exc: + print("tags=fail", type(exc).__name__, str(exc)[:160]) + continue + + try: + ps = request_json(url, "/api/ps", timeout_seconds=10) + live = ps.get("models", []) + if not live: + print("ps=ok live_models=") + for item in live: + print( + "ps=ok", + f"model={item.get('model')}", + f"expires={item.get('expires_at')}", + f"size_vram={item.get('size_vram')}", + f"context={item.get('context_length')}", + ) + if item.get("size_vram") == 0: + print("warning=cpu_only_or_no_vram") + except Exception as exc: + print("ps=fail", type(exc).__name__, str(exc)[:160]) + + payload = { + "model": model, + "prompt": "用繁體中文用一行回答:Ollama health check", + "stream": False, + "keep_alive": "8h", + "options": {"num_predict": 32, "temperature": 0.1}, + } + start = time.time() + try: + result = request_json(url, "/api/generate", payload, timeout_seconds=timeout) + latency_ms = round((time.time() - start) * 1000) + response = (result.get("response") or "").replace("\n", " ")[:120] + print(f"generate=ok latency_ms={latency_ms} response={response}") + except urllib.error.HTTPError as exc: + body = exc.read().decode(errors="replace")[:200] + print("generate=fail", "HTTPError", exc.code, body) + except Exception as exc: + print("generate=fail", type(exc).__name__, str(exc)[:200]) +PY