docs(awooop): define private Ollama mesh gateway
All checks were successful
Code Review / ai-code-review (push) Successful in 10s

This commit is contained in:
Your Name
2026-05-05 22:56:22 +08:00
parent 7baa316224
commit ed7c6946cb
9 changed files with 786 additions and 9 deletions

View File

@@ -3175,3 +3175,21 @@ kubectl -n awoooi-prod exec deploy/awoooi-api -- printenv | grep -E 'ALERT_OLLAM
- 目前 `192.168.0.110:11435/11436` 是經由 110 nginx 轉發到 GCP 公網 IP屬於過渡方案不應作為長期 primary Ollama lane。
- 建議建立 WireGuard site-to-site private mesh讓 K3s / 110 / 111 / GCP-A / GCP-B 以私網 IP 互連Ollama 僅綁定 mesh interface並由 AwoooP Inference Gateway 統一路由、熔斷、佇列與模型保溫。
- 注意:目前 GCP-A / GCP-B `/api/ps` 顯示 `size_vram: 0`,內網化可解決連線與安全問題,但無法讓 CPU-only GCP 等同 111 的 VRAM/GPU 效能;大模型應留在 111 或改用 GPU 型 GCP 節點。
### 後續文件化
- 新增 `docs/adr/ADR-125-gcp-ollama-private-mesh-inference-gateway.md`
- 新增 `docs/runbooks/GCP-OLLAMA-WIREGUARD-MESH.md`
- 新增 `docs/runbooks/AWOOOP-INFERENCE-GATEWAY.md`
- 新增 `scripts/ops/ollama-topology-check.sh` 作為現場三層 Ollama 健康 / residency / latency 檢查工具
### `ollama-topology-check` 實測
```bash
bash scripts/ops/ollama-topology-check.sh
# primary GCP-A via 110 proxy: gemma3:4b generate OK, ~2s, size_vram=0
# secondary GCP-B via 110 proxy: gemma3:4b generate OK, ~8.5s, size_vram=0
# fallback 111 direct: gemma3:4b generate OK, ~4.9s, size_vram=8210446336
```
結論GCP-A/B 可作 `alert-fast` lane但目前不應承擔 14B/32B 同步告警推理;重模型必須由 AwoooP Inference Gateway 隔離到 async / 111 / GPU 節點。

View File

@@ -5,6 +5,10 @@
**決策者**: 統帥
**關聯**: 取代 ADR-105Revert A2 Ollama Primary
> 2026-05-05 修正:本 ADR 的「GCP-A → GCP-B → 111 → paid provider」邏輯仍有效
> 但公網 GCP IP / 110 nginx proxy 僅為過渡傳輸。正式傳輸與 runtime
> 管理由 ADR-125GCP Ollama Private Mesh and AwoooP Inference Gateway取代。
---
## 背景
@@ -62,3 +66,15 @@ K8s NetworkPolicy egress 已新增 GCP-A/GCP-B 的 /32 出口規則port 11434
- Ollama 主要流量走 GCP SSD效能提升
- Local 111 保留為最後防線,不棄用
- Gemini/Nemotron/Claude fallback 鏈不變
## 2026-05-05 現場校正
冷啟動救援期間的實測顯示:
- GCP-A / GCP-B 透過 110 nginx proxy 可連線,但長 prompt 曾出現 504。
- `/api/ps` 顯示 GCP-A / GCP-B `size_vram: 0`,因此不可假設它們等同 111 GPU/VRAM 推理能力。
- 告警同步路徑必須使用 `gemma3:4b` 這類 fast lane 模型14B/32B 模型需移到 async 或 111/GPU 節點。
- 公網 `34.143.170.20:11434` / `34.21.145.224:11434` 不再視為最終安全架構。
後續以 ADR-125 為準WireGuard private mesh 是正式網路層AwoooP
Inference Gateway 是正式 runtime 層。

View File

@@ -0,0 +1,187 @@
# ADR-125: GCP Ollama Private Mesh and AwoooP Inference Gateway
**Status**: Accepted
**Date**: 2026-05-05 (Asia/Taipei)
**Decision Maker**: ogt / Codex
**Related**: ADR-110, ADR-111, ADR-113, ADR-121, ADR-124
---
## Context
ADR-110 moved Ollama priority from local-only 111 to a three-layer topology:
1. GCP-A
2. GCP-B
3. Local 111
4. Paid cloud fallback only after all Ollama lanes fail
The 2026-05-05 dirty-reboot recovery and alert incident exposed two gaps in the
initial ADR-110 implementation:
- The live transport is `K8s Pod -> 192.168.0.110 nginx -> GCP public IP`, not a
true private network path.
- GCP-A and GCP-B reported `size_vram: 0` in `/api/ps`, so they are CPU-only from
Ollama's perspective. Private networking improves reachability and security,
but does not make these nodes equivalent to local 111 GPU/VRAM behavior.
The public nginx proxy is useful as a bootstrap bridge, but it must not become
the long-term primary transport for platform inference.
## Decision
Adopt a two-layer target architecture:
### D1 - WireGuard private mesh is the target transport
AwoooP uses a WireGuard site-to-site mesh for GCP Ollama access.
Planned mesh CIDR:
| Host | Role | WireGuard IP |
|------|------|--------------|
| 110 | DevOps / transitional proxy / optional mesh router | `10.77.114.10` |
| 120 | K3s control-plane node | `10.77.114.120` |
| 121 | K3s control-plane node | `10.77.114.121` |
| 111 | Local Ollama fallback | `10.77.114.111` |
| GCP-A | Ollama primary | `10.77.114.21` |
| GCP-B | Ollama secondary | `10.77.114.22` |
Ollama endpoints after cutover:
| Tier | Endpoint |
|------|----------|
| Primary | `http://10.77.114.21:11434` |
| Secondary | `http://10.77.114.22:11434` |
| Fallback | `http://10.77.114.111:11434` |
The current `192.168.0.110:11435/11436` nginx proxy remains an emergency bridge
only until the mesh cutover passes shadow and canary gates.
### D2 - Public Ollama exposure is forbidden after cutover
After mesh cutover:
- GCP firewall must deny public `0.0.0.0/0 -> 11434`.
- Ollama should bind to the mesh interface or host firewall should allow
`11434/tcp` only from `10.77.114.0/24`.
- K8s NetworkPolicy should allow egress only to the mesh IPs for Ollama.
### D3 - AwoooP Inference Gateway owns runtime routing
Provider clients should stop selecting raw Ollama hosts directly. They should
call an AwoooP Inference Gateway that owns:
- endpoint health and circuit breakers
- per-lane concurrency limits
- model residency and keep-alive policy
- request timeouts by intent
- token/cost audit spans
- fallback order: GCP-A -> GCP-B -> 111 -> paid provider
The gateway may initially expose an Ollama-compatible surface:
| Endpoint | Purpose |
|----------|---------|
| `/api/tags` | health/model inventory |
| `/api/ps` | residency inventory |
| `/api/generate` | Ollama-compatible generation |
| `/v1/awooop/inference/runs` | future async AwoooP run API |
Gateway requests must carry `project_id`, `trace_id`, and an intent/lane label
when called from AwoooP-aware code.
### D4 - Alert lane is protected
Alert diagnosis must not share an unconstrained queue with heavy code-review or
deep-RCA jobs.
Initial lanes:
| Lane | Model | Primary use | Default timeout |
|------|-------|-------------|-----------------|
| `alert-fast` | `gemma3:4b` | Telegram incident cards and low-risk RCA | 45s |
| `code-review` | `qwen2.5-coder:7b` | Gitea review | 90s |
| `embedding` | `bge-m3` | RAG embeddings | 30s |
| `deep-rca` | 111-hosted 14B-class model | slow human-reviewed diagnosis | async only |
No 14B/32B model may evict `alert-fast` residency on GCP-A/GCP-B unless the
gateway explicitly opens a maintenance window.
## Migration Plan
### Phase 0 - Current bridge
- Keep `192.168.0.110:11435` and `192.168.0.110:11436` active.
- Alert path uses `ALERT_OLLAMA_MODEL=gemma3:4b`.
- Gemini remains paid emergency fallback only.
### Phase 1 - Mesh build in parallel
- Install WireGuard on 110, 120, 121, 111, GCP-A, and GCP-B.
- Assign mesh IPs from `10.77.114.0/24`.
- Keep public proxy and old env values unchanged.
- Verify `/api/tags`, `/api/ps`, and `gemma3:4b` generation over mesh.
### Phase 2 - Shadow mesh
- Add shadow health checks from the API pod to mesh endpoints.
- Emit OTel spans with both `active_endpoint` and `shadow_endpoint`.
- Do not send production inference traffic to mesh yet.
Promotion gate:
- 24h continuous mesh health
- p95 `alert-fast` latency <= current proxy p95 + 10%
- zero public-path-only success events
### Phase 3 - Switch active endpoints
Set production env:
```yaml
OLLAMA_URL: "http://10.77.114.21:11434"
OLLAMA_SECONDARY_URL: "http://10.77.114.22:11434"
OLLAMA_FALLBACK_URL: "http://10.77.114.111:11434"
```
Promotion gate:
- 7 days canary
- Gemini usage for alert lane is zero except documented all-Ollama outage
- no alert-card timeout regression
### Phase 4 - Close public exposure
- Remove or firewall public GCP `11434/tcp`.
- Keep nginx bridge config but disable listener or restrict to operator-only
rollback.
## Rollback
Rollback is env-only while the bridge remains available:
```yaml
OLLAMA_URL: "http://192.168.0.110:11435"
OLLAMA_SECONDARY_URL: "http://192.168.0.110:11436"
OLLAMA_FALLBACK_URL: "http://192.168.0.111:11434"
```
If GCP-A/B are unstable, force 111-first temporarily:
```yaml
OLLAMA_URL: "http://192.168.0.111:11434"
OLLAMA_SECONDARY_URL: "http://192.168.0.110:11435"
OLLAMA_FALLBACK_URL: "http://192.168.0.110:11436"
```
Paid provider fallback must remain budget-gated.
## Consequences
- GCP Ollama becomes private-by-default instead of public-IP dependent.
- K8s NetworkPolicy can move from public `/32` rules to stable mesh `/32` rules.
- AwoooP can manage Ollama as a platform resource shared by all tenants.
- CPU-only GCP performance remains a capacity constraint; routing must keep
heavy jobs off the alert lane or use GPU-capable GCP nodes.

View File

@@ -1120,22 +1120,34 @@ AwoooP 解法:全部 LLM call 必須 emit 以上 attribute進 SignOz188:
## 14. GCP Ollama 拓撲對 AwoooP 的影響ADR-110 整合)
### 14.1 新拓撲ADR-1102026-05-03 生效
### 14.1 新拓撲ADR-110 + ADR-1252026-05-05 修正
```
Primary : GCP-A http://34.143.170.20:11434 SSD9x 載速)
Secondary: GCP-B http://34.21.145.224:11434 SSD備援
Fallback : Local http://192.168.0.111:11434 HDD最後防線
Emergency: Gemini → Nemotron → Claude (全 Ollama 掛時)
Phase 0 bridge:
Primary : GCP-A http://192.168.0.110:11435 110 nginx → GCP public IP
Secondary: GCP-B http://192.168.0.110:11436
Fallback : Local http://192.168.0.111:11434
Emergency: Gemini → Nemotron → Claude (全 Ollama 掛時budget gated
Target private mesh:
Primary : GCP-A http://10.77.114.21:11434
Secondary: GCP-B http://10.77.114.22:11434
Fallback : Local http://10.77.114.111:11434
```
ADR-125 修正 ADR-110 的傳輸層:公網 GCP IP / 110 nginx proxy 僅保留為
過渡與 rollback bridge。正式路徑是 WireGuard private meshruntime 路由由
AwoooP Inference Gateway 管理。
### 14.2 AwoooP 必須處理的影響項目
| 影響項 | 位置 | 處理方式 | Phase |
|--------|------|---------|-------|
| `ollama:current_primary` Redis key 雙寫(只支援 1 個 URL新需要 3 層)| INV-1 | 改為 `platform:ollama:topology`JSONprimary/secondary/fallback| Phase 2 |
| `ollama_auto_recovery.py:230` 第二定義P0-11| ollama_auto_recovery.py | 移除,統一從 config 讀 | Phase 2 PR-03 |
| GCP IP 進 INV-434.143.170.20, 34.21.145.224| INV-4 | 加入 allowed IP 清單,確認 K8s NetworkPolicy egress 已設定 | Phase 0 INV-4 |
| GCP public IP 進 INV-434.143.170.20, 34.21.145.224| INV-4 | 標為 transitional only正式改用 `10.77.114.21/22` mesh IP | Phase 0 INV-4 |
| WireGuard mesh | ADR-125 / runbook | 建立 `10.77.114.0/24` private transport關閉 public 11434 | Phase 2 前置 |
| AwoooP Inference Gateway | ADR-125 / runbook | alert-fast / code-review / embedding / deep-rca lane 隔離,避免重模型搶告警 lane | Phase 4 |
| EwoooC Provider Proxy 走 GCP Ollama 路由 | Phase 6 | EwoooC 共用 platform Ollama topologyplatform_resource| Phase 6 |
| `telemetry.py:71` IP assertP0-08| telemetry.py:71 | 移除後GCP IP 不再觸發 assert改為 config-driven | Phase 2 PR-01 |
| budget_ledger 記錄 Ollama usage免費 GCP 仍需 token 計數)| Phase 4 | Ollama call 也必須記錄 token 消耗budget_ledger| Phase 4 |
@@ -1143,11 +1155,24 @@ Emergency: Gemini → Nemotron → Claude (全 Ollama 掛時)
### 14.3 Ollama GCP 為 platform_resourceADR-111
GCP Ollama34.143.170.20, 34.21.145.224)與 Local Ollama192.168.0.111)一律聲明為 `platform_resource`
GCP Ollamabridge: 34.143.170.20 / 34.21.145.224target mesh:
10.77.114.21 / 10.77.114.22)與 Local Ollama192.168.0.111 / target
10.77.114.111)一律聲明為 `platform_resource`
- 不屬於任何 tenant
- 所有 tenantAWOOOI / EwoooC / Tsenyang / Bitan共用但 audit 記錄各自 project_id
- `platform:ollama:topology` Redis key 前綴為 `platform:`(非 `{project_id}:`
### 14.4 實測限制2026-05-05
`scripts/ops/ollama-topology-check.sh` 實測:
- GCP-A `gemma3:4b` 約 2s`size_vram=0`
- GCP-B `gemma3:4b` 約 8.5s,但 `size_vram=0`
- 111 fallback `gemma3:4b` 約 4.9s`size_vram=8210446336`
結論GCP-A/B 可以作為同步 `alert-fast` lane但不可承擔 14B/32B 同步告警診斷。
重模型需由 Inference Gateway 分流到 async / 111 / GPU 節點。
---
## 15. 工作排序總表(含並行群組 + Critical Path

View File

@@ -135,7 +135,7 @@ ADR-106 也需要補一節:**Strangler Fig Quantified Gates**,把 shadow →
3. **Redis working memory project 邊界**#15
- `incident_service.py:603``SCAN incident:*``SCAN {project_id}:incident:*`
- 所有 `SCAN`/`KEYS` 必須帶 prefix
4. **`platform_resource` 例外名單**Ollama failover state、global rate limit、leader election lock 等明確標記
4. **`platform_resource` 例外名單**Ollama failover state、global rate limit、leader election lock 等明確標記GCP Ollama 正式路徑依 ADR-125 改為 WireGuard mesh + AwoooP Inference Gateway110 nginx proxy 僅保留為過渡 / rollback bridge
5. **回歸測試**cross-project read/write 必拒絕platform_resource 必允許但寫 audit
6. **AWOOOI Bootstrap Paradox 修補**(依 ADR-111、INV-3
- 標記為 `platform_internal` 的 entrypoint 帶 `project_id=__platform__`hard reject 例外但寫 audit

View File

@@ -0,0 +1,153 @@
# AwoooP Inference Gateway Runbook
> Runtime design for keeping GCP-A, GCP-B, 111, and paid providers under one
> controlled inference lane.
---
## Goal
Stop individual services from calling raw model hosts independently.
The gateway becomes the single platform path for:
- endpoint selection
- model lane assignment
- queue and concurrency control
- fallback
- cost and token audit
- trace correlation
## Why This Is Needed
Direct provider calls caused the 2026-05-05 alert issue:
- alert diagnosis wanted a fast response
- GCP-A/B were loaded with heavyweight models
- the request timed out through GCP-A and GCP-B
- Gemini fallback generated cost
Private networking alone cannot prevent model eviction or queue contention. The
gateway must own runtime scheduling.
## Required Lanes
| Lane | Model | Allowed hosts | Notes |
|------|-------|---------------|-------|
| `alert-fast` | `gemma3:4b` | GCP-A, GCP-B, 111 | Synchronous, protected |
| `code-review` | `qwen2.5-coder:7b` | GCP-B, 111 | Never 32B on GCP during alert canary |
| `embedding` | `bge-m3` | GCP-A, GCP-B, 111 | Short timeout |
| `deep-rca` | 14B-class model | 111 or GPU node | Async only |
| `paid-emergency` | Gemini / Claude | Cloud | Budget-gated emergency fallback |
## v0 API
The gateway should initially provide an Ollama-compatible API to minimize caller
changes:
```http
POST /api/generate
GET /api/tags
GET /api/ps
```
Required headers for AwoooP-aware calls:
```http
X-AwoooP-Project-ID: awoooi
X-AwoooP-Trace-ID: <w3c-trace-id>
X-AwoooP-Lane: alert-fast
X-AwoooP-Intent: DIAGNOSE
```
Legacy callers may be accepted in shadow mode, but must be assigned
`project_id=awoooi` by bootstrap rules from ADR-111.
## Scheduling Rules
- `alert-fast` concurrency is reserved and cannot be borrowed by other lanes.
- `alert-fast` keeps `gemma3:4b` warm on both GCP-A and GCP-B.
- 14B/32B models are denied on GCP-A/B unless an operator opens maintenance.
- Per-host circuit breaker opens after 2 consecutive timeout failures.
- Paid provider fallback requires:
- all Ollama endpoints failed or are circuit-open
- budget hard kill not triggered
- audit span records fallback reason
## Minimal Routing Algorithm
```text
input: lane, model, project_id, trace_id
if lane == alert-fast:
model = gemma3:4b
try GCP-A with 45s timeout
try GCP-B with 45s timeout
try 111 with 60s timeout
if allowed by budget: try paid emergency fallback
if lane == code-review:
model = qwen2.5-coder:7b
try GCP-B with 90s timeout
try 111 with 120s timeout
if lane == deep-rca:
reject synchronous request
create async run
```
## Metrics and Logs
Every request must emit:
- `awooop.project_id`
- `awooop.lane`
- `awooop.provider_tier`
- `awooop.endpoint`
- `gen_ai.request.model`
- `gen_ai.usage.input_tokens`
- `gen_ai.usage.output_tokens`
- `awooop.fallback_reason`
- `awooop.cost_usd`
## Implementation Stages
### Stage 1 - Sidecar health view
- Keep existing providers.
- Add health and residency checks to identify which lane is safe.
- No traffic proxying yet.
### Stage 2 - Gateway in shadow
- Mirror inference requests to the gateway.
- Gateway computes routing decision but does not execute.
- Compare selected endpoint/model against legacy path.
### Stage 3 - Alert lane active
- Route only `alert-fast` through the gateway.
- Keep code review and deep RCA on legacy providers.
### Stage 4 - All Ollama traffic active
- Move code review, embedding, and deep RCA to the gateway.
- Enforce lane-based deny rules.
### Stage 5 - AwoooP runtime integration
- Convert gateway decisions into `run_state` and `step_journal` entries.
- Use AwoooP budget ledger as source of truth.
## Rollback
Set provider env back to raw endpoints:
```yaml
OLLAMA_URL: "http://192.168.0.110:11435"
OLLAMA_SECONDARY_URL: "http://192.168.0.110:11436"
OLLAMA_FALLBACK_URL: "http://192.168.0.111:11434"
```
Do not disable budget hard kill during rollback.

View File

@@ -1,6 +1,10 @@
# GCP Ollama Nginx Proxy 部署指南
> ADR-110 三層容災 — 啟用 GCP Ollama 的關鍵步驟
>
> 2026-05-05 修正:此 runbook 只保留為過渡 / rollback bridge。正式方案改用
> ADR-125 的 WireGuard private mesh 與 AwoooP Inference Gateway。新部署不得把
> GCP `11434/tcp` 對 `0.0.0.0/0` 長期開放。
---
@@ -173,7 +177,10 @@ kubectl describe networkpolicy -n awoooi-prod allow-required-egress
curl -v http://34.143.170.20:11434/api/tags
```
若失敗,檢查 GCP 防火牆規則是否開放 0.0.0.0/0:11434。
若失敗,只允許短時間確認 GCP 防火牆是否對 110 的固定出口 IP 開放
`11434/tcp`。不得把 `0.0.0.0/0:11434` 當成正式設定。
正式切換請改走 [GCP-OLLAMA-WIREGUARD-MESH.md](GCP-OLLAMA-WIREGUARD-MESH.md)。
### 3. 模型載入但推理失敗
@@ -189,9 +196,12 @@ curl -v http://34.143.170.20:11434/api/tags
## 相關文件
- ADR-110: GCP 三層容災架構
- ADR-125: GCP Ollama Private Mesh and AwoooP Inference Gateway
- `k8s/awoooi-prod/04-configmap.yaml`
- `k8s/awoooi-prod/02-network-policy.yaml`
- `docs/runbooks/RUNBOOK-OLLAMA-FAILOVER.md`
- `docs/runbooks/GCP-OLLAMA-WIREGUARD-MESH.md`
- `docs/runbooks/AWOOOP-INFERENCE-GATEWAY.md`
---

View File

@@ -0,0 +1,280 @@
# GCP Ollama WireGuard Mesh Runbook
> Target state for ADR-125. This replaces the public GCP Ollama proxy as the
> primary path after shadow and canary validation.
---
## Scope
This runbook builds private Ollama connectivity between AWOOOI K3s and the GCP
Ollama hosts.
It does not replace AwoooP Inference Gateway work. The mesh solves transport and
security. The gateway solves routing, queueing, model residency, and fallback.
## Current State
Current production endpoints:
| Variable | Endpoint | Meaning |
|----------|----------|---------|
| `OLLAMA_URL` | `http://192.168.0.110:11435` | GCP-A through 110 nginx |
| `OLLAMA_SECONDARY_URL` | `http://192.168.0.110:11436` | GCP-B through 110 nginx |
| `OLLAMA_FALLBACK_URL` | `http://192.168.0.111:11434` | Local 111 |
This is a bridge. Do not treat the public proxy as the final architecture.
## Target State
| Host | WireGuard IP | Notes |
|------|--------------|-------|
| 110 | `10.77.114.10` | DevOps host and rollback bridge |
| 120 | `10.77.114.120` | K3s node |
| 121 | `10.77.114.121` | K3s node |
| 111 | `10.77.114.111` | Local Ollama fallback |
| GCP-A | `10.77.114.21` | Primary Ollama |
| GCP-B | `10.77.114.22` | Secondary Ollama |
Production endpoints after cutover:
```yaml
OLLAMA_URL: "http://10.77.114.21:11434"
OLLAMA_SECONDARY_URL: "http://10.77.114.22:11434"
OLLAMA_FALLBACK_URL: "http://10.77.114.111:11434"
```
## Prerequisites
- SSH access to GCP-A and GCP-B.
- GCP IAM permissions for firewall rules if OS firewall alone is not enough.
- SSH access to 110, 111, 120, and 121.
- A secured place to store WireGuard private keys. Never commit private keys.
- Confirm the GCP hosts have enough CPU/RAM for `gemma3:4b`.
## Key Rules
- Private keys are generated on each host and never copied into Git.
- Public keys may be recorded in the operator handoff note.
- Public GCP `11434/tcp` must be closed after cutover.
- `alert-fast` uses `gemma3:4b`; 14B/32B models must not run on GCP-A/B during
alert-lane canary.
## Install WireGuard
Ubuntu/Debian:
```bash
sudo apt-get update
sudo apt-get install -y wireguard
```
Alpine:
```bash
sudo apk add --no-cache wireguard-tools
```
Generate keys on every host:
```bash
umask 077
wg genkey | sudo tee /etc/wireguard/awooop.key
sudo cat /etc/wireguard/awooop.key | wg pubkey | sudo tee /etc/wireguard/awooop.pub
```
## Configure Peers
Create `/etc/wireguard/wg-awooop.conf` on each host.
Example for GCP-A:
```ini
[Interface]
Address = 10.77.114.21/32
ListenPort = 51820
PrivateKey = <GCP_A_PRIVATE_KEY>
[Peer]
# 120 K3s node
PublicKey = <K3S_120_PUBLIC_KEY>
AllowedIPs = 10.77.114.120/32
Endpoint = <120_REACHABLE_ENDPOINT>:51820
PersistentKeepalive = 25
[Peer]
# 121 K3s node
PublicKey = <K3S_121_PUBLIC_KEY>
AllowedIPs = 10.77.114.121/32
Endpoint = <121_REACHABLE_ENDPOINT>:51820
PersistentKeepalive = 25
[Peer]
# 110 DevOps rollback bridge
PublicKey = <HOST_110_PUBLIC_KEY>
AllowedIPs = 10.77.114.10/32
Endpoint = <110_REACHABLE_ENDPOINT>:51820
PersistentKeepalive = 25
```
Example for a K3s node:
```ini
[Interface]
Address = 10.77.114.120/32
ListenPort = 51820
PrivateKey = <K3S_120_PRIVATE_KEY>
[Peer]
# GCP-A
PublicKey = <GCP_A_PUBLIC_KEY>
AllowedIPs = 10.77.114.21/32
Endpoint = 34.143.170.20:51820
PersistentKeepalive = 25
[Peer]
# GCP-B
PublicKey = <GCP_B_PUBLIC_KEY>
AllowedIPs = 10.77.114.22/32
Endpoint = 34.21.145.224:51820
PersistentKeepalive = 25
[Peer]
# Local 111
PublicKey = <HOST_111_PUBLIC_KEY>
AllowedIPs = 10.77.114.111/32
Endpoint = 192.168.0.111:51820
PersistentKeepalive = 25
```
The exact peer list depends on reachable endpoints. If inbound access to 120/121
is not available, use 110 as a temporary mesh relay, then replace it with direct
K3s-to-GCP peers when routing is confirmed.
## Start WireGuard
```bash
sudo systemctl enable --now wg-quick@wg-awooop
sudo wg show wg-awooop
```
Verify connectivity:
```bash
ping -c 3 10.77.114.21
ping -c 3 10.77.114.22
curl -fsS http://10.77.114.21:11434/api/tags
curl -fsS http://10.77.114.22:11434/api/tags
```
## Bind or Firewall Ollama
Preferred: bind Ollama to the mesh interface.
```bash
sudo systemctl edit ollama
```
```ini
[Service]
Environment="OLLAMA_HOST=10.77.114.21:11434"
```
Use `10.77.114.22:11434` on GCP-B.
If binding is not possible, firewall the host:
```bash
sudo ufw allow from 10.77.114.0/24 to any port 11434 proto tcp
sudo ufw deny 11434/tcp
```
Then restart:
```bash
sudo systemctl daemon-reload
sudo systemctl restart ollama
```
## K8s NetworkPolicy
After mesh cutover, allow only mesh endpoints for Ollama:
```yaml
- to:
- ipBlock:
cidr: 10.77.114.21/32
- ipBlock:
cidr: 10.77.114.22/32
- ipBlock:
cidr: 10.77.114.111/32
ports:
- protocol: TCP
port: 11434
```
Do not remove the `192.168.0.110:11435/11436` rules until rollback is no longer
needed.
## Shadow Validation
From the API pod:
```bash
bash scripts/ops/ollama-topology-check.sh
```
Expected:
- GCP-A `/api/tags` returns 200.
- GCP-B `/api/tags` returns 200.
- `gemma3:4b` generation succeeds on both nodes.
- `/api/ps` contains `gemma3:4b`.
- If `size_vram=0`, keep GCP-A/B on `alert-fast` only and route heavy models to
111 or a GPU-capable node.
## Cutover
Patch deployment env after shadow passes:
```bash
kubectl -n awoooi-prod set env deploy/awoooi-api \
OLLAMA_URL=http://10.77.114.21:11434 \
OLLAMA_SECONDARY_URL=http://10.77.114.22:11434 \
OLLAMA_FALLBACK_URL=http://10.77.114.111:11434
kubectl -n awoooi-prod set env deploy/awoooi-worker \
OLLAMA_URL=http://10.77.114.21:11434 \
OLLAMA_SECONDARY_URL=http://10.77.114.22:11434 \
OLLAMA_FALLBACK_URL=http://10.77.114.111:11434
```
Verify:
```bash
kubectl -n awoooi-prod rollout status deploy/awoooi-api --timeout=180s
kubectl -n awoooi-prod rollout status deploy/awoooi-worker --timeout=180s
bash scripts/ops/ollama-topology-check.sh
```
## Rollback
```bash
kubectl -n awoooi-prod set env deploy/awoooi-api \
OLLAMA_URL=http://192.168.0.110:11435 \
OLLAMA_SECONDARY_URL=http://192.168.0.110:11436 \
OLLAMA_FALLBACK_URL=http://192.168.0.111:11434
kubectl -n awoooi-prod set env deploy/awoooi-worker \
OLLAMA_URL=http://192.168.0.110:11435 \
OLLAMA_SECONDARY_URL=http://192.168.0.110:11436 \
OLLAMA_FALLBACK_URL=http://192.168.0.111:11434
```
## Done Criteria
- Mesh endpoints pass 7 days of canary.
- Alert lane Gemini usage is zero except documented all-Ollama outages.
- Public GCP `11434/tcp` is closed.
- Operator runbook records peer public keys and rollback owner.

View File

@@ -0,0 +1,88 @@
#!/usr/bin/env bash
set -euo pipefail
NAMESPACE="${NAMESPACE:-awoooi-prod}"
DEPLOYMENT="${DEPLOYMENT:-awoooi-api}"
MODEL="${MODEL:-gemma3:4b}"
TIMEOUT_SECONDS="${TIMEOUT_SECONDS:-60}"
kubectl -n "${NAMESPACE}" exec -i "deploy/${DEPLOYMENT}" -- \
env CHECK_MODEL="${MODEL}" CHECK_TIMEOUT_SECONDS="${TIMEOUT_SECONDS}" python - <<'PY'
import json
import os
import time
import urllib.error
import urllib.request
model = os.environ["CHECK_MODEL"]
timeout = int(os.environ["CHECK_TIMEOUT_SECONDS"])
endpoints = [
("primary", os.environ.get("OLLAMA_URL", "")),
("secondary", os.environ.get("OLLAMA_SECONDARY_URL", "")),
("fallback", os.environ.get("OLLAMA_FALLBACK_URL", "")),
]
print(f"model={model} timeout={timeout}s")
def request_json(url: str, path: str, payload=None, timeout_seconds=10):
data = None
headers = {}
if payload is not None:
data = json.dumps(payload).encode()
headers["Content-Type"] = "application/json"
req = urllib.request.Request(url.rstrip("/") + path, data=data, headers=headers)
with urllib.request.urlopen(req, timeout=timeout_seconds) as response:
return json.loads(response.read().decode())
for label, url in endpoints:
print(f"\n== {label}: {url or '<missing>'} ==")
if not url:
print("status=missing")
continue
try:
tags = request_json(url, "/api/tags", timeout_seconds=10)
names = sorted(m.get("name", "") for m in tags.get("models", []))
print("tags=ok", ",".join(names[:12]))
except Exception as exc:
print("tags=fail", type(exc).__name__, str(exc)[:160])
continue
try:
ps = request_json(url, "/api/ps", timeout_seconds=10)
live = ps.get("models", [])
if not live:
print("ps=ok live_models=<none>")
for item in live:
print(
"ps=ok",
f"model={item.get('model')}",
f"expires={item.get('expires_at')}",
f"size_vram={item.get('size_vram')}",
f"context={item.get('context_length')}",
)
if item.get("size_vram") == 0:
print("warning=cpu_only_or_no_vram")
except Exception as exc:
print("ps=fail", type(exc).__name__, str(exc)[:160])
payload = {
"model": model,
"prompt": "用繁體中文用一行回答Ollama health check",
"stream": False,
"keep_alive": "8h",
"options": {"num_predict": 32, "temperature": 0.1},
}
start = time.time()
try:
result = request_json(url, "/api/generate", payload, timeout_seconds=timeout)
latency_ms = round((time.time() - start) * 1000)
response = (result.get("response") or "").replace("\n", " ")[:120]
print(f"generate=ok latency_ms={latency_ms} response={response}")
except urllib.error.HTTPError as exc:
body = exc.read().decode(errors="replace")[:200]
print("generate=fail", "HTTPError", exc.code, body)
except Exception as exc:
print("generate=fail", type(exc).__name__, str(exc)[:200])
PY