docs(awooop): define private Ollama mesh gateway
All checks were successful
Code Review / ai-code-review (push) Successful in 10s
All checks were successful
Code Review / ai-code-review (push) Successful in 10s
This commit is contained in:
@@ -3175,3 +3175,21 @@ kubectl -n awoooi-prod exec deploy/awoooi-api -- printenv | grep -E 'ALERT_OLLAM
|
||||
- 目前 `192.168.0.110:11435/11436` 是經由 110 nginx 轉發到 GCP 公網 IP,屬於過渡方案,不應作為長期 primary Ollama lane。
|
||||
- 建議建立 WireGuard site-to-site private mesh,讓 K3s / 110 / 111 / GCP-A / GCP-B 以私網 IP 互連,Ollama 僅綁定 mesh interface,並由 AwoooP Inference Gateway 統一路由、熔斷、佇列與模型保溫。
|
||||
- 注意:目前 GCP-A / GCP-B `/api/ps` 顯示 `size_vram: 0`,內網化可解決連線與安全問題,但無法讓 CPU-only GCP 等同 111 的 VRAM/GPU 效能;大模型應留在 111 或改用 GPU 型 GCP 節點。
|
||||
|
||||
### 後續文件化
|
||||
|
||||
- 新增 `docs/adr/ADR-125-gcp-ollama-private-mesh-inference-gateway.md`
|
||||
- 新增 `docs/runbooks/GCP-OLLAMA-WIREGUARD-MESH.md`
|
||||
- 新增 `docs/runbooks/AWOOOP-INFERENCE-GATEWAY.md`
|
||||
- 新增 `scripts/ops/ollama-topology-check.sh` 作為現場三層 Ollama 健康 / residency / latency 檢查工具
|
||||
|
||||
### `ollama-topology-check` 實測
|
||||
|
||||
```bash
|
||||
bash scripts/ops/ollama-topology-check.sh
|
||||
# primary GCP-A via 110 proxy: gemma3:4b generate OK, ~2s, size_vram=0
|
||||
# secondary GCP-B via 110 proxy: gemma3:4b generate OK, ~8.5s, size_vram=0
|
||||
# fallback 111 direct: gemma3:4b generate OK, ~4.9s, size_vram=8210446336
|
||||
```
|
||||
|
||||
結論:GCP-A/B 可作 `alert-fast` lane,但目前不應承擔 14B/32B 同步告警推理;重模型必須由 AwoooP Inference Gateway 隔離到 async / 111 / GPU 節點。
|
||||
|
||||
@@ -5,6 +5,10 @@
|
||||
**決策者**: 統帥
|
||||
**關聯**: 取代 ADR-105(Revert A2 Ollama Primary)
|
||||
|
||||
> 2026-05-05 修正:本 ADR 的「GCP-A → GCP-B → 111 → paid provider」邏輯仍有效,
|
||||
> 但公網 GCP IP / 110 nginx proxy 僅為過渡傳輸。正式傳輸與 runtime
|
||||
> 管理由 ADR-125(GCP Ollama Private Mesh and AwoooP Inference Gateway)取代。
|
||||
|
||||
---
|
||||
|
||||
## 背景
|
||||
@@ -62,3 +66,15 @@ K8s NetworkPolicy egress 已新增 GCP-A/GCP-B 的 /32 出口規則(port 11434
|
||||
- Ollama 主要流量走 GCP SSD,效能提升
|
||||
- Local 111 保留為最後防線,不棄用
|
||||
- Gemini/Nemotron/Claude fallback 鏈不變
|
||||
|
||||
## 2026-05-05 現場校正
|
||||
|
||||
冷啟動救援期間的實測顯示:
|
||||
|
||||
- GCP-A / GCP-B 透過 110 nginx proxy 可連線,但長 prompt 曾出現 504。
|
||||
- `/api/ps` 顯示 GCP-A / GCP-B `size_vram: 0`,因此不可假設它們等同 111 GPU/VRAM 推理能力。
|
||||
- 告警同步路徑必須使用 `gemma3:4b` 這類 fast lane 模型;14B/32B 模型需移到 async 或 111/GPU 節點。
|
||||
- 公網 `34.143.170.20:11434` / `34.21.145.224:11434` 不再視為最終安全架構。
|
||||
|
||||
後續以 ADR-125 為準:WireGuard private mesh 是正式網路層,AwoooP
|
||||
Inference Gateway 是正式 runtime 層。
|
||||
|
||||
187
docs/adr/ADR-125-gcp-ollama-private-mesh-inference-gateway.md
Normal file
187
docs/adr/ADR-125-gcp-ollama-private-mesh-inference-gateway.md
Normal file
@@ -0,0 +1,187 @@
|
||||
# ADR-125: GCP Ollama Private Mesh and AwoooP Inference Gateway
|
||||
|
||||
**Status**: Accepted
|
||||
**Date**: 2026-05-05 (Asia/Taipei)
|
||||
**Decision Maker**: ogt / Codex
|
||||
**Related**: ADR-110, ADR-111, ADR-113, ADR-121, ADR-124
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
ADR-110 moved Ollama priority from local-only 111 to a three-layer topology:
|
||||
|
||||
1. GCP-A
|
||||
2. GCP-B
|
||||
3. Local 111
|
||||
4. Paid cloud fallback only after all Ollama lanes fail
|
||||
|
||||
The 2026-05-05 dirty-reboot recovery and alert incident exposed two gaps in the
|
||||
initial ADR-110 implementation:
|
||||
|
||||
- The live transport is `K8s Pod -> 192.168.0.110 nginx -> GCP public IP`, not a
|
||||
true private network path.
|
||||
- GCP-A and GCP-B reported `size_vram: 0` in `/api/ps`, so they are CPU-only from
|
||||
Ollama's perspective. Private networking improves reachability and security,
|
||||
but does not make these nodes equivalent to local 111 GPU/VRAM behavior.
|
||||
|
||||
The public nginx proxy is useful as a bootstrap bridge, but it must not become
|
||||
the long-term primary transport for platform inference.
|
||||
|
||||
## Decision
|
||||
|
||||
Adopt a two-layer target architecture:
|
||||
|
||||
### D1 - WireGuard private mesh is the target transport
|
||||
|
||||
AwoooP uses a WireGuard site-to-site mesh for GCP Ollama access.
|
||||
|
||||
Planned mesh CIDR:
|
||||
|
||||
| Host | Role | WireGuard IP |
|
||||
|------|------|--------------|
|
||||
| 110 | DevOps / transitional proxy / optional mesh router | `10.77.114.10` |
|
||||
| 120 | K3s control-plane node | `10.77.114.120` |
|
||||
| 121 | K3s control-plane node | `10.77.114.121` |
|
||||
| 111 | Local Ollama fallback | `10.77.114.111` |
|
||||
| GCP-A | Ollama primary | `10.77.114.21` |
|
||||
| GCP-B | Ollama secondary | `10.77.114.22` |
|
||||
|
||||
Ollama endpoints after cutover:
|
||||
|
||||
| Tier | Endpoint |
|
||||
|------|----------|
|
||||
| Primary | `http://10.77.114.21:11434` |
|
||||
| Secondary | `http://10.77.114.22:11434` |
|
||||
| Fallback | `http://10.77.114.111:11434` |
|
||||
|
||||
The current `192.168.0.110:11435/11436` nginx proxy remains an emergency bridge
|
||||
only until the mesh cutover passes shadow and canary gates.
|
||||
|
||||
### D2 - Public Ollama exposure is forbidden after cutover
|
||||
|
||||
After mesh cutover:
|
||||
|
||||
- GCP firewall must deny public `0.0.0.0/0 -> 11434`.
|
||||
- Ollama should bind to the mesh interface or host firewall should allow
|
||||
`11434/tcp` only from `10.77.114.0/24`.
|
||||
- K8s NetworkPolicy should allow egress only to the mesh IPs for Ollama.
|
||||
|
||||
### D3 - AwoooP Inference Gateway owns runtime routing
|
||||
|
||||
Provider clients should stop selecting raw Ollama hosts directly. They should
|
||||
call an AwoooP Inference Gateway that owns:
|
||||
|
||||
- endpoint health and circuit breakers
|
||||
- per-lane concurrency limits
|
||||
- model residency and keep-alive policy
|
||||
- request timeouts by intent
|
||||
- token/cost audit spans
|
||||
- fallback order: GCP-A -> GCP-B -> 111 -> paid provider
|
||||
|
||||
The gateway may initially expose an Ollama-compatible surface:
|
||||
|
||||
| Endpoint | Purpose |
|
||||
|----------|---------|
|
||||
| `/api/tags` | health/model inventory |
|
||||
| `/api/ps` | residency inventory |
|
||||
| `/api/generate` | Ollama-compatible generation |
|
||||
| `/v1/awooop/inference/runs` | future async AwoooP run API |
|
||||
|
||||
Gateway requests must carry `project_id`, `trace_id`, and an intent/lane label
|
||||
when called from AwoooP-aware code.
|
||||
|
||||
### D4 - Alert lane is protected
|
||||
|
||||
Alert diagnosis must not share an unconstrained queue with heavy code-review or
|
||||
deep-RCA jobs.
|
||||
|
||||
Initial lanes:
|
||||
|
||||
| Lane | Model | Primary use | Default timeout |
|
||||
|------|-------|-------------|-----------------|
|
||||
| `alert-fast` | `gemma3:4b` | Telegram incident cards and low-risk RCA | 45s |
|
||||
| `code-review` | `qwen2.5-coder:7b` | Gitea review | 90s |
|
||||
| `embedding` | `bge-m3` | RAG embeddings | 30s |
|
||||
| `deep-rca` | 111-hosted 14B-class model | slow human-reviewed diagnosis | async only |
|
||||
|
||||
No 14B/32B model may evict `alert-fast` residency on GCP-A/GCP-B unless the
|
||||
gateway explicitly opens a maintenance window.
|
||||
|
||||
## Migration Plan
|
||||
|
||||
### Phase 0 - Current bridge
|
||||
|
||||
- Keep `192.168.0.110:11435` and `192.168.0.110:11436` active.
|
||||
- Alert path uses `ALERT_OLLAMA_MODEL=gemma3:4b`.
|
||||
- Gemini remains paid emergency fallback only.
|
||||
|
||||
### Phase 1 - Mesh build in parallel
|
||||
|
||||
- Install WireGuard on 110, 120, 121, 111, GCP-A, and GCP-B.
|
||||
- Assign mesh IPs from `10.77.114.0/24`.
|
||||
- Keep public proxy and old env values unchanged.
|
||||
- Verify `/api/tags`, `/api/ps`, and `gemma3:4b` generation over mesh.
|
||||
|
||||
### Phase 2 - Shadow mesh
|
||||
|
||||
- Add shadow health checks from the API pod to mesh endpoints.
|
||||
- Emit OTel spans with both `active_endpoint` and `shadow_endpoint`.
|
||||
- Do not send production inference traffic to mesh yet.
|
||||
|
||||
Promotion gate:
|
||||
|
||||
- 24h continuous mesh health
|
||||
- p95 `alert-fast` latency <= current proxy p95 + 10%
|
||||
- zero public-path-only success events
|
||||
|
||||
### Phase 3 - Switch active endpoints
|
||||
|
||||
Set production env:
|
||||
|
||||
```yaml
|
||||
OLLAMA_URL: "http://10.77.114.21:11434"
|
||||
OLLAMA_SECONDARY_URL: "http://10.77.114.22:11434"
|
||||
OLLAMA_FALLBACK_URL: "http://10.77.114.111:11434"
|
||||
```
|
||||
|
||||
Promotion gate:
|
||||
|
||||
- 7 days canary
|
||||
- Gemini usage for alert lane is zero except documented all-Ollama outage
|
||||
- no alert-card timeout regression
|
||||
|
||||
### Phase 4 - Close public exposure
|
||||
|
||||
- Remove or firewall public GCP `11434/tcp`.
|
||||
- Keep nginx bridge config but disable listener or restrict to operator-only
|
||||
rollback.
|
||||
|
||||
## Rollback
|
||||
|
||||
Rollback is env-only while the bridge remains available:
|
||||
|
||||
```yaml
|
||||
OLLAMA_URL: "http://192.168.0.110:11435"
|
||||
OLLAMA_SECONDARY_URL: "http://192.168.0.110:11436"
|
||||
OLLAMA_FALLBACK_URL: "http://192.168.0.111:11434"
|
||||
```
|
||||
|
||||
If GCP-A/B are unstable, force 111-first temporarily:
|
||||
|
||||
```yaml
|
||||
OLLAMA_URL: "http://192.168.0.111:11434"
|
||||
OLLAMA_SECONDARY_URL: "http://192.168.0.110:11435"
|
||||
OLLAMA_FALLBACK_URL: "http://192.168.0.110:11436"
|
||||
```
|
||||
|
||||
Paid provider fallback must remain budget-gated.
|
||||
|
||||
## Consequences
|
||||
|
||||
- GCP Ollama becomes private-by-default instead of public-IP dependent.
|
||||
- K8s NetworkPolicy can move from public `/32` rules to stable mesh `/32` rules.
|
||||
- AwoooP can manage Ollama as a platform resource shared by all tenants.
|
||||
- CPU-only GCP performance remains a capacity constraint; routing must keep
|
||||
heavy jobs off the alert lane or use GPU-capable GCP nodes.
|
||||
|
||||
@@ -1120,22 +1120,34 @@ AwoooP 解法:全部 LLM call 必須 emit 以上 attribute,進 SignOz(188:
|
||||
|
||||
## 14. GCP Ollama 拓撲對 AwoooP 的影響(ADR-110 整合)
|
||||
|
||||
### 14.1 新拓撲(ADR-110,2026-05-03 生效)
|
||||
### 14.1 新拓撲(ADR-110 + ADR-125,2026-05-05 修正)
|
||||
|
||||
```
|
||||
Primary : GCP-A http://34.143.170.20:11434 (SSD,9x 載速)
|
||||
Secondary: GCP-B http://34.21.145.224:11434 (SSD,備援)
|
||||
Fallback : Local http://192.168.0.111:11434 (HDD,最後防線)
|
||||
Emergency: Gemini → Nemotron → Claude (全 Ollama 掛時)
|
||||
Phase 0 bridge:
|
||||
Primary : GCP-A http://192.168.0.110:11435 (110 nginx → GCP public IP)
|
||||
Secondary: GCP-B http://192.168.0.110:11436
|
||||
Fallback : Local http://192.168.0.111:11434
|
||||
Emergency: Gemini → Nemotron → Claude (全 Ollama 掛時,budget gated)
|
||||
|
||||
Target private mesh:
|
||||
Primary : GCP-A http://10.77.114.21:11434
|
||||
Secondary: GCP-B http://10.77.114.22:11434
|
||||
Fallback : Local http://10.77.114.111:11434
|
||||
```
|
||||
|
||||
ADR-125 修正 ADR-110 的傳輸層:公網 GCP IP / 110 nginx proxy 僅保留為
|
||||
過渡與 rollback bridge。正式路徑是 WireGuard private mesh;runtime 路由由
|
||||
AwoooP Inference Gateway 管理。
|
||||
|
||||
### 14.2 AwoooP 必須處理的影響項目
|
||||
|
||||
| 影響項 | 位置 | 處理方式 | Phase |
|
||||
|--------|------|---------|-------|
|
||||
| `ollama:current_primary` Redis key 雙寫(只支援 1 個 URL,新需要 3 層)| INV-1 | 改為 `platform:ollama:topology`(JSON:primary/secondary/fallback)| Phase 2 |
|
||||
| `ollama_auto_recovery.py:230` 第二定義(P0-11)| ollama_auto_recovery.py | 移除,統一從 config 讀 | Phase 2 PR-03 |
|
||||
| GCP IP 進 INV-4(34.143.170.20, 34.21.145.224)| INV-4 | 加入 allowed IP 清單,確認 K8s NetworkPolicy egress 已設定 | Phase 0 INV-4 |
|
||||
| GCP public IP 進 INV-4(34.143.170.20, 34.21.145.224)| INV-4 | 標為 transitional only;正式改用 `10.77.114.21/22` mesh IP | Phase 0 INV-4 |
|
||||
| WireGuard mesh | ADR-125 / runbook | 建立 `10.77.114.0/24` private transport;關閉 public 11434 | Phase 2 前置 |
|
||||
| AwoooP Inference Gateway | ADR-125 / runbook | alert-fast / code-review / embedding / deep-rca lane 隔離,避免重模型搶告警 lane | Phase 4 |
|
||||
| EwoooC Provider Proxy 走 GCP Ollama 路由 | Phase 6 | EwoooC 共用 platform Ollama topology(platform_resource)| Phase 6 |
|
||||
| `telemetry.py:71` IP assert(P0-08)| telemetry.py:71 | 移除後,GCP IP 不再觸發 assert;改為 config-driven | Phase 2 PR-01 |
|
||||
| budget_ledger 記錄 Ollama usage(免費 GCP 仍需 token 計數)| Phase 4 | Ollama call 也必須記錄 token 消耗(budget_ledger)| Phase 4 |
|
||||
@@ -1143,11 +1155,24 @@ Emergency: Gemini → Nemotron → Claude (全 Ollama 掛時)
|
||||
|
||||
### 14.3 Ollama GCP 為 platform_resource(ADR-111)
|
||||
|
||||
GCP Ollama(34.143.170.20, 34.21.145.224)與 Local Ollama(192.168.0.111)一律聲明為 `platform_resource`:
|
||||
GCP Ollama(bridge: 34.143.170.20 / 34.21.145.224;target mesh:
|
||||
10.77.114.21 / 10.77.114.22)與 Local Ollama(192.168.0.111 / target
|
||||
10.77.114.111)一律聲明為 `platform_resource`:
|
||||
- 不屬於任何 tenant
|
||||
- 所有 tenant(AWOOOI / EwoooC / Tsenyang / Bitan)共用,但 audit 記錄各自 project_id
|
||||
- `platform:ollama:topology` Redis key 前綴為 `platform:`(非 `{project_id}:`)
|
||||
|
||||
### 14.4 實測限制(2026-05-05)
|
||||
|
||||
`scripts/ops/ollama-topology-check.sh` 實測:
|
||||
|
||||
- GCP-A `gemma3:4b` 約 2s,但 `size_vram=0`
|
||||
- GCP-B `gemma3:4b` 約 8.5s,但 `size_vram=0`
|
||||
- 111 fallback `gemma3:4b` 約 4.9s,`size_vram=8210446336`
|
||||
|
||||
結論:GCP-A/B 可以作為同步 `alert-fast` lane,但不可承擔 14B/32B 同步告警診斷。
|
||||
重模型需由 Inference Gateway 分流到 async / 111 / GPU 節點。
|
||||
|
||||
---
|
||||
|
||||
## 15. 工作排序總表(含並行群組 + Critical Path)
|
||||
|
||||
@@ -135,7 +135,7 @@ ADR-106 也需要補一節:**Strangler Fig Quantified Gates**,把 shadow →
|
||||
3. **Redis working memory project 邊界**(#15):
|
||||
- `incident_service.py:603` 的 `SCAN incident:*` → `SCAN {project_id}:incident:*`
|
||||
- 所有 `SCAN`/`KEYS` 必須帶 prefix
|
||||
4. **`platform_resource` 例外名單**:Ollama failover state、global rate limit、leader election lock 等明確標記
|
||||
4. **`platform_resource` 例外名單**:Ollama failover state、global rate limit、leader election lock 等明確標記;GCP Ollama 正式路徑依 ADR-125 改為 WireGuard mesh + AwoooP Inference Gateway,110 nginx proxy 僅保留為過渡 / rollback bridge
|
||||
5. **回歸測試**:cross-project read/write 必拒絕;platform_resource 必允許但寫 audit
|
||||
6. **AWOOOI Bootstrap Paradox 修補**(依 ADR-111、INV-3):
|
||||
- 標記為 `platform_internal` 的 entrypoint 帶 `project_id=__platform__`,hard reject 例外但寫 audit
|
||||
|
||||
153
docs/runbooks/AWOOOP-INFERENCE-GATEWAY.md
Normal file
153
docs/runbooks/AWOOOP-INFERENCE-GATEWAY.md
Normal file
@@ -0,0 +1,153 @@
|
||||
# AwoooP Inference Gateway Runbook
|
||||
|
||||
> Runtime design for keeping GCP-A, GCP-B, 111, and paid providers under one
|
||||
> controlled inference lane.
|
||||
|
||||
---
|
||||
|
||||
## Goal
|
||||
|
||||
Stop individual services from calling raw model hosts independently.
|
||||
|
||||
The gateway becomes the single platform path for:
|
||||
|
||||
- endpoint selection
|
||||
- model lane assignment
|
||||
- queue and concurrency control
|
||||
- fallback
|
||||
- cost and token audit
|
||||
- trace correlation
|
||||
|
||||
## Why This Is Needed
|
||||
|
||||
Direct provider calls caused the 2026-05-05 alert issue:
|
||||
|
||||
- alert diagnosis wanted a fast response
|
||||
- GCP-A/B were loaded with heavyweight models
|
||||
- the request timed out through GCP-A and GCP-B
|
||||
- Gemini fallback generated cost
|
||||
|
||||
Private networking alone cannot prevent model eviction or queue contention. The
|
||||
gateway must own runtime scheduling.
|
||||
|
||||
## Required Lanes
|
||||
|
||||
| Lane | Model | Allowed hosts | Notes |
|
||||
|------|-------|---------------|-------|
|
||||
| `alert-fast` | `gemma3:4b` | GCP-A, GCP-B, 111 | Synchronous, protected |
|
||||
| `code-review` | `qwen2.5-coder:7b` | GCP-B, 111 | Never 32B on GCP during alert canary |
|
||||
| `embedding` | `bge-m3` | GCP-A, GCP-B, 111 | Short timeout |
|
||||
| `deep-rca` | 14B-class model | 111 or GPU node | Async only |
|
||||
| `paid-emergency` | Gemini / Claude | Cloud | Budget-gated emergency fallback |
|
||||
|
||||
## v0 API
|
||||
|
||||
The gateway should initially provide an Ollama-compatible API to minimize caller
|
||||
changes:
|
||||
|
||||
```http
|
||||
POST /api/generate
|
||||
GET /api/tags
|
||||
GET /api/ps
|
||||
```
|
||||
|
||||
Required headers for AwoooP-aware calls:
|
||||
|
||||
```http
|
||||
X-AwoooP-Project-ID: awoooi
|
||||
X-AwoooP-Trace-ID: <w3c-trace-id>
|
||||
X-AwoooP-Lane: alert-fast
|
||||
X-AwoooP-Intent: DIAGNOSE
|
||||
```
|
||||
|
||||
Legacy callers may be accepted in shadow mode, but must be assigned
|
||||
`project_id=awoooi` by bootstrap rules from ADR-111.
|
||||
|
||||
## Scheduling Rules
|
||||
|
||||
- `alert-fast` concurrency is reserved and cannot be borrowed by other lanes.
|
||||
- `alert-fast` keeps `gemma3:4b` warm on both GCP-A and GCP-B.
|
||||
- 14B/32B models are denied on GCP-A/B unless an operator opens maintenance.
|
||||
- Per-host circuit breaker opens after 2 consecutive timeout failures.
|
||||
- Paid provider fallback requires:
|
||||
- all Ollama endpoints failed or are circuit-open
|
||||
- budget hard kill not triggered
|
||||
- audit span records fallback reason
|
||||
|
||||
## Minimal Routing Algorithm
|
||||
|
||||
```text
|
||||
input: lane, model, project_id, trace_id
|
||||
|
||||
if lane == alert-fast:
|
||||
model = gemma3:4b
|
||||
try GCP-A with 45s timeout
|
||||
try GCP-B with 45s timeout
|
||||
try 111 with 60s timeout
|
||||
if allowed by budget: try paid emergency fallback
|
||||
|
||||
if lane == code-review:
|
||||
model = qwen2.5-coder:7b
|
||||
try GCP-B with 90s timeout
|
||||
try 111 with 120s timeout
|
||||
|
||||
if lane == deep-rca:
|
||||
reject synchronous request
|
||||
create async run
|
||||
```
|
||||
|
||||
## Metrics and Logs
|
||||
|
||||
Every request must emit:
|
||||
|
||||
- `awooop.project_id`
|
||||
- `awooop.lane`
|
||||
- `awooop.provider_tier`
|
||||
- `awooop.endpoint`
|
||||
- `gen_ai.request.model`
|
||||
- `gen_ai.usage.input_tokens`
|
||||
- `gen_ai.usage.output_tokens`
|
||||
- `awooop.fallback_reason`
|
||||
- `awooop.cost_usd`
|
||||
|
||||
## Implementation Stages
|
||||
|
||||
### Stage 1 - Sidecar health view
|
||||
|
||||
- Keep existing providers.
|
||||
- Add health and residency checks to identify which lane is safe.
|
||||
- No traffic proxying yet.
|
||||
|
||||
### Stage 2 - Gateway in shadow
|
||||
|
||||
- Mirror inference requests to the gateway.
|
||||
- Gateway computes routing decision but does not execute.
|
||||
- Compare selected endpoint/model against legacy path.
|
||||
|
||||
### Stage 3 - Alert lane active
|
||||
|
||||
- Route only `alert-fast` through the gateway.
|
||||
- Keep code review and deep RCA on legacy providers.
|
||||
|
||||
### Stage 4 - All Ollama traffic active
|
||||
|
||||
- Move code review, embedding, and deep RCA to the gateway.
|
||||
- Enforce lane-based deny rules.
|
||||
|
||||
### Stage 5 - AwoooP runtime integration
|
||||
|
||||
- Convert gateway decisions into `run_state` and `step_journal` entries.
|
||||
- Use AwoooP budget ledger as source of truth.
|
||||
|
||||
## Rollback
|
||||
|
||||
Set provider env back to raw endpoints:
|
||||
|
||||
```yaml
|
||||
OLLAMA_URL: "http://192.168.0.110:11435"
|
||||
OLLAMA_SECONDARY_URL: "http://192.168.0.110:11436"
|
||||
OLLAMA_FALLBACK_URL: "http://192.168.0.111:11434"
|
||||
```
|
||||
|
||||
Do not disable budget hard kill during rollback.
|
||||
|
||||
@@ -1,6 +1,10 @@
|
||||
# GCP Ollama Nginx Proxy 部署指南
|
||||
|
||||
> ADR-110 三層容災 — 啟用 GCP Ollama 的關鍵步驟
|
||||
>
|
||||
> 2026-05-05 修正:此 runbook 只保留為過渡 / rollback bridge。正式方案改用
|
||||
> ADR-125 的 WireGuard private mesh 與 AwoooP Inference Gateway。新部署不得把
|
||||
> GCP `11434/tcp` 對 `0.0.0.0/0` 長期開放。
|
||||
|
||||
---
|
||||
|
||||
@@ -173,7 +177,10 @@ kubectl describe networkpolicy -n awoooi-prod allow-required-egress
|
||||
curl -v http://34.143.170.20:11434/api/tags
|
||||
```
|
||||
|
||||
若失敗,檢查 GCP 防火牆規則是否開放 0.0.0.0/0:11434。
|
||||
若失敗,只允許短時間確認 GCP 防火牆是否對 110 的固定出口 IP 開放
|
||||
`11434/tcp`。不得把 `0.0.0.0/0:11434` 當成正式設定。
|
||||
|
||||
正式切換請改走 [GCP-OLLAMA-WIREGUARD-MESH.md](GCP-OLLAMA-WIREGUARD-MESH.md)。
|
||||
|
||||
### 3. 模型載入但推理失敗
|
||||
|
||||
@@ -189,9 +196,12 @@ curl -v http://34.143.170.20:11434/api/tags
|
||||
## 相關文件
|
||||
|
||||
- ADR-110: GCP 三層容災架構
|
||||
- ADR-125: GCP Ollama Private Mesh and AwoooP Inference Gateway
|
||||
- `k8s/awoooi-prod/04-configmap.yaml`
|
||||
- `k8s/awoooi-prod/02-network-policy.yaml`
|
||||
- `docs/runbooks/RUNBOOK-OLLAMA-FAILOVER.md`
|
||||
- `docs/runbooks/GCP-OLLAMA-WIREGUARD-MESH.md`
|
||||
- `docs/runbooks/AWOOOP-INFERENCE-GATEWAY.md`
|
||||
|
||||
---
|
||||
|
||||
|
||||
280
docs/runbooks/GCP-OLLAMA-WIREGUARD-MESH.md
Normal file
280
docs/runbooks/GCP-OLLAMA-WIREGUARD-MESH.md
Normal file
@@ -0,0 +1,280 @@
|
||||
# GCP Ollama WireGuard Mesh Runbook
|
||||
|
||||
> Target state for ADR-125. This replaces the public GCP Ollama proxy as the
|
||||
> primary path after shadow and canary validation.
|
||||
|
||||
---
|
||||
|
||||
## Scope
|
||||
|
||||
This runbook builds private Ollama connectivity between AWOOOI K3s and the GCP
|
||||
Ollama hosts.
|
||||
|
||||
It does not replace AwoooP Inference Gateway work. The mesh solves transport and
|
||||
security. The gateway solves routing, queueing, model residency, and fallback.
|
||||
|
||||
## Current State
|
||||
|
||||
Current production endpoints:
|
||||
|
||||
| Variable | Endpoint | Meaning |
|
||||
|----------|----------|---------|
|
||||
| `OLLAMA_URL` | `http://192.168.0.110:11435` | GCP-A through 110 nginx |
|
||||
| `OLLAMA_SECONDARY_URL` | `http://192.168.0.110:11436` | GCP-B through 110 nginx |
|
||||
| `OLLAMA_FALLBACK_URL` | `http://192.168.0.111:11434` | Local 111 |
|
||||
|
||||
This is a bridge. Do not treat the public proxy as the final architecture.
|
||||
|
||||
## Target State
|
||||
|
||||
| Host | WireGuard IP | Notes |
|
||||
|------|--------------|-------|
|
||||
| 110 | `10.77.114.10` | DevOps host and rollback bridge |
|
||||
| 120 | `10.77.114.120` | K3s node |
|
||||
| 121 | `10.77.114.121` | K3s node |
|
||||
| 111 | `10.77.114.111` | Local Ollama fallback |
|
||||
| GCP-A | `10.77.114.21` | Primary Ollama |
|
||||
| GCP-B | `10.77.114.22` | Secondary Ollama |
|
||||
|
||||
Production endpoints after cutover:
|
||||
|
||||
```yaml
|
||||
OLLAMA_URL: "http://10.77.114.21:11434"
|
||||
OLLAMA_SECONDARY_URL: "http://10.77.114.22:11434"
|
||||
OLLAMA_FALLBACK_URL: "http://10.77.114.111:11434"
|
||||
```
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- SSH access to GCP-A and GCP-B.
|
||||
- GCP IAM permissions for firewall rules if OS firewall alone is not enough.
|
||||
- SSH access to 110, 111, 120, and 121.
|
||||
- A secured place to store WireGuard private keys. Never commit private keys.
|
||||
- Confirm the GCP hosts have enough CPU/RAM for `gemma3:4b`.
|
||||
|
||||
## Key Rules
|
||||
|
||||
- Private keys are generated on each host and never copied into Git.
|
||||
- Public keys may be recorded in the operator handoff note.
|
||||
- Public GCP `11434/tcp` must be closed after cutover.
|
||||
- `alert-fast` uses `gemma3:4b`; 14B/32B models must not run on GCP-A/B during
|
||||
alert-lane canary.
|
||||
|
||||
## Install WireGuard
|
||||
|
||||
Ubuntu/Debian:
|
||||
|
||||
```bash
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y wireguard
|
||||
```
|
||||
|
||||
Alpine:
|
||||
|
||||
```bash
|
||||
sudo apk add --no-cache wireguard-tools
|
||||
```
|
||||
|
||||
Generate keys on every host:
|
||||
|
||||
```bash
|
||||
umask 077
|
||||
wg genkey | sudo tee /etc/wireguard/awooop.key
|
||||
sudo cat /etc/wireguard/awooop.key | wg pubkey | sudo tee /etc/wireguard/awooop.pub
|
||||
```
|
||||
|
||||
## Configure Peers
|
||||
|
||||
Create `/etc/wireguard/wg-awooop.conf` on each host.
|
||||
|
||||
Example for GCP-A:
|
||||
|
||||
```ini
|
||||
[Interface]
|
||||
Address = 10.77.114.21/32
|
||||
ListenPort = 51820
|
||||
PrivateKey = <GCP_A_PRIVATE_KEY>
|
||||
|
||||
[Peer]
|
||||
# 120 K3s node
|
||||
PublicKey = <K3S_120_PUBLIC_KEY>
|
||||
AllowedIPs = 10.77.114.120/32
|
||||
Endpoint = <120_REACHABLE_ENDPOINT>:51820
|
||||
PersistentKeepalive = 25
|
||||
|
||||
[Peer]
|
||||
# 121 K3s node
|
||||
PublicKey = <K3S_121_PUBLIC_KEY>
|
||||
AllowedIPs = 10.77.114.121/32
|
||||
Endpoint = <121_REACHABLE_ENDPOINT>:51820
|
||||
PersistentKeepalive = 25
|
||||
|
||||
[Peer]
|
||||
# 110 DevOps rollback bridge
|
||||
PublicKey = <HOST_110_PUBLIC_KEY>
|
||||
AllowedIPs = 10.77.114.10/32
|
||||
Endpoint = <110_REACHABLE_ENDPOINT>:51820
|
||||
PersistentKeepalive = 25
|
||||
```
|
||||
|
||||
Example for a K3s node:
|
||||
|
||||
```ini
|
||||
[Interface]
|
||||
Address = 10.77.114.120/32
|
||||
ListenPort = 51820
|
||||
PrivateKey = <K3S_120_PRIVATE_KEY>
|
||||
|
||||
[Peer]
|
||||
# GCP-A
|
||||
PublicKey = <GCP_A_PUBLIC_KEY>
|
||||
AllowedIPs = 10.77.114.21/32
|
||||
Endpoint = 34.143.170.20:51820
|
||||
PersistentKeepalive = 25
|
||||
|
||||
[Peer]
|
||||
# GCP-B
|
||||
PublicKey = <GCP_B_PUBLIC_KEY>
|
||||
AllowedIPs = 10.77.114.22/32
|
||||
Endpoint = 34.21.145.224:51820
|
||||
PersistentKeepalive = 25
|
||||
|
||||
[Peer]
|
||||
# Local 111
|
||||
PublicKey = <HOST_111_PUBLIC_KEY>
|
||||
AllowedIPs = 10.77.114.111/32
|
||||
Endpoint = 192.168.0.111:51820
|
||||
PersistentKeepalive = 25
|
||||
```
|
||||
|
||||
The exact peer list depends on reachable endpoints. If inbound access to 120/121
|
||||
is not available, use 110 as a temporary mesh relay, then replace it with direct
|
||||
K3s-to-GCP peers when routing is confirmed.
|
||||
|
||||
## Start WireGuard
|
||||
|
||||
```bash
|
||||
sudo systemctl enable --now wg-quick@wg-awooop
|
||||
sudo wg show wg-awooop
|
||||
```
|
||||
|
||||
Verify connectivity:
|
||||
|
||||
```bash
|
||||
ping -c 3 10.77.114.21
|
||||
ping -c 3 10.77.114.22
|
||||
curl -fsS http://10.77.114.21:11434/api/tags
|
||||
curl -fsS http://10.77.114.22:11434/api/tags
|
||||
```
|
||||
|
||||
## Bind or Firewall Ollama
|
||||
|
||||
Preferred: bind Ollama to the mesh interface.
|
||||
|
||||
```bash
|
||||
sudo systemctl edit ollama
|
||||
```
|
||||
|
||||
```ini
|
||||
[Service]
|
||||
Environment="OLLAMA_HOST=10.77.114.21:11434"
|
||||
```
|
||||
|
||||
Use `10.77.114.22:11434` on GCP-B.
|
||||
|
||||
If binding is not possible, firewall the host:
|
||||
|
||||
```bash
|
||||
sudo ufw allow from 10.77.114.0/24 to any port 11434 proto tcp
|
||||
sudo ufw deny 11434/tcp
|
||||
```
|
||||
|
||||
Then restart:
|
||||
|
||||
```bash
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl restart ollama
|
||||
```
|
||||
|
||||
## K8s NetworkPolicy
|
||||
|
||||
After mesh cutover, allow only mesh endpoints for Ollama:
|
||||
|
||||
```yaml
|
||||
- to:
|
||||
- ipBlock:
|
||||
cidr: 10.77.114.21/32
|
||||
- ipBlock:
|
||||
cidr: 10.77.114.22/32
|
||||
- ipBlock:
|
||||
cidr: 10.77.114.111/32
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: 11434
|
||||
```
|
||||
|
||||
Do not remove the `192.168.0.110:11435/11436` rules until rollback is no longer
|
||||
needed.
|
||||
|
||||
## Shadow Validation
|
||||
|
||||
From the API pod:
|
||||
|
||||
```bash
|
||||
bash scripts/ops/ollama-topology-check.sh
|
||||
```
|
||||
|
||||
Expected:
|
||||
|
||||
- GCP-A `/api/tags` returns 200.
|
||||
- GCP-B `/api/tags` returns 200.
|
||||
- `gemma3:4b` generation succeeds on both nodes.
|
||||
- `/api/ps` contains `gemma3:4b`.
|
||||
- If `size_vram=0`, keep GCP-A/B on `alert-fast` only and route heavy models to
|
||||
111 or a GPU-capable node.
|
||||
|
||||
## Cutover
|
||||
|
||||
Patch deployment env after shadow passes:
|
||||
|
||||
```bash
|
||||
kubectl -n awoooi-prod set env deploy/awoooi-api \
|
||||
OLLAMA_URL=http://10.77.114.21:11434 \
|
||||
OLLAMA_SECONDARY_URL=http://10.77.114.22:11434 \
|
||||
OLLAMA_FALLBACK_URL=http://10.77.114.111:11434
|
||||
|
||||
kubectl -n awoooi-prod set env deploy/awoooi-worker \
|
||||
OLLAMA_URL=http://10.77.114.21:11434 \
|
||||
OLLAMA_SECONDARY_URL=http://10.77.114.22:11434 \
|
||||
OLLAMA_FALLBACK_URL=http://10.77.114.111:11434
|
||||
```
|
||||
|
||||
Verify:
|
||||
|
||||
```bash
|
||||
kubectl -n awoooi-prod rollout status deploy/awoooi-api --timeout=180s
|
||||
kubectl -n awoooi-prod rollout status deploy/awoooi-worker --timeout=180s
|
||||
bash scripts/ops/ollama-topology-check.sh
|
||||
```
|
||||
|
||||
## Rollback
|
||||
|
||||
```bash
|
||||
kubectl -n awoooi-prod set env deploy/awoooi-api \
|
||||
OLLAMA_URL=http://192.168.0.110:11435 \
|
||||
OLLAMA_SECONDARY_URL=http://192.168.0.110:11436 \
|
||||
OLLAMA_FALLBACK_URL=http://192.168.0.111:11434
|
||||
|
||||
kubectl -n awoooi-prod set env deploy/awoooi-worker \
|
||||
OLLAMA_URL=http://192.168.0.110:11435 \
|
||||
OLLAMA_SECONDARY_URL=http://192.168.0.110:11436 \
|
||||
OLLAMA_FALLBACK_URL=http://192.168.0.111:11434
|
||||
```
|
||||
|
||||
## Done Criteria
|
||||
|
||||
- Mesh endpoints pass 7 days of canary.
|
||||
- Alert lane Gemini usage is zero except documented all-Ollama outages.
|
||||
- Public GCP `11434/tcp` is closed.
|
||||
- Operator runbook records peer public keys and rollback owner.
|
||||
|
||||
88
scripts/ops/ollama-topology-check.sh
Executable file
88
scripts/ops/ollama-topology-check.sh
Executable file
@@ -0,0 +1,88 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
NAMESPACE="${NAMESPACE:-awoooi-prod}"
|
||||
DEPLOYMENT="${DEPLOYMENT:-awoooi-api}"
|
||||
MODEL="${MODEL:-gemma3:4b}"
|
||||
TIMEOUT_SECONDS="${TIMEOUT_SECONDS:-60}"
|
||||
|
||||
kubectl -n "${NAMESPACE}" exec -i "deploy/${DEPLOYMENT}" -- \
|
||||
env CHECK_MODEL="${MODEL}" CHECK_TIMEOUT_SECONDS="${TIMEOUT_SECONDS}" python - <<'PY'
|
||||
import json
|
||||
import os
|
||||
import time
|
||||
import urllib.error
|
||||
import urllib.request
|
||||
|
||||
model = os.environ["CHECK_MODEL"]
|
||||
timeout = int(os.environ["CHECK_TIMEOUT_SECONDS"])
|
||||
|
||||
endpoints = [
|
||||
("primary", os.environ.get("OLLAMA_URL", "")),
|
||||
("secondary", os.environ.get("OLLAMA_SECONDARY_URL", "")),
|
||||
("fallback", os.environ.get("OLLAMA_FALLBACK_URL", "")),
|
||||
]
|
||||
|
||||
print(f"model={model} timeout={timeout}s")
|
||||
|
||||
def request_json(url: str, path: str, payload=None, timeout_seconds=10):
|
||||
data = None
|
||||
headers = {}
|
||||
if payload is not None:
|
||||
data = json.dumps(payload).encode()
|
||||
headers["Content-Type"] = "application/json"
|
||||
req = urllib.request.Request(url.rstrip("/") + path, data=data, headers=headers)
|
||||
with urllib.request.urlopen(req, timeout=timeout_seconds) as response:
|
||||
return json.loads(response.read().decode())
|
||||
|
||||
for label, url in endpoints:
|
||||
print(f"\n== {label}: {url or '<missing>'} ==")
|
||||
if not url:
|
||||
print("status=missing")
|
||||
continue
|
||||
|
||||
try:
|
||||
tags = request_json(url, "/api/tags", timeout_seconds=10)
|
||||
names = sorted(m.get("name", "") for m in tags.get("models", []))
|
||||
print("tags=ok", ",".join(names[:12]))
|
||||
except Exception as exc:
|
||||
print("tags=fail", type(exc).__name__, str(exc)[:160])
|
||||
continue
|
||||
|
||||
try:
|
||||
ps = request_json(url, "/api/ps", timeout_seconds=10)
|
||||
live = ps.get("models", [])
|
||||
if not live:
|
||||
print("ps=ok live_models=<none>")
|
||||
for item in live:
|
||||
print(
|
||||
"ps=ok",
|
||||
f"model={item.get('model')}",
|
||||
f"expires={item.get('expires_at')}",
|
||||
f"size_vram={item.get('size_vram')}",
|
||||
f"context={item.get('context_length')}",
|
||||
)
|
||||
if item.get("size_vram") == 0:
|
||||
print("warning=cpu_only_or_no_vram")
|
||||
except Exception as exc:
|
||||
print("ps=fail", type(exc).__name__, str(exc)[:160])
|
||||
|
||||
payload = {
|
||||
"model": model,
|
||||
"prompt": "用繁體中文用一行回答:Ollama health check",
|
||||
"stream": False,
|
||||
"keep_alive": "8h",
|
||||
"options": {"num_predict": 32, "temperature": 0.1},
|
||||
}
|
||||
start = time.time()
|
||||
try:
|
||||
result = request_json(url, "/api/generate", payload, timeout_seconds=timeout)
|
||||
latency_ms = round((time.time() - start) * 1000)
|
||||
response = (result.get("response") or "").replace("\n", " ")[:120]
|
||||
print(f"generate=ok latency_ms={latency_ms} response={response}")
|
||||
except urllib.error.HTTPError as exc:
|
||||
body = exc.read().decode(errors="replace")[:200]
|
||||
print("generate=fail", "HTTPError", exc.code, body)
|
||||
except Exception as exc:
|
||||
print("generate=fail", type(exc).__name__, str(exc)[:200])
|
||||
PY
|
||||
Reference in New Issue
Block a user