fix(watchdog): 修复 Trust Drift 重复告警 + 建立 GCP Ollama nginx proxy
Some checks failed
Code Review / ai-code-review (push) Successful in 45s
Ansible Lint / lint (push) Has been cancelled

- ai_slo_watchdog_job: 改用 trust_drift_detector 纯统计 lib
  避免与 governance_agent 每小时自检查重复触发 Telegram

- infra/ansible: 建立 110 nginx proxy 转发到 GCP-A/B
  端口 11435 -> 34.143.170.20:11434 (GCP-A)
  端口 11436 -> 34.21.145.224:11434 (GCP-B)

- docs/runbooks: DEPLOY-GCP-OLLAMA-PROXY.md 完整部署指南
- ops/nginx: 手动部署脚本供 110 直接执行

ADR-110 三层容灾启用前提:先部署 proxy,再改 ConfigMap
This commit is contained in:
Your Name
2026-05-04 23:12:35 +08:00
parent a1b61289f5
commit ec013f662d
5 changed files with 446 additions and 14 deletions

View File

@@ -159,21 +159,17 @@ async def _check_once() -> None:
# W-6: Trust Drift 偵測Playbook 信任度漂移)
# 2026-05-02 ogt + Claude Sonnet 4.6(亞太): 整併雙寫路徑
# 原行為:呼叫 trust_drift_detector.run() 直接寫 event_type=trust_drift 到 PG
# governance_agent.check_trust_drift() 每 1h 也寫同一 event_type → 雙寫
# 整併:改呼叫 governance_agent.check_trust_drift() 為唯一 source-of-truth
# W-6 watchdog 仍每 15 分鐘執行感知器violations 計數用於 meta-alert 觸發
# PG 寫入由 governance_agent._alert() 統一處理,避免雙寫
# 2026-05-04 ogt + Claude: 修復重複告警 — 改為直接用 trust_drift_detector 純統計
# 背景:原本呼叫 governance_agent.check_trust_drift() 會觸發 Telegram 告警
# governance_agent.run_self_check() 每 1h 也會呼叫同一方法 → 雙重 Telegram
# 修正:watchdog 只取統計數字,不觸發 Telegram告警由 governance_agent 獨家負責
try:
from src.services.governance_agent import get_governance_agent
trust_result = await get_governance_agent().check_trust_drift()
if trust_result.get("drifted", 0) > 0:
drifted = trust_result["drifted"]
auto_deprecated = trust_result.get("auto_deprecated", 0)
kept = trust_result.get("kept", 0)
from src.services.trust_drift_detector import get_trust_drift_detector
dist = await get_trust_drift_detector().detect()
if dist.drift_detected:
violations.append(
f"Trust Drift 偵測到 {drifted} 個 Playbook 信任度低落"
f"auto-deprecated: {auto_deprecated},待人工審核: {kept}"
f"Trust Drift 偵測到 {dist.low_count} 個 Playbook 信任度低落"
f"low_ratio: {dist.low_ratio:.1%}mean_trust: {dist.mean_trust:.2f}"
)
except Exception as e:
logger.warning("watchdog_w6_trust_drift_check_failed", error=str(e))

View File

@@ -0,0 +1,201 @@
# GCP Ollama Nginx Proxy 部署指南
> ADR-110 三層容災 — 啟用 GCP Ollama 的關鍵步驟
---
## 背景
GCP Ollama (34.143.170.20 / 34.21.145.224) 已部署完成,但 K3s 叢集內無法直接連線 GCP 外網 IP。
透過 192.168.0.110 (DevOps 金庫) 架設 nginx 反向代理,讓 K3s Pod 走內網連線 GCP Ollama。
---
## 部署檔案
| 檔案 | 用途 |
|-----|------|
| `infra/ansible/roles/nginx/templates/110-ollama-proxy.conf.j2` | nginx 配置模板 |
| `infra/ansible/playbooks/nginx-sync.yml` | Ansible Playbook |
---
## 執行部署
```bash
# 1. 進入 Ansible 目錄
cd /Users/ogt/awoooi/infra/ansible
# 2. 部署到 110 (Dry-run 先驗證)
ansible-playbook -i inventory/hosts.yml playbooks/nginx-sync.yml --tags 110 --check
# 3. 正式部署
ansible-playbook -i inventory/hosts.yml playbooks/nginx-sync.yml --tags 110
```
---
## 驗證部署
### 從 110 本機驗證
```bash
# 測試 GCP-A proxy
curl http://127.0.0.1:11435/api/tags
# 測試 GCP-B proxy
curl http://127.0.0.1:11436/api/tags
```
### 從 K3s Node 驗證
```bash
# 進入 K3s node (120 或 121)
ssh wooo@192.168.0.120
# 測試連線 110 proxy
curl http://192.168.0.110:11435/api/tags
curl http://192.168.0.110:11436/api/tags
```
### 從 K8s Pod 驗證
```bash
# 進入 API Pod
kubectl exec -it -n awoooi-prod deployment/awoooi-api -- bash
# 測試連線
apt-get update && apt-get install -y curl
curl http://192.168.0.110:11435/api/tags
```
---
## 啟用 GCP Ollama
代理部署完成後,修改 ConfigMap 啟用 GCP 端點:
```bash
# 編輯 ConfigMap
kubectl edit configmap -n awoooi-prod awoooi-config
```
修改以下欄位:
```yaml
# 修改前
OLLAMA_URL: "http://192.168.0.111:11434"
OLLAMA_SECONDARY_URL: "http://192.168.0.110:11435"
OLLAMA_FALLBACK_URL: "http://192.168.0.110:11436"
# 修改後 (啟用 GCP-A 作為 Primary)
OLLAMA_URL: "http://192.168.0.110:11435" # GCP-A via proxy
OLLAMA_SECONDARY_URL: "http://192.168.0.110:11436" # GCP-B via proxy
OLLAMA_FALLBACK_URL: "http://192.168.0.111:11434" # Local GPU 最後防線
```
重啟 Deployment
```bash
kubectl rollout restart deployment/awoooi-api -n awoooi-prod
```
---
## 確認模型已載入
GCP Ollama 必須已載入以下模型:
```bash
# GCP-A 檢查
curl http://34.143.170.20:11434/api/tags | jq '.models[].name'
# 必須包含:
# - bge-m3:latest (embedding)
# - qwen2.5:7b-instruct (health check)
# - qwen3:14b (RCA analysis)
# - hermes3:latest (tool calling)
# - deepseek-r1:14b (reasoning)
```
若模型未載入SSH 到 GCP 主機執行:
```bash
ollama pull bge-m3:latest
ollama pull qwen2.5:7b-instruct
ollama pull qwen3:14b
ollama pull hermes3:latest
ollama pull deepseek-r1:14b
```
---
## 部署檢查清單
- [ ] Ansible playbook 執行成功 (110)
- [ ] 110:11435 監聽確認 (`ss -tlnp | grep 11435`)
- [ ] 110:11436 監聽確認 (`ss -tlnp | grep 11436`)
- [ ] K3s node 可連線 110:11435/11436
- [ ] K8s Pod 可連線 110:11435/11436
- [ ] GCP-A/B 模型已載入
- [ ] ConfigMap 已修改
- [ ] Deployment 已重啟
- [ ] API Pod 啟動無錯誤
- [ ] 推理測試成功 (檢查 latency < 10s)
---
## 常見問題
### 1. K3s Pod 連線被拒絕
檢查 NetworkPolicy
```bash
kubectl describe networkpolicy -n awoooi-prod allow-required-egress
```
確認包含:
```yaml
- to:
- ipBlock:
cidr: 192.168.0.110/32
ports:
- protocol: TCP
port: 11435
- protocol: TCP
port: 11436
```
### 2. nginx 無法連線 GCP
檢查 110 外網連線:
```bash
curl -v http://34.143.170.20:11434/api/tags
```
若失敗,檢查 GCP 防火牆規則是否開放 0.0.0.0/0:11434。
### 3. 模型載入但推理失敗
檢查 GCP VM 記憶體/CPU 使用率:
```bash
# GCP Console → Compute Engine → VM 執行個體 → 監控
```
若記憶體不足,升級機型或減少同時載入模型數量。
---
## 相關文件
- ADR-110: GCP 三層容災架構
- `k8s/awoooi-prod/04-configmap.yaml`
- `k8s/awoooi-prod/02-network-policy.yaml`
- `docs/runbooks/RUNBOOK-OLLAMA-FAILOVER.md`
---
## 負責人
- 建立: Claude Sonnet 4.6 — 2026-05-04
- 審查: 首席架構師 ogt

View File

@@ -36,11 +36,13 @@
name: nginx
state: reloaded
- name: "110 Nginx conf 同步(若有)"
- name: "110 Ollama GCP Proxy 部署"
hosts: host_110
become: true
vars:
ansible_become_pass: "{{ vault_sudo_password | default(omit) }}"
ollama_proxy_src: "{{ playbook_dir }}/../roles/nginx/templates/110-ollama-proxy.conf.j2"
ollama_proxy_dest: /etc/nginx/sites-enabled/110-ollama-proxy.conf
tasks:
- name: "nginx | 確認 110 nginx 無 all-sites-from-188.conf 在 sites-enabled"
@@ -54,3 +56,44 @@
msg: "⚠️ 110 sites-enabled 仍有 all-sites-from-188.conf應已封存"
when: stale_conf.stat.exists
tags: ["110", "nginx"]
# ADR-110: Ollama GCP 三層容災 — 110 作為 nginx proxy 轉發 K3s 流量到 GCP
- name: "nginx | 部署 Ollama GCP Proxy 配置"
ansible.builtin.template:
src: "{{ ollama_proxy_src }}"
dest: "{{ ollama_proxy_dest }}"
owner: root
group: root
mode: "0644"
backup: true
notify: reload nginx 110
tags: ["110", "nginx", "ollama-proxy"]
- name: "nginx | 測試 110 設定"
ansible.builtin.command:
cmd: "nginx -t"
changed_when: false
tags: ["110", "nginx", "ollama-proxy"]
- name: "nginx | 確認 nginx 已啟動"
ansible.builtin.systemd:
name: nginx
state: started
enabled: true
tags: ["110", "nginx", "ollama-proxy"]
- name: "nginx | 驗證 Ollama proxy 端口監聽"
ansible.builtin.wait_for:
port: "{{ item }}"
host: 127.0.0.1
timeout: 10
loop:
- 11435 # GCP-A
- 11436 # GCP-B
tags: ["110", "nginx", "ollama-proxy"]
handlers:
- name: reload nginx 110
ansible.builtin.systemd:
name: nginx
state: reloaded

View File

@@ -0,0 +1,71 @@
# 110 Ollama GCP Proxy — ADR-110 三層容災
# 讓 K3s 叢集內可透過內網 110 存取 GCP 外網 Ollama
# 建立時間: 2026-05-04
# 部署: ansible-playbook -i inventory/hosts.yml playbooks/nginx-sync.yml --tags 110
# ============================================================
# Ollama GCP-A Primary (port 11435 → 34.143.170.20:11434)
# ============================================================
server {
listen 11435;
listen [::]:11435;
server_name _;
access_log /var/log/nginx/ollama-gcp-a-access.log;
error_log /var/log/nginx/ollama-gcp-a-error.log warn;
location / {
proxy_pass http://34.143.170.20:11434;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
# Ollama 推理可能較慢,給較長超時
proxy_connect_timeout 10s;
proxy_send_timeout 300s;
proxy_read_timeout 300s;
# 支援 streaming response
proxy_buffering off;
proxy_cache off;
}
# 健康檢查端點
location /nginx-health {
access_log off;
return 200 "Ollama GCP-A Proxy OK\n";
add_header Content-Type text/plain;
}
}
# ============================================================
# Ollama GCP-B Secondary (port 11436 → 34.21.145.224:11434)
# ============================================================
server {
listen 11436;
listen [::]:11436;
server_name _;
access_log /var/log/nginx/ollama-gcp-b-access.log;
error_log /var/log/nginx/ollama-gcp-b-error.log warn;
location / {
proxy_pass http://34.21.145.224:11434;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_connect_timeout 10s;
proxy_send_timeout 300s;
proxy_read_timeout 300s;
proxy_buffering off;
proxy_cache off;
}
location /nginx-health {
access_log off;
return 200 "Ollama GCP-B Proxy OK\n";
add_header Content-Type text/plain;
}
}

View File

@@ -0,0 +1,121 @@
#!/bin/bash
# GCP Ollama Nginx Proxy 部署腳本 (110 手動執行)
# ADR-110 三層容災 — 讓 K3s 透過內網存取 GCP Ollama
# 執行: ssh wooo@192.168.0.110 'sudo bash -s' < deploy-ollama-proxy-110.sh
set -euo pipefail
echo "🚀 部署 GCP Ollama Nginx Proxy (110)..."
# 配置內容
NGINX_CONF="/etc/nginx/sites-enabled/110-ollama-proxy.conf"
# 備份現有配置
if [ -f "$NGINX_CONF" ]; then
echo "📦 備份現有配置..."
cp "$NGINX_CONF" "${NGINX_CONF}.backup.$(date +%Y%m%d%H%M%S)"
fi
# 寫入 nginx 配置
echo "📝 寫入 nginx 配置..."
cat > "$NGINX_CONF" << 'EOF'
# 110 Ollama GCP Proxy — ADR-110 三層容災
# 讓 K3s 叢集內可透過內網 110 存取 GCP 外網 Ollama
# ============================================================
# Ollama GCP-A Primary (port 11435 → 34.143.170.20:11434)
# ============================================================
server {
listen 11435;
listen [::]:11435;
server_name _;
access_log /var/log/nginx/ollama-gcp-a-access.log;
error_log /var/log/nginx/ollama-gcp-a-error.log warn;
location / {
proxy_pass http://34.143.170.20:11434;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
# Ollama 推理可能較慢,給較長超時
proxy_connect_timeout 10s;
proxy_send_timeout 300s;
proxy_read_timeout 300s;
# 支援 streaming response
proxy_buffering off;
proxy_cache off;
}
# 健康檢查端點
location /nginx-health {
access_log off;
return 200 "Ollama GCP-A Proxy OK\n";
add_header Content-Type text/plain;
}
}
# ============================================================
# Ollama GCP-B Secondary (port 11436 → 34.21.145.224:11434)
# ============================================================
server {
listen 11436;
listen [::]:11436;
server_name _;
access_log /var/log/nginx/ollama-gcp-b-access.log;
error_log /var/log/nginx/ollama-gcp-b-error.log warn;
location / {
proxy_pass http://34.21.145.224:11434;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_connect_timeout 10s;
proxy_send_timeout 300s;
proxy_read_timeout 300s;
proxy_buffering off;
proxy_cache off;
}
location /nginx-health {
access_log off;
return 200 "Ollama GCP-B Proxy OK\n";
add_header Content-Type text/plain;
}
}
EOF
# 測試 nginx 配置
echo "🧪 測試 nginx 配置..."
nginx -t
# 重載 nginx
echo "🔄 重載 nginx..."
systemctl reload nginx
# 驗證端口監聽
echo "🔍 驗證端口監聽..."
sleep 2
ss -tlnp | grep -E '11435|11436' || true
# 本地測試
echo "🌐 本地測試 proxy..."
echo "測試 GCP-A proxy (11435)..."
curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:11435/api/tags || echo "連線失敗"
echo ""
echo "測試 GCP-B proxy (11436)..."
curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:11436/api/tags || echo "連線失敗"
echo ""
echo "✅ 部署完成!"
echo ""
echo "下一步:"
echo "1. 從 K3s node 測試: curl http://192.168.0.110:11435/api/tags"
echo "2. 修改 K8s ConfigMap 指向 110:11435/11436"
echo "3. 重啟 awoooi-api deployment"