fix(watchdog): 修复 Trust Drift 重复告警 + 建立 GCP Ollama nginx proxy
- ai_slo_watchdog_job: 改用 trust_drift_detector 纯统计 lib 避免与 governance_agent 每小时自检查重复触发 Telegram - infra/ansible: 建立 110 nginx proxy 转发到 GCP-A/B 端口 11435 -> 34.143.170.20:11434 (GCP-A) 端口 11436 -> 34.21.145.224:11434 (GCP-B) - docs/runbooks: DEPLOY-GCP-OLLAMA-PROXY.md 完整部署指南 - ops/nginx: 手动部署脚本供 110 直接执行 ADR-110 三层容灾启用前提:先部署 proxy,再改 ConfigMap
This commit is contained in:
@@ -159,21 +159,17 @@ async def _check_once() -> None:
|
||||
|
||||
# W-6: Trust Drift 偵測(Playbook 信任度漂移)
|
||||
# 2026-05-02 ogt + Claude Sonnet 4.6(亞太): 整併雙寫路徑
|
||||
# 原行為:呼叫 trust_drift_detector.run() 直接寫 event_type=trust_drift 到 PG
|
||||
# governance_agent.check_trust_drift() 每 1h 也寫同一 event_type → 雙寫
|
||||
# 整併:改呼叫 governance_agent.check_trust_drift() 為唯一 source-of-truth
|
||||
# W-6 watchdog 仍每 15 分鐘執行(感知器),violations 計數用於 meta-alert 觸發
|
||||
# PG 寫入由 governance_agent._alert() 統一處理,避免雙寫
|
||||
# 2026-05-04 ogt + Claude: 修復重複告警 — 改為直接用 trust_drift_detector 純統計
|
||||
# 背景:原本呼叫 governance_agent.check_trust_drift() 會觸發 Telegram 告警
|
||||
# 但 governance_agent.run_self_check() 每 1h 也會呼叫同一方法 → 雙重 Telegram
|
||||
# 修正:watchdog 只取統計數字,不觸發 Telegram;告警由 governance_agent 獨家負責
|
||||
try:
|
||||
from src.services.governance_agent import get_governance_agent
|
||||
trust_result = await get_governance_agent().check_trust_drift()
|
||||
if trust_result.get("drifted", 0) > 0:
|
||||
drifted = trust_result["drifted"]
|
||||
auto_deprecated = trust_result.get("auto_deprecated", 0)
|
||||
kept = trust_result.get("kept", 0)
|
||||
from src.services.trust_drift_detector import get_trust_drift_detector
|
||||
dist = await get_trust_drift_detector().detect()
|
||||
if dist.drift_detected:
|
||||
violations.append(
|
||||
f"Trust Drift 偵測到 {drifted} 個 Playbook 信任度低落"
|
||||
f"(auto-deprecated: {auto_deprecated},待人工審核: {kept})"
|
||||
f"Trust Drift 偵測到 {dist.low_count} 個 Playbook 信任度低落"
|
||||
f"(low_ratio: {dist.low_ratio:.1%},mean_trust: {dist.mean_trust:.2f})"
|
||||
)
|
||||
except Exception as e:
|
||||
logger.warning("watchdog_w6_trust_drift_check_failed", error=str(e))
|
||||
|
||||
201
docs/runbooks/DEPLOY-GCP-OLLAMA-PROXY.md
Normal file
201
docs/runbooks/DEPLOY-GCP-OLLAMA-PROXY.md
Normal file
@@ -0,0 +1,201 @@
|
||||
# GCP Ollama Nginx Proxy 部署指南
|
||||
|
||||
> ADR-110 三層容災 — 啟用 GCP Ollama 的關鍵步驟
|
||||
|
||||
---
|
||||
|
||||
## 背景
|
||||
|
||||
GCP Ollama (34.143.170.20 / 34.21.145.224) 已部署完成,但 K3s 叢集內無法直接連線 GCP 外網 IP。
|
||||
透過 192.168.0.110 (DevOps 金庫) 架設 nginx 反向代理,讓 K3s Pod 走內網連線 GCP Ollama。
|
||||
|
||||
---
|
||||
|
||||
## 部署檔案
|
||||
|
||||
| 檔案 | 用途 |
|
||||
|-----|------|
|
||||
| `infra/ansible/roles/nginx/templates/110-ollama-proxy.conf.j2` | nginx 配置模板 |
|
||||
| `infra/ansible/playbooks/nginx-sync.yml` | Ansible Playbook |
|
||||
|
||||
---
|
||||
|
||||
## 執行部署
|
||||
|
||||
```bash
|
||||
# 1. 進入 Ansible 目錄
|
||||
cd /Users/ogt/awoooi/infra/ansible
|
||||
|
||||
# 2. 部署到 110 (Dry-run 先驗證)
|
||||
ansible-playbook -i inventory/hosts.yml playbooks/nginx-sync.yml --tags 110 --check
|
||||
|
||||
# 3. 正式部署
|
||||
ansible-playbook -i inventory/hosts.yml playbooks/nginx-sync.yml --tags 110
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 驗證部署
|
||||
|
||||
### 從 110 本機驗證
|
||||
|
||||
```bash
|
||||
# 測試 GCP-A proxy
|
||||
curl http://127.0.0.1:11435/api/tags
|
||||
|
||||
# 測試 GCP-B proxy
|
||||
curl http://127.0.0.1:11436/api/tags
|
||||
```
|
||||
|
||||
### 從 K3s Node 驗證
|
||||
|
||||
```bash
|
||||
# 進入 K3s node (120 或 121)
|
||||
ssh wooo@192.168.0.120
|
||||
|
||||
# 測試連線 110 proxy
|
||||
curl http://192.168.0.110:11435/api/tags
|
||||
curl http://192.168.0.110:11436/api/tags
|
||||
```
|
||||
|
||||
### 從 K8s Pod 驗證
|
||||
|
||||
```bash
|
||||
# 進入 API Pod
|
||||
kubectl exec -it -n awoooi-prod deployment/awoooi-api -- bash
|
||||
|
||||
# 測試連線
|
||||
apt-get update && apt-get install -y curl
|
||||
curl http://192.168.0.110:11435/api/tags
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 啟用 GCP Ollama
|
||||
|
||||
代理部署完成後,修改 ConfigMap 啟用 GCP 端點:
|
||||
|
||||
```bash
|
||||
# 編輯 ConfigMap
|
||||
kubectl edit configmap -n awoooi-prod awoooi-config
|
||||
```
|
||||
|
||||
修改以下欄位:
|
||||
|
||||
```yaml
|
||||
# 修改前
|
||||
OLLAMA_URL: "http://192.168.0.111:11434"
|
||||
OLLAMA_SECONDARY_URL: "http://192.168.0.110:11435"
|
||||
OLLAMA_FALLBACK_URL: "http://192.168.0.110:11436"
|
||||
|
||||
# 修改後 (啟用 GCP-A 作為 Primary)
|
||||
OLLAMA_URL: "http://192.168.0.110:11435" # GCP-A via proxy
|
||||
OLLAMA_SECONDARY_URL: "http://192.168.0.110:11436" # GCP-B via proxy
|
||||
OLLAMA_FALLBACK_URL: "http://192.168.0.111:11434" # Local GPU 最後防線
|
||||
```
|
||||
|
||||
重啟 Deployment:
|
||||
|
||||
```bash
|
||||
kubectl rollout restart deployment/awoooi-api -n awoooi-prod
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 確認模型已載入
|
||||
|
||||
GCP Ollama 必須已載入以下模型:
|
||||
|
||||
```bash
|
||||
# GCP-A 檢查
|
||||
curl http://34.143.170.20:11434/api/tags | jq '.models[].name'
|
||||
|
||||
# 必須包含:
|
||||
# - bge-m3:latest (embedding)
|
||||
# - qwen2.5:7b-instruct (health check)
|
||||
# - qwen3:14b (RCA analysis)
|
||||
# - hermes3:latest (tool calling)
|
||||
# - deepseek-r1:14b (reasoning)
|
||||
```
|
||||
|
||||
若模型未載入,SSH 到 GCP 主機執行:
|
||||
|
||||
```bash
|
||||
ollama pull bge-m3:latest
|
||||
ollama pull qwen2.5:7b-instruct
|
||||
ollama pull qwen3:14b
|
||||
ollama pull hermes3:latest
|
||||
ollama pull deepseek-r1:14b
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 部署檢查清單
|
||||
|
||||
- [ ] Ansible playbook 執行成功 (110)
|
||||
- [ ] 110:11435 監聽確認 (`ss -tlnp | grep 11435`)
|
||||
- [ ] 110:11436 監聽確認 (`ss -tlnp | grep 11436`)
|
||||
- [ ] K3s node 可連線 110:11435/11436
|
||||
- [ ] K8s Pod 可連線 110:11435/11436
|
||||
- [ ] GCP-A/B 模型已載入
|
||||
- [ ] ConfigMap 已修改
|
||||
- [ ] Deployment 已重啟
|
||||
- [ ] API Pod 啟動無錯誤
|
||||
- [ ] 推理測試成功 (檢查 latency < 10s)
|
||||
|
||||
---
|
||||
|
||||
## 常見問題
|
||||
|
||||
### 1. K3s Pod 連線被拒絕
|
||||
|
||||
檢查 NetworkPolicy:
|
||||
```bash
|
||||
kubectl describe networkpolicy -n awoooi-prod allow-required-egress
|
||||
```
|
||||
|
||||
確認包含:
|
||||
```yaml
|
||||
- to:
|
||||
- ipBlock:
|
||||
cidr: 192.168.0.110/32
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: 11435
|
||||
- protocol: TCP
|
||||
port: 11436
|
||||
```
|
||||
|
||||
### 2. nginx 無法連線 GCP
|
||||
|
||||
檢查 110 外網連線:
|
||||
```bash
|
||||
curl -v http://34.143.170.20:11434/api/tags
|
||||
```
|
||||
|
||||
若失敗,檢查 GCP 防火牆規則是否開放 0.0.0.0/0:11434。
|
||||
|
||||
### 3. 模型載入但推理失敗
|
||||
|
||||
檢查 GCP VM 記憶體/CPU 使用率:
|
||||
```bash
|
||||
# GCP Console → Compute Engine → VM 執行個體 → 監控
|
||||
```
|
||||
|
||||
若記憶體不足,升級機型或減少同時載入模型數量。
|
||||
|
||||
---
|
||||
|
||||
## 相關文件
|
||||
|
||||
- ADR-110: GCP 三層容災架構
|
||||
- `k8s/awoooi-prod/04-configmap.yaml`
|
||||
- `k8s/awoooi-prod/02-network-policy.yaml`
|
||||
- `docs/runbooks/RUNBOOK-OLLAMA-FAILOVER.md`
|
||||
|
||||
---
|
||||
|
||||
## 負責人
|
||||
|
||||
- 建立: Claude Sonnet 4.6 — 2026-05-04
|
||||
- 審查: 首席架構師 ogt
|
||||
@@ -36,11 +36,13 @@
|
||||
name: nginx
|
||||
state: reloaded
|
||||
|
||||
- name: "110 Nginx conf 同步(若有)"
|
||||
- name: "110 Ollama GCP Proxy 部署"
|
||||
hosts: host_110
|
||||
become: true
|
||||
vars:
|
||||
ansible_become_pass: "{{ vault_sudo_password | default(omit) }}"
|
||||
ollama_proxy_src: "{{ playbook_dir }}/../roles/nginx/templates/110-ollama-proxy.conf.j2"
|
||||
ollama_proxy_dest: /etc/nginx/sites-enabled/110-ollama-proxy.conf
|
||||
|
||||
tasks:
|
||||
- name: "nginx | 確認 110 nginx 無 all-sites-from-188.conf 在 sites-enabled"
|
||||
@@ -54,3 +56,44 @@
|
||||
msg: "⚠️ 110 sites-enabled 仍有 all-sites-from-188.conf,應已封存"
|
||||
when: stale_conf.stat.exists
|
||||
tags: ["110", "nginx"]
|
||||
|
||||
# ADR-110: Ollama GCP 三層容災 — 110 作為 nginx proxy 轉發 K3s 流量到 GCP
|
||||
- name: "nginx | 部署 Ollama GCP Proxy 配置"
|
||||
ansible.builtin.template:
|
||||
src: "{{ ollama_proxy_src }}"
|
||||
dest: "{{ ollama_proxy_dest }}"
|
||||
owner: root
|
||||
group: root
|
||||
mode: "0644"
|
||||
backup: true
|
||||
notify: reload nginx 110
|
||||
tags: ["110", "nginx", "ollama-proxy"]
|
||||
|
||||
- name: "nginx | 測試 110 設定"
|
||||
ansible.builtin.command:
|
||||
cmd: "nginx -t"
|
||||
changed_when: false
|
||||
tags: ["110", "nginx", "ollama-proxy"]
|
||||
|
||||
- name: "nginx | 確認 nginx 已啟動"
|
||||
ansible.builtin.systemd:
|
||||
name: nginx
|
||||
state: started
|
||||
enabled: true
|
||||
tags: ["110", "nginx", "ollama-proxy"]
|
||||
|
||||
- name: "nginx | 驗證 Ollama proxy 端口監聽"
|
||||
ansible.builtin.wait_for:
|
||||
port: "{{ item }}"
|
||||
host: 127.0.0.1
|
||||
timeout: 10
|
||||
loop:
|
||||
- 11435 # GCP-A
|
||||
- 11436 # GCP-B
|
||||
tags: ["110", "nginx", "ollama-proxy"]
|
||||
|
||||
handlers:
|
||||
- name: reload nginx 110
|
||||
ansible.builtin.systemd:
|
||||
name: nginx
|
||||
state: reloaded
|
||||
|
||||
71
infra/ansible/roles/nginx/templates/110-ollama-proxy.conf.j2
Normal file
71
infra/ansible/roles/nginx/templates/110-ollama-proxy.conf.j2
Normal file
@@ -0,0 +1,71 @@
|
||||
# 110 Ollama GCP Proxy — ADR-110 三層容災
|
||||
# 讓 K3s 叢集內可透過內網 110 存取 GCP 外網 Ollama
|
||||
# 建立時間: 2026-05-04
|
||||
# 部署: ansible-playbook -i inventory/hosts.yml playbooks/nginx-sync.yml --tags 110
|
||||
|
||||
# ============================================================
|
||||
# Ollama GCP-A Primary (port 11435 → 34.143.170.20:11434)
|
||||
# ============================================================
|
||||
server {
|
||||
listen 11435;
|
||||
listen [::]:11435;
|
||||
server_name _;
|
||||
|
||||
access_log /var/log/nginx/ollama-gcp-a-access.log;
|
||||
error_log /var/log/nginx/ollama-gcp-a-error.log warn;
|
||||
|
||||
location / {
|
||||
proxy_pass http://34.143.170.20:11434;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
|
||||
# Ollama 推理可能較慢,給較長超時
|
||||
proxy_connect_timeout 10s;
|
||||
proxy_send_timeout 300s;
|
||||
proxy_read_timeout 300s;
|
||||
|
||||
# 支援 streaming response
|
||||
proxy_buffering off;
|
||||
proxy_cache off;
|
||||
}
|
||||
|
||||
# 健康檢查端點
|
||||
location /nginx-health {
|
||||
access_log off;
|
||||
return 200 "Ollama GCP-A Proxy OK\n";
|
||||
add_header Content-Type text/plain;
|
||||
}
|
||||
}
|
||||
|
||||
# ============================================================
|
||||
# Ollama GCP-B Secondary (port 11436 → 34.21.145.224:11434)
|
||||
# ============================================================
|
||||
server {
|
||||
listen 11436;
|
||||
listen [::]:11436;
|
||||
server_name _;
|
||||
|
||||
access_log /var/log/nginx/ollama-gcp-b-access.log;
|
||||
error_log /var/log/nginx/ollama-gcp-b-error.log warn;
|
||||
|
||||
location / {
|
||||
proxy_pass http://34.21.145.224:11434;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
|
||||
proxy_connect_timeout 10s;
|
||||
proxy_send_timeout 300s;
|
||||
proxy_read_timeout 300s;
|
||||
|
||||
proxy_buffering off;
|
||||
proxy_cache off;
|
||||
}
|
||||
|
||||
location /nginx-health {
|
||||
access_log off;
|
||||
return 200 "Ollama GCP-B Proxy OK\n";
|
||||
add_header Content-Type text/plain;
|
||||
}
|
||||
}
|
||||
121
ops/nginx/deploy-ollama-proxy-110.sh
Executable file
121
ops/nginx/deploy-ollama-proxy-110.sh
Executable file
@@ -0,0 +1,121 @@
|
||||
#!/bin/bash
|
||||
# GCP Ollama Nginx Proxy 部署腳本 (110 手動執行)
|
||||
# ADR-110 三層容災 — 讓 K3s 透過內網存取 GCP Ollama
|
||||
# 執行: ssh wooo@192.168.0.110 'sudo bash -s' < deploy-ollama-proxy-110.sh
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
echo "🚀 部署 GCP Ollama Nginx Proxy (110)..."
|
||||
|
||||
# 配置內容
|
||||
NGINX_CONF="/etc/nginx/sites-enabled/110-ollama-proxy.conf"
|
||||
|
||||
# 備份現有配置
|
||||
if [ -f "$NGINX_CONF" ]; then
|
||||
echo "📦 備份現有配置..."
|
||||
cp "$NGINX_CONF" "${NGINX_CONF}.backup.$(date +%Y%m%d%H%M%S)"
|
||||
fi
|
||||
|
||||
# 寫入 nginx 配置
|
||||
echo "📝 寫入 nginx 配置..."
|
||||
cat > "$NGINX_CONF" << 'EOF'
|
||||
# 110 Ollama GCP Proxy — ADR-110 三層容災
|
||||
# 讓 K3s 叢集內可透過內網 110 存取 GCP 外網 Ollama
|
||||
|
||||
# ============================================================
|
||||
# Ollama GCP-A Primary (port 11435 → 34.143.170.20:11434)
|
||||
# ============================================================
|
||||
server {
|
||||
listen 11435;
|
||||
listen [::]:11435;
|
||||
server_name _;
|
||||
|
||||
access_log /var/log/nginx/ollama-gcp-a-access.log;
|
||||
error_log /var/log/nginx/ollama-gcp-a-error.log warn;
|
||||
|
||||
location / {
|
||||
proxy_pass http://34.143.170.20:11434;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
|
||||
# Ollama 推理可能較慢,給較長超時
|
||||
proxy_connect_timeout 10s;
|
||||
proxy_send_timeout 300s;
|
||||
proxy_read_timeout 300s;
|
||||
|
||||
# 支援 streaming response
|
||||
proxy_buffering off;
|
||||
proxy_cache off;
|
||||
}
|
||||
|
||||
# 健康檢查端點
|
||||
location /nginx-health {
|
||||
access_log off;
|
||||
return 200 "Ollama GCP-A Proxy OK\n";
|
||||
add_header Content-Type text/plain;
|
||||
}
|
||||
}
|
||||
|
||||
# ============================================================
|
||||
# Ollama GCP-B Secondary (port 11436 → 34.21.145.224:11434)
|
||||
# ============================================================
|
||||
server {
|
||||
listen 11436;
|
||||
listen [::]:11436;
|
||||
server_name _;
|
||||
|
||||
access_log /var/log/nginx/ollama-gcp-b-access.log;
|
||||
error_log /var/log/nginx/ollama-gcp-b-error.log warn;
|
||||
|
||||
location / {
|
||||
proxy_pass http://34.21.145.224:11434;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
|
||||
proxy_connect_timeout 10s;
|
||||
proxy_send_timeout 300s;
|
||||
proxy_read_timeout 300s;
|
||||
|
||||
proxy_buffering off;
|
||||
proxy_cache off;
|
||||
}
|
||||
|
||||
location /nginx-health {
|
||||
access_log off;
|
||||
return 200 "Ollama GCP-B Proxy OK\n";
|
||||
add_header Content-Type text/plain;
|
||||
}
|
||||
}
|
||||
EOF
|
||||
|
||||
# 測試 nginx 配置
|
||||
echo "🧪 測試 nginx 配置..."
|
||||
nginx -t
|
||||
|
||||
# 重載 nginx
|
||||
echo "🔄 重載 nginx..."
|
||||
systemctl reload nginx
|
||||
|
||||
# 驗證端口監聽
|
||||
echo "🔍 驗證端口監聽..."
|
||||
sleep 2
|
||||
ss -tlnp | grep -E '11435|11436' || true
|
||||
|
||||
# 本地測試
|
||||
echo "🌐 本地測試 proxy..."
|
||||
echo "測試 GCP-A proxy (11435)..."
|
||||
curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:11435/api/tags || echo "連線失敗"
|
||||
echo ""
|
||||
|
||||
echo "測試 GCP-B proxy (11436)..."
|
||||
curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:11436/api/tags || echo "連線失敗"
|
||||
echo ""
|
||||
|
||||
echo "✅ 部署完成!"
|
||||
echo ""
|
||||
echo "下一步:"
|
||||
echo "1. 從 K3s node 測試: curl http://192.168.0.110:11435/api/tags"
|
||||
echo "2. 修改 K8s ConfigMap 指向 110:11435/11436"
|
||||
echo "3. 重啟 awoooi-api deployment"
|
||||
Reference in New Issue
Block a user