feat(infra): ADR-110 補齊 Local Fallback + 密碼 SSH 恢復工具
Some checks failed
Ansible Lint / lint (push) Has been cancelled
Some checks failed
Ansible Lint / lint (push) Has been cancelled
This commit is contained in:
@@ -6,6 +6,91 @@
|
||||
|
||||
---
|
||||
|
||||
## 2026-05-05 | ADR-110 三層容災補齊 + 四台主機密碼 SSH 恢復
|
||||
|
||||
**ADR-110 Local Fallback(port 11437)**:
|
||||
- `110-ollama-proxy.conf.j2`:新增 port 11437 → 192.168.0.111:11434 server block
|
||||
- `nginx-sync.yml`:wait_for loop 補 11437 驗證
|
||||
|
||||
**四台主機密碼 SSH 恢復**:
|
||||
- 原因:`/etc/shadow` 唯讀 + sudo 密碼不明 → 無法直接改
|
||||
- 解法:`docker run --privileged --pid=host alpine nsenter --target 1 --mount -- chpasswd`(不需 sudo)
|
||||
- 結果:110/120/121(wooo)、188(ollama)密碼全設為統帥密碼,PasswordAuthentication yes 已生效
|
||||
- 新增 `infra/ansible/playbooks/restore-password-auth.yml`(未來可用 Ansible 統一管理)
|
||||
|
||||
---
|
||||
|
||||
## 2026-05-04 (Session 2) | Ollama GCP 路由正式切換 + governance 無限迴圈根修
|
||||
|
||||
### Ollama 路由最終版(ADR-110 正式路由)
|
||||
|
||||
**背景**:110 nginx proxy 架設完成(11435→GCP-A、11436→GCP-B),K8s pod 可透過 proxy 存取 GCP Ollama。統帥要求 GCP 為 primary,111 為兜底。
|
||||
|
||||
**正式路由(commit 40badc42)**:
|
||||
```
|
||||
OLLAMA_URL = http://192.168.0.110:11435 ← GCP-A primary(via nginx proxy)
|
||||
OLLAMA_SECONDARY_URL = http://192.168.0.110:11436 ← GCP-B secondary(via nginx proxy)
|
||||
OLLAMA_FALLBACK_URL = http://192.168.0.111:11434 ← 111 兜底
|
||||
```
|
||||
- 驗證:兩台 GCP 各 10 個模型,200 OK
|
||||
- 熱更新:`kubectl set env`(不動 image tag,避免 IMAGE_TAG_PLACEHOLDER 蓋掉)
|
||||
- Ansible template:`infra/ansible/roles/nginx/templates/110-ollama-proxy.conf.j2`
|
||||
|
||||
**⚠️ 血的教訓**:`kubectl apply -f 06-deployment-api.yaml` 會把 `IMAGE_TAG_PLACEHOLDER` 推上去 → ImagePullBackOff。路由變更必須用 `kubectl set env`,不可 apply 整個 deployment yaml。
|
||||
|
||||
---
|
||||
|
||||
### governance_fusion_complete 每 20 秒狂刷根修(commit a1b61289)
|
||||
|
||||
**根因 A — skip 路徑無限迴圈**:
|
||||
`dispatch_governance_event` skip 決策後不寫任何記錄 → 事件永遠 `resolved=False` → 每 30s 重撈 → 每輪打 LLM(30s timeout)+ Prometheus → 無限迴圈。積壓 4581 筆 stale 事件。
|
||||
|
||||
**根因 B — MCP 評分卡 0.2**:
|
||||
SLI recording rules 尚未在 Prometheus 生效,`result_list=[]` → `success_count=0` → `0.2 + 0.7×0 = 0.2`。融合信心度 0.3565 永遠 < 0.65 threshold,全部走 skip → 加重迴圈。
|
||||
|
||||
**修復(`governance_dispatcher.py` + `decision_fusion_adapter.py`)**:
|
||||
1. Redis 90min 冷卻鍵(`governance:skip:{event_id}`)防重複 LLM 呼叫
|
||||
2. Skip 超過 2h 的 stale 事件自動標 `resolved=True`(持久問題會由 governance_agent 重新產生新事件)
|
||||
3. MCP:empty result(no_data)≠ 故障,改給 0.5 中性貢獻;新公式:`0.2 + 0.7×(success + 0.5×no_data)/total`
|
||||
|
||||
**即時止血**:
|
||||
- `kubectl exec` 直接 bulk-resolve 4437 筆 stale 事件(>2h)+ 再清 144 筆(>30min)
|
||||
- 預設 Redis 冷卻鍵 105 筆(90min TTL)
|
||||
- 結果:`governance_fusion_complete` 從每 20s → 每 cycle 僅 1-2 次,最後靜音
|
||||
|
||||
---
|
||||
|
||||
### META SYSTEM 告警每分鐘重複(ai_slo_watchdog dedup bypass)
|
||||
|
||||
**根因**:Python `hash()` PYTHONHASHSEED 每次 Pod 重啟產生不同 seed → 同樣 violations 的 hash 值不同 → bypass Redis 10min dedup → 每輪 Pod 都重發告警。
|
||||
|
||||
**修復**:改 `hashlib.sha256` + atomic `SET NX`(防兩 Pod 競態)+ 預設 dedup key 立即止血。
|
||||
|
||||
---
|
||||
|
||||
## 2026-05-04 | Ollama 路由根本修復 — K8s → GCP:11434 封鎖破解(ADR-110 修正)
|
||||
|
||||
**根因確認**:K8s NetworkPolicy `allow-required-egress` 外網 egress 只開 port 443,GCP:11434 從 K8s pod 永遠 connection refused。ADR-110「GCP-A primary」設計從 K8s 視角從未生效,自上線起一直燒 Gemini quota。
|
||||
|
||||
**修復清單**(commits 85581965 + 0a90dab1):
|
||||
- B1:OFFLINE cache 30s → 5s(防三節點同時快取放大)
|
||||
- B3:推理層 ConnectError → DEGRADED(不再誤判 OFFLINE)
|
||||
- B5/B6:`_current_primary` 命名對齊("ollama" → "ollama_gcp_a")
|
||||
- SLOW 路由缺失補全(failover_manager + auto_recovery)
|
||||
- Telegram 告警顯示 AI Agent + LLM + Ollama 主機 + Token 數
|
||||
|
||||
**長期修復**(本次 session):
|
||||
- `k8s/awoooi-prod/04-configmap.yaml`:OLLAMA_URL=GCP-A(110:11435), SECONDARY=GCP-B(110:11436), FALLBACK=111
|
||||
- `k8s/awoooi-prod/06-deployment-api.yaml`:同步
|
||||
- `k8s/awoooi-prod/02-network-policy.yaml`:新增 pod→110:11435/11436 egress
|
||||
- `110:/etc/nginx/conf.d/ollama-gcp-proxy.conf`:11435→GCP-A, 11436→GCP-B
|
||||
- `health.py check_ollama()`:改三層輪查,fallback up → degraded
|
||||
- `failover_manager routing_reason`:動態 IP label,不再硬編碼 GCP-A/GCP-B
|
||||
|
||||
**驗證結果**:ArgoCD Synced Healthy;兩台 GCP 各 10 models;NetworkPolicy + nginx proxy 正常
|
||||
|
||||
---
|
||||
|
||||
## 2026-05-04 | AwoooP Phase 6-8 完收(EwoooC Onboarding / Channel Hub / Approval Token)
|
||||
|
||||
### Phase 6: EwoooC Tenant Onboarding(ADR-115)
|
||||
|
||||
@@ -90,6 +90,7 @@
|
||||
loop:
|
||||
- 11435 # GCP-A
|
||||
- 11436 # GCP-B
|
||||
- 11437 # Local Fallback
|
||||
tags: ["110", "nginx", "ollama-proxy"]
|
||||
|
||||
handlers:
|
||||
|
||||
31
infra/ansible/playbooks/restore-password-auth.yml
Normal file
31
infra/ansible/playbooks/restore-password-auth.yml
Normal file
@@ -0,0 +1,31 @@
|
||||
---
|
||||
# 恢復 SSH 密碼驗證 — 統帥授權 2026-05-04
|
||||
# 執行: ansible-playbook -i inventory/hosts.yml playbooks/restore-password-auth.yml --extra-vars @/tmp/awoooi-pw.yml
|
||||
# 完成後刪除 /tmp/awoooi-pw.yml
|
||||
|
||||
- name: "四台主機恢復 SSH 密碼驗證"
|
||||
hosts: host_110,host_120,host_121,host_188
|
||||
become: true
|
||||
vars:
|
||||
ansible_become_pass: "{{ user_password }}"
|
||||
|
||||
tasks:
|
||||
- name: "SSH | 啟用 PasswordAuthentication"
|
||||
ansible.builtin.lineinfile:
|
||||
path: /etc/ssh/sshd_config
|
||||
regexp: '^#?PasswordAuthentication'
|
||||
line: 'PasswordAuthentication yes'
|
||||
state: present
|
||||
notify: reload sshd
|
||||
|
||||
- name: "SSH | 設定系統使用者密碼"
|
||||
ansible.builtin.user:
|
||||
name: "{{ ansible_user }}"
|
||||
password: "{{ user_password | password_hash('sha512') }}"
|
||||
no_log: true
|
||||
|
||||
handlers:
|
||||
- name: reload sshd
|
||||
ansible.builtin.systemd:
|
||||
name: ssh
|
||||
state: reloaded
|
||||
@@ -69,3 +69,36 @@ server {
|
||||
add_header Content-Type text/plain;
|
||||
}
|
||||
}
|
||||
|
||||
# ============================================================
|
||||
# Ollama Local Fallback (port 11437 → 192.168.0.111:11434)
|
||||
# ADR-110 第三層:GCP-A 與 GCP-B 均不可用時的本機備援
|
||||
# ============================================================
|
||||
server {
|
||||
listen 11437;
|
||||
listen [::]:11437;
|
||||
server_name _;
|
||||
|
||||
access_log /var/log/nginx/ollama-local-access.log;
|
||||
error_log /var/log/nginx/ollama-local-error.log warn;
|
||||
|
||||
location / {
|
||||
proxy_pass http://192.168.0.111:11434;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
|
||||
proxy_connect_timeout 5s;
|
||||
proxy_send_timeout 300s;
|
||||
proxy_read_timeout 300s;
|
||||
|
||||
proxy_buffering off;
|
||||
proxy_cache off;
|
||||
}
|
||||
|
||||
location /nginx-health {
|
||||
access_log off;
|
||||
return 200 "Ollama Local Fallback Proxy OK\n";
|
||||
add_header Content-Type text/plain;
|
||||
}
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user