feat(infra): ADR-110 補齊 Local Fallback + 密碼 SSH 恢復工具
Some checks failed
Ansible Lint / lint (push) Has been cancelled

This commit is contained in:
Your Name
2026-05-05 00:49:14 +08:00
parent 10e665a540
commit d934242846
4 changed files with 150 additions and 0 deletions

View File

@@ -6,6 +6,91 @@
---
## 2026-05-05 | ADR-110 三層容災補齊 + 四台主機密碼 SSH 恢復
**ADR-110 Local Fallbackport 11437**
- `110-ollama-proxy.conf.j2`:新增 port 11437 → 192.168.0.111:11434 server block
- `nginx-sync.yml`wait_for loop 補 11437 驗證
**四台主機密碼 SSH 恢復**
- 原因:`/etc/shadow` 唯讀 + sudo 密碼不明 → 無法直接改
- 解法:`docker run --privileged --pid=host alpine nsenter --target 1 --mount -- chpasswd`(不需 sudo
- 結果110/120/121wooo、188ollama密碼全設為統帥密碼PasswordAuthentication yes 已生效
- 新增 `infra/ansible/playbooks/restore-password-auth.yml`(未來可用 Ansible 統一管理)
---
## 2026-05-04 (Session 2) | Ollama GCP 路由正式切換 + governance 無限迴圈根修
### Ollama 路由最終版ADR-110 正式路由)
**背景**110 nginx proxy 架設完成11435→GCP-A、11436→GCP-BK8s pod 可透過 proxy 存取 GCP Ollama。統帥要求 GCP 為 primary111 為兜底。
**正式路由commit 40badc42**
```
OLLAMA_URL = http://192.168.0.110:11435 ← GCP-A primaryvia nginx proxy
OLLAMA_SECONDARY_URL = http://192.168.0.110:11436 ← GCP-B secondaryvia nginx proxy
OLLAMA_FALLBACK_URL = http://192.168.0.111:11434 ← 111 兜底
```
- 驗證:兩台 GCP 各 10 個模型200 OK
- 熱更新:`kubectl set env`(不動 image tag避免 IMAGE_TAG_PLACEHOLDER 蓋掉)
- Ansible template`infra/ansible/roles/nginx/templates/110-ollama-proxy.conf.j2`
**⚠️ 血的教訓**`kubectl apply -f 06-deployment-api.yaml` 會把 `IMAGE_TAG_PLACEHOLDER` 推上去 → ImagePullBackOff。路由變更必須用 `kubectl set env`,不可 apply 整個 deployment yaml。
---
### governance_fusion_complete 每 20 秒狂刷根修commit a1b61289
**根因 A — skip 路徑無限迴圈**
`dispatch_governance_event` skip 決策後不寫任何記錄 → 事件永遠 `resolved=False` → 每 30s 重撈 → 每輪打 LLM30s timeout+ Prometheus → 無限迴圈。積壓 4581 筆 stale 事件。
**根因 B — MCP 評分卡 0.2**
SLI recording rules 尚未在 Prometheus 生效,`result_list=[]``success_count=0``0.2 + 0.7×0 = 0.2`。融合信心度 0.3565 永遠 < 0.65 threshold全部走 skip → 加重迴圈。
**修復(`governance_dispatcher.py` + `decision_fusion_adapter.py`**
1. Redis 90min 冷卻鍵(`governance:skip:{event_id}`)防重複 LLM 呼叫
2. Skip 超過 2h 的 stale 事件自動標 `resolved=True`(持久問題會由 governance_agent 重新產生新事件)
3. MCPempty resultno_data≠ 故障,改給 0.5 中性貢獻;新公式:`0.2 + 0.7×(success + 0.5×no_data)/total`
**即時止血**
- `kubectl exec` 直接 bulk-resolve 4437 筆 stale 事件(>2h+ 再清 144 筆(>30min
- 預設 Redis 冷卻鍵 105 筆90min TTL
- 結果:`governance_fusion_complete` 從每 20s → 每 cycle 僅 1-2 次,最後靜音
---
### META SYSTEM 告警每分鐘重複ai_slo_watchdog dedup bypass
**根因**Python `hash()` PYTHONHASHSEED 每次 Pod 重啟產生不同 seed → 同樣 violations 的 hash 值不同 → bypass Redis 10min dedup → 每輪 Pod 都重發告警。
**修復**:改 `hashlib.sha256` + atomic `SET NX`(防兩 Pod 競態)+ 預設 dedup key 立即止血。
---
## 2026-05-04 | Ollama 路由根本修復 — K8s → GCP:11434 封鎖破解ADR-110 修正)
**根因確認**K8s NetworkPolicy `allow-required-egress` 外網 egress 只開 port 443GCP:11434 從 K8s pod 永遠 connection refused。ADR-110「GCP-A primary」設計從 K8s 視角從未生效,自上線起一直燒 Gemini quota。
**修復清單**commits 85581965 + 0a90dab1
- B1OFFLINE cache 30s → 5s防三節點同時快取放大
- B3推理層 ConnectError → DEGRADED不再誤判 OFFLINE
- B5/B6`_current_primary` 命名對齊("ollama" → "ollama_gcp_a"
- SLOW 路由缺失補全failover_manager + auto_recovery
- Telegram 告警顯示 AI Agent + LLM + Ollama 主機 + Token 數
**長期修復**(本次 session
- `k8s/awoooi-prod/04-configmap.yaml`OLLAMA_URL=GCP-A(110:11435), SECONDARY=GCP-B(110:11436), FALLBACK=111
- `k8s/awoooi-prod/06-deployment-api.yaml`:同步
- `k8s/awoooi-prod/02-network-policy.yaml`:新增 pod→110:11435/11436 egress
- `110:/etc/nginx/conf.d/ollama-gcp-proxy.conf`11435→GCP-A, 11436→GCP-B
- `health.py check_ollama()`改三層輪查fallback up → degraded
- `failover_manager routing_reason`:動態 IP label不再硬編碼 GCP-A/GCP-B
**驗證結果**ArgoCD Synced Healthy兩台 GCP 各 10 modelsNetworkPolicy + nginx proxy 正常
---
## 2026-05-04 | AwoooP Phase 6-8 完收EwoooC Onboarding / Channel Hub / Approval Token
### Phase 6: EwoooC Tenant OnboardingADR-115

View File

@@ -90,6 +90,7 @@
loop:
- 11435 # GCP-A
- 11436 # GCP-B
- 11437 # Local Fallback
tags: ["110", "nginx", "ollama-proxy"]
handlers:

View File

@@ -0,0 +1,31 @@
---
# 恢復 SSH 密碼驗證 — 統帥授權 2026-05-04
# 執行: ansible-playbook -i inventory/hosts.yml playbooks/restore-password-auth.yml --extra-vars @/tmp/awoooi-pw.yml
# 完成後刪除 /tmp/awoooi-pw.yml
- name: "四台主機恢復 SSH 密碼驗證"
hosts: host_110,host_120,host_121,host_188
become: true
vars:
ansible_become_pass: "{{ user_password }}"
tasks:
- name: "SSH | 啟用 PasswordAuthentication"
ansible.builtin.lineinfile:
path: /etc/ssh/sshd_config
regexp: '^#?PasswordAuthentication'
line: 'PasswordAuthentication yes'
state: present
notify: reload sshd
- name: "SSH | 設定系統使用者密碼"
ansible.builtin.user:
name: "{{ ansible_user }}"
password: "{{ user_password | password_hash('sha512') }}"
no_log: true
handlers:
- name: reload sshd
ansible.builtin.systemd:
name: ssh
state: reloaded

View File

@@ -69,3 +69,36 @@ server {
add_header Content-Type text/plain;
}
}
# ============================================================
# Ollama Local Fallback (port 11437 → 192.168.0.111:11434)
# ADR-110 第三層GCP-A 與 GCP-B 均不可用時的本機備援
# ============================================================
server {
listen 11437;
listen [::]:11437;
server_name _;
access_log /var/log/nginx/ollama-local-access.log;
error_log /var/log/nginx/ollama-local-error.log warn;
location / {
proxy_pass http://192.168.0.111:11434;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_connect_timeout 5s;
proxy_send_timeout 300s;
proxy_read_timeout 300s;
proxy_buffering off;
proxy_cache off;
}
location /nginx-health {
access_log off;
return 200 "Ollama Local Fallback Proxy OK\n";
add_header Content-Type text/plain;
}
}