diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 84525489..3174929d 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -6,6 +6,91 @@ --- +## 2026-05-05 | ADR-110 三層容災補齊 + 四台主機密碼 SSH 恢復 + +**ADR-110 Local Fallback(port 11437)**: +- `110-ollama-proxy.conf.j2`:新增 port 11437 → 192.168.0.111:11434 server block +- `nginx-sync.yml`:wait_for loop 補 11437 驗證 + +**四台主機密碼 SSH 恢復**: +- 原因:`/etc/shadow` 唯讀 + sudo 密碼不明 → 無法直接改 +- 解法:`docker run --privileged --pid=host alpine nsenter --target 1 --mount -- chpasswd`(不需 sudo) +- 結果:110/120/121(wooo)、188(ollama)密碼全設為統帥密碼,PasswordAuthentication yes 已生效 +- 新增 `infra/ansible/playbooks/restore-password-auth.yml`(未來可用 Ansible 統一管理) + +--- + +## 2026-05-04 (Session 2) | Ollama GCP 路由正式切換 + governance 無限迴圈根修 + +### Ollama 路由最終版(ADR-110 正式路由) + +**背景**:110 nginx proxy 架設完成(11435→GCP-A、11436→GCP-B),K8s pod 可透過 proxy 存取 GCP Ollama。統帥要求 GCP 為 primary,111 為兜底。 + +**正式路由(commit 40badc42)**: +``` +OLLAMA_URL = http://192.168.0.110:11435 ← GCP-A primary(via nginx proxy) +OLLAMA_SECONDARY_URL = http://192.168.0.110:11436 ← GCP-B secondary(via nginx proxy) +OLLAMA_FALLBACK_URL = http://192.168.0.111:11434 ← 111 兜底 +``` +- 驗證:兩台 GCP 各 10 個模型,200 OK +- 熱更新:`kubectl set env`(不動 image tag,避免 IMAGE_TAG_PLACEHOLDER 蓋掉) +- Ansible template:`infra/ansible/roles/nginx/templates/110-ollama-proxy.conf.j2` + +**⚠️ 血的教訓**:`kubectl apply -f 06-deployment-api.yaml` 會把 `IMAGE_TAG_PLACEHOLDER` 推上去 → ImagePullBackOff。路由變更必須用 `kubectl set env`,不可 apply 整個 deployment yaml。 + +--- + +### governance_fusion_complete 每 20 秒狂刷根修(commit a1b61289) + +**根因 A — skip 路徑無限迴圈**: +`dispatch_governance_event` skip 決策後不寫任何記錄 → 事件永遠 `resolved=False` → 每 30s 重撈 → 每輪打 LLM(30s timeout)+ Prometheus → 無限迴圈。積壓 4581 筆 stale 事件。 + +**根因 B — MCP 評分卡 0.2**: +SLI recording rules 尚未在 Prometheus 生效,`result_list=[]` → `success_count=0` → `0.2 + 0.7×0 = 0.2`。融合信心度 0.3565 永遠 < 0.65 threshold,全部走 skip → 加重迴圈。 + +**修復(`governance_dispatcher.py` + `decision_fusion_adapter.py`)**: +1. Redis 90min 冷卻鍵(`governance:skip:{event_id}`)防重複 LLM 呼叫 +2. Skip 超過 2h 的 stale 事件自動標 `resolved=True`(持久問題會由 governance_agent 重新產生新事件) +3. MCP:empty result(no_data)≠ 故障,改給 0.5 中性貢獻;新公式:`0.2 + 0.7×(success + 0.5×no_data)/total` + +**即時止血**: +- `kubectl exec` 直接 bulk-resolve 4437 筆 stale 事件(>2h)+ 再清 144 筆(>30min) +- 預設 Redis 冷卻鍵 105 筆(90min TTL) +- 結果:`governance_fusion_complete` 從每 20s → 每 cycle 僅 1-2 次,最後靜音 + +--- + +### META SYSTEM 告警每分鐘重複(ai_slo_watchdog dedup bypass) + +**根因**:Python `hash()` PYTHONHASHSEED 每次 Pod 重啟產生不同 seed → 同樣 violations 的 hash 值不同 → bypass Redis 10min dedup → 每輪 Pod 都重發告警。 + +**修復**:改 `hashlib.sha256` + atomic `SET NX`(防兩 Pod 競態)+ 預設 dedup key 立即止血。 + +--- + +## 2026-05-04 | Ollama 路由根本修復 — K8s → GCP:11434 封鎖破解(ADR-110 修正) + +**根因確認**:K8s NetworkPolicy `allow-required-egress` 外網 egress 只開 port 443,GCP:11434 從 K8s pod 永遠 connection refused。ADR-110「GCP-A primary」設計從 K8s 視角從未生效,自上線起一直燒 Gemini quota。 + +**修復清單**(commits 85581965 + 0a90dab1): +- B1:OFFLINE cache 30s → 5s(防三節點同時快取放大) +- B3:推理層 ConnectError → DEGRADED(不再誤判 OFFLINE) +- B5/B6:`_current_primary` 命名對齊("ollama" → "ollama_gcp_a") +- SLOW 路由缺失補全(failover_manager + auto_recovery) +- Telegram 告警顯示 AI Agent + LLM + Ollama 主機 + Token 數 + +**長期修復**(本次 session): +- `k8s/awoooi-prod/04-configmap.yaml`:OLLAMA_URL=GCP-A(110:11435), SECONDARY=GCP-B(110:11436), FALLBACK=111 +- `k8s/awoooi-prod/06-deployment-api.yaml`:同步 +- `k8s/awoooi-prod/02-network-policy.yaml`:新增 pod→110:11435/11436 egress +- `110:/etc/nginx/conf.d/ollama-gcp-proxy.conf`:11435→GCP-A, 11436→GCP-B +- `health.py check_ollama()`:改三層輪查,fallback up → degraded +- `failover_manager routing_reason`:動態 IP label,不再硬編碼 GCP-A/GCP-B + +**驗證結果**:ArgoCD Synced Healthy;兩台 GCP 各 10 models;NetworkPolicy + nginx proxy 正常 + +--- + ## 2026-05-04 | AwoooP Phase 6-8 完收(EwoooC Onboarding / Channel Hub / Approval Token) ### Phase 6: EwoooC Tenant Onboarding(ADR-115) diff --git a/infra/ansible/playbooks/nginx-sync.yml b/infra/ansible/playbooks/nginx-sync.yml index fd9a35b0..e3674929 100644 --- a/infra/ansible/playbooks/nginx-sync.yml +++ b/infra/ansible/playbooks/nginx-sync.yml @@ -90,6 +90,7 @@ loop: - 11435 # GCP-A - 11436 # GCP-B + - 11437 # Local Fallback tags: ["110", "nginx", "ollama-proxy"] handlers: diff --git a/infra/ansible/playbooks/restore-password-auth.yml b/infra/ansible/playbooks/restore-password-auth.yml new file mode 100644 index 00000000..0b1eac1d --- /dev/null +++ b/infra/ansible/playbooks/restore-password-auth.yml @@ -0,0 +1,31 @@ +--- +# 恢復 SSH 密碼驗證 — 統帥授權 2026-05-04 +# 執行: ansible-playbook -i inventory/hosts.yml playbooks/restore-password-auth.yml --extra-vars @/tmp/awoooi-pw.yml +# 完成後刪除 /tmp/awoooi-pw.yml + +- name: "四台主機恢復 SSH 密碼驗證" + hosts: host_110,host_120,host_121,host_188 + become: true + vars: + ansible_become_pass: "{{ user_password }}" + + tasks: + - name: "SSH | 啟用 PasswordAuthentication" + ansible.builtin.lineinfile: + path: /etc/ssh/sshd_config + regexp: '^#?PasswordAuthentication' + line: 'PasswordAuthentication yes' + state: present + notify: reload sshd + + - name: "SSH | 設定系統使用者密碼" + ansible.builtin.user: + name: "{{ ansible_user }}" + password: "{{ user_password | password_hash('sha512') }}" + no_log: true + + handlers: + - name: reload sshd + ansible.builtin.systemd: + name: ssh + state: reloaded diff --git a/infra/ansible/roles/nginx/templates/110-ollama-proxy.conf.j2 b/infra/ansible/roles/nginx/templates/110-ollama-proxy.conf.j2 index 94f9a95d..02628478 100644 --- a/infra/ansible/roles/nginx/templates/110-ollama-proxy.conf.j2 +++ b/infra/ansible/roles/nginx/templates/110-ollama-proxy.conf.j2 @@ -69,3 +69,36 @@ server { add_header Content-Type text/plain; } } + +# ============================================================ +# Ollama Local Fallback (port 11437 → 192.168.0.111:11434) +# ADR-110 第三層:GCP-A 與 GCP-B 均不可用時的本機備援 +# ============================================================ +server { + listen 11437; + listen [::]:11437; + server_name _; + + access_log /var/log/nginx/ollama-local-access.log; + error_log /var/log/nginx/ollama-local-error.log warn; + + location / { + proxy_pass http://192.168.0.111:11434; + proxy_set_header Host $host; + proxy_set_header X-Real-IP $remote_addr; + proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; + + proxy_connect_timeout 5s; + proxy_send_timeout 300s; + proxy_read_timeout 300s; + + proxy_buffering off; + proxy_cache off; + } + + location /nginx-health { + access_log off; + return 200 "Ollama Local Fallback Proxy OK\n"; + add_header Content-Type text/plain; + } +}