Files
ewoooc/docs/phase2_deploy_verify_20260503.md
OoO 4648673423 db(p1): ai_calls/mcp_calls/budgets schema + bge-m3 signature
migrations 024/025/026 — 統一 LLM 遙測 + 預算告警 + RAG 一致性護欄
- 024: ai_calls 表 + 5 索引 + 6 CHECK constraint(H1/H2/M3/L3)
- 025: mcp_calls + ai_call_budgets + 10 種子預算(含 ollama_secondary)
- 026: ai_insights.embedding_signature + pgcrypto + CONCURRENTLY index

A11 critic 三輪審查記錄完整保留:
- Phase 1 schema review: 2 BLOCKER + 4 HIGH + 6 MEDIUM 全處理
- Phase 1 final sign-off: 0 BLOCKER + 2 HIGH + 4 MEDIUM
- Phase 6 ADR review: 5 BLOCKER + 6 HIGH 全修

Operation Ollama-First v5.0 / Phase 0+1+6 護欄

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 23:04:42 +08:00

206 lines
6.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 2 部署驗證劇本ADR-027 真正落地)
> **Date**: 2026-05-03
> **Phase**: Operation Ollama-First v5.0 — Phase 2A6 debugger
> **修補項**: B1 / B2 / B3 / B4 / N2 / N3
> **修改檔**: `config.py` / `services/ollama_service.py` / `services/aider_heal_executor.py` / `services/code_review_pipeline_service.py`
> **新檔**: `tests/test_ollama_resolve.py`13 tests本機已通過
---
## 一、部署前 dry-run本機
### 1.1 語法檢查
```bash
cd "/Users/ooo/Library/Mobile Documents/com~apple~CloudDocs/momo-pro-system"
python3 -m py_compile config.py services/ollama_service.py \
services/aider_heal_executor.py services/code_review_pipeline_service.py \
tests/test_ollama_resolve.py && echo "PYCOMPILE_OK"
```
期望:`PYCOMPILE_OK`(已驗證)
### 1.2 Unit test
```bash
MOMO_ALLOW_INSECURE_CONFIG_FOR_TESTS=true /opt/anaconda3/bin/python3 -m pytest \
tests/test_ollama_resolve.py \
tests/test_phase3f_cleanup_contracts.py \
tests/test_app_startup_contracts.py \
tests/test_ai_call_logger.py \
tests/test_code_review_pipeline_security.py \
tests/test_auto_heal_safety.py -v
```
期望56 passed13 新 + 43 既有)。已驗證。
### 1.3 import 一致性
```bash
MOMO_ALLOW_INSECURE_CONFIG_FOR_TESTS=true /opt/anaconda3/bin/python3 -c "
from config import get_ollama_host, get_hermes_url, get_embedding_host
from services.ollama_service import resolve_ollama_host, mark_unhealthy
print('get_ollama_host =', get_ollama_host())
print('get_hermes_url =', get_hermes_url())
print('get_embedding_host =', get_embedding_host())
print('resolve_ollama_host=', resolve_ollama_host())
"
```
期望(網路通時):四行都印 `http://34.21.145.224:11434`GCP 可達)或 `http://192.168.0.111:11434`GCP 不可達)。
不可出現 `https://ollama.wooo.work/ollama`(舊寫死 URL
---
## 二、部署後驗證SSH 188
### 2.1 容器健康
```bash
ssh wooo@192.168.0.110 "ssh ollama@192.168.0.188 \"\
docker ps --format '{{.Names}} | {{.Status}}' | grep momo-; \
docker exec momo-pro python3 -c 'from config import get_ollama_host; print(get_ollama_host())' 2>&1\""
```
期望:
- `momo-pro | Up`(重啟後新容器)
- 列印的 host 不是 `https://ollama.wooo.work/ollama`
### 2.2 OllamaHost 解析 logB3 HTTP probe 驗證)
```bash
ssh wooo@192.168.0.110 "ssh ollama@192.168.0.188 \"\
docker logs momo-pro --since 10m 2>&1 | grep -E 'OllamaHost' | tail -20\""
```
期望GCP 可達):
```
[OllamaHost] GCP 主機可用,使用 Primary: http://34.21.145.224:11434
```
期望GCP 掛時):
```
[OllamaHost] GCP 主機無法連線,自動切換 Fallback: http://192.168.0.111:11434
```
罕見process 卡死TCP 通但 HTTP 掛):
```
[OllamaHost] GCP HTTP 探測失敗但 TCP 仍通,疑似 process 卡死http://34.21.145.224:11434
[OllamaHost] GCP 主機無法連線,自動切換 Fallback: http://192.168.0.111:11434
```
> 第三種日誌是 **Phase 2 修補後才會看見的新觀測能力**,舊版純 TCP 探測不會印。
### 2.3 mark_unhealthy 觸發B4 驗證)
當 LLM generate 真的失敗時,會看見:
```
[OllamaHost] 主機標記為 unhealthy30s 跳過http://34.21.145.224:11434
```
立刻在下一次任何 ollama 呼叫的 log 看:
```
[OllamaHost] Primary http://34.21.145.224:11434 仍在 unhealthy TTL 內,跳過直接 fallback: http://192.168.0.111:11434
```
### 2.4 AiderHeal OLLAMA_API_BASE 動態化N2 驗證)
下次 AiderHeal 觸發時 grep
```bash
ssh wooo@192.168.0.110 "ssh ollama@192.168.0.188 \"\
docker logs momo-pro --since 30m 2>&1 | grep 'aider_ollama_api_base' | tail -5\""
```
期望:
```
event=aider_ollama_api_base host=http://34.21.145.224:11434
```
GCP 可達時)或 `host=http://192.168.0.111:11434`fallback
**絕不可** 仍顯示 `http://192.168.0.111:11434` 當 GCP 是可達的。
### 2.5 Code Review provider tagN3 驗證)
下次 Code Review pipeline 觸發後:
```bash
ssh wooo@192.168.0.110 "ssh ollama@192.168.0.188 \"\
docker exec momo-postgres psql -U momo -d momo_analytics -c \
\\\"SELECT caller, provider, meta->>'host' AS host \
FROM ai_calls \
WHERE caller = 'code_review_hermes' \
ORDER BY created_at DESC LIMIT 5;\\\"\""
```
期望GCP 通時):
```
caller | provider | host
code_review_hermes | gcp_ollama | http://34.21.145.224:11434
```
絕不可仍標 `ollama_111` 當 host 是 GCP。
---
## 三、模擬故障驗證(選做)
### 3.1 模擬 GCP 不可達 → 5s 內 fallback
在 188 上臨時封鎖 GCP IP
```bash
ssh wooo@192.168.0.110 "ssh ollama@192.168.0.188 \"\
sudo iptables -A OUTPUT -d 34.21.145.224 -j DROP\""
```
立即觸發 sales copyor 任何 LLM 入口),看 log
- 第一次呼叫應 timeout2s 內 _is_reachable 失敗)→ 切 fallback
- 之後 30s 內所有呼叫直接走 fallback
- 30s 後 cache TTL 過期,會重新探測(仍封鎖則繼續 fallback解封後恢復 GCP
恢復:
```bash
ssh wooo@192.168.0.110 "ssh ollama@192.168.0.188 \"\
sudo iptables -D OUTPUT -d 34.21.145.224 -j DROP\""
```
> 此項屬統帥權限debugger 不執行。
---
## 四、回滾 SOP
若部署後出問題,最快回滾:
```bash
git revert <this-commit-sha>
git push origin main
# 等 Gitea CD 自動部署
```
也可以單獨回退 ollama_service.py
```bash
git checkout HEAD~1 -- services/ollama_service.py config.py
```
(其他三檔變更可獨立保留)
---
## 五、commit 草稿
```
[V-New] ADR-027 Phase 2Ollama 主機解析全鏈 lazy + HTTP probe + unhealthy 標記
修補 6 項讓 ADR-027「GCP 優先」真正 100% 落地:
B1 — config.OLLAMA_HOST 改 lazy resolve移除寫死 ollama.wooo.work URL
B2 — config.EMBEDDING_HOST / HERMES_URL 改 lazy避免 import-time freeze
B3 — _is_reachable 改 HTTP probe (/api/version, 2s timeout)TCP 改作觀測點
B4 — 新增 mark_unhealthy()generate / embedding 失敗時標 30scache 失效
N2 — aider_heal_executor.OLLAMA_API_BASE 改 lazy resolve每次 execute 重評估)
N3 — code_review_pipeline_service Hermes scan 改 get_hermes_url() 取代 freeze
新增tests/test_ollama_resolve.py13 tests
變更config.py / services/ollama_service.py /
services/aider_heal_executor.py / services/code_review_pipeline_service.py
驗證56 tests 全綠13 新 + 43 既有 regressionpy_compile 全綠。
驗證劇本docs/phase2_deploy_verify_20260503.md給統帥 SSH 188 跑)。
```