強化 CD 健康檢查重試
All checks were successful
CD Pipeline / deploy (push) Successful in 1m32s

This commit is contained in:
OoO
2026-04-30 08:58:22 +08:00
parent 9dd5986077
commit 5a569d1e05
9 changed files with 44 additions and 13 deletions

View File

@@ -224,17 +224,21 @@ jobs:
# ── 健康檢查H3: HTTP + 三容器狀態雙重驗證) ─────────────────────────
- name: 健康檢查
run: |
echo "⏳ 等待服務啟動(15s..."
sleep 15
for i in $(seq 1 5); do
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" https://mo.wooo.work/health --max-time 10 || echo "000")
if [ "$HTTP_CODE" = "200" ]; then
echo "✅ HTTP 健康檢查通過HTTP $HTTP_CODE"
echo "⏳ 等待服務啟動(30s..."
sleep 30
for i in $(seq 1 12); do
INTERNAL_CODE=$(ssh -i ~/.ssh/id_deploy ollama@192.168.0.188 \
"docker exec momo-pro-system curl -s -o /dev/null -w '%{http_code}' --max-time 8 http://127.0.0.1:80/health" 2>/dev/null || true)
EXTERNAL_CODE=$(curl -s -o /dev/null -w "%{http_code}" https://mo.wooo.work/health --max-time 10 2>/dev/null || true)
INTERNAL_CODE=${INTERNAL_CODE:-000}
EXTERNAL_CODE=${EXTERNAL_CODE:-000}
if [ "$INTERNAL_CODE" = "200" ] && [ "$EXTERNAL_CODE" = "200" ]; then
echo "✅ HTTP 健康檢查通過internal=$INTERNAL_CODE, external=$EXTERNAL_CODE"
break
fi
echo "⏳ 嘗試 $i/5HTTP $HTTP_CODE等待 10s..."
[ "$i" -eq 5 ] && echo "❌ HTTP 健康檢查失敗" && exit 1
sleep 10
echo "⏳ 嘗試 $i/12internal=$INTERNAL_CODE external=$EXTERNAL_CODE等待 15s..."
[ "$i" -eq 12 ] && echo "❌ HTTP 健康檢查失敗" && exit 1
sleep 15
done
# 驗證三應用容器均在 Running 狀態
ssh -i ~/.ssh/id_deploy ollama@192.168.0.188 \

View File

@@ -2,7 +2,7 @@
> 本文件定義專案開發的核心準則與不可違反的規範
> **建立日期**: 2026-01-12
> **當前版本**: V10.11 (四 AI Agent 自動化 Metrics Scrape 修復版)
> **當前版本**: V10.12 (CD 健康檢查強化版)
> **最後更新**: 2026-04-29
---

View File

@@ -18,6 +18,7 @@
- Grafana 線上部署188 active Grafana 已載入 4 個 dashboard`MOMO AI Automation Overview` provisioning 成功。
- Prometheus scrape 修復active monitoring stack 新增 `momo-app` scrape job目標 `momo-pro-system:80/metrics`。
- Gunicorn preload 修復:`post_fork` 略過 Flask/Werkzeug request-bound LocalProxy避免 worker boot fail。
- CD 健康檢查強化:改為 internal container health + external `mo.wooo.work` 雙檢查,重試窗延長到約 3 分鐘。
【下次待辦】
- 觀察 Prometheus scrape 後 `momo_ai_*` 是否在事件發生後產生時間序列。

4
app.py
View File

@@ -95,8 +95,8 @@ except Exception as e:
sys_log.error(f"無法檢測磁碟空間: {e}")
# 🚩 系統版本定義 (備份與顯示用)
# 🚩 2026-04-30 V10.11: Gunicorn preload guard + AI metrics scrape
SYSTEM_VERSION = "V10.11"
# 🚩 2026-04-30 V10.12: CD health check internal/external hardening
SYSTEM_VERSION = "V10.12"
# ==========================================
# 🔒 SQL Injection 防護函數

View File

@@ -253,7 +253,7 @@ YOUTUBE_API_KEY = os.getenv('YOUTUBE_API_KEY', '')
# ==========================================
# 系統版本與路徑
# ==========================================
SYSTEM_VERSION = "V10.11"
SYSTEM_VERSION = "V10.12"
LOG_FILE_PATH = os.path.join(BASE_DIR, 'logs/system.log')
public_url = PUBLIC_URL # 用於模板顯示

View File

@@ -63,6 +63,7 @@
- **原因**: 110 與 188 之間的 SSH 隧道中斷。
- **檢查**: 在 110 執行 `curl -I http://127.0.0.1:5003/health`
- **修復**: 在 110 執行 `ssh -fN -L 5003:127.0.0.1:5003 ollama@192.168.0.188` 重啟隧道。
- **CD 判斷**: 先確認 188 內部 `docker exec momo-pro-system curl http://127.0.0.1:80/health`,再看外部 `https://mo.wooo.work/health`;若 internal 已 200 但 external 502多半是 Nginx/tunnel 短暫延遲。
### 2. CI/CD 報錯 `parent snapshot ... not found`
- **原因**: Docker Buildx 快取損壞。

View File

@@ -18,6 +18,7 @@
- 2026-04-30 active Grafana 已載入 4 個 dashboardAI dashboard 檔案同步到 188 實際掛載目錄 `monitoring/grafana/provisioning/dashboards/json/`
- 2026-04-30 active Prometheus 補 `momo-app` scrape job目標 `momo-pro-system:80/metrics`Prometheus 需加入 `momo-network` 才能解析 app container DNS。
- 2026-04-30 發現並修復 `gunicorn.conf.py` `post_fork` 掃到 Flask/Werkzeug LocalProxy 導致 worker boot fail 的問題。
- 2026-04-30 CD 健康檢查曾因 rebuild 後短暫 502 太早失敗;已改為 internal `docker exec momo-pro-system /health` + external `https://mo.wooo.work/health` 雙檢查,重試約 3 分鐘。
## 已落地範圍
@@ -48,6 +49,7 @@
- 2026-04-29 AI Grafana observability + AI core 回歸:`36 passed`collect-only`36 tests collected`
- 2026-04-30 Gunicorn LocalProxy 修復:新增 `tests/test_gunicorn_config.py`
- 2026-04-30 Prometheus scrape 修復:新增 `tests/test_prometheus_ai_automation_scrape.py`
- 2026-04-30 CD health check hardening新增 `tests/test_cd_health_check.py`
- 2026-04-29 L2 安全記憶批次:`24 passed`
- collect-only`48 tests collected`
- `git diff --check` 已通過。

View File

@@ -31,6 +31,7 @@
- **Smoke 每日摘要推播**: 新增 Telegram 手動推播 API 與 momo-scheduler 每日 09:10 摘要任務,只讀 smoke history。
- **Grafana AI 觀測**: 新增 `MOMO AI Automation Overview` provisioning dashboard覆蓋 EventRouter、safe action、replay、AutoHeal Prometheus 指標。
- **Grafana 線上載入與 scrape 修復**: 188 active Grafana 載入 4 dashboardsactive Prometheus 補 `momo-app` scrape job並修復 gunicorn preload LocalProxy boot crash。
- **CD 健康檢查強化**: Gitea Actions health check 改為 internal container health + external URL 雙檢查,降低 rebuild 後短暫 502 誤判。
### 2026-04-28~29Phase 3e 重構大戰 + daily_sales cache 隱形 bug 根除
- **app.py 縮減 -10.8%**: 7,386 → 6,590 行11 commits 全綠零 502。

View File

@@ -0,0 +1,22 @@
from pathlib import Path
ROOT = Path(__file__).resolve().parents[1]
CD_WORKFLOW = ROOT / ".gitea/workflows/cd.yaml"
def test_cd_health_check_allows_slow_rebuild_warmup():
workflow = CD_WORKFLOW.read_text(encoding="utf-8")
assert "等待服務啟動30s" in workflow
assert "seq 1 12" in workflow
assert "等待 15s" in workflow
def test_cd_health_check_validates_internal_and_external_health():
workflow = CD_WORKFLOW.read_text(encoding="utf-8")
assert "docker exec momo-pro-system curl" in workflow
assert "http://127.0.0.1:80/health" in workflow
assert "https://mo.wooo.work/health" in workflow
assert 'internal=$INTERNAL_CODE, external=$EXTERNAL_CODE' in workflow