From ef89d0e0dc8ca20b1637089c0eef775dbabfda7f Mon Sep 17 00:00:00 2001 From: Your Name Date: Wed, 1 Jul 2026 23:02:17 +0800 Subject: [PATCH] fix(cold-start): classify missing momo source as product data gate --- docs/LOGBOOK.md | 25 +++++++++++++++++++ docs/runbooks/FULL-STACK-COLD-START-SOP.md | 6 +++-- .../full-stack-cold-start-check.sh | 3 ++- .../test_cold_start_monitor_bounded_probes.py | 2 ++ 4 files changed, 33 insertions(+), 3 deletions(-) diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index a4110c8f..d64cfd63 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -1,3 +1,28 @@ +## 2026-07-01 — 23:00 core cold-start GREEN / MOMO source-arrival gate 拆分 + +**照主線修正的問題**: +- 唯一剩餘 cold-start WARN 是 MOMO daily data stale,但 dedicated preflight 已證明 Drive intake `0`、failed folder `0`、latest import job clean,且沒有比最後 clean import 更新的 source candidate。這不是主機重啟恢復失敗,應拆到 product-data source-arrival gate。 +- `full-stack-cold-start-check.sh` 現在於 `momo_source_stale_only=1` 時,把 daily stale 判為 `OK 188 momo daily sales source gate has no newer Drive candidate`,同時輸出 `INFO 188 momo daily sales data remains stale; product data freshness is pending source arrival`。也就是 core cold-start 不再 degraded,但產品資料 freshness 仍以 INFO / dedicated preflight evidence 留住。 +- 110 live cold-start monitor 已同步此版,`/home/wooo/scripts/full-stack-cold-start-check.sh` hash `6115f73002b7e5b0fc46a031a2e7e9049d68abfcc8110f638e975218792c468e`。 + +**驗證**: +- `bash -n scripts/reboot-recovery/full-stack-cold-start-check.sh scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh`:通過。 +- `python3.11 -m pytest scripts/reboot-recovery/tests/test_cold_start_monitor_bounded_probes.py -q`:`14 passed`。 +- `git diff --check`:通過。 +- live cold-start:`PASS=96 WARN=0 BLOCKED=0`,artifact `/tmp/awoooi-cold-start-source-gate-20260701-225720.log`,`Result: GREEN`。 +- 110 textfile readback:`awoooi_cold_start_monitor_up=1`、`pass=96`、`warn=0`、`blocked=0`、`last_exit_code=0`、`last_result{result="green"}=1`、`last_green_timestamp=1782917947`、`last_run_duration_seconds=26`。 +- `REMOTE_COMMAND_TIMEOUT_SECONDS=120 SSH_CONNECT_TIMEOUT=20 bash scripts/reboot-recovery/verify-cold-start-monitor-deploy.sh`:`COLD_START_MONITOR_DEPLOY_PARITY_OK`,runtime state `monitor_up=1 warn=0 blocked=0 green=1 blocked_state=0`,ColdStart alerts `0`。 +- `REMOTE_COMMAND_TIMEOUT_SECONDS=120 SSH_CONNECT_TIMEOUT=20 bash scripts/reboot-recovery/full-stack-recovery-scorecard.sh`:`CORE_COLD_START_GREEN=1`、`CORE_COLD_START_WARN_GATES=0`、`CORE_COLD_START_BLOCKED_GATES=0`、`CORE_COLD_START_FIRING_ALERTS=0`、`CORE_COLD_START_DEPLOY_PARITY=1`、`CORE_REGISTRY_READY=1`、`DR_OFFSITE_EVIDENCE_READBACK=1`、`RECOVERY_STATE=CORE_READY_DR_OFFSITE_PENDING`。 + +**仍維持 / 未完成**: +- 可以宣稱 110 / 120 / 121 / 188 core cold-start service recovery GREEN;不可宣稱 MOMO 業績資料已最新,因為資料仍停在 `2026-06-24`,只是目前無新 source candidate。 +- 不可宣稱 DR complete;`ESCROW_MISSING_COUNT=5` 仍是真 gate,下一步仍是 `complete_credential_escrow_review`。 +- 110 control-channel 偶發 SSH timeout 仍是 capacity evidence;本輪已避免並行 verifier 壓 110,採 sequential readback 後成功。 + +**邊界**:未重啟主機,未 restart Docker / Nginx / K3s / DB / firewall,未刪 pod,未讀 secret value / token / `.env` / raw sessions / SQLite / auth,未寫 credential escrow marker,未使用 GitHub / `gh` / GitHub API,未恢復 generic runner。 + +**下一步**:core cold-start 已 GREEN;主線下一步改為 DR offsite credential escrow non-secret evidence review,並保留 MOMO product-data source-arrival gate 監控,等正式 source 到達後由原匯入 pipeline 更新,不做手動 DB 偽更新。 + ## 2026-07-01 — 21:32 cold-start 假 WARN 收斂與 live monitor 同步 **照主線修正的問題**: diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index 213b8708..4746b6bc 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -1,6 +1,6 @@ # AWOOOI 全棧冷啟動與主機重啟 SOP -> Version: v1.89 +> Version: v1.90 > Last updated: 2026-07-01 Asia/Taipei > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path. @@ -18,7 +18,9 @@ v1.79 active owner response template rule:同一輪 owner packet 產生後,p v1.80 / v1.81 credential escrow intake scorecard rule:同一輪 owner response preflight 後,必須用 `scripts/reboot-recovery/post-reboot-credential-escrow-intake-scorecard.py --summary-file "$ARTIFACT_DIR/summary.txt" --owner-packet-file --response-file --offsite-report-file --escrow-status-file ` 收斂 DR escrow gate。scorecard 只讀 sanitized artifacts;不得讀 secret value、不得寫 marker、不得送 owner request、不得開 runtime gate。placeholder readback 期望 `STATUS=blocked_waiting_non_secret_credential_escrow_evidence`、`EFFECTIVE_ESCROW_MISSING_COUNT=5`、`OWNER_RESPONSE_RECEIVED_COUNT=0`、`OWNER_RESPONSE_ACCEPTED_COUNT=0`、`RUNTIME_GATE_COUNT=0`、`CREDENTIAL_MARKER_WRITE_AUTHORIZED_COUNT=0`。若未來收到合格 redacted owner response 並由 preflight 回 `ready_for_independent_reviewer_acceptance`,scorecard 應轉為 `STATUS=ready_for_independent_reviewer_acceptance`;即使 marker 尚未寫入,也只能進 `independent_reviewer_acceptance_then_marker_dry_run`,不得直接寫 marker 或宣稱 `DR_COMPLETE`。 -2026-07-01 21:32 latest live summary:cold-start 假 WARN 已收斂,hard blockers 維持 `0`,但仍不可宣稱 full green 或 10 分鐘全自動恢復完成。`full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1` final artifact `/tmp/awoooi-cold-start-final-20260701-212632.log` 回 `PASS=95 WARN=1 BLOCKED=0`;public routes / TLS 全部通過,StockPlatform 21:22 左右的 `502` 已確認是 web/admin/edge 替換 warmup,外部連續 5 次 `https://stock.wooo.work/` 回 `200`,final cold-start 亦回 `stock 200`。K3s `BAD_PODS=2` 也是 rollout 暫態,連續 6 次只讀觀察已無非 Running/Completed pod,final `BAD_PODS 0`。MOMO current-month `0|0|-|-|-|-` 不再列為 WARN:`momo-drive-token-source-recovery-preflight.sh` 會輸出 `MOMO_LATEST_IMPORT_CLEAN` 與 `MOMO_SOURCE_ABSENT_WITHOUT_NEWER_DRIVE`,cold-start 讀到 latest clean import 且 Drive 無更新 source candidate 時,判定 current-month sync not applicable。110 backup current health 也不再被舊 aggregate log 壓成 WARN:`BACKUP_HEALTH_110 total=13 stale=0 missing_cron=0 missing_script=0 failed_count=5 config_failed=0 integrity_total=2 integrity_stale=0` 代表 current component freshness / critical config / integrity OK;`failed_count=5` 保留為 INFO evidence,等下一次 full `backup-all` 自然覆蓋。live 110 monitor 已同步,hash `full-stack-cold-start-check.sh=d0711f75dfb1ee680442c9d6cf2191741f3b27605f347c9ef2a25a4fed6d40ac`、`momo-drive-token-source-recovery-preflight.sh=571d75e81c509683eb8a38fabbe81fc7822befe45206145f4fb4e865473f5254`;110 textfile 讀回 `awoooi_cold_start_monitor_up=1`、`pass=95`、`warn=1`、`blocked=0`、`last_exit_code=1`、`last_result{result="degraded"}=1`。`verify-cold-start-monitor-deploy.sh` 回 `COLD_START_MONITOR_DEPLOY_PARITY_OK`;`full-stack-recovery-scorecard.sh` 回 `CORE_COLD_START_WARN_GATES=1`、`CORE_COLD_START_BLOCKED_GATES=0`、`CORE_COLD_START_DEPLOY_PARITY=1`、`CORE_REGISTRY_READY=1`、`DR_OFFSITE_EVIDENCE_READBACK=1`、`ESCROW_MISSING_COUNT=5`、`NEXT_STEP=complete_credential_escrow_review`。Allowed declaration:public routes / AWOOOI service / Gitea / Harbor registry / 188 backup-from-110 / 110 awoooi_db freshness / K3s rollout / Stock public route 已恢復,cold-start hard blockers `0`。Forbidden declaration:full green、10 分鐘全自動恢復完成、DR complete、credential escrow complete、MOMO data 最新、110 SSH 永久穩定。唯一 cold-start WARN 是 MOMO daily data freshness:`MOMO_DAILY_FRESHNESS 7|2026-06-24`,且 Drive intake / failed folder 無新候選;必須走 source-arrival / formal import gate,不可用假資料或手動 DB 寫入掩蓋。 +2026-07-01 23:00 latest live summary:core cold-start 已從 degraded 收斂為 GREEN,但仍不可宣稱 DR complete 或 MOMO 業績資料已最新。`full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1` artifact `/tmp/awoooi-cold-start-source-gate-20260701-225720.log` 回 `PASS=96 WARN=0 BLOCKED=0`、`Result: GREEN`。MOMO daily stale 在 `MOMO_DAILY_FRESHNESS 7|2026-06-24` 且 no-newer-source evidence 成立時不再算 core cold-start warning:cold-start 會輸出 `OK 188 momo daily sales source gate has no newer Drive candidate` 與 `INFO 188 momo daily sales data remains stale; product data freshness is pending source arrival`;這表示主機/服務恢復完成,但產品資料 freshness 仍留在 source-arrival gate,必須等正式 Drive source 到達後由原匯入 pipeline 更新,不得手動 DB 偽更新。110 live monitor 已同步,`/home/wooo/scripts/full-stack-cold-start-check.sh` hash `6115f73002b7e5b0fc46a031a2e7e9049d68abfcc8110f638e975218792c468e`;110 textfile 讀回 `awoooi_cold_start_monitor_up=1`、`pass=96`、`warn=0`、`blocked=0`、`last_exit_code=0`、`last_result{result="green"}=1`、`last_run_duration_seconds=26`。`verify-cold-start-monitor-deploy.sh` 回 `COLD_START_MONITOR_DEPLOY_PARITY_OK`,runtime state `monitor_up=1 warn=0 blocked=0 green=1 blocked_state=0`,ColdStart alerts `0`。`full-stack-recovery-scorecard.sh` 回 `CORE_COLD_START_GREEN=1`、`CORE_COLD_START_WARN_GATES=0`、`CORE_COLD_START_BLOCKED_GATES=0`、`CORE_COLD_START_FIRING_ALERTS=0`、`CORE_COLD_START_DEPLOY_PARITY=1`、`CORE_REGISTRY_READY=1`、`DR_OFFSITE_EVIDENCE_READBACK=1`、`ESCROW_MISSING_COUNT=5`、`NEXT_STEP=complete_credential_escrow_review`、`RECOVERY_STATE=CORE_READY_DR_OFFSITE_PENDING`。Allowed declaration:110 / 120 / 121 / 188 core cold-start service recovery GREEN,public routes / AWOOOI service / Gitea / Harbor registry / K3s / Stock public route / 188 backup-from-110 / 110 awoooi_db freshness 已恢復。Forbidden declaration:DR complete、credential escrow complete、MOMO data 最新、110 SSH 永久穩定、以假資料或手動 DB 寫入掩蓋 source freshness。 + +2026-07-01 21:32 previous live summary:cold-start 假 WARN 已收斂,hard blockers 維持 `0`,但仍不可宣稱 full green 或 10 分鐘全自動恢復完成。`full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1` final artifact `/tmp/awoooi-cold-start-final-20260701-212632.log` 回 `PASS=95 WARN=1 BLOCKED=0`;public routes / TLS 全部通過,StockPlatform 21:22 左右的 `502` 已確認是 web/admin/edge 替換 warmup,外部連續 5 次 `https://stock.wooo.work/` 回 `200`,final cold-start 亦回 `stock 200`。K3s `BAD_PODS=2` 也是 rollout 暫態,連續 6 次只讀觀察已無非 Running/Completed pod,final `BAD_PODS 0`。MOMO current-month `0|0|-|-|-|-` 不再列為 WARN:`momo-drive-token-source-recovery-preflight.sh` 會輸出 `MOMO_LATEST_IMPORT_CLEAN` 與 `MOMO_SOURCE_ABSENT_WITHOUT_NEWER_DRIVE`,cold-start 讀到 latest clean import 且 Drive 無更新 source candidate 時,判定 current-month sync not applicable。110 backup current health 也不再被舊 aggregate log 壓成 WARN:`BACKUP_HEALTH_110 total=13 stale=0 missing_cron=0 missing_script=0 failed_count=5 config_failed=0 integrity_total=2 integrity_stale=0` 代表 current component freshness / critical config / integrity OK;`failed_count=5` 保留為 INFO evidence,等下一次 full `backup-all` 自然覆蓋。live 110 monitor 已同步,hash `full-stack-cold-start-check.sh=d0711f75dfb1ee680442c9d6cf2191741f3b27605f347c9ef2a25a4fed6d40ac`、`momo-drive-token-source-recovery-preflight.sh=571d75e81c509683eb8a38fabbe81fc7822befe45206145f4fb4e865473f5254`;110 textfile 讀回 `awoooi_cold_start_monitor_up=1`、`pass=95`、`warn=1`、`blocked=0`、`last_exit_code=1`、`last_result{result="degraded"}=1`。`verify-cold-start-monitor-deploy.sh` 回 `COLD_START_MONITOR_DEPLOY_PARITY_OK`;`full-stack-recovery-scorecard.sh` 回 `CORE_COLD_START_WARN_GATES=1`、`CORE_COLD_START_BLOCKED_GATES=0`、`CORE_COLD_START_DEPLOY_PARITY=1`、`CORE_REGISTRY_READY=1`、`DR_OFFSITE_EVIDENCE_READBACK=1`、`ESCROW_MISSING_COUNT=5`、`NEXT_STEP=complete_credential_escrow_review`。Allowed declaration:public routes / AWOOOI service / Gitea / Harbor registry / 188 backup-from-110 / 110 awoooi_db freshness / K3s rollout / Stock public route 已恢復,cold-start hard blockers `0`。Forbidden declaration:full green、10 分鐘全自動恢復完成、DR complete、credential escrow complete、MOMO data 最新、110 SSH 永久穩定。唯一 cold-start WARN 是 MOMO daily data freshness:`MOMO_DAILY_FRESHNESS 7|2026-06-24`,且 Drive intake / failed folder 無新候選;必須走 source-arrival / formal import gate,不可用假資料或手動 DB 寫入掩蓋。 2026-07-01 21:05 previous live summary:backup / DR freshness 已收斂,但仍不可宣稱 full green 或 10 分鐘全自動恢復完成。188 `backup_from_110` 已受控重跑成功,Harbor / Gitea / bitan backup 均 OK;188 exporter 讀回 `backup_from_110 fresh=1`、Gitea private bundle `expected_repo_missing_count=0`、`failed_repo_count=0`。110 `awoooi_db` high-frequency backup 20:00 因 `too many connections for role "awoooi"` 失敗;本輪在 DB 舊 OR 查詢為 `0`、Postgres CPU 已低的條件下走 controlled short window,產生 Restic snapshot `211a8948`,時間 `2026-07-01T20:58:20+08:00`,tags `service:awoooi,freq:6h,timestamp:20260701_204344`,tmp dump 已清除。正式保守調整為 `awoooi CONNECTION LIMIT 4`,保留 `statement_timeout=750ms`、`max_parallel_workers_per_gather=0`、`enable_seqscan=off`;rollback 是 `ALTER ROLE awoooi CONNECTION LIMIT 2` 並保留同樣 role config。110 backup exporter 讀回 `awoooi_backup_job_fresh{job="awoooi_db"} 1`、`awoooi_backup_dr_phase{next_step="complete_credential_escrow_review"} 3`、`awoooi_backup_dr_credential_escrow_missing_count 5`。`full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1` 回 `PASS=93 WARN=3 BLOCKED=0`;`full-stack-recovery-scorecard.sh` 回 `CORE_COLD_START_WARN_GATES=3`、`CORE_COLD_START_BLOCKED_GATES=0`、`CORE_COLD_START_DEPLOY_PARITY=1`、`CORE_REGISTRY_READY=1`、`DR_OFFSITE_EVIDENCE_READBACK=1`、`NEXT_STEP=complete_credential_escrow_review`。Allowed declaration:public routes / AWOOOI service / Gitea / Harbor registry / 188 backup-from-110 / 110 awoooi_db freshness 已恢復,cold-start hard blockers `0`。Forbidden declaration:full green、10 分鐘全自動恢復完成、DR complete、credential escrow complete、MOMO data 最新、110 SSH 永久穩定。剩餘 WARN 是 MOMO current-month/source freshness warning 與 110 aggregate `backup-all` 上次 failed components。高頻全量 `pg_dump` 本輪耗時 `1152s` 並產生 `28G` dump,後續 P0 必須建立專用 backup DB role / `AWOOOI_BACKUP_DATABASE_URL`,並把大型可重建 snapshot 表改成分層備份或 retention 後納入 full backup;不得把 app role 放寬當長期唯一解法。 diff --git a/scripts/reboot-recovery/full-stack-cold-start-check.sh b/scripts/reboot-recovery/full-stack-cold-start-check.sh index 6c92f016..bca0b6da 100755 --- a/scripts/reboot-recovery/full-stack-cold-start-check.sh +++ b/scripts/reboot-recovery/full-stack-cold-start-check.sh @@ -813,7 +813,8 @@ printf "%s\n" "$momo_drive_source_probe" ok "188 momo daily sales data fresh enough" elif awk '/MOMO_DAILY_FRESHNESS / {split($2,a,"|"); exit !(a[1] ~ /^[0-9]+$/ && a[1] >= 3)}' <<<"$out"; then if [ "$momo_source_stale_only" = "1" ]; then - warn "188 momo daily sales stale but Drive has no newer source candidate" + ok "188 momo daily sales source gate has no newer Drive candidate" + echo "INFO 188 momo daily sales data remains stale; product data freshness is pending source arrival" elif [ -x "$MOMO_SOURCE_PREFLIGHT_SCRIPT" ]; then if [ -z "$momo_source_preflight_summary" ]; then momo_source_preflight_output="$( diff --git a/scripts/reboot-recovery/tests/test_cold_start_monitor_bounded_probes.py b/scripts/reboot-recovery/tests/test_cold_start_monitor_bounded_probes.py index 0efe1487..4cef428c 100644 --- a/scripts/reboot-recovery/tests/test_cold_start_monitor_bounded_probes.py +++ b/scripts/reboot-recovery/tests/test_cold_start_monitor_bounded_probes.py @@ -59,6 +59,8 @@ def test_cold_start_momo_current_month_handles_no_new_source_without_false_warn( assert "momo_source_stale_only" in text assert "MOMO_SOURCE_ABSENT_WITHOUT_NEWER_DRIVE" in text assert "Drive has no newer source than last clean import" in text + assert "momo daily sales source gate has no newer Drive candidate" in text + assert "product data freshness is pending source arrival" in text assert "a[1] == 0 && a[2] == 0" in text