From d2854edcd8b722114575887e57b093b969b327a2 Mon Sep 17 00:00:00 2001 From: ogt Date: Thu, 25 Jun 2026 11:06:22 +0800 Subject: [PATCH] docs(ops): add momo preflight and cpu triage evidence [skip ci] --- docs/LOGBOOK.md | 17 ++ docs/runbooks/FULL-STACK-COLD-START-SOP.md | 56 ++++- ...oot-cold-start-backup-recovery-workplan.md | 6 +- ...o-drive-token-source-recovery-preflight.sh | 231 ++++++++++++++++++ 4 files changed, 305 insertions(+), 5 deletions(-) create mode 100755 scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index b7ec8d1a..869dfed1 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -1,3 +1,20 @@ +## 2026-06-25|11:01 MOMO preflight gate 與 110 CPU orphan Chrome 收斂 + +**背景**:10:45 已把 MOMO Drive token/source recovery 寫成 owner-gated 文件邊界;同時使用者詢問 110 CPU load 持續偏高。需要把兩者都收斂成可重跑證據:MOMO 不能只看 `/health`,110 CPU 也不能只看 load average。 + +**Read-only / approved evidence**: +- 新增 `scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh`,只做 read-only SSH / Docker metadata / scheduler logs / DB query,不讀 token 內容、不 import、不移動 Drive 檔、不 restart。 +- 11:01 live preflight returned expected exit code `2`:`PASS=8 WARN=4 BLOCKED=2`。MOMO public health `200`、local health `200`、`momo-scheduler` running/healthy、current-month DB parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`、latest daily import job `56 completed` clean。 +- Remaining MOMO blockers:host/container Google token metadata both `missing`;`DB_DAILY_FRESHNESS 8|2026-06-17`;scheduler recent fail-closed notification log not observed after the latest container restart window, so it remains WARN rather than success evidence。 +- 110 CPU read-only triage at 10:56 showed load `10.99 / 9.19 / 7.49`, 12 cores, low iowait and no active swap thrash. Top CPU was `stockplatform-review-bulk-ux` orphan Chrome groups plus active Gitea Actions CI load。 +- 使用者在 CPU triage 後批准繼續;10:58 only write action was targeted `SIGTERM` to orphan Chrome process groups `438005`, `471295`, `640155`, `670628` on 110。Post-check `OLD_GROUPS_REMAINING` returned empty。 +- 11:01 110 load dropped to around `8.24 / 9.62 / 8.24`; remaining CPU was active Gitea Actions / CI build work, not the killed orphan groups。 + +**仍維持 0 / false**: +- token value / hash / partial collection、manual import、Drive file movement、DB truncate / restore、Docker / systemd / Nginx / firewall / K8s / ArgoCD write、CI cancellation、Wazuh / 112 / SOC change、secret read。 + +**判定**:MOMO recovery 現在有獨立 preflight gate;110 CPU PlayBook 再次確認 orphan browser smoke 可被精準 SIGTERM,但 active Gitea Actions / CI load 必須觀察 queue / timeout,不可當成 runaway process 亂殺。 + ## 2026-06-25|10:45 MOMO Drive token recovery gate hardening **背景**:10:35 live refresh 確認 MOMO 服務與 DB parity 正常,但 Drive token artifact metadata missing,scheduler 因 auth failure 正確 fail closed。既有 SOP 14.28 仍保留 2026-06-24「token live repair」舊口徑,容易讓 operator 誤以為只要 owner/mode 還在就已修復。本輪只做 docs-only gate hardening。 diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index 49dba4d9..84d8035b 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -1,6 +1,6 @@ # AWOOOI 全棧冷啟動與主機重啟 SOP -> Version: v1.48 +> Version: v1.49 > Last updated: 2026-06-25 Asia/Taipei > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path. @@ -21,7 +21,8 @@ Service state: SERVICE_AVAILABLE_MOMO_DATA_STALE_GDRIVE_TOKEN_WARN_DR_ESCROW_BLO Runtime release state: API/Web/Worker are ready; production API health returns healthy with `environment=prod`, `mock_mode=false`, and postgresql / redis / openclaw / signoz / gcp ollama providers up. 10:35 direct route smoke returned 200 for AWOOOI API, `/zh-TW/iwooos`, VibeWork, AwoooGo, MOMO health, Stock, and Bitan. Cold-start raw route gate still records expected redirect statuses such as awoooi web=307 and sentry=302. MOMO release state: mo.wooo.work health is healthy on version V10.657. Gitea main fast-forwarded to e137d7a5d02a7595a44c3f3cc1cf54b766424ee7 and Gitea Actions cd.yaml #910 completed Success for the Drive-auth fail-closed fix. 188 host source and `momo-scheduler` container source both contain the marker `Google Drive 連線或認證失敗,未能確認來源資料夾是否有新檔案`; scheduler / telegram-bot were restarted by CD and are healthy. 10:04 scheduler proof shows auth failure is now logged as `❌ 自動匯入失敗` and Telegram failure notification is sent successfully. Earlier cd.yaml #904 remains the monthly-sync failure boundary fix. MOMO data state: current-month daily_sales_snapshot and realtime_sales_monthly still match, but both stop at 2026-06-17: `daily_sales_snapshot=104614|2025-07-01|2026-06-17`, `realtime_sales_monthly current-month=10936|2026/06/01|2026/06/17`, and cold-start still reports `MOMO_MONTHLY_SYNC 10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`. `MOMO_DAILY_FRESHNESS 8|2026-06-17` is a hard blocker because business data is not current. Latest import job remains `56 completed|即時業績_當日.xlsx|2026-06-18 11:41:00|2026-06-18 11:42:02|10936|10936|0`; no newer successful daily-sales import appeared by the 10:35 refresh. -Google Drive / source-file state: 2026-06-25 10:35 cold-start reports `MOMO_GDRIVE_TOKEN_STAT missing scheduler_uid=100000`; direct metadata-only readback confirms host path `/home/ollama/momo-pro/config/google_token.json` is missing and container-side `config/google_token.json` is missing, while the scheduler process runs as UID/GID `100000:100000`. Do not read token content and do not recreate/chown token evidence without an explicit maintenance-window / owner approval. The data blocker is now stale business data with a live auth-failure proof; token missing remains a separate WARN until owner-provided token/writeback evidence is restored. With cd.yaml #910 live, any future Drive auth/API failure must be treated as failed import evidence rather than a no-file success. +Google Drive / source-file state: 2026-06-25 10:35 cold-start reports `MOMO_GDRIVE_TOKEN_STAT missing scheduler_uid=100000`; direct metadata-only readback confirms host path `/home/ollama/momo-pro/config/google_token.json` is missing and container-side `config/google_token.json` is missing, while the scheduler process runs as UID/GID `100000:100000`. Do not read token content and do not recreate/chown token evidence without an explicit maintenance-window / owner approval. The data blocker is now stale business data with a live auth-failure proof; token missing remains a separate WARN until owner-provided token/writeback evidence is restored. With cd.yaml #910 live, any future Drive auth/API failure must be treated as failed import evidence rather than a no-file success. 2026-06-25 11:01 dedicated preflight `scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh` returns `PASS=8 WARN=4 BLOCKED=2`: public/local health and scheduler are healthy, token metadata remains missing, and `DB_DAILY_FRESHNESS 8|2026-06-17` remains a hard blocker. +110 CPU/load readback: 2026-06-25 10:58 user-approved minimal SIGTERM targeted only orphan `stockplatform-review-bulk-ux` Chrome process groups `438005`, `471295`, `640155`, and `670628`; `OLD_GROUPS_REMAINING` returned empty. No Docker/systemd/Nginx/firewall/K8s write was performed. Remaining load is attributed to active Gitea Actions / CI build work such as `GITEA-ACTIONS-TASK-7111`, not orphan Chrome; do not cancel CI unless separately approved. Backup / monitoring state: backup-status core blockers are 0, 110 is 13/13 fresh failed=0, 188 is 2/2 fresh failed=0, offsite_fresh=1, rclone_gdrive_fresh=1, integrity_stale=0, last aggregate is 2026-06-25 02:35:09. 09:33 backup-status --no-notify --no-refresh reports 110 13/13 fresh failed=0, 188 2/2 fresh failed=0, core_blockers=0, integrity_stale=0, offsite_fresh=1, rclone_gdrive_fresh=1, escrow_missing=5. Notification-noise state: healthy AWOOOI heartbeat is suppressed; heartbeat warning dedupe uses stable actionable fingerprints so HTTP status / timeout / latency drift does not create a new Telegram event every 30 minutes; MOMO Pro monitor uses https://mo.wooo.work/health as primary truth and no longer checks the 188 root path; MoWoooWorkDown now labels component=momo-pro-system and requires public/local/container/data-freshness triage instead of blind restart; docker-health-monitor keeps 5-minute repair cadence but has a separate 30-minute Telegram fallback cooldown; Bitan public-content check keeps failure alerting with same-fingerprint cooldown and one recovery notice. Monitoring coverage recovery state: if CD post-deploy fails only because `scripts/generate_monitoring.py --check` reports `nginx-exporter` down on `192.168.0.188:9113`, first verify 188 `stub_status` and restore the stateless exporter with `scripts/ops/188-nginx-exporter-restore.sh`; do not reload Nginx or restart product containers for this symptom. @@ -1840,6 +1841,8 @@ ssh ollama@192.168.0.188 'bash -s' < scripts/ops/188-node-exporter-restore.sh MOMO post-reboot 最小 readback: ```bash +scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh + ssh ollama@192.168.0.188 ' stat -c "%u:%g:%a %n" /home/ollama/momo-pro/config/google_token.json 2>/dev/null || echo "google_token.json missing" docker top momo-scheduler -eo pid,user,uid,gid,args | head -n 3 @@ -1853,6 +1856,8 @@ SELECT 'daily_freshness|' || (CURRENT_DATE - max(snapshot_date)::date) || '|' || SQL ``` +Preferred path is the scripted preflight. It is read-only and returns `0` for clean, `1` for WARN-only, and `2` for BLOCKED. 2026-06-25 11:01 live run returned `PASS=8 WARN=4 BLOCKED=2`: `https://mo.wooo.work/health` and local health both returned `200`, `momo-scheduler` was running/healthy, current-month DB parity matched, latest daily import job `56` was clean, token metadata remained missing, and `DB_DAILY_FRESHNESS 8|2026-06-17` remained the blocker. + 若 Drive token artifact missing 或 Drive pending folder 無新來源檔,不可手動 truncate、不可以舊 archive 檔重複匯入來製造「最新」,也不可把 DB parity 當 data freshness。下一個解除 blocker 的證據必須是: 1. Owner 提供非 secret evidence ref,確認可以恢復 Google Drive token artifact 或合法來源檔。 @@ -2131,6 +2136,53 @@ NO-GO: send a generic success notification for file_count > 0 before verify_impo 3. Data correctness: import job completed with sync_success=true, and daily_sales_snapshot / realtime_sales_monthly match the imported date range. ``` +### 14.35 2026-06-25 MOMO preflight 與 110 CPU orphan Chrome 分流 + +2026-06-25 11:01 的第九段變更是把兩個常見誤判收斂成可重跑 SOP: + +1. MOMO service health green 不等於 data fresh。 +2. 110 high load 不等於可以重啟 Docker 或取消 CI。 + +MOMO 專用 preflight: + +```bash +scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh +``` + +此腳本只做 read-only SSH / Docker metadata / logs / DB query,不讀 token 內容、不 import、不移動 Drive 檔、不 restart。11:01 live result: + +```text +MOMO_DRIVE_TOKEN_SOURCE_PREFLIGHT PASS=8 WARN=4 BLOCKED=2 HOST=ollama@192.168.0.188 FRESHNESS_MAX_DAYS=2 +MOMO_PUBLIC_HEALTH_CODE 200 +MOMO_HEALTH_CODE 200 +SCHEDULER_RUNNING true +SCHEDULER_HEALTH healthy +TOKEN_STAT missing +CONTAINER_TOKEN_STAT missing +DB_MONTHLY_SYNC 10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17 +DB_DAILY_FRESHNESS 8|2026-06-17 +DB_LATEST_DAILY_IMPORT_JOB 56|completed|即時業績_當日.xlsx|2026-06-18T11:41:00.853176|2026-06-18T11:42:02.309425|10936|10936|0 +``` + +110 CPU 分流: + +| Evidence | Decision | +|----------|----------| +| `ps` shows `stockplatform-review-bulk-ux` Chrome groups with root process PPID `1`, no parent node smoke, and sustained high CPU | Treat as orphan browser smoke. Run dry-run if available, then only with owner approval use targeted `SIGTERM` by process group. | +| Active Gitea Actions container is consuming CPU, e.g. `GITEA-ACTIONS-TASK-*`, `next build`, `uv pip install`, `docker-buildx` | Treat as legitimate CI/CD load. Do not kill unless there is explicit release owner approval to cancel the run. | +| `vmstat` shows high iowait or active swap in/out | Treat as storage / memory pressure, not browser runaway. Do not kill random processes; capture disk / memory evidence first. | + +2026-06-25 10:58 user-approved action: + +```text +Targeted command type: process SIGTERM only. +Targeted process groups: 438005, 471295, 640155, 670628. +Scope: orphan `stockplatform-review-bulk-ux` Chrome groups on 110. +Post-check: `OLD_GROUPS_REMAINING` empty. +Not performed: Docker restart, systemd restart, Nginx reload, firewall/iptables change, K8s action, CI cancellation, Wazuh/SOC change, secret read. +Remaining load: active Gitea Actions / CI build work; observe queue and timeout instead of killing. +``` + ### 14.22 重啟後時間軸驗證 每次重啟後照時間軸推進,不要等到最後才一次判定。 diff --git a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md index 4e8707e0..7facf59a 100644 --- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md +++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md @@ -15,7 +15,7 @@ | P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-25 09:05 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, VIP `192.168.0.125` is present, node filesystem / disk-pressure / readonly events are `0`, and latest `km-vectorize-29705460-55rgs` completed. | | P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-25 09:05 backup / alert readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-25 02:35:09`。DR remains blocked on real non-secret credential escrow evidence IDs. | | P2 service / data truth | BLOCKED_MOMO_DATA_FRESHNESS_WITH_GDRIVE_TOKEN_WARN | 98% | Public route/TLS, API/Web route, MOMO health `V10.657`, MOMO main / CD `#904` monthly-sync failure boundary, MOMO main / CD `#910` Drive-auth fail-closed boundary, 10:04 live scheduler fail-closed proof, direct 10:35 public route smoke all 200 for AWOOOI API / IwoooS / VibeWork / AwoooGo / MOMO health / Stock / Bitan, current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health are green. MOMO latest business date remains `2026-06-17`; stale age is `8` days as of 10:35. Latest valid job `56` already imported `即時業績_當日.xlsx` with `sync_success=true` and bounds `2026-06-01..2026-06-17`. Google Drive token metadata is still a WARN because host and container token paths are missing; this requires owner-gated metadata repair/evidence and must not be solved by reading token contents. | -| P3 docs / automation contracts | DONE_WITH_MOMO_FAIL_CLOSED_LIVE_PROOF | 100% | Workplan, SOP v1.48, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, 188 DB/Redis exporter restore helper, 188 MinIO/Velero restore helper, 188 nginx-exporter restore helper, 110 Docker disk pressure cleanup boundary, MOMO Google Drive token userns readback, MOMO daily freshness blocker, MOMO Pro false-noise health monitor source-of-truth, docker-health direct Telegram fallback cooldown, Bitan public-content same-fingerprint cooldown, notification-noise readback, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side cold-start v1.42 source absence classifier, live-sync parity gate, MOMO import-boundary production deploy, MOMO Drive-auth fail-closed production deploy, 10:04 scheduler fail-closed live proof, 10:35 route / DB / backup refresh, MacBook Pro Codex safe artifact sync readback, and 2026-06-25 live refresh with Google Drive token metadata WARN are updated. 2026-06-24 23:15 read-only verify still shows repo cold-start hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`; live 110 script sync of the v1.42 classifier is not claimed until separately approved and recorded. | +| P3 docs / automation contracts | DONE_WITH_MOMO_PREFLIGHT_AND_CPU_TRIAGE | 100% | Workplan, SOP v1.49, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, 188 DB/Redis exporter restore helper, 188 MinIO/Velero restore helper, 188 nginx-exporter restore helper, 110 Docker disk pressure cleanup boundary, MOMO Google Drive token userns readback, MOMO daily freshness blocker, MOMO Pro false-noise health monitor source-of-truth, docker-health direct Telegram fallback cooldown, Bitan public-content same-fingerprint cooldown, notification-noise readback, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side cold-start v1.42 source absence classifier, live-sync parity gate, MOMO import-boundary production deploy, MOMO Drive-auth fail-closed production deploy, 10:04 scheduler fail-closed live proof, 10:35 route / DB / backup refresh, 11:01 MOMO dedicated preflight gate, 10:58 user-approved 110 orphan Chrome SIGTERM evidence, MacBook Pro Codex safe artifact sync readback, and 2026-06-25 live refresh with Google Drive token metadata WARN are updated. 2026-06-24 23:15 read-only verify still shows repo cold-start hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`; live 110 script sync of the v1.42 classifier is not claimed until separately approved and recorded. | Full cold-start service readiness may not be declared green for the latest verified evidence set. As of 2026-06-25 10:35, routes/hosts/K3s/backups/exporters/monitoring surfaces are available, AWOOOI API is healthy, and MOMO service health is `V10.657`, but the latest live read-only cold-start scorecard remains `PASS=85 WARN=1 BLOCKED=1` because MOMO business data freshness is stale beyond 3 days and Google Drive token metadata is missing / writeback not confirmed. The hard blocker is `188 momo daily sales data stale beyond 3 days`; the token state is a separate WARN and not a reason to read token contents. MOMO Drive auth/API failure is no longer allowed to be recorded as a no-file success after CD `#910`; the 10:04 scheduler run proved it now fails closed and sends failure notification. This code fix does not create new business data. Do not declare DR scorecard complete while credential escrow evidence remains blocked. @@ -158,7 +158,7 @@ Next: | ID | Status | % | Work item | Fine analysis | Next action | Done criteria | |----|--------|---:|-----------|---------------|-------------|---------------| | P2-001 | VERIFIED | 100 | Public route smoke | 2026-06-12 18:57 cold-start confirms all listed domains returned expected 2xx/3xx over HTTPS; registry root route returned 200 in the scorecard and `/v2/` remains the normal unauthenticated 401 pattern from earlier checks. This proves ingress/TLS plus current route availability. | Keep as one row in scorecard. | Public route table updated after each reboot. | -| P2-002 | BLOCKED_MOMO_DATA_FRESHNESS_WITH_GDRIVE_TOKEN_WARN | 98 | momo latest/current-month parity and freshness | Latest current-month parity is good: `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`. Full-table read-only DB evidence is `daily_sales_snapshot=104614 rows, 2025-07-01..2026-06-17` and current-month `realtime_sales_monthly=10936 rows, 2026/06/01..2026/06/17`. However latest business data is stale: `MOMO_DAILY_FRESHNESS 8|2026-06-17`; latest valid job `56` already imported `即時業績_當日.xlsx` with `sync_success=true` and bounds `2026-06-01..2026-06-17`. 10:04 scheduler run proves Drive auth failure now fails closed and sends Telegram failure notification. Token metadata remains missing (`MOMO_GDRIVE_TOKEN_STAT missing scheduler_uid=100000`), so source listing cannot be treated as healthy until owner-gated token/source evidence is restored. | Obtain owner-provided non-secret evidence ref for Google Drive token artifact recovery or a newer legitimate PChome daily-sales source file. Recovery must have maintenance window, rollback owner, token metadata-only verification, import job `sync_success=true`, file movement only after success, table bounds, and `MOMO_DAILY_FRESHNESS <= 2`. Do not read token contents, do not manually import stale local samples, product exports, header-only sheets, or already imported archives. | Token artifact metadata matches scheduler UID/mode without exposing token value; source folder has a newer legitimate source or scheduler can list the expected folder; next import has `sync_success=true`; snapshot/current-month row count and bounds match; daily freshness is within threshold. | +| P2-002 | BLOCKED_MOMO_DATA_FRESHNESS_WITH_GDRIVE_TOKEN_WARN | 98 | momo latest/current-month parity and freshness | Latest current-month parity is good: `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`. Full-table read-only DB evidence is `daily_sales_snapshot=104614 rows, 2025-07-01..2026-06-17` and current-month `realtime_sales_monthly=10936 rows, 2026/06/01..2026/06/17`. 11:01 dedicated preflight returns `PASS=8 WARN=4 BLOCKED=2`: public/local health and scheduler are healthy, latest job `56` is clean, but latest business data is stale: `DB_DAILY_FRESHNESS 8|2026-06-17`; host/container token metadata remains missing, and scheduler fail-closed log evidence is not present after the latest container restart window. 10:04 scheduler run remains the previous proof that Drive auth failure now fails closed and sends Telegram failure notification. | Run `scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh` before any MOMO recovery claim. Obtain owner-provided non-secret evidence ref for Google Drive token artifact recovery or a newer legitimate PChome daily-sales source file. Recovery must have maintenance window, rollback owner, token metadata-only verification, import job `sync_success=true`, file movement only after success, table bounds, and `MOMO_DAILY_FRESHNESS <= 2`. Do not read token contents, do not manually import stale local samples, product exports, header-only sheets, or already imported archives. | Token artifact metadata matches scheduler UID/mode without exposing token value; source folder has a newer legitimate source or scheduler can list the expected folder; next import has `sync_success=true`; snapshot/current-month row count and bounds match; daily freshness is within threshold; preflight returns no BLOCKED result. | | P2-008 | DONE_SUPERSEDED_BY_TOKEN_WARN | 100 | Separate MOMO service recovery from upstream source absence | 2026-06-24 11:35 readback proved MOMO service was healthy and source-file absence was the blocker. 2026-06-25 10:35 supersedes that with a stricter split: service is still healthy, DB parity is still good, but token artifact metadata is missing and the latest scheduler evidence is auth failure, not a healthy empty-source listing. SOP v1.48 records GO/NO-GO rules forbidding old archive re-import, product-export import, truncate, whole-DB restore, fake freshness, or token secret exposure. | Keep stale warning and token WARN active until owner-gated Drive token/source evidence is restored and a legitimate newer `即時業績_當日` source imports cleanly. | Operators can say "MOMO service recovered, data pipeline blocked by Drive token/source evidence and stale business data" without calling the full stack green. | | P2-003 | DONE_PRODUCTION_DEPLOYED_WAITING_NEXT_REAL_IMPORT | 99 | Fix momo job semantics | Gitea-first repair is in `/Users/ogt/codex-workspaces/momo-pro-dev` commit `84035906aba0e5e190d031a13cfd9b47a8cd1f73` on branch `codex/momo-current-main-dev-base-20260624`, also fast-forwarded to MacBook Pro and fast-forwarded to MOMO `main`. Gitea Actions `cd.yaml #904` succeeded, and 188 live source contains `_table_columns`, `業績分析儀表板同步失敗`, and `保留來源檔案等待重試,不移動 Google Drive 檔案`. `process_daily_sales_import()` marks monthly sync failure as `failed`, records the sync error in summary, returns `False`, and leaves `auto_import_from_drive()` outside the Drive archive/move path. Regression tests cover both job failure and no-move behavior. | Watch the next real Google Drive import and confirm no file moves unless both tables sync; if a real monthly sync failure happens, verify import job status is `failed` and source file remains pending. | `pytest tests/test_import_service_sql_params.py tests/test_auto_import_data_sync.py tests/test_auto_import_failure_boundaries.py -q` returns `10 passed`; production deployment/readback is complete; final behavioral closeout requires next real import evidence. | | P2-004 | DONE | 100 | PostgreSQL index corruption runbook path | SOP v1.2 now states `posting list tuple ... cannot be split` is an index repair incident. | Use only concurrent reindex if the error returns. | No truncate, no whole DB restore; `REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly;` and idempotent resync evidence recorded. | @@ -179,7 +179,7 @@ Next: | P3-005 | DONE | 100 | Update cold-start SOP | SOP now includes start, shutdown, reboot, record, comparison, and 120 blocker handling. | Increment SOP version after each process change. | SOP has controlled power-operation sections and ledger template. | | P3-006 | DONE | 100 | Update backup status | Backup status now reflects current cron, rclone latest-only, failure-only alert posture, and escrow blocker. | Refresh after 120 backup rerun. | Backup status no longer claims noisy success Telegram notifications. | | P3-007 | DONE | 100 | Harden Gitea backup stale dump handling | 2026-06-05 manual Gitea backup failed because the container retained `/tmp/gitea-dump.zip` from the 02:00 failure. `scripts/backup/backup-gitea.sh` now renames stale container dump files to timestamped evidence before running a new dump, and the live 110 script is updated. | Watch the next 02:00 Gitea backup. | `bash -n` passes locally and on 110; manual Gitea backup completed after stale evidence rename. | -| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.44 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, 188 DB/Redis exporter restore, 188 MinIO/Velero restore, 188 nginx-exporter restore, 110 Docker disk cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, post-reboot notification noise gates, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side scorecard source-absence classifier, 110 live-sync parity gate, and CD monitoring coverage target-down classification. | Use v1.43 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, failed/stale/active Job counters, runaway-process metrics, CI load attribution, MOMO source availability, data freshness, Velero freshness, exporter scrape, disk usage, notification-noise state, monitoring coverage, and blockers against §1.4 plus §11.1 / §14.8 through §14.32. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process / notification-noise / MOMO source-file / monitoring coverage checks. If using the live 110 script, record its hash and do not assume repo-side v1.42 behavior until synced under approval and deploy parity passes. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; live cold-start returns `PASS=86 WARN=0 BLOCKED=1` for the current evidence set, repo-side v1.42 dry-run returns `PASS=88 WARN=0 BLOCKED=1` with blocker `188 momo source file absent while daily sales data stale`, and 23:15 deploy parity correctly blocks live-sync claims until hash parity is restored; repeated healthy/same-failure notification noise is controlled without hiding real alerts, and monitoring coverage target-down is routed through exporter restore before any product restart. | +| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.49 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, 188 DB/Redis exporter restore, 188 MinIO/Velero restore, 188 nginx-exporter restore, 110 Docker disk cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, post-reboot notification noise gates, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side scorecard source-absence classifier, 110 live-sync parity gate, CD monitoring coverage target-down classification, MOMO dedicated token/source preflight, and 2026-06-25 110 CPU orphan Chrome vs active CI 分流 evidence. | Use v1.49 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, failed/stale/active Job counters, runaway-process metrics, CI load attribution, MOMO source availability, data freshness, Velero freshness, exporter scrape, disk usage, notification-noise state, monitoring coverage, and blockers against §1.4 plus §11.1 / §14.8 through §14.35. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process / notification-noise / MOMO preflight / monitoring coverage checks. If using the live 110 script, record its hash and do not assume repo-side v1.42 behavior until synced under approval and deploy parity passes. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; latest MOMO dedicated preflight returns `PASS=8 WARN=4 BLOCKED=2`; 110 CPU evidence records old orphan Chrome groups removed by approved SIGTERM while active CI load remains observation-only; repeated healthy/same-failure notification noise is controlled without hiding real alerts, and monitoring coverage target-down is routed through exporter restore before any product restart. | | P3-009 | DONE | 100 | Assess 120/121 AA/AS role and host load balancing | 2026-06-12 15:19 live check confirms 120 and 121 are both `Ready control-plane`, `k3s active`, `k3s-agent inactive`, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. | After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. | `docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly `0%`. | | P3-010 | DONE | 100 | Update workload balancing docs with 2026-06-13 live truth | Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD `km-vectorize` degraded, Gitea main `acaae999`, ArgoCD sync, and final pod placement evidence. | Keep updating this file after the next reboot or deploy. | Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt. | | P3-011 | DONE | 100 | Record `km-vectorize` remediation status | LOGBOOK, this workplan, and SOP now state the schedule/label fix, ArgoCD sync evidence, the invalid manual Job boundary, and the 90% waiting-for-next-schedule gate. | After next 03:00 run, update this row and the top verdict with `lastSuccessfulTime` / ArgoCD health evidence. | No document claims ArgoCD green before official CronJob success evidence exists. | diff --git a/scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh b/scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh new file mode 100755 index 00000000..0a44bdbb --- /dev/null +++ b/scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh @@ -0,0 +1,231 @@ +#!/usr/bin/env bash +set -uo pipefail + +# Read-only MOMO recovery preflight. This script must not import files, move +# Drive artifacts, restart containers, change token ownership, or print secrets. + +MOMO_HOST="${MOMO_HOST:-ollama@192.168.0.188}" +FRESHNESS_MAX_DAYS="${FRESHNESS_MAX_DAYS:-2}" +SSH_CONNECT_TIMEOUT="${SSH_CONNECT_TIMEOUT:-8}" + +PASS_COUNT=0 +WARN_COUNT=0 +BLOCKED_COUNT=0 + +ok() { + PASS_COUNT=$((PASS_COUNT + 1)) + printf 'OK: %s\n' "$*" +} + +warn() { + WARN_COUNT=$((WARN_COUNT + 1)) + printf 'WARN: %s\n' "$*" +} + +blocked() { + BLOCKED_COUNT=$((BLOCKED_COUNT + 1)) + printf 'BLOCKED: %s\n' "$*" +} + +usage() { + cat <<'USAGE' +Usage: momo-drive-token-source-recovery-preflight.sh [--host user@host] [--freshness-max-days N] + +Read-only checks: + - MOMO public health and local health endpoint on 188 + - momo-scheduler running / health / UID + - Google token metadata only, never token content + - scheduler fail-closed log evidence and notification evidence + - daily_sales_snapshot / realtime_sales_monthly bounds + - latest daily_sales import job + +Exit codes: + 0 = no warnings or blockers + 1 = warnings only + 2 = one or more blockers +USAGE +} + +while [[ $# -gt 0 ]]; do + case "$1" in + --host) + MOMO_HOST="${2:-}" + shift 2 + ;; + --freshness-max-days) + FRESHNESS_MAX_DAYS="${2:-}" + shift 2 + ;; + -h|--help) + usage + exit 0 + ;; + *) + printf 'Unknown argument: %s\n' "$1" >&2 + usage >&2 + exit 2 + ;; + esac +done + +if ! [[ "$FRESHNESS_MAX_DAYS" =~ ^[0-9]+$ ]]; then + printf 'FRESHNESS_MAX_DAYS must be numeric: %s\n' "$FRESHNESS_MAX_DAYS" >&2 + exit 2 +fi + +tmp_output="$(mktemp -t momo-drive-preflight.XXXXXX)" +trap 'rm -f "$tmp_output"' EXIT + +if ! ssh -o BatchMode=yes -o StrictHostKeyChecking=accept-new -o ConnectTimeout="$SSH_CONNECT_TIMEOUT" "$MOMO_HOST" 'bash -s' >"$tmp_output" <<'REMOTE' +set -uo pipefail + +emit() { + printf '%s %s\n' "$1" "${2:-}" +} + +emit HOST "$(hostname 2>/dev/null || true)" +emit MOMO_HEALTH_CODE "$(curl -s -o /dev/null -w '%{http_code}' --max-time 5 http://127.0.0.1:5003/health 2>/dev/null || true)" +emit MOMO_PUBLIC_HEALTH_CODE "$(curl -s -o /dev/null -w '%{http_code}' --max-time 8 https://mo.wooo.work/health 2>/dev/null || true)" +emit SCHEDULER_RUNNING "$(docker inspect -f '{{.State.Running}}' momo-scheduler 2>/dev/null || true)" +emit SCHEDULER_HEALTH "$(docker inspect -f '{{if .State.Health}}{{.State.Health.Status}}{{else}}none{{end}}' momo-scheduler 2>/dev/null || true)" +emit SCHEDULER_STARTED_AT "$(docker inspect -f '{{.State.StartedAt}}' momo-scheduler 2>/dev/null || true)" +emit SCHEDULER_UID "$(docker top momo-scheduler -eo pid,user,uid 2>/dev/null | awk 'NR==2 {print $3}' || true)" + +token_stat="$(stat -c '%u:%g:%a' /home/ollama/momo-pro/config/google_token.json 2>/dev/null || true)" +emit TOKEN_STAT "${token_stat:-missing}" + +container_token_stat="$(docker exec momo-scheduler sh -lc 'stat -c "%u:%g:%a" config/google_token.json 2>/dev/null || true' 2>/dev/null || true)" +emit CONTAINER_TOKEN_STAT "${container_token_stat:-missing}" + +logs="$(docker logs --since 8h momo-scheduler 2>&1 || true)" +emit LOG_AUTH_FAILURE_COUNT "$(printf '%s\n' "$logs" | grep -Ec 'Google Drive 認證失敗|could not locate runnable browser|Permission denied.*google_token|連線或認證失敗' || true)" +emit LOG_FAIL_CLOSED_COUNT "$(printf '%s\n' "$logs" | grep -Ec '自動匯入失敗|未能確認來源資料夾是否有新檔案' || true)" +emit LOG_FAILURE_NOTIFY_SUCCESS_COUNT "$(printf '%s\n' "$logs" | grep -Ec '匯入失敗通知已發送|Telegram 通知發送成功' || true)" +emit LOG_EMPTY_SOURCE_COUNT "$(printf '%s\n' "$logs" | grep -Ec '找到 0 個 Excel|沒有找到待匯入' || true)" +emit LOG_SUCCESS_IMPORT_COUNT "$(printf '%s\n' "$logs" | grep -Ec '自動匯入完成|匯入成功|成功匯入' || true)" + +psql_query() { + docker exec momo-db psql -h 127.0.0.1 -U momo -d momo_analytics -Atc "$1" 2>/dev/null || true +} + +emit DB_DAILY "$(psql_query "SELECT count(*) || chr(124) || coalesce(min(snapshot_date::date)::text, chr(45)) || chr(124) || coalesce(max(snapshot_date::date)::text, chr(45)) FROM daily_sales_snapshot;")" +emit DB_MONTHLY_CURRENT "$(psql_query "SELECT count(*) || chr(124) || coalesce(min(\"日期\"::date)::text, chr(45)) || chr(124) || coalesce(max(\"日期\"::date)::text, chr(45)) FROM realtime_sales_monthly WHERE \"日期\"::date >= make_date(extract(year from current_date)::int, extract(month from current_date)::int, 1);")" +emit DB_MONTHLY_SYNC "$(psql_query "WITH scope AS (SELECT min(snapshot_date::date) dmin, max(snapshot_date::date) dmax, count(*) sc FROM daily_sales_snapshot WHERE snapshot_date::date >= make_date(extract(year from current_date)::int, extract(month from current_date)::int, 1)), monthly AS (SELECT count(*) mc, min(\"日期\"::date) mmin, max(\"日期\"::date) mmax FROM realtime_sales_monthly, scope WHERE scope.sc > 0 AND \"日期\"::date BETWEEN scope.dmin AND scope.dmax) SELECT coalesce(scope.sc,0)::text || chr(124) || coalesce(monthly.mc,0)::text || chr(124) || coalesce(scope.dmin::text,chr(45)) || chr(124) || coalesce(scope.dmax::text,chr(45)) || chr(124) || coalesce(monthly.mmin::text,chr(45)) || chr(124) || coalesce(monthly.mmax::text,chr(45)) FROM scope, monthly;")" +emit DB_DAILY_FRESHNESS "$(psql_query "SELECT coalesce((current_date - max(snapshot_date::date))::text, chr(45)) || chr(124) || coalesce(max(snapshot_date::date)::text, chr(45)) FROM daily_sales_snapshot;")" +emit DB_LATEST_DAILY_IMPORT_JOB "$(psql_query "SELECT coalesce(id::text, chr(45)) || chr(124) || coalesce(status, chr(45)) || chr(124) || coalesce(drive_file_name, chr(45)) || chr(124) || coalesce(replace(created_at::text, chr(32), chr(84)), chr(45)) || chr(124) || coalesce(replace(completed_at::text, chr(32), chr(84)), chr(45)) || chr(124) || coalesce(total_rows::text, chr(45)) || chr(124) || coalesce(success_rows::text, chr(45)) || chr(124) || coalesce(error_rows::text, chr(45)) FROM import_jobs WHERE job_type = 'daily_sales' ORDER BY created_at DESC LIMIT 1;")" +emit IMPORT_CONFIG "$(psql_query "SELECT config_key || chr(61) || config_value FROM import_config;" | awk -F= '$1 == "gdrive_folder_path" {folder=$2} $1 == "gdrive_file_pattern" {pattern=$2} END {if (folder || pattern) print folder "|" pattern}')" +REMOTE +then + cat "$tmp_output" + blocked "MOMO host read-only SSH preflight failed: $MOMO_HOST" +else + cat "$tmp_output" +fi + +value_for() { + awk -v key="$1" '$1 == key {sub($1 " ", ""); print; exit}' "$tmp_output" +} + +num_for() { + local value + value="$(value_for "$1")" + [[ "$value" =~ ^[0-9]+$ ]] && printf '%s\n' "$value" || printf '0\n' +} + +health_code="$(value_for MOMO_HEALTH_CODE)" +public_health_code="$(value_for MOMO_PUBLIC_HEALTH_CODE)" +[[ "$public_health_code" == "200" ]] && ok "MOMO public health endpoint returns 200" || blocked "MOMO public health endpoint is not 200: ${public_health_code:-missing}" +[[ "$health_code" == "200" ]] && ok "MOMO local health endpoint returns 200" || warn "MOMO local health endpoint is not 200: ${health_code:-missing}" + +scheduler_running="$(value_for SCHEDULER_RUNNING)" +scheduler_health="$(value_for SCHEDULER_HEALTH)" +scheduler_started_at="$(value_for SCHEDULER_STARTED_AT)" +[[ "$scheduler_running" == "true" ]] && ok "momo-scheduler container is running" || blocked "momo-scheduler container is not running" +[[ "$scheduler_health" == "healthy" ]] && ok "momo-scheduler container health is healthy" || warn "momo-scheduler health is not healthy: ${scheduler_health:-missing}" +[[ -n "$scheduler_started_at" ]] && ok "momo-scheduler started_at metadata is available: $scheduler_started_at" || warn "momo-scheduler started_at metadata unavailable" + +scheduler_uid="$(value_for SCHEDULER_UID)" +token_stat="$(value_for TOKEN_STAT)" +container_token_stat="$(value_for CONTAINER_TOKEN_STAT)" +if [[ "$token_stat" == "missing" || -z "$token_stat" ]]; then + warn "host Google token artifact metadata is missing" +elif [[ "$scheduler_uid" =~ ^[0-9]+$ ]]; then + token_uid="${token_stat%%:*}" + token_mode="${token_stat##*:}" + if [[ "$token_uid" == "$scheduler_uid" && "$token_mode" =~ ^[0-9]+$ && "$token_mode" -le 600 ]]; then + ok "host Google token metadata matches scheduler UID and restrictive mode" + else + warn "host Google token metadata does not match scheduler UID/mode: token=$token_stat scheduler_uid=$scheduler_uid" + fi +else + warn "scheduler UID unavailable; token metadata cannot be matched" +fi + +if [[ "$container_token_stat" == "missing" || -z "$container_token_stat" ]]; then + warn "container Google token artifact metadata is missing" +else + ok "container Google token artifact metadata exists" +fi + +auth_failures="$(num_for LOG_AUTH_FAILURE_COUNT)" +fail_closed="$(num_for LOG_FAIL_CLOSED_COUNT)" +notify_success="$(num_for LOG_FAILURE_NOTIFY_SUCCESS_COUNT)" +if [[ "$auth_failures" -gt 0 && "$fail_closed" -gt 0 ]]; then + ok "scheduler has recent Drive auth/API failure fail-closed evidence" +else + warn "scheduler recent fail-closed evidence not observed in the last 8h" +fi + +if [[ "$notify_success" -gt 0 ]]; then + ok "scheduler failure notification success evidence exists" +else + warn "scheduler failure notification success evidence not observed in the last 8h" +fi + +import_config="$(value_for IMPORT_CONFIG)" +[[ "$import_config" == *"當日業績匯入|即時業績_當日"* ]] && ok "Drive import config points to expected daily-sales intake" || blocked "Drive import config is unavailable or drifted: ${import_config:-missing}" + +monthly_sync="$(value_for DB_MONTHLY_SYNC)" +IFS='|' read -r sync_snapshot_count sync_monthly_count sync_dmin sync_dmax sync_mmin sync_mmax <<<"$monthly_sync" +if [[ "$sync_snapshot_count" =~ ^[0-9]+$ && "$sync_snapshot_count" -gt 0 && "$sync_snapshot_count" == "$sync_monthly_count" && "$sync_dmin" == "$sync_mmin" && "$sync_dmax" == "$sync_mmax" ]]; then + ok "current-month daily snapshot and realtime tables are in sync" +else + blocked "current-month daily snapshot and realtime sync is not proven: ${monthly_sync:-missing}" +fi + +freshness="$(value_for DB_DAILY_FRESHNESS)" +IFS='|' read -r freshness_days latest_daily_date <<<"$freshness" +if [[ "$freshness_days" =~ ^[0-9]+$ && "$freshness_days" -le "$FRESHNESS_MAX_DAYS" ]]; then + ok "daily sales data freshness is within ${FRESHNESS_MAX_DAYS} days: $freshness" +elif [[ "$freshness_days" =~ ^[0-9]+$ ]]; then + blocked "daily sales data is stale: $freshness" +else + blocked "daily sales freshness is unavailable: ${freshness:-missing}" +fi + +latest_job="$(value_for DB_LATEST_DAILY_IMPORT_JOB)" +IFS='|' read -r job_id job_status job_file job_created job_completed job_total job_success job_errors <<<"$latest_job" +if [[ "$job_id" =~ ^[0-9]+$ && "$job_status" == "completed" && "$job_total" == "$job_success" && "$job_errors" == "0" ]]; then + ok "latest daily import job completed cleanly: id=$job_id file=$job_file" +else + warn "latest daily import job is not a clean completed job: ${latest_job:-missing}" +fi + +if [[ "$freshness_days" =~ ^[0-9]+$ && "$freshness_days" -gt "$FRESHNESS_MAX_DAYS" ]]; then + if [[ "$auth_failures" -gt 0 ]]; then + blocked "release blocker is stale business data with active Drive auth/source evidence gate" + else + blocked "release blocker is stale business data; source evidence must be refreshed" + fi +fi + +printf 'MOMO_DRIVE_TOKEN_SOURCE_PREFLIGHT PASS=%d WARN=%d BLOCKED=%d HOST=%s FRESHNESS_MAX_DAYS=%s\n' \ + "$PASS_COUNT" "$WARN_COUNT" "$BLOCKED_COUNT" "$MOMO_HOST" "$FRESHNESS_MAX_DAYS" + +if [[ "$BLOCKED_COUNT" -gt 0 ]]; then + exit 2 +fi +if [[ "$WARN_COUNT" -gt 0 ]]; then + exit 1 +fi +exit 0