diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index b1ba4124..c2be9d88 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -1,3 +1,21 @@ +## 2026-06-25|11:44 MOMO V10.667 preflight 強化與替換事件回讀 + +**背景**:11:35 readback 後,188 上 MOMO 又經歷一次自動替換 / restart warm-up,`/health` 版本前進到 `V10.667`。為避免把舊 StartedAt 或單純 HTTP 200 當成最新狀態,本輪增強 `scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh`,讓 SOP 直接讀 MOMO version、三個核心容器 StartedAt / health / restart count、45 分鐘內 lifecycle events、以及 188 / backup 路徑上是否存在精確 `即時業績_當日.xlsx` 候選檔。 + +**只讀證據**: +- `scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh` 語法檢查通過;live run 預期 exit code `2`,結果 `PASS=15 WARN=5 BLOCKED=2`。 +- `https://mo.wooo.work/health` 與 188 local health 都回 `200`,版本為 `V10.667`。 +- 188 container metadata:`momo-pro-system` StartedAt `2026-06-25T03:43:30Z`、`momo-scheduler` StartedAt `2026-06-25T03:43:31Z`、`momo-telegram-bot` StartedAt `2026-06-25T03:43:30Z`,三者 health 皆 `healthy`,scheduler `RestartCount=0`。 +- `MOMO_CONTAINER_REPLACE_EVENTS_45M 23`,代表 11:42-11:43 仍有 recent lifecycle / replacement evidence;本視窗沒有執行 `docker compose`、container restart、Docker / systemd / Nginx / firewall / K8s write。 +- `LOCAL_EXACT_DAILY_SOURCE_COUNT 0`、`LOCAL_EXACT_DAILY_SOURCE_LATEST none`:188 `/home/ollama/momo-pro` 與 `/backup` 未找到可直接作為 Drive daily-sales intake 的精確 `即時業績_當日.xlsx` 候選。 +- DB freshness 未變:`DB_DAILY 104614|2025-07-01|2026-06-17`、`DB_MONTHLY_SYNC 10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`、`DB_DAILY_FRESHNESS 8|2026-06-17`、latest daily import job `56 completed|即時業績_當日.xlsx|2026-06-18T11:41:00|2026-06-18T11:42:02|10936|10936|0`。 + +**判定**: +- 可宣稱:MOMO public service 目前是 `V10.667` 且 app / scheduler / bot healthy;preflight 現在能捕捉 version、容器替換事件與 source-file absence,不再只看 `/health`。 +- 不可宣稱:MOMO data current、Google token repaired、source file restored、full-stack green、DR complete。唯一 hard service blocker 仍是 MOMO business data freshness;credential escrow 仍缺 `5`。 + +**邊界**:本輪全部為 read-only SSH / curl / Docker metadata / DB query / docs-only;沒有 import、沒有移動 Drive 檔、沒有讀 token、沒有主機 runtime write,沒有 Wazuh / 112 / SOC 操作。 + ## 2026-06-25|11:35 MOMO V10.665 replace 後 cold-start / source absence readback **背景**:11:32 read-only refresh 後,188 Docker events 顯示 MOMO `momo-pro-system`、`momo-scheduler`、`momo-telegram-bot` 在 11:33 發生 compose replace / restart。這不是本視窗手動操作;本視窗沒有執行 Docker / systemd / Nginx / firewall / K8s write。因版本與容器 StartedAt 已變,需要重跑最新 health / preflight / cold-start,避免用 11:21 的舊容器證據宣稱當前狀態。 diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index b4844f7a..f6b9635e 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -1,6 +1,6 @@ # AWOOOI 全棧冷啟動與主機重啟 SOP -> Version: v1.49 +> Version: v1.50 > Last updated: 2026-06-25 Asia/Taipei > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path. @@ -10,7 +10,7 @@ 本節是每次接手、開機、關機、重啟後的第一個判定錨點。若日期不是今天,必須先重跑 live check,再更新本節與 `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md`。 -2026-06-25 11:35 live read-only refresh supersedes the 11:21 wording for service availability. It confirms the SOP is behaving correctly after the 11:33 MOMO compose replace: hosts, routes, K3s, AWOOOI API health, MOMO service health, backup/offsite, and core monitoring/exporter surfaces are available, while full-stack release remains blocked by MOMO business data freshness and DR remains blocked by missing credential escrow evidence. MOMO is healthy on `V10.665`, but data bounds still stop at `2026-06-17`; therefore website `200` / new version / container healthy must not be treated as data freshness. MOMO Drive auth false-green is fixed in production by Gitea Actions `cd.yaml #910` at commit `e137d7a5d02a7595a44c3f3cc1cf54b766424ee7`: the 10:04 scheduler run proved Google Drive auth failure now records a failed import path and sends a formal failure notification instead of reporting empty-folder success. The Google Drive token ownership/writeback WARN remains because token metadata is still missing; this must be handled as a separate evidence / owner gate and not by reading token contents. The 110 live script parity blocker from 23:15 still applies. +2026-06-25 11:44 live read-only refresh supersedes the 11:35 MOMO service-version wording. Hosts, routes, K3s, AWOOOI API health, MOMO service health, backup/offsite, and core monitoring/exporter surfaces are available, while full-stack release remains blocked by MOMO business data freshness and DR remains blocked by missing credential escrow evidence. MOMO is healthy on `V10.667` after 11:42-11:43 replacement / restart warm-up evidence, but data bounds still stop at `2026-06-17`; therefore website `200` / new version / container healthy must not be treated as data freshness. MOMO Drive auth false-green is fixed in production by Gitea Actions `cd.yaml #910` at commit `e137d7a5d02a7595a44c3f3cc1cf54b766424ee7`: the 10:04 scheduler run proved Google Drive auth failure now records a failed import path and sends a formal failure notification instead of reporting empty-folder success. The Google Drive token ownership/writeback WARN remains because token metadata is still missing; this must be handled as a separate evidence / owner gate and not by reading token contents. The 110 live script parity blocker from 23:15 still applies. ```text Repo-side reboot SOP / Plan B / automation contracts: COMPLETE, 100%. @@ -19,9 +19,9 @@ Repo-side cold-start v1.42+ live read-only run: MOMO_SOURCE_EMPTY_EVIDENCE_LINES 110 live-sync parity: 2026-06-24 23:15 read-only `verify-cold-start-monitor-deploy.sh` correctly BLOCKED because repo script hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`. Do not use live 110 monitor output to prove v1.42 behavior until the approved live-sync gate in §13.3.1 passes. Service state: SERVICE_AVAILABLE_MOMO_DATA_STALE_GDRIVE_TOKEN_WARN_DR_ESCROW_BLOCKED; 110/120/121/188 reachable, K3s mon/mon1 Ready, ArgoCD awoooi-prod Synced/Healthy at revision 7db7800e399caed5487a705c81ec993dec76c70f, public routes/TLS green, 110/188 backup health fresh, 188 node-exporter / PostgreSQL exporter / Redis exporter restored, 188 MinIO endpoint and Velero BackupStorageLocation restored, 110 disk pressure cleared. Runtime release state: API/Web/Worker are ready; production API health returns healthy with `environment=prod`, `mock_mode=false`, and postgresql / redis / openclaw / signoz / gcp ollama providers up. 11:35 direct route smoke returned 200 for AWOOOI API, `/zh-TW/iwooos`, VibeWork, AwoooGo, MOMO health, Stock, and Bitan. Cold-start raw route gate still records expected redirect statuses such as awoooi web=307 and sentry=302. -MOMO release state: mo.wooo.work health is healthy on version V10.665 after an 11:33 compose replace / restart sequence observed via Docker events. `momo-pro-system`, `momo-scheduler`, and `momo-telegram-bot` are healthy with StartedAt around `2026-06-25T03:33Z` and `RestartCount=0`. Gitea main fast-forwarded to e137d7a5d02a7595a44c3f3cc1cf54b766424ee7 and Gitea Actions cd.yaml #910 completed Success for the Drive-auth fail-closed fix. 188 host source and `momo-scheduler` container source both contain the marker `Google Drive 連線或認證失敗,未能確認來源資料夾是否有新檔案`; the 10:04 scheduler proof shows auth failure is now logged as `❌ 自動匯入失敗` and Telegram failure notification is sent successfully. Earlier cd.yaml #904 remains the monthly-sync failure boundary fix. +MOMO release state: mo.wooo.work health is healthy on version V10.667 after 11:42-11:43 lifecycle / replacement events observed via Docker metadata. `momo-pro-system`, `momo-scheduler`, and `momo-telegram-bot` are healthy with StartedAt around `2026-06-25T03:43Z`; scheduler `RestartCount=0`. 11:44 dedicated preflight records `MOMO_CONTAINER_REPLACE_EVENTS_45M 23`, so a stability window must be observed before calling it quiet, even though health is green. Gitea main fast-forwarded to e137d7a5d02a7595a44c3f3cc1cf54b766424ee7 and Gitea Actions cd.yaml #910 completed Success for the Drive-auth fail-closed fix. 188 host source and `momo-scheduler` container source both contain the marker `Google Drive 連線或認證失敗,未能確認來源資料夾是否有新檔案`; the 10:04 scheduler proof shows auth failure is now logged as `❌ 自動匯入失敗` and Telegram failure notification is sent successfully. Earlier cd.yaml #904 remains the monthly-sync failure boundary fix. MOMO data state: current-month daily_sales_snapshot and realtime_sales_monthly still match, but both stop at 2026-06-17: `daily_sales_snapshot=104614|2025-07-01|2026-06-17`, `realtime_sales_monthly current-month=10936|2026/06/01|2026/06/17`, and cold-start still reports `MOMO_MONTHLY_SYNC 10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`. `MOMO_DAILY_FRESHNESS 8|2026-06-17` is a hard blocker because business data is not current. Latest import job remains `56 completed|即時業績_當日.xlsx|2026-06-18 11:41:00|2026-06-18 11:42:02|10936|10936|0`; no newer successful daily-sales import appeared by the 11:35 refresh. Targeted 188 source-file search did not find a newer `即時業績_當日` intake file; `data/excel_exports/MOMO_All_20260620_2211.xlsx` is an export artifact, not the configured Drive daily-sales intake source. -Google Drive / source-file state: 2026-06-25 10:35 cold-start reports `MOMO_GDRIVE_TOKEN_STAT missing scheduler_uid=100000`; direct metadata-only readback confirms host path `/home/ollama/momo-pro/config/google_token.json` is missing and container-side `config/google_token.json` is missing, while the scheduler process runs as UID/GID `100000:100000`. Do not read token content and do not recreate/chown token evidence without an explicit maintenance-window / owner approval. The data blocker is now stale business data with a live auth-failure proof; token missing remains a separate WARN until owner-provided token/writeback evidence is restored. With cd.yaml #910 live, any future Drive auth/API failure must be treated as failed import evidence rather than a no-file success. 2026-06-25 11:01 dedicated preflight `scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh` returns `PASS=8 WARN=4 BLOCKED=2`: public/local health and scheduler are healthy, token metadata remains missing, and `DB_DAILY_FRESHNESS 8|2026-06-17` remains a hard blocker. +Google Drive / source-file state: 2026-06-25 10:35 cold-start reports `MOMO_GDRIVE_TOKEN_STAT missing scheduler_uid=100000`; direct metadata-only readback confirms host path `/home/ollama/momo-pro/config/google_token.json` is missing and container-side `config/google_token.json` is missing, while the scheduler process runs as UID/GID `100000:100000`. Do not read token content and do not recreate/chown token evidence without an explicit maintenance-window / owner approval. The data blocker is now stale business data with a live auth-failure proof; token missing remains a separate WARN until owner-provided token/writeback evidence is restored. With cd.yaml #910 live, any future Drive auth/API failure must be treated as failed import evidence rather than a no-file success. 2026-06-25 11:44 dedicated preflight `scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh` returns `PASS=15 WARN=5 BLOCKED=2`: public/local health and scheduler are healthy, V10.667 is live, exact local `即時業績_當日.xlsx` candidate count is `0`, token metadata remains missing, and `DB_DAILY_FRESHNESS 8|2026-06-17` remains a hard blocker. 110 CPU/load readback: 2026-06-25 10:58 user-approved minimal SIGTERM targeted only orphan `stockplatform-review-bulk-ux` Chrome process groups `438005`, `471295`, `640155`, and `670628`; `OLD_GROUPS_REMAINING` returned empty. 11:20 readback shows remaining CPU is active `stockplatform-product-ux-smoke.mjs` with parent node process plus install/build work, not orphan Chrome. No Docker/systemd/Nginx/firewall/K8s write was performed; do not cancel active CI/smoke unless separately approved. Backup / monitoring state: backup-status core blockers are 0, 110 is 13/13 fresh failed=0, 188 is 2/2 fresh failed=0, offsite_fresh=1, rclone_gdrive_fresh=1, integrity_stale=0, last aggregate is 2026-06-25 02:35:09. 11:32 backup-status --no-notify --no-refresh reports 110 13/13 fresh failed=0, 188 2/2 fresh failed=0, core_blockers=0, integrity_stale=0, offsite_fresh=1, rclone_gdrive_fresh=1, escrow_missing=5. Notification-noise state: healthy AWOOOI heartbeat is suppressed; heartbeat warning dedupe uses stable actionable fingerprints so HTTP status / timeout / latency drift does not create a new Telegram event every 30 minutes; MOMO Pro monitor uses https://mo.wooo.work/health as primary truth and no longer checks the 188 root path; MoWoooWorkDown now labels component=momo-pro-system and requires public/local/container/data-freshness triage instead of blind restart; docker-health-monitor keeps 5-minute repair cadence but has a separate 30-minute Telegram fallback cooldown; Bitan public-content check keeps failure alerting with same-fingerprint cooldown and one recovery notice. @@ -1825,15 +1825,15 @@ ssh ollama@192.168.0.188 'bash -s' < scripts/ops/188-node-exporter-restore.sh | 項目 | 2026-06-25 MOMO freshness / token baseline | |------|------------------------------------| -| SOP version | `v1.48` | +| SOP version | `v1.50` | | Token current state | `MOMO_GDRIVE_TOKEN_STAT missing scheduler_uid=100000`; host path and container-side `config/google_token.json` are missing by metadata-only readback | | Token recovery boundary | Owner-gated maintenance only;不得讀取、貼上、保存 token value / hash / partial;不得把聊天密碼或 workaround 寫進 repo | | Drive auth behavior | 2026-06-25 10:04 scheduler logs `Google Drive 認證失敗: could not locate runnable browser`,then `❌ 自動匯入失敗` and Telegram failure notification success;這是正確 fail-closed | | Drive pending folder | `當日業績匯入`,pattern `即時業績_當日`;10:35 之前沒有新的成功匯入 | | Latest valid import | Job `56 completed`,`即時業績_當日.xlsx`,`2026-06-18 11:41:00..11:42:02`,`10936/10936/0` | | DB parity | `daily_sales_snapshot=104614|2025-07-01|2026-06-17`; current-month `realtime_sales_monthly=10936|2026/06/01|2026/06/17`; cold-start `MOMO_MONTHLY_SYNC 10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17` | -| Data freshness | `MOMO_DAILY_FRESHNESS 8|2026-06-17` as of 2026-06-25 10:35 | -| Live cold-start readback | `PASS=85 WARN=1 BLOCKED=1`, result `BLOCKED` | +| Data freshness | `MOMO_DAILY_FRESHNESS 8|2026-06-17` as of 2026-06-25 11:44 | +| Live cold-start readback | `PASS=87 WARN=1 BLOCKED=1`, result `BLOCKED`; dedicated MOMO preflight `PASS=15 WARN=5 BLOCKED=2` | | 110 live script sync | `/home/wooo/scripts/full-stack-cold-start-check.sh` hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8` | | Alert behavior | Drive auth failure now sends failure notification; heartbeat success remains suppressed; stale data alert must remain visible until data freshness recovers | | Declaration limit | 可宣稱 hosts/routes/K3s/backups recovered;不可宣稱 MOMO data current、full-stack green 或 DR complete | @@ -1856,7 +1856,7 @@ SELECT 'daily_freshness|' || (CURRENT_DATE - max(snapshot_date)::date) || '|' || SQL ``` -Preferred path is the scripted preflight. It is read-only and returns `0` for clean, `1` for WARN-only, and `2` for BLOCKED. 2026-06-25 11:01 live run returned `PASS=8 WARN=4 BLOCKED=2`: `https://mo.wooo.work/health` and local health both returned `200`, `momo-scheduler` was running/healthy, current-month DB parity matched, latest daily import job `56` was clean, token metadata remained missing, and `DB_DAILY_FRESHNESS 8|2026-06-17` remained the blocker. +Preferred path is the scripted preflight. It is read-only and returns `0` for clean, `1` for WARN-only, and `2` for BLOCKED. 2026-06-25 11:44 live run returned `PASS=15 WARN=5 BLOCKED=2`: `https://mo.wooo.work/health` and local health both returned `200`, health version was `V10.667`, app / scheduler / Telegram bot were healthy, current-month DB parity matched, latest daily import job `56` was clean, exact local `即時業績_當日.xlsx` candidates under 188 / backup paths were `0`, token metadata remained missing, recent MOMO lifecycle events were observed, and `DB_DAILY_FRESHNESS 8|2026-06-17` remained the blocker. 若 Drive token artifact missing 或 Drive pending folder 無新來源檔,不可手動 truncate、不可以舊 archive 檔重複匯入來製造「最新」,也不可把 DB parity 當 data freshness。下一個解除 blocker 的證據必須是: @@ -2149,16 +2149,23 @@ MOMO 專用 preflight: scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh ``` -此腳本只做 read-only SSH / Docker metadata / logs / DB query,不讀 token 內容、不 import、不移動 Drive 檔、不 restart。11:01 live result: +此腳本只做 read-only SSH / Docker metadata / logs / DB query,不讀 token 內容、不 import、不移動 Drive 檔、不 restart。11:44 live result: ```text -MOMO_DRIVE_TOKEN_SOURCE_PREFLIGHT PASS=8 WARN=4 BLOCKED=2 HOST=ollama@192.168.0.188 FRESHNESS_MAX_DAYS=2 +MOMO_DRIVE_TOKEN_SOURCE_PREFLIGHT PASS=15 WARN=5 BLOCKED=2 HOST=ollama@192.168.0.188 FRESHNESS_MAX_DAYS=2 MOMO_PUBLIC_HEALTH_CODE 200 MOMO_HEALTH_CODE 200 +MOMO_HEALTH_VERSION V10.667 +MOMO_APP_HEALTH healthy SCHEDULER_RUNNING true SCHEDULER_HEALTH healthy +SCHEDULER_RESTART_COUNT 0 +TELEGRAM_BOT_HEALTH healthy +MOMO_CONTAINER_REPLACE_EVENTS_45M 23 TOKEN_STAT missing CONTAINER_TOKEN_STAT missing +LOCAL_EXACT_DAILY_SOURCE_COUNT 0 +LOCAL_EXACT_DAILY_SOURCE_LATEST none DB_MONTHLY_SYNC 10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17 DB_DAILY_FRESHNESS 8|2026-06-17 DB_LATEST_DAILY_IMPORT_JOB 56|completed|即時業績_當日.xlsx|2026-06-18T11:41:00.853176|2026-06-18T11:42:02.309425|10936|10936|0 diff --git a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md index 07f1d6c2..ec6d8875 100644 --- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md +++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md @@ -11,15 +11,15 @@ | Area | Status | Completion | Evidence | |------|--------|------------|----------| -| Overall recovery readiness | SERVICE_AVAILABLE_MOMO_DATA_STALE_GDRIVE_TOKEN_WARN_DR_ESCROW_BLOCKED | 97% | 2026-06-25 11:35 live cold-start returned `PASS=87 WARN=1 BLOCKED=1`, result `BLOCKED` because MOMO business data freshness remains stale and Google Drive token ownership/writeback metadata is not confirmed. 110 / 120 / 121 / 188 ping and SSH port are OK, K3s `mon` / `mon1` are Ready, public routes/TLS are green, AWOOOI API health is healthy/prod/mock=false, MOMO service health is healthy on `V10.665` after an 11:33 compose replace, 110 / 188 runtime and backup checks are green。MOMO Gitea `main` is `e137d7a5d02a7595a44c3f3cc1cf54b766424ee7`; `cd.yaml #910` succeeded and deployed a fail-closed Drive auth/API boundary into 188 host source and `momo-scheduler` container source. Remaining hard service blocker is still MOMO business data freshness: `MOMO_DAILY_FRESHNESS 8|2026-06-17`; DB current-month readback remains `daily_sales_snapshot=104614|2025-07-01|2026-06-17` and `realtime_sales_monthly=10936|2026/06/01|2026/06/17`; latest valid job `56` is still completed with `sync_success=true` and bounds `2026-06-01..2026-06-17`. Warning evidence: metadata-only check shows `/home/ollama/momo-pro/config/google_token.json` missing on host and `config/google_token.json` missing inside `momo-scheduler`, while scheduler runs as UID/GID `100000:100000`; no token content was read. DR remains blocked because credential escrow evidence markers are still missing and must not be forged. | +| Overall recovery readiness | SERVICE_AVAILABLE_MOMO_DATA_STALE_GDRIVE_TOKEN_WARN_DR_ESCROW_BLOCKED | 97% | 2026-06-25 11:35 live cold-start returned `PASS=87 WARN=1 BLOCKED=1`, result `BLOCKED` because MOMO business data freshness remains stale and Google Drive token ownership/writeback metadata is not confirmed. 2026-06-25 11:44 dedicated MOMO preflight returned `PASS=15 WARN=5 BLOCKED=2`: 110 / 120 / 121 / 188 ping and SSH port are OK, K3s `mon` / `mon1` are Ready, public routes/TLS are green, AWOOOI API health is healthy/prod/mock=false, MOMO service health is healthy on `V10.667` after 11:42-11:43 replacement / restart warm-up evidence, 110 / 188 runtime and backup checks are green。MOMO Gitea `main` is `e137d7a5d02a7595a44c3f3cc1cf54b766424ee7`; `cd.yaml #910` succeeded and deployed a fail-closed Drive auth/API boundary into 188 host source and `momo-scheduler` container source. Remaining hard service blocker is still MOMO business data freshness: `MOMO_DAILY_FRESHNESS 8|2026-06-17`; DB current-month readback remains `daily_sales_snapshot=104614|2025-07-01|2026-06-17` and `realtime_sales_monthly=10936|2026/06/01|2026/06/17`; latest valid job `56` is still completed with `sync_success=true` and bounds `2026-06-01..2026-06-17`. Warning evidence: metadata-only check shows `/home/ollama/momo-pro/config/google_token.json` missing on host and `config/google_token.json` missing inside `momo-scheduler`, while scheduler runs as UID/GID `100000:100000`; no token content was read. DR remains blocked because credential escrow evidence markers are still missing and must not be forged. | | P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-25 09:05 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, VIP `192.168.0.125` is present, node filesystem / disk-pressure / readonly events are `0`, and latest `km-vectorize-29705460-55rgs` completed. | | P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-25 09:05 backup / alert readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-25 02:35:09`。DR remains blocked on real non-secret credential escrow evidence IDs. | -| P2 service / data truth | BLOCKED_MOMO_DATA_FRESHNESS_WITH_GDRIVE_TOKEN_WARN | 98% | Public route/TLS, API/Web route, MOMO health `V10.665`, MOMO main / CD `#904` monthly-sync failure boundary, MOMO main / CD `#910` Drive-auth fail-closed boundary, 10:04 live scheduler fail-closed proof, direct 11:35 public route smoke all 200 for AWOOOI API / IwoooS / VibeWork / AwoooGo / MOMO health / Stock / Bitan, current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health are green. MOMO latest business date remains `2026-06-17`; stale age is `8` days as of 11:35. Latest valid job `56` already imported `即時業績_當日.xlsx` with `sync_success=true` and bounds `2026-06-01..2026-06-17`; targeted source search did not find a newer `即時業績_當日` intake file. Google Drive token metadata is still a WARN because host and container token paths are missing; this requires owner-gated metadata repair/evidence and must not be solved by reading token contents. | -| P3 docs / automation contracts | DONE_WITH_MOMO_PREFLIGHT_AND_CPU_TRIAGE | 100% | Workplan, SOP v1.49, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, 188 DB/Redis exporter restore helper, 188 MinIO/Velero restore helper, 188 nginx-exporter restore helper, 110 Docker disk pressure cleanup boundary, MOMO Google Drive token userns readback, MOMO daily freshness blocker, MOMO Pro false-noise health monitor source-of-truth, docker-health direct Telegram fallback cooldown, Bitan public-content same-fingerprint cooldown, notification-noise readback, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side cold-start v1.42 source absence classifier, live-sync parity gate, MOMO import-boundary production deploy, MOMO Drive-auth fail-closed production deploy, 10:04 scheduler fail-closed live proof, 10:35 route / DB / backup refresh, 11:01 MOMO dedicated preflight gate, 10:58 user-approved 110 orphan Chrome SIGTERM evidence, MacBook Pro Codex safe artifact sync readback, and 2026-06-25 live refresh with Google Drive token metadata WARN are updated. 2026-06-24 23:15 read-only verify still shows repo cold-start hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`; live 110 script sync of the v1.42 classifier is not claimed until separately approved and recorded. | +| P2 service / data truth | BLOCKED_MOMO_DATA_FRESHNESS_WITH_GDRIVE_TOKEN_WARN | 98% | Public route/TLS, API/Web route, MOMO health `V10.667`, MOMO main / CD `#904` monthly-sync failure boundary, MOMO main / CD `#910` Drive-auth fail-closed boundary, 10:04 live scheduler fail-closed proof, direct 11:35 public route smoke all 200 for AWOOOI API / IwoooS / VibeWork / AwoooGo / MOMO health / Stock / Bitan, current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health are green. 11:44 preflight confirms app / scheduler / Telegram bot healthy, scheduler restart count `0`, recent lifecycle events `23`, and exact local source candidate count `0`. MOMO latest business date remains `2026-06-17`; stale age is `8` days as of 11:44. Latest valid job `56` already imported `即時業績_當日.xlsx` with `sync_success=true` and bounds `2026-06-01..2026-06-17`; targeted source search did not find a newer `即時業績_當日` intake file. Google Drive token metadata is still a WARN because host and container token paths are missing; this requires owner-gated metadata repair/evidence and must not be solved by reading token contents. | +| P3 docs / automation contracts | DONE_WITH_MOMO_PREFLIGHT_AND_CPU_TRIAGE | 100% | Workplan, SOP v1.50, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, 188 DB/Redis exporter restore helper, 188 MinIO/Velero restore helper, 188 nginx-exporter restore helper, 110 Docker disk pressure cleanup boundary, MOMO Google Drive token userns readback, MOMO daily freshness blocker, MOMO Pro false-noise health monitor source-of-truth, docker-health direct Telegram fallback cooldown, Bitan public-content same-fingerprint cooldown, notification-noise readback, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side cold-start v1.42 source absence classifier, live-sync parity gate, MOMO import-boundary production deploy, MOMO Drive-auth fail-closed production deploy, 10:04 scheduler fail-closed live proof, 10:35 route / DB / backup refresh, 11:44 MOMO dedicated preflight version / StartedAt / lifecycle / source-candidate gate, 10:58 user-approved 110 orphan Chrome SIGTERM evidence, MacBook Pro Codex safe artifact sync readback, and 2026-06-25 live refresh with Google Drive token metadata WARN are updated. 2026-06-24 23:15 read-only verify still shows repo cold-start hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`; live 110 script sync of the v1.42 classifier is not claimed until separately approved and recorded. | -2026-06-25 11:35 supplemental readback supersedes the 11:21 route / DB / backup wording for current evidence: direct route smoke is still 200 for AWOOOI API / IwoooS / VibeWork / AwoooGo / MOMO health / Stock / Bitan; repo-side cold-start returns `PASS=87 WARN=1 BLOCKED=1`; `/backup/scripts/backup-status.sh --no-notify --no-refresh` reports 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`; MOMO dedicated preflight remains `PASS=8 WARN=4 BLOCKED=2`; MOMO health is `V10.665` after an 11:33 compose replace; 110 CPU is stable around load `3.16 / 3.26 / 4.36`, not orphan Chrome. +2026-06-25 11:44 supplemental readback supersedes the 11:35 MOMO service-version wording for current evidence: direct route smoke is still 200 for AWOOOI API / IwoooS / VibeWork / AwoooGo / MOMO health / Stock / Bitan; repo-side cold-start returns `PASS=87 WARN=1 BLOCKED=1`; `/backup/scripts/backup-status.sh --no-notify --no-refresh` reports 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`; MOMO dedicated preflight returns `PASS=15 WARN=5 BLOCKED=2`; MOMO health is `V10.667` after 11:42-11:43 lifecycle events; 110 CPU is stable around load `3.16 / 3.26 / 4.36`, not orphan Chrome. -Full cold-start service readiness may not be declared green for the latest verified evidence set. As of 2026-06-25 11:35, routes/hosts/K3s/backups/exporters/monitoring surfaces are available, AWOOOI API is healthy, and MOMO service health is `V10.665`, but the latest live read-only cold-start scorecard remains `PASS=87 WARN=1 BLOCKED=1` because MOMO business data freshness is stale beyond 3 days and Google Drive token metadata is missing / writeback not confirmed. The hard blocker is `188 momo daily sales data stale beyond 3 days`; the token state is a separate WARN and not a reason to read token contents. MOMO Drive auth/API failure is no longer allowed to be recorded as a no-file success after CD `#910`; the 10:04 scheduler run proved it now fails closed and sends failure notification. This code fix does not create new business data. Do not declare DR scorecard complete while credential escrow evidence remains blocked. +Full cold-start service readiness may not be declared green for the latest verified evidence set. As of 2026-06-25 11:44, routes/hosts/K3s/backups/exporters/monitoring surfaces are available, AWOOOI API is healthy, and MOMO service health is `V10.667`, but the latest live read-only cold-start scorecard remains `PASS=87 WARN=1 BLOCKED=1` because MOMO business data freshness is stale beyond 3 days and Google Drive token metadata is missing / writeback not confirmed. The hard blocker is `188 momo daily sales data stale beyond 3 days`; the token state is a separate WARN and not a reason to read token contents. MOMO Drive auth/API failure is no longer allowed to be recorded as a no-file success after CD `#910`; the 10:04 scheduler run proved it now fails closed and sends failure notification. This code fix does not create new business data. Do not declare DR scorecard complete while credential escrow evidence remains blocked. 2026-06-13 01:26 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing survived the next normal CD deploy: Gitea main `e4a349bc`, ArgoCD revision `e4a349bc`, images from `414413a5`, API/Web split across `mon` / `mon1`, and global `known_hosts` retained 120 / 188 after CD fix `80e6ec1a`. Do not declare DR complete while credential escrow is missing. `km-vectorize` remediation is `90%`: schedule/label fix is live, and the remaining gate is the next official 03:00 CronJob success readback. @@ -160,7 +160,7 @@ Next: | ID | Status | % | Work item | Fine analysis | Next action | Done criteria | |----|--------|---:|-----------|---------------|-------------|---------------| | P2-001 | VERIFIED | 100 | Public route smoke | 2026-06-12 18:57 cold-start confirms all listed domains returned expected 2xx/3xx over HTTPS; registry root route returned 200 in the scorecard and `/v2/` remains the normal unauthenticated 401 pattern from earlier checks. This proves ingress/TLS plus current route availability. | Keep as one row in scorecard. | Public route table updated after each reboot. | -| P2-002 | BLOCKED_MOMO_DATA_FRESHNESS_WITH_GDRIVE_TOKEN_WARN | 98 | momo latest/current-month parity and freshness | Latest current-month parity is good: `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`. Full-table read-only DB evidence is `daily_sales_snapshot=104614 rows, 2025-07-01..2026-06-17` and current-month `realtime_sales_monthly=10936 rows, 2026/06/01..2026/06/17`. 11:01 dedicated preflight returns `PASS=8 WARN=4 BLOCKED=2`: public/local health and scheduler are healthy, latest job `56` is clean, but latest business data is stale: `DB_DAILY_FRESHNESS 8|2026-06-17`; host/container token metadata remains missing, and scheduler fail-closed log evidence is not present after the latest container restart window. 10:04 scheduler run remains the previous proof that Drive auth failure now fails closed and sends Telegram failure notification. | Run `scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh` before any MOMO recovery claim. Obtain owner-provided non-secret evidence ref for Google Drive token artifact recovery or a newer legitimate PChome daily-sales source file. Recovery must have maintenance window, rollback owner, token metadata-only verification, import job `sync_success=true`, file movement only after success, table bounds, and `MOMO_DAILY_FRESHNESS <= 2`. Do not read token contents, do not manually import stale local samples, product exports, header-only sheets, or already imported archives. | Token artifact metadata matches scheduler UID/mode without exposing token value; source folder has a newer legitimate source or scheduler can list the expected folder; next import has `sync_success=true`; snapshot/current-month row count and bounds match; daily freshness is within threshold; preflight returns no BLOCKED result. | +| P2-002 | BLOCKED_MOMO_DATA_FRESHNESS_WITH_GDRIVE_TOKEN_WARN | 98 | momo latest/current-month parity and freshness | Latest current-month parity is good: `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`. Full-table read-only DB evidence is `daily_sales_snapshot=104614 rows, 2025-07-01..2026-06-17` and current-month `realtime_sales_monthly=10936 rows, 2026/06/01..2026/06/17`. 11:44 dedicated preflight returns `PASS=15 WARN=5 BLOCKED=2`: public/local health and scheduler are healthy on `V10.667`, app / scheduler / Telegram bot StartedAt metadata is captured, recent lifecycle events are visible, exact local daily-sales source candidates are `0`, latest job `56` is clean, but latest business data is stale: `DB_DAILY_FRESHNESS 8|2026-06-17`; host/container token metadata remains missing, and scheduler fail-closed log evidence is not present after the latest container restart window. 10:04 scheduler run remains the previous proof that Drive auth failure now fails closed and sends Telegram failure notification. | Run `scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh` before any MOMO recovery claim. Obtain owner-provided non-secret evidence ref for Google Drive token artifact recovery or a newer legitimate PChome daily-sales source file. Recovery must have maintenance window, rollback owner, token metadata-only verification, import job `sync_success=true`, file movement only after success, table bounds, and `MOMO_DAILY_FRESHNESS <= 2`. Do not read token contents, do not manually import stale local samples, product exports, header-only sheets, or already imported archives. | Token artifact metadata matches scheduler UID/mode without exposing token value; source folder has a newer legitimate source or scheduler can list the expected folder; next import has `sync_success=true`; snapshot/current-month row count and bounds match; daily freshness is within threshold; preflight returns no BLOCKED result. | | P2-008 | DONE_SUPERSEDED_BY_TOKEN_WARN | 100 | Separate MOMO service recovery from upstream source absence | 2026-06-24 11:35 readback proved MOMO service was healthy and source-file absence was the blocker. 2026-06-25 10:35 supersedes that with a stricter split: service is still healthy, DB parity is still good, but token artifact metadata is missing and the latest scheduler evidence is auth failure, not a healthy empty-source listing. SOP v1.48 records GO/NO-GO rules forbidding old archive re-import, product-export import, truncate, whole-DB restore, fake freshness, or token secret exposure. | Keep stale warning and token WARN active until owner-gated Drive token/source evidence is restored and a legitimate newer `即時業績_當日` source imports cleanly. | Operators can say "MOMO service recovered, data pipeline blocked by Drive token/source evidence and stale business data" without calling the full stack green. | | P2-003 | DONE_PRODUCTION_DEPLOYED_WAITING_NEXT_REAL_IMPORT | 99 | Fix momo job semantics | Gitea-first repair is in `/Users/ogt/codex-workspaces/momo-pro-dev` commit `84035906aba0e5e190d031a13cfd9b47a8cd1f73` on branch `codex/momo-current-main-dev-base-20260624`, also fast-forwarded to MacBook Pro and fast-forwarded to MOMO `main`. Gitea Actions `cd.yaml #904` succeeded, and 188 live source contains `_table_columns`, `業績分析儀表板同步失敗`, and `保留來源檔案等待重試,不移動 Google Drive 檔案`. `process_daily_sales_import()` marks monthly sync failure as `failed`, records the sync error in summary, returns `False`, and leaves `auto_import_from_drive()` outside the Drive archive/move path. Regression tests cover both job failure and no-move behavior. | Watch the next real Google Drive import and confirm no file moves unless both tables sync; if a real monthly sync failure happens, verify import job status is `failed` and source file remains pending. | `pytest tests/test_import_service_sql_params.py tests/test_auto_import_data_sync.py tests/test_auto_import_failure_boundaries.py -q` returns `10 passed`; production deployment/readback is complete; final behavioral closeout requires next real import evidence. | | P2-004 | DONE | 100 | PostgreSQL index corruption runbook path | SOP v1.2 now states `posting list tuple ... cannot be split` is an index repair incident. | Use only concurrent reindex if the error returns. | No truncate, no whole DB restore; `REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly;` and idempotent resync evidence recorded. | @@ -181,7 +181,7 @@ Next: | P3-005 | DONE | 100 | Update cold-start SOP | SOP now includes start, shutdown, reboot, record, comparison, and 120 blocker handling. | Increment SOP version after each process change. | SOP has controlled power-operation sections and ledger template. | | P3-006 | DONE | 100 | Update backup status | Backup status now reflects current cron, rclone latest-only, failure-only alert posture, and escrow blocker. | Refresh after 120 backup rerun. | Backup status no longer claims noisy success Telegram notifications. | | P3-007 | DONE | 100 | Harden Gitea backup stale dump handling | 2026-06-05 manual Gitea backup failed because the container retained `/tmp/gitea-dump.zip` from the 02:00 failure. `scripts/backup/backup-gitea.sh` now renames stale container dump files to timestamped evidence before running a new dump, and the live 110 script is updated. | Watch the next 02:00 Gitea backup. | `bash -n` passes locally and on 110; manual Gitea backup completed after stale evidence rename. | -| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.49 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, 188 DB/Redis exporter restore, 188 MinIO/Velero restore, 188 nginx-exporter restore, 110 Docker disk cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, post-reboot notification noise gates, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side scorecard source-absence classifier, 110 live-sync parity gate, CD monitoring coverage target-down classification, MOMO dedicated token/source preflight, and 2026-06-25 110 CPU orphan Chrome vs active CI 分流 evidence. | Use v1.49 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, failed/stale/active Job counters, runaway-process metrics, CI load attribution, MOMO source availability, data freshness, Velero freshness, exporter scrape, disk usage, notification-noise state, monitoring coverage, and blockers against §1.4 plus §11.1 / §14.8 through §14.35. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process / notification-noise / MOMO preflight / monitoring coverage checks. If using the live 110 script, record its hash and do not assume repo-side v1.42 behavior until synced under approval and deploy parity passes. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; latest MOMO dedicated preflight returns `PASS=8 WARN=4 BLOCKED=2`; 110 CPU evidence records old orphan Chrome groups removed by approved SIGTERM while active CI load remains observation-only; repeated healthy/same-failure notification noise is controlled without hiding real alerts, and monitoring coverage target-down is routed through exporter restore before any product restart. | +| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.50 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, 188 DB/Redis exporter restore, 188 MinIO/Velero restore, 188 nginx-exporter restore, 110 Docker disk cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, post-reboot notification noise gates, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side scorecard source-absence classifier, 110 live-sync parity gate, CD monitoring coverage target-down classification, MOMO dedicated token/source preflight, MOMO V10.667 version / StartedAt / lifecycle / exact source-candidate readback, and 2026-06-25 110 CPU orphan Chrome vs active CI 分流 evidence. | Use v1.50 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, failed/stale/active Job counters, runaway-process metrics, CI load attribution, MOMO source availability, data freshness, Velero freshness, exporter scrape, disk usage, notification-noise state, monitoring coverage, and blockers against §1.4 plus §11.1 / §14.8 through §14.35. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process / notification-noise / MOMO preflight / monitoring coverage checks. If using the live 110 script, record its hash and do not assume repo-side v1.42 behavior until synced under approval and deploy parity passes. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; latest MOMO dedicated preflight returns `PASS=15 WARN=5 BLOCKED=2`; 110 CPU evidence records old orphan Chrome groups removed by approved SIGTERM while active CI load remains observation-only; repeated healthy/same-failure notification noise is controlled without hiding real alerts, and monitoring coverage target-down is routed through exporter restore before any product restart. | | P3-009 | DONE | 100 | Assess 120/121 AA/AS role and host load balancing | 2026-06-12 15:19 live check confirms 120 and 121 are both `Ready control-plane`, `k3s active`, `k3s-agent inactive`, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. | After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. | `docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly `0%`. | | P3-010 | DONE | 100 | Update workload balancing docs with 2026-06-13 live truth | Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD `km-vectorize` degraded, Gitea main `acaae999`, ArgoCD sync, and final pod placement evidence. | Keep updating this file after the next reboot or deploy. | Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt. | | P3-011 | DONE | 100 | Record `km-vectorize` remediation status | LOGBOOK, this workplan, and SOP now state the schedule/label fix, ArgoCD sync evidence, the invalid manual Job boundary, and the 90% waiting-for-next-schedule gate. | After next 03:00 run, update this row and the top verdict with `lastSuccessfulTime` / ArgoCD health evidence. | No document claims ArgoCD green before official CronJob success evidence exists. | diff --git a/scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh b/scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh index 0a44bdbb..351261ef 100755 --- a/scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh +++ b/scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh @@ -33,9 +33,11 @@ Usage: momo-drive-token-source-recovery-preflight.sh [--host user@host] [--fresh Read-only checks: - MOMO public health and local health endpoint on 188 + - MOMO version, container StartedAt, health, recent replace/restart evidence - momo-scheduler running / health / UID - Google token metadata only, never token content - scheduler fail-closed log evidence and notification evidence + - exact local daily-sales source candidate presence - daily_sales_snapshot / realtime_sales_monthly bounds - latest daily_sales import job @@ -84,12 +86,24 @@ emit() { } emit HOST "$(hostname 2>/dev/null || true)" +momo_health_json="$(curl -s --max-time 5 http://127.0.0.1:5003/health 2>/dev/null || true)" +momo_public_health_json="$(curl -s --max-time 8 https://mo.wooo.work/health 2>/dev/null || true)" emit MOMO_HEALTH_CODE "$(curl -s -o /dev/null -w '%{http_code}' --max-time 5 http://127.0.0.1:5003/health 2>/dev/null || true)" emit MOMO_PUBLIC_HEALTH_CODE "$(curl -s -o /dev/null -w '%{http_code}' --max-time 8 https://mo.wooo.work/health 2>/dev/null || true)" +emit MOMO_HEALTH_VERSION "$(printf '%s\n%s\n' "$momo_health_json" "$momo_public_health_json" | sed -n 's/.*"version"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p' | head -n 1)" +emit MOMO_APP_STARTED_AT "$(docker inspect -f '{{.State.StartedAt}}' momo-pro-system 2>/dev/null || true)" +emit MOMO_APP_HEALTH "$(docker inspect -f '{{if .State.Health}}{{.State.Health.Status}}{{else}}none{{end}}' momo-pro-system 2>/dev/null || true)" emit SCHEDULER_RUNNING "$(docker inspect -f '{{.State.Running}}' momo-scheduler 2>/dev/null || true)" emit SCHEDULER_HEALTH "$(docker inspect -f '{{if .State.Health}}{{.State.Health.Status}}{{else}}none{{end}}' momo-scheduler 2>/dev/null || true)" emit SCHEDULER_STARTED_AT "$(docker inspect -f '{{.State.StartedAt}}' momo-scheduler 2>/dev/null || true)" +emit SCHEDULER_RESTART_COUNT "$(docker inspect -f '{{.RestartCount}}' momo-scheduler 2>/dev/null || true)" emit SCHEDULER_UID "$(docker top momo-scheduler -eo pid,user,uid 2>/dev/null | awk 'NR==2 {print $3}' || true)" +emit TELEGRAM_BOT_STARTED_AT "$(docker inspect -f '{{.State.StartedAt}}' momo-telegram-bot 2>/dev/null || true)" +emit TELEGRAM_BOT_HEALTH "$(docker inspect -f '{{if .State.Health}}{{.State.Health.Status}}{{else}}none{{end}}' momo-telegram-bot 2>/dev/null || true)" + +event_until="$(date --iso-8601=seconds 2>/dev/null || date -Iseconds)" +recent_events="$(docker events --since 45m --until "$event_until" --filter container=momo-pro-system --filter container=momo-scheduler --filter container=momo-telegram-bot 2>/dev/null || true)" +emit MOMO_CONTAINER_REPLACE_EVENTS_45M "$(printf '%s\n' "$recent_events" | grep -Ec 'container (restart|start|kill|die|stop)' || true)" token_stat="$(stat -c '%u:%g:%a' /home/ollama/momo-pro/config/google_token.json 2>/dev/null || true)" emit TOKEN_STAT "${token_stat:-missing}" @@ -104,6 +118,11 @@ emit LOG_FAILURE_NOTIFY_SUCCESS_COUNT "$(printf '%s\n' "$logs" | grep -Ec '匯 emit LOG_EMPTY_SOURCE_COUNT "$(printf '%s\n' "$logs" | grep -Ec '找到 0 個 Excel|沒有找到待匯入' || true)" emit LOG_SUCCESS_IMPORT_COUNT "$(printf '%s\n' "$logs" | grep -Ec '自動匯入完成|匯入成功|成功匯入' || true)" +source_count="$(find /home/ollama/momo-pro /backup -type f -name '即時業績_當日.xlsx' 2>/dev/null | wc -l | awk '{print $1}' || true)" +latest_source="$(find /home/ollama/momo-pro /backup -type f -name '即時業績_當日.xlsx' -printf '%T@|%TY-%Tm-%TdT%TH:%TM:%TS|%s|%f\n' 2>/dev/null | sort -n | tail -n 1 | cut -d'|' -f2- || true)" +emit LOCAL_EXACT_DAILY_SOURCE_COUNT "${source_count:-0}" +emit LOCAL_EXACT_DAILY_SOURCE_LATEST "${latest_source:-none}" + psql_query() { docker exec momo-db psql -h 127.0.0.1 -U momo -d momo_analytics -Atc "$1" 2>/dev/null || true } @@ -137,12 +156,32 @@ public_health_code="$(value_for MOMO_PUBLIC_HEALTH_CODE)" [[ "$public_health_code" == "200" ]] && ok "MOMO public health endpoint returns 200" || blocked "MOMO public health endpoint is not 200: ${public_health_code:-missing}" [[ "$health_code" == "200" ]] && ok "MOMO local health endpoint returns 200" || warn "MOMO local health endpoint is not 200: ${health_code:-missing}" +momo_version="$(value_for MOMO_HEALTH_VERSION)" +[[ -n "$momo_version" ]] && ok "MOMO health version readback is available: $momo_version" || warn "MOMO health version readback unavailable" + scheduler_running="$(value_for SCHEDULER_RUNNING)" scheduler_health="$(value_for SCHEDULER_HEALTH)" scheduler_started_at="$(value_for SCHEDULER_STARTED_AT)" +scheduler_restart_count="$(value_for SCHEDULER_RESTART_COUNT)" +momo_app_health="$(value_for MOMO_APP_HEALTH)" +momo_app_started_at="$(value_for MOMO_APP_STARTED_AT)" +telegram_bot_health="$(value_for TELEGRAM_BOT_HEALTH)" +telegram_bot_started_at="$(value_for TELEGRAM_BOT_STARTED_AT)" [[ "$scheduler_running" == "true" ]] && ok "momo-scheduler container is running" || blocked "momo-scheduler container is not running" [[ "$scheduler_health" == "healthy" ]] && ok "momo-scheduler container health is healthy" || warn "momo-scheduler health is not healthy: ${scheduler_health:-missing}" [[ -n "$scheduler_started_at" ]] && ok "momo-scheduler started_at metadata is available: $scheduler_started_at" || warn "momo-scheduler started_at metadata unavailable" +[[ "$scheduler_restart_count" == "0" ]] && ok "momo-scheduler restart count is 0" || warn "momo-scheduler restart count is not 0: ${scheduler_restart_count:-missing}" +[[ "$momo_app_health" == "healthy" ]] && ok "momo-pro-system health is healthy" || warn "momo-pro-system health is not healthy: ${momo_app_health:-missing}" +[[ -n "$momo_app_started_at" ]] && ok "momo-pro-system started_at metadata is available: $momo_app_started_at" || warn "momo-pro-system started_at metadata unavailable" +[[ "$telegram_bot_health" == "healthy" ]] && ok "momo-telegram-bot health is healthy" || warn "momo-telegram-bot health is not healthy: ${telegram_bot_health:-missing}" +[[ -n "$telegram_bot_started_at" ]] && ok "momo-telegram-bot started_at metadata is available: $telegram_bot_started_at" || warn "momo-telegram-bot started_at metadata unavailable" + +replace_events="$(num_for MOMO_CONTAINER_REPLACE_EVENTS_45M)" +if [[ "$replace_events" -gt 0 ]]; then + warn "recent MOMO container replace/restart events observed in the last 45m: $replace_events" +else + ok "no MOMO container replace/restart events observed in the last 45m" +fi scheduler_uid="$(value_for SCHEDULER_UID)" token_stat="$(value_for TOKEN_STAT)" @@ -182,6 +221,14 @@ else warn "scheduler failure notification success evidence not observed in the last 8h" fi +local_source_count="$(num_for LOCAL_EXACT_DAILY_SOURCE_COUNT)" +local_source_latest="$(value_for LOCAL_EXACT_DAILY_SOURCE_LATEST)" +if [[ "$local_source_count" -gt 0 ]]; then + warn "exact local daily-sales source candidates exist outside Drive intake: count=$local_source_count latest=${local_source_latest:-unknown}" +else + ok "no exact local daily-sales source candidate found on 188 / backup paths" +fi + import_config="$(value_for IMPORT_CONFIG)" [[ "$import_config" == *"當日業績匯入|即時業績_當日"* ]] && ok "Drive import config points to expected daily-sales intake" || blocked "Drive import config is unavailable or drifted: ${import_config:-missing}"