diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index d44fbbb8..9e9b3f58 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -1,3 +1,26 @@ +## 2026-06-24|23:33 live cold-start / public routes / backup read-only refresh + +**背景**:23:15 已確認 110 live cold-start monitor 尚未同步 repo-side v1.42 hash;本輪不做 live script install,只用 repo-side authoritative script 重新跑完整 read-only cold-start,確認重啟 SOP 的現場判斷是否仍正確。 + +**Read-only evidence**: +- Repo / Gitea baseline:`6b9a09a0 docs(ops): record cold-start monitor live-sync gate [skip ci]`;Mac Mini / MacBook Pro AWOOOI controlled workspace 皆在 `codex/awoooi-current-main-dev-base-20260624`、commit `6b9a09a01a34`、dirty `0`。 +- `scripts/reboot-recovery/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1` returned expected exit code `2` with `PASS=88 WARN=0 BLOCKED=1`。 +- Hosts / K3s:110 / 120 / 121 / 188 ping and SSH port OK;K3s `mon` / `mon1` both `Ready control-plane`;VIP `192.168.0.125` present;node filesystem / disk-pressure / readonly events `0`。 +- Public routes direct smoke:`awoooi API=200`、`/zh-TW/iwooos=200`、`vibework=200`、`awooogo=200`、`mo health=200`、`stock=200`、`gitea=200`、`harbor=200`、`registry /v2=401`、`sentry=200`、`signoz=200`、`langfuse=200`、`bitan=200`、`aiops=200`。 +- AWOOOI API health:`status=healthy`、`environment=prod`、`mock_mode=false`;postgresql / redis / openclaw / signoz / ollama providers all `up`。 +- MOMO service health:`https://mo.wooo.work/health` returned `{"database":"postgresql","status":"healthy","version":"V10.653"}`。 +- 188 data / MOMO:PostgreSQL accepts connections, Redis PONG, SignOz 200, `momo-pro-system` healthy, `momo-scheduler` healthy, Google token owner/mode `100000:100000:600` matches scheduler UID, current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`。 +- MOMO blocker evidence remains: `MOMO_DAILY_FRESHNESS 7|2026-06-17` and `BLOCKED 188 momo source file absent while daily sales data stale`。 +- Backup read-only status from 110 `/backup/scripts/backup-status.sh --no-notify --no-refresh` at 23:33:110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、`core_blockers=0`、`integrity_stale=0`、`offsite_fresh=1`、`rclone_gdrive_fresh=1`、`escrow_missing=5`、last aggregate `2026-06-24 02:28:39`。 +- Readiness audit with local PyYAML venv remains `PASS=197 WARN=2 BLOCKED=0`;WARN only `ansible-playbook unavailable locally` and `live cold-start gate skipped` for that static audit path. + +**判定**: +- SOP 有效:它正確區分 route/service/DB/backup/K3s 已恢復,以及 MOMO business data freshness 仍 blocked;沒有被網站 200 或 DB parity 誤判成 full-stack green。 +- 可宣稱:核心主機、K3s、public routes、AWOOOI API health、MOMO service health、backup/offsite surfaces are available for this read-only evidence set。 +- 不可宣稱:full-stack green、MOMO data current、DR complete、credential escrow complete、或 110 live monitor 已同步 repo v1.42。Live 110 script parity 仍需獨立維護窗口。 + +**邊界**:本輪沒有主機寫入、沒有 `scp` live script、沒有 Docker / Nginx / firewall / K8s / ArgoCD 操作、沒有 Wazuh / SOC 修改、沒有讀取或保存 secret。 + ## 2026-06-24|23:15 110 cold-start monitor live-sync gate readback **背景**:23:04 已把 MOMO source absence classifier 納入 repo-side cold-start v1.42,但這不等於 110 上 `/home/wooo/scripts/full-stack-cold-start-check.sh` 已更新。為避免下次重啟時 operator 以 live 110 舊腳本輸出做錯判,本輪只做部署 parity 的 read-only 驗證與 SOP gate 補強。 diff --git a/docs/runbooks/BACKUP-STATUS.md b/docs/runbooks/BACKUP-STATUS.md index 6868af8e..df38f106 100644 --- a/docs/runbooks/BACKUP-STATUS.md +++ b/docs/runbooks/BACKUP-STATUS.md @@ -14,6 +14,30 @@ > 2026-06-24 22:40 Codex MOMO source readback: scheduler / DB / import metadata confirm the full-stack blocker is missing upstream source data, not backup freshness; no manual import or Drive write was performed. > 2026-06-24 23:04 Codex cold-start gate refresh: repo-side v1.42 dry-run now emits MOMO source-absence evidence and blocks with `188 momo source file absent while daily sales data stale`; backup/offsite remains green and live 110 script deployment is not claimed. > 2026-06-24 23:15 Codex live-sync gate readback: read-only deploy parity check correctly blocks because repo cold-start hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`; installer remains a live write requiring explicit approval. +> 2026-06-24 23:33 Codex backup readback: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`; live cold-start still blocks only on MOMO source absence / data freshness, not backup. + +--- + +## 2026-06-24 23:33 Backup / Offsite / Escrow Live Status + +Read-only command: `/backup/scripts/backup-status.sh --no-notify --no-refresh` from 110 at 23:33 Asia/Taipei. + +- 110 backup health: `13/13 fresh failed=0`。 +- 188 backup health: `2/2 fresh failed=0`。 +- Integrity / configured blockers: `core_blockers=0`、`dr_warnings=5`、`configured_missing_110=0`、`configured_missing_188=0`、`script_missing_110=0`、`script_missing_188=0`、`integrity_stale=0`。 +- Offsite / GDrive freshness: `offsite_configured=1`、`offsite_fresh=1`、`rclone_gdrive_configured=1`、`rclone_gdrive_fresh=1`。 +- Last aggregate backup: `2026-06-24 02:28:39`。 +- DR blocker remains: `escrow_missing=5`,不得偽造 evidence marker,也不得貼 secret value / hash / partial token。 +- Full-stack service release blocker remains separate: cold-start `PASS=88 WARN=0 BLOCKED=1`,原因是 `188 momo source file absent while daily sales data stale` / `MOMO_DAILY_FRESHNESS 7|2026-06-17`;這不是 backup freshness failure。 + +| Gate | Status | Evidence | +|------|--------|----------| +| 110 backup freshness | VERIFIED | 13/13 fresh, failed count 0. | +| 188 backup freshness | VERIFIED | 2/2 fresh, failed count 0. | +| Offsite / GDrive freshness | VERIFIED | `offsite_fresh=1`, `rclone_gdrive_fresh=1`. | +| Backup core blockers | GREEN | `core_blockers=0`. | +| Credential escrow | BLOCKED | `escrow_missing=5`; only real non-secret owner evidence may close this. | +| Service full green | NO-GO | Blocked by MOMO source absence / data freshness, not by backup. | --- diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index 7b5bd496..ff2f152c 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -1,6 +1,6 @@ # AWOOOI 全棧冷啟動與主機重啟 SOP -> Version: v1.43 +> Version: v1.44 > Last updated: 2026-06-24 Asia/Taipei > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path. @@ -10,19 +10,19 @@ 本節是每次接手、開機、關機、重啟後的第一個判定錨點。若日期不是今天,必須先重跑 live check,再更新本節與 `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md`。 -2026-06-24 23:04 repo-side gate refresh supersedes the earlier 22:40 wording where the source absence evidence existed only in LOGBOOK and manual readback. The service and data readiness gates below are refreshed by the 22:16 live cold-start scorecard, 22:17 backup-status readback, MOMO Gitea CD `#904` production deploy evidence, 22:40 MOMO scheduler / DB / import metadata read-only evidence, and 23:04 repo-side cold-start v1.42 dry-run evidence: +2026-06-24 23:33 live read-only refresh supersedes the earlier 22:16 / 23:04 scorecard wording. It confirms the SOP is behaving correctly: hosts, routes, K3s, AWOOOI API health, MOMO service health, and backup/offsite are available, while full-stack release remains blocked only by MOMO source absence / business data freshness and DR remains blocked by missing credential escrow evidence. The 110 live script parity blocker from 23:15 still applies. ```text Repo-side reboot SOP / Plan B / automation contracts: COMPLETE, 100%. -Live cold-start read-only check: 2026-06-24 22:16 PASS=86 WARN=0 BLOCKED=1, Result=BLOCKED. -Repo-side cold-start v1.42 dry-run: 2026-06-24 23:04 PASS=88 WARN=0 BLOCKED=1 against live read-only targets. New MOMO fields are MOMO_SOURCE_EMPTY_EVIDENCE_LINES=21, MOMO_IMPORT_CONFIG=當日業績匯入|即時業績_當日, MOMO_LATEST_IMPORT_JOB=56|completed|即時業績_當日.xlsx|2026-06-18T11:41:00.853176|2026-06-18T11:42:02.309425|10936|10936|0. The only BLOCKED text is now "188 momo source file absent while daily sales data stale". Live 110 script sync is not claimed until a separate approved deployment/sync happens. +Live cold-start read-only check: 2026-06-24 23:33 PASS=88 WARN=0 BLOCKED=1, Result=BLOCKED. +Repo-side cold-start v1.42 live read-only run: New MOMO fields remain MOMO_SOURCE_EMPTY_EVIDENCE_LINES=21, MOMO_IMPORT_CONFIG=當日業績匯入|即時業績_當日, MOMO_LATEST_IMPORT_JOB=56|completed|即時業績_當日.xlsx|2026-06-18T11:41:00.853176|2026-06-18T11:42:02.309425|10936|10936|0. The only BLOCKED text is "188 momo source file absent while daily sales data stale". Live 110 script sync is not claimed until a separate approved deployment/sync happens. 110 live-sync parity: 2026-06-24 23:15 read-only `verify-cold-start-monitor-deploy.sh` correctly BLOCKED because repo script hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`. Do not use live 110 monitor output to prove v1.42 behavior until the approved live-sync gate in §13.3.1 passes. Service state: SERVICE_AVAILABLE_MOMO_SOURCE_BLOCKED_DR_ESCROW_BLOCKED; 110/120/121/188 reachable, K3s mon/mon1 Ready, ArgoCD awoooi-prod Synced/Healthy at revision 7db7800e399caed5487a705c81ec993dec76c70f, public routes/TLS green, 110/188 backup health fresh, 188 node-exporter / PostgreSQL exporter / Redis exporter restored, 188 MinIO endpoint and Velero BackupStorageLocation restored, 110 disk pressure cleared. -Runtime release state: API/Web/Worker are ready; latest deployment marker 622bc372 points runtime image to 2ec7f6f4 and production API health returns healthy. 21:33 redirect-followed route batch shows awoooi web=200, awoooi API=200, vibework=200, awoooogo=200, momo health=200, stock=200, bitan=200, gitea=200, harbor=200, sentry=200, signoz=200, langfuse=200, registry /v2=401; cold-start raw route gate still records expected redirect statuses such as awoooi web=307, momo web=302, sentry=302. CD #3294 still has a historical Failure record because post-deploy monitoring coverage saw 188 nginx-exporter down before the exporter restore. +Runtime release state: API/Web/Worker are ready; production API health returns healthy with `environment=prod`, `mock_mode=false`, and postgresql / redis / openclaw / signoz / ollama providers all up. 23:33 redirect-followed route batch shows awoooi API=200, `/zh-TW/iwooos`=200, vibework=200, awoooogo=200, momo health=200, stock=200, bitan=200, gitea=200, harbor=200, sentry=200, signoz=200, langfuse=200, aiops=200, registry /v2=401; cold-start raw route gate still records expected redirect statuses such as awoooi web=307, momo web=302, sentry=302. MOMO release state: mo.wooo.work health is healthy on version V10.653. Gitea main fast-forwarded to 84035906aba0e5e190d031a13cfd9b47a8cd1f73 and Gitea Actions cd.yaml #904 completed Success. 188 live source contains the production marker `def _table_columns`, `業績分析儀表板同步失敗`, and `保留來源檔案等待重試,不移動 Google Drive 檔案`, proving the import-boundary fix is deployed. Mac Mini and MacBook Pro controlled Codex workspaces are both on branch codex/momo-current-main-dev-base-20260624 at commit 84035906aba0e5e190d031a13cfd9b47a8cd1f73 with dirty=0. MOMO data state: full-table read-only DB query shows `daily_sales_snapshot=104614 rows, 2025/07/01..2026/06/17` and `realtime_sales_monthly=786621 rows, 2024/01/01..2026/06/17`. Current-month daily_sales_snapshot and realtime_sales_monthly match, but both stop at 2026-06-17. MOMO_DAILY_FRESHNESS is 7 days, which is a hard blocker because business data is not current. Google Drive / source-file state: momo scheduler token ownership is fixed for Docker userns, container-side Drive listing works, and import config is `gdrive_folder_path=當日業績匯入`, `gdrive_file_pattern=即時業績_當日`; however scheduler stats and logs show repeated AutoImport runs with `file_count=0`, `imported_count=0`, including 2026-06-24 21:56 where the folder had `0` matching Excel files. Latest import job 56 was already completed on 2026-06-18 with `sync_success=true`, `source_file=即時業績_當日.xlsx`, and bounds `2026-06-01..2026-06-17`. Mac Mini and MacBook candidate spreadsheets were also read-only inspected: the local current daily candidate only contains 2025-07-01..2025-07-02, the iCloud full-month candidate only contains 2025-06-01..2025-06-30, and MacBook candidates are either header-only or the same 2025-07-01..2025-07-02 dataset. These are not legitimate newer sources. -Backup / monitoring state: backup-status core blockers are 0, 110 is 13/13 fresh failed=0, 188 is 2/2 fresh failed=0, offsite_fresh=1, rclone_gdrive_fresh=1, integrity_stale=0, last aggregate is 2026-06-24 02:28:39, 188 MinIO is healthy, Velero BackupStorageLocation default is Available, one-off backup reboot-recovery-202606240456 completed, backup-health textfile reports Velero freshness green, PostgreSQL / Redis exporters are green, 188 nginx-exporter is restored with nginx_up=1, monitoring coverage is 14/14 jobs UP, and VeleroBackupNotRun / PostgreSQLDown / RedisDown / disk-pressure / nginx-exporter target-down evidence is resolved. 22:17 backup-status --no-notify --no-refresh reports 110 13/13 fresh failed=0, 188 2/2 fresh failed=0, core_blockers=0, integrity_stale=0, offsite_fresh=1, rclone_gdrive_fresh=1, escrow_missing=5. +Backup / monitoring state: backup-status core blockers are 0, 110 is 13/13 fresh failed=0, 188 is 2/2 fresh failed=0, offsite_fresh=1, rclone_gdrive_fresh=1, integrity_stale=0, last aggregate is 2026-06-24 02:28:39, 188 MinIO is healthy, Velero BackupStorageLocation default is Available, backup-health textfile reports Velero freshness green, PostgreSQL / Redis exporters are green, 188 nginx-exporter is restored with nginx_up=1, monitoring coverage is 14/14 jobs UP, and VeleroBackupNotRun / PostgreSQLDown / RedisDown / disk-pressure / nginx-exporter target-down evidence is resolved. 23:33 backup-status --no-notify --no-refresh reports 110 13/13 fresh failed=0, 188 2/2 fresh failed=0, core_blockers=0, integrity_stale=0, offsite_fresh=1, rclone_gdrive_fresh=1, escrow_missing=5. Notification-noise state: healthy AWOOOI heartbeat is suppressed; heartbeat warning dedupe uses stable actionable fingerprints so HTTP status / timeout / latency drift does not create a new Telegram event every 30 minutes; MOMO Pro monitor uses https://mo.wooo.work/health as primary truth and no longer checks the 188 root path; MoWoooWorkDown now labels component=momo-pro-system and requires public/local/container/data-freshness triage instead of blind restart; docker-health-monitor keeps 5-minute repair cadence but has a separate 30-minute Telegram fallback cooldown; Bitan public-content check keeps failure alerting with same-fingerprint cooldown and one recovery notice. Monitoring coverage recovery state: if CD post-deploy fails only because `scripts/generate_monitoring.py --check` reports `nginx-exporter` down on `192.168.0.188:9113`, first verify 188 `stub_status` and restore the stateless exporter with `scripts/ops/188-nginx-exporter-restore.sh`; do not reload Nginx or restart product containers for this symptom. Allowed declaration: core hosts, routes, K3s, backup/exporter surfaces are recovered; MOMO production code release includes the import-boundary fix at Gitea main 84035906aba0; both controlled Codex workspaces are aligned on the same MOMO fix branch; MOMO data pipeline is blocked waiting for a newer source file or owner-provided source evidence. diff --git a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md index 8369a2aa..6b514793 100644 --- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md +++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md @@ -11,13 +11,13 @@ | Area | Status | Completion | Evidence | |------|--------|------------|----------| -| Overall recovery readiness | SERVICE_AVAILABLE_MOMO_SOURCE_BLOCKED_DR_ESCROW_BLOCKED | 98% | 2026-06-24 22:16 live cold-start returned `PASS=86 WARN=0 BLOCKED=1`, result `BLOCKED` because MOMO business data freshness remains stale. 110 / 120 / 121 / 188 ping and SSH port are OK, K3s `mon` / `mon1` are Ready, public routes/TLS are green, 110 / 188 runtime and backup checks are green。188 `node-exporter`、PostgreSQL exporter、Redis exporter、`nginx-exporter`、MinIO / Velero BSL are restored; monitoring coverage is now `14/14 UP`; 110 disk pressure cleared。Remaining service blocker is MOMO business data freshness: `MOMO_DAILY_FRESHNESS 7|2026-06-17`; 22:40 scheduler / DB / import metadata read-only evidence confirms Drive listing works from the scheduler container, `import_config` points to `當日業績匯入` / `即時業績_當日`, but recent scheduler runs all report `file_count=0` and no newer legitimate source file exists. 2026-06-24 22:17 confirms MOMO `main` and Gitea Actions `cd.yaml #904` deployed `84035906aba0`, so monthly sync failure now fails the import job and prevents Drive file movement in production. DR remains blocked because credential escrow evidence markers are still missing and must not be forged. | +| Overall recovery readiness | SERVICE_AVAILABLE_MOMO_SOURCE_BLOCKED_DR_ESCROW_BLOCKED | 98% | 2026-06-24 23:33 live cold-start returned `PASS=88 WARN=0 BLOCKED=1`, result `BLOCKED` because MOMO business data freshness remains stale and no newer legitimate source file is present. 110 / 120 / 121 / 188 ping and SSH port are OK, K3s `mon` / `mon1` are Ready, public routes/TLS are green, 110 / 188 runtime and backup checks are green。188 `node-exporter`、PostgreSQL exporter、Redis exporter、`nginx-exporter`、MinIO / Velero BSL are restored; monitoring coverage is now `14/14 UP`; 110 disk pressure cleared。Remaining service blocker is MOMO business data freshness: `MOMO_DAILY_FRESHNESS 7|2026-06-17`; 23:33 cold-start plus scheduler / DB / import metadata read-only evidence confirms Drive listing works from the scheduler container, `import_config` points to `當日業績匯入` / `即時業績_當日`, but recent scheduler runs report `file_count=0` and no newer legitimate source file exists. 2026-06-24 22:17 confirms MOMO `main` and Gitea Actions `cd.yaml #904` deployed `84035906aba0`, so monthly sync failure now fails the import job and prevents Drive file movement in production. DR remains blocked because credential escrow evidence markers are still missing and must not be forged. | | P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-14 18:15 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. | -| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-24 21:33 backup / alert readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`。188 `node-exporter` textfile scrape、PostgreSQL exporter、Redis exporter、`nginx-exporter`、MinIO endpoint、Velero BSL and latest completed backup freshness are restored; monitoring coverage is `14/14 UP`; `BackupHealthMonitorMissing188`、`PostgreSQLDown`、`RedisDown`、`VeleroBackupNotRun` and 110 disk-pressure alerts resolved. DR remains blocked on real non-secret credential escrow evidence IDs. | -| P2 service / data truth | BLOCKED_MOMO_DATA_FRESHNESS | 98% | Public route/TLS, API/Web route, momo health `V10.653`, MOMO main / CD `#904` commit `84035906aba0e5e190d031a13cfd9b47a8cd1f73`, 188 live import-boundary source marker, current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health are green. Mac Mini / MacBook Pro controlled MOMO workspaces both point to the same codex branch commit. MOMO latest business date remains `2026-06-17`; stale age is `7` days as of 22:40. Drive pending folder has `0` matching files in repeated scheduler checks; scheduler stats show `file_count=0` / `imported_count=0` for repeated AutoImport runs; latest valid job `56` already imported `即時業績_當日.xlsx` with `sync_success=true` and bounds `2026-06-01..2026-06-17`; Mac Mini / MacBook candidate files are old or header-only, so there is no safe newer source to import. | -| P3 docs / automation contracts | DONE_WITH_MOMO_SOURCE_ABSENCE_GATE_V142_REPO_ONLY | 100% | Workplan, SOP v1.43, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, 188 DB/Redis exporter restore helper, 188 MinIO/Velero restore helper, 188 nginx-exporter restore helper, 110 Docker disk pressure cleanup boundary, MOMO Google Drive token userns readback, MOMO daily freshness blocker, MOMO Pro false-noise health monitor source-of-truth, docker-health direct Telegram fallback cooldown, Bitan public-content same-fingerprint cooldown, notification-noise readback, MOMO source-file absence GO/NO-GO gate with scheduler stats / import_config / job 56 evidence, repo-side cold-start v1.42 source absence classifier, live-sync parity gate, MOMO V10.653 / Gitea main / dual-workstation Codex baseline readback, MOMO import-boundary production deploy, MacBook Pro Codex safe artifact sync readback, and MacBook Pro AwoooGo Gitea SSH / dev workspace readback are updated. Latest deploy marker `622bc372` points runtime image to `2ec7f6f4`; CD `#3294` retains a historical Failure because post-deploy monitoring coverage saw 188 `nginx-exporter` down before recovery, while manual coverage now passes `14/14 UP`. 2026-06-24 23:15 read-only verify shows repo cold-start hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`; live 110 script sync of the v1.42 classifier is not claimed until separately approved and recorded. | +| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-24 23:33 backup / alert readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`。188 `node-exporter` textfile scrape、PostgreSQL exporter、Redis exporter、`nginx-exporter`、MinIO endpoint、Velero BSL and latest completed backup freshness are restored; monitoring coverage is `14/14 UP`; `BackupHealthMonitorMissing188`、`PostgreSQLDown`、`RedisDown`、`VeleroBackupNotRun` and 110 disk-pressure alerts resolved. DR remains blocked on real non-secret credential escrow evidence IDs. | +| P2 service / data truth | BLOCKED_MOMO_DATA_FRESHNESS | 98% | Public route/TLS, API/Web route, momo health `V10.653`, MOMO main / CD `#904` commit `84035906aba0e5e190d031a13cfd9b47a8cd1f73`, 188 live import-boundary source marker, current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health are green. Mac Mini / MacBook Pro controlled MOMO workspaces both point to the same codex branch commit. MOMO latest business date remains `2026-06-17`; stale age is `7` days as of 23:33. Drive pending folder has `0` matching files in repeated scheduler checks; scheduler stats show `file_count=0` / `imported_count=0` for repeated AutoImport runs; latest valid job `56` already imported `即時業績_當日.xlsx` with `sync_success=true` and bounds `2026-06-01..2026-06-17`; Mac Mini / MacBook candidate files are old or header-only, so there is no safe newer source to import. | +| P3 docs / automation contracts | DONE_WITH_MOMO_SOURCE_ABSENCE_GATE_V142_REPO_ONLY | 100% | Workplan, SOP v1.44, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, 188 DB/Redis exporter restore helper, 188 MinIO/Velero restore helper, 188 nginx-exporter restore helper, 110 Docker disk pressure cleanup boundary, MOMO Google Drive token userns readback, MOMO daily freshness blocker, MOMO Pro false-noise health monitor source-of-truth, docker-health direct Telegram fallback cooldown, Bitan public-content same-fingerprint cooldown, notification-noise readback, MOMO source-file absence GO/NO-GO gate with scheduler stats / import_config / job 56 evidence, repo-side cold-start v1.42 source absence classifier, live-sync parity gate, MOMO V10.653 / Gitea main / dual-workstation Codex baseline readback, MOMO import-boundary production deploy, MacBook Pro Codex safe artifact sync readback, and MacBook Pro AwoooGo Gitea SSH / dev workspace readback are updated. Latest deploy marker `622bc372` points runtime image to `2ec7f6f4`; CD `#3294` retains a historical Failure because post-deploy monitoring coverage saw 188 `nginx-exporter` down before recovery, while manual coverage now passes `14/14 UP`. 2026-06-24 23:15 read-only verify still shows repo cold-start hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`; live 110 script sync of the v1.42 classifier is not claimed until separately approved and recorded. | -Full cold-start service readiness may not be declared green for the latest verified evidence set. As of 2026-06-24 22:40, routes/hosts/K3s/backups/exporters/Velero/monitoring coverage are available, and MOMO production code has the import-boundary fix, but the latest live cold-start scorecard remains `PASS=86 WARN=0 BLOCKED=1` because MOMO business data freshness is stale beyond 3 days and no newer legitimate source file is available. The 23:04 repo-side v1.42 dry-run now returns `PASS=88 WARN=0 BLOCKED=1` and names the blocker as `188 momo source file absent while daily sales data stale`; this is repo-side source-of-truth evidence and not yet a claim that the live 110 script was deployed. Do not declare DR scorecard complete while credential escrow evidence remains blocked. +Full cold-start service readiness may not be declared green for the latest verified evidence set. As of 2026-06-24 23:33, routes/hosts/K3s/backups/exporters/Velero/monitoring coverage are available, and MOMO production code has the import-boundary fix, but the latest repo-side live read-only cold-start scorecard remains `PASS=88 WARN=0 BLOCKED=1` because MOMO business data freshness is stale beyond 3 days and no newer legitimate source file is available. The blocker is explicitly `188 momo source file absent while daily sales data stale`; this is repo-side source-of-truth evidence and not yet a claim that the 110 live monitor script was deployed. Do not declare DR scorecard complete while credential escrow evidence remains blocked. 2026-06-13 01:26 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing survived the next normal CD deploy: Gitea main `e4a349bc`, ArgoCD revision `e4a349bc`, images from `414413a5`, API/Web split across `mon` / `mon1`, and global `known_hosts` retained 120 / 188 after CD fix `80e6ec1a`. Do not declare DR complete while credential escrow is missing. `km-vectorize` remediation is `90%`: schedule/label fix is live, and the remaining gate is the next official 03:00 CronJob success readback. @@ -179,7 +179,7 @@ Next: | P3-005 | DONE | 100 | Update cold-start SOP | SOP now includes start, shutdown, reboot, record, comparison, and 120 blocker handling. | Increment SOP version after each process change. | SOP has controlled power-operation sections and ledger template. | | P3-006 | DONE | 100 | Update backup status | Backup status now reflects current cron, rclone latest-only, failure-only alert posture, and escrow blocker. | Refresh after 120 backup rerun. | Backup status no longer claims noisy success Telegram notifications. | | P3-007 | DONE | 100 | Harden Gitea backup stale dump handling | 2026-06-05 manual Gitea backup failed because the container retained `/tmp/gitea-dump.zip` from the 02:00 failure. `scripts/backup/backup-gitea.sh` now renames stale container dump files to timestamped evidence before running a new dump, and the live 110 script is updated. | Watch the next 02:00 Gitea backup. | `bash -n` passes locally and on 110; manual Gitea backup completed after stale evidence rename. | -| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.43 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, 188 DB/Redis exporter restore, 188 MinIO/Velero restore, 188 nginx-exporter restore, 110 Docker disk cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, post-reboot notification noise gates, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side scorecard source-absence classifier, 110 live-sync parity gate, and CD monitoring coverage target-down classification. | Use v1.43 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, failed/stale/active Job counters, runaway-process metrics, CI load attribution, MOMO source availability, data freshness, Velero freshness, exporter scrape, disk usage, notification-noise state, monitoring coverage, and blockers against §1.4 plus §11.1 / §14.8 through §14.32. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process / notification-noise / MOMO source-file / monitoring coverage checks. If using the live 110 script, record its hash and do not assume repo-side v1.42 behavior until synced under approval and deploy parity passes. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; live cold-start returns `PASS=86 WARN=0 BLOCKED=1` for the current evidence set, repo-side v1.42 dry-run returns `PASS=88 WARN=0 BLOCKED=1` with blocker `188 momo source file absent while daily sales data stale`, and 23:15 deploy parity correctly blocks live-sync claims until hash parity is restored; repeated healthy/same-failure notification noise is controlled without hiding real alerts, and monitoring coverage target-down is routed through exporter restore before any product restart. | +| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.44 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, 188 DB/Redis exporter restore, 188 MinIO/Velero restore, 188 nginx-exporter restore, 110 Docker disk cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, post-reboot notification noise gates, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side scorecard source-absence classifier, 110 live-sync parity gate, and CD monitoring coverage target-down classification. | Use v1.43 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, failed/stale/active Job counters, runaway-process metrics, CI load attribution, MOMO source availability, data freshness, Velero freshness, exporter scrape, disk usage, notification-noise state, monitoring coverage, and blockers against §1.4 plus §11.1 / §14.8 through §14.32. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process / notification-noise / MOMO source-file / monitoring coverage checks. If using the live 110 script, record its hash and do not assume repo-side v1.42 behavior until synced under approval and deploy parity passes. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; live cold-start returns `PASS=86 WARN=0 BLOCKED=1` for the current evidence set, repo-side v1.42 dry-run returns `PASS=88 WARN=0 BLOCKED=1` with blocker `188 momo source file absent while daily sales data stale`, and 23:15 deploy parity correctly blocks live-sync claims until hash parity is restored; repeated healthy/same-failure notification noise is controlled without hiding real alerts, and monitoring coverage target-down is routed through exporter restore before any product restart. | | P3-009 | DONE | 100 | Assess 120/121 AA/AS role and host load balancing | 2026-06-12 15:19 live check confirms 120 and 121 are both `Ready control-plane`, `k3s active`, `k3s-agent inactive`, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. | After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. | `docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly `0%`. | | P3-010 | DONE | 100 | Update workload balancing docs with 2026-06-13 live truth | Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD `km-vectorize` degraded, Gitea main `acaae999`, ArgoCD sync, and final pod placement evidence. | Keep updating this file after the next reboot or deploy. | Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt. | | P3-011 | DONE | 100 | Record `km-vectorize` remediation status | LOGBOOK, this workplan, and SOP now state the schedule/label fix, ArgoCD sync evidence, the invalid manual Job boundary, and the 90% waiting-for-next-schedule gate. | After next 03:00 run, update this row and the top verdict with `lastSuccessfulTime` / ArgoCD health evidence. | No document claims ArgoCD green before official CronJob success evidence exists. |