diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 30c514ee..d5f026d0 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -1,3 +1,41 @@ +## 2026-06-24|11:19 full-stack recovery readback 與 MOMO 上游檔案缺席 Gate + +**背景**:完成 Telegram 心跳 / MOMO false-noise / Bitan repeated-failure 降噪後,重新做一次只讀恢復總檢查,避免把「服務 200」誤判成「資料也最新」。本輪只讀檢查沒有重啟 host、Docker、Nginx、K3s,也沒有手動建立 / 刪除 Job 或匯入舊檔。 + +**Full-stack readback**: +- 110 `full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1`:`PASS=86 WARN=0 BLOCKED=1`,Result `BLOCKED`。 +- 110 / 120 / 121 / 188 ping + SSH 均 OK;K3s `mon` / `mon1` Ready;ArgoCD `awoooi-prod` 為 `Synced / Healthy`,source revision `35a3a59839bb09404099f6b5a22be1354a247abe`。 +- `awoooi-api` / `awoooi-web` / `awoooi-worker` 均 ready;image 仍是 `a84a5a0b...`,這是預期狀態,因為後續 `35a3a598` 只改 docs / ops scripts,沒有 runtime image rebuild。 +- Public routes readback:`awoooi`、`vibework`、`awooogo`、`mo`、`stock`、`bitan` 皆回 200;AWOOI API health 為 `healthy / prod / mock_mode=false`,GCP Ollama routes up,local Ollama route 502 仍由 provider fallback 承接,不可單獨誇大成全綠。 + +**Backup / DR readback**: +- `backup-status` core green:110 `13/13 fresh failed=0`,188 `2/2 fresh failed=0`,`core_blockers=0`,`integrity_stale=0`,`offsite_fresh=1`,`rclone_gdrive_fresh=1`。 +- Last aggregate:`2026-06-24 02:28:39`。 +- DR 仍不可宣稱完成:`escrow_missing=5`,五個 credential escrow evidence marker 仍需外部真實非密文證據,不得偽造 marker、不得把密碼 / token 放 repo 或聊天。 + +**MOMO data truth**: +- `https://mo.wooo.work/health`:`status=healthy`、`database=postgresql`、`version=V10.639`。 +- DB readback:`daily_sales_snapshot=104614` rows,日期 `2025-07-01` 到 `2026-06-17`;`realtime_sales_monthly=786621` rows,日期 `2024-01-01` 到 `2026-06-17`。 +- current-month parity:`10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`,兩張表一致。 +- 最新成功 `daily_sales` import job 是 `id=56`,來源 `即時業績_當日.xlsx`,建立於 `2026-06-18 11:41`,`total_records=10936`、`sync_success=true`;6/18 之後沒有新的 `daily_sales` import job。 +- Drive pending folder `當日業績匯入` 對 pattern `即時業績_當日` 的 count 是 `0`;archive latest matching file 是 `2026-06-18T01:30:39Z`,且已被 job `56` 匯入。 +- `momo-scheduler` container 內目前可列 Google Drive,`count=0`;近期 log 未再出現 `Permission denied`。昨天的 token permission error 已收斂為歷史狀態,不是目前主要 blocker。 + +**判定**: +- 目前服務、路由、K3s、備份、offsite、exporter 已恢復到可用狀態。 +- 唯一 service hard blocker 是 `MOMO_DAILY_FRESHNESS 7|2026-06-17`:不是網站壞、不是 DB parity 壞、不是 scheduler 死,而是沒有更新的合法上游 Excel source file。 +- 不可用舊 archive 重匯、不可 truncate、不可整庫 restore、不可用產品匯出檔偽造成業績來源。解除條件只能是:新的 `即時業績_當日` source file 出現並成功匯入、`sync_success=true`、檔案移動規則正確、兩張表日期上下界一致、`MOMO_DAILY_FRESHNESS <= 2`。 + +**SOP / workplan**: +- `docs/runbooks/FULL-STACK-COLD-START-SOP.md` 升級為 v1.32,新增 §14.31 MOMO source-file absence gate。 +- `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md` 更新為 11:19/11:23 evidence;P2 data truth 維持 `BLOCKED_MOMO_DATA_FRESHNESS`,P3 docs / automation contracts 更新為 `DONE_WITH_MOMO_SOURCE_ABSENCE_GATE`。 + +**完成度**: +- 主機 / K3s / public route recovery:100%。 +- Backup / offsite core recovery:100%,DR escrow 仍 blocked。 +- MOMO 服務可用性:100%;MOMO 業務資料新鮮度:blocked on upstream source。 +- 整體 recovery readiness:98%,維持 `SERVICE_AVAILABLE_MOMO_SOURCE_BLOCKED_DR_ESCROW_BLOCKED`。 + ## 2026-06-24|Telegram 告警噪音收斂與 post-reboot notification gate **背景**:服務恢復後,Telegram 仍出現兩類噪音:AWOOOI 正常心跳每 30 分鐘洗版、MOMO Pro 舊 monitor 每 5 分鐘報 `HTTP 502`、Bitan public content cleanliness failure 曾每 30 分鐘重複推送。這些不是同一條鏈路:AWOOOI heartbeat 已由 production code 收斂;MOMO / Bitan 來自 110 cron 腳本。修復原則是「正常成功安靜、同一失敗降噪、warning / failure / recovery 保留」,不得消音真紅燈。 diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index 14ec39c5..00407b86 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -1,6 +1,6 @@ # AWOOOI 全棧冷啟動與主機重啟 SOP -> Version: v1.31 +> Version: v1.32 > Last updated: 2026-06-24 Asia/Taipei > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path. @@ -10,15 +10,16 @@ 本節是每次接手、開機、關機、重啟後的第一個判定錨點。若日期不是今天,必須先重跑 live check,再更新本節與 `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md`。 -2026-06-24 06:35 live readback supersedes the earlier 02:44 wording: +2026-06-24 11:19 live readback supersedes the earlier 06:35 wording: ```text Repo-side reboot SOP / Plan B / automation contracts: COMPLETE, 100%. Live cold-start read-only check: PASS=86 WARN=0 BLOCKED=1, Result=BLOCKED. -Service state: SERVICE_AVAILABLE_MOMO_SOURCE_BLOCKED_DR_ESCROW_BLOCKED; 110/120/121/188 reachable, K3s mon/mon1 Ready, public routes/TLS green, 110/188 backup health fresh, 188 node-exporter / PostgreSQL exporter / Redis exporter restored, 188 MinIO endpoint and Velero BackupStorageLocation restored, 110 disk pressure cleared. -MOMO state: current-month daily_sales_snapshot and realtime_sales_monthly match, but both stop at 2026-06-17. MOMO_DAILY_FRESHNESS is 6 days, which is a hard blocker because business data is not current. -Google Drive state: momo scheduler token ownership is fixed for Docker userns, Drive listing works, but folder 當日業績匯入 currently has no matching 即時業績_當日 Excel source file. Archive latest matching file is 2026-06-18 and was already imported. -Backup / monitoring state: 188 MinIO is healthy, Velero BackupStorageLocation default is Available, one-off backup reboot-recovery-202606240456 completed, backup-health textfile reports Velero freshness green, and VeleroBackupNotRun / PostgreSQLDown / RedisDown / disk-pressure alerts resolved. +Service state: SERVICE_AVAILABLE_MOMO_SOURCE_BLOCKED_DR_ESCROW_BLOCKED; 110/120/121/188 reachable, K3s mon/mon1 Ready, ArgoCD awoooi-prod Synced/Healthy at revision 35a3a59839bb09404099f6b5a22be1354a247abe, public routes/TLS green, 110/188 backup health fresh, 188 node-exporter / PostgreSQL exporter / Redis exporter restored, 188 MinIO endpoint and Velero BackupStorageLocation restored, 110 disk pressure cleared. +Runtime release state: API/Web/Worker are ready; image remains a84a5a0b because 35a3a598 is docs/ops-script only and does not rebuild runtime images. +MOMO state: mo.wooo.work health is healthy on version V10.639; current-month daily_sales_snapshot and realtime_sales_monthly match, but both stop at 2026-06-17. MOMO_DAILY_FRESHNESS is 7 days, which is a hard blocker because business data is not current. +Google Drive state: momo scheduler token ownership is fixed for Docker userns, container-side Drive listing works, but folder 當日業績匯入 currently has no matching 即時業績_當日 Excel source file. Archive latest matching file is 2026-06-18T01:30:39Z and was already imported by job 56. +Backup / monitoring state: backup-status core blockers are 0, last aggregate is 2026-06-24 02:28:39, 188 MinIO is healthy, Velero BackupStorageLocation default is Available, one-off backup reboot-recovery-202606240456 completed, backup-health textfile reports Velero freshness green, and VeleroBackupNotRun / PostgreSQLDown / RedisDown / disk-pressure alerts resolved. Notification-noise state: healthy AWOOOI heartbeat is suppressed; MOMO Pro monitor now uses https://mo.wooo.work/health as primary truth and no longer checks the 188 root path; docker-health-monitor keeps 5-minute repair cadence but has a separate 30-minute Telegram fallback cooldown; Bitan public-content check keeps failure alerting with same-fingerprint cooldown and one recovery notice. Allowed declaration: core hosts, routes, K3s, backup/exporter surfaces are recovered; MOMO data pipeline is blocked waiting for a newer source file or owner-provided source evidence. Forbidden declaration: full-stack green, MOMO data current, DR complete, or runtime/security acceptance. Credential escrow evidence is still missing and must not be forged. @@ -83,13 +84,13 @@ Allowed declaration: monitoring, alert rules, AI event packet, PlayBook / KM con Forbidden declaration: AI runtime remediation is enabled. Process termination, Docker/systemd restart, Nginx reload, firewall/K8s action, Telegram live send, Gateway queue write, Bot API call, production write, and secret read remain forbidden without owner approval, maintenance window, evidence ref, dry-run, and post-check. ``` -| 項目 | 2026-06-24 06:35 Asia/Taipei live result | 判定 | +| 項目 | 2026-06-24 11:19 Asia/Taipei live result | 判定 | |------|-------------------------------------------|------| | Overall recovery readiness | `98%` | `SERVICE_AVAILABLE_MOMO_SOURCE_BLOCKED_DR_ESCROW_BLOCKED` | | P0 host / K3s recovery | `100%` | `DONE` | | P1 backup / alert / escrow | `96%` | `BLOCKED_DR_ESCROW` | | P2 service / data truth | `96%` | `BLOCKED_MOMO_DATA_FRESHNESS` | -| P3 docs / automation contracts | `100%` | `DONE_WITH_VELERO_AND_EXPORTER_RECOVERY_GATE` | +| P3 docs / automation contracts | `100%` | `DONE_WITH_MOMO_SOURCE_ABSENCE_GATE` | | 110 host runtime | `fwupd-refresh.timer` intentionally disabled/inactive after non-runtime firmware metadata refresh failed units were classified; `systemctl --failed` returns `0 loaded units listed`; rollback is `sudo systemctl enable --now fwupd-refresh.timer` | `GREEN_WITH_FWUPD_TIMER_DISABLED` | | 110 host runaway process guard | 14:31-14:32 live scrape confirms `monitor_up=1`, orphan browser groups `0`, active Gitea Actions containers `2`, `load5_per_core≈0.79-0.81`, `swap_used_ratio≈1.0`, and `remediation_authorized=0`; exporter/helper also remain in Ansible textfile exporter source-of-truth. | `LIVE_SCRAPED_RUNTIME_GATE_0` | | 120 reachability | ping OK, SSH OK, boot around `2026-06-14 02:23`, K3s active, node `mon Ready` | `GREEN` | @@ -97,13 +98,13 @@ Forbidden declaration: AI runtime remediation is enabled. Process termination, D | 188 host runtime | production services green; host degraded only by `certbot.service` and `snap.certbot.renew.service` | `GREEN_WITH_CERTBOT_DEBT` | | K3s node state | `mon Ready control-plane`, `mon1 Ready control-plane`; bad pods `0`; `FAILED_JOBS=1`, `STALE_FAILED_JOBS=1`, `ACTIVE_FAILED_JOBS=0` | `GREEN_WITH_RETAINED_EVIDENCE` | | 110 -> 120 / 188 SSH trust | 00:33 cold-start exposed stale `known_hosts`; backup `/home/wooo/.ssh/known_hosts.before-120-188-refresh.20260613-003416`; final repair backup `/home/wooo/.ssh/known_hosts.before-120-188-final-refresh.20260613-011949`; CD fix `80e6ec1a` moves deploy trust to `/home/wooo/.ssh/deploy_known_hosts`; 01:28 global `known_hosts` still contains 120 / 188 and was not clobbered by deploy marker `e4a349bc` | `GREEN_WITH_GUARDRAIL` | -| Backup status | 06:35 status: 110 backup health fresh, 188 backup health fresh, MinIO / Velero storage restored, latest Velero backup fresh, exporter scrape green; escrow readback still shows `ESCROW_MISSING_COUNT=5` | `GREEN_WITH_DR_ESCROW_WARNING` | +| Backup status | 11:20 status: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`; escrow readback still shows `ESCROW_MISSING_COUNT=5` | `GREEN_WITH_DR_ESCROW_WARNING` | | Offsite sync / verify | 01:28 textfile: `awoooi_backup_offsite_remote_verify_ok=1`, `full_verify_fresh=1`, all 13 repos have `snapshot_count=1` and `snapshot_latest_only=1`; latest scheduled verifier log is 2026-06-12 07:20 | `GREEN` | | Backup / cold-start alerts | 01:27 live visibility check confirms Prometheus and Alertmanager expose the 5 required credential escrow gap alerts; Prometheus rules API has all five required alert names healthy; label contract check loads 24 baseline backup alert rules | `GREEN_WITH_EXPECTED_REDLIGHTS` | -| Cold-start scorecard | 06:35 read-only scorecard:`PASS=86 WARN=0 BLOCKED=1`。Public routes / TLS、momo DB parity、backup exporters、120/121 K3s、MinIO / Velero、AWOOOI API/Web 皆通過;only blocker is MOMO data freshness. | `BLOCKED_MOMO_DATA_FRESHNESS` | -| momo DB parity | `4571|4571|2026-06-01|2026-06-07|2026-06-01|2026-06-07` | `GREEN` | -| momo scheduler | container healthy; scorecard reads `SCHEDULER_RECENT_ACTIVITY 1009`; detector widened and deployed to 110 | `GREEN` | -| ArgoCD app health | This 13:43 cold-start pass does not reassert ArgoCD app health directly. Service release uses K8s node / route / workload / CronJob active-failure evidence. Any ArgoCD release claim still requires a fresh `awoooi-prod` app readback. | `NOT_REASSERTED_BY_THIS_GATE` | +| Cold-start scorecard | 11:19 read-only scorecard:`PASS=86 WARN=0 BLOCKED=1`。Public routes / TLS、momo DB parity、backup exporters、120/121 K3s、MinIO / Velero、AWOOOI API/Web 皆通過;only blocker is MOMO data freshness. | `BLOCKED_MOMO_DATA_FRESHNESS` | +| momo DB parity | `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17` | `GREEN` | +| momo scheduler | container healthy; Drive listing from container works; pending folder `當日業績匯入` count is `0` for `即時業績_當日`; no current `Permission denied` evidence in the latest readback | `GREEN_WITH_SOURCE_ABSENT` | +| ArgoCD app health | 11:19 readback: `awoooi-prod` sync `Synced`, health `Healthy`, source revision `35a3a59839bb09404099f6b5a22be1354a247abe`; API/Web/Worker ready. | `GREEN` | | Workload balancing | Live API/Web/Worker/CronJob image is `e999c16b3435f197b78fe2adfeec1c4faa6c4675`; API/Web pods remain split across `mon` / `mon1`, Worker single replica remains healthy on `mon` | `GREEN` | | Credential escrow | 5 non-secret evidence markers missing | `BLOCKED` | @@ -201,7 +202,7 @@ DR_COMPLETE = no, because credential escrow evidence is incomplete 110 / 120 / 121 / 188 HOST_READY = yes Core public services SERVICE_READY = yes MOMO_DB_PARITY = yes -MOMO_DATA_FRESH = no, because latest daily_sales_snapshot date is 2026-06-17 and stale age is 6 days +MOMO_DATA_FRESH = no, because latest daily_sales_snapshot date is 2026-06-17 and stale age is 7 days as of 2026-06-24 11:19 FULL_STACK_GREEN = no, because cold-start scorecard is PASS=86 WARN=0 BLOCKED=1 DR_COMPLETE = no, because credential escrow evidence is incomplete ``` @@ -1725,7 +1726,7 @@ ssh ollama@192.168.0.188 'bash -s' < scripts/ops/188-node-exporter-restore.sh | Drive pending folder | `當日業績匯入`,pattern `即時業績_當日`,目前 matching Excel count `0` | | Drive archive folder | `當日業績匯入/已匯入`,最新 matching file modifiedTime `2026-06-18T01:30:39Z`,已由 import job `56` 匯入 | | DB parity | `MOMO_MONTHLY_SYNC 10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17` | -| Data freshness | `MOMO_DAILY_FRESHNESS 6|2026-06-17` | +| Data freshness | `MOMO_DAILY_FRESHNESS 7|2026-06-17` as of 2026-06-24 11:19 | | Live cold-start readback | `PASS=86 WARN=0 BLOCKED=1`, result `BLOCKED` | | 110 live script sync | `/home/wooo/scripts/full-stack-cold-start-check.sh` hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8` | | Alert dedupe | `data_stale_alert` for `upstream_drive` has 24h dedupe; latest evidence was 2026-06-23 with last_date `2026-06-17` | @@ -1766,7 +1767,7 @@ SQL | Velero restore | 120 `BackupStorageLocation/default` phase is `Available`; one-off backup `reboot-recovery-202606240456` is `Completed` | | Backup-health textfile | 110 exporter refresh reports `awoooi_velero_monitor_up=1`, `awoooi_velero_latest_completed_backup_fresh=1`, restore-test cron present, failed jobs `0` | | Alert readback | `VeleroBackupNotRun`、`PostgreSQLDown`、`RedisDown`、110 disk-pressure alerts resolved | -| Live cold-start readback | `PASS=86 WARN=0 BLOCKED=1`, result `BLOCKED`; only blocker remains `MOMO_DAILY_FRESHNESS 6|2026-06-17` | +| Live cold-start readback | `PASS=86 WARN=0 BLOCKED=1`, result `BLOCKED`; only blocker remains MOMO data freshness | | Declaration limit | 可宣稱 backup / exporter / MinIO / Velero chain recovered;不可宣稱 full-stack green、MOMO data current、DR complete 或 runtime/security acceptance | 188 PostgreSQL / Redis exporter post-reboot recovery: @@ -1840,6 +1841,55 @@ Generic docker-health monitor: API 200/202 path is primary; direct Telegram fall Bitan public content: pass -> no failure Telegram; repeated same failure -> cooled; recovery after prior failure -> one notice. ``` +### 14.31 2026-06-24 MOMO source-file absence decision gate + +2026-06-24 11:19 的恢復判定把 MOMO 分成兩件事:服務可用與資料新鮮。服務可用已恢復,資料新鮮仍 blocked。這個 gate 的目的,是防止 operator 在外部網站 200、container healthy、DB parity 正常時,誤把「沒有新來源檔」當成「恢復完成」。 + +| 項目 | 11:19 source-file absence baseline | +|------|------------------------------------| +| SOP version | `v1.32` | +| MOMO public health | `https://mo.wooo.work/health` returns healthy; version `V10.639` | +| DB rows | `daily_sales_snapshot=104614`,`realtime_sales_monthly=786621` | +| DB bounds | daily `2025-07-01..2026-06-17`;monthly `2024-01-01..2026-06-17` | +| Current-month parity | `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17` | +| Latest successful import | `daily_sales` job `56`,created `2026-06-18 11:41`,source `即時業績_當日.xlsx`,`sync_success=true` | +| Pending source folder | `當日業績匯入` count `0` for pattern `即時業績_當日` | +| Archive latest | `2026-06-18T01:30:39Z`,already imported by job `56` | +| Scheduler Drive readback | container-side Drive listing works and currently returns count `0`; no current `Permission denied` evidence in latest readback | +| Stale alert posture | `data_stale_alert` has 24h dedupe; this is a true warning, not heartbeat spam | +| Blocking metric | `MOMO_DAILY_FRESHNESS 7|2026-06-17` | + +GO / NO-GO: + +```text +GO: declare MOMO web/API/container/database service available. +GO: declare current-month table parity good. +NO-GO: declare MOMO business data current. +NO-GO: declare FULL_STACK_GREEN while MOMO_DAILY_FRESHNESS > 2. +NO-GO: re-import old archived files to fake freshness. +NO-GO: import product exports or manually constructed spreadsheets as daily sales source. +NO-GO: truncate tables, restore whole DB, or move Drive files when sync_success is false. +``` + +解除 blocker 的唯一合格證據: + +```text +1. New legitimate 即時業績_當日 source file appears in the expected Drive intake path, or owner supplies a verifiable source-evidence reference. +2. Import job completes with success=true and sync_success=true. +3. Drive file movement / archive evidence shows the source was handled once. +4. daily_sales_snapshot and realtime_sales_monthly counts and date bounds match for the imported range. +5. MOMO_DAILY_FRESHNESS <= 2. +6. backup / offsite / cold-start scorecard rerun after import remains green except known DR escrow blocker. +``` + +如果 source file 缺席,正確回報是: + +```text +MOMO service is recovered, data pipeline is waiting for upstream source file. +No safe import candidate exists. +Full-stack remains blocked by data freshness, not by service outage. +``` + ### 14.22 重啟後時間軸驗證 每次重啟後照時間軸推進,不要等到最後才一次判定。 diff --git a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md index b6aad499..3989f8c5 100644 --- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md +++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md @@ -11,13 +11,13 @@ | Area | Status | Completion | Evidence | |------|--------|------------|----------| -| Overall recovery readiness | SERVICE_AVAILABLE_MOMO_SOURCE_BLOCKED_DR_ESCROW_BLOCKED | 98% | 2026-06-24 06:35 live cold-start read-only gate returned `PASS=86 WARN=0 BLOCKED=1`, result `BLOCKED`。110 / 120 / 121 / 188 ping and SSH port are OK, K3s `mon` / `mon1` are Ready, `NODE_FS_ERROR_EVENTS=0`, `NODE_READONLY_FILESYSTEM_TRUE=0`, `NODE_DISK_PRESSURE_TRUE=0`, public routes/TLS are green, 110 / 188 runtime and backup checks are green。188 `node-exporter`、PostgreSQL exporter、Redis exporter、MinIO / Velero BSL are restored; 110 disk pressure cleared to 73%。Remaining service blocker is MOMO business data freshness: `MOMO_DAILY_FRESHNESS 6|2026-06-17`; Drive listing works after token owner repair, but `當日業績匯入` has no newer `即時業績_當日` Excel source file. DR remains blocked because credential escrow evidence markers are still missing and must not be forged. | +| Overall recovery readiness | SERVICE_AVAILABLE_MOMO_SOURCE_BLOCKED_DR_ESCROW_BLOCKED | 98% | 2026-06-24 11:19 live cold-start read-only gate returned `PASS=86 WARN=0 BLOCKED=1`, result `BLOCKED`。110 / 120 / 121 / 188 ping and SSH port are OK, K3s `mon` / `mon1` are Ready, ArgoCD `awoooi-prod` is `Synced / Healthy` at revision `35a3a59839bb09404099f6b5a22be1354a247abe`, public routes/TLS are green, 110 / 188 runtime and backup checks are green。188 `node-exporter`、PostgreSQL exporter、Redis exporter、MinIO / Velero BSL are restored; 110 disk pressure cleared。Remaining service blocker is MOMO business data freshness: `MOMO_DAILY_FRESHNESS 7|2026-06-17`; Drive listing works from the scheduler container, but `當日業績匯入` has no newer `即時業績_當日` Excel source file. DR remains blocked because credential escrow evidence markers are still missing and must not be forged. | | P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-14 18:15 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. | -| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 96% | 2026-06-24 06:35 backup / alert readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`。188 `node-exporter` textfile scrape、PostgreSQL exporter、Redis exporter、MinIO endpoint、Velero BSL and latest completed backup freshness are restored; `BackupHealthMonitorMissing188`、`PostgreSQLDown`、`RedisDown`、`VeleroBackupNotRun` and 110 disk-pressure alerts resolved. DR remains blocked on real non-secret credential escrow evidence IDs. | -| P2 service / data truth | BLOCKED_MOMO_DATA_FRESHNESS | 96% | Public route/TLS, API/Web route, momo health, current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health are green. However MOMO latest business date is `2026-06-17`; stale age is `6` days. Drive pending folder has `0` matching files and archive latest is the already-imported 2026-06-18 file, so there is no safe newer source to import. | -| P3 docs / automation contracts | DONE_WITH_NOTIFICATION_NOISE_GATE | 100% | Workplan, SOP v1.31, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, 188 DB/Redis exporter restore helper, 188 MinIO/Velero restore helper, 110 Docker disk pressure cleanup boundary, MOMO Google Drive token userns readback, MOMO daily freshness blocker, MOMO Pro false-noise health monitor source-of-truth, docker-health direct Telegram fallback cooldown, Bitan public-content same-fingerprint cooldown, and 2026-06-24 notification-noise readback are updated. Production image `a84a5a0b` is live with API `2/2`, Web `2/2`, Worker `1/1`; CD `#3289` is a known false-negative caused by worker startup / rollout timeout after deploy marker `4a7b5329`. | +| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 96% | 2026-06-24 11:20 backup / alert readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`。188 `node-exporter` textfile scrape、PostgreSQL exporter、Redis exporter、MinIO endpoint、Velero BSL and latest completed backup freshness are restored; `BackupHealthMonitorMissing188`、`PostgreSQLDown`、`RedisDown`、`VeleroBackupNotRun` and 110 disk-pressure alerts resolved. DR remains blocked on real non-secret credential escrow evidence IDs. | +| P2 service / data truth | BLOCKED_MOMO_DATA_FRESHNESS | 96% | Public route/TLS, API/Web route, momo health `V10.639`, current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health are green. However MOMO latest business date is `2026-06-17`; stale age is `7` days as of 11:19. Drive pending folder has `0` matching files and archive latest `2026-06-18T01:30:39Z` is already imported by job `56`, so there is no safe newer source to import. | +| P3 docs / automation contracts | DONE_WITH_MOMO_SOURCE_ABSENCE_GATE | 100% | Workplan, SOP v1.32, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, 188 DB/Redis exporter restore helper, 188 MinIO/Velero restore helper, 110 Docker disk pressure cleanup boundary, MOMO Google Drive token userns readback, MOMO daily freshness blocker, MOMO Pro false-noise health monitor source-of-truth, docker-health direct Telegram fallback cooldown, Bitan public-content same-fingerprint cooldown, notification-noise readback, and MOMO source-file absence GO/NO-GO gate are updated. Production image `a84a5a0b` remains live with API `2/2`, Web `2/2`, Worker `1/1`; later docs/ops commits do not require runtime image rebuild. | -Full cold-start service readiness may not be declared green for the latest verified evidence set. As of 2026-06-24 06:35, routes/hosts/K3s/backups/exporters/Velero are available, but the scorecard is `PASS=86 WARN=0 BLOCKED=1` because MOMO business data freshness is stale beyond 3 days. Do not declare DR scorecard complete while credential escrow evidence remains blocked. +Full cold-start service readiness may not be declared green for the latest verified evidence set. As of 2026-06-24 11:19, routes/hosts/K3s/backups/exporters/Velero are available, but the scorecard is `PASS=86 WARN=0 BLOCKED=1` because MOMO business data freshness is stale beyond 3 days and no newer legitimate source file is available. Do not declare DR scorecard complete while credential escrow evidence remains blocked. 2026-06-13 01:26 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing survived the next normal CD deploy: Gitea main `e4a349bc`, ArgoCD revision `e4a349bc`, images from `414413a5`, API/Web split across `mon` / `mon1`, and global `known_hosts` retained 120 / 188 after CD fix `80e6ec1a`. Do not declare DR complete while credential escrow is missing. `km-vectorize` remediation is `90%`: schedule/label fix is live, and the remaining gate is the next official 03:00 CronJob success readback. @@ -157,7 +157,8 @@ Next: | ID | Status | % | Work item | Fine analysis | Next action | Done criteria | |----|--------|---:|-----------|---------------|-------------|---------------| | P2-001 | VERIFIED | 100 | Public route smoke | 2026-06-12 18:57 cold-start confirms all listed domains returned expected 2xx/3xx over HTTPS; registry root route returned 200 in the scorecard and `/v2/` remains the normal unauthenticated 401 pattern from earlier checks. This proves ingress/TLS plus current route availability. | Keep as one row in scorecard. | Public route table updated after each reboot. | -| P2-002 | BLOCKED_MOMO_DATA_FRESHNESS | 96 | momo latest/current-month parity and freshness | Latest current-month parity is good: `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`. However latest business data is stale: `MOMO_DAILY_FRESHNESS 6|2026-06-17`; Drive pending folder `當日業績匯入` has `0` matching `即時業績_當日` Excel files after token owner repair. | Wait for or obtain a newer legitimate source file, then verify import job `sync_success=true`, archive movement, table bounds, and `MOMO_DAILY_FRESHNESS <= 2`. | Snapshot/current-month row count and bounds match, source folder has no unprocessed stale file, and daily freshness is within threshold. | +| P2-002 | BLOCKED_MOMO_DATA_FRESHNESS | 96 | momo latest/current-month parity and freshness | Latest current-month parity is good: `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`. However latest business data is stale: `MOMO_DAILY_FRESHNESS 7|2026-06-17`; Drive pending folder `當日業績匯入` has `0` matching `即時業績_當日` Excel files after token owner repair, and archive latest `2026-06-18T01:30:39Z` was already imported by job `56`. | Wait for or obtain a newer legitimate source file, then verify import job `sync_success=true`, archive movement, table bounds, and `MOMO_DAILY_FRESHNESS <= 2`. | Snapshot/current-month row count and bounds match, source folder has no unprocessed stale file, and daily freshness is within threshold. | +| P2-008 | DONE | 100 | Separate MOMO service recovery from upstream source absence | 2026-06-24 11:19 readback proves MOMO service is healthy (`V10.639`), DB parity is good, scheduler container can list Drive, and recent logs have no current token `Permission denied`; the blocker is source-file absence, not service outage. SOP v1.32 records GO/NO-GO rules forbidding old archive re-import, product-export import, truncate, whole-DB restore, or fake freshness. | Keep the stale warning active until a legitimate newer `即時業績_當日` source file appears and imports cleanly. | Operators can say "MOMO service recovered, data pipeline waiting for upstream source file" without calling the full stack green. | | P2-003 | VERIFIED | 95 | Fix momo job semantics | `/Users/ogt/momo-pro-system/services/import_service.py` and live `/home/ollama/momo-pro/services/import_service.py` now mark monthly sync failure as `failed`, write `drive_file_movable=false`, return `False`, emit a failure alert path, and make auto-import aggregate failures as `success=false`. Live 188 backup: `services/import_service.py.bak.20260604-152827`; live hash after patch: `3fc45671986fa4cc155119f588bc1ebefd272927730052e42e2b9eb4352b2586`. | Watch the next real Google Drive import and confirm no file moves unless both tables sync; keep canonical source-control reconciliation open as a separate supply-chain task. | Live isolated temp-DB/real-Excel test passes; containers reloaded healthy; Telegram token/chat markers are present without exposing secrets; latest DB parity remains 404/404. | | P2-004 | DONE | 100 | PostgreSQL index corruption runbook path | SOP v1.2 now states `posting list tuple ... cannot be split` is an index repair incident. | Use only concurrent reindex if the error returns. | No truncate, no whole DB restore; `REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly;` and idempotent resync evidence recorded. | | P2-005 | VERIFIED | 100 | Do not rely on route 200 only | 2026-06-12 closeout has route + DB + backup + offsite + schedule + alert + K3s + cold-start scorecard evidence. The only remaining blocker is DR credential escrow, outside service availability. | Keep this cross-surface checklist mandatory after every reboot. | Each reboot record has route, DB, backup, schedules, alert, scorecard rows. | @@ -177,7 +178,7 @@ Next: | P3-005 | DONE | 100 | Update cold-start SOP | SOP now includes start, shutdown, reboot, record, comparison, and 120 blocker handling. | Increment SOP version after each process change. | SOP has controlled power-operation sections and ledger template. | | P3-006 | DONE | 100 | Update backup status | Backup status now reflects current cron, rclone latest-only, failure-only alert posture, and escrow blocker. | Refresh after 120 backup rerun. | Backup status no longer claims noisy success Telegram notifications. | | P3-007 | DONE | 100 | Harden Gitea backup stale dump handling | 2026-06-05 manual Gitea backup failed because the container retained `/tmp/gitea-dump.zip` from the 02:00 failure. `scripts/backup/backup-gitea.sh` now renames stale container dump files to timestamped evidence before running a new dump, and the live 110 script is updated. | Watch the next 02:00 Gitea backup. | `bash -n` passes locally and on 110; manual Gitea backup completed after stale evidence rename. | -| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.31 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, 188 DB/Redis exporter restore, 188 MinIO/Velero restore, 110 Docker disk cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, and post-reboot notification noise gates. | Use v1.31 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, failed/stale/active Job counters, runaway-process metrics, CI load attribution, MOMO source availability, data freshness, Velero freshness, exporter scrape, disk usage, notification-noise state, and blockers against §1.4 plus §11.1 / §14.8 through §14.30. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process / notification-noise checks. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; live cold-start now returns `PASS=86 WARN=0 BLOCKED=1` when MOMO data freshness is stale, and repeated healthy/same-failure notification noise is controlled without hiding real alerts. | +| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.32 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, 188 DB/Redis exporter restore, 188 MinIO/Velero restore, 110 Docker disk cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, post-reboot notification noise gates, and MOMO source-file absence decision gate. | Use v1.32 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, failed/stale/active Job counters, runaway-process metrics, CI load attribution, MOMO source availability, data freshness, Velero freshness, exporter scrape, disk usage, notification-noise state, and blockers against §1.4 plus §11.1 / §14.8 through §14.31. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process / notification-noise / MOMO source-file checks. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; live cold-start now returns `PASS=86 WARN=0 BLOCKED=1` when MOMO data freshness is stale because source file is absent, and repeated healthy/same-failure notification noise is controlled without hiding real alerts. | | P3-009 | DONE | 100 | Assess 120/121 AA/AS role and host load balancing | 2026-06-12 15:19 live check confirms 120 and 121 are both `Ready control-plane`, `k3s active`, `k3s-agent inactive`, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. | After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. | `docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly `0%`. | | P3-010 | DONE | 100 | Update workload balancing docs with 2026-06-13 live truth | Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD `km-vectorize` degraded, Gitea main `acaae999`, ArgoCD sync, and final pod placement evidence. | Keep updating this file after the next reboot or deploy. | Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt. | | P3-011 | DONE | 100 | Record `km-vectorize` remediation status | LOGBOOK, this workplan, and SOP now state the schedule/label fix, ArgoCD sync evidence, the invalid manual Job boundary, and the 90% waiting-for-next-schedule gate. | After next 03:00 run, update this row and the top verdict with `lastSuccessfulTime` / ArgoCD health evidence. | No document claims ArgoCD green before official CronJob success evidence exists. |