|
|
|
|
@@ -11,13 +11,13 @@
|
|
|
|
|
|
|
|
|
|
| Area | Status | Completion | Evidence |
|
|
|
|
|
|------|--------|------------|----------|
|
|
|
|
|
| Overall recovery readiness | SERVICE_AVAILABLE_MOMO_SOURCE_BLOCKED_DR_ESCROW_BLOCKED | 98% | 2026-06-24 22:16 live cold-start returned `PASS=86 WARN=0 BLOCKED=1`, result `BLOCKED` because MOMO business data freshness remains stale. 110 / 120 / 121 / 188 ping and SSH port are OK, K3s `mon` / `mon1` are Ready, public routes/TLS are green, 110 / 188 runtime and backup checks are green。188 `node-exporter`、PostgreSQL exporter、Redis exporter、`nginx-exporter`、MinIO / Velero BSL are restored; monitoring coverage is now `14/14 UP`; 110 disk pressure cleared。Remaining service blocker is MOMO business data freshness: `MOMO_DAILY_FRESHNESS 7|2026-06-17`; 22:40 scheduler / DB / import metadata read-only evidence confirms Drive listing works from the scheduler container, `import_config` points to `當日業績匯入` / `即時業績_當日`, but recent scheduler runs all report `file_count=0` and no newer legitimate source file exists. 2026-06-24 22:17 confirms MOMO `main` and Gitea Actions `cd.yaml #904` deployed `84035906aba0`, so monthly sync failure now fails the import job and prevents Drive file movement in production. DR remains blocked because credential escrow evidence markers are still missing and must not be forged. |
|
|
|
|
|
| Overall recovery readiness | SERVICE_AVAILABLE_MOMO_SOURCE_BLOCKED_DR_ESCROW_BLOCKED | 98% | 2026-06-24 23:33 live cold-start returned `PASS=88 WARN=0 BLOCKED=1`, result `BLOCKED` because MOMO business data freshness remains stale and no newer legitimate source file is present. 110 / 120 / 121 / 188 ping and SSH port are OK, K3s `mon` / `mon1` are Ready, public routes/TLS are green, 110 / 188 runtime and backup checks are green。188 `node-exporter`、PostgreSQL exporter、Redis exporter、`nginx-exporter`、MinIO / Velero BSL are restored; monitoring coverage is now `14/14 UP`; 110 disk pressure cleared。Remaining service blocker is MOMO business data freshness: `MOMO_DAILY_FRESHNESS 7|2026-06-17`; 23:33 cold-start plus scheduler / DB / import metadata read-only evidence confirms Drive listing works from the scheduler container, `import_config` points to `當日業績匯入` / `即時業績_當日`, but recent scheduler runs report `file_count=0` and no newer legitimate source file exists. 2026-06-24 22:17 confirms MOMO `main` and Gitea Actions `cd.yaml #904` deployed `84035906aba0`, so monthly sync failure now fails the import job and prevents Drive file movement in production. DR remains blocked because credential escrow evidence markers are still missing and must not be forged. |
|
|
|
|
|
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-14 18:15 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. |
|
|
|
|
|
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-24 21:33 backup / alert readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`。188 `node-exporter` textfile scrape、PostgreSQL exporter、Redis exporter、`nginx-exporter`、MinIO endpoint、Velero BSL and latest completed backup freshness are restored; monitoring coverage is `14/14 UP`; `BackupHealthMonitorMissing188`、`PostgreSQLDown`、`RedisDown`、`VeleroBackupNotRun` and 110 disk-pressure alerts resolved. DR remains blocked on real non-secret credential escrow evidence IDs. |
|
|
|
|
|
| P2 service / data truth | BLOCKED_MOMO_DATA_FRESHNESS | 98% | Public route/TLS, API/Web route, momo health `V10.653`, MOMO main / CD `#904` commit `84035906aba0e5e190d031a13cfd9b47a8cd1f73`, 188 live import-boundary source marker, current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health are green. Mac Mini / MacBook Pro controlled MOMO workspaces both point to the same codex branch commit. MOMO latest business date remains `2026-06-17`; stale age is `7` days as of 22:40. Drive pending folder has `0` matching files in repeated scheduler checks; scheduler stats show `file_count=0` / `imported_count=0` for repeated AutoImport runs; latest valid job `56` already imported `即時業績_當日.xlsx` with `sync_success=true` and bounds `2026-06-01..2026-06-17`; Mac Mini / MacBook candidate files are old or header-only, so there is no safe newer source to import. |
|
|
|
|
|
| P3 docs / automation contracts | DONE_WITH_MOMO_SOURCE_ABSENCE_GATE_V142_REPO_ONLY | 100% | Workplan, SOP v1.43, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, 188 DB/Redis exporter restore helper, 188 MinIO/Velero restore helper, 188 nginx-exporter restore helper, 110 Docker disk pressure cleanup boundary, MOMO Google Drive token userns readback, MOMO daily freshness blocker, MOMO Pro false-noise health monitor source-of-truth, docker-health direct Telegram fallback cooldown, Bitan public-content same-fingerprint cooldown, notification-noise readback, MOMO source-file absence GO/NO-GO gate with scheduler stats / import_config / job 56 evidence, repo-side cold-start v1.42 source absence classifier, live-sync parity gate, MOMO V10.653 / Gitea main / dual-workstation Codex baseline readback, MOMO import-boundary production deploy, MacBook Pro Codex safe artifact sync readback, and MacBook Pro AwoooGo Gitea SSH / dev workspace readback are updated. Latest deploy marker `622bc372` points runtime image to `2ec7f6f4`; CD `#3294` retains a historical Failure because post-deploy monitoring coverage saw 188 `nginx-exporter` down before recovery, while manual coverage now passes `14/14 UP`. 2026-06-24 23:15 read-only verify shows repo cold-start hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`; live 110 script sync of the v1.42 classifier is not claimed until separately approved and recorded. |
|
|
|
|
|
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-24 23:33 backup / alert readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`。188 `node-exporter` textfile scrape、PostgreSQL exporter、Redis exporter、`nginx-exporter`、MinIO endpoint、Velero BSL and latest completed backup freshness are restored; monitoring coverage is `14/14 UP`; `BackupHealthMonitorMissing188`、`PostgreSQLDown`、`RedisDown`、`VeleroBackupNotRun` and 110 disk-pressure alerts resolved. DR remains blocked on real non-secret credential escrow evidence IDs. |
|
|
|
|
|
| P2 service / data truth | BLOCKED_MOMO_DATA_FRESHNESS | 98% | Public route/TLS, API/Web route, momo health `V10.653`, MOMO main / CD `#904` commit `84035906aba0e5e190d031a13cfd9b47a8cd1f73`, 188 live import-boundary source marker, current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health are green. Mac Mini / MacBook Pro controlled MOMO workspaces both point to the same codex branch commit. MOMO latest business date remains `2026-06-17`; stale age is `7` days as of 23:33. Drive pending folder has `0` matching files in repeated scheduler checks; scheduler stats show `file_count=0` / `imported_count=0` for repeated AutoImport runs; latest valid job `56` already imported `即時業績_當日.xlsx` with `sync_success=true` and bounds `2026-06-01..2026-06-17`; Mac Mini / MacBook candidate files are old or header-only, so there is no safe newer source to import. |
|
|
|
|
|
| P3 docs / automation contracts | DONE_WITH_MOMO_SOURCE_ABSENCE_GATE_V142_REPO_ONLY | 100% | Workplan, SOP v1.44, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, 188 DB/Redis exporter restore helper, 188 MinIO/Velero restore helper, 188 nginx-exporter restore helper, 110 Docker disk pressure cleanup boundary, MOMO Google Drive token userns readback, MOMO daily freshness blocker, MOMO Pro false-noise health monitor source-of-truth, docker-health direct Telegram fallback cooldown, Bitan public-content same-fingerprint cooldown, notification-noise readback, MOMO source-file absence GO/NO-GO gate with scheduler stats / import_config / job 56 evidence, repo-side cold-start v1.42 source absence classifier, live-sync parity gate, MOMO V10.653 / Gitea main / dual-workstation Codex baseline readback, MOMO import-boundary production deploy, MacBook Pro Codex safe artifact sync readback, and MacBook Pro AwoooGo Gitea SSH / dev workspace readback are updated. Latest deploy marker `622bc372` points runtime image to `2ec7f6f4`; CD `#3294` retains a historical Failure because post-deploy monitoring coverage saw 188 `nginx-exporter` down before recovery, while manual coverage now passes `14/14 UP`. 2026-06-24 23:15 read-only verify still shows repo cold-start hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`; live 110 script sync of the v1.42 classifier is not claimed until separately approved and recorded. |
|
|
|
|
|
|
|
|
|
|
Full cold-start service readiness may not be declared green for the latest verified evidence set. As of 2026-06-24 22:40, routes/hosts/K3s/backups/exporters/Velero/monitoring coverage are available, and MOMO production code has the import-boundary fix, but the latest live cold-start scorecard remains `PASS=86 WARN=0 BLOCKED=1` because MOMO business data freshness is stale beyond 3 days and no newer legitimate source file is available. The 23:04 repo-side v1.42 dry-run now returns `PASS=88 WARN=0 BLOCKED=1` and names the blocker as `188 momo source file absent while daily sales data stale`; this is repo-side source-of-truth evidence and not yet a claim that the live 110 script was deployed. Do not declare DR scorecard complete while credential escrow evidence remains blocked.
|
|
|
|
|
Full cold-start service readiness may not be declared green for the latest verified evidence set. As of 2026-06-24 23:33, routes/hosts/K3s/backups/exporters/Velero/monitoring coverage are available, and MOMO production code has the import-boundary fix, but the latest repo-side live read-only cold-start scorecard remains `PASS=88 WARN=0 BLOCKED=1` because MOMO business data freshness is stale beyond 3 days and no newer legitimate source file is available. The blocker is explicitly `188 momo source file absent while daily sales data stale`; this is repo-side source-of-truth evidence and not yet a claim that the 110 live monitor script was deployed. Do not declare DR scorecard complete while credential escrow evidence remains blocked.
|
|
|
|
|
|
|
|
|
|
2026-06-13 01:26 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing survived the next normal CD deploy: Gitea main `e4a349bc`, ArgoCD revision `e4a349bc`, images from `414413a5`, API/Web split across `mon` / `mon1`, and global `known_hosts` retained 120 / 188 after CD fix `80e6ec1a`. Do not declare DR complete while credential escrow is missing. `km-vectorize` remediation is `90%`: schedule/label fix is live, and the remaining gate is the next official 03:00 CronJob success readback.
|
|
|
|
|
|
|
|
|
|
@@ -179,7 +179,7 @@ Next: <single next action>
|
|
|
|
|
| P3-005 | DONE | 100 | Update cold-start SOP | SOP now includes start, shutdown, reboot, record, comparison, and 120 blocker handling. | Increment SOP version after each process change. | SOP has controlled power-operation sections and ledger template. |
|
|
|
|
|
| P3-006 | DONE | 100 | Update backup status | Backup status now reflects current cron, rclone latest-only, failure-only alert posture, and escrow blocker. | Refresh after 120 backup rerun. | Backup status no longer claims noisy success Telegram notifications. |
|
|
|
|
|
| P3-007 | DONE | 100 | Harden Gitea backup stale dump handling | 2026-06-05 manual Gitea backup failed because the container retained `/tmp/gitea-dump.zip` from the 02:00 failure. `scripts/backup/backup-gitea.sh` now renames stale container dump files to timestamped evidence before running a new dump, and the live 110 script is updated. | Watch the next 02:00 Gitea backup. | `bash -n` passes locally and on 110; manual Gitea backup completed after stale evidence rename. |
|
|
|
|
|
| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.43 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, 188 DB/Redis exporter restore, 188 MinIO/Velero restore, 188 nginx-exporter restore, 110 Docker disk cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, post-reboot notification noise gates, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side scorecard source-absence classifier, 110 live-sync parity gate, and CD monitoring coverage target-down classification. | Use v1.43 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, failed/stale/active Job counters, runaway-process metrics, CI load attribution, MOMO source availability, data freshness, Velero freshness, exporter scrape, disk usage, notification-noise state, monitoring coverage, and blockers against §1.4 plus §11.1 / §14.8 through §14.32. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process / notification-noise / MOMO source-file / monitoring coverage checks. If using the live 110 script, record its hash and do not assume repo-side v1.42 behavior until synced under approval and deploy parity passes. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; live cold-start returns `PASS=86 WARN=0 BLOCKED=1` for the current evidence set, repo-side v1.42 dry-run returns `PASS=88 WARN=0 BLOCKED=1` with blocker `188 momo source file absent while daily sales data stale`, and 23:15 deploy parity correctly blocks live-sync claims until hash parity is restored; repeated healthy/same-failure notification noise is controlled without hiding real alerts, and monitoring coverage target-down is routed through exporter restore before any product restart. |
|
|
|
|
|
| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.44 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, 188 DB/Redis exporter restore, 188 MinIO/Velero restore, 188 nginx-exporter restore, 110 Docker disk cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, post-reboot notification noise gates, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side scorecard source-absence classifier, 110 live-sync parity gate, and CD monitoring coverage target-down classification. | Use v1.43 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, failed/stale/active Job counters, runaway-process metrics, CI load attribution, MOMO source availability, data freshness, Velero freshness, exporter scrape, disk usage, notification-noise state, monitoring coverage, and blockers against §1.4 plus §11.1 / §14.8 through §14.32. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process / notification-noise / MOMO source-file / monitoring coverage checks. If using the live 110 script, record its hash and do not assume repo-side v1.42 behavior until synced under approval and deploy parity passes. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; live cold-start returns `PASS=86 WARN=0 BLOCKED=1` for the current evidence set, repo-side v1.42 dry-run returns `PASS=88 WARN=0 BLOCKED=1` with blocker `188 momo source file absent while daily sales data stale`, and 23:15 deploy parity correctly blocks live-sync claims until hash parity is restored; repeated healthy/same-failure notification noise is controlled without hiding real alerts, and monitoring coverage target-down is routed through exporter restore before any product restart. |
|
|
|
|
|
| P3-009 | DONE | 100 | Assess 120/121 AA/AS role and host load balancing | 2026-06-12 15:19 live check confirms 120 and 121 are both `Ready control-plane`, `k3s active`, `k3s-agent inactive`, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. | After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. | `docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly `0%`. |
|
|
|
|
|
| P3-010 | DONE | 100 | Update workload balancing docs with 2026-06-13 live truth | Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD `km-vectorize` degraded, Gitea main `acaae999`, ArgoCD sync, and final pod placement evidence. | Keep updating this file after the next reboot or deploy. | Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt. |
|
|
|
|
|
| P3-011 | DONE | 100 | Record `km-vectorize` remediation status | LOGBOOK, this workplan, and SOP now state the schedule/label fix, ArgoCD sync evidence, the invalid manual Job boundary, and the 90% waiting-for-next-schedule gate. | After next 03:00 run, update this row and the top verdict with `lastSuccessfulTime` / ArgoCD health evidence. | No document claims ArgoCD green before official CronJob success evidence exists. |
|
|
|
|
|
|