From 6b9a09a01a340c7d4de14063580dc330782bc4a8 Mon Sep 17 00:00:00 2001 From: ogt Date: Wed, 24 Jun 2026 23:20:40 +0800 Subject: [PATCH] docs(ops): record cold-start monitor live-sync gate [skip ci] --- docs/LOGBOOK.md | 18 ++++++++ docs/runbooks/BACKUP-STATUS.md | 2 + docs/runbooks/FULL-STACK-COLD-START-SOP.md | 46 ++++++++++++++++++- ...oot-cold-start-backup-recovery-workplan.md | 6 +-- 4 files changed, 67 insertions(+), 5 deletions(-) diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 1d11cd4e..d44fbbb8 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -1,3 +1,21 @@ +## 2026-06-24|23:15 110 cold-start monitor live-sync gate readback + +**背景**:23:04 已把 MOMO source absence classifier 納入 repo-side cold-start v1.42,但這不等於 110 上 `/home/wooo/scripts/full-stack-cold-start-check.sh` 已更新。為避免下次重啟時 operator 以 live 110 舊腳本輸出做錯判,本輪只做部署 parity 的 read-only 驗證與 SOP gate 補強。 + +**Read-only evidence**: +- Repo-side authoritative script hash:`f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05`。 +- 110 live `/home/wooo/scripts/full-stack-cold-start-check.sh` hash:`10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`,mtime `2026-06-24 02:45`。 +- 指令類型:只讀 SSH hash / parity check;沒有 `scp`、沒有 cron 變更、沒有 textfile refresh、沒有 service restart。 +- `bash scripts/reboot-recovery/verify-cold-start-monitor-deploy.sh` 正確回 `BLOCKED full-stack-cold-start-check.sh hash mismatch local=f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05 remote=10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`。 +- Existing apply path is `bash scripts/reboot-recovery/install-cold-start-monitor-110.sh`; it performs `scp`, `chmod`, crontab replacement, and immediate textfile exporter refresh, so it is a live write and remains blocked until an explicit maintenance-window / owner approval. + +**判定**: +- 可宣稱:repo-side v1.42 classifier exists and deploy parity guard correctly prevents false live-sync claims。 +- 不可宣稱:110 live cold-start monitor already emits v1.42 MOMO source-absence fields, full-stack green, MOMO data current, or DR complete。 +- 完成 live-sync 的最小 gate:approved maintenance window -> run `bash scripts/reboot-recovery/install-cold-start-monitor-110.sh` -> run `bash scripts/reboot-recovery/verify-cold-start-monitor-deploy.sh` until hash parity OK -> run `/home/wooo/scripts/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1` and confirm v1.42 fields appear。 + +**邊界**:本輪沒有主機寫入、沒有 Docker / Nginx / firewall / K8s / ArgoCD 操作、沒有 Wazuh / SOC 修改、沒有使用聊天中的密碼,也沒有讀取或保存 secret。 + ## 2026-06-24|23:04 MOMO source absence cold-start gate v1.42 **背景**:22:40 readback 已確認 MOMO stale 的根因是 upstream source absence,但既有 cold-start scorecard 只顯示 `MOMO_DAILY_FRESHNESS 7|2026-06-17`,operator 仍需要回看 LOGBOOK 才知道不是服務 / DB / scheduler / import config 壞掉。本輪把這個分類直接補進 repo-side cold-start 腳本與 machine-readable baseline,讓下一次重啟判定能在 scorecard 本身呈現來源檔缺席。 diff --git a/docs/runbooks/BACKUP-STATUS.md b/docs/runbooks/BACKUP-STATUS.md index bf4da13f..6868af8e 100644 --- a/docs/runbooks/BACKUP-STATUS.md +++ b/docs/runbooks/BACKUP-STATUS.md @@ -13,6 +13,7 @@ > 2026-06-24 22:17 Codex backup readback: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`; MOMO import-boundary fix is production-deployed, but full-stack remains blocked by MOMO data freshness. > 2026-06-24 22:40 Codex MOMO source readback: scheduler / DB / import metadata confirm the full-stack blocker is missing upstream source data, not backup freshness; no manual import or Drive write was performed. > 2026-06-24 23:04 Codex cold-start gate refresh: repo-side v1.42 dry-run now emits MOMO source-absence evidence and blocks with `188 momo source file absent while daily sales data stale`; backup/offsite remains green and live 110 script deployment is not claimed. +> 2026-06-24 23:15 Codex live-sync gate readback: read-only deploy parity check correctly blocks because repo cold-start hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`; installer remains a live write requiring explicit approval. --- @@ -45,6 +46,7 @@ Read-only command: `/backup/scripts/backup-status.sh --no-notify --no-refresh` f - `import_config` remains `gdrive_folder_path=當日業績匯入` and `gdrive_file_pattern=即時業績_當日`。 - Latest valid import job `56` already completed with `sync_success=true` and bounds `2026-06-01..2026-06-17`。 - Repo-side cold-start v1.42 dry-run emits `MOMO_SOURCE_EMPTY_EVIDENCE_LINES 21`、`MOMO_IMPORT_CONFIG 當日業績匯入|即時業績_當日`、`MOMO_LATEST_IMPORT_JOB 56|completed|即時業績_當日.xlsx|2026-06-18T11:41:00.853176|2026-06-18T11:42:02.309425|10936|10936|0` and keeps the only hard blocker as source absence. +- 110 live monitor deployment is intentionally not claimed: `verify-cold-start-monitor-deploy.sh` reports repo hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` vs live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`. - Therefore backup/offsite remains green while service full-green remains blocked by business data source absence. Do not run backup restore or DB restore to solve this symptom. --- diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index f446a2cb..7b5bd496 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -1,6 +1,6 @@ # AWOOOI 全棧冷啟動與主機重啟 SOP -> Version: v1.42 +> Version: v1.43 > Last updated: 2026-06-24 Asia/Taipei > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path. @@ -16,6 +16,7 @@ Repo-side reboot SOP / Plan B / automation contracts: COMPLETE, 100%. Live cold-start read-only check: 2026-06-24 22:16 PASS=86 WARN=0 BLOCKED=1, Result=BLOCKED. Repo-side cold-start v1.42 dry-run: 2026-06-24 23:04 PASS=88 WARN=0 BLOCKED=1 against live read-only targets. New MOMO fields are MOMO_SOURCE_EMPTY_EVIDENCE_LINES=21, MOMO_IMPORT_CONFIG=當日業績匯入|即時業績_當日, MOMO_LATEST_IMPORT_JOB=56|completed|即時業績_當日.xlsx|2026-06-18T11:41:00.853176|2026-06-18T11:42:02.309425|10936|10936|0. The only BLOCKED text is now "188 momo source file absent while daily sales data stale". Live 110 script sync is not claimed until a separate approved deployment/sync happens. +110 live-sync parity: 2026-06-24 23:15 read-only `verify-cold-start-monitor-deploy.sh` correctly BLOCKED because repo script hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`. Do not use live 110 monitor output to prove v1.42 behavior until the approved live-sync gate in §13.3.1 passes. Service state: SERVICE_AVAILABLE_MOMO_SOURCE_BLOCKED_DR_ESCROW_BLOCKED; 110/120/121/188 reachable, K3s mon/mon1 Ready, ArgoCD awoooi-prod Synced/Healthy at revision 7db7800e399caed5487a705c81ec993dec76c70f, public routes/TLS green, 110/188 backup health fresh, 188 node-exporter / PostgreSQL exporter / Redis exporter restored, 188 MinIO endpoint and Velero BackupStorageLocation restored, 110 disk pressure cleared. Runtime release state: API/Web/Worker are ready; latest deployment marker 622bc372 points runtime image to 2ec7f6f4 and production API health returns healthy. 21:33 redirect-followed route batch shows awoooi web=200, awoooi API=200, vibework=200, awoooogo=200, momo health=200, stock=200, bitan=200, gitea=200, harbor=200, sentry=200, signoz=200, langfuse=200, registry /v2=401; cold-start raw route gate still records expected redirect statuses such as awoooi web=307, momo web=302, sentry=302. CD #3294 still has a historical Failure record because post-deploy monitoring coverage saw 188 nginx-exporter down before the exporter restore. MOMO release state: mo.wooo.work health is healthy on version V10.653. Gitea main fast-forwarded to 84035906aba0e5e190d031a13cfd9b47a8cd1f73 and Gitea Actions cd.yaml #904 completed Success. 188 live source contains the production marker `def _table_columns`, `業績分析儀表板同步失敗`, and `保留來源檔案等待重試,不移動 Google Drive 檔案`, proving the import-boundary fix is deployed. Mac Mini and MacBook Pro controlled Codex workspaces are both on branch codex/momo-current-main-dev-base-20260624 at commit 84035906aba0e5e190d031a13cfd9b47a8cd1f73 with dirty=0. @@ -1032,6 +1033,8 @@ After recovery, host 110 should run the same gate as a node-exporter textfile mo bash scripts/reboot-recovery/install-cold-start-monitor-110.sh ``` +This command is not read-only. It copies scripts to 110, rewrites the marked `wooo` crontab block, and immediately refreshes the textfile metric. Run it only inside an approved maintenance window or explicit owner-approved live-sync change. + This installs two scripts under `/home/wooo/scripts/`, adds a marked user-cron block, and writes: - `/home/wooo/node_exporter_textfiles/cold_start_recovery.prom` @@ -1049,6 +1052,45 @@ The cron path uses `--monitor-read-only`, so it does not POST Alertmanager smoke Prometheus rules in `ops/monitoring/alerts-unified.yml` alert when the monitor is missing, stale, blocked, degraded, or has not been green for more than 6 hours. +#### 13.3.1 110 cold-start monitor live-sync gate + +Use this gate whenever the repo-side cold-start script changes. This prevents a false-green where repo evidence is newer than the live 110 monitor. + +Current read-only evidence, 2026-06-24 23:15 Asia/Taipei: + +```text +Repo script hash: f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05 +110 live script hash: 10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8 +verify result: BLOCKED full-stack-cold-start-check.sh hash mismatch +``` + +Read-only verification: + +```bash +bash scripts/reboot-recovery/verify-cold-start-monitor-deploy.sh +``` + +Approved apply path, only after maintenance-window / owner approval: + +```bash +bash scripts/reboot-recovery/install-cold-start-monitor-110.sh +bash scripts/reboot-recovery/verify-cold-start-monitor-deploy.sh +/home/wooo/scripts/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1 +``` + +Completion criteria: + +- `verify-cold-start-monitor-deploy.sh` reports hash parity for `full-stack-cold-start-check.sh` and `cold-start-textfile-exporter.sh`. +- The live 110 cold-start output includes the expected current fields, including `MOMO_SOURCE_EMPTY_EVIDENCE_LINES`, `MOMO_IMPORT_CONFIG`, and `MOMO_LATEST_IMPORT_JOB` while MOMO data freshness remains blocked by source absence. +- The textfile monitor refreshes without creating alert spam. +- LOGBOOK records local hash, remote hash, command type, approval reference, and final cold-start result. + +NO-GO: + +- Do not run the installer as part of routine read-only triage. +- Do not call repo-side v1.42 deployed on 110 while the hash mismatch remains. +- Do not patch 110 manually with ad hoc `scp`; use the existing installer or Ansible source-of-truth path under an approved change. + ### 13.4 Script-To-SOP Coverage Map | Script gate | SOP coverage | Blocks | @@ -1928,7 +1970,7 @@ Bitan public content: pass -> no failure Telegram; repeated same failure -> cool 188 momo source file absent while daily sales data stale ``` -This is repo-side source-of-truth enhancement only. Do not claim the live 110 deployed script has this v1.42 behavior until `/home/wooo/scripts/full-stack-cold-start-check.sh` is synced under an approved change and its hash/readback is recorded. +This is repo-side source-of-truth enhancement only. 2026-06-24 23:15 read-only deploy parity check proves the live 110 script is still older: repo hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05`, live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`. Do not claim the live 110 deployed script has this v1.42 behavior until `/home/wooo/scripts/full-stack-cold-start-check.sh` is synced under an approved change and its hash/readback is recorded through §13.3.1. GO / NO-GO: diff --git a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md index 349d461c..8369a2aa 100644 --- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md +++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md @@ -15,7 +15,7 @@ | P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-14 18:15 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. | | P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-24 21:33 backup / alert readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`。188 `node-exporter` textfile scrape、PostgreSQL exporter、Redis exporter、`nginx-exporter`、MinIO endpoint、Velero BSL and latest completed backup freshness are restored; monitoring coverage is `14/14 UP`; `BackupHealthMonitorMissing188`、`PostgreSQLDown`、`RedisDown`、`VeleroBackupNotRun` and 110 disk-pressure alerts resolved. DR remains blocked on real non-secret credential escrow evidence IDs. | | P2 service / data truth | BLOCKED_MOMO_DATA_FRESHNESS | 98% | Public route/TLS, API/Web route, momo health `V10.653`, MOMO main / CD `#904` commit `84035906aba0e5e190d031a13cfd9b47a8cd1f73`, 188 live import-boundary source marker, current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health are green. Mac Mini / MacBook Pro controlled MOMO workspaces both point to the same codex branch commit. MOMO latest business date remains `2026-06-17`; stale age is `7` days as of 22:40. Drive pending folder has `0` matching files in repeated scheduler checks; scheduler stats show `file_count=0` / `imported_count=0` for repeated AutoImport runs; latest valid job `56` already imported `即時業績_當日.xlsx` with `sync_success=true` and bounds `2026-06-01..2026-06-17`; Mac Mini / MacBook candidate files are old or header-only, so there is no safe newer source to import. | -| P3 docs / automation contracts | DONE_WITH_MOMO_SOURCE_ABSENCE_GATE_V142_REPO_ONLY | 100% | Workplan, SOP v1.42, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, 188 DB/Redis exporter restore helper, 188 MinIO/Velero restore helper, 188 nginx-exporter restore helper, 110 Docker disk pressure cleanup boundary, MOMO Google Drive token userns readback, MOMO daily freshness blocker, MOMO Pro false-noise health monitor source-of-truth, docker-health direct Telegram fallback cooldown, Bitan public-content same-fingerprint cooldown, notification-noise readback, MOMO source-file absence GO/NO-GO gate with scheduler stats / import_config / job 56 evidence, repo-side cold-start v1.42 source absence classifier, MOMO V10.653 / Gitea main / dual-workstation Codex baseline readback, MOMO import-boundary production deploy, MacBook Pro Codex safe artifact sync readback, and MacBook Pro AwoooGo Gitea SSH / dev workspace readback are updated. Latest deploy marker `622bc372` points runtime image to `2ec7f6f4`; CD `#3294` retains a historical Failure because post-deploy monitoring coverage saw 188 `nginx-exporter` down before recovery, while manual coverage now passes `14/14 UP`. Live 110 script sync of the v1.42 classifier is not claimed until separately approved and recorded. | +| P3 docs / automation contracts | DONE_WITH_MOMO_SOURCE_ABSENCE_GATE_V142_REPO_ONLY | 100% | Workplan, SOP v1.43, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, 188 DB/Redis exporter restore helper, 188 MinIO/Velero restore helper, 188 nginx-exporter restore helper, 110 Docker disk pressure cleanup boundary, MOMO Google Drive token userns readback, MOMO daily freshness blocker, MOMO Pro false-noise health monitor source-of-truth, docker-health direct Telegram fallback cooldown, Bitan public-content same-fingerprint cooldown, notification-noise readback, MOMO source-file absence GO/NO-GO gate with scheduler stats / import_config / job 56 evidence, repo-side cold-start v1.42 source absence classifier, live-sync parity gate, MOMO V10.653 / Gitea main / dual-workstation Codex baseline readback, MOMO import-boundary production deploy, MacBook Pro Codex safe artifact sync readback, and MacBook Pro AwoooGo Gitea SSH / dev workspace readback are updated. Latest deploy marker `622bc372` points runtime image to `2ec7f6f4`; CD `#3294` retains a historical Failure because post-deploy monitoring coverage saw 188 `nginx-exporter` down before recovery, while manual coverage now passes `14/14 UP`. 2026-06-24 23:15 read-only verify shows repo cold-start hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`; live 110 script sync of the v1.42 classifier is not claimed until separately approved and recorded. | Full cold-start service readiness may not be declared green for the latest verified evidence set. As of 2026-06-24 22:40, routes/hosts/K3s/backups/exporters/Velero/monitoring coverage are available, and MOMO production code has the import-boundary fix, but the latest live cold-start scorecard remains `PASS=86 WARN=0 BLOCKED=1` because MOMO business data freshness is stale beyond 3 days and no newer legitimate source file is available. The 23:04 repo-side v1.42 dry-run now returns `PASS=88 WARN=0 BLOCKED=1` and names the blocker as `188 momo source file absent while daily sales data stale`; this is repo-side source-of-truth evidence and not yet a claim that the live 110 script was deployed. Do not declare DR scorecard complete while credential escrow evidence remains blocked. @@ -173,13 +173,13 @@ Next: | ID | Status | % | Work item | Fine analysis | Next action | Done criteria | |----|--------|---:|-----------|---------------|-------------|---------------| | P3-001 | VERIFIED | 100 | Confirm hardening commit | Gitea `main` currently points to `0260ec89...`; `git merge-base --is-ancestor ae7b39d9 0260ec89...` returned true. | Keep evidence in LOGBOOK. | Gitea main contains `ae7b39d9 fix(ops): harden reboot recovery and backup alerts`. | -| P3-002 | VERIFIED | 100 | Confirm live 110 scripts | All required recovery/check scripts exist under `/home/wooo/scripts/`; cold-start script hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8` is live on 110 after the MOMO freshness gate update. | Record every live script hash change in LOGBOOK and SOP. | Script paths and hashes recorded. | +| P3-002 | VERIFIED_WITH_V142_SYNC_BLOCKED | 100 | Confirm live 110 scripts | All required recovery/check scripts exist under `/home/wooo/scripts/`; cold-start script hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8` is live on 110. Repo-side v1.42 authoritative script hash is `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05`, and `verify-cold-start-monitor-deploy.sh` correctly blocks on the mismatch. | Do not run `install-cold-start-monitor-110.sh` during read-only triage. After explicit maintenance-window / owner approval, run the installer, rerun deploy parity, then rerun the live 110 cold-start monitor and record the new hash. | Script paths and current mismatch are recorded; v1.42 live-sync done criteria remains hash parity plus live scorecard fields. | | P3-003 | DONE | 100 | Reconcile 188 nginx Ansible baseline | Live 188 already routes `aiops.wooo.work` through VIP; the Ansible template matches that route and has no 120 upstream for aiops. `nginx-sync.yml` now also carries the `188-internal-tools-https.conf.j2` source-of-truth path, and `ansible-validate.sh` syntax-check passes with repo-local roles path. | Run only approved dry-run/apply from the normal Ansible environment before changing live nginx. | Template and live config agree; no 120 upstream for aiops; repo-side syntax and readiness contract pass. | | P3-004 | DONE | 100 | Update `docs/LOGBOOK.md` | Live blocker and new docs are recorded. | Keep this entry updated after each recovery phase. | LOGBOOK has current recovery status and next actions. | | P3-005 | DONE | 100 | Update cold-start SOP | SOP now includes start, shutdown, reboot, record, comparison, and 120 blocker handling. | Increment SOP version after each process change. | SOP has controlled power-operation sections and ledger template. | | P3-006 | DONE | 100 | Update backup status | Backup status now reflects current cron, rclone latest-only, failure-only alert posture, and escrow blocker. | Refresh after 120 backup rerun. | Backup status no longer claims noisy success Telegram notifications. | | P3-007 | DONE | 100 | Harden Gitea backup stale dump handling | 2026-06-05 manual Gitea backup failed because the container retained `/tmp/gitea-dump.zip` from the 02:00 failure. `scripts/backup/backup-gitea.sh` now renames stale container dump files to timestamped evidence before running a new dump, and the live 110 script is updated. | Watch the next 02:00 Gitea backup. | `bash -n` passes locally and on 110; manual Gitea backup completed after stale evidence rename. | -| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.42 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, 188 DB/Redis exporter restore, 188 MinIO/Velero restore, 188 nginx-exporter restore, 110 Docker disk cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, post-reboot notification noise gates, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side scorecard source-absence classifier, and CD monitoring coverage target-down classification. | Use v1.42 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, failed/stale/active Job counters, runaway-process metrics, CI load attribution, MOMO source availability, data freshness, Velero freshness, exporter scrape, disk usage, notification-noise state, monitoring coverage, and blockers against §1.4 plus §11.1 / §14.8 through §14.32. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process / notification-noise / MOMO source-file / monitoring coverage checks. If using the live 110 script, record its hash and do not assume repo-side v1.42 behavior until synced under approval. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; live cold-start returns `PASS=86 WARN=0 BLOCKED=1` for the current evidence set, and repo-side v1.42 dry-run returns `PASS=88 WARN=0 BLOCKED=1` with blocker `188 momo source file absent while daily sales data stale`; repeated healthy/same-failure notification noise is controlled without hiding real alerts, and monitoring coverage target-down is routed through exporter restore before any product restart. | +| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.43 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, 188 DB/Redis exporter restore, 188 MinIO/Velero restore, 188 nginx-exporter restore, 110 Docker disk cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, post-reboot notification noise gates, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side scorecard source-absence classifier, 110 live-sync parity gate, and CD monitoring coverage target-down classification. | Use v1.43 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, failed/stale/active Job counters, runaway-process metrics, CI load attribution, MOMO source availability, data freshness, Velero freshness, exporter scrape, disk usage, notification-noise state, monitoring coverage, and blockers against §1.4 plus §11.1 / §14.8 through §14.32. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process / notification-noise / MOMO source-file / monitoring coverage checks. If using the live 110 script, record its hash and do not assume repo-side v1.42 behavior until synced under approval and deploy parity passes. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; live cold-start returns `PASS=86 WARN=0 BLOCKED=1` for the current evidence set, repo-side v1.42 dry-run returns `PASS=88 WARN=0 BLOCKED=1` with blocker `188 momo source file absent while daily sales data stale`, and 23:15 deploy parity correctly blocks live-sync claims until hash parity is restored; repeated healthy/same-failure notification noise is controlled without hiding real alerts, and monitoring coverage target-down is routed through exporter restore before any product restart. | | P3-009 | DONE | 100 | Assess 120/121 AA/AS role and host load balancing | 2026-06-12 15:19 live check confirms 120 and 121 are both `Ready control-plane`, `k3s active`, `k3s-agent inactive`, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. | After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. | `docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly `0%`. | | P3-010 | DONE | 100 | Update workload balancing docs with 2026-06-13 live truth | Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD `km-vectorize` degraded, Gitea main `acaae999`, ArgoCD sync, and final pod placement evidence. | Keep updating this file after the next reboot or deploy. | Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt. | | P3-011 | DONE | 100 | Record `km-vectorize` remediation status | LOGBOOK, this workplan, and SOP now state the schedule/label fix, ArgoCD sync evidence, the invalid manual Job boundary, and the 90% waiting-for-next-schedule gate. | After next 03:00 run, update this row and the top verdict with `lastSuccessfulTime` / ArgoCD health evidence. | No document claims ArgoCD green before official CronJob success evidence exists. |