Files

ogt 5e4887d15c fix(ops): gate reboot recovery on product freshness [skip ci]

2026-06-25 19:39:42 +08:00

92 KiB

Raw Blame History

2026-06-04 Reboot / Cold-Start / Backup Recovery Workplan

Owner: SRE / DevOps commander Timezone: Asia/Taipei Baseline: 2026-06-04 15:00 live read-only checks. Do not reuse the 2026-05-29 baseline without rerunning checks. Scope: 110 / 120 / 121 / 188. 112 is Kali and is intentionally excluded from this recovery wave.

1. Current Verdict

Area	Status	Completion	Evidence
Overall recovery readiness	HOST_AND_CORE_SERVICE_GREEN_STOCK_DATA_BLOCKED_DR_ESCROW_BLOCKED	96%	2026-06-25 19:24 full post-start readback showed hosts / K3s / AWOOOI / MOMO / backup / offsite service gates green and `escrow_missing=5`; 2026-06-25 19:35 stricter product-data wrapper returned `POST_START_QUICK_CHECK PASS=31 WARN=1 BLOCKED=1`, result `BLOCKED`, because StockPlatform `/api/v1/system/freshness` is `blocked` with `core_margin_short_daily_missing,ai_recommendations_stale`. Expanded public route smoke covers AWOOOI, VibeWork, AwoooGo, 2026FIFA, Agent Bounty, MOMO, Stock, Bitan, TsenYang, VTuber, Gitea, Harbor, Registry, Sentry, SigNoz, Langfuse, and AIOps; all returned expected 2xx/3xx. MOMO remains fresh through `2026-06-24` with latest job `57` completed cleanly, and Bitan public-content cleanliness direct check passed. Do not declare "all products/data latest" until StockPlatform freshness is `ok`; do not declare DR complete until `escrow_missing=0`.
P0 host / K3s recovery	DONE	100%	120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-25 09:05 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, VIP `192.168.0.125` is present, node filesystem / disk-pressure / readonly events are `0`, and latest `km-vectorize-29705460-55rgs` completed.
P1 backup / alert / escrow	BLOCKED_DR_ESCROW	97%	2026-06-25 19:17 backup readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-25 02:35:09`。2026-06-25 19:19 offsite escrow report shows script presence OK, rclone configured, full and partial rclone markers present, `PASS=8 WARN=5 BLOCKED=0`, `ESCROW_MISSING_COUNT=5`; DR remains blocked on real non-secret credential escrow evidence IDs.
P2 service / data truth	BLOCKED_STOCK_DATA_FRESHNESS	92%	Service routes and core runtime are available, but product-data truth is not complete. 2026-06-25 19:35 StockPlatform `/api/v1/system/freshness` returned `status=blocked`, `latest_trading_date=2026-06-25`, blockers `core_margin_short_daily_missing,ai_recommendations_stale`; OK sources include price / chips / market index for `2026-06-25`, while `core.margin_short_daily` and `ai.recommendations` stop at `2026-06-24`. MOMO health `V10.690`, current-month parity `15383
P3 docs / automation contracts	DONE_WITH_PRODUCT_DATA_GATE_V157	100%	Workplan, SOP v1.57, one-page post-start quick check v1.2, expanded public route list, StockPlatform freshness gate, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and 2026-06-25 stricter product-data gate are updated. Live 110 script sync remains a separate approved live-write gate; do not claim it here.

2026-06-25 19:06 post-CD wrapper readback supersedes the 18:53 wording: consecutive main pushes created a deploy storm where older deploy markers were superseded by later commits. Latest production truth is deploy marker d8ca8224 chore(cd): deploy 9dbe044 [skip ci], ArgoCD Synced / Healthy, API/Web/Worker image tag 9dbe044ea1e8e3894ccbeb5ed760bb124b87f7be, direct route smoke 200 for AWOOOI API / IwoooS / VibeWork / AwoooGo / MOMO health / Stock / Bitan and expected route-gate statuses for MOMO / Gitea / Harbor / Registry / Sentry / SigNoz / Langfuse / AIOps, and wrapper POST_START_QUICK_CHECK PASS=18 WARN=3 BLOCKED=0. Repo-side cold-start returns PASS=89 WARN=0 BLOCKED=0; /backup/scripts/backup-status.sh --no-notify --no-refresh reports 110 13/13 fresh failed=0, 188 2/2 fresh failed=0, core_blockers=0, integrity_stale=0, offsite_fresh=1, rclone_gdrive_fresh=1, escrow_missing=5; MOMO dedicated preflight returns PASS=19 WARN=2 BLOCKED=0; MOMO health is V10.690; AwoooGo / Stock transient 502 reads cleared after upstream warmup and five consecutive route reads returned 200; 110 load is around 14.51 / 12.34 / 11.42, with Gitea Actions cache save / zstdmt / tar, StockPlatform headless Chrome smoke / CI, Gitea, AWOOOI API, ClickHouse, Docker, and platform services visible, not an AWOOOI service blocker. Wrapper result is FULL_STACK_GREEN_DR_ESCROW_BLOCKED, not DEGRADED, because service warnings are 0 and only DR boundary / evidence warnings remain. Wazuh route readback is now 200 disabled_waiting_iwooos_wazuh_owner_gate, but manager registry accepted remains 0, so Wazuh is a security registry evidence blocker rather than a reboot service blocker.

Full cold-start service readiness may now be declared GREEN for the latest verified evidence set. As of 2026-06-25 19:06, routes/hosts/K3s/backups/exporters/monitoring surfaces are available, AWOOOI API is healthy, MOMO service health is V10.690, and MOMO business data is fresh through 2026-06-24. The live read-only cold-start scorecard is PASS=89 WARN=0 BLOCKED=0, the post-start wrapper result is FULL_STACK_GREEN_DR_ESCROW_BLOCKED, AwoooGo / Stock route stability has been rechecked after transient warmup, and final API/Web workload placement is split across mon / mon1. Do not declare DR scorecard complete while credential escrow evidence remains blocked, and do not declare Wazuh registry recovery until manager registry evidence is accepted.

2026-06-25 19:35 stricter product-data gate readback supersedes the earlier "all product data green" interpretation. The full host/cold-start/backup layer remains green from the 19:24 read-only evidence, but the updated quick check now includes StockPlatform /api/v1/system/freshness and therefore blocks on product-data completeness: POST_START_QUICK_CHECK PASS=31 WARN=1 BLOCKED=1, RESULT=BLOCKED, blocker core_margin_short_daily_missing,ai_recommendations_stale. This is a correct no-false-green outcome: stock.wooo.work, /healthz, and /api/healthz all return 200, but StockPlatform data and AI recommendations are not latest. Next action is a separate StockPlatform data freshness remediation lane; do not solve it by host reboot, Nginx reload, Docker restart, or route-only smoke.

2026-06-13 01:26 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing survived the next normal CD deploy: Gitea main e4a349bc, ArgoCD revision e4a349bc, images from 414413a5, API/Web split across mon / mon1, and global known_hosts retained 120 / 188 after CD fix 80e6ec1a. Do not declare DR complete while credential escrow is missing. km-vectorize remediation is 90%: schedule/label fix is live, and the remaining gate is the next official 03:00 CronJob success readback.

2. Live Check Evidence, 2026-06-04

Target	Live result	Notes
192.168.0.110	ping OK, SSH port OK	Boot `2026-05-06 12:12`; load was elevated around `10.54 7.42 6.28`; cron and Docker active.
192.168.0.120	ping failed, SSH port failed	ARP incomplete; K3s node `mon` remains `NotReady,SchedulingDisabled`.
192.168.0.121	ping OK, SSH port OK	Boot `2026-05-22 02:30`; `sudo kubectl get nodes` shows `mon1 Ready`.
192.168.0.188	ping OK, SSH port OK	Boot `2026-05-06 12:07`; Docker/PostgreSQL/Redis/nginx active; momo containers healthy.
Cold-start scorecard	BLOCKED_BY_120	2026-06-12 14:47 read-only rerun: `PASS=72 WARN=2 BLOCKED=3`; hard blocks remain 120 reachability / SSH / 120 K3s read-only check.
Public routes	OK ingress only	2026-06-12 14:47: `awoooi`, `aiops`, `mo`, `momo_health`, `gitea`, `harbor`, `registry`, `sentry`, `signoz`, `stock`, `langfuse`, `bitan` returned 2xx/3xx over HTTPS.
momo DB current-month parity	OK	Scorecard reports `4571
110 daily backup cron	OK	`02:00 backup-all`, `03:00 rclone sync`, `06:05 backup-status`, `07:20 full offsite verify`.
Backup freshness	OK with remaining aggregate blocker	2026-06-05 18:40 status: `stale110=none`, `stale188=none`, `configured_missing_188=0`; remaining `core_blockers=6` is 02:00 aggregate failure history plus 120 config capture.
Google Drive latest-only	OK	2026-06-12 14:48 verifier: 13 repos, each `remote snapshots=1`, `REMOTE_LATEST_ONLY_OK=1`, `FULL_MARKER_FRESH=1`, `VERIFY_OK=1`, `FAILED=0`.
Live Prometheus / Alertmanager alert rules	OK	2026-06-12 14:49 `backup-alert-live-visibility-check.py` returned `BACKUP_ALERT_LIVE_VISIBILITY_OK`; all five required backup/cold-start/escrow alerts are visible in Prometheus and Alertmanager.
Credential escrow	BLOCKED	Missing markers: `break_glass_admin_credentials`, `dns_registrar_recovery`, `oauth_ai_provider_recovery`, `offsite_provider_credentials`, `restic_repository_password`.
Config backup capture	BLOCKED until 120 returns	`awoooi_backup_config_capture_ok{target="120-k3s-host-configs"} 0`; critical failed count `1`.
Live 110 script sync	OK	Six recovery/check scripts exist under `/home/wooo/scripts/`; `/home/wooo/scripts/full-stack-cold-start-check.sh` hash is `31321428207308d6c159fabb679d9f1d0848194b8e6d7eb7b04a2c05779ade46` after scheduler detector fix.
Gitea commit evidence	VERIFIED	Gitea `main` at `0260ec89...` contains `ae7b39d9 fix(ops): harden reboot recovery and backup alerts`.
188 nginx Ansible baseline	DONE	Template now pins `aiops.wooo.work` to VIP `192.168.0.125:32334/32335`, contains no `192.168.0.120`, and live smoke returned `https://aiops.wooo.work/` 307 plus `/api/v1/health` 200.
120 failure-domain triage	BLOCKED	19:02 checks from local/110/121/188 all fail to reach 120; 121 reports `Destination Host Unreachable`; K3s node lease renew stopped at `2026-05-21T18:48:36Z`; `120-fsck-maintenance-checklist.sh --no-color` returns `PASS=2 WARN=2 BLOCKED=3`, `MAINTENANCE REQUIRED`.
2026-06-05 backup remediation	BLOCKED with repaired freshness	16:00 live check still had 120 down and `stale110=awoooi_db`; manual backups produced snapshots `b7d5ee4e` (AWOOOI high-frequency DB), `ea641613` (Gitea), `d1147507` (Open-WebUI), `73ead3cc` (ClawBot), `b1161ab8` (AI artifacts). 18:40 backup status: `stale110=none`, `stale188=none`, `core_blockers=6`, `escrow_missing=5`.
2026-06-05 offsite closure	OK partial + full verify	Full sync was correctly skipped by runway gate; partial sync for `awoooi gitea open-webui clawbot ai-artifacts` completed `5/5`; full verifier at 18:39 shows all 13 remote repos `snapshots=1`, `REMOTE_LATEST_ONLY_OK=1`, `VERIFY_OK=1`.
2026-06-06 backup convergence	BLOCKED only by 120/escrow	14:58 backup status: 110 `13/13 fresh failed=1`, 188 `2/2 fresh failed=0`, `stale110=none`, `stale188=none`, `core_blockers=1`, `escrow_missing=5`; 02:00 aggregate failed only Configs due 120.
2026-06-06 offsite verify	OK	14:46 verifier: all 13 remote repos `snapshots=1`, `REMOTE_LATEST_ONLY_OK=1`, `VERIFY_OK=1`.
2026-06-06 cold-start scorecard	BLOCKED	15:03 read-only rerun: `PASS=71 WARN=3 BLOCKED=3`; hard blocks remain 120 ping / SSH / K3s read-only check. Direct 188 scheduler check still shows `momo-scheduler` healthy and active.
2026-06-12 pre-reboot check	NO-GO until offsite finishes	120 still ping/SSH failed and ARP incomplete; 110->188 SSH host key trust was repaired; 04:11 backup status cleared `stale110=awoooi_db` after daily backup but still has `failed=1/core_blockers=1` due 120 config capture; 03:00 offsite sync is still running at 04:10.
2026-06-12 post-reboot recovery	SERVICE_GREEN_WITH_120_BLOCKER	14:47 scorecard: `PASS=72 WARN=2 BLOCKED=3`; 110 failed units `0`, Swap `0B`, public routes/TLS green, momo scheduler and DB parity green, backup/offsite/alert surfaces green except the correct 120 config capture and escrow evidence red gates.
2026-06-12 blocker pursuit	WAITING_EXTERNAL_ACCESS	15:00 four-view 120 check still failed; no WOL/IPMI/vmrun/hypervisor entry found in repo, 110, 121, 188, local tools, or Chronicle-visible console. 15:02 escrow report shows offsite ready with warnings and all five escrow markers missing; no real non-secret evidence ID found in repo.
2026-06-12 120 recovery closeout	SERVICE_GREEN_DR_ESCROW_BLOCKED	120 root fsck was completed from console/initramfs and booted at `15:13`; 15:54 backup-all finished `13/13`; 17:37 full offsite sync finished `13/13`; 18:55 offsite verifier returned `REMOTE_LATEST_ONLY_OK=1`, `VERIFY_OK=1`, `FAILED=0`; 18:55 backup-status shows `core_blockers=0`, `escrow_missing=5`; 18:57 cold-start is `PASS=83 WARN=0 BLOCKED=0`.
2026-06-13 live refresh	SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED	00:13 backup status: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`; 00:33 cold-start exposed 110 `known_hosts` drift for 120 / 188, fixed after backup `/home/wooo/.ssh/known_hosts.before-120-188-refresh.20260613-003416`; 00:34 final cold-start: `PASS=83 WARN=0 BLOCKED=0`; live K3s has `mon` / `mon1` Ready, API/Web are split 120 / 121. 188 host is `degraded` only because `certbot.service` and `snap.certbot.renew.service` failed; ArgoCD remains Degraded because `km-vectorize` CronJob last success is stale. Manual Job `km-vectorize-codex-002709` did not leave verified completion evidence, so this remains open.
2026-06-13 `km-vectorize` health remediation	IN_PROGRESS_92	13:37 live readback: ArgoCD revision `88dc08e5` is `Synced / Degraded`; only unhealthy resource is `CronJob/awoooi-prod/km-vectorize` with message `CronJob has not completed its last execution successfully`. CronJob `lastScheduleTime=2026-06-12T19:00:00Z`, `lastSuccessfulTime=2026-06-04T11:00:37Z`; no 2026-06-13 failed Job is retained because `failedJobsHistoryLimit=0`. GitOps candidate now changes `km-vectorize` to `failedJobsHistoryLimit=3` so future 03:00 failures keep inspectable Job/Pod evidence. Next gate is ArgoCD sync plus the next official 03:00 success readback.
2026-06-13 post-CD trust / workload verification	SERVICE_GREEN_CD_GUARDRAIL_HELD	Gitea main advanced to deploy marker `e4a349bc chore(cd): deploy 414413a [skip ci]`; ArgoCD revision is `e4a349bc`, sync `Synced`, health still `Degraded` only by `km-vectorize` stale success. Live K3s image readback uses `414413a59268eedd391648f112e228716dd05362`; API pods split `mon1` / `mon`, Web pods split `mon` / `mon1`, Worker is single replica on `mon`. 01:28 `/home/wooo/.ssh/known_hosts` mtime remains `2026-06-13 01:20:02 +0800` with 120 / 188 entries present; deploy-specific `/home/wooo/.ssh/deploy_known_hosts` mtime is `01:24:05`, proving CD fix `80e6ec1a` stopped clobbering global trust. 01:26 cold-start: `PASS=83 WARN=0 BLOCKED=0`.
2026-06-13 API placement hardening	IN_PROGRESS	12:43 live refresh showed cold-start `PASS=83 WARN=0 BLOCKED=0`, but API replicas `2/2` were on 120 even though topology spread existed. Root cause: `whenUnsatisfiable=ScheduleAnyway` is a soft preference. GitOps candidate changes API/Web/Worker to `minDomains=2` + `DoNotSchedule`; completion requires ArgoCD sync, rollout readback, public route smoke, and cold-start rerun.
2026-06-13 API rollout strategy hardening	LIVE_VERIFIED	First hard-spread rollout reached ArgoCD revision `17e017f5`; `DoNotSchedule` was live, but API completed with both new pods on 121 because old 120 pods were still terminating during scheduling. Second GitOps rollout reached ArgoCD revision `60f653a0`, API/Web use `maxSurge=0`, `maxUnavailable=1`, `minDomains=2`, `DoNotSchedule`, and both deployments are split `mon` / `mon1`. Public API / governance route smoke passed and 12:59 cold-start returned `PASS=83 WARN=0 BLOCKED=0`.
2026-06-13 security mirror guard closure	LIVE_VERIFIED	Gitea main `b557a4b5` restores `apps/web/messages/en.json` as the required Traditional Chinese mirror of `zh-TW.json`; `security-mirror-progress-guard.py` now passes. ArgoCD revision `b557a4b5` is `Synced / Degraded` only by `km-vectorize`; API/Web/Worker are ready, API pods split `mon` / `mon1`, Web pods split `mon1` / `mon`, public API health is `healthy`, zh/en governance routes are `200`, backup status has `core_blockers=0`, and 13:52 cold-start is `PASS=83 WARN=0 BLOCKED=0`.
2026-06-13 security mirror production image closeout	LIVE_VERIFIED	Gitea main `64ea2444` records the Web rebuild trigger. Deploy marker `2cc02f1c chore(cd): deploy 6cf8d3c [skip ci]` put Web image `6cf8d3ca` live; ArgoCD source revision later advanced to `64ea2444` while Web image correctly remains `6cf8d3ca` because `64ea2444` is docs/changelog only. Public `/zh-TW/governance` and `/en/governance` return `200`, API health is `healthy`, `security-mirror-progress-guard.py` passes, and 14:10 cold-start is `PASS=83 WARN=0 BLOCKED=0`.
2026-06-13 final post-trigger deploy closeout	LIVE_VERIFIED	Deploy marker `834ccdba chore(cd): deploy bf86017 [skip ci]` put API/Web/Worker image `bf860177` live. ArgoCD revision `834ccdba` is `Synced / Degraded` only by `km-vectorize`; routes `/zh-TW/governance` and `/en/governance` return `200`, API health is `healthy`, source guards pass, backup status has `core_blockers=0` and `escrow_missing=5`, and 14:13 cold-start is `PASS=83 WARN=0 BLOCKED=0`.
2026-06-13 final goal audit refresh	SERVICE_GREEN_REMAINING_GATES_EXPLICIT	Clean worktree rebased onto `a520c32d` and reran source guards successfully; live ArgoCD tracks revision `a520c32d` with API/Web/Worker image `e897c8bf`, health `Degraded` only by `km-vectorize`; `km-vectorize` schedule remains `0 3 * * *`, `timeZone=Asia/Taipei`, `failedJobsHistoryLimit=3`, and no failed Job is currently retained. Public `/zh-TW/governance`, `/en/governance`, and `/api/v1/health` are green; backup core blockers remain `0`, `escrow_missing=5`; 14:16 cold-start is `PASS=83 WARN=0 BLOCKED=0`. Remaining gates: five credential escrow markers and next official 03:00 `km-vectorize` success readback.
2026-06-14 `km-vectorize` official run follow-up	DEGRADED_EVIDENCE_RETENTION_LIVE	03:00 official `km-vectorize-29689620` ran from CronJob and failed with `BackoffLimitExceeded`; ArgoCD later auto-synced revision `8868c025` and remains `Synced / Degraded`. Job is retained, but failed Pod `km-vectorize-29689620-nwpqz` was deleted before logs could be read, so root cause remains unproven for this run. Live CronJob is now `restartPolicy: Never` plus `terminationMessagePolicy: FallbackToLogsOnError`, so the next official failure should retain Pod/log evidence. Backup core remains green, `escrow_missing=5`, and 03:11 cold-start is `PASS=81 WARN=2 BLOCKED=0`.
2026-06-14 `km-vectorize` tenant context follow-up	ROOT_CAUSE_CANDIDATE_LIVE	Source audit shows `cron_km_vectorize.py` calls `/api/v1/knowledge/embed-all` without project context, while API middleware and `get_db_context()` require `X-Project-ID` / tenant context for fail-closed RLS. API logs show matching `db_context_missing` / `Missing tenant context` patterns. Deploy marker `ec03f0b7` put image `8ddb80d6` live; CronJob now has `KM_PROJECT_ID=awoooi`, script sends `X-Project-ID`, targeted pytest `7 passed`, and no manual Job was created. Completion still waits for the next official 03:00 success or retained failed Pod/log.
2026-06-14 110 failed-unit cleanup	SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED	`fwupd-refresh.timer` is intentionally `disabled / inactive` after non-runtime firmware metadata refresh failed units were classified; rollback is `sudo systemctl enable --now fwupd-refresh.timer`. `systemctl --failed` now returns `0 loaded units listed`; 08:24 cold-start improved to `PASS=82 WARN=1 BLOCKED=0`. Remaining warning is only K8s failed Job `km-vectorize-29689620`; backup core remains green and `escrow_missing=5`.
2026-06-14 post-CD recovery readback	SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED	Gitea main / ArgoCD revision `18b867c3` synced after deploy marker `18b867c3 chore(cd): deploy e0a6d33 [skip ci]`; API/Web/Worker/CronJob image is `e0a6d339`. API/Web remain split across `mon` / `mon1`, Worker is healthy on `mon`, public routes and TLS pass, backup core remains `0`, escrow missing remains `5`, and 08:40 cold-start remains `PASS=82 WARN=1 BLOCKED=0`. This proves no post-CD reboot recovery regression, but still not full green.
2026-06-14 P2-135 deploy recovery readback	SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED	Gitea main `5bad267e` and ArgoCD revision `5bad267e` are synced after deploy marker `8d575c1a`; API/Web/Worker/CronJob image is `280e0fbe`. API/Web remain split across `mon` / `mon1`, Worker is healthy on `mon1`, backup core remains `0`, escrow missing remains `5`, and 09:27 cold-start rerun is `PASS=82 WARN=1 BLOCKED=0`. 09:26 first run saw transient `stock.wooo.work` `502` while stockplatform-v2 containers were under one minute old; direct route/TLS recheck and scorecard rerun returned `200`. This proves no persistent post-P2-135 recovery regression, but still not full green.
2026-06-14 P2-136 / AI Agent 活動正式部署後 recovery readback	SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED	本 recovery commit 前最新文件 head 是 `a0fe7741`；runtime deploy marker / ArgoCD revision `60a0415c` is `Synced / Degraded`，API/Web/Worker/CronJob image 是 `a3de0ffb`。API/Web remain split across `mon` / `mon1`，Worker is healthy on `mon1`，backup core remains `0`，escrow missing remains `5`，09:56 cold-start is `PASS=82 WARN=1 BLOCKED=0`。This proves no P2-136 / AI Agent 活動正式部署後 recovery regression, but still not full green.
2026-06-14 P2-137 / CI smoke timeout recovery readback	SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED	本 recovery commit 前最新文件 head 為 `50d4f2ba`；runtime deploy marker `d023f5d7` 已將 image `f737f278` 帶到 live，ArgoCD revision `50d4f2ba` 為 `Synced / Degraded`。API/Web 維持分散在 `mon` / `mon1`，Worker 在 `mon` healthy，backup core 仍為 `0`，escrow missing 仍為 `5`，10:40 cold-start 為 `PASS=82 WARN=1 BLOCKED=0`。這證明 P2-137 / CI smoke timeout 修正後 recovery 沒有回歸，但仍不是 full green。
2026-06-14 P2-143 owner response 預檢 recovery readback	SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED	最新文件基準為 `b09eb1c6`；runtime deploy marker `667d6329` 已將 image `755b0a8d3038df2c52dee280067863d92db1eda5` 帶到 live，ArgoCD revision `4abf0c0f750254d3c7137eae049abdfd99630f5f` 為 `Synced / Degraded`。API/Web 維持分散在 `mon` / `mon1`，Worker 在 `mon` healthy，backup core 仍為 `0`，escrow missing 仍為 `5`，15:00 cold-start 為 `PASS=82 WARN=1 BLOCKED=0`；P2-143 endpoint current `P2-143`、completion `100`，所有 writer / Gateway / Telegram / Bot API / production write / secret read / destructive operation 維持 `0 / false`。這證明 P2-143 owner response 預檢後 recovery 沒有回歸，但仍不是 full green。
2026-06-14 P2-144 owner response 回讀 recovery readback	SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED	`gitea/main` 已前進至 deploy marker `180a6543`；image `fef94df877c5438f9f34ddbcace8ad8112a141ef` 已帶到 live，ArgoCD source revision `180a6543eaf26dd6b345d45114316926056a965a` 為 `Synced / Degraded`。API/Web 維持分散在 `mon` / `mon1`，Worker 在 `mon1` healthy，backup core 仍為 `0`，escrow missing 仍為 `5`，15:58 cold-start 為 `PASS=82 WARN=1 BLOCKED=0`；P2-144 endpoint current `P2-144`、completion `100`，owner response received / accepted / rejected、reviewer / Gateway / Telegram / Bot API / result capture / learning / PlayBook trust / production write / secret read / destructive operation 維持 `0 / false`。這證明 P2-144 owner response 回讀後 recovery 沒有回歸，但仍不是 full green。
2026-06-14 P2-145 owner response 驗收門檻 recovery readback	SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED	最新文件基準為 `06fe0a8f`；runtime deploy marker `36fbfc6b` 已將 image `386dbd078ef63401d9736048463f4ef5326442d9` 帶到 live，ArgoCD source revision `06fe0a8f14167824fea512f942d2569431bbcbc8` 為 `Synced / Degraded`。API/Web 維持分散在 `mon` / `mon1`，Worker 在 `mon` healthy，backup core 仍為 `0`，escrow missing 仍為 `5`，16:29 cold-start 為 `PASS=82 WARN=1 BLOCKED=0`；P2-145 endpoint current `P2-145`、completion `100`，owner response received / accepted / rejected、reviewer / Gateway / Telegram / Bot API / result capture / learning / PlayBook trust / production write / secret read / destructive operation 維持 `0 / false`。這證明 P2-145 owner response 驗收門檻後 recovery 沒有回歸，但仍不是 full green。
2026-06-14 IwoooS P0 配置控管優先序 recovery readback	SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED	最新文件基準為 `af62ec1f`；runtime deploy marker `ed651a98` 已將 image `e992af89955f8aae40a383b2f2e2f645445a690d` 帶到 API/Web/Worker/CronJob live，ArgoCD source revision `af62ec1fe72b3e84e179d80e788e5a5902bdaf27` 為 `Synced / Degraded`。API/Web 維持分散在 `mon` / `mon1`，Worker 在 `mon1` healthy；IwoooS route `/zh-TW/iwooos` returned `200`。backup core 仍為 `0`，escrow missing 仍為 `5`，17:04 cold-start 為 `PASS=82 WARN=1 BLOCKED=0`。這證明 IwoooS P0 配置控管優先序前台發布後 recovery 沒有回歸；但它不代表 Nginx reload、DNS/TLS/certbot、workflow/secret/public route/runtime gate 或 production write 已授權，且仍不是 full green。
2026-06-14 高價值配置 Owner Packet 前台同步 recovery readback	SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED	最新 repo 文件基準為 `0a4766dd`；runtime deploy marker `16c6b983` 已將 image `e999c16b3435f197b78fe2adfeec1c4faa6c4675` 帶到 API/Web/Worker/CronJob live，ArgoCD source revision `0a4766ddc94b0690824ce3deba5c6b9a69764f94` 為 `Synced / Degraded`。API/Web 維持分散在 `mon` / `mon1`，Worker 在 `mon` healthy；IwoooS route `/zh-TW/iwooos` 與 AwoooP route `/zh-TW/awooop` 皆回 `200`。backup core 仍為 `0`，escrow missing 仍為 `5`，18:15 cold-start 為 `PASS=82 WARN=1 BLOCKED=0`。這證明高價值配置 Owner Packet 前台同步後 recovery 沒有回歸；但它不代表 request sent、owner response received / accepted、Nginx reload、DNS/TLS/certbot、workflow/secret/public route/runtime gate、host write、active scan 或 production write 已授權，且仍不是 full green。

3. Progress Update Contract

Every phase update must change both status and percentage in this file.

State	Meaning
NOT_STARTED	Listed but no live evidence gathered in this session.
IN_PROGRESS	Actively being checked or fixed.
BLOCKED	A live red gate prevents completion. Do not downgrade or silence the alert.
WAITING_HOST_120	Action is intentionally held until 120 is reachable.
VERIFIED	Live evidence proves the item.
DONE	Fix is implemented, verified, and documented.

Completion is weighted by release risk:

Priority	Weight
P0	45%
P1	25%
P2	20%
P3	10%

For every push forward, update:

YYYY-MM-DD HH:MM Asia/Taipei
Phase: P0/P1/P2/P3
Before: <old percent>
After: <new percent>
Evidence: <command/file/snapshot>
Blocked: <yes/no and why>
Next: <single next action>

4. P0 Must-Do Gates

ID	Status	%	Work item	Fine analysis	Next action	Done criteria
P0-001	DONE	100	Rerun four-host reachability	18:57 cold-start confirms 110 / 120 / 121 / 188 ping and SSH are all OK; ARP neighbor evidence is reachable for 120 / 121 / 188.	Keep evidence in LOGBOOK/runbook.	Host reachability table recorded with date/time.
P0-002	DONE	100	Recover 192.168.0.120	120 root filesystem inconsistency was repaired from console/initramfs with offline fsck; host booted at `2026-06-12 15:13`, SSH returned, root is `rw`, failed units `0`, and K3s `mon` is `Ready control-plane`.	Continue normal monitoring; schedule storage review if fsck recurs.	120 ping/SSH OK, node `Ready`, root not readonly, failed units `0`.
P0-003	DONE	100	Rerun `/backup/scripts/backup-configs.sh`	15:17 manual config capture succeeded; 15:54 aggregate Configs succeeded again, including `120-k3s-host-configs`, `121-k3s-host-configs`, K8s workloads, K8s secrets, and Velero backups.	Keep next scheduled run under normal cron.	`config_failed=0`; Configs snapshot `bee9ae22` exists after 120 recovery.
P0-004	DONE	100	Rerun `/backup/scripts/backup-all.sh`	2026-06-12 15:54 aggregate completed `13/13` in `2170s`; 18:55 backup-status shows `failed=0`, `core_blockers=0`.	Keep 02:00 daily cadence.	Aggregate backup exits 0; backup health failed count 0.
P0-005	DONE	100	Rerun `/backup/scripts/sync-offsite-backups.sh --mode sync`	Default runway gate skipped full sync at 270m; controlled recovery override set runway to 240m without changing scripts. Full offsite sync completed `13/13` at 17:37 in `6027s`.	Restore normal default runway for scheduled sync; use override only for documented P0 recovery windows.	New `rclone-last-success` marker after local backup timestamp.
P0-006	DONE	100	Rerun `/backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color`	18:55 verifier confirms all 13 remote repos have `snapshots=1`, `REMOTE_LATEST_ONLY_OK=1`, `FULL_MARKER_FRESH=1`, `VERIFY_OK=1`, `FAILED=0`.	Keep 07:20 daily verifier.	`REMOTE_LATEST_ONLY_OK=1`, all 13 repos `snapshots=1`.
P0-007	DONE	100	Rerun full cold-start scorecard	First 18:56 rerun had one transient internal VIP API `000`; direct VIP checks from 110/120/121/188 returned API `200` and Web `307`. Second 18:57 rerun returned `PASS=83 WARN=0 BLOCKED=0`, result `GREEN`.	Treat future internal VIP `000` as transient only after direct multi-host VIP checks prove API `200`.	`BLOCKED=0`, `WARN=0`, result `GREEN`.
P0-008	DONE	100	Narrow 120 failure domain and prepare console handoff	110 and 188 see no route / no ping; 121 reports destination host unreachable; local ARP is incomplete. Kubernetes retained only stale node/lease data and cannot read current 120 host/filesystem state. No BMC/IPMI/WOL inventory was found in the repo.	Physical/VM console must verify power state, NIC attachment, boot screen, initramfs/fsck state, and then restore SSH.	Handoff evidence is recorded; no remote-only fix path remains before console access.
P0-009	DONE	100	Exhaust safe remote 120 recovery channels	2026-06-12 15:00 local/110/121/188 all still fail ping/SSH with ARP incomplete. Searched repo, local tools, 110, 121, 188, SSH config, local VM files, and Chronicle-visible desktop; no usable BMC/IPMI/WOL/vmrun/hypervisor/120 console entry was found.	Use hypervisor / console / VM inventory outside SSH path.	Remote-only path is proven unavailable; no alert was silenced and no unsafe reboot/restart was attempted.

5. P1 Backup And Alert Gates

ID	Status	%	Work item	Fine analysis	Next action	Done criteria
P1-001	VERIFIED	100	Confirm 110 backup schedule	Live crontab has `02:00 backup-all`, `03:00 rclone gated sync`, `06:05 backup-status`, `07:20 full offsite verify`.	Update `BACKUP-STATUS.md`.	Schedule documented and matches live crontab.
P1-002	VERIFIED	100	Confirm success-noise policy	Daily status is once at 06:05; normal backup success is not a Telegram spam path.	Keep failure-only escalation in backup docs.	Docs say failures escalate; daily status is summary only.
P1-003	VERIFIED	100	Confirm Google Drive latest-only	2026-06-12 18:55 verifier shows 13 repos with exactly one remote snapshot each after the post-120 aggregate backup and full offsite sync.	Record evidence in backup status.	`REMOTE_LATEST_ONLY_OK=1`, `VERIFY_OK=1`.
P1-004	VERIFIED	100	Confirm required alerts exist	Live Prometheus rules include all five required backup/cold-start alerts.	Keep in scorecard.	All five alert names FOUND live.
P1-005	BLOCKED_WAITING_OWNER_EVIDENCE	20	Fill credential escrow evidence markers	Five markers are missing. This is a DR scorecard blocker, not a service outage. 2026-06-13 13:10 proves scripts/offsite/rclone readiness is green; the remaining blocker is owner-provided real non-secret evidence IDs. Owner request package exists at `docs/security/CREDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md`; secrets must not enter repo or chat.	Human verifies vault/offline escrow, validates each non-secret evidence ID with `--dry-run`, then writes markers using `/backup/scripts/mark-credential-escrow-verified.sh`.	`awoooi_backup_dr_credential_escrow_missing_count=0`.
P1-006	DONE	100	Fix backup health failed component	2026-06-12 18:55 backup-status shows `failed=0`, `core_blockers=0`, `config_failed=0`; 120 config capture is no longer red.	Keep normal daily backup cadence.	`failed_count=0`, `config_failed=0`.
P1-007	DONE	100	Refresh stale backup jobs	2026-06-04 cleared `stale188=momo_pg_daily`; 2026-06-05 cleared recurring `stale110=awoooi_db`; 2026-06-06 confirms no stale jobs after the next aggregate window.	Keep normal cron cadence; only 120-driven Configs remains red.	`stale110=none`, `stale188=none`, 110 `13/13 fresh`, 188 `2/2 fresh`.
P1-008	DONE	100	Align 188 momo backup cron/exporter contract	188 backup exporter expected `/home/ollama/bin/momo-pg-backup.sh`; crontab still pointed to the old app-side script. Crontab was backed up and updated to the host-owned controller script.	Keep backup controller path in future deploy docs.	`configured_missing_188=0`, `awoooi_backup_job_configured{host="188",job="momo_pg_daily"} 1`.
P1-009	DONE	100	Repair 2026-06-05 non-120 backup failures	02:00 aggregate failed Gitea, AWOOOI DB, Open-WebUI, ClawBot, AI Artifacts, and Configs. The next aggregate window held the five non-120 fixes; Configs remains 120-blocked.	Leave aggregate red until 120 returns and Configs can rerun cleanly.	Fresh single-repo evidence exists for all non-120 failures and the next aggregate run only failed Configs.
P1-010	DONE	100	Offsite sync manual backup repairs	2026-06-12 17:37 full offsite sync completed `13/13` after controlled P0 runway override to 240m; 18:55 verifier confirmed 13 remote repos each have one snapshot.	Allow normal 03:00 full sync cadence unless another manual backup creates new snapshots.	`REMOTE_LATEST_ONLY_OK=1`, `VERIFY_OK=1`, full sync `13/13`.
P1-011	DONE	100	Confirm 2026-06-12 backup convergence	18:55 live check confirms the post-120 aggregate held: no stale jobs, no configured/missing script jobs, no failed components, offsite fresh, and only credential escrow remains as DR warning.	Keep escrow as explicit red gate.	`stale110=none`, `stale188=none`, `failed=0`, `config_failed=0`, `core_blockers=0`.
P1-012	DONE	100	Audit credential escrow marker write safety	2026-06-12 15:02 `mark-credential-escrow-verified.sh --status` reports all five allowed items missing; `offsite-escrow-evidence-report.sh --no-color` reports rclone/offsite configured and `ESCROW_MISSING_COUNT=5`; repo search found only runbooks/placeholders/rules, not real evidence IDs.	Write markers only after a real non-secret evidence ID exists for each item; never write placeholder or secret.	The marker blocker is narrowed to missing external evidence IDs, not missing script/config/offsite readiness.
P1-014	DONE	100	Publish credential escrow owner request package	2026-06-13 13:10 live report confirms `SCRIPT_MISSING_COUNT=0`, `OFFSITE_CONFIGURED=1`, `RCLONE_CONFIGURED=1`, `ESCROW_MISSING_COUNT=5`, `PASS=8 WARN=5 BLOCKED=0`. New owner request package defines allowed evidence-id types, forbidden secret values, safe dry-run flow, write flow, and closeout gates.	Dispatch to the credential owners without collecting secret values; keep marker write gated until owner gives real non-secret evidence IDs.	`docs/security/CREDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md` and snapshot exist and validate.
P1-013	DONE_FOR_SERVICE_READINESS	100	Remediate `km-vectorize` CronJob health debt	The retained `km-vectorize-29689620` failed Job is now classified as stale evidence, not an active blocker, because later official `km-vectorize` Jobs completed successfully. 2026-06-18 13:43 cold-start reads `FAILED_JOBS=1`, `STALE_FAILED_JOBS=1`, `ACTIVE_FAILED_JOBS=0`, `BAD_PODS=0`, and returns `PASS=84 WARN=0 BLOCKED=0`.	Keep retained failed Job as evidence unless an explicit maintenance window authorizes cleanup. Reassert ArgoCD app health only with a fresh ArgoCD app readback, not from the cold-start scorecard alone.	Service readiness no longer warns on stale failed Job evidence; active failed Job detection remains guarded.
P1-015	DONE	100	Restore 188 MinIO / Velero backup freshness and DB exporters	2026-06-24 06:35 resolved real backup / exporter red lights: 188 PostgreSQL exporter and Redis exporter now expose `pg_up=1` / `redis_up=1`; 188 MinIO health is live; 120 Velero BSL is `Available`; one-off backup `reboot-recovery-202606240456` completed; 110 backup-health textfile reports latest Velero backup fresh. 110 disk pressure was reduced from 92% to 73% by Docker image/build-cache cleanup only.	Reconcile MinIO `userns_mode: host` override into formal source-of-truth or data ownership fix; keep Docker volume prune forbidden without explicit owner approval.	`VeleroBackupNotRun`、`PostgreSQLDown`、`RedisDown`、110 disk-pressure alerts are resolved, and SOP includes restore helpers.
P1-016	DONE	100	Control repeated Telegram notification noise without hiding real alerts	2026-06-24 confirmed MOMO Pro 5-minute spam came from a legacy 110 script checking `http://192.168.0.188/health`; live script now uses `https://mo.wooo.work/health` as primary truth. Heartbeat warning dedupe now hashes stable actionable fingerprints so HTTP status / timeout / latency drift does not create a new Telegram event every 30 minutes. `MoWoooWorkDown` now labels `component=momo-pro-system`, disables blind auto-repair, and requires public/local/container/data-freshness triage. Generic docker-health monitor keeps 5-minute repair checks but adds a separate 30-minute direct Telegram fallback cooldown. Bitan public-content cleanliness keeps failure notification with same-fingerprint cooldown and one recovery notice.	Fold remaining cross-product direct Telegram egress into the unified notification gateway over time; do not disable real warning/failure/recovery signals. Production deployment/readback must confirm the code and Prometheus rules are live before declaring runtime closure.	Healthy heartbeat is quiet, same actionable heartbeat warning is deduped, MOMO public health success produces no alert, repeated same-failure direct fallback paths are cooled, and real failure/recovery/new-warning notifications remain enabled.
P1-017	DONE	100	Restore 188 nginx-exporter and post-CD monitoring coverage	CD `#3294` deployed marker `622bc372` but failed post-deploy checks because `scripts/generate_monitoring.py --check` saw Prometheus job `nginx-exporter` down at `192.168.0.188:9113`. 188 `stub_status` and compose config were healthy, so the correct fix was restoring the stateless exporter from `/home/ollama/nginx-exporter.yml`, not reloading Nginx or restarting products. New helper `scripts/ops/188-nginx-exporter-restore.sh` defaults to read-only `--check` and exposes explicit `--apply` for maintenance-window restore. `high-value-config-change-gate.py` now classifies `scripts/ops/*/exporter*` as `monitoring_alerting_observability` P1 / C1.	Keep this check in post-reboot and post-CD recovery. Do not mark historical CD `#3294` as success; use the next CD run plus monitoring coverage as future proof.	`bash scripts/ops/188-nginx-exporter-restore.sh --check` reports `nginx_up 1`; `python3 scripts/generate_monitoring.py --check --stabilization-sleep-seconds 0` reports `Jobs=14`, `全部 UP=14`, `真實問題=0`, coverage `100.0%`; high-value gate matches the helper as P1 / C1, not unmanaged.

6. P2 Service And Data Gates

ID	Status	%	Work item	Fine analysis	Next action	Done criteria
P2-001	VERIFIED	100	Public route smoke	2026-06-12 18:57 cold-start confirms all listed domains returned expected 2xx/3xx over HTTPS; registry root route returned 200 in the scorecard and `/v2/` remains the normal unauthenticated 401 pattern from earlier checks. This proves ingress/TLS plus current route availability.	Keep as one row in scorecard.	Public route table updated after each reboot.
P2-002	GREEN	100	momo latest/current-month parity and freshness	Latest current-month parity is good: `15383	15383	2026-06-01
P2-008	DONE_SUPERSEDED_BY_JOB_57_RECOVERY	100	Separate MOMO service recovery from upstream source absence	2026-06-24 11:35 readback proved MOMO service was healthy and source-file absence was the blocker. 2026-06-25 10:35 superseded that with a stricter split: service healthy, DB parity good, but token / Drive auth evidence not sufficient and scheduler fail-closed behavior required. 2026-06-25 14:16 supersedes the blocker with job `57` clean import, `V10.674`, token metadata aligned to scheduler UID, current-month parity through `2026-06-24`, and `DB_DAILY_FRESHNESS 1	2026-06-24`. SOP v1.51 preserves the GO/NO-GO rules forbidding old archive re-import, product-export import, truncate, whole-DB restore, fake freshness, or token secret exposure.	Keep running the dedicated preflight after each reboot/import window; if Drive/API auth fails again, it must fail closed and alert rather than becoming an empty-folder success.
P2-003	DONE_PRODUCTION_DEPLOYED_WAITING_NEXT_REAL_IMPORT	99	Fix momo job semantics	Gitea-first repair is in `/Users/ogt/codex-workspaces/momo-pro-dev` commit `84035906aba0e5e190d031a13cfd9b47a8cd1f73` on branch `codex/momo-current-main-dev-base-20260624`, also fast-forwarded to MacBook Pro and fast-forwarded to MOMO `main`. Gitea Actions `cd.yaml #904` succeeded, and 188 live source contains `_table_columns`, `業績分析儀表板同步失敗`, and `保留來源檔案等待重試，不移動 Google Drive 檔案`. `process_daily_sales_import()` marks monthly sync failure as `failed`, records the sync error in summary, returns `False`, and leaves `auto_import_from_drive()` outside the Drive archive/move path. Regression tests cover both job failure and no-move behavior.	Watch the next real Google Drive import and confirm no file moves unless both tables sync; if a real monthly sync failure happens, verify import job status is `failed` and source file remains pending.	`pytest tests/test_import_service_sql_params.py tests/test_auto_import_data_sync.py tests/test_auto_import_failure_boundaries.py -q` returns `10 passed`; production deployment/readback is complete; final behavioral closeout requires next real import evidence.
P2-004	DONE	100	PostgreSQL index corruption runbook path	SOP v1.2 now states `posting list tuple ... cannot be split` is an index repair incident.	Use only concurrent reindex if the error returns.	No truncate, no whole DB restore; `REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly;` and idempotent resync evidence recorded.
P2-005	VERIFIED	100	Do not rely on route 200 only	2026-06-12 closeout has route + DB + backup + offsite + schedule + alert + K3s + cold-start scorecard evidence. The only remaining blocker is DR credential escrow, outside service availability.	Keep this cross-surface checklist mandatory after every reboot.	Each reboot record has route, DB, backup, schedules, alert, scorecard rows.
P2-006	DONE	100	Validate momo scheduler WARN	2026-06-12 post-reboot regression showed the old detector was too narrow for Chinese batch and `[Feeder]` logs. The detector was widened and deployed to 110; 14:47 scorecard reads `SCHEDULER_RECENT_ACTIVITY 1070` and marks scheduler healthy.	Keep normal monitoring; treat future recurrence as detector tuning only if direct logs remain active.	Container healthy, direct log activity exists, and latest scorecard removed this WARN.
P2-007	DONE	100	Balance K3s AWOOI workload across 120 / 121	Gitea main `acaae999` adds topology spread for API/Web/Worker. ArgoCD later synced deploy marker `e4a349bc`; live deployments still have split placement after a normal CD rollout: API pods on `mon1` / `mon`, Web pods on `mon` / `mon1`, Worker single replica on `mon`; 01:26 final cold-start is `PASS=83 WARN=0 BLOCKED=0`.	Keep watching future deploys; do not manually delete pods unless placement drift becomes a real service or HA gate.	Live deployment has non-empty topology spread, API/Web placement max skew <= 1 after normal CD, public routes green, cold-start `WARN=0 BLOCKED=0`.

7. P3 Documentation And Automation

ID	Status	%	Work item	Fine analysis	Next action	Done criteria
P3-001	VERIFIED	100	Confirm hardening commit	Gitea `main` currently points to `0260ec89...`; `git merge-base --is-ancestor ae7b39d9 0260ec89...` returned true.	Keep evidence in LOGBOOK.	Gitea main contains `ae7b39d9 fix(ops): harden reboot recovery and backup alerts`.
P3-002	VERIFIED_WITH_V142_SYNC_BLOCKED	100	Confirm live 110 scripts	All required recovery/check scripts exist under `/home/wooo/scripts/`; cold-start script hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8` is live on 110. Repo-side v1.42 authoritative script hash is `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05`, and `verify-cold-start-monitor-deploy.sh` correctly blocks on the mismatch.	Do not run `install-cold-start-monitor-110.sh` during read-only triage. After explicit maintenance-window / owner approval, run the installer, rerun deploy parity, then rerun the live 110 cold-start monitor and record the new hash.	Script paths and current mismatch are recorded; v1.42 live-sync done criteria remains hash parity plus live scorecard fields.
P3-003	DONE	100	Reconcile 188 nginx Ansible baseline	Live 188 already routes `aiops.wooo.work` through VIP; the Ansible template matches that route and has no 120 upstream for aiops. `nginx-sync.yml` now also carries the `188-internal-tools-https.conf.j2` source-of-truth path, and `ansible-validate.sh` syntax-check passes with repo-local roles path.	Run only approved dry-run/apply from the normal Ansible environment before changing live nginx.	Template and live config agree; no 120 upstream for aiops; repo-side syntax and readiness contract pass.
P3-004	DONE	100	Update `docs/LOGBOOK.md`	Live blocker and new docs are recorded.	Keep this entry updated after each recovery phase.	LOGBOOK has current recovery status and next actions.
P3-005	DONE	100	Update cold-start SOP	SOP now includes start, shutdown, reboot, record, comparison, and 120 blocker handling.	Increment SOP version after each process change.	SOP has controlled power-operation sections and ledger template.
P3-006	DONE	100	Update backup status	Backup status now reflects current cron, rclone latest-only, failure-only alert posture, and escrow blocker.	Refresh after 120 backup rerun.	Backup status no longer claims noisy success Telegram notifications.
P3-007	DONE	100	Harden Gitea backup stale dump handling	2026-06-05 manual Gitea backup failed because the container retained `/tmp/gitea-dump.zip` from the 02:00 failure. `scripts/backup/backup-gitea.sh` now renames stale container dump files to timestamped evidence before running a new dump, and the live 110 script is updated.	Watch the next 02:00 Gitea backup.	`bash -n` passes locally and on 110; manual Gitea backup completed after stale evidence rename.
P3-008	DONE	100	Continuously optimize host reboot SOP	SOP v1.52 adds one-page post-start quick check wrapper, fallback runbook, startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, 188 DB/Redis exporter restore, 188 MinIO/Velero restore, 188 nginx-exporter restore, 110 Docker disk cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, post-reboot notification noise gates, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side scorecard source-absence classifier, 110 live-sync parity gate, CD monitoring coverage target-down classification, MOMO dedicated token/source preflight, MOMO V10.674 / StartedAt / lifecycle / job 57 / freshness 1 recovery readback, and 2026-06-25 110 CPU orphan Chrome vs active CI 分流 evidence.	Use `scripts/reboot-recovery/post-start-quick-check.sh --no-color` for T+10 post-reboot triage, then use `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` as manual fallback and SOP v1.52 for exceptions, Plan B, blocker-specific recovery, and historical comparison. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process / notification-noise / MOMO preflight / monitoring coverage checks. If using the live 110 script, record its hash and do not assume repo-side v1.42 behavior until synced under approval and deploy parity passes.	SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; quick check wrapper has one command order and LOGBOOK summary; latest MOMO dedicated preflight returns `PASS=19 WARN=2 BLOCKED=0`; 110 CPU evidence records old orphan Chrome groups removed by approved SIGTERM while active CI load remains observation-only; repeated healthy/same-failure notification noise is controlled without hiding real alerts, and monitoring coverage target-down is routed through exporter restore before any product restart.
P3-009	DONE	100	Assess 120/121 AA/AS role and host load balancing	2026-06-12 15:19 live check confirms 120 and 121 are both `Ready control-plane`, `k3s active`, `k3s-agent inactive`, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers.	After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services.	`docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly `0%`.
P3-010	DONE	100	Update workload balancing docs with 2026-06-13 live truth	Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD `km-vectorize` degraded, Gitea main `acaae999`, ArgoCD sync, and final pod placement evidence.	Keep updating this file after the next reboot or deploy.	Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt.
P3-011	DONE	100	Record `km-vectorize` remediation status	LOGBOOK, this workplan, and SOP now state the schedule/label fix, ArgoCD sync evidence, the invalid manual Job boundary, and the 90% waiting-for-next-schedule gate.	After next 03:00 run, update this row and the top verdict with `lastSuccessfulTime` / ArgoCD health evidence.	No document claims ArgoCD green before official CronJob success evidence exists.
P3-012	DONE	100	Prevent CD from clobbering cold-start SSH trust	Source fix `80e6ec1a` changes Gitea CD workflows to use deploy-specific `deploy_known_hosts` and `UserKnownHostsFile`; post-deploy marker `e4a349bc` proves global `/home/wooo/.ssh/known_hosts` retained 120 / 188 entries. SOP v1.8 records this as a release guardrail.	Keep the guardrail in future workflow reviews; any `> ~/.ssh/known_hosts` in deploy code is a release blocker.	CD success plus post-CD `known_hosts` readback and strict SSH checks to 120 / 188 remain green.

8. Required 120 Recovery Sequence

Do this only after physical/VM console access confirms 120 is powered on, attached to the LAN, and either booted or repairable.

# 0. Console-side checks first; do not do these through an online mounted root filesystem.
#    - power / VM state
#    - NIC connected to the 192.168.0.x LAN
#    - boot screen / initramfs / rescue state
#    - if root FS repair is required: fsck -f /dev/mapper/ubuntu--vg-ubuntu--lv from console/rescue only

# 1. After SSH returns, run read-only 120 maintenance readiness
bash scripts/reboot-recovery/120-fsck-maintenance-checklist.sh --no-color

# 2. After 120 is reachable and stable, on 110
/backup/scripts/backup-configs.sh
/backup/scripts/backup-all.sh
/backup/scripts/sync-offsite-backups.sh --mode sync
/backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color

# 3. Final cold-start scorecard
/home/wooo/scripts/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1

Do not run truncate, whole DB restore, force-push, DROP, or online root filesystem fsck as part of this flow.

9. Progress Updates

2026-06-18 14:20 Asia/Taipei
Phase: P3 AI Ops runaway process automation
Before: 110 CPU 滿載只能靠人工 `ps/top` 判斷；泛用 `HostHighCpuLoad` 無法分辨跨專案 orphan Chrome smoke 與合法 Gitea Actions CI load。
After: 新增 read-only `host-runaway-process-exporter.py`、gated `host-runaway-process-remediation.py`、Prometheus `host_runaway_process_alerts`、Ansible textfile exporter source-of-truth、SOP v1.26 與 `HOST-RUNAWAY-PROCESS-AIOPS-PLAYBOOK.md`。Exporter 暴露 orphan browser、active CI、load/core、swap ratio 與 `remediation_authorized=0`；修復器預設 dry-run，`SIGTERM` 必須帶 owner approval、maintenance window、evidence ref。

2026-06-18 14:31 Asia/Taipei
Phase: P3 AI Ops runaway process live observability
Before: Repo-side exporter / alert / PlayBook 已完成，但 110 Prometheus 尚未讀到 `awoooi_host_runaway_process_*` live metrics。
After: 110 已安裝 read-only exporter/helper 與 cron，立即刷新 textfile，Prometheus 第二次 scrape 讀到 `monitor_up=1`、orphan browser group count `0`、active CI containers `2`、load5/core 約 `0.79-0.81`、swap ratio 約 `1.0`、`remediation_authorized=0`；`HostRunawayProcessMonitorMissing` 與 `HostOrphanBrowserSmokeHighCpu` 未 firing。
Evidence: `/home/wooo/node_exporter_textfiles/host_runaway_process.prom`、Prometheus query `awoooi_host_runaway_process_monitor_up{host="110"}`、`ALERTS{alertname="HostRunawayProcessMonitorMissing",host="110",alertstate="firing"}`。
Blocked: No for live observability; yes for runtime remediation by design until owner approval / maintenance window / evidence ref / dry-run / post-check exist.
Next: Keep cron scrape under normal monitoring; if orphan count becomes >0, create AI triage packet and remediation dry-run before any gated `SIGTERM`.
Completion: monitoring / alert / PlayBook / KM contract 100%; runtime auto-remediation remains gated at 0 until a real owner-approved apply is executed.

2026-06-18 14:38 Asia/Taipei
Phase: P3 AI Ops alert-to-event packet
Before: 泛用 CPU raw dump 可被轉成 AI automation card，但 `HostOrphanBrowserSmokeHighCpu` / `HostCiRunnerLoadSaturation` alert text 尚未有專屬 lane。
After: Telegram 最後出口可將 `HostOrphanBrowserSmokeHighCpu` 轉成 `orphan_browser_smoke_runaway_process`，將 `HostCiRunnerLoadSaturation` 轉成 `ci_runner_load_saturation`；兩者都保留 `runtime_write_gate=0`，並要求 dry-run / owner / maintenance / evidence / KM / PlayBook / Verifier。
Evidence: `apps/api/src/services/telegram_gateway.py`、`apps/api/tests/test_telegram_message_templates.py`，精準 pytest `59 passed`。
Blocked: No for alert-to-event packet; yes for Telegram live send / runtime remediation by design.
Next: 等 code-review / CD 後做 production readback；若未來 alert 實際 firing，確認 Telegram card 與 AwoooP Run truth-chain 都能呈現同一 lane。

2026-06-18 14:51 Asia/Taipei
Phase: P3 AI Ops alert-to-event packet production readback
Before: `HostOrphanBrowserSmokeHighCpu` / `HostCiRunnerLoadSaturation` 已有 source + test，但尚未完成正式站部署與 runtime revision 讀回。
After: `f358a0f6` 已由 Gitea CD `#3150` 部署，deploy marker `2d278568`。ArgoCD `awoooi-prod` 為 `Synced / Healthy`，API / Web / Worker image 均為 `f358a0f6c3e614e407dedb6eee89bf10b2bc8173`；production health 與 IwoooS / Governance / AwoooP tenants routes 皆 `200` 且敏感字串抽樣命中 `0`。
Evidence: Gitea CD `#3150` tests `3221 passed, 23 skipped`、B5 integration `5 passed`、post-deploy alert-chain smoke `9/9`、monitoring coverage `14/14` jobs up；Prometheus 仍讀到 110 `monitor_up=1`、orphan browser group count `0`、CI active containers `2`、`remediation_authorized=0`，missing / orphan alerts 未 firing。
Blocked: No for production alert-to-event packet deployment; yes for runtime remediation by design.
Next: Future firing alert must produce AI triage packet and dry-run evidence first. `HostCiRunnerLoadSaturation` remains capacity / runner scheduling triage, not process kill. Runtime remediation remains `0` until owner approval, maintenance window, evidence ref, gated SIGTERM, post-check, and KM / PlayBook / Verifier writeback exist.
Completion: host runaway monitoring / alert / PlayBook / Telegram event packet / production deploy readback 100%; runtime auto-remediation remains safely gated at 0.

2026-06-18 15:08 Asia/Taipei
Phase: P3-009 Host runaway AIOps loop product readback
Before: Monitoring, alert rules, event packet routing, live scrape, and production deploy readback were complete, but governance UI still lacked a single product-visible loop state for monitor -> alert -> event packet -> PlayBook -> KM / Verifier -> gated remediation.
After: Added `host_runaway_aiops_loop_readiness_v1` committed snapshot, schema, strict API loader, endpoint `/api/v1/agents/agent-host-runaway-aiops-loop-readiness`, regression tests, API client type, and governance automation-inventory card. The card shows 6 loop stages, 2 alert lanes, 5 asset writeback contracts, host 110 live readback, deploy marker 2d278568, orphan groups 0, and runtime writes 0.
Evidence: `apps/api/tests/test_host_runaway_aiops_loop_readiness.py` + API test `9 passed`; web typecheck passed using a temporary existing node_modules symlink that was removed before commit; snapshot/schema/messages JSON parse and py_compile passed.
Blocked: No for product readback; yes for runtime remediation by design.
Next: If a real or fixture alert fires, verify Telegram card, AwoooP Work Item, KM / PlayBook / Verifier fields agree before considering any owner-approved non-production gated SIGTERM drill.
Completion: host runaway AIOps product-visible loop readback 100%; runtime auto-remediation remains safely gated at 0.

2026-06-18 16:08 Asia/Taipei
Phase: P3-009 Host runaway AIOps loop production verification
Before: P3-009 source, API, UI and tests were pushed, but production still needed deploy marker, API readback, desktop/mobile browser smoke, and CD runner lock recovery evidence.
After: Final deploy marker `42c08ece chore(cd): deploy 27143fb [skip ci]` is live after CD runner lock fixes `fc6c01ee` / `84ca8423` / `27143fb0`; `cd.yaml #3177` and `code-review.yaml #3178` are successful. Production endpoint `/api/v1/agents/agent-host-runaway-aiops-loop-readiness` returns `schema_version=host_runaway_aiops_loop_readiness_v1`, `current_task_id=P3-009`, `next_task_id=P3-010`, completion `100`, loop stages `6`, alert lanes `2`, writeback contracts `5`, host `110`, orphan browser groups `0`, active CI containers `2`, and every runtime/write/remediation counter `0`.
Evidence: API health `healthy / prod / mock_mode=false`; desktop `1440x1100` and mobile `390x844` governance smoke with deploy marker `42c08ece` have required text missing `0`, console/page errors `0`, horizontal overflow `false`, overflowing elements `0`; screenshots are `/tmp/awoooi-host-runaway-aiops-desktop-1440x1100-42c08ece.png` and `/tmp/awoooi-host-runaway-aiops-mobile-390x844-42c08ece.png`.
Blocked: No for production product readback. Yes for runtime remediation by design: process termination, Docker/systemd restart, Nginx reload, firewall/K8s action, Telegram live send, Gateway queue write, Bot API call, production write, and secret read remain `0 / false`.
Next: Treat the next real or fixture `HostOrphanBrowserSmokeHighCpu` as the acceptance drill for end-to-end Telegram card / AwoooP work item / KM / PlayBook / Verifier field agreement. Any actual SIGTERM remains owner-approved, maintenance-windowed, dry-run-first, and post-check-gated.
Completion: host runaway AIOps product-visible loop readback and production verification 100%; runtime auto-remediation remains safely gated at 0.

2026-06-18 13:43 Asia/Taipei
Phase: P1/P2/P3 live readback
Before: live cold-start was `PASS=83 WARN=1 BLOCKED=0`, result `DEGRADED`, because retained stale `km-vectorize-29689620` failed Job evidence was still counted as a service warning.
After: live cold-start is `PASS=84 WARN=0 BLOCKED=0`, result `GREEN`; P2 service readiness is now `100%`; overall recovery readiness is `99% SERVICE_GREEN_DR_ESCROW_BLOCKED`.
Evidence: `full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1`; K8s schedule counters `FAILED_JOBS=1`, `STALE_FAILED_JOBS=1`, `ACTIVE_FAILED_JOBS=0`, `BAD_PODS=0`; repo-side readiness audit `PASS=187 WARN=1 BLOCKED=0`; escrow readback `ESCROW_MISSING_COUNT=5`.
Blocked: no for full-stack service readiness. Yes for DR complete, because five credential escrow evidence markers still need real non-secret owner evidence IDs.
Next: use SOP v1.25 for the next reboot; record failed/stale/active Job counters separately; close B5 only after real credential escrow marker evidence exists.

2026-06-18 12:17 Asia/Taipei
Phase: P0/P2/P3 live readback
Before: repo-side readiness was complete, but live gate had not been rerun after the same-day push.
After: live cold-start is `PASS=83 WARN=1 BLOCKED=0`, result `DEGRADED`; final rollout readback shows API `2/2`, Web `2/2`, Worker `1/1`, Canary `1/1`, and API health `200 healthy`.
Evidence: `full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1`; read-only K8s deployment/job snapshot from 120; public API health readback.
Blocked: no hard blocker. One warning remains: stale retained Job `km-vectorize-29689620` from 2026-06-14 03:00; later official km-vectorize Jobs are Complete. DR complete still blocked by real credential escrow evidence markers.
Next: before any actual reboot, rerun the same live preflight and classify as `B3_SERVICE_AVAILABLE_DEGRADED` if only stale evidence remains, or `B4_FULL_STACK_GREEN` only when `WARN=0 BLOCKED=0`.

2026-06-18 12:06 Asia/Taipei
Phase: P3
Before: repo-side readiness audit PASS=147 WARN=2 BLOCKED=37 before blocker batch; after Plan B-only guard it still had pre-existing blockers.
After: repo-side readiness audit PASS=185 WARN=1 BLOCKED=0, result READY WITH WARNINGS.
Evidence: full-stack-cold-start-check.sh now emits NODE_FS_ERROR_EVENTS and blocks K3s release on node filesystem evidence; backup-awoooi.sh no longer runs direct service-level rclone sync; 110-devops.yml manages cold-start monitor, runner guardrails, textfile exporters, backup scripts, daily backup heartbeat, offsite evidence report and offsite full-sync verifier; 188-ai-web.yml uses host-owned /home/ollama/bin/momo-pg-backup.sh and no longer contains the old app-directory backup cron path; nginx-sync.yml includes 188-internal-tools-https.conf.j2; ansible-lint.yml now runs self-hosted validation across Ansible, ops baseline, monitoring rules, backup scripts, reboot scripts, docs and workflow changes; bootstrap-ansible-validation-env.sh selects Python 3.11/3.10 for pinned ansible-core; ansible-validate.sh passes YAML, shell, Python, doc secret, backup alert label, recovery scorecard, Ansible syntax-check and ansible-lint minimum profile.
Blocked: no for repo-side reboot readiness contracts. Yes for live reboot authorization until same-day live checks run; yes for DR complete while credential escrow evidence markers remain missing.
Next: before an actual reboot, run the same-day live preflight and then the live cold-start gate with --live or the 110 deployed monitor; do not use repo-side READY WITH WARNINGS as a substitute for host/runtime truth.

2026-06-18 11:48 Asia/Taipei
Phase: P3
Before: P3 100%
After: P3 100%
Evidence: ops/reboot-recovery/full-stack-cold-start-baseline.yml now has a machine-readable plan_b section with red lines, triggers, host paths, B0-B5 levels, T+0/T+120 timeline, and closeout states; scripts/reboot-recovery/reboot-recovery-readiness-audit.sh now checks SOP and baseline for Plan B markers. Targeted assertion returned PLAN_B_BASELINE_ASSERTIONS_OK levels=6 closeout=3 timeline_stop=T+120. Full readiness audit confirms all new Plan B checks pass, but overall audit remains NOT READY because of pre-existing Ansible / workflow / backup-contract blockers unrelated to this Plan B addition.
Blocked: no for Plan B mechanism. Yes for overall reboot automation readiness audit until the existing non-Plan-B BLOCKED rows are resolved.
Next: continue closing pre-existing readiness-audit blockers by priority, without changing runtime or pretending the overall audit is green.

2026-06-18 11:41 Asia/Taipei
Phase: P3
Before: P3 100%
After: P3 100%
Evidence: docs/runbooks/FULL-STACK-COLD-START-SOP.md updated to v1.22 with explicit Plan B degraded-operation path, B0-B5 service levels, Plan B trigger table, host-specific fallback routes for 110/120/121/188/K3s/public gateway, T+0/T+120 fallback timeline, and Plan B closeout states. This workplan now requires every future reboot record to compare actual timing and blockers against SOP §1.4, not only the Plan A cold-start chain.
Blocked: no for documentation. Live reboot authorization still requires fresh same-day preflight before any maintenance window; DR complete remains blocked while credential escrow missing count is 5.
Next: before the next host reboot, rerun live preflight, choose Plan A or Plan B entry criteria, then record final level as B0/B1/B2/B3/B4/B5 with the exact blocker.

2026-06-14 18:15 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: 最新 repo 文件基準 0a4766dd；runtime deploy marker 16c6b983 已將 image e999c16b3435f197b78fe2adfeec1c4faa6c4675 帶到 API/Web/Worker/CronJob live；ArgoCD source revision 0a4766ddc94b0690824ce3deba5c6b9a69764f94 維持 Synced/Degraded，原因仍只剩 km-vectorize；API/Web 分散在 mon/mon1；Worker 在 mon；IwoooS route /zh-TW/iwooos returned 200；AwoooP route /zh-TW/awooop returned 200；110 systemctl --failed returned 0 loaded units listed；backup-status core_blockers=0 and escrow_missing=5；final cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green，因為 km-vectorize-29689620 仍 failed，必須等待下一次官方 03:00 success 或 retained failed Pod/log evidence；yes for DR complete，因為 credential escrow evidence markers 仍缺 5 個。
Next: 維持 03:00 官方排程 gate；下一次官方 km-vectorize run 後，只讀驗證 lastSuccessfulTime、latest Job/Pod/log 與 ArgoCD health。

2026-06-14 17:04 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: 最新文件基準 af62ec1f；runtime deploy marker ed651a98 已將 image e992af89955f8aae40a383b2f2e2f645445a690d 帶到 API/Web/Worker/CronJob live；ArgoCD source revision af62ec1fe72b3e84e179d80e788e5a5902bdaf27 維持 Synced/Degraded，原因仍只剩 km-vectorize；API/Web 分散在 mon/mon1；Worker 在 mon1；IwoooS route /zh-TW/iwooos returned 200；110 systemctl --failed returned 0 loaded units listed；backup-status core_blockers=0 and escrow_missing=5；final cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green，因為 km-vectorize-29689620 仍 failed，必須等待下一次官方 03:00 success 或 retained failed Pod/log evidence；yes for DR complete，因為 credential escrow evidence markers 仍缺 5 個。
Next: 維持 03:00 官方排程 gate；下一次官方 km-vectorize run 後，只讀驗證 lastSuccessfulTime、latest Job/Pod/log 與 ArgoCD health。

2026-06-14 16:29 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: 最新文件基準 06fe0a8f；runtime deploy marker 36fbfc6b 已將 image 386dbd078ef63401d9736048463f4ef5326442d9 帶到 API/Web/Worker/CronJob live；ArgoCD source revision 06fe0a8f14167824fea512f942d2569431bbcbc8 維持 Synced/Degraded，原因仍只剩 km-vectorize；API/Web 分散在 mon/mon1；Worker 在 mon；110 systemctl --failed returned 0 loaded units listed；backup-status core_blockers=0 and escrow_missing=5；P2-145 endpoint current=P2-145 completion=100，owner response received/accepted/rejected、reviewer/Gateway/Telegram/Bot API/result capture/learning/PlayBook trust/production write/secret/destructive 全部 0/false；final cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green，因為 km-vectorize-29689620 仍 failed，必須等待下一次官方 03:00 success 或 retained failed Pod/log evidence；yes for DR complete，因為 credential escrow evidence markers 仍缺 5 個。
Next: 維持 03:00 官方排程 gate；下一次官方 km-vectorize run 後，只讀驗證 lastSuccessfulTime、latest Job/Pod/log 與 ArgoCD health。

2026-06-14 15:58 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: gitea/main 已前進至 deploy marker 180a6543；image fef94df877c5438f9f34ddbcace8ad8112a141ef 已帶到 API/Web/Worker live；ArgoCD source revision 180a6543eaf26dd6b345d45114316926056a965a 維持 Synced/Degraded，原因仍只剩 km-vectorize；API/Web 分散在 mon/mon1；Worker 在 mon1；110 systemctl --failed returned 0 loaded units listed；backup-status core_blockers=0 and escrow_missing=5；P2-144 endpoint current=P2-144 completion=100，owner response received/accepted/rejected、reviewer/Gateway/Telegram/Bot API/result capture/learning/PlayBook trust/production write/secret/destructive 全部 0/false；final cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green，因為 km-vectorize-29689620 仍 failed，必須等待下一次官方 03:00 success 或 retained failed Pod/log evidence；yes for DR complete，因為 credential escrow evidence markers 仍缺 5 個。
Next: 維持 03:00 官方排程 gate；下一次官方 km-vectorize run 後，只讀驗證 lastSuccessfulTime、latest Job/Pod/log 與 ArgoCD health。

2026-06-14 15:00 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: 最新文件基準 b09eb1c6；runtime deploy marker 667d6329 已將 image 755b0a8d3038df2c52dee280067863d92db1eda5 帶到 API/Web/Worker/CronJob live；ArgoCD revision 4abf0c0f750254d3c7137eae049abdfd99630f5f 維持 Synced/Degraded，原因仍只剩 km-vectorize；API/Web 分散在 mon/mon1；Worker 在 mon；110 systemctl --failed returned 0 loaded units listed；backup-status core_blockers=0 and escrow_missing=5；P2-143 endpoint current=P2-143 completion=100，writer/Gateway/Telegram/Bot API/production write/secret/destructive 全部 0/false；final cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green，因為 km-vectorize-29689620 仍 failed，必須等待下一次官方 03:00 success 或 retained failed Pod/log evidence；yes for DR complete，因為 credential escrow evidence markers 仍缺 5 個。
Next: 維持 03:00 官方排程 gate；下一次官方 km-vectorize run 後，只讀驗證 lastSuccessfulTime、latest Job/Pod/log 與 ArgoCD health。

2026-06-14 10:40 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: latest docs head observed before this recovery commit 50d4f2ba; runtime deploy marker d023f5d7 put image f737f278 live for API/Web/Worker/CronJob; ArgoCD revision 50d4f2ba; API/Web split across mon/mon1; Worker on mon; 110 systemctl --failed returned 0 loaded units listed and fwupd-refresh.timer remained disabled/inactive; backup-status core_blockers=0 and escrow_missing=5; final cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green, because km-vectorize-29689620 remains failed until the next official 03:00 success or retained failed Pod/log evidence; yes for DR complete, because credential escrow evidence markers still missing 5.
Next: keep the 03:00 official schedule gate; after the next official km-vectorize run, read-only verify lastSuccessfulTime, latest Job/Pod/log, and ArgoCD health.

2026-06-14 09:56 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: 本 recovery commit 前最新文件 head a0fe7741；runtime deploy marker 與 ArgoCD revision 60a0415c put image a3de0ffb live for API/Web/Worker/CronJob; API/Web split across mon/mon1; Worker on mon1; 110 systemctl --failed returned 0 loaded units listed and fwupd-refresh.timer remained disabled/inactive; backup-status core_blockers=0 and escrow_missing=5; final cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green, because km-vectorize-29689620 remains failed until the next official 03:00 success or retained failed Pod/log evidence; yes for DR complete, because credential escrow evidence markers still missing 5.
Next: keep the 03:00 official schedule gate; after the next official km-vectorize run, read-only verify lastSuccessfulTime, latest Job/Pod/log, and ArgoCD health.

2026-06-14 09:27 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: gitea/main 5bad267e and ArgoCD revision 5bad267e; deploy marker 8d575c1a put image 280e0fbe live for API/Web/Worker/CronJob; API/Web split across mon/mon1; 110 systemctl --failed returned 0 loaded units listed and fwupd-refresh.timer remained disabled/inactive; backup-status core_blockers=0 and escrow_missing=5; first cold-start had transient stock 502 during stockplatform-v2 warmup, direct route/TLS recheck returned 200, final cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green, because km-vectorize-29689620 remains failed until the next official 03:00 success or retained failed Pod/log evidence; yes for DR complete, because credential escrow evidence markers still missing 5.
Next: keep the 03:00 official schedule gate; after the next official km-vectorize run, read-only verify lastSuccessfulTime, latest Job/Pod/log, and ArgoCD health.

2026-06-14 08:40 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: gitea/main and ArgoCD revision 18b867c3; deploy marker 18b867c3 put image e0a6d339 live for API/Web/Worker/CronJob; API/Web split across mon/mon1; 110 systemctl --failed returned 0 loaded units listed and fwupd-refresh.timer remained disabled/inactive; backup-status core_blockers=0 and escrow_missing=5; cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green, because km-vectorize-29689620 remains failed until the next official 03:00 success or retained failed Pod/log evidence; yes for DR complete, because credential escrow evidence markers still missing 5.
Next: keep the 03:00 official schedule gate; after the next official km-vectorize run, read-only verify lastSuccessfulTime, latest Job/Pod/log, and ArgoCD health.

2026-06-14 08:24 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 96%, P1 92%, P2 98%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: 110 fwupd-refresh.timer disabled/inactive with rollback command recorded; systemctl --failed returned 0 loaded units listed; backup-status 110 13/13 fresh failed=0 and 188 2/2 fresh failed=0 with core_blockers=0 and escrow_missing=5; cold-start PASS=82 WARN=1 BLOCKED=0; ArgoCD/CronJob still waiting for official km-vectorize lastSuccessfulTime after deploy marker ec03f0b7 / image 8ddb80d6.
Blocked: yes for full cold-start green, because km-vectorize-29689620 remains failed until the next official 03:00 success or retained failed Pod/log evidence; yes for DR complete, because credential escrow evidence markers still missing 5.
Next: after the next 03:00 Asia/Taipei official km-vectorize schedule, read-only verify lastSuccessfulTime, latest Job/Pod/log, and ArgoCD health; do not manual-run, delete, patch, or fake evidence.

2026-06-13 01:29 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 95%, P1 90%, P2 100%, P3 100%
After: Overall 95%, P1 90%, P2 100%, P3 100%
Evidence: Gitea main e4a349bc; ArgoCD revision e4a349bc sync=Synced health=Degraded only by km-vectorize stale success; K3s images 414413a59268eedd391648f112e228716dd05362; API/Web split across mon/mon1; /home/wooo/.ssh/known_hosts retained 120/188 after CD fix 80e6ec1a; backup-status 110 13/13 fresh failed=0 and 188 2/2 fresh failed=0; offsite textfile remote_verify_ok=1 and 13 repos snapshot_count=1; backup alert live visibility OK; all five required Prometheus alert rule names health=ok; cold-start PASS=83 WARN=0 BLOCKED=0.
Blocked: yes for DR complete only, because credential escrow evidence markers still missing 5; ArgoCD fully healthy still waits for official 03:00 km-vectorize lastSuccessfulTime.
Next: after 03:00 Asia/Taipei, verify km-vectorize official Job completion and ArgoCD health; keep escrow alerts firing until real non-secret evidence IDs are written.

2026-06-04 15:23 Asia/Taipei
Phase: P3
Before: 78%
After: 95%
Evidence: infra/ansible/roles/nginx/templates/188-all-sites.conf.j2 now contains aiops VIP upstreams 192.168.0.125:32334/32335; live smoke aiops / -> 307 and /api/v1/health -> 200; content guard passed.
Blocked: no for route baseline; ansible-playbook is unavailable on this workstation, so syntax-check remains delegated to the normal Ansible environment before next apply.
Next: run Ansible syntax/apply validation from the Ansible host before changing 188 nginx live config.

2026-06-04 15:23 Asia/Taipei
Phase: P2
Before: 52%
After: 66%
Evidence: /Users/ogt/momo-pro-system/services/import_service.py updated; /Users/ogt/momo-pro-system/tests/test_daily_sales_monthly_sync_failure.py added; targeted pytest passed with temp SQLite and real Excel input.
Blocked: yes. Live 188 uses /home/ollama/momo-pro bind-mounted code, while momo/ewoooc canonical source remains unresolved.
Next: reconcile canonical source/deploy path, apply the same monthly-sync failure contract to live, then run controlled live auto-import failure-path verification.

2026-06-04 15:34 Asia/Taipei
Phase: P2
Before: 66%
After: 86%
Evidence: live /home/ollama/momo-pro/services/import_service.py patched from backup services/import_service.py.bak.20260604-152827; live hash 3fc45671986fa4cc155119f588bc1ebefd272927730052e42e2b9eb4352b2586; container isolated temp-DB/real-Excel contract test passed; momo-scheduler and momo-pro-system restarted and healthy; mo.wooo.work /health 200; latest DB parity daily=404 and monthly=404 for 2026-06-02.
Blocked: no for momo failure contract. Overall remains blocked by 120 reachability and credential escrow.
Next: observe the next real Google Drive import and keep canonical momo/ewoooc source-control reconciliation as a separate supply-chain item.

2026-06-04 15:50 Asia/Taipei
Phase: P1
Before: 58%
After: 72%
Evidence: /backup/scripts/backup-status.sh --no-notify initially showed stale110=awoooi_db, stale188=momo_pg_daily, configured_missing_188=1; manual 188 momo PostgreSQL backup completed and kept latest-only; manual 110 backup-awoooi-frequent completed with restic snapshot 7440d75f; 188 crontab now points momo_pg_daily to /home/ollama/bin/momo-pg-backup.sh; final backup-status shows stale110=none, stale188=none, configured_missing_188=0, core_blockers=1, escrow_missing=5.
Blocked: yes. 120 config capture still keeps aggregate backup red, and five credential escrow evidence markers are still missing.
Next: after 120 returns, rerun backup-configs, backup-all, offsite sync, full offsite verify, then cold-start scorecard; separately fill escrow only with real non-secret evidence IDs.

2026-06-04 18:55 Asia/Taipei
Phase: P0/P1/P2
Before: Overall 60%, P1 72%, P2 86%
After: Overall 61%, P1 74%, P2 88%
Evidence: local ping to 192.168.0.120 still 0/3, SSH 22 timed out, ARP incomplete; 121 kubectl still shows mon NotReady,SchedulingDisabled and mon1 Ready; 110 backup-status --no-notify shows stale110=none, stale188=none, configured_missing_188=0, core_blockers=1, escrow_missing=5; cold-start scorecard now reports PASS=71 WARN=3 BLOCKED=3 and momo monthly parity 2215/2215 for 2026-06-01 through 2026-06-04.
Blocked: yes. The three hard blocks are still 120 ping, 120 SSH, and 120 K3s read-only check; escrow remains missing 5 evidence markers.
Next: wait for physical/console recovery of 120, then run the required backup-configs / backup-all / offsite sync / full verify / cold-start sequence.

2026-06-04 19:02 Asia/Taipei
Phase: P0/P3
Before: Overall 61%, P0 35%, P3 95%
After: Overall 62%, P0 36%, P3 96%
Evidence: local/110/121/188 all failed to reach 192.168.0.120; 121 returned Destination Host Unreachable; kubectl describe node mon shows LastHeartbeatTime 2026-05-22 02:44:13 +08, Ready Unknown since 2026-05-22 02:49:48 +08, and kube-node-lease renewTime 2026-05-22 02:48:36 +08; 120-fsck-maintenance-checklist.sh --no-color returned PASS=2 WARN=2 BLOCKED=3 and MAINTENANCE REQUIRED; repo search found no BMC/IPMI/WOL inventory for 120.
Blocked: yes. 120 requires physical or VM console recovery before backup-configs, backup-all, offsite sync, and full cold-start can be made green.
Next: use console to verify 120 power/NIC/boot/initramfs state, perform offline fsck only if needed, then restore SSH and run the required recovery sequence.

2026-06-05 18:40 Asia/Taipei
Phase: P0/P1/P3
Before: Overall 62%, P1 74%, P3 96%
After: Overall 64%, P1 80%, P3 97%
Evidence: 120 remains unreachable from local/110/121/188 and K3s mon remains NotReady,SchedulingDisabled; 14:00 AWOOOI high-frequency backup had failed, then 16:01 manual high-frequency backup completed snapshot b7d5ee4e; Gitea stale container dump /tmp/gitea-dump.zip was preserved as /tmp/gitea-dump.stale.20260605_161032.zip, script hardened, and manual Gitea backup completed snapshot ea641613; Open-WebUI d1147507, ClawBot 73ead3cc, AI artifacts b1161ab8 completed; partial offsite sync for five changed repos completed 5/5; verify-offsite-full-sync reports REMOTE_LATEST_ONLY_OK=1 and VERIFY_OK=1; final backup-status shows stale110=none, stale188=none, core_blockers=6, escrow_missing=5; cold-start remains PASS=71 WARN=3 BLOCKED=3.
Blocked: yes. 120 remains the P0 blocker, backup_all failed history remains red until backup-all can rerun after 120 returns, and credential escrow still lacks five non-secret evidence markers.
Next: monitor the 20:00 high-frequency backup, keep 120 console recovery as P0, then rerun backup-configs / backup-all / offsite sync / full verify / cold-start after 120 returns.

2026-06-06 14:47 Asia/Taipei
Phase: P0/P1/P2
Before: Overall 64%, P1 80%, P2 88%
After: Overall 65%, P1 84%, P2 89%
Evidence: 120 still ping failed, SSH timed out, ARP incomplete, and K3s mon remains NotReady,SchedulingDisabled; 06-06 02:00 aggregate failed only Configs (12/13 success) due the 120 config capture blocker; backup-status at 14:46 shows stale110=none, stale188=none, failed=1, core_blockers=1, escrow_missing=5; verify-offsite-full-sync shows all 13 remote repos snapshots=1, REMOTE_LATEST_ONLY_OK=1, VERIFY_OK=1; cold-start reports PASS=70 WARN=4 BLOCKED=3; momo scheduler direct log activity count over the last 15 minutes is 151 despite the scorecard WARN.
Blocked: yes. 120 remains unreachable, aggregate backup cannot be green until backup-configs and backup-all rerun after 120 returns, and credential escrow still lacks five evidence markers.
Next: keep 120 console recovery as P0, keep escrow marker collection separate from secrets, and rerun the required backup/offsite/cold-start sequence only after 120 is reachable.

2026-06-06 15:00 Asia/Taipei
Phase: P3
Before: P3 97%
After: P3 98%
Evidence: docs/runbooks/FULL-STACK-COLD-START-SOP.md updated to v1.3 with 2026-06-06 live baseline, full shutdown/startup/single-host reboot SOP, mandatory reboot ledger template, and SOP version-comparison rules.
Blocked: no for documentation. Validation gap remains because ansible-playbook is unavailable on this workstation and 120 recovery still requires console access.
Next: after the next actual reboot or 120 console recovery, append a LOGBOOK reboot record and compare it against this 2026-06-06 baseline before changing SOP version again.

2026-06-06 15:03 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 65%, P2 89%, P3 98%
After: Overall 65%, P2 90%, P3 99%
Evidence: 120 still ping/SSH failed with ARP incomplete; 121 still shows mon NotReady,SchedulingDisabled and mon1 Ready; backup-status at 15:02 shows stale110=none, stale188=none, failed=1, core_blockers=1, escrow_missing=5; offsite verifier shows 13 repos snapshots=1 with REMOTE_LATEST_ONLY_OK=1 and VERIFY_OK=1; Alertmanager has all five required backup/cold-start rules; escrow report shows scripts/config present but 5 evidence markers missing; 15:03 cold-start reports PASS=71 WARN=3 BLOCKED=3; direct 188 momo-scheduler check is healthy with recent log activity.
Blocked: yes. The three hard blocks remain 120 ping, 120 SSH, and 120 K3s read-only check; aggregate backup remains blocked by 120 config capture; DR scorecard remains blocked by five missing non-secret escrow markers.
Next: do not fake escrow markers; after real non-secret evidence IDs are available, run mark-credential-escrow-verified.sh for the five items. Keep 120 console recovery as P0.

2026-06-06 15:06 Asia/Taipei
Phase: P1/P3
Before: Overall 65%, P1 84%, P3 99%
After: Overall 65%, P1 85%, P3 99%
Evidence: /backup/scripts/mark-credential-escrow-verified.sh --help confirms --dry-run support, allowed item names, and placeholder/secret rejection rules; docs/runbooks/BACKUP-STATUS.md now contains the credential escrow evidence checklist and safe marker flow.
Blocked: yes. No marker was written because no real non-secret evidence IDs were available in this session; escrow_missing remains 5.
Next: once real external evidence IDs exist, dry-run each item first, then write markers and rerun offsite-escrow-evidence-report plus backup-status.

2026-06-12 04:11 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 65%, P1 85%, P2 90%, P3 99%
After: Overall 66%, P1 86%, P2 90%, P3 99%
Evidence: 120 still ping/SSH failed with ARP incomplete; 121 still shows mon NotReady,SchedulingDisabled and mon1 Ready; 110->188 SSH host key trust repaired after matching ED25519 fingerprint; 02:00 backup-all completed 12/13 and failed only Configs due 120; backup-status at 04:11 shows stale110=none, stale188=none, failed=1, core_blockers=1, escrow_missing=5; offsite sync from 03:00 is still running at 04:10.
Blocked: yes. Full reboot window is NO-GO until current offsite sync exits and a fresh offsite verifier passes; full green remains impossible while 120 is unreachable.
Next: wait for the 03:00 offsite sync to finish, run verify-offsite-full-sync, then rerun cold-start scorecard before approving any maintenance window.

2026-06-12 18:57 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 67%, P0 36%, P1 86%, P2 95%, P3 99%
After: Overall 95%, P0 100%, P1 90%, P2 97%, P3 100%
Evidence: 120 root fsck recovery booted at 15:13; 120/121 are both Ready control-plane; backup-configs and backup-all captured 120/121/K8s successfully; backup-all completed 13/13 at 15:54; full offsite sync completed 13/13 at 17:37 after documented recovery runway override to 240m; verify-offsite-full-sync returned REMOTE_LATEST_ONLY_OK=1, FULL_MARKER_FRESH=1, VERIFY_OK=1, FAILED=0; backup-status at 18:55 reports core_blockers=0 and escrow_missing=5; cold-start at 18:57 reports PASS=83 WARN=0 BLOCKED=0.
Blocked: yes for DR only. Service/full-stack recovery is green, but DR scorecard remains blocked until five credential escrow evidence markers are written with real non-secret evidence IDs.
Next: collect real credential escrow evidence IDs, dry-run each marker, then write markers and rerun offsite-escrow-evidence-report plus backup-status; separately plan AWOOOI API/Web topology spread before moving services from 110/188 to 120/121.

10. Completion Claims That Are Not Allowed Yet

Do not claim every future reboot is guaranteed green. This run is green for the latest verified evidence set only.
Do not silence credential escrow alerts. They are the remaining correct DR red light.
Do not claim DR scorecard complete. Credential escrow markers are missing.
Do not claim public-route success is system success. Route checks must be paired with DB, backup, schedules, Alertmanager, and cold-start scorecard evidence.
Do not claim the next real Google Drive import has succeeded until the post-import row counts/date bounds and Drive archive movement are rechecked.

92 KiB Raw Blame History Unescape Escape