92 KiB
2026-06-04 Reboot / Cold-Start / Backup Recovery Workplan
Owner: SRE / DevOps commander Timezone: Asia/Taipei Baseline: 2026-06-04 15:00 live read-only checks. Do not reuse the 2026-05-29 baseline without rerunning checks. Scope: 110 / 120 / 121 / 188. 112 is Kali and is intentionally excluded from this recovery wave.
1. Current Verdict
| Area | Status | Completion | Evidence |
|---|---|---|---|
| Overall recovery readiness | HOST_AND_CORE_SERVICE_GREEN_STOCK_DATA_BLOCKED_DR_ESCROW_BLOCKED | 96% | 2026-06-25 19:24 full post-start readback showed hosts / K3s / AWOOOI / MOMO / backup / offsite service gates green and escrow_missing=5; 2026-06-25 19:35 stricter product-data wrapper returned POST_START_QUICK_CHECK PASS=31 WARN=1 BLOCKED=1, result BLOCKED, because StockPlatform /api/v1/system/freshness is blocked with core_margin_short_daily_missing,ai_recommendations_stale. Expanded public route smoke covers AWOOOI, VibeWork, AwoooGo, 2026FIFA, Agent Bounty, MOMO, Stock, Bitan, TsenYang, VTuber, Gitea, Harbor, Registry, Sentry, SigNoz, Langfuse, and AIOps; all returned expected 2xx/3xx. MOMO remains fresh through 2026-06-24 with latest job 57 completed cleanly, and Bitan public-content cleanliness direct check passed. Do not declare "all products/data latest" until StockPlatform freshness is ok; do not declare DR complete until escrow_missing=0. |
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at 2026-06-12 15:13; latest 2026-06-25 09:05 readback shows 120 is reachable, K3s is active, mon and mon1 are both Ready control-plane, VIP 192.168.0.125 is present, node filesystem / disk-pressure / readonly events are 0, and latest km-vectorize-29705460-55rgs completed. |
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-25 19:17 backup readback shows 110 13/13 fresh failed=0, 188 2/2 fresh failed=0, core_blockers=0, integrity_stale=0, offsite_fresh=1, rclone_gdrive_fresh=1, escrow_missing=5, last aggregate 2026-06-25 02:35:09。2026-06-25 19:19 offsite escrow report shows script presence OK, rclone configured, full and partial rclone markers present, PASS=8 WARN=5 BLOCKED=0, ESCROW_MISSING_COUNT=5; DR remains blocked on real non-secret credential escrow evidence IDs. |
| P2 service / data truth | BLOCKED_STOCK_DATA_FRESHNESS | 92% | Service routes and core runtime are available, but product-data truth is not complete. 2026-06-25 19:35 StockPlatform /api/v1/system/freshness returned status=blocked, latest_trading_date=2026-06-25, blockers core_margin_short_daily_missing,ai_recommendations_stale; OK sources include price / chips / market index for 2026-06-25, while core.margin_short_daily and ai.recommendations stop at 2026-06-24. MOMO health V10.690, current-month parity `15383 |
| P3 docs / automation contracts | DONE_WITH_PRODUCT_DATA_GATE_V157 | 100% | Workplan, SOP v1.57, one-page post-start quick check v1.2, expanded public route list, StockPlatform freshness gate, baseline stockplatform_system_freshness_ok, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable plan_b baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD known_hosts guardrail, fwupd-refresh.timer rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and 2026-06-25 stricter product-data gate are updated. Live 110 script sync remains a separate approved live-write gate; do not claim it here. |
2026-06-25 19:06 post-CD wrapper readback supersedes the 18:53 wording: consecutive main pushes created a deploy storm where older deploy markers were superseded by later commits. Latest production truth is deploy marker d8ca8224 chore(cd): deploy 9dbe044 [skip ci], ArgoCD Synced / Healthy, API/Web/Worker image tag 9dbe044ea1e8e3894ccbeb5ed760bb124b87f7be, direct route smoke 200 for AWOOOI API / IwoooS / VibeWork / AwoooGo / MOMO health / Stock / Bitan and expected route-gate statuses for MOMO / Gitea / Harbor / Registry / Sentry / SigNoz / Langfuse / AIOps, and wrapper POST_START_QUICK_CHECK PASS=18 WARN=3 BLOCKED=0. Repo-side cold-start returns PASS=89 WARN=0 BLOCKED=0; /backup/scripts/backup-status.sh --no-notify --no-refresh reports 110 13/13 fresh failed=0, 188 2/2 fresh failed=0, core_blockers=0, integrity_stale=0, offsite_fresh=1, rclone_gdrive_fresh=1, escrow_missing=5; MOMO dedicated preflight returns PASS=19 WARN=2 BLOCKED=0; MOMO health is V10.690; AwoooGo / Stock transient 502 reads cleared after upstream warmup and five consecutive route reads returned 200; 110 load is around 14.51 / 12.34 / 11.42, with Gitea Actions cache save / zstdmt / tar, StockPlatform headless Chrome smoke / CI, Gitea, AWOOOI API, ClickHouse, Docker, and platform services visible, not an AWOOOI service blocker. Wrapper result is FULL_STACK_GREEN_DR_ESCROW_BLOCKED, not DEGRADED, because service warnings are 0 and only DR boundary / evidence warnings remain. Wazuh route readback is now 200 disabled_waiting_iwooos_wazuh_owner_gate, but manager registry accepted remains 0, so Wazuh is a security registry evidence blocker rather than a reboot service blocker.
Full cold-start service readiness may now be declared GREEN for the latest verified evidence set. As of 2026-06-25 19:06, routes/hosts/K3s/backups/exporters/monitoring surfaces are available, AWOOOI API is healthy, MOMO service health is V10.690, and MOMO business data is fresh through 2026-06-24. The live read-only cold-start scorecard is PASS=89 WARN=0 BLOCKED=0, the post-start wrapper result is FULL_STACK_GREEN_DR_ESCROW_BLOCKED, AwoooGo / Stock route stability has been rechecked after transient warmup, and final API/Web workload placement is split across mon / mon1. Do not declare DR scorecard complete while credential escrow evidence remains blocked, and do not declare Wazuh registry recovery until manager registry evidence is accepted.
2026-06-25 19:35 stricter product-data gate readback supersedes the earlier "all product data green" interpretation. The full host/cold-start/backup layer remains green from the 19:24 read-only evidence, but the updated quick check now includes StockPlatform /api/v1/system/freshness and therefore blocks on product-data completeness: POST_START_QUICK_CHECK PASS=31 WARN=1 BLOCKED=1, RESULT=BLOCKED, blocker core_margin_short_daily_missing,ai_recommendations_stale. This is a correct no-false-green outcome: stock.wooo.work, /healthz, and /api/healthz all return 200, but StockPlatform data and AI recommendations are not latest. Next action is a separate StockPlatform data freshness remediation lane; do not solve it by host reboot, Nginx reload, Docker restart, or route-only smoke.
2026-06-13 01:26 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing survived the next normal CD deploy: Gitea main e4a349bc, ArgoCD revision e4a349bc, images from 414413a5, API/Web split across mon / mon1, and global known_hosts retained 120 / 188 after CD fix 80e6ec1a. Do not declare DR complete while credential escrow is missing. km-vectorize remediation is 90%: schedule/label fix is live, and the remaining gate is the next official 03:00 CronJob success readback.
2. Live Check Evidence, 2026-06-04
| Target | Live result | Notes |
|---|---|---|
| 192.168.0.110 | ping OK, SSH port OK | Boot 2026-05-06 12:12; load was elevated around 10.54 7.42 6.28; cron and Docker active. |
| 192.168.0.120 | ping failed, SSH port failed | ARP incomplete; K3s node mon remains NotReady,SchedulingDisabled. |
| 192.168.0.121 | ping OK, SSH port OK | Boot 2026-05-22 02:30; sudo kubectl get nodes shows mon1 Ready. |
| 192.168.0.188 | ping OK, SSH port OK | Boot 2026-05-06 12:07; Docker/PostgreSQL/Redis/nginx active; momo containers healthy. |
| Cold-start scorecard | BLOCKED_BY_120 | 2026-06-12 14:47 read-only rerun: PASS=72 WARN=2 BLOCKED=3; hard blocks remain 120 reachability / SSH / 120 K3s read-only check. |
| Public routes | OK ingress only | 2026-06-12 14:47: awoooi, aiops, mo, momo_health, gitea, harbor, registry, sentry, signoz, stock, langfuse, bitan returned 2xx/3xx over HTTPS. |
| momo DB current-month parity | OK | Scorecard reports `4571 |
| 110 daily backup cron | OK | 02:00 backup-all, 03:00 rclone sync, 06:05 backup-status, 07:20 full offsite verify. |
| Backup freshness | OK with remaining aggregate blocker | 2026-06-05 18:40 status: stale110=none, stale188=none, configured_missing_188=0; remaining core_blockers=6 is 02:00 aggregate failure history plus 120 config capture. |
| Google Drive latest-only | OK | 2026-06-12 14:48 verifier: 13 repos, each remote snapshots=1, REMOTE_LATEST_ONLY_OK=1, FULL_MARKER_FRESH=1, VERIFY_OK=1, FAILED=0. |
| Live Prometheus / Alertmanager alert rules | OK | 2026-06-12 14:49 backup-alert-live-visibility-check.py returned BACKUP_ALERT_LIVE_VISIBILITY_OK; all five required backup/cold-start/escrow alerts are visible in Prometheus and Alertmanager. |
| Credential escrow | BLOCKED | Missing markers: break_glass_admin_credentials, dns_registrar_recovery, oauth_ai_provider_recovery, offsite_provider_credentials, restic_repository_password. |
| Config backup capture | BLOCKED until 120 returns | awoooi_backup_config_capture_ok{target="120-k3s-host-configs"} 0; critical failed count 1. |
| Live 110 script sync | OK | Six recovery/check scripts exist under /home/wooo/scripts/; /home/wooo/scripts/full-stack-cold-start-check.sh hash is 31321428207308d6c159fabb679d9f1d0848194b8e6d7eb7b04a2c05779ade46 after scheduler detector fix. |
| Gitea commit evidence | VERIFIED | Gitea main at 0260ec89... contains ae7b39d9 fix(ops): harden reboot recovery and backup alerts. |
| 188 nginx Ansible baseline | DONE | Template now pins aiops.wooo.work to VIP 192.168.0.125:32334/32335, contains no 192.168.0.120, and live smoke returned https://aiops.wooo.work/ 307 plus /api/v1/health 200. |
| 120 failure-domain triage | BLOCKED | 19:02 checks from local/110/121/188 all fail to reach 120; 121 reports Destination Host Unreachable; K3s node lease renew stopped at 2026-05-21T18:48:36Z; 120-fsck-maintenance-checklist.sh --no-color returns PASS=2 WARN=2 BLOCKED=3, MAINTENANCE REQUIRED. |
| 2026-06-05 backup remediation | BLOCKED with repaired freshness | 16:00 live check still had 120 down and stale110=awoooi_db; manual backups produced snapshots b7d5ee4e (AWOOOI high-frequency DB), ea641613 (Gitea), d1147507 (Open-WebUI), 73ead3cc (ClawBot), b1161ab8 (AI artifacts). 18:40 backup status: stale110=none, stale188=none, core_blockers=6, escrow_missing=5. |
| 2026-06-05 offsite closure | OK partial + full verify | Full sync was correctly skipped by runway gate; partial sync for awoooi gitea open-webui clawbot ai-artifacts completed 5/5; full verifier at 18:39 shows all 13 remote repos snapshots=1, REMOTE_LATEST_ONLY_OK=1, VERIFY_OK=1. |
| 2026-06-06 backup convergence | BLOCKED only by 120/escrow | 14:58 backup status: 110 13/13 fresh failed=1, 188 2/2 fresh failed=0, stale110=none, stale188=none, core_blockers=1, escrow_missing=5; 02:00 aggregate failed only Configs due 120. |
| 2026-06-06 offsite verify | OK | 14:46 verifier: all 13 remote repos snapshots=1, REMOTE_LATEST_ONLY_OK=1, VERIFY_OK=1. |
| 2026-06-06 cold-start scorecard | BLOCKED | 15:03 read-only rerun: PASS=71 WARN=3 BLOCKED=3; hard blocks remain 120 ping / SSH / K3s read-only check. Direct 188 scheduler check still shows momo-scheduler healthy and active. |
| 2026-06-12 pre-reboot check | NO-GO until offsite finishes | 120 still ping/SSH failed and ARP incomplete; 110->188 SSH host key trust was repaired; 04:11 backup status cleared stale110=awoooi_db after daily backup but still has failed=1/core_blockers=1 due 120 config capture; 03:00 offsite sync is still running at 04:10. |
| 2026-06-12 post-reboot recovery | SERVICE_GREEN_WITH_120_BLOCKER | 14:47 scorecard: PASS=72 WARN=2 BLOCKED=3; 110 failed units 0, Swap 0B, public routes/TLS green, momo scheduler and DB parity green, backup/offsite/alert surfaces green except the correct 120 config capture and escrow evidence red gates. |
| 2026-06-12 blocker pursuit | WAITING_EXTERNAL_ACCESS | 15:00 four-view 120 check still failed; no WOL/IPMI/vmrun/hypervisor entry found in repo, 110, 121, 188, local tools, or Chronicle-visible console. 15:02 escrow report shows offsite ready with warnings and all five escrow markers missing; no real non-secret evidence ID found in repo. |
| 2026-06-12 120 recovery closeout | SERVICE_GREEN_DR_ESCROW_BLOCKED | 120 root fsck was completed from console/initramfs and booted at 15:13; 15:54 backup-all finished 13/13; 17:37 full offsite sync finished 13/13; 18:55 offsite verifier returned REMOTE_LATEST_ONLY_OK=1, VERIFY_OK=1, FAILED=0; 18:55 backup-status shows core_blockers=0, escrow_missing=5; 18:57 cold-start is PASS=83 WARN=0 BLOCKED=0. |
| 2026-06-13 live refresh | SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED | 00:13 backup status: 110 13/13 fresh failed=0, 188 2/2 fresh failed=0, core_blockers=0, escrow_missing=5; 00:33 cold-start exposed 110 known_hosts drift for 120 / 188, fixed after backup /home/wooo/.ssh/known_hosts.before-120-188-refresh.20260613-003416; 00:34 final cold-start: PASS=83 WARN=0 BLOCKED=0; live K3s has mon / mon1 Ready, API/Web are split 120 / 121. 188 host is degraded only because certbot.service and snap.certbot.renew.service failed; ArgoCD remains Degraded because km-vectorize CronJob last success is stale. Manual Job km-vectorize-codex-002709 did not leave verified completion evidence, so this remains open. |
2026-06-13 km-vectorize health remediation |
IN_PROGRESS_92 | 13:37 live readback: ArgoCD revision 88dc08e5 is Synced / Degraded; only unhealthy resource is CronJob/awoooi-prod/km-vectorize with message CronJob has not completed its last execution successfully. CronJob lastScheduleTime=2026-06-12T19:00:00Z, lastSuccessfulTime=2026-06-04T11:00:37Z; no 2026-06-13 failed Job is retained because failedJobsHistoryLimit=0. GitOps candidate now changes km-vectorize to failedJobsHistoryLimit=3 so future 03:00 failures keep inspectable Job/Pod evidence. Next gate is ArgoCD sync plus the next official 03:00 success readback. |
| 2026-06-13 post-CD trust / workload verification | SERVICE_GREEN_CD_GUARDRAIL_HELD | Gitea main advanced to deploy marker e4a349bc chore(cd): deploy 414413a [skip ci]; ArgoCD revision is e4a349bc, sync Synced, health still Degraded only by km-vectorize stale success. Live K3s image readback uses 414413a59268eedd391648f112e228716dd05362; API pods split mon1 / mon, Web pods split mon / mon1, Worker is single replica on mon. 01:28 /home/wooo/.ssh/known_hosts mtime remains 2026-06-13 01:20:02 +0800 with 120 / 188 entries present; deploy-specific /home/wooo/.ssh/deploy_known_hosts mtime is 01:24:05, proving CD fix 80e6ec1a stopped clobbering global trust. 01:26 cold-start: PASS=83 WARN=0 BLOCKED=0. |
| 2026-06-13 API placement hardening | IN_PROGRESS | 12:43 live refresh showed cold-start PASS=83 WARN=0 BLOCKED=0, but API replicas 2/2 were on 120 even though topology spread existed. Root cause: whenUnsatisfiable=ScheduleAnyway is a soft preference. GitOps candidate changes API/Web/Worker to minDomains=2 + DoNotSchedule; completion requires ArgoCD sync, rollout readback, public route smoke, and cold-start rerun. |
| 2026-06-13 API rollout strategy hardening | LIVE_VERIFIED | First hard-spread rollout reached ArgoCD revision 17e017f5; DoNotSchedule was live, but API completed with both new pods on 121 because old 120 pods were still terminating during scheduling. Second GitOps rollout reached ArgoCD revision 60f653a0, API/Web use maxSurge=0, maxUnavailable=1, minDomains=2, DoNotSchedule, and both deployments are split mon / mon1. Public API / governance route smoke passed and 12:59 cold-start returned PASS=83 WARN=0 BLOCKED=0. |
| 2026-06-13 security mirror guard closure | LIVE_VERIFIED | Gitea main b557a4b5 restores apps/web/messages/en.json as the required Traditional Chinese mirror of zh-TW.json; security-mirror-progress-guard.py now passes. ArgoCD revision b557a4b5 is Synced / Degraded only by km-vectorize; API/Web/Worker are ready, API pods split mon / mon1, Web pods split mon1 / mon, public API health is healthy, zh/en governance routes are 200, backup status has core_blockers=0, and 13:52 cold-start is PASS=83 WARN=0 BLOCKED=0. |
| 2026-06-13 security mirror production image closeout | LIVE_VERIFIED | Gitea main 64ea2444 records the Web rebuild trigger. Deploy marker 2cc02f1c chore(cd): deploy 6cf8d3c [skip ci] put Web image 6cf8d3ca live; ArgoCD source revision later advanced to 64ea2444 while Web image correctly remains 6cf8d3ca because 64ea2444 is docs/changelog only. Public /zh-TW/governance and /en/governance return 200, API health is healthy, security-mirror-progress-guard.py passes, and 14:10 cold-start is PASS=83 WARN=0 BLOCKED=0. |
| 2026-06-13 final post-trigger deploy closeout | LIVE_VERIFIED | Deploy marker 834ccdba chore(cd): deploy bf86017 [skip ci] put API/Web/Worker image bf860177 live. ArgoCD revision 834ccdba is Synced / Degraded only by km-vectorize; routes /zh-TW/governance and /en/governance return 200, API health is healthy, source guards pass, backup status has core_blockers=0 and escrow_missing=5, and 14:13 cold-start is PASS=83 WARN=0 BLOCKED=0. |
| 2026-06-13 final goal audit refresh | SERVICE_GREEN_REMAINING_GATES_EXPLICIT | Clean worktree rebased onto a520c32d and reran source guards successfully; live ArgoCD tracks revision a520c32d with API/Web/Worker image e897c8bf, health Degraded only by km-vectorize; km-vectorize schedule remains 0 3 * * *, timeZone=Asia/Taipei, failedJobsHistoryLimit=3, and no failed Job is currently retained. Public /zh-TW/governance, /en/governance, and /api/v1/health are green; backup core blockers remain 0, escrow_missing=5; 14:16 cold-start is PASS=83 WARN=0 BLOCKED=0. Remaining gates: five credential escrow markers and next official 03:00 km-vectorize success readback. |
2026-06-14 km-vectorize official run follow-up |
DEGRADED_EVIDENCE_RETENTION_LIVE | 03:00 official km-vectorize-29689620 ran from CronJob and failed with BackoffLimitExceeded; ArgoCD later auto-synced revision 8868c025 and remains Synced / Degraded. Job is retained, but failed Pod km-vectorize-29689620-nwpqz was deleted before logs could be read, so root cause remains unproven for this run. Live CronJob is now restartPolicy: Never plus terminationMessagePolicy: FallbackToLogsOnError, so the next official failure should retain Pod/log evidence. Backup core remains green, escrow_missing=5, and 03:11 cold-start is PASS=81 WARN=2 BLOCKED=0. |
2026-06-14 km-vectorize tenant context follow-up |
ROOT_CAUSE_CANDIDATE_LIVE | Source audit shows cron_km_vectorize.py calls /api/v1/knowledge/embed-all without project context, while API middleware and get_db_context() require X-Project-ID / tenant context for fail-closed RLS. API logs show matching db_context_missing / Missing tenant context patterns. Deploy marker ec03f0b7 put image 8ddb80d6 live; CronJob now has KM_PROJECT_ID=awoooi, script sends X-Project-ID, targeted pytest 7 passed, and no manual Job was created. Completion still waits for the next official 03:00 success or retained failed Pod/log. |
| 2026-06-14 110 failed-unit cleanup | SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED | fwupd-refresh.timer is intentionally disabled / inactive after non-runtime firmware metadata refresh failed units were classified; rollback is sudo systemctl enable --now fwupd-refresh.timer. systemctl --failed now returns 0 loaded units listed; 08:24 cold-start improved to PASS=82 WARN=1 BLOCKED=0. Remaining warning is only K8s failed Job km-vectorize-29689620; backup core remains green and escrow_missing=5. |
| 2026-06-14 post-CD recovery readback | SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED | Gitea main / ArgoCD revision 18b867c3 synced after deploy marker 18b867c3 chore(cd): deploy e0a6d33 [skip ci]; API/Web/Worker/CronJob image is e0a6d339. API/Web remain split across mon / mon1, Worker is healthy on mon, public routes and TLS pass, backup core remains 0, escrow missing remains 5, and 08:40 cold-start remains PASS=82 WARN=1 BLOCKED=0. This proves no post-CD reboot recovery regression, but still not full green. |
| 2026-06-14 P2-135 deploy recovery readback | SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED | Gitea main 5bad267e and ArgoCD revision 5bad267e are synced after deploy marker 8d575c1a; API/Web/Worker/CronJob image is 280e0fbe. API/Web remain split across mon / mon1, Worker is healthy on mon1, backup core remains 0, escrow missing remains 5, and 09:27 cold-start rerun is PASS=82 WARN=1 BLOCKED=0. 09:26 first run saw transient stock.wooo.work 502 while stockplatform-v2 containers were under one minute old; direct route/TLS recheck and scorecard rerun returned 200. This proves no persistent post-P2-135 recovery regression, but still not full green. |
| 2026-06-14 P2-136 / AI Agent 活動正式部署後 recovery readback | SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED | 本 recovery commit 前最新文件 head 是 a0fe7741;runtime deploy marker / ArgoCD revision 60a0415c is Synced / Degraded,API/Web/Worker/CronJob image 是 a3de0ffb。API/Web remain split across mon / mon1,Worker is healthy on mon1,backup core remains 0,escrow missing remains 5,09:56 cold-start is PASS=82 WARN=1 BLOCKED=0。This proves no P2-136 / AI Agent 活動正式部署後 recovery regression, but still not full green. |
| 2026-06-14 P2-137 / CI smoke timeout recovery readback | SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED | 本 recovery commit 前最新文件 head 為 50d4f2ba;runtime deploy marker d023f5d7 已將 image f737f278 帶到 live,ArgoCD revision 50d4f2ba 為 Synced / Degraded。API/Web 維持分散在 mon / mon1,Worker 在 mon healthy,backup core 仍為 0,escrow missing 仍為 5,10:40 cold-start 為 PASS=82 WARN=1 BLOCKED=0。這證明 P2-137 / CI smoke timeout 修正後 recovery 沒有回歸,但仍不是 full green。 |
| 2026-06-14 P2-143 owner response 預檢 recovery readback | SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED | 最新文件基準為 b09eb1c6;runtime deploy marker 667d6329 已將 image 755b0a8d3038df2c52dee280067863d92db1eda5 帶到 live,ArgoCD revision 4abf0c0f750254d3c7137eae049abdfd99630f5f 為 Synced / Degraded。API/Web 維持分散在 mon / mon1,Worker 在 mon healthy,backup core 仍為 0,escrow missing 仍為 5,15:00 cold-start 為 PASS=82 WARN=1 BLOCKED=0;P2-143 endpoint current P2-143、completion 100,所有 writer / Gateway / Telegram / Bot API / production write / secret read / destructive operation 維持 0 / false。這證明 P2-143 owner response 預檢後 recovery 沒有回歸,但仍不是 full green。 |
| 2026-06-14 P2-144 owner response 回讀 recovery readback | SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED | gitea/main 已前進至 deploy marker 180a6543;image fef94df877c5438f9f34ddbcace8ad8112a141ef 已帶到 live,ArgoCD source revision 180a6543eaf26dd6b345d45114316926056a965a 為 Synced / Degraded。API/Web 維持分散在 mon / mon1,Worker 在 mon1 healthy,backup core 仍為 0,escrow missing 仍為 5,15:58 cold-start 為 PASS=82 WARN=1 BLOCKED=0;P2-144 endpoint current P2-144、completion 100,owner response received / accepted / rejected、reviewer / Gateway / Telegram / Bot API / result capture / learning / PlayBook trust / production write / secret read / destructive operation 維持 0 / false。這證明 P2-144 owner response 回讀後 recovery 沒有回歸,但仍不是 full green。 |
| 2026-06-14 P2-145 owner response 驗收門檻 recovery readback | SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED | 最新文件基準為 06fe0a8f;runtime deploy marker 36fbfc6b 已將 image 386dbd078ef63401d9736048463f4ef5326442d9 帶到 live,ArgoCD source revision 06fe0a8f14167824fea512f942d2569431bbcbc8 為 Synced / Degraded。API/Web 維持分散在 mon / mon1,Worker 在 mon healthy,backup core 仍為 0,escrow missing 仍為 5,16:29 cold-start 為 PASS=82 WARN=1 BLOCKED=0;P2-145 endpoint current P2-145、completion 100,owner response received / accepted / rejected、reviewer / Gateway / Telegram / Bot API / result capture / learning / PlayBook trust / production write / secret read / destructive operation 維持 0 / false。這證明 P2-145 owner response 驗收門檻後 recovery 沒有回歸,但仍不是 full green。 |
| 2026-06-14 IwoooS P0 配置控管優先序 recovery readback | SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED | 最新文件基準為 af62ec1f;runtime deploy marker ed651a98 已將 image e992af89955f8aae40a383b2f2e2f645445a690d 帶到 API/Web/Worker/CronJob live,ArgoCD source revision af62ec1fe72b3e84e179d80e788e5a5902bdaf27 為 Synced / Degraded。API/Web 維持分散在 mon / mon1,Worker 在 mon1 healthy;IwoooS route /zh-TW/iwooos returned 200。backup core 仍為 0,escrow missing 仍為 5,17:04 cold-start 為 PASS=82 WARN=1 BLOCKED=0。這證明 IwoooS P0 配置控管優先序前台發布後 recovery 沒有回歸;但它不代表 Nginx reload、DNS/TLS/certbot、workflow/secret/public route/runtime gate 或 production write 已授權,且仍不是 full green。 |
| 2026-06-14 高價值配置 Owner Packet 前台同步 recovery readback | SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED | 最新 repo 文件基準為 0a4766dd;runtime deploy marker 16c6b983 已將 image e999c16b3435f197b78fe2adfeec1c4faa6c4675 帶到 API/Web/Worker/CronJob live,ArgoCD source revision 0a4766ddc94b0690824ce3deba5c6b9a69764f94 為 Synced / Degraded。API/Web 維持分散在 mon / mon1,Worker 在 mon healthy;IwoooS route /zh-TW/iwooos 與 AwoooP route /zh-TW/awooop 皆回 200。backup core 仍為 0,escrow missing 仍為 5,18:15 cold-start 為 PASS=82 WARN=1 BLOCKED=0。這證明高價值配置 Owner Packet 前台同步後 recovery 沒有回歸;但它不代表 request sent、owner response received / accepted、Nginx reload、DNS/TLS/certbot、workflow/secret/public route/runtime gate、host write、active scan 或 production write 已授權,且仍不是 full green。 |
3. Progress Update Contract
Every phase update must change both status and percentage in this file.
| State | Meaning |
|---|---|
| NOT_STARTED | Listed but no live evidence gathered in this session. |
| IN_PROGRESS | Actively being checked or fixed. |
| BLOCKED | A live red gate prevents completion. Do not downgrade or silence the alert. |
| WAITING_HOST_120 | Action is intentionally held until 120 is reachable. |
| VERIFIED | Live evidence proves the item. |
| DONE | Fix is implemented, verified, and documented. |
Completion is weighted by release risk:
| Priority | Weight |
|---|---|
| P0 | 45% |
| P1 | 25% |
| P2 | 20% |
| P3 | 10% |
For every push forward, update:
YYYY-MM-DD HH:MM Asia/Taipei
Phase: P0/P1/P2/P3
Before: <old percent>
After: <new percent>
Evidence: <command/file/snapshot>
Blocked: <yes/no and why>
Next: <single next action>
4. P0 Must-Do Gates
| ID | Status | % | Work item | Fine analysis | Next action | Done criteria |
|---|---|---|---|---|---|---|
| P0-001 | DONE | 100 | Rerun four-host reachability | 18:57 cold-start confirms 110 / 120 / 121 / 188 ping and SSH are all OK; ARP neighbor evidence is reachable for 120 / 121 / 188. | Keep evidence in LOGBOOK/runbook. | Host reachability table recorded with date/time. |
| P0-002 | DONE | 100 | Recover 192.168.0.120 | 120 root filesystem inconsistency was repaired from console/initramfs with offline fsck; host booted at 2026-06-12 15:13, SSH returned, root is rw, failed units 0, and K3s mon is Ready control-plane. |
Continue normal monitoring; schedule storage review if fsck recurs. | 120 ping/SSH OK, node Ready, root not readonly, failed units 0. |
| P0-003 | DONE | 100 | Rerun /backup/scripts/backup-configs.sh |
15:17 manual config capture succeeded; 15:54 aggregate Configs succeeded again, including 120-k3s-host-configs, 121-k3s-host-configs, K8s workloads, K8s secrets, and Velero backups. |
Keep next scheduled run under normal cron. | config_failed=0; Configs snapshot bee9ae22 exists after 120 recovery. |
| P0-004 | DONE | 100 | Rerun /backup/scripts/backup-all.sh |
2026-06-12 15:54 aggregate completed 13/13 in 2170s; 18:55 backup-status shows failed=0, core_blockers=0. |
Keep 02:00 daily cadence. | Aggregate backup exits 0; backup health failed count 0. |
| P0-005 | DONE | 100 | Rerun /backup/scripts/sync-offsite-backups.sh --mode sync |
Default runway gate skipped full sync at 270m; controlled recovery override set runway to 240m without changing scripts. Full offsite sync completed 13/13 at 17:37 in 6027s. |
Restore normal default runway for scheduled sync; use override only for documented P0 recovery windows. | New rclone-last-success marker after local backup timestamp. |
| P0-006 | DONE | 100 | Rerun /backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color |
18:55 verifier confirms all 13 remote repos have snapshots=1, REMOTE_LATEST_ONLY_OK=1, FULL_MARKER_FRESH=1, VERIFY_OK=1, FAILED=0. |
Keep 07:20 daily verifier. | REMOTE_LATEST_ONLY_OK=1, all 13 repos snapshots=1. |
| P0-007 | DONE | 100 | Rerun full cold-start scorecard | First 18:56 rerun had one transient internal VIP API 000; direct VIP checks from 110/120/121/188 returned API 200 and Web 307. Second 18:57 rerun returned PASS=83 WARN=0 BLOCKED=0, result GREEN. |
Treat future internal VIP 000 as transient only after direct multi-host VIP checks prove API 200. |
BLOCKED=0, WARN=0, result GREEN. |
| P0-008 | DONE | 100 | Narrow 120 failure domain and prepare console handoff | 110 and 188 see no route / no ping; 121 reports destination host unreachable; local ARP is incomplete. Kubernetes retained only stale node/lease data and cannot read current 120 host/filesystem state. No BMC/IPMI/WOL inventory was found in the repo. | Physical/VM console must verify power state, NIC attachment, boot screen, initramfs/fsck state, and then restore SSH. | Handoff evidence is recorded; no remote-only fix path remains before console access. |
| P0-009 | DONE | 100 | Exhaust safe remote 120 recovery channels | 2026-06-12 15:00 local/110/121/188 all still fail ping/SSH with ARP incomplete. Searched repo, local tools, 110, 121, 188, SSH config, local VM files, and Chronicle-visible desktop; no usable BMC/IPMI/WOL/vmrun/hypervisor/120 console entry was found. | Use hypervisor / console / VM inventory outside SSH path. | Remote-only path is proven unavailable; no alert was silenced and no unsafe reboot/restart was attempted. |
5. P1 Backup And Alert Gates
| ID | Status | % | Work item | Fine analysis | Next action | Done criteria |
|---|---|---|---|---|---|---|
| P1-001 | VERIFIED | 100 | Confirm 110 backup schedule | Live crontab has 02:00 backup-all, 03:00 rclone gated sync, 06:05 backup-status, 07:20 full offsite verify. |
Update BACKUP-STATUS.md. |
Schedule documented and matches live crontab. |
| P1-002 | VERIFIED | 100 | Confirm success-noise policy | Daily status is once at 06:05; normal backup success is not a Telegram spam path. | Keep failure-only escalation in backup docs. | Docs say failures escalate; daily status is summary only. |
| P1-003 | VERIFIED | 100 | Confirm Google Drive latest-only | 2026-06-12 18:55 verifier shows 13 repos with exactly one remote snapshot each after the post-120 aggregate backup and full offsite sync. | Record evidence in backup status. | REMOTE_LATEST_ONLY_OK=1, VERIFY_OK=1. |
| P1-004 | VERIFIED | 100 | Confirm required alerts exist | Live Prometheus rules include all five required backup/cold-start alerts. | Keep in scorecard. | All five alert names FOUND live. |
| P1-005 | BLOCKED_WAITING_OWNER_EVIDENCE | 20 | Fill credential escrow evidence markers | Five markers are missing. This is a DR scorecard blocker, not a service outage. 2026-06-13 13:10 proves scripts/offsite/rclone readiness is green; the remaining blocker is owner-provided real non-secret evidence IDs. Owner request package exists at docs/security/CREDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md; secrets must not enter repo or chat. |
Human verifies vault/offline escrow, validates each non-secret evidence ID with --dry-run, then writes markers using /backup/scripts/mark-credential-escrow-verified.sh. |
awoooi_backup_dr_credential_escrow_missing_count=0. |
| P1-006 | DONE | 100 | Fix backup health failed component | 2026-06-12 18:55 backup-status shows failed=0, core_blockers=0, config_failed=0; 120 config capture is no longer red. |
Keep normal daily backup cadence. | failed_count=0, config_failed=0. |
| P1-007 | DONE | 100 | Refresh stale backup jobs | 2026-06-04 cleared stale188=momo_pg_daily; 2026-06-05 cleared recurring stale110=awoooi_db; 2026-06-06 confirms no stale jobs after the next aggregate window. |
Keep normal cron cadence; only 120-driven Configs remains red. | stale110=none, stale188=none, 110 13/13 fresh, 188 2/2 fresh. |
| P1-008 | DONE | 100 | Align 188 momo backup cron/exporter contract | 188 backup exporter expected /home/ollama/bin/momo-pg-backup.sh; crontab still pointed to the old app-side script. Crontab was backed up and updated to the host-owned controller script. |
Keep backup controller path in future deploy docs. | configured_missing_188=0, awoooi_backup_job_configured{host="188",job="momo_pg_daily"} 1. |
| P1-009 | DONE | 100 | Repair 2026-06-05 non-120 backup failures | 02:00 aggregate failed Gitea, AWOOOI DB, Open-WebUI, ClawBot, AI Artifacts, and Configs. The next aggregate window held the five non-120 fixes; Configs remains 120-blocked. | Leave aggregate red until 120 returns and Configs can rerun cleanly. | Fresh single-repo evidence exists for all non-120 failures and the next aggregate run only failed Configs. |
| P1-010 | DONE | 100 | Offsite sync manual backup repairs | 2026-06-12 17:37 full offsite sync completed 13/13 after controlled P0 runway override to 240m; 18:55 verifier confirmed 13 remote repos each have one snapshot. |
Allow normal 03:00 full sync cadence unless another manual backup creates new snapshots. | REMOTE_LATEST_ONLY_OK=1, VERIFY_OK=1, full sync 13/13. |
| P1-011 | DONE | 100 | Confirm 2026-06-12 backup convergence | 18:55 live check confirms the post-120 aggregate held: no stale jobs, no configured/missing script jobs, no failed components, offsite fresh, and only credential escrow remains as DR warning. | Keep escrow as explicit red gate. | stale110=none, stale188=none, failed=0, config_failed=0, core_blockers=0. |
| P1-012 | DONE | 100 | Audit credential escrow marker write safety | 2026-06-12 15:02 mark-credential-escrow-verified.sh --status reports all five allowed items missing; offsite-escrow-evidence-report.sh --no-color reports rclone/offsite configured and ESCROW_MISSING_COUNT=5; repo search found only runbooks/placeholders/rules, not real evidence IDs. |
Write markers only after a real non-secret evidence ID exists for each item; never write placeholder or secret. | The marker blocker is narrowed to missing external evidence IDs, not missing script/config/offsite readiness. |
| P1-014 | DONE | 100 | Publish credential escrow owner request package | 2026-06-13 13:10 live report confirms SCRIPT_MISSING_COUNT=0, OFFSITE_CONFIGURED=1, RCLONE_CONFIGURED=1, ESCROW_MISSING_COUNT=5, PASS=8 WARN=5 BLOCKED=0. New owner request package defines allowed evidence-id types, forbidden secret values, safe dry-run flow, write flow, and closeout gates. |
Dispatch to the credential owners without collecting secret values; keep marker write gated until owner gives real non-secret evidence IDs. | docs/security/CREDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md and snapshot exist and validate. |
| P1-013 | DONE_FOR_SERVICE_READINESS | 100 | Remediate km-vectorize CronJob health debt |
The retained km-vectorize-29689620 failed Job is now classified as stale evidence, not an active blocker, because later official km-vectorize Jobs completed successfully. 2026-06-18 13:43 cold-start reads FAILED_JOBS=1, STALE_FAILED_JOBS=1, ACTIVE_FAILED_JOBS=0, BAD_PODS=0, and returns PASS=84 WARN=0 BLOCKED=0. |
Keep retained failed Job as evidence unless an explicit maintenance window authorizes cleanup. Reassert ArgoCD app health only with a fresh ArgoCD app readback, not from the cold-start scorecard alone. | Service readiness no longer warns on stale failed Job evidence; active failed Job detection remains guarded. |
| P1-015 | DONE | 100 | Restore 188 MinIO / Velero backup freshness and DB exporters | 2026-06-24 06:35 resolved real backup / exporter red lights: 188 PostgreSQL exporter and Redis exporter now expose pg_up=1 / redis_up=1; 188 MinIO health is live; 120 Velero BSL is Available; one-off backup reboot-recovery-202606240456 completed; 110 backup-health textfile reports latest Velero backup fresh. 110 disk pressure was reduced from 92% to 73% by Docker image/build-cache cleanup only. |
Reconcile MinIO userns_mode: host override into formal source-of-truth or data ownership fix; keep Docker volume prune forbidden without explicit owner approval. |
VeleroBackupNotRun、PostgreSQLDown、RedisDown、110 disk-pressure alerts are resolved, and SOP includes restore helpers. |
| P1-016 | DONE | 100 | Control repeated Telegram notification noise without hiding real alerts | 2026-06-24 confirmed MOMO Pro 5-minute spam came from a legacy 110 script checking http://192.168.0.188/health; live script now uses https://mo.wooo.work/health as primary truth. Heartbeat warning dedupe now hashes stable actionable fingerprints so HTTP status / timeout / latency drift does not create a new Telegram event every 30 minutes. MoWoooWorkDown now labels component=momo-pro-system, disables blind auto-repair, and requires public/local/container/data-freshness triage. Generic docker-health monitor keeps 5-minute repair checks but adds a separate 30-minute direct Telegram fallback cooldown. Bitan public-content cleanliness keeps failure notification with same-fingerprint cooldown and one recovery notice. |
Fold remaining cross-product direct Telegram egress into the unified notification gateway over time; do not disable real warning/failure/recovery signals. Production deployment/readback must confirm the code and Prometheus rules are live before declaring runtime closure. | Healthy heartbeat is quiet, same actionable heartbeat warning is deduped, MOMO public health success produces no alert, repeated same-failure direct fallback paths are cooled, and real failure/recovery/new-warning notifications remain enabled. |
| P1-017 | DONE | 100 | Restore 188 nginx-exporter and post-CD monitoring coverage | CD #3294 deployed marker 622bc372 but failed post-deploy checks because scripts/generate_monitoring.py --check saw Prometheus job nginx-exporter down at 192.168.0.188:9113. 188 stub_status and compose config were healthy, so the correct fix was restoring the stateless exporter from /home/ollama/nginx-exporter.yml, not reloading Nginx or restarting products. New helper scripts/ops/188-nginx-exporter-restore.sh defaults to read-only --check and exposes explicit --apply for maintenance-window restore. high-value-config-change-gate.py now classifies scripts/ops/**/*exporter* as monitoring_alerting_observability P1 / C1. |
Keep this check in post-reboot and post-CD recovery. Do not mark historical CD #3294 as success; use the next CD run plus monitoring coverage as future proof. |
bash scripts/ops/188-nginx-exporter-restore.sh --check reports nginx_up 1; python3 scripts/generate_monitoring.py --check --stabilization-sleep-seconds 0 reports Jobs=14, 全部 UP=14, 真實問題=0, coverage 100.0%; high-value gate matches the helper as P1 / C1, not unmanaged. |
6. P2 Service And Data Gates
| ID | Status | % | Work item | Fine analysis | Next action | Done criteria |
|---|---|---|---|---|---|---|
| P2-001 | VERIFIED | 100 | Public route smoke | 2026-06-12 18:57 cold-start confirms all listed domains returned expected 2xx/3xx over HTTPS; registry root route returned 200 in the scorecard and /v2/ remains the normal unauthenticated 401 pattern from earlier checks. This proves ingress/TLS plus current route availability. |
Keep as one row in scorecard. | Public route table updated after each reboot. |
| P2-002 | GREEN | 100 | momo latest/current-month parity and freshness | Latest current-month parity is good: `15383 | 15383 | 2026-06-01 |
| P2-008 | DONE_SUPERSEDED_BY_JOB_57_RECOVERY | 100 | Separate MOMO service recovery from upstream source absence | 2026-06-24 11:35 readback proved MOMO service was healthy and source-file absence was the blocker. 2026-06-25 10:35 superseded that with a stricter split: service healthy, DB parity good, but token / Drive auth evidence not sufficient and scheduler fail-closed behavior required. 2026-06-25 14:16 supersedes the blocker with job 57 clean import, V10.674, token metadata aligned to scheduler UID, current-month parity through 2026-06-24, and `DB_DAILY_FRESHNESS 1 |
2026-06-24`. SOP v1.51 preserves the GO/NO-GO rules forbidding old archive re-import, product-export import, truncate, whole-DB restore, fake freshness, or token secret exposure. | Keep running the dedicated preflight after each reboot/import window; if Drive/API auth fails again, it must fail closed and alert rather than becoming an empty-folder success. |
| P2-003 | DONE_PRODUCTION_DEPLOYED_WAITING_NEXT_REAL_IMPORT | 99 | Fix momo job semantics | Gitea-first repair is in /Users/ogt/codex-workspaces/momo-pro-dev commit 84035906aba0e5e190d031a13cfd9b47a8cd1f73 on branch codex/momo-current-main-dev-base-20260624, also fast-forwarded to MacBook Pro and fast-forwarded to MOMO main. Gitea Actions cd.yaml #904 succeeded, and 188 live source contains _table_columns, 業績分析儀表板同步失敗, and 保留來源檔案等待重試,不移動 Google Drive 檔案. process_daily_sales_import() marks monthly sync failure as failed, records the sync error in summary, returns False, and leaves auto_import_from_drive() outside the Drive archive/move path. Regression tests cover both job failure and no-move behavior. |
Watch the next real Google Drive import and confirm no file moves unless both tables sync; if a real monthly sync failure happens, verify import job status is failed and source file remains pending. |
pytest tests/test_import_service_sql_params.py tests/test_auto_import_data_sync.py tests/test_auto_import_failure_boundaries.py -q returns 10 passed; production deployment/readback is complete; final behavioral closeout requires next real import evidence. |
| P2-004 | DONE | 100 | PostgreSQL index corruption runbook path | SOP v1.2 now states posting list tuple ... cannot be split is an index repair incident. |
Use only concurrent reindex if the error returns. | No truncate, no whole DB restore; REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly; and idempotent resync evidence recorded. |
| P2-005 | VERIFIED | 100 | Do not rely on route 200 only | 2026-06-12 closeout has route + DB + backup + offsite + schedule + alert + K3s + cold-start scorecard evidence. The only remaining blocker is DR credential escrow, outside service availability. | Keep this cross-surface checklist mandatory after every reboot. | Each reboot record has route, DB, backup, schedules, alert, scorecard rows. |
| P2-006 | DONE | 100 | Validate momo scheduler WARN | 2026-06-12 post-reboot regression showed the old detector was too narrow for Chinese batch and [Feeder] logs. The detector was widened and deployed to 110; 14:47 scorecard reads SCHEDULER_RECENT_ACTIVITY 1070 and marks scheduler healthy. |
Keep normal monitoring; treat future recurrence as detector tuning only if direct logs remain active. | Container healthy, direct log activity exists, and latest scorecard removed this WARN. |
| P2-007 | DONE | 100 | Balance K3s AWOOI workload across 120 / 121 | Gitea main acaae999 adds topology spread for API/Web/Worker. ArgoCD later synced deploy marker e4a349bc; live deployments still have split placement after a normal CD rollout: API pods on mon1 / mon, Web pods on mon / mon1, Worker single replica on mon; 01:26 final cold-start is PASS=83 WARN=0 BLOCKED=0. |
Keep watching future deploys; do not manually delete pods unless placement drift becomes a real service or HA gate. | Live deployment has non-empty topology spread, API/Web placement max skew <= 1 after normal CD, public routes green, cold-start WARN=0 BLOCKED=0. |
7. P3 Documentation And Automation
| ID | Status | % | Work item | Fine analysis | Next action | Done criteria |
|---|---|---|---|---|---|---|
| P3-001 | VERIFIED | 100 | Confirm hardening commit | Gitea main currently points to 0260ec89...; git merge-base --is-ancestor ae7b39d9 0260ec89... returned true. |
Keep evidence in LOGBOOK. | Gitea main contains ae7b39d9 fix(ops): harden reboot recovery and backup alerts. |
| P3-002 | VERIFIED_WITH_V142_SYNC_BLOCKED | 100 | Confirm live 110 scripts | All required recovery/check scripts exist under /home/wooo/scripts/; cold-start script hash 10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8 is live on 110. Repo-side v1.42 authoritative script hash is f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05, and verify-cold-start-monitor-deploy.sh correctly blocks on the mismatch. |
Do not run install-cold-start-monitor-110.sh during read-only triage. After explicit maintenance-window / owner approval, run the installer, rerun deploy parity, then rerun the live 110 cold-start monitor and record the new hash. |
Script paths and current mismatch are recorded; v1.42 live-sync done criteria remains hash parity plus live scorecard fields. |
| P3-003 | DONE | 100 | Reconcile 188 nginx Ansible baseline | Live 188 already routes aiops.wooo.work through VIP; the Ansible template matches that route and has no 120 upstream for aiops. nginx-sync.yml now also carries the 188-internal-tools-https.conf.j2 source-of-truth path, and ansible-validate.sh syntax-check passes with repo-local roles path. |
Run only approved dry-run/apply from the normal Ansible environment before changing live nginx. | Template and live config agree; no 120 upstream for aiops; repo-side syntax and readiness contract pass. |
| P3-004 | DONE | 100 | Update docs/LOGBOOK.md |
Live blocker and new docs are recorded. | Keep this entry updated after each recovery phase. | LOGBOOK has current recovery status and next actions. |
| P3-005 | DONE | 100 | Update cold-start SOP | SOP now includes start, shutdown, reboot, record, comparison, and 120 blocker handling. | Increment SOP version after each process change. | SOP has controlled power-operation sections and ledger template. |
| P3-006 | DONE | 100 | Update backup status | Backup status now reflects current cron, rclone latest-only, failure-only alert posture, and escrow blocker. | Refresh after 120 backup rerun. | Backup status no longer claims noisy success Telegram notifications. |
| P3-007 | DONE | 100 | Harden Gitea backup stale dump handling | 2026-06-05 manual Gitea backup failed because the container retained /tmp/gitea-dump.zip from the 02:00 failure. scripts/backup/backup-gitea.sh now renames stale container dump files to timestamped evidence before running a new dump, and the live 110 script is updated. |
Watch the next 02:00 Gitea backup. | bash -n passes locally and on 110; manual Gitea backup completed after stale evidence rename. |
| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.52 adds one-page post-start quick check wrapper, fallback runbook, startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable plan_b baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, fwupd-refresh.timer rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, 188 DB/Redis exporter restore, 188 MinIO/Velero restore, 188 nginx-exporter restore, 110 Docker disk cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, post-reboot notification noise gates, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side scorecard source-absence classifier, 110 live-sync parity gate, CD monitoring coverage target-down classification, MOMO dedicated token/source preflight, MOMO V10.674 / StartedAt / lifecycle / job 57 / freshness 1 recovery readback, and 2026-06-25 110 CPU orphan Chrome vs active CI 分流 evidence. |
Use scripts/reboot-recovery/post-start-quick-check.sh --no-color for T+10 post-reboot triage, then use docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md as manual fallback and SOP v1.52 for exceptions, Plan B, blocker-specific recovery, and historical comparison. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process / notification-noise / MOMO preflight / monitoring coverage checks. If using the live 110 script, record its hash and do not assume repo-side v1.42 behavior until synced under approval and deploy parity passes. |
SOP distinguishes HOST_BOOTED, HOST_READY, SERVICE_READY, FULL_STACK_GREEN, K3S_CONTROL_PLANE_AA, WORKLOAD_BALANCED, B0_ABORTED_BEFORE_REBOOT, B1_HOST_RECOVERY_ONLY, B2_CORE_SERVICE_READY, B3_SERVICE_AVAILABLE_DEGRADED, B4_FULL_STACK_GREEN, and B5_DR_COMPLETE; quick check wrapper has one command order and LOGBOOK summary; latest MOMO dedicated preflight returns PASS=19 WARN=2 BLOCKED=0; 110 CPU evidence records old orphan Chrome groups removed by approved SIGTERM while active CI load remains observation-only; repeated healthy/same-failure notification noise is controlled without hiding real alerts, and monitoring coverage target-down is routed through exporter restore before any product restart. |
| P3-009 | DONE | 100 | Assess 120/121 AA/AS role and host load balancing | 2026-06-12 15:19 live check confirms 120 and 121 are both Ready control-plane, k3s active, k3s-agent inactive, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. |
After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. | docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly 0%. |
| P3-010 | DONE | 100 | Update workload balancing docs with 2026-06-13 live truth | Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD km-vectorize degraded, Gitea main acaae999, ArgoCD sync, and final pod placement evidence. |
Keep updating this file after the next reboot or deploy. | Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt. |
| P3-011 | DONE | 100 | Record km-vectorize remediation status |
LOGBOOK, this workplan, and SOP now state the schedule/label fix, ArgoCD sync evidence, the invalid manual Job boundary, and the 90% waiting-for-next-schedule gate. | After next 03:00 run, update this row and the top verdict with lastSuccessfulTime / ArgoCD health evidence. |
No document claims ArgoCD green before official CronJob success evidence exists. |
| P3-012 | DONE | 100 | Prevent CD from clobbering cold-start SSH trust | Source fix 80e6ec1a changes Gitea CD workflows to use deploy-specific deploy_known_hosts and UserKnownHostsFile; post-deploy marker e4a349bc proves global /home/wooo/.ssh/known_hosts retained 120 / 188 entries. SOP v1.8 records this as a release guardrail. |
Keep the guardrail in future workflow reviews; any > ~/.ssh/known_hosts in deploy code is a release blocker. |
CD success plus post-CD known_hosts readback and strict SSH checks to 120 / 188 remain green. |
8. Required 120 Recovery Sequence
Do this only after physical/VM console access confirms 120 is powered on, attached to the LAN, and either booted or repairable.
# 0. Console-side checks first; do not do these through an online mounted root filesystem.
# - power / VM state
# - NIC connected to the 192.168.0.x LAN
# - boot screen / initramfs / rescue state
# - if root FS repair is required: fsck -f /dev/mapper/ubuntu--vg-ubuntu--lv from console/rescue only
# 1. After SSH returns, run read-only 120 maintenance readiness
bash scripts/reboot-recovery/120-fsck-maintenance-checklist.sh --no-color
# 2. After 120 is reachable and stable, on 110
/backup/scripts/backup-configs.sh
/backup/scripts/backup-all.sh
/backup/scripts/sync-offsite-backups.sh --mode sync
/backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color
# 3. Final cold-start scorecard
/home/wooo/scripts/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1
Do not run truncate, whole DB restore, force-push, DROP, or online root filesystem fsck as part of this flow.
9. Progress Updates
2026-06-18 14:20 Asia/Taipei
Phase: P3 AI Ops runaway process automation
Before: 110 CPU 滿載只能靠人工 `ps/top` 判斷;泛用 `HostHighCpuLoad` 無法分辨跨專案 orphan Chrome smoke 與合法 Gitea Actions CI load。
After: 新增 read-only `host-runaway-process-exporter.py`、gated `host-runaway-process-remediation.py`、Prometheus `host_runaway_process_alerts`、Ansible textfile exporter source-of-truth、SOP v1.26 與 `HOST-RUNAWAY-PROCESS-AIOPS-PLAYBOOK.md`。Exporter 暴露 orphan browser、active CI、load/core、swap ratio 與 `remediation_authorized=0`;修復器預設 dry-run,`SIGTERM` 必須帶 owner approval、maintenance window、evidence ref。
2026-06-18 14:31 Asia/Taipei
Phase: P3 AI Ops runaway process live observability
Before: Repo-side exporter / alert / PlayBook 已完成,但 110 Prometheus 尚未讀到 `awoooi_host_runaway_process_*` live metrics。
After: 110 已安裝 read-only exporter/helper 與 cron,立即刷新 textfile,Prometheus 第二次 scrape 讀到 `monitor_up=1`、orphan browser group count `0`、active CI containers `2`、load5/core 約 `0.79-0.81`、swap ratio 約 `1.0`、`remediation_authorized=0`;`HostRunawayProcessMonitorMissing` 與 `HostOrphanBrowserSmokeHighCpu` 未 firing。
Evidence: `/home/wooo/node_exporter_textfiles/host_runaway_process.prom`、Prometheus query `awoooi_host_runaway_process_monitor_up{host="110"}`、`ALERTS{alertname="HostRunawayProcessMonitorMissing",host="110",alertstate="firing"}`。
Blocked: No for live observability; yes for runtime remediation by design until owner approval / maintenance window / evidence ref / dry-run / post-check exist.
Next: Keep cron scrape under normal monitoring; if orphan count becomes >0, create AI triage packet and remediation dry-run before any gated `SIGTERM`.
Completion: monitoring / alert / PlayBook / KM contract 100%; runtime auto-remediation remains gated at 0 until a real owner-approved apply is executed.
2026-06-18 14:38 Asia/Taipei
Phase: P3 AI Ops alert-to-event packet
Before: 泛用 CPU raw dump 可被轉成 AI automation card,但 `HostOrphanBrowserSmokeHighCpu` / `HostCiRunnerLoadSaturation` alert text 尚未有專屬 lane。
After: Telegram 最後出口可將 `HostOrphanBrowserSmokeHighCpu` 轉成 `orphan_browser_smoke_runaway_process`,將 `HostCiRunnerLoadSaturation` 轉成 `ci_runner_load_saturation`;兩者都保留 `runtime_write_gate=0`,並要求 dry-run / owner / maintenance / evidence / KM / PlayBook / Verifier。
Evidence: `apps/api/src/services/telegram_gateway.py`、`apps/api/tests/test_telegram_message_templates.py`,精準 pytest `59 passed`。
Blocked: No for alert-to-event packet; yes for Telegram live send / runtime remediation by design.
Next: 等 code-review / CD 後做 production readback;若未來 alert 實際 firing,確認 Telegram card 與 AwoooP Run truth-chain 都能呈現同一 lane。
2026-06-18 14:51 Asia/Taipei
Phase: P3 AI Ops alert-to-event packet production readback
Before: `HostOrphanBrowserSmokeHighCpu` / `HostCiRunnerLoadSaturation` 已有 source + test,但尚未完成正式站部署與 runtime revision 讀回。
After: `f358a0f6` 已由 Gitea CD `#3150` 部署,deploy marker `2d278568`。ArgoCD `awoooi-prod` 為 `Synced / Healthy`,API / Web / Worker image 均為 `f358a0f6c3e614e407dedb6eee89bf10b2bc8173`;production health 與 IwoooS / Governance / AwoooP tenants routes 皆 `200` 且敏感字串抽樣命中 `0`。
Evidence: Gitea CD `#3150` tests `3221 passed, 23 skipped`、B5 integration `5 passed`、post-deploy alert-chain smoke `9/9`、monitoring coverage `14/14` jobs up;Prometheus 仍讀到 110 `monitor_up=1`、orphan browser group count `0`、CI active containers `2`、`remediation_authorized=0`,missing / orphan alerts 未 firing。
Blocked: No for production alert-to-event packet deployment; yes for runtime remediation by design.
Next: Future firing alert must produce AI triage packet and dry-run evidence first. `HostCiRunnerLoadSaturation` remains capacity / runner scheduling triage, not process kill. Runtime remediation remains `0` until owner approval, maintenance window, evidence ref, gated SIGTERM, post-check, and KM / PlayBook / Verifier writeback exist.
Completion: host runaway monitoring / alert / PlayBook / Telegram event packet / production deploy readback 100%; runtime auto-remediation remains safely gated at 0.
2026-06-18 15:08 Asia/Taipei
Phase: P3-009 Host runaway AIOps loop product readback
Before: Monitoring, alert rules, event packet routing, live scrape, and production deploy readback were complete, but governance UI still lacked a single product-visible loop state for monitor -> alert -> event packet -> PlayBook -> KM / Verifier -> gated remediation.
After: Added `host_runaway_aiops_loop_readiness_v1` committed snapshot, schema, strict API loader, endpoint `/api/v1/agents/agent-host-runaway-aiops-loop-readiness`, regression tests, API client type, and governance automation-inventory card. The card shows 6 loop stages, 2 alert lanes, 5 asset writeback contracts, host 110 live readback, deploy marker 2d278568, orphan groups 0, and runtime writes 0.
Evidence: `apps/api/tests/test_host_runaway_aiops_loop_readiness.py` + API test `9 passed`; web typecheck passed using a temporary existing node_modules symlink that was removed before commit; snapshot/schema/messages JSON parse and py_compile passed.
Blocked: No for product readback; yes for runtime remediation by design.
Next: If a real or fixture alert fires, verify Telegram card, AwoooP Work Item, KM / PlayBook / Verifier fields agree before considering any owner-approved non-production gated SIGTERM drill.
Completion: host runaway AIOps product-visible loop readback 100%; runtime auto-remediation remains safely gated at 0.
2026-06-18 16:08 Asia/Taipei
Phase: P3-009 Host runaway AIOps loop production verification
Before: P3-009 source, API, UI and tests were pushed, but production still needed deploy marker, API readback, desktop/mobile browser smoke, and CD runner lock recovery evidence.
After: Final deploy marker `42c08ece chore(cd): deploy 27143fb [skip ci]` is live after CD runner lock fixes `fc6c01ee` / `84ca8423` / `27143fb0`; `cd.yaml #3177` and `code-review.yaml #3178` are successful. Production endpoint `/api/v1/agents/agent-host-runaway-aiops-loop-readiness` returns `schema_version=host_runaway_aiops_loop_readiness_v1`, `current_task_id=P3-009`, `next_task_id=P3-010`, completion `100`, loop stages `6`, alert lanes `2`, writeback contracts `5`, host `110`, orphan browser groups `0`, active CI containers `2`, and every runtime/write/remediation counter `0`.
Evidence: API health `healthy / prod / mock_mode=false`; desktop `1440x1100` and mobile `390x844` governance smoke with deploy marker `42c08ece` have required text missing `0`, console/page errors `0`, horizontal overflow `false`, overflowing elements `0`; screenshots are `/tmp/awoooi-host-runaway-aiops-desktop-1440x1100-42c08ece.png` and `/tmp/awoooi-host-runaway-aiops-mobile-390x844-42c08ece.png`.
Blocked: No for production product readback. Yes for runtime remediation by design: process termination, Docker/systemd restart, Nginx reload, firewall/K8s action, Telegram live send, Gateway queue write, Bot API call, production write, and secret read remain `0 / false`.
Next: Treat the next real or fixture `HostOrphanBrowserSmokeHighCpu` as the acceptance drill for end-to-end Telegram card / AwoooP work item / KM / PlayBook / Verifier field agreement. Any actual SIGTERM remains owner-approved, maintenance-windowed, dry-run-first, and post-check-gated.
Completion: host runaway AIOps product-visible loop readback and production verification 100%; runtime auto-remediation remains safely gated at 0.
2026-06-18 13:43 Asia/Taipei
Phase: P1/P2/P3 live readback
Before: live cold-start was `PASS=83 WARN=1 BLOCKED=0`, result `DEGRADED`, because retained stale `km-vectorize-29689620` failed Job evidence was still counted as a service warning.
After: live cold-start is `PASS=84 WARN=0 BLOCKED=0`, result `GREEN`; P2 service readiness is now `100%`; overall recovery readiness is `99% SERVICE_GREEN_DR_ESCROW_BLOCKED`.
Evidence: `full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1`; K8s schedule counters `FAILED_JOBS=1`, `STALE_FAILED_JOBS=1`, `ACTIVE_FAILED_JOBS=0`, `BAD_PODS=0`; repo-side readiness audit `PASS=187 WARN=1 BLOCKED=0`; escrow readback `ESCROW_MISSING_COUNT=5`.
Blocked: no for full-stack service readiness. Yes for DR complete, because five credential escrow evidence markers still need real non-secret owner evidence IDs.
Next: use SOP v1.25 for the next reboot; record failed/stale/active Job counters separately; close B5 only after real credential escrow marker evidence exists.
2026-06-18 12:17 Asia/Taipei
Phase: P0/P2/P3 live readback
Before: repo-side readiness was complete, but live gate had not been rerun after the same-day push.
After: live cold-start is `PASS=83 WARN=1 BLOCKED=0`, result `DEGRADED`; final rollout readback shows API `2/2`, Web `2/2`, Worker `1/1`, Canary `1/1`, and API health `200 healthy`.
Evidence: `full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1`; read-only K8s deployment/job snapshot from 120; public API health readback.
Blocked: no hard blocker. One warning remains: stale retained Job `km-vectorize-29689620` from 2026-06-14 03:00; later official km-vectorize Jobs are Complete. DR complete still blocked by real credential escrow evidence markers.
Next: before any actual reboot, rerun the same live preflight and classify as `B3_SERVICE_AVAILABLE_DEGRADED` if only stale evidence remains, or `B4_FULL_STACK_GREEN` only when `WARN=0 BLOCKED=0`.
2026-06-18 12:06 Asia/Taipei
Phase: P3
Before: repo-side readiness audit PASS=147 WARN=2 BLOCKED=37 before blocker batch; after Plan B-only guard it still had pre-existing blockers.
After: repo-side readiness audit PASS=185 WARN=1 BLOCKED=0, result READY WITH WARNINGS.
Evidence: full-stack-cold-start-check.sh now emits NODE_FS_ERROR_EVENTS and blocks K3s release on node filesystem evidence; backup-awoooi.sh no longer runs direct service-level rclone sync; 110-devops.yml manages cold-start monitor, runner guardrails, textfile exporters, backup scripts, daily backup heartbeat, offsite evidence report and offsite full-sync verifier; 188-ai-web.yml uses host-owned /home/ollama/bin/momo-pg-backup.sh and no longer contains the old app-directory backup cron path; nginx-sync.yml includes 188-internal-tools-https.conf.j2; ansible-lint.yml now runs self-hosted validation across Ansible, ops baseline, monitoring rules, backup scripts, reboot scripts, docs and workflow changes; bootstrap-ansible-validation-env.sh selects Python 3.11/3.10 for pinned ansible-core; ansible-validate.sh passes YAML, shell, Python, doc secret, backup alert label, recovery scorecard, Ansible syntax-check and ansible-lint minimum profile.
Blocked: no for repo-side reboot readiness contracts. Yes for live reboot authorization until same-day live checks run; yes for DR complete while credential escrow evidence markers remain missing.
Next: before an actual reboot, run the same-day live preflight and then the live cold-start gate with --live or the 110 deployed monitor; do not use repo-side READY WITH WARNINGS as a substitute for host/runtime truth.
2026-06-18 11:48 Asia/Taipei
Phase: P3
Before: P3 100%
After: P3 100%
Evidence: ops/reboot-recovery/full-stack-cold-start-baseline.yml now has a machine-readable plan_b section with red lines, triggers, host paths, B0-B5 levels, T+0/T+120 timeline, and closeout states; scripts/reboot-recovery/reboot-recovery-readiness-audit.sh now checks SOP and baseline for Plan B markers. Targeted assertion returned PLAN_B_BASELINE_ASSERTIONS_OK levels=6 closeout=3 timeline_stop=T+120. Full readiness audit confirms all new Plan B checks pass, but overall audit remains NOT READY because of pre-existing Ansible / workflow / backup-contract blockers unrelated to this Plan B addition.
Blocked: no for Plan B mechanism. Yes for overall reboot automation readiness audit until the existing non-Plan-B BLOCKED rows are resolved.
Next: continue closing pre-existing readiness-audit blockers by priority, without changing runtime or pretending the overall audit is green.
2026-06-18 11:41 Asia/Taipei
Phase: P3
Before: P3 100%
After: P3 100%
Evidence: docs/runbooks/FULL-STACK-COLD-START-SOP.md updated to v1.22 with explicit Plan B degraded-operation path, B0-B5 service levels, Plan B trigger table, host-specific fallback routes for 110/120/121/188/K3s/public gateway, T+0/T+120 fallback timeline, and Plan B closeout states. This workplan now requires every future reboot record to compare actual timing and blockers against SOP §1.4, not only the Plan A cold-start chain.
Blocked: no for documentation. Live reboot authorization still requires fresh same-day preflight before any maintenance window; DR complete remains blocked while credential escrow missing count is 5.
Next: before the next host reboot, rerun live preflight, choose Plan A or Plan B entry criteria, then record final level as B0/B1/B2/B3/B4/B5 with the exact blocker.
2026-06-14 18:15 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: 最新 repo 文件基準 0a4766dd;runtime deploy marker 16c6b983 已將 image e999c16b3435f197b78fe2adfeec1c4faa6c4675 帶到 API/Web/Worker/CronJob live;ArgoCD source revision 0a4766ddc94b0690824ce3deba5c6b9a69764f94 維持 Synced/Degraded,原因仍只剩 km-vectorize;API/Web 分散在 mon/mon1;Worker 在 mon;IwoooS route /zh-TW/iwooos returned 200;AwoooP route /zh-TW/awooop returned 200;110 systemctl --failed returned 0 loaded units listed;backup-status core_blockers=0 and escrow_missing=5;final cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green,因為 km-vectorize-29689620 仍 failed,必須等待下一次官方 03:00 success 或 retained failed Pod/log evidence;yes for DR complete,因為 credential escrow evidence markers 仍缺 5 個。
Next: 維持 03:00 官方排程 gate;下一次官方 km-vectorize run 後,只讀驗證 lastSuccessfulTime、latest Job/Pod/log 與 ArgoCD health。
2026-06-14 17:04 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: 最新文件基準 af62ec1f;runtime deploy marker ed651a98 已將 image e992af89955f8aae40a383b2f2e2f645445a690d 帶到 API/Web/Worker/CronJob live;ArgoCD source revision af62ec1fe72b3e84e179d80e788e5a5902bdaf27 維持 Synced/Degraded,原因仍只剩 km-vectorize;API/Web 分散在 mon/mon1;Worker 在 mon1;IwoooS route /zh-TW/iwooos returned 200;110 systemctl --failed returned 0 loaded units listed;backup-status core_blockers=0 and escrow_missing=5;final cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green,因為 km-vectorize-29689620 仍 failed,必須等待下一次官方 03:00 success 或 retained failed Pod/log evidence;yes for DR complete,因為 credential escrow evidence markers 仍缺 5 個。
Next: 維持 03:00 官方排程 gate;下一次官方 km-vectorize run 後,只讀驗證 lastSuccessfulTime、latest Job/Pod/log 與 ArgoCD health。
2026-06-14 16:29 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: 最新文件基準 06fe0a8f;runtime deploy marker 36fbfc6b 已將 image 386dbd078ef63401d9736048463f4ef5326442d9 帶到 API/Web/Worker/CronJob live;ArgoCD source revision 06fe0a8f14167824fea512f942d2569431bbcbc8 維持 Synced/Degraded,原因仍只剩 km-vectorize;API/Web 分散在 mon/mon1;Worker 在 mon;110 systemctl --failed returned 0 loaded units listed;backup-status core_blockers=0 and escrow_missing=5;P2-145 endpoint current=P2-145 completion=100,owner response received/accepted/rejected、reviewer/Gateway/Telegram/Bot API/result capture/learning/PlayBook trust/production write/secret/destructive 全部 0/false;final cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green,因為 km-vectorize-29689620 仍 failed,必須等待下一次官方 03:00 success 或 retained failed Pod/log evidence;yes for DR complete,因為 credential escrow evidence markers 仍缺 5 個。
Next: 維持 03:00 官方排程 gate;下一次官方 km-vectorize run 後,只讀驗證 lastSuccessfulTime、latest Job/Pod/log 與 ArgoCD health。
2026-06-14 15:58 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: gitea/main 已前進至 deploy marker 180a6543;image fef94df877c5438f9f34ddbcace8ad8112a141ef 已帶到 API/Web/Worker live;ArgoCD source revision 180a6543eaf26dd6b345d45114316926056a965a 維持 Synced/Degraded,原因仍只剩 km-vectorize;API/Web 分散在 mon/mon1;Worker 在 mon1;110 systemctl --failed returned 0 loaded units listed;backup-status core_blockers=0 and escrow_missing=5;P2-144 endpoint current=P2-144 completion=100,owner response received/accepted/rejected、reviewer/Gateway/Telegram/Bot API/result capture/learning/PlayBook trust/production write/secret/destructive 全部 0/false;final cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green,因為 km-vectorize-29689620 仍 failed,必須等待下一次官方 03:00 success 或 retained failed Pod/log evidence;yes for DR complete,因為 credential escrow evidence markers 仍缺 5 個。
Next: 維持 03:00 官方排程 gate;下一次官方 km-vectorize run 後,只讀驗證 lastSuccessfulTime、latest Job/Pod/log 與 ArgoCD health。
2026-06-14 15:00 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: 最新文件基準 b09eb1c6;runtime deploy marker 667d6329 已將 image 755b0a8d3038df2c52dee280067863d92db1eda5 帶到 API/Web/Worker/CronJob live;ArgoCD revision 4abf0c0f750254d3c7137eae049abdfd99630f5f 維持 Synced/Degraded,原因仍只剩 km-vectorize;API/Web 分散在 mon/mon1;Worker 在 mon;110 systemctl --failed returned 0 loaded units listed;backup-status core_blockers=0 and escrow_missing=5;P2-143 endpoint current=P2-143 completion=100,writer/Gateway/Telegram/Bot API/production write/secret/destructive 全部 0/false;final cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green,因為 km-vectorize-29689620 仍 failed,必須等待下一次官方 03:00 success 或 retained failed Pod/log evidence;yes for DR complete,因為 credential escrow evidence markers 仍缺 5 個。
Next: 維持 03:00 官方排程 gate;下一次官方 km-vectorize run 後,只讀驗證 lastSuccessfulTime、latest Job/Pod/log 與 ArgoCD health。
2026-06-14 10:40 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: latest docs head observed before this recovery commit 50d4f2ba; runtime deploy marker d023f5d7 put image f737f278 live for API/Web/Worker/CronJob; ArgoCD revision 50d4f2ba; API/Web split across mon/mon1; Worker on mon; 110 systemctl --failed returned 0 loaded units listed and fwupd-refresh.timer remained disabled/inactive; backup-status core_blockers=0 and escrow_missing=5; final cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green, because km-vectorize-29689620 remains failed until the next official 03:00 success or retained failed Pod/log evidence; yes for DR complete, because credential escrow evidence markers still missing 5.
Next: keep the 03:00 official schedule gate; after the next official km-vectorize run, read-only verify lastSuccessfulTime, latest Job/Pod/log, and ArgoCD health.
2026-06-14 09:56 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: 本 recovery commit 前最新文件 head a0fe7741;runtime deploy marker 與 ArgoCD revision 60a0415c put image a3de0ffb live for API/Web/Worker/CronJob; API/Web split across mon/mon1; Worker on mon1; 110 systemctl --failed returned 0 loaded units listed and fwupd-refresh.timer remained disabled/inactive; backup-status core_blockers=0 and escrow_missing=5; final cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green, because km-vectorize-29689620 remains failed until the next official 03:00 success or retained failed Pod/log evidence; yes for DR complete, because credential escrow evidence markers still missing 5.
Next: keep the 03:00 official schedule gate; after the next official km-vectorize run, read-only verify lastSuccessfulTime, latest Job/Pod/log, and ArgoCD health.
2026-06-14 09:27 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: gitea/main 5bad267e and ArgoCD revision 5bad267e; deploy marker 8d575c1a put image 280e0fbe live for API/Web/Worker/CronJob; API/Web split across mon/mon1; 110 systemctl --failed returned 0 loaded units listed and fwupd-refresh.timer remained disabled/inactive; backup-status core_blockers=0 and escrow_missing=5; first cold-start had transient stock 502 during stockplatform-v2 warmup, direct route/TLS recheck returned 200, final cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green, because km-vectorize-29689620 remains failed until the next official 03:00 success or retained failed Pod/log evidence; yes for DR complete, because credential escrow evidence markers still missing 5.
Next: keep the 03:00 official schedule gate; after the next official km-vectorize run, read-only verify lastSuccessfulTime, latest Job/Pod/log, and ArgoCD health.
2026-06-14 08:40 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: gitea/main and ArgoCD revision 18b867c3; deploy marker 18b867c3 put image e0a6d339 live for API/Web/Worker/CronJob; API/Web split across mon/mon1; 110 systemctl --failed returned 0 loaded units listed and fwupd-refresh.timer remained disabled/inactive; backup-status core_blockers=0 and escrow_missing=5; cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green, because km-vectorize-29689620 remains failed until the next official 03:00 success or retained failed Pod/log evidence; yes for DR complete, because credential escrow evidence markers still missing 5.
Next: keep the 03:00 official schedule gate; after the next official km-vectorize run, read-only verify lastSuccessfulTime, latest Job/Pod/log, and ArgoCD health.
2026-06-14 08:24 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 96%, P1 92%, P2 98%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: 110 fwupd-refresh.timer disabled/inactive with rollback command recorded; systemctl --failed returned 0 loaded units listed; backup-status 110 13/13 fresh failed=0 and 188 2/2 fresh failed=0 with core_blockers=0 and escrow_missing=5; cold-start PASS=82 WARN=1 BLOCKED=0; ArgoCD/CronJob still waiting for official km-vectorize lastSuccessfulTime after deploy marker ec03f0b7 / image 8ddb80d6.
Blocked: yes for full cold-start green, because km-vectorize-29689620 remains failed until the next official 03:00 success or retained failed Pod/log evidence; yes for DR complete, because credential escrow evidence markers still missing 5.
Next: after the next 03:00 Asia/Taipei official km-vectorize schedule, read-only verify lastSuccessfulTime, latest Job/Pod/log, and ArgoCD health; do not manual-run, delete, patch, or fake evidence.
2026-06-13 01:29 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 95%, P1 90%, P2 100%, P3 100%
After: Overall 95%, P1 90%, P2 100%, P3 100%
Evidence: Gitea main e4a349bc; ArgoCD revision e4a349bc sync=Synced health=Degraded only by km-vectorize stale success; K3s images 414413a59268eedd391648f112e228716dd05362; API/Web split across mon/mon1; /home/wooo/.ssh/known_hosts retained 120/188 after CD fix 80e6ec1a; backup-status 110 13/13 fresh failed=0 and 188 2/2 fresh failed=0; offsite textfile remote_verify_ok=1 and 13 repos snapshot_count=1; backup alert live visibility OK; all five required Prometheus alert rule names health=ok; cold-start PASS=83 WARN=0 BLOCKED=0.
Blocked: yes for DR complete only, because credential escrow evidence markers still missing 5; ArgoCD fully healthy still waits for official 03:00 km-vectorize lastSuccessfulTime.
Next: after 03:00 Asia/Taipei, verify km-vectorize official Job completion and ArgoCD health; keep escrow alerts firing until real non-secret evidence IDs are written.
2026-06-04 15:23 Asia/Taipei
Phase: P3
Before: 78%
After: 95%
Evidence: infra/ansible/roles/nginx/templates/188-all-sites.conf.j2 now contains aiops VIP upstreams 192.168.0.125:32334/32335; live smoke aiops / -> 307 and /api/v1/health -> 200; content guard passed.
Blocked: no for route baseline; ansible-playbook is unavailable on this workstation, so syntax-check remains delegated to the normal Ansible environment before next apply.
Next: run Ansible syntax/apply validation from the Ansible host before changing 188 nginx live config.
2026-06-04 15:23 Asia/Taipei
Phase: P2
Before: 52%
After: 66%
Evidence: /Users/ogt/momo-pro-system/services/import_service.py updated; /Users/ogt/momo-pro-system/tests/test_daily_sales_monthly_sync_failure.py added; targeted pytest passed with temp SQLite and real Excel input.
Blocked: yes. Live 188 uses /home/ollama/momo-pro bind-mounted code, while momo/ewoooc canonical source remains unresolved.
Next: reconcile canonical source/deploy path, apply the same monthly-sync failure contract to live, then run controlled live auto-import failure-path verification.
2026-06-04 15:34 Asia/Taipei
Phase: P2
Before: 66%
After: 86%
Evidence: live /home/ollama/momo-pro/services/import_service.py patched from backup services/import_service.py.bak.20260604-152827; live hash 3fc45671986fa4cc155119f588bc1ebefd272927730052e42e2b9eb4352b2586; container isolated temp-DB/real-Excel contract test passed; momo-scheduler and momo-pro-system restarted and healthy; mo.wooo.work /health 200; latest DB parity daily=404 and monthly=404 for 2026-06-02.
Blocked: no for momo failure contract. Overall remains blocked by 120 reachability and credential escrow.
Next: observe the next real Google Drive import and keep canonical momo/ewoooc source-control reconciliation as a separate supply-chain item.
2026-06-04 15:50 Asia/Taipei
Phase: P1
Before: 58%
After: 72%
Evidence: /backup/scripts/backup-status.sh --no-notify initially showed stale110=awoooi_db, stale188=momo_pg_daily, configured_missing_188=1; manual 188 momo PostgreSQL backup completed and kept latest-only; manual 110 backup-awoooi-frequent completed with restic snapshot 7440d75f; 188 crontab now points momo_pg_daily to /home/ollama/bin/momo-pg-backup.sh; final backup-status shows stale110=none, stale188=none, configured_missing_188=0, core_blockers=1, escrow_missing=5.
Blocked: yes. 120 config capture still keeps aggregate backup red, and five credential escrow evidence markers are still missing.
Next: after 120 returns, rerun backup-configs, backup-all, offsite sync, full offsite verify, then cold-start scorecard; separately fill escrow only with real non-secret evidence IDs.
2026-06-04 18:55 Asia/Taipei
Phase: P0/P1/P2
Before: Overall 60%, P1 72%, P2 86%
After: Overall 61%, P1 74%, P2 88%
Evidence: local ping to 192.168.0.120 still 0/3, SSH 22 timed out, ARP incomplete; 121 kubectl still shows mon NotReady,SchedulingDisabled and mon1 Ready; 110 backup-status --no-notify shows stale110=none, stale188=none, configured_missing_188=0, core_blockers=1, escrow_missing=5; cold-start scorecard now reports PASS=71 WARN=3 BLOCKED=3 and momo monthly parity 2215/2215 for 2026-06-01 through 2026-06-04.
Blocked: yes. The three hard blocks are still 120 ping, 120 SSH, and 120 K3s read-only check; escrow remains missing 5 evidence markers.
Next: wait for physical/console recovery of 120, then run the required backup-configs / backup-all / offsite sync / full verify / cold-start sequence.
2026-06-04 19:02 Asia/Taipei
Phase: P0/P3
Before: Overall 61%, P0 35%, P3 95%
After: Overall 62%, P0 36%, P3 96%
Evidence: local/110/121/188 all failed to reach 192.168.0.120; 121 returned Destination Host Unreachable; kubectl describe node mon shows LastHeartbeatTime 2026-05-22 02:44:13 +08, Ready Unknown since 2026-05-22 02:49:48 +08, and kube-node-lease renewTime 2026-05-22 02:48:36 +08; 120-fsck-maintenance-checklist.sh --no-color returned PASS=2 WARN=2 BLOCKED=3 and MAINTENANCE REQUIRED; repo search found no BMC/IPMI/WOL inventory for 120.
Blocked: yes. 120 requires physical or VM console recovery before backup-configs, backup-all, offsite sync, and full cold-start can be made green.
Next: use console to verify 120 power/NIC/boot/initramfs state, perform offline fsck only if needed, then restore SSH and run the required recovery sequence.
2026-06-05 18:40 Asia/Taipei
Phase: P0/P1/P3
Before: Overall 62%, P1 74%, P3 96%
After: Overall 64%, P1 80%, P3 97%
Evidence: 120 remains unreachable from local/110/121/188 and K3s mon remains NotReady,SchedulingDisabled; 14:00 AWOOOI high-frequency backup had failed, then 16:01 manual high-frequency backup completed snapshot b7d5ee4e; Gitea stale container dump /tmp/gitea-dump.zip was preserved as /tmp/gitea-dump.stale.20260605_161032.zip, script hardened, and manual Gitea backup completed snapshot ea641613; Open-WebUI d1147507, ClawBot 73ead3cc, AI artifacts b1161ab8 completed; partial offsite sync for five changed repos completed 5/5; verify-offsite-full-sync reports REMOTE_LATEST_ONLY_OK=1 and VERIFY_OK=1; final backup-status shows stale110=none, stale188=none, core_blockers=6, escrow_missing=5; cold-start remains PASS=71 WARN=3 BLOCKED=3.
Blocked: yes. 120 remains the P0 blocker, backup_all failed history remains red until backup-all can rerun after 120 returns, and credential escrow still lacks five non-secret evidence markers.
Next: monitor the 20:00 high-frequency backup, keep 120 console recovery as P0, then rerun backup-configs / backup-all / offsite sync / full verify / cold-start after 120 returns.
2026-06-06 14:47 Asia/Taipei
Phase: P0/P1/P2
Before: Overall 64%, P1 80%, P2 88%
After: Overall 65%, P1 84%, P2 89%
Evidence: 120 still ping failed, SSH timed out, ARP incomplete, and K3s mon remains NotReady,SchedulingDisabled; 06-06 02:00 aggregate failed only Configs (12/13 success) due the 120 config capture blocker; backup-status at 14:46 shows stale110=none, stale188=none, failed=1, core_blockers=1, escrow_missing=5; verify-offsite-full-sync shows all 13 remote repos snapshots=1, REMOTE_LATEST_ONLY_OK=1, VERIFY_OK=1; cold-start reports PASS=70 WARN=4 BLOCKED=3; momo scheduler direct log activity count over the last 15 minutes is 151 despite the scorecard WARN.
Blocked: yes. 120 remains unreachable, aggregate backup cannot be green until backup-configs and backup-all rerun after 120 returns, and credential escrow still lacks five evidence markers.
Next: keep 120 console recovery as P0, keep escrow marker collection separate from secrets, and rerun the required backup/offsite/cold-start sequence only after 120 is reachable.
2026-06-06 15:00 Asia/Taipei
Phase: P3
Before: P3 97%
After: P3 98%
Evidence: docs/runbooks/FULL-STACK-COLD-START-SOP.md updated to v1.3 with 2026-06-06 live baseline, full shutdown/startup/single-host reboot SOP, mandatory reboot ledger template, and SOP version-comparison rules.
Blocked: no for documentation. Validation gap remains because ansible-playbook is unavailable on this workstation and 120 recovery still requires console access.
Next: after the next actual reboot or 120 console recovery, append a LOGBOOK reboot record and compare it against this 2026-06-06 baseline before changing SOP version again.
2026-06-06 15:03 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 65%, P2 89%, P3 98%
After: Overall 65%, P2 90%, P3 99%
Evidence: 120 still ping/SSH failed with ARP incomplete; 121 still shows mon NotReady,SchedulingDisabled and mon1 Ready; backup-status at 15:02 shows stale110=none, stale188=none, failed=1, core_blockers=1, escrow_missing=5; offsite verifier shows 13 repos snapshots=1 with REMOTE_LATEST_ONLY_OK=1 and VERIFY_OK=1; Alertmanager has all five required backup/cold-start rules; escrow report shows scripts/config present but 5 evidence markers missing; 15:03 cold-start reports PASS=71 WARN=3 BLOCKED=3; direct 188 momo-scheduler check is healthy with recent log activity.
Blocked: yes. The three hard blocks remain 120 ping, 120 SSH, and 120 K3s read-only check; aggregate backup remains blocked by 120 config capture; DR scorecard remains blocked by five missing non-secret escrow markers.
Next: do not fake escrow markers; after real non-secret evidence IDs are available, run mark-credential-escrow-verified.sh for the five items. Keep 120 console recovery as P0.
2026-06-06 15:06 Asia/Taipei
Phase: P1/P3
Before: Overall 65%, P1 84%, P3 99%
After: Overall 65%, P1 85%, P3 99%
Evidence: /backup/scripts/mark-credential-escrow-verified.sh --help confirms --dry-run support, allowed item names, and placeholder/secret rejection rules; docs/runbooks/BACKUP-STATUS.md now contains the credential escrow evidence checklist and safe marker flow.
Blocked: yes. No marker was written because no real non-secret evidence IDs were available in this session; escrow_missing remains 5.
Next: once real external evidence IDs exist, dry-run each item first, then write markers and rerun offsite-escrow-evidence-report plus backup-status.
2026-06-12 04:11 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 65%, P1 85%, P2 90%, P3 99%
After: Overall 66%, P1 86%, P2 90%, P3 99%
Evidence: 120 still ping/SSH failed with ARP incomplete; 121 still shows mon NotReady,SchedulingDisabled and mon1 Ready; 110->188 SSH host key trust repaired after matching ED25519 fingerprint; 02:00 backup-all completed 12/13 and failed only Configs due 120; backup-status at 04:11 shows stale110=none, stale188=none, failed=1, core_blockers=1, escrow_missing=5; offsite sync from 03:00 is still running at 04:10.
Blocked: yes. Full reboot window is NO-GO until current offsite sync exits and a fresh offsite verifier passes; full green remains impossible while 120 is unreachable.
Next: wait for the 03:00 offsite sync to finish, run verify-offsite-full-sync, then rerun cold-start scorecard before approving any maintenance window.
2026-06-12 18:57 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 67%, P0 36%, P1 86%, P2 95%, P3 99%
After: Overall 95%, P0 100%, P1 90%, P2 97%, P3 100%
Evidence: 120 root fsck recovery booted at 15:13; 120/121 are both Ready control-plane; backup-configs and backup-all captured 120/121/K8s successfully; backup-all completed 13/13 at 15:54; full offsite sync completed 13/13 at 17:37 after documented recovery runway override to 240m; verify-offsite-full-sync returned REMOTE_LATEST_ONLY_OK=1, FULL_MARKER_FRESH=1, VERIFY_OK=1, FAILED=0; backup-status at 18:55 reports core_blockers=0 and escrow_missing=5; cold-start at 18:57 reports PASS=83 WARN=0 BLOCKED=0.
Blocked: yes for DR only. Service/full-stack recovery is green, but DR scorecard remains blocked until five credential escrow evidence markers are written with real non-secret evidence IDs.
Next: collect real credential escrow evidence IDs, dry-run each marker, then write markers and rerun offsite-escrow-evidence-report plus backup-status; separately plan AWOOOI API/Web topology spread before moving services from 110/188 to 120/121.
10. Completion Claims That Are Not Allowed Yet
- Do not claim every future reboot is guaranteed green. This run is green for the latest verified evidence set only.
- Do not silence credential escrow alerts. They are the remaining correct DR red light.
- Do not claim DR scorecard complete. Credential escrow markers are missing.
- Do not claim public-route success is system success. Route checks must be paired with DB, backup, schedules, Alertmanager, and cold-start scorecard evidence.
- Do not claim the next real Google Drive import has succeeded until the post-import row counts/date bounds and Drive archive movement are rechecked.