Files
awoooi/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md

103 KiB
Raw Blame History

2026-06-04 Reboot / Cold-Start / Backup Recovery Workplan

Owner: SRE / DevOps commander Timezone: Asia/Taipei Baseline: 2026-06-04 15:00 live read-only checks. Do not reuse the 2026-05-29 baseline without rerunning checks. Scope: 110 / 120 / 121 / 188. 112 is Kali and is intentionally excluded from this recovery wave.


1. Current Verdict

Area Status Completion Evidence
Overall recovery readiness FULL_STACK_GREEN_DR_ESCROW_BLOCKED 99% 2026-06-27 02:42 live revalidation 覆蓋 00:16 暫時 blocked 判讀。post-reboot-readiness-summary.sh --no-color artifact /tmp/awoooi-post-reboot-readiness-20260627-024151/summary.txt 回傳 SERVICE_GREEN=1PRODUCT_DATA_GREEN=1POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKEDPOST_START_WARN=3POST_START_BLOCKED=0STOCK_FRESHNESS_STATUS=okSTOCK_LATEST_TRADING_DATE=2026-06-26STOCK_BLOCKERS=noneBACKUP_CORE_GREEN=1ESCROW_MISSING_COUNT=5WAZUH_MANAGER_REGISTRY_ACCEPTED=0。00:16 的 blocker 是 188 momo_pg_daily configured drift備份 fresh但 exporter 因 crontab 仍指 app-side path 判 configured_missing_188=100:19 已備份 188 crontab 到 /home/ollama/momo_backups/crontab-before-momo-pg-host-owned-20260627-001925.txt 並收斂到 host-owned /home/ollama/bin/momo-pg-backup.sh,刷新 exporter 後 awoooi_backup_job_configured{host="188",job="momo_pg_daily"} 100:56 backup-statuscore_blockers=0。02:41 DR checklist 回 CORE_COLD_START_GREEN=1RECOVERY_STATE=CORE_READY_DR_OFFSITE_PENDINGPrometheus live contract 回 awoooi_recovery_core_ready=1awoooi_recovery_dr_offsite_ready=0。主機 / K3s / public routes / AWOOOI / MOMO / Stock / backup core / 188 hygiene 已恢復。DR 仍因 credential escrow 缺 5 不能宣稱 completeWazuh registry 已有脫敏 manager readback但尚未 Dashboard API / owner acceptance。
P0 host / K3s recovery DONE 100% 120 booted after console fsck at 2026-06-12 15:13; latest 2026-06-26 07:19 readback shows 120 and 121 reachable, K3s active, mon and mon1 both Ready control-plane, AWOOOI API/Web replicas split across both nodes, ArgoCD awoooi-prod Synced / Healthy at revision 1fd5e2a8b0f18d24eed16aa2a44286bcbf230603, and km-vectorize official 03:00 台北時間 run succeeded with lastSuccess=2026-06-25T19:00:14Z.
P1 backup / alert / escrow BLOCKED_DR_ESCROW 98% 2026-06-27 00:56 backup readback shows 110 13/13 fresh failed=0, 188 2/2 fresh failed=0, core_blockers=0, integrity_stale=0, offsite_fresh=1, rclone_gdrive_fresh=1, configured_missing_188=0, escrow_missing=5, last aggregate 2026-06-26 02:31:02。188 MOMO backup crontab drift 已修復並保留 rollback crontab。DR remains blocked on real non-secret credential escrow evidence IDs; do not write placeholder markers or paste secret values.
P2 service / data truth DONE 100% Public routes 與 service health 為綠燈MOMO health V10.719current-month parity 為 `15383
P3 docs / automation contracts DONE_WITH_BACKUP_CORE_RECOVERY_V178 100% Workplan, SOP v1.78, post-reboot declaration guard, machine-readable post-reboot readiness summary with Wazuh registry detail fields and auto-persisted summary.txt, post-reboot next-gate dispatch checklist, owner-packet JSON generator, dynamic owner-packet contract guard, post-reboot owner response preflight, owner response placeholder template, one-page post-start quick check v1.18, route retry gate, delegated cold-start public-route / AWOOOI API warmup classifier, backup-status core-blocker readback, PyYAML-optional recovery-scorecard contract check, 188 MOMO backup crontab host-owned rollback evidence, deploy warmup classification, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, 188 fail-closed startup data recovery gate, 188 host hygiene read-only checklist, 188 PostgreSQL runtime-ready source-of-truth, 188 ACME route/timer hygiene, baseline stockplatform_system_freshness_ok, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable plan_b baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD known_hosts guardrail, fwupd-refresh.timer rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and stricter product-data / route retry gates are updated. Declaration guard now machine-checks allowed / forbidden recovery statements from the same summary.txt: service/data/backup/188 host hygiene green may be declared when live summary says so, while DR_COMPLETEWAZUH_REGISTRY_RECOVERED and RUNTIME_ACTION_AUTHORIZED remain forbidden until evidence gates close.

2026-06-26 12:13 machine-readable summary baseline supersedes the 07:47 / 08:59 gate set: scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color stores delegated logs under /tmp/awoooi-post-reboot-readiness-20260626-121303 and returns SERVICE_GREEN=1, PRODUCT_DATA_GREEN=1, BACKUP_CORE_GREEN=1, DR_ESCROW_BLOCKED=1, ESCROW_MISSING_COUNT=5, HOST_188_SERVICE_GREEN=1, HOST_188_HYGIENE_BLOCKED=0, HOST_188_CHECK_RC=0, HOST_188_RESULT=HOST_188_HYGIENE_GREEN., WAZUH_ROUTE_CODE=200, WAZUH_TRANSPORT_COUNT=6, WAZUH_COVERAGE_SCOPE=6, WAZUH_DIRECT_ACTIVE=2, WAZUH_NO_TRANSPORT=1, WAZUH_SSH_BLOCKED=3, WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning, WAZUH_DASHBOARD_INDEX_OK=3, WAZUH_MANAGER_REGISTRY_ACCEPTED=0, WAZUH_RUNTIME_GATE=0, RUNTIME_ACTION_AUTHORIZED=0, OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED, and NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export. This is now the preferred first operator/AI-agent entrypoint after reboot because it separates service health from DR and security registry evidence; 188 host hygiene is no longer a next gate unless the live checklist regresses.

2026-06-26 07:47 machine-readable summary baseline is retained as historical evidence only. It showed HOST_188_HYGIENE_BLOCKED=1 and three next gates before the 188 startup / ACME / certbot hygiene repair. Do not use the 07:47 gate set as the current status.

2026-06-26 12:13 next-gate dispatch baseline: scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color now emits only the gates present in the current summary. Current expected gates are credential_escrow_evidence and wazuh_manager_registry_export, with NEXT_GATE_COUNT=2, REQUEST_SENT_COUNT=0, DISPATCH_AUTHORIZED=0, HOST_WRITE_AUTHORIZED=0, SECRET_VALUE_COLLECTION_ALLOWED=0, RUNTIME_ACTION_AUTHORIZED=0. If 188 hygiene regresses, host_188_hygiene_maintenance_window will reappear automatically.

2026-06-26 12:13 owner-packet JSON baseline: scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color emits schema_version=awoooi_post_reboot_next_gate_owner_packets_v1 with dynamic next_gate_count=2, p0_gate_count=2, request_sent_count=0, owner_response_received_count=0, owner_response_accepted_count=0, runtime_action_authorized_count=0. This packet is for AI / operator / owner review intake only; it does not send request, write credential marker, read secret, or authorize runtime action.

2026-06-26 12:13 owner-packet contract guard baseline: scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json validates the generated JSON before any owner review intake. It requires the packet gates to equal the live source.next_required_gates, preserves request_sent=0owner_response_received=0owner_response_accepted=0runtime_action_authorized=0host_write_authorized=0secret_value_collection_allowed=0runtime_gate=0, and rejects missing forbidden payload/action controls for active gates. Current expected success line: POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=2 request_sent=0 accepted=0 runtime_gate=0.

2026-06-26 13:01 owner response preflight baseline: scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color validates future owner responses against the dynamic owner-packet gate set without sending requests, writing markers, reading secrets, or changing runtime. Missing response file must remain blocked_waiting_owner_response_file; the placeholder template docs/templates/post-reboot-next-gate-owner-response.json must remain blocked_waiting_owner_response_content with received=0, accepted=0, and runtime_gate=0. The only acceptable payload class is redacted owner evidence for credential escrow and Wazuh manager registry export; secret values, hash / prefix / suffix, raw Wazuh payload, agent real names, internal IPs, client.keys, credential marker write, host write, Wazuh active response / re-enroll / restart, and Kali active scan are rejected.

2026-06-26 17:45 single-summary replay baseline: scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color now writes the exact emitted key/value summary to $ARTIFACT_DIR/summary.txt; latest artifact /tmp/awoooi-post-reboot-readiness-20260626-174451/summary.txt returns SERVICE_GREEN=1, PRODUCT_DATA_GREEN=1, BACKUP_CORE_GREEN=1, DR_ESCROW_BLOCKED=1, ESCROW_MISSING_COUNT=5, HOST_188_HYGIENE_BLOCKED=0, WAZUH_MANAGER_REGISTRY_ACCEPTED=0, RUNTIME_ACTION_AUTHORIZED=0, OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED, and NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export. The same summary file drives declaration guard, next-gate dispatch, owner packet generation, contract guard, and owner response preflight. post-start-quick-check.sh now holds delegated cold-start blockers until wrapper route retry completes; route-only cold-start blockers that recover under wrapper retry are evidence warnings, while non-route blockers or unrecovered routes remain hard blockers.

2026-06-26 08:47 Wazuh registry detail summary baseline: post-reboot readiness summary now emits WAZUH_COVERAGE_SCOPE, WAZUH_DIRECT_ACTIVE, WAZUH_NO_TRANSPORT, WAZUH_SSH_BLOCKED, WAZUH_DASHBOARD_API_CONNECTION, and WAZUH_DASHBOARD_INDEX_OK alongside existing route / transport / registry fields. Current read-only truth is coverage scope 6, direct active 2, no transport 1, SSH blocked 3, route 200, transport 6, Dashboard API pending_or_spinning, index OK 3, manager registry accepted 0, runtime gate 0. This is a security evidence blocker, not a reboot service blocker.

2026-06-26 12:13 declaration guard baseline: scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color emits schema_version=awoooi_post_reboot_declaration_guard_v1, status allowed_with_boundary_blockers, allowed declarations including service / product data / backup / 188 host hygiene green for this evidence set, and forbidden declarations DR_COMPLETEWAZUH_REGISTRY_RECOVEREDRUNTIME_ACTION_AUTHORIZED. Proposed false-green declarations are rejected before they can enter LOGBOOK / owner packets / external status updates.

2026-06-26 07:39 live quick-check refresh supersedes the 07:19 row for current operator status. scripts/reboot-recovery/post-start-quick-check.sh --no-color returned POST_START_QUICK_CHECK PASS=38 WARN=3 BLOCKED=0, warning split SERVICE=0 BOUNDARY=1 EVIDENCE=2, result FULL_STACK_GREEN_DR_ESCROW_BLOCKED. Delegated cold-start returned PASS=89 WARN=0 BLOCKED=0; four reboot-scope hosts ping/SSH were OK; AWOOOI / VibeWork / AwoooGo / 2026FIFA / Agent Bounty / MOMO / Stock / Bitan / TsenYang / VTuber / Gitea / Harbor / Registry / Sentry / SigNoz / Langfuse / AIOps routes returned expected 2xx/3xx. MOMO V10.701 has job 57 completed, daily freshness 1|2026-06-24, and current-month parity 15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24. StockPlatform freshness is ok through 2026-06-25 with price / chips / margin / AI recommendations current. Backup core remains green: 110 13/13 fresh failed=0, 188 2/2 fresh failed=0, core_blockers=0, offsite/rclone fresh, last_backup_all=2026-06-26 02:31:02; DR still has escrow_missing=5. 110 load around 5.19 / 4.66 / 4.91 is attributable to normal platform processes, not orphan Chrome. 188 host hygiene remains blocked by failed host PostgreSQL / certbot / startup units and must use the dedicated maintenance runbook and read-only checklist.

2026-06-25 19:06 post-CD wrapper readback supersedes the 18:53 wording: consecutive main pushes created a deploy storm where older deploy markers were superseded by later commits. Latest production truth is deploy marker d8ca8224 chore(cd): deploy 9dbe044 [skip ci], ArgoCD Synced / Healthy, API/Web/Worker image tag 9dbe044ea1e8e3894ccbeb5ed760bb124b87f7be, direct route smoke 200 for AWOOOI API / IwoooS / VibeWork / AwoooGo / MOMO health / Stock / Bitan and expected route-gate statuses for MOMO / Gitea / Harbor / Registry / Sentry / SigNoz / Langfuse / AIOps, and wrapper POST_START_QUICK_CHECK PASS=18 WARN=3 BLOCKED=0. Repo-side cold-start returns PASS=89 WARN=0 BLOCKED=0; /backup/scripts/backup-status.sh --no-notify --no-refresh reports 110 13/13 fresh failed=0, 188 2/2 fresh failed=0, core_blockers=0, integrity_stale=0, offsite_fresh=1, rclone_gdrive_fresh=1, escrow_missing=5; MOMO dedicated preflight returns PASS=19 WARN=2 BLOCKED=0; MOMO health is V10.690; AwoooGo / Stock transient 502 reads cleared after upstream warmup and five consecutive route reads returned 200; 110 load is around 14.51 / 12.34 / 11.42, with Gitea Actions cache save / zstdmt / tar, StockPlatform headless Chrome smoke / CI, Gitea, AWOOOI API, ClickHouse, Docker, and platform services visible, not an AWOOOI service blocker. Wrapper result is FULL_STACK_GREEN_DR_ESCROW_BLOCKED, not DEGRADED, because service warnings are 0 and only DR boundary / evidence warnings remain. Wazuh route readback is now 200 disabled_waiting_iwooos_wazuh_owner_gate, but manager registry accepted remains 0, so Wazuh is a security registry evidence blocker rather than a reboot service blocker.

Full cold-start service readiness may now be declared GREEN for the latest verified evidence set. As of 2026-06-25 19:06, routes/hosts/K3s/backups/exporters/monitoring surfaces are available, AWOOOI API is healthy, MOMO service health is V10.690, and MOMO business data is fresh through 2026-06-24. The live read-only cold-start scorecard is PASS=89 WARN=0 BLOCKED=0, the post-start wrapper result is FULL_STACK_GREEN_DR_ESCROW_BLOCKED, AwoooGo / Stock route stability has been rechecked after transient warmup, and final API/Web workload placement is split across mon / mon1. Do not declare DR scorecard complete while credential escrow evidence remains blocked, and do not declare Wazuh registry recovery until manager registry evidence is accepted.

2026-06-25 19:35 stricter product-data gate readback supersedes the earlier "all product data green" interpretation. The full host/cold-start/backup layer remains green from the 19:24 read-only evidence, but the updated quick check now includes StockPlatform /api/v1/system/freshness and therefore blocks on product-data completeness: POST_START_QUICK_CHECK PASS=31 WARN=1 BLOCKED=1, RESULT=BLOCKED, blocker core_margin_short_daily_missing,ai_recommendations_stale. This is a correct no-false-green outcome: stock.wooo.work, /healthz, and /api/healthz all return 200, but StockPlatform data and AI recommendations are not latest. Next action is a separate StockPlatform data freshness remediation lane; do not solve it by host reboot, Nginx reload, Docker restart, or route-only smoke.

2026-06-25 20:11 StockPlatform cron-source recovery closeout: root cause for several StockPlatform stale/old-data symptoms included production source drift where cron referenced scripts that were absent from live /home/wooo/stockplatform-v2, producing script_exit_127 for source remediation, market index, price ingestion, and related monitors. Commit fb91aa4c6272469d1d26e0820169629eac17d28a fix(ops): restore production cron recovery entrypoints was pushed to gitea/main and fast-forward pulled on 110 only. Live post-pull checks confirm all referenced cron scripts exist and bash -n passes. Natural cron runs then recovered: source remediation 19:56 / 20:00 succeeded, market index 20:00 succeeded, price 20:02 succeeded, margin 20:05 succeeded, chips 20:06 succeeded, and AI pipeline 20:10 succeeded at cron/job level while correctly blocking on official margin-short source pending. Remaining blocker is official 2026-06-25 margin-short data and dependent AI recommendation freshness, not source-version drift. Next natural follow-up is 21:00 intelligence-sync to prove the restored Docker-backed psql shim without manual production writes.

2026-06-25 20:25 110 CPU orphan Chrome cleanup closeout: read-only process attribution found two stockplatform-review-bulk-ux Chrome process groups 2756503 and 2829627 with root Chrome process PPID=1, elapsed about 5h, and sustained GPU/renderer CPU. With user approval, only those PGIDs received targeted SIGTERM; post-check showed no remaining group entries, CPU idle around 85-90%, and si/so=0. Full post-start wrapper after cleanup returned cold-start PASS=89 WARN=0 BLOCKED=0, backup core 0, MOMO fresh, expanded public routes green, and overall PASS=37 WARN=2 BLOCKED=1, RESULT=BLOCKED only because StockPlatform product data freshness remains blocked. This confirms the reboot SOP is effective at separating host/service recovery, runaway process cleanup, product-data freshness, DR escrow, and Wazuh security evidence.

2026-06-25 21:14 StockPlatform natural-cron closeout: after waiting for official schedules, 21:00 intelligence-sync succeeded with status=0, core.margin_short_daily reached 2026-06-25, and 21:10 ai-recommendation-pipeline produced STOCKPLATFORM_AI_RECOMMENDATION_PIPELINE_OK as_of_date=2026-06-25. /api/v1/system/freshness is status=ok, blockers [], with price / chips / margin / AI recommendations all current for 2026-06-25. Full wrapper returned POST_START_QUICK_CHECK PASS=38 WARN=2 BLOCKED=0, RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED. This is the current service/data recovery baseline: all reboot service and product-data gates are green; DR remains blocked only by credential escrow evidence 5, and Wazuh registry remains a separate security evidence blocker.

2026-06-13 01:26 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing survived the next normal CD deploy: Gitea main e4a349bc, ArgoCD revision e4a349bc, images from 414413a5, API/Web split across mon / mon1, and global known_hosts retained 120 / 188 after CD fix 80e6ec1a. Do not declare DR complete while credential escrow is missing. km-vectorize remediation is 90%: schedule/label fix is live, and the remaining gate is the next official 03:00 CronJob success readback.


2. Live Check Evidence, 2026-06-04

Target Live result Notes
192.168.0.110 ping OK, SSH port OK Boot 2026-05-06 12:12; load was elevated around 10.54 7.42 6.28; cron and Docker active.
192.168.0.120 ping failed, SSH port failed ARP incomplete; K3s node mon remains NotReady,SchedulingDisabled.
192.168.0.121 ping OK, SSH port OK Boot 2026-05-22 02:30; sudo kubectl get nodes shows mon1 Ready.
192.168.0.188 ping OK, SSH port OK Boot 2026-05-06 12:07; Docker/PostgreSQL/Redis/nginx active; momo containers healthy.
Cold-start scorecard BLOCKED_BY_120 2026-06-12 14:47 read-only rerun: PASS=72 WARN=2 BLOCKED=3; hard blocks remain 120 reachability / SSH / 120 K3s read-only check.
Public routes OK ingress only 2026-06-12 14:47: awoooi, aiops, mo, momo_health, gitea, harbor, registry, sentry, signoz, stock, langfuse, bitan returned 2xx/3xx over HTTPS.
momo DB current-month parity OK Scorecard reports `4571
110 daily backup cron OK 02:00 backup-all, 03:00 rclone sync, 06:05 backup-status, 07:20 full offsite verify.
Backup freshness OK with remaining aggregate blocker 2026-06-05 18:40 status: stale110=none, stale188=none, configured_missing_188=0; remaining core_blockers=6 is 02:00 aggregate failure history plus 120 config capture.
Google Drive latest-only OK 2026-06-12 14:48 verifier: 13 repos, each remote snapshots=1, REMOTE_LATEST_ONLY_OK=1, FULL_MARKER_FRESH=1, VERIFY_OK=1, FAILED=0.
Live Prometheus / Alertmanager alert rules OK 2026-06-12 14:49 backup-alert-live-visibility-check.py returned BACKUP_ALERT_LIVE_VISIBILITY_OK; all five required backup/cold-start/escrow alerts are visible in Prometheus and Alertmanager.
Credential escrow BLOCKED Missing markers: break_glass_admin_credentials, dns_registrar_recovery, oauth_ai_provider_recovery, offsite_provider_credentials, restic_repository_password.
Config backup capture BLOCKED until 120 returns awoooi_backup_config_capture_ok{target="120-k3s-host-configs"} 0; critical failed count 1.
Live 110 script sync OK Six recovery/check scripts exist under /home/wooo/scripts/; /home/wooo/scripts/full-stack-cold-start-check.sh hash is 31321428207308d6c159fabb679d9f1d0848194b8e6d7eb7b04a2c05779ade46 after scheduler detector fix.
Gitea commit evidence VERIFIED Gitea main at 0260ec89... contains ae7b39d9 fix(ops): harden reboot recovery and backup alerts.
188 nginx Ansible baseline DONE Template now pins aiops.wooo.work to VIP 192.168.0.125:32334/32335, contains no 192.168.0.120, and live smoke returned https://aiops.wooo.work/ 307 plus /api/v1/health 200.
120 failure-domain triage BLOCKED 19:02 checks from local/110/121/188 all fail to reach 120; 121 reports Destination Host Unreachable; K3s node lease renew stopped at 2026-05-21T18:48:36Z; 120-fsck-maintenance-checklist.sh --no-color returns PASS=2 WARN=2 BLOCKED=3, MAINTENANCE REQUIRED.
2026-06-05 backup remediation BLOCKED with repaired freshness 16:00 live check still had 120 down and stale110=awoooi_db; manual backups produced snapshots b7d5ee4e (AWOOOI high-frequency DB), ea641613 (Gitea), d1147507 (Open-WebUI), 73ead3cc (ClawBot), b1161ab8 (AI artifacts). 18:40 backup status: stale110=none, stale188=none, core_blockers=6, escrow_missing=5.
2026-06-05 offsite closure OK partial + full verify Full sync was correctly skipped by runway gate; partial sync for awoooi gitea open-webui clawbot ai-artifacts completed 5/5; full verifier at 18:39 shows all 13 remote repos snapshots=1, REMOTE_LATEST_ONLY_OK=1, VERIFY_OK=1.
2026-06-06 backup convergence BLOCKED only by 120/escrow 14:58 backup status: 110 13/13 fresh failed=1, 188 2/2 fresh failed=0, stale110=none, stale188=none, core_blockers=1, escrow_missing=5; 02:00 aggregate failed only Configs due 120.
2026-06-06 offsite verify OK 14:46 verifier: all 13 remote repos snapshots=1, REMOTE_LATEST_ONLY_OK=1, VERIFY_OK=1.
2026-06-06 cold-start scorecard BLOCKED 15:03 read-only rerun: PASS=71 WARN=3 BLOCKED=3; hard blocks remain 120 ping / SSH / K3s read-only check. Direct 188 scheduler check still shows momo-scheduler healthy and active.
2026-06-12 pre-reboot check NO-GO until offsite finishes 120 still ping/SSH failed and ARP incomplete; 110->188 SSH host key trust was repaired; 04:11 backup status cleared stale110=awoooi_db after daily backup but still has failed=1/core_blockers=1 due 120 config capture; 03:00 offsite sync is still running at 04:10.
2026-06-12 post-reboot recovery SERVICE_GREEN_WITH_120_BLOCKER 14:47 scorecard: PASS=72 WARN=2 BLOCKED=3; 110 failed units 0, Swap 0B, public routes/TLS green, momo scheduler and DB parity green, backup/offsite/alert surfaces green except the correct 120 config capture and escrow evidence red gates.
2026-06-12 blocker pursuit WAITING_EXTERNAL_ACCESS 15:00 four-view 120 check still failed; no WOL/IPMI/vmrun/hypervisor entry found in repo, 110, 121, 188, local tools, or Chronicle-visible console. 15:02 escrow report shows offsite ready with warnings and all five escrow markers missing; no real non-secret evidence ID found in repo.
2026-06-12 120 recovery closeout SERVICE_GREEN_DR_ESCROW_BLOCKED 120 root fsck was completed from console/initramfs and booted at 15:13; 15:54 backup-all finished 13/13; 17:37 full offsite sync finished 13/13; 18:55 offsite verifier returned REMOTE_LATEST_ONLY_OK=1, VERIFY_OK=1, FAILED=0; 18:55 backup-status shows core_blockers=0, escrow_missing=5; 18:57 cold-start is PASS=83 WARN=0 BLOCKED=0.
2026-06-13 live refresh SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED 00:13 backup status: 110 13/13 fresh failed=0, 188 2/2 fresh failed=0, core_blockers=0, escrow_missing=5; 00:33 cold-start exposed 110 known_hosts drift for 120 / 188, fixed after backup /home/wooo/.ssh/known_hosts.before-120-188-refresh.20260613-003416; 00:34 final cold-start: PASS=83 WARN=0 BLOCKED=0; live K3s has mon / mon1 Ready, API/Web are split 120 / 121. 188 host is degraded only because certbot.service and snap.certbot.renew.service failed; ArgoCD remains Degraded because km-vectorize CronJob last success is stale. Manual Job km-vectorize-codex-002709 did not leave verified completion evidence, so this remains open.
2026-06-13 km-vectorize health remediation IN_PROGRESS_92 13:37 live readback: ArgoCD revision 88dc08e5 is Synced / Degraded; only unhealthy resource is CronJob/awoooi-prod/km-vectorize with message CronJob has not completed its last execution successfully. CronJob lastScheduleTime=2026-06-12T19:00:00Z, lastSuccessfulTime=2026-06-04T11:00:37Z; no 2026-06-13 failed Job is retained because failedJobsHistoryLimit=0. GitOps candidate now changes km-vectorize to failedJobsHistoryLimit=3 so future 03:00 failures keep inspectable Job/Pod evidence. Next gate is ArgoCD sync plus the next official 03:00 success readback.
2026-06-13 post-CD trust / workload verification SERVICE_GREEN_CD_GUARDRAIL_HELD Gitea main advanced to deploy marker e4a349bc chore(cd): deploy 414413a [skip ci]; ArgoCD revision is e4a349bc, sync Synced, health still Degraded only by km-vectorize stale success. Live K3s image readback uses 414413a59268eedd391648f112e228716dd05362; API pods split mon1 / mon, Web pods split mon / mon1, Worker is single replica on mon. 01:28 /home/wooo/.ssh/known_hosts mtime remains 2026-06-13 01:20:02 +0800 with 120 / 188 entries present; deploy-specific /home/wooo/.ssh/deploy_known_hosts mtime is 01:24:05, proving CD fix 80e6ec1a stopped clobbering global trust. 01:26 cold-start: PASS=83 WARN=0 BLOCKED=0.
2026-06-13 API placement hardening IN_PROGRESS 12:43 live refresh showed cold-start PASS=83 WARN=0 BLOCKED=0, but API replicas 2/2 were on 120 even though topology spread existed. Root cause: whenUnsatisfiable=ScheduleAnyway is a soft preference. GitOps candidate changes API/Web/Worker to minDomains=2 + DoNotSchedule; completion requires ArgoCD sync, rollout readback, public route smoke, and cold-start rerun.
2026-06-13 API rollout strategy hardening LIVE_VERIFIED First hard-spread rollout reached ArgoCD revision 17e017f5; DoNotSchedule was live, but API completed with both new pods on 121 because old 120 pods were still terminating during scheduling. Second GitOps rollout reached ArgoCD revision 60f653a0, API/Web use maxSurge=0, maxUnavailable=1, minDomains=2, DoNotSchedule, and both deployments are split mon / mon1. Public API / governance route smoke passed and 12:59 cold-start returned PASS=83 WARN=0 BLOCKED=0.
2026-06-13 security mirror guard closure LIVE_VERIFIED Gitea main b557a4b5 restores apps/web/messages/en.json as the required Traditional Chinese mirror of zh-TW.json; security-mirror-progress-guard.py now passes. ArgoCD revision b557a4b5 is Synced / Degraded only by km-vectorize; API/Web/Worker are ready, API pods split mon / mon1, Web pods split mon1 / mon, public API health is healthy, zh/en governance routes are 200, backup status has core_blockers=0, and 13:52 cold-start is PASS=83 WARN=0 BLOCKED=0.
2026-06-13 security mirror production image closeout LIVE_VERIFIED Gitea main 64ea2444 records the Web rebuild trigger. Deploy marker 2cc02f1c chore(cd): deploy 6cf8d3c [skip ci] put Web image 6cf8d3ca live; ArgoCD source revision later advanced to 64ea2444 while Web image correctly remains 6cf8d3ca because 64ea2444 is docs/changelog only. Public /zh-TW/governance and /en/governance return 200, API health is healthy, security-mirror-progress-guard.py passes, and 14:10 cold-start is PASS=83 WARN=0 BLOCKED=0.
2026-06-13 final post-trigger deploy closeout LIVE_VERIFIED Deploy marker 834ccdba chore(cd): deploy bf86017 [skip ci] put API/Web/Worker image bf860177 live. ArgoCD revision 834ccdba is Synced / Degraded only by km-vectorize; routes /zh-TW/governance and /en/governance return 200, API health is healthy, source guards pass, backup status has core_blockers=0 and escrow_missing=5, and 14:13 cold-start is PASS=83 WARN=0 BLOCKED=0.
2026-06-13 final goal audit refresh SERVICE_GREEN_REMAINING_GATES_EXPLICIT Clean worktree rebased onto a520c32d and reran source guards successfully; live ArgoCD tracks revision a520c32d with API/Web/Worker image e897c8bf, health Degraded only by km-vectorize; km-vectorize schedule remains 0 3 * * *, timeZone=Asia/Taipei, failedJobsHistoryLimit=3, and no failed Job is currently retained. Public /zh-TW/governance, /en/governance, and /api/v1/health are green; backup core blockers remain 0, escrow_missing=5; 14:16 cold-start is PASS=83 WARN=0 BLOCKED=0. Remaining gates: five credential escrow markers and next official 03:00 km-vectorize success readback.
2026-06-14 km-vectorize official run follow-up DEGRADED_EVIDENCE_RETENTION_LIVE 03:00 official km-vectorize-29689620 ran from CronJob and failed with BackoffLimitExceeded; ArgoCD later auto-synced revision 8868c025 and remains Synced / Degraded. Job is retained, but failed Pod km-vectorize-29689620-nwpqz was deleted before logs could be read, so root cause remains unproven for this run. Live CronJob is now restartPolicy: Never plus terminationMessagePolicy: FallbackToLogsOnError, so the next official failure should retain Pod/log evidence. Backup core remains green, escrow_missing=5, and 03:11 cold-start is PASS=81 WARN=2 BLOCKED=0.
2026-06-14 km-vectorize tenant context follow-up ROOT_CAUSE_CANDIDATE_LIVE Source audit shows cron_km_vectorize.py calls /api/v1/knowledge/embed-all without project context, while API middleware and get_db_context() require X-Project-ID / tenant context for fail-closed RLS. API logs show matching db_context_missing / Missing tenant context patterns. Deploy marker ec03f0b7 put image 8ddb80d6 live; CronJob now has KM_PROJECT_ID=awoooi, script sends X-Project-ID, targeted pytest 7 passed, and no manual Job was created. Completion still waits for the next official 03:00 success or retained failed Pod/log.
2026-06-14 110 failed-unit cleanup SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED fwupd-refresh.timer is intentionally disabled / inactive after non-runtime firmware metadata refresh failed units were classified; rollback is sudo systemctl enable --now fwupd-refresh.timer. systemctl --failed now returns 0 loaded units listed; 08:24 cold-start improved to PASS=82 WARN=1 BLOCKED=0. Remaining warning is only K8s failed Job km-vectorize-29689620; backup core remains green and escrow_missing=5.
2026-06-14 post-CD recovery readback SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED Gitea main / ArgoCD revision 18b867c3 synced after deploy marker 18b867c3 chore(cd): deploy e0a6d33 [skip ci]; API/Web/Worker/CronJob image is e0a6d339. API/Web remain split across mon / mon1, Worker is healthy on mon, public routes and TLS pass, backup core remains 0, escrow missing remains 5, and 08:40 cold-start remains PASS=82 WARN=1 BLOCKED=0. This proves no post-CD reboot recovery regression, but still not full green.
2026-06-14 P2-135 deploy recovery readback SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED Gitea main 5bad267e and ArgoCD revision 5bad267e are synced after deploy marker 8d575c1a; API/Web/Worker/CronJob image is 280e0fbe. API/Web remain split across mon / mon1, Worker is healthy on mon1, backup core remains 0, escrow missing remains 5, and 09:27 cold-start rerun is PASS=82 WARN=1 BLOCKED=0. 09:26 first run saw transient stock.wooo.work 502 while stockplatform-v2 containers were under one minute old; direct route/TLS recheck and scorecard rerun returned 200. This proves no persistent post-P2-135 recovery regression, but still not full green.
2026-06-14 P2-136 / AI Agent 活動正式部署後 recovery readback SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED 本 recovery commit 前最新文件 head 是 a0fe7741runtime deploy marker / ArgoCD revision 60a0415c is Synced / DegradedAPI/Web/Worker/CronJob image 是 a3de0ffb。API/Web remain split across mon / mon1Worker is healthy on mon1backup core remains 0escrow missing remains 509:56 cold-start is PASS=82 WARN=1 BLOCKED=0。This proves no P2-136 / AI Agent 活動正式部署後 recovery regression, but still not full green.
2026-06-14 P2-137 / CI smoke timeout recovery readback SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED 本 recovery commit 前最新文件 head 為 50d4f2baruntime deploy marker d023f5d7 已將 image f737f278 帶到 liveArgoCD revision 50d4f2baSynced / Degraded。API/Web 維持分散在 mon / mon1Worker 在 mon healthybackup core 仍為 0escrow missing 仍為 510:40 cold-start 為 PASS=82 WARN=1 BLOCKED=0。這證明 P2-137 / CI smoke timeout 修正後 recovery 沒有回歸,但仍不是 full green。
2026-06-14 P2-143 owner response 預檢 recovery readback SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED 最新文件基準為 b09eb1c6runtime deploy marker 667d6329 已將 image 755b0a8d3038df2c52dee280067863d92db1eda5 帶到 liveArgoCD revision 4abf0c0f750254d3c7137eae049abdfd99630f5fSynced / Degraded。API/Web 維持分散在 mon / mon1Worker 在 mon healthybackup core 仍為 0escrow missing 仍為 515:00 cold-start 為 PASS=82 WARN=1 BLOCKED=0P2-143 endpoint current P2-143、completion 100,所有 writer / Gateway / Telegram / Bot API / production write / secret read / destructive operation 維持 0 / false。這證明 P2-143 owner response 預檢後 recovery 沒有回歸,但仍不是 full green。
2026-06-14 P2-144 owner response 回讀 recovery readback SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED gitea/main 已前進至 deploy marker 180a6543image fef94df877c5438f9f34ddbcace8ad8112a141ef 已帶到 liveArgoCD source revision 180a6543eaf26dd6b345d45114316926056a965aSynced / Degraded。API/Web 維持分散在 mon / mon1Worker 在 mon1 healthybackup core 仍為 0escrow missing 仍為 515:58 cold-start 為 PASS=82 WARN=1 BLOCKED=0P2-144 endpoint current P2-144、completion 100owner response received / accepted / rejected、reviewer / Gateway / Telegram / Bot API / result capture / learning / PlayBook trust / production write / secret read / destructive operation 維持 0 / false。這證明 P2-144 owner response 回讀後 recovery 沒有回歸,但仍不是 full green。
2026-06-14 P2-145 owner response 驗收門檻 recovery readback SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED 最新文件基準為 06fe0a8fruntime deploy marker 36fbfc6b 已將 image 386dbd078ef63401d9736048463f4ef5326442d9 帶到 liveArgoCD source revision 06fe0a8f14167824fea512f942d2569431bbcbc8Synced / Degraded。API/Web 維持分散在 mon / mon1Worker 在 mon healthybackup core 仍為 0escrow missing 仍為 516:29 cold-start 為 PASS=82 WARN=1 BLOCKED=0P2-145 endpoint current P2-145、completion 100owner response received / accepted / rejected、reviewer / Gateway / Telegram / Bot API / result capture / learning / PlayBook trust / production write / secret read / destructive operation 維持 0 / false。這證明 P2-145 owner response 驗收門檻後 recovery 沒有回歸,但仍不是 full green。
2026-06-14 IwoooS P0 配置控管優先序 recovery readback SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED 最新文件基準為 af62ec1fruntime deploy marker ed651a98 已將 image e992af89955f8aae40a383b2f2e2f645445a690d 帶到 API/Web/Worker/CronJob liveArgoCD source revision af62ec1fe72b3e84e179d80e788e5a5902bdaf27Synced / Degraded。API/Web 維持分散在 mon / mon1Worker 在 mon1 healthyIwoooS route /zh-TW/iwooos returned 200。backup core 仍為 0escrow missing 仍為 517:04 cold-start 為 PASS=82 WARN=1 BLOCKED=0。這證明 IwoooS P0 配置控管優先序前台發布後 recovery 沒有回歸;但它不代表 Nginx reload、DNS/TLS/certbot、workflow/secret/public route/runtime gate 或 production write 已授權,且仍不是 full green。
2026-06-14 高價值配置 Owner Packet 前台同步 recovery readback SERVICE_AVAILABLE_KM_VECTORIZE_FAILED_DR_ESCROW_BLOCKED 最新 repo 文件基準為 0a4766ddruntime deploy marker 16c6b983 已將 image e999c16b3435f197b78fe2adfeec1c4faa6c4675 帶到 API/Web/Worker/CronJob liveArgoCD source revision 0a4766ddc94b0690824ce3deba5c6b9a69764f94Synced / Degraded。API/Web 維持分散在 mon / mon1Worker 在 mon healthyIwoooS route /zh-TW/iwooos 與 AwoooP route /zh-TW/awooop 皆回 200。backup core 仍為 0escrow missing 仍為 518:15 cold-start 為 PASS=82 WARN=1 BLOCKED=0。這證明高價值配置 Owner Packet 前台同步後 recovery 沒有回歸;但它不代表 request sent、owner response received / accepted、Nginx reload、DNS/TLS/certbot、workflow/secret/public route/runtime gate、host write、active scan 或 production write 已授權,且仍不是 full green。

3. Progress Update Contract

Every phase update must change both status and percentage in this file.

State Meaning
NOT_STARTED Listed but no live evidence gathered in this session.
IN_PROGRESS Actively being checked or fixed.
BLOCKED A live red gate prevents completion. Do not downgrade or silence the alert.
WAITING_HOST_120 Action is intentionally held until 120 is reachable.
VERIFIED Live evidence proves the item.
DONE Fix is implemented, verified, and documented.

Completion is weighted by release risk:

Priority Weight
P0 45%
P1 25%
P2 20%
P3 10%

For every push forward, update:

YYYY-MM-DD HH:MM Asia/Taipei
Phase: P0/P1/P2/P3
Before: <old percent>
After: <new percent>
Evidence: <command/file/snapshot>
Blocked: <yes/no and why>
Next: <single next action>

4. P0 Must-Do Gates

ID Status % Work item Fine analysis Next action Done criteria
P0-001 DONE 100 Rerun four-host reachability 18:57 cold-start confirms 110 / 120 / 121 / 188 ping and SSH are all OK; ARP neighbor evidence is reachable for 120 / 121 / 188. Keep evidence in LOGBOOK/runbook. Host reachability table recorded with date/time.
P0-002 DONE 100 Recover 192.168.0.120 120 root filesystem inconsistency was repaired from console/initramfs with offline fsck; host booted at 2026-06-12 15:13, SSH returned, root is rw, failed units 0, and K3s mon is Ready control-plane. Continue normal monitoring; schedule storage review if fsck recurs. 120 ping/SSH OK, node Ready, root not readonly, failed units 0.
P0-003 DONE 100 Rerun /backup/scripts/backup-configs.sh 15:17 manual config capture succeeded; 15:54 aggregate Configs succeeded again, including 120-k3s-host-configs, 121-k3s-host-configs, K8s workloads, K8s secrets, and Velero backups. Keep next scheduled run under normal cron. config_failed=0; Configs snapshot bee9ae22 exists after 120 recovery.
P0-004 DONE 100 Rerun /backup/scripts/backup-all.sh 2026-06-12 15:54 aggregate completed 13/13 in 2170s; 18:55 backup-status shows failed=0, core_blockers=0. Keep 02:00 daily cadence. Aggregate backup exits 0; backup health failed count 0.
P0-005 DONE 100 Rerun /backup/scripts/sync-offsite-backups.sh --mode sync Default runway gate skipped full sync at 270m; controlled recovery override set runway to 240m without changing scripts. Full offsite sync completed 13/13 at 17:37 in 6027s. Restore normal default runway for scheduled sync; use override only for documented P0 recovery windows. New rclone-last-success marker after local backup timestamp.
P0-006 DONE 100 Rerun /backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color 18:55 verifier confirms all 13 remote repos have snapshots=1, REMOTE_LATEST_ONLY_OK=1, FULL_MARKER_FRESH=1, VERIFY_OK=1, FAILED=0. Keep 07:20 daily verifier. REMOTE_LATEST_ONLY_OK=1, all 13 repos snapshots=1.
P0-007 DONE 100 Rerun full cold-start scorecard First 18:56 rerun had one transient internal VIP API 000; direct VIP checks from 110/120/121/188 returned API 200 and Web 307. Second 18:57 rerun returned PASS=83 WARN=0 BLOCKED=0, result GREEN. Treat future internal VIP 000 as transient only after direct multi-host VIP checks prove API 200. BLOCKED=0, WARN=0, result GREEN.
P0-008 DONE 100 Narrow 120 failure domain and prepare console handoff 110 and 188 see no route / no ping; 121 reports destination host unreachable; local ARP is incomplete. Kubernetes retained only stale node/lease data and cannot read current 120 host/filesystem state. No BMC/IPMI/WOL inventory was found in the repo. Physical/VM console must verify power state, NIC attachment, boot screen, initramfs/fsck state, and then restore SSH. Handoff evidence is recorded; no remote-only fix path remains before console access.
P0-009 DONE 100 Exhaust safe remote 120 recovery channels 2026-06-12 15:00 local/110/121/188 all still fail ping/SSH with ARP incomplete. Searched repo, local tools, 110, 121, 188, SSH config, local VM files, and Chronicle-visible desktop; no usable BMC/IPMI/WOL/vmrun/hypervisor/120 console entry was found. Use hypervisor / console / VM inventory outside SSH path. Remote-only path is proven unavailable; no alert was silenced and no unsafe reboot/restart was attempted.

5. P1 Backup And Alert Gates

ID Status % Work item Fine analysis Next action Done criteria
P1-001 VERIFIED 100 Confirm 110 backup schedule Live crontab has 02:00 backup-all, 03:00 rclone gated sync, 06:05 backup-status, 07:20 full offsite verify. Update BACKUP-STATUS.md. Schedule documented and matches live crontab.
P1-002 VERIFIED 100 Confirm success-noise policy Daily status is once at 06:05; normal backup success is not a Telegram spam path. Keep failure-only escalation in backup docs. Docs say failures escalate; daily status is summary only.
P1-003 VERIFIED 100 Confirm Google Drive latest-only 2026-06-12 18:55 verifier shows 13 repos with exactly one remote snapshot each after the post-120 aggregate backup and full offsite sync. Record evidence in backup status. REMOTE_LATEST_ONLY_OK=1, VERIFY_OK=1.
P1-004 VERIFIED 100 Confirm required alerts exist Live Prometheus rules include all five required backup/cold-start alerts. Keep in scorecard. All five alert names FOUND live.
P1-005 BLOCKED_WAITING_OWNER_EVIDENCE 20 Fill credential escrow evidence markers Five markers are missing. This is a DR scorecard blocker, not a service outage. 2026-06-13 13:10 proves scripts/offsite/rclone readiness is green; the remaining blocker is owner-provided real non-secret evidence IDs. Owner request package exists at docs/security/CREDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md; secrets must not enter repo or chat. Human verifies vault/offline escrow, validates each non-secret evidence ID with --dry-run, then writes markers using /backup/scripts/mark-credential-escrow-verified.sh. awoooi_backup_dr_credential_escrow_missing_count=0.
P1-006 DONE 100 Fix backup health failed component 2026-06-12 18:55 backup-status shows failed=0, core_blockers=0, config_failed=0; 120 config capture is no longer red. Keep normal daily backup cadence. failed_count=0, config_failed=0.
P1-007 DONE 100 Refresh stale backup jobs 2026-06-04 cleared stale188=momo_pg_daily; 2026-06-05 cleared recurring stale110=awoooi_db; 2026-06-06 confirms no stale jobs after the next aggregate window. Keep normal cron cadence; only 120-driven Configs remains red. stale110=none, stale188=none, 110 13/13 fresh, 188 2/2 fresh.
P1-008 DONE 100 Align 188 momo backup cron/exporter contract 188 backup exporter expected /home/ollama/bin/momo-pg-backup.sh; crontab still pointed to the old app-side script. Crontab was backed up and updated to the host-owned controller script. Keep backup controller path in future deploy docs. configured_missing_188=0, awoooi_backup_job_configured{host="188",job="momo_pg_daily"} 1.
P1-009 DONE 100 Repair 2026-06-05 non-120 backup failures 02:00 aggregate failed Gitea, AWOOOI DB, Open-WebUI, ClawBot, AI Artifacts, and Configs. The next aggregate window held the five non-120 fixes; Configs remains 120-blocked. Leave aggregate red until 120 returns and Configs can rerun cleanly. Fresh single-repo evidence exists for all non-120 failures and the next aggregate run only failed Configs.
P1-010 DONE 100 Offsite sync manual backup repairs 2026-06-12 17:37 full offsite sync completed 13/13 after controlled P0 runway override to 240m; 18:55 verifier confirmed 13 remote repos each have one snapshot. Allow normal 03:00 full sync cadence unless another manual backup creates new snapshots. REMOTE_LATEST_ONLY_OK=1, VERIFY_OK=1, full sync 13/13.
P1-011 DONE 100 Confirm 2026-06-12 backup convergence 18:55 live check confirms the post-120 aggregate held: no stale jobs, no configured/missing script jobs, no failed components, offsite fresh, and only credential escrow remains as DR warning. Keep escrow as explicit red gate. stale110=none, stale188=none, failed=0, config_failed=0, core_blockers=0.
P1-012 DONE 100 Audit credential escrow marker write safety 2026-06-12 15:02 mark-credential-escrow-verified.sh --status reports all five allowed items missing; offsite-escrow-evidence-report.sh --no-color reports rclone/offsite configured and ESCROW_MISSING_COUNT=5; repo search found only runbooks/placeholders/rules, not real evidence IDs. Write markers only after a real non-secret evidence ID exists for each item; never write placeholder or secret. The marker blocker is narrowed to missing external evidence IDs, not missing script/config/offsite readiness.
P1-014 DONE 100 Publish credential escrow owner request package 2026-06-13 13:10 live report confirms SCRIPT_MISSING_COUNT=0, OFFSITE_CONFIGURED=1, RCLONE_CONFIGURED=1, ESCROW_MISSING_COUNT=5, PASS=8 WARN=5 BLOCKED=0. New owner request package defines allowed evidence-id types, forbidden secret values, safe dry-run flow, write flow, and closeout gates. Dispatch to the credential owners without collecting secret values; keep marker write gated until owner gives real non-secret evidence IDs. docs/security/CREDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md and snapshot exist and validate.
P1-013 DONE_FOR_SERVICE_READINESS 100 Remediate km-vectorize CronJob health debt The retained km-vectorize-29689620 failed Job is now classified as stale evidence, not an active blocker, because later official km-vectorize Jobs completed successfully. 2026-06-18 13:43 cold-start reads FAILED_JOBS=1, STALE_FAILED_JOBS=1, ACTIVE_FAILED_JOBS=0, BAD_PODS=0, and returns PASS=84 WARN=0 BLOCKED=0. Keep retained failed Job as evidence unless an explicit maintenance window authorizes cleanup. Reassert ArgoCD app health only with a fresh ArgoCD app readback, not from the cold-start scorecard alone. Service readiness no longer warns on stale failed Job evidence; active failed Job detection remains guarded.
P1-015 DONE 100 Restore 188 MinIO / Velero backup freshness and DB exporters 2026-06-24 06:35 resolved real backup / exporter red lights: 188 PostgreSQL exporter and Redis exporter now expose pg_up=1 / redis_up=1; 188 MinIO health is live; 120 Velero BSL is Available; one-off backup reboot-recovery-202606240456 completed; 110 backup-health textfile reports latest Velero backup fresh. 110 disk pressure was reduced from 92% to 73% by Docker image/build-cache cleanup only. Reconcile MinIO userns_mode: host override into formal source-of-truth or data ownership fix; keep Docker volume prune forbidden without explicit owner approval. VeleroBackupNotRunPostgreSQLDownRedisDown、110 disk-pressure alerts are resolved, and SOP includes restore helpers.
P1-016 DONE 100 Control repeated Telegram notification noise without hiding real alerts 2026-06-24 confirmed MOMO Pro 5-minute spam came from a legacy 110 script checking http://192.168.0.188/health; live script now uses https://mo.wooo.work/health as primary truth. Heartbeat warning dedupe now hashes stable actionable fingerprints so HTTP status / timeout / latency drift does not create a new Telegram event every 30 minutes. MoWoooWorkDown now labels component=momo-pro-system, disables blind auto-repair, and requires public/local/container/data-freshness triage. Generic docker-health monitor keeps 5-minute repair checks but adds a separate 30-minute direct Telegram fallback cooldown. Bitan public-content cleanliness keeps failure notification with same-fingerprint cooldown and one recovery notice. Fold remaining cross-product direct Telegram egress into the unified notification gateway over time; do not disable real warning/failure/recovery signals. Production deployment/readback must confirm the code and Prometheus rules are live before declaring runtime closure. Healthy heartbeat is quiet, same actionable heartbeat warning is deduped, MOMO public health success produces no alert, repeated same-failure direct fallback paths are cooled, and real failure/recovery/new-warning notifications remain enabled.
P1-017 DONE 100 Restore 188 nginx-exporter and post-CD monitoring coverage CD #3294 deployed marker 622bc372 but failed post-deploy checks because scripts/generate_monitoring.py --check saw Prometheus job nginx-exporter down at 192.168.0.188:9113. 188 stub_status and compose config were healthy, so the correct fix was restoring the stateless exporter from /home/ollama/nginx-exporter.yml, not reloading Nginx or restarting products. New helper scripts/ops/188-nginx-exporter-restore.sh defaults to read-only --check and exposes explicit --apply for maintenance-window restore. high-value-config-change-gate.py now classifies scripts/ops/**/*exporter* as monitoring_alerting_observability P1 / C1. Keep this check in post-reboot and post-CD recovery. Do not mark historical CD #3294 as success; use the next CD run plus monitoring coverage as future proof. bash scripts/ops/188-nginx-exporter-restore.sh --check reports nginx_up 1; python3 scripts/generate_monitoring.py --check --stabilization-sleep-seconds 0 reports Jobs=14, 全部 UP=14, 真實問題=0, coverage 100.0%; high-value gate matches the helper as P1 / C1, not unmanaged.

6. P2 Service And Data Gates

ID Status % Work item Fine analysis Next action Done criteria
P2-001 VERIFIED 100 Public route smoke 2026-06-12 18:57 cold-start confirms all listed domains returned expected 2xx/3xx over HTTPS; registry root route returned 200 in the scorecard and /v2/ remains the normal unauthenticated 401 pattern from earlier checks. This proves ingress/TLS plus current route availability. Keep as one row in scorecard. Public route table updated after each reboot.
P2-002 GREEN 100 momo latest/current-month parity and freshness Latest current-month parity is good: `15383 15383 2026-06-01
P2-008 DONE_SUPERSEDED_BY_JOB_57_RECOVERY 100 Separate MOMO service recovery from upstream source absence 2026-06-24 11:35 readback proved MOMO service was healthy and source-file absence was the blocker. 2026-06-25 10:35 superseded that with a stricter split: service healthy, DB parity good, but token / Drive auth evidence not sufficient and scheduler fail-closed behavior required. 2026-06-25 14:16 supersedes the blocker with job 57 clean import, V10.674, token metadata aligned to scheduler UID, current-month parity through 2026-06-24, and `DB_DAILY_FRESHNESS 1 2026-06-24`. SOP v1.51 preserves the GO/NO-GO rules forbidding old archive re-import, product-export import, truncate, whole-DB restore, fake freshness, or token secret exposure. Keep running the dedicated preflight after each reboot/import window; if Drive/API auth fails again, it must fail closed and alert rather than becoming an empty-folder success.
P2-003 DONE_PRODUCTION_DEPLOYED_WAITING_NEXT_REAL_IMPORT 99 Fix momo job semantics Gitea-first repair is in /Users/ogt/codex-workspaces/momo-pro-dev commit 84035906aba0e5e190d031a13cfd9b47a8cd1f73 on branch codex/momo-current-main-dev-base-20260624, also fast-forwarded to MacBook Pro and fast-forwarded to MOMO main. Gitea Actions cd.yaml #904 succeeded, and 188 live source contains _table_columns, 業績分析儀表板同步失敗, and 保留來源檔案等待重試,不移動 Google Drive 檔案. process_daily_sales_import() marks monthly sync failure as failed, records the sync error in summary, returns False, and leaves auto_import_from_drive() outside the Drive archive/move path. Regression tests cover both job failure and no-move behavior. Watch the next real Google Drive import and confirm no file moves unless both tables sync; if a real monthly sync failure happens, verify import job status is failed and source file remains pending. pytest tests/test_import_service_sql_params.py tests/test_auto_import_data_sync.py tests/test_auto_import_failure_boundaries.py -q returns 10 passed; production deployment/readback is complete; final behavioral closeout requires next real import evidence.
P2-004 DONE 100 PostgreSQL index corruption runbook path SOP v1.2 now states posting list tuple ... cannot be split is an index repair incident. Use only concurrent reindex if the error returns. No truncate, no whole DB restore; REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly; and idempotent resync evidence recorded.
P2-005 VERIFIED 100 Do not rely on route 200 only 2026-06-12 closeout has route + DB + backup + offsite + schedule + alert + K3s + cold-start scorecard evidence. The only remaining blocker is DR credential escrow, outside service availability. Keep this cross-surface checklist mandatory after every reboot. Each reboot record has route, DB, backup, schedules, alert, scorecard rows.
P2-006 DONE 100 Validate momo scheduler WARN 2026-06-12 post-reboot regression showed the old detector was too narrow for Chinese batch and [Feeder] logs. The detector was widened and deployed to 110; 14:47 scorecard reads SCHEDULER_RECENT_ACTIVITY 1070 and marks scheduler healthy. Keep normal monitoring; treat future recurrence as detector tuning only if direct logs remain active. Container healthy, direct log activity exists, and latest scorecard removed this WARN.
P2-007 DONE 100 Balance K3s AWOOI workload across 120 / 121 Gitea main acaae999 adds topology spread for API/Web/Worker. ArgoCD later synced deploy marker e4a349bc; live deployments still have split placement after a normal CD rollout: API pods on mon1 / mon, Web pods on mon / mon1, Worker single replica on mon; 01:26 final cold-start is PASS=83 WARN=0 BLOCKED=0. Keep watching future deploys; do not manually delete pods unless placement drift becomes a real service or HA gate. Live deployment has non-empty topology spread, API/Web placement max skew <= 1 after normal CD, public routes green, cold-start WARN=0 BLOCKED=0.

7. P3 Documentation And Automation

ID Status % Work item Fine analysis Next action Done criteria
P3-001 VERIFIED 100 Confirm hardening commit Gitea main currently points to 0260ec89...; git merge-base --is-ancestor ae7b39d9 0260ec89... returned true. Keep evidence in LOGBOOK. Gitea main contains ae7b39d9 fix(ops): harden reboot recovery and backup alerts.
P3-002 VERIFIED_WITH_V142_SYNC_BLOCKED 100 Confirm live 110 scripts All required recovery/check scripts exist under /home/wooo/scripts/; cold-start script hash 10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8 is live on 110. Repo-side v1.42 authoritative script hash is f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05, and verify-cold-start-monitor-deploy.sh correctly blocks on the mismatch. Do not run install-cold-start-monitor-110.sh during read-only triage. After explicit maintenance-window / owner approval, run the installer, rerun deploy parity, then rerun the live 110 cold-start monitor and record the new hash. Script paths and current mismatch are recorded; v1.42 live-sync done criteria remains hash parity plus live scorecard fields.
P3-003 DONE 100 Reconcile 188 nginx Ansible baseline Live 188 already routes aiops.wooo.work through VIP; the Ansible template matches that route and has no 120 upstream for aiops. nginx-sync.yml now also carries the 188-internal-tools-https.conf.j2 source-of-truth path, and ansible-validate.sh syntax-check passes with repo-local roles path. Run only approved dry-run/apply from the normal Ansible environment before changing live nginx. Template and live config agree; no 120 upstream for aiops; repo-side syntax and readiness contract pass.
P3-004 DONE 100 Update docs/LOGBOOK.md Live blocker and new docs are recorded. Keep this entry updated after each recovery phase. LOGBOOK has current recovery status and next actions.
P3-005 DONE 100 Update cold-start SOP SOP now includes start, shutdown, reboot, record, comparison, and 120 blocker handling. Increment SOP version after each process change. SOP has controlled power-operation sections and ledger template.
P3-006 DONE 100 Update backup status Backup status now reflects current cron, rclone latest-only, failure-only alert posture, and escrow blocker. Refresh after 120 backup rerun. Backup status no longer claims noisy success Telegram notifications.
P3-007 DONE 100 Harden Gitea backup stale dump handling 2026-06-05 manual Gitea backup failed because the container retained /tmp/gitea-dump.zip from the 02:00 failure. scripts/backup/backup-gitea.sh now renames stale container dump files to timestamped evidence before running a new dump, and the live 110 script is updated. Watch the next 02:00 Gitea backup. bash -n passes locally and on 110; manual Gitea backup completed after stale evidence rename.
P3-008 DONE 100 Continuously optimize host reboot SOP SOP v1.52 adds one-page post-start quick check wrapper, fallback runbook, startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable plan_b baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, fwupd-refresh.timer rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, 188 DB/Redis exporter restore, 188 MinIO/Velero restore, 188 nginx-exporter restore, 110 Docker disk cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, post-reboot notification noise gates, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side scorecard source-absence classifier, 110 live-sync parity gate, CD monitoring coverage target-down classification, MOMO dedicated token/source preflight, MOMO V10.674 / StartedAt / lifecycle / job 57 / freshness 1 recovery readback, and 2026-06-25 110 CPU orphan Chrome vs active CI 分流 evidence. Use scripts/reboot-recovery/post-start-quick-check.sh --no-color for T+10 post-reboot triage, then use docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md as manual fallback and SOP v1.52 for exceptions, Plan B, blocker-specific recovery, and historical comparison. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process / notification-noise / MOMO preflight / monitoring coverage checks. If using the live 110 script, record its hash and do not assume repo-side v1.42 behavior until synced under approval and deploy parity passes. SOP distinguishes HOST_BOOTED, HOST_READY, SERVICE_READY, FULL_STACK_GREEN, K3S_CONTROL_PLANE_AA, WORKLOAD_BALANCED, B0_ABORTED_BEFORE_REBOOT, B1_HOST_RECOVERY_ONLY, B2_CORE_SERVICE_READY, B3_SERVICE_AVAILABLE_DEGRADED, B4_FULL_STACK_GREEN, and B5_DR_COMPLETE; quick check wrapper has one command order and LOGBOOK summary; latest MOMO dedicated preflight returns PASS=19 WARN=2 BLOCKED=0; 110 CPU evidence records old orphan Chrome groups removed by approved SIGTERM while active CI load remains observation-only; repeated healthy/same-failure notification noise is controlled without hiding real alerts, and monitoring coverage target-down is routed through exporter restore before any product restart.
P3-009 DONE 100 Assess 120/121 AA/AS role and host load balancing 2026-06-12 15:19 live check confirms 120 and 121 are both Ready control-plane, k3s active, k3s-agent inactive, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly 0%.
P3-010 DONE 100 Update workload balancing docs with 2026-06-13 live truth Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD km-vectorize degraded, Gitea main acaae999, ArgoCD sync, and final pod placement evidence. Keep updating this file after the next reboot or deploy. Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt.
P3-011 DONE 100 Record km-vectorize remediation status LOGBOOK, this workplan, and SOP now state the schedule/label fix, ArgoCD sync evidence, the invalid manual Job boundary, and the 90% waiting-for-next-schedule gate. After next 03:00 run, update this row and the top verdict with lastSuccessfulTime / ArgoCD health evidence. No document claims ArgoCD green before official CronJob success evidence exists.
P3-012 DONE 100 Prevent CD from clobbering cold-start SSH trust Source fix 80e6ec1a changes Gitea CD workflows to use deploy-specific deploy_known_hosts and UserKnownHostsFile; post-deploy marker e4a349bc proves global /home/wooo/.ssh/known_hosts retained 120 / 188 entries. SOP v1.8 records this as a release guardrail. Keep the guardrail in future workflow reviews; any > ~/.ssh/known_hosts in deploy code is a release blocker. CD success plus post-CD known_hosts readback and strict SSH checks to 120 / 188 remain green.

8. Required 120 Recovery Sequence

Do this only after physical/VM console access confirms 120 is powered on, attached to the LAN, and either booted or repairable.

# 0. Console-side checks first; do not do these through an online mounted root filesystem.
#    - power / VM state
#    - NIC connected to the 192.168.0.x LAN
#    - boot screen / initramfs / rescue state
#    - if root FS repair is required: fsck -f /dev/mapper/ubuntu--vg-ubuntu--lv from console/rescue only

# 1. After SSH returns, run read-only 120 maintenance readiness
bash scripts/reboot-recovery/120-fsck-maintenance-checklist.sh --no-color

# 2. After 120 is reachable and stable, on 110
/backup/scripts/backup-configs.sh
/backup/scripts/backup-all.sh
/backup/scripts/sync-offsite-backups.sh --mode sync
/backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color

# 3. Final cold-start scorecard
/home/wooo/scripts/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1

Do not run truncate, whole DB restore, force-push, DROP, or online root filesystem fsck as part of this flow.


9. Progress Updates

2026-06-18 14:20 Asia/Taipei
Phase: P3 AI Ops runaway process automation
Before: 110 CPU 滿載只能靠人工 `ps/top` 判斷;泛用 `HostHighCpuLoad` 無法分辨跨專案 orphan Chrome smoke 與合法 Gitea Actions CI load。
After: 新增 read-only `host-runaway-process-exporter.py`、gated `host-runaway-process-remediation.py`、Prometheus `host_runaway_process_alerts`、Ansible textfile exporter source-of-truth、SOP v1.26 與 `HOST-RUNAWAY-PROCESS-AIOPS-PLAYBOOK.md`。Exporter 暴露 orphan browser、active CI、load/core、swap ratio 與 `remediation_authorized=0`;修復器預設 dry-run`SIGTERM` 必須帶 owner approval、maintenance window、evidence ref。

2026-06-18 14:31 Asia/Taipei
Phase: P3 AI Ops runaway process live observability
Before: Repo-side exporter / alert / PlayBook 已完成,但 110 Prometheus 尚未讀到 `awoooi_host_runaway_process_*` live metrics。
After: 110 已安裝 read-only exporter/helper 與 cron立即刷新 textfilePrometheus 第二次 scrape 讀到 `monitor_up=1`、orphan browser group count `0`、active CI containers `2`、load5/core 約 `0.79-0.81`、swap ratio 約 `1.0`、`remediation_authorized=0``HostRunawayProcessMonitorMissing` 與 `HostOrphanBrowserSmokeHighCpu` 未 firing。
Evidence: `/home/wooo/node_exporter_textfiles/host_runaway_process.prom`、Prometheus query `awoooi_host_runaway_process_monitor_up{host="110"}`、`ALERTS{alertname="HostRunawayProcessMonitorMissing",host="110",alertstate="firing"}`。
Blocked: No for live observability; yes for runtime remediation by design until owner approval / maintenance window / evidence ref / dry-run / post-check exist.
Next: Keep cron scrape under normal monitoring; if orphan count becomes >0, create AI triage packet and remediation dry-run before any gated `SIGTERM`.
Completion: monitoring / alert / PlayBook / KM contract 100%; runtime auto-remediation remains gated at 0 until a real owner-approved apply is executed.

2026-06-18 14:38 Asia/Taipei
Phase: P3 AI Ops alert-to-event packet
Before: 泛用 CPU raw dump 可被轉成 AI automation card但 `HostOrphanBrowserSmokeHighCpu` / `HostCiRunnerLoadSaturation` alert text 尚未有專屬 lane。
After: Telegram 最後出口可將 `HostOrphanBrowserSmokeHighCpu` 轉成 `orphan_browser_smoke_runaway_process`,將 `HostCiRunnerLoadSaturation` 轉成 `ci_runner_load_saturation`;兩者都保留 `runtime_write_gate=0`,並要求 dry-run / owner / maintenance / evidence / KM / PlayBook / Verifier。
Evidence: `apps/api/src/services/telegram_gateway.py`、`apps/api/tests/test_telegram_message_templates.py`,精準 pytest `59 passed`。
Blocked: No for alert-to-event packet; yes for Telegram live send / runtime remediation by design.
Next: 等 code-review / CD 後做 production readback若未來 alert 實際 firing確認 Telegram card 與 AwoooP Run truth-chain 都能呈現同一 lane。

2026-06-18 14:51 Asia/Taipei
Phase: P3 AI Ops alert-to-event packet production readback
Before: `HostOrphanBrowserSmokeHighCpu` / `HostCiRunnerLoadSaturation` 已有 source + test但尚未完成正式站部署與 runtime revision 讀回。
After: `f358a0f6` 已由 Gitea CD `#3150` 部署deploy marker `2d278568`。ArgoCD `awoooi-prod` 為 `Synced / Healthy`API / Web / Worker image 均為 `f358a0f6c3e614e407dedb6eee89bf10b2bc8173`production health 與 IwoooS / Governance / AwoooP tenants routes 皆 `200` 且敏感字串抽樣命中 `0`。
Evidence: Gitea CD `#3150` tests `3221 passed, 23 skipped`、B5 integration `5 passed`、post-deploy alert-chain smoke `9/9`、monitoring coverage `14/14` jobs upPrometheus 仍讀到 110 `monitor_up=1`、orphan browser group count `0`、CI active containers `2`、`remediation_authorized=0`missing / orphan alerts 未 firing。
Blocked: No for production alert-to-event packet deployment; yes for runtime remediation by design.
Next: Future firing alert must produce AI triage packet and dry-run evidence first. `HostCiRunnerLoadSaturation` remains capacity / runner scheduling triage, not process kill. Runtime remediation remains `0` until owner approval, maintenance window, evidence ref, gated SIGTERM, post-check, and KM / PlayBook / Verifier writeback exist.
Completion: host runaway monitoring / alert / PlayBook / Telegram event packet / production deploy readback 100%; runtime auto-remediation remains safely gated at 0.

2026-06-18 15:08 Asia/Taipei
Phase: P3-009 Host runaway AIOps loop product readback
Before: Monitoring, alert rules, event packet routing, live scrape, and production deploy readback were complete, but governance UI still lacked a single product-visible loop state for monitor -> alert -> event packet -> PlayBook -> KM / Verifier -> gated remediation.
After: Added `host_runaway_aiops_loop_readiness_v1` committed snapshot, schema, strict API loader, endpoint `/api/v1/agents/agent-host-runaway-aiops-loop-readiness`, regression tests, API client type, and governance automation-inventory card. The card shows 6 loop stages, 2 alert lanes, 5 asset writeback contracts, host 110 live readback, deploy marker 2d278568, orphan groups 0, and runtime writes 0.
Evidence: `apps/api/tests/test_host_runaway_aiops_loop_readiness.py` + API test `9 passed`; web typecheck passed using a temporary existing node_modules symlink that was removed before commit; snapshot/schema/messages JSON parse and py_compile passed.
Blocked: No for product readback; yes for runtime remediation by design.
Next: If a real or fixture alert fires, verify Telegram card, AwoooP Work Item, KM / PlayBook / Verifier fields agree before considering any owner-approved non-production gated SIGTERM drill.
Completion: host runaway AIOps product-visible loop readback 100%; runtime auto-remediation remains safely gated at 0.

2026-06-18 16:08 Asia/Taipei
Phase: P3-009 Host runaway AIOps loop production verification
Before: P3-009 source, API, UI and tests were pushed, but production still needed deploy marker, API readback, desktop/mobile browser smoke, and CD runner lock recovery evidence.
After: Final deploy marker `42c08ece chore(cd): deploy 27143fb [skip ci]` is live after CD runner lock fixes `fc6c01ee` / `84ca8423` / `27143fb0`; `cd.yaml #3177` and `code-review.yaml #3178` are successful. Production endpoint `/api/v1/agents/agent-host-runaway-aiops-loop-readiness` returns `schema_version=host_runaway_aiops_loop_readiness_v1`, `current_task_id=P3-009`, `next_task_id=P3-010`, completion `100`, loop stages `6`, alert lanes `2`, writeback contracts `5`, host `110`, orphan browser groups `0`, active CI containers `2`, and every runtime/write/remediation counter `0`.
Evidence: API health `healthy / prod / mock_mode=false`; desktop `1440x1100` and mobile `390x844` governance smoke with deploy marker `42c08ece` have required text missing `0`, console/page errors `0`, horizontal overflow `false`, overflowing elements `0`; screenshots are `/tmp/awoooi-host-runaway-aiops-desktop-1440x1100-42c08ece.png` and `/tmp/awoooi-host-runaway-aiops-mobile-390x844-42c08ece.png`.
Blocked: No for production product readback. Yes for runtime remediation by design: process termination, Docker/systemd restart, Nginx reload, firewall/K8s action, Telegram live send, Gateway queue write, Bot API call, production write, and secret read remain `0 / false`.
Next: Treat the next real or fixture `HostOrphanBrowserSmokeHighCpu` as the acceptance drill for end-to-end Telegram card / AwoooP work item / KM / PlayBook / Verifier field agreement. Any actual SIGTERM remains owner-approved, maintenance-windowed, dry-run-first, and post-check-gated.
Completion: host runaway AIOps product-visible loop readback and production verification 100%; runtime auto-remediation remains safely gated at 0.

2026-06-18 13:43 Asia/Taipei
Phase: P1/P2/P3 live readback
Before: live cold-start was `PASS=83 WARN=1 BLOCKED=0`, result `DEGRADED`, because retained stale `km-vectorize-29689620` failed Job evidence was still counted as a service warning.
After: live cold-start is `PASS=84 WARN=0 BLOCKED=0`, result `GREEN`; P2 service readiness is now `100%`; overall recovery readiness is `99% SERVICE_GREEN_DR_ESCROW_BLOCKED`.
Evidence: `full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1`; K8s schedule counters `FAILED_JOBS=1`, `STALE_FAILED_JOBS=1`, `ACTIVE_FAILED_JOBS=0`, `BAD_PODS=0`; repo-side readiness audit `PASS=187 WARN=1 BLOCKED=0`; escrow readback `ESCROW_MISSING_COUNT=5`.
Blocked: no for full-stack service readiness. Yes for DR complete, because five credential escrow evidence markers still need real non-secret owner evidence IDs.
Next: use SOP v1.25 for the next reboot; record failed/stale/active Job counters separately; close B5 only after real credential escrow marker evidence exists.

2026-06-18 12:17 Asia/Taipei
Phase: P0/P2/P3 live readback
Before: repo-side readiness was complete, but live gate had not been rerun after the same-day push.
After: live cold-start is `PASS=83 WARN=1 BLOCKED=0`, result `DEGRADED`; final rollout readback shows API `2/2`, Web `2/2`, Worker `1/1`, Canary `1/1`, and API health `200 healthy`.
Evidence: `full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1`; read-only K8s deployment/job snapshot from 120; public API health readback.
Blocked: no hard blocker. One warning remains: stale retained Job `km-vectorize-29689620` from 2026-06-14 03:00; later official km-vectorize Jobs are Complete. DR complete still blocked by real credential escrow evidence markers.
Next: before any actual reboot, rerun the same live preflight and classify as `B3_SERVICE_AVAILABLE_DEGRADED` if only stale evidence remains, or `B4_FULL_STACK_GREEN` only when `WARN=0 BLOCKED=0`.

2026-06-18 12:06 Asia/Taipei
Phase: P3
Before: repo-side readiness audit PASS=147 WARN=2 BLOCKED=37 before blocker batch; after Plan B-only guard it still had pre-existing blockers.
After: repo-side readiness audit PASS=185 WARN=1 BLOCKED=0, result READY WITH WARNINGS.
Evidence: full-stack-cold-start-check.sh now emits NODE_FS_ERROR_EVENTS and blocks K3s release on node filesystem evidence; backup-awoooi.sh no longer runs direct service-level rclone sync; 110-devops.yml manages cold-start monitor, runner guardrails, textfile exporters, backup scripts, daily backup heartbeat, offsite evidence report and offsite full-sync verifier; 188-ai-web.yml uses host-owned /home/ollama/bin/momo-pg-backup.sh and no longer contains the old app-directory backup cron path; nginx-sync.yml includes 188-internal-tools-https.conf.j2; ansible-lint.yml now runs self-hosted validation across Ansible, ops baseline, monitoring rules, backup scripts, reboot scripts, docs and workflow changes; bootstrap-ansible-validation-env.sh selects Python 3.11/3.10 for pinned ansible-core; ansible-validate.sh passes YAML, shell, Python, doc secret, backup alert label, recovery scorecard, Ansible syntax-check and ansible-lint minimum profile.
Blocked: no for repo-side reboot readiness contracts. Yes for live reboot authorization until same-day live checks run; yes for DR complete while credential escrow evidence markers remain missing.
Next: before an actual reboot, run the same-day live preflight and then the live cold-start gate with --live or the 110 deployed monitor; do not use repo-side READY WITH WARNINGS as a substitute for host/runtime truth.
2026-06-18 11:48 Asia/Taipei
Phase: P3
Before: P3 100%
After: P3 100%
Evidence: ops/reboot-recovery/full-stack-cold-start-baseline.yml now has a machine-readable plan_b section with red lines, triggers, host paths, B0-B5 levels, T+0/T+120 timeline, and closeout states; scripts/reboot-recovery/reboot-recovery-readiness-audit.sh now checks SOP and baseline for Plan B markers. Targeted assertion returned PLAN_B_BASELINE_ASSERTIONS_OK levels=6 closeout=3 timeline_stop=T+120. Full readiness audit confirms all new Plan B checks pass, but overall audit remains NOT READY because of pre-existing Ansible / workflow / backup-contract blockers unrelated to this Plan B addition.
Blocked: no for Plan B mechanism. Yes for overall reboot automation readiness audit until the existing non-Plan-B BLOCKED rows are resolved.
Next: continue closing pre-existing readiness-audit blockers by priority, without changing runtime or pretending the overall audit is green.
2026-06-18 11:41 Asia/Taipei
Phase: P3
Before: P3 100%
After: P3 100%
Evidence: docs/runbooks/FULL-STACK-COLD-START-SOP.md updated to v1.22 with explicit Plan B degraded-operation path, B0-B5 service levels, Plan B trigger table, host-specific fallback routes for 110/120/121/188/K3s/public gateway, T+0/T+120 fallback timeline, and Plan B closeout states. This workplan now requires every future reboot record to compare actual timing and blockers against SOP §1.4, not only the Plan A cold-start chain.
Blocked: no for documentation. Live reboot authorization still requires fresh same-day preflight before any maintenance window; DR complete remains blocked while credential escrow missing count is 5.
Next: before the next host reboot, rerun live preflight, choose Plan A or Plan B entry criteria, then record final level as B0/B1/B2/B3/B4/B5 with the exact blocker.
2026-06-14 18:15 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: 最新 repo 文件基準 0a4766ddruntime deploy marker 16c6b983 已將 image e999c16b3435f197b78fe2adfeec1c4faa6c4675 帶到 API/Web/Worker/CronJob liveArgoCD source revision 0a4766ddc94b0690824ce3deba5c6b9a69764f94 維持 Synced/Degraded原因仍只剩 km-vectorizeAPI/Web 分散在 mon/mon1Worker 在 monIwoooS route /zh-TW/iwooos returned 200AwoooP route /zh-TW/awooop returned 200110 systemctl --failed returned 0 loaded units listedbackup-status core_blockers=0 and escrow_missing=5final cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green因為 km-vectorize-29689620 仍 failed必須等待下一次官方 03:00 success 或 retained failed Pod/log evidenceyes for DR complete因為 credential escrow evidence markers 仍缺 5 個。
Next: 維持 03:00 官方排程 gate下一次官方 km-vectorize run 後,只讀驗證 lastSuccessfulTime、latest Job/Pod/log 與 ArgoCD health。
2026-06-14 17:04 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: 最新文件基準 af62ec1fruntime deploy marker ed651a98 已將 image e992af89955f8aae40a383b2f2e2f645445a690d 帶到 API/Web/Worker/CronJob liveArgoCD source revision af62ec1fe72b3e84e179d80e788e5a5902bdaf27 維持 Synced/Degraded原因仍只剩 km-vectorizeAPI/Web 分散在 mon/mon1Worker 在 mon1IwoooS route /zh-TW/iwooos returned 200110 systemctl --failed returned 0 loaded units listedbackup-status core_blockers=0 and escrow_missing=5final cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green因為 km-vectorize-29689620 仍 failed必須等待下一次官方 03:00 success 或 retained failed Pod/log evidenceyes for DR complete因為 credential escrow evidence markers 仍缺 5 個。
Next: 維持 03:00 官方排程 gate下一次官方 km-vectorize run 後,只讀驗證 lastSuccessfulTime、latest Job/Pod/log 與 ArgoCD health。
2026-06-14 16:29 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: 最新文件基準 06fe0a8fruntime deploy marker 36fbfc6b 已將 image 386dbd078ef63401d9736048463f4ef5326442d9 帶到 API/Web/Worker/CronJob liveArgoCD source revision 06fe0a8f14167824fea512f942d2569431bbcbc8 維持 Synced/Degraded原因仍只剩 km-vectorizeAPI/Web 分散在 mon/mon1Worker 在 mon110 systemctl --failed returned 0 loaded units listedbackup-status core_blockers=0 and escrow_missing=5P2-145 endpoint current=P2-145 completion=100owner response received/accepted/rejected、reviewer/Gateway/Telegram/Bot API/result capture/learning/PlayBook trust/production write/secret/destructive 全部 0/falsefinal cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green因為 km-vectorize-29689620 仍 failed必須等待下一次官方 03:00 success 或 retained failed Pod/log evidenceyes for DR complete因為 credential escrow evidence markers 仍缺 5 個。
Next: 維持 03:00 官方排程 gate下一次官方 km-vectorize run 後,只讀驗證 lastSuccessfulTime、latest Job/Pod/log 與 ArgoCD health。
2026-06-14 15:58 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: gitea/main 已前進至 deploy marker 180a6543image fef94df877c5438f9f34ddbcace8ad8112a141ef 已帶到 API/Web/Worker liveArgoCD source revision 180a6543eaf26dd6b345d45114316926056a965a 維持 Synced/Degraded原因仍只剩 km-vectorizeAPI/Web 分散在 mon/mon1Worker 在 mon1110 systemctl --failed returned 0 loaded units listedbackup-status core_blockers=0 and escrow_missing=5P2-144 endpoint current=P2-144 completion=100owner response received/accepted/rejected、reviewer/Gateway/Telegram/Bot API/result capture/learning/PlayBook trust/production write/secret/destructive 全部 0/falsefinal cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green因為 km-vectorize-29689620 仍 failed必須等待下一次官方 03:00 success 或 retained failed Pod/log evidenceyes for DR complete因為 credential escrow evidence markers 仍缺 5 個。
Next: 維持 03:00 官方排程 gate下一次官方 km-vectorize run 後,只讀驗證 lastSuccessfulTime、latest Job/Pod/log 與 ArgoCD health。
2026-06-14 15:00 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: 最新文件基準 b09eb1c6runtime deploy marker 667d6329 已將 image 755b0a8d3038df2c52dee280067863d92db1eda5 帶到 API/Web/Worker/CronJob liveArgoCD revision 4abf0c0f750254d3c7137eae049abdfd99630f5f 維持 Synced/Degraded原因仍只剩 km-vectorizeAPI/Web 分散在 mon/mon1Worker 在 mon110 systemctl --failed returned 0 loaded units listedbackup-status core_blockers=0 and escrow_missing=5P2-143 endpoint current=P2-143 completion=100writer/Gateway/Telegram/Bot API/production write/secret/destructive 全部 0/falsefinal cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green因為 km-vectorize-29689620 仍 failed必須等待下一次官方 03:00 success 或 retained failed Pod/log evidenceyes for DR complete因為 credential escrow evidence markers 仍缺 5 個。
Next: 維持 03:00 官方排程 gate下一次官方 km-vectorize run 後,只讀驗證 lastSuccessfulTime、latest Job/Pod/log 與 ArgoCD health。
2026-06-14 10:40 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: latest docs head observed before this recovery commit 50d4f2ba; runtime deploy marker d023f5d7 put image f737f278 live for API/Web/Worker/CronJob; ArgoCD revision 50d4f2ba; API/Web split across mon/mon1; Worker on mon; 110 systemctl --failed returned 0 loaded units listed and fwupd-refresh.timer remained disabled/inactive; backup-status core_blockers=0 and escrow_missing=5; final cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green, because km-vectorize-29689620 remains failed until the next official 03:00 success or retained failed Pod/log evidence; yes for DR complete, because credential escrow evidence markers still missing 5.
Next: keep the 03:00 official schedule gate; after the next official km-vectorize run, read-only verify lastSuccessfulTime, latest Job/Pod/log, and ArgoCD health.
2026-06-14 09:56 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: 本 recovery commit 前最新文件 head a0fe7741runtime deploy marker 與 ArgoCD revision 60a0415c put image a3de0ffb live for API/Web/Worker/CronJob; API/Web split across mon/mon1; Worker on mon1; 110 systemctl --failed returned 0 loaded units listed and fwupd-refresh.timer remained disabled/inactive; backup-status core_blockers=0 and escrow_missing=5; final cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green, because km-vectorize-29689620 remains failed until the next official 03:00 success or retained failed Pod/log evidence; yes for DR complete, because credential escrow evidence markers still missing 5.
Next: keep the 03:00 official schedule gate; after the next official km-vectorize run, read-only verify lastSuccessfulTime, latest Job/Pod/log, and ArgoCD health.
2026-06-14 09:27 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: gitea/main 5bad267e and ArgoCD revision 5bad267e; deploy marker 8d575c1a put image 280e0fbe live for API/Web/Worker/CronJob; API/Web split across mon/mon1; 110 systemctl --failed returned 0 loaded units listed and fwupd-refresh.timer remained disabled/inactive; backup-status core_blockers=0 and escrow_missing=5; first cold-start had transient stock 502 during stockplatform-v2 warmup, direct route/TLS recheck returned 200, final cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green, because km-vectorize-29689620 remains failed until the next official 03:00 success or retained failed Pod/log evidence; yes for DR complete, because credential escrow evidence markers still missing 5.
Next: keep the 03:00 official schedule gate; after the next official km-vectorize run, read-only verify lastSuccessfulTime, latest Job/Pod/log, and ArgoCD health.
2026-06-14 08:40 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 97%, P1 92%, P2 99%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: gitea/main and ArgoCD revision 18b867c3; deploy marker 18b867c3 put image e0a6d339 live for API/Web/Worker/CronJob; API/Web split across mon/mon1; 110 systemctl --failed returned 0 loaded units listed and fwupd-refresh.timer remained disabled/inactive; backup-status core_blockers=0 and escrow_missing=5; cold-start PASS=82 WARN=1 BLOCKED=0.
Blocked: yes for full cold-start green, because km-vectorize-29689620 remains failed until the next official 03:00 success or retained failed Pod/log evidence; yes for DR complete, because credential escrow evidence markers still missing 5.
Next: keep the 03:00 official schedule gate; after the next official km-vectorize run, read-only verify lastSuccessfulTime, latest Job/Pod/log, and ArgoCD health.
2026-06-14 08:24 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 96%, P1 92%, P2 98%, P3 100%
After: Overall 97%, P1 92%, P2 99%, P3 100%
Evidence: 110 fwupd-refresh.timer disabled/inactive with rollback command recorded; systemctl --failed returned 0 loaded units listed; backup-status 110 13/13 fresh failed=0 and 188 2/2 fresh failed=0 with core_blockers=0 and escrow_missing=5; cold-start PASS=82 WARN=1 BLOCKED=0; ArgoCD/CronJob still waiting for official km-vectorize lastSuccessfulTime after deploy marker ec03f0b7 / image 8ddb80d6.
Blocked: yes for full cold-start green, because km-vectorize-29689620 remains failed until the next official 03:00 success or retained failed Pod/log evidence; yes for DR complete, because credential escrow evidence markers still missing 5.
Next: after the next 03:00 Asia/Taipei official km-vectorize schedule, read-only verify lastSuccessfulTime, latest Job/Pod/log, and ArgoCD health; do not manual-run, delete, patch, or fake evidence.
2026-06-13 01:29 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 95%, P1 90%, P2 100%, P3 100%
After: Overall 95%, P1 90%, P2 100%, P3 100%
Evidence: Gitea main e4a349bc; ArgoCD revision e4a349bc sync=Synced health=Degraded only by km-vectorize stale success; K3s images 414413a59268eedd391648f112e228716dd05362; API/Web split across mon/mon1; /home/wooo/.ssh/known_hosts retained 120/188 after CD fix 80e6ec1a; backup-status 110 13/13 fresh failed=0 and 188 2/2 fresh failed=0; offsite textfile remote_verify_ok=1 and 13 repos snapshot_count=1; backup alert live visibility OK; all five required Prometheus alert rule names health=ok; cold-start PASS=83 WARN=0 BLOCKED=0.
Blocked: yes for DR complete only, because credential escrow evidence markers still missing 5; ArgoCD fully healthy still waits for official 03:00 km-vectorize lastSuccessfulTime.
Next: after 03:00 Asia/Taipei, verify km-vectorize official Job completion and ArgoCD health; keep escrow alerts firing until real non-secret evidence IDs are written.
2026-06-04 15:23 Asia/Taipei
Phase: P3
Before: 78%
After: 95%
Evidence: infra/ansible/roles/nginx/templates/188-all-sites.conf.j2 now contains aiops VIP upstreams 192.168.0.125:32334/32335; live smoke aiops / -> 307 and /api/v1/health -> 200; content guard passed.
Blocked: no for route baseline; ansible-playbook is unavailable on this workstation, so syntax-check remains delegated to the normal Ansible environment before next apply.
Next: run Ansible syntax/apply validation from the Ansible host before changing 188 nginx live config.
2026-06-04 15:23 Asia/Taipei
Phase: P2
Before: 52%
After: 66%
Evidence: /Users/ogt/momo-pro-system/services/import_service.py updated; /Users/ogt/momo-pro-system/tests/test_daily_sales_monthly_sync_failure.py added; targeted pytest passed with temp SQLite and real Excel input.
Blocked: yes. Live 188 uses /home/ollama/momo-pro bind-mounted code, while momo/ewoooc canonical source remains unresolved.
Next: reconcile canonical source/deploy path, apply the same monthly-sync failure contract to live, then run controlled live auto-import failure-path verification.
2026-06-04 15:34 Asia/Taipei
Phase: P2
Before: 66%
After: 86%
Evidence: live /home/ollama/momo-pro/services/import_service.py patched from backup services/import_service.py.bak.20260604-152827; live hash 3fc45671986fa4cc155119f588bc1ebefd272927730052e42e2b9eb4352b2586; container isolated temp-DB/real-Excel contract test passed; momo-scheduler and momo-pro-system restarted and healthy; mo.wooo.work /health 200; latest DB parity daily=404 and monthly=404 for 2026-06-02.
Blocked: no for momo failure contract. Overall remains blocked by 120 reachability and credential escrow.
Next: observe the next real Google Drive import and keep canonical momo/ewoooc source-control reconciliation as a separate supply-chain item.
2026-06-04 15:50 Asia/Taipei
Phase: P1
Before: 58%
After: 72%
Evidence: /backup/scripts/backup-status.sh --no-notify initially showed stale110=awoooi_db, stale188=momo_pg_daily, configured_missing_188=1; manual 188 momo PostgreSQL backup completed and kept latest-only; manual 110 backup-awoooi-frequent completed with restic snapshot 7440d75f; 188 crontab now points momo_pg_daily to /home/ollama/bin/momo-pg-backup.sh; final backup-status shows stale110=none, stale188=none, configured_missing_188=0, core_blockers=1, escrow_missing=5.
Blocked: yes. 120 config capture still keeps aggregate backup red, and five credential escrow evidence markers are still missing.
Next: after 120 returns, rerun backup-configs, backup-all, offsite sync, full offsite verify, then cold-start scorecard; separately fill escrow only with real non-secret evidence IDs.
2026-06-04 18:55 Asia/Taipei
Phase: P0/P1/P2
Before: Overall 60%, P1 72%, P2 86%
After: Overall 61%, P1 74%, P2 88%
Evidence: local ping to 192.168.0.120 still 0/3, SSH 22 timed out, ARP incomplete; 121 kubectl still shows mon NotReady,SchedulingDisabled and mon1 Ready; 110 backup-status --no-notify shows stale110=none, stale188=none, configured_missing_188=0, core_blockers=1, escrow_missing=5; cold-start scorecard now reports PASS=71 WARN=3 BLOCKED=3 and momo monthly parity 2215/2215 for 2026-06-01 through 2026-06-04.
Blocked: yes. The three hard blocks are still 120 ping, 120 SSH, and 120 K3s read-only check; escrow remains missing 5 evidence markers.
Next: wait for physical/console recovery of 120, then run the required backup-configs / backup-all / offsite sync / full verify / cold-start sequence.
2026-06-04 19:02 Asia/Taipei
Phase: P0/P3
Before: Overall 61%, P0 35%, P3 95%
After: Overall 62%, P0 36%, P3 96%
Evidence: local/110/121/188 all failed to reach 192.168.0.120; 121 returned Destination Host Unreachable; kubectl describe node mon shows LastHeartbeatTime 2026-05-22 02:44:13 +08, Ready Unknown since 2026-05-22 02:49:48 +08, and kube-node-lease renewTime 2026-05-22 02:48:36 +08; 120-fsck-maintenance-checklist.sh --no-color returned PASS=2 WARN=2 BLOCKED=3 and MAINTENANCE REQUIRED; repo search found no BMC/IPMI/WOL inventory for 120.
Blocked: yes. 120 requires physical or VM console recovery before backup-configs, backup-all, offsite sync, and full cold-start can be made green.
Next: use console to verify 120 power/NIC/boot/initramfs state, perform offline fsck only if needed, then restore SSH and run the required recovery sequence.
2026-06-05 18:40 Asia/Taipei
Phase: P0/P1/P3
Before: Overall 62%, P1 74%, P3 96%
After: Overall 64%, P1 80%, P3 97%
Evidence: 120 remains unreachable from local/110/121/188 and K3s mon remains NotReady,SchedulingDisabled; 14:00 AWOOOI high-frequency backup had failed, then 16:01 manual high-frequency backup completed snapshot b7d5ee4e; Gitea stale container dump /tmp/gitea-dump.zip was preserved as /tmp/gitea-dump.stale.20260605_161032.zip, script hardened, and manual Gitea backup completed snapshot ea641613; Open-WebUI d1147507, ClawBot 73ead3cc, AI artifacts b1161ab8 completed; partial offsite sync for five changed repos completed 5/5; verify-offsite-full-sync reports REMOTE_LATEST_ONLY_OK=1 and VERIFY_OK=1; final backup-status shows stale110=none, stale188=none, core_blockers=6, escrow_missing=5; cold-start remains PASS=71 WARN=3 BLOCKED=3.
Blocked: yes. 120 remains the P0 blocker, backup_all failed history remains red until backup-all can rerun after 120 returns, and credential escrow still lacks five non-secret evidence markers.
Next: monitor the 20:00 high-frequency backup, keep 120 console recovery as P0, then rerun backup-configs / backup-all / offsite sync / full verify / cold-start after 120 returns.
2026-06-06 14:47 Asia/Taipei
Phase: P0/P1/P2
Before: Overall 64%, P1 80%, P2 88%
After: Overall 65%, P1 84%, P2 89%
Evidence: 120 still ping failed, SSH timed out, ARP incomplete, and K3s mon remains NotReady,SchedulingDisabled; 06-06 02:00 aggregate failed only Configs (12/13 success) due the 120 config capture blocker; backup-status at 14:46 shows stale110=none, stale188=none, failed=1, core_blockers=1, escrow_missing=5; verify-offsite-full-sync shows all 13 remote repos snapshots=1, REMOTE_LATEST_ONLY_OK=1, VERIFY_OK=1; cold-start reports PASS=70 WARN=4 BLOCKED=3; momo scheduler direct log activity count over the last 15 minutes is 151 despite the scorecard WARN.
Blocked: yes. 120 remains unreachable, aggregate backup cannot be green until backup-configs and backup-all rerun after 120 returns, and credential escrow still lacks five evidence markers.
Next: keep 120 console recovery as P0, keep escrow marker collection separate from secrets, and rerun the required backup/offsite/cold-start sequence only after 120 is reachable.
2026-06-06 15:00 Asia/Taipei
Phase: P3
Before: P3 97%
After: P3 98%
Evidence: docs/runbooks/FULL-STACK-COLD-START-SOP.md updated to v1.3 with 2026-06-06 live baseline, full shutdown/startup/single-host reboot SOP, mandatory reboot ledger template, and SOP version-comparison rules.
Blocked: no for documentation. Validation gap remains because ansible-playbook is unavailable on this workstation and 120 recovery still requires console access.
Next: after the next actual reboot or 120 console recovery, append a LOGBOOK reboot record and compare it against this 2026-06-06 baseline before changing SOP version again.
2026-06-06 15:03 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 65%, P2 89%, P3 98%
After: Overall 65%, P2 90%, P3 99%
Evidence: 120 still ping/SSH failed with ARP incomplete; 121 still shows mon NotReady,SchedulingDisabled and mon1 Ready; backup-status at 15:02 shows stale110=none, stale188=none, failed=1, core_blockers=1, escrow_missing=5; offsite verifier shows 13 repos snapshots=1 with REMOTE_LATEST_ONLY_OK=1 and VERIFY_OK=1; Alertmanager has all five required backup/cold-start rules; escrow report shows scripts/config present but 5 evidence markers missing; 15:03 cold-start reports PASS=71 WARN=3 BLOCKED=3; direct 188 momo-scheduler check is healthy with recent log activity.
Blocked: yes. The three hard blocks remain 120 ping, 120 SSH, and 120 K3s read-only check; aggregate backup remains blocked by 120 config capture; DR scorecard remains blocked by five missing non-secret escrow markers.
Next: do not fake escrow markers; after real non-secret evidence IDs are available, run mark-credential-escrow-verified.sh for the five items. Keep 120 console recovery as P0.
2026-06-06 15:06 Asia/Taipei
Phase: P1/P3
Before: Overall 65%, P1 84%, P3 99%
After: Overall 65%, P1 85%, P3 99%
Evidence: /backup/scripts/mark-credential-escrow-verified.sh --help confirms --dry-run support, allowed item names, and placeholder/secret rejection rules; docs/runbooks/BACKUP-STATUS.md now contains the credential escrow evidence checklist and safe marker flow.
Blocked: yes. No marker was written because no real non-secret evidence IDs were available in this session; escrow_missing remains 5.
Next: once real external evidence IDs exist, dry-run each item first, then write markers and rerun offsite-escrow-evidence-report plus backup-status.
2026-06-12 04:11 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 65%, P1 85%, P2 90%, P3 99%
After: Overall 66%, P1 86%, P2 90%, P3 99%
Evidence: 120 still ping/SSH failed with ARP incomplete; 121 still shows mon NotReady,SchedulingDisabled and mon1 Ready; 110->188 SSH host key trust repaired after matching ED25519 fingerprint; 02:00 backup-all completed 12/13 and failed only Configs due 120; backup-status at 04:11 shows stale110=none, stale188=none, failed=1, core_blockers=1, escrow_missing=5; offsite sync from 03:00 is still running at 04:10.
Blocked: yes. Full reboot window is NO-GO until current offsite sync exits and a fresh offsite verifier passes; full green remains impossible while 120 is unreachable.
Next: wait for the 03:00 offsite sync to finish, run verify-offsite-full-sync, then rerun cold-start scorecard before approving any maintenance window.
2026-06-12 18:57 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 67%, P0 36%, P1 86%, P2 95%, P3 99%
After: Overall 95%, P0 100%, P1 90%, P2 97%, P3 100%
Evidence: 120 root fsck recovery booted at 15:13; 120/121 are both Ready control-plane; backup-configs and backup-all captured 120/121/K8s successfully; backup-all completed 13/13 at 15:54; full offsite sync completed 13/13 at 17:37 after documented recovery runway override to 240m; verify-offsite-full-sync returned REMOTE_LATEST_ONLY_OK=1, FULL_MARKER_FRESH=1, VERIFY_OK=1, FAILED=0; backup-status at 18:55 reports core_blockers=0 and escrow_missing=5; cold-start at 18:57 reports PASS=83 WARN=0 BLOCKED=0.
Blocked: yes for DR only. Service/full-stack recovery is green, but DR scorecard remains blocked until five credential escrow evidence markers are written with real non-secret evidence IDs.
Next: collect real credential escrow evidence IDs, dry-run each marker, then write markers and rerun offsite-escrow-evidence-report plus backup-status; separately plan AWOOOI API/Web topology spread before moving services from 110/188 to 120/121.

10. Completion Claims That Are Not Allowed Yet

  • Do not claim every future reboot is guaranteed green. This run is green for the latest verified evidence set only.
  • Do not silence credential escrow alerts. They are the remaining correct DR red light.
  • Do not claim DR scorecard complete. Credential escrow markers are missing.
  • Do not claim public-route success is system success. Route checks must be paired with DB, backup, schedules, Alertmanager, and cold-start scorecard evidence.
  • Do not claim the next real Google Drive import has succeeded until the post-import row counts/date bounds and Drive archive movement are rechecked.