Files
awoooi/docs/runbooks/BACKUP-STATUS.md

47 KiB
Raw Blame History

BACKUP-STATUS.md — 備份狀態總覽

2026-04-05 Claude Code: 首席架構師完整盤點 — 全服務全自動化 + 告警機制 備份中心192.168.0.110 (/backup/) — Restic + latest-only retention + Google Drive/rclone offsite mirror 2026-06-12 Codex pre-reboot refresh: 110 cron / Google Drive rclone / Alertmanager / credential escrow / cold-start scorecard rechecked. 2026-06-12 Codex post-reboot refresh: 110 recovered, offsite latest-only verified, and remaining red gates narrowed to 120 config capture plus credential escrow evidence. 2026-06-12 Codex post-120 recovery refresh: 120 restored, backup aggregate / offsite / full cold-start green; DR still blocked only by credential escrow evidence. 2026-06-13 Codex live refresh: backup core remains green; DR still blocked only by credential escrow evidence. 2026-06-13 Codex post-CD refresh: backup/offsite/alert contracts remain green after deploy marker e4a349bc; global SSH trust guardrail held; DR still blocked only by credential escrow evidence. 2026-06-13 Codex escrow refresh: 13:10 live report confirms offsite/rclone/script readiness is green and only five non-secret credential escrow evidence markers remain missing. 2026-06-18 Codex cold-start refresh: full-stack service readiness is green after stale failed Job classification; backup core remains green; DR still blocked only by five credential escrow evidence markers. 2026-06-24 Codex Velero/exporter refresh: 188 MinIO / Velero backup freshness, 188 PostgreSQL / Redis exporters, 188 node-exporter, and 110 disk pressure are recovered; DR still blocked only by five credential escrow evidence markers and service full-green is blocked by MOMO data freshness. 2026-06-24 22:17 Codex backup readback: 110 13/13 fresh failed=0, 188 2/2 fresh failed=0, core_blockers=0, integrity_stale=0, offsite_fresh=1, rclone_gdrive_fresh=1, escrow_missing=5; MOMO import-boundary fix is production-deployed, but full-stack remains blocked by MOMO data freshness. 2026-06-24 22:40 Codex MOMO source readback: scheduler / DB / import metadata confirm the full-stack blocker is missing upstream source data, not backup freshness; no manual import or Drive write was performed. 2026-06-24 23:04 Codex cold-start gate refresh: repo-side v1.42 dry-run now emits MOMO source-absence evidence and blocks with 188 momo source file absent while daily sales data stale; backup/offsite remains green and live 110 script deployment is not claimed. 2026-06-24 23:15 Codex live-sync gate readback: read-only deploy parity check correctly blocks because repo cold-start hash f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05 differs from 110 live hash 10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8; installer remains a live write requiring explicit approval. 2026-06-24 23:33 Codex backup readback: 110 13/13 fresh failed=0, 188 2/2 fresh failed=0, core_blockers=0, integrity_stale=0, offsite_fresh=1, rclone_gdrive_fresh=1, escrow_missing=5; live cold-start still blocks only on MOMO source absence / data freshness, not backup. 2026-06-25 09:05 Codex backup readback: 110 13/13 fresh failed=0, 188 2/2 fresh failed=0, core_blockers=0, integrity_stale=0, offsite_fresh=1, rclone_gdrive_fresh=1, escrow_missing=5; live cold-start is PASS=87 WARN=1 BLOCKED=1 because MOMO business data is stale and MOMO Google Drive token metadata is missing. 2026-06-25 09:37 Codex MOMO deploy readback: MOMO main is e137d7a5d02a7595a44c3f3cc1cf54b766424ee7, Gitea Actions cd.yaml #910 succeeded, and 188 host / momo-scheduler source now fail closed on Google Drive auth/API failure. Backup/offsite remains green; full-stack still blocks on MOMO data freshness and escrow_missing=5. 2026-06-25 10:23 Codex MOMO fail-closed live proof: 10:04 scheduler run recorded Google Drive auth failure as ❌ 自動匯入失敗 and sent Telegram failure notification successfully; cold-start remains PASS=87 WARN=1 BLOCKED=1 because business data is stale beyond 3 days and Drive token metadata/writeback is not confirmed. Backup/offsite remains green and escrow_missing=5 remains the DR blocker. 2026-06-25 10:35 Codex route / DB / backup refresh: direct public routes for AWOOOI API, IwoooS, VibeWork, AwoooGo, MOMO health, Stock, and Bitan are 200; backup remains 110 13/13 and 188 2/2 fresh; MOMO daily and monthly DB bounds still stop at 2026-06-17; latest import job remains 56 completed. 2026-06-25 19:17 Codex latest recovery readback: post-start quick check is FULL_STACK_GREEN_DR_ESCROW_BLOCKED; 110 backup 13/13 fresh failed=0, 188 backup 2/2 fresh failed=0, core_blockers=0, integrity_stale=0, offsite_fresh=1, rclone_gdrive_fresh=1; MOMO data freshness is recovered through 2026-06-24; DR still blocked only by escrow_missing=5. 2026-06-25 19:35 Codex product-data gate refresh: backup/offsite remains green, but overall "all products/data latest" is blocked by StockPlatform /api/v1/system/freshness (core_margin_short_daily_missing, ai_recommendations_stale). This is not a backup failure; keep escrow_missing=5 as the DR blocker and Stock freshness as a separate product-data blocker. 2026-06-25 20:11 Codex StockPlatform cron-source recovery: StockPlatform Gitea/live source is now fb91aa4c6272469d1d26e0820169629eac17d28a; six missing production cron entrypoints are restored; natural cron runs for source remediation, market index, price, margin, chips, and AI no longer fail from missing files. Backup/offsite remains green. Stock freshness still blocks because official 2026-06-25 margin-short data is pending and AI recommendations correctly stay on 2026-06-24; this is still not a backup or restore incident. 2026-06-25 20:25 Codex 110 CPU cleanup: two orphan StockPlatform headless Chrome process groups were cleared by targeted approved SIGTERM; no Docker/systemd/Nginx/K8s/DB/backup write occurred. Backup/offsite remains green, DR still blocked by escrow_missing=5, and Stock freshness remains the only hard product-data blocker. 2026-06-25 21:14 Codex full wrapper refresh: StockPlatform 21:00 intelligence-sync and 21:10 AI pipeline naturally caught up; /api/v1/system/freshness is status=ok with blockers []. Backup/offsite remains 110 13/13 and 188 2/2 fresh, core_blockers=0, offsite_fresh=1, rclone_gdrive_fresh=1; full-stack service/data result is FULL_STACK_GREEN_DR_ESCROW_BLOCKED, with only escrow_missing=5 blocking DR complete. 2026-06-26 06:28 Codex隔日 backup readback: 110 13/13 fresh failed=0, 188 2/2 fresh failed=0, core_blockers=0, integrity_stale=0, offsite_fresh=1, rclone_gdrive_fresh=1, last_backup_all=2026-06-26 02:31:02, escrow_missing=5; full-stack service/data result remains FULL_STACK_GREEN_DR_ESCROW_BLOCKED. 2026-06-27 00:56 Codex backup core recovery: 188 momo_pg_daily was fresh but temporarily false-blocked by cron/config drift (configured_missing_188=1). 188 crontab was backed up to /home/ollama/momo_backups/crontab-before-momo-pg-host-owned-20260627-001925.txt, the daily MOMO PostgreSQL backup entry was restored to host-owned /home/ollama/bin/momo-pg-backup.sh, and the exporter now reports awoooi_backup_job_configured{host="188",job="momo_pg_daily"} 1. backup-status now reports 110 13/13 fresh failed=0, 188 2/2 fresh failed=0, core_blockers=0, configured_missing_188=0, integrity_stale=0, offsite_fresh=1, rclone_gdrive_fresh=1, escrow_missing=5; DR still blocked only by credential escrow evidence. 2026-06-27 02:42 Codex post-reboot revalidation: post-reboot-readiness-summary.sh remains FULL_STACK_GREEN_DR_ESCROW_BLOCKED with SERVICE_GREEN=1, PRODUCT_DATA_GREEN=1, BACKUP_CORE_GREEN=1, HOST_188_HYGIENE_BLOCKED=0, STOCK_FRESHNESS_STATUS=ok, and ESCROW_MISSING_COUNT=5. dr-offsite-operator-checklist.sh --check confirms CORE_COLD_START_GREEN=1, RECOVERY_STATE=CORE_READY_DR_OFFSITE_PENDING, live Prometheus awoooi_recovery_core_ready=1, and awoooi_recovery_dr_offsite_ready=0.


2026-06-27 00:56 Backup / Offsite / Escrow Live Status

Read-only and minimal-write evidence sources: 00:56 /backup/scripts/backup-status.sh --no-notify --no-refresh from 110, 188 crontab backup / controlled MOMO backup path correction, 188 textfile exporter refresh, post-start quick check at 00:57, and Prometheus recovery recording-rule readback.

  • 110 backup health: 13/13 fresh failed=0
  • 188 backup health: 2/2 fresh failed=0
  • Integrity / configured blockers: core_blockers=0configured_missing_110=0configured_missing_188=0script_missing_110=0script_missing_188=0integrity_stale=0
  • 188 MOMO backup config drift fix: crontab rollback file /home/ollama/momo_backups/crontab-before-momo-pg-host-owned-20260627-001925.txt; active cron now uses /home/ollama/bin/momo-pg-backup.sh; exporter reports awoooi_backup_job_configured{host="188",job="momo_pg_daily"} 1
  • Offsite / GDrive freshness: offsite_configured=1offsite_fresh=1rclone_gdrive_configured=1rclone_gdrive_fresh=1
  • Last aggregate backup: 2026-06-26 02:31:02
  • Prometheus recovery rules: awoooi_recovery_core_ready=1awoooi_recovery_dr_offsite_ready=0
  • DR blocker remains: escrow_missing=5,不得偽造 evidence marker也不得貼 secret value / hash / partial token。
  • Full-stack service state: FULL_STACK_GREEN_DR_ESCROW_BLOCKED。Post-start quick check PASS=38 WARN=3 BLOCKED=0StockPlatform freshness status=okMOMO daily freshness 2|2026-06-24
Gate Status Evidence
110 backup freshness VERIFIED 13/13 fresh, failed count 0.
188 backup freshness VERIFIED 2/2 fresh, failed count 0.
188 MOMO backup cron/config VERIFIED Active crontab uses /home/ollama/bin/momo-pg-backup.sh; configured_missing_188=0.
Offsite / GDrive freshness VERIFIED offsite_fresh=1, rclone_gdrive_fresh=1.
Backup core blockers GREEN core_blockers=0; Prometheus awoooi_recovery_core_ready=1.
Full-stack service state FULL_STACK_GREEN_DR_ESCROW_BLOCKED POST_START_QUICK_CHECK PASS=38 WARN=3 BLOCKED=0; service/data/backup core green.
Credential escrow BLOCKED escrow_missing=5; only real non-secret owner evidence may close this.

2026-06-26 06:28 Backup / Offsite / Escrow Live Status

Read-only evidence sources: 06:26 / 06:28 post-start-quick-check.sh, delegated /backup/scripts/backup-status.sh --no-notify --no-refresh, route-only wrapper retry validation, and direct StockPlatform / MOMO freshness readback.

  • 110 backup health: 13/13 fresh failed=0
  • 188 backup health: 2/2 fresh failed=0
  • Integrity / configured blockers: core_blockers=0configured_missing_110=0configured_missing_188=0script_missing_110=0script_missing_188=0integrity_stale=0
  • Offsite / GDrive freshness: offsite_configured=1offsite_fresh=1rclone_gdrive_configured=1rclone_gdrive_fresh=1
  • Last aggregate backup: 2026-06-26 02:31:02
  • DR blocker remains: escrow_missing=5,不得偽造 evidence marker也不得貼 secret value / hash / partial token。
  • Full-stack service state: FULL_STACK_GREEN_DR_ESCROW_BLOCKED。Cold-start PASS=89 WARN=0 BLOCKED=0StockPlatform freshness status=okMOMO daily freshness 1|2026-06-24
  • Route note: 06:26 full wrapper had one-time route 000 for IwoooS / VibeWork, but direct curl and route-only wrapper immediately returned 200 and RESULT=GREEN; v1.6 wrapper now retries routes before blocking.
Gate Status Evidence
110 backup freshness VERIFIED 13/13 fresh, failed count 0.
188 backup freshness VERIFIED 2/2 fresh, failed count 0.
Offsite / GDrive freshness VERIFIED offsite_fresh=1, rclone_gdrive_fresh=1.
Backup core blockers GREEN core_blockers=0.
Full-stack service state FULL_STACK_GREEN_DR_ESCROW_BLOCKED Cold-start PASS=89 WARN=0 BLOCKED=0; core wrapper PASS=15 WARN=2 BLOCKED=0; route-only wrapper PASS=31 WARN=0 BLOCKED=0.
Credential escrow BLOCKED escrow_missing=5; only real non-secret owner evidence may close this.

2026-06-25 19:17 Backup / Offsite / Escrow Live Status

Read-only evidence sources: /backup/scripts/backup-status.sh --no-notify --no-refresh from 110 at 19:17 Asia/Taipei, plus 19:05 post-start quick check and 19:05-19:06 route stability readback.

  • 110 backup health: 13/13 fresh failed=0
  • 188 backup health: 2/2 fresh failed=0
  • Integrity / configured blockers: core_blockers=0dr_warnings=5configured_missing_110=0configured_missing_188=0script_missing_110=0script_missing_188=0integrity_stale=0
  • Offsite / GDrive freshness: offsite_configured=1offsite_fresh=1rclone_gdrive_configured=1rclone_gdrive_fresh=1
  • Last aggregate backup: 2026-06-25 02:35:09
  • DR blocker remains: escrow_missing=5,不得偽造 evidence marker也不得貼 secret value / hash / partial token。
  • Full-stack service state: FULL_STACK_GREEN_DR_ESCROW_BLOCKED。這代表服務面、路由、K3s、MOMO data freshness、backup/offsite 為 green不是 DR complete。
  • MOMO DB readback from 19:05 wrapper: daily_sales_snapshot=109061|2025-07-01|2026-06-24DB_MONTHLY_SYNC 15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24DB_DAILY_FRESHNESS 1|2026-06-24、latest import job 57 completed|即時業績_當日.xlsx|2026-06-25T13:16:47.359958|2026-06-25T13:18:02.964985|15383|15383|0
Gate Status Evidence
110 backup freshness VERIFIED 13/13 fresh, failed count 0.
188 backup freshness VERIFIED 2/2 fresh, failed count 0.
Offsite / GDrive freshness VERIFIED offsite_fresh=1, rclone_gdrive_fresh=1.
Backup core blockers GREEN core_blockers=0.
MOMO data freshness VERIFIED `DB_DAILY_FRESHNESS 1
Full-stack service state FULL_STACK_GREEN_DR_ESCROW_BLOCKED 21:14 POST_START_QUICK_CHECK PASS=38 WARN=2 BLOCKED=0; cold-start PASS=89 WARN=0 BLOCKED=0; StockPlatform freshness is OK, and only escrow_missing=5 blocks DR complete.
Credential escrow BLOCKED escrow_missing=5; only real non-secret owner evidence may close this.

2026-06-25 20:11 StockPlatform Cron Source / Backup Boundary

Read-only and minimal-write evidence sources: StockPlatform Gitea / live source readback, one fast-forward git pull --ff-only origin main on 110 /home/wooo/stockplatform-v2, natural cron logs, ops.job_runs, and /api/v1/system/freshness.

  • Backup remains green from the 19:17 readback: 110 13/13 fresh failed=0, 188 2/2 fresh failed=0, core_blockers=0, offsite_fresh=1, rclone_gdrive_fresh=1
  • DR blocker remains escrow_missing=5
  • StockPlatform source-version drift is repaired: live /home/wooo/stockplatform-v2 and Gitea main are fb91aa4c6272469d1d26e0820169629eac17d28a
  • Six previously missing production cron entrypoint scripts are present, and every scripts/ops/*.sh referenced by install-production-cron.sh exists on live source。
  • Natural cron evidence after source sync:
    • source-remediation-queue succeeded at 19:56 and 20:00.
    • market-index-ingestion succeeded at 20:00.
    • price-ingestion succeeded at 20:02.
    • margin-short-ingestion succeeded at 20:05 but official 2026-06-25 margin-short data remained pending, with row_count=0.
    • chips-ingestion succeeded at 20:06.
    • ai-recommendation-pipeline succeeded at the cron/job layer at 20:10 and correctly blocked internally on core_margin_short_daily_incomplete,official_margin_short_daily_official_pending
  • Stock freshness remains separate from backup: /api/v1/system/freshness is still blocked with core_margin_short_daily_missing and ai_recommendations_stale
  • No backup restore, manual DB restore, Docker restart, Nginx reload, K8s action, firewall change, or secret read was performed to address StockPlatform.
Gate Status Evidence
Backup / offsite VERIFIED 19:17 backup readback remains green.
StockPlatform cron source REPAIRED Live and Gitea at fb91aa4c6272469d1d26e0820169629eac17d28a; missing entrypoints restored.
StockPlatform natural cron entrypoints VERIFIED 19:56-20:10 official schedule runs no longer fail with script_exit_127.
StockPlatform product data freshness BLOCKED_EXTERNAL_SOURCE Official 2026-06-25 margin-short source pending; AI recommendations stay stale by design.
Credential escrow BLOCKED escrow_missing=5; only real non-secret owner evidence may close this.

2026-06-25 10:35 Backup / Offsite / Escrow Live Status

Read-only evidence sources: /backup/scripts/backup-status.sh --no-notify --no-refresh from 110 at 10:35 Asia/Taipei; scheduler log proof at 10:04; cold-start rerun at 10:35; public route and DB readback at 10:35.

  • 110 backup health: 13/13 fresh failed=0
  • 188 backup health: 2/2 fresh failed=0
  • Integrity / configured blockers: core_blockers=0dr_warnings=5configured_missing_110=0configured_missing_188=0script_missing_110=0script_missing_188=0integrity_stale=0
  • Offsite / GDrive freshness: offsite_configured=1offsite_fresh=1rclone_gdrive_configured=1rclone_gdrive_fresh=1
  • Last aggregate backup: 2026-06-25 02:35:09
  • DR blocker remains: escrow_missing=5,不得偽造 evidence marker也不得貼 secret value / hash / partial token。
  • Full-stack service release blocker remains separate: cold-start PASS=85 WARN=1 BLOCKED=1,原因是 MOMO business data freshness stale (MOMO_DAILY_FRESHNESS 8|2026-06-17) plus Google Drive token metadata missing / writeback not confirmed。這不是 backup freshness failure。
  • MOMO code boundary now covers both failure modes: cd.yaml #904 makes monthly sync failure fail the import job and prevents Drive file movement; cd.yaml #910 makes Drive auth/API failure return success=false instead of a no-file success.
  • Live scheduler proof: 2026-06-25 10:04 auto_import_task logs Google Drive 認證失敗: could not locate runnable browserthen logs ❌ 自動匯入失敗 and sends Telegram failure notification successfully. Therefore the alert is now a correct failure signal, not a heartbeat / no-file false green.
  • MOMO DB readback: daily_sales_snapshot=104614|2025-07-01|2026-06-17; current-month realtime_sales_monthly=10936|2026/06/01|2026/06/17; latest import job remains 56 completed with 10936/10936/0 and no newer successful daily-sales import by 10:35.
Gate Status Evidence
110 backup freshness VERIFIED 13/13 fresh, failed count 0.
188 backup freshness VERIFIED 2/2 fresh, failed count 0.
Offsite / GDrive freshness VERIFIED offsite_fresh=1, rclone_gdrive_fresh=1.
Backup core blockers GREEN core_blockers=0.
Credential escrow BLOCKED escrow_missing=5; only real non-secret owner evidence may close this.
MOMO Drive token metadata WARN Host and container token metadata paths are missing; no token content was read.
MOMO Drive auth false-green FIX_DEPLOYED_AND_LIVE_PROVEN Gitea Actions cd.yaml #910 success; 188 host and scheduler container source include fail-closed marker; 10:04 scheduler cycle failed closed and sent failure notification.
Service full green NO-GO Blocked by MOMO source absence / data freshness; token metadata warning also requires owner-gated evidence.

2026-06-24 23:33 Backup / Offsite / Escrow Live Status

Read-only command: /backup/scripts/backup-status.sh --no-notify --no-refresh from 110 at 23:33 Asia/Taipei.

  • 110 backup health: 13/13 fresh failed=0
  • 188 backup health: 2/2 fresh failed=0
  • Integrity / configured blockers: core_blockers=0dr_warnings=5configured_missing_110=0configured_missing_188=0script_missing_110=0script_missing_188=0integrity_stale=0
  • Offsite / GDrive freshness: offsite_configured=1offsite_fresh=1rclone_gdrive_configured=1rclone_gdrive_fresh=1
  • Last aggregate backup: 2026-06-24 02:28:39
  • DR blocker remains: escrow_missing=5,不得偽造 evidence marker也不得貼 secret value / hash / partial token。
  • Full-stack service release blocker remains separate: cold-start PASS=88 WARN=0 BLOCKED=1,原因是 188 momo source file absent while daily sales data stale / MOMO_DAILY_FRESHNESS 7|2026-06-17;這不是 backup freshness failure。
Gate Status Evidence
110 backup freshness VERIFIED 13/13 fresh, failed count 0.
188 backup freshness VERIFIED 2/2 fresh, failed count 0.
Offsite / GDrive freshness VERIFIED offsite_fresh=1, rclone_gdrive_fresh=1.
Backup core blockers GREEN core_blockers=0.
Credential escrow BLOCKED escrow_missing=5; only real non-secret owner evidence may close this.
Service full green NO-GO Blocked by MOMO source absence / data freshness, not by backup.

2026-06-24 22:17 Backup / Offsite / Escrow Live Status

Read-only command: /backup/scripts/backup-status.sh --no-notify --no-refresh from 110 at 22:17 Asia/Taipei.

  • 110 backup health: 13/13 fresh failed=0
  • 188 backup health: 2/2 fresh failed=0
  • Integrity / configured blockers: core_blockers=0dr_warnings=5configured_missing_110=0configured_missing_188=0script_missing_110=0script_missing_188=0
  • Offsite / GDrive freshness: offsite_configured=1offsite_fresh=1rclone_gdrive_configured=1rclone_gdrive_fresh=1
  • Last aggregate backup: 2026-06-24 02:28:39
  • DR blocker remains: escrow_missing=5,不得偽造 evidence marker也不得貼 secret value / hash / partial token。
  • Full-stack service release blocker remains separate: cold-start PASS=86 WARN=0 BLOCKED=1,原因是 MOMO_DAILY_FRESHNESS 7|2026-06-17;這不是 backup freshness failure。
  • MOMO code boundary is now production-deployed: Gitea Actions cd.yaml #904 succeeded at commit 84035906aba0e5e190d031a13cfd9b47a8cd1f73; 188 live source marker confirms monthly sync failure now fails the job and prevents Drive file movement.
Gate Status Evidence
110 backup freshness VERIFIED 13/13 fresh, failed count 0.
188 backup freshness VERIFIED 2/2 fresh, failed count 0.
Offsite / GDrive freshness VERIFIED offsite_fresh=1, rclone_gdrive_fresh=1.
Backup core blockers GREEN core_blockers=0.
Credential escrow BLOCKED escrow_missing=5; only real non-secret owner evidence may close this.
MOMO code boundary VERIFIED cd.yaml #904 success and 188 live import-service marker.
Service full green NO-GO Blocked by MOMO data freshness, not by backup.

2026-06-24 22:40 MOMO source absence clarification:

  • Scheduler auto_import_task recent runs report file_count=0 and imported_count=0 for the expected Drive folder / pattern.
  • import_config remains gdrive_folder_path=當日業績匯入 and gdrive_file_pattern=即時業績_當日
  • Latest valid import job 56 already completed with sync_success=true and bounds 2026-06-01..2026-06-17
  • Repo-side cold-start v1.42 dry-run emits MOMO_SOURCE_EMPTY_EVIDENCE_LINES 21MOMO_IMPORT_CONFIG 當日業績匯入|即時業績_當日MOMO_LATEST_IMPORT_JOB 56|completed|即時業績_當日.xlsx|2026-06-18T11:41:00.853176|2026-06-18T11:42:02.309425|10936|10936|0 and keeps the only hard blocker as source absence.
  • 110 live monitor deployment is intentionally not claimed: verify-cold-start-monitor-deploy.sh reports repo hash f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05 vs live hash 10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8.
  • Therefore backup/offsite remains green while service full-green remains blocked by business data source absence. Do not run backup restore or DB restore to solve this symptom.

2026-06-24 21:33 Backup / Offsite / Escrow Live Status

Read-only command: /backup/scripts/backup-status.sh --no-notify --no-refresh from 110 at 21:33 Asia/Taipei.

  • 110 backup health: 13/13 fresh failed=0
  • 188 backup health: 2/2 fresh failed=0
  • Integrity / configured blockers: core_blockers=0configured_missing_110=0configured_missing_188=0script_missing_110=0script_missing_188=0
  • Offsite / GDrive freshness: offsite_configured=1offsite_fresh=1rclone_gdrive_configured=1rclone_gdrive_fresh=1
  • Last aggregate backup: 2026-06-24 02:28:39
  • DR blocker remains: escrow_missing=5,不得偽造 evidence marker也不得貼 secret value / hash / partial token。
  • Full-stack service release blocker remains separate: cold-start PASS=86 WARN=0 BLOCKED=1,原因是 MOMO_DAILY_FRESHNESS 7|2026-06-17;這不是 backup freshness failure。MOMO 程式版本已是 V10.653,但業務資料仍未到今天。
Gate Status Evidence
110 backup freshness VERIFIED 13/13 fresh, failed count 0.
188 backup freshness VERIFIED 2/2 fresh, failed count 0.
Offsite / GDrive freshness VERIFIED offsite_fresh=1, rclone_gdrive_fresh=1.
Backup core blockers GREEN core_blockers=0.
Credential escrow BLOCKED escrow_missing=5; only real non-secret owner evidence may close this.
Service full green NO-GO Blocked by MOMO data freshness, not by backup.

2026-06-24 Velero / Exporter / Disk-Pressure Live Status

2026-06-24 06:35 refresh:

  • 110 backup health remains fresh: 13 configured jobs, stale 0, failed count 0, config failed 0
  • 188 backup health remains fresh: 2 configured jobs, stale 0, missing cron/script 0
  • 188 node-exporter textfile scrape is restored: Prometheus up{job="node-exporter-188"}=1 and awoooi_backup_health_monitor_up{host="188"}=1
  • 188 PostgreSQL exporter and Redis exporter are restored: local metrics pg_up=1 / redis_up=1; Prometheus sees up{job="postgres-exporter"}=1, pg_up=1, up{job="redis-exporter"}=1, redis_up=1
  • 188 MinIO endpoint is healthy on 192.168.0.188:9000; 120 Velero BackupStorageLocation/default is Available
  • One-off Velero backup reboot-recovery-202606240456 completed successfully; 110 backup-health textfile reports awoooi_velero_latest_completed_backup_fresh=1
  • VeleroBackupNotRunBackupHealthMonitorMissing188PostgreSQLDownRedisDown and 110 disk-pressure alerts are resolved.
  • 110 / disk use is reduced from 92% to 73% after Docker image/build-cache cleanup only. Docker volume prune remains forbidden without explicit owner approval.
  • Credential escrow readback remains blocked: ESCROW_MISSING_COUNT=5
  • Full service green is still blocked by MOMO business data freshness: MOMO_DAILY_FRESHNESS 6|2026-06-17
Gate Status Evidence
110 backup freshness VERIFIED 13/13 fresh, failed count 0, config failed 0.
188 backup freshness VERIFIED 2/2 fresh, node-exporter scrape and textfile metrics restored.
Velero / MinIO storage VERIFIED MinIO health OK, BSL Available, backup reboot-recovery-202606240456 Completed, freshness metric 1.
PostgreSQL / Redis exporters VERIFIED pg_up=1, redis_up=1, Prometheus scrape up=1 for both exporters.
Alert chain VERIFIED_WITH_EXPECTED_REDLIGHTS Exporter/Velero/disk alerts resolved; escrow missing and MOMO freshness blocker remain visible.
Credential escrow BLOCKED Five non-secret evidence markers still missing.
DR closeout NO-GO Must not be declared complete until real owner-provided non-secret evidence IDs are validated and markers are written.

Operational helpers:

# 188 PostgreSQL / Redis exporters
ssh ollama@192.168.0.188 'bash /home/ollama/bin/188-db-exporters-restore.sh'

# 188 MinIO / 120 Velero BSL readback
ssh wooo@192.168.0.110 '/home/wooo/scripts/188-minio-velero-restore.sh'

# Maintenance-window one-off Velero backup + backup-health textfile refresh
ssh wooo@192.168.0.110 'CREATE_VELERO_BACKUP=true REFRESH_BACKUP_HEALTH=true /home/wooo/scripts/188-minio-velero-restore.sh'

Current policy: restore backup and monitoring red lights first; do not silence VeleroBackupNotRun or exporter-down alerts. Healthy heartbeat success messages are suppressed separately and should not be confused with real backup/data/escrow alerts.


2026-06-18 Cold-Start / Backup Live Status

2026-06-18 13:43 refresh:

  • full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1PASS=84 WARN=0 BLOCKED=0result GREEN
  • K8s schedule evidenceFAILED_JOBS=1STALE_FAILED_JOBS=1ACTIVE_FAILED_JOBS=0BAD_PODS=0。Retained km-vectorize failed Job is historical evidence only; active failed Job count is zero.
  • 110 backup healthtotal=13 stale=0 missing_cron=0 missing_script=0 failed_count=0 config_failed=0 integrity_total=2 integrity_stale=0
  • 188 backup healthtotal=2 stale=0 missing_cron=0 missing_script=0
  • Public routes / TLS, momo DB parity, 120 / 121 K3s readiness, AWOOOI API/Web route checks all passed in the same cold-start run.
  • Credential escrow readback remains blocked: /backup/scripts/offsite-escrow-evidence-report.sh --no-color reports SCRIPT_MISSING_COUNT=0OFFSITE_CONFIGURED=1RCLONE_CONFIGURED=1READINESS_REQUIRE_CONFIGURED_BLOCKED=0ESCROW_MISSING_COUNT=5
Gate Status Evidence
Full cold-start service readiness GREEN PASS=84 WARN=0 BLOCKED=0; stale failed Job evidence is separated from active failed Job blockers.
110 backup freshness VERIFIED 13/13 fresh, failed count 0, config failed 0, integrity stale 0.
188 backup freshness VERIFIED 2/2 fresh, missing cron/script 0.
Credential escrow BLOCKED Five non-secret evidence markers still missing: restic_repository_password, offsite_provider_credentials, break_glass_admin_credentials, dns_registrar_recovery, oauth_ai_provider_recovery.
DR closeout NO-GO Must not be declared complete until real owner-provided non-secret evidence IDs are validated and markers are written.

Current policy: service recovery and backup health can be green while DR is still blocked. Do not fake escrow markers, do not paste secrets into repo/chat, and do not silence escrow alerts.


2026-06-13 Post-CD Live Status

2026-06-13 01:26 / 01:28 refresh:

  • /backup/scripts/backup-status.sh --no-notify110 13/13 fresh failed=0、188 2/2 fresh failed=0integrity_stale=0offsite_fresh=1rclone_gdrive_fresh=1core_blockers=0escrow_missing=5、last aggregate 2026-06-12 15:54:40
  • /home/wooo/node_exporter_textfiles/offsite_full_sync_verify.promawoooi_backup_offsite_remote_verify_ok=1awoooi_backup_offsite_full_verify_fresh=113 個 repo 都是 snapshot_count=1snapshot_latest_only=1
  • backup-alert-live-visibility-check.pyPrometheus 與 Alertmanager 皆可見 5 個 escrow gap active/firing alerts。
  • Prometheus rules APIBackupConfigCapturePartialBackupAggregateRunFailedBackupCredentialEscrowEvidenceMissingColdStartRecoveryBlockedColdStartHost120Unreachable 全部存在且 health ok;目前只有 escrow gap rule 正確 firing其餘 inactive。
  • backup-alert-label-contract-check.py:本地 ops/monitoring/alerts-unified.yml 與 live Prometheus label contract 對齊24 條 baseline backup alert rules 已載入。
  • 這代表備份核心仍綠;剩餘紅燈仍是 DR credential escrow evidence不是備份腳本或 offsite sync 失敗。
Gate Status Evidence
110 backup cron VERIFIED Live crontab still has 02:00 backup-all, 03:00 sync-offsite-backups --mode sync, 06:05 backup-status, 07:20 verify-offsite-full-sync; success is summarized once daily and not sent as noisy Telegram heartbeat.
Backup freshness VERIFIED 2026-06-13 01:26 status shows 110 13/13 fresh failed=0, 188 2/2 fresh failed=0, core_blockers=0; last aggregate backup completed 2026-06-12 15:54:40.
188 momo backup cron/exporter contract VERIFIED 188 crontab now runs /home/ollama/bin/momo-pg-backup.sh; exporter reports awoooi_backup_job_configured{host="188",job="momo_pg_daily"} 1, so configured_missing_188=0.
Google Drive/rclone remote latest-only VERIFIED 2026-06-13 01:28 textfile confirms all 13 remote repos have snapshot_count=1 and snapshot_latest_only=1; latest scheduled verifier log at 2026-06-12 07:20 returned REMOTE_LATEST_ONLY_OK=1, FULL_MARKER_FRESH=1, VERIFY_OK=1, FAILED=0.
Offsite gate marker VERIFIED /backup/offsite/enable-rclone-sync present; full marker fresh and verifier wrote /home/wooo/node_exporter_textfiles/offsite_full_sync_verify.prom.
Backup alert rules VERIFIED 2026-06-13 01:27 live check confirms Prometheus and Alertmanager expose the active BackupCredentialEscrowEvidenceMissing gap alerts for the five missing items; Prometheus rules API has all five required alert names healthy; label contract check confirms all baseline backup alert rules are loaded.
Backup aggregate health VERIFIED 2026-06-12 15:54 /backup/scripts/backup-all.sh completed 13/13 successfully; Configs captured 120 / 121 / K8s workloads / K8s secrets / Velero from source 192.168.0.120; 18:55 core_blockers=0.
Credential escrow BLOCKED Five evidence markers missing. Only write non-secret marker evidence with /backup/scripts/mark-credential-escrow-verified.sh.
Config backup capture VERIFIED 2026-06-12 15:54 Configs backup succeeded for 120-k3s-host-configs, 121-k3s-host-configs, cluster-k8s-workloads, cluster-k8s-secrets, and cluster-velero-backups; latest Configs snapshot bee9ae22.
Full cold-start GREEN 2026-06-13 01:26 read-only rerun: PASS=83 WARN=0 BLOCKED=0; result GREEN.
110 -> 120 / 188 SSH trust VERIFIED Final trust repair backup /home/wooo/.ssh/known_hosts.before-120-188-final-refresh.20260613-011949; CD fix 80e6ec1a uses /home/wooo/.ssh/deploy_known_hosts; post-deploy marker e4a349bc did not clobber global known_hosts, and 120 / 188 entries remain present.
120 console handoff CLOSED 120 root filesystem was repaired from console/initramfs with offline fsck, booted at 2026-06-12 15:13, SSH returned, root mounted rw, failed units 0, and K3s mon returned Ready.
2026-06-05 manual backup remediation VERIFIED with aggregate blocker 18:40 status: stale110=none, stale188=none, configured_missing_188=0; manual snapshots: AWOOOI b7d5ee4e, Gitea ea641613, Open-WebUI d1147507, ClawBot 73ead3cc, AI artifacts b1161ab8.
2026-06-06 credential escrow audit BLOCKED 15:03 report confirms scripts/config/rclone are present, but all five non-secret evidence markers are still missing; 15:06 safe dry-run checklist is documented below.

Current policy: normal success should not create immediate Telegram noise. Failures and operator-action states must still alert; a single daily status summary runs at 06:05.

2026-06-13 post-CD closeout:

  • 110 / 120 / 121 / 188 public/core service recovery is green.
  • Backup aggregate, Google Drive/rclone offsite latest-only, and full cold-start are green after 120 recovery.
  • Latest normal CD deploy preserved API/Web workload balancing and did not break cold-start SSH trust.
  • Do not close DR scorecard until all five credential escrow evidence markers are written with non-secret evidence IDs.

Credential Escrow Evidence Checklist

2026-06-13 13:10 live refresh:

  • /backup/scripts/mark-credential-escrow-verified.sh --status:仍缺 restic_repository_passwordoffsite_provider_credentialsbreak_glass_admin_credentialsdns_registrar_recoveryoauth_ai_provider_recovery
  • /backup/scripts/offsite-escrow-evidence-report.sh --no-colorSCRIPT_MISSING_COUNT=0OFFSITE_CONFIGURED=1RCLONE_CONFIGURED=1READINESS_REQUIRE_CONFIGURED_BLOCKED=0ESCROW_MISSING_COUNT=5SUMMARY PASS=8 WARN=5 BLOCKED=0
  • Owner request package: CREDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md
  • 判定:備份核心與 offsite readiness 是 greenDR closeout 仍 blocked直到五個 marker 以真實非敏感 evidence-id 寫入。

Credential escrow marker 只證明「復原資料已被人工驗證且可取回」,不能包含任何 secret。

Item Acceptable evidence-id Forbidden
restic_repository_password Password manager item ID, sealed envelope ID, recovery checklist ID Restic password, recovery code, secret URL
offsite_provider_credentials Vault item ID for Google Drive/rclone or provider credential record OAuth token, refresh token, application key
break_glass_admin_credentials Break-glass credential record ID or sealed envelope ID Admin password, SSH private key, OTP seed
dns_registrar_recovery Registrar recovery checklist ID or vault item ID Registrar password, recovery codes
oauth_ai_provider_recovery Provider account recovery checklist ID or vault item ID API key, token, client secret

Safe flow after human verification:

# 1. Read current status; this does not expose secrets.
/backup/scripts/mark-credential-escrow-verified.sh --status

# 2. Validate the non-secret evidence-id first.
/backup/scripts/mark-credential-escrow-verified.sh \
  --item <allowed-item> \
  --evidence-id <non-secret-evidence-id> \
  --dry-run

# 3. Only after dry-run passes, write the marker.
/backup/scripts/mark-credential-escrow-verified.sh \
  --item <allowed-item> \
  --evidence-id <non-secret-evidence-id>

# 4. Recheck DR metric.
/backup/scripts/offsite-escrow-evidence-report.sh --no-color

<non-secret-evidence-id> must be an existing external reference. Placeholder values such as EVIDENCE_ID_FOR_* or VAULT-ITEM-ID are rejected and must not be written.


備份全景圖(全部自動化)

# 資料類型 備份腳本 排程 最大損失 狀態
1 Gitea (DB + 倉庫) backup-gitea.sh 每日 02:00 24h
2 MOMO PostgreSQL backup-momo.sh 每日 02:00 24h
3 Harbor (Registry + DB) backup-harbor.sh 每日 02:00 24h
4 AWOOOI PostgreSQL (完整) backup-awoooi.sh 每日 02:00 6h
4h AWOOOI PostgreSQL (高頻) backup-awoooi-frequent.sh 08/14/20:00 6h
5 Langfuse (AI 追蹤/評測) backup-langfuse.sh 每日 02:00 24h
6 Monitoring (Prometheus/Grafana/Alertmanager) backup-monitoring.sh 每日 02:00 24h
7 SignOz (ClickHouse 追蹤/日誌) backup-signoz.sh 每日 02:00 24h
8 Open-WebUI (LLM 對話紀錄) backup-open-webui.sh 每日 02:00 24h
9 ClawBot Redis (狀態/快取) backup-clawbot.sh 每日 02:00 24h
- K8s 資源 (全命名空間) Velero + MinIO 每日 02:00 24h

備份總控/backup/scripts/backup-all.sh v3.0 — 統一執行 9 個備份


告警機制

備份失敗與需要人工處理的狀態必須推送 AwoooP / Telegram。正常成功不即時推送避免洗版成功狀態由每日 06:05 摘要與 Prometheus/textfile 證據承載。

狀態 Severity Telegram 收到
success info 不即時洗版;每日 06:05 backup status 摘要
warning warning ⚠️ 黃色警告
failed critical 🔴 立即告警

告警端點http://192.168.0.188:8088/api/v1/webhook/custom
測試指令

source /backup/scripts/common.sh
notify_clawbot "failed" "backup-test" "測試告警" 0

保留策略

2026-05-19 起110 本地 restic repo、188 MOMO 檔案備份與 Google Drive/rclone 離機鏡像採 latest-only 策略:成功建立新 snapshot 後只保留最新一份。2026-06-13 01:28 live textfile 已確認 Google Drive/rclone remote 13 個 repo 各 1 份,且 latest-only 指標全為 1

2026-06-04 manual refresh evidence:

  • 188 momo-pg-backup.sh produced momo_analytics_20260604_154234.sql.gz and pruned old backups beyond keep-last=1.
  • 110 backup-awoooi-frequent.sh completed restic snapshot 7440d75f and pruned previous AWOOOI high-frequency DB snapshot.
  • 18:54 backup-status.sh --no-notify: stale110=none, stale188=none, configured_missing_188=0, core_blockers=1, escrow_missing=5.

18:55 cold-start scorecard refresh:

  • PASS=71 WARN=3 BLOCKED=3.
  • Remaining hard blocks: 120 ping, 120 SSH, and 120 K3s read-only check.
  • 188 backup health stale jobs are clear.
  • momo current-month parity is green: 2215|2215|2026-06-01|2026-06-04|2026-06-01|2026-06-04.

19:02 120 console handoff evidence:

  • local/110/121/188 cannot reach 192.168.0.120.
  • K3s node lease for mon stopped renewing at 2026-05-22 02:48:36 +08.
  • 120-fsck-maintenance-checklist.sh --no-color returns PASS=2 WARN=2 BLOCKED=3, so backup aggregate remains correctly blocked until console/SSH recovery.

2026-06-12 18:55 update: 120 has returned and the aggregate backup blocker is cleared. /backup/scripts/backup-all.sh completed 13/13, full offsite sync completed 13/13, full verifier returned REMOTE_LATEST_ONLY_OK=1 / VERIFY_OK=1, and backup-status.sh --no-notify reports core_blockers=0. The only remaining DR warning is escrow_missing=5.

2026-06-13 01:28 update: post-CD live readback still shows remote_verify_ok=1, full_verify_fresh=1, and all 13 repos snapshot_count=1; backup core remains green after deploy marker e4a349bc.

2026-06-05 manual remediation:

  • 16:00 live check still had 120 unreachable, stale110=awoooi_db, backup_all failed=6, and escrow_missing=5.
  • 14:00 AWOOOI high-frequency backup failed, then 16:01 manual rerun completed snapshot b7d5ee4e.
  • 02:00 Gitea failure was caused by stale container /tmp/gitea-dump.zip; it was renamed in-container to /tmp/gitea-dump.stale.20260605_161032.zip, then Gitea backup completed snapshot ea641613.
  • scripts/backup/backup-gitea.sh and live 110 /backup/scripts/backup-gitea.sh now preserve stale container dump files with timestamped names before starting a new dump.
  • 110 -> 188 SSH known_hosts was refreshed after fingerprint match for 192.168.0.188; Open-WebUI backup completed snapshot d1147507.
  • ClawBot backup completed snapshot 73ead3cc; BGSAVE still warned, but the Redis volume backup succeeded.
  • AI artifacts backup completed snapshot b1161ab8.
  • Full offsite sync was skipped by runway gate because the next scheduled backup was too close; partial sync for awoooi gitea open-webui clawbot ai-artifacts completed 5/5.
  • 18:39 full verifier confirmed all 13 Google Drive/rclone repos have remote snapshots=1, REMOTE_LATEST_ONLY_OK=1, and VERIFY_OK=1.
  • 18:40 backup status still reports failed=6 / core_blockers=6 because the 02:00 aggregate history remains failed until backup-all reruns after 120 returns. Do not mark aggregate backup green from individual backup success alone.

2026-06-06 convergence evidence:

  • 14:46 live check: 120 still ping/SSH failed and K3s mon remains NotReady,SchedulingDisabled.
  • 02:00 aggregate backup failed only Configs: 全服務備份完成 (1532s) - 1 個失敗 (12/13).
  • 14:58 backup-status.sh --no-notify: stale110=none, stale188=none, failed=1, core_blockers=1, escrow_missing=5.
  • 14:46 verify-offsite-full-sync.sh --write-textfile --no-color: all 13 remote repos have one snapshot, REMOTE_LATEST_ONLY_OK=1, VERIFY_OK=1.
  • 15:03 cold-start scorecard: PASS=71 WARN=3 BLOCKED=3; direct 188 checks still show momo-scheduler healthy with recent log activity, and the scheduler WARN is no longer present in the scorecard.
  • 15:03 credential escrow report: rclone/offsite readiness is configured, but restic_repository_password, offsite_provider_credentials, break_glass_admin_credentials, dns_registrar_recovery, and oauth_ai_provider_recovery still lack non-secret evidence markers. Do not write placeholders or secrets.

Crontab 完整排程110

0 2       * * *   backup-all.sh              ← 9 個服務完整備份
0 8,14,20 * * *   backup-awoooi-frequent.sh  ← AWOOOI 高頻(每 6 小時)
0 3       * * *   sync-offsite-backups.sh --mode sync          ← Google Drive/rclone gated sync
5 6       * * *   backup-status.sh           ← 每日一次備份狀態摘要,避免成功心跳洗版
20 7      * * *   verify-offsite-full-sync.sh --write-textfile ← Google Drive/rclone latest-only 驗證

備份架構

192.168.0.110 (/backup/scripts/backup-all.sh) 每日 02:00
├── [1/9] backup-gitea.sh       → gitea dump → /backup/gitea
├── [2/9] backup-momo.sh        → SSH 188 pg_dump momo → /backup/momo
├── [3/9] backup-harbor.sh      → harbor dump → /backup/harbor
├── [4/9] backup-awoooi.sh      → SSH 188 pg_dump awoooi_prod/dev/k3s → /backup/awoooi
├── [5/9] backup-langfuse.sh    → docker exec langfuse-db pg_dump → /backup/langfuse
├── [6/9] backup-monitoring.sh  → volumes prometheus/grafana/alertmanager → /backup/monitoring
├── [7/9] backup-signoz.sh      → volumes signoz-clickhouse/sqlite → /backup/signoz
├── [8/9] backup-open-webui.sh  → SSH 188 volume open-webui → /backup/open-webui
└── [9/9] backup-clawbot.sh     → SSH 188 volume clawbot-redis → /backup/clawbot

備份失敗 → notify_clawbot("failed") → /webhook/custom 或 AwoooP/Alertmanager path → Telegram 🔴
備份成功 → textfile / Prometheus / 06:05 status 摘要,不即時洗版

192.168.0.188 (Velero) 每日 02:00
└── K8s 資源快照 → MinIO :9000 (bucket: velero)

尚未備份(說明)

服務 原因 備記
Prometheus TSDB 原始指標數據非設定TSDB 自帶 30d TTL 低優先Grafana 設定已備份
Sentry 目前沒有在運行docker ps 空) 有 volume重新部署後再評估
Redis (AWOOOI) Cache/WorkingMemory無持久業務數據 低優先
Velero MinIO 數據 MinIO 是備份的備份,需離機備份 待評估 B2/S3 offsite

驗證 SOP

# 最新備份日誌
ssh wooo@192.168.0.110 "tail -50 /backup/logs/backup.log"

# 所有服務快照數
ssh wooo@192.168.0.110 "for r in gitea momo harbor awoooi langfuse monitoring signoz open-webui clawbot; do
  echo -n \"\$r: \"
  restic -r /backup/\$r snapshots --password-file /backup/scripts/.restic-password 2>/dev/null | grep -c snapshot || echo 0
done"

# 告警測試
ssh wooo@192.168.0.110 "source /backup/scripts/common.sh && notify_clawbot 'warning' 'manual-test' '手動告警測試' 0"

相關文件

  • REBOOT-RECOVERY-SOP.md - 重開機恢復 SOP
  • scripts/backup/ - 所有備份腳本Git 版本)
  • /backup/scripts/ (on 110) - 實際部署腳本