Files
awoooi/docs/runbooks/BACKUP-STATUS.md

30 KiB
Raw Blame History

BACKUP-STATUS.md — 備份狀態總覽

2026-04-05 Claude Code: 首席架構師完整盤點 — 全服務全自動化 + 告警機制 備份中心192.168.0.110 (/backup/) — Restic + latest-only retention + Google Drive/rclone offsite mirror 2026-06-12 Codex pre-reboot refresh: 110 cron / Google Drive rclone / Alertmanager / credential escrow / cold-start scorecard rechecked. 2026-06-12 Codex post-reboot refresh: 110 recovered, offsite latest-only verified, and remaining red gates narrowed to 120 config capture plus credential escrow evidence. 2026-06-12 Codex post-120 recovery refresh: 120 restored, backup aggregate / offsite / full cold-start green; DR still blocked only by credential escrow evidence. 2026-06-13 Codex live refresh: backup core remains green; DR still blocked only by credential escrow evidence. 2026-06-13 Codex post-CD refresh: backup/offsite/alert contracts remain green after deploy marker e4a349bc; global SSH trust guardrail held; DR still blocked only by credential escrow evidence. 2026-06-13 Codex escrow refresh: 13:10 live report confirms offsite/rclone/script readiness is green and only five non-secret credential escrow evidence markers remain missing. 2026-06-18 Codex cold-start refresh: full-stack service readiness is green after stale failed Job classification; backup core remains green; DR still blocked only by five credential escrow evidence markers. 2026-06-24 Codex Velero/exporter refresh: 188 MinIO / Velero backup freshness, 188 PostgreSQL / Redis exporters, 188 node-exporter, and 110 disk pressure are recovered; DR still blocked only by five credential escrow evidence markers and service full-green is blocked by MOMO data freshness. 2026-06-24 22:17 Codex backup readback: 110 13/13 fresh failed=0, 188 2/2 fresh failed=0, core_blockers=0, integrity_stale=0, offsite_fresh=1, rclone_gdrive_fresh=1, escrow_missing=5; MOMO import-boundary fix is production-deployed, but full-stack remains blocked by MOMO data freshness. 2026-06-24 22:40 Codex MOMO source readback: scheduler / DB / import metadata confirm the full-stack blocker is missing upstream source data, not backup freshness; no manual import or Drive write was performed. 2026-06-24 23:04 Codex cold-start gate refresh: repo-side v1.42 dry-run now emits MOMO source-absence evidence and blocks with 188 momo source file absent while daily sales data stale; backup/offsite remains green and live 110 script deployment is not claimed. 2026-06-24 23:15 Codex live-sync gate readback: read-only deploy parity check correctly blocks because repo cold-start hash f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05 differs from 110 live hash 10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8; installer remains a live write requiring explicit approval. 2026-06-24 23:33 Codex backup readback: 110 13/13 fresh failed=0, 188 2/2 fresh failed=0, core_blockers=0, integrity_stale=0, offsite_fresh=1, rclone_gdrive_fresh=1, escrow_missing=5; live cold-start still blocks only on MOMO source absence / data freshness, not backup.


2026-06-24 23:33 Backup / Offsite / Escrow Live Status

Read-only command: /backup/scripts/backup-status.sh --no-notify --no-refresh from 110 at 23:33 Asia/Taipei.

  • 110 backup health: 13/13 fresh failed=0
  • 188 backup health: 2/2 fresh failed=0
  • Integrity / configured blockers: core_blockers=0dr_warnings=5configured_missing_110=0configured_missing_188=0script_missing_110=0script_missing_188=0integrity_stale=0
  • Offsite / GDrive freshness: offsite_configured=1offsite_fresh=1rclone_gdrive_configured=1rclone_gdrive_fresh=1
  • Last aggregate backup: 2026-06-24 02:28:39
  • DR blocker remains: escrow_missing=5,不得偽造 evidence marker也不得貼 secret value / hash / partial token。
  • Full-stack service release blocker remains separate: cold-start PASS=88 WARN=0 BLOCKED=1,原因是 188 momo source file absent while daily sales data stale / MOMO_DAILY_FRESHNESS 7|2026-06-17;這不是 backup freshness failure。
Gate Status Evidence
110 backup freshness VERIFIED 13/13 fresh, failed count 0.
188 backup freshness VERIFIED 2/2 fresh, failed count 0.
Offsite / GDrive freshness VERIFIED offsite_fresh=1, rclone_gdrive_fresh=1.
Backup core blockers GREEN core_blockers=0.
Credential escrow BLOCKED escrow_missing=5; only real non-secret owner evidence may close this.
Service full green NO-GO Blocked by MOMO source absence / data freshness, not by backup.

2026-06-24 22:17 Backup / Offsite / Escrow Live Status

Read-only command: /backup/scripts/backup-status.sh --no-notify --no-refresh from 110 at 22:17 Asia/Taipei.

  • 110 backup health: 13/13 fresh failed=0
  • 188 backup health: 2/2 fresh failed=0
  • Integrity / configured blockers: core_blockers=0dr_warnings=5configured_missing_110=0configured_missing_188=0script_missing_110=0script_missing_188=0
  • Offsite / GDrive freshness: offsite_configured=1offsite_fresh=1rclone_gdrive_configured=1rclone_gdrive_fresh=1
  • Last aggregate backup: 2026-06-24 02:28:39
  • DR blocker remains: escrow_missing=5,不得偽造 evidence marker也不得貼 secret value / hash / partial token。
  • Full-stack service release blocker remains separate: cold-start PASS=86 WARN=0 BLOCKED=1,原因是 MOMO_DAILY_FRESHNESS 7|2026-06-17;這不是 backup freshness failure。
  • MOMO code boundary is now production-deployed: Gitea Actions cd.yaml #904 succeeded at commit 84035906aba0e5e190d031a13cfd9b47a8cd1f73; 188 live source marker confirms monthly sync failure now fails the job and prevents Drive file movement.
Gate Status Evidence
110 backup freshness VERIFIED 13/13 fresh, failed count 0.
188 backup freshness VERIFIED 2/2 fresh, failed count 0.
Offsite / GDrive freshness VERIFIED offsite_fresh=1, rclone_gdrive_fresh=1.
Backup core blockers GREEN core_blockers=0.
Credential escrow BLOCKED escrow_missing=5; only real non-secret owner evidence may close this.
MOMO code boundary VERIFIED cd.yaml #904 success and 188 live import-service marker.
Service full green NO-GO Blocked by MOMO data freshness, not by backup.

2026-06-24 22:40 MOMO source absence clarification:

  • Scheduler auto_import_task recent runs report file_count=0 and imported_count=0 for the expected Drive folder / pattern.
  • import_config remains gdrive_folder_path=當日業績匯入 and gdrive_file_pattern=即時業績_當日
  • Latest valid import job 56 already completed with sync_success=true and bounds 2026-06-01..2026-06-17
  • Repo-side cold-start v1.42 dry-run emits MOMO_SOURCE_EMPTY_EVIDENCE_LINES 21MOMO_IMPORT_CONFIG 當日業績匯入|即時業績_當日MOMO_LATEST_IMPORT_JOB 56|completed|即時業績_當日.xlsx|2026-06-18T11:41:00.853176|2026-06-18T11:42:02.309425|10936|10936|0 and keeps the only hard blocker as source absence.
  • 110 live monitor deployment is intentionally not claimed: verify-cold-start-monitor-deploy.sh reports repo hash f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05 vs live hash 10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8.
  • Therefore backup/offsite remains green while service full-green remains blocked by business data source absence. Do not run backup restore or DB restore to solve this symptom.

2026-06-24 21:33 Backup / Offsite / Escrow Live Status

Read-only command: /backup/scripts/backup-status.sh --no-notify --no-refresh from 110 at 21:33 Asia/Taipei.

  • 110 backup health: 13/13 fresh failed=0
  • 188 backup health: 2/2 fresh failed=0
  • Integrity / configured blockers: core_blockers=0configured_missing_110=0configured_missing_188=0script_missing_110=0script_missing_188=0
  • Offsite / GDrive freshness: offsite_configured=1offsite_fresh=1rclone_gdrive_configured=1rclone_gdrive_fresh=1
  • Last aggregate backup: 2026-06-24 02:28:39
  • DR blocker remains: escrow_missing=5,不得偽造 evidence marker也不得貼 secret value / hash / partial token。
  • Full-stack service release blocker remains separate: cold-start PASS=86 WARN=0 BLOCKED=1,原因是 MOMO_DAILY_FRESHNESS 7|2026-06-17;這不是 backup freshness failure。MOMO 程式版本已是 V10.653,但業務資料仍未到今天。
Gate Status Evidence
110 backup freshness VERIFIED 13/13 fresh, failed count 0.
188 backup freshness VERIFIED 2/2 fresh, failed count 0.
Offsite / GDrive freshness VERIFIED offsite_fresh=1, rclone_gdrive_fresh=1.
Backup core blockers GREEN core_blockers=0.
Credential escrow BLOCKED escrow_missing=5; only real non-secret owner evidence may close this.
Service full green NO-GO Blocked by MOMO data freshness, not by backup.

2026-06-24 Velero / Exporter / Disk-Pressure Live Status

2026-06-24 06:35 refresh:

  • 110 backup health remains fresh: 13 configured jobs, stale 0, failed count 0, config failed 0
  • 188 backup health remains fresh: 2 configured jobs, stale 0, missing cron/script 0
  • 188 node-exporter textfile scrape is restored: Prometheus up{job="node-exporter-188"}=1 and awoooi_backup_health_monitor_up{host="188"}=1
  • 188 PostgreSQL exporter and Redis exporter are restored: local metrics pg_up=1 / redis_up=1; Prometheus sees up{job="postgres-exporter"}=1, pg_up=1, up{job="redis-exporter"}=1, redis_up=1
  • 188 MinIO endpoint is healthy on 192.168.0.188:9000; 120 Velero BackupStorageLocation/default is Available
  • One-off Velero backup reboot-recovery-202606240456 completed successfully; 110 backup-health textfile reports awoooi_velero_latest_completed_backup_fresh=1
  • VeleroBackupNotRunBackupHealthMonitorMissing188PostgreSQLDownRedisDown and 110 disk-pressure alerts are resolved.
  • 110 / disk use is reduced from 92% to 73% after Docker image/build-cache cleanup only. Docker volume prune remains forbidden without explicit owner approval.
  • Credential escrow readback remains blocked: ESCROW_MISSING_COUNT=5
  • Full service green is still blocked by MOMO business data freshness: MOMO_DAILY_FRESHNESS 6|2026-06-17
Gate Status Evidence
110 backup freshness VERIFIED 13/13 fresh, failed count 0, config failed 0.
188 backup freshness VERIFIED 2/2 fresh, node-exporter scrape and textfile metrics restored.
Velero / MinIO storage VERIFIED MinIO health OK, BSL Available, backup reboot-recovery-202606240456 Completed, freshness metric 1.
PostgreSQL / Redis exporters VERIFIED pg_up=1, redis_up=1, Prometheus scrape up=1 for both exporters.
Alert chain VERIFIED_WITH_EXPECTED_REDLIGHTS Exporter/Velero/disk alerts resolved; escrow missing and MOMO freshness blocker remain visible.
Credential escrow BLOCKED Five non-secret evidence markers still missing.
DR closeout NO-GO Must not be declared complete until real owner-provided non-secret evidence IDs are validated and markers are written.

Operational helpers:

# 188 PostgreSQL / Redis exporters
ssh ollama@192.168.0.188 'bash /home/ollama/bin/188-db-exporters-restore.sh'

# 188 MinIO / 120 Velero BSL readback
ssh wooo@192.168.0.110 '/home/wooo/scripts/188-minio-velero-restore.sh'

# Maintenance-window one-off Velero backup + backup-health textfile refresh
ssh wooo@192.168.0.110 'CREATE_VELERO_BACKUP=true REFRESH_BACKUP_HEALTH=true /home/wooo/scripts/188-minio-velero-restore.sh'

Current policy: restore backup and monitoring red lights first; do not silence VeleroBackupNotRun or exporter-down alerts. Healthy heartbeat success messages are suppressed separately and should not be confused with real backup/data/escrow alerts.


2026-06-18 Cold-Start / Backup Live Status

2026-06-18 13:43 refresh:

  • full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1PASS=84 WARN=0 BLOCKED=0result GREEN
  • K8s schedule evidenceFAILED_JOBS=1STALE_FAILED_JOBS=1ACTIVE_FAILED_JOBS=0BAD_PODS=0。Retained km-vectorize failed Job is historical evidence only; active failed Job count is zero.
  • 110 backup healthtotal=13 stale=0 missing_cron=0 missing_script=0 failed_count=0 config_failed=0 integrity_total=2 integrity_stale=0
  • 188 backup healthtotal=2 stale=0 missing_cron=0 missing_script=0
  • Public routes / TLS, momo DB parity, 120 / 121 K3s readiness, AWOOOI API/Web route checks all passed in the same cold-start run.
  • Credential escrow readback remains blocked: /backup/scripts/offsite-escrow-evidence-report.sh --no-color reports SCRIPT_MISSING_COUNT=0OFFSITE_CONFIGURED=1RCLONE_CONFIGURED=1READINESS_REQUIRE_CONFIGURED_BLOCKED=0ESCROW_MISSING_COUNT=5
Gate Status Evidence
Full cold-start service readiness GREEN PASS=84 WARN=0 BLOCKED=0; stale failed Job evidence is separated from active failed Job blockers.
110 backup freshness VERIFIED 13/13 fresh, failed count 0, config failed 0, integrity stale 0.
188 backup freshness VERIFIED 2/2 fresh, missing cron/script 0.
Credential escrow BLOCKED Five non-secret evidence markers still missing: restic_repository_password, offsite_provider_credentials, break_glass_admin_credentials, dns_registrar_recovery, oauth_ai_provider_recovery.
DR closeout NO-GO Must not be declared complete until real owner-provided non-secret evidence IDs are validated and markers are written.

Current policy: service recovery and backup health can be green while DR is still blocked. Do not fake escrow markers, do not paste secrets into repo/chat, and do not silence escrow alerts.


2026-06-13 Post-CD Live Status

2026-06-13 01:26 / 01:28 refresh:

  • /backup/scripts/backup-status.sh --no-notify110 13/13 fresh failed=0、188 2/2 fresh failed=0integrity_stale=0offsite_fresh=1rclone_gdrive_fresh=1core_blockers=0escrow_missing=5、last aggregate 2026-06-12 15:54:40
  • /home/wooo/node_exporter_textfiles/offsite_full_sync_verify.promawoooi_backup_offsite_remote_verify_ok=1awoooi_backup_offsite_full_verify_fresh=113 個 repo 都是 snapshot_count=1snapshot_latest_only=1
  • backup-alert-live-visibility-check.pyPrometheus 與 Alertmanager 皆可見 5 個 escrow gap active/firing alerts。
  • Prometheus rules APIBackupConfigCapturePartialBackupAggregateRunFailedBackupCredentialEscrowEvidenceMissingColdStartRecoveryBlockedColdStartHost120Unreachable 全部存在且 health ok;目前只有 escrow gap rule 正確 firing其餘 inactive。
  • backup-alert-label-contract-check.py:本地 ops/monitoring/alerts-unified.yml 與 live Prometheus label contract 對齊24 條 baseline backup alert rules 已載入。
  • 這代表備份核心仍綠;剩餘紅燈仍是 DR credential escrow evidence不是備份腳本或 offsite sync 失敗。
Gate Status Evidence
110 backup cron VERIFIED Live crontab still has 02:00 backup-all, 03:00 sync-offsite-backups --mode sync, 06:05 backup-status, 07:20 verify-offsite-full-sync; success is summarized once daily and not sent as noisy Telegram heartbeat.
Backup freshness VERIFIED 2026-06-13 01:26 status shows 110 13/13 fresh failed=0, 188 2/2 fresh failed=0, core_blockers=0; last aggregate backup completed 2026-06-12 15:54:40.
188 momo backup cron/exporter contract VERIFIED 188 crontab now runs /home/ollama/bin/momo-pg-backup.sh; exporter reports awoooi_backup_job_configured{host="188",job="momo_pg_daily"} 1, so configured_missing_188=0.
Google Drive/rclone remote latest-only VERIFIED 2026-06-13 01:28 textfile confirms all 13 remote repos have snapshot_count=1 and snapshot_latest_only=1; latest scheduled verifier log at 2026-06-12 07:20 returned REMOTE_LATEST_ONLY_OK=1, FULL_MARKER_FRESH=1, VERIFY_OK=1, FAILED=0.
Offsite gate marker VERIFIED /backup/offsite/enable-rclone-sync present; full marker fresh and verifier wrote /home/wooo/node_exporter_textfiles/offsite_full_sync_verify.prom.
Backup alert rules VERIFIED 2026-06-13 01:27 live check confirms Prometheus and Alertmanager expose the active BackupCredentialEscrowEvidenceMissing gap alerts for the five missing items; Prometheus rules API has all five required alert names healthy; label contract check confirms all baseline backup alert rules are loaded.
Backup aggregate health VERIFIED 2026-06-12 15:54 /backup/scripts/backup-all.sh completed 13/13 successfully; Configs captured 120 / 121 / K8s workloads / K8s secrets / Velero from source 192.168.0.120; 18:55 core_blockers=0.
Credential escrow BLOCKED Five evidence markers missing. Only write non-secret marker evidence with /backup/scripts/mark-credential-escrow-verified.sh.
Config backup capture VERIFIED 2026-06-12 15:54 Configs backup succeeded for 120-k3s-host-configs, 121-k3s-host-configs, cluster-k8s-workloads, cluster-k8s-secrets, and cluster-velero-backups; latest Configs snapshot bee9ae22.
Full cold-start GREEN 2026-06-13 01:26 read-only rerun: PASS=83 WARN=0 BLOCKED=0; result GREEN.
110 -> 120 / 188 SSH trust VERIFIED Final trust repair backup /home/wooo/.ssh/known_hosts.before-120-188-final-refresh.20260613-011949; CD fix 80e6ec1a uses /home/wooo/.ssh/deploy_known_hosts; post-deploy marker e4a349bc did not clobber global known_hosts, and 120 / 188 entries remain present.
120 console handoff CLOSED 120 root filesystem was repaired from console/initramfs with offline fsck, booted at 2026-06-12 15:13, SSH returned, root mounted rw, failed units 0, and K3s mon returned Ready.
2026-06-05 manual backup remediation VERIFIED with aggregate blocker 18:40 status: stale110=none, stale188=none, configured_missing_188=0; manual snapshots: AWOOOI b7d5ee4e, Gitea ea641613, Open-WebUI d1147507, ClawBot 73ead3cc, AI artifacts b1161ab8.
2026-06-06 credential escrow audit BLOCKED 15:03 report confirms scripts/config/rclone are present, but all five non-secret evidence markers are still missing; 15:06 safe dry-run checklist is documented below.

Current policy: normal success should not create immediate Telegram noise. Failures and operator-action states must still alert; a single daily status summary runs at 06:05.

2026-06-13 post-CD closeout:

  • 110 / 120 / 121 / 188 public/core service recovery is green.
  • Backup aggregate, Google Drive/rclone offsite latest-only, and full cold-start are green after 120 recovery.
  • Latest normal CD deploy preserved API/Web workload balancing and did not break cold-start SSH trust.
  • Do not close DR scorecard until all five credential escrow evidence markers are written with non-secret evidence IDs.

Credential Escrow Evidence Checklist

2026-06-13 13:10 live refresh:

  • /backup/scripts/mark-credential-escrow-verified.sh --status:仍缺 restic_repository_passwordoffsite_provider_credentialsbreak_glass_admin_credentialsdns_registrar_recoveryoauth_ai_provider_recovery
  • /backup/scripts/offsite-escrow-evidence-report.sh --no-colorSCRIPT_MISSING_COUNT=0OFFSITE_CONFIGURED=1RCLONE_CONFIGURED=1READINESS_REQUIRE_CONFIGURED_BLOCKED=0ESCROW_MISSING_COUNT=5SUMMARY PASS=8 WARN=5 BLOCKED=0
  • Owner request package: CREDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md
  • 判定:備份核心與 offsite readiness 是 greenDR closeout 仍 blocked直到五個 marker 以真實非敏感 evidence-id 寫入。

Credential escrow marker 只證明「復原資料已被人工驗證且可取回」,不能包含任何 secret。

Item Acceptable evidence-id Forbidden
restic_repository_password Password manager item ID, sealed envelope ID, recovery checklist ID Restic password, recovery code, secret URL
offsite_provider_credentials Vault item ID for Google Drive/rclone or provider credential record OAuth token, refresh token, application key
break_glass_admin_credentials Break-glass credential record ID or sealed envelope ID Admin password, SSH private key, OTP seed
dns_registrar_recovery Registrar recovery checklist ID or vault item ID Registrar password, recovery codes
oauth_ai_provider_recovery Provider account recovery checklist ID or vault item ID API key, token, client secret

Safe flow after human verification:

# 1. Read current status; this does not expose secrets.
/backup/scripts/mark-credential-escrow-verified.sh --status

# 2. Validate the non-secret evidence-id first.
/backup/scripts/mark-credential-escrow-verified.sh \
  --item <allowed-item> \
  --evidence-id <non-secret-evidence-id> \
  --dry-run

# 3. Only after dry-run passes, write the marker.
/backup/scripts/mark-credential-escrow-verified.sh \
  --item <allowed-item> \
  --evidence-id <non-secret-evidence-id>

# 4. Recheck DR metric.
/backup/scripts/offsite-escrow-evidence-report.sh --no-color

<non-secret-evidence-id> must be an existing external reference. Placeholder values such as EVIDENCE_ID_FOR_* or VAULT-ITEM-ID are rejected and must not be written.


備份全景圖(全部自動化)

# 資料類型 備份腳本 排程 最大損失 狀態
1 Gitea (DB + 倉庫) backup-gitea.sh 每日 02:00 24h
2 MOMO PostgreSQL backup-momo.sh 每日 02:00 24h
3 Harbor (Registry + DB) backup-harbor.sh 每日 02:00 24h
4 AWOOOI PostgreSQL (完整) backup-awoooi.sh 每日 02:00 6h
4h AWOOOI PostgreSQL (高頻) backup-awoooi-frequent.sh 08/14/20:00 6h
5 Langfuse (AI 追蹤/評測) backup-langfuse.sh 每日 02:00 24h
6 Monitoring (Prometheus/Grafana/Alertmanager) backup-monitoring.sh 每日 02:00 24h
7 SignOz (ClickHouse 追蹤/日誌) backup-signoz.sh 每日 02:00 24h
8 Open-WebUI (LLM 對話紀錄) backup-open-webui.sh 每日 02:00 24h
9 ClawBot Redis (狀態/快取) backup-clawbot.sh 每日 02:00 24h
- K8s 資源 (全命名空間) Velero + MinIO 每日 02:00 24h

備份總控/backup/scripts/backup-all.sh v3.0 — 統一執行 9 個備份


告警機制

備份失敗與需要人工處理的狀態必須推送 AwoooP / Telegram。正常成功不即時推送避免洗版成功狀態由每日 06:05 摘要與 Prometheus/textfile 證據承載。

狀態 Severity Telegram 收到
success info 不即時洗版;每日 06:05 backup status 摘要
warning warning ⚠️ 黃色警告
failed critical 🔴 立即告警

告警端點http://192.168.0.188:8088/api/v1/webhook/custom
測試指令

source /backup/scripts/common.sh
notify_clawbot "failed" "backup-test" "測試告警" 0

保留策略

2026-05-19 起110 本地 restic repo、188 MOMO 檔案備份與 Google Drive/rclone 離機鏡像採 latest-only 策略:成功建立新 snapshot 後只保留最新一份。2026-06-13 01:28 live textfile 已確認 Google Drive/rclone remote 13 個 repo 各 1 份,且 latest-only 指標全為 1

2026-06-04 manual refresh evidence:

  • 188 momo-pg-backup.sh produced momo_analytics_20260604_154234.sql.gz and pruned old backups beyond keep-last=1.
  • 110 backup-awoooi-frequent.sh completed restic snapshot 7440d75f and pruned previous AWOOOI high-frequency DB snapshot.
  • 18:54 backup-status.sh --no-notify: stale110=none, stale188=none, configured_missing_188=0, core_blockers=1, escrow_missing=5.

18:55 cold-start scorecard refresh:

  • PASS=71 WARN=3 BLOCKED=3.
  • Remaining hard blocks: 120 ping, 120 SSH, and 120 K3s read-only check.
  • 188 backup health stale jobs are clear.
  • momo current-month parity is green: 2215|2215|2026-06-01|2026-06-04|2026-06-01|2026-06-04.

19:02 120 console handoff evidence:

  • local/110/121/188 cannot reach 192.168.0.120.
  • K3s node lease for mon stopped renewing at 2026-05-22 02:48:36 +08.
  • 120-fsck-maintenance-checklist.sh --no-color returns PASS=2 WARN=2 BLOCKED=3, so backup aggregate remains correctly blocked until console/SSH recovery.

2026-06-12 18:55 update: 120 has returned and the aggregate backup blocker is cleared. /backup/scripts/backup-all.sh completed 13/13, full offsite sync completed 13/13, full verifier returned REMOTE_LATEST_ONLY_OK=1 / VERIFY_OK=1, and backup-status.sh --no-notify reports core_blockers=0. The only remaining DR warning is escrow_missing=5.

2026-06-13 01:28 update: post-CD live readback still shows remote_verify_ok=1, full_verify_fresh=1, and all 13 repos snapshot_count=1; backup core remains green after deploy marker e4a349bc.

2026-06-05 manual remediation:

  • 16:00 live check still had 120 unreachable, stale110=awoooi_db, backup_all failed=6, and escrow_missing=5.
  • 14:00 AWOOOI high-frequency backup failed, then 16:01 manual rerun completed snapshot b7d5ee4e.
  • 02:00 Gitea failure was caused by stale container /tmp/gitea-dump.zip; it was renamed in-container to /tmp/gitea-dump.stale.20260605_161032.zip, then Gitea backup completed snapshot ea641613.
  • scripts/backup/backup-gitea.sh and live 110 /backup/scripts/backup-gitea.sh now preserve stale container dump files with timestamped names before starting a new dump.
  • 110 -> 188 SSH known_hosts was refreshed after fingerprint match for 192.168.0.188; Open-WebUI backup completed snapshot d1147507.
  • ClawBot backup completed snapshot 73ead3cc; BGSAVE still warned, but the Redis volume backup succeeded.
  • AI artifacts backup completed snapshot b1161ab8.
  • Full offsite sync was skipped by runway gate because the next scheduled backup was too close; partial sync for awoooi gitea open-webui clawbot ai-artifacts completed 5/5.
  • 18:39 full verifier confirmed all 13 Google Drive/rclone repos have remote snapshots=1, REMOTE_LATEST_ONLY_OK=1, and VERIFY_OK=1.
  • 18:40 backup status still reports failed=6 / core_blockers=6 because the 02:00 aggregate history remains failed until backup-all reruns after 120 returns. Do not mark aggregate backup green from individual backup success alone.

2026-06-06 convergence evidence:

  • 14:46 live check: 120 still ping/SSH failed and K3s mon remains NotReady,SchedulingDisabled.
  • 02:00 aggregate backup failed only Configs: 全服務備份完成 (1532s) - 1 個失敗 (12/13).
  • 14:58 backup-status.sh --no-notify: stale110=none, stale188=none, failed=1, core_blockers=1, escrow_missing=5.
  • 14:46 verify-offsite-full-sync.sh --write-textfile --no-color: all 13 remote repos have one snapshot, REMOTE_LATEST_ONLY_OK=1, VERIFY_OK=1.
  • 15:03 cold-start scorecard: PASS=71 WARN=3 BLOCKED=3; direct 188 checks still show momo-scheduler healthy with recent log activity, and the scheduler WARN is no longer present in the scorecard.
  • 15:03 credential escrow report: rclone/offsite readiness is configured, but restic_repository_password, offsite_provider_credentials, break_glass_admin_credentials, dns_registrar_recovery, and oauth_ai_provider_recovery still lack non-secret evidence markers. Do not write placeholders or secrets.

Crontab 完整排程110

0 2       * * *   backup-all.sh              ← 9 個服務完整備份
0 8,14,20 * * *   backup-awoooi-frequent.sh  ← AWOOOI 高頻(每 6 小時)
0 3       * * *   sync-offsite-backups.sh --mode sync          ← Google Drive/rclone gated sync
5 6       * * *   backup-status.sh           ← 每日一次備份狀態摘要,避免成功心跳洗版
20 7      * * *   verify-offsite-full-sync.sh --write-textfile ← Google Drive/rclone latest-only 驗證

備份架構

192.168.0.110 (/backup/scripts/backup-all.sh) 每日 02:00
├── [1/9] backup-gitea.sh       → gitea dump → /backup/gitea
├── [2/9] backup-momo.sh        → SSH 188 pg_dump momo → /backup/momo
├── [3/9] backup-harbor.sh      → harbor dump → /backup/harbor
├── [4/9] backup-awoooi.sh      → SSH 188 pg_dump awoooi_prod/dev/k3s → /backup/awoooi
├── [5/9] backup-langfuse.sh    → docker exec langfuse-db pg_dump → /backup/langfuse
├── [6/9] backup-monitoring.sh  → volumes prometheus/grafana/alertmanager → /backup/monitoring
├── [7/9] backup-signoz.sh      → volumes signoz-clickhouse/sqlite → /backup/signoz
├── [8/9] backup-open-webui.sh  → SSH 188 volume open-webui → /backup/open-webui
└── [9/9] backup-clawbot.sh     → SSH 188 volume clawbot-redis → /backup/clawbot

備份失敗 → notify_clawbot("failed") → /webhook/custom 或 AwoooP/Alertmanager path → Telegram 🔴
備份成功 → textfile / Prometheus / 06:05 status 摘要,不即時洗版

192.168.0.188 (Velero) 每日 02:00
└── K8s 資源快照 → MinIO :9000 (bucket: velero)

尚未備份(說明)

服務 原因 備記
Prometheus TSDB 原始指標數據非設定TSDB 自帶 30d TTL 低優先Grafana 設定已備份
Sentry 目前沒有在運行docker ps 空) 有 volume重新部署後再評估
Redis (AWOOOI) Cache/WorkingMemory無持久業務數據 低優先
Velero MinIO 數據 MinIO 是備份的備份,需離機備份 待評估 B2/S3 offsite

驗證 SOP

# 最新備份日誌
ssh wooo@192.168.0.110 "tail -50 /backup/logs/backup.log"

# 所有服務快照數
ssh wooo@192.168.0.110 "for r in gitea momo harbor awoooi langfuse monitoring signoz open-webui clawbot; do
  echo -n \"\$r: \"
  restic -r /backup/\$r snapshots --password-file /backup/scripts/.restic-password 2>/dev/null | grep -c snapshot || echo 0
done"

# 告警測試
ssh wooo@192.168.0.110 "source /backup/scripts/common.sh && notify_clawbot 'warning' 'manual-test' '手動告警測試' 0"

相關文件

  • REBOOT-RECOVERY-SOP.md - 重開機恢復 SOP
  • scripts/backup/ - 所有備份腳本Git 版本)
  • /backup/scripts/ (on 110) - 實際部署腳本