18 KiB
BACKUP-STATUS.md — 備份狀態總覽
2026-04-05 Claude Code: 首席架構師完整盤點 — 全服務全自動化 + 告警機制 備份中心:192.168.0.110 (
/backup/) — Restic + latest-only retention + Google Drive/rclone offsite mirror 2026-06-12 Codex pre-reboot refresh: 110 cron / Google Drive rclone / Alertmanager / credential escrow / cold-start scorecard rechecked. 2026-06-12 Codex post-reboot refresh: 110 recovered, offsite latest-only verified, and remaining red gates narrowed to 120 config capture plus credential escrow evidence. 2026-06-12 Codex post-120 recovery refresh: 120 restored, backup aggregate / offsite / full cold-start green; DR still blocked only by credential escrow evidence. 2026-06-13 Codex live refresh: backup core remains green; DR still blocked only by credential escrow evidence. 2026-06-13 Codex post-CD refresh: backup/offsite/alert contracts remain green after deploy markere4a349bc; global SSH trust guardrail held; DR still blocked only by credential escrow evidence. 2026-06-13 Codex escrow refresh: 13:10 live report confirms offsite/rclone/script readiness is green and only five non-secret credential escrow evidence markers remain missing.
2026-06-13 Post-CD Live Status
2026-06-13 01:26 / 01:28 refresh:
/backup/scripts/backup-status.sh --no-notify:11013/13 fresh failed=0、1882/2 fresh failed=0、integrity_stale=0、offsite_fresh=1、rclone_gdrive_fresh=1、core_blockers=0、escrow_missing=5、last aggregate2026-06-12 15:54:40。/home/wooo/node_exporter_textfiles/offsite_full_sync_verify.prom:awoooi_backup_offsite_remote_verify_ok=1、awoooi_backup_offsite_full_verify_fresh=1,13 個 repo 都是snapshot_count=1且snapshot_latest_only=1。backup-alert-live-visibility-check.py:Prometheus 與 Alertmanager 皆可見 5 個 escrow gap active/firing alerts。- Prometheus rules API:
BackupConfigCapturePartial、BackupAggregateRunFailed、BackupCredentialEscrowEvidenceMissing、ColdStartRecoveryBlocked、ColdStartHost120Unreachable全部存在且 healthok;目前只有 escrow gap rule 正確 firing,其餘 inactive。 backup-alert-label-contract-check.py:本地ops/monitoring/alerts-unified.yml與 live Prometheus label contract 對齊,24 條 baseline backup alert rules 已載入。- 這代表備份核心仍綠;剩餘紅燈仍是 DR credential escrow evidence,不是備份腳本或 offsite sync 失敗。
| Gate | Status | Evidence |
|---|---|---|
| 110 backup cron | VERIFIED | Live crontab still has 02:00 backup-all, 03:00 sync-offsite-backups --mode sync, 06:05 backup-status, 07:20 verify-offsite-full-sync; success is summarized once daily and not sent as noisy Telegram heartbeat. |
| Backup freshness | VERIFIED | 2026-06-13 01:26 status shows 110 13/13 fresh failed=0, 188 2/2 fresh failed=0, core_blockers=0; last aggregate backup completed 2026-06-12 15:54:40. |
| 188 momo backup cron/exporter contract | VERIFIED | 188 crontab now runs /home/ollama/bin/momo-pg-backup.sh; exporter reports awoooi_backup_job_configured{host="188",job="momo_pg_daily"} 1, so configured_missing_188=0. |
| Google Drive/rclone remote latest-only | VERIFIED | 2026-06-13 01:28 textfile confirms all 13 remote repos have snapshot_count=1 and snapshot_latest_only=1; latest scheduled verifier log at 2026-06-12 07:20 returned REMOTE_LATEST_ONLY_OK=1, FULL_MARKER_FRESH=1, VERIFY_OK=1, FAILED=0. |
| Offsite gate marker | VERIFIED | /backup/offsite/enable-rclone-sync present; full marker fresh and verifier wrote /home/wooo/node_exporter_textfiles/offsite_full_sync_verify.prom. |
| Backup alert rules | VERIFIED | 2026-06-13 01:27 live check confirms Prometheus and Alertmanager expose the active BackupCredentialEscrowEvidenceMissing gap alerts for the five missing items; Prometheus rules API has all five required alert names healthy; label contract check confirms all baseline backup alert rules are loaded. |
| Backup aggregate health | VERIFIED | 2026-06-12 15:54 /backup/scripts/backup-all.sh completed 13/13 successfully; Configs captured 120 / 121 / K8s workloads / K8s secrets / Velero from source 192.168.0.120; 18:55 core_blockers=0. |
| Credential escrow | BLOCKED | Five evidence markers missing. Only write non-secret marker evidence with /backup/scripts/mark-credential-escrow-verified.sh. |
| Config backup capture | VERIFIED | 2026-06-12 15:54 Configs backup succeeded for 120-k3s-host-configs, 121-k3s-host-configs, cluster-k8s-workloads, cluster-k8s-secrets, and cluster-velero-backups; latest Configs snapshot bee9ae22. |
| Full cold-start | GREEN | 2026-06-13 01:26 read-only rerun: PASS=83 WARN=0 BLOCKED=0; result GREEN. |
| 110 -> 120 / 188 SSH trust | VERIFIED | Final trust repair backup /home/wooo/.ssh/known_hosts.before-120-188-final-refresh.20260613-011949; CD fix 80e6ec1a uses /home/wooo/.ssh/deploy_known_hosts; post-deploy marker e4a349bc did not clobber global known_hosts, and 120 / 188 entries remain present. |
| 120 console handoff | CLOSED | 120 root filesystem was repaired from console/initramfs with offline fsck, booted at 2026-06-12 15:13, SSH returned, root mounted rw, failed units 0, and K3s mon returned Ready. |
| 2026-06-05 manual backup remediation | VERIFIED with aggregate blocker | 18:40 status: stale110=none, stale188=none, configured_missing_188=0; manual snapshots: AWOOOI b7d5ee4e, Gitea ea641613, Open-WebUI d1147507, ClawBot 73ead3cc, AI artifacts b1161ab8. |
| 2026-06-06 credential escrow audit | BLOCKED | 15:03 report confirms scripts/config/rclone are present, but all five non-secret evidence markers are still missing; 15:06 safe dry-run checklist is documented below. |
Current policy: normal success should not create immediate Telegram noise. Failures and operator-action states must still alert; a single daily status summary runs at 06:05.
2026-06-13 post-CD closeout:
- 110 / 120 / 121 / 188 public/core service recovery is green.
- Backup aggregate, Google Drive/rclone offsite latest-only, and full cold-start are green after 120 recovery.
- Latest normal CD deploy preserved API/Web workload balancing and did not break cold-start SSH trust.
- Do not close DR scorecard until all five credential escrow evidence markers are written with non-secret evidence IDs.
Credential Escrow Evidence Checklist
2026-06-13 13:10 live refresh:
/backup/scripts/mark-credential-escrow-verified.sh --status:仍缺restic_repository_password、offsite_provider_credentials、break_glass_admin_credentials、dns_registrar_recovery、oauth_ai_provider_recovery。/backup/scripts/offsite-escrow-evidence-report.sh --no-color:SCRIPT_MISSING_COUNT=0、OFFSITE_CONFIGURED=1、RCLONE_CONFIGURED=1、READINESS_REQUIRE_CONFIGURED_BLOCKED=0、ESCROW_MISSING_COUNT=5、SUMMARY PASS=8 WARN=5 BLOCKED=0。- Owner request package: CREDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md。
- 判定:備份核心與 offsite readiness 是 green;DR closeout 仍 blocked,直到五個 marker 以真實非敏感 evidence-id 寫入。
Credential escrow marker 只證明「復原資料已被人工驗證且可取回」,不能包含任何 secret。
| Item | Acceptable evidence-id | Forbidden |
|---|---|---|
restic_repository_password |
Password manager item ID, sealed envelope ID, recovery checklist ID | Restic password, recovery code, secret URL |
offsite_provider_credentials |
Vault item ID for Google Drive/rclone or provider credential record | OAuth token, refresh token, application key |
break_glass_admin_credentials |
Break-glass credential record ID or sealed envelope ID | Admin password, SSH private key, OTP seed |
dns_registrar_recovery |
Registrar recovery checklist ID or vault item ID | Registrar password, recovery codes |
oauth_ai_provider_recovery |
Provider account recovery checklist ID or vault item ID | API key, token, client secret |
Safe flow after human verification:
# 1. Read current status; this does not expose secrets.
/backup/scripts/mark-credential-escrow-verified.sh --status
# 2. Validate the non-secret evidence-id first.
/backup/scripts/mark-credential-escrow-verified.sh \
--item <allowed-item> \
--evidence-id <non-secret-evidence-id> \
--dry-run
# 3. Only after dry-run passes, write the marker.
/backup/scripts/mark-credential-escrow-verified.sh \
--item <allowed-item> \
--evidence-id <non-secret-evidence-id>
# 4. Recheck DR metric.
/backup/scripts/offsite-escrow-evidence-report.sh --no-color
<non-secret-evidence-id> must be an existing external reference. Placeholder values such as EVIDENCE_ID_FOR_* or VAULT-ITEM-ID are rejected and must not be written.
備份全景圖(全部自動化)
| # | 資料類型 | 備份腳本 | 排程 | 最大損失 | 狀態 |
|---|---|---|---|---|---|
| 1 | Gitea (DB + 倉庫) | backup-gitea.sh |
每日 02:00 | 24h | ✅ |
| 2 | MOMO PostgreSQL | backup-momo.sh |
每日 02:00 | 24h | ✅ |
| 3 | Harbor (Registry + DB) | backup-harbor.sh |
每日 02:00 | 24h | ✅ |
| 4 | AWOOOI PostgreSQL (完整) | backup-awoooi.sh |
每日 02:00 | 6h | ✅ |
| 4h | AWOOOI PostgreSQL (高頻) | backup-awoooi-frequent.sh |
08/14/20:00 | 6h | ✅ |
| 5 | Langfuse (AI 追蹤/評測) | backup-langfuse.sh |
每日 02:00 | 24h | ✅ |
| 6 | Monitoring (Prometheus/Grafana/Alertmanager) | backup-monitoring.sh |
每日 02:00 | 24h | ✅ |
| 7 | SignOz (ClickHouse 追蹤/日誌) | backup-signoz.sh |
每日 02:00 | 24h | ✅ |
| 8 | Open-WebUI (LLM 對話紀錄) | backup-open-webui.sh |
每日 02:00 | 24h | ✅ |
| 9 | ClawBot Redis (狀態/快取) | backup-clawbot.sh |
每日 02:00 | 24h | ✅ |
| - | K8s 資源 (全命名空間) | Velero + MinIO | 每日 02:00 | 24h | ✅ |
備份總控:/backup/scripts/backup-all.sh v3.0 — 統一執行 9 個備份
告警機制
備份失敗與需要人工處理的狀態必須推送 AwoooP / Telegram。正常成功不即時推送,避免洗版;成功狀態由每日 06:05 摘要與 Prometheus/textfile 證據承載。
| 狀態 | Severity | Telegram 收到 |
|---|---|---|
success |
info | 不即時洗版;每日 06:05 backup status 摘要 |
warning |
warning | ⚠️ 黃色警告 |
failed |
critical | 🔴 立即告警 |
告警端點:http://192.168.0.188:8088/api/v1/webhook/custom
測試指令:
source /backup/scripts/common.sh
notify_clawbot "failed" "backup-test" "測試告警" 0
保留策略
2026-05-19 起,110 本地 restic repo、188 MOMO 檔案備份與 Google Drive/rclone 離機鏡像採 latest-only 策略:成功建立新 snapshot 後只保留最新一份。2026-06-13 01:28 live textfile 已確認 Google Drive/rclone remote 13 個 repo 各 1 份,且 latest-only 指標全為 1。
2026-06-04 manual refresh evidence:
- 188
momo-pg-backup.shproducedmomo_analytics_20260604_154234.sql.gzand pruned old backups beyond keep-last=1. - 110
backup-awoooi-frequent.shcompleted restic snapshot7440d75fand pruned previous AWOOOI high-frequency DB snapshot. - 18:54
backup-status.sh --no-notify:stale110=none,stale188=none,configured_missing_188=0,core_blockers=1,escrow_missing=5.
18:55 cold-start scorecard refresh:
PASS=71 WARN=3 BLOCKED=3.- Remaining hard blocks: 120 ping, 120 SSH, and 120 K3s read-only check.
- 188 backup health stale jobs are clear.
- momo current-month parity is green:
2215|2215|2026-06-01|2026-06-04|2026-06-01|2026-06-04.
19:02 120 console handoff evidence:
- local/110/121/188 cannot reach 192.168.0.120.
- K3s node lease for
monstopped renewing at2026-05-22 02:48:36 +08. 120-fsck-maintenance-checklist.sh --no-colorreturnsPASS=2 WARN=2 BLOCKED=3, so backup aggregate remains correctly blocked until console/SSH recovery.
2026-06-12 18:55 update: 120 has returned and the aggregate backup blocker is cleared. /backup/scripts/backup-all.sh completed 13/13, full offsite sync completed 13/13, full verifier returned REMOTE_LATEST_ONLY_OK=1 / VERIFY_OK=1, and backup-status.sh --no-notify reports core_blockers=0. The only remaining DR warning is escrow_missing=5.
2026-06-13 01:28 update: post-CD live readback still shows remote_verify_ok=1, full_verify_fresh=1, and all 13 repos snapshot_count=1; backup core remains green after deploy marker e4a349bc.
2026-06-05 manual remediation:
- 16:00 live check still had 120 unreachable,
stale110=awoooi_db,backup_all failed=6, andescrow_missing=5. - 14:00 AWOOOI high-frequency backup failed, then 16:01 manual rerun completed snapshot
b7d5ee4e. - 02:00 Gitea failure was caused by stale container
/tmp/gitea-dump.zip; it was renamed in-container to/tmp/gitea-dump.stale.20260605_161032.zip, then Gitea backup completed snapshotea641613. scripts/backup/backup-gitea.shand live 110/backup/scripts/backup-gitea.shnow preserve stale container dump files with timestamped names before starting a new dump.- 110 -> 188 SSH known_hosts was refreshed after fingerprint match for
192.168.0.188; Open-WebUI backup completed snapshotd1147507. - ClawBot backup completed snapshot
73ead3cc; BGSAVE still warned, but the Redis volume backup succeeded. - AI artifacts backup completed snapshot
b1161ab8. - Full offsite sync was skipped by runway gate because the next scheduled backup was too close; partial sync for
awoooi gitea open-webui clawbot ai-artifactscompleted5/5. - 18:39 full verifier confirmed all 13 Google Drive/rclone repos have
remote snapshots=1,REMOTE_LATEST_ONLY_OK=1, andVERIFY_OK=1. - 18:40 backup status still reports
failed=6/core_blockers=6because the 02:00 aggregate history remains failed untilbackup-allreruns after 120 returns. Do not mark aggregate backup green from individual backup success alone.
2026-06-06 convergence evidence:
- 14:46 live check: 120 still ping/SSH failed and K3s
monremainsNotReady,SchedulingDisabled. - 02:00 aggregate backup failed only Configs:
全服務備份完成 (1532s) - 1 個失敗 (12/13). - 14:58
backup-status.sh --no-notify:stale110=none,stale188=none,failed=1,core_blockers=1,escrow_missing=5. - 14:46
verify-offsite-full-sync.sh --write-textfile --no-color: all 13 remote repos have one snapshot,REMOTE_LATEST_ONLY_OK=1,VERIFY_OK=1. - 15:03 cold-start scorecard:
PASS=71 WARN=3 BLOCKED=3; direct 188 checks still showmomo-schedulerhealthy with recent log activity, and the scheduler WARN is no longer present in the scorecard. - 15:03 credential escrow report: rclone/offsite readiness is configured, but
restic_repository_password,offsite_provider_credentials,break_glass_admin_credentials,dns_registrar_recovery, andoauth_ai_provider_recoverystill lack non-secret evidence markers. Do not write placeholders or secrets.
Crontab 完整排程(110)
0 2 * * * backup-all.sh ← 9 個服務完整備份
0 8,14,20 * * * backup-awoooi-frequent.sh ← AWOOOI 高頻(每 6 小時)
0 3 * * * sync-offsite-backups.sh --mode sync ← Google Drive/rclone gated sync
5 6 * * * backup-status.sh ← 每日一次備份狀態摘要,避免成功心跳洗版
20 7 * * * verify-offsite-full-sync.sh --write-textfile ← Google Drive/rclone latest-only 驗證
備份架構
192.168.0.110 (/backup/scripts/backup-all.sh) 每日 02:00
├── [1/9] backup-gitea.sh → gitea dump → /backup/gitea
├── [2/9] backup-momo.sh → SSH 188 pg_dump momo → /backup/momo
├── [3/9] backup-harbor.sh → harbor dump → /backup/harbor
├── [4/9] backup-awoooi.sh → SSH 188 pg_dump awoooi_prod/dev/k3s → /backup/awoooi
├── [5/9] backup-langfuse.sh → docker exec langfuse-db pg_dump → /backup/langfuse
├── [6/9] backup-monitoring.sh → volumes prometheus/grafana/alertmanager → /backup/monitoring
├── [7/9] backup-signoz.sh → volumes signoz-clickhouse/sqlite → /backup/signoz
├── [8/9] backup-open-webui.sh → SSH 188 volume open-webui → /backup/open-webui
└── [9/9] backup-clawbot.sh → SSH 188 volume clawbot-redis → /backup/clawbot
備份失敗 → notify_clawbot("failed") → /webhook/custom 或 AwoooP/Alertmanager path → Telegram 🔴
備份成功 → textfile / Prometheus / 06:05 status 摘要,不即時洗版
192.168.0.188 (Velero) 每日 02:00
└── K8s 資源快照 → MinIO :9000 (bucket: velero)
尚未備份(說明)
| 服務 | 原因 | 備記 |
|---|---|---|
| Prometheus TSDB | 原始指標數據(非設定),TSDB 自帶 30d TTL | 低優先;Grafana 設定已備份 |
| Sentry | 目前沒有在運行(docker ps 空) | 有 volume,重新部署後再評估 |
| Redis (AWOOOI) | Cache/WorkingMemory,無持久業務數據 | 低優先 |
| Velero MinIO 數據 | MinIO 是備份的備份,需離機備份 | 待評估 B2/S3 offsite |
驗證 SOP
# 最新備份日誌
ssh wooo@192.168.0.110 "tail -50 /backup/logs/backup.log"
# 所有服務快照數
ssh wooo@192.168.0.110 "for r in gitea momo harbor awoooi langfuse monitoring signoz open-webui clawbot; do
echo -n \"\$r: \"
restic -r /backup/\$r snapshots --password-file /backup/scripts/.restic-password 2>/dev/null | grep -c snapshot || echo 0
done"
# 告警測試
ssh wooo@192.168.0.110 "source /backup/scripts/common.sh && notify_clawbot 'warning' 'manual-test' '手動告警測試' 0"
相關文件
- REBOOT-RECOVERY-SOP.md - 重開機恢復 SOP
scripts/backup/- 所有備份腳本(Git 版本)/backup/scripts/(on 110) - 實際部署腳本