Files
awoooi/ops/reboot-recovery/full-stack-backup-baseline.yml
2026-05-29 12:41:34 +08:00

307 lines
18 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
version: 2026-05-19.v7
scope: "110/120/121/188 全服務、資料、設定與還原驗證備份基準"
principles:
- "資料備份與設定備份分層DB/PV/物件資料負責資料configs 負責可啟動狀態。"
- "Secrets、TLS private keys、SSH host keys 可進加密 restic/Velero 備份,但不得印到 log、repo、Telegram。"
- "備份系統本身也要備份restic repository health、password/key escrow、offsite copy、restore drill evidence 缺一不可。"
- "每個備份都必須有三個證據:排程存在、最近成功時間、還原或 dry-run 驗證。"
- "AI 自動修復在備份/還原領域預設 observe-only禁止未經新成功備份證據與 baseline gate 的刪除、DROP DB、覆蓋 production namespace。"
- "2026-05-19 起備份保留策略為 latest-only每個本地 restic repo、188 MOMO 檔案備份與 Google Drive/rclone 離機鏡像都只保留最新一份。"
backup_domains:
- id: host_configs
owner_host: "110"
script: "/backup/scripts/backup-configs.sh"
repository: "/backup/configs"
schedule: "daily via /backup/scripts/backup-all.sh"
max_age_hours: 48
includes:
- "110/188/120/121: /etc/nginx, /etc/systemd/system, /etc/cron.d, /etc/crontab"
- "110/188/120/121: /etc/letsencrypt, /etc/ssh, /etc/fstab, /etc/hosts, /etc/netplan"
- "110: /opt/harbor, /opt/sentry, /home/wooo/monitoring, /home/wooo/scripts, /backup/scripts"
- "188: /opt/n8n, /opt/open-webui, /opt/litellm, /opt/signoz, /home/ollama/momo-pro, /home/ollama/bin"
- "120/121: /etc/rancher/k3s, K3s manifests, containerd/keepalived host config"
- "K8s: workloads, services, ingress, configmaps, secrets, RBAC, PV/PVC, CRDs, Velero schedules/backups"
restore_test: "抽樣 restic restore 到隔離目錄,確認 nginx/systemd/K8s YAML 可讀;不得直接覆蓋 production。"
- id: awoooi_databases
owner_host: "110"
scripts:
- "/backup/scripts/backup-awoooi.sh"
- "/backup/scripts/backup-awoooi-frequent.sh"
repository: "/backup/awoooi"
schedule: "daily 02:00 + high-frequency 08:00/14:00/20:00"
max_age_hours: 7
includes:
- "awoooi_prod"
- "awoooi_dev"
- "k3s_datastore if present"
restore_test: "pg_restore/psql 到隔離 DB驗證 schema 與核心表筆數;不可覆蓋 production DB。"
- id: gitea_and_ci
owner_host: "110"
repository: "/backup/gitea"
schedule: "daily via backup-all"
max_age_hours: 48
includes:
- "Gitea DB"
- "Git repositories"
- "Gitea app.ini 與 runner registration/config evidence"
- "workflow definitions from repos"
restore_test: "抽樣 git fsck / git cloneGitea DB dump 可讀。"
- id: harbor_registry
owner_host: "110"
repository: "/backup/harbor"
schedule: "daily via backup-all"
max_age_hours: 48
includes:
- "Harbor DB/config"
- "registry storage"
- "TLS/config state from configs backup"
restore_test: "抽樣 registry manifest/blobs 可讀Harbor compose/config 可重建。"
- id: observability
owner_host: "110"
repositories:
- "/backup/monitoring"
- "/backup/signoz"
schedule: "daily via backup-all"
max_age_hours: 48
includes:
- "Prometheus TSDB"
- "Grafana dashboards/datasources"
- "Alertmanager config/state"
- "SignOz ClickHouse/SQLite/config"
- "blackbox/node-exporter textfile config"
restore_test: "Prometheus/Grafana/Alertmanager 設定 lintSignOz dump 可列出表。"
- id: sentry
owner_host: "110"
coverage_status: "covered_by_backup_sentry_script"
script: "/backup/scripts/backup-sentry.sh"
repository: "/backup/sentry"
schedule: "daily via backup-all; config also covered by /backup/configs"
max_age_hours: 48
includes:
- "Sentry compose/.env/config"
- "Sentry Postgres logical dump"
- "Sentry ClickHouse volume snapshot and table inventory"
- "Sentry Kafka queue volume snapshot"
- "Sentry Redis / SeaweedFS / Taskbroker / Vroom / Symbolicator state"
restore_test: "先在隔離 compose stack 驗證 Postgres dump 可讀、ClickHouse volume 可掛載、web/symbolicator/snuba 可啟動。"
- id: credential_escrow
owner_host: "human-controlled"
coverage_status: "gap_p0_out_of_band_escrow_required"
repository: "不可放在同一個 restic repo需放在密碼管理器或離線加密金庫"
schedule: "每次新增/輪替 Secret 後立即更新 escrow每月人工抽查"
max_age_hours: 744
includes:
- "restic password files / repository keys / Google Drive rclone.conf / offsite provider credentials"
- "Cloud DNS / registrar / CDN / tunnel 管理帳號與 recovery codes"
- "Gitea/Harbor/Sentry/admin break-glass credentials"
- "Git deploy keys、runner registration tokens、K8s bootstrap/admin kubeconfig 的復原路徑"
- "Google Drive / OAuth / Telegram / AI provider tokens 的輪替與復原流程,不包含明文輸出"
restore_test: "用人工雙人覆核方式確認 key escrow 可找到、可解密、可用於列出 snapshots不得把 Secret 值寫進 repo 或監控 label。"
- id: external_dns_and_public_routes
owner_host: "110"
coverage_status: "covered_by_public_route_evidence_backup; provider_zone_export_still_requires_credentials"
script: "/backup/scripts/backup-public-routes.sh"
repository: "/backup/public-routes"
schedule: "daily via backup-all; DNS/CDN provider zone export after every routing change when credentials are available"
max_age_hours: 168
includes:
- "wooo.work DNS answersCDN/Cloudflare/registrar 設定匯出仍需 provider token"
- "public nginx route map、TLS renewal config、ACME account evidence"
- "blackbox public endpoint inventory 與 expected status codes"
- "VPN/tunnel/port-forward/HA VIP 對外路由設定"
restore_test: "從匯出檔重建 public route checklist確認 awoooi/mo/registry/harbor/gitea 等 endpoint 對應正確;不得在測試中改正式 DNS。"
- id: backup_repositories_and_integrity
owner_host: "110/188/121/offsite"
coverage_status: "covered_locally_by_check_backup_integrity_script; offsite copy still depends on credentials"
scripts:
- "/backup/scripts/check-backup-integrity.sh"
- "/backup/scripts/configure-offsite-rclone.sh"
- "/backup/scripts/configure-offsite-b2.sh"
- "/backup/scripts/sync-offsite-backups.sh"
- "/backup/scripts/backup-offsite-readiness-gate.sh"
- "/backup/scripts/offsite-escrow-evidence-report.sh"
- "/backup/scripts/mark-credential-escrow-verified.sh"
repositories:
- "/backup/* restic repos"
- "/home/ollama/backup/110"
- "Google Drive/rclone/offsite remote when credentials are configured"
schedule: "daily freshness; daily 06:10 offsite status; daily 06:15 offsite escrow evidence report; weekly restic check; monthly sample restore drill"
max_age_hours: 168
includes:
- "restic snapshots metadata、repo config、locks/prune policy"
- "188 backup-from-110 rsync copy"
- "offsite copy status and retention policy"
- "restore drill logs with snapshot id and restored object counts"
restore_test: "每週 `restic check --read-data-subset=1%`;每月 `restic dump latest <sample>` 到 0700 暫存目錄驗證可讀。"
retention_policy: "latest-only本地 restic repo 新 snapshot 成功後 --group-by \"\" --keep-last=1 + prune188 MOMO 檔案備份只留最新一份;離機 Google Drive/rclone 以本地 repo 為準鏡像刪舊。"
offsite_sync_policy: "offsite-escrow-evidence-report.sh 先產出紅acted 證據與 NEXT_STEPbackup-offsite-readiness-gate.sh 再做 status / dry-run-small / pre-full-syncsync-offsite-backups.sh 預設 statusdry-run 可隨時執行Google Drive/rclone full sync 需選低峰窗口,成功後才寫 /backup/offsite/rclone-last-success且 OFFSITE_SYNC_DELETE_OLD=1 時會刪除遠端舊檔。full sync 不得與本地備份程序重疊,且必須距離下一次備份排程至少 270 分鐘。"
- id: momo_web_and_data
owner_host: "188"
scripts:
- "/backup/scripts/backup-momo.sh on 110"
- "/home/ollama/bin/momo-pg-backup.sh on 188"
repositories:
- "/backup/momo"
- "/home/ollama/momo_backups"
schedule: "110 daily + 188 daily 02:00"
max_age_hours: 30
includes:
- "mo.wooo.work app DB"
- "momo uploads/files/config"
- "scheduler config and cron"
restore_test: "隔離 DB restore 後跑 app health check確認 mo.wooo.work 需要的資料表與資料筆數。"
- id: ai_and_tooling
owner_host: "188"
coverage_status: "covered_by_backup_ai_artifacts_for_manifest_and_metadata; model_blobs_require_manual_classification"
script: "/backup/scripts/backup-ai-artifacts.sh"
repositories:
- "/backup/langfuse"
- "/backup/open-webui"
- "/backup/clawbot"
- "/backup/configs"
- "/backup/ai-artifacts"
schedule: "daily via backup-all"
max_age_hours: 48
includes:
- "Langfuse traces/evaluations"
- "Open-WebUI conversations/config"
- "LiteLLM config, model routing, provider state"
- "OpenClaw/ClawBot Redis or persistent state"
- "n8n workflows/credentials through encrypted config backup"
- "Ollama model manifest/tag list/Modelfile自製或不可重新下載的 model/adapters 才備份 blobs"
- "KM/RAG/vector 狀態;若存在於 AWOOOI DB隨 DB dump 還原;若是外部 vector store 必須有獨立 dump"
restore_test: "抽樣匯出 workflow/configRedis dump 可讀Langfuse/Open-WebUI DB dump 可讀Ollama manifest tar 可列出模型 tags。"
- id: source_of_truth_and_ops_memory
owner_host: "110/Gitea"
coverage_status: "gap_p1_sanitized_operational_context"
repositories:
- "/backup/gitea"
- "/backup/configs"
schedule: "Gitea daily; configs daily; 每次事故後更新 docs/LOGBOOK.md 與 runbooks"
max_age_hours: 48
includes:
- "所有 Git repositories、Ansible roles/playbooks/inventory、K8s manifests、monitoring rules"
- "AGENTS/HARD_RULES/runbooks/LOGBOOK/ADR 等決策與啟動順序文件"
- "AI agent handoff summaries and operational memory exports after sanitization"
- "CI/CD workflow definitions、runner labels、deployment marker policy"
restore_test: "從 Gitea backup 抽樣 clone repo跑 ansible/k8s/alerts YAML validation不得備份含明文 token 的聊天或 shell transcript。"
- id: k3s_and_velero
owner_host: "120"
schedule: "Velero daily-awoooi-prod + weekly restore dry-run"
max_age_hours: 25
includes:
- "K8s manifests and CRDs"
- "Secrets/ConfigMaps/RBAC"
- "PVC/PV snapshots via Velero provider"
- "backup-restore-test CronJob and result metrics"
restore_test: "backup-restore-test CronJob 每週 dry-run 到 restore-test-dry namespace mapping。"
- id: offsite_and_dr
owner_host: "188/121"
schedule: "188 backup-from-110 daily 01:00; 121 DR drill monthly"
max_age_hours: 25
includes:
- "110 Harbor/Gitea/bitan rsync copy on 188"
- "DR drill evidence on 121"
- "Google Drive/rclone remote when credentials are configured"
restore_test: "121 DR drill dry-run finds latest Completed Velero backup; 188 backup-from-110 textfile fresh。"
monitoring_contract:
textfile_metrics:
"110": "/home/wooo/node_exporter_textfiles/backup_health.prom"
"188": "/home/ollama/node_exporter_textfiles/backup_health.prom"
"120": "由 110 backup_health.prom 透過 120 kubectl 查詢 Velero/CronJob/Job 狀態"
offsite_and_escrow_metrics:
- "awoooi_backup_offsite_configured只回報 Google Drive/rclone 或相容 provider 是否看起來已配置,不輸出 credential 值。"
- "awoooi_backup_offsite_fresh由 /backup/offsite/*last_success 類 marker 判斷離機同步是否新鮮。"
- "awoooi_backup_offsite_partial_fresh由小範圍 partial sync marker 判斷 Google Drive/rclone 寫入路徑是否已被證明。"
- "awoooi_backup_credential_escrow_fresh由 /backup/escrow-evidence/*.last_verified 類 marker 判斷人工金庫覆核是否在 31 天內完成。"
- "awoooi_backup_dr_next_step_info用 next_step label 告訴 AI 巡檢與 operator 下一個安全人工作業,不包含 secret。"
- "awoooi_backup_dr_credential_escrow_missing_count金庫覆核尚缺的項目數。"
- "awoooi_backup_cron_active_duplicate_count110 active crontab 中 exact duplicate entry 的數量。"
- "awoooi_backup_cron_singular_entry_okoffsite/status/verifier/exporter 等單一入口排程是否剛好只有一條 active cron。"
- "awoooi_backup_config_capture_ok最新 configs snapshot 是否實際捕捉 110/120/121/188 host config 與 K8s workloads/secrets不輸出 secret。"
- "awoooi_backup_config_capture_critical_failed_count最新設定檔備份缺少的 critical capture target 數量。"
prometheus_alerts:
- BackupHealthMonitorMissing110
- BackupHealthMonitorMissing188
- BackupHealthMonitorStale
- BackupExpectedJobMissing
- BackupScheduleDuplicateActiveEntries
- BackupScheduleSingletonMismatch
- BackupScriptMissing
- BackupJobStale
- BackupAggregateRunFailed
- BackupConfigCapturePartial
- BackupConfigCaptureStatusStale
- BackupIntegrityCheckMissingOrFailed
- BackupRestoreDrillMissingOrFailed
- BackupRestoreTestMissing
- BackupRestoreTestCronMissing
- BackupRestoreTestFailed
- BackupRestoreTestStale
- BackupOffsiteCopyNotConfigured
- BackupOffsiteCopyStale
- BackupCredentialEscrowEvidenceMissing
- BackupRetentionPolicyNotLatestOnly
- BackupSnapshotRetentionExceeded
- BackupOffsiteFullVerifyFailed
- BackupOffsiteRemoteSnapshotRetentionExceeded
live_visibility_checks:
- "如果 awoooi_backup_offsite_configured{host=\"110\"} 為 0Prometheus 必須有 BackupOffsiteCopyNotConfigured firingAlertmanager 必須有 active alert。"
- "如果 offsite provider 已配置、full marker 尚未 fresh且 full sync enable marker 缺失或已超過 30 小時Prometheus 與 Alertmanager 必須看得到 BackupOffsiteCopyStale。"
- "如果 awoooi_backup_credential_escrow_fresh{host=\"110\"} == 0Prometheus 與 Alertmanager 必須依 item 看得到 BackupCredentialEscrowEvidenceMissing。"
- "如果 awoooi_backup_retention_latest_only{host=\"110\"} 或 awoooi_backup_retention_offsite_delete_old_enabled{host=\"110\",provider=\"rclone\"} 缺失/不為 1Prometheus 與 Alertmanager 必須看得到 BackupRetentionPolicyNotLatestOnly。"
- "如果任一 awoooi_backup_job_snapshot_count{host=\"110\",type=\"restic\"} > 1Prometheus 與 Alertmanager 必須看得到 BackupSnapshotRetentionExceeded。"
- "如果 full offsite marker fresh 但 awoooi_backup_offsite_remote_verify_ok{host=\"110\",provider=\"rclone\"} 不為 1 或缺失Prometheus 必須看得到 BackupOffsiteFullVerifyFailed。"
- "如果 full offsite marker fresh 且任一 awoooi_backup_offsite_remote_snapshot_count{host=\"110\",provider=\"rclone\"} > 1Prometheus 必須看得到 BackupOffsiteRemoteSnapshotRetentionExceeded。"
- "如果 awoooi_backup_cron_active_duplicate_count{host=\"110\"} > 0Prometheus 與 Alertmanager 必須看得到 BackupScheduleDuplicateActiveEntries。"
- "如果任一 awoooi_backup_cron_singular_entry_ok{host=\"110\"} == 0Prometheus 與 Alertmanager 必須看得到 BackupScheduleSingletonMismatch。"
- "如果任一 awoooi_backup_config_capture_ok{host=\"110\",critical=\"true\"} == 0Prometheus 與 Alertmanager 必須看得到 BackupConfigCapturePartial且 target label 必須指出缺哪個設定來源。"
- "如果 awoooi_backup_config_capture_status_timestamp 缺失或超過 48 小時Prometheus 與 Alertmanager 必須看得到 BackupConfigCaptureStatusStale。"
- "live visibility check 只讀 Prometheus / Alertmanager API不送測試告警、不改 silence、不改 route、不觸發修復。"
prometheus_recording_rules:
- awoooi_recovery_core_ready
- awoooi_recovery_dr_offsite_ready
release_gate:
cold_start_script: "bash scripts/reboot-recovery/full-stack-cold-start-check.sh --monitor-read-only --no-color"
p3_script: "bash scripts/reboot-recovery/p3-controlled-release-gate.sh"
recovery_core_scorecard: "bash scripts/reboot-recovery/full-stack-recovery-scorecard.sh --require-core"
dr_offsite_operator_checklist: "bash scripts/reboot-recovery/dr-offsite-operator-checklist.sh"
dr_offsite_scorecard: "bash scripts/reboot-recovery/full-stack-recovery-scorecard.sh --require-dr"
dr_offsite_final_gate: "bash scripts/reboot-recovery/dr-offsite-operator-checklist.sh --require-dr"
dr_offsite_post_marker_wait: "bash scripts/reboot-recovery/wait-dr-offsite-ready.sh --timeout-seconds 900 --interval-seconds 30 --no-color"
required_green:
- "backup_health.prom fresh on 110/188"
- "awoooi_backup_job_fresh == 1 for every expected job"
- "Velero latest Completed backup < 25h"
- "backup-restore-test CronJob present and lastSuccessfulTime not stale"
- "weekly restic check successful"
- "monthly sample restore drill successful"
warning_until_human_escrow_ready:
- "offsite provider configured and latest offsite copy marker fresh"
- "credential escrow marker files refreshed after human verification; marker files must contain only timestamp/evidence id, never secret values"
strict_dr_exit_conditions:
- "Google Drive/rclone provider configured on 110 host-local rclone.conf; /backup/scripts/offsite.env keeps only non-secret remote/path with mode 0600"
- "credential escrow markers fresh for restic_repository_password, offsite_provider_credentials, break_glass_admin_credentials, dns_registrar_recovery, oauth_ai_provider_recovery"
- "full offsite marker /backup/offsite/rclone-last-success fresh after full 13 repo sync"
- "full-stack-recovery-scorecard.sh --require-dr exits 0"
- "recovery-scorecard-contract-check.py --expect-dr-ready exits 0 against 110 Prometheus"
- "dr-offsite-operator-checklist.sh --require-dr exits 0 after scorecard, Prometheus recording rule, and backup alert visibility contract agree"
- "wait-dr-offsite-ready.sh exits 0 after post-marker textfile, Prometheus, Alertmanager, and final checklist convergence"