307 lines
18 KiB
YAML
307 lines
18 KiB
YAML
version: 2026-05-19.v7
|
||
scope: "110/120/121/188 全服務、資料、設定與還原驗證備份基準"
|
||
|
||
principles:
|
||
- "資料備份與設定備份分層:DB/PV/物件資料負責資料,configs 負責可啟動狀態。"
|
||
- "Secrets、TLS private keys、SSH host keys 可進加密 restic/Velero 備份,但不得印到 log、repo、Telegram。"
|
||
- "備份系統本身也要備份:restic repository health、password/key escrow、offsite copy、restore drill evidence 缺一不可。"
|
||
- "每個備份都必須有三個證據:排程存在、最近成功時間、還原或 dry-run 驗證。"
|
||
- "AI 自動修復在備份/還原領域預設 observe-only;禁止未經新成功備份證據與 baseline gate 的刪除、DROP DB、覆蓋 production namespace。"
|
||
- "2026-05-19 起備份保留策略為 latest-only:每個本地 restic repo、188 MOMO 檔案備份與 Google Drive/rclone 離機鏡像都只保留最新一份。"
|
||
|
||
backup_domains:
|
||
- id: host_configs
|
||
owner_host: "110"
|
||
script: "/backup/scripts/backup-configs.sh"
|
||
repository: "/backup/configs"
|
||
schedule: "daily via /backup/scripts/backup-all.sh"
|
||
max_age_hours: 48
|
||
includes:
|
||
- "110/188/120/121: /etc/nginx, /etc/systemd/system, /etc/cron.d, /etc/crontab"
|
||
- "110/188/120/121: /etc/letsencrypt, /etc/ssh, /etc/fstab, /etc/hosts, /etc/netplan"
|
||
- "110: /opt/harbor, /opt/sentry, /home/wooo/monitoring, /home/wooo/scripts, /backup/scripts"
|
||
- "188: /opt/n8n, /opt/open-webui, /opt/litellm, /opt/signoz, /home/ollama/momo-pro, /home/ollama/bin"
|
||
- "120/121: /etc/rancher/k3s, K3s manifests, containerd/keepalived host config"
|
||
- "K8s: workloads, services, ingress, configmaps, secrets, RBAC, PV/PVC, CRDs, Velero schedules/backups"
|
||
restore_test: "抽樣 restic restore 到隔離目錄,確認 nginx/systemd/K8s YAML 可讀;不得直接覆蓋 production。"
|
||
|
||
- id: awoooi_databases
|
||
owner_host: "110"
|
||
scripts:
|
||
- "/backup/scripts/backup-awoooi.sh"
|
||
- "/backup/scripts/backup-awoooi-frequent.sh"
|
||
repository: "/backup/awoooi"
|
||
schedule: "daily 02:00 + high-frequency 08:00/14:00/20:00"
|
||
max_age_hours: 7
|
||
includes:
|
||
- "awoooi_prod"
|
||
- "awoooi_dev"
|
||
- "k3s_datastore if present"
|
||
restore_test: "pg_restore/psql 到隔離 DB,驗證 schema 與核心表筆數;不可覆蓋 production DB。"
|
||
|
||
- id: gitea_and_ci
|
||
owner_host: "110"
|
||
repository: "/backup/gitea"
|
||
schedule: "daily via backup-all"
|
||
max_age_hours: 48
|
||
includes:
|
||
- "Gitea DB"
|
||
- "Git repositories"
|
||
- "Gitea app.ini 與 runner registration/config evidence"
|
||
- "workflow definitions from repos"
|
||
restore_test: "抽樣 git fsck / git clone;Gitea DB dump 可讀。"
|
||
|
||
- id: harbor_registry
|
||
owner_host: "110"
|
||
repository: "/backup/harbor"
|
||
schedule: "daily via backup-all"
|
||
max_age_hours: 48
|
||
includes:
|
||
- "Harbor DB/config"
|
||
- "registry storage"
|
||
- "TLS/config state from configs backup"
|
||
restore_test: "抽樣 registry manifest/blobs 可讀;Harbor compose/config 可重建。"
|
||
|
||
- id: observability
|
||
owner_host: "110"
|
||
repositories:
|
||
- "/backup/monitoring"
|
||
- "/backup/signoz"
|
||
schedule: "daily via backup-all"
|
||
max_age_hours: 48
|
||
includes:
|
||
- "Prometheus TSDB"
|
||
- "Grafana dashboards/datasources"
|
||
- "Alertmanager config/state"
|
||
- "SignOz ClickHouse/SQLite/config"
|
||
- "blackbox/node-exporter textfile config"
|
||
restore_test: "Prometheus/Grafana/Alertmanager 設定 lint;SignOz dump 可列出表。"
|
||
|
||
- id: sentry
|
||
owner_host: "110"
|
||
coverage_status: "covered_by_backup_sentry_script"
|
||
script: "/backup/scripts/backup-sentry.sh"
|
||
repository: "/backup/sentry"
|
||
schedule: "daily via backup-all; config also covered by /backup/configs"
|
||
max_age_hours: 48
|
||
includes:
|
||
- "Sentry compose/.env/config"
|
||
- "Sentry Postgres logical dump"
|
||
- "Sentry ClickHouse volume snapshot and table inventory"
|
||
- "Sentry Kafka queue volume snapshot"
|
||
- "Sentry Redis / SeaweedFS / Taskbroker / Vroom / Symbolicator state"
|
||
restore_test: "先在隔離 compose stack 驗證 Postgres dump 可讀、ClickHouse volume 可掛載、web/symbolicator/snuba 可啟動。"
|
||
|
||
- id: credential_escrow
|
||
owner_host: "human-controlled"
|
||
coverage_status: "gap_p0_out_of_band_escrow_required"
|
||
repository: "不可放在同一個 restic repo;需放在密碼管理器或離線加密金庫"
|
||
schedule: "每次新增/輪替 Secret 後立即更新 escrow;每月人工抽查"
|
||
max_age_hours: 744
|
||
includes:
|
||
- "restic password files / repository keys / Google Drive rclone.conf / offsite provider credentials"
|
||
- "Cloud DNS / registrar / CDN / tunnel 管理帳號與 recovery codes"
|
||
- "Gitea/Harbor/Sentry/admin break-glass credentials"
|
||
- "Git deploy keys、runner registration tokens、K8s bootstrap/admin kubeconfig 的復原路徑"
|
||
- "Google Drive / OAuth / Telegram / AI provider tokens 的輪替與復原流程,不包含明文輸出"
|
||
restore_test: "用人工雙人覆核方式確認 key escrow 可找到、可解密、可用於列出 snapshots;不得把 Secret 值寫進 repo 或監控 label。"
|
||
|
||
- id: external_dns_and_public_routes
|
||
owner_host: "110"
|
||
coverage_status: "covered_by_public_route_evidence_backup; provider_zone_export_still_requires_credentials"
|
||
script: "/backup/scripts/backup-public-routes.sh"
|
||
repository: "/backup/public-routes"
|
||
schedule: "daily via backup-all; DNS/CDN provider zone export after every routing change when credentials are available"
|
||
max_age_hours: 168
|
||
includes:
|
||
- "wooo.work DNS answers;CDN/Cloudflare/registrar 設定匯出仍需 provider token"
|
||
- "public nginx route map、TLS renewal config、ACME account evidence"
|
||
- "blackbox public endpoint inventory 與 expected status codes"
|
||
- "VPN/tunnel/port-forward/HA VIP 對外路由設定"
|
||
restore_test: "從匯出檔重建 public route checklist,確認 awoooi/mo/registry/harbor/gitea 等 endpoint 對應正確;不得在測試中改正式 DNS。"
|
||
|
||
- id: backup_repositories_and_integrity
|
||
owner_host: "110/188/121/offsite"
|
||
coverage_status: "covered_locally_by_check_backup_integrity_script; offsite copy still depends on credentials"
|
||
scripts:
|
||
- "/backup/scripts/check-backup-integrity.sh"
|
||
- "/backup/scripts/configure-offsite-rclone.sh"
|
||
- "/backup/scripts/configure-offsite-b2.sh"
|
||
- "/backup/scripts/sync-offsite-backups.sh"
|
||
- "/backup/scripts/backup-offsite-readiness-gate.sh"
|
||
- "/backup/scripts/offsite-escrow-evidence-report.sh"
|
||
- "/backup/scripts/mark-credential-escrow-verified.sh"
|
||
repositories:
|
||
- "/backup/* restic repos"
|
||
- "/home/ollama/backup/110"
|
||
- "Google Drive/rclone/offsite remote when credentials are configured"
|
||
schedule: "daily freshness; daily 06:10 offsite status; daily 06:15 offsite escrow evidence report; weekly restic check; monthly sample restore drill"
|
||
max_age_hours: 168
|
||
includes:
|
||
- "restic snapshots metadata、repo config、locks/prune policy"
|
||
- "188 backup-from-110 rsync copy"
|
||
- "offsite copy status and retention policy"
|
||
- "restore drill logs with snapshot id and restored object counts"
|
||
restore_test: "每週 `restic check --read-data-subset=1%`;每月 `restic dump latest <sample>` 到 0700 暫存目錄驗證可讀。"
|
||
retention_policy: "latest-only;本地 restic repo 新 snapshot 成功後 --group-by \"\" --keep-last=1 + prune;188 MOMO 檔案備份只留最新一份;離機 Google Drive/rclone 以本地 repo 為準鏡像刪舊。"
|
||
offsite_sync_policy: "offsite-escrow-evidence-report.sh 先產出紅acted 證據與 NEXT_STEP;backup-offsite-readiness-gate.sh 再做 status / dry-run-small / pre-full-sync;sync-offsite-backups.sh 預設 status;dry-run 可隨時執行;Google Drive/rclone full sync 需選低峰窗口,成功後才寫 /backup/offsite/rclone-last-success,且 OFFSITE_SYNC_DELETE_OLD=1 時會刪除遠端舊檔。full sync 不得與本地備份程序重疊,且必須距離下一次備份排程至少 270 分鐘。"
|
||
|
||
- id: momo_web_and_data
|
||
owner_host: "188"
|
||
scripts:
|
||
- "/backup/scripts/backup-momo.sh on 110"
|
||
- "/home/ollama/bin/momo-pg-backup.sh on 188"
|
||
repositories:
|
||
- "/backup/momo"
|
||
- "/home/ollama/momo_backups"
|
||
schedule: "110 daily + 188 daily 02:00"
|
||
max_age_hours: 30
|
||
includes:
|
||
- "mo.wooo.work app DB"
|
||
- "momo uploads/files/config"
|
||
- "scheduler config and cron"
|
||
restore_test: "隔離 DB restore 後跑 app health check;確認 mo.wooo.work 需要的資料表與資料筆數。"
|
||
|
||
- id: ai_and_tooling
|
||
owner_host: "188"
|
||
coverage_status: "covered_by_backup_ai_artifacts_for_manifest_and_metadata; model_blobs_require_manual_classification"
|
||
script: "/backup/scripts/backup-ai-artifacts.sh"
|
||
repositories:
|
||
- "/backup/langfuse"
|
||
- "/backup/open-webui"
|
||
- "/backup/clawbot"
|
||
- "/backup/configs"
|
||
- "/backup/ai-artifacts"
|
||
schedule: "daily via backup-all"
|
||
max_age_hours: 48
|
||
includes:
|
||
- "Langfuse traces/evaluations"
|
||
- "Open-WebUI conversations/config"
|
||
- "LiteLLM config, model routing, provider state"
|
||
- "OpenClaw/ClawBot Redis or persistent state"
|
||
- "n8n workflows/credentials through encrypted config backup"
|
||
- "Ollama model manifest/tag list/Modelfile;自製或不可重新下載的 model/adapters 才備份 blobs"
|
||
- "KM/RAG/vector 狀態;若存在於 AWOOOI DB,隨 DB dump 還原;若是外部 vector store 必須有獨立 dump"
|
||
restore_test: "抽樣匯出 workflow/config;Redis dump 可讀;Langfuse/Open-WebUI DB dump 可讀;Ollama manifest tar 可列出模型 tags。"
|
||
|
||
- id: source_of_truth_and_ops_memory
|
||
owner_host: "110/Gitea"
|
||
coverage_status: "gap_p1_sanitized_operational_context"
|
||
repositories:
|
||
- "/backup/gitea"
|
||
- "/backup/configs"
|
||
schedule: "Gitea daily; configs daily; 每次事故後更新 docs/LOGBOOK.md 與 runbooks"
|
||
max_age_hours: 48
|
||
includes:
|
||
- "所有 Git repositories、Ansible roles/playbooks/inventory、K8s manifests、monitoring rules"
|
||
- "AGENTS/HARD_RULES/runbooks/LOGBOOK/ADR 等決策與啟動順序文件"
|
||
- "AI agent handoff summaries and operational memory exports after sanitization"
|
||
- "CI/CD workflow definitions、runner labels、deployment marker policy"
|
||
restore_test: "從 Gitea backup 抽樣 clone repo,跑 ansible/k8s/alerts YAML validation;不得備份含明文 token 的聊天或 shell transcript。"
|
||
|
||
- id: k3s_and_velero
|
||
owner_host: "120"
|
||
schedule: "Velero daily-awoooi-prod + weekly restore dry-run"
|
||
max_age_hours: 25
|
||
includes:
|
||
- "K8s manifests and CRDs"
|
||
- "Secrets/ConfigMaps/RBAC"
|
||
- "PVC/PV snapshots via Velero provider"
|
||
- "backup-restore-test CronJob and result metrics"
|
||
restore_test: "backup-restore-test CronJob 每週 dry-run 到 restore-test-dry namespace mapping。"
|
||
|
||
- id: offsite_and_dr
|
||
owner_host: "188/121"
|
||
schedule: "188 backup-from-110 daily 01:00; 121 DR drill monthly"
|
||
max_age_hours: 25
|
||
includes:
|
||
- "110 Harbor/Gitea/bitan rsync copy on 188"
|
||
- "DR drill evidence on 121"
|
||
- "Google Drive/rclone remote when credentials are configured"
|
||
restore_test: "121 DR drill dry-run finds latest Completed Velero backup; 188 backup-from-110 textfile fresh。"
|
||
|
||
monitoring_contract:
|
||
textfile_metrics:
|
||
"110": "/home/wooo/node_exporter_textfiles/backup_health.prom"
|
||
"188": "/home/ollama/node_exporter_textfiles/backup_health.prom"
|
||
"120": "由 110 backup_health.prom 透過 120 kubectl 查詢 Velero/CronJob/Job 狀態"
|
||
offsite_and_escrow_metrics:
|
||
- "awoooi_backup_offsite_configured:只回報 Google Drive/rclone 或相容 provider 是否看起來已配置,不輸出 credential 值。"
|
||
- "awoooi_backup_offsite_fresh:由 /backup/offsite/*last_success 類 marker 判斷離機同步是否新鮮。"
|
||
- "awoooi_backup_offsite_partial_fresh:由小範圍 partial sync marker 判斷 Google Drive/rclone 寫入路徑是否已被證明。"
|
||
- "awoooi_backup_credential_escrow_fresh:由 /backup/escrow-evidence/*.last_verified 類 marker 判斷人工金庫覆核是否在 31 天內完成。"
|
||
- "awoooi_backup_dr_next_step_info:用 next_step label 告訴 AI 巡檢與 operator 下一個安全人工作業,不包含 secret。"
|
||
- "awoooi_backup_dr_credential_escrow_missing_count:金庫覆核尚缺的項目數。"
|
||
- "awoooi_backup_cron_active_duplicate_count:110 active crontab 中 exact duplicate entry 的數量。"
|
||
- "awoooi_backup_cron_singular_entry_ok:offsite/status/verifier/exporter 等單一入口排程是否剛好只有一條 active cron。"
|
||
- "awoooi_backup_config_capture_ok:最新 configs snapshot 是否實際捕捉 110/120/121/188 host config 與 K8s workloads/secrets,不輸出 secret。"
|
||
- "awoooi_backup_config_capture_critical_failed_count:最新設定檔備份缺少的 critical capture target 數量。"
|
||
prometheus_alerts:
|
||
- BackupHealthMonitorMissing110
|
||
- BackupHealthMonitorMissing188
|
||
- BackupHealthMonitorStale
|
||
- BackupExpectedJobMissing
|
||
- BackupScheduleDuplicateActiveEntries
|
||
- BackupScheduleSingletonMismatch
|
||
- BackupScriptMissing
|
||
- BackupJobStale
|
||
- BackupAggregateRunFailed
|
||
- BackupConfigCapturePartial
|
||
- BackupConfigCaptureStatusStale
|
||
- BackupIntegrityCheckMissingOrFailed
|
||
- BackupRestoreDrillMissingOrFailed
|
||
- BackupRestoreTestMissing
|
||
- BackupRestoreTestCronMissing
|
||
- BackupRestoreTestFailed
|
||
- BackupRestoreTestStale
|
||
- BackupOffsiteCopyNotConfigured
|
||
- BackupOffsiteCopyStale
|
||
- BackupCredentialEscrowEvidenceMissing
|
||
- BackupRetentionPolicyNotLatestOnly
|
||
- BackupSnapshotRetentionExceeded
|
||
- BackupOffsiteFullVerifyFailed
|
||
- BackupOffsiteRemoteSnapshotRetentionExceeded
|
||
live_visibility_checks:
|
||
- "如果 awoooi_backup_offsite_configured{host=\"110\"} 為 0,Prometheus 必須有 BackupOffsiteCopyNotConfigured firing,Alertmanager 必須有 active alert。"
|
||
- "如果 offsite provider 已配置、full marker 尚未 fresh,且 full sync enable marker 缺失或已超過 30 小時,Prometheus 與 Alertmanager 必須看得到 BackupOffsiteCopyStale。"
|
||
- "如果 awoooi_backup_credential_escrow_fresh{host=\"110\"} == 0,Prometheus 與 Alertmanager 必須依 item 看得到 BackupCredentialEscrowEvidenceMissing。"
|
||
- "如果 awoooi_backup_retention_latest_only{host=\"110\"} 或 awoooi_backup_retention_offsite_delete_old_enabled{host=\"110\",provider=\"rclone\"} 缺失/不為 1,Prometheus 與 Alertmanager 必須看得到 BackupRetentionPolicyNotLatestOnly。"
|
||
- "如果任一 awoooi_backup_job_snapshot_count{host=\"110\",type=\"restic\"} > 1,Prometheus 與 Alertmanager 必須看得到 BackupSnapshotRetentionExceeded。"
|
||
- "如果 full offsite marker fresh 但 awoooi_backup_offsite_remote_verify_ok{host=\"110\",provider=\"rclone\"} 不為 1 或缺失,Prometheus 必須看得到 BackupOffsiteFullVerifyFailed。"
|
||
- "如果 full offsite marker fresh 且任一 awoooi_backup_offsite_remote_snapshot_count{host=\"110\",provider=\"rclone\"} > 1,Prometheus 必須看得到 BackupOffsiteRemoteSnapshotRetentionExceeded。"
|
||
- "如果 awoooi_backup_cron_active_duplicate_count{host=\"110\"} > 0,Prometheus 與 Alertmanager 必須看得到 BackupScheduleDuplicateActiveEntries。"
|
||
- "如果任一 awoooi_backup_cron_singular_entry_ok{host=\"110\"} == 0,Prometheus 與 Alertmanager 必須看得到 BackupScheduleSingletonMismatch。"
|
||
- "如果任一 awoooi_backup_config_capture_ok{host=\"110\",critical=\"true\"} == 0,Prometheus 與 Alertmanager 必須看得到 BackupConfigCapturePartial,且 target label 必須指出缺哪個設定來源。"
|
||
- "如果 awoooi_backup_config_capture_status_timestamp 缺失或超過 48 小時,Prometheus 與 Alertmanager 必須看得到 BackupConfigCaptureStatusStale。"
|
||
- "live visibility check 只讀 Prometheus / Alertmanager API,不送測試告警、不改 silence、不改 route、不觸發修復。"
|
||
prometheus_recording_rules:
|
||
- awoooi_recovery_core_ready
|
||
- awoooi_recovery_dr_offsite_ready
|
||
|
||
release_gate:
|
||
cold_start_script: "bash scripts/reboot-recovery/full-stack-cold-start-check.sh --monitor-read-only --no-color"
|
||
p3_script: "bash scripts/reboot-recovery/p3-controlled-release-gate.sh"
|
||
recovery_core_scorecard: "bash scripts/reboot-recovery/full-stack-recovery-scorecard.sh --require-core"
|
||
dr_offsite_operator_checklist: "bash scripts/reboot-recovery/dr-offsite-operator-checklist.sh"
|
||
dr_offsite_scorecard: "bash scripts/reboot-recovery/full-stack-recovery-scorecard.sh --require-dr"
|
||
dr_offsite_final_gate: "bash scripts/reboot-recovery/dr-offsite-operator-checklist.sh --require-dr"
|
||
dr_offsite_post_marker_wait: "bash scripts/reboot-recovery/wait-dr-offsite-ready.sh --timeout-seconds 900 --interval-seconds 30 --no-color"
|
||
required_green:
|
||
- "backup_health.prom fresh on 110/188"
|
||
- "awoooi_backup_job_fresh == 1 for every expected job"
|
||
- "Velero latest Completed backup < 25h"
|
||
- "backup-restore-test CronJob present and lastSuccessfulTime not stale"
|
||
- "weekly restic check successful"
|
||
- "monthly sample restore drill successful"
|
||
warning_until_human_escrow_ready:
|
||
- "offsite provider configured and latest offsite copy marker fresh"
|
||
- "credential escrow marker files refreshed after human verification; marker files must contain only timestamp/evidence id, never secret values"
|
||
strict_dr_exit_conditions:
|
||
- "Google Drive/rclone provider configured on 110 host-local rclone.conf; /backup/scripts/offsite.env keeps only non-secret remote/path with mode 0600"
|
||
- "credential escrow markers fresh for restic_repository_password, offsite_provider_credentials, break_glass_admin_credentials, dns_registrar_recovery, oauth_ai_provider_recovery"
|
||
- "full offsite marker /backup/offsite/rclone-last-success fresh after full 13 repo sync"
|
||
- "full-stack-recovery-scorecard.sh --require-dr exits 0"
|
||
- "recovery-scorecard-contract-check.py --expect-dr-ready exits 0 against 110 Prometheus"
|
||
- "dr-offsite-operator-checklist.sh --require-dr exits 0 after scorecard, Prometheus recording rule, and backup alert visibility contract agree"
|
||
- "wait-dr-offsite-ready.sh exits 0 after post-marker textfile, Prometheus, Alertmanager, and final checklist convergence"
|