Files
awoooi/docs/runbooks/BACKUP-STATUS.md
Your Name cfb866d055
Some checks failed
Ansible Lint / lint (push) Successful in 35s
CD Pipeline / tests (push) Failing after 13s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Failing after 11s
feat(governance): add agent market automation surfaces
2026-06-04 21:50:55 +08:00

8.8 KiB
Raw Blame History

BACKUP-STATUS.md — 備份狀態總覽

2026-04-05 Claude Code: 首席架構師完整盤點 — 全服務全自動化 + 告警機制 備份中心192.168.0.110 (/backup/) — Restic + latest-only retention + Google Drive/rclone offsite mirror 2026-06-04 Codex live refresh: 110 cron / Google Drive rclone / Alertmanager / credential escrow / cold-start scorecard rechecked.


2026-06-04 Live Status

Gate Status Evidence
110 backup cron VERIFIED 02:00 backup-all, 03:00 sync-offsite-backups --mode sync, 06:05 backup-status, 07:20 verify-offsite-full-sync.
Backup freshness VERIFIED with one blocker 2026-06-04 manual refresh cleared stale110=awoooi_db and stale188=momo_pg_daily; 18:54 status still shows stale110=none, stale188=none, 110 13/13 fresh, 188 2/2 fresh.
188 momo backup cron/exporter contract VERIFIED 188 crontab now runs /home/ollama/bin/momo-pg-backup.sh; exporter reports awoooi_backup_job_configured{host="188",job="momo_pg_daily"} 1, so configured_missing_188=0.
Google Drive/rclone remote latest-only VERIFIED 2026-06-04 07:20 verifier: 13 repos each remote snapshots=1, REMOTE_LATEST_ONLY_OK=1, VERIFY_OK=1.
Offsite gate marker VERIFIED /backup/offsite/enable-rclone-sync present; rclone success markers fresh on 2026-06-04.
Backup alert rules VERIFIED Live Prometheus contains BackupConfigCapturePartial, BackupAggregateRunFailed, BackupCredentialEscrowEvidenceMissing, ColdStartRecoveryBlocked, ColdStartHost120Unreachable.
Backup aggregate health BLOCKED until 120 recovers 18:54 backup-status --no-notify: failed=1, core_blockers=1; the remaining red component is 120 config capture, not stale backup freshness.
Credential escrow BLOCKED Five evidence markers missing. Only write non-secret marker evidence with /backup/scripts/mark-credential-escrow-verified.sh.
Config backup capture BLOCKED until 120 recovers awoooi_backup_config_capture_ok{target="120-k3s-host-configs"} 0; critical failed count 1.
Full cold-start BLOCKED 18:55 read-only rerun: PASS=71 WARN=3 BLOCKED=3; 120 remains unreachable and K3s mon remains NotReady,SchedulingDisabled.
120 console handoff BLOCKED 19:02 120-fsck-maintenance-checklist.sh --no-color: PASS=2 WARN=2 BLOCKED=3, MAINTENANCE REQUIRED; 120 host/K3s/filesystem evidence is unreadable until console or SSH returns.

Current policy: normal success should not create immediate Telegram noise. Failures and operator-action states must still alert; a single daily status summary runs at 06:05.


備份全景圖(全部自動化)

# 資料類型 備份腳本 排程 最大損失 狀態
1 Gitea (DB + 倉庫) backup-gitea.sh 每日 02:00 24h
2 MOMO PostgreSQL backup-momo.sh 每日 02:00 24h
3 Harbor (Registry + DB) backup-harbor.sh 每日 02:00 24h
4 AWOOOI PostgreSQL (完整) backup-awoooi.sh 每日 02:00 6h
4h AWOOOI PostgreSQL (高頻) backup-awoooi-frequent.sh 08/14/20:00 6h
5 Langfuse (AI 追蹤/評測) backup-langfuse.sh 每日 02:00 24h
6 Monitoring (Prometheus/Grafana/Alertmanager) backup-monitoring.sh 每日 02:00 24h
7 SignOz (ClickHouse 追蹤/日誌) backup-signoz.sh 每日 02:00 24h
8 Open-WebUI (LLM 對話紀錄) backup-open-webui.sh 每日 02:00 24h
9 ClawBot Redis (狀態/快取) backup-clawbot.sh 每日 02:00 24h
- K8s 資源 (全命名空間) Velero + MinIO 每日 02:00 24h

備份總控/backup/scripts/backup-all.sh v3.0 — 統一執行 9 個備份


告警機制

備份失敗與需要人工處理的狀態必須推送 AwoooP / Telegram。正常成功不即時推送避免洗版成功狀態由每日 06:05 摘要與 Prometheus/textfile 證據承載。

狀態 Severity Telegram 收到
success info 不即時洗版;每日 06:05 backup status 摘要
warning warning ⚠️ 黃色警告
failed critical 🔴 立即告警

告警端點http://192.168.0.188:8088/api/v1/webhook/custom
測試指令

source /backup/scripts/common.sh
notify_clawbot "failed" "backup-test" "測試告警" 0

保留策略

2026-05-19 起110 本地 restic repo、188 MOMO 檔案備份與 Google Drive/rclone 離機鏡像採 latest-only 策略:成功建立新 snapshot 後只保留最新一份。2026-06-04 07:20 live verifier 已確認 Google Drive/rclone remote 13 個 repo 各 1 份。

2026-06-04 manual refresh evidence:

  • 188 momo-pg-backup.sh produced momo_analytics_20260604_154234.sql.gz and pruned old backups beyond keep-last=1.
  • 110 backup-awoooi-frequent.sh completed restic snapshot 7440d75f and pruned previous AWOOOI high-frequency DB snapshot.
  • 18:54 backup-status.sh --no-notify: stale110=none, stale188=none, configured_missing_188=0, core_blockers=1, escrow_missing=5.

18:55 cold-start scorecard refresh:

  • PASS=71 WARN=3 BLOCKED=3.
  • Remaining hard blocks: 120 ping, 120 SSH, and 120 K3s read-only check.
  • 188 backup health stale jobs are clear.
  • momo current-month parity is green: 2215|2215|2026-06-01|2026-06-04|2026-06-01|2026-06-04.

19:02 120 console handoff evidence:

  • local/110/121/188 cannot reach 192.168.0.120.
  • K3s node lease for mon stopped renewing at 2026-05-22 02:48:36 +08.
  • 120-fsck-maintenance-checklist.sh --no-color returns PASS=2 WARN=2 BLOCKED=3, so backup aggregate remains correctly blocked until console/SSH recovery.

The remaining core_blockers=1 is expected until 192.168.0.120 comes back and /backup/scripts/backup-configs.sh plus /backup/scripts/backup-all.sh both complete cleanly. Do not suppress this red gate.


Crontab 完整排程110

0 2       * * *   backup-all.sh              ← 9 個服務完整備份
0 8,14,20 * * *   backup-awoooi-frequent.sh  ← AWOOOI 高頻(每 6 小時)
0 3       * * *   sync-offsite-backups.sh --mode sync          ← Google Drive/rclone gated sync
5 6       * * *   backup-status.sh           ← 每日一次備份狀態摘要,避免成功心跳洗版
20 7      * * *   verify-offsite-full-sync.sh --write-textfile ← Google Drive/rclone latest-only 驗證

備份架構

192.168.0.110 (/backup/scripts/backup-all.sh) 每日 02:00
├── [1/9] backup-gitea.sh       → gitea dump → /backup/gitea
├── [2/9] backup-momo.sh        → SSH 188 pg_dump momo → /backup/momo
├── [3/9] backup-harbor.sh      → harbor dump → /backup/harbor
├── [4/9] backup-awoooi.sh      → SSH 188 pg_dump awoooi_prod/dev/k3s → /backup/awoooi
├── [5/9] backup-langfuse.sh    → docker exec langfuse-db pg_dump → /backup/langfuse
├── [6/9] backup-monitoring.sh  → volumes prometheus/grafana/alertmanager → /backup/monitoring
├── [7/9] backup-signoz.sh      → volumes signoz-clickhouse/sqlite → /backup/signoz
├── [8/9] backup-open-webui.sh  → SSH 188 volume open-webui → /backup/open-webui
└── [9/9] backup-clawbot.sh     → SSH 188 volume clawbot-redis → /backup/clawbot

備份失敗 → notify_clawbot("failed") → /webhook/custom 或 AwoooP/Alertmanager path → Telegram 🔴
備份成功 → textfile / Prometheus / 06:05 status 摘要,不即時洗版

192.168.0.188 (Velero) 每日 02:00
└── K8s 資源快照 → MinIO :9000 (bucket: velero)

尚未備份(說明)

服務 原因 備記
Prometheus TSDB 原始指標數據非設定TSDB 自帶 30d TTL 低優先Grafana 設定已備份
Sentry 目前沒有在運行docker ps 空) 有 volume重新部署後再評估
Redis (AWOOOI) Cache/WorkingMemory無持久業務數據 低優先
Velero MinIO 數據 MinIO 是備份的備份,需離機備份 待評估 B2/S3 offsite

驗證 SOP

# 最新備份日誌
ssh wooo@192.168.0.110 "tail -50 /backup/logs/backup.log"

# 所有服務快照數
ssh wooo@192.168.0.110 "for r in gitea momo harbor awoooi langfuse monitoring signoz open-webui clawbot; do
  echo -n \"\$r: \"
  restic -r /backup/\$r snapshots --password-file /backup/scripts/.restic-password 2>/dev/null | grep -c snapshot || echo 0
done"

# 告警測試
ssh wooo@192.168.0.110 "source /backup/scripts/common.sh && notify_clawbot 'warning' 'manual-test' '手動告警測試' 0"

相關文件

  • REBOOT-RECOVERY-SOP.md - 重開機恢復 SOP
  • scripts/backup/ - 所有備份腳本Git 版本)
  • /backup/scripts/ (on 110) - 實際部署腳本