161 lines
8.8 KiB
Markdown
161 lines
8.8 KiB
Markdown
# BACKUP-STATUS.md — 備份狀態總覽
|
||
|
||
> 2026-04-05 Claude Code: 首席架構師完整盤點 — 全服務全自動化 + 告警機制
|
||
> 備份中心:192.168.0.110 (`/backup/`) — Restic + latest-only retention + Google Drive/rclone offsite mirror
|
||
> 2026-06-04 Codex live refresh: 110 cron / Google Drive rclone / Alertmanager / credential escrow / cold-start scorecard rechecked.
|
||
|
||
---
|
||
|
||
## 2026-06-04 Live Status
|
||
|
||
| Gate | Status | Evidence |
|
||
|------|--------|----------|
|
||
| 110 backup cron | VERIFIED | `02:00 backup-all`, `03:00 sync-offsite-backups --mode sync`, `06:05 backup-status`, `07:20 verify-offsite-full-sync`. |
|
||
| Backup freshness | VERIFIED with one blocker | 2026-06-04 manual refresh cleared `stale110=awoooi_db` and `stale188=momo_pg_daily`; 18:54 status still shows `stale110=none`, `stale188=none`, 110 `13/13 fresh`, 188 `2/2 fresh`. |
|
||
| 188 momo backup cron/exporter contract | VERIFIED | 188 crontab now runs `/home/ollama/bin/momo-pg-backup.sh`; exporter reports `awoooi_backup_job_configured{host="188",job="momo_pg_daily"} 1`, so `configured_missing_188=0`. |
|
||
| Google Drive/rclone remote latest-only | VERIFIED | 2026-06-04 07:20 verifier: 13 repos each `remote snapshots=1`, `REMOTE_LATEST_ONLY_OK=1`, `VERIFY_OK=1`. |
|
||
| Offsite gate marker | VERIFIED | `/backup/offsite/enable-rclone-sync` present; rclone success markers fresh on 2026-06-04. |
|
||
| Backup alert rules | VERIFIED | Live Prometheus contains `BackupConfigCapturePartial`, `BackupAggregateRunFailed`, `BackupCredentialEscrowEvidenceMissing`, `ColdStartRecoveryBlocked`, `ColdStartHost120Unreachable`. |
|
||
| Backup aggregate health | BLOCKED until 120 recovers | 18:54 `backup-status --no-notify`: `failed=1`, `core_blockers=1`; the remaining red component is 120 config capture, not stale backup freshness. |
|
||
| Credential escrow | BLOCKED | Five evidence markers missing. Only write non-secret marker evidence with `/backup/scripts/mark-credential-escrow-verified.sh`. |
|
||
| Config backup capture | BLOCKED until 120 recovers | `awoooi_backup_config_capture_ok{target="120-k3s-host-configs"} 0`; critical failed count `1`. |
|
||
| Full cold-start | BLOCKED | 18:55 read-only rerun: `PASS=71 WARN=3 BLOCKED=3`; 120 remains unreachable and K3s `mon` remains `NotReady,SchedulingDisabled`. |
|
||
| 120 console handoff | BLOCKED | 19:02 `120-fsck-maintenance-checklist.sh --no-color`: `PASS=2 WARN=2 BLOCKED=3`, `MAINTENANCE REQUIRED`; 120 host/K3s/filesystem evidence is unreadable until console or SSH returns. |
|
||
|
||
Current policy: normal success should not create immediate Telegram noise. Failures and operator-action states must still alert; a single daily status summary runs at 06:05.
|
||
|
||
---
|
||
|
||
## 備份全景圖(全部自動化)
|
||
|
||
| # | 資料類型 | 備份腳本 | 排程 | 最大損失 | 狀態 |
|
||
|---|---------|---------|------|---------|------|
|
||
| 1 | Gitea (DB + 倉庫) | `backup-gitea.sh` | 每日 02:00 | 24h | ✅ |
|
||
| 2 | MOMO PostgreSQL | `backup-momo.sh` | 每日 02:00 | 24h | ✅ |
|
||
| 3 | Harbor (Registry + DB) | `backup-harbor.sh` | 每日 02:00 | 24h | ✅ |
|
||
| 4 | **AWOOOI PostgreSQL (完整)** | **`backup-awoooi.sh`** | **每日 02:00** | **6h** | **✅** |
|
||
| 4h | **AWOOOI PostgreSQL (高頻)** | **`backup-awoooi-frequent.sh`** | **08/14/20:00** | **6h** | **✅** |
|
||
| 5 | **Langfuse (AI 追蹤/評測)** | **`backup-langfuse.sh`** | **每日 02:00** | **24h** | **✅** |
|
||
| 6 | **Monitoring (Prometheus/Grafana/Alertmanager)** | **`backup-monitoring.sh`** | **每日 02:00** | **24h** | **✅** |
|
||
| 7 | **SignOz (ClickHouse 追蹤/日誌)** | **`backup-signoz.sh`** | **每日 02:00** | **24h** | **✅** |
|
||
| 8 | **Open-WebUI (LLM 對話紀錄)** | **`backup-open-webui.sh`** | **每日 02:00** | **24h** | **✅** |
|
||
| 9 | **ClawBot Redis (狀態/快取)** | **`backup-clawbot.sh`** | **每日 02:00** | **24h** | **✅** |
|
||
| - | K8s 資源 (全命名空間) | Velero + MinIO | 每日 02:00 | 24h | ✅ |
|
||
|
||
**備份總控**:`/backup/scripts/backup-all.sh` v3.0 — 統一執行 9 個備份
|
||
|
||
---
|
||
|
||
## 告警機制
|
||
|
||
備份失敗與需要人工處理的狀態必須推送 AwoooP / Telegram。正常成功不即時推送,避免洗版;成功狀態由每日 06:05 摘要與 Prometheus/textfile 證據承載。
|
||
|
||
| 狀態 | Severity | Telegram 收到 |
|
||
|------|---------|--------------|
|
||
| `success` | info | 不即時洗版;每日 06:05 backup status 摘要 |
|
||
| `warning` | warning | ⚠️ 黃色警告 |
|
||
| `failed` | **critical** | 🔴 **立即告警** |
|
||
|
||
**告警端點**:`http://192.168.0.188:8088/api/v1/webhook/custom`
|
||
**測試指令**:
|
||
```bash
|
||
source /backup/scripts/common.sh
|
||
notify_clawbot "failed" "backup-test" "測試告警" 0
|
||
```
|
||
|
||
---
|
||
|
||
## 保留策略
|
||
|
||
2026-05-19 起,110 本地 restic repo、188 MOMO 檔案備份與 Google Drive/rclone 離機鏡像採 latest-only 策略:成功建立新 snapshot 後只保留最新一份。2026-06-04 07:20 live verifier 已確認 Google Drive/rclone remote 13 個 repo 各 1 份。
|
||
|
||
2026-06-04 manual refresh evidence:
|
||
- 188 `momo-pg-backup.sh` produced `momo_analytics_20260604_154234.sql.gz` and pruned old backups beyond keep-last=1.
|
||
- 110 `backup-awoooi-frequent.sh` completed restic snapshot `7440d75f` and pruned previous AWOOOI high-frequency DB snapshot.
|
||
- 18:54 `backup-status.sh --no-notify`: `stale110=none`, `stale188=none`, `configured_missing_188=0`, `core_blockers=1`, `escrow_missing=5`.
|
||
|
||
18:55 cold-start scorecard refresh:
|
||
- `PASS=71 WARN=3 BLOCKED=3`.
|
||
- Remaining hard blocks: 120 ping, 120 SSH, and 120 K3s read-only check.
|
||
- 188 backup health stale jobs are clear.
|
||
- momo current-month parity is green: `2215|2215|2026-06-01|2026-06-04|2026-06-01|2026-06-04`.
|
||
|
||
19:02 120 console handoff evidence:
|
||
- local/110/121/188 cannot reach 192.168.0.120.
|
||
- K3s node lease for `mon` stopped renewing at `2026-05-22 02:48:36 +08`.
|
||
- `120-fsck-maintenance-checklist.sh --no-color` returns `PASS=2 WARN=2 BLOCKED=3`, so backup aggregate remains correctly blocked until console/SSH recovery.
|
||
|
||
The remaining `core_blockers=1` is expected until 192.168.0.120 comes back and `/backup/scripts/backup-configs.sh` plus `/backup/scripts/backup-all.sh` both complete cleanly. Do not suppress this red gate.
|
||
|
||
---
|
||
|
||
## Crontab 完整排程(110)
|
||
|
||
```
|
||
0 2 * * * backup-all.sh ← 9 個服務完整備份
|
||
0 8,14,20 * * * backup-awoooi-frequent.sh ← AWOOOI 高頻(每 6 小時)
|
||
0 3 * * * sync-offsite-backups.sh --mode sync ← Google Drive/rclone gated sync
|
||
5 6 * * * backup-status.sh ← 每日一次備份狀態摘要,避免成功心跳洗版
|
||
20 7 * * * verify-offsite-full-sync.sh --write-textfile ← Google Drive/rclone latest-only 驗證
|
||
```
|
||
|
||
---
|
||
|
||
## 備份架構
|
||
|
||
```
|
||
192.168.0.110 (/backup/scripts/backup-all.sh) 每日 02:00
|
||
├── [1/9] backup-gitea.sh → gitea dump → /backup/gitea
|
||
├── [2/9] backup-momo.sh → SSH 188 pg_dump momo → /backup/momo
|
||
├── [3/9] backup-harbor.sh → harbor dump → /backup/harbor
|
||
├── [4/9] backup-awoooi.sh → SSH 188 pg_dump awoooi_prod/dev/k3s → /backup/awoooi
|
||
├── [5/9] backup-langfuse.sh → docker exec langfuse-db pg_dump → /backup/langfuse
|
||
├── [6/9] backup-monitoring.sh → volumes prometheus/grafana/alertmanager → /backup/monitoring
|
||
├── [7/9] backup-signoz.sh → volumes signoz-clickhouse/sqlite → /backup/signoz
|
||
├── [8/9] backup-open-webui.sh → SSH 188 volume open-webui → /backup/open-webui
|
||
└── [9/9] backup-clawbot.sh → SSH 188 volume clawbot-redis → /backup/clawbot
|
||
|
||
備份失敗 → notify_clawbot("failed") → /webhook/custom 或 AwoooP/Alertmanager path → Telegram 🔴
|
||
備份成功 → textfile / Prometheus / 06:05 status 摘要,不即時洗版
|
||
|
||
192.168.0.188 (Velero) 每日 02:00
|
||
└── K8s 資源快照 → MinIO :9000 (bucket: velero)
|
||
```
|
||
|
||
---
|
||
|
||
## 尚未備份(說明)
|
||
|
||
| 服務 | 原因 | 備記 |
|
||
|------|------|------|
|
||
| Prometheus TSDB | 原始指標數據(非設定),TSDB 自帶 30d TTL | 低優先;Grafana 設定已備份 |
|
||
| Sentry | 目前沒有在運行(docker ps 空)| 有 volume,重新部署後再評估 |
|
||
| Redis (AWOOOI) | Cache/WorkingMemory,無持久業務數據 | 低優先 |
|
||
| Velero MinIO 數據 | MinIO 是備份的備份,需離機備份 | 待評估 B2/S3 offsite |
|
||
|
||
---
|
||
|
||
## 驗證 SOP
|
||
|
||
```bash
|
||
# 最新備份日誌
|
||
ssh wooo@192.168.0.110 "tail -50 /backup/logs/backup.log"
|
||
|
||
# 所有服務快照數
|
||
ssh wooo@192.168.0.110 "for r in gitea momo harbor awoooi langfuse monitoring signoz open-webui clawbot; do
|
||
echo -n \"\$r: \"
|
||
restic -r /backup/\$r snapshots --password-file /backup/scripts/.restic-password 2>/dev/null | grep -c snapshot || echo 0
|
||
done"
|
||
|
||
# 告警測試
|
||
ssh wooo@192.168.0.110 "source /backup/scripts/common.sh && notify_clawbot 'warning' 'manual-test' '手動告警測試' 0"
|
||
```
|
||
|
||
---
|
||
|
||
## 相關文件
|
||
|
||
- [REBOOT-RECOVERY-SOP.md](REBOOT-RECOVERY-SOP.md) - 重開機恢復 SOP
|
||
- `scripts/backup/` - 所有備份腳本(Git 版本)
|
||
- `/backup/scripts/` (on 110) - 實際部署腳本
|