Files
awoooi/docs/runbooks/BACKUP-STATUS.md
Your Name cfb866d055
Some checks failed
Ansible Lint / lint (push) Successful in 35s
CD Pipeline / tests (push) Failing after 13s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Failing after 11s
feat(governance): add agent market automation surfaces
2026-06-04 21:50:55 +08:00

161 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# BACKUP-STATUS.md — 備份狀態總覽
> 2026-04-05 Claude Code: 首席架構師完整盤點 — 全服務全自動化 + 告警機制
> 備份中心192.168.0.110 (`/backup/`) — Restic + latest-only retention + Google Drive/rclone offsite mirror
> 2026-06-04 Codex live refresh: 110 cron / Google Drive rclone / Alertmanager / credential escrow / cold-start scorecard rechecked.
---
## 2026-06-04 Live Status
| Gate | Status | Evidence |
|------|--------|----------|
| 110 backup cron | VERIFIED | `02:00 backup-all`, `03:00 sync-offsite-backups --mode sync`, `06:05 backup-status`, `07:20 verify-offsite-full-sync`. |
| Backup freshness | VERIFIED with one blocker | 2026-06-04 manual refresh cleared `stale110=awoooi_db` and `stale188=momo_pg_daily`; 18:54 status still shows `stale110=none`, `stale188=none`, 110 `13/13 fresh`, 188 `2/2 fresh`. |
| 188 momo backup cron/exporter contract | VERIFIED | 188 crontab now runs `/home/ollama/bin/momo-pg-backup.sh`; exporter reports `awoooi_backup_job_configured{host="188",job="momo_pg_daily"} 1`, so `configured_missing_188=0`. |
| Google Drive/rclone remote latest-only | VERIFIED | 2026-06-04 07:20 verifier: 13 repos each `remote snapshots=1`, `REMOTE_LATEST_ONLY_OK=1`, `VERIFY_OK=1`. |
| Offsite gate marker | VERIFIED | `/backup/offsite/enable-rclone-sync` present; rclone success markers fresh on 2026-06-04. |
| Backup alert rules | VERIFIED | Live Prometheus contains `BackupConfigCapturePartial`, `BackupAggregateRunFailed`, `BackupCredentialEscrowEvidenceMissing`, `ColdStartRecoveryBlocked`, `ColdStartHost120Unreachable`. |
| Backup aggregate health | BLOCKED until 120 recovers | 18:54 `backup-status --no-notify`: `failed=1`, `core_blockers=1`; the remaining red component is 120 config capture, not stale backup freshness. |
| Credential escrow | BLOCKED | Five evidence markers missing. Only write non-secret marker evidence with `/backup/scripts/mark-credential-escrow-verified.sh`. |
| Config backup capture | BLOCKED until 120 recovers | `awoooi_backup_config_capture_ok{target="120-k3s-host-configs"} 0`; critical failed count `1`. |
| Full cold-start | BLOCKED | 18:55 read-only rerun: `PASS=71 WARN=3 BLOCKED=3`; 120 remains unreachable and K3s `mon` remains `NotReady,SchedulingDisabled`. |
| 120 console handoff | BLOCKED | 19:02 `120-fsck-maintenance-checklist.sh --no-color`: `PASS=2 WARN=2 BLOCKED=3`, `MAINTENANCE REQUIRED`; 120 host/K3s/filesystem evidence is unreadable until console or SSH returns. |
Current policy: normal success should not create immediate Telegram noise. Failures and operator-action states must still alert; a single daily status summary runs at 06:05.
---
## 備份全景圖(全部自動化)
| # | 資料類型 | 備份腳本 | 排程 | 最大損失 | 狀態 |
|---|---------|---------|------|---------|------|
| 1 | Gitea (DB + 倉庫) | `backup-gitea.sh` | 每日 02:00 | 24h | ✅ |
| 2 | MOMO PostgreSQL | `backup-momo.sh` | 每日 02:00 | 24h | ✅ |
| 3 | Harbor (Registry + DB) | `backup-harbor.sh` | 每日 02:00 | 24h | ✅ |
| 4 | **AWOOOI PostgreSQL (完整)** | **`backup-awoooi.sh`** | **每日 02:00** | **6h** | **✅** |
| 4h | **AWOOOI PostgreSQL (高頻)** | **`backup-awoooi-frequent.sh`** | **08/14/20:00** | **6h** | **✅** |
| 5 | **Langfuse (AI 追蹤/評測)** | **`backup-langfuse.sh`** | **每日 02:00** | **24h** | **✅** |
| 6 | **Monitoring (Prometheus/Grafana/Alertmanager)** | **`backup-monitoring.sh`** | **每日 02:00** | **24h** | **✅** |
| 7 | **SignOz (ClickHouse 追蹤/日誌)** | **`backup-signoz.sh`** | **每日 02:00** | **24h** | **✅** |
| 8 | **Open-WebUI (LLM 對話紀錄)** | **`backup-open-webui.sh`** | **每日 02:00** | **24h** | **✅** |
| 9 | **ClawBot Redis (狀態/快取)** | **`backup-clawbot.sh`** | **每日 02:00** | **24h** | **✅** |
| - | K8s 資源 (全命名空間) | Velero + MinIO | 每日 02:00 | 24h | ✅ |
**備份總控**`/backup/scripts/backup-all.sh` v3.0 — 統一執行 9 個備份
---
## 告警機制
備份失敗與需要人工處理的狀態必須推送 AwoooP / Telegram。正常成功不即時推送避免洗版成功狀態由每日 06:05 摘要與 Prometheus/textfile 證據承載。
| 狀態 | Severity | Telegram 收到 |
|------|---------|--------------|
| `success` | info | 不即時洗版;每日 06:05 backup status 摘要 |
| `warning` | warning | ⚠️ 黃色警告 |
| `failed` | **critical** | 🔴 **立即告警** |
**告警端點**`http://192.168.0.188:8088/api/v1/webhook/custom`
**測試指令**
```bash
source /backup/scripts/common.sh
notify_clawbot "failed" "backup-test" "測試告警" 0
```
---
## 保留策略
2026-05-19 起110 本地 restic repo、188 MOMO 檔案備份與 Google Drive/rclone 離機鏡像採 latest-only 策略:成功建立新 snapshot 後只保留最新一份。2026-06-04 07:20 live verifier 已確認 Google Drive/rclone remote 13 個 repo 各 1 份。
2026-06-04 manual refresh evidence:
- 188 `momo-pg-backup.sh` produced `momo_analytics_20260604_154234.sql.gz` and pruned old backups beyond keep-last=1.
- 110 `backup-awoooi-frequent.sh` completed restic snapshot `7440d75f` and pruned previous AWOOOI high-frequency DB snapshot.
- 18:54 `backup-status.sh --no-notify`: `stale110=none`, `stale188=none`, `configured_missing_188=0`, `core_blockers=1`, `escrow_missing=5`.
18:55 cold-start scorecard refresh:
- `PASS=71 WARN=3 BLOCKED=3`.
- Remaining hard blocks: 120 ping, 120 SSH, and 120 K3s read-only check.
- 188 backup health stale jobs are clear.
- momo current-month parity is green: `2215|2215|2026-06-01|2026-06-04|2026-06-01|2026-06-04`.
19:02 120 console handoff evidence:
- local/110/121/188 cannot reach 192.168.0.120.
- K3s node lease for `mon` stopped renewing at `2026-05-22 02:48:36 +08`.
- `120-fsck-maintenance-checklist.sh --no-color` returns `PASS=2 WARN=2 BLOCKED=3`, so backup aggregate remains correctly blocked until console/SSH recovery.
The remaining `core_blockers=1` is expected until 192.168.0.120 comes back and `/backup/scripts/backup-configs.sh` plus `/backup/scripts/backup-all.sh` both complete cleanly. Do not suppress this red gate.
---
## Crontab 完整排程110
```
0 2 * * * backup-all.sh ← 9 個服務完整備份
0 8,14,20 * * * backup-awoooi-frequent.sh ← AWOOOI 高頻(每 6 小時)
0 3 * * * sync-offsite-backups.sh --mode sync ← Google Drive/rclone gated sync
5 6 * * * backup-status.sh ← 每日一次備份狀態摘要,避免成功心跳洗版
20 7 * * * verify-offsite-full-sync.sh --write-textfile ← Google Drive/rclone latest-only 驗證
```
---
## 備份架構
```
192.168.0.110 (/backup/scripts/backup-all.sh) 每日 02:00
├── [1/9] backup-gitea.sh → gitea dump → /backup/gitea
├── [2/9] backup-momo.sh → SSH 188 pg_dump momo → /backup/momo
├── [3/9] backup-harbor.sh → harbor dump → /backup/harbor
├── [4/9] backup-awoooi.sh → SSH 188 pg_dump awoooi_prod/dev/k3s → /backup/awoooi
├── [5/9] backup-langfuse.sh → docker exec langfuse-db pg_dump → /backup/langfuse
├── [6/9] backup-monitoring.sh → volumes prometheus/grafana/alertmanager → /backup/monitoring
├── [7/9] backup-signoz.sh → volumes signoz-clickhouse/sqlite → /backup/signoz
├── [8/9] backup-open-webui.sh → SSH 188 volume open-webui → /backup/open-webui
└── [9/9] backup-clawbot.sh → SSH 188 volume clawbot-redis → /backup/clawbot
備份失敗 → notify_clawbot("failed") → /webhook/custom 或 AwoooP/Alertmanager path → Telegram 🔴
備份成功 → textfile / Prometheus / 06:05 status 摘要,不即時洗版
192.168.0.188 (Velero) 每日 02:00
└── K8s 資源快照 → MinIO :9000 (bucket: velero)
```
---
## 尚未備份(說明)
| 服務 | 原因 | 備記 |
|------|------|------|
| Prometheus TSDB | 原始指標數據非設定TSDB 自帶 30d TTL | 低優先Grafana 設定已備份 |
| Sentry | 目前沒有在運行docker ps 空)| 有 volume重新部署後再評估 |
| Redis (AWOOOI) | Cache/WorkingMemory無持久業務數據 | 低優先 |
| Velero MinIO 數據 | MinIO 是備份的備份,需離機備份 | 待評估 B2/S3 offsite |
---
## 驗證 SOP
```bash
# 最新備份日誌
ssh wooo@192.168.0.110 "tail -50 /backup/logs/backup.log"
# 所有服務快照數
ssh wooo@192.168.0.110 "for r in gitea momo harbor awoooi langfuse monitoring signoz open-webui clawbot; do
echo -n \"\$r: \"
restic -r /backup/\$r snapshots --password-file /backup/scripts/.restic-password 2>/dev/null | grep -c snapshot || echo 0
done"
# 告警測試
ssh wooo@192.168.0.110 "source /backup/scripts/common.sh && notify_clawbot 'warning' 'manual-test' '手動告警測試' 0"
```
---
## 相關文件
- [REBOOT-RECOVERY-SOP.md](REBOOT-RECOVERY-SOP.md) - 重開機恢復 SOP
- `scripts/backup/` - 所有備份腳本Git 版本)
- `/backup/scripts/` (on 110) - 實際部署腳本