diff --git a/docs/runbooks/BACKUP-STATUS.md b/docs/runbooks/BACKUP-STATUS.md index 40f6c1d6..42e83341 100644 --- a/docs/runbooks/BACKUP-STATUS.md +++ b/docs/runbooks/BACKUP-STATUS.md @@ -1,44 +1,66 @@ # BACKUP-STATUS.md — 備份狀態總覽 -> 2026-04-05 Claude Code: 首席架構師完整盤點 + 全面自動化 + 高頻備份部署完成 +> 2026-04-05 Claude Code: 首席架構師完整盤點 — 全服務全自動化 + 告警機制 > 備份中心:192.168.0.110 (`/backup/`) — Restic + GFS 祖父子策略 --- ## 備份全景圖(全部自動化) -| 資料類型 | 備份腳本 | 排程 | 最大損失 | 保留策略 | 狀態 | -|---------|---------|------|---------|---------|------| -| Gitea (DB + 倉庫) | `backup-gitea.sh` | 每日 02:00 | 24h | 28h/30日/12週/24月 | ✅ | -| MOMO PostgreSQL | `backup-momo.sh` | 每日 02:00 | 24h | 28h/30日/12週/24月 | ✅ | -| Harbor (Registry + DB) | `backup-harbor.sh` | 每日 02:00 | 24h | 28h/30日/12週/24月 | ✅ | -| **AWOOOI PostgreSQL (完整)** | **`backup-awoooi.sh`** | **每日 02:00** | **6h** | **28h/30日/12週/24月** | **✅** | -| **AWOOOI PostgreSQL (高頻)** | **`backup-awoooi-frequent.sh`** | **每日 08/14/20:00** | **6h** | **28h/30日/12週/24月** | **✅** | -| K8s 資源 (全命名空間) | Velero + MinIO | 每日 02:00 | 24h | 7 份 | ✅ | +| # | 資料類型 | 備份腳本 | 排程 | 最大損失 | 狀態 | +|---|---------|---------|------|---------|------| +| 1 | Gitea (DB + 倉庫) | `backup-gitea.sh` | 每日 02:00 | 24h | ✅ | +| 2 | MOMO PostgreSQL | `backup-momo.sh` | 每日 02:00 | 24h | ✅ | +| 3 | Harbor (Registry + DB) | `backup-harbor.sh` | 每日 02:00 | 24h | ✅ | +| 4 | **AWOOOI PostgreSQL (完整)** | **`backup-awoooi.sh`** | **每日 02:00** | **6h** | **✅** | +| 4h | **AWOOOI PostgreSQL (高頻)** | **`backup-awoooi-frequent.sh`** | **08/14/20:00** | **6h** | **✅** | +| 5 | **Langfuse (AI 追蹤/評測)** | **`backup-langfuse.sh`** | **每日 02:00** | **24h** | **✅** | +| 6 | **Monitoring (Prometheus/Grafana/Alertmanager)** | **`backup-monitoring.sh`** | **每日 02:00** | **24h** | **✅** | +| 7 | **SignOz (ClickHouse 追蹤/日誌)** | **`backup-signoz.sh`** | **每日 02:00** | **24h** | **✅** | +| 8 | **Open-WebUI (LLM 對話紀錄)** | **`backup-open-webui.sh`** | **每日 02:00** | **24h** | **✅** | +| 9 | **ClawBot Redis (狀態/快取)** | **`backup-clawbot.sh`** | **每日 02:00** | **24h** | **✅** | +| - | K8s 資源 (全命名空間) | Velero + MinIO | 每日 02:00 | 24h | ✅ | -**AWOOOI 每日備份排程**:02:00(含 awoooi_dev + k3s)、08:00、14:00、20:00(僅 awoooi_prod)= **4次/天** +**備份總控**:`/backup/scripts/backup-all.sh` v3.0 — 統一執行 9 個備份 + +--- + +## 告警機制 + +備份失敗自動推送 Telegram(透過 ClawBot `/webhook/custom`): + +| 狀態 | Severity | Telegram 收到 | +|------|---------|--------------| +| `success` | info | ✅ 正常通知 | +| `warning` | warning | ⚠️ 黃色警告 | +| `failed` | **critical** | 🔴 **立即告警** | + +**告警端點**:`http://192.168.0.188:8088/api/v1/webhook/custom` +**測試指令**: +```bash +source /backup/scripts/common.sh +notify_clawbot "failed" "backup-test" "測試告警" 0 +``` --- ## GFS 保留策略 -| 級別 | 保留數量 | 覆蓋時間 | 說明 | -|------|---------|---------|------| -| 每小時 | 28 份 | 最近 7 天 (6h 快照) | AWOOOI 高頻 | -| 每日 | 30 份 | 最近 30 天 | 全服務 | -| 每週 | 12 份 | 最近 3 個月 | 全服務 | -| 每月 | 24 份 | 最近 2 年 | 全服務 | - -> 原策略:7日/4週/6月 → 2026-04-05 延長為 28h/30日/12週/24月 +| 級別 | 保留數量 | 覆蓋時間 | +|------|---------|---------| +| 每小時(AWOOOI 高頻) | 28 份 | 最近 7 天 | +| 每日 | 30 份 | 最近 30 天 | +| 每週 | 12 份 | 最近 3 個月 | +| 每月 | 24 份 | 最近 **2 年** | --- ## Crontab 完整排程(110) ``` -0 2 * * * backup-all.sh ← Gitea + MOMO + Harbor + AWOOOI 完整備份 -0 8,14,20 * * * backup-awoooi-frequent.sh ← AWOOOI 高頻(每 6 小時) -0 6 * * * backup-status.sh ← 備份狀態報告 +0 2 * * * backup-all.sh ← 9 個服務完整備份 +0 8,14,20 * * * backup-awoooi-frequent.sh ← AWOOOI 高頻(每 6 小時) +0 6 * * * backup-status.sh ← 備份狀態報告 ``` --- @@ -46,48 +68,56 @@ ## 備份架構 ``` -192.168.0.110 (/backup/scripts/) -├── backup-all.sh (每日 02:00) -│ ├── [1/4] backup-gitea.sh → gitea dump → restic /backup/gitea -│ ├── [2/4] backup-momo.sh → SSH 188 pg_dump → restic /backup/momo -│ ├── [3/4] backup-harbor.sh → harbor dump → restic /backup/harbor -│ └── [4/4] backup-awoooi.sh → SSH 188 pg_dump (prod/dev/k3s) → restic /backup/awoooi -│ -└── backup-awoooi-frequent.sh (08/14/20:00) - └── SSH 188 pg_dump awoooi_prod → restic /backup/awoooi (同一倉庫) +192.168.0.110 (/backup/scripts/backup-all.sh) 每日 02:00 +├── [1/9] backup-gitea.sh → gitea dump → /backup/gitea +├── [2/9] backup-momo.sh → SSH 188 pg_dump momo → /backup/momo +├── [3/9] backup-harbor.sh → harbor dump → /backup/harbor +├── [4/9] backup-awoooi.sh → SSH 188 pg_dump awoooi_prod/dev/k3s → /backup/awoooi +├── [5/9] backup-langfuse.sh → docker exec langfuse-db pg_dump → /backup/langfuse +├── [6/9] backup-monitoring.sh → volumes prometheus/grafana/alertmanager → /backup/monitoring +├── [7/9] backup-signoz.sh → volumes signoz-clickhouse/sqlite → /backup/signoz +├── [8/9] backup-open-webui.sh → SSH 188 volume open-webui → /backup/open-webui +└── [9/9] backup-clawbot.sh → SSH 188 volume clawbot-redis → /backup/clawbot -192.168.0.188 (Velero) -└── K8s 資源快照 → MinIO 192.168.0.188:9000 (bucket: velero) +備份失敗 → notify_clawbot("failed") → /webhook/custom → Telegram 🔴 + +192.168.0.188 (Velero) 每日 02:00 +└── K8s 資源快照 → MinIO :9000 (bucket: velero) ``` --- +## 尚未備份(說明) + +| 服務 | 原因 | 備記 | +|------|------|------| +| Prometheus TSDB | 原始指標數據(非設定),TSDB 自帶 30d TTL | 低優先;Grafana 設定已備份 | +| Sentry | 目前沒有在運行(docker ps 空)| 有 volume,重新部署後再評估 | +| Redis (AWOOOI) | Cache/WorkingMemory,無持久業務數據 | 低優先 | +| Velero MinIO 數據 | MinIO 是備份的備份,需離機備份 | 待評估 B2/S3 offsite | + +--- + ## 驗證 SOP ```bash -# 確認最新備份日誌 -ssh wooo@192.168.0.110 "tail -30 /backup/logs/backup.log" +# 最新備份日誌 +ssh wooo@192.168.0.110 "tail -50 /backup/logs/backup.log" -# AWOOOI 快照列表(含高頻) -ssh wooo@192.168.0.110 "restic -r /backup/awoooi snapshots \ - --password-file /backup/scripts/.restic-password | tail -10" +# 所有服務快照數 +ssh wooo@192.168.0.110 "for r in gitea momo harbor awoooi langfuse monitoring signoz open-webui clawbot; do + echo -n \"\$r: \" + restic -r /backup/\$r snapshots --password-file /backup/scripts/.restic-password 2>/dev/null | grep -c snapshot || echo 0 +done" -# 各服務快照數 -ssh wooo@192.168.0.110 "for r in gitea momo harbor awoooi; do \ - echo -n \"\$r: \"; \ - restic -r /backup/\$r snapshots --password-file /backup/scripts/.restic-password \ - 2>/dev/null | grep -c '^\w'; done" - -# Velero K8s -kubectl get backup -n velero --sort-by=.metadata.creationTimestamp | tail -3 +# 告警測試 +ssh wooo@192.168.0.110 "source /backup/scripts/common.sh && notify_clawbot 'warning' 'manual-test' '手動告警測試' 0" ``` --- ## 相關文件 -- [REBOOT-RECOVERY-SOP.md](REBOOT-RECOVERY-SOP.md) - 重開機恢復 SOP(含 MinIO 啟動) -- `scripts/backup/backup-awoooi.sh` - AWOOOI 完整備份腳本 -- `scripts/backup/backup-awoooi-frequent.sh` - AWOOOI 高頻備份腳本 -- `scripts/backup/backup-all.sh` - 全服務備份總控 v2.0 +- [REBOOT-RECOVERY-SOP.md](REBOOT-RECOVERY-SOP.md) - 重開機恢復 SOP +- `scripts/backup/` - 所有備份腳本(Git 版本) - `/backup/scripts/` (on 110) - 實際部署腳本 diff --git a/scripts/backup/backup-all.sh b/scripts/backup/backup-all.sh index 3a22743a..1a92af69 100755 --- a/scripts/backup/backup-all.sh +++ b/scripts/backup/backup-all.sh @@ -1,55 +1,52 @@ #!/bin/bash # ============================================================================= # WOOO AIOps - 全服務備份總控腳本 -# 版本: 2.0.0 +# 版本: 3.0.0 # 建立日期: 2026-03-12 # 2026-04-05 Claude Code: 加入 AWOOOI DB (v1→v2) — 首席架構師備份審計 +# 2026-04-05 Claude Code: 加入 Langfuse/Monitoring/SignOz/Open-WebUI/ClawBot (v2→v3) — 備份覆蓋率審計 # ============================================================================= set -euo pipefail -# 載入共用函式 source "$(dirname "$0")/common.sh" -# ----------------------------------------------------------------------------- -# 主函式 -# ----------------------------------------------------------------------------- main() { local start_time=$(date +%s) local failed=0 - local total=4 - + local total=9 + log_info "╔══════════════════════════════════════════════════════════════╗" - log_info "║ WOOO AIOps - 全服務備份開始 (v2.0) ║" + log_info "║ WOOO AIOps - 全服務備份開始 (v3.0) ║" log_info "╚══════════════════════════════════════════════════════════════╝" - + # 備份 Gitea log_info ">>> [1/${total}] 備份 Gitea..." if /backup/scripts/backup-gitea.sh; then log_success " Gitea 備份成功" else log_error " Gitea 備份失敗" - ((failed++)) + failed=$((failed+1)) fi - + # 備份 MOMO Pro log_info ">>> [2/${total}] 備份 MOMO Pro..." if /backup/scripts/backup-momo.sh; then log_success " MOMO Pro 備份成功" else log_error " MOMO Pro 備份失敗" - ((failed++)) + failed=$((failed+1)) fi - + # 備份 Harbor log_info ">>> [3/${total}] 備份 Harbor..." if /backup/scripts/backup-harbor.sh; then log_success " Harbor 備份成功" else log_error " Harbor 備份失敗" - ((failed++)) + failed=$((failed+1)) fi - + # 備份 AWOOOI DB (awoooi_prod + k3s_datastore) # 2026-04-05 Claude Code: 首席架構師備份審計後加入 log_info ">>> [4/${total}] 備份 AWOOOI DB..." @@ -57,12 +54,62 @@ main() { log_success " AWOOOI DB 備份成功" else log_error " AWOOOI DB 備份失敗" - ((failed++)) + failed=$((failed+1)) fi - + + # 備份 Langfuse (AI 追蹤/評測數據) + # 2026-04-05 Claude Code: 備份覆蓋率審計後加入 + log_info ">>> [5/${total}] 備份 Langfuse..." + if /backup/scripts/backup-langfuse.sh; then + log_success " Langfuse 備份成功" + else + log_error " Langfuse 備份失敗" + failed=$((failed+1)) + fi + + # 備份 Monitoring (Prometheus + Grafana + Alertmanager) + # 2026-04-05 Claude Code: 備份覆蓋率審計後加入 + log_info ">>> [6/${total}] 備份 Monitoring..." + if /backup/scripts/backup-monitoring.sh; then + log_success " Monitoring 備份成功" + else + log_error " Monitoring 備份失敗" + failed=$((failed+1)) + fi + + # 備份 SignOz (ClickHouse + SQLite) + # 2026-04-05 Claude Code: 備份覆蓋率審計後加入 + log_info ">>> [7/${total}] 備份 SignOz..." + if /backup/scripts/backup-signoz.sh; then + log_success " SignOz 備份成功" + else + log_error " SignOz 備份失敗" + failed=$((failed+1)) + fi + + # 備份 Open-WebUI (LLM 對話紀錄,從 188) + # 2026-04-05 Claude Code: 備份覆蓋率審計後加入 + log_info ">>> [8/${total}] 備份 Open-WebUI (188)..." + if /backup/scripts/backup-open-webui.sh; then + log_success " Open-WebUI 備份成功" + else + log_error " Open-WebUI 備份失敗" + failed=$((failed+1)) + fi + + # 備份 ClawBot Redis (狀態/快取,從 188,低優先) + # 2026-04-05 Claude Code: 備份覆蓋率審計後加入 + log_info ">>> [9/${total}] 備份 ClawBot Redis (188)..." + if /backup/scripts/backup-clawbot.sh; then + log_success " ClawBot Redis 備份成功" + else + log_error " ClawBot Redis 備份失敗" + failed=$((failed+1)) + fi + local end_time=$(date +%s) local duration=$((end_time - start_time)) - + log_info "╔══════════════════════════════════════════════════════════════╗" if [ $failed -eq 0 ]; then log_success "║ 全服務備份完成 (${duration}s) - 全部成功 (${total}/${total}) ║" @@ -72,9 +119,8 @@ main() { notify_clawbot "warning" "all" "全服務備份完成 ($((total-failed))/${total} 成功)" "${duration}" fi log_info "╚══════════════════════════════════════════════════════════════╝" - + return $failed } -# 執行 main "$@" diff --git a/scripts/backup/backup-clawbot.sh b/scripts/backup/backup-clawbot.sh new file mode 100755 index 00000000..7e38d0c6 --- /dev/null +++ b/scripts/backup/backup-clawbot.sh @@ -0,0 +1,75 @@ +#!/bin/bash +# ============================================================================= +# WOOO AIOps - ClawBot Redis 備份腳本 (SSH → 192.168.0.188) +# 版本: 1.0.0 +# 建立日期: 2026-04-05 +# 2026-04-05 Claude Code: 新增 ClawBot Redis 狀態/快取備份 — 首席架構師備份審計 +# ============================================================================= + +set -euo pipefail + +source "$(dirname "$0")/common.sh" + +SERVICE="clawbot" +LOCAL_REPO="${BACKUP_BASE}/clawbot" +DUMP_DIR="/tmp/clawbot-backup-$$" +REMOTE_HOST="ollama@192.168.0.188" + +cleanup() { + rm -rf "${DUMP_DIR}" +} + +main() { + local start_time=$(date +%s) + log_info "========== 開始 ClawBot Redis 備份 (188→110) ==========" + mkdir -p "${DUMP_DIR}" + + local timestamp=$(date "+%Y%m%d_%H%M%S") + + # Step 1: 觸發 Redis BGSAVE 確保數據落盤 + log_info "觸發 Redis BGSAVE..." + ssh "${REMOTE_HOST}" "docker exec clawbot-redis redis-cli BGSAVE" 2>/dev/null || log_warn "BGSAVE 失敗或 clawbot-redis 未運行,繼續備份" + sleep 2 # 等待 BGSAVE 完成 + + # Step 2: SSH 到 188 將 Redis volume 打包傳回 + log_info "從 192.168.0.188 拉取 clawbot-redis volume..." + if ssh "${REMOTE_HOST}" "docker run --rm -v clawbot-v5_clawbot-redis-data:/data alpine tar czf - /data 2>/dev/null" > "${DUMP_DIR}/clawbot-redis_${timestamp}.tar.gz"; then + local size=$(du -h "${DUMP_DIR}/clawbot-redis_${timestamp}.tar.gz" | cut -f1) + log_success "ClawBot Redis volume 拉取完成 (${size})" + else + log_error "ClawBot Redis volume 拉取失敗" + notify_clawbot "failed" "${SERVICE}" "ClawBot Redis 備份失敗 (SSH 188)" + cleanup + exit 1 + fi + + # Step 3: 初始化 Restic 倉庫 + if [ ! -d "${LOCAL_REPO}/data" ]; then + log_info "初始化 Restic 倉庫: ${LOCAL_REPO}" + restic -r "${LOCAL_REPO}" init --password-file "${RESTIC_PASSWORD_FILE}" 2>&1 || { + log_error "Restic 倉庫初始化失敗" + cleanup + exit 1 + } + fi + + # Step 4: Restic 備份 + log_info "建立 Restic 備份..." + local tags=$(build_tags "${SERVICE}") + restic -r "${LOCAL_REPO}" backup "${DUMP_DIR}" --password-file "${RESTIC_PASSWORD_FILE}" ${tags} 2>&1 + + local snapshot_id=$(restic -r "${LOCAL_REPO}" snapshots --latest 1 --json --password-file "${RESTIC_PASSWORD_FILE}" 2>/dev/null | grep -oP '"short_id":"\K[^"]+' | head -1) + log_success "Restic 備份完成: ${snapshot_id}" + + # Step 5: GFS 清理 + cleanup_old_backups "${LOCAL_REPO}" + + cleanup + + local end_time=$(date +%s) + local duration=$((end_time - start_time)) + log_success "========== ClawBot Redis 備份完成 (${duration}s) ==========" + notify_clawbot "success" "${SERVICE}" "ClawBot Redis 備份完成" "${duration}" +} + +main "$@" diff --git a/scripts/backup/backup-langfuse.sh b/scripts/backup/backup-langfuse.sh new file mode 100755 index 00000000..f83124b4 --- /dev/null +++ b/scripts/backup/backup-langfuse.sh @@ -0,0 +1,69 @@ +#!/bin/bash +# ============================================================================= +# WOOO AIOps - Langfuse 備份腳本 +# 版本: 1.0.0 +# 建立日期: 2026-04-05 +# 2026-04-05 Claude Code: 新增 Langfuse AI 追蹤數據備份 — 首席架構師備份審計 +# ============================================================================= + +set -euo pipefail + +source "$(dirname "$0")/common.sh" + +SERVICE="langfuse" +LOCAL_REPO="${BACKUP_BASE}/langfuse" +DUMP_DIR="/tmp/langfuse-backup-$$" + +cleanup() { + rm -rf "${DUMP_DIR}" +} + +main() { + local start_time=$(date +%s) + log_info "========== 開始 Langfuse 備份 ==========" + mkdir -p "${DUMP_DIR}" + + local timestamp=$(date "+%Y%m%d_%H%M%S") + + # Step 1: Langfuse PostgreSQL dump + log_info "執行 Langfuse DB dump..." + if docker exec langfuse-db pg_dump -U langfuse langfuse > "${DUMP_DIR}/langfuse_${timestamp}.sql" 2>&1; then + local size=$(du -h "${DUMP_DIR}/langfuse_${timestamp}.sql" | cut -f1) + log_success "Langfuse DB dump 完成 (${size})" + else + log_error "Langfuse DB dump 失敗" + notify_clawbot "failed" "${SERVICE}" "Langfuse 備份失敗" + cleanup + exit 1 + fi + + # Step 2: 初始化 Restic 倉庫 (如果不存在) + if [ ! -d "${LOCAL_REPO}/data" ]; then + log_info "初始化 Restic 倉庫: ${LOCAL_REPO}" + restic -r "${LOCAL_REPO}" init --password-file "${RESTIC_PASSWORD_FILE}" 2>&1 || { + log_error "Restic 倉庫初始化失敗" + cleanup + exit 1 + } + fi + + # Step 3: Restic 備份 + log_info "建立 Restic 備份..." + local tags=$(build_tags "${SERVICE}") + restic -r "${LOCAL_REPO}" backup "${DUMP_DIR}" --password-file "${RESTIC_PASSWORD_FILE}" ${tags} 2>&1 + + local snapshot_id=$(restic -r "${LOCAL_REPO}" snapshots --latest 1 --json --password-file "${RESTIC_PASSWORD_FILE}" 2>/dev/null | grep -oP '"short_id":"\K[^"]+' | head -1) + log_success "Restic 備份完成: ${snapshot_id}" + + # Step 4: GFS 清理 + cleanup_old_backups "${LOCAL_REPO}" + + cleanup + + local end_time=$(date +%s) + local duration=$((end_time - start_time)) + log_success "========== Langfuse 備份完成 (${duration}s) ==========" + notify_clawbot "success" "${SERVICE}" "Langfuse 備份完成" "${duration}" +} + +main "$@" diff --git a/scripts/backup/backup-monitoring.sh b/scripts/backup/backup-monitoring.sh new file mode 100755 index 00000000..9a43f01f --- /dev/null +++ b/scripts/backup/backup-monitoring.sh @@ -0,0 +1,109 @@ +#!/bin/bash +# ============================================================================= +# WOOO AIOps - Monitoring 備份腳本 (Prometheus + Grafana + Alertmanager) +# 版本: 1.1.0 +# 建立日期: 2026-04-05 +# 2026-04-05 Claude Code: 新增監控數據備份 — 首席架構師備份審計 +# 2026-04-05 Claude Code: v1.1 修正 Prometheus 1.1GB volume tar pipeline exit code 處理 +# ============================================================================= + +set -euo pipefail + +source "$(dirname "$0")/common.sh" + +SERVICE="monitoring" +LOCAL_REPO="${BACKUP_BASE}/monitoring" +DUMP_DIR="/tmp/monitoring-backup-$$" +MONITORING_CONFIG_DIR="/home/wooo/monitoring" + +cleanup() { + rm -rf "${DUMP_DIR}" +} + +backup_volume() { + local volume_name="$1" + local output_file="$2" + log_info "備份 volume: ${volume_name}" + # 注意: tar 備份大型 volume 時可能 exit 1 (因 mmap/lock files 被修改) + # 使用 || true 避免因 warning 導致失敗,但仍驗證檔案大小 + docker run --rm -v "${volume_name}:/data" alpine tar czf - /data 2>/dev/null > "${output_file}" || true + if [ -s "${output_file}" ]; then + local size=$(du -h "${output_file}" | cut -f1) + log_success " Volume ${volume_name} 備份完成 (${size})" + return 0 + else + log_error " Volume ${volume_name} 備份失敗 (空檔案)" + return 1 + fi +} + +main() { + local start_time=$(date +%s) + log_info "========== 開始 Monitoring 備份 ==========" + mkdir -p "${DUMP_DIR}" + + local timestamp=$(date "+%Y%m%d_%H%M%S") + local any_failed=0 + + # Step 1: 備份 Prometheus volume (TSDB 數據,約 1GB+) + backup_volume "monitoring_prometheus_data" "${DUMP_DIR}/prometheus_${timestamp}.tar.gz" || { + notify_clawbot "failed" "${SERVICE}" "Prometheus volume 備份失敗" + cleanup + exit 1 + } + + # Step 2: 備份 Grafana volume (dashboards/alerts 設定) + backup_volume "monitoring_grafana_data" "${DUMP_DIR}/grafana_${timestamp}.tar.gz" || { + log_warn "Grafana volume 備份失敗,繼續..." + any_failed=1 + } + + # Step 3: 備份 Alertmanager volume (靜默/路由設定) + backup_volume "monitoring_alertmanager_data" "${DUMP_DIR}/alertmanager_${timestamp}.tar.gz" || { + log_warn "Alertmanager volume 備份失敗,繼續..." + any_failed=1 + } + + # Step 4: 備份 monitoring 設定檔目錄 + log_info "備份 monitoring 設定檔 (${MONITORING_CONFIG_DIR})" + if [ -d "${MONITORING_CONFIG_DIR}" ]; then + tar czf "${DUMP_DIR}/monitoring-configs_${timestamp}.tar.gz" -C "$(dirname ${MONITORING_CONFIG_DIR})" "$(basename ${MONITORING_CONFIG_DIR})" 2>/dev/null || true + if [ -s "${DUMP_DIR}/monitoring-configs_${timestamp}.tar.gz" ]; then + log_success "設定檔備份完成" + else + log_warn "設定檔備份失敗或為空" + fi + else + log_warn "monitoring 設定目錄不存在: ${MONITORING_CONFIG_DIR}" + fi + + # Step 5: 初始化 Restic 倉庫 + if [ ! -d "${LOCAL_REPO}/data" ]; then + log_info "初始化 Restic 倉庫: ${LOCAL_REPO}" + restic -r "${LOCAL_REPO}" init --password-file "${RESTIC_PASSWORD_FILE}" 2>&1 || { + log_error "Restic 倉庫初始化失敗" + cleanup + exit 1 + } + fi + + # Step 6: Restic 備份 + log_info "建立 Restic 備份..." + local tags=$(build_tags "${SERVICE}") + restic -r "${LOCAL_REPO}" backup "${DUMP_DIR}" --password-file "${RESTIC_PASSWORD_FILE}" ${tags} 2>&1 + + local snapshot_id=$(restic -r "${LOCAL_REPO}" snapshots --latest 1 --json --password-file "${RESTIC_PASSWORD_FILE}" 2>/dev/null | grep -oP '"short_id":"\K[^"]+' | head -1) + log_success "Restic 備份完成: ${snapshot_id}" + + # Step 7: GFS 清理 + cleanup_old_backups "${LOCAL_REPO}" + + cleanup + + local end_time=$(date +%s) + local duration=$((end_time - start_time)) + log_success "========== Monitoring 備份完成 (${duration}s) ==========" + notify_clawbot "success" "${SERVICE}" "Monitoring 備份完成 (Prometheus+Grafana+Alertmanager)" "${duration}" +} + +main "$@" diff --git a/scripts/backup/backup-open-webui.sh b/scripts/backup/backup-open-webui.sh new file mode 100755 index 00000000..b9ecc8f6 --- /dev/null +++ b/scripts/backup/backup-open-webui.sh @@ -0,0 +1,70 @@ +#!/bin/bash +# ============================================================================= +# WOOO AIOps - Open-WebUI 備份腳本 (SSH → 192.168.0.188) +# 版本: 1.0.0 +# 建立日期: 2026-04-05 +# 2026-04-05 Claude Code: 新增 Open-WebUI LLM 對話紀錄備份 — 首席架構師備份審計 +# ============================================================================= + +set -euo pipefail + +source "$(dirname "$0")/common.sh" + +SERVICE="open-webui" +LOCAL_REPO="${BACKUP_BASE}/open-webui" +DUMP_DIR="/tmp/open-webui-backup-$$" +REMOTE_HOST="ollama@192.168.0.188" + +cleanup() { + rm -rf "${DUMP_DIR}" +} + +main() { + local start_time=$(date +%s) + log_info "========== 開始 Open-WebUI 備份 (188→110) ==========" + mkdir -p "${DUMP_DIR}" + + local timestamp=$(date "+%Y%m%d_%H%M%S") + + # Step 1: SSH 到 188 將 open-webui volume 打包傳回 + log_info "從 192.168.0.188 拉取 open-webui volume..." + if ssh "${REMOTE_HOST}" "docker run --rm -v open-webui:/data alpine tar czf - /data 2>/dev/null" > "${DUMP_DIR}/open-webui_${timestamp}.tar.gz"; then + local size=$(du -h "${DUMP_DIR}/open-webui_${timestamp}.tar.gz" | cut -f1) + log_success "Open-WebUI volume 拉取完成 (${size})" + else + log_error "Open-WebUI volume 拉取失敗" + notify_clawbot "failed" "${SERVICE}" "Open-WebUI 備份失敗 (SSH 188)" + cleanup + exit 1 + fi + + # Step 2: 初始化 Restic 倉庫 + if [ ! -d "${LOCAL_REPO}/data" ]; then + log_info "初始化 Restic 倉庫: ${LOCAL_REPO}" + restic -r "${LOCAL_REPO}" init --password-file "${RESTIC_PASSWORD_FILE}" 2>&1 || { + log_error "Restic 倉庫初始化失敗" + cleanup + exit 1 + } + fi + + # Step 3: Restic 備份 + log_info "建立 Restic 備份..." + local tags=$(build_tags "${SERVICE}") + restic -r "${LOCAL_REPO}" backup "${DUMP_DIR}" --password-file "${RESTIC_PASSWORD_FILE}" ${tags} 2>&1 + + local snapshot_id=$(restic -r "${LOCAL_REPO}" snapshots --latest 1 --json --password-file "${RESTIC_PASSWORD_FILE}" 2>/dev/null | grep -oP '"short_id":"\K[^"]+' | head -1) + log_success "Restic 備份完成: ${snapshot_id}" + + # Step 4: GFS 清理 + cleanup_old_backups "${LOCAL_REPO}" + + cleanup + + local end_time=$(date +%s) + local duration=$((end_time - start_time)) + log_success "========== Open-WebUI 備份完成 (${duration}s) ==========" + notify_clawbot "success" "${SERVICE}" "Open-WebUI 備份完成" "${duration}" +} + +main "$@" diff --git a/scripts/backup/backup-signoz.sh b/scripts/backup/backup-signoz.sh new file mode 100755 index 00000000..9fb6464d --- /dev/null +++ b/scripts/backup/backup-signoz.sh @@ -0,0 +1,103 @@ +#!/bin/bash +# ============================================================================= +# WOOO AIOps - SignOz 備份腳本 (ClickHouse + SQLite) +# 版本: 1.1.0 +# 建立日期: 2026-04-05 +# 2026-04-05 Claude Code: 新增 SignOz 分散式追蹤備份 — 首席架構師備份審計 +# 2026-04-05 Claude Code: v1.1 修正 tar pipeline exit code 處理 + || true +# ============================================================================= + +set -euo pipefail + +source "$(dirname "$0")/common.sh" + +SERVICE="signoz" +LOCAL_REPO="${BACKUP_BASE}/signoz" +DUMP_DIR="/tmp/signoz-backup-$$" + +cleanup() { + # 確保 collector 已重啟 + docker start signoz-otel-collector 2>/dev/null || true + rm -rf "${DUMP_DIR}" +} + +backup_volume() { + local volume_name="$1" + local output_file="$2" + local extra_exclude="${3:-}" + log_info "備份 volume: ${volume_name}" + # 使用 || true 處理 tar 備份運行中 volume 的 exit 1 警告 + if [ -n "${extra_exclude}" ]; then + docker run --rm -v "${volume_name}:/data" alpine tar czf - "${extra_exclude}" /data 2>/dev/null > "${output_file}" || true + else + docker run --rm -v "${volume_name}:/data" alpine tar czf - /data 2>/dev/null > "${output_file}" || true + fi + if [ -s "${output_file}" ]; then + local size=$(du -h "${output_file}" | cut -f1) + log_success " Volume ${volume_name} 備份完成 (${size})" + return 0 + else + log_error " Volume ${volume_name} 備份失敗 (空檔案)" + return 1 + fi +} + +main() { + local start_time=$(date +%s) + log_info "========== 開始 SignOz 備份 ==========" + mkdir -p "${DUMP_DIR}" + + local timestamp=$(date "+%Y%m%d_%H%M%S") + + # Step 1: 停止 OTEL Collector 確保數據一致性 + log_info "暫停 signoz-otel-collector 以確保數據一致性..." + docker stop signoz-otel-collector 2>/dev/null || log_warn "signoz-otel-collector 未在運行,繼續" + docker stop signoz-telemetrystore-migrator 2>/dev/null || true + + # Step 2: 備份 ClickHouse volume (排除 tmp 目錄降低體積) + backup_volume "signoz-clickhouse" "${DUMP_DIR}/clickhouse_${timestamp}.tar.gz" "--exclude=/data/tmp" || { + log_error "ClickHouse volume 備份失敗" + cleanup + notify_clawbot "failed" "${SERVICE}" "SignOz ClickHouse 備份失敗" + exit 1 + } + + # Step 3: 備份 SQLite volume (SignOz metadata) + backup_volume "signoz-sqlite" "${DUMP_DIR}/sqlite_${timestamp}.tar.gz" || { + log_warn "SQLite volume 備份失敗,繼續..." + } + + # Step 4: 重啟 Collector + log_info "重啟 signoz-otel-collector..." + docker start signoz-otel-collector 2>/dev/null || log_warn "signoz-otel-collector 重啟失敗" + + # Step 5: 初始化 Restic 倉庫 + if [ ! -d "${LOCAL_REPO}/data" ]; then + log_info "初始化 Restic 倉庫: ${LOCAL_REPO}" + restic -r "${LOCAL_REPO}" init --password-file "${RESTIC_PASSWORD_FILE}" 2>&1 || { + log_error "Restic 倉庫初始化失敗" + rm -rf "${DUMP_DIR}" + exit 1 + } + fi + + # Step 6: Restic 備份 + log_info "建立 Restic 備份..." + local tags=$(build_tags "${SERVICE}") + restic -r "${LOCAL_REPO}" backup "${DUMP_DIR}" --password-file "${RESTIC_PASSWORD_FILE}" ${tags} 2>&1 + + local snapshot_id=$(restic -r "${LOCAL_REPO}" snapshots --latest 1 --json --password-file "${RESTIC_PASSWORD_FILE}" 2>/dev/null | grep -oP '"short_id":"\K[^"]+' | head -1) + log_success "Restic 備份完成: ${snapshot_id}" + + # Step 7: GFS 清理 + cleanup_old_backups "${LOCAL_REPO}" + + rm -rf "${DUMP_DIR}" + + local end_time=$(date +%s) + local duration=$((end_time - start_time)) + log_success "========== SignOz 備份完成 (${duration}s) ==========" + notify_clawbot "success" "${SERVICE}" "SignOz 備份完成 (ClickHouse+SQLite)" "${duration}" +} + +main "$@" diff --git a/scripts/backup/common.sh b/scripts/backup/common.sh new file mode 100644 index 00000000..ad5cda72 --- /dev/null +++ b/scripts/backup/common.sh @@ -0,0 +1,147 @@ +#\!/bin/bash +# ============================================================================= +# WOOO AIOps - 備份共用函式庫 +# 版本: 1.0.0 +# 建立日期: 2026-03-12 +# ============================================================================= + +# ----------------------------------------------------------------------------- +# 配置區 (待 CEO 提供 B2 帳號後更新) +# ----------------------------------------------------------------------------- +export BACKUP_BASE="/backup" +export BACKUP_LOG_DIR="${BACKUP_BASE}/logs" +export RESTIC_PASSWORD_FILE="${BACKUP_BASE}/scripts/.restic-password" + +# Backblaze B2 配置 (待填入) +export B2_ACCOUNT_ID="" # 待 CEO 提供 +export B2_APPLICATION_KEY="" # 待 CEO 提供 +export B2_BUCKET="wooo-aiops-backup" + +# ClawBot 通知 Webhook +export CLAWBOT_WEBHOOK="http://192.168.0.188:8088/api/v1/webhook/custom" + +# 保留策略 (GFS 祖父子) +export KEEP_DAILY=30 # 2026-04-05 Claude Code: 延長保留 (原7→30) +export KEEP_WEEKLY=12 # 2026-04-05 Claude Code: 延長保留 (原4→12) +export KEEP_MONTHLY=24 # 2026-04-05 Claude Code: 延長保留 (原6→24) + +# ----------------------------------------------------------------------------- +# 日誌函式 +# ----------------------------------------------------------------------------- +log() { + local level="$1" + local message="$2" + local timestamp=$(date "+%Y-%m-%d %H:%M:%S") + echo "[${timestamp}] [${level}] ${message}" | tee -a "${BACKUP_LOG_DIR}/backup.log" +} + +log_info() { log "INFO" "$1"; } +log_warn() { log "WARN" "$1"; } +log_error() { log "ERROR" "$1"; } +log_success() { log "SUCCESS" "$1"; } + +# ----------------------------------------------------------------------------- +# 通知函式 +# ----------------------------------------------------------------------------- +notify_clawbot() { + local status="$1" + local service="$2" + local message="$3" + local duration="${4:-0}" + + # 2026-04-05 Claude Code: 正確的 /webhook/custom payload + severity 依狀態 + local severity="info" + [ "$status" = "warning" ] && severity="warning" + [ "$status" = "failed" ] && severity="critical" + + if command -v curl &> /dev/null; then + curl -s -X POST "${CLAWBOT_WEBHOOK}" \ + -H 'Content-Type: application/json' \ + -d "{\"name\":\"Backup.${service}\",\"severity\":\"${severity}\",\"service\":\"${service}\",\"description\":\"[${status}] ${message} (${duration}s)\"}" \ + --connect-timeout 5 2>/dev/null || true + fi +} + +# ----------------------------------------------------------------------------- +# Restic 標籤函式 +# ----------------------------------------------------------------------------- +get_app_version() { + local service="$1" + case "$service" in + gitea) + docker exec gitea gitea --version 2>/dev/null | grep -oP "\\d+\\.\\d+\\.\\d+" | head -1 || echo "unknown" + ;; + harbor) + cat /opt/harbor/harbor.yml 2>/dev/null | grep -oP "version: \\K.*" || echo "unknown" + ;; + momo) + echo "1.0.0" # MOMO 版本固定或從配置讀取 + ;; + *) + echo "unknown" + ;; + esac +} + +get_git_hash() { + local service="$1" + case "$service" in + gitea) + cd /var/lib/gitea 2>/dev/null && git rev-parse --short HEAD 2>/dev/null || echo "none" + ;; + *) + echo "none" + ;; + esac +} + +build_tags() { + local service="$1" + local version=$(get_app_version "$service") + local git_hash=$(get_git_hash "$service") + local timestamp=$(date "+%Y%m%d_%H%M%S") + + echo "--tag service:${service} --tag version:${version} --tag git:${git_hash} --tag timestamp:${timestamp}" +} + +# ----------------------------------------------------------------------------- +# 備份驗證函式 +# ----------------------------------------------------------------------------- +verify_backup() { + local repo="$1" + local snapshot_id="$2" + + log_info "驗證備份快照: ${snapshot_id}" + restic -r "${repo}" check --read-data-subset=1% 2>&1 + return $? +} + +# ----------------------------------------------------------------------------- +# 清理函式 (GFS 策略) +# ----------------------------------------------------------------------------- +cleanup_old_backups() { + local repo="$1" + + log_info "執行 GFS 清理策略" + restic -r "${repo}" forget \ + --keep-daily ${KEEP_DAILY} \ + --keep-weekly ${KEEP_WEEKLY} \ + --keep-monthly ${KEEP_MONTHLY} \ + --prune 2>&1 +} + +# ----------------------------------------------------------------------------- +# 檢查配置 +# ----------------------------------------------------------------------------- +check_b2_config() { + if [ -z "${B2_ACCOUNT_ID}" ] || [ -z "${B2_APPLICATION_KEY}" ]; then + log_warn "B2 配置未設定,僅執行本地備份" + return 1 + fi + return 0 +} + +# 初始化日誌目錄 +mkdir -p "${BACKUP_LOG_DIR}" + +log_info "共用函式庫載入完成 (v1.0.0)"