feat(backup): 全服務備份覆蓋 + 告警機制 — 9/9 服務完整

新增備份(已部署到 110,首次執行全部通過):
- backup-langfuse.sh: Langfuse AI 追蹤/評測 DB (7238 traces)
- backup-monitoring.sh: Prometheus + Grafana + Alertmanager volumes + configs
- backup-signoz.sh: SignOz ClickHouse + SQLite (分散式追蹤/日誌)
- backup-open-webui.sh: Open-WebUI LLM 對話紀錄 (SSH 188 volume)
- backup-clawbot.sh: ClawBot Redis 狀態/快取 (SSH 188 volume)
- backup-all.sh v3.0: 整合至 9/9 服務

告警機制:
- common.sh: notify_clawbot 改用 /webhook/custom 正確格式
- failed → severity:critical → Telegram 🔴 立即告警
- 告警測試通過:{"status":"ok","alert_id":"878c4c59..."}

GFS 保留:30日/12週/24月 (AWOOOI 額外 28h 高頻)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
OG T
2026-04-05 11:12:42 +08:00
parent 67fd5e61fb
commit f51bf5a6a8
8 changed files with 718 additions and 69 deletions

View File

@@ -1,44 +1,66 @@
# BACKUP-STATUS.md — 備份狀態總覽
> 2026-04-05 Claude Code: 首席架構師完整盤點 +自動化 + 高頻備份部署完成
> 2026-04-05 Claude Code: 首席架構師完整盤點 服務全自動化 + 告警機制
> 備份中心192.168.0.110 (`/backup/`) — Restic + GFS 祖父子策略
---
## 備份全景圖(全部自動化)
| 資料類型 | 備份腳本 | 排程 | 最大損失 | 保留策略 | 狀態 |
|---------|---------|------|---------|---------|------|
| Gitea (DB + 倉庫) | `backup-gitea.sh` | 每日 02:00 | 24h | 28h/30日/12週/24月 | ✅ |
| MOMO PostgreSQL | `backup-momo.sh` | 每日 02:00 | 24h | 28h/30日/12週/24月 | ✅ |
| Harbor (Registry + DB) | `backup-harbor.sh` | 每日 02:00 | 24h | 28h/30日/12週/24月 | ✅ |
| **AWOOOI PostgreSQL (完整)** | **`backup-awoooi.sh`** | **每日 02:00** | **6h** | **28h/30日/12週/24月** | **✅** |
| **AWOOOI PostgreSQL (高頻)** | **`backup-awoooi-frequent.sh`** | **每日 08/14/20:00** | **6h** | **28h/30日/12週/24月** | **✅** |
| K8s 資源 (全命名空間) | Velero + MinIO | 每日 02:00 | 24h | 7 份 | ✅ |
| # | 資料類型 | 備份腳本 | 排程 | 最大損失 | 狀態 |
|---|---------|---------|------|---------|------|
| 1 | Gitea (DB + 倉庫) | `backup-gitea.sh` | 每日 02:00 | 24h | ✅ |
| 2 | MOMO PostgreSQL | `backup-momo.sh` | 每日 02:00 | 24h | ✅ |
| 3 | Harbor (Registry + DB) | `backup-harbor.sh` | 每日 02:00 | 24h | ✅ |
| 4 | **AWOOOI PostgreSQL (完整)** | **`backup-awoooi.sh`** | **每日 02:00** | **6h** | **✅** |
| 4h | **AWOOOI PostgreSQL (高頻)** | **`backup-awoooi-frequent.sh`** | **08/14/20:00** | **6h** | **✅** |
| 5 | **Langfuse (AI 追蹤/評測)** | **`backup-langfuse.sh`** | **每日 02:00** | **24h** | **✅** |
| 6 | **Monitoring (Prometheus/Grafana/Alertmanager)** | **`backup-monitoring.sh`** | **每日 02:00** | **24h** | **✅** |
| 7 | **SignOz (ClickHouse 追蹤/日誌)** | **`backup-signoz.sh`** | **每日 02:00** | **24h** | **✅** |
| 8 | **Open-WebUI (LLM 對話紀錄)** | **`backup-open-webui.sh`** | **每日 02:00** | **24h** | **✅** |
| 9 | **ClawBot Redis (狀態/快取)** | **`backup-clawbot.sh`** | **每日 02:00** | **24h** | **✅** |
| - | K8s 資源 (全命名空間) | Velero + MinIO | 每日 02:00 | 24h | ✅ |
**AWOOOI 每日備份排程**02:00含 awoooi_dev + k3s、08:00、14:00、20:00僅 awoooi_prod= **4次/天**
**備份總控**`/backup/scripts/backup-all.sh` v3.0 — 統一執行 9 個備份
---
## 告警機制
備份失敗自動推送 Telegram透過 ClawBot `/webhook/custom`
| 狀態 | Severity | Telegram 收到 |
|------|---------|--------------|
| `success` | info | ✅ 正常通知 |
| `warning` | warning | ⚠️ 黃色警告 |
| `failed` | **critical** | 🔴 **立即告警** |
**告警端點**`http://192.168.0.188:8088/api/v1/webhook/custom`
**測試指令**
```bash
source /backup/scripts/common.sh
notify_clawbot "failed" "backup-test" "測試告警" 0
```
---
## GFS 保留策略
| 級別 | 保留數量 | 覆蓋時間 | 說明 |
|------|---------|---------|------|
| 每小時 | 28 份 | 最近 7 天 (6h 快照) | AWOOOI 高頻 |
| 每日 | 30 份 | 最近 30 天 | 全服務 |
| 每週 | 12 份 | 最近 3 個月 | 全服務 |
| 每月 | 24 份 | 最近 2 年 | 全服務 |
> 原策略7日/4週/6月 → 2026-04-05 延長為 28h/30日/12週/24月
| 級別 | 保留數量 | 覆蓋時間 |
|------|---------|---------|
| 每小時AWOOOI 高頻) | 28 份 | 最近 7 天 |
| 每日 | 30 份 | 最近 30 天 |
| 每週 | 12 份 | 最近 3 個月 |
| 每月 | 24 份 | 最近 **2 年** |
---
## Crontab 完整排程110
```
0 2 * * * backup-all.sh ← Gitea + MOMO + Harbor + AWOOOI 完整備份
0 8,14,20 * * * backup-awoooi-frequent.sh ← AWOOOI 高頻(每 6 小時)
0 6 * * * backup-status.sh ← 備份狀態報告
0 2 * * * backup-all.sh ← 9 個服務完整備份
0 8,14,20 * * * backup-awoooi-frequent.sh ← AWOOOI 高頻(每 6 小時)
0 6 * * * backup-status.sh ← 備份狀態報告
```
---
@@ -46,48 +68,56 @@
## 備份架構
```
192.168.0.110 (/backup/scripts/)
├── backup-all.sh (每日 02:00)
├── [1/4] backup-gitea.sh → gitea dump → restic /backup/gitea
├── [2/4] backup-momo.sh → SSH 188 pg_dump → restic /backup/momo
├── [3/4] backup-harbor.sh → harbor dump → restic /backup/harbor
│ └── [4/4] backup-awoooi.sh → SSH 188 pg_dump (prod/dev/k3s) → restic /backup/awoooi
── backup-awoooi-frequent.sh (08/14/20:00)
└── SSH 188 pg_dump awoooi_prod → restic /backup/awoooi (同一倉庫)
192.168.0.110 (/backup/scripts/backup-all.sh) 每日 02:00
├── [1/9] backup-gitea.sh → gitea dump → /backup/gitea
├── [2/9] backup-momo.sh → SSH 188 pg_dump momo → /backup/momo
├── [3/9] backup-harbor.sh → harbor dump → /backup/harbor
├── [4/9] backup-awoooi.sh → SSH 188 pg_dump awoooi_prod/dev/k3s → /backup/awoooi
── [5/9] backup-langfuse.sh docker exec langfuse-db pg_dump → /backup/langfuse
├── [6/9] backup-monitoring.sh → volumes prometheus/grafana/alertmanager → /backup/monitoring
── [7/9] backup-signoz.sh → volumes signoz-clickhouse/sqlite → /backup/signoz
├── [8/9] backup-open-webui.sh → SSH 188 volume open-webui → /backup/open-webui
└── [9/9] backup-clawbot.sh → SSH 188 volume clawbot-redis → /backup/clawbot
192.168.0.188 (Velero)
└── K8s 資源快照 → MinIO 192.168.0.188:9000 (bucket: velero)
備份失敗 → notify_clawbot("failed") → /webhook/custom → Telegram 🔴
192.168.0.188 (Velero) 每日 02:00
└── K8s 資源快照 → MinIO :9000 (bucket: velero)
```
---
## 尚未備份(說明)
| 服務 | 原因 | 備記 |
|------|------|------|
| Prometheus TSDB | 原始指標數據非設定TSDB 自帶 30d TTL | 低優先Grafana 設定已備份 |
| Sentry | 目前沒有在運行docker ps 空)| 有 volume重新部署後再評估 |
| Redis (AWOOOI) | Cache/WorkingMemory無持久業務數據 | 低優先 |
| Velero MinIO 數據 | MinIO 是備份的備份,需離機備份 | 待評估 B2/S3 offsite |
---
## 驗證 SOP
```bash
# 確認最新備份日誌
ssh wooo@192.168.0.110 "tail -30 /backup/logs/backup.log"
# 最新備份日誌
ssh wooo@192.168.0.110 "tail -50 /backup/logs/backup.log"
# AWOOOI 快照列表(含高頻)
ssh wooo@192.168.0.110 "restic -r /backup/awoooi snapshots \
--password-file /backup/scripts/.restic-password | tail -10"
# 所有服務快照數
ssh wooo@192.168.0.110 "for r in gitea momo harbor awoooi langfuse monitoring signoz open-webui clawbot; do
echo -n \"\$r: \"
restic -r /backup/\$r snapshots --password-file /backup/scripts/.restic-password 2>/dev/null | grep -c snapshot || echo 0
done"
# 各服務快照數
ssh wooo@192.168.0.110 "for r in gitea momo harbor awoooi; do \
echo -n \"\$r: \"; \
restic -r /backup/\$r snapshots --password-file /backup/scripts/.restic-password \
2>/dev/null | grep -c '^\w'; done"
# Velero K8s
kubectl get backup -n velero --sort-by=.metadata.creationTimestamp | tail -3
# 告警測試
ssh wooo@192.168.0.110 "source /backup/scripts/common.sh && notify_clawbot 'warning' 'manual-test' '手動告警測試' 0"
```
---
## 相關文件
- [REBOOT-RECOVERY-SOP.md](REBOOT-RECOVERY-SOP.md) - 重開機恢復 SOP(含 MinIO 啟動)
- `scripts/backup/backup-awoooi.sh` - AWOOOI 完整備份腳本
- `scripts/backup/backup-awoooi-frequent.sh` - AWOOOI 高頻備份腳本
- `scripts/backup/backup-all.sh` - 全服務備份總控 v2.0
- [REBOOT-RECOVERY-SOP.md](REBOOT-RECOVERY-SOP.md) - 重開機恢復 SOP
- `scripts/backup/` - 所有備份腳本Git 版本)
- `/backup/scripts/` (on 110) - 實際部署腳本

View File

@@ -1,55 +1,52 @@
#!/bin/bash
# =============================================================================
# WOOO AIOps - 全服務備份總控腳本
# 版本: 2.0.0
# 版本: 3.0.0
# 建立日期: 2026-03-12
# 2026-04-05 Claude Code: 加入 AWOOOI DB (v1→v2) — 首席架構師備份審計
# 2026-04-05 Claude Code: 加入 Langfuse/Monitoring/SignOz/Open-WebUI/ClawBot (v2→v3) — 備份覆蓋率審計
# =============================================================================
set -euo pipefail
# 載入共用函式
source "$(dirname "$0")/common.sh"
# -----------------------------------------------------------------------------
# 主函式
# -----------------------------------------------------------------------------
main() {
local start_time=$(date +%s)
local failed=0
local total=4
local total=9
log_info "╔══════════════════════════════════════════════════════════════╗"
log_info "║ WOOO AIOps - 全服務備份開始 (v2.0) ║"
log_info "║ WOOO AIOps - 全服務備份開始 (v3.0) ║"
log_info "╚══════════════════════════════════════════════════════════════╝"
# 備份 Gitea
log_info ">>> [1/${total}] 備份 Gitea..."
if /backup/scripts/backup-gitea.sh; then
log_success " Gitea 備份成功"
else
log_error " Gitea 備份失敗"
((failed++))
failed=$((failed+1))
fi
# 備份 MOMO Pro
log_info ">>> [2/${total}] 備份 MOMO Pro..."
if /backup/scripts/backup-momo.sh; then
log_success " MOMO Pro 備份成功"
else
log_error " MOMO Pro 備份失敗"
((failed++))
failed=$((failed+1))
fi
# 備份 Harbor
log_info ">>> [3/${total}] 備份 Harbor..."
if /backup/scripts/backup-harbor.sh; then
log_success " Harbor 備份成功"
else
log_error " Harbor 備份失敗"
((failed++))
failed=$((failed+1))
fi
# 備份 AWOOOI DB (awoooi_prod + k3s_datastore)
# 2026-04-05 Claude Code: 首席架構師備份審計後加入
log_info ">>> [4/${total}] 備份 AWOOOI DB..."
@@ -57,12 +54,62 @@ main() {
log_success " AWOOOI DB 備份成功"
else
log_error " AWOOOI DB 備份失敗"
((failed++))
failed=$((failed+1))
fi
# 備份 Langfuse (AI 追蹤/評測數據)
# 2026-04-05 Claude Code: 備份覆蓋率審計後加入
log_info ">>> [5/${total}] 備份 Langfuse..."
if /backup/scripts/backup-langfuse.sh; then
log_success " Langfuse 備份成功"
else
log_error " Langfuse 備份失敗"
failed=$((failed+1))
fi
# 備份 Monitoring (Prometheus + Grafana + Alertmanager)
# 2026-04-05 Claude Code: 備份覆蓋率審計後加入
log_info ">>> [6/${total}] 備份 Monitoring..."
if /backup/scripts/backup-monitoring.sh; then
log_success " Monitoring 備份成功"
else
log_error " Monitoring 備份失敗"
failed=$((failed+1))
fi
# 備份 SignOz (ClickHouse + SQLite)
# 2026-04-05 Claude Code: 備份覆蓋率審計後加入
log_info ">>> [7/${total}] 備份 SignOz..."
if /backup/scripts/backup-signoz.sh; then
log_success " SignOz 備份成功"
else
log_error " SignOz 備份失敗"
failed=$((failed+1))
fi
# 備份 Open-WebUI (LLM 對話紀錄,從 188)
# 2026-04-05 Claude Code: 備份覆蓋率審計後加入
log_info ">>> [8/${total}] 備份 Open-WebUI (188)..."
if /backup/scripts/backup-open-webui.sh; then
log_success " Open-WebUI 備份成功"
else
log_error " Open-WebUI 備份失敗"
failed=$((failed+1))
fi
# 備份 ClawBot Redis (狀態/快取,從 188低優先)
# 2026-04-05 Claude Code: 備份覆蓋率審計後加入
log_info ">>> [9/${total}] 備份 ClawBot Redis (188)..."
if /backup/scripts/backup-clawbot.sh; then
log_success " ClawBot Redis 備份成功"
else
log_error " ClawBot Redis 備份失敗"
failed=$((failed+1))
fi
local end_time=$(date +%s)
local duration=$((end_time - start_time))
log_info "╔══════════════════════════════════════════════════════════════╗"
if [ $failed -eq 0 ]; then
log_success "║ 全服務備份完成 (${duration}s) - 全部成功 (${total}/${total}) ║"
@@ -72,9 +119,8 @@ main() {
notify_clawbot "warning" "all" "全服務備份完成 ($((total-failed))/${total} 成功)" "${duration}"
fi
log_info "╚══════════════════════════════════════════════════════════════╝"
return $failed
}
# 執行
main "$@"

View File

@@ -0,0 +1,75 @@
#!/bin/bash
# =============================================================================
# WOOO AIOps - ClawBot Redis 備份腳本 (SSH → 192.168.0.188)
# 版本: 1.0.0
# 建立日期: 2026-04-05
# 2026-04-05 Claude Code: 新增 ClawBot Redis 狀態/快取備份 — 首席架構師備份審計
# =============================================================================
set -euo pipefail
source "$(dirname "$0")/common.sh"
SERVICE="clawbot"
LOCAL_REPO="${BACKUP_BASE}/clawbot"
DUMP_DIR="/tmp/clawbot-backup-$$"
REMOTE_HOST="ollama@192.168.0.188"
cleanup() {
rm -rf "${DUMP_DIR}"
}
main() {
local start_time=$(date +%s)
log_info "========== 開始 ClawBot Redis 備份 (188→110) =========="
mkdir -p "${DUMP_DIR}"
local timestamp=$(date "+%Y%m%d_%H%M%S")
# Step 1: 觸發 Redis BGSAVE 確保數據落盤
log_info "觸發 Redis BGSAVE..."
ssh "${REMOTE_HOST}" "docker exec clawbot-redis redis-cli BGSAVE" 2>/dev/null || log_warn "BGSAVE 失敗或 clawbot-redis 未運行,繼續備份"
sleep 2 # 等待 BGSAVE 完成
# Step 2: SSH 到 188 將 Redis volume 打包傳回
log_info "從 192.168.0.188 拉取 clawbot-redis volume..."
if ssh "${REMOTE_HOST}" "docker run --rm -v clawbot-v5_clawbot-redis-data:/data alpine tar czf - /data 2>/dev/null" > "${DUMP_DIR}/clawbot-redis_${timestamp}.tar.gz"; then
local size=$(du -h "${DUMP_DIR}/clawbot-redis_${timestamp}.tar.gz" | cut -f1)
log_success "ClawBot Redis volume 拉取完成 (${size})"
else
log_error "ClawBot Redis volume 拉取失敗"
notify_clawbot "failed" "${SERVICE}" "ClawBot Redis 備份失敗 (SSH 188)"
cleanup
exit 1
fi
# Step 3: 初始化 Restic 倉庫
if [ ! -d "${LOCAL_REPO}/data" ]; then
log_info "初始化 Restic 倉庫: ${LOCAL_REPO}"
restic -r "${LOCAL_REPO}" init --password-file "${RESTIC_PASSWORD_FILE}" 2>&1 || {
log_error "Restic 倉庫初始化失敗"
cleanup
exit 1
}
fi
# Step 4: Restic 備份
log_info "建立 Restic 備份..."
local tags=$(build_tags "${SERVICE}")
restic -r "${LOCAL_REPO}" backup "${DUMP_DIR}" --password-file "${RESTIC_PASSWORD_FILE}" ${tags} 2>&1
local snapshot_id=$(restic -r "${LOCAL_REPO}" snapshots --latest 1 --json --password-file "${RESTIC_PASSWORD_FILE}" 2>/dev/null | grep -oP '"short_id":"\K[^"]+' | head -1)
log_success "Restic 備份完成: ${snapshot_id}"
# Step 5: GFS 清理
cleanup_old_backups "${LOCAL_REPO}"
cleanup
local end_time=$(date +%s)
local duration=$((end_time - start_time))
log_success "========== ClawBot Redis 備份完成 (${duration}s) =========="
notify_clawbot "success" "${SERVICE}" "ClawBot Redis 備份完成" "${duration}"
}
main "$@"

View File

@@ -0,0 +1,69 @@
#!/bin/bash
# =============================================================================
# WOOO AIOps - Langfuse 備份腳本
# 版本: 1.0.0
# 建立日期: 2026-04-05
# 2026-04-05 Claude Code: 新增 Langfuse AI 追蹤數據備份 — 首席架構師備份審計
# =============================================================================
set -euo pipefail
source "$(dirname "$0")/common.sh"
SERVICE="langfuse"
LOCAL_REPO="${BACKUP_BASE}/langfuse"
DUMP_DIR="/tmp/langfuse-backup-$$"
cleanup() {
rm -rf "${DUMP_DIR}"
}
main() {
local start_time=$(date +%s)
log_info "========== 開始 Langfuse 備份 =========="
mkdir -p "${DUMP_DIR}"
local timestamp=$(date "+%Y%m%d_%H%M%S")
# Step 1: Langfuse PostgreSQL dump
log_info "執行 Langfuse DB dump..."
if docker exec langfuse-db pg_dump -U langfuse langfuse > "${DUMP_DIR}/langfuse_${timestamp}.sql" 2>&1; then
local size=$(du -h "${DUMP_DIR}/langfuse_${timestamp}.sql" | cut -f1)
log_success "Langfuse DB dump 完成 (${size})"
else
log_error "Langfuse DB dump 失敗"
notify_clawbot "failed" "${SERVICE}" "Langfuse 備份失敗"
cleanup
exit 1
fi
# Step 2: 初始化 Restic 倉庫 (如果不存在)
if [ ! -d "${LOCAL_REPO}/data" ]; then
log_info "初始化 Restic 倉庫: ${LOCAL_REPO}"
restic -r "${LOCAL_REPO}" init --password-file "${RESTIC_PASSWORD_FILE}" 2>&1 || {
log_error "Restic 倉庫初始化失敗"
cleanup
exit 1
}
fi
# Step 3: Restic 備份
log_info "建立 Restic 備份..."
local tags=$(build_tags "${SERVICE}")
restic -r "${LOCAL_REPO}" backup "${DUMP_DIR}" --password-file "${RESTIC_PASSWORD_FILE}" ${tags} 2>&1
local snapshot_id=$(restic -r "${LOCAL_REPO}" snapshots --latest 1 --json --password-file "${RESTIC_PASSWORD_FILE}" 2>/dev/null | grep -oP '"short_id":"\K[^"]+' | head -1)
log_success "Restic 備份完成: ${snapshot_id}"
# Step 4: GFS 清理
cleanup_old_backups "${LOCAL_REPO}"
cleanup
local end_time=$(date +%s)
local duration=$((end_time - start_time))
log_success "========== Langfuse 備份完成 (${duration}s) =========="
notify_clawbot "success" "${SERVICE}" "Langfuse 備份完成" "${duration}"
}
main "$@"

View File

@@ -0,0 +1,109 @@
#!/bin/bash
# =============================================================================
# WOOO AIOps - Monitoring 備份腳本 (Prometheus + Grafana + Alertmanager)
# 版本: 1.1.0
# 建立日期: 2026-04-05
# 2026-04-05 Claude Code: 新增監控數據備份 — 首席架構師備份審計
# 2026-04-05 Claude Code: v1.1 修正 Prometheus 1.1GB volume tar pipeline exit code 處理
# =============================================================================
set -euo pipefail
source "$(dirname "$0")/common.sh"
SERVICE="monitoring"
LOCAL_REPO="${BACKUP_BASE}/monitoring"
DUMP_DIR="/tmp/monitoring-backup-$$"
MONITORING_CONFIG_DIR="/home/wooo/monitoring"
cleanup() {
rm -rf "${DUMP_DIR}"
}
backup_volume() {
local volume_name="$1"
local output_file="$2"
log_info "備份 volume: ${volume_name}"
# 注意: tar 備份大型 volume 時可能 exit 1 (因 mmap/lock files 被修改)
# 使用 || true 避免因 warning 導致失敗,但仍驗證檔案大小
docker run --rm -v "${volume_name}:/data" alpine tar czf - /data 2>/dev/null > "${output_file}" || true
if [ -s "${output_file}" ]; then
local size=$(du -h "${output_file}" | cut -f1)
log_success " Volume ${volume_name} 備份完成 (${size})"
return 0
else
log_error " Volume ${volume_name} 備份失敗 (空檔案)"
return 1
fi
}
main() {
local start_time=$(date +%s)
log_info "========== 開始 Monitoring 備份 =========="
mkdir -p "${DUMP_DIR}"
local timestamp=$(date "+%Y%m%d_%H%M%S")
local any_failed=0
# Step 1: 備份 Prometheus volume (TSDB 數據,約 1GB+)
backup_volume "monitoring_prometheus_data" "${DUMP_DIR}/prometheus_${timestamp}.tar.gz" || {
notify_clawbot "failed" "${SERVICE}" "Prometheus volume 備份失敗"
cleanup
exit 1
}
# Step 2: 備份 Grafana volume (dashboards/alerts 設定)
backup_volume "monitoring_grafana_data" "${DUMP_DIR}/grafana_${timestamp}.tar.gz" || {
log_warn "Grafana volume 備份失敗,繼續..."
any_failed=1
}
# Step 3: 備份 Alertmanager volume (靜默/路由設定)
backup_volume "monitoring_alertmanager_data" "${DUMP_DIR}/alertmanager_${timestamp}.tar.gz" || {
log_warn "Alertmanager volume 備份失敗,繼續..."
any_failed=1
}
# Step 4: 備份 monitoring 設定檔目錄
log_info "備份 monitoring 設定檔 (${MONITORING_CONFIG_DIR})"
if [ -d "${MONITORING_CONFIG_DIR}" ]; then
tar czf "${DUMP_DIR}/monitoring-configs_${timestamp}.tar.gz" -C "$(dirname ${MONITORING_CONFIG_DIR})" "$(basename ${MONITORING_CONFIG_DIR})" 2>/dev/null || true
if [ -s "${DUMP_DIR}/monitoring-configs_${timestamp}.tar.gz" ]; then
log_success "設定檔備份完成"
else
log_warn "設定檔備份失敗或為空"
fi
else
log_warn "monitoring 設定目錄不存在: ${MONITORING_CONFIG_DIR}"
fi
# Step 5: 初始化 Restic 倉庫
if [ ! -d "${LOCAL_REPO}/data" ]; then
log_info "初始化 Restic 倉庫: ${LOCAL_REPO}"
restic -r "${LOCAL_REPO}" init --password-file "${RESTIC_PASSWORD_FILE}" 2>&1 || {
log_error "Restic 倉庫初始化失敗"
cleanup
exit 1
}
fi
# Step 6: Restic 備份
log_info "建立 Restic 備份..."
local tags=$(build_tags "${SERVICE}")
restic -r "${LOCAL_REPO}" backup "${DUMP_DIR}" --password-file "${RESTIC_PASSWORD_FILE}" ${tags} 2>&1
local snapshot_id=$(restic -r "${LOCAL_REPO}" snapshots --latest 1 --json --password-file "${RESTIC_PASSWORD_FILE}" 2>/dev/null | grep -oP '"short_id":"\K[^"]+' | head -1)
log_success "Restic 備份完成: ${snapshot_id}"
# Step 7: GFS 清理
cleanup_old_backups "${LOCAL_REPO}"
cleanup
local end_time=$(date +%s)
local duration=$((end_time - start_time))
log_success "========== Monitoring 備份完成 (${duration}s) =========="
notify_clawbot "success" "${SERVICE}" "Monitoring 備份完成 (Prometheus+Grafana+Alertmanager)" "${duration}"
}
main "$@"

View File

@@ -0,0 +1,70 @@
#!/bin/bash
# =============================================================================
# WOOO AIOps - Open-WebUI 備份腳本 (SSH → 192.168.0.188)
# 版本: 1.0.0
# 建立日期: 2026-04-05
# 2026-04-05 Claude Code: 新增 Open-WebUI LLM 對話紀錄備份 — 首席架構師備份審計
# =============================================================================
set -euo pipefail
source "$(dirname "$0")/common.sh"
SERVICE="open-webui"
LOCAL_REPO="${BACKUP_BASE}/open-webui"
DUMP_DIR="/tmp/open-webui-backup-$$"
REMOTE_HOST="ollama@192.168.0.188"
cleanup() {
rm -rf "${DUMP_DIR}"
}
main() {
local start_time=$(date +%s)
log_info "========== 開始 Open-WebUI 備份 (188→110) =========="
mkdir -p "${DUMP_DIR}"
local timestamp=$(date "+%Y%m%d_%H%M%S")
# Step 1: SSH 到 188 將 open-webui volume 打包傳回
log_info "從 192.168.0.188 拉取 open-webui volume..."
if ssh "${REMOTE_HOST}" "docker run --rm -v open-webui:/data alpine tar czf - /data 2>/dev/null" > "${DUMP_DIR}/open-webui_${timestamp}.tar.gz"; then
local size=$(du -h "${DUMP_DIR}/open-webui_${timestamp}.tar.gz" | cut -f1)
log_success "Open-WebUI volume 拉取完成 (${size})"
else
log_error "Open-WebUI volume 拉取失敗"
notify_clawbot "failed" "${SERVICE}" "Open-WebUI 備份失敗 (SSH 188)"
cleanup
exit 1
fi
# Step 2: 初始化 Restic 倉庫
if [ ! -d "${LOCAL_REPO}/data" ]; then
log_info "初始化 Restic 倉庫: ${LOCAL_REPO}"
restic -r "${LOCAL_REPO}" init --password-file "${RESTIC_PASSWORD_FILE}" 2>&1 || {
log_error "Restic 倉庫初始化失敗"
cleanup
exit 1
}
fi
# Step 3: Restic 備份
log_info "建立 Restic 備份..."
local tags=$(build_tags "${SERVICE}")
restic -r "${LOCAL_REPO}" backup "${DUMP_DIR}" --password-file "${RESTIC_PASSWORD_FILE}" ${tags} 2>&1
local snapshot_id=$(restic -r "${LOCAL_REPO}" snapshots --latest 1 --json --password-file "${RESTIC_PASSWORD_FILE}" 2>/dev/null | grep -oP '"short_id":"\K[^"]+' | head -1)
log_success "Restic 備份完成: ${snapshot_id}"
# Step 4: GFS 清理
cleanup_old_backups "${LOCAL_REPO}"
cleanup
local end_time=$(date +%s)
local duration=$((end_time - start_time))
log_success "========== Open-WebUI 備份完成 (${duration}s) =========="
notify_clawbot "success" "${SERVICE}" "Open-WebUI 備份完成" "${duration}"
}
main "$@"

103
scripts/backup/backup-signoz.sh Executable file
View File

@@ -0,0 +1,103 @@
#!/bin/bash
# =============================================================================
# WOOO AIOps - SignOz 備份腳本 (ClickHouse + SQLite)
# 版本: 1.1.0
# 建立日期: 2026-04-05
# 2026-04-05 Claude Code: 新增 SignOz 分散式追蹤備份 — 首席架構師備份審計
# 2026-04-05 Claude Code: v1.1 修正 tar pipeline exit code 處理 + || true
# =============================================================================
set -euo pipefail
source "$(dirname "$0")/common.sh"
SERVICE="signoz"
LOCAL_REPO="${BACKUP_BASE}/signoz"
DUMP_DIR="/tmp/signoz-backup-$$"
cleanup() {
# 確保 collector 已重啟
docker start signoz-otel-collector 2>/dev/null || true
rm -rf "${DUMP_DIR}"
}
backup_volume() {
local volume_name="$1"
local output_file="$2"
local extra_exclude="${3:-}"
log_info "備份 volume: ${volume_name}"
# 使用 || true 處理 tar 備份運行中 volume 的 exit 1 警告
if [ -n "${extra_exclude}" ]; then
docker run --rm -v "${volume_name}:/data" alpine tar czf - "${extra_exclude}" /data 2>/dev/null > "${output_file}" || true
else
docker run --rm -v "${volume_name}:/data" alpine tar czf - /data 2>/dev/null > "${output_file}" || true
fi
if [ -s "${output_file}" ]; then
local size=$(du -h "${output_file}" | cut -f1)
log_success " Volume ${volume_name} 備份完成 (${size})"
return 0
else
log_error " Volume ${volume_name} 備份失敗 (空檔案)"
return 1
fi
}
main() {
local start_time=$(date +%s)
log_info "========== 開始 SignOz 備份 =========="
mkdir -p "${DUMP_DIR}"
local timestamp=$(date "+%Y%m%d_%H%M%S")
# Step 1: 停止 OTEL Collector 確保數據一致性
log_info "暫停 signoz-otel-collector 以確保數據一致性..."
docker stop signoz-otel-collector 2>/dev/null || log_warn "signoz-otel-collector 未在運行,繼續"
docker stop signoz-telemetrystore-migrator 2>/dev/null || true
# Step 2: 備份 ClickHouse volume (排除 tmp 目錄降低體積)
backup_volume "signoz-clickhouse" "${DUMP_DIR}/clickhouse_${timestamp}.tar.gz" "--exclude=/data/tmp" || {
log_error "ClickHouse volume 備份失敗"
cleanup
notify_clawbot "failed" "${SERVICE}" "SignOz ClickHouse 備份失敗"
exit 1
}
# Step 3: 備份 SQLite volume (SignOz metadata)
backup_volume "signoz-sqlite" "${DUMP_DIR}/sqlite_${timestamp}.tar.gz" || {
log_warn "SQLite volume 備份失敗,繼續..."
}
# Step 4: 重啟 Collector
log_info "重啟 signoz-otel-collector..."
docker start signoz-otel-collector 2>/dev/null || log_warn "signoz-otel-collector 重啟失敗"
# Step 5: 初始化 Restic 倉庫
if [ ! -d "${LOCAL_REPO}/data" ]; then
log_info "初始化 Restic 倉庫: ${LOCAL_REPO}"
restic -r "${LOCAL_REPO}" init --password-file "${RESTIC_PASSWORD_FILE}" 2>&1 || {
log_error "Restic 倉庫初始化失敗"
rm -rf "${DUMP_DIR}"
exit 1
}
fi
# Step 6: Restic 備份
log_info "建立 Restic 備份..."
local tags=$(build_tags "${SERVICE}")
restic -r "${LOCAL_REPO}" backup "${DUMP_DIR}" --password-file "${RESTIC_PASSWORD_FILE}" ${tags} 2>&1
local snapshot_id=$(restic -r "${LOCAL_REPO}" snapshots --latest 1 --json --password-file "${RESTIC_PASSWORD_FILE}" 2>/dev/null | grep -oP '"short_id":"\K[^"]+' | head -1)
log_success "Restic 備份完成: ${snapshot_id}"
# Step 7: GFS 清理
cleanup_old_backups "${LOCAL_REPO}"
rm -rf "${DUMP_DIR}"
local end_time=$(date +%s)
local duration=$((end_time - start_time))
log_success "========== SignOz 備份完成 (${duration}s) =========="
notify_clawbot "success" "${SERVICE}" "SignOz 備份完成 (ClickHouse+SQLite)" "${duration}"
}
main "$@"

147
scripts/backup/common.sh Normal file
View File

@@ -0,0 +1,147 @@
#\!/bin/bash
# =============================================================================
# WOOO AIOps - 備份共用函式庫
# 版本: 1.0.0
# 建立日期: 2026-03-12
# =============================================================================
# -----------------------------------------------------------------------------
# 配置區 (待 CEO 提供 B2 帳號後更新)
# -----------------------------------------------------------------------------
export BACKUP_BASE="/backup"
export BACKUP_LOG_DIR="${BACKUP_BASE}/logs"
export RESTIC_PASSWORD_FILE="${BACKUP_BASE}/scripts/.restic-password"
# Backblaze B2 配置 (待填入)
export B2_ACCOUNT_ID="" # 待 CEO 提供
export B2_APPLICATION_KEY="" # 待 CEO 提供
export B2_BUCKET="wooo-aiops-backup"
# ClawBot 通知 Webhook
export CLAWBOT_WEBHOOK="http://192.168.0.188:8088/api/v1/webhook/custom"
# 保留策略 (GFS 祖父子)
export KEEP_DAILY=30 # 2026-04-05 Claude Code: 延長保留 (原7→30)
export KEEP_WEEKLY=12 # 2026-04-05 Claude Code: 延長保留 (原4→12)
export KEEP_MONTHLY=24 # 2026-04-05 Claude Code: 延長保留 (原6→24)
# -----------------------------------------------------------------------------
# 日誌函式
# -----------------------------------------------------------------------------
log() {
local level="$1"
local message="$2"
local timestamp=$(date "+%Y-%m-%d %H:%M:%S")
echo "[${timestamp}] [${level}] ${message}" | tee -a "${BACKUP_LOG_DIR}/backup.log"
}
log_info() { log "INFO" "$1"; }
log_warn() { log "WARN" "$1"; }
log_error() { log "ERROR" "$1"; }
log_success() { log "SUCCESS" "$1"; }
# -----------------------------------------------------------------------------
# 通知函式
# -----------------------------------------------------------------------------
notify_clawbot() {
local status="$1"
local service="$2"
local message="$3"
local duration="${4:-0}"
# 2026-04-05 Claude Code: 正確的 /webhook/custom payload + severity 依狀態
local severity="info"
[ "$status" = "warning" ] && severity="warning"
[ "$status" = "failed" ] && severity="critical"
if command -v curl &> /dev/null; then
curl -s -X POST "${CLAWBOT_WEBHOOK}" \
-H 'Content-Type: application/json' \
-d "{\"name\":\"Backup.${service}\",\"severity\":\"${severity}\",\"service\":\"${service}\",\"description\":\"[${status}] ${message} (${duration}s)\"}" \
--connect-timeout 5 2>/dev/null || true
fi
}
# -----------------------------------------------------------------------------
# Restic 標籤函式
# -----------------------------------------------------------------------------
get_app_version() {
local service="$1"
case "$service" in
gitea)
docker exec gitea gitea --version 2>/dev/null | grep -oP "\\d+\\.\\d+\\.\\d+" | head -1 || echo "unknown"
;;
harbor)
cat /opt/harbor/harbor.yml 2>/dev/null | grep -oP "version: \\K.*" || echo "unknown"
;;
momo)
echo "1.0.0" # MOMO 版本固定或從配置讀取
;;
*)
echo "unknown"
;;
esac
}
get_git_hash() {
local service="$1"
case "$service" in
gitea)
cd /var/lib/gitea 2>/dev/null && git rev-parse --short HEAD 2>/dev/null || echo "none"
;;
*)
echo "none"
;;
esac
}
build_tags() {
local service="$1"
local version=$(get_app_version "$service")
local git_hash=$(get_git_hash "$service")
local timestamp=$(date "+%Y%m%d_%H%M%S")
echo "--tag service:${service} --tag version:${version} --tag git:${git_hash} --tag timestamp:${timestamp}"
}
# -----------------------------------------------------------------------------
# 備份驗證函式
# -----------------------------------------------------------------------------
verify_backup() {
local repo="$1"
local snapshot_id="$2"
log_info "驗證備份快照: ${snapshot_id}"
restic -r "${repo}" check --read-data-subset=1% 2>&1
return $?
}
# -----------------------------------------------------------------------------
# 清理函式 (GFS 策略)
# -----------------------------------------------------------------------------
cleanup_old_backups() {
local repo="$1"
log_info "執行 GFS 清理策略"
restic -r "${repo}" forget \
--keep-daily ${KEEP_DAILY} \
--keep-weekly ${KEEP_WEEKLY} \
--keep-monthly ${KEEP_MONTHLY} \
--prune 2>&1
}
# -----------------------------------------------------------------------------
# 檢查配置
# -----------------------------------------------------------------------------
check_b2_config() {
if [ -z "${B2_ACCOUNT_ID}" ] || [ -z "${B2_APPLICATION_KEY}" ]; then
log_warn "B2 配置未設定,僅執行本地備份"
return 1
fi
return 0
}
# 初始化日誌目錄
mkdir -p "${BACKUP_LOG_DIR}"
log_info "共用函式庫載入完成 (v1.0.0)"