Files

Your Name 88dc08e595 docs(ops): add credential escrow evidence owner request [skip ci]

2026-06-13 13:14:51 +08:00

18 KiB

Raw Blame History

BACKUP-STATUS.md — 備份狀態總覽

2026-04-05 Claude Code: 首席架構師完整盤點 — 全服務全自動化 + 告警機制備份中心：192.168.0.110 (/backup/) — Restic + latest-only retention + Google Drive/rclone offsite mirror 2026-06-12 Codex pre-reboot refresh: 110 cron / Google Drive rclone / Alertmanager / credential escrow / cold-start scorecard rechecked. 2026-06-12 Codex post-reboot refresh: 110 recovered, offsite latest-only verified, and remaining red gates narrowed to 120 config capture plus credential escrow evidence. 2026-06-12 Codex post-120 recovery refresh: 120 restored, backup aggregate / offsite / full cold-start green; DR still blocked only by credential escrow evidence. 2026-06-13 Codex live refresh: backup core remains green; DR still blocked only by credential escrow evidence. 2026-06-13 Codex post-CD refresh: backup/offsite/alert contracts remain green after deploy marker e4a349bc; global SSH trust guardrail held; DR still blocked only by credential escrow evidence. 2026-06-13 Codex escrow refresh: 13:10 live report confirms offsite/rclone/script readiness is green and only five non-secret credential escrow evidence markers remain missing.

2026-06-13 Post-CD Live Status

2026-06-13 01:26 / 01:28 refresh:

/backup/scripts/backup-status.sh --no-notify：110 13/13 fresh failed=0、188 2/2 fresh failed=0、integrity_stale=0、offsite_fresh=1、rclone_gdrive_fresh=1、core_blockers=0、escrow_missing=5、last aggregate 2026-06-12 15:54:40。
/home/wooo/node_exporter_textfiles/offsite_full_sync_verify.prom：awoooi_backup_offsite_remote_verify_ok=1、awoooi_backup_offsite_full_verify_fresh=1，13 個 repo 都是 snapshot_count=1 且 snapshot_latest_only=1。
backup-alert-live-visibility-check.py：Prometheus 與 Alertmanager 皆可見 5 個 escrow gap active/firing alerts。
Prometheus rules API：BackupConfigCapturePartial、BackupAggregateRunFailed、BackupCredentialEscrowEvidenceMissing、ColdStartRecoveryBlocked、ColdStartHost120Unreachable 全部存在且 health ok；目前只有 escrow gap rule 正確 firing，其餘 inactive。
backup-alert-label-contract-check.py：本地 ops/monitoring/alerts-unified.yml 與 live Prometheus label contract 對齊，24 條 baseline backup alert rules 已載入。
這代表備份核心仍綠；剩餘紅燈仍是 DR credential escrow evidence，不是備份腳本或 offsite sync 失敗。

Gate	Status	Evidence
110 backup cron	VERIFIED	Live crontab still has `02:00 backup-all`, `03:00 sync-offsite-backups --mode sync`, `06:05 backup-status`, `07:20 verify-offsite-full-sync`; success is summarized once daily and not sent as noisy Telegram heartbeat.
Backup freshness	VERIFIED	2026-06-13 01:26 status shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`; last aggregate backup completed `2026-06-12 15:54:40`.
188 momo backup cron/exporter contract	VERIFIED	188 crontab now runs `/home/ollama/bin/momo-pg-backup.sh`; exporter reports `awoooi_backup_job_configured{host="188",job="momo_pg_daily"} 1`, so `configured_missing_188=0`.
Google Drive/rclone remote latest-only	VERIFIED	2026-06-13 01:28 textfile confirms all 13 remote repos have `snapshot_count=1` and `snapshot_latest_only=1`; latest scheduled verifier log at 2026-06-12 07:20 returned `REMOTE_LATEST_ONLY_OK=1`, `FULL_MARKER_FRESH=1`, `VERIFY_OK=1`, `FAILED=0`.
Offsite gate marker	VERIFIED	`/backup/offsite/enable-rclone-sync` present; full marker fresh and verifier wrote `/home/wooo/node_exporter_textfiles/offsite_full_sync_verify.prom`.
Backup alert rules	VERIFIED	2026-06-13 01:27 live check confirms Prometheus and Alertmanager expose the active `BackupCredentialEscrowEvidenceMissing` gap alerts for the five missing items; Prometheus rules API has all five required alert names healthy; label contract check confirms all baseline backup alert rules are loaded.
Backup aggregate health	VERIFIED	2026-06-12 15:54 `/backup/scripts/backup-all.sh` completed `13/13` successfully; Configs captured 120 / 121 / K8s workloads / K8s secrets / Velero from source `192.168.0.120`; 18:55 `core_blockers=0`.
Credential escrow	BLOCKED	Five evidence markers missing. Only write non-secret marker evidence with `/backup/scripts/mark-credential-escrow-verified.sh`.
Config backup capture	VERIFIED	2026-06-12 15:54 Configs backup succeeded for `120-k3s-host-configs`, `121-k3s-host-configs`, `cluster-k8s-workloads`, `cluster-k8s-secrets`, and `cluster-velero-backups`; latest Configs snapshot `bee9ae22`.
Full cold-start	GREEN	2026-06-13 01:26 read-only rerun: `PASS=83 WARN=0 BLOCKED=0`; result `GREEN`.
110 -> 120 / 188 SSH trust	VERIFIED	Final trust repair backup `/home/wooo/.ssh/known_hosts.before-120-188-final-refresh.20260613-011949`; CD fix `80e6ec1a` uses `/home/wooo/.ssh/deploy_known_hosts`; post-deploy marker `e4a349bc` did not clobber global `known_hosts`, and 120 / 188 entries remain present.
120 console handoff	CLOSED	120 root filesystem was repaired from console/initramfs with offline fsck, booted at `2026-06-12 15:13`, SSH returned, root mounted `rw`, failed units `0`, and K3s `mon` returned `Ready`.
2026-06-05 manual backup remediation	VERIFIED with aggregate blocker	18:40 status: `stale110=none`, `stale188=none`, `configured_missing_188=0`; manual snapshots: AWOOOI `b7d5ee4e`, Gitea `ea641613`, Open-WebUI `d1147507`, ClawBot `73ead3cc`, AI artifacts `b1161ab8`.
2026-06-06 credential escrow audit	BLOCKED	15:03 report confirms scripts/config/rclone are present, but all five non-secret evidence markers are still missing; 15:06 safe dry-run checklist is documented below.

Current policy: normal success should not create immediate Telegram noise. Failures and operator-action states must still alert; a single daily status summary runs at 06:05.

2026-06-13 post-CD closeout:

110 / 120 / 121 / 188 public/core service recovery is green.
Backup aggregate, Google Drive/rclone offsite latest-only, and full cold-start are green after 120 recovery.
Latest normal CD deploy preserved API/Web workload balancing and did not break cold-start SSH trust.
Do not close DR scorecard until all five credential escrow evidence markers are written with non-secret evidence IDs.

Credential Escrow Evidence Checklist

2026-06-13 13:10 live refresh:

/backup/scripts/mark-credential-escrow-verified.sh --status：仍缺 restic_repository_password、offsite_provider_credentials、break_glass_admin_credentials、dns_registrar_recovery、oauth_ai_provider_recovery。
/backup/scripts/offsite-escrow-evidence-report.sh --no-color：SCRIPT_MISSING_COUNT=0、OFFSITE_CONFIGURED=1、RCLONE_CONFIGURED=1、READINESS_REQUIRE_CONFIGURED_BLOCKED=0、ESCROW_MISSING_COUNT=5、SUMMARY PASS=8 WARN=5 BLOCKED=0。
Owner request package: CREDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md。
判定：備份核心與 offsite readiness 是 green；DR closeout 仍 blocked，直到五個 marker 以真實非敏感 evidence-id 寫入。

Credential escrow marker 只證明「復原資料已被人工驗證且可取回」，不能包含任何 secret。

Item	Acceptable evidence-id	Forbidden
`restic_repository_password`	Password manager item ID, sealed envelope ID, recovery checklist ID	Restic password, recovery code, secret URL
`offsite_provider_credentials`	Vault item ID for Google Drive/rclone or provider credential record	OAuth token, refresh token, application key
`break_glass_admin_credentials`	Break-glass credential record ID or sealed envelope ID	Admin password, SSH private key, OTP seed
`dns_registrar_recovery`	Registrar recovery checklist ID or vault item ID	Registrar password, recovery codes
`oauth_ai_provider_recovery`	Provider account recovery checklist ID or vault item ID	API key, token, client secret

Safe flow after human verification:

# 1. Read current status; this does not expose secrets.
/backup/scripts/mark-credential-escrow-verified.sh --status

# 2. Validate the non-secret evidence-id first.
/backup/scripts/mark-credential-escrow-verified.sh \
  --item <allowed-item> \
  --evidence-id <non-secret-evidence-id> \
  --dry-run

# 3. Only after dry-run passes, write the marker.
/backup/scripts/mark-credential-escrow-verified.sh \
  --item <allowed-item> \
  --evidence-id <non-secret-evidence-id>

# 4. Recheck DR metric.
/backup/scripts/offsite-escrow-evidence-report.sh --no-color

<non-secret-evidence-id> must be an existing external reference. Placeholder values such as EVIDENCE_ID_FOR_* or VAULT-ITEM-ID are rejected and must not be written.

備份全景圖（全部自動化）

#	資料類型	備份腳本	排程	最大損失	狀態
1	Gitea (DB + 倉庫)	`backup-gitea.sh`	每日 02:00	24h	✅
2	MOMO PostgreSQL	`backup-momo.sh`	每日 02:00	24h	✅
3	Harbor (Registry + DB)	`backup-harbor.sh`	每日 02:00	24h	✅
4	AWOOOI PostgreSQL (完整)	`backup-awoooi.sh`	每日 02:00	6h	✅
4h	AWOOOI PostgreSQL (高頻)	`backup-awoooi-frequent.sh`	08/14/20:00	6h	✅
5	Langfuse (AI 追蹤/評測)	`backup-langfuse.sh`	每日 02:00	24h	✅
6	Monitoring (Prometheus/Grafana/Alertmanager)	`backup-monitoring.sh`	每日 02:00	24h	✅
7	SignOz (ClickHouse 追蹤/日誌)	`backup-signoz.sh`	每日 02:00	24h	✅
8	Open-WebUI (LLM 對話紀錄)	`backup-open-webui.sh`	每日 02:00	24h	✅
9	ClawBot Redis (狀態/快取)	`backup-clawbot.sh`	每日 02:00	24h	✅
-	K8s 資源 (全命名空間)	Velero + MinIO	每日 02:00	24h	✅

備份總控：/backup/scripts/backup-all.sh v3.0 — 統一執行 9 個備份

告警機制

備份失敗與需要人工處理的狀態必須推送 AwoooP / Telegram。正常成功不即時推送，避免洗版；成功狀態由每日 06:05 摘要與 Prometheus/textfile 證據承載。

狀態	Severity	Telegram 收到
`success`	info	不即時洗版；每日 06:05 backup status 摘要
`warning`	warning	⚠️ 黃色警告
`failed`	critical	🔴 立即告警

告警端點：http://192.168.0.188:8088/api/v1/webhook/custom
測試指令：

source /backup/scripts/common.sh
notify_clawbot "failed" "backup-test" "測試告警" 0

保留策略

2026-05-19 起，110 本地 restic repo、188 MOMO 檔案備份與 Google Drive/rclone 離機鏡像採 latest-only 策略：成功建立新 snapshot 後只保留最新一份。2026-06-13 01:28 live textfile 已確認 Google Drive/rclone remote 13 個 repo 各 1 份，且 latest-only 指標全為 1。

2026-06-04 manual refresh evidence:

188 momo-pg-backup.sh produced momo_analytics_20260604_154234.sql.gz and pruned old backups beyond keep-last=1.
110 backup-awoooi-frequent.sh completed restic snapshot 7440d75f and pruned previous AWOOOI high-frequency DB snapshot.
18:54 backup-status.sh --no-notify: stale110=none, stale188=none, configured_missing_188=0, core_blockers=1, escrow_missing=5.

18:55 cold-start scorecard refresh:

PASS=71 WARN=3 BLOCKED=3.
Remaining hard blocks: 120 ping, 120 SSH, and 120 K3s read-only check.
188 backup health stale jobs are clear.
momo current-month parity is green: 2215|2215|2026-06-01|2026-06-04|2026-06-01|2026-06-04.

19:02 120 console handoff evidence:

local/110/121/188 cannot reach 192.168.0.120.
K3s node lease for mon stopped renewing at 2026-05-22 02:48:36 +08.
120-fsck-maintenance-checklist.sh --no-color returns PASS=2 WARN=2 BLOCKED=3, so backup aggregate remains correctly blocked until console/SSH recovery.

2026-06-12 18:55 update: 120 has returned and the aggregate backup blocker is cleared. /backup/scripts/backup-all.sh completed 13/13, full offsite sync completed 13/13, full verifier returned REMOTE_LATEST_ONLY_OK=1 / VERIFY_OK=1, and backup-status.sh --no-notify reports core_blockers=0. The only remaining DR warning is escrow_missing=5.

2026-06-13 01:28 update: post-CD live readback still shows remote_verify_ok=1, full_verify_fresh=1, and all 13 repos snapshot_count=1; backup core remains green after deploy marker e4a349bc.

2026-06-05 manual remediation:

16:00 live check still had 120 unreachable, stale110=awoooi_db, backup_all failed=6, and escrow_missing=5.
14:00 AWOOOI high-frequency backup failed, then 16:01 manual rerun completed snapshot b7d5ee4e.
02:00 Gitea failure was caused by stale container /tmp/gitea-dump.zip; it was renamed in-container to /tmp/gitea-dump.stale.20260605_161032.zip, then Gitea backup completed snapshot ea641613.
scripts/backup/backup-gitea.sh and live 110 /backup/scripts/backup-gitea.sh now preserve stale container dump files with timestamped names before starting a new dump.
110 -> 188 SSH known_hosts was refreshed after fingerprint match for 192.168.0.188; Open-WebUI backup completed snapshot d1147507.
ClawBot backup completed snapshot 73ead3cc; BGSAVE still warned, but the Redis volume backup succeeded.
AI artifacts backup completed snapshot b1161ab8.
Full offsite sync was skipped by runway gate because the next scheduled backup was too close; partial sync for awoooi gitea open-webui clawbot ai-artifacts completed 5/5.
18:39 full verifier confirmed all 13 Google Drive/rclone repos have remote snapshots=1, REMOTE_LATEST_ONLY_OK=1, and VERIFY_OK=1.
18:40 backup status still reports failed=6 / core_blockers=6 because the 02:00 aggregate history remains failed until backup-all reruns after 120 returns. Do not mark aggregate backup green from individual backup success alone.

2026-06-06 convergence evidence:

14:46 live check: 120 still ping/SSH failed and K3s mon remains NotReady,SchedulingDisabled.
02:00 aggregate backup failed only Configs: 全服務備份完成 (1532s) - 1 個失敗 (12/13).
14:58 backup-status.sh --no-notify: stale110=none, stale188=none, failed=1, core_blockers=1, escrow_missing=5.
14:46 verify-offsite-full-sync.sh --write-textfile --no-color: all 13 remote repos have one snapshot, REMOTE_LATEST_ONLY_OK=1, VERIFY_OK=1.
15:03 cold-start scorecard: PASS=71 WARN=3 BLOCKED=3; direct 188 checks still show momo-scheduler healthy with recent log activity, and the scheduler WARN is no longer present in the scorecard.
15:03 credential escrow report: rclone/offsite readiness is configured, but restic_repository_password, offsite_provider_credentials, break_glass_admin_credentials, dns_registrar_recovery, and oauth_ai_provider_recovery still lack non-secret evidence markers. Do not write placeholders or secrets.

Crontab 完整排程（110）

0 2       * * *   backup-all.sh              ← 9 個服務完整備份
0 8,14,20 * * *   backup-awoooi-frequent.sh  ← AWOOOI 高頻（每 6 小時）
0 3       * * *   sync-offsite-backups.sh --mode sync          ← Google Drive/rclone gated sync
5 6       * * *   backup-status.sh           ← 每日一次備份狀態摘要，避免成功心跳洗版
20 7      * * *   verify-offsite-full-sync.sh --write-textfile ← Google Drive/rclone latest-only 驗證

備份架構

192.168.0.110 (/backup/scripts/backup-all.sh) 每日 02:00
├── [1/9] backup-gitea.sh       → gitea dump → /backup/gitea
├── [2/9] backup-momo.sh        → SSH 188 pg_dump momo → /backup/momo
├── [3/9] backup-harbor.sh      → harbor dump → /backup/harbor
├── [4/9] backup-awoooi.sh      → SSH 188 pg_dump awoooi_prod/dev/k3s → /backup/awoooi
├── [5/9] backup-langfuse.sh    → docker exec langfuse-db pg_dump → /backup/langfuse
├── [6/9] backup-monitoring.sh  → volumes prometheus/grafana/alertmanager → /backup/monitoring
├── [7/9] backup-signoz.sh      → volumes signoz-clickhouse/sqlite → /backup/signoz
├── [8/9] backup-open-webui.sh  → SSH 188 volume open-webui → /backup/open-webui
└── [9/9] backup-clawbot.sh     → SSH 188 volume clawbot-redis → /backup/clawbot

備份失敗 → notify_clawbot("failed") → /webhook/custom 或 AwoooP/Alertmanager path → Telegram 🔴
備份成功 → textfile / Prometheus / 06:05 status 摘要，不即時洗版

192.168.0.188 (Velero) 每日 02:00
└── K8s 資源快照 → MinIO :9000 (bucket: velero)

尚未備份（說明）

服務	原因	備記
Prometheus TSDB	原始指標數據（非設定），TSDB 自帶 30d TTL	低優先；Grafana 設定已備份
Sentry	目前沒有在運行（docker ps 空）	有 volume，重新部署後再評估
Redis (AWOOOI)	Cache/WorkingMemory，無持久業務數據	低優先
Velero MinIO 數據	MinIO 是備份的備份，需離機備份	待評估 B2/S3 offsite

驗證 SOP

# 最新備份日誌
ssh wooo@192.168.0.110 "tail -50 /backup/logs/backup.log"

# 所有服務快照數
ssh wooo@192.168.0.110 "for r in gitea momo harbor awoooi langfuse monitoring signoz open-webui clawbot; do
  echo -n \"\$r: \"
  restic -r /backup/\$r snapshots --password-file /backup/scripts/.restic-password 2>/dev/null | grep -c snapshot || echo 0
done"

# 告警測試
ssh wooo@192.168.0.110 "source /backup/scripts/common.sh && notify_clawbot 'warning' 'manual-test' '手動告警測試' 0"

18 KiB Raw Blame History Unescape Escape