133 lines
8.9 KiB
Markdown
133 lines
8.9 KiB
Markdown
# IwoooS Backup / Restore / Escrow / Retention 只讀清冊
|
||
|
||
| 項目 | 內容 |
|
||
|------|------|
|
||
| 日期 | 2026-06-11 |
|
||
| 狀態 | `repo_only_inventory_ready` |
|
||
| 工具 | `scripts/security/backup-restore-escrow-inventory.py` |
|
||
| Snapshot | `docs/security/backup-restore-escrow-inventory.snapshot.json` |
|
||
| Schema | `docs/schemas/backup_restore_escrow_inventory_v1.schema.json` |
|
||
| 本階段模式 | committed repo files only;不執行 backup / restore / offsite sync |
|
||
| runtime gate | `0` |
|
||
|
||
## 1. 核心結論
|
||
|
||
Backup / restore / escrow / retention 已從抽象待辦推進成可重跑的 repo-only 高價值配置清冊。本清冊納入 `38` 個 surface,其中 `27` 個屬於 write-capable 或 apply-capable 類型,必須先有 owner response、維護窗口、rollback owner 與驗證指標,才可進入任何 runtime 行動。
|
||
|
||
此清冊仍不是 live backup truth。它只證明 repo 裡有哪些備份、還原、離機同步、credential escrow、Velero、alert 與 DR 文件需要被控管;不代表備份已成功、restore drill 已執行、offsite sync 已授權、escrow marker 可寫入或 retention policy 可變更。
|
||
|
||
## 2. 覆蓋摘要
|
||
|
||
| 指標 | 目前值 | 邊界 |
|
||
|------|--------|------|
|
||
| surface count | `38` | 只代表 repo source 可見 |
|
||
| source exists | `38` | 每個 source 皆有 SHA256,不代表 live 已套用 |
|
||
| backup script surface | `15` | 含總控、服務備份與 config capture |
|
||
| offsite / escrow surface | `8` | 含 rclone sync、verifier、readiness、marker、config helpers |
|
||
| Velero surface | `5` | 含 CronJob、ConfigMap script、standalone script、credential manifest、install manifest |
|
||
| restore drill surface | `4` | 仍需人工批准,不可直接演練 |
|
||
| retention surface | `3` | `restic forget --prune` 與 latest-only delete 仍未授權 |
|
||
| credential surface | `5` | 只允許 metadata / evidence id,不收 secret value |
|
||
| alert surface | `1` | 不代表 Prometheus / Alertmanager reload |
|
||
| write-capable surface | `27` | 可見代表需管控,不代表可執行 |
|
||
| owner response received / accepted | `0 / 0` | 不得假性提高 |
|
||
| restore / offsite / escrow / retention accepted | `0 / 0 / 0 / 0` | 全部仍待 owner gate |
|
||
| runtime gate | `0` | 不提供操作按鈕 |
|
||
|
||
Backup / restore / credential 類別成熟度從 `52%` 推進到 `58%`。這只代表 repo-only 清冊、schema、snapshot 與前台邊界完成,不代表 live evidence、restore drill 或 offsite sync 已被接受。
|
||
|
||
## 3. 已納管的 surface 類型
|
||
|
||
| 類型 | 代表 source | 目前狀態 |
|
||
|------|-------------|----------|
|
||
| 備份總控 | `scripts/backup/backup-all.sh` | 可執行總控可見,但本階段不執行 |
|
||
| 服務備份 | `backup-gitea.sh`、`backup-awoooi.sh`、`backup-harbor.sh`、`backup-langfuse.sh`、`backup-monitoring.sh`、`backup-signoz.sh`、`backup-open-webui.sh`、`backup-clawbot.sh`、`backup-sentry.sh`、`backup-ai-artifacts.sh`、`backup-public-routes.sh` | 需逐服務 owner、freshness、restore target isolation 與 secret redaction proof |
|
||
| Restic / retention | `scripts/backup/common.sh`、`scripts/backup/enforce-latest-only-retention.sh` | GFS 與 latest-only policy 可見;`restic prune` 與 delete 仍未授權 |
|
||
| Offsite / escrow | `sync-offsite-backups.sh`、`verify-offsite-full-sync.sh`、`backup-offsite-readiness-gate.sh`、`offsite-escrow-evidence-report.sh`、`mark-credential-escrow-verified.sh` | remote sync、remote delete、marker write 全部仍需人工批准 |
|
||
| Credential config | `configure-offsite-rclone.sh`、`configure-offsite-b2.sh`、`k8s/velero/01-credentials.yaml` | 只控管 secret metadata;不得收 value、hash、partial token 或 recovery code |
|
||
| Velero restore drill | `k8s/awoooi-prod/16-cronjob-backup-restore-test.yaml`、`17-configmap-backup-restore-scripts.yaml`、`scripts/cron_backup_restore_test.sh` | manifest 與 script 可見;不代表 CronJob live、restore dry-run 或 metric 正常 |
|
||
| Alert / health | `ops/monitoring/alerts.yml`、`scripts/ops/backup-health-textfile-exporter.py` | 只納入 rule / metric source;不 reload Alertmanager,不寫 live textfile |
|
||
| DR / cold-start 文件 | `BACKUP-STATUS.md`、`FULL-STACK-COLD-START-SOP.md`、backup DR evaluation snapshots | 文件已納入控管;命令範例不得被視為授權 |
|
||
|
||
## 4. 目前不符合或待補強
|
||
|
||
| 優先 | 缺口 | 風險 | 下一步 |
|
||
|------|------|------|--------|
|
||
| P0 | `credential_escrow_markers` 尚未 accepted | 缺少可恢復 restic/offsite/break-glass/DNS/OAuth credential 的人工證據 | 建立 non-secret evidence id owner response;不得直接寫 marker |
|
||
| P0 | restore drill approval package 仍是模板 | 不能證明 DB、config、K8s、observability restore 可安全演練 | 補隔離環境、observer、source backup ref、stop condition 與 rollback owner |
|
||
| P0 | offsite sync 具有 remote delete 能力 | latest-only / rclone sync 可能刪除遠端舊 pack | 補 offsite owner、runway、remote delete owner、full sync window 與 verifier evidence |
|
||
| P0 | retention / restic prune 未形成 owner gate | 誤刪 snapshot 或縮短可恢復窗口 | 補 retention owner、restore runway、prune window 與回滾條件 |
|
||
| P0 | Velero credential / install manifest 仍需 live disposition | Cluster-admin、MinIO endpoint 與 credential injection 風險高 | 補 RBAC owner、secret manager source、rotation owner、least privilege review |
|
||
| P1 | restore test ConfigMap 與 standalone script timestamp 寫法不一致 | Prometheus textfile 可能無法正確讀取 13 位 timestamp | 先列入 owner disposition;修正需走 CronJob / ConfigMap owner gate |
|
||
| P1 | backup status runbook 有舊 live refresh note | 先前 live 狀態可能過期 | 需要 owner-provided live refresh,不由本清冊主動 SSH |
|
||
| P1 | backup health exporter 可寫 textfile | false-green metric 會誤導告警 | 補 exporter owner、metric freshness SLO、失敗通知與 guard |
|
||
|
||
## 5. 固定 0 / false 邊界
|
||
|
||
以下旗標必須維持 `false`:
|
||
|
||
```text
|
||
runtime_execution_authorized=false
|
||
host_write_authorized=false
|
||
backup_run_authorized=false
|
||
restore_run_authorized=false
|
||
restore_drill_authorized=false
|
||
offsite_sync_authorized=false
|
||
offsite_remote_delete_authorized=false
|
||
credential_escrow_marker_write_authorized=false
|
||
retention_change_authorized=false
|
||
restic_prune_authorized=false
|
||
rclone_config_authorized=false
|
||
velero_restore_authorized=false
|
||
velero_backup_authorized=false
|
||
kubectl_action_authorized=false
|
||
ssh_read_authorized=false
|
||
ssh_write_authorized=false
|
||
secret_value_collection_allowed=false
|
||
active_scan_authorized=false
|
||
action_buttons_allowed=false
|
||
```
|
||
|
||
## 6. 下一階段優先順序
|
||
|
||
1. P0:整理 backup / restore / escrow owner response packet,欄位包含 owner role / team、decision、decision reason、affected scope、redacted evidence refs、followup owner、rollback owner、maintenance window、validation plan。
|
||
2. P0:建立 credential escrow review package,只允許 non-secret evidence id,不寫 marker。
|
||
3. P0:針對 offsite sync 補 remote delete owner、runway、full sync window 與 verifier evidence;驗收前不得執行 `sync`。
|
||
4. P0:針對 restore drill 補隔離環境、observer、source backup refs、stop condition 與 rollback owner;驗收前不得跑 restore。
|
||
5. P1:針對 Velero CronJob / ConfigMap script timestamp 差異建立 owner disposition,不直接 apply。
|
||
6. P1:由 owner 提供最新 backup status / offsite / escrow / Velero metric redacted evidence;本階段不主動 SSH 取得。
|
||
7. P1:將 P1-3 指標同步到 `/zh-TW/iwooos`,並做 desktop / mobile overflow 與 no action button 驗證。
|
||
|
||
## 7. 驗證指令
|
||
|
||
```bash
|
||
python3 scripts/security/backup-restore-escrow-inventory.py \
|
||
--root . \
|
||
--output /tmp/backup-restore-escrow-inventory-check.json
|
||
```
|
||
|
||
固定 committed snapshot 時間:
|
||
|
||
```bash
|
||
python3 scripts/security/backup-restore-escrow-inventory.py \
|
||
--root . \
|
||
--generated-at 2026-06-11T22:20:00+08:00 \
|
||
--output docs/security/backup-restore-escrow-inventory.snapshot.json
|
||
```
|
||
|
||
## 8. 完成度
|
||
|
||
| 工作 | 完成度 | 說明 |
|
||
|------|--------|------|
|
||
| repo-only surface 註冊 | `100%` | `38` 個 source surface 已納入 |
|
||
| source existence / SHA256 | `100%` | `38 / 38` source 存在 |
|
||
| schema / snapshot | `100%` | `backup_restore_escrow_inventory_v1` 已建立 |
|
||
| 高價值配置成熟度 | `58%` | 從 `52%` 推進;只代表只讀框架 |
|
||
| owner response 收件 / 接受 | `0%` | 尚未送件、收件或接受 |
|
||
| live evidence collection | `0%` | 未 SSH、未 rclone、未 kubectl、未 restore |
|
||
| restore / offsite / escrow / retention gate | `0%` | 全部仍為 `0 / false` |
|
||
|
||
## 9. 邊界
|
||
|
||
本清冊未執行 `backup-all.sh`、未執行任何 service backup、未執行 `restic check`、未執行 `restic forget --prune`、未執行 `rclone sync`、未讀遠端 offsite、未寫 escrow marker、未修改 rclone / B2 設定、未 apply Velero manifest、未跑 restore dry-run、未寫 Prometheus textfile、未 reload alert rules、未 SSH、未收 secret value、未新增任何前端執行按鈕。
|