Files
awoooi/docs/security/BACKUP-RESTORE-ESCROW-INVENTORY.md
Your Name 93a1993d11
Some checks failed
CD Pipeline / tests (push) Successful in 1m30s
Code Review / ai-code-review (push) Successful in 16s
CD Pipeline / build-and-deploy (push) Successful in 4m30s
CD Pipeline / post-deploy-checks (push) Has been cancelled
feat(security): 新增 backup restore escrow 只讀清冊
2026-06-11 22:51:31 +08:00

133 lines
8.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# IwoooS Backup / Restore / Escrow / Retention 只讀清冊
| 項目 | 內容 |
|------|------|
| 日期 | 2026-06-11 |
| 狀態 | `repo_only_inventory_ready` |
| 工具 | `scripts/security/backup-restore-escrow-inventory.py` |
| Snapshot | `docs/security/backup-restore-escrow-inventory.snapshot.json` |
| Schema | `docs/schemas/backup_restore_escrow_inventory_v1.schema.json` |
| 本階段模式 | committed repo files only不執行 backup / restore / offsite sync |
| runtime gate | `0` |
## 1. 核心結論
Backup / restore / escrow / retention 已從抽象待辦推進成可重跑的 repo-only 高價值配置清冊。本清冊納入 `38` 個 surface其中 `27` 個屬於 write-capable 或 apply-capable 類型,必須先有 owner response、維護窗口、rollback owner 與驗證指標,才可進入任何 runtime 行動。
此清冊仍不是 live backup truth。它只證明 repo 裡有哪些備份、還原、離機同步、credential escrow、Velero、alert 與 DR 文件需要被控管不代表備份已成功、restore drill 已執行、offsite sync 已授權、escrow marker 可寫入或 retention policy 可變更。
## 2. 覆蓋摘要
| 指標 | 目前值 | 邊界 |
|------|--------|------|
| surface count | `38` | 只代表 repo source 可見 |
| source exists | `38` | 每個 source 皆有 SHA256不代表 live 已套用 |
| backup script surface | `15` | 含總控、服務備份與 config capture |
| offsite / escrow surface | `8` | 含 rclone sync、verifier、readiness、marker、config helpers |
| Velero surface | `5` | 含 CronJob、ConfigMap script、standalone script、credential manifest、install manifest |
| restore drill surface | `4` | 仍需人工批准,不可直接演練 |
| retention surface | `3` | `restic forget --prune` 與 latest-only delete 仍未授權 |
| credential surface | `5` | 只允許 metadata / evidence id不收 secret value |
| alert surface | `1` | 不代表 Prometheus / Alertmanager reload |
| write-capable surface | `27` | 可見代表需管控,不代表可執行 |
| owner response received / accepted | `0 / 0` | 不得假性提高 |
| restore / offsite / escrow / retention accepted | `0 / 0 / 0 / 0` | 全部仍待 owner gate |
| runtime gate | `0` | 不提供操作按鈕 |
Backup / restore / credential 類別成熟度從 `52%` 推進到 `58%`。這只代表 repo-only 清冊、schema、snapshot 與前台邊界完成,不代表 live evidence、restore drill 或 offsite sync 已被接受。
## 3. 已納管的 surface 類型
| 類型 | 代表 source | 目前狀態 |
|------|-------------|----------|
| 備份總控 | `scripts/backup/backup-all.sh` | 可執行總控可見,但本階段不執行 |
| 服務備份 | `backup-gitea.sh``backup-awoooi.sh``backup-harbor.sh``backup-langfuse.sh``backup-monitoring.sh``backup-signoz.sh``backup-open-webui.sh``backup-clawbot.sh``backup-sentry.sh``backup-ai-artifacts.sh``backup-public-routes.sh` | 需逐服務 owner、freshness、restore target isolation 與 secret redaction proof |
| Restic / retention | `scripts/backup/common.sh``scripts/backup/enforce-latest-only-retention.sh` | GFS 與 latest-only policy 可見;`restic prune` 與 delete 仍未授權 |
| Offsite / escrow | `sync-offsite-backups.sh``verify-offsite-full-sync.sh``backup-offsite-readiness-gate.sh``offsite-escrow-evidence-report.sh``mark-credential-escrow-verified.sh` | remote sync、remote delete、marker write 全部仍需人工批准 |
| Credential config | `configure-offsite-rclone.sh``configure-offsite-b2.sh``k8s/velero/01-credentials.yaml` | 只控管 secret metadata不得收 value、hash、partial token 或 recovery code |
| Velero restore drill | `k8s/awoooi-prod/16-cronjob-backup-restore-test.yaml``17-configmap-backup-restore-scripts.yaml``scripts/cron_backup_restore_test.sh` | manifest 與 script 可見;不代表 CronJob live、restore dry-run 或 metric 正常 |
| Alert / health | `ops/monitoring/alerts.yml``scripts/ops/backup-health-textfile-exporter.py` | 只納入 rule / metric source不 reload Alertmanager不寫 live textfile |
| DR / cold-start 文件 | `BACKUP-STATUS.md``FULL-STACK-COLD-START-SOP.md`、backup DR evaluation snapshots | 文件已納入控管;命令範例不得被視為授權 |
## 4. 目前不符合或待補強
| 優先 | 缺口 | 風險 | 下一步 |
|------|------|------|--------|
| P0 | `credential_escrow_markers` 尚未 accepted | 缺少可恢復 restic/offsite/break-glass/DNS/OAuth credential 的人工證據 | 建立 non-secret evidence id owner response不得直接寫 marker |
| P0 | restore drill approval package 仍是模板 | 不能證明 DB、config、K8s、observability restore 可安全演練 | 補隔離環境、observer、source backup ref、stop condition 與 rollback owner |
| P0 | offsite sync 具有 remote delete 能力 | latest-only / rclone sync 可能刪除遠端舊 pack | 補 offsite owner、runway、remote delete owner、full sync window 與 verifier evidence |
| P0 | retention / restic prune 未形成 owner gate | 誤刪 snapshot 或縮短可恢復窗口 | 補 retention owner、restore runway、prune window 與回滾條件 |
| P0 | Velero credential / install manifest 仍需 live disposition | Cluster-admin、MinIO endpoint 與 credential injection 風險高 | 補 RBAC owner、secret manager source、rotation owner、least privilege review |
| P1 | restore test ConfigMap 與 standalone script timestamp 寫法不一致 | Prometheus textfile 可能無法正確讀取 13 位 timestamp | 先列入 owner disposition修正需走 CronJob / ConfigMap owner gate |
| P1 | backup status runbook 有舊 live refresh note | 先前 live 狀態可能過期 | 需要 owner-provided live refresh不由本清冊主動 SSH |
| P1 | backup health exporter 可寫 textfile | false-green metric 會誤導告警 | 補 exporter owner、metric freshness SLO、失敗通知與 guard |
## 5. 固定 0 / false 邊界
以下旗標必須維持 `false`
```text
runtime_execution_authorized=false
host_write_authorized=false
backup_run_authorized=false
restore_run_authorized=false
restore_drill_authorized=false
offsite_sync_authorized=false
offsite_remote_delete_authorized=false
credential_escrow_marker_write_authorized=false
retention_change_authorized=false
restic_prune_authorized=false
rclone_config_authorized=false
velero_restore_authorized=false
velero_backup_authorized=false
kubectl_action_authorized=false
ssh_read_authorized=false
ssh_write_authorized=false
secret_value_collection_allowed=false
active_scan_authorized=false
action_buttons_allowed=false
```
## 6. 下一階段優先順序
1. P0整理 backup / restore / escrow owner response packet欄位包含 owner role / team、decision、decision reason、affected scope、redacted evidence refs、followup owner、rollback owner、maintenance window、validation plan。
2. P0建立 credential escrow review package只允許 non-secret evidence id不寫 marker。
3. P0針對 offsite sync 補 remote delete owner、runway、full sync window 與 verifier evidence驗收前不得執行 `sync`
4. P0針對 restore drill 補隔離環境、observer、source backup refs、stop condition 與 rollback owner驗收前不得跑 restore。
5. P1針對 Velero CronJob / ConfigMap script timestamp 差異建立 owner disposition不直接 apply。
6. P1由 owner 提供最新 backup status / offsite / escrow / Velero metric redacted evidence本階段不主動 SSH 取得。
7. P1將 P1-3 指標同步到 `/zh-TW/iwooos`,並做 desktop / mobile overflow 與 no action button 驗證。
## 7. 驗證指令
```bash
python3 scripts/security/backup-restore-escrow-inventory.py \
--root . \
--output /tmp/backup-restore-escrow-inventory-check.json
```
固定 committed snapshot 時間:
```bash
python3 scripts/security/backup-restore-escrow-inventory.py \
--root . \
--generated-at 2026-06-11T22:20:00+08:00 \
--output docs/security/backup-restore-escrow-inventory.snapshot.json
```
## 8. 完成度
| 工作 | 完成度 | 說明 |
|------|--------|------|
| repo-only surface 註冊 | `100%` | `38` 個 source surface 已納入 |
| source existence / SHA256 | `100%` | `38 / 38` source 存在 |
| schema / snapshot | `100%` | `backup_restore_escrow_inventory_v1` 已建立 |
| 高價值配置成熟度 | `58%` | 從 `52%` 推進;只代表只讀框架 |
| owner response 收件 / 接受 | `0%` | 尚未送件、收件或接受 |
| live evidence collection | `0%` | 未 SSH、未 rclone、未 kubectl、未 restore |
| restore / offsite / escrow / retention gate | `0%` | 全部仍為 `0 / false` |
## 9. 邊界
本清冊未執行 `backup-all.sh`、未執行任何 service backup、未執行 `restic check`、未執行 `restic forget --prune`、未執行 `rclone sync`、未讀遠端 offsite、未寫 escrow marker、未修改 rclone / B2 設定、未 apply Velero manifest、未跑 restore dry-run、未寫 Prometheus textfile、未 reload alert rules、未 SSH、未收 secret value、未新增任何前端執行按鈕。