docs(ops): add credential escrow evidence owner request [skip ci]

This commit is contained in:
Your Name
2026-06-13 13:14:46 +08:00
parent 7c1ebe0153
commit 88dc08e595
5 changed files with 290 additions and 3 deletions

View File

@@ -33943,3 +33943,26 @@ production browser smoke:
- Service / cold-start`GREEN`
- API / Web workload balancing`LIVE_VERIFIED`
- DR scorecard仍不可宣稱完成credential escrow evidence 仍缺 `5` 個。
## 2026-06-13 — Credential escrow owner evidence request package
**Live read-only evidence13:10 Asia/Taipei**
- `/backup/scripts/mark-credential-escrow-verified.sh --status`:仍缺 `restic_repository_password``offsite_provider_credentials``break_glass_admin_credentials``dns_registrar_recovery``oauth_ai_provider_recovery`
- `/backup/scripts/offsite-escrow-evidence-report.sh --no-color``SCRIPT_MISSING_COUNT=0``OFFSITE_CONFIGURED=1``RCLONE_CONFIGURED=1``READINESS_REQUIRE_CONFIGURED_BLOCKED=0``ESCROW_MISSING_COUNT=5``SUMMARY PASS=8 WARN=5 BLOCKED=0`
**文件 / snapshot**
- 新增 `docs/security/CREDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md`
- 新增 `docs/security/credential-escrow-evidence-owner-request.snapshot.json`
- 更新 `docs/runbooks/BACKUP-STATUS.md` 與 reboot workplan將 P1 credential escrow 從「script/config 是否可用」收斂為「等待 owner 提供真實非敏感 evidence-id」。
**目前進度**
- Credential escrow owner request package`80%`
- Owner external verification`0%`
- Dry-run validation`0%`
- Marker write`0%`
- DR closeout verification`0%`
**邊界**
- 本輪沒有讀取、收集、貼上或保存任何 secret value、hash、prefix/suffix、partial token。
- 本輪沒有寫入 live marker`BackupCredentialEscrowEvidenceMissing` 必須繼續 firing直到五個 marker 以真實非敏感 evidence-id 補齊。
- Service / cold-start 維持 `GREEN`DR scorecard 仍是 `BLOCKED`

View File

@@ -7,6 +7,7 @@
> 2026-06-12 Codex post-120 recovery refresh: 120 restored, backup aggregate / offsite / full cold-start green; DR still blocked only by credential escrow evidence.
> 2026-06-13 Codex live refresh: backup core remains green; DR still blocked only by credential escrow evidence.
> 2026-06-13 Codex post-CD refresh: backup/offsite/alert contracts remain green after deploy marker `e4a349bc`; global SSH trust guardrail held; DR still blocked only by credential escrow evidence.
> 2026-06-13 Codex escrow refresh: 13:10 live report confirms offsite/rclone/script readiness is green and only five non-secret credential escrow evidence markers remain missing.
---
@@ -50,6 +51,13 @@ Current policy: normal success should not create immediate Telegram noise. Failu
## Credential Escrow Evidence Checklist
2026-06-13 13:10 live refresh:
- `/backup/scripts/mark-credential-escrow-verified.sh --status`:仍缺 `restic_repository_password``offsite_provider_credentials``break_glass_admin_credentials``dns_registrar_recovery``oauth_ai_provider_recovery`
- `/backup/scripts/offsite-escrow-evidence-report.sh --no-color``SCRIPT_MISSING_COUNT=0``OFFSITE_CONFIGURED=1``RCLONE_CONFIGURED=1``READINESS_REQUIRE_CONFIGURED_BLOCKED=0``ESCROW_MISSING_COUNT=5``SUMMARY PASS=8 WARN=5 BLOCKED=0`
- Owner request package: [CREDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md](../security/CREDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md)。
- 判定:備份核心與 offsite readiness 是 greenDR closeout 仍 blocked直到五個 marker 以真實非敏感 evidence-id 寫入。
Credential escrow marker 只證明「復原資料已被人工驗證且可取回」,不能包含任何 secret。
| Item | Acceptable evidence-id | Forbidden |

View File

@@ -0,0 +1,140 @@
# Credential Escrow Evidence Owner Request
> 狀態時間2026-06-13 13:10 Asia/Taipei
> 範圍DR credential escrow evidence marker 補齊
> 原則:只收非敏感 evidence-id禁止收集、張貼、提交或保存任何密碼、token、金鑰、recovery code、secret value、hash、prefix/suffix 或 partial token。
---
## 1. 目前判定
| 項目 | 狀態 | 完成度 | 說明 |
|------|------|-------:|------|
| 備份核心 | GREEN | 100% | 110 / 188 freshness、Google Drive rclone、latest-only verifier、full cold-start 目前皆為綠燈。 |
| Escrow script / offsite readiness | READY | 100% | `SCRIPT_MISSING_COUNT=0``OFFSITE_CONFIGURED=1``RCLONE_CONFIGURED=1``READINESS_REQUIRE_CONFIGURED_BLOCKED=0`。 |
| Credential escrow owner request package | READY_TO_DISPATCH | 80% | 本文件定義 owner 需要提供的非敏感 evidence-id、禁止內容、dry-run 與驗收流程。 |
| Credential escrow marker | BLOCKED_WAITING_OWNER_EVIDENCE | 0% | 五個 marker 仍未寫入;不得用 placeholder 或 secret 補齊。 |
| DR scorecard closeout | BLOCKED | 90% | 服務可用性已綠DR 完成仍等 `ESCROW_MISSING_COUNT=0`。 |
這不是服務 outage。這是 DR 復原治理 gate必須證明關鍵復原憑證可由指定 owner 在災難時取回,但證明本身不能洩漏憑證。
---
## 2. Live Evidence
2026-06-13 13:10 在 110 只讀檢查:
```text
missing: restic_repository_password
missing: offsite_provider_credentials
missing: break_glass_admin_credentials
missing: dns_registrar_recovery
missing: oauth_ai_provider_recovery
SCRIPT_MISSING_COUNT=0
OFFSITE_CONFIGURED=1
RCLONE_CONFIGURED=1
B2_CONFIGURED=0
READINESS_REQUIRE_CONFIGURED_BLOCKED=0
ESCROW_MISSING_COUNT=5
PARTIAL_MARKER_PRESENT=1
FULL_MARKER_PRESENT=1
SUMMARY PASS=8 WARN=5 BLOCKED=0
```
`B2_CONFIGURED=0` 是 legacy B2 未配置;目前 Google Drive / rclone provider 已配置,所以不是本輪 blocker。
---
## 3. Owner Request Matrix
| Item | Owner 要確認的復原能力 | 可接受的非敏感 evidence-id | 禁止內容 |
|------|--------------------------|-----------------------------|----------|
| `restic_repository_password` | 能在災難時取回 restic repository password。 | Password manager item ID、sealed envelope ID、recovery checklist ID。 | Restic password、recovery code、secret URL、截圖中的密碼。 |
| `offsite_provider_credentials` | 能在災難時取回 Google Drive / rclone 或 offsite provider 憑證。 | Vault item ID、provider credential record ID、offsite access checklist ID。 | OAuth token、refresh token、application key、client secret、cookie。 |
| `break_glass_admin_credentials` | 能在災難時取得 break-glass admin 登入或替代復原路徑。 | Break-glass credential record ID、sealed envelope ID、emergency access checklist ID。 | Admin password、SSH private key、OTP seed、recovery code。 |
| `dns_registrar_recovery` | 能在災難時恢復 DNS registrar / domain control。 | Registrar recovery checklist ID、vault item ID、domain recovery record ID。 | Registrar password、recovery code、unredacted registrar session。 |
| `oauth_ai_provider_recovery` | 能在災難時恢復 AI provider / OAuth provider 管理權。 | Provider recovery checklist ID、vault item ID、provider account recovery record ID。 | API key、token、client secret、OAuth refresh token。 |
Evidence-id 必須是「外部系統中可查核的記錄代號」,例如 password manager item ID、sealed envelope 編號、內部 recovery checklist 編號。它不能是憑證本身,也不能足以推導出憑證。
---
## 4. 禁止提交的資料
以下內容不得出現在 repo、聊天、issue、PR、LOGBOOK、snapshot、terminal output 貼文或 marker note
- 密碼、token、API key、private key、SSH key、cookie、session。
- OAuth client secret、refresh token、authorization header。
- OTP seed、recovery code、backup code。
- PostgreSQL / Redis / Sentry / provider connection URL 中含帳密的字串。
- secret hash、prefix、suffix、partial token、可逆遮罩值。
- 未遮罩截圖、未遮罩 password manager 畫面。
- placeholder例如 `EVIDENCE_ID_FOR_*``VAULT-ITEM-ID``TODO``TBD`
---
## 5. Safe Execution Flow
以下命令只能在 owner 已於 repo / chat 外部確認復原資料存在後執行。`<NON_SECRET_EVIDENCE_ID>` 必須換成真實但非敏感的外部記錄代號。
```bash
# 1. 讀取目前 marker 狀態;此步不暴露 secret。
/backup/scripts/mark-credential-escrow-verified.sh --status
# 2. 先 dry-run 驗證 evidence-id 格式與 item 合法性;此步不寫入 marker。
/backup/scripts/mark-credential-escrow-verified.sh \
--item <item> \
--evidence-id <NON_SECRET_EVIDENCE_ID> \
--dry-run
# 3. dry-run OK 且 owner 明確批准後,才寫入 marker。
/backup/scripts/mark-credential-escrow-verified.sh \
--item <item> \
--evidence-id <NON_SECRET_EVIDENCE_ID> \
--note <SHORT_NON_SECRET_NOTE>
# 4. 寫入後重新產生 escrow / backup / cold-start 證據。
/backup/scripts/offsite-escrow-evidence-report.sh --no-color
/backup/scripts/backup-status.sh --no-notify --no-refresh
/home/wooo/scripts/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1
```
若 dry-run 拒絕 placeholder、過短值、疑似 secret 或不合法 item必須停止回到 owner 重新提供非敏感 evidence-id。
---
## 6. 驗收條件
| Gate | 必須看到 |
|------|----------|
| Escrow marker | 五個 item 都不再顯示 missing。 |
| Escrow report | `ESCROW_MISSING_COUNT=0`。 |
| Prometheus textfile | `awoooi_backup_dr_credential_escrow_missing_count 0`。 |
| Backup status | `escrow_missing=0`,且 `core_blockers=0` 維持不變。 |
| Alertmanager | `BackupCredentialEscrowEvidenceMissing` 不再 firing。 |
| Cold-start | `WARN=0 BLOCKED=0` 維持 green。 |
只有以上全部成立,才可以把 DR scorecard 從 `BLOCKED` 改為 `COMPLETE`
---
## 7. 工作推進百分比
| Lane | 目前完成度 | 下一步 |
|------|-----------:|--------|
| Owner request package | 80% | 指定 owner role / team交付本文件與五個 item 清單。 |
| Owner external verification | 0% | Owner 在 password manager、sealed envelope、registrar/provider account 外部完成查核。 |
| Dry-run validation | 0% | 五個 item 都以非敏感 evidence-id 通過 `--dry-run`。 |
| Marker write | 0% | 五個 marker 寫入成功。 |
| DR closeout verification | 0% | escrow report、backup status、Alertmanager、cold-start 全部重跑且綠燈。 |
---
## 8. 目前不可宣稱
- 不可宣稱 DR scorecard complete。
- 不可宣稱 credential escrow 已補齊。
- 不可把備份 / offsite / cold-start green 等同 credential escrow green。
- 不可用 placeholder、測試 ID 或秘密值補 marker。
- 不可消音 `BackupCredentialEscrowEvidenceMissing`,它目前是正確紅燈。

View File

@@ -0,0 +1,115 @@
{
"schema_version": 1,
"generated_at": "2026-06-13T13:10:53+08:00",
"timezone": "Asia/Taipei",
"scope": "credential_escrow_evidence_owner_request",
"source_evidence": {
"host": "192.168.0.110",
"commands": [
"/backup/scripts/mark-credential-escrow-verified.sh --status",
"/backup/scripts/offsite-escrow-evidence-report.sh --no-color"
],
"script_missing_count": 0,
"offsite_configured": 1,
"rclone_configured": 1,
"b2_configured": 0,
"readiness_require_configured_blocked": 0,
"partial_marker_present": 1,
"full_marker_present": 1,
"escrow_missing_count": 5,
"summary": {
"pass": 8,
"warn": 5,
"blocked": 0
}
},
"missing_items": [
{
"item": "restic_repository_password",
"allowed_evidence_id_types": [
"password_manager_item_id",
"sealed_envelope_id",
"recovery_checklist_id"
],
"status": "missing"
},
{
"item": "offsite_provider_credentials",
"allowed_evidence_id_types": [
"vault_item_id",
"provider_credential_record_id",
"offsite_access_checklist_id"
],
"status": "missing"
},
{
"item": "break_glass_admin_credentials",
"allowed_evidence_id_types": [
"break_glass_credential_record_id",
"sealed_envelope_id",
"emergency_access_checklist_id"
],
"status": "missing"
},
{
"item": "dns_registrar_recovery",
"allowed_evidence_id_types": [
"registrar_recovery_checklist_id",
"vault_item_id",
"domain_recovery_record_id"
],
"status": "missing"
},
{
"item": "oauth_ai_provider_recovery",
"allowed_evidence_id_types": [
"provider_recovery_checklist_id",
"vault_item_id",
"provider_account_recovery_record_id"
],
"status": "missing"
}
],
"forbidden_values": [
"password",
"token",
"api_key",
"private_key",
"ssh_key",
"cookie",
"session",
"authorization_header",
"oauth_client_secret",
"refresh_token",
"otp_seed",
"recovery_code",
"backup_code",
"database_url_with_credentials",
"secret_hash",
"secret_prefix",
"secret_suffix",
"partial_token",
"unredacted_screenshot",
"placeholder"
],
"progress": {
"owner_request_package_percent": 80,
"owner_external_verification_percent": 0,
"dry_run_validation_percent": 0,
"marker_write_percent": 0,
"dr_closeout_verification_percent": 0
},
"gates": {
"runtime_execution_authorized": false,
"secret_value_collection_authorized": false,
"marker_write_completed": false,
"dr_scorecard_complete": false
},
"done_criteria": [
"ESCROW_MISSING_COUNT=0",
"awoooi_backup_dr_credential_escrow_missing_count=0",
"backup-status escrow_missing=0",
"BackupCredentialEscrowEvidenceMissing not firing",
"cold-start WARN=0 BLOCKED=0"
]
}

View File

@@ -11,9 +11,9 @@
| Area | Status | Completion | Evidence |
|------|--------|------------|----------|
| Overall recovery readiness | SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED | 95% | 2026-06-13 01:26 final cold-start scorecard is `PASS=83 WARN=0 BLOCKED=0`; 120/121 K3s are both `Ready control-plane`, backup core blockers remain `0`, public routes/TLS/momo DB/schedules/Alertmanager are green, API/Web remain spread across 120 / 121 after deploy marker `e4a349bc`, and CD no longer clobbers global `known_hosts`. Remaining blocker is DR-only credential escrow evidence (`escrow_missing=5`); ArgoCD `km-vectorize` is tracked separately as governance health debt until its official scheduled Job refreshes `lastSuccessfulTime`. |
| Overall recovery readiness | SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED | 95% | 2026-06-13 12:59 final cold-start scorecard is `PASS=83 WARN=0 BLOCKED=0`; 120/121 K3s are both `Ready control-plane`, backup core blockers remain `0`, public routes/TLS/momo DB/schedules/Alertmanager are green, API/Web are live-verified split across 120 / 121 after topology strategy hardening, and CD no longer clobbers global `known_hosts`. 13:10 escrow report shows offsite/rclone/script readiness green, but DR remains blocked by five missing credential escrow evidence markers; ArgoCD `km-vectorize` is tracked separately as governance health debt until its official scheduled Job refreshes `lastSuccessfulTime`. |
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; host is reachable, root is mounted `rw`, failed units `0`, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. |
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 90% | 2026-06-13 01:26 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`; 01:28 offsite textfile has `remote_verify_ok=1`, `full_verify_fresh=1`, and all 13 repos `snapshot_count=1`; 01:27 Alertmanager exposes the five expected escrow gap alerts and Prometheus has all five required alert rule names healthy. |
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 92% | 2026-06-13 12:43 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`; 13:10 escrow report shows `SCRIPT_MISSING_COUNT=0`, `OFFSITE_CONFIGURED=1`, `RCLONE_CONFIGURED=1`, `ESCROW_MISSING_COUNT=5`, `PASS=8 WARN=5 BLOCKED=0`. Owner request package is now ready; actual marker write remains blocked on real non-secret evidence IDs. |
| P2 service / data truth | VERIFIED_WORKLOAD_BALANCED | 100% | 2026-06-13 01:26 cold-start is green; public routes/TLS are green, VIP API/Web are reachable, momo current-month parity is `4571/4571` with matching date bounds, schedules/services are green. API/Web both keep 120 / 121 split placement after latest deploy marker `e4a349bc`. |
| P3 docs / automation contracts | DONE_WITH_VALIDATION_GAP | 100% | Workplan, SOP v1.8, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, T+0/T+60 timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, and `km-vectorize` remediation tracking are updated; Ansible syntax check is unavailable on this workstation. |
@@ -121,7 +121,7 @@ Next: <single next action>
| P1-002 | VERIFIED | 100 | Confirm success-noise policy | Daily status is once at 06:05; normal backup success is not a Telegram spam path. | Keep failure-only escalation in backup docs. | Docs say failures escalate; daily status is summary only. |
| P1-003 | VERIFIED | 100 | Confirm Google Drive latest-only | 2026-06-12 18:55 verifier shows 13 repos with exactly one remote snapshot each after the post-120 aggregate backup and full offsite sync. | Record evidence in backup status. | `REMOTE_LATEST_ONLY_OK=1`, `VERIFY_OK=1`. |
| P1-004 | VERIFIED | 100 | Confirm required alerts exist | Live Prometheus rules include all five required backup/cold-start alerts. | Keep in scorecard. | All five alert names FOUND live. |
| P1-005 | BLOCKED | 5 | Fill credential escrow evidence markers | Five markers are missing. This is a DR scorecard blocker, not a service outage. Scripts/config are present and the marker CLI supports `--dry-run`; secrets must not enter repo or chat. | Human verifies vault/offline escrow, validates each non-secret evidence ID with `--dry-run`, then writes markers using `/backup/scripts/mark-credential-escrow-verified.sh`. | `awoooi_backup_dr_credential_escrow_missing_count=0`. |
| P1-005 | BLOCKED_WAITING_OWNER_EVIDENCE | 20 | Fill credential escrow evidence markers | Five markers are missing. This is a DR scorecard blocker, not a service outage. 2026-06-13 13:10 proves scripts/offsite/rclone readiness is green; the remaining blocker is owner-provided real non-secret evidence IDs. Owner request package exists at `docs/security/CREDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md`; secrets must not enter repo or chat. | Human verifies vault/offline escrow, validates each non-secret evidence ID with `--dry-run`, then writes markers using `/backup/scripts/mark-credential-escrow-verified.sh`. | `awoooi_backup_dr_credential_escrow_missing_count=0`. |
| P1-006 | DONE | 100 | Fix backup health failed component | 2026-06-12 18:55 backup-status shows `failed=0`, `core_blockers=0`, `config_failed=0`; 120 config capture is no longer red. | Keep normal daily backup cadence. | `failed_count=0`, `config_failed=0`. |
| P1-007 | DONE | 100 | Refresh stale backup jobs | 2026-06-04 cleared `stale188=momo_pg_daily`; 2026-06-05 cleared recurring `stale110=awoooi_db`; 2026-06-06 confirms no stale jobs after the next aggregate window. | Keep normal cron cadence; only 120-driven Configs remains red. | `stale110=none`, `stale188=none`, 110 `13/13 fresh`, 188 `2/2 fresh`. |
| P1-008 | DONE | 100 | Align 188 momo backup cron/exporter contract | 188 backup exporter expected `/home/ollama/bin/momo-pg-backup.sh`; crontab still pointed to the old app-side script. Crontab was backed up and updated to the host-owned controller script. | Keep backup controller path in future deploy docs. | `configured_missing_188=0`, `awoooi_backup_job_configured{host="188",job="momo_pg_daily"} 1`. |
@@ -129,6 +129,7 @@ Next: <single next action>
| P1-010 | DONE | 100 | Offsite sync manual backup repairs | 2026-06-12 17:37 full offsite sync completed `13/13` after controlled P0 runway override to 240m; 18:55 verifier confirmed 13 remote repos each have one snapshot. | Allow normal 03:00 full sync cadence unless another manual backup creates new snapshots. | `REMOTE_LATEST_ONLY_OK=1`, `VERIFY_OK=1`, full sync `13/13`. |
| P1-011 | DONE | 100 | Confirm 2026-06-12 backup convergence | 18:55 live check confirms the post-120 aggregate held: no stale jobs, no configured/missing script jobs, no failed components, offsite fresh, and only credential escrow remains as DR warning. | Keep escrow as explicit red gate. | `stale110=none`, `stale188=none`, `failed=0`, `config_failed=0`, `core_blockers=0`. |
| P1-012 | DONE | 100 | Audit credential escrow marker write safety | 2026-06-12 15:02 `mark-credential-escrow-verified.sh --status` reports all five allowed items missing; `offsite-escrow-evidence-report.sh --no-color` reports rclone/offsite configured and `ESCROW_MISSING_COUNT=5`; repo search found only runbooks/placeholders/rules, not real evidence IDs. | Write markers only after a real non-secret evidence ID exists for each item; never write placeholder or secret. | The marker blocker is narrowed to missing external evidence IDs, not missing script/config/offsite readiness. |
| P1-014 | DONE | 100 | Publish credential escrow owner request package | 2026-06-13 13:10 live report confirms `SCRIPT_MISSING_COUNT=0`, `OFFSITE_CONFIGURED=1`, `RCLONE_CONFIGURED=1`, `ESCROW_MISSING_COUNT=5`, `PASS=8 WARN=5 BLOCKED=0`. New owner request package defines allowed evidence-id types, forbidden secret values, safe dry-run flow, write flow, and closeout gates. | Dispatch to the credential owners without collecting secret values; keep marker write gated until owner gives real non-secret evidence IDs. | `docs/security/CREDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md` and snapshot exist and validate. |
| P1-013 | IN_PROGRESS | 90 | Remediate `km-vectorize` CronJob health debt | ArgoCD Degraded is isolated to `CronJob/km-vectorize`: `lastSuccessfulTime` is stale even though retained 6/2-6/4 Jobs completed, and the manifest schedule was semantically wrong (`0 19` with `timeZone: Asia/Taipei` ran at 19:00 台北, not 03:00). Manual Job evidence is invalid because the controller deleted `km-vectorize-codex-002709` as `UnexpectedJob`. Gitea main `47ee96b0` is synced live and the CronJob spec is corrected. | Verify the next 03:00 official CronJob updates `lastSuccessfulTime` and ArgoCD returns `Healthy`. | `lastSuccessfulTime` is after the manifest sync, the official scheduled Job is `Complete`, and ArgoCD `awoooi-prod` health is `Healthy`. |
---