From 6a3f6caedbaed427ef31c8a17f52d6b66d3565e1 Mon Sep 17 00:00:00 2001 From: Your Name Date: Sat, 13 Jun 2026 01:35:39 +0800 Subject: [PATCH] docs(ops): refresh reboot SOP evidence [skip ci] --- docs/LOGBOOK.md | 21 ++++++ docs/runbooks/BACKUP-STATUS.md | 30 ++++++--- docs/runbooks/FULL-STACK-COLD-START-SOP.md | 67 +++++++++++++++---- ...oot-cold-start-backup-recovery-workplan.md | 26 +++++-- 4 files changed, 112 insertions(+), 32 deletions(-) diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 466bbdc7..1a2a4691 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -1,3 +1,24 @@ +## 2026-06-13|Post-CD cold-start green 與 SSH trust guardrail closeout + +**01:26 / 01:29 live refresh:** +- Gitea main 已到 deploy marker `e4a349bc chore(cd): deploy 414413a [skip ci]`;ArgoCD `awoooi-prod` revision `e4a349bc`、sync `Synced`,health 仍 `Degraded`,原因仍是 `km-vectorize` 官方排程成功回寫尚未更新,不是 service outage。 +- K3s live image readback:`awoooi-api`、`awoooi-web`、`awoooi-worker` 均使用 `414413a59268eedd391648f112e228716dd05362`;API pods split `mon1` / `mon`,Web pods split `mon` / `mon1`,Worker single replica 在 `mon`,證明 topology spread 沒被最新 CD 打回集中配置。 +- 110 global `/home/wooo/.ssh/known_hosts` 在 CD 後仍保留 120 / 188 entries,mtime `2026-06-13 01:20:02 +0800`;deploy-specific `/home/wooo/.ssh/deploy_known_hosts` mtime `01:24:05`。這驗證 `80e6ec1a fix(ci): avoid clobbering runner known hosts` 已阻止 CD 再覆蓋 cold-start 使用的全域 SSH trust。 +- 01:26 `full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1` 回 `PASS=83 WARN=0 BLOCKED=0`;result `GREEN`。110 / 120 / 121 / 188 ping + SSH OK,public routes/TLS、momo DB parity、CronJobs、backup health、textfile exporters 全部綠。 +- 01:26 `/backup/scripts/backup-status.sh --no-notify`:110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、`core_blockers=0`、`escrow_missing=5`、last aggregate `2026-06-12 15:54:40`。 +- 01:28 offsite textfile:`remote_verify_ok=1`、`full_verify_fresh=1`,13 個 repo 全部 `snapshot_count=1` / `snapshot_latest_only=1`。最新排程 verifier log 為 2026-06-12 07:20,仍在 48h freshness 內。 +- 01:27 backup alert live visibility:Prometheus 與 Alertmanager 都看得到 5 個 credential escrow gap active/firing alerts;Prometheus rules API 確認 `BackupConfigCapturePartial`、`BackupAggregateRunFailed`、`BackupCredentialEscrowEvidenceMissing`、`ColdStartRecoveryBlocked`、`ColdStartHost120Unreachable` 全部 health `ok`;label contract check 載入 24 條 baseline backup alert rules。這些 escrow 紅燈是正確阻擋,不可消音。 + +**文件 / SOP 更新:** +- `docs/runbooks/FULL-STACK-COLD-START-SOP.md` 升到 v1.8,新增 CD / SSH trust guardrail、post-CD 比較錨點與 done criteria。 +- `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md` 更新 01:29 完成度:Overall `95%`、P1 `90%`、P2 `100%`、P3 `100%`;新增 `P3-012` 防止 CD clobber `known_hosts`。 +- `docs/runbooks/BACKUP-STATUS.md` 更新 01:26 / 01:28 備份、offsite、Alertmanager 證據。 + +**剩餘不可宣稱完成:** +- 不宣稱 `DR complete`:credential escrow evidence markers 仍缺 `5`。 +- 不宣稱 ArgoCD fully healthy:`km-vectorize` corrected schedule 已 live,但要等 03:00 Asia/Taipei 官方 CronJob 更新 `lastSuccessfulTime`。 +- 不把 P0-PUBLICENV local source patch 視為 production remediation complete;production public bundle leak audit 仍需等前端 worktree / build / deploy / rescan 收斂。 + ## 2026-06-13|P2-104 matched PlayBook 學習缺口 **背景**:P2-103 已把任務結果接回 KM 草稿、LOGBOOK 證據、audit trail、timeline 與人工交接契約;原先下一步被描述為修復 `matched_playbook_id` 學習缺口。正式 DB 只讀回查後修正判斷:近 24h approval 的 `matched_playbook_id` 已 `66/66`,真正缺口已移到 approved 後沒有形成 execution learning 與 PlayBook trust update。 diff --git a/docs/runbooks/BACKUP-STATUS.md b/docs/runbooks/BACKUP-STATUS.md index 1f81be0c..98362069 100644 --- a/docs/runbooks/BACKUP-STATUS.md +++ b/docs/runbooks/BACKUP-STATUS.md @@ -6,38 +6,44 @@ > 2026-06-12 Codex post-reboot refresh: 110 recovered, offsite latest-only verified, and remaining red gates narrowed to 120 config capture plus credential escrow evidence. > 2026-06-12 Codex post-120 recovery refresh: 120 restored, backup aggregate / offsite / full cold-start green; DR still blocked only by credential escrow evidence. > 2026-06-13 Codex live refresh: backup core remains green; DR still blocked only by credential escrow evidence. +> 2026-06-13 Codex post-CD refresh: backup/offsite/alert contracts remain green after deploy marker `e4a349bc`; global SSH trust guardrail held; DR still blocked only by credential escrow evidence. --- -## 2026-06-12 Post-Reboot Live Status +## 2026-06-13 Post-CD Live Status -2026-06-13 00:13 refresh: +2026-06-13 01:26 / 01:28 refresh: -- `backup-status.sh --no-notify`:110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、`integrity_stale=0`、`offsite_fresh=1`、`rclone_gdrive_fresh=1`、`core_blockers=0`、`escrow_missing=5`。 +- `/backup/scripts/backup-status.sh --no-notify`:110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、`integrity_stale=0`、`offsite_fresh=1`、`rclone_gdrive_fresh=1`、`core_blockers=0`、`escrow_missing=5`、last aggregate `2026-06-12 15:54:40`。 +- `/home/wooo/node_exporter_textfiles/offsite_full_sync_verify.prom`:`awoooi_backup_offsite_remote_verify_ok=1`、`awoooi_backup_offsite_full_verify_fresh=1`,13 個 repo 都是 `snapshot_count=1` 且 `snapshot_latest_only=1`。 +- `backup-alert-live-visibility-check.py`:Prometheus 與 Alertmanager 皆可見 5 個 escrow gap active/firing alerts。 +- Prometheus rules API:`BackupConfigCapturePartial`、`BackupAggregateRunFailed`、`BackupCredentialEscrowEvidenceMissing`、`ColdStartRecoveryBlocked`、`ColdStartHost120Unreachable` 全部存在且 health `ok`;目前只有 escrow gap rule 正確 firing,其餘 inactive。 +- `backup-alert-label-contract-check.py`:本地 `ops/monitoring/alerts-unified.yml` 與 live Prometheus label contract 對齊,24 條 baseline backup alert rules 已載入。 - 這代表備份核心仍綠;剩餘紅燈仍是 DR credential escrow evidence,不是備份腳本或 offsite sync 失敗。 | Gate | Status | Evidence | |------|--------|----------| -| 110 backup cron | VERIFIED | `02:00 backup-all`, `03:00 sync-offsite-backups --mode sync`, `06:05 backup-status`, `07:20 verify-offsite-full-sync`. | -| Backup freshness | VERIFIED | 2026-06-12 18:55 status shows `stale110=none`, `stale188=none`, 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`; last aggregate backup completed `2026-06-12 15:54:40`. | +| 110 backup cron | VERIFIED | Live crontab still has `02:00 backup-all`, `03:00 sync-offsite-backups --mode sync`, `06:05 backup-status`, `07:20 verify-offsite-full-sync`; success is summarized once daily and not sent as noisy Telegram heartbeat. | +| Backup freshness | VERIFIED | 2026-06-13 01:26 status shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`; last aggregate backup completed `2026-06-12 15:54:40`. | | 188 momo backup cron/exporter contract | VERIFIED | 188 crontab now runs `/home/ollama/bin/momo-pg-backup.sh`; exporter reports `awoooi_backup_job_configured{host="188",job="momo_pg_daily"} 1`, so `configured_missing_188=0`. | -| Google Drive/rclone remote latest-only | VERIFIED | 2026-06-12 18:55 verifier: all 13 remote repos have `snapshots=1`, `REMOTE_LATEST_ONLY_OK=1`, `FULL_MARKER_FRESH=1`, `VERIFY_OK=1`, `FAILED=0`. | +| Google Drive/rclone remote latest-only | VERIFIED | 2026-06-13 01:28 textfile confirms all 13 remote repos have `snapshot_count=1` and `snapshot_latest_only=1`; latest scheduled verifier log at 2026-06-12 07:20 returned `REMOTE_LATEST_ONLY_OK=1`, `FULL_MARKER_FRESH=1`, `VERIFY_OK=1`, `FAILED=0`. | | Offsite gate marker | VERIFIED | `/backup/offsite/enable-rclone-sync` present; full marker fresh and verifier wrote `/home/wooo/node_exporter_textfiles/offsite_full_sync_verify.prom`. | -| Backup alert rules | VERIFIED | 2026-06-12 14:49 live check confirms Prometheus and Alertmanager expose `BackupConfigCapturePartial`, `BackupAggregateRunFailed`, `BackupCredentialEscrowEvidenceMissing`, `ColdStartRecoveryBlocked`, `ColdStartHost120Unreachable`. | +| Backup alert rules | VERIFIED | 2026-06-13 01:27 live check confirms Prometheus and Alertmanager expose the active `BackupCredentialEscrowEvidenceMissing` gap alerts for the five missing items; Prometheus rules API has all five required alert names healthy; label contract check confirms all baseline backup alert rules are loaded. | | Backup aggregate health | VERIFIED | 2026-06-12 15:54 `/backup/scripts/backup-all.sh` completed `13/13` successfully; Configs captured 120 / 121 / K8s workloads / K8s secrets / Velero from source `192.168.0.120`; 18:55 `core_blockers=0`. | | Credential escrow | BLOCKED | Five evidence markers missing. Only write non-secret marker evidence with `/backup/scripts/mark-credential-escrow-verified.sh`. | | Config backup capture | VERIFIED | 2026-06-12 15:54 Configs backup succeeded for `120-k3s-host-configs`, `121-k3s-host-configs`, `cluster-k8s-workloads`, `cluster-k8s-secrets`, and `cluster-velero-backups`; latest Configs snapshot `bee9ae22`. | -| Full cold-start | GREEN | 2026-06-12 18:57 read-only rerun: `PASS=83 WARN=0 BLOCKED=0`; result `GREEN`. | -| 110 -> 188 SSH trust | VERIFIED | 2026-06-12 known_hosts refreshed after current ED25519 fingerprint matched the previously verified `SHA256:d+3QQO4Sh9hQ1DUhPEEJsxqE5D/CYVH+qTgG9TSMNaM`; this cleared the 188 SSH host-key blocker. | +| Full cold-start | GREEN | 2026-06-13 01:26 read-only rerun: `PASS=83 WARN=0 BLOCKED=0`; result `GREEN`. | +| 110 -> 120 / 188 SSH trust | VERIFIED | Final trust repair backup `/home/wooo/.ssh/known_hosts.before-120-188-final-refresh.20260613-011949`; CD fix `80e6ec1a` uses `/home/wooo/.ssh/deploy_known_hosts`; post-deploy marker `e4a349bc` did not clobber global `known_hosts`, and 120 / 188 entries remain present. | | 120 console handoff | CLOSED | 120 root filesystem was repaired from console/initramfs with offline fsck, booted at `2026-06-12 15:13`, SSH returned, root mounted `rw`, failed units `0`, and K3s `mon` returned `Ready`. | | 2026-06-05 manual backup remediation | VERIFIED with aggregate blocker | 18:40 status: `stale110=none`, `stale188=none`, `configured_missing_188=0`; manual snapshots: AWOOOI `b7d5ee4e`, Gitea `ea641613`, Open-WebUI `d1147507`, ClawBot `73ead3cc`, AI artifacts `b1161ab8`. | | 2026-06-06 credential escrow audit | BLOCKED | 15:03 report confirms scripts/config/rclone are present, but all five non-secret evidence markers are still missing; 15:06 safe dry-run checklist is documented below. | Current policy: normal success should not create immediate Telegram noise. Failures and operator-action states must still alert; a single daily status summary runs at 06:05. -2026-06-12 post-reboot closeout: +2026-06-13 post-CD closeout: - 110 / 120 / 121 / 188 public/core service recovery is green. - Backup aggregate, Google Drive/rclone offsite latest-only, and full cold-start are green after 120 recovery. +- Latest normal CD deploy preserved API/Web workload balancing and did not break cold-start SSH trust. - Do not close DR scorecard until all five credential escrow evidence markers are written with non-secret evidence IDs. --- @@ -120,7 +126,7 @@ notify_clawbot "failed" "backup-test" "測試告警" 0 ## 保留策略 -2026-05-19 起,110 本地 restic repo、188 MOMO 檔案備份與 Google Drive/rclone 離機鏡像採 latest-only 策略:成功建立新 snapshot 後只保留最新一份。2026-06-06 14:46 live verifier 已確認 Google Drive/rclone remote 13 個 repo 各 1 份。 +2026-05-19 起,110 本地 restic repo、188 MOMO 檔案備份與 Google Drive/rclone 離機鏡像採 latest-only 策略:成功建立新 snapshot 後只保留最新一份。2026-06-13 01:28 live textfile 已確認 Google Drive/rclone remote 13 個 repo 各 1 份,且 latest-only 指標全為 `1`。 2026-06-04 manual refresh evidence: - 188 `momo-pg-backup.sh` produced `momo_analytics_20260604_154234.sql.gz` and pruned old backups beyond keep-last=1. @@ -140,6 +146,8 @@ notify_clawbot "failed" "backup-test" "測試告警" 0 2026-06-12 18:55 update: 120 has returned and the aggregate backup blocker is cleared. `/backup/scripts/backup-all.sh` completed `13/13`, full offsite sync completed `13/13`, full verifier returned `REMOTE_LATEST_ONLY_OK=1` / `VERIFY_OK=1`, and `backup-status.sh --no-notify` reports `core_blockers=0`. The only remaining DR warning is `escrow_missing=5`. +2026-06-13 01:28 update: post-CD live readback still shows `remote_verify_ok=1`, `full_verify_fresh=1`, and all 13 repos `snapshot_count=1`; backup core remains green after deploy marker `e4a349bc`. + 2026-06-05 manual remediation: - 16:00 live check still had 120 unreachable, `stale110=awoooi_db`, `backup_all failed=6`, and `escrow_missing=5`. - 14:00 AWOOOI high-frequency backup failed, then 16:01 manual rerun completed snapshot `b7d5ee4e`. diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index aff9bfd7..554cd064 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -1,6 +1,6 @@ # AWOOOI 全棧冷啟動與主機重啟 SOP -> Version: v1.7 +> Version: v1.8 > Last updated: 2026-06-13 Asia/Taipei > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path. @@ -10,27 +10,27 @@ 本節是每次接手、開機、關機、重啟後的第一個判定錨點。若日期不是今天,必須先重跑 live check,再更新本節與 `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md`。 -| 項目 | 2026-06-13 00:14 Asia/Taipei live result | 判定 | +| 項目 | 2026-06-13 01:29 Asia/Taipei live result | 判定 | |------|-------------------------------------------|------| -| Overall recovery readiness | `95%` | `SERVICE_GREEN_DR_ESCROW_BLOCKED` | +| Overall recovery readiness | `95%` | `SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED` | | P0 host / K3s recovery | `100%` | `DONE` | | P1 backup / alert / escrow | `90%` | `BLOCKED_DR_ESCROW` | | P2 service / data truth | `100%` | `VERIFIED_WORKLOAD_BALANCED` | | P3 docs / automation contracts | `100%` | `DONE_WITH_VALIDATION_GAP` | -| 110 host runtime | `systemctl is-system-running=running`, failed units `0`, load `4.45 / 3.28 / 2.89`, Swap `166Mi/7.8GiB` | `GREEN_WITH_LOAD_WATCH` | +| 110 host runtime | `systemctl is-system-running=running`, failed units `0`, cold-start load `3.55 / 4.54 / 4.51`; Docker / Harbor / Gitea / Prometheus / Alertmanager / Sentry checks pass | `GREEN_WITH_LOAD_WATCH` | | 120 reachability | ping OK, SSH OK, boot `2026-06-12 15:13`, root ext4 `rw`, failed units `0` | `GREEN` | | 121 reachability | ping OK, SSH OK, failed units `0` | `GREEN` | | 188 host runtime | production services green; host degraded only by `certbot.service` and `snap.certbot.renew.service` | `GREEN_WITH_CERTBOT_DEBT` | -| K3s node state | `mon Ready control-plane`, `mon1 Ready control-plane`, no failed jobs / bad pods; API/Web are split across 120 / 121 | `GREEN_WORKLOAD_BALANCED` | -| 110 -> 120 / 188 SSH trust | 00:33 cold-start exposed stale `known_hosts`; backup `/home/wooo/.ssh/known_hosts.before-120-188-refresh.20260613-003416`; 120 ED25519 `SHA256:LLhtf/K09AQjRE1Csi0pDCkwvtM9/Ag48RcLgVF4EuY`; 188 ED25519 `SHA256:d+3QQO4Sh9hQ1DUhPEEJsxqE5D/CYVH+qTgG9TSMNaM`; strict SSH checks now pass | `GREEN` | -| Backup status | 00:13 status: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5` | `GREEN_WITH_DR_ESCROW_WARNING` | -| Offsite sync / verify | 17:37 full sync `13/13`; 18:55 verifier: 13 repos each `snapshots=1`, `REMOTE_LATEST_ONLY_OK=1`, `FULL_MARKER_FRESH=1`, `VERIFY_OK=1`, `FAILED=0` | `GREEN` | -| Backup / cold-start alerts | 14:49 Prometheus and Alertmanager expose 5 required backup/cold-start/escrow alerts | `GREEN_WITH_EXPECTED_REDLIGHTS` | -| Cold-start scorecard | 01:04 final read-only scorecard after `km-vectorize` sync / 188 SSH trust refresh: `PASS=83 WARN=0 BLOCKED=0` | `GREEN` | +| K3s node state | `mon Ready control-plane`, `mon1 Ready control-plane`, no failed jobs / bad pods; latest CD kept API/Web split across 120 / 121 | `GREEN_WORKLOAD_BALANCED` | +| 110 -> 120 / 188 SSH trust | 00:33 cold-start exposed stale `known_hosts`; backup `/home/wooo/.ssh/known_hosts.before-120-188-refresh.20260613-003416`; final repair backup `/home/wooo/.ssh/known_hosts.before-120-188-final-refresh.20260613-011949`; CD fix `80e6ec1a` moves deploy trust to `/home/wooo/.ssh/deploy_known_hosts`; 01:28 global `known_hosts` still contains 120 / 188 and was not clobbered by deploy marker `e4a349bc` | `GREEN_WITH_GUARDRAIL` | +| Backup status | 01:26 status: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-12 15:54:40` | `GREEN_WITH_DR_ESCROW_WARNING` | +| Offsite sync / verify | 01:28 textfile: `awoooi_backup_offsite_remote_verify_ok=1`, `full_verify_fresh=1`, all 13 repos have `snapshot_count=1` and `snapshot_latest_only=1`; latest scheduled verifier log is 2026-06-12 07:20 | `GREEN` | +| Backup / cold-start alerts | 01:27 live visibility check confirms Prometheus and Alertmanager expose the 5 required credential escrow gap alerts; Prometheus rules API has all five required alert names healthy; label contract check loads 24 baseline backup alert rules | `GREEN_WITH_EXPECTED_REDLIGHTS` | +| Cold-start scorecard | 01:26 read-only scorecard after latest CD / trust guardrail: `PASS=83 WARN=0 BLOCKED=0` | `GREEN` | | momo DB parity | `4571|4571|2026-06-01|2026-06-07|2026-06-01|2026-06-07` | `GREEN` | | momo scheduler | container healthy; scorecard reads `SCHEDULER_RECENT_ACTIVITY 1136`; detector widened and deployed to 110 | `GREEN` | -| ArgoCD app health | Deployments/PDB healthy; app Degraded because `km-vectorize` last success is stale. Gitea main `47ee96b0` / ArgoCD live now use 03:00 Asia/Taipei and complete Job labels; waiting for official CronJob success readback. | `GOVERNANCE_DEBT_IN_REMEDIATION` | -| Workload balancing | Gitea main `acaae999` synced by ArgoCD; API/Web each have one pod on 120 and one pod on 121 | `GREEN` | +| ArgoCD app health | ArgoCD revision `e4a349bc` is `Synced`; app remains `Degraded` while `km-vectorize` `lastSuccessfulTime=2026-06-04T11:00:37Z`; corrected schedule remains `0 3 * * *` with `timeZone=Asia/Taipei` | `GOVERNANCE_DEBT_IN_REMEDIATION` | +| Workload balancing | Latest live deployments use images from `414413a5`; API/Web each have one pod on 120 and one pod on 121 after deploy marker `e4a349bc` | `GREEN` | | Credential escrow | 5 non-secret evidence markers missing | `BLOCKED` | Release rule: @@ -44,12 +44,13 @@ Do not declare DR scorecard complete while credential escrow markers are missing 2026-06-13 live rule: ```text -110 / 120 / 121 / 188 service recovery is full-stack green after the 00:34 scorecard. +110 / 120 / 121 / 188 service recovery is full-stack green after the 01:26 scorecard. GO for controlled runner/CD release; keep AI auto-remediation governed by normal gates. NO-GO for "DR complete" while credential escrow evidence markers are missing. Do not fake or silence credential escrow alerts; they are the remaining correct DR red light. GO for "AWOOOI core workload balanced"; topology spread is in Gitea main / ArgoCD live and API/Web placement proves max skew <= 1. NO-GO for "ArgoCD fully healthy" until `km-vectorize` updates `lastSuccessfulTime` after an official scheduled Job, not a manual `UnexpectedJob`. +NO-GO for any CD workflow that writes deploy host keys into `/home/wooo/.ssh/known_hosts`; deploy jobs must use an isolated `UserKnownHostsFile=/home/wooo/.ssh/deploy_known_hosts`. Current allowed wording: "service / cold-start green; ArgoCD CronJob governance debt in remediation and waiting for 03:00 `km-vectorize` success evidence." ``` @@ -229,6 +230,26 @@ Cold start creates noisy metrics and partial failures. During P0/P1, keep automa 多個工作視窗同時處理事故時,第一優先是避免互相打斷:只要有人在收斂 Docker / Nginx / firewall / K3s 寫操作,其他視窗先只讀觀察,直到明確交接。 +### 2.2 CD / SSH Trust Guardrail + +2026-06-13 的冷啟動假紅燈顯示:CD workflow 若用 `ssh-keyscan ... > /home/wooo/.ssh/known_hosts`,會覆蓋 110 使用者層的全域 SSH trust,導致 110 到 120 / 188 的 strict SSH 檢查失敗。這會把實際已恢復的主機誤判成 blocked。 + +固定規則: + +| 項目 | 正確做法 | 禁止 | +|------|----------|------| +| Deploy 專用 host key | 寫入 `/home/wooo/.ssh/deploy_known_hosts` | 寫入或覆蓋 `/home/wooo/.ssh/known_hosts` | +| Deploy SSH options | `-o UserKnownHostsFile=/home/wooo/.ssh/deploy_known_hosts` | 共用 operator / cold-start 的 `known_hosts` | +| 冷啟動 SSH trust | 保留 120 / 188 的已驗證 fingerprint;修復前先備份 | 無 fingerprint 交叉驗證就 `ssh-keygen -R` 或重建全檔 | +| 驗證 | CD 後檢查 `known_hosts` mtime、120 / 188 entries、strict SSH | 只看 CD success badge | + +2026-06-13 修復錨點: + +- Source fix:Gitea main 包含 `80e6ec1a fix(ci): avoid clobbering runner known hosts`。 +- Deploy marker:`e4a349bc chore(cd): deploy 414413a [skip ci]` 後,`/home/wooo/.ssh/known_hosts` mtime 仍停在 `2026-06-13 01:20:02 +0800`,未被 CD 覆蓋。 +- Deploy isolated file:`/home/wooo/.ssh/deploy_known_hosts` mtime `2026-06-13 01:24:05 +0800`。 +- Global strict entries:120 ED25519 line 4、188 ED25519 line 5 仍存在;strict SSH 到 `wooo@192.168.0.120` 與 `ollama@192.168.0.188` 必須通過。 + --- ## 3. P0 Evidence And Network @@ -1135,7 +1156,24 @@ SOP update: | momo scheduler | scorecard reads `SCHEDULER_RECENT_ACTIVITY 1070` after detector fix | | SOP change | v1.5 adds startup judgment layers, GO/NO-GO tree, host recovery cards, and timeline checks | -### 14.9 重啟後時間軸驗證 +### 14.9 2026-06-13 CD 後恢復比較錨點 + +2026-06-13 不是主機重啟,而是用來驗證「120/121 workload balancing + CD known_hosts guardrail」是否能承受下一次正常部署的比較錨點。 + +| 項目 | 2026-06-13 01:26 baseline | +|------|----------------------------| +| Gitea / ArgoCD | Gitea main `e4a349bc`,ArgoCD revision `e4a349bc`,sync `Synced` | +| K3s image readback | API/Web/Worker image tag `414413a59268eedd391648f112e228716dd05362` | +| K3s placement | API pods split `mon1` / `mon`;Web pods split `mon` / `mon1`;Worker single replica on `mon` | +| Cold-start | `PASS=83 WARN=0 BLOCKED=0` | +| Public routes | Scorecard verifies awoooi API/Web, momo, gitea, harbor, registry, sentry, signoz, stock, langfuse, bitan, aiops over TLS | +| Backup | `backup-status`: 110 `13/13 fresh failed=0`,188 `2/2 fresh failed=0`,`core_blockers=0`,`escrow_missing=5` | +| Offsite | textfile `remote_verify_ok=1`、`full_verify_fresh=1`,13 repos each `snapshot_count=1` | +| SSH trust | Global `known_hosts` retained 120 / 188 entries after CD; deploy-specific trust moved to `deploy_known_hosts` | +| Remaining non-service debt | `km-vectorize` waits for official 03:00 `lastSuccessfulTime`; credential escrow missing count remains `5`; 188 host degraded only by certbot units | +| SOP change | v1.8 adds CD / SSH trust guardrail and post-CD comparison baseline | + +### 14.10 重啟後時間軸驗證 每次重啟後照時間軸推進,不要等到最後才一次判定。 @@ -1170,6 +1208,7 @@ All must be true: - Runners are guarded and released last. - AI auto-remediation is not in full execution mode until all gates are green. - 110 cold-start textfile monitor is fresh, and Prometheus has the cold-start alert rules loaded. +- 110 global `/home/wooo/.ssh/known_hosts` still contains verified 120 / 188 entries after any CD run; deploy jobs use `/home/wooo/.ssh/deploy_known_hosts` only. ### 15.1 可宣稱狀態 diff --git a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md index d2dfe77b..6e75eefa 100644 --- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md +++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md @@ -11,15 +11,15 @@ | Area | Status | Completion | Evidence | |------|--------|------------|----------| -| Overall recovery readiness | SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED | 95% | 2026-06-13 01:04 final cold-start scorecard is `PASS=83 WARN=0 BLOCKED=0`; 120/121 K3s are both `Ready control-plane`, backup core blockers remain `0`, public routes/TLS/momo DB/schedules/Alertmanager are green, and API/Web are now spread across 120 / 121. Remaining blocker is DR-only credential escrow evidence (`escrow_missing=5`); ArgoCD `km-vectorize` is tracked separately as governance health debt until its official scheduled Job refreshes `lastSuccessfulTime`. | +| Overall recovery readiness | SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED | 95% | 2026-06-13 01:26 final cold-start scorecard is `PASS=83 WARN=0 BLOCKED=0`; 120/121 K3s are both `Ready control-plane`, backup core blockers remain `0`, public routes/TLS/momo DB/schedules/Alertmanager are green, API/Web remain spread across 120 / 121 after deploy marker `e4a349bc`, and CD no longer clobbers global `known_hosts`. Remaining blocker is DR-only credential escrow evidence (`escrow_missing=5`); ArgoCD `km-vectorize` is tracked separately as governance health debt until its official scheduled Job refreshes `lastSuccessfulTime`. | | P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; host is reachable, root is mounted `rw`, failed units `0`, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. | -| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 90% | 2026-06-12 15:54 `backup-all` completed `13/13`; 17:37 offsite full sync completed `13/13`; 18:55 verifier returned `REMOTE_LATEST_ONLY_OK=1`, `VERIFY_OK=1`, `FAILED=0`; 18:55 backup-status has `core_blockers=0` but `escrow_missing=5`. | -| P2 service / data truth | VERIFIED_WORKLOAD_BALANCED | 100% | 2026-06-13 post-rollout cold-start is green; public routes/TLS are green, VIP API/Web are reachable, momo current-month parity is `4571/4571` with matching date bounds, schedules/services are green. API/Web now both have 120 / 121 split placement. | -| P3 docs / automation contracts | DONE_WITH_VALIDATION_GAP | 100% | Workplan, SOP v1.7, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, T+0/T+60 timeline checks, host role / load-balancing assessment, and `km-vectorize` remediation tracking are updated; Ansible syntax check is unavailable on this workstation. | +| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 90% | 2026-06-13 01:26 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`; 01:28 offsite textfile has `remote_verify_ok=1`, `full_verify_fresh=1`, and all 13 repos `snapshot_count=1`; 01:27 Alertmanager exposes the five expected escrow gap alerts and Prometheus has all five required alert rule names healthy. | +| P2 service / data truth | VERIFIED_WORKLOAD_BALANCED | 100% | 2026-06-13 01:26 cold-start is green; public routes/TLS are green, VIP API/Web are reachable, momo current-month parity is `4571/4571` with matching date bounds, schedules/services are green. API/Web both keep 120 / 121 split placement after latest deploy marker `e4a349bc`. | +| P3 docs / automation contracts | DONE_WITH_VALIDATION_GAP | 100% | Workplan, SOP v1.8, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, T+0/T+60 timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, and `km-vectorize` remediation tracking are updated; Ansible syntax check is unavailable on this workstation. | Full cold-start may be declared green only for the latest verified evidence set. Do not declare DR scorecard complete while credential escrow evidence remains blocked. -2026-06-13 01:04 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing is live verified after Gitea main `acaae999` and ArgoCD sync; do not declare DR complete while credential escrow is missing. `km-vectorize` remediation is `90%`: schedule/label fix is in Gitea main `47ee96b0`, ArgoCD has synced it live, and the remaining gate is the next official 03:00 CronJob success readback. +2026-06-13 01:26 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing survived the next normal CD deploy: Gitea main `e4a349bc`, ArgoCD revision `e4a349bc`, images from `414413a5`, API/Web split across `mon` / `mon1`, and global `known_hosts` retained 120 / 188 after CD fix `80e6ec1a`. Do not declare DR complete while credential escrow is missing. `km-vectorize` remediation is `90%`: schedule/label fix is live, and the remaining gate is the next official 03:00 CronJob success readback. --- @@ -55,6 +55,7 @@ Full cold-start may be declared green only for the latest verified evidence set. | 2026-06-12 120 recovery closeout | SERVICE_GREEN_DR_ESCROW_BLOCKED | 120 root fsck was completed from console/initramfs and booted at `15:13`; 15:54 backup-all finished `13/13`; 17:37 full offsite sync finished `13/13`; 18:55 offsite verifier returned `REMOTE_LATEST_ONLY_OK=1`, `VERIFY_OK=1`, `FAILED=0`; 18:55 backup-status shows `core_blockers=0`, `escrow_missing=5`; 18:57 cold-start is `PASS=83 WARN=0 BLOCKED=0`. | | 2026-06-13 live refresh | SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED | 00:13 backup status: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`; 00:33 cold-start exposed 110 `known_hosts` drift for 120 / 188, fixed after backup `/home/wooo/.ssh/known_hosts.before-120-188-refresh.20260613-003416`; 00:34 final cold-start: `PASS=83 WARN=0 BLOCKED=0`; live K3s has `mon` / `mon1` Ready, API/Web are split 120 / 121. 188 host is `degraded` only because `certbot.service` and `snap.certbot.renew.service` failed; ArgoCD remains Degraded because `km-vectorize` CronJob last success is stale. Manual Job `km-vectorize-codex-002709` did not leave verified completion evidence, so this remains open. | | 2026-06-13 `km-vectorize` health remediation | IN_PROGRESS_90 | 00:50 live readback: CronJob `lastScheduleTime=2026-06-12T11:00:00Z`, `lastSuccessfulTime=2026-06-04T11:00:37Z`; retained 6/2, 6/3, 6/4 Jobs are `Complete`, latest visible pod log returned `embed-all: 200 {"total":32,"success":32,"failed":0}`. Gitea main `47ee96b0` and ArgoCD sync now corrected live spec to `schedule=0 3 * * *`, `timeZone=Asia/Taipei`, with Job/Pod labels `app/component/environment/phase/system`. 01:04 cold-start is `PASS=83 WARN=0 BLOCKED=0`. Next gate is the official 03:00 CronJob success readback. | +| 2026-06-13 post-CD trust / workload verification | SERVICE_GREEN_CD_GUARDRAIL_HELD | Gitea main advanced to deploy marker `e4a349bc chore(cd): deploy 414413a [skip ci]`; ArgoCD revision is `e4a349bc`, sync `Synced`, health still `Degraded` only by `km-vectorize` stale success. Live K3s image readback uses `414413a59268eedd391648f112e228716dd05362`; API pods split `mon1` / `mon`, Web pods split `mon` / `mon1`, Worker is single replica on `mon`. 01:28 `/home/wooo/.ssh/known_hosts` mtime remains `2026-06-13 01:20:02 +0800` with 120 / 188 entries present; deploy-specific `/home/wooo/.ssh/deploy_known_hosts` mtime is `01:24:05`, proving CD fix `80e6ec1a` stopped clobbering global trust. 01:26 cold-start: `PASS=83 WARN=0 BLOCKED=0`. | --- @@ -140,7 +141,7 @@ Next: | P2-004 | DONE | 100 | PostgreSQL index corruption runbook path | SOP v1.2 now states `posting list tuple ... cannot be split` is an index repair incident. | Use only concurrent reindex if the error returns. | No truncate, no whole DB restore; `REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly;` and idempotent resync evidence recorded. | | P2-005 | VERIFIED | 100 | Do not rely on route 200 only | 2026-06-12 closeout has route + DB + backup + offsite + schedule + alert + K3s + cold-start scorecard evidence. The only remaining blocker is DR credential escrow, outside service availability. | Keep this cross-surface checklist mandatory after every reboot. | Each reboot record has route, DB, backup, schedules, alert, scorecard rows. | | P2-006 | DONE | 100 | Validate momo scheduler WARN | 2026-06-12 post-reboot regression showed the old detector was too narrow for Chinese batch and `[Feeder]` logs. The detector was widened and deployed to 110; 14:47 scorecard reads `SCHEDULER_RECENT_ACTIVITY 1070` and marks scheduler healthy. | Keep normal monitoring; treat future recurrence as detector tuning only if direct logs remain active. | Container healthy, direct log activity exists, and latest scorecard removed this WARN. | -| P2-007 | DONE | 100 | Balance K3s AWOOI workload across 120 / 121 | Gitea main `acaae999` adds topology spread for API/Web/Worker. ArgoCD synced revision `acaae99986aee2e1f5630984981ccb0f2b676bb8`, live deployments have non-empty `topologySpreadConstraints`, API/Web are each split 120 / 121, and 00:34 final cold-start is `PASS=83 WARN=0 BLOCKED=0`. | Watch the next normal deploy to confirm topology spread remains effective. | Live deployment has non-empty topology spread, API/Web placement max skew <= 1, public routes green, cold-start `WARN=0 BLOCKED=0`. | +| P2-007 | DONE | 100 | Balance K3s AWOOI workload across 120 / 121 | Gitea main `acaae999` adds topology spread for API/Web/Worker. ArgoCD later synced deploy marker `e4a349bc`; live deployments still have split placement after a normal CD rollout: API pods on `mon1` / `mon`, Web pods on `mon` / `mon1`, Worker single replica on `mon`; 01:26 final cold-start is `PASS=83 WARN=0 BLOCKED=0`. | Keep watching future deploys; do not manually delete pods unless placement drift becomes a real service or HA gate. | Live deployment has non-empty topology spread, API/Web placement max skew <= 1 after normal CD, public routes green, cold-start `WARN=0 BLOCKED=0`. | --- @@ -155,10 +156,11 @@ Next: | P3-005 | DONE | 100 | Update cold-start SOP | SOP now includes start, shutdown, reboot, record, comparison, and 120 blocker handling. | Increment SOP version after each process change. | SOP has controlled power-operation sections and ledger template. | | P3-006 | DONE | 100 | Update backup status | Backup status now reflects current cron, rclone latest-only, failure-only alert posture, and escrow blocker. | Refresh after 120 backup rerun. | Backup status no longer claims noisy success Telegram notifications. | | P3-007 | DONE | 100 | Harden Gitea backup stale dump handling | 2026-06-05 manual Gitea backup failed because the container retained `/tmp/gitea-dump.zip` from the 02:00 failure. `scripts/backup/backup-gitea.sh` now renames stale container dump files to timestamped evidence before running a new dump, and the live 110 script is updated. | Watch the next 02:00 Gitea backup. | `bash -n` passes locally and on 110; manual Gitea backup completed after stale evidence rename. | -| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.6 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, 2026-06-12 comparison anchor, T+0/T+60 verification timeline, AA/AS 判定, workload 分散判定, and allowed declaration wording. | Use v1.6 for the next reboot record, then compare actual timing and blockers against §14.8 / §14.9. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, and `WORKLOAD_BALANCED`, and blocks false green while 120 / escrow remain red. | +| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.8 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, 2026-06-12 post-reboot anchor, 2026-06-13 post-CD trust/workload anchor, T+0/T+60 verification timeline, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, and allowed declaration wording. | Use v1.8 for the next reboot record, then compare actual timing and blockers against §14.8 / §14.9 / §14.10. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, and `WORKLOAD_BALANCED`, and blocks false green while escrow or governed CronJob debt remain red. | | P3-009 | DONE | 100 | Assess 120/121 AA/AS role and host load balancing | 2026-06-12 15:19 live check confirms 120 and 121 are both `Ready control-plane`, `k3s active`, `k3s-agent inactive`, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. | After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. | `docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly `0%`. | | P3-010 | DONE | 100 | Update workload balancing docs with 2026-06-13 live truth | Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD `km-vectorize` degraded, Gitea main `acaae999`, ArgoCD sync, and final pod placement evidence. | Keep updating this file after the next reboot or deploy. | Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt. | | P3-011 | DONE | 100 | Record `km-vectorize` remediation status | LOGBOOK, this workplan, and SOP now state the schedule/label fix, ArgoCD sync evidence, the invalid manual Job boundary, and the 90% waiting-for-next-schedule gate. | After next 03:00 run, update this row and the top verdict with `lastSuccessfulTime` / ArgoCD health evidence. | No document claims ArgoCD green before official CronJob success evidence exists. | +| P3-012 | DONE | 100 | Prevent CD from clobbering cold-start SSH trust | Source fix `80e6ec1a` changes Gitea CD workflows to use deploy-specific `deploy_known_hosts` and `UserKnownHostsFile`; post-deploy marker `e4a349bc` proves global `/home/wooo/.ssh/known_hosts` retained 120 / 188 entries. SOP v1.8 records this as a release guardrail. | Keep the guardrail in future workflow reviews; any `> ~/.ssh/known_hosts` in deploy code is a release blocker. | CD success plus post-CD `known_hosts` readback and strict SSH checks to 120 / 188 remain green. | --- @@ -192,6 +194,16 @@ Do not run `truncate`, whole DB restore, force-push, DROP, or online root filesy ## 9. Progress Updates +```text +2026-06-13 01:29 Asia/Taipei +Phase: P0/P1/P2/P3 +Before: Overall 95%, P1 90%, P2 100%, P3 100% +After: Overall 95%, P1 90%, P2 100%, P3 100% +Evidence: Gitea main e4a349bc; ArgoCD revision e4a349bc sync=Synced health=Degraded only by km-vectorize stale success; K3s images 414413a59268eedd391648f112e228716dd05362; API/Web split across mon/mon1; /home/wooo/.ssh/known_hosts retained 120/188 after CD fix 80e6ec1a; backup-status 110 13/13 fresh failed=0 and 188 2/2 fresh failed=0; offsite textfile remote_verify_ok=1 and 13 repos snapshot_count=1; backup alert live visibility OK; all five required Prometheus alert rule names health=ok; cold-start PASS=83 WARN=0 BLOCKED=0. +Blocked: yes for DR complete only, because credential escrow evidence markers still missing 5; ArgoCD fully healthy still waits for official 03:00 km-vectorize lastSuccessfulTime. +Next: after 03:00 Asia/Taipei, verify km-vectorize official Job completion and ArgoCD health; keep escrow alerts firing until real non-secret evidence IDs are written. +``` + ```text 2026-06-04 15:23 Asia/Taipei Phase: P3