docs(ops): refresh reboot SOP evidence [skip ci]

This commit is contained in:
Your Name
2026-06-13 01:35:39 +08:00
parent 208a6bd023
commit 6a3f6caedb
4 changed files with 112 additions and 32 deletions

View File

@@ -1,3 +1,24 @@
## 2026-06-13Post-CD cold-start green 與 SSH trust guardrail closeout
**01:26 / 01:29 live refresh**
- Gitea main 已到 deploy marker `e4a349bc chore(cd): deploy 414413a [skip ci]`ArgoCD `awoooi-prod` revision `e4a349bc`、sync `Synced`health 仍 `Degraded`,原因仍是 `km-vectorize` 官方排程成功回寫尚未更新,不是 service outage。
- K3s live image readback`awoooi-api``awoooi-web``awoooi-worker` 均使用 `414413a59268eedd391648f112e228716dd05362`API pods split `mon1` / `mon`Web pods split `mon` / `mon1`Worker single replica 在 `mon`,證明 topology spread 沒被最新 CD 打回集中配置。
- 110 global `/home/wooo/.ssh/known_hosts` 在 CD 後仍保留 120 / 188 entriesmtime `2026-06-13 01:20:02 +0800`deploy-specific `/home/wooo/.ssh/deploy_known_hosts` mtime `01:24:05`。這驗證 `80e6ec1a fix(ci): avoid clobbering runner known hosts` 已阻止 CD 再覆蓋 cold-start 使用的全域 SSH trust。
- 01:26 `full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1``PASS=83 WARN=0 BLOCKED=0`result `GREEN`。110 / 120 / 121 / 188 ping + SSH OKpublic routes/TLS、momo DB parity、CronJobs、backup health、textfile exporters 全部綠。
- 01:26 `/backup/scripts/backup-status.sh --no-notify`110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0``core_blockers=0``escrow_missing=5`、last aggregate `2026-06-12 15:54:40`
- 01:28 offsite textfile`remote_verify_ok=1``full_verify_fresh=1`13 個 repo 全部 `snapshot_count=1` / `snapshot_latest_only=1`。最新排程 verifier log 為 2026-06-12 07:20仍在 48h freshness 內。
- 01:27 backup alert live visibilityPrometheus 與 Alertmanager 都看得到 5 個 credential escrow gap active/firing alertsPrometheus rules API 確認 `BackupConfigCapturePartial``BackupAggregateRunFailed``BackupCredentialEscrowEvidenceMissing``ColdStartRecoveryBlocked``ColdStartHost120Unreachable` 全部 health `ok`label contract check 載入 24 條 baseline backup alert rules。這些 escrow 紅燈是正確阻擋,不可消音。
**文件 / SOP 更新:**
- `docs/runbooks/FULL-STACK-COLD-START-SOP.md` 升到 v1.8,新增 CD / SSH trust guardrail、post-CD 比較錨點與 done criteria。
- `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md` 更新 01:29 完成度Overall `95%`、P1 `90%`、P2 `100%`、P3 `100%`;新增 `P3-012` 防止 CD clobber `known_hosts`
- `docs/runbooks/BACKUP-STATUS.md` 更新 01:26 / 01:28 備份、offsite、Alertmanager 證據。
**剩餘不可宣稱完成:**
- 不宣稱 `DR complete`credential escrow evidence markers 仍缺 `5`
- 不宣稱 ArgoCD fully healthy`km-vectorize` corrected schedule 已 live但要等 03:00 Asia/Taipei 官方 CronJob 更新 `lastSuccessfulTime`
- 不把 P0-PUBLICENV local source patch 視為 production remediation completeproduction public bundle leak audit 仍需等前端 worktree / build / deploy / rescan 收斂。
## 2026-06-13P2-104 matched PlayBook 學習缺口
**背景**P2-103 已把任務結果接回 KM 草稿、LOGBOOK 證據、audit trail、timeline 與人工交接契約;原先下一步被描述為修復 `matched_playbook_id` 學習缺口。正式 DB 只讀回查後修正判斷:近 24h approval 的 `matched_playbook_id``66/66`,真正缺口已移到 approved 後沒有形成 execution learning 與 PlayBook trust update。

View File

@@ -6,38 +6,44 @@
> 2026-06-12 Codex post-reboot refresh: 110 recovered, offsite latest-only verified, and remaining red gates narrowed to 120 config capture plus credential escrow evidence.
> 2026-06-12 Codex post-120 recovery refresh: 120 restored, backup aggregate / offsite / full cold-start green; DR still blocked only by credential escrow evidence.
> 2026-06-13 Codex live refresh: backup core remains green; DR still blocked only by credential escrow evidence.
> 2026-06-13 Codex post-CD refresh: backup/offsite/alert contracts remain green after deploy marker `e4a349bc`; global SSH trust guardrail held; DR still blocked only by credential escrow evidence.
---
## 2026-06-12 Post-Reboot Live Status
## 2026-06-13 Post-CD Live Status
2026-06-13 00:13 refresh:
2026-06-13 01:26 / 01:28 refresh:
- `backup-status.sh --no-notify`110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0``integrity_stale=0``offsite_fresh=1``rclone_gdrive_fresh=1``core_blockers=0``escrow_missing=5`
- `/backup/scripts/backup-status.sh --no-notify`110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0``integrity_stale=0``offsite_fresh=1``rclone_gdrive_fresh=1``core_blockers=0``escrow_missing=5`、last aggregate `2026-06-12 15:54:40`
- `/home/wooo/node_exporter_textfiles/offsite_full_sync_verify.prom``awoooi_backup_offsite_remote_verify_ok=1``awoooi_backup_offsite_full_verify_fresh=1`13 個 repo 都是 `snapshot_count=1``snapshot_latest_only=1`
- `backup-alert-live-visibility-check.py`Prometheus 與 Alertmanager 皆可見 5 個 escrow gap active/firing alerts。
- Prometheus rules API`BackupConfigCapturePartial``BackupAggregateRunFailed``BackupCredentialEscrowEvidenceMissing``ColdStartRecoveryBlocked``ColdStartHost120Unreachable` 全部存在且 health `ok`;目前只有 escrow gap rule 正確 firing其餘 inactive。
- `backup-alert-label-contract-check.py`:本地 `ops/monitoring/alerts-unified.yml` 與 live Prometheus label contract 對齊24 條 baseline backup alert rules 已載入。
- 這代表備份核心仍綠;剩餘紅燈仍是 DR credential escrow evidence不是備份腳本或 offsite sync 失敗。
| Gate | Status | Evidence |
|------|--------|----------|
| 110 backup cron | VERIFIED | `02:00 backup-all`, `03:00 sync-offsite-backups --mode sync`, `06:05 backup-status`, `07:20 verify-offsite-full-sync`. |
| Backup freshness | VERIFIED | 2026-06-12 18:55 status shows `stale110=none`, `stale188=none`, 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`; last aggregate backup completed `2026-06-12 15:54:40`. |
| 110 backup cron | VERIFIED | Live crontab still has `02:00 backup-all`, `03:00 sync-offsite-backups --mode sync`, `06:05 backup-status`, `07:20 verify-offsite-full-sync`; success is summarized once daily and not sent as noisy Telegram heartbeat. |
| Backup freshness | VERIFIED | 2026-06-13 01:26 status shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`; last aggregate backup completed `2026-06-12 15:54:40`. |
| 188 momo backup cron/exporter contract | VERIFIED | 188 crontab now runs `/home/ollama/bin/momo-pg-backup.sh`; exporter reports `awoooi_backup_job_configured{host="188",job="momo_pg_daily"} 1`, so `configured_missing_188=0`. |
| Google Drive/rclone remote latest-only | VERIFIED | 2026-06-12 18:55 verifier: all 13 remote repos have `snapshots=1`, `REMOTE_LATEST_ONLY_OK=1`, `FULL_MARKER_FRESH=1`, `VERIFY_OK=1`, `FAILED=0`. |
| Google Drive/rclone remote latest-only | VERIFIED | 2026-06-13 01:28 textfile confirms all 13 remote repos have `snapshot_count=1` and `snapshot_latest_only=1`; latest scheduled verifier log at 2026-06-12 07:20 returned `REMOTE_LATEST_ONLY_OK=1`, `FULL_MARKER_FRESH=1`, `VERIFY_OK=1`, `FAILED=0`. |
| Offsite gate marker | VERIFIED | `/backup/offsite/enable-rclone-sync` present; full marker fresh and verifier wrote `/home/wooo/node_exporter_textfiles/offsite_full_sync_verify.prom`. |
| Backup alert rules | VERIFIED | 2026-06-12 14:49 live check confirms Prometheus and Alertmanager expose `BackupConfigCapturePartial`, `BackupAggregateRunFailed`, `BackupCredentialEscrowEvidenceMissing`, `ColdStartRecoveryBlocked`, `ColdStartHost120Unreachable`. |
| Backup alert rules | VERIFIED | 2026-06-13 01:27 live check confirms Prometheus and Alertmanager expose the active `BackupCredentialEscrowEvidenceMissing` gap alerts for the five missing items; Prometheus rules API has all five required alert names healthy; label contract check confirms all baseline backup alert rules are loaded. |
| Backup aggregate health | VERIFIED | 2026-06-12 15:54 `/backup/scripts/backup-all.sh` completed `13/13` successfully; Configs captured 120 / 121 / K8s workloads / K8s secrets / Velero from source `192.168.0.120`; 18:55 `core_blockers=0`. |
| Credential escrow | BLOCKED | Five evidence markers missing. Only write non-secret marker evidence with `/backup/scripts/mark-credential-escrow-verified.sh`. |
| Config backup capture | VERIFIED | 2026-06-12 15:54 Configs backup succeeded for `120-k3s-host-configs`, `121-k3s-host-configs`, `cluster-k8s-workloads`, `cluster-k8s-secrets`, and `cluster-velero-backups`; latest Configs snapshot `bee9ae22`. |
| Full cold-start | GREEN | 2026-06-12 18:57 read-only rerun: `PASS=83 WARN=0 BLOCKED=0`; result `GREEN`. |
| 110 -> 188 SSH trust | VERIFIED | 2026-06-12 known_hosts refreshed after current ED25519 fingerprint matched the previously verified `SHA256:d+3QQO4Sh9hQ1DUhPEEJsxqE5D/CYVH+qTgG9TSMNaM`; this cleared the 188 SSH host-key blocker. |
| Full cold-start | GREEN | 2026-06-13 01:26 read-only rerun: `PASS=83 WARN=0 BLOCKED=0`; result `GREEN`. |
| 110 -> 120 / 188 SSH trust | VERIFIED | Final trust repair backup `/home/wooo/.ssh/known_hosts.before-120-188-final-refresh.20260613-011949`; CD fix `80e6ec1a` uses `/home/wooo/.ssh/deploy_known_hosts`; post-deploy marker `e4a349bc` did not clobber global `known_hosts`, and 120 / 188 entries remain present. |
| 120 console handoff | CLOSED | 120 root filesystem was repaired from console/initramfs with offline fsck, booted at `2026-06-12 15:13`, SSH returned, root mounted `rw`, failed units `0`, and K3s `mon` returned `Ready`. |
| 2026-06-05 manual backup remediation | VERIFIED with aggregate blocker | 18:40 status: `stale110=none`, `stale188=none`, `configured_missing_188=0`; manual snapshots: AWOOOI `b7d5ee4e`, Gitea `ea641613`, Open-WebUI `d1147507`, ClawBot `73ead3cc`, AI artifacts `b1161ab8`. |
| 2026-06-06 credential escrow audit | BLOCKED | 15:03 report confirms scripts/config/rclone are present, but all five non-secret evidence markers are still missing; 15:06 safe dry-run checklist is documented below. |
Current policy: normal success should not create immediate Telegram noise. Failures and operator-action states must still alert; a single daily status summary runs at 06:05.
2026-06-12 post-reboot closeout:
2026-06-13 post-CD closeout:
- 110 / 120 / 121 / 188 public/core service recovery is green.
- Backup aggregate, Google Drive/rclone offsite latest-only, and full cold-start are green after 120 recovery.
- Latest normal CD deploy preserved API/Web workload balancing and did not break cold-start SSH trust.
- Do not close DR scorecard until all five credential escrow evidence markers are written with non-secret evidence IDs.
---
@@ -120,7 +126,7 @@ notify_clawbot "failed" "backup-test" "測試告警" 0
## 保留策略
2026-05-19 起110 本地 restic repo、188 MOMO 檔案備份與 Google Drive/rclone 離機鏡像採 latest-only 策略:成功建立新 snapshot 後只保留最新一份。2026-06-06 14:46 live verifier 已確認 Google Drive/rclone remote 13 個 repo 各 1 份。
2026-05-19 起110 本地 restic repo、188 MOMO 檔案備份與 Google Drive/rclone 離機鏡像採 latest-only 策略:成功建立新 snapshot 後只保留最新一份。2026-06-13 01:28 live textfile 已確認 Google Drive/rclone remote 13 個 repo 各 1 份,且 latest-only 指標全為 `1`
2026-06-04 manual refresh evidence:
- 188 `momo-pg-backup.sh` produced `momo_analytics_20260604_154234.sql.gz` and pruned old backups beyond keep-last=1.
@@ -140,6 +146,8 @@ notify_clawbot "failed" "backup-test" "測試告警" 0
2026-06-12 18:55 update: 120 has returned and the aggregate backup blocker is cleared. `/backup/scripts/backup-all.sh` completed `13/13`, full offsite sync completed `13/13`, full verifier returned `REMOTE_LATEST_ONLY_OK=1` / `VERIFY_OK=1`, and `backup-status.sh --no-notify` reports `core_blockers=0`. The only remaining DR warning is `escrow_missing=5`.
2026-06-13 01:28 update: post-CD live readback still shows `remote_verify_ok=1`, `full_verify_fresh=1`, and all 13 repos `snapshot_count=1`; backup core remains green after deploy marker `e4a349bc`.
2026-06-05 manual remediation:
- 16:00 live check still had 120 unreachable, `stale110=awoooi_db`, `backup_all failed=6`, and `escrow_missing=5`.
- 14:00 AWOOOI high-frequency backup failed, then 16:01 manual rerun completed snapshot `b7d5ee4e`.

View File

@@ -1,6 +1,6 @@
# AWOOOI 全棧冷啟動與主機重啟 SOP
> Version: v1.7
> Version: v1.8
> Last updated: 2026-06-13 Asia/Taipei
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
@@ -10,27 +10,27 @@
本節是每次接手、開機、關機、重啟後的第一個判定錨點。若日期不是今天,必須先重跑 live check再更新本節與 `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md`
| 項目 | 2026-06-13 00:14 Asia/Taipei live result | 判定 |
| 項目 | 2026-06-13 01:29 Asia/Taipei live result | 判定 |
|------|-------------------------------------------|------|
| Overall recovery readiness | `95%` | `SERVICE_GREEN_DR_ESCROW_BLOCKED` |
| Overall recovery readiness | `95%` | `SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED` |
| P0 host / K3s recovery | `100%` | `DONE` |
| P1 backup / alert / escrow | `90%` | `BLOCKED_DR_ESCROW` |
| P2 service / data truth | `100%` | `VERIFIED_WORKLOAD_BALANCED` |
| P3 docs / automation contracts | `100%` | `DONE_WITH_VALIDATION_GAP` |
| 110 host runtime | `systemctl is-system-running=running`, failed units `0`, load `4.45 / 3.28 / 2.89`, Swap `166Mi/7.8GiB` | `GREEN_WITH_LOAD_WATCH` |
| 110 host runtime | `systemctl is-system-running=running`, failed units `0`, cold-start load `3.55 / 4.54 / 4.51`; Docker / Harbor / Gitea / Prometheus / Alertmanager / Sentry checks pass | `GREEN_WITH_LOAD_WATCH` |
| 120 reachability | ping OK, SSH OK, boot `2026-06-12 15:13`, root ext4 `rw`, failed units `0` | `GREEN` |
| 121 reachability | ping OK, SSH OK, failed units `0` | `GREEN` |
| 188 host runtime | production services green; host degraded only by `certbot.service` and `snap.certbot.renew.service` | `GREEN_WITH_CERTBOT_DEBT` |
| K3s node state | `mon Ready control-plane`, `mon1 Ready control-plane`, no failed jobs / bad pods; API/Web are split across 120 / 121 | `GREEN_WORKLOAD_BALANCED` |
| 110 -> 120 / 188 SSH trust | 00:33 cold-start exposed stale `known_hosts`; backup `/home/wooo/.ssh/known_hosts.before-120-188-refresh.20260613-003416`; 120 ED25519 `SHA256:LLhtf/K09AQjRE1Csi0pDCkwvtM9/Ag48RcLgVF4EuY`; 188 ED25519 `SHA256:d+3QQO4Sh9hQ1DUhPEEJsxqE5D/CYVH+qTgG9TSMNaM`; strict SSH checks now pass | `GREEN` |
| Backup status | 00:13 status: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5` | `GREEN_WITH_DR_ESCROW_WARNING` |
| Offsite sync / verify | 17:37 full sync `13/13`; 18:55 verifier: 13 repos each `snapshots=1`, `REMOTE_LATEST_ONLY_OK=1`, `FULL_MARKER_FRESH=1`, `VERIFY_OK=1`, `FAILED=0` | `GREEN` |
| Backup / cold-start alerts | 14:49 Prometheus and Alertmanager expose 5 required backup/cold-start/escrow alerts | `GREEN_WITH_EXPECTED_REDLIGHTS` |
| Cold-start scorecard | 01:04 final read-only scorecard after `km-vectorize` sync / 188 SSH trust refresh: `PASS=83 WARN=0 BLOCKED=0` | `GREEN` |
| K3s node state | `mon Ready control-plane`, `mon1 Ready control-plane`, no failed jobs / bad pods; latest CD kept API/Web split across 120 / 121 | `GREEN_WORKLOAD_BALANCED` |
| 110 -> 120 / 188 SSH trust | 00:33 cold-start exposed stale `known_hosts`; backup `/home/wooo/.ssh/known_hosts.before-120-188-refresh.20260613-003416`; final repair backup `/home/wooo/.ssh/known_hosts.before-120-188-final-refresh.20260613-011949`; CD fix `80e6ec1a` moves deploy trust to `/home/wooo/.ssh/deploy_known_hosts`; 01:28 global `known_hosts` still contains 120 / 188 and was not clobbered by deploy marker `e4a349bc` | `GREEN_WITH_GUARDRAIL` |
| Backup status | 01:26 status: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-12 15:54:40` | `GREEN_WITH_DR_ESCROW_WARNING` |
| Offsite sync / verify | 01:28 textfile: `awoooi_backup_offsite_remote_verify_ok=1`, `full_verify_fresh=1`, all 13 repos have `snapshot_count=1` and `snapshot_latest_only=1`; latest scheduled verifier log is 2026-06-12 07:20 | `GREEN` |
| Backup / cold-start alerts | 01:27 live visibility check confirms Prometheus and Alertmanager expose the 5 required credential escrow gap alerts; Prometheus rules API has all five required alert names healthy; label contract check loads 24 baseline backup alert rules | `GREEN_WITH_EXPECTED_REDLIGHTS` |
| Cold-start scorecard | 01:26 read-only scorecard after latest CD / trust guardrail: `PASS=83 WARN=0 BLOCKED=0` | `GREEN` |
| momo DB parity | `4571|4571|2026-06-01|2026-06-07|2026-06-01|2026-06-07` | `GREEN` |
| momo scheduler | container healthy; scorecard reads `SCHEDULER_RECENT_ACTIVITY 1136`; detector widened and deployed to 110 | `GREEN` |
| ArgoCD app health | Deployments/PDB healthy; app Degraded because `km-vectorize` last success is stale. Gitea main `47ee96b0` / ArgoCD live now use 03:00 Asia/Taipei and complete Job labels; waiting for official CronJob success readback. | `GOVERNANCE_DEBT_IN_REMEDIATION` |
| Workload balancing | Gitea main `acaae999` synced by ArgoCD; API/Web each have one pod on 120 and one pod on 121 | `GREEN` |
| ArgoCD app health | ArgoCD revision `e4a349bc` is `Synced`; app remains `Degraded` while `km-vectorize` `lastSuccessfulTime=2026-06-04T11:00:37Z`; corrected schedule remains `0 3 * * *` with `timeZone=Asia/Taipei` | `GOVERNANCE_DEBT_IN_REMEDIATION` |
| Workload balancing | Latest live deployments use images from `414413a5`; API/Web each have one pod on 120 and one pod on 121 after deploy marker `e4a349bc` | `GREEN` |
| Credential escrow | 5 non-secret evidence markers missing | `BLOCKED` |
Release rule:
@@ -44,12 +44,13 @@ Do not declare DR scorecard complete while credential escrow markers are missing
2026-06-13 live rule:
```text
110 / 120 / 121 / 188 service recovery is full-stack green after the 00:34 scorecard.
110 / 120 / 121 / 188 service recovery is full-stack green after the 01:26 scorecard.
GO for controlled runner/CD release; keep AI auto-remediation governed by normal gates.
NO-GO for "DR complete" while credential escrow evidence markers are missing.
Do not fake or silence credential escrow alerts; they are the remaining correct DR red light.
GO for "AWOOOI core workload balanced"; topology spread is in Gitea main / ArgoCD live and API/Web placement proves max skew <= 1.
NO-GO for "ArgoCD fully healthy" until `km-vectorize` updates `lastSuccessfulTime` after an official scheduled Job, not a manual `UnexpectedJob`.
NO-GO for any CD workflow that writes deploy host keys into `/home/wooo/.ssh/known_hosts`; deploy jobs must use an isolated `UserKnownHostsFile=/home/wooo/.ssh/deploy_known_hosts`.
Current allowed wording: "service / cold-start green; ArgoCD CronJob governance debt in remediation and waiting for 03:00 `km-vectorize` success evidence."
```
@@ -229,6 +230,26 @@ Cold start creates noisy metrics and partial failures. During P0/P1, keep automa
多個工作視窗同時處理事故時,第一優先是避免互相打斷:只要有人在收斂 Docker / Nginx / firewall / K3s 寫操作,其他視窗先只讀觀察,直到明確交接。
### 2.2 CD / SSH Trust Guardrail
2026-06-13 的冷啟動假紅燈顯示CD workflow 若用 `ssh-keyscan ... > /home/wooo/.ssh/known_hosts`,會覆蓋 110 使用者層的全域 SSH trust導致 110 到 120 / 188 的 strict SSH 檢查失敗。這會把實際已恢復的主機誤判成 blocked。
固定規則:
| 項目 | 正確做法 | 禁止 |
|------|----------|------|
| Deploy 專用 host key | 寫入 `/home/wooo/.ssh/deploy_known_hosts` | 寫入或覆蓋 `/home/wooo/.ssh/known_hosts` |
| Deploy SSH options | `-o UserKnownHostsFile=/home/wooo/.ssh/deploy_known_hosts` | 共用 operator / cold-start 的 `known_hosts` |
| 冷啟動 SSH trust | 保留 120 / 188 的已驗證 fingerprint修復前先備份 | 無 fingerprint 交叉驗證就 `ssh-keygen -R` 或重建全檔 |
| 驗證 | CD 後檢查 `known_hosts` mtime、120 / 188 entries、strict SSH | 只看 CD success badge |
2026-06-13 修復錨點:
- Source fixGitea main 包含 `80e6ec1a fix(ci): avoid clobbering runner known hosts`
- Deploy marker`e4a349bc chore(cd): deploy 414413a [skip ci]` 後,`/home/wooo/.ssh/known_hosts` mtime 仍停在 `2026-06-13 01:20:02 +0800`,未被 CD 覆蓋。
- Deploy isolated file`/home/wooo/.ssh/deploy_known_hosts` mtime `2026-06-13 01:24:05 +0800`
- Global strict entries120 ED25519 line 4、188 ED25519 line 5 仍存在strict SSH 到 `wooo@192.168.0.120``ollama@192.168.0.188` 必須通過。
---
## 3. P0 Evidence And Network
@@ -1135,7 +1156,24 @@ SOP update:
| momo scheduler | scorecard reads `SCHEDULER_RECENT_ACTIVITY 1070` after detector fix |
| SOP change | v1.5 adds startup judgment layers, GO/NO-GO tree, host recovery cards, and timeline checks |
### 14.9 重啟後時間軸驗證
### 14.9 2026-06-13 CD 後恢復比較錨點
2026-06-13 不是主機重啟而是用來驗證「120/121 workload balancing + CD known_hosts guardrail」是否能承受下一次正常部署的比較錨點。
| 項目 | 2026-06-13 01:26 baseline |
|------|----------------------------|
| Gitea / ArgoCD | Gitea main `e4a349bc`ArgoCD revision `e4a349bc`sync `Synced` |
| K3s image readback | API/Web/Worker image tag `414413a59268eedd391648f112e228716dd05362` |
| K3s placement | API pods split `mon1` / `mon`Web pods split `mon` / `mon1`Worker single replica on `mon` |
| Cold-start | `PASS=83 WARN=0 BLOCKED=0` |
| Public routes | Scorecard verifies awoooi API/Web, momo, gitea, harbor, registry, sentry, signoz, stock, langfuse, bitan, aiops over TLS |
| Backup | `backup-status`: 110 `13/13 fresh failed=0`188 `2/2 fresh failed=0``core_blockers=0``escrow_missing=5` |
| Offsite | textfile `remote_verify_ok=1``full_verify_fresh=1`13 repos each `snapshot_count=1` |
| SSH trust | Global `known_hosts` retained 120 / 188 entries after CD; deploy-specific trust moved to `deploy_known_hosts` |
| Remaining non-service debt | `km-vectorize` waits for official 03:00 `lastSuccessfulTime`; credential escrow missing count remains `5`; 188 host degraded only by certbot units |
| SOP change | v1.8 adds CD / SSH trust guardrail and post-CD comparison baseline |
### 14.10 重啟後時間軸驗證
每次重啟後照時間軸推進,不要等到最後才一次判定。
@@ -1170,6 +1208,7 @@ All must be true:
- Runners are guarded and released last.
- AI auto-remediation is not in full execution mode until all gates are green.
- 110 cold-start textfile monitor is fresh, and Prometheus has the cold-start alert rules loaded.
- 110 global `/home/wooo/.ssh/known_hosts` still contains verified 120 / 188 entries after any CD run; deploy jobs use `/home/wooo/.ssh/deploy_known_hosts` only.
### 15.1 可宣稱狀態

View File

@@ -11,15 +11,15 @@
| Area | Status | Completion | Evidence |
|------|--------|------------|----------|
| Overall recovery readiness | SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED | 95% | 2026-06-13 01:04 final cold-start scorecard is `PASS=83 WARN=0 BLOCKED=0`; 120/121 K3s are both `Ready control-plane`, backup core blockers remain `0`, public routes/TLS/momo DB/schedules/Alertmanager are green, and API/Web are now spread across 120 / 121. Remaining blocker is DR-only credential escrow evidence (`escrow_missing=5`); ArgoCD `km-vectorize` is tracked separately as governance health debt until its official scheduled Job refreshes `lastSuccessfulTime`. |
| Overall recovery readiness | SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED | 95% | 2026-06-13 01:26 final cold-start scorecard is `PASS=83 WARN=0 BLOCKED=0`; 120/121 K3s are both `Ready control-plane`, backup core blockers remain `0`, public routes/TLS/momo DB/schedules/Alertmanager are green, API/Web remain spread across 120 / 121 after deploy marker `e4a349bc`, and CD no longer clobbers global `known_hosts`. Remaining blocker is DR-only credential escrow evidence (`escrow_missing=5`); ArgoCD `km-vectorize` is tracked separately as governance health debt until its official scheduled Job refreshes `lastSuccessfulTime`. |
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; host is reachable, root is mounted `rw`, failed units `0`, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. |
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 90% | 2026-06-12 15:54 `backup-all` completed `13/13`; 17:37 offsite full sync completed `13/13`; 18:55 verifier returned `REMOTE_LATEST_ONLY_OK=1`, `VERIFY_OK=1`, `FAILED=0`; 18:55 backup-status has `core_blockers=0` but `escrow_missing=5`. |
| P2 service / data truth | VERIFIED_WORKLOAD_BALANCED | 100% | 2026-06-13 post-rollout cold-start is green; public routes/TLS are green, VIP API/Web are reachable, momo current-month parity is `4571/4571` with matching date bounds, schedules/services are green. API/Web now both have 120 / 121 split placement. |
| P3 docs / automation contracts | DONE_WITH_VALIDATION_GAP | 100% | Workplan, SOP v1.7, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, T+0/T+60 timeline checks, host role / load-balancing assessment, and `km-vectorize` remediation tracking are updated; Ansible syntax check is unavailable on this workstation. |
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 90% | 2026-06-13 01:26 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`; 01:28 offsite textfile has `remote_verify_ok=1`, `full_verify_fresh=1`, and all 13 repos `snapshot_count=1`; 01:27 Alertmanager exposes the five expected escrow gap alerts and Prometheus has all five required alert rule names healthy. |
| P2 service / data truth | VERIFIED_WORKLOAD_BALANCED | 100% | 2026-06-13 01:26 cold-start is green; public routes/TLS are green, VIP API/Web are reachable, momo current-month parity is `4571/4571` with matching date bounds, schedules/services are green. API/Web both keep 120 / 121 split placement after latest deploy marker `e4a349bc`. |
| P3 docs / automation contracts | DONE_WITH_VALIDATION_GAP | 100% | Workplan, SOP v1.8, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, T+0/T+60 timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, and `km-vectorize` remediation tracking are updated; Ansible syntax check is unavailable on this workstation. |
Full cold-start may be declared green only for the latest verified evidence set. Do not declare DR scorecard complete while credential escrow evidence remains blocked.
2026-06-13 01:04 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing is live verified after Gitea main `acaae999` and ArgoCD sync; do not declare DR complete while credential escrow is missing. `km-vectorize` remediation is `90%`: schedule/label fix is in Gitea main `47ee96b0`, ArgoCD has synced it live, and the remaining gate is the next official 03:00 CronJob success readback.
2026-06-13 01:26 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing survived the next normal CD deploy: Gitea main `e4a349bc`, ArgoCD revision `e4a349bc`, images from `414413a5`, API/Web split across `mon` / `mon1`, and global `known_hosts` retained 120 / 188 after CD fix `80e6ec1a`. Do not declare DR complete while credential escrow is missing. `km-vectorize` remediation is `90%`: schedule/label fix is live, and the remaining gate is the next official 03:00 CronJob success readback.
---
@@ -55,6 +55,7 @@ Full cold-start may be declared green only for the latest verified evidence set.
| 2026-06-12 120 recovery closeout | SERVICE_GREEN_DR_ESCROW_BLOCKED | 120 root fsck was completed from console/initramfs and booted at `15:13`; 15:54 backup-all finished `13/13`; 17:37 full offsite sync finished `13/13`; 18:55 offsite verifier returned `REMOTE_LATEST_ONLY_OK=1`, `VERIFY_OK=1`, `FAILED=0`; 18:55 backup-status shows `core_blockers=0`, `escrow_missing=5`; 18:57 cold-start is `PASS=83 WARN=0 BLOCKED=0`. |
| 2026-06-13 live refresh | SERVICE_GREEN_WORKLOAD_BALANCED_DR_ESCROW_BLOCKED | 00:13 backup status: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`; 00:33 cold-start exposed 110 `known_hosts` drift for 120 / 188, fixed after backup `/home/wooo/.ssh/known_hosts.before-120-188-refresh.20260613-003416`; 00:34 final cold-start: `PASS=83 WARN=0 BLOCKED=0`; live K3s has `mon` / `mon1` Ready, API/Web are split 120 / 121. 188 host is `degraded` only because `certbot.service` and `snap.certbot.renew.service` failed; ArgoCD remains Degraded because `km-vectorize` CronJob last success is stale. Manual Job `km-vectorize-codex-002709` did not leave verified completion evidence, so this remains open. |
| 2026-06-13 `km-vectorize` health remediation | IN_PROGRESS_90 | 00:50 live readback: CronJob `lastScheduleTime=2026-06-12T11:00:00Z`, `lastSuccessfulTime=2026-06-04T11:00:37Z`; retained 6/2, 6/3, 6/4 Jobs are `Complete`, latest visible pod log returned `embed-all: 200 {"total":32,"success":32,"failed":0}`. Gitea main `47ee96b0` and ArgoCD sync now corrected live spec to `schedule=0 3 * * *`, `timeZone=Asia/Taipei`, with Job/Pod labels `app/component/environment/phase/system`. 01:04 cold-start is `PASS=83 WARN=0 BLOCKED=0`. Next gate is the official 03:00 CronJob success readback. |
| 2026-06-13 post-CD trust / workload verification | SERVICE_GREEN_CD_GUARDRAIL_HELD | Gitea main advanced to deploy marker `e4a349bc chore(cd): deploy 414413a [skip ci]`; ArgoCD revision is `e4a349bc`, sync `Synced`, health still `Degraded` only by `km-vectorize` stale success. Live K3s image readback uses `414413a59268eedd391648f112e228716dd05362`; API pods split `mon1` / `mon`, Web pods split `mon` / `mon1`, Worker is single replica on `mon`. 01:28 `/home/wooo/.ssh/known_hosts` mtime remains `2026-06-13 01:20:02 +0800` with 120 / 188 entries present; deploy-specific `/home/wooo/.ssh/deploy_known_hosts` mtime is `01:24:05`, proving CD fix `80e6ec1a` stopped clobbering global trust. 01:26 cold-start: `PASS=83 WARN=0 BLOCKED=0`. |
---
@@ -140,7 +141,7 @@ Next: <single next action>
| P2-004 | DONE | 100 | PostgreSQL index corruption runbook path | SOP v1.2 now states `posting list tuple ... cannot be split` is an index repair incident. | Use only concurrent reindex if the error returns. | No truncate, no whole DB restore; `REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly;` and idempotent resync evidence recorded. |
| P2-005 | VERIFIED | 100 | Do not rely on route 200 only | 2026-06-12 closeout has route + DB + backup + offsite + schedule + alert + K3s + cold-start scorecard evidence. The only remaining blocker is DR credential escrow, outside service availability. | Keep this cross-surface checklist mandatory after every reboot. | Each reboot record has route, DB, backup, schedules, alert, scorecard rows. |
| P2-006 | DONE | 100 | Validate momo scheduler WARN | 2026-06-12 post-reboot regression showed the old detector was too narrow for Chinese batch and `[Feeder]` logs. The detector was widened and deployed to 110; 14:47 scorecard reads `SCHEDULER_RECENT_ACTIVITY 1070` and marks scheduler healthy. | Keep normal monitoring; treat future recurrence as detector tuning only if direct logs remain active. | Container healthy, direct log activity exists, and latest scorecard removed this WARN. |
| P2-007 | DONE | 100 | Balance K3s AWOOI workload across 120 / 121 | Gitea main `acaae999` adds topology spread for API/Web/Worker. ArgoCD synced revision `acaae99986aee2e1f5630984981ccb0f2b676bb8`, live deployments have non-empty `topologySpreadConstraints`, API/Web are each split 120 / 121, and 00:34 final cold-start is `PASS=83 WARN=0 BLOCKED=0`. | Watch the next normal deploy to confirm topology spread remains effective. | Live deployment has non-empty topology spread, API/Web placement max skew <= 1, public routes green, cold-start `WARN=0 BLOCKED=0`. |
| P2-007 | DONE | 100 | Balance K3s AWOOI workload across 120 / 121 | Gitea main `acaae999` adds topology spread for API/Web/Worker. ArgoCD later synced deploy marker `e4a349bc`; live deployments still have split placement after a normal CD rollout: API pods on `mon1` / `mon`, Web pods on `mon` / `mon1`, Worker single replica on `mon`; 01:26 final cold-start is `PASS=83 WARN=0 BLOCKED=0`. | Keep watching future deploys; do not manually delete pods unless placement drift becomes a real service or HA gate. | Live deployment has non-empty topology spread, API/Web placement max skew <= 1 after normal CD, public routes green, cold-start `WARN=0 BLOCKED=0`. |
---
@@ -155,10 +156,11 @@ Next: <single next action>
| P3-005 | DONE | 100 | Update cold-start SOP | SOP now includes start, shutdown, reboot, record, comparison, and 120 blocker handling. | Increment SOP version after each process change. | SOP has controlled power-operation sections and ledger template. |
| P3-006 | DONE | 100 | Update backup status | Backup status now reflects current cron, rclone latest-only, failure-only alert posture, and escrow blocker. | Refresh after 120 backup rerun. | Backup status no longer claims noisy success Telegram notifications. |
| P3-007 | DONE | 100 | Harden Gitea backup stale dump handling | 2026-06-05 manual Gitea backup failed because the container retained `/tmp/gitea-dump.zip` from the 02:00 failure. `scripts/backup/backup-gitea.sh` now renames stale container dump files to timestamped evidence before running a new dump, and the live 110 script is updated. | Watch the next 02:00 Gitea backup. | `bash -n` passes locally and on 110; manual Gitea backup completed after stale evidence rename. |
| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.6 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, 2026-06-12 comparison anchor, T+0/T+60 verification timeline, AA/AS 判定, workload 分散判定, and allowed declaration wording. | Use v1.6 for the next reboot record, then compare actual timing and blockers against §14.8 / §14.9. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, and `WORKLOAD_BALANCED`, and blocks false green while 120 / escrow remain red. |
| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.8 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, 2026-06-12 post-reboot anchor, 2026-06-13 post-CD trust/workload anchor, T+0/T+60 verification timeline, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, and allowed declaration wording. | Use v1.8 for the next reboot record, then compare actual timing and blockers against §14.8 / §14.9 / §14.10. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, and `WORKLOAD_BALANCED`, and blocks false green while escrow or governed CronJob debt remain red. |
| P3-009 | DONE | 100 | Assess 120/121 AA/AS role and host load balancing | 2026-06-12 15:19 live check confirms 120 and 121 are both `Ready control-plane`, `k3s active`, `k3s-agent inactive`, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. | After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. | `docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly `0%`. |
| P3-010 | DONE | 100 | Update workload balancing docs with 2026-06-13 live truth | Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD `km-vectorize` degraded, Gitea main `acaae999`, ArgoCD sync, and final pod placement evidence. | Keep updating this file after the next reboot or deploy. | Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt. |
| P3-011 | DONE | 100 | Record `km-vectorize` remediation status | LOGBOOK, this workplan, and SOP now state the schedule/label fix, ArgoCD sync evidence, the invalid manual Job boundary, and the 90% waiting-for-next-schedule gate. | After next 03:00 run, update this row and the top verdict with `lastSuccessfulTime` / ArgoCD health evidence. | No document claims ArgoCD green before official CronJob success evidence exists. |
| P3-012 | DONE | 100 | Prevent CD from clobbering cold-start SSH trust | Source fix `80e6ec1a` changes Gitea CD workflows to use deploy-specific `deploy_known_hosts` and `UserKnownHostsFile`; post-deploy marker `e4a349bc` proves global `/home/wooo/.ssh/known_hosts` retained 120 / 188 entries. SOP v1.8 records this as a release guardrail. | Keep the guardrail in future workflow reviews; any `> ~/.ssh/known_hosts` in deploy code is a release blocker. | CD success plus post-CD `known_hosts` readback and strict SSH checks to 120 / 188 remain green. |
---
@@ -192,6 +194,16 @@ Do not run `truncate`, whole DB restore, force-push, DROP, or online root filesy
## 9. Progress Updates
```text
2026-06-13 01:29 Asia/Taipei
Phase: P0/P1/P2/P3
Before: Overall 95%, P1 90%, P2 100%, P3 100%
After: Overall 95%, P1 90%, P2 100%, P3 100%
Evidence: Gitea main e4a349bc; ArgoCD revision e4a349bc sync=Synced health=Degraded only by km-vectorize stale success; K3s images 414413a59268eedd391648f112e228716dd05362; API/Web split across mon/mon1; /home/wooo/.ssh/known_hosts retained 120/188 after CD fix 80e6ec1a; backup-status 110 13/13 fresh failed=0 and 188 2/2 fresh failed=0; offsite textfile remote_verify_ok=1 and 13 repos snapshot_count=1; backup alert live visibility OK; all five required Prometheus alert rule names health=ok; cold-start PASS=83 WARN=0 BLOCKED=0.
Blocked: yes for DR complete only, because credential escrow evidence markers still missing 5; ArgoCD fully healthy still waits for official 03:00 km-vectorize lastSuccessfulTime.
Next: after 03:00 Asia/Taipei, verify km-vectorize official Job completion and ArgoCD health; keep escrow alerts firing until real non-secret evidence IDs are written.
```
```text
2026-06-04 15:23 Asia/Taipei
Phase: P3