From e4a85847c3aa5e8273058bb50b30e0696c3195f8 Mon Sep 17 00:00:00 2001 From: ogt Date: Fri, 26 Jun 2026 07:29:57 +0800 Subject: [PATCH] docs(ops): add 188 hygiene maintenance runbook [skip ci] --- docs/LOGBOOK.md | 20 +++ docs/runbooks/FULL-STACK-COLD-START-SOP.md | 2 +- .../HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md | 168 ++++++++++++++++++ ...REDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md | 21 ++- ...oot-cold-start-backup-recovery-workplan.md | 2 +- 5 files changed, 209 insertions(+), 4 deletions(-) create mode 100644 docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 70b0f32f..94f869dd 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -44751,3 +44751,23 @@ production browser smoke: - 建立 188 維護窗口與 rollback owner:決定 host PostgreSQL `14/main` 是廢棄 cluster、需從 backup restore、或需 break-glass WAL/FS 修復;不得直接 `pg_resetwal`。 - 建立 certbot / DNS challenge 修復窗口:先確認 SAN / wildcard / public gateway owner evidence,再處理 rate-limit 後的 renew。 - 補五個 credential escrow non-secret evidence marker,讓 DR scorecard 從 `BLOCKED` 轉為可驗收。 + +## 2026-06-26 — 188 host hygiene 維護窗口 runbook 與 escrow live evidence 補強 + +**時間與來源**: +- 2026-06-26 07:24 Asia/Taipei。 +- 來源:前一輪 188 `systemctl status` / `pg_lsclusters` read-only、110 backup-status 07:18、既有 Backup / Restore / Escrow 與 DNS / TLS owner acceptance 文件。 + +**完成內容**: +- 新增 `docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md`,把 188 host PostgreSQL `14/main` checkpoint/WAL failure、certbot rate-limit / challenge failure、`awoooi-startup.service` root `pg_resetwal` 失敗,拆成維護窗口決策樹。 +- `FULL-STACK-COLD-START-SOP.md` 保持 service green 判定,但新增指向 188 專用 runbook,避免 cold-start 自動腳本承擔資料層 break-glass 修復。 +- `CREDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md` 補入 2026-06-26 07:18 live backup evidence:110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、offsite/rclone fresh、`core_blockers=0`、`escrow_missing=5`。 + +**做過的命令類型**: +- 寫入:repo docs-only。 +- 未做:host / Docker / systemd / Nginx / firewall / K8s / DB / Wazuh runtime 寫操作;未執行 `pg_resetwal`、certbot renew、Nginx reload、restore、`reset-failed`;未收 secret value。 + +**目前判定**: +- 188 service recovery 仍 `GREEN`。 +- 188 host hygiene evidence package 從 `80%` 推進到 `90%`;runtime repair 仍 `0%`。 +- Credential escrow owner request package 維持 `READY_TO_DISPATCH`;marker write 仍 `0%`,不得用 placeholder 或 secret 補齊。 diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index 3f37e114..d2035f37 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -12,7 +12,7 @@ 若只是重啟後要快速判斷能不能宣稱恢復,先跑一頁式總檢查:`scripts/reboot-recovery/post-start-quick-check.sh --no-color`,並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為人工 fallback。長 SOP 保留完整背景、例外處理與 Plan B;短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。 -2026-06-26 07:19 follow-up:`gitea/main` 已包含前一輪 SOP 文件 commit `1fd5e2a8`,ArgoCD `awoooi-prod` 讀回 `Synced / Healthy`,revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`,API `2/2`、Web `2/2`、Worker `1/1`,pods `restart=0`。重跑 full cold-start 仍是 `PASS=87 WARN=0 BLOCKED=0`,result `GREEN`。直接 public route 讀回:AWOOOI API `200`、AWOOOI Web `307`、VibeWork `200`、AwoooGo `200`、MOMO health `200`、Stock freshness `200`、Bitan `200`、Gitea `200`、Harbor `200`、Registry `/v2/` expected `401`、Sentry expected `302`、SigNoz `200`、Langfuse `200`。188 blocker 精準分類:`pg_lsclusters` 顯示 host PostgreSQL `14/main` down,`systemctl status postgresql@14-main` 顯示 `invalid primary checkpoint record` 與 `PANIC: could not locate a valid checkpoint record`;`certbot.service` 顯示 `sentry.wooo.work` renew rate-limited,`snap.certbot.renew.service` 顯示 challenge failed;`awoooi-startup.service` 曾嘗試以 root 執行 `pg_resetwal` 並失敗。本輪不執行 `pg_resetwal`、不 `reset-failed`、不重啟 service;188 需用獨立維護窗口、rollback owner、restore/source-of-truth plan 處理。110 load 已降到約 `4.83 / 4.82 / 5.52`,top CPU 是 active AWOOOI Web `turbo build` / Docker buildx;Swap 仍滿但 memory available 約 `41Gi`,本輪不手動清 swap。整體宣告仍是 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。 +2026-06-26 07:19 follow-up:`gitea/main` 已包含前一輪 SOP 文件 commit `1fd5e2a8`,ArgoCD `awoooi-prod` 讀回 `Synced / Healthy`,revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`,API `2/2`、Web `2/2`、Worker `1/1`,pods `restart=0`。重跑 full cold-start 仍是 `PASS=87 WARN=0 BLOCKED=0`,result `GREEN`。直接 public route 讀回:AWOOOI API `200`、AWOOOI Web `307`、VibeWork `200`、AwoooGo `200`、MOMO health `200`、Stock freshness `200`、Bitan `200`、Gitea `200`、Harbor `200`、Registry `/v2/` expected `401`、Sentry expected `302`、SigNoz `200`、Langfuse `200`。188 blocker 精準分類:`pg_lsclusters` 顯示 host PostgreSQL `14/main` down,`systemctl status postgresql@14-main` 顯示 `invalid primary checkpoint record` 與 `PANIC: could not locate a valid checkpoint record`;`certbot.service` 顯示 `sentry.wooo.work` renew rate-limited,`snap.certbot.renew.service` 顯示 challenge failed;`awoooi-startup.service` 曾嘗試以 root 執行 `pg_resetwal` 並失敗。本輪不執行 `pg_resetwal`、不 `reset-failed`、不重啟 service;188 需用獨立維護窗口、rollback owner、restore/source-of-truth plan 處理,詳見 `docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md`。110 load 已降到約 `4.83 / 4.82 / 5.52`,top CPU 是 active AWOOOI Web `turbo build` / Docker buildx;Swap 仍滿但 memory available 約 `41Gi`,本輪不手動清 swap。整體宣告仍是 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。 2026-06-26 07:02 全主機 live refresh:`110 / 120 / 121 / 188 / 112 / 111 / 168` ping 與 SSH port 全部 OK。110 `systemctl=running`、failed units `0`,但 load `5.83 / 7.26 / 5.77` 且 top CPU 是 AWOOOI Web `next build`,Swap 仍 `7.8Gi / 7.8Gi`;這是 CI/build 壓力,不是 orphan Chrome 或 Docker 事故。120 / 121 `systemctl=running`、K3s active,nodes `mon` / `mon1` 均為 Ready。ArgoCD `awoooi-prod` 在 06:57 曾短暫 `OutOfSync / Progressing`,因 deploy marker `52f61da4` rollout 正在替換 API/Web/Worker;07:00 後已穩定為 `Synced / Healthy`,API `2/2`、Web `2/2`、Worker `1/1`,API/Web 仍跨 `mon` / `mon1`。重跑 live cold-start:`PASS=87 WARN=0 BLOCKED=0`,result `GREEN`。StockPlatform `/api/v1/system/freshness` 曾在容器剛重啟約 35 秒時短暫 `502`,後續連續讀回皆 `200` 且 `status=ok`、`latest_trading_date=2026-06-25`、blockers `[]`;這類 rollout warmup 只有連續失敗才算 blocker。MOMO health 是 `V10.699`,cold-start direct evidence 仍顯示 current-month parity `15383 / 15383` 截至 `2026-06-24`,daily freshness `1|2026-06-24`。Backup status 06:58:110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、`core_blockers=0`、offsite/rclone fresh、`last_backup_all=2026-06-26 02:31:02`、`escrow_missing=5`。188 產品容器健康,但 host `systemctl=degraded` 仍是真實 host hygiene blocker:`awoooi-startup.service`、`postgresql@14-main.service`、`certbot.service`、`snap.certbot.renew.service` failed。112 Wazuh manager/indexer/dashboard active,ports `1514 / 1515 / 55000` listen,但 production Wazuh route 仍回報 `disabled_waiting_iwooos_wazuh_owner_gate`、`configured=false`、manager registry accepted `0`、runtime gate `0`。111 / 168 可連線,但兩邊 AWOOOI dev workspaces 皆 ahead 17 且 HEAD 不同(`111=56c83257`、`168=59485d51`);Mac Mini `/System/Volumes/Data` 只剩約 `3.2Gi`。目前 service recovery 宣告維持 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`,host hygiene / DR escrow / Wazuh registry / workstation capacity 明確列為 service green 之外的 blocker。 diff --git a/docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md b/docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md new file mode 100644 index 00000000..b5807f99 --- /dev/null +++ b/docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md @@ -0,0 +1,168 @@ +# 188 Host Hygiene 維護窗口 Runbook + +> 版本:2026-06-26.v1 +> 適用範圍:`192.168.0.188` host PostgreSQL `14/main`、`awoooi-startup.service`、certbot / ACME renewal hygiene。 +> 狀態:`SERVICE_GREEN_HOST_HYGIENE_BLOCKED` + +--- + +## 1. 目的 + +188 目前的產品服務與 public routes 可用,但 host systemd 仍是 `degraded`。這份 runbook 將「服務已恢復」與「host hygiene 未收斂」分開,避免把 route `200`、container healthy、`pg_isready` 或 exporter green 誤判成 host 已完全健康。 + +本文件不是維護窗口批准,也不是 runtime action 授權。未取得 owner、rollback owner、maintenance window、backup / restore evidence 與 post-check plan 前,不得執行 restart、reset-failed、renew、reload、restore 或 `pg_resetwal`。 + +--- + +## 2. 最新只讀證據 + +2026-06-26 07:18-07:19 Asia/Taipei read-only evidence: + +| 層級 | 證據 | 判定 | +|------|------|------| +| 188 disk | `/` 約 `982G`,使用 `712G`,可用 `230G`,`76%` | 不是容量耗盡 | +| 產品容器 | MOMO、VibeWork、2026FIFA、ClawBot、MinIO、exporters、MOMO DB 等容器可見 healthy / up | 產品服務 green | +| Host systemd | failed units:`awoooi-startup.service`、`postgresql@14-main.service`、`certbot.service`、`snap.certbot.renew.service` | host hygiene blocked | +| Host PostgreSQL | `pg_lsclusters` 顯示 `14/main down` | host cluster 不健康 | +| PostgreSQL status | `invalid primary checkpoint record`、`PANIC: could not locate a valid checkpoint record` | checkpoint / WAL 類資料層錯誤 | +| Startup unit | 曾嘗試以 root 執行 `pg_resetwal`,並失敗 | 自動修復邏輯必須 fail-closed | +| certbot apt unit | `sentry.wooo.work` renewal rate-limited | 不可重複猛打 ACME | +| certbot snap unit | `sentry.wooo.work` challenge failed | 需先確認 ACME route / DNS / gateway owner | +| public cert | shared `sentry.wooo.work` certificate valid until `2026-07-09 16:03:40 UTC` | 有維護窗口,不是立即過期 | + +--- + +## 3. GO / NO-GO + +| Gate | GO 條件 | NO-GO 條件 | +|------|---------|------------| +| 服務層 | MOMO / VibeWork / 2026FIFA / SignOz / public routes 可用 | 產品 route 或 DB 容器同時故障時,先走 incident triage | +| PostgreSQL | DB owner 確認 host `14/main` 是否仍為 authoritative source | 無法確認用途就不得 reset / disable / restore | +| Backup | 有可追溯 backup ref、restore target isolation、rollback owner | 沒有 backup / restore evidence 就不得 `pg_resetwal` 或 restore | +| certbot | DNS / ACME / public gateway owner response accepted,rate-limit 冷卻期明確 | ACME route 未確認、仍被 rate-limit、或缺 rollback owner | +| 維護窗口 | owner、窗口、post-check、operator notification 完整 | 只有「服務看起來綠」或「想清 failed units」 | + +--- + +## 4. PostgreSQL `14/main` 決策樹 + +1. **先判定用途** + - 是否仍被任何 production service 使用。 + - 是否只是舊的 host cluster,產品 DB 已由 containerized PostgreSQL 接管。 + - 是否仍承載 K3s Kine、MOMO、monitoring、legacy app 或 operator tooling。 + +2. **若確認 host cluster 已廢棄** + - 準備 owner disposition:廢棄原因、affected scope、最後 backup / dump ref、保留期限。 + - 維護窗口內才可停用 unit 或調整 startup script 的 host PG dependency。 + - 停用後仍需 post-check:public routes、MOMO health、SignOz、exporters、backup-status、cold-start。 + +3. **若確認 host cluster 仍需要保留** + - 先建立隔離 restore target,不直接覆蓋 production data directory。 + - 先比對 backup / WAL / data directory metadata。 + - 只有 DB owner 明確接受資料風險與 rollback plan 後,才可進入 break-glass 修復。 + +4. **禁止預設路徑** + - 不得直接 `pg_resetwal`。 + - 不得直接 `rm -rf /var/lib/postgresql/14/main`。 + - 不得直接 `systemctl reset-failed` 假綠。 + - 不得把 container `momo-db` healthy 當成 host `14/main` healthy。 + +--- + +## 5. certbot / ACME 決策樹 + +1. **先停止誤判** + - `certbot.service` failed 目前代表 renewal 未成功,不代表現有憑證立即失效。 + - `sentry.wooo.work` cert 仍有效到 `2026-07-09 16:03:40 UTC`。 + +2. **先做 owner evidence** + - domain / SAN / shared certificate coverage ref。 + - ACME challenge route owner。 + - renewal owner。 + - rollback owner。 + - validation plan。 + +3. **維護窗口內才可做** + - `nginx -t`。 + - ACME challenge route smoke。 + - certbot dry-run 或 renew。 + - Nginx reload。 + +4. **禁止事項** + - rate-limit 期間反覆 renew。 + - 未確認 challenge route 就 renew。 + - 同時保留 apt certbot timer 與 snap certbot timer 的重複失敗狀態。 + - 貼出 private key、ACME account key、DNS credential、未脫敏 certbot log。 + +--- + +## 6. 維護窗口最小步驟 + +此段只定義順序;不代表目前已批准執行。 + +### Phase A:維護前只讀快照 + +```bash +systemctl --failed --no-pager +pg_lsclusters +systemctl status postgresql@14-main.service --no-pager +systemctl status certbot.service snap.certbot.renew.service --no-pager +docker ps --format '{{.Names}}\t{{.Status}}\t{{.Ports}}' +/backup/scripts/backup-status.sh --no-notify +/home/wooo/scripts/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1 +``` + +### Phase B:owner 決策 + +| 決策 | 必填 evidence | +|------|---------------| +| host PostgreSQL 廢棄 | owner disposition、affected services、last backup ref、retention owner | +| host PostgreSQL restore | source backup ref、isolated restore target、checksum / row-count plan、rollback owner | +| certbot route 修復 | certificate coverage ref、ACME owner、challenge route plan、post-renew validation | + +### Phase C:受控修復 + +只在 maintenance window 內由 operator 執行;此文件不提供自動 apply。 + +### Phase D:post-check + +```bash +systemctl --failed --no-pager +pg_lsclusters +curl -fsS https://mo.wooo.work/health +curl -fsS https://stock.wooo.work/api/v1/system/freshness +curl -fsS https://awoooi.wooo.work/api/v1/health +/backup/scripts/backup-status.sh --no-notify --no-refresh +/home/wooo/scripts/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1 +``` + +成功標準: + +- 產品 routes 仍綠。 +- backup core 仍 `core_blockers=0`。 +- cold-start 維持 `WARN=0 BLOCKED=0`。 +- 若 owner 決定 host PG 廢棄,systemd failed unit 必須以 owner disposition 收斂,不可只是 reset-failed。 +- 若 owner 決定 host PG 修復,必須有 restore / data integrity evidence。 +- 若 certbot 修復,必須有 cert notAfter、ACME route、Nginx config test 與 reload evidence。 + +--- + +## 7. 目前完成度 + +| Lane | 完成度 | 說明 | +|------|-------:|------| +| 服務恢復 | `100%` | routes、containers、cold-start、backup core 已 green | +| 188 host hygiene evidence | `80%` | 只讀 root cause 已明確;缺 owner disposition | +| PostgreSQL owner decision | `0%` | 尚未確認廢棄、restore 或 break-glass | +| certbot owner decision | `0%` | 尚未接受 DNS / ACME / coverage evidence | +| runtime repair | `0%` | 未批准、未執行 | + +--- + +## 8. 不得宣稱 + +- 不得宣稱 188 host fully green。 +- 不得宣稱 host PostgreSQL 已恢復。 +- 不得宣稱 certbot renewal 已恢復。 +- 不得宣稱 `reset-failed` 後就是修復完成。 +- 不得宣稱這份文件批准任何 runtime action。 diff --git a/docs/security/CREDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md b/docs/security/CREDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md index 3215301e..901c3db1 100644 --- a/docs/security/CREDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md +++ b/docs/security/CREDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md @@ -10,7 +10,7 @@ | 項目 | 狀態 | 完成度 | 說明 | |------|------|-------:|------| -| 備份核心 | GREEN | 100% | 110 / 188 freshness、Google Drive rclone、latest-only verifier、full cold-start 目前皆為綠燈。 | +| 備份核心 | GREEN | 100% | 2026-06-26 07:18 readback:110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、offsite / rclone fresh、latest-only verifier 與 full cold-start 目前皆為綠燈。 | | Escrow script / offsite readiness | READY | 100% | `SCRIPT_MISSING_COUNT=0`、`OFFSITE_CONFIGURED=1`、`RCLONE_CONFIGURED=1`、`READINESS_REQUIRE_CONFIGURED_BLOCKED=0`。 | | Credential escrow owner request package | READY_TO_DISPATCH | 80% | 本文件定義 owner 需要提供的非敏感 evidence-id、禁止內容、dry-run 與驗收流程。 | | Credential escrow marker | BLOCKED_WAITING_OWNER_EVIDENCE | 0% | 五個 marker 仍未寫入;不得用 placeholder 或 secret 補齊。 | @@ -22,7 +22,24 @@ ## 2. Live Evidence -2026-06-13 13:10 在 110 只讀檢查: +2026-06-26 07:18 在 110 只讀檢查: + +```text +每日備份心跳核心正常但 DR 未完成 +110備份=13/13 fresh failed=0 +188備份=2/2 fresh failed=0 +integrity_stale=0 +offsite_configured=1 +offsite_fresh=1 +rclone_gdrive_configured=1 +rclone_gdrive_fresh=1 +escrow_missing=5 +last_backup_all=2026-06-26 02:31:02 +core_blockers=0 +dr_warnings=5 +``` + +2026-06-13 13:10 在 110 只讀檢查仍保留為原始缺口基準: ```text missing: restic_repository_password diff --git a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md index d8621d88..c2d993a0 100644 --- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md +++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md @@ -11,7 +11,7 @@ | Area | Status | Completion | Evidence | |------|--------|------------|----------| -| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-26 07:19 read-only follow-up confirms service recovery remains green after the SOP commit reached production. Cold-start rerun returned `PASS=87 WARN=0 BLOCKED=0`, result `GREEN`; ArgoCD `awoooi-prod Synced / Healthy` at revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`; API/Web/Worker pods are Running with restart `0`; public routes return expected statuses including AWOOOI API `200`, AWOOOI Web `307`, VibeWork `200`, AwoooGo `200`, MOMO health `200`, Stock freshness `200`, Bitan `200`, Gitea `200`, Harbor `200`, Registry `/v2/` expected `401`, Sentry expected `302`, SigNoz `200`, Langfuse `200`. Backup-status 07:18 remains 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, offsite/rclone fresh, `last_backup_all=2026-06-26 02:31:02`, `escrow_missing=5`. 188 remains product-service green but host-hygiene blocked: host PostgreSQL `14/main` has `invalid primary checkpoint record` / `could not locate a valid checkpoint record`, certbot renew for `sentry.wooo.work` is rate-limited / challenge-failed, and `awoooi-startup.service` previously attempted `pg_resetwal` as root. Do not declare DR complete until `escrow_missing=0`; Wazuh manager registry accepted remains `0`; 111/168 Codex workspace HEAD drift and Mac Mini low free space remain workstation blockers, not reboot service blockers. | +| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-26 07:19 read-only follow-up confirms service recovery remains green after the SOP commit reached production. Cold-start rerun returned `PASS=87 WARN=0 BLOCKED=0`, result `GREEN`; ArgoCD `awoooi-prod Synced / Healthy` at revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`; API/Web/Worker pods are Running with restart `0`; public routes return expected statuses including AWOOOI API `200`, AWOOOI Web `307`, VibeWork `200`, AwoooGo `200`, MOMO health `200`, Stock freshness `200`, Bitan `200`, Gitea `200`, Harbor `200`, Registry `/v2/` expected `401`, Sentry expected `302`, SigNoz `200`, Langfuse `200`. Backup-status 07:18 remains 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, offsite/rclone fresh, `last_backup_all=2026-06-26 02:31:02`, `escrow_missing=5`. 188 remains product-service green but host-hygiene blocked: host PostgreSQL `14/main` has `invalid primary checkpoint record` / `could not locate a valid checkpoint record`, certbot renew for `sentry.wooo.work` is rate-limited / challenge-failed, and `awoooi-startup.service` previously attempted `pg_resetwal` as root. New runbook `docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md` defines the maintenance-window decision tree without authorizing runtime repair. Do not declare DR complete until `escrow_missing=0`; Wazuh manager registry accepted remains `0`; 111/168 Codex workspace HEAD drift and Mac Mini low free space remain workstation blockers, not reboot service blockers. | | P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-26 07:19 readback shows 120 and 121 reachable, K3s active, `mon` and `mon1` both `Ready control-plane`, AWOOOI API/Web replicas split across both nodes, ArgoCD `awoooi-prod Synced / Healthy` at revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`, and `km-vectorize` official 03:00 台北時間 run succeeded with `lastSuccess=2026-06-25T19:00:14Z`. | | P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-26 06:58 backup readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-26 02:31:02`。DR remains blocked on real non-secret credential escrow evidence IDs; do not write placeholder markers or paste secret values. | | P2 service / data truth | DONE | 100% | Service routes and core runtime are available, 110 current CPU pressure is attributable to active AWOOOI Web `turbo build` / Docker buildx, and previous orphan Chrome groups remain cleared. 2026-06-26 07:19 StockPlatform `/api/v1/system/freshness` returned `200`; 07:01 freshness payload was `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`; price / chips / margin / AI recommendations are all on `2026-06-25`. `ai.recommendations` row count is `2868`; `core.margin_short_daily` row count is `1976`. MOMO health `V10.699`, current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, and `MOMO_DAILY_FRESHNESS 1|2026-06-24` are green; expanded public routes are green. |