ops(reboot): close 188 hygiene and dynamic post-reboot gates

2026-06-26 12:39:55 +08:00
parent d8a68c742c
commit 71261c122e
12 changed files with 227 additions and 67 deletions
--- a/docs/LOGBOOK.md
+++ b/docs/LOGBOOK.md
@@ -45489,3 +45489,40 @@ production browser smoke:
 - 188 host hygiene 維護窗口仍未執行。
 - Wazuh manager registry accepted remains `0`。
 - 不得宣稱 owner request 已送出、owner response 已收到 / 接受、runtime 寫入已批准、`DR_COMPLETE`、188 host fully green、或 Wazuh registry recovered。
+
+## 2026-06-26 — 12:13 188 host hygiene live repair closeout / SOP v1.73
+
+**時間與來源**：
+- 2026-06-26 11:40-12:13 Asia/Taipei。
+- 來源：188 live SSH maintenance session、`scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh --no-color`、`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color`、public ACME HTTP-01 self-test、systemd / Docker / PostgreSQL runtime readback。
+
+**完成內容**：
+- 188 startup source-of-truth 已修正：`/usr/local/bin/awoooi-startup.sh` 改為 `postgres_runtime_ready()`，接受 active `postgresql@14-main` 或 `k3s-postgres-recovery` host-network PostgreSQL runtime + `pg_isready`；不再自動執行 `pg_resetwal`。
+- 188 `awoooi-startup.service` 移除 hard `postgresql@14-main.service` dependency，避免把已由 recovery container 提供的 production DB runtime 誤判成 host cluster failed。
+- 188 Nginx ACME HTTP-01 route 已修：`sentry.wooo.work`、`gitea.wooo.work`、`langfuse.wooo.work`、`signoz.wooo.work` 均可從 public HTTP 讀回同一個 non-secret self-test token。
+- 188 duplicate certbot runner 已收斂：apt `certbot.timer` disabled / inactive，snap `snap.certbot.renew.timer` enabled / active / waiting；`certbot.service` 與 `snap.certbot.renew.service` 均 `Result=success / inactive`。
+- 188 failed units 已清零，`systemctl is-system-running=running`。
+- Repo baseline 同步更新：`scripts/reboot-recovery/awoooi-startup.sh`、`scripts/reboot-recovery/awoooi-startup.service`、`scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh`、188 Nginx Ansible templates、SOP / quick-check / workplan / runbook。
+- `post-reboot-owner-packet-contract-guard.py` 改為依 live `source.next_required_gates` 動態驗收，不再把已修復的 188 gate 永久鎖成必備三 gate。
+- `post-start-quick-check.sh` 的 StockPlatform dedicated freshness gate 改用 `STOCK_FRESHNESS_RETRY_ATTEMPTS=6` / `STOCK_FRESHNESS_RETRY_DELAY_SECONDS=5` 重試，避免 110 Docker / CI rollout 後的短暫 upstream `502` 把已恢復的服務 / 資料 freshness 誤判成 blocked；連續失敗仍維持 hard blocker。
+
+**live 驗證結果**：
+- `scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh --no-color`：`PASS=16 WARN=3 BLOCKED=0`、`SERVICE_GREEN=1`、`HOST_HYGIENE_BLOCKED=0`、`Result: HOST_188_HYGIENE_GREEN.`
+- `scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color`：`POST_START_RC=0`、`POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`、`POST_START_PASS=38`、`POST_START_WARN=4`、`POST_START_BLOCKED=0`、`SERVICE_GREEN=1`、`PRODUCT_DATA_GREEN=1`、`BACKUP_CORE_GREEN=1`、`DR_ESCROW_BLOCKED=1`、`ESCROW_MISSING_COUNT=5`、`HOST_188_HYGIENE_BLOCKED=0`、`WAZUH_MANAGER_REGISTRY_ACCEPTED=0`、`RUNTIME_ACTION_AUTHORIZED=0`。
+- `NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export`；188 host hygiene 已從 next gates 移除。
+
+**做過的命令類型**：
+- 寫入：188 startup script/service 安裝、188 Nginx ACME route config 安裝、`systemctl daemon-reload`、`systemctl reset-failed`、`nginx -t`、`systemctl reload nginx`、停用 duplicate apt `certbot.timer`、建立 / 刪除 non-secret ACME self-test token file。
+- 只讀：systemd / Docker / PostgreSQL runtime / certbot timer / public route / backup / post-reboot summary / Wazuh repo gates。
+- 未做：沒有 `pg_resetwal`、沒有 DB restore、沒有 Docker restart、沒有 K3s / ArgoCD / firewall / Wazuh runtime action、沒有 secret value / token / password 讀取或保存、沒有強制 certbot renew。
+
+**目前判定**：
+- 188 host hygiene repair：`0% -> 100%`。
+- Reboot service / product data / backup / 188 host hygiene：`GREEN`。
+- Overall recovery declaration：`FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。
+- SOP / quick-check / owner packet guard：v1.73 動態 gate baseline。
+
+**仍 blocked / 不得宣稱**：
+- DR credential escrow evidence 仍缺 `5`：不得宣稱 `DR_COMPLETE`。
+- Wazuh manager registry accepted 仍為 `0`：不得宣稱 Wazuh 全主機納管恢復。
+- certbot formal renewal 尚未完成 readback；本輪完成的是 HTTP-01 route / timer hygiene / failed-unit 清除，正式 renew 成功需等 snap certbot timer 或獨立 ACME window。
--- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md
+++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md
@@ -1,6 +1,6 @@
 # AWOOOI 全棧冷啟動與主機重啟 SOP

-> Version: v1.72
+> Version: v1.73
 > Last updated: 2026-06-26 Asia/Taipei
 > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.

@@ -10,19 +10,21 @@

 本節是每次接手、開機、關機、重啟後的第一個判定錨點。若日期不是今天，必須先重跑 live check，再更新本節與 `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md`。

-若只是重啟後要快速判斷能不能宣稱恢復，先跑機器可讀摘要：`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color`。此腳本會呼叫一頁式總檢查、188 host hygiene checklist 與 Wazuh no-false-green repo gates，並把 delegated logs 留在 `/tmp/awoooi-post-reboot-readiness-*`。接著跑 `scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color`，把 summary 轉成 allowed / forbidden declaration，避免把服務綠誤報成 DR complete、188 host fully green、Wazuh registry recovered 或 runtime authorized。若 summary 顯示 `SERVICE_GREEN=1` 但 `NEXT_REQUIRED_GATES` 仍非空，再跑 `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color`，把 DR escrow、188 hygiene、Wazuh registry 三條 blocker 轉成 owner / evidence / forbidden-action dispatch checklist；需要機器可讀 intake 時，再跑 `scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color --output /tmp/awoooi-post-reboot-owner-packets.json` 產生 `awoooi_post_reboot_next_gate_owner_packets_v1` JSON，並立刻跑 `scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json`。dispatch / packet / guard 均固定 `DISPATCH_AUTHORIZED=0`、`REQUEST_SENT_COUNT=0`、`OWNER_RESPONSE_ACCEPTED=0`、`HOST_WRITE_AUTHORIZED=0`、`SECRET_VALUE_COLLECTION_ALLOWED=0`、`RUNTIME_GATE=0`；guard 未通過時不得送 owner request、不得寫 escrow marker、不得進維護窗口、不得宣稱 DR / 188 host hygiene / Wazuh registry complete。需要人工展開時，再跑 `scripts/reboot-recovery/post-start-quick-check.sh --no-color` 並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為 fallback。長 SOP 保留完整背景、例外處理與 Plan B；短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。
+若只是重啟後要快速判斷能不能宣稱恢復，先跑機器可讀摘要：`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color`。此腳本會呼叫一頁式總檢查、188 host hygiene checklist 與 Wazuh no-false-green repo gates，並把 delegated logs 留在 `/tmp/awoooi-post-reboot-readiness-*`。接著跑 `scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color`，把 summary 轉成 allowed / forbidden declaration，避免把服務綠誤報成 DR complete、188 host hygiene、Wazuh registry recovered 或 runtime authorized。若 summary 顯示 `SERVICE_GREEN=1` 但 `NEXT_REQUIRED_GATES` 仍非空，再跑 `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color`，把 live summary 內尚未完成的 blocker 轉成 owner / evidence / forbidden-action dispatch checklist；需要機器可讀 intake 時，再跑 `scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color --output /tmp/awoooi-post-reboot-owner-packets.json` 產生 `awoooi_post_reboot_next_gate_owner_packets_v1` JSON，並立刻跑 `scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json`。dispatch / packet / guard 均固定 `DISPATCH_AUTHORIZED=0`、`REQUEST_SENT_COUNT=0`、`OWNER_RESPONSE_ACCEPTED=0`、`HOST_WRITE_AUTHORIZED=0`、`SECRET_VALUE_COLLECTION_ALLOWED=0`、`RUNTIME_GATE=0`；guard 未通過時不得送 owner request、不得寫 escrow marker、不得進維護窗口、不得宣稱 DR / Wazuh registry complete。需要人工展開時，再跑 `scripts/reboot-recovery/post-start-quick-check.sh --no-color` 並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為 fallback。長 SOP 保留完整背景、例外處理與 Plan B；短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。

-2026-06-26 07:47 machine-readable readiness summary：`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 已驗證可用，artifact dir `/tmp/awoooi-post-reboot-readiness-20260626-074702`。摘要輸出 `POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`、`POST_START_PASS=38`、`POST_START_WARN=3`、`POST_START_BLOCKED=0`、`SERVICE_GREEN=1`、`PRODUCT_DATA_GREEN=1`、`BACKUP_CORE_GREEN=1`、`DR_ESCROW_BLOCKED=1`、`ESCROW_MISSING_COUNT=5`、`HOST_188_SERVICE_GREEN=1`、`HOST_188_HYGIENE_BLOCKED=1`、`WAZUH_ROUTE_CODE=200`、`WAZUH_TRANSPORT_COUNT=6`、`WAZUH_MANAGER_REGISTRY_ACCEPTED=0`、`WAZUH_RUNTIME_GATE=0`、`RUNTIME_ACTION_AUTHORIZED=0`。目前 `OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`，`NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export`。這是每次重啟後的第一層 operator / AI agent 判定格式。
+2026-06-26 12:13 latest live summary supersedes the 08:59 gate set：`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`、`POST_START_PASS=38`、`POST_START_WARN=4`、`POST_START_BLOCKED=0`、`SERVICE_GREEN=1`、`PRODUCT_DATA_GREEN=1`、`BACKUP_CORE_GREEN=1`、`DR_ESCROW_BLOCKED=1`、`ESCROW_MISSING_COUNT=5`、`HOST_188_SERVICE_GREEN=1`、`HOST_188_HYGIENE_BLOCKED=0`、`HOST_188_RESULT=HOST_188_HYGIENE_GREEN.`、`WAZUH_ROUTE_CODE=200`、`WAZUH_TRANSPORT_COUNT=6`、`WAZUH_MANAGER_REGISTRY_ACCEPTED=0`、`WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning`、`WAZUH_DASHBOARD_INDEX_OK=3`、`RUNTIME_ACTION_AUTHORIZED=0`、`OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`、`NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export`。188 host hygiene 已從 blocker 移除；目前不可宣稱完成的只剩 DR credential escrow 與 Wazuh manager registry。ACME HTTP-01 route 與 certbot timer hygiene 已修復，但不得宣稱憑證已正式 renew，需等 snap certbot timer / ACME window readback。

-2026-06-26 08:12 next-gate dispatch baseline：`scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color` 已以最新 summary live output 驗證。腳本讀回 `SERVICE_GREEN=1`、`OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`、`NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export`，並輸出三個 P0 checklist：一是 credential escrow non-secret evidence，要求五個 escrow item 的 evidence id / owner / reviewer 且禁止密碼、token、hash、prefix/suffix；二是 188 host PostgreSQL / certbot hygiene maintenance window，要求 DB / DNS-TLS / rollback / postcheck owner 決策且禁止 `pg_resetwal`、certbot renew、Nginx reload、DB restore、Docker restart 等未批准動作；三是 Wazuh manager registry redacted export，要求脫敏 registry count、host alias status、dashboard API/version status、time window 與 reviewer，且禁止 agent real name、internal IP、client.keys、raw payload、active response、agent re-enroll、Wazuh restart、secret patch、host write、Kali active scan。輸出固定 `NEXT_GATE_COUNT=3`、`NEXT_STEP=dispatch_owner_packets_manually_after_review`、`RUNTIME_ACTION_AUTHORIZED=0`，這是 dispatch checklist，不是 request sent 或 runtime approval。
+2026-06-26 07:47 machine-readable readiness summary retained as historical pre-repair evidence：當時 `HOST_188_HYGIENE_BLOCKED=1`、`NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export`。此段只用來比對 188 修復前後差異；現行 gate set 必須使用 12:13 baseline。
+
+2026-06-26 08:12 next-gate dispatch baseline retained as historical pre-repair evidence：當時 output 固定三個 P0 checklist。12:13 起 dispatch 依 live summary 動態輸出，目前 expected `NEXT_GATE_COUNT=2`，只剩 credential escrow 與 Wazuh registry。

 2026-06-26 08:29 owner-packet JSON baseline：`scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color` 將 dispatch output 轉成 `schema_version=awoooi_post_reboot_next_gate_owner_packets_v1`，包含三個 `owner_packets`、`next_gate_count=3`、`p0_gate_count=3`、`request_sent_count=0`、`owner_response_received_count=0`、`owner_response_accepted_count=0`、`runtime_action_authorized_count=0`。此 JSON 是 AI / operator / owner review intake，不是外部 request，也不是維護窗口批准。

-2026-06-26 08:40 owner-packet contract guard baseline：`scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json` 鎖定 `schema_version=awoooi_post_reboot_next_gate_owner_packets_v1`、三個 P0 gate id、`next_gate_count=3`、`p0_gate_count=3`、`request_sent_count=0`、`owner_response_received_count=0`、`owner_response_accepted_count=0`、`runtime_action_authorized_count=0`、`dispatch_authorized=0`、`host_write_authorized=0`、`secret_value_collection_allowed=0`、`runtime_gate_count=0`。此 guard 也驗證 escrow 禁止 password / token / secret value / hash / prefix / suffix / raw credential，188 禁止 `pg_resetwal` / certbot renew / Nginx reload / DB restore / Docker restart / host file write，Wazuh 禁止 raw payload / internal IP / active response / re-enroll / restart / secret patch / host write / Kali active scan，並要求四條 no-false-green 規則存在。輸出必須是 `POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=3 request_sent=0 accepted=0 runtime_gate=0`。
+2026-06-26 08:40 owner-packet contract guard baseline retained as historical pre-repair evidence：舊版鎖定三個 P0 gate。12:13 起 contract guard 依 `source.next_required_gates` 動態驗收，現行 expected success line 是 `POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=2 request_sent=0 accepted=0 runtime_gate=0`；若 188 hygiene future regression，才會回到 `gates=3`。

 2026-06-26 08:47 Wazuh registry detail baseline：`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 已把 Wazuh repo-side coverage / runtime gate 的細節納入固定 key/value：`WAZUH_COVERAGE_SCOPE=6`、`WAZUH_DIRECT_ACTIVE=2`、`WAZUH_NO_TRANSPORT=1`、`WAZUH_SSH_BLOCKED=3`、`WAZUH_ROUTE_CODE=200`、`WAZUH_TRANSPORT_COUNT=6`、`WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning`、`WAZUH_DASHBOARD_INDEX_OK=3`、`WAZUH_MANAGER_REGISTRY_ACCEPTED=0`、`WAZUH_RUNTIME_GATE=0`。`scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color` 的 `wazuh_manager_registry_export` gate 會把這些狀態放入 `CURRENT_EVIDENCE`。判讀鐵律：route `200`、transport `6`、Dashboard index pattern `3` 都不是 manager registry accepted；全主機納管與 Dashboard API 修復仍需 owner evidence / registry export / acceptance record。

-2026-06-26 08:59 declaration guard baseline：`scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color` 將 summary 轉成 `awoooi_post_reboot_declaration_guard_v1`，目前 status 為 `allowed_with_boundary_blockers`。允許宣稱 `SERVICE_RECOVERY_GREEN`、`PRODUCT_DATA_GREEN`、`BACKUP_CORE_GREEN` 與 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`；禁止宣稱 `DR_COMPLETE`、`HOST_188_FULLY_GREEN`、`WAZUH_REGISTRY_RECOVERED`、`RUNTIME_ACTION_AUTHORIZED`。若用 `--proposed DR_COMPLETE` 或 `--proposed WAZUH_REGISTRY_RECOVERED` 測試，guard 必須以 `blocked_false_green_proposal` 拒絕。
+2026-06-26 08:59 declaration guard baseline retained as historical pre-repair evidence：當時 `HOST_188_FULLY_GREEN` 仍 forbidden。12:13 起 guard 依 `HOST_188_HYGIENE_BLOCKED=0` 動態允許 188 host hygiene green，但仍拒絕 `DR_COMPLETE`、`WAZUH_REGISTRY_RECOVERED`、`RUNTIME_ACTION_AUTHORIZED`。

 2026-06-26 07:39 live quick-check refresh：`scripts/reboot-recovery/post-start-quick-check.sh --no-color` 完整跑完，四主機 ping / SSH 全部 OK，delegated cold-start 為 `PASS=89 WARN=0 BLOCKED=0`，wrapper 總結為 `POST_START_QUICK_CHECK PASS=38 WARN=3 BLOCKED=0`、warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=2`、`RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。MOMO health `V10.701`，daily snapshot `109061` rows / `2025-07-01..2026-06-24`，current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`，latest import job `57 completed`。StockPlatform freshness `status=ok`、latest trading date `2026-06-25`，price / chips / margin / AI recommendations 均為 `2026-06-25`。Backup-status 07:39 顯示 110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、`core_blockers=0`、offsite/rclone fresh、`last_backup_all=2026-06-26 02:31:02`、`escrow_missing=5`。Public routes extended list 全部回 expected 2xx/3xx。110 CPU attribution 顯示 load 約 `5.19 / 4.66 / 4.91`，CPU idle 多數樣本 `80%+`，目前負載來自 Gitea / ClickHouse / Docker / Kafka / StockPlatform / AWOOOI API / Sentry 等正常平台工作，不是 orphan Chrome。這一輪 allowed declaration：主機、K3s、服務、網站、產品資料 freshness、備份核心與 offsite freshness 綠；forbidden declaration：DR complete、credential escrow complete、188 host fully green、Wazuh registry recovered。

--- a/docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md
+++ b/docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md
@@ -1,34 +1,33 @@
 # 188 Host Hygiene 維護窗口 Runbook

-> 版本：2026-06-26.v1
+> 版本：2026-06-26.v2
 > 適用範圍：`192.168.0.188` host PostgreSQL `14/main`、`awoooi-startup.service`、certbot / ACME renewal hygiene。
-> 狀態：`SERVICE_GREEN_HOST_HYGIENE_BLOCKED`
+> 狀態：`HOST_188_HYGIENE_GREEN`

 ---

 ## 1. 目的

-188 目前的產品服務與 public routes 可用，但 host systemd 仍是 `degraded`。這份 runbook 將「服務已恢復」與「host hygiene 未收斂」分開，避免把 route `200`、container healthy、`pg_isready` 或 exporter green 誤判成 host 已完全健康。
+188 目前的產品服務、public routes 與 host hygiene 均已收斂為 green。這份 runbook 將「服務恢復」、「host hygiene」、「DR escrow」與「Wazuh registry」分開，避免把 route `200`、container healthy、`pg_isready`、exporter green 或 `reset-failed` 誤判成其他層級也完成。

-本文件不是維護窗口批准，也不是 runtime action 授權。未取得 owner、rollback owner、maintenance window、backup / restore evidence 與 post-check plan 前，不得執行 restart、reset-failed、renew、reload、restore 或 `pg_resetwal`。
+本文件記錄 2026-06-26 已完成的 188 host hygiene 修復與後續操作邊界。它不是 DR escrow 完成證明，也不是 Wazuh / SOC runtime action 授權。未取得 owner、rollback owner、maintenance window、backup / restore evidence 與 post-check plan 前，仍不得執行資料層 break-glass、restore、未批准 renew、或其他主機寫入。

 ---

 ## 2. 最新只讀證據

-2026-06-26 07:18-07:19 Asia/Taipei read-only evidence：
+2026-06-26 12:02-12:13 Asia/Taipei live evidence：

 | 層級 | 證據 | 判定 |
 |------|------|------|
-| 188 disk | `/` 約 `982G`，使用 `712G`，可用 `230G`，`76%` | 不是容量耗盡 |
-| 產品容器 | MOMO、VibeWork、2026FIFA、ClawBot、MinIO、exporters、MOMO DB 等容器可見 healthy / up | 產品服務 green |
-| Host systemd | failed units：`awoooi-startup.service`、`postgresql@14-main.service`、`certbot.service`、`snap.certbot.renew.service` | host hygiene blocked |
-| Host PostgreSQL | `pg_lsclusters` 顯示 `14/main down` | host cluster 不健康 |
-| PostgreSQL status | `invalid primary checkpoint record`、`PANIC: could not locate a valid checkpoint record` | checkpoint / WAL 類資料層錯誤 |
-| Startup unit | 曾嘗試以 root 執行 `pg_resetwal`，並失敗 | 自動修復邏輯必須 fail-closed |
-| certbot apt unit | `sentry.wooo.work` renewal rate-limited | 不可重複猛打 ACME |
-| certbot snap unit | `sentry.wooo.work` challenge failed | 需先確認 ACME route / DNS / gateway owner |
-| public cert | shared `sentry.wooo.work` certificate valid until `2026-07-09 16:03:40 UTC` | 有維護窗口，不是立即過期 |
+| Host systemd | `systemctl is-system-running` 為 `running`，failed units `0 loaded units listed` | host hygiene green |
+| 產品容器 | MOMO、VibeWork、2026FIFA、ClawBot、MinIO、exporters、MOMO DB 等容器 healthy / up | 產品服務 green |
+| PostgreSQL runtime | `k3s-postgres-recovery` container 為 `Running=true`、`NetworkMode=host`、restart policy `unless-stopped`，`pg_isready -h localhost -p 5432` accepting connections | production DB runtime green |
+| Host PostgreSQL unit | `postgresql@14-main.service` reset 後 `Result=success / inactive`；`pg_lsclusters` 仍可能因 recovery container PID model 顯示 host cluster down | 不再作為 service blocker；由 runtime-ready 判斷 |
+| Startup unit | `/usr/local/bin/awoooi-startup.sh` 已改為 `postgres_runtime_ready()`，接受 active host unit 或 `k3s-postgres-recovery` host-network runtime；不再自動執行 `pg_resetwal` | fail-closed 修復完成 |
+| Nginx / ACME route | `sentry.wooo.work`、`gitea.wooo.work`、`langfuse.wooo.work`、`signoz.wooo.work` HTTP-01 self-test token 均由 `/var/www/html` 正確回應 | challenge route green |
+| certbot timers | duplicate apt `certbot.timer` 已停用；snap `snap.certbot.renew.timer` enabled / active / waiting；`certbot.service` 與 `snap.certbot.renew.service` `Result=success / inactive` | renewal runner hygiene green |
+| public cert | shared `sentry.wooo.work` certificate valid until `2026-07-09 16:03:40 UTC` | 尚未宣稱已 renew；等待 snap timer / ACME window |

 ---

@@ -112,9 +111,9 @@ SSH_BATCH_MODE=yes bash scripts/reboot-recovery/188-host-hygiene-maintenance-che

 ```text
 SERVICE_GREEN=1
-HOST_HYGIENE_BLOCKED=1
+HOST_HYGIENE_BLOCKED=0
 RUNTIME_ACTION_AUTHORIZED=0
-Result: SERVICE_GREEN_HOST_HYGIENE_BLOCKED. Open maintenance window; do not auto-repair.
+Result: HOST_188_HYGIENE_GREEN.
 ```

 ### Phase A：維護前只讀快照
@@ -169,17 +168,18 @@ curl -fsS https://awoooi.wooo.work/api/v1/health
 | Lane | 完成度 | 說明 |
 |------|-------:|------|
 | 服務恢復 | `100%` | routes、containers、cold-start、backup core 已 green |
-| 188 host hygiene evidence | `80%` | 只讀 root cause 已明確；缺 owner disposition |
-| PostgreSQL owner decision | `0%` | 尚未確認廢棄、restore 或 break-glass |
-| certbot owner decision | `0%` | 尚未接受 DNS / ACME / coverage evidence |
-| runtime repair | `0%` | 未批准、未執行 |
+| 188 host hygiene repair | `100%` | startup fail-closed、PostgreSQL runtime-ready 判斷、systemd failed units、ACME route 與 certbot timer hygiene 已收斂 |
+| PostgreSQL runtime source-of-truth | `100%` | production DB runtime 以 `k3s-postgres-recovery` host-network container + `pg_isready` 判斷，不用 host `pg_lsclusters` 假紅 |
+| certbot / ACME route hygiene | `95%` | HTTP-01 route 與 timer split 已修；正式 renew 成功仍等待 snap timer / ACME window |
+| DR escrow | `BLOCKED` | `escrow_missing=5`，不可用 188 host green 替代 |
+| Wazuh registry | `BLOCKED` | manager registry accepted `0`，不可用 route `200` 或 transport count 替代 |

 ---

 ## 8. 不得宣稱

- 不得宣稱 188 host fully green。
- 不得宣稱 host PostgreSQL 已恢復。
- 不得宣稱 certbot renewal 已恢復。
- 不得宣稱 `reset-failed` 後就是修復完成。
- 不得宣稱這份文件批准任何 runtime action。
+- 不得宣稱 DR complete；credential escrow evidence 仍缺 `5`。
+- 不得宣稱 Wazuh registry recovered；manager registry accepted 仍為 `0`。
+- 不得宣稱 certbot certificate 已完成正式 renew；目前完成的是 HTTP-01 route / timer hygiene / failed units 清除。
+- 不得把 `reset-failed` 單獨當成修復完成；本次完成是因 startup source、Nginx ACME route、timer split 與 postcheck 同時收斂。
+- 不得宣稱這份文件批准任何新的 runtime action。
--- a/docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md
+++ b/docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md
@@ -1,6 +1,6 @@
 # 主機重啟後一頁式總檢查

-> Version: v1.12
+> Version: v1.13
 > Last updated: 2026-06-26 Asia/Taipei
 > Scope: 110 / 120 / 121 / 188 post-reboot service recovery. 112 Kali / Wazuh / active scan 不屬於本流程。

@@ -10,7 +10,7 @@

 每次 110 / 120 / 121 / 188 任一台主機開機、關機、重啟、斷電恢復、VMware console fsck、Docker / K3s 大量重排後，都先跑本頁，再決定是否宣稱恢復。

-最新基準：2026-06-26 08:59 post-reboot declaration guard。`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `SERVICE_GREEN=1`、`PRODUCT_DATA_GREEN=1`、`BACKUP_CORE_GREEN=1`、`DR_ESCROW_BLOCKED=1`、`ESCROW_MISSING_COUNT=5`、`HOST_188_HYGIENE_BLOCKED=1`、`WAZUH_MANAGER_REGISTRY_ACCEPTED=0`、`WAZUH_COVERAGE_SCOPE=6`、`WAZUH_DIRECT_ACTIVE=2`、`WAZUH_NO_TRANSPORT=1`、`WAZUH_SSH_BLOCKED=3`、`WAZUH_ROUTE_CODE=200`、`WAZUH_TRANSPORT_COUNT=6`、`WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning`、`WAZUH_DASHBOARD_INDEX_OK=3`、`RUNTIME_ACTION_AUTHORIZED=0`、`OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。`scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color` 會把 summary 轉成 allowed / forbidden declaration：目前允許宣稱服務、產品資料、備份核心與 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`；禁止宣稱 `DR_COMPLETE`、`HOST_188_FULLY_GREEN`、`WAZUH_REGISTRY_RECOVERED`、`RUNTIME_ACTION_AUTHORIZED`。接著 `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color` 將 `NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export` 展成三個 owner / evidence / forbidden-action checklist；Wazuh checklist 的 `CURRENT_EVIDENCE` 會保留 registry accepted、coverage scope、direct active、no transport、SSH blocked、route、transport、Dashboard API 與 index pattern 狀態，避免把 route `200` 或 transport `6` 誤報成 registry recovered。`scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color` 進一步轉成 `awoooi_post_reboot_next_gate_owner_packets_v1` JSON，固定 `dispatch_authorized=0`、`request_sent_count=0`、`owner_response_accepted_count=0`、`host_write_authorized=0`、`secret_value_collection_allowed=0`、`runtime_gate_count=0`；`scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json` 鎖定三個 P0 gate、所有 `0 / false` 邊界、禁用 secret payload / runtime action 與 no-false-green 規則。Cold-start `PASS=89 WARN=0 BLOCKED=0`；MOMO `V10.701`、latest import job `57 completed`、`DB_DAILY_FRESHNESS 1|2026-06-24`；StockPlatform `/api/v1/system/freshness` 為 `status=ok`、`latest_trading_date=2026-06-25`、blockers `[]`；backup-status 110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、`core_blockers=0`、`offsite_fresh=1`、`rclone_gdrive_fresh=1`、`last_backup_all=2026-06-26 02:31:02`。DR 仍因 `escrow_missing=5` 不可宣稱 complete。188 host hygiene 與 Wazuh manager registry 仍是 service green 之外的獨立 blocker。
+最新基準：2026-06-26 12:13 post-reboot summary / declaration guard。`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `SERVICE_GREEN=1`、`PRODUCT_DATA_GREEN=1`、`BACKUP_CORE_GREEN=1`、`DR_ESCROW_BLOCKED=1`、`ESCROW_MISSING_COUNT=5`、`HOST_188_HYGIENE_BLOCKED=0`、`HOST_188_RESULT=HOST_188_HYGIENE_GREEN.`、`WAZUH_MANAGER_REGISTRY_ACCEPTED=0`、`WAZUH_COVERAGE_SCOPE=6`、`WAZUH_DIRECT_ACTIVE=2`、`WAZUH_NO_TRANSPORT=1`、`WAZUH_SSH_BLOCKED=3`、`WAZUH_ROUTE_CODE=200`、`WAZUH_TRANSPORT_COUNT=6`、`WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning`、`WAZUH_DASHBOARD_INDEX_OK=3`、`RUNTIME_ACTION_AUTHORIZED=0`、`OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。`scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color` 會把 summary 轉成 allowed / forbidden declaration：目前允許宣稱服務、產品資料、備份核心、188 host hygiene green 與 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`；禁止宣稱 `DR_COMPLETE`、`WAZUH_REGISTRY_RECOVERED`、`RUNTIME_ACTION_AUTHORIZED`。接著 `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color` 將 `NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export` 展成 owner / evidence / forbidden-action checklist；Wazuh checklist 的 `CURRENT_EVIDENCE` 會保留 registry accepted、coverage scope、direct active、no transport、SSH blocked、route、transport、Dashboard API 與 index pattern 狀態，避免把 route `200` 或 transport `6` 誤報成 registry recovered。`scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color` 進一步轉成 `awoooi_post_reboot_next_gate_owner_packets_v1` JSON，固定 `dispatch_authorized=0`、`request_sent_count=0`、`owner_response_accepted_count=0`、`host_write_authorized=0`、`secret_value_collection_allowed=0`、`runtime_gate_count=0`；`scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json` 依 live `next_required_gates` 動態鎖定 P0 gate、所有 `0 / false` 邊界、禁用 secret payload / runtime action 與 no-false-green 規則。DR 仍因 `escrow_missing=5` 不可宣稱 complete；Wazuh manager registry 仍是 service green 之外的獨立 blocker。ACME HTTP-01 route / certbot timer hygiene 已修復，但憑證正式 renew 成功需等 snap certbot timer 或獨立 ACME window readback。

 本頁只回答四件事：

@@ -55,7 +55,7 @@ scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color
 - `PRODUCT_DATA_GREEN=1`：MOMO / StockPlatform 主要資料 freshness 可宣稱恢復。
 - `BACKUP_CORE_GREEN=1`：備份核心可宣稱恢復。
 - `DR_ESCROW_BLOCKED=1` / `ESCROW_MISSING_COUNT>0`：不可宣稱 DR complete。
- `HOST_188_HYGIENE_BLOCKED=1`：188 host hygiene 需維護窗口，不等於產品服務掛掉。
+- `HOST_188_HYGIENE_BLOCKED=0`：188 host hygiene 已收斂；若未來回到 `1`，再走 188 專用 runbook。
 - `WAZUH_MANAGER_REGISTRY_ACCEPTED=0`：不可宣稱 Wazuh 全主機納管恢復。
 - `WAZUH_ROUTE_CODE=200` / `WAZUH_TRANSPORT_COUNT>0` 只能代表 route / transport evidence；仍必須搭配 `WAZUH_COVERAGE_SCOPE`、`WAZUH_DIRECT_ACTIVE`、`WAZUH_NO_TRANSPORT`、`WAZUH_SSH_BLOCKED`、`WAZUH_DASHBOARD_API_CONNECTION` 與 `WAZUH_MANAGER_REGISTRY_ACCEPTED` 判讀。
 - `RUNTIME_ACTION_AUTHORIZED=0`：本流程沒有授權 runtime 寫操作。
@@ -73,7 +73,7 @@ scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color
 scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color --proposed DR_COMPLETE --proposed WAZUH_REGISTRY_RECOVERED
 ```

-在目前基準下，`DR_COMPLETE`、`HOST_188_FULLY_GREEN`、`WAZUH_REGISTRY_RECOVERED`、`RUNTIME_ACTION_AUTHORIZED` 都必須被拒絕；任何人、任何 AI agent 或任何文件若提出這些宣告，必須先補對應 evidence / owner acceptance。
+在目前基準下，`DR_COMPLETE`、`WAZUH_REGISTRY_RECOVERED`、`RUNTIME_ACTION_AUTHORIZED` 都必須被拒絕；任何人、任何 AI agent 或任何文件若提出這些宣告，必須先補對應 evidence / owner acceptance。`HOST_188_FULLY_GREEN` 只有在 summary 維持 `HOST_188_HYGIENE_BLOCKED=0` 且 checklist `HOST_188_HYGIENE_GREEN` 時才可宣稱。

 summary 顯示 `SERVICE_GREEN=1` 但仍有 `NEXT_REQUIRED_GATES` 時，再跑下一 Gate 派工 checklist：

@@ -81,7 +81,7 @@ summary 顯示 `SERVICE_GREEN=1` 但仍有 `NEXT_REQUIRED_GATES` 時，再跑下
 scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color
 ```

-這支腳本只把 `credential_escrow_evidence`、`host_188_hygiene_maintenance_window`、`wazuh_manager_registry_export` 轉成 owner / required evidence / forbidden action / done criteria。它不送 request、不讀 secret、不寫 marker、不 restart / reload / repair / import / delete / patch，也不授權 host / Wazuh / Nginx / K8s / DB runtime action。若它輸出 `SERVICE_GREEN=0`，先回到服務恢復，不進入 boundary dispatch。
+這支腳本只把目前 live summary 內的 `NEXT_REQUIRED_GATES` 轉成 owner / required evidence / forbidden action / done criteria。2026-06-26 12:13 起通常只剩 `credential_escrow_evidence` 與 `wazuh_manager_registry_export`；若未來 188 又紅，才會重新出現 `host_188_hygiene_maintenance_window`。它不送 request、不讀 secret、不寫 marker、不 restart / reload / repair / import / delete / patch，也不授權 host / Wazuh / Nginx / K8s / DB runtime action。若它輸出 `SERVICE_GREEN=0`，先回到服務恢復，不進入 boundary dispatch。

 若要交給 AI / 工單 / owner review 使用，產生機器可讀 owner packet：

@@ -98,7 +98,7 @@ scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color --outp
 scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json
 ```

-guard 必須輸出 `POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=3 request_sent=0 accepted=0 runtime_gate=0`。若 gate 數量、P0 gate id、`0 / false` 欄位、禁用 secret payload、188 禁用維修動作、Wazuh 禁用 active response / host write，或 no-false-green 規則任何一項漂移，視為 `BLOCKED`，不得送 owner request、不得寫 escrow marker、不得進維護窗口、不得宣稱 DR / Wazuh / 188 host hygiene 完成。
+guard 必須輸出 `POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=<live_next_gate_count> request_sent=0 accepted=0 runtime_gate=0`。目前預期 `gates=2`；若 188 hygiene 回到 blocked，才會是 `gates=3`。若 gate 數量、P0 gate id、`0 / false` 欄位、禁用 secret payload、Wazuh 禁用 active response / host write，或 no-false-green 規則任何一項漂移，視為 `BLOCKED`，不得送 owner request、不得寫 escrow marker、不得進維護窗口、不得宣稱 DR / Wazuh 完成。

 需要展開細節時，再使用 repo-side wrapper：

@@ -108,7 +108,7 @@ scripts/reboot-recovery/post-start-quick-check.sh --no-color

 此 wrapper 只做 read-only 檢查，並委派既有 cold-start / MOMO preflight / backup-status；不 restart、不 reload、不 import、不改 K8s、不讀 token 內容。wrapper 會把 warning 分成 `SERVICE`、`BOUNDARY`、`EVIDENCE` 三類，避免把 `escrow_missing>0` 誤判成服務降級。若 summary 或 wrapper 因某個 SSH 權限或路徑失敗，再依下列分段命令手動補證據。

-Public route gate 自 v1.6 起會使用 `ROUTE_RETRY_ATTEMPTS`（預設 `3`）與 `ROUTE_RETRY_DELAY_SECONDS`（預設 `2`）重試。單次 `000` / timeout 若 retry 後恢復，應列為 evidence warning 或 transient route evidence，不可直接當成網站仍壞；只有連續失敗才是 service blocker。
+Public route gate 自 v1.6 起會使用 `ROUTE_RETRY_ATTEMPTS`（預設 `3`）與 `ROUTE_RETRY_DELAY_SECONDS`（預設 `2`）重試。StockPlatform dedicated freshness gate 自 v1.13 起使用 `STOCK_FRESHNESS_RETRY_ATTEMPTS`（預設 `6`）與 `STOCK_FRESHNESS_RETRY_DELAY_SECONDS`（預設 `5`），因為它常在 110 Docker / CI rollout 後比 route health 多需要數十秒 warmup。單次 `000` / timeout / `502` 若 retry 後恢復，應列為 evidence warning 或 transient evidence，不可直接當成網站或資料仍壞；只有連續失敗才是 service blocker。

 Credential escrow gate 自 v1.6 起在 `escrow_missing>0` 時，會只讀呼叫 `/backup/scripts/mark-credential-escrow-verified.sh --status` 並列出缺項。這只是 evidence readback，不會寫 marker、不會讀密碼、不會降低 DR blocker；用途是讓 operator 立即知道缺的是哪幾個非 secret evidence marker。

--- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md
+++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md
@@ -11,23 +11,25 @@

 | Area | Status | Completion | Evidence |
 |------|--------|------------|----------|
-| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-26 07:19 read-only follow-up confirms service recovery remains green after the SOP commit reached production. Cold-start rerun returned `PASS=87 WARN=0 BLOCKED=0`, result `GREEN`; ArgoCD `awoooi-prod Synced / Healthy` at revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`; API/Web/Worker pods are Running with restart `0`; public routes return expected statuses including AWOOOI API `200`, AWOOOI Web `307`, VibeWork `200`, AwoooGo `200`, MOMO health `200`, Stock freshness `200`, Bitan `200`, Gitea `200`, Harbor `200`, Registry `/v2/` expected `401`, Sentry expected `302`, SigNoz `200`, Langfuse `200`. Backup-status 07:18 remains 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, offsite/rclone fresh, `last_backup_all=2026-06-26 02:31:02`, `escrow_missing=5`. 188 remains product-service green but host-hygiene blocked: host PostgreSQL `14/main` has `invalid primary checkpoint record` / `could not locate a valid checkpoint record`, certbot renew for `sentry.wooo.work` is rate-limited / challenge-failed, and `awoooi-startup.service` previously attempted `pg_resetwal` as root. New runbook `docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md` plus read-only script `scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh` define the maintenance-window decision tree without authorizing runtime repair. Do not declare DR complete until `escrow_missing=0`; Wazuh manager registry accepted remains `0`; 111/168 Codex workspace HEAD drift and Mac Mini low free space remain workstation blockers, not reboot service blockers. |
+| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-26 12:13 live summary confirms host / service / product data / backup / 188 host hygiene are green for the current evidence set. `post-reboot-readiness-summary.sh --no-color` returns `SERVICE_GREEN=1`, `PRODUCT_DATA_GREEN=1`, `BACKUP_CORE_GREEN=1`, `HOST_188_HYGIENE_BLOCKED=0`, `HOST_188_RESULT=HOST_188_HYGIENE_GREEN.`, `DR_ESCROW_BLOCKED=1`, `ESCROW_MISSING_COUNT=5`, `WAZUH_MANAGER_REGISTRY_ACCEPTED=0`, and `NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export`. 188 startup now accepts the real `k3s-postgres-recovery` host-network PostgreSQL runtime, fails closed instead of running `pg_resetwal`, Nginx ACME HTTP-01 routes for `sentry/gitea/langfuse/signoz` are corrected, duplicate apt certbot timer is disabled, snap certbot timer remains enabled, and failed units are cleared. Do not declare DR complete until `escrow_missing=0`; do not declare Wazuh registry recovered until manager registry accepted is `1`; certbot formal renewal success still requires snap timer / ACME window readback. |
 | P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-26 07:19 readback shows 120 and 121 reachable, K3s active, `mon` and `mon1` both `Ready control-plane`, AWOOOI API/Web replicas split across both nodes, ArgoCD `awoooi-prod Synced / Healthy` at revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`, and `km-vectorize` official 03:00 台北時間 run succeeded with `lastSuccess=2026-06-25T19:00:14Z`. |
 | P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-26 06:58 backup readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-26 02:31:02`。DR remains blocked on real non-secret credential escrow evidence IDs; do not write placeholder markers or paste secret values. |
 | P2 service / data truth | DONE | 100% | Service routes and core runtime are available, 110 current CPU pressure is attributable to active AWOOOI Web `turbo build` / Docker buildx, and previous orphan Chrome groups remain cleared. 2026-06-26 07:19 StockPlatform `/api/v1/system/freshness` returned `200`; 07:01 freshness payload was `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`; price / chips / margin / AI recommendations are all on `2026-06-25`. `ai.recommendations` row count is `2868`; `core.margin_short_daily` row count is `1976`. MOMO health `V10.699`, current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, and `MOMO_DAILY_FRESHNESS 1|2026-06-24` are green; expanded public routes are green. |
-| P3 docs / automation contracts | DONE_WITH_DECLARATION_GUARD_V172 | 100% | Workplan, SOP v1.72, post-reboot declaration guard, machine-readable post-reboot readiness summary with Wazuh registry detail fields, post-reboot next-gate dispatch checklist, owner-packet JSON generator, owner-packet contract guard, one-page post-start quick check v1.12, route retry gate, deploy warmup classification, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, 188 fail-closed startup data recovery gate, 188 host hygiene read-only checklist, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and stricter product-data / route retry gates are updated. Declaration guard now machine-checks allowed / forbidden recovery statements: service/data/backup green may be declared, while `DR_COMPLETE`、`HOST_188_FULLY_GREEN`、`WAZUH_REGISTRY_RECOVERED` and `RUNTIME_ACTION_AUTHORIZED` remain forbidden until evidence gates close. Live 110 script sync remains a separate approved live-write gate; do not claim it here. |
+| P3 docs / automation contracts | DONE_WITH_DECLARATION_GUARD_V173 | 100% | Workplan, SOP v1.73, post-reboot declaration guard, machine-readable post-reboot readiness summary with Wazuh registry detail fields, post-reboot next-gate dispatch checklist, owner-packet JSON generator, dynamic owner-packet contract guard, one-page post-start quick check v1.13, route retry gate, deploy warmup classification, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, 188 fail-closed startup data recovery gate, 188 host hygiene read-only checklist, 188 PostgreSQL runtime-ready source-of-truth, 188 ACME route/timer hygiene, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and stricter product-data / route retry gates are updated. Declaration guard now machine-checks allowed / forbidden recovery statements: service/data/backup/188 host hygiene green may be declared when live summary says so, while `DR_COMPLETE`、`WAZUH_REGISTRY_RECOVERED` and `RUNTIME_ACTION_AUTHORIZED` remain forbidden until evidence gates close. Live 110 script sync remains a separate approved live-write gate; do not claim it here. |

-2026-06-26 07:47 machine-readable summary baseline: `scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` stores delegated logs under `/tmp/awoooi-post-reboot-readiness-20260626-074702` and returns `SERVICE_GREEN=1`, `PRODUCT_DATA_GREEN=1`, `BACKUP_CORE_GREEN=1`, `DR_ESCROW_BLOCKED=1`, `ESCROW_MISSING_COUNT=5`, `HOST_188_SERVICE_GREEN=1`, `HOST_188_HYGIENE_BLOCKED=1`, `WAZUH_ROUTE_CODE=200`, `WAZUH_TRANSPORT_COUNT=6`, `WAZUH_MANAGER_REGISTRY_ACCEPTED=0`, `WAZUH_RUNTIME_GATE=0`, `RUNTIME_ACTION_AUTHORIZED=0`, `OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, and `NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export`. This is now the preferred first operator/AI-agent entrypoint after reboot because it separates service health from DR, host hygiene, and security registry evidence.
+2026-06-26 12:13 machine-readable summary baseline supersedes the 07:47 / 08:59 gate set: `scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` stores delegated logs under `/tmp/awoooi-post-reboot-readiness-20260626-121303` and returns `SERVICE_GREEN=1`, `PRODUCT_DATA_GREEN=1`, `BACKUP_CORE_GREEN=1`, `DR_ESCROW_BLOCKED=1`, `ESCROW_MISSING_COUNT=5`, `HOST_188_SERVICE_GREEN=1`, `HOST_188_HYGIENE_BLOCKED=0`, `HOST_188_CHECK_RC=0`, `HOST_188_RESULT=HOST_188_HYGIENE_GREEN.`, `WAZUH_ROUTE_CODE=200`, `WAZUH_TRANSPORT_COUNT=6`, `WAZUH_COVERAGE_SCOPE=6`, `WAZUH_DIRECT_ACTIVE=2`, `WAZUH_NO_TRANSPORT=1`, `WAZUH_SSH_BLOCKED=3`, `WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning`, `WAZUH_DASHBOARD_INDEX_OK=3`, `WAZUH_MANAGER_REGISTRY_ACCEPTED=0`, `WAZUH_RUNTIME_GATE=0`, `RUNTIME_ACTION_AUTHORIZED=0`, `OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, and `NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export`. This is now the preferred first operator/AI-agent entrypoint after reboot because it separates service health from DR and security registry evidence; 188 host hygiene is no longer a next gate unless the live checklist regresses.

-2026-06-26 08:12 next-gate dispatch baseline: `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color` reads the same summary path and emits three explicit P0 dispatch checklists without sending requests or changing runtime. `credential_escrow_evidence` requires non-secret evidence id / owner / reviewer for five escrow items and rejects password / token / secret value / hash / prefix / suffix / raw credential payloads. `host_188_hygiene_maintenance_window` requires PostgreSQL `14/main` decision, DNS / TLS / certbot path, startup unit source-of-truth, rollback owner, postcheck owner, and blocks unapproved `pg_resetwal` / certbot renew / Nginx reload / DB restore / Docker restart / host file writes. `wazuh_manager_registry_export` requires redacted registry counts, per-host alias status, dashboard API / version status, time window, and reviewer while blocking raw agent names, internal IPs, client keys, Wazuh payloads, active response, re-enroll, restart, secret patch, host write, and Kali active scan. Output fixed `NEXT_GATE_COUNT=3`, `REQUEST_SENT_COUNT=0`, `DISPATCH_AUTHORIZED=0`, `HOST_WRITE_AUTHORIZED=0`, `SECRET_VALUE_COLLECTION_ALLOWED=0`, `RUNTIME_ACTION_AUTHORIZED=0`.
+2026-06-26 07:47 machine-readable summary baseline is retained as historical evidence only. It showed `HOST_188_HYGIENE_BLOCKED=1` and three next gates before the 188 startup / ACME / certbot hygiene repair. Do not use the 07:47 gate set as the current status.

-2026-06-26 08:29 owner-packet JSON baseline: `scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color` emits `schema_version=awoooi_post_reboot_next_gate_owner_packets_v1` with `next_gate_count=3`, `p0_gate_count=3`, `request_sent_count=0`, `owner_response_received_count=0`, `owner_response_accepted_count=0`, `runtime_action_authorized_count=0`. This packet is for AI / operator / owner review intake only; it does not send request, write credential marker, read secret, or authorize runtime action.
+2026-06-26 12:13 next-gate dispatch baseline: `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color` now emits only the gates present in the current summary. Current expected gates are `credential_escrow_evidence` and `wazuh_manager_registry_export`, with `NEXT_GATE_COUNT=2`, `REQUEST_SENT_COUNT=0`, `DISPATCH_AUTHORIZED=0`, `HOST_WRITE_AUTHORIZED=0`, `SECRET_VALUE_COLLECTION_ALLOWED=0`, `RUNTIME_ACTION_AUTHORIZED=0`. If 188 hygiene regresses, `host_188_hygiene_maintenance_window` will reappear automatically.

-2026-06-26 08:40 owner-packet contract guard baseline: `scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json` validates the generated JSON before any owner review intake. It requires exactly three P0 gates, preserves `request_sent=0`、`owner_response_received=0`、`owner_response_accepted=0`、`runtime_action_authorized=0`、`host_write_authorized=0`、`secret_value_collection_allowed=0`、`runtime_gate=0`, and rejects missing forbidden payload/action controls for credential escrow, 188 host hygiene, and Wazuh registry export. Expected success line: `POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=3 request_sent=0 accepted=0 runtime_gate=0`.
+2026-06-26 12:13 owner-packet JSON baseline: `scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color` emits `schema_version=awoooi_post_reboot_next_gate_owner_packets_v1` with dynamic `next_gate_count=2`, `p0_gate_count=2`, `request_sent_count=0`, `owner_response_received_count=0`, `owner_response_accepted_count=0`, `runtime_action_authorized_count=0`. This packet is for AI / operator / owner review intake only; it does not send request, write credential marker, read secret, or authorize runtime action.
+
+2026-06-26 12:13 owner-packet contract guard baseline: `scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json` validates the generated JSON before any owner review intake. It requires the packet gates to equal the live `source.next_required_gates`, preserves `request_sent=0`、`owner_response_received=0`、`owner_response_accepted=0`、`runtime_action_authorized=0`、`host_write_authorized=0`、`secret_value_collection_allowed=0`、`runtime_gate=0`, and rejects missing forbidden payload/action controls for active gates. Current expected success line: `POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=2 request_sent=0 accepted=0 runtime_gate=0`.

 2026-06-26 08:47 Wazuh registry detail summary baseline: post-reboot readiness summary now emits `WAZUH_COVERAGE_SCOPE`, `WAZUH_DIRECT_ACTIVE`, `WAZUH_NO_TRANSPORT`, `WAZUH_SSH_BLOCKED`, `WAZUH_DASHBOARD_API_CONNECTION`, and `WAZUH_DASHBOARD_INDEX_OK` alongside existing route / transport / registry fields. Current read-only truth is coverage scope `6`, direct active `2`, no transport `1`, SSH blocked `3`, route `200`, transport `6`, Dashboard API `pending_or_spinning`, index OK `3`, manager registry accepted `0`, runtime gate `0`. This is a security evidence blocker, not a reboot service blocker.

-2026-06-26 08:59 declaration guard baseline: `scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color` emits `schema_version=awoooi_post_reboot_declaration_guard_v1`, status `allowed_with_boundary_blockers`, allowed declarations `BACKUP_CORE_GREEN`、`FULL_STACK_GREEN_DR_ESCROW_BLOCKED`、`PRODUCT_DATA_GREEN`、`SERVICE_RECOVERY_GREEN`, and forbidden declarations `DR_COMPLETE`、`HOST_188_FULLY_GREEN`、`WAZUH_REGISTRY_RECOVERED`、`RUNTIME_ACTION_AUTHORIZED`. Proposed false-green declarations are rejected before they can enter LOGBOOK / owner packets / external status updates.
+2026-06-26 12:13 declaration guard baseline: `scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color` emits `schema_version=awoooi_post_reboot_declaration_guard_v1`, status `allowed_with_boundary_blockers`, allowed declarations including service / product data / backup / 188 host hygiene green for this evidence set, and forbidden declarations `DR_COMPLETE`、`WAZUH_REGISTRY_RECOVERED`、`RUNTIME_ACTION_AUTHORIZED`. Proposed false-green declarations are rejected before they can enter LOGBOOK / owner packets / external status updates.

 2026-06-26 07:39 live quick-check refresh supersedes the 07:19 row for current operator status. `scripts/reboot-recovery/post-start-quick-check.sh --no-color` returned `POST_START_QUICK_CHECK PASS=38 WARN=3 BLOCKED=0`, warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=2`, result `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`. Delegated cold-start returned `PASS=89 WARN=0 BLOCKED=0`; four reboot-scope hosts ping/SSH were OK; AWOOOI / VibeWork / AwoooGo / 2026FIFA / Agent Bounty / MOMO / Stock / Bitan / TsenYang / VTuber / Gitea / Harbor / Registry / Sentry / SigNoz / Langfuse / AIOps routes returned expected 2xx/3xx. MOMO `V10.701` has job `57 completed`, daily freshness `1|2026-06-24`, and current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`. StockPlatform freshness is `ok` through `2026-06-25` with price / chips / margin / AI recommendations current. Backup core remains green: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, offsite/rclone fresh, `last_backup_all=2026-06-26 02:31:02`; DR still has `escrow_missing=5`. 110 load around `5.19 / 4.66 / 4.91` is attributable to normal platform processes, not orphan Chrome. 188 host hygiene remains blocked by failed host PostgreSQL / certbot / startup units and must use the dedicated maintenance runbook and read-only checklist.

--- a/infra/ansible/roles/nginx/templates/188-all-sites.conf.j2
+++ b/infra/ansible/roles/nginx/templates/188-all-sites.conf.j2
@@ -86,6 +86,10 @@ server {
    listen 80;
    server_name signoz.wooo.work;

+    location /.well-known/acme-challenge/ {
+        root /var/www/html;
+    }
+
    location / {
        proxy_pass http://127.0.0.1:3301;
        proxy_http_version 1.1;
--- a/infra/ansible/roles/nginx/templates/188-internal-tools-https.conf.j2
+++ b/infra/ansible/roles/nginx/templates/188-internal-tools-https.conf.j2
@@ -9,10 +9,22 @@ server {
    server_name
        gitea.wooo.work
        sentry.wooo.work
-        langfuse.wooo.work
+        langfuse.wooo.work;
+
+    location /.well-known/acme-challenge/ {
+        root /var/www/html;
+    }
+
+    location / {
+        return 301 https://$host$request_uri;
+    }
+}
+
+server {
+    listen 80;
+    server_name
        harbor.wooo.work
-        registry.wooo.work
-        stock.wooo.work;
+        registry.wooo.work;

    location /.well-known/acme-challenge/ {
        root /var/www/certbot;
--- a/scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh
+++ b/scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh
@@ -149,17 +149,35 @@ if out=$(ssh_cmd "$REMOTE_188" '
 pg_lsclusters 2>/dev/null || true
 systemctl status postgresql@14-main.service --no-pager || true
 echo "PG_ISREADY_LOCAL $(pg_isready -h localhost -p 5432 2>/dev/null || true)"
+echo "RECOVERY_CONTAINER $(docker inspect -f "{{.State.Running}} {{.HostConfig.NetworkMode}} {{.HostConfig.RestartPolicy.Name}}" k3s-postgres-recovery 2>/dev/null || echo missing)"
 ' 2>&1); then
  echo "$out"
+  recovery_container_ready=0
+  if grep -q '^RECOVERY_CONTAINER true host ' <<<"$out" && grep -q 'PG_ISREADY_LOCAL .*accepting connections' <<<"$out"; then
+    recovery_container_ready=1
+  fi
+
  if grep -Eq '^14[[:space:]]+main[[:space:]]+5432[[:space:]]+down' <<<"$out"; then
-    blocked "host PostgreSQL cluster 14/main is down"
+    if [[ "$recovery_container_ready" -eq 1 ]]; then
+      warn "host PostgreSQL cluster 14/main is down, but controlled k3s-postgres-recovery runtime is accepting connections"
+    else
+      blocked "host PostgreSQL cluster 14/main is down and no controlled recovery runtime was accepted"
+    fi
  else
    ok "host PostgreSQL cluster 14/main not reported down"
  fi
+
  if grep -Eiq 'invalid primary checkpoint record|could not locate a valid checkpoint record|PANIC:' <<<"$out"; then
-    blocked "PostgreSQL checkpoint/WAL error detected; pg_resetwal is break-glass only"
+    if [[ "$recovery_container_ready" -eq 1 ]]; then
+      warn "PostgreSQL checkpoint/WAL error remains historical host-cluster evidence; pg_resetwal is still break-glass only"
+    else
+      blocked "PostgreSQL checkpoint/WAL error detected; pg_resetwal is break-glass only"
+    fi
  fi
-  if grep -q 'PG_ISREADY_LOCAL .*accepting connections' <<<"$out"; then
+
+  if [[ "$recovery_container_ready" -eq 1 ]]; then
+    ok "PostgreSQL runtime is provided by k3s-postgres-recovery on host network"
+  elif grep -q 'PG_ISREADY_LOCAL .*accepting connections' <<<"$out"; then
    warn "pg_isready accepts on localhost; do not use this alone as host 14/main health"
  fi
 else
@@ -169,12 +187,30 @@ fi

 section "188 certbot / ACME"
 if out=$(ssh_cmd "$REMOTE_188" '
-systemctl status certbot.service --no-pager || true
-systemctl status snap.certbot.renew.service --no-pager || true
+systemctl show certbot.service snap.certbot.renew.service certbot.timer snap.certbot.renew.timer -p Id -p ActiveState -p SubState -p Result -p UnitFileState --no-pager || true
+systemctl list-timers --all --no-pager | grep -i certbot || true
 ' 2>&1); then
  echo "$out"
-  grep -Eiq 'rateLimited|Service busy' <<<"$out" && blocked "certbot renewal is rate-limited; do not retry blindly"
-  grep -Eiq 'Some challenges have failed|challenge' <<<"$out" && blocked "certbot challenge failure requires DNS / ACME route owner evidence"
+  if grep -q 'Id=certbot.service' <<<"$out" && grep -A3 'Id=certbot.service' <<<"$out" | grep -q 'Result=failed'; then
+    blocked "apt certbot service currently failed"
+  else
+    ok "apt certbot service is not currently failed"
+  fi
+  if grep -q 'Id=snap.certbot.renew.service' <<<"$out" && grep -A3 'Id=snap.certbot.renew.service' <<<"$out" | grep -q 'Result=failed'; then
+    blocked "snap certbot renew service currently failed"
+  else
+    ok "snap certbot renew service is not currently failed"
+  fi
+  if grep -A4 'Id=certbot.timer' <<<"$out" | grep -q 'UnitFileState=disabled'; then
+    ok "legacy apt certbot timer disabled to avoid duplicate renewals"
+  else
+    warn "legacy apt certbot timer is not disabled"
+  fi
+  if grep -A4 'Id=snap.certbot.renew.timer' <<<"$out" | grep -q 'ActiveState=active' && grep -A4 'Id=snap.certbot.renew.timer' <<<"$out" | grep -q 'UnitFileState=enabled'; then
+    ok "snap certbot renew timer enabled"
+  else
+    blocked "snap certbot renew timer is not enabled and active"
+  fi
 else
  blocked "certbot status unavailable"
  echo "$out"
@@ -223,7 +259,27 @@ else
 fi

 section "Maintenance decision tree"
-cat <<'STEPS'
+if [ "$SERVICE_GREEN" -eq 1 ] && [ "$HOST_HYGIENE_BLOCKED" -eq 0 ]; then
+  cat <<'STEPS'
+Current expected outcome:
+  SERVICE_GREEN=1
+  HOST_HYGIENE_BLOCKED=0
+  RESULT=HOST_188_HYGIENE_GREEN
+
+Allowed next step:
+  1. Keep this host in the normal post-reboot summary.
+  2. Wait for snap certbot timer / ACME-window readback before declaring formal certificate renewal success.
+  3. Keep DR credential escrow and Wazuh registry evidence as separate blockers.
+
+Forbidden without separate approval:
+  - pg_resetwal
+  - DB restore
+  - Docker/systemd restart
+  - firewall change
+  - Wazuh active response or agent re-enroll
+STEPS
+else
+  cat <<'STEPS'
 Current expected outcome when 188 service is green but host hygiene is not:
  SERVICE_GREEN=1
  HOST_HYGIENE_BLOCKED=1
@@ -244,6 +300,7 @@ Forbidden without maintenance approval:
  - Docker/systemd restart
  - host file write
 STEPS
+fi

 echo
 echo "SERVICE_GREEN=$SERVICE_GREEN"
--- a/scripts/reboot-recovery/awoooi-startup.service
+++ b/scripts/reboot-recovery/awoooi-startup.service
@@ -14,8 +14,9 @@ Description=AWOOOI Auto-Startup Recovery Sequence
 After=network-online.target containerd.service docker.service
 Wants=network-online.target

-# 確保 PostgreSQL 盡早嘗試啟動
-Wants=postgresql@14-main.service redis-server.service ollama.service nginx.service
+# PostgreSQL 可由受控 recovery container 提供；不得在 startup 階段硬拉
+# postgresql@14-main.service，避免與 recovery runtime 競爭或觸發假綠修復。
+Wants=redis-server.service ollama.service nginx.service

 [Service]
 Type=oneshot
--- a/scripts/reboot-recovery/awoooi-startup.sh
+++ b/scripts/reboot-recovery/awoooi-startup.sh
@@ -3,6 +3,8 @@
 # 2026-04-04 ogt: 根據實際事故建立，處理 container / Docker 啟動順序與 K3s Kine 維護。
 # 2026-06-26 Codex: PostgreSQL checkpoint/WAL 錯誤改為 fail-closed；
 # 不在自動啟動腳本內執行 pg_resetwal，避免資料破壞被誤判成恢復。
+# 2026-06-26 Codex: 允許受控 recovery container 提供 14/main runtime；
+# 不再因 systemd postgresql@14-main failed 而誤判活 DB 為不可用。
 # 部署位置: /usr/local/bin/awoooi-startup.sh (on 192.168.0.188)
 # systemd unit: /etc/systemd/system/awoooi-startup.service

@@ -12,6 +14,22 @@ exec > >(tee -a "$LOG") 2>&1

 log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; }

+postgres_runtime_ready() {
+    if systemctl is-active postgresql@14-main >/dev/null 2>&1; then
+        log "✅ PostgreSQL systemd unit active"
+        return 0
+    fi
+
+    if docker inspect -f '{{.State.Running}} {{.HostConfig.NetworkMode}}' k3s-postgres-recovery 2>/dev/null | grep -q '^true host$'; then
+        if pg_isready -h localhost -p 5432 >/dev/null 2>&1; then
+            log "✅ PostgreSQL recovery container active on host network"
+            return 0
+        fi
+    fi
+
+    return 1
+}
+
 log "=== AWOOOI 啟動序列開始 ==="

 # ──────────────────────────────────────────────
@@ -73,20 +91,20 @@ fi
 # ──────────────────────────────────────────────
 log "[3/7] 檢查 PostgreSQL..."

-if ! systemctl is-active postgresql@14-main >/dev/null 2>&1; then
+if ! postgres_runtime_ready; then
    log "PostgreSQL 未啟動，嘗試啟動..."
    systemctl start postgresql@14-main || true
    sleep 8
 fi

-if ! systemctl is-active postgresql@14-main >/dev/null 2>&1; then
+if ! postgres_runtime_ready; then
    log "PostgreSQL 啟動失敗，檢查是否屬於 checkpoint/WAL 類資料層錯誤..."
    if journalctl -u postgresql@14-main -n 20 | grep -q "could not locate a valid checkpoint"; then
        log "❌ 偵測到 PostgreSQL checkpoint/WAL 錯誤；禁止自動 pg_resetwal。"
        log "需要 DB owner、備份/restore evidence、maintenance window 與 post-check 後才能人工處理。"
        exit 1
    fi
-    systemctl is-active postgresql@14-main && log "✅ PostgreSQL 修復成功" || { log "❌ PostgreSQL 修復失敗"; exit 1; }
+    postgres_runtime_ready && log "✅ PostgreSQL runtime 可用" || { log "❌ PostgreSQL 修復失敗"; exit 1; }
 fi

 # 等待 PG 接受連線
--- a/scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py
+++ b/scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py
@@ -23,7 +23,7 @@ OWNER_PACKET_GENERATOR = (
 )

 EXPECTED_SCHEMA = "awoooi_post_reboot_next_gate_owner_packets_v1"
-EXPECTED_GATES = {
+KNOWN_GATES = {
    "credential_escrow_evidence",
    "host_188_hygiene_maintenance_window",
    "wazuh_manager_registry_export",
@@ -187,12 +187,21 @@ def validate_packet(packet: dict[str, Any]) -> list[str]:
        counts = {}

    gate_ids = {str(item.get("packet_id", "")) for item in owner_packets if isinstance(item, dict)}
-    if gate_ids != EXPECTED_GATES:
+    unknown_gates = sorted(gate_ids - KNOWN_GATES)
+    if unknown_gates:
+        failures.append(f"unknown_gate_ids={unknown_gates}")
+
+    source = packet.get("source", {})
+    if not isinstance(source, dict):
+        failures.append("source_not_object")
+        source = {}
+    expected_gates = set(str(item) for item in as_list(source.get("next_required_gates")))
+    if expected_gates != gate_ids:
        failures.append(f"gate_ids={sorted(gate_ids)}")

    expected_counts = {
-        "next_gate_count": 3,
-        "p0_gate_count": 3,
+        "next_gate_count": len(gate_ids),
+        "p0_gate_count": len(gate_ids),
    }
    for key, expected in expected_counts.items():
        if counts.get(key) != expected:
--- a/scripts/reboot-recovery/post-start-quick-check.sh
+++ b/scripts/reboot-recovery/post-start-quick-check.sh
@@ -9,6 +9,8 @@ ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
 SSH_CONNECT_TIMEOUT="${SSH_CONNECT_TIMEOUT:-6}"
 ROUTE_RETRY_ATTEMPTS="${ROUTE_RETRY_ATTEMPTS:-3}"
 ROUTE_RETRY_DELAY_SECONDS="${ROUTE_RETRY_DELAY_SECONDS:-2}"
+STOCK_FRESHNESS_RETRY_ATTEMPTS="${STOCK_FRESHNESS_RETRY_ATTEMPTS:-6}"
+STOCK_FRESHNESS_RETRY_DELAY_SECONDS="${STOCK_FRESHNESS_RETRY_DELAY_SECONDS:-5}"
 RUN_COLD_START=1
 RUN_MOMO=1
 RUN_STOCK=1
@@ -76,6 +78,8 @@ Options:
 Environment:
  ROUTE_RETRY_ATTEMPTS        Public route attempts before blocking. Default: 3.
  ROUTE_RETRY_DELAY_SECONDS  Delay between failed public route attempts. Default: 2.
+  STOCK_FRESHNESS_RETRY_ATTEMPTS       Stock freshness attempts before blocking. Default: 6.
+  STOCK_FRESHNESS_RETRY_DELAY_SECONDS  Delay between failed Stock freshness attempts. Default: 5.

 Exit codes:
  0 = no service blockers. Boundary / evidence warnings may still be present.
@@ -277,9 +281,23 @@ fi
 if [[ "$RUN_STOCK" -eq 1 ]]; then
  section "StockPlatform freshness"
  stock_tmp="$(mktemp -t post-start-stock.XXXXXX)"
-  stock_code="$(curl -k -sS -o "$stock_tmp" -w '%{http_code}' --max-time 12 "https://stock.wooo.work/api/v1/system/freshness" 2>/dev/null || true)"
+  stock_code=""
+  stock_attempt=1
+  while [[ "$stock_attempt" -le "$STOCK_FRESHNESS_RETRY_ATTEMPTS" ]]; do
+    stock_code="$(curl -k -sS -o "$stock_tmp" -w '%{http_code}' --max-time 12 "https://stock.wooo.work/api/v1/system/freshness" 2>/dev/null || true)"
+    if [[ "$stock_code" == 2* ]]; then
+      if [[ "$stock_attempt" -gt 1 ]]; then
+        evidence_warn "StockPlatform freshness recovered after attempt=$stock_attempt"
+      fi
+      break
+    fi
+    if [[ "$stock_attempt" -lt "$STOCK_FRESHNESS_RETRY_ATTEMPTS" ]]; then
+      sleep "$STOCK_FRESHNESS_RETRY_DELAY_SECONDS"
+    fi
+    stock_attempt=$((stock_attempt + 1))
+  done
  if [[ "$stock_code" != 2* ]]; then
-    blocked "StockPlatform freshness endpoint returned ${stock_code:-curl_failed}"
+    blocked "StockPlatform freshness endpoint returned ${stock_code:-curl_failed} attempts=$STOCK_FRESHNESS_RETRY_ATTEMPTS"
    cat "$stock_tmp" || true
  else
    python3 - "$stock_tmp" <<'PY'