diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index af818481..77416464 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -45489,3 +45489,40 @@ production browser smoke: - 188 host hygiene 維護窗口仍未執行。 - Wazuh manager registry accepted remains `0`。 - 不得宣稱 owner request 已送出、owner response 已收到 / 接受、runtime 寫入已批准、`DR_COMPLETE`、188 host fully green、或 Wazuh registry recovered。 + +## 2026-06-26 — 12:13 188 host hygiene live repair closeout / SOP v1.73 + +**時間與來源**: +- 2026-06-26 11:40-12:13 Asia/Taipei。 +- 來源:188 live SSH maintenance session、`scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh --no-color`、`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color`、public ACME HTTP-01 self-test、systemd / Docker / PostgreSQL runtime readback。 + +**完成內容**: +- 188 startup source-of-truth 已修正:`/usr/local/bin/awoooi-startup.sh` 改為 `postgres_runtime_ready()`,接受 active `postgresql@14-main` 或 `k3s-postgres-recovery` host-network PostgreSQL runtime + `pg_isready`;不再自動執行 `pg_resetwal`。 +- 188 `awoooi-startup.service` 移除 hard `postgresql@14-main.service` dependency,避免把已由 recovery container 提供的 production DB runtime 誤判成 host cluster failed。 +- 188 Nginx ACME HTTP-01 route 已修:`sentry.wooo.work`、`gitea.wooo.work`、`langfuse.wooo.work`、`signoz.wooo.work` 均可從 public HTTP 讀回同一個 non-secret self-test token。 +- 188 duplicate certbot runner 已收斂:apt `certbot.timer` disabled / inactive,snap `snap.certbot.renew.timer` enabled / active / waiting;`certbot.service` 與 `snap.certbot.renew.service` 均 `Result=success / inactive`。 +- 188 failed units 已清零,`systemctl is-system-running=running`。 +- Repo baseline 同步更新:`scripts/reboot-recovery/awoooi-startup.sh`、`scripts/reboot-recovery/awoooi-startup.service`、`scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh`、188 Nginx Ansible templates、SOP / quick-check / workplan / runbook。 +- `post-reboot-owner-packet-contract-guard.py` 改為依 live `source.next_required_gates` 動態驗收,不再把已修復的 188 gate 永久鎖成必備三 gate。 +- `post-start-quick-check.sh` 的 StockPlatform dedicated freshness gate 改用 `STOCK_FRESHNESS_RETRY_ATTEMPTS=6` / `STOCK_FRESHNESS_RETRY_DELAY_SECONDS=5` 重試,避免 110 Docker / CI rollout 後的短暫 upstream `502` 把已恢復的服務 / 資料 freshness 誤判成 blocked;連續失敗仍維持 hard blocker。 + +**live 驗證結果**: +- `scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh --no-color`:`PASS=16 WARN=3 BLOCKED=0`、`SERVICE_GREEN=1`、`HOST_HYGIENE_BLOCKED=0`、`Result: HOST_188_HYGIENE_GREEN.` +- `scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color`:`POST_START_RC=0`、`POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`、`POST_START_PASS=38`、`POST_START_WARN=4`、`POST_START_BLOCKED=0`、`SERVICE_GREEN=1`、`PRODUCT_DATA_GREEN=1`、`BACKUP_CORE_GREEN=1`、`DR_ESCROW_BLOCKED=1`、`ESCROW_MISSING_COUNT=5`、`HOST_188_HYGIENE_BLOCKED=0`、`WAZUH_MANAGER_REGISTRY_ACCEPTED=0`、`RUNTIME_ACTION_AUTHORIZED=0`。 +- `NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export`;188 host hygiene 已從 next gates 移除。 + +**做過的命令類型**: +- 寫入:188 startup script/service 安裝、188 Nginx ACME route config 安裝、`systemctl daemon-reload`、`systemctl reset-failed`、`nginx -t`、`systemctl reload nginx`、停用 duplicate apt `certbot.timer`、建立 / 刪除 non-secret ACME self-test token file。 +- 只讀:systemd / Docker / PostgreSQL runtime / certbot timer / public route / backup / post-reboot summary / Wazuh repo gates。 +- 未做:沒有 `pg_resetwal`、沒有 DB restore、沒有 Docker restart、沒有 K3s / ArgoCD / firewall / Wazuh runtime action、沒有 secret value / token / password 讀取或保存、沒有強制 certbot renew。 + +**目前判定**: +- 188 host hygiene repair:`0% -> 100%`。 +- Reboot service / product data / backup / 188 host hygiene:`GREEN`。 +- Overall recovery declaration:`FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。 +- SOP / quick-check / owner packet guard:v1.73 動態 gate baseline。 + +**仍 blocked / 不得宣稱**: +- DR credential escrow evidence 仍缺 `5`:不得宣稱 `DR_COMPLETE`。 +- Wazuh manager registry accepted 仍為 `0`:不得宣稱 Wazuh 全主機納管恢復。 +- certbot formal renewal 尚未完成 readback;本輪完成的是 HTTP-01 route / timer hygiene / failed-unit 清除,正式 renew 成功需等 snap certbot timer 或獨立 ACME window。 diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index 14e52a65..bc75f398 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -1,6 +1,6 @@ # AWOOOI 全棧冷啟動與主機重啟 SOP -> Version: v1.72 +> Version: v1.73 > Last updated: 2026-06-26 Asia/Taipei > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path. @@ -10,19 +10,21 @@ 本節是每次接手、開機、關機、重啟後的第一個判定錨點。若日期不是今天,必須先重跑 live check,再更新本節與 `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md`。 -若只是重啟後要快速判斷能不能宣稱恢復,先跑機器可讀摘要:`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color`。此腳本會呼叫一頁式總檢查、188 host hygiene checklist 與 Wazuh no-false-green repo gates,並把 delegated logs 留在 `/tmp/awoooi-post-reboot-readiness-*`。接著跑 `scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color`,把 summary 轉成 allowed / forbidden declaration,避免把服務綠誤報成 DR complete、188 host fully green、Wazuh registry recovered 或 runtime authorized。若 summary 顯示 `SERVICE_GREEN=1` 但 `NEXT_REQUIRED_GATES` 仍非空,再跑 `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color`,把 DR escrow、188 hygiene、Wazuh registry 三條 blocker 轉成 owner / evidence / forbidden-action dispatch checklist;需要機器可讀 intake 時,再跑 `scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color --output /tmp/awoooi-post-reboot-owner-packets.json` 產生 `awoooi_post_reboot_next_gate_owner_packets_v1` JSON,並立刻跑 `scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json`。dispatch / packet / guard 均固定 `DISPATCH_AUTHORIZED=0`、`REQUEST_SENT_COUNT=0`、`OWNER_RESPONSE_ACCEPTED=0`、`HOST_WRITE_AUTHORIZED=0`、`SECRET_VALUE_COLLECTION_ALLOWED=0`、`RUNTIME_GATE=0`;guard 未通過時不得送 owner request、不得寫 escrow marker、不得進維護窗口、不得宣稱 DR / 188 host hygiene / Wazuh registry complete。需要人工展開時,再跑 `scripts/reboot-recovery/post-start-quick-check.sh --no-color` 並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為 fallback。長 SOP 保留完整背景、例外處理與 Plan B;短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。 +若只是重啟後要快速判斷能不能宣稱恢復,先跑機器可讀摘要:`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color`。此腳本會呼叫一頁式總檢查、188 host hygiene checklist 與 Wazuh no-false-green repo gates,並把 delegated logs 留在 `/tmp/awoooi-post-reboot-readiness-*`。接著跑 `scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color`,把 summary 轉成 allowed / forbidden declaration,避免把服務綠誤報成 DR complete、188 host hygiene、Wazuh registry recovered 或 runtime authorized。若 summary 顯示 `SERVICE_GREEN=1` 但 `NEXT_REQUIRED_GATES` 仍非空,再跑 `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color`,把 live summary 內尚未完成的 blocker 轉成 owner / evidence / forbidden-action dispatch checklist;需要機器可讀 intake 時,再跑 `scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color --output /tmp/awoooi-post-reboot-owner-packets.json` 產生 `awoooi_post_reboot_next_gate_owner_packets_v1` JSON,並立刻跑 `scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json`。dispatch / packet / guard 均固定 `DISPATCH_AUTHORIZED=0`、`REQUEST_SENT_COUNT=0`、`OWNER_RESPONSE_ACCEPTED=0`、`HOST_WRITE_AUTHORIZED=0`、`SECRET_VALUE_COLLECTION_ALLOWED=0`、`RUNTIME_GATE=0`;guard 未通過時不得送 owner request、不得寫 escrow marker、不得進維護窗口、不得宣稱 DR / Wazuh registry complete。需要人工展開時,再跑 `scripts/reboot-recovery/post-start-quick-check.sh --no-color` 並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為 fallback。長 SOP 保留完整背景、例外處理與 Plan B;短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。 -2026-06-26 07:47 machine-readable readiness summary:`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 已驗證可用,artifact dir `/tmp/awoooi-post-reboot-readiness-20260626-074702`。摘要輸出 `POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`、`POST_START_PASS=38`、`POST_START_WARN=3`、`POST_START_BLOCKED=0`、`SERVICE_GREEN=1`、`PRODUCT_DATA_GREEN=1`、`BACKUP_CORE_GREEN=1`、`DR_ESCROW_BLOCKED=1`、`ESCROW_MISSING_COUNT=5`、`HOST_188_SERVICE_GREEN=1`、`HOST_188_HYGIENE_BLOCKED=1`、`WAZUH_ROUTE_CODE=200`、`WAZUH_TRANSPORT_COUNT=6`、`WAZUH_MANAGER_REGISTRY_ACCEPTED=0`、`WAZUH_RUNTIME_GATE=0`、`RUNTIME_ACTION_AUTHORIZED=0`。目前 `OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`,`NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export`。這是每次重啟後的第一層 operator / AI agent 判定格式。 +2026-06-26 12:13 latest live summary supersedes the 08:59 gate set:`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`、`POST_START_PASS=38`、`POST_START_WARN=4`、`POST_START_BLOCKED=0`、`SERVICE_GREEN=1`、`PRODUCT_DATA_GREEN=1`、`BACKUP_CORE_GREEN=1`、`DR_ESCROW_BLOCKED=1`、`ESCROW_MISSING_COUNT=5`、`HOST_188_SERVICE_GREEN=1`、`HOST_188_HYGIENE_BLOCKED=0`、`HOST_188_RESULT=HOST_188_HYGIENE_GREEN.`、`WAZUH_ROUTE_CODE=200`、`WAZUH_TRANSPORT_COUNT=6`、`WAZUH_MANAGER_REGISTRY_ACCEPTED=0`、`WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning`、`WAZUH_DASHBOARD_INDEX_OK=3`、`RUNTIME_ACTION_AUTHORIZED=0`、`OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`、`NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export`。188 host hygiene 已從 blocker 移除;目前不可宣稱完成的只剩 DR credential escrow 與 Wazuh manager registry。ACME HTTP-01 route 與 certbot timer hygiene 已修復,但不得宣稱憑證已正式 renew,需等 snap certbot timer / ACME window readback。 -2026-06-26 08:12 next-gate dispatch baseline:`scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color` 已以最新 summary live output 驗證。腳本讀回 `SERVICE_GREEN=1`、`OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`、`NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export`,並輸出三個 P0 checklist:一是 credential escrow non-secret evidence,要求五個 escrow item 的 evidence id / owner / reviewer 且禁止密碼、token、hash、prefix/suffix;二是 188 host PostgreSQL / certbot hygiene maintenance window,要求 DB / DNS-TLS / rollback / postcheck owner 決策且禁止 `pg_resetwal`、certbot renew、Nginx reload、DB restore、Docker restart 等未批准動作;三是 Wazuh manager registry redacted export,要求脫敏 registry count、host alias status、dashboard API/version status、time window 與 reviewer,且禁止 agent real name、internal IP、client.keys、raw payload、active response、agent re-enroll、Wazuh restart、secret patch、host write、Kali active scan。輸出固定 `NEXT_GATE_COUNT=3`、`NEXT_STEP=dispatch_owner_packets_manually_after_review`、`RUNTIME_ACTION_AUTHORIZED=0`,這是 dispatch checklist,不是 request sent 或 runtime approval。 +2026-06-26 07:47 machine-readable readiness summary retained as historical pre-repair evidence:當時 `HOST_188_HYGIENE_BLOCKED=1`、`NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export`。此段只用來比對 188 修復前後差異;現行 gate set 必須使用 12:13 baseline。 + +2026-06-26 08:12 next-gate dispatch baseline retained as historical pre-repair evidence:當時 output 固定三個 P0 checklist。12:13 起 dispatch 依 live summary 動態輸出,目前 expected `NEXT_GATE_COUNT=2`,只剩 credential escrow 與 Wazuh registry。 2026-06-26 08:29 owner-packet JSON baseline:`scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color` 將 dispatch output 轉成 `schema_version=awoooi_post_reboot_next_gate_owner_packets_v1`,包含三個 `owner_packets`、`next_gate_count=3`、`p0_gate_count=3`、`request_sent_count=0`、`owner_response_received_count=0`、`owner_response_accepted_count=0`、`runtime_action_authorized_count=0`。此 JSON 是 AI / operator / owner review intake,不是外部 request,也不是維護窗口批准。 -2026-06-26 08:40 owner-packet contract guard baseline:`scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json` 鎖定 `schema_version=awoooi_post_reboot_next_gate_owner_packets_v1`、三個 P0 gate id、`next_gate_count=3`、`p0_gate_count=3`、`request_sent_count=0`、`owner_response_received_count=0`、`owner_response_accepted_count=0`、`runtime_action_authorized_count=0`、`dispatch_authorized=0`、`host_write_authorized=0`、`secret_value_collection_allowed=0`、`runtime_gate_count=0`。此 guard 也驗證 escrow 禁止 password / token / secret value / hash / prefix / suffix / raw credential,188 禁止 `pg_resetwal` / certbot renew / Nginx reload / DB restore / Docker restart / host file write,Wazuh 禁止 raw payload / internal IP / active response / re-enroll / restart / secret patch / host write / Kali active scan,並要求四條 no-false-green 規則存在。輸出必須是 `POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=3 request_sent=0 accepted=0 runtime_gate=0`。 +2026-06-26 08:40 owner-packet contract guard baseline retained as historical pre-repair evidence:舊版鎖定三個 P0 gate。12:13 起 contract guard 依 `source.next_required_gates` 動態驗收,現行 expected success line 是 `POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=2 request_sent=0 accepted=0 runtime_gate=0`;若 188 hygiene future regression,才會回到 `gates=3`。 2026-06-26 08:47 Wazuh registry detail baseline:`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 已把 Wazuh repo-side coverage / runtime gate 的細節納入固定 key/value:`WAZUH_COVERAGE_SCOPE=6`、`WAZUH_DIRECT_ACTIVE=2`、`WAZUH_NO_TRANSPORT=1`、`WAZUH_SSH_BLOCKED=3`、`WAZUH_ROUTE_CODE=200`、`WAZUH_TRANSPORT_COUNT=6`、`WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning`、`WAZUH_DASHBOARD_INDEX_OK=3`、`WAZUH_MANAGER_REGISTRY_ACCEPTED=0`、`WAZUH_RUNTIME_GATE=0`。`scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color` 的 `wazuh_manager_registry_export` gate 會把這些狀態放入 `CURRENT_EVIDENCE`。判讀鐵律:route `200`、transport `6`、Dashboard index pattern `3` 都不是 manager registry accepted;全主機納管與 Dashboard API 修復仍需 owner evidence / registry export / acceptance record。 -2026-06-26 08:59 declaration guard baseline:`scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color` 將 summary 轉成 `awoooi_post_reboot_declaration_guard_v1`,目前 status 為 `allowed_with_boundary_blockers`。允許宣稱 `SERVICE_RECOVERY_GREEN`、`PRODUCT_DATA_GREEN`、`BACKUP_CORE_GREEN` 與 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`;禁止宣稱 `DR_COMPLETE`、`HOST_188_FULLY_GREEN`、`WAZUH_REGISTRY_RECOVERED`、`RUNTIME_ACTION_AUTHORIZED`。若用 `--proposed DR_COMPLETE` 或 `--proposed WAZUH_REGISTRY_RECOVERED` 測試,guard 必須以 `blocked_false_green_proposal` 拒絕。 +2026-06-26 08:59 declaration guard baseline retained as historical pre-repair evidence:當時 `HOST_188_FULLY_GREEN` 仍 forbidden。12:13 起 guard 依 `HOST_188_HYGIENE_BLOCKED=0` 動態允許 188 host hygiene green,但仍拒絕 `DR_COMPLETE`、`WAZUH_REGISTRY_RECOVERED`、`RUNTIME_ACTION_AUTHORIZED`。 2026-06-26 07:39 live quick-check refresh:`scripts/reboot-recovery/post-start-quick-check.sh --no-color` 完整跑完,四主機 ping / SSH 全部 OK,delegated cold-start 為 `PASS=89 WARN=0 BLOCKED=0`,wrapper 總結為 `POST_START_QUICK_CHECK PASS=38 WARN=3 BLOCKED=0`、warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=2`、`RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。MOMO health `V10.701`,daily snapshot `109061` rows / `2025-07-01..2026-06-24`,current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`,latest import job `57 completed`。StockPlatform freshness `status=ok`、latest trading date `2026-06-25`,price / chips / margin / AI recommendations 均為 `2026-06-25`。Backup-status 07:39 顯示 110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、`core_blockers=0`、offsite/rclone fresh、`last_backup_all=2026-06-26 02:31:02`、`escrow_missing=5`。Public routes extended list 全部回 expected 2xx/3xx。110 CPU attribution 顯示 load 約 `5.19 / 4.66 / 4.91`,CPU idle 多數樣本 `80%+`,目前負載來自 Gitea / ClickHouse / Docker / Kafka / StockPlatform / AWOOOI API / Sentry 等正常平台工作,不是 orphan Chrome。這一輪 allowed declaration:主機、K3s、服務、網站、產品資料 freshness、備份核心與 offsite freshness 綠;forbidden declaration:DR complete、credential escrow complete、188 host fully green、Wazuh registry recovered。 diff --git a/docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md b/docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md index 98b3449b..4966e785 100644 --- a/docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md +++ b/docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md @@ -1,34 +1,33 @@ # 188 Host Hygiene 維護窗口 Runbook -> 版本:2026-06-26.v1 +> 版本:2026-06-26.v2 > 適用範圍:`192.168.0.188` host PostgreSQL `14/main`、`awoooi-startup.service`、certbot / ACME renewal hygiene。 -> 狀態:`SERVICE_GREEN_HOST_HYGIENE_BLOCKED` +> 狀態:`HOST_188_HYGIENE_GREEN` --- ## 1. 目的 -188 目前的產品服務與 public routes 可用,但 host systemd 仍是 `degraded`。這份 runbook 將「服務已恢復」與「host hygiene 未收斂」分開,避免把 route `200`、container healthy、`pg_isready` 或 exporter green 誤判成 host 已完全健康。 +188 目前的產品服務、public routes 與 host hygiene 均已收斂為 green。這份 runbook 將「服務恢復」、「host hygiene」、「DR escrow」與「Wazuh registry」分開,避免把 route `200`、container healthy、`pg_isready`、exporter green 或 `reset-failed` 誤判成其他層級也完成。 -本文件不是維護窗口批准,也不是 runtime action 授權。未取得 owner、rollback owner、maintenance window、backup / restore evidence 與 post-check plan 前,不得執行 restart、reset-failed、renew、reload、restore 或 `pg_resetwal`。 +本文件記錄 2026-06-26 已完成的 188 host hygiene 修復與後續操作邊界。它不是 DR escrow 完成證明,也不是 Wazuh / SOC runtime action 授權。未取得 owner、rollback owner、maintenance window、backup / restore evidence 與 post-check plan 前,仍不得執行資料層 break-glass、restore、未批准 renew、或其他主機寫入。 --- ## 2. 最新只讀證據 -2026-06-26 07:18-07:19 Asia/Taipei read-only evidence: +2026-06-26 12:02-12:13 Asia/Taipei live evidence: | 層級 | 證據 | 判定 | |------|------|------| -| 188 disk | `/` 約 `982G`,使用 `712G`,可用 `230G`,`76%` | 不是容量耗盡 | -| 產品容器 | MOMO、VibeWork、2026FIFA、ClawBot、MinIO、exporters、MOMO DB 等容器可見 healthy / up | 產品服務 green | -| Host systemd | failed units:`awoooi-startup.service`、`postgresql@14-main.service`、`certbot.service`、`snap.certbot.renew.service` | host hygiene blocked | -| Host PostgreSQL | `pg_lsclusters` 顯示 `14/main down` | host cluster 不健康 | -| PostgreSQL status | `invalid primary checkpoint record`、`PANIC: could not locate a valid checkpoint record` | checkpoint / WAL 類資料層錯誤 | -| Startup unit | 曾嘗試以 root 執行 `pg_resetwal`,並失敗 | 自動修復邏輯必須 fail-closed | -| certbot apt unit | `sentry.wooo.work` renewal rate-limited | 不可重複猛打 ACME | -| certbot snap unit | `sentry.wooo.work` challenge failed | 需先確認 ACME route / DNS / gateway owner | -| public cert | shared `sentry.wooo.work` certificate valid until `2026-07-09 16:03:40 UTC` | 有維護窗口,不是立即過期 | +| Host systemd | `systemctl is-system-running` 為 `running`,failed units `0 loaded units listed` | host hygiene green | +| 產品容器 | MOMO、VibeWork、2026FIFA、ClawBot、MinIO、exporters、MOMO DB 等容器 healthy / up | 產品服務 green | +| PostgreSQL runtime | `k3s-postgres-recovery` container 為 `Running=true`、`NetworkMode=host`、restart policy `unless-stopped`,`pg_isready -h localhost -p 5432` accepting connections | production DB runtime green | +| Host PostgreSQL unit | `postgresql@14-main.service` reset 後 `Result=success / inactive`;`pg_lsclusters` 仍可能因 recovery container PID model 顯示 host cluster down | 不再作為 service blocker;由 runtime-ready 判斷 | +| Startup unit | `/usr/local/bin/awoooi-startup.sh` 已改為 `postgres_runtime_ready()`,接受 active host unit 或 `k3s-postgres-recovery` host-network runtime;不再自動執行 `pg_resetwal` | fail-closed 修復完成 | +| Nginx / ACME route | `sentry.wooo.work`、`gitea.wooo.work`、`langfuse.wooo.work`、`signoz.wooo.work` HTTP-01 self-test token 均由 `/var/www/html` 正確回應 | challenge route green | +| certbot timers | duplicate apt `certbot.timer` 已停用;snap `snap.certbot.renew.timer` enabled / active / waiting;`certbot.service` 與 `snap.certbot.renew.service` `Result=success / inactive` | renewal runner hygiene green | +| public cert | shared `sentry.wooo.work` certificate valid until `2026-07-09 16:03:40 UTC` | 尚未宣稱已 renew;等待 snap timer / ACME window | --- @@ -112,9 +111,9 @@ SSH_BATCH_MODE=yes bash scripts/reboot-recovery/188-host-hygiene-maintenance-che ```text SERVICE_GREEN=1 -HOST_HYGIENE_BLOCKED=1 +HOST_HYGIENE_BLOCKED=0 RUNTIME_ACTION_AUTHORIZED=0 -Result: SERVICE_GREEN_HOST_HYGIENE_BLOCKED. Open maintenance window; do not auto-repair. +Result: HOST_188_HYGIENE_GREEN. ``` ### Phase A:維護前只讀快照 @@ -169,17 +168,18 @@ curl -fsS https://awoooi.wooo.work/api/v1/health | Lane | 完成度 | 說明 | |------|-------:|------| | 服務恢復 | `100%` | routes、containers、cold-start、backup core 已 green | -| 188 host hygiene evidence | `80%` | 只讀 root cause 已明確;缺 owner disposition | -| PostgreSQL owner decision | `0%` | 尚未確認廢棄、restore 或 break-glass | -| certbot owner decision | `0%` | 尚未接受 DNS / ACME / coverage evidence | -| runtime repair | `0%` | 未批准、未執行 | +| 188 host hygiene repair | `100%` | startup fail-closed、PostgreSQL runtime-ready 判斷、systemd failed units、ACME route 與 certbot timer hygiene 已收斂 | +| PostgreSQL runtime source-of-truth | `100%` | production DB runtime 以 `k3s-postgres-recovery` host-network container + `pg_isready` 判斷,不用 host `pg_lsclusters` 假紅 | +| certbot / ACME route hygiene | `95%` | HTTP-01 route 與 timer split 已修;正式 renew 成功仍等待 snap timer / ACME window | +| DR escrow | `BLOCKED` | `escrow_missing=5`,不可用 188 host green 替代 | +| Wazuh registry | `BLOCKED` | manager registry accepted `0`,不可用 route `200` 或 transport count 替代 | --- ## 8. 不得宣稱 -- 不得宣稱 188 host fully green。 -- 不得宣稱 host PostgreSQL 已恢復。 -- 不得宣稱 certbot renewal 已恢復。 -- 不得宣稱 `reset-failed` 後就是修復完成。 -- 不得宣稱這份文件批准任何 runtime action。 +- 不得宣稱 DR complete;credential escrow evidence 仍缺 `5`。 +- 不得宣稱 Wazuh registry recovered;manager registry accepted 仍為 `0`。 +- 不得宣稱 certbot certificate 已完成正式 renew;目前完成的是 HTTP-01 route / timer hygiene / failed units 清除。 +- 不得把 `reset-failed` 單獨當成修復完成;本次完成是因 startup source、Nginx ACME route、timer split 與 postcheck 同時收斂。 +- 不得宣稱這份文件批准任何新的 runtime action。 diff --git a/docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md b/docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md index c01b3b4a..ec4c6f3f 100644 --- a/docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md +++ b/docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md @@ -1,6 +1,6 @@ # 主機重啟後一頁式總檢查 -> Version: v1.12 +> Version: v1.13 > Last updated: 2026-06-26 Asia/Taipei > Scope: 110 / 120 / 121 / 188 post-reboot service recovery. 112 Kali / Wazuh / active scan 不屬於本流程。 @@ -10,7 +10,7 @@ 每次 110 / 120 / 121 / 188 任一台主機開機、關機、重啟、斷電恢復、VMware console fsck、Docker / K3s 大量重排後,都先跑本頁,再決定是否宣稱恢復。 -最新基準:2026-06-26 08:59 post-reboot declaration guard。`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `SERVICE_GREEN=1`、`PRODUCT_DATA_GREEN=1`、`BACKUP_CORE_GREEN=1`、`DR_ESCROW_BLOCKED=1`、`ESCROW_MISSING_COUNT=5`、`HOST_188_HYGIENE_BLOCKED=1`、`WAZUH_MANAGER_REGISTRY_ACCEPTED=0`、`WAZUH_COVERAGE_SCOPE=6`、`WAZUH_DIRECT_ACTIVE=2`、`WAZUH_NO_TRANSPORT=1`、`WAZUH_SSH_BLOCKED=3`、`WAZUH_ROUTE_CODE=200`、`WAZUH_TRANSPORT_COUNT=6`、`WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning`、`WAZUH_DASHBOARD_INDEX_OK=3`、`RUNTIME_ACTION_AUTHORIZED=0`、`OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。`scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color` 會把 summary 轉成 allowed / forbidden declaration:目前允許宣稱服務、產品資料、備份核心與 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`;禁止宣稱 `DR_COMPLETE`、`HOST_188_FULLY_GREEN`、`WAZUH_REGISTRY_RECOVERED`、`RUNTIME_ACTION_AUTHORIZED`。接著 `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color` 將 `NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export` 展成三個 owner / evidence / forbidden-action checklist;Wazuh checklist 的 `CURRENT_EVIDENCE` 會保留 registry accepted、coverage scope、direct active、no transport、SSH blocked、route、transport、Dashboard API 與 index pattern 狀態,避免把 route `200` 或 transport `6` 誤報成 registry recovered。`scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color` 進一步轉成 `awoooi_post_reboot_next_gate_owner_packets_v1` JSON,固定 `dispatch_authorized=0`、`request_sent_count=0`、`owner_response_accepted_count=0`、`host_write_authorized=0`、`secret_value_collection_allowed=0`、`runtime_gate_count=0`;`scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json` 鎖定三個 P0 gate、所有 `0 / false` 邊界、禁用 secret payload / runtime action 與 no-false-green 規則。Cold-start `PASS=89 WARN=0 BLOCKED=0`;MOMO `V10.701`、latest import job `57 completed`、`DB_DAILY_FRESHNESS 1|2026-06-24`;StockPlatform `/api/v1/system/freshness` 為 `status=ok`、`latest_trading_date=2026-06-25`、blockers `[]`;backup-status 110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、`core_blockers=0`、`offsite_fresh=1`、`rclone_gdrive_fresh=1`、`last_backup_all=2026-06-26 02:31:02`。DR 仍因 `escrow_missing=5` 不可宣稱 complete。188 host hygiene 與 Wazuh manager registry 仍是 service green 之外的獨立 blocker。 +最新基準:2026-06-26 12:13 post-reboot summary / declaration guard。`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `SERVICE_GREEN=1`、`PRODUCT_DATA_GREEN=1`、`BACKUP_CORE_GREEN=1`、`DR_ESCROW_BLOCKED=1`、`ESCROW_MISSING_COUNT=5`、`HOST_188_HYGIENE_BLOCKED=0`、`HOST_188_RESULT=HOST_188_HYGIENE_GREEN.`、`WAZUH_MANAGER_REGISTRY_ACCEPTED=0`、`WAZUH_COVERAGE_SCOPE=6`、`WAZUH_DIRECT_ACTIVE=2`、`WAZUH_NO_TRANSPORT=1`、`WAZUH_SSH_BLOCKED=3`、`WAZUH_ROUTE_CODE=200`、`WAZUH_TRANSPORT_COUNT=6`、`WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning`、`WAZUH_DASHBOARD_INDEX_OK=3`、`RUNTIME_ACTION_AUTHORIZED=0`、`OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。`scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color` 會把 summary 轉成 allowed / forbidden declaration:目前允許宣稱服務、產品資料、備份核心、188 host hygiene green 與 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`;禁止宣稱 `DR_COMPLETE`、`WAZUH_REGISTRY_RECOVERED`、`RUNTIME_ACTION_AUTHORIZED`。接著 `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color` 將 `NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export` 展成 owner / evidence / forbidden-action checklist;Wazuh checklist 的 `CURRENT_EVIDENCE` 會保留 registry accepted、coverage scope、direct active、no transport、SSH blocked、route、transport、Dashboard API 與 index pattern 狀態,避免把 route `200` 或 transport `6` 誤報成 registry recovered。`scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color` 進一步轉成 `awoooi_post_reboot_next_gate_owner_packets_v1` JSON,固定 `dispatch_authorized=0`、`request_sent_count=0`、`owner_response_accepted_count=0`、`host_write_authorized=0`、`secret_value_collection_allowed=0`、`runtime_gate_count=0`;`scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json` 依 live `next_required_gates` 動態鎖定 P0 gate、所有 `0 / false` 邊界、禁用 secret payload / runtime action 與 no-false-green 規則。DR 仍因 `escrow_missing=5` 不可宣稱 complete;Wazuh manager registry 仍是 service green 之外的獨立 blocker。ACME HTTP-01 route / certbot timer hygiene 已修復,但憑證正式 renew 成功需等 snap certbot timer 或獨立 ACME window readback。 本頁只回答四件事: @@ -55,7 +55,7 @@ scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color - `PRODUCT_DATA_GREEN=1`:MOMO / StockPlatform 主要資料 freshness 可宣稱恢復。 - `BACKUP_CORE_GREEN=1`:備份核心可宣稱恢復。 - `DR_ESCROW_BLOCKED=1` / `ESCROW_MISSING_COUNT>0`:不可宣稱 DR complete。 -- `HOST_188_HYGIENE_BLOCKED=1`:188 host hygiene 需維護窗口,不等於產品服務掛掉。 +- `HOST_188_HYGIENE_BLOCKED=0`:188 host hygiene 已收斂;若未來回到 `1`,再走 188 專用 runbook。 - `WAZUH_MANAGER_REGISTRY_ACCEPTED=0`:不可宣稱 Wazuh 全主機納管恢復。 - `WAZUH_ROUTE_CODE=200` / `WAZUH_TRANSPORT_COUNT>0` 只能代表 route / transport evidence;仍必須搭配 `WAZUH_COVERAGE_SCOPE`、`WAZUH_DIRECT_ACTIVE`、`WAZUH_NO_TRANSPORT`、`WAZUH_SSH_BLOCKED`、`WAZUH_DASHBOARD_API_CONNECTION` 與 `WAZUH_MANAGER_REGISTRY_ACCEPTED` 判讀。 - `RUNTIME_ACTION_AUTHORIZED=0`:本流程沒有授權 runtime 寫操作。 @@ -73,7 +73,7 @@ scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color --proposed DR_COMPLETE --proposed WAZUH_REGISTRY_RECOVERED ``` -在目前基準下,`DR_COMPLETE`、`HOST_188_FULLY_GREEN`、`WAZUH_REGISTRY_RECOVERED`、`RUNTIME_ACTION_AUTHORIZED` 都必須被拒絕;任何人、任何 AI agent 或任何文件若提出這些宣告,必須先補對應 evidence / owner acceptance。 +在目前基準下,`DR_COMPLETE`、`WAZUH_REGISTRY_RECOVERED`、`RUNTIME_ACTION_AUTHORIZED` 都必須被拒絕;任何人、任何 AI agent 或任何文件若提出這些宣告,必須先補對應 evidence / owner acceptance。`HOST_188_FULLY_GREEN` 只有在 summary 維持 `HOST_188_HYGIENE_BLOCKED=0` 且 checklist `HOST_188_HYGIENE_GREEN` 時才可宣稱。 summary 顯示 `SERVICE_GREEN=1` 但仍有 `NEXT_REQUIRED_GATES` 時,再跑下一 Gate 派工 checklist: @@ -81,7 +81,7 @@ summary 顯示 `SERVICE_GREEN=1` 但仍有 `NEXT_REQUIRED_GATES` 時,再跑下 scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color ``` -這支腳本只把 `credential_escrow_evidence`、`host_188_hygiene_maintenance_window`、`wazuh_manager_registry_export` 轉成 owner / required evidence / forbidden action / done criteria。它不送 request、不讀 secret、不寫 marker、不 restart / reload / repair / import / delete / patch,也不授權 host / Wazuh / Nginx / K8s / DB runtime action。若它輸出 `SERVICE_GREEN=0`,先回到服務恢復,不進入 boundary dispatch。 +這支腳本只把目前 live summary 內的 `NEXT_REQUIRED_GATES` 轉成 owner / required evidence / forbidden action / done criteria。2026-06-26 12:13 起通常只剩 `credential_escrow_evidence` 與 `wazuh_manager_registry_export`;若未來 188 又紅,才會重新出現 `host_188_hygiene_maintenance_window`。它不送 request、不讀 secret、不寫 marker、不 restart / reload / repair / import / delete / patch,也不授權 host / Wazuh / Nginx / K8s / DB runtime action。若它輸出 `SERVICE_GREEN=0`,先回到服務恢復,不進入 boundary dispatch。 若要交給 AI / 工單 / owner review 使用,產生機器可讀 owner packet: @@ -98,7 +98,7 @@ scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color --outp scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json ``` -guard 必須輸出 `POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=3 request_sent=0 accepted=0 runtime_gate=0`。若 gate 數量、P0 gate id、`0 / false` 欄位、禁用 secret payload、188 禁用維修動作、Wazuh 禁用 active response / host write,或 no-false-green 規則任何一項漂移,視為 `BLOCKED`,不得送 owner request、不得寫 escrow marker、不得進維護窗口、不得宣稱 DR / Wazuh / 188 host hygiene 完成。 +guard 必須輸出 `POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates= request_sent=0 accepted=0 runtime_gate=0`。目前預期 `gates=2`;若 188 hygiene 回到 blocked,才會是 `gates=3`。若 gate 數量、P0 gate id、`0 / false` 欄位、禁用 secret payload、Wazuh 禁用 active response / host write,或 no-false-green 規則任何一項漂移,視為 `BLOCKED`,不得送 owner request、不得寫 escrow marker、不得進維護窗口、不得宣稱 DR / Wazuh 完成。 需要展開細節時,再使用 repo-side wrapper: @@ -108,7 +108,7 @@ scripts/reboot-recovery/post-start-quick-check.sh --no-color 此 wrapper 只做 read-only 檢查,並委派既有 cold-start / MOMO preflight / backup-status;不 restart、不 reload、不 import、不改 K8s、不讀 token 內容。wrapper 會把 warning 分成 `SERVICE`、`BOUNDARY`、`EVIDENCE` 三類,避免把 `escrow_missing>0` 誤判成服務降級。若 summary 或 wrapper 因某個 SSH 權限或路徑失敗,再依下列分段命令手動補證據。 -Public route gate 自 v1.6 起會使用 `ROUTE_RETRY_ATTEMPTS`(預設 `3`)與 `ROUTE_RETRY_DELAY_SECONDS`(預設 `2`)重試。單次 `000` / timeout 若 retry 後恢復,應列為 evidence warning 或 transient route evidence,不可直接當成網站仍壞;只有連續失敗才是 service blocker。 +Public route gate 自 v1.6 起會使用 `ROUTE_RETRY_ATTEMPTS`(預設 `3`)與 `ROUTE_RETRY_DELAY_SECONDS`(預設 `2`)重試。StockPlatform dedicated freshness gate 自 v1.13 起使用 `STOCK_FRESHNESS_RETRY_ATTEMPTS`(預設 `6`)與 `STOCK_FRESHNESS_RETRY_DELAY_SECONDS`(預設 `5`),因為它常在 110 Docker / CI rollout 後比 route health 多需要數十秒 warmup。單次 `000` / timeout / `502` 若 retry 後恢復,應列為 evidence warning 或 transient evidence,不可直接當成網站或資料仍壞;只有連續失敗才是 service blocker。 Credential escrow gate 自 v1.6 起在 `escrow_missing>0` 時,會只讀呼叫 `/backup/scripts/mark-credential-escrow-verified.sh --status` 並列出缺項。這只是 evidence readback,不會寫 marker、不會讀密碼、不會降低 DR blocker;用途是讓 operator 立即知道缺的是哪幾個非 secret evidence marker。 diff --git a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md index cfb9daa1..34e68637 100644 --- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md +++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md @@ -11,23 +11,25 @@ | Area | Status | Completion | Evidence | |------|--------|------------|----------| -| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-26 07:19 read-only follow-up confirms service recovery remains green after the SOP commit reached production. Cold-start rerun returned `PASS=87 WARN=0 BLOCKED=0`, result `GREEN`; ArgoCD `awoooi-prod Synced / Healthy` at revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`; API/Web/Worker pods are Running with restart `0`; public routes return expected statuses including AWOOOI API `200`, AWOOOI Web `307`, VibeWork `200`, AwoooGo `200`, MOMO health `200`, Stock freshness `200`, Bitan `200`, Gitea `200`, Harbor `200`, Registry `/v2/` expected `401`, Sentry expected `302`, SigNoz `200`, Langfuse `200`. Backup-status 07:18 remains 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, offsite/rclone fresh, `last_backup_all=2026-06-26 02:31:02`, `escrow_missing=5`. 188 remains product-service green but host-hygiene blocked: host PostgreSQL `14/main` has `invalid primary checkpoint record` / `could not locate a valid checkpoint record`, certbot renew for `sentry.wooo.work` is rate-limited / challenge-failed, and `awoooi-startup.service` previously attempted `pg_resetwal` as root. New runbook `docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md` plus read-only script `scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh` define the maintenance-window decision tree without authorizing runtime repair. Do not declare DR complete until `escrow_missing=0`; Wazuh manager registry accepted remains `0`; 111/168 Codex workspace HEAD drift and Mac Mini low free space remain workstation blockers, not reboot service blockers. | +| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-26 12:13 live summary confirms host / service / product data / backup / 188 host hygiene are green for the current evidence set. `post-reboot-readiness-summary.sh --no-color` returns `SERVICE_GREEN=1`, `PRODUCT_DATA_GREEN=1`, `BACKUP_CORE_GREEN=1`, `HOST_188_HYGIENE_BLOCKED=0`, `HOST_188_RESULT=HOST_188_HYGIENE_GREEN.`, `DR_ESCROW_BLOCKED=1`, `ESCROW_MISSING_COUNT=5`, `WAZUH_MANAGER_REGISTRY_ACCEPTED=0`, and `NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export`. 188 startup now accepts the real `k3s-postgres-recovery` host-network PostgreSQL runtime, fails closed instead of running `pg_resetwal`, Nginx ACME HTTP-01 routes for `sentry/gitea/langfuse/signoz` are corrected, duplicate apt certbot timer is disabled, snap certbot timer remains enabled, and failed units are cleared. Do not declare DR complete until `escrow_missing=0`; do not declare Wazuh registry recovered until manager registry accepted is `1`; certbot formal renewal success still requires snap timer / ACME window readback. | | P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-26 07:19 readback shows 120 and 121 reachable, K3s active, `mon` and `mon1` both `Ready control-plane`, AWOOOI API/Web replicas split across both nodes, ArgoCD `awoooi-prod Synced / Healthy` at revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`, and `km-vectorize` official 03:00 台北時間 run succeeded with `lastSuccess=2026-06-25T19:00:14Z`. | | P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-26 06:58 backup readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-26 02:31:02`。DR remains blocked on real non-secret credential escrow evidence IDs; do not write placeholder markers or paste secret values. | | P2 service / data truth | DONE | 100% | Service routes and core runtime are available, 110 current CPU pressure is attributable to active AWOOOI Web `turbo build` / Docker buildx, and previous orphan Chrome groups remain cleared. 2026-06-26 07:19 StockPlatform `/api/v1/system/freshness` returned `200`; 07:01 freshness payload was `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`; price / chips / margin / AI recommendations are all on `2026-06-25`. `ai.recommendations` row count is `2868`; `core.margin_short_daily` row count is `1976`. MOMO health `V10.699`, current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, and `MOMO_DAILY_FRESHNESS 1|2026-06-24` are green; expanded public routes are green. | -| P3 docs / automation contracts | DONE_WITH_DECLARATION_GUARD_V172 | 100% | Workplan, SOP v1.72, post-reboot declaration guard, machine-readable post-reboot readiness summary with Wazuh registry detail fields, post-reboot next-gate dispatch checklist, owner-packet JSON generator, owner-packet contract guard, one-page post-start quick check v1.12, route retry gate, deploy warmup classification, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, 188 fail-closed startup data recovery gate, 188 host hygiene read-only checklist, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and stricter product-data / route retry gates are updated. Declaration guard now machine-checks allowed / forbidden recovery statements: service/data/backup green may be declared, while `DR_COMPLETE`、`HOST_188_FULLY_GREEN`、`WAZUH_REGISTRY_RECOVERED` and `RUNTIME_ACTION_AUTHORIZED` remain forbidden until evidence gates close. Live 110 script sync remains a separate approved live-write gate; do not claim it here. | +| P3 docs / automation contracts | DONE_WITH_DECLARATION_GUARD_V173 | 100% | Workplan, SOP v1.73, post-reboot declaration guard, machine-readable post-reboot readiness summary with Wazuh registry detail fields, post-reboot next-gate dispatch checklist, owner-packet JSON generator, dynamic owner-packet contract guard, one-page post-start quick check v1.13, route retry gate, deploy warmup classification, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, 188 fail-closed startup data recovery gate, 188 host hygiene read-only checklist, 188 PostgreSQL runtime-ready source-of-truth, 188 ACME route/timer hygiene, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and stricter product-data / route retry gates are updated. Declaration guard now machine-checks allowed / forbidden recovery statements: service/data/backup/188 host hygiene green may be declared when live summary says so, while `DR_COMPLETE`、`WAZUH_REGISTRY_RECOVERED` and `RUNTIME_ACTION_AUTHORIZED` remain forbidden until evidence gates close. Live 110 script sync remains a separate approved live-write gate; do not claim it here. | -2026-06-26 07:47 machine-readable summary baseline: `scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` stores delegated logs under `/tmp/awoooi-post-reboot-readiness-20260626-074702` and returns `SERVICE_GREEN=1`, `PRODUCT_DATA_GREEN=1`, `BACKUP_CORE_GREEN=1`, `DR_ESCROW_BLOCKED=1`, `ESCROW_MISSING_COUNT=5`, `HOST_188_SERVICE_GREEN=1`, `HOST_188_HYGIENE_BLOCKED=1`, `WAZUH_ROUTE_CODE=200`, `WAZUH_TRANSPORT_COUNT=6`, `WAZUH_MANAGER_REGISTRY_ACCEPTED=0`, `WAZUH_RUNTIME_GATE=0`, `RUNTIME_ACTION_AUTHORIZED=0`, `OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, and `NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export`. This is now the preferred first operator/AI-agent entrypoint after reboot because it separates service health from DR, host hygiene, and security registry evidence. +2026-06-26 12:13 machine-readable summary baseline supersedes the 07:47 / 08:59 gate set: `scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` stores delegated logs under `/tmp/awoooi-post-reboot-readiness-20260626-121303` and returns `SERVICE_GREEN=1`, `PRODUCT_DATA_GREEN=1`, `BACKUP_CORE_GREEN=1`, `DR_ESCROW_BLOCKED=1`, `ESCROW_MISSING_COUNT=5`, `HOST_188_SERVICE_GREEN=1`, `HOST_188_HYGIENE_BLOCKED=0`, `HOST_188_CHECK_RC=0`, `HOST_188_RESULT=HOST_188_HYGIENE_GREEN.`, `WAZUH_ROUTE_CODE=200`, `WAZUH_TRANSPORT_COUNT=6`, `WAZUH_COVERAGE_SCOPE=6`, `WAZUH_DIRECT_ACTIVE=2`, `WAZUH_NO_TRANSPORT=1`, `WAZUH_SSH_BLOCKED=3`, `WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning`, `WAZUH_DASHBOARD_INDEX_OK=3`, `WAZUH_MANAGER_REGISTRY_ACCEPTED=0`, `WAZUH_RUNTIME_GATE=0`, `RUNTIME_ACTION_AUTHORIZED=0`, `OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, and `NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export`. This is now the preferred first operator/AI-agent entrypoint after reboot because it separates service health from DR and security registry evidence; 188 host hygiene is no longer a next gate unless the live checklist regresses. -2026-06-26 08:12 next-gate dispatch baseline: `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color` reads the same summary path and emits three explicit P0 dispatch checklists without sending requests or changing runtime. `credential_escrow_evidence` requires non-secret evidence id / owner / reviewer for five escrow items and rejects password / token / secret value / hash / prefix / suffix / raw credential payloads. `host_188_hygiene_maintenance_window` requires PostgreSQL `14/main` decision, DNS / TLS / certbot path, startup unit source-of-truth, rollback owner, postcheck owner, and blocks unapproved `pg_resetwal` / certbot renew / Nginx reload / DB restore / Docker restart / host file writes. `wazuh_manager_registry_export` requires redacted registry counts, per-host alias status, dashboard API / version status, time window, and reviewer while blocking raw agent names, internal IPs, client keys, Wazuh payloads, active response, re-enroll, restart, secret patch, host write, and Kali active scan. Output fixed `NEXT_GATE_COUNT=3`, `REQUEST_SENT_COUNT=0`, `DISPATCH_AUTHORIZED=0`, `HOST_WRITE_AUTHORIZED=0`, `SECRET_VALUE_COLLECTION_ALLOWED=0`, `RUNTIME_ACTION_AUTHORIZED=0`. +2026-06-26 07:47 machine-readable summary baseline is retained as historical evidence only. It showed `HOST_188_HYGIENE_BLOCKED=1` and three next gates before the 188 startup / ACME / certbot hygiene repair. Do not use the 07:47 gate set as the current status. -2026-06-26 08:29 owner-packet JSON baseline: `scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color` emits `schema_version=awoooi_post_reboot_next_gate_owner_packets_v1` with `next_gate_count=3`, `p0_gate_count=3`, `request_sent_count=0`, `owner_response_received_count=0`, `owner_response_accepted_count=0`, `runtime_action_authorized_count=0`. This packet is for AI / operator / owner review intake only; it does not send request, write credential marker, read secret, or authorize runtime action. +2026-06-26 12:13 next-gate dispatch baseline: `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color` now emits only the gates present in the current summary. Current expected gates are `credential_escrow_evidence` and `wazuh_manager_registry_export`, with `NEXT_GATE_COUNT=2`, `REQUEST_SENT_COUNT=0`, `DISPATCH_AUTHORIZED=0`, `HOST_WRITE_AUTHORIZED=0`, `SECRET_VALUE_COLLECTION_ALLOWED=0`, `RUNTIME_ACTION_AUTHORIZED=0`. If 188 hygiene regresses, `host_188_hygiene_maintenance_window` will reappear automatically. -2026-06-26 08:40 owner-packet contract guard baseline: `scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json` validates the generated JSON before any owner review intake. It requires exactly three P0 gates, preserves `request_sent=0`、`owner_response_received=0`、`owner_response_accepted=0`、`runtime_action_authorized=0`、`host_write_authorized=0`、`secret_value_collection_allowed=0`、`runtime_gate=0`, and rejects missing forbidden payload/action controls for credential escrow, 188 host hygiene, and Wazuh registry export. Expected success line: `POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=3 request_sent=0 accepted=0 runtime_gate=0`. +2026-06-26 12:13 owner-packet JSON baseline: `scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color` emits `schema_version=awoooi_post_reboot_next_gate_owner_packets_v1` with dynamic `next_gate_count=2`, `p0_gate_count=2`, `request_sent_count=0`, `owner_response_received_count=0`, `owner_response_accepted_count=0`, `runtime_action_authorized_count=0`. This packet is for AI / operator / owner review intake only; it does not send request, write credential marker, read secret, or authorize runtime action. + +2026-06-26 12:13 owner-packet contract guard baseline: `scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json` validates the generated JSON before any owner review intake. It requires the packet gates to equal the live `source.next_required_gates`, preserves `request_sent=0`、`owner_response_received=0`、`owner_response_accepted=0`、`runtime_action_authorized=0`、`host_write_authorized=0`、`secret_value_collection_allowed=0`、`runtime_gate=0`, and rejects missing forbidden payload/action controls for active gates. Current expected success line: `POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=2 request_sent=0 accepted=0 runtime_gate=0`. 2026-06-26 08:47 Wazuh registry detail summary baseline: post-reboot readiness summary now emits `WAZUH_COVERAGE_SCOPE`, `WAZUH_DIRECT_ACTIVE`, `WAZUH_NO_TRANSPORT`, `WAZUH_SSH_BLOCKED`, `WAZUH_DASHBOARD_API_CONNECTION`, and `WAZUH_DASHBOARD_INDEX_OK` alongside existing route / transport / registry fields. Current read-only truth is coverage scope `6`, direct active `2`, no transport `1`, SSH blocked `3`, route `200`, transport `6`, Dashboard API `pending_or_spinning`, index OK `3`, manager registry accepted `0`, runtime gate `0`. This is a security evidence blocker, not a reboot service blocker. -2026-06-26 08:59 declaration guard baseline: `scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color` emits `schema_version=awoooi_post_reboot_declaration_guard_v1`, status `allowed_with_boundary_blockers`, allowed declarations `BACKUP_CORE_GREEN`、`FULL_STACK_GREEN_DR_ESCROW_BLOCKED`、`PRODUCT_DATA_GREEN`、`SERVICE_RECOVERY_GREEN`, and forbidden declarations `DR_COMPLETE`、`HOST_188_FULLY_GREEN`、`WAZUH_REGISTRY_RECOVERED`、`RUNTIME_ACTION_AUTHORIZED`. Proposed false-green declarations are rejected before they can enter LOGBOOK / owner packets / external status updates. +2026-06-26 12:13 declaration guard baseline: `scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color` emits `schema_version=awoooi_post_reboot_declaration_guard_v1`, status `allowed_with_boundary_blockers`, allowed declarations including service / product data / backup / 188 host hygiene green for this evidence set, and forbidden declarations `DR_COMPLETE`、`WAZUH_REGISTRY_RECOVERED`、`RUNTIME_ACTION_AUTHORIZED`. Proposed false-green declarations are rejected before they can enter LOGBOOK / owner packets / external status updates. 2026-06-26 07:39 live quick-check refresh supersedes the 07:19 row for current operator status. `scripts/reboot-recovery/post-start-quick-check.sh --no-color` returned `POST_START_QUICK_CHECK PASS=38 WARN=3 BLOCKED=0`, warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=2`, result `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`. Delegated cold-start returned `PASS=89 WARN=0 BLOCKED=0`; four reboot-scope hosts ping/SSH were OK; AWOOOI / VibeWork / AwoooGo / 2026FIFA / Agent Bounty / MOMO / Stock / Bitan / TsenYang / VTuber / Gitea / Harbor / Registry / Sentry / SigNoz / Langfuse / AIOps routes returned expected 2xx/3xx. MOMO `V10.701` has job `57 completed`, daily freshness `1|2026-06-24`, and current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`. StockPlatform freshness is `ok` through `2026-06-25` with price / chips / margin / AI recommendations current. Backup core remains green: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, offsite/rclone fresh, `last_backup_all=2026-06-26 02:31:02`; DR still has `escrow_missing=5`. 110 load around `5.19 / 4.66 / 4.91` is attributable to normal platform processes, not orphan Chrome. 188 host hygiene remains blocked by failed host PostgreSQL / certbot / startup units and must use the dedicated maintenance runbook and read-only checklist. diff --git a/infra/ansible/roles/nginx/templates/188-all-sites.conf.j2 b/infra/ansible/roles/nginx/templates/188-all-sites.conf.j2 index 47687632..bd3b90ad 100644 --- a/infra/ansible/roles/nginx/templates/188-all-sites.conf.j2 +++ b/infra/ansible/roles/nginx/templates/188-all-sites.conf.j2 @@ -86,6 +86,10 @@ server { listen 80; server_name signoz.wooo.work; + location /.well-known/acme-challenge/ { + root /var/www/html; + } + location / { proxy_pass http://127.0.0.1:3301; proxy_http_version 1.1; diff --git a/infra/ansible/roles/nginx/templates/188-internal-tools-https.conf.j2 b/infra/ansible/roles/nginx/templates/188-internal-tools-https.conf.j2 index d4213451..5a7118a5 100644 --- a/infra/ansible/roles/nginx/templates/188-internal-tools-https.conf.j2 +++ b/infra/ansible/roles/nginx/templates/188-internal-tools-https.conf.j2 @@ -9,10 +9,22 @@ server { server_name gitea.wooo.work sentry.wooo.work - langfuse.wooo.work + langfuse.wooo.work; + + location /.well-known/acme-challenge/ { + root /var/www/html; + } + + location / { + return 301 https://$host$request_uri; + } +} + +server { + listen 80; + server_name harbor.wooo.work - registry.wooo.work - stock.wooo.work; + registry.wooo.work; location /.well-known/acme-challenge/ { root /var/www/certbot; diff --git a/scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh b/scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh index d8542570..62998e6f 100755 --- a/scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh +++ b/scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh @@ -149,17 +149,35 @@ if out=$(ssh_cmd "$REMOTE_188" ' pg_lsclusters 2>/dev/null || true systemctl status postgresql@14-main.service --no-pager || true echo "PG_ISREADY_LOCAL $(pg_isready -h localhost -p 5432 2>/dev/null || true)" +echo "RECOVERY_CONTAINER $(docker inspect -f "{{.State.Running}} {{.HostConfig.NetworkMode}} {{.HostConfig.RestartPolicy.Name}}" k3s-postgres-recovery 2>/dev/null || echo missing)" ' 2>&1); then echo "$out" + recovery_container_ready=0 + if grep -q '^RECOVERY_CONTAINER true host ' <<<"$out" && grep -q 'PG_ISREADY_LOCAL .*accepting connections' <<<"$out"; then + recovery_container_ready=1 + fi + if grep -Eq '^14[[:space:]]+main[[:space:]]+5432[[:space:]]+down' <<<"$out"; then - blocked "host PostgreSQL cluster 14/main is down" + if [[ "$recovery_container_ready" -eq 1 ]]; then + warn "host PostgreSQL cluster 14/main is down, but controlled k3s-postgres-recovery runtime is accepting connections" + else + blocked "host PostgreSQL cluster 14/main is down and no controlled recovery runtime was accepted" + fi else ok "host PostgreSQL cluster 14/main not reported down" fi + if grep -Eiq 'invalid primary checkpoint record|could not locate a valid checkpoint record|PANIC:' <<<"$out"; then - blocked "PostgreSQL checkpoint/WAL error detected; pg_resetwal is break-glass only" + if [[ "$recovery_container_ready" -eq 1 ]]; then + warn "PostgreSQL checkpoint/WAL error remains historical host-cluster evidence; pg_resetwal is still break-glass only" + else + blocked "PostgreSQL checkpoint/WAL error detected; pg_resetwal is break-glass only" + fi fi - if grep -q 'PG_ISREADY_LOCAL .*accepting connections' <<<"$out"; then + + if [[ "$recovery_container_ready" -eq 1 ]]; then + ok "PostgreSQL runtime is provided by k3s-postgres-recovery on host network" + elif grep -q 'PG_ISREADY_LOCAL .*accepting connections' <<<"$out"; then warn "pg_isready accepts on localhost; do not use this alone as host 14/main health" fi else @@ -169,12 +187,30 @@ fi section "188 certbot / ACME" if out=$(ssh_cmd "$REMOTE_188" ' -systemctl status certbot.service --no-pager || true -systemctl status snap.certbot.renew.service --no-pager || true +systemctl show certbot.service snap.certbot.renew.service certbot.timer snap.certbot.renew.timer -p Id -p ActiveState -p SubState -p Result -p UnitFileState --no-pager || true +systemctl list-timers --all --no-pager | grep -i certbot || true ' 2>&1); then echo "$out" - grep -Eiq 'rateLimited|Service busy' <<<"$out" && blocked "certbot renewal is rate-limited; do not retry blindly" - grep -Eiq 'Some challenges have failed|challenge' <<<"$out" && blocked "certbot challenge failure requires DNS / ACME route owner evidence" + if grep -q 'Id=certbot.service' <<<"$out" && grep -A3 'Id=certbot.service' <<<"$out" | grep -q 'Result=failed'; then + blocked "apt certbot service currently failed" + else + ok "apt certbot service is not currently failed" + fi + if grep -q 'Id=snap.certbot.renew.service' <<<"$out" && grep -A3 'Id=snap.certbot.renew.service' <<<"$out" | grep -q 'Result=failed'; then + blocked "snap certbot renew service currently failed" + else + ok "snap certbot renew service is not currently failed" + fi + if grep -A4 'Id=certbot.timer' <<<"$out" | grep -q 'UnitFileState=disabled'; then + ok "legacy apt certbot timer disabled to avoid duplicate renewals" + else + warn "legacy apt certbot timer is not disabled" + fi + if grep -A4 'Id=snap.certbot.renew.timer' <<<"$out" | grep -q 'ActiveState=active' && grep -A4 'Id=snap.certbot.renew.timer' <<<"$out" | grep -q 'UnitFileState=enabled'; then + ok "snap certbot renew timer enabled" + else + blocked "snap certbot renew timer is not enabled and active" + fi else blocked "certbot status unavailable" echo "$out" @@ -223,7 +259,27 @@ else fi section "Maintenance decision tree" -cat <<'STEPS' +if [ "$SERVICE_GREEN" -eq 1 ] && [ "$HOST_HYGIENE_BLOCKED" -eq 0 ]; then + cat <<'STEPS' +Current expected outcome: + SERVICE_GREEN=1 + HOST_HYGIENE_BLOCKED=0 + RESULT=HOST_188_HYGIENE_GREEN + +Allowed next step: + 1. Keep this host in the normal post-reboot summary. + 2. Wait for snap certbot timer / ACME-window readback before declaring formal certificate renewal success. + 3. Keep DR credential escrow and Wazuh registry evidence as separate blockers. + +Forbidden without separate approval: + - pg_resetwal + - DB restore + - Docker/systemd restart + - firewall change + - Wazuh active response or agent re-enroll +STEPS +else + cat <<'STEPS' Current expected outcome when 188 service is green but host hygiene is not: SERVICE_GREEN=1 HOST_HYGIENE_BLOCKED=1 @@ -244,6 +300,7 @@ Forbidden without maintenance approval: - Docker/systemd restart - host file write STEPS +fi echo echo "SERVICE_GREEN=$SERVICE_GREEN" diff --git a/scripts/reboot-recovery/awoooi-startup.service b/scripts/reboot-recovery/awoooi-startup.service index 65a30d44..871cb460 100644 --- a/scripts/reboot-recovery/awoooi-startup.service +++ b/scripts/reboot-recovery/awoooi-startup.service @@ -14,8 +14,9 @@ Description=AWOOOI Auto-Startup Recovery Sequence After=network-online.target containerd.service docker.service Wants=network-online.target -# 確保 PostgreSQL 盡早嘗試啟動 -Wants=postgresql@14-main.service redis-server.service ollama.service nginx.service +# PostgreSQL 可由受控 recovery container 提供;不得在 startup 階段硬拉 +# postgresql@14-main.service,避免與 recovery runtime 競爭或觸發假綠修復。 +Wants=redis-server.service ollama.service nginx.service [Service] Type=oneshot diff --git a/scripts/reboot-recovery/awoooi-startup.sh b/scripts/reboot-recovery/awoooi-startup.sh index 8a91db2f..8b0c49a1 100644 --- a/scripts/reboot-recovery/awoooi-startup.sh +++ b/scripts/reboot-recovery/awoooi-startup.sh @@ -3,6 +3,8 @@ # 2026-04-04 ogt: 根據實際事故建立,處理 container / Docker 啟動順序與 K3s Kine 維護。 # 2026-06-26 Codex: PostgreSQL checkpoint/WAL 錯誤改為 fail-closed; # 不在自動啟動腳本內執行 pg_resetwal,避免資料破壞被誤判成恢復。 +# 2026-06-26 Codex: 允許受控 recovery container 提供 14/main runtime; +# 不再因 systemd postgresql@14-main failed 而誤判活 DB 為不可用。 # 部署位置: /usr/local/bin/awoooi-startup.sh (on 192.168.0.188) # systemd unit: /etc/systemd/system/awoooi-startup.service @@ -12,6 +14,22 @@ exec > >(tee -a "$LOG") 2>&1 log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; } +postgres_runtime_ready() { + if systemctl is-active postgresql@14-main >/dev/null 2>&1; then + log "✅ PostgreSQL systemd unit active" + return 0 + fi + + if docker inspect -f '{{.State.Running}} {{.HostConfig.NetworkMode}}' k3s-postgres-recovery 2>/dev/null | grep -q '^true host$'; then + if pg_isready -h localhost -p 5432 >/dev/null 2>&1; then + log "✅ PostgreSQL recovery container active on host network" + return 0 + fi + fi + + return 1 +} + log "=== AWOOOI 啟動序列開始 ===" # ────────────────────────────────────────────── @@ -73,20 +91,20 @@ fi # ────────────────────────────────────────────── log "[3/7] 檢查 PostgreSQL..." -if ! systemctl is-active postgresql@14-main >/dev/null 2>&1; then +if ! postgres_runtime_ready; then log "PostgreSQL 未啟動,嘗試啟動..." systemctl start postgresql@14-main || true sleep 8 fi -if ! systemctl is-active postgresql@14-main >/dev/null 2>&1; then +if ! postgres_runtime_ready; then log "PostgreSQL 啟動失敗,檢查是否屬於 checkpoint/WAL 類資料層錯誤..." if journalctl -u postgresql@14-main -n 20 | grep -q "could not locate a valid checkpoint"; then log "❌ 偵測到 PostgreSQL checkpoint/WAL 錯誤;禁止自動 pg_resetwal。" log "需要 DB owner、備份/restore evidence、maintenance window 與 post-check 後才能人工處理。" exit 1 fi - systemctl is-active postgresql@14-main && log "✅ PostgreSQL 修復成功" || { log "❌ PostgreSQL 修復失敗"; exit 1; } + postgres_runtime_ready && log "✅ PostgreSQL runtime 可用" || { log "❌ PostgreSQL 修復失敗"; exit 1; } fi # 等待 PG 接受連線 diff --git a/scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py b/scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py index 31f21afc..2ab61468 100755 --- a/scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py +++ b/scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py @@ -23,7 +23,7 @@ OWNER_PACKET_GENERATOR = ( ) EXPECTED_SCHEMA = "awoooi_post_reboot_next_gate_owner_packets_v1" -EXPECTED_GATES = { +KNOWN_GATES = { "credential_escrow_evidence", "host_188_hygiene_maintenance_window", "wazuh_manager_registry_export", @@ -187,12 +187,21 @@ def validate_packet(packet: dict[str, Any]) -> list[str]: counts = {} gate_ids = {str(item.get("packet_id", "")) for item in owner_packets if isinstance(item, dict)} - if gate_ids != EXPECTED_GATES: + unknown_gates = sorted(gate_ids - KNOWN_GATES) + if unknown_gates: + failures.append(f"unknown_gate_ids={unknown_gates}") + + source = packet.get("source", {}) + if not isinstance(source, dict): + failures.append("source_not_object") + source = {} + expected_gates = set(str(item) for item in as_list(source.get("next_required_gates"))) + if expected_gates != gate_ids: failures.append(f"gate_ids={sorted(gate_ids)}") expected_counts = { - "next_gate_count": 3, - "p0_gate_count": 3, + "next_gate_count": len(gate_ids), + "p0_gate_count": len(gate_ids), } for key, expected in expected_counts.items(): if counts.get(key) != expected: diff --git a/scripts/reboot-recovery/post-start-quick-check.sh b/scripts/reboot-recovery/post-start-quick-check.sh index 4109fb3d..3390e114 100755 --- a/scripts/reboot-recovery/post-start-quick-check.sh +++ b/scripts/reboot-recovery/post-start-quick-check.sh @@ -9,6 +9,8 @@ ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)" SSH_CONNECT_TIMEOUT="${SSH_CONNECT_TIMEOUT:-6}" ROUTE_RETRY_ATTEMPTS="${ROUTE_RETRY_ATTEMPTS:-3}" ROUTE_RETRY_DELAY_SECONDS="${ROUTE_RETRY_DELAY_SECONDS:-2}" +STOCK_FRESHNESS_RETRY_ATTEMPTS="${STOCK_FRESHNESS_RETRY_ATTEMPTS:-6}" +STOCK_FRESHNESS_RETRY_DELAY_SECONDS="${STOCK_FRESHNESS_RETRY_DELAY_SECONDS:-5}" RUN_COLD_START=1 RUN_MOMO=1 RUN_STOCK=1 @@ -76,6 +78,8 @@ Options: Environment: ROUTE_RETRY_ATTEMPTS Public route attempts before blocking. Default: 3. ROUTE_RETRY_DELAY_SECONDS Delay between failed public route attempts. Default: 2. + STOCK_FRESHNESS_RETRY_ATTEMPTS Stock freshness attempts before blocking. Default: 6. + STOCK_FRESHNESS_RETRY_DELAY_SECONDS Delay between failed Stock freshness attempts. Default: 5. Exit codes: 0 = no service blockers. Boundary / evidence warnings may still be present. @@ -277,9 +281,23 @@ fi if [[ "$RUN_STOCK" -eq 1 ]]; then section "StockPlatform freshness" stock_tmp="$(mktemp -t post-start-stock.XXXXXX)" - stock_code="$(curl -k -sS -o "$stock_tmp" -w '%{http_code}' --max-time 12 "https://stock.wooo.work/api/v1/system/freshness" 2>/dev/null || true)" + stock_code="" + stock_attempt=1 + while [[ "$stock_attempt" -le "$STOCK_FRESHNESS_RETRY_ATTEMPTS" ]]; do + stock_code="$(curl -k -sS -o "$stock_tmp" -w '%{http_code}' --max-time 12 "https://stock.wooo.work/api/v1/system/freshness" 2>/dev/null || true)" + if [[ "$stock_code" == 2* ]]; then + if [[ "$stock_attempt" -gt 1 ]]; then + evidence_warn "StockPlatform freshness recovered after attempt=$stock_attempt" + fi + break + fi + if [[ "$stock_attempt" -lt "$STOCK_FRESHNESS_RETRY_ATTEMPTS" ]]; then + sleep "$STOCK_FRESHNESS_RETRY_DELAY_SECONDS" + fi + stock_attempt=$((stock_attempt + 1)) + done if [[ "$stock_code" != 2* ]]; then - blocked "StockPlatform freshness endpoint returned ${stock_code:-curl_failed}" + blocked "StockPlatform freshness endpoint returned ${stock_code:-curl_failed} attempts=$STOCK_FRESHNESS_RETRY_ATTEMPTS" cat "$stock_tmp" || true else python3 - "$stock_tmp" <<'PY'