diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index d07e5d4f..4606736a 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -45027,3 +45027,48 @@ production browser smoke: - 188 host hygiene 維護窗口仍未執行。 - Wazuh manager registry accepted remains `0`。 - 不得宣稱 `DR_COMPLETE`、188 host fully green、Wazuh registry recovered、runtime/security acceptance enabled。 + +## 2026-06-26 — 08:12 post-reboot next-gate dispatch checklist / SOP v1.68 + +**時間與來源**: +- 2026-06-26 08:12-08:20 Asia/Taipei。 +- 來源:新增 `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color`,08:20 live validation 委派 `scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color`,summary artifact dir `/tmp/awoooi-post-reboot-readiness-20260626-082049`。 + +**完成內容**: +- 新增 post-reboot 下一 Gate 派工 checklist,將 `NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export` 轉成 owner / required evidence / forbidden payload / forbidden action / done criteria。 +- `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 升至 v1.8,將 next-gate dispatch 設為 summary 之後的固定 read-only 步驟。 +- `docs/runbooks/FULL-STACK-COLD-START-SOP.md` 升至 v1.68,明確列出三個 P0 gate 的 dispatch 邊界。 +- `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md` 更新為 `DONE_WITH_NEXT_GATE_DISPATCH_V168`。 + +**只讀驗證結果**: +- `SERVICE_GREEN=1` +- `OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED` +- `NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export` +- `NEXT_GATE_COUNT=3` +- `RUNTIME_ACTION_AUTHORIZED=0` +- `DISPATCH_AUTHORIZED=0` +- `REQUEST_SENT_COUNT=0` +- `HOST_WRITE_AUTHORIZED=0` +- `SECRET_VALUE_COLLECTION_ALLOWED=0` + +**三個 Gate 的下一步**: +- `credential_escrow_evidence`:要求五個 escrow item 的 non-secret evidence id / owner / reviewer;禁止密碼、token、secret value、hash、prefix/suffix、raw credential。 +- `host_188_hygiene_maintenance_window`:要求 188 host PostgreSQL `14/main`、certbot / DNS-TLS、startup unit、rollback owner、postcheck owner 的維護窗口決策;禁止未批准 `pg_resetwal`、certbot renew、Nginx reload、DB restore、Docker restart、host file write。 +- `wazuh_manager_registry_export`:要求 Wazuh manager registry 脫敏匯出、host alias 狀態、Dashboard API / version 狀態、time window 與 reviewer;禁止 agent real name、internal IP、client.keys、raw Wazuh payload、active response、agent re-enroll、Wazuh restart、secret patch、host write、Kali active scan。 + +**做過的命令類型**: +- 只讀:post-reboot readiness summary、next-gate dispatch checklist、source guard。 +- 寫入:repo script / docs-only。 +- 未做:host / Docker / systemd / Nginx / firewall / K8s / DB / Wazuh runtime 寫操作;未讀 secret 明文;未送 owner request;未寫 escrow marker;未執行 active response。 + +**目前判定**: +- Reboot service / data / backup readiness remains `GREEN`。 +- Overall declaration remains `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。 +- Next-gate dispatch automation:`0% -> 100%`。 +- Runtime repair / owner request sent / credential marker write:仍 `0%`。 + +**仍 blocked / 不得宣稱**: +- DR credential escrow evidence missing `5`。 +- 188 host hygiene 維護窗口仍未執行。 +- Wazuh manager registry accepted remains `0`。 +- 不得宣稱 `DR_COMPLETE`、188 host fully green、Wazuh registry recovered、runtime/security acceptance enabled、或 owner request 已送出。 diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index 8f3171d7..64bc6804 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -1,6 +1,6 @@ # AWOOOI 全棧冷啟動與主機重啟 SOP -> Version: v1.67 +> Version: v1.68 > Last updated: 2026-06-26 Asia/Taipei > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path. @@ -10,10 +10,12 @@ 本節是每次接手、開機、關機、重啟後的第一個判定錨點。若日期不是今天,必須先重跑 live check,再更新本節與 `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md`。 -若只是重啟後要快速判斷能不能宣稱恢復,先跑機器可讀摘要:`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color`。此腳本會呼叫一頁式總檢查、188 host hygiene checklist 與 Wazuh no-false-green repo gates,並把 delegated logs 留在 `/tmp/awoooi-post-reboot-readiness-*`。需要人工展開時,再跑 `scripts/reboot-recovery/post-start-quick-check.sh --no-color` 並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為 fallback。長 SOP 保留完整背景、例外處理與 Plan B;短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。 +若只是重啟後要快速判斷能不能宣稱恢復,先跑機器可讀摘要:`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color`。此腳本會呼叫一頁式總檢查、188 host hygiene checklist 與 Wazuh no-false-green repo gates,並把 delegated logs 留在 `/tmp/awoooi-post-reboot-readiness-*`。若 summary 顯示 `SERVICE_GREEN=1` 但 `NEXT_REQUIRED_GATES` 仍非空,接著跑 `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color`,把 DR escrow、188 hygiene、Wazuh registry 三條 blocker 轉成 owner / evidence / forbidden-action dispatch checklist;此 dispatch 仍固定 `DISPATCH_AUTHORIZED=0`、`REQUEST_SENT_COUNT=0`、`HOST_WRITE_AUTHORIZED=0`、`SECRET_VALUE_COLLECTION_ALLOWED=0`。需要人工展開時,再跑 `scripts/reboot-recovery/post-start-quick-check.sh --no-color` 並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為 fallback。長 SOP 保留完整背景、例外處理與 Plan B;短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。 2026-06-26 07:47 machine-readable readiness summary:`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 已驗證可用,artifact dir `/tmp/awoooi-post-reboot-readiness-20260626-074702`。摘要輸出 `POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`、`POST_START_PASS=38`、`POST_START_WARN=3`、`POST_START_BLOCKED=0`、`SERVICE_GREEN=1`、`PRODUCT_DATA_GREEN=1`、`BACKUP_CORE_GREEN=1`、`DR_ESCROW_BLOCKED=1`、`ESCROW_MISSING_COUNT=5`、`HOST_188_SERVICE_GREEN=1`、`HOST_188_HYGIENE_BLOCKED=1`、`WAZUH_ROUTE_CODE=200`、`WAZUH_TRANSPORT_COUNT=6`、`WAZUH_MANAGER_REGISTRY_ACCEPTED=0`、`WAZUH_RUNTIME_GATE=0`、`RUNTIME_ACTION_AUTHORIZED=0`。目前 `OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`,`NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export`。這是每次重啟後的第一層 operator / AI agent 判定格式。 +2026-06-26 08:12 next-gate dispatch baseline:`scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color` 已以最新 summary live output 驗證。腳本讀回 `SERVICE_GREEN=1`、`OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`、`NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export`,並輸出三個 P0 checklist:一是 credential escrow non-secret evidence,要求五個 escrow item 的 evidence id / owner / reviewer 且禁止密碼、token、hash、prefix/suffix;二是 188 host PostgreSQL / certbot hygiene maintenance window,要求 DB / DNS-TLS / rollback / postcheck owner 決策且禁止 `pg_resetwal`、certbot renew、Nginx reload、DB restore、Docker restart 等未批准動作;三是 Wazuh manager registry redacted export,要求脫敏 registry count、host alias status、dashboard API/version status、time window 與 reviewer,且禁止 agent real name、internal IP、client.keys、raw payload、active response、agent re-enroll、Wazuh restart、secret patch、host write、Kali active scan。輸出固定 `NEXT_GATE_COUNT=3`、`NEXT_STEP=dispatch_owner_packets_manually_after_review`、`RUNTIME_ACTION_AUTHORIZED=0`,這是 dispatch checklist,不是 request sent 或 runtime approval。 + 2026-06-26 07:39 live quick-check refresh:`scripts/reboot-recovery/post-start-quick-check.sh --no-color` 完整跑完,四主機 ping / SSH 全部 OK,delegated cold-start 為 `PASS=89 WARN=0 BLOCKED=0`,wrapper 總結為 `POST_START_QUICK_CHECK PASS=38 WARN=3 BLOCKED=0`、warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=2`、`RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。MOMO health `V10.701`,daily snapshot `109061` rows / `2025-07-01..2026-06-24`,current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`,latest import job `57 completed`。StockPlatform freshness `status=ok`、latest trading date `2026-06-25`,price / chips / margin / AI recommendations 均為 `2026-06-25`。Backup-status 07:39 顯示 110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、`core_blockers=0`、offsite/rclone fresh、`last_backup_all=2026-06-26 02:31:02`、`escrow_missing=5`。Public routes extended list 全部回 expected 2xx/3xx。110 CPU attribution 顯示 load 約 `5.19 / 4.66 / 4.91`,CPU idle 多數樣本 `80%+`,目前負載來自 Gitea / ClickHouse / Docker / Kafka / StockPlatform / AWOOOI API / Sentry 等正常平台工作,不是 orphan Chrome。這一輪 allowed declaration:主機、K3s、服務、網站、產品資料 freshness、備份核心與 offsite freshness 綠;forbidden declaration:DR complete、credential escrow complete、188 host fully green、Wazuh registry recovered。 2026-06-26 07:19 follow-up:`gitea/main` 已包含前一輪 SOP 文件 commit `1fd5e2a8`,ArgoCD `awoooi-prod` 讀回 `Synced / Healthy`,revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`,API `2/2`、Web `2/2`、Worker `1/1`,pods `restart=0`。重跑 full cold-start 仍是 `PASS=87 WARN=0 BLOCKED=0`,result `GREEN`。直接 public route 讀回:AWOOOI API `200`、AWOOOI Web `307`、VibeWork `200`、AwoooGo `200`、MOMO health `200`、Stock freshness `200`、Bitan `200`、Gitea `200`、Harbor `200`、Registry `/v2/` expected `401`、Sentry expected `302`、SigNoz `200`、Langfuse `200`。188 blocker 精準分類:`pg_lsclusters` 顯示 host PostgreSQL `14/main` down,`systemctl status postgresql@14-main` 顯示 `invalid primary checkpoint record` 與 `PANIC: could not locate a valid checkpoint record`;`certbot.service` 顯示 `sentry.wooo.work` renew rate-limited,`snap.certbot.renew.service` 顯示 challenge failed;`awoooi-startup.service` 曾嘗試以 root 執行 `pg_resetwal` 並失敗。本輪不執行 `pg_resetwal`、不 `reset-failed`、不重啟 service;188 需用獨立維護窗口、rollback owner、restore/source-of-truth plan 處理,詳見 `docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md`,並可先跑 `scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh --no-color` 取得只讀 preflight。110 load 已降到約 `4.83 / 4.82 / 5.52`,top CPU 是 active AWOOOI Web `turbo build` / Docker buildx;Swap 仍滿但 memory available 約 `41Gi`,本輪不手動清 swap。整體宣告仍是 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。 diff --git a/docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md b/docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md index dcdff0b2..6ad4a4a3 100644 --- a/docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md +++ b/docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md @@ -1,6 +1,6 @@ # 主機重啟後一頁式總檢查 -> Version: v1.7 +> Version: v1.8 > Last updated: 2026-06-26 Asia/Taipei > Scope: 110 / 120 / 121 / 188 post-reboot service recovery. 112 Kali / Wazuh / active scan 不屬於本流程。 @@ -10,7 +10,7 @@ 每次 110 / 120 / 121 / 188 任一台主機開機、關機、重啟、斷電恢復、VMware console fsck、Docker / K3s 大量重排後,都先跑本頁,再決定是否宣稱恢復。 -最新基準:2026-06-26 07:47 machine-readable summary。`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `SERVICE_GREEN=1`、`PRODUCT_DATA_GREEN=1`、`BACKUP_CORE_GREEN=1`、`DR_ESCROW_BLOCKED=1`、`ESCROW_MISSING_COUNT=5`、`HOST_188_HYGIENE_BLOCKED=1`、`WAZUH_MANAGER_REGISTRY_ACCEPTED=0`、`RUNTIME_ACTION_AUTHORIZED=0`、`OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。Cold-start `PASS=89 WARN=0 BLOCKED=0`;MOMO `V10.701`、latest import job `57 completed`、`DB_DAILY_FRESHNESS 1|2026-06-24`;StockPlatform `/api/v1/system/freshness` 為 `status=ok`、`latest_trading_date=2026-06-25`、blockers `[]`;backup-status 110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、`core_blockers=0`、`offsite_fresh=1`、`rclone_gdrive_fresh=1`、`last_backup_all=2026-06-26 02:31:02`。DR 仍因 `escrow_missing=5` 不可宣稱 complete。188 host hygiene 與 Wazuh manager registry 仍是 service green 之外的獨立 blocker。 +最新基準:2026-06-26 08:12 next-gate dispatch。`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `SERVICE_GREEN=1`、`PRODUCT_DATA_GREEN=1`、`BACKUP_CORE_GREEN=1`、`DR_ESCROW_BLOCKED=1`、`ESCROW_MISSING_COUNT=5`、`HOST_188_HYGIENE_BLOCKED=1`、`WAZUH_MANAGER_REGISTRY_ACCEPTED=0`、`RUNTIME_ACTION_AUTHORIZED=0`、`OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。接著 `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color` 將 `NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export` 展成三個 owner / evidence / forbidden-action checklist,並固定 `DISPATCH_AUTHORIZED=0`、`REQUEST_SENT_COUNT=0`、`HOST_WRITE_AUTHORIZED=0`、`SECRET_VALUE_COLLECTION_ALLOWED=0`。Cold-start `PASS=89 WARN=0 BLOCKED=0`;MOMO `V10.701`、latest import job `57 completed`、`DB_DAILY_FRESHNESS 1|2026-06-24`;StockPlatform `/api/v1/system/freshness` 為 `status=ok`、`latest_trading_date=2026-06-25`、blockers `[]`;backup-status 110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、`core_blockers=0`、`offsite_fresh=1`、`rclone_gdrive_fresh=1`、`last_backup_all=2026-06-26 02:31:02`。DR 仍因 `escrow_missing=5` 不可宣稱 complete。188 host hygiene 與 Wazuh manager registry 仍是 service green 之外的獨立 blocker。 本頁只回答四件事: @@ -60,6 +60,14 @@ scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color - `RUNTIME_ACTION_AUTHORIZED=0`:本流程沒有授權 runtime 寫操作。 - `OVERALL_DECLARATION`:本輪可使用的最高宣告。 +summary 顯示 `SERVICE_GREEN=1` 但仍有 `NEXT_REQUIRED_GATES` 時,接著跑下一 Gate 派工 checklist: + +```bash +scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color +``` + +這支腳本只把 `credential_escrow_evidence`、`host_188_hygiene_maintenance_window`、`wazuh_manager_registry_export` 轉成 owner / required evidence / forbidden action / done criteria。它不送 request、不讀 secret、不寫 marker、不 restart / reload / repair / import / delete / patch,也不授權 host / Wazuh / Nginx / K8s / DB runtime action。若它輸出 `SERVICE_GREEN=0`,先回到服務恢復,不進入 boundary dispatch。 + 需要展開細節時,再使用 repo-side wrapper: ```bash diff --git a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md index 8e7c5b88..0090fc44 100644 --- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md +++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md @@ -15,10 +15,12 @@ | P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-26 07:19 readback shows 120 and 121 reachable, K3s active, `mon` and `mon1` both `Ready control-plane`, AWOOOI API/Web replicas split across both nodes, ArgoCD `awoooi-prod Synced / Healthy` at revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`, and `km-vectorize` official 03:00 台北時間 run succeeded with `lastSuccess=2026-06-25T19:00:14Z`. | | P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-26 06:58 backup readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-26 02:31:02`。DR remains blocked on real non-secret credential escrow evidence IDs; do not write placeholder markers or paste secret values. | | P2 service / data truth | DONE | 100% | Service routes and core runtime are available, 110 current CPU pressure is attributable to active AWOOOI Web `turbo build` / Docker buildx, and previous orphan Chrome groups remain cleared. 2026-06-26 07:19 StockPlatform `/api/v1/system/freshness` returned `200`; 07:01 freshness payload was `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`; price / chips / margin / AI recommendations are all on `2026-06-25`. `ai.recommendations` row count is `2868`; `core.margin_short_daily` row count is `1976`. MOMO health `V10.699`, current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, and `MOMO_DAILY_FRESHNESS 1|2026-06-24` are green; expanded public routes are green. | -| P3 docs / automation contracts | DONE_WITH_MACHINE_READINESS_V167 | 100% | Workplan, SOP v1.67, machine-readable post-reboot readiness summary, one-page post-start quick check v1.6, route retry gate, deploy warmup classification, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, 188 fail-closed startup data recovery gate, 188 host hygiene read-only checklist, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and stricter product-data / route retry gates are updated. Live 110 script sync remains a separate approved live-write gate; do not claim it here. | +| P3 docs / automation contracts | DONE_WITH_NEXT_GATE_DISPATCH_V168 | 100% | Workplan, SOP v1.68, machine-readable post-reboot readiness summary, post-reboot next-gate dispatch checklist, one-page post-start quick check v1.8, route retry gate, deploy warmup classification, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, 188 fail-closed startup data recovery gate, 188 host hygiene read-only checklist, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and stricter product-data / route retry gates are updated. Next-gate dispatch turns `credential_escrow_evidence`、`host_188_hygiene_maintenance_window`、`wazuh_manager_registry_export` into owner / evidence / forbidden-action checklists while keeping request sent / host write / secret collection / runtime action at `0`. Live 110 script sync remains a separate approved live-write gate; do not claim it here. | 2026-06-26 07:47 machine-readable summary baseline: `scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` stores delegated logs under `/tmp/awoooi-post-reboot-readiness-20260626-074702` and returns `SERVICE_GREEN=1`, `PRODUCT_DATA_GREEN=1`, `BACKUP_CORE_GREEN=1`, `DR_ESCROW_BLOCKED=1`, `ESCROW_MISSING_COUNT=5`, `HOST_188_SERVICE_GREEN=1`, `HOST_188_HYGIENE_BLOCKED=1`, `WAZUH_ROUTE_CODE=200`, `WAZUH_TRANSPORT_COUNT=6`, `WAZUH_MANAGER_REGISTRY_ACCEPTED=0`, `WAZUH_RUNTIME_GATE=0`, `RUNTIME_ACTION_AUTHORIZED=0`, `OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, and `NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export`. This is now the preferred first operator/AI-agent entrypoint after reboot because it separates service health from DR, host hygiene, and security registry evidence. +2026-06-26 08:12 next-gate dispatch baseline: `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color` reads the same summary path and emits three explicit P0 dispatch checklists without sending requests or changing runtime. `credential_escrow_evidence` requires non-secret evidence id / owner / reviewer for five escrow items and rejects password / token / secret value / hash / prefix / suffix / raw credential payloads. `host_188_hygiene_maintenance_window` requires PostgreSQL `14/main` decision, DNS / TLS / certbot path, startup unit source-of-truth, rollback owner, postcheck owner, and blocks unapproved `pg_resetwal` / certbot renew / Nginx reload / DB restore / Docker restart / host file writes. `wazuh_manager_registry_export` requires redacted registry counts, per-host alias status, dashboard API / version status, time window, and reviewer while blocking raw agent names, internal IPs, client keys, Wazuh payloads, active response, re-enroll, restart, secret patch, host write, and Kali active scan. Output fixed `NEXT_GATE_COUNT=3`, `REQUEST_SENT_COUNT=0`, `DISPATCH_AUTHORIZED=0`, `HOST_WRITE_AUTHORIZED=0`, `SECRET_VALUE_COLLECTION_ALLOWED=0`, `RUNTIME_ACTION_AUTHORIZED=0`. + 2026-06-26 07:39 live quick-check refresh supersedes the 07:19 row for current operator status. `scripts/reboot-recovery/post-start-quick-check.sh --no-color` returned `POST_START_QUICK_CHECK PASS=38 WARN=3 BLOCKED=0`, warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=2`, result `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`. Delegated cold-start returned `PASS=89 WARN=0 BLOCKED=0`; four reboot-scope hosts ping/SSH were OK; AWOOOI / VibeWork / AwoooGo / 2026FIFA / Agent Bounty / MOMO / Stock / Bitan / TsenYang / VTuber / Gitea / Harbor / Registry / Sentry / SigNoz / Langfuse / AIOps routes returned expected 2xx/3xx. MOMO `V10.701` has job `57 completed`, daily freshness `1|2026-06-24`, and current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`. StockPlatform freshness is `ok` through `2026-06-25` with price / chips / margin / AI recommendations current. Backup core remains green: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, offsite/rclone fresh, `last_backup_all=2026-06-26 02:31:02`; DR still has `escrow_missing=5`. 110 load around `5.19 / 4.66 / 4.91` is attributable to normal platform processes, not orphan Chrome. 188 host hygiene remains blocked by failed host PostgreSQL / certbot / startup units and must use the dedicated maintenance runbook and read-only checklist. 2026-06-25 19:06 post-CD wrapper readback supersedes the 18:53 wording: consecutive main pushes created a deploy storm where older deploy markers were superseded by later commits. Latest production truth is deploy marker `d8ca8224 chore(cd): deploy 9dbe044 [skip ci]`, ArgoCD `Synced / Healthy`, API/Web/Worker image tag `9dbe044ea1e8e3894ccbeb5ed760bb124b87f7be`, direct route smoke 200 for AWOOOI API / IwoooS / VibeWork / AwoooGo / MOMO health / Stock / Bitan and expected route-gate statuses for MOMO / Gitea / Harbor / Registry / Sentry / SigNoz / Langfuse / AIOps, and wrapper `POST_START_QUICK_CHECK PASS=18 WARN=3 BLOCKED=0`. Repo-side cold-start returns `PASS=89 WARN=0 BLOCKED=0`; `/backup/scripts/backup-status.sh --no-notify --no-refresh` reports 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`; MOMO dedicated preflight returns `PASS=19 WARN=2 BLOCKED=0`; MOMO health is `V10.690`; AwoooGo / Stock transient 502 reads cleared after upstream warmup and five consecutive route reads returned `200`; 110 load is around `14.51 / 12.34 / 11.42`, with Gitea Actions cache save / `zstdmt` / `tar`, StockPlatform headless Chrome smoke / CI, Gitea, AWOOOI API, ClickHouse, Docker, and platform services visible, not an AWOOOI service blocker. Wrapper result is `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, not `DEGRADED`, because service warnings are `0` and only DR boundary / evidence warnings remain. Wazuh route readback is now `200 disabled_waiting_iwooos_wazuh_owner_gate`, but manager registry accepted remains `0`, so Wazuh is a security registry evidence blocker rather than a reboot service blocker. diff --git a/scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh b/scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh new file mode 100755 index 00000000..a352bf1b --- /dev/null +++ b/scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh @@ -0,0 +1,171 @@ +#!/usr/bin/env bash +# AWOOOI post-reboot next-gate dispatch checklist. +# Read-only by design. It does not send requests, restart services, modify +# hosts, query secrets, or enable runtime actions. + +set -uo pipefail + +ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)" +SUMMARY_FILE="" +RUN_SUMMARY=1 +NO_COLOR_FLAG=0 + +usage() { + cat <<'USAGE' +Usage: post-reboot-next-gate-dispatch.sh [options] + +Turns post-reboot readiness summary blockers into a deterministic owner/action +checklist. This is a dispatch checklist only; it does not dispatch requests. + +Options: + --summary-file PATH Use an existing post-reboot readiness summary file. + --no-run-summary Do not run post-reboot-readiness-summary.sh. + --no-color Disable color in delegated summary. + -h, --help Show this help. + +Exit codes: + 0 = checklist emitted. Runtime action remains unauthorized. + 2 = summary unavailable or service blocker observed. +USAGE +} + +while [[ $# -gt 0 ]]; do + case "$1" in + --summary-file) + shift + SUMMARY_FILE="${1:-}" + ;; + --no-run-summary) + RUN_SUMMARY=0 + ;; + --no-color) + NO_COLOR_FLAG=1 + ;; + -h|--help) + usage + exit 0 + ;; + *) + printf 'Unknown argument: %s\n' "$1" >&2 + usage >&2 + exit 2 + ;; + esac + shift +done + +if [[ -z "$SUMMARY_FILE" && "$RUN_SUMMARY" -eq 1 ]]; then + SUMMARY_FILE="$(mktemp -t awoooi-post-reboot-summary.XXXXXX)" + summary_args=() + [[ "$NO_COLOR_FLAG" -eq 1 ]] && summary_args+=(--no-color) + bash "$ROOT_DIR/scripts/reboot-recovery/post-reboot-readiness-summary.sh" "${summary_args[@]}" >"$SUMMARY_FILE" 2>&1 + summary_rc=$? +else + summary_rc=0 +fi + +if [[ -z "$SUMMARY_FILE" || ! -s "$SUMMARY_FILE" ]]; then + echo "BLOCKED_SUMMARY_UNAVAILABLE=1" + echo "RUNTIME_ACTION_AUTHORIZED=0" + exit 2 +fi + +value_for() { + local key="$1" + awk -F= -v key="$key" '$1 == key {value=$2; found=1} END {if (found) print value; else print ""}' "$SUMMARY_FILE" +} + +service_green="$(value_for SERVICE_GREEN)" +overall_declaration="$(value_for OVERALL_DECLARATION)" +next_required_gates="$(value_for NEXT_REQUIRED_GATES)" +escrow_missing_count="$(value_for ESCROW_MISSING_COUNT)" +host_188_hygiene_blocked="$(value_for HOST_188_HYGIENE_BLOCKED)" +wazuh_registry_accepted="$(value_for WAZUH_MANAGER_REGISTRY_ACCEPTED)" +runtime_action_authorized="$(value_for RUNTIME_ACTION_AUTHORIZED)" +summary_artifact_dir="$(value_for ARTIFACT_DIR)" + +contains_gate() { + local gate="$1" + [[ ",${next_required_gates}," == *",${gate},"* ]] +} + +print_gate_header() { + local id="$1" + local title="$2" + echo + echo "GATE_ID=$id" + echo "GATE_TITLE=$title" +} + +echo "AWOOOI_POST_REBOOT_NEXT_GATE_DISPATCH=1" +echo "SUMMARY_FILE=$SUMMARY_FILE" +echo "SUMMARY_ARTIFACT_DIR=${summary_artifact_dir:-unknown}" +echo "SUMMARY_RC=$summary_rc" +echo "SERVICE_GREEN=${service_green:-unknown}" +echo "OVERALL_DECLARATION=${overall_declaration:-unknown}" +echo "NEXT_REQUIRED_GATES=${next_required_gates:-unknown}" +echo "RUNTIME_ACTION_AUTHORIZED=0" +echo "DISPATCH_AUTHORIZED=0" +echo "REQUEST_SENT_COUNT=0" +echo "HOST_WRITE_AUTHORIZED=0" +echo "SECRET_VALUE_COLLECTION_ALLOWED=0" + +if [[ "$service_green" != "1" ]]; then + echo + echo "BLOCKED_SERVICE_GREEN=0" + echo "NEXT_STEP=restore_service_before_boundary_dispatch" + exit 2 +fi + +gate_count=0 + +if contains_gate "credential_escrow_evidence"; then + gate_count=$((gate_count + 1)) + print_gate_header "credential_escrow_evidence" "DR credential escrow non-secret evidence" + echo "GATE_PRIORITY=P0" + echo "GATE_STATUS=owner_evidence_required" + echo "CURRENT_EVIDENCE=escrow_missing_count:${escrow_missing_count:-unknown}" + echo "OWNER_GROUP=backup_dr_owner,security_owner,business_owner" + echo "REQUIRED_ITEMS=restic_repository_password,offsite_provider_credentials,break_glass_admin_credentials,dns_registrar_recovery,oauth_ai_provider_recovery" + echo "REQUIRED_EVIDENCE=non_secret_evidence_id,owner_role,owner_team,evidence_location,review_date,reviewer" + echo "FORBIDDEN_PAYLOADS=password,token,secret_value,hash,prefix,suffix,raw_credential,screenshot_with_secret" + echo "ALLOWED_ACTION=collect_non_secret_marker_evidence_only" + echo "FORBIDDEN_ACTION=mark_placeholder,write_fake_marker,store_secret,disable_alert" + echo "DONE_CRITERIA=escrow_missing_count:0,offsite_report_full_marker:1,backup_status_core_blockers:0" +fi + +if contains_gate "host_188_hygiene_maintenance_window"; then + gate_count=$((gate_count + 1)) + print_gate_header "host_188_hygiene_maintenance_window" "188 host PostgreSQL / certbot hygiene maintenance window" + echo "GATE_PRIORITY=P0" + echo "GATE_STATUS=maintenance_window_required" + echo "CURRENT_EVIDENCE=host_188_hygiene_blocked:${host_188_hygiene_blocked:-unknown}" + echo "OWNER_GROUP=188_host_owner,db_owner,dns_tls_owner,rollback_owner" + echo "REQUIRED_DECISIONS=postgresql_14_main_retire_or_restore_or_break_glass,certbot_dns_acme_owner_path,startup_unit_source_of_truth,rollback_owner,postcheck_owner" + echo "REQUIRED_EVIDENCE=maintenance_window,rollback_plan,precheck_artifact,postcheck_artifact,service_impact,stop_condition" + echo "FORBIDDEN_ACTIONS=pg_resetwal,certbot_renew,nginx_reload,systemctl_reset_failed,db_restore,docker_restart,host_file_write" + echo "ALLOWED_ACTION=prepare_maintenance_packet_and_read_only_preflight" + echo "DONE_CRITERIA=service_green:1,host_188_hygiene_blocked:0,systemd_failed_units:0,certbot_owner_evidence_accepted:1" +fi + +if contains_gate "wazuh_manager_registry_export"; then + gate_count=$((gate_count + 1)) + print_gate_header "wazuh_manager_registry_export" "Wazuh manager registry redacted export" + echo "GATE_PRIORITY=P0" + echo "GATE_STATUS=readonly_registry_export_required" + echo "CURRENT_EVIDENCE=wazuh_manager_registry_accepted:${wazuh_registry_accepted:-unknown}" + echo "OWNER_GROUP=iwooos_soc_owner,wazuh_owner,host_owner" + echo "REQUIRED_EXPORT=redacted_manager_registry_counts,per_host_alias_status,dashboard_api_connection_status,dashboard_api_version_status,collection_time_window,reviewer" + echo "FORBIDDEN_PAYLOADS=agent_real_name,internal_ip,client_keys,raw_wazuh_payload,token,password,authorization_header" + echo "FORBIDDEN_ACTIONS=active_response,agent_reenroll,wazuh_restart,secret_patch,host_write,kali_active_scan" + echo "ALLOWED_ACTION=collect_redacted_registry_export_or_no_data_attestation" + echo "DONE_CRITERIA=manager_registry_accepted:1,owner_evidence_accepted:1,runtime_gate:0" +fi + +echo +echo "NEXT_GATE_COUNT=$gate_count" +echo "NEXT_STEP=dispatch_owner_packets_manually_after_review" +echo "POSTCHECK_COMMAND=scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color" +echo "NO_FALSE_GREEN_RULE=service_green_does_not_equal_dr_complete_or_wazuh_registry_recovered" + +exit 0