fix(ops): recover backup core after reboot [skip ci]

This commit is contained in:
ogt
2026-06-27 03:06:42 +08:00
parent 6e6e9fa746
commit 8fdcc0194f
7 changed files with 184 additions and 13 deletions

View File

@@ -1,3 +1,44 @@
## 2026-06-2700:58 reboot SOP 實際修復188 MOMO backup core 假紅收斂
**時間與來源**
- 2026-06-27 00:11-00:58 Asia/Taipei。
- 來源:`dr-offsite-operator-checklist.sh --check --no-color``recovery-scorecard-contract-check.py`、188 `ollama` crontab / textfile exporter、110 `/backup/scripts/backup-status.sh --no-notify --no-refresh``post-start-quick-check.sh --no-color``post-reboot-readiness-summary.sh --no-color`、Prometheus recovery recording rules。
**實際問題**
- `dr-offsite-operator-checklist.sh` 原本會因 `scripts/ops/recovery-scorecard-contract-check.py` 直接 `import yaml` 而在 lean Python 環境中中斷,錯誤是 `ModuleNotFoundError: No module named 'yaml'`
- 00:16 post-reboot summary 進一步顯示 `SERVICE_GREEN=0``BACKUP_CORE_GREEN=0``POST_START_BLOCKED=2`。根因不是備份資料缺失,而是 188 `momo_pg_daily` 備份 fresh、cron 存在,但 exporter 仍判 `awoooi_backup_job_configured{host="188",job="momo_pg_daily"} 0`,導致 110 backup-status 回 `core_blockers=1``configured_missing_188=1`
**修復內容**
- `scripts/ops/recovery-scorecard-contract-check.py` 已改成 PyYAML optional若沒有 PyYAML使用標準 Python fallback 解析 recovery recording rules 與 baseline `monitoring_contract.prometheus_recording_rules`
- 188 上已做最小可逆 host 寫入:先備份 `ollama` crontab 到 `/home/ollama/momo_backups/crontab-before-momo-pg-host-owned-20260627-001925.txt`,再把 `AWOOOI momo PostgreSQL daily backup` 收斂到 host-owned `/home/ollama/bin/momo-pg-backup.sh`。沒有重啟 Docker / systemd / Nginx / firewall / K3s / DB。
- 188 textfile exporter 已手動刷新,讀回 `awoooi_backup_job_configured{host="188",job="momo_pg_daily"} 1`
- repo source-of-truth `infra/ansible/playbooks/188-momo-backup-user.yml` 已同步改用 host-owned `/home/ollama/bin/momo-pg-backup.sh`,避免未來再把 crontab 改回 app-side path。
**驗證結果**
- `python3 scripts/ops/recovery-scorecard-contract-check.py``RECOVERY_SCORECARD_CONTRACT_OK`
- `python3 scripts/ops/recovery-scorecard-contract-check.py --prometheus-url http://192.168.0.110:9090 --expect-core-ready``awoooi_recovery_core_ready=1``awoooi_recovery_dr_offsite_ready=0`core ready 已恢復DR 因 escrow 仍正確為 0。
- `python3 scripts/ops/recovery-scorecard-contract-check.py --prometheus-url http://192.168.0.110:9090 --expect-core-ready --expect-dr-ready`:正確失敗,原因 `expected DR offsite ready, got 0.0`
- 110 backup-status 00:56`110備份=13/13 fresh failed=0``188備份=2/2 fresh failed=0``core_blockers=0``configured_missing_188=0``integrity_stale=0``offsite_fresh=1``rclone_gdrive_fresh=1``escrow_missing=5`
- `post-start-quick-check.sh` 00:57`POST_START_QUICK_CHECK PASS=38 WARN=3 BLOCKED=0``SERVICE=0``RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`
- `post-reboot-readiness-summary.sh` 00:58 artifact `/tmp/awoooi-post-reboot-readiness-20260627-005728/summary.txt``POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``BACKUP_CORE_GREEN=1``HOST_188_HYGIENE_BLOCKED=0``ESCROW_MISSING_COUNT=5``WAZUH_MANAGER_REGISTRY_ACCEPTED=0`
- 02:42 live revalidation artifact `/tmp/awoooi-post-reboot-readiness-20260627-024151/summary.txt``POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``POST_START_PASS=38``POST_START_WARN=3``POST_START_BLOCKED=0``SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``STOCK_FRESHNESS_STATUS=ok``STOCK_LATEST_TRADING_DATE=2026-06-26``BACKUP_CORE_GREEN=1``HOST_188_HYGIENE_BLOCKED=0``ESCROW_MISSING_COUNT=5``WAZUH_MANAGER_REGISTRY_ACCEPTED=0``RUNTIME_ACTION_AUTHORIZED=0`
- 02:41 DR checklist`CORE_COLD_START_GREEN=1``RECOVERY_STATE=CORE_READY_DR_OFFSITE_PENDING`Prometheus contract `awoooi_recovery_core_ready=1``awoooi_recovery_dr_offsite_ready=0`
**做過的命令類型**
- 只讀scorecard / DR checklist / backup-status / post-start / post-reboot summary / Prometheus readback / route and process evidence。
- 寫入repo script / Ansible playbook / runbook / workplan / LOGBOOK188 `ollama` crontab 單一備份排程路徑修正與 exporter 手動刷新。
- 未做:沒有讀或保存 secret、沒有 credential marker write、沒有 backup restore / prune / remote delete、沒有 Docker/systemd/Nginx/firewall/K8s/DB/Wazuh restart 或 active response、沒有 Kali active scan。
**目前判定**
- 主機 / K3s / public routes / AWOOOI / MOMO / Stock / backup core / 188 hygiene`GREEN`
- Prometheus recovery core`awoooi_recovery_core_ready=1`
- Overall recovery declaration`FULL_STACK_GREEN_DR_ESCROW_BLOCKED`
**仍 blocked / 不得宣稱**
- DR credential escrow evidence 仍缺 `5``restic_repository_password``offsite_provider_credentials``break_glass_admin_credentials``dns_registrar_recovery``oauth_ai_provider_recovery`;不得宣稱 `DR_COMPLETE`
- Wazuh manager registry accepted 仍為 `0`;不得宣稱 Wazuh 全主機納管恢復。
- Runtime action / host write 擴大授權 / Wazuh active response / Kali active scan 仍全部 `0 / false`
## 2026-06-26D1I 最新正式基線同步Delivery workbench、controlled apply、Wazuh metadata gate smoke ## 2026-06-26D1I 最新正式基線同步Delivery workbench、controlled apply、Wazuh metadata gate smoke
**背景**D1H 後,平行 delivery workbench release 與 Wazuh live metadata gate 繼續推進;為避免正式環境再次落後 main本段只做最新 `gitea/main` fast-forward、正式 API / Browser smoke 與證據補帳,不新增 runtime 執行權限。 **背景**D1H 後,平行 delivery workbench release 與 Wazuh live metadata gate 繼續推進;為避免正式環境再次落後 main本段只做最新 `gitea/main` fast-forward、正式 API / Browser smoke 與證據補帳,不新增 runtime 執行權限。
@@ -72,7 +113,6 @@
- 負責人回覆接受、機密來源中繼資料接受、唯讀範圍接受、啟用後讀回、Wazuh 即時查詢、主動回應、主機寫入、Kali 主動掃描、Telegram 實發、機密收集、執行期閘門:仍全部 `0 / false` - 負責人回覆接受、機密來源中繼資料接受、唯讀範圍接受、啟用後讀回、Wazuh 即時查詢、主動回應、主機寫入、Kali 主動掃描、Telegram 實發、機密收集、執行期閘門:仍全部 `0 / false`
**下一個 P0**取得正式負責人回覆封包即時中繼資料負責人、機密注入負責人、機密來源中繼資料參照、Wazuh 管理節點健康參照、TLS 驗證參照、唯讀帳號範圍參照、agent 別名映射政策、啟用後讀回指令、回滾負責人、維護窗口、驗證計畫,以及不提供機密明文 / 不提供原始載荷聲明。驗收前不得啟用 Wazuh 即時中繼資料環境變數、不得查 live Wazuh API、不得重啟 Wazuh / Docker / Nginx / firewall、不得重新註冊 agent、不得啟用主動回應。 **下一個 P0**取得正式負責人回覆封包即時中繼資料負責人、機密注入負責人、機密來源中繼資料參照、Wazuh 管理節點健康參照、TLS 驗證參照、唯讀帳號範圍參照、agent 別名映射政策、啟用後讀回指令、回滾負責人、維護窗口、驗證計畫,以及不提供機密明文 / 不提供原始載荷聲明。驗收前不得啟用 Wazuh 即時中繼資料環境變數、不得查 live Wazuh API、不得重啟 Wazuh / Docker / Nginx / firewall、不得重新註冊 agent、不得啟用主動回應。
## 2026-06-26D1G IwoooS Wazuh live route 紅燈前移Runtime board 與正式站讀回完成 ## 2026-06-26D1G IwoooS Wazuh live route 紅燈前移Runtime board 與正式站讀回完成
**背景**:正式站已確認 `/api/iwooos/wazuh` 不是 registry empty而是 `disabled_waiting_iwooos_wazuh_owner_gate`;過去這個狀態只在頁面下方 Wazuh 卡片可見,容易讓 Runtime 資安總板看起來像只剩靜態 snapshot。此段把 Wazuh 只讀路由的公開安全 aggregate 狀態接進 Runtime 資安讀回首屏,讓 disabled、misconfigured、empty、below expected、unavailable 都成為 P0 紅燈。 **背景**:正式站已確認 `/api/iwooos/wazuh` 不是 registry empty而是 `disabled_waiting_iwooos_wazuh_owner_gate`;過去這個狀態只在頁面下方 Wazuh 卡片可見,容易讓 Runtime 資安總板看起來像只剩靜態 snapshot。此段把 Wazuh 只讀路由的公開安全 aggregate 狀態接進 Runtime 資安讀回首屏,讓 disabled、misconfigured、empty、below expected、unavailable 都成為 P0 紅燈。

View File

@@ -25,9 +25,35 @@
> 2026-06-25 20:25 Codex 110 CPU cleanup: two orphan StockPlatform headless Chrome process groups were cleared by targeted approved `SIGTERM`; no Docker/systemd/Nginx/K8s/DB/backup write occurred. Backup/offsite remains green, DR still blocked by `escrow_missing=5`, and Stock freshness remains the only hard product-data blocker. > 2026-06-25 20:25 Codex 110 CPU cleanup: two orphan StockPlatform headless Chrome process groups were cleared by targeted approved `SIGTERM`; no Docker/systemd/Nginx/K8s/DB/backup write occurred. Backup/offsite remains green, DR still blocked by `escrow_missing=5`, and Stock freshness remains the only hard product-data blocker.
> 2026-06-25 21:14 Codex full wrapper refresh: StockPlatform 21:00 `intelligence-sync` and 21:10 AI pipeline naturally caught up; `/api/v1/system/freshness` is `status=ok` with blockers `[]`. Backup/offsite remains 110 `13/13` and 188 `2/2` fresh, `core_blockers=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`; full-stack service/data result is `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, with only `escrow_missing=5` blocking DR complete. > 2026-06-25 21:14 Codex full wrapper refresh: StockPlatform 21:00 `intelligence-sync` and 21:10 AI pipeline naturally caught up; `/api/v1/system/freshness` is `status=ok` with blockers `[]`. Backup/offsite remains 110 `13/13` and 188 `2/2` fresh, `core_blockers=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`; full-stack service/data result is `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, with only `escrow_missing=5` blocking DR complete.
> 2026-06-26 06:28 Codex隔日 backup readback: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `last_backup_all=2026-06-26 02:31:02`, `escrow_missing=5`; full-stack service/data result remains `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`. > 2026-06-26 06:28 Codex隔日 backup readback: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `last_backup_all=2026-06-26 02:31:02`, `escrow_missing=5`; full-stack service/data result remains `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`.
> 2026-06-27 00:56 Codex backup core recovery: 188 `momo_pg_daily` was fresh but temporarily false-blocked by cron/config drift (`configured_missing_188=1`). 188 crontab was backed up to `/home/ollama/momo_backups/crontab-before-momo-pg-host-owned-20260627-001925.txt`, the daily MOMO PostgreSQL backup entry was restored to host-owned `/home/ollama/bin/momo-pg-backup.sh`, and the exporter now reports `awoooi_backup_job_configured{host="188",job="momo_pg_daily"} 1`. `backup-status` now reports 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `configured_missing_188=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`; DR still blocked only by credential escrow evidence.
> 2026-06-27 02:42 Codex post-reboot revalidation: `post-reboot-readiness-summary.sh` remains `FULL_STACK_GREEN_DR_ESCROW_BLOCKED` with `SERVICE_GREEN=1`, `PRODUCT_DATA_GREEN=1`, `BACKUP_CORE_GREEN=1`, `HOST_188_HYGIENE_BLOCKED=0`, `STOCK_FRESHNESS_STATUS=ok`, and `ESCROW_MISSING_COUNT=5`. `dr-offsite-operator-checklist.sh --check` confirms `CORE_COLD_START_GREEN=1`, `RECOVERY_STATE=CORE_READY_DR_OFFSITE_PENDING`, live Prometheus `awoooi_recovery_core_ready=1`, and `awoooi_recovery_dr_offsite_ready=0`.
--- ---
## 2026-06-27 00:56 Backup / Offsite / Escrow Live Status
Read-only and minimal-write evidence sources: 00:56 `/backup/scripts/backup-status.sh --no-notify --no-refresh` from 110, 188 crontab backup / controlled MOMO backup path correction, 188 textfile exporter refresh, post-start quick check at 00:57, and Prometheus recovery recording-rule readback.
- 110 backup health: `13/13 fresh failed=0`
- 188 backup health: `2/2 fresh failed=0`
- Integrity / configured blockers: `core_blockers=0``configured_missing_110=0``configured_missing_188=0``script_missing_110=0``script_missing_188=0``integrity_stale=0`
- 188 MOMO backup config drift fix: crontab rollback file `/home/ollama/momo_backups/crontab-before-momo-pg-host-owned-20260627-001925.txt`; active cron now uses `/home/ollama/bin/momo-pg-backup.sh`; exporter reports `awoooi_backup_job_configured{host="188",job="momo_pg_daily"} 1`
- Offsite / GDrive freshness: `offsite_configured=1``offsite_fresh=1``rclone_gdrive_configured=1``rclone_gdrive_fresh=1`
- Last aggregate backup: `2026-06-26 02:31:02`
- Prometheus recovery rules: `awoooi_recovery_core_ready=1``awoooi_recovery_dr_offsite_ready=0`
- DR blocker remains: `escrow_missing=5`,不得偽造 evidence marker也不得貼 secret value / hash / partial token。
- Full-stack service state: `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。Post-start quick check `PASS=38 WARN=3 BLOCKED=0`StockPlatform freshness `status=ok`MOMO daily freshness `2|2026-06-24`
| Gate | Status | Evidence |
|------|--------|----------|
| 110 backup freshness | VERIFIED | 13/13 fresh, failed count 0. |
| 188 backup freshness | VERIFIED | 2/2 fresh, failed count 0. |
| 188 MOMO backup cron/config | VERIFIED | Active crontab uses `/home/ollama/bin/momo-pg-backup.sh`; `configured_missing_188=0`. |
| Offsite / GDrive freshness | VERIFIED | `offsite_fresh=1`, `rclone_gdrive_fresh=1`. |
| Backup core blockers | GREEN | `core_blockers=0`; Prometheus `awoooi_recovery_core_ready=1`. |
| Full-stack service state | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | `POST_START_QUICK_CHECK PASS=38 WARN=3 BLOCKED=0`; service/data/backup core green. |
| Credential escrow | BLOCKED | `escrow_missing=5`; only real non-secret owner evidence may close this. |
## 2026-06-26 06:28 Backup / Offsite / Escrow Live Status ## 2026-06-26 06:28 Backup / Offsite / Escrow Live Status
Read-only evidence sources: 06:26 / 06:28 `post-start-quick-check.sh`, delegated `/backup/scripts/backup-status.sh --no-notify --no-refresh`, route-only wrapper retry validation, and direct StockPlatform / MOMO freshness readback. Read-only evidence sources: 06:26 / 06:28 `post-start-quick-check.sh`, delegated `/backup/scripts/backup-status.sh --no-notify --no-refresh`, route-only wrapper retry validation, and direct StockPlatform / MOMO freshness readback.

View File

@@ -1,6 +1,6 @@
# AWOOOI 全棧冷啟動與主機重啟 SOP # AWOOOI 全棧冷啟動與主機重啟 SOP
> Version: v1.77 > Version: v1.78
> Last updated: 2026-06-27 Asia/Taipei > Last updated: 2026-06-27 Asia/Taipei
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path. > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
@@ -14,6 +14,10 @@
v1.76 owner gate replay rule同一輪 summary 產生後owner packet 與 owner response preflight 必須優先使用 `--summary-file "$ARTIFACT_DIR/summary.txt"`,例如 `scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color --summary-file "$ARTIFACT_DIR/summary.txt" --output /tmp/awoooi-post-reboot-owner-packets.json``scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color --summary-file "$ARTIFACT_DIR/summary.txt" --response-file <file>`。只有在刻意要重新取 live evidence 時,才允許省略 `--summary-file`;否則 preflight 不得自己重跑 summary 造成同一輪狀態漂移。 v1.76 owner gate replay rule同一輪 summary 產生後owner packet 與 owner response preflight 必須優先使用 `--summary-file "$ARTIFACT_DIR/summary.txt"`,例如 `scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color --summary-file "$ARTIFACT_DIR/summary.txt" --output /tmp/awoooi-post-reboot-owner-packets.json``scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color --summary-file "$ARTIFACT_DIR/summary.txt" --response-file <file>`。只有在刻意要重新取 live evidence 時,才允許省略 `--summary-file`;否則 preflight 不得自己重跑 summary 造成同一輪狀態漂移。
2026-06-27 02:42 最新 live revalidation`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` artifact `/tmp/awoooi-post-reboot-readiness-20260627-024151/summary.txt` 回傳 `POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``POST_START_PASS=38``POST_START_WARN=3``POST_START_BLOCKED=0``SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``STOCK_FRESHNESS_STATUS=ok``STOCK_LATEST_TRADING_DATE=2026-06-26``STOCK_BLOCKERS=none``BACKUP_CORE_GREEN=1``HOST_188_HYGIENE_BLOCKED=0``WAZUH_MANAGER_REGISTRY_ACCEPTED=0``RUNTIME_ACTION_AUTHORIZED=0`。同輪 DR checklist 回 `CORE_COLD_START_GREEN=1``RECOVERY_STATE=CORE_READY_DR_OFFSITE_PENDING`、Prometheus contract `awoooi_recovery_core_ready=1``awoooi_recovery_dr_offsite_ready=0`。因此目前服務 / 資料 / 備份核心可宣稱恢復DR complete 仍因 `ESCROW_MISSING_COUNT=5` 禁止宣稱Wazuh 全主機納管仍因 manager registry accepted `0` 禁止宣稱。
2026-06-27 00:58 最新 live summary本輪先修復兩個實際 SOP blocker。第一`scripts/ops/recovery-scorecard-contract-check.py` 已改成 PyYAML optional標準 Python 環境也能驗證 recovery recording-rule contract不會因 `ModuleNotFoundError: yaml` 中斷 DR/offsite checklist。第二188 `ollama` crontab 已備份到 `/home/ollama/momo_backups/crontab-before-momo-pg-host-owned-20260627-001925.txt`,並把 `AWOOOI momo PostgreSQL daily backup` 從 app-side `/home/ollama/momo-pro/scripts/pg_backup.sh` 收斂回 host-owned `/home/ollama/bin/momo-pg-backup.sh`;刷新 188 textfile exporter 後 `awoooi_backup_job_configured{host="188",job="momo_pg_daily"} 1`。00:58 `scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` artifact `/tmp/awoooi-post-reboot-readiness-20260627-005728/summary.txt` 回傳 `POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``POST_START_PASS=38``POST_START_WARN=3``POST_START_BLOCKED=0``SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``BACKUP_CORE_GREEN=1``ESCROW_MISSING_COUNT=5``HOST_188_HYGIENE_BLOCKED=0``WAZUH_MANAGER_REGISTRY_ACCEPTED=0``RUNTIME_ACTION_AUTHORIZED=0`。同輪 `backup-status``core_blockers=0``configured_missing_188=0`Prometheus live contract 回 `awoooi_recovery_core_ready=1``awoooi_recovery_dr_offsite_ready=0`,表示主機 / K3s / public routes / product data / backup core 已恢復DR 仍只因 credential escrow 缺 5 個 non-secret evidence marker blockedWazuh 全主機 registry accepted 仍為 0。
2026-06-27 00:02 最新 live summary`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``POST_START_PASS=38``POST_START_WARN=4``POST_START_BLOCKED=0``SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``STOCK_FRESHNESS_STATUS=ok``STOCK_LATEST_TRADING_DATE=2026-06-26``STOCK_BLOCKERS=none``BACKUP_CORE_GREEN=1``ESCROW_MISSING_COUNT=5``HOST_188_HYGIENE_BLOCKED=0``WAZUH_MANAGER_REGISTRY_ACCEPTED=0``RUNTIME_ACTION_AUTHORIZED=0`。同一輪 production route smoke 回傳IwoooS `200`、Wazuh read-only routes `200`、VibeWork `200`、AwoooGo `200`、MOMO health `200`、Stock `200`AWOOOI API health `healthy / prod / mock_mode=false`PostgreSQL / Redis / OpenClaw / SigNoz / GCP Ollama provider uplocal Ollama endpoint 仍為 cooldown / degraded由 provider fallback 承接,不是網站或 API service blocker。最新 deploy marker 為 `e506b9d5 chore(cd): deploy fe74d86 [skip ci]`;本輪 `89b9e67a` 是 SOP / scripts / docs source update不是 runtime bundle deploy marker。112 Wazuh 與 120 K3s 的 23:56 脫敏 readback 仍作為本輪相鄰 evidence120 ArgoCD `Synced / Healthy`、Pod 均 `Running``Completed`Wazuh manager registry 並非全空,但 `WAZUH_MANAGER_REGISTRY_ACCEPTED=0` 維持,不能宣稱全主機納管恢復。 2026-06-27 00:02 最新 live summary`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``POST_START_PASS=38``POST_START_WARN=4``POST_START_BLOCKED=0``SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``STOCK_FRESHNESS_STATUS=ok``STOCK_LATEST_TRADING_DATE=2026-06-26``STOCK_BLOCKERS=none``BACKUP_CORE_GREEN=1``ESCROW_MISSING_COUNT=5``HOST_188_HYGIENE_BLOCKED=0``WAZUH_MANAGER_REGISTRY_ACCEPTED=0``RUNTIME_ACTION_AUTHORIZED=0`。同一輪 production route smoke 回傳IwoooS `200`、Wazuh read-only routes `200`、VibeWork `200`、AwoooGo `200`、MOMO health `200`、Stock `200`AWOOOI API health `healthy / prod / mock_mode=false`PostgreSQL / Redis / OpenClaw / SigNoz / GCP Ollama provider uplocal Ollama endpoint 仍為 cooldown / degraded由 provider fallback 承接,不是網站或 API service blocker。最新 deploy marker 為 `e506b9d5 chore(cd): deploy fe74d86 [skip ci]`;本輪 `89b9e67a` 是 SOP / scripts / docs source update不是 runtime bundle deploy marker。112 Wazuh 與 120 K3s 的 23:56 脫敏 readback 仍作為本輪相鄰 evidence120 ArgoCD `Synced / Healthy`、Pod 均 `Running``Completed`Wazuh manager registry 並非全空,但 `WAZUH_MANAGER_REGISTRY_ACCEPTED=0` 維持,不能宣稱全主機納管恢復。
2026-06-26 23:56 live summary retained for comparison`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``POST_START_PASS=38``POST_START_WARN=3``POST_START_BLOCKED=0``SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``STOCK_FRESHNESS_STATUS=ok``STOCK_LATEST_TRADING_DATE=2026-06-26``STOCK_BLOCKERS=none``BACKUP_CORE_GREEN=1``ESCROW_MISSING_COUNT=5``HOST_188_HYGIENE_BLOCKED=0``WAZUH_MANAGER_REGISTRY_ACCEPTED=0``RUNTIME_ACTION_AUTHORIZED=0`。同一時段只讀補查 120ArgoCD `awoooi-prod``Synced / Healthy``awoooi-prod` Pod 均為 `Running``Completed`;歷史 `km-vectorize-29689620` failed Job 已被 2026-06-23、2026-06-24、2026-06-25 後續成功 Job 覆蓋,不是目前服務 blocker。同一時段只讀補查 112systemd `running`Wazuh manager / indexer / dashboard `active`manager API root 回 `401`Dashboard unauthenticated check endpoints 回 `401`manager registry 脫敏讀回為 local manager `1`、受管 agent `5`、active managed `5`、disconnected `0`、never connected `0`。此證據證明 registry 不再是「全空」,但仍不能宣稱 Wazuh 全主機納管恢復,因為 SOP expected scope 仍是 6、Dashboard API connection / version 尚未以登入或 owner evidence 驗收owner response accepted 仍為 `0` 2026-06-26 23:56 live summary retained for comparison`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``POST_START_PASS=38``POST_START_WARN=3``POST_START_BLOCKED=0``SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``STOCK_FRESHNESS_STATUS=ok``STOCK_LATEST_TRADING_DATE=2026-06-26``STOCK_BLOCKERS=none``BACKUP_CORE_GREEN=1``ESCROW_MISSING_COUNT=5``HOST_188_HYGIENE_BLOCKED=0``WAZUH_MANAGER_REGISTRY_ACCEPTED=0``RUNTIME_ACTION_AUTHORIZED=0`。同一時段只讀補查 120ArgoCD `awoooi-prod``Synced / Healthy``awoooi-prod` Pod 均為 `Running``Completed`;歷史 `km-vectorize-29689620` failed Job 已被 2026-06-23、2026-06-24、2026-06-25 後續成功 Job 覆蓋,不是目前服務 blocker。同一時段只讀補查 112systemd `running`Wazuh manager / indexer / dashboard `active`manager API root 回 `401`Dashboard unauthenticated check endpoints 回 `401`manager registry 脫敏讀回為 local manager `1`、受管 agent `5`、active managed `5`、disconnected `0`、never connected `0`。此證據證明 registry 不再是「全空」,但仍不能宣稱 Wazuh 全主機納管恢復,因為 SOP expected scope 仍是 6、Dashboard API connection / version 尚未以登入或 owner evidence 驗收owner response accepted 仍為 `0`

View File

@@ -1,6 +1,6 @@
# 主機重啟後一頁式總檢查 # 主機重啟後一頁式總檢查
> Version: v1.17 > Version: v1.18
> Last updated: 2026-06-27 Asia/Taipei > Last updated: 2026-06-27 Asia/Taipei
> Scope: 110 / 120 / 121 / 188 post-reboot service recovery. 112 Kali / Wazuh / active scan 不屬於本流程。 > Scope: 110 / 120 / 121 / 188 post-reboot service recovery. 112 Kali / Wazuh / active scan 不屬於本流程。
@@ -10,7 +10,7 @@
每次 110 / 120 / 121 / 188 任一台主機開機、關機、重啟、斷電恢復、VMware console fsck、Docker / K3s 大量重排後,都先跑本頁,再決定是否宣稱恢復。 每次 110 / 120 / 121 / 188 任一台主機開機、關機、重啟、斷電恢復、VMware console fsck、Docker / K3s 大量重排後,都先跑本頁,再決定是否宣稱恢復。
最新基準2026-06-27 00:02 single-summary replay / route + AWOOOI API warmup classifier。`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``POST_START_PASS=38``POST_START_WARN=4``POST_START_BLOCKED=0``SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``STOCK_FRESHNESS_STATUS=ok``STOCK_LATEST_TRADING_DATE=2026-06-26``STOCK_BLOCKERS=none``BACKUP_CORE_GREEN=1``DR_ESCROW_BLOCKED=1``ESCROW_MISSING_COUNT=5``HOST_188_HYGIENE_BLOCKED=0``HOST_188_RESULT=HOST_188_HYGIENE_GREEN.``WAZUH_MANAGER_REGISTRY_ACCEPTED=0``WAZUH_COVERAGE_SCOPE=6``WAZUH_DIRECT_ACTIVE=2``WAZUH_NO_TRANSPORT=1``WAZUH_SSH_BLOCKED=3``WAZUH_ROUTE_CODE=200``WAZUH_TRANSPORT_COUNT=6``WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning``WAZUH_DASHBOARD_INDEX_OK=3``RUNTIME_ACTION_AUTHORIZED=0``OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`,並自動把同一份 key/value 寫到 `$ARTIFACT_DIR/summary.txt`。Production route smoke 同輪確認 IwoooS、Wazuh read-only routes、VibeWork、AwoooGo、MOMO health、Stock 均為 `200`AWOOOI API health 整體 `healthy`local Ollama cooldown 由 GCP provider fallback 承接,不是網站或 API service blocker。同一輪後續 `post-reboot-declaration-guard.py``post-reboot-next-gate-dispatch.sh``post-reboot-next-gate-owner-packets.py``post-reboot-owner-packet-contract-guard.py``post-reboot-owner-response-preflight.py` 必須使用這份 `summary.txt` 或由它產生的 dispatch / packet不得混用多次 live probe 的不同時間點結果。`NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export` 仍是唯一目前 next gatesDR 仍因 `escrow_missing=5` 不可宣稱 completeWazuh manager registry accepted 仍是 `0`,不可把 route `200`、transport `6`、Dashboard index pattern `3` 或脫敏 registry 計數當成全主機納管完成。v1.17 維持 route/API warmup classifierdelegated cold-start 若只因 public route 單次 502 / TLS readback或 K3s rollout 瞬間單次 `BLOCKED AWOOOI API not reachable`,但 wrapper route retry 已確認 public AWOOOI API health 為 2xx該 blocker 會降級為 evidence warningpublic API 仍失敗、其他 non-route blocker 或 retry 後未恢復仍為 hard blocked。 最新基準2026-06-27 02:42 live revalidation / backup core recovery`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` artifact `/tmp/awoooi-post-reboot-readiness-20260627-024151/summary.txt` 回傳 `POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``POST_START_PASS=38``POST_START_WARN=3``POST_START_BLOCKED=0``SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``STOCK_FRESHNESS_STATUS=ok``STOCK_LATEST_TRADING_DATE=2026-06-26``STOCK_BLOCKERS=none``BACKUP_CORE_GREEN=1``DR_ESCROW_BLOCKED=1``ESCROW_MISSING_COUNT=5``HOST_188_HYGIENE_BLOCKED=0``HOST_188_RESULT=HOST_188_HYGIENE_GREEN.``WAZUH_MANAGER_REGISTRY_ACCEPTED=0``WAZUH_COVERAGE_SCOPE=6``WAZUH_DIRECT_ACTIVE=2``WAZUH_NO_TRANSPORT=1``WAZUH_SSH_BLOCKED=3``WAZUH_ROUTE_CODE=200``WAZUH_TRANSPORT_COUNT=6``WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning``WAZUH_DASHBOARD_INDEX_OK=3``RUNTIME_ACTION_AUTHORIZED=0``OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。本輪實際修復 188 `momo_pg_daily` backup configured drift先前 00:16 summary 因 `configured_missing_188=1` 暫時 blocked00:19 已備份 188 crontab 到 `/home/ollama/momo_backups/crontab-before-momo-pg-host-owned-20260627-001925.txt`,並把 MOMO PostgreSQL daily backup 收斂到 host-owned `/home/ollama/bin/momo-pg-backup.sh`;刷新 exporter 後 `configured_missing_188=0`00:56 `backup-status``core_blockers=0`。02:41 DR checklist 回 `CORE_COLD_START_GREEN=1``RECOVERY_STATE=CORE_READY_DR_OFFSITE_PENDING`Prometheus recovery contract 回 `awoooi_recovery_core_ready=1``awoooi_recovery_dr_offsite_ready=0`。同一輪後續 `post-reboot-declaration-guard.py``post-reboot-next-gate-dispatch.sh``post-reboot-next-gate-owner-packets.py``post-reboot-owner-packet-contract-guard.py``post-reboot-owner-response-preflight.py` 必須使用這份 `summary.txt` 或由它產生的 dispatch / packet不得混用多次 live probe 的不同時間點結果。`NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export` 仍是唯一目前 next gatesDR 仍因 `escrow_missing=5` 不可宣稱 completeWazuh manager registry accepted 仍是 `0`,不可把 route `200`、transport `6`、Dashboard index pattern `3` 或脫敏 registry 計數當成全主機納管完成。v1.18 維持 route/API warmup classifierdelegated cold-start 若只因 public route 單次 502 / TLS readback或 K3s rollout 瞬間單次 `BLOCKED AWOOOI API not reachable`,但 wrapper route retry 已確認 public AWOOOI API health 為 2xx該 blocker 會降級為 evidence warningpublic API 仍失敗、其他 non-route blocker 或 retry 後未恢復仍為 hard blocked。
本頁只回答四件事: 本頁只回答四件事:

View File

@@ -11,11 +11,11 @@
| Area | Status | Completion | Evidence | | Area | Status | Completion | Evidence |
|------|--------|------------|----------| |------|--------|------------|----------|
| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-27 00:02 即時摘要覆蓋 2026-06-26 23:56 判讀。`post-reboot-readiness-summary.sh --no-color` 回傳 `SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``POST_START_WARN=4``STOCK_FRESHNESS_STATUS=ok``STOCK_LATEST_TRADING_DATE=2026-06-26``STOCK_BLOCKERS=none``BACKUP_CORE_GREEN=1``ESCROW_MISSING_COUNT=5``WAZUH_MANAGER_REGISTRY_ACCEPTED=0`Production route smokeIwoooS / Wazuh read-only routes / VibeWork / AwoooGo / MOMO health / Stock 均 `200`AWOOOI API health `healthy / prod / mock_mode=false`local Ollama cooldown 由 GCP provider fallback 承接,不是網站或 API blocker。主機 / K3s / public routes / AWOOOI / MOMO / Stock / backup core / 188 hygiene 已恢復。DR 仍因 credential escrow 缺 5 不能宣稱 completeWazuh registry 已有脫敏 manager readback但尚未 Dashboard API / owner acceptance。 | | Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-27 02:42 live revalidation 覆蓋 00:16 暫時 blocked 判讀。`post-reboot-readiness-summary.sh --no-color` artifact `/tmp/awoooi-post-reboot-readiness-20260627-024151/summary.txt` 回傳 `SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``POST_START_WARN=3``POST_START_BLOCKED=0``STOCK_FRESHNESS_STATUS=ok``STOCK_LATEST_TRADING_DATE=2026-06-26``STOCK_BLOCKERS=none``BACKUP_CORE_GREEN=1``ESCROW_MISSING_COUNT=5``WAZUH_MANAGER_REGISTRY_ACCEPTED=0`00:16 的 blocker 是 188 `momo_pg_daily` configured drift備份 fresh但 exporter 因 crontab 仍指 app-side path 判 `configured_missing_188=1`00:19 已備份 188 crontab 到 `/home/ollama/momo_backups/crontab-before-momo-pg-host-owned-20260627-001925.txt` 並收斂到 host-owned `/home/ollama/bin/momo-pg-backup.sh`,刷新 exporter 後 `awoooi_backup_job_configured{host="188",job="momo_pg_daily"} 1`00:56 `backup-status``core_blockers=0`。02:41 DR checklist 回 `CORE_COLD_START_GREEN=1``RECOVERY_STATE=CORE_READY_DR_OFFSITE_PENDING`Prometheus live contract 回 `awoooi_recovery_core_ready=1``awoooi_recovery_dr_offsite_ready=0`。主機 / K3s / public routes / AWOOOI / MOMO / Stock / backup core / 188 hygiene 已恢復。DR 仍因 credential escrow 缺 5 不能宣稱 completeWazuh registry 已有脫敏 manager readback但尚未 Dashboard API / owner acceptance。 |
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-26 07:19 readback shows 120 and 121 reachable, K3s active, `mon` and `mon1` both `Ready control-plane`, AWOOOI API/Web replicas split across both nodes, ArgoCD `awoooi-prod Synced / Healthy` at revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`, and `km-vectorize` official 03:00 台北時間 run succeeded with `lastSuccess=2026-06-25T19:00:14Z`. | | P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-26 07:19 readback shows 120 and 121 reachable, K3s active, `mon` and `mon1` both `Ready control-plane`, AWOOOI API/Web replicas split across both nodes, ArgoCD `awoooi-prod Synced / Healthy` at revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`, and `km-vectorize` official 03:00 台北時間 run succeeded with `lastSuccess=2026-06-25T19:00:14Z`. |
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-26 06:58 backup readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-26 02:31:02`。DR remains blocked on real non-secret credential escrow evidence IDs; do not write placeholder markers or paste secret values. | | P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 98% | 2026-06-27 00:56 backup readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `configured_missing_188=0`, `escrow_missing=5`, last aggregate `2026-06-26 02:31:02`188 MOMO backup crontab drift 已修復並保留 rollback crontab。DR remains blocked on real non-secret credential escrow evidence IDs; do not write placeholder markers or paste secret values. |
| P2 service / data truth | DONE | 100% | Public routes 與 service health 為綠燈MOMO health `V10.719`current-month parity 為 `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`。StockPlatform `/api/v1/system/freshness``ok`latest trading date `2026-06-26`blockers `none`;先前 Stock EOD blocker 已由官方來源與正式 cron 自然收斂。 | | P2 service / data truth | DONE | 100% | Public routes 與 service health 為綠燈MOMO health `V10.719`current-month parity 為 `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`。StockPlatform `/api/v1/system/freshness``ok`latest trading date `2026-06-26`blockers `none`;先前 Stock EOD blocker 已由官方來源與正式 cron 自然收斂。 |
| P3 docs / automation contracts | DONE_WITH_API_WARMUP_CLASSIFIER_V176 | 100% | Workplan, SOP v1.76, post-reboot declaration guard, machine-readable post-reboot readiness summary with Wazuh registry detail fields and auto-persisted `summary.txt`, post-reboot next-gate dispatch checklist, owner-packet JSON generator, dynamic owner-packet contract guard, post-reboot owner response preflight, owner response placeholder template, one-page post-start quick check v1.16, route retry gate, delegated cold-start public-route / AWOOOI API warmup classifier, deploy warmup classification, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, 188 fail-closed startup data recovery gate, 188 host hygiene read-only checklist, 188 PostgreSQL runtime-ready source-of-truth, 188 ACME route/timer hygiene, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and stricter product-data / route retry gates are updated. Declaration guard now machine-checks allowed / forbidden recovery statements from the same `summary.txt`: service/data/backup/188 host hygiene green may be declared when live summary says so, while `DR_COMPLETE``WAZUH_REGISTRY_RECOVERED` and `RUNTIME_ACTION_AUTHORIZED` remain forbidden until evidence gates close. | | P3 docs / automation contracts | DONE_WITH_BACKUP_CORE_RECOVERY_V178 | 100% | Workplan, SOP v1.78, post-reboot declaration guard, machine-readable post-reboot readiness summary with Wazuh registry detail fields and auto-persisted `summary.txt`, post-reboot next-gate dispatch checklist, owner-packet JSON generator, dynamic owner-packet contract guard, post-reboot owner response preflight, owner response placeholder template, one-page post-start quick check v1.18, route retry gate, delegated cold-start public-route / AWOOOI API warmup classifier, backup-status core-blocker readback, PyYAML-optional recovery-scorecard contract check, 188 MOMO backup crontab host-owned rollback evidence, deploy warmup classification, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, 188 fail-closed startup data recovery gate, 188 host hygiene read-only checklist, 188 PostgreSQL runtime-ready source-of-truth, 188 ACME route/timer hygiene, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and stricter product-data / route retry gates are updated. Declaration guard now machine-checks allowed / forbidden recovery statements from the same `summary.txt`: service/data/backup/188 host hygiene green may be declared when live summary says so, while `DR_COMPLETE``WAZUH_REGISTRY_RECOVERED` and `RUNTIME_ACTION_AUTHORIZED` remain forbidden until evidence gates close. |
2026-06-26 12:13 machine-readable summary baseline supersedes the 07:47 / 08:59 gate set: `scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` stores delegated logs under `/tmp/awoooi-post-reboot-readiness-20260626-121303` and returns `SERVICE_GREEN=1`, `PRODUCT_DATA_GREEN=1`, `BACKUP_CORE_GREEN=1`, `DR_ESCROW_BLOCKED=1`, `ESCROW_MISSING_COUNT=5`, `HOST_188_SERVICE_GREEN=1`, `HOST_188_HYGIENE_BLOCKED=0`, `HOST_188_CHECK_RC=0`, `HOST_188_RESULT=HOST_188_HYGIENE_GREEN.`, `WAZUH_ROUTE_CODE=200`, `WAZUH_TRANSPORT_COUNT=6`, `WAZUH_COVERAGE_SCOPE=6`, `WAZUH_DIRECT_ACTIVE=2`, `WAZUH_NO_TRANSPORT=1`, `WAZUH_SSH_BLOCKED=3`, `WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning`, `WAZUH_DASHBOARD_INDEX_OK=3`, `WAZUH_MANAGER_REGISTRY_ACCEPTED=0`, `WAZUH_RUNTIME_GATE=0`, `RUNTIME_ACTION_AUTHORIZED=0`, `OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, and `NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export`. This is now the preferred first operator/AI-agent entrypoint after reboot because it separates service health from DR and security registry evidence; 188 host hygiene is no longer a next gate unless the live checklist regresses. 2026-06-26 12:13 machine-readable summary baseline supersedes the 07:47 / 08:59 gate set: `scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` stores delegated logs under `/tmp/awoooi-post-reboot-readiness-20260626-121303` and returns `SERVICE_GREEN=1`, `PRODUCT_DATA_GREEN=1`, `BACKUP_CORE_GREEN=1`, `DR_ESCROW_BLOCKED=1`, `ESCROW_MISSING_COUNT=5`, `HOST_188_SERVICE_GREEN=1`, `HOST_188_HYGIENE_BLOCKED=0`, `HOST_188_CHECK_RC=0`, `HOST_188_RESULT=HOST_188_HYGIENE_GREEN.`, `WAZUH_ROUTE_CODE=200`, `WAZUH_TRANSPORT_COUNT=6`, `WAZUH_COVERAGE_SCOPE=6`, `WAZUH_DIRECT_ACTIVE=2`, `WAZUH_NO_TRANSPORT=1`, `WAZUH_SSH_BLOCKED=3`, `WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning`, `WAZUH_DASHBOARD_INDEX_OK=3`, `WAZUH_MANAGER_REGISTRY_ACCEPTED=0`, `WAZUH_RUNTIME_GATE=0`, `RUNTIME_ACTION_AUTHORIZED=0`, `OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, and `NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export`. This is now the preferred first operator/AI-agent entrypoint after reboot because it separates service health from DR and security registry evidence; 188 host hygiene is no longer a next gate unless the live checklist regresses.

View File

@@ -11,14 +11,14 @@
vars: vars:
momo_backup_script_source: "{{ playbook_dir }}/../../../scripts/backup/backup-momo-188-pg.sh" momo_backup_script_source: "{{ playbook_dir }}/../../../scripts/backup/backup-momo-188-pg.sh"
momo_notify_helper_source: "{{ playbook_dir }}/../../../scripts/ops/notify-awoooi-ops.sh" momo_notify_helper_source: "{{ playbook_dir }}/../../../scripts/ops/notify-awoooi-ops.sh"
momo_scripts_dir: /home/ollama/momo-pro/scripts momo_scripts_dir: /home/ollama/bin
momo_backup_script_path: /home/ollama/momo-pro/scripts/pg_backup.sh momo_backup_script_path: /home/ollama/bin/momo-pg-backup.sh
momo_notify_helper_path: /home/ollama/momo-pro/scripts/notify-awoooi-ops.sh momo_notify_helper_path: /home/ollama/bin/notify-awoooi-ops.sh
momo_backup_dir: /home/ollama/momo_backups momo_backup_dir: /home/ollama/momo_backups
momo_backup_cron_name: AWOOOI momo PostgreSQL daily backup momo_backup_cron_name: AWOOOI momo PostgreSQL daily backup
momo_backup_cron_job: >- momo_backup_cron_job: >-
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
/home/ollama/momo-pro/scripts/pg_backup.sh /home/ollama/bin/momo-pg-backup.sh
>> /home/ollama/momo_backups/backup.log 2>&1 >> /home/ollama/momo_backups/backup.log 2>&1
momo_legacy_bin_cron_line: "0 2 * * * /home/ollama/bin/momo-pg-backup.sh >> /home/ollama/momo_backups/backup.log 2>&1" momo_legacy_bin_cron_line: "0 2 * * * /home/ollama/bin/momo-pg-backup.sh >> /home/ollama/momo_backups/backup.log 2>&1"
momo_legacy_direct_cron_line: "0 2 * * * /home/ollama/momo-pro/scripts/pg_backup.sh >> /home/ollama/momo_backups/backup.log 2>&1" momo_legacy_direct_cron_line: "0 2 * * * /home/ollama/momo-pro/scripts/pg_backup.sh >> /home/ollama/momo_backups/backup.log 2>&1"

View File

@@ -5,13 +5,20 @@ from __future__ import annotations
import argparse import argparse
import json import json
import re
import sys import sys
import urllib.parse import urllib.parse
import urllib.request import urllib.request
from pathlib import Path from pathlib import Path
from typing import Any from typing import Any
import yaml try:
import yaml
except ModuleNotFoundError: # pragma: no cover - exercised on lean operator hosts
yaml = None
YAML_ERROR_TYPES: tuple[type[BaseException], ...] = ()
else:
YAML_ERROR_TYPES = (yaml.YAMLError,)
DEFAULT_RULES = Path("ops/monitoring/alerts-unified.yml") DEFAULT_RULES = Path("ops/monitoring/alerts-unified.yml")
@@ -24,7 +31,99 @@ class ContractError(RuntimeError):
pass pass
RECOVERABLE_ERRORS = (ContractError, OSError, json.JSONDecodeError) + YAML_ERROR_TYPES
_RECORD_RE = re.compile(r"^(?P<indent>\s*)-\s+record:\s*(?P<record>.+?)\s*$")
_RULE_START_RE = re.compile(r"^(?P<indent>\s*)-\s+(?:record|alert):\s*.+$")
_EXPR_RE = re.compile(r"^(?P<indent>\s*)expr:\s*(?P<tail>.*)$")
_PROM_RULES_RE = re.compile(r"^(?P<indent>\s*)prometheus_recording_rules:\s*$")
_LIST_ITEM_RE = re.compile(r"^(?P<indent>\s*)-\s+(?P<value>.+?)\s*$")
def _strip_yaml_scalar(value: str) -> str:
return value.strip().strip('"').strip("'")
def _indent_width(line: str) -> int:
return len(line) - len(line.lstrip(" "))
def _fallback_rules(path: Path) -> list[dict[str, Any]]:
lines = path.read_text(encoding="utf-8").splitlines()
rules: list[dict[str, Any]] = []
index = 0
while index < len(lines):
record_match = _RECORD_RE.match(lines[index])
if not record_match:
index += 1
continue
record_indent = len(record_match.group("indent"))
rule: dict[str, Any] = {"record": _strip_yaml_scalar(record_match.group("record"))}
index += 1
while index < len(lines):
next_rule = _RULE_START_RE.match(lines[index])
if next_rule and len(next_rule.group("indent")) <= record_indent:
break
expr_match = _EXPR_RE.match(lines[index])
if not expr_match:
index += 1
continue
expr_indent = len(expr_match.group("indent"))
tail = expr_match.group("tail").strip()
if tail not in {"|", "|-", "|+"}:
rule["expr"] = _strip_yaml_scalar(tail)
index += 1
continue
block: list[str] = []
index += 1
while index < len(lines):
block_next_rule = _RULE_START_RE.match(lines[index])
if block_next_rule and len(block_next_rule.group("indent")) <= record_indent:
break
if lines[index].strip() and _indent_width(lines[index]) <= expr_indent:
break
block.append(lines[index])
index += 1
rule["expr"] = "\n".join(block)
rules.append(rule)
if not rules:
raise ContractError(f"missing recording rules in {path}")
return rules
def _fallback_expected_recording_rules(path: Path) -> list[str]:
lines = path.read_text(encoding="utf-8").splitlines()
for index, line in enumerate(lines):
key_match = _PROM_RULES_RE.match(line)
if not key_match:
continue
key_indent = len(key_match.group("indent"))
rules: list[str] = []
for child in lines[index + 1 :]:
if not child.strip():
continue
child_indent = _indent_width(child)
if child_indent <= key_indent:
break
item_match = _LIST_ITEM_RE.match(child)
if item_match and len(item_match.group("indent")) > key_indent:
rules.append(_strip_yaml_scalar(item_match.group("value")))
if rules:
return rules
raise ContractError(f"missing monitoring_contract.prometheus_recording_rules in {path}")
def _rules(path: Path) -> list[dict[str, Any]]: def _rules(path: Path) -> list[dict[str, Any]]:
if yaml is None:
return _fallback_rules(path)
data = yaml.safe_load(path.read_text(encoding="utf-8")) or {} data = yaml.safe_load(path.read_text(encoding="utf-8")) or {}
rules: list[dict[str, Any]] = [] rules: list[dict[str, Any]] = []
for group in data.get("groups") or []: for group in data.get("groups") or []:
@@ -33,6 +132,8 @@ def _rules(path: Path) -> list[dict[str, Any]]:
def _expected_recording_rules(path: Path) -> list[str]: def _expected_recording_rules(path: Path) -> list[str]:
if yaml is None:
return _fallback_expected_recording_rules(path)
data = yaml.safe_load(path.read_text(encoding="utf-8")) or {} data = yaml.safe_load(path.read_text(encoding="utf-8")) or {}
rules = data.get("monitoring_contract", {}).get("prometheus_recording_rules") or [] rules = data.get("monitoring_contract", {}).get("prometheus_recording_rules") or []
if not rules: if not rules:
@@ -136,7 +237,7 @@ def main() -> int:
args.expect_dr_ready, args.expect_dr_ready,
): ):
print(line) print(line)
except (ContractError, OSError, yaml.YAMLError, json.JSONDecodeError) as exc: except RECOVERABLE_ERRORS as exc:
print(f"RECOVERY_SCORECARD_CONTRACT_FAILED {exc}", file=sys.stderr) print(f"RECOVERY_SCORECARD_CONTRACT_FAILED {exc}", file=sys.stderr)
return 1 return 1