ops(reboot): close 188 hygiene and dynamic post-reboot gates
Some checks failed
Code Review / ai-code-review (push) Successful in 15s
Ansible / Reboot Recovery Contract / validate (push) Has been cancelled

This commit is contained in:
ogt
2026-06-26 12:39:55 +08:00
parent d8a68c742c
commit 71261c122e
12 changed files with 227 additions and 67 deletions

View File

@@ -45489,3 +45489,40 @@ production browser smoke:
- 188 host hygiene 維護窗口仍未執行。
- Wazuh manager registry accepted remains `0`
- 不得宣稱 owner request 已送出、owner response 已收到 / 接受、runtime 寫入已批准、`DR_COMPLETE`、188 host fully green、或 Wazuh registry recovered。
## 2026-06-26 — 12:13 188 host hygiene live repair closeout / SOP v1.73
**時間與來源**
- 2026-06-26 11:40-12:13 Asia/Taipei。
- 來源188 live SSH maintenance session、`scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh --no-color``scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color`、public ACME HTTP-01 self-test、systemd / Docker / PostgreSQL runtime readback。
**完成內容**
- 188 startup source-of-truth 已修正:`/usr/local/bin/awoooi-startup.sh` 改為 `postgres_runtime_ready()`,接受 active `postgresql@14-main``k3s-postgres-recovery` host-network PostgreSQL runtime + `pg_isready`;不再自動執行 `pg_resetwal`
- 188 `awoooi-startup.service` 移除 hard `postgresql@14-main.service` dependency避免把已由 recovery container 提供的 production DB runtime 誤判成 host cluster failed。
- 188 Nginx ACME HTTP-01 route 已修:`sentry.wooo.work``gitea.wooo.work``langfuse.wooo.work``signoz.wooo.work` 均可從 public HTTP 讀回同一個 non-secret self-test token。
- 188 duplicate certbot runner 已收斂apt `certbot.timer` disabled / inactivesnap `snap.certbot.renew.timer` enabled / active / waiting`certbot.service``snap.certbot.renew.service``Result=success / inactive`
- 188 failed units 已清零,`systemctl is-system-running=running`
- Repo baseline 同步更新:`scripts/reboot-recovery/awoooi-startup.sh``scripts/reboot-recovery/awoooi-startup.service``scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh`、188 Nginx Ansible templates、SOP / quick-check / workplan / runbook。
- `post-reboot-owner-packet-contract-guard.py` 改為依 live `source.next_required_gates` 動態驗收,不再把已修復的 188 gate 永久鎖成必備三 gate。
- `post-start-quick-check.sh` 的 StockPlatform dedicated freshness gate 改用 `STOCK_FRESHNESS_RETRY_ATTEMPTS=6` / `STOCK_FRESHNESS_RETRY_DELAY_SECONDS=5` 重試,避免 110 Docker / CI rollout 後的短暫 upstream `502` 把已恢復的服務 / 資料 freshness 誤判成 blocked連續失敗仍維持 hard blocker。
**live 驗證結果**
- `scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh --no-color``PASS=16 WARN=3 BLOCKED=0``SERVICE_GREEN=1``HOST_HYGIENE_BLOCKED=0``Result: HOST_188_HYGIENE_GREEN.`
- `scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color``POST_START_RC=0``POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``POST_START_PASS=38``POST_START_WARN=4``POST_START_BLOCKED=0``SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``BACKUP_CORE_GREEN=1``DR_ESCROW_BLOCKED=1``ESCROW_MISSING_COUNT=5``HOST_188_HYGIENE_BLOCKED=0``WAZUH_MANAGER_REGISTRY_ACCEPTED=0``RUNTIME_ACTION_AUTHORIZED=0`
- `NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export`188 host hygiene 已從 next gates 移除。
**做過的命令類型**
- 寫入188 startup script/service 安裝、188 Nginx ACME route config 安裝、`systemctl daemon-reload``systemctl reset-failed``nginx -t``systemctl reload nginx`、停用 duplicate apt `certbot.timer`、建立 / 刪除 non-secret ACME self-test token file。
- 只讀systemd / Docker / PostgreSQL runtime / certbot timer / public route / backup / post-reboot summary / Wazuh repo gates。
- 未做:沒有 `pg_resetwal`、沒有 DB restore、沒有 Docker restart、沒有 K3s / ArgoCD / firewall / Wazuh runtime action、沒有 secret value / token / password 讀取或保存、沒有強制 certbot renew。
**目前判定**
- 188 host hygiene repair`0% -> 100%`
- Reboot service / product data / backup / 188 host hygiene`GREEN`
- Overall recovery declaration`FULL_STACK_GREEN_DR_ESCROW_BLOCKED`
- SOP / quick-check / owner packet guardv1.73 動態 gate baseline。
**仍 blocked / 不得宣稱**
- DR credential escrow evidence 仍缺 `5`:不得宣稱 `DR_COMPLETE`
- Wazuh manager registry accepted 仍為 `0`:不得宣稱 Wazuh 全主機納管恢復。
- certbot formal renewal 尚未完成 readback本輪完成的是 HTTP-01 route / timer hygiene / failed-unit 清除,正式 renew 成功需等 snap certbot timer 或獨立 ACME window。

View File

@@ -1,6 +1,6 @@
# AWOOOI 全棧冷啟動與主機重啟 SOP
> Version: v1.72
> Version: v1.73
> Last updated: 2026-06-26 Asia/Taipei
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
@@ -10,19 +10,21 @@
本節是每次接手、開機、關機、重啟後的第一個判定錨點。若日期不是今天,必須先重跑 live check再更新本節與 `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md`
若只是重啟後要快速判斷能不能宣稱恢復,先跑機器可讀摘要:`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color`。此腳本會呼叫一頁式總檢查、188 host hygiene checklist 與 Wazuh no-false-green repo gates並把 delegated logs 留在 `/tmp/awoooi-post-reboot-readiness-*`。接著跑 `scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color`,把 summary 轉成 allowed / forbidden declaration避免把服務綠誤報成 DR complete、188 host fully green、Wazuh registry recovered 或 runtime authorized。若 summary 顯示 `SERVICE_GREEN=1``NEXT_REQUIRED_GATES` 仍非空,再跑 `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color`,把 DR escrow、188 hygiene、Wazuh registry 三條 blocker 轉成 owner / evidence / forbidden-action dispatch checklist需要機器可讀 intake 時,再跑 `scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color --output /tmp/awoooi-post-reboot-owner-packets.json` 產生 `awoooi_post_reboot_next_gate_owner_packets_v1` JSON並立刻跑 `scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json`。dispatch / packet / guard 均固定 `DISPATCH_AUTHORIZED=0``REQUEST_SENT_COUNT=0``OWNER_RESPONSE_ACCEPTED=0``HOST_WRITE_AUTHORIZED=0``SECRET_VALUE_COLLECTION_ALLOWED=0``RUNTIME_GATE=0`guard 未通過時不得送 owner request、不得寫 escrow marker、不得進維護窗口、不得宣稱 DR / 188 host hygiene / Wazuh registry complete。需要人工展開時再跑 `scripts/reboot-recovery/post-start-quick-check.sh --no-color` 並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為 fallback。長 SOP 保留完整背景、例外處理與 Plan B短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。
若只是重啟後要快速判斷能不能宣稱恢復,先跑機器可讀摘要:`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color`。此腳本會呼叫一頁式總檢查、188 host hygiene checklist 與 Wazuh no-false-green repo gates並把 delegated logs 留在 `/tmp/awoooi-post-reboot-readiness-*`。接著跑 `scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color`,把 summary 轉成 allowed / forbidden declaration避免把服務綠誤報成 DR complete、188 host hygiene、Wazuh registry recovered 或 runtime authorized。若 summary 顯示 `SERVICE_GREEN=1``NEXT_REQUIRED_GATES` 仍非空,再跑 `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color`,把 live summary 內尚未完成的 blocker 轉成 owner / evidence / forbidden-action dispatch checklist需要機器可讀 intake 時,再跑 `scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color --output /tmp/awoooi-post-reboot-owner-packets.json` 產生 `awoooi_post_reboot_next_gate_owner_packets_v1` JSON並立刻跑 `scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json`。dispatch / packet / guard 均固定 `DISPATCH_AUTHORIZED=0``REQUEST_SENT_COUNT=0``OWNER_RESPONSE_ACCEPTED=0``HOST_WRITE_AUTHORIZED=0``SECRET_VALUE_COLLECTION_ALLOWED=0``RUNTIME_GATE=0`guard 未通過時不得送 owner request、不得寫 escrow marker、不得進維護窗口、不得宣稱 DR / Wazuh registry complete。需要人工展開時再跑 `scripts/reboot-recovery/post-start-quick-check.sh --no-color` 並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為 fallback。長 SOP 保留完整背景、例外處理與 Plan B短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。
2026-06-26 07:47 machine-readable readiness summary`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 已驗證可用artifact dir `/tmp/awoooi-post-reboot-readiness-20260626-074702`。摘要輸出 `POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``POST_START_PASS=38``POST_START_WARN=3``POST_START_BLOCKED=0``SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``BACKUP_CORE_GREEN=1``DR_ESCROW_BLOCKED=1``ESCROW_MISSING_COUNT=5``HOST_188_SERVICE_GREEN=1``HOST_188_HYGIENE_BLOCKED=1``WAZUH_ROUTE_CODE=200``WAZUH_TRANSPORT_COUNT=6``WAZUH_MANAGER_REGISTRY_ACCEPTED=0``WAZUH_RUNTIME_GATE=0``RUNTIME_ACTION_AUTHORIZED=0`。目前 `OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export`。這是每次重啟後的第一層 operator / AI agent 判定格式
2026-06-26 12:13 latest live summary supersedes the 08:59 gate set`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``POST_START_PASS=38``POST_START_WARN=4``POST_START_BLOCKED=0``SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``BACKUP_CORE_GREEN=1``DR_ESCROW_BLOCKED=1``ESCROW_MISSING_COUNT=5``HOST_188_SERVICE_GREEN=1``HOST_188_HYGIENE_BLOCKED=0``HOST_188_RESULT=HOST_188_HYGIENE_GREEN.``WAZUH_ROUTE_CODE=200``WAZUH_TRANSPORT_COUNT=6``WAZUH_MANAGER_REGISTRY_ACCEPTED=0``WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning``WAZUH_DASHBOARD_INDEX_OK=3``RUNTIME_ACTION_AUTHORIZED=0``OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export`。188 host hygiene 已從 blocker 移除;目前不可宣稱完成的只剩 DR credential escrow 與 Wazuh manager registry。ACME HTTP-01 route 與 certbot timer hygiene 已修復,但不得宣稱憑證已正式 renew需等 snap certbot timer / ACME window readback
2026-06-26 08:12 next-gate dispatch baseline`scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color` 已以最新 summary live output 驗證。腳本讀回 `SERVICE_GREEN=1``OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export`,並輸出三個 P0 checklist一是 credential escrow non-secret evidence要求五個 escrow item 的 evidence id / owner / reviewer 且禁止密碼、token、hash、prefix/suffix二是 188 host PostgreSQL / certbot hygiene maintenance window要求 DB / DNS-TLS / rollback / postcheck owner 決策且禁止 `pg_resetwal`、certbot renew、Nginx reload、DB restore、Docker restart 等未批准動作;三是 Wazuh manager registry redacted export要求脫敏 registry count、host alias status、dashboard API/version status、time window 與 reviewer且禁止 agent real name、internal IP、client.keys、raw payload、active response、agent re-enroll、Wazuh restart、secret patch、host write、Kali active scan。輸出固定 `NEXT_GATE_COUNT=3``NEXT_STEP=dispatch_owner_packets_manually_after_review``RUNTIME_ACTION_AUTHORIZED=0`,這是 dispatch checklist不是 request sent 或 runtime approval
2026-06-26 07:47 machine-readable readiness summary retained as historical pre-repair evidence當時 `HOST_188_HYGIENE_BLOCKED=1``NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export`。此段只用來比對 188 修復前後差異;現行 gate set 必須使用 12:13 baseline
2026-06-26 08:12 next-gate dispatch baseline retained as historical pre-repair evidence當時 output 固定三個 P0 checklist。12:13 起 dispatch 依 live summary 動態輸出,目前 expected `NEXT_GATE_COUNT=2`,只剩 credential escrow 與 Wazuh registry。
2026-06-26 08:29 owner-packet JSON baseline`scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color` 將 dispatch output 轉成 `schema_version=awoooi_post_reboot_next_gate_owner_packets_v1`,包含三個 `owner_packets``next_gate_count=3``p0_gate_count=3``request_sent_count=0``owner_response_received_count=0``owner_response_accepted_count=0``runtime_action_authorized_count=0`。此 JSON 是 AI / operator / owner review intake不是外部 request也不是維護窗口批准。
2026-06-26 08:40 owner-packet contract guard baseline`scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json` 鎖定 `schema_version=awoooi_post_reboot_next_gate_owner_packets_v1`、三個 P0 gate id、`next_gate_count=3``p0_gate_count=3``request_sent_count=0``owner_response_received_count=0``owner_response_accepted_count=0``runtime_action_authorized_count=0``dispatch_authorized=0``host_write_authorized=0``secret_value_collection_allowed=0``runtime_gate_count=0`。此 guard 也驗證 escrow 禁止 password / token / secret value / hash / prefix / suffix / raw credential188 禁止 `pg_resetwal` / certbot renew / Nginx reload / DB restore / Docker restart / host file writeWazuh 禁止 raw payload / internal IP / active response / re-enroll / restart / secret patch / host write / Kali active scan並要求四條 no-false-green 規則存在。輸出必須`POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=3 request_sent=0 accepted=0 runtime_gate=0`
2026-06-26 08:40 owner-packet contract guard baseline retained as historical pre-repair evidence舊版鎖定三個 P0 gate。12:13 起 contract guard 依 `source.next_required_gates` 動態驗收,現行 expected success line `POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=2 request_sent=0 accepted=0 runtime_gate=0`;若 188 hygiene future regression才會回到 `gates=3`
2026-06-26 08:47 Wazuh registry detail baseline`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 已把 Wazuh repo-side coverage / runtime gate 的細節納入固定 key/value`WAZUH_COVERAGE_SCOPE=6``WAZUH_DIRECT_ACTIVE=2``WAZUH_NO_TRANSPORT=1``WAZUH_SSH_BLOCKED=3``WAZUH_ROUTE_CODE=200``WAZUH_TRANSPORT_COUNT=6``WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning``WAZUH_DASHBOARD_INDEX_OK=3``WAZUH_MANAGER_REGISTRY_ACCEPTED=0``WAZUH_RUNTIME_GATE=0``scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color``wazuh_manager_registry_export` gate 會把這些狀態放入 `CURRENT_EVIDENCE`。判讀鐵律route `200`、transport `6`、Dashboard index pattern `3` 都不是 manager registry accepted全主機納管與 Dashboard API 修復仍需 owner evidence / registry export / acceptance record。
2026-06-26 08:59 declaration guard baseline`scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color` 將 summary 轉成 `awoooi_post_reboot_declaration_guard_v1`,目前 status 為 `allowed_with_boundary_blockers`。允許宣稱 `SERVICE_RECOVERY_GREEN``PRODUCT_DATA_GREEN``BACKUP_CORE_GREEN``FULL_STACK_GREEN_DR_ESCROW_BLOCKED`;禁止宣稱 `DR_COMPLETE``HOST_188_FULLY_GREEN``WAZUH_REGISTRY_RECOVERED``RUNTIME_ACTION_AUTHORIZED`。若用 `--proposed DR_COMPLETE``--proposed WAZUH_REGISTRY_RECOVERED` 測試guard 必須以 `blocked_false_green_proposal` 拒絕
2026-06-26 08:59 declaration guard baseline retained as historical pre-repair evidence當時 `HOST_188_FULLY_GREEN` 仍 forbidden。12:13 起 guard 依 `HOST_188_HYGIENE_BLOCKED=0` 動態允許 188 host hygiene green但仍拒絕 `DR_COMPLETE``WAZUH_REGISTRY_RECOVERED``RUNTIME_ACTION_AUTHORIZED`
2026-06-26 07:39 live quick-check refresh`scripts/reboot-recovery/post-start-quick-check.sh --no-color` 完整跑完,四主機 ping / SSH 全部 OKdelegated cold-start 為 `PASS=89 WARN=0 BLOCKED=0`wrapper 總結為 `POST_START_QUICK_CHECK PASS=38 WARN=3 BLOCKED=0`、warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=2``RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。MOMO health `V10.701`daily snapshot `109061` rows / `2025-07-01..2026-06-24`current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`latest import job `57 completed`。StockPlatform freshness `status=ok`、latest trading date `2026-06-25`price / chips / margin / AI recommendations 均為 `2026-06-25`。Backup-status 07:39 顯示 110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0``core_blockers=0`、offsite/rclone fresh、`last_backup_all=2026-06-26 02:31:02``escrow_missing=5`。Public routes extended list 全部回 expected 2xx/3xx。110 CPU attribution 顯示 load 約 `5.19 / 4.66 / 4.91`CPU idle 多數樣本 `80%+`,目前負載來自 Gitea / ClickHouse / Docker / Kafka / StockPlatform / AWOOOI API / Sentry 等正常平台工作,不是 orphan Chrome。這一輪 allowed declaration主機、K3s、服務、網站、產品資料 freshness、備份核心與 offsite freshness 綠forbidden declarationDR complete、credential escrow complete、188 host fully green、Wazuh registry recovered。

View File

@@ -1,34 +1,33 @@
# 188 Host Hygiene 維護窗口 Runbook
> 版本2026-06-26.v1
> 版本2026-06-26.v2
> 適用範圍:`192.168.0.188` host PostgreSQL `14/main`、`awoooi-startup.service`、certbot / ACME renewal hygiene。
> 狀態:`SERVICE_GREEN_HOST_HYGIENE_BLOCKED`
> 狀態:`HOST_188_HYGIENE_GREEN`
---
## 1. 目的
188 目前的產品服務public routes 可用,但 host systemd 仍是 `degraded`。這份 runbook 將「服務恢復」「host hygiene 未收斂」分開,避免把 route `200`、container healthy、`pg_isready`exporter green 誤判成 host 已完全健康
188 目前的產品服務public routes host hygiene 均已收斂為 green。這份 runbook 將「服務恢復」「host hygiene」、「DR escrow」與「Wazuh registry」分開,避免把 route `200`、container healthy、`pg_isready`exporter green `reset-failed` 誤判成其他層級也完成
本文件不是維護窗口批准,也不是 runtime action 授權。未取得 owner、rollback owner、maintenance window、backup / restore evidence 與 post-check plan 前,不得執行 restart、reset-failed、renew、reload、restore 或 `pg_resetwal`
本文件記錄 2026-06-26 已完成的 188 host hygiene 修復與後續操作邊界。它不是 DR escrow 完成證明,也不是 Wazuh / SOC runtime action 授權。未取得 owner、rollback owner、maintenance window、backup / restore evidence 與 post-check plan 前,不得執行資料層 break-glass、restore、未批准 renew、或其他主機寫入
---
## 2. 最新只讀證據
2026-06-26 07:18-07:19 Asia/Taipei read-only evidence
2026-06-26 12:02-12:13 Asia/Taipei live evidence
| 層級 | 證據 | 判定 |
|------|------|------|
| 188 disk | `/` `982G`,使用 `712G`,可用 `230G``76%` | 不是容量耗盡 |
| 產品容器 | MOMO、VibeWork、2026FIFA、ClawBot、MinIO、exporters、MOMO DB 等容器可見 healthy / up | 產品服務 green |
| Host systemd | failed units`awoooi-startup.service``postgresql@14-main.service``certbot.service``snap.certbot.renew.service` | host hygiene blocked |
| Host PostgreSQL | `pg_lsclusters` 顯示 `14/main down` | host cluster 不健康 |
| PostgreSQL status | `invalid primary checkpoint record``PANIC: could not locate a valid checkpoint record` | checkpoint / WAL 類資料層錯誤 |
| Startup unit | 曾嘗試以 root 執行 `pg_resetwal`,並失敗 | 自動修復邏輯必須 fail-closed |
| certbot apt unit | `sentry.wooo.work` renewal rate-limited | 不可重複猛打 ACME |
| certbot snap unit | `sentry.wooo.work` challenge failed | 需先確認 ACME route / DNS / gateway owner |
| public cert | shared `sentry.wooo.work` certificate valid until `2026-07-09 16:03:40 UTC` | 有維護窗口,不是立即過期 |
| Host systemd | `systemctl is-system-running` `running`failed units `0 loaded units listed` | host hygiene green |
| 產品容器 | MOMO、VibeWork、2026FIFA、ClawBot、MinIO、exporters、MOMO DB 等容器 healthy / up | 產品服務 green |
| PostgreSQL runtime | `k3s-postgres-recovery` container 為 `Running=true``NetworkMode=host`、restart policy `unless-stopped``pg_isready -h localhost -p 5432` accepting connections | production DB runtime green |
| Host PostgreSQL unit | `postgresql@14-main.service` reset 後 `Result=success / inactive``pg_lsclusters` 仍可能因 recovery container PID model 顯示 host cluster down | 不再作為 service blocker由 runtime-ready 判斷 |
| Startup unit | `/usr/local/bin/awoooi-startup.sh` 已改為 `postgres_runtime_ready()`,接受 active host unit 或 `k3s-postgres-recovery` host-network runtime不再自動執行 `pg_resetwal` | fail-closed 修復完成 |
| Nginx / ACME route | `sentry.wooo.work``gitea.wooo.work``langfuse.wooo.work``signoz.wooo.work` HTTP-01 self-test token 均由 `/var/www/html` 正確回應 | challenge route green |
| certbot timers | duplicate apt `certbot.timer` 已停用snap `snap.certbot.renew.timer` enabled / active / waiting`certbot.service``snap.certbot.renew.service` `Result=success / inactive` | renewal runner hygiene green |
| public cert | shared `sentry.wooo.work` certificate valid until `2026-07-09 16:03:40 UTC` | 尚未宣稱已 renew等待 snap timer / ACME window |
---
@@ -112,9 +111,9 @@ SSH_BATCH_MODE=yes bash scripts/reboot-recovery/188-host-hygiene-maintenance-che
```text
SERVICE_GREEN=1
HOST_HYGIENE_BLOCKED=1
HOST_HYGIENE_BLOCKED=0
RUNTIME_ACTION_AUTHORIZED=0
Result: SERVICE_GREEN_HOST_HYGIENE_BLOCKED. Open maintenance window; do not auto-repair.
Result: HOST_188_HYGIENE_GREEN.
```
### Phase A維護前只讀快照
@@ -169,17 +168,18 @@ curl -fsS https://awoooi.wooo.work/api/v1/health
| Lane | 完成度 | 說明 |
|------|-------:|------|
| 服務恢復 | `100%` | routes、containers、cold-start、backup core 已 green |
| 188 host hygiene evidence | `80%` | 只讀 root cause 已明確;缺 owner disposition |
| PostgreSQL owner decision | `0%` | 尚未確認廢棄、restore 或 break-glass |
| certbot owner decision | `0%` | 尚未接受 DNS / ACME / coverage evidence |
| runtime repair | `0%` | 未批准、未執行 |
| 188 host hygiene repair | `100%` | startup fail-closed、PostgreSQL runtime-ready 判斷、systemd failed units、ACME route 與 certbot timer hygiene 已收斂 |
| PostgreSQL runtime source-of-truth | `100%` | production DB runtime 以 `k3s-postgres-recovery` host-network container + `pg_isready` 判斷,不用 host `pg_lsclusters` 假紅 |
| certbot / ACME route hygiene | `95%` | HTTP-01 route 與 timer split 已修;正式 renew 成功仍等待 snap timer / ACME window |
| DR escrow | `BLOCKED` | `escrow_missing=5`,不可用 188 host green 替代 |
| Wazuh registry | `BLOCKED` | manager registry accepted `0`,不可用 route `200` 或 transport count 替代 |
---
## 8. 不得宣稱
- 不得宣稱 188 host fully green
- 不得宣稱 host PostgreSQL 已恢復
- 不得宣稱 certbot renewal 已恢復
- 不得宣稱 `reset-failed` 後就是修復完成
- 不得宣稱這份文件批准任何 runtime action。
- 不得宣稱 DR completecredential escrow evidence 仍缺 `5`
- 不得宣稱 Wazuh registry recoveredmanager registry accepted 仍為 `0`
- 不得宣稱 certbot certificate 已完成正式 renew目前完成的是 HTTP-01 route / timer hygiene / failed units 清除
- 不得 `reset-failed` 單獨當成修復完成;本次完成是因 startup source、Nginx ACME route、timer split 與 postcheck 同時收斂
- 不得宣稱這份文件批准任何新的 runtime action。

View File

@@ -1,6 +1,6 @@
# 主機重啟後一頁式總檢查
> Version: v1.12
> Version: v1.13
> Last updated: 2026-06-26 Asia/Taipei
> Scope: 110 / 120 / 121 / 188 post-reboot service recovery. 112 Kali / Wazuh / active scan 不屬於本流程。
@@ -10,7 +10,7 @@
每次 110 / 120 / 121 / 188 任一台主機開機、關機、重啟、斷電恢復、VMware console fsck、Docker / K3s 大量重排後,都先跑本頁,再決定是否宣稱恢復。
最新基準2026-06-26 08:59 post-reboot declaration guard。`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``BACKUP_CORE_GREEN=1``DR_ESCROW_BLOCKED=1``ESCROW_MISSING_COUNT=5``HOST_188_HYGIENE_BLOCKED=1``WAZUH_MANAGER_REGISTRY_ACCEPTED=0``WAZUH_COVERAGE_SCOPE=6``WAZUH_DIRECT_ACTIVE=2``WAZUH_NO_TRANSPORT=1``WAZUH_SSH_BLOCKED=3``WAZUH_ROUTE_CODE=200``WAZUH_TRANSPORT_COUNT=6``WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning``WAZUH_DASHBOARD_INDEX_OK=3``RUNTIME_ACTION_AUTHORIZED=0``OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color` 會把 summary 轉成 allowed / forbidden declaration目前允許宣稱服務、產品資料、備份核心與 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`;禁止宣稱 `DR_COMPLETE``HOST_188_FULLY_GREEN``WAZUH_REGISTRY_RECOVERED``RUNTIME_ACTION_AUTHORIZED`。接著 `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color``NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export` 展成三個 owner / evidence / forbidden-action checklistWazuh checklist 的 `CURRENT_EVIDENCE` 會保留 registry accepted、coverage scope、direct active、no transport、SSH blocked、route、transport、Dashboard API 與 index pattern 狀態,避免把 route `200` 或 transport `6` 誤報成 registry recovered。`scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color` 進一步轉成 `awoooi_post_reboot_next_gate_owner_packets_v1` JSON固定 `dispatch_authorized=0``request_sent_count=0``owner_response_accepted_count=0``host_write_authorized=0``secret_value_collection_allowed=0``runtime_gate_count=0``scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json` 鎖定三個 P0 gate、所有 `0 / false` 邊界、禁用 secret payload / runtime action 與 no-false-green 規則。Cold-start `PASS=89 WARN=0 BLOCKED=0`MOMO `V10.701`、latest import job `57 completed``DB_DAILY_FRESHNESS 1|2026-06-24`StockPlatform `/api/v1/system/freshness``status=ok``latest_trading_date=2026-06-25`、blockers `[]`backup-status 110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0``core_blockers=0``offsite_fresh=1``rclone_gdrive_fresh=1``last_backup_all=2026-06-26 02:31:02`DR 仍因 `escrow_missing=5` 不可宣稱 complete。188 host hygiene 與 Wazuh manager registry 仍是 service green 之外的獨立 blocker。
最新基準2026-06-26 12:13 post-reboot summary / declaration guard。`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``BACKUP_CORE_GREEN=1``DR_ESCROW_BLOCKED=1``ESCROW_MISSING_COUNT=5``HOST_188_HYGIENE_BLOCKED=0``HOST_188_RESULT=HOST_188_HYGIENE_GREEN.``WAZUH_MANAGER_REGISTRY_ACCEPTED=0``WAZUH_COVERAGE_SCOPE=6``WAZUH_DIRECT_ACTIVE=2``WAZUH_NO_TRANSPORT=1``WAZUH_SSH_BLOCKED=3``WAZUH_ROUTE_CODE=200``WAZUH_TRANSPORT_COUNT=6``WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning``WAZUH_DASHBOARD_INDEX_OK=3``RUNTIME_ACTION_AUTHORIZED=0``OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color` 會把 summary 轉成 allowed / forbidden declaration目前允許宣稱服務、產品資料、備份核心、188 host hygiene green `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`;禁止宣稱 `DR_COMPLETE``WAZUH_REGISTRY_RECOVERED``RUNTIME_ACTION_AUTHORIZED`。接著 `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color``NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export` 展成 owner / evidence / forbidden-action checklistWazuh checklist 的 `CURRENT_EVIDENCE` 會保留 registry accepted、coverage scope、direct active、no transport、SSH blocked、route、transport、Dashboard API 與 index pattern 狀態,避免把 route `200` 或 transport `6` 誤報成 registry recovered。`scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color` 進一步轉成 `awoooi_post_reboot_next_gate_owner_packets_v1` JSON固定 `dispatch_authorized=0``request_sent_count=0``owner_response_accepted_count=0``host_write_authorized=0``secret_value_collection_allowed=0``runtime_gate_count=0``scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json` 依 live `next_required_gates` 動態鎖定 P0 gate、所有 `0 / false` 邊界、禁用 secret payload / runtime action 與 no-false-green 規則。DR 仍因 `escrow_missing=5` 不可宣稱 completeWazuh manager registry 仍是 service green 之外的獨立 blocker。ACME HTTP-01 route / certbot timer hygiene 已修復,但憑證正式 renew 成功需等 snap certbot timer 或獨立 ACME window readback。
本頁只回答四件事:
@@ -55,7 +55,7 @@ scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color
- `PRODUCT_DATA_GREEN=1`MOMO / StockPlatform 主要資料 freshness 可宣稱恢復。
- `BACKUP_CORE_GREEN=1`:備份核心可宣稱恢復。
- `DR_ESCROW_BLOCKED=1` / `ESCROW_MISSING_COUNT>0`:不可宣稱 DR complete。
- `HOST_188_HYGIENE_BLOCKED=1`188 host hygiene 需維護窗口,不等於產品服務掛掉
- `HOST_188_HYGIENE_BLOCKED=0`188 host hygiene 已收斂;若未來回到 `1`,再走 188 專用 runbook
- `WAZUH_MANAGER_REGISTRY_ACCEPTED=0`:不可宣稱 Wazuh 全主機納管恢復。
- `WAZUH_ROUTE_CODE=200` / `WAZUH_TRANSPORT_COUNT>0` 只能代表 route / transport evidence仍必須搭配 `WAZUH_COVERAGE_SCOPE``WAZUH_DIRECT_ACTIVE``WAZUH_NO_TRANSPORT``WAZUH_SSH_BLOCKED``WAZUH_DASHBOARD_API_CONNECTION``WAZUH_MANAGER_REGISTRY_ACCEPTED` 判讀。
- `RUNTIME_ACTION_AUTHORIZED=0`:本流程沒有授權 runtime 寫操作。
@@ -73,7 +73,7 @@ scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color
scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color --proposed DR_COMPLETE --proposed WAZUH_REGISTRY_RECOVERED
```
在目前基準下,`DR_COMPLETE``HOST_188_FULLY_GREEN``WAZUH_REGISTRY_RECOVERED``RUNTIME_ACTION_AUTHORIZED` 都必須被拒絕;任何人、任何 AI agent 或任何文件若提出這些宣告,必須先補對應 evidence / owner acceptance。
在目前基準下,`DR_COMPLETE``WAZUH_REGISTRY_RECOVERED``RUNTIME_ACTION_AUTHORIZED` 都必須被拒絕;任何人、任何 AI agent 或任何文件若提出這些宣告,必須先補對應 evidence / owner acceptance。`HOST_188_FULLY_GREEN` 只有在 summary 維持 `HOST_188_HYGIENE_BLOCKED=0` 且 checklist `HOST_188_HYGIENE_GREEN` 時才可宣稱。
summary 顯示 `SERVICE_GREEN=1` 但仍有 `NEXT_REQUIRED_GATES` 時,再跑下一 Gate 派工 checklist
@@ -81,7 +81,7 @@ summary 顯示 `SERVICE_GREEN=1` 但仍有 `NEXT_REQUIRED_GATES` 時,再跑下
scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color
```
這支腳本只把 `credential_escrow_evidence``host_188_hygiene_maintenance_window``wazuh_manager_registry_export` 轉成 owner / required evidence / forbidden action / done criteria。它不送 request、不讀 secret、不寫 marker、不 restart / reload / repair / import / delete / patch也不授權 host / Wazuh / Nginx / K8s / DB runtime action。若它輸出 `SERVICE_GREEN=0`,先回到服務恢復,不進入 boundary dispatch。
這支腳本只把目前 live summary 內的 `NEXT_REQUIRED_GATES` 轉成 owner / required evidence / forbidden action / done criteria。2026-06-26 12:13 起通常只剩 `credential_escrow_evidence``wazuh_manager_registry_export`;若未來 188 又紅,才會重新出現 `host_188_hygiene_maintenance_window`。它不送 request、不讀 secret、不寫 marker、不 restart / reload / repair / import / delete / patch也不授權 host / Wazuh / Nginx / K8s / DB runtime action。若它輸出 `SERVICE_GREEN=0`,先回到服務恢復,不進入 boundary dispatch。
若要交給 AI / 工單 / owner review 使用,產生機器可讀 owner packet
@@ -98,7 +98,7 @@ scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color --outp
scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json
```
guard 必須輸出 `POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=3 request_sent=0 accepted=0 runtime_gate=0`。若 gate 數量、P0 gate id、`0 / false` 欄位、禁用 secret payload、188 禁用維修動作、Wazuh 禁用 active response / host write或 no-false-green 規則任何一項漂移,視為 `BLOCKED`,不得送 owner request、不得寫 escrow marker、不得進維護窗口、不得宣稱 DR / Wazuh / 188 host hygiene 完成。
guard 必須輸出 `POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=<live_next_gate_count> request_sent=0 accepted=0 runtime_gate=0`目前預期 `gates=2`;若 188 hygiene 回到 blocked才會是 `gates=3`若 gate 數量、P0 gate id、`0 / false` 欄位、禁用 secret payload、Wazuh 禁用 active response / host write或 no-false-green 規則任何一項漂移,視為 `BLOCKED`,不得送 owner request、不得寫 escrow marker、不得進維護窗口、不得宣稱 DR / Wazuh 完成。
需要展開細節時,再使用 repo-side wrapper
@@ -108,7 +108,7 @@ scripts/reboot-recovery/post-start-quick-check.sh --no-color
此 wrapper 只做 read-only 檢查,並委派既有 cold-start / MOMO preflight / backup-status不 restart、不 reload、不 import、不改 K8s、不讀 token 內容。wrapper 會把 warning 分成 `SERVICE``BOUNDARY``EVIDENCE` 三類,避免把 `escrow_missing>0` 誤判成服務降級。若 summary 或 wrapper 因某個 SSH 權限或路徑失敗,再依下列分段命令手動補證據。
Public route gate 自 v1.6 起會使用 `ROUTE_RETRY_ATTEMPTS`(預設 `3`)與 `ROUTE_RETRY_DELAY_SECONDS`(預設 `2`)重試。單次 `000` / timeout 若 retry 後恢復,應列為 evidence warning 或 transient route evidence不可直接當成網站仍壞只有連續失敗才是 service blocker。
Public route gate 自 v1.6 起會使用 `ROUTE_RETRY_ATTEMPTS`(預設 `3`)與 `ROUTE_RETRY_DELAY_SECONDS`(預設 `2`)重試。StockPlatform dedicated freshness gate 自 v1.13 起使用 `STOCK_FRESHNESS_RETRY_ATTEMPTS`(預設 `6`)與 `STOCK_FRESHNESS_RETRY_DELAY_SECONDS`(預設 `5`),因為它常在 110 Docker / CI rollout 後比 route health 多需要數十秒 warmup。單次 `000` / timeout / `502` 若 retry 後恢復,應列為 evidence warning 或 transient evidence不可直接當成網站或資料仍壞;只有連續失敗才是 service blocker。
Credential escrow gate 自 v1.6 起在 `escrow_missing>0` 時,會只讀呼叫 `/backup/scripts/mark-credential-escrow-verified.sh --status` 並列出缺項。這只是 evidence readback不會寫 marker、不會讀密碼、不會降低 DR blocker用途是讓 operator 立即知道缺的是哪幾個非 secret evidence marker。

View File

@@ -11,23 +11,25 @@
| Area | Status | Completion | Evidence |
|------|--------|------------|----------|
| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-26 07:19 read-only follow-up confirms service recovery remains green after the SOP commit reached production. Cold-start rerun returned `PASS=87 WARN=0 BLOCKED=0`, result `GREEN`; ArgoCD `awoooi-prod Synced / Healthy` at revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`; API/Web/Worker pods are Running with restart `0`; public routes return expected statuses including AWOOOI API `200`, AWOOOI Web `307`, VibeWork `200`, AwoooGo `200`, MOMO health `200`, Stock freshness `200`, Bitan `200`, Gitea `200`, Harbor `200`, Registry `/v2/` expected `401`, Sentry expected `302`, SigNoz `200`, Langfuse `200`. Backup-status 07:18 remains 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, offsite/rclone fresh, `last_backup_all=2026-06-26 02:31:02`, `escrow_missing=5`. 188 remains product-service green but host-hygiene blocked: host PostgreSQL `14/main` has `invalid primary checkpoint record` / `could not locate a valid checkpoint record`, certbot renew for `sentry.wooo.work` is rate-limited / challenge-failed, and `awoooi-startup.service` previously attempted `pg_resetwal` as root. New runbook `docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md` plus read-only script `scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh` define the maintenance-window decision tree without authorizing runtime repair. Do not declare DR complete until `escrow_missing=0`; Wazuh manager registry accepted remains `0`; 111/168 Codex workspace HEAD drift and Mac Mini low free space remain workstation blockers, not reboot service blockers. |
| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-26 12:13 live summary confirms host / service / product data / backup / 188 host hygiene are green for the current evidence set. `post-reboot-readiness-summary.sh --no-color` returns `SERVICE_GREEN=1`, `PRODUCT_DATA_GREEN=1`, `BACKUP_CORE_GREEN=1`, `HOST_188_HYGIENE_BLOCKED=0`, `HOST_188_RESULT=HOST_188_HYGIENE_GREEN.`, `DR_ESCROW_BLOCKED=1`, `ESCROW_MISSING_COUNT=5`, `WAZUH_MANAGER_REGISTRY_ACCEPTED=0`, and `NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export`. 188 startup now accepts the real `k3s-postgres-recovery` host-network PostgreSQL runtime, fails closed instead of running `pg_resetwal`, Nginx ACME HTTP-01 routes for `sentry/gitea/langfuse/signoz` are corrected, duplicate apt certbot timer is disabled, snap certbot timer remains enabled, and failed units are cleared. Do not declare DR complete until `escrow_missing=0`; do not declare Wazuh registry recovered until manager registry accepted is `1`; certbot formal renewal success still requires snap timer / ACME window readback. |
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-26 07:19 readback shows 120 and 121 reachable, K3s active, `mon` and `mon1` both `Ready control-plane`, AWOOOI API/Web replicas split across both nodes, ArgoCD `awoooi-prod Synced / Healthy` at revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`, and `km-vectorize` official 03:00 台北時間 run succeeded with `lastSuccess=2026-06-25T19:00:14Z`. |
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-26 06:58 backup readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-26 02:31:02`。DR remains blocked on real non-secret credential escrow evidence IDs; do not write placeholder markers or paste secret values. |
| P2 service / data truth | DONE | 100% | Service routes and core runtime are available, 110 current CPU pressure is attributable to active AWOOOI Web `turbo build` / Docker buildx, and previous orphan Chrome groups remain cleared. 2026-06-26 07:19 StockPlatform `/api/v1/system/freshness` returned `200`; 07:01 freshness payload was `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`; price / chips / margin / AI recommendations are all on `2026-06-25`. `ai.recommendations` row count is `2868`; `core.margin_short_daily` row count is `1976`. MOMO health `V10.699`, current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, and `MOMO_DAILY_FRESHNESS 1|2026-06-24` are green; expanded public routes are green. |
| P3 docs / automation contracts | DONE_WITH_DECLARATION_GUARD_V172 | 100% | Workplan, SOP v1.72, post-reboot declaration guard, machine-readable post-reboot readiness summary with Wazuh registry detail fields, post-reboot next-gate dispatch checklist, owner-packet JSON generator, owner-packet contract guard, one-page post-start quick check v1.12, route retry gate, deploy warmup classification, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, 188 fail-closed startup data recovery gate, 188 host hygiene read-only checklist, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and stricter product-data / route retry gates are updated. Declaration guard now machine-checks allowed / forbidden recovery statements: service/data/backup green may be declared, while `DR_COMPLETE``HOST_188_FULLY_GREEN``WAZUH_REGISTRY_RECOVERED` and `RUNTIME_ACTION_AUTHORIZED` remain forbidden until evidence gates close. Live 110 script sync remains a separate approved live-write gate; do not claim it here. |
| P3 docs / automation contracts | DONE_WITH_DECLARATION_GUARD_V173 | 100% | Workplan, SOP v1.73, post-reboot declaration guard, machine-readable post-reboot readiness summary with Wazuh registry detail fields, post-reboot next-gate dispatch checklist, owner-packet JSON generator, dynamic owner-packet contract guard, one-page post-start quick check v1.13, route retry gate, deploy warmup classification, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, 188 fail-closed startup data recovery gate, 188 host hygiene read-only checklist, 188 PostgreSQL runtime-ready source-of-truth, 188 ACME route/timer hygiene, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and stricter product-data / route retry gates are updated. Declaration guard now machine-checks allowed / forbidden recovery statements: service/data/backup/188 host hygiene green may be declared when live summary says so, while `DR_COMPLETE``WAZUH_REGISTRY_RECOVERED` and `RUNTIME_ACTION_AUTHORIZED` remain forbidden until evidence gates close. Live 110 script sync remains a separate approved live-write gate; do not claim it here. |
2026-06-26 07:47 machine-readable summary baseline: `scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` stores delegated logs under `/tmp/awoooi-post-reboot-readiness-20260626-074702` and returns `SERVICE_GREEN=1`, `PRODUCT_DATA_GREEN=1`, `BACKUP_CORE_GREEN=1`, `DR_ESCROW_BLOCKED=1`, `ESCROW_MISSING_COUNT=5`, `HOST_188_SERVICE_GREEN=1`, `HOST_188_HYGIENE_BLOCKED=1`, `WAZUH_ROUTE_CODE=200`, `WAZUH_TRANSPORT_COUNT=6`, `WAZUH_MANAGER_REGISTRY_ACCEPTED=0`, `WAZUH_RUNTIME_GATE=0`, `RUNTIME_ACTION_AUTHORIZED=0`, `OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, and `NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export`. This is now the preferred first operator/AI-agent entrypoint after reboot because it separates service health from DR, host hygiene, and security registry evidence.
2026-06-26 12:13 machine-readable summary baseline supersedes the 07:47 / 08:59 gate set: `scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` stores delegated logs under `/tmp/awoooi-post-reboot-readiness-20260626-121303` and returns `SERVICE_GREEN=1`, `PRODUCT_DATA_GREEN=1`, `BACKUP_CORE_GREEN=1`, `DR_ESCROW_BLOCKED=1`, `ESCROW_MISSING_COUNT=5`, `HOST_188_SERVICE_GREEN=1`, `HOST_188_HYGIENE_BLOCKED=0`, `HOST_188_CHECK_RC=0`, `HOST_188_RESULT=HOST_188_HYGIENE_GREEN.`, `WAZUH_ROUTE_CODE=200`, `WAZUH_TRANSPORT_COUNT=6`, `WAZUH_COVERAGE_SCOPE=6`, `WAZUH_DIRECT_ACTIVE=2`, `WAZUH_NO_TRANSPORT=1`, `WAZUH_SSH_BLOCKED=3`, `WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning`, `WAZUH_DASHBOARD_INDEX_OK=3`, `WAZUH_MANAGER_REGISTRY_ACCEPTED=0`, `WAZUH_RUNTIME_GATE=0`, `RUNTIME_ACTION_AUTHORIZED=0`, `OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, and `NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export`. This is now the preferred first operator/AI-agent entrypoint after reboot because it separates service health from DR and security registry evidence; 188 host hygiene is no longer a next gate unless the live checklist regresses.
2026-06-26 08:12 next-gate dispatch baseline: `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color` reads the same summary path and emits three explicit P0 dispatch checklists without sending requests or changing runtime. `credential_escrow_evidence` requires non-secret evidence id / owner / reviewer for five escrow items and rejects password / token / secret value / hash / prefix / suffix / raw credential payloads. `host_188_hygiene_maintenance_window` requires PostgreSQL `14/main` decision, DNS / TLS / certbot path, startup unit source-of-truth, rollback owner, postcheck owner, and blocks unapproved `pg_resetwal` / certbot renew / Nginx reload / DB restore / Docker restart / host file writes. `wazuh_manager_registry_export` requires redacted registry counts, per-host alias status, dashboard API / version status, time window, and reviewer while blocking raw agent names, internal IPs, client keys, Wazuh payloads, active response, re-enroll, restart, secret patch, host write, and Kali active scan. Output fixed `NEXT_GATE_COUNT=3`, `REQUEST_SENT_COUNT=0`, `DISPATCH_AUTHORIZED=0`, `HOST_WRITE_AUTHORIZED=0`, `SECRET_VALUE_COLLECTION_ALLOWED=0`, `RUNTIME_ACTION_AUTHORIZED=0`.
2026-06-26 07:47 machine-readable summary baseline is retained as historical evidence only. It showed `HOST_188_HYGIENE_BLOCKED=1` and three next gates before the 188 startup / ACME / certbot hygiene repair. Do not use the 07:47 gate set as the current status.
2026-06-26 08:29 owner-packet JSON baseline: `scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color` emits `schema_version=awoooi_post_reboot_next_gate_owner_packets_v1` with `next_gate_count=3`, `p0_gate_count=3`, `request_sent_count=0`, `owner_response_received_count=0`, `owner_response_accepted_count=0`, `runtime_action_authorized_count=0`. This packet is for AI / operator / owner review intake only; it does not send request, write credential marker, read secret, or authorize runtime action.
2026-06-26 12:13 next-gate dispatch baseline: `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color` now emits only the gates present in the current summary. Current expected gates are `credential_escrow_evidence` and `wazuh_manager_registry_export`, with `NEXT_GATE_COUNT=2`, `REQUEST_SENT_COUNT=0`, `DISPATCH_AUTHORIZED=0`, `HOST_WRITE_AUTHORIZED=0`, `SECRET_VALUE_COLLECTION_ALLOWED=0`, `RUNTIME_ACTION_AUTHORIZED=0`. If 188 hygiene regresses, `host_188_hygiene_maintenance_window` will reappear automatically.
2026-06-26 08:40 owner-packet contract guard baseline: `scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json` validates the generated JSON before any owner review intake. It requires exactly three P0 gates, preserves `request_sent=0``owner_response_received=0``owner_response_accepted=0``runtime_action_authorized=0``host_write_authorized=0``secret_value_collection_allowed=0``runtime_gate=0`, and rejects missing forbidden payload/action controls for credential escrow, 188 host hygiene, and Wazuh registry export. Expected success line: `POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=3 request_sent=0 accepted=0 runtime_gate=0`.
2026-06-26 12:13 owner-packet JSON baseline: `scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color` emits `schema_version=awoooi_post_reboot_next_gate_owner_packets_v1` with dynamic `next_gate_count=2`, `p0_gate_count=2`, `request_sent_count=0`, `owner_response_received_count=0`, `owner_response_accepted_count=0`, `runtime_action_authorized_count=0`. This packet is for AI / operator / owner review intake only; it does not send request, write credential marker, read secret, or authorize runtime action.
2026-06-26 12:13 owner-packet contract guard baseline: `scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json` validates the generated JSON before any owner review intake. It requires the packet gates to equal the live `source.next_required_gates`, preserves `request_sent=0``owner_response_received=0``owner_response_accepted=0``runtime_action_authorized=0``host_write_authorized=0``secret_value_collection_allowed=0``runtime_gate=0`, and rejects missing forbidden payload/action controls for active gates. Current expected success line: `POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=2 request_sent=0 accepted=0 runtime_gate=0`.
2026-06-26 08:47 Wazuh registry detail summary baseline: post-reboot readiness summary now emits `WAZUH_COVERAGE_SCOPE`, `WAZUH_DIRECT_ACTIVE`, `WAZUH_NO_TRANSPORT`, `WAZUH_SSH_BLOCKED`, `WAZUH_DASHBOARD_API_CONNECTION`, and `WAZUH_DASHBOARD_INDEX_OK` alongside existing route / transport / registry fields. Current read-only truth is coverage scope `6`, direct active `2`, no transport `1`, SSH blocked `3`, route `200`, transport `6`, Dashboard API `pending_or_spinning`, index OK `3`, manager registry accepted `0`, runtime gate `0`. This is a security evidence blocker, not a reboot service blocker.
2026-06-26 08:59 declaration guard baseline: `scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color` emits `schema_version=awoooi_post_reboot_declaration_guard_v1`, status `allowed_with_boundary_blockers`, allowed declarations `BACKUP_CORE_GREEN``FULL_STACK_GREEN_DR_ESCROW_BLOCKED``PRODUCT_DATA_GREEN``SERVICE_RECOVERY_GREEN`, and forbidden declarations `DR_COMPLETE``HOST_188_FULLY_GREEN``WAZUH_REGISTRY_RECOVERED``RUNTIME_ACTION_AUTHORIZED`. Proposed false-green declarations are rejected before they can enter LOGBOOK / owner packets / external status updates.
2026-06-26 12:13 declaration guard baseline: `scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color` emits `schema_version=awoooi_post_reboot_declaration_guard_v1`, status `allowed_with_boundary_blockers`, allowed declarations including service / product data / backup / 188 host hygiene green for this evidence set, and forbidden declarations `DR_COMPLETE``WAZUH_REGISTRY_RECOVERED``RUNTIME_ACTION_AUTHORIZED`. Proposed false-green declarations are rejected before they can enter LOGBOOK / owner packets / external status updates.
2026-06-26 07:39 live quick-check refresh supersedes the 07:19 row for current operator status. `scripts/reboot-recovery/post-start-quick-check.sh --no-color` returned `POST_START_QUICK_CHECK PASS=38 WARN=3 BLOCKED=0`, warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=2`, result `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`. Delegated cold-start returned `PASS=89 WARN=0 BLOCKED=0`; four reboot-scope hosts ping/SSH were OK; AWOOOI / VibeWork / AwoooGo / 2026FIFA / Agent Bounty / MOMO / Stock / Bitan / TsenYang / VTuber / Gitea / Harbor / Registry / Sentry / SigNoz / Langfuse / AIOps routes returned expected 2xx/3xx. MOMO `V10.701` has job `57 completed`, daily freshness `1|2026-06-24`, and current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`. StockPlatform freshness is `ok` through `2026-06-25` with price / chips / margin / AI recommendations current. Backup core remains green: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, offsite/rclone fresh, `last_backup_all=2026-06-26 02:31:02`; DR still has `escrow_missing=5`. 110 load around `5.19 / 4.66 / 4.91` is attributable to normal platform processes, not orphan Chrome. 188 host hygiene remains blocked by failed host PostgreSQL / certbot / startup units and must use the dedicated maintenance runbook and read-only checklist.

View File

@@ -86,6 +86,10 @@ server {
listen 80;
server_name signoz.wooo.work;
location /.well-known/acme-challenge/ {
root /var/www/html;
}
location / {
proxy_pass http://127.0.0.1:3301;
proxy_http_version 1.1;

View File

@@ -9,10 +9,22 @@ server {
server_name
gitea.wooo.work
sentry.wooo.work
langfuse.wooo.work
langfuse.wooo.work;
location /.well-known/acme-challenge/ {
root /var/www/html;
}
location / {
return 301 https://$host$request_uri;
}
}
server {
listen 80;
server_name
harbor.wooo.work
registry.wooo.work
stock.wooo.work;
registry.wooo.work;
location /.well-known/acme-challenge/ {
root /var/www/certbot;

View File

@@ -149,17 +149,35 @@ if out=$(ssh_cmd "$REMOTE_188" '
pg_lsclusters 2>/dev/null || true
systemctl status postgresql@14-main.service --no-pager || true
echo "PG_ISREADY_LOCAL $(pg_isready -h localhost -p 5432 2>/dev/null || true)"
echo "RECOVERY_CONTAINER $(docker inspect -f "{{.State.Running}} {{.HostConfig.NetworkMode}} {{.HostConfig.RestartPolicy.Name}}" k3s-postgres-recovery 2>/dev/null || echo missing)"
' 2>&1); then
echo "$out"
recovery_container_ready=0
if grep -q '^RECOVERY_CONTAINER true host ' <<<"$out" && grep -q 'PG_ISREADY_LOCAL .*accepting connections' <<<"$out"; then
recovery_container_ready=1
fi
if grep -Eq '^14[[:space:]]+main[[:space:]]+5432[[:space:]]+down' <<<"$out"; then
blocked "host PostgreSQL cluster 14/main is down"
if [[ "$recovery_container_ready" -eq 1 ]]; then
warn "host PostgreSQL cluster 14/main is down, but controlled k3s-postgres-recovery runtime is accepting connections"
else
blocked "host PostgreSQL cluster 14/main is down and no controlled recovery runtime was accepted"
fi
else
ok "host PostgreSQL cluster 14/main not reported down"
fi
if grep -Eiq 'invalid primary checkpoint record|could not locate a valid checkpoint record|PANIC:' <<<"$out"; then
blocked "PostgreSQL checkpoint/WAL error detected; pg_resetwal is break-glass only"
if [[ "$recovery_container_ready" -eq 1 ]]; then
warn "PostgreSQL checkpoint/WAL error remains historical host-cluster evidence; pg_resetwal is still break-glass only"
else
blocked "PostgreSQL checkpoint/WAL error detected; pg_resetwal is break-glass only"
fi
fi
if grep -q 'PG_ISREADY_LOCAL .*accepting connections' <<<"$out"; then
if [[ "$recovery_container_ready" -eq 1 ]]; then
ok "PostgreSQL runtime is provided by k3s-postgres-recovery on host network"
elif grep -q 'PG_ISREADY_LOCAL .*accepting connections' <<<"$out"; then
warn "pg_isready accepts on localhost; do not use this alone as host 14/main health"
fi
else
@@ -169,12 +187,30 @@ fi
section "188 certbot / ACME"
if out=$(ssh_cmd "$REMOTE_188" '
systemctl status certbot.service --no-pager || true
systemctl status snap.certbot.renew.service --no-pager || true
systemctl show certbot.service snap.certbot.renew.service certbot.timer snap.certbot.renew.timer -p Id -p ActiveState -p SubState -p Result -p UnitFileState --no-pager || true
systemctl list-timers --all --no-pager | grep -i certbot || true
' 2>&1); then
echo "$out"
grep -Eiq 'rateLimited|Service busy' <<<"$out" && blocked "certbot renewal is rate-limited; do not retry blindly"
grep -Eiq 'Some challenges have failed|challenge' <<<"$out" && blocked "certbot challenge failure requires DNS / ACME route owner evidence"
if grep -q 'Id=certbot.service' <<<"$out" && grep -A3 'Id=certbot.service' <<<"$out" | grep -q 'Result=failed'; then
blocked "apt certbot service currently failed"
else
ok "apt certbot service is not currently failed"
fi
if grep -q 'Id=snap.certbot.renew.service' <<<"$out" && grep -A3 'Id=snap.certbot.renew.service' <<<"$out" | grep -q 'Result=failed'; then
blocked "snap certbot renew service currently failed"
else
ok "snap certbot renew service is not currently failed"
fi
if grep -A4 'Id=certbot.timer' <<<"$out" | grep -q 'UnitFileState=disabled'; then
ok "legacy apt certbot timer disabled to avoid duplicate renewals"
else
warn "legacy apt certbot timer is not disabled"
fi
if grep -A4 'Id=snap.certbot.renew.timer' <<<"$out" | grep -q 'ActiveState=active' && grep -A4 'Id=snap.certbot.renew.timer' <<<"$out" | grep -q 'UnitFileState=enabled'; then
ok "snap certbot renew timer enabled"
else
blocked "snap certbot renew timer is not enabled and active"
fi
else
blocked "certbot status unavailable"
echo "$out"
@@ -223,7 +259,27 @@ else
fi
section "Maintenance decision tree"
cat <<'STEPS'
if [ "$SERVICE_GREEN" -eq 1 ] && [ "$HOST_HYGIENE_BLOCKED" -eq 0 ]; then
cat <<'STEPS'
Current expected outcome:
SERVICE_GREEN=1
HOST_HYGIENE_BLOCKED=0
RESULT=HOST_188_HYGIENE_GREEN
Allowed next step:
1. Keep this host in the normal post-reboot summary.
2. Wait for snap certbot timer / ACME-window readback before declaring formal certificate renewal success.
3. Keep DR credential escrow and Wazuh registry evidence as separate blockers.
Forbidden without separate approval:
- pg_resetwal
- DB restore
- Docker/systemd restart
- firewall change
- Wazuh active response or agent re-enroll
STEPS
else
cat <<'STEPS'
Current expected outcome when 188 service is green but host hygiene is not:
SERVICE_GREEN=1
HOST_HYGIENE_BLOCKED=1
@@ -244,6 +300,7 @@ Forbidden without maintenance approval:
- Docker/systemd restart
- host file write
STEPS
fi
echo
echo "SERVICE_GREEN=$SERVICE_GREEN"

View File

@@ -14,8 +14,9 @@ Description=AWOOOI Auto-Startup Recovery Sequence
After=network-online.target containerd.service docker.service
Wants=network-online.target
# 確保 PostgreSQL 盡早嘗試啟動
Wants=postgresql@14-main.service redis-server.service ollama.service nginx.service
# PostgreSQL 可由受控 recovery container 提供;不得在 startup 階段硬拉
# postgresql@14-main.service,避免與 recovery runtime 競爭或觸發假綠修復。
Wants=redis-server.service ollama.service nginx.service
[Service]
Type=oneshot

View File

@@ -3,6 +3,8 @@
# 2026-04-04 ogt: 根據實際事故建立,處理 container / Docker 啟動順序與 K3s Kine 維護。
# 2026-06-26 Codex: PostgreSQL checkpoint/WAL 錯誤改為 fail-closed
# 不在自動啟動腳本內執行 pg_resetwal避免資料破壞被誤判成恢復。
# 2026-06-26 Codex: 允許受控 recovery container 提供 14/main runtime
# 不再因 systemd postgresql@14-main failed 而誤判活 DB 為不可用。
# 部署位置: /usr/local/bin/awoooi-startup.sh (on 192.168.0.188)
# systemd unit: /etc/systemd/system/awoooi-startup.service
@@ -12,6 +14,22 @@ exec > >(tee -a "$LOG") 2>&1
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; }
postgres_runtime_ready() {
if systemctl is-active postgresql@14-main >/dev/null 2>&1; then
log "✅ PostgreSQL systemd unit active"
return 0
fi
if docker inspect -f '{{.State.Running}} {{.HostConfig.NetworkMode}}' k3s-postgres-recovery 2>/dev/null | grep -q '^true host$'; then
if pg_isready -h localhost -p 5432 >/dev/null 2>&1; then
log "✅ PostgreSQL recovery container active on host network"
return 0
fi
fi
return 1
}
log "=== AWOOOI 啟動序列開始 ==="
# ──────────────────────────────────────────────
@@ -73,20 +91,20 @@ fi
# ──────────────────────────────────────────────
log "[3/7] 檢查 PostgreSQL..."
if ! systemctl is-active postgresql@14-main >/dev/null 2>&1; then
if ! postgres_runtime_ready; then
log "PostgreSQL 未啟動,嘗試啟動..."
systemctl start postgresql@14-main || true
sleep 8
fi
if ! systemctl is-active postgresql@14-main >/dev/null 2>&1; then
if ! postgres_runtime_ready; then
log "PostgreSQL 啟動失敗,檢查是否屬於 checkpoint/WAL 類資料層錯誤..."
if journalctl -u postgresql@14-main -n 20 | grep -q "could not locate a valid checkpoint"; then
log "❌ 偵測到 PostgreSQL checkpoint/WAL 錯誤;禁止自動 pg_resetwal。"
log "需要 DB owner、備份/restore evidence、maintenance window 與 post-check 後才能人工處理。"
exit 1
fi
systemctl is-active postgresql@14-main && log "✅ PostgreSQL 修復成功" || { log "❌ PostgreSQL 修復失敗"; exit 1; }
postgres_runtime_ready && log "✅ PostgreSQL runtime 可用" || { log "❌ PostgreSQL 修復失敗"; exit 1; }
fi
# 等待 PG 接受連線

View File

@@ -23,7 +23,7 @@ OWNER_PACKET_GENERATOR = (
)
EXPECTED_SCHEMA = "awoooi_post_reboot_next_gate_owner_packets_v1"
EXPECTED_GATES = {
KNOWN_GATES = {
"credential_escrow_evidence",
"host_188_hygiene_maintenance_window",
"wazuh_manager_registry_export",
@@ -187,12 +187,21 @@ def validate_packet(packet: dict[str, Any]) -> list[str]:
counts = {}
gate_ids = {str(item.get("packet_id", "")) for item in owner_packets if isinstance(item, dict)}
if gate_ids != EXPECTED_GATES:
unknown_gates = sorted(gate_ids - KNOWN_GATES)
if unknown_gates:
failures.append(f"unknown_gate_ids={unknown_gates}")
source = packet.get("source", {})
if not isinstance(source, dict):
failures.append("source_not_object")
source = {}
expected_gates = set(str(item) for item in as_list(source.get("next_required_gates")))
if expected_gates != gate_ids:
failures.append(f"gate_ids={sorted(gate_ids)}")
expected_counts = {
"next_gate_count": 3,
"p0_gate_count": 3,
"next_gate_count": len(gate_ids),
"p0_gate_count": len(gate_ids),
}
for key, expected in expected_counts.items():
if counts.get(key) != expected:

View File

@@ -9,6 +9,8 @@ ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
SSH_CONNECT_TIMEOUT="${SSH_CONNECT_TIMEOUT:-6}"
ROUTE_RETRY_ATTEMPTS="${ROUTE_RETRY_ATTEMPTS:-3}"
ROUTE_RETRY_DELAY_SECONDS="${ROUTE_RETRY_DELAY_SECONDS:-2}"
STOCK_FRESHNESS_RETRY_ATTEMPTS="${STOCK_FRESHNESS_RETRY_ATTEMPTS:-6}"
STOCK_FRESHNESS_RETRY_DELAY_SECONDS="${STOCK_FRESHNESS_RETRY_DELAY_SECONDS:-5}"
RUN_COLD_START=1
RUN_MOMO=1
RUN_STOCK=1
@@ -76,6 +78,8 @@ Options:
Environment:
ROUTE_RETRY_ATTEMPTS Public route attempts before blocking. Default: 3.
ROUTE_RETRY_DELAY_SECONDS Delay between failed public route attempts. Default: 2.
STOCK_FRESHNESS_RETRY_ATTEMPTS Stock freshness attempts before blocking. Default: 6.
STOCK_FRESHNESS_RETRY_DELAY_SECONDS Delay between failed Stock freshness attempts. Default: 5.
Exit codes:
0 = no service blockers. Boundary / evidence warnings may still be present.
@@ -277,9 +281,23 @@ fi
if [[ "$RUN_STOCK" -eq 1 ]]; then
section "StockPlatform freshness"
stock_tmp="$(mktemp -t post-start-stock.XXXXXX)"
stock_code="$(curl -k -sS -o "$stock_tmp" -w '%{http_code}' --max-time 12 "https://stock.wooo.work/api/v1/system/freshness" 2>/dev/null || true)"
stock_code=""
stock_attempt=1
while [[ "$stock_attempt" -le "$STOCK_FRESHNESS_RETRY_ATTEMPTS" ]]; do
stock_code="$(curl -k -sS -o "$stock_tmp" -w '%{http_code}' --max-time 12 "https://stock.wooo.work/api/v1/system/freshness" 2>/dev/null || true)"
if [[ "$stock_code" == 2* ]]; then
if [[ "$stock_attempt" -gt 1 ]]; then
evidence_warn "StockPlatform freshness recovered after attempt=$stock_attempt"
fi
break
fi
if [[ "$stock_attempt" -lt "$STOCK_FRESHNESS_RETRY_ATTEMPTS" ]]; then
sleep "$STOCK_FRESHNESS_RETRY_DELAY_SECONDS"
fi
stock_attempt=$((stock_attempt + 1))
done
if [[ "$stock_code" != 2* ]]; then
blocked "StockPlatform freshness endpoint returned ${stock_code:-curl_failed}"
blocked "StockPlatform freshness endpoint returned ${stock_code:-curl_failed} attempts=$STOCK_FRESHNESS_RETRY_ATTEMPTS"
cat "$stock_tmp" || true
else
python3 - "$stock_tmp" <<'PY'