docs(ops): refresh reboot readback route retry [skip ci]

This commit is contained in:
ogt
2026-06-26 06:33:04 +08:00
parent 1966647691
commit 482ff21af5
6 changed files with 94 additions and 10 deletions

View File

@@ -1,3 +1,37 @@
## 2026-06-26主機重啟 SOP 隔日 readback 與 route retry gate
**背景**2026-06-25 21:14 已達 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。隔日 06:26 重新跑 live read-only check確認服務綠燈是否維持並處理 wrapper 對單次 route `000` 過度敏感的 SOP 缺口。
**Read-only evidence**
- 四主機 `110 / 120 / 121 / 188` ping / SSH port 全部 OK。
- Cold-start`PASS=89 WARN=0 BLOCKED=0`Result `GREEN`
- K3s`mon` / `mon1` ReadyAWOOOI API/Web/Worker Runningactive failed Jobs `0`
- MOMOhealth `V10.690`latest import job `57 completed``DB_DAILY_FRESHNESS 1|2026-06-24`current-month parity `15383|15383`
- StockPlatform`/api/v1/system/freshness``status=ok``latest_trading_date=2026-06-25`blockers `[]`price / chips / margin / AI recommendations 皆為 `2026-06-25`
- Backup110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0``core_blockers=0``offsite_fresh=1``rclone_gdrive_fresh=1``last_backup_all=2026-06-26 02:31:02``escrow_missing=5`
- Public routes06:26 full wrapper 對 `https://awoooi.wooo.work/zh-TW/iwooos``https://vibework.wooo.work/` 出現單次 `000`;獨立 curl 隨即回 `200`route-only wrapper 回 `PASS=31 WARN=0 BLOCKED=0 RESULT=GREEN`
- 110 CPUload 約 `5.50 / 3.41 / 2.74``vmstat` 無即時 swap thrash未見 orphan Chrome 或長時間 active StockPlatform query。主要是 Gitea / ClickHouse / Docker / Kafka / platform 背景服務與短查詢負載。
**完成**
- `scripts/reboot-recovery/post-start-quick-check.sh` public route gate 新增 retry預設 `ROUTE_RETRY_ATTEMPTS=3``ROUTE_RETRY_DELAY_SECONDS=2`
- Retry 後恢復的 route 會列為 `evidence_warn recovered_after_attempt=<n>`;只有連續失敗才算 `BLOCKED`
- 更新 `FULL-STACK-COLD-START-SOP.md` v1.61、`REBOOT-POST-START-QUICK-CHECK.md` v1.6、recovery workplan 與 `BACKUP-STATUS.md`
**驗證**
- `bash -n scripts/reboot-recovery/post-start-quick-check.sh` 通過。
- Route-only wrapper`PASS=31 WARN=0 BLOCKED=0``RESULT=GREEN`
- Core wrapper with routes skipped`PASS=15 WARN=2 BLOCKED=0`warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=1``RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`
**做過的命令類型**
- Read-onlycold-start、quick-check、MOMO / Stock freshness、backup-status、route curl、CPU / PostgreSQL activity readback。
- Repo-onlySOP / runbook / workplan / LOGBOOK 文件與 wrapper retry gate。
- 沒有 host/runtime write沒有 Docker/systemd/Nginx/firewall/K8s/ArgoCD/DB/Wazuh 操作,沒有 manual ingestion沒有 secret read。
**仍 blocked / 不得宣稱**
- `DR_COMPLETE` 仍 blocked`escrow_missing=5`
- Wazuh manager registry accepted 仍為 `0`route `200` 或 UI 可見不能宣稱 Wazuh 全主機納管完成。
- 全產品治理總工程仍依 `CODEX-START-HERE``not_complete`,不得把本輪 reboot service green 說成全產品治理完成。
## 2026-06-25Status-chain apply candidate 語意修正:不再把乾跑候選講成純人工
**背景**`INC-20260625-977E5F` / `node-exporter-188` 類告警已完成 MCP 調查與 Ansible check-mode且 status-chain 能推導 `ansible-apply-candidate:*``verifier-plan:*` 與 Work Item但 operator outcome 仍把它描述成單純 dry-run / manual gate前端也直接顯示 raw `next_step`。這會讓值班者感覺 AI 只把事情丟回人工,無法看出 AI 已經產生可審查的 apply candidate。

View File

@@ -24,9 +24,32 @@
> 2026-06-25 20:11 Codex StockPlatform cron-source recovery: StockPlatform Gitea/live source is now `fb91aa4c6272469d1d26e0820169629eac17d28a`; six missing production cron entrypoints are restored; natural cron runs for source remediation, market index, price, margin, chips, and AI no longer fail from missing files. Backup/offsite remains green. Stock freshness still blocks because official 2026-06-25 margin-short data is pending and AI recommendations correctly stay on 2026-06-24; this is still not a backup or restore incident.
> 2026-06-25 20:25 Codex 110 CPU cleanup: two orphan StockPlatform headless Chrome process groups were cleared by targeted approved `SIGTERM`; no Docker/systemd/Nginx/K8s/DB/backup write occurred. Backup/offsite remains green, DR still blocked by `escrow_missing=5`, and Stock freshness remains the only hard product-data blocker.
> 2026-06-25 21:14 Codex full wrapper refresh: StockPlatform 21:00 `intelligence-sync` and 21:10 AI pipeline naturally caught up; `/api/v1/system/freshness` is `status=ok` with blockers `[]`. Backup/offsite remains 110 `13/13` and 188 `2/2` fresh, `core_blockers=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`; full-stack service/data result is `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, with only `escrow_missing=5` blocking DR complete.
> 2026-06-26 06:28 Codex隔日 backup readback: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `last_backup_all=2026-06-26 02:31:02`, `escrow_missing=5`; full-stack service/data result remains `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`.
---
## 2026-06-26 06:28 Backup / Offsite / Escrow Live Status
Read-only evidence sources: 06:26 / 06:28 `post-start-quick-check.sh`, delegated `/backup/scripts/backup-status.sh --no-notify --no-refresh`, route-only wrapper retry validation, and direct StockPlatform / MOMO freshness readback.
- 110 backup health: `13/13 fresh failed=0`
- 188 backup health: `2/2 fresh failed=0`
- Integrity / configured blockers: `core_blockers=0``configured_missing_110=0``configured_missing_188=0``script_missing_110=0``script_missing_188=0``integrity_stale=0`
- Offsite / GDrive freshness: `offsite_configured=1``offsite_fresh=1``rclone_gdrive_configured=1``rclone_gdrive_fresh=1`
- Last aggregate backup: `2026-06-26 02:31:02`
- DR blocker remains: `escrow_missing=5`,不得偽造 evidence marker也不得貼 secret value / hash / partial token。
- Full-stack service state: `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。Cold-start `PASS=89 WARN=0 BLOCKED=0`StockPlatform freshness `status=ok`MOMO daily freshness `1|2026-06-24`
- Route note: 06:26 full wrapper had one-time route `000` for IwoooS / VibeWork, but direct curl and route-only wrapper immediately returned `200` and `RESULT=GREEN`; v1.6 wrapper now retries routes before blocking.
| Gate | Status | Evidence |
|------|--------|----------|
| 110 backup freshness | VERIFIED | 13/13 fresh, failed count 0. |
| 188 backup freshness | VERIFIED | 2/2 fresh, failed count 0. |
| Offsite / GDrive freshness | VERIFIED | `offsite_fresh=1`, `rclone_gdrive_fresh=1`. |
| Backup core blockers | GREEN | `core_blockers=0`. |
| Full-stack service state | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | Cold-start `PASS=89 WARN=0 BLOCKED=0`; core wrapper `PASS=15 WARN=2 BLOCKED=0`; route-only wrapper `PASS=31 WARN=0 BLOCKED=0`. |
| Credential escrow | BLOCKED | `escrow_missing=5`; only real non-secret owner evidence may close this. |
## 2026-06-25 19:17 Backup / Offsite / Escrow Live Status
Read-only evidence sources: `/backup/scripts/backup-status.sh --no-notify --no-refresh` from 110 at 19:17 Asia/Taipei, plus 19:05 post-start quick check and 19:05-19:06 route stability readback.

View File

@@ -1,7 +1,7 @@
# AWOOOI 全棧冷啟動與主機重啟 SOP
> Version: v1.60
> Last updated: 2026-06-25 Asia/Taipei
> Version: v1.61
> Last updated: 2026-06-26 Asia/Taipei
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
---
@@ -12,6 +12,8 @@
若只是重啟後要快速判斷能不能宣稱恢復,先跑一頁式總檢查:`scripts/reboot-recovery/post-start-quick-check.sh --no-color`,並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為人工 fallback。長 SOP 保留完整背景、例外處理與 Plan B短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。
2026-06-26 06:26-06:28 隔日 read-only refresh四主機 ping/SSH OKcold-start `PASS=89 WARN=0 BLOCKED=0`MOMO `V10.690` 且 latest import job `57 completed`StockPlatform `/api/v1/system/freshness` 仍為 `status=ok` / `latest_trading_date=2026-06-25` / blockers `[]`backup-status 110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0``core_blockers=0``offsite_fresh=1``rclone_gdrive_fresh=1``last_backup_all=2026-06-26 02:31:02``escrow_missing=5`。06:26 full wrapper 首輪在 `https://awoooi.wooo.work/zh-TW/iwooos``https://vibework.wooo.work/` 出現單次 `000`,但獨立 curl 立即回 `200`route-only wrapper 也回 `PASS=31 WARN=0 BLOCKED=0 RESULT=GREEN`;因此 v1.61 將 public route gate 改為最多 3 次 retry只有連續失敗才算 `BLOCKED`retry 後恢復則列為 evidence warning。06:28 core wrapper with routes skipped returns `POST_START_QUICK_CHECK PASS=15 WARN=2 BLOCKED=0`, `RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。本次沒有 Docker/systemd/Nginx/firewall/K8s/DB/Wazuh runtime 寫操作。
2026-06-25 21:14 StockPlatform natural-cron / full-wrapper refresh supersedes the 20:25 product-data blocker wording. After waiting for official schedules instead of manual ingestion, `intelligence-sync` 21:00 finished `status=0`, `core.margin_short_daily` reached `2026-06-25` / 1976 rows, and `ai-recommendation-pipeline` 21:10 finished `STOCKPLATFORM_AI_RECOMMENDATION_PIPELINE_OK as_of_date=2026-06-25` with `draft_count=120`, `candidate_count=120`, and `rag_documents=1000`. StockPlatform `/api/v1/system/freshness` now returns `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`, with price / chips / margin / AI recommendations all on `2026-06-25`. The 21:14 full wrapper returns cold-start `PASS=89 WARN=0 BLOCKED=0` and overall `POST_START_QUICK_CHECK PASS=38 WARN=2 BLOCKED=0`, `RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`. The only remaining recovery red gate is DR credential escrow evidence `escrow_missing=5`; Wazuh manager registry accepted remains `0` as a security evidence blocker, not a reboot service blocker.
2026-06-25 20:25 orphan Chrome cleanup / scorecard refresh supersedes the 20:11 CPU wording. 110 high CPU was traced to two `stockplatform-review-bulk-ux` Chrome process groups `2756503` and `2829627` with root Chrome process `PPID=1`, elapsed about 5h, no active parent smoke, and sustained GPU/renderer CPU. With user approval, only those two process groups received targeted `SIGTERM` at 20:24. Post-check showed no remaining PGID entries; `vmstat` showed CPU idle around `85-90%`, `si/so=0`, and no immediate swap thrash. No Docker/systemd/Nginx/firewall/K8s action, CI cancellation, manual data ingestion, manual DB write, Wazuh/SOC runtime change, or secret read was performed. The 20:25 full post-start wrapper then returned cold-start `PASS=89 WARN=0 BLOCKED=0`, but overall `POST_START_QUICK_CHECK PASS=37 WARN=2 BLOCKED=1`, `RESULT=BLOCKED`, because StockPlatform data freshness was still blocked at that time and DR remained incomplete.

View File

@@ -1,7 +1,7 @@
# 主機重啟後一頁式總檢查
> Version: v1.5
> Last updated: 2026-06-25 Asia/Taipei
> Version: v1.6
> Last updated: 2026-06-26 Asia/Taipei
> Scope: 110 / 120 / 121 / 188 post-reboot service recovery. 112 Kali / Wazuh / active scan 不屬於本流程。
---
@@ -10,7 +10,7 @@
每次 110 / 120 / 121 / 188 任一台主機開機、關機、重啟、斷電恢復、VMware console fsck、Docker / K3s 大量重排後,都先跑本頁,再決定是否宣稱恢復。
最新基準2026-06-25 21:14 full wrapper `POST_START_QUICK_CHECK PASS=38 WARN=2 BLOCKED=0`warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=1`Result `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`StockPlatform 21:10 自然 AI pipeline 已補到 `as_of_date=2026-06-25``/api/v1/system/freshness``status=ok`DR 仍因 `escrow_missing=5` 不可宣稱 complete。
最新基準2026-06-26 06:26-06:28 read-only refresh。Cold-start `PASS=89 WARN=0 BLOCKED=0`MOMO `V10.690`、latest import job `57 completed``DB_DAILY_FRESHNESS 1|2026-06-24`StockPlatform `/api/v1/system/freshness``status=ok``latest_trading_date=2026-06-25`、blockers `[]`backup-status 110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0``core_blockers=0``offsite_fresh=1``rclone_gdrive_fresh=1``last_backup_all=2026-06-26 02:31:02`DR 仍因 `escrow_missing=5` 不可宣稱 complete。06:26 full wrapper 的 `iwooos` / `vibework` 單次 route `000` 已由獨立 curl 與 route-only wrapper 確認為 transientv1.6 起 public route gate 會 retry只有連續失敗才算 `BLOCKED`
本頁只回答四件事:
@@ -51,6 +51,8 @@ scripts/reboot-recovery/post-start-quick-check.sh --no-color
此 wrapper 只做 read-only 檢查,並委派既有 cold-start / MOMO preflight / backup-status不 restart、不 reload、不 import、不改 K8s、不讀 token 內容。wrapper 會把 warning 分成 `SERVICE``BOUNDARY``EVIDENCE` 三類,避免把 `escrow_missing>0` 誤判成服務降級。若 wrapper 因某個 SSH 權限或路徑失敗,再依下列分段命令手動補證據。
Public route gate 自 v1.6 起會使用 `ROUTE_RETRY_ATTEMPTS`(預設 `3`)與 `ROUTE_RETRY_DELAY_SECONDS`(預設 `2`)重試。單次 `000` / timeout 若 retry 後恢復,應列為 evidence warning 或 transient route evidence不可直接當成網站仍壞只有連續失敗才是 service blocker。
Wrapper 必須先解析 cold-start summary不可只看 cold-start exit code
- cold-start `BLOCKED>0`wrapper 才可判定 `BLOCKED`

View File

@@ -11,11 +11,11 @@
| Area | Status | Completion | Evidence |
|------|--------|------------|----------|
| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-25 21:14 full post-start wrapper showed hosts / K3s / AWOOOI / public routes / MOMO / StockPlatform / backup / offsite service and data gates green. Cold-start returned `PASS=89 WARN=0 BLOCKED=0`; wrapper returned `POST_START_QUICK_CHECK PASS=38 WARN=2 BLOCKED=0`, warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=1`, result `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`. 20:24 targeted approved `SIGTERM` cleared orphan 110 `stockplatform-review-bulk-ux` Chrome PGIDs `2756503` and `2829627`; 21:14 CPU attribution shows current load is active AWOOOI Web `next build`, not orphan Chrome. StockPlatform Gitea/live source is `fb91aa4c6272469d1d26e0820169629eac17d28a`; 21:00 `intelligence-sync` succeeded, 21:10 `ai-recommendation-pipeline` produced `STOCKPLATFORM_AI_RECOMMENDATION_PIPELINE_OK as_of_date=2026-06-25`, and `/api/v1/system/freshness` is now `status=ok` with blockers `[]`. MOMO remains fresh through `2026-06-24` with latest job `57` completed cleanly. Do not declare DR complete until `escrow_missing=0`; Wazuh manager registry accepted remains `0` and is a security evidence blocker, not a reboot service blocker. |
| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-26 06:26-06:28 read-only refresh confirms the 2026-06-25 21:14 green baseline still holds. Four hosts ping/SSH OK; cold-start `PASS=89 WARN=0 BLOCKED=0`; MOMO health `V10.690`, latest import job `57 completed`, daily freshness `1|2026-06-24`; StockPlatform freshness `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`; backup-status 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `last_backup_all=2026-06-26 02:31:02`, `escrow_missing=5`. The 06:26 full wrapper saw one-time route `000` on IwoooS / VibeWork, but independent curl and route-only wrapper immediately returned `200` / `PASS=31 WARN=0 BLOCKED=0`; v1.61/v1.6 now retries public routes before blocking. 06:28 core wrapper with routes skipped returned `PASS=15 WARN=2 BLOCKED=0`, result `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`. Do not declare DR complete until `escrow_missing=0`; Wazuh manager registry accepted remains `0` and is a security evidence blocker, not a reboot service blocker. |
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-25 09:05 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, VIP `192.168.0.125` is present, node filesystem / disk-pressure / readonly events are `0`, and latest `km-vectorize-29705460-55rgs` completed. |
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-25 19:17 backup readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-25 02:35:09`。2026-06-25 19:19 offsite escrow report shows script presence OK, rclone configured, full and partial rclone markers present, `PASS=8 WARN=5 BLOCKED=0`, `ESCROW_MISSING_COUNT=5`; DR remains blocked on real non-secret credential escrow evidence IDs. |
| P2 service / data truth | DONE | 100% | Service routes and core runtime are available, 110 orphan Chrome CPU pressure is cleared, and StockPlatform cron-source drift is repaired. 2026-06-25 21:13 StockPlatform `/api/v1/system/freshness` returned `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`; price / chips / margin / AI recommendations are all on `2026-06-25`. `ai.recommendations` row count is `2868`; `core.margin_short_daily` row count is `1976`. MOMO health `V10.690`, current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, and `DB_DAILY_FRESHNESS 1|2026-06-24` are green; expanded public routes are green. |
| P3 docs / automation contracts | DONE_WITH_PRODUCT_DATA_GREEN_V160 | 100% | Workplan, SOP v1.60, one-page post-start quick check v1.5, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform 21:00 / 21:10 natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and 2026-06-25 stricter product-data gate are updated. Live 110 script sync remains a separate approved live-write gate; do not claim it here. |
| P3 docs / automation contracts | DONE_WITH_ROUTE_RETRY_V161 | 100% | Workplan, SOP v1.61, one-page post-start quick check v1.6, route retry gate, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform 21:00 / 21:10 natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and stricter product-data / route retry gates are updated. Live 110 script sync remains a separate approved live-write gate; do not claim it here. |
2026-06-25 19:06 post-CD wrapper readback supersedes the 18:53 wording: consecutive main pushes created a deploy storm where older deploy markers were superseded by later commits. Latest production truth is deploy marker `d8ca8224 chore(cd): deploy 9dbe044 [skip ci]`, ArgoCD `Synced / Healthy`, API/Web/Worker image tag `9dbe044ea1e8e3894ccbeb5ed760bb124b87f7be`, direct route smoke 200 for AWOOOI API / IwoooS / VibeWork / AwoooGo / MOMO health / Stock / Bitan and expected route-gate statuses for MOMO / Gitea / Harbor / Registry / Sentry / SigNoz / Langfuse / AIOps, and wrapper `POST_START_QUICK_CHECK PASS=18 WARN=3 BLOCKED=0`. Repo-side cold-start returns `PASS=89 WARN=0 BLOCKED=0`; `/backup/scripts/backup-status.sh --no-notify --no-refresh` reports 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`; MOMO dedicated preflight returns `PASS=19 WARN=2 BLOCKED=0`; MOMO health is `V10.690`; AwoooGo / Stock transient 502 reads cleared after upstream warmup and five consecutive route reads returned `200`; 110 load is around `14.51 / 12.34 / 11.42`, with Gitea Actions cache save / `zstdmt` / `tar`, StockPlatform headless Chrome smoke / CI, Gitea, AWOOOI API, ClickHouse, Docker, and platform services visible, not an AWOOOI service blocker. Wrapper result is `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, not `DEGRADED`, because service warnings are `0` and only DR boundary / evidence warnings remain. Wazuh route readback is now `200 disabled_waiting_iwooos_wazuh_owner_gate`, but manager registry accepted remains `0`, so Wazuh is a security registry evidence blocker rather than a reboot service blocker.

View File

@@ -7,6 +7,8 @@ set -uo pipefail
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
SSH_CONNECT_TIMEOUT="${SSH_CONNECT_TIMEOUT:-6}"
ROUTE_RETRY_ATTEMPTS="${ROUTE_RETRY_ATTEMPTS:-3}"
ROUTE_RETRY_DELAY_SECONDS="${ROUTE_RETRY_DELAY_SECONDS:-2}"
RUN_COLD_START=1
RUN_MOMO=1
RUN_STOCK=1
@@ -71,6 +73,10 @@ Options:
--no-color Disable ANSI color.
-h, --help Show this help.
Environment:
ROUTE_RETRY_ATTEMPTS Public route attempts before blocking. Default: 3.
ROUTE_RETRY_DELAY_SECONDS Delay between failed public route attempts. Default: 2.
Exit codes:
0 = no service blockers. Boundary / evidence warnings may still be present.
1 = service warnings only.
@@ -348,13 +354,30 @@ fi
if [[ "$RUN_ROUTES" -eq 1 ]]; then
section "Public routes"
for url in "${ROUTES[@]}"; do
code="$(curl -k -sS -o /dev/null -w '%{http_code}' --max-time 12 "$url" 2>/dev/null || true)"
code=""
attempt=1
while [[ "$attempt" -le "$ROUTE_RETRY_ATTEMPTS" ]]; do
code="$(curl -k -sS -o /dev/null -w '%{http_code}' --max-time 12 "$url" 2>/dev/null || true)"
case "$code" in
2*|3*)
break
;;
esac
if [[ "$attempt" -lt "$ROUTE_RETRY_ATTEMPTS" ]]; then
sleep "$ROUTE_RETRY_DELAY_SECONDS"
fi
attempt=$((attempt + 1))
done
case "$code" in
2*|3*)
ok "$code $url"
if [[ "$attempt" -gt 1 ]]; then
evidence_warn "$code $url recovered_after_attempt=$attempt"
else
ok "$code $url"
fi
;;
*)
blocked "${code:-curl_failed} $url"
blocked "${code:-curl_failed} $url attempts=$ROUTE_RETRY_ATTEMPTS"
;;
esac
done