From 482ff21af59c44ceaa9a3499d6669e595189ff1a Mon Sep 17 00:00:00 2001 From: ogt Date: Fri, 26 Jun 2026 06:33:04 +0800 Subject: [PATCH] docs(ops): refresh reboot readback route retry [skip ci] --- docs/LOGBOOK.md | 34 +++++++++++++++++++ docs/runbooks/BACKUP-STATUS.md | 23 +++++++++++++ docs/runbooks/FULL-STACK-COLD-START-SOP.md | 6 ++-- .../runbooks/REBOOT-POST-START-QUICK-CHECK.md | 8 +++-- ...oot-cold-start-backup-recovery-workplan.md | 4 +-- .../reboot-recovery/post-start-quick-check.sh | 29 ++++++++++++++-- 6 files changed, 94 insertions(+), 10 deletions(-) diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index ea32be74..1aed803e 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -1,3 +1,37 @@ +## 2026-06-26|主機重啟 SOP 隔日 readback 與 route retry gate + +**背景**:2026-06-25 21:14 已達 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。隔日 06:26 重新跑 live read-only check,確認服務綠燈是否維持,並處理 wrapper 對單次 route `000` 過度敏感的 SOP 缺口。 + +**Read-only evidence**: +- 四主機 `110 / 120 / 121 / 188` ping / SSH port 全部 OK。 +- Cold-start:`PASS=89 WARN=0 BLOCKED=0`,Result `GREEN`。 +- K3s:`mon` / `mon1` Ready;AWOOOI API/Web/Worker Running;active failed Jobs `0`。 +- MOMO:health `V10.690`,latest import job `57 completed`,`DB_DAILY_FRESHNESS 1|2026-06-24`,current-month parity `15383|15383`。 +- StockPlatform:`/api/v1/system/freshness` 回 `status=ok`,`latest_trading_date=2026-06-25`,blockers `[]`;price / chips / margin / AI recommendations 皆為 `2026-06-25`。 +- Backup:110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、`core_blockers=0`、`offsite_fresh=1`、`rclone_gdrive_fresh=1`、`last_backup_all=2026-06-26 02:31:02`、`escrow_missing=5`。 +- Public routes:06:26 full wrapper 對 `https://awoooi.wooo.work/zh-TW/iwooos` 與 `https://vibework.wooo.work/` 出現單次 `000`;獨立 curl 隨即回 `200`,route-only wrapper 回 `PASS=31 WARN=0 BLOCKED=0 RESULT=GREEN`。 +- 110 CPU:load 約 `5.50 / 3.41 / 2.74`;`vmstat` 無即時 swap thrash;未見 orphan Chrome 或長時間 active StockPlatform query。主要是 Gitea / ClickHouse / Docker / Kafka / platform 背景服務與短查詢負載。 + +**完成**: +- `scripts/reboot-recovery/post-start-quick-check.sh` public route gate 新增 retry:預設 `ROUTE_RETRY_ATTEMPTS=3`、`ROUTE_RETRY_DELAY_SECONDS=2`。 +- Retry 後恢復的 route 會列為 `evidence_warn recovered_after_attempt=`;只有連續失敗才算 `BLOCKED`。 +- 更新 `FULL-STACK-COLD-START-SOP.md` v1.61、`REBOOT-POST-START-QUICK-CHECK.md` v1.6、recovery workplan 與 `BACKUP-STATUS.md`。 + +**驗證**: +- `bash -n scripts/reboot-recovery/post-start-quick-check.sh` 通過。 +- Route-only wrapper:`PASS=31 WARN=0 BLOCKED=0`,`RESULT=GREEN`。 +- Core wrapper with routes skipped:`PASS=15 WARN=2 BLOCKED=0`,warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=1`,`RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。 + +**做過的命令類型**: +- Read-only:cold-start、quick-check、MOMO / Stock freshness、backup-status、route curl、CPU / PostgreSQL activity readback。 +- Repo-only:SOP / runbook / workplan / LOGBOOK 文件與 wrapper retry gate。 +- 沒有 host/runtime write:沒有 Docker/systemd/Nginx/firewall/K8s/ArgoCD/DB/Wazuh 操作,沒有 manual ingestion,沒有 secret read。 + +**仍 blocked / 不得宣稱**: +- `DR_COMPLETE` 仍 blocked:`escrow_missing=5`。 +- Wazuh manager registry accepted 仍為 `0`;route `200` 或 UI 可見不能宣稱 Wazuh 全主機納管完成。 +- 全產品治理總工程仍依 `CODEX-START-HERE` 為 `not_complete`,不得把本輪 reboot service green 說成全產品治理完成。 + ## 2026-06-25|Status-chain apply candidate 語意修正:不再把乾跑候選講成純人工 **背景**:`INC-20260625-977E5F` / `node-exporter-188` 類告警已完成 MCP 調查與 Ansible check-mode,且 status-chain 能推導 `ansible-apply-candidate:*`、`verifier-plan:*` 與 Work Item;但 operator outcome 仍把它描述成單純 dry-run / manual gate,前端也直接顯示 raw `next_step`。這會讓值班者感覺 AI 只把事情丟回人工,無法看出 AI 已經產生可審查的 apply candidate。 diff --git a/docs/runbooks/BACKUP-STATUS.md b/docs/runbooks/BACKUP-STATUS.md index 44b47d06..6991c01a 100644 --- a/docs/runbooks/BACKUP-STATUS.md +++ b/docs/runbooks/BACKUP-STATUS.md @@ -24,9 +24,32 @@ > 2026-06-25 20:11 Codex StockPlatform cron-source recovery: StockPlatform Gitea/live source is now `fb91aa4c6272469d1d26e0820169629eac17d28a`; six missing production cron entrypoints are restored; natural cron runs for source remediation, market index, price, margin, chips, and AI no longer fail from missing files. Backup/offsite remains green. Stock freshness still blocks because official 2026-06-25 margin-short data is pending and AI recommendations correctly stay on 2026-06-24; this is still not a backup or restore incident. > 2026-06-25 20:25 Codex 110 CPU cleanup: two orphan StockPlatform headless Chrome process groups were cleared by targeted approved `SIGTERM`; no Docker/systemd/Nginx/K8s/DB/backup write occurred. Backup/offsite remains green, DR still blocked by `escrow_missing=5`, and Stock freshness remains the only hard product-data blocker. > 2026-06-25 21:14 Codex full wrapper refresh: StockPlatform 21:00 `intelligence-sync` and 21:10 AI pipeline naturally caught up; `/api/v1/system/freshness` is `status=ok` with blockers `[]`. Backup/offsite remains 110 `13/13` and 188 `2/2` fresh, `core_blockers=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`; full-stack service/data result is `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, with only `escrow_missing=5` blocking DR complete. +> 2026-06-26 06:28 Codex隔日 backup readback: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `last_backup_all=2026-06-26 02:31:02`, `escrow_missing=5`; full-stack service/data result remains `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`. --- +## 2026-06-26 06:28 Backup / Offsite / Escrow Live Status + +Read-only evidence sources: 06:26 / 06:28 `post-start-quick-check.sh`, delegated `/backup/scripts/backup-status.sh --no-notify --no-refresh`, route-only wrapper retry validation, and direct StockPlatform / MOMO freshness readback. + +- 110 backup health: `13/13 fresh failed=0`。 +- 188 backup health: `2/2 fresh failed=0`。 +- Integrity / configured blockers: `core_blockers=0`、`configured_missing_110=0`、`configured_missing_188=0`、`script_missing_110=0`、`script_missing_188=0`、`integrity_stale=0`。 +- Offsite / GDrive freshness: `offsite_configured=1`、`offsite_fresh=1`、`rclone_gdrive_configured=1`、`rclone_gdrive_fresh=1`。 +- Last aggregate backup: `2026-06-26 02:31:02`。 +- DR blocker remains: `escrow_missing=5`,不得偽造 evidence marker,也不得貼 secret value / hash / partial token。 +- Full-stack service state: `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。Cold-start `PASS=89 WARN=0 BLOCKED=0`;StockPlatform freshness `status=ok`;MOMO daily freshness `1|2026-06-24`。 +- Route note: 06:26 full wrapper had one-time route `000` for IwoooS / VibeWork, but direct curl and route-only wrapper immediately returned `200` and `RESULT=GREEN`; v1.6 wrapper now retries routes before blocking. + +| Gate | Status | Evidence | +|------|--------|----------| +| 110 backup freshness | VERIFIED | 13/13 fresh, failed count 0. | +| 188 backup freshness | VERIFIED | 2/2 fresh, failed count 0. | +| Offsite / GDrive freshness | VERIFIED | `offsite_fresh=1`, `rclone_gdrive_fresh=1`. | +| Backup core blockers | GREEN | `core_blockers=0`. | +| Full-stack service state | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | Cold-start `PASS=89 WARN=0 BLOCKED=0`; core wrapper `PASS=15 WARN=2 BLOCKED=0`; route-only wrapper `PASS=31 WARN=0 BLOCKED=0`. | +| Credential escrow | BLOCKED | `escrow_missing=5`; only real non-secret owner evidence may close this. | + ## 2026-06-25 19:17 Backup / Offsite / Escrow Live Status Read-only evidence sources: `/backup/scripts/backup-status.sh --no-notify --no-refresh` from 110 at 19:17 Asia/Taipei, plus 19:05 post-start quick check and 19:05-19:06 route stability readback. diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index 05b05167..5e34cb45 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -1,7 +1,7 @@ # AWOOOI 全棧冷啟動與主機重啟 SOP -> Version: v1.60 -> Last updated: 2026-06-25 Asia/Taipei +> Version: v1.61 +> Last updated: 2026-06-26 Asia/Taipei > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path. --- @@ -12,6 +12,8 @@ 若只是重啟後要快速判斷能不能宣稱恢復,先跑一頁式總檢查:`scripts/reboot-recovery/post-start-quick-check.sh --no-color`,並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為人工 fallback。長 SOP 保留完整背景、例外處理與 Plan B;短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。 +2026-06-26 06:26-06:28 隔日 read-only refresh:四主機 ping/SSH OK,cold-start `PASS=89 WARN=0 BLOCKED=0`,MOMO `V10.690` 且 latest import job `57 completed`,StockPlatform `/api/v1/system/freshness` 仍為 `status=ok` / `latest_trading_date=2026-06-25` / blockers `[]`,backup-status 110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、`core_blockers=0`、`offsite_fresh=1`、`rclone_gdrive_fresh=1`、`last_backup_all=2026-06-26 02:31:02`、`escrow_missing=5`。06:26 full wrapper 首輪在 `https://awoooi.wooo.work/zh-TW/iwooos` 與 `https://vibework.wooo.work/` 出現單次 `000`,但獨立 curl 立即回 `200`,route-only wrapper 也回 `PASS=31 WARN=0 BLOCKED=0 RESULT=GREEN`;因此 v1.61 將 public route gate 改為最多 3 次 retry,只有連續失敗才算 `BLOCKED`,retry 後恢復則列為 evidence warning。06:28 core wrapper with routes skipped returns `POST_START_QUICK_CHECK PASS=15 WARN=2 BLOCKED=0`, `RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。本次沒有 Docker/systemd/Nginx/firewall/K8s/DB/Wazuh runtime 寫操作。 + 2026-06-25 21:14 StockPlatform natural-cron / full-wrapper refresh supersedes the 20:25 product-data blocker wording. After waiting for official schedules instead of manual ingestion, `intelligence-sync` 21:00 finished `status=0`, `core.margin_short_daily` reached `2026-06-25` / 1976 rows, and `ai-recommendation-pipeline` 21:10 finished `STOCKPLATFORM_AI_RECOMMENDATION_PIPELINE_OK as_of_date=2026-06-25` with `draft_count=120`, `candidate_count=120`, and `rag_documents=1000`. StockPlatform `/api/v1/system/freshness` now returns `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`, with price / chips / margin / AI recommendations all on `2026-06-25`. The 21:14 full wrapper returns cold-start `PASS=89 WARN=0 BLOCKED=0` and overall `POST_START_QUICK_CHECK PASS=38 WARN=2 BLOCKED=0`, `RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`. The only remaining recovery red gate is DR credential escrow evidence `escrow_missing=5`; Wazuh manager registry accepted remains `0` as a security evidence blocker, not a reboot service blocker. 2026-06-25 20:25 orphan Chrome cleanup / scorecard refresh supersedes the 20:11 CPU wording. 110 high CPU was traced to two `stockplatform-review-bulk-ux` Chrome process groups `2756503` and `2829627` with root Chrome process `PPID=1`, elapsed about 5h, no active parent smoke, and sustained GPU/renderer CPU. With user approval, only those two process groups received targeted `SIGTERM` at 20:24. Post-check showed no remaining PGID entries; `vmstat` showed CPU idle around `85-90%`, `si/so=0`, and no immediate swap thrash. No Docker/systemd/Nginx/firewall/K8s action, CI cancellation, manual data ingestion, manual DB write, Wazuh/SOC runtime change, or secret read was performed. The 20:25 full post-start wrapper then returned cold-start `PASS=89 WARN=0 BLOCKED=0`, but overall `POST_START_QUICK_CHECK PASS=37 WARN=2 BLOCKED=1`, `RESULT=BLOCKED`, because StockPlatform data freshness was still blocked at that time and DR remained incomplete. diff --git a/docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md b/docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md index 8eb4d4d6..6e74672e 100644 --- a/docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md +++ b/docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md @@ -1,7 +1,7 @@ # 主機重啟後一頁式總檢查 -> Version: v1.5 -> Last updated: 2026-06-25 Asia/Taipei +> Version: v1.6 +> Last updated: 2026-06-26 Asia/Taipei > Scope: 110 / 120 / 121 / 188 post-reboot service recovery. 112 Kali / Wazuh / active scan 不屬於本流程。 --- @@ -10,7 +10,7 @@ 每次 110 / 120 / 121 / 188 任一台主機開機、關機、重啟、斷電恢復、VMware console fsck、Docker / K3s 大量重排後,都先跑本頁,再決定是否宣稱恢復。 -最新基準:2026-06-25 21:14 full wrapper `POST_START_QUICK_CHECK PASS=38 WARN=2 BLOCKED=0`,warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=1`,Result `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。StockPlatform 21:10 自然 AI pipeline 已補到 `as_of_date=2026-06-25`,`/api/v1/system/freshness` 為 `status=ok`;DR 仍因 `escrow_missing=5` 不可宣稱 complete。 +最新基準:2026-06-26 06:26-06:28 read-only refresh。Cold-start `PASS=89 WARN=0 BLOCKED=0`;MOMO `V10.690`、latest import job `57 completed`、`DB_DAILY_FRESHNESS 1|2026-06-24`;StockPlatform `/api/v1/system/freshness` 為 `status=ok`、`latest_trading_date=2026-06-25`、blockers `[]`;backup-status 110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、`core_blockers=0`、`offsite_fresh=1`、`rclone_gdrive_fresh=1`、`last_backup_all=2026-06-26 02:31:02`。DR 仍因 `escrow_missing=5` 不可宣稱 complete。06:26 full wrapper 的 `iwooos` / `vibework` 單次 route `000` 已由獨立 curl 與 route-only wrapper 確認為 transient;v1.6 起 public route gate 會 retry,只有連續失敗才算 `BLOCKED`。 本頁只回答四件事: @@ -51,6 +51,8 @@ scripts/reboot-recovery/post-start-quick-check.sh --no-color 此 wrapper 只做 read-only 檢查,並委派既有 cold-start / MOMO preflight / backup-status;不 restart、不 reload、不 import、不改 K8s、不讀 token 內容。wrapper 會把 warning 分成 `SERVICE`、`BOUNDARY`、`EVIDENCE` 三類,避免把 `escrow_missing>0` 誤判成服務降級。若 wrapper 因某個 SSH 權限或路徑失敗,再依下列分段命令手動補證據。 +Public route gate 自 v1.6 起會使用 `ROUTE_RETRY_ATTEMPTS`(預設 `3`)與 `ROUTE_RETRY_DELAY_SECONDS`(預設 `2`)重試。單次 `000` / timeout 若 retry 後恢復,應列為 evidence warning 或 transient route evidence,不可直接當成網站仍壞;只有連續失敗才是 service blocker。 + Wrapper 必須先解析 cold-start summary,不可只看 cold-start exit code: - cold-start `BLOCKED>0`:wrapper 才可判定 `BLOCKED`。 diff --git a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md index d0a8548a..c4c0ae4a 100644 --- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md +++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md @@ -11,11 +11,11 @@ | Area | Status | Completion | Evidence | |------|--------|------------|----------| -| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-25 21:14 full post-start wrapper showed hosts / K3s / AWOOOI / public routes / MOMO / StockPlatform / backup / offsite service and data gates green. Cold-start returned `PASS=89 WARN=0 BLOCKED=0`; wrapper returned `POST_START_QUICK_CHECK PASS=38 WARN=2 BLOCKED=0`, warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=1`, result `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`. 20:24 targeted approved `SIGTERM` cleared orphan 110 `stockplatform-review-bulk-ux` Chrome PGIDs `2756503` and `2829627`; 21:14 CPU attribution shows current load is active AWOOOI Web `next build`, not orphan Chrome. StockPlatform Gitea/live source is `fb91aa4c6272469d1d26e0820169629eac17d28a`; 21:00 `intelligence-sync` succeeded, 21:10 `ai-recommendation-pipeline` produced `STOCKPLATFORM_AI_RECOMMENDATION_PIPELINE_OK as_of_date=2026-06-25`, and `/api/v1/system/freshness` is now `status=ok` with blockers `[]`. MOMO remains fresh through `2026-06-24` with latest job `57` completed cleanly. Do not declare DR complete until `escrow_missing=0`; Wazuh manager registry accepted remains `0` and is a security evidence blocker, not a reboot service blocker. | +| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-26 06:26-06:28 read-only refresh confirms the 2026-06-25 21:14 green baseline still holds. Four hosts ping/SSH OK; cold-start `PASS=89 WARN=0 BLOCKED=0`; MOMO health `V10.690`, latest import job `57 completed`, daily freshness `1|2026-06-24`; StockPlatform freshness `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`; backup-status 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `last_backup_all=2026-06-26 02:31:02`, `escrow_missing=5`. The 06:26 full wrapper saw one-time route `000` on IwoooS / VibeWork, but independent curl and route-only wrapper immediately returned `200` / `PASS=31 WARN=0 BLOCKED=0`; v1.61/v1.6 now retries public routes before blocking. 06:28 core wrapper with routes skipped returned `PASS=15 WARN=2 BLOCKED=0`, result `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`. Do not declare DR complete until `escrow_missing=0`; Wazuh manager registry accepted remains `0` and is a security evidence blocker, not a reboot service blocker. | | P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-25 09:05 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, VIP `192.168.0.125` is present, node filesystem / disk-pressure / readonly events are `0`, and latest `km-vectorize-29705460-55rgs` completed. | | P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-25 19:17 backup readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-25 02:35:09`。2026-06-25 19:19 offsite escrow report shows script presence OK, rclone configured, full and partial rclone markers present, `PASS=8 WARN=5 BLOCKED=0`, `ESCROW_MISSING_COUNT=5`; DR remains blocked on real non-secret credential escrow evidence IDs. | | P2 service / data truth | DONE | 100% | Service routes and core runtime are available, 110 orphan Chrome CPU pressure is cleared, and StockPlatform cron-source drift is repaired. 2026-06-25 21:13 StockPlatform `/api/v1/system/freshness` returned `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`; price / chips / margin / AI recommendations are all on `2026-06-25`. `ai.recommendations` row count is `2868`; `core.margin_short_daily` row count is `1976`. MOMO health `V10.690`, current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, and `DB_DAILY_FRESHNESS 1|2026-06-24` are green; expanded public routes are green. | -| P3 docs / automation contracts | DONE_WITH_PRODUCT_DATA_GREEN_V160 | 100% | Workplan, SOP v1.60, one-page post-start quick check v1.5, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform 21:00 / 21:10 natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and 2026-06-25 stricter product-data gate are updated. Live 110 script sync remains a separate approved live-write gate; do not claim it here. | +| P3 docs / automation contracts | DONE_WITH_ROUTE_RETRY_V161 | 100% | Workplan, SOP v1.61, one-page post-start quick check v1.6, route retry gate, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform 21:00 / 21:10 natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and stricter product-data / route retry gates are updated. Live 110 script sync remains a separate approved live-write gate; do not claim it here. | 2026-06-25 19:06 post-CD wrapper readback supersedes the 18:53 wording: consecutive main pushes created a deploy storm where older deploy markers were superseded by later commits. Latest production truth is deploy marker `d8ca8224 chore(cd): deploy 9dbe044 [skip ci]`, ArgoCD `Synced / Healthy`, API/Web/Worker image tag `9dbe044ea1e8e3894ccbeb5ed760bb124b87f7be`, direct route smoke 200 for AWOOOI API / IwoooS / VibeWork / AwoooGo / MOMO health / Stock / Bitan and expected route-gate statuses for MOMO / Gitea / Harbor / Registry / Sentry / SigNoz / Langfuse / AIOps, and wrapper `POST_START_QUICK_CHECK PASS=18 WARN=3 BLOCKED=0`. Repo-side cold-start returns `PASS=89 WARN=0 BLOCKED=0`; `/backup/scripts/backup-status.sh --no-notify --no-refresh` reports 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`; MOMO dedicated preflight returns `PASS=19 WARN=2 BLOCKED=0`; MOMO health is `V10.690`; AwoooGo / Stock transient 502 reads cleared after upstream warmup and five consecutive route reads returned `200`; 110 load is around `14.51 / 12.34 / 11.42`, with Gitea Actions cache save / `zstdmt` / `tar`, StockPlatform headless Chrome smoke / CI, Gitea, AWOOOI API, ClickHouse, Docker, and platform services visible, not an AWOOOI service blocker. Wrapper result is `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, not `DEGRADED`, because service warnings are `0` and only DR boundary / evidence warnings remain. Wazuh route readback is now `200 disabled_waiting_iwooos_wazuh_owner_gate`, but manager registry accepted remains `0`, so Wazuh is a security registry evidence blocker rather than a reboot service blocker. diff --git a/scripts/reboot-recovery/post-start-quick-check.sh b/scripts/reboot-recovery/post-start-quick-check.sh index 5cebab85..889ca025 100755 --- a/scripts/reboot-recovery/post-start-quick-check.sh +++ b/scripts/reboot-recovery/post-start-quick-check.sh @@ -7,6 +7,8 @@ set -uo pipefail ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)" SSH_CONNECT_TIMEOUT="${SSH_CONNECT_TIMEOUT:-6}" +ROUTE_RETRY_ATTEMPTS="${ROUTE_RETRY_ATTEMPTS:-3}" +ROUTE_RETRY_DELAY_SECONDS="${ROUTE_RETRY_DELAY_SECONDS:-2}" RUN_COLD_START=1 RUN_MOMO=1 RUN_STOCK=1 @@ -71,6 +73,10 @@ Options: --no-color Disable ANSI color. -h, --help Show this help. +Environment: + ROUTE_RETRY_ATTEMPTS Public route attempts before blocking. Default: 3. + ROUTE_RETRY_DELAY_SECONDS Delay between failed public route attempts. Default: 2. + Exit codes: 0 = no service blockers. Boundary / evidence warnings may still be present. 1 = service warnings only. @@ -348,13 +354,30 @@ fi if [[ "$RUN_ROUTES" -eq 1 ]]; then section "Public routes" for url in "${ROUTES[@]}"; do - code="$(curl -k -sS -o /dev/null -w '%{http_code}' --max-time 12 "$url" 2>/dev/null || true)" + code="" + attempt=1 + while [[ "$attempt" -le "$ROUTE_RETRY_ATTEMPTS" ]]; do + code="$(curl -k -sS -o /dev/null -w '%{http_code}' --max-time 12 "$url" 2>/dev/null || true)" + case "$code" in + 2*|3*) + break + ;; + esac + if [[ "$attempt" -lt "$ROUTE_RETRY_ATTEMPTS" ]]; then + sleep "$ROUTE_RETRY_DELAY_SECONDS" + fi + attempt=$((attempt + 1)) + done case "$code" in 2*|3*) - ok "$code $url" + if [[ "$attempt" -gt 1 ]]; then + evidence_warn "$code $url recovered_after_attempt=$attempt" + else + ok "$code $url" + fi ;; *) - blocked "${code:-curl_failed} $url" + blocked "${code:-curl_failed} $url attempts=$ROUTE_RETRY_ATTEMPTS" ;; esac done