fix(ops): classify cold-start warning-only quick checks [skip ci]
This commit is contained in:
@@ -1,3 +1,30 @@
|
||||
## 2026-06-25|15:04 post-start wrapper cold-start WARN 分級修正與 live readback
|
||||
|
||||
**背景**:15:02 使用最新 main 版 `post-start-quick-check.sh --no-color` 跑 read-only 驗證時,cold-start 本身是 `PASS=88 WARN=1 BLOCKED=0`,但 wrapper 只看 cold-start exit code,將 WARN-only 誤判成 `BLOCKED`。這會把「rollout / stale warning」放大成服務 blocker,與 SOP 分層判定目標不符。
|
||||
|
||||
**修正**:
|
||||
- `post-start-quick-check.sh` 的 cold-start 區塊改為先解析 `PASS / WARN / BLOCKED` summary。
|
||||
- cold-start `BLOCKED>0` 才算 wrapper `BLOCKED`。
|
||||
- cold-start `WARN>0 BLOCKED=0` 只算 `SERVICE` warning,wrapper 最多回 `DEGRADED`,不可放大成 `BLOCKED`。
|
||||
- cold-start `WARN=0 BLOCKED=0` 才進入 service green / DR boundary 判定。
|
||||
- `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 已補這個判定規則。
|
||||
|
||||
**15:04 live read-only wrapper 證據**:
|
||||
- `scripts/reboot-recovery/post-start-quick-check.sh --no-color` 回 exit code `0`。
|
||||
- Hosts / SSH:110、120、121、188 ping 與 SSH port OK。
|
||||
- Cold-start:`PASS=89 WARN=0 BLOCKED=0`,Result `GREEN`。前一輪 rollout 短暫 `BAD_PODS` warning 已收斂為 `BAD_PODS 0`。
|
||||
- MOMO:health `V10.681`,dedicated preflight `PASS=19 WARN=2 BLOCKED=0`;latest job `57 completed`,`DB_DAILY_FRESHNESS 1|2026-06-24`,current-month DB parity through `2026-06-24`。
|
||||
- Backup:110 `13/13 fresh failed=0`,188 `2/2 fresh failed=0`,`core_blockers=0`,`offsite_fresh=1`,`rclone_gdrive_fresh=1`,`escrow_missing=5`。
|
||||
- Routes:AWOOI API、IwoooS、MOMO health、Stock direct smoke all `200`。
|
||||
- 110 CPU:load 約 `8.99 / 5.76 / 4.89`;top processes show active `next build` and Playwright e2e smoke under Gitea Actions / CD workflow. Chrome is an active child of Playwright, not orphan Chrome; no kill / restart was performed.
|
||||
- Wrapper summary:`POST_START_QUICK_CHECK PASS=18 WARN=3 BLOCKED=0`,`POST_START_QUICK_CHECK_WARNINGS SERVICE=0 BOUNDARY=1 EVIDENCE=2`,`RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。
|
||||
|
||||
**判定**:
|
||||
- 可宣稱:重啟後服務面 / public routes / K3s / MOMO data freshness / backup core 已恢復;latest main 版 quick check 可正確把 rollout WARN-only、DR boundary 與 evidence note 分開。
|
||||
- 不可宣稱:DR complete、credential escrow complete、Wazuh manager registry accepted、Wazuh active response / host write / runtime security gate。`escrow_missing=5` 必須保留為 DR blocker,不得偽造 marker。
|
||||
|
||||
**邊界**:本輪只做 read-only wrapper live run、repo-side script / docs 修正與 guard;沒有 Docker / systemd / Nginx / firewall / K8s / ArgoCD / Wazuh runtime 寫操作,沒有 import,沒有讀 token。
|
||||
|
||||
## 2026-06-25|14:41 post-start quick check live wrapper 分級讀回
|
||||
|
||||
**背景**:第一版 `post-start-quick-check.sh` live run 將預期中的 `escrow_missing=5` 與 MOMO 非服務面 warning 一併算成 `DEGRADED`,容易讓重啟 SOP 看起來永遠差一點。這不符合本輪目標:服務恢復、資料新鮮、備份健康、DR escrow、Wazuh registry 必須分層判定。
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# AWOOOI 全棧冷啟動與主機重啟 SOP
|
||||
|
||||
> Version: v1.53
|
||||
> Version: v1.54
|
||||
> Last updated: 2026-06-25 Asia/Taipei
|
||||
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
|
||||
|
||||
@@ -12,21 +12,21 @@
|
||||
|
||||
若只是重啟後要快速判斷能不能宣稱恢復,先跑一頁式總檢查:`scripts/reboot-recovery/post-start-quick-check.sh --no-color`,並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為人工 fallback。長 SOP 保留完整背景、例外處理與 Plan B;短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。
|
||||
|
||||
2026-06-25 14:41 post-start wrapper live read-only refresh supersedes the first wrapper wording. Hosts, routes, K3s, AWOOOI API health, MOMO service health, MOMO business data freshness, backup core/offsite, and core monitoring/exporter surfaces are green for controlled runner/CD release. MOMO is healthy on `V10.676`; latest import job `57` completed cleanly; `MOMO_DAILY_FRESHNESS 1|2026-06-24`; current-month daily snapshot and realtime tables match through `2026-06-24`. `post-start-quick-check.sh` now separates `SERVICE / BOUNDARY / EVIDENCE` warnings and returns `RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED` when service blockers are zero but `escrow_missing=5` remains. Do not turn this into a DR complete or security/runtime acceptance claim. Wazuh host registry acceptance remains outside this SOP lane and is still not complete.
|
||||
2026-06-25 15:04 post-start wrapper live read-only refresh supersedes the 14:41 wrapper wording. Hosts, routes, K3s, AWOOOI API health, MOMO service health, MOMO business data freshness, backup core/offsite, and core monitoring/exporter surfaces are green for controlled runner/CD release. MOMO is healthy on `V10.681`; latest import job `57` completed cleanly; `MOMO_DAILY_FRESHNESS 1|2026-06-24`; current-month daily snapshot and realtime tables match through `2026-06-24`. `post-start-quick-check.sh` now parses cold-start `PASS / WARN / BLOCKED` summary before classifying exit codes, so WARN-only rollout/stale evidence is no longer inflated into a service blocker. The latest wrapper returns `RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED` when service blockers are zero but `escrow_missing=5` remains. Do not turn this into a DR complete or security/runtime acceptance claim. Wazuh host registry acceptance remains outside this SOP lane and is still not complete.
|
||||
|
||||
```text
|
||||
Repo-side reboot SOP / Plan B / automation contracts: COMPLETE, 100%.
|
||||
Live cold-start read-only check: 2026-06-25 14:41 wrapper delegated cold-start PASS=89 WARN=0 BLOCKED=0, Result=GREEN.
|
||||
Post-start quick check: 2026-06-25 14:41 PASS=18 WARN=2 BLOCKED=0; warning split SERVICE=0 BOUNDARY=1 EVIDENCE=1; Result=FULL_STACK_GREEN_DR_ESCROW_BLOCKED; exit code 0.
|
||||
Live cold-start read-only check: 2026-06-25 15:04 wrapper delegated cold-start PASS=89 WARN=0 BLOCKED=0, Result=GREEN.
|
||||
Post-start quick check: 2026-06-25 15:04 PASS=18 WARN=3 BLOCKED=0; warning split SERVICE=0 BOUNDARY=1 EVIDENCE=2; Result=FULL_STACK_GREEN_DR_ESCROW_BLOCKED; exit code 0.
|
||||
Repo-side cold-start v1.42+ live read-only run: MOMO source absence / stale data blocker is cleared by import job 57 and `MOMO_DAILY_FRESHNESS 1|2026-06-24`. Live 110 script sync is not claimed until a separate approved deployment/sync happens.
|
||||
110 live-sync parity: 2026-06-24 23:15 read-only `verify-cold-start-monitor-deploy.sh` correctly BLOCKED because repo script hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`. Do not use live 110 monitor output to prove v1.42 behavior until the approved live-sync gate in §13.3.1 passes.
|
||||
Service state: FULL_STACK_GREEN_DR_ESCROW_BLOCKED; 110/120/121/188 reachable, K3s mon/mon1 Ready, public routes/TLS green, MOMO data fresh, 110/188 backup health fresh, 188 node-exporter / PostgreSQL exporter / Redis exporter restored, 188 MinIO endpoint and Velero BackupStorageLocation restored, 110 disk pressure cleared.
|
||||
Runtime release state: API/Web/Worker are ready; production API health returns healthy with `environment=prod`, `mock_mode=false`, and postgresql / redis / openclaw / signoz / gcp ollama providers up. 14:16 direct route smoke returned 200 for AWOOOI API, `/zh-TW/iwooos`, MOMO health, and Stock; cold-start raw route gate returned all expected route statuses, including redirects such as awoooi web=307 and sentry=302.
|
||||
MOMO release state: mo.wooo.work health is healthy on version V10.676 after 14:14-14:15 lifecycle / replacement events observed via Docker metadata. `momo-pro-system`, `momo-scheduler`, and `momo-telegram-bot` are healthy; scheduler `RestartCount=0`. 14:41 dedicated preflight returns PASS=19 WARN=2 BLOCKED=0, so retain stability / evidence notes, but no service blocker remains.
|
||||
MOMO release state: mo.wooo.work health is healthy on version V10.681. `momo-pro-system`, `momo-scheduler`, and `momo-telegram-bot` are healthy; scheduler `RestartCount=0`. 15:04 dedicated preflight returns PASS=19 WARN=2 BLOCKED=0, so retain scheduler fail-closed / notification evidence notes, but no service blocker remains.
|
||||
MOMO data state: current-month daily_sales_snapshot and realtime_sales_monthly match through 2026-06-24: `daily_sales_snapshot=109061|2025-07-01|2026-06-24`, `MOMO_MONTHLY_SYNC 15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, and `MOMO_DAILY_FRESHNESS 1|2026-06-24`. Latest import job is `57 completed|即時業績_當日.xlsx|2026-06-25T13:16:47.359958|2026-06-25T13:18:02.964985|15383|15383|0`.
|
||||
Google Drive / source-file state: 14:16 cold-start reports `MOMO_GDRIVE_TOKEN_STAT 100000:100000:600 scheduler_uid=100000`. Dedicated preflight confirms host token metadata matches scheduler UID and restrictive mode; container token artifact exists with mode `600`. Token content was not read. Future Drive auth/API failure must still be treated as failed import evidence rather than no-file success.
|
||||
110 CPU/load readback: 2026-06-25 10:58 user-approved minimal SIGTERM targeted only orphan `stockplatform-review-bulk-ux` Chrome process groups `438005`, `471295`, `640155`, and `670628`; `OLD_GROUPS_REMAINING` returned empty. 11:20 readback shows remaining CPU is active `stockplatform-product-ux-smoke.mjs` with parent node process plus install/build work, not orphan Chrome. No Docker/systemd/Nginx/firewall/K8s write was performed; do not cancel active CI/smoke unless separately approved.
|
||||
Backup / monitoring state: 14:41 wrapper readback confirms backup core blockers are 0, 110 is 13/13 fresh failed=0, 188 is 2/2 fresh failed=0, offsite_fresh=1, rclone_gdrive_fresh=1, integrity_stale=0, last aggregate is 2026-06-25 02:35:09, and escrow_missing=5.
|
||||
110 CPU/load readback: 2026-06-25 10:58 user-approved minimal SIGTERM targeted only orphan `stockplatform-review-bulk-ux` Chrome process groups `438005`, `471295`, `640155`, and `670628`; `OLD_GROUPS_REMAINING` returned empty. 15:04 readback shows current higher load is active Gitea Actions / CD `next build` and Playwright e2e smoke, with Chrome as an active child of Playwright, not orphan Chrome. No Docker/systemd/Nginx/firewall/K8s write was performed; do not cancel active CI/smoke unless separately approved.
|
||||
Backup / monitoring state: 15:04 wrapper readback confirms backup core blockers are 0, 110 is 13/13 fresh failed=0, 188 is 2/2 fresh failed=0, offsite_fresh=1, rclone_gdrive_fresh=1, integrity_stale=0, last aggregate is 2026-06-25 02:35:09, and escrow_missing=5.
|
||||
Notification-noise state: healthy AWOOOI heartbeat is suppressed; heartbeat warning dedupe uses stable actionable fingerprints so HTTP status / timeout / latency drift does not create a new Telegram event every 30 minutes; MOMO Pro monitor uses https://mo.wooo.work/health as primary truth and no longer checks the 188 root path; MoWoooWorkDown now labels component=momo-pro-system and requires public/local/container/data-freshness triage instead of blind restart; docker-health-monitor keeps 5-minute repair cadence but has a separate 30-minute Telegram fallback cooldown; Bitan public-content check keeps failure alerting with same-fingerprint cooldown and one recovery notice.
|
||||
Monitoring coverage recovery state: if CD post-deploy fails only because `scripts/generate_monitoring.py --check` reports `nginx-exporter` down on `192.168.0.188:9113`, first verify 188 `stub_status` and restore the stateless exporter with `scripts/ops/188-nginx-exporter-restore.sh`; do not reload Nginx or restart product containers for this symptom.
|
||||
Allowed declaration: full-stack service readiness is GREEN for controlled runner/CD release; core hosts, routes, K3s, backup/exporter surfaces, AWOOOI API health, MOMO service health, and MOMO data freshness are green for the latest read-only evidence set.
|
||||
|
||||
@@ -49,6 +49,12 @@ scripts/reboot-recovery/post-start-quick-check.sh --no-color
|
||||
|
||||
此 wrapper 只做 read-only 檢查,並委派既有 cold-start / MOMO preflight / backup-status;不 restart、不 reload、不 import、不改 K8s、不讀 token 內容。wrapper 會把 warning 分成 `SERVICE`、`BOUNDARY`、`EVIDENCE` 三類,避免把 `escrow_missing>0` 誤判成服務降級。若 wrapper 因某個 SSH 權限或路徑失敗,再依下列分段命令手動補證據。
|
||||
|
||||
Wrapper 必須先解析 cold-start summary,不可只看 cold-start exit code:
|
||||
|
||||
- cold-start `BLOCKED>0`:wrapper 才可判定 `BLOCKED`。
|
||||
- cold-start `WARN>0 BLOCKED=0`:wrapper 判定為 `SERVICE` warning,結果最多是 `DEGRADED`;不可放大成 `BLOCKED`。
|
||||
- cold-start `WARN=0 BLOCKED=0`:才可進入 `FULL_STACK_GREEN` / `FULL_STACK_GREEN_DR_ESCROW_BLOCKED` 判定。
|
||||
|
||||
### Step 1 - 主機與 SSH
|
||||
|
||||
```bash
|
||||
@@ -177,6 +183,17 @@ CPU / runaway:orphan=? active_ci=? load=?
|
||||
|
||||
## 6. 目前最新已驗證基線
|
||||
|
||||
2026-06-25 15:04 wrapper live run:
|
||||
|
||||
- Wrapper:`POST_START_QUICK_CHECK PASS=18 WARN=3 BLOCKED=0`。
|
||||
- Warning split:`SERVICE=0 BOUNDARY=1 EVIDENCE=2`。
|
||||
- Result:`FULL_STACK_GREEN_DR_ESCROW_BLOCKED`,exit code `0`。
|
||||
- Cold-start:`PASS=89 WARN=0 BLOCKED=0`,Result `GREEN`。
|
||||
- MOMO:`V10.681`,dedicated preflight `PASS=19 WARN=2 BLOCKED=0`,job `57` clean,`DB_DAILY_FRESHNESS 1|2026-06-24`。
|
||||
- Backup:110 `13/13 fresh failed=0`,188 `2/2 fresh failed=0`,`core_blockers=0`。
|
||||
- DR:`escrow_missing=5`,不可宣稱 DR complete。
|
||||
- CPU:110 顯示 active Gitea Actions / CD `next build` / Playwright e2e smoke;Chrome 是 Playwright child,不是 orphan Chrome。
|
||||
|
||||
2026-06-25 14:41 wrapper live run:
|
||||
|
||||
- Wrapper:`POST_START_QUICK_CHECK PASS=18 WARN=2 BLOCKED=0`。
|
||||
|
||||
@@ -11,15 +11,15 @@
|
||||
|
||||
| Area | Status | Completion | Evidence |
|
||||
|------|--------|------------|----------|
|
||||
| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-25 14:41 post-start quick check returned exit `0`, `POST_START_QUICK_CHECK PASS=18 WARN=2 BLOCKED=0`, warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=1`, result `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`: 110 / 120 / 121 / 188 ping and SSH port are OK, K3s `mon` / `mon1` are Ready, public routes/TLS are green, AWOOOI API health is healthy/prod/mock=false, delegated cold-start is `PASS=89 WARN=0 BLOCKED=0`, MOMO service health is healthy on `V10.676`, MOMO data freshness is `1|2026-06-24`, 110 / 188 runtime and backup checks are green。MOMO latest valid job `57` completed cleanly at `2026-06-25T13:18:02`, `15383/15383/0`, and current-month snapshot / realtime bounds match through `2026-06-24`. DR remains blocked because credential escrow evidence markers are still missing (`escrow_missing=5`) and must not be forged. |
|
||||
| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-25 15:04 post-start quick check returned exit `0`, `POST_START_QUICK_CHECK PASS=18 WARN=3 BLOCKED=0`, warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=2`, result `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`: 110 / 120 / 121 / 188 ping and SSH port are OK, K3s `mon` / `mon1` are Ready, public routes/TLS are green, AWOOOI API health is healthy/prod/mock=false, delegated cold-start is `PASS=89 WARN=0 BLOCKED=0`, MOMO service health is healthy on `V10.681`, MOMO data freshness is `1|2026-06-24`, 110 / 188 runtime and backup checks are green。MOMO latest valid job `57` completed cleanly at `2026-06-25T13:18:02`, `15383/15383/0`, and current-month snapshot / realtime bounds match through `2026-06-24`. DR remains blocked because credential escrow evidence markers are still missing (`escrow_missing=5`) and must not be forged. |
|
||||
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-25 09:05 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, VIP `192.168.0.125` is present, node filesystem / disk-pressure / readonly events are `0`, and latest `km-vectorize-29705460-55rgs` completed. |
|
||||
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-25 09:05 backup / alert readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-25 02:35:09`。DR remains blocked on real non-secret credential escrow evidence IDs. |
|
||||
| P2 service / data truth | GREEN | 100% | Public route/TLS, API/Web route, MOMO health `V10.676`, MOMO main / CD `#904` monthly-sync failure boundary, MOMO main / CD `#910` Drive-auth fail-closed boundary, direct 14:41 wrapper public route smoke all expected 2xx/3xx, current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health are green. 14:41 preflight confirms app / scheduler / Telegram bot healthy, scheduler restart count `0`, token metadata aligned to scheduler UID, latest job `57` completed cleanly, and `DB_DAILY_FRESHNESS 1|2026-06-24`. |
|
||||
| P3 docs / automation contracts | DONE_WITH_WRAPPER_LIVE_READBACK | 100% | Workplan, SOP v1.53, one-page post-start quick check wrapper + fallback runbook, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, 188 DB/Redis exporter restore helper, 188 MinIO/Velero restore helper, 188 nginx-exporter restore helper, 110 Docker disk pressure cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, MOMO Pro false-noise health monitor source-of-truth, docker-health direct Telegram fallback cooldown, Bitan public-content same-fingerprint cooldown, notification-noise readback, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side cold-start v1.42 source absence classifier, live-sync parity gate, MOMO import-boundary production deploy, MOMO Drive-auth fail-closed production deploy, 10:04 scheduler fail-closed live proof, 10:35 route / DB / backup refresh, 11:44 MOMO dedicated preflight blocked readback, 14:16 MOMO dedicated preflight recovery on V10.674 / job 57 / freshness 1, 14:41 wrapper warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=1`, 10:58 user-approved 110 orphan Chrome SIGTERM evidence, MacBook Pro Codex safe artifact sync readback, and 2026-06-25 live refresh with full cold-start GREEN are updated. 2026-06-24 23:15 read-only verify still shows repo cold-start hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`; live 110 script sync of the v1.42 classifier is not claimed until separately approved and recorded. |
|
||||
| P2 service / data truth | GREEN | 100% | Public route/TLS, API/Web route, MOMO health `V10.681`, MOMO main / CD `#904` monthly-sync failure boundary, MOMO main / CD `#910` Drive-auth fail-closed boundary, direct 15:04 wrapper public route smoke all expected 2xx/3xx, current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health are green. 15:04 preflight confirms app / scheduler / Telegram bot healthy, scheduler restart count `0`, token metadata aligned to scheduler UID, latest job `57` completed cleanly, and `DB_DAILY_FRESHNESS 1|2026-06-24`. |
|
||||
| P3 docs / automation contracts | DONE_WITH_WRAPPER_COLD_START_WARN_CLASSIFIER | 100% | Workplan, SOP v1.54, one-page post-start quick check wrapper + fallback runbook, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, 188 DB/Redis exporter restore helper, 188 MinIO/Velero restore helper, 188 nginx-exporter restore helper, 110 Docker disk pressure cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, MOMO Pro false-noise health monitor source-of-truth, docker-health direct Telegram fallback cooldown, Bitan public-content same-fingerprint cooldown, notification-noise readback, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side cold-start v1.42 source absence classifier, live-sync parity gate, MOMO import-boundary production deploy, MOMO Drive-auth fail-closed production deploy, 10:04 scheduler fail-closed live proof, 10:35 route / DB / backup refresh, 11:44 MOMO dedicated preflight blocked readback, 14:16 MOMO dedicated preflight recovery on V10.674 / job 57 / freshness 1, 14:41 wrapper warning split, 15:04 cold-start WARN-only classifier fix, 10:58 user-approved 110 orphan Chrome SIGTERM evidence, MacBook Pro Codex safe artifact sync readback, and 2026-06-25 live refresh with full cold-start GREEN are updated. 2026-06-24 23:15 read-only verify still shows repo cold-start hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`; live 110 script sync of the v1.42 classifier is not claimed until separately approved and recorded. |
|
||||
|
||||
2026-06-25 14:41 supplemental wrapper readback supersedes the 14:16 wrapper wording: direct route smoke is 200 for AWOOOI API / IwoooS / MOMO health / Stock, and cold-start public route/TLS gate is green for all expected 2xx/3xx routes. Repo-side cold-start returns `PASS=89 WARN=0 BLOCKED=0`; `/backup/scripts/backup-status.sh --no-notify --no-refresh` reports 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`; MOMO dedicated preflight returns `PASS=19 WARN=2 BLOCKED=0`; MOMO health is `V10.676`; 110 load is around `4.09 / 4.52 / 4.39`, with active Gitea Actions / build / test load visible, not orphan Chrome. Wrapper result is `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, not `DEGRADED`, because service warnings are `0` and only DR boundary / evidence warnings remain.
|
||||
2026-06-25 15:04 supplemental wrapper readback supersedes the 14:41 wrapper wording: direct route smoke is 200 for AWOOOI API / IwoooS / MOMO health / Stock, and cold-start public route/TLS gate is green for all expected 2xx/3xx routes. Repo-side cold-start returns `PASS=89 WARN=0 BLOCKED=0`; `/backup/scripts/backup-status.sh --no-notify --no-refresh` reports 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`; MOMO dedicated preflight returns `PASS=19 WARN=2 BLOCKED=0`; MOMO health is `V10.681`; 110 load is around `8.99 / 5.76 / 4.89`, with active Gitea Actions / CD `next build` / Playwright e2e smoke visible, not orphan Chrome. Wrapper result is `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, not `DEGRADED`, because service warnings are `0` and only DR boundary / evidence warnings remain.
|
||||
|
||||
Full cold-start service readiness may now be declared GREEN for the latest verified evidence set. As of 2026-06-25 14:41, routes/hosts/K3s/backups/exporters/monitoring surfaces are available, AWOOOI API is healthy, MOMO service health is `V10.676`, and MOMO business data is fresh through `2026-06-24`. The live read-only cold-start scorecard is `PASS=89 WARN=0 BLOCKED=0`, and the post-start wrapper result is `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`. Do not declare DR scorecard complete while credential escrow evidence remains blocked.
|
||||
Full cold-start service readiness may now be declared GREEN for the latest verified evidence set. As of 2026-06-25 15:04, routes/hosts/K3s/backups/exporters/monitoring surfaces are available, AWOOOI API is healthy, MOMO service health is `V10.681`, and MOMO business data is fresh through `2026-06-24`. The live read-only cold-start scorecard is `PASS=89 WARN=0 BLOCKED=0`, and the post-start wrapper result is `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`. Do not declare DR scorecard complete while credential escrow evidence remains blocked.
|
||||
|
||||
2026-06-13 01:26 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing survived the next normal CD deploy: Gitea main `e4a349bc`, ArgoCD revision `e4a349bc`, images from `414413a5`, API/Web split across `mon` / `mon1`, and global `known_hosts` retained 120 / 188 after CD fix `80e6ec1a`. Do not declare DR complete while credential escrow is missing. `km-vectorize` remediation is `90%`: schedule/label fix is live, and the remaining gate is the next official 03:00 CronJob success readback.
|
||||
|
||||
|
||||
@@ -181,17 +181,39 @@ done
|
||||
if [[ "$RUN_COLD_START" -eq 1 ]]; then
|
||||
section "Cold-start scorecard"
|
||||
cold_tmp="$(mktemp -t post-start-cold-start.XXXXXX)"
|
||||
cold_rc=0
|
||||
if bash "$ROOT_DIR/scripts/reboot-recovery/full-stack-cold-start-check.sh" --monitor-read-only --no-color --watch --interval 1 --max-attempts 1 >"$cold_tmp" 2>&1; then
|
||||
ok "cold-start command exited 0"
|
||||
cold_rc=0
|
||||
else
|
||||
blocked "cold-start command returned non-zero"
|
||||
cold_rc=$?
|
||||
fi
|
||||
cat "$cold_tmp"
|
||||
cold_summary="$(grep -E 'PASS=[0-9]+ WARN=[0-9]+ BLOCKED=[0-9]+' "$cold_tmp" | tail -n 1 || true)"
|
||||
if [[ -n "$cold_summary" ]]; then
|
||||
ok "cold-start summary: $cold_summary"
|
||||
cold_warn=0
|
||||
cold_blocked=0
|
||||
if [[ "$cold_summary" =~ WARN=([0-9]+) ]]; then
|
||||
cold_warn="${BASH_REMATCH[1]}"
|
||||
fi
|
||||
if [[ "$cold_summary" =~ BLOCKED=([0-9]+) ]]; then
|
||||
cold_blocked="${BASH_REMATCH[1]}"
|
||||
fi
|
||||
if [[ "$cold_blocked" -gt 0 ]]; then
|
||||
blocked "cold-start has blockers: $cold_summary"
|
||||
elif [[ "$cold_warn" -gt 0 ]]; then
|
||||
service_warn "cold-start is warning-only, not blocked: $cold_summary"
|
||||
elif [[ "$cold_rc" -eq 0 ]]; then
|
||||
ok "cold-start command exited 0"
|
||||
else
|
||||
evidence_warn "cold-start exited $cold_rc but summary has no blockers: $cold_summary"
|
||||
fi
|
||||
else
|
||||
service_warn "cold-start summary not found"
|
||||
if [[ "$cold_rc" -eq 0 ]]; then
|
||||
service_warn "cold-start summary not found"
|
||||
else
|
||||
blocked "cold-start command returned $cold_rc without summary"
|
||||
fi
|
||||
fi
|
||||
rm -f "$cold_tmp"
|
||||
fi
|
||||
|
||||
Reference in New Issue
Block a user