docs(ops): codify full stack cold start recovery

2026-05-06 00:07:57 +08:00
parent 2aa31c205a
commit 0315c2b510
4 changed files with 552 additions and 27 deletions
--- a/docs/LOGBOOK.md
+++ b/docs/LOGBOOK.md
@@ -3269,3 +3269,48 @@ DATABASE_URL=postgresql+asyncpg://u:p@localhost:5432/test REDIS_URL=redis://loca
  apps/api/tests/test_openclaw_alert_cloud_fallback_gate.py -q
 # 15 passed
 ```
+
+---
+
+## 2026-05-06（台北）— 全棧重開機冷啟動 SOP / baseline / watch mode
+
+**觸發**：2026-05-05 晚間 110 / 120 / 121 / 188 異常重開機後，要求把本次恢復順序、服務相依、放行邏輯、最後確認機制完整文件化，並建立下次重開機可快速恢復的標準做法。
+
+### 已完成
+
+| Artifact | Result |
+|----------|--------|
+| `docs/runbooks/FULL-STACK-COLD-START-SOP.md` | 升級為 v1.1，補齊 Golden Startup Order、Mermaid 依賴圖、phase gate 邏輯、script-to-SOP 覆蓋表、next-reboot operator contract |
+| `ops/reboot-recovery/full-stack-cold-start-baseline.yml` | 新增機器可讀 baseline，固定 hosts、roles、啟動順序、endpoint code、schedule freshness、stateful-service 禁區、AI auto-remediation gate |
+| `scripts/reboot-recovery/full-stack-cold-start-check.sh` | 新增 `--watch` / `--interval` / `--max-attempts`，可在重開機後反覆檢查直到 `GREEN` |
+
+### 標準下次重開機放行指令
+
+```bash
+bash scripts/reboot-recovery/full-stack-cold-start-check.sh \
+  --watch \
+  --interval 60 \
+  --max-attempts 30 \
+  --send-alert-test
+```
+
+### 驗證結果
+
+```bash
+bash -n scripts/reboot-recovery/full-stack-cold-start-check.sh
+# OK
+
+ruby -e 'require "yaml"; YAML.load_file("ops/reboot-recovery/full-stack-cold-start-baseline.yml"); puts "YAML OK"'
+# YAML OK
+
+bash scripts/reboot-recovery/full-stack-cold-start-check.sh --watch --interval 1 --max-attempts 1 --send-alert-test
+# PASS=50 WARN=0 BLOCKED=0
+# Result: GREEN. Full stack is ready for controlled runner/CD release.
+```
+
+### 放行原則
+
+- `BLOCKED`：停止釋放後續 phase，先修第一個阻塞 gate。
+- `WARN`：不可釋放 runner/CD/AI full execution，需清掉或明確接受警告。
+- `GREEN`：只代表可進入下一階段；高負載 crawler / Snuba / ClickHouse merge / runner/CD 仍需最後釋放。
+- Stateful DB / ClickHouse / Kafka / Harbor / Sentry 資料層不可由 AI 自動破壞性修復。
--- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md
+++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md
@@ -1,7 +1,7 @@
 # AWOOOI Full-Stack Cold Start SOP

-> Version: v1.0
-> Last updated: 2026-05-05 Asia/Taipei
+> Version: v1.1
+> Last updated: 2026-05-06 Asia/Taipei
 > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.

 ---
@@ -38,6 +38,71 @@ The rule is simple: **recover the dependency chain, not the loudest symptom.**

 Never start runner/CD before 188 PostgreSQL, 110 Harbor, K3s nodes, and AWOOOI API are healthy.

+### 1.1 Dependency Graph
+
+```mermaid
+flowchart TD
+  network["P0 network: LAN, ARP, SSH"] --> data188["188 data: PostgreSQL, Redis, momo DB, SignOz"]
+  network --> obs110["110 registry/observability: Harbor, Gitea, Prometheus, Alertmanager, Sentry"]
+  data188 --> k3s["120/121 K3s: server, agent, VIP, NodePorts"]
+  obs110 --> k3s
+  k3s --> workload["AWOOOI workload: API, Web, K8s Secrets"]
+  workload --> alertchain["Alert chain: Alertmanager webhook, Telegram"]
+  workload --> public["Public routes: awoooi.wooo.work, mo.wooo.work"]
+  public --> schedules["Schedules: cron, CronJobs, backups, exporters"]
+  schedules --> highload["High-load release: crawlers, Snuba, ClickHouse merges, runners/CD"]
+  highload --> ai["AI auto-remediation: limited execution"]
+```
+
+This is also captured in the machine-readable baseline:
+
+```text
+ops/reboot-recovery/full-stack-cold-start-baseline.yml
+```
+
+The YAML baseline is the source of truth for:
+
+- hosts, roles, and SSH users
+- phase ordering
+- service startup dependencies
+- endpoint success codes
+- schedule freshness thresholds
+- stateful-service protection boundaries
+- AI automation release gates
+
+### 1.2 Phase Gate Logic
+
+Each phase has the same decision rule:
+
+| Result | Meaning | Action |
+|--------|---------|--------|
+| `BLOCKED` | A dependency required by later phases is down. | Stop phase release and fix the first blocked gate. |
+| `WARN` | Core dependency passed, but confidence is incomplete. | Continue diagnosis, but do not release runner/CD/AI full execution. |
+| `GREEN` | All checks in scope passed. | Release the next phase only. |
+
+The cold-start flow is intentionally conservative:
+
+```text
+P0 network green
+  -> P0 188 data green
+  -> P0 110 registry/observability green
+  -> P1 K3s green
+  -> P2 workload + alert chain green
+  -> P2 public routes green
+  -> P2 schedules green
+  -> P3 high-load services and runners/CD
+  -> AI auto-remediation limited execution
+```
+
+The final release condition is not "containers are running". It is:
+
+```text
+PASS > 0
+WARN = 0
+BLOCKED = 0
+Result: GREEN
+```
+
 ---

 ## 2. Automation Freeze
@@ -443,26 +508,63 @@ Until then:

 ## 13. One-Command Readiness Script

-Run:
+### 13.1 Single Pass
+
+Run this when you want one read-only snapshot:

 ```bash
 bash scripts/reboot-recovery/full-stack-cold-start-check.sh
 ```

-The script is read-only. It reports gates:
+The script is read-only. It does not restart services, delete data, change memory/CPU limits, or patch Kubernetes. It reports gates:

 - `P0-NETWORK`
 - `P0-188-DATA`
- `P0-110-REGISTRY`
+- `P0-110-REGISTRY-OBSERVABILITY`
 - `P1-K3S`
- `P2-WORKLOAD`
- `P2-ALERTCHAIN`
+- `P2-WORKLOAD-ALERTCHAIN`
 - `P2-PUBLIC-ROUTES`
 - `P2-SCHEDULES`
 - runner guardrail state inside `P0-110-REGISTRY-OBSERVABILITY`

 If it prints `BLOCKED`, fix the first blocked gate before moving forward.

+### 13.2 Professional Watch Mode
+
+Run this after a full reboot when you want the machine to keep checking until the whole stack is ready:
+
+```bash
+bash scripts/reboot-recovery/full-stack-cold-start-check.sh \
+  --watch \
+  --interval 60 \
+  --max-attempts 30 \
+  --send-alert-test
+```
+
+This is the standard next-reboot release command. It checks every 60 seconds for up to 30 attempts and exits only when the stack is `GREEN` or the last attempt remains degraded/blocked.
+
+Use `--send-alert-test` for final release because the alert chain is not proven until the AWOOOI Alertmanager webhook accepts a real POST. Without `--send-alert-test`, the script intentionally leaves a warning so operators do not falsely mark alerting as complete.
+
+### 13.3 Script-To-SOP Coverage Map
+
+| Script gate | SOP coverage | Blocks |
+|-------------|--------------|--------|
+| `P0-NETWORK` | host reachability, ARP, SSH | every later phase |
+| `P0-188-DATA` | PostgreSQL, Redis, momo, SignOz | K3s, AWOOOI API, momo public site |
+| `P0-110-REGISTRY-OBSERVABILITY` | Harbor, Gitea, Prometheus, Alertmanager, Sentry, runner quotas | image pulls, CD, alert rules, runners |
+| `P1-K3S` | 120/121 K3s, VIP, node readiness, pod health | workload and webhook health |
+| `P2-WORKLOAD-ALERTCHAIN` | AWOOOI API/Web, Alertmanager webhook | AI auto-remediation and alert confidence |
+| `P2-PUBLIC-ROUTES` | external AWOOOI and momo URLs | external release |
+| `P2-SCHEDULES` | cron, CronJobs, backups, textfile exporters, DR drill | final done criteria |
+
+### 13.4 Next-Reboot Operator Contract
+
+1. Run the watch command above.
+2. If it stops at `BLOCKED`, repair the first blocked gate and rerun watch mode.
+3. If it stops at `WARN`, do not release runner/CD/AI full execution; clear or explicitly accept each warning.
+4. Release high-load services only after `GREEN` and load/core stays below `1.0` for 15 minutes.
+5. Record the final output summary and any manual repair in `docs/LOGBOOK.md`.
+
 ---

 ## 14. Done Criteria
--- a/ops/reboot-recovery/full-stack-cold-start-baseline.yml
+++ b/ops/reboot-recovery/full-stack-cold-start-baseline.yml
@@ -0,0 +1,303 @@
+# AWOOOI full-stack cold-start dependency baseline.
+# This is the machine-readable companion to docs/runbooks/FULL-STACK-COLD-START-SOP.md.
+#
+# Intent:
+# - document the reboot startup order and service dependency graph
+# - define release gates for operators and AI automation
+# - keep stateful services out of generic auto-restart loops
+
+version: "2026-05-06"
+incident_reference: "2026-05-05 full-stack reboot recovery"
+scope:
+  managed_hosts:
+    "110":
+      address: "192.168.0.110"
+      ssh_user: "wooo"
+      roles:
+        - registry
+        - git
+        - observability
+        - sentry
+        - runners
+    "120":
+      address: "192.168.0.120"
+      ssh_user: "wooo"
+      roles:
+        - k3s_server
+        - keepalived_vip
+        - awoooi_nodeport
+    "121":
+      address: "192.168.0.121"
+      ssh_user: "wooo"
+      roles:
+        - k3s_node
+        - keepalived_peer
+        - dr_drill
+    "188":
+      address: "192.168.0.188"
+      ssh_user: "ollama"
+      roles:
+        - postgres_datastore
+        - redis
+        - momo
+        - signoz
+        - ai_proxy
+  intentionally_skipped:
+    "112":
+      role: "kali"
+      reason: "scanner host is not required for production cold-start release"
+
+global_policy:
+  startup_rule: "Recover the dependency chain before releasing high-load work."
+  runner_cd_rule: "Release runners and CD only after data, registry, K3s, workload, routes, schedules, and alert E2E gates are green."
+  ai_auto_repair_rule: "Observe-only until all green gates pass and host load stays below baseline."
+  destructive_state_rule: "No DROP, data directory deletion, volume recreation, pg_resetwal, fsck, or backup restore without explicit human approval."
+  no_generic_restart_rule: "Never run generic docker restart against all containers during cold start."
+
+phases:
+  - id: "P0-NETWORK"
+    order: 0
+    start_after: []
+    owns:
+      - "LAN reachability"
+      - "SSH reachability"
+      - "ARP evidence"
+    gates:
+      - "ping 192.168.0.110/120/121/188 succeeds"
+      - "TCP 22 open on 192.168.0.110/120/121/188"
+      - "reboot evidence captured before repair"
+    blocks:
+      - "all other phases"
+
+  - id: "P0-188-DATA"
+    order: 1
+    start_after:
+      - "P0-NETWORK"
+    host: "188"
+    service_order:
+      - "containerd"
+      - "docker"
+      - "postgresql@14-main"
+      - "k3s_datastore.kine maintenance"
+      - "redis-server"
+      - "ollama or current AI proxy dependencies"
+      - "nginx"
+      - "Docker networks"
+      - "MinIO / OpenClaw / SignOz"
+      - "momo / litellm / batch services"
+    gates:
+      - "PostgreSQL port 5432 open"
+      - "pg_isready reports accepting connections"
+      - "Redis replies PONG"
+      - "momo health endpoint returns 200"
+      - "SignOz HTTP route is reachable"
+    blocks:
+      - "120/121 K3s"
+      - "AWOOOI API database access"
+      - "Alertmanager webhook"
+      - "momo public site"
+
+  - id: "P0-110-REGISTRY-OBSERVABILITY"
+    order: 2
+    start_after:
+      - "P0-NETWORK"
+      - "P0-188-DATA"
+    host: "110"
+    service_order:
+      - "docker"
+      - "orphan Exited(128/137) cleanup if needed"
+      - "Harbor log"
+      - "Harbor registry stack"
+      - "Gitea"
+      - "Prometheus / Alertmanager / Grafana / exporters"
+      - "Langfuse"
+      - "SignOz or local observability companions"
+      - "Sentry DB layer"
+      - "Sentry web / worker / consumer layer"
+      - "Gitea host runner and actions runners"
+    gates:
+      - "Harbor /v2/ returns 200 or 401"
+      - "Gitea returns 200 or 302"
+      - "Prometheus /-/ready returns 200"
+      - "Alertmanager /-/healthy returns 200"
+      - "Sentry HTTP returns 200, 302, or 400"
+      - "runner CPUQuota=200%, MemoryMax=2G, WatchdogUSec=0"
+    blocks:
+      - "K3s image pulls"
+      - "runtime CD"
+      - "alert rules deploy"
+      - "code-review runners"
+
+  - id: "P1-K3S"
+    order: 3
+    start_after:
+      - "P0-188-DATA"
+      - "P0-110-REGISTRY-OBSERVABILITY"
+    hosts:
+      - "120"
+      - "121"
+    service_order:
+      - "120 k3s.service"
+      - "121 k3s-agent.service or live role"
+      - "CNI / kube-proxy"
+      - "nodes Ready"
+      - "core pods"
+      - "awoooi-prod pods"
+      - "keepalived VIP 192.168.0.125"
+      - "NodePorts 32334 and 32335"
+    gates:
+      - "120 can reach 188:5432"
+      - "K3s nodes show Ready"
+      - "VIP 192.168.0.125 is present"
+      - "awoooi-prod pods are Running or Completed"
+    blocks:
+      - "AWOOOI workload health"
+      - "public AWOOOI route"
+      - "Alertmanager webhook"
+
+  - id: "P2-WORKLOAD-ALERTCHAIN"
+    order: 4
+    start_after:
+      - "P1-K3S"
+    owners:
+      - "AWOOOI API"
+      - "AWOOOI Web"
+      - "Alertmanager webhook"
+      - "Telegram delivery"
+    gates:
+      - "http://192.168.0.125:32334/api/v1/health returns 2xx/3xx"
+      - "http://192.168.0.125:32335/ returns 2xx/3xx"
+      - "Alertmanager webhook POST returns 2xx"
+      - "K8s Telegram secrets are present and non-placeholder"
+    blocks:
+      - "AI auto-remediation"
+      - "full alert confidence"
+
+  - id: "P2-PUBLIC-ROUTES"
+    order: 5
+    start_after:
+      - "P2-WORKLOAD-ALERTCHAIN"
+    gates:
+      - "https://awoooi.wooo.work/api/v1/health returns 2xx/3xx"
+      - "https://awoooi.wooo.work/ returns 2xx/3xx"
+      - "https://mo.wooo.work/ returns 2xx/3xx"
+      - "https://mo.wooo.work/health returns 2xx/3xx"
+    blocks:
+      - "external release complete"
+
+  - id: "P2-SCHEDULES"
+    order: 6
+    start_after:
+      - "P2-PUBLIC-ROUTES"
+    gates:
+      - "110/120/121/188 cron services active"
+      - "188 backup-from-110 success age below 25h"
+      - "188 docker restart/stats textfiles fresh"
+      - "110 docker/systemd textfiles fresh"
+      - "120 awoooi-prod CronJobs present and unsuspended"
+      - "120 awoooi-prod has no failed Jobs"
+      - "121 DR drill cron present"
+    blocks:
+      - "done criteria"
+      - "AI auto-remediation release"
+
+  - id: "P3-HIGH-LOAD-RELEASE"
+    order: 7
+    start_after:
+      - "P2-SCHEDULES"
+    release_last:
+      - "momo-scheduler / Chrome crawlers"
+      - "Sentry Snuba consumers"
+      - "SignOz ClickHouse merge-heavy work"
+      - "Gitea actions runners"
+      - "runtime CD jobs"
+    gates:
+      - "all prior gates green"
+      - "host load per CPU below 1.0 for 15 minutes before releasing batch/runner work"
+      - "ClickHouse/Kafka/Snuba backlog decreasing for two consecutive checks if backlog exists"
+
+baselines:
+  endpoints:
+    awoooi_vip_api_health: "http://192.168.0.125:32334/api/v1/health"
+    awoooi_vip_web: "http://192.168.0.125:32335/"
+    awoooi_public_api_health: "https://awoooi.wooo.work/api/v1/health"
+    awoooi_public_web: "https://awoooi.wooo.work/"
+    momo_public_web: "https://mo.wooo.work/"
+    momo_public_health: "https://mo.wooo.work/health"
+    harbor_registry: "http://127.0.0.1:5000/v2/"
+    gitea: "http://127.0.0.1:3001/"
+    prometheus_ready: "http://127.0.0.1:9090/-/ready"
+    alertmanager_healthy: "http://127.0.0.1:9093/-/healthy"
+    sentry: "http://127.0.0.1:9000/"
+  expected_codes:
+    harbor_registry:
+      - 200
+      - 401
+    gitea:
+      - 200
+      - 302
+    prometheus_ready:
+      - 200
+    alertmanager_healthy:
+      - 200
+    sentry:
+      - 200
+      - 302
+      - 400
+    workload_and_public:
+      - "2xx"
+      - "3xx"
+  runner_guardrails:
+    CPUQuotaPerSecUSec: "2s"
+    MemoryMax: "2147483648"
+    WatchdogUSec: "0"
+  freshness_seconds:
+    docker_textfiles: 300
+    systemd_textfiles: 300
+    backup_success: 90000
+
+stateful_services:
+  hard_block_auto_repair:
+    - "188 PostgreSQL data directory"
+    - "188 k3s_datastore"
+    - "188 momo database"
+    - "110 Harbor DB"
+    - "110 Sentry DB"
+    - "Sentry ClickHouse data"
+    - "SignOz ClickHouse data"
+    - "Kafka topic/log directories"
+  human_in_loop_required:
+    - "pg_resetwal"
+    - "ClickHouse clean-clone recovery"
+    - "Kafka checkpoint file quarantine"
+    - "backup restore"
+    - "filesystem repair"
+
+ai_automation_gate:
+  observe_only_until:
+    - "P0-NETWORK green"
+    - "P0-188-DATA green"
+    - "P0-110-REGISTRY-OBSERVABILITY green"
+    - "P1-K3S green"
+    - "P2-WORKLOAD-ALERTCHAIN green"
+    - "P2-PUBLIC-ROUTES green"
+    - "P2-SCHEDULES green"
+    - "no active restart storm"
+    - "host load per CPU below 1.0 for 15 minutes"
+  allowed_before_green:
+    - "diagnose"
+    - "collect evidence"
+    - "notify"
+  blocked_before_green:
+    - "stateful restart"
+    - "destructive repair"
+    - "runner/CD release"
+    - "generic container restart"
+
+final_confirmation:
+  command: "bash scripts/reboot-recovery/full-stack-cold-start-check.sh --watch --interval 60 --max-attempts 30 --send-alert-test"
+  green_result:
+    PASS: "greater than 0"
+    WARN: 0
+    BLOCKED: 0
+    summary: "Result: GREEN"
--- a/scripts/reboot-recovery/full-stack-cold-start-check.sh
+++ b/scripts/reboot-recovery/full-stack-cold-start-check.sh
@@ -6,26 +6,61 @@ set -uo pipefail

 SSH_OPTS=(-o BatchMode=yes -o ConnectTimeout=6)
 SEND_ALERT_TEST=0
+WATCH_MODE=0
+WATCH_INTERVAL=60
+WATCH_MAX_ATTEMPTS=30

-for arg in "$@"; do
-  case "$arg" in
+usage() {
+  cat <<'USAGE'
+Usage: bash scripts/reboot-recovery/full-stack-cold-start-check.sh [options]
+
+Options:
+  --send-alert-test       POST one Alertmanager webhook test after AWOOOI API is ready.
+  --watch                 Repeat checks until all gates are GREEN or max attempts is reached.
+  --interval SECONDS      Retry interval for --watch. Default: 60.
+  --max-attempts COUNT    Max attempts for --watch. Default: 30. Use 0 for unlimited.
+  -h, --help              Show this help.
+
+Default mode is read-only and does not POST an Alertmanager test event.
+Use --send-alert-test for the final release gate after AWOOOI API is expected to be ready.
+USAGE
+}
+
+while [ "$#" -gt 0 ]; do
+  case "$1" in
    --send-alert-test)
      SEND_ALERT_TEST=1
      ;;
+    --watch)
+      WATCH_MODE=1
+      ;;
+    --interval)
+      shift
+      if ! [[ "${1:-}" =~ ^[0-9]+$ ]] || [ "${1:-0}" -lt 1 ]; then
+        echo "--interval requires a positive integer number of seconds" >&2
+        exit 64
+      fi
+      WATCH_INTERVAL="$1"
+      ;;
+    --max-attempts)
+      shift
+      if ! [[ "${1:-}" =~ ^[0-9]+$ ]]; then
+        echo "--max-attempts requires a non-negative integer" >&2
+        exit 64
+      fi
+      WATCH_MAX_ATTEMPTS="$1"
+      ;;
    -h|--help)
-      cat <<'USAGE'
-Usage: bash scripts/reboot-recovery/full-stack-cold-start-check.sh [--send-alert-test]
-
-Default mode is read-only and does not POST an Alertmanager test event.
-Use --send-alert-test only after AWOOOI API is expected to be ready.
-USAGE
+      usage
      exit 0
      ;;
    *)
-      echo "Unknown argument: $arg" >&2
+      echo "Unknown argument: $1" >&2
+      usage >&2
      exit 64
      ;;
  esac
+  shift
 done

 RED=$'\033[0;31m'
@@ -38,6 +73,12 @@ PASS=0
 WARN=0
 FAIL=0

+reset_counters() {
+  PASS=0
+  WARN=0
+  FAIL=0
+}
+
 log_section() {
  printf "\n%s=== %s ===%s\n" "$BLUE" "$1" "$NC"
 }
@@ -104,6 +145,7 @@ print_header() {
  echo "AWOOOI full-stack cold-start check"
  date '+%Y-%m-%d %H:%M:%S %Z'
  echo "Scope: 110 / 120 / 121 / 188. 112 Kali is intentionally skipped."
+  echo "Baseline: ops/reboot-recovery/full-stack-cold-start-baseline.yml"
 }

 check_network() {
@@ -385,21 +427,54 @@ summary() {
  echo "PASS=$PASS WARN=$WARN BLOCKED=$FAIL"
  if [ "$FAIL" -gt 0 ]; then
    echo "Result: BLOCKED. Fix the first blocked gate before releasing runner/CD/AI auto-remediation."
-    exit 2
+    return 2
  fi
  if [ "$WARN" -gt 0 ]; then
    echo "Result: DEGRADED. Core gates passed but warnings remain."
-    exit 1
+    return 1
  fi
  echo "Result: GREEN. Full stack is ready for controlled runner/CD release."
+  return 0
 }

-print_header
-check_network
-check_188
-check_110
-check_k3s
-check_workload_and_alertchain
-check_public_routes
-check_schedules
-summary
+run_once() {
+  reset_counters
+  print_header
+  check_network
+  check_188
+  check_110
+  check_k3s
+  check_workload_and_alertchain
+  check_public_routes
+  check_schedules
+  summary
+}
+
+if [ "$WATCH_MODE" -eq 1 ]; then
+  attempt=1
+  while :; do
+    if [ "$WATCH_MAX_ATTEMPTS" -eq 0 ]; then
+      printf "\nWatch attempt %s/unlimited\n" "$attempt"
+    else
+      printf "\nWatch attempt %s/%s\n" "$attempt" "$WATCH_MAX_ATTEMPTS"
+    fi
+
+    run_once
+    rc=$?
+    if [ "$rc" -eq 0 ]; then
+      exit 0
+    fi
+
+    if [ "$WATCH_MAX_ATTEMPTS" -ne 0 ] && [ "$attempt" -ge "$WATCH_MAX_ATTEMPTS" ]; then
+      echo "Watch stopped before GREEN. Last result code: $rc"
+      exit "$rc"
+    fi
+
+    echo "Waiting ${WATCH_INTERVAL}s before the next cold-start gate check..."
+    sleep "$WATCH_INTERVAL"
+    attempt=$((attempt + 1))
+  done
+fi
+
+run_once
+exit $?