fix(ops): harden reboot recovery and backup alerts

2026-05-29 12:38:58 +08:00
parent 70637ec871
commit ae7b39d96a
14 changed files with 2354 additions and 672 deletions
--- a/docs/LOGBOOK.md
+++ b/docs/LOGBOOK.md
@@ -22528,3 +22528,29 @@ production browser smoke:
 - 24h 完整自動修復 production claim：0%；目前仍不能宣稱真正 AI 自動修復閉環已達成。
 - 完整 AI 自動化管理產品化：約 99.3%，但「真正全自動 repair / approval / learning / KM writeback 閉環」
  仍需以 24h production evidence 補齊。
+
+## 2026-05-29 | 重開機恢復續修：aiops 入口、備份告警與 Ansible baseline 收斂
+
+**背景**：統帥要求確認所有主機重啟後，服務、網站、工具、資料庫、排程與備份都能快速恢復，且不能只停在人工熱修。前一輪已修正 AWOOOI/Flywheel stale incident 與成功率規則；本輪接著處理 cold-start gate 仍未綠燈的項目。
+
+**現場修復**：
+- 188 public gateway 的 `aiops.wooo.work` 原本仍反代到失聯的 `192.168.0.120:31234/31235`，導致 public route 502；已改為正式 VIP `192.168.0.125:32334/32335`，`/` 回 307 到 `/zh-TW`，`/api/v1/health` 回 `healthy`。
+- 188 `/etc/nginx/sites-enabled/` 中有舊備份檔仍被 Nginx include，造成新 vhost 被 `conflicting server name ... ignored`；已移到 `/etc/nginx/sites-disabled-codex/`，保留備份但不再載入。
+- 110 `fwupd.service` / `fwupd-refresh.service` 是 stale failed state；已 `reset-failed`，`systemctl --failed` 回 0。
+- Prometheus live `alerts.yml` 與 `alerts-unified.canonical.yml` 被縮水成舊版，缺完整備份、異地同步、credential escrow、cold-start scorecard 規則；已重新同步 repo 的 `ops/monitoring/alerts-unified.yml` 到兩個 live 檔並 reload Prometheus。
+- `prometheus-rule-drift-guard` 已確認 `missing_required_count=0`、`current_matches_canonical=1`，之後不會每 5 分鐘把完整備份規則拉回舊版。
+- Ansible `infra/ansible/roles/nginx/templates/188-all-sites.conf.j2` 已同步 188 live public gateway baseline，避免下一次跑 `nginx-sync.yml` 又把 aiops 指回單一 120 節點。
+
+**驗證**：
+- `https://aiops.wooo.work/` public route 與 TLS 已回 200/307 成功範圍；`https://aiops.wooo.work/api/v1/health` 回 `healthy prod`。
+- `bash /home/wooo/scripts/full-stack-cold-start-check.sh --monitor-read-only --no-color --watch --interval 1 --max-attempts 1`：public routes 全部通過，110 failed units = 0，momo scheduler 以 container health + 2h 內 task activity 判定正常，momo 當月 `daily_sales_snapshot`/`realtime_sales_monthly` 一致，結果為 `PASS=72 WARN=2 BLOCKED=3`。
+- `BLOCKED=3` 全部仍指向 120：`ping 192.168.0.120`、`ssh 192.168.0.120:22`、`ssh 120 k3s read-only check`。
+- Google Drive/rclone daily full sync 仍正常：`rclone-last-success` 與 `rclone-full-verify-last-success` 都是 2026-05-29，full repos 覆蓋 `awoooi configs gitea harbor momo langfuse monitoring signoz open-webui clawbot sentry ai-artifacts public-routes`。
+- 完整備份告警規則已載入：`BackupAggregateRunFailed`、`BackupConfigCapturePartial`、`BackupOffsiteCopyStale`、`BackupCredentialEscrowEvidenceMissing`、`awoooi_recovery_core_ready`、`ColdStartRecoveryBlocked` 全部存在；Prometheus rule count = 142。
+- 因 120 失聯，`BackupConfigCapturePartial{target="120-k3s-host-configs"}` 與 `BackupAggregateRunFailed` 會進入 pending/firing，這是正確訊號，不應消音。
+- `mo.wooo.work` 資料修復：momo 自動匯入 2026-05-29 11:55 已把 2026-05-01~2026-05-28 的 17,353 筆寫入 `daily_sales_snapshot`，但同步 `realtime_sales_monthly` 時 PostgreSQL index 內部錯誤 `posting list tuple ... cannot be split`，導致 5 月分析表為 0。已在 188 `momo-db` 執行 `REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly`，再以同日期範圍從 `daily_sales_snapshot` idempotent 補同步；驗證 `daily_sales_snapshot=17,353`、`realtime_sales_monthly=17,353`、`realtime_sales_monthly` 總筆數 `774,111`，日期最大值到 `2026-05-28`，並清除 momo 應用 cache。
+
+**不可宣稱完成**：
+- 120 仍不可達，K3s node `mon` 是 `NotReady,SchedulingDisabled`；`mon1` 可承載 AWOOI workloads，但 full cold-start done criteria 尚未達成。
+- 110 backup aggregate `failed_count=1` 是 120 config capture 無法完成；必須 120 回來後重跑 `/backup/scripts/backup-configs.sh` 或 `/backup/scripts/backup-all.sh`，再補跑 Google Drive/rclone full sync。
+- `SLO_KMGrowthRate_Low` 仍為 warning（24h KM 約 19/20），不是網站 outage，但需後續追 KM 產出。
--- a/docs/runbooks/BACKUP-STATUS.md
+++ b/docs/runbooks/BACKUP-STATUS.md
@@ -60,7 +60,7 @@ notify_clawbot "failed" "backup-test" "測試告警" 0
 ```
 0 2       * * *   backup-all.sh              ← 9 個服務完整備份
 0 8,14,20 * * *   backup-awoooi-frequent.sh  ← AWOOOI 高頻（每 6 小時）
-0 6       * * *   backup-status.sh           ← 備份狀態報告
+5 6       * * *   backup-status.sh           ← 備份狀態報告（每日一次，避免 Telegram 心跳噪音）
 ```

 ---
--- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md
+++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md
@@ -590,6 +590,84 @@ Prometheus rules in `ops/monitoring/alerts-unified.yml` alert when the monitor i
 4. Release high-load services only after `GREEN` and load/core stays below `1.0` for 15 minutes.
 5. Record the final output summary and any manual repair in `docs/LOGBOOK.md`.

+### 13.6 2026-05-29 補充：188 Public Gateway 與備份告警
+
+`aiops.wooo.work` 的 188 public gateway 不可再指向單一 `192.168.0.120:31234/31235`。120 失聯時這會讓 public route 直接 502。正式 baseline 必須走 K3s VIP：
+
+```nginx
+location /api/ {
+    proxy_pass http://192.168.0.125:32334/api/;
+}
+
+location /api/v1/ws {
+    proxy_pass http://192.168.0.125:32334/api/v1/ws;
+}
+
+location / {
+    proxy_pass http://192.168.0.125:32335;
+}
+```
+
+變更來源必須是 `infra/ansible/roles/nginx/templates/188-all-sites.conf.j2`，再用 `infra/ansible/playbooks/nginx-sync.yml` 收斂；禁止只改 188 live 檔而不回寫 Ansible baseline。
+
+備份告警有兩層，缺一不可：
+
+- `ops/monitoring/alerts-unified.yml` 是 repo canonical。
+- 110 live `/home/wooo/monitoring/alerts.yml` 與 `/home/wooo/monitoring/alerts-unified.canonical.yml` 必須一致，否則 `prometheus-rule-drift-guard` 可能把規則拉回舊版。
+
+重啟後必查：
+
+```bash
+curl -s http://127.0.0.1:9090/api/v1/rules \
+  | python3 -c 'import json,sys; d=json.load(sys.stdin); names=[r.get("name") for g in d["data"]["groups"] for r in g["rules"]]; print([n for n in ["BackupAggregateRunFailed","BackupConfigCapturePartial","BackupOffsiteCopyStale","BackupCredentialEscrowEvidenceMissing","ColdStartRecoveryBlocked"] if n not in names])'
+
+cat /home/wooo/node_exporter_textfiles/prometheus_rule_drift_guard.prom
+```
+
+若 120 尚未恢復，`BackupConfigCapturePartial{target="120-k3s-host-configs"}` 與 cold-start blocked 是正確訊號，不可消音。120 恢復後再重跑：
+
+```bash
+/backup/scripts/backup-configs.sh
+/backup/scripts/backup-all.sh
+/backup/scripts/sync-offsite-backups.sh --mode sync
+/backup/scripts/verify-offsite-full-sync.sh --write-textfile --no-color
+```
+
+### 13.7 2026-05-29 補充：momo PostgreSQL Index 與資料同步
+
+`mo.wooo.work` 不能只看 `/health` 或首頁 200。重啟或 fsck 後，PostgreSQL index 可能讓匯入流程表面完成，但 `daily_sales_snapshot` 未同步到 `realtime_sales_monthly`。本次症狀：
+
+- `daily_sales_snapshot` 已有 2026-05-01 到 2026-05-28 的 17,353 筆。
+- `realtime_sales_monthly` 同日期範圍為 0 筆。
+- momo-scheduler log 出現 PostgreSQL 內部錯誤 `posting list tuple ... cannot be split`。
+
+標準處理順序：
+
+```bash
+# 188 / momo-db，只重建索引，不刪資料
+docker exec -i momo-db bash -lc 'psql -U "$POSTGRES_USER" -d "$POSTGRES_DB" -v ON_ERROR_STOP=1' <<'SQL'
+REINDEX TABLE CONCURRENTLY public.realtime_sales_monthly;
+SQL
+```
+
+重建索引後，才可針對缺漏日期做 idempotent 補同步。正式作法必須先確認 `realtime_sales_monthly` 該日期範圍筆數，若非 0，需先保存查詢結果並確認是否重跑同範圍同步；不可整表 truncate、不可整庫 restore。補同步後至少驗證：
+
+```sql
+SELECT count(*), min(snapshot_date::date), max(snapshot_date::date)
+FROM daily_sales_snapshot
+WHERE snapshot_date::date BETWEEN DATE '2026-05-01' AND DATE '2026-05-28';
+
+SELECT count(*), min("日期"::date), max("日期"::date)
+FROM realtime_sales_monthly
+WHERE "日期"::date BETWEEN DATE '2026-05-01' AND DATE '2026-05-28';
+```
+
+兩張表同日期範圍筆數與日期上下界必須一致。完成後清除 momo 應用 cache：
+
+```bash
+docker exec momo-pro-system python -c 'from services.cache_service import clear_all_cache; clear_all_cache(); print("cache_cleared")'
+```
+
 ---

 ## 14. Done Criteria
@@ -604,6 +682,7 @@ All must be true:
 - AWOOOI API and Web reachable through NodePort/VIP.
 - Alertmanager E2E webhook succeeds.
 - cron/CronJob schedules are active, unsuspended, and verified.
+- momo `daily_sales_snapshot` 與 `realtime_sales_monthly` 在最新匯入日期範圍內筆數一致。
 - Sentry and SignOz are either healthy or explicitly in controlled backlog recovery.
 - High-load batch services are capped or delayed.
 - Runners are guarded and released last.
--- a/infra/ansible/roles/nginx/templates/188-all-sites.conf.j2
+++ b/infra/ansible/roles/nginx/templates/188-all-sites.conf.j2
@@ -1,145 +1,268 @@
 # 188-all-sites.conf.j2
-# AWOOOI Nginx 全站設定 — 由 Ansible nginx-sync.yml playbook 管理
-# 禁止直接手改此檔案 → 請修改 roles/nginx/templates/188-all-sites.conf.j2
-# 部署指令: ansible-playbook -i inventory/hosts.yml playbooks/nginx-sync.yml --tags 188
-# 最後同步: {{ ansible_date_time.iso8601 }}
-
-# ============================================================
-# OpenClaw (port 8088)
-# ============================================================
+# AWOOOI 188 public gateway baseline managed by infra/ansible/playbooks/nginx-sync.yml.
+# 2026-05-29 Codex: synced from live 188 after reboot recovery; aiops.wooo.work
+# must use the K3s VIP 192.168.0.125:32334/32335 instead of a single 120 node.
+#
+# =============================================================================
+# AIOPS - aiops.wooo.work
+# =============================================================================
 server {
    listen 80;
-    server_name openclaw.awoooi.com;
+    server_name aiops.wooo.work;
+    return 301 https://$server_name$request_uri;
+}

-    location / {
-        proxy_pass http://127.0.0.1:8088;
+server {
+    listen 443 ssl http2;
+    server_name aiops.wooo.work;
+
+    ssl_certificate /etc/letsencrypt/live/aiops.wooo.work/fullchain.pem;
+    ssl_certificate_key /etc/letsencrypt/live/aiops.wooo.work/privkey.pem;
+
+    # API
+    location /api/ {
+        proxy_pass http://192.168.0.125:32334/api/;
+        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+        proxy_set_header X-Forwarded-Proto $scheme;
+    }
+
+    # WebSocket
+    location /api/v1/ws {
+        proxy_pass http://192.168.0.125:32334/api/v1/ws;
+        proxy_http_version 1.1;
+        proxy_set_header Upgrade $http_upgrade;
+        proxy_set_header Connection "upgrade";
+        proxy_set_header Host $host;
+    }
+
+    # Frontend
+    location / {
+        proxy_pass http://192.168.0.125:32335;
+        proxy_http_version 1.1;
+        proxy_set_header Host $host;
+        proxy_set_header X-Real-IP $remote_addr;
+        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+        proxy_set_header X-Forwarded-Proto $scheme;
+    }
+}
+
+# =============================================================================
+# GitLab - gitlab.wooo.work (代理到 110)
+# =============================================================================
+server {
+    listen 80;
+    server_name gitlab.wooo.work;
+    return 301 https://$server_name$request_uri;
+}
+
+server {
+    listen 443 ssl http2;
+    server_name gitlab.wooo.work;
+
+    ssl_certificate /etc/letsencrypt/live/gitlab.wooo.work/fullchain.pem;
+    ssl_certificate_key /etc/letsencrypt/live/gitlab.wooo.work/privkey.pem;
+
+    client_max_body_size 500m;
+
+    location / {
+        proxy_pass http://192.168.0.110:8929;
+        proxy_http_version 1.1;
+        proxy_set_header Host $host;
+        proxy_set_header X-Real-IP $remote_addr;
+        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_read_timeout 300s;
+        proxy_connect_timeout 300s;
    }
 }

-# ============================================================
-# tsenyang (port 3000)
-# ============================================================
+# =============================================================================
+# SigNoz - signoz.wooo.work
+# =============================================================================
 server {
    listen 80;
-    server_name tsenyang.awoooi.com;
-
-    location / {
-        proxy_pass http://127.0.0.1:3000;
-        proxy_set_header Host $host;
-        proxy_set_header X-Real-IP $remote_addr;
-        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
-    }
-}
-
-# ============================================================
-# momo (port 5003)
-# ============================================================
-server {
-    listen 80;
-    server_name momo.awoooi.com;
-
-    location / {
-        proxy_pass http://127.0.0.1:5003;
-        proxy_set_header Host $host;
-        proxy_set_header X-Real-IP $remote_addr;
-        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
-    }
-}
-
-# ============================================================
-# SignOz (port 3301)
-# ============================================================
-server {
-    listen 80;
-    server_name signoz.awoooi.internal;
+    server_name signoz.wooo.work;

    location / {
        proxy_pass http://127.0.0.1:3301;
+        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
+    }
+}
+
+# =============================================================================
+# Tsenyang - www.tsenyang.com (待遷移，暫時代理到 110)
+# =============================================================================
+server {
+    listen 80;
+    server_name www.tsenyang.com tsenyang.com;
+    return 301 https://$server_name$request_uri;
+}
+
+server {
+    listen 443 ssl http2;
+    server_name www.tsenyang.com tsenyang.com;
+
+    ssl_certificate /etc/letsencrypt/live/www.tsenyang.com/fullchain.pem;
+    ssl_certificate_key /etc/letsencrypt/live/www.tsenyang.com/privkey.pem;
+
+    location / {
+        proxy_pass http://127.0.0.1:3000;
+        proxy_http_version 1.1;
+        proxy_set_header Host $host;
+        proxy_set_header X-Real-IP $remote_addr;
+    }
+}
+
+# =============================================================================
+# Stock Platform - stock.wooo.work
+# =============================================================================
+server {
+    listen 80;
+    server_name stock.wooo.work;
+
+    location /.well-known/acme-challenge/ {
+        root /var/www/html;
+    }
+
+    location / {
+        return 301 https://$server_name$request_uri;
+    }
+}
+
+server {
+    listen 443 ssl http2;
+    server_name stock.wooo.work;
+
+    ssl_certificate /etc/letsencrypt/live/stock.wooo.work/fullchain.pem;
+    ssl_certificate_key /etc/letsencrypt/live/stock.wooo.work/privkey.pem;
+
+    # 後台直接接收，不經由網站主站 Basic Auth
+    location = /admin {
+        return 301 /admin/;
+    }
+
+    location /admin/ {
+        auth_basic off;
+        proxy_pass http://192.168.0.110:31235;
+        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
-    }
-}
-
-# ============================================================
-# MinIO (port 9000 API / 9001 Console)
-# ============================================================
-server {
-    listen 80;
-    server_name minio.awoooi.internal;
-
-    location / {
-        proxy_pass http://127.0.0.1:9001;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
-        client_max_body_size 500m;
+        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+        proxy_set_header X-Forwarded-Proto $scheme;
+        proxy_buffering off;
+    }
+
+    # 前台主站
+    location / {
+        proxy_pass http://192.168.0.110:31235;
+        proxy_http_version 1.1;
+        proxy_set_header Upgrade $http_upgrade;
+        proxy_set_header Connection "upgrade";
+        proxy_set_header Host $host;
+        proxy_set_header X-Real-IP $remote_addr;
+        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+        proxy_set_header X-Forwarded-Proto $scheme;
    }
 }

-# ============================================================
-# LiteLLM (port 4000)
-# ============================================================
+# =============================================================================
+# MOMO PRO - mo.wooo.work (待部署)
+# =============================================================================
 server {
    listen 80;
-    server_name litellm.awoooi.internal;
+    server_name mo.wooo.work;
+    return 301 https://$server_name$request_uri;
+}
+
+server {
+    listen 443 ssl http2;
+    server_name mo.wooo.work;
+
+    ssl_certificate /etc/letsencrypt/live/mo.wooo.work/fullchain.pem;
+    ssl_certificate_key /etc/letsencrypt/live/mo.wooo.work/privkey.pem;

    location / {
-        proxy_pass http://127.0.0.1:4000;
+        proxy_pass http://127.0.0.1:5003;
+        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_read_timeout 300s;
    }
 }

-# ============================================================
-# n8n (port 5678)
-# ============================================================
+# =============================================================================
+# Bitan 藥局 - bitan.wooo.work (待部署)
+# =============================================================================
 server {
    listen 80;
-    server_name n8n.awoooi.internal;
+    server_name bitan.wooo.work;
+    return 301 https://$server_name$request_uri;
+}
+
+server {
+    listen 443 ssl http2;
+    server_name bitan.wooo.work;
+
+    ssl_certificate /etc/letsencrypt/live/bitan.wooo.work/fullchain.pem;
+    ssl_certificate_key /etc/letsencrypt/live/bitan.wooo.work/privkey.pem;
+
+    client_max_body_size 25m;

    location / {
-        proxy_pass http://127.0.0.1:5678;
-        proxy_set_header Host $host;
-        proxy_set_header X-Real-IP $remote_addr;
+        proxy_pass http://192.168.0.110:3003;
+        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
+        proxy_set_header Host $host;
+        proxy_set_header X-Real-IP $remote_addr;
    }
 }

-# ============================================================
-# Open WebUI (port 3010)
-# ============================================================
+# =============================================================================
+# VTuber - vtuber.wooo.work
+# =============================================================================
 server {
-    listen 80;
-    server_name open-webui.awoooi.internal;
+    server_name vtuber.wooo.work;
+
+    location /.well-known/acme-challenge/ {
+        root /var/www/html;
+    }

    location / {
-        proxy_pass http://127.0.0.1:3010;
-        proxy_set_header Host $host;
-        proxy_set_header X-Real-IP $remote_addr;
+        proxy_pass https://192.168.0.110;
+        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
-        proxy_read_timeout 300s;
-    }
-}
-
-# ============================================================
-# Docker Registry (port 5001)
-# ============================================================
-server {
-    listen 80;
-    server_name registry.awoooi.internal;
-
-    location / {
-        proxy_pass http://127.0.0.1:5001;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
-        client_max_body_size 2g;
+        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+        proxy_set_header X-Forwarded-Proto $scheme;
    }
+
+    listen 443 ssl; # managed by Certbot
+    ssl_certificate /etc/letsencrypt/live/vtuber.wooo.work/fullchain.pem; # managed by Certbot
+    ssl_certificate_key /etc/letsencrypt/live/vtuber.wooo.work/privkey.pem; # managed by Certbot
+    include /etc/letsencrypt/options-ssl-nginx.conf; # managed by Certbot
+    ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem; # managed by Certbot
+
+}
+
+server {
+    if ($host = vtuber.wooo.work) {
+        return 301 https://$host$request_uri;
+    } # managed by Certbot
+
+
+    listen 80;
+    server_name vtuber.wooo.work;
+    return 404; # managed by Certbot
+
+
 }
--- a/k8s/monitoring/prometheus.yml
+++ b/k8s/monitoring/prometheus.yml
@@ -57,8 +57,8 @@ scrape_configs:
          - https://mo.wooo.work
          - http://192.168.0.188:4000/health/liveliness
          - http://192.168.0.110:3001
-          - http://192.168.0.120:31234
-          - http://192.168.0.120:31235
+          - http://192.168.0.125:32334/api/v1/health
+          - http://192.168.0.125:32335
          - https://www.tsenyang.com
          - http://stock.wooo.work
          - https://bitan.wooo.work
@@ -93,8 +93,8 @@ scrape_configs:
          - 192.168.0.188:6380
          - 192.168.0.188:8089
          # K3s Worker
-          - 192.168.0.120:31234
-          - 192.168.0.120:31235
+          - 192.168.0.125:32334
+          - 192.168.0.125:32335
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
--- a/ops/monitoring/alerts-unified.yml
+++ b/ops/monitoring/alerts-unified.yml
--- a/ops/reboot-recovery/full-stack-backup-baseline.yml
+++ b/ops/reboot-recovery/full-stack-backup-baseline.yml
@@ -0,0 +1,306 @@
+version: 2026-05-19.v7
+scope: "110/120/121/188 全服務、資料、設定與還原驗證備份基準"
+
+principles:
+  - "資料備份與設定備份分層：DB/PV/物件資料負責資料，configs 負責可啟動狀態。"
+  - "Secrets、TLS private keys、SSH host keys 可進加密 restic/Velero 備份，但不得印到 log、repo、Telegram。"
+  - "備份系統本身也要備份：restic repository health、password/key escrow、offsite copy、restore drill evidence 缺一不可。"
+  - "每個備份都必須有三個證據：排程存在、最近成功時間、還原或 dry-run 驗證。"
+  - "AI 自動修復在備份/還原領域預設 observe-only；禁止未經新成功備份證據與 baseline gate 的刪除、DROP DB、覆蓋 production namespace。"
+  - "2026-05-19 起備份保留策略為 latest-only：每個本地 restic repo、188 MOMO 檔案備份與 Google Drive/rclone 離機鏡像都只保留最新一份。"
+
+backup_domains:
+  - id: host_configs
+    owner_host: "110"
+    script: "/backup/scripts/backup-configs.sh"
+    repository: "/backup/configs"
+    schedule: "daily via /backup/scripts/backup-all.sh"
+    max_age_hours: 48
+    includes:
+      - "110/188/120/121: /etc/nginx, /etc/systemd/system, /etc/cron.d, /etc/crontab"
+      - "110/188/120/121: /etc/letsencrypt, /etc/ssh, /etc/fstab, /etc/hosts, /etc/netplan"
+      - "110: /opt/harbor, /opt/sentry, /home/wooo/monitoring, /home/wooo/scripts, /backup/scripts"
+      - "188: /opt/n8n, /opt/open-webui, /opt/litellm, /opt/signoz, /home/ollama/momo-pro, /home/ollama/bin"
+      - "120/121: /etc/rancher/k3s, K3s manifests, containerd/keepalived host config"
+      - "K8s: workloads, services, ingress, configmaps, secrets, RBAC, PV/PVC, CRDs, Velero schedules/backups"
+    restore_test: "抽樣 restic restore 到隔離目錄，確認 nginx/systemd/K8s YAML 可讀；不得直接覆蓋 production。"
+
+  - id: awoooi_databases
+    owner_host: "110"
+    scripts:
+      - "/backup/scripts/backup-awoooi.sh"
+      - "/backup/scripts/backup-awoooi-frequent.sh"
+    repository: "/backup/awoooi"
+    schedule: "daily 02:00 + high-frequency 08:00/14:00/20:00"
+    max_age_hours: 7
+    includes:
+      - "awoooi_prod"
+      - "awoooi_dev"
+      - "k3s_datastore if present"
+    restore_test: "pg_restore/psql 到隔離 DB，驗證 schema 與核心表筆數；不可覆蓋 production DB。"
+
+  - id: gitea_and_ci
+    owner_host: "110"
+    repository: "/backup/gitea"
+    schedule: "daily via backup-all"
+    max_age_hours: 48
+    includes:
+      - "Gitea DB"
+      - "Git repositories"
+      - "Gitea app.ini 與 runner registration/config evidence"
+      - "workflow definitions from repos"
+    restore_test: "抽樣 git fsck / git clone；Gitea DB dump 可讀。"
+
+  - id: harbor_registry
+    owner_host: "110"
+    repository: "/backup/harbor"
+    schedule: "daily via backup-all"
+    max_age_hours: 48
+    includes:
+      - "Harbor DB/config"
+      - "registry storage"
+      - "TLS/config state from configs backup"
+    restore_test: "抽樣 registry manifest/blobs 可讀；Harbor compose/config 可重建。"
+
+  - id: observability
+    owner_host: "110"
+    repositories:
+      - "/backup/monitoring"
+      - "/backup/signoz"
+    schedule: "daily via backup-all"
+    max_age_hours: 48
+    includes:
+      - "Prometheus TSDB"
+      - "Grafana dashboards/datasources"
+      - "Alertmanager config/state"
+      - "SignOz ClickHouse/SQLite/config"
+      - "blackbox/node-exporter textfile config"
+    restore_test: "Prometheus/Grafana/Alertmanager 設定 lint；SignOz dump 可列出表。"
+
+  - id: sentry
+    owner_host: "110"
+    coverage_status: "covered_by_backup_sentry_script"
+    script: "/backup/scripts/backup-sentry.sh"
+    repository: "/backup/sentry"
+    schedule: "daily via backup-all; config also covered by /backup/configs"
+    max_age_hours: 48
+    includes:
+      - "Sentry compose/.env/config"
+      - "Sentry Postgres logical dump"
+      - "Sentry ClickHouse volume snapshot and table inventory"
+      - "Sentry Kafka queue volume snapshot"
+      - "Sentry Redis / SeaweedFS / Taskbroker / Vroom / Symbolicator state"
+    restore_test: "先在隔離 compose stack 驗證 Postgres dump 可讀、ClickHouse volume 可掛載、web/symbolicator/snuba 可啟動。"
+
+  - id: credential_escrow
+    owner_host: "human-controlled"
+    coverage_status: "gap_p0_out_of_band_escrow_required"
+    repository: "不可放在同一個 restic repo；需放在密碼管理器或離線加密金庫"
+    schedule: "每次新增/輪替 Secret 後立即更新 escrow；每月人工抽查"
+    max_age_hours: 744
+    includes:
+      - "restic password files / repository keys / Google Drive rclone.conf / offsite provider credentials"
+      - "Cloud DNS / registrar / CDN / tunnel 管理帳號與 recovery codes"
+      - "Gitea/Harbor/Sentry/admin break-glass credentials"
+      - "Git deploy keys、runner registration tokens、K8s bootstrap/admin kubeconfig 的復原路徑"
+      - "Google Drive / OAuth / Telegram / AI provider tokens 的輪替與復原流程，不包含明文輸出"
+    restore_test: "用人工雙人覆核方式確認 key escrow 可找到、可解密、可用於列出 snapshots；不得把 Secret 值寫進 repo 或監控 label。"
+
+  - id: external_dns_and_public_routes
+    owner_host: "110"
+    coverage_status: "covered_by_public_route_evidence_backup; provider_zone_export_still_requires_credentials"
+    script: "/backup/scripts/backup-public-routes.sh"
+    repository: "/backup/public-routes"
+    schedule: "daily via backup-all; DNS/CDN provider zone export after every routing change when credentials are available"
+    max_age_hours: 168
+    includes:
+      - "wooo.work DNS answers；CDN/Cloudflare/registrar 設定匯出仍需 provider token"
+      - "public nginx route map、TLS renewal config、ACME account evidence"
+      - "blackbox public endpoint inventory 與 expected status codes"
+      - "VPN/tunnel/port-forward/HA VIP 對外路由設定"
+    restore_test: "從匯出檔重建 public route checklist，確認 awoooi/mo/registry/harbor/gitea 等 endpoint 對應正確；不得在測試中改正式 DNS。"
+
+  - id: backup_repositories_and_integrity
+    owner_host: "110/188/121/offsite"
+    coverage_status: "covered_locally_by_check_backup_integrity_script; offsite copy still depends on credentials"
+    scripts:
+      - "/backup/scripts/check-backup-integrity.sh"
+      - "/backup/scripts/configure-offsite-rclone.sh"
+      - "/backup/scripts/configure-offsite-b2.sh"
+      - "/backup/scripts/sync-offsite-backups.sh"
+      - "/backup/scripts/backup-offsite-readiness-gate.sh"
+      - "/backup/scripts/offsite-escrow-evidence-report.sh"
+      - "/backup/scripts/mark-credential-escrow-verified.sh"
+    repositories:
+      - "/backup/* restic repos"
+      - "/home/ollama/backup/110"
+      - "Google Drive/rclone/offsite remote when credentials are configured"
+    schedule: "daily freshness; daily 06:10 offsite status; daily 06:15 offsite escrow evidence report; weekly restic check; monthly sample restore drill"
+    max_age_hours: 168
+    includes:
+      - "restic snapshots metadata、repo config、locks/prune policy"
+      - "188 backup-from-110 rsync copy"
+      - "offsite copy status and retention policy"
+      - "restore drill logs with snapshot id and restored object counts"
+    restore_test: "每週 `restic check --read-data-subset=1%`；每月 `restic dump latest <sample>` 到 0700 暫存目錄驗證可讀。"
+    retention_policy: "latest-only；本地 restic repo 新 snapshot 成功後 --group-by \"\" --keep-last=1 + prune；188 MOMO 檔案備份只留最新一份；離機 Google Drive/rclone 以本地 repo 為準鏡像刪舊。"
+    offsite_sync_policy: "offsite-escrow-evidence-report.sh 先產出紅acted 證據與 NEXT_STEP；backup-offsite-readiness-gate.sh 再做 status / dry-run-small / pre-full-sync；sync-offsite-backups.sh 預設 status；dry-run 可隨時執行；Google Drive/rclone full sync 需選低峰窗口，成功後才寫 /backup/offsite/rclone-last-success，且 OFFSITE_SYNC_DELETE_OLD=1 時會刪除遠端舊檔。full sync 不得與本地備份程序重疊，且必須距離下一次備份排程至少 270 分鐘。"
+
+  - id: momo_web_and_data
+    owner_host: "188"
+    scripts:
+      - "/backup/scripts/backup-momo.sh on 110"
+      - "/home/ollama/bin/momo-pg-backup.sh on 188"
+    repositories:
+      - "/backup/momo"
+      - "/home/ollama/momo_backups"
+    schedule: "110 daily + 188 daily 02:00"
+    max_age_hours: 30
+    includes:
+      - "mo.wooo.work app DB"
+      - "momo uploads/files/config"
+      - "scheduler config and cron"
+    restore_test: "隔離 DB restore 後跑 app health check；確認 mo.wooo.work 需要的資料表與資料筆數。"
+
+  - id: ai_and_tooling
+    owner_host: "188"
+    coverage_status: "covered_by_backup_ai_artifacts_for_manifest_and_metadata; model_blobs_require_manual_classification"
+    script: "/backup/scripts/backup-ai-artifacts.sh"
+    repositories:
+      - "/backup/langfuse"
+      - "/backup/open-webui"
+      - "/backup/clawbot"
+      - "/backup/configs"
+      - "/backup/ai-artifacts"
+    schedule: "daily via backup-all"
+    max_age_hours: 48
+    includes:
+      - "Langfuse traces/evaluations"
+      - "Open-WebUI conversations/config"
+      - "LiteLLM config, model routing, provider state"
+      - "OpenClaw/ClawBot Redis or persistent state"
+      - "n8n workflows/credentials through encrypted config backup"
+      - "Ollama model manifest/tag list/Modelfile；自製或不可重新下載的 model/adapters 才備份 blobs"
+      - "KM/RAG/vector 狀態；若存在於 AWOOOI DB，隨 DB dump 還原；若是外部 vector store 必須有獨立 dump"
+    restore_test: "抽樣匯出 workflow/config；Redis dump 可讀；Langfuse/Open-WebUI DB dump 可讀；Ollama manifest tar 可列出模型 tags。"
+
+  - id: source_of_truth_and_ops_memory
+    owner_host: "110/Gitea"
+    coverage_status: "gap_p1_sanitized_operational_context"
+    repositories:
+      - "/backup/gitea"
+      - "/backup/configs"
+    schedule: "Gitea daily; configs daily; 每次事故後更新 docs/LOGBOOK.md 與 runbooks"
+    max_age_hours: 48
+    includes:
+      - "所有 Git repositories、Ansible roles/playbooks/inventory、K8s manifests、monitoring rules"
+      - "AGENTS/HARD_RULES/runbooks/LOGBOOK/ADR 等決策與啟動順序文件"
+      - "AI agent handoff summaries and operational memory exports after sanitization"
+      - "CI/CD workflow definitions、runner labels、deployment marker policy"
+    restore_test: "從 Gitea backup 抽樣 clone repo，跑 ansible/k8s/alerts YAML validation；不得備份含明文 token 的聊天或 shell transcript。"
+
+  - id: k3s_and_velero
+    owner_host: "120"
+    schedule: "Velero daily-awoooi-prod + weekly restore dry-run"
+    max_age_hours: 25
+    includes:
+      - "K8s manifests and CRDs"
+      - "Secrets/ConfigMaps/RBAC"
+      - "PVC/PV snapshots via Velero provider"
+      - "backup-restore-test CronJob and result metrics"
+    restore_test: "backup-restore-test CronJob 每週 dry-run 到 restore-test-dry namespace mapping。"
+
+  - id: offsite_and_dr
+    owner_host: "188/121"
+    schedule: "188 backup-from-110 daily 01:00; 121 DR drill monthly"
+    max_age_hours: 25
+    includes:
+      - "110 Harbor/Gitea/bitan rsync copy on 188"
+      - "DR drill evidence on 121"
+      - "Google Drive/rclone remote when credentials are configured"
+    restore_test: "121 DR drill dry-run finds latest Completed Velero backup; 188 backup-from-110 textfile fresh。"
+
+monitoring_contract:
+  textfile_metrics:
+    "110": "/home/wooo/node_exporter_textfiles/backup_health.prom"
+    "188": "/home/ollama/node_exporter_textfiles/backup_health.prom"
+    "120": "由 110 backup_health.prom 透過 120 kubectl 查詢 Velero/CronJob/Job 狀態"
+  offsite_and_escrow_metrics:
+    - "awoooi_backup_offsite_configured：只回報 Google Drive/rclone 或相容 provider 是否看起來已配置，不輸出 credential 值。"
+    - "awoooi_backup_offsite_fresh：由 /backup/offsite/*last_success 類 marker 判斷離機同步是否新鮮。"
+    - "awoooi_backup_offsite_partial_fresh：由小範圍 partial sync marker 判斷 Google Drive/rclone 寫入路徑是否已被證明。"
+    - "awoooi_backup_credential_escrow_fresh：由 /backup/escrow-evidence/*.last_verified 類 marker 判斷人工金庫覆核是否在 31 天內完成。"
+    - "awoooi_backup_dr_next_step_info：用 next_step label 告訴 AI 巡檢與 operator 下一個安全人工作業，不包含 secret。"
+    - "awoooi_backup_dr_credential_escrow_missing_count：金庫覆核尚缺的項目數。"
+    - "awoooi_backup_cron_active_duplicate_count：110 active crontab 中 exact duplicate entry 的數量。"
+    - "awoooi_backup_cron_singular_entry_ok：offsite/status/verifier/exporter 等單一入口排程是否剛好只有一條 active cron。"
+    - "awoooi_backup_config_capture_ok：最新 configs snapshot 是否實際捕捉 110/120/121/188 host config 與 K8s workloads/secrets，不輸出 secret。"
+    - "awoooi_backup_config_capture_critical_failed_count：最新設定檔備份缺少的 critical capture target 數量。"
+  prometheus_alerts:
+    - BackupHealthMonitorMissing110
+    - BackupHealthMonitorMissing188
+    - BackupHealthMonitorStale
+    - BackupExpectedJobMissing
+    - BackupScheduleDuplicateActiveEntries
+    - BackupScheduleSingletonMismatch
+    - BackupScriptMissing
+    - BackupJobStale
+    - BackupAggregateRunFailed
+    - BackupConfigCapturePartial
+    - BackupConfigCaptureStatusStale
+    - BackupIntegrityCheckMissingOrFailed
+    - BackupRestoreDrillMissingOrFailed
+    - BackupRestoreTestMissing
+    - BackupRestoreTestCronMissing
+    - BackupRestoreTestFailed
+    - BackupRestoreTestStale
+    - BackupOffsiteCopyNotConfigured
+    - BackupOffsiteCopyStale
+    - BackupCredentialEscrowEvidenceMissing
+    - BackupRetentionPolicyNotLatestOnly
+    - BackupSnapshotRetentionExceeded
+    - BackupOffsiteFullVerifyFailed
+    - BackupOffsiteRemoteSnapshotRetentionExceeded
+  live_visibility_checks:
+    - "如果 awoooi_backup_offsite_configured{host=\"110\"} 為 0，Prometheus 必須有 BackupOffsiteCopyNotConfigured firing，Alertmanager 必須有 active alert。"
+    - "如果 offsite provider 已配置、full marker 尚未 fresh，且 full sync enable marker 缺失或已超過 30 小時，Prometheus 與 Alertmanager 必須看得到 BackupOffsiteCopyStale。"
+    - "如果 awoooi_backup_credential_escrow_fresh{host=\"110\"} == 0，Prometheus 與 Alertmanager 必須依 item 看得到 BackupCredentialEscrowEvidenceMissing。"
+    - "如果 awoooi_backup_retention_latest_only{host=\"110\"} 或 awoooi_backup_retention_offsite_delete_old_enabled{host=\"110\",provider=\"rclone\"} 缺失/不為 1，Prometheus 與 Alertmanager 必須看得到 BackupRetentionPolicyNotLatestOnly。"
+    - "如果任一 awoooi_backup_job_snapshot_count{host=\"110\",type=\"restic\"} > 1，Prometheus 與 Alertmanager 必須看得到 BackupSnapshotRetentionExceeded。"
+    - "如果 full offsite marker fresh 但 awoooi_backup_offsite_remote_verify_ok{host=\"110\",provider=\"rclone\"} 不為 1 或缺失，Prometheus 必須看得到 BackupOffsiteFullVerifyFailed。"
+    - "如果 full offsite marker fresh 且任一 awoooi_backup_offsite_remote_snapshot_count{host=\"110\",provider=\"rclone\"} > 1，Prometheus 必須看得到 BackupOffsiteRemoteSnapshotRetentionExceeded。"
+    - "如果 awoooi_backup_cron_active_duplicate_count{host=\"110\"} > 0，Prometheus 與 Alertmanager 必須看得到 BackupScheduleDuplicateActiveEntries。"
+    - "如果任一 awoooi_backup_cron_singular_entry_ok{host=\"110\"} == 0，Prometheus 與 Alertmanager 必須看得到 BackupScheduleSingletonMismatch。"
+    - "如果任一 awoooi_backup_config_capture_ok{host=\"110\",critical=\"true\"} == 0，Prometheus 與 Alertmanager 必須看得到 BackupConfigCapturePartial，且 target label 必須指出缺哪個設定來源。"
+    - "如果 awoooi_backup_config_capture_status_timestamp 缺失或超過 48 小時，Prometheus 與 Alertmanager 必須看得到 BackupConfigCaptureStatusStale。"
+    - "live visibility check 只讀 Prometheus / Alertmanager API，不送測試告警、不改 silence、不改 route、不觸發修復。"
+  prometheus_recording_rules:
+    - awoooi_recovery_core_ready
+    - awoooi_recovery_dr_offsite_ready
+
+release_gate:
+  cold_start_script: "bash scripts/reboot-recovery/full-stack-cold-start-check.sh --monitor-read-only --no-color"
+  p3_script: "bash scripts/reboot-recovery/p3-controlled-release-gate.sh"
+  recovery_core_scorecard: "bash scripts/reboot-recovery/full-stack-recovery-scorecard.sh --require-core"
+  dr_offsite_operator_checklist: "bash scripts/reboot-recovery/dr-offsite-operator-checklist.sh"
+  dr_offsite_scorecard: "bash scripts/reboot-recovery/full-stack-recovery-scorecard.sh --require-dr"
+  dr_offsite_final_gate: "bash scripts/reboot-recovery/dr-offsite-operator-checklist.sh --require-dr"
+  dr_offsite_post_marker_wait: "bash scripts/reboot-recovery/wait-dr-offsite-ready.sh --timeout-seconds 900 --interval-seconds 30 --no-color"
+  required_green:
+    - "backup_health.prom fresh on 110/188"
+    - "awoooi_backup_job_fresh == 1 for every expected job"
+    - "Velero latest Completed backup < 25h"
+    - "backup-restore-test CronJob present and lastSuccessfulTime not stale"
+    - "weekly restic check successful"
+    - "monthly sample restore drill successful"
+  warning_until_human_escrow_ready:
+    - "offsite provider configured and latest offsite copy marker fresh"
+    - "credential escrow marker files refreshed after human verification; marker files must contain only timestamp/evidence id, never secret values"
+  strict_dr_exit_conditions:
+    - "Google Drive/rclone provider configured on 110 host-local rclone.conf; /backup/scripts/offsite.env keeps only non-secret remote/path with mode 0600"
+    - "credential escrow markers fresh for restic_repository_password, offsite_provider_credentials, break_glass_admin_credentials, dns_registrar_recovery, oauth_ai_provider_recovery"
+    - "full offsite marker /backup/offsite/rclone-last-success fresh after full 13 repo sync"
+    - "full-stack-recovery-scorecard.sh --require-dr exits 0"
+    - "recovery-scorecard-contract-check.py --expect-dr-ready exits 0 against 110 Prometheus"
+    - "dr-offsite-operator-checklist.sh --require-dr exits 0 after scorecard, Prometheus recording rule, and backup alert visibility contract agree"
+    - "wait-dr-offsite-ready.sh exits 0 after post-marker textfile, Prometheus, Alertmanager, and final checklist convergence"
--- a/ops/reboot-recovery/full-stack-cold-start-baseline.yml
+++ b/ops/reboot-recovery/full-stack-cold-start-baseline.yml
@@ -1,337 +1,204 @@
-# AWOOOI full-stack cold-start dependency baseline.
-# This is the machine-readable companion to docs/runbooks/FULL-STACK-COLD-START-SOP.md.
-#
-# Intent:
-# - document the reboot startup order and service dependency graph
-# - define release gates for operators and AI automation
-# - keep stateful services out of generic auto-restart loops
-
-version: "2026-05-06"
-incident_reference: "2026-05-05 full-stack reboot recovery"
+version: 2026-05-06.v1
 scope:
-  managed_hosts:
-    "110":
-      address: "192.168.0.110"
-      ssh_user: "wooo"
-      roles:
-        - registry
-        - git
-        - observability
-        - sentry
-        - runners
-    "120":
-      address: "192.168.0.120"
-      ssh_user: "wooo"
-      roles:
-        - k3s_server
-        - keepalived_vip
-        - awoooi_nodeport
-    "121":
-      address: "192.168.0.121"
-      ssh_user: "wooo"
-      roles:
-        - k3s_node
-        - keepalived_peer
-        - dr_drill
-    "188":
-      address: "192.168.0.188"
-      ssh_user: "ollama"
-      roles:
-        - postgres_datastore
-        - redis
-        - momo
-        - signoz
-        - ai_proxy
-  intentionally_skipped:
-    "112":
-      role: "kali"
-      reason: "scanner host is not required for production cold-start release"
+  included_hosts:
+    "110": "DevOps, registry, observability, Sentry, runners"
+    "120": "K3s control plane and VIP"
+    "121": "K3s peer node and DR drill cron"
+    "188": "Data, AI, web, momo, SignOz, public nginx gateway"
+  excluded_hosts:
+    "112": "Kali security host; recorded but not part of cold-start release gate"

-global_policy:
-  startup_rule: "Recover the dependency chain before releasing high-load work."
-  runner_cd_rule: "Release runners and CD only after data, registry, K3s, workload, routes, schedules, and alert E2E gates are green."
-  ai_auto_repair_rule: "Observe-only until all green gates pass and host load stays below baseline."
-  destructive_state_rule: "No DROP, data directory deletion, volume recreation, pg_resetwal, fsck, or backup restore without explicit human approval."
-  no_generic_restart_rule: "Never run generic docker restart against all containers during cold start."
+principles:
+  - recover_dependency_chain_before_workloads
+  - keep_ai_auto_repair_observe_only_until_green
+  - never_generic_restart_stateful_services
+  - preserve_corrupt_parts_in_quarantine_not_delete
+  - release_runners_and_crawlers_last

 phases:
-  - id: "P0-NETWORK"
+  - id: P0-NETWORK
    order: 0
-    start_after: []
-    owns:
-      - "LAN reachability"
-      - "SSH reachability"
-      - "ARP evidence"
    gates:
-      - "ping 192.168.0.110/120/121/188 succeeds"
-      - "TCP 22 open on 192.168.0.110/120/121/188"
-      - "reboot evidence captured before repair"
-    blocks:
-      - "all other phases"
+      - ping_110_120_121_188
+      - ssh_port_110_120_121_188
+      - arp_evidence_or_monitor_mode_fallback

-  - id: "P0-188-DATA"
-    order: 1
-    start_after:
-      - "P0-NETWORK"
-    host: "188"
-    service_order:
-      - "containerd"
-      - "docker"
-      - "postgresql@14-main"
-      - "k3s_datastore.kine maintenance"
-      - "redis-server"
-      - "ollama or current AI proxy dependencies"
-      - "nginx"
-      - "Docker networks"
-      - "MinIO / OpenClaw / SignOz"
-      - "momo / litellm / batch services"
+  - id: P0-188-DATA
+    order: 10
+    required_before:
+      - P1-K3S
+      - P2-WORKLOAD-ALERTCHAIN
    gates:
-      - "PostgreSQL port 5432 open"
-      - "pg_isready reports accepting connections"
-      - "Redis replies PONG"
-      - "momo health endpoint returns 200"
-      - "SignOz HTTP route is reachable"
-    blocks:
-      - "120/121 K3s"
-      - "AWOOOI API database access"
-      - "Alertmanager webhook"
-      - "momo public site"
+      - containerd_docker_postgresql_redis_ollama_nginx_active
+      - postgresql_5432_accepting_connections
+      - redis_pong
+      - momo_db_not_restarting
+      - signoz_http_reachable
+      - momo_health_200

-  - id: "P0-110-REGISTRY-OBSERVABILITY"
-    order: 2
-    start_after:
-      - "P0-NETWORK"
-      - "P0-188-DATA"
-    host: "110"
-    service_order:
-      - "docker"
-      - "orphan Exited(128/137) cleanup if needed"
-      - "Harbor log"
-      - "Harbor registry stack"
-      - "Gitea"
-      - "Prometheus / Alertmanager / Grafana / exporters"
-      - "Langfuse"
-      - "SignOz or local observability companions"
-      - "Sentry DB layer"
-      - "Sentry web / worker / consumer layer"
-      - "Gitea host runner and actions runners"
+  - id: P0-110-REGISTRY-OBSERVABILITY
+    order: 20
+    required_before:
+      - P1-K3S
+      - P3-RUNNER-CD
    gates:
-      - "Harbor /v2/ returns 200 or 401"
-      - "Gitea returns 200 or 302"
-      - "Prometheus /-/ready returns 200"
-      - "Alertmanager /-/healthy returns 200"
-      - "Sentry HTTP returns 200, 302, or 400"
-      - "runner CPUQuota=200%, MemoryMax=2G, WatchdogUSec=0"
-    blocks:
-      - "K3s image pulls"
-      - "runtime CD"
-      - "alert rules deploy"
-      - "code-review runners"
+      - docker_active
+      - harbor_v2_200_or_401
+      - gitea_200_or_302
+      - prometheus_ready
+      - alertmanager_healthy
+      - sentry_http_reachable
+      - docker_containers_all_up
+      - runner_watchdog_disabled
+      - sentry_clickhouse_not_restarting
+      - cadvisor_image_v0_47_0
+      - cadvisor_cpu_cap_0_3

-  - id: "P1-K3S"
-    order: 3
-    start_after:
-      - "P0-188-DATA"
-      - "P0-110-REGISTRY-OBSERVABILITY"
-    hosts:
-      - "120"
-      - "121"
-    service_order:
-      - "120 k3s.service"
-      - "121 k3s-agent.service or live role"
-      - "CNI / kube-proxy"
-      - "nodes Ready"
-      - "core pods"
-      - "awoooi-prod pods"
-      - "keepalived VIP 192.168.0.125"
-      - "NodePorts 32334 and 32335"
+  - id: P1-K3S
+    order: 30
    gates:
-      - "120 can reach 188:5432"
-      - "K3s nodes show Ready"
-      - "VIP 192.168.0.125 is present"
-      - "awoooi-prod pods are Running or Completed"
-    blocks:
-      - "AWOOOI workload health"
-      - "public AWOOOI route"
-      - "Alertmanager webhook"
+      - 120_can_reach_188_postgres
+      - mon_and_mon1_ready
+      - no_non_running_non_succeeded_pods
+      - awoooi_dev_api_nodeport_200
+      - vip_192_168_0_125_present

-  - id: "P2-WORKLOAD-ALERTCHAIN"
-    order: 4
-    start_after:
-      - "P1-K3S"
-    owners:
-      - "AWOOOI API"
-      - "AWOOOI Web"
-      - "Alertmanager webhook"
-      - "Telegram delivery"
+  - id: P2-WORKLOAD-ALERTCHAIN
+    order: 40
    gates:
-      - "http://192.168.0.125:32334/api/v1/health returns 2xx/3xx"
-      - "http://192.168.0.125:32335/ returns 2xx/3xx"
-      - "Alertmanager webhook POST returns 2xx"
-      - "K8s Telegram secrets are present and non-placeholder"
-    blocks:
-      - "AI auto-remediation"
-      - "full alert confidence"
+      - awoooi_api_vip_health_2xx_or_3xx
+      - awoooi_web_vip_2xx_or_3xx
+      - alertmanager_webhook_e2e_2xx_when_release_gate

-  - id: "P2-PUBLIC-ROUTES"
-    order: 5
-    start_after:
-      - "P2-WORKLOAD-ALERTCHAIN"
+  - id: P2-PUBLIC-ROUTES
+    order: 50
+    public_https_routes:
+      - https://awoooi.wooo.work/api/v1/health
+      - https://awoooi.wooo.work/
+      - https://mo.wooo.work/
+      - https://mo.wooo.work/health
+      - https://gitea.wooo.work/
+      - https://harbor.wooo.work/
+      - https://registry.wooo.work/
+      - https://sentry.wooo.work/
+      - https://signoz.wooo.work/
+      - https://stock.wooo.work/
+      - https://langfuse.wooo.work/
+      - https://bitan.wooo.work/
+      - https://aiops.wooo.work/
+
+  - id: P2-SCHEDULES
+    order: 60
    gates:
-      - "https://awoooi.wooo.work/api/v1/health returns 2xx/3xx"
-      - "https://awoooi.wooo.work/ returns 2xx/3xx"
-      - "https://mo.wooo.work/ returns 2xx/3xx"
-      - "https://mo.wooo.work/health returns 2xx/3xx"
-    blocks:
-      - "external release complete"
+      - cron_active_188_110_120_121
+      - docker_restart_textfile_fresh_188
+      - docker_stats_textfile_fresh_188_110
+      - systemd_units_textfile_fresh_110
+      - backup_health_textfile_fresh_188_110
+      - backup_from_110_success_under_25h
+      - expected_backup_jobs_fresh_188_110
+      - host_service_config_backup_success_under_48h
+      - sentry_dedicated_backup_success_under_48h
+      - backup_integrity_check_success_under_8d
+      - backup_restore_drill_success_under_31d
+      - velero_schedule_present_and_latest_completed_under_25h
+      - velero_restore_test_cron_present
+      - momo_scheduler_registered_jobs
+      - k8s_cronjobs_unsuspended
+      - k8s_failed_jobs_zero
+      - dr_drill_cron_present_121

-  - id: "P2-SCHEDULES"
-    order: 6
-    start_after:
-      - "P2-PUBLIC-ROUTES"
-    gates:
-      - "110/120/121/188 cron services active"
-      - "188 backup-from-110 success age below 25h"
-      - "188 docker restart/stats textfiles fresh"
-      - "188 momo-scheduler container healthy and registration evidence present within 6h"
-      - "110 docker/systemd textfiles fresh"
-      - "120 awoooi-prod CronJobs present and unsuspended"
-      - "120 awoooi-prod has no failed Jobs"
-      - "121 DR drill cron present"
-    blocks:
-      - "done criteria"
-      - "AI auto-remediation release"
+  - id: P3-HIGH-LOAD-WORK
+    order: 70
+    release_after:
+      - P0-NETWORK
+      - P0-188-DATA
+      - P0-110-REGISTRY-OBSERVABILITY
+      - P1-K3S
+      - P2-WORKLOAD-ALERTCHAIN
+      - P2-PUBLIC-ROUTES
+      - P2-SCHEDULES
+    release_conditions:
+      - host_load_per_core_below_1_0_for_15m
+      - no_restart_storm
+      - clickhouse_merge_or_kafka_lag_not_increasing_two_checks
+    examples:
+      - sentry_snuba_consumers
+      - momo_scheduler_chrome_crawlers
+      - gitea_actions_jobs

-  - id: "P3-HIGH-LOAD-RELEASE"
-    order: 7
-    start_after:
-      - "P2-SCHEDULES"
-    release_last:
-      - "momo-scheduler / Chrome crawlers"
-      - "Sentry Snuba consumers"
-      - "SignOz ClickHouse merge-heavy work"
-      - "Gitea actions runners"
-      - "runtime CD jobs"
-    gates:
-      - "all prior gates green"
-      - "host load per CPU below 1.0 for 15 minutes before releasing batch/runner work"
-      - "ClickHouse/Kafka/Snuba backlog decreasing for two consecutive checks if backlog exists"
+  - id: P3-RUNNER-CD
+    order: 80
+    release_conditions:
+      - all_previous_gates_green
+      - runner_cpuquota_200_percent
+      - runner_memorymax_2g
+      - watchdogusec_0
+      - active_awoooi_cd_or_gitea_actions_task_containers_cpu_capped_during_cold_start

-baselines:
-  endpoints:
-    awoooi_vip_api_health: "http://192.168.0.125:32334/api/v1/health"
-    awoooi_vip_web: "http://192.168.0.125:32335/"
-    awoooi_public_api_health: "https://awoooi.wooo.work/api/v1/health"
-    awoooi_public_web: "https://awoooi.wooo.work/"
-    momo_public_web: "https://mo.wooo.work/"
-    momo_public_health: "https://mo.wooo.work/health"
-    harbor_registry: "http://127.0.0.1:5000/v2/"
-    gitea: "http://127.0.0.1:3001/"
-    prometheus_ready: "http://127.0.0.1:9090/-/ready"
-    alertmanager_healthy: "http://127.0.0.1:9093/-/healthy"
-    sentry: "http://127.0.0.1:9000/"
-  expected_codes:
-    harbor_registry:
-      - 200
-      - 401
-    gitea:
-      - 200
-      - 302
-    prometheus_ready:
-      - 200
-    alertmanager_healthy:
-      - 200
-    sentry:
-      - 200
-      - 302
-      - 400
-    workload_and_public:
-      - "2xx"
-      - "3xx"
-  runner_guardrails:
-    CPUQuotaPerSecUSec: "2s"
-    MemoryMax: "2147483648"
-    WatchdogUSec: "0"
-  freshness_seconds:
-    docker_textfiles: 300
-    systemd_textfiles: 300
-    backup_success: 90000
+automation_policy:
+  before_green:
+    ai_auto_repair: observe_only
+    alertmanager_smoke_test: manual_or_release_gate_only
+    stateful_service_actions: human_approval_required
+    generic_restart: forbidden
+  after_green:
+    ai_auto_repair: limited_execution_for_stateless_exporters_only
+    stateful_service_actions: human_in_the_loop
+    runner_cd: controlled_release

-stateful_services:
-  hard_block_auto_repair:
-    - "188 PostgreSQL data directory"
-    - "188 k3s_datastore"
-    - "188 momo database"
-    - "110 Harbor DB"
-    - "110 Sentry DB"
-    - "Sentry ClickHouse data"
-    - "SignOz ClickHouse data"
-    - "Kafka topic/log directories"
-  human_in_loop_required:
-    - "pg_resetwal"
-    - "ClickHouse clean-clone recovery"
-    - "Kafka checkpoint file quarantine"
-    - "backup restore"
-    - "filesystem repair"
+resource_guardrails:
+  "110":
+    cadvisor:
+      image: gcr.io/cadvisor/cadvisor:v0.47.0
+      cpus: 0.3
+      mem_limit: 512m
+    sentry_snuba_cold_start_consumers:
+      cpus: 0.5
+      persist_in: /opt/sentry/docker-compose.override.yml
+    sentry_self_hosted_memory_limits:
+      taskscheduler_mem_limit: 1g
+      relay_mem_limit: 2g
+      persist_in: /opt/sentry/docker-compose.override.yml
+      note: "taskscheduler/relay 不得回退到 512m/1g 造成長期 >85% memory-limit pressure；110 主機仍以 ClickHouse/Kafka/Snuba CPU caps 防止冷啟動過載。"
+    actions_runner_systemd:
+      cpu_quota: 200%
+      memory_max: 2G
+      watchdog: disabled
+  "188":
+    ollama_systemd:
+      cpu_quota: 300%
+      memory_high: 20G
+      memory_max: 24G
+      max_loaded_models: 1
+      num_parallel: 1
+      note: "188 本機 Ollama 是 cold-start 依賴與 Open-WebUI local endpoint；不得維持 disabled/inactive，也不得保留 700%/45G 無節制 guardrail。"
+    litellm:
+      cpus: 1.0
+      memory: 1G
+      mode: stateless
+    momo_scheduler:
+      cpus: 2.0
+      memory: 2G
+    signoz_clickhouse:
+      memory: 24G
+      note: do_not_lower_during_merge_backlog

-ai_automation_gate:
-  observe_only_until:
-    - "P0-NETWORK green"
-    - "P0-188-DATA green"
-    - "P0-110-REGISTRY-OBSERVABILITY green"
-    - "P1-K3S green"
-    - "P2-WORKLOAD-ALERTCHAIN green"
-    - "P2-PUBLIC-ROUTES green"
-    - "P2-SCHEDULES green"
-    - "no active restart storm"
-    - "host load per CPU below 1.0 for 15 minutes"
-  allowed_before_green:
-    - "diagnose"
-    - "collect evidence"
-    - "notify"
-  blocked_before_green:
-    - "stateful restart"
-    - "destructive repair"
-    - "runner/CD release"
-    - "generic container restart"
-
-persistent_monitoring:
-  host: "110"
-  install_command: "bash scripts/reboot-recovery/install-cold-start-monitor-110.sh"
-  schedule: "*/10 * * * *"
-  mode: "read_only"
-  send_alert_test: false
-  scripts:
-    check: "/home/wooo/scripts/full-stack-cold-start-check.sh"
-    exporter: "/home/wooo/scripts/cold-start-textfile-exporter.sh"
-  outputs:
-    textfile: "/home/wooo/node_exporter_textfiles/cold_start_recovery.prom"
-    last_log: "/home/wooo/reboot-recovery/cold-start-last.log"
-  metrics:
-    - "awoooi_cold_start_monitor_up"
-    - "awoooi_cold_start_pass_gates"
-    - "awoooi_cold_start_warn_gates"
-    - "awoooi_cold_start_blocked_gates"
-    - "awoooi_cold_start_last_run_timestamp"
-    - "awoooi_cold_start_last_green_timestamp"
-    - "awoooi_cold_start_last_result"
-  prometheus_alerts:
-    - "ColdStartMonitorMissing"
-    - "ColdStartMonitorStale"
-    - "ColdStartRecoveryBlocked"
-    - "ColdStartRecoveryDegraded"
-    - "ColdStartLastGreenTooOld"
-  ai_contract:
-    monitor_missing: "diagnose cron/textfile path only"
-    stale: "collect cron log and last check log"
-    degraded: "collect evidence, do not release high-load work"
-    blocked: "follow first BLOCKED gate in phase order"
-    forbidden: "generic restart, stateful restart, destructive repair"
-
-final_confirmation:
-  command: "bash scripts/reboot-recovery/full-stack-cold-start-check.sh --watch --interval 60 --max-attempts 30 --send-alert-test"
-  green_result:
-    PASS: "greater than 0"
-    WARN: 0
-    BLOCKED: 0
-    summary: "Result: GREEN"
+authoritative_checks:
+  read_only_monitor:
+    command: bash scripts/reboot-recovery/full-stack-cold-start-check.sh --monitor-read-only --no-color
+    expected_for_cron: PASS>0 WARN=0 BLOCKED=0
+  release_gate:
+    command: SSH_BATCH_MODE=yes bash scripts/reboot-recovery/full-stack-cold-start-check.sh --send-alert-test
+    expected: PASS=64 WARN=0 BLOCKED=0
+  textfile_metric:
+    path: /home/wooo/node_exporter_textfiles/cold_start_recovery.prom
+    green_metric: awoooi_cold_start_last_result{host="110",scope="110_120_121_188",result="green"} 1
+  backup_baseline:
+    path: ops/reboot-recovery/full-stack-backup-baseline.yml
+    required_metrics:
+      - awoooi_backup_health_monitor_up
+      - awoooi_backup_job_fresh
+      - awoooi_backup_integrity_fresh
+      - awoooi_velero_restore_test_cron_present
+      - awoooi_velero_restore_test_last_success_fresh
--- a/scripts/ops/backup-alert-label-contract-check.py
+++ b/scripts/ops/backup-alert-label-contract-check.py
@@ -0,0 +1,260 @@
+#!/usr/bin/env python3
+"""
+Validate the backup alert label contract.
+
+Node exporter textfile metrics use labels such as job="backup_all" locally, but
+Prometheus rewrites that metric label to exported_job because the scrape target
+already has job="node-exporter-110". Backup alerts must therefore use
+$labels.exported_job in user-facing text and exported_job="..." in expressions.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+import urllib.parse
+import urllib.request
+from pathlib import Path
+from typing import Any
+
+import yaml
+
+
+DEFAULT_RULES = Path("ops/monitoring/alerts-unified.yml")
+DEFAULT_BASELINE = Path("ops/reboot-recovery/full-stack-backup-baseline.yml")
+
+
+class ContractError(RuntimeError):
+    pass
+
+
+def _load_alerts(path: Path) -> dict[str, dict[str, Any]]:
+    data = yaml.safe_load(path.read_text(encoding="utf-8")) or {}
+    alerts: dict[str, dict[str, Any]] = {}
+    for group in data.get("groups") or []:
+        for rule in group.get("rules") or []:
+            name = rule.get("alert")
+            if name:
+                alerts[name] = rule
+    return alerts
+
+
+def _annotation_text(rule: dict[str, Any]) -> str:
+    annotations = rule.get("annotations") or {}
+    return "\n".join(str(value) for value in annotations.values())
+
+
+def _require_alert(alerts: dict[str, dict[str, Any]], name: str) -> dict[str, Any]:
+    if name not in alerts:
+        raise ContractError(f"missing alert: {name}")
+    return alerts[name]
+
+
+def _require_contains(value: str, expected: str, label: str) -> None:
+    if expected not in value:
+        raise ContractError(f"{label} must contain {expected!r}")
+
+
+def _require_not_contains(value: str, forbidden: str, label: str) -> None:
+    if forbidden in value:
+        raise ContractError(f"{label} must not contain {forbidden!r}")
+
+
+def _expected_backup_alerts(path: Path) -> list[str]:
+    data = yaml.safe_load(path.read_text(encoding="utf-8")) or {}
+    alerts = data.get("monitoring_contract", {}).get("prometheus_alerts") or []
+    if not alerts:
+        raise ContractError(f"missing monitoring_contract.prometheus_alerts in {path}")
+    return [str(alert) for alert in alerts]
+
+
+def static_check(path: Path, baseline_path: Path) -> list[str]:
+    alerts = _load_alerts(path)
+    lines: list[str] = []
+
+    missing = sorted(set(_expected_backup_alerts(baseline_path)) - set(alerts))
+    if missing:
+        raise ContractError(f"alerts-unified.yml missing baseline backup alerts: {missing}")
+    lines.append("OK alerts-unified.yml contains every baseline backup alert")
+
+    rule = _require_alert(alerts, "BackupExpectedJobMissing")
+    _require_contains(str(rule.get("expr", "")), "awoooi_backup_job_configured", "BackupExpectedJobMissing expr")
+    text = _annotation_text(rule)
+    _require_contains(text, "$labels.exported_job", "BackupExpectedJobMissing annotations")
+    _require_not_contains(text, "$labels.job", "BackupExpectedJobMissing annotations")
+    lines.append("OK BackupExpectedJobMissing uses exported_job label")
+
+    rule = _require_alert(alerts, "BackupJobStale")
+    _require_contains(str(rule.get("expr", "")), "awoooi_backup_job_fresh", "BackupJobStale expr")
+    text = _annotation_text(rule)
+    _require_contains(text, "$labels.exported_job", "BackupJobStale annotations")
+    _require_not_contains(text, "$labels.job", "BackupJobStale annotations")
+    for required_label in ["$labels.max_age_hours", "$labels.source", "$labels.target"]:
+        _require_contains(text, required_label, "BackupJobStale annotations")
+    lines.append("OK BackupJobStale uses exported_job/source/target labels")
+
+    rule = _require_alert(alerts, "BackupAggregateRunFailed")
+    _require_contains(
+        str(rule.get("expr", "")),
+        'awoooi_backup_last_run_failed_count{host="110",exported_job="backup_all"}',
+        "BackupAggregateRunFailed expr",
+    )
+    lines.append("OK BackupAggregateRunFailed filters exported_job=backup_all")
+
+    rule = _require_alert(alerts, "BackupConfigCapturePartial")
+    _require_contains(str(rule.get("expr", "")), "awoooi_backup_config_capture_ok", "BackupConfigCapturePartial expr")
+    text = _annotation_text(rule)
+    for required_label in ["$labels.target", "$labels.source"]:
+        _require_contains(text, required_label, "BackupConfigCapturePartial annotations")
+    lines.append("OK BackupConfigCapturePartial uses target/source labels")
+
+    rule = _require_alert(alerts, "BackupConfigCaptureStatusStale")
+    _require_contains(
+        str(rule.get("expr", "")),
+        "awoooi_backup_config_capture_status_timestamp",
+        "BackupConfigCaptureStatusStale expr",
+    )
+    lines.append("OK BackupConfigCaptureStatusStale checks config capture status timestamp")
+
+    rule = _require_alert(alerts, "BackupScriptMissing")
+    _require_contains(_annotation_text(rule), "$labels.script", "BackupScriptMissing annotations")
+    lines.append("OK BackupScriptMissing uses script label")
+
+    rule = _require_alert(alerts, "BackupCredentialEscrowEvidenceMissing")
+    _require_contains(_annotation_text(rule), "$labels.item", "BackupCredentialEscrowEvidenceMissing annotations")
+    lines.append("OK BackupCredentialEscrowEvidenceMissing uses item label")
+
+    return lines
+
+
+def _prom_query(base_url: str, expr: str) -> list[dict[str, Any]]:
+    query = urllib.parse.urlencode({"query": expr})
+    url = f"{base_url.rstrip('/')}/api/v1/query?{query}"
+    with urllib.request.urlopen(url, timeout=8) as response:
+        payload = json.loads(response.read().decode("utf-8"))
+    if payload.get("status") != "success":
+        raise ContractError(f"Prometheus query failed for {expr}: {payload}")
+    return payload.get("data", {}).get("result") or []
+
+
+def _prom_rules(base_url: str) -> list[dict[str, Any]]:
+    url = f"{base_url.rstrip('/')}/api/v1/rules"
+    with urllib.request.urlopen(url, timeout=8) as response:
+        payload = json.loads(response.read().decode("utf-8"))
+    if payload.get("status") != "success":
+        raise ContractError(f"Prometheus rules query failed: {payload}")
+    rules: list[dict[str, Any]] = []
+    for group in payload.get("data", {}).get("groups") or []:
+        for rule in group.get("rules") or []:
+            name = rule.get("name") or rule.get("alert")
+            if not name:
+                continue
+            rules.append(
+                {
+                    "name": str(name),
+                    "health": str(rule.get("health", "")),
+                    "state": str(rule.get("state", "")),
+                    "group": str(group.get("name", "")),
+                }
+            )
+    return rules
+
+
+def _require_live_label(base_url: str, expr: str, labels: set[str]) -> str:
+    rows = _prom_query(base_url, expr)
+    if not rows:
+        raise ContractError(f"Prometheus query returned no series: {expr}")
+    metric = rows[0].get("metric") or {}
+    missing = sorted(label for label in labels if label not in metric)
+    if missing:
+        raise ContractError(f"{expr} missing labels {missing}; labels={sorted(metric)}")
+    return f"OK live {expr} exposes labels {','.join(sorted(labels))}"
+
+
+def _require_live_rules(base_url: str, expected_alerts: list[str]) -> list[str]:
+    rules = _prom_rules(base_url)
+    by_name = {rule["name"]: rule for rule in rules}
+    missing = sorted(set(expected_alerts) - set(by_name))
+    if missing:
+        raise ContractError(f"Prometheus missing loaded backup alert rules: {missing}")
+
+    unhealthy = [
+        f"{rule['name']} health={rule['health']} group={rule['group']}"
+        for rule in by_name.values()
+        if rule["name"] in expected_alerts and rule["health"] not in {"", "ok"}
+    ]
+    if unhealthy:
+        raise ContractError(f"Prometheus backup alert rule health is not ok: {unhealthy}")
+
+    state_counts: dict[str, int] = {}
+    for name in expected_alerts:
+        state = by_name[name]["state"] or "unknown"
+        state_counts[state] = state_counts.get(state, 0) + 1
+    state_summary = ",".join(f"{key}={state_counts[key]}" for key in sorted(state_counts))
+    return [
+        f"OK live Prometheus loaded {len(expected_alerts)} baseline backup alert rules",
+        f"OK live Prometheus backup alert rule states {state_summary}",
+    ]
+
+
+def live_check(base_url: str, baseline_path: Path) -> list[str]:
+    lines = [
+        _require_live_label(
+            base_url,
+            'awoooi_backup_job_configured{host="110"}',
+            {"exported_job", "host", "job"},
+        ),
+        _require_live_label(
+            base_url,
+            'awoooi_backup_job_fresh{host="110"}',
+            {"exported_job", "host", "job", "source", "target", "max_age_hours"},
+        ),
+        _require_live_label(
+            base_url,
+            'awoooi_backup_last_run_failed_count{host="110"}',
+            {"exported_job", "host", "job"},
+        ),
+        _require_live_label(
+            base_url,
+            'awoooi_backup_dr_next_step_info{host="110"}',
+            {"host", "next_step"},
+        ),
+        _require_live_label(
+            base_url,
+            'awoooi_backup_offsite_partial_fresh{host="110",provider="rclone"}',
+            {"host", "provider", "scope", "max_age_hours"},
+        ),
+        _require_live_label(
+            base_url,
+            'awoooi_backup_config_capture_ok{host="110"}',
+            {"host", "target", "source", "critical"},
+        ),
+    ]
+    lines.extend(_require_live_rules(base_url, _expected_backup_alerts(baseline_path)))
+    return lines
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--rules", type=Path, default=DEFAULT_RULES)
+    parser.add_argument("--baseline", type=Path, default=DEFAULT_BASELINE)
+    parser.add_argument("--prometheus-url", default="")
+    args = parser.parse_args()
+
+    try:
+        for line in static_check(args.rules, args.baseline):
+            print(line)
+        if args.prometheus_url:
+            for line in live_check(args.prometheus_url, args.baseline):
+                print(line)
+    except (ContractError, OSError, yaml.YAMLError, json.JSONDecodeError) as exc:
+        print(f"BACKUP_ALERT_LABEL_CONTRACT_FAILED {exc}", file=sys.stderr)
+        return 1
+
+    print("BACKUP_ALERT_LABEL_CONTRACT_OK")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/scripts/ops/backup-alert-live-visibility-check.py
+++ b/scripts/ops/backup-alert-live-visibility-check.py
@@ -0,0 +1,242 @@
+#!/usr/bin/env python3
+"""Verify live visibility for backup gap alerts.
+
+This read-only check closes the gap between "metrics exist" and "alerts are
+actually visible". If the offsite or credential-escrow gap metrics are present,
+the corresponding Prometheus firing alerts must be visible. When Alertmanager is
+provided, those same alerts must also be active there.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+import time
+import urllib.parse
+import urllib.request
+from dataclasses import dataclass
+from typing import Any
+
+
+class VisibilityError(RuntimeError):
+    pass
+
+
+@dataclass(frozen=True)
+class RequiredAlert:
+    name: str
+    labels: dict[str, str]
+
+
+COMMON_LABELS = {
+    "host": "110",
+    "auto_repair": "false",
+    "alert_category": "infrastructure",
+    "notification_type": "TYPE-1",
+    "severity": "warning",
+}
+
+
+def _json_get(url: str, timeout: int) -> Any:
+    with urllib.request.urlopen(url, timeout=timeout) as response:
+        return json.loads(response.read().decode("utf-8"))
+
+
+def _prom_query(base_url: str, expr: str, timeout: int) -> list[dict[str, Any]]:
+    query = urllib.parse.urlencode({"query": expr})
+    url = f"{base_url.rstrip('/')}/api/v1/query?{query}"
+    payload = _json_get(url, timeout)
+    if payload.get("status") != "success":
+        raise VisibilityError(f"Prometheus query failed for {expr}: {payload}")
+    return payload.get("data", {}).get("result") or []
+
+
+def _prom_alerts(base_url: str, timeout: int) -> list[dict[str, Any]]:
+    url = f"{base_url.rstrip('/')}/api/v1/alerts"
+    payload = _json_get(url, timeout)
+    if payload.get("status") != "success":
+        raise VisibilityError(f"Prometheus alerts query failed: {payload}")
+    return payload.get("data", {}).get("alerts") or []
+
+
+def _alertmanager_alerts(base_url: str, timeout: int) -> list[dict[str, Any]]:
+    url = f"{base_url.rstrip('/')}/api/v2/alerts"
+    payload = _json_get(url, timeout)
+    if not isinstance(payload, list):
+        raise VisibilityError(f"Alertmanager alerts query returned unexpected payload: {payload}")
+    return payload
+
+
+def _float_value(row: dict[str, Any], expr: str) -> float:
+    value = row.get("value")
+    if not isinstance(value, list) or len(value) < 2:
+        raise VisibilityError(f"Prometheus query returned unexpected value for {expr}: {row}")
+    try:
+        return float(value[1])
+    except (TypeError, ValueError) as exc:
+        raise VisibilityError(f"Prometheus query returned non-numeric value for {expr}: {row}") from exc
+
+
+def _metric_labels(row: dict[str, Any]) -> dict[str, str]:
+    metric = row.get("metric") or {}
+    return {str(key): str(value) for key, value in metric.items()}
+
+
+def _labels_match(actual: dict[str, str], expected: dict[str, str]) -> bool:
+    return all(actual.get(key) == value for key, value in expected.items())
+
+
+def _find_prom_alert(alerts: list[dict[str, Any]], required: RequiredAlert) -> dict[str, Any] | None:
+    expected = {"alertname": required.name, **required.labels}
+    for alert in alerts:
+        if str(alert.get("state", "")) != "firing":
+            continue
+        labels = {str(key): str(value) for key, value in (alert.get("labels") or {}).items()}
+        if _labels_match(labels, expected):
+            return alert
+    return None
+
+
+def _find_alertmanager_alert(alerts: list[dict[str, Any]], required: RequiredAlert) -> dict[str, Any] | None:
+    expected = {"alertname": required.name, **required.labels}
+    for alert in alerts:
+        status = alert.get("status") or {}
+        if str(status.get("state", "")) != "active":
+            continue
+        labels = {str(key): str(value) for key, value in (alert.get("labels") or {}).items()}
+        if _labels_match(labels, expected):
+            return alert
+    return None
+
+
+def _require_prom_alert(alerts: list[dict[str, Any]], required: RequiredAlert) -> None:
+    if _find_prom_alert(alerts, required) is None:
+        raise VisibilityError(
+            f"missing Prometheus firing alert {required.name} with labels {required.labels}"
+        )
+
+
+def _require_alertmanager_alert(alerts: list[dict[str, Any]], required: RequiredAlert) -> None:
+    if _find_alertmanager_alert(alerts, required) is None:
+        raise VisibilityError(
+            f"missing Alertmanager active alert {required.name} with labels {required.labels}"
+        )
+
+
+def _sum_query_values(prometheus_url: str, expr: str, timeout: int) -> float:
+    return sum(_float_value(row, expr) for row in _prom_query(prometheus_url, expr, timeout))
+
+
+def _max_query_value(prometheus_url: str, expr: str, timeout: int) -> float:
+    rows = _prom_query(prometheus_url, expr, timeout)
+    if not rows:
+        return 0
+    return max(_float_value(row, expr) for row in rows)
+
+
+def _offsite_required_alerts(prometheus_url: str, host: str, timeout: int) -> tuple[list[RequiredAlert], str]:
+    expr = f'awoooi_backup_offsite_configured{{host="{host}"}}'
+    rows = _prom_query(prometheus_url, expr, timeout)
+    if not rows:
+        raise VisibilityError(f"Prometheus query returned no offsite configured series: {expr}")
+    configured_total = sum(_float_value(row, expr) for row in rows)
+    if configured_total == 0:
+        return (
+            [RequiredAlert("BackupOffsiteCopyNotConfigured", {**COMMON_LABELS, "host": host})],
+            "OK offsite gap metric requires BackupOffsiteCopyNotConfigured visibility",
+        )
+
+    fresh_expr = f'awoooi_backup_offsite_fresh{{host="{host}"}}'
+    if _sum_query_values(prometheus_url, fresh_expr, timeout) > 0:
+        return [], "OK offsite full marker is fresh; no offsite gap alert required"
+
+    enabled_expr = f'awoooi_backup_offsite_full_sync_enabled{{host="{host}"}}'
+    enabled_total = _sum_query_values(prometheus_url, enabled_expr, timeout)
+    if enabled_total > 0:
+        timestamp_expr = f'awoooi_backup_offsite_full_sync_enabled_timestamp{{host="{host}"}}'
+        enabled_timestamp = _max_query_value(prometheus_url, timestamp_expr, timeout)
+        enabled_age = int(time.time() - enabled_timestamp) if enabled_timestamp else 0
+        if enabled_timestamp and enabled_age <= 30 * 3600:
+            return (
+                [],
+                f"OK offsite full sync enabled within grace window; BackupOffsiteCopyStale not required yet age_seconds={enabled_age}",
+            )
+
+    return (
+        [RequiredAlert("BackupOffsiteCopyStale", {**COMMON_LABELS, "host": host})],
+        "OK offsite full marker gap requires BackupOffsiteCopyStale visibility",
+    )
+
+
+def _escrow_required_alerts(prometheus_url: str, host: str, timeout: int) -> list[RequiredAlert]:
+    expr = f'awoooi_backup_credential_escrow_fresh{{host="{host}"}} == 0'
+    rows = _prom_query(prometheus_url, expr, timeout)
+    required: list[RequiredAlert] = []
+    for row in rows:
+        labels = _metric_labels(row)
+        item = labels.get("item")
+        if not item:
+            raise VisibilityError(f"Credential escrow gap metric missing item label: {row}")
+        required.append(
+            RequiredAlert(
+                "BackupCredentialEscrowEvidenceMissing",
+                {**COMMON_LABELS, "host": host, "item": item},
+            )
+        )
+    return sorted(required, key=lambda alert: alert.labels["item"])
+
+
+def live_check(prometheus_url: str, alertmanager_url: str, host: str, timeout: int) -> list[str]:
+    required_alerts: list[RequiredAlert] = []
+    lines: list[str] = []
+
+    offsite_alerts, offsite_line = _offsite_required_alerts(prometheus_url, host, timeout)
+    required_alerts.extend(offsite_alerts)
+    lines.append(offsite_line)
+
+    escrow_alerts = _escrow_required_alerts(prometheus_url, host, timeout)
+    required_alerts.extend(escrow_alerts)
+    if escrow_alerts:
+        escrow_items = ", ".join(alert.labels["item"] for alert in escrow_alerts)
+        lines.append(
+            f"OK credential escrow gap metrics require {len(escrow_alerts)} alert(s): {escrow_items}"
+        )
+    else:
+        lines.append("OK credential escrow markers are fresh; no escrow gap alert required")
+
+    prom_alerts = _prom_alerts(prometheus_url, timeout)
+    for required in required_alerts:
+        _require_prom_alert(prom_alerts, required)
+    lines.append(f"OK Prometheus exposes {len(required_alerts)} required backup gap firing alert(s)")
+
+    if alertmanager_url:
+        am_alerts = _alertmanager_alerts(alertmanager_url, timeout)
+        for required in required_alerts:
+            _require_alertmanager_alert(am_alerts, required)
+        lines.append(f"OK Alertmanager exposes {len(required_alerts)} required backup gap active alert(s)")
+
+    return lines
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--prometheus-url", required=True)
+    parser.add_argument("--alertmanager-url", default="")
+    parser.add_argument("--host", default="110")
+    parser.add_argument("--timeout", type=int, default=8)
+    args = parser.parse_args()
+
+    try:
+        for line in live_check(args.prometheus_url, args.alertmanager_url, args.host, args.timeout):
+            print(line)
+    except (VisibilityError, OSError, json.JSONDecodeError) as exc:
+        print(f"BACKUP_ALERT_LIVE_VISIBILITY_FAILED {exc}", file=sys.stderr)
+        return 1
+
+    print("BACKUP_ALERT_LIVE_VISIBILITY_OK")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/scripts/ops/prometheus-rule-drift-guard.sh
+++ b/scripts/ops/prometheus-rule-drift-guard.sh
@@ -1,9 +1,9 @@
 #!/usr/bin/env bash
 # Guard 110 Prometheus alert rules against stale deploys.
 #
-# The canonical file is the source of truth. The guard restores active
-# alerts.yml only when the active file differs from canonical or when
-# Prometheus is missing rule names declared by canonical.
+# This script is intentionally narrow: it only restores the canonical alert
+# rules file when required recovery/backup rules disappear from live Prometheus
+# or when the active file differs from the canonical copy.

 set -uo pipefail

@@ -14,6 +14,14 @@ CANONICAL_RULES="${CANONICAL_RULES:-/home/wooo/monitoring/alerts-unified.canonic
 TEXTFILE="${TEXTFILE:-/home/wooo/node_exporter_textfiles/prometheus_rule_drift_guard.prom}"
 LOG_FILE="${LOG_FILE:-/home/wooo/logs/prometheus-rule-drift-guard.log}"

+REQUIRED_RULES=(
+  "BackupCredentialEscrowEvidenceMissing"
+  "BackupExpectedJobMissing"
+  "awoooi_recovery_core_ready"
+  "awoooi_recovery_dr_offsite_ready"
+  "ColdStartRecoveryBlocked"
+)
+
 log() {
  mkdir -p "$(dirname "$LOG_FILE")" 2>/dev/null || true
  printf '[%s] %s\n' "$(date '+%Y-%m-%d %H:%M:%S')" "$*" >>"$LOG_FILE"
@@ -34,7 +42,7 @@ awoooi_prometheus_rule_drift_guard_last_run_timestamp{host="${HOST_LABEL}",statu
 # HELP awoooi_prometheus_rule_drift_guard_repaired Whether the guard restored canonical Prometheus rules on the last run.
 # TYPE awoooi_prometheus_rule_drift_guard_repaired gauge
 awoooi_prometheus_rule_drift_guard_repaired{host="${HOST_LABEL}"} ${repaired}
-# HELP awoooi_prometheus_rule_drift_guard_missing_required_count Number of canonical live rules missing after the last check.
+# HELP awoooi_prometheus_rule_drift_guard_missing_required_count Number of required live rules missing after the last check.
 # TYPE awoooi_prometheus_rule_drift_guard_missing_required_count gauge
 awoooi_prometheus_rule_drift_guard_missing_required_count{host="${HOST_LABEL}"} ${missing_count}
 # HELP awoooi_prometheus_rule_drift_guard_current_matches_canonical Whether active alerts.yml matches canonical copy.
@@ -46,27 +54,13 @@ EOF
 }

 rules_missing_count() {
-  python3 - "$PROMETHEUS_URL" "$CANONICAL_RULES" <<'PY'
+  python3 - "$PROMETHEUS_URL" "${REQUIRED_RULES[@]}" <<'PY'
 import json
-import re
 import sys
 import urllib.request

 base_url = sys.argv[1].rstrip("/")
-canonical_path = sys.argv[2]
-
-name_pattern = re.compile(r"^\s*-\s*(?:alert|record):\s*['\"]?([^'\"#]+?)['\"]?\s*(?:#.*)?$")
-required: set[str] = set()
-try:
-    with open(canonical_path, encoding="utf-8") as handle:
-        for line in handle:
-            match = name_pattern.match(line)
-            if match:
-                required.add(match.group(1).strip())
-except Exception as exc:
-    print(f"CANONICAL_PARSE_FAILED:{exc}")
-    raise SystemExit(0)
-
+required = set(sys.argv[2:])
 try:
    with urllib.request.urlopen(f"{base_url}/api/v1/rules", timeout=8) as response:
        payload = json.loads(response.read().decode("utf-8"))
@@ -115,8 +109,8 @@ main() {
  before_matches="$(matches_canonical)"
  repaired=0

-  if [[ "$missing" == QUERY_FAILED:* || "$missing" == CANONICAL_PARSE_FAILED:* ]]; then
-    log "Prometheus/canonical query failed: ${missing}"
+  if [[ "$missing" == QUERY_FAILED:* ]]; then
+    log "Prometheus query failed: ${missing}"
    write_textfile "query_failed" 0 999 "$before_matches"
    return 1
  fi
@@ -135,8 +129,8 @@ main() {

  after_missing="$(rules_missing_count)"
  after_matches="$(matches_canonical)"
-  if [[ "$after_missing" == QUERY_FAILED:* || "$after_missing" == CANONICAL_PARSE_FAILED:* ]]; then
-    log "post-restore Prometheus/canonical query failed: ${after_missing}"
+  if [[ "$after_missing" == QUERY_FAILED:* ]]; then
+    log "post-restore Prometheus query failed: ${after_missing}"
    write_textfile "post_query_failed" "$repaired" 999 "$after_matches"
    return 1
  fi
--- a/scripts/ops/recovery-scorecard-contract-check.py
+++ b/scripts/ops/recovery-scorecard-contract-check.py
@@ -0,0 +1,148 @@
+#!/usr/bin/env python3
+"""Validate recovery scorecard recording-rule contract."""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+import urllib.parse
+import urllib.request
+from pathlib import Path
+from typing import Any
+
+import yaml
+
+
+DEFAULT_RULES = Path("ops/monitoring/alerts-unified.yml")
+DEFAULT_BASELINE = Path("ops/reboot-recovery/full-stack-backup-baseline.yml")
+EXPECTED_CORE = 'awoooi_recovery_core_ready{host="110",scope="110_120_121_188"}'
+EXPECTED_DR = 'awoooi_recovery_dr_offsite_ready{host="110"}'
+
+
+class ContractError(RuntimeError):
+    pass
+
+
+def _rules(path: Path) -> list[dict[str, Any]]:
+    data = yaml.safe_load(path.read_text(encoding="utf-8")) or {}
+    rules: list[dict[str, Any]] = []
+    for group in data.get("groups") or []:
+        rules.extend(group.get("rules") or [])
+    return rules
+
+
+def _expected_recording_rules(path: Path) -> list[str]:
+    data = yaml.safe_load(path.read_text(encoding="utf-8")) or {}
+    rules = data.get("monitoring_contract", {}).get("prometheus_recording_rules") or []
+    if not rules:
+        raise ContractError(f"missing monitoring_contract.prometheus_recording_rules in {path}")
+    return [str(rule) for rule in rules]
+
+
+def static_check(rules_path: Path, baseline_path: Path) -> list[str]:
+    rules = _rules(rules_path)
+    by_record = {str(rule.get("record")): rule for rule in rules if rule.get("record")}
+    expected = _expected_recording_rules(baseline_path)
+    missing = sorted(set(expected) - set(by_record))
+    if missing:
+        raise ContractError(f"alerts-unified.yml missing recovery recording rules: {missing}")
+
+    core_expr = str(by_record["awoooi_recovery_core_ready"].get("expr", ""))
+    for required in [
+        "awoooi_cold_start_last_result",
+        "awoooi_cold_start_warn_gates",
+        "awoooi_cold_start_blocked_gates",
+        "awoooi_cold_start_last_green_timestamp",
+    ]:
+        if required not in core_expr:
+            raise ContractError(f"awoooi_recovery_core_ready expr missing {required}")
+
+    dr_expr = str(by_record["awoooi_recovery_dr_offsite_ready"].get("expr", ""))
+    for required in [
+        "awoooi_backup_offsite_configured",
+        "awoooi_backup_offsite_fresh",
+        "awoooi_backup_credential_escrow_fresh",
+    ]:
+        if required not in dr_expr:
+            raise ContractError(f"awoooi_recovery_dr_offsite_ready expr missing {required}")
+
+    return [
+        "OK alerts-unified.yml contains every recovery scorecard recording rule",
+        "OK recovery core rule depends on cold-start green/warn/blocked/last-green metrics",
+        "OK recovery DR rule depends on provider-neutral offsite freshness and credential escrow freshness",
+    ]
+
+
+def _prom_query(base_url: str, expr: str) -> list[dict[str, Any]]:
+    url = f"{base_url.rstrip('/')}/api/v1/query?" + urllib.parse.urlencode({"query": expr})
+    with urllib.request.urlopen(url, timeout=8) as response:
+        payload = json.loads(response.read().decode("utf-8"))
+    if payload.get("status") != "success":
+        raise ContractError(f"Prometheus query failed for {expr}: {payload}")
+    return payload.get("data", {}).get("result") or []
+
+
+def _single_value(base_url: str, expr: str) -> float:
+    rows = _prom_query(base_url, expr)
+    if len(rows) != 1:
+        raise ContractError(f"Prometheus query expected one series for {expr}, got {len(rows)}")
+    value = rows[0].get("value") or []
+    if len(value) < 2:
+        raise ContractError(f"Prometheus query returned malformed value for {expr}: {rows[0]}")
+    try:
+        number = float(value[1])
+    except (TypeError, ValueError) as exc:
+        raise ContractError(f"Prometheus query returned non-numeric value for {expr}: {rows[0]}") from exc
+    if number not in {0.0, 1.0}:
+        raise ContractError(f"Prometheus recovery scorecard metric must be 0 or 1: {expr}={number}")
+    return number
+
+
+def live_check(
+    base_url: str,
+    expect_core_ready: bool = False,
+    expect_dr_ready: bool = False,
+) -> list[str]:
+    core = _single_value(base_url, EXPECTED_CORE)
+    dr = _single_value(base_url, EXPECTED_DR)
+    lines = [
+        f"OK live {EXPECTED_CORE} value={int(core)}",
+        f"OK live {EXPECTED_DR} value={int(dr)}",
+    ]
+    if expect_core_ready and core != 1.0:
+        raise ContractError(f"expected core recovery ready, got {core}")
+    if expect_dr_ready and dr != 1.0:
+        raise ContractError(f"expected DR offsite ready, got {dr}")
+    return lines
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--rules", type=Path, default=DEFAULT_RULES)
+    parser.add_argument("--baseline", type=Path, default=DEFAULT_BASELINE)
+    parser.add_argument("--prometheus-url", default="")
+    parser.add_argument("--expect-core-ready", action="store_true")
+    parser.add_argument("--expect-dr-ready", action="store_true")
+    args = parser.parse_args()
+
+    try:
+        for line in static_check(args.rules, args.baseline):
+            print(line)
+        if args.prometheus_url:
+            for line in live_check(
+                args.prometheus_url,
+                args.expect_core_ready,
+                args.expect_dr_ready,
+            ):
+                print(line)
+    except (ContractError, OSError, yaml.YAMLError, json.JSONDecodeError) as exc:
+        print(f"RECOVERY_SCORECARD_CONTRACT_FAILED {exc}", file=sys.stderr)
+        return 1
+
+    print("RECOVERY_SCORECARD_CONTRACT_OK")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/scripts/reboot-recovery/cold-start-textfile-exporter.sh
+++ b/scripts/reboot-recovery/cold-start-textfile-exporter.sh
@@ -1,10 +1,8 @@
 #!/usr/bin/env bash
 # Export AWOOOI full-stack cold-start gate status as node-exporter textfile metrics.
 #
-# 2026-05-06 ogt + Codex: reboot recovery hardening.
-# Intent: give Prometheus and the AI incident flow a durable, read-only signal
-# for the 110/120/121/188 startup gates. This wrapper never sends the
-# Alertmanager smoke event and never writes remote state.
+# This wrapper is read-only: it never sends the Alertmanager smoke event and
+# never mutates remote host/service state.

 set -uo pipefail

@@ -13,6 +11,8 @@ TEXTFILE_DIR="${TEXTFILE_DIR:-${NODE_EXPORTER_TEXTFILE_DIR:-/home/wooo/node_expo
 OUTPUT_NAME="${OUTPUT_NAME:-cold_start_recovery.prom}"
 LOG_DIR="${LOG_DIR:-/home/wooo/reboot-recovery}"
 CHECK_TIMEOUT_SECONDS="${CHECK_TIMEOUT_SECONDS:-240}"
+CHECK_WATCH_INTERVAL_SECONDS="${CHECK_WATCH_INTERVAL_SECONDS:-10}"
+CHECK_WATCH_MAX_ATTEMPTS="${CHECK_WATCH_MAX_ATTEMPTS:-3}"
 HOST_LABEL="${AIOPS_HOST_LABEL:-110}"
 SCOPE_LABEL="${AIOPS_SCOPE_LABEL:-110_120_121_188}"
 LOCK_FILE="${LOCK_FILE:-/tmp/awoooi-cold-start-textfile-exporter.lock}"
@@ -35,6 +35,10 @@ write_metric_file() {
  local blocked_state="${11}"
  local check_failed="${12}"
  local last_green="${13}"
+  local k3s_node_fs_blocker="${14}"
+  local public_route_tls_blocker="${15}"
+  local host_120_unreachable_blocker="${16}"
+  local backup_health_blocker="${17}"
  local host scope
  host=$(escape_label "$HOST_LABEL")
  scope=$(escape_label "$SCOPE_LABEL")
@@ -70,10 +74,16 @@ awoooi_cold_start_last_result{host="$host",scope="$scope",result="green"} $green
 awoooi_cold_start_last_result{host="$host",scope="$scope",result="degraded"} $degraded
 awoooi_cold_start_last_result{host="$host",scope="$scope",result="blocked"} $blocked_state
 awoooi_cold_start_last_result{host="$host",scope="$scope",result="check_failed"} $check_failed
+# HELP awoooi_cold_start_blocker_reason Whether a known cold-start blocker reason was detected in the last log.
+# TYPE awoooi_cold_start_blocker_reason gauge
+awoooi_cold_start_blocker_reason{host="$host",scope="$scope",reason="k3s_node_filesystem_error",target="120"} $k3s_node_fs_blocker
+awoooi_cold_start_blocker_reason{host="$host",scope="$scope",reason="public_route_tls_failure",target="public_https"} $public_route_tls_blocker
+awoooi_cold_start_blocker_reason{host="$host",scope="$scope",reason="host_unreachable",target="120"} $host_120_unreachable_blocker
+awoooi_cold_start_blocker_reason{host="$host",scope="$scope",reason="backup_health_blocked",target="110"} $backup_health_blocker
 METRICS
 }

-if [ -n "${BASH_VERSION:-}" ] && command -v flock >/dev/null 2>&1; then
+if command -v flock >/dev/null 2>&1; then
  exec 9>"$LOCK_FILE"
  if ! flock -n 9; then
    exit 0
@@ -92,13 +102,19 @@ if [ ! -x "$CHECK_SCRIPT" ]; then
  tmp_metric=$(mktemp "$TEXTFILE_DIR/.cold_start_recovery.XXXXXX")
  last_green=$(cat "$state_file" 2>/dev/null || echo 0)
  printf 'CHECK_SCRIPT not executable: %s\n' "$CHECK_SCRIPT" >"$log_file"
-  write_metric_file "$tmp_metric" "$end_ts" "$((end_ts - start_ts))" 127 0 0 0 1 0 0 0 1 "$last_green"
+  write_metric_file "$tmp_metric" "$end_ts" "$((end_ts - start_ts))" 127 0 0 0 1 0 0 0 1 "$last_green" 0 0 0 0
  chmod 0644 "$tmp_metric"
  mv "$tmp_metric" "$TEXTFILE_DIR/$OUTPUT_NAME"
  exit 0
 fi

-timeout "$CHECK_TIMEOUT_SECONDS" bash "$CHECK_SCRIPT" --monitor-read-only --no-color >"$log_tmp" 2>&1
+timeout "$CHECK_TIMEOUT_SECONDS" bash "$CHECK_SCRIPT" \
+  --monitor-read-only \
+  --no-color \
+  --watch \
+  --interval "$CHECK_WATCH_INTERVAL_SECONDS" \
+  --max-attempts "$CHECK_WATCH_MAX_ATTEMPTS" \
+  >"$log_tmp" 2>&1
 exit_code=$?
 mv "$log_tmp" "$log_file"

@@ -111,6 +127,10 @@ green=0
 degraded=0
 blocked_state=0
 check_failed=0
+k3s_node_fs_blocker=0
+public_route_tls_blocker=0
+host_120_unreachable_blocker=0
+backup_health_blocker=0

 if [ -n "$summary_line" ]; then
  monitor_up=1
@@ -130,6 +150,22 @@ else
  check_failed=1
 fi

+if grep -Eq 'NODE_FS_ERROR_EVENTS[[:space:]]+[1-9][0-9]*|K3s node filesystem error events present' "$log_file"; then
+  k3s_node_fs_blocker=1
+fi
+
+if grep -Eq 'PUBLIC_ROUTE_TLS .*(000|5[0-9][0-9])|public route .* TLS certificate verification failed' "$log_file"; then
+  public_route_tls_blocker=1
+fi
+
+if grep -Eq 'BLOCKED (ping 192\.168\.0\.120|ssh port 192\.168\.0\.120:22|ssh 120 k3s read-only check)' "$log_file"; then
+  host_120_unreachable_blocker=1
+fi
+
+if grep -Eq 'BLOCKED 110 backup health has stale expected jobs' "$log_file"; then
+  backup_health_blocker=1
+fi
+
 end_ts=$(date +%s)
 if [ "$green" -eq 1 ]; then
  printf '%s\n' "$end_ts" >"$state_file"
@@ -137,6 +173,6 @@ fi
 last_green=$(cat "$state_file" 2>/dev/null || echo 0)

 tmp_metric=$(mktemp "$TEXTFILE_DIR/.cold_start_recovery.XXXXXX")
-write_metric_file "$tmp_metric" "$end_ts" "$((end_ts - start_ts))" "$exit_code" "$monitor_up" "$pass" "$warn" "$blocked" "$green" "$degraded" "$blocked_state" "$check_failed" "$last_green"
+write_metric_file "$tmp_metric" "$end_ts" "$((end_ts - start_ts))" "$exit_code" "$monitor_up" "$pass" "$warn" "$blocked" "$green" "$degraded" "$blocked_state" "$check_failed" "$last_green" "$k3s_node_fs_blocker" "$public_route_tls_blocker" "$host_120_unreachable_blocker" "$backup_health_blocker"
 chmod 0644 "$tmp_metric"
 mv "$tmp_metric" "$TEXTFILE_DIR/$OUTPUT_NAME"
--- a/scripts/reboot-recovery/full-stack-cold-start-check.sh
+++ b/scripts/reboot-recovery/full-stack-cold-start-check.sh
@@ -7,6 +7,7 @@ set -uo pipefail
 SSH_OPTS=(-o BatchMode=yes -o ConnectTimeout=6)
 SEND_ALERT_TEST=0
 MONITOR_READ_ONLY=0
+NO_COLOR_FLAG=0
 WATCH_MODE=0
 WATCH_INTERVAL=60
 WATCH_MAX_ATTEMPTS=30
@@ -30,15 +31,17 @@ USAGE
 }

 while [ "$#" -gt 0 ]; do
-  case "$1" in
+  arg="$1"
+  case "$arg" in
    --send-alert-test)
      SEND_ALERT_TEST=1
      ;;
    --monitor-read-only)
      MONITOR_READ_ONLY=1
+      SEND_ALERT_TEST=0
      ;;
    --no-color)
-      NO_COLOR=1
+      NO_COLOR_FLAG=1
      ;;
    --watch)
      WATCH_MODE=1
@@ -64,7 +67,7 @@ while [ "$#" -gt 0 ]; do
      exit 0
      ;;
    *)
-      echo "Unknown argument: $1" >&2
+      echo "Unknown argument: $arg" >&2
      usage >&2
      exit 64
      ;;
@@ -72,7 +75,7 @@ while [ "$#" -gt 0 ]; do
  shift
 done

-if [ -n "${NO_COLOR:-}" ]; then
+if [ -n "${NO_COLOR:-}" ] || [ "$NO_COLOR_FLAG" -eq 1 ]; then
  RED=""
  GREEN=""
  YELLOW=""
@@ -90,12 +93,6 @@ PASS=0
 WARN=0
 FAIL=0

-reset_counters() {
-  PASS=0
-  WARN=0
-  FAIL=0
-}
-
 log_section() {
  printf "\n%s=== %s ===%s\n" "$BLUE" "$1" "$NC"
 }
@@ -198,6 +195,18 @@ probe_tcp() {
  nc -G 3 -z "$host" "$port" >/dev/null 2>&1 || nc -w 3 -z "$host" "$port" >/dev/null 2>&1
 }

+print_neighbor_rows() {
+  if command -v arp >/dev/null 2>&1; then
+    arp -an | grep -E '192\.168\.0\.(110|120|121|188)'
+    return $?
+  fi
+  if command -v ip >/dev/null 2>&1; then
+    ip neigh show | grep -E '192\.168\.0\.(110|120|121|188)'
+    return $?
+  fi
+  return 1
+}
+
 print_header() {
  echo "AWOOOI full-stack cold-start check"
  date '+%Y-%m-%d %H:%M:%S %Z'
@@ -222,12 +231,12 @@ check_network() {
    fi
  done

-  if arp -an | grep -E '192\.168\.0\.(110|120|121|188)'; then
-    ok "ARP evidence printed"
+  if print_neighbor_rows; then
+    ok "neighbor evidence printed"
  elif [ "$MONITOR_READ_ONLY" -eq 1 ]; then
-    ok "ARP evidence unavailable in monitor mode; ping and TCP gates passed"
+    ok "neighbor evidence unavailable in monitor mode; ping and TCP gates provide primary signal"
  else
-    warn "no ARP rows printed for one or more hosts"
+    warn "no neighbor rows printed for one or more hosts"
  fi
 }

@@ -370,21 +379,34 @@ WEB_CODE $web_code"

 check_public_routes() {
  log_section "P2-PUBLIC-ROUTES"
-  local awoooi_api_code awoooi_web_code momo_code momo_health_code
-  awoooi_api_code=$(probe_http_code "https://awoooi.wooo.work/api/v1/health")
-  awoooi_web_code=$(probe_http_code "https://awoooi.wooo.work/")
-  momo_code=$(probe_http_code "https://mo.wooo.work/")
-  momo_health_code=$(probe_http_code "https://mo.wooo.work/health")
+  local item name url code tls_code
+  local routes=(
+    "awoooi_api|https://awoooi.wooo.work/api/v1/health"
+    "awoooi_web|https://awoooi.wooo.work/"
+    "momo_web|https://mo.wooo.work/"
+    "momo_health|https://mo.wooo.work/health"
+    "gitea|https://gitea.wooo.work/"
+    "harbor|https://harbor.wooo.work/"
+    "registry|https://registry.wooo.work/"
+    "sentry|https://sentry.wooo.work/"
+    "signoz|https://signoz.wooo.work/"
+    "stock|https://stock.wooo.work/"
+    "langfuse|https://langfuse.wooo.work/"
+    "bitan|https://bitan.wooo.work/"
+    "aiops|https://aiops.wooo.work/"
+  )

-  echo "AWOOOI_PUBLIC_API_CODE $awoooi_api_code"
-  echo "AWOOOI_PUBLIC_WEB_CODE $awoooi_web_code"
-  echo "MOMO_PUBLIC_CODE $momo_code"
-  echo "MOMO_PUBLIC_HEALTH_CODE $momo_health_code"
-
-  [[ "$awoooi_api_code" =~ ^[23] ]] && ok "AWOOOI public API reachable" || warn "AWOOOI public API not confirmed"
-  [[ "$awoooi_web_code" =~ ^[23] ]] && ok "AWOOOI public web reachable" || warn "AWOOOI public web not confirmed"
-  [[ "$momo_code" =~ ^[23] ]] && ok "momo public route reachable" || warn "momo public route not confirmed"
-  [[ "$momo_health_code" =~ ^[23] ]] && ok "momo public health reachable" || warn "momo public health not confirmed"
+  for item in "${routes[@]}"; do
+    name="${item%%|*}"
+    url="${item#*|}"
+    code=$(probe_http_code "$url")
+    echo "PUBLIC_ROUTE $name $code $url"
+    [[ "$code" =~ ^[23] ]] && ok "public route $name reachable" || warn "public route $name not confirmed"
+    tls_code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 8 "$url" 2>/dev/null || true)
+    tls_code="${tls_code:-000}"
+    echo "PUBLIC_ROUTE_TLS $name $tls_code $url"
+    [[ "$tls_code" =~ ^[23] ]] && ok "public route $name TLS certificate verified" || fail "public route $name TLS certificate verification failed"
+  done
 }

 check_schedules() {
@@ -394,7 +416,7 @@ check_schedules() {
  if out=$(host_cmd "ollama@192.168.0.188" '
 now=$(date +%s)
 echo "CRON_188 $(systemctl is-active cron 2>/dev/null || systemctl is-active crond 2>/dev/null || true)"
-for f in /home/ollama/node_exporter_textfiles/backup.prom /home/ollama/node_exporter_textfiles/docker_restart_count.prom /home/ollama/node_exporter_textfiles/docker_stats.prom; do
+for f in /home/ollama/node_exporter_textfiles/backup.prom /home/ollama/node_exporter_textfiles/backup_health.prom /home/ollama/node_exporter_textfiles/docker_restart_count.prom /home/ollama/node_exporter_textfiles/docker_stats.prom /home/ollama/node_exporter_textfiles/storage_health.prom; do
  if [ -f "$f" ]; then
    mt=$(stat -c %Y "$f")
    echo "TEXTFILE_188 $(basename "$f") age=$((now - mt))"
@@ -405,17 +427,37 @@ done
 if [ -f /home/ollama/node_exporter_textfiles/backup.prom ]; then
  awk -v now="$now" "/^backup_110_last_success_timestamp / {printf \"BACKUP_110_AGE %d\\n\", now - int(\$2)}" /home/ollama/node_exporter_textfiles/backup.prom
 fi
-echo "SCHEDULER_STATE $(docker inspect -f "{{.State.Status}} {{if .State.Health}}{{.State.Health.Status}}{{end}}" momo-scheduler 2>/dev/null || true)"
-echo "SCHEDULER_REGISTERED $(docker logs --since 6h momo-scheduler 2>&1 | grep -c "全部排程任務已註冊" || true)"
+if [ -f /home/ollama/node_exporter_textfiles/backup_health.prom ]; then
+  awk "/^awoooi_backup_job_fresh/ {total++; if (int(\$2) == 0) stale++} /^awoooi_backup_job_configured/ {if (int(\$2) == 0) missing_cron++} /^awoooi_backup_script_present/ {if (int(\$2) == 0) missing_script++} END {printf \"BACKUP_HEALTH_188 total=%d stale=%d missing_cron=%d missing_script=%d\\n\", total+0, stale+0, missing_cron+0, missing_script+0}" /home/ollama/node_exporter_textfiles/backup_health.prom
+fi
+if [ -f /home/ollama/node_exporter_textfiles/storage_health.prom ]; then
+  awk "/^awoooi_host_storage_root_readonly/ {readonly=int(\$2)} /^awoooi_host_storage_current_boot_error_count/ {current=int(\$2)} END {printf \"STORAGE_HEALTH_188 root_readonly=%d current=%d\\n\", readonly+0, current+0}" /home/ollama/node_exporter_textfiles/storage_health.prom
+fi
+echo "SCHEDULER_CONTAINER_RUNNING $(docker inspect -f "{{.State.Running}}" momo-scheduler 2>/dev/null || true)"
+echo "SCHEDULER_CONTAINER_HEALTH $(docker inspect -f "{{if .State.Health}}{{.State.Health.Status}}{{else}}none{{end}}" momo-scheduler 2>/dev/null || true)"
+echo "SCHEDULER_REGISTERED $(docker logs --tail 200 momo-scheduler 2>&1 | grep -c "全部排程任務已註冊" || true)"
+echo "SCHEDULER_RECENT_ACTIVITY $(docker logs --since 2h momo-scheduler 2>&1 | grep -Ec "AutoImport|Meta-Analysis|Scheduler" || true)"
+momo_sync=$(docker exec momo-db sh -c "psql -U \"\$POSTGRES_USER\" -d \"\$POSTGRES_DB\" -Atc \"WITH scope AS (SELECT min(snapshot_date::date) dmin, max(snapshot_date::date) dmax, count(*) sc FROM daily_sales_snapshot WHERE snapshot_date::date >= make_date(extract(year from current_date)::int, extract(month from current_date)::int, 1)), monthly AS (SELECT count(*) mc, min(\\\"日期\\\"::date) mmin, max(\\\"日期\\\"::date) mmax FROM realtime_sales_monthly, scope WHERE scope.sc > 0 AND \\\"日期\\\"::date BETWEEN scope.dmin AND scope.dmax) SELECT coalesce(scope.sc,0)::text || chr(124) || coalesce(monthly.mc,0)::text || chr(124) || coalesce(scope.dmin::text,chr(45)) || chr(124) || coalesce(scope.dmax::text,chr(45)) || chr(124) || coalesce(monthly.mmin::text,chr(45)) || chr(124) || coalesce(monthly.mmax::text,chr(45)) FROM scope, monthly;\"" 2>/dev/null || true)
+echo "MOMO_MONTHLY_SYNC ${momo_sync:-unavailable}"
 ' 2>&1); then
    echo "$out"
    grep -q "CRON_188 active" <<<"$out" && ok "188 cron active" || warn "188 cron not confirmed"
    awk '/TEXTFILE_188 backup.prom age=/ {split($3,a,"="); exit !(a[2] < 90000)}' <<<"$out" && ok "188 backup textfile fresh enough" || warn "188 backup textfile stale or missing"
+    awk '/TEXTFILE_188 backup_health.prom age=/ {split($3,a,"="); exit !(a[2] < 900)}' <<<"$out" && ok "188 backup health exporter fresh" || warn "188 backup health exporter stale"
    awk '/TEXTFILE_188 docker_restart_count.prom age=/ {split($3,a,"="); exit !(a[2] < 300)}' <<<"$out" && ok "188 docker restart exporter fresh" || warn "188 docker restart exporter stale"
    awk '/TEXTFILE_188 docker_stats.prom age=/ {split($3,a,"="); exit !(a[2] < 300)}' <<<"$out" && ok "188 docker stats exporter fresh" || warn "188 docker stats exporter stale"
+    awk '/TEXTFILE_188 storage_health.prom age=/ {split($3,a,"="); exit !(a[2] < 300)}' <<<"$out" && ok "188 storage health exporter fresh" || warn "188 storage health exporter stale"
+    grep -q "STORAGE_HEALTH_188 root_readonly=0 current=0" <<<"$out" && ok "188 current boot storage health clean" || warn "188 storage health not clean"
    awk '/BACKUP_110_AGE / {exit !($2 < 90000)}' <<<"$out" && ok "188 backup-from-110 success within 25h" || warn "188 backup-from-110 success not confirmed"
-    grep -q "SCHEDULER_STATE running healthy" <<<"$out" && ok "188 momo scheduler container healthy" || warn "188 momo scheduler health not confirmed"
-    awk '/SCHEDULER_REGISTERED / {exit !($2 > 0)}' <<<"$out" && ok "188 momo scheduler registered jobs within 6h" || warn "188 momo scheduler registration not confirmed within 6h"
+    grep -q "BACKUP_HEALTH_188 total=" <<<"$out" && awk '/BACKUP_HEALTH_188/ {split($3,a,"="); split($4,b,"="); split($5,c,"="); exit !((a[2]+b[2]+c[2]) == 0)}' <<<"$out" && ok "188 backup health has no stale expected jobs" || warn "188 backup health has stale expected jobs"
+    if grep -q "SCHEDULER_CONTAINER_HEALTH healthy" <<<"$out" && awk '/SCHEDULER_RECENT_ACTIVITY / {exit !($2 > 0)}' <<<"$out"; then
+      ok "188 momo scheduler healthy with recent task activity"
+    elif awk '/SCHEDULER_REGISTERED / {exit !($2 > 0)}' <<<"$out"; then
+      ok "188 momo scheduler registered jobs"
+    else
+      warn "188 momo scheduler registration/activity not confirmed"
+    fi
+    awk '/MOMO_MONTHLY_SYNC / {split($2,a,"|"); exit !(a[1] > 0 && a[1] == a[2] && a[3] == a[5] && a[4] == a[6])}' <<<"$out" && ok "188 momo current-month snapshot and realtime tables match" || warn "188 momo current-month snapshot/realtime sync not confirmed"
  else
    warn "188 schedule check unavailable"
    echo "$out"
@@ -427,7 +469,7 @@ echo "CRON_110 $(systemctl is-active cron 2>/dev/null || systemctl is-active cro
 echo "FAILED_UNITS_110 $(systemctl --failed --no-legend --plain 2>/dev/null | wc -l)"
 echo "MOMO_STARTUP_ENABLED $(systemctl is-enabled momo-startup-complete.service 2>/dev/null || true)"
 echo "STAGGERED_STARTUP_ENABLED $(systemctl is-enabled wooo-staggered-startup.service 2>/dev/null || true)"
-for f in /home/wooo/node_exporter_textfiles/docker_stats.prom /home/wooo/node_exporter_textfiles/systemd_units.prom; do
+for f in /home/wooo/node_exporter_textfiles/docker_stats.prom /home/wooo/node_exporter_textfiles/systemd_units.prom /home/wooo/node_exporter_textfiles/storage_health.prom /home/wooo/node_exporter_textfiles/backup_health.prom; do
  if [ -f "$f" ]; then
    mt=$(stat -c %Y "$f")
    echo "TEXTFILE_110 $(basename "$f") age=$((now - mt))"
@@ -435,6 +477,12 @@ for f in /home/wooo/node_exporter_textfiles/docker_stats.prom /home/wooo/node_ex
    echo "TEXTFILE_110 $(basename "$f") missing"
  fi
 done
+if [ -f /home/wooo/node_exporter_textfiles/storage_health.prom ]; then
+  awk "/^awoooi_host_storage_root_readonly/ {readonly=int(\$2)} /^awoooi_host_storage_current_boot_error_count/ {current=int(\$2)} END {printf \"STORAGE_HEALTH_110 root_readonly=%d current=%d\\n\", readonly+0, current+0}" /home/wooo/node_exporter_textfiles/storage_health.prom
+fi
+if [ -f /home/wooo/node_exporter_textfiles/backup_health.prom ]; then
+  awk "/^awoooi_backup_job_fresh/ {total++; if (int(\$2) == 0) stale++} /^awoooi_backup_job_configured/ {if (int(\$2) == 0) missing_cron++} /^awoooi_backup_script_present/ {if (int(\$2) == 0) missing_script++} /^awoooi_backup_last_run_failed_count/ {if (\$0 ~ /(exported_job|job)=\"backup_all\"/) failed=int(\$2)} /^awoooi_backup_config_capture_critical_failed_count/ {config_failed=int(\$2)} /^awoooi_backup_integrity_fresh/ {integrity_total++; if (int(\$2) == 0) integrity_stale++} END {printf \"BACKUP_HEALTH_110 total=%d stale=%d missing_cron=%d missing_script=%d failed_count=%d config_failed=%d integrity_total=%d integrity_stale=%d\\n\", total+0, stale+0, missing_cron+0, missing_script+0, failed+0, config_failed+0, integrity_total+0, integrity_stale+0}" /home/wooo/node_exporter_textfiles/backup_health.prom
+fi
 ' 2>&1); then
    echo "$out"
    grep -q "CRON_110 active" <<<"$out" && ok "110 cron active" || warn "110 cron not confirmed"
@@ -443,6 +491,11 @@ done
    grep -q "STAGGERED_STARTUP_ENABLED disabled" <<<"$out" && ok "110 stale staggered startup unit disabled" || warn "110 stale staggered startup unit not disabled"
    awk '/TEXTFILE_110 docker_stats.prom age=/ {split($3,a,"="); exit !(a[2] < 300)}' <<<"$out" && ok "110 docker stats exporter fresh" || warn "110 docker stats exporter stale"
    awk '/TEXTFILE_110 systemd_units.prom age=/ {split($3,a,"="); exit !(a[2] < 300)}' <<<"$out" && ok "110 systemd units exporter fresh" || warn "110 systemd units exporter stale"
+    awk '/TEXTFILE_110 storage_health.prom age=/ {split($3,a,"="); exit !(a[2] < 300)}' <<<"$out" && ok "110 storage health exporter fresh" || warn "110 storage health exporter stale"
+    awk '/TEXTFILE_110 backup_health.prom age=/ {split($3,a,"="); exit !(a[2] < 900)}' <<<"$out" && ok "110 backup health exporter fresh" || warn "110 backup health exporter stale"
+    grep -q "STORAGE_HEALTH_110 root_readonly=0 current=0" <<<"$out" && ok "110 current boot storage health clean" || warn "110 storage health not clean"
+    grep -q "BACKUP_HEALTH_110 total=" <<<"$out" && awk '/BACKUP_HEALTH_110/ {split($3,a,"="); split($4,b,"="); split($5,c,"="); split($6,d,"="); split($7,e,"="); exit !((a[2]+b[2]+c[2]) == 0 && d[2] == 0 && e[2] == 0)}' <<<"$out" && ok "110 backup health has no stale expected jobs" || warn "110 latest aggregate/config backup had failed components; rerun backup-all after 120 recovers"
+    awk '/BACKUP_HEALTH_110/ {split($9,a,"="); exit !(a[2] == 0)}' <<<"$out" && ok "110 backup integrity and restore drill fresh" || warn "110 backup integrity or restore drill stale"
  else
    warn "110 schedule check unavailable"
    echo "$out"
@@ -494,54 +547,41 @@ summary() {
  echo "PASS=$PASS WARN=$WARN BLOCKED=$FAIL"
  if [ "$FAIL" -gt 0 ]; then
    echo "Result: BLOCKED. Fix the first blocked gate before releasing runner/CD/AI auto-remediation."
-    return 2
+    exit 2
  fi
  if [ "$WARN" -gt 0 ]; then
    echo "Result: DEGRADED. Core gates passed but warnings remain."
-    return 1
+    exit 1
  fi
  echo "Result: GREEN. Full stack is ready for controlled runner/CD release."
-  return 0
-}
-
-run_once() {
-  reset_counters
-  print_header
-  check_network
-  check_188
-  check_110
-  check_k3s
-  check_workload_and_alertchain
-  check_public_routes
-  check_schedules
-  summary
 }

 if [ "$WATCH_MODE" -eq 1 ]; then
  attempt=1
-  while :; do
-    if [ "$WATCH_MAX_ATTEMPTS" -eq 0 ]; then
-      printf "\nWatch attempt %s/unlimited\n" "$attempt"
-    else
-      printf "\nWatch attempt %s/%s\n" "$attempt" "$WATCH_MAX_ATTEMPTS"
-    fi
-
-    run_once
+  rc=2
+  while true; do
+    echo "WATCH_ATTEMPT=$attempt"
+    args=()
+    [ "$MONITOR_READ_ONLY" -eq 1 ] && args+=(--monitor-read-only)
+    [ "$NO_COLOR_FLAG" -eq 1 ] && args+=(--no-color)
+    [ "$SEND_ALERT_TEST" -eq 1 ] && args+=(--send-alert-test)
+    bash "$0" "${args[@]}"
    rc=$?
-    if [ "$rc" -eq 0 ]; then
-      exit 0
-    fi
-
-    if [ "$WATCH_MAX_ATTEMPTS" -ne 0 ] && [ "$attempt" -ge "$WATCH_MAX_ATTEMPTS" ]; then
-      echo "Watch stopped before GREEN. Last result code: $rc"
+    [ "$rc" -eq 0 ] && exit 0
+    if [ "$WATCH_MAX_ATTEMPTS" -gt 0 ] && [ "$attempt" -ge "$WATCH_MAX_ATTEMPTS" ]; then
      exit "$rc"
    fi
-
-    echo "Waiting ${WATCH_INTERVAL}s before the next cold-start gate check..."
-    sleep "$WATCH_INTERVAL"
    attempt=$((attempt + 1))
+    sleep "$WATCH_INTERVAL"
  done
 fi

-run_once
-exit $?
+print_header
+check_network
+check_188
+check_110
+check_k3s
+check_workload_and_alertchain
+check_public_routes
+check_schedules
+summary