fix(ops): gate reboot recovery on product freshness [skip ci]

This commit is contained in:
ogt
2026-06-25 19:39:42 +08:00
parent bfc78d3fee
commit 5e4887d15c
7 changed files with 170 additions and 15 deletions

View File

@@ -1,3 +1,29 @@
## 2026-06-25重啟 SOP 產品版本 / 資料 freshness Gate 加嚴
**背景**:使用者要求主機重啟後不能只恢復網站 200必須確認所有主機、服務、產品、網站、工具、套件與資料都是正常、最新且可追溯。本輪把 post-start quick check 從核心四路由擴充為全產品入口與 StockPlatform freshness gate避免 route green 掩蓋資料阻擋。
**只讀 live evidence**
- 工作樹已 rebase 到 `gitea/main=232d75f1`,未 force push、未做 host / Docker / Nginx / firewall / K8s 寫操作。
- 19:24 full post-start / cold-start / backup readback110 / 120 / 121 / 188 ping 和 SSH OKdelegated cold-start `PASS=89 WARN=0 BLOCKED=0`backup 110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0``core_blockers=0``offsite_fresh=1``rclone_gdrive_fresh=1``escrow_missing=5`
- Expanded public route readback`awoooi`、AWOOOI API、IwoooS、VibeWork、AwoooGo、2026FIFA、Agent Bounty、MOMO、MOMO health、Stock、Stock `/healthz`、Stock `/api/healthz`、Bitan、TsenYang、VTuber、Gitea、Harbor、Registry、Sentry、SigNoz、Langfuse、AIOps 全部回 expected 2xx/3xx。
- MOMOhealth `V10.690``momo-pro-system` / `momo-scheduler` / `momo-telegram-bot` healthy`DB_DAILY_FRESHNESS 1|2026-06-24`latest import job `57 completed|即時業績_當日.xlsx|2026-06-25T13:16:47.359958|2026-06-25T13:18:02.964985|15383|15383|0`
- Bitan`scripts/verify-public-content-cleanliness.sh --target-url https://bitan.wooo.work --db-container bitan-pg-restore` 通過source contract、DB codex/test content、NEWS API、announcements API、products API、products/news pages 均 PASS。
- StockPlatformservice route 和 health routes are `200`live source `/home/wooo/stockplatform-v2` matches Gitea main `c67a2cf5aef3f15f14c99941a1615d1c809bac33`;但 `/api/v1/system/freshness` returns `status=blocked`, `latest_trading_date=2026-06-25`, blockers `core_margin_short_daily_missing,ai_recommendations_stale`。OK sources: `core.price_daily` / `core.chips_daily` / `core.market_index_daily.tw` at `2026-06-25`; blocked sources: `core.margin_short_daily` and `ai.recommendations` at `2026-06-24`
- Product version readbackVibeWork runtime image `192.168.0.110:5000/vibework/web:76a4ee15026af278a3660ad4b4547e9308b107be` matches Gitea main `76a4ee15026af278a3660ad4b4547e9308b107be`; AwoooGo live source matches Gitea main `6897972e9820cbb96c508fa9a80c66246c973307`; MOMO runtime uses `registry.wooo.work/wooo/momo-pro-system:stable` image id `df931906e158` created `2026-06-25T13:28:59+08:00`, while Gitea `wooo/momo-pro-system.git` main is `25120cbf21ba51affc94d0220ec87e607f59a833` and runtime compose dir is not a git worktree.
**Repo changes**
- `scripts/reboot-recovery/post-start-quick-check.sh` 新增完整產品 routes 與 `StockPlatform freshness` section`--skip-stock` 可跳過該 gate。
- `ops/reboot-recovery/full-stack-cold-start-baseline.yml` 新增 VibeWork、AwoooGo、2026FIFA、Agent Bounty、Stock health routes、Bitan、TsenYang、VTuber 與 `stockplatform_system_freshness_ok` / `stockplatform_latest_trading_day_sources_current`
- `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` v1.2、`docs/runbooks/FULL-STACK-COLD-START-SOP.md` v1.57、`docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md``docs/runbooks/BACKUP-STATUS.md` 已同步。
**最新判定**
- Host / K3s / core service / public route / MOMO data / backup / offsitegreen for current evidence。
- Product-data completeness`BLOCKED_STOCK_DATA_FRESHNESS`
- DR`BLOCKED_DR_ESCROW` because `escrow_missing=5`
- Wazuhsecurity registry evidence blockernot a reboot service blockermanager registry accepted remains `0`
**邊界**:本輪沒有重啟 Docker/systemd/Nginx/K3s、沒有改 firewall、沒有 active scan、沒有讀 secret、沒有寫 host runtime、沒有 manual Stock import。Stock freshness remediation 必須走 StockPlatform data lane不能用 route 200 或主機重啟掩蓋。
## 2026-06-25IwoooS P0 資安事件收斂 Gate source-side 完成
**背景**:近期 Wazuh API / registry、主機入侵、端口異動與告警可讀性都仍缺少可接受的 owner evidence前端與治理文件也必須避免把 raw log、工作視窗對話、個人名稱、內網資訊或單點綠燈當成資安完成。本輪先把即時性資安風險收斂成一個只讀 P0 Gate避免在 Wazuh API / registry、主機鑑識、Nginx、firewall、host runtime、告警 receipt、SOC case 與跨專案 runtime 邊界之間繼續散落。

View File

@@ -20,6 +20,7 @@
> 2026-06-25 10:23 Codex MOMO fail-closed live proof: 10:04 scheduler run recorded Google Drive auth failure as `❌ 自動匯入失敗` and sent Telegram failure notification successfully; cold-start remains `PASS=87 WARN=1 BLOCKED=1` because business data is stale beyond 3 days and Drive token metadata/writeback is not confirmed. Backup/offsite remains green and `escrow_missing=5` remains the DR blocker.
> 2026-06-25 10:35 Codex route / DB / backup refresh: direct public routes for AWOOOI API, IwoooS, VibeWork, AwoooGo, MOMO health, Stock, and Bitan are 200; backup remains 110 `13/13` and 188 `2/2` fresh; MOMO daily and monthly DB bounds still stop at `2026-06-17`; latest import job remains `56 completed`.
> 2026-06-25 19:17 Codex latest recovery readback: post-start quick check is `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`; 110 backup `13/13 fresh failed=0`, 188 backup `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`; MOMO data freshness is recovered through `2026-06-24`; DR still blocked only by `escrow_missing=5`.
> 2026-06-25 19:35 Codex product-data gate refresh: backup/offsite remains green, but overall "all products/data latest" is blocked by StockPlatform `/api/v1/system/freshness` (`core_margin_short_daily_missing`, `ai_recommendations_stale`). This is not a backup failure; keep `escrow_missing=5` as the DR blocker and Stock freshness as a separate product-data blocker.
---

View File

@@ -1,6 +1,6 @@
# AWOOOI 全棧冷啟動與主機重啟 SOP
> Version: v1.56
> Version: v1.57
> Last updated: 2026-06-25 Asia/Taipei
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
@@ -12,6 +12,8 @@
若只是重啟後要快速判斷能不能宣稱恢復,先跑一頁式總檢查:`scripts/reboot-recovery/post-start-quick-check.sh --no-color`,並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為人工 fallback。長 SOP 保留完整背景、例外處理與 Plan B短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。
2026-06-25 19:35 product-version / data-freshness refresh supersedes the 19:06 data-complete wording. Host boot, K3s, AWOOOI runtime, MOMO service/data, backup/offsite, Bitan cleanliness, and expanded public routes are available, but the stricter post-start wrapper now checks StockPlatform `/api/v1/system/freshness` and correctly returns `RESULT=BLOCKED` when product data is not current. The 19:35 lightweight wrapper run used `--skip-cold-start --skip-backup --skip-cpu` after the 19:24 full host/cold-start/backup readback and returned `PASS=31 WARN=1 BLOCKED=1`, with the single blocker `StockPlatform freshness is blocked: core_margin_short_daily_missing,ai_recommendations_stale`. `stock.wooo.work`, `/healthz`, and `/api/healthz` all return `200`, and live source `/home/wooo/stockplatform-v2` matches Gitea `main` `c67a2cf5aef3f15f14c99941a1615d1c809bac33`, so this is a data-freshness blocker, not a route or source-version blocker. Public routes now covered by the wrapper include AWOOOI, VibeWork, AwoooGo, 2026FIFA, Agent Bounty, MOMO, Stock, Bitan, TsenYang, VTuber, Gitea, Harbor, Registry, Sentry, SigNoz, Langfuse, and AIOps. Do not declare "all products and data are latest" until StockPlatform freshness is `ok`; keep DR blocked until `escrow_missing=0`.
2026-06-25 19:06 post-CD live read-only refresh supersedes the 18:53 wrapper wording. Consecutive main pushes caused older deploy markers to be replaced, so the latest production truth is deploy marker `d8ca8224 chore(cd): deploy 9dbe044 [skip ci]`. Read-only ArgoCD shows `awoooi-prod Synced / Healthy` at revision `d8ca822422021d0fda8da8fa4c354c0c4db7ff22`; API/Web/Worker live image tag `9dbe044ea1e8e3894ccbeb5ed760bb124b87f7be`; API `2/2`, Web `2/2`, Worker `1/1`. The 19:05 post-start quick check returns `RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, delegated cold-start remains `PASS=89 WARN=0 BLOCKED=0`, and 19:05-19:06 route stability checks confirm AWOOOI API, IwoooS, AwoooGo, Stock, VibeWork, Bitan, and MOMO health all return `200` for five consecutive external reads. Earlier AwoooGo / Stock `502` reads were post-deploy upstream warmup transients, not persistent service failures. Hosts, routes, K3s, AWOOOI API health, MOMO service health, MOMO business data freshness, backup core/offsite, and core monitoring/exporter surfaces are green for controlled runner/CD release. MOMO is healthy on `V10.690`; latest import job `57` completed cleanly; `MOMO_DAILY_FRESHNESS 1|2026-06-24`; current-month daily snapshot and realtime tables match through `2026-06-24`. `post-start-quick-check.sh` parses cold-start `PASS / WARN / BLOCKED` summary before classifying exit codes, so WARN-only rollout/stale evidence is no longer inflated into a service blocker. The wrapper returns `RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED` when service blockers are zero but `escrow_missing=5` remains. Do not turn this into a DR complete or security/runtime acceptance claim. Wazuh production routes are now `200 disabled_waiting_iwooos_wazuh_owner_gate`, but `configured=false`, manager query accepted `0`, manager registry accepted `0`, and runtime gate `0`; treat Wazuh as a security registry evidence blocker, not a reboot service blocker.
```text
@@ -20,10 +22,12 @@ Live cold-start read-only check: 2026-06-25 19:05 wrapper delegated cold-start P
Post-start quick check: 2026-06-25 19:05 PASS=18 WARN=3 BLOCKED=0; warning split SERVICE=0 BOUNDARY=1 EVIDENCE=2; Result=FULL_STACK_GREEN_DR_ESCROW_BLOCKED; exit code 0.
Repo-side cold-start v1.42+ live read-only run: MOMO source absence / stale data blocker is cleared by import job 57 and `MOMO_DAILY_FRESHNESS 1|2026-06-24`. Live 110 script sync is not claimed until a separate approved deployment/sync happens.
110 live-sync parity: 2026-06-24 23:15 read-only `verify-cold-start-monitor-deploy.sh` correctly BLOCKED because repo script hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`. Do not use live 110 monitor output to prove v1.42 behavior until the approved live-sync gate in §13.3.1 passes.
Service state: FULL_STACK_GREEN_DR_ESCROW_BLOCKED; 110/120/121/188 reachable, K3s mon/mon1 Ready, public routes/TLS green, MOMO data fresh, 110/188 backup health fresh, 188 node-exporter / PostgreSQL exporter / Redis exporter restored, 188 MinIO endpoint and Velero BackupStorageLocation restored, 110 disk pressure cleared.
Runtime release state: API/Web/Worker live image tag is `9dbe044ea1e8e3894ccbeb5ed760bb124b87f7be`, and 19:06 K3s readback shows API/Web/Worker pods Running; production API health returns healthy with `environment=prod`, `mock_mode=false`, and postgresql / redis / openclaw / signoz / gcp ollama providers up. 19:05 route smoke returned 200 for AWOOOI API, IwoooS, MOMO health, and Stock; cold-start route gate also returned expected statuses for AWOOOI web, MOMO, Gitea, Harbor, Registry, Sentry, SigNoz, Langfuse, Bitan, and AIOps. AwoooGo, Stock, AWOOOI API, IwoooS, VibeWork, MOMO health, and Bitan then returned 200 for five consecutive external route reads from 19:05:38 to 19:06:24. Cold-start raw route gate returned all expected route statuses, including redirects such as awoooi web=307 and sentry=302.
Service state: HOST_AND_CORE_SERVICE_GREEN_STOCK_DATA_BLOCKED_DR_ESCROW_BLOCKED; 110/120/121/188 reachable, K3s mon/mon1 Ready, public routes/TLS green, MOMO data fresh, 110/188 backup health fresh, 188 node-exporter / PostgreSQL exporter / Redis exporter restored, 188 MinIO endpoint and Velero BackupStorageLocation restored, 110 disk pressure cleared. Product-data completeness is blocked by StockPlatform freshness.
Runtime release state: API/Web/Worker live image tag is `9dbe044ea1e8e3894ccbeb5ed760bb124b87f7be`, and 19:06 K3s readback shows API/Web/Worker pods Running; production API health returns healthy with `environment=prod`, `mock_mode=false`, and postgresql / redis / openclaw / signoz / gcp ollama providers up. 19:05 route smoke returned 200 for AWOOOI API, IwoooS, MOMO health, and Stock; cold-start route gate also returned expected statuses for AWOOOI web, MOMO, Gitea, Harbor, Registry, Sentry, SigNoz, Langfuse, Bitan, and AIOps. AwoooGo, Stock, AWOOOI API, IwoooS, VibeWork, MOMO health, and Bitan then returned 200 for five consecutive external route reads from 19:05:38 to 19:06:24. 19:35 expanded route readback returned expected 2xx/3xx for AWOOOI, VibeWork, AwoooGo, 2026FIFA, Agent Bounty, MOMO, Stock, Bitan, TsenYang, VTuber, Gitea, Harbor, Registry, Sentry, SigNoz, Langfuse, and AIOps. Cold-start raw route gate returned all expected route statuses, including redirects such as awoooi web=307 and sentry=302.
MOMO release state: mo.wooo.work health is healthy on version V10.690. `momo-pro-system`, `momo-scheduler`, and `momo-telegram-bot` are healthy; scheduler `RestartCount=0`. 18:23 dedicated preflight returns PASS=19 WARN=2 BLOCKED=0, so retain recent container-replace / scheduler fail-closed / notification evidence notes, but no service blocker remains.
MOMO data state: current-month daily_sales_snapshot and realtime_sales_monthly match through 2026-06-24: `daily_sales_snapshot=109061|2025-07-01|2026-06-24`, `MOMO_MONTHLY_SYNC 15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, and `MOMO_DAILY_FRESHNESS 1|2026-06-24`. Latest import job is `57 completed|即時業績_當日.xlsx|2026-06-25T13:16:47.359958|2026-06-25T13:18:02.964985|15383|15383|0`.
StockPlatform data state: `/api/v1/system/freshness` returns `status=blocked`, `latest_trading_date=2026-06-25`, blockers `core_margin_short_daily_missing,ai_recommendations_stale`. Current OK sources include `core.price_daily` 2026-06-25 / 1976 rows, `core.chips_daily` 2026-06-25 / 1976 rows, `core.market_index_daily.tw` 2026-06-25 / 2 rows, and `core.market_index_daily.global` 2026-06-25 / 2001 rows. Blocked sources are `core.margin_short_daily` latest 2026-06-24 / 0 rows for 2026-06-25 and `ai.recommendations` latest 2026-06-24 / 2748 rows. Do not claim StockPlatform data or AI recommendations are latest until this endpoint is `ok`.
Product version readback: StockPlatform live source `/home/wooo/stockplatform-v2` matches Gitea `wooo/stockplatform-v2.git` main `c67a2cf5aef3f15f14c99941a1615d1c809bac33`; VibeWork live image `192.168.0.110:5000/vibework/web:76a4ee15026af278a3660ad4b4547e9308b107be` matches Gitea `wooo/vibework.git` main `76a4ee15026af278a3660ad4b4547e9308b107be`; AwoooGo live source `/home/wooo/awooogo` matches Gitea `wooo/AwoooGo` main `6897972e9820cbb96c508fa9a80c66246c973307`; MOMO runtime uses `registry.wooo.work/wooo/momo-pro-system:stable` image id `df931906e158` created `2026-06-25T13:28:59+08:00`, while Gitea `wooo/momo-pro-system.git` main is `25120cbf21ba51affc94d0220ec87e607f59a833`; 188 runtime directory is a compose/image deployment path, not a git worktree, so add image revision label evidence before declaring code-image parity.
Google Drive / source-file state: 14:16 cold-start reports `MOMO_GDRIVE_TOKEN_STAT 100000:100000:600 scheduler_uid=100000`. Dedicated preflight confirms host token metadata matches scheduler UID and restrictive mode; container token artifact exists with mode `600`. Token content was not read. Future Drive auth/API failure must still be treated as failed import evidence rather than no-file success.
110 CPU/load readback: 2026-06-25 10:58 user-approved minimal SIGTERM targeted only orphan `stockplatform-review-bulk-ux` Chrome process groups `438005`, `471295`, `640155`, and `670628`; `OLD_GROUPS_REMAINING` returned empty. 19:05 readback shows current higher load is mainly Gitea Actions cache save / `zstdmt` / `tar`, StockPlatform headless Chrome smoke / CI, Gitea, AWOOOI API, ClickHouse, Docker, and platform services. No Docker/systemd/Nginx/firewall/K8s write was performed; do not cancel active CI/smoke unless separately approved. If Chrome groups are active children of Playwright / CI, observe queue and timeout; if they become PPID 1 orphan process groups with sustained CPU and no parent smoke, run dry-run and require owner approval before targeted `SIGTERM`.
Backup / monitoring state: 19:05 wrapper readback confirms backup core blockers are 0, 110 is 13/13 fresh failed=0, 188 is 2/2 fresh failed=0, offsite_fresh=1, rclone_gdrive_fresh=1, integrity_stale=0, last aggregate is 2026-06-25 02:35:09, and escrow_missing=5.
@@ -32,8 +36,8 @@ Notification-noise state: healthy AWOOOI heartbeat is suppressed; heartbeat warn
Deploy storm / CD replacement state: if several main commits land during recovery, older CD runs may be canceled by newer commits. Do not treat the canceled run as a service failure. Wait for the final deploy marker, verify live image tags, ArgoCD health, public routes, DB freshness, backup status, and post-start quick check before declaring latest production recovered.
Wazuh / SOC boundary state: production Wazuh read-only route presence is not equivalent to Wazuh registry recovery. `/api/iwooos/wazuh` and `/api/v1/iwooos/wazuh` returning `200 disabled_waiting_iwooos_wazuh_owner_gate` only proves the route boundary is deployed; manager registry accepted, owner evidence accepted, active response, host write, agent re-enroll, restart, secret patch, Kali active scan, and runtime gate remain `0 / false`.
Monitoring coverage recovery state: if CD post-deploy fails only because `scripts/generate_monitoring.py --check` reports `nginx-exporter` down on `192.168.0.188:9113`, first verify 188 `stub_status` and restore the stateless exporter with `scripts/ops/188-nginx-exporter-restore.sh`; do not reload Nginx or restart product containers for this symptom.
Allowed declaration: full-stack service readiness is GREEN for controlled runner/CD release; core hosts, routes, K3s, backup/exporter surfaces, AWOOOI API health, MOMO service health, and MOMO data freshness are green for the latest read-only evidence set.
Forbidden declaration: DR complete, credential escrow complete, Wazuh host registry accepted, 110 live monitor synced, or runtime/security acceptance. Credential escrow evidence is still missing and must not be forged.
Allowed declaration: host boot, core service readiness, K3s, public route availability, AWOOOI API health, MOMO service health/data freshness, Bitan public-content cleanliness, and backup/offsite readiness are green for the latest read-only evidence set.
Forbidden declaration: all product data latest, StockPlatform data freshness green, DR complete, credential escrow complete, Wazuh host registry accepted, 110 live monitor synced, or runtime/security acceptance. Credential escrow evidence is still missing and StockPlatform freshness is blocked; neither may be smoothed into green.
```
2026-06-24 22:17 Codex workstation continuity readback:

View File

@@ -1,6 +1,6 @@
# 主機重啟後一頁式總檢查
> Version: v1.1
> Version: v1.2
> Last updated: 2026-06-25 Asia/Taipei
> Scope: 110 / 120 / 121 / 188 post-reboot service recovery. 112 Kali / Wazuh / active scan 不屬於本流程。
@@ -95,7 +95,22 @@ scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh
`DB_DAILY_FRESHNESS > 2` 或 import job 失敗時,不可宣稱 MOMO 資料已恢復。
### Step 4 - Backup / offsite / escrow
### Step 4 - StockPlatform freshness gate
```bash
curl -k -sS https://stock.wooo.work/api/v1/system/freshness
```
必要欄位:
- `status` 必須是 `ok`,才可宣稱 StockPlatform 資料層恢復。
- `latest_trading_date` 必須對齊最近交易日。
- `core.price_daily``core.chips_daily``core.market_index_daily.tw` 必須是 `ok`
- `blockers` 不可有 `core_margin_short_daily_missing``ai_recommendations_stale` 或其他資料閘門阻擋。
`stock.wooo.work``/healthz``/api/healthz` 皆為 200 只代表服務活著;`/api/v1/system/freshness``blocked` 時,不可宣稱 StockPlatform 資料最新。
### Step 5 - Backup / offsite / escrow
在 110 只讀執行:
@@ -115,14 +130,33 @@ scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh
`escrow_missing>0` 時,服務可 green但 DR 不可 green。
### Step 5 - Public routes 只作輔助證據
### Step 6 - Public routes 只作輔助證據
```bash
for url in \
https://awoooi.wooo.work/ \
https://awoooi.wooo.work/api/v1/health \
https://awoooi.wooo.work/zh-TW/iwooos \
https://vibework.wooo.work/ \
https://awooogo.wooo.work/ \
https://2026fifa.wooo.work/ \
https://agent.wooo.work/ \
https://mo.wooo.work/ \
https://mo.wooo.work/health \
https://stock.wooo.work/; do
https://stock.wooo.work/ \
https://stock.wooo.work/healthz \
https://stock.wooo.work/api/healthz \
https://bitan.wooo.work/ \
https://tsenyang.com/ \
https://www.tsenyang.com/ \
https://vtuber.wooo.work/ \
https://gitea.wooo.work/ \
https://harbor.wooo.work/ \
https://registry.wooo.work/ \
https://sentry.wooo.work/ \
https://signoz.wooo.work/ \
https://langfuse.wooo.work/ \
https://aiops.wooo.work/; do
code="$(curl -k -sS -o /dev/null -w '%{http_code}' "$url" || true)"
echo "$code $url"
done
@@ -130,7 +164,7 @@ done
Route smoke 必須和 cold-start / DB / backup 一起看;不能單獨當恢復證明。
### Step 6 - 110 CPU / runaway process
### Step 7 - 110 CPU / runaway process
```bash
ssh wooo@192.168.0.110 'uptime; vmstat 1 5; ps -eo pid,ppid,pgid,stat,pcpu,pmem,comm,args --sort=-pcpu | head -25'
@@ -151,6 +185,7 @@ ssh wooo@192.168.0.110 'uptime; vmstat 1 5; ps -eo pid,ppid,pgid,stat,pcpu,pmem,
| `FULL_STACK_GREEN_DR_ESCROW_BLOCKED` | 可宣稱所有服務面恢復;不可宣稱 DR complete。 |
| `SERVICE_AVAILABLE_DEGRADED` | 可宣稱服務可用;必須列 WARN 與下一步。 |
| `BLOCKED_MOMO_DATA_FRESHNESS` | 可宣稱網站可用;不可宣稱資料最新。 |
| `BLOCKED_STOCK_DATA_FRESHNESS` | 可宣稱 StockPlatform route / container 可用;不可宣稱 StockPlatform 資料或 AI 推薦已最新。 |
| `BLOCKED_HOST_OR_K3S` | 不可宣稱全棧恢復;先修主機 / K3s。 |
| `BLOCKED_BACKUP_CORE` | 不可宣稱恢復完成;備份紅燈優先。 |
| `BLOCKED_WAZUH_REGISTRY` | 不屬於本 SOP 的服務恢復 blocker必須交給 IwoooS / Wazuh lane不可改 Wazuh runtime。 |

View File

@@ -11,16 +11,18 @@
| Area | Status | Completion | Evidence |
|------|--------|------------|----------|
| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-25 19:05 post-CD quick check returned exit `0`, `POST_START_QUICK_CHECK PASS=18 WARN=3 BLOCKED=0`, warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=2`, result `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`: latest deploy marker is `d8ca8224 chore(cd): deploy 9dbe044 [skip ci]`; read-only ArgoCD shows `awoooi-prod Synced / Healthy` at revision `d8ca822422021d0fda8da8fa4c354c0c4db7ff22`; API/Web/Worker images are `9dbe044ea1e8e3894ccbeb5ed760bb124b87f7be`; 110 / 120 / 121 / 188 ping and SSH port are OK, K3s `mon` / `mon1` are Ready, public routes/TLS are green, AWOOOI API health is healthy/prod/mock=false, delegated cold-start is `PASS=89 WARN=0 BLOCKED=0`, MOMO service health is healthy on `V10.690`, MOMO data freshness is `1|2026-06-24`, 110 / 188 runtime and backup checks are green. MOMO latest valid job `57` completed cleanly at `2026-06-25T13:18:02`, `15383/15383/0`, and current-month snapshot / realtime bounds match through `2026-06-24`. AwoooGo, Stock, AWOOOI API, IwoooS, VibeWork, MOMO health, and Bitan returned `200` for 5 consecutive reads from 19:05-19:06; final K3s readback confirms API/Web split across `mon` / `mon1`. DR remains blocked because credential escrow evidence markers are still missing (`escrow_missing=5`) and must not be forged. |
| Overall recovery readiness | HOST_AND_CORE_SERVICE_GREEN_STOCK_DATA_BLOCKED_DR_ESCROW_BLOCKED | 96% | 2026-06-25 19:24 full post-start readback showed hosts / K3s / AWOOOI / MOMO / backup / offsite service gates green and `escrow_missing=5`; 2026-06-25 19:35 stricter product-data wrapper returned `POST_START_QUICK_CHECK PASS=31 WARN=1 BLOCKED=1`, result `BLOCKED`, because StockPlatform `/api/v1/system/freshness` is `blocked` with `core_margin_short_daily_missing,ai_recommendations_stale`. Expanded public route smoke covers AWOOOI, VibeWork, AwoooGo, 2026FIFA, Agent Bounty, MOMO, Stock, Bitan, TsenYang, VTuber, Gitea, Harbor, Registry, Sentry, SigNoz, Langfuse, and AIOps; all returned expected 2xx/3xx. MOMO remains fresh through `2026-06-24` with latest job `57` completed cleanly, and Bitan public-content cleanliness direct check passed. Do not declare "all products/data latest" until StockPlatform freshness is `ok`; do not declare DR complete until `escrow_missing=0`. |
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-25 09:05 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, VIP `192.168.0.125` is present, node filesystem / disk-pressure / readonly events are `0`, and latest `km-vectorize-29705460-55rgs` completed. |
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-25 19:17 backup readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-25 02:35:09`。2026-06-25 19:19 offsite escrow report shows script presence OK, rclone configured, full and partial rclone markers present, `PASS=8 WARN=5 BLOCKED=0`, `ESCROW_MISSING_COUNT=5`; DR remains blocked on real non-secret credential escrow evidence IDs. |
| P2 service / data truth | GREEN | 100% | Public route/TLS, API/Web route, MOMO health `V10.690`, MOMO main / CD `#904` monthly-sync failure boundary, MOMO main / CD `#910` Drive-auth fail-closed boundary, direct 19:05 wrapper public route smoke all expected 2xx/3xx, five-iteration AwoooGo / Stock / AWOOOI / IwoooS / VibeWork / MOMO / Bitan route stability reads, current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health are green. 19:05 preflight confirms app / scheduler / Telegram bot healthy, scheduler restart count `0`, token metadata aligned to scheduler UID, latest job `57` completed cleanly, and `DB_DAILY_FRESHNESS 1|2026-06-24`. |
| P3 docs / automation contracts | DONE_WITH_POST_CD_DEPLOY_STORM_READBACK | 100% | Workplan, SOP v1.56, one-page post-start quick check wrapper + fallback runbook, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, 188 DB/Redis exporter restore helper, 188 MinIO/Velero restore helper, 188 nginx-exporter restore helper, 110 Docker disk pressure cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, MOMO Pro false-noise health monitor source-of-truth, docker-health direct Telegram fallback cooldown, Bitan public-content same-fingerprint cooldown, notification-noise readback, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side cold-start v1.42 source absence classifier, live-sync parity gate, MOMO import-boundary production deploy, MOMO Drive-auth fail-closed production deploy, 10:04 scheduler fail-closed live proof, 10:35 route / DB / backup refresh, 11:44 MOMO dedicated preflight blocked readback, 14:16 MOMO dedicated preflight recovery on V10.674 / job 57 / freshness 1, 14:41 wrapper warning split, 15:04 cold-start WARN-only classifier fix, 18:23 deploy-storm replacement readback after marker `2a9e816a`, 18:53 latest marker `cc835df5` readback, 19:06 latest marker `d8ca8224` readback, transient AwoooGo / Stock 502 warmup classification, rollout terminating-pod workload-balanced timing note, 10:58 user-approved 110 orphan Chrome SIGTERM evidence, MacBook Pro Codex safe artifact sync readback, and 2026-06-25 live refresh with full cold-start GREEN are updated. 2026-06-24 23:15 read-only verify still shows repo cold-start hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`; live 110 script sync of the v1.42 classifier is not claimed until separately approved and recorded. |
| P2 service / data truth | BLOCKED_STOCK_DATA_FRESHNESS | 92% | Service routes and core runtime are available, but product-data truth is not complete. 2026-06-25 19:35 StockPlatform `/api/v1/system/freshness` returned `status=blocked`, `latest_trading_date=2026-06-25`, blockers `core_margin_short_daily_missing,ai_recommendations_stale`; OK sources include price / chips / market index for `2026-06-25`, while `core.margin_short_daily` and `ai.recommendations` stop at `2026-06-24`. MOMO health `V10.690`, current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, and `DB_DAILY_FRESHNESS 1|2026-06-24` are green; Bitan public-content cleanliness direct check passed; expanded public routes are green. |
| P3 docs / automation contracts | DONE_WITH_PRODUCT_DATA_GATE_V157 | 100% | Workplan, SOP v1.57, one-page post-start quick check v1.2, expanded public route list, StockPlatform freshness gate, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and 2026-06-25 stricter product-data gate are updated. Live 110 script sync remains a separate approved live-write gate; do not claim it here. |
2026-06-25 19:06 post-CD wrapper readback supersedes the 18:53 wording: consecutive main pushes created a deploy storm where older deploy markers were superseded by later commits. Latest production truth is deploy marker `d8ca8224 chore(cd): deploy 9dbe044 [skip ci]`, ArgoCD `Synced / Healthy`, API/Web/Worker image tag `9dbe044ea1e8e3894ccbeb5ed760bb124b87f7be`, direct route smoke 200 for AWOOOI API / IwoooS / VibeWork / AwoooGo / MOMO health / Stock / Bitan and expected route-gate statuses for MOMO / Gitea / Harbor / Registry / Sentry / SigNoz / Langfuse / AIOps, and wrapper `POST_START_QUICK_CHECK PASS=18 WARN=3 BLOCKED=0`. Repo-side cold-start returns `PASS=89 WARN=0 BLOCKED=0`; `/backup/scripts/backup-status.sh --no-notify --no-refresh` reports 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`; MOMO dedicated preflight returns `PASS=19 WARN=2 BLOCKED=0`; MOMO health is `V10.690`; AwoooGo / Stock transient 502 reads cleared after upstream warmup and five consecutive route reads returned `200`; 110 load is around `14.51 / 12.34 / 11.42`, with Gitea Actions cache save / `zstdmt` / `tar`, StockPlatform headless Chrome smoke / CI, Gitea, AWOOOI API, ClickHouse, Docker, and platform services visible, not an AWOOOI service blocker. Wrapper result is `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, not `DEGRADED`, because service warnings are `0` and only DR boundary / evidence warnings remain. Wazuh route readback is now `200 disabled_waiting_iwooos_wazuh_owner_gate`, but manager registry accepted remains `0`, so Wazuh is a security registry evidence blocker rather than a reboot service blocker.
Full cold-start service readiness may now be declared GREEN for the latest verified evidence set. As of 2026-06-25 19:06, routes/hosts/K3s/backups/exporters/monitoring surfaces are available, AWOOOI API is healthy, MOMO service health is `V10.690`, and MOMO business data is fresh through `2026-06-24`. The live read-only cold-start scorecard is `PASS=89 WARN=0 BLOCKED=0`, the post-start wrapper result is `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, AwoooGo / Stock route stability has been rechecked after transient warmup, and final API/Web workload placement is split across `mon` / `mon1`. Do not declare DR scorecard complete while credential escrow evidence remains blocked, and do not declare Wazuh registry recovery until manager registry evidence is accepted.
2026-06-25 19:35 stricter product-data gate readback supersedes the earlier "all product data green" interpretation. The full host/cold-start/backup layer remains green from the 19:24 read-only evidence, but the updated quick check now includes StockPlatform `/api/v1/system/freshness` and therefore blocks on product-data completeness: `POST_START_QUICK_CHECK PASS=31 WARN=1 BLOCKED=1`, `RESULT=BLOCKED`, blocker `core_margin_short_daily_missing,ai_recommendations_stale`. This is a correct no-false-green outcome: `stock.wooo.work`, `/healthz`, and `/api/healthz` all return `200`, but StockPlatform data and AI recommendations are not latest. Next action is a separate StockPlatform data freshness remediation lane; do not solve it by host reboot, Nginx reload, Docker restart, or route-only smoke.
2026-06-13 01:26 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing survived the next normal CD deploy: Gitea main `e4a349bc`, ArgoCD revision `e4a349bc`, images from `414413a5`, API/Web split across `mon` / `mon1`, and global `known_hosts` retained 120 / 188 after CD fix `80e6ec1a`. Do not declare DR complete while credential escrow is missing. `km-vectorize` remediation is `90%`: schedule/label fix is live, and the remaining gate is the next official 03:00 CronJob success readback.
---

View File

@@ -75,16 +75,26 @@ phases:
public_https_routes:
- https://awoooi.wooo.work/api/v1/health
- https://awoooi.wooo.work/
- https://awoooi.wooo.work/zh-TW/iwooos
- https://vibework.wooo.work/
- https://awooogo.wooo.work/
- https://2026fifa.wooo.work/
- https://agent.wooo.work/
- https://mo.wooo.work/
- https://mo.wooo.work/health
- https://stock.wooo.work/
- https://stock.wooo.work/healthz
- https://stock.wooo.work/api/healthz
- https://bitan.wooo.work/
- https://tsenyang.com/
- https://www.tsenyang.com/
- https://vtuber.wooo.work/
- https://gitea.wooo.work/
- https://harbor.wooo.work/
- https://registry.wooo.work/
- https://sentry.wooo.work/
- https://signoz.wooo.work/
- https://stock.wooo.work/
- https://langfuse.wooo.work/
- https://bitan.wooo.work/
- https://aiops.wooo.work/
- id: P2-SCHEDULES
@@ -106,6 +116,8 @@ phases:
- momo_scheduler_registered_jobs
- momo_import_config_daily_sales_intake
- momo_source_absence_evidence_when_freshness_blocked
- stockplatform_system_freshness_ok
- stockplatform_latest_trading_day_sources_current
- k8s_cronjobs_unsuspended
- k8s_failed_jobs_zero
- dr_drill_cron_present_121

View File

@@ -9,6 +9,7 @@ ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
SSH_CONNECT_TIMEOUT="${SSH_CONNECT_TIMEOUT:-6}"
RUN_COLD_START=1
RUN_MOMO=1
RUN_STOCK=1
RUN_BACKUP=1
RUN_ROUTES=1
RUN_CPU=1
@@ -29,10 +30,29 @@ HOSTS=(
)
ROUTES=(
"https://awoooi.wooo.work/"
"https://awoooi.wooo.work/api/v1/health"
"https://awoooi.wooo.work/zh-TW/iwooos"
"https://vibework.wooo.work/"
"https://awooogo.wooo.work/"
"https://2026fifa.wooo.work/"
"https://agent.wooo.work/"
"https://mo.wooo.work/"
"https://mo.wooo.work/health"
"https://stock.wooo.work/"
"https://stock.wooo.work/healthz"
"https://stock.wooo.work/api/healthz"
"https://bitan.wooo.work/"
"https://tsenyang.com/"
"https://www.tsenyang.com/"
"https://vtuber.wooo.work/"
"https://gitea.wooo.work/"
"https://harbor.wooo.work/"
"https://registry.wooo.work/"
"https://sentry.wooo.work/"
"https://signoz.wooo.work/"
"https://langfuse.wooo.work/"
"https://aiops.wooo.work/"
)
usage() {
@@ -44,6 +64,7 @@ Read-only post-reboot quick check for 110 / 120 / 121 / 188.
Options:
--skip-cold-start Do not run full-stack-cold-start-check.sh.
--skip-momo Do not run momo-drive-token-source-recovery-preflight.sh.
--skip-stock Do not query StockPlatform data freshness.
--skip-backup Do not run /backup/scripts/backup-status.sh on 110.
--skip-routes Do not curl public route smoke targets.
--skip-cpu Do not read 110 CPU / process summary.
@@ -67,6 +88,9 @@ while [[ $# -gt 0 ]]; do
--skip-momo)
RUN_MOMO=0
;;
--skip-stock)
RUN_STOCK=0
;;
--skip-backup)
RUN_BACKUP=0
;;
@@ -244,6 +268,57 @@ if [[ "$RUN_MOMO" -eq 1 ]]; then
rm -f "$momo_tmp"
fi
if [[ "$RUN_STOCK" -eq 1 ]]; then
section "StockPlatform freshness"
stock_tmp="$(mktemp -t post-start-stock.XXXXXX)"
stock_code="$(curl -k -sS -o "$stock_tmp" -w '%{http_code}' --max-time 12 "https://stock.wooo.work/api/v1/system/freshness" 2>/dev/null || true)"
if [[ "$stock_code" != 2* ]]; then
blocked "StockPlatform freshness endpoint returned ${stock_code:-curl_failed}"
cat "$stock_tmp" || true
else
python3 - "$stock_tmp" <<'PY'
import json
import sys
path = sys.argv[1]
with open(path, "r", encoding="utf-8") as fh:
payload = json.load(fh)
print(f"STOCK_FRESHNESS_STATUS {payload.get('status')}")
print(f"STOCK_LATEST_TRADING_DATE {payload.get('latest_trading_date')}")
print("STOCK_BLOCKERS " + ",".join(payload.get("blockers") or []))
for source in payload.get("sources") or []:
print(
"STOCK_SOURCE "
f"{source.get('source')}|{source.get('status')}|"
f"{source.get('latest_date')}|{source.get('row_count')}"
)
PY
stock_status="$(python3 - "$stock_tmp" <<'PY'
import json
import sys
with open(sys.argv[1], "r", encoding="utf-8") as fh:
print(json.load(fh).get("status") or "")
PY
)"
if [[ "$stock_status" == "ok" ]]; then
ok "StockPlatform freshness is ok"
else
stock_blockers="$(python3 - "$stock_tmp" <<'PY'
import json
import sys
with open(sys.argv[1], "r", encoding="utf-8") as fh:
print(",".join(json.load(fh).get("blockers") or []))
PY
)"
blocked "StockPlatform freshness is ${stock_status:-unknown}: ${stock_blockers:-no_blocker_list}"
fi
fi
rm -f "$stock_tmp"
fi
if [[ "$RUN_BACKUP" -eq 1 ]]; then
section "Backup / offsite / escrow"
backup_tmp="$(mktemp -t post-start-backup.XXXXXX)"