Compare commits

...

2 Commits

Author SHA1 Message Date
ogt
47797d9684 Merge remote-tracking branch 'gitea/main' into codex/delivery-workbench-release-20260626-ffsync 2026-06-27 00:00:06 +08:00
ogt
89b9e67a41 fix(ops): harden reboot API warmup evidence flow
Some checks failed
Code Review / ai-code-review (push) Successful in 13s
E2E Health Check / e2e-health (push) Successful in 31s
Ansible / Reboot Recovery Contract / validate (push) Has been cancelled
2026-06-26 23:59:06 +08:00
7 changed files with 74 additions and 15 deletions

View File

@@ -46252,3 +46252,35 @@ production browser smoke:
- Wazuh manager registry accepted 仍為 `0`:不得宣稱 Wazuh 全主機納管恢復。
- Owner response received / accepted 仍為 `0 / 0`不得把「批准繼續」、空模板、UI 可見、route `200`、transport `6`、Dashboard index pattern `3` 或 owner-packet JSON 當成 evidence accepted。
- Runtime action / host write / credential marker write / Wazuh active response / Kali active scan 仍全部 `0 / false`
## 2026-06-26 — 23:56 AWOOOI API rollout warmup classifier / SOP v1.76
**時間與來源**
- 2026-06-26 23:31-23:56 Asia/Taipei。
- 來源:`scripts/reboot-recovery/post-start-quick-check.sh --no-color``scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color`、AWOOOI production `/api/v1/health`、120 K3s / ArgoCD 只讀 readback、112 Wazuh manager / dashboard 只讀脫敏 readback。
**完成內容**
- `post-start-quick-check.sh` 新增 `AWOOOI_API_ROUTE_OK`,當 delegated cold-start 在 K3s/CD rollout 瞬間單次輸出 `BLOCKED AWOOOI API not reachable`,但 wrapper public route retry 已確認 `https://awoooi.wooo.work/api/v1/health` 回 2xx 時,該 cold-start blocker 會降級為 route/API warmup evidence warningpublic API 仍失敗、其他 non-route blocker 或 retry 後未恢復仍是 hard blocker。
- `docs/runbooks/FULL-STACK-COLD-START-SOP.md` 升至 v1.76`docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 升至 v1.16workplan Current Verdict 更新為 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`
- 120 K3s 只讀補查完成ArgoCD `awoooi-prod``Synced / Healthy``awoooi-prod` Pod 均為 `Running``Completed`。歷史 `km-vectorize-29689620` failed Job 已被 2026-06-23、2026-06-24、2026-06-25 後續成功 Job 覆蓋,不是現行服務 blocker。
- 112 Wazuh 只讀補查完成systemd `running`manager / indexer / dashboard `active`API root 回 `401`dashboard unauthenticated check endpoints 回 `401`;未讀、未輸出、未保存 secret value。manager registry 脫敏統計顯示 local manager `1`、受管 agent `5`、active managed `5`、disconnected `0`、never connected `0`。這證明 registry 並非全空,但仍未達成 SOP expected scope / Dashboard API / owner acceptance因此 `WAZUH_MANAGER_REGISTRY_ACCEPTED=0` 維持。
**live 驗證結果**
- 23:34 summary`POST_START_RC=0``POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``STOCK_FRESHNESS_STATUS=ok``STOCK_LATEST_TRADING_DATE=2026-06-26``STOCK_BLOCKERS=none``BACKUP_CORE_GREEN=1``ESCROW_MISSING_COUNT=5``HOST_188_HYGIENE_BLOCKED=0`
- 最終 23:56 summary`POST_START_RC=0``POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``POST_START_PASS=38``POST_START_WARN=3``POST_START_BLOCKED=0``SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``BACKUP_CORE_GREEN=1``DR_ESCROW_BLOCKED=1``ESCROW_MISSING_COUNT=5``HOST_188_HYGIENE_BLOCKED=0``WAZUH_MANAGER_REGISTRY_ACCEPTED=0``RUNTIME_ACTION_AUTHORIZED=0``NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export`
- 110 CPU attribution高 load 來自 active Gitea Actions / StockPlatform Next build / Jest worker 與平台服務;未觀察到 orphan Chrome recurrence本輪未 kill、未 restart、未 cancel CI。
**做過的命令類型**
- 只讀cold-start / post-start / readiness summary、public route、112 Wazuh service / API / dashboard endpoint / registry 計數、110 CPU attribution。
- 寫入repo script / runbook / workplan / LOGBOOK only。
- 未做:沒有 host / Docker / systemd / Nginx / firewall / K8s / DB runtime 寫操作;沒有讀 secret 明文;沒有 credential marker write沒有 Wazuh active response / agent re-enroll / restart沒有 Kali active scan沒有取消 CI。
**目前判定**
- 主機、K3s、服務、public routes、MOMO、StockPlatform freshness、backup core、188 host hygiene`GREEN`
- Overall recovery declaration`FULL_STACK_GREEN_DR_ESCROW_BLOCKED`
- SOP / quick-check / route + AWOOOI API warmup classifierv1.76。
**仍 blocked / 不得宣稱**
- DR credential escrow evidence 仍缺 `5`:不得宣稱 `DR_COMPLETE`
- Wazuh manager registry accepted 仍為 `0`:不得宣稱 Wazuh 全主機納管恢復manager registry 脫敏統計只是 evidence還缺 Dashboard API / owner acceptance。
- Runtime action / host write / credential marker write / Wazuh active response / Kali active scan 仍全部 `0 / false`

View File

@@ -1,6 +1,6 @@
# AWOOOI 全棧冷啟動與主機重啟 SOP
> Version: v1.75
> Version: v1.76
> Last updated: 2026-06-26 Asia/Taipei
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
@@ -10,7 +10,11 @@
本節是每次接手、開機、關機、重啟後的第一個判定錨點。若日期不是今天,必須先重跑 live check再更新本節與 `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md`
若只是重啟後要快速判斷能不能宣稱恢復,先跑機器可讀摘要:`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color`。此腳本會呼叫一頁式總檢查、188 host hygiene checklist 與 Wazuh no-false-green repo gates並把 delegated logs 和可重放的 `summary.txt` 留在 `/tmp/awoooi-post-reboot-readiness-*`。v1.75 起,同一輪驗收後續步驟必須吃同一個 `$ARTIFACT_DIR/summary.txt`,例如 `scripts/reboot-recovery/post-reboot-declaration-guard.py --summary-file "$ARTIFACT_DIR/summary.txt" --no-color``scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --summary-file "$ARTIFACT_DIR/summary.txt" --no-color`;不得在同一輪 evidence chain 反覆重跑 live probes 後混用不同時間點結論。宣告 guard 會把 summary 轉成 allowed / forbidden declaration避免把服務綠誤報成 DR complete、188 host hygiene、Wazuh registry recovered 或 runtime authorized。若 summary 顯示 `SERVICE_GREEN=1``NEXT_REQUIRED_GATES` 仍非空,再由 dispatch checklist 把尚未完成的 blocker 轉成 owner / evidence / forbidden-action checklist需要機器可讀 intake 時,再跑 `scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --dispatch-file <dispatch.txt> --output /tmp/awoooi-post-reboot-owner-packets.json` 產生 `awoooi_post_reboot_next_gate_owner_packets_v1` JSON並立刻跑 `scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json`。dispatch / packet / guard 均固定 `DISPATCH_AUTHORIZED=0``REQUEST_SENT_COUNT=0``OWNER_RESPONSE_ACCEPTED=0``HOST_WRITE_AUTHORIZED=0``SECRET_VALUE_COLLECTION_ALLOWED=0``RUNTIME_GATE=0`guard 未通過時不得送 owner request、不得寫 escrow marker、不得進維護窗口、不得宣稱 DR / Wazuh registry complete。v1.74 起,任何 owner response JSON 還必須經過 `scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color --owner-packet-file <owner-packets.json> --response-file <file>`空模板、placeholder、secret payload、runtime action request、credential marker write、Wazuh active response / re-enroll / restart、Kali active scan 或缺少 Dashboard API / manager registry evidence 都必須 fail-closedpreflight 通過也只表示可進入獨立 reviewer acceptance不是 runtime 授權。需要人工展開時,再跑 `scripts/reboot-recovery/post-start-quick-check.sh --no-color` 並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為 fallback。長 SOP 保留完整背景、例外處理與 Plan B短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。
若只是重啟後要快速判斷能不能宣稱恢復,先跑機器可讀摘要:`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color`。此腳本會呼叫一頁式總檢查、188 host hygiene checklist 與 Wazuh no-false-green repo gates並把 delegated logs 和可重放的 `summary.txt` 留在 `/tmp/awoooi-post-reboot-readiness-*`。v1.75 起,同一輪驗收後續步驟必須吃同一個 `$ARTIFACT_DIR/summary.txt`,例如 `scripts/reboot-recovery/post-reboot-declaration-guard.py --summary-file "$ARTIFACT_DIR/summary.txt" --no-color``scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --summary-file "$ARTIFACT_DIR/summary.txt" --no-color`;不得在同一輪 evidence chain 反覆重跑 live probes 後混用不同時間點結論。v1.76 起delegated cold-start 若在 K3s rollout / CD 替換瞬間出現單次 `BLOCKED AWOOOI API not reachable`,但 wrapper 自己的 public `https://awoooi.wooo.work/api/v1/health` route retry 已回 2xx該 blocker 只列為 route/API warmup evidence warningpublic API 仍失敗、其他 non-route blocker、或 retry 後未恢復時,仍維持 hard blocked。宣告 guard 會把 summary 轉成 allowed / forbidden declaration避免把服務綠誤報成 DR complete、188 host hygiene、Wazuh registry recovered 或 runtime authorized。若 summary 顯示 `SERVICE_GREEN=1``NEXT_REQUIRED_GATES` 仍非空,再由 dispatch checklist 把尚未完成的 blocker 轉成 owner / evidence / forbidden-action checklist需要機器可讀 intake 時,再跑 `scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --dispatch-file <dispatch.txt> --output /tmp/awoooi-post-reboot-owner-packets.json` 產生 `awoooi_post_reboot_next_gate_owner_packets_v1` JSON並立刻跑 `scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json`。dispatch / packet / guard 均固定 `DISPATCH_AUTHORIZED=0``REQUEST_SENT_COUNT=0``OWNER_RESPONSE_ACCEPTED=0``HOST_WRITE_AUTHORIZED=0``SECRET_VALUE_COLLECTION_ALLOWED=0``RUNTIME_GATE=0`guard 未通過時不得送 owner request、不得寫 escrow marker、不得進維護窗口、不得宣稱 DR / Wazuh registry complete。v1.74 起,任何 owner response JSON 還必須經過 `scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color --owner-packet-file <owner-packets.json> --response-file <file>`空模板、placeholder、secret payload、runtime action request、credential marker write、Wazuh active response / re-enroll / restart、Kali active scan 或缺少 Dashboard API / manager registry evidence 都必須 fail-closedpreflight 通過也只表示可進入獨立 reviewer acceptance不是 runtime 授權。需要人工展開時,再跑 `scripts/reboot-recovery/post-start-quick-check.sh --no-color` 並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為 fallback。長 SOP 保留完整背景、例外處理與 Plan B短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。
v1.76 owner gate replay rule同一輪 summary 產生後owner packet 與 owner response preflight 必須優先使用 `--summary-file "$ARTIFACT_DIR/summary.txt"`,例如 `scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color --summary-file "$ARTIFACT_DIR/summary.txt" --output /tmp/awoooi-post-reboot-owner-packets.json``scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color --summary-file "$ARTIFACT_DIR/summary.txt" --response-file <file>`。只有在刻意要重新取 live evidence 時,才允許省略 `--summary-file`;否則 preflight 不得自己重跑 summary 造成同一輪狀態漂移。
2026-06-26 23:56 最新 live summary`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``POST_START_PASS=38``POST_START_WARN=3``POST_START_BLOCKED=0``SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``STOCK_FRESHNESS_STATUS=ok``STOCK_LATEST_TRADING_DATE=2026-06-26``STOCK_BLOCKERS=none``BACKUP_CORE_GREEN=1``ESCROW_MISSING_COUNT=5``HOST_188_HYGIENE_BLOCKED=0``WAZUH_MANAGER_REGISTRY_ACCEPTED=0``RUNTIME_ACTION_AUTHORIZED=0`。同一時段只讀補查 120ArgoCD `awoooi-prod``Synced / Healthy``awoooi-prod` Pod 均為 `Running``Completed`;歷史 `km-vectorize-29689620` failed Job 已被 2026-06-23、2026-06-24、2026-06-25 後續成功 Job 覆蓋,不是目前服務 blocker。同一時段只讀補查 112systemd `running`Wazuh manager / indexer / dashboard `active`manager API root 回 `401`Dashboard unauthenticated check endpoints 回 `401`manager registry 脫敏讀回為 local manager `1`、受管 agent `5`、active managed `5`、disconnected `0`、never connected `0`。此證據證明 registry 不再是「全空」,但仍不能宣稱 Wazuh 全主機納管恢復,因為 SOP expected scope 仍是 6、Dashboard API connection / version 尚未以登入或 owner evidence 驗收owner response accepted 仍為 `0`
2026-06-26 18:46 最新即時恢復真相已覆蓋 12:13 對今日產品資料的判讀:`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `POST_START_RESULT=PRODUCT_DATA_PENDING_EOD_WINDOW``SERVICE_GREEN=1``PRODUCT_DATA_GREEN=0``STOCK_LATEST_TRADING_DATE=2026-06-26``STOCK_BLOCKERS=core_margin_short_daily_missing,ai_recommendations_stale``BACKUP_CORE_GREEN=1``ESCROW_MISSING_COUNT=5``WAZUH_MANAGER_REGISTRY_ACCEPTED=0`。同一輪 live cold-start 長檢查回傳 `PASS=87 WARN=0 BLOCKED=0``Result: GREEN`,代表 110 / 120 / 121 / 188 主機、K3s、public routes、AWOOI API、MOMO、backup core、exporters、cron 與 Alertmanager 服務層已恢復;但 StockPlatform 今日官方 margin-short 尚未發布AI recommendations 仍依賴該資料因此不可宣稱所有產品資料最新。18:43 已以授權 `SIGTERM` 清除 110 上兩組 6 小時以上 `stockplatform-review-bulk-ux` orphan Chrome process group`REMAINING=0`18:44-18:46 已停止 168 Mac Mini 上本機 AWOOOI `next build` 並清理 temp/build/cache 與 Antigravity backup browser recordings使 `/System/Volumes/Data` 從約 `1.0Gi / 100%` 回到約 `8.7Gi / 96%`。112 Kali 的 `networking.service` failed 已定位為 `/etc/network/if-up.d/wg-nat` 錯誤 shebang `#\!/bin/bash` 導致 `Exec format error`Wazuh manager / indexer / dashboard 仍 active該 hook 修復需要 112 sudo 提權,未使用或保存密碼。

View File

@@ -1,6 +1,6 @@
# 主機重啟後一頁式總檢查
> Version: v1.15
> Version: v1.16
> Last updated: 2026-06-26 Asia/Taipei
> Scope: 110 / 120 / 121 / 188 post-reboot service recovery. 112 Kali / Wazuh / active scan 不屬於本流程。
@@ -10,7 +10,7 @@
每次 110 / 120 / 121 / 188 任一台主機開機、關機、重啟、斷電恢復、VMware console fsck、Docker / K3s 大量重排後,都先跑本頁,再決定是否宣稱恢復。
最新基準2026-06-26 17:45 single-summary replay / route warmup classifier。`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``BACKUP_CORE_GREEN=1``DR_ESCROW_BLOCKED=1``ESCROW_MISSING_COUNT=5``HOST_188_HYGIENE_BLOCKED=0``HOST_188_RESULT=HOST_188_HYGIENE_GREEN.``WAZUH_MANAGER_REGISTRY_ACCEPTED=0``WAZUH_COVERAGE_SCOPE=6``WAZUH_DIRECT_ACTIVE=2``WAZUH_NO_TRANSPORT=1``WAZUH_SSH_BLOCKED=3``WAZUH_ROUTE_CODE=200``WAZUH_TRANSPORT_COUNT=6``WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning``WAZUH_DASHBOARD_INDEX_OK=3``RUNTIME_ACTION_AUTHORIZED=0``OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`,並自動把同一份 key/value 寫到 `$ARTIFACT_DIR/summary.txt`。同一輪後續 `post-reboot-declaration-guard.py``post-reboot-next-gate-dispatch.sh``post-reboot-next-gate-owner-packets.py``post-reboot-owner-packet-contract-guard.py``post-reboot-owner-response-preflight.py` 必須使用這份 `summary.txt` 或由它產生的 dispatch / packet不得混用多次 live probe 的不同時間點結果。`NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export` 仍是唯一目前 next gatesDR 仍因 `escrow_missing=5` 不可宣稱 completeWazuh manager registry accepted 仍是 `0`,不可把 route `200`、transport `6`、Dashboard index pattern `3` 當成 registry recovered。v1.15 另補 route warmup classifierdelegated cold-start 若只因 public route 單次 502 / TLS readback 暫時 blocked但 wrapper route retry 已確認全部恢復,該 blocker 會降級為 evidence warningroute blocker 或 retry 後仍失敗仍為 hard blocked。
最新基準2026-06-26 23:56 single-summary replay / route + AWOOOI API warmup classifier。`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``POST_START_PASS=38``POST_START_WARN=3``POST_START_BLOCKED=0``SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``STOCK_FRESHNESS_STATUS=ok``STOCK_LATEST_TRADING_DATE=2026-06-26``STOCK_BLOCKERS=none``BACKUP_CORE_GREEN=1``DR_ESCROW_BLOCKED=1``ESCROW_MISSING_COUNT=5``HOST_188_HYGIENE_BLOCKED=0``HOST_188_RESULT=HOST_188_HYGIENE_GREEN.``WAZUH_MANAGER_REGISTRY_ACCEPTED=0``WAZUH_COVERAGE_SCOPE=6``WAZUH_DIRECT_ACTIVE=2``WAZUH_NO_TRANSPORT=1``WAZUH_SSH_BLOCKED=3``WAZUH_ROUTE_CODE=200``WAZUH_TRANSPORT_COUNT=6``WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning``WAZUH_DASHBOARD_INDEX_OK=3``RUNTIME_ACTION_AUTHORIZED=0``OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`,並自動把同一份 key/value 寫到 `$ARTIFACT_DIR/summary.txt`。同一輪後續 `post-reboot-declaration-guard.py``post-reboot-next-gate-dispatch.sh``post-reboot-next-gate-owner-packets.py``post-reboot-owner-packet-contract-guard.py``post-reboot-owner-response-preflight.py` 必須使用這份 `summary.txt` 或由它產生的 dispatch / packet不得混用多次 live probe 的不同時間點結果。`NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export` 仍是唯一目前 next gatesDR 仍因 `escrow_missing=5` 不可宣稱 completeWazuh manager registry accepted 仍是 `0`,不可把 route `200`、transport `6`、Dashboard index pattern `3` 或脫敏 registry 計數當成全主機納管完成。v1.16 另補 route/API warmup classifierdelegated cold-start 若只因 public route 單次 502 / TLS readback,或 K3s rollout 瞬間單次 `BLOCKED AWOOOI API not reachable`,但 wrapper route retry 已確認 public AWOOOI API health 為 2xx,該 blocker 會降級為 evidence warningpublic API 仍失敗、其他 non-route blocker 或 retry 後未恢復仍為 hard blocked。
本頁只回答四件事:
@@ -99,7 +99,7 @@ scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color --summary-f
若要交給 AI / 工單 / owner review 使用,產生機器可讀 owner packet
```bash
scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color
scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color --summary-file "$ARTIFACT_DIR/summary.txt"
```
輸出 JSON 只能作為 intake / review packet不是 request sent。必須看到 `request_sent_count=0``owner_response_accepted_count=0``runtime_action_authorized_count=0`,否則視為不合格。
@@ -107,7 +107,7 @@ scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color
送入任何 owner review queue 前,必須先把 JSON 存成 artifact 並跑 contract guard
```bash
scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color --dispatch-file /tmp/awoooi-post-reboot-dispatch.txt --output /tmp/awoooi-post-reboot-owner-packets.json
scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color --summary-file "$ARTIFACT_DIR/summary.txt" --output /tmp/awoooi-post-reboot-owner-packets.json
scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json
```
@@ -116,7 +116,7 @@ guard 必須輸出 `POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=<live_next_
收到 owner response 檔案前,或收到任何聲稱已補證據的 JSON 前,必須跑 owner response preflight
```bash
scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color
scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color --summary-file "$ARTIFACT_DIR/summary.txt"
scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color --owner-packet-file /tmp/awoooi-post-reboot-owner-packets.json
scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color --owner-packet-file /tmp/awoooi-post-reboot-owner-packets.json --response-file docs/templates/post-reboot-next-gate-owner-response.json
```

View File

@@ -11,11 +11,11 @@
| Area | Status | Completion | Evidence |
|------|--------|------------|----------|
| Overall recovery readiness | SERVICE_GREEN_PRODUCT_DATA_PENDING_EOD_WINDOW | 98% | 2026-06-26 18:45 即時摘要覆蓋 12:13 對今日產品資料的判讀。`post-reboot-readiness-summary.sh --no-color` 回傳 `SERVICE_GREEN=1``PRODUCT_DATA_GREEN=0``POST_START_RESULT=PRODUCT_DATA_PENDING_EOD_WINDOW``STOCK_LATEST_TRADING_DATE=2026-06-26``STOCK_BLOCKERS=core_margin_short_daily_missing,ai_recommendations_stale``BACKUP_CORE_GREEN=1``ESCROW_MISSING_COUNT=5``WAZUH_MANAGER_REGISTRY_ACCEPTED=0`長版 live cold-start 腳本仍回傳 `PASS=87 WARN=0 BLOCKED=0` / `GREEN`,所以主機 / 服務 / route / K3s / backup core 已恢復StockPlatform 今日官方融資融券資料尚未發布,需等既有 EOD retry不可宣稱所有產品資料最新。 |
| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-26 23:56 即時摘要覆蓋 18:45 EOD pending 判讀。`post-reboot-readiness-summary.sh --no-color` 回傳 `SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``POST_START_WARN=3``STOCK_FRESHNESS_STATUS=ok``STOCK_LATEST_TRADING_DATE=2026-06-26``STOCK_BLOCKERS=none``BACKUP_CORE_GREEN=1``ESCROW_MISSING_COUNT=5``WAZUH_MANAGER_REGISTRY_ACCEPTED=0`主機 / K3s / public routes / AWOOOI / MOMO / Stock / backup core / 188 hygiene 已恢復120 ArgoCD 為 `Synced / Healthy``awoooi-prod` Pod 均為 `Running``Completed`;歷史 `km-vectorize-29689620` failed Job 已由後續成功 Job 覆蓋。DR 仍因 credential escrow 缺 5 不能宣稱 completeWazuh registry 已有脫敏 manager readback但尚未 Dashboard API / owner acceptance。 |
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-26 07:19 readback shows 120 and 121 reachable, K3s active, `mon` and `mon1` both `Ready control-plane`, AWOOOI API/Web replicas split across both nodes, ArgoCD `awoooi-prod Synced / Healthy` at revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`, and `km-vectorize` official 03:00 台北時間 run succeeded with `lastSuccess=2026-06-25T19:00:14Z`. |
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-26 06:58 backup readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-26 02:31:02`。DR remains blocked on real non-secret credential escrow evidence IDs; do not write placeholder markers or paste secret values. |
| P2 service / data truth | BLOCKED_STOCK_EOD_WINDOW | 90% | Public routes 與 service health 為綠燈MOMO `V10.716` healthycurrent-month parity 為 `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`。StockPlatform 網站為 `200`,且 2026-06-26 price / chips / market index 已存在,但 `/api/v1/system/freshness` `blocked`,因為 2026-06-26 的 `core.margin_short_daily` 仍等待官方來源 / row count `0``ai.recommendations` 依設計停在 2026-06-25。這是產品資料 freshness blocker不是 reboot / Nginx / Docker blocker。 |
| P3 docs / automation contracts | DONE_WITH_SINGLE_SUMMARY_REPLAY_V175 | 100% | Workplan, SOP v1.75, post-reboot declaration guard, machine-readable post-reboot readiness summary with Wazuh registry detail fields and auto-persisted `summary.txt`, post-reboot next-gate dispatch checklist, owner-packet JSON generator, dynamic owner-packet contract guard, post-reboot owner response preflight, owner response placeholder template, one-page post-start quick check v1.15, route retry gate, delegated cold-start public-route warmup classifier, deploy warmup classification, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, 188 fail-closed startup data recovery gate, 188 host hygiene read-only checklist, 188 PostgreSQL runtime-ready source-of-truth, 188 ACME route/timer hygiene, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and stricter product-data / route retry gates are updated. Declaration guard now machine-checks allowed / forbidden recovery statements from the same `summary.txt`: service/data/backup/188 host hygiene green may be declared when live summary says so, while `DR_COMPLETE``WAZUH_REGISTRY_RECOVERED` and `RUNTIME_ACTION_AUTHORIZED` remain forbidden until evidence gates close. Owner response preflight blocks missing files, placeholder templates, secret payloads, credential marker writes, Wazuh active response / re-enroll / restart, host write, and Kali active scan before any evidence can be counted as received or accepted. Live 110 script sync remains a separate approved live-write gate; do not claim it here. |
| P2 service / data truth | DONE | 100% | Public routes 與 service health 為綠燈MOMO health `V10.719`current-month parity 為 `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`。StockPlatform `/api/v1/system/freshness``ok`latest trading date `2026-06-26`blockers `none`;先前 Stock EOD blocker 已由官方來源與正式 cron 自然收斂。 |
| P3 docs / automation contracts | DONE_WITH_API_WARMUP_CLASSIFIER_V176 | 100% | Workplan, SOP v1.76, post-reboot declaration guard, machine-readable post-reboot readiness summary with Wazuh registry detail fields and auto-persisted `summary.txt`, post-reboot next-gate dispatch checklist, owner-packet JSON generator, dynamic owner-packet contract guard, post-reboot owner response preflight, owner response placeholder template, one-page post-start quick check v1.16, route retry gate, delegated cold-start public-route / AWOOOI API warmup classifier, deploy warmup classification, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, 188 fail-closed startup data recovery gate, 188 host hygiene read-only checklist, 188 PostgreSQL runtime-ready source-of-truth, 188 ACME route/timer hygiene, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and stricter product-data / route retry gates are updated. Declaration guard now machine-checks allowed / forbidden recovery statements from the same `summary.txt`: service/data/backup/188 host hygiene green may be declared when live summary says so, while `DR_COMPLETE``WAZUH_REGISTRY_RECOVERED` and `RUNTIME_ACTION_AUTHORIZED` remain forbidden until evidence gates close. |
2026-06-26 12:13 machine-readable summary baseline supersedes the 07:47 / 08:59 gate set: `scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` stores delegated logs under `/tmp/awoooi-post-reboot-readiness-20260626-121303` and returns `SERVICE_GREEN=1`, `PRODUCT_DATA_GREEN=1`, `BACKUP_CORE_GREEN=1`, `DR_ESCROW_BLOCKED=1`, `ESCROW_MISSING_COUNT=5`, `HOST_188_SERVICE_GREEN=1`, `HOST_188_HYGIENE_BLOCKED=0`, `HOST_188_CHECK_RC=0`, `HOST_188_RESULT=HOST_188_HYGIENE_GREEN.`, `WAZUH_ROUTE_CODE=200`, `WAZUH_TRANSPORT_COUNT=6`, `WAZUH_COVERAGE_SCOPE=6`, `WAZUH_DIRECT_ACTIVE=2`, `WAZUH_NO_TRANSPORT=1`, `WAZUH_SSH_BLOCKED=3`, `WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning`, `WAZUH_DASHBOARD_INDEX_OK=3`, `WAZUH_MANAGER_REGISTRY_ACCEPTED=0`, `WAZUH_RUNTIME_GATE=0`, `RUNTIME_ACTION_AUTHORIZED=0`, `OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, and `NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export`. This is now the preferred first operator/AI-agent entrypoint after reboot because it separates service health from DR and security registry evidence; 188 host hygiene is no longer a next gate unless the live checklist regresses.

View File

@@ -30,6 +30,11 @@ def parse_args() -> argparse.Namespace:
type=Path,
help="Use an existing post-reboot-next-gate-dispatch output file.",
)
parser.add_argument(
"--summary-file",
type=Path,
help="Pass an existing readiness summary file to the delegated dispatch checklist.",
)
parser.add_argument(
"--output",
type=Path,
@@ -43,8 +48,10 @@ def parse_args() -> argparse.Namespace:
return parser.parse_args()
def run_dispatch(no_color: bool) -> str:
def run_dispatch(no_color: bool, summary_file: Path | None) -> str:
cmd = [str(DISPATCH_SCRIPT)]
if summary_file:
cmd.extend(["--summary-file", str(summary_file)])
if no_color:
cmd.append("--no-color")
completed = subprocess.run(
@@ -65,7 +72,7 @@ def run_dispatch(no_color: bool) -> str:
def load_dispatch(args: argparse.Namespace) -> str:
if args.dispatch_file:
return args.dispatch_file.read_text(encoding="utf-8")
return run_dispatch(no_color=args.no_color)
return run_dispatch(no_color=args.no_color, summary_file=args.summary_file)
def split_csv(value: str) -> list[str]:

View File

@@ -97,6 +97,11 @@ def parse_args() -> argparse.Namespace:
type=Path,
help="Use an existing owner packet JSON instead of generating one.",
)
parser.add_argument(
"--summary-file",
type=Path,
help="Generate owner packets from an existing readiness summary file.",
)
parser.add_argument("--json", action="store_true", help="Print machine-readable JSON.")
parser.add_argument(
"--no-color",
@@ -118,8 +123,10 @@ def load_json(path: Path, label: str = "response_file") -> dict[str, Any]:
return payload
def generate_owner_packet(no_color: bool) -> dict[str, Any]:
def generate_owner_packet(no_color: bool, summary_file: Path | None) -> dict[str, Any]:
cmd = [str(OWNER_PACKET_GENERATOR)]
if summary_file:
cmd.extend(["--summary-file", str(summary_file)])
if no_color:
cmd.append("--no-color")
completed = subprocess.run(
@@ -147,7 +154,7 @@ def generate_owner_packet(no_color: bool) -> dict[str, Any]:
def load_owner_packet(args: argparse.Namespace) -> dict[str, Any]:
if args.owner_packet_file:
return load_json(args.owner_packet_file, label="owner_packet_file")
return generate_owner_packet(no_color=args.no_color)
return generate_owner_packet(no_color=args.no_color, summary_file=args.summary_file)
def as_list(value: Any) -> list[Any]:

View File

@@ -22,6 +22,7 @@ COLD_START_PENDING_BLOCKERS=0
COLD_START_BLOCKED_SUMMARY=""
COLD_START_BLOCKED_LINES=""
ROUTE_SMOKE_BLOCKED=0
AWOOOI_API_ROUTE_OK=0
STOCK_EOD_WINDOW_PENDING=0
STOCK_EOD_CLASSIFICATION="not_evaluated"
STOCK_EOD_NEXT_ACTION="not_evaluated"
@@ -469,6 +470,9 @@ if [[ "$RUN_ROUTES" -eq 1 ]]; then
done
case "$code" in
2*|3*)
if [[ "$url" == "https://awoooi.wooo.work/api/v1/health" && "$code" == 2* ]]; then
AWOOOI_API_ROUTE_OK=1
fi
if [[ "$attempt" -gt 1 ]]; then
evidence_warn "$code $url recovered_after_attempt=$attempt"
else
@@ -485,8 +489,13 @@ fi
if [[ "$COLD_START_PENDING_BLOCKERS" -gt 0 ]]; then
non_route_cold_blockers="$(printf '%s\n' "$COLD_START_BLOCKED_LINES" | grep -Ev '^BLOCKED public route ' || true)"
if [[ "$RUN_ROUTES" -eq 1 && "$ROUTE_SMOKE_BLOCKED" -eq 0 && "$AWOOOI_API_ROUTE_OK" -eq 1 ]]; then
non_route_cold_blockers="$(
printf '%s\n' "$non_route_cold_blockers" | grep -Ev '^BLOCKED AWOOOI API not reachable$|^BLOCKED AWOOI API not reachable$' || true
)"
fi
if [[ "$RUN_ROUTES" -eq 1 && "$ROUTE_SMOKE_BLOCKED" -eq 0 && -z "$non_route_cold_blockers" ]]; then
evidence_warn "cold-start public-route blockers recovered under wrapper route retry: $COLD_START_BLOCKED_SUMMARY"
evidence_warn "cold-start route/API warmup blockers recovered under wrapper route retry: $COLD_START_BLOCKED_SUMMARY"
printf '%s\n' "$COLD_START_BLOCKED_LINES"
else
blocked "cold-start has blockers: $COLD_START_BLOCKED_SUMMARY"