ops(reboot): persist summary evidence and classify warmup routes
Some checks failed
Code Review / ai-code-review (push) Successful in 14s
Ansible / Reboot Recovery Contract / validate (push) Has been cancelled

This commit is contained in:
ogt
2026-06-26 17:55:45 +08:00
parent 1d4b3df5e4
commit 35dba35253
7 changed files with 94 additions and 15 deletions

View File

@@ -45664,3 +45664,41 @@ production browser smoke:
- Wazuh manager registry accepted 仍為 `0`:不得宣稱 Wazuh 全主機納管恢復。
- Owner response received / accepted 仍為 `0 / 0`不得把「批准繼續」、空模板、UI 可見、route `200`、transport `6`、Dashboard index pattern `3` 或 owner-packet JSON 當成 evidence accepted。
- Runtime action / host write / credential marker write / Wazuh active response / Kali active scan 仍全部 `0 / false`
## 2026-06-26 — 17:45 single-summary replay / route warmup classifier / SOP v1.75
**時間與來源**
- 2026-06-26 17:39-17:45 Asia/Taipei。
- 來源:`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color``scripts/reboot-recovery/post-start-quick-check.sh --no-color``scripts/reboot-recovery/post-reboot-declaration-guard.py --summary-file ...``scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --summary-file ...`、owner packet / contract / owner response preflight。
**完成內容**
- `post-reboot-readiness-summary.sh` 會把 stdout 的 key/value summary 同步寫入 `$ARTIFACT_DIR/summary.txt`,讓同一輪 declaration guard、next-gate dispatch、owner packet、contract guard 與 owner response preflight 都吃同一份 evidence。
- `post-start-quick-check.sh` 新增 delegated cold-start blocker 分類cold-start 若只因 public route 單次 502 / TLS readback 暫時 blocked但 wrapper 自己的 route retry 已全部恢復,該 blocker 降級為 evidence warning非 route blocker 或 retry 後仍失敗仍為 hard blocker。
- `post-reboot-owner-response-preflight.py` 的 JSON loader 錯誤訊息已區分 `owner_packet_file_*``response_file_*`,避免 race 或缺檔時誤導 operator。
- `docs/runbooks/FULL-STACK-COLD-START-SOP.md` 升至 v1.75`docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 升至 v1.15workplan 狀態更新為 `DONE_WITH_SINGLE_SUMMARY_REPLAY_V175`
**live / replay 證據**
- 17:42 首輪修補後 summary 確認 `summary.txt` 已寫入 `/tmp/awoooi-post-reboot-readiness-20260626-174129/summary.txt`,但 delegated cold-start 因 `stock.wooo.work` 單次 502 / TLS check failure 產生 `POST_START_BLOCKED=1`;同一輪 wrapper route retry 已顯示 `WARN 200 https://stock.wooo.work/ recovered_after_attempt=2`,且 StockPlatform freshness 為 `status=ok`
- 分類修正後,`scripts/reboot-recovery/post-start-quick-check.sh --no-color` 回到 `POST_START_QUICK_CHECK PASS=38 WARN=4 BLOCKED=0``POST_START_QUICK_CHECK_WARNINGS SERVICE=0 BOUNDARY=1 EVIDENCE=3``RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`
- 最終 summary artifact`/tmp/awoooi-post-reboot-readiness-20260626-174451/summary.txt`,回傳 `POST_START_RC=0``POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``POST_START_PASS=38``POST_START_WARN=4``POST_START_BLOCKED=0``SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``BACKUP_CORE_GREEN=1``DR_ESCROW_BLOCKED=1``ESCROW_MISSING_COUNT=5``HOST_188_HYGIENE_BLOCKED=0``HOST_188_RESULT=HOST_188_HYGIENE_GREEN.``WAZUH_MANAGER_REGISTRY_ACCEPTED=0``RUNTIME_ACTION_AUTHORIZED=0``OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export`
- 固定 summary 重放:`post-reboot-next-gate-dispatch.sh --summary-file /tmp/awoooi-post-reboot-readiness-20260626-174451/summary.txt` 只輸出兩個 P0 gate`credential_escrow_evidence``wazuh_manager_registry_export``DISPATCH_AUTHORIZED=0``REQUEST_SENT_COUNT=0``HOST_WRITE_AUTHORIZED=0``SECRET_VALUE_COLLECTION_ALLOWED=0`
- Declaration guard 使用同一份 summary 正確拒絕 proposed `DR_COMPLETE``WAZUH_REGISTRY_RECOVERED``RUNTIME_ACTION_AUTHORIZED``POST_REBOOT_DECLARATION_GUARD_REJECTED status=blocked_false_green_proposal allowed=5 forbidden=3 next_gates=2 rejected_proposed=3`
- Owner packet contract guard`POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=2 request_sent=0 accepted=0 runtime_gate=0`
- Owner response preflight無 response file 維持 `blocked_waiting_owner_response_file`placeholder template 維持 `blocked_waiting_owner_response_content``received=0``accepted=0``runtime_gate=0`
**做過的命令類型**
- 只讀live summary、quick-check、declaration guard、dispatch replay、owner packet / contract / preflight 驗證。
- 寫入repo script / runbook / workplan / LOGBOOK only。
- 未做:沒有 host / Docker / systemd / Nginx / firewall / K8s / DB / Wazuh runtime 寫操作;沒有讀 secret 明文;沒有寫 credential marker沒有送 owner request沒有 Wazuh active response / agent re-enroll / restart沒有 Kali active scan。
**目前判定**
- Reboot service / product data / backup / 188 host hygiene`GREEN`
- Overall recovery declaration`FULL_STACK_GREEN_DR_ESCROW_BLOCKED`
- SOP / quick-check / single-summary evidence chainv1.75。
- Route warmup no-false-blocker classifier`100%`
**仍 blocked / 不得宣稱**
- DR credential escrow evidence 仍缺 `5`:不得宣稱 `DR_COMPLETE` 或 credential escrow complete。
- Wazuh manager registry accepted 仍為 `0`:不得宣稱 Wazuh 全主機納管恢復。
- Owner response received / accepted 仍為 `0 / 0`不得把「批准繼續」、空模板、UI 可見、route `200`、transport `6`、Dashboard index pattern `3` 或 owner-packet JSON 當成 evidence accepted。
- Runtime action / host write / credential marker write / Wazuh active response / Kali active scan 仍全部 `0 / false`

View File

@@ -1,6 +1,6 @@
# AWOOOI 全棧冷啟動與主機重啟 SOP
> Version: v1.74
> Version: v1.75
> Last updated: 2026-06-26 Asia/Taipei
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
@@ -10,12 +10,14 @@
本節是每次接手、開機、關機、重啟後的第一個判定錨點。若日期不是今天,必須先重跑 live check再更新本節與 `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md`
若只是重啟後要快速判斷能不能宣稱恢復,先跑機器可讀摘要:`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color`。此腳本會呼叫一頁式總檢查、188 host hygiene checklist 與 Wazuh no-false-green repo gates並把 delegated logs 留在 `/tmp/awoooi-post-reboot-readiness-*`接著跑 `scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color`把 summary 轉成 allowed / forbidden declaration避免把服務綠誤報成 DR complete、188 host hygiene、Wazuh registry recovered 或 runtime authorized。若 summary 顯示 `SERVICE_GREEN=1``NEXT_REQUIRED_GATES` 仍非空,再`scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color`,把 live summary 內尚未完成的 blocker 轉成 owner / evidence / forbidden-action dispatch checklist需要機器可讀 intake 時,再跑 `scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color --output /tmp/awoooi-post-reboot-owner-packets.json` 產生 `awoooi_post_reboot_next_gate_owner_packets_v1` JSON並立刻跑 `scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json`。dispatch / packet / guard 均固定 `DISPATCH_AUTHORIZED=0``REQUEST_SENT_COUNT=0``OWNER_RESPONSE_ACCEPTED=0``HOST_WRITE_AUTHORIZED=0``SECRET_VALUE_COLLECTION_ALLOWED=0``RUNTIME_GATE=0`guard 未通過時不得送 owner request、不得寫 escrow marker、不得進維護窗口、不得宣稱 DR / Wazuh registry complete。v1.74 起,任何 owner response JSON 還必須經過 `scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color --response-file <file>`空模板、placeholder、secret payload、runtime action request、credential marker write、Wazuh active response / re-enroll / restart、Kali active scan 或缺少 Dashboard API / manager registry evidence 都必須 fail-closedpreflight 通過也只表示可進入獨立 reviewer acceptance不是 runtime 授權。需要人工展開時,再跑 `scripts/reboot-recovery/post-start-quick-check.sh --no-color` 並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為 fallback。長 SOP 保留完整背景、例外處理與 Plan B短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。
若只是重啟後要快速判斷能不能宣稱恢復,先跑機器可讀摘要:`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color`。此腳本會呼叫一頁式總檢查、188 host hygiene checklist 與 Wazuh no-false-green repo gates並把 delegated logs 和可重放的 `summary.txt` 留在 `/tmp/awoooi-post-reboot-readiness-*`v1.75 起,同一輪驗收後續步驟必須吃同一個 `$ARTIFACT_DIR/summary.txt`,例如 `scripts/reboot-recovery/post-reboot-declaration-guard.py --summary-file "$ARTIFACT_DIR/summary.txt" --no-color``scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --summary-file "$ARTIFACT_DIR/summary.txt" --no-color`;不得在同一輪 evidence chain 反覆重跑 live probes 後混用不同時間點結論。宣告 guard 會把 summary 轉成 allowed / forbidden declaration避免把服務綠誤報成 DR complete、188 host hygiene、Wazuh registry recovered 或 runtime authorized。若 summary 顯示 `SERVICE_GREEN=1``NEXT_REQUIRED_GATES` 仍非空,再由 dispatch checklist 把尚未完成的 blocker 轉成 owner / evidence / forbidden-action checklist需要機器可讀 intake 時,再跑 `scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --dispatch-file <dispatch.txt> --output /tmp/awoooi-post-reboot-owner-packets.json` 產生 `awoooi_post_reboot_next_gate_owner_packets_v1` JSON並立刻跑 `scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json`。dispatch / packet / guard 均固定 `DISPATCH_AUTHORIZED=0``REQUEST_SENT_COUNT=0``OWNER_RESPONSE_ACCEPTED=0``HOST_WRITE_AUTHORIZED=0``SECRET_VALUE_COLLECTION_ALLOWED=0``RUNTIME_GATE=0`guard 未通過時不得送 owner request、不得寫 escrow marker、不得進維護窗口、不得宣稱 DR / Wazuh registry complete。v1.74 起,任何 owner response JSON 還必須經過 `scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color --owner-packet-file <owner-packets.json> --response-file <file>`空模板、placeholder、secret payload、runtime action request、credential marker write、Wazuh active response / re-enroll / restart、Kali active scan 或缺少 Dashboard API / manager registry evidence 都必須 fail-closedpreflight 通過也只表示可進入獨立 reviewer acceptance不是 runtime 授權。需要人工展開時,再跑 `scripts/reboot-recovery/post-start-quick-check.sh --no-color` 並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為 fallback。長 SOP 保留完整背景、例外處理與 Plan B短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。
2026-06-26 12:13 latest live summary supersedes the 08:59 gate set`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``POST_START_PASS=38``POST_START_WARN=4``POST_START_BLOCKED=0``SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``BACKUP_CORE_GREEN=1``DR_ESCROW_BLOCKED=1``ESCROW_MISSING_COUNT=5``HOST_188_SERVICE_GREEN=1``HOST_188_HYGIENE_BLOCKED=0``HOST_188_RESULT=HOST_188_HYGIENE_GREEN.``WAZUH_ROUTE_CODE=200``WAZUH_TRANSPORT_COUNT=6``WAZUH_MANAGER_REGISTRY_ACCEPTED=0``WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning``WAZUH_DASHBOARD_INDEX_OK=3``RUNTIME_ACTION_AUTHORIZED=0``OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export`。188 host hygiene 已從 blocker 移除;目前不可宣稱完成的只剩 DR credential escrow 與 Wazuh manager registry。ACME HTTP-01 route 與 certbot timer hygiene 已修復,但不得宣稱憑證已正式 renew需等 snap certbot timer / ACME window readback。
2026-06-26 13:01 owner response preflight baseline新增 `scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color``docs/templates/post-reboot-next-gate-owner-response.json`。無 response file 時必須輸出 `POST_REBOOT_OWNER_RESPONSE_PREFLIGHT_BLOCKED status=blocked_waiting_owner_response_file expected_gates=2 received=0 accepted=0 runtime_gate=0`;直接使用模板時必須輸出 `POST_REBOOT_OWNER_RESPONSE_PREFLIGHT_BLOCKED status=blocked_waiting_owner_response_content expected_gates=2 received=0 accepted=0 runtime_gate=0`。此 gate 只驗收 `credential_escrow_evidence``wazuh_manager_registry_export` 的脫敏 owner evidence不送 request、不寫 escrow marker、不讀 secret、不做 Wazuh / host / Kali runtime action也不把一般批准訊息轉成 owner accepted。
2026-06-26 17:45 single-summary replay baseline`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 現在會自動寫入 `/tmp/awoooi-post-reboot-readiness-20260626-174451/summary.txt`,同一輪後續 `declaration guard``next-gate dispatch``owner packet``contract guard``owner response preflight` 均用此 summary 重放。17:45 summary 回傳 `SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``BACKUP_CORE_GREEN=1``DR_ESCROW_BLOCKED=1``ESCROW_MISSING_COUNT=5``HOST_188_HYGIENE_BLOCKED=0``WAZUH_MANAGER_REGISTRY_ACCEPTED=0``OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export``post-start-quick-check.sh` 也已補 route warmup 分類:若 delegated cold-start 的 `BLOCKED` 全部是 public route且 wrapper 自己的 route retry 已全部恢復,該 cold-start blocker 會降級為 evidence warning不再把整輪服務恢復誤判成 blocked非 route blocker 或 retry 後仍失敗仍維持 hard blocked。
2026-06-26 07:47 machine-readable readiness summary retained as historical pre-repair evidence當時 `HOST_188_HYGIENE_BLOCKED=1``NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export`。此段只用來比對 188 修復前後差異;現行 gate set 必須使用 12:13 baseline。
2026-06-26 08:12 next-gate dispatch baseline retained as historical pre-repair evidence當時 output 固定三個 P0 checklist。12:13 起 dispatch 依 live summary 動態輸出,目前 expected `NEXT_GATE_COUNT=2`,只剩 credential escrow 與 Wazuh registry。

View File

@@ -1,6 +1,6 @@
# 主機重啟後一頁式總檢查
> Version: v1.14
> Version: v1.15
> Last updated: 2026-06-26 Asia/Taipei
> Scope: 110 / 120 / 121 / 188 post-reboot service recovery. 112 Kali / Wazuh / active scan 不屬於本流程。
@@ -10,7 +10,7 @@
每次 110 / 120 / 121 / 188 任一台主機開機、關機、重啟、斷電恢復、VMware console fsck、Docker / K3s 大量重排後,都先跑本頁,再決定是否宣稱恢復。
最新基準2026-06-26 13:01 post-reboot owner response preflight`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``BACKUP_CORE_GREEN=1``DR_ESCROW_BLOCKED=1``ESCROW_MISSING_COUNT=5``HOST_188_HYGIENE_BLOCKED=0``HOST_188_RESULT=HOST_188_HYGIENE_GREEN.``WAZUH_MANAGER_REGISTRY_ACCEPTED=0``WAZUH_COVERAGE_SCOPE=6``WAZUH_DIRECT_ACTIVE=2``WAZUH_NO_TRANSPORT=1``WAZUH_SSH_BLOCKED=3``WAZUH_ROUTE_CODE=200``WAZUH_TRANSPORT_COUNT=6``WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning``WAZUH_DASHBOARD_INDEX_OK=3``RUNTIME_ACTION_AUTHORIZED=0``OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED``scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color` 會把 summary 轉成 allowed / forbidden declaration目前允許宣稱服務、產品資料、備份核心、188 host hygiene green 與 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`;禁止宣稱 `DR_COMPLETE``WAZUH_REGISTRY_RECOVERED``RUNTIME_ACTION_AUTHORIZED`。接著 `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color``NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export` 展成 owner / evidence / forbidden-action checklistWazuh checklist 的 `CURRENT_EVIDENCE` 會保留 registry accepted、coverage scope、direct active、no transport、SSH blocked、route、transport、Dashboard API 與 index pattern 狀態,避免把 route `200` 或 transport `6` 誤報成 registry recovered。`scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color` 進一步轉成 `awoooi_post_reboot_next_gate_owner_packets_v1` JSON固定 `dispatch_authorized=0``request_sent_count=0``owner_response_accepted_count=0``host_write_authorized=0``secret_value_collection_allowed=0``runtime_gate_count=0``scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json` 依 live `next_required_gates` 動態鎖定 P0 gate、所有 `0 / false` 邊界、禁用 secret payload / runtime action 與 no-false-green 規則。新增 `scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color` 作為 owner response 收件預檢:沒有 response file 必須是 `blocked_waiting_owner_response_file`;直接套用 `docs/templates/post-reboot-next-gate-owner-response.json` 必須是 `blocked_waiting_owner_response_content`;只有具備遮罩 evidence refs、完整 owner 欄位、Wazuh registry / Dashboard API 狀態、五個 credential escrow 非 secret evidence refs且沒有 secret value / runtime action request 的 response 才能進入下一層 reviewer acceptance。DR 仍因 `escrow_missing=5` 不可宣稱 completeWazuh manager registry 仍是 service green 之外的獨立 blocker。ACME HTTP-01 route / certbot timer hygiene 已修復,但憑證正式 renew 成功需等 snap certbot timer 或獨立 ACME window readback
最新基準2026-06-26 17:45 single-summary replay / route warmup classifier`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `SERVICE_GREEN=1``PRODUCT_DATA_GREEN=1``BACKUP_CORE_GREEN=1``DR_ESCROW_BLOCKED=1``ESCROW_MISSING_COUNT=5``HOST_188_HYGIENE_BLOCKED=0``HOST_188_RESULT=HOST_188_HYGIENE_GREEN.``WAZUH_MANAGER_REGISTRY_ACCEPTED=0``WAZUH_COVERAGE_SCOPE=6``WAZUH_DIRECT_ACTIVE=2``WAZUH_NO_TRANSPORT=1``WAZUH_SSH_BLOCKED=3``WAZUH_ROUTE_CODE=200``WAZUH_TRANSPORT_COUNT=6``WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning``WAZUH_DASHBOARD_INDEX_OK=3``RUNTIME_ACTION_AUTHORIZED=0``OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`,並自動把同一份 key/value 寫到 `$ARTIFACT_DIR/summary.txt`。同一輪後續 `post-reboot-declaration-guard.py``post-reboot-next-gate-dispatch.sh``post-reboot-next-gate-owner-packets.py``post-reboot-owner-packet-contract-guard.py``post-reboot-owner-response-preflight.py` 必須使用這份 `summary.txt` 或由它產生的 dispatch / packet不得混用多次 live probe 的不同時間點結果。`NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export` 仍是唯一目前 next gatesDR 仍因 `escrow_missing=5` 不可宣稱 completeWazuh manager registry accepted 仍是 `0`,不可把 route `200`、transport `6`、Dashboard index pattern `3` 當成 registry recovered。v1.15 另補 route warmup classifierdelegated cold-start 若只因 public route 單次 502 / TLS readback 暫時 blocked但 wrapper route retry 已確認全部恢復,該 blocker 會降級為 evidence warning非 route blocker 或 retry 後仍失敗仍為 hard blocked
本頁只回答四件事:
@@ -49,7 +49,7 @@
scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color
```
此 summary 只做 read-only 檢查會委派一頁式總檢查、188 host hygiene checklist 與 Wazuh repo-side no-false-green gates並將 delegated logs 保留在 `/tmp/awoooi-post-reboot-readiness-*`。第一眼先看這些欄位:
此 summary 只做 read-only 檢查會委派一頁式總檢查、188 host hygiene checklist 與 Wazuh repo-side no-false-green gates並將 delegated logs 與可重放的 `summary.txt` 保留在 `/tmp/awoooi-post-reboot-readiness-*`。第一眼先看這些欄位:
- `SERVICE_GREEN=1`:服務面可宣稱恢復。
- `PRODUCT_DATA_GREEN=1`MOMO / StockPlatform 主要資料 freshness 可宣稱恢復。
@@ -67,6 +67,13 @@ summary 顯示 `SERVICE_GREEN=1` 後,先跑宣告 guard確認本輪可以
scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color
```
若已經有 `$ARTIFACT_DIR/summary.txt`,同一輪建議固定使用它,避免重跑 live probes 時被 CI / route warmup 瞬間狀態影響:
```bash
SUMMARY_FILE=/tmp/awoooi-post-reboot-readiness-YYYYMMDD-HHMMSS/summary.txt
scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color --summary-file "$SUMMARY_FILE"
```
這支 guard 只讀取 summary evidence並輸出本輪允許與禁止宣稱的邊界。若要測試某個說法是否允許可用 `--proposed`
```bash
@@ -81,6 +88,12 @@ summary 顯示 `SERVICE_GREEN=1` 但仍有 `NEXT_REQUIRED_GATES` 時,再跑下
scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color
```
同一輪 evidence chain 應改用同一份 summary
```bash
scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color --summary-file "$SUMMARY_FILE" | tee /tmp/awoooi-post-reboot-dispatch.txt
```
這支腳本只把目前 live summary 內的 `NEXT_REQUIRED_GATES` 轉成 owner / required evidence / forbidden action / done criteria。2026-06-26 12:13 起通常只剩 `credential_escrow_evidence``wazuh_manager_registry_export`;若未來 188 又紅,才會重新出現 `host_188_hygiene_maintenance_window`。它不送 request、不讀 secret、不寫 marker、不 restart / reload / repair / import / delete / patch也不授權 host / Wazuh / Nginx / K8s / DB runtime action。若它輸出 `SERVICE_GREEN=0`,先回到服務恢復,不進入 boundary dispatch。
若要交給 AI / 工單 / owner review 使用,產生機器可讀 owner packet
@@ -94,7 +107,7 @@ scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color
送入任何 owner review queue 前,必須先把 JSON 存成 artifact 並跑 contract guard
```bash
scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color --output /tmp/awoooi-post-reboot-owner-packets.json
scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color --dispatch-file /tmp/awoooi-post-reboot-dispatch.txt --output /tmp/awoooi-post-reboot-owner-packets.json
scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json
```
@@ -104,7 +117,8 @@ guard 必須輸出 `POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=<live_next_
```bash
scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color
scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color --response-file docs/templates/post-reboot-next-gate-owner-response.json
scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color --owner-packet-file /tmp/awoooi-post-reboot-owner-packets.json
scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color --owner-packet-file /tmp/awoooi-post-reboot-owner-packets.json --response-file docs/templates/post-reboot-next-gate-owner-response.json
```
第一個命令必須輸出 `POST_REBOOT_OWNER_RESPONSE_PREFLIGHT_BLOCKED status=blocked_waiting_owner_response_file expected_gates=2 received=0 accepted=0 runtime_gate=0`。第二個命令必須輸出 `POST_REBOOT_OWNER_RESPONSE_PREFLIGHT_BLOCKED status=blocked_waiting_owner_response_content expected_gates=2 received=0 accepted=0 runtime_gate=0`,證明空模板不能被算成已收件或已接受。合格 response 只能包含脫敏 evidence refs、owner role / team / decision / reviewer / followup owner、五個 escrow item 的 non-secret evidence ref以及 Wazuh manager registry / Dashboard API readback不得包含密碼、token、secret value、hash、prefix/suffix、raw Wazuh payload、agent 原名、內網 IP、`client.keys`、active response、host write、agent re-enroll、Wazuh restart、Kali active scan 或 credential marker write。preflight 通過也只代表可進入獨立 reviewer acceptance不代表 `DR_COMPLETE``WAZUH_REGISTRY_RECOVERED` 或任何 runtime action 授權。

View File

@@ -15,7 +15,7 @@
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-26 07:19 readback shows 120 and 121 reachable, K3s active, `mon` and `mon1` both `Ready control-plane`, AWOOOI API/Web replicas split across both nodes, ArgoCD `awoooi-prod Synced / Healthy` at revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`, and `km-vectorize` official 03:00 台北時間 run succeeded with `lastSuccess=2026-06-25T19:00:14Z`. |
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-26 06:58 backup readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-26 02:31:02`。DR remains blocked on real non-secret credential escrow evidence IDs; do not write placeholder markers or paste secret values. |
| P2 service / data truth | DONE | 100% | Service routes and core runtime are available, 110 current CPU pressure is attributable to active AWOOOI Web `turbo build` / Docker buildx, and previous orphan Chrome groups remain cleared. 2026-06-26 07:19 StockPlatform `/api/v1/system/freshness` returned `200`; 07:01 freshness payload was `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`; price / chips / margin / AI recommendations are all on `2026-06-25`. `ai.recommendations` row count is `2868`; `core.margin_short_daily` row count is `1976`. MOMO health `V10.699`, current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, and `MOMO_DAILY_FRESHNESS 1|2026-06-24` are green; expanded public routes are green. |
| P3 docs / automation contracts | DONE_WITH_OWNER_RESPONSE_PREFLIGHT_V174 | 100% | Workplan, SOP v1.74, post-reboot declaration guard, machine-readable post-reboot readiness summary with Wazuh registry detail fields, post-reboot next-gate dispatch checklist, owner-packet JSON generator, dynamic owner-packet contract guard, post-reboot owner response preflight, owner response placeholder template, one-page post-start quick check v1.14, route retry gate, deploy warmup classification, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, 188 fail-closed startup data recovery gate, 188 host hygiene read-only checklist, 188 PostgreSQL runtime-ready source-of-truth, 188 ACME route/timer hygiene, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and stricter product-data / route retry gates are updated. Declaration guard now machine-checks allowed / forbidden recovery statements: service/data/backup/188 host hygiene green may be declared when live summary says so, while `DR_COMPLETE``WAZUH_REGISTRY_RECOVERED` and `RUNTIME_ACTION_AUTHORIZED` remain forbidden until evidence gates close. Owner response preflight blocks missing files, placeholder templates, secret payloads, credential marker writes, Wazuh active response / re-enroll / restart, host write, and Kali active scan before any evidence can be counted as received or accepted. Live 110 script sync remains a separate approved live-write gate; do not claim it here. |
| P3 docs / automation contracts | DONE_WITH_SINGLE_SUMMARY_REPLAY_V175 | 100% | Workplan, SOP v1.75, post-reboot declaration guard, machine-readable post-reboot readiness summary with Wazuh registry detail fields and auto-persisted `summary.txt`, post-reboot next-gate dispatch checklist, owner-packet JSON generator, dynamic owner-packet contract guard, post-reboot owner response preflight, owner response placeholder template, one-page post-start quick check v1.15, route retry gate, delegated cold-start public-route warmup classifier, deploy warmup classification, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, 188 fail-closed startup data recovery gate, 188 host hygiene read-only checklist, 188 PostgreSQL runtime-ready source-of-truth, 188 ACME route/timer hygiene, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and stricter product-data / route retry gates are updated. Declaration guard now machine-checks allowed / forbidden recovery statements from the same `summary.txt`: service/data/backup/188 host hygiene green may be declared when live summary says so, while `DR_COMPLETE``WAZUH_REGISTRY_RECOVERED` and `RUNTIME_ACTION_AUTHORIZED` remain forbidden until evidence gates close. Owner response preflight blocks missing files, placeholder templates, secret payloads, credential marker writes, Wazuh active response / re-enroll / restart, host write, and Kali active scan before any evidence can be counted as received or accepted. Live 110 script sync remains a separate approved live-write gate; do not claim it here. |
2026-06-26 12:13 machine-readable summary baseline supersedes the 07:47 / 08:59 gate set: `scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` stores delegated logs under `/tmp/awoooi-post-reboot-readiness-20260626-121303` and returns `SERVICE_GREEN=1`, `PRODUCT_DATA_GREEN=1`, `BACKUP_CORE_GREEN=1`, `DR_ESCROW_BLOCKED=1`, `ESCROW_MISSING_COUNT=5`, `HOST_188_SERVICE_GREEN=1`, `HOST_188_HYGIENE_BLOCKED=0`, `HOST_188_CHECK_RC=0`, `HOST_188_RESULT=HOST_188_HYGIENE_GREEN.`, `WAZUH_ROUTE_CODE=200`, `WAZUH_TRANSPORT_COUNT=6`, `WAZUH_COVERAGE_SCOPE=6`, `WAZUH_DIRECT_ACTIVE=2`, `WAZUH_NO_TRANSPORT=1`, `WAZUH_SSH_BLOCKED=3`, `WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning`, `WAZUH_DASHBOARD_INDEX_OK=3`, `WAZUH_MANAGER_REGISTRY_ACCEPTED=0`, `WAZUH_RUNTIME_GATE=0`, `RUNTIME_ACTION_AUTHORIZED=0`, `OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, and `NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export`. This is now the preferred first operator/AI-agent entrypoint after reboot because it separates service health from DR and security registry evidence; 188 host hygiene is no longer a next gate unless the live checklist regresses.
@@ -29,6 +29,8 @@
2026-06-26 13:01 owner response preflight baseline: `scripts/reboot-recovery/post-reboot-owner-response-preflight.py --no-color` validates future owner responses against the dynamic owner-packet gate set without sending requests, writing markers, reading secrets, or changing runtime. Missing response file must remain `blocked_waiting_owner_response_file`; the placeholder template `docs/templates/post-reboot-next-gate-owner-response.json` must remain `blocked_waiting_owner_response_content` with `received=0`, `accepted=0`, and `runtime_gate=0`. The only acceptable payload class is redacted owner evidence for credential escrow and Wazuh manager registry export; secret values, hash / prefix / suffix, raw Wazuh payload, agent real names, internal IPs, `client.keys`, credential marker write, host write, Wazuh active response / re-enroll / restart, and Kali active scan are rejected.
2026-06-26 17:45 single-summary replay baseline: `scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` now writes the exact emitted key/value summary to `$ARTIFACT_DIR/summary.txt`; latest artifact `/tmp/awoooi-post-reboot-readiness-20260626-174451/summary.txt` returns `SERVICE_GREEN=1`, `PRODUCT_DATA_GREEN=1`, `BACKUP_CORE_GREEN=1`, `DR_ESCROW_BLOCKED=1`, `ESCROW_MISSING_COUNT=5`, `HOST_188_HYGIENE_BLOCKED=0`, `WAZUH_MANAGER_REGISTRY_ACCEPTED=0`, `RUNTIME_ACTION_AUTHORIZED=0`, `OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, and `NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export`. The same summary file drives declaration guard, next-gate dispatch, owner packet generation, contract guard, and owner response preflight. `post-start-quick-check.sh` now holds delegated cold-start blockers until wrapper route retry completes; route-only cold-start blockers that recover under wrapper retry are evidence warnings, while non-route blockers or unrecovered routes remain hard blockers.
2026-06-26 08:47 Wazuh registry detail summary baseline: post-reboot readiness summary now emits `WAZUH_COVERAGE_SCOPE`, `WAZUH_DIRECT_ACTIVE`, `WAZUH_NO_TRANSPORT`, `WAZUH_SSH_BLOCKED`, `WAZUH_DASHBOARD_API_CONNECTION`, and `WAZUH_DASHBOARD_INDEX_OK` alongside existing route / transport / registry fields. Current read-only truth is coverage scope `6`, direct active `2`, no transport `1`, SSH blocked `3`, route `200`, transport `6`, Dashboard API `pending_or_spinning`, index OK `3`, manager registry accepted `0`, runtime gate `0`. This is a security evidence blocker, not a reboot service blocker.
2026-06-26 12:13 declaration guard baseline: `scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color` emits `schema_version=awoooi_post_reboot_declaration_guard_v1`, status `allowed_with_boundary_blockers`, allowed declarations including service / product data / backup / 188 host hygiene green for this evidence set, and forbidden declarations `DR_COMPLETE``WAZUH_REGISTRY_RECOVERED``RUNTIME_ACTION_AUTHORIZED`. Proposed false-green declarations are rejected before they can enter LOGBOOK / owner packets / external status updates.