ops(reboot): close 188 hygiene and dynamic post-reboot gates
This commit is contained in:
@@ -45489,3 +45489,40 @@ production browser smoke:
|
||||
- 188 host hygiene 維護窗口仍未執行。
|
||||
- Wazuh manager registry accepted remains `0`。
|
||||
- 不得宣稱 owner request 已送出、owner response 已收到 / 接受、runtime 寫入已批准、`DR_COMPLETE`、188 host fully green、或 Wazuh registry recovered。
|
||||
|
||||
## 2026-06-26 — 12:13 188 host hygiene live repair closeout / SOP v1.73
|
||||
|
||||
**時間與來源**:
|
||||
- 2026-06-26 11:40-12:13 Asia/Taipei。
|
||||
- 來源:188 live SSH maintenance session、`scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh --no-color`、`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color`、public ACME HTTP-01 self-test、systemd / Docker / PostgreSQL runtime readback。
|
||||
|
||||
**完成內容**:
|
||||
- 188 startup source-of-truth 已修正:`/usr/local/bin/awoooi-startup.sh` 改為 `postgres_runtime_ready()`,接受 active `postgresql@14-main` 或 `k3s-postgres-recovery` host-network PostgreSQL runtime + `pg_isready`;不再自動執行 `pg_resetwal`。
|
||||
- 188 `awoooi-startup.service` 移除 hard `postgresql@14-main.service` dependency,避免把已由 recovery container 提供的 production DB runtime 誤判成 host cluster failed。
|
||||
- 188 Nginx ACME HTTP-01 route 已修:`sentry.wooo.work`、`gitea.wooo.work`、`langfuse.wooo.work`、`signoz.wooo.work` 均可從 public HTTP 讀回同一個 non-secret self-test token。
|
||||
- 188 duplicate certbot runner 已收斂:apt `certbot.timer` disabled / inactive,snap `snap.certbot.renew.timer` enabled / active / waiting;`certbot.service` 與 `snap.certbot.renew.service` 均 `Result=success / inactive`。
|
||||
- 188 failed units 已清零,`systemctl is-system-running=running`。
|
||||
- Repo baseline 同步更新:`scripts/reboot-recovery/awoooi-startup.sh`、`scripts/reboot-recovery/awoooi-startup.service`、`scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh`、188 Nginx Ansible templates、SOP / quick-check / workplan / runbook。
|
||||
- `post-reboot-owner-packet-contract-guard.py` 改為依 live `source.next_required_gates` 動態驗收,不再把已修復的 188 gate 永久鎖成必備三 gate。
|
||||
- `post-start-quick-check.sh` 的 StockPlatform dedicated freshness gate 改用 `STOCK_FRESHNESS_RETRY_ATTEMPTS=6` / `STOCK_FRESHNESS_RETRY_DELAY_SECONDS=5` 重試,避免 110 Docker / CI rollout 後的短暫 upstream `502` 把已恢復的服務 / 資料 freshness 誤判成 blocked;連續失敗仍維持 hard blocker。
|
||||
|
||||
**live 驗證結果**:
|
||||
- `scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh --no-color`:`PASS=16 WARN=3 BLOCKED=0`、`SERVICE_GREEN=1`、`HOST_HYGIENE_BLOCKED=0`、`Result: HOST_188_HYGIENE_GREEN.`
|
||||
- `scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color`:`POST_START_RC=0`、`POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`、`POST_START_PASS=38`、`POST_START_WARN=4`、`POST_START_BLOCKED=0`、`SERVICE_GREEN=1`、`PRODUCT_DATA_GREEN=1`、`BACKUP_CORE_GREEN=1`、`DR_ESCROW_BLOCKED=1`、`ESCROW_MISSING_COUNT=5`、`HOST_188_HYGIENE_BLOCKED=0`、`WAZUH_MANAGER_REGISTRY_ACCEPTED=0`、`RUNTIME_ACTION_AUTHORIZED=0`。
|
||||
- `NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export`;188 host hygiene 已從 next gates 移除。
|
||||
|
||||
**做過的命令類型**:
|
||||
- 寫入:188 startup script/service 安裝、188 Nginx ACME route config 安裝、`systemctl daemon-reload`、`systemctl reset-failed`、`nginx -t`、`systemctl reload nginx`、停用 duplicate apt `certbot.timer`、建立 / 刪除 non-secret ACME self-test token file。
|
||||
- 只讀:systemd / Docker / PostgreSQL runtime / certbot timer / public route / backup / post-reboot summary / Wazuh repo gates。
|
||||
- 未做:沒有 `pg_resetwal`、沒有 DB restore、沒有 Docker restart、沒有 K3s / ArgoCD / firewall / Wazuh runtime action、沒有 secret value / token / password 讀取或保存、沒有強制 certbot renew。
|
||||
|
||||
**目前判定**:
|
||||
- 188 host hygiene repair:`0% -> 100%`。
|
||||
- Reboot service / product data / backup / 188 host hygiene:`GREEN`。
|
||||
- Overall recovery declaration:`FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。
|
||||
- SOP / quick-check / owner packet guard:v1.73 動態 gate baseline。
|
||||
|
||||
**仍 blocked / 不得宣稱**:
|
||||
- DR credential escrow evidence 仍缺 `5`:不得宣稱 `DR_COMPLETE`。
|
||||
- Wazuh manager registry accepted 仍為 `0`:不得宣稱 Wazuh 全主機納管恢復。
|
||||
- certbot formal renewal 尚未完成 readback;本輪完成的是 HTTP-01 route / timer hygiene / failed-unit 清除,正式 renew 成功需等 snap certbot timer 或獨立 ACME window。
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# AWOOOI 全棧冷啟動與主機重啟 SOP
|
||||
|
||||
> Version: v1.72
|
||||
> Version: v1.73
|
||||
> Last updated: 2026-06-26 Asia/Taipei
|
||||
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
|
||||
|
||||
@@ -10,19 +10,21 @@
|
||||
|
||||
本節是每次接手、開機、關機、重啟後的第一個判定錨點。若日期不是今天,必須先重跑 live check,再更新本節與 `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md`。
|
||||
|
||||
若只是重啟後要快速判斷能不能宣稱恢復,先跑機器可讀摘要:`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color`。此腳本會呼叫一頁式總檢查、188 host hygiene checklist 與 Wazuh no-false-green repo gates,並把 delegated logs 留在 `/tmp/awoooi-post-reboot-readiness-*`。接著跑 `scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color`,把 summary 轉成 allowed / forbidden declaration,避免把服務綠誤報成 DR complete、188 host fully green、Wazuh registry recovered 或 runtime authorized。若 summary 顯示 `SERVICE_GREEN=1` 但 `NEXT_REQUIRED_GATES` 仍非空,再跑 `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color`,把 DR escrow、188 hygiene、Wazuh registry 三條 blocker 轉成 owner / evidence / forbidden-action dispatch checklist;需要機器可讀 intake 時,再跑 `scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color --output /tmp/awoooi-post-reboot-owner-packets.json` 產生 `awoooi_post_reboot_next_gate_owner_packets_v1` JSON,並立刻跑 `scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json`。dispatch / packet / guard 均固定 `DISPATCH_AUTHORIZED=0`、`REQUEST_SENT_COUNT=0`、`OWNER_RESPONSE_ACCEPTED=0`、`HOST_WRITE_AUTHORIZED=0`、`SECRET_VALUE_COLLECTION_ALLOWED=0`、`RUNTIME_GATE=0`;guard 未通過時不得送 owner request、不得寫 escrow marker、不得進維護窗口、不得宣稱 DR / 188 host hygiene / Wazuh registry complete。需要人工展開時,再跑 `scripts/reboot-recovery/post-start-quick-check.sh --no-color` 並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為 fallback。長 SOP 保留完整背景、例外處理與 Plan B;短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。
|
||||
若只是重啟後要快速判斷能不能宣稱恢復,先跑機器可讀摘要:`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color`。此腳本會呼叫一頁式總檢查、188 host hygiene checklist 與 Wazuh no-false-green repo gates,並把 delegated logs 留在 `/tmp/awoooi-post-reboot-readiness-*`。接著跑 `scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color`,把 summary 轉成 allowed / forbidden declaration,避免把服務綠誤報成 DR complete、188 host hygiene、Wazuh registry recovered 或 runtime authorized。若 summary 顯示 `SERVICE_GREEN=1` 但 `NEXT_REQUIRED_GATES` 仍非空,再跑 `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color`,把 live summary 內尚未完成的 blocker 轉成 owner / evidence / forbidden-action dispatch checklist;需要機器可讀 intake 時,再跑 `scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color --output /tmp/awoooi-post-reboot-owner-packets.json` 產生 `awoooi_post_reboot_next_gate_owner_packets_v1` JSON,並立刻跑 `scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json`。dispatch / packet / guard 均固定 `DISPATCH_AUTHORIZED=0`、`REQUEST_SENT_COUNT=0`、`OWNER_RESPONSE_ACCEPTED=0`、`HOST_WRITE_AUTHORIZED=0`、`SECRET_VALUE_COLLECTION_ALLOWED=0`、`RUNTIME_GATE=0`;guard 未通過時不得送 owner request、不得寫 escrow marker、不得進維護窗口、不得宣稱 DR / Wazuh registry complete。需要人工展開時,再跑 `scripts/reboot-recovery/post-start-quick-check.sh --no-color` 並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為 fallback。長 SOP 保留完整背景、例外處理與 Plan B;短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。
|
||||
|
||||
2026-06-26 07:47 machine-readable readiness summary:`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 已驗證可用,artifact dir `/tmp/awoooi-post-reboot-readiness-20260626-074702`。摘要輸出 `POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`、`POST_START_PASS=38`、`POST_START_WARN=3`、`POST_START_BLOCKED=0`、`SERVICE_GREEN=1`、`PRODUCT_DATA_GREEN=1`、`BACKUP_CORE_GREEN=1`、`DR_ESCROW_BLOCKED=1`、`ESCROW_MISSING_COUNT=5`、`HOST_188_SERVICE_GREEN=1`、`HOST_188_HYGIENE_BLOCKED=1`、`WAZUH_ROUTE_CODE=200`、`WAZUH_TRANSPORT_COUNT=6`、`WAZUH_MANAGER_REGISTRY_ACCEPTED=0`、`WAZUH_RUNTIME_GATE=0`、`RUNTIME_ACTION_AUTHORIZED=0`。目前 `OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`,`NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export`。這是每次重啟後的第一層 operator / AI agent 判定格式。
|
||||
2026-06-26 12:13 latest live summary supersedes the 08:59 gate set:`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `POST_START_RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`、`POST_START_PASS=38`、`POST_START_WARN=4`、`POST_START_BLOCKED=0`、`SERVICE_GREEN=1`、`PRODUCT_DATA_GREEN=1`、`BACKUP_CORE_GREEN=1`、`DR_ESCROW_BLOCKED=1`、`ESCROW_MISSING_COUNT=5`、`HOST_188_SERVICE_GREEN=1`、`HOST_188_HYGIENE_BLOCKED=0`、`HOST_188_RESULT=HOST_188_HYGIENE_GREEN.`、`WAZUH_ROUTE_CODE=200`、`WAZUH_TRANSPORT_COUNT=6`、`WAZUH_MANAGER_REGISTRY_ACCEPTED=0`、`WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning`、`WAZUH_DASHBOARD_INDEX_OK=3`、`RUNTIME_ACTION_AUTHORIZED=0`、`OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`、`NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export`。188 host hygiene 已從 blocker 移除;目前不可宣稱完成的只剩 DR credential escrow 與 Wazuh manager registry。ACME HTTP-01 route 與 certbot timer hygiene 已修復,但不得宣稱憑證已正式 renew,需等 snap certbot timer / ACME window readback。
|
||||
|
||||
2026-06-26 08:12 next-gate dispatch baseline:`scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color` 已以最新 summary live output 驗證。腳本讀回 `SERVICE_GREEN=1`、`OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`、`NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export`,並輸出三個 P0 checklist:一是 credential escrow non-secret evidence,要求五個 escrow item 的 evidence id / owner / reviewer 且禁止密碼、token、hash、prefix/suffix;二是 188 host PostgreSQL / certbot hygiene maintenance window,要求 DB / DNS-TLS / rollback / postcheck owner 決策且禁止 `pg_resetwal`、certbot renew、Nginx reload、DB restore、Docker restart 等未批准動作;三是 Wazuh manager registry redacted export,要求脫敏 registry count、host alias status、dashboard API/version status、time window 與 reviewer,且禁止 agent real name、internal IP、client.keys、raw payload、active response、agent re-enroll、Wazuh restart、secret patch、host write、Kali active scan。輸出固定 `NEXT_GATE_COUNT=3`、`NEXT_STEP=dispatch_owner_packets_manually_after_review`、`RUNTIME_ACTION_AUTHORIZED=0`,這是 dispatch checklist,不是 request sent 或 runtime approval。
|
||||
2026-06-26 07:47 machine-readable readiness summary retained as historical pre-repair evidence:當時 `HOST_188_HYGIENE_BLOCKED=1`、`NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export`。此段只用來比對 188 修復前後差異;現行 gate set 必須使用 12:13 baseline。
|
||||
|
||||
2026-06-26 08:12 next-gate dispatch baseline retained as historical pre-repair evidence:當時 output 固定三個 P0 checklist。12:13 起 dispatch 依 live summary 動態輸出,目前 expected `NEXT_GATE_COUNT=2`,只剩 credential escrow 與 Wazuh registry。
|
||||
|
||||
2026-06-26 08:29 owner-packet JSON baseline:`scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color` 將 dispatch output 轉成 `schema_version=awoooi_post_reboot_next_gate_owner_packets_v1`,包含三個 `owner_packets`、`next_gate_count=3`、`p0_gate_count=3`、`request_sent_count=0`、`owner_response_received_count=0`、`owner_response_accepted_count=0`、`runtime_action_authorized_count=0`。此 JSON 是 AI / operator / owner review intake,不是外部 request,也不是維護窗口批准。
|
||||
|
||||
2026-06-26 08:40 owner-packet contract guard baseline:`scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json` 鎖定 `schema_version=awoooi_post_reboot_next_gate_owner_packets_v1`、三個 P0 gate id、`next_gate_count=3`、`p0_gate_count=3`、`request_sent_count=0`、`owner_response_received_count=0`、`owner_response_accepted_count=0`、`runtime_action_authorized_count=0`、`dispatch_authorized=0`、`host_write_authorized=0`、`secret_value_collection_allowed=0`、`runtime_gate_count=0`。此 guard 也驗證 escrow 禁止 password / token / secret value / hash / prefix / suffix / raw credential,188 禁止 `pg_resetwal` / certbot renew / Nginx reload / DB restore / Docker restart / host file write,Wazuh 禁止 raw payload / internal IP / active response / re-enroll / restart / secret patch / host write / Kali active scan,並要求四條 no-false-green 規則存在。輸出必須是 `POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=3 request_sent=0 accepted=0 runtime_gate=0`。
|
||||
2026-06-26 08:40 owner-packet contract guard baseline retained as historical pre-repair evidence:舊版鎖定三個 P0 gate。12:13 起 contract guard 依 `source.next_required_gates` 動態驗收,現行 expected success line 是 `POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=2 request_sent=0 accepted=0 runtime_gate=0`;若 188 hygiene future regression,才會回到 `gates=3`。
|
||||
|
||||
2026-06-26 08:47 Wazuh registry detail baseline:`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 已把 Wazuh repo-side coverage / runtime gate 的細節納入固定 key/value:`WAZUH_COVERAGE_SCOPE=6`、`WAZUH_DIRECT_ACTIVE=2`、`WAZUH_NO_TRANSPORT=1`、`WAZUH_SSH_BLOCKED=3`、`WAZUH_ROUTE_CODE=200`、`WAZUH_TRANSPORT_COUNT=6`、`WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning`、`WAZUH_DASHBOARD_INDEX_OK=3`、`WAZUH_MANAGER_REGISTRY_ACCEPTED=0`、`WAZUH_RUNTIME_GATE=0`。`scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color` 的 `wazuh_manager_registry_export` gate 會把這些狀態放入 `CURRENT_EVIDENCE`。判讀鐵律:route `200`、transport `6`、Dashboard index pattern `3` 都不是 manager registry accepted;全主機納管與 Dashboard API 修復仍需 owner evidence / registry export / acceptance record。
|
||||
|
||||
2026-06-26 08:59 declaration guard baseline:`scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color` 將 summary 轉成 `awoooi_post_reboot_declaration_guard_v1`,目前 status 為 `allowed_with_boundary_blockers`。允許宣稱 `SERVICE_RECOVERY_GREEN`、`PRODUCT_DATA_GREEN`、`BACKUP_CORE_GREEN` 與 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`;禁止宣稱 `DR_COMPLETE`、`HOST_188_FULLY_GREEN`、`WAZUH_REGISTRY_RECOVERED`、`RUNTIME_ACTION_AUTHORIZED`。若用 `--proposed DR_COMPLETE` 或 `--proposed WAZUH_REGISTRY_RECOVERED` 測試,guard 必須以 `blocked_false_green_proposal` 拒絕。
|
||||
2026-06-26 08:59 declaration guard baseline retained as historical pre-repair evidence:當時 `HOST_188_FULLY_GREEN` 仍 forbidden。12:13 起 guard 依 `HOST_188_HYGIENE_BLOCKED=0` 動態允許 188 host hygiene green,但仍拒絕 `DR_COMPLETE`、`WAZUH_REGISTRY_RECOVERED`、`RUNTIME_ACTION_AUTHORIZED`。
|
||||
|
||||
2026-06-26 07:39 live quick-check refresh:`scripts/reboot-recovery/post-start-quick-check.sh --no-color` 完整跑完,四主機 ping / SSH 全部 OK,delegated cold-start 為 `PASS=89 WARN=0 BLOCKED=0`,wrapper 總結為 `POST_START_QUICK_CHECK PASS=38 WARN=3 BLOCKED=0`、warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=2`、`RESULT=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。MOMO health `V10.701`,daily snapshot `109061` rows / `2025-07-01..2026-06-24`,current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`,latest import job `57 completed`。StockPlatform freshness `status=ok`、latest trading date `2026-06-25`,price / chips / margin / AI recommendations 均為 `2026-06-25`。Backup-status 07:39 顯示 110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、`core_blockers=0`、offsite/rclone fresh、`last_backup_all=2026-06-26 02:31:02`、`escrow_missing=5`。Public routes extended list 全部回 expected 2xx/3xx。110 CPU attribution 顯示 load 約 `5.19 / 4.66 / 4.91`,CPU idle 多數樣本 `80%+`,目前負載來自 Gitea / ClickHouse / Docker / Kafka / StockPlatform / AWOOOI API / Sentry 等正常平台工作,不是 orphan Chrome。這一輪 allowed declaration:主機、K3s、服務、網站、產品資料 freshness、備份核心與 offsite freshness 綠;forbidden declaration:DR complete、credential escrow complete、188 host fully green、Wazuh registry recovered。
|
||||
|
||||
|
||||
@@ -1,34 +1,33 @@
|
||||
# 188 Host Hygiene 維護窗口 Runbook
|
||||
|
||||
> 版本:2026-06-26.v1
|
||||
> 版本:2026-06-26.v2
|
||||
> 適用範圍:`192.168.0.188` host PostgreSQL `14/main`、`awoooi-startup.service`、certbot / ACME renewal hygiene。
|
||||
> 狀態:`SERVICE_GREEN_HOST_HYGIENE_BLOCKED`
|
||||
> 狀態:`HOST_188_HYGIENE_GREEN`
|
||||
|
||||
---
|
||||
|
||||
## 1. 目的
|
||||
|
||||
188 目前的產品服務與 public routes 可用,但 host systemd 仍是 `degraded`。這份 runbook 將「服務已恢復」與「host hygiene 未收斂」分開,避免把 route `200`、container healthy、`pg_isready` 或 exporter green 誤判成 host 已完全健康。
|
||||
188 目前的產品服務、public routes 與 host hygiene 均已收斂為 green。這份 runbook 將「服務恢復」、「host hygiene」、「DR escrow」與「Wazuh registry」分開,避免把 route `200`、container healthy、`pg_isready`、exporter green 或 `reset-failed` 誤判成其他層級也完成。
|
||||
|
||||
本文件不是維護窗口批准,也不是 runtime action 授權。未取得 owner、rollback owner、maintenance window、backup / restore evidence 與 post-check plan 前,不得執行 restart、reset-failed、renew、reload、restore 或 `pg_resetwal`。
|
||||
本文件記錄 2026-06-26 已完成的 188 host hygiene 修復與後續操作邊界。它不是 DR escrow 完成證明,也不是 Wazuh / SOC runtime action 授權。未取得 owner、rollback owner、maintenance window、backup / restore evidence 與 post-check plan 前,仍不得執行資料層 break-glass、restore、未批准 renew、或其他主機寫入。
|
||||
|
||||
---
|
||||
|
||||
## 2. 最新只讀證據
|
||||
|
||||
2026-06-26 07:18-07:19 Asia/Taipei read-only evidence:
|
||||
2026-06-26 12:02-12:13 Asia/Taipei live evidence:
|
||||
|
||||
| 層級 | 證據 | 判定 |
|
||||
|------|------|------|
|
||||
| 188 disk | `/` 約 `982G`,使用 `712G`,可用 `230G`,`76%` | 不是容量耗盡 |
|
||||
| 產品容器 | MOMO、VibeWork、2026FIFA、ClawBot、MinIO、exporters、MOMO DB 等容器可見 healthy / up | 產品服務 green |
|
||||
| Host systemd | failed units:`awoooi-startup.service`、`postgresql@14-main.service`、`certbot.service`、`snap.certbot.renew.service` | host hygiene blocked |
|
||||
| Host PostgreSQL | `pg_lsclusters` 顯示 `14/main down` | host cluster 不健康 |
|
||||
| PostgreSQL status | `invalid primary checkpoint record`、`PANIC: could not locate a valid checkpoint record` | checkpoint / WAL 類資料層錯誤 |
|
||||
| Startup unit | 曾嘗試以 root 執行 `pg_resetwal`,並失敗 | 自動修復邏輯必須 fail-closed |
|
||||
| certbot apt unit | `sentry.wooo.work` renewal rate-limited | 不可重複猛打 ACME |
|
||||
| certbot snap unit | `sentry.wooo.work` challenge failed | 需先確認 ACME route / DNS / gateway owner |
|
||||
| public cert | shared `sentry.wooo.work` certificate valid until `2026-07-09 16:03:40 UTC` | 有維護窗口,不是立即過期 |
|
||||
| Host systemd | `systemctl is-system-running` 為 `running`,failed units `0 loaded units listed` | host hygiene green |
|
||||
| 產品容器 | MOMO、VibeWork、2026FIFA、ClawBot、MinIO、exporters、MOMO DB 等容器 healthy / up | 產品服務 green |
|
||||
| PostgreSQL runtime | `k3s-postgres-recovery` container 為 `Running=true`、`NetworkMode=host`、restart policy `unless-stopped`,`pg_isready -h localhost -p 5432` accepting connections | production DB runtime green |
|
||||
| Host PostgreSQL unit | `postgresql@14-main.service` reset 後 `Result=success / inactive`;`pg_lsclusters` 仍可能因 recovery container PID model 顯示 host cluster down | 不再作為 service blocker;由 runtime-ready 判斷 |
|
||||
| Startup unit | `/usr/local/bin/awoooi-startup.sh` 已改為 `postgres_runtime_ready()`,接受 active host unit 或 `k3s-postgres-recovery` host-network runtime;不再自動執行 `pg_resetwal` | fail-closed 修復完成 |
|
||||
| Nginx / ACME route | `sentry.wooo.work`、`gitea.wooo.work`、`langfuse.wooo.work`、`signoz.wooo.work` HTTP-01 self-test token 均由 `/var/www/html` 正確回應 | challenge route green |
|
||||
| certbot timers | duplicate apt `certbot.timer` 已停用;snap `snap.certbot.renew.timer` enabled / active / waiting;`certbot.service` 與 `snap.certbot.renew.service` `Result=success / inactive` | renewal runner hygiene green |
|
||||
| public cert | shared `sentry.wooo.work` certificate valid until `2026-07-09 16:03:40 UTC` | 尚未宣稱已 renew;等待 snap timer / ACME window |
|
||||
|
||||
---
|
||||
|
||||
@@ -112,9 +111,9 @@ SSH_BATCH_MODE=yes bash scripts/reboot-recovery/188-host-hygiene-maintenance-che
|
||||
|
||||
```text
|
||||
SERVICE_GREEN=1
|
||||
HOST_HYGIENE_BLOCKED=1
|
||||
HOST_HYGIENE_BLOCKED=0
|
||||
RUNTIME_ACTION_AUTHORIZED=0
|
||||
Result: SERVICE_GREEN_HOST_HYGIENE_BLOCKED. Open maintenance window; do not auto-repair.
|
||||
Result: HOST_188_HYGIENE_GREEN.
|
||||
```
|
||||
|
||||
### Phase A:維護前只讀快照
|
||||
@@ -169,17 +168,18 @@ curl -fsS https://awoooi.wooo.work/api/v1/health
|
||||
| Lane | 完成度 | 說明 |
|
||||
|------|-------:|------|
|
||||
| 服務恢復 | `100%` | routes、containers、cold-start、backup core 已 green |
|
||||
| 188 host hygiene evidence | `80%` | 只讀 root cause 已明確;缺 owner disposition |
|
||||
| PostgreSQL owner decision | `0%` | 尚未確認廢棄、restore 或 break-glass |
|
||||
| certbot owner decision | `0%` | 尚未接受 DNS / ACME / coverage evidence |
|
||||
| runtime repair | `0%` | 未批准、未執行 |
|
||||
| 188 host hygiene repair | `100%` | startup fail-closed、PostgreSQL runtime-ready 判斷、systemd failed units、ACME route 與 certbot timer hygiene 已收斂 |
|
||||
| PostgreSQL runtime source-of-truth | `100%` | production DB runtime 以 `k3s-postgres-recovery` host-network container + `pg_isready` 判斷,不用 host `pg_lsclusters` 假紅 |
|
||||
| certbot / ACME route hygiene | `95%` | HTTP-01 route 與 timer split 已修;正式 renew 成功仍等待 snap timer / ACME window |
|
||||
| DR escrow | `BLOCKED` | `escrow_missing=5`,不可用 188 host green 替代 |
|
||||
| Wazuh registry | `BLOCKED` | manager registry accepted `0`,不可用 route `200` 或 transport count 替代 |
|
||||
|
||||
---
|
||||
|
||||
## 8. 不得宣稱
|
||||
|
||||
- 不得宣稱 188 host fully green。
|
||||
- 不得宣稱 host PostgreSQL 已恢復。
|
||||
- 不得宣稱 certbot renewal 已恢復。
|
||||
- 不得宣稱 `reset-failed` 後就是修復完成。
|
||||
- 不得宣稱這份文件批准任何 runtime action。
|
||||
- 不得宣稱 DR complete;credential escrow evidence 仍缺 `5`。
|
||||
- 不得宣稱 Wazuh registry recovered;manager registry accepted 仍為 `0`。
|
||||
- 不得宣稱 certbot certificate 已完成正式 renew;目前完成的是 HTTP-01 route / timer hygiene / failed units 清除。
|
||||
- 不得把 `reset-failed` 單獨當成修復完成;本次完成是因 startup source、Nginx ACME route、timer split 與 postcheck 同時收斂。
|
||||
- 不得宣稱這份文件批准任何新的 runtime action。
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# 主機重啟後一頁式總檢查
|
||||
|
||||
> Version: v1.12
|
||||
> Version: v1.13
|
||||
> Last updated: 2026-06-26 Asia/Taipei
|
||||
> Scope: 110 / 120 / 121 / 188 post-reboot service recovery. 112 Kali / Wazuh / active scan 不屬於本流程。
|
||||
|
||||
@@ -10,7 +10,7 @@
|
||||
|
||||
每次 110 / 120 / 121 / 188 任一台主機開機、關機、重啟、斷電恢復、VMware console fsck、Docker / K3s 大量重排後,都先跑本頁,再決定是否宣稱恢復。
|
||||
|
||||
最新基準:2026-06-26 08:59 post-reboot declaration guard。`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `SERVICE_GREEN=1`、`PRODUCT_DATA_GREEN=1`、`BACKUP_CORE_GREEN=1`、`DR_ESCROW_BLOCKED=1`、`ESCROW_MISSING_COUNT=5`、`HOST_188_HYGIENE_BLOCKED=1`、`WAZUH_MANAGER_REGISTRY_ACCEPTED=0`、`WAZUH_COVERAGE_SCOPE=6`、`WAZUH_DIRECT_ACTIVE=2`、`WAZUH_NO_TRANSPORT=1`、`WAZUH_SSH_BLOCKED=3`、`WAZUH_ROUTE_CODE=200`、`WAZUH_TRANSPORT_COUNT=6`、`WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning`、`WAZUH_DASHBOARD_INDEX_OK=3`、`RUNTIME_ACTION_AUTHORIZED=0`、`OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。`scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color` 會把 summary 轉成 allowed / forbidden declaration:目前允許宣稱服務、產品資料、備份核心與 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`;禁止宣稱 `DR_COMPLETE`、`HOST_188_FULLY_GREEN`、`WAZUH_REGISTRY_RECOVERED`、`RUNTIME_ACTION_AUTHORIZED`。接著 `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color` 將 `NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export` 展成三個 owner / evidence / forbidden-action checklist;Wazuh checklist 的 `CURRENT_EVIDENCE` 會保留 registry accepted、coverage scope、direct active、no transport、SSH blocked、route、transport、Dashboard API 與 index pattern 狀態,避免把 route `200` 或 transport `6` 誤報成 registry recovered。`scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color` 進一步轉成 `awoooi_post_reboot_next_gate_owner_packets_v1` JSON,固定 `dispatch_authorized=0`、`request_sent_count=0`、`owner_response_accepted_count=0`、`host_write_authorized=0`、`secret_value_collection_allowed=0`、`runtime_gate_count=0`;`scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json` 鎖定三個 P0 gate、所有 `0 / false` 邊界、禁用 secret payload / runtime action 與 no-false-green 規則。Cold-start `PASS=89 WARN=0 BLOCKED=0`;MOMO `V10.701`、latest import job `57 completed`、`DB_DAILY_FRESHNESS 1|2026-06-24`;StockPlatform `/api/v1/system/freshness` 為 `status=ok`、`latest_trading_date=2026-06-25`、blockers `[]`;backup-status 110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、`core_blockers=0`、`offsite_fresh=1`、`rclone_gdrive_fresh=1`、`last_backup_all=2026-06-26 02:31:02`。DR 仍因 `escrow_missing=5` 不可宣稱 complete。188 host hygiene 與 Wazuh manager registry 仍是 service green 之外的獨立 blocker。
|
||||
最新基準:2026-06-26 12:13 post-reboot summary / declaration guard。`scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` 回傳 `SERVICE_GREEN=1`、`PRODUCT_DATA_GREEN=1`、`BACKUP_CORE_GREEN=1`、`DR_ESCROW_BLOCKED=1`、`ESCROW_MISSING_COUNT=5`、`HOST_188_HYGIENE_BLOCKED=0`、`HOST_188_RESULT=HOST_188_HYGIENE_GREEN.`、`WAZUH_MANAGER_REGISTRY_ACCEPTED=0`、`WAZUH_COVERAGE_SCOPE=6`、`WAZUH_DIRECT_ACTIVE=2`、`WAZUH_NO_TRANSPORT=1`、`WAZUH_SSH_BLOCKED=3`、`WAZUH_ROUTE_CODE=200`、`WAZUH_TRANSPORT_COUNT=6`、`WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning`、`WAZUH_DASHBOARD_INDEX_OK=3`、`RUNTIME_ACTION_AUTHORIZED=0`、`OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。`scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color` 會把 summary 轉成 allowed / forbidden declaration:目前允許宣稱服務、產品資料、備份核心、188 host hygiene green 與 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`;禁止宣稱 `DR_COMPLETE`、`WAZUH_REGISTRY_RECOVERED`、`RUNTIME_ACTION_AUTHORIZED`。接著 `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color` 將 `NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export` 展成 owner / evidence / forbidden-action checklist;Wazuh checklist 的 `CURRENT_EVIDENCE` 會保留 registry accepted、coverage scope、direct active、no transport、SSH blocked、route、transport、Dashboard API 與 index pattern 狀態,避免把 route `200` 或 transport `6` 誤報成 registry recovered。`scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color` 進一步轉成 `awoooi_post_reboot_next_gate_owner_packets_v1` JSON,固定 `dispatch_authorized=0`、`request_sent_count=0`、`owner_response_accepted_count=0`、`host_write_authorized=0`、`secret_value_collection_allowed=0`、`runtime_gate_count=0`;`scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json` 依 live `next_required_gates` 動態鎖定 P0 gate、所有 `0 / false` 邊界、禁用 secret payload / runtime action 與 no-false-green 規則。DR 仍因 `escrow_missing=5` 不可宣稱 complete;Wazuh manager registry 仍是 service green 之外的獨立 blocker。ACME HTTP-01 route / certbot timer hygiene 已修復,但憑證正式 renew 成功需等 snap certbot timer 或獨立 ACME window readback。
|
||||
|
||||
本頁只回答四件事:
|
||||
|
||||
@@ -55,7 +55,7 @@ scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color
|
||||
- `PRODUCT_DATA_GREEN=1`:MOMO / StockPlatform 主要資料 freshness 可宣稱恢復。
|
||||
- `BACKUP_CORE_GREEN=1`:備份核心可宣稱恢復。
|
||||
- `DR_ESCROW_BLOCKED=1` / `ESCROW_MISSING_COUNT>0`:不可宣稱 DR complete。
|
||||
- `HOST_188_HYGIENE_BLOCKED=1`:188 host hygiene 需維護窗口,不等於產品服務掛掉。
|
||||
- `HOST_188_HYGIENE_BLOCKED=0`:188 host hygiene 已收斂;若未來回到 `1`,再走 188 專用 runbook。
|
||||
- `WAZUH_MANAGER_REGISTRY_ACCEPTED=0`:不可宣稱 Wazuh 全主機納管恢復。
|
||||
- `WAZUH_ROUTE_CODE=200` / `WAZUH_TRANSPORT_COUNT>0` 只能代表 route / transport evidence;仍必須搭配 `WAZUH_COVERAGE_SCOPE`、`WAZUH_DIRECT_ACTIVE`、`WAZUH_NO_TRANSPORT`、`WAZUH_SSH_BLOCKED`、`WAZUH_DASHBOARD_API_CONNECTION` 與 `WAZUH_MANAGER_REGISTRY_ACCEPTED` 判讀。
|
||||
- `RUNTIME_ACTION_AUTHORIZED=0`:本流程沒有授權 runtime 寫操作。
|
||||
@@ -73,7 +73,7 @@ scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color
|
||||
scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color --proposed DR_COMPLETE --proposed WAZUH_REGISTRY_RECOVERED
|
||||
```
|
||||
|
||||
在目前基準下,`DR_COMPLETE`、`HOST_188_FULLY_GREEN`、`WAZUH_REGISTRY_RECOVERED`、`RUNTIME_ACTION_AUTHORIZED` 都必須被拒絕;任何人、任何 AI agent 或任何文件若提出這些宣告,必須先補對應 evidence / owner acceptance。
|
||||
在目前基準下,`DR_COMPLETE`、`WAZUH_REGISTRY_RECOVERED`、`RUNTIME_ACTION_AUTHORIZED` 都必須被拒絕;任何人、任何 AI agent 或任何文件若提出這些宣告,必須先補對應 evidence / owner acceptance。`HOST_188_FULLY_GREEN` 只有在 summary 維持 `HOST_188_HYGIENE_BLOCKED=0` 且 checklist `HOST_188_HYGIENE_GREEN` 時才可宣稱。
|
||||
|
||||
summary 顯示 `SERVICE_GREEN=1` 但仍有 `NEXT_REQUIRED_GATES` 時,再跑下一 Gate 派工 checklist:
|
||||
|
||||
@@ -81,7 +81,7 @@ summary 顯示 `SERVICE_GREEN=1` 但仍有 `NEXT_REQUIRED_GATES` 時,再跑下
|
||||
scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color
|
||||
```
|
||||
|
||||
這支腳本只把 `credential_escrow_evidence`、`host_188_hygiene_maintenance_window`、`wazuh_manager_registry_export` 轉成 owner / required evidence / forbidden action / done criteria。它不送 request、不讀 secret、不寫 marker、不 restart / reload / repair / import / delete / patch,也不授權 host / Wazuh / Nginx / K8s / DB runtime action。若它輸出 `SERVICE_GREEN=0`,先回到服務恢復,不進入 boundary dispatch。
|
||||
這支腳本只把目前 live summary 內的 `NEXT_REQUIRED_GATES` 轉成 owner / required evidence / forbidden action / done criteria。2026-06-26 12:13 起通常只剩 `credential_escrow_evidence` 與 `wazuh_manager_registry_export`;若未來 188 又紅,才會重新出現 `host_188_hygiene_maintenance_window`。它不送 request、不讀 secret、不寫 marker、不 restart / reload / repair / import / delete / patch,也不授權 host / Wazuh / Nginx / K8s / DB runtime action。若它輸出 `SERVICE_GREEN=0`,先回到服務恢復,不進入 boundary dispatch。
|
||||
|
||||
若要交給 AI / 工單 / owner review 使用,產生機器可讀 owner packet:
|
||||
|
||||
@@ -98,7 +98,7 @@ scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color --outp
|
||||
scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json
|
||||
```
|
||||
|
||||
guard 必須輸出 `POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=3 request_sent=0 accepted=0 runtime_gate=0`。若 gate 數量、P0 gate id、`0 / false` 欄位、禁用 secret payload、188 禁用維修動作、Wazuh 禁用 active response / host write,或 no-false-green 規則任何一項漂移,視為 `BLOCKED`,不得送 owner request、不得寫 escrow marker、不得進維護窗口、不得宣稱 DR / Wazuh / 188 host hygiene 完成。
|
||||
guard 必須輸出 `POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=<live_next_gate_count> request_sent=0 accepted=0 runtime_gate=0`。目前預期 `gates=2`;若 188 hygiene 回到 blocked,才會是 `gates=3`。若 gate 數量、P0 gate id、`0 / false` 欄位、禁用 secret payload、Wazuh 禁用 active response / host write,或 no-false-green 規則任何一項漂移,視為 `BLOCKED`,不得送 owner request、不得寫 escrow marker、不得進維護窗口、不得宣稱 DR / Wazuh 完成。
|
||||
|
||||
需要展開細節時,再使用 repo-side wrapper:
|
||||
|
||||
@@ -108,7 +108,7 @@ scripts/reboot-recovery/post-start-quick-check.sh --no-color
|
||||
|
||||
此 wrapper 只做 read-only 檢查,並委派既有 cold-start / MOMO preflight / backup-status;不 restart、不 reload、不 import、不改 K8s、不讀 token 內容。wrapper 會把 warning 分成 `SERVICE`、`BOUNDARY`、`EVIDENCE` 三類,避免把 `escrow_missing>0` 誤判成服務降級。若 summary 或 wrapper 因某個 SSH 權限或路徑失敗,再依下列分段命令手動補證據。
|
||||
|
||||
Public route gate 自 v1.6 起會使用 `ROUTE_RETRY_ATTEMPTS`(預設 `3`)與 `ROUTE_RETRY_DELAY_SECONDS`(預設 `2`)重試。單次 `000` / timeout 若 retry 後恢復,應列為 evidence warning 或 transient route evidence,不可直接當成網站仍壞;只有連續失敗才是 service blocker。
|
||||
Public route gate 自 v1.6 起會使用 `ROUTE_RETRY_ATTEMPTS`(預設 `3`)與 `ROUTE_RETRY_DELAY_SECONDS`(預設 `2`)重試。StockPlatform dedicated freshness gate 自 v1.13 起使用 `STOCK_FRESHNESS_RETRY_ATTEMPTS`(預設 `6`)與 `STOCK_FRESHNESS_RETRY_DELAY_SECONDS`(預設 `5`),因為它常在 110 Docker / CI rollout 後比 route health 多需要數十秒 warmup。單次 `000` / timeout / `502` 若 retry 後恢復,應列為 evidence warning 或 transient evidence,不可直接當成網站或資料仍壞;只有連續失敗才是 service blocker。
|
||||
|
||||
Credential escrow gate 自 v1.6 起在 `escrow_missing>0` 時,會只讀呼叫 `/backup/scripts/mark-credential-escrow-verified.sh --status` 並列出缺項。這只是 evidence readback,不會寫 marker、不會讀密碼、不會降低 DR blocker;用途是讓 operator 立即知道缺的是哪幾個非 secret evidence marker。
|
||||
|
||||
|
||||
@@ -11,23 +11,25 @@
|
||||
|
||||
| Area | Status | Completion | Evidence |
|
||||
|------|--------|------------|----------|
|
||||
| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-26 07:19 read-only follow-up confirms service recovery remains green after the SOP commit reached production. Cold-start rerun returned `PASS=87 WARN=0 BLOCKED=0`, result `GREEN`; ArgoCD `awoooi-prod Synced / Healthy` at revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`; API/Web/Worker pods are Running with restart `0`; public routes return expected statuses including AWOOOI API `200`, AWOOOI Web `307`, VibeWork `200`, AwoooGo `200`, MOMO health `200`, Stock freshness `200`, Bitan `200`, Gitea `200`, Harbor `200`, Registry `/v2/` expected `401`, Sentry expected `302`, SigNoz `200`, Langfuse `200`. Backup-status 07:18 remains 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, offsite/rclone fresh, `last_backup_all=2026-06-26 02:31:02`, `escrow_missing=5`. 188 remains product-service green but host-hygiene blocked: host PostgreSQL `14/main` has `invalid primary checkpoint record` / `could not locate a valid checkpoint record`, certbot renew for `sentry.wooo.work` is rate-limited / challenge-failed, and `awoooi-startup.service` previously attempted `pg_resetwal` as root. New runbook `docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md` plus read-only script `scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh` define the maintenance-window decision tree without authorizing runtime repair. Do not declare DR complete until `escrow_missing=0`; Wazuh manager registry accepted remains `0`; 111/168 Codex workspace HEAD drift and Mac Mini low free space remain workstation blockers, not reboot service blockers. |
|
||||
| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-26 12:13 live summary confirms host / service / product data / backup / 188 host hygiene are green for the current evidence set. `post-reboot-readiness-summary.sh --no-color` returns `SERVICE_GREEN=1`, `PRODUCT_DATA_GREEN=1`, `BACKUP_CORE_GREEN=1`, `HOST_188_HYGIENE_BLOCKED=0`, `HOST_188_RESULT=HOST_188_HYGIENE_GREEN.`, `DR_ESCROW_BLOCKED=1`, `ESCROW_MISSING_COUNT=5`, `WAZUH_MANAGER_REGISTRY_ACCEPTED=0`, and `NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export`. 188 startup now accepts the real `k3s-postgres-recovery` host-network PostgreSQL runtime, fails closed instead of running `pg_resetwal`, Nginx ACME HTTP-01 routes for `sentry/gitea/langfuse/signoz` are corrected, duplicate apt certbot timer is disabled, snap certbot timer remains enabled, and failed units are cleared. Do not declare DR complete until `escrow_missing=0`; do not declare Wazuh registry recovered until manager registry accepted is `1`; certbot formal renewal success still requires snap timer / ACME window readback. |
|
||||
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-26 07:19 readback shows 120 and 121 reachable, K3s active, `mon` and `mon1` both `Ready control-plane`, AWOOOI API/Web replicas split across both nodes, ArgoCD `awoooi-prod Synced / Healthy` at revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`, and `km-vectorize` official 03:00 台北時間 run succeeded with `lastSuccess=2026-06-25T19:00:14Z`. |
|
||||
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-26 06:58 backup readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-26 02:31:02`。DR remains blocked on real non-secret credential escrow evidence IDs; do not write placeholder markers or paste secret values. |
|
||||
| P2 service / data truth | DONE | 100% | Service routes and core runtime are available, 110 current CPU pressure is attributable to active AWOOOI Web `turbo build` / Docker buildx, and previous orphan Chrome groups remain cleared. 2026-06-26 07:19 StockPlatform `/api/v1/system/freshness` returned `200`; 07:01 freshness payload was `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`; price / chips / margin / AI recommendations are all on `2026-06-25`. `ai.recommendations` row count is `2868`; `core.margin_short_daily` row count is `1976`. MOMO health `V10.699`, current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, and `MOMO_DAILY_FRESHNESS 1|2026-06-24` are green; expanded public routes are green. |
|
||||
| P3 docs / automation contracts | DONE_WITH_DECLARATION_GUARD_V172 | 100% | Workplan, SOP v1.72, post-reboot declaration guard, machine-readable post-reboot readiness summary with Wazuh registry detail fields, post-reboot next-gate dispatch checklist, owner-packet JSON generator, owner-packet contract guard, one-page post-start quick check v1.12, route retry gate, deploy warmup classification, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, 188 fail-closed startup data recovery gate, 188 host hygiene read-only checklist, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and stricter product-data / route retry gates are updated. Declaration guard now machine-checks allowed / forbidden recovery statements: service/data/backup green may be declared, while `DR_COMPLETE`、`HOST_188_FULLY_GREEN`、`WAZUH_REGISTRY_RECOVERED` and `RUNTIME_ACTION_AUTHORIZED` remain forbidden until evidence gates close. Live 110 script sync remains a separate approved live-write gate; do not claim it here. |
|
||||
| P3 docs / automation contracts | DONE_WITH_DECLARATION_GUARD_V173 | 100% | Workplan, SOP v1.73, post-reboot declaration guard, machine-readable post-reboot readiness summary with Wazuh registry detail fields, post-reboot next-gate dispatch checklist, owner-packet JSON generator, dynamic owner-packet contract guard, one-page post-start quick check v1.13, route retry gate, deploy warmup classification, expanded public route list, StockPlatform freshness gate, StockPlatform cron-source recovery evidence, StockPlatform natural schedule green evidence, 110 orphan Chrome recurrence cleanup evidence, 188 fail-closed startup data recovery gate, 188 host hygiene read-only checklist, 188 PostgreSQL runtime-ready source-of-truth, 188 ACME route/timer hygiene, baseline `stockplatform_system_freshness_ok`, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, exporter restore helpers, 110 Docker disk pressure cleanup boundary, notification-noise readback, MOMO import-boundary / Drive-auth fail-closed deploys, product version/readback matrix, and stricter product-data / route retry gates are updated. Declaration guard now machine-checks allowed / forbidden recovery statements: service/data/backup/188 host hygiene green may be declared when live summary says so, while `DR_COMPLETE`、`WAZUH_REGISTRY_RECOVERED` and `RUNTIME_ACTION_AUTHORIZED` remain forbidden until evidence gates close. Live 110 script sync remains a separate approved live-write gate; do not claim it here. |
|
||||
|
||||
2026-06-26 07:47 machine-readable summary baseline: `scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` stores delegated logs under `/tmp/awoooi-post-reboot-readiness-20260626-074702` and returns `SERVICE_GREEN=1`, `PRODUCT_DATA_GREEN=1`, `BACKUP_CORE_GREEN=1`, `DR_ESCROW_BLOCKED=1`, `ESCROW_MISSING_COUNT=5`, `HOST_188_SERVICE_GREEN=1`, `HOST_188_HYGIENE_BLOCKED=1`, `WAZUH_ROUTE_CODE=200`, `WAZUH_TRANSPORT_COUNT=6`, `WAZUH_MANAGER_REGISTRY_ACCEPTED=0`, `WAZUH_RUNTIME_GATE=0`, `RUNTIME_ACTION_AUTHORIZED=0`, `OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, and `NEXT_REQUIRED_GATES=credential_escrow_evidence,host_188_hygiene_maintenance_window,wazuh_manager_registry_export`. This is now the preferred first operator/AI-agent entrypoint after reboot because it separates service health from DR, host hygiene, and security registry evidence.
|
||||
2026-06-26 12:13 machine-readable summary baseline supersedes the 07:47 / 08:59 gate set: `scripts/reboot-recovery/post-reboot-readiness-summary.sh --no-color` stores delegated logs under `/tmp/awoooi-post-reboot-readiness-20260626-121303` and returns `SERVICE_GREEN=1`, `PRODUCT_DATA_GREEN=1`, `BACKUP_CORE_GREEN=1`, `DR_ESCROW_BLOCKED=1`, `ESCROW_MISSING_COUNT=5`, `HOST_188_SERVICE_GREEN=1`, `HOST_188_HYGIENE_BLOCKED=0`, `HOST_188_CHECK_RC=0`, `HOST_188_RESULT=HOST_188_HYGIENE_GREEN.`, `WAZUH_ROUTE_CODE=200`, `WAZUH_TRANSPORT_COUNT=6`, `WAZUH_COVERAGE_SCOPE=6`, `WAZUH_DIRECT_ACTIVE=2`, `WAZUH_NO_TRANSPORT=1`, `WAZUH_SSH_BLOCKED=3`, `WAZUH_DASHBOARD_API_CONNECTION=pending_or_spinning`, `WAZUH_DASHBOARD_INDEX_OK=3`, `WAZUH_MANAGER_REGISTRY_ACCEPTED=0`, `WAZUH_RUNTIME_GATE=0`, `RUNTIME_ACTION_AUTHORIZED=0`, `OVERALL_DECLARATION=FULL_STACK_GREEN_DR_ESCROW_BLOCKED`, and `NEXT_REQUIRED_GATES=credential_escrow_evidence,wazuh_manager_registry_export`. This is now the preferred first operator/AI-agent entrypoint after reboot because it separates service health from DR and security registry evidence; 188 host hygiene is no longer a next gate unless the live checklist regresses.
|
||||
|
||||
2026-06-26 08:12 next-gate dispatch baseline: `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color` reads the same summary path and emits three explicit P0 dispatch checklists without sending requests or changing runtime. `credential_escrow_evidence` requires non-secret evidence id / owner / reviewer for five escrow items and rejects password / token / secret value / hash / prefix / suffix / raw credential payloads. `host_188_hygiene_maintenance_window` requires PostgreSQL `14/main` decision, DNS / TLS / certbot path, startup unit source-of-truth, rollback owner, postcheck owner, and blocks unapproved `pg_resetwal` / certbot renew / Nginx reload / DB restore / Docker restart / host file writes. `wazuh_manager_registry_export` requires redacted registry counts, per-host alias status, dashboard API / version status, time window, and reviewer while blocking raw agent names, internal IPs, client keys, Wazuh payloads, active response, re-enroll, restart, secret patch, host write, and Kali active scan. Output fixed `NEXT_GATE_COUNT=3`, `REQUEST_SENT_COUNT=0`, `DISPATCH_AUTHORIZED=0`, `HOST_WRITE_AUTHORIZED=0`, `SECRET_VALUE_COLLECTION_ALLOWED=0`, `RUNTIME_ACTION_AUTHORIZED=0`.
|
||||
2026-06-26 07:47 machine-readable summary baseline is retained as historical evidence only. It showed `HOST_188_HYGIENE_BLOCKED=1` and three next gates before the 188 startup / ACME / certbot hygiene repair. Do not use the 07:47 gate set as the current status.
|
||||
|
||||
2026-06-26 08:29 owner-packet JSON baseline: `scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color` emits `schema_version=awoooi_post_reboot_next_gate_owner_packets_v1` with `next_gate_count=3`, `p0_gate_count=3`, `request_sent_count=0`, `owner_response_received_count=0`, `owner_response_accepted_count=0`, `runtime_action_authorized_count=0`. This packet is for AI / operator / owner review intake only; it does not send request, write credential marker, read secret, or authorize runtime action.
|
||||
2026-06-26 12:13 next-gate dispatch baseline: `scripts/reboot-recovery/post-reboot-next-gate-dispatch.sh --no-color` now emits only the gates present in the current summary. Current expected gates are `credential_escrow_evidence` and `wazuh_manager_registry_export`, with `NEXT_GATE_COUNT=2`, `REQUEST_SENT_COUNT=0`, `DISPATCH_AUTHORIZED=0`, `HOST_WRITE_AUTHORIZED=0`, `SECRET_VALUE_COLLECTION_ALLOWED=0`, `RUNTIME_ACTION_AUTHORIZED=0`. If 188 hygiene regresses, `host_188_hygiene_maintenance_window` will reappear automatically.
|
||||
|
||||
2026-06-26 08:40 owner-packet contract guard baseline: `scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json` validates the generated JSON before any owner review intake. It requires exactly three P0 gates, preserves `request_sent=0`、`owner_response_received=0`、`owner_response_accepted=0`、`runtime_action_authorized=0`、`host_write_authorized=0`、`secret_value_collection_allowed=0`、`runtime_gate=0`, and rejects missing forbidden payload/action controls for credential escrow, 188 host hygiene, and Wazuh registry export. Expected success line: `POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=3 request_sent=0 accepted=0 runtime_gate=0`.
|
||||
2026-06-26 12:13 owner-packet JSON baseline: `scripts/reboot-recovery/post-reboot-next-gate-owner-packets.py --no-color` emits `schema_version=awoooi_post_reboot_next_gate_owner_packets_v1` with dynamic `next_gate_count=2`, `p0_gate_count=2`, `request_sent_count=0`, `owner_response_received_count=0`, `owner_response_accepted_count=0`, `runtime_action_authorized_count=0`. This packet is for AI / operator / owner review intake only; it does not send request, write credential marker, read secret, or authorize runtime action.
|
||||
|
||||
2026-06-26 12:13 owner-packet contract guard baseline: `scripts/reboot-recovery/post-reboot-owner-packet-contract-guard.py --packet-file /tmp/awoooi-post-reboot-owner-packets.json` validates the generated JSON before any owner review intake. It requires the packet gates to equal the live `source.next_required_gates`, preserves `request_sent=0`、`owner_response_received=0`、`owner_response_accepted=0`、`runtime_action_authorized=0`、`host_write_authorized=0`、`secret_value_collection_allowed=0`、`runtime_gate=0`, and rejects missing forbidden payload/action controls for active gates. Current expected success line: `POST_REBOOT_OWNER_PACKET_CONTRACT_GUARD_OK gates=2 request_sent=0 accepted=0 runtime_gate=0`.
|
||||
|
||||
2026-06-26 08:47 Wazuh registry detail summary baseline: post-reboot readiness summary now emits `WAZUH_COVERAGE_SCOPE`, `WAZUH_DIRECT_ACTIVE`, `WAZUH_NO_TRANSPORT`, `WAZUH_SSH_BLOCKED`, `WAZUH_DASHBOARD_API_CONNECTION`, and `WAZUH_DASHBOARD_INDEX_OK` alongside existing route / transport / registry fields. Current read-only truth is coverage scope `6`, direct active `2`, no transport `1`, SSH blocked `3`, route `200`, transport `6`, Dashboard API `pending_or_spinning`, index OK `3`, manager registry accepted `0`, runtime gate `0`. This is a security evidence blocker, not a reboot service blocker.
|
||||
|
||||
2026-06-26 08:59 declaration guard baseline: `scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color` emits `schema_version=awoooi_post_reboot_declaration_guard_v1`, status `allowed_with_boundary_blockers`, allowed declarations `BACKUP_CORE_GREEN`、`FULL_STACK_GREEN_DR_ESCROW_BLOCKED`、`PRODUCT_DATA_GREEN`、`SERVICE_RECOVERY_GREEN`, and forbidden declarations `DR_COMPLETE`、`HOST_188_FULLY_GREEN`、`WAZUH_REGISTRY_RECOVERED`、`RUNTIME_ACTION_AUTHORIZED`. Proposed false-green declarations are rejected before they can enter LOGBOOK / owner packets / external status updates.
|
||||
2026-06-26 12:13 declaration guard baseline: `scripts/reboot-recovery/post-reboot-declaration-guard.py --no-color` emits `schema_version=awoooi_post_reboot_declaration_guard_v1`, status `allowed_with_boundary_blockers`, allowed declarations including service / product data / backup / 188 host hygiene green for this evidence set, and forbidden declarations `DR_COMPLETE`、`WAZUH_REGISTRY_RECOVERED`、`RUNTIME_ACTION_AUTHORIZED`. Proposed false-green declarations are rejected before they can enter LOGBOOK / owner packets / external status updates.
|
||||
|
||||
2026-06-26 07:39 live quick-check refresh supersedes the 07:19 row for current operator status. `scripts/reboot-recovery/post-start-quick-check.sh --no-color` returned `POST_START_QUICK_CHECK PASS=38 WARN=3 BLOCKED=0`, warning split `SERVICE=0 BOUNDARY=1 EVIDENCE=2`, result `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`. Delegated cold-start returned `PASS=89 WARN=0 BLOCKED=0`; four reboot-scope hosts ping/SSH were OK; AWOOOI / VibeWork / AwoooGo / 2026FIFA / Agent Bounty / MOMO / Stock / Bitan / TsenYang / VTuber / Gitea / Harbor / Registry / Sentry / SigNoz / Langfuse / AIOps routes returned expected 2xx/3xx. MOMO `V10.701` has job `57 completed`, daily freshness `1|2026-06-24`, and current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`. StockPlatform freshness is `ok` through `2026-06-25` with price / chips / margin / AI recommendations current. Backup core remains green: 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, offsite/rclone fresh, `last_backup_all=2026-06-26 02:31:02`; DR still has `escrow_missing=5`. 110 load around `5.19 / 4.66 / 4.91` is attributable to normal platform processes, not orphan Chrome. 188 host hygiene remains blocked by failed host PostgreSQL / certbot / startup units and must use the dedicated maintenance runbook and read-only checklist.
|
||||
|
||||
|
||||
@@ -86,6 +86,10 @@ server {
|
||||
listen 80;
|
||||
server_name signoz.wooo.work;
|
||||
|
||||
location /.well-known/acme-challenge/ {
|
||||
root /var/www/html;
|
||||
}
|
||||
|
||||
location / {
|
||||
proxy_pass http://127.0.0.1:3301;
|
||||
proxy_http_version 1.1;
|
||||
|
||||
@@ -9,10 +9,22 @@ server {
|
||||
server_name
|
||||
gitea.wooo.work
|
||||
sentry.wooo.work
|
||||
langfuse.wooo.work
|
||||
langfuse.wooo.work;
|
||||
|
||||
location /.well-known/acme-challenge/ {
|
||||
root /var/www/html;
|
||||
}
|
||||
|
||||
location / {
|
||||
return 301 https://$host$request_uri;
|
||||
}
|
||||
}
|
||||
|
||||
server {
|
||||
listen 80;
|
||||
server_name
|
||||
harbor.wooo.work
|
||||
registry.wooo.work
|
||||
stock.wooo.work;
|
||||
registry.wooo.work;
|
||||
|
||||
location /.well-known/acme-challenge/ {
|
||||
root /var/www/certbot;
|
||||
|
||||
@@ -149,17 +149,35 @@ if out=$(ssh_cmd "$REMOTE_188" '
|
||||
pg_lsclusters 2>/dev/null || true
|
||||
systemctl status postgresql@14-main.service --no-pager || true
|
||||
echo "PG_ISREADY_LOCAL $(pg_isready -h localhost -p 5432 2>/dev/null || true)"
|
||||
echo "RECOVERY_CONTAINER $(docker inspect -f "{{.State.Running}} {{.HostConfig.NetworkMode}} {{.HostConfig.RestartPolicy.Name}}" k3s-postgres-recovery 2>/dev/null || echo missing)"
|
||||
' 2>&1); then
|
||||
echo "$out"
|
||||
recovery_container_ready=0
|
||||
if grep -q '^RECOVERY_CONTAINER true host ' <<<"$out" && grep -q 'PG_ISREADY_LOCAL .*accepting connections' <<<"$out"; then
|
||||
recovery_container_ready=1
|
||||
fi
|
||||
|
||||
if grep -Eq '^14[[:space:]]+main[[:space:]]+5432[[:space:]]+down' <<<"$out"; then
|
||||
blocked "host PostgreSQL cluster 14/main is down"
|
||||
if [[ "$recovery_container_ready" -eq 1 ]]; then
|
||||
warn "host PostgreSQL cluster 14/main is down, but controlled k3s-postgres-recovery runtime is accepting connections"
|
||||
else
|
||||
blocked "host PostgreSQL cluster 14/main is down and no controlled recovery runtime was accepted"
|
||||
fi
|
||||
else
|
||||
ok "host PostgreSQL cluster 14/main not reported down"
|
||||
fi
|
||||
|
||||
if grep -Eiq 'invalid primary checkpoint record|could not locate a valid checkpoint record|PANIC:' <<<"$out"; then
|
||||
blocked "PostgreSQL checkpoint/WAL error detected; pg_resetwal is break-glass only"
|
||||
if [[ "$recovery_container_ready" -eq 1 ]]; then
|
||||
warn "PostgreSQL checkpoint/WAL error remains historical host-cluster evidence; pg_resetwal is still break-glass only"
|
||||
else
|
||||
blocked "PostgreSQL checkpoint/WAL error detected; pg_resetwal is break-glass only"
|
||||
fi
|
||||
fi
|
||||
if grep -q 'PG_ISREADY_LOCAL .*accepting connections' <<<"$out"; then
|
||||
|
||||
if [[ "$recovery_container_ready" -eq 1 ]]; then
|
||||
ok "PostgreSQL runtime is provided by k3s-postgres-recovery on host network"
|
||||
elif grep -q 'PG_ISREADY_LOCAL .*accepting connections' <<<"$out"; then
|
||||
warn "pg_isready accepts on localhost; do not use this alone as host 14/main health"
|
||||
fi
|
||||
else
|
||||
@@ -169,12 +187,30 @@ fi
|
||||
|
||||
section "188 certbot / ACME"
|
||||
if out=$(ssh_cmd "$REMOTE_188" '
|
||||
systemctl status certbot.service --no-pager || true
|
||||
systemctl status snap.certbot.renew.service --no-pager || true
|
||||
systemctl show certbot.service snap.certbot.renew.service certbot.timer snap.certbot.renew.timer -p Id -p ActiveState -p SubState -p Result -p UnitFileState --no-pager || true
|
||||
systemctl list-timers --all --no-pager | grep -i certbot || true
|
||||
' 2>&1); then
|
||||
echo "$out"
|
||||
grep -Eiq 'rateLimited|Service busy' <<<"$out" && blocked "certbot renewal is rate-limited; do not retry blindly"
|
||||
grep -Eiq 'Some challenges have failed|challenge' <<<"$out" && blocked "certbot challenge failure requires DNS / ACME route owner evidence"
|
||||
if grep -q 'Id=certbot.service' <<<"$out" && grep -A3 'Id=certbot.service' <<<"$out" | grep -q 'Result=failed'; then
|
||||
blocked "apt certbot service currently failed"
|
||||
else
|
||||
ok "apt certbot service is not currently failed"
|
||||
fi
|
||||
if grep -q 'Id=snap.certbot.renew.service' <<<"$out" && grep -A3 'Id=snap.certbot.renew.service' <<<"$out" | grep -q 'Result=failed'; then
|
||||
blocked "snap certbot renew service currently failed"
|
||||
else
|
||||
ok "snap certbot renew service is not currently failed"
|
||||
fi
|
||||
if grep -A4 'Id=certbot.timer' <<<"$out" | grep -q 'UnitFileState=disabled'; then
|
||||
ok "legacy apt certbot timer disabled to avoid duplicate renewals"
|
||||
else
|
||||
warn "legacy apt certbot timer is not disabled"
|
||||
fi
|
||||
if grep -A4 'Id=snap.certbot.renew.timer' <<<"$out" | grep -q 'ActiveState=active' && grep -A4 'Id=snap.certbot.renew.timer' <<<"$out" | grep -q 'UnitFileState=enabled'; then
|
||||
ok "snap certbot renew timer enabled"
|
||||
else
|
||||
blocked "snap certbot renew timer is not enabled and active"
|
||||
fi
|
||||
else
|
||||
blocked "certbot status unavailable"
|
||||
echo "$out"
|
||||
@@ -223,7 +259,27 @@ else
|
||||
fi
|
||||
|
||||
section "Maintenance decision tree"
|
||||
cat <<'STEPS'
|
||||
if [ "$SERVICE_GREEN" -eq 1 ] && [ "$HOST_HYGIENE_BLOCKED" -eq 0 ]; then
|
||||
cat <<'STEPS'
|
||||
Current expected outcome:
|
||||
SERVICE_GREEN=1
|
||||
HOST_HYGIENE_BLOCKED=0
|
||||
RESULT=HOST_188_HYGIENE_GREEN
|
||||
|
||||
Allowed next step:
|
||||
1. Keep this host in the normal post-reboot summary.
|
||||
2. Wait for snap certbot timer / ACME-window readback before declaring formal certificate renewal success.
|
||||
3. Keep DR credential escrow and Wazuh registry evidence as separate blockers.
|
||||
|
||||
Forbidden without separate approval:
|
||||
- pg_resetwal
|
||||
- DB restore
|
||||
- Docker/systemd restart
|
||||
- firewall change
|
||||
- Wazuh active response or agent re-enroll
|
||||
STEPS
|
||||
else
|
||||
cat <<'STEPS'
|
||||
Current expected outcome when 188 service is green but host hygiene is not:
|
||||
SERVICE_GREEN=1
|
||||
HOST_HYGIENE_BLOCKED=1
|
||||
@@ -244,6 +300,7 @@ Forbidden without maintenance approval:
|
||||
- Docker/systemd restart
|
||||
- host file write
|
||||
STEPS
|
||||
fi
|
||||
|
||||
echo
|
||||
echo "SERVICE_GREEN=$SERVICE_GREEN"
|
||||
|
||||
@@ -14,8 +14,9 @@ Description=AWOOOI Auto-Startup Recovery Sequence
|
||||
After=network-online.target containerd.service docker.service
|
||||
Wants=network-online.target
|
||||
|
||||
# 確保 PostgreSQL 盡早嘗試啟動
|
||||
Wants=postgresql@14-main.service redis-server.service ollama.service nginx.service
|
||||
# PostgreSQL 可由受控 recovery container 提供;不得在 startup 階段硬拉
|
||||
# postgresql@14-main.service,避免與 recovery runtime 競爭或觸發假綠修復。
|
||||
Wants=redis-server.service ollama.service nginx.service
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
|
||||
@@ -3,6 +3,8 @@
|
||||
# 2026-04-04 ogt: 根據實際事故建立,處理 container / Docker 啟動順序與 K3s Kine 維護。
|
||||
# 2026-06-26 Codex: PostgreSQL checkpoint/WAL 錯誤改為 fail-closed;
|
||||
# 不在自動啟動腳本內執行 pg_resetwal,避免資料破壞被誤判成恢復。
|
||||
# 2026-06-26 Codex: 允許受控 recovery container 提供 14/main runtime;
|
||||
# 不再因 systemd postgresql@14-main failed 而誤判活 DB 為不可用。
|
||||
# 部署位置: /usr/local/bin/awoooi-startup.sh (on 192.168.0.188)
|
||||
# systemd unit: /etc/systemd/system/awoooi-startup.service
|
||||
|
||||
@@ -12,6 +14,22 @@ exec > >(tee -a "$LOG") 2>&1
|
||||
|
||||
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; }
|
||||
|
||||
postgres_runtime_ready() {
|
||||
if systemctl is-active postgresql@14-main >/dev/null 2>&1; then
|
||||
log "✅ PostgreSQL systemd unit active"
|
||||
return 0
|
||||
fi
|
||||
|
||||
if docker inspect -f '{{.State.Running}} {{.HostConfig.NetworkMode}}' k3s-postgres-recovery 2>/dev/null | grep -q '^true host$'; then
|
||||
if pg_isready -h localhost -p 5432 >/dev/null 2>&1; then
|
||||
log "✅ PostgreSQL recovery container active on host network"
|
||||
return 0
|
||||
fi
|
||||
fi
|
||||
|
||||
return 1
|
||||
}
|
||||
|
||||
log "=== AWOOOI 啟動序列開始 ==="
|
||||
|
||||
# ──────────────────────────────────────────────
|
||||
@@ -73,20 +91,20 @@ fi
|
||||
# ──────────────────────────────────────────────
|
||||
log "[3/7] 檢查 PostgreSQL..."
|
||||
|
||||
if ! systemctl is-active postgresql@14-main >/dev/null 2>&1; then
|
||||
if ! postgres_runtime_ready; then
|
||||
log "PostgreSQL 未啟動,嘗試啟動..."
|
||||
systemctl start postgresql@14-main || true
|
||||
sleep 8
|
||||
fi
|
||||
|
||||
if ! systemctl is-active postgresql@14-main >/dev/null 2>&1; then
|
||||
if ! postgres_runtime_ready; then
|
||||
log "PostgreSQL 啟動失敗,檢查是否屬於 checkpoint/WAL 類資料層錯誤..."
|
||||
if journalctl -u postgresql@14-main -n 20 | grep -q "could not locate a valid checkpoint"; then
|
||||
log "❌ 偵測到 PostgreSQL checkpoint/WAL 錯誤;禁止自動 pg_resetwal。"
|
||||
log "需要 DB owner、備份/restore evidence、maintenance window 與 post-check 後才能人工處理。"
|
||||
exit 1
|
||||
fi
|
||||
systemctl is-active postgresql@14-main && log "✅ PostgreSQL 修復成功" || { log "❌ PostgreSQL 修復失敗"; exit 1; }
|
||||
postgres_runtime_ready && log "✅ PostgreSQL runtime 可用" || { log "❌ PostgreSQL 修復失敗"; exit 1; }
|
||||
fi
|
||||
|
||||
# 等待 PG 接受連線
|
||||
|
||||
@@ -23,7 +23,7 @@ OWNER_PACKET_GENERATOR = (
|
||||
)
|
||||
|
||||
EXPECTED_SCHEMA = "awoooi_post_reboot_next_gate_owner_packets_v1"
|
||||
EXPECTED_GATES = {
|
||||
KNOWN_GATES = {
|
||||
"credential_escrow_evidence",
|
||||
"host_188_hygiene_maintenance_window",
|
||||
"wazuh_manager_registry_export",
|
||||
@@ -187,12 +187,21 @@ def validate_packet(packet: dict[str, Any]) -> list[str]:
|
||||
counts = {}
|
||||
|
||||
gate_ids = {str(item.get("packet_id", "")) for item in owner_packets if isinstance(item, dict)}
|
||||
if gate_ids != EXPECTED_GATES:
|
||||
unknown_gates = sorted(gate_ids - KNOWN_GATES)
|
||||
if unknown_gates:
|
||||
failures.append(f"unknown_gate_ids={unknown_gates}")
|
||||
|
||||
source = packet.get("source", {})
|
||||
if not isinstance(source, dict):
|
||||
failures.append("source_not_object")
|
||||
source = {}
|
||||
expected_gates = set(str(item) for item in as_list(source.get("next_required_gates")))
|
||||
if expected_gates != gate_ids:
|
||||
failures.append(f"gate_ids={sorted(gate_ids)}")
|
||||
|
||||
expected_counts = {
|
||||
"next_gate_count": 3,
|
||||
"p0_gate_count": 3,
|
||||
"next_gate_count": len(gate_ids),
|
||||
"p0_gate_count": len(gate_ids),
|
||||
}
|
||||
for key, expected in expected_counts.items():
|
||||
if counts.get(key) != expected:
|
||||
|
||||
@@ -9,6 +9,8 @@ ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
|
||||
SSH_CONNECT_TIMEOUT="${SSH_CONNECT_TIMEOUT:-6}"
|
||||
ROUTE_RETRY_ATTEMPTS="${ROUTE_RETRY_ATTEMPTS:-3}"
|
||||
ROUTE_RETRY_DELAY_SECONDS="${ROUTE_RETRY_DELAY_SECONDS:-2}"
|
||||
STOCK_FRESHNESS_RETRY_ATTEMPTS="${STOCK_FRESHNESS_RETRY_ATTEMPTS:-6}"
|
||||
STOCK_FRESHNESS_RETRY_DELAY_SECONDS="${STOCK_FRESHNESS_RETRY_DELAY_SECONDS:-5}"
|
||||
RUN_COLD_START=1
|
||||
RUN_MOMO=1
|
||||
RUN_STOCK=1
|
||||
@@ -76,6 +78,8 @@ Options:
|
||||
Environment:
|
||||
ROUTE_RETRY_ATTEMPTS Public route attempts before blocking. Default: 3.
|
||||
ROUTE_RETRY_DELAY_SECONDS Delay between failed public route attempts. Default: 2.
|
||||
STOCK_FRESHNESS_RETRY_ATTEMPTS Stock freshness attempts before blocking. Default: 6.
|
||||
STOCK_FRESHNESS_RETRY_DELAY_SECONDS Delay between failed Stock freshness attempts. Default: 5.
|
||||
|
||||
Exit codes:
|
||||
0 = no service blockers. Boundary / evidence warnings may still be present.
|
||||
@@ -277,9 +281,23 @@ fi
|
||||
if [[ "$RUN_STOCK" -eq 1 ]]; then
|
||||
section "StockPlatform freshness"
|
||||
stock_tmp="$(mktemp -t post-start-stock.XXXXXX)"
|
||||
stock_code="$(curl -k -sS -o "$stock_tmp" -w '%{http_code}' --max-time 12 "https://stock.wooo.work/api/v1/system/freshness" 2>/dev/null || true)"
|
||||
stock_code=""
|
||||
stock_attempt=1
|
||||
while [[ "$stock_attempt" -le "$STOCK_FRESHNESS_RETRY_ATTEMPTS" ]]; do
|
||||
stock_code="$(curl -k -sS -o "$stock_tmp" -w '%{http_code}' --max-time 12 "https://stock.wooo.work/api/v1/system/freshness" 2>/dev/null || true)"
|
||||
if [[ "$stock_code" == 2* ]]; then
|
||||
if [[ "$stock_attempt" -gt 1 ]]; then
|
||||
evidence_warn "StockPlatform freshness recovered after attempt=$stock_attempt"
|
||||
fi
|
||||
break
|
||||
fi
|
||||
if [[ "$stock_attempt" -lt "$STOCK_FRESHNESS_RETRY_ATTEMPTS" ]]; then
|
||||
sleep "$STOCK_FRESHNESS_RETRY_DELAY_SECONDS"
|
||||
fi
|
||||
stock_attempt=$((stock_attempt + 1))
|
||||
done
|
||||
if [[ "$stock_code" != 2* ]]; then
|
||||
blocked "StockPlatform freshness endpoint returned ${stock_code:-curl_failed}"
|
||||
blocked "StockPlatform freshness endpoint returned ${stock_code:-curl_failed} attempts=$STOCK_FRESHNESS_RETRY_ATTEMPTS"
|
||||
cat "$stock_tmp" || true
|
||||
else
|
||||
python3 - "$stock_tmp" <<'PY'
|
||||
|
||||
Reference in New Issue
Block a user