diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 6a274273..f82bd381 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -3,11 +3,12 @@ **背景**:SOP v1.51 已能判定 full-stack service GREEN,但長 SOP 太完整,不適合作為每次重啟後 T+10 分鐘內的操作頁。為避免下一次又在 route 200、container healthy、DB freshness、backup、Wazuh registry、DR escrow 之間混淆,本輪新增一頁式 post-start quick check。 **更新**: -- 新增 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md`,固定重啟後 10 分鐘只讀順序:主機 / SSH、cold-start scorecard、MOMO freshness、backup / offsite / escrow、public routes、110 CPU / runaway process。 +- 新增 `scripts/reboot-recovery/post-start-quick-check.sh`,提供重啟後 10 分鐘只讀 wrapper:主機 / SSH、cold-start scorecard、MOMO freshness、backup / offsite / escrow、public routes、110 CPU / runaway process。 +- 新增 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為 wrapper 說明與人工 fallback。 - `docs/runbooks/FULL-STACK-COLD-START-SOP.md` 升級為 `v1.52`,於最新 baseline 直接連到 quick check。 - `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md` 更新 P3 docs / automation contract 與 P3-008,明確區分短版 quick check 與長 SOP / Plan B。 -**邊界**:本輪 docs-only,沒有 SSH、Docker、systemd、Nginx、firewall、K8s、ArgoCD、Wazuh runtime、active scan 或 secret 操作。Quick check 仍禁止把網站 200 當資料最新、把 backup fresh 當 DR complete、或把 Wazuh route 200 當 agent registry accepted。 +**邊界**:本輪 repo-side script / docs-only;驗證只跑語法與 guard,沒有執行 live SSH、Docker、systemd、Nginx、firewall、K8s、ArgoCD、Wazuh runtime、active scan 或 secret 操作。Quick check 仍禁止把網站 200 當資料最新、把 backup fresh 當 DR complete、或把 Wazuh route 200 當 agent registry accepted。 ## 2026-06-25|14:16 full cold-start GREEN / MOMO data freshness recovered diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index a8f5c425..c6492b27 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -10,7 +10,7 @@ 本節是每次接手、開機、關機、重啟後的第一個判定錨點。若日期不是今天,必須先重跑 live check,再更新本節與 `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md`。 -若只是重啟後要快速判斷能不能宣稱恢復,先跑一頁式總檢查:`docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md`。長 SOP 保留完整背景、例外處理與 Plan B;短版 checklist 負責每次 T+10 分鐘內的固定判定。 +若只是重啟後要快速判斷能不能宣稱恢復,先跑一頁式總檢查:`scripts/reboot-recovery/post-start-quick-check.sh --no-color`,並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為人工 fallback。長 SOP 保留完整背景、例外處理與 Plan B;短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。 2026-06-25 14:16 live read-only refresh supersedes the 11:53 BLOCKED wording. Hosts, routes, K3s, AWOOOI API health, MOMO service health, MOMO business data freshness, backup core/offsite, and core monitoring/exporter surfaces are green for controlled runner/CD release. MOMO is healthy on `V10.674`; latest import job `57` completed cleanly; `MOMO_DAILY_FRESHNESS 1|2026-06-24`; current-month daily snapshot and realtime tables match through `2026-06-24`. Full-stack service readiness is now GREEN, but DR remains blocked by missing credential escrow evidence (`escrow_missing=5`). Do not turn this into a DR complete or security/runtime acceptance claim. Wazuh host registry acceptance remains outside this SOP lane and is still not complete. diff --git a/docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md b/docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md index 9719b3a2..30429cd2 100644 --- a/docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md +++ b/docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md @@ -41,6 +41,14 @@ ## 3. 10 分鐘只讀總檢查順序 +優先使用 repo-side wrapper: + +```bash +scripts/reboot-recovery/post-start-quick-check.sh --no-color +``` + +此 wrapper 只做 read-only 檢查,並委派既有 cold-start / MOMO preflight / backup-status;不 restart、不 reload、不 import、不改 K8s、不讀 token 內容。若 wrapper 因某個 SSH 權限或路徑失敗,再依下列分段命令手動補證據。 + ### Step 1 - 主機與 SSH ```bash diff --git a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md index 49f3b8b2..5b586fd6 100644 --- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md +++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md @@ -15,7 +15,7 @@ | P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-25 09:05 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, VIP `192.168.0.125` is present, node filesystem / disk-pressure / readonly events are `0`, and latest `km-vectorize-29705460-55rgs` completed. | | P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-25 09:05 backup / alert readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-25 02:35:09`。DR remains blocked on real non-secret credential escrow evidence IDs. | | P2 service / data truth | GREEN | 100% | Public route/TLS, API/Web route, MOMO health `V10.674`, MOMO main / CD `#904` monthly-sync failure boundary, MOMO main / CD `#910` Drive-auth fail-closed boundary, direct 14:16 public route smoke all expected 2xx/3xx, current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health are green. 14:16 preflight confirms app / scheduler / Telegram bot healthy, scheduler restart count `0`, token metadata aligned to scheduler UID, latest job `57` completed cleanly, and `DB_DAILY_FRESHNESS 1|2026-06-24`. | -| P3 docs / automation contracts | DONE_WITH_MOMO_PREFLIGHT_AND_CPU_TRIAGE | 100% | Workplan, SOP v1.52, one-page post-start quick check, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, 188 DB/Redis exporter restore helper, 188 MinIO/Velero restore helper, 188 nginx-exporter restore helper, 110 Docker disk pressure cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, MOMO Pro false-noise health monitor source-of-truth, docker-health direct Telegram fallback cooldown, Bitan public-content same-fingerprint cooldown, notification-noise readback, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side cold-start v1.42 source absence classifier, live-sync parity gate, MOMO import-boundary production deploy, MOMO Drive-auth fail-closed production deploy, 10:04 scheduler fail-closed live proof, 10:35 route / DB / backup refresh, 11:44 MOMO dedicated preflight blocked readback, 14:16 MOMO dedicated preflight recovery on V10.674 / job 57 / freshness 1, 10:58 user-approved 110 orphan Chrome SIGTERM evidence, MacBook Pro Codex safe artifact sync readback, and 2026-06-25 live refresh with full cold-start GREEN are updated. 2026-06-24 23:15 read-only verify still shows repo cold-start hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`; live 110 script sync of the v1.42 classifier is not claimed until separately approved and recorded. | +| P3 docs / automation contracts | DONE_WITH_MOMO_PREFLIGHT_AND_CPU_TRIAGE | 100% | Workplan, SOP v1.52, one-page post-start quick check wrapper + fallback runbook, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, 188 DB/Redis exporter restore helper, 188 MinIO/Velero restore helper, 188 nginx-exporter restore helper, 110 Docker disk pressure cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, MOMO Pro false-noise health monitor source-of-truth, docker-health direct Telegram fallback cooldown, Bitan public-content same-fingerprint cooldown, notification-noise readback, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side cold-start v1.42 source absence classifier, live-sync parity gate, MOMO import-boundary production deploy, MOMO Drive-auth fail-closed production deploy, 10:04 scheduler fail-closed live proof, 10:35 route / DB / backup refresh, 11:44 MOMO dedicated preflight blocked readback, 14:16 MOMO dedicated preflight recovery on V10.674 / job 57 / freshness 1, 10:58 user-approved 110 orphan Chrome SIGTERM evidence, MacBook Pro Codex safe artifact sync readback, and 2026-06-25 live refresh with full cold-start GREEN are updated. 2026-06-24 23:15 read-only verify still shows repo cold-start hash `f60b81029969a527dc742ebc9558d2933f11fe24ec4f46f7a7bc6637759b7b05` differs from 110 live hash `10608873d406911a519afa96218abebc2b85ab6123bdf46b6e21eb269e554bb8`; live 110 script sync of the v1.42 classifier is not claimed until separately approved and recorded. | 2026-06-25 14:16 supplemental readback supersedes the 11:53 BLOCKED wording: direct route smoke is 200 for AWOOOI API / IwoooS / MOMO health / Stock, and cold-start public route/TLS gate is green for all expected 2xx/3xx routes. Repo-side cold-start returns `PASS=89 WARN=0 BLOCKED=0`; `/backup/scripts/backup-status.sh --no-notify --no-refresh` reports 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`; MOMO dedicated preflight returns `PASS=18 WARN=3 BLOCKED=0`; MOMO health is `V10.674`; 110 load is around `3.85 / 3.33 / 3.19`, with active Gitea Actions / 2026 World Cup pipeline visible, not orphan Chrome. @@ -181,7 +181,7 @@ Next: | P3-005 | DONE | 100 | Update cold-start SOP | SOP now includes start, shutdown, reboot, record, comparison, and 120 blocker handling. | Increment SOP version after each process change. | SOP has controlled power-operation sections and ledger template. | | P3-006 | DONE | 100 | Update backup status | Backup status now reflects current cron, rclone latest-only, failure-only alert posture, and escrow blocker. | Refresh after 120 backup rerun. | Backup status no longer claims noisy success Telegram notifications. | | P3-007 | DONE | 100 | Harden Gitea backup stale dump handling | 2026-06-05 manual Gitea backup failed because the container retained `/tmp/gitea-dump.zip` from the 02:00 failure. `scripts/backup/backup-gitea.sh` now renames stale container dump files to timestamped evidence before running a new dump, and the live 110 script is updated. | Watch the next 02:00 Gitea backup. | `bash -n` passes locally and on 110; manual Gitea backup completed after stale evidence rename. | -| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.52 adds one-page post-start quick check, startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, 188 DB/Redis exporter restore, 188 MinIO/Velero restore, 188 nginx-exporter restore, 110 Docker disk cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, post-reboot notification noise gates, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side scorecard source-absence classifier, 110 live-sync parity gate, CD monitoring coverage target-down classification, MOMO dedicated token/source preflight, MOMO V10.674 / StartedAt / lifecycle / job 57 / freshness 1 recovery readback, and 2026-06-25 110 CPU orphan Chrome vs active CI 分流 evidence. | Use `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` for T+10 post-reboot triage, then use SOP v1.52 for exceptions, Plan B, blocker-specific recovery, and historical comparison. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process / notification-noise / MOMO preflight / monitoring coverage checks. If using the live 110 script, record its hash and do not assume repo-side v1.42 behavior until synced under approval and deploy parity passes. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; quick check has one-page command order and LOGBOOK template; latest MOMO dedicated preflight returns `PASS=18 WARN=3 BLOCKED=0`; 110 CPU evidence records old orphan Chrome groups removed by approved SIGTERM while active CI load remains observation-only; repeated healthy/same-failure notification noise is controlled without hiding real alerts, and monitoring coverage target-down is routed through exporter restore before any product restart. | +| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.52 adds one-page post-start quick check wrapper, fallback runbook, startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, 188 DB/Redis exporter restore, 188 MinIO/Velero restore, 188 nginx-exporter restore, 110 Docker disk cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, post-reboot notification noise gates, MOMO source-file absence decision gate with scheduler stats / import_config / job 56 evidence, repo-side scorecard source-absence classifier, 110 live-sync parity gate, CD monitoring coverage target-down classification, MOMO dedicated token/source preflight, MOMO V10.674 / StartedAt / lifecycle / job 57 / freshness 1 recovery readback, and 2026-06-25 110 CPU orphan Chrome vs active CI 分流 evidence. | Use `scripts/reboot-recovery/post-start-quick-check.sh --no-color` for T+10 post-reboot triage, then use `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` as manual fallback and SOP v1.52 for exceptions, Plan B, blocker-specific recovery, and historical comparison. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process / notification-noise / MOMO preflight / monitoring coverage checks. If using the live 110 script, record its hash and do not assume repo-side v1.42 behavior until synced under approval and deploy parity passes. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; quick check wrapper has one command order and LOGBOOK summary; latest MOMO dedicated preflight returns `PASS=18 WARN=3 BLOCKED=0`; 110 CPU evidence records old orphan Chrome groups removed by approved SIGTERM while active CI load remains observation-only; repeated healthy/same-failure notification noise is controlled without hiding real alerts, and monitoring coverage target-down is routed through exporter restore before any product restart. | | P3-009 | DONE | 100 | Assess 120/121 AA/AS role and host load balancing | 2026-06-12 15:19 live check confirms 120 and 121 are both `Ready control-plane`, `k3s active`, `k3s-agent inactive`, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. | After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. | `docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly `0%`. | | P3-010 | DONE | 100 | Update workload balancing docs with 2026-06-13 live truth | Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD `km-vectorize` degraded, Gitea main `acaae999`, ArgoCD sync, and final pod placement evidence. | Keep updating this file after the next reboot or deploy. | Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt. | | P3-011 | DONE | 100 | Record `km-vectorize` remediation status | LOGBOOK, this workplan, and SOP now state the schedule/label fix, ArgoCD sync evidence, the invalid manual Job boundary, and the 90% waiting-for-next-schedule gate. | After next 03:00 run, update this row and the top verdict with `lastSuccessfulTime` / ArgoCD health evidence. | No document claims ArgoCD green before official CronJob success evidence exists. | diff --git a/scripts/reboot-recovery/post-start-quick-check.sh b/scripts/reboot-recovery/post-start-quick-check.sh new file mode 100755 index 00000000..d9cd0b7c --- /dev/null +++ b/scripts/reboot-recovery/post-start-quick-check.sh @@ -0,0 +1,273 @@ +#!/usr/bin/env bash +set -uo pipefail + +# One-entry read-only post-reboot check. This wrapper intentionally delegates +# deep checks to the existing recovery scripts and does not restart, patch, +# delete, import, reload, or write runtime state. + +ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)" +SSH_CONNECT_TIMEOUT="${SSH_CONNECT_TIMEOUT:-6}" +RUN_COLD_START=1 +RUN_MOMO=1 +RUN_BACKUP=1 +RUN_ROUTES=1 +RUN_CPU=1 +NO_COLOR_FLAG=0 + +PASS_COUNT=0 +WARN_COUNT=0 +BLOCKED_COUNT=0 + +HOSTS=( + "192.168.0.110" + "192.168.0.120" + "192.168.0.121" + "192.168.0.188" +) + +ROUTES=( + "https://awoooi.wooo.work/api/v1/health" + "https://awoooi.wooo.work/zh-TW/iwooos" + "https://mo.wooo.work/health" + "https://stock.wooo.work/" +) + +usage() { + cat <<'USAGE' +Usage: post-start-quick-check.sh [options] + +Read-only post-reboot quick check for 110 / 120 / 121 / 188. + +Options: + --skip-cold-start Do not run full-stack-cold-start-check.sh. + --skip-momo Do not run momo-drive-token-source-recovery-preflight.sh. + --skip-backup Do not run /backup/scripts/backup-status.sh on 110. + --skip-routes Do not curl public route smoke targets. + --skip-cpu Do not read 110 CPU / process summary. + --no-color Disable ANSI color. + -h, --help Show this help. + +Exit codes: + 0 = no blockers. + 1 = warnings only. + 2 = blockers observed. + +This script never reads token content and never writes runtime state. +USAGE +} + +while [[ $# -gt 0 ]]; do + case "$1" in + --skip-cold-start) + RUN_COLD_START=0 + ;; + --skip-momo) + RUN_MOMO=0 + ;; + --skip-backup) + RUN_BACKUP=0 + ;; + --skip-routes) + RUN_ROUTES=0 + ;; + --skip-cpu) + RUN_CPU=0 + ;; + --no-color) + NO_COLOR_FLAG=1 + ;; + -h|--help) + usage + exit 0 + ;; + *) + printf 'Unknown argument: %s\n' "$1" >&2 + usage >&2 + exit 2 + ;; + esac + shift +done + +if [[ -n "${NO_COLOR:-}" || "$NO_COLOR_FLAG" -eq 1 ]]; then + RED="" + GREEN="" + YELLOW="" + BLUE="" + NC="" +else + RED=$'\033[0;31m' + GREEN=$'\033[0;32m' + YELLOW=$'\033[1;33m' + BLUE=$'\033[0;34m' + NC=$'\033[0m' +fi + +section() { + printf '\n%s=== %s ===%s\n' "$BLUE" "$1" "$NC" +} + +ok() { + PASS_COUNT=$((PASS_COUNT + 1)) + printf '%sOK%s %s\n' "$GREEN" "$NC" "$*" +} + +warn() { + WARN_COUNT=$((WARN_COUNT + 1)) + printf '%sWARN%s %s\n' "$YELLOW" "$NC" "$*" +} + +blocked() { + BLOCKED_COUNT=$((BLOCKED_COUNT + 1)) + printf '%sBLOCKED%s %s\n' "$RED" "$NC" "$*" +} + +ssh_read() { + local user_host="$1" + local command="$2" + ssh -o BatchMode=yes -o ConnectTimeout="$SSH_CONNECT_TIMEOUT" "$user_host" "$command" +} + +run_and_capture() { + local label="$1" + shift + local tmp + tmp="$(mktemp -t post-start-quick-check.XXXXXX)" + if "$@" >"$tmp" 2>&1; then + ok "$label" + cat "$tmp" + rm -f "$tmp" + return 0 + fi + local rc=$? + cat "$tmp" + rm -f "$tmp" + return "$rc" +} + +section "主機 / SSH" +for host in "${HOSTS[@]}"; do + if ping -c 1 -W 1 "$host" >/dev/null 2>&1; then + ok "PING_OK $host" + else + blocked "PING_FAIL $host" + fi + + if nc -z -w 2 "$host" 22 >/dev/null 2>&1; then + ok "SSH_PORT_OK $host" + else + blocked "SSH_PORT_FAIL $host" + fi +done + +if [[ "$RUN_COLD_START" -eq 1 ]]; then + section "Cold-start scorecard" + cold_tmp="$(mktemp -t post-start-cold-start.XXXXXX)" + if bash "$ROOT_DIR/scripts/reboot-recovery/full-stack-cold-start-check.sh" --monitor-read-only --no-color --watch --interval 1 --max-attempts 1 >"$cold_tmp" 2>&1; then + ok "cold-start command exited 0" + else + blocked "cold-start command returned non-zero" + fi + cat "$cold_tmp" + cold_summary="$(grep -E 'PASS=[0-9]+ WARN=[0-9]+ BLOCKED=[0-9]+' "$cold_tmp" | tail -n 1 || true)" + if [[ -n "$cold_summary" ]]; then + ok "cold-start summary: $cold_summary" + else + warn "cold-start summary not found" + fi + rm -f "$cold_tmp" +fi + +if [[ "$RUN_MOMO" -eq 1 ]]; then + section "MOMO freshness" + momo_tmp="$(mktemp -t post-start-momo.XXXXXX)" + bash "$ROOT_DIR/scripts/reboot-recovery/momo-drive-token-source-recovery-preflight.sh" >"$momo_tmp" 2>&1 + momo_rc=$? + cat "$momo_tmp" + case "$momo_rc" in + 0) + ok "MOMO preflight clean" + ;; + 1) + warn "MOMO preflight has warnings" + ;; + *) + blocked "MOMO preflight has blockers" + ;; + esac + grep -E 'MOMO_DRIVE_TOKEN_SOURCE_PREFLIGHT|MOMO_HEALTH_VERSION|DB_MONTHLY_SYNC|DB_DAILY_FRESHNESS|DB_LATEST_DAILY_IMPORT_JOB' "$momo_tmp" || true + rm -f "$momo_tmp" +fi + +if [[ "$RUN_BACKUP" -eq 1 ]]; then + section "Backup / offsite / escrow" + backup_tmp="$(mktemp -t post-start-backup.XXXXXX)" + if ssh_read "wooo@192.168.0.110" '/backup/scripts/backup-status.sh --no-notify --no-refresh' >"$backup_tmp" 2>&1; then + ok "backup-status readback succeeded" + else + blocked "backup-status readback failed" + fi + cat "$backup_tmp" + if grep -Eq 'core_blockers=0|CORE_BLOCKERS[ =]0' "$backup_tmp"; then + ok "backup core blockers are 0" + else + warn "backup core blocker summary not confirmed" + fi + if grep -Eq 'escrow_missing=0|ESCROW_MISSING_COUNT[ =]0' "$backup_tmp"; then + ok "credential escrow missing is 0" + elif grep -Eq 'escrow_missing=[1-9]|ESCROW_MISSING_COUNT[ =][1-9]' "$backup_tmp"; then + warn "credential escrow still missing; DR_COMPLETE is forbidden" + else + warn "credential escrow count not found" + fi + rm -f "$backup_tmp" +fi + +if [[ "$RUN_ROUTES" -eq 1 ]]; then + section "Public routes" + for url in "${ROUTES[@]}"; do + code="$(curl -k -sS -o /dev/null -w '%{http_code}' --max-time 12 "$url" 2>/dev/null || true)" + case "$code" in + 2*|3*) + ok "$code $url" + ;; + *) + blocked "${code:-curl_failed} $url" + ;; + esac + done +fi + +if [[ "$RUN_CPU" -eq 1 ]]; then + section "110 CPU / process attribution" + cpu_tmp="$(mktemp -t post-start-cpu.XXXXXX)" + if ssh_read "wooo@192.168.0.110" 'uptime; vmstat 1 5; ps -eo pid,ppid,pgid,stat,pcpu,pmem,comm,args --sort=-pcpu | head -25' >"$cpu_tmp" 2>&1; then + ok "110 CPU/process readback succeeded" + else + warn "110 CPU/process readback failed" + fi + cat "$cpu_tmp" + if grep -Eiq 'chrome|chromium|playwright' "$cpu_tmp"; then + warn "browser/smoke process is visible; classify orphan vs active parent before action" + fi + if grep -Eiq 'gitea|actions|runner|npm|pnpm|pytest|pip-audit' "$cpu_tmp"; then + ok "active CI/build/test load is visible" + fi + rm -f "$cpu_tmp" +fi + +section "總結" +printf 'POST_START_QUICK_CHECK PASS=%s WARN=%s BLOCKED=%s\n' "$PASS_COUNT" "$WARN_COUNT" "$BLOCKED_COUNT" + +if [[ "$BLOCKED_COUNT" -gt 0 ]]; then + printf 'RESULT=BLOCKED\n' + exit 2 +fi + +if [[ "$WARN_COUNT" -gt 0 ]]; then + printf 'RESULT=DEGRADED\n' + exit 1 +fi + +printf 'RESULT=GREEN\n' +exit 0