ops(reboot): add 188 hygiene read-only checklist [skip ci]
This commit is contained in:
@@ -44810,3 +44810,23 @@ production browser smoke:
|
||||
- 188 service recovery 仍 `GREEN`。
|
||||
- 188 host hygiene evidence package 從 `80%` 推進到 `90%`;runtime repair 仍 `0%`。
|
||||
- Credential escrow owner request package 維持 `READY_TO_DISPATCH`;marker write 仍 `0%`,不得用 placeholder 或 secret 補齊。
|
||||
|
||||
## 2026-06-26 — 188 host hygiene 只讀 checklist 自動化
|
||||
|
||||
**時間與來源**:
|
||||
- 2026-06-26 07:31 Asia/Taipei。
|
||||
- 來源:`docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md`、既有 120 fsck checklist 腳本模式、188 / 110 read-only evidence。
|
||||
|
||||
**完成內容**:
|
||||
- 新增 `scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh`。
|
||||
- 腳本預設只讀,會檢查 188 systemd failed units、產品容器、host PostgreSQL `14/main`、certbot / ACME、`sentry.wooo.work` public cert、110 backup / escrow 邊界與 critical public routes。
|
||||
- 腳本固定輸出 `RUNTIME_ACTION_AUTHORIZED=0`;目前預期結果是 `SERVICE_GREEN=1`、`HOST_HYGIENE_BLOCKED=1`、`Result: SERVICE_GREEN_HOST_HYGIENE_BLOCKED`。
|
||||
- `HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md`、`FULL-STACK-COLD-START-SOP.md` 與本 workplan 已連到此 checklist。
|
||||
|
||||
**做過的命令類型**:
|
||||
- 寫入:repo script / docs-only。
|
||||
- 未做:host / Docker / systemd / Nginx / firewall / K8s / DB / Wazuh runtime 寫操作;未執行 `pg_resetwal`、certbot renew、Nginx reload、restore、`reset-failed`;未收 secret value。
|
||||
|
||||
**目前判定**:
|
||||
- 188 host hygiene preflight automation:`0% -> 100%`。
|
||||
- 188 runtime repair:仍 `0%`。
|
||||
|
||||
@@ -12,7 +12,7 @@
|
||||
|
||||
若只是重啟後要快速判斷能不能宣稱恢復,先跑一頁式總檢查:`scripts/reboot-recovery/post-start-quick-check.sh --no-color`,並以 `docs/runbooks/REBOOT-POST-START-QUICK-CHECK.md` 作為人工 fallback。長 SOP 保留完整背景、例外處理與 Plan B;短版 wrapper / checklist 負責每次 T+10 分鐘內的固定判定。
|
||||
|
||||
2026-06-26 07:19 follow-up:`gitea/main` 已包含前一輪 SOP 文件 commit `1fd5e2a8`,ArgoCD `awoooi-prod` 讀回 `Synced / Healthy`,revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`,API `2/2`、Web `2/2`、Worker `1/1`,pods `restart=0`。重跑 full cold-start 仍是 `PASS=87 WARN=0 BLOCKED=0`,result `GREEN`。直接 public route 讀回:AWOOOI API `200`、AWOOOI Web `307`、VibeWork `200`、AwoooGo `200`、MOMO health `200`、Stock freshness `200`、Bitan `200`、Gitea `200`、Harbor `200`、Registry `/v2/` expected `401`、Sentry expected `302`、SigNoz `200`、Langfuse `200`。188 blocker 精準分類:`pg_lsclusters` 顯示 host PostgreSQL `14/main` down,`systemctl status postgresql@14-main` 顯示 `invalid primary checkpoint record` 與 `PANIC: could not locate a valid checkpoint record`;`certbot.service` 顯示 `sentry.wooo.work` renew rate-limited,`snap.certbot.renew.service` 顯示 challenge failed;`awoooi-startup.service` 曾嘗試以 root 執行 `pg_resetwal` 並失敗。本輪不執行 `pg_resetwal`、不 `reset-failed`、不重啟 service;188 需用獨立維護窗口、rollback owner、restore/source-of-truth plan 處理,詳見 `docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md`。110 load 已降到約 `4.83 / 4.82 / 5.52`,top CPU 是 active AWOOOI Web `turbo build` / Docker buildx;Swap 仍滿但 memory available 約 `41Gi`,本輪不手動清 swap。整體宣告仍是 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。
|
||||
2026-06-26 07:19 follow-up:`gitea/main` 已包含前一輪 SOP 文件 commit `1fd5e2a8`,ArgoCD `awoooi-prod` 讀回 `Synced / Healthy`,revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`,API `2/2`、Web `2/2`、Worker `1/1`,pods `restart=0`。重跑 full cold-start 仍是 `PASS=87 WARN=0 BLOCKED=0`,result `GREEN`。直接 public route 讀回:AWOOOI API `200`、AWOOOI Web `307`、VibeWork `200`、AwoooGo `200`、MOMO health `200`、Stock freshness `200`、Bitan `200`、Gitea `200`、Harbor `200`、Registry `/v2/` expected `401`、Sentry expected `302`、SigNoz `200`、Langfuse `200`。188 blocker 精準分類:`pg_lsclusters` 顯示 host PostgreSQL `14/main` down,`systemctl status postgresql@14-main` 顯示 `invalid primary checkpoint record` 與 `PANIC: could not locate a valid checkpoint record`;`certbot.service` 顯示 `sentry.wooo.work` renew rate-limited,`snap.certbot.renew.service` 顯示 challenge failed;`awoooi-startup.service` 曾嘗試以 root 執行 `pg_resetwal` 並失敗。本輪不執行 `pg_resetwal`、不 `reset-failed`、不重啟 service;188 需用獨立維護窗口、rollback owner、restore/source-of-truth plan 處理,詳見 `docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md`,並可先跑 `scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh --no-color` 取得只讀 preflight。110 load 已降到約 `4.83 / 4.82 / 5.52`,top CPU 是 active AWOOOI Web `turbo build` / Docker buildx;Swap 仍滿但 memory available 約 `41Gi`,本輪不手動清 swap。整體宣告仍是 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`。
|
||||
|
||||
2026-06-26 07:02 全主機 live refresh:`110 / 120 / 121 / 188 / 112 / 111 / 168` ping 與 SSH port 全部 OK。110 `systemctl=running`、failed units `0`,但 load `5.83 / 7.26 / 5.77` 且 top CPU 是 AWOOOI Web `next build`,Swap 仍 `7.8Gi / 7.8Gi`;這是 CI/build 壓力,不是 orphan Chrome 或 Docker 事故。120 / 121 `systemctl=running`、K3s active,nodes `mon` / `mon1` 均為 Ready。ArgoCD `awoooi-prod` 在 06:57 曾短暫 `OutOfSync / Progressing`,因 deploy marker `52f61da4` rollout 正在替換 API/Web/Worker;07:00 後已穩定為 `Synced / Healthy`,API `2/2`、Web `2/2`、Worker `1/1`,API/Web 仍跨 `mon` / `mon1`。重跑 live cold-start:`PASS=87 WARN=0 BLOCKED=0`,result `GREEN`。StockPlatform `/api/v1/system/freshness` 曾在容器剛重啟約 35 秒時短暫 `502`,後續連續讀回皆 `200` 且 `status=ok`、`latest_trading_date=2026-06-25`、blockers `[]`;這類 rollout warmup 只有連續失敗才算 blocker。MOMO health 是 `V10.699`,cold-start direct evidence 仍顯示 current-month parity `15383 / 15383` 截至 `2026-06-24`,daily freshness `1|2026-06-24`。Backup status 06:58:110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0`、`core_blockers=0`、offsite/rclone fresh、`last_backup_all=2026-06-26 02:31:02`、`escrow_missing=5`。188 產品容器健康,但 host `systemctl=degraded` 仍是真實 host hygiene blocker:`awoooi-startup.service`、`postgresql@14-main.service`、`certbot.service`、`snap.certbot.renew.service` failed。112 Wazuh manager/indexer/dashboard active,ports `1514 / 1515 / 55000` listen,但 production Wazuh route 仍回報 `disabled_waiting_iwooos_wazuh_owner_gate`、`configured=false`、manager registry accepted `0`、runtime gate `0`。111 / 168 可連線,但兩邊 AWOOOI dev workspaces 皆 ahead 17 且 HEAD 不同(`111=56c83257`、`168=59485d51`);Mac Mini `/System/Volumes/Data` 只剩約 `3.2Gi`。目前 service recovery 宣告維持 `FULL_STACK_GREEN_DR_ESCROW_BLOCKED`,host hygiene / DR escrow / Wazuh registry / workstation capacity 明確列為 service green 之外的 blocker。
|
||||
|
||||
|
||||
@@ -100,6 +100,23 @@
|
||||
|
||||
此段只定義順序;不代表目前已批准執行。
|
||||
|
||||
### Phase 0:一鍵只讀 preflight
|
||||
|
||||
在 repo 工作站先執行只讀 checklist;它不會 restart、reload、renew、reset-failed、restore、`pg_resetwal` 或寫入遠端狀態。
|
||||
|
||||
```bash
|
||||
SSH_BATCH_MODE=yes bash scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh --no-color
|
||||
```
|
||||
|
||||
目前預期結果是:
|
||||
|
||||
```text
|
||||
SERVICE_GREEN=1
|
||||
HOST_HYGIENE_BLOCKED=1
|
||||
RUNTIME_ACTION_AUTHORIZED=0
|
||||
Result: SERVICE_GREEN_HOST_HYGIENE_BLOCKED. Open maintenance window; do not auto-repair.
|
||||
```
|
||||
|
||||
### Phase A:維護前只讀快照
|
||||
|
||||
```bash
|
||||
|
||||
@@ -11,7 +11,7 @@
|
||||
|
||||
| Area | Status | Completion | Evidence |
|
||||
|------|--------|------------|----------|
|
||||
| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-26 07:19 read-only follow-up confirms service recovery remains green after the SOP commit reached production. Cold-start rerun returned `PASS=87 WARN=0 BLOCKED=0`, result `GREEN`; ArgoCD `awoooi-prod Synced / Healthy` at revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`; API/Web/Worker pods are Running with restart `0`; public routes return expected statuses including AWOOOI API `200`, AWOOOI Web `307`, VibeWork `200`, AwoooGo `200`, MOMO health `200`, Stock freshness `200`, Bitan `200`, Gitea `200`, Harbor `200`, Registry `/v2/` expected `401`, Sentry expected `302`, SigNoz `200`, Langfuse `200`. Backup-status 07:18 remains 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, offsite/rclone fresh, `last_backup_all=2026-06-26 02:31:02`, `escrow_missing=5`. 188 remains product-service green but host-hygiene blocked: host PostgreSQL `14/main` has `invalid primary checkpoint record` / `could not locate a valid checkpoint record`, certbot renew for `sentry.wooo.work` is rate-limited / challenge-failed, and `awoooi-startup.service` previously attempted `pg_resetwal` as root. New runbook `docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md` defines the maintenance-window decision tree without authorizing runtime repair. Do not declare DR complete until `escrow_missing=0`; Wazuh manager registry accepted remains `0`; 111/168 Codex workspace HEAD drift and Mac Mini low free space remain workstation blockers, not reboot service blockers. |
|
||||
| Overall recovery readiness | FULL_STACK_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-26 07:19 read-only follow-up confirms service recovery remains green after the SOP commit reached production. Cold-start rerun returned `PASS=87 WARN=0 BLOCKED=0`, result `GREEN`; ArgoCD `awoooi-prod Synced / Healthy` at revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`; API/Web/Worker pods are Running with restart `0`; public routes return expected statuses including AWOOOI API `200`, AWOOOI Web `307`, VibeWork `200`, AwoooGo `200`, MOMO health `200`, Stock freshness `200`, Bitan `200`, Gitea `200`, Harbor `200`, Registry `/v2/` expected `401`, Sentry expected `302`, SigNoz `200`, Langfuse `200`. Backup-status 07:18 remains 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, offsite/rclone fresh, `last_backup_all=2026-06-26 02:31:02`, `escrow_missing=5`. 188 remains product-service green but host-hygiene blocked: host PostgreSQL `14/main` has `invalid primary checkpoint record` / `could not locate a valid checkpoint record`, certbot renew for `sentry.wooo.work` is rate-limited / challenge-failed, and `awoooi-startup.service` previously attempted `pg_resetwal` as root. New runbook `docs/runbooks/HOST-188-HYGIENE-MAINTENANCE-RUNBOOK.md` plus read-only script `scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh` define the maintenance-window decision tree without authorizing runtime repair. Do not declare DR complete until `escrow_missing=0`; Wazuh manager registry accepted remains `0`; 111/168 Codex workspace HEAD drift and Mac Mini low free space remain workstation blockers, not reboot service blockers. |
|
||||
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-26 07:19 readback shows 120 and 121 reachable, K3s active, `mon` and `mon1` both `Ready control-plane`, AWOOOI API/Web replicas split across both nodes, ArgoCD `awoooi-prod Synced / Healthy` at revision `1fd5e2a8b0f18d24eed16aa2a44286bcbf230603`, and `km-vectorize` official 03:00 台北時間 run succeeded with `lastSuccess=2026-06-25T19:00:14Z`. |
|
||||
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-26 06:58 backup readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`, last aggregate `2026-06-26 02:31:02`。DR remains blocked on real non-secret credential escrow evidence IDs; do not write placeholder markers or paste secret values. |
|
||||
| P2 service / data truth | DONE | 100% | Service routes and core runtime are available, 110 current CPU pressure is attributable to active AWOOOI Web `turbo build` / Docker buildx, and previous orphan Chrome groups remain cleared. 2026-06-26 07:19 StockPlatform `/api/v1/system/freshness` returned `200`; 07:01 freshness payload was `status=ok`, `latest_trading_date=2026-06-25`, blockers `[]`; price / chips / margin / AI recommendations are all on `2026-06-25`. `ai.recommendations` row count is `2868`; `core.margin_short_daily` row count is `1976`. MOMO health `V10.699`, current-month parity `15383|15383|2026-06-01|2026-06-24|2026-06-01|2026-06-24`, and `MOMO_DAILY_FRESHNESS 1|2026-06-24` are green; expanded public routes are green. |
|
||||
|
||||
265
scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh
Executable file
265
scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh
Executable file
@@ -0,0 +1,265 @@
|
||||
#!/usr/bin/env bash
|
||||
# 188 host hygiene 維護前只讀檢查。
|
||||
# 本腳本不會 restart、reload、renew、reset-failed、restore、pg_resetwal 或修改遠端狀態。
|
||||
|
||||
set -uo pipefail
|
||||
|
||||
REMOTE_188="${REMOTE_188:-192.168.0.188}"
|
||||
REMOTE_110="${REMOTE_110:-192.168.0.110}"
|
||||
SSH_BATCH_MODE="${SSH_BATCH_MODE:-yes}"
|
||||
SSH_STRICT_HOST_KEY_CHECKING="${SSH_STRICT_HOST_KEY_CHECKING:-accept-new}"
|
||||
SSH_CONNECT_TIMEOUT="${SSH_CONNECT_TIMEOUT:-8}"
|
||||
NO_COLOR=0
|
||||
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage: bash scripts/reboot-recovery/188-host-hygiene-maintenance-checklist.sh [--no-color]
|
||||
|
||||
Read-only pre-maintenance checklist for host 188 hygiene issues.
|
||||
It prints evidence for host PostgreSQL, certbot, product containers, backup core,
|
||||
and route health. It never runs pg_resetwal, certbot renew, reset-failed, restore,
|
||||
Nginx reload, Docker/systemd restart, or any host write.
|
||||
|
||||
Environment:
|
||||
REMOTE_188=192.168.0.188
|
||||
REMOTE_110=192.168.0.110
|
||||
SSH_BATCH_MODE=yes
|
||||
SSH_STRICT_HOST_KEY_CHECKING=accept-new
|
||||
SSH_CONNECT_TIMEOUT=8
|
||||
USAGE
|
||||
}
|
||||
|
||||
while [ "$#" -gt 0 ]; do
|
||||
case "$1" in
|
||||
--no-color)
|
||||
NO_COLOR=1
|
||||
shift
|
||||
;;
|
||||
-h|--help)
|
||||
usage
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
echo "Unknown argument: $1" >&2
|
||||
usage >&2
|
||||
exit 64
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
if [ "$NO_COLOR" = "1" ]; then
|
||||
green=""
|
||||
yellow=""
|
||||
red=""
|
||||
blue=""
|
||||
reset=""
|
||||
else
|
||||
green="$(printf '\033[32m')"
|
||||
yellow="$(printf '\033[33m')"
|
||||
red="$(printf '\033[31m')"
|
||||
blue="$(printf '\033[34m')"
|
||||
reset="$(printf '\033[0m')"
|
||||
fi
|
||||
|
||||
PASS=0
|
||||
WARN=0
|
||||
BLOCKED=0
|
||||
SERVICE_GREEN=1
|
||||
HOST_HYGIENE_BLOCKED=0
|
||||
|
||||
ssh_opts=(
|
||||
-o BatchMode="$SSH_BATCH_MODE"
|
||||
-o ConnectTimeout="$SSH_CONNECT_TIMEOUT"
|
||||
-o StrictHostKeyChecking="$SSH_STRICT_HOST_KEY_CHECKING"
|
||||
)
|
||||
|
||||
section() {
|
||||
printf "\n%s=== %s ===%s\n" "$blue" "$1" "$reset"
|
||||
}
|
||||
|
||||
ok() {
|
||||
PASS=$((PASS + 1))
|
||||
printf "%sOK%s %s\n" "$green" "$reset" "$*"
|
||||
}
|
||||
|
||||
warn() {
|
||||
WARN=$((WARN + 1))
|
||||
printf "%sWARN%s %s\n" "$yellow" "$reset" "$*"
|
||||
}
|
||||
|
||||
blocked() {
|
||||
BLOCKED=$((BLOCKED + 1))
|
||||
HOST_HYGIENE_BLOCKED=1
|
||||
printf "%sBLOCKED%s %s\n" "$red" "$reset" "$*"
|
||||
}
|
||||
|
||||
ssh_cmd() {
|
||||
local target="$1"
|
||||
local command="$2"
|
||||
ssh "${ssh_opts[@]}" "$target" "$command"
|
||||
}
|
||||
|
||||
echo "AWOOOI 188 host hygiene maintenance checklist"
|
||||
date '+%Y-%m-%d %H:%M:%S %Z'
|
||||
echo "Scope: 188 host PostgreSQL 14/main, startup unit, certbot / ACME hygiene."
|
||||
echo "RUNTIME_ACTION_AUTHORIZED=0"
|
||||
|
||||
section "188 host state"
|
||||
if out=$(ssh_cmd "$REMOTE_188" '
|
||||
hostname
|
||||
date --iso-8601=seconds
|
||||
uptime
|
||||
echo "SYSTEMD_STATE $(systemctl is-system-running || true)"
|
||||
systemctl --failed --no-pager --plain || true
|
||||
df -h /
|
||||
free -h
|
||||
' 2>&1); then
|
||||
echo "$out"
|
||||
ok "188 SSH readable"
|
||||
if grep -q 'SYSTEMD_STATE running' <<<"$out"; then
|
||||
ok "188 systemd running"
|
||||
else
|
||||
blocked "188 systemd is not fully running; failed units require owner disposition"
|
||||
fi
|
||||
grep -q 'postgresql@14-main.service' <<<"$out" && blocked "host PostgreSQL failed unit visible" || ok "host PostgreSQL failed unit not visible"
|
||||
grep -q 'certbot.service' <<<"$out" && blocked "apt certbot failed unit visible" || ok "apt certbot failed unit not visible"
|
||||
grep -q 'snap.certbot.renew.service' <<<"$out" && blocked "snap certbot renew failed unit visible" || ok "snap certbot renew failed unit not visible"
|
||||
else
|
||||
SERVICE_GREEN=0
|
||||
blocked "188 SSH host state unavailable"
|
||||
echo "$out"
|
||||
fi
|
||||
|
||||
section "188 product container state"
|
||||
if out=$(ssh_cmd "$REMOTE_188" '
|
||||
docker ps --format "{{.Names}}\t{{.Status}}\t{{.Ports}}" 2>/dev/null || true
|
||||
' 2>&1); then
|
||||
echo "$out"
|
||||
grep -q '^momo-pro-system[[:space:]].*healthy' <<<"$out" && ok "MOMO Pro container healthy" || { SERVICE_GREEN=0; blocked "MOMO Pro container not confirmed healthy"; }
|
||||
grep -q '^momo-db[[:space:]].*Up' <<<"$out" && ok "MOMO DB container up" || { SERVICE_GREEN=0; blocked "MOMO DB container not confirmed up"; }
|
||||
grep -q '^vibework-production-web-1[[:space:]].*healthy' <<<"$out" && ok "VibeWork production web container healthy" || warn "VibeWork production web container health not confirmed"
|
||||
else
|
||||
SERVICE_GREEN=0
|
||||
blocked "188 docker ps unavailable"
|
||||
echo "$out"
|
||||
fi
|
||||
|
||||
section "188 host PostgreSQL 14/main"
|
||||
if out=$(ssh_cmd "$REMOTE_188" '
|
||||
pg_lsclusters 2>/dev/null || true
|
||||
systemctl status postgresql@14-main.service --no-pager || true
|
||||
echo "PG_ISREADY_LOCAL $(pg_isready -h localhost -p 5432 2>/dev/null || true)"
|
||||
' 2>&1); then
|
||||
echo "$out"
|
||||
if grep -Eq '^14[[:space:]]+main[[:space:]]+5432[[:space:]]+down' <<<"$out"; then
|
||||
blocked "host PostgreSQL cluster 14/main is down"
|
||||
else
|
||||
ok "host PostgreSQL cluster 14/main not reported down"
|
||||
fi
|
||||
if grep -Eiq 'invalid primary checkpoint record|could not locate a valid checkpoint record|PANIC:' <<<"$out"; then
|
||||
blocked "PostgreSQL checkpoint/WAL error detected; pg_resetwal is break-glass only"
|
||||
fi
|
||||
if grep -q 'PG_ISREADY_LOCAL .*accepting connections' <<<"$out"; then
|
||||
warn "pg_isready accepts on localhost; do not use this alone as host 14/main health"
|
||||
fi
|
||||
else
|
||||
blocked "PostgreSQL evidence unavailable"
|
||||
echo "$out"
|
||||
fi
|
||||
|
||||
section "188 certbot / ACME"
|
||||
if out=$(ssh_cmd "$REMOTE_188" '
|
||||
systemctl status certbot.service --no-pager || true
|
||||
systemctl status snap.certbot.renew.service --no-pager || true
|
||||
' 2>&1); then
|
||||
echo "$out"
|
||||
grep -Eiq 'rateLimited|Service busy' <<<"$out" && blocked "certbot renewal is rate-limited; do not retry blindly"
|
||||
grep -Eiq 'Some challenges have failed|challenge' <<<"$out" && blocked "certbot challenge failure requires DNS / ACME route owner evidence"
|
||||
else
|
||||
blocked "certbot status unavailable"
|
||||
echo "$out"
|
||||
fi
|
||||
|
||||
cert_out="$(openssl s_client -servername sentry.wooo.work -connect sentry.wooo.work:443 </dev/null 2>/dev/null | openssl x509 -noout -subject -issuer -dates 2>/dev/null || true)"
|
||||
printf '%s\n' "$cert_out"
|
||||
if grep -q 'notAfter=' <<<"$cert_out"; then
|
||||
ok "sentry.wooo.work public certificate readable"
|
||||
else
|
||||
warn "sentry.wooo.work public certificate not readable"
|
||||
fi
|
||||
|
||||
section "110 backup / escrow boundary"
|
||||
if out=$(ssh_cmd "$REMOTE_110" '
|
||||
test -x /backup/scripts/backup-status.sh && /backup/scripts/backup-status.sh --no-notify --no-refresh
|
||||
' 2>&1); then
|
||||
echo "$out"
|
||||
grep -q 'core_blockers=0' <<<"$out" && ok "backup core blockers remain zero" || blocked "backup core blockers not zero"
|
||||
grep -q 'escrow_missing=0' <<<"$out" && ok "credential escrow complete" || warn "credential escrow still incomplete; DR_COMPLETE remains blocked"
|
||||
else
|
||||
warn "110 backup status unavailable"
|
||||
echo "$out"
|
||||
fi
|
||||
|
||||
section "Public route health"
|
||||
route_fail=0
|
||||
for url in \
|
||||
https://mo.wooo.work/health \
|
||||
https://signoz.wooo.work/ \
|
||||
https://vibework.wooo.work/ \
|
||||
https://awoooi.wooo.work/api/v1/health
|
||||
do
|
||||
code="$(curl -sS -o /dev/null -w '%{http_code}' --max-time 12 "$url" 2>/dev/null || echo 000)"
|
||||
printf 'PUBLIC_ROUTE %s %s\n' "$code" "$url"
|
||||
case "$code" in
|
||||
2??|3??) ;;
|
||||
*) route_fail=1 ;;
|
||||
esac
|
||||
done
|
||||
if [ "$route_fail" -eq 0 ]; then
|
||||
ok "critical public routes respond"
|
||||
else
|
||||
SERVICE_GREEN=0
|
||||
blocked "one or more critical public routes failed"
|
||||
fi
|
||||
|
||||
section "Maintenance decision tree"
|
||||
cat <<'STEPS'
|
||||
Current expected outcome when 188 service is green but host hygiene is not:
|
||||
SERVICE_GREEN=1
|
||||
HOST_HYGIENE_BLOCKED=1
|
||||
RESULT=SERVICE_GREEN_HOST_HYGIENE_BLOCKED
|
||||
|
||||
Allowed next step:
|
||||
1. Collect owner disposition for host PostgreSQL 14/main:
|
||||
- retired cluster, restore from backup, or break-glass repair.
|
||||
2. Collect DNS / ACME / certificate owner evidence before certbot renew.
|
||||
3. Open an explicit maintenance window with rollback owner and post-check.
|
||||
|
||||
Forbidden without maintenance approval:
|
||||
- pg_resetwal
|
||||
- systemctl reset-failed
|
||||
- certbot renew
|
||||
- nginx reload
|
||||
- DB restore
|
||||
- Docker/systemd restart
|
||||
- host file write
|
||||
STEPS
|
||||
|
||||
echo
|
||||
echo "SERVICE_GREEN=$SERVICE_GREEN"
|
||||
echo "HOST_HYGIENE_BLOCKED=$HOST_HYGIENE_BLOCKED"
|
||||
echo "RUNTIME_ACTION_AUTHORIZED=0"
|
||||
echo "PASS=$PASS WARN=$WARN BLOCKED=$BLOCKED"
|
||||
|
||||
if [ "$SERVICE_GREEN" -eq 1 ] && [ "$HOST_HYGIENE_BLOCKED" -eq 1 ]; then
|
||||
echo "Result: SERVICE_GREEN_HOST_HYGIENE_BLOCKED. Open maintenance window; do not auto-repair."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [ "$BLOCKED" -gt 0 ]; then
|
||||
echo "Result: BLOCKED. Service or host evidence needs triage."
|
||||
exit 2
|
||||
fi
|
||||
|
||||
echo "Result: HOST_188_HYGIENE_GREEN."
|
||||
exit 0
|
||||
Reference in New Issue
Block a user