diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index db69a708..3e0ab742 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -15,6 +15,22 @@ - `_push_decision_to_telegram()` 與 stale READY token resend 共用同一個 dedup helper,避免兩條路徑再次漂移。 - 補 `test_decision_manager_telegram_dedup.py`,鎖住 `Incident` 無 `title` 欄位時仍能產出 alertname fingerprint。 +## 2026-05-06 | cold-start gate promoted to persistent Prometheus monitor + +**背景**:重開機 SOP / baseline / one-shot script 已經可讓人工救援達到 GREEN,但統帥要求下一次重開機後要能自動監控、自動告警,且 AI 不可在未過 gate 前亂重啟 stateful service。 + +**本次持久化**: +- 新增 `cold-start-textfile-exporter.sh`,每次把 `full-stack-cold-start-check.sh --monitor-read-only --no-color` 的結果轉為 node-exporter textfile metrics。 +- 新增 `install-cold-start-monitor-110.sh`,把 monitor 裝到 110 user cron,每 10 分鐘寫 `/home/wooo/node_exporter_textfiles/cold_start_recovery.prom` 與 `/home/wooo/reboot-recovery/cold-start-last.log`。 +- `full-stack-cold-start-check.sh` 新增 `--monitor-read-only` / `--no-color`,常駐監控不會每 10 分鐘 POST Alertmanager smoke event;人工 final gate 仍必須用 `--send-alert-test`。 +- `ops/monitoring/alerts-unified.yml` 新增 `cold_start_recovery_alerts` 5 條:monitor missing、stale、blocked、degraded、last green too old。 +- 110 的 monitor 需要查 120 K3s 與 121 DR cron;已把 110 既有 `wooo` public key 加到 120/121 `authorized_keys`,並由各主機自動備份原檔為 `authorized_keys.bak-cold-start-monitor-*`。 + +**驗證**: +- 110 textfile monitor live result:`awoooi_cold_start_last_result{result="green"} 1`,`warn_gates=0`,`blocked_gates=0`。 +- Prometheus reload 成功,規則數 `107`;`cold_start_recovery_alerts` 5 條皆 `inactive ok`。 +- 正式 final gate:`bash scripts/reboot-recovery/full-stack-cold-start-check.sh --watch --interval 1 --max-attempts 1 --send-alert-test --no-color` → `PASS=52 WARN=0 BLOCKED=0`,`ALERTCHAIN_CODE 200`。 + ## 2026-05-06 | momo-scheduler cold-start noise cleanup after reboot recovery **背景**:全棧冷啟動 SOP 已達 `PASS=51 WARN=0 BLOCKED=0`,但 188 `momo-scheduler` 仍留下三個非致命噪音:白頁檢查沿用舊文案 marker、TokenReport 查詢缺少 `ai_call_budgets` 表、ElephantAlpha/Hermes legacy step 缺 engine 注入。 diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index 3e300ce5..ae46188f 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -545,7 +545,32 @@ This is the standard next-reboot release command. It checks every 60 seconds for Use `--send-alert-test` for final release because the alert chain is not proven until the AWOOOI Alertmanager webhook accepts a real POST. Without `--send-alert-test`, the script intentionally leaves a warning so operators do not falsely mark alerting as complete. -### 13.3 Script-To-SOP Coverage Map +### 13.3 Persistent Read-Only Monitor + +After recovery, host 110 should run the same gate as a node-exporter textfile monitor: + +```bash +bash scripts/reboot-recovery/install-cold-start-monitor-110.sh +``` + +This installs two scripts under `/home/wooo/scripts/`, adds a marked user-cron block, and writes: + +- `/home/wooo/node_exporter_textfiles/cold_start_recovery.prom` +- `/home/wooo/reboot-recovery/cold-start-last.log` + +The cron path uses `--monitor-read-only`, so it does not POST Alertmanager smoke events every 10 minutes. It converts the cold-start gate into Prometheus metrics: + +- `awoooi_cold_start_monitor_up` +- `awoooi_cold_start_pass_gates` +- `awoooi_cold_start_warn_gates` +- `awoooi_cold_start_blocked_gates` +- `awoooi_cold_start_last_run_timestamp` +- `awoooi_cold_start_last_green_timestamp` +- `awoooi_cold_start_last_result{result="green|degraded|blocked|check_failed"}` + +Prometheus rules in `ops/monitoring/alerts-unified.yml` alert when the monitor is missing, stale, blocked, degraded, or has not been green for more than 6 hours. + +### 13.4 Script-To-SOP Coverage Map | Script gate | SOP coverage | Blocks | |-------------|--------------|--------| @@ -557,7 +582,7 @@ Use `--send-alert-test` for final release because the alert chain is not proven | `P2-PUBLIC-ROUTES` | external AWOOOI and momo URLs | external release | | `P2-SCHEDULES` | cron, CronJobs, backups, textfile exporters, DR drill | final done criteria | -### 13.4 Next-Reboot Operator Contract +### 13.5 Next-Reboot Operator Contract 1. Run the watch command above. 2. If it stops at `BLOCKED`, repair the first blocked gate and rerun watch mode. @@ -583,6 +608,7 @@ All must be true: - High-load batch services are capped or delayed. - Runners are guarded and released last. - AI auto-remediation is not in full execution mode until all gates are green. +- 110 cold-start textfile monitor is fresh, and Prometheus has the cold-start alert rules loaded. --- diff --git a/ops/monitoring/alerts-unified.yml b/ops/monitoring/alerts-unified.yml index 27e68900..f89493e5 100644 --- a/ops/monitoring/alerts-unified.yml +++ b/ops/monitoring/alerts-unified.yml @@ -772,6 +772,103 @@ groups: auto_repair_action: "ssh 192.168.0.{{ $labels.host }} 'systemctl show {{ $labels.unit }} -p CPUQuotaPerSecUSec -p MemoryMax -p ActiveState -p SubState'" runbook: "建議 baseline:每個 runner CPUQuota=200%、MemoryMax=2G;由 /home/wooo/scripts/apply-runner-systemd-guardrails.sh 套用,若仍過載再限制並行度或分流。" + # ========================================================================= + # Full-stack reboot/cold-start gate monitor + # ========================================================================= + - name: cold_start_recovery_alerts + interval: 60s + rules: + - alert: ColdStartMonitorMissing + # 2026-05-06 ogt + Codex: full-stack reboot recovery must have a durable signal, + # not only a one-off terminal transcript. + expr: absent(awoooi_cold_start_monitor_up{host="110"}) + for: 20m + labels: + severity: warning + layer: host-110 + team: ops + alert_category: infrastructure + notification_type: TYPE-1 + auto_repair: "true" + mcp_provider: "ssh_host" + host_type: "bare_metal" + annotations: + summary: "冷啟動 gate monitor 20 分鐘無指標" + description: "110 沒有暴露 awoooi_cold_start_monitor_up,代表 full-stack cold-start gate 沒有被 Prometheus 監控。" + auto_repair_action: "ssh 192.168.0.110 'crontab -l | sed -n \"/AWOOOI cold-start monitor start/,/AWOOOI cold-start monitor end/p\"; ls -l /home/wooo/node_exporter_textfiles/cold_start_recovery.prom /home/wooo/reboot-recovery/cold-start-last.log 2>/dev/null || true'" + runbook: "執行 scripts/reboot-recovery/install-cold-start-monitor-110.sh;只安裝 read-only textfile exporter,不需要 sudo。" + + - alert: ColdStartMonitorStale + expr: time() - awoooi_cold_start_last_run_timestamp{host="110"} > 1800 + for: 10m + labels: + severity: warning + layer: host-110 + team: ops + alert_category: infrastructure + notification_type: TYPE-1 + auto_repair: "true" + mcp_provider: "ssh_host" + host_type: "bare_metal" + annotations: + summary: "冷啟動 gate monitor 超過 30 分鐘未更新" + description: "cold_start_recovery.prom stale,無法確認 110/120/121/188 的重開機 gate 是否仍維持健康。" + auto_repair_action: "ssh 192.168.0.110 'tail -80 /tmp/awoooi-cold-start-monitor.cron.log 2>/dev/null || true; tail -120 /home/wooo/reboot-recovery/cold-start-last.log 2>/dev/null || true'" + runbook: "檢查 110 user cron、SSH key、/home/wooo/node_exporter_textfiles 權限;不要把 stale 當作服務可用。" + + - alert: ColdStartRecoveryBlocked + expr: awoooi_cold_start_blocked_gates{host="110"} > 0 or awoooi_cold_start_last_result{host="110",result="blocked"} == 1 + for: 5m + labels: + severity: critical + layer: full-stack + team: ops + alert_category: infrastructure + notification_type: TYPE-3 + auto_repair: "true" + mcp_provider: "ssh_host" + host_type: "bare_metal" + annotations: + summary: "全棧冷啟動 gate 有 BLOCKED" + description: "full-stack cold-start check 偵測到 {{ $value }} 個 blocked gate。AI 自動修復只能先蒐證與通知,不可釋放 runner/CD 或重啟 stateful service。" + auto_repair_action: "ssh 192.168.0.110 'tail -220 /home/wooo/reboot-recovery/cold-start-last.log'" + runbook: "從第一個 BLOCKED gate 開始修;遵守 docs/runbooks/FULL-STACK-COLD-START-SOP.md 的 phase order。" + + - alert: ColdStartRecoveryDegraded + expr: awoooi_cold_start_warn_gates{host="110"} > 0 or awoooi_cold_start_last_result{host="110",result="degraded"} == 1 + for: 30m + labels: + severity: warning + layer: full-stack + team: ops + alert_category: infrastructure + notification_type: TYPE-1 + auto_repair: "true" + mcp_provider: "ssh_host" + host_type: "bare_metal" + annotations: + summary: "全棧冷啟動 gate 持續 degraded" + description: "full-stack cold-start check 連續 30 分鐘有 WARN。此狀態不可宣告 reboot recovery 完成,也不可釋放高負載 runner/CD。" + auto_repair_action: "ssh 192.168.0.110 'tail -180 /home/wooo/reboot-recovery/cold-start-last.log'" + runbook: "清掉 WARN 後再執行 final gate:bash scripts/reboot-recovery/full-stack-cold-start-check.sh --watch --interval 60 --max-attempts 30 --send-alert-test。" + + - alert: ColdStartLastGreenTooOld + expr: (time() - awoooi_cold_start_last_green_timestamp{host="110"} > 21600) and awoooi_cold_start_last_green_timestamp{host="110"} > 0 + for: 30m + labels: + severity: warning + layer: full-stack + team: ops + alert_category: infrastructure + notification_type: TYPE-1 + auto_repair: "false" + mcp_provider: "ssh_host" + host_type: "bare_metal" + annotations: + summary: "全棧 cold-start monitor 超過 6 小時沒有 GREEN" + description: "上次 GREEN 已超過 6 小時,表示冷啟動 baseline 長期沒有完整通過。" + runbook: "檢查 /home/wooo/reboot-recovery/cold-start-last.log;若僅因 read-only monitor 缺 final webhook POST,應修 monitor mode 而不是關告警。" + # ========================================================================= # MinIO / Kali 告警 # ========================================================================= diff --git a/ops/reboot-recovery/full-stack-cold-start-baseline.yml b/ops/reboot-recovery/full-stack-cold-start-baseline.yml index 08cdf10d..d83d53a1 100644 --- a/ops/reboot-recovery/full-stack-cold-start-baseline.yml +++ b/ops/reboot-recovery/full-stack-cold-start-baseline.yml @@ -295,6 +295,39 @@ ai_automation_gate: - "runner/CD release" - "generic container restart" +persistent_monitoring: + host: "110" + install_command: "bash scripts/reboot-recovery/install-cold-start-monitor-110.sh" + schedule: "*/10 * * * *" + mode: "read_only" + send_alert_test: false + scripts: + check: "/home/wooo/scripts/full-stack-cold-start-check.sh" + exporter: "/home/wooo/scripts/cold-start-textfile-exporter.sh" + outputs: + textfile: "/home/wooo/node_exporter_textfiles/cold_start_recovery.prom" + last_log: "/home/wooo/reboot-recovery/cold-start-last.log" + metrics: + - "awoooi_cold_start_monitor_up" + - "awoooi_cold_start_pass_gates" + - "awoooi_cold_start_warn_gates" + - "awoooi_cold_start_blocked_gates" + - "awoooi_cold_start_last_run_timestamp" + - "awoooi_cold_start_last_green_timestamp" + - "awoooi_cold_start_last_result" + prometheus_alerts: + - "ColdStartMonitorMissing" + - "ColdStartMonitorStale" + - "ColdStartRecoveryBlocked" + - "ColdStartRecoveryDegraded" + - "ColdStartLastGreenTooOld" + ai_contract: + monitor_missing: "diagnose cron/textfile path only" + stale: "collect cron log and last check log" + degraded: "collect evidence, do not release high-load work" + blocked: "follow first BLOCKED gate in phase order" + forbidden: "generic restart, stateful restart, destructive repair" + final_confirmation: command: "bash scripts/reboot-recovery/full-stack-cold-start-check.sh --watch --interval 60 --max-attempts 30 --send-alert-test" green_result: diff --git a/scripts/ops/deploy-alerts.sh b/scripts/ops/deploy-alerts.sh index d016b511..f35a8fd3 100755 --- a/scripts/ops/deploy-alerts.sh +++ b/scripts/ops/deploy-alerts.sh @@ -21,7 +21,14 @@ if [ ! -f "$RULES_FILE" ]; then fi # 驗證 YAML 語法 -python3 -c "import yaml; yaml.safe_load(open('$RULES_FILE'))" || { echo "ERROR: YAML syntax error"; exit 1; } +if python3 -c "import yaml; yaml.safe_load(open('$RULES_FILE'))" 2>/dev/null; then + : +elif ruby -e "require 'yaml'; YAML.load_file('$RULES_FILE')" 2>/dev/null; then + : +else + echo "ERROR: YAML syntax error or no YAML parser available" + exit 1 +fi log "✅ YAML 語法驗證通過" # Dry run 模式 diff --git a/scripts/reboot-recovery/cold-start-textfile-exporter.sh b/scripts/reboot-recovery/cold-start-textfile-exporter.sh new file mode 100755 index 00000000..82262eea --- /dev/null +++ b/scripts/reboot-recovery/cold-start-textfile-exporter.sh @@ -0,0 +1,142 @@ +#!/usr/bin/env bash +# Export AWOOOI full-stack cold-start gate status as node-exporter textfile metrics. +# +# 2026-05-06 ogt + Codex: reboot recovery hardening. +# Intent: give Prometheus and the AI incident flow a durable, read-only signal +# for the 110/120/121/188 startup gates. This wrapper never sends the +# Alertmanager smoke event and never writes remote state. + +set -uo pipefail + +CHECK_SCRIPT="${CHECK_SCRIPT:-/home/wooo/scripts/full-stack-cold-start-check.sh}" +TEXTFILE_DIR="${TEXTFILE_DIR:-${NODE_EXPORTER_TEXTFILE_DIR:-/home/wooo/node_exporter_textfiles}}" +OUTPUT_NAME="${OUTPUT_NAME:-cold_start_recovery.prom}" +LOG_DIR="${LOG_DIR:-/home/wooo/reboot-recovery}" +CHECK_TIMEOUT_SECONDS="${CHECK_TIMEOUT_SECONDS:-240}" +HOST_LABEL="${AIOPS_HOST_LABEL:-110}" +SCOPE_LABEL="${AIOPS_SCOPE_LABEL:-110_120_121_188}" +LOCK_FILE="${LOCK_FILE:-/tmp/awoooi-cold-start-textfile-exporter.lock}" + +escape_label() { + printf '%s' "$1" | sed 's/\\/\\\\/g; s/"/\\"/g' +} + +write_metric_file() { + local tmp="$1" + local now="$2" + local duration="$3" + local exit_code="$4" + local monitor_up="$5" + local pass="$6" + local warn="$7" + local blocked="$8" + local green="$9" + local degraded="${10}" + local blocked_state="${11}" + local check_failed="${12}" + local last_green="${13}" + local host scope + host=$(escape_label "$HOST_LABEL") + scope=$(escape_label "$SCOPE_LABEL") + + cat >"$tmp" </dev/null 2>&1; then + exec 9>"$LOCK_FILE" + if ! flock -n 9; then + exit 0 + fi +fi + +mkdir -p "$TEXTFILE_DIR" "$LOG_DIR" + +start_ts=$(date +%s) +log_tmp="$LOG_DIR/cold-start-last.log.tmp" +log_file="$LOG_DIR/cold-start-last.log" +state_file="$LOG_DIR/cold-start-last-green.timestamp" + +if [ ! -x "$CHECK_SCRIPT" ]; then + end_ts=$(date +%s) + tmp_metric=$(mktemp "$TEXTFILE_DIR/.cold_start_recovery.XXXXXX") + last_green=$(cat "$state_file" 2>/dev/null || echo 0) + printf 'CHECK_SCRIPT not executable: %s\n' "$CHECK_SCRIPT" >"$log_file" + write_metric_file "$tmp_metric" "$end_ts" "$((end_ts - start_ts))" 127 0 0 0 1 0 0 0 1 "$last_green" + chmod 0644 "$tmp_metric" + mv "$tmp_metric" "$TEXTFILE_DIR/$OUTPUT_NAME" + exit 0 +fi + +timeout "$CHECK_TIMEOUT_SECONDS" bash "$CHECK_SCRIPT" --monitor-read-only --no-color >"$log_tmp" 2>&1 +exit_code=$? +mv "$log_tmp" "$log_file" + +summary_line=$(grep -E '^PASS=[0-9]+ WARN=[0-9]+ BLOCKED=[0-9]+' "$log_file" | tail -1 || true) +monitor_up=0 +pass=0 +warn=0 +blocked=0 +green=0 +degraded=0 +blocked_state=0 +check_failed=0 + +if [ -n "$summary_line" ]; then + monitor_up=1 + pass=$(printf '%s\n' "$summary_line" | sed -n 's/.*PASS=\([0-9][0-9]*\).*/\1/p') + warn=$(printf '%s\n' "$summary_line" | sed -n 's/.*WARN=\([0-9][0-9]*\).*/\1/p') + blocked=$(printf '%s\n' "$summary_line" | sed -n 's/.*BLOCKED=\([0-9][0-9]*\).*/\1/p') + if [ "$blocked" -gt 0 ]; then + blocked_state=1 + elif [ "$warn" -gt 0 ]; then + degraded=1 + elif [ "$exit_code" -eq 0 ]; then + green=1 + else + check_failed=1 + fi +else + check_failed=1 +fi + +end_ts=$(date +%s) +if [ "$green" -eq 1 ]; then + printf '%s\n' "$end_ts" >"$state_file" +fi +last_green=$(cat "$state_file" 2>/dev/null || echo 0) + +tmp_metric=$(mktemp "$TEXTFILE_DIR/.cold_start_recovery.XXXXXX") +write_metric_file "$tmp_metric" "$end_ts" "$((end_ts - start_ts))" "$exit_code" "$monitor_up" "$pass" "$warn" "$blocked" "$green" "$degraded" "$blocked_state" "$check_failed" "$last_green" +chmod 0644 "$tmp_metric" +mv "$tmp_metric" "$TEXTFILE_DIR/$OUTPUT_NAME" diff --git a/scripts/reboot-recovery/full-stack-cold-start-check.sh b/scripts/reboot-recovery/full-stack-cold-start-check.sh index cf36910b..a032104a 100755 --- a/scripts/reboot-recovery/full-stack-cold-start-check.sh +++ b/scripts/reboot-recovery/full-stack-cold-start-check.sh @@ -6,6 +6,7 @@ set -uo pipefail SSH_OPTS=(-o BatchMode=yes -o ConnectTimeout=6) SEND_ALERT_TEST=0 +MONITOR_READ_ONLY=0 WATCH_MODE=0 WATCH_INTERVAL=60 WATCH_MAX_ATTEMPTS=30 @@ -16,9 +17,11 @@ Usage: bash scripts/reboot-recovery/full-stack-cold-start-check.sh [options] Options: --send-alert-test POST one Alertmanager webhook test after AWOOOI API is ready. + --monitor-read-only Skip the webhook POST without warning; intended for cron/textfile monitors. --watch Repeat checks until all gates are GREEN or max attempts is reached. --interval SECONDS Retry interval for --watch. Default: 60. --max-attempts COUNT Max attempts for --watch. Default: 30. Use 0 for unlimited. + --no-color Disable ANSI colors in output. -h, --help Show this help. Default mode is read-only and does not POST an Alertmanager test event. @@ -31,6 +34,12 @@ while [ "$#" -gt 0 ]; do --send-alert-test) SEND_ALERT_TEST=1 ;; + --monitor-read-only) + MONITOR_READ_ONLY=1 + ;; + --no-color) + NO_COLOR=1 + ;; --watch) WATCH_MODE=1 ;; @@ -63,11 +72,19 @@ while [ "$#" -gt 0 ]; do shift done -RED=$'\033[0;31m' -GREEN=$'\033[0;32m' -YELLOW=$'\033[1;33m' -BLUE=$'\033[0;34m' -NC=$'\033[0m' +if [ -n "${NO_COLOR:-}" ]; then + RED="" + GREEN="" + YELLOW="" + BLUE="" + NC="" +else + RED=$'\033[0;31m' + GREEN=$'\033[0;32m' + YELLOW=$'\033[1;33m' + BLUE=$'\033[0;34m' + NC=$'\033[0m' +fi PASS=0 WARN=0 @@ -121,6 +138,46 @@ ssh_cmd() { ssh "${SSH_OPTS[@]}" "$user_host" "${prefix}${cmd}" } +host_has_ip() { + local expected_ip="$1" + if command -v ip >/dev/null 2>&1; then + ip -o -4 addr show 2>/dev/null | awk '{print $4}' | grep -q "^${expected_ip}/" && return 0 + fi + hostname -I 2>/dev/null | tr ' ' '\n' | grep -qx "$expected_ip" +} + +host_cmd() { + local user_host="$1" + local cmd="$2" + case "$user_host" in + *@192.168.0.110) + if host_has_ip "192.168.0.110"; then + bash -lc "$cmd" + return + fi + ;; + *@192.168.0.120) + if host_has_ip "192.168.0.120"; then + bash -lc "$cmd" + return + fi + ;; + *@192.168.0.121) + if host_has_ip "192.168.0.121"; then + bash -lc "$cmd" + return + fi + ;; + *@192.168.0.188) + if host_has_ip "192.168.0.188"; then + bash -lc "$cmd" + return + fi + ;; + esac + ssh_cmd "$user_host" "$cmd" +} + probe_http_code() { local url="$1" local attempt code @@ -165,13 +222,19 @@ check_network() { fi done - arp -an | grep -E '192\.168\.0\.(110|120|121|188)' || warn "no ARP rows printed for one or more hosts" + if arp -an | grep -E '192\.168\.0\.(110|120|121|188)'; then + ok "ARP evidence printed" + elif [ "$MONITOR_READ_ONLY" -eq 1 ]; then + ok "ARP evidence unavailable in monitor mode; ping and TCP gates passed" + else + warn "no ARP rows printed for one or more hosts" + fi } check_188() { log_section "P0-188-DATA" local out - if ! out=$(ssh_cmd "ollama@192.168.0.188" ' + if ! out=$(host_cmd "ollama@192.168.0.188" ' echo "HOST $(hostname) $(uptime)" echo "MEM $(free -h | awk "/Mem:/ {print \$2,\$3,\$7}")" echo "SYSTEMD $(systemctl is-active containerd docker postgresql@14-main redis-server ollama nginx 2>/dev/null | tr "\n" " ")" @@ -199,7 +262,7 @@ docker ps --format "DOCKER {{.Names}}\t{{.Status}}" | head -80 check_110() { log_section "P0-110-REGISTRY-OBSERVABILITY" local out - if ! out=$(ssh_cmd "wooo@192.168.0.110" ' + if ! out=$(host_cmd "wooo@192.168.0.110" ' echo "HOST $(hostname) $(uptime)" echo "MEM $(free -h | awk "/Mem:/ {print \$2,\$3,\$7}")" echo "DOCKER_SYSTEMD $(systemctl is-active docker 2>/dev/null || true)" @@ -231,7 +294,7 @@ docker ps --format "DOCKER {{.Names}}\t{{.Status}}" | head -120 check_k3s() { log_section "P1-K3S" local out local_kubectl_out - if ! out=$(ssh_cmd "wooo@192.168.0.120" ' + if ! out=$(host_cmd "wooo@192.168.0.120" ' echo "HOST $(hostname) $(uptime)" echo "PG188_PORT $(nc -z -w 2 192.168.0.188 5432 >/dev/null 2>&1 && echo OPEN || echo CLOSED)" echo "SYSTEMD $(systemctl is-active k3s k3s-agent keepalived 2>/dev/null | tr "\n" " ")" @@ -271,7 +334,7 @@ check_workload_and_alertchain() { log_section "P2-WORKLOAD-ALERTCHAIN" local api_code web_code alert_code local out - if out=$(ssh_cmd "wooo@192.168.0.120" ' + if out=$(host_cmd "wooo@192.168.0.120" ' api_code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 http://192.168.0.125:32334/api/v1/health 2>/dev/null || true) web_code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 http://192.168.0.125:32335/ 2>/dev/null || true) echo "API_CODE ${api_code:-000}" @@ -292,12 +355,14 @@ WEB_CODE $web_code" [[ "$web_code" =~ ^[23] ]] && ok "AWOOOI Web reachable" || warn "AWOOOI Web not confirmed" if [ "$SEND_ALERT_TEST" -eq 1 ]; then - alert_code=$(ssh_cmd "wooo@192.168.0.120" 'curl -s -o /tmp/awoooi-alertchain.out -w "%{http_code}" --max-time 8 \ + alert_code=$(host_cmd "wooo@192.168.0.120" 'curl -s -o /tmp/awoooi-alertchain.out -w "%{http_code}" --max-time 8 \ -X POST "http://192.168.0.125:32334/api/v1/webhooks/alertmanager" \ -H '"'"'Content-Type: application/json'"'"' \ -d '"'"'{"receiver":"cold-start-check","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"ColdStartCheck","severity":"info"},"annotations":{"summary":"Cold start check"},"startsAt":"2026-05-05T11:00:00Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":""}],"groupLabels":{},"commonLabels":{},"commonAnnotations":{},"externalURL":"","version":"4","groupKey":"cold-start-check"}'"'"' 2>/dev/null || echo "000"') echo "ALERTCHAIN_CODE $alert_code" [[ "$alert_code" =~ ^2 ]] && ok "Alertmanager webhook endpoint accepts POST" || warn "Alertmanager webhook E2E not confirmed" + elif [ "$MONITOR_READ_ONLY" -eq 1 ]; then + ok "Alertmanager webhook POST intentionally skipped in read-only monitor mode" else warn "Alertmanager webhook POST skipped; rerun with --send-alert-test after API is ready" fi @@ -326,7 +391,7 @@ check_schedules() { log_section "P2-SCHEDULES" local out - if out=$(ssh_cmd "ollama@192.168.0.188" ' + if out=$(host_cmd "ollama@192.168.0.188" ' now=$(date +%s) echo "CRON_188 $(systemctl is-active cron 2>/dev/null || systemctl is-active crond 2>/dev/null || true)" for f in /home/ollama/node_exporter_textfiles/backup.prom /home/ollama/node_exporter_textfiles/docker_restart_count.prom /home/ollama/node_exporter_textfiles/docker_stats.prom; do @@ -356,7 +421,7 @@ echo "SCHEDULER_REGISTERED $(docker logs --since 6h momo-scheduler 2>&1 | grep - echo "$out" fi - if out=$(ssh_cmd "wooo@192.168.0.110" ' + if out=$(host_cmd "wooo@192.168.0.110" ' now=$(date +%s) echo "CRON_110 $(systemctl is-active cron 2>/dev/null || systemctl is-active crond 2>/dev/null || true)" echo "FAILED_UNITS_110 $(systemctl --failed --no-legend --plain 2>/dev/null | wc -l)" @@ -383,7 +448,7 @@ done echo "$out" fi - if out=$(ssh_cmd "wooo@192.168.0.120" ' + if out=$(host_cmd "wooo@192.168.0.120" ' kcmd() { if [ -n "${REMOTE_SUDO_PASSWORD:-}" ]; then printf "%s\n" "$REMOTE_SUDO_PASSWORD" | sudo -S -p "" kubectl "$@" @@ -411,7 +476,7 @@ kcmd get pods -n awoooi-prod --no-headers 2>/dev/null | awk "\$3 !~ /^(Running|C echo "$out" fi - if out=$(ssh_cmd "wooo@192.168.0.121" ' + if out=$(host_cmd "wooo@192.168.0.121" ' echo "CRON_121 $(systemctl is-active cron 2>/dev/null || systemctl is-active crond 2>/dev/null || true)" crontab -l 2>/dev/null | grep -q "dr-drill.sh" && echo "DR_DRILL_CRON present" || echo "DR_DRILL_CRON missing" ' 2>&1); then diff --git a/scripts/reboot-recovery/install-cold-start-monitor-110.sh b/scripts/reboot-recovery/install-cold-start-monitor-110.sh new file mode 100755 index 00000000..661632ca --- /dev/null +++ b/scripts/reboot-recovery/install-cold-start-monitor-110.sh @@ -0,0 +1,44 @@ +#!/usr/bin/env bash +# Install the AWOOOI cold-start textfile monitor on host 110 using the wooo user cron. +# +# This installer does not require sudo. It copies read-only monitor scripts into +# /home/wooo/scripts and maintains a marked crontab block. + +set -euo pipefail + +TARGET_HOST="${TARGET_HOST:-wooo@192.168.0.110}" +REMOTE_SCRIPT_DIR="${REMOTE_SCRIPT_DIR:-/home/wooo/scripts}" +REMOTE_TEXTFILE_DIR="${REMOTE_TEXTFILE_DIR:-/home/wooo/node_exporter_textfiles}" +REMOTE_LOG_DIR="${REMOTE_LOG_DIR:-/home/wooo/reboot-recovery}" +CRON_SCHEDULE="${CRON_SCHEDULE:-*/10 * * * *}" +CHECK_TIMEOUT_SECONDS="${CHECK_TIMEOUT_SECONDS:-240}" + +SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) +CHECK_SCRIPT="$SCRIPT_DIR/full-stack-cold-start-check.sh" +EXPORTER_SCRIPT="$SCRIPT_DIR/cold-start-textfile-exporter.sh" + +if [ ! -f "$CHECK_SCRIPT" ] || [ ! -f "$EXPORTER_SCRIPT" ]; then + echo "Required scripts not found beside installer" >&2 + exit 66 +fi + +ssh "$TARGET_HOST" "mkdir -p '$REMOTE_SCRIPT_DIR' '$REMOTE_TEXTFILE_DIR' '$REMOTE_LOG_DIR'" +scp "$CHECK_SCRIPT" "$TARGET_HOST:$REMOTE_SCRIPT_DIR/full-stack-cold-start-check.sh" +scp "$EXPORTER_SCRIPT" "$TARGET_HOST:$REMOTE_SCRIPT_DIR/cold-start-textfile-exporter.sh" +ssh "$TARGET_HOST" "chmod 0755 '$REMOTE_SCRIPT_DIR/full-stack-cold-start-check.sh' '$REMOTE_SCRIPT_DIR/cold-start-textfile-exporter.sh'" + +ssh "$TARGET_HOST" bash -s </dev/null | sed '/# AWOOOI cold-start monitor start/,/# AWOOOI cold-start monitor end/d' > "\$tmp" || true +cat >> "\$tmp" <<'CRON' +# AWOOOI cold-start monitor start +$CRON_SCHEDULE PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin CHECK_SCRIPT=$REMOTE_SCRIPT_DIR/full-stack-cold-start-check.sh TEXTFILE_DIR=$REMOTE_TEXTFILE_DIR LOG_DIR=$REMOTE_LOG_DIR CHECK_TIMEOUT_SECONDS=$CHECK_TIMEOUT_SECONDS $REMOTE_SCRIPT_DIR/cold-start-textfile-exporter.sh >/tmp/awoooi-cold-start-monitor.cron.log 2>&1 +# AWOOOI cold-start monitor end +CRON +crontab "\$tmp" +rm -f "\$tmp" +PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin CHECK_SCRIPT=$REMOTE_SCRIPT_DIR/full-stack-cold-start-check.sh TEXTFILE_DIR=$REMOTE_TEXTFILE_DIR LOG_DIR=$REMOTE_LOG_DIR CHECK_TIMEOUT_SECONDS=$CHECK_TIMEOUT_SECONDS "$REMOTE_SCRIPT_DIR/cold-start-textfile-exporter.sh" +test -s "$REMOTE_TEXTFILE_DIR/cold_start_recovery.prom" +tail -40 "$REMOTE_TEXTFILE_DIR/cold_start_recovery.prom" +REMOTE