fix(ops): monitor full-stack cold-start gates
This commit is contained in:
@@ -15,6 +15,22 @@
|
||||
- `_push_decision_to_telegram()` 與 stale READY token resend 共用同一個 dedup helper,避免兩條路徑再次漂移。
|
||||
- 補 `test_decision_manager_telegram_dedup.py`,鎖住 `Incident` 無 `title` 欄位時仍能產出 alertname fingerprint。
|
||||
|
||||
## 2026-05-06 | cold-start gate promoted to persistent Prometheus monitor
|
||||
|
||||
**背景**:重開機 SOP / baseline / one-shot script 已經可讓人工救援達到 GREEN,但統帥要求下一次重開機後要能自動監控、自動告警,且 AI 不可在未過 gate 前亂重啟 stateful service。
|
||||
|
||||
**本次持久化**:
|
||||
- 新增 `cold-start-textfile-exporter.sh`,每次把 `full-stack-cold-start-check.sh --monitor-read-only --no-color` 的結果轉為 node-exporter textfile metrics。
|
||||
- 新增 `install-cold-start-monitor-110.sh`,把 monitor 裝到 110 user cron,每 10 分鐘寫 `/home/wooo/node_exporter_textfiles/cold_start_recovery.prom` 與 `/home/wooo/reboot-recovery/cold-start-last.log`。
|
||||
- `full-stack-cold-start-check.sh` 新增 `--monitor-read-only` / `--no-color`,常駐監控不會每 10 分鐘 POST Alertmanager smoke event;人工 final gate 仍必須用 `--send-alert-test`。
|
||||
- `ops/monitoring/alerts-unified.yml` 新增 `cold_start_recovery_alerts` 5 條:monitor missing、stale、blocked、degraded、last green too old。
|
||||
- 110 的 monitor 需要查 120 K3s 與 121 DR cron;已把 110 既有 `wooo` public key 加到 120/121 `authorized_keys`,並由各主機自動備份原檔為 `authorized_keys.bak-cold-start-monitor-*`。
|
||||
|
||||
**驗證**:
|
||||
- 110 textfile monitor live result:`awoooi_cold_start_last_result{result="green"} 1`,`warn_gates=0`,`blocked_gates=0`。
|
||||
- Prometheus reload 成功,規則數 `107`;`cold_start_recovery_alerts` 5 條皆 `inactive ok`。
|
||||
- 正式 final gate:`bash scripts/reboot-recovery/full-stack-cold-start-check.sh --watch --interval 1 --max-attempts 1 --send-alert-test --no-color` → `PASS=52 WARN=0 BLOCKED=0`,`ALERTCHAIN_CODE 200`。
|
||||
|
||||
## 2026-05-06 | momo-scheduler cold-start noise cleanup after reboot recovery
|
||||
|
||||
**背景**:全棧冷啟動 SOP 已達 `PASS=51 WARN=0 BLOCKED=0`,但 188 `momo-scheduler` 仍留下三個非致命噪音:白頁檢查沿用舊文案 marker、TokenReport 查詢缺少 `ai_call_budgets` 表、ElephantAlpha/Hermes legacy step 缺 engine 注入。
|
||||
|
||||
@@ -545,7 +545,32 @@ This is the standard next-reboot release command. It checks every 60 seconds for
|
||||
|
||||
Use `--send-alert-test` for final release because the alert chain is not proven until the AWOOOI Alertmanager webhook accepts a real POST. Without `--send-alert-test`, the script intentionally leaves a warning so operators do not falsely mark alerting as complete.
|
||||
|
||||
### 13.3 Script-To-SOP Coverage Map
|
||||
### 13.3 Persistent Read-Only Monitor
|
||||
|
||||
After recovery, host 110 should run the same gate as a node-exporter textfile monitor:
|
||||
|
||||
```bash
|
||||
bash scripts/reboot-recovery/install-cold-start-monitor-110.sh
|
||||
```
|
||||
|
||||
This installs two scripts under `/home/wooo/scripts/`, adds a marked user-cron block, and writes:
|
||||
|
||||
- `/home/wooo/node_exporter_textfiles/cold_start_recovery.prom`
|
||||
- `/home/wooo/reboot-recovery/cold-start-last.log`
|
||||
|
||||
The cron path uses `--monitor-read-only`, so it does not POST Alertmanager smoke events every 10 minutes. It converts the cold-start gate into Prometheus metrics:
|
||||
|
||||
- `awoooi_cold_start_monitor_up`
|
||||
- `awoooi_cold_start_pass_gates`
|
||||
- `awoooi_cold_start_warn_gates`
|
||||
- `awoooi_cold_start_blocked_gates`
|
||||
- `awoooi_cold_start_last_run_timestamp`
|
||||
- `awoooi_cold_start_last_green_timestamp`
|
||||
- `awoooi_cold_start_last_result{result="green|degraded|blocked|check_failed"}`
|
||||
|
||||
Prometheus rules in `ops/monitoring/alerts-unified.yml` alert when the monitor is missing, stale, blocked, degraded, or has not been green for more than 6 hours.
|
||||
|
||||
### 13.4 Script-To-SOP Coverage Map
|
||||
|
||||
| Script gate | SOP coverage | Blocks |
|
||||
|-------------|--------------|--------|
|
||||
@@ -557,7 +582,7 @@ Use `--send-alert-test` for final release because the alert chain is not proven
|
||||
| `P2-PUBLIC-ROUTES` | external AWOOOI and momo URLs | external release |
|
||||
| `P2-SCHEDULES` | cron, CronJobs, backups, textfile exporters, DR drill | final done criteria |
|
||||
|
||||
### 13.4 Next-Reboot Operator Contract
|
||||
### 13.5 Next-Reboot Operator Contract
|
||||
|
||||
1. Run the watch command above.
|
||||
2. If it stops at `BLOCKED`, repair the first blocked gate and rerun watch mode.
|
||||
@@ -583,6 +608,7 @@ All must be true:
|
||||
- High-load batch services are capped or delayed.
|
||||
- Runners are guarded and released last.
|
||||
- AI auto-remediation is not in full execution mode until all gates are green.
|
||||
- 110 cold-start textfile monitor is fresh, and Prometheus has the cold-start alert rules loaded.
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -772,6 +772,103 @@ groups:
|
||||
auto_repair_action: "ssh 192.168.0.{{ $labels.host }} 'systemctl show {{ $labels.unit }} -p CPUQuotaPerSecUSec -p MemoryMax -p ActiveState -p SubState'"
|
||||
runbook: "建議 baseline:每個 runner CPUQuota=200%、MemoryMax=2G;由 /home/wooo/scripts/apply-runner-systemd-guardrails.sh 套用,若仍過載再限制並行度或分流。"
|
||||
|
||||
# =========================================================================
|
||||
# Full-stack reboot/cold-start gate monitor
|
||||
# =========================================================================
|
||||
- name: cold_start_recovery_alerts
|
||||
interval: 60s
|
||||
rules:
|
||||
- alert: ColdStartMonitorMissing
|
||||
# 2026-05-06 ogt + Codex: full-stack reboot recovery must have a durable signal,
|
||||
# not only a one-off terminal transcript.
|
||||
expr: absent(awoooi_cold_start_monitor_up{host="110"})
|
||||
for: 20m
|
||||
labels:
|
||||
severity: warning
|
||||
layer: host-110
|
||||
team: ops
|
||||
alert_category: infrastructure
|
||||
notification_type: TYPE-1
|
||||
auto_repair: "true"
|
||||
mcp_provider: "ssh_host"
|
||||
host_type: "bare_metal"
|
||||
annotations:
|
||||
summary: "冷啟動 gate monitor 20 分鐘無指標"
|
||||
description: "110 沒有暴露 awoooi_cold_start_monitor_up,代表 full-stack cold-start gate 沒有被 Prometheus 監控。"
|
||||
auto_repair_action: "ssh 192.168.0.110 'crontab -l | sed -n \"/AWOOOI cold-start monitor start/,/AWOOOI cold-start monitor end/p\"; ls -l /home/wooo/node_exporter_textfiles/cold_start_recovery.prom /home/wooo/reboot-recovery/cold-start-last.log 2>/dev/null || true'"
|
||||
runbook: "執行 scripts/reboot-recovery/install-cold-start-monitor-110.sh;只安裝 read-only textfile exporter,不需要 sudo。"
|
||||
|
||||
- alert: ColdStartMonitorStale
|
||||
expr: time() - awoooi_cold_start_last_run_timestamp{host="110"} > 1800
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
layer: host-110
|
||||
team: ops
|
||||
alert_category: infrastructure
|
||||
notification_type: TYPE-1
|
||||
auto_repair: "true"
|
||||
mcp_provider: "ssh_host"
|
||||
host_type: "bare_metal"
|
||||
annotations:
|
||||
summary: "冷啟動 gate monitor 超過 30 分鐘未更新"
|
||||
description: "cold_start_recovery.prom stale,無法確認 110/120/121/188 的重開機 gate 是否仍維持健康。"
|
||||
auto_repair_action: "ssh 192.168.0.110 'tail -80 /tmp/awoooi-cold-start-monitor.cron.log 2>/dev/null || true; tail -120 /home/wooo/reboot-recovery/cold-start-last.log 2>/dev/null || true'"
|
||||
runbook: "檢查 110 user cron、SSH key、/home/wooo/node_exporter_textfiles 權限;不要把 stale 當作服務可用。"
|
||||
|
||||
- alert: ColdStartRecoveryBlocked
|
||||
expr: awoooi_cold_start_blocked_gates{host="110"} > 0 or awoooi_cold_start_last_result{host="110",result="blocked"} == 1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
layer: full-stack
|
||||
team: ops
|
||||
alert_category: infrastructure
|
||||
notification_type: TYPE-3
|
||||
auto_repair: "true"
|
||||
mcp_provider: "ssh_host"
|
||||
host_type: "bare_metal"
|
||||
annotations:
|
||||
summary: "全棧冷啟動 gate 有 BLOCKED"
|
||||
description: "full-stack cold-start check 偵測到 {{ $value }} 個 blocked gate。AI 自動修復只能先蒐證與通知,不可釋放 runner/CD 或重啟 stateful service。"
|
||||
auto_repair_action: "ssh 192.168.0.110 'tail -220 /home/wooo/reboot-recovery/cold-start-last.log'"
|
||||
runbook: "從第一個 BLOCKED gate 開始修;遵守 docs/runbooks/FULL-STACK-COLD-START-SOP.md 的 phase order。"
|
||||
|
||||
- alert: ColdStartRecoveryDegraded
|
||||
expr: awoooi_cold_start_warn_gates{host="110"} > 0 or awoooi_cold_start_last_result{host="110",result="degraded"} == 1
|
||||
for: 30m
|
||||
labels:
|
||||
severity: warning
|
||||
layer: full-stack
|
||||
team: ops
|
||||
alert_category: infrastructure
|
||||
notification_type: TYPE-1
|
||||
auto_repair: "true"
|
||||
mcp_provider: "ssh_host"
|
||||
host_type: "bare_metal"
|
||||
annotations:
|
||||
summary: "全棧冷啟動 gate 持續 degraded"
|
||||
description: "full-stack cold-start check 連續 30 分鐘有 WARN。此狀態不可宣告 reboot recovery 完成,也不可釋放高負載 runner/CD。"
|
||||
auto_repair_action: "ssh 192.168.0.110 'tail -180 /home/wooo/reboot-recovery/cold-start-last.log'"
|
||||
runbook: "清掉 WARN 後再執行 final gate:bash scripts/reboot-recovery/full-stack-cold-start-check.sh --watch --interval 60 --max-attempts 30 --send-alert-test。"
|
||||
|
||||
- alert: ColdStartLastGreenTooOld
|
||||
expr: (time() - awoooi_cold_start_last_green_timestamp{host="110"} > 21600) and awoooi_cold_start_last_green_timestamp{host="110"} > 0
|
||||
for: 30m
|
||||
labels:
|
||||
severity: warning
|
||||
layer: full-stack
|
||||
team: ops
|
||||
alert_category: infrastructure
|
||||
notification_type: TYPE-1
|
||||
auto_repair: "false"
|
||||
mcp_provider: "ssh_host"
|
||||
host_type: "bare_metal"
|
||||
annotations:
|
||||
summary: "全棧 cold-start monitor 超過 6 小時沒有 GREEN"
|
||||
description: "上次 GREEN 已超過 6 小時,表示冷啟動 baseline 長期沒有完整通過。"
|
||||
runbook: "檢查 /home/wooo/reboot-recovery/cold-start-last.log;若僅因 read-only monitor 缺 final webhook POST,應修 monitor mode 而不是關告警。"
|
||||
|
||||
# =========================================================================
|
||||
# MinIO / Kali 告警
|
||||
# =========================================================================
|
||||
|
||||
@@ -295,6 +295,39 @@ ai_automation_gate:
|
||||
- "runner/CD release"
|
||||
- "generic container restart"
|
||||
|
||||
persistent_monitoring:
|
||||
host: "110"
|
||||
install_command: "bash scripts/reboot-recovery/install-cold-start-monitor-110.sh"
|
||||
schedule: "*/10 * * * *"
|
||||
mode: "read_only"
|
||||
send_alert_test: false
|
||||
scripts:
|
||||
check: "/home/wooo/scripts/full-stack-cold-start-check.sh"
|
||||
exporter: "/home/wooo/scripts/cold-start-textfile-exporter.sh"
|
||||
outputs:
|
||||
textfile: "/home/wooo/node_exporter_textfiles/cold_start_recovery.prom"
|
||||
last_log: "/home/wooo/reboot-recovery/cold-start-last.log"
|
||||
metrics:
|
||||
- "awoooi_cold_start_monitor_up"
|
||||
- "awoooi_cold_start_pass_gates"
|
||||
- "awoooi_cold_start_warn_gates"
|
||||
- "awoooi_cold_start_blocked_gates"
|
||||
- "awoooi_cold_start_last_run_timestamp"
|
||||
- "awoooi_cold_start_last_green_timestamp"
|
||||
- "awoooi_cold_start_last_result"
|
||||
prometheus_alerts:
|
||||
- "ColdStartMonitorMissing"
|
||||
- "ColdStartMonitorStale"
|
||||
- "ColdStartRecoveryBlocked"
|
||||
- "ColdStartRecoveryDegraded"
|
||||
- "ColdStartLastGreenTooOld"
|
||||
ai_contract:
|
||||
monitor_missing: "diagnose cron/textfile path only"
|
||||
stale: "collect cron log and last check log"
|
||||
degraded: "collect evidence, do not release high-load work"
|
||||
blocked: "follow first BLOCKED gate in phase order"
|
||||
forbidden: "generic restart, stateful restart, destructive repair"
|
||||
|
||||
final_confirmation:
|
||||
command: "bash scripts/reboot-recovery/full-stack-cold-start-check.sh --watch --interval 60 --max-attempts 30 --send-alert-test"
|
||||
green_result:
|
||||
|
||||
@@ -21,7 +21,14 @@ if [ ! -f "$RULES_FILE" ]; then
|
||||
fi
|
||||
|
||||
# 驗證 YAML 語法
|
||||
python3 -c "import yaml; yaml.safe_load(open('$RULES_FILE'))" || { echo "ERROR: YAML syntax error"; exit 1; }
|
||||
if python3 -c "import yaml; yaml.safe_load(open('$RULES_FILE'))" 2>/dev/null; then
|
||||
:
|
||||
elif ruby -e "require 'yaml'; YAML.load_file('$RULES_FILE')" 2>/dev/null; then
|
||||
:
|
||||
else
|
||||
echo "ERROR: YAML syntax error or no YAML parser available"
|
||||
exit 1
|
||||
fi
|
||||
log "✅ YAML 語法驗證通過"
|
||||
|
||||
# Dry run 模式
|
||||
|
||||
142
scripts/reboot-recovery/cold-start-textfile-exporter.sh
Executable file
142
scripts/reboot-recovery/cold-start-textfile-exporter.sh
Executable file
@@ -0,0 +1,142 @@
|
||||
#!/usr/bin/env bash
|
||||
# Export AWOOOI full-stack cold-start gate status as node-exporter textfile metrics.
|
||||
#
|
||||
# 2026-05-06 ogt + Codex: reboot recovery hardening.
|
||||
# Intent: give Prometheus and the AI incident flow a durable, read-only signal
|
||||
# for the 110/120/121/188 startup gates. This wrapper never sends the
|
||||
# Alertmanager smoke event and never writes remote state.
|
||||
|
||||
set -uo pipefail
|
||||
|
||||
CHECK_SCRIPT="${CHECK_SCRIPT:-/home/wooo/scripts/full-stack-cold-start-check.sh}"
|
||||
TEXTFILE_DIR="${TEXTFILE_DIR:-${NODE_EXPORTER_TEXTFILE_DIR:-/home/wooo/node_exporter_textfiles}}"
|
||||
OUTPUT_NAME="${OUTPUT_NAME:-cold_start_recovery.prom}"
|
||||
LOG_DIR="${LOG_DIR:-/home/wooo/reboot-recovery}"
|
||||
CHECK_TIMEOUT_SECONDS="${CHECK_TIMEOUT_SECONDS:-240}"
|
||||
HOST_LABEL="${AIOPS_HOST_LABEL:-110}"
|
||||
SCOPE_LABEL="${AIOPS_SCOPE_LABEL:-110_120_121_188}"
|
||||
LOCK_FILE="${LOCK_FILE:-/tmp/awoooi-cold-start-textfile-exporter.lock}"
|
||||
|
||||
escape_label() {
|
||||
printf '%s' "$1" | sed 's/\\/\\\\/g; s/"/\\"/g'
|
||||
}
|
||||
|
||||
write_metric_file() {
|
||||
local tmp="$1"
|
||||
local now="$2"
|
||||
local duration="$3"
|
||||
local exit_code="$4"
|
||||
local monitor_up="$5"
|
||||
local pass="$6"
|
||||
local warn="$7"
|
||||
local blocked="$8"
|
||||
local green="$9"
|
||||
local degraded="${10}"
|
||||
local blocked_state="${11}"
|
||||
local check_failed="${12}"
|
||||
local last_green="${13}"
|
||||
local host scope
|
||||
host=$(escape_label "$HOST_LABEL")
|
||||
scope=$(escape_label "$SCOPE_LABEL")
|
||||
|
||||
cat >"$tmp" <<METRICS
|
||||
# HELP awoooi_cold_start_monitor_up Whether the cold-start monitor produced a parseable summary.
|
||||
# TYPE awoooi_cold_start_monitor_up gauge
|
||||
awoooi_cold_start_monitor_up{host="$host",scope="$scope",mode="read_only"} $monitor_up
|
||||
# HELP awoooi_cold_start_pass_gates Last cold-start check pass gate count.
|
||||
# TYPE awoooi_cold_start_pass_gates gauge
|
||||
awoooi_cold_start_pass_gates{host="$host",scope="$scope"} $pass
|
||||
# HELP awoooi_cold_start_warn_gates Last cold-start check warning gate count.
|
||||
# TYPE awoooi_cold_start_warn_gates gauge
|
||||
awoooi_cold_start_warn_gates{host="$host",scope="$scope"} $warn
|
||||
# HELP awoooi_cold_start_blocked_gates Last cold-start check blocked gate count.
|
||||
# TYPE awoooi_cold_start_blocked_gates gauge
|
||||
awoooi_cold_start_blocked_gates{host="$host",scope="$scope"} $blocked
|
||||
# HELP awoooi_cold_start_last_run_timestamp Unix timestamp of the last cold-start monitor run.
|
||||
# TYPE awoooi_cold_start_last_run_timestamp gauge
|
||||
awoooi_cold_start_last_run_timestamp{host="$host",scope="$scope"} $now
|
||||
# HELP awoooi_cold_start_last_green_timestamp Unix timestamp of the last GREEN cold-start monitor run.
|
||||
# TYPE awoooi_cold_start_last_green_timestamp gauge
|
||||
awoooi_cold_start_last_green_timestamp{host="$host",scope="$scope"} $last_green
|
||||
# HELP awoooi_cold_start_last_run_duration_seconds Last cold-start monitor run duration in seconds.
|
||||
# TYPE awoooi_cold_start_last_run_duration_seconds gauge
|
||||
awoooi_cold_start_last_run_duration_seconds{host="$host",scope="$scope"} $duration
|
||||
# HELP awoooi_cold_start_last_exit_code Last cold-start monitor process exit code.
|
||||
# TYPE awoooi_cold_start_last_exit_code gauge
|
||||
awoooi_cold_start_last_exit_code{host="$host",scope="$scope"} $exit_code
|
||||
# HELP awoooi_cold_start_last_result Last cold-start result as one-hot labels.
|
||||
# TYPE awoooi_cold_start_last_result gauge
|
||||
awoooi_cold_start_last_result{host="$host",scope="$scope",result="green"} $green
|
||||
awoooi_cold_start_last_result{host="$host",scope="$scope",result="degraded"} $degraded
|
||||
awoooi_cold_start_last_result{host="$host",scope="$scope",result="blocked"} $blocked_state
|
||||
awoooi_cold_start_last_result{host="$host",scope="$scope",result="check_failed"} $check_failed
|
||||
METRICS
|
||||
}
|
||||
|
||||
if [ -n "${BASH_VERSION:-}" ] && command -v flock >/dev/null 2>&1; then
|
||||
exec 9>"$LOCK_FILE"
|
||||
if ! flock -n 9; then
|
||||
exit 0
|
||||
fi
|
||||
fi
|
||||
|
||||
mkdir -p "$TEXTFILE_DIR" "$LOG_DIR"
|
||||
|
||||
start_ts=$(date +%s)
|
||||
log_tmp="$LOG_DIR/cold-start-last.log.tmp"
|
||||
log_file="$LOG_DIR/cold-start-last.log"
|
||||
state_file="$LOG_DIR/cold-start-last-green.timestamp"
|
||||
|
||||
if [ ! -x "$CHECK_SCRIPT" ]; then
|
||||
end_ts=$(date +%s)
|
||||
tmp_metric=$(mktemp "$TEXTFILE_DIR/.cold_start_recovery.XXXXXX")
|
||||
last_green=$(cat "$state_file" 2>/dev/null || echo 0)
|
||||
printf 'CHECK_SCRIPT not executable: %s\n' "$CHECK_SCRIPT" >"$log_file"
|
||||
write_metric_file "$tmp_metric" "$end_ts" "$((end_ts - start_ts))" 127 0 0 0 1 0 0 0 1 "$last_green"
|
||||
chmod 0644 "$tmp_metric"
|
||||
mv "$tmp_metric" "$TEXTFILE_DIR/$OUTPUT_NAME"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
timeout "$CHECK_TIMEOUT_SECONDS" bash "$CHECK_SCRIPT" --monitor-read-only --no-color >"$log_tmp" 2>&1
|
||||
exit_code=$?
|
||||
mv "$log_tmp" "$log_file"
|
||||
|
||||
summary_line=$(grep -E '^PASS=[0-9]+ WARN=[0-9]+ BLOCKED=[0-9]+' "$log_file" | tail -1 || true)
|
||||
monitor_up=0
|
||||
pass=0
|
||||
warn=0
|
||||
blocked=0
|
||||
green=0
|
||||
degraded=0
|
||||
blocked_state=0
|
||||
check_failed=0
|
||||
|
||||
if [ -n "$summary_line" ]; then
|
||||
monitor_up=1
|
||||
pass=$(printf '%s\n' "$summary_line" | sed -n 's/.*PASS=\([0-9][0-9]*\).*/\1/p')
|
||||
warn=$(printf '%s\n' "$summary_line" | sed -n 's/.*WARN=\([0-9][0-9]*\).*/\1/p')
|
||||
blocked=$(printf '%s\n' "$summary_line" | sed -n 's/.*BLOCKED=\([0-9][0-9]*\).*/\1/p')
|
||||
if [ "$blocked" -gt 0 ]; then
|
||||
blocked_state=1
|
||||
elif [ "$warn" -gt 0 ]; then
|
||||
degraded=1
|
||||
elif [ "$exit_code" -eq 0 ]; then
|
||||
green=1
|
||||
else
|
||||
check_failed=1
|
||||
fi
|
||||
else
|
||||
check_failed=1
|
||||
fi
|
||||
|
||||
end_ts=$(date +%s)
|
||||
if [ "$green" -eq 1 ]; then
|
||||
printf '%s\n' "$end_ts" >"$state_file"
|
||||
fi
|
||||
last_green=$(cat "$state_file" 2>/dev/null || echo 0)
|
||||
|
||||
tmp_metric=$(mktemp "$TEXTFILE_DIR/.cold_start_recovery.XXXXXX")
|
||||
write_metric_file "$tmp_metric" "$end_ts" "$((end_ts - start_ts))" "$exit_code" "$monitor_up" "$pass" "$warn" "$blocked" "$green" "$degraded" "$blocked_state" "$check_failed" "$last_green"
|
||||
chmod 0644 "$tmp_metric"
|
||||
mv "$tmp_metric" "$TEXTFILE_DIR/$OUTPUT_NAME"
|
||||
@@ -6,6 +6,7 @@ set -uo pipefail
|
||||
|
||||
SSH_OPTS=(-o BatchMode=yes -o ConnectTimeout=6)
|
||||
SEND_ALERT_TEST=0
|
||||
MONITOR_READ_ONLY=0
|
||||
WATCH_MODE=0
|
||||
WATCH_INTERVAL=60
|
||||
WATCH_MAX_ATTEMPTS=30
|
||||
@@ -16,9 +17,11 @@ Usage: bash scripts/reboot-recovery/full-stack-cold-start-check.sh [options]
|
||||
|
||||
Options:
|
||||
--send-alert-test POST one Alertmanager webhook test after AWOOOI API is ready.
|
||||
--monitor-read-only Skip the webhook POST without warning; intended for cron/textfile monitors.
|
||||
--watch Repeat checks until all gates are GREEN or max attempts is reached.
|
||||
--interval SECONDS Retry interval for --watch. Default: 60.
|
||||
--max-attempts COUNT Max attempts for --watch. Default: 30. Use 0 for unlimited.
|
||||
--no-color Disable ANSI colors in output.
|
||||
-h, --help Show this help.
|
||||
|
||||
Default mode is read-only and does not POST an Alertmanager test event.
|
||||
@@ -31,6 +34,12 @@ while [ "$#" -gt 0 ]; do
|
||||
--send-alert-test)
|
||||
SEND_ALERT_TEST=1
|
||||
;;
|
||||
--monitor-read-only)
|
||||
MONITOR_READ_ONLY=1
|
||||
;;
|
||||
--no-color)
|
||||
NO_COLOR=1
|
||||
;;
|
||||
--watch)
|
||||
WATCH_MODE=1
|
||||
;;
|
||||
@@ -63,11 +72,19 @@ while [ "$#" -gt 0 ]; do
|
||||
shift
|
||||
done
|
||||
|
||||
RED=$'\033[0;31m'
|
||||
GREEN=$'\033[0;32m'
|
||||
YELLOW=$'\033[1;33m'
|
||||
BLUE=$'\033[0;34m'
|
||||
NC=$'\033[0m'
|
||||
if [ -n "${NO_COLOR:-}" ]; then
|
||||
RED=""
|
||||
GREEN=""
|
||||
YELLOW=""
|
||||
BLUE=""
|
||||
NC=""
|
||||
else
|
||||
RED=$'\033[0;31m'
|
||||
GREEN=$'\033[0;32m'
|
||||
YELLOW=$'\033[1;33m'
|
||||
BLUE=$'\033[0;34m'
|
||||
NC=$'\033[0m'
|
||||
fi
|
||||
|
||||
PASS=0
|
||||
WARN=0
|
||||
@@ -121,6 +138,46 @@ ssh_cmd() {
|
||||
ssh "${SSH_OPTS[@]}" "$user_host" "${prefix}${cmd}"
|
||||
}
|
||||
|
||||
host_has_ip() {
|
||||
local expected_ip="$1"
|
||||
if command -v ip >/dev/null 2>&1; then
|
||||
ip -o -4 addr show 2>/dev/null | awk '{print $4}' | grep -q "^${expected_ip}/" && return 0
|
||||
fi
|
||||
hostname -I 2>/dev/null | tr ' ' '\n' | grep -qx "$expected_ip"
|
||||
}
|
||||
|
||||
host_cmd() {
|
||||
local user_host="$1"
|
||||
local cmd="$2"
|
||||
case "$user_host" in
|
||||
*@192.168.0.110)
|
||||
if host_has_ip "192.168.0.110"; then
|
||||
bash -lc "$cmd"
|
||||
return
|
||||
fi
|
||||
;;
|
||||
*@192.168.0.120)
|
||||
if host_has_ip "192.168.0.120"; then
|
||||
bash -lc "$cmd"
|
||||
return
|
||||
fi
|
||||
;;
|
||||
*@192.168.0.121)
|
||||
if host_has_ip "192.168.0.121"; then
|
||||
bash -lc "$cmd"
|
||||
return
|
||||
fi
|
||||
;;
|
||||
*@192.168.0.188)
|
||||
if host_has_ip "192.168.0.188"; then
|
||||
bash -lc "$cmd"
|
||||
return
|
||||
fi
|
||||
;;
|
||||
esac
|
||||
ssh_cmd "$user_host" "$cmd"
|
||||
}
|
||||
|
||||
probe_http_code() {
|
||||
local url="$1"
|
||||
local attempt code
|
||||
@@ -165,13 +222,19 @@ check_network() {
|
||||
fi
|
||||
done
|
||||
|
||||
arp -an | grep -E '192\.168\.0\.(110|120|121|188)' || warn "no ARP rows printed for one or more hosts"
|
||||
if arp -an | grep -E '192\.168\.0\.(110|120|121|188)'; then
|
||||
ok "ARP evidence printed"
|
||||
elif [ "$MONITOR_READ_ONLY" -eq 1 ]; then
|
||||
ok "ARP evidence unavailable in monitor mode; ping and TCP gates passed"
|
||||
else
|
||||
warn "no ARP rows printed for one or more hosts"
|
||||
fi
|
||||
}
|
||||
|
||||
check_188() {
|
||||
log_section "P0-188-DATA"
|
||||
local out
|
||||
if ! out=$(ssh_cmd "ollama@192.168.0.188" '
|
||||
if ! out=$(host_cmd "ollama@192.168.0.188" '
|
||||
echo "HOST $(hostname) $(uptime)"
|
||||
echo "MEM $(free -h | awk "/Mem:/ {print \$2,\$3,\$7}")"
|
||||
echo "SYSTEMD $(systemctl is-active containerd docker postgresql@14-main redis-server ollama nginx 2>/dev/null | tr "\n" " ")"
|
||||
@@ -199,7 +262,7 @@ docker ps --format "DOCKER {{.Names}}\t{{.Status}}" | head -80
|
||||
check_110() {
|
||||
log_section "P0-110-REGISTRY-OBSERVABILITY"
|
||||
local out
|
||||
if ! out=$(ssh_cmd "wooo@192.168.0.110" '
|
||||
if ! out=$(host_cmd "wooo@192.168.0.110" '
|
||||
echo "HOST $(hostname) $(uptime)"
|
||||
echo "MEM $(free -h | awk "/Mem:/ {print \$2,\$3,\$7}")"
|
||||
echo "DOCKER_SYSTEMD $(systemctl is-active docker 2>/dev/null || true)"
|
||||
@@ -231,7 +294,7 @@ docker ps --format "DOCKER {{.Names}}\t{{.Status}}" | head -120
|
||||
check_k3s() {
|
||||
log_section "P1-K3S"
|
||||
local out local_kubectl_out
|
||||
if ! out=$(ssh_cmd "wooo@192.168.0.120" '
|
||||
if ! out=$(host_cmd "wooo@192.168.0.120" '
|
||||
echo "HOST $(hostname) $(uptime)"
|
||||
echo "PG188_PORT $(nc -z -w 2 192.168.0.188 5432 >/dev/null 2>&1 && echo OPEN || echo CLOSED)"
|
||||
echo "SYSTEMD $(systemctl is-active k3s k3s-agent keepalived 2>/dev/null | tr "\n" " ")"
|
||||
@@ -271,7 +334,7 @@ check_workload_and_alertchain() {
|
||||
log_section "P2-WORKLOAD-ALERTCHAIN"
|
||||
local api_code web_code alert_code
|
||||
local out
|
||||
if out=$(ssh_cmd "wooo@192.168.0.120" '
|
||||
if out=$(host_cmd "wooo@192.168.0.120" '
|
||||
api_code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 http://192.168.0.125:32334/api/v1/health 2>/dev/null || true)
|
||||
web_code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 http://192.168.0.125:32335/ 2>/dev/null || true)
|
||||
echo "API_CODE ${api_code:-000}"
|
||||
@@ -292,12 +355,14 @@ WEB_CODE $web_code"
|
||||
[[ "$web_code" =~ ^[23] ]] && ok "AWOOOI Web reachable" || warn "AWOOOI Web not confirmed"
|
||||
|
||||
if [ "$SEND_ALERT_TEST" -eq 1 ]; then
|
||||
alert_code=$(ssh_cmd "wooo@192.168.0.120" 'curl -s -o /tmp/awoooi-alertchain.out -w "%{http_code}" --max-time 8 \
|
||||
alert_code=$(host_cmd "wooo@192.168.0.120" 'curl -s -o /tmp/awoooi-alertchain.out -w "%{http_code}" --max-time 8 \
|
||||
-X POST "http://192.168.0.125:32334/api/v1/webhooks/alertmanager" \
|
||||
-H '"'"'Content-Type: application/json'"'"' \
|
||||
-d '"'"'{"receiver":"cold-start-check","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"ColdStartCheck","severity":"info"},"annotations":{"summary":"Cold start check"},"startsAt":"2026-05-05T11:00:00Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":""}],"groupLabels":{},"commonLabels":{},"commonAnnotations":{},"externalURL":"","version":"4","groupKey":"cold-start-check"}'"'"' 2>/dev/null || echo "000"')
|
||||
echo "ALERTCHAIN_CODE $alert_code"
|
||||
[[ "$alert_code" =~ ^2 ]] && ok "Alertmanager webhook endpoint accepts POST" || warn "Alertmanager webhook E2E not confirmed"
|
||||
elif [ "$MONITOR_READ_ONLY" -eq 1 ]; then
|
||||
ok "Alertmanager webhook POST intentionally skipped in read-only monitor mode"
|
||||
else
|
||||
warn "Alertmanager webhook POST skipped; rerun with --send-alert-test after API is ready"
|
||||
fi
|
||||
@@ -326,7 +391,7 @@ check_schedules() {
|
||||
log_section "P2-SCHEDULES"
|
||||
local out
|
||||
|
||||
if out=$(ssh_cmd "ollama@192.168.0.188" '
|
||||
if out=$(host_cmd "ollama@192.168.0.188" '
|
||||
now=$(date +%s)
|
||||
echo "CRON_188 $(systemctl is-active cron 2>/dev/null || systemctl is-active crond 2>/dev/null || true)"
|
||||
for f in /home/ollama/node_exporter_textfiles/backup.prom /home/ollama/node_exporter_textfiles/docker_restart_count.prom /home/ollama/node_exporter_textfiles/docker_stats.prom; do
|
||||
@@ -356,7 +421,7 @@ echo "SCHEDULER_REGISTERED $(docker logs --since 6h momo-scheduler 2>&1 | grep -
|
||||
echo "$out"
|
||||
fi
|
||||
|
||||
if out=$(ssh_cmd "wooo@192.168.0.110" '
|
||||
if out=$(host_cmd "wooo@192.168.0.110" '
|
||||
now=$(date +%s)
|
||||
echo "CRON_110 $(systemctl is-active cron 2>/dev/null || systemctl is-active crond 2>/dev/null || true)"
|
||||
echo "FAILED_UNITS_110 $(systemctl --failed --no-legend --plain 2>/dev/null | wc -l)"
|
||||
@@ -383,7 +448,7 @@ done
|
||||
echo "$out"
|
||||
fi
|
||||
|
||||
if out=$(ssh_cmd "wooo@192.168.0.120" '
|
||||
if out=$(host_cmd "wooo@192.168.0.120" '
|
||||
kcmd() {
|
||||
if [ -n "${REMOTE_SUDO_PASSWORD:-}" ]; then
|
||||
printf "%s\n" "$REMOTE_SUDO_PASSWORD" | sudo -S -p "" kubectl "$@"
|
||||
@@ -411,7 +476,7 @@ kcmd get pods -n awoooi-prod --no-headers 2>/dev/null | awk "\$3 !~ /^(Running|C
|
||||
echo "$out"
|
||||
fi
|
||||
|
||||
if out=$(ssh_cmd "wooo@192.168.0.121" '
|
||||
if out=$(host_cmd "wooo@192.168.0.121" '
|
||||
echo "CRON_121 $(systemctl is-active cron 2>/dev/null || systemctl is-active crond 2>/dev/null || true)"
|
||||
crontab -l 2>/dev/null | grep -q "dr-drill.sh" && echo "DR_DRILL_CRON present" || echo "DR_DRILL_CRON missing"
|
||||
' 2>&1); then
|
||||
|
||||
44
scripts/reboot-recovery/install-cold-start-monitor-110.sh
Executable file
44
scripts/reboot-recovery/install-cold-start-monitor-110.sh
Executable file
@@ -0,0 +1,44 @@
|
||||
#!/usr/bin/env bash
|
||||
# Install the AWOOOI cold-start textfile monitor on host 110 using the wooo user cron.
|
||||
#
|
||||
# This installer does not require sudo. It copies read-only monitor scripts into
|
||||
# /home/wooo/scripts and maintains a marked crontab block.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
TARGET_HOST="${TARGET_HOST:-wooo@192.168.0.110}"
|
||||
REMOTE_SCRIPT_DIR="${REMOTE_SCRIPT_DIR:-/home/wooo/scripts}"
|
||||
REMOTE_TEXTFILE_DIR="${REMOTE_TEXTFILE_DIR:-/home/wooo/node_exporter_textfiles}"
|
||||
REMOTE_LOG_DIR="${REMOTE_LOG_DIR:-/home/wooo/reboot-recovery}"
|
||||
CRON_SCHEDULE="${CRON_SCHEDULE:-*/10 * * * *}"
|
||||
CHECK_TIMEOUT_SECONDS="${CHECK_TIMEOUT_SECONDS:-240}"
|
||||
|
||||
SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
|
||||
CHECK_SCRIPT="$SCRIPT_DIR/full-stack-cold-start-check.sh"
|
||||
EXPORTER_SCRIPT="$SCRIPT_DIR/cold-start-textfile-exporter.sh"
|
||||
|
||||
if [ ! -f "$CHECK_SCRIPT" ] || [ ! -f "$EXPORTER_SCRIPT" ]; then
|
||||
echo "Required scripts not found beside installer" >&2
|
||||
exit 66
|
||||
fi
|
||||
|
||||
ssh "$TARGET_HOST" "mkdir -p '$REMOTE_SCRIPT_DIR' '$REMOTE_TEXTFILE_DIR' '$REMOTE_LOG_DIR'"
|
||||
scp "$CHECK_SCRIPT" "$TARGET_HOST:$REMOTE_SCRIPT_DIR/full-stack-cold-start-check.sh"
|
||||
scp "$EXPORTER_SCRIPT" "$TARGET_HOST:$REMOTE_SCRIPT_DIR/cold-start-textfile-exporter.sh"
|
||||
ssh "$TARGET_HOST" "chmod 0755 '$REMOTE_SCRIPT_DIR/full-stack-cold-start-check.sh' '$REMOTE_SCRIPT_DIR/cold-start-textfile-exporter.sh'"
|
||||
|
||||
ssh "$TARGET_HOST" bash -s <<REMOTE
|
||||
set -euo pipefail
|
||||
tmp=\$(mktemp)
|
||||
crontab -l 2>/dev/null | sed '/# AWOOOI cold-start monitor start/,/# AWOOOI cold-start monitor end/d' > "\$tmp" || true
|
||||
cat >> "\$tmp" <<'CRON'
|
||||
# AWOOOI cold-start monitor start
|
||||
$CRON_SCHEDULE PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin CHECK_SCRIPT=$REMOTE_SCRIPT_DIR/full-stack-cold-start-check.sh TEXTFILE_DIR=$REMOTE_TEXTFILE_DIR LOG_DIR=$REMOTE_LOG_DIR CHECK_TIMEOUT_SECONDS=$CHECK_TIMEOUT_SECONDS $REMOTE_SCRIPT_DIR/cold-start-textfile-exporter.sh >/tmp/awoooi-cold-start-monitor.cron.log 2>&1
|
||||
# AWOOOI cold-start monitor end
|
||||
CRON
|
||||
crontab "\$tmp"
|
||||
rm -f "\$tmp"
|
||||
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin CHECK_SCRIPT=$REMOTE_SCRIPT_DIR/full-stack-cold-start-check.sh TEXTFILE_DIR=$REMOTE_TEXTFILE_DIR LOG_DIR=$REMOTE_LOG_DIR CHECK_TIMEOUT_SECONDS=$CHECK_TIMEOUT_SECONDS "$REMOTE_SCRIPT_DIR/cold-start-textfile-exporter.sh"
|
||||
test -s "$REMOTE_TEXTFILE_DIR/cold_start_recovery.prom"
|
||||
tail -40 "$REMOTE_TEXTFILE_DIR/cold_start_recovery.prom"
|
||||
REMOTE
|
||||
Reference in New Issue
Block a user