fix(ops): monitor full-stack cold-start gates
All checks were successful
Code Review / ai-code-review (push) Successful in 11s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 18s

This commit is contained in:
Your Name
2026-05-06 00:48:05 +08:00
parent a2c4b3d47e
commit 587551c1f1
8 changed files with 448 additions and 18 deletions

View File

@@ -15,6 +15,22 @@
- `_push_decision_to_telegram()` 與 stale READY token resend 共用同一個 dedup helper避免兩條路徑再次漂移。
-`test_decision_manager_telegram_dedup.py`,鎖住 `Incident``title` 欄位時仍能產出 alertname fingerprint。
## 2026-05-06 | cold-start gate promoted to persistent Prometheus monitor
**背景**:重開機 SOP / baseline / one-shot script 已經可讓人工救援達到 GREEN但統帥要求下一次重開機後要能自動監控、自動告警且 AI 不可在未過 gate 前亂重啟 stateful service。
**本次持久化**
- 新增 `cold-start-textfile-exporter.sh`,每次把 `full-stack-cold-start-check.sh --monitor-read-only --no-color` 的結果轉為 node-exporter textfile metrics。
- 新增 `install-cold-start-monitor-110.sh`,把 monitor 裝到 110 user cron每 10 分鐘寫 `/home/wooo/node_exporter_textfiles/cold_start_recovery.prom``/home/wooo/reboot-recovery/cold-start-last.log`
- `full-stack-cold-start-check.sh` 新增 `--monitor-read-only` / `--no-color`,常駐監控不會每 10 分鐘 POST Alertmanager smoke event人工 final gate 仍必須用 `--send-alert-test`
- `ops/monitoring/alerts-unified.yml` 新增 `cold_start_recovery_alerts` 5 條monitor missing、stale、blocked、degraded、last green too old。
- 110 的 monitor 需要查 120 K3s 與 121 DR cron已把 110 既有 `wooo` public key 加到 120/121 `authorized_keys`,並由各主機自動備份原檔為 `authorized_keys.bak-cold-start-monitor-*`
**驗證**
- 110 textfile monitor live result`awoooi_cold_start_last_result{result="green"} 1``warn_gates=0``blocked_gates=0`
- Prometheus reload 成功,規則數 `107``cold_start_recovery_alerts` 5 條皆 `inactive ok`
- 正式 final gate`bash scripts/reboot-recovery/full-stack-cold-start-check.sh --watch --interval 1 --max-attempts 1 --send-alert-test --no-color``PASS=52 WARN=0 BLOCKED=0``ALERTCHAIN_CODE 200`
## 2026-05-06 | momo-scheduler cold-start noise cleanup after reboot recovery
**背景**:全棧冷啟動 SOP 已達 `PASS=51 WARN=0 BLOCKED=0`,但 188 `momo-scheduler` 仍留下三個非致命噪音:白頁檢查沿用舊文案 marker、TokenReport 查詢缺少 `ai_call_budgets` 表、ElephantAlpha/Hermes legacy step 缺 engine 注入。

View File

@@ -545,7 +545,32 @@ This is the standard next-reboot release command. It checks every 60 seconds for
Use `--send-alert-test` for final release because the alert chain is not proven until the AWOOOI Alertmanager webhook accepts a real POST. Without `--send-alert-test`, the script intentionally leaves a warning so operators do not falsely mark alerting as complete.
### 13.3 Script-To-SOP Coverage Map
### 13.3 Persistent Read-Only Monitor
After recovery, host 110 should run the same gate as a node-exporter textfile monitor:
```bash
bash scripts/reboot-recovery/install-cold-start-monitor-110.sh
```
This installs two scripts under `/home/wooo/scripts/`, adds a marked user-cron block, and writes:
- `/home/wooo/node_exporter_textfiles/cold_start_recovery.prom`
- `/home/wooo/reboot-recovery/cold-start-last.log`
The cron path uses `--monitor-read-only`, so it does not POST Alertmanager smoke events every 10 minutes. It converts the cold-start gate into Prometheus metrics:
- `awoooi_cold_start_monitor_up`
- `awoooi_cold_start_pass_gates`
- `awoooi_cold_start_warn_gates`
- `awoooi_cold_start_blocked_gates`
- `awoooi_cold_start_last_run_timestamp`
- `awoooi_cold_start_last_green_timestamp`
- `awoooi_cold_start_last_result{result="green|degraded|blocked|check_failed"}`
Prometheus rules in `ops/monitoring/alerts-unified.yml` alert when the monitor is missing, stale, blocked, degraded, or has not been green for more than 6 hours.
### 13.4 Script-To-SOP Coverage Map
| Script gate | SOP coverage | Blocks |
|-------------|--------------|--------|
@@ -557,7 +582,7 @@ Use `--send-alert-test` for final release because the alert chain is not proven
| `P2-PUBLIC-ROUTES` | external AWOOOI and momo URLs | external release |
| `P2-SCHEDULES` | cron, CronJobs, backups, textfile exporters, DR drill | final done criteria |
### 13.4 Next-Reboot Operator Contract
### 13.5 Next-Reboot Operator Contract
1. Run the watch command above.
2. If it stops at `BLOCKED`, repair the first blocked gate and rerun watch mode.
@@ -583,6 +608,7 @@ All must be true:
- High-load batch services are capped or delayed.
- Runners are guarded and released last.
- AI auto-remediation is not in full execution mode until all gates are green.
- 110 cold-start textfile monitor is fresh, and Prometheus has the cold-start alert rules loaded.
---

View File

@@ -772,6 +772,103 @@ groups:
auto_repair_action: "ssh 192.168.0.{{ $labels.host }} 'systemctl show {{ $labels.unit }} -p CPUQuotaPerSecUSec -p MemoryMax -p ActiveState -p SubState'"
runbook: "建議 baseline每個 runner CPUQuota=200%、MemoryMax=2G由 /home/wooo/scripts/apply-runner-systemd-guardrails.sh 套用,若仍過載再限制並行度或分流。"
# =========================================================================
# Full-stack reboot/cold-start gate monitor
# =========================================================================
- name: cold_start_recovery_alerts
interval: 60s
rules:
- alert: ColdStartMonitorMissing
# 2026-05-06 ogt + Codex: full-stack reboot recovery must have a durable signal,
# not only a one-off terminal transcript.
expr: absent(awoooi_cold_start_monitor_up{host="110"})
for: 20m
labels:
severity: warning
layer: host-110
team: ops
alert_category: infrastructure
notification_type: TYPE-1
auto_repair: "true"
mcp_provider: "ssh_host"
host_type: "bare_metal"
annotations:
summary: "冷啟動 gate monitor 20 分鐘無指標"
description: "110 沒有暴露 awoooi_cold_start_monitor_up代表 full-stack cold-start gate 沒有被 Prometheus 監控。"
auto_repair_action: "ssh 192.168.0.110 'crontab -l | sed -n \"/AWOOOI cold-start monitor start/,/AWOOOI cold-start monitor end/p\"; ls -l /home/wooo/node_exporter_textfiles/cold_start_recovery.prom /home/wooo/reboot-recovery/cold-start-last.log 2>/dev/null || true'"
runbook: "執行 scripts/reboot-recovery/install-cold-start-monitor-110.sh只安裝 read-only textfile exporter不需要 sudo。"
- alert: ColdStartMonitorStale
expr: time() - awoooi_cold_start_last_run_timestamp{host="110"} > 1800
for: 10m
labels:
severity: warning
layer: host-110
team: ops
alert_category: infrastructure
notification_type: TYPE-1
auto_repair: "true"
mcp_provider: "ssh_host"
host_type: "bare_metal"
annotations:
summary: "冷啟動 gate monitor 超過 30 分鐘未更新"
description: "cold_start_recovery.prom stale無法確認 110/120/121/188 的重開機 gate 是否仍維持健康。"
auto_repair_action: "ssh 192.168.0.110 'tail -80 /tmp/awoooi-cold-start-monitor.cron.log 2>/dev/null || true; tail -120 /home/wooo/reboot-recovery/cold-start-last.log 2>/dev/null || true'"
runbook: "檢查 110 user cron、SSH key、/home/wooo/node_exporter_textfiles 權限;不要把 stale 當作服務可用。"
- alert: ColdStartRecoveryBlocked
expr: awoooi_cold_start_blocked_gates{host="110"} > 0 or awoooi_cold_start_last_result{host="110",result="blocked"} == 1
for: 5m
labels:
severity: critical
layer: full-stack
team: ops
alert_category: infrastructure
notification_type: TYPE-3
auto_repair: "true"
mcp_provider: "ssh_host"
host_type: "bare_metal"
annotations:
summary: "全棧冷啟動 gate 有 BLOCKED"
description: "full-stack cold-start check 偵測到 {{ $value }} 個 blocked gate。AI 自動修復只能先蒐證與通知,不可釋放 runner/CD 或重啟 stateful service。"
auto_repair_action: "ssh 192.168.0.110 'tail -220 /home/wooo/reboot-recovery/cold-start-last.log'"
runbook: "從第一個 BLOCKED gate 開始修;遵守 docs/runbooks/FULL-STACK-COLD-START-SOP.md 的 phase order。"
- alert: ColdStartRecoveryDegraded
expr: awoooi_cold_start_warn_gates{host="110"} > 0 or awoooi_cold_start_last_result{host="110",result="degraded"} == 1
for: 30m
labels:
severity: warning
layer: full-stack
team: ops
alert_category: infrastructure
notification_type: TYPE-1
auto_repair: "true"
mcp_provider: "ssh_host"
host_type: "bare_metal"
annotations:
summary: "全棧冷啟動 gate 持續 degraded"
description: "full-stack cold-start check 連續 30 分鐘有 WARN。此狀態不可宣告 reboot recovery 完成,也不可釋放高負載 runner/CD。"
auto_repair_action: "ssh 192.168.0.110 'tail -180 /home/wooo/reboot-recovery/cold-start-last.log'"
runbook: "清掉 WARN 後再執行 final gatebash scripts/reboot-recovery/full-stack-cold-start-check.sh --watch --interval 60 --max-attempts 30 --send-alert-test。"
- alert: ColdStartLastGreenTooOld
expr: (time() - awoooi_cold_start_last_green_timestamp{host="110"} > 21600) and awoooi_cold_start_last_green_timestamp{host="110"} > 0
for: 30m
labels:
severity: warning
layer: full-stack
team: ops
alert_category: infrastructure
notification_type: TYPE-1
auto_repair: "false"
mcp_provider: "ssh_host"
host_type: "bare_metal"
annotations:
summary: "全棧 cold-start monitor 超過 6 小時沒有 GREEN"
description: "上次 GREEN 已超過 6 小時,表示冷啟動 baseline 長期沒有完整通過。"
runbook: "檢查 /home/wooo/reboot-recovery/cold-start-last.log若僅因 read-only monitor 缺 final webhook POST應修 monitor mode 而不是關告警。"
# =========================================================================
# MinIO / Kali 告警
# =========================================================================

View File

@@ -295,6 +295,39 @@ ai_automation_gate:
- "runner/CD release"
- "generic container restart"
persistent_monitoring:
host: "110"
install_command: "bash scripts/reboot-recovery/install-cold-start-monitor-110.sh"
schedule: "*/10 * * * *"
mode: "read_only"
send_alert_test: false
scripts:
check: "/home/wooo/scripts/full-stack-cold-start-check.sh"
exporter: "/home/wooo/scripts/cold-start-textfile-exporter.sh"
outputs:
textfile: "/home/wooo/node_exporter_textfiles/cold_start_recovery.prom"
last_log: "/home/wooo/reboot-recovery/cold-start-last.log"
metrics:
- "awoooi_cold_start_monitor_up"
- "awoooi_cold_start_pass_gates"
- "awoooi_cold_start_warn_gates"
- "awoooi_cold_start_blocked_gates"
- "awoooi_cold_start_last_run_timestamp"
- "awoooi_cold_start_last_green_timestamp"
- "awoooi_cold_start_last_result"
prometheus_alerts:
- "ColdStartMonitorMissing"
- "ColdStartMonitorStale"
- "ColdStartRecoveryBlocked"
- "ColdStartRecoveryDegraded"
- "ColdStartLastGreenTooOld"
ai_contract:
monitor_missing: "diagnose cron/textfile path only"
stale: "collect cron log and last check log"
degraded: "collect evidence, do not release high-load work"
blocked: "follow first BLOCKED gate in phase order"
forbidden: "generic restart, stateful restart, destructive repair"
final_confirmation:
command: "bash scripts/reboot-recovery/full-stack-cold-start-check.sh --watch --interval 60 --max-attempts 30 --send-alert-test"
green_result:

View File

@@ -21,7 +21,14 @@ if [ ! -f "$RULES_FILE" ]; then
fi
# 驗證 YAML 語法
python3 -c "import yaml; yaml.safe_load(open('$RULES_FILE'))" || { echo "ERROR: YAML syntax error"; exit 1; }
if python3 -c "import yaml; yaml.safe_load(open('$RULES_FILE'))" 2>/dev/null; then
:
elif ruby -e "require 'yaml'; YAML.load_file('$RULES_FILE')" 2>/dev/null; then
:
else
echo "ERROR: YAML syntax error or no YAML parser available"
exit 1
fi
log "✅ YAML 語法驗證通過"
# Dry run 模式

View File

@@ -0,0 +1,142 @@
#!/usr/bin/env bash
# Export AWOOOI full-stack cold-start gate status as node-exporter textfile metrics.
#
# 2026-05-06 ogt + Codex: reboot recovery hardening.
# Intent: give Prometheus and the AI incident flow a durable, read-only signal
# for the 110/120/121/188 startup gates. This wrapper never sends the
# Alertmanager smoke event and never writes remote state.
set -uo pipefail
CHECK_SCRIPT="${CHECK_SCRIPT:-/home/wooo/scripts/full-stack-cold-start-check.sh}"
TEXTFILE_DIR="${TEXTFILE_DIR:-${NODE_EXPORTER_TEXTFILE_DIR:-/home/wooo/node_exporter_textfiles}}"
OUTPUT_NAME="${OUTPUT_NAME:-cold_start_recovery.prom}"
LOG_DIR="${LOG_DIR:-/home/wooo/reboot-recovery}"
CHECK_TIMEOUT_SECONDS="${CHECK_TIMEOUT_SECONDS:-240}"
HOST_LABEL="${AIOPS_HOST_LABEL:-110}"
SCOPE_LABEL="${AIOPS_SCOPE_LABEL:-110_120_121_188}"
LOCK_FILE="${LOCK_FILE:-/tmp/awoooi-cold-start-textfile-exporter.lock}"
escape_label() {
printf '%s' "$1" | sed 's/\\/\\\\/g; s/"/\\"/g'
}
write_metric_file() {
local tmp="$1"
local now="$2"
local duration="$3"
local exit_code="$4"
local monitor_up="$5"
local pass="$6"
local warn="$7"
local blocked="$8"
local green="$9"
local degraded="${10}"
local blocked_state="${11}"
local check_failed="${12}"
local last_green="${13}"
local host scope
host=$(escape_label "$HOST_LABEL")
scope=$(escape_label "$SCOPE_LABEL")
cat >"$tmp" <<METRICS
# HELP awoooi_cold_start_monitor_up Whether the cold-start monitor produced a parseable summary.
# TYPE awoooi_cold_start_monitor_up gauge
awoooi_cold_start_monitor_up{host="$host",scope="$scope",mode="read_only"} $monitor_up
# HELP awoooi_cold_start_pass_gates Last cold-start check pass gate count.
# TYPE awoooi_cold_start_pass_gates gauge
awoooi_cold_start_pass_gates{host="$host",scope="$scope"} $pass
# HELP awoooi_cold_start_warn_gates Last cold-start check warning gate count.
# TYPE awoooi_cold_start_warn_gates gauge
awoooi_cold_start_warn_gates{host="$host",scope="$scope"} $warn
# HELP awoooi_cold_start_blocked_gates Last cold-start check blocked gate count.
# TYPE awoooi_cold_start_blocked_gates gauge
awoooi_cold_start_blocked_gates{host="$host",scope="$scope"} $blocked
# HELP awoooi_cold_start_last_run_timestamp Unix timestamp of the last cold-start monitor run.
# TYPE awoooi_cold_start_last_run_timestamp gauge
awoooi_cold_start_last_run_timestamp{host="$host",scope="$scope"} $now
# HELP awoooi_cold_start_last_green_timestamp Unix timestamp of the last GREEN cold-start monitor run.
# TYPE awoooi_cold_start_last_green_timestamp gauge
awoooi_cold_start_last_green_timestamp{host="$host",scope="$scope"} $last_green
# HELP awoooi_cold_start_last_run_duration_seconds Last cold-start monitor run duration in seconds.
# TYPE awoooi_cold_start_last_run_duration_seconds gauge
awoooi_cold_start_last_run_duration_seconds{host="$host",scope="$scope"} $duration
# HELP awoooi_cold_start_last_exit_code Last cold-start monitor process exit code.
# TYPE awoooi_cold_start_last_exit_code gauge
awoooi_cold_start_last_exit_code{host="$host",scope="$scope"} $exit_code
# HELP awoooi_cold_start_last_result Last cold-start result as one-hot labels.
# TYPE awoooi_cold_start_last_result gauge
awoooi_cold_start_last_result{host="$host",scope="$scope",result="green"} $green
awoooi_cold_start_last_result{host="$host",scope="$scope",result="degraded"} $degraded
awoooi_cold_start_last_result{host="$host",scope="$scope",result="blocked"} $blocked_state
awoooi_cold_start_last_result{host="$host",scope="$scope",result="check_failed"} $check_failed
METRICS
}
if [ -n "${BASH_VERSION:-}" ] && command -v flock >/dev/null 2>&1; then
exec 9>"$LOCK_FILE"
if ! flock -n 9; then
exit 0
fi
fi
mkdir -p "$TEXTFILE_DIR" "$LOG_DIR"
start_ts=$(date +%s)
log_tmp="$LOG_DIR/cold-start-last.log.tmp"
log_file="$LOG_DIR/cold-start-last.log"
state_file="$LOG_DIR/cold-start-last-green.timestamp"
if [ ! -x "$CHECK_SCRIPT" ]; then
end_ts=$(date +%s)
tmp_metric=$(mktemp "$TEXTFILE_DIR/.cold_start_recovery.XXXXXX")
last_green=$(cat "$state_file" 2>/dev/null || echo 0)
printf 'CHECK_SCRIPT not executable: %s\n' "$CHECK_SCRIPT" >"$log_file"
write_metric_file "$tmp_metric" "$end_ts" "$((end_ts - start_ts))" 127 0 0 0 1 0 0 0 1 "$last_green"
chmod 0644 "$tmp_metric"
mv "$tmp_metric" "$TEXTFILE_DIR/$OUTPUT_NAME"
exit 0
fi
timeout "$CHECK_TIMEOUT_SECONDS" bash "$CHECK_SCRIPT" --monitor-read-only --no-color >"$log_tmp" 2>&1
exit_code=$?
mv "$log_tmp" "$log_file"
summary_line=$(grep -E '^PASS=[0-9]+ WARN=[0-9]+ BLOCKED=[0-9]+' "$log_file" | tail -1 || true)
monitor_up=0
pass=0
warn=0
blocked=0
green=0
degraded=0
blocked_state=0
check_failed=0
if [ -n "$summary_line" ]; then
monitor_up=1
pass=$(printf '%s\n' "$summary_line" | sed -n 's/.*PASS=\([0-9][0-9]*\).*/\1/p')
warn=$(printf '%s\n' "$summary_line" | sed -n 's/.*WARN=\([0-9][0-9]*\).*/\1/p')
blocked=$(printf '%s\n' "$summary_line" | sed -n 's/.*BLOCKED=\([0-9][0-9]*\).*/\1/p')
if [ "$blocked" -gt 0 ]; then
blocked_state=1
elif [ "$warn" -gt 0 ]; then
degraded=1
elif [ "$exit_code" -eq 0 ]; then
green=1
else
check_failed=1
fi
else
check_failed=1
fi
end_ts=$(date +%s)
if [ "$green" -eq 1 ]; then
printf '%s\n' "$end_ts" >"$state_file"
fi
last_green=$(cat "$state_file" 2>/dev/null || echo 0)
tmp_metric=$(mktemp "$TEXTFILE_DIR/.cold_start_recovery.XXXXXX")
write_metric_file "$tmp_metric" "$end_ts" "$((end_ts - start_ts))" "$exit_code" "$monitor_up" "$pass" "$warn" "$blocked" "$green" "$degraded" "$blocked_state" "$check_failed" "$last_green"
chmod 0644 "$tmp_metric"
mv "$tmp_metric" "$TEXTFILE_DIR/$OUTPUT_NAME"

View File

@@ -6,6 +6,7 @@ set -uo pipefail
SSH_OPTS=(-o BatchMode=yes -o ConnectTimeout=6)
SEND_ALERT_TEST=0
MONITOR_READ_ONLY=0
WATCH_MODE=0
WATCH_INTERVAL=60
WATCH_MAX_ATTEMPTS=30
@@ -16,9 +17,11 @@ Usage: bash scripts/reboot-recovery/full-stack-cold-start-check.sh [options]
Options:
--send-alert-test POST one Alertmanager webhook test after AWOOOI API is ready.
--monitor-read-only Skip the webhook POST without warning; intended for cron/textfile monitors.
--watch Repeat checks until all gates are GREEN or max attempts is reached.
--interval SECONDS Retry interval for --watch. Default: 60.
--max-attempts COUNT Max attempts for --watch. Default: 30. Use 0 for unlimited.
--no-color Disable ANSI colors in output.
-h, --help Show this help.
Default mode is read-only and does not POST an Alertmanager test event.
@@ -31,6 +34,12 @@ while [ "$#" -gt 0 ]; do
--send-alert-test)
SEND_ALERT_TEST=1
;;
--monitor-read-only)
MONITOR_READ_ONLY=1
;;
--no-color)
NO_COLOR=1
;;
--watch)
WATCH_MODE=1
;;
@@ -63,11 +72,19 @@ while [ "$#" -gt 0 ]; do
shift
done
RED=$'\033[0;31m'
GREEN=$'\033[0;32m'
YELLOW=$'\033[1;33m'
BLUE=$'\033[0;34m'
NC=$'\033[0m'
if [ -n "${NO_COLOR:-}" ]; then
RED=""
GREEN=""
YELLOW=""
BLUE=""
NC=""
else
RED=$'\033[0;31m'
GREEN=$'\033[0;32m'
YELLOW=$'\033[1;33m'
BLUE=$'\033[0;34m'
NC=$'\033[0m'
fi
PASS=0
WARN=0
@@ -121,6 +138,46 @@ ssh_cmd() {
ssh "${SSH_OPTS[@]}" "$user_host" "${prefix}${cmd}"
}
host_has_ip() {
local expected_ip="$1"
if command -v ip >/dev/null 2>&1; then
ip -o -4 addr show 2>/dev/null | awk '{print $4}' | grep -q "^${expected_ip}/" && return 0
fi
hostname -I 2>/dev/null | tr ' ' '\n' | grep -qx "$expected_ip"
}
host_cmd() {
local user_host="$1"
local cmd="$2"
case "$user_host" in
*@192.168.0.110)
if host_has_ip "192.168.0.110"; then
bash -lc "$cmd"
return
fi
;;
*@192.168.0.120)
if host_has_ip "192.168.0.120"; then
bash -lc "$cmd"
return
fi
;;
*@192.168.0.121)
if host_has_ip "192.168.0.121"; then
bash -lc "$cmd"
return
fi
;;
*@192.168.0.188)
if host_has_ip "192.168.0.188"; then
bash -lc "$cmd"
return
fi
;;
esac
ssh_cmd "$user_host" "$cmd"
}
probe_http_code() {
local url="$1"
local attempt code
@@ -165,13 +222,19 @@ check_network() {
fi
done
arp -an | grep -E '192\.168\.0\.(110|120|121|188)' || warn "no ARP rows printed for one or more hosts"
if arp -an | grep -E '192\.168\.0\.(110|120|121|188)'; then
ok "ARP evidence printed"
elif [ "$MONITOR_READ_ONLY" -eq 1 ]; then
ok "ARP evidence unavailable in monitor mode; ping and TCP gates passed"
else
warn "no ARP rows printed for one or more hosts"
fi
}
check_188() {
log_section "P0-188-DATA"
local out
if ! out=$(ssh_cmd "ollama@192.168.0.188" '
if ! out=$(host_cmd "ollama@192.168.0.188" '
echo "HOST $(hostname) $(uptime)"
echo "MEM $(free -h | awk "/Mem:/ {print \$2,\$3,\$7}")"
echo "SYSTEMD $(systemctl is-active containerd docker postgresql@14-main redis-server ollama nginx 2>/dev/null | tr "\n" " ")"
@@ -199,7 +262,7 @@ docker ps --format "DOCKER {{.Names}}\t{{.Status}}" | head -80
check_110() {
log_section "P0-110-REGISTRY-OBSERVABILITY"
local out
if ! out=$(ssh_cmd "wooo@192.168.0.110" '
if ! out=$(host_cmd "wooo@192.168.0.110" '
echo "HOST $(hostname) $(uptime)"
echo "MEM $(free -h | awk "/Mem:/ {print \$2,\$3,\$7}")"
echo "DOCKER_SYSTEMD $(systemctl is-active docker 2>/dev/null || true)"
@@ -231,7 +294,7 @@ docker ps --format "DOCKER {{.Names}}\t{{.Status}}" | head -120
check_k3s() {
log_section "P1-K3S"
local out local_kubectl_out
if ! out=$(ssh_cmd "wooo@192.168.0.120" '
if ! out=$(host_cmd "wooo@192.168.0.120" '
echo "HOST $(hostname) $(uptime)"
echo "PG188_PORT $(nc -z -w 2 192.168.0.188 5432 >/dev/null 2>&1 && echo OPEN || echo CLOSED)"
echo "SYSTEMD $(systemctl is-active k3s k3s-agent keepalived 2>/dev/null | tr "\n" " ")"
@@ -271,7 +334,7 @@ check_workload_and_alertchain() {
log_section "P2-WORKLOAD-ALERTCHAIN"
local api_code web_code alert_code
local out
if out=$(ssh_cmd "wooo@192.168.0.120" '
if out=$(host_cmd "wooo@192.168.0.120" '
api_code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 http://192.168.0.125:32334/api/v1/health 2>/dev/null || true)
web_code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 http://192.168.0.125:32335/ 2>/dev/null || true)
echo "API_CODE ${api_code:-000}"
@@ -292,12 +355,14 @@ WEB_CODE $web_code"
[[ "$web_code" =~ ^[23] ]] && ok "AWOOOI Web reachable" || warn "AWOOOI Web not confirmed"
if [ "$SEND_ALERT_TEST" -eq 1 ]; then
alert_code=$(ssh_cmd "wooo@192.168.0.120" 'curl -s -o /tmp/awoooi-alertchain.out -w "%{http_code}" --max-time 8 \
alert_code=$(host_cmd "wooo@192.168.0.120" 'curl -s -o /tmp/awoooi-alertchain.out -w "%{http_code}" --max-time 8 \
-X POST "http://192.168.0.125:32334/api/v1/webhooks/alertmanager" \
-H '"'"'Content-Type: application/json'"'"' \
-d '"'"'{"receiver":"cold-start-check","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"ColdStartCheck","severity":"info"},"annotations":{"summary":"Cold start check"},"startsAt":"2026-05-05T11:00:00Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":""}],"groupLabels":{},"commonLabels":{},"commonAnnotations":{},"externalURL":"","version":"4","groupKey":"cold-start-check"}'"'"' 2>/dev/null || echo "000"')
echo "ALERTCHAIN_CODE $alert_code"
[[ "$alert_code" =~ ^2 ]] && ok "Alertmanager webhook endpoint accepts POST" || warn "Alertmanager webhook E2E not confirmed"
elif [ "$MONITOR_READ_ONLY" -eq 1 ]; then
ok "Alertmanager webhook POST intentionally skipped in read-only monitor mode"
else
warn "Alertmanager webhook POST skipped; rerun with --send-alert-test after API is ready"
fi
@@ -326,7 +391,7 @@ check_schedules() {
log_section "P2-SCHEDULES"
local out
if out=$(ssh_cmd "ollama@192.168.0.188" '
if out=$(host_cmd "ollama@192.168.0.188" '
now=$(date +%s)
echo "CRON_188 $(systemctl is-active cron 2>/dev/null || systemctl is-active crond 2>/dev/null || true)"
for f in /home/ollama/node_exporter_textfiles/backup.prom /home/ollama/node_exporter_textfiles/docker_restart_count.prom /home/ollama/node_exporter_textfiles/docker_stats.prom; do
@@ -356,7 +421,7 @@ echo "SCHEDULER_REGISTERED $(docker logs --since 6h momo-scheduler 2>&1 | grep -
echo "$out"
fi
if out=$(ssh_cmd "wooo@192.168.0.110" '
if out=$(host_cmd "wooo@192.168.0.110" '
now=$(date +%s)
echo "CRON_110 $(systemctl is-active cron 2>/dev/null || systemctl is-active crond 2>/dev/null || true)"
echo "FAILED_UNITS_110 $(systemctl --failed --no-legend --plain 2>/dev/null | wc -l)"
@@ -383,7 +448,7 @@ done
echo "$out"
fi
if out=$(ssh_cmd "wooo@192.168.0.120" '
if out=$(host_cmd "wooo@192.168.0.120" '
kcmd() {
if [ -n "${REMOTE_SUDO_PASSWORD:-}" ]; then
printf "%s\n" "$REMOTE_SUDO_PASSWORD" | sudo -S -p "" kubectl "$@"
@@ -411,7 +476,7 @@ kcmd get pods -n awoooi-prod --no-headers 2>/dev/null | awk "\$3 !~ /^(Running|C
echo "$out"
fi
if out=$(ssh_cmd "wooo@192.168.0.121" '
if out=$(host_cmd "wooo@192.168.0.121" '
echo "CRON_121 $(systemctl is-active cron 2>/dev/null || systemctl is-active crond 2>/dev/null || true)"
crontab -l 2>/dev/null | grep -q "dr-drill.sh" && echo "DR_DRILL_CRON present" || echo "DR_DRILL_CRON missing"
' 2>&1); then

View File

@@ -0,0 +1,44 @@
#!/usr/bin/env bash
# Install the AWOOOI cold-start textfile monitor on host 110 using the wooo user cron.
#
# This installer does not require sudo. It copies read-only monitor scripts into
# /home/wooo/scripts and maintains a marked crontab block.
set -euo pipefail
TARGET_HOST="${TARGET_HOST:-wooo@192.168.0.110}"
REMOTE_SCRIPT_DIR="${REMOTE_SCRIPT_DIR:-/home/wooo/scripts}"
REMOTE_TEXTFILE_DIR="${REMOTE_TEXTFILE_DIR:-/home/wooo/node_exporter_textfiles}"
REMOTE_LOG_DIR="${REMOTE_LOG_DIR:-/home/wooo/reboot-recovery}"
CRON_SCHEDULE="${CRON_SCHEDULE:-*/10 * * * *}"
CHECK_TIMEOUT_SECONDS="${CHECK_TIMEOUT_SECONDS:-240}"
SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
CHECK_SCRIPT="$SCRIPT_DIR/full-stack-cold-start-check.sh"
EXPORTER_SCRIPT="$SCRIPT_DIR/cold-start-textfile-exporter.sh"
if [ ! -f "$CHECK_SCRIPT" ] || [ ! -f "$EXPORTER_SCRIPT" ]; then
echo "Required scripts not found beside installer" >&2
exit 66
fi
ssh "$TARGET_HOST" "mkdir -p '$REMOTE_SCRIPT_DIR' '$REMOTE_TEXTFILE_DIR' '$REMOTE_LOG_DIR'"
scp "$CHECK_SCRIPT" "$TARGET_HOST:$REMOTE_SCRIPT_DIR/full-stack-cold-start-check.sh"
scp "$EXPORTER_SCRIPT" "$TARGET_HOST:$REMOTE_SCRIPT_DIR/cold-start-textfile-exporter.sh"
ssh "$TARGET_HOST" "chmod 0755 '$REMOTE_SCRIPT_DIR/full-stack-cold-start-check.sh' '$REMOTE_SCRIPT_DIR/cold-start-textfile-exporter.sh'"
ssh "$TARGET_HOST" bash -s <<REMOTE
set -euo pipefail
tmp=\$(mktemp)
crontab -l 2>/dev/null | sed '/# AWOOOI cold-start monitor start/,/# AWOOOI cold-start monitor end/d' > "\$tmp" || true
cat >> "\$tmp" <<'CRON'
# AWOOOI cold-start monitor start
$CRON_SCHEDULE PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin CHECK_SCRIPT=$REMOTE_SCRIPT_DIR/full-stack-cold-start-check.sh TEXTFILE_DIR=$REMOTE_TEXTFILE_DIR LOG_DIR=$REMOTE_LOG_DIR CHECK_TIMEOUT_SECONDS=$CHECK_TIMEOUT_SECONDS $REMOTE_SCRIPT_DIR/cold-start-textfile-exporter.sh >/tmp/awoooi-cold-start-monitor.cron.log 2>&1
# AWOOOI cold-start monitor end
CRON
crontab "\$tmp"
rm -f "\$tmp"
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin CHECK_SCRIPT=$REMOTE_SCRIPT_DIR/full-stack-cold-start-check.sh TEXTFILE_DIR=$REMOTE_TEXTFILE_DIR LOG_DIR=$REMOTE_LOG_DIR CHECK_TIMEOUT_SECONDS=$CHECK_TIMEOUT_SECONDS "$REMOTE_SCRIPT_DIR/cold-start-textfile-exporter.sh"
test -s "$REMOTE_TEXTFILE_DIR/cold_start_recovery.prom"
tail -40 "$REMOTE_TEXTFILE_DIR/cold_start_recovery.prom"
REMOTE