docs(ops): record 188 node exporter recovery [skip ci]

2026-06-24 02:28:16 +08:00
parent 8aeeadbde1
commit 271a9a526d
4 changed files with 107 additions and 2 deletions
--- a/docs/LOGBOOK.md
+++ b/docs/LOGBOOK.md
@@ -1,3 +1,26 @@
+## 2026-06-24｜188 node-exporter 恢復與備份健康缺失告警收斂
+
+**背景**：冷啟動與 `backup-status` 均顯示 188 備份 textfile fresh，但 Alertmanager 仍有 `BackupHealthMonitorMissing188` active。追查後確認不是備份失敗，而是 Prometheus 抓不到 `192.168.0.188:9100`，因此看不到 188 的 `backup_health.prom`。
+
+**根因**：
+- 188 上 `node_exporter` / `prometheus-node-exporter` systemd unit 不存在或 inactive，`9100/9187/9121/9113` exporter ports 皆 connection refused。
+- 直接用 `/host/home/ollama/node_exporter_textfiles` 掛 rootfs 會因 `/home/ollama` 權限 `750` 造成 textfile collector `permission denied`。
+
+**修復**：
+- 以 Docker 啟動 `quay.io/prometheus/node-exporter:v1.8.2`，`restart=unless-stopped`，`-p 9100:9100`。
+- rootfs 仍以 `/host` 掛載；textfile 目錄改為直接 bind `/home/ollama/node_exporter_textfiles:/textfile:ro`，並指定 `--collector.textfile.directory=/textfile`。
+- 新增 repo helper：`scripts/ops/188-node-exporter-restore.sh`，避免下次重啟後靠手動記憶。
+
+**Live 驗證**：
+- 188 local `http://127.0.0.1:9100/metrics` 已出現 `awoooi_backup_health_monitor_up{host="188"} 1`。
+- `node_textfile_scrape_error 0`。
+- Prometheus `up{job="node-exporter-188"}` 回 `1`。
+- Prometheus `awoooi_backup_health_monitor_up{host="188"}` 回 `1`。
+- `absent(awoooi_backup_health_monitor_up{host="188"})` 回空集合。
+- `ALERTS{alertname="BackupHealthMonitorMissing188"}` 回空集合，告警已 resolve。
+
+**邊界**：這是 188 exporter / textfile scrape 恢復，不是消音備份告警；沒有重啟 Docker daemon、沒有改 MOMO / PostgreSQL / Redis 業務容器、沒有改防火牆、沒有讀 secret。
+
 ## 2026-06-24｜Telegram 正常心跳降噪與 cold-start MOMO detector 修正

 **背景**：Telegram 群組每 30 分鐘收到 `AWOOOI 心跳 / 告警鏈路: ✅ 正常`，造成正常訊號洗版；同時 2026-06-24 01:45 live cold-start 雖已 `BLOCKED=0`，但仍因 MOMO scheduler log pattern 與 DB 讀法過舊產生兩個 WARN。這兩者都會讓重啟 SOP 出現 false-noise / false-warning，必須修掉。
--- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md
+++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md
@@ -1652,6 +1652,36 @@ SOP update:

 Worker / CronJob / queue 類服務若啟動時間可能超過 startup probe，不能只看第一次 `rollout status --timeout=60s` 失敗就判定 production down。必須同時看 deploy marker、image tag、pod readiness、container restart count、service health、public route / API health。若 pod 最終 ready 但 CD 紅燈，這是 CI timeout / probe tuning 工作，不是服務重啟事故；後續應調整 startup probe 或 post-deploy timeout。

+### 14.27 2026-06-24 188 node-exporter / backup health alert closure
+
+2026-06-24 的第二段變更是恢復 188 node-exporter textfile scrape。`backup-status` 與 cold-start 都能透過 SSH 讀到 188 `backup_health.prom` fresh，但 Prometheus `node-exporter-188` scrape down 會讓 `BackupHealthMonitorMissing188` 正確告警。這種情況不能消音告警，必須恢復 exporter。
+
+| 項目 | 2026-06-24 188 exporter baseline |
+|------|----------------------------------|
+| SOP version | `v1.28` |
+| Root cause | 188 `9100` connection refused；`node_exporter` / `prometheus-node-exporter` unit absent/inactive；Prometheus could not scrape `backup_health.prom` |
+| False start | Mounting `/home/ollama/node_exporter_textfiles` via `/host/home/ollama/...` failed because `/home/ollama` is `750` and textfile collector saw `permission denied` |
+| Live restore | Docker container `node-exporter` uses `quay.io/prometheus/node-exporter:v1.8.2`, `restart=unless-stopped`, `-p 9100:9100`, rootfs mount `/host`, direct textfile bind `/home/ollama/node_exporter_textfiles:/textfile:ro` |
+| Repo helper | `scripts/ops/188-node-exporter-restore.sh` |
+| Local metrics | `awoooi_backup_health_monitor_up{host="188"} 1`; `node_textfile_scrape_error 0` |
+| Prometheus readback | `up{job="node-exporter-188"} 1`; `awoooi_backup_health_monitor_up{host="188"} 1`; `absent(awoooi_backup_health_monitor_up{host="188"})` empty |
+| Alert readback | `ALERTS{alertname="BackupHealthMonitorMissing188"}` empty |
+| Declaration limit | 可宣稱 188 backup health scrape restored；不可把這當作 credential escrow complete 或 backup retention policy complete |
+
+若未來重啟後 `BackupHealthMonitorMissing188` active，但 SSH/backup-status 顯示 `backup_health.prom` fresh，優先查：
+
+```bash
+curl -fsS http://192.168.0.188:9100/metrics | grep -E 'awoooi_backup_health_monitor_up|node_textfile_scrape_error'
+```
+
+若 `9100` connection refused 或 textfile collector error，先用 repo helper 恢復 exporter：
+
+```bash
+ssh ollama@192.168.0.188 'bash -s' < scripts/ops/188-node-exporter-restore.sh
+```
+
+恢復後再查 Prometheus / Alertmanager，不要直接 silence。
+
 ### 14.22 重啟後時間軸驗證

 每次重啟後照時間軸推進，不要等到最後才一次判定。
--- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md
+++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md
@@ -13,9 +13,9 @@
 |------|--------|------------|----------|
 | Overall recovery readiness | SERVICE_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-24 02:08 live cold-start read-only gate returned `PASS=85 WARN=0 BLOCKED=0`, result `GREEN`。110 / 120 / 121 / 188 ping and SSH port are OK, K3s `mon` / `mon1` are Ready, `NODE_FS_ERROR_EVENTS=0`, `NODE_READONLY_FILESYSTEM_TRUE=0`, `NODE_DISK_PRESSURE_TRUE=0`, public routes/TLS are green, 110 / 188 runtime checks are green。K8s schedule readback distinguishes `FAILED_JOBS=1` / `STALE_FAILED_JOBS=1` / `ACTIVE_FAILED_JOBS=0`; the retained failed Job is historical evidence, not an active service blocker. DR remains blocked because credential escrow evidence markers are still missing and must not be forged. |
 | P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-14 18:15 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. |
-| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 92% | 2026-06-15 03:11 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-15 02:40:13`. Offsite / escrow report shows `SCRIPT_MISSING_COUNT=0`, `OFFSITE_CONFIGURED=1`, `RCLONE_CONFIGURED=1`, `ESCROW_MISSING_COUNT=5`. Owner request package is ready; actual marker write remains blocked on real non-secret evidence IDs. |
+| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 93% | 2026-06-24 02:20 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-23 20:53:42`。02:24 restored 188 `node-exporter` textfile scrape; Prometheus now has `up{job="node-exporter-188"}=1` and `awoooi_backup_health_monitor_up{host="188"}=1`; `BackupHealthMonitorMissing188` resolved. DR remains blocked on real non-secret credential escrow evidence IDs. |
 | P2 service / data truth | VERIFIED_FULL_STACK_GREEN_FOR_SERVICE | 100% | 2026-06-24 02:08 cold-start verifies public route/TLS, API/Web route, momo health and current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health. K8s active failed Job count is `0`, bad pods are `0`, and cold-start returns `PASS=85 WARN=0 BLOCKED=0`. |
-| P3 docs / automation contracts | DONE_WITH_HEARTBEAT_NOISE_AND_MOMO_DETECTOR_CLOSURE | 100% | Workplan, SOP v1.27, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, and 2026-06-24 live readback are updated. Production image `a84a5a0b` is live with API `2/2`, Web `2/2`, Worker `1/1`; CD `#3289` is a known false-negative caused by worker startup / rollout timeout after deploy marker `4a7b5329`. |
+| P3 docs / automation contracts | DONE_WITH_HEARTBEAT_NOISE_MOMO_AND_188_EXPORTER_CLOSURE | 100% | Workplan, SOP v1.28, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, and 2026-06-24 live readback are updated. Production image `a84a5a0b` is live with API `2/2`, Web `2/2`, Worker `1/1`; CD `#3289` is a known false-negative caused by worker startup / rollout timeout after deploy marker `4a7b5329`. |

 Full cold-start service readiness may be declared green for the latest verified evidence set. As of 2026-06-24 02:08, services are green with `PASS=85 WARN=0 BLOCKED=0`; the retained stale failed Job remains historical evidence only. Do not declare DR scorecard complete while credential escrow evidence remains blocked.

--- a/scripts/ops/188-node-exporter-restore.sh
+++ b/scripts/ops/188-node-exporter-restore.sh
@@ -0,0 +1,52 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+IMAGE="${NODE_EXPORTER_IMAGE:-quay.io/prometheus/node-exporter:v1.8.2}"
+CONTAINER_NAME="${NODE_EXPORTER_CONTAINER_NAME:-node-exporter}"
+TEXTFILE_DIR="${NODE_EXPORTER_TEXTFILE_DIR:-/home/ollama/node_exporter_textfiles}"
+PORT="${NODE_EXPORTER_PORT:-9100}"
+
+if [ ! -d "$TEXTFILE_DIR" ]; then
+  echo "TEXTFILE_DIR_MISSING $TEXTFILE_DIR" >&2
+  exit 1
+fi
+
+if docker ps -a --format '{{.Names}}' | grep -qx "$CONTAINER_NAME"; then
+  docker rm -f "$CONTAINER_NAME" >/dev/null
+fi
+
+docker run -d \
+  --name "$CONTAINER_NAME" \
+  --restart unless-stopped \
+  -p "${PORT}:9100" \
+  -v /:/host:ro,rslave \
+  -v "${TEXTFILE_DIR}:/textfile:ro" \
+  "$IMAGE" \
+  --path.rootfs=/host \
+  --collector.textfile.directory=/textfile \
+  '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc|run/credentials|var/lib/docker|run/docker)($|/)' \
+  '--collector.filesystem.fs-types-exclude=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs|tmpfs|shm)$' \
+  --no-collector.infiniband \
+  --no-collector.ipvs \
+  --no-collector.mdadm \
+  --no-collector.nfs \
+  --no-collector.nfsd \
+  --no-collector.xfs \
+  --no-collector.zfs \
+  --no-collector.processes \
+  --no-collector.arp \
+  --no-collector.netclass \
+  --no-collector.netdev >/dev/null
+
+for _ in $(seq 1 20); do
+  if curl -fsS --max-time 3 "http://127.0.0.1:${PORT}/metrics" \
+    | grep -q '^awoooi_backup_health_monitor_up{host="188"} 1$'; then
+    echo "NODE_EXPORTER_188_OK port=${PORT} textfile_dir=${TEXTFILE_DIR}"
+    exit 0
+  fi
+  sleep 1
+done
+
+echo "NODE_EXPORTER_188_BACKUP_HEALTH_METRIC_MISSING port=${PORT}" >&2
+docker logs --tail=80 "$CONTAINER_NAME" >&2 || true
+exit 1