diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index cd1ab7b6..ac0e6767 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -1,3 +1,26 @@ +## 2026-06-24|188 node-exporter 恢復與備份健康缺失告警收斂 + +**背景**:冷啟動與 `backup-status` 均顯示 188 備份 textfile fresh,但 Alertmanager 仍有 `BackupHealthMonitorMissing188` active。追查後確認不是備份失敗,而是 Prometheus 抓不到 `192.168.0.188:9100`,因此看不到 188 的 `backup_health.prom`。 + +**根因**: +- 188 上 `node_exporter` / `prometheus-node-exporter` systemd unit 不存在或 inactive,`9100/9187/9121/9113` exporter ports 皆 connection refused。 +- 直接用 `/host/home/ollama/node_exporter_textfiles` 掛 rootfs 會因 `/home/ollama` 權限 `750` 造成 textfile collector `permission denied`。 + +**修復**: +- 以 Docker 啟動 `quay.io/prometheus/node-exporter:v1.8.2`,`restart=unless-stopped`,`-p 9100:9100`。 +- rootfs 仍以 `/host` 掛載;textfile 目錄改為直接 bind `/home/ollama/node_exporter_textfiles:/textfile:ro`,並指定 `--collector.textfile.directory=/textfile`。 +- 新增 repo helper:`scripts/ops/188-node-exporter-restore.sh`,避免下次重啟後靠手動記憶。 + +**Live 驗證**: +- 188 local `http://127.0.0.1:9100/metrics` 已出現 `awoooi_backup_health_monitor_up{host="188"} 1`。 +- `node_textfile_scrape_error 0`。 +- Prometheus `up{job="node-exporter-188"}` 回 `1`。 +- Prometheus `awoooi_backup_health_monitor_up{host="188"}` 回 `1`。 +- `absent(awoooi_backup_health_monitor_up{host="188"})` 回空集合。 +- `ALERTS{alertname="BackupHealthMonitorMissing188"}` 回空集合,告警已 resolve。 + +**邊界**:這是 188 exporter / textfile scrape 恢復,不是消音備份告警;沒有重啟 Docker daemon、沒有改 MOMO / PostgreSQL / Redis 業務容器、沒有改防火牆、沒有讀 secret。 + ## 2026-06-24|Telegram 正常心跳降噪與 cold-start MOMO detector 修正 **背景**:Telegram 群組每 30 分鐘收到 `AWOOOI 心跳 / 告警鏈路: ✅ 正常`,造成正常訊號洗版;同時 2026-06-24 01:45 live cold-start 雖已 `BLOCKED=0`,但仍因 MOMO scheduler log pattern 與 DB 讀法過舊產生兩個 WARN。這兩者都會讓重啟 SOP 出現 false-noise / false-warning,必須修掉。 diff --git a/docs/runbooks/FULL-STACK-COLD-START-SOP.md b/docs/runbooks/FULL-STACK-COLD-START-SOP.md index c8e61246..fc4b8c2f 100644 --- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md +++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md @@ -1652,6 +1652,36 @@ SOP update: Worker / CronJob / queue 類服務若啟動時間可能超過 startup probe,不能只看第一次 `rollout status --timeout=60s` 失敗就判定 production down。必須同時看 deploy marker、image tag、pod readiness、container restart count、service health、public route / API health。若 pod 最終 ready 但 CD 紅燈,這是 CI timeout / probe tuning 工作,不是服務重啟事故;後續應調整 startup probe 或 post-deploy timeout。 +### 14.27 2026-06-24 188 node-exporter / backup health alert closure + +2026-06-24 的第二段變更是恢復 188 node-exporter textfile scrape。`backup-status` 與 cold-start 都能透過 SSH 讀到 188 `backup_health.prom` fresh,但 Prometheus `node-exporter-188` scrape down 會讓 `BackupHealthMonitorMissing188` 正確告警。這種情況不能消音告警,必須恢復 exporter。 + +| 項目 | 2026-06-24 188 exporter baseline | +|------|----------------------------------| +| SOP version | `v1.28` | +| Root cause | 188 `9100` connection refused;`node_exporter` / `prometheus-node-exporter` unit absent/inactive;Prometheus could not scrape `backup_health.prom` | +| False start | Mounting `/home/ollama/node_exporter_textfiles` via `/host/home/ollama/...` failed because `/home/ollama` is `750` and textfile collector saw `permission denied` | +| Live restore | Docker container `node-exporter` uses `quay.io/prometheus/node-exporter:v1.8.2`, `restart=unless-stopped`, `-p 9100:9100`, rootfs mount `/host`, direct textfile bind `/home/ollama/node_exporter_textfiles:/textfile:ro` | +| Repo helper | `scripts/ops/188-node-exporter-restore.sh` | +| Local metrics | `awoooi_backup_health_monitor_up{host="188"} 1`; `node_textfile_scrape_error 0` | +| Prometheus readback | `up{job="node-exporter-188"} 1`; `awoooi_backup_health_monitor_up{host="188"} 1`; `absent(awoooi_backup_health_monitor_up{host="188"})` empty | +| Alert readback | `ALERTS{alertname="BackupHealthMonitorMissing188"}` empty | +| Declaration limit | 可宣稱 188 backup health scrape restored;不可把這當作 credential escrow complete 或 backup retention policy complete | + +若未來重啟後 `BackupHealthMonitorMissing188` active,但 SSH/backup-status 顯示 `backup_health.prom` fresh,優先查: + +```bash +curl -fsS http://192.168.0.188:9100/metrics | grep -E 'awoooi_backup_health_monitor_up|node_textfile_scrape_error' +``` + +若 `9100` connection refused 或 textfile collector error,先用 repo helper 恢復 exporter: + +```bash +ssh ollama@192.168.0.188 'bash -s' < scripts/ops/188-node-exporter-restore.sh +``` + +恢復後再查 Prometheus / Alertmanager,不要直接 silence。 + ### 14.22 重啟後時間軸驗證 每次重啟後照時間軸推進,不要等到最後才一次判定。 diff --git a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md index 29b9bea5..5e7692a0 100644 --- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md +++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md @@ -13,9 +13,9 @@ |------|--------|------------|----------| | Overall recovery readiness | SERVICE_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-24 02:08 live cold-start read-only gate returned `PASS=85 WARN=0 BLOCKED=0`, result `GREEN`。110 / 120 / 121 / 188 ping and SSH port are OK, K3s `mon` / `mon1` are Ready, `NODE_FS_ERROR_EVENTS=0`, `NODE_READONLY_FILESYSTEM_TRUE=0`, `NODE_DISK_PRESSURE_TRUE=0`, public routes/TLS are green, 110 / 188 runtime checks are green。K8s schedule readback distinguishes `FAILED_JOBS=1` / `STALE_FAILED_JOBS=1` / `ACTIVE_FAILED_JOBS=0`; the retained failed Job is historical evidence, not an active service blocker. DR remains blocked because credential escrow evidence markers are still missing and must not be forged. | | P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-14 18:15 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. | -| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 92% | 2026-06-15 03:11 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-15 02:40:13`. Offsite / escrow report shows `SCRIPT_MISSING_COUNT=0`, `OFFSITE_CONFIGURED=1`, `RCLONE_CONFIGURED=1`, `ESCROW_MISSING_COUNT=5`. Owner request package is ready; actual marker write remains blocked on real non-secret evidence IDs. | +| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 93% | 2026-06-24 02:20 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-23 20:53:42`。02:24 restored 188 `node-exporter` textfile scrape; Prometheus now has `up{job="node-exporter-188"}=1` and `awoooi_backup_health_monitor_up{host="188"}=1`; `BackupHealthMonitorMissing188` resolved. DR remains blocked on real non-secret credential escrow evidence IDs. | | P2 service / data truth | VERIFIED_FULL_STACK_GREEN_FOR_SERVICE | 100% | 2026-06-24 02:08 cold-start verifies public route/TLS, API/Web route, momo health and current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health. K8s active failed Job count is `0`, bad pods are `0`, and cold-start returns `PASS=85 WARN=0 BLOCKED=0`. | -| P3 docs / automation contracts | DONE_WITH_HEARTBEAT_NOISE_AND_MOMO_DETECTOR_CLOSURE | 100% | Workplan, SOP v1.27, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, and 2026-06-24 live readback are updated. Production image `a84a5a0b` is live with API `2/2`, Web `2/2`, Worker `1/1`; CD `#3289` is a known false-negative caused by worker startup / rollout timeout after deploy marker `4a7b5329`. | +| P3 docs / automation contracts | DONE_WITH_HEARTBEAT_NOISE_MOMO_AND_188_EXPORTER_CLOSURE | 100% | Workplan, SOP v1.28, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, and 2026-06-24 live readback are updated. Production image `a84a5a0b` is live with API `2/2`, Web `2/2`, Worker `1/1`; CD `#3289` is a known false-negative caused by worker startup / rollout timeout after deploy marker `4a7b5329`. | Full cold-start service readiness may be declared green for the latest verified evidence set. As of 2026-06-24 02:08, services are green with `PASS=85 WARN=0 BLOCKED=0`; the retained stale failed Job remains historical evidence only. Do not declare DR scorecard complete while credential escrow evidence remains blocked. diff --git a/scripts/ops/188-node-exporter-restore.sh b/scripts/ops/188-node-exporter-restore.sh new file mode 100755 index 00000000..6fe7321c --- /dev/null +++ b/scripts/ops/188-node-exporter-restore.sh @@ -0,0 +1,52 @@ +#!/usr/bin/env bash +set -euo pipefail + +IMAGE="${NODE_EXPORTER_IMAGE:-quay.io/prometheus/node-exporter:v1.8.2}" +CONTAINER_NAME="${NODE_EXPORTER_CONTAINER_NAME:-node-exporter}" +TEXTFILE_DIR="${NODE_EXPORTER_TEXTFILE_DIR:-/home/ollama/node_exporter_textfiles}" +PORT="${NODE_EXPORTER_PORT:-9100}" + +if [ ! -d "$TEXTFILE_DIR" ]; then + echo "TEXTFILE_DIR_MISSING $TEXTFILE_DIR" >&2 + exit 1 +fi + +if docker ps -a --format '{{.Names}}' | grep -qx "$CONTAINER_NAME"; then + docker rm -f "$CONTAINER_NAME" >/dev/null +fi + +docker run -d \ + --name "$CONTAINER_NAME" \ + --restart unless-stopped \ + -p "${PORT}:9100" \ + -v /:/host:ro,rslave \ + -v "${TEXTFILE_DIR}:/textfile:ro" \ + "$IMAGE" \ + --path.rootfs=/host \ + --collector.textfile.directory=/textfile \ + '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc|run/credentials|var/lib/docker|run/docker)($|/)' \ + '--collector.filesystem.fs-types-exclude=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs|tmpfs|shm)$' \ + --no-collector.infiniband \ + --no-collector.ipvs \ + --no-collector.mdadm \ + --no-collector.nfs \ + --no-collector.nfsd \ + --no-collector.xfs \ + --no-collector.zfs \ + --no-collector.processes \ + --no-collector.arp \ + --no-collector.netclass \ + --no-collector.netdev >/dev/null + +for _ in $(seq 1 20); do + if curl -fsS --max-time 3 "http://127.0.0.1:${PORT}/metrics" \ + | grep -q '^awoooi_backup_health_monitor_up{host="188"} 1$'; then + echo "NODE_EXPORTER_188_OK port=${PORT} textfile_dir=${TEXTFILE_DIR}" + exit 0 + fi + sleep 1 +done + +echo "NODE_EXPORTER_188_BACKUP_HEALTH_METRIC_MISSING port=${PORT}" >&2 +docker logs --tail=80 "$CONTAINER_NAME" >&2 || true +exit 1