docs(ops): record 188 node exporter recovery [skip ci]
This commit is contained in:
@@ -1,3 +1,26 @@
|
||||
## 2026-06-24|188 node-exporter 恢復與備份健康缺失告警收斂
|
||||
|
||||
**背景**:冷啟動與 `backup-status` 均顯示 188 備份 textfile fresh,但 Alertmanager 仍有 `BackupHealthMonitorMissing188` active。追查後確認不是備份失敗,而是 Prometheus 抓不到 `192.168.0.188:9100`,因此看不到 188 的 `backup_health.prom`。
|
||||
|
||||
**根因**:
|
||||
- 188 上 `node_exporter` / `prometheus-node-exporter` systemd unit 不存在或 inactive,`9100/9187/9121/9113` exporter ports 皆 connection refused。
|
||||
- 直接用 `/host/home/ollama/node_exporter_textfiles` 掛 rootfs 會因 `/home/ollama` 權限 `750` 造成 textfile collector `permission denied`。
|
||||
|
||||
**修復**:
|
||||
- 以 Docker 啟動 `quay.io/prometheus/node-exporter:v1.8.2`,`restart=unless-stopped`,`-p 9100:9100`。
|
||||
- rootfs 仍以 `/host` 掛載;textfile 目錄改為直接 bind `/home/ollama/node_exporter_textfiles:/textfile:ro`,並指定 `--collector.textfile.directory=/textfile`。
|
||||
- 新增 repo helper:`scripts/ops/188-node-exporter-restore.sh`,避免下次重啟後靠手動記憶。
|
||||
|
||||
**Live 驗證**:
|
||||
- 188 local `http://127.0.0.1:9100/metrics` 已出現 `awoooi_backup_health_monitor_up{host="188"} 1`。
|
||||
- `node_textfile_scrape_error 0`。
|
||||
- Prometheus `up{job="node-exporter-188"}` 回 `1`。
|
||||
- Prometheus `awoooi_backup_health_monitor_up{host="188"}` 回 `1`。
|
||||
- `absent(awoooi_backup_health_monitor_up{host="188"})` 回空集合。
|
||||
- `ALERTS{alertname="BackupHealthMonitorMissing188"}` 回空集合,告警已 resolve。
|
||||
|
||||
**邊界**:這是 188 exporter / textfile scrape 恢復,不是消音備份告警;沒有重啟 Docker daemon、沒有改 MOMO / PostgreSQL / Redis 業務容器、沒有改防火牆、沒有讀 secret。
|
||||
|
||||
## 2026-06-24|Telegram 正常心跳降噪與 cold-start MOMO detector 修正
|
||||
|
||||
**背景**:Telegram 群組每 30 分鐘收到 `AWOOOI 心跳 / 告警鏈路: ✅ 正常`,造成正常訊號洗版;同時 2026-06-24 01:45 live cold-start 雖已 `BLOCKED=0`,但仍因 MOMO scheduler log pattern 與 DB 讀法過舊產生兩個 WARN。這兩者都會讓重啟 SOP 出現 false-noise / false-warning,必須修掉。
|
||||
|
||||
@@ -1652,6 +1652,36 @@ SOP update:
|
||||
|
||||
Worker / CronJob / queue 類服務若啟動時間可能超過 startup probe,不能只看第一次 `rollout status --timeout=60s` 失敗就判定 production down。必須同時看 deploy marker、image tag、pod readiness、container restart count、service health、public route / API health。若 pod 最終 ready 但 CD 紅燈,這是 CI timeout / probe tuning 工作,不是服務重啟事故;後續應調整 startup probe 或 post-deploy timeout。
|
||||
|
||||
### 14.27 2026-06-24 188 node-exporter / backup health alert closure
|
||||
|
||||
2026-06-24 的第二段變更是恢復 188 node-exporter textfile scrape。`backup-status` 與 cold-start 都能透過 SSH 讀到 188 `backup_health.prom` fresh,但 Prometheus `node-exporter-188` scrape down 會讓 `BackupHealthMonitorMissing188` 正確告警。這種情況不能消音告警,必須恢復 exporter。
|
||||
|
||||
| 項目 | 2026-06-24 188 exporter baseline |
|
||||
|------|----------------------------------|
|
||||
| SOP version | `v1.28` |
|
||||
| Root cause | 188 `9100` connection refused;`node_exporter` / `prometheus-node-exporter` unit absent/inactive;Prometheus could not scrape `backup_health.prom` |
|
||||
| False start | Mounting `/home/ollama/node_exporter_textfiles` via `/host/home/ollama/...` failed because `/home/ollama` is `750` and textfile collector saw `permission denied` |
|
||||
| Live restore | Docker container `node-exporter` uses `quay.io/prometheus/node-exporter:v1.8.2`, `restart=unless-stopped`, `-p 9100:9100`, rootfs mount `/host`, direct textfile bind `/home/ollama/node_exporter_textfiles:/textfile:ro` |
|
||||
| Repo helper | `scripts/ops/188-node-exporter-restore.sh` |
|
||||
| Local metrics | `awoooi_backup_health_monitor_up{host="188"} 1`; `node_textfile_scrape_error 0` |
|
||||
| Prometheus readback | `up{job="node-exporter-188"} 1`; `awoooi_backup_health_monitor_up{host="188"} 1`; `absent(awoooi_backup_health_monitor_up{host="188"})` empty |
|
||||
| Alert readback | `ALERTS{alertname="BackupHealthMonitorMissing188"}` empty |
|
||||
| Declaration limit | 可宣稱 188 backup health scrape restored;不可把這當作 credential escrow complete 或 backup retention policy complete |
|
||||
|
||||
若未來重啟後 `BackupHealthMonitorMissing188` active,但 SSH/backup-status 顯示 `backup_health.prom` fresh,優先查:
|
||||
|
||||
```bash
|
||||
curl -fsS http://192.168.0.188:9100/metrics | grep -E 'awoooi_backup_health_monitor_up|node_textfile_scrape_error'
|
||||
```
|
||||
|
||||
若 `9100` connection refused 或 textfile collector error,先用 repo helper 恢復 exporter:
|
||||
|
||||
```bash
|
||||
ssh ollama@192.168.0.188 'bash -s' < scripts/ops/188-node-exporter-restore.sh
|
||||
```
|
||||
|
||||
恢復後再查 Prometheus / Alertmanager,不要直接 silence。
|
||||
|
||||
### 14.22 重啟後時間軸驗證
|
||||
|
||||
每次重啟後照時間軸推進,不要等到最後才一次判定。
|
||||
|
||||
@@ -13,9 +13,9 @@
|
||||
|------|--------|------------|----------|
|
||||
| Overall recovery readiness | SERVICE_GREEN_DR_ESCROW_BLOCKED | 99% | 2026-06-24 02:08 live cold-start read-only gate returned `PASS=85 WARN=0 BLOCKED=0`, result `GREEN`。110 / 120 / 121 / 188 ping and SSH port are OK, K3s `mon` / `mon1` are Ready, `NODE_FS_ERROR_EVENTS=0`, `NODE_READONLY_FILESYSTEM_TRUE=0`, `NODE_DISK_PRESSURE_TRUE=0`, public routes/TLS are green, 110 / 188 runtime checks are green。K8s schedule readback distinguishes `FAILED_JOBS=1` / `STALE_FAILED_JOBS=1` / `ACTIVE_FAILED_JOBS=0`; the retained failed Job is historical evidence, not an active service blocker. DR remains blocked because credential escrow evidence markers are still missing and must not be forged. |
|
||||
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-14 18:15 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. |
|
||||
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 92% | 2026-06-15 03:11 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-15 02:40:13`. Offsite / escrow report shows `SCRIPT_MISSING_COUNT=0`, `OFFSITE_CONFIGURED=1`, `RCLONE_CONFIGURED=1`, `ESCROW_MISSING_COUNT=5`. Owner request package is ready; actual marker write remains blocked on real non-secret evidence IDs. |
|
||||
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 93% | 2026-06-24 02:20 `backup-status` shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`, last aggregate `2026-06-23 20:53:42`。02:24 restored 188 `node-exporter` textfile scrape; Prometheus now has `up{job="node-exporter-188"}=1` and `awoooi_backup_health_monitor_up{host="188"}=1`; `BackupHealthMonitorMissing188` resolved. DR remains blocked on real non-secret credential escrow evidence IDs. |
|
||||
| P2 service / data truth | VERIFIED_FULL_STACK_GREEN_FOR_SERVICE | 100% | 2026-06-24 02:08 cold-start verifies public route/TLS, API/Web route, momo health and current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health. K8s active failed Job count is `0`, bad pods are `0`, and cold-start returns `PASS=85 WARN=0 BLOCKED=0`. |
|
||||
| P3 docs / automation contracts | DONE_WITH_HEARTBEAT_NOISE_AND_MOMO_DETECTOR_CLOSURE | 100% | Workplan, SOP v1.27, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, and 2026-06-24 live readback are updated. Production image `a84a5a0b` is live with API `2/2`, Web `2/2`, Worker `1/1`; CD `#3289` is a known false-negative caused by worker startup / rollout timeout after deploy marker `4a7b5329`. |
|
||||
| P3 docs / automation contracts | DONE_WITH_HEARTBEAT_NOISE_MOMO_AND_188_EXPORTER_CLOSURE | 100% | Workplan, SOP v1.28, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, and 2026-06-24 live readback are updated. Production image `a84a5a0b` is live with API `2/2`, Web `2/2`, Worker `1/1`; CD `#3289` is a known false-negative caused by worker startup / rollout timeout after deploy marker `4a7b5329`. |
|
||||
|
||||
Full cold-start service readiness may be declared green for the latest verified evidence set. As of 2026-06-24 02:08, services are green with `PASS=85 WARN=0 BLOCKED=0`; the retained stale failed Job remains historical evidence only. Do not declare DR scorecard complete while credential escrow evidence remains blocked.
|
||||
|
||||
|
||||
52
scripts/ops/188-node-exporter-restore.sh
Executable file
52
scripts/ops/188-node-exporter-restore.sh
Executable file
@@ -0,0 +1,52 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
IMAGE="${NODE_EXPORTER_IMAGE:-quay.io/prometheus/node-exporter:v1.8.2}"
|
||||
CONTAINER_NAME="${NODE_EXPORTER_CONTAINER_NAME:-node-exporter}"
|
||||
TEXTFILE_DIR="${NODE_EXPORTER_TEXTFILE_DIR:-/home/ollama/node_exporter_textfiles}"
|
||||
PORT="${NODE_EXPORTER_PORT:-9100}"
|
||||
|
||||
if [ ! -d "$TEXTFILE_DIR" ]; then
|
||||
echo "TEXTFILE_DIR_MISSING $TEXTFILE_DIR" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if docker ps -a --format '{{.Names}}' | grep -qx "$CONTAINER_NAME"; then
|
||||
docker rm -f "$CONTAINER_NAME" >/dev/null
|
||||
fi
|
||||
|
||||
docker run -d \
|
||||
--name "$CONTAINER_NAME" \
|
||||
--restart unless-stopped \
|
||||
-p "${PORT}:9100" \
|
||||
-v /:/host:ro,rslave \
|
||||
-v "${TEXTFILE_DIR}:/textfile:ro" \
|
||||
"$IMAGE" \
|
||||
--path.rootfs=/host \
|
||||
--collector.textfile.directory=/textfile \
|
||||
'--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc|run/credentials|var/lib/docker|run/docker)($|/)' \
|
||||
'--collector.filesystem.fs-types-exclude=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs|tmpfs|shm)$' \
|
||||
--no-collector.infiniband \
|
||||
--no-collector.ipvs \
|
||||
--no-collector.mdadm \
|
||||
--no-collector.nfs \
|
||||
--no-collector.nfsd \
|
||||
--no-collector.xfs \
|
||||
--no-collector.zfs \
|
||||
--no-collector.processes \
|
||||
--no-collector.arp \
|
||||
--no-collector.netclass \
|
||||
--no-collector.netdev >/dev/null
|
||||
|
||||
for _ in $(seq 1 20); do
|
||||
if curl -fsS --max-time 3 "http://127.0.0.1:${PORT}/metrics" \
|
||||
| grep -q '^awoooi_backup_health_monitor_up{host="188"} 1$'; then
|
||||
echo "NODE_EXPORTER_188_OK port=${PORT} textfile_dir=${TEXTFILE_DIR}"
|
||||
exit 0
|
||||
fi
|
||||
sleep 1
|
||||
done
|
||||
|
||||
echo "NODE_EXPORTER_188_BACKUP_HEALTH_METRIC_MISSING port=${PORT}" >&2
|
||||
docker logs --tail=80 "$CONTAINER_NAME" >&2 || true
|
||||
exit 1
|
||||
Reference in New Issue
Block a user