docs(ops): record nginx exporter recovery [skip ci]

This commit is contained in:
Your Name
2026-06-24 20:19:03 +08:00
parent 622bc37250
commit b07486b7f2
7 changed files with 133 additions and 14 deletions

View File

@@ -1,3 +1,24 @@
## 2026-06-24188 nginx-exporter 與 CD monitoring coverage gate 收斂
**背景**`2ec7f6f4 fix(ops): harden heartbeat and momo alert noise` 已由 CD 回寫 deploy marker `622bc372 chore(cd): deploy 2ec7f6f [skip ci]`production API health 也回 `200 healthy`。但 Gitea CD `#3294``post-deploy-checks` 步驟仍標 Failure根因不是 API/Web rollout 失敗,而是 `scripts/generate_monitoring.py --check` 看到 Prometheus job `nginx-exporter` down`192.168.0.188:9113` connection refused。
**完成**
- 188 live `nginx_status``127.0.0.1:8080/nginx_status` 正常,`/home/ollama/nginx-exporter.yml` compose config 可解析,表示 Nginx 本身與 stub_status 不需要 reload。
- 以既有 188 compose source-of-truth 恢復 stateless `nginx-exporter` container未修改 Nginx config、未執行 `nginx -t`、未 reload、未改 firewall、未讀 secret、未做 volume prune。
- 新增 `scripts/ops/188-nginx-exporter-restore.sh`,預設 `--check` 只讀驗證 stub_status、compose config、container state 與 metrics只有明確 `--apply` 才執行 `docker compose -f /home/ollama/nginx-exporter.yml up -d`
- `bash scripts/ops/188-nginx-exporter-restore.sh --check` 回讀 `nginx_up 1``nginx_connections_active``nginx_http_requests_total`container `nginx-exporter` 透過 `0.0.0.0:9113->9113/tcp` 暴露。
- `python3 scripts/generate_monitoring.py --check --stabilization-sleep-seconds 0` 回到 `Jobs 總數=14``全部 UP=14``真實問題=0``預期覆蓋率=100.0%`
- `high-value-config-change-gate.py` 追加 `scripts/ops/**/*exporter*``monitoring_alerting_observability`,使 exporter restore helper 進入 P1 / C1 分類,而不是落在未控管腳本灰區。
- 20:17 live cold-start rerun 仍為 `PASS=86 WARN=0 BLOCKED=1`,唯一 blocker 是 `MOMO_DAILY_FRESHNESS 7|2026-06-17`public routes/TLS、K3s `mon` / `mon1` Ready、AWOOOI API/Web/Worker、188 exporters、110 / 188 backup textfiles 與 CronJobs 均通過。
- `/backup/scripts/backup-status.sh --no-notify --no-refresh` 回讀 110 `13/13 fresh failed=0`、188 `2/2 fresh failed=0``core_blockers=0``offsite_fresh=1``rclone_gdrive_fresh=1``escrow_missing=5`
- MOMO public health 為 `V10.645`current-month `daily_sales_snapshot` / `realtime_sales_monthly` 仍一致 `10936|2026-06-01..2026-06-17`Google Drive scheduler 最近 4 小時多次確認 `當日業績匯入` 找到 `0` 個 Excel 檔案,因此這是來源檔缺席,不是服務未恢復。
**判定**
- 可宣稱:`nginx-exporter` 與 monitoring coverage gate 已恢復,下一次 CD post-deploy coverage 不應再因 `188:9113` 同一原因失敗。
- 不可宣稱CD `#3294` 歷史 run 已變綠、或因此 full-stack / DR complete。該 run 已經留下 Failure 記錄;本輪修的是實際監控目標與可重跑 SOP。
**邊界**:本輪 live 動作僅限恢復 stateless exporter container沒有 SSH 修改 Nginx / Docker volumes / firewall / K8s / ArgoCD沒有讀 secret沒有 force push。
## 2026-06-24重啟後告警噪音 hardening
**背景**重啟恢復後MOMO Pro 5 分鐘 502 舊告警與 AWOOOI 30 分鐘成功心跳都會干擾判斷。正確策略不是關掉告警,而是把「可處置的新異常」與「同一狀態重複回報 / 成功心跳」分開。

View File

@@ -1,6 +1,6 @@
# AWOOOI 全棧冷啟動與主機重啟 SOP
> Version: v1.33
> Version: v1.34
> Last updated: 2026-06-24 Asia/Taipei
> Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.
@@ -10,17 +10,18 @@
本節是每次接手、開機、關機、重啟後的第一個判定錨點。若日期不是今天,必須先重跑 live check再更新本節與 `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md`
2026-06-24 18:32 notification-noise hardening supersedes the earlier 11:35 wording where it discusses heartbeat / MOMO alert behavior. The service and data readiness gates below remain unchanged until a fresh live cold-start scorecard says otherwise:
2026-06-24 20:17 notification-noise hardening and 188 `nginx-exporter` recovery supersede the earlier 11:35 wording where it discusses heartbeat / MOMO alert behavior and monitoring coverage. The service and data readiness gates below are refreshed by the 20:17 live cold-start scorecard:
```text
Repo-side reboot SOP / Plan B / automation contracts: COMPLETE, 100%.
Live cold-start read-only check: PASS=86 WARN=0 BLOCKED=1, Result=BLOCKED.
Service state: SERVICE_AVAILABLE_MOMO_SOURCE_BLOCKED_DR_ESCROW_BLOCKED; 110/120/121/188 reachable, K3s mon/mon1 Ready, ArgoCD awoooi-prod Synced/Healthy at revision 7db7800e399caed5487a705c81ec993dec76c70f, public routes/TLS green, 110/188 backup health fresh, 188 node-exporter / PostgreSQL exporter / Redis exporter restored, 188 MinIO endpoint and Velero BackupStorageLocation restored, 110 disk pressure cleared.
Runtime release state: API/Web/Worker are ready; image remains a84a5a0b because 7db7800e is docs-only and does not rebuild runtime images.
MOMO state: mo.wooo.work health is healthy on version V10.639; current-month daily_sales_snapshot and realtime_sales_monthly match, but both stop at 2026-06-17. MOMO_DAILY_FRESHNESS is 7 days, which is a hard blocker because business data is not current.
Runtime release state: API/Web/Worker are ready; latest deployment marker 622bc372 points runtime image to 2ec7f6f4 and production API health returns healthy. CD #3294 still has a historical Failure record because post-deploy monitoring coverage saw 188 nginx-exporter down before the exporter restore.
MOMO state: mo.wooo.work health is healthy on version V10.645; current-month daily_sales_snapshot and realtime_sales_monthly match, but both stop at 2026-06-17. MOMO_DAILY_FRESHNESS is 7 days, which is a hard blocker because business data is not current.
Google Drive state: momo scheduler token ownership is fixed for Docker userns, container-side Drive listing works, but folder 當日業績匯入 currently has no matching 即時業績_當日 Excel source file. Archive latest matching file is 2026-06-18T01:30:39Z and was already imported by job 56.
Backup / monitoring state: backup-status core blockers are 0, last aggregate is 2026-06-24 02:28:39, 188 MinIO is healthy, Velero BackupStorageLocation default is Available, one-off backup reboot-recovery-202606240456 completed, backup-health textfile reports Velero freshness green, and VeleroBackupNotRun / PostgreSQLDown / RedisDown / disk-pressure alerts resolved.
Backup / monitoring state: backup-status core blockers are 0, 110 is 13/13 fresh failed=0, 188 is 2/2 fresh failed=0, offsite_fresh=1, rclone_gdrive_fresh=1, last aggregate is 2026-06-24 02:28:39, 188 MinIO is healthy, Velero BackupStorageLocation default is Available, one-off backup reboot-recovery-202606240456 completed, backup-health textfile reports Velero freshness green, PostgreSQL / Redis exporters are green, 188 nginx-exporter is restored with nginx_up=1, monitoring coverage is 14/14 jobs UP, and VeleroBackupNotRun / PostgreSQLDown / RedisDown / disk-pressure / nginx-exporter target-down evidence is resolved.
Notification-noise state: healthy AWOOOI heartbeat is suppressed; heartbeat warning dedupe uses stable actionable fingerprints so HTTP status / timeout / latency drift does not create a new Telegram event every 30 minutes; MOMO Pro monitor uses https://mo.wooo.work/health as primary truth and no longer checks the 188 root path; MoWoooWorkDown now labels component=momo-pro-system and requires public/local/container/data-freshness triage instead of blind restart; docker-health-monitor keeps 5-minute repair cadence but has a separate 30-minute Telegram fallback cooldown; Bitan public-content check keeps failure alerting with same-fingerprint cooldown and one recovery notice.
Monitoring coverage recovery state: if CD post-deploy fails only because `scripts/generate_monitoring.py --check` reports `nginx-exporter` down on `192.168.0.188:9113`, first verify 188 `stub_status` and restore the stateless exporter with `scripts/ops/188-nginx-exporter-restore.sh`; do not reload Nginx or restart product containers for this symptom.
Allowed declaration: core hosts, routes, K3s, backup/exporter surfaces are recovered; MOMO data pipeline is blocked waiting for a newer source file or owner-provided source evidence.
Forbidden declaration: full-stack green, MOMO data current, DR complete, or runtime/security acceptance. Credential escrow evidence is still missing and must not be forged.
```
@@ -1931,6 +1932,47 @@ No safe import candidate exists.
Full-stack remains blocked by data freshness, not by service outage.
```
### 14.32 2026-06-24 188 nginx-exporter / CD monitoring coverage gate
2026-06-24 的第六段變更是把 CD post-deploy monitoring coverage 失敗納入重啟 SOP。`2ec7f6f4` 的 runtime deploy 已回寫 `622bc372` 並且 production API health 為 healthy但 CD `#3294` 的 post-deploy checks 因 `nginx-exporter` target down 留下 Failure。根因是 188 `nginx-exporter` container 未運行,並非 Nginx public gateway、API/Web rollout 或 MOMO 服務故障。
| 項目 | 20:10 monitoring coverage baseline |
|------|------------------------------------|
| SOP version | `v1.34` |
| Affected CD run | Gitea CD `#3294` 歷史結果仍為 Failuredeploy marker `622bc372` 已寫入 |
| Root cause | Prometheus job `nginx-exporter` downtarget `192.168.0.188:9113` connection refused |
| Non-root cause | Nginx `stub_status` 正常;不需要 reload Nginx、不需要重啟 API/Web/MOMO、不需要改 firewall |
| Live restore source | `/home/ollama/nginx-exporter.yml` |
| Repo helper | `scripts/ops/188-nginx-exporter-restore.sh` |
| Check mode | `--check` only reads stub_status, compose config, container state, and metrics |
| Apply mode | `--apply` runs `docker compose -f /home/ollama/nginx-exporter.yml up -d` after stub_status and compose config pass |
| Exporter metrics | `nginx_up 1``nginx_connections_active``nginx_http_requests_total` |
| Monitoring coverage | `Jobs 總數=14``全部 UP=14``真實問題=0``預期覆蓋率=100.0%` |
| Declaration limit | 可宣稱 exporter / monitoring coverage recovered不可把歷史 CD run 改稱 success也不可宣稱 full-stack green / DR complete |
Post-reboot / post-CD 188 nginx-exporter check:
```bash
bash scripts/ops/188-nginx-exporter-restore.sh --check
python3 scripts/generate_monitoring.py --check --stabilization-sleep-seconds 0
```
如果 `--check` 只在 metrics 階段失敗,但 `stub_status` 與 compose config 都通過,且維護窗口允許恢復無狀態 exporter
```bash
bash scripts/ops/188-nginx-exporter-restore.sh --apply
python3 scripts/generate_monitoring.py --check --stabilization-sleep-seconds 0
```
禁止把這個症狀用下列方式處理:
```text
NO-GO: reload Nginx before stub_status / exporter metrics prove Nginx config is the issue.
NO-GO: restart product containers because monitoring coverage alone is red.
NO-GO: silence monitoring coverage or mark CD green without target recovery evidence.
NO-GO: prune Docker volumes or delete exporter state not owned by this SOP.
```
### 14.22 重啟後時間軸驗證
每次重啟後照時間軸推進,不要等到最後才一次判定。

View File

@@ -26,6 +26,10 @@
此同步只修正長期覆蓋矩陣與變更 Gate 的一致性,不代表 live evidence 已收到,也不代表可執行 `nginx -t`、reload、certbot renew 或 DNS / TLS 變更。
## 1.1a 2026-06-24 exporter restore helper 覆蓋同步
`high-value-config-change-gate.py` 追加 `scripts/ops/**/*exporter*``monitoring_alerting_observability`,讓 `scripts/ops/188-nginx-exporter-restore.sh` 這類 exporter restore helper 進入 P1 / C1 owner response 與 monitoring evidence 管控。這次 snapshot 固定數字仍為 `categories=14``c0=8`、平均成熟度 `73%``runtime_gate=0`;變更只修正 path coverage不代表 owner response received / accepted、Prometheus reload、Alertmanager reload、host write、Docker action、route smoke、production write 或 runtime gate 已授權。
## 1.2 2026-06-14 K8s / ArgoCD manifest repo-only 清冊
已新增 `docs/security/K8S-ARGOCD-MANIFEST-INVENTORY.md``docs/security/k8s-argocd-manifest-inventory.snapshot.json`,將 `k8s/awoooi-prod``k8s/argocd``k8s/velero``k8s/monitoring` 轉成 repo-only manifest inventory。

View File

@@ -373,7 +373,8 @@
"ops/signoz/**",
"ops/sentry-self-hosted/**",
"infra/langfuse/**",
"k8s/monitoring/**"
"k8s/monitoring/**",
"scripts/ops/**/*exporter*"
],
"priority": "P1",
"required_gate": "monitoring_observability_owner_response_required",
@@ -673,8 +674,8 @@
"websocket_route_change_authorized": false,
"workflow_modification_authorized": false
},
"generated_at": "2026-06-18T18:30:00+08:00",
"git_commit": "9013fbdc",
"generated_at": "2026-06-24T20:14:02+08:00",
"git_commit": "2ec7f6f4",
"lowest_coverage_categories": [
{
"category_id": "ai_provider_model_routing",

View File

@@ -11,13 +11,13 @@
| Area | Status | Completion | Evidence |
|------|--------|------------|----------|
| Overall recovery readiness | SERVICE_AVAILABLE_MOMO_SOURCE_BLOCKED_DR_ESCROW_BLOCKED | 98% | 2026-06-24 11:35 live cold-start read-only gate returned `PASS=86 WARN=0 BLOCKED=1`, result `BLOCKED`110 / 120 / 121 / 188 ping and SSH port are OK, K3s `mon` / `mon1` are Ready, ArgoCD `awoooi-prod` is `Synced / Healthy` at revision `7db7800e399caed5487a705c81ec993dec76c70f`, public routes/TLS are green, 110 / 188 runtime and backup checks are green。188 `node-exporter`、PostgreSQL exporter、Redis exporter、MinIO / Velero BSL are restored; 110 disk pressure cleared。Remaining service blocker is MOMO business data freshness: `MOMO_DAILY_FRESHNESS 7|2026-06-17`; Drive listing works from the scheduler container, but `當日業績匯入` has no newer `即時業績_當日` Excel source file. DR remains blocked because credential escrow evidence markers are still missing and must not be forged. |
| Overall recovery readiness | SERVICE_AVAILABLE_MOMO_SOURCE_BLOCKED_DR_ESCROW_BLOCKED | 98% | 2026-06-24 20:17 live cold-start returned `PASS=86 WARN=0 BLOCKED=1`, result `BLOCKED` because MOMO business data freshness remains stale. 110 / 120 / 121 / 188 ping and SSH port are OK, K3s `mon` / `mon1` are Ready, public routes/TLS are green, 110 / 188 runtime and backup checks are green。188 `node-exporter`、PostgreSQL exporter、Redis exporter、`nginx-exporter`MinIO / Velero BSL are restored; monitoring coverage is now `14/14 UP`; 110 disk pressure cleared。Remaining service blocker is MOMO business data freshness: `MOMO_DAILY_FRESHNESS 7|2026-06-17`; Drive listing works from the scheduler container, but `當日業績匯入` has no newer `即時業績_當日` Excel source file. DR remains blocked because credential escrow evidence markers are still missing and must not be forged. |
| P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-14 18:15 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. |
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 96% | 2026-06-24 11:20 backup / alert readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`。188 `node-exporter` textfile scrape、PostgreSQL exporter、Redis exporter、MinIO endpoint、Velero BSL and latest completed backup freshness are restored; `BackupHealthMonitorMissing188``PostgreSQLDown``RedisDown``VeleroBackupNotRun` and 110 disk-pressure alerts resolved. DR remains blocked on real non-secret credential escrow evidence IDs. |
| P2 service / data truth | BLOCKED_MOMO_DATA_FRESHNESS | 96% | Public route/TLS, API/Web route, momo health `V10.639`, current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health are green. However MOMO latest business date is `2026-06-17`; stale age is `7` days as of 11:35. Drive pending folder has `0` matching files and archive latest `2026-06-18T01:30:39Z` is already imported by job `56`, so there is no safe newer source to import. |
| P3 docs / automation contracts | DONE_WITH_MOMO_SOURCE_ABSENCE_GATE | 100% | Workplan, SOP v1.32, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, 188 DB/Redis exporter restore helper, 188 MinIO/Velero restore helper, 110 Docker disk pressure cleanup boundary, MOMO Google Drive token userns readback, MOMO daily freshness blocker, MOMO Pro false-noise health monitor source-of-truth, docker-health direct Telegram fallback cooldown, Bitan public-content same-fingerprint cooldown, notification-noise readback, MOMO source-file absence GO/NO-GO gate, MacBook Pro Codex safe artifact sync readback, and MacBook Pro AwoooGo Gitea SSH / dev workspace readback are updated. Production image `a84a5a0b` remains live with API `2/2`, Web `2/2`, Worker `1/1`; `7db7800e` is docs-only and does not require runtime image rebuild. |
| P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 97% | 2026-06-24 11:20 backup / alert readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `integrity_stale=0`, `offsite_fresh=1`, `rclone_gdrive_fresh=1`, `escrow_missing=5`。188 `node-exporter` textfile scrape、PostgreSQL exporter、Redis exporter、`nginx-exporter`MinIO endpoint、Velero BSL and latest completed backup freshness are restored; monitoring coverage is `14/14 UP`; `BackupHealthMonitorMissing188``PostgreSQLDown``RedisDown``VeleroBackupNotRun` and 110 disk-pressure alerts resolved. DR remains blocked on real non-secret credential escrow evidence IDs. |
| P2 service / data truth | BLOCKED_MOMO_DATA_FRESHNESS | 96% | Public route/TLS, API/Web route, momo health `V10.645`, current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health are green. However MOMO latest business date is `2026-06-17`; stale age is `7` days as of 20:17. Drive pending folder has `0` matching files in repeated scheduler checks and archive latest `2026-06-18T01:30:39Z` is already imported by job `56`, so there is no safe newer source to import. |
| P3 docs / automation contracts | DONE_WITH_MOMO_SOURCE_ABSENCE_GATE | 100% | Workplan, SOP v1.34, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, 188 DB/Redis exporter restore helper, 188 MinIO/Velero restore helper, 188 nginx-exporter restore helper, 110 Docker disk pressure cleanup boundary, MOMO Google Drive token userns readback, MOMO daily freshness blocker, MOMO Pro false-noise health monitor source-of-truth, docker-health direct Telegram fallback cooldown, Bitan public-content same-fingerprint cooldown, notification-noise readback, MOMO source-file absence GO/NO-GO gate, MacBook Pro Codex safe artifact sync readback, and MacBook Pro AwoooGo Gitea SSH / dev workspace readback are updated. Latest deploy marker `622bc372` points runtime image to `2ec7f6f4`; CD `#3294` retains a historical Failure because post-deploy monitoring coverage saw 188 `nginx-exporter` down before recovery, while manual coverage now passes `14/14 UP`. |
Full cold-start service readiness may not be declared green for the latest verified evidence set. As of 2026-06-24 11:35, routes/hosts/K3s/backups/exporters/Velero are available, but the scorecard is `PASS=86 WARN=0 BLOCKED=1` because MOMO business data freshness is stale beyond 3 days and no newer legitimate source file is available. Do not declare DR scorecard complete while credential escrow evidence remains blocked.
Full cold-start service readiness may not be declared green for the latest verified evidence set. As of 2026-06-24 20:17, routes/hosts/K3s/backups/exporters/Velero/monitoring coverage are available, but the latest cold-start scorecard remains `PASS=86 WARN=0 BLOCKED=1` because MOMO business data freshness is stale beyond 3 days and no newer legitimate source file is available. Do not declare DR scorecard complete while credential escrow evidence remains blocked.
2026-06-13 01:26 refresh: full cold-start is again green for the current evidence set. AWOOOI API/Web workload balancing survived the next normal CD deploy: Gitea main `e4a349bc`, ArgoCD revision `e4a349bc`, images from `414413a5`, API/Web split across `mon` / `mon1`, and global `known_hosts` retained 120 / 188 after CD fix `80e6ec1a`. Do not declare DR complete while credential escrow is missing. `km-vectorize` remediation is `90%`: schedule/label fix is live, and the remaining gate is the next official 03:00 CronJob success readback.
@@ -149,6 +149,7 @@ Next: <single next action>
| P1-013 | DONE_FOR_SERVICE_READINESS | 100 | Remediate `km-vectorize` CronJob health debt | The retained `km-vectorize-29689620` failed Job is now classified as stale evidence, not an active blocker, because later official `km-vectorize` Jobs completed successfully. 2026-06-18 13:43 cold-start reads `FAILED_JOBS=1`, `STALE_FAILED_JOBS=1`, `ACTIVE_FAILED_JOBS=0`, `BAD_PODS=0`, and returns `PASS=84 WARN=0 BLOCKED=0`. | Keep retained failed Job as evidence unless an explicit maintenance window authorizes cleanup. Reassert ArgoCD app health only with a fresh ArgoCD app readback, not from the cold-start scorecard alone. | Service readiness no longer warns on stale failed Job evidence; active failed Job detection remains guarded. |
| P1-015 | DONE | 100 | Restore 188 MinIO / Velero backup freshness and DB exporters | 2026-06-24 06:35 resolved real backup / exporter red lights: 188 PostgreSQL exporter and Redis exporter now expose `pg_up=1` / `redis_up=1`; 188 MinIO health is live; 120 Velero BSL is `Available`; one-off backup `reboot-recovery-202606240456` completed; 110 backup-health textfile reports latest Velero backup fresh. 110 disk pressure was reduced from 92% to 73% by Docker image/build-cache cleanup only. | Reconcile MinIO `userns_mode: host` override into formal source-of-truth or data ownership fix; keep Docker volume prune forbidden without explicit owner approval. | `VeleroBackupNotRun``PostgreSQLDown``RedisDown`、110 disk-pressure alerts are resolved, and SOP includes restore helpers. |
| P1-016 | DONE | 100 | Control repeated Telegram notification noise without hiding real alerts | 2026-06-24 confirmed MOMO Pro 5-minute spam came from a legacy 110 script checking `http://192.168.0.188/health`; live script now uses `https://mo.wooo.work/health` as primary truth. Heartbeat warning dedupe now hashes stable actionable fingerprints so HTTP status / timeout / latency drift does not create a new Telegram event every 30 minutes. `MoWoooWorkDown` now labels `component=momo-pro-system`, disables blind auto-repair, and requires public/local/container/data-freshness triage. Generic docker-health monitor keeps 5-minute repair checks but adds a separate 30-minute direct Telegram fallback cooldown. Bitan public-content cleanliness keeps failure notification with same-fingerprint cooldown and one recovery notice. | Fold remaining cross-product direct Telegram egress into the unified notification gateway over time; do not disable real warning/failure/recovery signals. Production deployment/readback must confirm the code and Prometheus rules are live before declaring runtime closure. | Healthy heartbeat is quiet, same actionable heartbeat warning is deduped, MOMO public health success produces no alert, repeated same-failure direct fallback paths are cooled, and real failure/recovery/new-warning notifications remain enabled. |
| P1-017 | DONE | 100 | Restore 188 nginx-exporter and post-CD monitoring coverage | CD `#3294` deployed marker `622bc372` but failed post-deploy checks because `scripts/generate_monitoring.py --check` saw Prometheus job `nginx-exporter` down at `192.168.0.188:9113`. 188 `stub_status` and compose config were healthy, so the correct fix was restoring the stateless exporter from `/home/ollama/nginx-exporter.yml`, not reloading Nginx or restarting products. New helper `scripts/ops/188-nginx-exporter-restore.sh` defaults to read-only `--check` and exposes explicit `--apply` for maintenance-window restore. `high-value-config-change-gate.py` now classifies `scripts/ops/**/*exporter*` as `monitoring_alerting_observability` P1 / C1. | Keep this check in post-reboot and post-CD recovery. Do not mark historical CD `#3294` as success; use the next CD run plus monitoring coverage as future proof. | `bash scripts/ops/188-nginx-exporter-restore.sh --check` reports `nginx_up 1`; `python3 scripts/generate_monitoring.py --check --stabilization-sleep-seconds 0` reports `Jobs=14`, `全部 UP=14`, `真實問題=0`, coverage `100.0%`; high-value gate matches the helper as P1 / C1, not unmanaged. |
---
@@ -178,7 +179,7 @@ Next: <single next action>
| P3-005 | DONE | 100 | Update cold-start SOP | SOP now includes start, shutdown, reboot, record, comparison, and 120 blocker handling. | Increment SOP version after each process change. | SOP has controlled power-operation sections and ledger template. |
| P3-006 | DONE | 100 | Update backup status | Backup status now reflects current cron, rclone latest-only, failure-only alert posture, and escrow blocker. | Refresh after 120 backup rerun. | Backup status no longer claims noisy success Telegram notifications. |
| P3-007 | DONE | 100 | Harden Gitea backup stale dump handling | 2026-06-05 manual Gitea backup failed because the container retained `/tmp/gitea-dump.zip` from the 02:00 failure. `scripts/backup/backup-gitea.sh` now renames stale container dump files to timestamped evidence before running a new dump, and the live 110 script is updated. | Watch the next 02:00 Gitea backup. | `bash -n` passes locally and on 110; manual Gitea backup completed after stale evidence rename. |
| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.32 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, 188 DB/Redis exporter restore, 188 MinIO/Velero restore, 110 Docker disk cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, post-reboot notification noise gates, and MOMO source-file absence decision gate. | Use v1.32 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, failed/stale/active Job counters, runaway-process metrics, CI load attribution, MOMO source availability, data freshness, Velero freshness, exporter scrape, disk usage, notification-noise state, and blockers against §1.4 plus §11.1 / §14.8 through §14.31. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process / notification-noise / MOMO source-file checks. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; live cold-start now returns `PASS=86 WARN=0 BLOCKED=1` when MOMO data freshness is stale because source file is absent, and repeated healthy/same-failure notification noise is controlled without hiding real alerts. |
| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.34 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, 188 DB/Redis exporter restore, 188 MinIO/Velero restore, 188 nginx-exporter restore, 110 Docker disk cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, post-reboot notification noise gates, MOMO source-file absence decision gate, and CD monitoring coverage target-down classification. | Use v1.34 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, failed/stale/active Job counters, runaway-process metrics, CI load attribution, MOMO source availability, data freshness, Velero freshness, exporter scrape, disk usage, notification-noise state, monitoring coverage, and blockers against §1.4 plus §11.1 / §14.8 through §14.32. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process / notification-noise / MOMO source-file / monitoring coverage checks. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; live cold-start now returns `PASS=86 WARN=0 BLOCKED=1` when MOMO data freshness is stale because source file is absent, repeated healthy/same-failure notification noise is controlled without hiding real alerts, and monitoring coverage target-down is routed through exporter restore before any product restart. |
| P3-009 | DONE | 100 | Assess 120/121 AA/AS role and host load balancing | 2026-06-12 15:19 live check confirms 120 and 121 are both `Ready control-plane`, `k3s active`, `k3s-agent inactive`, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. | After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. | `docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly `0%`. |
| P3-010 | DONE | 100 | Update workload balancing docs with 2026-06-13 live truth | Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD `km-vectorize` degraded, Gitea main `acaae999`, ArgoCD sync, and final pod placement evidence. | Keep updating this file after the next reboot or deploy. | Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt. |
| P3-011 | DONE | 100 | Record `km-vectorize` remediation status | LOGBOOK, this workplan, and SOP now state the schedule/label fix, ArgoCD sync evidence, the invalid manual Job boundary, and the 90% waiting-for-next-schedule gate. | After next 03:00 run, update this row and the top verdict with `lastSuccessfulTime` / ArgoCD health evidence. | No document claims ArgoCD green before official CronJob success evidence exists. |

View File

@@ -0,0 +1,49 @@
#!/usr/bin/env bash
set -euo pipefail
REMOTE="${NGINX_EXPORTER_REMOTE:-ollama@192.168.0.188}"
COMPOSE_FILE="${NGINX_EXPORTER_COMPOSE_FILE:-/home/ollama/nginx-exporter.yml}"
STUB_STATUS_URL="${NGINX_EXPORTER_STUB_STATUS_URL:-http://127.0.0.1:8080/nginx_status}"
METRICS_URL="${NGINX_EXPORTER_METRICS_URL:-http://127.0.0.1:9113/metrics}"
MODE="check"
if [[ "${1:-}" == "--apply" ]]; then
MODE="apply"
elif [[ "${1:-}" != "" && "${1:-}" != "--check" ]]; then
echo "usage: $0 [--check|--apply]" >&2
exit 2
fi
remote_script='
set -euo pipefail
mode="$1"
compose_file="$2"
stub_status_url="$3"
metrics_url="$4"
echo "== nginx stub_status =="
curl -fsS --max-time 5 "$stub_status_url" | head -5
echo "== exporter compose =="
test -s "$compose_file"
docker compose -f "$compose_file" config >/dev/null
echo "== exporter state before =="
docker ps -a --filter name=nginx-exporter --format "{{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}" || true
if [[ "$mode" == "apply" ]]; then
echo "== restore nginx-exporter =="
docker compose -f "$compose_file" up -d
sleep 3
fi
echo "== exporter metrics =="
curl -fsS --max-time 5 "$metrics_url" | grep -E "^(nginx_up|nginx_connections_active|nginx_http_requests_total)" | head -10
echo "== exporter state after =="
docker ps --filter name=nginx-exporter --format "{{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}"
'
ssh -o BatchMode=yes -o ConnectTimeout=8 "$REMOTE" \
"bash -s" -- "$MODE" "$COMPOSE_FILE" "$STUB_STATUS_URL" "$METRICS_URL" \
<<<"$remote_script"

View File

@@ -269,6 +269,7 @@ CATEGORIES = [
"ops/sentry-self-hosted/**",
"infra/langfuse/**",
"k8s/monitoring/**",
"scripts/ops/**/*exporter*",
),
required_validation=(
"rule_diff",