fix(ops): reduce post-reboot notification noise

2026-06-24 06:52:47 +08:00
parent 95f442adab
commit 35a3a59839
6 changed files with 249 additions and 10 deletions
--- a/docs/LOGBOOK.md
+++ b/docs/LOGBOOK.md
@@ -1,3 +1,36 @@
+## 2026-06-24｜Telegram 告警噪音收斂與 post-reboot notification gate
+
+**背景**：服務恢復後，Telegram 仍出現兩類噪音：AWOOOI 正常心跳每 30 分鐘洗版、MOMO Pro 舊 monitor 每 5 分鐘報 `HTTP 502`、Bitan public content cleanliness failure 曾每 30 分鐘重複推送。這些不是同一條鏈路：AWOOOI heartbeat 已由 production code 收斂；MOMO / Bitan 來自 110 cron 腳本。修復原則是「正常成功安靜、同一失敗降噪、warning / failure / recovery 保留」，不得消音真紅燈。
+
+**MOMO Pro monitor**：
+- 根因：110 舊 `/home/wooo/scripts/docker_health_monitor.sh` 以 `http://192.168.0.188/health` 當主判定，重啟恢復期間持續得到 `HTTP 502`，即使 public `https://mo.wooo.work/health` 已恢復，仍每 5 分鐘發 MOMO Pro 服務異常。
+- 新增 repo source-of-truth：`scripts/ops/momo-pro-health-monitor.sh`。
+- live `/home/wooo/scripts/docker_health_monitor.sh` 已同步為 public route first：`https://mo.wooo.work/health` 是 primary truth；188 local `127.0.0.1:5003/health` 與 container state 只作 secondary evidence。
+- live hash：`d7a6bc75549efa10176c42e6f9082c90b9856dbcbb335aba4a4fa4abb754eaef`。
+- 110 已部署 `/home/wooo/awoooi-ops/notify-awoooi-ops.sh`，hash `12bf9ae124c56bb7f31be15ebeb501671b0686d695492bc3fa1d9abb5b683b67`；repo 版 MOMO monitor 走 AWOOI Alertmanager wrapper，`telegram-notification-egress-no-new-bypass-guard.py` 維持 `new=0`。
+- 手動 readback：`OK: public health 200; no alert`。
+
+**Docker health monitor fallback**：
+- `scripts/ops/docker-health-monitor.sh` 保留 `ACTION_COOLDOWN_SECONDS=300`，不降低自動修復掃描頻率。
+- 新增 `NOTIFY_COOLDOWN_SECONDS=1800` 與 `TELEGRAM_COOLDOWN`，僅套在 AWOOOI API 不可達時的 direct Telegram fallback，避免 API path 壞掉時同一 container/action 每 5 分鐘直發。
+- live `/home/wooo/awoooi-ops/docker-health-monitor.sh` hash：`41d64f29048868c8e4c089132d299c8ee0e2b50ab2c513158d6d45cf92ea38e3`。
+
+**Bitan public content cleanliness check**：
+- live `/home/wooo/apps/bitan-pharmacy-release/scripts/run-public-content-cleanliness-check.sh` 加入 `public-content-cleanliness.notify.state`。
+- 同一 failure fingerprint 冷卻 `21600s`；從 fail 變 pass 時只送一次 recovery；pass 狀態不發失敗通知。
+- live hash：`294ec7f75448c86688b8afc408c785efe4cf173d468ad0d82228ba638d3de2dc`。
+- 手動 no-notify readback：DB、public APIs、products/news pages 與 content contract 全部 PASS。
+- Bitan local repo 目前有大量既有 dirty / untracked 變更，本輪只同步 live hotfix；後續要獨立整理 Bitan source-control reconciliation，不把整包 dirty tree 混入 AWOOOI commit。
+
+**SOP / workplan**：
+- `docs/runbooks/FULL-STACK-COLD-START-SOP.md` 升級為 v1.31，新增 §14.30 notification noise closure。
+- `docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md` 新增 P1-016，並把 P3 docs / automation contracts 更新為 `DONE_WITH_NOTIFICATION_NOISE_GATE`。
+
+**邊界**：
+- 這不是消音 real alerts；`ColdStartRecoveryBlocked`、`MOMO_DAILY_FRESHNESS`、`BackupCredentialEscrowEvidenceMissing`、backup/exporter/down 類紅燈仍必須告警。
+- 仍不可宣稱 full-stack green：MOMO business data freshness 仍停在 `2026-06-17`，最新 cold-start 仍因 `MOMO_DAILY_FRESHNESS 6|2026-06-17` blocked。
+- DR 仍不可宣稱完成：credential escrow evidence missing 維持 `5`，不得偽造 marker，也不得把 secret value 放進 repo 或聊天。
+
 ## 2026-06-24｜188 MinIO / Velero、DB exporter 與 110 磁碟壓力恢復

 **背景**：02:44 cold-start 已證明主機、K3s、public routes 與 MOMO DB parity 多數恢復，但後續 Alertmanager 仍暴露多個真實紅燈：188 PostgreSQL / Redis exporters down、110 disk pressure、Velero backup freshness 過期。這些都不能靠消音處理，必須恢復監控與備份鏈路。
--- a/docs/runbooks/FULL-STACK-COLD-START-SOP.md
+++ b/docs/runbooks/FULL-STACK-COLD-START-SOP.md
@@ -1,6 +1,6 @@
 # AWOOOI 全棧冷啟動與主機重啟 SOP

-> Version: v1.30
+> Version: v1.31
 > Last updated: 2026-06-24 Asia/Taipei
 > Scope: 110 / 120 / 121 / 188 full-stack reboot recovery. 112 Kali is recorded as P3 optional and is not part of this recovery path.

@@ -19,6 +19,7 @@ Service state: SERVICE_AVAILABLE_MOMO_SOURCE_BLOCKED_DR_ESCROW_BLOCKED; 110/120/
 MOMO state: current-month daily_sales_snapshot and realtime_sales_monthly match, but both stop at 2026-06-17. MOMO_DAILY_FRESHNESS is 6 days, which is a hard blocker because business data is not current.
 Google Drive state: momo scheduler token ownership is fixed for Docker userns, Drive listing works, but folder 當日業績匯入 currently has no matching 即時業績_當日 Excel source file. Archive latest matching file is 2026-06-18 and was already imported.
 Backup / monitoring state: 188 MinIO is healthy, Velero BackupStorageLocation default is Available, one-off backup reboot-recovery-202606240456 completed, backup-health textfile reports Velero freshness green, and VeleroBackupNotRun / PostgreSQLDown / RedisDown / disk-pressure alerts resolved.
+Notification-noise state: healthy AWOOOI heartbeat is suppressed; MOMO Pro monitor now uses https://mo.wooo.work/health as primary truth and no longer checks the 188 root path; docker-health-monitor keeps 5-minute repair cadence but has a separate 30-minute Telegram fallback cooldown; Bitan public-content check keeps failure alerting with same-fingerprint cooldown and one recovery notice.
 Allowed declaration: core hosts, routes, K3s, backup/exporter surfaces are recovered; MOMO data pipeline is blocked waiting for a newer source file or owner-provided source evidence.
 Forbidden declaration: full-stack green, MOMO data current, DR complete, or runtime/security acceptance. Credential escrow evidence is still missing and must not be forged.
 ```
@@ -1803,6 +1804,42 @@ Forbidden without explicit owner approval: `docker volume prune`, deleting datab
 Done gate: filesystem use below 85%, no active disk-pressure alerts, and no service regression in cold-start scorecard.
 ```

+### 14.30 2026-06-24 notification noise closure after reboot recovery
+
+2026-06-24 的第五段變更是把「服務已恢復，但舊監控路徑或成功心跳繼續洗 Telegram」納入重啟 SOP。這不是消音；失敗、warning、資料新鮮度、backup / exporter / escrow 紅燈仍要告警。修正目標是避免同一個已知失敗每 5 或 30 分鐘重複推送，並避免正常成功心跳佔滿戰情室。
+
+| 項目 | 2026-06-24 notification baseline |
+|------|-----------------------------------|
+| SOP version | `v1.31` |
+| AWOOOI healthy heartbeat | Production `a84a5a0b`：healthy 且無 warnings 時只更新 Redis/log，不送 Telegram；warning 變化會送，warning 恢復 healthy 只送一次 recovery |
+| MOMO false-noise root cause | 110 舊 `/home/wooo/scripts/docker_health_monitor.sh` 打 `http://192.168.0.188/health`，重啟期間連續得到 `HTTP 502`，產生每 5 分鐘 MOMO Pro 告警 |
+| MOMO monitor source-of-truth | 新增 `scripts/ops/momo-pro-health-monitor.sh`；primary truth 是 `https://mo.wooo.work/health`，188 local `127.0.0.1:5003/health` 與 container state 只作 secondary evidence |
+| MOMO live readback | `/home/wooo/scripts/docker_health_monitor.sh` hash `d7a6bc75549efa10176c42e6f9082c90b9856dbcbb335aba4a4fa4abb754eaef`; manual run returned `OK: public health 200; no alert` |
+| AWOOI ops notify wrapper | `/home/wooo/awoooi-ops/notify-awoooi-ops.sh` hash `12bf9ae124c56bb7f31be15ebeb501671b0686d695492bc3fa1d9abb5b683b67`; repo MOMO monitor uses this wrapper instead of adding a new Telegram Bot API direct send |
+| Docker monitor fallback | `scripts/ops/docker-health-monitor.sh` keeps `ACTION_COOLDOWN_SECONDS=300` for repair cadence but adds `NOTIFY_COOLDOWN_SECONDS=1800` for direct Telegram fallback when AWOOOI API cannot receive the event |
+| Docker monitor live readback | `/home/wooo/awoooi-ops/docker-health-monitor.sh` hash `41d64f29048868c8e4c089132d299c8ee0e2b50ab2c513158d6d45cf92ea38e3` and exposes `TELEGRAM_COOLDOWN` lines for repeated fallback suppression |
+| Bitan public-content check | Live `/home/wooo/apps/bitan-pharmacy-release/scripts/run-public-content-cleanliness-check.sh` now writes `public-content-cleanliness.notify.state`, suppresses same failure fingerprint for `21600s`, and sends one recovery notice after a failed state becomes pass |
+| Bitan live readback | Script hash `294ec7f75448c86688b8afc408c785efe4cf173d468ad0d82228ba638d3de2dc`; manual no-notify run returned PASS for DB, public APIs, products/news pages, and content contract |
+| Declaration limit | 可宣稱 repeated healthy / same-failure notification noise is controlled for these paths；不可宣稱 all product alerts migrated to the unified notification gateway or any real failure alert disabled |
+
+Post-reboot notification gate:
+
+```bash
+ssh wooo@192.168.0.110 '/home/wooo/scripts/docker_health_monitor.sh'
+ssh wooo@192.168.0.110 'tail -n 120 /home/wooo/logs/docker_health.log'
+ssh wooo@192.168.0.110 'tail -n 120 /home/wooo/awoooi-ops/monitor.log'
+ssh wooo@192.168.0.110 'tail -n 120 /home/wooo/apps/bitan-pharmacy-release/logs/public-content-cleanliness-check.cron.log'
+```
+
+Done gate:
+
+```text
+MOMO monitor: public health 200 -> no Telegram.
+AWOOOI heartbeat: healthy + no warnings -> suppressed; warning/recovery still send.
+Generic docker-health monitor: API 200/202 path is primary; direct Telegram fallback is fingerprint-cooled.
+Bitan public content: pass -> no failure Telegram; repeated same failure -> cooled; recovery after prior failure -> one notice.
+```
+
 ### 14.22 重啟後時間軸驗證

 每次重啟後照時間軸推進，不要等到最後才一次判定。
--- a/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md
+++ b/docs/workplans/2026-06-04-reboot-cold-start-backup-recovery-workplan.md
@@ -15,7 +15,7 @@
 | P0 host / K3s recovery | DONE | 100% | 120 booted after console fsck at `2026-06-12 15:13`; latest 2026-06-14 18:15 readback shows 120 is reachable, K3s is active, `mon` and `mon1` are both `Ready control-plane`, and cold-start P0/P1 checks are green. |
 | P1 backup / alert / escrow | BLOCKED_DR_ESCROW | 96% | 2026-06-24 06:35 backup / alert readback shows 110 `13/13 fresh failed=0`, 188 `2/2 fresh failed=0`, `core_blockers=0`, `escrow_missing=5`。188 `node-exporter` textfile scrape、PostgreSQL exporter、Redis exporter、MinIO endpoint、Velero BSL and latest completed backup freshness are restored; `BackupHealthMonitorMissing188`、`PostgreSQLDown`、`RedisDown`、`VeleroBackupNotRun` and 110 disk-pressure alerts resolved. DR remains blocked on real non-secret credential escrow evidence IDs. |
 | P2 service / data truth | BLOCKED_MOMO_DATA_FRESHNESS | 96% | Public route/TLS, API/Web route, momo health, current-month parity `10936|10936|2026-06-01|2026-06-17|2026-06-01|2026-06-17`, backup exporters, schedules, K3s node readiness/storage conditions, VIP, and 110 / 188 runtime health are green. However MOMO latest business date is `2026-06-17`; stale age is `6` days. Drive pending folder has `0` matching files and archive latest is the already-imported 2026-06-18 file, so there is no safe newer source to import. |
-| P3 docs / automation contracts | DONE_WITH_VELERO_AND_EXPORTER_RECOVERY_GATE | 100% | Workplan, SOP v1.30, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, 188 DB/Redis exporter restore helper, 188 MinIO/Velero restore helper, 110 Docker disk pressure cleanup boundary, MOMO Google Drive token userns readback, MOMO daily freshness blocker, and 2026-06-24 06:35 live readback are updated. Production image `a84a5a0b` is live with API `2/2`, Web `2/2`, Worker `1/1`; CD `#3289` is a known false-negative caused by worker startup / rollout timeout after deploy marker `4a7b5329`. |
+| P3 docs / automation contracts | DONE_WITH_NOTIFICATION_NOISE_GATE | 100% | Workplan, SOP v1.31, BACKUP-STATUS, LOGBOOK, 120 console/fsck recovery, Gitea backup stale-dump hardening, reboot ledger/version-comparison SOP, escrow evidence audit, 188 nginx Ansible baseline, 110 cold-start detector script, startup judgment layers, GO/NO-GO tree, host recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline checks, host role / load-balancing assessment, CD `known_hosts` guardrail, `fwupd-refresh.timer` rollback note, K3s filesystem event blocker, AWOOOI backup no-direct-offsite-sync contract, 110/188 Ansible source-of-truth, Gitea self-hosted readiness validation workflow, post-CD no-regression readbacks, stale-vs-active K8s failed Job classification, 110 runaway browser / CI load AIOps exporter + alert + gated remediation PlayBook, Telegram / AI event packet mapping, healthy heartbeat Telegram suppression, MOMO scheduler / current-month detector fix, 188 node-exporter restore helper, 188 DB/Redis exporter restore helper, 188 MinIO/Velero restore helper, 110 Docker disk pressure cleanup boundary, MOMO Google Drive token userns readback, MOMO daily freshness blocker, MOMO Pro false-noise health monitor source-of-truth, docker-health direct Telegram fallback cooldown, Bitan public-content same-fingerprint cooldown, and 2026-06-24 notification-noise readback are updated. Production image `a84a5a0b` is live with API `2/2`, Web `2/2`, Worker `1/1`; CD `#3289` is a known false-negative caused by worker startup / rollout timeout after deploy marker `4a7b5329`. |

 Full cold-start service readiness may not be declared green for the latest verified evidence set. As of 2026-06-24 06:35, routes/hosts/K3s/backups/exporters/Velero are available, but the scorecard is `PASS=86 WARN=0 BLOCKED=1` because MOMO business data freshness is stale beyond 3 days. Do not declare DR scorecard complete while credential escrow evidence remains blocked.

@@ -148,6 +148,7 @@ Next: <single next action>
 | P1-014 | DONE | 100 | Publish credential escrow owner request package | 2026-06-13 13:10 live report confirms `SCRIPT_MISSING_COUNT=0`, `OFFSITE_CONFIGURED=1`, `RCLONE_CONFIGURED=1`, `ESCROW_MISSING_COUNT=5`, `PASS=8 WARN=5 BLOCKED=0`. New owner request package defines allowed evidence-id types, forbidden secret values, safe dry-run flow, write flow, and closeout gates. | Dispatch to the credential owners without collecting secret values; keep marker write gated until owner gives real non-secret evidence IDs. | `docs/security/CREDENTIAL-ESCROW-EVIDENCE-OWNER-REQUEST.md` and snapshot exist and validate. |
 | P1-013 | DONE_FOR_SERVICE_READINESS | 100 | Remediate `km-vectorize` CronJob health debt | The retained `km-vectorize-29689620` failed Job is now classified as stale evidence, not an active blocker, because later official `km-vectorize` Jobs completed successfully. 2026-06-18 13:43 cold-start reads `FAILED_JOBS=1`, `STALE_FAILED_JOBS=1`, `ACTIVE_FAILED_JOBS=0`, `BAD_PODS=0`, and returns `PASS=84 WARN=0 BLOCKED=0`. | Keep retained failed Job as evidence unless an explicit maintenance window authorizes cleanup. Reassert ArgoCD app health only with a fresh ArgoCD app readback, not from the cold-start scorecard alone. | Service readiness no longer warns on stale failed Job evidence; active failed Job detection remains guarded. |
 | P1-015 | DONE | 100 | Restore 188 MinIO / Velero backup freshness and DB exporters | 2026-06-24 06:35 resolved real backup / exporter red lights: 188 PostgreSQL exporter and Redis exporter now expose `pg_up=1` / `redis_up=1`; 188 MinIO health is live; 120 Velero BSL is `Available`; one-off backup `reboot-recovery-202606240456` completed; 110 backup-health textfile reports latest Velero backup fresh. 110 disk pressure was reduced from 92% to 73% by Docker image/build-cache cleanup only. | Reconcile MinIO `userns_mode: host` override into formal source-of-truth or data ownership fix; keep Docker volume prune forbidden without explicit owner approval. | `VeleroBackupNotRun`、`PostgreSQLDown`、`RedisDown`、110 disk-pressure alerts are resolved, and SOP includes restore helpers. |
+| P1-016 | DONE | 100 | Control repeated Telegram notification noise without hiding real alerts | 2026-06-24 confirmed MOMO Pro 5-minute spam came from a legacy 110 script checking `http://192.168.0.188/health`; live script now uses `https://mo.wooo.work/health` as primary truth and manual readback returned `OK: public health 200; no alert`. Generic docker-health monitor keeps 5-minute repair checks but adds a separate 30-minute direct Telegram fallback cooldown. Bitan public-content cleanliness keeps failure notification but suppresses the same failure fingerprint for 6 hours and emits one recovery notice. | Fold remaining cross-product direct Telegram egress into the unified notification gateway over time; do not disable real warning/failure/recovery signals. | Healthy heartbeat is quiet, MOMO public health success produces no alert, repeated same-failure direct fallback paths are cooled, and real failure/recovery notifications remain enabled. |

 ---

@@ -176,7 +177,7 @@ Next: <single next action>
 | P3-005 | DONE | 100 | Update cold-start SOP | SOP now includes start, shutdown, reboot, record, comparison, and 120 blocker handling. | Increment SOP version after each process change. | SOP has controlled power-operation sections and ledger template. |
 | P3-006 | DONE | 100 | Update backup status | Backup status now reflects current cron, rclone latest-only, failure-only alert posture, and escrow blocker. | Refresh after 120 backup rerun. | Backup status no longer claims noisy success Telegram notifications. |
 | P3-007 | DONE | 100 | Harden Gitea backup stale dump handling | 2026-06-05 manual Gitea backup failed because the container retained `/tmp/gitea-dump.zip` from the 02:00 failure. `scripts/backup/backup-gitea.sh` now renames stale container dump files to timestamped evidence before running a new dump, and the live 110 script is updated. | Watch the next 02:00 Gitea backup. | `bash -n` passes locally and on 110; manual Gitea backup completed after stale evidence rename. |
-| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.30 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, 188 DB/Redis exporter restore, 188 MinIO/Velero restore, 110 Docker disk cleanup boundary, MOMO Google Drive token userns readback, and MOMO data freshness hard blocker. | Use v1.30 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, failed/stale/active Job counters, runaway-process metrics, CI load attribution, MOMO source availability, data freshness, Velero freshness, exporter scrape, disk usage, and blockers against §1.4 plus §11.1 / §14.8 through §14.29. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process checks. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; live cold-start now returns `PASS=86 WARN=0 BLOCKED=1` when MOMO data freshness is stale, preventing false green. |
+| P3-008 | DONE | 100 | Continuously optimize host reboot SOP | SOP v1.31 adds startup judgment layers, GO/NO-GO decision tree, freeze execution checklist, host boot detection, 110/188/120/121 recovery cards, explicit Plan B degraded-operation path, machine-readable `plan_b` baseline, readiness-audit Plan B guard, B0-B5 service levels, T+0/T+120 fallback timeline, K3s filesystem event blocker, stale-vs-active K8s failed Job classification, post-reboot / post-CD recovery anchors, AA/AS 判定, workload 分散判定, CD SSH trust guardrail, CronJob failure evidence retention rule, `fwupd-refresh.timer` rollback note, 110 runaway browser / CI load 分流 PlayBook, healthy-heartbeat suppression, 188 node-exporter restore, 188 DB/Redis exporter restore, 188 MinIO/Velero restore, 110 Docker disk cleanup boundary, MOMO Google Drive token userns readback, MOMO data freshness hard blocker, and post-reboot notification noise gates. | Use v1.31 for the next reboot record, then compare actual timing, Plan B trigger, degraded level, failed/stale/active Job counters, runaway-process metrics, CI load attribution, MOMO source availability, data freshness, Velero freshness, exporter scrape, disk usage, notification-noise state, and blockers against §1.4 plus §11.1 / §14.8 through §14.30. Before any real reboot, rerun same-day live cold-start / backup / offsite / alert / escrow / runaway-process / notification-noise checks. | SOP distinguishes `HOST_BOOTED`, `HOST_READY`, `SERVICE_READY`, `FULL_STACK_GREEN`, `K3S_CONTROL_PLANE_AA`, `WORKLOAD_BALANCED`, `B0_ABORTED_BEFORE_REBOOT`, `B1_HOST_RECOVERY_ONLY`, `B2_CORE_SERVICE_READY`, `B3_SERVICE_AVAILABLE_DEGRADED`, `B4_FULL_STACK_GREEN`, and `B5_DR_COMPLETE`; live cold-start now returns `PASS=86 WARN=0 BLOCKED=1` when MOMO data freshness is stale, and repeated healthy/same-failure notification noise is controlled without hiding real alerts. |
 | P3-009 | DONE | 100 | Assess 120/121 AA/AS role and host load balancing | 2026-06-12 15:19 live check confirms 120 and 121 are both `Ready control-plane`, `k3s active`, `k3s-agent inactive`, with no taints; however most AWOOOI / ArgoCD / Velero workload remains on 121 after 120 fsck recovery. New assessment defines control-plane AA vs workload AA, migration candidates from 110/188, and stateful migration blockers. | After P0 backup/offsite/cold-start green, implement topology spread for AWOOOI API/Web before moving additional services. | `docs/runbooks/HOST-ROLE-LOAD-BALANCING-ASSESSMENT.md` exists; SOP v1.6 links AA/AS and load-balancing checks; migration implementation remains explicitly `0%`. |
 | P3-010 | DONE | 100 | Update workload balancing docs with 2026-06-13 live truth | Host role assessment, workplan, SOP, backup status, and LOGBOOK are refreshed with current cold-start, backup, 188 certbot degraded, ArgoCD `km-vectorize` degraded, Gitea main `acaae999`, ArgoCD sync, and final pod placement evidence. | Keep updating this file after the next reboot or deploy. | Docs separate service-green status from DR escrow, workload rollout, and non-service governance debt. |
 | P3-011 | DONE | 100 | Record `km-vectorize` remediation status | LOGBOOK, this workplan, and SOP now state the schedule/label fix, ArgoCD sync evidence, the invalid manual Job boundary, and the 90% waiting-for-next-schedule gate. | After next 03:00 run, update this row and the top verdict with `lastSuccessfulTime` / ArgoCD health evidence. | No document claims ArgoCD green before official CronJob success evidence exists. |
--- a/scripts/ops/deploy-docker-health-monitor.sh
+++ b/scripts/ops/deploy-docker-health-monitor.sh
@@ -88,6 +88,8 @@ AWOOOI_API_URL=https://awoooi.wooo.work
 TELEGRAM_BOT_TOKEN=CHANGE_ME
 SRE_GROUP_CHAT_ID=-1003711974679
 SEND_COOLDOWN_SECONDS=300
+ACTION_COOLDOWN_SECONDS=300
+NOTIFY_COOLDOWN_SECONDS=1800
 SECRETS_TEMPLATE
        echo '  ⚠️  請填寫 /etc/awoooi-ops/secrets.env.template 後重命名為 secrets.env'
    else
--- a/scripts/ops/docker-health-monitor.sh
+++ b/scripts/ops/docker-health-monitor.sh
@@ -25,12 +25,16 @@ fi
 : "${AWOOOI_API_URL:=https://awoooi.wooo.work}"
 : "${TELEGRAM_BOT_TOKEN:=}"
 : "${SRE_GROUP_CHAT_ID:=-1003711974679}"
+: "${TELEGRAM_ALERT_CHAT_ID:=${SRE_GROUP_CHAT_ID:-${TELEGRAM_CHAT_ID:-}}}"
 : "${LOG_FILE:=/var/log/docker-health-monitor.log}"
 : "${SEND_COOLDOWN_SECONDS:=300}"
+: "${ACTION_COOLDOWN_SECONDS:=${SEND_COOLDOWN_SECONDS}}"
+: "${NOTIFY_COOLDOWN_SECONDS:=1800}"
 : "${COOLDOWN_DIR:=/tmp/docker-health-monitor-cooldown}"
-: "${EXCLUDE_CONTAINERS:=signoz-telemetrystore-migrator,signoz-clickhouse,signoz-init-clickhouse,gitea-runner}"
+: "${NOTIFY_COOLDOWN_DIR:=${COOLDOWN_DIR}/notify}"
+: "${EXCLUDE_CONTAINERS:=signoz-telemetrystore-migrator,signoz-clickhouse,signoz-init-clickhouse,gitea-runner,vtuber-web,vtuber-admin,vtuber-api,vtuber-db,bitan-pharmacy-bitan-1}"

-mkdir -p "$COOLDOWN_DIR"
+mkdir -p "$COOLDOWN_DIR" "$NOTIFY_COOLDOWN_DIR"

 # ─── 禁止自動重啟的容器 (模式匹配) ─────────────────────────────────────────
 # DB / Cache / 監控棧核心 — 僅告警，不自動重啟
@@ -57,8 +61,8 @@ is_in_cooldown() {
        last_sent=$(cat "$cooldown_file")
        now=$(date +%s)
        elapsed=$(( now - last_sent ))
-        if (( elapsed < SEND_COOLDOWN_SECONDS )); then
-            log "COOLDOWN: ${container} 距上次處理 ${elapsed}s，跳過（冷卻 ${SEND_COOLDOWN_SECONDS}s）"
+        if (( elapsed < ACTION_COOLDOWN_SECONDS )); then
+            log "COOLDOWN: ${container} 距上次處理 ${elapsed}s，跳過（處理冷卻 ${ACTION_COOLDOWN_SECONDS}s）"
            return 0
        fi
    fi
@@ -70,6 +74,32 @@ set_cooldown() {
    date +%s > "${COOLDOWN_DIR}/${container}.cooldown"
 }

+safe_cooldown_key() {
+    tr -c 'A-Za-z0-9_.-' '_' <<< "$1"
+}
+
+should_send_direct_telegram() {
+    local fingerprint="$1"
+    local cooldown_key
+    local cooldown_file
+    cooldown_key="$(safe_cooldown_key "$fingerprint")"
+    cooldown_file="${NOTIFY_COOLDOWN_DIR}/${cooldown_key}.cooldown"
+
+    if [[ -f "$cooldown_file" ]]; then
+        local last_sent now elapsed
+        last_sent=$(cat "$cooldown_file" 2>/dev/null || echo 0)
+        now=$(date +%s)
+        elapsed=$(( now - last_sent ))
+        if (( elapsed < NOTIFY_COOLDOWN_SECONDS )); then
+            log "TELEGRAM_COOLDOWN: ${fingerprint} ${elapsed}s/${NOTIFY_COOLDOWN_SECONDS}s，跳過直發"
+            return 1
+        fi
+    fi
+
+    date +%s > "$cooldown_file"
+    return 0
+}
+
 # 判斷容器是否符合模式清單
 matches_pattern() {
    local name="$1"
@@ -86,10 +116,10 @@ matches_pattern() {
 # ─── Telegram 直發 Fallback ──────────────────────────────────────────────────
 send_telegram_direct() {
    local message="$1"
-    [[ -z "$TELEGRAM_BOT_TOKEN" || -z "$SRE_GROUP_CHAT_ID" ]] && return 0
+    [[ -z "$TELEGRAM_BOT_TOKEN" || -z "$TELEGRAM_ALERT_CHAT_ID" ]] && return 0
    curl -s -X POST "https://api.telegram.org/bot${TELEGRAM_BOT_TOKEN}/sendMessage" \
        -H "Content-Type: application/json" \
-        -d "{\"chat_id\":\"${SRE_GROUP_CHAT_ID}\",\"text\":\"${message}\",\"parse_mode\":\"HTML\"}" \
+        -d "{\"chat_id\":\"${TELEGRAM_ALERT_CHAT_ID}\",\"text\":\"${message}\",\"parse_mode\":\"HTML\"}" \
        > /dev/null 2>&1 || true
 }

@@ -152,7 +182,9 @@ JSON
        local emoji="🔧"
        [[ "$repair_result" == "failed" ]] && emoji="❌"
        [[ "$repair_action" == "alert_only" ]] && emoji="⚠️"
-        send_telegram_direct "${emoji} [docker-health-monitor]&#10;主機: ${hostname}&#10;容器: ${container}&#10;狀態: ${detected_status}&#10;修復: ${repair_action} → ${repair_result}&#10;(API 不可達)"
+        if should_send_direct_telegram "${hostname}:${container}:${detected_status}:${repair_action}:${repair_result}:${http_code}"; then
+            send_telegram_direct "${emoji} [docker-health-monitor]&#10;主機: ${hostname}&#10;容器: ${container}&#10;狀態: ${detected_status}&#10;修復: ${repair_action} → ${repair_result}&#10;(API 不可達)"
+        fi
    fi
 }

--- a/scripts/ops/momo-pro-health-monitor.sh
+++ b/scripts/ops/momo-pro-health-monitor.sh
@@ -0,0 +1,134 @@
+#!/usr/bin/env bash
+# MOMO Pro health monitor (110 cron -> 188 app).
+#
+# Public route health is the first source of truth. The 188 local endpoint and
+# container state are secondary evidence only when the public health endpoint
+# fails. This prevents the old reboot-era false positive that checked the 188
+# root path and sent Telegram every five minutes while the real public service
+# had already recovered.
+
+set -euo pipefail
+
+SECRETS_FILE="${SECRETS_FILE:-/etc/awoooi-ops/secrets.env}"
+if [[ -r "$SECRETS_FILE" ]]; then
+    # shellcheck source=/dev/null
+    source "$SECRETS_FILE"
+fi
+
+: "${HEALTH_URL:=https://mo.wooo.work/health}"
+: "${REMOTE_HOST:=192.168.0.188}"
+: "${REMOTE_USER:=ollama}"
+: "${REMOTE_LOCAL_URL:=http://127.0.0.1:5003/health}"
+: "${CONTAINER_NAME:=momo-pro-system}"
+: "${CURL_TIMEOUT:=15}"
+: "${SEND_COOLDOWN_SECONDS:=1800}"
+: "${COOLDOWN_FILE:=/tmp/momo-pro-health-monitor.cooldown}"
+: "${NOTIFY_AWOOI_OPS_SCRIPT:=/home/wooo/awoooi-ops/notify-awoooi-ops.sh}"
+: "${SSH_STRICT_HOST_KEY_CHECKING:=accept-new}"
+
+log() {
+    echo "[$(date '+%Y-%m-%d %H:%M:%S %z')] $*"
+}
+
+http_code_for() {
+    local url="$1"
+    curl -kLsS -o /dev/null -w "%{http_code}" \
+        --connect-timeout "$CURL_TIMEOUT" \
+        --max-time "$CURL_TIMEOUT" \
+        "$url" 2>/dev/null || echo "000"
+}
+
+is_success_code() {
+    case "$1" in
+        2*|3*) return 0 ;;
+        *) return 1 ;;
+    esac
+}
+
+in_cooldown() {
+    [[ -f "$COOLDOWN_FILE" ]] || return 1
+    local last now elapsed
+    last=$(cat "$COOLDOWN_FILE" 2>/dev/null || echo 0)
+    now=$(date +%s)
+    elapsed=$(( now - last ))
+    if (( elapsed < SEND_COOLDOWN_SECONDS )); then
+        log "COOLDOWN: last alert ${elapsed}s ago; skip Telegram (${SEND_COOLDOWN_SECONDS}s window)"
+        return 0
+    fi
+    return 1
+}
+
+mark_cooldown() {
+    date +%s > "$COOLDOWN_FILE"
+}
+
+send_ops_alert() {
+    local alertname="$1"
+    local summary="$2"
+    local detail="$3"
+
+    if in_cooldown; then
+        return 0
+    fi
+
+    if [[ ! -x "$NOTIFY_AWOOI_OPS_SCRIPT" ]]; then
+        log "WARN: notify helper not executable: ${NOTIFY_AWOOI_OPS_SCRIPT}"
+        mark_cooldown
+        return 0
+    fi
+
+    AWOOI_OPS_ALERTNAME="$alertname" \
+    AWOOI_OPS_JOB_NAME="MOMO Pro health monitor" \
+    AWOOI_OPS_STATUS="failed" \
+    AWOOI_OPS_SEVERITY="warning" \
+    AWOOI_OPS_SOURCE="momo-pro-health-monitor" \
+    AWOOI_OPS_COMPONENT="momo-pro" \
+    AWOOI_OPS_SUMMARY="$summary" \
+    AWOOI_OPS_DETAIL="$detail" \
+    "$NOTIFY_AWOOI_OPS_SCRIPT" >/dev/null 2>&1 || log "WARN: AWOOI ops notification failed"
+
+    mark_cooldown
+    log "ALERT_REPORTED: cooldown marked"
+}
+
+remote_health_code() {
+    ssh -o ConnectTimeout=5 -o StrictHostKeyChecking="${SSH_STRICT_HOST_KEY_CHECKING}" "${REMOTE_USER}@${REMOTE_HOST}" \
+        "curl -kLsS -o /dev/null -w '%{http_code}' --connect-timeout ${CURL_TIMEOUT} --max-time ${CURL_TIMEOUT} '${REMOTE_LOCAL_URL}' 2>/dev/null || echo 000" \
+        2>/dev/null || echo "ssh_failed"
+}
+
+remote_container_status() {
+    ssh -o ConnectTimeout=5 -o StrictHostKeyChecking="${SSH_STRICT_HOST_KEY_CHECKING}" "${REMOTE_USER}@${REMOTE_HOST}" \
+        "docker inspect -f '{{.State.Status}} {{if .State.Health}}{{.State.Health.Status}}{{else}}no-healthcheck{{end}}' '${CONTAINER_NAME}' 2>/dev/null || echo missing" \
+        2>/dev/null || echo "ssh_failed"
+}
+
+main() {
+    log "CHECK: MOMO public health ${HEALTH_URL}"
+    local public_code
+    public_code=$(http_code_for "$HEALTH_URL")
+    if is_success_code "$public_code"; then
+        log "OK: public health ${public_code}; no alert"
+        return 0
+    fi
+
+    log "WARN: public health ${public_code}; checking ${REMOTE_HOST} local app and container"
+    local local_code container_status
+    local_code=$(remote_health_code)
+    container_status=$(remote_container_status)
+
+    if is_success_code "$local_code"; then
+        send_ops_alert \
+            "MomoProPublicRouteUnhealthy" \
+            "MOMO Pro public route unhealthy but 188 local app is OK" \
+            "public=${HEALTH_URL} HTTP ${public_code}; local=${REMOTE_LOCAL_URL} HTTP ${local_code}; container=${CONTAINER_NAME} ${container_status}; likely gateway/proxy path; do not restart blindly."
+    else
+        send_ops_alert \
+            "MomoProHealthFailed" \
+            "MOMO Pro service health failed" \
+            "public=${HEALTH_URL} HTTP ${public_code}; local=${REMOTE_LOCAL_URL} HTTP ${local_code}; container=${CONTAINER_NAME} ${container_status}; host=${REMOTE_HOST}; manual triage required."
+    fi
+    return 2
+}
+
+main "$@"