feat(heartbeat): noise reduction — silent 6h + warnings hash dedup
Some checks failed
Code Review / ai-code-review (push) Successful in 47s
CD Pipeline / tests (push) Successful in 2m11s
CD Pipeline / build-and-deploy (push) Failing after 31m12s
CD Pipeline / post-deploy-checks (push) Has been skipped

P0 #4 (徹底長期修系列) — 統帥鐵證:「INFO | AWOOOI 系統報告」每 30 分鐘
推一次,一天 48 條同樣內容,即使我修了 P0 #3 假警報,每天的「全系統正常」
重複推送本身就是噪音,讓統帥誤以為告警還在重複。

修法(不違反「監控工具必須被監控」鐵律 — 健康狀態仍每 6h 推 1 次「我活著」):

| 狀況 | 推送行為 |
|------|---------|
| 健康(無 warnings)| 6h 內最多 1 次「我活著」訊號 |
| 有 warnings 跟上次同 hash | 跳過 |
| 有 warnings 跟上次不同 | 立即推送(新狀況不漏)|
| 健康 ↔ 有事 切換 | 自動清掉相反 marker |

Redis keys:
- `heartbeat:silent_last_sent` — 健康狀態 silent marker, TTL=6h
- `heartbeat:warnings_hash` — 上次 warnings 的 md5[:12], TTL=24h

效果:統帥每天從 48 條 heartbeat → ~4 條(健康狀態 4×6h),有事立即推。

Tests: 6 passed (test_heartbeat_dedup_p0_4.py)
- healthy_first_send_goes_through
- healthy_second_send_within_6h_skipped
- warnings_unchanged_skipped
- warnings_changed_pushes
- warnings_to_healthy_clears_warnings_hash
- healthy_to_warnings_clears_silent_marker

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Your Name
2026-05-03 01:48:57 +08:00
parent 2ce722bda9
commit 8fb0c5df33
2 changed files with 246 additions and 0 deletions

View File

@@ -6211,6 +6211,55 @@ class TelegramGateway:
return True
report = await HeartbeatReportService().collect()
# 2026-05-03 Claude Opus 4.7 + 統帥 ogtP0 #4 heartbeat 噪音降頻
# 鐵證:原本 30min/次 = 一天 48 條,統帥每天看相同內容 = 變相重複告警
# 修法(不違反「監控工具必須被監控」鐵律):
# 健康(無 warnings→ 6h 內最多 1 次「我活著」訊號
# 有 warnings 跟上次相同 → 跳過hash 對比)
# 有 warnings 跟上次不同 → 立即推送(新狀況不漏)
import hashlib
SILENT_REPORT_INTERVAL_HOURS = 6
WARNINGS_HASH_TTL = 24 * 3600
silent_key = "heartbeat:silent_last_sent"
warnings_hash_key = "heartbeat:warnings_hash"
warnings_str = "|".join(sorted(report.warnings))
warnings_hash = hashlib.md5(warnings_str.encode()).hexdigest()[:12]
if not report.warnings:
# 健康狀態6h 1 次「我活著」訊號
if await redis_client.exists(silent_key):
logger.debug(
"telegram_heartbeat_skipped_silent_recent",
slot_id=slot_id,
)
return True
await redis_client.setex(
silent_key, SILENT_REPORT_INTERVAL_HOURS * 3600, "1",
)
# 清掉舊的 warnings hash從有事 → 健康,下次有事要立即推)
await redis_client.delete(warnings_hash_key)
else:
# 有事:跟上次同 hash 跳過
last_hash_raw = await redis_client.get(warnings_hash_key)
last_hash = (
last_hash_raw.decode() if isinstance(last_hash_raw, bytes)
else last_hash_raw
)
if last_hash == warnings_hash:
logger.debug(
"telegram_heartbeat_skipped_warnings_unchanged",
slot_id=slot_id,
warnings_hash=warnings_hash,
)
return True
await redis_client.setex(
warnings_hash_key, WARNINGS_HASH_TTL, warnings_hash,
)
# 清掉 silent marker從健康 → 有事,下次健康要過 6h 才再推)
await redis_client.delete(silent_key)
text = report_to_telegram_html(report)
# 只發到 SRE 戰情室群組
@@ -6228,6 +6277,7 @@ class TelegramGateway:
logger.info(
"telegram_heartbeat_sent",
warnings=len(report.warnings),
warnings_hash=warnings_hash,
has_sre_group=bool(settings.SRE_GROUP_CHAT_ID),
)