Telegram 重複發告警鐵證(4 個 agent 真實數據):
- INC-6FE3BD (HostBackupFailed) 24h 內被推 15 次
- INC-FD6E21 (HostHighCpuLoad) 24h 內被推 6 次
- 06:44:18 同秒兩送 = pod 並發 race
根因:
1. `telegram_sent:{incident_id}` dedup key 綁 uuid4 隨機 INC ID,
同 fingerprint 換新 INC 完全不去重
2. dedup TTL=600s 比 incident_analysis_sweeper 重觸週期 1h、
alertmanager repeat_interval 4h 都短 → 每輪都過期通過
3. pod restart 走 _resend_unconfirmed_ready_tokens 用同一 incident_id key
→ 重啟必炸一波
修法(不消音、是「AI 認得這是同一事故」):
- decision_manager.py:207-225 dedup key 改 alertname+target fingerprint
- decision_manager.py:573-578 TTL 600s → 86400s (蓋住 sweeper 1h × alertmanager 4h)
- decision_manager.py:3189-3208 pod restart resend 路徑同步改 fingerprint
- incident_analysis_sweeper.py:37-42 sweeper_done TTL 3600s → 86400s
預期:同症狀 24h 內最多發 1 張 decision card;resolved 後 line 220-226
status check 會 early return,不影響復發偵測。
Tests: 35 passed (test_telegram_adr050 + test_decision_manager_docker_prune_routing)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>