docs(adr): ADR-109 telegram_gateway unified dedup layer (P0 #1 design doc)
P0 #1 (徹底長期修系列) — 33 個 send_xxx 方法各自寫 dedup 改為統一在 `_send_request()` 一層處理,未來新增 send_xxx 方法傳兩個 kwargs (dedup_scope + dedup_fingerprint) 即自動繼承 dedup,不再有「漏修一條鏈 就轟炸統帥」的設計缺陷。 當前是 Proposed 狀態,等首席架構師審。Tier 2 橙區。 包含: - 33 個 send_xxx 的 dedup_scope mapping - 5-6 小時 / 3 commits 漸進式重構計畫 - 與 ADR-108 (incident_id fingerprint) 的協同關係 兩個 ADR 都是「徹底長期修」系列的 design 階段,等統帥批准執行。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
144
docs/adr/ADR-109-telegram-gateway-unified-dedup.md
Normal file
144
docs/adr/ADR-109-telegram-gateway-unified-dedup.md
Normal file
@@ -0,0 +1,144 @@
|
||||
# ADR-109: Telegram Gateway 統一 dedup 抽象層(P0 #1 徹底長期修)
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Status | Proposed (等首席架構師審) |
|
||||
| Date | 2026-05-03 |
|
||||
| Author | Claude Opus 4.7 + 統帥 ogt |
|
||||
| Tier | 🟠 Tier 2 橙區(telegram_gateway.py) |
|
||||
| Related | ADR-108 (incident_id fingerprint), `feedback_telegram_dedup.md`, `feedback_full_chain_first_then_fix.md` |
|
||||
|
||||
## 1. 問題
|
||||
|
||||
`telegram_gateway.py` 有 **33 個 `send_xxx_card / send_xxx_alert / send_xxx_notification` 方法**。
|
||||
dedup 邏輯散在 4 個不同模組,各自寫法:
|
||||
|
||||
| 路徑 | 位置 | dedup key | TTL |
|
||||
|------|------|-----------|-----|
|
||||
| Decision card | `decision_manager.py:218` | `telegram_sent:fp:{alertname}:{target}` | 86400s(剛改)|
|
||||
| Escalation card | `emergency_escalation_service.py:37` | `auto_repair:emergency_escalated:fp:{alertname}:{target}` | 86400s(剛改)|
|
||||
| Drift escalation | `emergency_escalation_service.py:116` | `drift:auto_adopt_emergency:{report_id}` | 3600s |
|
||||
| Heartbeat | `telegram_gateway.py:6213` | `heartbeat:silent_last_sent` + `heartbeat:warnings_hash` | 6h+24h(剛改)|
|
||||
| 其他 28 個 send_xxx | 散在 caller side 或無 dedup | 無/各自 | 無/各自 |
|
||||
|
||||
**結果**:每加一個新 `send_xxx_card` 方法,dedup 必須在 caller 手動加 — 漏一個就重複轟炸統帥。
|
||||
|
||||
## 2. 鐵證
|
||||
- 2026-05-02 escalation card 漏修 → 4 條 ESCALATION 17:35-17:36 連發(同 target=node-exporter-110)
|
||||
- Codex 之前修 decision card 漏看 escalation, statq Claude 也漏
|
||||
|
||||
## 3. 推薦方案
|
||||
|
||||
### 3.1 統一在 `_send_request()` 一層 dedup(不傳 = 不去重)
|
||||
|
||||
```python
|
||||
# telegram_gateway.py
|
||||
async def _send_request(
|
||||
self,
|
||||
method: str,
|
||||
payload: dict,
|
||||
*,
|
||||
dedup_scope: str | None = None, # "decision" / "escalation" / "drift" / ...
|
||||
dedup_fingerprint: str | None = None, # 由 caller 傳入 alertname:target hash
|
||||
dedup_ttl: int = 86400,
|
||||
) -> bool:
|
||||
if dedup_scope and dedup_fingerprint:
|
||||
key = f"telegram_sent:{dedup_scope}:{dedup_fingerprint}"
|
||||
if await get_redis().exists(key):
|
||||
logger.debug("telegram_send_dedup_skipped", scope=dedup_scope, fp=dedup_fingerprint)
|
||||
return True # silent skip
|
||||
|
||||
response = await self._http.post(...)
|
||||
|
||||
# 成功送出後 register dedup
|
||||
if response.ok and dedup_scope and dedup_fingerprint:
|
||||
await get_redis().setex(key, dedup_ttl, "1")
|
||||
|
||||
return response.ok
|
||||
```
|
||||
|
||||
### 3.2 33 個 send_xxx 方法
|
||||
|
||||
不需全部改簽名 — **僅有 dedup 需求的 send_xxx** 加 kwargs 並透傳給 `_send_request`:
|
||||
|
||||
| send_xxx | dedup_scope | 備註 |
|
||||
|----------|-------------|------|
|
||||
| `send_approval_card` | "approval" | 同 INC + 同 fingerprint 1 張 |
|
||||
| `send_drift_card` | "drift" | 同 report_id |
|
||||
| `send_meta_alert` | "meta" | 自然 dedup |
|
||||
| `send_secops_card` | "secops" | 同事件 1 張 |
|
||||
| `send_business_alert` | "business" | 同 metric 1 張 |
|
||||
| `send_escalation_card` | "escalation" | (已有 caller dedup, 升級到 gateway 層) |
|
||||
| `send_resource_warning` | "resource" | 同 host:resource |
|
||||
| `send_repair_report` | None | 不去重(每次都該推) |
|
||||
| `send_daily_summary` | "daily" | 一天 1 張 |
|
||||
| `send_cicd_progress` | None | 進度即時推 |
|
||||
| `send_deploy_success` | None | 部署即時推 |
|
||||
| `send_rate_limit_warning` | "rate_limit" | 同 chat 1 小時 1 張 |
|
||||
| `send_guardrail_blocked` | "guardrail" | 同事件 |
|
||||
| `send_preflight_failed` | "preflight" | 同 commit |
|
||||
| `send_backup_result` | None | 每次都推 |
|
||||
| `send_multisig_waiting/approved` | "multisig:{op_id}" | 同操作 1 張 |
|
||||
| `send_change_applied` | None | 即時推 |
|
||||
| `send_sentry_error` | "sentry:{error_id}" | 同 error 1 小時 1 張 |
|
||||
| `send_heartbeat` | (內建) | 已用 `heartbeat:silent_last_sent` |
|
||||
| `send_info_notification` | None | caller 控制 |
|
||||
| `send_alert_notification` | "alert:{?}" | caller 傳 fingerprint |
|
||||
| ... | ... | ... |
|
||||
|
||||
### 3.3 caller 端改造
|
||||
|
||||
當前 callers 自己算 dedup key,改成傳 fingerprint 給 gateway:
|
||||
|
||||
```python
|
||||
# Before (decision_manager.py:217-218)
|
||||
dedup_key = f"telegram_sent:fp:{alertname}:{target}"
|
||||
if await redis.exists(dedup_key): return
|
||||
|
||||
# After
|
||||
await gateway.send_approval_card(
|
||||
...,
|
||||
dedup_scope="decision",
|
||||
dedup_fingerprint=f"{alertname}:{target}",
|
||||
)
|
||||
```
|
||||
|
||||
## 4. 影響面
|
||||
|
||||
| 類別 | 數量 | 工程 |
|
||||
|------|------|------|
|
||||
| `_send_request` 加 kwargs | 1 | 簡單 |
|
||||
| send_xxx 加 kwargs 透傳 | ~20 | 機械 |
|
||||
| caller 端從 manual dedup 改傳 fp | ~10 | 中度(要找 caller)|
|
||||
| 新 unit tests(dedup 行為) | 8-10 | 中度 |
|
||||
|
||||
**總工程:~5-6 小時,3 個 commits**:
|
||||
- C1: `_send_request` 加 dedup kwargs + 對應 unit tests
|
||||
- C2: 20 個 send_xxx 加透傳 kwargs(簽名改但 default None 不破相容)
|
||||
- C3: callers 從 manual dedup 改傳 fp(含 b3a0f0d7 / 47342dfb 的繞道清理)
|
||||
|
||||
## 5. 與 ADR-108 的協同
|
||||
|
||||
- ADR-108 (incident_id fingerprint 化) 完成後 → ADR-109 的 caller 端傳 `incident.fingerprint` 即可,更乾淨
|
||||
- 兩個 ADR 可獨立進行,但建議 **ADR-108 先**:incident_id 治本後 ADR-109 caller 端改造更直觀
|
||||
|
||||
## 6. 風險與緩解
|
||||
|
||||
| 風險 | 影響 | 緩解 |
|
||||
|------|------|------|
|
||||
| 33 send_xxx 簽名變動 | 中 | 全 default None 向後相容;caller 漸進改造 |
|
||||
| dedup 邏輯統一後新增「應該不去重」的場景無法繞過 | 低 | dedup_scope=None 仍然完全不去重 |
|
||||
| 高頻路徑(`_send_request`)加 Redis call 延遲 | 低 | EXISTS 是 O(1),~1ms |
|
||||
|
||||
## 7. 驗收條件
|
||||
|
||||
1. 任一新增 `send_xxx_card` 方法不傳 dedup 參數 → 行為跟現在相同(向後相容)
|
||||
2. 傳 `dedup_scope + dedup_fingerprint` → 24h 內同 fp 第二次呼叫 silent skip
|
||||
3. 既有 b3a0f0d7 + 47342dfb 的繞道 dedup 可清理(caller 不再手動算 key)
|
||||
4. 全部 send_xxx unit test 含「dedup hit」「dedup miss」雙路徑覆蓋
|
||||
|
||||
## 8. 等批准
|
||||
|
||||
統帥 ogt — 此 ADR 為 Tier 2 橙區改動的 design doc。批准後執行 C1→C2→C3。
|
||||
|
||||
建議與 ADR-108 一起評估順序:**ADR-108 → ADR-109** 或 **平行進行**?
|
||||
Reference in New Issue
Block a user