docs: 記錄 Telegram 告警轟炸事故修復
更新: - ADR-027: 新增緊急事故修復章節 - LOGBOOK: 記錄 2026-03-26 事故時間線 - Skill 02 v1.6: 新增 Telegram 去重機制章節 根因: Phase 6.5 修改 + INC- 前綴重複 修復: Redis 去重 (10 分鐘) + 前綴檢查 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -10,10 +10,10 @@
|
||||
|
||||
| 欄位 | 值 |
|
||||
|------|-----|
|
||||
| **版本** | v1.5 |
|
||||
| **版本** | v1.6 |
|
||||
| **建立日期** | 2026-03-20 (台北) |
|
||||
| **建立者** | Claude Code |
|
||||
| **最後修改** | 2026-03-26 19:30 (台北) |
|
||||
| **最後修改** | 2026-03-26 20:10 (台北) |
|
||||
| **修改者** | Claude Code |
|
||||
|
||||
### 變更紀錄
|
||||
@@ -26,6 +26,7 @@
|
||||
| v1.3 | 2026-03-26 | Claude Code | 🔴🔴🔴 新增積木化強制執行章節 (32 項違規審計後) |
|
||||
| v1.4 | 2026-03-26 | Claude Code | 📊 新增 Langfuse LLMOps 整合章節 (Phase 15.1) |
|
||||
| v1.5 | 2026-03-26 | Claude Code | 🔴🔴 新增 UnitOfWork + Saga Pattern 章節 (ADR-027) |
|
||||
| v1.6 | 2026-03-26 | Claude Code | 🚨 新增 Telegram 去重機制章節 (告警轟炸事故教訓) |
|
||||
|
||||
---
|
||||
|
||||
@@ -446,6 +447,51 @@ apps/api/src/
|
||||
|
||||
---
|
||||
|
||||
## 🚨 Telegram 去重機制 (2026-03-26 事故教訓)
|
||||
|
||||
> **事故**: Telegram 告警轟炸,同樣告警重複發送數十次,ID 格式 `INC-INC-INC-2026`
|
||||
> **根因**: Phase 6.5 修改導致每次 poll 建立新 decision → 觸發 Telegram
|
||||
|
||||
### 鐵律 1: Telegram 發送必須去重
|
||||
|
||||
```python
|
||||
# decision_manager.py - _push_to_telegram_on_decision()
|
||||
async def _push_to_telegram_on_decision(self, incident, decision):
|
||||
# 去重檢查 (10 分鐘 TTL)
|
||||
dedup_key = f"telegram_sent:{incident.incident_id}"
|
||||
if await redis_client.get(dedup_key):
|
||||
logger.debug("telegram_push_skipped_dedup", incident_id=incident.incident_id)
|
||||
return
|
||||
|
||||
# 發送 Telegram...
|
||||
|
||||
# 設定去重 key
|
||||
await redis_client.set(dedup_key, "1", ex=600)
|
||||
```
|
||||
|
||||
### 鐵律 2: INC- 前綴禁止重複
|
||||
|
||||
```python
|
||||
# ❌ 禁止: 不檢查直接加前綴
|
||||
incident_id = f"INC-{self.approval_id[:8].upper()}"
|
||||
|
||||
# ✅ 正確: 檢查後再加
|
||||
if self.approval_id.upper().startswith("INC-"):
|
||||
incident_id = self.approval_id.upper()
|
||||
else:
|
||||
incident_id = f"INC-{self.approval_id[:8].upper()}"
|
||||
```
|
||||
|
||||
### 檢查清單
|
||||
|
||||
| 項目 | 正確做法 |
|
||||
|------|----------|
|
||||
| Telegram 去重 | Redis key `telegram_sent:{id}` + 600 秒 TTL |
|
||||
| INC- 前綴 | 發送前檢查是否已有前綴 |
|
||||
| Decision 建立 | 返回現有 decision 時不發送 Telegram |
|
||||
|
||||
---
|
||||
|
||||
## 🧰 MCP Tool 實作規範 (Phase 13.2)
|
||||
|
||||
> **目標**: 將 Mock MCP Tool 升級為真實系統連接
|
||||
|
||||
@@ -5,18 +5,46 @@
|
||||
|
||||
---
|
||||
|
||||
## 📍 當前狀態 (2026-03-26 19:30 台北)
|
||||
## 📍 當前狀態 (2026-03-26 20:05 台北)
|
||||
|
||||
| 項目 | 狀態 |
|
||||
|------|------|
|
||||
| **當前 Phase** | **Phase 17.1 Incident-Approval 同步** |
|
||||
| **Day** | Day 10 |
|
||||
| **AI Fallback** | ✅ **Gemini 優先 (11/500 daily used)** |
|
||||
| **AI Fallback** | ✅ **Ollama 優先 (已切回,Gemini 404 問題)** |
|
||||
| **ADR-027** | ✅ **批准** (Incident-Approval 同步架構) |
|
||||
| **ADR-026** | ✅ **批准** (CoreDNS GitOps) |
|
||||
| **Telegram 告警** | ✅ **修復完成** (NetworkPolicy + CoreDNS) |
|
||||
| **Telegram 告警** | ✅ **修復完成** (INC-INC- bug + 去重機制) |
|
||||
| **架構審查** | ✅ **完成** (5 個關鍵問題識別) |
|
||||
| **Runner 問題** | ⚠️ **actions/checkout 檔案寫入問題 (需調查)** |
|
||||
| **K8s 部署** | ✅ **CD 自動處理 selector 不可變問題** |
|
||||
|
||||
### 🔴🔴 2026-03-26 Telegram 告警轟炸緊急修復 (Day 10 晚間 19:30-20:00)
|
||||
|
||||
**事故**: Telegram 同樣告警重複發送 (INC-INC-INC- 格式)
|
||||
|
||||
**根因分析**:
|
||||
1. Phase 6.5 (765ee39) 修改: COMPLETED decision + INVESTIGATING incident → 建立新 decision
|
||||
2. 每次前端 poll `/api/v1/incidents` 都觸發新 decision → Telegram 訊息
|
||||
3. `telegram_gateway.py:161` 又加 INC- 前綴 → INC-INC- 格式
|
||||
4. Gemini API 返回 404 但計費 (91/500 quota 浪費)
|
||||
|
||||
**修復內容**:
|
||||
| 修復 | 檔案 | Commit |
|
||||
|------|------|--------|
|
||||
| INC- 前綴重複 (decision) | decision_manager.py:83 | 35aa690 |
|
||||
| INC- 前綴重複 (telegram) | telegram_gateway.py:161 | 139ddc3 |
|
||||
| Telegram 去重 (10 分鐘 Redis) | decision_manager.py | 35aa690 |
|
||||
| Ollama 優先順序 | kubectl set env | - |
|
||||
| K8s selector 不可變 | cd.yaml | 6421af0 |
|
||||
| TypeScript 錯誤 | live-approval-panel.tsx | 0e6c381 |
|
||||
| Lint 錯誤 | services/__init__.py | df04254 |
|
||||
|
||||
**教訓**:
|
||||
- 🔴 修改 Phase 6.5 decision 邏輯時沒考慮 polling 影響
|
||||
- 🔴 沒有 Telegram 發送頻率限制機制
|
||||
- 🔴 Gemini API 404 問題未及時發現
|
||||
|
||||
---
|
||||
|
||||
### 🔴 2026-03-26 首席架構師完整審查 + ADR-027 批准 (Day 9 晚間 19:30)
|
||||
|
||||
|
||||
@@ -231,12 +231,45 @@ APPROVAL_TO_INCIDENT_STATUS = {
|
||||
|
||||
---
|
||||
|
||||
## 🔴🔴 緊急事故修復 (2026-03-26 19:30-20:00)
|
||||
|
||||
### 事故: Telegram 告警轟炸
|
||||
|
||||
**症狀**: 同樣告警重複發送數十次,ID 格式為 `INC-INC-INC-2026`
|
||||
|
||||
### 根因
|
||||
|
||||
1. **Phase 6.5 (765ee39) 修改**: COMPLETED decision + INVESTIGATING incident → 建立新 decision
|
||||
2. **每次 poll 觸發**: 前端 poll `/api/v1/incidents` → 建立新 decision → Telegram
|
||||
3. **INC- 前綴重複**: `decision_manager.py` + `telegram_gateway.py` 雙重加前綴
|
||||
|
||||
### 修復 Commits
|
||||
|
||||
| Commit | 修復內容 |
|
||||
|--------|----------|
|
||||
| 35aa690 | decision_manager.py: INC- 前綴修復 + Redis 去重 (10 分鐘) |
|
||||
| 139ddc3 | telegram_gateway.py: INC- 前綴檢查 |
|
||||
| 6421af0 | cd.yaml: K8s selector 不可變處理 |
|
||||
|
||||
### 新增防禦機制
|
||||
|
||||
```python
|
||||
# decision_manager.py - Telegram 去重
|
||||
dedup_key = f"telegram_sent:{incident.incident_id}"
|
||||
if await redis_client.get(dedup_key):
|
||||
return # 10 分鐘內不重複發送
|
||||
await redis_client.set(dedup_key, "1", ex=600)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 關聯文件
|
||||
|
||||
- ADR-011: NetworkPolicy 變更治理架構
|
||||
- ADR-026: CoreDNS GitOps 管控架構
|
||||
- [project_incident_approval_sync.md](~/.claude/projects/-Users-ogt-awoooi/memory/project_incident_approval_sync.md)
|
||||
- [feedback_incident_approval_sync.md](~/.claude/projects/-Users-ogt-awoooi/memory/feedback_incident_approval_sync.md)
|
||||
- [feedback_telegram_dedup.md](~/.claude/projects/-Users-ogt-awoooi/memory/feedback_telegram_dedup.md)
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user