# ADR-068: 自動修復飛輪冷啟動斷層系統性修復 **狀態**: Approved & Implemented **日期**: 2026-04-10 (台北時間) **決策者**: ogt (統帥) **執行者**: Claude Sonnet 4.6 **Commit**: `c6edfb5` (四階段修復) → `670cd5d` (首席架構師審查修正) --- ## 背景與問題 25 個 `AUTO_REPAIR_TRIGGERED` 事件全部以 `blocked_by: NO_MATCH` 失敗。 Playbook 存在,告警也存在,但飛輪永遠無法啟動。 系統性診斷發現**五個同時存在的根因**,任何一個獨立修復都無法解決問題: | 斷層編號 | 根因 | 影響 | |---------|------|------| | F1 | `affected_services` 污染:`[alertname]` 或 `[IP:port]` 被填入 | Jaccard 比對永遠為 0 | | F2 | `alert_name = "custom"`(非真實 alertname) | Redis index 查詢命中 0 | | F3 | Redis Playbook index 缺少真實 alertname 變體 | `HostHighCpuLoad` 等找不到 Playbook | | F4 | Jaccard 空集合未豁免 | 通用型 Playbook (affected_services=[]) 永遠不匹配 | | F5 | 重啟後 Redis 向量快取清空(冷啟動) | Phase 4 embedding 搜尋返回空 | --- ## 決策:四階段系統性修復 拒絕逐一 patch,採用根治方案。 ### Phase 1 — Signal 品質修復(`webhooks.py` → `incident_service.py`) **問題**:`create_incident_for_approval` 將 `target_resource`(fallback 為 alertname 或 IP)直接填入 `affected_services`。Signal 的 `alert_name = alert_type = "custom"`。 **修復**: - 新增 `extract_affected_services(labels, target_resource)`,優先序:`component > job(非基礎設施) > pod(取 deployment name) > clean target_resource > []` - Signal 的 `alert_name` 改用真實 `alertname` label ```python # 修復前(已移除) affected_services=[target_resource] # IP:port 污染 # 修復後(incident_service.py) affected_services=extract_affected_services(labels, target_resource) # 語意提取 ``` ### Phase 2 — Playbook Index 擴充 **問題**:Redis `playbook:index:alert:*` 只有初始建立時的 alertname,缺少真實世界的變體。 **修復**: - `alert_rules.yaml` 5 條規則新增 `HostHighCpuLoad`、`KubePodCrashLooping`、`NodeMemoryUsageHigh` 等 17 個變體 - `scripts/update_playbook_alert_variants.py` 執行一次性 Redis index 補齊 - 驗證:`HostHighCpuLoad → ['PB-20260406-488671']` ✅ ### Phase 3 — Jaccard 空集合豁免(`similarity.py`) **問題**:通用型基礎設施 Playbook(`affected_services=[]`,`severity_range=[]`)與任何告警做 Jaccard 都得 0.0。 **修復**: ```python "affected_services": ( 1.0 if not pattern_b.affected_services # 通用型 Playbook 豁免 else calculate_jaccard_similarity(...) ), "severity": ( 1.0 if not pattern_b.severity_range # 全嚴重度適用豁免 or bool(set(pattern_a.severity_range) & set(pattern_b.severity_range)) else 0.0 ), ``` ### Phase 4 — Embedding 冷啟動修復 **問題**:重啟後 Redis 清空,向量快取消失,Phase 4 語意搜尋退化為空結果。 **修復**: - 新建 `migrations/flywheel_playbook_embeddings.sql`:pgvector 持久化表 - 新建 `services/playbook_embedding_service.py`:啟動時非阻塞重建(`asyncio.create_task`) - 新建 `repositories/playbook_embedding_repository.py`:UPSERT Repository - 啟動驗收:18/18 Playbooks 索引成功 --- ## 首席架構師審查修正(`670cd5d`) 初始實作評分 78/100,首席架構師審查後修正至 **97/100**: | # | 問題 | 修正 | |---|------|------| | C1 | `playbook_embedding_service` 直接 `db.execute(text(...))` — 繞過 Repository | 新建 `PlaybookEmbeddingRepository` | | C2 | `create_incident_for_approval` 業務邏輯在 Router 層 | 移入 `incident_service.py` | | I1 | `_infra_jobs` 每次呼叫重建 set | 提升為 module-level `frozenset` | | I2 | `_persist_embeddings_to_db` 參數無型別標注 | 補齊 `PlaybookRAGService / list[Playbook]` | | I3 | `str(embedding)` 格式不確定 | 顯式 `"[" + ",".join(str(float(x))...) + "]"` | | I4 | `import asyncio` 在 `try` 區塊內 | 移至 `main.py` 頂層 | | M1 | `if union > 0 else 0.0` 死代碼 | 移除 | --- ## E2E 驗收結果 ``` 2026-04-10 03:20:03 ALERT_RECEIVED HostHighCpuLoad 2026-04-10 03:20:25 AUTO_REPAIR_TRIGGERED ok=True high-cpu-restart 2026-04-10 03:20:26 EXECUTION_COMPLETED ok=True PB-20260406-488671 2026-04-10 03:20:27 TELEGRAM_SENT ok=True approval_card ``` **飛輪從 100% 失敗 → 成功觸發,不再 NO_MATCH。** --- ## 驗證方法(Runbook) ### 1. 快速煙霧測試 ```bash # 觸發測試告警 curl -X POST http://:32334/api/v1/webhooks/alertmanager \ -H 'Content-Type: application/json' \ -d '{"version":"4","groupKey":"test","status":"firing","receiver":"awoooi", "alerts":[{"status":"firing", "labels":{"alertname":"HostHighCpuLoad","severity":"critical", "instance":"192.168.0.188:9100","job":"node-exporter"}, "annotations":{"summary":"Test","description":"E2E Test"}, "startsAt":"2026-04-10T00:00:00Z"}]}' # 驗證結果 kubectl exec -n awoooi-prod deployment/awoooi-api -- python3 -c " import asyncio, asyncpg, os async def q(): conn = await asyncpg.connect(os.environ['DATABASE_URL'].replace('postgresql+asyncpg://','postgresql://')) rows = await conn.fetch('SELECT event_type, success, action_detail FROM alert_operation_log ORDER BY created_at DESC LIMIT 5') [print(r['event_type'], r['success'], r['action_detail']) for r in rows] await conn.close() asyncio.run(q()) " ``` **預期輸出**: ``` AUTO_REPAIR_TRIGGERED True high-cpu-restart EXECUTION_COMPLETED True playbook:PB-xxxxxxxx TELEGRAM_SENT True approval_card ``` ### 2. Embedding 健康檢查 ```bash kubectl exec -n awoooi-prod deployment/awoooi-api -- python3 -c " import asyncio, asyncpg, os async def q(): conn = await asyncpg.connect(os.environ['DATABASE_URL'].replace('postgresql+asyncpg://','postgresql://')) rows = await conn.fetch('SELECT playbook_id, array_length(alert_names,1) FROM playbook_embeddings') print(f'playbook_embeddings: {len(rows)} rows') [print(' ', r['playbook_id'], r['array_length']) for r in rows[:5]] await conn.close() asyncio.run(q()) " # 預期: 18 rows,每筆有 alert_names 數量 ``` ### 3. Redis Index 檢查 ```bash kubectl exec -n awoooi-prod deployment/awoooi-api -- python3 -c " import asyncio from src.core.redis_client import init_redis_pool, get_redis async def q(): await init_redis_pool() r = get_redis() for alert in ['HostHighCpuLoad','KubePodCrashLooping','NodeMemoryUsageHigh']: members = await r.smembers(f'playbook:index:alert:{alert}') print(f'{alert}: {[m.decode() for m in members]}') asyncio.run(q()) " ``` ### 4. 相似度計算單元驗證 ```python from src.utils.similarity import calculate_symptom_similarity from src.models.playbook import SymptomPattern # 通用型 Playbook(affected_services=[])應與任何告警得高分 playbook_pattern = SymptomPattern(alert_names=["HostHighCpuLoad"], affected_services=[], severity_range=[]) alert_pattern = SymptomPattern(alert_names=["HostHighCpuLoad"], affected_services=["momo-app"], severity_range=["critical"]) score = calculate_symptom_similarity(alert_pattern, playbook_pattern) assert score >= 0.5, f"通用型 Playbook 分數過低: {score}" # 預期: 0.35 (alert_names 完全匹配) + 0.30 (服務豁免) + 0.15 (嚴重度豁免) = 0.80 ``` --- ## 架構影響 | 元件 | 變更前 | 變更後 | |------|--------|--------| | `webhooks.py` | 含業務邏輯函數 | 純 Router,業務邏輯在 Service | | `incident_service.py` | 只有記憶體存取 | +`create_incident_for_approval` + `extract_affected_services` | | `similarity.py` | 空集合打 0 | 通用型 Playbook 豁免 1.0 | | `playbook_embeddings` | 不存在 | pgvector 持久化表,啟動自動重建 | | `PlaybookEmbeddingRepository` | 不存在 | UPSERT Repository 層 | --- ## 相關 ADR - ADR-030: 智能自動修復系統(飛輪原始設計) - ADR-067: Ollama 本地 AI(nomic-embed-text 向量) - ADR-064: 告警規則引擎 YAML 驅動