Files
awoooi/docs/adr/ADR-068-flywheel-coldstart-fix.md
OG T b52e2de968 docs(adr068): 飛輪冷啟動修復結案文件 + Skills v2.8
- ADR-068: 完整記錄五根因、四階段修復、首席架構師審查、E2E 驗收、驗證 Runbook
- LOGBOOK: 更新當前狀態,標記全閉環
- Skill 02 v2.8: 新增「自動修復飛輪六大鐵律」章節(affected_services/alert_name/Router層/Jaccard/alertname變體/Embedding雙軌)

2026-04-10 Asia/Taipei — Claude Sonnet 4.6
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 11:39:42 +08:00

210 lines
8.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-068: 自動修復飛輪冷啟動斷層系統性修復
**狀態**: Approved & Implemented
**日期**: 2026-04-10 (台北時間)
**決策者**: ogt (統帥)
**執行者**: Claude Sonnet 4.6
**Commit**: `c6edfb5` (四階段修復) → `670cd5d` (首席架構師審查修正)
---
## 背景與問題
25 個 `AUTO_REPAIR_TRIGGERED` 事件全部以 `blocked_by: NO_MATCH` 失敗。
Playbook 存在,告警也存在,但飛輪永遠無法啟動。
系統性診斷發現**五個同時存在的根因**,任何一個獨立修復都無法解決問題:
| 斷層編號 | 根因 | 影響 |
|---------|------|------|
| F1 | `affected_services` 污染:`[alertname]``[IP:port]` 被填入 | Jaccard 比對永遠為 0 |
| F2 | `alert_name = "custom"`(非真實 alertname | Redis index 查詢命中 0 |
| F3 | Redis Playbook index 缺少真實 alertname 變體 | `HostHighCpuLoad` 等找不到 Playbook |
| F4 | Jaccard 空集合未豁免 | 通用型 Playbook (affected_services=[]) 永遠不匹配 |
| F5 | 重啟後 Redis 向量快取清空(冷啟動) | Phase 4 embedding 搜尋返回空 |
---
## 決策:四階段系統性修復
拒絕逐一 patch採用根治方案。
### Phase 1 — Signal 品質修復(`webhooks.py` → `incident_service.py`
**問題**`create_incident_for_approval``target_resource`fallback 為 alertname 或 IP直接填入 `affected_services`。Signal 的 `alert_name = alert_type = "custom"`
**修復**
- 新增 `extract_affected_services(labels, target_resource)`,優先序:`component > job(非基礎設施) > pod(取 deployment name) > clean target_resource > []`
- Signal 的 `alert_name` 改用真實 `alertname` label
```python
# 修復前(已移除)
affected_services=[target_resource] # IP:port 污染
# 修復後incident_service.py
affected_services=extract_affected_services(labels, target_resource) # 語意提取
```
### Phase 2 — Playbook Index 擴充
**問題**Redis `playbook:index:alert:*` 只有初始建立時的 alertname缺少真實世界的變體。
**修復**
- `alert_rules.yaml` 5 條規則新增 `HostHighCpuLoad``KubePodCrashLooping``NodeMemoryUsageHigh` 等 17 個變體
- `scripts/update_playbook_alert_variants.py` 執行一次性 Redis index 補齊
- 驗證:`HostHighCpuLoad → ['PB-20260406-488671']`
### Phase 3 — Jaccard 空集合豁免(`similarity.py`
**問題**:通用型基礎設施 Playbook`affected_services=[]``severity_range=[]`)與任何告警做 Jaccard 都得 0.0。
**修復**
```python
"affected_services": (
1.0 if not pattern_b.affected_services # 通用型 Playbook 豁免
else calculate_jaccard_similarity(...)
),
"severity": (
1.0 if not pattern_b.severity_range # 全嚴重度適用豁免
or bool(set(pattern_a.severity_range) & set(pattern_b.severity_range))
else 0.0
),
```
### Phase 4 — Embedding 冷啟動修復
**問題**:重啟後 Redis 清空向量快取消失Phase 4 語意搜尋退化為空結果。
**修復**
- 新建 `migrations/flywheel_playbook_embeddings.sql`pgvector 持久化表
- 新建 `services/playbook_embedding_service.py`:啟動時非阻塞重建(`asyncio.create_task`
- 新建 `repositories/playbook_embedding_repository.py`UPSERT Repository
- 啟動驗收18/18 Playbooks 索引成功
---
## 首席架構師審查修正(`670cd5d`
初始實作評分 78/100首席架構師審查後修正至 **97/100**
| # | 問題 | 修正 |
|---|------|------|
| C1 | `playbook_embedding_service` 直接 `db.execute(text(...))` — 繞過 Repository | 新建 `PlaybookEmbeddingRepository` |
| C2 | `create_incident_for_approval` 業務邏輯在 Router 層 | 移入 `incident_service.py` |
| I1 | `_infra_jobs` 每次呼叫重建 set | 提升為 module-level `frozenset` |
| I2 | `_persist_embeddings_to_db` 參數無型別標注 | 補齊 `PlaybookRAGService / list[Playbook]` |
| I3 | `str(embedding)` 格式不確定 | 顯式 `"[" + ",".join(str(float(x))...) + "]"` |
| I4 | `import asyncio``try` 區塊內 | 移至 `main.py` 頂層 |
| M1 | `if union > 0 else 0.0` 死代碼 | 移除 |
---
## E2E 驗收結果
```
2026-04-10 03:20:03 ALERT_RECEIVED HostHighCpuLoad
2026-04-10 03:20:25 AUTO_REPAIR_TRIGGERED ok=True high-cpu-restart
2026-04-10 03:20:26 EXECUTION_COMPLETED ok=True PB-20260406-488671
2026-04-10 03:20:27 TELEGRAM_SENT ok=True approval_card
```
**飛輪從 100% 失敗 → 成功觸發,不再 NO_MATCH。**
---
## 驗證方法Runbook
### 1. 快速煙霧測試
```bash
# 觸發測試告警
curl -X POST http://<node>:32334/api/v1/webhooks/alertmanager \
-H 'Content-Type: application/json' \
-d '{"version":"4","groupKey":"test","status":"firing","receiver":"awoooi",
"alerts":[{"status":"firing",
"labels":{"alertname":"HostHighCpuLoad","severity":"critical",
"instance":"192.168.0.188:9100","job":"node-exporter"},
"annotations":{"summary":"Test","description":"E2E Test"},
"startsAt":"2026-04-10T00:00:00Z"}]}'
# 驗證結果
kubectl exec -n awoooi-prod deployment/awoooi-api -- python3 -c "
import asyncio, asyncpg, os
async def q():
conn = await asyncpg.connect(os.environ['DATABASE_URL'].replace('postgresql+asyncpg://','postgresql://'))
rows = await conn.fetch('SELECT event_type, success, action_detail FROM alert_operation_log ORDER BY created_at DESC LIMIT 5')
[print(r['event_type'], r['success'], r['action_detail']) for r in rows]
await conn.close()
asyncio.run(q())
"
```
**預期輸出**
```
AUTO_REPAIR_TRIGGERED True high-cpu-restart
EXECUTION_COMPLETED True playbook:PB-xxxxxxxx
TELEGRAM_SENT True approval_card
```
### 2. Embedding 健康檢查
```bash
kubectl exec -n awoooi-prod deployment/awoooi-api -- python3 -c "
import asyncio, asyncpg, os
async def q():
conn = await asyncpg.connect(os.environ['DATABASE_URL'].replace('postgresql+asyncpg://','postgresql://'))
rows = await conn.fetch('SELECT playbook_id, array_length(alert_names,1) FROM playbook_embeddings')
print(f'playbook_embeddings: {len(rows)} rows')
[print(' ', r['playbook_id'], r['array_length']) for r in rows[:5]]
await conn.close()
asyncio.run(q())
"
# 預期: 18 rows每筆有 alert_names 數量
```
### 3. Redis Index 檢查
```bash
kubectl exec -n awoooi-prod deployment/awoooi-api -- python3 -c "
import asyncio
from src.core.redis_client import init_redis_pool, get_redis
async def q():
await init_redis_pool()
r = get_redis()
for alert in ['HostHighCpuLoad','KubePodCrashLooping','NodeMemoryUsageHigh']:
members = await r.smembers(f'playbook:index:alert:{alert}')
print(f'{alert}: {[m.decode() for m in members]}')
asyncio.run(q())
"
```
### 4. 相似度計算單元驗證
```python
from src.utils.similarity import calculate_symptom_similarity
from src.models.playbook import SymptomPattern
# 通用型 Playbookaffected_services=[])應與任何告警得高分
playbook_pattern = SymptomPattern(alert_names=["HostHighCpuLoad"], affected_services=[], severity_range=[])
alert_pattern = SymptomPattern(alert_names=["HostHighCpuLoad"], affected_services=["momo-app"], severity_range=["critical"])
score = calculate_symptom_similarity(alert_pattern, playbook_pattern)
assert score >= 0.5, f"通用型 Playbook 分數過低: {score}"
# 預期: 0.35 (alert_names 完全匹配) + 0.30 (服務豁免) + 0.15 (嚴重度豁免) = 0.80
```
---
## 架構影響
| 元件 | 變更前 | 變更後 |
|------|--------|--------|
| `webhooks.py` | 含業務邏輯函數 | 純 Router業務邏輯在 Service |
| `incident_service.py` | 只有記憶體存取 | +`create_incident_for_approval` + `extract_affected_services` |
| `similarity.py` | 空集合打 0 | 通用型 Playbook 豁免 1.0 |
| `playbook_embeddings` | 不存在 | pgvector 持久化表,啟動自動重建 |
| `PlaybookEmbeddingRepository` | 不存在 | UPSERT Repository 層 |
---
## 相關 ADR
- ADR-030: 智能自動修復系統(飛輪原始設計)
- ADR-067: Ollama 本地 AInomic-embed-text 向量)
- ADR-064: 告警規則引擎 YAML 驅動