docs(adr068): 飛輪冷啟動修復結案文件 + Skills v2.8
- ADR-068: 完整記錄五根因、四階段修復、首席架構師審查、E2E 驗收、驗證 Runbook - LOGBOOK: 更新當前狀態,標記全閉環 - Skill 02 v2.8: 新增「自動修復飛輪六大鐵律」章節(affected_services/alert_name/Router層/Jaccard/alertname變體/Embedding雙軌) 2026-04-10 Asia/Taipei — Claude Sonnet 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -38,6 +38,7 @@
|
||||
| v2.5 | 2026-04-01 | Claude Code | ♻️ Phase R-R2 完成 (legacy -971行) + R-R2.1 P0/P1修復 + ADR-046 型別統一 |
|
||||
| v2.6 | 2026-04-08 | Claude Code | 🛡️ Sprint 5.1 Data Safety Guardrails — Service Registry 模式 + 審查修正鐵律 |
|
||||
| v2.7 | 2026-04-09 | Claude Sonnet 4.6 | 🔧 ADR-066 批准執行閉環修復 — Nemotron tool→kubectl_command 回填鐵律 |
|
||||
| v2.8 | 2026-04-10 | Claude Sonnet 4.6 | 🚀 ADR-068 飛輪冷啟動修復鐵律 — affected_services/Router層業務邏輯/Jaccard豁免/embedding持久化 |
|
||||
|
||||
---
|
||||
|
||||
@@ -1051,6 +1052,89 @@ asyncio.create_task(auto_generate_rule(
|
||||
|
||||
---
|
||||
|
||||
## 🚀 自動修復飛輪鐵律 (ADR-068, 2026-04-10)
|
||||
|
||||
> **背景**: 25 個 AUTO_REPAIR_TRIGGERED 全部 NO_MATCH — 五個根因同時存在
|
||||
|
||||
### 1. affected_services 提取鐵律
|
||||
|
||||
**禁止**將 `target_resource`(可能是 IP:port 或 alertname)直接填入 `affected_services`。
|
||||
|
||||
```python
|
||||
# ❌ 絕對禁止(污染 Jaccard 匹配)
|
||||
affected_services = [target_resource] # 可能是 "192.168.0.188:9100" 或 "HostHighCpuLoad"
|
||||
|
||||
# ✅ 正確 — 語意提取(在 incident_service.py)
|
||||
affected_services = extract_affected_services(labels, target_resource)
|
||||
# 優先序: component > job(非基礎設施) > pod(deployment name) > clean target > []
|
||||
```
|
||||
|
||||
### 2. Signal alert_name 鐵律
|
||||
|
||||
```python
|
||||
# ❌ 禁止 — alert_name="custom" 讓 Redis index 查詢命中零
|
||||
alert_name = alert_type # "custom"
|
||||
|
||||
# ✅ 正確 — 用真實 alertname label
|
||||
alert_name = alertname or alert_type # "HostHighCpuLoad"
|
||||
```
|
||||
|
||||
### 3. Router 層業務邏輯鐵律
|
||||
|
||||
`create_incident_for_approval` 等含 Severity 映射、Signal 建立、Incident 建立的函數**必須**在 Service 層:
|
||||
|
||||
```
|
||||
# ✅ 正確位置
|
||||
apps/api/src/services/incident_service.py ← create_incident_for_approval()
|
||||
← extract_affected_services()
|
||||
|
||||
# ❌ 錯誤位置(已修正)
|
||||
apps/api/src/api/v1/webhooks.py ← 業務邏輯不屬 Router
|
||||
```
|
||||
|
||||
### 4. Jaccard 空集合豁免鐵律
|
||||
|
||||
通用型基礎設施 Playbook(`affected_services=[]`,`severity_range=[]`)代表適用所有情境,**不能**因空集合被 Jaccard 打成 0:
|
||||
|
||||
```python
|
||||
# apps/api/src/utils/similarity.py — 豁免規則
|
||||
"affected_services": 1.0 if not pattern_b.affected_services else jaccard(...)
|
||||
"severity": 1.0 if not pattern_b.severity_range or overlap else 0.0
|
||||
```
|
||||
|
||||
### 5. Playbook alertname 變體鐵律
|
||||
|
||||
Playbook 的 `symptom_pattern.alert_names` 必須包含所有真實世界 alertname 變體:
|
||||
|
||||
```yaml
|
||||
# apps/api/alert_rules.yaml — 每條規則都要加足變體
|
||||
- id: high_cpu
|
||||
match:
|
||||
alertname:
|
||||
- HighCPUUsage # Prometheus 規則名
|
||||
- HostHighCpuLoad # node-exporter 變體
|
||||
- CPUThrottlingHigh # K8s 變體
|
||||
```
|
||||
|
||||
### 6. Embedding 持久化鐵律
|
||||
|
||||
Playbook 向量**必須**同時存入 Redis(熱快取)和 `playbook_embeddings`(pgvector 持久化),防止重啟後冷啟動斷層:
|
||||
|
||||
```python
|
||||
# main.py lifespan 啟動時(非阻塞)
|
||||
asyncio.create_task(ensure_playbook_embeddings_indexed())
|
||||
```
|
||||
|
||||
Repository 層負責格式化:
|
||||
```python
|
||||
# ✅ 正確 — PlaybookEmbeddingRepository.upsert()
|
||||
vec_str = "[" + ",".join(str(float(x)) for x in embedding) + "]" # pgvector 安全格式
|
||||
|
||||
# ❌ 禁止 — str(embedding) 可能輸出帶空格的格式
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 參考文檔
|
||||
|
||||
- `apps/api/src/core/config.py`: 設定中心
|
||||
@@ -1067,3 +1151,4 @@ asyncio.create_task(auto_generate_rule(
|
||||
- ADR-008: Python 模組化獨立積木架構
|
||||
- ADR-027: Incident-Approval 同步架構 (UnitOfWork + Saga)
|
||||
- ADR-064: Alert Rule Engine — YAML 驅動 + AI 自動學習
|
||||
- ADR-068: 飛輪冷啟動斷層修復 — affected_services/Jaccard/Embedding 四階段系統性根治
|
||||
|
||||
Reference in New Issue
Block a user