docs(spec): ADR-073/074 AIOps 飛輪全面修復整合規格書 v1.0
整合四個層次的完整解決方案: - 層次一 ADR-073-A:緊急解封(CD修復/alertname/debounce/Playbook冷啟動/KM向量化) - 層次二 ADR-073-B:路由修正(檢傷分類站/SSH路徑/action解析/KMConversionService) - 層次三 ADR-074:監控補全(飛輪健康度Exporter/網路/DNS/Gitea CI/備份還原測試) - 層次四 ADR-073-C:前端飛輪即時化(真實API/WebSocket/KPI面板) 整合來源:ADR-073盤點 + v2.2規格書§14.11 ADR-071工作序 + 監控缺口盤點 + 飛輪截圖定案 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,554 @@
|
||||
# AWOOOI AIOps 全面修復與長期解決方案
|
||||
## 飛輪修復 + 監控補全 + 前端即時化 — 完整整合規格書
|
||||
|
||||
> **文件版本**: v1.0
|
||||
> **建立日期**: 2026-04-12(台北時間)
|
||||
> **狀態**: 🔴 等待統帥批准後實作
|
||||
> **整合來源**: ADR-073(飛輪盤點)+ v2.2 規格書 §14.11(ADR-071 工作序)+ 監控覆蓋盤點 + 飛輪截圖定案 + MCP 角色分析
|
||||
> **作者**: Claude Sonnet 4.6
|
||||
|
||||
---
|
||||
|
||||
## 一、核心診斷:飛輪為何從未運轉
|
||||
|
||||
### 1.1 飛輪六節點架構(截圖定案)
|
||||
|
||||
```
|
||||
監控觀察 → 接入去重 → 環境診斷 → 推理匹配 → 執行燒橇 → 學習固化
|
||||
↑ ↓
|
||||
└─────────────────── 飛輪閉環 ──────────────────────────────┘
|
||||
```
|
||||
|
||||
人工介入只發生在兩種情況:
|
||||
- **TYPE-3**:AI 找到方案但 `risk_level=critical` 或 DESTRUCTIVE_PATTERNS 命中 → 人工批准後執行
|
||||
- **TYPE-4**:AI 無法判斷(`confidence < 0.5` 或 MCP 全失敗)→ 人工手動處理 → 記錄步驟 → 自動轉 Playbook
|
||||
|
||||
### 1.2 MCP 的角色:解決幻覺的唯一機制
|
||||
|
||||
MCP 坐落在「環境診斷」節點,是 LLM 推理前的真實環境快照收集層:
|
||||
|
||||
```
|
||||
沒有 MCP:LLM 瞎猜 → target=unknown → EXECUTION_FAILED(55%)
|
||||
有 MCP: SSH/K8s 查真實狀態 → target=api-worker-7d8f9b → EXECUTION_SUCCESS
|
||||
```
|
||||
|
||||
**當前問題**:`_collect_mcp_context()` 已寫入 `8be87b0`,但 CD 失敗,Pod 跑的是 `a86ecf3`(舊版,無此函數)。10 個 MCP Provider 全部啟用但零作用。
|
||||
|
||||
### 1.3 資料庫實況(2026-04-12)
|
||||
|
||||
| 指標 | 現況 | 目標 |
|
||||
|------|------|------|
|
||||
| Playbooks | **0** | ≥ 20 |
|
||||
| EXECUTION_SUCCESS 率 | **0.5%(2/380)** | ≥ 30% |
|
||||
| KM vectorized | **0%(699 筆全 False)** | ≥ 90% |
|
||||
| alertname NULL 率 | **100%** | 0% |
|
||||
| incidents.outcome NULL | **100%** | 0% |
|
||||
| debounce window | **5 分鐘** | 30 分鐘 |
|
||||
| 「未知操作\|」action | **93 筆** | 0 |
|
||||
|
||||
### 1.4 飛輪各節點實際狀態
|
||||
|
||||
| 節點 | 狀態 | 根本問題 |
|
||||
|------|------|---------|
|
||||
| 監控觀察 | ⚠️ 部分正常 | `signals->0->>'alertname'` 全 NULL,alertname 寫入位置錯誤 |
|
||||
| 接入去重 | ⚠️ 異常 | debounce 5 分鐘,同一問題每 5 分鐘重建 Incident |
|
||||
| 環境診斷 | ❌ 未運作 | `8be87b0` 未部署,`_collect_mcp_context()` 不存在於 Pod |
|
||||
| 推理匹配 | ❌ 嚴重 | Playbooks=0(100% 走人工);alertname NULL;target 解析錯誤 |
|
||||
| 執行燒橇 | ❌ 0.5% | 93 筆「未知操作\|」;Docker/Host 告警走 K8s 路徑;Pod hash 寫死 |
|
||||
| 學習固化 | ❌ 完全失效 | Playbook 從未生成;690 KM 未向量化;outcome 全 NULL |
|
||||
|
||||
---
|
||||
|
||||
## 二、整合解決方案架構
|
||||
|
||||
本次解決方案分四個層次,按優先順序依序執行:
|
||||
|
||||
```
|
||||
層次一:緊急解封(ADR-073-A) — 讓飛輪第一次能動
|
||||
↓
|
||||
層次二:路由修正(ADR-073-B) — 告警走對路,分類正確
|
||||
↓
|
||||
層次三:監控補全(ADR-074) — 擴大捕魚網,補缺口
|
||||
↓
|
||||
層次四:前端即時化(ADR-073-C) — 讓人看到飛輪在跑
|
||||
```
|
||||
|
||||
> **鐵律**:不要把水倒進漏水的桶子。先修飛輪(層次一二),再擴監控(層次三),才有意義。
|
||||
|
||||
---
|
||||
|
||||
## 三、層次一:緊急解封(ADR-073-A)
|
||||
|
||||
### 目標:讓飛輪第一次真的動起來
|
||||
|
||||
**執行序**:A1 → A2 → A3 → A4 → A5(A1 是所有後續的前提)
|
||||
|
||||
### A1:CD 修復 + `8be87b0` 部署
|
||||
|
||||
**問題**:CI build 成功,Harbor 有 image,但 CD `git push` 因 non-fast-forward 失敗,`kustomization.yaml` 仍指向 `a86ecf3`。
|
||||
|
||||
**解法**:
|
||||
1. 確認 Harbor 存在 `8be87b0` 的 api/web/worker image
|
||||
2. 手動更新 `kustomization.yaml` newTag → `8be87b0`
|
||||
3. `git push gitea main` 觸發 ArgoCD 同步
|
||||
4. 驗證 Pod image = `8be87b0`,`_collect_mcp_context` 存在
|
||||
|
||||
**驗收**:
|
||||
```bash
|
||||
kubectl get pods -n awoooi-prod -o jsonpath='{.items[*].spec.containers[*].image}'
|
||||
# 輸出應包含 :8be87b0
|
||||
```
|
||||
|
||||
**Tier 2(可自主執行,完成後通知統帥)**
|
||||
|
||||
---
|
||||
|
||||
### A2:signals alertname NULL 修復
|
||||
|
||||
**問題**:`webhooks.py` 建立 Incident 時,alertname 寫入 `signals` JSONB array,但查詢時 `signals->0->>'alertname'` = NULL,說明結構不符。
|
||||
|
||||
**解法**:
|
||||
- `webhooks.py` `alertmanager_webhook()` 中,建立 signal 時確保 `{"alertname": alert["labels"]["alertname"], ...}` 正確寫入
|
||||
- 同時新增獨立欄位 `incidents.alertname VARCHAR(100)`(在 A-DB-Migration 中處理)
|
||||
- 現有 132 筆 NULL alertname:一次性補填腳本,從 `signals` JSONB 反查
|
||||
|
||||
**驗收**:新 Incident 的 `alertname` 欄位不為 NULL
|
||||
|
||||
---
|
||||
|
||||
### A3:debounce window 5 → 30 分鐘
|
||||
|
||||
**問題**:Prometheus 每 1 分鐘 fire 一次 alert,5 分鐘 TTL 到期後重建 Incident,同一問題反覆發 Telegram。
|
||||
|
||||
**解法**:`webhooks.py` fingerprint Redis key TTL 從 `300`(5 分鐘)→ `1800`(30 分鐘)
|
||||
|
||||
**驗收**:同一告警 fingerprint 30 分鐘內只建一個 `INVESTIGATING` Incident
|
||||
|
||||
---
|
||||
|
||||
### A4:冷啟動 Playbook 生成腳本(一次性)
|
||||
|
||||
**問題**:Playbooks = 0,`evaluate_auto_repair()` 永遠走人工審核路徑。
|
||||
|
||||
**解法**:一次性腳本 `scripts/cold_start_playbooks.py`:
|
||||
1. 讀取 `approval_records` 中 `status=EXECUTION_SUCCESS` 的 2 筆
|
||||
2. 讀取 `status=APPROVED` 且 action 非「未知操作|」的 top 20 筆
|
||||
3. 對每一筆,呼叫 LLM 生成 Playbook 草稿(alertname + action + target → YAML 格式)
|
||||
4. 寫入 `playbooks` 表,`status=APPROVED`,`source=cold_start`
|
||||
|
||||
**驗收**:`SELECT count(*) FROM playbooks` ≥ 15
|
||||
|
||||
---
|
||||
|
||||
### A5:690 筆 KM 批次向量化腳本(一次性)
|
||||
|
||||
**問題**:690 筆 `knowledge_entries` 全部 `vectorized=False`,RAG 查詢永遠找不到歷史案例。
|
||||
|
||||
**解法**:一次性腳本 `scripts/batch_vectorize_km.py`:
|
||||
1. 分批(batch size=50)讀取 `vectorized=False` 的 entries
|
||||
2. 呼叫 `EmbeddingService.embed()` 生成向量
|
||||
3. 寫回 `embedding` 欄位,更新 `vectorized=True`
|
||||
4. 失敗的記錄寫入 error log,不中斷整體流程
|
||||
|
||||
**驗收**:`SELECT count(*) FROM knowledge_entries WHERE vectorized=False` = 0
|
||||
|
||||
---
|
||||
|
||||
## 四、層次二:路由修正(ADR-073-B + ADR-071 整合)
|
||||
|
||||
### 目標:告警走對路,分類正確,通知品質達標
|
||||
|
||||
**執行序**:B-DB → B1 → B2 → B3 → B4 → B5 → B6 → B7 → B8
|
||||
|
||||
---
|
||||
|
||||
### B-DB:DB Migration(ADR-071-A)
|
||||
|
||||
**incidents 表新增 7 個欄位**:
|
||||
|
||||
```sql
|
||||
-- 注意:PgEnum ADD VALUE 必須在獨立 transaction,不能在同一 transaction 內使用新值
|
||||
ALTER TABLE incidents ADD COLUMN alertname VARCHAR(100);
|
||||
ALTER TABLE incidents ADD COLUMN notification_type VARCHAR(10);
|
||||
ALTER TABLE incidents ADD COLUMN alert_category VARCHAR(50);
|
||||
ALTER TABLE incidents ADD COLUMN context_bundle JSONB;
|
||||
ALTER TABLE incidents ADD COLUMN metrics_before JSONB;
|
||||
ALTER TABLE incidents ADD COLUMN metrics_after JSONB;
|
||||
ALTER TABLE incidents ADD COLUMN verification_result JSONB;
|
||||
ALTER TABLE incidents ADD COLUMN manual_fix_steps TEXT;
|
||||
ALTER TABLE incidents ADD COLUMN manual_fix_by VARCHAR(100);
|
||||
|
||||
-- alert_operation_log 新增 5 個 event_type
|
||||
-- (根據現有 EventType enum 結構決定是 ALTER TYPE 或 JSONB 欄位)
|
||||
```
|
||||
|
||||
**Tier 1(獨立 migration,可直接執行)**
|
||||
|
||||
---
|
||||
|
||||
### B1:檢傷分類站前移(ADR-071-A0 + ADR-071-B)
|
||||
|
||||
**問題根因**:`classify_notification()` 原本只在 `telegram_gateway` 推送前才分類,導致 HostBackupFailed、ConfigDrift 等告警已經走完整個修復分析流程,產生「kubectl scale deployment unknown」的無效指令。
|
||||
|
||||
**解法**:將分類守衛前移至 `decision_manager._analyze()` 入口,在 step 5(自動執行判斷)之前:
|
||||
|
||||
```python
|
||||
# decision_manager.py 新增分類守衛
|
||||
alert_category, notification_type = classify_alert_early(incident)
|
||||
incident.alert_category = alert_category
|
||||
incident.notification_type = notification_type
|
||||
await db.commit()
|
||||
|
||||
if notification_type == "TYPE-1":
|
||||
# 純資訊:直接發純文字通知,不進入修復流程
|
||||
await telegram_gateway.send_info_notification(incident)
|
||||
return
|
||||
elif notification_type == "TYPE-4D":
|
||||
# Config Drift:直接發 Drift 卡片,不進入修復流程
|
||||
await telegram_gateway.send_drift_card(incident)
|
||||
return
|
||||
# TYPE-3:繼續走 auto_approve → 人工審核 or 自動執行
|
||||
```
|
||||
|
||||
**分類對照表**:
|
||||
|
||||
| alertname 模式 | severity | 分類結果 | notification_type |
|
||||
|---------------|----------|---------|-------------------|
|
||||
| `*Backup*` / `*Heartbeat*` | any | backup / info | TYPE-1 |
|
||||
| `*Info*` / severity=info/none | info/none | info | TYPE-1 |
|
||||
| `ConfigurationDrift` / `KubeConfigDrift` | any | config_drift | TYPE-4D |
|
||||
| `Docker*` / `Host*` | warning/critical | infrastructure | TYPE-3(SSH路徑)|
|
||||
| `Kube*` / `Pod*` / `Deploy*` | warning/critical | kubernetes | TYPE-3(K8s路徑)|
|
||||
| `Postgres*` / `Redis*` | warning/critical | database | TYPE-3 |
|
||||
| others | warning/critical | general | TYPE-3 |
|
||||
|
||||
**驗收**:`HostBackupFailed` 告警進入 `decision_manager` → `alert_category="backup"`, `notification_type="TYPE-1"`,不進入 auto_approve 流程,Telegram 收到純文字無按鈕卡片
|
||||
|
||||
---
|
||||
|
||||
### B2:Docker/Host 告警 → SSH MCP 路徑
|
||||
|
||||
**問題**:`DockerContainerUnhealthy`、`HostHighCpuLoad` 告警的 target 解析走 K8s 路徑,`deployment/unknown` 不存在 → 100% 失敗。
|
||||
|
||||
**解法**:`decision_manager._auto_execute()` 路由層:
|
||||
|
||||
```python
|
||||
if alert_category == "infrastructure" and alertname.startswith(("Docker", "Host")):
|
||||
# SSH MCP 路徑
|
||||
target_host = extract_host_from_labels(incident) # 從 labels.instance 取 IP
|
||||
await ssh_mcp.execute(target_host, action_command)
|
||||
else:
|
||||
# K8s MCP 路徑(原有邏輯)
|
||||
await k8s_mcp.execute(namespace, deployment, action_command)
|
||||
```
|
||||
|
||||
**Tier 3(修改 decision_manager 核心路徑,需首席架構師授權)**
|
||||
|
||||
---
|
||||
|
||||
### B3:action 解析修復(93 筆「未知操作|」)
|
||||
|
||||
**問題**:LLM 回傳 action 含 `|` 分隔符,解析後為空字串,顯示「未知操作|」。
|
||||
|
||||
**解法**:`_push_decision_to_telegram()` 的 action 字串處理:
|
||||
```python
|
||||
# 清理 LLM 回傳的 action,取 | 前的部分
|
||||
action_clean = action.split("|")[0].strip() if "|" in action else action
|
||||
if not action_clean:
|
||||
action_clean = "NO_ACTION"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### B4:NO_ACTION → TYPE-1(無按鈕卡片)
|
||||
|
||||
**問題**:`HostBackupFailed`、`suggested_action=NO_ACTION` 的告警仍顯示六個操作按鈕。
|
||||
|
||||
**解法**:`classify_notification()` 加判斷:
|
||||
```python
|
||||
if analysis.suggested_action in ("NO_ACTION", "OBSERVE", "") or notification_type == "TYPE-1":
|
||||
return NotificationConfig(type="TYPE-1", buttons=[])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### B5:outcome / alert_category / notification_type 三個寫入點
|
||||
|
||||
**問題**:三個欄位全 NULL,KM 記錄無法分類查詢。
|
||||
|
||||
**三個寫入點**:
|
||||
1. `webhooks.py` 建 Incident → 寫 `alert_category`(從 alertname/labels 推導)
|
||||
2. `decision_manager._auto_execute()` 完成 → 寫 `outcome`(SUCCESS/FAILED/SKIPPED)
|
||||
3. `telegram_gateway._push_decision()` 發完 → 寫 `notification_type`
|
||||
|
||||
---
|
||||
|
||||
### B6:risk_level YAML 優先覆蓋 LLM
|
||||
|
||||
**問題**:YAML 規則定義 `HostHighCpuLoad risk=medium`,但 LLM 習慣性回傳 `risk=high`,導致告警升級為人工審核。
|
||||
|
||||
**解法**:`_dual_engine_analyze()` 中:
|
||||
```python
|
||||
# YAML 規則的 risk_level 優先於 LLM 輸出
|
||||
if rule_match and rule_match.risk_level:
|
||||
final_risk = rule_match.risk_level # 使用 YAML 值
|
||||
else:
|
||||
final_risk = llm_result.risk_level # fallback 到 LLM
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### B7:ADR-071-G — KMConversionService
|
||||
|
||||
**問題**:Incident RESOLVED 後不會自動轉成 KM,學習固化節點從未觸發。
|
||||
|
||||
**解法**:新增 `km_conversion_service.py`:
|
||||
1. `incident_service.resolve()` 呼叫 `KMConversionService.convert(incident_id)`
|
||||
2. 建立 `KnowledgeEntryRecord`(alertname + action + outcome + context)
|
||||
3. 建立 `Playbook` draft(若 outcome=SUCCESS)
|
||||
4. 呼叫 `EmbeddingService.embed()` → `vectorized=True`
|
||||
|
||||
**驗收**:Incident RESOLVED → `knowledge_entries` 自動新增一筆且 `vectorized=True`
|
||||
|
||||
---
|
||||
|
||||
### B8:TYPE-1 純資訊卡片 + TYPE-4 手動記錄(ADR-071-C/H)
|
||||
|
||||
**TYPE-1 純資訊卡片**:
|
||||
- 新增 `telegram_gateway.send_info_notification(incident)`
|
||||
- 格式:無按鈕,只有告警名稱 + 嚴重度 + 簡短說明
|
||||
|
||||
**TYPE-4 手動修復記錄**:
|
||||
- `handle_callback()` 新增 `log_manual_fix` 動作
|
||||
- Bot 對話:「請輸入您的修復步驟(多行)」
|
||||
- 收集後呼叫 `LearningService.record_manual_fix()` → 草稿 Playbook
|
||||
|
||||
---
|
||||
|
||||
## 五、層次三:監控覆蓋補全(ADR-074)
|
||||
|
||||
> **前提**:層次一二完成後才執行。避免「把新告警倒進漏水的桶子」。
|
||||
> **新增 ADR**:本層次對應 ADR-074(監控覆蓋補全),為本文新提出的 ADR,尚未有獨立文件。
|
||||
|
||||
### 5.1 現有監控缺口清單
|
||||
|
||||
| 缺口 | 嚴重度 | 告警名稱 | 自動修復路徑 |
|
||||
|------|--------|---------|------------|
|
||||
| **飛輪健康度自身** | 🔴 P0 | `FlywheelExecutionSuccessLow` `FlywheelPlaybookZero` `FlywheelKMVectorizationLow` | 自動:Slack/Telegram P0 告警 |
|
||||
| **主機間網路連通性** | 🔴 P0 | `HostNetworkPartition` | SSH MCP:檢查路由,通知人工 |
|
||||
| **CoreDNS 解析失敗** | 🔴 P0 | `CoreDNSResolutionFailed` | K8s MCP:restart CoreDNS Pod |
|
||||
| **Gitea CI/CD 管線失敗** | 🔴 P0 | `GiteaCIPipelineFailed` | Telegram 通知(無自動修復)|
|
||||
| **備份還原可用性測試** | 🔴 P0 | `BackupRestoreTestFailed` | 週排程:Velero restore dry-run |
|
||||
| **Docker 188 容器詳細健康** | 🟠 P1 | `DockerContainerUnhealthyDetailed` | SSH MCP:`docker inspect` + restart |
|
||||
| **Redis Streams 積壓** | 🟠 P1 | `RedisStreamBacklogHigh` | Redis MCP:XLEN 超過閾值告警 |
|
||||
| **PostgreSQL 磁碟增長率** | 🟠 P1 | `PostgreSQLDiskGrowthRate` | 告警:提醒人工清理 |
|
||||
| **Gemini API 錯誤率** | 🟠 P1 | `GeminiAPIErrorRateHigh` | AI Router 自動切換 Ollama |
|
||||
| **OpenClaw LLM 處理延遲** | 🟡 P2 | `OpenClawLLMLatencyHigh` | 告警:LLM queue 深度監控 |
|
||||
| **TLS 憑證到期(7天)** | 🟡 P2 | `TLSCertExpiryCritical` | 已有,確認覆蓋所有域名 |
|
||||
|
||||
### 5.2 飛輪健康度 Prometheus Exporter(最優先)
|
||||
|
||||
飛輪自身必須能監控自己的健康狀態。新增 `flywheel_exporter.py`(CronJob 或常駐 exporter):
|
||||
|
||||
```python
|
||||
# 每 5 分鐘查詢 DB,暴露以下 metrics
|
||||
awoooi_flywheel_playbook_count # 目標 ≥ 20
|
||||
awoooi_flywheel_execution_success_rate # 目標 ≥ 0.3(30%)
|
||||
awoooi_flywheel_km_unvectorized_count # 目標 = 0
|
||||
awoooi_flywheel_alertname_null_rate # 目標 = 0
|
||||
awoooi_flywheel_incidents_stuck # INVESTIGATING 超過 24 小時的數量
|
||||
```
|
||||
|
||||
**告警規則**:
|
||||
```yaml
|
||||
- alert: FlywheelPlaybookZero
|
||||
expr: awoooi_flywheel_playbook_count == 0
|
||||
for: 1h
|
||||
severity: critical
|
||||
|
||||
- alert: FlywheelExecutionSuccessLow
|
||||
expr: awoooi_flywheel_execution_success_rate < 0.1
|
||||
for: 2h
|
||||
severity: warning
|
||||
|
||||
- alert: FlywheelKMVectorizationLow
|
||||
expr: awoooi_flywheel_km_unvectorized_count > 10
|
||||
for: 30m
|
||||
severity: warning
|
||||
```
|
||||
|
||||
### 5.3 主機間網路監控
|
||||
|
||||
新增 Prometheus Blackbox TCP probe,主機兩兩互測:
|
||||
|
||||
```yaml
|
||||
- job_name: host-connectivity
|
||||
metrics_path: /probe
|
||||
params:
|
||||
module: [tcp_connect]
|
||||
static_configs:
|
||||
- targets:
|
||||
- 192.168.0.110:22 # 110 SSH
|
||||
- 192.168.0.188:22 # 188 SSH
|
||||
- 192.168.0.120:6443 # 120 K3s API
|
||||
- 192.168.0.121:6443 # 121 K3s API
|
||||
```
|
||||
|
||||
### 5.4 Gitea CI/CD 管線失敗告警
|
||||
|
||||
Gitea 支援 webhook,在管線失敗時 POST 到 AWOOOI API:
|
||||
|
||||
```yaml
|
||||
# gitea webhook → /api/v1/webhooks/gitea
|
||||
# 觸發條件:workflow run status = failure
|
||||
# 告警名稱:GiteaCIPipelineFailed
|
||||
# 分類:TYPE-1(純通知,不自動修復)
|
||||
```
|
||||
|
||||
### 5.5 備份還原驗證排程
|
||||
|
||||
新增週排程 CronJob `backup-restore-test`(每週日 02:00 台北時間):
|
||||
1. 觸發 Velero restore dry-run(`--dry-run`)
|
||||
2. 驗證 PVC 快照可讀取
|
||||
3. 失敗 → Prometheus textfile metrics → 告警 `BackupRestoreTestFailed`
|
||||
|
||||
---
|
||||
|
||||
## 六、層次四:前端飛輪即時化(ADR-073-C)
|
||||
|
||||
> **前提**:層次一完成,飛輪開始有真實活動後才有意義顯示。
|
||||
|
||||
### 6.1 新增 API Endpoints
|
||||
|
||||
**`GET /api/v1/stats/flywheel`**:
|
||||
```json
|
||||
{
|
||||
"nodes": {
|
||||
"monitoring": {"status": "active", "last_event": "2026-04-12T14:30:00+08:00", "count_1h": 12},
|
||||
"deduplication": {"status": "active", "dedup_rate": 0.73},
|
||||
"diagnosis": {"status": "active", "mcp_providers_used": ["k8s", "ssh"]},
|
||||
"reasoning": {"status": "active", "playbook_hits": 3, "llm_calls": 9},
|
||||
"execution": {"status": "active", "success_rate": 0.31},
|
||||
"learning": {"status": "active", "new_playbooks_today": 2}
|
||||
},
|
||||
"current_flow": [
|
||||
{"incident_id": "abc123", "alertname": "HostHighCpuLoad", "current_node": "execution", "ts": "..."}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**`GET /api/v1/stats/summary`**:
|
||||
```json
|
||||
{
|
||||
"playbook_count": 23,
|
||||
"execution_success_rate": 0.31,
|
||||
"today_processed": 18,
|
||||
"flywheel_conversions_today": 5,
|
||||
"km_vectorized_rate": 0.94
|
||||
}
|
||||
```
|
||||
|
||||
**WebSocket `ws://api/ws/flywheel`**:
|
||||
- 每次 Incident 節點變化 → push event
|
||||
- 前端動畫跟隨真實告警流轉動
|
||||
|
||||
### 6.2 前端飛輪元件更新
|
||||
|
||||
**當前問題**:飛輪元件使用假數據(靜態 mock)。
|
||||
|
||||
**解法**:
|
||||
1. 接 `GET /api/v1/stats/flywheel` 取節點狀態
|
||||
2. 接 WebSocket 接收即時事件
|
||||
3. 告警流過節點時,對應節點高亮動畫
|
||||
4. TYPE-3(人工審核)時:紅色虛線從「推理匹配」到「人工處理中心」
|
||||
5. TYPE-4(根因確認)時:橙色虛線從「推理匹配」到「TYPE-4 根因確認」節點
|
||||
|
||||
### 6.3 飛輪 KPI 面板(截圖定案)
|
||||
|
||||
截圖右上角三個 KPI 必須連接真實數據:
|
||||
- **累計告警**:`incidents` 表總數
|
||||
- **自動化率**:`EXECUTION_SUCCESS / total_approvals`(目前 0.5%)
|
||||
- **飛輪轉化數**:成功修復且生成 Playbook 的次數(目前 2)
|
||||
|
||||
---
|
||||
|
||||
## 七、完整執行順序與依賴鏈
|
||||
|
||||
```
|
||||
【層次一:緊急解封】
|
||||
A1(CD + 8be87b0 部署)
|
||||
└→ A2(alertname NULL 修復)
|
||||
└→ A3(debounce 30 分鐘)
|
||||
└→ A4(冷啟動 Playbook 腳本)
|
||||
└→ A5(KM 批次向量化腳本)
|
||||
|
||||
【層次二:路由修正】
|
||||
B-DB(DB Migration — 獨立,無依賴)
|
||||
└→ B1(檢傷分類站前移 A0+B)
|
||||
└→ B2(Docker/Host SSH MCP 路徑) ← Tier 3,需授權
|
||||
└→ B3(action 解析 | 修復)
|
||||
└→ B4(NO_ACTION → TYPE-1)
|
||||
└→ B5(outcome/category/type 寫入)
|
||||
└→ B6(risk_level YAML 優先)
|
||||
└→ B7(KMConversionService)
|
||||
└→ B8(TYPE-1 純資訊卡片 + TYPE-4 手動記錄)
|
||||
|
||||
【層次三:監控補全】(依賴層次一完成)
|
||||
M1(飛輪健康度 Exporter + 告警規則) ← 最優先
|
||||
M2(主機間網路 Blackbox probe)
|
||||
M3(Gitea CI/CD 失敗 webhook)
|
||||
M4(備份還原週排程測試)
|
||||
M5(Docker 188 詳細健康 + Redis Streams + PostgreSQL 磁碟)
|
||||
|
||||
【層次四:前端即時化】(依賴層次一完成)
|
||||
C1(flywheel stats API)
|
||||
C2(前端飛輪元件接真實 API)
|
||||
C3(WebSocket 即時推送)
|
||||
C4(人工介入路徑視覺化)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 八、驗收標準(7 天後)
|
||||
|
||||
| 指標 | 當前 | 目標 | 驗收方式 |
|
||||
|------|------|------|---------|
|
||||
| Playbooks 數量 | 0 | ≥ 20 | `SELECT count(*) FROM playbooks` |
|
||||
| EXECUTION_SUCCESS 率 | 0.5% | ≥ 30% | `SELECT ... FROM approval_records` |
|
||||
| alertname NULL 率 | 100% | 0% | `SELECT count(*) FROM incidents WHERE alertname IS NULL` |
|
||||
| KM vectorized | 0% | ≥ 90% | `SELECT count(*) FROM knowledge_entries WHERE vectorized=False` |
|
||||
| 重複告警同 Incident | < 10% | ≥ 70% | 30 分鐘 debounce 驗證 |
|
||||
| 飛輪健康告警延遲 | 不存在 | < 5 分鐘 | Prometheus alert rule 觸發測試 |
|
||||
| Docker/Host 告警成功率 | 0% | ≥ 50% | SSH MCP 路徑驗收 |
|
||||
| 飛輪前端有真實動畫 | 靜止 | 有活動 | 告警進來時前端節點高亮 |
|
||||
| Gitea CD 失敗有告警 | 無 | 有 | 製造 CD 失敗驗證告警 |
|
||||
| 備份還原可用 | 未測試 | 週排程通過 | CronJob 成功執行 |
|
||||
|
||||
---
|
||||
|
||||
## 九、Tier 3 紅區項目(需首席架構師授權)
|
||||
|
||||
以下項目修改飛輪核心決策路徑,必須在首席架構師授權後才能實作:
|
||||
|
||||
| 項目 | 修改的核心檔案 | 風險說明 |
|
||||
|------|-------------|---------|
|
||||
| B1 檢傷分類站前移 | `decision_manager.py` step 5 之前 | 分類錯誤會導致有效告警被 TYPE-1 靜默 |
|
||||
| B2 Docker/Host SSH MCP 路徑 | `decision_manager._auto_execute()` | 路由錯誤會導致 SSH 執行危險指令 |
|
||||
| B5 outcome 寫入點 | `_auto_execute()` + `telegram_gateway` | 寫入時機錯誤會影響 KM 轉換觸發 |
|
||||
| B7 KMConversionService | 新服務 + `incident_service.resolve()` | RESOLVED 觸發點必須防重複執行 |
|
||||
|
||||
---
|
||||
|
||||
## 十、ADR 參考體系
|
||||
|
||||
| ADR | 內容 | 狀態 |
|
||||
|-----|------|------|
|
||||
| ADR-068 | 飛輪冷啟動修復(alertname/affected_services 五根因)| ✅ 設計完成 |
|
||||
| ADR-070 | MCP 四個 Phase(環境診斷節點)| ✅ Phase 1-4 完成 |
|
||||
| ADR-071 | 四種通知類型(TYPE-1/2/3/4)+ A-J 工作序 | ✅ 設計完成,部分未部署 |
|
||||
| ADR-073 | 全面盤點 + 三個 Sprint 解決方案(本文整合來源)| 🔴 等待實作批准 |
|
||||
| **ADR-074** | **監控覆蓋補全(本文新增)** | 🔴 規劃中 |
|
||||
|
||||
---
|
||||
|
||||
*文件版本 v1.0(2026-04-12 台北時間)— 整合 ADR-073 飛輪盤點 + v2.2 規格書 §14.11 ADR-071 工作序 + 監控覆蓋缺口盤點 + 飛輪截圖定案架構 + MCP 角色分析。所有實作等待統帥批准。*
|
||||
Reference in New Issue
Block a user