docs: Sprint 4 告警處置統計系統 — 完整計畫文件 + LOGBOOK 更新
Sprint 4 計畫包含 6 Phase / 19 工作項: - Phase A: 資料層 (IncidentFrequencyStats + Redis 計數器) - Phase B: 寫入層 (4 觸發點: auto_repair/cold_start/human/manual) - Phase C: API 端點 (/stats/disposition) - Phase D: Telegram 告警卡片統計 - Phase E: 前端 (/reports 儀表板 + 首頁 + auto-repair + neural-command) - Phase F: 週報 + 文件 首席架構師審查: 100% Fully Approved 衝突檢查: 所有依賴正確,DAG 無環 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -5,7 +5,22 @@
|
||||
|
||||
---
|
||||
|
||||
## 📍 當前狀態 (2026-04-07 Sprint 3 P0 Critical Security Fixes 完成 → 待首席架構師 Re-Review)
|
||||
## 📍 當前狀態 (2026-04-07 Sprint 4 規劃完成 → Phase A 待實作)
|
||||
|
||||
| 項目 | 狀態 | 說明 |
|
||||
|------|------|------|
|
||||
| Sprint 4 計畫文件 | ✅ | `docs/superpowers/plans/2026-04-07-sprint4-disposition-tracking.md` |
|
||||
| 首席架構師審查 | ✅ 100% | Fully Approved |
|
||||
| 衝突檢查 | ✅ 無衝突 | 與 Sprint 3/Phase 24/Telegram/i18n 均無衝突 |
|
||||
| feedback_disposition_tracking.md | ✅ | 四種分類邏輯+推斷規則+Redis設計 |
|
||||
| project_master_workplan.md 更新 | ✅ | Sprint 3 + Sprint 4 完整工作項 |
|
||||
| project_current_status.md 更新 | ✅ | 最新狀態快照 |
|
||||
|
||||
**下一步**: Phase A 實作 (A1→A2→A3: Model + Redis 計數器)
|
||||
|
||||
---
|
||||
|
||||
## 📍 當前狀態 (2026-04-07 Sprint 3 P0 Critical Security Fixes 完成 + Re-Review 91/100 通過)
|
||||
|
||||
| 項目 | 狀態 | Commit |
|
||||
|------|------|--------|
|
||||
|
||||
@@ -0,0 +1,353 @@
|
||||
# Sprint 4 — 告警處置統計 + 自動修復閉環
|
||||
|
||||
> **建立**: 2026-04-07 (台北時區)
|
||||
> **建立者**: Claude Code (統帥批准)
|
||||
> **狀態**: 架構審查 100% 通過,首席架構師 Fully Approved
|
||||
> **依賴**: Sprint 3 SSH_COMMAND (91/100 通過) + 首次信任機制 (53b2dae)
|
||||
|
||||
---
|
||||
|
||||
## 目標
|
||||
|
||||
將 AWOOOI 從「監控工具」升級為「可觀測性決策平台」。
|
||||
核心指標:每個告警必須追蹤「怎麼被處理的」— 自動修復 / 人工審核 / 手動處理 / 冷啟動信任。
|
||||
|
||||
---
|
||||
|
||||
## 前置條件(Sprint 3 遺留)
|
||||
|
||||
| # | 項目 | 狀態 | 說明 |
|
||||
|---|------|------|------|
|
||||
| S3-DEPLOY | 推送 Sprint 3 + 首次信任到 Gitea | ⏳ 待執行 | `git push gitea main` |
|
||||
| S3-E2E | E2E 驗證首次信任自動修復 | ⏳ 待部署後 | 告警 → 首次信任 → 自動修復成功 |
|
||||
|
||||
---
|
||||
|
||||
## Phase A: 資料層 (Model + Redis 計數器)
|
||||
|
||||
### A1. IncidentFrequencyStats 新增處置計數欄位
|
||||
|
||||
**檔案**: `apps/api/src/models/incident.py`
|
||||
**改動**: IncidentFrequencyStats class 新增 4 欄位
|
||||
|
||||
```python
|
||||
# 新增欄位 (向後兼容 default=0)
|
||||
human_approved_count: int = Field(default=0, ge=0, description="人工按批准後執行次數")
|
||||
manual_resolved_count: int = Field(default=0, ge=0, description="無系統修復紀錄但 resolved 次數")
|
||||
cold_start_trust_count: int = Field(default=0, ge=0, description="首次信任自動放行次數")
|
||||
total_resolution_count: int = Field(default=0, ge=0, description="總處置次數")
|
||||
```
|
||||
|
||||
**依賴**: 無
|
||||
**風險**: 低 — Pydantic model 加 default=0,完全向後兼容
|
||||
|
||||
---
|
||||
|
||||
### A2. AnomalyCounter 新增 record_disposition()
|
||||
|
||||
**檔案**: `apps/api/src/services/anomaly_counter.py`
|
||||
**改動**: 新增 Redis HASH 計數器
|
||||
|
||||
```python
|
||||
PREFIX_DISPOSITION = "anomaly:disposition:" # 新 Redis key 前綴
|
||||
|
||||
async def record_disposition(
|
||||
self,
|
||||
anomaly_key: str,
|
||||
disposition_type: str, # "auto_repair" | "human_approved" | "manual_resolved" | "cold_start_trust"
|
||||
) -> None:
|
||||
"""記錄處置類型(原子 HINCRBY)"""
|
||||
key = f"{self.PREFIX_DISPOSITION}{anomaly_key}"
|
||||
await self.redis.hincrby(key, disposition_type, 1)
|
||||
await self.redis.hincrby(key, "total", 1)
|
||||
await self.redis.expire(key, self.TTL_SECONDS)
|
||||
```
|
||||
|
||||
**依賴**: 無
|
||||
**風險**: 低 — 新增 Redis key,不影響現有結構
|
||||
|
||||
---
|
||||
|
||||
### A3. AnomalyCounter get_frequency_stats() 擴充
|
||||
|
||||
**檔案**: `apps/api/src/services/anomaly_counter.py`
|
||||
**改動**: 修改 `get_frequency()` 或新增 `get_disposition_stats()`
|
||||
|
||||
```python
|
||||
async def get_disposition_stats(self, anomaly_key: str) -> dict:
|
||||
"""回傳處置分佈統計"""
|
||||
key = f"{self.PREFIX_DISPOSITION}{anomaly_key}"
|
||||
raw = await self.redis.hgetall(key)
|
||||
return {
|
||||
"auto_repair": int(raw.get(b"auto_repair", 0)),
|
||||
"human_approved": int(raw.get(b"human_approved", 0)),
|
||||
"manual_resolved": int(raw.get(b"manual_resolved", 0)),
|
||||
"cold_start_trust": int(raw.get(b"cold_start_trust", 0)),
|
||||
"total": int(raw.get(b"total", 0)),
|
||||
}
|
||||
```
|
||||
|
||||
**依賴**: A2
|
||||
**風險**: 低
|
||||
|
||||
---
|
||||
|
||||
## Phase B: 寫入層(觸發點接線)
|
||||
|
||||
### B1. 自動修復成功 → auto_repair_count++
|
||||
|
||||
**檔案**: `apps/api/src/services/auto_repair_service.py`
|
||||
**觸發點**: `execute_auto_repair()` 成功後 (L378 附近)
|
||||
**改動**:
|
||||
```python
|
||||
# 成功後記錄處置類型
|
||||
symptoms = self._extract_symptoms(incident)
|
||||
anomaly_key = self.hash_signature_from_symptoms(symptoms)
|
||||
counter = get_anomaly_counter()
|
||||
await counter.record_disposition(anomaly_key, "auto_repair")
|
||||
```
|
||||
|
||||
**依賴**: A2
|
||||
**注意**: 需取得 anomaly_key — 從 incident.signals 提取 signature → hash
|
||||
|
||||
---
|
||||
|
||||
### B2. 首次信任自動修復 → cold_start_trust_count++
|
||||
|
||||
**檔案**: `apps/api/src/services/auto_repair_service.py`
|
||||
**觸發點**: `evaluate_auto_repair()` 冷啟動通過後
|
||||
**改動**:
|
||||
- `execute_auto_repair()` 新增參數 `is_cold_start: bool = False`
|
||||
- 成功時:
|
||||
```python
|
||||
if is_cold_start:
|
||||
await counter.record_disposition(anomaly_key, "cold_start_trust")
|
||||
else:
|
||||
await counter.record_disposition(anomaly_key, "auto_repair")
|
||||
```
|
||||
|
||||
**依賴**: A2, B1
|
||||
**注意**: 冷啟動也算自動修復,但要獨立計數以追蹤信任機制效果
|
||||
|
||||
---
|
||||
|
||||
### B3. 人工按「批准」執行 → human_approved_count++
|
||||
|
||||
**檔案**: `apps/api/src/services/approval_execution.py`
|
||||
**觸發點**: `execute_approved_action()` 成功後
|
||||
**改動**:
|
||||
```python
|
||||
# 執行成功後記錄處置類型
|
||||
# 需從 approval → incident_id → incident → anomaly_signature → hash
|
||||
anomaly_key = await self._get_anomaly_key_from_approval(approval)
|
||||
if anomaly_key:
|
||||
counter = get_anomaly_counter()
|
||||
await counter.record_disposition(anomaly_key, "human_approved")
|
||||
```
|
||||
|
||||
**依賴**: A2
|
||||
**注意**: approval_execution 是 Tier 3 紅區邊緣,需小心
|
||||
|
||||
---
|
||||
|
||||
### B4. 手動處理推斷 → manual_resolved_count++
|
||||
|
||||
**檔案**: `apps/api/src/services/incident_service.py`
|
||||
**觸發點**: Incident 狀態變為 `resolved` 時
|
||||
**改動**:
|
||||
```python
|
||||
# 推斷邏輯(首席架構師建議)
|
||||
if new_status == IncidentStatus.RESOLVED:
|
||||
has_auto_repair = await self._check_has_successful_auto_repair(incident)
|
||||
has_human_approval = await self._check_has_executed_approval(incident)
|
||||
if not has_auto_repair and not has_human_approval:
|
||||
counter = get_anomaly_counter()
|
||||
anomaly_key = self._get_anomaly_key(incident)
|
||||
await counter.record_disposition(anomaly_key, "manual_resolved")
|
||||
```
|
||||
|
||||
**依賴**: A2
|
||||
**邊界條件**: 外部修復(SRE 手動 SSH)系統無法感知 → 用排除法推斷
|
||||
|
||||
---
|
||||
|
||||
## Phase C: 讀取層 (API 端點)
|
||||
|
||||
### C1. 新增 `/api/v1/stats/disposition` 端點
|
||||
|
||||
**檔案**: `apps/api/src/api/v1/stats.py`
|
||||
**改動**: 新增 GET endpoint
|
||||
**回傳格式**:
|
||||
```json
|
||||
{
|
||||
"total": 120,
|
||||
"auto_repair": 84,
|
||||
"human_approved": 24,
|
||||
"manual_resolved": 6,
|
||||
"cold_start_trust": 6,
|
||||
"auto_rate": 0.70,
|
||||
"human_rate": 0.20,
|
||||
"by_anomaly_type": { ... }
|
||||
}
|
||||
```
|
||||
|
||||
**依賴**: A3
|
||||
|
||||
---
|
||||
|
||||
### C2. 現有端點擴充
|
||||
|
||||
**檔案**: `apps/api/src/api/v1/auto_repair.py`, `apps/api/src/api/v1/stats.py`
|
||||
**改動**:
|
||||
- `/api/v1/auto-repair/stats` 加 `disposition_summary`
|
||||
- `/api/v1/auto-repair/history` 每筆加 `disposition_type`
|
||||
|
||||
**依賴**: A3, C1
|
||||
|
||||
---
|
||||
|
||||
## Phase D: Telegram 告警呈現
|
||||
|
||||
### D1. 告警卡片加統計行
|
||||
|
||||
**檔案**: `apps/api/src/services/telegram_gateway.py`
|
||||
**位置**: anomaly_frequency 區塊 (L233-250)
|
||||
**新格式**:
|
||||
```
|
||||
📊 頻率統計 [升級emoji]
|
||||
├ 1h: 3 次 | 24h: 12 次
|
||||
├ 🤖 自動: 8 | 👤 審核: 3 | 🔧 手動: 1
|
||||
└ 自動化率: 67%
|
||||
```
|
||||
|
||||
**依賴**: A3 (get_disposition_stats)
|
||||
|
||||
---
|
||||
|
||||
### D2. 歷史按鈕強化
|
||||
|
||||
**檔案**: `apps/api/src/services/telegram_gateway.py`
|
||||
**位置**: `_send_incident_history()` (L2508)
|
||||
**改動**: 新增處置分佈明細區塊
|
||||
|
||||
**依賴**: A3, D1
|
||||
|
||||
---
|
||||
|
||||
## Phase E: 前端頁面呈現
|
||||
|
||||
### E1. `/reports` 頁面 — 完整處置統計儀表板(主戰場)
|
||||
|
||||
**檔案**: `apps/web/src/app/[locale]/reports/page.tsx`
|
||||
**內容**:
|
||||
- 頂部 KPI:異常總計 / 自動化率 / 介入率
|
||||
- 四大計數卡片:自動修復 / 人工審核 / 手動處理 / 冷啟動信任
|
||||
- 處理分佈橫條圖(百分比堆疊,按異常類型)
|
||||
- 時間範圍篩選(7天/30天)
|
||||
- 異常類型下拉篩選
|
||||
|
||||
**依賴**: C1
|
||||
|
||||
---
|
||||
|
||||
### E2. 首頁 Metrics Strip 加摘要
|
||||
|
||||
**檔案**: `apps/web/src/app/[locale]/page.tsx`
|
||||
**改動**: 新增「自動化率」指標,點擊跳轉 `/reports`
|
||||
|
||||
**依賴**: C1
|
||||
|
||||
---
|
||||
|
||||
### E3. `/auto-repair` 頁面加處置概況
|
||||
|
||||
**檔案**: `apps/web/src/app/[locale]/auto-repair/page.tsx`
|
||||
**改動**: AutoRepairStats interface 擴充 + 處置方式小卡
|
||||
|
||||
**依賴**: C2
|
||||
|
||||
---
|
||||
|
||||
### E4. `/neural-command` stats tab 加處置統計
|
||||
|
||||
**檔案**: `apps/web/src/components/neural-command/NeuralStats.tsx`
|
||||
**改動**: 在統計區加處置分佈
|
||||
|
||||
**依賴**: C1
|
||||
|
||||
---
|
||||
|
||||
### E5. i18n 翻譯
|
||||
|
||||
**檔案**: `messages/zh-TW.json`, `messages/en.json`
|
||||
**新增 key**: disposition.autoRepair, disposition.humanApproved, disposition.manualResolved, disposition.coldStartTrust, disposition.autoRate 等
|
||||
|
||||
**依賴**: E1
|
||||
|
||||
---
|
||||
|
||||
## Phase F: 週報 + 文件更新
|
||||
|
||||
### F1. 週報增加處置分佈
|
||||
|
||||
**檔案**: `apps/api/src/services/weekly_report_service.py`
|
||||
**改動**: WeeklyReportMessage 新增 auto_repair_count, human_approved_count, automation_rate
|
||||
|
||||
**依賴**: A3
|
||||
|
||||
---
|
||||
|
||||
### F2. Memory / LOGBOOK / ADR 更新
|
||||
|
||||
- 更新 `docs/LOGBOOK.md`
|
||||
- 新增 `feedback_disposition_tracking.md`
|
||||
- 更新 `MEMORY.md` 索引
|
||||
|
||||
---
|
||||
|
||||
## 實施順序 (依賴鏈 DAG)
|
||||
|
||||
```
|
||||
S3-DEPLOY ─→ S3-E2E
|
||||
│
|
||||
Phase A: ───────────────────────
|
||||
A1 (Model) ──→ A2 (Redis Write) ──→ A3 (Redis Read)
|
||||
│ │
|
||||
Phase B: ─────────────────────── │
|
||||
B1 (auto_repair) ┐ │
|
||||
B2 (cold_start) ├─ 依賴 A2 │
|
||||
B3 (human_appr) │ │
|
||||
B4 (manual_res) ┘ │
|
||||
│
|
||||
Phase C: ─────────────────────── │
|
||||
C1 (stats/disposition) ←────────────────┘
|
||||
C2 (擴充現有端點) ← C1
|
||||
│
|
||||
Phase D: ───────────────────────
|
||||
D1 (Telegram 卡片) ← A3
|
||||
D2 (歷史強化) ← D1
|
||||
│
|
||||
Phase E: ───────────────────────
|
||||
E1 (/reports 儀表板) ← C1
|
||||
E2 (首頁 Strip) ← C1
|
||||
E3 (/auto-repair) ← C2
|
||||
E4 (/neural-command) ← C1
|
||||
E5 (i18n) ← E1
|
||||
│
|
||||
Phase F: ───────────────────────
|
||||
F1 (週報) ← A3
|
||||
F2 (文件) ← 全部完成
|
||||
```
|
||||
|
||||
## 衝突檢查
|
||||
|
||||
| 檢查項 | 結果 | 說明 |
|
||||
|--------|------|------|
|
||||
| 與 Sprint 3 SSH_COMMAND 衝突 | ✅ 無衝突 | Sprint 3 已完成,Sprint 4 建立在其上 |
|
||||
| 與 Phase 24 AI Router 衝突 | ✅ 無衝突 | 不同 Service 層,互不依賴 |
|
||||
| 與 Telegram 格式衝突 | ✅ 無衝突 | 擴充 anomaly_frequency 區塊,不動現有格式 |
|
||||
| 與 anomaly_counter 衝突 | ⚠️ 需注意 | 新增 Redis key 前綴,不修改現有 key |
|
||||
| 與 auto_repair_service 衝突 | ⚠️ 需注意 | 首次信任剛加入,B1/B2 在同一方法操作 |
|
||||
| 與 incident model 衝突 | ✅ 無衝突 | 新增 Field(default=0),向後兼容 |
|
||||
| 工作順序邏輯 | ✅ 正確 | DAG 無環,A→B→C→D/E→F 線性依賴 |
|
||||
| 前端 i18n 衝突 | ✅ 無衝突 | 新增 key,不修改現有 |
|
||||
Reference in New Issue
Block a user