Files
awoooi/apps/api/src/services/evidence_snapshot.py
Your Name 3668d49f2f
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m38s
feat(flywheel): W2 三件 + KMWriter critic 修法(1635 tests 全綠)
W2 (onboarder 4 週飛輪 80→90 路徑第二週) + critic PR review 5 個 critical/major
全部修完,default flag=false 安全無爆炸風險。

## W2 三件 PR

### PR-R2 — AOL → catalog confidence EWMA 回灌(修飛輪斷鏈 C2)
- 新檔 `apps/api/src/jobs/aol_to_catalog_writeback_job.py`
- 邏輯:每小時掃 AOL 計算 EWMA confidence (alpha=0.3) 回灌 alert_rule_catalog
- 失敗閾值 N=5 連續低成功率 → review_status='draft'
- Hermes _fetch_noisy_rules SQL 加 OR review_status='draft'
- ENABLE_AOL_WRITEBACK_JOB=false (default)
- 8 個測試(mock path 修正:lazy import → patch src.db.base.get_db_context)

### PR-V1 — self_healing_validator 串接 (修飛輪斷鏈 C6)
- 新檔 `apps/api/src/services/self_healing_validator.py`(純函數 assess_self_healing)
- post_execution_verifier.py step 5 串接(feature flag gate)
- evidence_snapshot.py 加 self_healing_score / self_healing_detail 欄位
- db/models.py + base.py ALTER IF NOT EXISTS
- score < 0.5 → 觸發 rollback 提案 Telegram alert(不自動執行)
- ENABLE_SELF_HEALING_VALIDATOR=false (default)
- 7 個測試

### PR-L1 — KM ↔ Playbook 雙向回路 (修飛輪斷鏈 C3+C4)
- learning_service.py 三條新邏輯:
  1. _write_playbook_evolution_km:promote/demote 寫 KM 演化條目
  2. _check_and_mark_playbook_review:N=5 累積觸發 review_required
  3. _demote_alert_rule_catalog_confidence:DEPRECATED → confidence×=0.5
- PlaybookRecord 加 review_required 欄位(schema migration via base.py)
- ENABLE_KM_PLAYBOOK_FEEDBACK_LOOP=false (default)
- KM_PLAYBOOK_REVIEW_THRESHOLD=5 可調
- 6 個測試

## KMWriter Critic 5 個 Critical/Major 修復(之前 critic PR review 發現)
之前 push commit c5753e1c 已修,本 commit 補回 stash 中的對應檔案:
- C1 km_writer.py:194 backfill 自打臉(已修:同步 await + DLQ)
- C2 km_writer.py:391 KM_WRITE_AWAIT=false 路徑收緊
- M1 decision_manager.py:2178/2203 移除 _fire_and_forget
- M2 incident_service.py:1099 自製 path 加 retry+DLQ
- M3 km_writer.py:166 冪等聲明對齊(UPSERT + partial unique index)

## 驗證
- 1635 unit tests 全綠(+27 from 1608)
- 與 fb0c72db (推翻 A2 Ollama primary) 共存無衝突
- 所有新 Job/Service default flag=false(不爆炸)

## 期望影響
飛輪斷鏈 C2 + C3 + C4 + C6 全修
飛輪自主化評分:65 → 85 預估(W2 完成後)

啟用順序(待 prod fb0c72db 驗證 OLLAMA primary 跑得起來後):
1. ENABLE_AOL_WRITEBACK_JOB=true
2. ENABLE_KM_PLAYBOOK_FEEDBACK_LOOP=true
3. ENABLE_SELF_HEALING_VALIDATOR=true

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:44:04 +08:00

400 lines
16 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""
AWOOOI AIOps Phase 1 — 不可變事件證據快照
==========================================
EvidenceSnapshotPreDecisionInvestigator 的輸出契約。
設計原則:
1. 不可變Immutable— 建立後只讀;執行後補填 post_execution_state
2. 版本化Versioned— schema_version 確保 fine-tune pipeline 可過濾
3. 安全Sanitized— 所有感官文字必須過 SanitizationService
4. 降級友好Graceful Degradation— 部分感官失敗不阻塞決策
資料流:
PreDecisionInvestigator
→ EvidenceSnapshotPydantic model
→ save() 寫入 incident_evidence 表
→ 傳給 decision_manager._dual_engine_analyze()
PostExecutionVerifier
→ update_post_execution() 補填 post_execution_state
ADR-081: PreDecisionInvestigator + EvidenceSnapshot
2026-04-15 ogt + Claude Sonnet 4.6 (亞太): Phase 1 初始建立
"""
from __future__ import annotations
import uuid
from dataclasses import dataclass, field
from datetime import datetime
from typing import Any
import structlog
from sqlalchemy import update
from src.db.base import get_db_context
from src.db.models import IncidentEvidence
from src.utils.timezone import now_taipei
logger = structlog.get_logger(__name__)
# EvidenceSnapshot schema 版本
SCHEMA_VERSION = "v1"
# Evidence summary 最大長度(防止超出 LLM token budget
MAX_SUMMARY_CHARS = 32_000 # ≈ 8K tokensUTF-8 中文 1 字 ≈ 4 chars
@dataclass
class EvidenceSnapshot:
"""
AI 決策前的不可變情報快照。
8D 感官維度:
D1 k8s_state — kubectl describe pod + events
D2 recent_logs — container stderr tail-50已 sanitize
D3 metrics_snapshot — Prometheus 5min vs 1h baseline
D4 recent_deployments — ArgoCD/Gitea 過去 1h 部署 diff
D5 business_metrics — 訂單量 / 登入成功率 / P0 SLI
D6 historical_context — 過去 30 天同 alertname 處置歷史
D7 peer_health — 同 Deployment 其他 replica 健康度
D8 dependency_topology — Istio/Service Mesh 上下游 latency
品質指標:
mcp_health — 各工具呼叫成敗 {tool_name: bool}
sensors_attempted / sensors_succeeded — 感官覆蓋率
Usage:
snapshot = EvidenceSnapshot(incident_id="INC-001")
snapshot.k8s_state = {"phase": "CrashLoopBackOff", ...}
snapshot_id = await snapshot.save()
"""
incident_id: str
# Identifiers
snapshot_id: str = field(default_factory=lambda: str(uuid.uuid4()))
schema_version: str = SCHEMA_VERSION
collected_at: datetime = field(default_factory=now_taipei)
# 告警基礎資訊sensors=0 時的最小情報2026-04-16 ogt + Claude Sonnet 4.6
alert_info: dict[str, Any] | None = None
# 8D 感官數據
k8s_state: dict[str, Any] | None = None # D1
recent_logs: str | None = None # D2 (sanitized)
metrics_snapshot: dict[str, Any] | None = None # D3
recent_deployments: list[dict] | None = None # D4
business_metrics: dict[str, Any] | None = None # D5
historical_context: str | None = None # D6
peer_health: dict[str, Any] | None = None # D7
dependency_topology: dict[str, Any] | None = None # D8
# Phase 4 ADR-084: 動態異常感官DynamicBaseline + LogAnomaly + TrendPredictor
# 2026-04-15 ogt + Claude Sonnet 4.6(亞太): Phase 4 8D 升級
anomaly_context: dict[str, Any] | None = None # Phase 4 動態異常上下文
# 2026-04-27 P3.1-T2-PathA by Claude — DiagAggregator 信號分類層in-memory only不持久化
# {"signal_count": int, "signals": [{"source", "signal_type", "severity", "message", ...}]}
extra_diagnosis: dict | None = None
# 感官品質
mcp_health: dict[str, bool] = field(default_factory=dict)
collection_duration_ms: int | None = None
sensors_attempted: int = 0
sensors_succeeded: int = 0
# LLM 輸入摘要(由 Investigator 組裝)
evidence_summary: str | None = None
# 執行前後 State
pre_execution_state: dict[str, Any] | None = None
post_execution_state: dict[str, Any] | None = None
verification_result: str | None = None
# W2 PR-V1: SelfHealingValidator 自愈品質評估 (2026-04-28 ogt + Claude Sonnet 4.6)
# ENABLE_SELF_HEALING_VALIDATOR=false 時永 None
self_healing_score: float | None = None
self_healing_detail: dict[str, Any] | None = None
# Phase 3 填充(目前永 null
matched_playbook_id: str | None = None
# ─────────────────────────────────────────────────────────────
# Derived helpers
# ─────────────────────────────────────────────────────────────
@property
def sensor_coverage_ratio(self) -> float:
"""感官覆蓋率0.0 ~ 1.0"""
if self.sensors_attempted == 0:
return 0.0
return self.sensors_succeeded / self.sensors_attempted
@property
def has_k8s_context(self) -> bool:
return self.k8s_state is not None
@property
def has_log_context(self) -> bool:
return self.recent_logs is not None and len(self.recent_logs) > 0
def build_summary(self) -> str:
"""
組裝 LLM-ready 情報摘要(< MAX_SUMMARY_CHARS
格式採用 <raw_evidence> 區塊隔離,防止 Prompt Injection。
"""
parts: list[str] = []
# 告警基礎資訊永遠放在最前sensors=0 時也要讓 AI 知道是什麼告警)
if self.alert_info:
parts.append(f"[告警資訊] {self.alert_info}")
if self.k8s_state:
parts.append(f"[K8s狀態] {self.k8s_state}")
if self.recent_logs:
parts.append(f"[近期日誌]\n{self.recent_logs[:2000]}")
if self.metrics_snapshot:
parts.append(f"[指標快照] {self.metrics_snapshot}")
if self.recent_deployments:
dep_str = "; ".join(
d.get("summary", str(d)) for d in self.recent_deployments[:3]
)
parts.append(f"[近期部署] {dep_str}")
if self.business_metrics:
parts.append(f"[業務指標] {self.business_metrics}")
if self.historical_context:
parts.append(f"[歷史脈絡] {self.historical_context[:500]}")
if self.peer_health:
parts.append(f"[同級副本健康度] {self.peer_health}")
if self.dependency_topology:
parts.append(f"[依賴拓撲] {self.dependency_topology}")
if self.anomaly_context:
parts.append(f"[動態異常偵測]\n{self.anomaly_context}")
# 2026-04-27 P3.1-T2-PathA by Claude — DiagAggregator 信號分類層(結構化 dict
if self.extra_diagnosis and self.extra_diagnosis.get("signals"):
signals_str = ", ".join(
s.get("signal_type", "?") for s in self.extra_diagnosis["signals"][:5]
)
parts.append(f"[Signal Classification] {signals_str}")
# 感官品質報告
failed_tools = [t for t, ok in self.mcp_health.items() if not ok]
if failed_tools:
parts.append(f"[感官警告] 以下工具呼叫失敗,情報可能不完整: {failed_tools}")
raw = "\n\n".join(parts)
summary = f"<raw_evidence>\n{raw}\n</raw_evidence>"
# Token budget 保護
if len(summary) > MAX_SUMMARY_CHARS:
summary = summary[:MAX_SUMMARY_CHARS] + "\n[...已截斷,超出 token budget]</raw_evidence>"
return summary
# ─────────────────────────────────────────────────────────────
# Persistence
# ─────────────────────────────────────────────────────────────
async def save(self) -> str:
"""
將快照持久化到 incident_evidence 表。
Returns:
str: snapshot_idUUID
"""
if self.evidence_summary is None:
self.evidence_summary = self.build_summary()
try:
async with get_db_context() as db:
record = IncidentEvidence(
id=self.snapshot_id,
incident_id=self.incident_id,
matched_playbook_id=self.matched_playbook_id,
schema_version=self.schema_version,
k8s_state=self.k8s_state,
recent_logs=self.recent_logs,
metrics_snapshot=self.metrics_snapshot,
recent_deployments=self.recent_deployments,
business_metrics=self.business_metrics,
historical_context=self.historical_context,
peer_health=self.peer_health,
dependency_topology=self.dependency_topology,
anomaly_context=self.anomaly_context,
mcp_health=self.mcp_health,
collection_duration_ms=self.collection_duration_ms,
sensors_attempted=self.sensors_attempted,
sensors_succeeded=self.sensors_succeeded,
evidence_summary=self.evidence_summary,
pre_execution_state=self.pre_execution_state,
post_execution_state=self.post_execution_state,
verification_result=self.verification_result,
collected_at=self.collected_at,
)
db.add(record)
await db.flush()
logger.info(
"evidence_snapshot_saved",
snapshot_id=self.snapshot_id,
incident_id=self.incident_id,
sensors_succeeded=self.sensors_succeeded,
collection_ms=self.collection_duration_ms,
)
return self.snapshot_id
except Exception:
logger.exception(
"evidence_snapshot_save_error",
snapshot_id=self.snapshot_id,
incident_id=self.incident_id,
)
raise
async def update_post_execution(
self,
post_state: dict[str, Any],
verification_result: str,
) -> None:
"""
PostExecutionVerifier 執行後補填 post_execution_state。
Args:
post_state: 執行後環境狀態
verification_result: "success" / "degraded" / "failed" / "timeout"
"""
self.post_execution_state = post_state
self.verification_result = verification_result
try:
async with get_db_context() as db:
stmt_result = await db.execute(
update(IncidentEvidence)
.where(IncidentEvidence.id == self.snapshot_id)
.values(
post_execution_state=post_state,
verification_result=verification_result,
)
)
# Gate 1 fix: 零行更新代表 snapshot 從未持久化save() 失敗)→ 學習數據將靜默丟失
if stmt_result.rowcount < 1:
logger.warning(
"evidence_snapshot_post_update_no_rows",
snapshot_id=self.snapshot_id,
verification_result=verification_result,
)
else:
logger.info(
"evidence_snapshot_post_execution_updated",
snapshot_id=self.snapshot_id,
verification_result=verification_result,
)
except Exception:
logger.exception(
"evidence_snapshot_post_update_error",
snapshot_id=self.snapshot_id,
)
raise
async def update_self_healing(
self,
score: float,
detail: dict[str, Any],
) -> None:
"""
W2 PR-V1: SelfHealingValidator 評估結果補填。
在 PostExecutionVerifier.verify() 完成 update_post_execution() 之後呼叫。
僅在 ENABLE_SELF_HEALING_VALIDATOR=True 且 snapshot 已持久化時有效。
Args:
score: 自愈品質分數0.0-1.0
detail: SelfHealingValidator.assess_self_healing() 返回的明細 dict
2026-04-28 ogt + Claude Sonnet 4.6: W2 PR-V1 初始建立
"""
self.self_healing_score = score
self.self_healing_detail = detail
try:
async with get_db_context() as db:
stmt_result = await db.execute(
update(IncidentEvidence)
.where(IncidentEvidence.id == self.snapshot_id)
.values(
self_healing_score=score,
self_healing_detail=detail,
)
)
if stmt_result.rowcount < 1:
logger.warning(
"evidence_snapshot_self_healing_update_no_rows",
snapshot_id=self.snapshot_id,
score=score,
)
else:
logger.info(
"evidence_snapshot_self_healing_updated",
snapshot_id=self.snapshot_id,
score=score,
)
except Exception:
logger.exception(
"evidence_snapshot_self_healing_update_error",
snapshot_id=self.snapshot_id,
)
raise
async def get_latest_snapshot(incident_id: str) -> EvidenceSnapshot | None:
"""
查詢某 Incident 最新的 EvidenceSnapshot由 snapshot_id 識別)。
主要供測試和 Phase 3 learning pipeline 使用。
"""
from sqlalchemy import desc, select
try:
async with get_db_context() as db:
result = await db.execute(
select(IncidentEvidence)
.where(IncidentEvidence.incident_id == incident_id)
.order_by(desc(IncidentEvidence.collected_at))
.limit(1)
)
row = result.scalar_one_or_none()
if row is None:
return None
snap = EvidenceSnapshot(
incident_id=row.incident_id,
snapshot_id=row.id,
schema_version=row.schema_version,
collected_at=row.collected_at,
k8s_state=row.k8s_state,
recent_logs=row.recent_logs,
metrics_snapshot=row.metrics_snapshot,
recent_deployments=row.recent_deployments,
business_metrics=row.business_metrics,
historical_context=row.historical_context,
peer_health=row.peer_health,
dependency_topology=row.dependency_topology,
anomaly_context=row.anomaly_context,
mcp_health=row.mcp_health or {},
collection_duration_ms=row.collection_duration_ms,
sensors_attempted=row.sensors_attempted or 0,
sensors_succeeded=row.sensors_succeeded or 0,
evidence_summary=row.evidence_summary,
pre_execution_state=row.pre_execution_state,
post_execution_state=row.post_execution_state,
verification_result=row.verification_result,
matched_playbook_id=row.matched_playbook_id,
)
return snap
except Exception:
logger.exception("evidence_snapshot_get_error", incident_id=incident_id)
return None