Files
awoooi/apps/api/src/services/self_healing_validator.py
Your Name 3668d49f2f
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m38s
feat(flywheel): W2 三件 + KMWriter critic 修法(1635 tests 全綠)
W2 (onboarder 4 週飛輪 80→90 路徑第二週) + critic PR review 5 個 critical/major
全部修完,default flag=false 安全無爆炸風險。

## W2 三件 PR

### PR-R2 — AOL → catalog confidence EWMA 回灌(修飛輪斷鏈 C2)
- 新檔 `apps/api/src/jobs/aol_to_catalog_writeback_job.py`
- 邏輯:每小時掃 AOL 計算 EWMA confidence (alpha=0.3) 回灌 alert_rule_catalog
- 失敗閾值 N=5 連續低成功率 → review_status='draft'
- Hermes _fetch_noisy_rules SQL 加 OR review_status='draft'
- ENABLE_AOL_WRITEBACK_JOB=false (default)
- 8 個測試(mock path 修正:lazy import → patch src.db.base.get_db_context)

### PR-V1 — self_healing_validator 串接 (修飛輪斷鏈 C6)
- 新檔 `apps/api/src/services/self_healing_validator.py`(純函數 assess_self_healing)
- post_execution_verifier.py step 5 串接(feature flag gate)
- evidence_snapshot.py 加 self_healing_score / self_healing_detail 欄位
- db/models.py + base.py ALTER IF NOT EXISTS
- score < 0.5 → 觸發 rollback 提案 Telegram alert(不自動執行)
- ENABLE_SELF_HEALING_VALIDATOR=false (default)
- 7 個測試

### PR-L1 — KM ↔ Playbook 雙向回路 (修飛輪斷鏈 C3+C4)
- learning_service.py 三條新邏輯:
  1. _write_playbook_evolution_km:promote/demote 寫 KM 演化條目
  2. _check_and_mark_playbook_review:N=5 累積觸發 review_required
  3. _demote_alert_rule_catalog_confidence:DEPRECATED → confidence×=0.5
- PlaybookRecord 加 review_required 欄位(schema migration via base.py)
- ENABLE_KM_PLAYBOOK_FEEDBACK_LOOP=false (default)
- KM_PLAYBOOK_REVIEW_THRESHOLD=5 可調
- 6 個測試

## KMWriter Critic 5 個 Critical/Major 修復(之前 critic PR review 發現)
之前 push commit c5753e1c 已修,本 commit 補回 stash 中的對應檔案:
- C1 km_writer.py:194 backfill 自打臉(已修:同步 await + DLQ)
- C2 km_writer.py:391 KM_WRITE_AWAIT=false 路徑收緊
- M1 decision_manager.py:2178/2203 移除 _fire_and_forget
- M2 incident_service.py:1099 自製 path 加 retry+DLQ
- M3 km_writer.py:166 冪等聲明對齊(UPSERT + partial unique index)

## 驗證
- 1635 unit tests 全綠(+27 from 1608)
- 與 fb0c72db (推翻 A2 Ollama primary) 共存無衝突
- 所有新 Job/Service default flag=false(不爆炸)

## 期望影響
飛輪斷鏈 C2 + C3 + C4 + C6 全修
飛輪自主化評分:65 → 85 預估(W2 完成後)

啟用順序(待 prod fb0c72db 驗證 OLLAMA primary 跑得起來後):
1. ENABLE_AOL_WRITEBACK_JOB=true
2. ENABLE_KM_PLAYBOOK_FEEDBACK_LOOP=true
3. ENABLE_SELF_HEALING_VALIDATOR=true

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:44:04 +08:00

164 lines
5.3 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""
AWOOOI AIOps — 自愈品質驗證器
================================
W2 PR-V1: 飛輪斷鏈 C6 修復 — PostExecutionVerifier 串接自愈品質評估
職責:
1. 評估系統是否真的「自愈」root cause 解除 vs 只是 metric 暫時恢復)
2. Regression Detection修完一個指標但其他指標惡化
3. 修復品質分數0.0 ~ 1.0
評分邏輯:
- base_score 由 verification_result 決定success=1.0 / degraded=0.4 / failed=0.0 / timeout=0.2
- regression_penalty 由 pre/post state diff 中惡化指標數量決定
- 最終 score = max(0.0, base_score - regression_penalty)
閾值:
- score < 0.5 → rollback 提案Telegram 警示,不自動執行)
- score >= 0.5 → 認可自愈,無額外動作
設計原則:
- 不修改 self_healing_validator 內部邏輯(外部串接層)
- 驗證失敗不阻塞主流程(容錯 try/except 全包)
- Feature Flag: ENABLE_SELF_HEALING_VALIDATOR=false預設關閉
ADR-081 Phase 1 延伸
2026-04-28 ogt + Claude Sonnet 4.6: W2 PR-V1 初始建立C6 修復)
"""
from __future__ import annotations
import re
from typing import TYPE_CHECKING, Any
import structlog
if TYPE_CHECKING:
pass
logger = structlog.get_logger(__name__)
# 修復品質分數基準by verification_result
_BASE_SCORES: dict[str, float] = {
"success": 1.0,
"degraded": 0.4,
"failed": 0.0,
"timeout": 0.2,
}
# 每個惡化指標的扣分
_REGRESSION_PENALTY_PER_METRIC = 0.15
# 扣分上限(避免 over-penalty
_MAX_REGRESSION_PENALTY = 0.4
# root cause 解除信號post_state 出現這些 → root cause 已清除)
_ROOT_CAUSE_CLEARED_SIGNALS = ["running", "ready", "1/1", "2/2", "3/3", "healthy"]
# regression 惡化信號post_state 新出現但 pre_state 不存在 → regression
_REGRESSION_SIGNALS = [
"crashloopbackoff",
"oomkilled",
"oomkill",
"pending",
"terminating",
"error",
"failed",
"timeout",
"evicted",
"imagepullbackoff",
"errimagepull",
]
# 數值指標惡化偵測regex 找 %、數字,比較增幅)
_NUMERIC_THRESHOLD_RATIO = 0.2 # 超過 20% 增幅算惡化
def assess_self_healing(
pre_state: dict[str, Any] | None,
post_state: dict[str, Any] | None,
verification_result: str,
action_taken: str,
) -> dict[str, Any]:
"""
評估自愈品質,返回結構化評估結果。
Args:
pre_state: 執行前環境狀態(可為 None
post_state: 執行後環境狀態(可為 None
verification_result: PostExecutionVerifier 的判斷結果success/degraded/failed/timeout
action_taken: 執行的動作描述
Returns:
dict 包含:
score (float 0.0-1.0)
root_cause_cleared (bool)
regressions (list[str] — 惡化的指標名稱)
detail (str — 人類可讀說明)
"""
base_score = _BASE_SCORES.get(verification_result, 0.0)
pre_str = str(pre_state).lower() if pre_state else ""
post_str = str(post_state).lower() if post_state else ""
# 1. Root cause 是否真正解除
root_cause_cleared = any(sig in post_str for sig in _ROOT_CAUSE_CLEARED_SIGNALS)
if verification_result in ("failed", "timeout"):
root_cause_cleared = False
# 2. Regression detection — 新出現在 post 但 pre 沒有的惡化信號
regressions: list[str] = []
for sig in _REGRESSION_SIGNALS:
if sig in post_str and sig not in pre_str:
regressions.append(sig)
# 3. 數值指標惡化偵測(簡單版:找百分比值增幅)
pre_nums = _extract_percentages(pre_str)
post_nums = _extract_percentages(post_str)
for key, pre_val in pre_nums.items():
if key in post_nums:
post_val = post_nums[key]
if pre_val > 0 and (post_val - pre_val) / pre_val > _NUMERIC_THRESHOLD_RATIO:
regressions.append(f"metric_increase:{key}")
# 4. 計算最終分數
regression_penalty = min(
len(regressions) * _REGRESSION_PENALTY_PER_METRIC,
_MAX_REGRESSION_PENALTY,
)
score = max(0.0, base_score - regression_penalty)
# 5. 組裝說明
detail_parts = [f"base={base_score:.2f}"]
if regressions:
detail_parts.append(f"regression_penalty={regression_penalty:.2f} ({','.join(regressions[:5])})")
if not root_cause_cleared and verification_result == "success":
detail_parts.append("root_cause_unclear")
detail = "; ".join(detail_parts)
return {
"score": round(score, 4),
"root_cause_cleared": root_cause_cleared,
"regressions": regressions,
"detail": detail,
"verification_result": verification_result,
"action_taken": action_taken,
}
def _extract_percentages(text: str) -> dict[str, float]:
"""
從狀態字串中提取數值百分比。
例如 "cpu_usage: 85%"{"cpu_usage": 85.0}
用於偵測指標惡化簡單啟發式Phase 1 版本)。
"""
result: dict[str, float] = {}
# 格式word_key: N% 或 word_key=N%
pattern = re.compile(r"(\w+)[:\s=]+(\d+(?:\.\d+)?)\s*%")
for match in pattern.finditer(text):
key = match.group(1)
val = float(match.group(2))
result[key] = val
return result