Files
awoooi/docs/adr/ADR-083-learning-loop-reconstruction.md
OG T 7da64eaad2
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 19m7s
Type Sync Check / check-type-sync (push) Failing after 1m18s
feat(Phase 3): 學習閉環重建 — 三根因修復 + 2x EWMA + Evolver Agent
ADR-083 Phase 3 學習閉環重建:

**三根因修復**
- approval_execution.py: fire-and-forget create_task → await asyncio.wait_for(timeout=30) × 2
  (成功路徑 L265 + 失敗路徑 L353,超時記錄 learning_trigger_timeout metric,主流程不 crash)
- models/approval.py: ApprovalRequestBase 新增 matched_playbook_id 欄位
- decision_manager.py: _auto_execute 建立 ApprovalRequest 時填充 matched_playbook_id
- learning_service.py: 雙路徑查找 _matched_pb_id(matched_playbook_id + metadata fallback)

**2x EWMA 負向強化**
- models/playbook.py: 新增 trust_score: float = 0.3(EWMA 動態信任度欄位)
- repositories/playbook_repository.py: update_stats 加 EWMA
  成功: trust = 0.9 × old + 0.1 × 1.0
  失敗: trust = 0.8 × old + 0.2 × 0.0(衰減速度 2x)
  trust < 0.1 → log warning,等 Evolver 封存

**Evolver Agent(新建)**
- services/playbook_evolver.py: 三功能全靜態規則
  1. 低信任封存: trust < 0.1 → DEPRECATED
  2. 休眠封存: 30d 未使用 AND trust < 0.5 → DEPRECATED
  3. 相似合併: 症狀 Jaccard > 0.9 → 保留高 trust,封存低 trust
  AIOPS_P3_EVOLVER_ENABLED=False 預設關閉

**文件**
- ADR-083 學習閉環重建
- MASTER §8 Phase 3 完工記錄

AIOPS_P3_ENABLED=False(預設),骨架就位等統帥批准開啟

Co-Authored-By: Claude Sonnet 4.6(亞太)<noreply@anthropic.com>
2026-04-15 14:01:37 +08:00

93 lines
3.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-083: 學習閉環重建與 Playbook 演化
**狀態**: Completed
**日期**: 2026-04-15
**作者**: ogt + Claude Sonnet 4.6(亞太)
**Phase**: Phase 3
---
## 背景
Phase 2ADR-082完成後AWOOOI AIOps 系統進入 Phase 3。
根據 MASTER v2 架構診斷L7 學習層有三個核彈級根因導致 Playbook 信任度**永遠停在初始值 0.3**
1. **fire-and-forget**`approval_execution._trigger_learning()``asyncio.create_task()` 呼叫主流程不等待GC 可能在學習完成前回收協程。
2. **matched_playbook_id 永 null**`decision_manager._auto_execute` 建立 `ApprovalRequest` 時從未填充 `matched_playbook_id`,導致 `learning_service._update_playbook_stats``if _matched_pb_id:` 條件永遠為 False。
3. **無負向強化**`playbook_repository.update_stats` 只遞增 `success_count / failure_count`,無 EWMA 動態信任度計算。
## 決策
### D1: fire-and-forget → await asyncio.wait_for
```python
# 舊(火烤)
asyncio.create_task(self._trigger_learning(...))
# 新Phase 3
try:
await asyncio.wait_for(self._trigger_learning(...), timeout=30.0)
except asyncio.TimeoutError:
logger.warning("learning_trigger_timeout", ...)
```
- 超時 30s → 記錄 metric主流程繼續不 crash
- 成功路徑 + 失敗路徑各修一處
### D2: matched_playbook_id 傳遞
```python
# ApprovalRequestBase 新增欄位
matched_playbook_id: str | None = Field(default=None)
# decision_manager._auto_execute 填充
_matched_playbook_id = token.proposal_data.get("playbook_id")
approval = ApprovalRequest(..., matched_playbook_id=_matched_playbook_id)
# learning_service 雙路徑查找
_matched_pb_id = (
getattr(approval, "matched_playbook_id", None)
or (approval.metadata or {}).get("matched_playbook_id")
or (approval.metadata or {}).get("playbook_id")
)
```
### D3: 2x EWMA 負向強化
```
Playbook.trust_score 初值 = 0.3(新增欄位)
成功: trust_new = 0.9 × trust_old + 0.1 × 1.0
失敗: trust_new = 0.8 × trust_old + 0.2 × 0.0 ← 衰減係數 2x 快
```
trust < 0.1 → 記錄警告,由 Evolver Agent 封存
### D4: Evolver Agentplaybook_evolver.py
三功能全靜態規則(不依賴 LLM
1. **低信任封存**trust_score < 0.1 → DEPRECATED
2. **休眠封存**30d 未使用 AND trust < 0.5 → DEPRECATED
3. **相似合併**:症狀 Jaccard > 0.9 → 保留高 trust封存低 trust
Feature flag`AIOPS_P3_EVOLVER_ENABLED=False`(預設關閉)
## 後果
**正面**
- 學習閉環首次真正打通(第 9 節點觸發率 0% → 接近 100%
- Playbook 信任度動態更新,失敗者加速退場
- Evolver 防止 Playbook 重複萃取膨脹
**限制**
- 人工審核路徑的 `matched_playbook_id` 傳遞仍依賴 `metadata` fallback完整修復需改 approval DB schema留 Phase 3.5
- Evolver 使用 Jaccard非 cosine相似度向量化合併留 Phase 4
- `AIOPS_P3_ENABLED = False` 預設值 — 學習呼叫仍執行,但 EWMA 計算不依賴此開關(直接在 repository 層)
## 驗收條件
- `matched_playbook_id` null 率 = 0auto_execute 路徑)
- Playbook trust_score 隨每次執行更新(可由 `playbook_stats_updated` log 驗證)
- 學習超時 → `learning_trigger_timeout` log 出現(不 crash
- Evolver `run_evolver()` 空跑無 exception