docs(adr+skills): ADR-092 AI Decision LLM 層 + Skill 03 更新統一 LLM pattern
首席架構師 2026-04-19 Review 92/100 Grade A 後的完整文檔化:
**ADR-092 新建 (AI Decision LLM 擴展架構)**:
- 背景: 14 scanner 中 8 個純 threshold,違反 feedback_ai_autonomous_direction
- 決策: 4 個 LLM service + 統一 pattern (llm_json_parser)
- 約束 5 鐵律: 失敗不 raise / AI 只建議不動作 / openclaw 統一入口 /
aol 留痕 / 繁中 + JSON schema
- 節流: Daily cron + 事件觸發 (red_ratio>30% 且 scanned>=50)
- autonomy_score 0-100 量化追蹤
- 實作成果 + P1 剩餘 + 回滾計畫
**Skill 03 openclaw-cognitive-expert 更新**:
- 新增「2026-04-19 AI Decision LLM 擴展層」章節
- Pattern code 範本 (不是每次重寫 3-path parse)
- 4 LLM service 對照表 + required_key
- 擴加 5 鐵律清單
- autonomy_score 追蹤使用說明
下 session Claude 接手時能快速看到 LLM service pattern,不會重複造輪子.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -567,3 +567,68 @@ match_rule(alert_context)
|
||||
- `memory/project_phase13_enterprise_aiops.md`: Phase 13 規劃
|
||||
- Phase 6.0-6.3: 認知覺醒計畫
|
||||
- ADR-064: Alert Rule Engine
|
||||
|
||||
---
|
||||
|
||||
## 🆕 2026-04-19 AI Decision LLM 擴展層 (ADR-092)
|
||||
|
||||
### 統一 LLM Service Pattern
|
||||
|
||||
**Helper**: `apps/api/src/services/llm_json_parser.py`
|
||||
|
||||
```python
|
||||
from src.services.llm_json_parser import parse_llm_json_response
|
||||
from src.services.openclaw import get_openclaw
|
||||
|
||||
async def _llm_analyze_xxx(input_data) -> dict[str, Any] | None:
|
||||
try:
|
||||
prompt = _PROMPT.format(**input_data)
|
||||
openclaw = get_openclaw()
|
||||
text, provider, success = await openclaw.call(prompt)
|
||||
if not success or not text:
|
||||
return None
|
||||
parsed = parse_llm_json_response(
|
||||
text,
|
||||
required_key="your_required_key", # e.g. 'recommended_actions'
|
||||
logger_context="your_service_name",
|
||||
)
|
||||
if parsed:
|
||||
parsed["_llm_provider"] = provider
|
||||
return parsed
|
||||
except Exception as e:
|
||||
logger.warning("xxx_llm_error", error=str(e))
|
||||
return None
|
||||
```
|
||||
|
||||
**3-path fallback 自動處理**:
|
||||
- Path 1: 剝 markdown fence + 直接 JSON
|
||||
- Path 2: NemoTron wrapper (description/action_title/reasoning 內嵌 JSON)
|
||||
- Path 3: 失敗 return None + logger.warning (不 raise)
|
||||
|
||||
### 現有 4 個 LLM Service(擴加時參考 pattern)
|
||||
|
||||
| Service | required_key | 用途 | 觸發 |
|
||||
|---|---|---|---|
|
||||
| `hermes_rule_quality_job` | `recommended_actions` | noisy rule 假報真因 | 每日 04:00 |
|
||||
| `capacity_forecaster_job` | `priority_actions` | 容量預測修復策略 | 每日 05:00 |
|
||||
| `compliance_scanner_job` | `posture_grade` | 合規態勢評級 A/B/C/D/F | 每日 03:00 |
|
||||
| `coverage_evaluator_job` | `worst_dimension` | 補覆蓋缺口建議 | red_ratio > 30% 且 scanned >= 50 |
|
||||
|
||||
### 擴加 LLM Service 鐵律 (ADR-092)
|
||||
|
||||
1. **失敗永不 raise** — try/except return None, 呼叫者 fallback 硬編規則
|
||||
2. **AI 只建議不動作** — output 必設 `requires_human_decision=True`
|
||||
3. **openclaw 統一入口** — 不直接呼叫 Ollama/NVIDIA/Gemini
|
||||
4. **aol 留痕** — 寫 `automation_operation_log.output.llm_analysis`
|
||||
5. **繁中 + JSON schema** — Prompt 明確 required_key
|
||||
|
||||
### autonomy_score 追蹤
|
||||
|
||||
`GET /api/v1/aiops/kpi` → `ai_autonomy_score.total` (0-100)
|
||||
|
||||
5 子項 × 20 分:
|
||||
- asset_coverage / rule_quality / capacity_health / automation_flow / ai_diversity
|
||||
|
||||
Grade: mature(90+) / in_progress(70-90) / starter(50-70) / initial(<50)
|
||||
|
||||
實測 2026-04-19: **63/100 (starter)** — LLM 升級 1/9 → 4/9
|
||||
|
||||
134
docs/adr/ADR-092-ai-decision-llm-expansion-layer.md
Normal file
134
docs/adr/ADR-092-ai-decision-llm-expansion-layer.md
Normal file
@@ -0,0 +1,134 @@
|
||||
# ADR-092: AI Decision Layer — LLM 擴展架構
|
||||
|
||||
**狀態**: Accepted
|
||||
**日期**: 2026-04-19 (台北時區)
|
||||
**作者**: ogt + Claude Opus 4.7 (1M context, 亞太)
|
||||
**關聯**: ADR-090 盲區治理 / MASTER §3 D1-D6 / feedback_ai_autonomous_direction
|
||||
|
||||
---
|
||||
|
||||
## 背景
|
||||
|
||||
2026-04-19 session 完成 Phase 7 盲區治理後(14 個新 scanner,8 張 0 writer 表全活化),首席架構師 Review 發現嚴重 Gap:
|
||||
|
||||
- **AI 層淺**:14 個新 scanner 中 **8 個純 threshold(rule-based)**,只 Hermes 1 個真用 LLM 做決策
|
||||
- **autonomy_score 實測**:63/100(starter 起步級),`ai_diversity` 子項僅 6/20
|
||||
- **統帥鐵律**:「朝 AI 自主化方向」— infrastructure 不等於 AI
|
||||
|
||||
違反 `feedback_ai_autonomous_direction.md` 核心原則:「禁寫死規則做最終決策」。
|
||||
|
||||
---
|
||||
|
||||
## 決策
|
||||
|
||||
### 1. AI Decision 層的 4 個 LLM Service(本 session 完成)
|
||||
|
||||
| Service | LLM 判斷 required_key | 觸發條件 | 職責 |
|
||||
|---|---|---|---|
|
||||
| **hermes_rule_quality** | `recommended_actions` | 每日 04:00 Taipei | 分析 noise_rate > 0.5 rule 的假報真因 + 改進建議 |
|
||||
| **capacity_forecaster** | `priority_actions` | 每日 05:00 Taipei(predict_linear 命中)| 分析 7d 容量預測高風險 host 的修復策略 |
|
||||
| **compliance_scanner** | `posture_grade` | 每日 03:00 Taipei(有 warning/violation)| 產整體 compliance posture 評級 + top 3 action |
|
||||
| **coverage_evaluator** | `worst_dimension` | 每 1h(red_ratio > 30% 且 scanned > 50)| 分析 7 維 red 缺口 + 補覆蓋建議 |
|
||||
|
||||
### 2. LLM 調用統一 pattern(services/llm_json_parser.py)
|
||||
|
||||
**原問題**:4 個 LLM scanner 各自重複 3-path JSON parse 邏輯(~80 行 × 4 = 320 行)
|
||||
**新 pattern**:`parse_llm_json_response(text, required_key, logger_context)`
|
||||
|
||||
```python
|
||||
# 統一 3-path fallback
|
||||
Path 1: 剝 markdown fence + 直接 JSON 含 required_key
|
||||
Path 2: NemoTron wrapper — description/action_title/reasoning 內嵌 JSON
|
||||
Path 3: 所有失敗 return None + logger.warning
|
||||
```
|
||||
|
||||
所有 LLM service 都用此 helper,未來擴加 LLM service 直接呼叫。
|
||||
|
||||
### 3. LLM Service 架構約束(鐵律)
|
||||
|
||||
**必須遵守**:
|
||||
1. **失敗永不 raise**:LLM 掛掉/parse 失敗 → return None → 呼叫者 fallback 硬編規則
|
||||
2. **AI 只建議不動作**:所有 LLM 輸出都寫 `requires_human_decision=True`,推 Telegram 等人工
|
||||
3. **openclaw 統一入口**:不直接呼叫 Ollama/NVIDIA/Gemini,走 `get_openclaw().call(prompt)` 多 provider fallback
|
||||
4. **aol 留痕**:LLM 結果進 `automation_operation_log.output.llm_analysis`
|
||||
5. **繁中 + JSON schema**:prompt 明確 required_key,system prompt 要求 LLM 產純 JSON
|
||||
|
||||
**禁止**:
|
||||
- 硬編 threshold 做最終決策(只做「觸發討論」)
|
||||
- LLM 結果直接執行破壞性動作(必過人工)
|
||||
- 內嵌多 provider 判斷(統一走 openclaw)
|
||||
|
||||
### 4. 觸發條件設計原則
|
||||
|
||||
**節流避免 token 成本爆炸**:
|
||||
- Daily cron:每日一次(compliance/forecaster/hermes)
|
||||
- 事件觸發:coverage 只在 `red_ratio > 30% 且 scanned >= 50` 才跑 LLM
|
||||
- 避免 bootstrap 首次必觸發的浪費
|
||||
|
||||
### 5. autonomy_score 量化追蹤
|
||||
|
||||
新增 `GET /api/v1/aiops/kpi` 回 5 子項 × 20 分總分(0-100):
|
||||
- `asset_coverage`: green 比例 × 20
|
||||
- `rule_quality`: 1 - (noisy/total) × 20
|
||||
- `capacity_health`: critical 扣 10 / warning 扣 3
|
||||
- `automation_flow`: log10(24h_ops) 標準化 × 20
|
||||
- `ai_diversity`: ai_generated rules + op_type 多樣性 × 20
|
||||
|
||||
Grade: mature(90+) / in_progress(70-90) / starter(50-70) / initial(<50)
|
||||
|
||||
---
|
||||
|
||||
## 實作成果
|
||||
|
||||
### 達成指標
|
||||
- LLM decision service: **1/9 → 4/9**
|
||||
- autonomy_score: 未量化 → **63/100 (starter)** 可持續追蹤
|
||||
- LLM JSON parse 重複碼: **320 行 → 1 份共用 helper**(90 行)
|
||||
|
||||
### Commits
|
||||
- `ba18ad2` Hermes LLM 升級(Hermes 從 threshold 改 LLM)
|
||||
- `d6b854a` Gap 3.1 capacity_forecaster LLM
|
||||
- `f6cb938` Gap 3.2 compliance_scanner LLM
|
||||
- `2f5cab2` Gap 3.3 coverage_evaluator LLM
|
||||
- `fa643eb` P1 重構:llm_json_parser helper + coverage 雙條件
|
||||
|
||||
### 相關檔案
|
||||
- `apps/api/src/services/llm_json_parser.py`(共用 helper)
|
||||
- `apps/api/src/services/openclaw.py`(多 provider)
|
||||
- `apps/api/src/jobs/{hermes_rule_quality,capacity_forecaster,compliance_scanner,coverage_evaluator}_job.py`
|
||||
- `apps/api/src/services/aiops_kpi_service.py`(autonomy_score 計算)
|
||||
- `apps/api/src/api/v1/aiops_kpi.py`(KPI endpoint)
|
||||
|
||||
---
|
||||
|
||||
## 後續工作(下 session)
|
||||
|
||||
### P1 剩餘優化(首席架構師 Review 指出)
|
||||
1. **Prometheus HTTP helper 抽出** — 5 scanner 重複 httpx + 錯誤處理 pattern
|
||||
2. **14 scanner first_delay 錯開** — 避免並行啟動搶 DB pool(建議 60/90/120/180 階梯)
|
||||
3. **LLM budget guard** — `aiops_kpi_service` 加 `llm_call_count_24h` 指標超過閾值 Telegram 告警
|
||||
4. **asset_scanner 918 行拆分** — `providers/` + `writers/` 兩層
|
||||
|
||||
### 可擴展 LLM 方向
|
||||
- `rule_catalog_sync`: 新規則 import 時 LLM 驗證 expr 合理性
|
||||
- `asset_change_tracker`: 重大 lifecycle 變化時 LLM 評估爆炸半徑
|
||||
- `drift_interpreter` (既有): 升級更精細 prompt
|
||||
|
||||
---
|
||||
|
||||
## 回滾計畫
|
||||
|
||||
若 LLM service 造成成本或穩定性問題,可分層關閉:
|
||||
|
||||
1. 改 config `AIOPS_LLM_ENABLED=false`(需新增,目前未實作)
|
||||
2. 或 kubectl exec 進 Pod `kill` 單個 loop task
|
||||
3. 各 LLM `try/except return None → fallback` 原設計保護主流程不受影響
|
||||
|
||||
---
|
||||
|
||||
## 相關 ADR
|
||||
|
||||
- ADR-067 Phase 30 AI 五大應用(OpenClaw + drift_narrator + runbook_generator 等原 LLM service)
|
||||
- ADR-081 PreDecisionInvestigator + EvidenceSnapshot
|
||||
- ADR-083 學習閉環重建
|
||||
- ADR-090 監控盲區治理 + 自動化覆蓋矩陣(11 張 + 8 張新表)
|
||||
Reference in New Issue
Block a user