fix(arch-review): 首席架構師審查 S1×3 S2×3 S3×3 全修復 + ADR-064
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
S1 Critical:
- S1-1: asyncio 觸發移至 _call_with_fallback async 上下文,移除 sync 中的 get_event_loop()
- S1-2: _append_rule_to_yaml 加 textwrap.dedent() 正規化 LLM 輸出縮排
- S1-3: _matches() 對 alertname=["*"] 直接回傳 False,防意外命中
S2 Major:
- S2-1: auto_generate_rule() 改為 DI 參數注入 (ollama_url/model/gemini_api_key),移除 import settings
- S2-4: _generate_mock_response docstring 澄清為規則引擎生產路徑,非假數據
- S2-5: suggested_action .strip() 防空白字串繞過 or
S3 Minor:
- S3-2: priority 上界 min(next, 890)
- S3-3: alertname sanitize re.sub([{}]) 防 format KeyError
- S3-4: model_registry.py 最後修改時間戳更新
文件:
- ADR-064: Alert Rule Engine YAML 驅動 + AI 自動學習
- Skills 02: 告警規則引擎 DI 規範 + asyncio 禁止事項
- Skills 03: _generate_mock_response 語意澄清 + 規則引擎降級流程
- LOGBOOK: 本次 Session 完整記錄
2026-04-09 ogt: 首席架構師審查修正
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -970,11 +970,58 @@ async def send_xxx_notification(self, ...) -> None:
|
||||
|
||||
---
|
||||
|
||||
## 告警規則引擎 (ADR-064, 2026-04-09)
|
||||
|
||||
**模組**: `apps/api/src/services/alert_rule_engine.py`
|
||||
**配置**: `apps/api/alert_rules.yaml`
|
||||
|
||||
### 規則匹配
|
||||
|
||||
```python
|
||||
from src.services.alert_rule_engine import match_rule
|
||||
result = match_rule(alert_context) # dict | None
|
||||
# result["rule_id"] == "generic_fallback" → AI 自動學習
|
||||
```
|
||||
|
||||
### AI 自動規則學習
|
||||
|
||||
命中 `generic_fallback` 時,在上層 **async** 方法觸發:
|
||||
|
||||
```python
|
||||
asyncio.create_task(auto_generate_rule(
|
||||
alert_context,
|
||||
ollama_url=settings.OLLAMA_URL, # DI 注入
|
||||
model=settings.OPENCLAW_DEFAULT_MODEL,
|
||||
gemini_api_key=getattr(settings, "GEMINI_API_KEY", ""),
|
||||
))
|
||||
```
|
||||
|
||||
⚠️ **禁止在 sync 方法中呼叫 asyncio.get_event_loop()** — 必須在 async 上下文用 `asyncio.create_task()`
|
||||
|
||||
### Priority 體系
|
||||
|
||||
| 範圍 | 用途 |
|
||||
|------|------|
|
||||
| 1–499 | 手寫規則(不被 AI 覆蓋) |
|
||||
| 500–890 | AI 自動生成規則 |
|
||||
| 999 | generic_fallback 通用兜底 |
|
||||
|
||||
### 多 Pod 限制(ADR-064 L1/L2)
|
||||
|
||||
`_generating` set 進程級去重,多 Pod 可能重複生成。新規則 append 後只有寫入 Pod 立即生效,其他 Pod 需重啟。
|
||||
|
||||
### DI 要求
|
||||
|
||||
`auto_generate_rule()` 透過參數接收 ollama/gemini 設定,**禁止** 在函式內 `from src.core.config import settings`。
|
||||
|
||||
---
|
||||
|
||||
## 參考文檔
|
||||
|
||||
- `apps/api/src/core/config.py`: 設定中心
|
||||
- `apps/api/src/main.py`: FastAPI 應用入口
|
||||
- `apps/api/src/plugins/mcp/mcp_bridge.py`: MCP Bridge 核心
|
||||
- `apps/api/alert_rules.yaml`: 告警規則配置(新增規則只改這裡)
|
||||
- `packages/lewooogo-data/`: 記憶體 Provider 積木
|
||||
- `packages/lewooogo-brain/`: AI 引擎積木
|
||||
- `memory/feedback_lewooogo_modular_enforcement.md`: 積木化強制執行鐵律
|
||||
@@ -984,3 +1031,4 @@ async def send_xxx_notification(self, ...) -> None:
|
||||
- ADR-006: AI 備援策略
|
||||
- ADR-008: Python 模組化獨立積木架構
|
||||
- ADR-027: Incident-Approval 同步架構 (UnitOfWork + Saga)
|
||||
- ADR-064: Alert Rule Engine — YAML 驅動 + AI 自動學習
|
||||
|
||||
@@ -526,11 +526,44 @@ NEMOTRON_ASYNC_UPDATE=true # 異步更新模式
|
||||
|
||||
---
|
||||
|
||||
## 規則引擎降級路徑 (ADR-064, 2026-04-09)
|
||||
|
||||
`_generate_mock_response()` **不是假數據**,是正式降級的規則引擎路徑。
|
||||
|
||||
### 降級流程
|
||||
|
||||
```
|
||||
AI 分析失敗(所有 Provider 失敗)
|
||||
↓
|
||||
_call_with_fallback() 呼叫規則引擎降級
|
||||
↓
|
||||
match_rule(alert_context)
|
||||
├── 命中具體規則 → rule_id = "docker_container_unhealthy" 等
|
||||
└── 只命中 generic_fallback → rule_id = "generic_fallback"
|
||||
↓ asyncio.create_task (在 async context)
|
||||
auto_generate_rule() → Ollama → Gemini → append alert_rules.yaml
|
||||
```
|
||||
|
||||
### 關鍵行為
|
||||
|
||||
- `confidence = 0.0` — 規則匹配固定值,**禁止偽造**
|
||||
- `suggested_action` 在 Telegram 顯示的是 `kubectl_command`(完整指令),不是 enum 字串
|
||||
- 自動生成的規則 priority 500–890,不覆蓋手寫規則 (1–499)
|
||||
|
||||
### 新增規則
|
||||
|
||||
只需修改 `apps/api/alert_rules.yaml`,重啟 Pod 生效,**不需要改 Python**。
|
||||
|
||||
---
|
||||
|
||||
## 參考文檔
|
||||
|
||||
- `apps/api/src/services/incident_engine.py`: 聚合引擎
|
||||
- `apps/api/src/services/multi_sig_redis.py`: 分散式狀態
|
||||
- `apps/api/src/workers/signal_worker.py`: Event Bus 消費者
|
||||
- `apps/api/src/plugins/mcp/mcp_bridge.py`: MCP Bridge
|
||||
- `apps/api/alert_rules.yaml`: 告警規則配置
|
||||
- `apps/api/src/services/alert_rule_engine.py`: 規則引擎
|
||||
- `memory/project_phase13_enterprise_aiops.md`: Phase 13 規劃
|
||||
- Phase 6.0-6.3: 認知覺醒計畫
|
||||
- ADR-064: Alert Rule Engine
|
||||
|
||||
@@ -1396,7 +1396,7 @@ async def alertmanager_webhook(
|
||||
risk_level=risk_level.value,
|
||||
resource_name=target_resource,
|
||||
root_cause=root_cause,
|
||||
suggested_action=analysis_result.kubectl_command or analysis_result.suggested_action.value,
|
||||
suggested_action=(analysis_result.kubectl_command or "").strip() or analysis_result.suggested_action.value,
|
||||
estimated_downtime=estimated_downtime,
|
||||
hit_count=1,
|
||||
primary_responsibility=primary_responsibility,
|
||||
|
||||
@@ -99,14 +99,17 @@ def _load_rules() -> list[dict]:
|
||||
|
||||
|
||||
def _matches(rule: dict, alertname: str, alert_type: str, message: str) -> bool:
|
||||
"""判斷規則是否匹配"""
|
||||
"""判斷規則是否匹配。通用兜底規則(alertname=["*"])永遠回傳 False,由 match_rule 單獨處理。"""
|
||||
match = rule.get("match", {})
|
||||
|
||||
# alertname 完全匹配
|
||||
# S1-3 修正: 通用兜底規則不參與 _matches,防止其 alert_type/message 關鍵字意外命中
|
||||
alertnames = match.get("alertname", [])
|
||||
if alertnames and alertnames != ["*"]:
|
||||
if alertname in alertnames:
|
||||
return True
|
||||
if alertnames == ["*"]:
|
||||
return False
|
||||
|
||||
# alertname 完全匹配
|
||||
if alertnames and alertname in alertnames:
|
||||
return True
|
||||
|
||||
# alert_type 部分匹配
|
||||
for kw in match.get("alert_type", []):
|
||||
@@ -279,12 +282,15 @@ def _append_rule_to_yaml(rule_yaml: str, alertname: str) -> bool:
|
||||
logger.warning("auto_rule_empty_response", alertname=alertname)
|
||||
return False
|
||||
|
||||
# S1-2 修正: dedent 正規化 LLM 可能輸出的前置空格,再加 2 spaces 縮排到 rules: 下
|
||||
import textwrap
|
||||
normalized = textwrap.dedent(rule_yaml.strip())
|
||||
|
||||
# append 到 YAML 檔
|
||||
with RULES_FILE.open("a", encoding="utf-8") as f:
|
||||
now = datetime.now().strftime("%Y-%m-%d %H:%M")
|
||||
f.write(f"\n # AUTO-GENERATED {now} — alertname={alertname}\n")
|
||||
# indent list item under rules:
|
||||
for line in rule_yaml.strip().splitlines():
|
||||
for line in normalized.splitlines():
|
||||
f.write(f" {line}\n")
|
||||
|
||||
# 清除 lru_cache 讓新規則立即生效
|
||||
@@ -340,64 +346,84 @@ def _extract_yaml_block(text: str) -> str:
|
||||
return match.group(1).strip() if match else text
|
||||
|
||||
|
||||
async def auto_generate_rule(alert_context: dict) -> None:
|
||||
async def auto_generate_rule(
|
||||
alert_context: dict,
|
||||
ollama_url: str,
|
||||
model: str,
|
||||
gemini_api_key: str = "",
|
||||
) -> None:
|
||||
"""
|
||||
非同步背景任務:呼叫 AI 為未知告警自動生成規則並寫入 alert_rules.yaml。
|
||||
|
||||
觸發條件: match_rule() 命中 generic_fallback
|
||||
流程: Ollama (deepseek-r1:14b) → 失敗則 Gemini → 驗證 → append YAML → 清除 cache
|
||||
"""
|
||||
from src.core.config import settings
|
||||
流程: Ollama → 失敗則 Gemini → 驗證格式 → append YAML → 清除 lru_cache 立即生效
|
||||
|
||||
Args:
|
||||
alert_context: 告警上下文
|
||||
ollama_url: Ollama endpoint(由呼叫方從 settings 注入,S2-1 DI 修正)
|
||||
model: Ollama 模型名稱
|
||||
gemini_api_key: Gemini API Key(空字串則跳過 Gemini 備援)
|
||||
|
||||
限制:
|
||||
- 進程級去重 (_generating set),多 Pod 環境可能重複生成(ADR-064 已記錄)
|
||||
- 寫入後清除 lru_cache,同 Pod 立即生效;其他 Pod 需重啟
|
||||
"""
|
||||
labels = alert_context.get("labels", {})
|
||||
alertname = labels.get("alertname", alert_context.get("alert_type", "custom"))
|
||||
|
||||
# S3-3 修正: sanitize alertname,防止含 {/} 的 alertname 在 format() 中拋出 KeyError
|
||||
alertname_safe = re.sub(r"[{}]", "", alertname)
|
||||
|
||||
# 去重:同一 alertname 同時只跑一次
|
||||
if alertname in _generating:
|
||||
if alertname_safe in _generating:
|
||||
return
|
||||
if _rule_id_exists(alertname):
|
||||
logger.debug("auto_rule_skip_exists", alertname=alertname)
|
||||
if _rule_id_exists(alertname_safe):
|
||||
logger.debug("auto_rule_skip_exists", alertname=alertname_safe)
|
||||
return
|
||||
|
||||
_generating.add(alertname)
|
||||
_generating.add(alertname_safe)
|
||||
try:
|
||||
rule_id = re.sub(r"[^a-z0-9_]", "_", alertname.lower()).strip("_")
|
||||
# priority: 500~899 給 AI 生成規則,不干擾手寫規則 (1-499)
|
||||
rule_id = re.sub(r"[^a-z0-9_]", "_", alertname_safe.lower()).strip("_")
|
||||
|
||||
# S3-2 修正: priority 上界 890,防止超出 AI 生成範圍
|
||||
existing = [r.get("priority", 0) for r in _load_rules() if not _is_generic(r)]
|
||||
priority = max((p for p in existing if 500 <= p < 900), default=499) + 10
|
||||
next_priority = max((p for p in existing if 500 <= p < 900), default=499) + 10
|
||||
priority = min(next_priority, 890)
|
||||
|
||||
prompt = _AUTO_RULE_PROMPT.format(
|
||||
alertname=alertname,
|
||||
alertname=alertname_safe,
|
||||
alert_type=alert_context.get("alert_type", "custom"),
|
||||
message=alert_context.get("message", "")[:200],
|
||||
labels=json.dumps({k: v for k, v in labels.items() if k in
|
||||
("job", "instance", "severity", "namespace", "container", "name")},
|
||||
ensure_ascii=False),
|
||||
labels=json.dumps(
|
||||
{k: v for k, v in labels.items()
|
||||
if k in ("job", "instance", "severity", "namespace", "container", "name")},
|
||||
ensure_ascii=False,
|
||||
),
|
||||
rule_id=rule_id,
|
||||
priority=priority,
|
||||
)
|
||||
|
||||
logger.info("auto_rule_generating", alertname=alertname, rule_id=rule_id)
|
||||
logger.info("auto_rule_generating", alertname=alertname_safe, rule_id=rule_id)
|
||||
|
||||
# 1. 先試 Ollama
|
||||
raw = await _call_ollama(prompt, settings.OLLAMA_URL, settings.OPENCLAW_DEFAULT_MODEL)
|
||||
raw = await _call_ollama(prompt, ollama_url, model)
|
||||
|
||||
# 2. Ollama 失敗 → Gemini
|
||||
if not raw and settings.GEMINI_API_KEY:
|
||||
raw = await _call_gemini(prompt, settings.GEMINI_API_KEY)
|
||||
if not raw and gemini_api_key:
|
||||
raw = await _call_gemini(prompt, gemini_api_key)
|
||||
|
||||
if not raw:
|
||||
logger.warning("auto_rule_no_response", alertname=alertname)
|
||||
logger.warning("auto_rule_no_response", alertname=alertname_safe)
|
||||
return
|
||||
|
||||
yaml_block = _extract_yaml_block(raw)
|
||||
success = _append_rule_to_yaml(yaml_block, alertname)
|
||||
success = _append_rule_to_yaml(yaml_block, alertname_safe)
|
||||
if success:
|
||||
logger.info("auto_rule_success", alertname=alertname, rule_id=rule_id)
|
||||
logger.info("auto_rule_success", alertname=alertname_safe, rule_id=rule_id)
|
||||
else:
|
||||
logger.warning("auto_rule_failed_validation", alertname=alertname)
|
||||
logger.warning("auto_rule_failed_validation", alertname=alertname_safe)
|
||||
|
||||
except Exception as e:
|
||||
logger.error("auto_rule_exception", alertname=alertname, error=str(e))
|
||||
logger.error("auto_rule_exception", alertname=alertname_safe, error=str(e))
|
||||
finally:
|
||||
_generating.discard(alertname)
|
||||
_generating.discard(alertname_safe)
|
||||
|
||||
@@ -12,7 +12,7 @@ Model Registry - Phase 12 P1 修復
|
||||
版本: v1.0
|
||||
建立: 2026-03-26 23:00 (台北時區)
|
||||
建立者: Claude Code
|
||||
最後修改: 2026-03-26 23:00 (台北時區)
|
||||
最後修改: 2026-04-09 10:00 (台北時區) — ogt: fallback config 更新為 deepseek-r1:14b + gemma3:4b
|
||||
修改者: Claude Code
|
||||
"""
|
||||
|
||||
|
||||
@@ -572,12 +572,17 @@ class OpenClawService:
|
||||
signoz_metrics: GoldMetrics | None = None,
|
||||
) -> str:
|
||||
"""
|
||||
Mock LLM 回應生成器 - 規則引擎降級 (v8.0)
|
||||
規則引擎降級回應 (v8.0) — 生產用途,不是假數據
|
||||
|
||||
從 alert_rules.yaml 載入規則,取代硬編碼 if/elif。
|
||||
新增規則只需修改 YAML,不需要改代碼重新部署。
|
||||
從 alert_rules.yaml 載入規則進行匹配,AI 分析失敗時的正式降級路徑。
|
||||
命中 generic_fallback 時會回傳 rule_id="generic_fallback",
|
||||
由上層 async 方法(_call_with_fallback)觸發 auto_generate_rule() 學習新規則。
|
||||
|
||||
Returns:
|
||||
(json_str, rule_id) tuple
|
||||
|
||||
2026-04-09 ogt: 重構為規則引擎,移除 if/elif 硬編碼
|
||||
2026-04-09 ogt: S2-4 架構師審查 — 修正 Mock 語意混淆,澄清為規則引擎生產路徑
|
||||
"""
|
||||
from src.services.alert_rule_engine import match_rule
|
||||
|
||||
@@ -640,20 +645,9 @@ class OpenClawService:
|
||||
is_mock=True,
|
||||
)
|
||||
|
||||
# 2026-04-09 ogt: 命中通用兜底時,背景自動生成專屬規則
|
||||
if rule_id == "generic_fallback":
|
||||
from src.services.alert_rule_engine import auto_generate_rule
|
||||
import asyncio
|
||||
try:
|
||||
loop = asyncio.get_event_loop()
|
||||
if loop.is_running():
|
||||
loop.create_task(auto_generate_rule(alert_context))
|
||||
else:
|
||||
asyncio.run(auto_generate_rule(alert_context))
|
||||
except Exception as _e:
|
||||
logger.warning("auto_rule_trigger_failed", error=str(_e))
|
||||
|
||||
return json.dumps(mock_response)
|
||||
# 2026-04-09 ogt: rule_id 回傳給上層 async 方法觸發自動規則生成
|
||||
# 不在此 sync 方法中呼叫 asyncio,避免 event loop 混用問題 (S1-1 架構師審查)
|
||||
return json.dumps(mock_response), rule_id
|
||||
|
||||
# =========================================================================
|
||||
# LLM Cache Layer (憲法要求: 嚴禁無快取裸奔)
|
||||
@@ -871,7 +865,20 @@ class OpenClawService:
|
||||
# Mock Mode: 開發測試用
|
||||
if settings.MOCK_MODE:
|
||||
logger.info("mock_mode_enabled", using="mock_llm")
|
||||
return self._generate_mock_response(alert_context or {}, signoz_metrics), "mock", True, 0, 0.0
|
||||
_mock_json, _rule_id = self._generate_mock_response(alert_context or {}, signoz_metrics)
|
||||
if _rule_id == "generic_fallback":
|
||||
import asyncio
|
||||
from src.services.alert_rule_engine import auto_generate_rule
|
||||
try:
|
||||
asyncio.create_task(auto_generate_rule(
|
||||
alert_context or {},
|
||||
ollama_url=settings.OLLAMA_URL,
|
||||
model=settings.OPENCLAW_DEFAULT_MODEL,
|
||||
gemini_api_key=getattr(settings, "GEMINI_API_KEY", ""),
|
||||
))
|
||||
except Exception as _e:
|
||||
logger.warning("auto_rule_trigger_failed", error=str(_e))
|
||||
return _mock_json, "mock", True, 0, 0.0
|
||||
|
||||
# Phase 15.1 + 15.3: Langfuse 追蹤整合 + SignOz Deep Linking
|
||||
with langfuse_trace(
|
||||
@@ -978,7 +985,20 @@ class OpenClawService:
|
||||
# 所有 Provider 失敗時,fallback 到 Mock (優雅降級)
|
||||
logger.warning("all_providers_failed_using_mock", fallback="mock_llm")
|
||||
trace.score(name="provider_success", value=0.0, comment="All providers failed, using mock")
|
||||
return self._generate_mock_response(alert_context or {}, signoz_metrics), "mock_fallback", True, 0, 0.0
|
||||
_mock_json, _rule_id = self._generate_mock_response(alert_context or {}, signoz_metrics)
|
||||
if _rule_id == "generic_fallback":
|
||||
import asyncio
|
||||
from src.services.alert_rule_engine import auto_generate_rule
|
||||
try:
|
||||
asyncio.create_task(auto_generate_rule(
|
||||
alert_context or {},
|
||||
ollama_url=settings.OLLAMA_URL,
|
||||
model=settings.OPENCLAW_DEFAULT_MODEL,
|
||||
gemini_api_key=getattr(settings, "GEMINI_API_KEY", ""),
|
||||
))
|
||||
except Exception as _e:
|
||||
logger.warning("auto_rule_trigger_failed", error=str(_e))
|
||||
return _mock_json, "mock_fallback", True, 0, 0.0
|
||||
|
||||
def _get_model_name(self, provider: str) -> str:
|
||||
"""取得 provider 對應的模型名稱 (從 ModelRegistry)"""
|
||||
|
||||
@@ -6,6 +6,27 @@
|
||||
|
||||
---
|
||||
|
||||
## 📍 當前狀態 (2026-04-09 Alert Rule Engine + Ollama M1 Pro + 首席架構師審查)
|
||||
|
||||
| 項目 | 狀態 | Commit |
|
||||
|------|------|--------|
|
||||
| Ollama 切換 188→111 (M1 Pro, 0.45→40+ tok/s) | ✅ | 多個 |
|
||||
| deepseek-r1:14b (RCA) + gemma3:4b (summary) | ✅ | f32b077 |
|
||||
| Gemini fallback for NIM 完全失敗 | ✅ | d80153b |
|
||||
| 告警規則引擎 alert_rules.yaml + alert_rule_engine.py | ✅ | d1ede7f |
|
||||
| AI 自動規則學習 (generic_fallback 觸發) | ✅ | 71437db |
|
||||
| 首席架構師審查 63/100 → 6 個問題修復 | ✅ | 本次 |
|
||||
| ADR-064 Alert Rule Engine | ✅ | 本次 |
|
||||
| Skills 02/03 更新 | ✅ | 本次 |
|
||||
| model_registry fallback 同步更新 | ✅ | 89da2d2 |
|
||||
| K8s 部署驗證 (image 89da2d2) | ✅ | 2 Pod Running |
|
||||
|
||||
**已知技術債**: 多 Pod 規則重複生成(ADR-064 L1),lru_cache 跨 Pod 不同步(ADR-064 L2)
|
||||
|
||||
**下一步**: 前端重設計整合頁面 Panel 抽取 (解決雙重 AppLayout)
|
||||
|
||||
---
|
||||
|
||||
## 📍 當前狀態 (2026-04-09 Sprint 5 前端重設計完成 + 部署中)
|
||||
|
||||
| 項目 | 狀態 | Commit |
|
||||
|
||||
102
docs/adr/ADR-064-alert-rule-engine-yaml-driven.md
Normal file
102
docs/adr/ADR-064-alert-rule-engine-yaml-driven.md
Normal file
@@ -0,0 +1,102 @@
|
||||
# ADR-064: Alert Rule Engine — YAML 驅動規則匹配 + AI 自動學習
|
||||
|
||||
**狀態**: 已批准
|
||||
**日期**: 2026-04-09
|
||||
**作者**: ogt + Claude Sonnet 4.6
|
||||
**審查**: 首席架構師(2026-04-09)
|
||||
|
||||
---
|
||||
|
||||
## 背景
|
||||
|
||||
`openclaw.py` 中的 `_generate_mock_response` 用硬編碼 if/elif 實作規則匹配,每次新增告警類型都要改 Python 代碼並重新部署。隨著監控目標增加,此模式不可持續。
|
||||
|
||||
## 決策
|
||||
|
||||
### D1: 規則外化為 YAML(`apps/api/alert_rules.yaml`)
|
||||
|
||||
規則引擎從 YAML 檔載入,Pod 重啟即生效,不需要改代碼。
|
||||
|
||||
**規則結構**:
|
||||
```yaml
|
||||
rules:
|
||||
- id: docker_container_unhealthy
|
||||
priority: 10
|
||||
match:
|
||||
alertname: [DockerContainerUnhealthy]
|
||||
message: [unhealthy]
|
||||
response:
|
||||
kubectl_command: "ssh {host} 'docker inspect {container}...'"
|
||||
suggested_action: RESTART_DEPLOYMENT
|
||||
risk: medium
|
||||
responsibility: INFRA
|
||||
```
|
||||
|
||||
**Priority 體系**:
|
||||
| 範圍 | 用途 |
|
||||
|------|------|
|
||||
| 1–499 | 手寫規則(高優先,不被 AI 覆蓋) |
|
||||
| 500–890 | AI 自動生成規則 |
|
||||
| 999 | `generic_fallback` 通用兜底 |
|
||||
|
||||
### D2: AI 自動規則學習機制
|
||||
|
||||
當告警命中 `generic_fallback` 時,觸發 `auto_generate_rule()`:
|
||||
|
||||
1. 呼叫 Ollama (deepseek-r1:14b) 生成 YAML 規則片段
|
||||
2. Ollama 失敗 → Gemini 2.0 Flash 備援
|
||||
3. 驗證格式(id/match/response/kubectl_command 必填)
|
||||
4. `textwrap.dedent()` 正規化縮排(防 LLM 輸出前置空格)
|
||||
5. append 到 `alert_rules.yaml`
|
||||
6. `lru_cache.cache_clear()` — 同 Pod 立即生效
|
||||
|
||||
### D3: 模組邊界
|
||||
|
||||
- `alert_rule_engine.py` = Service 層,只讀 YAML,不直接存取 Redis/DB
|
||||
- `auto_generate_rule()` 採用 DI 參數注入(`ollama_url`, `model`, `gemini_api_key`),不 import settings 全域單例
|
||||
- asyncio 觸發在上層 async `_call_with_fallback()` 執行,不在 sync `_generate_mock_response()` 中操作 event loop
|
||||
|
||||
### D4: 匹配邏輯
|
||||
|
||||
優先順序:alertname 完全匹配 > alert_type 部分匹配 > message 關鍵字
|
||||
|
||||
`generic_fallback`(`alertname: ["*"]`)在 `_matches()` 中永遠回傳 False,由 `match_rule()` 的第二輪迴圈單獨選取,防止其 alert_type/message 關鍵字意外命中。
|
||||
|
||||
## 已知限制
|
||||
|
||||
### L1: 多 Pod 環境下規則可能重複生成
|
||||
|
||||
`_generating` set 是進程記憶體級去重,多 Pod 各自維護。同一告警可能在不同 Pod 同時觸發生成,產生重複規則 append。
|
||||
|
||||
**緩解**: `_rule_id_exists()` 提供二次去重,但有 lru_cache 的時間窗口 race condition。
|
||||
|
||||
**計劃**: 若未來 Pod 數 > 2,需 Redis 分散式鎖。目前 prod 為 2 Pod,可接受。
|
||||
|
||||
### L2: `lru_cache` 跨 Pod 不同步
|
||||
|
||||
新規則寫入後,只有寫入的 Pod 清除了 cache,其他 Pod 需重啟才能載入新規則。這是已知行為,下次告警觸發時仍會走 `generic_fallback`,但不會再次生成(`_rule_id_exists` 讀 YAML 直接確認)。
|
||||
|
||||
## 測試策略
|
||||
|
||||
`auto_generate_rule()` 採 DI,可在不啟動 FastAPI 的情況下單獨測試:
|
||||
|
||||
```python
|
||||
await auto_generate_rule(
|
||||
alert_context={"labels": {"alertname": "TestAlert"}},
|
||||
ollama_url="http://mock",
|
||||
model="test-model",
|
||||
)
|
||||
```
|
||||
|
||||
## 相關檔案
|
||||
|
||||
- `apps/api/alert_rules.yaml` — 規則定義
|
||||
- `apps/api/src/services/alert_rule_engine.py` — 規則引擎
|
||||
- `apps/api/src/services/openclaw.py` — `_generate_mock_response` + `_call_with_fallback` 整合點
|
||||
- `apps/api/Dockerfile` — COPY alert_rules.yaml
|
||||
|
||||
## 參考
|
||||
|
||||
- ADR-006: AI 模型路由配置
|
||||
- ADR-052: Phase 24 AIRouter
|
||||
- `feedback_lewooogo_modular_enforcement.md`: 積木化 5 問
|
||||
Reference in New Issue
Block a user