fix(review): 首席架構師 Code Review 修補 — I1 get_incident_type 邏輯修正 + 測試補全
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 8m13s

Code Review 發現 2 個 Critical + 2 個 Important 問題:

Critical:
- rule.id 語意為「規則識別符」,與 incident_type 命名空間不同,不可混用
  移除 rule_id fallback 路徑,YAML 匹配無 incident_type 時 fall through 靜態 dict
- get_incident_type() 關鍵路徑無測試覆蓋
  新增 test_get_incident_type.py:11 測試、4 類別(靜態/YAML優先/YAML錯誤/custom)全過

Important:
- ALERTNAME_TO_TYPE deferred import 移至模組頂層(無 circular 風險)
- alert_types.py TODO 過期 → 更新為 I1 整合後正確說明

技術債記錄:NetworkPolicy ArgoCD egress ClusterIP 10.43.16.201/32 需 ArgoCD 重裝後更新

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
OG T
2026-04-11 21:33:04 +08:00
parent b2dfcf9b0d
commit d77b2add73
4 changed files with 224 additions and 11 deletions

View File

@@ -6,8 +6,9 @@ alertname → incident_type 靜態對應表
來源: BUG-008 修復 2026-04-119筆 → 56筆涵蓋 alerts-unified.yml 全部 alertname
遷移: M3 2026-04-11 — 從 webhooks.py 內聯 dict 抽至此模組
TODO (I1): 下 Sprint 整合 ADR-064 Rule Engine改用 YAML 規則動態推斷;
此靜態 dict 為可接受中間狀態
整合狀態 (I1, 2026-04-11): ADR-064 Rule Engine 已整合,見 alert_rule_engine.get_incident_type()。
此靜態 dict 為 Layer 2 fallbackYAML incident_typeLayer 1→ 此 dict → "custom"Layer 3
擴展方式:在 alert_rules.yaml 中新增 incident_type 欄位;此 dict 僅需補無 YAML 規則的 alertname。
"""
# alertname → incident_type 對應56 筆)

View File

@@ -31,6 +31,8 @@ import httpx
import structlog
import yaml
from src.constants.alert_types import ALERTNAME_TO_TYPE
logger = structlog.get_logger(__name__)
RULES_FILE = Path(__file__).parent.parent.parent / "alert_rules.yaml"
@@ -267,14 +269,13 @@ def get_incident_type(alertname: str) -> str:
從 YAML 規則動態推斷 incident_type取代 webhooks.py 靜態 dict。
優先順序:
1. alert_rules.yaml 中 match.alertname 完全匹配 → 使用 rule.incident_type
1. alert_rules.yaml 中 match.alertname 完全匹配 ruleincident_type 欄位 → 使用之
2. ALERTNAME_TO_TYPE 靜態 dict fallbackconstants/alert_types.py
3. "custom"(兜底)
rule.incident_type 可選欄位;若 YAML 規則未設則跳過進 fallback。
注意YAML rule.id 語意為「規則識別符」,不等於 incident_type
兩者命名空間不同不可混用。YAML 未設 incident_type 時一律走 fallback。
"""
from src.constants.alert_types import ALERTNAME_TO_TYPE
try:
rules = _load_rules()
for rule in rules:
@@ -282,14 +283,10 @@ def get_incident_type(alertname: str) -> str:
continue
alertnames = rule.get("match", {}).get("alertname", [])
if alertname in alertnames:
# YAML incident_type 欄位優先用
# 只有 YAML 明確設定 incident_type 才採用,否則 fall through 到靜態 dict
incident_type = rule.get("incident_type")
if incident_type:
return incident_type
# 無 incident_type 欄位 → 用 rule id 作為 incident_type語意一致
rule_id = rule.get("id", "")
if rule_id:
return rule_id
break
except Exception:
pass

View File

@@ -0,0 +1,167 @@
"""
I1 ADR-064 get_incident_type() 整合測試
========================================
測試策略 (feedback_no_mock_testing.md):
- 直接呼叫 get_incident_type(),不 Mock YAML / 靜態 dict
- 驗證三層降級YAML incident_type → 靜態 dict → "custom"
- 2026-04-11 Claude Sonnet 4.6: Code Review C1 修補
"""
import pytest
from src.constants.alert_types import ALERTNAME_TO_TYPE
from src.services.alert_rule_engine import get_incident_type
class TestGetIncidentTypeStaticFallback:
"""Layer 2: 靜態 dict fallbackYAML 無 incident_type 欄位時)"""
def test_known_alertname_returns_static_dict_value(self):
"""HostHighCpuLoad → 靜態 dict 應回傳 host_cpu"""
result = get_incident_type("HostHighCpuLoad")
assert result == "host_cpu"
def test_known_alertname_k8s(self):
"""KubePodCrashLooping → k8s_pod_crash"""
result = get_incident_type("KubePodCrashLooping")
assert result == "k8s_pod_crash"
def test_known_alertname_database(self):
"""PostgreSQLDown → database_down"""
result = get_incident_type("PostgreSQLDown")
assert result == "database_down"
def test_known_alertname_service(self):
"""SentryDown → service_down"""
result = get_incident_type("SentryDown")
assert result == "service_down"
def test_known_alertname_ssl(self):
"""ExternalSiteSSLExpiringSoon → ssl_expiry"""
result = get_incident_type("ExternalSiteSSLExpiringSoon")
assert result == "ssl_expiry"
class TestGetIncidentTypeCustomFallback:
"""Layer 3: unknown alertname → "custom" """
def test_unknown_alertname_returns_custom(self):
result = get_incident_type("SomeCompletelyUnknownAlert")
assert result == "custom"
def test_empty_string_returns_custom(self):
result = get_incident_type("")
assert result == "custom"
class TestGetIncidentTypeYamlPriority:
"""Layer 1: YAML rule 有 incident_type 欄位時優先使用"""
def test_yaml_incident_type_overrides_static_dict(self, tmp_path, monkeypatch):
"""若 YAML rule 明確設定 incident_type應回傳 YAML 值(而非靜態 dict 值)"""
import yaml
from src.services import alert_rule_engine as engine
yaml_content = {
"version": "1.0.0",
"rules": [
{
"id": "test_rule",
"priority": 1,
"description": "test",
"match": {"alertname": ["HostHighCpuLoad"]},
"incident_type": "custom_cpu_override",
"response": {
"action_title": "test",
"suggested_action": "NO_ACTION",
"risk": "low",
"responsibility": "INFRA",
"responsibility_reasoning": "test",
"secondary_teams": [],
"reasoning": "test",
},
}
],
}
rules_file = tmp_path / "alert_rules.yaml"
rules_file.write_text(yaml.dump(yaml_content))
# 替換 RULES_FILE 路徑並清除 lru_cache
monkeypatch.setattr(engine, "RULES_FILE", rules_file)
engine._load_rules.cache_clear()
result = get_incident_type("HostHighCpuLoad")
assert result == "custom_cpu_override"
# 清理 cache還原給後續測試
engine._load_rules.cache_clear()
def test_yaml_match_without_incident_type_falls_through_to_static(
self, tmp_path, monkeypatch
):
"""YAML rule 匹配到 alertname 但無 incident_type 欄位 → 必須 fallthrough 到靜態 dict"""
import yaml
from src.services import alert_rule_engine as engine
yaml_content = {
"version": "1.0.0",
"rules": [
{
"id": "host_cpu", # id 值與靜態 dict value 相同,但這是巧合,不應被依賴
"priority": 1,
"description": "test",
"match": {"alertname": ["HostHighCpuLoad"]},
# 故意不設 incident_type
"response": {
"action_title": "test",
"suggested_action": "NO_ACTION",
"risk": "low",
"responsibility": "INFRA",
"responsibility_reasoning": "test",
"secondary_teams": [],
"reasoning": "test",
},
}
],
}
rules_file = tmp_path / "alert_rules.yaml"
rules_file.write_text(yaml.dump(yaml_content))
monkeypatch.setattr(engine, "RULES_FILE", rules_file)
engine._load_rules.cache_clear()
result = get_incident_type("HostHighCpuLoad")
# 靜態 dict 應回傳 host_cpu不應回傳 rule.id
assert result == ALERTNAME_TO_TYPE["HostHighCpuLoad"]
assert result == "host_cpu"
engine._load_rules.cache_clear()
class TestGetIncidentTypeYamlError:
"""YAML 載入失敗時應 graceful fallback"""
def test_corrupted_yaml_falls_back_to_static(self, tmp_path, monkeypatch):
from src.services import alert_rule_engine as engine
rules_file = tmp_path / "alert_rules.yaml"
rules_file.write_text("{{invalid yaml{{{{")
monkeypatch.setattr(engine, "RULES_FILE", rules_file)
engine._load_rules.cache_clear()
result = get_incident_type("HostHighCpuLoad")
assert result == "host_cpu" # 靜態 dict fallback
engine._load_rules.cache_clear()
def test_missing_yaml_falls_back_to_static(self, tmp_path, monkeypatch):
from src.services import alert_rule_engine as engine
monkeypatch.setattr(engine, "RULES_FILE", tmp_path / "nonexistent.yaml")
engine._load_rules.cache_clear()
result = get_incident_type("KubeNodeNotReady")
assert result == "k8s_node_failure"
engine._load_rules.cache_clear()

View File

@@ -6,6 +6,54 @@
---
## 📍 當前狀態 (2026-04-11 深夜 — 首席架構師 Code Review 修補完成)
### Code Review 修補 — I1/M3/D1/ArgoCD 重審 (首席架構師觸發)
| 等級 | 問題 | 修復方式 | Commit |
|------|------|---------|--------|
| Critical | `get_incident_type()` rule.id 不應作為 incident_type fallback命名空間不同有語意污染風險| 移除 rule_id 路徑YAML 匹配無 incident_type → fall through 靜態 dict | 待 push |
| Critical | `get_incident_type()` 無測試覆蓋(關鍵路徑)| 新增 test_get_incident_type.py11 個測試 4 個類別全過 | 待 push |
| Important | ALERTNAME_TO_TYPE deferred import無 circular risk| 移至模組頂層 import | 待 push |
| Important | NetworkPolicy ClusterIP `10.43.16.201/32` 寫死(技術債)| 記錄為已知技術債podSelector+namespaceSelector 已在同一 rule下次 ArgoCD 重裝需更新 | LOGBOOK |
| Minor | alert_types.py TODO 已過期 | 更新為 I1 整合後的正確說明 | 待 push |
**技術債記錄**`k8s/awoooi-prod/02-network-policy.yaml` ArgoCD egress 規則中 `ipBlock: 10.43.16.201/32` 為 ClusterIP若 ArgoCD 重裝此 IP 會變更,須同步更新。
---
## 📍 當前狀態 (2026-04-11 深夜 — MCP Security Code Review P0~P2 全修補)
### MCP Security Code Review 修補 (commit f323633)
| 等級 | 問題 | 修復方式 |
|------|------|---------|
| P0 | `_fetch_metrics_snapshot` 呼叫型別錯誤str→dict| `{"query": ...}` + 結果解析修正 |
| P1 | alertname PromQL injection | `_RE_SAFE_ALERTNAME` 白名單正則 |
| P1 | kubectl action 危險字元注入 | `_ALLOWED_KUBECTL_PATTERN` 白名單 |
| P1 | 6 個 `create_task()` GC 風險 | `_background_tasks` set + `_fire_and_forget()` helper |
| P1 | SSH Group B 無 known_hosts 可執行寫入 | 強制驗證 `SSH_MCP_KNOWN_HOSTS_FILE` 存在 |
| P2 | Sentry query 無語意驗證 | `_RE_SAFE_SENTRY_QUERY` 拒絕特殊字元 |
| P2 | ArgoCD `verify=False` 寫死 | `ARGOCD_VERIFY_TLS` 環境變數開關 |
推送 Gitea → CI/CD 觸發中
---
## 📍 當前狀態 (2026-04-11 深夜 — M3+I1 Tech Debt 全清零)
### M3 alertname_to_type 抽至 constants + I1 ADR-064 Rule Engine 整合 (2026-04-11 深夜)
| 任務 | 說明 | Commit |
|------|------|--------|
| M3 | 56筆 alertname_to_type 從 webhooks.py 內聯 dict 抽至 `src/constants/alert_types.py` | 1ede9f9 |
| I1 | alert_rule_engine.py 新增 `get_incident_type(alertname)`YAML規則→靜態dict→"custom" 三層降級 | 615822d |
| I1 | webhooks.py call site 改用 `get_incident_type(alertname)`,正式整合 ADR-064 Rule Engine | 615822d |
**Backlog 全清零** ✅ — 唯一剩餘 B3 (Phase 15.5 Trace Context UI) 統帥裁示暫緩
---
## 📍 當前狀態 (2026-04-11 深夜 — D1 models.json 集中化完成)
### D1 models.json 集中化 (2026-04-11 深夜)