Files
awoooi/ops/monitoring/tests/test_slo_rules.yaml
Your Name 025a493f06
Some checks failed
run-migration / migrate (push) Failing after 12s
CD Pipeline / build-and-deploy (push) Has been cancelled
feat(p3.2+adr-100): Model Version Tracker + SLO 自治 + KB rot cleaner
Wave 8 P3.2 模型版本追蹤 + ADR-100 SLO 自我治理 + 配套:

P3.2 — Model Version Tracking:
- model_version_probe.py (268 行) — 探測 Ollama / OpenRouter 等 provider 的 model version
- model_version_tracker.py (101 行) — 對齊 PG provider_version_history 表
- migrations/p3_2_provider_version_history.sql + rollback — 25 行 schema
- db/models.py +32 行 — ProviderVersionHistory ORM

ADR-100 — AI 自主化 SLO:
- docs/adr/ADR-100-ai-autonomous-slo.md (167 行) — 飛輪 SLO 設計與閾值
- ops/monitoring/slo-rules.yml (254 行) — Prometheus SLO recording rules + alerts
- ops/monitoring/tests/test_slo_rules.yaml (242 行) — promtool unit tests

整合修改:
- main.py +72 行 — Lifespan 啟動 model_version_probe + KB rot cleaner schedule
- gitea_webhook.py +45 行 — webhook 接收 model 版本變化通知
- ci_auto_repair.py / evidence_snapshot.py / pre_decision_investigator.py — 配合接線

新測試:
- test_kb_rot_cleaner_schedule.py (120 行) — 9 tests pass
- test_slo_rules.yaml — promtool 驗收

Tests: 9 passed (test_kb_rot_cleaner_schedule)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Multiple Engineers (P3.2 + ADR-100) <noreply@anthropic.com>
2026-04-27 14:54:19 +08:00

243 lines
8.8 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ops/monitoring/tests/test_slo_rules.yaml
# promtool unit tests for AI Autonomous SLO rules
# 2026-04-27 P3.4 by Claude — AI SLO
#
# 執行方式:
# promtool test rules ops/monitoring/tests/test_slo_rules.yaml
#
# 覆蓋範圍:
# - sli:autonomy_rate:5m recording rule 數值正確性
# - sli:decision_accuracy:5m recording rule
# - sli:km_growth_rate:24h recording rule
# - SLO_AutonomyRate_FastBurn alert 觸發與不觸發
# - SLO_DecisionAccuracy_FastBurn alert
# - SLO_KMGrowthRate_Critical alert
rule_files:
- ../slo-rules.yml
evaluation_interval: 1m
tests:
# ============================================================
# Recording Rule Tests
# ============================================================
# ---- SLI 1: 自主化率 = 80% (auto=8, human=2 per tick) ----
- interval: 1m
name: "sli:autonomy_rate:5m 應為 0.8auto_executed=8, total=10"
input_series:
- series: 'automation_operation_log_total{outcome="auto_executed"}'
values: "0+8x30"
- series: 'automation_operation_log_total{outcome="human_required"}'
values: "0+2x30"
promql_expr_test:
- expr: sli:autonomy_rate:5m
eval_time: 15m
exp_samples:
- value: 0.8
# ---- SLI 1: 自主化率 = 100%(無 human_required----
- interval: 1m
name: "sli:autonomy_rate:5m 應為 1.0(無人工)"
input_series:
- series: 'automation_operation_log_total{outcome="auto_executed"}'
values: "0+10x30"
promql_expr_test:
- expr: sli:autonomy_rate:5m
eval_time: 15m
exp_samples:
- value: 1.0
# ---- SLI 2: 決策準確率 = 90% (success=9, auto_executed=10) ----
- interval: 1m
name: "sli:decision_accuracy:5m 應為 0.9"
input_series:
- series: 'post_execution_verification_total{outcome="success"}'
values: "0+9x30"
- series: 'automation_operation_log_total{outcome="auto_executed"}'
values: "0+10x30"
promql_expr_test:
- expr: sli:decision_accuracy:5m
eval_time: 15m
exp_samples:
- value: 0.9
# ---- SLI 4: KM 增長率24h increase----
- interval: 1m
name: "sli:km_growth_rate:24h 應約為 1440每分鐘 +1 × 24h"
input_series:
- series: "knowledge_entries_total"
values: "0+1x1500"
promql_expr_test:
- expr: sli:km_growth_rate:24h
eval_time: 25h
exp_samples:
# increase over 24h = 1440 samples × 1/min
- value: 1440
# ============================================================
# Alert Tests — SLO 1: 自主化率
# ============================================================
# ---- 負測: 自主化率 = 80% → FastBurn 不觸發 ----
- interval: 1m
name: "SLO_AutonomyRate_FastBurn 不觸發(自主化率 = 80%,達標)"
input_series:
- series: 'automation_operation_log_total{outcome="auto_executed"}'
values: "0+8x30"
- series: 'automation_operation_log_total{outcome="human_required"}'
values: "0+2x30"
alert_rule_test:
- eval_time: 10m
alertname: SLO_AutonomyRate_FastBurn
exp_alerts: []
# ---- 正測: 自主化率 = 40%error_rate=0.6 > 0.20×14.4=2.88 → 不對)
# 注意0.20 * 14.4 = 2.88,但 error_rate 最大為 1.0,所以正確觸發條件:
# error_rate > 2.88 不可能,實際上 fast burn alert 只在 burn rate 非常高時觸發。
# 重新計算SLO=0.80, budget=0.20; 1h burn 2% = 消耗了 budget × 2/100 = 0.004
# 在 1h 內消耗了這麼多error_rate 需 > 0.20 × 14.4 = 2.88PromQL burn rate 係數)
# 由於 error_rate ∈ [0,1]2.88 > 1 → fast burn 永遠不觸發(正確行為)
# 改用 medium burn 測試threshold = 0.20 × 6 = 1.2 > 1 → 也不觸發)
# 實際上只有 slow burn 可觸發threshold = 0.20 × 1.1 = 0.22 < 1
# ---- 正測: 自主化率 = 50%error_rate=0.5 > 0.22)→ SlowBurn 觸發 ----
- interval: 1m
name: "SLO_AutonomyRate_SlowBurn 觸發(自主化率 = 50%error_rate 0.5 > 0.22"
input_series:
- series: 'automation_operation_log_total{outcome="auto_executed"}'
values: "0+5x120"
- series: 'automation_operation_log_total{outcome="human_required"}'
values: "0+5x120"
alert_rule_test:
- eval_time: 70m
alertname: SLO_AutonomyRate_SlowBurn
exp_alerts:
- exp_labels:
alertname: SLO_AutonomyRate_SlowBurn
severity: info
slo_name: autonomy_rate
burn_window: 3d
team: ai
auto_repair: "false"
# ---- 負測: 自主化率 = 85% → SlowBurn 不觸發 ----
- interval: 1m
name: "SLO_AutonomyRate_SlowBurn 不觸發(自主化率 = 85%"
input_series:
- series: 'automation_operation_log_total{outcome="auto_executed"}'
values: "0+85x120"
- series: 'automation_operation_log_total{outcome="human_required"}'
values: "0+15x120"
alert_rule_test:
- eval_time: 70m
alertname: SLO_AutonomyRate_SlowBurn
exp_alerts: []
# ============================================================
# Alert Tests — SLO 2: 決策準確率
# ============================================================
# ---- 正測: 決策準確率 = 75%error_rate=0.25 > 0.10×1.1=0.11)→ SlowBurn 觸發 ----
- interval: 1m
name: "SLO_DecisionAccuracy_SlowBurn 觸發(決策準確率 75%"
input_series:
- series: 'post_execution_verification_total{outcome="success"}'
values: "0+75x120"
- series: 'automation_operation_log_total{outcome="auto_executed"}'
values: "0+100x120"
alert_rule_test:
- eval_time: 70m
alertname: SLO_DecisionAccuracy_SlowBurn
exp_alerts:
- exp_labels:
alertname: SLO_DecisionAccuracy_SlowBurn
severity: info
slo_name: decision_accuracy
burn_window: 3d
team: ai
auto_repair: "false"
# ---- 負測: 決策準確率 = 92% → SlowBurn 不觸發 ----
- interval: 1m
name: "SLO_DecisionAccuracy_SlowBurn 不觸發(決策準確率 92%"
input_series:
- series: 'post_execution_verification_total{outcome="success"}'
values: "0+92x120"
- series: 'automation_operation_log_total{outcome="auto_executed"}'
values: "0+100x120"
alert_rule_test:
- eval_time: 70m
alertname: SLO_DecisionAccuracy_SlowBurn
exp_alerts: []
# ============================================================
# Alert Tests — SLO 4: KM 增長率
# ============================================================
# ---- 正測: KM 增長率 = 0 → Critical 觸發 ----
- interval: 1m
name: "SLO_KMGrowthRate_Critical 觸發KM 停止增長)"
input_series:
# counter 停止increase[24h] = 0
- series: "knowledge_entries_total"
values: "100x1600"
alert_rule_test:
- eval_time: 25h
alertname: SLO_KMGrowthRate_Critical
exp_alerts:
- exp_labels:
alertname: SLO_KMGrowthRate_Critical
severity: critical
slo_name: km_growth_rate
team: ai
auto_repair: "false"
# ---- 正測: KM 增長率 = 3/day → Critical 觸發(< 5----
- interval: 30m
name: "SLO_KMGrowthRate_Critical 觸發KM 增長 = 3/day"
input_series:
# 每 30min +0.0625 次 ≈ 3/day
- series: "knowledge_entries_total"
values: "0+0.0625x50"
alert_rule_test:
- eval_time: 25h
alertname: SLO_KMGrowthRate_Critical
exp_alerts:
- exp_labels:
alertname: SLO_KMGrowthRate_Critical
severity: critical
slo_name: km_growth_rate
team: ai
auto_repair: "false"
# ---- 負測: KM 增長率 = 30/day → Critical 不觸發 ----
- interval: 1m
name: "SLO_KMGrowthRate_Critical 不觸發KM 增長 = 30/day"
input_series:
# 每分鐘 +0.0208 次 = 30/day
- series: "knowledge_entries_total"
values: "0+0.0208x1600"
alert_rule_test:
- eval_time: 25h
alertname: SLO_KMGrowthRate_Critical
exp_alerts: []
# ---- 正測: KM 增長率 = 15/day → Low 觸發(< 20但 Critical 不觸發 ----
- interval: 1m
name: "SLO_KMGrowthRate_Low 觸發Critical 不觸發KM 增長 15/day"
input_series:
# 每分鐘 +0.0104 次 ≈ 15/day
- series: "knowledge_entries_total"
values: "0+0.0104x1600"
alert_rule_test:
- eval_time: 25h
alertname: SLO_KMGrowthRate_Low
exp_alerts:
- exp_labels:
alertname: SLO_KMGrowthRate_Low
severity: warning
slo_name: km_growth_rate
team: ai
auto_repair: "false"