Files
awoooi/ops/monitoring/tests/test_slo_rules.yaml
Your Name d2a4a17969
Some checks failed
Code Review / ai-code-review (push) Successful in 22s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 25s
CD Pipeline / tests (push) Successful in 1m11s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
fix(governance): stabilize adr100 km growth slo
2026-05-14 19:33:52 +08:00

290 lines
12 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ops/monitoring/tests/test_slo_rules.yaml
# promtool unit tests for AI Autonomous SLO rules
# 2026-04-27 P3.4 by Claude — AI SLO
#
# 執行方式:
# promtool test rules ops/monitoring/tests/test_slo_rules.yaml
#
# 覆蓋範圍:
# - sli:autonomy_rate:5m recording rule 數值正確性
# - sli:decision_accuracy:5m recording rule
# - sli:km_growth_rate:24h recording rule
# - SLO_AutonomyRate_FastBurn alert 觸發與不觸發
# - SLO_DecisionAccuracy_FastBurn alert
# - SLO_KMGrowthRate_Critical alert
rule_files:
- ../slo-rules.yml
evaluation_interval: 1m
tests:
# ============================================================
# Recording Rule Tests
# ============================================================
# ---- SLI 1: 自主化率 = 80% (auto=8, human=2 per tick) ----
- interval: 1m
name: "sli:autonomy_rate:5m 應為 0.8auto_executed=8, total=10"
input_series:
- series: 'automation_operation_log_total{outcome="auto_executed"}'
values: "0+8x30"
- series: 'automation_operation_log_total{outcome="human_required"}'
values: "0+2x30"
promql_expr_test:
- expr: sli:autonomy_rate:5m
eval_time: 15m
exp_samples:
- labels: '{__name__="sli:autonomy_rate:5m"}'
value: 0.8
# ---- SLI 1: 自主化率 = 100%(無 human_required----
- interval: 1m
name: "sli:autonomy_rate:5m 應為 1.0(無人工)"
input_series:
- series: 'automation_operation_log_total{outcome="auto_executed"}'
values: "0+10x30"
promql_expr_test:
- expr: sli:autonomy_rate:5m
eval_time: 15m
exp_samples:
- labels: '{__name__="sli:autonomy_rate:5m"}'
value: 1.0
# ---- SLI 2: 決策準確率 = 90% (success=9, auto_executed=10) ----
- interval: 1m
name: "sli:decision_accuracy:5m 應為 0.9"
input_series:
- series: 'post_execution_verification_total{outcome="success"}'
values: "0+9x30"
- series: 'automation_operation_log_total{outcome="auto_executed"}'
values: "0+10x30"
promql_expr_test:
- expr: sli:decision_accuracy:5m
eval_time: 15m
exp_samples:
- labels: '{__name__="sli:decision_accuracy:5m"}'
value: 0.9
# ---- SLI 4: KM 增長率24h increase----
- interval: 1m
name: "sli:km_growth_rate:24h 應約為 1440每分鐘 +1 × 24h"
input_series:
- series: "knowledge_entries_total"
values: "0+1x1500"
promql_expr_test:
- expr: sli:km_growth_rate:24h
eval_time: 25h
exp_samples:
# increase over 24h = 1440 samples × 1/min
- labels: '{__name__="sli:km_growth_rate:24h"}'
value: 1440
# ---- SLI 4: DB-derived gauge 優先,避免 counter 新上線暖機誤報 0 ----
- interval: 1m
name: "sli:km_growth_rate:24h 應優先使用 knowledge_entries_created_24h"
input_series:
- series: "knowledge_entries_created_24h"
values: "25x30"
- series: "knowledge_entries_total"
values: "100x30"
promql_expr_test:
- expr: sli:km_growth_rate:24h
eval_time: 15m
exp_samples:
- labels: '{__name__="sli:km_growth_rate:24h"}'
value: 25
# ============================================================
# Alert Tests — SLO 1: 自主化率
# ============================================================
# ---- 負測: 自主化率 = 80% → FastBurn 不觸發 ----
- interval: 1m
name: "SLO_AutonomyRate_FastBurn 不觸發(自主化率 = 80%,達標)"
input_series:
- series: 'automation_operation_log_total{outcome="auto_executed"}'
values: "0+8x30"
- series: 'automation_operation_log_total{outcome="human_required"}'
values: "0+2x30"
alert_rule_test:
- eval_time: 10m
alertname: SLO_AutonomyRate_FastBurn
exp_alerts: []
# ---- 正測: 自主化率 = 40%error_rate=0.6 > 0.20×14.4=2.88 → 不對)
# 注意0.20 * 14.4 = 2.88,但 error_rate 最大為 1.0,所以正確觸發條件:
# error_rate > 2.88 不可能,實際上 fast burn alert 只在 burn rate 非常高時觸發。
# 重新計算SLO=0.80, budget=0.20; 1h burn 2% = 消耗了 budget × 2/100 = 0.004
# 在 1h 內消耗了這麼多error_rate 需 > 0.20 × 14.4 = 2.88PromQL burn rate 係數)
# 由於 error_rate ∈ [0,1]2.88 > 1 → fast burn 永遠不觸發(正確行為)
# 改用 medium burn 測試threshold = 0.20 × 6 = 1.2 > 1 → 也不觸發)
# 實際上只有 slow burn 可觸發threshold = 0.20 × 1.1 = 0.22 < 1
# ---- 正測: 自主化率 = 50%error_rate=0.5 > 0.22)→ SlowBurn 觸發 ----
- interval: 1m
name: "SLO_AutonomyRate_SlowBurn 觸發(自主化率 = 50%error_rate 0.5 > 0.22"
input_series:
- series: 'automation_operation_log_total{outcome="auto_executed"}'
values: "0+5x120"
- series: 'automation_operation_log_total{outcome="human_required"}'
values: "0+5x120"
alert_rule_test:
- eval_time: 70m
alertname: SLO_AutonomyRate_SlowBurn
exp_alerts:
- exp_labels:
alertname: SLO_AutonomyRate_SlowBurn
severity: info
slo_name: autonomy_rate
burn_window: 3d
team: ai
auto_repair: "false"
exp_annotations:
summary: "SLO 自主化率 slow burn長期趨勢偏低"
description: "自主化率長期低於目標,累積 error budget 消耗率偏高,建議本週 review。"
runbook: "分析近 7d 數據,是否需要重訓或調整 confidence threshold。"
# ---- 負測: 自主化率 = 85% → SlowBurn 不觸發 ----
- interval: 1m
name: "SLO_AutonomyRate_SlowBurn 不觸發(自主化率 = 85%"
input_series:
- series: 'automation_operation_log_total{outcome="auto_executed"}'
values: "0+85x120"
- series: 'automation_operation_log_total{outcome="human_required"}'
values: "0+15x120"
alert_rule_test:
- eval_time: 70m
alertname: SLO_AutonomyRate_SlowBurn
exp_alerts: []
# ============================================================
# Alert Tests — SLO 2: 決策準確率
# ============================================================
# ---- 正測: 決策準確率 = 75%error_rate=0.25 > 0.10×1.1=0.11)→ SlowBurn 觸發 ----
- interval: 1m
name: "SLO_DecisionAccuracy_SlowBurn 觸發(決策準確率 75%"
input_series:
- series: 'post_execution_verification_total{outcome="success"}'
values: "0+75x120"
- series: 'automation_operation_log_total{outcome="auto_executed"}'
values: "0+100x120"
alert_rule_test:
- eval_time: 70m
alertname: SLO_DecisionAccuracy_SlowBurn
exp_alerts:
- exp_labels:
alertname: SLO_DecisionAccuracy_SlowBurn
severity: info
slo_name: decision_accuracy
burn_window: 3d
team: ai
auto_repair: "false"
exp_annotations:
summary: "SLO 決策準確率 slow burn長期趨勢偏低"
description: "決策準確率長期低於目標,累積 error budget 消耗偏高。"
runbook: "近 7d verifier 失敗分析,考慮 playbook fine-tune。"
# ---- 負測: 決策準確率 = 92% → SlowBurn 不觸發 ----
- interval: 1m
name: "SLO_DecisionAccuracy_SlowBurn 不觸發(決策準確率 92%"
input_series:
- series: 'post_execution_verification_total{outcome="success"}'
values: "0+92x120"
- series: 'automation_operation_log_total{outcome="auto_executed"}'
values: "0+100x120"
alert_rule_test:
- eval_time: 70m
alertname: SLO_DecisionAccuracy_SlowBurn
exp_alerts: []
# ============================================================
# Alert Tests — SLO 4: KM 增長率
# ============================================================
# ---- 正測: KM 增長率 = 0 → Critical 觸發 ----
- interval: 1m
name: "SLO_KMGrowthRate_Critical 觸發KM 停止增長)"
input_series:
# counter 停止increase[24h] = 0
- series: "knowledge_entries_total"
values: "100x1600"
alert_rule_test:
- eval_time: 25h
alertname: SLO_KMGrowthRate_Critical
exp_alerts:
- exp_labels:
alertname: SLO_KMGrowthRate_Critical
severity: critical
slo_name: km_growth_rate
team: ai
auto_repair: "false"
exp_annotations:
summary: "SLO KM 增長率嚴重不足(< 5 筆/day— 疑似 KM 鏈斷裂"
description: "過去 24h KM 新增 0 筆,遠低於目標 20 筆/day飛輪學習迴圈疑似中斷。"
runbook: |
1. 確認 knowledge_entries_created_24h gauge 與 knowledge_entries_total counter 是否正常遞增
2. 查 governance_agent 日誌中 governance_km_growth_slo_violation
3. 確認 auto_execute 後 KM 寫入路徑feedback_flywheel_km_write_gap.md
4. 手動執行 POST /api/v1/governance/check
# ---- 正測: KM 增長率 = 3/day → Critical 觸發(< 5----
- interval: 30m
name: "SLO_KMGrowthRate_Critical 觸發KM 增長 = 3/day"
input_series:
# 每 30min +0.0625 次 ≈ 3/day
- series: "knowledge_entries_total"
values: "0+0.0625x50"
alert_rule_test:
- eval_time: 25h
alertname: SLO_KMGrowthRate_Critical
exp_alerts:
- exp_labels:
alertname: SLO_KMGrowthRate_Critical
severity: critical
slo_name: km_growth_rate
team: ai
auto_repair: "false"
exp_annotations:
summary: "SLO KM 增長率嚴重不足(< 5 筆/day— 疑似 KM 鏈斷裂"
description: "過去 24h KM 新增 2.9999999999999996 筆,遠低於目標 20 筆/day飛輪學習迴圈疑似中斷。"
runbook: |
1. 確認 knowledge_entries_created_24h gauge 與 knowledge_entries_total counter 是否正常遞增
2. 查 governance_agent 日誌中 governance_km_growth_slo_violation
3. 確認 auto_execute 後 KM 寫入路徑feedback_flywheel_km_write_gap.md
4. 手動執行 POST /api/v1/governance/check
# ---- 負測: KM 增長率 = 30/day → Critical 不觸發 ----
- interval: 1m
name: "SLO_KMGrowthRate_Critical 不觸發KM 增長 = 30/day"
input_series:
# 每分鐘 +0.0208 次 = 30/day
- series: "knowledge_entries_total"
values: "0+0.0208x1600"
alert_rule_test:
- eval_time: 25h
alertname: SLO_KMGrowthRate_Critical
exp_alerts: []
# ---- 正測: KM 增長率 = 15/day → Low 觸發(< 20但 Critical 不觸發 ----
- interval: 1m
name: "SLO_KMGrowthRate_Low 觸發Critical 不觸發KM 增長 15/day"
input_series:
# 每分鐘 +0.0104 次 ≈ 15/day
- series: "knowledge_entries_total"
values: "0+0.0104x1600"
alert_rule_test:
- eval_time: 25h
alertname: SLO_KMGrowthRate_Low
exp_alerts:
- exp_labels:
alertname: SLO_KMGrowthRate_Low
severity: warning
slo_name: km_growth_rate
team: ai
auto_repair: "false"
exp_annotations:
summary: "SLO KM 增長率偏低(< 20 筆/day"
description: "過去 24h KM 新增 14.976000000000393 筆,低於目標 20 筆/day。"
runbook: "查 KM 寫入路徑auto_execute 後 _write_execution_result_to_km確認飛輪 KM 閉環正常。"