fix(monitoring): 合併重複飛輪告警 group + 補 notification_type: TYPE-8M
All checks were successful
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 44s
All checks were successful
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 44s
awoooi_flywheel_health (重複) 合入 awoooi_flywheel_meta_alerts: - 所有 5 條規則加 notification_type: TYPE-8M - 新增 FlywheelAlertnameNullHigh(原僅在舊 group) - 刪除重複 group,消除 Prometheus 同名告警衝突 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -755,9 +755,11 @@ groups:
|
||||
team: aiops
|
||||
auto_repair: "false"
|
||||
alert_category: flywheel_health
|
||||
notification_type: TYPE-8M
|
||||
annotations:
|
||||
summary: "飛輪 Playbook 數量為零,AI 修復完全依賴 LLM"
|
||||
description: "Redis 中無任何已批准 Playbook,自動修復能力大幅降低"
|
||||
runbook: "執行 scripts/cold_start_playbooks.py 冷啟動"
|
||||
- alert: FlywheelExecutionSuccessLow
|
||||
expr: awoooi_flywheel_execution_success_rate < 0.1
|
||||
for: 2h
|
||||
@@ -767,9 +769,11 @@ groups:
|
||||
team: aiops
|
||||
auto_repair: "false"
|
||||
alert_category: flywheel_health
|
||||
notification_type: TYPE-8M
|
||||
annotations:
|
||||
summary: "飛輪執行成功率 {{ $value | humanizePercentage }} 低於 10%"
|
||||
description: "連續 2 小時執行成功率不足 10%,Playbook 可能已過時"
|
||||
runbook: "檢查 decision_manager 日誌,確認 target 解析和 SSH MCP 狀態"
|
||||
- alert: FlywheelKMVectorizationLow
|
||||
expr: awoooi_flywheel_km_unvectorized_count > 10
|
||||
for: 30m
|
||||
@@ -779,91 +783,38 @@ groups:
|
||||
team: aiops
|
||||
auto_repair: "false"
|
||||
alert_category: flywheel_health
|
||||
notification_type: TYPE-8M
|
||||
annotations:
|
||||
summary: "{{ $value }} 筆 KM 未向量化,RAG 查詢命中率下降"
|
||||
description: "knowledge_entries 中 embedding IS NULL 超過 10 筆且持續 30 分鐘"
|
||||
- alert: FlywheelIncidentsStuck
|
||||
expr: awoooi_flywheel_incidents_stuck > 5
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
layer: k8s
|
||||
team: aiops
|
||||
auto_repair: "false"
|
||||
alert_category: flywheel_health
|
||||
annotations:
|
||||
summary: "{{ $value }} 筆 Incident 卡在 INVESTIGATING 超過 24h"
|
||||
description: "飛輪推理匹配節點可能堵塞,需人工清理或重新觸發診斷"
|
||||
|
||||
# =========================================================================
|
||||
# 飛輪健康詳細告警 (awoooi_flywheel_health) — 從主機補回 2026-04-12
|
||||
# =========================================================================
|
||||
- name: awoooi_flywheel_health
|
||||
interval: 5m
|
||||
rules:
|
||||
- alert: FlywheelPlaybookZero
|
||||
expr: awoooi_flywheel_playbook_count == 0
|
||||
for: 1h
|
||||
labels:
|
||||
severity: critical
|
||||
alert_category: flywheel_health
|
||||
notification_type: TYPE-8M
|
||||
auto_repair: "false"
|
||||
annotations:
|
||||
summary: "飛輪 Playbook 數量為 0"
|
||||
description: "Playbook 數量持續 1 小時為 0,飛輪學習節點完全失效。"
|
||||
runbook: "執行 scripts/cold_start_playbooks.py 冷啟動"
|
||||
|
||||
- alert: FlywheelExecutionSuccessLow
|
||||
expr: awoooi_flywheel_execution_success_rate < 0.1
|
||||
for: 2h
|
||||
labels:
|
||||
severity: warning
|
||||
alert_category: flywheel_health
|
||||
notification_type: TYPE-8M
|
||||
auto_repair: "false"
|
||||
annotations:
|
||||
summary: "飛輪自動修復成功率低於 10%"
|
||||
description: "執行成功率 {{ $value | humanizePercentage }},低於健康基線 10%。"
|
||||
runbook: "檢查 decision_manager 日誌,確認 target 解析和 SSH MCP 狀態"
|
||||
|
||||
- alert: FlywheelKMVectorizationLow
|
||||
expr: awoooi_flywheel_km_unvectorized_count > 10
|
||||
for: 30m
|
||||
labels:
|
||||
severity: warning
|
||||
alert_category: flywheel_health
|
||||
notification_type: TYPE-8M
|
||||
auto_repair: "false"
|
||||
annotations:
|
||||
summary: "飛輪 KM 未向量化數量 > 10"
|
||||
description: "{{ $value }} 筆 KM 條目尚未向量化,RAG 查詢品質下降。"
|
||||
runbook: "執行 scripts/batch_vectorize_km.py 或檢查每日 CronJob 狀態"
|
||||
|
||||
- alert: FlywheelAlertnameNullHigh
|
||||
expr: awoooi_flywheel_alertname_null_rate > 0.05
|
||||
for: 30m
|
||||
labels:
|
||||
severity: warning
|
||||
layer: k8s
|
||||
team: aiops
|
||||
auto_repair: "false"
|
||||
alert_category: flywheel_health
|
||||
notification_type: TYPE-8M
|
||||
auto_repair: "false"
|
||||
annotations:
|
||||
summary: "飛輪 alertname NULL 率超過 5%"
|
||||
description: "alertname NULL 率 {{ $value | humanizePercentage }},影響路由準確性。"
|
||||
runbook: "執行 scripts/backfill_alertname.py 回填"
|
||||
|
||||
- alert: FlywheelIncidentsStuck
|
||||
expr: awoooi_flywheel_incidents_stuck > 5
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
layer: k8s
|
||||
team: aiops
|
||||
auto_repair: "false"
|
||||
alert_category: flywheel_health
|
||||
notification_type: TYPE-8M
|
||||
auto_repair: "false"
|
||||
annotations:
|
||||
summary: "{{ $value }} 筆 Incident 卡在 INVESTIGATING 超過 24 小時"
|
||||
description: "大量 Incident 未推進,可能是決策引擎或 Telegram 通知阻塞。"
|
||||
summary: "{{ $value }} 筆 Incident 卡在 INVESTIGATING 超過 24h"
|
||||
description: "飛輪推理匹配節點可能堵塞,需人工清理或重新觸發診斷"
|
||||
|
||||
# =========================================================================
|
||||
# 備份還原告警 (awoooi_backup_restore) — 從主機補回 2026-04-12
|
||||
|
||||
Reference in New Issue
Block a user