Files
awoooi/k8s/monitoring/flywheel-alerts.yaml
Your Name 577250a678
Some checks failed
Code Review / ai-code-review (push) Successful in 52s
CD Pipeline / tests (push) Failing after 2m21s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 1m6s
fix(governance): 修反消音化 W-3/W-4 守衛 + Prometheus 補資料缺失告警
【統帥怒訓 — 違反 feedback_full_chain_first_then_fix.md 鐵律】

前次 commit f1362fcc 用 skip 條件把告警吞掉,是消音化解法:
  - W-3:total_exec<10 永遠 skip → Redis 永遠空也不會告警
  - W-4:playbooks total==0 永遠 skip → 表被清空也不會告警
  - Prometheus NaN sentinel + 既有 < 0.1 規則疊加後沒任何路徑會告警

統帥怒訓「又把告警給消失了」「已經這樣做幾次了」。本 commit 救回告警可見性。

【修法 — 啟動 30 分鐘寬限 + 過期改打資料管線斷新告警】

- ai_slo_watchdog_job.py 新增模組層 _PROCESS_START 與 _grace_active() 守衛:
  - W-3a:metric 有資料 + rate<0.30 → 既有「飛輪成功率過低」
  - W-3b:rate=None 且 uptime>30min → 新告警「飛輪資料管線無流量」
  - W-4a:playbooks total>0 + approved=0 → 既有「自動修復鏈路斷裂」
  - W-4b:playbooks total=0 且 uptime>30min → 新告警「Playbook 表初始化失敗」

- 3 份 Prometheus rule(k8s/monitoring/flywheel-alerts.yaml、
  ops/monitoring/alerts.yml、ops/monitoring/alerts-unified.yml)新增
  FlywheelExecutionRateMissing:absent() 或 NaN 持續 30 分鐘 → 告警,
  與 watchdog W-3b 雙保險

【已加入 memory】

feedback_silencing_alerts_recurring_violation.md 鎖入紅線鐵律:
  「fresh deploy / init guard 用 skip 吞告警 = 結構性失職,必須分流寬限期 +
   過期改打資料管線斷新告警」

【驗證】

106 個治理相關 unit test 全過:
  test_trust_drift_watchdog / test_governance_agent / test_failover_alerter /
  test_check_trust_drift_commit_outside_context_poc /
  test_governance_remediation_dispatch / test_ai_governance_endpoints /
  test_governance_dispatcher

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 12:39:46 +08:00

220 lines
10 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# =============================================================================
# 🔴 [已封存 / DEPRECATED] 請勿使用此檔案
# =============================================================================
# 2026-04-14 Claude Sonnet 4.6Backlog 清剿):
# - 本檔為 PrometheusRule CRD 格式,需 Prometheus Operator 才能載入
# - 但 AWOOOI 的 Prometheus 是 Docker 部署188無 Operator
# - 11 條規則已全數遷入 ops/monitoring/alerts-unified.ymlgroup: awoooi_flywheel_meta_alerts
# - 本檔保留僅作歷史參考,請勿 kubectl apply
#
# 權威來源ops/monitoring/alerts-unified.yml
# =============================================================================
# 歷史資訊:飛輪健康度告警規則 — ADR-074 M1
# 數據來源:/api/v1/stats/flywheel/metricsawoooi-flywheel scrape job
# 建立2026-04-12 ogt (ADR-074 M1)
# 封存2026-04-14 Claude Sonnet 4.611/11 規則已遷入 alerts-unified.yml
# =============================================================================
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: flywheel-alerts
namespace: monitoring
labels:
release: prometheus
app: prometheus
spec:
groups:
- name: awoooi_flywheel_health
interval: 5m
rules:
# P0: Playbook 完全沒有 → 飛輪學習節點失效
- alert: FlywheelPlaybookZero
expr: awoooi_flywheel_playbook_count == 0
for: 1h
labels:
severity: critical
alert_category: infrastructure
notification_type: TYPE-3
annotations:
summary: "飛輪 Playbook 數量為 0"
description: "Playbook 數量持續 1 小時為 0飛輪學習節點完全失效。"
runbook: "執行 scripts/cold_start_playbooks.py 冷啟動"
# P0: 執行成功率極低(有資料但低於門檻)
# 2026-05-03 ogt + Claude Opus 4.7(亞太)— anti-silencing 補配對告警
# 新版 flywheel_stats_service 樣本不足會 emit NaNProm 把 NaN 當 stale 不參與比較)
# 故此規則只在「有資料、值低於 0.1」時觸發,不會被 NaN 誤觸;
# 真正的「資料管線斷流」由下方 FlywheelExecutionRateMissing 補打。
- alert: FlywheelExecutionSuccessLow
expr: awoooi_flywheel_execution_success_rate < 0.1
for: 2h
labels:
severity: warning
alert_category: infrastructure
notification_type: TYPE-3
annotations:
summary: "飛輪自動修復成功率低於 10%"
description: "執行成功率 {{ $value | humanizePercentage }},低於健康基線 10%。"
runbook: "檢查 decision_manager 日誌,確認 target 解析和 SSH MCP 狀態"
# P0: 飛輪執行率資料管線斷流NaN sentinel + 30 分鐘無資料)
# 2026-05-03 ogt + Claude Opus 4.7(亞太)— feedback_silencing_alerts_recurring_violation
# 配對 FlywheelExecutionSuccessLow當指標連續 30 分鐘為 NaN樣本不足 sentinel
# 即代表「資料應該來但沒來」watchdog W-3b 也會打同一情境,雙保險。
- alert: FlywheelExecutionRateMissing
expr: absent(awoooi_flywheel_execution_success_rate) or (awoooi_flywheel_execution_success_rate != awoooi_flywheel_execution_success_rate)
for: 30m
labels:
severity: warning
alert_category: infrastructure
notification_type: TYPE-3
annotations:
summary: "飛輪執行率指標連續 30 分鐘無資料"
description: "execution_success_rate 連續 30 分鐘為 NaN 或不存在,代表 Redis playbook 統計斷流(資料管線壞 / Redis flush / FlywheelStatsService 異常)。"
runbook: "1) 檢查 Redis playbook:* keys 是否存在 2) 檢查 FlywheelStatsService 日誌 3) /metrics endpoint 直接拉看 NaN 來源"
# P0: KM 大量未向量化 → RAG 無法使用歷史案例
- alert: FlywheelKMVectorizationLow
expr: awoooi_flywheel_km_unvectorized_count > 10
for: 30m
labels:
severity: warning
alert_category: infrastructure
notification_type: TYPE-3
annotations:
summary: "飛輪 KM 未向量化數量 > 10"
description: "{{ $value }} 筆 KM 條目尚未向量化RAG 查詢品質下降。"
runbook: "執行 scripts/batch_vectorize_km.py 或檢查每日 CronJob 狀態"
# P1: alertname NULL 率異常
- alert: FlywheelAlertnameNullHigh
expr: awoooi_flywheel_alertname_null_rate > 0.05
for: 30m
labels:
severity: warning
alert_category: infrastructure
notification_type: TYPE-3
annotations:
summary: "飛輪 alertname NULL 率超過 5%"
description: "alertname NULL 率 {{ $value | humanizePercentage }},影響路由準確性。"
runbook: "執行 scripts/backfill_alertname.py 回填"
# P1: Incident 卡住超過 24 小時
- alert: FlywheelIncidentsStuck
expr: awoooi_flywheel_incidents_stuck > 5
for: 10m
labels:
severity: warning
alert_category: infrastructure
notification_type: TYPE-3
annotations:
summary: "{{ $value }} 筆 Incident 卡在 INVESTIGATING 超過 24 小時"
description: "大量 Incident 未推進,可能是決策引擎或 Telegram 通知阻塞。"
- name: awoooi_backup_restore
interval: 1h
rules:
# P0: 備份還原 dry-run 失敗
- alert: BackupRestoreTestFailed
expr: awoooi_backup_restore_test_success == 0
for: 5m
labels:
severity: critical
alert_category: infrastructure
notification_type: TYPE-3
annotations:
summary: "備份還原 dry-run 測試失敗"
description: "Velero restore dry-run 失敗,備份可能無法還原。立即人工驗證備份狀態。"
runbook: "執行 velero backup describe awoooi-daily 及 velero restore create --from-backup awoooi-daily --dry-run"
# P1: 備份還原測試超過 8 天未執行(週排程失效)
- alert: BackupRestoreTestStale
expr: (time() - awoooi_backup_restore_test_timestamp_seconds) > 691200
for: 10m
labels:
severity: warning
alert_category: infrastructure
notification_type: TYPE-3
annotations:
summary: "備份還原測試超過 8 天未執行"
description: "上次備份測試距今 {{ $value | humanizeDuration }},週排程 CronJob 可能失效。"
runbook: "檢查 velero namespace 中的 backup-restore-test CronJob 狀態"
- name: awoooi_infrastructure_detailed
interval: 60s
rules:
# P1: Docker 188 容器不健康docker inspect health=unhealthy
# node-exporter + cAdvisor 暴露 container_last_seen / container_tasks_state
- alert: DockerContainerUnhealthyDetailed
expr: |
count by (name, instance) (
container_tasks_state{state="running", instance=~"192.168.0.188.*"}
) == 0
or
container_last_seen{instance=~"192.168.0.188.*", name!=""} < (time() - 120)
for: 5m
labels:
severity: warning
alert_category: infrastructure
notification_type: TYPE-3
annotations:
summary: "188 主機容器 {{ $labels.name }} 異常"
description: "容器 {{ $labels.name }} 在 {{ $labels.instance }} 已超過 2 分鐘無活動或不在 running 狀態。"
runbook: "SSH 到 192.168.0.188docker inspect {{ $labels.name }} 確認健康狀態"
# P1: Redis Streams 積壓過高alert stream 或 incident stream
- alert: RedisStreamBacklogHigh
expr: awoooi_redis_stream_len > 500
for: 10m
labels:
severity: warning
alert_category: infrastructure
notification_type: TYPE-3
annotations:
summary: "Redis Stream {{ $labels.stream }} 積壓 {{ $value }} 筆"
description: "Redis Stream 積壓超過 500 筆,飛輪消費者可能阻塞。"
runbook: "檢查 consumer group lagXINFO GROUPS <stream-key>"
# P1: PostgreSQL 磁碟增長率過快1小時增長超過 500MB
- alert: PostgreSQLDiskGrowthRate
expr: |
(
node_filesystem_size_bytes{instance="192.168.0.188:9100", mountpoint="/"}
- node_filesystem_avail_bytes{instance="192.168.0.188:9100", mountpoint="/"}
)
- (
node_filesystem_size_bytes{instance="192.168.0.188:9100", mountpoint="/"}
- node_filesystem_avail_bytes{instance="192.168.0.188:9100", mountpoint="/"}
) offset 1h
> 524288000
for: 5m
labels:
severity: warning
alert_category: infrastructure
notification_type: TYPE-3
annotations:
summary: "188 主機磁碟 1 小時增長超過 500MB"
description: "磁碟在過去 1 小時增長 {{ $value | humanize1024 }}B可能是 PostgreSQL WAL 或日誌暴增。"
runbook: "SSH 188df -h / && du -sh /var/lib/postgresql/*/pg_wal"
- name: awoooi_host_connectivity
interval: 60s
rules:
# P0: 主機間網路分區
- alert: HostNetworkPartition
expr: probe_success{job="host-connectivity"} == 0
for: 5m
labels:
severity: critical
alert_category: infrastructure
notification_type: TYPE-3
annotations:
summary: "主機 {{ $labels.instance }} 無法連通"
description: "TCP probe 到 {{ $labels.instance }} 失敗超過 5 分鐘,可能發生網路分區。"
runbook: "SSH 檢查路由和防火牆規則"