Files
awoooi/ops
OG T ba18ad2ef8
All checks were successful
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 40s
CD Pipeline / build-and-deploy (push) Successful in 8m37s
feat(hermes+rules): LLM 升級 Hermes + 統帥決策 deprecate PostgreSQLDiskGrowthRate
統帥 2026-04-19 決策:
  - Rule 1 PostgreSQLDiskGrowthRate → 選項 C: deprecate + 替代新規則
  - Rule 2 NoAlertsReceived2Hours → 保留 (真實告警鏈路守護)
  - noise_rate 算法先修正 (NO_ACTION 不算 fp),觀察後動態調整

1. rule_stats_updater v2 noise 算法:
  原: 任何 EXPIRED approval 都算 fp
  問題: NO_ACTION/OBSERVE/INVESTIGATE 是 AI 純觀察,不該算假報
  修: WHERE ar.action NOT ILIKE '%NO_ACTION%' AND NOT ILIKE '%OBSERVE%' AND ...

2. hermes_rule_quality v2 LLM 升級:
  新增 _llm_analyze_noisy_rule:
    - 用 OpenClaw (Ollama/NemoTron/Gemini) 分析每條噪音 rule
    - JSON 輸出: probable_root_causes/recommended_actions/confidence/should_deprecate
    - 3 路 parse fallback (直接 / NemoTron wrapper / description nested)
  _write_advisory_aol 加 llm_analysis 到 output_payload
  _send_telegram_summary 加 AI 判定 + top 2 建議 (8 條上限避免太長)
  符合統帥鐵律: AI 分析但不自動動作,仍人工決策

3. ops/monitoring/alerts-unified.yml 替換 Rule 1:
  刪 PostgreSQLDiskGrowthRate (500MB/h 增長 → 觸發 WAL 正常行為誤報)
  加 HostDiskUsageHigh (>80% for 10m, warning)
  加 HostDiskUsageCritical (>90% for 5m, critical)
  兩者 labels.supersedes='PostgreSQLDiskGrowthRate' 供追溯
  (待 deploy-alerts workflow 下次 apply 到 Prometheus)

4. DB 即時 mark deprecated (避免等 alerts yaml 部署前 Hermes 又推):
  UPDATE alert_rule_catalog SET review_status='deprecated' WHERE rule_name='PostgreSQLDiskGrowthRate'

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 19:39:05 +08:00
..