Files
awoooi/ops
Your Name 577250a678
Some checks failed
Code Review / ai-code-review (push) Successful in 52s
CD Pipeline / tests (push) Failing after 2m21s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 1m6s
fix(governance): 修反消音化 W-3/W-4 守衛 + Prometheus 補資料缺失告警
【統帥怒訓 — 違反 feedback_full_chain_first_then_fix.md 鐵律】

前次 commit f1362fcc 用 skip 條件把告警吞掉,是消音化解法:
  - W-3:total_exec<10 永遠 skip → Redis 永遠空也不會告警
  - W-4:playbooks total==0 永遠 skip → 表被清空也不會告警
  - Prometheus NaN sentinel + 既有 < 0.1 規則疊加後沒任何路徑會告警

統帥怒訓「又把告警給消失了」「已經這樣做幾次了」。本 commit 救回告警可見性。

【修法 — 啟動 30 分鐘寬限 + 過期改打資料管線斷新告警】

- ai_slo_watchdog_job.py 新增模組層 _PROCESS_START 與 _grace_active() 守衛:
  - W-3a:metric 有資料 + rate<0.30 → 既有「飛輪成功率過低」
  - W-3b:rate=None 且 uptime>30min → 新告警「飛輪資料管線無流量」
  - W-4a:playbooks total>0 + approved=0 → 既有「自動修復鏈路斷裂」
  - W-4b:playbooks total=0 且 uptime>30min → 新告警「Playbook 表初始化失敗」

- 3 份 Prometheus rule(k8s/monitoring/flywheel-alerts.yaml、
  ops/monitoring/alerts.yml、ops/monitoring/alerts-unified.yml)新增
  FlywheelExecutionRateMissing:absent() 或 NaN 持續 30 分鐘 → 告警,
  與 watchdog W-3b 雙保險

【已加入 memory】

feedback_silencing_alerts_recurring_violation.md 鎖入紅線鐵律:
  「fresh deploy / init guard 用 skip 吞告警 = 結構性失職,必須分流寬限期 +
   過期改打資料管線斷新告警」

【驗證】

106 個治理相關 unit test 全過:
  test_trust_drift_watchdog / test_governance_agent / test_failover_alerter /
  test_check_trust_drift_commit_outside_context_poc /
  test_governance_remediation_dispatch / test_ai_governance_endpoints /
  test_governance_dispatcher

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 12:39:46 +08:00
..