Your Name
715dc3cb91
fix(observability): P0 假警報止血 + ConfigMap drift 對齊 + 治理工具
12-Agent 全景診斷觸發的 P0/P1 觀測層修復。
## P0 假警報止血(4 SLO 雪崩根因)
- governance_agent.py:306 — 空 result 不再 fallback 0.0,改 continue + log warning
根因:Prometheus 查無資料(emitter 未實作 / rule 未部署)被誤判為 SLO=0
必觸發 violated=True 噴 4 條假告警
## P0 鬼魂按鈕守門
- telegram_gateway.py:1654 — LLM 動態按鈕 Redis 失敗時 btn_list.clear()
first_row(批准/拒絕,HMAC nonce 無狀態)由 caller 1488 永遠保留
feedback_no_ghost_buttons.md 三缺一鐵律對齊
## ConfigMap drift 修復(3 處)
- config.py:683 PROMETHEUS_URL: 188→110(drift checker 揪出 = SPF-4 部分根因)
- config.py:705 ARGOCD_URL: 125→121(T0 G3 已知)
- config.py:375 AI_FALLBACK_ORDER: 補 nvidia 對齊 ConfigMap
## P1 Alertmanager 升級(amtool SUCCESS)
- ops/alertmanager/alertmanager.yml: deprecated → v0.27+ 新語法
- match/match_re → matchers
- source_match/target_match → source_matchers/target_matchers
- group_by 加 team label(防 SLO 雪崩 4 條同秒推)
- PostgreSQL/Redis inhibit 補 equal: ['instance'](防爆炸抑制)
- 新增 3 組因果抑制:
- OllamaInstanceDown → SLO_*/AI_*(30 分鐘)
- KMConverterDown → SLO_KMGrowthRate*
- SLO_*_FastBurn → SLO_*_(Medium|Slow)Burn
## 治理工具落地
- scripts/check_config_drift.py: ConfigMap vs code default drift 檢測
揪出 PROMETHEUS_URL drift 是 SPF-4 根因(governance_agent 連 188 而非 110)
- scripts/health_check_session.sh: 11 服務 + 4 SSH + drift + git 全景驗證
## 驗證
- 1552 unit tests 全綠
- amtool check-config SUCCESS(8 inhibit_rules / 2 receivers)
- drift checker 4 欄位全對齊
- health check 11 服務全可達
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 10:44:39 +08:00
..
2026-03-26 16:06:20 +08:00
2026-04-27 14:54:19 +08:00
2026-04-22 01:27:39 +08:00
2026-04-29 10:44:39 +08:00
2026-04-28 15:27:33 +08:00
2026-04-05 01:00:31 +08:00
2026-03-22 18:57:44 +08:00
2026-04-27 20:13:07 +08:00
2026-04-07 16:00:12 +08:00
2026-04-10 13:03:25 +08:00
2026-04-12 15:14:44 +08:00
2026-03-29 15:27:49 +08:00
2026-04-11 20:45:53 +08:00
2026-04-16 22:23:49 +08:00
2026-03-22 18:57:44 +08:00
2026-04-15 15:34:04 +08:00