Files
awoooi/scripts/check_config_drift.py
Your Name 715dc3cb91 fix(observability): P0 假警報止血 + ConfigMap drift 對齊 + 治理工具
12-Agent 全景診斷觸發的 P0/P1 觀測層修復。

## P0 假警報止血(4 SLO 雪崩根因)
- governance_agent.py:306 — 空 result 不再 fallback 0.0,改 continue + log warning
  根因:Prometheus 查無資料(emitter 未實作 / rule 未部署)被誤判為 SLO=0
  必觸發 violated=True 噴 4 條假告警

## P0 鬼魂按鈕守門
- telegram_gateway.py:1654 — LLM 動態按鈕 Redis 失敗時 btn_list.clear()
  first_row(批准/拒絕,HMAC nonce 無狀態)由 caller 1488 永遠保留
  feedback_no_ghost_buttons.md 三缺一鐵律對齊

## ConfigMap drift 修復(3 處)
- config.py:683 PROMETHEUS_URL: 188→110(drift checker 揪出 = SPF-4 部分根因)
- config.py:705 ARGOCD_URL: 125→121(T0 G3 已知)
- config.py:375 AI_FALLBACK_ORDER: 補 nvidia 對齊 ConfigMap

## P1 Alertmanager 升級(amtool SUCCESS)
- ops/alertmanager/alertmanager.yml: deprecated → v0.27+ 新語法
  - match/match_re → matchers
  - source_match/target_match → source_matchers/target_matchers
  - group_by 加 team label(防 SLO 雪崩 4 條同秒推)
  - PostgreSQL/Redis inhibit 補 equal: ['instance'](防爆炸抑制)
- 新增 3 組因果抑制:
  - OllamaInstanceDown → SLO_*/AI_*(30 分鐘)
  - KMConverterDown → SLO_KMGrowthRate*
  - SLO_*_FastBurn → SLO_*_(Medium|Slow)Burn

## 治理工具落地
- scripts/check_config_drift.py: ConfigMap vs code default drift 檢測
  揪出 PROMETHEUS_URL drift 是 SPF-4 根因(governance_agent 連 188 而非 110)
- scripts/health_check_session.sh: 11 服務 + 4 SSH + drift + git 全景驗證

## 驗證
- 1552 unit tests 全綠
- amtool check-config SUCCESS(8 inhibit_rules / 2 receivers)
- drift checker 4 欄位全對齊
- health check 11 服務全可達

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 10:44:39 +08:00

110 lines
3.5 KiB
Python
Executable File
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
#!/usr/bin/env python3
# 2026-04-28 ogt + Claude Opus 4.7: P2-1 ConfigMap vs code default drift checker
# 來源tool-expert 統一治理方案
# 目的CI / pre-commit 階段驗證 k8s ConfigMap 與 apps/api/src/core/config.py default 一致
# 違反「事實驅動」紅線案例AI_FALLBACK_ORDER、ARGOCD_URL 都曾發生 drift
"""
ConfigMap vs Code Default Drift Checker
用法:
python3 scripts/check_config_drift.py
退出碼:
0 = 全部對齊
1 = 至少一項 driftCI 應 fail
可加進 .pre-commit-config.yaml
- repo: local
hooks:
- id: config-drift-check
name: ConfigMap vs code default drift
entry: python3 scripts/check_config_drift.py
language: python
pass_filenames: false
additional_dependencies: [pyyaml]
"""
from __future__ import annotations
import json
import re
import sys
from pathlib import Path
import yaml # noqa: F401 pre-commit 會經 additional_dependencies 安裝
ROOT = Path(__file__).resolve().parent.parent
CONFIGMAP_PATH = ROOT / "k8s" / "awoooi-prod" / "04-configmap.yaml"
CONFIG_PY_PATH = ROOT / "apps" / "api" / "src" / "core" / "config.py"
# 需要比對的欄位
# code_default_pattern: 在 config.py 找 default=... 用的 regexDOTALL
CHECK_FIELDS: dict[str, dict[str, str]] = {
"AI_FALLBACK_ORDER": {
"configmap_key": "AI_FALLBACK_ORDER",
"code_pattern": r"AI_FALLBACK_ORDER:\s*list\[str\]\s*=\s*Field\([^)]*?default=(\[[^\]]+\])",
},
"ARGOCD_URL": {
"configmap_key": "ARGOCD_URL",
"code_pattern": r"ARGOCD_URL[^\n]*?\n[^)]*?default=[\"']([^\"']+)[\"']",
},
"PROMETHEUS_URL": {
"configmap_key": "PROMETHEUS_URL",
"code_pattern": r"PROMETHEUS_URL[^\n]*?\n[^)]*?default=[\"']([^\"']+)[\"']",
},
"OLLAMA_URL": {
"configmap_key": "OLLAMA_URL",
"code_pattern": r"OLLAMA_URL[^\n]*?\n[^)]*?default=[\"']([^\"']+)[\"']",
},
}
def _normalize(raw: str) -> object:
"""嘗試把字串解析成 list/dict失敗就回原字串。"""
raw_strip = raw.strip().strip("'\"")
if raw_strip.startswith("["):
try:
return json.loads(raw_strip.replace("'", '"'))
except json.JSONDecodeError:
return raw_strip
return raw_strip
def main() -> int:
if not CONFIGMAP_PATH.exists():
print(f"[ERROR] ConfigMap not found: {CONFIGMAP_PATH}")
return 2
if not CONFIG_PY_PATH.exists():
print(f"[ERROR] config.py not found: {CONFIG_PY_PATH}")
return 2
with CONFIGMAP_PATH.open() as fh:
cm_data: dict = yaml.safe_load(fh).get("data", {}) or {}
py_src = CONFIG_PY_PATH.read_text()
exit_code = 0
print("=== ConfigMap ↔ code.default Drift Check ===")
for field, spec in CHECK_FIELDS.items():
cm_raw = cm_data.get(spec["configmap_key"], "<MISSING_IN_CONFIGMAP>")
m = re.search(spec["code_pattern"], py_src, re.DOTALL)
py_raw = m.group(1) if m else "<NOT_FOUND_IN_CONFIG_PY>"
cm_val = _normalize(cm_raw)
py_val = _normalize(py_raw)
if cm_val == py_val:
print(f"[OK] {field}: {cm_val}")
else:
print(f"[DRIFT] {field}:")
print(f" ConfigMap = {cm_val}")
print(f" config.py = {py_val}")
exit_code = 1
if exit_code == 0:
print("=== All drift-check fields aligned ===")
else:
print("=== DRIFT detected, fix the inconsistency ===")
return exit_code
if __name__ == "__main__":
sys.exit(main())