critic PR review 揭示已 push commits 的 7 個 blocker,本 commit 全部修復。
## C1 + C2 + M1 + M2 + M3 — KMWriter 真正統一契約(critic 最嚴重 5 條)
### C1 km_writer.py:194 — backfill 自打臉修
- 裸 asyncio.create_task(_backfill_path_a_approval) → await _backfill_path_a_approval_safe()
- 同步 await + 獨立 DLQ km:backfill:dlq + try/except 不阻塞主寫入
- 新增 km_backfill_reconciler_job.py(每 5 分鐘掃 DLQ)+ ENABLE_KM_BACKFILL_RECONCILER flag
- 防 Path B 比 Path A 先完成 → related_approval_id 永遠 NULL 的 race
### C2 km_writer.py:391 — KM_WRITE_AWAIT=false 路徑收緊
- 從 ensure_future(fire-and-forget 比舊版同步寫更糟)
- 改 await writer.write(retry=1, timeout=2.0)(仍 await 但只試一次、超時短)
- docstring 明確標註「緊急回滾用,不保證可靠性」
### M1 decision_manager.py:2178/2203 — 移除 _fire_and_forget 旁路
- 兩處 _fire_and_forget(executor.write_execution_result_to_km(...))
- 改 await asyncio.shield(...) + BaseException 保護(防上層 cancel 中斷)
- KM_WRITE_AWAIT=true 在這條路徑終於真正 await
### M2 incident_service.py:1099 — 自製 path 加 retry+DLQ
- 原本 if settings.KM_WRITE_AWAIT: await asyncio.wait_for else create_task
- 改 3 次指數退避 retry + DLQ 保護(呼叫 km_writer 私有 helper)
### M3 km_writer.py:166 — 冪等聲明對齊實作
- knowledge_repository.create() 加 UPSERT 路徑(pg_insert ON CONFLICT DO UPDATE)
- KnowledgeEntryCreate / KnowledgeEntryRecord 加 path_type 欄位
- migration: ADD COLUMN path_type + partial unique index uix_knowledge_incident_path
## M4 alertmanager.yml — equal: [] 收緊(critic 防爆炸抑制)
- OllamaInstanceDown / KMConverterDown 抑制加 equal: ['cluster'] 約束
- 防多 cluster 場景下任一 Ollama down 誤抑全 AI/SLO 告警
## M5 Alertmanager 版本驗證(已確認 v0.31.1,遠超 v0.22+)
## M6 governance_agent.py — health score 區分 skipped vs ok vs violated
- check_slo_compliance 加 _meta {violated_count, skipped_count, ok_count, all_skipped, status}
- run_self_check: SLO 全 skipped 時獨立發 governance_slo_data_gap 告警
(不污染 self_failure 計數,因為 no_data 是 emitter 未實作不是治理機制故障)
## M7 scripts/check_config_drift.py — 改 AST 解析
- regex 改 ast.parse 找 Settings ClassDef AnnAssign Field(default=...)
- 避免多行 list / default_factory= / 含跳行字串的 false negative
- 4 欄位(AI_FALLBACK_ORDER / ARGOCD_URL / PROMETHEUS_URL / OLLAMA_URL)全對齊
## 新增測試
- test_km_writer_backfill_reconciler.py: 7 cases(C1 reconciler + safe helper)
- test_km_writer_idempotent.py: 5 cases(M3 path_type 注入 + UPSERT 分支)
## 驗證
- 1585 unit tests 全綠(+13 從 1572)
- amtool check-config SUCCESS(8 inhibit_rules / 2 receivers)
- drift checker AST-based 4 欄位全對齊
- Alertmanager v0.31.1 確認支援新語法
## 期望影響
- KMWriter 名實統一:飛輪閉環 KM 寫入路徑 100% 可靠
- M4 抑制爆炸風險解除
- 治理層不再對 SLO no_data 靜默
- drift checker false negative 風險解除
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
143 lines
5.7 KiB
YAML
143 lines
5.7 KiB
YAML
# AWOOOI Alertmanager 配置
|
||
# 2026-04-05 Claude Code: 修正 webhook URL
|
||
# 修正前: http://192.168.0.188:8088/api/v1/webhook/alertmanager (OpenClaw,舊系統,錯誤)
|
||
# 修正後: http://192.168.0.121:32334/api/v1/webhooks/alertmanager (AWOOOI API,複數,正確)
|
||
# 根據 feedback_alertmanager_awoooi_flow.md 鐵律
|
||
# 2026-04-09 Claude Sonnet 4.6 Asia/Taipei: 新增 Telegram Fallback (ADR-035)
|
||
# 架構: awoooi-webhook (主路徑) + telegram-direct (fallback,獨立路由)
|
||
# 當 AWOOOI API 無法回應時,critical 告警直接送 Telegram Bot API
|
||
# ⚠️ bot_token/chat_id 部署時由 secrets 替換,此檔為模板
|
||
#
|
||
# 2026-04-29 ogt + Claude Opus 4.7: P1-4 新版語法升級 + 因果抑制擴展
|
||
# 改動:
|
||
# 1. match/match_re → matchers (Alertmanager v0.27+ deprecated 警告)
|
||
# 2. source_match/target_match/target_match_re → source_matchers/target_matchers
|
||
# 3. group_by 加 team label(防 4 條 SLO 同秒爆,依 web-researcher 文件)
|
||
# 4. PostgreSQLDown / RedisDown inhibit 補 equal: ['instance'](防全 ns 爆炸抑制)
|
||
# 5. 新增 OllamaInstanceDown / KMConverterDown / SLO FastBurn 三組因果抑制
|
||
# 根因:本次 4 SLO 雪崩證實 Ollama 111 掛 → AI 推理鏈斷 → SLO 級聯爆炸無守門
|
||
# 6. 命名鐵律 feedback_telegram_alert_format.md 對齊(label team=ai/component/auto_repair)
|
||
|
||
global:
|
||
resolve_timeout: 5m
|
||
|
||
route:
|
||
receiver: 'awoooi-webhook'
|
||
# 2026-04-29: 加 team — SLO/AI PrometheusRule 含 team=ai 時可獨立分組合併
|
||
group_by: ['team', 'alertname', 'severity']
|
||
group_wait: 30s
|
||
group_interval: 5m
|
||
repeat_interval: 4h
|
||
routes:
|
||
- matchers:
|
||
- severity="critical"
|
||
receiver: 'awoooi-webhook'
|
||
group_wait: 10s
|
||
# continue:true 讓 critical 同時送 telegram-direct (fallback)
|
||
continue: true
|
||
- matchers:
|
||
- severity="critical"
|
||
receiver: 'telegram-direct'
|
||
group_wait: 10s
|
||
- matchers:
|
||
- severity="warning"
|
||
receiver: 'awoooi-webhook'
|
||
- matchers:
|
||
- alertname=~"Zombie.*|Container.*"
|
||
receiver: 'awoooi-webhook'
|
||
group_wait: 1m
|
||
|
||
receivers:
|
||
# 主路徑: AWOOOI API 處理所有告警 (AI 分析 + 去重 + Telegram)
|
||
# 2026-04-16 ogt + Claude Sonnet 4.6: 改指向 VIP 192.168.0.125
|
||
# 根因: 121:32334 Connection Refused,120:32334 也 Refused
|
||
# 只有 VIP 125:32334 可連通(kube-proxy NodePort 路由正常)
|
||
# ⚠️ SPF-1 風險:VIP 125 為單點,VIP host 整機 down → 主鏈斷
|
||
# 緩解計畫見 critic SPF 治理設計(中度方案:webhook_configs 多 url round-robin)
|
||
- name: 'awoooi-webhook'
|
||
webhook_configs:
|
||
- url: 'http://192.168.0.125:32334/api/v1/webhooks/alertmanager'
|
||
send_resolved: true
|
||
|
||
# Fallback 路徑: AWOOOI API 掛掉時,critical 告警直接送 Telegram
|
||
# 只有 critical severity 走此路徑(避免 warning 雙重通知)
|
||
# ⚠️ bot_token / chat_id 由 CD pipeline 在 deploy 時用 K8s Secret 注入
|
||
# feedback_telegram_secrets_injection.md 鐵律:禁止 PLACEHOLDER 上線
|
||
- name: 'telegram-direct'
|
||
telegram_configs:
|
||
- bot_token: 'TELEGRAM_BOT_TOKEN_PLACEHOLDER'
|
||
chat_id: TELEGRAM_CHAT_ID_PLACEHOLDER
|
||
parse_mode: 'HTML'
|
||
message: |
|
||
🚨 <b>[Alertmanager Fallback]</b>
|
||
{{ range .Alerts }}
|
||
├ <b>{{ .Labels.alertname }}</b>
|
||
├ 嚴重度: {{ .Labels.severity }}
|
||
├ 主機: {{ .Labels.host }}{{ .Labels.instance }}
|
||
└ {{ .Annotations.summary }}
|
||
{{ end }}
|
||
<i>⚠️ AWOOOI API 可能離線,此為直接告警</i>
|
||
send_resolved: false
|
||
|
||
inhibit_rules:
|
||
# === 基礎因果抑制(原有規則,新語法重寫)===
|
||
- source_matchers:
|
||
- severity="critical"
|
||
target_matchers:
|
||
- severity="warning"
|
||
equal: ['alertname', 'instance']
|
||
|
||
- source_matchers:
|
||
- alertname="HostDown"
|
||
target_matchers:
|
||
- alertname=~"HostHighCpuLoad|HostOutOfMemory|HostOutOfDiskSpace"
|
||
equal: ['host']
|
||
|
||
- source_matchers:
|
||
- alertname="KubeNodeNotReady"
|
||
target_matchers:
|
||
- alertname=~"KubePodCrashLooping|KubePodNotReady|KubeDeploymentReplicasMismatch"
|
||
equal: ['node']
|
||
|
||
# 2026-04-29: 補 equal: ['instance'] — 原本缺,PG 在 instance A down
|
||
# 不該抑制 instance B 的 HighConnections(爆炸抑制 bug)
|
||
- source_matchers:
|
||
- alertname="PostgreSQLDown"
|
||
target_matchers:
|
||
- alertname="PostgreSQLHighConnections"
|
||
equal: ['instance']
|
||
|
||
- source_matchers:
|
||
- alertname="RedisDown"
|
||
target_matchers:
|
||
- alertname="RedisMemoryHigh"
|
||
equal: ['instance']
|
||
|
||
# === 新增:AI 鏈因果抑制(2026-04-29 ADR-035 因果抑制擴展)===
|
||
# 根因:本次 4 SLO 雪崩證實 Ollama 111 掛 → AI 推理鏈斷 → 4 SLO 同秒爆
|
||
# 無此抑制 → 假警報淹沒真警報(Ollama down 本身才是真信號)
|
||
|
||
# Ollama 任一實例掛 → 抑制所有 AI/SLO 告警 30 分鐘
|
||
# 2026-04-29 ogt + Claude Opus 4.7: critic M4 修 — equal:[] 過寬,可能誤抑跨 cluster
|
||
# 加 ['cluster'] 約束(同 cluster 才抑制)
|
||
# 注意:本 cluster 目前單一,若 instance label 同步加在 SLO rule 可進一步收緊
|
||
- source_matchers:
|
||
- alertname="OllamaInstanceDown"
|
||
target_matchers:
|
||
- alertname=~"SLO_.*|AI_.*"
|
||
equal: ['cluster']
|
||
|
||
# KM converter 掛 → 抑制 KM Growth Rate SLO(避免 KM 寫入失敗本身觸發 SLO)
|
||
- source_matchers:
|
||
- alertname="KMConverterDown"
|
||
target_matchers:
|
||
- alertname=~"SLO_KMGrowthRate.*"
|
||
equal: ['cluster']
|
||
|
||
# 同 SLO 較嚴重抑制較輕(FastBurn 抑制 Medium/Slow Burn)
|
||
- source_matchers:
|
||
- alertname=~"SLO_.+_FastBurn"
|
||
target_matchers:
|
||
- alertname=~"SLO_.+_(Medium|Slow)Burn"
|
||
equal: ['alertname']
|