Files
awoooi/ops/alertmanager/alertmanager.yml
Your Name c5753e1c57 fix(critic-review): KMWriter 名實統一 + Alertmanager 修抑制 + drift checker AST 化
critic PR review 揭示已 push commits 的 7 個 blocker,本 commit 全部修復。

## C1 + C2 + M1 + M2 + M3 — KMWriter 真正統一契約(critic 最嚴重 5 條)

### C1 km_writer.py:194 — backfill 自打臉修
- 裸 asyncio.create_task(_backfill_path_a_approval) → await _backfill_path_a_approval_safe()
- 同步 await + 獨立 DLQ km:backfill:dlq + try/except 不阻塞主寫入
- 新增 km_backfill_reconciler_job.py(每 5 分鐘掃 DLQ)+ ENABLE_KM_BACKFILL_RECONCILER flag
- 防 Path B 比 Path A 先完成 → related_approval_id 永遠 NULL 的 race

### C2 km_writer.py:391 — KM_WRITE_AWAIT=false 路徑收緊
- 從 ensure_future(fire-and-forget 比舊版同步寫更糟)
- 改 await writer.write(retry=1, timeout=2.0)(仍 await 但只試一次、超時短)
- docstring 明確標註「緊急回滾用,不保證可靠性」

### M1 decision_manager.py:2178/2203 — 移除 _fire_and_forget 旁路
- 兩處 _fire_and_forget(executor.write_execution_result_to_km(...))
- 改 await asyncio.shield(...) + BaseException 保護(防上層 cancel 中斷)
- KM_WRITE_AWAIT=true 在這條路徑終於真正 await

### M2 incident_service.py:1099 — 自製 path 加 retry+DLQ
- 原本 if settings.KM_WRITE_AWAIT: await asyncio.wait_for else create_task
- 改 3 次指數退避 retry + DLQ 保護(呼叫 km_writer 私有 helper)

### M3 km_writer.py:166 — 冪等聲明對齊實作
- knowledge_repository.create() 加 UPSERT 路徑(pg_insert ON CONFLICT DO UPDATE)
- KnowledgeEntryCreate / KnowledgeEntryRecord 加 path_type 欄位
- migration: ADD COLUMN path_type + partial unique index uix_knowledge_incident_path

## M4 alertmanager.yml — equal: [] 收緊(critic 防爆炸抑制)
- OllamaInstanceDown / KMConverterDown 抑制加 equal: ['cluster'] 約束
- 防多 cluster 場景下任一 Ollama down 誤抑全 AI/SLO 告警

## M5 Alertmanager 版本驗證(已確認 v0.31.1,遠超 v0.22+)

## M6 governance_agent.py — health score 區分 skipped vs ok vs violated
- check_slo_compliance 加 _meta {violated_count, skipped_count, ok_count, all_skipped, status}
- run_self_check: SLO 全 skipped 時獨立發 governance_slo_data_gap 告警
  (不污染 self_failure 計數,因為 no_data 是 emitter 未實作不是治理機制故障)

## M7 scripts/check_config_drift.py — 改 AST 解析
- regex 改 ast.parse 找 Settings ClassDef AnnAssign Field(default=...)
- 避免多行 list / default_factory= / 含跳行字串的 false negative
- 4 欄位(AI_FALLBACK_ORDER / ARGOCD_URL / PROMETHEUS_URL / OLLAMA_URL)全對齊

## 新增測試
- test_km_writer_backfill_reconciler.py: 7 cases(C1 reconciler + safe helper)
- test_km_writer_idempotent.py: 5 cases(M3 path_type 注入 + UPSERT 分支)

## 驗證
- 1585 unit tests 全綠(+13 從 1572)
- amtool check-config SUCCESS(8 inhibit_rules / 2 receivers)
- drift checker AST-based 4 欄位全對齊
- Alertmanager v0.31.1 確認支援新語法

## 期望影響
- KMWriter 名實統一:飛輪閉環 KM 寫入路徑 100% 可靠
- M4 抑制爆炸風險解除
- 治理層不再對 SLO no_data 靜默
- drift checker false negative 風險解除

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 10:44:39 +08:00

143 lines
5.7 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# AWOOOI Alertmanager 配置
# 2026-04-05 Claude Code: 修正 webhook URL
# 修正前: http://192.168.0.188:8088/api/v1/webhook/alertmanager (OpenClaw舊系統錯誤)
# 修正後: http://192.168.0.121:32334/api/v1/webhooks/alertmanager (AWOOOI API複數正確)
# 根據 feedback_alertmanager_awoooi_flow.md 鐵律
# 2026-04-09 Claude Sonnet 4.6 Asia/Taipei: 新增 Telegram Fallback (ADR-035)
# 架構: awoooi-webhook (主路徑) + telegram-direct (fallback獨立路由)
# 當 AWOOOI API 無法回應時critical 告警直接送 Telegram Bot API
# ⚠️ bot_token/chat_id 部署時由 secrets 替換,此檔為模板
#
# 2026-04-29 ogt + Claude Opus 4.7: P1-4 新版語法升級 + 因果抑制擴展
# 改動:
# 1. match/match_re → matchers (Alertmanager v0.27+ deprecated 警告)
# 2. source_match/target_match/target_match_re → source_matchers/target_matchers
# 3. group_by 加 team label防 4 條 SLO 同秒爆,依 web-researcher 文件)
# 4. PostgreSQLDown / RedisDown inhibit 補 equal: ['instance'](防全 ns 爆炸抑制)
# 5. 新增 OllamaInstanceDown / KMConverterDown / SLO FastBurn 三組因果抑制
# 根因:本次 4 SLO 雪崩證實 Ollama 111 掛 → AI 推理鏈斷 → SLO 級聯爆炸無守門
# 6. 命名鐵律 feedback_telegram_alert_format.md 對齊label team=ai/component/auto_repair
global:
resolve_timeout: 5m
route:
receiver: 'awoooi-webhook'
# 2026-04-29: 加 team — SLO/AI PrometheusRule 含 team=ai 時可獨立分組合併
group_by: ['team', 'alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- matchers:
- severity="critical"
receiver: 'awoooi-webhook'
group_wait: 10s
# continue:true 讓 critical 同時送 telegram-direct (fallback)
continue: true
- matchers:
- severity="critical"
receiver: 'telegram-direct'
group_wait: 10s
- matchers:
- severity="warning"
receiver: 'awoooi-webhook'
- matchers:
- alertname=~"Zombie.*|Container.*"
receiver: 'awoooi-webhook'
group_wait: 1m
receivers:
# 主路徑: AWOOOI API 處理所有告警 (AI 分析 + 去重 + Telegram)
# 2026-04-16 ogt + Claude Sonnet 4.6: 改指向 VIP 192.168.0.125
# 根因: 121:32334 Connection Refused120:32334 也 Refused
# 只有 VIP 125:32334 可連通kube-proxy NodePort 路由正常)
# ⚠️ SPF-1 風險VIP 125 為單點VIP host 整機 down → 主鏈斷
# 緩解計畫見 critic SPF 治理設計中度方案webhook_configs 多 url round-robin
- name: 'awoooi-webhook'
webhook_configs:
- url: 'http://192.168.0.125:32334/api/v1/webhooks/alertmanager'
send_resolved: true
# Fallback 路徑: AWOOOI API 掛掉時critical 告警直接送 Telegram
# 只有 critical severity 走此路徑(避免 warning 雙重通知)
# ⚠️ bot_token / chat_id 由 CD pipeline 在 deploy 時用 K8s Secret 注入
# feedback_telegram_secrets_injection.md 鐵律:禁止 PLACEHOLDER 上線
- name: 'telegram-direct'
telegram_configs:
- bot_token: 'TELEGRAM_BOT_TOKEN_PLACEHOLDER'
chat_id: TELEGRAM_CHAT_ID_PLACEHOLDER
parse_mode: 'HTML'
message: |
🚨 <b>[Alertmanager Fallback]</b>
{{ range .Alerts }}
├ <b>{{ .Labels.alertname }}</b>
├ 嚴重度: {{ .Labels.severity }}
├ 主機: {{ .Labels.host }}{{ .Labels.instance }}
└ {{ .Annotations.summary }}
{{ end }}
<i>⚠️ AWOOOI API 可能離線,此為直接告警</i>
send_resolved: false
inhibit_rules:
# === 基礎因果抑制(原有規則,新語法重寫)===
- source_matchers:
- severity="critical"
target_matchers:
- severity="warning"
equal: ['alertname', 'instance']
- source_matchers:
- alertname="HostDown"
target_matchers:
- alertname=~"HostHighCpuLoad|HostOutOfMemory|HostOutOfDiskSpace"
equal: ['host']
- source_matchers:
- alertname="KubeNodeNotReady"
target_matchers:
- alertname=~"KubePodCrashLooping|KubePodNotReady|KubeDeploymentReplicasMismatch"
equal: ['node']
# 2026-04-29: 補 equal: ['instance'] — 原本缺PG 在 instance A down
# 不該抑制 instance B 的 HighConnections爆炸抑制 bug
- source_matchers:
- alertname="PostgreSQLDown"
target_matchers:
- alertname="PostgreSQLHighConnections"
equal: ['instance']
- source_matchers:
- alertname="RedisDown"
target_matchers:
- alertname="RedisMemoryHigh"
equal: ['instance']
# === 新增AI 鏈因果抑制2026-04-29 ADR-035 因果抑制擴展)===
# 根因:本次 4 SLO 雪崩證實 Ollama 111 掛 → AI 推理鏈斷 → 4 SLO 同秒爆
# 無此抑制 → 假警報淹沒真警報Ollama down 本身才是真信號)
# Ollama 任一實例掛 → 抑制所有 AI/SLO 告警 30 分鐘
# 2026-04-29 ogt + Claude Opus 4.7: critic M4 修 — equal:[] 過寬,可能誤抑跨 cluster
# 加 ['cluster'] 約束(同 cluster 才抑制)
# 注意:本 cluster 目前單一,若 instance label 同步加在 SLO rule 可進一步收緊
- source_matchers:
- alertname="OllamaInstanceDown"
target_matchers:
- alertname=~"SLO_.*|AI_.*"
equal: ['cluster']
# KM converter 掛 → 抑制 KM Growth Rate SLO避免 KM 寫入失敗本身觸發 SLO
- source_matchers:
- alertname="KMConverterDown"
target_matchers:
- alertname=~"SLO_KMGrowthRate.*"
equal: ['cluster']
# 同 SLO 較嚴重抑制較輕FastBurn 抑制 Medium/Slow Burn
- source_matchers:
- alertname=~"SLO_.+_FastBurn"
target_matchers:
- alertname=~"SLO_.+_(Medium|Slow)Burn"
equal: ['alertname']