fix(alertmanager): make live config deployment safe
All checks were successful
Code Review / ai-code-review (push) Successful in 10s

This commit is contained in:
OG T
2026-05-06 13:52:57 +08:00
parent c4f40235f4
commit 6e2ab7cedc
4 changed files with 163 additions and 6 deletions

View File

@@ -3629,8 +3629,11 @@ Sentry consumers after reset
| Main route | 一般 critical含 Docker/Sentry container restart只走 `awoooi-webhook`,回到 AWOOOI API 去重、AI 分析、Approval 與 Audit 主鏈 |
| Live webhook URL | `/home/wooo/monitoring/alertmanager.yml``192.168.0.121:32334` 對齊 repo 的 VIP `192.168.0.125:32334` |
| Config check | `docker exec alertmanager amtool check-config /etc/alertmanager/alertmanager.yml` 成功HUP reload 完成 |
| Drift prevention | 新增 `scripts/ops/deploy-alertmanager-config.sh`,從 K8s Secret 注入 Telegram token / `SRE_GROUP_CHAT_ID`,先 amtool 驗證再備份與 reload |
| Deploy safety | 修正部署腳本以原 inode 覆寫 bind-mounted config並強制 `chmod 0644`,避免容器因 config `0600` 進入 restart loop |
| Live firing state | 修復後 `ALERTS{alertname="DockerContainerRestartSpike",alertstate="firing"}` 已降為 0Sentry consumers 回到 healthy |
### 注意
- `DockerContainerRestartSpike` 使用 15 分鐘窗口,已發生的 restart spike 會在 Prometheus 窗口過去後退火;修復完成後短時間內 `ALERTS{alertname="DockerContainerRestartSpike"}` 仍可能暫時為 firing
- `DockerContainerRestartSpike` 使用 15 分鐘窗口,已發生的 restart spike 會在 Prometheus 窗口過去後退火;若短時間仍看到舊訊息,優先查 live `ALERTS{alertname="DockerContainerRestartSpike"}` 是否已歸零
- Alertmanager 本身不支援「webhook send failed 後再 fallback receiver」語義因此 direct Telegram 只能以明確的 API/AlertChain 健康告警作為 emergency gate。

View File

@@ -293,7 +293,8 @@ For AI routing releases, also verify:
## 11. Immediate Next Items
1. Make Alertmanager config deployment deterministic: the live `telegram-direct` route is now emergency-only, but the inject/deploy path still needs a checked script so the 110 config cannot drift from `ops/alertmanager/alertmanager.yml`.
1. Wire `scripts/ops/deploy-alertmanager-config.sh` into the reboot/release checklist, then consider whether CD should run it for `ops/alertmanager/**` changes.
2. Continue Wave 1 with MCP Gateway bypass and MCP audit completeness, because production callers can still route around the gateway.
3. Keep GCP-A/GCP-B/111 Ollama routing verification in every alert-path release until EffectivePolicy becomes authoritative.
4. Add a Sentry/Snuba post-reboot health gate: ClickHouse table existence, Snuba migration status, and Kafka consumer offsets must be part of cold-start validation.
5. Add a post-deploy Alertmanager live check for `amtool check-config`, container status, and config-file mode; direct Telegram must remain emergency-only and target the SRE group.

View File

@@ -96,7 +96,7 @@ K3s (依賴 PostgreSQL@188)
【告警鏈路】
Prometheus → Alertmanager(110)
→ AWOOOI API(121:32334/api/v1/webhooks/alertmanager) ← 直接,不走 OpenClaw
→ AWOOOI API(VIP 192.168.0.125:32334/api/v1/webhooks/alertmanager) ← 直接,不走 OpenClaw
→ TelegramGateway → Telegram
【CD 鏈路】
@@ -396,17 +396,25 @@ cd /home/wooo/act-runner && docker compose up -d
Webhook URL 正確?
grep 'url:' /home/wooo/monitoring/alertmanager.yml
必須是: http://192.168.0.121:32334/api/v1/webhooks/alertmanager
├── NO → 修正 URL 並 curl http://localhost:9093/-/reload
必須是: http://192.168.0.125:32334/api/v1/webhooks/alertmanager
├── NO → bash scripts/ops/deploy-alertmanager-config.sh
└── YES
從 110 curl POST webhook 成功?
curl -X POST http://192.168.0.121:32334/api/v1/webhooks/alertmanager ...
curl -X POST http://192.168.0.125:32334/api/v1/webhooks/alertmanager ...
├── timeout → NetworkPolicy 未允許 110
│ kubectl apply -f k8s/awoooi-prod/02-network-policy.yaml
└── {"success":true} → 檢查 Telegram Bot Token
```
Alertmanager config drift 修復:
```bash
# 從專案根目錄執行;會從 K8s Secret 注入 Telegram bot token 與 SRE_GROUP_CHAT_ID
# 先用 amtool 驗證,再備份 110 live config、原 inode 覆寫、修正 0644 權限並 HUP reload。
bash scripts/ops/deploy-alertmanager-config.sh
```
**補充診斷 — 特定服務異常但無告警Alertmanager 正常)**:
```bash
# 確認 Prometheus 規則數量和關鍵規則是否存在

View File

@@ -0,0 +1,145 @@
#!/usr/bin/env bash
# Render and deploy ops/alertmanager/alertmanager.yml to the 110 Docker Alertmanager.
#
# This script keeps the live direct-Telegram emergency route aligned with Git:
# - inject Telegram bot token and SRE group chat id from K8s secret or env
# - validate with amtool before touching the live config
# - back up the live file
# - keep the bind-mounted live file inode and readable permissions intact
# - reload Alertmanager with SIGHUP
#
# Usage:
# bash scripts/ops/deploy-alertmanager-config.sh [--dry-run]
#
# Optional env:
# TARGET_HOST=192.168.0.110
# TARGET_PATH=/home/wooo/monitoring/alertmanager.yml
# K8S_HOST=192.168.0.120
# K8S_NAMESPACE=awoooi-prod
# K8S_SECRET=awoooi-secrets
# TELEGRAM_BOT_TOKEN=...
# SRE_GROUP_CHAT_ID=...
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "${SCRIPT_DIR}/../.." && pwd)"
CONFIG_TEMPLATE="${REPO_ROOT}/ops/alertmanager/alertmanager.yml"
TARGET_HOST="${TARGET_HOST:-192.168.0.110}"
TARGET_USER="${TARGET_USER:-wooo}"
TARGET_PATH="${TARGET_PATH:-/home/wooo/monitoring/alertmanager.yml}"
K8S_HOST="${K8S_HOST:-192.168.0.120}"
K8S_USER="${K8S_USER:-wooo}"
K8S_NAMESPACE="${K8S_NAMESPACE:-awoooi-prod}"
K8S_SECRET="${K8S_SECRET:-awoooi-secrets}"
DRY_RUN="${1:-}"
log() { printf '[%s] %s\n' "$(date '+%H:%M:%S')" "$*"; }
die() {
echo "ERROR: $*" >&2
exit 1
}
decode_b64() {
python3 -c 'import base64,sys; print(base64.b64decode(sys.stdin.read()).decode().strip())'
}
secret_key_b64() {
local key="$1"
ssh -o BatchMode=yes -o ConnectTimeout=8 "${K8S_USER}@${K8S_HOST}" \
"sudo -n kubectl -n '${K8S_NAMESPACE}' get secret '${K8S_SECRET}' -o jsonpath='{.data.${key}}'" 2>/dev/null
}
read_secret_first_available() {
local env_value="$1"
shift
if [[ -n "$env_value" ]]; then
printf '%s' "$env_value"
return 0
fi
local key raw
for key in "$@"; do
raw="$(secret_key_b64 "$key" || true)"
if [[ -n "$raw" ]]; then
printf '%s' "$raw" | decode_b64
return 0
fi
done
return 1
}
[[ -f "$CONFIG_TEMPLATE" ]] || die "template not found: ${CONFIG_TEMPLATE}"
TELEGRAM_BOT_TOKEN="$(
read_secret_first_available \
"${TELEGRAM_BOT_TOKEN:-}" \
OPENCLAW_TG_BOT_TOKEN \
OPENCLAW_BOT_TOKEN \
TELEGRAM_BOT_TOKEN \
TG_BOT_TOKEN
)" || die "missing Telegram bot token; set TELEGRAM_BOT_TOKEN or add one of the known keys to ${K8S_SECRET}"
SRE_GROUP_CHAT_ID="$(
read_secret_first_available \
"${SRE_GROUP_CHAT_ID:-}" \
SRE_GROUP_CHAT_ID \
TELEGRAM_ALERT_CHAT_ID
)" || die "missing SRE_GROUP_CHAT_ID"
[[ "$SRE_GROUP_CHAT_ID" =~ ^-?[0-9]+$ ]] || die "SRE_GROUP_CHAT_ID must be a Telegram numeric chat id"
export TELEGRAM_BOT_TOKEN SRE_GROUP_CHAT_ID
tmp_rendered="$(mktemp)"
trap 'rm -f "$tmp_rendered"' EXIT
chmod 600 "$tmp_rendered"
python3 - "$CONFIG_TEMPLATE" "$tmp_rendered" <<'PY'
from pathlib import Path
import os
import sys
template = Path(sys.argv[1])
target = Path(sys.argv[2])
text = template.read_text()
text = text.replace("TELEGRAM_BOT_TOKEN_PLACEHOLDER", os.environ["TELEGRAM_BOT_TOKEN"])
text = text.replace("SRE_GROUP_CHAT_ID_PLACEHOLDER", os.environ["SRE_GROUP_CHAT_ID"])
if "TELEGRAM_BOT_TOKEN_PLACEHOLDER" in text or "SRE_GROUP_CHAT_ID_PLACEHOLDER" in text:
raise SystemExit("unreplaced secret placeholder remains in rendered config")
target.write_text(text)
PY
log "Validating rendered config with live Alertmanager amtool on ${TARGET_HOST}"
ssh -o BatchMode=yes -o ConnectTimeout=8 "${TARGET_USER}@${TARGET_HOST}" \
"docker exec -i alertmanager sh -c 'cat >/tmp/alertmanager-rendered.yml && amtool check-config /tmp/alertmanager-rendered.yml'" \
< "$tmp_rendered"
if [[ "$DRY_RUN" == "--dry-run" ]]; then
log "DRY RUN: rendered config validated; not deploying"
exit 0
fi
log "Uploading rendered config to ${TARGET_HOST}:${TARGET_PATH}"
ssh -o BatchMode=yes -o ConnectTimeout=8 "${TARGET_USER}@${TARGET_HOST}" \
"umask 077 && cat > /tmp/alertmanager.yml.new" < "$tmp_rendered"
ssh -o BatchMode=yes -o ConnectTimeout=8 "${TARGET_USER}@${TARGET_HOST}" "bash -s" <<REMOTE
set -euo pipefail
target='${TARGET_PATH}'
backup="\${target}.bak.\$(date +%Y%m%d%H%M%S)"
cp "\$target" "\$backup"
# Alertmanager bind-mounts a single file. Keep the existing inode instead of mv'ing
# a replacement over it, then restore readable permissions for the container user.
cat /tmp/alertmanager.yml.new > "\$target"
chmod 0644 "\$target"
rm -f /tmp/alertmanager.yml.new
docker exec alertmanager amtool check-config /etc/alertmanager/alertmanager.yml
docker kill -s HUP alertmanager >/dev/null
sleep 2
docker inspect alertmanager --format 'status={{.State.Status}} started={{.State.StartedAt}}'
echo "backup=\$backup"
REMOTE
log "Alertmanager config deployed and reloaded"