fix(alertmanager): make live config deployment safe
All checks were successful
Code Review / ai-code-review (push) Successful in 10s
All checks were successful
Code Review / ai-code-review (push) Successful in 10s
This commit is contained in:
@@ -3629,8 +3629,11 @@ Sentry consumers after reset
|
||||
| Main route | 一般 critical(含 Docker/Sentry container restart)只走 `awoooi-webhook`,回到 AWOOOI API 去重、AI 分析、Approval 與 Audit 主鏈 |
|
||||
| Live webhook URL | `/home/wooo/monitoring/alertmanager.yml` 從 `192.168.0.121:32334` 對齊 repo 的 VIP `192.168.0.125:32334` |
|
||||
| Config check | `docker exec alertmanager amtool check-config /etc/alertmanager/alertmanager.yml` 成功,HUP reload 完成 |
|
||||
| Drift prevention | 新增 `scripts/ops/deploy-alertmanager-config.sh`,從 K8s Secret 注入 Telegram token / `SRE_GROUP_CHAT_ID`,先 amtool 驗證再備份與 reload |
|
||||
| Deploy safety | 修正部署腳本以原 inode 覆寫 bind-mounted config,並強制 `chmod 0644`,避免容器因 config `0600` 進入 restart loop |
|
||||
| Live firing state | 修復後 `ALERTS{alertname="DockerContainerRestartSpike",alertstate="firing"}` 已降為 0;Sentry consumers 回到 healthy |
|
||||
|
||||
### 注意
|
||||
|
||||
- `DockerContainerRestartSpike` 使用 15 分鐘窗口,已發生的 restart spike 會在 Prometheus 窗口過去後退火;修復完成後短時間內 `ALERTS{alertname="DockerContainerRestartSpike"}` 仍可能暫時為 firing。
|
||||
- `DockerContainerRestartSpike` 使用 15 分鐘窗口,已發生的 restart spike 會在 Prometheus 窗口過去後退火;若短時間仍看到舊訊息,優先查 live `ALERTS{alertname="DockerContainerRestartSpike"}` 是否已歸零。
|
||||
- Alertmanager 本身不支援「webhook send failed 後再 fallback receiver」語義;因此 direct Telegram 只能以明確的 API/AlertChain 健康告警作為 emergency gate。
|
||||
|
||||
@@ -293,7 +293,8 @@ For AI routing releases, also verify:
|
||||
|
||||
## 11. Immediate Next Items
|
||||
|
||||
1. Make Alertmanager config deployment deterministic: the live `telegram-direct` route is now emergency-only, but the inject/deploy path still needs a checked script so the 110 config cannot drift from `ops/alertmanager/alertmanager.yml`.
|
||||
1. Wire `scripts/ops/deploy-alertmanager-config.sh` into the reboot/release checklist, then consider whether CD should run it for `ops/alertmanager/**` changes.
|
||||
2. Continue Wave 1 with MCP Gateway bypass and MCP audit completeness, because production callers can still route around the gateway.
|
||||
3. Keep GCP-A/GCP-B/111 Ollama routing verification in every alert-path release until EffectivePolicy becomes authoritative.
|
||||
4. Add a Sentry/Snuba post-reboot health gate: ClickHouse table existence, Snuba migration status, and Kafka consumer offsets must be part of cold-start validation.
|
||||
5. Add a post-deploy Alertmanager live check for `amtool check-config`, container status, and config-file mode; direct Telegram must remain emergency-only and target the SRE group.
|
||||
|
||||
@@ -96,7 +96,7 @@ K3s (依賴 PostgreSQL@188)
|
||||
|
||||
【告警鏈路】
|
||||
Prometheus → Alertmanager(110)
|
||||
→ AWOOOI API(121:32334/api/v1/webhooks/alertmanager) ← 直接,不走 OpenClaw
|
||||
→ AWOOOI API(VIP 192.168.0.125:32334/api/v1/webhooks/alertmanager) ← 直接,不走 OpenClaw
|
||||
→ TelegramGateway → Telegram
|
||||
|
||||
【CD 鏈路】
|
||||
@@ -396,17 +396,25 @@ cd /home/wooo/act-runner && docker compose up -d
|
||||
↓
|
||||
Webhook URL 正確?
|
||||
grep 'url:' /home/wooo/monitoring/alertmanager.yml
|
||||
必須是: http://192.168.0.121:32334/api/v1/webhooks/alertmanager
|
||||
├── NO → 修正 URL 並 curl http://localhost:9093/-/reload
|
||||
必須是: http://192.168.0.125:32334/api/v1/webhooks/alertmanager
|
||||
├── NO → bash scripts/ops/deploy-alertmanager-config.sh
|
||||
└── YES
|
||||
↓
|
||||
從 110 curl POST webhook 成功?
|
||||
curl -X POST http://192.168.0.121:32334/api/v1/webhooks/alertmanager ...
|
||||
curl -X POST http://192.168.0.125:32334/api/v1/webhooks/alertmanager ...
|
||||
├── timeout → NetworkPolicy 未允許 110
|
||||
│ kubectl apply -f k8s/awoooi-prod/02-network-policy.yaml
|
||||
└── {"success":true} → 檢查 Telegram Bot Token
|
||||
```
|
||||
|
||||
Alertmanager config drift 修復:
|
||||
|
||||
```bash
|
||||
# 從專案根目錄執行;會從 K8s Secret 注入 Telegram bot token 與 SRE_GROUP_CHAT_ID,
|
||||
# 先用 amtool 驗證,再備份 110 live config、原 inode 覆寫、修正 0644 權限並 HUP reload。
|
||||
bash scripts/ops/deploy-alertmanager-config.sh
|
||||
```
|
||||
|
||||
**補充診斷 — 特定服務異常但無告警(Alertmanager 正常)**:
|
||||
```bash
|
||||
# 確認 Prometheus 規則數量和關鍵規則是否存在
|
||||
|
||||
145
scripts/ops/deploy-alertmanager-config.sh
Executable file
145
scripts/ops/deploy-alertmanager-config.sh
Executable file
@@ -0,0 +1,145 @@
|
||||
#!/usr/bin/env bash
|
||||
# Render and deploy ops/alertmanager/alertmanager.yml to the 110 Docker Alertmanager.
|
||||
#
|
||||
# This script keeps the live direct-Telegram emergency route aligned with Git:
|
||||
# - inject Telegram bot token and SRE group chat id from K8s secret or env
|
||||
# - validate with amtool before touching the live config
|
||||
# - back up the live file
|
||||
# - keep the bind-mounted live file inode and readable permissions intact
|
||||
# - reload Alertmanager with SIGHUP
|
||||
#
|
||||
# Usage:
|
||||
# bash scripts/ops/deploy-alertmanager-config.sh [--dry-run]
|
||||
#
|
||||
# Optional env:
|
||||
# TARGET_HOST=192.168.0.110
|
||||
# TARGET_PATH=/home/wooo/monitoring/alertmanager.yml
|
||||
# K8S_HOST=192.168.0.120
|
||||
# K8S_NAMESPACE=awoooi-prod
|
||||
# K8S_SECRET=awoooi-secrets
|
||||
# TELEGRAM_BOT_TOKEN=...
|
||||
# SRE_GROUP_CHAT_ID=...
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
REPO_ROOT="$(cd "${SCRIPT_DIR}/../.." && pwd)"
|
||||
CONFIG_TEMPLATE="${REPO_ROOT}/ops/alertmanager/alertmanager.yml"
|
||||
|
||||
TARGET_HOST="${TARGET_HOST:-192.168.0.110}"
|
||||
TARGET_USER="${TARGET_USER:-wooo}"
|
||||
TARGET_PATH="${TARGET_PATH:-/home/wooo/monitoring/alertmanager.yml}"
|
||||
K8S_HOST="${K8S_HOST:-192.168.0.120}"
|
||||
K8S_USER="${K8S_USER:-wooo}"
|
||||
K8S_NAMESPACE="${K8S_NAMESPACE:-awoooi-prod}"
|
||||
K8S_SECRET="${K8S_SECRET:-awoooi-secrets}"
|
||||
DRY_RUN="${1:-}"
|
||||
|
||||
log() { printf '[%s] %s\n' "$(date '+%H:%M:%S')" "$*"; }
|
||||
|
||||
die() {
|
||||
echo "ERROR: $*" >&2
|
||||
exit 1
|
||||
}
|
||||
|
||||
decode_b64() {
|
||||
python3 -c 'import base64,sys; print(base64.b64decode(sys.stdin.read()).decode().strip())'
|
||||
}
|
||||
|
||||
secret_key_b64() {
|
||||
local key="$1"
|
||||
ssh -o BatchMode=yes -o ConnectTimeout=8 "${K8S_USER}@${K8S_HOST}" \
|
||||
"sudo -n kubectl -n '${K8S_NAMESPACE}' get secret '${K8S_SECRET}' -o jsonpath='{.data.${key}}'" 2>/dev/null
|
||||
}
|
||||
|
||||
read_secret_first_available() {
|
||||
local env_value="$1"
|
||||
shift
|
||||
if [[ -n "$env_value" ]]; then
|
||||
printf '%s' "$env_value"
|
||||
return 0
|
||||
fi
|
||||
|
||||
local key raw
|
||||
for key in "$@"; do
|
||||
raw="$(secret_key_b64 "$key" || true)"
|
||||
if [[ -n "$raw" ]]; then
|
||||
printf '%s' "$raw" | decode_b64
|
||||
return 0
|
||||
fi
|
||||
done
|
||||
return 1
|
||||
}
|
||||
|
||||
[[ -f "$CONFIG_TEMPLATE" ]] || die "template not found: ${CONFIG_TEMPLATE}"
|
||||
|
||||
TELEGRAM_BOT_TOKEN="$(
|
||||
read_secret_first_available \
|
||||
"${TELEGRAM_BOT_TOKEN:-}" \
|
||||
OPENCLAW_TG_BOT_TOKEN \
|
||||
OPENCLAW_BOT_TOKEN \
|
||||
TELEGRAM_BOT_TOKEN \
|
||||
TG_BOT_TOKEN
|
||||
)" || die "missing Telegram bot token; set TELEGRAM_BOT_TOKEN or add one of the known keys to ${K8S_SECRET}"
|
||||
|
||||
SRE_GROUP_CHAT_ID="$(
|
||||
read_secret_first_available \
|
||||
"${SRE_GROUP_CHAT_ID:-}" \
|
||||
SRE_GROUP_CHAT_ID \
|
||||
TELEGRAM_ALERT_CHAT_ID
|
||||
)" || die "missing SRE_GROUP_CHAT_ID"
|
||||
|
||||
[[ "$SRE_GROUP_CHAT_ID" =~ ^-?[0-9]+$ ]] || die "SRE_GROUP_CHAT_ID must be a Telegram numeric chat id"
|
||||
export TELEGRAM_BOT_TOKEN SRE_GROUP_CHAT_ID
|
||||
|
||||
tmp_rendered="$(mktemp)"
|
||||
trap 'rm -f "$tmp_rendered"' EXIT
|
||||
chmod 600 "$tmp_rendered"
|
||||
|
||||
python3 - "$CONFIG_TEMPLATE" "$tmp_rendered" <<'PY'
|
||||
from pathlib import Path
|
||||
import os
|
||||
import sys
|
||||
|
||||
template = Path(sys.argv[1])
|
||||
target = Path(sys.argv[2])
|
||||
text = template.read_text()
|
||||
text = text.replace("TELEGRAM_BOT_TOKEN_PLACEHOLDER", os.environ["TELEGRAM_BOT_TOKEN"])
|
||||
text = text.replace("SRE_GROUP_CHAT_ID_PLACEHOLDER", os.environ["SRE_GROUP_CHAT_ID"])
|
||||
if "TELEGRAM_BOT_TOKEN_PLACEHOLDER" in text or "SRE_GROUP_CHAT_ID_PLACEHOLDER" in text:
|
||||
raise SystemExit("unreplaced secret placeholder remains in rendered config")
|
||||
target.write_text(text)
|
||||
PY
|
||||
|
||||
log "Validating rendered config with live Alertmanager amtool on ${TARGET_HOST}"
|
||||
ssh -o BatchMode=yes -o ConnectTimeout=8 "${TARGET_USER}@${TARGET_HOST}" \
|
||||
"docker exec -i alertmanager sh -c 'cat >/tmp/alertmanager-rendered.yml && amtool check-config /tmp/alertmanager-rendered.yml'" \
|
||||
< "$tmp_rendered"
|
||||
|
||||
if [[ "$DRY_RUN" == "--dry-run" ]]; then
|
||||
log "DRY RUN: rendered config validated; not deploying"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
log "Uploading rendered config to ${TARGET_HOST}:${TARGET_PATH}"
|
||||
ssh -o BatchMode=yes -o ConnectTimeout=8 "${TARGET_USER}@${TARGET_HOST}" \
|
||||
"umask 077 && cat > /tmp/alertmanager.yml.new" < "$tmp_rendered"
|
||||
|
||||
ssh -o BatchMode=yes -o ConnectTimeout=8 "${TARGET_USER}@${TARGET_HOST}" "bash -s" <<REMOTE
|
||||
set -euo pipefail
|
||||
target='${TARGET_PATH}'
|
||||
backup="\${target}.bak.\$(date +%Y%m%d%H%M%S)"
|
||||
cp "\$target" "\$backup"
|
||||
# Alertmanager bind-mounts a single file. Keep the existing inode instead of mv'ing
|
||||
# a replacement over it, then restore readable permissions for the container user.
|
||||
cat /tmp/alertmanager.yml.new > "\$target"
|
||||
chmod 0644 "\$target"
|
||||
rm -f /tmp/alertmanager.yml.new
|
||||
docker exec alertmanager amtool check-config /etc/alertmanager/alertmanager.yml
|
||||
docker kill -s HUP alertmanager >/dev/null
|
||||
sleep 2
|
||||
docker inspect alertmanager --format 'status={{.State.Status}} started={{.State.StartedAt}}'
|
||||
echo "backup=\$backup"
|
||||
REMOTE
|
||||
|
||||
log "Alertmanager config deployed and reloaded"
|
||||
Reference in New Issue
Block a user