diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 74d929e2..68d047c4 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -3629,8 +3629,11 @@ Sentry consumers after reset | Main route | 一般 critical(含 Docker/Sentry container restart)只走 `awoooi-webhook`,回到 AWOOOI API 去重、AI 分析、Approval 與 Audit 主鏈 | | Live webhook URL | `/home/wooo/monitoring/alertmanager.yml` 從 `192.168.0.121:32334` 對齊 repo 的 VIP `192.168.0.125:32334` | | Config check | `docker exec alertmanager amtool check-config /etc/alertmanager/alertmanager.yml` 成功,HUP reload 完成 | +| Drift prevention | 新增 `scripts/ops/deploy-alertmanager-config.sh`,從 K8s Secret 注入 Telegram token / `SRE_GROUP_CHAT_ID`,先 amtool 驗證再備份與 reload | +| Deploy safety | 修正部署腳本以原 inode 覆寫 bind-mounted config,並強制 `chmod 0644`,避免容器因 config `0600` 進入 restart loop | +| Live firing state | 修復後 `ALERTS{alertname="DockerContainerRestartSpike",alertstate="firing"}` 已降為 0;Sentry consumers 回到 healthy | ### 注意 -- `DockerContainerRestartSpike` 使用 15 分鐘窗口,已發生的 restart spike 會在 Prometheus 窗口過去後退火;修復完成後短時間內 `ALERTS{alertname="DockerContainerRestartSpike"}` 仍可能暫時為 firing。 +- `DockerContainerRestartSpike` 使用 15 分鐘窗口,已發生的 restart spike 會在 Prometheus 窗口過去後退火;若短時間仍看到舊訊息,優先查 live `ALERTS{alertname="DockerContainerRestartSpike"}` 是否已歸零。 - Alertmanager 本身不支援「webhook send failed 後再 fallback receiver」語義;因此 direct Telegram 只能以明確的 API/AlertChain 健康告警作為 emergency gate。 diff --git a/docs/awooop/AWOOOI-AWOOOP-AI-AUTONOMOUS-FLYWHEEL-INTEGRATION-PLAN.md b/docs/awooop/AWOOOI-AWOOOP-AI-AUTONOMOUS-FLYWHEEL-INTEGRATION-PLAN.md index 9760fcc0..c52f223b 100644 --- a/docs/awooop/AWOOOI-AWOOOP-AI-AUTONOMOUS-FLYWHEEL-INTEGRATION-PLAN.md +++ b/docs/awooop/AWOOOI-AWOOOP-AI-AUTONOMOUS-FLYWHEEL-INTEGRATION-PLAN.md @@ -293,7 +293,8 @@ For AI routing releases, also verify: ## 11. Immediate Next Items -1. Make Alertmanager config deployment deterministic: the live `telegram-direct` route is now emergency-only, but the inject/deploy path still needs a checked script so the 110 config cannot drift from `ops/alertmanager/alertmanager.yml`. +1. Wire `scripts/ops/deploy-alertmanager-config.sh` into the reboot/release checklist, then consider whether CD should run it for `ops/alertmanager/**` changes. 2. Continue Wave 1 with MCP Gateway bypass and MCP audit completeness, because production callers can still route around the gateway. 3. Keep GCP-A/GCP-B/111 Ollama routing verification in every alert-path release until EffectivePolicy becomes authoritative. 4. Add a Sentry/Snuba post-reboot health gate: ClickHouse table existence, Snuba migration status, and Kafka consumer offsets must be part of cold-start validation. +5. Add a post-deploy Alertmanager live check for `amtool check-config`, container status, and config-file mode; direct Telegram must remain emergency-only and target the SRE group. diff --git a/docs/runbooks/REBOOT-RECOVERY-SOP.md b/docs/runbooks/REBOOT-RECOVERY-SOP.md index 12e3174f..ad677526 100644 --- a/docs/runbooks/REBOOT-RECOVERY-SOP.md +++ b/docs/runbooks/REBOOT-RECOVERY-SOP.md @@ -96,7 +96,7 @@ K3s (依賴 PostgreSQL@188) 【告警鏈路】 Prometheus → Alertmanager(110) - → AWOOOI API(121:32334/api/v1/webhooks/alertmanager) ← 直接,不走 OpenClaw + → AWOOOI API(VIP 192.168.0.125:32334/api/v1/webhooks/alertmanager) ← 直接,不走 OpenClaw → TelegramGateway → Telegram 【CD 鏈路】 @@ -396,17 +396,25 @@ cd /home/wooo/act-runner && docker compose up -d ↓ Webhook URL 正確? grep 'url:' /home/wooo/monitoring/alertmanager.yml - 必須是: http://192.168.0.121:32334/api/v1/webhooks/alertmanager - ├── NO → 修正 URL 並 curl http://localhost:9093/-/reload + 必須是: http://192.168.0.125:32334/api/v1/webhooks/alertmanager + ├── NO → bash scripts/ops/deploy-alertmanager-config.sh └── YES ↓ 從 110 curl POST webhook 成功? - curl -X POST http://192.168.0.121:32334/api/v1/webhooks/alertmanager ... + curl -X POST http://192.168.0.125:32334/api/v1/webhooks/alertmanager ... ├── timeout → NetworkPolicy 未允許 110 │ kubectl apply -f k8s/awoooi-prod/02-network-policy.yaml └── {"success":true} → 檢查 Telegram Bot Token ``` +Alertmanager config drift 修復: + +```bash +# 從專案根目錄執行;會從 K8s Secret 注入 Telegram bot token 與 SRE_GROUP_CHAT_ID, +# 先用 amtool 驗證,再備份 110 live config、原 inode 覆寫、修正 0644 權限並 HUP reload。 +bash scripts/ops/deploy-alertmanager-config.sh +``` + **補充診斷 — 特定服務異常但無告警(Alertmanager 正常)**: ```bash # 確認 Prometheus 規則數量和關鍵規則是否存在 diff --git a/scripts/ops/deploy-alertmanager-config.sh b/scripts/ops/deploy-alertmanager-config.sh new file mode 100755 index 00000000..01bd73cd --- /dev/null +++ b/scripts/ops/deploy-alertmanager-config.sh @@ -0,0 +1,145 @@ +#!/usr/bin/env bash +# Render and deploy ops/alertmanager/alertmanager.yml to the 110 Docker Alertmanager. +# +# This script keeps the live direct-Telegram emergency route aligned with Git: +# - inject Telegram bot token and SRE group chat id from K8s secret or env +# - validate with amtool before touching the live config +# - back up the live file +# - keep the bind-mounted live file inode and readable permissions intact +# - reload Alertmanager with SIGHUP +# +# Usage: +# bash scripts/ops/deploy-alertmanager-config.sh [--dry-run] +# +# Optional env: +# TARGET_HOST=192.168.0.110 +# TARGET_PATH=/home/wooo/monitoring/alertmanager.yml +# K8S_HOST=192.168.0.120 +# K8S_NAMESPACE=awoooi-prod +# K8S_SECRET=awoooi-secrets +# TELEGRAM_BOT_TOKEN=... +# SRE_GROUP_CHAT_ID=... + +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +REPO_ROOT="$(cd "${SCRIPT_DIR}/../.." && pwd)" +CONFIG_TEMPLATE="${REPO_ROOT}/ops/alertmanager/alertmanager.yml" + +TARGET_HOST="${TARGET_HOST:-192.168.0.110}" +TARGET_USER="${TARGET_USER:-wooo}" +TARGET_PATH="${TARGET_PATH:-/home/wooo/monitoring/alertmanager.yml}" +K8S_HOST="${K8S_HOST:-192.168.0.120}" +K8S_USER="${K8S_USER:-wooo}" +K8S_NAMESPACE="${K8S_NAMESPACE:-awoooi-prod}" +K8S_SECRET="${K8S_SECRET:-awoooi-secrets}" +DRY_RUN="${1:-}" + +log() { printf '[%s] %s\n' "$(date '+%H:%M:%S')" "$*"; } + +die() { + echo "ERROR: $*" >&2 + exit 1 +} + +decode_b64() { + python3 -c 'import base64,sys; print(base64.b64decode(sys.stdin.read()).decode().strip())' +} + +secret_key_b64() { + local key="$1" + ssh -o BatchMode=yes -o ConnectTimeout=8 "${K8S_USER}@${K8S_HOST}" \ + "sudo -n kubectl -n '${K8S_NAMESPACE}' get secret '${K8S_SECRET}' -o jsonpath='{.data.${key}}'" 2>/dev/null +} + +read_secret_first_available() { + local env_value="$1" + shift + if [[ -n "$env_value" ]]; then + printf '%s' "$env_value" + return 0 + fi + + local key raw + for key in "$@"; do + raw="$(secret_key_b64 "$key" || true)" + if [[ -n "$raw" ]]; then + printf '%s' "$raw" | decode_b64 + return 0 + fi + done + return 1 +} + +[[ -f "$CONFIG_TEMPLATE" ]] || die "template not found: ${CONFIG_TEMPLATE}" + +TELEGRAM_BOT_TOKEN="$( + read_secret_first_available \ + "${TELEGRAM_BOT_TOKEN:-}" \ + OPENCLAW_TG_BOT_TOKEN \ + OPENCLAW_BOT_TOKEN \ + TELEGRAM_BOT_TOKEN \ + TG_BOT_TOKEN +)" || die "missing Telegram bot token; set TELEGRAM_BOT_TOKEN or add one of the known keys to ${K8S_SECRET}" + +SRE_GROUP_CHAT_ID="$( + read_secret_first_available \ + "${SRE_GROUP_CHAT_ID:-}" \ + SRE_GROUP_CHAT_ID \ + TELEGRAM_ALERT_CHAT_ID +)" || die "missing SRE_GROUP_CHAT_ID" + +[[ "$SRE_GROUP_CHAT_ID" =~ ^-?[0-9]+$ ]] || die "SRE_GROUP_CHAT_ID must be a Telegram numeric chat id" +export TELEGRAM_BOT_TOKEN SRE_GROUP_CHAT_ID + +tmp_rendered="$(mktemp)" +trap 'rm -f "$tmp_rendered"' EXIT +chmod 600 "$tmp_rendered" + +python3 - "$CONFIG_TEMPLATE" "$tmp_rendered" <<'PY' +from pathlib import Path +import os +import sys + +template = Path(sys.argv[1]) +target = Path(sys.argv[2]) +text = template.read_text() +text = text.replace("TELEGRAM_BOT_TOKEN_PLACEHOLDER", os.environ["TELEGRAM_BOT_TOKEN"]) +text = text.replace("SRE_GROUP_CHAT_ID_PLACEHOLDER", os.environ["SRE_GROUP_CHAT_ID"]) +if "TELEGRAM_BOT_TOKEN_PLACEHOLDER" in text or "SRE_GROUP_CHAT_ID_PLACEHOLDER" in text: + raise SystemExit("unreplaced secret placeholder remains in rendered config") +target.write_text(text) +PY + +log "Validating rendered config with live Alertmanager amtool on ${TARGET_HOST}" +ssh -o BatchMode=yes -o ConnectTimeout=8 "${TARGET_USER}@${TARGET_HOST}" \ + "docker exec -i alertmanager sh -c 'cat >/tmp/alertmanager-rendered.yml && amtool check-config /tmp/alertmanager-rendered.yml'" \ + < "$tmp_rendered" + +if [[ "$DRY_RUN" == "--dry-run" ]]; then + log "DRY RUN: rendered config validated; not deploying" + exit 0 +fi + +log "Uploading rendered config to ${TARGET_HOST}:${TARGET_PATH}" +ssh -o BatchMode=yes -o ConnectTimeout=8 "${TARGET_USER}@${TARGET_HOST}" \ + "umask 077 && cat > /tmp/alertmanager.yml.new" < "$tmp_rendered" + +ssh -o BatchMode=yes -o ConnectTimeout=8 "${TARGET_USER}@${TARGET_HOST}" "bash -s" < "\$target" +chmod 0644 "\$target" +rm -f /tmp/alertmanager.yml.new +docker exec alertmanager amtool check-config /etc/alertmanager/alertmanager.yml +docker kill -s HUP alertmanager >/dev/null +sleep 2 +docker inspect alertmanager --format 'status={{.State.Status}} started={{.State.StartedAt}}' +echo "backup=\$backup" +REMOTE + +log "Alertmanager config deployed and reloaded"