fix(awooop): mirror ops notifications through api
All checks were successful
Code Review / ai-code-review (push) Successful in 10s

This commit is contained in:
Your Name
2026-05-12 14:43:09 +08:00
parent b437a33043
commit 1a74286dfa
6 changed files with 287 additions and 21 deletions

View File

@@ -823,7 +823,8 @@ jobs:
# 2026-04-09 Claude Sonnet 4.6: Sprint 5.2 — 同步 ops 腳本到 188 (ollama user)
# DEPLOY_SSH_KEY_188 = gitea-cd-deploy-188 (ed25519只有 188 authorized_keys)
# 腳本: docker-health-monitor.sh + pg-backup.sh (感知層 + 備份)
# 腳本: docker-health-monitor.sh + pg-backup.sh + notify-awoooi-ops.sh
# 感知層與備份通知都先走 AWOOI API/AwoooPTelegram 直發只保留 API 離線 fallback。
- name: Sync Ops Scripts to 188
continue-on-error: true
env:
@@ -870,9 +871,16 @@ jobs:
&& echo "✅ pg-backup.sh 已同步" \
|| echo "⚠️ pg-backup.sh 同步失敗"
# 同步 ops 通知 helper
timeout -k 5s 60s scp "${SCP_188_OPTS[@]}" \
scripts/ops/notify-awoooi-ops.sh \
ollama@192.168.0.188:~/awoooi-ops/notify-awoooi-ops.sh \
&& echo "✅ notify-awoooi-ops.sh 已同步" \
|| echo "⚠️ notify-awoooi-ops.sh 同步失敗"
# 確保執行權限
timeout -k 5s 30s ssh "${SSH_188_OPTS[@]}" ollama@192.168.0.188 \
"chmod +x ~/awoooi-ops/docker-health-monitor.sh ~/awoooi-ops/pg-backup.sh && echo '✅ 權限設定完成'" \
"chmod +x ~/awoooi-ops/docker-health-monitor.sh ~/awoooi-ops/pg-backup.sh ~/awoooi-ops/notify-awoooi-ops.sh && echo '✅ 權限設定完成'" \
|| echo "⚠️ 權限設定失敗"
- name: Notify Pipeline Failure

View File

@@ -1,3 +1,73 @@
## 2026-05-12 | Ops 通知旁路收斂到 AWOOI API / AwoooP
**背景**CI/CD 通知已改成先走 AWOOI Alertmanager 入口,並由 TelegramGateway 鏡像到 AwoooP Run Timeline但 188 ops 腳本仍有直接 Telegram 發送路徑。這會讓備份、DR Drill、host backup 等營運事件繞過 AwoooP 的治理與稽核,只在 Telegram 群組出現。
**本次修補**
- 新增 `scripts/ops/notify-awoooi-ops.sh`
- 將 ops job 狀態包成 Alertmanager payload。
- 預設投遞到 `${AWOOOI_API_URL}/api/v1/webhooks/alertmanager`
- 支援 `AWOOI_OPS_*` / `AWOOOI_OPS_*` 環境變數。
- 支援 `AWOOI_OPS_DRY_RUN=1` 輸出 JSON便於部署前驗證。
- `pg-backup.sh`
- DB 備份成功 / 失敗先走 `notify-awoooi-ops.sh`
- Alertname 使用 `Backup.PG`severity 固定 `info`,避免備份狀態通知誤入 LLM 路徑燒 token。
- Telegram 直發只保留為 API 不可達 fallback。
- `dr-drill.sh`
- DR dry-run / 失敗 / 月度演練結果先走 AWOOI API。
- Alertname 使用 `DRDrillStatus`,並帶入執行耗時。
- `backup-from-110.sh`
- host backup 失敗先走 AWOOI APIfallback 才直發 Telegram。
- Alertname 使用 `HostBackupFailed`severity 固定 `info`,避免腳本即時通知和 Prometheus 長時間備份告警互相重複觸發 LLM。
- `.gitea/workflows/cd.yaml`
- `Sync Ops Scripts to 188` 新增同步 `notify-awoooi-ops.sh`
- chmod 同步納入 helper確保 188 上的 `pg-backup.sh` 能使用同目錄 helper。
- Telegram fallback 改用 `--data-urlencode text=...`,避免多行 HTML 訊息在 JSON 字串內破格式。
**驗證**
- `bash -n scripts/ops/notify-awoooi-ops.sh scripts/ops/pg-backup.sh scripts/ops/dr-drill.sh scripts/ops/backup-from-110.sh` → passed。
- `AWOOI_OPS_DRY_RUN=1 ... scripts/ops/notify-awoooi-ops.sh` → JSON 可解析,且多行 detail 保留。
- `ruby -e 'require "yaml"; YAML.load_file(".gitea/workflows/cd.yaml")'``yaml ok`
- `git diff --check` → clean。
判讀:這輪先收斂 188 ops 通知的主要旁路。正式訊息會先進 AWOOI API / TelegramGateway / AwoooPTelegram 直發只剩 API 離線時的救命 fallback。下一步可繼續把未納入 CD 同步的 `backup-from-110.sh` 實機部署到 188並逐步清理其他 workflows 的 direct Telegram fallback。
## 2026-05-12 | CI/CD 出站訊息正式進入 AwoooP Run Timeline
**背景**CI/CD 通知已改走 AWOOI API但 production 一開始沒有出現在 AwoooP Run Monitor。追 log 後確認是 legacy outbound mirror 建立 `awooop_run_state` 時仰賴 DB default而 production table 的 `attempt_count` 等 NOT NULL 欄位未套到 default導致 `telegram_outbound_mirror_failed`
**本次修補**
- `channel_hub.py``ensure_completed_shadow_run()` 明確寫入:
- `attempt_count = 0`
- `max_attempts = 3`
- `cost_usd = 0.0000`
- `step_count = 0`
- `platform_operator_service.py` 將含 `[AWOOOI CI/CD]` 的 outbound timeline 標題改為 `TELEGRAMCI/CD 狀態通知`,不再顯示泛用 `TELEGRAM處置結果`
- `.gitea/workflows/cd.yaml` 修正 Docker build lock 檢查自我匹配問題,避免 `grep 'docker build'` 匹配到自己的 shell script造成 orphan lock 無法自清。
**驗證**
- Gitea CD `#1885` success
- `tests` success。
- `build-and-deploy` success。
- `post-deploy-checks` success。
- K8s live image
- `awoooi-api``192.168.0.110:5000/awoooi/api:03ba9678d54cd24038cbe3162b6c03c31956548c`
- `awoooi-web``192.168.0.110:5000/awoooi/web:03ba9678d54cd24038cbe3162b6c03c31956548c`
- `awoooi-worker``192.168.0.110:5000/awoooi/api:03ba9678d54cd24038cbe3162b6c03c31956548c`
- Production smoke
- `/api/v1/health` → 200。
- `/zh-TW/awooop/runs` → 200。
- `/api/v1/platform/runs/list?per_page=3``total=11`
- Run detail `5f422d51-f967-532b-9eaf-46c1616ef455`
- timeline 含 `TELEGRAMCI/CD 狀態通知`
- content preview 含 `[AWOOOI CI/CD] | post-deploy`
- Production API log 短窗口看到:
- `alertmanager_cicd_detected`
- `completed_shadow_run_created`
- `outbound_message_recorded`
- 未再看到 `telegram_outbound_mirror_failed``NotNullViolation``IntegrityError`
判讀CI/CD 出站訊息已不只是 Telegram 訊息,而是能在 AwoooP Run Monitor / Timeline 查到的治理事件。這是把 AWOOOP 併回 AI 自動化飛輪控制面的第一個可驗證閉環。
## 2026-05-07 | AwoooP legacy Channel Event 補 completed shadow run 錨點
**背景**Production `/api/v1/platform/runs/list``total=0`,但系統仍持續有 Telegram 出站訊息與 grouped child alert。盤點後確認legacy Telegram 出站只寫 `awooop_outbound_message`,使用 soft `run_id`,但沒有對應 `awooop_run_state`grouped child alert 也只落 `awooop_conversation_event`。結果是 AwoooP Console 有 event / outbound 資料,但 Run Monitor 主列表沒有聚合錨點,看起來像空殼。

View File

@@ -31,6 +31,7 @@ TEXTFILE_DIR="${TEXTFILE_DIR:-/home/ollama/node_exporter_textfiles}"
TEXTFILE_PROM="${TEXTFILE_DIR}/backup.prom"
DATE=$(date +%Y%m%d-%H%M%S)
ERRORS=0
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
log() {
echo "[$DATE] $*" | tee -a "$LOG"
@@ -38,6 +39,45 @@ log() {
log "=== Starting backup from 110 ==="
notify_awoooi_ops() {
local status="$1"
local msg="$2"
local helper="${SCRIPT_DIR}/notify-awoooi-ops.sh"
[[ -x "$helper" ]] || return 1
AWOOI_OPS_ALERTNAME="HostBackupFailed" \
AWOOI_OPS_JOB_NAME="188 Host 層備份" \
AWOOI_OPS_STATUS="$status" \
AWOOI_OPS_SEVERITY="info" \
AWOOI_OPS_SOURCE="backup-from-110" \
AWOOI_OPS_COMPONENT="host-backup" \
AWOOI_OPS_SUMMARY="188 Host 層備份 ${status}" \
AWOOI_OPS_DETAIL="$msg" \
"$helper" >/dev/null
}
notify_telegram_fallback() {
local msg="$1"
local tg_token="${TG_BOT_TOKEN:-${TELEGRAM_BOT_TOKEN:-}}"
local tg_chat="${TELEGRAM_ALERT_CHAT_ID:-${SRE_GROUP_CHAT_ID:--1003711974679}}"
if [ -n "$tg_token" ] && [ -n "$tg_chat" ]; then
curl -s -X POST "https://api.telegram.org/bot${tg_token}/sendMessage" \
-d "chat_id=${tg_chat}" \
--data-urlencode "text=${msg}" \
> /dev/null || true
fi
}
notify_ops() {
local status="$1"
local msg="$2"
# 正式路徑:先交給 AWOOI API由 TelegramGateway 送出並鏡像到 AwoooP。
# 只有 API 不可達或 helper 未部署時,才使用 Telegram 直發救命旁路。
notify_awoooi_ops "$status" "$msg" && return 0
notify_telegram_fallback "$msg"
}
# ── Harbor registry data ──────────────────────────────────────────────────────
# 2026-04-17 ogt: 改用 docker socket 讀取 volumes/var/lib/docker/volumes/ 是 710 root:root
# wooo 是 docker group 成員,可透過 docker run 掛載 volume不可直接讀取 FS 路徑
@@ -100,15 +140,6 @@ EOF
exit 0
else
log "=== Backup FAILED ($ERRORS errors) ==="
# Telegram 告警:正式目的地為 SRE 戰情室群組。
TG_TOKEN="${TG_BOT_TOKEN:-}"
TG_CHAT="${TELEGRAM_ALERT_CHAT_ID:-${SRE_GROUP_CHAT_ID:--1003711974679}}"
if [ -n "$TG_TOKEN" ] && [ -n "$TG_CHAT" ]; then
curl -s -X POST "https://api.telegram.org/bot${TG_TOKEN}/sendMessage" \
-d "chat_id=${TG_CHAT}" \
-d "text=🚨 backup-from-110.sh FAILED on 188 — ${ERRORS} error(s) at ${DATE}" \
> /dev/null || true
fi
notify_ops "failed" "🚨 backup-from-110.sh FAILED on 188 — ${ERRORS} error(s) at ${DATE}"
exit 1
fi

View File

@@ -22,6 +22,7 @@ DR_NAMESPACE="awoooi-dr-test"
RESTORE_TIMEOUT="${RESTORE_TIMEOUT:-600}" # 10 分鐘
SECRETS_FILE="${SECRETS_FILE:-/home/wooo/awoooi-ops-secrets/secrets.env}"
DRY_RUN="${1:-}"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
[[ -f "$SECRETS_FILE" ]] && source "$SECRETS_FILE"
@@ -31,13 +32,38 @@ START_TIME=$(date +%s)
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S %z')] $*"; }
notify_awoooi_ops() {
local status="$1"
local msg="$2"
local helper="${SCRIPT_DIR}/notify-awoooi-ops.sh"
[[ -x "$helper" ]] || return 1
AWOOI_OPS_ALERTNAME="DRDrillStatus" \
AWOOI_OPS_JOB_NAME="DR Drill 月度演練" \
AWOOI_OPS_STATUS="$status" \
AWOOI_OPS_SEVERITY="info" \
AWOOI_OPS_SOURCE="dr-drill" \
AWOOI_OPS_COMPONENT="disaster-recovery" \
AWOOI_OPS_SUMMARY="DR Drill ${status}" \
AWOOI_OPS_DETAIL="$msg" \
AWOOI_OPS_DURATION_SECONDS="$(elapsed)" \
"$helper" >/dev/null
}
notify_telegram() {
local msg="$1"
local status="${2:-success}"
# 正式路徑:先交給 AWOOI API由 TelegramGateway 送出並鏡像到 AwoooP。
# 只有 API 不可達或 helper 未部署時,才使用 Telegram 直發救命旁路。
notify_awoooi_ops "$status" "$msg" && return 0
local chat_id="${TELEGRAM_ALERT_CHAT_ID:-${SRE_GROUP_CHAT_ID:--1003711974679}}"
if [[ -n "${TELEGRAM_BOT_TOKEN:-}" && -n "$chat_id" ]]; then
curl -s -X POST "https://api.telegram.org/bot${TELEGRAM_BOT_TOKEN}/sendMessage" \
-H "Content-Type: application/json" \
-d "{\"chat_id\":\"${chat_id}\",\"text\":\"${msg}\",\"parse_mode\":\"HTML\"}" \
-d "chat_id=${chat_id}" \
-d "parse_mode=HTML" \
--data-urlencode "text=${msg}" \
> /dev/null 2>&1 || true
fi
}
@@ -189,18 +215,18 @@ main() {
if [[ "$DRY_RUN" == "--dry-run" ]]; then
log "🔍 DRY RUN 模式 — 只檢查 backup不執行還原"
local backup
backup=$(find_latest_backup) || { notify_telegram "❌ DR Drill 失敗: 找不到有效 backup"; exit 1; }
backup=$(find_latest_backup) || { notify_telegram "❌ DR Drill 失敗: 找不到有效 backup" "failed"; exit 1; }
log "✅ 最新 backup: ${backup}"
notify_telegram "🔍 <b>DR Drill DRY RUN</b>
├ 最新 backup: ${backup}
└ 狀態: Completed ✅ (未執行還原)"
└ 狀態: Completed ✅ (未執行還原)" "success"
return 0
fi
local backup
backup=$(find_latest_backup) || {
notify_telegram "❌ <b>DR Drill 失敗</b>
└ 找不到有效 Velero backup"
└ 找不到有效 Velero backup" "failed"
exit 1
}
log "📦 使用 backup: ${backup}"
@@ -233,12 +259,15 @@ main() {
log "=== DR Drill 完成: ${overall} (${minutes}m${seconds}s) ==="
local notify_status="success"
[[ "$overall" == *"FAIL"* ]] && notify_status="failed"
notify_telegram "${overall} <b>DR Drill 月度演練</b>
├ 備份: ${backup}
├ Restore: ${pod_status}
├ API Health: ${health_status}
├ 耗時: ${minutes}m${seconds}s
└ 時間: $(date '+%Y-%m-%d %H:%M') +0800"
└ 時間: $(date '+%Y-%m-%d %H:%M') +0800" "$notify_status"
[[ "$overall" == *"FAIL"* ]] && exit 1
return 0

100
scripts/ops/notify-awoooi-ops.sh Executable file
View File

@@ -0,0 +1,100 @@
#!/usr/bin/env bash
# 2026-05-12 Codex: Ops 通知先走 AWOOI Alertmanager 入口,讓 TelegramGateway
# 統一送出並鏡像到 AwoooP。呼叫端保留直接 Telegram fallback 作為 API 離線備援。
set -euo pipefail
API_BASE="${AWOOOI_API_URL:-https://awoooi.wooo.work}"
ALERTMANAGER_URL="${AWOOOI_ALERTMANAGER_URL:-${API_BASE%/}/api/v1/webhooks/alertmanager}"
JOB_NAME="${AWOOI_OPS_JOB_NAME:-${AWOOOI_OPS_JOB_NAME:-Ops Job}}"
STATUS_RAW="${AWOOI_OPS_STATUS:-${AWOOOI_OPS_STATUS:-success}}"
SEVERITY="${AWOOI_OPS_SEVERITY:-${AWOOOI_OPS_SEVERITY:-info}}"
ALERTNAME="${AWOOI_OPS_ALERTNAME:-${AWOOOI_OPS_ALERTNAME:-OpsJobStatus}}"
SOURCE="${AWOOI_OPS_SOURCE:-${AWOOOI_OPS_SOURCE:-ops-script}}"
HOSTNAME_VALUE="${AWOOI_OPS_HOST:-${AWOOOI_OPS_HOST:-$(hostname 2>/dev/null || echo unknown)}}"
COMPONENT="${AWOOI_OPS_COMPONENT:-${AWOOOI_OPS_COMPONENT:-ops}}"
SUMMARY="${AWOOI_OPS_SUMMARY:-${AWOOOI_OPS_SUMMARY:-${JOB_NAME}}}"
DETAIL="${AWOOI_OPS_DETAIL:-${AWOOOI_OPS_DETAIL:-}}"
DURATION_SECONDS="${AWOOI_OPS_DURATION_SECONDS:-${AWOOOI_OPS_DURATION_SECONDS:-0}}"
if ! command -v python3 >/dev/null 2>&1; then
echo "python3 missing; cannot build Alertmanager JSON payload" >&2
exit 2
fi
payload_file="$(mktemp)"
trap 'rm -f "$payload_file"' EXIT
JOB_NAME="$JOB_NAME" \
STATUS_RAW="$STATUS_RAW" \
SEVERITY="$SEVERITY" \
ALERTNAME="$ALERTNAME" \
SOURCE="$SOURCE" \
HOSTNAME_VALUE="$HOSTNAME_VALUE" \
COMPONENT="$COMPONENT" \
SUMMARY="$SUMMARY" \
DETAIL="$DETAIL" \
DURATION_SECONDS="$DURATION_SECONDS" \
python3 - <<'PY' > "$payload_file"
from __future__ import annotations
import datetime as dt
import json
import os
import re
status = (os.environ.get("STATUS_RAW") or "success").strip().lower()
if status not in {"success", "failed", "warning", "running", "skipped"}:
status = "warning"
severity = (os.environ.get("SEVERITY") or "info").strip().lower()
if severity not in {"info", "warning", "critical"}:
severity = "info"
alertname = (os.environ.get("ALERTNAME") or "OpsJobStatus").strip()
safe_alertname = re.sub(r"[^A-Za-z0-9_.:-]+", "_", alertname).strip("_") or "OpsJobStatus"
payload = {
"version": "4",
"status": "firing",
"receiver": "awoooi-ops",
"groupLabels": {"alertname": safe_alertname},
"commonLabels": {"alertname": safe_alertname, "severity": severity},
"commonAnnotations": {},
"alerts": [
{
"status": "firing",
"labels": {
"alertname": safe_alertname,
"severity": severity,
"status": status,
"source": os.environ.get("SOURCE", "ops-script"),
"job": os.environ.get("JOB_NAME", "Ops Job"),
"host": os.environ.get("HOSTNAME_VALUE", "unknown"),
"component": os.environ.get("COMPONENT", "ops"),
"duration_seconds": os.environ.get("DURATION_SECONDS", "0"),
},
"annotations": {
"summary": os.environ.get("SUMMARY", ""),
"description": os.environ.get("DETAIL", ""),
},
"startsAt": dt.datetime.now(dt.timezone.utc).isoformat().replace("+00:00", "Z"),
}
],
}
print(json.dumps(payload, ensure_ascii=False))
PY
if [ "${AWOOI_OPS_DRY_RUN:-${AWOOOI_OPS_DRY_RUN:-0}}" = "1" ]; then
cat "$payload_file"
exit 0
fi
curl -fsS \
--connect-timeout "${AWOOI_OPS_CONNECT_TIMEOUT:-5}" \
--max-time "${AWOOI_OPS_MAX_TIME:-12}" \
-H "Content-Type: application/json" \
--data-binary "@${payload_file}" \
"$ALERTMANAGER_URL" >/dev/null
echo "AwoooP-mirrored ops notification sent via ${ALERTMANAGER_URL}"

View File

@@ -12,6 +12,7 @@ BACKUP_DIR="${BACKUP_DIR:-/home/ollama/backups}"
SECRETS_FILE="${SECRETS_FILE:-/home/ollama/awoooi-ops-secrets/secrets.env}"
RETAIN_DAYS="${RETAIN_DAYS:-7}"
AWOOOI_API_URL="${AWOOOI_API_URL:-https://awoooi.wooo.work}"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# 載入 secrets含 Telegram token for fallback
[[ -f "$SECRETS_FILE" ]] && source "$SECRETS_FILE"
@@ -21,13 +22,37 @@ LOG_PREFIX="[$(date '+%Y-%m-%d %H:%M:%S %z')]"
log() { echo "${LOG_PREFIX} $*"; }
notify_awoooi_ops() {
local status="$1"
local msg="$2"
local helper="${SCRIPT_DIR}/notify-awoooi-ops.sh"
[[ -x "$helper" ]] || return 1
AWOOI_OPS_ALERTNAME="Backup.PG" \
AWOOI_OPS_JOB_NAME="AWOOOI DB 備份" \
AWOOI_OPS_STATUS="$status" \
AWOOI_OPS_SEVERITY="info" \
AWOOI_OPS_SOURCE="pg-backup" \
AWOOI_OPS_COMPONENT="postgres-backup" \
AWOOI_OPS_SUMMARY="AWOOOI DB 備份 ${status}" \
AWOOI_OPS_DETAIL="$msg" \
"$helper" >/dev/null
}
notify_telegram() {
local msg="$1"
local status="${2:-success}"
# 正式路徑:先交給 AWOOI API由 TelegramGateway 送出並鏡像到 AwoooP。
# 只有 API 不可達或 helper 未部署時,才使用 Telegram 直發救命旁路。
notify_awoooi_ops "$status" "$msg" && return 0
local chat_id="${TELEGRAM_ALERT_CHAT_ID:-${SRE_GROUP_CHAT_ID:--1003711974679}}"
if [[ -n "${TELEGRAM_BOT_TOKEN:-}" && -n "$chat_id" ]]; then
curl -s -X POST "https://api.telegram.org/bot${TELEGRAM_BOT_TOKEN}/sendMessage" \
-H "Content-Type: application/json" \
-d "{\"chat_id\":\"${chat_id}\",\"text\":\"${msg}\",\"parse_mode\":\"HTML\"}" \
-d "chat_id=${chat_id}" \
-d "parse_mode=HTML" \
--data-urlencode "text=${msg}" \
> /dev/null 2>&1 || true
fi
}
@@ -110,10 +135,13 @@ main() {
local icon="✅"
[[ $fail_count -gt 0 ]] && icon="⚠️"
local notify_status="success"
[[ $fail_count -gt 0 ]] && notify_status="failed"
notify_telegram "${icon} <b>AWOOOI DB 備份</b>
├ 時間: $(date '+%Y-%m-%d %H:%M') +0800
├ 成功: ${success_count} | 失敗: ${fail_count}
${details}"
${details}" "$notify_status"
[[ $fail_count -gt 0 ]] && exit 1
return 0