12-Agent 全景診斷觸發的 P0/P1 觀測層修復。 ## P0 假警報止血(4 SLO 雪崩根因) - governance_agent.py:306 — 空 result 不再 fallback 0.0,改 continue + log warning 根因:Prometheus 查無資料(emitter 未實作 / rule 未部署)被誤判為 SLO=0 必觸發 violated=True 噴 4 條假告警 ## P0 鬼魂按鈕守門 - telegram_gateway.py:1654 — LLM 動態按鈕 Redis 失敗時 btn_list.clear() first_row(批准/拒絕,HMAC nonce 無狀態)由 caller 1488 永遠保留 feedback_no_ghost_buttons.md 三缺一鐵律對齊 ## ConfigMap drift 修復(3 處) - config.py:683 PROMETHEUS_URL: 188→110(drift checker 揪出 = SPF-4 部分根因) - config.py:705 ARGOCD_URL: 125→121(T0 G3 已知) - config.py:375 AI_FALLBACK_ORDER: 補 nvidia 對齊 ConfigMap ## P1 Alertmanager 升級(amtool SUCCESS) - ops/alertmanager/alertmanager.yml: deprecated → v0.27+ 新語法 - match/match_re → matchers - source_match/target_match → source_matchers/target_matchers - group_by 加 team label(防 SLO 雪崩 4 條同秒推) - PostgreSQL/Redis inhibit 補 equal: ['instance'](防爆炸抑制) - 新增 3 組因果抑制: - OllamaInstanceDown → SLO_*/AI_*(30 分鐘) - KMConverterDown → SLO_KMGrowthRate* - SLO_*_FastBurn → SLO_*_(Medium|Slow)Burn ## 治理工具落地 - scripts/check_config_drift.py: ConfigMap vs code default drift 檢測 揪出 PROMETHEUS_URL drift 是 SPF-4 根因(governance_agent 連 188 而非 110) - scripts/health_check_session.sh: 11 服務 + 4 SSH + drift + git 全景驗證 ## 驗證 - 1552 unit tests 全綠 - amtool check-config SUCCESS(8 inhibit_rules / 2 receivers) - drift checker 4 欄位全對齊 - health check 11 服務全可達 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
111 lines
3.5 KiB
Bash
Executable File
111 lines
3.5 KiB
Bash
Executable File
#!/usr/bin/env bash
|
||
# 2026-04-28 ogt + Claude Opus 4.7: P2-2 Session 啟動健康驗證
|
||
# 來源:tool-expert 統一治理方案
|
||
# 目的:每次 Claude session 啟動時快速確認 5 主機 + 關鍵服務可達
|
||
# 純 read-only(curl + ssh -o BatchMode),不修改任何狀態
|
||
#
|
||
# 用法:
|
||
# bash scripts/health_check_session.sh
|
||
# 或加 alias: alias awoooi-health='bash ~/awoooi/scripts/health_check_session.sh'
|
||
|
||
set -uo pipefail # 不要 -e,個別 check 失敗不阻擋全部
|
||
|
||
RED='\033[0;31m'
|
||
GREEN='\033[0;32m'
|
||
YELLOW='\033[0;33m'
|
||
NC='\033[0m'
|
||
|
||
ok() { printf "${GREEN}[OK]${NC} %s\n" "$1"; }
|
||
fail() { printf "${RED}[FAIL]${NC} %s\n" "$1"; }
|
||
warn() { printf "${YELLOW}[WARN]${NC} %s\n" "$1"; }
|
||
|
||
check_url() {
|
||
local name=$1 url=$2
|
||
local code
|
||
code=$(curl -sk --max-time 3 -o /dev/null -w "%{http_code}" "$url" 2>/dev/null || echo "000")
|
||
if [[ "$code" =~ ^[23] ]]; then
|
||
ok "$name → $url ($code)"
|
||
return 0
|
||
elif [[ "$code" =~ ^[45] ]]; then
|
||
warn "$name → $url ($code, 服務有回應但非 2xx/3xx)"
|
||
return 0
|
||
else
|
||
fail "$name → $url (unreachable)"
|
||
return 1
|
||
fi
|
||
}
|
||
|
||
check_ssh() {
|
||
local name=$1 host=$2
|
||
if ssh -o ConnectTimeout=3 -o BatchMode=yes -o StrictHostKeyChecking=no \
|
||
"$host" "echo ok" 2>/dev/null | grep -q ok; then
|
||
ok "SSH $name ($host)"
|
||
return 0
|
||
else
|
||
fail "SSH $name ($host) — 無法連線(timeout / 認證失敗 / 主機不可達)"
|
||
return 1
|
||
fi
|
||
}
|
||
|
||
ROOT="$(cd "$(dirname "$0")/.." && pwd)"
|
||
TS="$(date '+%Y-%m-%d %H:%M %Z')"
|
||
|
||
echo "=========================================="
|
||
echo "AWOOOI Session Health Check $TS"
|
||
echo "=========================================="
|
||
|
||
echo ""
|
||
echo "--- K8s 控制平面 ---"
|
||
check_url "K3s VIP API" "https://192.168.0.125:6443/healthz"
|
||
check_url "ArgoCD (121)" "https://192.168.0.121:30443"
|
||
|
||
echo ""
|
||
echo "--- AI 推理層 ---"
|
||
check_url "Ollama 111 GPU" "http://192.168.0.111:11434/api/tags"
|
||
check_url "Ollama 188 Hub" "http://192.168.0.188:11434/api/tags"
|
||
|
||
echo ""
|
||
echo "--- 觀測層 ---"
|
||
check_url "Prometheus 110" "http://192.168.0.110:9090/-/healthy"
|
||
check_url "Alertmanager 110" "http://192.168.0.110:9093/-/healthy"
|
||
check_url "Gitea 110" "http://192.168.0.110:3001"
|
||
check_url "Langfuse 110" "http://192.168.0.110:3100"
|
||
|
||
echo ""
|
||
echo "--- AWOOOI 核心服務 (prod NodePort) ---"
|
||
check_url "AWOOOI API (125)" "http://192.168.0.125:32334/api/v1/health"
|
||
|
||
echo ""
|
||
echo "--- SSH 連通 ---"
|
||
check_ssh "awoooi-devops (110)" "wooo@192.168.0.110"
|
||
check_ssh "k3s-1 (120)" "wooo@192.168.0.120"
|
||
check_ssh "k3s-2 (121)" "wooo@192.168.0.121"
|
||
check_ssh "ollama-111-gpu (ProxyJump 110)" "ollama-111-gpu"
|
||
|
||
echo ""
|
||
echo "--- Config Drift Check ---"
|
||
if [ -x "$ROOT/scripts/check_config_drift.py" ]; then
|
||
python3 "$ROOT/scripts/check_config_drift.py" || warn "config drift detected (見上方 [DRIFT] 行)"
|
||
else
|
||
warn "drift checker 不存在 ($ROOT/scripts/check_config_drift.py)"
|
||
fi
|
||
|
||
echo ""
|
||
echo "--- Git 狀態 ---"
|
||
if [ -d "$ROOT/.git" ]; then
|
||
cd "$ROOT" || exit
|
||
branch=$(git branch --show-current 2>/dev/null || echo "<detached>")
|
||
upstream_diff=$(git rev-list --count "@{u}..HEAD" 2>/dev/null || echo "?")
|
||
echo " 分支: $branch (本地超前上游 $upstream_diff 個 commit)"
|
||
if ! git diff --quiet 2>/dev/null; then
|
||
warn " 有未 commit 的變更(git status 自查)"
|
||
else
|
||
ok " 工作目錄 clean"
|
||
fi
|
||
fi
|
||
|
||
echo ""
|
||
echo "=========================================="
|
||
echo "Session Health Check 結束"
|
||
echo "=========================================="
|