awoooi

Author	SHA1	Message	Date
Your Name	ae7b39d96a	fix(ops): harden reboot recovery and backup alerts	2026-05-29 12:41:34 +08:00
Your Name	ae9d0b7385	feat(monitoring): alert on stale source provider ingestion All checks were successful Code Review / ai-code-review (push) Successful in 10s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 25s Details CD Pipeline / tests (push) Successful in 3m26s Details CD Pipeline / build-and-deploy (push) Successful in 3m38s Details CD Pipeline / post-deploy-checks (push) Successful in 1m25s Details	2026-05-20 19:19:21 +08:00
Your Name	598f33ae8b	fix(monitoring): clarify alert chain smoke evidence All checks were successful Code Review / ai-code-review (push) Successful in 11s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 22s Details CD Pipeline / tests (push) Successful in 3m55s Details CD Pipeline / build-and-deploy (push) Successful in 3m31s Details CD Pipeline / post-deploy-checks (push) Successful in 1m33s Details	2026-05-20 13:11:44 +08:00
Your Name	587551c1f1	fix(ops): monitor full-stack cold-start gates All checks were successful Code Review / ai-code-review (push) Successful in 11s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 18s Details	2026-05-06 00:48:05 +08:00
Your Name	23932773ef	fix(monitoring): route docker baseline alerts to ssh All checks were successful Code Review / ai-code-review (push) Successful in 11s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 19s Details	2026-05-06 00:00:12 +08:00
Your Name	2f50c67f5c	fix(monitoring): keep host alert ssh diagnostics canonical All checks were successful Code Review / ai-code-review (push) Successful in 10s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 20s Details E2E Health Check / e2e-health (push) Successful in 2m35s Details	2026-05-05 23:57:53 +08:00
Your Name	2221fd3256	fix(ops): persist host resource guardrails All checks were successful CD Pipeline / tests (push) Successful in 5m25s Details Code Review / ai-code-review (push) Successful in 25s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 37s Details CD Pipeline / build-and-deploy (push) Successful in 7m31s Details CD Pipeline / post-deploy-checks (push) Successful in 5m10s Details	2026-05-05 16:13:19 +08:00
Your Name	1cc9de5722	fix(ops): point runner guardrail alerts to host script All checks were successful CD Pipeline / tests (push) Successful in 5m31s Details Code Review / ai-code-review (push) Successful in 30s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 37s Details CD Pipeline / build-and-deploy (push) Successful in 7m45s Details CD Pipeline / post-deploy-checks (push) Successful in 5m4s Details	2026-05-05 15:25:37 +08:00
Your Name	d08d1e4951	fix(ops): alert on missing docker resource limits Some checks failed CD Pipeline / tests (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details Code Review / ai-code-review (push) Successful in 23s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s Details	2026-05-05 15:01:31 +08:00
Your Name	72d66e4ae6	fix(ops): align stale job cleanup thresholds All checks were successful Code Review / ai-code-review (push) Successful in 28s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 36s Details	2026-05-05 14:54:17 +08:00
Your Name	5e625f777d	fix(ops): add stale gitea job cleanup guard Some checks failed Code Review / ai-code-review (push) Has been cancelled Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Has been cancelled Details	2026-05-05 14:50:47 +08:00
Your Name	7d45f0cb58	fix(ops): alert on stale gitea actions jobs Some checks failed CD Pipeline / tests (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details Code Review / ai-code-review (push) Has been cancelled Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Has been cancelled Details	2026-05-05 14:42:09 +08:00
Your Name	fe618960a8	fix(ops): monitor systemd runners in host baseline Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details CD Pipeline / tests (push) Has been cancelled Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details Code Review / ai-code-review (push) Has been cancelled Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 39s Details	2026-05-05 14:08:43 +08:00
Your Name	e8e6748f70	fix(ops): add docker host resource baseline guardrails Some checks failed CD Pipeline / tests (push) Failing after 1m50s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details Code Review / ai-code-review (push) Successful in 25s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s Details	2026-05-05 13:45:09 +08:00
Your Name	577250a678	fix(governance): 修反消音化 W-3/W-4 守衛 + Prometheus 補資料缺失告警 Some checks failed Code Review / ai-code-review (push) Successful in 52s Details CD Pipeline / tests (push) Failing after 2m21s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 1m6s Details 【統帥怒訓 — 違反 feedback_full_chain_first_then_fix.md 鐵律】前次 commit `f1362fcc` 用 skip 條件把告警吞掉，是消音化解法： - W-3：total_exec<10 永遠 skip → Redis 永遠空也不會告警 - W-4：playbooks total==0 永遠 skip → 表被清空也不會告警 - Prometheus NaN sentinel + 既有 < 0.1 規則疊加後沒任何路徑會告警統帥怒訓「又把告警給消失了」「已經這樣做幾次了」。本 commit 救回告警可見性。【修法 — 啟動 30 分鐘寬限 + 過期改打資料管線斷新告警】 - ai_slo_watchdog_job.py 新增模組層 _PROCESS_START 與 _grace_active() 守衛： - W-3a：metric 有資料 + rate<0.30 → 既有「飛輪成功率過低」 - W-3b：rate=None 且 uptime>30min → 新告警「飛輪資料管線無流量」 - W-4a：playbooks total>0 + approved=0 → 既有「自動修復鏈路斷裂」 - W-4b：playbooks total=0 且 uptime>30min → 新告警「Playbook 表初始化失敗」 - 3 份 Prometheus rule（k8s/monitoring/flywheel-alerts.yaml、 ops/monitoring/alerts.yml、ops/monitoring/alerts-unified.yml）新增 FlywheelExecutionRateMissing：absent() 或 NaN 持續 30 分鐘 → 告警，與 watchdog W-3b 雙保險【已加入 memory】 feedback_silencing_alerts_recurring_violation.md 鎖入紅線鐵律：「fresh deploy / init guard 用 skip 吞告警 = 結構性失職，必須分流寬限期 + 過期改打資料管線斷新告警」【驗證】 106 個治理相關 unit test 全過： test_trust_drift_watchdog / test_governance_agent / test_failover_alerter / test_check_trust_drift_commit_outside_context_poc / test_governance_remediation_dispatch / test_ai_governance_endpoints / test_governance_dispatcher 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 12:39:46 +08:00
Your Name	b371edb70c	fix host alert auto-repair routing and backup false positives	2026-05-02 23:44:12 +08:00
Your Name	ca22ec2fd2	fix(aiops): route backup failures rule-first All checks were successful CD Pipeline / tests (push) Successful in 1m51s Details Code Review / ai-code-review (push) Successful in 30s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 42s Details CD Pipeline / build-and-deploy (push) Successful in 8m21s Details CD Pipeline / post-deploy-checks (push) Successful in 4m18s Details	2026-05-01 10:11:10 +08:00
Your Name	f0d14ab6c4	fix(aiops): escalate blocked auto repair Some checks failed CD Pipeline / tests (push) Successful in 1m33s Details Code Review / ai-code-review (push) Successful in 28s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 40s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-04-30 23:49:17 +08:00
Your Name	7cd53c0228	fix(monitoring): 記憶體告警改用 working_set，停止 page cache 假告警 - alerts-unified.yml: - SentryClickHouseMemoryPressure: usage_bytes → working_set_bytes，0.8 → 0.85 - GiteaMemoryPressure: 同步修正（同樣 page cache 虛高根因） - ops/monitoring/tests/clickhouse_memory_test.yml: promtool 4 cases - 04-awoooi-devops-commander.md v2.8: Prometheus 指標選擇規範 + Gitea HMAC Webhook 規範 - LOGBOOK: 記錄 T0 五大並行任務（A 按鈕 / B ClickHouse / C Gitea webhook / D ElephantAlpha / F Code review）鐵證: 2026-04-23 23:13 sentry-clickhouse usage_bytes=88.5% vs working_set=7.8% 根因: container_memory_usage_bytes 含 OS page cache，OOM killer 不視為壓力修法: 改用 K8s/cadvisor 認可的 working_set_bytes (RSS + active cache)，閾值 0.85 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:16:12 +08:00
OG T	ba18ad2ef8	feat(hermes+rules): LLM 升級 Hermes + 統帥決策 deprecate PostgreSQLDiskGrowthRate All checks were successful Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 40s Details CD Pipeline / build-and-deploy (push) Successful in 8m37s Details 統帥 2026-04-19 決策: - Rule 1 PostgreSQLDiskGrowthRate → 選項 C: deprecate + 替代新規則 - Rule 2 NoAlertsReceived2Hours → 保留 (真實告警鏈路守護) - noise_rate 算法先修正 (NO_ACTION 不算 fp),觀察後動態調整 1. rule_stats_updater v2 noise 算法: 原: 任何 EXPIRED approval 都算 fp 問題: NO_ACTION/OBSERVE/INVESTIGATE 是 AI 純觀察,不該算假報修: WHERE ar.action NOT ILIKE '%NO_ACTION%' AND NOT ILIKE '%OBSERVE%' AND ... 2. hermes_rule_quality v2 LLM 升級: 新增 _llm_analyze_noisy_rule: - 用 OpenClaw (Ollama/NemoTron/Gemini) 分析每條噪音 rule - JSON 輸出: probable_root_causes/recommended_actions/confidence/should_deprecate - 3 路 parse fallback (直接 / NemoTron wrapper / description nested) _write_advisory_aol 加 llm_analysis 到 output_payload _send_telegram_summary 加 AI 判定 + top 2 建議 (8 條上限避免太長) 符合統帥鐵律: AI 分析但不自動動作,仍人工決策 3. ops/monitoring/alerts-unified.yml 替換 Rule 1: 刪 PostgreSQLDiskGrowthRate (500MB/h 增長 → 觸發 WAL 正常行為誤報) 加 HostDiskUsageHigh (>80% for 10m, warning) 加 HostDiskUsageCritical (>90% for 5m, critical) 兩者 labels.supersedes='PostgreSQLDiskGrowthRate' 供追溯 (待 deploy-alerts workflow 下次 apply 到 Prometheus) 4. DB 即時 mark deprecated (避免等 alerts yaml 部署前 Hermes 又推): UPDATE alert_rule_catalog SET review_status='deprecated' WHERE rule_name='PostgreSQLDiskGrowthRate' Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 19:39:05 +08:00
OG T	eab3f527cd	feat(monitoring): Phase 7 盲區治理 — L2 配額 + 自監控告警 (ADR-090) Some checks failed Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 1m21s Details CD Pipeline / build-and-deploy (push) Failing after 9m24s Details 戰場：110 load=17 持續 13 天 + 188 cadvisor 321% CPU 重啟無效統帥鐵律：不要只降低，要長期解決 → 結構性治理而非補丁本 commit 涵蓋： 1. k8s/monitoring/docker-compose-110.yml - cadvisor 加 mem_limit 512M + cpus 1.0（L2 防爆網） - 備註 110 live 與本檔 drift（下一 session 納入 CD） 2. ops/monitoring/alerts-unified.yml 新增 infra_self_monitoring 群組： - CadvisorDown / MemoryPressure / CPUThrottled - NodeExporterDown / CPUThrottled - SentryClickHouseMemoryPressure / CPUThrottled - GiteaMemoryPressure / CPUThrottled - PrometheusDown（監控自監控元層） → 全部用 (memory usage / spec_memory_limit) 動態判斷，不寫死 80% 或 MB 數，配額改閾值自動跟著變其他配套（非本 repo，已 SSH patch 到 110/188）： - /home/ollama/wooo-aiops/docker-compose.yml：188 cadvisor 加 --disable_metrics / --docker_only / --housekeeping_interval + 1g/1.5c - /home/wooo/monitoring/docker-compose.yml：110 cadvisor + node-exporter 納管 + 降維 flags + 配額 - /opt/sentry/docker-compose.override.yml：Sentry L2 配額（clickhouse 8g/4c, kafka 3g/2c 等） - /home/wooo/gitea/docker-compose.yml：Gitea 3g/3c - /home/wooo/act-runner/docker-compose.yml：Actions Runner 2g/2c 對映： - feedback_monitor_self_monitoring.md 🔴🔴🔴 監控工具必須被監控 - feedback_ai_autonomous_direction.md 動態閾值 ≠ 寫死規則 - ADR-090 Layer 2 資源配額強制驗收（48h）： - 188 cadvisor CPU 從 321% → <50%（配額強制） - 110 load5 從 18 → <10（Sentry/Gitea 釋壓後） - 自監控告警無誤報 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 01:50:41 +08:00
OG T	946fe1fa7c	fix(monitoring): 合併重複飛輪告警 group + 補 notification_type: TYPE-8M All checks were successful Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 44s Details awoooi_flywheel_health (重複) 合入 awoooi_flywheel_meta_alerts: - 所有 5 條規則加 notification_type: TYPE-8M - 新增 FlywheelAlertnameNullHigh（原僅在舊 group） - 刪除重複 group，消除 Prometheus 同名告警衝突 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 22:43:02 +08:00
OG T	bd75aca727	feat(adr-075): 補全 2 個欠缺的 Prometheus 告警規則 All checks were successful Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 49s Details - MomoScraperSuccessLow: 業務爬蟲成功率 <90% (business group) - CoreDNSResolutionFailed: CoreDNS SERVFAIL 率 >5% (kubernetes group) ADR-075 Phase 3 完成 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 21:59:18 +08:00
OG T	edb97fd29b	fix(monitoring): 補回 4 個僅存於主機的 Prometheus 規則群組 All checks were successful Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 41s Details deploy-alerts.sh 部署時覆寫了這 4 個從未進 repo 的群組： - awoooi_flywheel_health (5條：Playbook/Success/Vectorization/NullRate/Stuck) - awoooi_backup_restore (2條：RestoreTestFailed/TestStale) - awoooi_infrastructure_detailed (3條：Container/RedisStream/DiskGrowth) - awoooi_host_connectivity (1條：NetworkPartition) 從 /home/wooo/monitoring/alerts.yml.bak_20260412_183835 還原。 offset PromQL 已修正為各個 selector 上，而非整個表達式。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 19:14:39 +08:00
OG T	f52dc459e6	feat(adr075): Step4 新增4組Prometheus規則 secops/business/flywheel_meta All checks were successful Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 41s Details 新增規則群組: - awoooi_secops_alerts: UnauthorizedSSHLogin (5min>10次失敗) - awoooi_business_alerts: AITokenCostSpike + GeminiAPIErrorRateHigh - awoooi_flywheel_meta_alerts: FlywheelPlaybookZero / FlywheelExecutionSuccessLow FlywheelKMVectorizationLow / FlywheelIncidentsStuck 飛輪 meta 規則依賴 ADR-074 Exporter 指標 secops/business 規則依賴 node_exporter/awoooi custom metrics Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 18:51:23 +08:00
OG T	43edff184d	feat(dr): Sprint C — Host rsync 備份 + DR SOP 文件 C-1 Velero: 已確認運作中（daily-awoooi-prod schedule, 13d, MinIO Available） C-2 Host rsync 備份: scripts/ops/backup-from-110.sh — 188 每日凌晨 1:00 rsync 備份 110 - Harbor registry data（最高優先） - Gitea repos - bitan-pharmacy.git（若存在） - 成功寫入 /var/run/backup-110.last_success 供 Prometheus 監控 - 失敗時 Telegram 告警 ops/monitoring/alerts-unified.yml — 新增 HostBackupFailed 告警規則 C-3 DR SOP 文件: docs/runbooks/disaster-recovery/DR-K8s-awoooi.md (<15分鐘) docs/runbooks/disaster-recovery/DR-Nginx.md (<5分鐘) docs/runbooks/disaster-recovery/DR-Harbor.md (<30分鐘) docs/runbooks/disaster-recovery/DR-Bitan.md (<5分鐘) docs/runbooks/disaster-recovery/DR-Stock.md (<5分鐘) 部署備份腳本說明 (需手動執行): scp scripts/ops/backup-from-110.sh ollama@192.168.0.188:~/bin/backup-from-110.sh ssh ollama@192.168.0.188 "chmod +x ~/bin/backup-from-110.sh && mkdir -p /backup/110/{harbor,gitea}" ssh ollama@192.168.0.188 "echo '0 1 * * * /home/ollama/bin/backup-from-110.sh' \| crontab -" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-11 03:04:18 +08:00
OG T	6351e9a0e9	feat(mcp-phase2): MCP Phase 2 — Prometheus MCP + SSH MCP + alert labels All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m37s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 35s Details MCP-2b: prometheus_provider.py - prometheus_query (PromQL 即時查詢) - prometheus_query_range (歷史趨勢，預設 15 分鐘) - prometheus_get_alert_history (告警觸發歷史) - config: PROMETHEUS_URL + PROMETHEUS_MCP_ENABLED MCP-2a: ssh_provider.py - 群組A 9 個只讀診斷工具 (top/disk/memory/logs/status/port/nginx/swap) - 群組B 6 個安全操作工具 (restart/compose/systemctl/clear-log/ssl/nginx-reload) - 四層安全守衛 (白名單/allowed_hosts/forbidden_patterns/trust_score) - config: SSH_MCP_ENABLED + SSH_MCP_ALLOWED_HOSTS K8s: 04-ssh-mcp-secret.example.yaml (ssh-mcp-key Secret 範本 + 建立步驟) Alert labels: alerts-unified.yml 補充 mcp_provider/host_type/alert_category 覆蓋: HostHighCpuLoad/HostOutOfMemory/HostOutOfDiskSpace/DockerContainer* SignOzDown/SentryDown/HarborDown/GiteaDown Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-11 02:35:35 +08:00
OG T	e1dfbedf0e	fix(alerts): HostHighCpuLoad auto_repair: false → true All checks were successful Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 39s Details 飛輪一直 GUARDRAIL_BLOCKED 的根本原因： Prometheus rule 標籤 auto_repair=false 強制 HITL Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-10 13:33:23 +08:00
OG T	85d4857d1b	fix(monitoring): RedisMemoryHigh 誤報 — max_bytes=0 除以零修正 Some checks are pending CD Pipeline / build-and-deploy (push) Has started running Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 37s Details - 加入 redis_memory_max_bytes > 0 前置條件 - 防止 Redis 未設 maxmemory 時除以零產生 +Inf 永遠觸發 - 影響: alerts-unified.yml + database-alerts.yaml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 11:41:10 +08:00
OG T	9799a14f54	feat(monitoring): Plan C 外部網站告警 — 4個網站 + SSL憑證預警 All checks were successful Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 34s Details 新增 external_website_alerts 群組： - MoWoooWorkDown (mo.wooo.work, 188, momo-app) - TsenyangWebsiteDown (tsenyang.com, 188, tsenyang-website) - StockWoooWorkDown (stock.wooo.work, 110, stock-platform) - BitanWoooWorkDown (bitan.wooo.work, 188, bitan-app) - ExternalSiteSSLExpiringSoon (14天預警, auto_repair:false) blackbox-http 已涵蓋全部目標，此為結構化告警規則。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 08:53:08 +08:00
OG T	3c6807d79c	ops(monitoring): 觸發 deploy-alerts — database_detail_alerts 6條規則補部署 All checks were successful Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 39s Details `d9e0fab` 新增了 6 條 DB 詳細告警規則但 deploy-alerts 因 pyyaml 未安裝失敗 `0f86c5c` 已修復 workflow，此 commit 觸發重新部署 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 21:17:26 +08:00
OG T	d9e0fab3fe	feat(monitoring): Sprint 5.2 Plan B — 資料庫詳細告警規則 Some checks failed Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Failing after 17s Details 新增 database_detail_alerts 規則群組: PostgreSQL: - PostgreSQLSlowQueries: 慢查詢 >60s - PostgreSQLDeadlocks: 死鎖發生 - PostgreSQLTooManyConnections: 連接數 >50 Redis: - RedisKeyEviction: Key 驅逐 - RedisConnectionsHigh: 連接數 >100 - RedisCommandLatencyHigh: 命令延遲 >10ms 前置: postgres-exporter:9187 + redis-exporter:9121 已在 188 部署 ✅ Prometheus scrape 已更新 ✅ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 18:19:03 +08:00
OG T	0847fa3a60	feat(sprint5.1): L2-2 — alerts-unified.yml 補 DockerContainerUnhealthy/Exited 規則 Some checks failed Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Failing after 19s Details 新增 docker_health_alerts group： - DockerContainerUnhealthy: container_health_status==0, for 2m, auto_repair=true - DockerContainerExited: container_running_status==0, for 1m, auto_repair=true 標籤 auto_repair=true 讓 AWOOOI API 進入 Guardrail 決策鏈路，實際修復動作由 Service Registry 分級（ADR-062）決定， docker-health-monitor.sh（純感知層）送 webhook 後由此規則補充 Prometheus 路徑。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 16:40:44 +08:00
OG T	dc27f8f811	ops(monitoring): 統一 Prometheus 告警規則 — 40+條含統一 layer 標籤修正: - ClawBotDown → OpenClawDown (舊命名廢棄) - 加入 SentryDown/HarborDown/GiteaDown/AlertmanagerDown - 所有規則補齊 layer/component/host/auto_repair 統一標籤 - 整合 k8s/monitoring/*.yaml → ops/monitoring/alerts-unified.yml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 02:26:18 +08:00

34 Commits