OG T
eab3f527cd
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 1m21s
CD Pipeline / build-and-deploy (push) Failing after 9m24s
feat(monitoring): Phase 7 盲區治理 — L2 配額 + 自監控告警 (ADR-090)
戰場:110 load=17 持續 13 天 + 188 cadvisor 321% CPU 重啟無效
統帥鐵律:不要只降低,要長期解決 → 結構性治理而非補丁
本 commit 涵蓋:
1. k8s/monitoring/docker-compose-110.yml
- cadvisor 加 mem_limit 512M + cpus 1.0(L2 防爆網)
- 備註 110 live 與本檔 drift(下一 session 納入 CD)
2. ops/monitoring/alerts-unified.yml 新增 infra_self_monitoring 群組:
- CadvisorDown / MemoryPressure / CPUThrottled
- NodeExporterDown / CPUThrottled
- SentryClickHouseMemoryPressure / CPUThrottled
- GiteaMemoryPressure / CPUThrottled
- PrometheusDown(監控自監控元層)
→ 全部用 (memory usage / spec_memory_limit) 動態判斷,
不寫死 80% 或 MB 數,配額改閾值自動跟著變
其他配套(非本 repo,已 SSH patch 到 110/188):
- /home/ollama/wooo-aiops/docker-compose.yml:188 cadvisor 加 --disable_metrics / --docker_only / --housekeeping_interval + 1g/1.5c
- /home/wooo/monitoring/docker-compose.yml:110 cadvisor + node-exporter 納管 + 降維 flags + 配額
- /opt/sentry/docker-compose.override.yml:Sentry L2 配額(clickhouse 8g/4c, kafka 3g/2c 等)
- /home/wooo/gitea/docker-compose.yml:Gitea 3g/3c
- /home/wooo/act-runner/docker-compose.yml:Actions Runner 2g/2c
對映:
- feedback_monitor_self_monitoring.md 🔴🔴🔴 監控工具必須被監控
- feedback_ai_autonomous_direction.md 動態閾值 ≠ 寫死規則
- ADR-090 Layer 2 資源配額強制
驗收(48h):
- 188 cadvisor CPU 從 321% → <50%(配額強制)
- 110 load5 從 18 → <10(Sentry/Gitea 釋壓後)
- 自監控告警無誤報
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 01:50:41 +08:00
..
2026-04-12 14:20:52 +08:00
2026-04-03 15:58:11 +08:00
2026-04-18 17:26:35 +00:00
2026-04-02 14:02:50 +08:00
2026-04-09 20:28:55 +08:00
2026-03-26 18:43:28 +08:00
2026-03-28 22:23:42 +08:00
2026-03-28 22:53:21 +08:00
2026-04-19 01:50:41 +08:00
2026-03-25 00:05:51 +08:00
2026-03-28 21:27:05 +08:00
2026-04-02 14:01:42 +08:00
2026-03-28 23:16:54 +08:00
2026-04-08 16:24:09 +08:00
2026-03-28 21:57:57 +08:00
2026-04-14 15:28:52 +08:00