OG T
|
eab3f527cd
|
feat(monitoring): Phase 7 盲區治理 — L2 配額 + 自監控告警 (ADR-090)
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 1m21s
CD Pipeline / build-and-deploy (push) Failing after 9m24s
戰場:110 load=17 持續 13 天 + 188 cadvisor 321% CPU 重啟無效
統帥鐵律:不要只降低,要長期解決 → 結構性治理而非補丁
本 commit 涵蓋:
1. k8s/monitoring/docker-compose-110.yml
- cadvisor 加 mem_limit 512M + cpus 1.0(L2 防爆網)
- 備註 110 live 與本檔 drift(下一 session 納入 CD)
2. ops/monitoring/alerts-unified.yml 新增 infra_self_monitoring 群組:
- CadvisorDown / MemoryPressure / CPUThrottled
- NodeExporterDown / CPUThrottled
- SentryClickHouseMemoryPressure / CPUThrottled
- GiteaMemoryPressure / CPUThrottled
- PrometheusDown(監控自監控元層)
→ 全部用 (memory usage / spec_memory_limit) 動態判斷,
不寫死 80% 或 MB 數,配額改閾值自動跟著變
其他配套(非本 repo,已 SSH patch 到 110/188):
- /home/ollama/wooo-aiops/docker-compose.yml:188 cadvisor 加 --disable_metrics / --docker_only / --housekeeping_interval + 1g/1.5c
- /home/wooo/monitoring/docker-compose.yml:110 cadvisor + node-exporter 納管 + 降維 flags + 配額
- /opt/sentry/docker-compose.override.yml:Sentry L2 配額(clickhouse 8g/4c, kafka 3g/2c 等)
- /home/wooo/gitea/docker-compose.yml:Gitea 3g/3c
- /home/wooo/act-runner/docker-compose.yml:Actions Runner 2g/2c
對映:
- feedback_monitor_self_monitoring.md 🔴🔴🔴 監控工具必須被監控
- feedback_ai_autonomous_direction.md 動態閾值 ≠ 寫死規則
- ADR-090 Layer 2 資源配額強制
驗收(48h):
- 188 cadvisor CPU 從 321% → <50%(配額強制)
- 110 load5 從 18 → <10(Sentry/Gitea 釋壓後)
- 自監控告警無誤報
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-19 01:50:41 +08:00 |
|
OG T
|
08db3580a7
|
fix(monitoring): 修復 110 主機 CPU 高負載
CD Pipeline / build-and-deploy (push) Has started running
根因 1: cadvisor 持續掃描 overlay2 磁碟用量 (每次 1-4s × N 容器)
→ 加 --disable_metrics=disk,diskIO,tcp,udp,percpu,sched,process
→ --housekeeping_interval=30s --docker_only=true
→ CPU 從 239% 降到 <1%
根因 2: node_exporter scrape_timeout 預設 10s,高 load 下超時→broken pipe→瘋狂重試
→ 加 scrape_interval: 30s / scrape_timeout: 25s
→ CPU 從 48% 降到 0%
整體 load average: 20 → 9
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 13:53:13 +08:00 |
|