Commit Graph

3 Commits

Author SHA1 Message Date
Your Name
6efd186750 docs(security): 建立高價值配置控管清冊 [skip ci] 2026-06-11 11:29:58 +08:00
OG T
eab3f527cd feat(monitoring): Phase 7 盲區治理 — L2 配額 + 自監控告警 (ADR-090)
Some checks failed
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 1m21s
CD Pipeline / build-and-deploy (push) Failing after 9m24s
戰場:110 load=17 持續 13 天 + 188 cadvisor 321% CPU 重啟無效
統帥鐵律:不要只降低,要長期解決 → 結構性治理而非補丁

本 commit 涵蓋:
1. k8s/monitoring/docker-compose-110.yml
   - cadvisor 加 mem_limit 512M + cpus 1.0(L2 防爆網)
   - 備註 110 live 與本檔 drift(下一 session 納入 CD)

2. ops/monitoring/alerts-unified.yml 新增 infra_self_monitoring 群組:
   - CadvisorDown / MemoryPressure / CPUThrottled
   - NodeExporterDown / CPUThrottled
   - SentryClickHouseMemoryPressure / CPUThrottled
   - GiteaMemoryPressure / CPUThrottled
   - PrometheusDown(監控自監控元層)
   → 全部用 (memory usage / spec_memory_limit) 動態判斷,
     不寫死 80% 或 MB 數,配額改閾值自動跟著變

其他配套(非本 repo,已 SSH patch 到 110/188):
- /home/ollama/wooo-aiops/docker-compose.yml:188 cadvisor 加 --disable_metrics / --docker_only / --housekeeping_interval + 1g/1.5c
- /home/wooo/monitoring/docker-compose.yml:110 cadvisor + node-exporter 納管 + 降維 flags + 配額
- /opt/sentry/docker-compose.override.yml:Sentry L2 配額(clickhouse 8g/4c, kafka 3g/2c 等)
- /home/wooo/gitea/docker-compose.yml:Gitea 3g/3c
- /home/wooo/act-runner/docker-compose.yml:Actions Runner 2g/2c

對映:
- feedback_monitor_self_monitoring.md 🔴🔴🔴 監控工具必須被監控
- feedback_ai_autonomous_direction.md 動態閾值 ≠ 寫死規則
- ADR-090 Layer 2 資源配額強制

驗收(48h):
- 188 cadvisor CPU 從 321% → <50%(配額強制)
- 110 load5 從 18 → <10(Sentry/Gitea 釋壓後)
- 自監控告警無誤報

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 01:50:41 +08:00
OG T
08db3580a7 fix(monitoring): 修復 110 主機 CPU 高負載
Some checks are pending
CD Pipeline / build-and-deploy (push) Has started running
根因 1: cadvisor 持續掃描 overlay2 磁碟用量 (每次 1-4s × N 容器)
  → 加 --disable_metrics=disk,diskIO,tcp,udp,percpu,sched,process
  → --housekeeping_interval=30s --docker_only=true
  → CPU 從 239% 降到 <1%

根因 2: node_exporter scrape_timeout 預設 10s,高 load 下超時→broken pipe→瘋狂重試
  → 加 scrape_interval: 30s / scrape_timeout: 25s
  → CPU 從 48% 降到 0%

整體 load average: 20 → 9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 13:53:13 +08:00