戰場:110 load=17 持續 13 天 + 188 cadvisor 321% CPU 重啟無效
統帥鐵律:不要只降低,要長期解決 → 結構性治理而非補丁
本 commit 涵蓋:
1. k8s/monitoring/docker-compose-110.yml
- cadvisor 加 mem_limit 512M + cpus 1.0(L2 防爆網)
- 備註 110 live 與本檔 drift(下一 session 納入 CD)
2. ops/monitoring/alerts-unified.yml 新增 infra_self_monitoring 群組:
- CadvisorDown / MemoryPressure / CPUThrottled
- NodeExporterDown / CPUThrottled
- SentryClickHouseMemoryPressure / CPUThrottled
- GiteaMemoryPressure / CPUThrottled
- PrometheusDown(監控自監控元層)
→ 全部用 (memory usage / spec_memory_limit) 動態判斷,
不寫死 80% 或 MB 數,配額改閾值自動跟著變
其他配套(非本 repo,已 SSH patch 到 110/188):
- /home/ollama/wooo-aiops/docker-compose.yml:188 cadvisor 加 --disable_metrics / --docker_only / --housekeeping_interval + 1g/1.5c
- /home/wooo/monitoring/docker-compose.yml:110 cadvisor + node-exporter 納管 + 降維 flags + 配額
- /opt/sentry/docker-compose.override.yml:Sentry L2 配額(clickhouse 8g/4c, kafka 3g/2c 等)
- /home/wooo/gitea/docker-compose.yml:Gitea 3g/3c
- /home/wooo/act-runner/docker-compose.yml:Actions Runner 2g/2c
對映:
- feedback_monitor_self_monitoring.md 🔴🔴🔴 監控工具必須被監控
- feedback_ai_autonomous_direction.md 動態閾值 ≠ 寫死規則
- ADR-090 Layer 2 資源配額強制
驗收(48h):
- 188 cadvisor CPU 從 321% → <50%(配額強制)
- 110 load5 從 18 → <10(Sentry/Gitea 釋壓後)
- 自監控告警無誤報
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
149 lines
4.2 KiB
YAML
149 lines
4.2 KiB
YAML
version: '3.8'
|
||
|
||
services:
|
||
# ---------------------------------------------------------------------------
|
||
# cAdvisor 110 防爆網(2026-04-19 台北凌晨 / Claude Opus 4.7 / Phase 7 盲區治理)
|
||
# Why: 110 cadvisor 目前 0% CPU(已有 flags 降維),但無 mem_limit/cpus 防爆網
|
||
# 加配額作為 L2 永久防爆:萬一未來量爆,會 OOMKill 不會拖垮 110 主機
|
||
# 對映:ADR-090 Layer 2 資源配額強制
|
||
# 注意:此 compose 與 110 live (/home/wooo/monitoring/docker-compose.yml) 有 drift 風險,
|
||
# 目前無 CD workflow 同步 → 列為技術債,下一 session 納入 Git + CD
|
||
# ---------------------------------------------------------------------------
|
||
cadvisor:
|
||
image: gcr.io/cadvisor/cadvisor:latest
|
||
container_name: cadvisor
|
||
restart: unless-stopped
|
||
command:
|
||
- --logtostderr
|
||
- --disable_metrics=disk,diskIO,tcp,udp,percpu,sched,process
|
||
- --housekeeping_interval=30s
|
||
- --max_housekeeping_interval=60s
|
||
- --docker_only=true
|
||
ports:
|
||
- "9180:8080"
|
||
volumes:
|
||
- /:/rootfs:ro
|
||
- /var/run:/var/run:ro
|
||
- /sys:/sys:ro
|
||
- /var/lib/docker/:/var/lib/docker:ro
|
||
- /dev/disk/:/dev/disk:ro
|
||
- /etc/localtime:/etc/localtime:ro
|
||
environment:
|
||
- TZ=Asia/Taipei
|
||
privileged: true
|
||
devices:
|
||
- /dev/kmsg
|
||
networks:
|
||
- monitoring
|
||
mem_limit: 512m
|
||
cpus: 1.0
|
||
|
||
prometheus:
|
||
image: prom/prometheus:latest
|
||
container_name: prometheus
|
||
restart: unless-stopped
|
||
ports:
|
||
- "9090:9090"
|
||
volumes:
|
||
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
|
||
- ./alerts.yml:/etc/prometheus/alerts.yml:ro
|
||
- prometheus_data:/prometheus
|
||
- /etc/localtime:/etc/localtime:ro
|
||
environment:
|
||
- TZ=Asia/Taipei
|
||
command:
|
||
- '--config.file=/etc/prometheus/prometheus.yml'
|
||
- '--storage.tsdb.path=/prometheus'
|
||
- '--storage.tsdb.retention.time=30d'
|
||
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
|
||
- '--web.console.templates=/usr/share/prometheus/consoles'
|
||
- '--web.enable-lifecycle'
|
||
extra_hosts:
|
||
- "host.docker.internal:host-gateway"
|
||
networks:
|
||
- monitoring
|
||
|
||
grafana:
|
||
image: grafana/grafana:latest
|
||
container_name: grafana
|
||
restart: unless-stopped
|
||
ports:
|
||
- "3002:3000"
|
||
environment:
|
||
- GF_SECURITY_ADMIN_USER=admin
|
||
- GF_SECURITY_ADMIN_PASSWORD=WoooTech2026
|
||
- GF_SERVER_ROOT_URL=http://192.168.0.110:3002
|
||
- TZ=Asia/Taipei
|
||
volumes:
|
||
- grafana_data:/var/lib/grafana
|
||
- /etc/localtime:/etc/localtime:ro
|
||
networks:
|
||
- monitoring
|
||
depends_on:
|
||
- prometheus
|
||
|
||
blackbox-exporter:
|
||
image: prom/blackbox-exporter:latest
|
||
container_name: blackbox-exporter
|
||
restart: unless-stopped
|
||
ports:
|
||
- "9115:9115"
|
||
volumes:
|
||
- ./blackbox.yml:/etc/blackbox_exporter/config.yml:ro
|
||
- /etc/localtime:/etc/localtime:ro
|
||
environment:
|
||
- TZ=Asia/Taipei
|
||
command:
|
||
- '--config.file=/etc/blackbox_exporter/config.yml'
|
||
networks:
|
||
- monitoring
|
||
|
||
alertmanager:
|
||
image: prom/alertmanager:latest
|
||
container_name: alertmanager
|
||
restart: unless-stopped
|
||
ports:
|
||
- "9093:9093"
|
||
volumes:
|
||
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
|
||
- alertmanager_data:/alertmanager
|
||
- /etc/localtime:/etc/localtime:ro
|
||
environment:
|
||
- TZ=Asia/Taipei
|
||
command:
|
||
- '--config.file=/etc/alertmanager/alertmanager.yml'
|
||
- '--storage.path=/alertmanager'
|
||
networks:
|
||
- monitoring
|
||
|
||
# === Phase 5: GitHub Exporter (OPS.176) ===
|
||
github-exporter:
|
||
image: promhippie/github-exporter:latest
|
||
container_name: github-exporter
|
||
restart: unless-stopped
|
||
ports:
|
||
- '9504:9504'
|
||
environment:
|
||
- GITHUB_EXPORTER_TOKEN=${GITHUB_TOKEN}
|
||
- GITHUB_EXPORTER_REPOS=owenhytsai/wooo-aiops,owenhytsai/clawbot-v5
|
||
- GITHUB_EXPORTER_LOG_LEVEL=info
|
||
networks:
|
||
- monitoring
|
||
labels:
|
||
- 'com.wooo.service=github-exporter'
|
||
- 'com.wooo.phase=phase-5'
|
||
logging:
|
||
driver: json-file
|
||
options:
|
||
max-size: '10m'
|
||
max-file: '3'
|
||
|
||
networks:
|
||
monitoring:
|
||
driver: bridge
|
||
|
||
volumes:
|
||
prometheus_data:
|
||
grafana_data:
|
||
alertmanager_data:
|