Files
awoooi/k8s/monitoring/docker-compose-110.yml
OG T eab3f527cd
Some checks failed
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 1m21s
CD Pipeline / build-and-deploy (push) Failing after 9m24s
feat(monitoring): Phase 7 盲區治理 — L2 配額 + 自監控告警 (ADR-090)
戰場:110 load=17 持續 13 天 + 188 cadvisor 321% CPU 重啟無效
統帥鐵律:不要只降低,要長期解決 → 結構性治理而非補丁

本 commit 涵蓋:
1. k8s/monitoring/docker-compose-110.yml
   - cadvisor 加 mem_limit 512M + cpus 1.0(L2 防爆網)
   - 備註 110 live 與本檔 drift(下一 session 納入 CD)

2. ops/monitoring/alerts-unified.yml 新增 infra_self_monitoring 群組:
   - CadvisorDown / MemoryPressure / CPUThrottled
   - NodeExporterDown / CPUThrottled
   - SentryClickHouseMemoryPressure / CPUThrottled
   - GiteaMemoryPressure / CPUThrottled
   - PrometheusDown(監控自監控元層)
   → 全部用 (memory usage / spec_memory_limit) 動態判斷,
     不寫死 80% 或 MB 數,配額改閾值自動跟著變

其他配套(非本 repo,已 SSH patch 到 110/188):
- /home/ollama/wooo-aiops/docker-compose.yml:188 cadvisor 加 --disable_metrics / --docker_only / --housekeeping_interval + 1g/1.5c
- /home/wooo/monitoring/docker-compose.yml:110 cadvisor + node-exporter 納管 + 降維 flags + 配額
- /opt/sentry/docker-compose.override.yml:Sentry L2 配額(clickhouse 8g/4c, kafka 3g/2c 等)
- /home/wooo/gitea/docker-compose.yml:Gitea 3g/3c
- /home/wooo/act-runner/docker-compose.yml:Actions Runner 2g/2c

對映:
- feedback_monitor_self_monitoring.md 🔴🔴🔴 監控工具必須被監控
- feedback_ai_autonomous_direction.md 動態閾值 ≠ 寫死規則
- ADR-090 Layer 2 資源配額強制

驗收(48h):
- 188 cadvisor CPU 從 321% → <50%(配額強制)
- 110 load5 從 18 → <10(Sentry/Gitea 釋壓後)
- 自監控告警無誤報

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 01:50:41 +08:00

149 lines
4.2 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
version: '3.8'
services:
# ---------------------------------------------------------------------------
# cAdvisor 110 防爆網2026-04-19 台北凌晨 / Claude Opus 4.7 / Phase 7 盲區治理)
# Why: 110 cadvisor 目前 0% CPU已有 flags 降維),但無 mem_limit/cpus 防爆網
# 加配額作為 L2 永久防爆:萬一未來量爆,會 OOMKill 不會拖垮 110 主機
# 對映ADR-090 Layer 2 資源配額強制
# 注意:此 compose 與 110 live (/home/wooo/monitoring/docker-compose.yml) 有 drift 風險,
# 目前無 CD workflow 同步 → 列為技術債,下一 session 納入 Git + CD
# ---------------------------------------------------------------------------
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
restart: unless-stopped
command:
- --logtostderr
- --disable_metrics=disk,diskIO,tcp,udp,percpu,sched,process
- --housekeeping_interval=30s
- --max_housekeeping_interval=60s
- --docker_only=true
ports:
- "9180:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
- /etc/localtime:/etc/localtime:ro
environment:
- TZ=Asia/Taipei
privileged: true
devices:
- /dev/kmsg
networks:
- monitoring
mem_limit: 512m
cpus: 1.0
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./alerts.yml:/etc/prometheus/alerts.yml:ro
- prometheus_data:/prometheus
- /etc/localtime:/etc/localtime:ro
environment:
- TZ=Asia/Taipei
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
- '--web.enable-lifecycle'
extra_hosts:
- "host.docker.internal:host-gateway"
networks:
- monitoring
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: unless-stopped
ports:
- "3002:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=WoooTech2026
- GF_SERVER_ROOT_URL=http://192.168.0.110:3002
- TZ=Asia/Taipei
volumes:
- grafana_data:/var/lib/grafana
- /etc/localtime:/etc/localtime:ro
networks:
- monitoring
depends_on:
- prometheus
blackbox-exporter:
image: prom/blackbox-exporter:latest
container_name: blackbox-exporter
restart: unless-stopped
ports:
- "9115:9115"
volumes:
- ./blackbox.yml:/etc/blackbox_exporter/config.yml:ro
- /etc/localtime:/etc/localtime:ro
environment:
- TZ=Asia/Taipei
command:
- '--config.file=/etc/blackbox_exporter/config.yml'
networks:
- monitoring
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
restart: unless-stopped
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
- alertmanager_data:/alertmanager
- /etc/localtime:/etc/localtime:ro
environment:
- TZ=Asia/Taipei
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
networks:
- monitoring
# === Phase 5: GitHub Exporter (OPS.176) ===
github-exporter:
image: promhippie/github-exporter:latest
container_name: github-exporter
restart: unless-stopped
ports:
- '9504:9504'
environment:
- GITHUB_EXPORTER_TOKEN=${GITHUB_TOKEN}
- GITHUB_EXPORTER_REPOS=owenhytsai/wooo-aiops,owenhytsai/clawbot-v5
- GITHUB_EXPORTER_LOG_LEVEL=info
networks:
- monitoring
labels:
- 'com.wooo.service=github-exporter'
- 'com.wooo.phase=phase-5'
logging:
driver: json-file
options:
max-size: '10m'
max-file: '3'
networks:
monitoring:
driver: bridge
volumes:
prometheus_data:
grafana_data:
alertmanager_data: