fix(ops): alert on missing docker resource limits
Some checks failed
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Successful in 23s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s

This commit is contained in:
Your Name
2026-05-05 15:01:31 +08:00
parent e24c8ea051
commit d08d1e4951
5 changed files with 44 additions and 6 deletions

View File

@@ -174,6 +174,7 @@ rules:
- DockerContainerCpuSustainedHigh
- DockerContainerCpuRunawayCritical
- DockerContainerMemoryLimitPressure
- DockerContainerMissingResourceLimit
- DockerContainerRestartSpike
- DockerGiteaActionsJobStale
response:

View File

@@ -1608,7 +1608,8 @@ psql $DATABASE_URL -f apps/api/migrations/cleanup_duplicate_deprecated_playbooks
| `3b73cc7` | CD paths 收斂workflow-only commits 不再觸發完整 image build/deploy |
| `7d45f0c` | Docker textfile 補 `docker_container_started_seconds` + `DockerGiteaActionsJobStale` |
| `5e625f7` | 110 stale Gitea Actions job dry-run cleanup script + runbook/alert annotation |
| `pending` | stale job cleanup policy thresholds aligned with workflow/job timeout buffers |
| `72d66e4` | stale job cleanup policy thresholds aligned with workflow/job timeout buffers |
| `pending` | `DockerContainerMissingResourceLimit` alert routing for Compose services missing CPU/memory guardrails |
### 下一步

View File

@@ -42,6 +42,7 @@ Use these thresholds for alerting and AI triage:
| Docker container CPU | > 2 cores for 10m | Warning diagnosis; check limit, backlog, and workload type. |
| Docker container CPU | > 4 cores for 15m | Critical diagnosis; never generic restart. |
| Docker memory / limit | > 85% for 10m | Warning; raise memory or reduce workload, never lower the limit. |
| Docker CPU or memory limit | missing for 30m | Warning; add service-specific caps before the next spike. |
| Docker restarts | > 5 in 15m | Critical; pull logs and fix crash signature. |
| Systemd runner restarts | > 2 in 15m | Critical; inspect watchdog/drop-ins and active CI jobs. |
| Systemd runner WatchdogSec | > 0 for 10m | Warning; GitHub Actions runner should not be killed by systemd watchdog. |
@@ -57,11 +58,12 @@ Use these thresholds for alerting and AI triage:
5. For Kafka/Snuba, treat high CPU as backlog digestion unless lag stops decreasing.
6. For monitoring tools, caps are required, but every cap must be paired with self-monitoring.
7. Every Docker Compose host must emit `docker_container_cpu_cores`, `docker_container_memory_*`, and a restart counter via node-exporter textfile.
8. Disable node-exporter collectors that are slow or failing on each host; exporter scrape time is part of the resource baseline.
9. Runner services are part of the resource baseline even when jobs run outside Docker. Export `systemd_unit_*` metrics for all `actions.runner.*` services.
10. CI runner limits must be set at the systemd unit level. Docker container limits do not protect host load from runner listener, worker, and `act` helper processes.
11. Workflow-only commits must not trigger full CD image builds; CD should run only when runtime inputs change.
12. Stale Gitea Actions cleanup must be dry-run first. Use `--apply` only when the container is older than its workflow/job policy threshold and has no recent useful logs.
8. Important Docker services must have both CPU and memory guardrails; missing limit alerts should drive service-specific compose fixes, not generic restarts.
9. Disable node-exporter collectors that are slow or failing on each host; exporter scrape time is part of the resource baseline.
10. Runner services are part of the resource baseline even when jobs run outside Docker. Export `systemd_unit_*` metrics for all `actions.runner.*` services.
11. CI runner limits must be set at the systemd unit level. Docker container limits do not protect host load from runner listener, worker, and `act` helper processes.
12. Workflow-only commits must not trigger full CD image builds; CD should run only when runtime inputs change.
13. Stale Gitea Actions cleanup must be dry-run first. Use `--apply` only when the container is older than its workflow/job policy threshold and has no recent useful logs.
## Next Safe Rollout Order

View File

@@ -663,6 +663,23 @@ groups:
auto_repair_action: "ssh 192.168.0.{{ $labels.host }} 'docker ps -a --filter name={{ $labels.container_name }}; docker logs --tail=120 {{ $labels.container_name }}'"
runbook: "先抓 crash signature若是 config/DB/網路問題,修設定,不用無限 restart。"
- alert: DockerContainerMissingResourceLimit
# 2026-05-05 ogt + Codex: catch Compose services that silently run with unlimited CPU/memory.
expr: (docker_container_cpu_limit_cores{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|process-spans).*"} == 0) or (docker_container_memory_limit_bytes{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|process-spans).*"} == 0)
for: 30m
labels:
severity: warning
layer: docker
team: ops
alert_category: infrastructure
notification_type: TYPE-1
auto_repair: "true"
annotations:
summary: "容器 {{ $labels.container_name }} 缺 CPU 或 memory limit"
description: "{{ $labels.container_name }} 在 Docker Host 上沒有完整 CPU/memory guardrail長時間尖峰可能拖垮 110/188。"
auto_repair_action: "ssh 192.168.0.{{ $labels.host }} 'docker inspect {{ $labels.container_name }} --format \"NanoCpus={{.HostConfig.NanoCpus}} Memory={{.HostConfig.Memory}}\"; docker stats --no-stream {{ $labels.container_name }}'"
runbook: "先盤點 workload再在 compose/service-specific playbook 補 cpus/mem_limit。不可盲目降 ClickHouse/Kafkamomo app/scheduler 可用 2 core/2GiB 起步。"
- alert: DockerGiteaActionsJobStale
# 2026-05-05 ogt + Codex: Gitea/act job containers can outlive workflow timeout and keep 110 hot.
expr: time() - docker_container_started_seconds{container_name=~"GITEA-ACTIONS-.*"} > 1200

View File

@@ -669,6 +669,23 @@ groups:
auto_repair_action: "ssh 192.168.0.{{ $labels.host }} 'docker ps -a --filter name={{ $labels.container_name }}; docker logs --tail=120 {{ $labels.container_name }}'"
runbook: "先抓 crash signature若是 config/DB/網路問題,修設定,不用無限 restart。"
- alert: DockerContainerMissingResourceLimit
# 2026-05-05 ogt + Codex: catch Compose services that silently run with unlimited CPU/memory.
expr: (docker_container_cpu_limit_cores{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|process-spans).*"} == 0) or (docker_container_memory_limit_bytes{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|process-spans).*"} == 0)
for: 30m
labels:
severity: warning
layer: docker
team: ops
alert_category: infrastructure
notification_type: TYPE-1
auto_repair: "true"
annotations:
summary: "容器 {{ $labels.container_name }} 缺 CPU 或 memory limit"
description: "{{ $labels.container_name }} 在 Docker Host 上沒有完整 CPU/memory guardrail長時間尖峰可能拖垮 110/188。"
auto_repair_action: "ssh 192.168.0.{{ $labels.host }} 'docker inspect {{ $labels.container_name }} --format \"NanoCpus={{.HostConfig.NanoCpus}} Memory={{.HostConfig.Memory}}\"; docker stats --no-stream {{ $labels.container_name }}'"
runbook: "先盤點 workload再在 compose/service-specific playbook 補 cpus/mem_limit。不可盲目降 ClickHouse/Kafkamomo app/scheduler 可用 2 core/2GiB 起步。"
- alert: DockerGiteaActionsJobStale
# 2026-05-05 ogt + Codex: Gitea/act job containers can outlive workflow timeout and keep 110 hot.
expr: time() - docker_container_started_seconds{container_name=~"GITEA-ACTIONS-.*"} > 1200