fix(ops): persist host resource guardrails
This commit is contained in:
@@ -44,6 +44,8 @@ ARG NEXT_PUBLIC_SENTRY_DSN=
|
||||
ENV NEXT_PUBLIC_API_URL=${NEXT_PUBLIC_API_URL}
|
||||
ENV NEXT_PUBLIC_SENTRY_DSN=${NEXT_PUBLIC_SENTRY_DSN}
|
||||
ENV NEXT_TELEMETRY_DISABLED=1
|
||||
# 2026-05-05 ogt + Codex: keep self-hosted 110 runner builds from saturating CPU.
|
||||
ENV NEXT_PRIVATE_BUILD_WORKER_COUNT=1
|
||||
|
||||
# 2026-04-06 ogt: --mount=type=cache 持久化 .next/cache,跨 build 增量編譯
|
||||
# 只有變更的頁面重新編譯,未變更頁面直接用 cache → 節省 3-4 min
|
||||
@@ -51,7 +53,7 @@ ENV NEXT_TELEMETRY_DISABLED=1
|
||||
# /root/.cache/turbo 存放 turbo 的 task 輸出快取,避免每次重跑未變動的 packages
|
||||
RUN --mount=type=cache,target=/app/apps/web/.next/cache \
|
||||
--mount=type=cache,target=/root/.cache/turbo \
|
||||
pnpm turbo build --filter=@awoooi/web
|
||||
pnpm turbo build --filter=@awoooi/web --concurrency=1
|
||||
|
||||
FROM base AS runner
|
||||
WORKDIR /app
|
||||
|
||||
@@ -6,6 +6,39 @@
|
||||
|
||||
---
|
||||
|
||||
## 2026-05-05 | 110 Sentry resource limits persistence gap closed
|
||||
|
||||
**背景**:110 guardrail 告警已清,但主機 load 仍有長尾;統帥擔心 Claude Code 只做 live `docker update`,重建後配置又失效。
|
||||
|
||||
**現場結論**:
|
||||
- 188 已回穩:load 約 `2.26 / 2.84 / 3.21`,momo/litellm/SignOz 核心容器都有 live CPU/memory guardrail;仍有 `HostBackupFailed`,但與 CPU/load 無關。
|
||||
- 110 仍是 Sentry 長尾,不是 runner 或 momo 類事故:ClickHouse 約 2.2-3.0 cores,Kafka 約 0.6 core,taskworker/taskbroker/taskscheduler/redis/uptime-checker 合計形成背景 load。
|
||||
- ClickHouse 目前不是查詢卡死:`system.processes` 無長查詢,`system.mutations` 無 pending,`system.merges` 只看到短 transaction merge;最大資料表是 `eap_items_1_local` 約 `6.68 GiB`。
|
||||
- Kafka consumer lag 查詢未見 backlog 膨脹;目前不應再靠降低 ClickHouse/Kafka memory 或泛用 restart。
|
||||
- 真正缺口:110 live limit 已存在,但 `/opt/sentry/docker-compose.yml` 只持久化了 `process-spans`;ClickHouse/Kafka/taskworker/taskbroker/taskscheduler/redis 一旦 compose recreate 可能回到 unlimited。
|
||||
|
||||
**本次 live 修補**:
|
||||
- 110 `/opt/sentry/docker-compose.yml` 已備份為 `docker-compose.yml.bak-20260505-155707-codex-resource-limits`。
|
||||
- 持久化 Sentry 核心 guardrail:ClickHouse `2 CPU / 8 GiB / 16 GiB swap`、Kafka `2 CPU / 3 GiB / 6 GiB swap`、taskworker `2 CPU / 2 GiB / 4 GiB swap`、taskbroker `1 CPU / 512 MiB / 1 GiB swap`、taskscheduler `0.5 CPU / 512 MiB / 1 GiB swap`、redis `0.5 CPU / 512 MiB / 1 GiB swap`、uptime-checker `0.5 CPU / 512 MiB / 1 GiB swap`。
|
||||
- 只對 uptime-checker 補 live `docker update`,未重啟 Sentry/ClickHouse/Kafka;容器仍 `Up 5 days`。
|
||||
- 110 `/opt/sentry/clickhouse/config.xml` 已備份為 `config.xml.bak-20260505-160120-codex-merge-pool4`;ClickHouse 背景 merge 從 pool `8` 降到 `4`,三門檻從 `6/4/6` 降到 `3/2/3`,`max_bytes_to_merge_at_max_space_in_pool` 從 `512MiB` 降到 `256MiB`。
|
||||
- `SYSTEM RELOAD CONFIG` 不會熱套用這些 ClickHouse 25.3 設定,因此只重啟 `sentry-self-hosted-clickhouse-1`;重啟前 active foreground processes `1`(查詢本身)、pending mutations `0`。
|
||||
|
||||
**驗證**:
|
||||
- `/opt/sentry/docker-compose.yml` `docker compose config` passed(僅 upstream `version` obsolete warning)。
|
||||
- `docker inspect` 顯示 ClickHouse/Kafka/taskworker/taskbroker/taskscheduler/redis/uptime-checker live limit 全部與 compose baseline 一致。
|
||||
- 110 load 從約 `12.50 / 13.10 / 13.35` 降到 `7.41 / 10.60 / 12.35`;`HostLoadAverageSustainedHigh` 未 firing,`DockerContainerCpuSustainedHigh` 僅 pending 於 Sentry ClickHouse。
|
||||
- ClickHouse 重啟後 16 秒 healthy;runtime setting 已確認 `background_pool_size=4`、三門檻 `3/2/3`、merge 上限 `268435456` bytes;active merges `0`、pending mutations `0`、ClickHouse CPU 約從 `2.1-2.7 cores` 降到 `0.67 core`。
|
||||
- 因 4 條 merge thread 仍可讓 ClickHouse 短暫回到 2.7 cores,將 live + compose CPU quota 從 `4` 收到 `2`,記憶體維持 `8 GiB`;後續 topk 顯示 ClickHouse 約 `2.0 cores`,由 CPU quota 保護 host。
|
||||
- 後續 host `ps` 顯示剩餘 `HostHighCpuLoad` 主因之一是 CD Web image build:`node /app/.../next build` 約 `1.4 cores`,疊加 Gitea/ClickHouse/Kafka;已在 `apps/web/Dockerfile` 加 `NEXT_PRIVATE_BUILD_WORKER_COUNT=1`,並將 `pnpm turbo build --filter=@awoooi/web` 改為 `--concurrency=1`,避免 Web build 再把 110 推到長時間高 CPU。
|
||||
- 舊 `HostHighCpuLoad` 從 `CPU >80% for 5m` 調成 `CPU >90% for 10m` 的早期 warning;真正長時間過載/自動診斷交給 `HostLoadAverageSustainedHigh` 的 `load5/core >1.5 for 15m`。
|
||||
- Prometheus firing alerts 只剩 `FlywheelExecutionRateMissing` 與 188 `HostBackupFailed`;Docker/runner guardrail alerts clean。
|
||||
|
||||
**下一步**:
|
||||
- 110 若 ClickHouse sustained CPU 仍 pending 超過 drain window,下一步查 EAP/profiling/replay/uptime 是否需要保留;不要先降 ClickHouse memory 或重啟。
|
||||
- 將其他 unlimited 低流量容器分批納入 baseline,不一次全量加,避免把 Sentry/Harbor/monitoring 次要服務壓出新事故。
|
||||
- 188 優先修 `HostBackupFailed` 與 momo scheduler Google Drive/白頁檢查雜訊,CPU/load 不是當前阻塞。
|
||||
|
||||
## 2026-05-05 | 110/188 CPU/Mem 配額全景盤點 + Docker baseline 監控落地
|
||||
|
||||
**背景**:統帥擔心 Claude Code 對 110/188 服務 CPU/memory limit 亂配置,造成服務卡死或慢性過載;本輪接續盤點 live Docker inspect / docker stats / compose 宣告。
|
||||
|
||||
@@ -9,11 +9,13 @@
|
||||
|
||||
| Service | Live Limit | Live Usage Snapshot | Verdict |
|
||||
|---|---:|---:|---|
|
||||
| Sentry ClickHouse | 4 CPU / 8 GiB | ~235-291% CPU / 3.3-3.4 GiB | CPU capped but still hottest. Do not lower memory; keep merge settings explicit. |
|
||||
| Sentry ClickHouse | 2 CPU / 8 GiB, merge pool 4 | capped near 2 cores after pool 8 -> 4 restart | Do not lower memory. CPU quota intentionally slows background merge so Sentry cannot dominate 110. If backlog grows, inspect `MergeMutate` and Sentry high-volume features before raising it. |
|
||||
| Sentry Kafka | 2 CPU / 3 GiB | ~40-55% CPU / 2.5 GiB (84%) | Memory is close to pressure. Do not reduce memory. |
|
||||
| Sentry taskworker | 2 CPU / 2 GiB, concurrency 2 | ~120-181% CPU after restart | Concurrency reduced from 4 to 2 after Kafka lag cleared. Watch Sentry task latency before further changes. |
|
||||
| Sentry taskbroker | 1 CPU / 512 MiB | ~70-98% CPU / 160 MiB | CPU is tight; increasing may improve backlog but can raise host load. |
|
||||
| Sentry taskscheduler | 0.5 CPU / 512 MiB | ~13% CPU / 387 MiB (76%) | Memory is tight; alert at 85% before it stalls. |
|
||||
| Sentry redis | 0.5 CPU / 512 MiB | ~15-30% CPU / 19 MiB | Live and compose cap are aligned. |
|
||||
| Sentry uptime-checker | 0.5 CPU / 512 MiB | ~26-30% CPU / 43-187 MiB | Capped after it showed sustained background CPU. |
|
||||
| Gitea | 3 CPU / 3 GiB | ~4% CPU / 2.18 GiB (73%) | Good cap; memory headroom is not huge. |
|
||||
| GitHub/Gitea runners | unlimited systemd services | one runner had WatchdogSec=5min and 8,490 restarts; `act` CI containers caused load spikes | Must be monitored outside Docker. Remove bad watchdog drop-in and apply per-runner CPU/Memory quotas. |
|
||||
| node-exporter | 1 CPU / 256 MiB | ~0-5% CPU / 8 MiB | Good after disabling expensive `arp`, `netclass`, and `netdev` collectors. |
|
||||
@@ -28,8 +30,11 @@
|
||||
| SignOz ClickHouse | 4 CPU / 24 GiB | ~93-133% CPU / 1.1 GiB | Healthy enough; keep current cap. |
|
||||
| SignOz Zookeeper | 1 CPU / 2 GiB | ~8-18% CPU / 1.09 GiB | OK. |
|
||||
| cadvisor | 1.5 CPU / 1 GiB | ~0% CPU / 28 MiB | Good. |
|
||||
| litellm | unlimited | ~0.6-0.9% CPU / 780 MiB | Add modest cap after observing traffic; do not re-add DATABASE_URL. |
|
||||
| momo-pro-system / momo-db | unlimited | DB had short CPU bursts, then ~0.6% with no active long query | Needs service-specific limits after scheduler/schema pressure is controlled. |
|
||||
| litellm | 1 CPU / 1 GiB | ~0.5-0.9% CPU / 780 MiB | Good cap; keep stateless mode and do not re-add `DATABASE_URL`. |
|
||||
| momo-pro-system | 2 CPU / 2 GiB | ~1-2% CPU / 740 MiB | Good cap; startup cache prewarm must stay single-flight. |
|
||||
| momo-scheduler | 2 CPU / 2 GiB | ~0.3% CPU / 105-163 MiB after crawler burst | CPU cap is working. Next fix is crawler concurrency and failed background jobs, not lower CPU. |
|
||||
| momo-telegram-bot | 0.5 CPU / 512 MiB | ~0.7% CPU / 66 MiB | Good cap. |
|
||||
| momo-db | 2 CPU / 4 GiB | DB had short CPU bursts, then ~0.6-29% with no active long query | Good cap; current bursts are query/workload, not limit pressure. |
|
||||
| Monitoring tools / websites / exporters | mostly unlimited | low | Add caps gradually with textfile alerts watching pressure. |
|
||||
|
||||
## Baseline Policy
|
||||
@@ -69,12 +74,13 @@ Use these thresholds for alerting and AI triage:
|
||||
|
||||
1. Deploy `scripts/ops/docker-stats-textfile-exporter.py` to 110 and 188 textfile collector cron.
|
||||
2. Reload Prometheus rules with the new Docker CPU/memory/restart baseline alerts.
|
||||
3. Observe 110 for one drain window after node-exporter collector trim and taskworker concurrency 2. Kafka lag is now near zero; if ClickHouse remains high, tune merge/query behavior, not Kafka consumers.
|
||||
4. Tune `momo-scheduler` crawler concurrency on 188; keep 2 CPU / 2 GiB until success rate and latency prove it is too low.
|
||||
5. Fix 188 Elephant Alpha/OpenClaw allowed-action drift before enabling resource auto-repair beyond diagnosis.
|
||||
6. Add modest caps to currently unlimited low-risk services in small batches.
|
||||
7. Deploy `scripts/ops/stop-stale-gitea-actions-jobs.sh` to 110 as `/home/wooo/scripts/stop-stale-gitea-actions-jobs.sh`; keep Prometheus auto action in dry-run mode.
|
||||
8. Fix 110 runner services with sudo-capable host maintenance:
|
||||
3. Persist live limits in the owning compose files before considering the host repaired; live `docker update` alone is not durable.
|
||||
4. Observe 110 for one drain window after node-exporter collector trim and taskworker concurrency 2. Kafka lag is now near zero; if ClickHouse remains high, tune merge/query behavior or reduce high-volume Sentry features, not Kafka memory.
|
||||
5. Tune `momo-scheduler` crawler concurrency on 188; keep 2 CPU / 2 GiB until success rate and latency prove it is too low.
|
||||
6. Fix 188 Elephant Alpha/OpenClaw allowed-action drift before enabling resource auto-repair beyond diagnosis.
|
||||
7. Add modest caps to currently unlimited low-risk services in small batches. Do not alert every unlimited auxiliary container at once; promote candidates only after 24h usage data.
|
||||
8. Deploy `scripts/ops/stop-stale-gitea-actions-jobs.sh` to 110 as `/home/wooo/scripts/stop-stale-gitea-actions-jobs.sh`; keep Prometheus auto action in dry-run mode.
|
||||
9. Fix 110 runner services with sudo-capable host maintenance:
|
||||
|
||||
```bash
|
||||
sudo /home/wooo/scripts/apply-runner-systemd-guardrails.sh --apply
|
||||
@@ -88,3 +94,4 @@ sudo /home/wooo/scripts/apply-runner-systemd-guardrails.sh --apply
|
||||
- Treating "no alert" as healthy when cAdvisor or textfile exporters are missing.
|
||||
- Letting monitoring collectors spend seconds per scrape; this turns observability into load.
|
||||
- Leaving self-hosted runners unlimited on the same host as Sentry/ClickHouse/Gitea.
|
||||
- Applying live `docker update` without persisting the same guardrail in compose/systemd/IaC.
|
||||
|
||||
@@ -33,8 +33,10 @@ groups:
|
||||
description: "Node Exporter 無回應超過 1 分鐘"
|
||||
|
||||
- alert: HostHighCpuLoad
|
||||
expr: 100 - (avg by(host) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
|
||||
for: 5m
|
||||
# 2026-05-05 ogt + Codex: keep this as early warning only.
|
||||
# Sustained overload/root-cause automation is handled by HostLoadAverageSustainedHigh.
|
||||
expr: 100 - (avg by(host) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
layer: systemd-188
|
||||
@@ -46,7 +48,7 @@ groups:
|
||||
alert_category: "host_resource"
|
||||
annotations:
|
||||
summary: "主機 {{ $labels.host }} CPU 高負載"
|
||||
description: "CPU 使用率超過 80%"
|
||||
description: "CPU 使用率超過 90% 持續 10 分鐘;若 load5/core 未超過 1.5,先視為容量觀察與診斷,不直接修復。"
|
||||
|
||||
- alert: HostLoadAverageSustainedHigh
|
||||
# 2026-05-05 ogt + Codex: 110/188 長時間過載基線。
|
||||
@@ -665,7 +667,7 @@ groups:
|
||||
|
||||
- alert: DockerContainerMissingResourceLimit
|
||||
# 2026-05-05 ogt + Codex: catch Compose services that silently run with unlimited CPU/memory.
|
||||
expr: (docker_container_cpu_limit_cores{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|process-spans).*"} == 0) or (docker_container_memory_limit_bytes{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|process-spans).*"} == 0)
|
||||
expr: (docker_container_cpu_limit_cores{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|taskbroker|taskscheduler|redis|uptime-checker|process-spans).*"} == 0) or (docker_container_memory_limit_bytes{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|taskbroker|taskscheduler|redis|uptime-checker|process-spans).*"} == 0)
|
||||
for: 30m
|
||||
labels:
|
||||
severity: warning
|
||||
|
||||
@@ -33,8 +33,10 @@ groups:
|
||||
description: "Node Exporter 無回應超過 1 分鐘"
|
||||
|
||||
- alert: HostHighCpuLoad
|
||||
expr: 100 - (avg by(host) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
|
||||
for: 5m
|
||||
# 2026-05-05 ogt + Codex: keep this as early warning only.
|
||||
# Sustained overload/root-cause automation is handled by HostLoadAverageSustainedHigh.
|
||||
expr: 100 - (avg by(host) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
layer: systemd-188
|
||||
@@ -46,7 +48,7 @@ groups:
|
||||
alert_category: "host_resource"
|
||||
annotations:
|
||||
summary: "主機 {{ $labels.host }} CPU 高負載"
|
||||
description: "CPU 使用率超過 80%"
|
||||
description: "CPU 使用率超過 90% 持續 10 分鐘;若 load5/core 未超過 1.5,先視為容量觀察與診斷,不直接修復。"
|
||||
# 2026-05-02 ogt + Claude Sonnet 4.6: 引導 LLM 走 SSH 診斷而非 kubectl
|
||||
auto_repair_action: "ssh {{ $labels.instance }} 'ps aux --sort=-%cpu | head -20' (host CPU 診斷;禁 kubectl restart awoooi-* — 主因常為第三方服務 Sentry/ClickHouse/Snuba)"
|
||||
runbook: "host CPU 高負載排查:先 SSH ps aux 看 top 進程;若為第三方服務(Sentry/ClickHouse 等)寫 ADR 升級資源或調 limit,禁止 kubectl restart 跨 domain"
|
||||
@@ -671,7 +673,7 @@ groups:
|
||||
|
||||
- alert: DockerContainerMissingResourceLimit
|
||||
# 2026-05-05 ogt + Codex: catch Compose services that silently run with unlimited CPU/memory.
|
||||
expr: (docker_container_cpu_limit_cores{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|process-spans).*"} == 0) or (docker_container_memory_limit_bytes{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|process-spans).*"} == 0)
|
||||
expr: (docker_container_cpu_limit_cores{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|taskbroker|taskscheduler|redis|uptime-checker|process-spans).*"} == 0) or (docker_container_memory_limit_bytes{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|taskbroker|taskscheduler|redis|uptime-checker|process-spans).*"} == 0)
|
||||
for: 30m
|
||||
labels:
|
||||
severity: warning
|
||||
|
||||
Reference in New Issue
Block a user