fix(ops): persist host resource guardrails
All checks were successful
CD Pipeline / tests (push) Successful in 5m25s
Code Review / ai-code-review (push) Successful in 25s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 37s
CD Pipeline / build-and-deploy (push) Successful in 7m31s
CD Pipeline / post-deploy-checks (push) Successful in 5m10s

This commit is contained in:
Your Name
2026-05-05 16:13:02 +08:00
parent 84a661beaf
commit 2221fd3256
5 changed files with 64 additions and 18 deletions

View File

@@ -44,6 +44,8 @@ ARG NEXT_PUBLIC_SENTRY_DSN=
ENV NEXT_PUBLIC_API_URL=${NEXT_PUBLIC_API_URL}
ENV NEXT_PUBLIC_SENTRY_DSN=${NEXT_PUBLIC_SENTRY_DSN}
ENV NEXT_TELEMETRY_DISABLED=1
# 2026-05-05 ogt + Codex: keep self-hosted 110 runner builds from saturating CPU.
ENV NEXT_PRIVATE_BUILD_WORKER_COUNT=1
# 2026-04-06 ogt: --mount=type=cache 持久化 .next/cache跨 build 增量編譯
# 只有變更的頁面重新編譯,未變更頁面直接用 cache → 節省 3-4 min
@@ -51,7 +53,7 @@ ENV NEXT_TELEMETRY_DISABLED=1
# /root/.cache/turbo 存放 turbo 的 task 輸出快取,避免每次重跑未變動的 packages
RUN --mount=type=cache,target=/app/apps/web/.next/cache \
--mount=type=cache,target=/root/.cache/turbo \
pnpm turbo build --filter=@awoooi/web
pnpm turbo build --filter=@awoooi/web --concurrency=1
FROM base AS runner
WORKDIR /app

View File

@@ -6,6 +6,39 @@
---
## 2026-05-05 | 110 Sentry resource limits persistence gap closed
**背景**110 guardrail 告警已清,但主機 load 仍有長尾;統帥擔心 Claude Code 只做 live `docker update`,重建後配置又失效。
**現場結論**
- 188 已回穩load 約 `2.26 / 2.84 / 3.21`momo/litellm/SignOz 核心容器都有 live CPU/memory guardrail仍有 `HostBackupFailed`,但與 CPU/load 無關。
- 110 仍是 Sentry 長尾,不是 runner 或 momo 類事故ClickHouse 約 2.2-3.0 coresKafka 約 0.6 coretaskworker/taskbroker/taskscheduler/redis/uptime-checker 合計形成背景 load。
- ClickHouse 目前不是查詢卡死:`system.processes` 無長查詢,`system.mutations` 無 pending`system.merges` 只看到短 transaction merge最大資料表是 `eap_items_1_local``6.68 GiB`
- Kafka consumer lag 查詢未見 backlog 膨脹;目前不應再靠降低 ClickHouse/Kafka memory 或泛用 restart。
- 真正缺口110 live limit 已存在,但 `/opt/sentry/docker-compose.yml` 只持久化了 `process-spans`ClickHouse/Kafka/taskworker/taskbroker/taskscheduler/redis 一旦 compose recreate 可能回到 unlimited。
**本次 live 修補**
- 110 `/opt/sentry/docker-compose.yml` 已備份為 `docker-compose.yml.bak-20260505-155707-codex-resource-limits`
- 持久化 Sentry 核心 guardrailClickHouse `2 CPU / 8 GiB / 16 GiB swap`、Kafka `2 CPU / 3 GiB / 6 GiB swap`、taskworker `2 CPU / 2 GiB / 4 GiB swap`、taskbroker `1 CPU / 512 MiB / 1 GiB swap`、taskscheduler `0.5 CPU / 512 MiB / 1 GiB swap`、redis `0.5 CPU / 512 MiB / 1 GiB swap`、uptime-checker `0.5 CPU / 512 MiB / 1 GiB swap`
- 只對 uptime-checker 補 live `docker update`,未重啟 Sentry/ClickHouse/Kafka容器仍 `Up 5 days`
- 110 `/opt/sentry/clickhouse/config.xml` 已備份為 `config.xml.bak-20260505-160120-codex-merge-pool4`ClickHouse 背景 merge 從 pool `8` 降到 `4`,三門檻從 `6/4/6` 降到 `3/2/3``max_bytes_to_merge_at_max_space_in_pool``512MiB` 降到 `256MiB`
- `SYSTEM RELOAD CONFIG` 不會熱套用這些 ClickHouse 25.3 設定,因此只重啟 `sentry-self-hosted-clickhouse-1`;重啟前 active foreground processes `1`查詢本身、pending mutations `0`
**驗證**
- `/opt/sentry/docker-compose.yml` `docker compose config` passed僅 upstream `version` obsolete warning
- `docker inspect` 顯示 ClickHouse/Kafka/taskworker/taskbroker/taskscheduler/redis/uptime-checker live limit 全部與 compose baseline 一致。
- 110 load 從約 `12.50 / 13.10 / 13.35` 降到 `7.41 / 10.60 / 12.35``HostLoadAverageSustainedHigh` 未 firing`DockerContainerCpuSustainedHigh` 僅 pending 於 Sentry ClickHouse。
- ClickHouse 重啟後 16 秒 healthyruntime setting 已確認 `background_pool_size=4`、三門檻 `3/2/3`、merge 上限 `268435456` bytesactive merges `0`、pending mutations `0`、ClickHouse CPU 約從 `2.1-2.7 cores` 降到 `0.67 core`
- 因 4 條 merge thread 仍可讓 ClickHouse 短暫回到 2.7 cores將 live + compose CPU quota 從 `4` 收到 `2`,記憶體維持 `8 GiB`;後續 topk 顯示 ClickHouse 約 `2.0 cores`,由 CPU quota 保護 host。
- 後續 host `ps` 顯示剩餘 `HostHighCpuLoad` 主因之一是 CD Web image build`node /app/.../next build``1.4 cores`,疊加 Gitea/ClickHouse/Kafka已在 `apps/web/Dockerfile``NEXT_PRIVATE_BUILD_WORKER_COUNT=1`,並將 `pnpm turbo build --filter=@awoooi/web` 改為 `--concurrency=1`,避免 Web build 再把 110 推到長時間高 CPU。
-`HostHighCpuLoad``CPU >80% for 5m` 調成 `CPU >90% for 10m` 的早期 warning真正長時間過載/自動診斷交給 `HostLoadAverageSustainedHigh``load5/core >1.5 for 15m`
- Prometheus firing alerts 只剩 `FlywheelExecutionRateMissing` 與 188 `HostBackupFailed`Docker/runner guardrail alerts clean。
**下一步**
- 110 若 ClickHouse sustained CPU 仍 pending 超過 drain window下一步查 EAP/profiling/replay/uptime 是否需要保留;不要先降 ClickHouse memory 或重啟。
- 將其他 unlimited 低流量容器分批納入 baseline不一次全量加避免把 Sentry/Harbor/monitoring 次要服務壓出新事故。
- 188 優先修 `HostBackupFailed` 與 momo scheduler Google Drive/白頁檢查雜訊CPU/load 不是當前阻塞。
## 2026-05-05 | 110/188 CPU/Mem 配額全景盤點 + Docker baseline 監控落地
**背景**:統帥擔心 Claude Code 對 110/188 服務 CPU/memory limit 亂配置,造成服務卡死或慢性過載;本輪接續盤點 live Docker inspect / docker stats / compose 宣告。

View File

@@ -9,11 +9,13 @@
| Service | Live Limit | Live Usage Snapshot | Verdict |
|---|---:|---:|---|
| Sentry ClickHouse | 4 CPU / 8 GiB | ~235-291% CPU / 3.3-3.4 GiB | CPU capped but still hottest. Do not lower memory; keep merge settings explicit. |
| Sentry ClickHouse | 2 CPU / 8 GiB, merge pool 4 | capped near 2 cores after pool 8 -> 4 restart | Do not lower memory. CPU quota intentionally slows background merge so Sentry cannot dominate 110. If backlog grows, inspect `MergeMutate` and Sentry high-volume features before raising it. |
| Sentry Kafka | 2 CPU / 3 GiB | ~40-55% CPU / 2.5 GiB (84%) | Memory is close to pressure. Do not reduce memory. |
| Sentry taskworker | 2 CPU / 2 GiB, concurrency 2 | ~120-181% CPU after restart | Concurrency reduced from 4 to 2 after Kafka lag cleared. Watch Sentry task latency before further changes. |
| Sentry taskbroker | 1 CPU / 512 MiB | ~70-98% CPU / 160 MiB | CPU is tight; increasing may improve backlog but can raise host load. |
| Sentry taskscheduler | 0.5 CPU / 512 MiB | ~13% CPU / 387 MiB (76%) | Memory is tight; alert at 85% before it stalls. |
| Sentry redis | 0.5 CPU / 512 MiB | ~15-30% CPU / 19 MiB | Live and compose cap are aligned. |
| Sentry uptime-checker | 0.5 CPU / 512 MiB | ~26-30% CPU / 43-187 MiB | Capped after it showed sustained background CPU. |
| Gitea | 3 CPU / 3 GiB | ~4% CPU / 2.18 GiB (73%) | Good cap; memory headroom is not huge. |
| GitHub/Gitea runners | unlimited systemd services | one runner had WatchdogSec=5min and 8,490 restarts; `act` CI containers caused load spikes | Must be monitored outside Docker. Remove bad watchdog drop-in and apply per-runner CPU/Memory quotas. |
| node-exporter | 1 CPU / 256 MiB | ~0-5% CPU / 8 MiB | Good after disabling expensive `arp`, `netclass`, and `netdev` collectors. |
@@ -28,8 +30,11 @@
| SignOz ClickHouse | 4 CPU / 24 GiB | ~93-133% CPU / 1.1 GiB | Healthy enough; keep current cap. |
| SignOz Zookeeper | 1 CPU / 2 GiB | ~8-18% CPU / 1.09 GiB | OK. |
| cadvisor | 1.5 CPU / 1 GiB | ~0% CPU / 28 MiB | Good. |
| litellm | unlimited | ~0.6-0.9% CPU / 780 MiB | Add modest cap after observing traffic; do not re-add DATABASE_URL. |
| momo-pro-system / momo-db | unlimited | DB had short CPU bursts, then ~0.6% with no active long query | Needs service-specific limits after scheduler/schema pressure is controlled. |
| litellm | 1 CPU / 1 GiB | ~0.5-0.9% CPU / 780 MiB | Good cap; keep stateless mode and do not re-add `DATABASE_URL`. |
| momo-pro-system | 2 CPU / 2 GiB | ~1-2% CPU / 740 MiB | Good cap; startup cache prewarm must stay single-flight. |
| momo-scheduler | 2 CPU / 2 GiB | ~0.3% CPU / 105-163 MiB after crawler burst | CPU cap is working. Next fix is crawler concurrency and failed background jobs, not lower CPU. |
| momo-telegram-bot | 0.5 CPU / 512 MiB | ~0.7% CPU / 66 MiB | Good cap. |
| momo-db | 2 CPU / 4 GiB | DB had short CPU bursts, then ~0.6-29% with no active long query | Good cap; current bursts are query/workload, not limit pressure. |
| Monitoring tools / websites / exporters | mostly unlimited | low | Add caps gradually with textfile alerts watching pressure. |
## Baseline Policy
@@ -69,12 +74,13 @@ Use these thresholds for alerting and AI triage:
1. Deploy `scripts/ops/docker-stats-textfile-exporter.py` to 110 and 188 textfile collector cron.
2. Reload Prometheus rules with the new Docker CPU/memory/restart baseline alerts.
3. Observe 110 for one drain window after node-exporter collector trim and taskworker concurrency 2. Kafka lag is now near zero; if ClickHouse remains high, tune merge/query behavior, not Kafka consumers.
4. Tune `momo-scheduler` crawler concurrency on 188; keep 2 CPU / 2 GiB until success rate and latency prove it is too low.
5. Fix 188 Elephant Alpha/OpenClaw allowed-action drift before enabling resource auto-repair beyond diagnosis.
6. Add modest caps to currently unlimited low-risk services in small batches.
7. Deploy `scripts/ops/stop-stale-gitea-actions-jobs.sh` to 110 as `/home/wooo/scripts/stop-stale-gitea-actions-jobs.sh`; keep Prometheus auto action in dry-run mode.
8. Fix 110 runner services with sudo-capable host maintenance:
3. Persist live limits in the owning compose files before considering the host repaired; live `docker update` alone is not durable.
4. Observe 110 for one drain window after node-exporter collector trim and taskworker concurrency 2. Kafka lag is now near zero; if ClickHouse remains high, tune merge/query behavior or reduce high-volume Sentry features, not Kafka memory.
5. Tune `momo-scheduler` crawler concurrency on 188; keep 2 CPU / 2 GiB until success rate and latency prove it is too low.
6. Fix 188 Elephant Alpha/OpenClaw allowed-action drift before enabling resource auto-repair beyond diagnosis.
7. Add modest caps to currently unlimited low-risk services in small batches. Do not alert every unlimited auxiliary container at once; promote candidates only after 24h usage data.
8. Deploy `scripts/ops/stop-stale-gitea-actions-jobs.sh` to 110 as `/home/wooo/scripts/stop-stale-gitea-actions-jobs.sh`; keep Prometheus auto action in dry-run mode.
9. Fix 110 runner services with sudo-capable host maintenance:
```bash
sudo /home/wooo/scripts/apply-runner-systemd-guardrails.sh --apply
@@ -88,3 +94,4 @@ sudo /home/wooo/scripts/apply-runner-systemd-guardrails.sh --apply
- Treating "no alert" as healthy when cAdvisor or textfile exporters are missing.
- Letting monitoring collectors spend seconds per scrape; this turns observability into load.
- Leaving self-hosted runners unlimited on the same host as Sentry/ClickHouse/Gitea.
- Applying live `docker update` without persisting the same guardrail in compose/systemd/IaC.

View File

@@ -33,8 +33,10 @@ groups:
description: "Node Exporter 無回應超過 1 分鐘"
- alert: HostHighCpuLoad
expr: 100 - (avg by(host) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
# 2026-05-05 ogt + Codex: keep this as early warning only.
# Sustained overload/root-cause automation is handled by HostLoadAverageSustainedHigh.
expr: 100 - (avg by(host) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 10m
labels:
severity: warning
layer: systemd-188
@@ -46,7 +48,7 @@ groups:
alert_category: "host_resource"
annotations:
summary: "主機 {{ $labels.host }} CPU 高負載"
description: "CPU 使用率超過 80%"
description: "CPU 使用率超過 90% 持續 10 分鐘;若 load5/core 未超過 1.5,先視為容量觀察與診斷,不直接修復。"
- alert: HostLoadAverageSustainedHigh
# 2026-05-05 ogt + Codex: 110/188 長時間過載基線。
@@ -665,7 +667,7 @@ groups:
- alert: DockerContainerMissingResourceLimit
# 2026-05-05 ogt + Codex: catch Compose services that silently run with unlimited CPU/memory.
expr: (docker_container_cpu_limit_cores{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|process-spans).*"} == 0) or (docker_container_memory_limit_bytes{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|process-spans).*"} == 0)
expr: (docker_container_cpu_limit_cores{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|taskbroker|taskscheduler|redis|uptime-checker|process-spans).*"} == 0) or (docker_container_memory_limit_bytes{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|taskbroker|taskscheduler|redis|uptime-checker|process-spans).*"} == 0)
for: 30m
labels:
severity: warning

View File

@@ -33,8 +33,10 @@ groups:
description: "Node Exporter 無回應超過 1 分鐘"
- alert: HostHighCpuLoad
expr: 100 - (avg by(host) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
# 2026-05-05 ogt + Codex: keep this as early warning only.
# Sustained overload/root-cause automation is handled by HostLoadAverageSustainedHigh.
expr: 100 - (avg by(host) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 10m
labels:
severity: warning
layer: systemd-188
@@ -46,7 +48,7 @@ groups:
alert_category: "host_resource"
annotations:
summary: "主機 {{ $labels.host }} CPU 高負載"
description: "CPU 使用率超過 80%"
description: "CPU 使用率超過 90% 持續 10 分鐘;若 load5/core 未超過 1.5,先視為容量觀察與診斷,不直接修復。"
# 2026-05-02 ogt + Claude Sonnet 4.6: 引導 LLM 走 SSH 診斷而非 kubectl
auto_repair_action: "ssh {{ $labels.instance }} 'ps aux --sort=-%cpu | head -20' (host CPU 診斷;禁 kubectl restart awoooi-* — 主因常為第三方服務 Sentry/ClickHouse/Snuba)"
runbook: "host CPU 高負載排查:先 SSH ps aux 看 top 進程若為第三方服務Sentry/ClickHouse 等)寫 ADR 升級資源或調 limit禁止 kubectl restart 跨 domain"
@@ -671,7 +673,7 @@ groups:
- alert: DockerContainerMissingResourceLimit
# 2026-05-05 ogt + Codex: catch Compose services that silently run with unlimited CPU/memory.
expr: (docker_container_cpu_limit_cores{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|process-spans).*"} == 0) or (docker_container_memory_limit_bytes{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|process-spans).*"} == 0)
expr: (docker_container_cpu_limit_cores{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|taskbroker|taskscheduler|redis|uptime-checker|process-spans).*"} == 0) or (docker_container_memory_limit_bytes{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|taskbroker|taskscheduler|redis|uptime-checker|process-spans).*"} == 0)
for: 30m
labels:
severity: warning