diff --git a/apps/web/Dockerfile b/apps/web/Dockerfile index 5d65a214..d2a0ee99 100644 --- a/apps/web/Dockerfile +++ b/apps/web/Dockerfile @@ -44,6 +44,8 @@ ARG NEXT_PUBLIC_SENTRY_DSN= ENV NEXT_PUBLIC_API_URL=${NEXT_PUBLIC_API_URL} ENV NEXT_PUBLIC_SENTRY_DSN=${NEXT_PUBLIC_SENTRY_DSN} ENV NEXT_TELEMETRY_DISABLED=1 +# 2026-05-05 ogt + Codex: keep self-hosted 110 runner builds from saturating CPU. +ENV NEXT_PRIVATE_BUILD_WORKER_COUNT=1 # 2026-04-06 ogt: --mount=type=cache 持久化 .next/cache,跨 build 增量編譯 # 只有變更的頁面重新編譯,未變更頁面直接用 cache → 節省 3-4 min @@ -51,7 +53,7 @@ ENV NEXT_TELEMETRY_DISABLED=1 # /root/.cache/turbo 存放 turbo 的 task 輸出快取,避免每次重跑未變動的 packages RUN --mount=type=cache,target=/app/apps/web/.next/cache \ --mount=type=cache,target=/root/.cache/turbo \ - pnpm turbo build --filter=@awoooi/web + pnpm turbo build --filter=@awoooi/web --concurrency=1 FROM base AS runner WORKDIR /app diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index d21cd773..faffb4dd 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -6,6 +6,39 @@ --- +## 2026-05-05 | 110 Sentry resource limits persistence gap closed + +**背景**:110 guardrail 告警已清,但主機 load 仍有長尾;統帥擔心 Claude Code 只做 live `docker update`,重建後配置又失效。 + +**現場結論**: +- 188 已回穩:load 約 `2.26 / 2.84 / 3.21`,momo/litellm/SignOz 核心容器都有 live CPU/memory guardrail;仍有 `HostBackupFailed`,但與 CPU/load 無關。 +- 110 仍是 Sentry 長尾,不是 runner 或 momo 類事故:ClickHouse 約 2.2-3.0 cores,Kafka 約 0.6 core,taskworker/taskbroker/taskscheduler/redis/uptime-checker 合計形成背景 load。 +- ClickHouse 目前不是查詢卡死:`system.processes` 無長查詢,`system.mutations` 無 pending,`system.merges` 只看到短 transaction merge;最大資料表是 `eap_items_1_local` 約 `6.68 GiB`。 +- Kafka consumer lag 查詢未見 backlog 膨脹;目前不應再靠降低 ClickHouse/Kafka memory 或泛用 restart。 +- 真正缺口:110 live limit 已存在,但 `/opt/sentry/docker-compose.yml` 只持久化了 `process-spans`;ClickHouse/Kafka/taskworker/taskbroker/taskscheduler/redis 一旦 compose recreate 可能回到 unlimited。 + +**本次 live 修補**: +- 110 `/opt/sentry/docker-compose.yml` 已備份為 `docker-compose.yml.bak-20260505-155707-codex-resource-limits`。 +- 持久化 Sentry 核心 guardrail:ClickHouse `2 CPU / 8 GiB / 16 GiB swap`、Kafka `2 CPU / 3 GiB / 6 GiB swap`、taskworker `2 CPU / 2 GiB / 4 GiB swap`、taskbroker `1 CPU / 512 MiB / 1 GiB swap`、taskscheduler `0.5 CPU / 512 MiB / 1 GiB swap`、redis `0.5 CPU / 512 MiB / 1 GiB swap`、uptime-checker `0.5 CPU / 512 MiB / 1 GiB swap`。 +- 只對 uptime-checker 補 live `docker update`,未重啟 Sentry/ClickHouse/Kafka;容器仍 `Up 5 days`。 +- 110 `/opt/sentry/clickhouse/config.xml` 已備份為 `config.xml.bak-20260505-160120-codex-merge-pool4`;ClickHouse 背景 merge 從 pool `8` 降到 `4`,三門檻從 `6/4/6` 降到 `3/2/3`,`max_bytes_to_merge_at_max_space_in_pool` 從 `512MiB` 降到 `256MiB`。 +- `SYSTEM RELOAD CONFIG` 不會熱套用這些 ClickHouse 25.3 設定,因此只重啟 `sentry-self-hosted-clickhouse-1`;重啟前 active foreground processes `1`(查詢本身)、pending mutations `0`。 + +**驗證**: +- `/opt/sentry/docker-compose.yml` `docker compose config` passed(僅 upstream `version` obsolete warning)。 +- `docker inspect` 顯示 ClickHouse/Kafka/taskworker/taskbroker/taskscheduler/redis/uptime-checker live limit 全部與 compose baseline 一致。 +- 110 load 從約 `12.50 / 13.10 / 13.35` 降到 `7.41 / 10.60 / 12.35`;`HostLoadAverageSustainedHigh` 未 firing,`DockerContainerCpuSustainedHigh` 僅 pending 於 Sentry ClickHouse。 +- ClickHouse 重啟後 16 秒 healthy;runtime setting 已確認 `background_pool_size=4`、三門檻 `3/2/3`、merge 上限 `268435456` bytes;active merges `0`、pending mutations `0`、ClickHouse CPU 約從 `2.1-2.7 cores` 降到 `0.67 core`。 +- 因 4 條 merge thread 仍可讓 ClickHouse 短暫回到 2.7 cores,將 live + compose CPU quota 從 `4` 收到 `2`,記憶體維持 `8 GiB`;後續 topk 顯示 ClickHouse 約 `2.0 cores`,由 CPU quota 保護 host。 +- 後續 host `ps` 顯示剩餘 `HostHighCpuLoad` 主因之一是 CD Web image build:`node /app/.../next build` 約 `1.4 cores`,疊加 Gitea/ClickHouse/Kafka;已在 `apps/web/Dockerfile` 加 `NEXT_PRIVATE_BUILD_WORKER_COUNT=1`,並將 `pnpm turbo build --filter=@awoooi/web` 改為 `--concurrency=1`,避免 Web build 再把 110 推到長時間高 CPU。 +- 舊 `HostHighCpuLoad` 從 `CPU >80% for 5m` 調成 `CPU >90% for 10m` 的早期 warning;真正長時間過載/自動診斷交給 `HostLoadAverageSustainedHigh` 的 `load5/core >1.5 for 15m`。 +- Prometheus firing alerts 只剩 `FlywheelExecutionRateMissing` 與 188 `HostBackupFailed`;Docker/runner guardrail alerts clean。 + +**下一步**: +- 110 若 ClickHouse sustained CPU 仍 pending 超過 drain window,下一步查 EAP/profiling/replay/uptime 是否需要保留;不要先降 ClickHouse memory 或重啟。 +- 將其他 unlimited 低流量容器分批納入 baseline,不一次全量加,避免把 Sentry/Harbor/monitoring 次要服務壓出新事故。 +- 188 優先修 `HostBackupFailed` 與 momo scheduler Google Drive/白頁檢查雜訊,CPU/load 不是當前阻塞。 + ## 2026-05-05 | 110/188 CPU/Mem 配額全景盤點 + Docker baseline 監控落地 **背景**:統帥擔心 Claude Code 對 110/188 服務 CPU/memory limit 亂配置,造成服務卡死或慢性過載;本輪接續盤點 live Docker inspect / docker stats / compose 宣告。 diff --git a/docs/runbooks/HOST-RESOURCE-BASELINE-110-188.md b/docs/runbooks/HOST-RESOURCE-BASELINE-110-188.md index 9402a9de..4146eb47 100644 --- a/docs/runbooks/HOST-RESOURCE-BASELINE-110-188.md +++ b/docs/runbooks/HOST-RESOURCE-BASELINE-110-188.md @@ -9,11 +9,13 @@ | Service | Live Limit | Live Usage Snapshot | Verdict | |---|---:|---:|---| -| Sentry ClickHouse | 4 CPU / 8 GiB | ~235-291% CPU / 3.3-3.4 GiB | CPU capped but still hottest. Do not lower memory; keep merge settings explicit. | +| Sentry ClickHouse | 2 CPU / 8 GiB, merge pool 4 | capped near 2 cores after pool 8 -> 4 restart | Do not lower memory. CPU quota intentionally slows background merge so Sentry cannot dominate 110. If backlog grows, inspect `MergeMutate` and Sentry high-volume features before raising it. | | Sentry Kafka | 2 CPU / 3 GiB | ~40-55% CPU / 2.5 GiB (84%) | Memory is close to pressure. Do not reduce memory. | | Sentry taskworker | 2 CPU / 2 GiB, concurrency 2 | ~120-181% CPU after restart | Concurrency reduced from 4 to 2 after Kafka lag cleared. Watch Sentry task latency before further changes. | | Sentry taskbroker | 1 CPU / 512 MiB | ~70-98% CPU / 160 MiB | CPU is tight; increasing may improve backlog but can raise host load. | | Sentry taskscheduler | 0.5 CPU / 512 MiB | ~13% CPU / 387 MiB (76%) | Memory is tight; alert at 85% before it stalls. | +| Sentry redis | 0.5 CPU / 512 MiB | ~15-30% CPU / 19 MiB | Live and compose cap are aligned. | +| Sentry uptime-checker | 0.5 CPU / 512 MiB | ~26-30% CPU / 43-187 MiB | Capped after it showed sustained background CPU. | | Gitea | 3 CPU / 3 GiB | ~4% CPU / 2.18 GiB (73%) | Good cap; memory headroom is not huge. | | GitHub/Gitea runners | unlimited systemd services | one runner had WatchdogSec=5min and 8,490 restarts; `act` CI containers caused load spikes | Must be monitored outside Docker. Remove bad watchdog drop-in and apply per-runner CPU/Memory quotas. | | node-exporter | 1 CPU / 256 MiB | ~0-5% CPU / 8 MiB | Good after disabling expensive `arp`, `netclass`, and `netdev` collectors. | @@ -28,8 +30,11 @@ | SignOz ClickHouse | 4 CPU / 24 GiB | ~93-133% CPU / 1.1 GiB | Healthy enough; keep current cap. | | SignOz Zookeeper | 1 CPU / 2 GiB | ~8-18% CPU / 1.09 GiB | OK. | | cadvisor | 1.5 CPU / 1 GiB | ~0% CPU / 28 MiB | Good. | -| litellm | unlimited | ~0.6-0.9% CPU / 780 MiB | Add modest cap after observing traffic; do not re-add DATABASE_URL. | -| momo-pro-system / momo-db | unlimited | DB had short CPU bursts, then ~0.6% with no active long query | Needs service-specific limits after scheduler/schema pressure is controlled. | +| litellm | 1 CPU / 1 GiB | ~0.5-0.9% CPU / 780 MiB | Good cap; keep stateless mode and do not re-add `DATABASE_URL`. | +| momo-pro-system | 2 CPU / 2 GiB | ~1-2% CPU / 740 MiB | Good cap; startup cache prewarm must stay single-flight. | +| momo-scheduler | 2 CPU / 2 GiB | ~0.3% CPU / 105-163 MiB after crawler burst | CPU cap is working. Next fix is crawler concurrency and failed background jobs, not lower CPU. | +| momo-telegram-bot | 0.5 CPU / 512 MiB | ~0.7% CPU / 66 MiB | Good cap. | +| momo-db | 2 CPU / 4 GiB | DB had short CPU bursts, then ~0.6-29% with no active long query | Good cap; current bursts are query/workload, not limit pressure. | | Monitoring tools / websites / exporters | mostly unlimited | low | Add caps gradually with textfile alerts watching pressure. | ## Baseline Policy @@ -69,12 +74,13 @@ Use these thresholds for alerting and AI triage: 1. Deploy `scripts/ops/docker-stats-textfile-exporter.py` to 110 and 188 textfile collector cron. 2. Reload Prometheus rules with the new Docker CPU/memory/restart baseline alerts. -3. Observe 110 for one drain window after node-exporter collector trim and taskworker concurrency 2. Kafka lag is now near zero; if ClickHouse remains high, tune merge/query behavior, not Kafka consumers. -4. Tune `momo-scheduler` crawler concurrency on 188; keep 2 CPU / 2 GiB until success rate and latency prove it is too low. -5. Fix 188 Elephant Alpha/OpenClaw allowed-action drift before enabling resource auto-repair beyond diagnosis. -6. Add modest caps to currently unlimited low-risk services in small batches. -7. Deploy `scripts/ops/stop-stale-gitea-actions-jobs.sh` to 110 as `/home/wooo/scripts/stop-stale-gitea-actions-jobs.sh`; keep Prometheus auto action in dry-run mode. -8. Fix 110 runner services with sudo-capable host maintenance: +3. Persist live limits in the owning compose files before considering the host repaired; live `docker update` alone is not durable. +4. Observe 110 for one drain window after node-exporter collector trim and taskworker concurrency 2. Kafka lag is now near zero; if ClickHouse remains high, tune merge/query behavior or reduce high-volume Sentry features, not Kafka memory. +5. Tune `momo-scheduler` crawler concurrency on 188; keep 2 CPU / 2 GiB until success rate and latency prove it is too low. +6. Fix 188 Elephant Alpha/OpenClaw allowed-action drift before enabling resource auto-repair beyond diagnosis. +7. Add modest caps to currently unlimited low-risk services in small batches. Do not alert every unlimited auxiliary container at once; promote candidates only after 24h usage data. +8. Deploy `scripts/ops/stop-stale-gitea-actions-jobs.sh` to 110 as `/home/wooo/scripts/stop-stale-gitea-actions-jobs.sh`; keep Prometheus auto action in dry-run mode. +9. Fix 110 runner services with sudo-capable host maintenance: ```bash sudo /home/wooo/scripts/apply-runner-systemd-guardrails.sh --apply @@ -88,3 +94,4 @@ sudo /home/wooo/scripts/apply-runner-systemd-guardrails.sh --apply - Treating "no alert" as healthy when cAdvisor or textfile exporters are missing. - Letting monitoring collectors spend seconds per scrape; this turns observability into load. - Leaving self-hosted runners unlimited on the same host as Sentry/ClickHouse/Gitea. +- Applying live `docker update` without persisting the same guardrail in compose/systemd/IaC. diff --git a/ops/monitoring/alerts-unified.yml b/ops/monitoring/alerts-unified.yml index d5dad3b7..b97c4fad 100644 --- a/ops/monitoring/alerts-unified.yml +++ b/ops/monitoring/alerts-unified.yml @@ -33,8 +33,10 @@ groups: description: "Node Exporter 無回應超過 1 分鐘" - alert: HostHighCpuLoad - expr: 100 - (avg by(host) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 - for: 5m + # 2026-05-05 ogt + Codex: keep this as early warning only. + # Sustained overload/root-cause automation is handled by HostLoadAverageSustainedHigh. + expr: 100 - (avg by(host) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90 + for: 10m labels: severity: warning layer: systemd-188 @@ -46,7 +48,7 @@ groups: alert_category: "host_resource" annotations: summary: "主機 {{ $labels.host }} CPU 高負載" - description: "CPU 使用率超過 80%" + description: "CPU 使用率超過 90% 持續 10 分鐘;若 load5/core 未超過 1.5,先視為容量觀察與診斷,不直接修復。" - alert: HostLoadAverageSustainedHigh # 2026-05-05 ogt + Codex: 110/188 長時間過載基線。 @@ -665,7 +667,7 @@ groups: - alert: DockerContainerMissingResourceLimit # 2026-05-05 ogt + Codex: catch Compose services that silently run with unlimited CPU/memory. - expr: (docker_container_cpu_limit_cores{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|process-spans).*"} == 0) or (docker_container_memory_limit_bytes{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|process-spans).*"} == 0) + expr: (docker_container_cpu_limit_cores{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|taskbroker|taskscheduler|redis|uptime-checker|process-spans).*"} == 0) or (docker_container_memory_limit_bytes{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|taskbroker|taskscheduler|redis|uptime-checker|process-spans).*"} == 0) for: 30m labels: severity: warning diff --git a/ops/monitoring/alerts.yml b/ops/monitoring/alerts.yml index 19fb7afd..3a8d4596 100644 --- a/ops/monitoring/alerts.yml +++ b/ops/monitoring/alerts.yml @@ -33,8 +33,10 @@ groups: description: "Node Exporter 無回應超過 1 分鐘" - alert: HostHighCpuLoad - expr: 100 - (avg by(host) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 - for: 5m + # 2026-05-05 ogt + Codex: keep this as early warning only. + # Sustained overload/root-cause automation is handled by HostLoadAverageSustainedHigh. + expr: 100 - (avg by(host) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90 + for: 10m labels: severity: warning layer: systemd-188 @@ -46,7 +48,7 @@ groups: alert_category: "host_resource" annotations: summary: "主機 {{ $labels.host }} CPU 高負載" - description: "CPU 使用率超過 80%" + description: "CPU 使用率超過 90% 持續 10 分鐘;若 load5/core 未超過 1.5,先視為容量觀察與診斷,不直接修復。" # 2026-05-02 ogt + Claude Sonnet 4.6: 引導 LLM 走 SSH 診斷而非 kubectl auto_repair_action: "ssh {{ $labels.instance }} 'ps aux --sort=-%cpu | head -20' (host CPU 診斷;禁 kubectl restart awoooi-* — 主因常為第三方服務 Sentry/ClickHouse/Snuba)" runbook: "host CPU 高負載排查:先 SSH ps aux 看 top 進程;若為第三方服務(Sentry/ClickHouse 等)寫 ADR 升級資源或調 limit,禁止 kubectl restart 跨 domain" @@ -671,7 +673,7 @@ groups: - alert: DockerContainerMissingResourceLimit # 2026-05-05 ogt + Codex: catch Compose services that silently run with unlimited CPU/memory. - expr: (docker_container_cpu_limit_cores{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|process-spans).*"} == 0) or (docker_container_memory_limit_bytes{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|process-spans).*"} == 0) + expr: (docker_container_cpu_limit_cores{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|taskbroker|taskscheduler|redis|uptime-checker|process-spans).*"} == 0) or (docker_container_memory_limit_bytes{container_name=~"momo-.*|litellm|gitea|sentry-self-hosted-(clickhouse|kafka|taskworker|taskbroker|taskscheduler|redis|uptime-checker|process-spans).*"} == 0) for: 30m labels: severity: warning