fix(ops): prioritize live gitea pressure routing
All checks were successful
CD Pipeline / workflow-shape (push) Successful in 0s
CD Pipeline / cancel-stale-cd (push) Has been skipped
CD Pipeline / tests (push) Successful in 57s
CD Pipeline / build-and-deploy (push) Successful in 4m34s
CD Pipeline / post-deploy-checks (push) Successful in 6m0s
All checks were successful
CD Pipeline / workflow-shape (push) Successful in 0s
CD Pipeline / cancel-stale-cd (push) Has been skipped
CD Pipeline / tests (push) Successful in 57s
CD Pipeline / build-and-deploy (push) Successful in 4m34s
CD Pipeline / post-deploy-checks (push) Successful in 6m0s
This commit is contained in:
@@ -52441,21 +52441,23 @@ production browser smoke:
|
||||
- `scripts/ops/host-sustained-load-controller.py` 新增 live `ps` process-family 分類,並把中度壓力分流到 `control_plane_saturation`、`gitea_queue_or_hook_backlog`、`stockplatform_hot_query_or_api_pressure` check-mode packet;controller 仍只產生 controlled packet,不做 Docker / systemd / DB / Nginx restart。
|
||||
- `ops/monitoring/alerts-unified.yml` 將 `Host110SustainedModeratePressure` 從只看 `load5/core` / container CPU,擴充為同時看 `awoooi_host_process_family_cpu_percent{family=~"systemd_control_plane|ssh_control_plane|gitea_service|postgres"} > 50`;Gitea / StockPlatform container early triage 門檻從 `> 2.0 core` 降到 `> 1.0 core`。
|
||||
- `scripts/ops/host-sustained-load-controller.py` 追加 `--script-dir`,預設 `/home/wooo/scripts`,確保 live alert 產出的 `dry_run_command` / `post_apply_verifier` 指向 110 實際部署 helper,不再輸出會打空的 repo 相對路徑 `scripts/ops/...`。
|
||||
- `scripts/ops/host-sustained-load-controller.py` / `host-sustained-load-evidence.py` 同步修正分流優先序:fresh Docker stats 顯示 `gitea` 或 StockPlatform 關鍵容器超過 `1.0 core` 時,優先路由到服務 playbook,不再被長壽命 `systemd_control_plane` 平均 CPU 搶先導向 control-plane playbook。
|
||||
- 已 live 部署到 110:
|
||||
- `/home/wooo/scripts/host-runaway-process-exporter.py` SHA `d85d27c81ea76a8f2f370ee85c92381bec4440eea4fd37efb2efb9f43dbd1a8a`
|
||||
- `/home/wooo/scripts/host-sustained-load-controller.py` SHA `7a11407c7df05427085982d6f6d11d1756f908591573a28c9b3267de32b94f3e`
|
||||
- `/home/wooo/scripts/host-sustained-load-controller.py` SHA `1bf9c183fe8d89c30008e08db0903c24c609df31eeebe62f9d59b3a26a3bd1c0`
|
||||
- `/home/wooo/scripts/host-sustained-load-evidence.py` SHA `2fd8e7d43a0249f97b35a865cfd4ce2aa45162729faebfc837cde8cf48beec38`
|
||||
- `bash scripts/ops/deploy-alerts.sh` 完成,Prometheus 已載入 `159` 條規則。
|
||||
|
||||
**live readback 證據**:
|
||||
- 110 textfile / Prometheus 已輸出 process-family metric:`systemd_control_plane=72.4`、`gitea_service=53.1`、`postgres=11.1`。
|
||||
- Prometheus rule readback:`Host110SustainedModeratePressure state=firing health=ok`,query 已包含 `docker_container_cpu_cores > 1` 與 `awoooi_host_process_family_cpu_percent > 50`。
|
||||
- Alertmanager `/api/v2/alerts` 已有 active alert:`family=gitea_service`、`family=systemd_control_plane`,`status.state=active`、`auto_repair=true`。
|
||||
- live controller readback:`classification=blocked_control_plane_saturation_requires_playbook`、`next_action=run_control_plane_saturation_playbook_check_mode`、`dry_run_command=/home/wooo/scripts/host-sustained-load-evidence.py ...`、`post_apply_verifier=/home/wooo/scripts/host-sustained-load-controller.py ...`、`controller_exit=75`。
|
||||
- live sanitized evidence readback:`recommendation=control_plane_saturation_playbook`、`controlled_apply_allowed=false`、`top_process_families` 包含 `systemd_control_plane=72.4`、`gitea_service=53.1`、`unknown=54.3`,`top_containers` 以 `gitea=1.5951 core` 為最高;evidence 明確標示不輸出 raw command line / URL / secret。
|
||||
- live controller readback:`classification=blocked_gitea_queue_or_hook_backlog_requires_playbook`、`next_action=run_gitea_queue_or_hook_backlog_playbook_check_mode`、`dry_run_command=/home/wooo/scripts/host-sustained-load-evidence.py ...`、`post_apply_verifier=/home/wooo/scripts/host-sustained-load-controller.py ...`、`controller_exit=75`。
|
||||
- live sanitized evidence readback:`recommendation=gitea_queue_or_hook_backlog_playbook`、`controlled_apply_allowed=false`、`top_process_families` 包含 `systemd_control_plane=72.5`、`gitea_service=53.1`、`unknown=53.7`,`top_containers` 以 `gitea=1.3055 core` 為最高;evidence 明確標示不輸出 raw command line / URL / secret。
|
||||
|
||||
**本地驗證結果**:
|
||||
- `python3.11 -m py_compile scripts/ops/host-runaway-process-exporter.py scripts/ops/host-sustained-load-controller.py scripts/ops/host-sustained-load-evidence.py`:通過。
|
||||
- `python3.11 -m pytest scripts/ops/tests/test_host_runaway_process_exporter.py scripts/ops/tests/test_host_pressure_alert_contract.py -q`:`27 passed`。
|
||||
- `python3.11 -m pytest scripts/ops/tests/test_host_runaway_process_exporter.py scripts/ops/tests/test_host_pressure_alert_contract.py -q`:`29 passed`。
|
||||
- `python3.11 -m pytest ops/runner/test_cd_controlled_runtime_profile.py scripts/ops/tests/test_host_runaway_process_exporter.py scripts/ops/tests/test_host_pressure_alert_contract.py -q`:`70 passed`。
|
||||
- `python3.11 -c "import yaml; yaml.safe_load(open('ops/monitoring/alerts-unified.yml'))"`:通過。
|
||||
- `python3.11 ops/runner/guard-gitea-runner-pressure.py --root .`:通過,`auto_branch_events_on_110=0`、`generic_runner_labels=0`。
|
||||
|
||||
@@ -94,6 +94,8 @@ v1.82 bounded summary rule:`post-start-quick-check.sh` 與 `188-host-hygiene-m
|
||||
|
||||
2026-07-02 12:15 追加 controller command path 合約:`host-sustained-load-controller.py` 產出的 `dry_run_command` / `post_apply_verifier` 必須使用 110 實際部署路徑 `/home/wooo/scripts/host-sustained-load-evidence.py`、`/home/wooo/scripts/host-sustained-load-controller.py`、`/home/wooo/scripts/host-runaway-process-remediation.py`。若 controller 在 live host 上輸出 `scripts/ops/...` 這類 repo 相對路徑,視為告警自動化斷鏈,需先修 controller / `--script-dir`,再進入 playbook check-mode;不得把「helper 找不到」當成 CPU 根因已處理。
|
||||
|
||||
2026-07-02 12:35 追加 110 CPU 分流優先序:若 Docker stats 是 fresh,且 `gitea` 或 StockPlatform 關鍵容器已超過 early triage 門檻 `1.0 core`,controller / evidence 必須先路由到對應服務 playbook,不得被長壽命 `ps %CPU` 的 `systemd_control_plane` 平均值搶先導到 control-plane playbook。control-plane saturation 仍保留為後備路徑,適用於沒有已知 hot container / hot service family 的情境。
|
||||
|
||||
2026-06-25 20:25 orphan Chrome cleanup / scorecard refresh supersedes the 20:11 CPU wording. 110 high CPU was traced to two `stockplatform-review-bulk-ux` Chrome process groups `2756503` and `2829627` with root Chrome process `PPID=1`, elapsed about 5h, no active parent smoke, and sustained GPU/renderer CPU. With user approval, only those two process groups received targeted `SIGTERM` at 20:24. Post-check showed no remaining PGID entries; `vmstat` showed CPU idle around `85-90%`, `si/so=0`, and no immediate swap thrash. No Docker/systemd/Nginx/firewall/K8s action, CI cancellation, manual data ingestion, manual DB write, Wazuh/SOC runtime change, or secret read was performed. The 20:25 full post-start wrapper then returned cold-start `PASS=89 WARN=0 BLOCKED=0`, but overall `POST_START_QUICK_CHECK PASS=37 WARN=2 BLOCKED=1`, `RESULT=BLOCKED`, because StockPlatform data freshness was still blocked at that time and DR remained incomplete.
|
||||
|
||||
2026-06-25 20:11 StockPlatform cron-source recovery supersedes the 19:35 source-version wording. StockPlatform Gitea `main` and live `/home/wooo/stockplatform-v2` are now at `fb91aa4c6272469d1d26e0820169629eac17d28a fix(ops): restore production cron recovery entrypoints`; six missing production cron entrypoint scripts are restored, `run-intelligence-sync.sh` contains the Docker-backed `psql` shim, and live contract check confirms every `scripts/ops/*.sh` referenced by `install-production-cron.sh` exists. The only live write performed for StockPlatform recovery was a fast-forward `git pull --ff-only origin main` on 110; no Docker/systemd/Nginx/firewall/K8s restart, manual ingestion run, manual DB write, or secret read was performed. Natural cron evidence after the pull is now green for the repaired entrypoints: `source-remediation-queue` 19:56 and 20:00 succeeded, `market-index-ingestion` 20:00 succeeded, `price-ingestion` 20:02 succeeded, `margin-short-ingestion` 20:05 succeeded, `chips-ingestion` 20:06 succeeded, and `ai-recommendation-pipeline` 20:10 ran but correctly produced the internal blocker `core_margin_short_daily_incomplete,official_margin_short_daily_official_pending`. StockPlatform `/api/v1/system/freshness` therefore still returns `status=blocked` because the 2026-06-25 official margin-short source is pending and `ai.recommendations` must stay on 2026-06-24 until that gate clears. This is no longer a route, source-version, or missing-cron-script blocker; it is a product-data freshness blocker waiting on official source availability and the next valid AI pipeline run.
|
||||
|
||||
@@ -537,15 +537,6 @@ def build_packet(
|
||||
f"--docker-stats-file {DEFAULT_DOCKER_STATS_FILE} --json"
|
||||
)
|
||||
next_action = "run_gitea_queue_or_hook_backlog_playbook_check_mode"
|
||||
elif control_plane_cpu >= process_family_cpu_threshold:
|
||||
classification = "blocked_control_plane_saturation_requires_playbook"
|
||||
severity = "critical" if load5_per_core > load5_per_core_threshold else "warning"
|
||||
dry_run_command = (
|
||||
f"{evidence_script} "
|
||||
f"--host {host} --metrics-file {DEFAULT_METRICS_FILE} "
|
||||
f"--docker-stats-file {DEFAULT_DOCKER_STATS_FILE} --json"
|
||||
)
|
||||
next_action = "run_control_plane_saturation_playbook_check_mode"
|
||||
elif (
|
||||
"stockplatform-v2-postgres-1" in top_container_name
|
||||
and top_container_cpu >= hot_container_cpu_threshold
|
||||
@@ -572,6 +563,15 @@ def build_packet(
|
||||
f"--docker-stats-file {DEFAULT_DOCKER_STATS_FILE} --json"
|
||||
)
|
||||
next_action = "run_gitea_queue_or_hook_backlog_playbook_check_mode"
|
||||
elif control_plane_cpu >= process_family_cpu_threshold:
|
||||
classification = "blocked_control_plane_saturation_requires_playbook"
|
||||
severity = "critical" if load5_per_core > load5_per_core_threshold else "warning"
|
||||
dry_run_command = (
|
||||
f"{evidence_script} "
|
||||
f"--host {host} --metrics-file {DEFAULT_METRICS_FILE} "
|
||||
f"--docker-stats-file {DEFAULT_DOCKER_STATS_FILE} --json"
|
||||
)
|
||||
next_action = "run_control_plane_saturation_playbook_check_mode"
|
||||
elif load5_per_core > load5_per_core_threshold and swap_used_ratio >= 0.85:
|
||||
classification = "blocked_memory_or_swap_pressure_requires_service_playbook"
|
||||
severity = "critical"
|
||||
|
||||
@@ -281,11 +281,23 @@ def recommend_playbook(process_families: list[dict[str, Any]], containers: list[
|
||||
top_container_cpu = float(top_container.get("cpu_cores") or 0.0)
|
||||
top_family = process_families[0] if process_families else {}
|
||||
family = str(top_family.get("family") or "")
|
||||
family_cpu = {
|
||||
str(item.get("family") or ""): float(item.get("cpu_percent") or 0.0)
|
||||
for item in process_families
|
||||
}
|
||||
|
||||
if "gitea" in top_container_name and top_container_cpu >= 2.0:
|
||||
if "gitea" in top_container_name and top_container_cpu >= 1.0:
|
||||
return "gitea_queue_or_hook_backlog_playbook"
|
||||
if "postgres" in top_container_name or "postgres" in family:
|
||||
if (
|
||||
(
|
||||
"postgres" in top_container_name
|
||||
or "stockplatform-v2-postgres-1" in top_container_name
|
||||
)
|
||||
and top_container_cpu >= 1.0
|
||||
) or family_cpu.get("postgres", 0.0) >= 50.0:
|
||||
return "postgres_hot_query_or_backup_export_playbook"
|
||||
if family_cpu.get("gitea_service", 0.0) >= 50.0:
|
||||
return "gitea_queue_or_hook_backlog_playbook"
|
||||
if family in {"docker_build", "web_build", "gitea_actions_runner"}:
|
||||
return "build_or_runner_pressure_playbook"
|
||||
if family in {"systemd_control_plane", "ssh_control_plane"}:
|
||||
|
||||
@@ -526,6 +526,80 @@ def test_sustained_load_controller_routes_gitea_quota_pressure_even_when_load_is
|
||||
assert "scripts/ops/" not in payload["commands"]["dry_run"]
|
||||
|
||||
|
||||
def test_sustained_load_controller_prioritizes_hot_gitea_container_over_control_plane_average(
|
||||
tmp_path: Path,
|
||||
) -> None:
|
||||
metrics_file = tmp_path / "host.prom"
|
||||
metrics_file.write_text(
|
||||
"\n".join(
|
||||
[
|
||||
'awoooi_host_runaway_process_monitor_up{host="110",mode="read_only"} 1',
|
||||
'awoooi_host_load5_per_core{host="110"} 0.70',
|
||||
'awoooi_host_swap_used_ratio{host="110"} 0.1',
|
||||
'awoooi_host_runaway_process_remediation_authorized{host="110"} 0',
|
||||
'awoooi_host_gitea_actions_active_container_count{host="110"} 0',
|
||||
'awoooi_host_gitea_actions_active_process_group_count{host="110"} 0',
|
||||
'awoooi_host_runaway_browser_orphan_group_count{host="110",rule="stockplatform_headless_smoke",min_age_seconds="1800",min_cpu_percent="50"} 0',
|
||||
]
|
||||
),
|
||||
encoding="utf-8",
|
||||
)
|
||||
docker_file = tmp_path / "docker.prom"
|
||||
docker_file.write_text(
|
||||
"\n".join(
|
||||
[
|
||||
'docker_container_cpu_cores{host="110",container_name="gitea"} 1.59',
|
||||
'docker_container_cpu_cores{host="110",container_name="redis"} 0.2',
|
||||
]
|
||||
),
|
||||
encoding="utf-8",
|
||||
)
|
||||
ps_file = tmp_path / "ps.txt"
|
||||
ps_file.write_text(
|
||||
"\n".join(
|
||||
[
|
||||
"100 1 100 75507 61.8 0.0 systemd /sbin/init",
|
||||
"101 1 101 75469 6.7 0.0 dbus-daemon @dbus-daemon --system",
|
||||
"200 1 200 75348 53.1 1.3 gitea /usr/local/bin/gitea web --config /home/wooo/gitea/app.ini",
|
||||
]
|
||||
),
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
result = subprocess.run(
|
||||
[
|
||||
sys.executable,
|
||||
str(CONTROLLER_PATH),
|
||||
"--host",
|
||||
"110",
|
||||
"--load5-per-core-threshold",
|
||||
"0.75",
|
||||
"--hot-container-cpu-threshold",
|
||||
"1.0",
|
||||
"--container-cpu-threshold",
|
||||
"2.0",
|
||||
"--metrics-file",
|
||||
str(metrics_file),
|
||||
"--docker-stats-file",
|
||||
str(docker_file),
|
||||
"--ps-file",
|
||||
str(ps_file),
|
||||
"--json",
|
||||
],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
|
||||
assert result.returncode == 75
|
||||
payload = json.loads(result.stdout)
|
||||
assert payload["classification"] == "blocked_gitea_queue_or_hook_backlog_requires_playbook"
|
||||
assert payload["next_action"] == "run_gitea_queue_or_hook_backlog_playbook_check_mode"
|
||||
assert payload["readback"]["control_plane_process_cpu_percent"] == 68.5
|
||||
assert payload["readback"]["top_container_cpu"]["container_name"] == "gitea"
|
||||
assert "/home/wooo/scripts/host-sustained-load-evidence.py" in payload["commands"]["dry_run"]
|
||||
assert "/home/wooo/gitea/app.ini" not in result.stdout
|
||||
|
||||
|
||||
def test_sustained_load_controller_ignores_stale_docker_stats_attribution(tmp_path: Path) -> None:
|
||||
metrics_file = tmp_path / "host.prom"
|
||||
metrics_file.write_text(
|
||||
@@ -842,6 +916,50 @@ def test_sustained_load_evidence_emits_sanitized_gitea_recommendation(tmp_path:
|
||||
assert "/home/wooo" not in result.stdout
|
||||
|
||||
|
||||
def test_sustained_load_evidence_prioritizes_hot_gitea_container_over_control_plane_average(
|
||||
tmp_path: Path,
|
||||
) -> None:
|
||||
ps_file = tmp_path / "ps.txt"
|
||||
ps_file.write_text(
|
||||
"\n".join(
|
||||
[
|
||||
"100 1 100 75507 61.8 0.0 systemd /sbin/init",
|
||||
"101 1 101 75469 6.7 0.0 dbus-daemon @dbus-daemon --system",
|
||||
"200 1 200 75348 53.1 1.3 gitea /usr/local/bin/gitea web --config /home/wooo/gitea/app.ini",
|
||||
]
|
||||
),
|
||||
encoding="utf-8",
|
||||
)
|
||||
docker_file = tmp_path / "docker.prom"
|
||||
docker_file.write_text(
|
||||
'docker_container_cpu_cores{host="110",container_name="gitea"} 1.4591\n',
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
result = subprocess.run(
|
||||
[
|
||||
sys.executable,
|
||||
str(EVIDENCE_PATH),
|
||||
"--host",
|
||||
"110",
|
||||
"--ps-file",
|
||||
str(ps_file),
|
||||
"--docker-stats-file",
|
||||
str(docker_file),
|
||||
"--json",
|
||||
],
|
||||
check=True,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
|
||||
payload = json.loads(result.stdout)
|
||||
assert payload["recommendation"] == "gitea_queue_or_hook_backlog_playbook"
|
||||
assert payload["top_process_families"][0]["family"] == "systemd_control_plane"
|
||||
assert payload["top_containers"][0]["container_name"] == "gitea"
|
||||
assert "/home/wooo/gitea/app.ini" not in result.stdout
|
||||
|
||||
|
||||
def test_sustained_load_evidence_keeps_stale_container_samples_untrusted(tmp_path: Path) -> None:
|
||||
metrics_file = tmp_path / "host.prom"
|
||||
metrics_file.write_text(
|
||||
|
||||
Reference in New Issue
Block a user