fix(ops): prioritize live gitea pressure routing
All checks were successful
CD Pipeline / workflow-shape (push) Successful in 0s
CD Pipeline / cancel-stale-cd (push) Has been skipped
CD Pipeline / tests (push) Successful in 57s
CD Pipeline / build-and-deploy (push) Successful in 4m34s
CD Pipeline / post-deploy-checks (push) Successful in 6m0s

This commit is contained in:
Your Name
2026-07-02 12:26:27 +08:00
parent 12284c915a
commit ec563465e7
5 changed files with 149 additions and 15 deletions

View File

@@ -52441,21 +52441,23 @@ production browser smoke:
- `scripts/ops/host-sustained-load-controller.py` 新增 live `ps` process-family 分類,並把中度壓力分流到 `control_plane_saturation``gitea_queue_or_hook_backlog``stockplatform_hot_query_or_api_pressure` check-mode packetcontroller 仍只產生 controlled packet不做 Docker / systemd / DB / Nginx restart。
- `ops/monitoring/alerts-unified.yml``Host110SustainedModeratePressure` 從只看 `load5/core` / container CPU擴充為同時看 `awoooi_host_process_family_cpu_percent{family=~"systemd_control_plane|ssh_control_plane|gitea_service|postgres"} > 50`Gitea / StockPlatform container early triage 門檻從 `> 2.0 core` 降到 `> 1.0 core`
- `scripts/ops/host-sustained-load-controller.py` 追加 `--script-dir`,預設 `/home/wooo/scripts`,確保 live alert 產出的 `dry_run_command` / `post_apply_verifier` 指向 110 實際部署 helper不再輸出會打空的 repo 相對路徑 `scripts/ops/...`
- `scripts/ops/host-sustained-load-controller.py` / `host-sustained-load-evidence.py` 同步修正分流優先序fresh Docker stats 顯示 `gitea` 或 StockPlatform 關鍵容器超過 `1.0 core` 時,優先路由到服務 playbook不再被長壽命 `systemd_control_plane` 平均 CPU 搶先導向 control-plane playbook。
- 已 live 部署到 110
- `/home/wooo/scripts/host-runaway-process-exporter.py` SHA `d85d27c81ea76a8f2f370ee85c92381bec4440eea4fd37efb2efb9f43dbd1a8a`
- `/home/wooo/scripts/host-sustained-load-controller.py` SHA `7a11407c7df05427085982d6f6d11d1756f908591573a28c9b3267de32b94f3e`
- `/home/wooo/scripts/host-sustained-load-controller.py` SHA `1bf9c183fe8d89c30008e08db0903c24c609df31eeebe62f9d59b3a26a3bd1c0`
- `/home/wooo/scripts/host-sustained-load-evidence.py` SHA `2fd8e7d43a0249f97b35a865cfd4ce2aa45162729faebfc837cde8cf48beec38`
- `bash scripts/ops/deploy-alerts.sh` 完成Prometheus 已載入 `159` 條規則。
**live readback 證據**
- 110 textfile / Prometheus 已輸出 process-family metric`systemd_control_plane=72.4``gitea_service=53.1``postgres=11.1`
- Prometheus rule readback`Host110SustainedModeratePressure state=firing health=ok`query 已包含 `docker_container_cpu_cores > 1``awoooi_host_process_family_cpu_percent > 50`
- Alertmanager `/api/v2/alerts` 已有 active alert`family=gitea_service``family=systemd_control_plane``status.state=active``auto_repair=true`
- live controller readback`classification=blocked_control_plane_saturation_requires_playbook``next_action=run_control_plane_saturation_playbook_check_mode``dry_run_command=/home/wooo/scripts/host-sustained-load-evidence.py ...``post_apply_verifier=/home/wooo/scripts/host-sustained-load-controller.py ...``controller_exit=75`
- live sanitized evidence readback`recommendation=control_plane_saturation_playbook``controlled_apply_allowed=false``top_process_families` 包含 `systemd_control_plane=72.4``gitea_service=53.1``unknown=54.3``top_containers``gitea=1.5951 core` 為最高evidence 明確標示不輸出 raw command line / URL / secret。
- live controller readback`classification=blocked_gitea_queue_or_hook_backlog_requires_playbook``next_action=run_gitea_queue_or_hook_backlog_playbook_check_mode``dry_run_command=/home/wooo/scripts/host-sustained-load-evidence.py ...``post_apply_verifier=/home/wooo/scripts/host-sustained-load-controller.py ...``controller_exit=75`
- live sanitized evidence readback`recommendation=gitea_queue_or_hook_backlog_playbook``controlled_apply_allowed=false``top_process_families` 包含 `systemd_control_plane=72.5``gitea_service=53.1``unknown=53.7``top_containers``gitea=1.3055 core` 為最高evidence 明確標示不輸出 raw command line / URL / secret。
**本地驗證結果**
- `python3.11 -m py_compile scripts/ops/host-runaway-process-exporter.py scripts/ops/host-sustained-load-controller.py scripts/ops/host-sustained-load-evidence.py`:通過。
- `python3.11 -m pytest scripts/ops/tests/test_host_runaway_process_exporter.py scripts/ops/tests/test_host_pressure_alert_contract.py -q``27 passed`
- `python3.11 -m pytest scripts/ops/tests/test_host_runaway_process_exporter.py scripts/ops/tests/test_host_pressure_alert_contract.py -q``29 passed`
- `python3.11 -m pytest ops/runner/test_cd_controlled_runtime_profile.py scripts/ops/tests/test_host_runaway_process_exporter.py scripts/ops/tests/test_host_pressure_alert_contract.py -q``70 passed`
- `python3.11 -c "import yaml; yaml.safe_load(open('ops/monitoring/alerts-unified.yml'))"`:通過。
- `python3.11 ops/runner/guard-gitea-runner-pressure.py --root .`:通過,`auto_branch_events_on_110=0``generic_runner_labels=0`

View File

@@ -94,6 +94,8 @@ v1.82 bounded summary rule`post-start-quick-check.sh` 與 `188-host-hygiene-m
2026-07-02 12:15 追加 controller command path 合約:`host-sustained-load-controller.py` 產出的 `dry_run_command` / `post_apply_verifier` 必須使用 110 實際部署路徑 `/home/wooo/scripts/host-sustained-load-evidence.py``/home/wooo/scripts/host-sustained-load-controller.py``/home/wooo/scripts/host-runaway-process-remediation.py`。若 controller 在 live host 上輸出 `scripts/ops/...` 這類 repo 相對路徑,視為告警自動化斷鏈,需先修 controller / `--script-dir`,再進入 playbook check-mode不得把「helper 找不到」當成 CPU 根因已處理。
2026-07-02 12:35 追加 110 CPU 分流優先序:若 Docker stats 是 fresh`gitea` 或 StockPlatform 關鍵容器已超過 early triage 門檻 `1.0 core`controller / evidence 必須先路由到對應服務 playbook不得被長壽命 `ps %CPU``systemd_control_plane` 平均值搶先導到 control-plane playbook。control-plane saturation 仍保留為後備路徑,適用於沒有已知 hot container / hot service family 的情境。
2026-06-25 20:25 orphan Chrome cleanup / scorecard refresh supersedes the 20:11 CPU wording. 110 high CPU was traced to two `stockplatform-review-bulk-ux` Chrome process groups `2756503` and `2829627` with root Chrome process `PPID=1`, elapsed about 5h, no active parent smoke, and sustained GPU/renderer CPU. With user approval, only those two process groups received targeted `SIGTERM` at 20:24. Post-check showed no remaining PGID entries; `vmstat` showed CPU idle around `85-90%`, `si/so=0`, and no immediate swap thrash. No Docker/systemd/Nginx/firewall/K8s action, CI cancellation, manual data ingestion, manual DB write, Wazuh/SOC runtime change, or secret read was performed. The 20:25 full post-start wrapper then returned cold-start `PASS=89 WARN=0 BLOCKED=0`, but overall `POST_START_QUICK_CHECK PASS=37 WARN=2 BLOCKED=1`, `RESULT=BLOCKED`, because StockPlatform data freshness was still blocked at that time and DR remained incomplete.
2026-06-25 20:11 StockPlatform cron-source recovery supersedes the 19:35 source-version wording. StockPlatform Gitea `main` and live `/home/wooo/stockplatform-v2` are now at `fb91aa4c6272469d1d26e0820169629eac17d28a fix(ops): restore production cron recovery entrypoints`; six missing production cron entrypoint scripts are restored, `run-intelligence-sync.sh` contains the Docker-backed `psql` shim, and live contract check confirms every `scripts/ops/*.sh` referenced by `install-production-cron.sh` exists. The only live write performed for StockPlatform recovery was a fast-forward `git pull --ff-only origin main` on 110; no Docker/systemd/Nginx/firewall/K8s restart, manual ingestion run, manual DB write, or secret read was performed. Natural cron evidence after the pull is now green for the repaired entrypoints: `source-remediation-queue` 19:56 and 20:00 succeeded, `market-index-ingestion` 20:00 succeeded, `price-ingestion` 20:02 succeeded, `margin-short-ingestion` 20:05 succeeded, `chips-ingestion` 20:06 succeeded, and `ai-recommendation-pipeline` 20:10 ran but correctly produced the internal blocker `core_margin_short_daily_incomplete,official_margin_short_daily_official_pending`. StockPlatform `/api/v1/system/freshness` therefore still returns `status=blocked` because the 2026-06-25 official margin-short source is pending and `ai.recommendations` must stay on 2026-06-24 until that gate clears. This is no longer a route, source-version, or missing-cron-script blocker; it is a product-data freshness blocker waiting on official source availability and the next valid AI pipeline run.

View File

@@ -537,15 +537,6 @@ def build_packet(
f"--docker-stats-file {DEFAULT_DOCKER_STATS_FILE} --json"
)
next_action = "run_gitea_queue_or_hook_backlog_playbook_check_mode"
elif control_plane_cpu >= process_family_cpu_threshold:
classification = "blocked_control_plane_saturation_requires_playbook"
severity = "critical" if load5_per_core > load5_per_core_threshold else "warning"
dry_run_command = (
f"{evidence_script} "
f"--host {host} --metrics-file {DEFAULT_METRICS_FILE} "
f"--docker-stats-file {DEFAULT_DOCKER_STATS_FILE} --json"
)
next_action = "run_control_plane_saturation_playbook_check_mode"
elif (
"stockplatform-v2-postgres-1" in top_container_name
and top_container_cpu >= hot_container_cpu_threshold
@@ -572,6 +563,15 @@ def build_packet(
f"--docker-stats-file {DEFAULT_DOCKER_STATS_FILE} --json"
)
next_action = "run_gitea_queue_or_hook_backlog_playbook_check_mode"
elif control_plane_cpu >= process_family_cpu_threshold:
classification = "blocked_control_plane_saturation_requires_playbook"
severity = "critical" if load5_per_core > load5_per_core_threshold else "warning"
dry_run_command = (
f"{evidence_script} "
f"--host {host} --metrics-file {DEFAULT_METRICS_FILE} "
f"--docker-stats-file {DEFAULT_DOCKER_STATS_FILE} --json"
)
next_action = "run_control_plane_saturation_playbook_check_mode"
elif load5_per_core > load5_per_core_threshold and swap_used_ratio >= 0.85:
classification = "blocked_memory_or_swap_pressure_requires_service_playbook"
severity = "critical"

View File

@@ -281,11 +281,23 @@ def recommend_playbook(process_families: list[dict[str, Any]], containers: list[
top_container_cpu = float(top_container.get("cpu_cores") or 0.0)
top_family = process_families[0] if process_families else {}
family = str(top_family.get("family") or "")
family_cpu = {
str(item.get("family") or ""): float(item.get("cpu_percent") or 0.0)
for item in process_families
}
if "gitea" in top_container_name and top_container_cpu >= 2.0:
if "gitea" in top_container_name and top_container_cpu >= 1.0:
return "gitea_queue_or_hook_backlog_playbook"
if "postgres" in top_container_name or "postgres" in family:
if (
(
"postgres" in top_container_name
or "stockplatform-v2-postgres-1" in top_container_name
)
and top_container_cpu >= 1.0
) or family_cpu.get("postgres", 0.0) >= 50.0:
return "postgres_hot_query_or_backup_export_playbook"
if family_cpu.get("gitea_service", 0.0) >= 50.0:
return "gitea_queue_or_hook_backlog_playbook"
if family in {"docker_build", "web_build", "gitea_actions_runner"}:
return "build_or_runner_pressure_playbook"
if family in {"systemd_control_plane", "ssh_control_plane"}:

View File

@@ -526,6 +526,80 @@ def test_sustained_load_controller_routes_gitea_quota_pressure_even_when_load_is
assert "scripts/ops/" not in payload["commands"]["dry_run"]
def test_sustained_load_controller_prioritizes_hot_gitea_container_over_control_plane_average(
tmp_path: Path,
) -> None:
metrics_file = tmp_path / "host.prom"
metrics_file.write_text(
"\n".join(
[
'awoooi_host_runaway_process_monitor_up{host="110",mode="read_only"} 1',
'awoooi_host_load5_per_core{host="110"} 0.70',
'awoooi_host_swap_used_ratio{host="110"} 0.1',
'awoooi_host_runaway_process_remediation_authorized{host="110"} 0',
'awoooi_host_gitea_actions_active_container_count{host="110"} 0',
'awoooi_host_gitea_actions_active_process_group_count{host="110"} 0',
'awoooi_host_runaway_browser_orphan_group_count{host="110",rule="stockplatform_headless_smoke",min_age_seconds="1800",min_cpu_percent="50"} 0',
]
),
encoding="utf-8",
)
docker_file = tmp_path / "docker.prom"
docker_file.write_text(
"\n".join(
[
'docker_container_cpu_cores{host="110",container_name="gitea"} 1.59',
'docker_container_cpu_cores{host="110",container_name="redis"} 0.2',
]
),
encoding="utf-8",
)
ps_file = tmp_path / "ps.txt"
ps_file.write_text(
"\n".join(
[
"100 1 100 75507 61.8 0.0 systemd /sbin/init",
"101 1 101 75469 6.7 0.0 dbus-daemon @dbus-daemon --system",
"200 1 200 75348 53.1 1.3 gitea /usr/local/bin/gitea web --config /home/wooo/gitea/app.ini",
]
),
encoding="utf-8",
)
result = subprocess.run(
[
sys.executable,
str(CONTROLLER_PATH),
"--host",
"110",
"--load5-per-core-threshold",
"0.75",
"--hot-container-cpu-threshold",
"1.0",
"--container-cpu-threshold",
"2.0",
"--metrics-file",
str(metrics_file),
"--docker-stats-file",
str(docker_file),
"--ps-file",
str(ps_file),
"--json",
],
capture_output=True,
text=True,
)
assert result.returncode == 75
payload = json.loads(result.stdout)
assert payload["classification"] == "blocked_gitea_queue_or_hook_backlog_requires_playbook"
assert payload["next_action"] == "run_gitea_queue_or_hook_backlog_playbook_check_mode"
assert payload["readback"]["control_plane_process_cpu_percent"] == 68.5
assert payload["readback"]["top_container_cpu"]["container_name"] == "gitea"
assert "/home/wooo/scripts/host-sustained-load-evidence.py" in payload["commands"]["dry_run"]
assert "/home/wooo/gitea/app.ini" not in result.stdout
def test_sustained_load_controller_ignores_stale_docker_stats_attribution(tmp_path: Path) -> None:
metrics_file = tmp_path / "host.prom"
metrics_file.write_text(
@@ -842,6 +916,50 @@ def test_sustained_load_evidence_emits_sanitized_gitea_recommendation(tmp_path:
assert "/home/wooo" not in result.stdout
def test_sustained_load_evidence_prioritizes_hot_gitea_container_over_control_plane_average(
tmp_path: Path,
) -> None:
ps_file = tmp_path / "ps.txt"
ps_file.write_text(
"\n".join(
[
"100 1 100 75507 61.8 0.0 systemd /sbin/init",
"101 1 101 75469 6.7 0.0 dbus-daemon @dbus-daemon --system",
"200 1 200 75348 53.1 1.3 gitea /usr/local/bin/gitea web --config /home/wooo/gitea/app.ini",
]
),
encoding="utf-8",
)
docker_file = tmp_path / "docker.prom"
docker_file.write_text(
'docker_container_cpu_cores{host="110",container_name="gitea"} 1.4591\n',
encoding="utf-8",
)
result = subprocess.run(
[
sys.executable,
str(EVIDENCE_PATH),
"--host",
"110",
"--ps-file",
str(ps_file),
"--docker-stats-file",
str(docker_file),
"--json",
],
check=True,
capture_output=True,
text=True,
)
payload = json.loads(result.stdout)
assert payload["recommendation"] == "gitea_queue_or_hook_backlog_playbook"
assert payload["top_process_families"][0]["family"] == "systemd_control_plane"
assert payload["top_containers"][0]["container_name"] == "gitea"
assert "/home/wooo/gitea/app.ini" not in result.stdout
def test_sustained_load_evidence_keeps_stale_container_samples_untrusted(tmp_path: Path) -> None:
metrics_file = tmp_path / "host.prom"
metrics_file.write_text(