Files
awoooi/docs/runbooks/HOST-RUNAWAY-PROCESS-AIOPS-PLAYBOOK.md
Your Name a6dc806d38
Some checks failed
CD Pipeline / workflow-shape (push) Successful in 0s
CD Pipeline / cancel-stale-cd (push) Has been skipped
CD Pipeline / tests (push) Failing after 28s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
feat(agent): automate sustained host load response
2026-07-01 08:43:40 +08:00

180 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 主機異常行程 AIOps PlayBook
> 最後更新2026-06-18 Asia/Taipei
> 範圍110 host CPU 滿載、orphan Chrome / Playwright smoke、Gitea Actions CI load、Prisma / package install 資源壓力分流。
---
## 1. 目標
這份 PlayBook 把 2026-06-18 110 CPU 滿載事件固化成 AI Ops 閉環:
```text
read-only exporter -> Prometheus alert -> AI triage packet -> KM / PlayBook evidence -> gated remediation -> post-check / recurrence guard
```
它要解決的不是「CPU 高」這個泛用症狀,而是精準分辨:
| 類型 | 判定 | 處理 |
|------|------|------|
| orphan browser smoke | headless Chrome / Chromium / Playwright process group 存活過久、PPID=1 或 group leader 消失、CPU 合計過高 | 走 dry-run 修復包controlled apply receipt + evidence + verifier 成立後可送 `SIGTERM` |
| 合法 CI load | Gitea Actions task container 正在跑,沒有 orphan browser 指標 | 觀察 queue / timeout不要誤殺 |
| Docker / Sentry / Harbor 事故 | container restart、port down、journal error、cold-start gate blocked | 走各服務自己的 SOP不使用本 PlayBook 殺 process |
| swap 已滿但未 thrash | swap ratio 高但 `vmstat` / load 分類未顯示即時 thrash | 不手動清 swap先降高 CPU 來源 |
---
## 2. 指標與告警
110 由 Ansible `host-textfile-exporters` 角色部署:
```text
/home/wooo/scripts/host-runaway-process-exporter.py
/home/wooo/scripts/host-runaway-process-remediation.py
/home/wooo/node_exporter_textfiles/host_runaway_process.prom
```
核心指標:
| Metric | 意義 |
|--------|------|
| `awoooi_host_runaway_process_monitor_up{host="110"}` | exporter 是否正常輸出 |
| `awoooi_host_runaway_browser_orphan_group_count{host="110",rule=...}` | 符合規則的 orphan browser process group 數 |
| `awoooi_host_runaway_browser_orphan_cpu_percent{host="110",rule=...}` | orphan group CPU 合計 |
| `awoooi_host_gitea_actions_active_container_count{host="110"}` | 目前 active Gitea Actions task containers |
| `awoooi_host_load5_per_core{host="110"}` | load5 / CPU core |
| `awoooi_host_swap_used_ratio{host="110"}` | swap 使用比例 |
| `awoooi_host_runaway_process_remediation_authorized{host="110"}` | 必須永遠為 `0`exporter 不是執行器 |
告警:
| Alert | 條件 | 行動 |
|-------|------|------|
| `HostOrphanBrowserSmokeHighCpu` | orphan browser group `> 0` 且 CPU `>= 100%` 持續 10 分鐘 | 產生 dry-run 修復包,補 controlled apply id / evidence / post verifier |
| `HostCiRunnerLoadSaturation` | load5/core `> 1.0` 且 active Gitea Actions `> 0` | 標為短期 CI 負載,檢查 runner queue不直接 kill |
| `HostRunawayProcessMonitorMissing` / `Stale` | exporter 缺失或超過 10 分鐘未更新 | 修 exporter / cron / textfile collector |
| `HostRunawayProcessRemediationUnexpectedlyAuthorized` | `remediation_authorized > 0` | 立即回滾;禁止把監控器改成執行器 |
Telegram / AI event packet contract:
| Alert / input | Telegram lane | 必須顯示 |
|---------------|---------------|----------|
| `HostOrphanBrowserSmokeHighCpu` | `orphan_browser_smoke_runaway_process` | alertname、host、rule、runaway dry-run、controlled apply id、evidence ref、post verifier、KM / PlayBook / Verifier 回寫 |
| `HostCiRunnerLoadSaturation` | `ci_runner_load_saturation` | Gitea Actions run、runner queue、load/core、swap trend、capacity / queue 判定、不做 process remediation |
| raw `CPU 警告` / `ps aux` dump | `runner_build_resource_pressure``runner_prisma_generate_resource_pressure``host_resource_pressure_triage` | `host-sustained-load-evidence.py` 產生 sanitized top process / container evidence不顯示 raw workspace path、hosted toolcache path、`node_modules` path、外部 URL、JSON payload 或完整 process dump |
所有 Telegram 卡片都必須明確顯示 `runtime_write_gate=controlled/0``controlled_apply_allowed`、post verifier 與 forbidden actions並不得把 alert/card 轉成直接 kill / restart / reload 指令。
Host / runner raw dump 進入 Telegram 前必須先被 `TelegramGateway` 壓成 `P1/P2/P3 主機資源壓力` 卡片。第一屏只允許顯示 CPU、load、root process count、AI lane、candidate gate、Top evidence 與禁止事項;完整命令列、套件 JSON、外部檢查 endpoint、內部 workspace path 與 raw `ps aux` 必須留在內部 evidence / timeline不得外送。
應用內所有 `sendMessage` 都必須經過 `TelegramGateway._send_request()` 的最後出口正規化;如果是 Gitea workflow、Alertmanager receiver、主機 cron 或外部 bot 直接呼叫 Telegram Bot API必須另外納入配置控管與 formatter 收斂,不能宣稱已由 API 端完全治理。
---
## 3. AI Triager 必做判讀
收到 `HostOrphanBrowserSmokeHighCpu`AI / operator 必須先產生 dry-run
```bash
python3 scripts/ops/host-runaway-process-remediation.py \
--rule stockplatform_headless_smoke \
--min-age-seconds 1800 \
--min-cpu-percent 50
```
dry-run 必須檢查:
1. `candidate_count > 0`
2. `orphan_reason``ppid_1``missing_group_leader`
3. `oldest_age_seconds` 超過 PlayBook 門檻。
4. `active Gitea Actions` 與候選 process group 不是同一個仍在跑的合法 job。
5. 不是 Docker daemon、Sentry、Harbor、PostgreSQL、ClickHouse、K3s 或 backup 服務本體。
6. 已有 controlled apply id、evidence ref、post verifierowner / 維護窗口只作額外 evidence不作 low-blast orphan cleanup 的預設阻擋。
如果只看到 `HostCiRunnerLoadSaturation`,且 orphan group count 為 `0`,預設判定是「合法 CI 短期負載」,不得自動殺 process只能走 runner queue verifier、stale-run drain/cancel packet 與 host pressure fail-closed。
如果只看到 `HostLoadAverageSustainedHigh`,且 orphan / active CI / swap 都無明確命中AI 必須先跑只讀脫敏 evidence collector
```bash
python3 scripts/ops/host-sustained-load-evidence.py \
--host 110 \
--metrics-file /home/wooo/node_exporter_textfiles/host_runaway_process.prom \
--docker-stats-file /home/wooo/node_exporter_textfiles/docker_stats.prom \
--json
```
collector 只輸出 process family、container CPU 與 PlayBook recommendation不輸出 raw command line、workspace path、URL、JSON payload 或 secret。
---
## 4. Gated Remediation
真正送 `SIGTERM` 時必須帶齊 controlled apply gate
```bash
python3 scripts/ops/host-runaway-process-remediation.py \
--apply \
--confirm-apply \
--rule stockplatform_headless_smoke \
--controlled-apply-id CAP-REDACTED \
--evidence-ref INC-REDACTED \
--post-apply-verifier "scripts/ops/host-sustained-load-controller.py --host 110 --json" \
--wait-seconds 5
```
禁止事項:
- 不可預設 `SIGKILL`
- 不可因 CPU 高直接 `systemctl restart docker`
- 不可重啟 Sentry / Harbor / Gitea / Nginx。
- 不可改 firewall / iptables / NetworkPolicy。
- 不可讀取或輸出 secret value、token、hash、prefix / suffix。
- 不可把 route 200、container up、UI 可見當成修復完成。
修復完成條件:
```text
signaled_process_group_count > 0
remaining_after_wait = []
awoooi_host_runaway_browser_orphan_group_count == 0
load5/core 開始下降或維持可解釋
active Gitea Actions 若仍存在,告警降級為 CI load而非 orphan smoke
```
---
## 5. KM / PlayBook 回寫契約
每次觸發都要沉澱:
| 資產 | 必填欄位 |
|------|----------|
| Incident evidence | alert name、host、rule、pgid count、cpu percent、oldest age、active CI count、swap ratio |
| PlayBook run | dry-run payload、controlled apply id、optional owner / maintenance evidence、evidence ref、post verifier、actual signal summary |
| KM entry | 根因分類、誤判防護、修復結果、recurrence guard |
| Verifier | post-check 指標、load trend、orphan group count、runner queue state |
| Work item | 如果缺 controlled apply id / evidence ref / post verifier建立補件項owner / maintenance 只作 optional evidence不假性拉高 runtime gate |
產品上的結論必須分開呈現:
```text
monitoring_ready=true
alert_ready=true
playbook_ready=true
km_writeback_required=true
runtime_write_gate=controlled for allowlisted orphan browser cleanup; 0/false is evidence only unless critical break-glass applies
```
---
## 6. 與重啟 SOP 的關係
110 重啟後runner / CD / high-load batch 是最後釋出。若 service health green 但 load 持續高:
1. 先讀 `host_runaway_process.prom`
2. orphan browser 指標紅:走本 PlayBook。
3. active CI 指標紅但 orphan 為 0等待 / drain / workflow timeout不走 kill。
4. Docker / systemd / storage / backup 指標紅:回到 `FULL-STACK-COLD-START-SOP.md` 對應章節。
這條 PlayBook 是 AI 自動化產品的 host CPU 專用閉環,不取代 cold-start scorecard也不解除 credential escrow / DR gate。