Files
awoooi/docs/runbooks/HOST-RUNAWAY-PROCESS-AIOPS-PLAYBOOK.md
Your Name d06203cbae
Some checks failed
Code Review / ai-code-review (push) Successful in 14s
CD Pipeline / tests (push) Successful in 1m39s
Ansible / Reboot Recovery Contract / validate (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
fix(api): 收斂 direct Telegram sendMessage 告警
2026-06-18 18:45:34 +08:00

168 lines
7.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 主機異常行程 AIOps PlayBook
> 最後更新2026-06-18 Asia/Taipei
> 範圍110 host CPU 滿載、orphan Chrome / Playwright smoke、Gitea Actions CI load、Prisma / package install 資源壓力分流。
---
## 1. 目標
這份 PlayBook 把 2026-06-18 110 CPU 滿載事件固化成 AI Ops 閉環:
```text
read-only exporter -> Prometheus alert -> AI triage packet -> KM / PlayBook evidence -> gated remediation -> post-check / recurrence guard
```
它要解決的不是「CPU 高」這個泛用症狀,而是精準分辨:
| 類型 | 判定 | 處理 |
|------|------|------|
| orphan browser smoke | headless Chrome / Chromium / Playwright process group 存活過久、PPID=1 或 group leader 消失、CPU 合計過高 | 走 dry-run 修復包;人工批准後可送 `SIGTERM` |
| 合法 CI load | Gitea Actions task container 正在跑,沒有 orphan browser 指標 | 觀察 queue / timeout不要誤殺 |
| Docker / Sentry / Harbor 事故 | container restart、port down、journal error、cold-start gate blocked | 走各服務自己的 SOP不使用本 PlayBook 殺 process |
| swap 已滿但未 thrash | swap ratio 高但 `vmstat` / load 分類未顯示即時 thrash | 不手動清 swap先降高 CPU 來源 |
---
## 2. 指標與告警
110 由 Ansible `host-textfile-exporters` 角色部署:
```text
/home/wooo/scripts/host-runaway-process-exporter.py
/home/wooo/scripts/host-runaway-process-remediation.py
/home/wooo/node_exporter_textfiles/host_runaway_process.prom
```
核心指標:
| Metric | 意義 |
|--------|------|
| `awoooi_host_runaway_process_monitor_up{host="110"}` | exporter 是否正常輸出 |
| `awoooi_host_runaway_browser_orphan_group_count{host="110",rule=...}` | 符合規則的 orphan browser process group 數 |
| `awoooi_host_runaway_browser_orphan_cpu_percent{host="110",rule=...}` | orphan group CPU 合計 |
| `awoooi_host_gitea_actions_active_container_count{host="110"}` | 目前 active Gitea Actions task containers |
| `awoooi_host_load5_per_core{host="110"}` | load5 / CPU core |
| `awoooi_host_swap_used_ratio{host="110"}` | swap 使用比例 |
| `awoooi_host_runaway_process_remediation_authorized{host="110"}` | 必須永遠為 `0`exporter 不是執行器 |
告警:
| Alert | 條件 | 行動 |
|-------|------|------|
| `HostOrphanBrowserSmokeHighCpu` | orphan browser group `> 0` 且 CPU `>= 100%` 持續 10 分鐘 | 產生 dry-run 修復包,確認 owner / 維護窗口 / evidence |
| `HostCiRunnerLoadSaturation` | load5/core `> 1.0` 且 active Gitea Actions `> 0` | 標為短期 CI 負載,檢查 runner queue不直接 kill |
| `HostRunawayProcessMonitorMissing` / `Stale` | exporter 缺失或超過 10 分鐘未更新 | 修 exporter / cron / textfile collector |
| `HostRunawayProcessRemediationUnexpectedlyAuthorized` | `remediation_authorized > 0` | 立即回滾;禁止把監控器改成執行器 |
Telegram / AI event packet contract:
| Alert / input | Telegram lane | 必須顯示 |
|---------------|---------------|----------|
| `HostOrphanBrowserSmokeHighCpu` | `orphan_browser_smoke_runaway_process` | alertname、host、rule、runaway dry-run、owner / maintenance / evidence gate、KM / PlayBook / Verifier 回寫 |
| `HostCiRunnerLoadSaturation` | `ci_runner_load_saturation` | Gitea Actions run、runner queue、load/core、swap trend、capacity / queue 判定、不做 process remediation |
| raw `CPU 警告` / `ps aux` dump | `runner_build_resource_pressure``runner_prisma_generate_resource_pressure``host_resource_pressure_triage` | sanitized top process evidence不顯示 raw workspace path、hosted toolcache path、`node_modules` path、外部 URL、JSON payload 或完整 process dump |
所有 Telegram 卡片都必須保留 `runtime_write_gate=0`,並不得把 alert/card 轉成直接 kill / restart / reload 指令。
Host / runner raw dump 進入 Telegram 前必須先被 `TelegramGateway` 壓成 `P1/P2/P3 主機資源壓力` 卡片。第一屏只允許顯示 CPU、load、root process count、AI lane、candidate gate、Top evidence 與禁止事項;完整命令列、套件 JSON、外部檢查 endpoint、內部 workspace path 與 raw `ps aux` 必須留在內部 evidence / timeline不得外送。
應用內所有 `sendMessage` 都必須經過 `TelegramGateway._send_request()` 的最後出口正規化;如果是 Gitea workflow、Alertmanager receiver、主機 cron 或外部 bot 直接呼叫 Telegram Bot API必須另外納入配置控管與 formatter 收斂,不能宣稱已由 API 端完全治理。
---
## 3. AI Triager 必做判讀
收到 `HostOrphanBrowserSmokeHighCpu`AI / operator 必須先產生 dry-run
```bash
python3 scripts/ops/host-runaway-process-remediation.py \
--rule stockplatform_headless_smoke \
--min-age-seconds 1800 \
--min-cpu-percent 50
```
dry-run 必須檢查:
1. `candidate_count > 0`
2. `orphan_reason``ppid_1``missing_group_leader`
3. `oldest_age_seconds` 超過 PlayBook 門檻。
4. `active Gitea Actions` 與候選 process group 不是同一個仍在跑的合法 job。
5. 不是 Docker daemon、Sentry、Harbor、PostgreSQL、ClickHouse、K3s 或 backup 服務本體。
6. 已有 owner / 維護窗口 / evidence ref。
如果只看到 `HostCiRunnerLoadSaturation`,且 orphan group count 為 `0`,預設判定是「合法 CI 短期負載」,不得自動修復。
---
## 4. Gated Remediation
真正送 `SIGTERM` 時必須帶齊三個 gate
```bash
python3 scripts/ops/host-runaway-process-remediation.py \
--apply \
--confirm-apply \
--rule stockplatform_headless_smoke \
--owner-approval-id OWNER-APPROVAL-REDACTED \
--maintenance-window-id MW-REDACTED \
--evidence-ref INC-REDACTED \
--wait-seconds 5
```
禁止事項:
- 不可預設 `SIGKILL`
- 不可因 CPU 高直接 `systemctl restart docker`
- 不可重啟 Sentry / Harbor / Gitea / Nginx。
- 不可改 firewall / iptables / NetworkPolicy。
- 不可讀取或輸出 secret value、token、hash、prefix / suffix。
- 不可把 route 200、container up、UI 可見當成修復完成。
修復完成條件:
```text
signaled_process_group_count > 0
remaining_after_wait = []
awoooi_host_runaway_browser_orphan_group_count == 0
load5/core 開始下降或維持可解釋
active Gitea Actions 若仍存在,告警降級為 CI load而非 orphan smoke
```
---
## 5. KM / PlayBook 回寫契約
每次觸發都要沉澱:
| 資產 | 必填欄位 |
|------|----------|
| Incident evidence | alert name、host、rule、pgid count、cpu percent、oldest age、active CI count、swap ratio |
| PlayBook run | dry-run payload、owner approval id、maintenance window id、evidence ref、actual signal summary |
| KM entry | 根因分類、誤判防護、修復結果、recurrence guard |
| Verifier | post-check 指標、load trend、orphan group count、runner queue state |
| Work item | 如果缺 owner / evidence / maintenance window建立補件項不假性拉高 runtime gate |
產品上的結論必須分開呈現:
```text
monitoring_ready=true
alert_ready=true
playbook_ready=true
km_writeback_required=true
runtime_remediation_authorized=false unless gated apply is executed
```
---
## 6. 與重啟 SOP 的關係
110 重啟後runner / CD / high-load batch 是最後釋出。若 service health green 但 load 持續高:
1. 先讀 `host_runaway_process.prom`
2. orphan browser 指標紅:走本 PlayBook。
3. active CI 指標紅但 orphan 為 0等待 / drain / workflow timeout不走 kill。
4. Docker / systemd / storage / backup 指標紅:回到 `FULL-STACK-COLD-START-SOP.md` 對應章節。
這條 PlayBook 是 AI 自動化產品的 host CPU 專用閉環,不取代 cold-start scorecard也不解除 credential escrow / DR gate。