Some checks failed
Code Review / ai-code-review (push) Successful in 14s
CD Pipeline / tests (push) Successful in 1m39s
Ansible / Reboot Recovery Contract / validate (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
168 lines
7.6 KiB
Markdown
168 lines
7.6 KiB
Markdown
# 主機異常行程 AIOps PlayBook
|
||
|
||
> 最後更新:2026-06-18 Asia/Taipei
|
||
> 範圍:110 host CPU 滿載、orphan Chrome / Playwright smoke、Gitea Actions CI load、Prisma / package install 資源壓力分流。
|
||
|
||
---
|
||
|
||
## 1. 目標
|
||
|
||
這份 PlayBook 把 2026-06-18 110 CPU 滿載事件固化成 AI Ops 閉環:
|
||
|
||
```text
|
||
read-only exporter -> Prometheus alert -> AI triage packet -> KM / PlayBook evidence -> gated remediation -> post-check / recurrence guard
|
||
```
|
||
|
||
它要解決的不是「CPU 高」這個泛用症狀,而是精準分辨:
|
||
|
||
| 類型 | 判定 | 處理 |
|
||
|------|------|------|
|
||
| orphan browser smoke | headless Chrome / Chromium / Playwright process group 存活過久、PPID=1 或 group leader 消失、CPU 合計過高 | 走 dry-run 修復包;人工批准後可送 `SIGTERM` |
|
||
| 合法 CI load | Gitea Actions task container 正在跑,沒有 orphan browser 指標 | 觀察 queue / timeout;不要誤殺 |
|
||
| Docker / Sentry / Harbor 事故 | container restart、port down、journal error、cold-start gate blocked | 走各服務自己的 SOP,不使用本 PlayBook 殺 process |
|
||
| swap 已滿但未 thrash | swap ratio 高但 `vmstat` / load 分類未顯示即時 thrash | 不手動清 swap;先降高 CPU 來源 |
|
||
|
||
---
|
||
|
||
## 2. 指標與告警
|
||
|
||
110 由 Ansible `host-textfile-exporters` 角色部署:
|
||
|
||
```text
|
||
/home/wooo/scripts/host-runaway-process-exporter.py
|
||
/home/wooo/scripts/host-runaway-process-remediation.py
|
||
/home/wooo/node_exporter_textfiles/host_runaway_process.prom
|
||
```
|
||
|
||
核心指標:
|
||
|
||
| Metric | 意義 |
|
||
|--------|------|
|
||
| `awoooi_host_runaway_process_monitor_up{host="110"}` | exporter 是否正常輸出 |
|
||
| `awoooi_host_runaway_browser_orphan_group_count{host="110",rule=...}` | 符合規則的 orphan browser process group 數 |
|
||
| `awoooi_host_runaway_browser_orphan_cpu_percent{host="110",rule=...}` | orphan group CPU 合計 |
|
||
| `awoooi_host_gitea_actions_active_container_count{host="110"}` | 目前 active Gitea Actions task containers |
|
||
| `awoooi_host_load5_per_core{host="110"}` | load5 / CPU core |
|
||
| `awoooi_host_swap_used_ratio{host="110"}` | swap 使用比例 |
|
||
| `awoooi_host_runaway_process_remediation_authorized{host="110"}` | 必須永遠為 `0`;exporter 不是執行器 |
|
||
|
||
告警:
|
||
|
||
| Alert | 條件 | 行動 |
|
||
|-------|------|------|
|
||
| `HostOrphanBrowserSmokeHighCpu` | orphan browser group `> 0` 且 CPU `>= 100%` 持續 10 分鐘 | 產生 dry-run 修復包,確認 owner / 維護窗口 / evidence |
|
||
| `HostCiRunnerLoadSaturation` | load5/core `> 1.0` 且 active Gitea Actions `> 0` | 標為短期 CI 負載,檢查 runner queue,不直接 kill |
|
||
| `HostRunawayProcessMonitorMissing` / `Stale` | exporter 缺失或超過 10 分鐘未更新 | 修 exporter / cron / textfile collector |
|
||
| `HostRunawayProcessRemediationUnexpectedlyAuthorized` | `remediation_authorized > 0` | 立即回滾;禁止把監控器改成執行器 |
|
||
|
||
Telegram / AI event packet contract:
|
||
|
||
| Alert / input | Telegram lane | 必須顯示 |
|
||
|---------------|---------------|----------|
|
||
| `HostOrphanBrowserSmokeHighCpu` | `orphan_browser_smoke_runaway_process` | alertname、host、rule、runaway dry-run、owner / maintenance / evidence gate、KM / PlayBook / Verifier 回寫 |
|
||
| `HostCiRunnerLoadSaturation` | `ci_runner_load_saturation` | Gitea Actions run、runner queue、load/core、swap trend、capacity / queue 判定、不做 process remediation |
|
||
| raw `CPU 警告` / `ps aux` dump | `runner_build_resource_pressure`、`runner_prisma_generate_resource_pressure` 或 `host_resource_pressure_triage` | sanitized top process evidence,不顯示 raw workspace path、hosted toolcache path、`node_modules` path、外部 URL、JSON payload 或完整 process dump |
|
||
|
||
所有 Telegram 卡片都必須保留 `runtime_write_gate=0`,並不得把 alert/card 轉成直接 kill / restart / reload 指令。
|
||
|
||
Host / runner raw dump 進入 Telegram 前必須先被 `TelegramGateway` 壓成 `P1/P2/P3 主機資源壓力` 卡片。第一屏只允許顯示 CPU、load、root process count、AI lane、candidate gate、Top evidence 與禁止事項;完整命令列、套件 JSON、外部檢查 endpoint、內部 workspace path 與 raw `ps aux` 必須留在內部 evidence / timeline,不得外送。
|
||
|
||
應用內所有 `sendMessage` 都必須經過 `TelegramGateway._send_request()` 的最後出口正規化;如果是 Gitea workflow、Alertmanager receiver、主機 cron 或外部 bot 直接呼叫 Telegram Bot API,必須另外納入配置控管與 formatter 收斂,不能宣稱已由 API 端完全治理。
|
||
|
||
---
|
||
|
||
## 3. AI Triager 必做判讀
|
||
|
||
收到 `HostOrphanBrowserSmokeHighCpu` 時,AI / operator 必須先產生 dry-run:
|
||
|
||
```bash
|
||
python3 scripts/ops/host-runaway-process-remediation.py \
|
||
--rule stockplatform_headless_smoke \
|
||
--min-age-seconds 1800 \
|
||
--min-cpu-percent 50
|
||
```
|
||
|
||
dry-run 必須檢查:
|
||
|
||
1. `candidate_count > 0`。
|
||
2. `orphan_reason` 是 `ppid_1` 或 `missing_group_leader`。
|
||
3. `oldest_age_seconds` 超過 PlayBook 門檻。
|
||
4. `active Gitea Actions` 與候選 process group 不是同一個仍在跑的合法 job。
|
||
5. 不是 Docker daemon、Sentry、Harbor、PostgreSQL、ClickHouse、K3s 或 backup 服務本體。
|
||
6. 已有 owner / 維護窗口 / evidence ref。
|
||
|
||
如果只看到 `HostCiRunnerLoadSaturation`,且 orphan group count 為 `0`,預設判定是「合法 CI 短期負載」,不得自動修復。
|
||
|
||
---
|
||
|
||
## 4. Gated Remediation
|
||
|
||
真正送 `SIGTERM` 時必須帶齊三個 gate:
|
||
|
||
```bash
|
||
python3 scripts/ops/host-runaway-process-remediation.py \
|
||
--apply \
|
||
--confirm-apply \
|
||
--rule stockplatform_headless_smoke \
|
||
--owner-approval-id OWNER-APPROVAL-REDACTED \
|
||
--maintenance-window-id MW-REDACTED \
|
||
--evidence-ref INC-REDACTED \
|
||
--wait-seconds 5
|
||
```
|
||
|
||
禁止事項:
|
||
|
||
- 不可預設 `SIGKILL`。
|
||
- 不可因 CPU 高直接 `systemctl restart docker`。
|
||
- 不可重啟 Sentry / Harbor / Gitea / Nginx。
|
||
- 不可改 firewall / iptables / NetworkPolicy。
|
||
- 不可讀取或輸出 secret value、token、hash、prefix / suffix。
|
||
- 不可把 route 200、container up、UI 可見當成修復完成。
|
||
|
||
修復完成條件:
|
||
|
||
```text
|
||
signaled_process_group_count > 0
|
||
remaining_after_wait = []
|
||
awoooi_host_runaway_browser_orphan_group_count == 0
|
||
load5/core 開始下降或維持可解釋
|
||
active Gitea Actions 若仍存在,告警降級為 CI load,而非 orphan smoke
|
||
```
|
||
|
||
---
|
||
|
||
## 5. KM / PlayBook 回寫契約
|
||
|
||
每次觸發都要沉澱:
|
||
|
||
| 資產 | 必填欄位 |
|
||
|------|----------|
|
||
| Incident evidence | alert name、host、rule、pgid count、cpu percent、oldest age、active CI count、swap ratio |
|
||
| PlayBook run | dry-run payload、owner approval id、maintenance window id、evidence ref、actual signal summary |
|
||
| KM entry | 根因分類、誤判防護、修復結果、recurrence guard |
|
||
| Verifier | post-check 指標、load trend、orphan group count、runner queue state |
|
||
| Work item | 如果缺 owner / evidence / maintenance window,建立補件項,不假性拉高 runtime gate |
|
||
|
||
產品上的結論必須分開呈現:
|
||
|
||
```text
|
||
monitoring_ready=true
|
||
alert_ready=true
|
||
playbook_ready=true
|
||
km_writeback_required=true
|
||
runtime_remediation_authorized=false unless gated apply is executed
|
||
```
|
||
|
||
---
|
||
|
||
## 6. 與重啟 SOP 的關係
|
||
|
||
110 重啟後,runner / CD / high-load batch 是最後釋出。若 service health green 但 load 持續高:
|
||
|
||
1. 先讀 `host_runaway_process.prom`。
|
||
2. orphan browser 指標紅:走本 PlayBook。
|
||
3. active CI 指標紅但 orphan 為 0:等待 / drain / workflow timeout,不走 kill。
|
||
4. Docker / systemd / storage / backup 指標紅:回到 `FULL-STACK-COLD-START-SOP.md` 對應章節。
|
||
|
||
這條 PlayBook 是 AI 自動化產品的 host CPU 專用閉環,不取代 cold-start scorecard,也不解除 credential escrow / DR gate。
|