fix(alert-rules): HostHighCpuLoad 獨立規則,停止 kubectl scale unknown
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled

根因: HostHighCpuLoad 是 node_exporter host 告警,無 pod/deployment label
      被分到 K8s high_cpu 規則 → {target}=unknown → auto-repair 安全攔截

修復: 新增 host_cpu_high 規則 (priority=45),NO_ACTION + 正確描述
     high_cpu 規則移除 HostHighCpuLoad/NodeCPUUsageHigh

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
OG T
2026-04-12 22:50:37 +08:00
parent 1a4b52ed28
commit f1face4e34

View File

@@ -107,16 +107,36 @@ rules:
command: "kubectl autoscale deployment {target} --memory-percent=80 --min=2 --max=5 -n {namespace}"
reasoning: "[規則匹配] Pod OOMKilled 後 ReplicaSet 將自動重建,但需同步修正資源配置防止復發。"
# 2026-04-12 ogt: Host CPU 告警獨立規則 — node_exporter 告警無 pod/deployment label
# 原本放在 high_cpu 規則導致 {target}="unknown" → auto-repair 安全攔截
# host 告警只能通知,不能 kubectl scale
- id: host_cpu_high
priority: 45
description: Host 主機 CPU 使用率過高 (node_exporter非 K8s workload)
match:
alertname:
- HostHighCpuLoad
- NodeCPUUsageHigh
- NodeHighCpuLoad
response:
action_title: "Host {host} CPU 過高 — 需排查高 CPU 進程"
description: "⚠️ 主機 {host} CPU 使用率超標。此為主機層告警,需 SSH 登入排查 (top / ps aux)。常見原因: Ollama 推理、DB 查詢、K3s GC。"
suggested_action: NO_ACTION
kubectl_command: ""
estimated_downtime: "N/A"
risk: low
responsibility: INFRA
reasoning: "[規則匹配] 主機 CPU 告警無法自動修復,需人工確認高 CPU 進程後決策。"
- id: high_cpu
priority: 40
description: Pod/Node CPU 使用率過高
description: K8s Pod/Deployment CPU 使用率過高
match:
# 2026-04-10 Claude Sonnet 4.6: Phase 2 飛輪修復 — 補齊 Prometheus alertname 變體
# 2026-04-12 ogt: 移除 HostHighCpuLoad/NodeCPUUsageHigh → 已獨立為 host_cpu_high 規則
alertname:
- HighCPUUsage
- ContainerCpuUsageSecondsTotal
- HostHighCpuLoad
- NodeCPUUsageHigh
- CPUThrottlingHigh
- KubeCPUOvercommit
alert_type: