When HostHighCpuLoad / HostOutOfMemory fire on a bare-metal host
(192.168.0.110 et al, where Sentry / ClickHouse / Snuba are eating
CPU), the LLM kept proposing "kubectl rollout restart awoooi-api",
which is a wrong-domain action — restarting awoooi cannot fix a
third-party process's CPU usage on the host. Auto-execute would then
either run the no-op kubectl restart (wasted) or escalate after
ssh_diagnose because no safe action was found, producing the
"AI 自動修復失敗" Telegram noise the user just complained about.
Adds a guard at the top of DecisionManager._auto_execute: if the
incident's primary signal carries host_type=bare_metal AND the
proposed action starts with "kubectl", refuse to execute. The
incident is marked READY with a clear blocked_reason so human
operators see why automation declined, and emergency_escalation
records the event in AOL for audit.
Also patches /home/wooo/monitoring/alerts.yml on 110 (and the new
ops/monitoring/alerts.yml in repo) to add an explicit
auto_repair_action annotation on HostHighCpuLoad / HostOutOfMemory
that hints LLM toward `ssh ... ps aux` rather than kubectl restart.
Prometheus reload returned 200.
Tests: tests/test_decision_manager_bare_metal_kubectl_guard.py
covers (1) bare_metal+kubectl blocked, (2) kubectl get also blocked,
(3) bare_metal+ssh NOT blocked, (4) k8s host_type+kubectl NOT
blocked, (5) missing host_type label NOT blocked.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>