When HostHighCpuLoad / HostOutOfMemory fire on a bare-metal host
(192.168.0.110 et al, where Sentry / ClickHouse / Snuba are eating
CPU), the LLM kept proposing "kubectl rollout restart awoooi-api",
which is a wrong-domain action — restarting awoooi cannot fix a
third-party process's CPU usage on the host. Auto-execute would then
either run the no-op kubectl restart (wasted) or escalate after
ssh_diagnose because no safe action was found, producing the
"AI 自動修復失敗" Telegram noise the user just complained about.
Adds a guard at the top of DecisionManager._auto_execute: if the
incident's primary signal carries host_type=bare_metal AND the
proposed action starts with "kubectl", refuse to execute. The
incident is marked READY with a clear blocked_reason so human
operators see why automation declined, and emergency_escalation
records the event in AOL for audit.
Also patches /home/wooo/monitoring/alerts.yml on 110 (and the new
ops/monitoring/alerts.yml in repo) to add an explicit
auto_repair_action annotation on HostHighCpuLoad / HostOutOfMemory
that hints LLM toward `ssh ... ps aux` rather than kubectl restart.
Prometheus reload returned 200.
Tests: tests/test_decision_manager_bare_metal_kubectl_guard.py
covers (1) bare_metal+kubectl blocked, (2) kubectl get also blocked,
(3) bare_metal+ssh NOT blocked, (4) k8s host_type+kubectl NOT
blocked, (5) missing host_type label NOT blocked.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When ssh-keyscan partially fails (e.g. one host is unreachable for a
moment) the previous logic still considered the file non-empty, so it
patched ssh-mcp-key/known_hosts with an incomplete set. asyncssh then
rejected any SSH to the missing host with "Host key is not trusted",
which routed every host disk-full / docker alert into the emergency
escalation channel and spammed Telegram (today's regression for 110).
Now we explicitly verify all four target IPs (110/120/121/188) appear
in the scan output before patching. Missing any of them aborts the
patch and keeps the previously-good secret untouched, plus logs the
ssh-keyscan stderr to help debug intermittent network issues.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Empty commits don't match cd.yaml paths filter (apps/** etc).
This adds a comment to .dockerignore to trigger build for sha
84ba3216's commits stack.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previous build (task#1396) failed when act_runner daemon was restarted
to clear stuck job state.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Per user request: every AI-driven repair must surface a Telegram trace
even when it succeeds, so nobody can later deny what the autonomy did.
Adds 🤖 [AUTO] markers and an explicit `Actor: leWOOOgo (autonomous)`
line to both success and failure status messages emitted by
_push_auto_repair_result, making them clearly distinguishable from
human-clicked approval cards.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
trust_drift previously fired alerts forever for playbooks stuck below
the 0.2 threshold. With user authorization for governance-class
auto-fixes, check_trust_drift now retires playbooks that have been
unused for 30+ days (or never used and created 30+ days ago) by
flipping status to 'deprecated' before alerting.
Alerts now report drifted_count, auto_deprecated_count, and the kept
playbook_ids that still need human review (those in their 30d trial
window). Existing alert noise from the four currently-drifted
playbooks should drop to whatever fraction is genuinely in trial.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
parse_operation_from_action only knew kubectl and Chinese restart phrases,
so any "ssh host '...'" action approved via Telegram fell through to
"Could not parse operation type" and reported a fake failure even though
the LLM had proposed a valid host repair.
Adds OperationType.SSH_HOST, makes the parser detect ssh prefixes (with
optional flags / user@host) before kubectl patterns, and routes the
SSH_HOST branch in approval_execution.execute_in_background through
SSHProvider with the same tool keywords decision_manager uses
(ssh_docker_prune / ssh_docker_restart / ssh_systemctl_restart /
ssh_diagnose). Unroutable SSH actions now fail loudly with a descriptive
error instead of silently breaking.
Trigger: 2026-05-02 incidents INC-20260502-D6D0B7 / E12EE4 / 557055
were approved by the user but executor reported "Could not parse" and
left the alerts pending.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds Group B SSH MCP tool ssh_docker_prune (image+volume+builder prune
with ≥75% disk usage gate) and routes "docker prune" actions through it.
Flips HostDiskUsageHigh from auto_repair=false to true with mcp_provider
routing labels so the flywheel can self-heal next disk-full event without
hitting the emergency_channel Telegram path.
Trigger: 2026-05-01 → 05-02 Telegram alert storm (peak 53/hr) caused by
empty ssh-mcp-key/known_hosts secret rejecting all SSH and forcing every
disk-full alert through "Host key is not trusted → escalate" loop.
known_hosts patched live; this commit closes the playbook gap so the
next occurrence resolves without manual intervention.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>