awoooi

wooo/awoooi

Fork 0

Files

History

Your Name da772a1605

Code Review / ai-code-review (push) Successful in 54s

Details

CD Pipeline / tests (push) Successful in 3m47s

Details

CD Pipeline / build-and-deploy (push) Successful in 13m26s

Details

CD Pipeline / post-deploy-checks (push) Successful in 5m45s

Details

fix(decision): block kubectl actions on bare_metal host alerts

When HostHighCpuLoad / HostOutOfMemory fire on a bare-metal host
(192.168.0.110 et al, where Sentry / ClickHouse / Snuba are eating
CPU), the LLM kept proposing "kubectl rollout restart awoooi-api",
which is a wrong-domain action — restarting awoooi cannot fix a
third-party process's CPU usage on the host. Auto-execute would then
either run the no-op kubectl restart (wasted) or escalate after
ssh_diagnose because no safe action was found, producing the
"AI 自動修復失敗" Telegram noise the user just complained about.

Adds a guard at the top of DecisionManager._auto_execute: if the
incident's primary signal carries host_type=bare_metal AND the
proposed action starts with "kubectl", refuse to execute. The
incident is marked READY with a clear blocked_reason so human
operators see why automation declined, and emergency_escalation
records the event in AOL for audit.

Also patches /home/wooo/monitoring/alerts.yml on 110 (and the new
ops/monitoring/alerts.yml in repo) to add an explicit
auto_repair_action annotation on HostHighCpuLoad / HostOutOfMemory
that hints LLM toward `ssh ... ps aux` rather than kubectl restart.
Prometheus reload returned 200.

Tests: tests/test_decision_manager_bare_metal_kubectl_guard.py
covers (1) bare_metal+kubectl blocked, (2) kubectl get also blocked,
(3) bare_metal+ssh NOT blocked, (4) k8s host_type+kubectl NOT
blocked, (5) missing host_type label NOT blocked.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-02 17:41:28 +08:00

generated

fix(cd): 補提交 ops/monitoring 腳本

2026-03-29 15:45:42 +08:00

grafana

feat(p3.2-tests+ci-schema): model_version 測試 + CI test_schema 對齊 + Grafana SLO Dashboard

2026-04-27 14:57:16 +08:00

tests

feat(p3.2+adr-100): Model Version Tracker + SLO 自治 + KB rot cleaner

2026-04-27 14:54:19 +08:00

alerts-unified.yml

fix(aiops): route backup failures rule-first