awoooi

Author	SHA1	Message	Date
Your Name	ae9d0b7385	feat(monitoring): alert on stale source provider ingestion All checks were successful Code Review / ai-code-review (push) Successful in 10s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 25s Details CD Pipeline / tests (push) Successful in 3m26s Details CD Pipeline / build-and-deploy (push) Successful in 3m38s Details CD Pipeline / post-deploy-checks (push) Successful in 1m25s Details	2026-05-20 19:19:21 +08:00
Your Name	598f33ae8b	fix(monitoring): clarify alert chain smoke evidence All checks were successful Code Review / ai-code-review (push) Successful in 11s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 22s Details CD Pipeline / tests (push) Successful in 3m55s Details CD Pipeline / build-and-deploy (push) Successful in 3m31s Details CD Pipeline / post-deploy-checks (push) Successful in 1m33s Details	2026-05-20 13:11:44 +08:00
Your Name	2221fd3256	fix(ops): persist host resource guardrails All checks were successful CD Pipeline / tests (push) Successful in 5m25s Details Code Review / ai-code-review (push) Successful in 25s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 37s Details CD Pipeline / build-and-deploy (push) Successful in 7m31s Details CD Pipeline / post-deploy-checks (push) Successful in 5m10s Details	2026-05-05 16:13:19 +08:00
Your Name	1cc9de5722	fix(ops): point runner guardrail alerts to host script All checks were successful CD Pipeline / tests (push) Successful in 5m31s Details Code Review / ai-code-review (push) Successful in 30s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 37s Details CD Pipeline / build-and-deploy (push) Successful in 7m45s Details CD Pipeline / post-deploy-checks (push) Successful in 5m4s Details	2026-05-05 15:25:37 +08:00
Your Name	d08d1e4951	fix(ops): alert on missing docker resource limits Some checks failed CD Pipeline / tests (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details Code Review / ai-code-review (push) Successful in 23s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s Details	2026-05-05 15:01:31 +08:00
Your Name	72d66e4ae6	fix(ops): align stale job cleanup thresholds All checks were successful Code Review / ai-code-review (push) Successful in 28s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 36s Details	2026-05-05 14:54:17 +08:00
Your Name	5e625f777d	fix(ops): add stale gitea job cleanup guard Some checks failed Code Review / ai-code-review (push) Has been cancelled Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Has been cancelled Details	2026-05-05 14:50:47 +08:00
Your Name	7d45f0cb58	fix(ops): alert on stale gitea actions jobs Some checks failed CD Pipeline / tests (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details Code Review / ai-code-review (push) Has been cancelled Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Has been cancelled Details	2026-05-05 14:42:09 +08:00
Your Name	fe618960a8	fix(ops): monitor systemd runners in host baseline Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details CD Pipeline / tests (push) Has been cancelled Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details Code Review / ai-code-review (push) Has been cancelled Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 39s Details	2026-05-05 14:08:43 +08:00
Your Name	e8e6748f70	fix(ops): add docker host resource baseline guardrails Some checks failed CD Pipeline / tests (push) Failing after 1m50s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details Code Review / ai-code-review (push) Successful in 25s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s Details	2026-05-05 13:45:09 +08:00
Your Name	577250a678	fix(governance): 修反消音化 W-3/W-4 守衛 + Prometheus 補資料缺失告警 Some checks failed Code Review / ai-code-review (push) Successful in 52s Details CD Pipeline / tests (push) Failing after 2m21s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 1m6s Details 【統帥怒訓 — 違反 feedback_full_chain_first_then_fix.md 鐵律】前次 commit `f1362fcc` 用 skip 條件把告警吞掉，是消音化解法： - W-3：total_exec<10 永遠 skip → Redis 永遠空也不會告警 - W-4：playbooks total==0 永遠 skip → 表被清空也不會告警 - Prometheus NaN sentinel + 既有 < 0.1 規則疊加後沒任何路徑會告警統帥怒訓「又把告警給消失了」「已經這樣做幾次了」。本 commit 救回告警可見性。【修法 — 啟動 30 分鐘寬限 + 過期改打資料管線斷新告警】 - ai_slo_watchdog_job.py 新增模組層 _PROCESS_START 與 _grace_active() 守衛： - W-3a：metric 有資料 + rate<0.30 → 既有「飛輪成功率過低」 - W-3b：rate=None 且 uptime>30min → 新告警「飛輪資料管線無流量」 - W-4a：playbooks total>0 + approved=0 → 既有「自動修復鏈路斷裂」 - W-4b：playbooks total=0 且 uptime>30min → 新告警「Playbook 表初始化失敗」 - 3 份 Prometheus rule（k8s/monitoring/flywheel-alerts.yaml、 ops/monitoring/alerts.yml、ops/monitoring/alerts-unified.yml）新增 FlywheelExecutionRateMissing：absent() 或 NaN 持續 30 分鐘 → 告警，與 watchdog W-3b 雙保險【已加入 memory】 feedback_silencing_alerts_recurring_violation.md 鎖入紅線鐵律：「fresh deploy / init guard 用 skip 吞告警 = 結構性失職，必須分流寬限期 + 過期改打資料管線斷新告警」【驗證】 106 個治理相關 unit test 全過： test_trust_drift_watchdog / test_governance_agent / test_failover_alerter / test_check_trust_drift_commit_outside_context_poc / test_governance_remediation_dispatch / test_ai_governance_endpoints / test_governance_dispatcher 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 12:39:46 +08:00
Your Name	b371edb70c	fix host alert auto-repair routing and backup false positives	2026-05-02 23:44:12 +08:00
Your Name	da772a1605	fix(decision): block kubectl actions on bare_metal host alerts All checks were successful Code Review / ai-code-review (push) Successful in 54s Details CD Pipeline / tests (push) Successful in 3m47s Details CD Pipeline / build-and-deploy (push) Successful in 13m26s Details CD Pipeline / post-deploy-checks (push) Successful in 5m45s Details When HostHighCpuLoad / HostOutOfMemory fire on a bare-metal host (192.168.0.110 et al, where Sentry / ClickHouse / Snuba are eating CPU), the LLM kept proposing "kubectl rollout restart awoooi-api", which is a wrong-domain action — restarting awoooi cannot fix a third-party process's CPU usage on the host. Auto-execute would then either run the no-op kubectl restart (wasted) or escalate after ssh_diagnose because no safe action was found, producing the "AI 自動修復失敗" Telegram noise the user just complained about. Adds a guard at the top of DecisionManager._auto_execute: if the incident's primary signal carries host_type=bare_metal AND the proposed action starts with "kubectl", refuse to execute. The incident is marked READY with a clear blocked_reason so human operators see why automation declined, and emergency_escalation records the event in AOL for audit. Also patches /home/wooo/monitoring/alerts.yml on 110 (and the new ops/monitoring/alerts.yml in repo) to add an explicit auto_repair_action annotation on HostHighCpuLoad / HostOutOfMemory that hints LLM toward `ssh ... ps aux` rather than kubectl restart. Prometheus reload returned 200. Tests: tests/test_decision_manager_bare_metal_kubectl_guard.py covers (1) bare_metal+kubectl blocked, (2) kubectl get also blocked, (3) bare_metal+ssh NOT blocked, (4) k8s host_type+kubectl NOT blocked, (5) missing host_type label NOT blocked. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 17:41:28 +08:00
Your Name	3156ff1c69	feat(aiops): add ssh_docker_prune to auto-repair flywheel for disk-full alerts Adds Group B SSH MCP tool ssh_docker_prune (image+volume+builder prune with ≥75% disk usage gate) and routes "docker prune" actions through it. Flips HostDiskUsageHigh from auto_repair=false to true with mcp_provider routing labels so the flywheel can self-heal next disk-full event without hitting the emergency_channel Telegram path. Trigger: 2026-05-01 → 05-02 Telegram alert storm (peak 53/hr) caused by empty ssh-mcp-key/known_hosts secret rejecting all SSH and forcing every disk-full alert through "Host key is not trusted → escalate" loop. known_hosts patched live; this commit closes the playbook gap so the next occurrence resolves without manual intervention. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:31:37 +08:00

14 Commits