awoooi

Author	SHA1	Message	Date
Your Name	da772a1605	fix(decision): block kubectl actions on bare_metal host alerts All checks were successful Code Review / ai-code-review (push) Successful in 54s Details CD Pipeline / tests (push) Successful in 3m47s Details CD Pipeline / build-and-deploy (push) Successful in 13m26s Details CD Pipeline / post-deploy-checks (push) Successful in 5m45s Details When HostHighCpuLoad / HostOutOfMemory fire on a bare-metal host (192.168.0.110 et al, where Sentry / ClickHouse / Snuba are eating CPU), the LLM kept proposing "kubectl rollout restart awoooi-api", which is a wrong-domain action — restarting awoooi cannot fix a third-party process's CPU usage on the host. Auto-execute would then either run the no-op kubectl restart (wasted) or escalate after ssh_diagnose because no safe action was found, producing the "AI 自動修復失敗" Telegram noise the user just complained about. Adds a guard at the top of DecisionManager._auto_execute: if the incident's primary signal carries host_type=bare_metal AND the proposed action starts with "kubectl", refuse to execute. The incident is marked READY with a clear blocked_reason so human operators see why automation declined, and emergency_escalation records the event in AOL for audit. Also patches /home/wooo/monitoring/alerts.yml on 110 (and the new ops/monitoring/alerts.yml in repo) to add an explicit auto_repair_action annotation on HostHighCpuLoad / HostOutOfMemory that hints LLM toward `ssh ... ps aux` rather than kubectl restart. Prometheus reload returned 200. Tests: tests/test_decision_manager_bare_metal_kubectl_guard.py covers (1) bare_metal+kubectl blocked, (2) kubectl get also blocked, (3) bare_metal+ssh NOT blocked, (4) k8s host_type+kubectl NOT blocked, (5) missing host_type label NOT blocked. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 17:41:28 +08:00
Your Name	47342dfb34	fix(escalation): dedup escalation card by fingerprint + 24h TTL Some checks failed Code Review / ai-code-review (push) Successful in 55s Details CD Pipeline / build-and-deploy (push) Has been cancelled Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / tests (push) Has been cancelled Details 接續 b3a0f0d7（decision card dedup）—— 統帥 17:35 鐵證：4 條 ESCALATION P0 連發（HostOutOfDiskSpace + 3×HostDiskUsageHigh，全 target=node-exporter-110，全不同 INC ID C9CD6E/FB7944/559B54/C1BBF3）。 decision card 修了但 escalation card 走另一條路徑，根因相同： - emergency_escalation_service.py:31 dedup key 綁 incident_id (uuid4 隨機) - TTL 900s 比 sweeper 重觸週期 1h 短修法： - escalate_auto_repair_unavailable() 改用 alertname+target fingerprint dedup - TTL 900s → 86400s，與 decision_manager.py:574 對齊 drift_auto_adopt 路徑暫不動（TTL 已 3600s + report_id 非隨機，非當前問題）。 Tests: 7 passed (escalation/emergency 相關用例) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 17:38:54 +08:00
AWOOOI CD	697e13b23a	chore(cd): deploy `297afb6` [skip ci]	2026-05-02 17:28:56 +08:00
Your Name	297afb6998	fix(ci): require all 4 host keys before overwriting ssh-mcp-key secret All checks were successful Code Review / ai-code-review (push) Successful in 44s Details CD Pipeline / tests (push) Successful in 2m17s Details CD Pipeline / build-and-deploy (push) Successful in 12m44s Details CD Pipeline / post-deploy-checks (push) Successful in 4m26s Details When ssh-keyscan partially fails (e.g. one host is unreachable for a moment) the previous logic still considered the file non-empty, so it patched ssh-mcp-key/known_hosts with an incomplete set. asyncssh then rejected any SSH to the missing host with "Host key is not trusted", which routed every host disk-full / docker alert into the emergency escalation channel and spammed Telegram (today's regression for 110). Now we explicitly verify all four target IPs (110/120/121/188) appear in the scan output before patching. Missing any of them aborts the patch and keeps the previously-good secret untouched, plus logs the ssh-keyscan stderr to help debug intermittent network issues. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 17:14:30 +08:00
AWOOOI CD	a6409c39e2	chore(cd): deploy `b3a0f0d` [skip ci]	2026-05-02 16:49:00 +08:00
Your Name	b3a0f0d766	fix(telegram): dedup by fingerprint + 24h TTL to stop repeat alerts All checks were successful CD Pipeline / tests (push) Successful in 2m22s Details Code Review / ai-code-review (push) Successful in 57s Details CD Pipeline / build-and-deploy (push) Successful in 21m3s Details CD Pipeline / post-deploy-checks (push) Successful in 5m2s Details Telegram 重複發告警鐵證（4 個 agent 真實數據）： - INC-6FE3BD (HostBackupFailed) 24h 內被推 15 次 - INC-FD6E21 (HostHighCpuLoad) 24h 內被推 6 次 - 06:44:18 同秒兩送 = pod 並發 race 根因： 1. `telegram_sent:{incident_id}` dedup key 綁 uuid4 隨機 INC ID，同 fingerprint 換新 INC 完全不去重 2. dedup TTL=600s 比 incident_analysis_sweeper 重觸週期 1h、 alertmanager repeat_interval 4h 都短 → 每輪都過期通過 3. pod restart 走 _resend_unconfirmed_ready_tokens 用同一 incident_id key → 重啟必炸一波修法（不消音、是「AI 認得這是同一事故」）： - decision_manager.py:207-225 dedup key 改 alertname+target fingerprint - decision_manager.py:573-578 TTL 600s → 86400s (蓋住 sweeper 1h × alertmanager 4h) - decision_manager.py:3189-3208 pod restart resend 路徑同步改 fingerprint - incident_analysis_sweeper.py:37-42 sweeper_done TTL 3600s → 86400s 預期：同症狀 24h 內最多發 1 張 decision card；resolved 後 line 220-226 status check 會 early return，不影響復發偵測。 Tests: 35 passed (test_telegram_adr050 + test_decision_manager_docker_prune_routing) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 16:25:48 +08:00
Your Name	202071f7a8	chore(ci): force CD rebuild via .dockerignore touch Some checks failed CD Pipeline / tests (push) Successful in 2m17s Details CD Pipeline / build-and-deploy (push) Failing after 31m17s Details CD Pipeline / post-deploy-checks (push) Has been skipped Details Empty commits don't match cd.yaml paths filter (apps/** etc). This adds a comment to .dockerignore to trigger build for sha 84ba3216's commits stack. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 15:46:05 +08:00
Your Name	5c27bac686	chore(ci): retrigger build after runner restart Previous build (task#1396) failed when act_runner daemon was restarted to clear stuck job state. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 15:44:42 +08:00
Your Name	899bfdb6d1	chore(ci): trigger build after Gitea restart	2026-05-02 15:38:24 +08:00
Your Name	1a09b0250a	chore(ci): trigger Gitea Actions again	2026-05-02 15:32:55 +08:00
Your Name	ed726253e2	chore(ci): trigger Gitea Actions	2026-05-02 15:20:54 +08:00
Your Name	ec5eaef31c	chore(ci): enable Gitea Actions workflows	2026-05-02 15:20:01 +08:00
Your Name	84ba3216ee	feat(notifications): tag autonomous repair actions with [AUTO] prefix Some checks failed Code Review / ai-code-review (push) Successful in 57s Details CD Pipeline / tests (push) Successful in 2m36s Details CD Pipeline / build-and-deploy (push) Failing after 31m11s Details CD Pipeline / post-deploy-checks (push) Has been skipped Details Per user request: every AI-driven repair must surface a Telegram trace even when it succeeds, so nobody can later deny what the autonomy did. Adds 🤖 [AUTO] markers and an explicit `Actor: leWOOOgo (autonomous)` line to both success and failure status messages emitted by _push_auto_repair_result, making them clearly distinguishable from human-clicked approval cards. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:49:43 +08:00
Your Name	3059897318	feat(governance): auto-deprecate low-trust unused playbooks (>30d) Some checks failed Code Review / ai-code-review (push) Successful in 41s Details CD Pipeline / tests (push) Successful in 3m29s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details trust_drift previously fired alerts forever for playbooks stuck below the 0.2 threshold. With user authorization for governance-class auto-fixes, check_trust_drift now retires playbooks that have been unused for 30+ days (or never used and created 30+ days ago) by flipping status to 'deprecated' before alerting. Alerts now report drifted_count, auto_deprecated_count, and the kept playbook_ids that still need human review (those in their 30d trial window). Existing alert noise from the four currently-drifted playbooks should drop to whatever fraction is genuinely in trial. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:31:37 +08:00
Your Name	607358c4dd	fix(approval): route SSH actions through SSHProvider on manual approve parse_operation_from_action only knew kubectl and Chinese restart phrases, so any "ssh host '...'" action approved via Telegram fell through to "Could not parse operation type" and reported a fake failure even though the LLM had proposed a valid host repair. Adds OperationType.SSH_HOST, makes the parser detect ssh prefixes (with optional flags / user@host) before kubectl patterns, and routes the SSH_HOST branch in approval_execution.execute_in_background through SSHProvider with the same tool keywords decision_manager uses (ssh_docker_prune / ssh_docker_restart / ssh_systemctl_restart / ssh_diagnose). Unroutable SSH actions now fail loudly with a descriptive error instead of silently breaking. Trigger: 2026-05-02 incidents INC-20260502-D6D0B7 / E12EE4 / 557055 were approved by the user but executor reported "Could not parse" and left the alerts pending. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:31:37 +08:00
Your Name	3156ff1c69	feat(aiops): add ssh_docker_prune to auto-repair flywheel for disk-full alerts Adds Group B SSH MCP tool ssh_docker_prune (image+volume+builder prune with ≥75% disk usage gate) and routes "docker prune" actions through it. Flips HostDiskUsageHigh from auto_repair=false to true with mcp_provider routing labels so the flywheel can self-heal next disk-full event without hitting the emergency_channel Telegram path. Trigger: 2026-05-01 → 05-02 Telegram alert storm (peak 53/hr) caused by empty ssh-mcp-key/known_hosts secret rejecting all SSH and forcing every disk-full alert through "Host key is not trusted → escalate" loop. known_hosts patched live; this commit closes the playbook gap so the next occurrence resolves without manual intervention. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:31:37 +08:00
Your Name	8cf559215c	docs(awooop): add Phase 1 Isolation Foundation implementation plan (ADR-106 P1)	2026-05-02 12:28:33 +08:00
Your Name	443947ffa1	fix(ci): avoid code review sigpipe on large diffs [skip ci]	2026-05-01 20:59:14 +08:00
AWOOOI CD	329849a559	chore(cd): deploy `7795f02` [skip ci]	2026-05-01 20:53:02 +08:00
Your Name	7795f027d2	fix(aiops): persist emergency intervention traces Some checks failed CD Pipeline / tests (push) Successful in 2m56s Details Code Review / ai-code-review (push) Failing after 39s Details CD Pipeline / build-and-deploy (push) Successful in 12m54s Details CD Pipeline / post-deploy-checks (push) Successful in 4m40s Details	2026-05-01 20:34:33 +08:00
Your Name	8e49f2ea88	fix(ci): preserve ssh mcp known hosts [skip ci]	2026-05-01 17:18:32 +08:00
AWOOOI CD	b72eac0712	chore(cd): deploy `433f7b0` [skip ci]	2026-05-01 17:08:42 +08:00
Your Name	433f7b068e	fix(aiops): close ssh and telegram remediation gaps All checks were successful CD Pipeline / tests (push) Successful in 2m7s Details Code Review / ai-code-review (push) Successful in 42s Details CD Pipeline / build-and-deploy (push) Successful in 13m14s Details CD Pipeline / post-deploy-checks (push) Successful in 4m29s Details	2026-05-01 16:53:02 +08:00
Your Name	3650fc727a	docs(ci): record runner user service takeover state All checks were successful Code Review / ai-code-review (push) Successful in 45s Details	2026-05-01 16:30:54 +08:00
Your Name	e7991b8e6c	fix(ci): keep runner installer idempotent without restart All checks were successful Code Review / ai-code-review (push) Successful in 42s Details	2026-05-01 16:27:37 +08:00
Your Name	bc295eaec2	fix(ci): allow user service for gitea host runner Some checks failed Code Review / ai-code-review (push) Has been cancelled Details	2026-05-01 16:24:45 +08:00
Your Name	cb5ab900c4	fix(ci): preserve gitea runner jobs on shutdown All checks were successful Code Review / ai-code-review (push) Successful in 46s Details	2026-05-01 16:16:27 +08:00
AWOOOI CD	f72419dd17	chore(cd): deploy `b0da6da` [skip ci]	2026-05-01 15:27:48 +08:00
Your Name	b0da6da1e9	feat(aiops): structure agent loop shadow output Some checks failed CD Pipeline / tests (push) Successful in 2m50s Details Code Review / ai-code-review (push) Successful in 33s Details CD Pipeline / build-and-deploy (push) Failing after 25m48s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details	2026-05-01 15:09:57 +08:00
AWOOOI CD	f53d7e5584	chore(cd): deploy `f8e4497` [skip ci]	2026-05-01 14:41:18 +08:00
Your Name	f8e44971c1	feat(aiops): enable read-only agent loop canary All checks were successful CD Pipeline / tests (push) Successful in 1m43s Details Code Review / ai-code-review (push) Successful in 31s Details CD Pipeline / build-and-deploy (push) Successful in 10m22s Details CD Pipeline / post-deploy-checks (push) Successful in 4m3s Details	2026-05-01 14:20:16 +08:00
AWOOOI CD	33a7148916	chore(cd): deploy `b6cf616` [skip ci]	2026-05-01 14:02:59 +08:00
Your Name	b6cf616707	fix(aiops): harden agent tool permission names All checks were successful CD Pipeline / tests (push) Successful in 1m32s Details Code Review / ai-code-review (push) Successful in 27s Details CD Pipeline / build-and-deploy (push) Successful in 8m26s Details CD Pipeline / post-deploy-checks (push) Successful in 3m37s Details	2026-05-01 13:52:33 +08:00
AWOOOI CD	1fe75e9f99	chore(cd): deploy `6ec3f11` [skip ci]	2026-05-01 13:45:55 +08:00
Your Name	6ec3f116fd	fix(ci): normalize migration database url for psql All checks were successful CD Pipeline / tests (push) Successful in 1m30s Details Code Review / ai-code-review (push) Successful in 27s Details CD Pipeline / build-and-deploy (push) Successful in 13m20s Details CD Pipeline / post-deploy-checks (push) Successful in 3m36s Details	2026-05-01 13:30:32 +08:00
Your Name	7e4d995e4b	feat(aiops): add mcp agent loop foundation Some checks failed CD Pipeline / tests (push) Successful in 1m59s Details Code Review / ai-code-review (push) Successful in 28s Details run-migration / migrate (push) Failing after 24s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-05-01 13:21:19 +08:00
Your Name	9db87f177e	fix(aiops): suppress repeated llm alert loops Some checks failed CD Pipeline / tests (push) Successful in 1m37s Details Code Review / ai-code-review (push) Successful in 28s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-05-01 13:02:07 +08:00
Your Name	3691402561	chore(cd): deploy `11673d80` api [skip ci]	2026-05-01 12:52:23 +08:00
Your Name	11673d80ea	fix(aiops): route backup decisions through ssh Some checks failed CD Pipeline / tests (push) Successful in 1m35s Details Code Review / ai-code-review (push) Successful in 34s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-05-01 12:50:01 +08:00
Your Name	337bcb912e	fix(db): tolerate knowledge enum owner mismatch Some checks failed CD Pipeline / tests (push) Successful in 1m48s Details Code Review / ai-code-review (push) Successful in 27s Details run-migration / migrate (push) Successful in 22s Details CD Pipeline / build-and-deploy (push) Failing after 31m4s Details CD Pipeline / post-deploy-checks (push) Has been skipped Details	2026-05-01 11:08:21 +08:00
Your Name	3a6acae408	fix(km): add phase25 knowledge enum labels Some checks failed CD Pipeline / tests (push) Successful in 2m14s Details Code Review / ai-code-review (push) Successful in 26s Details run-migration / migrate (push) Failing after 24s Details CD Pipeline / build-and-deploy (push) Has been cancelled Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details	2026-05-01 11:03:03 +08:00
Your Name	ce4cf4c94b	chore(cd): deploy `2c12bce` api [skip ci]	2026-05-01 10:58:55 +08:00
Your Name	2c12bce135	fix(aiops): use existing escalation event type Some checks failed CD Pipeline / tests (push) Successful in 1m54s Details Code Review / ai-code-review (push) Successful in 29s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-05-01 10:56:59 +08:00
Your Name	78bcc090ad	chore(cd): deploy `97be5de` api [skip ci]	2026-05-01 10:52:31 +08:00
Your Name	97be5dedd7	fix(aiops): escalate failed host verification Some checks failed CD Pipeline / tests (push) Successful in 1m27s Details Code Review / ai-code-review (push) Successful in 29s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-05-01 10:47:42 +08:00
AWOOOI CD	046d598e88	chore(cd): deploy `e4aef6a` [skip ci]	2026-05-01 10:43:56 +08:00
Your Name	fa6a78af2a	chore(cd): deploy `e4aef6a` api [skip ci]	2026-05-01 10:42:07 +08:00
Your Name	e4aef6ac4e	fix(aiops): block k8s playbooks for host repair All checks were successful CD Pipeline / tests (push) Successful in 1m27s Details Code Review / ai-code-review (push) Successful in 26s Details CD Pipeline / build-and-deploy (push) Successful in 8m6s Details CD Pipeline / post-deploy-checks (push) Successful in 3m31s Details	2026-05-01 10:33:52 +08:00
AWOOOI CD	7472eb2fcd	chore(cd): deploy `ca22ec2` [skip ci]	2026-05-01 10:24:48 +08:00
Your Name	ca22ec2fd2	fix(aiops): route backup failures rule-first All checks were successful CD Pipeline / tests (push) Successful in 1m51s Details Code Review / ai-code-review (push) Successful in 30s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 42s Details CD Pipeline / build-and-deploy (push) Successful in 8m21s Details CD Pipeline / post-deploy-checks (push) Successful in 4m18s Details	2026-05-01 10:11:10 +08:00

1 2 3 4 5 ...

1892 Commits