Commit Graph

12 Commits

Author SHA1 Message Date
Your Name
f0d14ab6c4 fix(aiops): escalate blocked auto repair
Some checks failed
CD Pipeline / tests (push) Successful in 1m33s
Code Review / ai-code-review (push) Successful in 28s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 40s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-30 23:49:17 +08:00
Your Name
c1a1be61bd fix(ssh-auto): 主機告警 SSH 自動診斷授權(HostHighCpuLoad 不再卡人工審核)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 9m7s
根因:SSH_MCP_ALLOWED_HOSTS 未設定 → _ssh_execute() 全部攔截
      + auto_approve 只認 kubectl 不認 ssh → 主機告警永遠降級人工

修復:
- ConfigMap: 補 SSH_MCP_ALLOWED_HOSTS 四主機白名單
- alert_rules: HostHighCpuLoad 等從 NO_ACTION 改為 ssh_diagnose 指令
- auto_approve: _has_executable 加入 ssh 開頭識別
- decision_manager: _ssh_execute() 加入 ssh_diagnose 路由
- ssh_provider: 新增 ssh_diagnose tool(ps aux + free -h + df -h,只讀)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-27 20:13:07 +08:00
Your Name
04ff22563e fix(aiops-p1): Playbook 學習閉環 5斷點全修 + DB Migration(ADR-092 B4)
Some checks failed
run-migration / migrate (push) Failing after 14s
CD Pipeline / build-and-deploy (push) Failing after 2m7s
【P0.4 補丁】pre_decision_investigator Prometheus query 欄位缺失
- _build_tool_params() 補 "query" 欄位(prometheus_query tool 必要參數)
- 新增 _build_prometheus_query() — 依告警類型生成 PromQL(CPU/Memory/Crash/Disk/HTTP/Pod/fallback)
- 修復後 D3_METRICS 感官維度實際取得資料(原本 100% 回 missing_query_parameter)

【P1 Playbook 學習閉環 B1-B5 全修】
- B2 db/models.py: ApprovalRecord 新增 matched_playbook_id 欄位 + ix_approval_matched_playbook index
- B2 db/models.py: TimelineEvent 新增 incident_id 欄位(MCP 稽核用)+ index
- B3 approval_db.py: record→ApprovalRequest 補回 incident_id + matched_playbook_id
- B4 approval_repository.py: 同 B3(兩個轉換函式必須同步)
- B5 approval_db.py: approval_request_to_record_data 補 matched_playbook_id → DB 才能存值

【P1.5 KM 寫入】approval_execution.py: fire-and-forget → await wait_for(30s)
- 根因:asyncio.create_task 在 Pod recycle 時被殺,KM 寫入靜默遺失
- 修復:await asyncio.wait_for(..., timeout=30.0) + TimeoutError log

【Migration 文件】adr092_p1_learning_chain_fix.sql
- ALTER TABLE approval_records ADD COLUMN matched_playbook_id VARCHAR(36)
- ALTER TABLE timeline_events ADD COLUMN incident_id VARCHAR(64)
- 執行:psql $DATABASE_URL -f apps/api/migrations/adr092_p1_learning_chain_fix.sql

【附帶 Agent 改動】
- decision_manager: Phase 2 YAML NO_ACTION 優先門(主機層/外部服務跳過 Agent Debate)
- alert_rules.yaml: Sentry/ClickHouse + HostDiskUsageHigh/Critical 新規則
- solver_agent: action_title 語意合成兜底(取代靜默丟棄)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 15:41:35 +08:00
Your Name
7f4088bcd0 fix(aiops-p0): 六大病根 P0 全面修復(ADR-092 B4)
【P0.1】knowledge_extractor_service.py:210 — AttributeError 修復
- Signal.description 欄位不存在(100% 失敗,KM 每天+5 根因)
- 改用 alert_name + annotations.summary 拼接文字

【P0.2+P0.3】Gate 9+11 唯讀指令鬆綁
- blast_radius_calculator: kubectl get/top/describe/logs/version → score=1(非 50)
- operation_parser: 增加 INVESTIGATE 類型識別(唯讀 kubectl 不回 None)
- executor.py: OperationType 新增 INVESTIGATE enum
- approval_execution.py: INVESTIGATE 路徑直接呼叫 execute_kubectl_command

【P0.4】MCP SSH/K8s Provider 修復
- decision_manager: params= → parameters=(符合 MCPToolProvider.execute 簽名)
- decision_manager: MCPToolResult .get() → .success/.output(dataclass 用法)
- decision_manager + ssh_provider: 補入 hosts 120/121(原 default 缺失)
- auto_approve: phase2_agent_debate source bypass confidence 閾值

【P0.5】告警規則語義矛盾修復
- alert_rules.yaml: 8 條 kubectl 查詢規則 RESTART_DEPLOYMENT → NO_ACTION
  (CrashLoopBackOff/PostgreSQL 連線/慢查詢/MinIO 磁碟/K3s 節點/告警鏈路/SSL/CoreDNS 等)
- incident_service.py: cAdvisor/CoreDNS 從 general 拆出獨立分類

【P0.6】proactive_inspector 動態基線 PromQL 全修
- 5 個 MONITORED_METRICS PromQL 全部修正(cadvisor label/datname/blackbox)
- db_connection_pool: datname="awoooi" → "awoooi_prod"
- http_error_rate: 無效 http_requests_total → blackbox probe_success
- cpu/memory: namespace label → name=~"k8s_api_awoooi-api.*"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 15:32:23 +08:00
OG T
54d6818b8d fix(sensors+rules+dedup): 全景三根因修復 — asyncssh缺失/YAML規則空洞/重複卡片
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Fix 1: asyncssh 未安裝 → sensors_succeeded 永遠=0
  - apps/api/pyproject.toml 加入 asyncssh>=2.14.0
  - 根因:ssh_provider.py 的 import asyncssh 在 try/except 外,ImportError 直接噴出
  - 效果:15 個 SSH tool 全部恢復可用

Fix 2: YAML 規則空洞 → HostHighLoadAverage/PostgreSQLDiskGrowthRate 落 generic_fallback → restart
  - 合併 host_cpu_high 為 host_resource_alert,覆蓋 25+ 個主機層 alertname
  - 新增 postgresql_disk_monitoring 規則,覆蓋磁碟增長/exporter/vacuum 類告警
  - 所有主機層/磁碟監控告警 → NO_ACTION,禁止 kubectl restart

Fix 3: 同一 incident 被多 pod 同時處理 → 送出 3 張重複 Telegram 卡
  - decision_manager.get_or_create_decision(): ANALYZING 狀態加入早返回
  - 根因:ANALYZING 不在 (READY/EXECUTING/COMPLETED) 條件 → pod-B/C 各自建新 token

2026-04-16 ogt + Claude Sonnet 4.6 (台北時區)
2026-04-16 22:23:49 +08:00
OG T
cc42aa0bdb feat(adr-076): Task 2.2 + 2.3 — 規則擴充 + kubectl 注入防護
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Task 2.2: alert_rules.yaml 新增 3 類規則 (priority 125-127)
  - gitea_down: Gitea CI/CD 下線 → NO_ACTION (priority 125, critical)
  - ssl_cert_expiring: SSL 憑證到期 → NO_ACTION (priority 126, medium)
  - external_site_down: MoWoooWork/Dev/Blackbox probe → NO_ACTION (priority 127, medium)
  規則總數: 21 → 24

Task 2.3: alert_rule_engine.py kubectl 注入防護
  - _RULE_ENGINE_DESTRUCTIVE_RE: 阻擋 delete pvc/namespace/statefulset/deployment,
    drain/cordon, --replicas=0, rm -rf, DROP TABLE, $() 反引號
  - validate_kubectl_command(): 公開 API,SSH 指令/空字串直接通過
  - match_rule() 整合: 變數替換後驗證,阻擋時清空 + log warning
  - test_alert_rule_engine_validation.py: 34 tests (100% 通過)

測試: 776 passed, 26 skipped, 0 failed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 15:10:10 +08:00
OG T
f1face4e34 fix(alert-rules): HostHighCpuLoad 獨立規則,停止 kubectl scale unknown
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根因: HostHighCpuLoad 是 node_exporter host 告警,無 pod/deployment label
      被分到 K8s high_cpu 規則 → {target}=unknown → auto-repair 安全攔截

修復: 新增 host_cpu_high 規則 (priority=45),NO_ACTION + 正確描述
     high_cpu 規則移除 HostHighCpuLoad/NodeCPUUsageHigh

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:50:37 +08:00
OG T
e447f97616 fix(telegram): 接通 classify_notification + 修復 HostBackupFailed 亂送按鈕
三個問題同時修復:

1. classify_notification() 死程式碼接通
   - _push_decision_to_telegram() 現在先呼叫 classify_notification()
   - TYPE-1 (純資訊) → send_info_notification(),無按鈕
   - TYPE-4D (Config Drift) → send_drift_card()
   - 其餘 TYPE-2/3/4 → send_approval_card()(原有按鈕)
   - decision_state + auto_executed 從呼叫端注入 proposal_data

2. alert_rules.yaml 補 host_backup_failed 規則
   - HostBackupFailed / VeleroBackupFailed / VeleroBackupNotRun → NO_ACTION
   - 不再走 generic_fallback → 不再產生 kubectl rollout restart deployment/backup

3. _verify_k8s_deployment_exists() 主機層告警不再保守放行
   - Host*/Docker*/Backup*/Velero*/SSH* 前綴告警 → K8s MCP 不可用時 return False
   - _auto_execute() 收到 NO_ACTION 或空 kubectl_command → 早退,不執行

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 20:35:48 +08:00
OG T
c6edfb5614 fix(flywheel): 四階段系統性修復 AUTO_REPAIR NO_MATCH 斷層
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Phase 1 — affected_services 污染根治
  - webhooks.py: _extract_affected_services() 從 labels 精準萃取服務名
    (component > job > pod deployment name > clean target_resource > [])
  - create_incident_for_approval: alert_labels 完整保留進 Signal
  - alert_name 從 alertname 取,不再用 "custom"

Phase 2 — Playbook alertname 變體擴充
  - alert_rules.yaml: 5 條規則新增 HostHighCpuLoad、KubePodCrashLooping 等變體
  - scripts/update_playbook_alert_variants.py: Redis index 已執行更新 

Phase 3 — Jaccard 通用型 Playbook 豁免
  - similarity.py: affected_services=[] → 1.0 豁免(基礎設施 Playbook 不針對特定服務)
  - severity_range=[] → 1.0 豁免(適用所有嚴重度)

Phase 4 — Playbook Embedding 持久化(冷啟動修復)
  - migrations/flywheel_playbook_embeddings.sql: pgvector 持久化表
  - services/playbook_embedding_service.py: 啟動時重建 Redis 向量快取 + 同步 DB
  - main.py: lifespan 啟動時 asyncio.create_task 非阻塞執行

2026-04-10 Asia/Taipei — Claude Sonnet 4.6
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 11:04:56 +08:00
OG T
2554ac1e60 fix: E2E test 告警識別 + 自動修復結果 Telegram 通知
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
**alert_rule_engine.py**
- _matches() 加入 instance_prefix 匹配(最高優先)
- match_rule() 傳入 instance label 至 _matches
- 用途: e2e-final-* / e2e-test-* instance 可被 YAML 規則識別

**alert_rules.yaml**
- 新增 e2e_smoke_test 規則 (priority=120)
- alertname: E2E_SMOKE_TEST / instance_prefix: e2e-final- / e2e-test- / test-host
- suggested_action: NO_ACTION,顯示「告警鏈路驗證成功」

**decision_manager.py**
- _auto_execute() 成功後發 Telegram 結果通知 
- _auto_execute() 失敗後發 Telegram 失敗通知 
- 新增 _push_auto_repair_result() 函數

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 14:16:15 +08:00
OG T
b43e1f1818 feat(rules): L2-2 alerts-unified — 補充 14 條 Prometheus 告警規則 + target_down 自動修復
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
新增規則:
- postgresql_down / postgresql_connection_pool / postgresql_slow_queries
- redis_down / ollama_down / minio_down / minio_disk_high / harbor_down
- k3s_node_down / awoooi_api_down / alert_chain_broken / nvidia_circuit_breaker

修正:
- target_down: kubectl_command 從診斷改為自動重啟 exporter (docker restart / systemctl)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 11:49:28 +08:00
OG T
d1ede7f989 feat(openclaw): 告警規則引擎 — alert_rules.yaml 取代硬編碼 if/elif
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- 新增 alert_rules.yaml: 6 條規則 (docker/target_down/oom/cpu/5xx/crash) + 通用兜底
- 新增 alert_rule_engine.py: YAML 載入、匹配邏輯、變數填充
- openclaw.py _generate_mock_response: 重構為呼叫規則引擎 (v8.0)
- 新增規則只需修改 YAML,重啟 Pod 即可,不需改代碼
- 2026-04-09 ogt: 架構重構

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 09:05:23 +08:00