OG T
|
e447f97616
|
fix(telegram): 接通 classify_notification + 修復 HostBackupFailed 亂送按鈕
三個問題同時修復:
1. classify_notification() 死程式碼接通
- _push_decision_to_telegram() 現在先呼叫 classify_notification()
- TYPE-1 (純資訊) → send_info_notification(),無按鈕
- TYPE-4D (Config Drift) → send_drift_card()
- 其餘 TYPE-2/3/4 → send_approval_card()(原有按鈕)
- decision_state + auto_executed 從呼叫端注入 proposal_data
2. alert_rules.yaml 補 host_backup_failed 規則
- HostBackupFailed / VeleroBackupFailed / VeleroBackupNotRun → NO_ACTION
- 不再走 generic_fallback → 不再產生 kubectl rollout restart deployment/backup
3. _verify_k8s_deployment_exists() 主機層告警不再保守放行
- Host*/Docker*/Backup*/Velero*/SSH* 前綴告警 → K8s MCP 不可用時 return False
- _auto_execute() 收到 NO_ACTION 或空 kubectl_command → 早退,不執行
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-11 20:35:48 +08:00 |
|
OG T
|
c6edfb5614
|
fix(flywheel): 四階段系統性修復 AUTO_REPAIR NO_MATCH 斷層
CD Pipeline / build-and-deploy (push) Has been cancelled
Phase 1 — affected_services 污染根治
- webhooks.py: _extract_affected_services() 從 labels 精準萃取服務名
(component > job > pod deployment name > clean target_resource > [])
- create_incident_for_approval: alert_labels 完整保留進 Signal
- alert_name 從 alertname 取,不再用 "custom"
Phase 2 — Playbook alertname 變體擴充
- alert_rules.yaml: 5 條規則新增 HostHighCpuLoad、KubePodCrashLooping 等變體
- scripts/update_playbook_alert_variants.py: Redis index 已執行更新 ✅
Phase 3 — Jaccard 通用型 Playbook 豁免
- similarity.py: affected_services=[] → 1.0 豁免(基礎設施 Playbook 不針對特定服務)
- severity_range=[] → 1.0 豁免(適用所有嚴重度)
Phase 4 — Playbook Embedding 持久化(冷啟動修復)
- migrations/flywheel_playbook_embeddings.sql: pgvector 持久化表
- services/playbook_embedding_service.py: 啟動時重建 Redis 向量快取 + 同步 DB
- main.py: lifespan 啟動時 asyncio.create_task 非阻塞執行
2026-04-10 Asia/Taipei — Claude Sonnet 4.6
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-10 11:04:56 +08:00 |
|
OG T
|
2554ac1e60
|
fix: E2E test 告警識別 + 自動修復結果 Telegram 通知
CD Pipeline / build-and-deploy (push) Has been cancelled
**alert_rule_engine.py**
- _matches() 加入 instance_prefix 匹配(最高優先)
- match_rule() 傳入 instance label 至 _matches
- 用途: e2e-final-* / e2e-test-* instance 可被 YAML 規則識別
**alert_rules.yaml**
- 新增 e2e_smoke_test 規則 (priority=120)
- alertname: E2E_SMOKE_TEST / instance_prefix: e2e-final- / e2e-test- / test-host
- suggested_action: NO_ACTION,顯示「告警鏈路驗證成功」
**decision_manager.py**
- _auto_execute() 成功後發 Telegram 結果通知 ✅
- _auto_execute() 失敗後發 Telegram 失敗通知 ❌
- 新增 _push_auto_repair_result() 函數
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 14:16:15 +08:00 |
|
OG T
|
b43e1f1818
|
feat(rules): L2-2 alerts-unified — 補充 14 條 Prometheus 告警規則 + target_down 自動修復
CD Pipeline / build-and-deploy (push) Has been cancelled
新增規則:
- postgresql_down / postgresql_connection_pool / postgresql_slow_queries
- redis_down / ollama_down / minio_down / minio_disk_high / harbor_down
- k3s_node_down / awoooi_api_down / alert_chain_broken / nvidia_circuit_breaker
修正:
- target_down: kubectl_command 從診斷改為自動重啟 exporter (docker restart / systemctl)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 11:49:28 +08:00 |
|
OG T
|
d1ede7f989
|
feat(openclaw): 告警規則引擎 — alert_rules.yaml 取代硬編碼 if/elif
CD Pipeline / build-and-deploy (push) Has been cancelled
- 新增 alert_rules.yaml: 6 條規則 (docker/target_down/oom/cpu/5xx/crash) + 通用兜底
- 新增 alert_rule_engine.py: YAML 載入、匹配邏輯、變數填充
- openclaw.py _generate_mock_response: 重構為呼叫規則引擎 (v8.0)
- 新增規則只需修改 YAML,重啟 Pod 即可,不需改代碼
- 2026-04-09 ogt: 架構重構
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 09:05:23 +08:00 |
|