OG T
|
c6edfb5614
|
fix(flywheel): 四階段系統性修復 AUTO_REPAIR NO_MATCH 斷層
CD Pipeline / build-and-deploy (push) Has been cancelled
Phase 1 — affected_services 污染根治
- webhooks.py: _extract_affected_services() 從 labels 精準萃取服務名
(component > job > pod deployment name > clean target_resource > [])
- create_incident_for_approval: alert_labels 完整保留進 Signal
- alert_name 從 alertname 取,不再用 "custom"
Phase 2 — Playbook alertname 變體擴充
- alert_rules.yaml: 5 條規則新增 HostHighCpuLoad、KubePodCrashLooping 等變體
- scripts/update_playbook_alert_variants.py: Redis index 已執行更新 ✅
Phase 3 — Jaccard 通用型 Playbook 豁免
- similarity.py: affected_services=[] → 1.0 豁免(基礎設施 Playbook 不針對特定服務)
- severity_range=[] → 1.0 豁免(適用所有嚴重度)
Phase 4 — Playbook Embedding 持久化(冷啟動修復)
- migrations/flywheel_playbook_embeddings.sql: pgvector 持久化表
- services/playbook_embedding_service.py: 啟動時重建 Redis 向量快取 + 同步 DB
- main.py: lifespan 啟動時 asyncio.create_task 非阻塞執行
2026-04-10 Asia/Taipei — Claude Sonnet 4.6
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-10 11:04:56 +08:00 |
|
OG T
|
2554ac1e60
|
fix: E2E test 告警識別 + 自動修復結果 Telegram 通知
CD Pipeline / build-and-deploy (push) Has been cancelled
**alert_rule_engine.py**
- _matches() 加入 instance_prefix 匹配(最高優先)
- match_rule() 傳入 instance label 至 _matches
- 用途: e2e-final-* / e2e-test-* instance 可被 YAML 規則識別
**alert_rules.yaml**
- 新增 e2e_smoke_test 規則 (priority=120)
- alertname: E2E_SMOKE_TEST / instance_prefix: e2e-final- / e2e-test- / test-host
- suggested_action: NO_ACTION,顯示「告警鏈路驗證成功」
**decision_manager.py**
- _auto_execute() 成功後發 Telegram 結果通知 ✅
- _auto_execute() 失敗後發 Telegram 失敗通知 ❌
- 新增 _push_auto_repair_result() 函數
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 14:16:15 +08:00 |
|
OG T
|
b43e1f1818
|
feat(rules): L2-2 alerts-unified — 補充 14 條 Prometheus 告警規則 + target_down 自動修復
CD Pipeline / build-and-deploy (push) Has been cancelled
新增規則:
- postgresql_down / postgresql_connection_pool / postgresql_slow_queries
- redis_down / ollama_down / minio_down / minio_disk_high / harbor_down
- k3s_node_down / awoooi_api_down / alert_chain_broken / nvidia_circuit_breaker
修正:
- target_down: kubectl_command 從診斷改為自動重啟 exporter (docker restart / systemctl)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 11:49:28 +08:00 |
|
OG T
|
d1ede7f989
|
feat(openclaw): 告警規則引擎 — alert_rules.yaml 取代硬編碼 if/elif
CD Pipeline / build-and-deploy (push) Has been cancelled
- 新增 alert_rules.yaml: 6 條規則 (docker/target_down/oom/cpu/5xx/crash) + 通用兜底
- 新增 alert_rule_engine.py: YAML 載入、匹配邏輯、變數填充
- openclaw.py _generate_mock_response: 重構為呼叫規則引擎 (v8.0)
- 新增規則只需修改 YAML,重啟 Pod 即可,不需改代碼
- 2026-04-09 ogt: 架構重構
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 09:05:23 +08:00 |
|