Your Name
|
61f5a6a419
|
fix(telegram): route alerts to SRE war room
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
|
2026-04-30 15:01:23 +08:00 |
|
OG T
|
02a276127e
|
fix(sensors+drift+repair-card): 全景修復三個節點問題
CD Pipeline / build-and-deploy (push) Successful in 1h1m39s
Fix 1: sensors 7/8 失敗 — SSH host 短名展開 (pre_decision_investigator.py)
根因: Prometheus instance label 為 "110:9100",split(":")[0]="110"
SSH_MCP_ALLOWED_HOSTS 存完整 IP "192.168.0.110" → 7 個 SSH 工具全部失敗
修復: 加入 _SHORT_HOST_MAP,"110"→"192.168.0.110",四台主機全覆蓋
Fix 2: Config Drift 誤報 — K8s 預設欄位加入白名單 (drift_detector.py)
根因: kubectl rollout restart 後 restartedAt annotation 被偵測為 "medium" drift
restartPolicy/dnsPolicy/terminationGracePeriodSeconds 等 K8s 自動填入欄位未白名單
修復: _DEFAULT_ALLOWLIST_FIELDS 加入 13 個 K8s 執行時自動填入欄位
Fix 3: 修復請求卡內容垃圾 — fallback 帶入真實 error context (failure_watcher.py)
根因: LLM 分析失敗時 root_cause = "規則引擎分類: K8S_ERROR"(無任何有用資訊)
修復: fallback 改為 "[K8S_ERROR] {operation_type} 在 {target_resource} 失敗\n錯誤:{error_message[:200]}"
2026-04-16 ogt + Claude Sonnet 4.6(亞太)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-16 20:50:06 +08:00 |
|
OG T
|
67f437043a
|
fix(prod): 修復四個生產致命 bug — outcome 寫入 / OpenClaw / Telegram 通知 / LLM 規則顯示
CD Pipeline / build-and-deploy (push) Has been cancelled
1. decision_manager: 移除 UPDATE incidents 中的 verification_result 欄位
(incidents 表無此欄位 → 所有 outcome 寫入失敗 outcome_write_failed)
2. failure_watcher: get_openclaw_service → get_openclaw
(函數名錯誤 → OpenClaw 分析全部 ImportError 崩潰)
3. failure_watcher: tg.send_message → tg.send_notification
(TelegramGateway 無 send_message 方法 → 修復通知無法送出)
4. decision_manager: expert_analyze 補齊 initial_diagnosis / diagnosis_description key
(openclaw.py 讀這兩個 key,但 expert_analyze 只有 matched_rule / description
→ LLM 永遠看到 Matched Rule=unknown,無法正確分析)
2026-04-15 ogt + Claude Sonnet 4.6(亞太): 生產緊急修復
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-15 21:41:31 +08:00 |
|
OG T
|
138a56a432
|
fix(api): Phase 18 P0 修復 - 全域熔斷 + Dry-run 驗證
E2E Health Check / e2e-health (push) Has been cancelled
2026-03-31 首席架構師審查要求 (91/100 條件通過)
P0-1 修復: 全域自動修復熔斷 (ADR-040)
- 整合 check_global_repair_cooldown() 前置檢查
- 有狀態服務黑名單 (PostgreSQL/Redis/ClickHouse)
- 15 分鐘窗口 >5 次則凍結
- 成功修復後 record_global_repair_action()
P0-2 修復: Dry-run 驗證
- restart_deployment 前驗證 Deployment 存在
- delete_pod 前驗證 Pod 存在
- 驗證失敗立即返回,不執行危險操作
安全閉環:
全域熔斷 → 單資源冷卻 → Dry-run → 執行 → 記錄
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-31 12:23:02 +08:00 |
|
OG T
|
d6f37853c5
|
feat(api): Phase 18.4 OpenClaw 深度分析整合
E2E Health Check / e2e-health (push) Successful in 17s
2026-03-31 Claude Code (統帥批准)
新增功能:
- _llm_analyze() 整合 OpenClawService
- 使用 analyze_alert() 進行 AI RCA 分析
- 整合 SignOz 監控數據
- 支援 Token/Cost 追蹤
- _map_severity_to_risk(): 嚴重度→風險等級映射
- critical/高 → CRITICAL
- warning/medium/中 → MEDIUM
- 其他 → LOW
- _extract_repair_action(): 從 AI 建議提取可執行動作
- restart/重啟 → restart_deployment/restart_pod
- clear/清理/cache → clear_cache
- scale/擴展 → scale_up (需人工授權)
閉環強化:
規則引擎初步分類 → OpenClaw AI 深度分析 → 更精準的修復建議
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-31 12:14:54 +08:00 |
|
OG T
|
770586dd85
|
feat(api): Phase 18.3 K8s Executor 整合自動修復
E2E Health Check / e2e-health (push) Successful in 17s
2026-03-31 Claude Code (統帥批准)
新增功能:
- execute_auto_repair() 實際執行 K8s 操作
- restart_deployment: rollout restart
- restart_pod: 刪除 Pod 觸發重建
- clear_cache: 安全清理 Redis 快取
安全機制:
- _check_repair_cooldown(): 防止修復風暴
- 同一資源 5 分鐘內最多修復 3 次
- 超過限制升級為 MEDIUM 風險
- Redis 計數器 + 自動過期
修復閉環完整流程:
執行失敗 → FailureWatcher → AI 分析 → 風險評估
├─ LOW + 冷卻期內 → 自動修復 → 揭露通知
└─ MEDIUM/CRITICAL 或超限 → Telegram 請求授權
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-31 12:10:52 +08:00 |
|
OG T
|
8e2d7c3706
|
feat(api): Phase 18.2 FailureWatcher 失敗自動修復閉環
E2E Health Check / e2e-health (push) Successful in 17s
2026-03-31 Claude Code (統帥批准)
新增:
- IFailureWatcher Protocol (interfaces.py)
- FailureWatcherService 失敗監聽服務
- AI 分析失敗原因 (規則引擎 + LLM 深度分析)
- 風險等級評估 (LOW/MEDIUM/CRITICAL)
- LOW 風險自動修復 (Phase 18.3 實際執行)
- MEDIUM/CRITICAL 推送 Telegram 請求授權
整合:
- executor._write_audit_log() 失敗時觸發 FailureWatcher
- 失敗分類寫入 AuditLog.failure_classification
- 自動修復結果寫入 AuditLog.auto_repair_result
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-31 12:01:56 +08:00 |
|