Commit Graph

7 Commits

Author SHA1 Message Date
Your Name
61f5a6a419 fix(telegram): route alerts to SRE war room
Some checks failed
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
2026-04-30 15:01:23 +08:00
OG T
02a276127e fix(sensors+drift+repair-card): 全景修復三個節點問題
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 1h1m39s
Fix 1: sensors 7/8 失敗 — SSH host 短名展開 (pre_decision_investigator.py)
  根因: Prometheus instance label 為 "110:9100",split(":")[0]="110"
        SSH_MCP_ALLOWED_HOSTS 存完整 IP "192.168.0.110" → 7 個 SSH 工具全部失敗
  修復: 加入 _SHORT_HOST_MAP,"110"→"192.168.0.110",四台主機全覆蓋

Fix 2: Config Drift 誤報 — K8s 預設欄位加入白名單 (drift_detector.py)
  根因: kubectl rollout restart 後 restartedAt annotation 被偵測為 "medium" drift
        restartPolicy/dnsPolicy/terminationGracePeriodSeconds 等 K8s 自動填入欄位未白名單
  修復: _DEFAULT_ALLOWLIST_FIELDS 加入 13 個 K8s 執行時自動填入欄位

Fix 3: 修復請求卡內容垃圾 — fallback 帶入真實 error context (failure_watcher.py)
  根因: LLM 分析失敗時 root_cause = "規則引擎分類: K8S_ERROR"(無任何有用資訊)
  修復: fallback 改為 "[K8S_ERROR] {operation_type} 在 {target_resource} 失敗\n錯誤:{error_message[:200]}"

2026-04-16 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 20:50:06 +08:00
OG T
67f437043a fix(prod): 修復四個生產致命 bug — outcome 寫入 / OpenClaw / Telegram 通知 / LLM 規則顯示
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. decision_manager: 移除 UPDATE incidents 中的 verification_result 欄位
   (incidents 表無此欄位 → 所有 outcome 寫入失敗 outcome_write_failed)

2. failure_watcher: get_openclaw_service → get_openclaw
   (函數名錯誤 → OpenClaw 分析全部 ImportError 崩潰)

3. failure_watcher: tg.send_message → tg.send_notification
   (TelegramGateway 無 send_message 方法 → 修復通知無法送出)

4. decision_manager: expert_analyze 補齊 initial_diagnosis / diagnosis_description key
   (openclaw.py 讀這兩個 key,但 expert_analyze 只有 matched_rule / description
    → LLM 永遠看到 Matched Rule=unknown,無法正確分析)

2026-04-15 ogt + Claude Sonnet 4.6(亞太): 生產緊急修復

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:41:31 +08:00
OG T
138a56a432 fix(api): Phase 18 P0 修復 - 全域熔斷 + Dry-run 驗證
Some checks failed
E2E Health Check / e2e-health (push) Has been cancelled
2026-03-31 首席架構師審查要求 (91/100 條件通過)

P0-1 修復: 全域自動修復熔斷 (ADR-040)
- 整合 check_global_repair_cooldown() 前置檢查
- 有狀態服務黑名單 (PostgreSQL/Redis/ClickHouse)
- 15 分鐘窗口 >5 次則凍結
- 成功修復後 record_global_repair_action()

P0-2 修復: Dry-run 驗證
- restart_deployment 前驗證 Deployment 存在
- delete_pod 前驗證 Pod 存在
- 驗證失敗立即返回,不執行危險操作

安全閉環:
全域熔斷 → 單資源冷卻 → Dry-run → 執行 → 記錄

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-31 12:23:02 +08:00
OG T
d6f37853c5 feat(api): Phase 18.4 OpenClaw 深度分析整合
All checks were successful
E2E Health Check / e2e-health (push) Successful in 17s
2026-03-31 Claude Code (統帥批准)

新增功能:
- _llm_analyze() 整合 OpenClawService
  - 使用 analyze_alert() 進行 AI RCA 分析
  - 整合 SignOz 監控數據
  - 支援 Token/Cost 追蹤

- _map_severity_to_risk(): 嚴重度→風險等級映射
  - critical/高 → CRITICAL
  - warning/medium/中 → MEDIUM
  - 其他 → LOW

- _extract_repair_action(): 從 AI 建議提取可執行動作
  - restart/重啟 → restart_deployment/restart_pod
  - clear/清理/cache → clear_cache
  - scale/擴展 → scale_up (需人工授權)

閉環強化:
規則引擎初步分類 → OpenClaw AI 深度分析 → 更精準的修復建議

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-31 12:14:54 +08:00
OG T
770586dd85 feat(api): Phase 18.3 K8s Executor 整合自動修復
All checks were successful
E2E Health Check / e2e-health (push) Successful in 17s
2026-03-31 Claude Code (統帥批准)

新增功能:
- execute_auto_repair() 實際執行 K8s 操作
  - restart_deployment: rollout restart
  - restart_pod: 刪除 Pod 觸發重建
  - clear_cache: 安全清理 Redis 快取

安全機制:
- _check_repair_cooldown(): 防止修復風暴
  - 同一資源 5 分鐘內最多修復 3 次
  - 超過限制升級為 MEDIUM 風險
  - Redis 計數器 + 自動過期

修復閉環完整流程:
執行失敗 → FailureWatcher → AI 分析 → 風險評估
├─ LOW + 冷卻期內 → 自動修復 → 揭露通知
└─ MEDIUM/CRITICAL 或超限 → Telegram 請求授權

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-31 12:10:52 +08:00
OG T
8e2d7c3706 feat(api): Phase 18.2 FailureWatcher 失敗自動修復閉環
All checks were successful
E2E Health Check / e2e-health (push) Successful in 17s
2026-03-31 Claude Code (統帥批准)

新增:
- IFailureWatcher Protocol (interfaces.py)
- FailureWatcherService 失敗監聽服務
  - AI 分析失敗原因 (規則引擎 + LLM 深度分析)
  - 風險等級評估 (LOW/MEDIUM/CRITICAL)
  - LOW 風險自動修復 (Phase 18.3 實際執行)
  - MEDIUM/CRITICAL 推送 Telegram 請求授權

整合:
- executor._write_audit_log() 失敗時觸發 FailureWatcher
- 失敗分類寫入 AuditLog.failure_classification
- 自動修復結果寫入 AuditLog.auto_repair_result

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-31 12:01:56 +08:00