awoooi

Author	SHA1	Message	Date
Your Name	61f5a6a419	fix(telegram): route alerts to SRE war room Some checks failed CD Pipeline / tests (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details Code Review / ai-code-review (push) Has been cancelled Details	2026-04-30 15:01:23 +08:00
OG T	02a276127e	fix(sensors+drift+repair-card): 全景修復三個節點問題 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 1h1m39s Details Fix 1: sensors 7/8 失敗 — SSH host 短名展開 (pre_decision_investigator.py) 根因: Prometheus instance label 為 "110:9100"，split(":")[0]="110" SSH_MCP_ALLOWED_HOSTS 存完整 IP "192.168.0.110" → 7 個 SSH 工具全部失敗修復: 加入 _SHORT_HOST_MAP，"110"→"192.168.0.110"，四台主機全覆蓋 Fix 2: Config Drift 誤報 — K8s 預設欄位加入白名單 (drift_detector.py) 根因: kubectl rollout restart 後 restartedAt annotation 被偵測為 "medium" drift restartPolicy/dnsPolicy/terminationGracePeriodSeconds 等 K8s 自動填入欄位未白名單修復: _DEFAULT_ALLOWLIST_FIELDS 加入 13 個 K8s 執行時自動填入欄位 Fix 3: 修復請求卡內容垃圾 — fallback 帶入真實 error context (failure_watcher.py) 根因: LLM 分析失敗時 root_cause = "規則引擎分類: K8S_ERROR"（無任何有用資訊）修復: fallback 改為 "[K8S_ERROR] {operation_type} 在 {target_resource} 失敗\n錯誤：{error_message[:200]}" 2026-04-16 ogt + Claude Sonnet 4.6（亞太） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 20:50:06 +08:00
OG T	67f437043a	fix(prod): 修復四個生產致命 bug — outcome 寫入 / OpenClaw / Telegram 通知 / LLM 規則顯示 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 1. decision_manager: 移除 UPDATE incidents 中的 verification_result 欄位（incidents 表無此欄位 → 所有 outcome 寫入失敗 outcome_write_failed） 2. failure_watcher: get_openclaw_service → get_openclaw （函數名錯誤 → OpenClaw 分析全部 ImportError 崩潰） 3. failure_watcher: tg.send_message → tg.send_notification （TelegramGateway 無 send_message 方法 → 修復通知無法送出） 4. decision_manager: expert_analyze 補齊 initial_diagnosis / diagnosis_description key （openclaw.py 讀這兩個 key，但 expert_analyze 只有 matched_rule / description → LLM 永遠看到 Matched Rule=unknown，無法正確分析） 2026-04-15 ogt + Claude Sonnet 4.6（亞太）: 生產緊急修復 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 21:41:31 +08:00
OG T	138a56a432	fix(api): Phase 18 P0 修復 - 全域熔斷 + Dry-run 驗證 Some checks failed E2E Health Check / e2e-health (push) Has been cancelled Details 2026-03-31 首席架構師審查要求 (91/100 條件通過) P0-1 修復: 全域自動修復熔斷 (ADR-040) - 整合 check_global_repair_cooldown() 前置檢查 - 有狀態服務黑名單 (PostgreSQL/Redis/ClickHouse) - 15 分鐘窗口 >5 次則凍結 - 成功修復後 record_global_repair_action() P0-2 修復: Dry-run 驗證 - restart_deployment 前驗證 Deployment 存在 - delete_pod 前驗證 Pod 存在 - 驗證失敗立即返回，不執行危險操作安全閉環: 全域熔斷 → 單資源冷卻 → Dry-run → 執行 → 記錄 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-03-31 12:23:02 +08:00
OG T	d6f37853c5	feat(api): Phase 18.4 OpenClaw 深度分析整合 All checks were successful E2E Health Check / e2e-health (push) Successful in 17s Details 2026-03-31 Claude Code (統帥批准) 新增功能: - _llm_analyze() 整合 OpenClawService - 使用 analyze_alert() 進行 AI RCA 分析 - 整合 SignOz 監控數據 - 支援 Token/Cost 追蹤 - _map_severity_to_risk(): 嚴重度→風險等級映射 - critical/高 → CRITICAL - warning/medium/中 → MEDIUM - 其他 → LOW - _extract_repair_action(): 從 AI 建議提取可執行動作 - restart/重啟 → restart_deployment/restart_pod - clear/清理/cache → clear_cache - scale/擴展 → scale_up (需人工授權) 閉環強化: 規則引擎初步分類 → OpenClaw AI 深度分析 → 更精準的修復建議 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-03-31 12:14:54 +08:00
OG T	770586dd85	feat(api): Phase 18.3 K8s Executor 整合自動修復 All checks were successful E2E Health Check / e2e-health (push) Successful in 17s Details 2026-03-31 Claude Code (統帥批准) 新增功能: - execute_auto_repair() 實際執行 K8s 操作 - restart_deployment: rollout restart - restart_pod: 刪除 Pod 觸發重建 - clear_cache: 安全清理 Redis 快取安全機制: - _check_repair_cooldown(): 防止修復風暴 - 同一資源 5 分鐘內最多修復 3 次 - 超過限制升級為 MEDIUM 風險 - Redis 計數器 + 自動過期修復閉環完整流程: 執行失敗 → FailureWatcher → AI 分析 → 風險評估 ├─ LOW + 冷卻期內 → 自動修復 → 揭露通知 └─ MEDIUM/CRITICAL 或超限 → Telegram 請求授權 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-03-31 12:10:52 +08:00
OG T	8e2d7c3706	feat(api): Phase 18.2 FailureWatcher 失敗自動修復閉環 All checks were successful E2E Health Check / e2e-health (push) Successful in 17s Details 2026-03-31 Claude Code (統帥批准) 新增: - IFailureWatcher Protocol (interfaces.py) - FailureWatcherService 失敗監聽服務 - AI 分析失敗原因 (規則引擎 + LLM 深度分析) - 風險等級評估 (LOW/MEDIUM/CRITICAL) - LOW 風險自動修復 (Phase 18.3 實際執行) - MEDIUM/CRITICAL 推送 Telegram 請求授權整合: - executor._write_audit_log() 失敗時觸發 FailureWatcher - 失敗分類寫入 AuditLog.failure_classification - 自動修復結果寫入 AuditLog.auto_repair_result Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-03-31 12:01:56 +08:00

7 Commits