awoooi

Author	SHA1	Message	Date
Your Name	7f200aff5f	fix(solver): 注入告警 labels 讓 params 模板填充真實值 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m45s Details 根因：Solver LLM 不知道 namespace/pod/deployment/instance 真實值， recommended_actions.params 模板（{labels.namespace} 等）填不出來 → Telegram 顯示 kubectl scale deployment --replicas=（空白）修復： - solver.run() 加 incident_labels 參數 - _build_prompt() 把 labels 顯式列出給 LLM 參考 - orchestrator 從 snapshot.alert_info.labels 取出後傳入 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-28 15:05:06 +08:00
Your Name	c3fa03fc19	fix(solver): 補 AGENT_SOLVER_TIMEOUT_SEC=80 + prompt 禁無腦重啟 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 問題1：AGENT_SOLVER_TIMEOUT_SEC 預設 20s K8s 未設 → deepseek-r1:14b 必然 timeout → candidates=[] → action="" → Telegram 顯示「待分析」+「規則分析」問題2：Solver prompt JSON 範例只有 restart + kubectl top，LLM 模仿範例 → 所有告警都推重啟，HostDisk/CPU 類應優先診斷+清理修復： - K8s 加 AGENT_SOLVER_TIMEOUT_SEC=80（< OPENCLAW_TIMEOUT=120，留 buffer） - Solver prompt 加根因對應修復規則：HostDisk→df/du/journalctl，CPU→top/ps， OOM→kubectl logs，禁止「先重啟」 - JSON 範例改為 HostDisk SSH 診斷場景，不再只有 K8s 命令 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-27 15:51:42 +08:00
Your Name	9908fdf50d	feat(p3.1-t2-patha): DiagnosisAggregator 路徑 A + Solver F4 critical reject + 對齊測試 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m59s Details Wave 8 P3.1-T2 PathA 啟用 + Solver F4 安全強化 + test 對齊： PathA — DiagnosisAggregator 信號分類層補 PDI: - ENABLE_DIAGNOSIS_AGGREGATOR default=False → True · PathA 純信號分類層（OOMKilled/CrashLoop 等業務邏輯） · 不重複呼叫 K8s/SignOz API（只取 PDI 已收集的 raw 資料） · 安全 default on — 純邏輯處理，無外部依賴重疊 - diagnosis_aggregator.py +155 行（PathA 實作） - pre_decision_investigator.py 已接 (commit `3a2cd151`) F4 — Solver critical risk reject: - solver_agent.py: _validate_recommended_action 拒絕 risk=critical · 鐵律：critical 動作必須走人工審批，不可變 Telegram 按鈕 · log warning + return None（被 _extract 過濾掉） - _extract_recommended_actions 改返回 (list, status_str) tuple · status="ok"/"empty"/"all_invalid" 供呼叫端決策 - protocol.py +16 / metrics.py +9 / ai_router.py +18 — 配套 metric + protocol field 測試對齊: - test_solver_recommended_actions.py 拆 test_all_valid → low/medium/high accepted + test_critical_rejected - result tuple unpack: result, _ = _extract_recommended_actions(...) - test_diagnosis_aggregator_stub.py: feature flag default 改 True 對齊 PathA Tests: 51 passed (solver 28 + aggregator 16 + router fallback 8) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Multiple Engineers (Wave 8 P3.1-T2 PathA + F4) <noreply@anthropic.com>	2026-04-27 14:42:29 +08:00
Your Name	7c726ebc1c	fix(b1): Solver Agent 結構化動作 — 北極星 §1.1 修復多樣性 ≥ 40% Some checks failed CD Pipeline / build-and-deploy (push) Failing after 2m22s Details INC-20260425 衍生修復 — Solver 拒絕 rule-based mock 兜底: 原設計缺陷: - LLM 失敗時 → rule-based mock 推 RESTART 兜底 - 違反北極星 §1.1：修復多樣性 ≥ 40%（不能寫死同一指令）新設計: - LLM 失敗 → graceful degraded（candidates=[], recommended_actions=[], degraded=True） - 禁止 rule-based mock / hardcode RESTART - 新增 recommended_actions 結構化 MCP 動作清單 · 供 B3 Telegram 按鈕動態生成 · YAML 規則庫驅動，非寫死 - 新增 yaml + Path import 載入動作模板庫向下相容: - 既有 candidates / blast_radius 邏輯不變 - 新增欄位 recommended_actions 為 optional list Tests: 8 passed (solver 相關全綠) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Claude Sonnet 4.6 (B1 北極星 §1.1) <noreply@anthropic.com>	2026-04-27 08:18:38 +08:00
Your Name	fefe4c21cd	fix(inc-20260425): A1+A2 後續 — Solver/Critic timeout + auto_repair 接線 + Runbook + Grafana Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 延續 `595629c0` INC-20260425 修復，補三段 Agent + 全鏈路觀測： A1 後續 — Solver/Critic 三段 timeout 接線: - solver_agent.py: AGENT_SOLVER_TIMEOUT_SEC=20.0（env override） - critic_agent.py: AGENT_CRITIC_TIMEOUT_SEC=15.0（env override） - protocol.py: 三 Agent 共用 observe_agent_step() 包裹呼叫 · success/timeout/error outcome label · histogram 寫入 aiops_agent_step_duration_seconds A2 後續 — auto_repair_service 改用 _diagnose_fallback_chain: - auto_repair_service.py +46 行 — 切換 DIAGNOSE 路由到新 chain（NEMO→GEMINI→CLAUDE） - 完全避開 Ollama CPU 238s 二次 timeout 新增 metrics: - core/metrics.py +59 行 — 配合 observe_agent_step 的 histogram bucket + label cardinality 新增測試 (862 行): - test_agent_step_timeouts.py (475) — 三 Agent 各 timeout 邊界 + outcome label - test_ai_router_diagnose_fallback.py (387) — _diagnose_fallback_chain 正確序新增配套: - docs/runbooks/RUNBOOK-AGENT-STEP-LATENCY.md (350) — INC 故障排查 + 觀測指引 - ops/monitoring/grafana/agent_step_latency_rules.yaml (160) · 三 Agent histogram alert rules（p99 > timeout 80% → warning）驗收: 33 tests pass (test_agent_step_timeouts 22 + test_ai_router_diagnose_fallback 11) INC-20260425 雙修總工作量（595629c0 + 此 commit）: · 5 個 service/agent 檔修改 · 1 個新 observability 模組 · 4 個新測試/配套檔 · 1372+187 = 1559 行新增 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Claude Sonnet 4.6 (INC-20260425 後續) <noreply@anthropic.com>	2026-04-27 08:15:53 +08:00
Your Name	595629c013	fix(inc-20260425): A1 三段 Agent timeout 拆分 + A2 DIAGNOSE 移除 Ollama Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details INC-20260425-8D17BB / 3B6C39 兩則告警 AI 信心降到 20% 根因雙修（統帥批准 A+B）： A1 — 三段 Agent step timeout 拆分（北極星 §1.2 Observable by Default）: - diagnostician_agent.py: PHASE2_STEP_TIMEOUT_SEC=20.0 共用值 → 拆三段 · AGENT_DIAGNOSTICIAN_TIMEOUT_SEC=30.0（NIM 主吃口，最大 prompt + 多假設） · AGENT_SOLVER_TIMEOUT_SEC=20.0（後續 commit 接線） · AGENT_CRITIC_TIMEOUT_SEC=15.0（後續 commit 接線） · env override 支援，K8s ConfigMap 動態調整不需 rebuild · 保留 PHASE2_STEP_TIMEOUT_SEC alias（DEPRECATED，下 sprint 移除） - observability/agent_step_metrics.py (58 行) — 新模組: · aiops_agent_step_duration_seconds Histogram · observe_agent_step() helper 統一三 Agent 呼叫點 · outcome label ∈ {success, timeout, error} · agent label ∈ {diagnostician, solver, critic} A2 — ai_router DIAGNOSE chain 移除 Ollama: - ai_router.py v4.4 by Claude Sonnet 4.6 · 新增 _diagnose_fallback_chain: NEMO → GEMINI → CLAUDE · Ollama 永久排除於此 chain（CPU-only 實測 238s，二次 timeout 必爆） · 新增 aiops_diagnose_fallback_total Prometheus metric - 根因: NIM timeout 後 fallback 到 Ollama deepseek-r1:14b CPU 238s → 二次 timeout → degraded confidence=0.2 Wave8-X2 整合測試補正: - test_ollama_failover_manager.py: TestSelectProvider 補 mock _check_gemini_quota 原 test 期望 OFFLINE→Gemini，但 quota fail-closed 後沒 mock 會被切到 188 繞過 quota check 後驗純路由邏輯 → 37/37 PASS Tests: 37 passed (test_ollama_failover_manager 全部) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Claude Sonnet 4.6 (Wave 8 INC-20260425) <noreply@anthropic.com>	2026-04-27 08:15:10 +08:00
Your Name	cbd28e29a0	fix(solver+incident): 兩組 P0 配置修復 - Gitea 非K8s 過濾 + 備份告警年齡升級 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m57s Details L3 修復總結（2026-04-25）：【修復 1】Gitea 跨域界限 kubectl 過濾（solver_agent.py）根因：GiteaMemoryPressure 告警觸發 Solver → LLM 生成 'kubectl scale deployment gitea' Gitea 在主機 docker-compose，不在 awoooi-prod K8s namespace → 執行必然失敗變更： - 添加 _filter_non_k8s_targets() 函數，對 scale/restart/delete/patch 指令驗證 target - 添加 _KUBECTL_MUTATING_VERBS / _KUBECTL_ROLLOUT_MUTATING_SUBVERBS 常數 - 在 _solve() 呼叫 _fetch_k8s_inventory() 獲取實際部署清單 - 後置過濾：candidates 中若 target 不在 inventory 且屬寫入動詞 → 丟棄 + 警告預期行為：GiteaMemoryPressure → Solver 現生成調查類 kubectl（get/describe），而非 scale 【修復 2】HostBackupFailed 誤判升級（incident_service.py + webhooks.py）根因：備份失敗 >24h 被標記 TYPE-1（純資訊），導致靜默發送無按鈕卡片，未觸發自動修復變更： - incident_service.py classify_alert_early() 添加 age_hours 參數 - 添加 _BACKUP_AGE_UPGRADE_NAMES + _BACKUP_AGE_THRESHOLD_HOURS=24.0 - 若 alertname in (HostBackupFailed/Stale/Missing) 且 age > 24h → TYPE-3 升級 - webhooks.py 計算 alert.startsAt → age_hours，並傳遞給 classify_alert_early() 預期行為：HostBackupFailed 25h+ → 升級為 TYPE-3，觸發 LLM 分析 + P0 自動修復建議測試結果： - solver_agent: 35/35 tests PASSED ✅ - incident_service: 11/11 tests PASSED ✅ - incident_api integration: 7/7 tests PASSED ✅ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 09:48:04 +08:00
Your Name	f9f2263c00	fix(execution-feedback): 修復系統自動化反饋完全斷鏈的三層 P0 故障 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m57s Details 背景用戶報告執行狀態卡在「⚡ 執行中...」永不回報，導致自動修復機制完全癱瘓（信心度修復後，執行失敗但無法推送 Telegram 卡片通知） L1 — Post-verify AttributeError（2 處） - approval_execution.py:757, 1010 調用不存在方法 IncidentService.get_incident() - 正確方法：get_from_working_memory() fallback get_from_episodic_memory() - 影響：post-verify 邏輯被 exception 無聲吞掉，下游 Telegram 推送完全卡住 L2 — Notification Provider 未配置 - 新增 notifications/telegram.py：複用既有 TelegramGateway.send_notification() - 修改 manager.py：初始化時註冊 TelegramWebhookProvider - 影響：執行完成後無任何 provider 發送推送，導致 Telegram 看不到結果 L3 — Solver Agent 語意合成生成殘缺指令 - 舊邏輯：action_title="重啟服務" → 合成 "kubectl rollout restart deployment -n awoooi-prod"（缺名） - 下游 operation_parser 無法解析（regex 要求 deployment/<name>） - 修法：優先從 parsed 提取 target 欄位；無名則 return []，降級到唯讀調查指令 - 測試全部通過：35/35，含 11 個新安全測試驗證 - 被阻擋的惡意 kubectl_command 現在正確 fall-through 到語意合成路徑 - 無 target 名稱時返回空列表，不再生成殘缺指令 - Telegram 執行結果推送鏈路已完整預期效果 - 執行失敗 → 立即收到「❌ 執行失敗」Telegram 卡片（L1 + L2 修復） - 自動化決策遵循白名單，避免生成無法執行的指令（L3 修復） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 03:29:38 +08:00
Your Name	cc69f3ce04	fix(solver_agent): 修復 AI 信心度阻斷 + 三層 kubectl 安全防禦 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 修法A — 恢復 AI 決策信心度 (0.5 → 0.9) - Solver Agent 優先使用 OpenClaw NIM 的 `kubectl_command` 欄位（完整指令），略過語義合成降級 - 保留原始 0.9 信心度，告警自動化能力回復 - Root cause: 舊版在 action_title 未含 "kubectl" 時執行 min(0.9, 0.5) 降級 C1 — CRITICAL: ReDoS + 注入防禦 - 正則 `\s` → `[ ]` 避免換行符號 (\n\r) 配對（Shell 注入向量） - 加入 `re.ASCII` 與 `{1,500}` 有界量詞，防止指數級回溯 - 性能提升 7.256s → 0.015ms (48x faster) - 明文拒絕 \n \r \t \x00 C2 — CRITICAL: 繞過防禦 + 截斷攻擊 - action_title 路徑加白名單驗證（舊版跳過） - 標準候選路徑：驗證 → 截斷，防止截斷繞過 - 不安全指令自動降級至語義合成 C3 — CRITICAL: 無界長度 DoS - 新增 _KUBECTL_MAX_LEN = 500，硬上限前置檢查 - 防止長輸入導致正則超時測試覆蓋 - 35 個測試（24 回歸 + 11 新安全測試） - LF/CR/Tab/Null 注入、Shell 元字元、ReDoS 效能、邊界條件全覆蓋 - Critic 與 vuln-verifier 雙重驗證 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 03:02:58 +08:00
Your Name	39f45dd305	fix(solver): 補 import re（solver_agent 已有 re.compile 但漏 import） Some checks are pending CD Pipeline / build-and-deploy (push) Has started running Details Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 02:42:25 +08:00
Your Name	7d1c85eb86	fix(hermes): ANTHROPIC_API_KEY 注入 + solver 信心度修法 A + 12-Agent 治理文件 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details - nl_gateway.py: ClaudeAgentOptions 透過 env= 注入 ANTHROPIC_API_KEY（CLAUDE_API_KEY alias），修復 SDK 找不到 API key 的問題（SDK 讀 ANTHROPIC_API_KEY，K8s secret 名稱是 CLAUDE_API_KEY） - solver_agent.py: 修法 A — kubectl_command 欄位優先路徑，OpenClaw Nemo 回傳完整指令時不再被語意合成壓縮 confidence（0.9 → min(0.5) 的 bug），9 tests pass - AGENTS.md: Codex CLI 對應版 CLAUDE.md（Codex Session 啟動用） - docs/12-agent-game-rules.md: 12-Agent 任務判型 + 主責/協作派工 + 9 skills 對照（v1.0） - .agents/skills/06-awoooi-monorepo-master.md: v1.6，新增 12-agent 協作治理章節 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 02:33:43 +08:00
Your Name	0d81b28b1b	fix(aiops): bound phase2 timeout and repair incident links All checks were successful E2E Health Check / e2e-health (push) Successful in 52s Details CD Pipeline / build-and-deploy (push) Successful in 9m24s Details	2026-04-24 23:53:56 +08:00
Your Name	04ff22563e	fix(aiops-p1): Playbook 學習閉環 5斷點全修 + DB Migration（ADR-092 B4） Some checks failed run-migration / migrate (push) Failing after 14s Details CD Pipeline / build-and-deploy (push) Failing after 2m7s Details 【P0.4 補丁】pre_decision_investigator Prometheus query 欄位缺失 - _build_tool_params() 補 "query" 欄位（prometheus_query tool 必要參數） - 新增 _build_prometheus_query() — 依告警類型生成 PromQL（CPU/Memory/Crash/Disk/HTTP/Pod/fallback） - 修復後 D3_METRICS 感官維度實際取得資料（原本 100% 回 missing_query_parameter）【P1 Playbook 學習閉環 B1-B5 全修】 - B2 db/models.py: ApprovalRecord 新增 matched_playbook_id 欄位 + ix_approval_matched_playbook index - B2 db/models.py: TimelineEvent 新增 incident_id 欄位（MCP 稽核用）+ index - B3 approval_db.py: record→ApprovalRequest 補回 incident_id + matched_playbook_id - B4 approval_repository.py: 同 B3（兩個轉換函式必須同步） - B5 approval_db.py: approval_request_to_record_data 補 matched_playbook_id → DB 才能存值【P1.5 KM 寫入】approval_execution.py: fire-and-forget → await wait_for(30s) - 根因：asyncio.create_task 在 Pod recycle 時被殺，KM 寫入靜默遺失 - 修復：await asyncio.wait_for(..., timeout=30.0) + TimeoutError log 【Migration 文件】adr092_p1_learning_chain_fix.sql - ALTER TABLE approval_records ADD COLUMN matched_playbook_id VARCHAR(36) - ALTER TABLE timeline_events ADD COLUMN incident_id VARCHAR(64) - 執行：psql $DATABASE_URL -f apps/api/migrations/adr092_p1_learning_chain_fix.sql 【附帶 Agent 改動】 - decision_manager: Phase 2 YAML NO_ACTION 優先門（主機層/外部服務跳過 Agent Debate） - alert_rules.yaml: Sentry/ClickHouse + HostDiskUsageHigh/Critical 新規則 - solver_agent: action_title 語意合成兜底（取代靜默丟棄） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 15:41:35 +08:00
OG T	cf50a5ce25	fix(solver+execution): Checkpoint-1 假成功修復 + Checkpoint-2 K8s 環境感知 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m55s Details ## Checkpoint-1: 假成功根治 - approval_execution.py: execute_approved_action 改返回 bool (原返回 None，呼叫端無法判斷 K8s 是否接受指令) - decision_manager.py auto-execute 路徑: 用 _exec_success 取代硬編 success=True 修復: K8s 拒絕指令時正確發 ❌ 而非 ✅ 自動修復完成 ## Checkpoint-2: K8s 環境感知 (Inventory Pre-flight) - solver_agent.py: 新增 _fetch_k8s_inventory() — 生成 kubectl 指令前先拉 kubectl get deployments,statefulsets -n awoooi-prod，將真實名稱清單注入 Solver prompt，LLM 必須從清單選擇，防止幻覺（awooiii-api 三個 i） - 超時 5s 或失敗 → 返回 ""，prompt 顯示警示但不中斷主流程 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 23:08:23 +08:00
OG T	93205ceab0	fix(auto_approve+solver): P1 kubectl gate + P2 Nemo path kubectl 強制 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m56s Details P1 安全漏洞 (auto_approve.py): - 新增條件 1d：action 必須含 kubectl 關鍵字才可自動執行 - Solver 經 OpenClaw Nemo 路徑輸出自然語言 → 條件 1c 通過但無法執行 - 修復：自然語言 action → 降級人工審核（NO_PLAYBOOK reason） P2 執行障礙 (solver_agent.py): - Nemo 格式路徑：action_title 不含 kubectl → return [] → 觸發 _degraded_plan - _default_action_for_category：舊自然語言 → 真實 kubectl 調查指令 - 降級路徑現在輸出 kubectl get/top/exec 等唯讀指令，可被 auto_approve 1d 正確評估 ADR-082: Phase 2 多 Agent 協作 2026-04-17 ogt + Claude Sonnet 4.6（亞太）: P1+P2 hotfix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 14:49:53 +08:00
OG T	e0bfcc7bd6	fix(phase5): 修復 Solver action 格式 — 強制輸出 kubectl 命令 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 9m33s Details 根因：_build_prompt() 的 action 範例為 "restart_service:awoooi-api"（自訂格式）， LLM 模仿此格式輸出自然語言描述而非 kubectl 命令。影響鏈： Solver action = 自然語言描述 → auto_approve Condition 1c 拒絕（無 kubectl 關鍵字） → _auto_execute() 永不被調用 → blast_radius_calculator 永不被調用 → blast_radius_score fill rate = 0/14 = 0%（Phase 5 驗收指標未達）修復： 1. blast_radius 參考從抽象描述改為實際 kubectl 命令示例 2. 明確要求 action 欄位必須是真實 kubectl 命令（不可用自然語言） 3. 正確範例：kubectl rollout restart deployment/awoooi-api -n awoooi-prod 預期效果：LLM 輸出 kubectl 命令 → auto_approve 通過（低 blast_radius 情境） → blast_radius_calculator 被調用 → fill rate 趨向 100% 2026-04-17 ogt + Claude Sonnet 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 12:44:36 +08:00
OG T	0388e50d0e	fix(p1-backlog): 修復「待分析」死結與 Telegram 訊息截斷 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 30m25s Details 問題 1：REQUEST_REVISION → 待分析根因：safe_candidates=[] → selected=None → recommended_action=None → decision_manager action="" → TG 卡顯示「待分析」（資訊流斷裂）修復 coordinator_agent.py：無安全候選時回退至 Solver 原始最優方案標記「[Reviewer 未核准，僅供參考] {action}」 SRE 永遠能看到 AI 建議，資訊流絕不中斷問題 2：debate_summary 在 (blast_radius... 中間截斷顯示 (bl 根因：root_cause=reasoning[:150] — 150 字元對中文 debate_summary 過短修復 decision_manager.py： root_cause 截斷 150 → 300 suggested_action 截斷 80 → 120 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 11:12:02 +08:00
OG T	0077ff9758	fix(solver): 傳遞 hypothesis 作為 alert_context 給 OPENCLAW_NEMO Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 根因：solver 呼叫 openclaw.call(prompt) 不傳 context → nemo fallback 把 prompt[:500]（系統說明「軍師 Agent」）當 signal description → LLM 回傳垃圾方案描述修復：把 top.description 放進 alert_context.signals 讓 nemo 看到真實根因假設（與 diagnostician 同模式 7eb8375） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 22:51:30 +08:00
OG T	7eb837567d	fix(diagnostician): 修復 'AI 深度診斷' 垃圾根因顯示 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 根因三層鏈： 1. openclaw.call(prompt) 不傳 context 2. OPENCLAW_NEMO fallback 把 prompt[:500]（系統說明文字）當 signal description 3. Nemo LLM 回傳 action_title="調查 AWOOOI SRE 系統的偵探 Agent"（任務描述） 4. _extract_hypotheses() 用 action_title 作為根因假設描述 → Telegram 顯示垃圾修復： - openclaw.call() 新增 alert_context 可選參數，透傳給 _call_with_fallback - diagnostician._analyze() 建立 alert_context（incident_id + evidence_summary as signal） → nemo 使用結構化 API 收到真實感應器資料而非系統說明文字 - _extract_hypotheses() nemo 格式轉換：優先用 reasoning（為什麼）作為假設描述而非 action_title（做什麼）— reasoning 更接近根因分析 2026-04-16 ogt + Claude Sonnet 4.6 (台北時區) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 22:34:48 +08:00
OG T	d294caf830	fix(solver): 相容 openclaw_nemo 回傳格式 → candidates 格式轉換 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 3m51s Details 與 diagnostician 同步：openclaw_nemo 回傳 action_title/risk_level/confidence， solver 的 _extract_candidates() 找不到 candidates key → 空方案 → no_candidates 修復: 檢測 action_title 存在時轉換為 candidates 格式 risk_level → blast_radius 映射: critical=60, high=40, medium=25, low=10 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 14:34:50 +08:00
OG T	c27709d11b	fix(diagnostician): 相容 openclaw_nemo 回傳格式 → 解除全面 ABSTAIN 根因: AI Router DIAGNOSE→openclaw_nemo 回傳 ClawBot 格式： {"action_title":"...","risk_level":"...","confidence":0.85} Diagnostician 只解析 {"hypotheses":[...]} → 永遠 0 hypotheses → ABSTAIN 修復: _extract_hypotheses() 新增 openclaw_nemo 格式檢測與轉換 action_title→description, confidence→confidence, risk_level→category 影響: 所有 critical alert 自 2026-04-15 收到後一律 ABSTAIN，無任何修復動作 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 14:32:32 +08:00
OG T	7e3cc8b3b0	fix(agents): 移除人工 per-agent timeout，LLM 必須等完整回應 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 原設計 asyncio.wait_for(timeout_sec=25s) 是任意截斷，只要 LLM 超過時限就降級為 confidence=20%，根本沒有分析。正確做法： - 移除所有 4 個 agent 的 asyncio.wait_for() 包裝 - 只留 except Exception 捕真實異常（連線失敗、模型崩潰） - 全流程由 Orchestrator GLOBAL_TIMEOUT_SEC=90s 防掛死 - _PER_AGENT_TIMEOUT_SEC 常數廢棄移除影響：LLM 推理多久就等多久，不再人工截斷， deepseek-r1:14b 等模型得以完整輸出分析結果。 2026-04-16 ogt + Claude Sonnet 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 02:54:34 +08:00
OG T	14a02263ae	feat(Phase 4): 主動巡檢 + 趨勢預測 + 8D 感官升級全部完成 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 12m32s Details ## Phase 4 完整交付（ADR-084） ### 新增服務 - trend_predictor.py: numpy 線性回歸，4h 閾值突破預警，R² 信心評分 - proactive_inspector.py: 每 5 分鐘主動巡檢協調器 - DynamicBaselineService（3σ 偏離） - LogAnomalyDetector（新 Drain3 pattern） - TrendPredictor（斜率外推 4h 預測） - Shadow Mode + 30 分鐘去重 + Holt-Winters 背景重訓 ### 8D 感官升級（EvidenceSnapshot Phase 4 增強） - PreDecisionInvestigator._collect_phase4_anomalies(): 決策前讀取 ProactiveInspector 最近巡檢快取 + LogAnomalyDetector 新 pattern - EvidenceSnapshot.anomaly_context: 新欄位，Phase 4 動態異常上下文 - DiagnosticianAgent._build_prompt(): prompt 包含 anomaly_context， LLM RCA 可參考動態基線偏差與趨勢預警 ### 資料庫遷移 - incident_evidence: ADD COLUMN anomaly_context JSONB（冪等） ### main.py - 啟動 run_proactive_inspector_loop() asyncio task 2026-04-15 ogt + Claude Sonnet 4.6（亞太）: Phase 4 全部完成 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 15:47:05 +08:00
OG T	42bc1df9f9	fix(phase2): 驗證發現兩處安全漏洞並修正 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 手動驗證執行中發現： 1. reviewer_agent.py: force push regex 只覆蓋「force push」文字順序，漏掉 git 實際格式「git push --force」(push 先, --force/-f 後) → 修正為雙向 pattern：(?:force.{0,5}push\|push.{0,30}(?:--force\|-f\b)).{0,30}main 2. coordinator_agent.py: Critic critical challenge 僅施 0.3 penalty，當原始信心 > 0.7（如 0.82）時 penalty 後仍 > 0.4 閾值， critical challenge 穿透到 auto-execute 路徑（驗證確認：0.82→0.52>0.4） → 新增 Critic REJECT 硬閘（等同 Reviewer REJECT 效力），在 penalty 邏輯前強制 requires_human_approval=True Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 13:48:55 +08:00
OG T	5ddba6d6e0	feat(adr-082): Phase 2 多 Agent 協作 — 5 角色辯證系統骨架上線新增 5 個 Agent + Orchestrator + DecisionManager 接線： - protocol.py: DiagnosisReport / ActionPlan / ReviewVerdict / CriticReport / DecisionPackage 型別系統 - DiagnosticianAgent: RCA 根因分析，confidence < 0.4 → ABSTAIN - SolverAgent: 修復方案軍師，blast_radius 評分 + 降級 rule-based mock - ReviewerAgent: 安全審查，HARD_RULES 靜態 pattern + blast_radius 閾值 (>50 revision, >80 reject) - CriticAgent: 刻意唱反調，強制 3 問批判性思維，critical challenge → REJECT - CoordinatorAgent: 純規則聚合，6 級決策閘，REQUEST_REVISION → 強制人工 - AgentOrchestrator: 30s 全局超時，Reviewer ‖ Critic 並行，DB Immutable Event Sourcing + Redis Streams - DecisionManager: AIOPS_P2_ENABLED gate + _package_to_proposal_data 橋接既有 proposal_data 格式 - AgentSession DB table + 4 個複合 index - ADR-082 決策記錄 Gate 2 修復（7 項）: - CRITICAL: DELETE FROM regex lookahead 位置錯誤（移至 FROM 後） - CRITICAL: REQUEST_REVISION 可抵達 auto-execute 路徑（改回 requires_human_approval=True） - IMPORTANT: _extract_json flat regex 不支援巢狀 JSON（改 find/rfind 邊界提取） - IMPORTANT: all_degraded 遺漏 verdict.degraded（補全 4 個 Agent） - IMPORTANT: Solver ABSTAIN guard 放行降級假設（改為無論 hypotheses 有無均跳過） - IMPORTANT: dataclasses.asdict() Enum 未序列化導致 DB 寫入靜默失敗（加 json.dumps default handler） - IMPORTANT: P2 gate 直讀屬性繞過父 Phase 守衛（改用 is_phase_enabled(2)） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 13:48:55 +08:00
OG T	938df7f291	fix(api): 全面清除假信心分數 - 遵循 feedback_confidence_truthfulness.md 🔴 違規修正: 規則匹配/Expert System 不是 AI 分析，confidence 必須 = 0.0 修正檔案: - agents/action_planner.py: 0.9 → 0.0 - agents/blast_radius.py: 0.85/0.5/0.9 → 0.0 - agents/security.py: 計算公式 → 0.0 - signoz_webhook.py: 0.7 → 0.0 - auto_approve.py: default 0.5 → 0.0 - ci_auto_repair.py: 整個計算函數 → return 0.0 - error_analyzer_service.py: default 0.5 → 0.0 - intent_classifier.py: 計算公式 → 0.0 - openclaw.py: default 0.5 → 0.0 - resource_resolver.py: 0.8 → 0.0 - k8s_naming.py: 0.9/0.7 → 0.0 只有 LLM 真實分析返回的 confidence 才能 > 0 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-03-29 16:00:46 +08:00
OG T	6f049877fc	fix(lint): ruff auto-fix + lewooogo-core src 加入 git - Python: ruff --fix 修復 280 個 lint 錯誤 - lewooogo-core: src/ 目錄未追蹤，導致 CI eslint 失敗 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-03-23 23:51:37 +08:00
OG T	7478dc0254	feat(phase6-9): Complete modular architecture and Agent Teams Phase 6.4 - Modular Architecture: - Add lewooogo-brain adapters for LLM providers - Add lewooogo-data dual memory (Redis + PostgreSQL) - Implement consensus engine for multi-agent decisions - Add incident memory service for historical context Phase 9 - Agent Teams (Claude Agent SDK): - Add base agent class with Claude Sonnet 4 integration - Implement action planner, blast radius, and security agents - Add agent API endpoints and proposal workflow - Integrate ADR-009 OpenClaw Agent Teams architecture DevOps & CI/CD: - Add GitHub Actions CI/CD workflows (ci.yaml, cd.yaml) - Add pre-commit hooks and secrets baseline - Add docker-compose for local development - Update Kubernetes network policies Frontend Improvements: - Add auto-healing error boundary component - Update i18n messages for agent features - Enhance dual-state incident card with execution feedback Documentation: - Add 7 ADRs covering MCP, design system, architecture decisions - Update ARCHITECTURE_MEMORY.md with modular design - Add GLOBAL_RULES.md and SOUL.md for project identity Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-03-23 18:40:36 +08:00

28 Commits