awoooi

Author	SHA1	Message	Date
Your Name	d0e98192de	fix(ai): keep openclaw before gemini in alert fallback All checks were successful Code Review / ai-code-review (push) Successful in 10s Details CD Pipeline / tests (push) Successful in 1m9s Details CD Pipeline / build-and-deploy (push) Successful in 3m28s Details CD Pipeline / post-deploy-checks (push) Successful in 1m19s Details	2026-05-06 20:20:58 +08:00
Your Name	1a1ab0df6e	fix(ai): route alerts through openclaw before gemini All checks were successful Code Review / ai-code-review (push) Successful in 11s Details CD Pipeline / tests (push) Successful in 1m5s Details CD Pipeline / build-and-deploy (push) Successful in 3m42s Details CD Pipeline / post-deploy-checks (push) Successful in 1m36s Details	2026-05-06 20:11:24 +08:00
Your Name	2ef54ccc94	fix(ai): enforce ollama first for drift governance All checks were successful Code Review / ai-code-review (push) Successful in 16s Details CD Pipeline / tests (push) Successful in 1m17s Details CD Pipeline / build-and-deploy (push) Successful in 4m54s Details CD Pipeline / post-deploy-checks (push) Successful in 3m10s Details	2026-05-06 19:26:09 +08:00
Your Name	4111ea4f9f	fix(ai): remove 188 ollama provider All checks were successful Code Review / ai-code-review (push) Successful in 12s Details CD Pipeline / tests (push) Successful in 1m13s Details CD Pipeline / build-and-deploy (push) Successful in 3m36s Details CD Pipeline / post-deploy-checks (push) Successful in 1m20s Details	2026-05-06 14:34:48 +08:00
Your Name	9ef9633aff	fix(alerts): bypass proxy timeout for GCP Ollama	2026-05-06 08:55:14 +08:00
Your Name	2aa31c205a	fix(ai): require 111 before alert cloud fallback All checks were successful CD Pipeline / tests (push) Successful in 54s Details Code Review / ai-code-review (push) Successful in 10s Details CD Pipeline / build-and-deploy (push) Successful in 3m21s Details CD Pipeline / post-deploy-checks (push) Successful in 2m2s Details	2026-05-06 00:05:51 +08:00
Your Name	e208798531	fix(ai): keep GCP Ollama lane on safe models All checks were successful CD Pipeline / tests (push) Successful in 54s Details Code Review / ai-code-review (push) Successful in 14s Details CD Pipeline / build-and-deploy (push) Successful in 3m25s Details CD Pipeline / post-deploy-checks (push) Successful in 1m50s Details	2026-05-05 23:37:33 +08:00
Your Name	bf847ad045	fix(ai): stabilize GCP Ollama alert lane Some checks failed Code Review / ai-code-review (push) Successful in 10s Details CD Pipeline / build-and-deploy (push) Has been cancelled Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / tests (push) Has been cancelled Details	2026-05-05 22:20:27 +08:00
Your Name	ee5e3bc94f	fix(openclaw): gate alert cloud fallback behind flag Some checks failed Code Review / ai-code-review (push) Successful in 27s Details CD Pipeline / tests (push) Successful in 5m17s Details CD Pipeline / build-and-deploy (push) Failing after 5m35s Details CD Pipeline / post-deploy-checks (push) Has been skipped Details	2026-05-05 20:54:47 +08:00
Your Name	2ff0ef3bb6	fix(openclaw): route legacy ollama through failover endpoints Some checks failed CD Pipeline / tests (push) Failing after 1m49s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details Code Review / ai-code-review (push) Successful in 24s Details	2026-05-05 13:55:52 +08:00
Your Name	b0da6da1e9	feat(aiops): structure agent loop shadow output Some checks failed CD Pipeline / tests (push) Successful in 2m50s Details Code Review / ai-code-review (push) Successful in 33s Details CD Pipeline / build-and-deploy (push) Failing after 25m48s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details	2026-05-01 15:09:57 +08:00
Your Name	f8e44971c1	feat(aiops): enable read-only agent loop canary All checks were successful CD Pipeline / tests (push) Successful in 1m43s Details Code Review / ai-code-review (push) Successful in 31s Details CD Pipeline / build-and-deploy (push) Successful in 10m22s Details CD Pipeline / post-deploy-checks (push) Successful in 4m3s Details	2026-05-01 14:20:16 +08:00
Your Name	9db87f177e	fix(aiops): suppress repeated llm alert loops Some checks failed CD Pipeline / tests (push) Successful in 1m37s Details Code Review / ai-code-review (push) Successful in 28s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-05-01 13:02:07 +08:00
Your Name	d845d53257	fix(security): keep Gemini key out of request URLs All checks were successful CD Pipeline / build-and-deploy (push) Successful in 15m5s Details	2026-04-29 22:56:12 +08:00
Your Name	fe2b8f4571	fix(flywheel): fallback on OpenClaw degraded responses All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m56s Details	2026-04-29 22:38:57 +08:00
Your Name	dccdcdbaf5	fix(flywheel): unblock action safety and Claude fallback All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m45s Details	2026-04-29 21:51:18 +08:00
Your Name	fb0c72db42	feat(ai-router): 推翻 A2 鐵律 — DIAGNOSE primary 改 Ollama 本地優先 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 2m26s Details 統帥鐵律 2026-04-29：「主要優先用 111 主機的 Ollama」 + feedback_ai_autonomous_direction.md：以本地免費 LLM 為主 + feedback_ollama_111_only.md：Ollama 唯一主機 = 111 ## 推翻 A2 (2026-04-27 INC-20260425) 的事實基礎舊事實：Ollama = CPU-only deepseek-r1:14b @ 238s（不可用）新事實：prod Ollama 111 = M1 Pro Apple Silicon GPU + qwen2.5:7b-instruct VRAM 8.2GB 全載入，ctx 32k，實測 hi prompt 0.54s 雲端全死（2026-04-29 prod log 證據）： - OpenClaw 188:8088 → 500 Internal Server Error - Gemini → 429 Too Many Requests（配額爆） - Claude → 404 Not Found（model claude-3-haiku-20240307 過期）不推翻 A2 → 100% incident llm_failed → AI 自動修復永遠不啟動 ## 修改範圍（最小、安全、可驗證） ### ai_router.py - `_diagnose_fallback_chain`: OLLAMA 第一順位（取代「永久排除」舊註解）順序：[OLLAMA, OPENCLAW_NEMO, GEMINI, CLAUDE] - `_intent_provider_overrides[DIAGNOSE]`: OPENCLAW_NEMO → OLLAMA - 不動 _full_fallback_chain（避免影響 RESTART/SCALE/CONFIG/DELETE） - 不動 _tool_calling_fallback_chain - 不動 complexity_map（critic M2 留待後續） ### openclaw.py - 注入 task_type="diagnose" 到 alert_context（critic C2 真根因） - 修復 ai_providers/ollama.py:77 timeout 對齊問題： - 有 task_type → OLLAMA_DIAGNOSE_TIMEOUT_SECONDS=200s - 沒有 → OPENCLAW_TIMEOUT=30s（不夠 qwen2.5:7b 推理） - prod log 看到 latency_ms=120014 的根因 - 用 dict(alert_context) 複製，不污染原 context ## Regression Test 同步更新（5 個） A2 鐵律守門 test 全部反映新鐵律： - test_p0_diagnose_routing.py::test_diagnose_override_is_ollama （原 test_diagnose_override_is_openclaw_nemo） - test_ai_router_diagnose_fallback.py::test_diagnose_fallback_chain_ollama_primary （原 test_diagnose_fallback_chain_no_ollama） - test_ai_router_diagnose_fallback.py::test_diagnose_route_primary_is_ollama （原 test_diagnose_route_fallback_chain_excludes_ollama） - test_ai_router_diagnose_fallback.py::test_diagnose_route_sync_primary_is_ollama （原 test_diagnose_route_sync_fallback_chain_excludes_ollama） - test_ai_router_diagnose_fallback.py::test_build_fallback_chain_for_intent_diagnose_with_ollama_primary （原 test_build_fallback_chain_for_intent_diagnose_no_ollama） - test_ai_router_failover_integration.py::test_router_uses_failover_for_diagnose_ollama_primary （原 test_router_does_not_use_failover_for_openclaw_nemo）每個 test docstring 都記載歷史脈絡 + 推翻原因。 ## 驗證 - 1608 unit tests 全綠 - LLM 路徑 16 個 test 全綠（含 6 個 A2 守門 test 更新版） - complexity_scorer / failover_manager / intent_classifier 不受影響 ## 期望 prod 行為（部署後驗證） incident 進入 → DIAGNOSE intent → primary OLLAMA (qwen2.5:7b on M1 Pro GPU) 失敗才 fallback → OpenClaw 188 → Gemini → Claude Ollama 用 200s timeout（之前 30s 不夠） → AI 自動修復終於可以啟動，不再 100% llm_failed ## 已知債（後續處理） - models.json:21 ollama.default 仍是 deepseek-r1:14b（critic C1，但 prod 已自動 route 到實載 model） - complexity 4/5 仍寫死 gemini/claude（critic M2） - Gemini API key 在 prod log 明文（需輪換 + sanitize） Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 11:39:36 +08:00
Your Name	bb5f16f8ef	fix(aiops-p2): P2.1 LLM品質三修 — Evidence-First + consensus confidence + raw_evidence注入 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 根因: - consensus_engine 四 ExpertAgent confidence=0.0 → 加權投票 total=0 → 永遠返回 NO_ACTION - prompts.py 無 Evidence-First 指令 → LLM 靠記憶推理，無真實環境約束 - openclaw.py analyze_alert 建 prompt 未注入 MCP evidence (diagnosis_context) 修復: - consensus_engine: SRE/Security/Cost/Performance 依訊號強度設 0.45~0.80 confidence - consensus_engine: _normalize_action 加「重新啟動」別名 → RESTART - consensus_engine: SecurityAgent 移除未使用的 _target 變數 - prompts.py: 加 Evidence-First Protocol + Skepticism Rules 區塊 - openclaw.py: analyze_alert 提取 diagnosis_context → <raw_evidence> 注入 full_prompt 驗證: consensus score 從 0.0 → 0.744（CrashLoop 測試案例） P2.1 fix 2026-04-24 ogt + Claude Sonnet 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 15:52:25 +08:00
Your Name	4fc1f49dca	fix(pipeline): 三斷點修復 — SLO公式+NO_ACTION堆積+幻覺降級風險 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 14m3s Details D1 flywheel_stats_service: execution_count 欄位不存在 → 改讀 success_count+failure_count；消除飛輪執行成功率永遠 0.0% 假象 D2 openclaw._validate_deployment_inventory: 幻覺 deployment 降級後原 HIGH/CRITICAL risk 未清零 → 加 result.risk_level = AIRiskLevel.LOW D3 webhooks.py (兩處 alert path): NO_ACTION/INVESTIGATE/OBSERVE 三類非破壞性動作強制 risk_level = LOW，跳過 Telegram 批准直接 auto-approve → approval_execution.py 的 NO_ACTION handler 立即標 EXECUTION_SUCCESS Root cause 鏈：BUTTON_DATA_INVALID 修復後 TG 按鈕可發，但 NO_ACTION 積壓的 35 筆 PENDING 是因 HIGH risk 無法走 auto-approve 路徑導致。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 22:26:07 +08:00
OG T	4b8be32610	fix(telegram+approval): TG-1 + AP-1/2/3 — 4 修 Telegram UX Some checks failed CD Pipeline / build-and-deploy (push) Failing after 25m27s Details Ansible Lint / lint (push) Has been cancelled Details 2026-04-19 凌晨（台北時區）— ogt + Claude Opus 4.7 (1M) ## TG-1: INFO_ACTIONS 加 view security_interceptor.py — 'view' 按鈕現在走 2-part 讀格式, 不再誤觸發 4-part nonce 寫格式。 ## AP-1: approval_records.telegram_message_id 持久化 telegram_gateway.send_approval_card send 成功後,在 DB 層 UPDATE approval_records SET telegram_message_id, telegram_chat_id (不只 Redis, Pod 重啟仍可找回原卡片)。 ## AP-2: approval 執行完成原卡片 edit + KM/Playbook 增量 approval_execution._push_execution_result_to_alert 除了 reply 原卡片, 還 editMessageReplyMarkup 移除按鈕（修「永遠執行中」卡片問題）。 - 同步查 knowledge_entries/playbooks 2min 內增量,附加到訊息顯示 "📚 KM +N 🎯 Playbook 更新×M" - 成功: ✅ 執行成功 + action + KM 增量 - 失敗: ❌ 執行失敗 + 原因 + KM 增量 ## AP-3: primary_responsibility 正規化降「❓ 未知」比例 openclaw._parse_analysis_result: 若 LLM 填空/None/不在白名單 (FE/BE/INFRA/DB/COLLAB),強制 fallback: kubectl 關鍵字有 → INFRA, 否則 BE。之前只檢查 "not in data" 但 None 或空字串會穿過。 ## 跳過: TG-3 (refactor) + TG-5 (webhook 為棄用 endpoint,design 採 Long Polling) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 01:15:58 +08:00
OG T	68a42a3c97	fix(openclaw): 幻覺驗證雙路徑覆蓋 + 抽出共用 helper Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 2026-04-19 凌晨（台北時區）— ogt + Claude Opus 4.7 (1M) 根因: commit `7e9448f` 的 Python hallucination validator 只裝在 `analyze_alert` (webhook path),但 incident sweeper 走 `generate_incident_proposal` (line 1552) 沒裝驗證 → 00:23 PostgreSQLDiskGrowthRate 卡片出現 "deployment/awoooi-prod" 幻覺未攔截。修: 1. 抽出 `_validate_deployment_inventory(result, inventory, ns)` 共用方法 2. `analyze_alert` (line 1322 area) 呼叫此 helper — 原行內邏輯消除 3. `generate_incident_proposal` (line 1552) 動態抓 inventory + 呼叫 helper 4. helper 補: - result.action_title = '[安全降級] 調查 {ns} 真實資源狀態' (之前只改 description,action_title 沒變 → DB action 欄位仍殘留舊文字) - 每個欄位賦值 try/except 保底,單欄失敗不影響其他 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 01:11:09 +08:00
OG T	898145d68e	refactor(openclaw): SuggestedAction 改用頂部 import (避免 inline 三重巢狀) Some checks failed CD Pipeline / build-and-deploy (push) Successful in 11m17s Details Ansible Lint / lint (push) Has been cancelled Details IDE 對 inline "from src.models.ai import" 誤報(但運行正常)。改為頂部 import 既滿足 IDE 也更 Pythonic。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:28:19 +08:00
OG T	e6e484c1dc	fix(openclaw): import path 修正 — src.models.ai (非 openclaw_schema) Some checks are pending CD Pipeline / build-and-deploy (push) Has started running Details IDE 正確抓到的 bug(非 false positive),SuggestedAction 在 src/models/ai.py。 _SA.NO_ACTION 現在能正確降級。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:26:45 +08:00
OG T	7e9448f6d0	fix(openclaw): 幻覺 deployment 名雙層防禦 — Prompt + Python validator Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 2026-04-18 晚（台北時區）— ogt + Claude Opus 4.7 (1M) 生產事件 (approval f763bedf, 22:58): - Alert: KubePodCrashLooping, labels.deployment="awoooi-api" - NEMOTRON 雖收 inventory "awoooi-api, awoooi-web, awoooi-worker" 仍輸出 kubectl_command="kubectl rollout restart deployment/awoooi-prod" (把 namespace 誤當 deployment 名) - 執行結果: "Deployment 'awoooi-prod' not found in namespace 'awoooi-prod'" ## Layer 1: NEMOTRON_SYSTEM_PROMPT 強化 (prompts.py) 新增「🔒 DEPLOYMENT NAME RULE (STRICTLY ENFORCED)」區塊: - namespace NEVER is a deployment name - "awoooi-prod" 是 NAMESPACE,不可寫 deployment/awoooi-prod - 若有 inventory,deployment 必須 exact match - 優先用 labels.deployment,unknown → NO_ACTION ## Layer 2: Python 後驗證 (openclaw.py:1322+) LLM 回應解析後 regex 抽出 deployment 名,對照 _k8s_inventory: - 在清單內 → 通過 - 不在清單內 → 降級: * kubectl_command → "kubectl get deploy -n {ns}"(純調查) * suggested_action → NO_ACTION * target_resource → "unknown(hallucinated)" * confidence → 0.0 * description 加註 [安全降級] 並列出合法 inventory - log 'openclaw_deployment_hallucination_detected' 記錄效果: 就算 LLM 無視 prompt,Python 層也會擋下。破壞性 kubectl 絕不執行於不存在的 deployment。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:26:09 +08:00
OG T	4f2e122fd2	fix(openclaw): Checkpoint-2 webhook path K8s inventory injection — 防止 NemoTron 幻覺 awoooi-service All checks were successful CD Pipeline / build-and-deploy (push) Successful in 11m39s Details 根因：NemoTron 在 webhook path（analyze_alert）無叢集上下文 → 盲猜 deployment/awoooi-service → kubectl not found → EXECUTION_FAILED → trust score 0 永遠修復： - analyze_alert() Step 0.5: 呼叫 _fetch_k8s_inventory_for_openclaw() 拉取真實 Deployment 清單 - 注入「🔒 叢集實際資源清單」section 到 full_prompt，強制 LLM 從清單選擇資源名 - 失敗/超時 → 返回空字串 → 注入警示提示，主流程不中斷 - available_len 計算納入 k8s_section 長度防止 4K 截斷影響： - Solver Agent path (solver_agent.py) 已在 `cf50a5c` 修復 - 本 commit 修復 Alertmanager webhook path（analyze_alert → NemoTron） - 兩條路徑均有 K8s 環境感知，LLM 不再幻覺資源名 ADR-082: Phase 2 多 Agent 協作 2026-04-17 ogt + Claude Sonnet 4.6（Checkpoint-2 webhook path completion） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-18 00:53:27 +08:00
OG T	604d8eea37	fix(schema-drift): 補齊 prompts.py + Claude API schema enum 同步 (ADR-090) All checks were successful CD Pipeline / build-and-deploy (push) Successful in 12m27s Details 問題: `fe77e6d` 擴充了 models/ai.py enum 至 8 值，但兩個地方未同步： 1. core/prompts.py L77: 缺 INVESTIGATE、OBSERVE 2. core/prompts.py L176 (NEMOTRON_SYSTEM_PROMPT): 缺 APPLY_HPA、INVESTIGATE、OBSERVE 3. openclaw.py L564 (_call_claude tools schema): 舊 4 值 enum 約束影響: LLM 不知道可以輸出 INVESTIGATE/OBSERVE，只能選舊 4 值修復: 三處統一對齊 8 個 suggested_action 值 RESTART_DEPLOYMENT\|DELETE_POD\|SCALE_DEPLOYMENT\|APPLY_HPA\|TUNE_RESOURCES\|INVESTIGATE\|OBSERVE\|NO_ACTION Closes: ADR-090 Prompt-Model 三層同步鐵律 2026-04-17 ogt + Claude Sonnet 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 22:10:18 +08:00
OG T	7eb837567d	fix(diagnostician): 修復 'AI 深度診斷' 垃圾根因顯示 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 根因三層鏈： 1. openclaw.call(prompt) 不傳 context 2. OPENCLAW_NEMO fallback 把 prompt[:500]（系統說明文字）當 signal description 3. Nemo LLM 回傳 action_title="調查 AWOOOI SRE 系統的偵探 Agent"（任務描述） 4. _extract_hypotheses() 用 action_title 作為根因假設描述 → Telegram 顯示垃圾修復： - openclaw.call() 新增 alert_context 可選參數，透傳給 _call_with_fallback - diagnostician._analyze() 建立 alert_context（incident_id + evidence_summary as signal） → nemo 使用結構化 API 收到真實感應器資料而非系統說明文字 - _extract_hypotheses() nemo 格式轉換：優先用 reasoning（為什麼）作為假設描述而非 action_title（做什麼）— reasoning 更接近根因分析 2026-04-16 ogt + Claude Sonnet 4.6 (台北時區) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 22:34:48 +08:00
OG T	a258d87767	fix(webhooks+prompts): 修復 LLM 對所有告警一律輸出「重啟 AWOOOI 服務」的根本問題 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 根因 (INC-20260416-C365D0 postgres 磁碟告警事故): 1. alert_context 中 alertname 埋在 labels 深處，LLM 看到 alert_type="custom" → 不知道是什麼告警 2. 快取鍵用 alert_type:target_resource → 不同 alertname 共用同一快取 → 全部回傳第一個 LLM 結果 3. 系統 Prompt 無 alert-category 指導 → LLM 永遠輸出 kubectl rollout restart 修復: - webhooks.py: alert_context 置頂加入 alertname + alert_category + annotations - openclaw.py: 快取鍵改用 alertname:target_resource（告警名稱才是主要識別符） - prompts.py: OPENCLAW_SYSTEM_PROMPT + NEMOTRON_SYSTEM_PROMPT 加入 Alert-Specific Analysis Rules database/storage 告警 → NO_ACTION + 調查指令；K8s 告警 → 對應重啟指令禁止對非 K8s 告警輸出 kubectl rollout restart deployment/awoooi-prod 2026-04-16 ogt + Claude Sonnet 4.6（亞太） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 19:56:13 +08:00
OG T	cd1c0ffdb8	fix(openclaw): call() 5值→3值 — 修復全部 AI 分析降級根因根因: _call_with_fallback() 回傳 5-tuple，但 call() 直接 return 導致呼叫端 (diagnostician/solver/critic agents) 3-var unpack → ValueError: too many values to unpack → 全部降級 20% 修復: call() 明確解包再回傳 (response, provider, success) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 03:27:48 +08:00
OG T	3ce5025ca7	fix(alerts): 3 個飛輪沉默節點 — DIAGNOSE routing + 心跳停用 + 通知格式 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 1. openclaw.py: DIAGNOSE 移除 require_local=True - v4.3 已決定 NIM 為主力且無隱私問題 - require_local=True 導致所有 provider 被 privacy_skip → 告警永遠失敗 - 修後 DIAGNOSE 走 _full_fallback_chain（NIM → Gemini → Claude） 2. ai_router.py: require_local 失敗通知改為 ADR-075 TYPE-1 格式 - 禁止純文字 raw notification（統帥鐵律：所有訊息必須符合格式模板） - 改用 ├─ / └─ 樹狀結構 + 語義化標籤 3. main.py: 停用 Telegram 心跳監控 - 心跳已轉發到另一個 Telegram 群組，不需在此頻道重複發送 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 19:49:43 +08:00
OG T	36754a8a84	fix: Bug A 診斷 + Bug B 真修 — LLM 120s/130s 硬編 → OPENCLAW_TIMEOUT Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 殘留兩個深層 bug 處理: Bug A (approval.incident_id 仍 NULL) — 加診斷 - update_incident_id 加 rowcount 檢查 - 若 UPDATE 0 rows affected → warning log (id 型別 mismatch 或 session 不同步) - 手動 UPDATE 測試通過 → DB/permissions 正常，問題在應用層 - 等 CD 部署後 live-fire 觀察 log 診斷真因 Bug B (LLM 仍 2m6s >> 30s) — 真修 openclaw.py 兩處硬編 timeout: - line 146 httpx client default: 120.0s → settings.OPENCLAW_TIMEOUT (30s) - line 348 /analyze/incident POST: 130.0s → settings.OPENCLAW_TIMEOUT (30s) GAP-B4 commit `dd0a778` 只修了 ai_providers/ollama.py 但 openclaw.py 自己的 httpx client 和 endpoint call 沒改這就是為什麼 Live-fire #2-#7 都卡 120s+ 的真因回歸測試: 125/125 (dispatcher + a4 + classify + grouping) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-14 20:38:00 +08:00
OG T	c0ba1000f3	Revert "fix(auto-repair): 中低風險+無kubectl_command → TYPE-1 純資訊，不顯示審核按鈕" This reverts commit abf1ffa91e7327a36af93be2742d53dac1933f0d.	2026-04-14 13:33:24 +08:00
OG T	2df4945880	fix(auto-repair): 中低風險+無kubectl_command → TYPE-1 純資訊，不顯示審核按鈕問題: HostHighCpuLoad 等主機層告警 affected_services=[] → OpenClaw 生成 kubectl unknown → safety guard 攔截 → 退回 READY + TYPE-3 帶按鈕卡片用戶一直看到帶按鈕的中/低風險告警，按鈕無法修復任何東西修復三處: 1. openclaw.py: _call_openclaw_analyze() 回傳 suggested_action 欄位 + target_resource 預設改為 "" (避免 "unknown" 進入 safety guard) 2. decision_manager.py: classify_notification() 傳入 suggested_action / risk_level / has_kubectl_command 3. telegram_gateway.py: classify_notification() 新規則 — 無 kubectl_command + risk=low/medium + action=investigate/no_action → TYPE-1 (純資訊，無按鈕) 搭配 clawbot-v5 f4b84d7 (OpenClaw prompt CRITICAL RULES) 一起生效 2026-04-14 Claude Sonnet 4.6 Asia/Taipei Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 13:33:24 +08:00
OG T	a7f2b9c0f5	fix(display): 規則匹配改顯示 ✅ 取代 🔴 0% + 修復 LLM 字串 confidence 解析 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details - telegram_gateway.py: confidence==0 (規則匹配/Expert fallback) 不再顯示「🔴 0%」，改顯示「⚙️ 規則匹配 ✅」，兩個 card 類型都修正 - openclaw.py: NIM/Ollama 有時回傳字串 "0.85" 而非 float，導致 isinstance(str, int\|float)=False → confidence 被強制設 0.0。現在先嘗試 float() 解析，解析失敗才 fallback 0.0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 20:50:53 +08:00
OG T	896bef94ee	fix(web): pending-approvals-card 加防重複點擊 + loading 狀態 linter 自動強化: actioningId state 防止同一張卡重複操作 - disabled + opacity 0.6 + cursor not-allowed - loading 時按鈕顯示 '...' - finally() 確保 actioningId 清除 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 18:38:08 +08:00
OG T	2c7d5d049c	fix(openclaw): Nemotron tool call 回填 kubectl_command，讓批准後執行器能解析 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 根本問題：Nemotron 產生的 restart_deployment(deployment_name=sentry) tool call 只存在 nemotron_tools[]，沒有回填到 proposal["kubectl_command"]。 proposal_service 拿到的 kubectl_command 是空的，approval_records.action 存空值， parse_operation_from_action 永遠返回 None，execute_approved_action 永遠 skip。修正：Nemotron (和 Gemini fallback) 成功後，將 tool call 轉換為 kubectl 指令並回填 proposal["kubectl_command"]，讓 proposal_service 能取到可執行指令。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 18:15:01 +08:00
OG T	d8c2969341	feat(telegram): AI 鏈路透明化 — 告警訊息顯示 OpenClaw + Tool Calling 模型/後端 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 12m12s Details - nemotron.py: 偵測 OllamaToolProvider vs NvidiaProvider，記錄 tool_model/tool_backend - openclaw.py: 傳播 nemotron_tool_model/nemotron_tool_backend 到 proposal - decision_manager.py: 從 proposal_data 提取並傳給 send_approval_card() - telegram_gateway.py: TelegramMessage 新增兩個欄位，format_with_nemotron 顯示 "🔧 Tool Calling: llama3.1:8b (Ollama 本機)" 或 "NVIDIA 雲端" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 15:05:16 +08:00
OG T	d467fc11be	fix(nemotron): 修復 deployment_name placeholder 問題根因: Nemotron tool calling 收到 target_resource=DockerContainerUnhealthy (非真實 K8s deployment name)，不確定時填 <deployment_name> 修復: 1. prompt 明確標注 deployment_name 必須填入 target_resource 2. 收到 tool call 結果後，偵測 placeholder 並用 target_resource 覆蓋 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 11:44:25 +08:00
OG T	428e66c111	fix(arch-review): 首席架構師審查 S1×3 S2×3 S3×3 全修復 + ADR-064 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details S1 Critical: - S1-1: asyncio 觸發移至 _call_with_fallback async 上下文，移除 sync 中的 get_event_loop() - S1-2: _append_rule_to_yaml 加 textwrap.dedent() 正規化 LLM 輸出縮排 - S1-3: _matches() 對 alertname=["*"] 直接回傳 False，防意外命中 S2 Major: - S2-1: auto_generate_rule() 改為 DI 參數注入 (ollama_url/model/gemini_api_key)，移除 import settings - S2-4: _generate_mock_response docstring 澄清為規則引擎生產路徑，非假數據 - S2-5: suggested_action .strip() 防空白字串繞過 or S3 Minor: - S3-2: priority 上界 min(next, 890) - S3-3: alertname sanitize re.sub([{}]) 防 format KeyError - S3-4: model_registry.py 最後修改時間戳更新文件: - ADR-064: Alert Rule Engine YAML 驅動 + AI 自動學習 - Skills 02: 告警規則引擎 DI 規範 + asyncio 禁止事項 - Skills 03: _generate_mock_response 語意澄清 + 規則引擎降級流程 - LOGBOOK: 本次 Session 完整記錄 2026-04-09 ogt: 首席架構師審查修正 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 10:52:40 +08:00
OG T	71437db0e9	feat(rule-engine): 自動規則生成 — generic_fallback 觸發 AI 學習 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 11m25s Details 流程: 1. 告警命中 generic_fallback 規則 2. 背景觸發 auto_generate_rule() 3. Ollama (deepseek-r1:14b) 生成 YAML 規則片段 4. Ollama 失敗 → Gemini 備援 5. 驗證格式 → append alert_rules.yaml → 清除 lru_cache 6. 下次同類告警直接命中專屬規則，不再走兜底去重: 同一 alertname 進程內只生成一次手寫規則 priority 1-499，AI 生成 500-899，兜底 999 2026-04-09 ogt: AI 自學規則引擎 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 09:20:33 +08:00
OG T	d1ede7f989	feat(openclaw): 告警規則引擎 — alert_rules.yaml 取代硬編碼 if/elif Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details - 新增 alert_rules.yaml: 6 條規則 (docker/target_down/oom/cpu/5xx/crash) + 通用兜底 - 新增 alert_rule_engine.py: YAML 載入、匹配邏輯、變數填充 - openclaw.py _generate_mock_response: 重構為呼叫規則引擎 (v8.0) - 新增規則只需修改 YAML，重啟 Pod 即可，不需改代碼 - 2026-04-09 ogt: 架構重構 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 09:05:23 +08:00
OG T	3abc7c2f85	fix(openclaw): DockerContainerUnhealthy + TargetDown 專屬規則匹配 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details - DockerContainerUnhealthy: ssh docker inspect + docker restart，含 healthcheck 指令驗證 - TargetDown / IP:port instance: ssh 確認 exporter 存活 - 修正 target 混用 alertname 作為 deployment 名稱的問題 - alertname/labels 從 alert_context 提取供規則判斷 - 2026-04-09 ogt: 新增兩條專屬規則 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 09:00:31 +08:00
OG T	d80153bdce	fix(openclaw): NIM 完全失敗後 fallback 到 Gemini 產生執行方案 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m34s Details NIM tool calling 多次 timeout 後，不再顯示空白執行方案，改由 Gemini 代理產生 kubectl 操作指令（JSON 解析）。只有 NIM 完全失敗才觸發，符合統帥「必須等到有回應」原則。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 22:55:25 +08:00
OG T	14cb015826	fix(openclaw): Nemotron 重試邏輯 + exhausted log key (未提交的修改) Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details - generate_incident_proposal_with_tools: 單次 try/except → 2次重試迴圈 - 失敗 log key: nemotron_collaboration_failed → nemotron_collaboration_exhausted - 失敗時 nemotron_enabled=True (讓統帥看到失敗狀態) - _call_nemotron_tools: timeout 超時改為拋出異常(讓外層重試) - 這是之前 Session 的本地修改，修正測試與實際實作不一致問題 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 21:16:34 +08:00
OG T	d0f09705e5	fix(auto-repair): 修復三個阻礙自動修復的根本原因 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 1. playbook_rag: Ollama embedding http_client 滾動重啟後 is_closed - 新增 _get_http_client() 偵測 is_closed 自動重建 - singleton get_playbook_rag_service() 加 is_closed 重建判斷 2. telegram: 加入 ai_model 欄位顯示底層判斷模型 - TelegramMessage.ai_model 欄位 - format() / format_with_nemotron() 顯示 "Nemotron (nemotron-70b)" - openclaw proposal_dict 加入 model 欄位 - decision_manager / send_approval_card 串接 3. DB: 清除 9 筆 3/26 殭屍 PENDING (mock_fallback CRITICAL 測試記錄) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 11:46:25 +08:00
OG T	b6105b8214	fix(ai): 首席架構師審查修復 C1+C2 (Phase 24 C) C1 — telegram_gateway.py Fail-Closed 白名單: 白名單為空時 'if whitelist and ...' 為 False → 任何人可執行 /ai 修復: 'if not whitelist or user_id not in whitelist' Fail-Closed 加入 whitelist_empty 欄位到 warning log C2 — openclaw.py list comprehension await 語法錯誤: Python 3.11 不支援 list comprehension 中使用 await 'if not await is_provider_disabled(p)' → SyntaxError 修復: 改為 for loop 明確 await I4: 靜默 except 改為 logger.warning Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-03 00:42:02 +08:00
OG T	dbe71f82e3	feat(ai): Phase 24 C — Telegram /ai 動態控制 + Redis 狀態管理 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 新增 ai_control.py: - /ai status: 所有 Provider 狀態 + 路由模式 - /ai router on/off: 動態切換 AIRouter (覆蓋 env var) - /ai primary <provider>: 設定主要 Provider - /ai enable/disable <provider>: 控制 Provider 啟停 - /ai cost: 費用統計 - 白名單: OPENCLAW_TG_USER_WHITELIST 保護 telegram_gateway.py: - _handle_chat_message 加入 /ai 指令攔截路由 - 白名單未授權返回警告 openclaw.py: - Redis 狀態覆蓋 env USE_AI_ROUTER (/ai router on/off 生效) - Redis primary_provider 覆蓋路由決策 (/ai primary 生效) - Redis disabled provider 過濾 (/ai disable 生效) Redis Keys: ai:control:use_router ai:control:primary_provider ai:control:disabled:<provider> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-03 00:34:14 +08:00
OG T	b4b3a457c5	refactor(openclaw): Phase 24 B4 — 封存舊 fallback Provider 方法 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details [ARCHIVED] _call_ollama / _call_gemini / _call_claude - 這三個方法為 USE_AI_ROUTER=false 回滾保留路徑 - 新路徑: USE_AI_ROUTER=true → AIRouterExecutor (ai_router.py) - 新 Provider: ai_providers/ollama.py / gemini.py / claude.py - 封存而非刪除: 完整移除等 Phase 24 全驗收後 (ADR-052 D11) R3 觀察結果 (通過 ✅): - openclaw_nemo provider: 12/12 incidents 全部正確路由 - 信心度: 0.8~0.9 正常 - USE_AI_ROUTER=true 生效確認 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-03 00:29:56 +08:00
OG T	5a8aae89c4	fix(phase24): 首席架構師 Review C1/C2/C3/I4 修復 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 7m12s Details E2E Health Check / e2e-health (push) Successful in 18s Details C1 (P0): AIRouterExecutor.execute() 補 Langfuse Trace (D5) - 建立 langfuse_trace("ai_router_execute") 包住整個執行鏈 - 成功時記錄 generation (model/input/output/tokens/cost) - prod 所有 AI 呼叫現在有 LLMOps 追蹤 C2 (P0): 絞殺者改為呼叫 AIRouter.route() 智慧路由 - 先取得 RoutingDecision (意圖分類 + 複雜度評分) - provider_order 從 selected_provider + fallback_chain 動態生成 - D1 意圖路由矩陣、D7 隱私保護 (DIAGNOSE 強制 local) 生效 C3 (P1): 型別標注 typo 修復 - AIProviderEnumEnum → AIProviderEnum - AIProviderEnumProtocol → AIProviderProtocol I4 (P1): interfaces.py AIProvider Protocol 補 close() 定義 S1: ai_router.py 模組版本標頭更新至 v4.0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-02 21:47:06 +08:00
OG T	73e8f8ab77	feat(ai): Phase 24-A+B1 — AI Provider Registry + 絞殺者包裝 (ADR-052) Some checks failed E2E Health Check / e2e-health (push) Successful in 16s Details CD Pipeline / build-and-deploy (push) Has been cancelled Details Brain Layer 雙軌 Registry 架構: - 新建 src/services/ai_providers/ 目錄 (interfaces + 4 providers) - OllamaProvider (local, rca/chat/code_review) - GeminiProvider (cloud, rca/chat) - ClaudeProvider (cloud, rca/chat/code_review) - OpenClawNemoProvider (cloud, rca — 委派 188→NIM) - 擴展 ai_router.py 加入: - AIProviderRegistry (動態註冊/啟停) - AIRouterExecutor (Cache + 閘門 CB/RL/Sem + 執行) - openclaw.py 絞殺者包裝: USE_AI_ROUTER=true 走新路徑 - config.py + ConfigMap 加入 USE_AI_ROUTER=false (安全預設) - ADR-052 正式文件 (14 項決策 D1-D14) - HARD_RULES v1.7 加入 AI Router 規範安全: USE_AI_ROUTER=false 預設不啟用，需手動開啟觀察回滾: kubectl set env deployment/awoooi-api USE_AI_ROUTER=false 2026-04-02 ogt: Phase 24 首批實作 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 13:16:09 +08:00

1 2

85 Commits