Commit Graph

85 Commits

Author SHA1 Message Date
Your Name
d0e98192de fix(ai): keep openclaw before gemini in alert fallback
All checks were successful
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 1m9s
CD Pipeline / build-and-deploy (push) Successful in 3m28s
CD Pipeline / post-deploy-checks (push) Successful in 1m19s
2026-05-06 20:20:58 +08:00
Your Name
1a1ab0df6e fix(ai): route alerts through openclaw before gemini
All checks were successful
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / tests (push) Successful in 1m5s
CD Pipeline / build-and-deploy (push) Successful in 3m42s
CD Pipeline / post-deploy-checks (push) Successful in 1m36s
2026-05-06 20:11:24 +08:00
Your Name
2ef54ccc94 fix(ai): enforce ollama first for drift governance
All checks were successful
Code Review / ai-code-review (push) Successful in 16s
CD Pipeline / tests (push) Successful in 1m17s
CD Pipeline / build-and-deploy (push) Successful in 4m54s
CD Pipeline / post-deploy-checks (push) Successful in 3m10s
2026-05-06 19:26:09 +08:00
Your Name
4111ea4f9f fix(ai): remove 188 ollama provider
All checks were successful
Code Review / ai-code-review (push) Successful in 12s
CD Pipeline / tests (push) Successful in 1m13s
CD Pipeline / build-and-deploy (push) Successful in 3m36s
CD Pipeline / post-deploy-checks (push) Successful in 1m20s
2026-05-06 14:34:48 +08:00
Your Name
9ef9633aff fix(alerts): bypass proxy timeout for GCP Ollama 2026-05-06 08:55:14 +08:00
Your Name
2aa31c205a fix(ai): require 111 before alert cloud fallback
All checks were successful
CD Pipeline / tests (push) Successful in 54s
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / build-and-deploy (push) Successful in 3m21s
CD Pipeline / post-deploy-checks (push) Successful in 2m2s
2026-05-06 00:05:51 +08:00
Your Name
e208798531 fix(ai): keep GCP Ollama lane on safe models
All checks were successful
CD Pipeline / tests (push) Successful in 54s
Code Review / ai-code-review (push) Successful in 14s
CD Pipeline / build-and-deploy (push) Successful in 3m25s
CD Pipeline / post-deploy-checks (push) Successful in 1m50s
2026-05-05 23:37:33 +08:00
Your Name
bf847ad045 fix(ai): stabilize GCP Ollama alert lane
Some checks failed
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
2026-05-05 22:20:27 +08:00
Your Name
ee5e3bc94f fix(openclaw): gate alert cloud fallback behind flag
Some checks failed
Code Review / ai-code-review (push) Successful in 27s
CD Pipeline / tests (push) Successful in 5m17s
CD Pipeline / build-and-deploy (push) Failing after 5m35s
CD Pipeline / post-deploy-checks (push) Has been skipped
2026-05-05 20:54:47 +08:00
Your Name
2ff0ef3bb6 fix(openclaw): route legacy ollama through failover endpoints
Some checks failed
CD Pipeline / tests (push) Failing after 1m49s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 24s
2026-05-05 13:55:52 +08:00
Your Name
b0da6da1e9 feat(aiops): structure agent loop shadow output
Some checks failed
CD Pipeline / tests (push) Successful in 2m50s
Code Review / ai-code-review (push) Successful in 33s
CD Pipeline / build-and-deploy (push) Failing after 25m48s
CD Pipeline / post-deploy-checks (push) Has been cancelled
2026-05-01 15:09:57 +08:00
Your Name
f8e44971c1 feat(aiops): enable read-only agent loop canary
All checks were successful
CD Pipeline / tests (push) Successful in 1m43s
Code Review / ai-code-review (push) Successful in 31s
CD Pipeline / build-and-deploy (push) Successful in 10m22s
CD Pipeline / post-deploy-checks (push) Successful in 4m3s
2026-05-01 14:20:16 +08:00
Your Name
9db87f177e fix(aiops): suppress repeated llm alert loops
Some checks failed
CD Pipeline / tests (push) Successful in 1m37s
Code Review / ai-code-review (push) Successful in 28s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-05-01 13:02:07 +08:00
Your Name
d845d53257 fix(security): keep Gemini key out of request URLs
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 15m5s
2026-04-29 22:56:12 +08:00
Your Name
fe2b8f4571 fix(flywheel): fallback on OpenClaw degraded responses
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 9m56s
2026-04-29 22:38:57 +08:00
Your Name
dccdcdbaf5 fix(flywheel): unblock action safety and Claude fallback
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 9m45s
2026-04-29 21:51:18 +08:00
Your Name
fb0c72db42 feat(ai-router): 推翻 A2 鐵律 — DIAGNOSE primary 改 Ollama 本地優先
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 2m26s
統帥鐵律 2026-04-29:「主要優先用 111 主機的 Ollama」
+ feedback_ai_autonomous_direction.md:以本地免費 LLM 為主
+ feedback_ollama_111_only.md:Ollama 唯一主機 = 111

## 推翻 A2 (2026-04-27 INC-20260425) 的事實基礎

**舊事實**:Ollama = CPU-only deepseek-r1:14b @ 238s(不可用)
**新事實**:prod Ollama 111 = M1 Pro Apple Silicon GPU + qwen2.5:7b-instruct
           VRAM 8.2GB 全載入,ctx 32k,實測 hi prompt 0.54s

**雲端全死**(2026-04-29 prod log 證據):
- OpenClaw 188:8088 → 500 Internal Server Error
- Gemini → 429 Too Many Requests(配額爆)
- Claude → 404 Not Found(model claude-3-haiku-20240307 過期)

**不推翻 A2 → 100% incident llm_failed → AI 自動修復永遠不啟動**

## 修改範圍(最小、安全、可驗證)

### ai_router.py
- `_diagnose_fallback_chain`: OLLAMA 第一順位(取代「永久排除」舊註解)
  順序:[OLLAMA, OPENCLAW_NEMO, GEMINI, CLAUDE]
- `_intent_provider_overrides[DIAGNOSE]`: OPENCLAW_NEMO → OLLAMA
- 不動 _full_fallback_chain(避免影響 RESTART/SCALE/CONFIG/DELETE)
- 不動 _tool_calling_fallback_chain
- 不動 complexity_map(critic M2 留待後續)

### openclaw.py
- 注入 task_type="diagnose" 到 alert_context(critic C2 真根因)
- 修復 ai_providers/ollama.py:77 timeout 對齊問題:
  - 有 task_type → OLLAMA_DIAGNOSE_TIMEOUT_SECONDS=200s
  - 沒有 → OPENCLAW_TIMEOUT=30s(不夠 qwen2.5:7b 推理)
- prod log 看到 latency_ms=120014 的根因
- 用 dict(alert_context) 複製,不污染原 context

## Regression Test 同步更新(5 個)

A2 鐵律守門 test 全部反映新鐵律:

- test_p0_diagnose_routing.py::test_diagnose_override_is_ollama
  (原 test_diagnose_override_is_openclaw_nemo)
- test_ai_router_diagnose_fallback.py::test_diagnose_fallback_chain_ollama_primary
  (原 test_diagnose_fallback_chain_no_ollama)
- test_ai_router_diagnose_fallback.py::test_diagnose_route_primary_is_ollama
  (原 test_diagnose_route_fallback_chain_excludes_ollama)
- test_ai_router_diagnose_fallback.py::test_diagnose_route_sync_primary_is_ollama
  (原 test_diagnose_route_sync_fallback_chain_excludes_ollama)
- test_ai_router_diagnose_fallback.py::test_build_fallback_chain_for_intent_diagnose_with_ollama_primary
  (原 test_build_fallback_chain_for_intent_diagnose_no_ollama)
- test_ai_router_failover_integration.py::test_router_uses_failover_for_diagnose_ollama_primary
  (原 test_router_does_not_use_failover_for_openclaw_nemo)

每個 test docstring 都記載歷史脈絡 + 推翻原因。

## 驗證

- 1608 unit tests 全綠
- LLM 路徑 16 個 test 全綠(含 6 個 A2 守門 test 更新版)
- complexity_scorer / failover_manager / intent_classifier 不受影響

## 期望 prod 行為(部署後驗證)

incident 進入 → DIAGNOSE intent → primary OLLAMA (qwen2.5:7b on M1 Pro GPU)
  失敗才 fallback → OpenClaw 188 → Gemini → Claude
  Ollama 用 200s timeout(之前 30s 不夠)
  → AI 自動修復終於可以啟動,不再 100% llm_failed

## 已知債(後續處理)

- models.json:21 ollama.default 仍是 deepseek-r1:14b(critic C1,但 prod 已自動 route 到實載 model)
- complexity 4/5 仍寫死 gemini/claude(critic M2)
- Gemini API key 在 prod log 明文(需輪換 + sanitize)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 11:39:36 +08:00
Your Name
bb5f16f8ef fix(aiops-p2): P2.1 LLM品質三修 — Evidence-First + consensus confidence + raw_evidence注入
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根因:
- consensus_engine 四 ExpertAgent confidence=0.0 → 加權投票 total=0 → 永遠返回 NO_ACTION
- prompts.py 無 Evidence-First 指令 → LLM 靠記憶推理,無真實環境約束
- openclaw.py analyze_alert 建 prompt 未注入 MCP evidence (diagnosis_context)

修復:
- consensus_engine: SRE/Security/Cost/Performance 依訊號強度設 0.45~0.80 confidence
- consensus_engine: _normalize_action 加「重新啟動」別名 → RESTART
- consensus_engine: SecurityAgent 移除未使用的 _target 變數
- prompts.py: 加 Evidence-First Protocol + Skepticism Rules 區塊
- openclaw.py: analyze_alert 提取 diagnosis_context → <raw_evidence> 注入 full_prompt

驗證: consensus score 從 0.0 → 0.744(CrashLoop 測試案例)

P2.1 fix 2026-04-24 ogt + Claude Sonnet 4.6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 15:52:25 +08:00
Your Name
4fc1f49dca fix(pipeline): 三斷點修復 — SLO公式+NO_ACTION堆積+幻覺降級風險
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 14m3s
D1 flywheel_stats_service: execution_count 欄位不存在 → 改讀
    success_count+failure_count;消除飛輪執行成功率永遠 0.0% 假象

D2 openclaw._validate_deployment_inventory: 幻覺 deployment 降級後
    原 HIGH/CRITICAL risk 未清零 → 加 result.risk_level = AIRiskLevel.LOW

D3 webhooks.py (兩處 alert path): NO_ACTION/INVESTIGATE/OBSERVE 三類
    非破壞性動作強制 risk_level = LOW,跳過 Telegram 批准直接 auto-approve
    → approval_execution.py 的 NO_ACTION handler 立即標 EXECUTION_SUCCESS

Root cause 鏈:BUTTON_DATA_INVALID 修復後 TG 按鈕可發,但 NO_ACTION
積壓的 35 筆 PENDING 是因 HIGH risk 無法走 auto-approve 路徑導致。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 22:26:07 +08:00
OG T
4b8be32610 fix(telegram+approval): TG-1 + AP-1/2/3 — 4 修 Telegram UX
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 25m27s
Ansible Lint / lint (push) Has been cancelled
2026-04-19 凌晨(台北時區)— ogt + Claude Opus 4.7 (1M)

## TG-1: INFO_ACTIONS 加 view
security_interceptor.py — 'view' 按鈕現在走 2-part 讀格式,
不再誤觸發 4-part nonce 寫格式。

## AP-1: approval_records.telegram_message_id 持久化
telegram_gateway.send_approval_card send 成功後,在 DB 層 UPDATE
approval_records SET telegram_message_id, telegram_chat_id
(不只 Redis, Pod 重啟仍可找回原卡片)。

## AP-2: approval 執行完成原卡片 edit + KM/Playbook 增量
approval_execution._push_execution_result_to_alert 除了 reply 原卡片,
還 editMessageReplyMarkup 移除按鈕(修「永遠執行中」卡片問題)。
  - 同步查 knowledge_entries/playbooks 2min 內增量,附加到訊息
    顯示 "📚 KM +N  🎯 Playbook 更新×M"
  - 成功:  執行成功 + action + KM 增量
  - 失敗:  執行失敗 + 原因 + KM 增量

## AP-3: primary_responsibility 正規化降「 未知」比例
openclaw._parse_analysis_result: 若 LLM 填空/None/不在白名單
(FE/BE/INFRA/DB/COLLAB),強制 fallback: kubectl 關鍵字有 → INFRA,
否則 BE。之前只檢查 "not in data" 但 None 或空字串會穿過。

## 跳過: TG-3 (refactor) + TG-5 (webhook 為棄用 endpoint,design 採 Long Polling)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 01:15:58 +08:00
OG T
68a42a3c97 fix(openclaw): 幻覺驗證雙路徑覆蓋 + 抽出共用 helper
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-19 凌晨(台北時區)— ogt + Claude Opus 4.7 (1M)

根因:
  commit 7e9448f 的 Python hallucination validator 只裝在
  `analyze_alert` (webhook path),但 incident sweeper 走
  `generate_incident_proposal` (line 1552) 沒裝驗證 → 00:23
  PostgreSQLDiskGrowthRate 卡片出現 "deployment/awoooi-prod"
  幻覺未攔截。

修:
1. 抽出 `_validate_deployment_inventory(result, inventory, ns)` 共用方法
2. `analyze_alert` (line 1322 area) 呼叫此 helper — 原行內邏輯消除
3. `generate_incident_proposal` (line 1552) 動態抓 inventory + 呼叫 helper
4. helper 補:
   - result.action_title = '[安全降級] 調查 {ns} 真實資源狀態'
     (之前只改 description,action_title 沒變 → DB action 欄位仍殘留舊文字)
   - 每個欄位賦值 try/except 保底,單欄失敗不影響其他

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 01:11:09 +08:00
OG T
898145d68e refactor(openclaw): SuggestedAction 改用頂部 import (避免 inline 三重巢狀)
Some checks failed
CD Pipeline / build-and-deploy (push) Successful in 11m17s
Ansible Lint / lint (push) Has been cancelled
IDE 對 inline "from src.models.ai import" 誤報(但運行正常)。
改為頂部 import 既滿足 IDE 也更 Pythonic。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 23:28:19 +08:00
OG T
e6e484c1dc fix(openclaw): import path 修正 — src.models.ai (非 openclaw_schema)
Some checks are pending
CD Pipeline / build-and-deploy (push) Has started running
IDE 正確抓到的 bug(非 false positive),SuggestedAction 在 src/models/ai.py。
_SA.NO_ACTION 現在能正確降級。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 23:26:45 +08:00
OG T
7e9448f6d0 fix(openclaw): 幻覺 deployment 名雙層防禦 — Prompt + Python validator
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-18 晚(台北時區)— ogt + Claude Opus 4.7 (1M)

生產事件 (approval f763bedf, 22:58):
- Alert: KubePodCrashLooping, labels.deployment="awoooi-api"
- NEMOTRON 雖收 inventory "awoooi-api, awoooi-web, awoooi-worker"
  仍輸出 kubectl_command="kubectl rollout restart deployment/awoooi-prod"
  (把 namespace 誤當 deployment 名)
- 執行結果: "Deployment 'awoooi-prod' not found in namespace 'awoooi-prod'"

## Layer 1: NEMOTRON_SYSTEM_PROMPT 強化 (prompts.py)
新增「🔒 DEPLOYMENT NAME RULE (STRICTLY ENFORCED)」區塊:
- namespace NEVER is a deployment name
- "awoooi-prod" 是 NAMESPACE,不可寫 deployment/awoooi-prod
- 若有 inventory,deployment 必須 exact match
- 優先用 labels.deployment,unknown → NO_ACTION

## Layer 2: Python 後驗證 (openclaw.py:1322+)
LLM 回應解析後 regex 抽出 deployment 名,對照 _k8s_inventory:
- 在清單內 → 通過
- 不在清單內 → 降級:
    * kubectl_command → "kubectl get deploy -n {ns}"(純調查)
    * suggested_action → NO_ACTION
    * target_resource → "unknown(hallucinated)"
    * confidence → 0.0
    * description 加註 [安全降級] 並列出合法 inventory
- log 'openclaw_deployment_hallucination_detected' 記錄

效果: 就算 LLM 無視 prompt,Python 層也會擋下。
破壞性 kubectl 絕不執行於不存在的 deployment。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 23:26:09 +08:00
OG T
4f2e122fd2 fix(openclaw): Checkpoint-2 webhook path K8s inventory injection — 防止 NemoTron 幻覺 awoooi-service
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 11m39s
根因:NemoTron 在 webhook path(analyze_alert)無叢集上下文
→ 盲猜 deployment/awoooi-service → kubectl not found → EXECUTION_FAILED → trust score 0 永遠

修復:
- analyze_alert() Step 0.5: 呼叫 _fetch_k8s_inventory_for_openclaw() 拉取真實 Deployment 清單
- 注入「🔒 叢集實際資源清單」section 到 full_prompt,強制 LLM 從清單選擇資源名
- 失敗/超時 → 返回空字串 → 注入警示提示,主流程不中斷
- available_len 計算納入 k8s_section 長度防止 4K 截斷

影響:
- Solver Agent path (solver_agent.py) 已在 cf50a5c 修復
- 本 commit 修復 Alertmanager webhook path(analyze_alert → NemoTron)
- 兩條路徑均有 K8s 環境感知,LLM 不再幻覺資源名

ADR-082: Phase 2 多 Agent 協作
2026-04-17 ogt + Claude Sonnet 4.6(Checkpoint-2 webhook path completion)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 00:53:27 +08:00
OG T
604d8eea37 fix(schema-drift): 補齊 prompts.py + Claude API schema enum 同步 (ADR-090)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 12m27s
問題: fe77e6d 擴充了 models/ai.py enum 至 8 值,但兩個地方未同步:
  1. core/prompts.py L77: 缺 INVESTIGATE、OBSERVE
  2. core/prompts.py L176 (NEMOTRON_SYSTEM_PROMPT): 缺 APPLY_HPA、INVESTIGATE、OBSERVE
  3. openclaw.py L564 (_call_claude tools schema): 舊 4 值 enum 約束

影響: LLM 不知道可以輸出 INVESTIGATE/OBSERVE,只能選舊 4 值

修復: 三處統一對齊 8 個 suggested_action 值
  RESTART_DEPLOYMENT|DELETE_POD|SCALE_DEPLOYMENT|APPLY_HPA|TUNE_RESOURCES|INVESTIGATE|OBSERVE|NO_ACTION

Closes: ADR-090 Prompt-Model 三層同步鐵律

2026-04-17 ogt + Claude Sonnet 4.6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 22:10:18 +08:00
OG T
7eb837567d fix(diagnostician): 修復 'AI 深度診斷' 垃圾根因顯示
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根因三層鏈:
1. openclaw.call(prompt) 不傳 context
2. OPENCLAW_NEMO fallback 把 prompt[:500](系統說明文字)當 signal description
3. Nemo LLM 回傳 action_title="調查 AWOOOI SRE 系統的偵探 Agent"(任務描述)
4. _extract_hypotheses() 用 action_title 作為根因假設描述 → Telegram 顯示垃圾

修復:
- openclaw.call() 新增 alert_context 可選參數,透傳給 _call_with_fallback
- diagnostician._analyze() 建立 alert_context(incident_id + evidence_summary as signal)
  → nemo 使用結構化 API 收到真實感應器資料而非系統說明文字
- _extract_hypotheses() nemo 格式轉換:優先用 reasoning(為什麼)作為假設描述
  而非 action_title(做什麼)— reasoning 更接近根因分析

2026-04-16 ogt + Claude Sonnet 4.6 (台北時區)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 22:34:48 +08:00
OG T
a258d87767 fix(webhooks+prompts): 修復 LLM 對所有告警一律輸出「重啟 AWOOOI 服務」的根本問題
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根因 (INC-20260416-C365D0 postgres 磁碟告警事故):
1. alert_context 中 alertname 埋在 labels 深處,LLM 看到 alert_type="custom" → 不知道是什麼告警
2. 快取鍵用 alert_type:target_resource → 不同 alertname 共用同一快取 → 全部回傳第一個 LLM 結果
3. 系統 Prompt 無 alert-category 指導 → LLM 永遠輸出 kubectl rollout restart

修復:
- webhooks.py: alert_context 置頂加入 alertname + alert_category + annotations
- openclaw.py: 快取鍵改用 alertname:target_resource(告警名稱才是主要識別符)
- prompts.py: OPENCLAW_SYSTEM_PROMPT + NEMOTRON_SYSTEM_PROMPT 加入 Alert-Specific Analysis Rules
  database/storage 告警 → NO_ACTION + 調查指令;K8s 告警 → 對應重啟指令
  禁止對非 K8s 告警輸出 kubectl rollout restart deployment/awoooi-prod

2026-04-16 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 19:56:13 +08:00
OG T
cd1c0ffdb8 fix(openclaw): call() 5值→3值 — 修復全部 AI 分析降級根因
根因: _call_with_fallback() 回傳 5-tuple,但 call() 直接 return
導致呼叫端 (diagnostician/solver/critic agents) 3-var unpack
→ ValueError: too many values to unpack → 全部 降級 20%

修復: call() 明確解包再回傳 (response, provider, success)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 03:27:48 +08:00
OG T
3ce5025ca7 fix(alerts): 3 個飛輪沉默節點 — DIAGNOSE routing + 心跳停用 + 通知格式
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. openclaw.py: DIAGNOSE 移除 require_local=True
   - v4.3 已決定 NIM 為主力且無隱私問題
   - require_local=True 導致所有 provider 被 privacy_skip → 告警永遠失敗
   - 修後 DIAGNOSE 走 _full_fallback_chain(NIM → Gemini → Claude)

2. ai_router.py: require_local 失敗通知改為 ADR-075 TYPE-1 格式
   - 禁止純文字 raw notification(統帥鐵律:所有訊息必須符合格式模板)
   - 改用 ├─ / └─ 樹狀結構 + 語義化標籤

3. main.py: 停用 Telegram 心跳監控
   - 心跳已轉發到另一個 Telegram 群組,不需在此頻道重複發送

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 19:49:43 +08:00
OG T
36754a8a84 fix: Bug A 診斷 + Bug B 真修 — LLM 120s/130s 硬編 → OPENCLAW_TIMEOUT
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
殘留兩個深層 bug 處理:

Bug A (approval.incident_id 仍 NULL) — 加診斷
  - update_incident_id 加 rowcount 檢查
  - 若 UPDATE 0 rows affected → warning log (id 型別 mismatch 或 session 不同步)
  - 手動 UPDATE 測試通過 → DB/permissions 正常,問題在應用層
  - 等 CD 部署後 live-fire 觀察 log 診斷真因

Bug B (LLM 仍 2m6s >> 30s) — 真修
  openclaw.py 兩處硬編 timeout:
  - line 146 httpx client default: 120.0s → settings.OPENCLAW_TIMEOUT (30s)
  - line 348 /analyze/incident POST: 130.0s → settings.OPENCLAW_TIMEOUT (30s)
  GAP-B4 commit dd0a778 只修了 ai_providers/ollama.py
  但 openclaw.py 自己的 httpx client 和 endpoint call 沒改
  這就是為什麼 Live-fire #2-#7 都卡 120s+ 的真因

回歸測試: 125/125 (dispatcher + a4 + classify + grouping)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-14 20:38:00 +08:00
OG T
c0ba1000f3 Revert "fix(auto-repair): 中低風險+無kubectl_command → TYPE-1 純資訊,不顯示審核按鈕"
This reverts commit abf1ffa91e7327a36af93be2742d53dac1933f0d.
2026-04-14 13:33:24 +08:00
OG T
2df4945880 fix(auto-repair): 中低風險+無kubectl_command → TYPE-1 純資訊,不顯示審核按鈕
問題: HostHighCpuLoad 等主機層告警 affected_services=[] → OpenClaw 生成
kubectl unknown → safety guard 攔截 → 退回 READY + TYPE-3 帶按鈕卡片
用戶一直看到帶按鈕的中/低風險告警,按鈕無法修復任何東西

修復三處:
1. openclaw.py: _call_openclaw_analyze() 回傳 suggested_action 欄位
   + target_resource 預設改為 "" (避免 "unknown" 進入 safety guard)
2. decision_manager.py: classify_notification() 傳入
   suggested_action / risk_level / has_kubectl_command
3. telegram_gateway.py: classify_notification() 新規則 —
   無 kubectl_command + risk=low/medium + action=investigate/no_action
   → TYPE-1 (純資訊,無按鈕)

搭配 clawbot-v5 f4b84d7 (OpenClaw prompt CRITICAL RULES) 一起生效

2026-04-14 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 13:33:24 +08:00
OG T
a7f2b9c0f5 fix(display): 規則匹配改顯示 取代 🔴 0% + 修復 LLM 字串 confidence 解析
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- telegram_gateway.py: confidence==0 (規則匹配/Expert fallback) 不再顯示
  「🔴 0%」,改顯示「⚙️ 規則匹配 」,兩個 card 類型都修正
- openclaw.py: NIM/Ollama 有時回傳字串 "0.85" 而非 float,導致
  isinstance(str, int|float)=False → confidence 被強制設 0.0。
  現在先嘗試 float() 解析,解析失敗才 fallback 0.0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 20:50:53 +08:00
OG T
896bef94ee fix(web): pending-approvals-card 加防重複點擊 + loading 狀態
linter 自動強化: actioningId state 防止同一張卡重複操作
- disabled + opacity 0.6 + cursor not-allowed
- loading 時按鈕顯示 '...'
- finally() 確保 actioningId 清除

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:38:08 +08:00
OG T
2c7d5d049c fix(openclaw): Nemotron tool call 回填 kubectl_command,讓批准後執行器能解析
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根本問題:Nemotron 產生的 restart_deployment(deployment_name=sentry) tool call
只存在 nemotron_tools[],沒有回填到 proposal["kubectl_command"]。
proposal_service 拿到的 kubectl_command 是空的,approval_records.action 存空值,
parse_operation_from_action 永遠返回 None,execute_approved_action 永遠 skip。

修正:Nemotron (和 Gemini fallback) 成功後,將 tool call 轉換為 kubectl 指令
並回填 proposal["kubectl_command"],讓 proposal_service 能取到可執行指令。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:15:01 +08:00
OG T
d8c2969341 feat(telegram): AI 鏈路透明化 — 告警訊息顯示 OpenClaw + Tool Calling 模型/後端
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 12m12s
- nemotron.py: 偵測 OllamaToolProvider vs NvidiaProvider,記錄 tool_model/tool_backend
- openclaw.py: 傳播 nemotron_tool_model/nemotron_tool_backend 到 proposal
- decision_manager.py: 從 proposal_data 提取並傳給 send_approval_card()
- telegram_gateway.py: TelegramMessage 新增兩個欄位,format_with_nemotron 顯示
  "🔧 Tool Calling: llama3.1:8b (Ollama 本機)" 或 "NVIDIA 雲端"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 15:05:16 +08:00
OG T
d467fc11be fix(nemotron): 修復 deployment_name placeholder 問題
根因: Nemotron tool calling 收到 target_resource=DockerContainerUnhealthy
  (非真實 K8s deployment name),不確定時填 <deployment_name>

修復:
1. prompt 明確標注 deployment_name 必須填入 target_resource
2. 收到 tool call 結果後,偵測 placeholder 並用 target_resource 覆蓋

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 11:44:25 +08:00
OG T
428e66c111 fix(arch-review): 首席架構師審查 S1×3 S2×3 S3×3 全修復 + ADR-064
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
S1 Critical:
- S1-1: asyncio 觸發移至 _call_with_fallback async 上下文,移除 sync 中的 get_event_loop()
- S1-2: _append_rule_to_yaml 加 textwrap.dedent() 正規化 LLM 輸出縮排
- S1-3: _matches() 對 alertname=["*"] 直接回傳 False,防意外命中

S2 Major:
- S2-1: auto_generate_rule() 改為 DI 參數注入 (ollama_url/model/gemini_api_key),移除 import settings
- S2-4: _generate_mock_response docstring 澄清為規則引擎生產路徑,非假數據
- S2-5: suggested_action .strip() 防空白字串繞過 or

S3 Minor:
- S3-2: priority 上界 min(next, 890)
- S3-3: alertname sanitize re.sub([{}]) 防 format KeyError
- S3-4: model_registry.py 最後修改時間戳更新

文件:
- ADR-064: Alert Rule Engine YAML 驅動 + AI 自動學習
- Skills 02: 告警規則引擎 DI 規範 + asyncio 禁止事項
- Skills 03: _generate_mock_response 語意澄清 + 規則引擎降級流程
- LOGBOOK: 本次 Session 完整記錄

2026-04-09 ogt: 首席架構師審查修正

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 10:52:40 +08:00
OG T
71437db0e9 feat(rule-engine): 自動規則生成 — generic_fallback 觸發 AI 學習
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 11m25s
流程:
1. 告警命中 generic_fallback 規則
2. 背景觸發 auto_generate_rule()
3. Ollama (deepseek-r1:14b) 生成 YAML 規則片段
4. Ollama 失敗 → Gemini 備援
5. 驗證格式 → append alert_rules.yaml → 清除 lru_cache
6. 下次同類告警直接命中專屬規則,不再走兜底

去重: 同一 alertname 進程內只生成一次
手寫規則 priority 1-499,AI 生成 500-899,兜底 999

2026-04-09 ogt: AI 自學規則引擎

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 09:20:33 +08:00
OG T
d1ede7f989 feat(openclaw): 告警規則引擎 — alert_rules.yaml 取代硬編碼 if/elif
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- 新增 alert_rules.yaml: 6 條規則 (docker/target_down/oom/cpu/5xx/crash) + 通用兜底
- 新增 alert_rule_engine.py: YAML 載入、匹配邏輯、變數填充
- openclaw.py _generate_mock_response: 重構為呼叫規則引擎 (v8.0)
- 新增規則只需修改 YAML,重啟 Pod 即可,不需改代碼
- 2026-04-09 ogt: 架構重構

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 09:05:23 +08:00
OG T
3abc7c2f85 fix(openclaw): DockerContainerUnhealthy + TargetDown 專屬規則匹配
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- DockerContainerUnhealthy: ssh docker inspect + docker restart,含 healthcheck 指令驗證
- TargetDown / IP:port instance: ssh 確認 exporter 存活
- 修正 target 混用 alertname 作為 deployment 名稱的問題
- alertname/labels 從 alert_context 提取供規則判斷
- 2026-04-09 ogt: 新增兩條專屬規則

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 09:00:31 +08:00
OG T
d80153bdce fix(openclaw): NIM 完全失敗後 fallback 到 Gemini 產生執行方案
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m34s
NIM tool calling 多次 timeout 後,不再顯示空白執行方案,
改由 Gemini 代理產生 kubectl 操作指令(JSON 解析)。
只有 NIM 完全失敗才觸發,符合統帥「必須等到有回應」原則。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 22:55:25 +08:00
OG T
14cb015826 fix(openclaw): Nemotron 重試邏輯 + exhausted log key (未提交的修改)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- generate_incident_proposal_with_tools: 單次 try/except → 2次重試迴圈
- 失敗 log key: nemotron_collaboration_failed → nemotron_collaboration_exhausted
- 失敗時 nemotron_enabled=True (讓統帥看到失敗狀態)
- _call_nemotron_tools: timeout 超時改為拋出異常(讓外層重試)
- 這是之前 Session 的本地修改,修正測試與實際實作不一致問題

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 21:16:34 +08:00
OG T
d0f09705e5 fix(auto-repair): 修復三個阻礙自動修復的根本原因
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. playbook_rag: Ollama embedding http_client 滾動重啟後 is_closed
   - 新增 _get_http_client() 偵測 is_closed 自動重建
   - singleton get_playbook_rag_service() 加 is_closed 重建判斷

2. telegram: 加入 ai_model 欄位顯示底層判斷模型
   - TelegramMessage.ai_model 欄位
   - format() / format_with_nemotron() 顯示 "Nemotron (nemotron-70b)"
   - openclaw proposal_dict 加入 model 欄位
   - decision_manager / send_approval_card 串接

3. DB: 清除 9 筆 3/26 殭屍 PENDING (mock_fallback CRITICAL 測試記錄)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 11:46:25 +08:00
OG T
b6105b8214 fix(ai): 首席架構師審查修復 C1+C2 (Phase 24 C)
C1 — telegram_gateway.py Fail-Closed 白名單:
  白名單為空時 'if whitelist and ...' 為 False → 任何人可執行 /ai
  修復: 'if not whitelist or user_id not in whitelist' Fail-Closed
  加入 whitelist_empty 欄位到 warning log

C2 — openclaw.py list comprehension await 語法錯誤:
  Python 3.11 不支援 list comprehension 中使用 await
  'if not await is_provider_disabled(p)' → SyntaxError
  修復: 改為 for loop 明確 await
  I4: 靜默 except 改為 logger.warning

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 00:42:02 +08:00
OG T
dbe71f82e3 feat(ai): Phase 24 C — Telegram /ai 動態控制 + Redis 狀態管理
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
新增 ai_control.py:
- /ai status: 所有 Provider 狀態 + 路由模式
- /ai router on/off: 動態切換 AIRouter (覆蓋 env var)
- /ai primary <provider>: 設定主要 Provider
- /ai enable/disable <provider>: 控制 Provider 啟停
- /ai cost: 費用統計
- 白名單: OPENCLAW_TG_USER_WHITELIST 保護

telegram_gateway.py:
- _handle_chat_message 加入 /ai 指令攔截路由
- 白名單未授權返回警告

openclaw.py:
- Redis 狀態覆蓋 env USE_AI_ROUTER (/ai router on/off 生效)
- Redis primary_provider 覆蓋路由決策 (/ai primary 生效)
- Redis disabled provider 過濾 (/ai disable 生效)

Redis Keys:
  ai:control:use_router
  ai:control:primary_provider
  ai:control:disabled:<provider>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 00:34:14 +08:00
OG T
b4b3a457c5 refactor(openclaw): Phase 24 B4 — 封存舊 fallback Provider 方法
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
[ARCHIVED] _call_ollama / _call_gemini / _call_claude
- 這三個方法為 USE_AI_ROUTER=false 回滾保留路徑
- 新路徑: USE_AI_ROUTER=true → AIRouterExecutor (ai_router.py)
- 新 Provider: ai_providers/ollama.py / gemini.py / claude.py
- 封存而非刪除: 完整移除等 Phase 24 全驗收後 (ADR-052 D11)

R3 觀察結果 (通過 ):
- openclaw_nemo provider: 12/12 incidents 全部正確路由
- 信心度: 0.8~0.9 正常
- USE_AI_ROUTER=true 生效確認

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 00:29:56 +08:00
OG T
5a8aae89c4 fix(phase24): 首席架構師 Review C1/C2/C3/I4 修復
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m12s
E2E Health Check / e2e-health (push) Successful in 18s
C1 (P0): AIRouterExecutor.execute() 補 Langfuse Trace (D5)
  - 建立 langfuse_trace("ai_router_execute") 包住整個執行鏈
  - 成功時記錄 generation (model/input/output/tokens/cost)
  - prod 所有 AI 呼叫現在有 LLMOps 追蹤

C2 (P0): 絞殺者改為呼叫 AIRouter.route() 智慧路由
  - 先取得 RoutingDecision (意圖分類 + 複雜度評分)
  - provider_order 從 selected_provider + fallback_chain 動態生成
  - D1 意圖路由矩陣、D7 隱私保護 (DIAGNOSE 強制 local) 生效

C3 (P1): 型別標注 typo 修復
  - AIProviderEnumEnum → AIProviderEnum
  - AIProviderEnumProtocol → AIProviderProtocol

I4 (P1): interfaces.py AIProvider Protocol 補 close() 定義

S1: ai_router.py 模組版本標頭更新至 v4.0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 21:47:06 +08:00
OG T
73e8f8ab77 feat(ai): Phase 24-A+B1 — AI Provider Registry + 絞殺者包裝 (ADR-052)
Some checks failed
E2E Health Check / e2e-health (push) Successful in 16s
CD Pipeline / build-and-deploy (push) Has been cancelled
Brain Layer 雙軌 Registry 架構:
- 新建 src/services/ai_providers/ 目錄 (interfaces + 4 providers)
  - OllamaProvider (local, rca/chat/code_review)
  - GeminiProvider (cloud, rca/chat)
  - ClaudeProvider (cloud, rca/chat/code_review)
  - OpenClawNemoProvider (cloud, rca — 委派 188→NIM)
- 擴展 ai_router.py 加入:
  - AIProviderRegistry (動態註冊/啟停)
  - AIRouterExecutor (Cache + 閘門 CB/RL/Sem + 執行)
- openclaw.py 絞殺者包裝: USE_AI_ROUTER=true 走新路徑
- config.py + ConfigMap 加入 USE_AI_ROUTER=false (安全預設)
- ADR-052 正式文件 (14 項決策 D1-D14)
- HARD_RULES v1.7 加入 AI Router 規範

安全: USE_AI_ROUTER=false 預設不啟用,需手動開啟觀察
回滾: kubectl set env deployment/awoooi-api USE_AI_ROUTER=false

2026-04-02 ogt: Phase 24 首批實作

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 13:16:09 +08:00