Commit Graph

327 Commits

Author SHA1 Message Date
OG T
96d5e18924 fix(p0): 實測修正 — timeout 依 benchmark 調整,_local_fallback_chain 移除雲端 Nemotron
- config.py: NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS=60s (NIM 實測 11-45s + 15s buffer)
- config.py: OLLAMA_DIAGNOSE_TIMEOUT_SECONDS=200s (Ollama 實測 ~173s + 27s buffer)
- ollama.py: 新增 per-task timeout (diagnose/force_local 用 200s)
- ai_router.py: _local_fallback_chain 移除 Nemotron (NIM=雲端,不可進 local chain)
- ai_router.py: v4.2 — Option C 分情境路由正式確立

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 00:29:09 +08:00
OG T
4912c7f307 fix(phase25): 首席架構師 Review R2 修正 (I1/I2/I3/I4/C3/M1)
I1: auto_repair_service — 失敗分支 anti_pattern task 補齊 _pending_tasks GC 防護
C3: drift_remediator — _kubectl_apply() 實作 resource_key 範圍過濾(修復虛設參數 bug)
M1: drift_remediator — _git_push() 標記 DISABLED,防止誤啟用
I2: drift.py — Telegram 通知移除失效的 adopt() 端點連結
I3: drift/page.tsx — handleScan POST body namespace→namespaces(對齊後端 DriftScanRequest)
I4: drift/page.tsx — 移除硬編碼英文字串,改用 t('loading')/t('highCount')/t('mediumCount')
i18n: zh-TW.json + en.json 補齊 drift.loading key

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 00:22:38 +08:00
OG T
a562db4048 fix(phase25): 首席架構師 Review C1/C2/I1/I3 修正
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 57s
C1: NemotronProvider.privacy_level cloud→local
    NIM 部署在 192.168.0.188 內網,非官方雲端 API
    可納入 DIAGNOSE _local_fallback_chain 隱私邊界

C2: adopt() 端點暫停,返回 501
    API Pod 執行 git add -A 有安全風險
    ADR-057 起草後改用 Gitea PR API 實作

I1: timeout log 修正,記錄實際套用的 timeout 值
    原本永遠記錄 NEMOTRON_TIMEOUT_SECONDS=45
    現在記錄依 task_type 選擇的正確值

I3: route_sync() 補 DIAGNOSE 隱私邊界
    async route() 已有 _local_fallback_chain
    sync 版本遺漏,此次補齊

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 18:00:05 +08:00
OG T
c4eafd2a5b fix(ai-router): fallback_models 排除 selected_model 避免重複
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 6m58s
DIAGNOSE intent 路由至 Nemotron 後,fallback_chain 中的 OPENCLAW_NEMO
也使用相同 model string,導致 fallback_models 包含已選模型。

修正: 過濾掉與 selected_model 相同的 model string。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 17:43:44 +08:00
OG T
8056be5847 feat(ai-router): DIAGNOSE intent override 升級至 Nemotron (P0) 2026-04-04 17:41:45 +08:00
OG T
3455044457 feat(phase25): Nemotron 主動防禦三方向 P0+P1+P2 完整實作
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 38s
Type Sync Check / check-type-sync (push) Failing after 35s
P0 - DIAGNOSE Privacy-First Routing:
- ai_router.py: _local_fallback_chain [NEMOTRON→OLLAMA→REJECT]
- DIAGNOSE 意圖 override 改為 NEMOTRON (原 OLLAMA)
- DIAGNOSE fallback 使用 local-only 鏈,不觸碰雲端
- 全部失敗時 REJECT + Telegram 通知
- config.py: NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS=30, OLLAMA_DIAGNOSE_TIMEOUT_SECONDS=60
- nemotron.py: 根據 context[task_type] 選擇 timeout

P1 - Knowledge Auto-Harvesting:
- models/knowledge.py: EntryType.AUTO_RUNBOOK + ANTI_PATTERN + symptoms_hash
- EntryStatus.PUBLISHED (ANTI_PATTERN 直接發布,無需審核)
- models/playbook.py: SymptomPattern.compute_hash() (16字元確定性 hash)
- services/runbook_generator.py: NemotronRunbookGenerator (v1.1)
  - generate_runbook() → AUTO_RUNBOOK (DRAFT) + Telegram 審核 card
  - generate_anti_pattern() → ANTI_PATTERN (PUBLISHED) + Telegram 通知
  - 使用 nvidia.chat() (正確介面),Nemotron 超時時 Minimal fallback
- knowledge_service.py: check_anti_pattern(symptoms_hash, days=7)
- db/models.py: symptoms_hash VARCHAR(16) + ix_knowledge_symptoms_hash
- repositories/knowledge_repository.py: create() 支援 symptoms_hash + status
- auto_repair_service.py: anti_pattern_gate 在 decide() + runbook hook 在 execute()
- migrations/phase8_symptoms_hash.sql: ALTER TABLE + partial index + PUBLISHED constraint

P2 - Config Drift Detection:
- models/drift.py: DriftItem/DriftReport/DriftLevel/DriftIntent/DriftStatus
- services/drift_detector.py: GitStateReader + K8sStateReader + DriftDetector
- services/drift_analyzer.py: 白名單過濾 + DriftLevel 分級
- services/drift_interpreter.py: NemotronDriftInterpreter(意圖分析,不生成修復指令)
- services/drift_remediator.py: rollback(kubectl apply) + adopt(git push gitea)
- api/v1/drift.py: POST /scan, GET /reports, POST /rollback, POST /adopt
- migrations/phase9_drift_reports.sql: drift_reports 表
- k8s/drift-cronjob.yaml: 每小時自動掃描 CronJob

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 12:35:05 +08:00
OG T
df3ef9006c fix(auto-repair): 首席架構師 Review — 4 Critical/Important 修復
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m2s
Critical #1: KM write task 移出 try/except
- _trigger_learning 的 KM 寫入原在 try 內,learning 失敗時不寫 KM
- 移至 except 後確保成功/失敗都寫入
- 移除冗餘 import asyncio(已在頂層 import)
- Minor: approval.incident_id or None 防空字串

Important #2: migration 加 PRIMARY KEY
- playbook_id 從 UNIQUE 升為 PRIMARY KEY
- prod DB 已執行 ALTER TABLE ADD PRIMARY KEY

Important #3: s.sequence→s.step_number, s.description→s.command
- embed_playbook() 使用不存在的欄位名,RAG 向量索引靜默失敗
- RepairStep 正確欄位: step_number, command

Important #1: PlaybookService._get_rag_service 不再 Service 層快取
- 改為每次呼叫工廠 get_playbook_rag_service()
- 避免舊實例繞過工廠的 is_closed 重建邏輯

冷啟動修復 (首席架構師建議B+C):
- _trigger_playbook_extraction 執行成功後自動設定
  execution_success=True, effectiveness_score=4, status=RESOLVED
- skip 路徑 logger.debug → logger.info 提升可觀測性

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 12:02:03 +08:00
OG T
72d7536ead feat(auto-repair): 完整自動修復閉環 + KM 沉澱串接
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. DB Migration: playbooks 資料表 (phase7_playbooks_table.sql)
   - 這是自動修復無法啟動的根本原因 — table 從未建立
   - 5 個索引: status/tags/alert_names/source_incidents/created_at
   - 已在 prod DB 執行

2. playbook_service: 萃取後自動沉澱 KM
   - extract_from_incident() 完成後 fire-and-forget _write_to_km()
   - 內容含症狀模式、修復步驟、信心度、來源 Incident

3. approval_execution: 執行結果沉澱 KM
   - _trigger_learning() 後 fire-and-forget _write_execution_result_to_km()
   - 成功/失敗記錄都寫入,category=execution_result

完整閉環:
告警 → AI分析 → 查Playbook → 決策 → 執行 → 結果寫KM
                                              ↓
                              Incident解決 → KM(knowledge_extractor)
                                          → Playbook萃取 → KM

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 11:54:15 +08:00
OG T
429d81d29b fix(knowledge): I2+I3 首席架構師 Important 修復 — 依賴注入 + exception 細分
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
I2: KnowledgeService 移至 DecisionManager.__init__ 注入
    _query_kb_context_inner 使用 self._knowledge_svc,移除函數內 import 耦合

I3: _query_kb_context exception 細分
    - asyncio.TimeoutError → warning (預期降級)
    - ConnectionError/OSError → warning (Ollama 連線問題,預期降級)
    - Exception → error (非預期,提升監控可見性)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 11:51:43 +08:00
OG T
f846000c8c fix(knowledge): C1 首席架構師必修 — _query_kb_context 5秒 hard timeout
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
C1 修復 (首席架構師 Review 74/100 → 條件通過):
- 抽出 _query_kb_context_inner 含實際查詢邏輯
- _query_kb_context 用 asyncio.wait_for(timeout=5.0) 包裝
- Ollama hang/慢響應最多消耗 5s,保護 30s 決策 SLA
- timeout 時 logger.warning("kb_rag_timeout") 靜默降級

同步移除 LLM prompt 中的 emoji (## 📚 → ## Knowledge Base)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 11:48:57 +08:00
OG T
860dc1d892 feat(knowledge): KB Phase 2 — OpenClaw RAG 整合
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
_dual_engine_analyze 新增 _query_kb_context():
- Incident 分析前語意搜尋相關 KB 條目 (top-3, threshold=0.4)
- 將 KB context 注入 expert_context.diagnosis_context 傳給 LLM
- 失敗時靜默降級,不影響主分析流程
- dual_engine_llm_win log 新增 kb_rag 欄位,可觀測 RAG 命中率

架構: _query_kb_context 透過 get_knowledge_service() 呼叫 Service 層
符合 leWOOOgo 積木化 — decision_manager 不直接存取 DB/pgvector

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 11:46:47 +08:00
OG T
d0f09705e5 fix(auto-repair): 修復三個阻礙自動修復的根本原因
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. playbook_rag: Ollama embedding http_client 滾動重啟後 is_closed
   - 新增 _get_http_client() 偵測 is_closed 自動重建
   - singleton get_playbook_rag_service() 加 is_closed 重建判斷

2. telegram: 加入 ai_model 欄位顯示底層判斷模型
   - TelegramMessage.ai_model 欄位
   - format() / format_with_nemotron() 顯示 "Nemotron (nemotron-70b)"
   - openclaw proposal_dict 加入 model 欄位
   - decision_manager / send_approval_card 串接

3. DB: 清除 9 筆 3/26 殭屍 PENDING (mock_fallback CRITICAL 測試記錄)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 11:46:25 +08:00
OG T
12bc94796a fix(knowledge): asyncpg 不支援 :param::type,改用 CAST(:param AS vector)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
asyncpg 使用 $1 位置參數,:emb::vector 語法導致 PostgresSyntaxError。
save_embedding 和 semantic_search 均改用 CAST(:emb AS vector) 語法。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 11:43:59 +08:00
OG T
cddc4cb1fc fix(knowledge): 首席架構師 Review 修復 C1+C2+I1+I2 (71→~88/100)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m16s
C1: IKnowledgeRepository Protocol 補齊 save_embedding + semantic_search +
    list_unembedded_entries,恢復 Interface 先行保護層

C2: embed_all_entries Service 層 raw SQL 移至 Repository.list_unembedded_entries()
    Service 改透過 Protocol 呼叫,符合 leWOOOgo 積木化原則

I1: asyncio.create_task 加入 _pending_tasks set 持有引用,防 GC 回收與
    Shutdown 時 Task 遺失;task done 後自動 discard

I2: OllamaEmbeddingService 從每次 new 改為 KnowledgeService.__init__ 注入,
    單一實例重用

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 11:22:38 +08:00
OG T
8960bba7fe feat(knowledge): pgvector RAG — 語意搜尋 + 背景 Embedding 管線
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- repository: save_embedding (raw SQL pgvector cast) + semantic_search (cosine <=>)
- service: create_entry 背景 embed + semantic_search + embed_all_entries 批次補 embed
- router: GET /semantic-search (q/limit/threshold) + POST /embed-all 管理端點

向量模型: nomic-embed-text (Ollama 192.168.0.188, 768 dims)
索引: ivfflat cosine (knowledge_entries.embedding vector(768))

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 11:17:24 +08:00
OG T
200c382ca4 feat(metrics): sparklines 串接真實數據 + TOOL_LINKS 移至 API (2026-04-04 ogt)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m6s
前端 page.tsx:
- 今日事件 sparkline: 過去 6 小時每小時事件數 (從 incidents 計算)
- MTTR sparkline: 各已解決 incident 修復時間序列 (從 incidents 計算)
- 無數據時不顯示 sparkline (undefined 渲染 nothing)
- 移除硬碼 TOOL_LINKS,改讀 API 回傳的 tool.url

後端 monitoring.py:
- 每個 probe 函數回傳 dict 加入 "url" 欄位
- 前端工具連結由後端集中管理,解決多環境問題

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 11:09:04 +08:00
OG T
9e78d5222a feat(group-chat): 方案B slash commands — /status /incidents /cost /pods /help (2026-04-03 ogt)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m5s
E2E Health Check / e2e-health (push) Successful in 17s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 20:02:27 +08:00
OG T
e833065043 feat(group-chat): Reply Bot 訊息時只有被Reply的Bot回應 (2026-04-03 ogt)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m0s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 19:48:10 +08:00
OG T
8d09b18477 fix(group-chat): 移除雙AI互相評論 — 單獨@只有該AI回覆,雙AI路徑不再互評
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 19:44:49 +08:00
OG T
79a770ffe5 feat(group): 移除告警自動 AI 分析 — 老闆指示
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m3s
告警發到群組只顯示卡片,不自動觸發 OpenClaw/NemoClaw 分析
老闆和 AI 可手動在群組討論告警內容

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 19:36:40 +08:00
OG T
b62d7d3eb0 feat(chat): OpenClaw 改用 Gemini 2.0 Flash-Lite (最便宜)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Input $0.075/1M, Output $0.30/1M (比 Flash 便宜 25%)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 19:35:13 +08:00
OG T
6cd4280168 feat(chat): NemoClaw Claude API 加 token+費用統計
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Claude Haiku 4.5: Input $0.80/1M, Output $4.00/1M
每次回覆顯示: token 數 | 本次費用 | 本月累計
Redis key: claude_cost:YYYY-MM,TTL 40 天

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 19:29:22 +08:00
OG T
781a6dac3e feat(chat): NemoClaw→Claude Haiku API + 告警只由 OpenClaw 分析
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m20s
老闆指示 (2026-04-03):
1. NemoClaw 改接 Claude API (claude-haiku-4-5),快速中文對話
2. 群組告警分析只觸發 OpenClaw,NemoClaw 不分析告警
3. OpenClaw/NemoClaw 雙向自然語言對話維持

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 19:19:56 +08:00
OG T
10ad2a67c7 fix(chat): gemini-2.0-flash 修正 + 全形小O支援 + NemoClaw 回 NIM
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. Gemini 模型名稱: gemini-1.5-flash → gemini-2.0-flash (404修復)
2. 費用計算: 2.0 Flash 定價 Input $0.10/1M, Output $0.40/1M
3. 全形/半形統一: unicodedata.normalize NFKC,支援「小O」全形輸入
4. NemoClaw: Ollama 188 負載高超時,暫回 NIM nemotron-mini-4b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 19:17:08 +08:00
OG T
08b02280f8 feat(chat): Gemini 月費用上限 $10 USD + Redis 累計追蹤
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 6m55s
- 每次呼叫前檢查當月累計費用,超過 $10 USD 拒絕呼叫
- Redis key: gemini_cost:YYYY-MM,TTL 40 天
- 每次回覆顯示: token 數 | 本次費用 | 本月累計
- 超限時回傳警告訊息告知老闆

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 19:01:21 +08:00
OG T
2828cd897a feat(chat): OpenClaw→Gemini Flash + NemoClaw→Ollama llama3.2:3b
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
老闆指示 (2026-04-03):
- OpenClaw: Gemini 1.5 Flash API,每次回覆附 token+費用統計
- NemoClaw: Ollama llama3.2:3b,本地快速回應 (3-8s)
- 費用控管: Gemini 月上限 $10 USD

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 18:59:28 +08:00
OG T
fbf122fa1f fix(chat): OpenClaw 改用 NIM llama-3.1-8b 對話 + NemoClaw timeout 120s + 老闆稱謂
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m9s
1. _call_openclaw: 改用 NIM meta/llama-3.1-8b-instruct
   舊的 analyze/incident 是告警 API,回覆是告警格式,不適合對話
2. _call_nemotron: 移除 Ollama fallback,回到純 NIM
3. NEMOTRON_TIMEOUT_SECONDS: 55 → 120 (ConfigMap 已更新)
4. 修正「統帥」→「老闆」

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 18:41:15 +08:00
OG T
2da8da5a25 fix(chat): OpenClaw 改用 Ollama qwen2.5 做對話 + NemoClaw 加 Ollama fallback
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 6m51s
問題: _call_openclaw 用 analyze/incident API → 回覆是告警格式,不是自然語言
修法:
  1. OpenClaw chat → Ollama qwen2.5:7b-instruct (本地,快速,無格式污染)
  2. NemoClaw → NIM 優先,超時 fallback 到 Ollama llama3.2:3b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 18:30:31 +08:00
OG T
d1436157b7 fix(polling): httpx client timeout 改為分開設定,read=50s > getUpdates 40s
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根因: httpx.AsyncClient(timeout=30.0) 的 read timeout 30s
     < getUpdates 的 long polling timeout 40s
     導致每次 getUpdates 都被 client 打斷 → polling loop 無法正常收訊息

修法: httpx.Timeout(connect=10s, read=50s) 讓 long polling 正常等待

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 18:29:22 +08:00
OG T
dfc1e19c07 fix(group): 互相評論補充也加 reply_to_message_id 引用原訊息
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 18:24:51 +08:00
OG T
09241f102e fix(group): 群組訊息移到 security interceptor 前 — 修復 whitelist 擋掉所有群組訊息
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m10s
根因: intercept_telegram() 的 whitelist 是字串,user_id 是 int
      型別不匹配 → exception → telegram_chat_unauthorized → 群組訊息全被丟棄
修法: SRE 群組訊息優先路由,不走個人 whitelist
     (群組成員由 Telegram 群組管理員控制,安全邊界已存在)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 18:17:22 +08:00
OG T
203855a56e debug(group): 加 group_routing_check log 診斷 chat_id 不匹配
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 18:12:07 +08:00
OG T
63929a5e87 feat(group): 別名 小O→OpenClaw 小賀→NemoClaw + NemoClaw 強制繁中
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m6s
1. telegram_gateway.py: _handle_group_message 加入別名路由
   - 小O / 小o → 只有 OpenClaw 回應
   - 小賀 / 小贺 → 只有 NemoClaw 回應
   - clean_text 同步移除別名 token

2. chat_manager.py: NEMOCLAW_PERSONA 加強繁體中文強制指令
   - 明確「禁止使用英文或其他語言」防止 Nemotron 自動英文回應

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 18:00:51 +08:00
OG T
699e61ac87 feat(group): 群組雙向對話 + 格式選項C + 老闆稱謂
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m11s
1. _handle_group_message: SRE 群組訊息路由
   - @OpenClawAwoooI_Bot → 只有 OpenClaw 回應
   - @NemoTronAwoooI_Bot → 只有 NemoClaw 回應
   - 一般訊息 → 並行回應 + 互相評論第二輪
   - Bot 訊息自動忽略(防無限循環)

2. 告警格式改選項 C (老闆指示)
   - 【🔴 HIGH】resource_name
   - 區塊式,去掉 ═══ 長分隔線

3. AI persona 改稱呼用戶為「老闆」

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 17:51:48 +08:00
OG T
d2f02999b7 fix(alert-format): 移除 [LLM_OPENCLAW_NEMO] prefix + 擴大根因/建議字數
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m4s
- root_cause: 移除 [source.upper()] 前綴,直接顯示 AI 分析文字
- root_cause 截斷: 80→150 字
- suggested_action 截斷: 50→80 字
- AI provider 來源已在訊息標頭 「🤖 OpenClaw Nemo 仲裁」顯示,不需在根因重複

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 17:43:19 +08:00
OG T
50457675ef feat(group): OpenClaw + NemoClaw 並行分析告警 (統帥指示)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- 兩個 AI 同時分析,不互相影響(更客觀)
- 總等待時間 = max(OpenClaw, NemoClaw) 而非相加
- 兩者都 reply 同一條告警訊息,並排出現在群組
- 修正 unused message_id parameter noqa

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 17:41:50 +08:00
OG T
209fb8d4dc fix(group): supergroup 跨 Bot reply 改用 reply_parameters (Bot API v6.7+)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
舊的 reply_to_message_id 在 supergroup 跨 Bot 回覆會 400
改用 reply_parameters + allow_sending_without_reply: true

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 17:39:53 +08:00
OG T
890d438cdf fix(group): 群組告警格式對齊 TelegramMessage 模板 + 修復 AI 討論觸發
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- 群組告警改用 ═══ 分隔線格式,與個人 chat 一致
- 加入「OpenClaw 與 NemoClaw 正在分析中...」提示
- 加 group_msg_id 為空時的 warning log
- clawbot-v5 STANDBY_MODE: main.py 檢查條件修正

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 17:36:01 +08:00
OG T
c65ed5b1c9 feat(telegram): SRE 戰情室群組三頭政治 Triumvirate (ADR-053)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m6s
- config.py: 新增 OPENCLAW_BOT_TOKEN / NEMOTRON_BOT_TOKEN / SRE_GROUP_CHAT_ID
- telegram_gateway.py: send_to_group / send_as_openclaw / send_as_nemotron / trigger_group_ai_discussion / _send_approval_card_to_group
- send_approval_card 告警發送後非同步觸發群組 AI 雙向討論
- configmap: SRE_GROUP_CHAT_ID=-1003711974679
- secrets: OPENCLAW_BOT_TOKEN / NEMOTRON_BOT_TOKEN CHANGE_ME 佔位

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 17:16:05 +08:00
OG T
ff5a77f7a9 fix(telegram): 啟用 Polling + 修正 InfraAlertMessage 格式
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 6m52s
1. TELEGRAM_ENABLE_POLLING: false→true
   - clawbot-v5 已停止 polling (STANDBY_MODE)
   - AWOOOI API 接管,統帥可與 OpenClaw/NemoClaw 雙 AI 對話

2. InfraAlertMessage.format() 加入 note 欄位
   - NIM 慢屬正常不再顯示「自動修復失敗」
   - 改為 💡 資訊性提示

3. NIM 探測端點改為 /v1/models (輕量,不觸發計費)
   timeout: 10s → 25s (NIM 免費 tier 冷啟動)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 16:43:40 +08:00
OG T
15aabd6ac5 fix(chat+nim): 修復首席架構師 Review I1-I4 + S3 四項重要問題
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m9s
I1: chat_manager._call_openclaw timeout=30.0 → 讀 settings.OPENCLAW_TIMEOUT
I2: nvidia_provider.py stale comment "45" → "55" 對齊 ConfigMap
I3: asyncio.shield 移除 — shield 超時後 task 繼續跑但無人等待 (silent leak)
I4: ChatManager.__init__ 移除 repo 實例 (leWOOOgo 禁 Service 持有 repository)
S3: _check_nemotron_health probe 10s → 25s + /v1/models 輕量端點

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 16:36:16 +08:00
OG T
be247d6c5c fix(chat): OpenClaw timeout 30→40s,NemoClaw 50→60s
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 6m51s
get_system_context() k8s/DB 查詢加上 _call_openclaw 30s,
總計超過外層 shield 30s 導致 OpenClaw 全部超時。
放寬 timeout 讓兩個 AI 有足夠時間回應。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 16:27:08 +08:00
OG T
d8c9e29485 fix(heartbeat): 撤銷錯誤的 Nemotron 自動關閉邏輯
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 6m53s
之前錯誤地在偵測到 Nemotron 慢時自動執行
ENABLE_NEMOTRON_COLLABORATION=false,
這等於自動關掉產品核心功能。

Nemotron NIM 免費 tier 延遲 11-45s 是已知特性(Memory 有記載),
不是需要自動修復的異常。

現在:偵測慢只發告警通知,不執行任何自動修復。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 15:34:34 +08:00
OG T
1430b1283d fix(chat+nvidia): 還原 OpenClaw+Nemotron 架構 + 修 30s timeout 根因
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
ChatManager 還原:
- OpenClaw (188:8088) 負責 RCA 仲裁,不改用 Gemini (未經批准)
- NemoClaw (NVIDIA NIM nemotron-mini-4b) 負責補充/評論
- 雙 AI 並行執行,OpenClaw 30s / NemoClaw 50s timeout
- 支援 @openclaw / @nemo 指定對象

nvidia_provider.py 修 timeout 根因:
- NVIDIA_TIMEOUT 從硬編碼 30.0 改為讀 NEMOTRON_TIMEOUT_SECONDS (45s)
- Memory 記載 NIM 免費 tier 延遲 11-45s,30s 硬編碼導致慢請求全超時

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 15:34:02 +08:00
OG T
d522c51deb fix(infra-alert): Nemotron 異常告警套用標準模板 + 真正自動修復
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. 新增 InfraAlertMessage dataclass — 基礎設施異常的標準告警格式
   (之前 Nemotron 告警是硬編碼文字,不走任何模板)

2. 偵測 Nemotron 異常時自動執行修復:
   kubectl set env ENABLE_NEMOTRON_COLLABORATION=false
   (之前只是把指令印在訊息裡,從未執行)

3. 告警顯示自動修復結果 ( 已自動修復 /  失敗)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 15:29:20 +08:00
OG T
e93ada0452 fix(chat): OpenClaw 改走 Gemini Flash,移除 Ollama 依賴
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m18s
Ollama 188 完全卡死 (0 bytes/30s timeout),無法作為對話後端。
雙 AI 皆使用 Gemini Flash,靠不同 persona 和 temperature 區分:
- OpenClaw: temperature=0.5 (精準果斷)
- NemoClaw: temperature=0.9 (分析發散)

同時 kubectl set env ENABLE_NEMOTRON_COLLABORATION=false
停止每個 incident 白白等待 30s Nemotron timeout

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 15:20:23 +08:00
OG T
d9007e6855 feat(chat+monitor): 雙 AI 對話重寫 + Nemotron 健康監控告警
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 6m56s
ChatManager 重寫 (Phase 22.6):
- @openclaw <msg> → 只有 OpenClaw 回應 (Ollama qwen2.5:7b)
- @nemo <msg>     → 只有 NemoClaw 回應 (Gemini Flash)
- 無前綴           → OpenClaw 先答,NemoClaw 評論/反駁

NemoClaw 改用 Gemini Flash (棄 NIM nemotron-mini-4b 因為 15s+ 回應時間)

TelegramGateway 心跳新增 Nemotron 健康探測:
- 每次心跳探測 NVIDIA NIM API (10s timeout)
- 異常時立刻發 Telegram 告警 + 緩解指令
- 補足 Nemotron 100% 超時卻無告警的監控盲區

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 14:59:06 +08:00
OG T
c1834a7156 feat(kb+apm): KB Phase 2-A 自動萃取 + KB-D Markdown 詳情面板 + APM 趨勢圖
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m28s
- KB-A: 新增 knowledge_extractor_service.py (Ollama llama3.2:3b 本地推理)
- KB-A: incident_service.py resolve hook (fire-and-forget asyncio.create_task)
- KB-D: 引入 react-markdown + remark-gfm,知識庫詳情面板 Markdown 渲染
- KB-D: 批准/封存按鈕串接 API (POST /knowledge/{id}/approve, PATCH status)
- KB-D: i18n 新增 approving/archiving 載入狀態文字
- APM: apm/page.tsx 整合 TimeSeriesChart sparkline (使用 trend[] 欄位)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 14:40:27 +08:00
OG T
e60225ea29 fix(ai): I1+I3 — Redis TTL + openclaw_nemo 命名對齊
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 36s
I1: ai_control.py 所有寫入 Redis 的 key 加入 30 天 TTL
    防止 ai:control:* keys 永久累積造成記憶體洩漏

I3: ai_rate_limiter.py "nvidia" key → "openclaw_nemo"
    對齊 Phase 24 AIProviderEnum,使 rate limit 正確作用

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 13:22:36 +08:00
OG T
e7b4f43b60 fix(knowledge): 路由改為無尾斜線避免 307 redirect
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 6m49s
GET "" 代替 "/" 讓 /api/v1/knowledge 直接回應,
不再觸發 FastAPI trailing-slash 307 重導向。
此修正與 ProxyHeadersMiddleware 雙重保障。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 12:55:18 +08:00