Commit Graph

76 Commits

Author SHA1 Message Date
OG T
cf5eb71ea6 fix(phase34): polling loop 補圖片路由 — _handle_chat_message photo handler
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
text=None 時直接 return,導致圖片訊息被丟棄
在 text 檢查前插入 photo 路由,呼叫 image_analysis_service

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 13:58:05 +08:00
OG T
7768924fea fix(flywheel): 自動修復後移除 Telegram 按鈕 + 心跳告警排除飛輪
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 6m56s
問題: 自動修復成功後 Telegram 卡片仍顯示批准/拒絕/靜默按鈕

Fix 1 — Telegram 卡片回饋閉環 (積木化合規):
- telegram_gateway.send_approval_card: 發送後自動存 tg_approval:{id} 到 Redis
- telegram_gateway.mark_auto_repaired(): 新方法 — 移除按鈕 + reply 結果
- _try_auto_repair_background: 改呼叫 gateway.mark_auto_repaired() (Service 層)

Fix 2 — 心跳/看門狗告警排除飛輪:
- constants.py: is_heartbeat_alertname() + HEARTBEAT_ALERT_NAMES
- NoAlertsReceived2Hours 等不觸發 _try_auto_repair_background

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 11:52:04 +08:00
OG T
88ac1c7f50 feat(phase27): 歷史按鈕雙層頻率統計 + DB frequency_snapshot 持久化
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m44s
- telegram_gateway: _send_incident_history 改為 Phase 27 雙層策略
  Layer 1: DB frequency_snapshot (建立時刻永久快照)
  Layer 2: Redis AnomalyCounter disposition 累積統計 (35d TTL)
  修復舊版呼叫 record_anomaly() 導致誤計數的 bug
- 新增 migration: phase27_incident_frequency_snapshot.sql (已在 prod 執行)
- CLAUDE.md: 精簡至 123 行,減少 Token 消耗

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 01:06:51 +08:00
OG T
ae90c36cd7 fix(telegram): _send_incident_history 加入 freq=None fallback — 無頻率統計資料
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
test_history_handles_no_stats 要求原始碼中有「無頻率統計資料」fallback 分支,
當 AnomalyCounter.record_anomaly() 回傳 None 時顯示此訊息而非繼續處理。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 01:01:19 +08:00
OG T
e59f8181b3 fix(telegram): 歷史按鈕改從 AnomalyCounter(Redis) 讀頻率,修復永遠顯示「無頻率統計資料」
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m45s
根本原因: frequency_stats 從未持久化到 DB,get_by_id() 回傳永遠是 None
修復: 用 AnomalyCounter.derive_key_from_incident() 推導 anomaly_key,
直接從 Redis 查即時頻率與處置分佈統計

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 00:56:23 +08:00
OG T
d324dd7aed fix(telegram): 移除所有告警訊息欄位截斷限制,放寬至 Telegram 4096 字元硬限
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
問題:root_cause[:50/100]、suggested_action[:80]、suggestion[:50]、
note[:150]、fix_action[:100]、impact[:150]、hypothesis[:200]
以及 message[:900]/[:1000] 導致告警內容顯示不完整。

修復:移除欄位截斷,整體上限改為 4096(Telegram API 硬限制)。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 23:32:51 +08:00
OG T
eb46079b4a fix(telegram): root_cause 顯示長度 50→100 字元,符合 SOUL.md 鐵律
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
SOUL.md 明定根因摘要上限 100 字元,但程式碼兩處 IncidentApprovalCard
均截在 [:50],導致告警卡片訊息被截斷。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 23:30:58 +08:00
OG T
896bef94ee fix(web): pending-approvals-card 加防重複點擊 + loading 狀態
linter 自動強化: actioningId state 防止同一張卡重複操作
- disabled + opacity 0.6 + cursor not-allowed
- loading 時按鈕顯示 '...'
- finally() 確保 actioningId 清除

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:38:08 +08:00
OG T
1483218bab feat(approval): 批准/拒絕後立即回應 Telegram + 持久化 message_id 到 DB
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 13m9s
問題:按下 TG 批准/拒絕按鈕後完全沒有任何回應,使用者不知道是否成功。
      Telegram message_id 只存 Redis 24h TTL,過期後無法追蹤。

修正:
- approval_records 加 telegram_message_id / telegram_chat_id 欄位(已 ALTER TABLE)
- approval_db.update_telegram_message() — 持久化 message_id 到 DB
- decision_manager: 發送告警卡片後同時寫 Redis + DB
- telegram_gateway._notify_approval_result() — 批准/拒絕後:
    1. editMessageReplyMarkup 移除批准/拒絕按鈕,保留資訊按鈕
    2. sendMessage reply_to 在原訊息下回覆狀態行
    3. fallback: send_notification 發新訊息
- _handle_group_command: chat_id 改為 _chat_id 消除 IDE lint

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:19:31 +08:00
OG T
d8c2969341 feat(telegram): AI 鏈路透明化 — 告警訊息顯示 OpenClaw + Tool Calling 模型/後端
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 12m12s
- nemotron.py: 偵測 OllamaToolProvider vs NvidiaProvider,記錄 tool_model/tool_backend
- openclaw.py: 傳播 nemotron_tool_model/nemotron_tool_backend 到 proposal
- decision_manager.py: 從 proposal_data 提取並傳給 send_approval_card()
- telegram_gateway.py: TelegramMessage 新增兩個欄位,format_with_nemotron 顯示
  "🔧 Tool Calling: llama3.1:8b (Ollama 本機)" 或 "NVIDIA 雲端"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 15:05:16 +08:00
OG T
4f80ba38c0 feat: 告警狀態變更在原訊息延續 (方案 B)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 12m28s
**telegram_gateway.py**
- 新增 append_incident_update(incident_id, status_line)
  - 從 Redis tg_msg:{id} 取 message_id
  - editMessageReplyMarkup: 移除 Row1(批准/拒絕/靜默),保留 Row2(詳情/重診/歷史)
  - sendMessage reply_to_message_id: 在原訊息下方追加狀態行
  - 找不到 message_id 回傳 False(呼叫方自行 fallback)

**decision_manager.py**
- _push_decision_to_telegram: send_approval_card 後存 tg_msg:{id}=message_id (TTL 24h)
- _push_auto_repair_result: 改用 append_incident_update,找不到 message_id 才 fallback 新訊息

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 14:21:33 +08:00
OG T
c5e475121a fix(telegram): 修復建議指令被截斷 + decision_manager enum string 補正
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根因 1: telegram_gateway.py suggested_action[:35] 剛好截到 deployment/ 後
  → 改為 [:80],完整顯示 kubectl command

根因 2: 舊 Incident proposal_data 存 enum string (RESTART_DEPLOYMENT)
  → decision_manager.py 加入偵測,用規則引擎重新查 kubectl command

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 11:14:30 +08:00
OG T
0f5fecfef5 fix(sprint5.1): 首席架構師審查修正 — S1×4 S2×2 S3×1
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m40s
S1-1: service_registry/velero_client/preflight_service 改用 structlog
S1-2: velero_client datetime.now(UTC) 改用 now_taipei()(台北時區鐵律)
S1-3: Guardrail 失敗改為保守拒絕(原放行方向與安全目標相悖)
S1-4: service_registry import 移至模組頂部(移除函數內 import)
S2-1: telegram_gateway T1-T6 六個通知方法補齊 try/except
S2-2: webhooks.py Langfuse URL 改用 settings.LANGFUSE_URL(移除硬寫內網 IP)
S3-3: velero_client trigger_emergency_backup 改為 kubectl apply Backup CRD
      (原 kubectl create backup 語法不存在,審查發現靜默失敗風險)

審查評分: 70/100 → 修正後預計 90+/100

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 16:36:18 +08:00
OG T
88696dba9b feat(sprint5.1): Data Safety Guardrails 全鏈路整合 (L1-L5)
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m33s
Type Sync Check / check-type-sync (push) Failing after 58s
Layer 0 - K8s RBAC:
  - k8s/rbac/api-velero-reader.yaml: awoooi-executor SA Velero backup reader

Layer 1 - DB Migration (已在 188 執行):
  - M-002: approval_records 新增 approval_level/votes/required_votes
  - M-003: alert_event_type ENUM 新增 8 個值

Layer 2 - IaC:
  - ops/config/service-registry.yaml: 全服務 Stateful 分級清單 (BLOCK/CRITICAL_HITL/STANDARD_HITL/AUTO)

Layer 3 - Python Services:
  - service_registry.py: 讀取 YAML,提供 is_blocked/requires_multisig/get_required_votes
  - velero_client.py: kubectl 查詢 Velero 備份年齡,失敗 fallback 999h
  - preflight_service.py: Pre-flight 安全檢查 (Q2/Q4 決策)

Layer 1-M001 - Playbook model:
  - playbook.py: 新增 requires_approval_level/stateful_targets/requires_pre_backup

Layer 4 - 業務邏輯:
  - alert_operation_log_repository.py: 新增 8 個 event_type (Guardrail/Pre-flight/MultiSig/備份)
  - auto_repair_service.py: 注入 Service Registry Guardrail 檢查 (BLOCK → 直接拒絕)
  - webhooks.py: ALERT_RECEIVED 溯源記錄 + auto_repair flag Q9 + Langfuse trace_id Q10
  - db/models.py: ApprovalRecord 同步 approval_level/votes/required_votes 欄位
  - docker-health-monitor.sh: 純感知層改造(移除所有 docker restart 邏輯)

Layer 5 - Telegram 通知:
  - telegram_gateway.py: T1-T6 六個新通知方法 (Guardrail/Pre-flight/Backup/MultiSig/ChangeApplied)

參考: ADR-062 Data Safety Guardrails, ADR-063 Service Registry IaC

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 16:24:09 +08:00
OG T
de3935d1d4 feat: Sprint 4 Phase E+F — 前端處置統計 + 週報處置分佈
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m26s
Type Sync Check / check-type-sync (push) Failing after 1m2s
Phase E: 前端頁面
- E1: /reports 完整處置統計儀表板 (已在 Sprint F 完成)
- E2: 首頁 Metrics Strip — 從 disposition API 取得真實自動化率
  優先使用 /stats/disposition auto_rate,fallback 到 incidents 推算
- E3: /auto-repair 處置概況卡片 (已在 Sprint F 完成)
- E4: /neural-command stats tab 處置分佈 (已在 Sprint F 完成)
- E5: i18n 翻譯 zh-TW + en (已在 Sprint F 完成)

Phase F: 週報 + 文件
- F1: WeeklyReportMessage 新增 disposition 5 欄位
  週報格式加「📋 處置分佈」區塊 (自動/冷啟動/人工/手動 + 自動化率)
  weekly_report_service 整合 get_all_disposition_stats()
- message 字數上限從 900 提升到 1200 (適應處置區塊)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-07 13:02:20 +08:00
OG T
a85e9ced08 feat(api+telegram): Sprint 4 Phase C+D — API 端點 + Telegram 處置統計
Phase C: API 端點
- C1: GET /api/v1/stats/disposition — 完整處置分佈統計
  - DispositionSummary: auto/human/manual/cold_start + auto_rate
  - DispositionByAnomaly: 按異常類型明細 (最多 20 筆)
  - Redis SCAN + HGETALL 聚合
- C2: GET /api/v1/auto-repair/stats 擴充 disposition_summary

Phase D: Telegram 告警格式
- D1: 告警卡片加處置統計行
  - 🤖 自動: N | 👤 審核: N | 🔧 手動: N
  - 自動化率百分比
- D2: 歷史按鈕強化處置分佈明細
  - 完整 5 項計數 + 自動化率

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-07 12:17:20 +08:00
OG T
f6332b4b2f fix(telegram): 修正 approval_id UUID 轉換錯誤 — 支援 INC-xxx 格式
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 12m24s
_execute_approval_action 用 UUID(approval_id) 但 approval_id 是 INC-xxx,
導致 'badly formed hexadecimal UUID string' 錯誤,簽核無法執行。

修正: 先嘗試 UUID 轉換,失敗則用 incident_id 查出對應的 pending approval UUID。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 11:53:48 +08:00
OG T
09d965dab5 fix(telegram): 修正 editMessageText 400 錯誤 — 先移除按鈕再更新文字
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 12m46s
原因: original_text 來自 message.text (純文字),含 <>&等字符,
     用 parse_mode=HTML 發送時 Telegram 返回 400。

修正:
1. 先呼叫 editMessageReplyMarkup 移除按鈕 (確保按鈕一定消失)
2. 再 html.escape(original_text) 後嘗試更新文字
3. 文字更新失敗不影響整體流程 (按鈕已移除為首要目標)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 22:13:54 +08:00
OG T
8220027298 fix(telegram+cd): 兩個顯示 bug 修正
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. Nemotron args 顯示 Python dict 字串問題
   - restart_deployment: {'deployment_name': 'awoo'} → restart_deployment: deployment_name=awoooi-api
   - 改用 key=value 格式化,不再使用 str(dict)[:25]

2. CD 通知 ${MINUTES}/${SECONDS} 等變數未展開
   - TG_MSG 從 env: 移到 run: shell 中組裝
   - env: 中的 shell 變數在 bash 執行前是靜態字串,無法展開

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:47:52 +08:00
OG T
22ee9b2fe3 fix(telegram): answerCallbackQuery result=true 導致 bool is not iterable
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 13m3s
Telegram answerCallbackQuery 成功時返回 {"ok": true, "result": true},
_send_request 中 "message_id" in result["result"] 對 bool 做 in 操作
報 "argument of type 'bool' is not iterable"。

修正:加 isinstance(result_val, dict) 防禦後再做 in 檢查。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:20:54 +08:00
OG T
4b4007db6c feat(telegram): SRE 群組告警格式升級為完整 v7.0
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
_send_approval_card_to_group 改用與個人 chat 相同的 TelegramMessage.format()
格式,包含 SignOz metrics、AI provider/model、Nemotron 協作、異常頻率統計等全部欄位。

統帥指示:群組收到的告警訊息要與個人 chat 格式完全一致。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:11:59 +08:00
OG T
665f93e83f fix(telegram): 首席架構師 R1 修正 — I-1/I-2/M-1/M-2
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
I-1: webhooks/sentry_webhook/signoz_webhook 三個呼叫者補 TODO 說明
     無 incident_id 是已知限制(Approval 路徑未建 Incident 關聯)
I-2: TestPushRequest 新增 incident_id 欄位,使 QA 可驗證按鈕渲染
M-1: 移除 _build_inline_keyboard 呼叫中多餘的 `or message.incident_id`
M-2: 補充 900/1000 截斷長度差異說明

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:07:42 +08:00
OG T
4935cfc346 fix(telegram): 重設計訊息格式 + 修復 detail/reanalyze/history 按鈕失效
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m26s
- format() / format_with_nemotron(): 移除 ═══ 分隔符,改為簡潔換行佈局
- send_approval_card(): 新增 incident_id 參數,傳入 _build_inline_keyboard()
- decision_manager.py: 呼叫 send_approval_card() 時傳入 incident.incident_id
- 問題根因: incident_id 未傳入 _build_inline_keyboard() 導致第二排按鈕從未渲染

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 12:44:13 +08:00
OG T
d0f09705e5 fix(auto-repair): 修復三個阻礙自動修復的根本原因
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. playbook_rag: Ollama embedding http_client 滾動重啟後 is_closed
   - 新增 _get_http_client() 偵測 is_closed 自動重建
   - singleton get_playbook_rag_service() 加 is_closed 重建判斷

2. telegram: 加入 ai_model 欄位顯示底層判斷模型
   - TelegramMessage.ai_model 欄位
   - format() / format_with_nemotron() 顯示 "Nemotron (nemotron-70b)"
   - openclaw proposal_dict 加入 model 欄位
   - decision_manager / send_approval_card 串接

3. DB: 清除 9 筆 3/26 殭屍 PENDING (mock_fallback CRITICAL 測試記錄)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 11:46:25 +08:00
OG T
9e78d5222a feat(group-chat): 方案B slash commands — /status /incidents /cost /pods /help (2026-04-03 ogt)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m5s
E2E Health Check / e2e-health (push) Successful in 17s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 20:02:27 +08:00
OG T
e833065043 feat(group-chat): Reply Bot 訊息時只有被Reply的Bot回應 (2026-04-03 ogt)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m0s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 19:48:10 +08:00
OG T
8d09b18477 fix(group-chat): 移除雙AI互相評論 — 單獨@只有該AI回覆,雙AI路徑不再互評
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 19:44:49 +08:00
OG T
79a770ffe5 feat(group): 移除告警自動 AI 分析 — 老闆指示
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m3s
告警發到群組只顯示卡片,不自動觸發 OpenClaw/NemoClaw 分析
老闆和 AI 可手動在群組討論告警內容

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 19:36:40 +08:00
OG T
781a6dac3e feat(chat): NemoClaw→Claude Haiku API + 告警只由 OpenClaw 分析
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m20s
老闆指示 (2026-04-03):
1. NemoClaw 改接 Claude API (claude-haiku-4-5),快速中文對話
2. 群組告警分析只觸發 OpenClaw,NemoClaw 不分析告警
3. OpenClaw/NemoClaw 雙向自然語言對話維持

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 19:19:56 +08:00
OG T
10ad2a67c7 fix(chat): gemini-2.0-flash 修正 + 全形小O支援 + NemoClaw 回 NIM
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. Gemini 模型名稱: gemini-1.5-flash → gemini-2.0-flash (404修復)
2. 費用計算: 2.0 Flash 定價 Input $0.10/1M, Output $0.40/1M
3. 全形/半形統一: unicodedata.normalize NFKC,支援「小O」全形輸入
4. NemoClaw: Ollama 188 負載高超時,暫回 NIM nemotron-mini-4b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 19:17:08 +08:00
OG T
d1436157b7 fix(polling): httpx client timeout 改為分開設定,read=50s > getUpdates 40s
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根因: httpx.AsyncClient(timeout=30.0) 的 read timeout 30s
     < getUpdates 的 long polling timeout 40s
     導致每次 getUpdates 都被 client 打斷 → polling loop 無法正常收訊息

修法: httpx.Timeout(connect=10s, read=50s) 讓 long polling 正常等待

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 18:29:22 +08:00
OG T
dfc1e19c07 fix(group): 互相評論補充也加 reply_to_message_id 引用原訊息
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 18:24:51 +08:00
OG T
09241f102e fix(group): 群組訊息移到 security interceptor 前 — 修復 whitelist 擋掉所有群組訊息
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m10s
根因: intercept_telegram() 的 whitelist 是字串,user_id 是 int
      型別不匹配 → exception → telegram_chat_unauthorized → 群組訊息全被丟棄
修法: SRE 群組訊息優先路由,不走個人 whitelist
     (群組成員由 Telegram 群組管理員控制,安全邊界已存在)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 18:17:22 +08:00
OG T
203855a56e debug(group): 加 group_routing_check log 診斷 chat_id 不匹配
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 18:12:07 +08:00
OG T
63929a5e87 feat(group): 別名 小O→OpenClaw 小賀→NemoClaw + NemoClaw 強制繁中
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m6s
1. telegram_gateway.py: _handle_group_message 加入別名路由
   - 小O / 小o → 只有 OpenClaw 回應
   - 小賀 / 小贺 → 只有 NemoClaw 回應
   - clean_text 同步移除別名 token

2. chat_manager.py: NEMOCLAW_PERSONA 加強繁體中文強制指令
   - 明確「禁止使用英文或其他語言」防止 Nemotron 自動英文回應

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 18:00:51 +08:00
OG T
699e61ac87 feat(group): 群組雙向對話 + 格式選項C + 老闆稱謂
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m11s
1. _handle_group_message: SRE 群組訊息路由
   - @OpenClawAwoooI_Bot → 只有 OpenClaw 回應
   - @NemoTronAwoooI_Bot → 只有 NemoClaw 回應
   - 一般訊息 → 並行回應 + 互相評論第二輪
   - Bot 訊息自動忽略(防無限循環)

2. 告警格式改選項 C (老闆指示)
   - 【🔴 HIGH】resource_name
   - 區塊式,去掉 ═══ 長分隔線

3. AI persona 改稱呼用戶為「老闆」

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 17:51:48 +08:00
OG T
50457675ef feat(group): OpenClaw + NemoClaw 並行分析告警 (統帥指示)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- 兩個 AI 同時分析,不互相影響(更客觀)
- 總等待時間 = max(OpenClaw, NemoClaw) 而非相加
- 兩者都 reply 同一條告警訊息,並排出現在群組
- 修正 unused message_id parameter noqa

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 17:41:50 +08:00
OG T
209fb8d4dc fix(group): supergroup 跨 Bot reply 改用 reply_parameters (Bot API v6.7+)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
舊的 reply_to_message_id 在 supergroup 跨 Bot 回覆會 400
改用 reply_parameters + allow_sending_without_reply: true

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 17:39:53 +08:00
OG T
890d438cdf fix(group): 群組告警格式對齊 TelegramMessage 模板 + 修復 AI 討論觸發
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- 群組告警改用 ═══ 分隔線格式,與個人 chat 一致
- 加入「OpenClaw 與 NemoClaw 正在分析中...」提示
- 加 group_msg_id 為空時的 warning log
- clawbot-v5 STANDBY_MODE: main.py 檢查條件修正

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 17:36:01 +08:00
OG T
c65ed5b1c9 feat(telegram): SRE 戰情室群組三頭政治 Triumvirate (ADR-053)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m6s
- config.py: 新增 OPENCLAW_BOT_TOKEN / NEMOTRON_BOT_TOKEN / SRE_GROUP_CHAT_ID
- telegram_gateway.py: send_to_group / send_as_openclaw / send_as_nemotron / trigger_group_ai_discussion / _send_approval_card_to_group
- send_approval_card 告警發送後非同步觸發群組 AI 雙向討論
- configmap: SRE_GROUP_CHAT_ID=-1003711974679
- secrets: OPENCLAW_BOT_TOKEN / NEMOTRON_BOT_TOKEN CHANGE_ME 佔位

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 17:16:05 +08:00
OG T
ff5a77f7a9 fix(telegram): 啟用 Polling + 修正 InfraAlertMessage 格式
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 6m52s
1. TELEGRAM_ENABLE_POLLING: false→true
   - clawbot-v5 已停止 polling (STANDBY_MODE)
   - AWOOOI API 接管,統帥可與 OpenClaw/NemoClaw 雙 AI 對話

2. InfraAlertMessage.format() 加入 note 欄位
   - NIM 慢屬正常不再顯示「自動修復失敗」
   - 改為 💡 資訊性提示

3. NIM 探測端點改為 /v1/models (輕量,不觸發計費)
   timeout: 10s → 25s (NIM 免費 tier 冷啟動)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 16:43:40 +08:00
OG T
15aabd6ac5 fix(chat+nim): 修復首席架構師 Review I1-I4 + S3 四項重要問題
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m9s
I1: chat_manager._call_openclaw timeout=30.0 → 讀 settings.OPENCLAW_TIMEOUT
I2: nvidia_provider.py stale comment "45" → "55" 對齊 ConfigMap
I3: asyncio.shield 移除 — shield 超時後 task 繼續跑但無人等待 (silent leak)
I4: ChatManager.__init__ 移除 repo 實例 (leWOOOgo 禁 Service 持有 repository)
S3: _check_nemotron_health probe 10s → 25s + /v1/models 輕量端點

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 16:36:16 +08:00
OG T
d8c9e29485 fix(heartbeat): 撤銷錯誤的 Nemotron 自動關閉邏輯
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 6m53s
之前錯誤地在偵測到 Nemotron 慢時自動執行
ENABLE_NEMOTRON_COLLABORATION=false,
這等於自動關掉產品核心功能。

Nemotron NIM 免費 tier 延遲 11-45s 是已知特性(Memory 有記載),
不是需要自動修復的異常。

現在:偵測慢只發告警通知,不執行任何自動修復。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 15:34:34 +08:00
OG T
d522c51deb fix(infra-alert): Nemotron 異常告警套用標準模板 + 真正自動修復
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. 新增 InfraAlertMessage dataclass — 基礎設施異常的標準告警格式
   (之前 Nemotron 告警是硬編碼文字,不走任何模板)

2. 偵測 Nemotron 異常時自動執行修復:
   kubectl set env ENABLE_NEMOTRON_COLLABORATION=false
   (之前只是把指令印在訊息裡,從未執行)

3. 告警顯示自動修復結果 ( 已自動修復 /  失敗)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 15:29:20 +08:00
OG T
d9007e6855 feat(chat+monitor): 雙 AI 對話重寫 + Nemotron 健康監控告警
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 6m56s
ChatManager 重寫 (Phase 22.6):
- @openclaw <msg> → 只有 OpenClaw 回應 (Ollama qwen2.5:7b)
- @nemo <msg>     → 只有 NemoClaw 回應 (Gemini Flash)
- 無前綴           → OpenClaw 先答,NemoClaw 評論/反駁

NemoClaw 改用 Gemini Flash (棄 NIM nemotron-mini-4b 因為 15s+ 回應時間)

TelegramGateway 心跳新增 Nemotron 健康探測:
- 每次心跳探測 NVIDIA NIM API (10s timeout)
- 異常時立刻發 Telegram 告警 + 緩解指令
- 補足 Nemotron 100% 超時卻無告警的監控盲區

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 14:59:06 +08:00
OG T
b6105b8214 fix(ai): 首席架構師審查修復 C1+C2 (Phase 24 C)
C1 — telegram_gateway.py Fail-Closed 白名單:
  白名單為空時 'if whitelist and ...' 為 False → 任何人可執行 /ai
  修復: 'if not whitelist or user_id not in whitelist' Fail-Closed
  加入 whitelist_empty 欄位到 warning log

C2 — openclaw.py list comprehension await 語法錯誤:
  Python 3.11 不支援 list comprehension 中使用 await
  'if not await is_provider_disabled(p)' → SyntaxError
  修復: 改為 for loop 明確 await
  I4: 靜默 except 改為 logger.warning

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 00:42:02 +08:00
OG T
dbe71f82e3 feat(ai): Phase 24 C — Telegram /ai 動態控制 + Redis 狀態管理
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
新增 ai_control.py:
- /ai status: 所有 Provider 狀態 + 路由模式
- /ai router on/off: 動態切換 AIRouter (覆蓋 env var)
- /ai primary <provider>: 設定主要 Provider
- /ai enable/disable <provider>: 控制 Provider 啟停
- /ai cost: 費用統計
- 白名單: OPENCLAW_TG_USER_WHITELIST 保護

telegram_gateway.py:
- _handle_chat_message 加入 /ai 指令攔截路由
- 白名單未授權返回警告

openclaw.py:
- Redis 狀態覆蓋 env USE_AI_ROUTER (/ai router on/off 生效)
- Redis primary_provider 覆蓋路由決策 (/ai primary 生效)
- Redis disabled provider 過濾 (/ai disable 生效)

Redis Keys:
  ai:control:use_router
  ai:control:primary_provider
  ai:control:disabled:<provider>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 00:34:14 +08:00
OG T
d32d84efce feat(telegram): 接通 Phase 22 Nemotron 雙軌顯示 (ADR-044)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
根本原因: format_with_nemotron() 已實作但從未被呼叫
- send_approval_card() 新增 nemotron_enabled/tools/validation/latency 參數
- TelegramMessage 建構時傳入 nemotron 欄位
- nemotron_enabled=true 時自動使用 format_with_nemotron() 格式
- _push_decision_to_telegram() 從 proposal_data 提取並傳遞 nemotron 資料

效果: Telegram 同時顯示 OpenClaw 仲裁 + Nemotron 執行方案雙區塊
2026-04-02 ogt: Phase 22 最後一哩路

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 08:59:03 +08:00
OG T
de04de1d4f fix(telegram): 新增 openclaw_nemo/nvidia_nim 顯示名稱映射
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
- format() 和 format_with_nemotron() 兩處 provider_names 均加入:
  openclaw_nemo → "OpenClaw Nemo"
  openclaw_nvidia_nim → "OpenClaw Nemo"
  openclaw_qwen → "OpenClaw Nemo"
- 修正顯示 "OPENCLAW_NEMO" (大寫) 的問題
- 2026-04-01 ogt: 配合 AI 仲裁架構調整

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 00:20:37 +08:00
OG T
60d2fbaf8c feat(telegram): implement reanalyze button handler, replace placeholder (ADR-050)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 21:08:44 +08:00