OG T
72dd0c5875
fix: Telegram 簽核 gate + 執行結果 reply — 打通人工審核閉環
...
CD Pipeline / build-and-deploy (push) Successful in 14m7s
3 處修復(統帥盤查發現):
1. telegram_gateway.py:4890 — gate 從 execution_triggered 改 approval.status==APPROVED
- 原 gate 靠樂觀鎖旗標,race 時失效(REST+Telegram 同時簽核)
- 與 REST API approvals.py:360 路徑對齊
- 加 Redis lock exec:{approval_id} 60s TTL 防重入
2. telegram_gateway.py:4772 — 拿掉「👀 等待執行」誤導文案
- 批准後一律顯示「⚡ 執行中...」,實際結果由 #3 reply 補上
3. approval_execution.py — 新增 _push_execution_result_to_alert()
- 成功/失敗兩處 fire-and-forget 呼叫
- requested_by=="auto_approve" skip(避免與 _push_auto_repair_result 衝突)
- Redis tg_msg:{incident_id} 查原告警 message_id → reply_to
- 找不到 message_id 靜默不發,不影響執行主流程
防破壞性檢查:
- ✅ 自動執行路徑不受影響(skip via requested_by)
- ✅ Reject 路徑完全不動
- ✅ Redis lock 防重入
- ✅ 132 回歸測試全過
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-14 19:03:38 +08:00
OG T
c0ba1000f3
Revert "fix(auto-repair): 中低風險+無kubectl_command → TYPE-1 純資訊,不顯示審核按鈕"
...
This reverts commit abf1ffa91e7327a36af93be2742d53dac1933f0d.
2026-04-14 13:33:24 +08:00
OG T
2df4945880
fix(auto-repair): 中低風險+無kubectl_command → TYPE-1 純資訊,不顯示審核按鈕
...
問題: HostHighCpuLoad 等主機層告警 affected_services=[] → OpenClaw 生成
kubectl unknown → safety guard 攔截 → 退回 READY + TYPE-3 帶按鈕卡片
用戶一直看到帶按鈕的中/低風險告警,按鈕無法修復任何東西
修復三處:
1. openclaw.py: _call_openclaw_analyze() 回傳 suggested_action 欄位
+ target_resource 預設改為 "" (避免 "unknown" 進入 safety guard)
2. decision_manager.py: classify_notification() 傳入
suggested_action / risk_level / has_kubectl_command
3. telegram_gateway.py: classify_notification() 新規則 —
無 kubectl_command + risk=low/medium + action=investigate/no_action
→ TYPE-1 (純資訊,無按鈕)
搭配 clawbot-v5 f4b84d7 (OpenClaw prompt CRITICAL RULES) 一起生效
2026-04-14 Claude Sonnet 4.6 Asia/Taipei
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-14 13:33:24 +08:00
OG T
09134f5c47
fix(openclaw): 修復 incident.title + DIAGNOSE→NEMOTRON confidence=0.0
...
CD Pipeline / build-and-deploy (push) Failing after 2m10s
1. telegram_gateway.py:1169 — classify_notification() 仍用 incident.title
改用 alertname + signal annotations 組合 (同 decision_manager.py 修法)
2. ai_router.py — DIAGNOSE 路由暫停 NEMOTRON
NIM tool_call 返回無 confidence → openclaw_analysis_complete confidence=0.0
改為 None (複雜度路由),讓 Gemini/openclaw_nemo 處理 DIAGNOSE
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 22:12:55 +08:00
OG T
3d8b0e4f90
fix(adr075): TYPE-3 格式改用 spec 模板 — ACTION REQUIRED + AI深度診斷 + 建議修復動作
...
CD Pipeline / build-and-deploy (push) Failing after 2m15s
- 標頭改為 "{emoji} ACTION REQUIRED | {severity_zh}"
- 新增 "🧠 AI 深度診斷" 區塊 (分析/責任/AI來源)
- 新增 "⚡ 建議修復動作" 區塊 (<code> 格式)
- confidence=0 顯示 "📋 規則分析" 取代誤導性 "🔴 0%"
- SignOz 指標區塊補回 Trace 連結
2026-04-12 ogt: ADR-075 TYPE-3 格式標準化
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 21:00:28 +08:00
OG T
a7f2b9c0f5
fix(display): 規則匹配改顯示 ✅ 取代 🔴 0% + 修復 LLM 字串 confidence 解析
...
CD Pipeline / build-and-deploy (push) Has been cancelled
- telegram_gateway.py: confidence==0 (規則匹配/Expert fallback) 不再顯示
「🔴 0%」,改顯示「⚙️ 規則匹配 ✅ 」,兩個 card 類型都修正
- openclaw.py: NIM/Ollama 有時回傳字串 "0.85" 而非 float,導致
isinstance(str, int|float)=False → confidence 被強制設 0.0。
現在先嘗試 float() 解析,解析失敗才 fallback 0.0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 20:50:53 +08:00
OG T
24c1b5677b
feat(adr075): Step1-3 classify補丁+新按鈕+TYPE-5S/6B/7E格式函數
...
Step-1 incident_service.py classify_alert_early():
- 新增 secops (TYPE-5S): UnauthorizedSSH/KubeAudit/CVE/WAFAttack/PodAbnormal
- 新增 business (TYPE-6B): AITokenCost/GeminiAPIError/SLOBurn/MomoScraper
- 新增 flywheel_health MCPProvider/OllamaDown/NemotronDown 前綴
- ssl_cert: 依 days_remaining 決定 TYPE-1(≥14d) vs TYPE-3(<14d)
Step-2 telegram_gateway.py _build_inline_keyboard():
- 新增 secops: [隔離] [封鎖IP] [驅逐] [確認授權]
- 新增 business: [暫停1h] [查SignOz] [忽略]
- 新增 flywheel_health: [觸發診斷] [飛輪面板] [靜默]
Step-3 telegram_gateway.py 新增格式化函數 (Tier 2):
- send_secops_card() — TYPE-5S 防禦按鈕+nonce
- send_business_alert() — TYPE-6B 業務損失速率
- send_escalation_card() — TYPE-7E P0/P1 升級,發 DM+群組
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 18:50:37 +08:00
OG T
1cb654cf59
fix(adr-075): CR P0/P1 修補 — TYPE_8M enum + 死碼清理 + docstring 更新
...
CD Pipeline / build-and-deploy (push) Has been cancelled
P0-2: NotificationType 新增 TYPE_8M = "TYPE-8M"
classify_notification 早期回傳 TYPE-8M
decision_manager 改用 NotificationType.TYPE_8M enum 比較(移除字串字面量)
P1-1: 移除 _CATEGORY_BUTTONS 中不可達的 alertchain_health/flywheel_health 條目
P1-4: test_classify_alert_early.py docstring 更新為 13 條規則/10 分類
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 18:44:12 +08:00
OG T
561c1d806b
feat(adr-075): Phase 2 — TYPE-8M 飛輪/告警鏈路健康通知格式與路由
...
CD Pipeline / build-and-deploy (push) Failing after 4m0s
新增 send_meta_alert() — ⚙️ META SYSTEM 卡片(觸發診斷/查看面板/靜默)
decision_manager 新增 TYPE-8M elif 分支(在 TYPE-4D 後)
_alert_category 提取提前至 if 鏈前,三個分支共用
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 18:39:04 +08:00
OG T
2cef2098d3
feat(adr-075): 修復 Telegram 動態按鈕 4 個斷點 + 新增 7 種告警分類
...
CD Pipeline / build-and-deploy (push) Has been cancelled
斷點 A: decision_manager 提取 alert_category/notification_type 傳入 send_approval_card
斷點 B: send_approval_card 新增參數並傳遞至 _build_inline_keyboard
斷點 C: 互動型通知 (TYPE-3/4/4D/8M) 禁止發 SRE 群組,防 nonce 洩漏
斷點 D: _CATEGORY_BUTTONS k8s_workload → kubernetes + 新增 6 類按鈕組
classify_alert_early 新增: alertchain_health, flywheel_health, storage,
devops_tool, external_site, ssl_cert, host_resource (從 infrastructure 分離)
Test: 52 classify + 664 total passed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 18:35:56 +08:00
OG T
99b489ca63
fix(flywheel): 修補剩餘 P0/P1 缺陷
...
CD Pipeline / build-and-deploy (push) Has been cancelled
- CRITICAL-1: TYPE-1 path approval_id=str(alert_id) → uuid.uuid4(),
避免 UUID(approval_id) 拋 ValueError 導致所有 Heartbeat/Info 告警崩潰
- CRITICAL-2: asyncio.create_task() 結果存入 _exec_task 並加 done_callback,
防止 GC 在執行中途回收任務
- FORMAT: _push_to_telegram_background 新增 notification_type + diff_summary 參數,
TYPE-4D → send_drift_card(),其他 → send_approval_card()(修正 ConfigDrift 顯示錯誤卡片)
- 傳遞 notification_type 至 Alertmanager 兩個呼叫點
ADR-073 四斷點修補最終收尾
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 17:14:57 +08:00
OG T
f0e14136ca
fix(flywheel): 修補飛輪四個核心斷點,讓完整流程真正串接起來
...
CD Pipeline / build-and-deploy (push) Has been cancelled
1. incident_service.py: save_to_episodic_memory() 補寫 alertname/notification_type/alert_category
→ 之前這3欄在DB永遠NULL,LLM無alertname,Playbook匹配全失敗
2. telegram_gateway.py: Telegram批准後呼叫 execute_approved_action()
→ 之前sign_approval()只改DB狀態,380筆批准0筆真正執行kubectl指令
3. approval_execution.py: 執行成功後呼叫 resolve_incident()
webhooks.py: auto-repair成功後呼叫 resolve_incident()
→ 之前Incident永遠停在INVESTIGATING,KM轉換永遠不觸發,Playbook=0
4. webhooks.py: TYPE-1告警短路,不進LLM
→ 之前Heartbeat/Backup/Info仍燒LLM token,產生垃圾修復建議
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 17:01:10 +08:00
OG T
93f9522d5a
fix(heartbeat): 對齊整點發送避免多replica各自發 + KM向量化改查embedding欄位
...
CD Pipeline / build-and-deploy (push) Successful in 14m10s
- _heartbeat_loop: 先 sleep 到下一個整點倍數再開始循環
避免 3 個 replica 啟動時間不同導致短時間內收到多條心跳
- heartbeat_report_service: km_vectorized 改查 KnowledgeEntryRecord.embedding IS NOT NULL
原本錯誤查 IncidentRecord.vectorized 導致顯示 0/714 (0%)
2026-04-12 ogt (ADR-073 heartbeat fix)
2026-04-12 16:33:15 +08:00
OG T
effd78807e
fix(heartbeat): blocking_timeout 5→0,多 replica 不排隊等鎖避免重複發送
...
CD Pipeline / build-and-deploy (push) Successful in 14m0s
3 個 replica 各自跑 loop,blocking_timeout=5.0 導致鎖釋放後
其他 replica 依序拿鎖,每次心跳最多發 3 條。
改為 blocking_timeout=0:拿不到鎖立刻跳過,同週期只發一條。
2026-04-12 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 16:13:41 +08:00
OG T
00a31abb85
feat(heartbeat): ADR-073 P2 心跳整合重構 — HeartbeatReportService + RedisLock
...
CD Pipeline / build-and-deploy (push) Has been cancelled
- 新增 HeartbeatReportService:11 個並行探針(Ollama/Nemotron/Gemini/Claude/MCP×4/ArgoCD/Velero)
- 重寫 send_heartbeat():RedisLock 防重發 + 統一發送 SRE_GROUP_CHAT_ID
- 簡化 _heartbeat_loop():移除散落的 silence 多次發送
- config.py:新增 OLLAMA_REQUIRED_MODELS 欄位
- 03-secrets.example.yaml:補 SRE_GROUP_CHAT_ID 確保 CD Inject 不遺漏
2026-04-12 ogt (ADR-073 Phase 2-3/4)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 15:35:13 +08:00
OG T
c09521a1c6
fix(cr): Code Review P0/P1 全修補 — 積木化+SSH路由+安全守衛順序
...
CD Pipeline / build-and-deploy (push) Failing after 2m30s
P0-1: classify_alert_early 移至 incident_service (Service層),webhooks.py import 修正
P0-2: _ssh_execute() 改用 self._ssh,移除冗餘 SSHProvider() 實例化
P1-1: infrastructure SSH routing 移至 kubectl safety guard 之前,docker指令不再被攔截
P1-2: alert_rule_engine 新增 get_risk_for_alertname() public API
P1-3: classify_notification() docstring 修正 ORM→Pydantic
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 14:51:12 +08:00
OG T
dbc77c5e62
feat(flywheel): Phase 3 — decision_manager Tier 3 七大修復 (首席架構師授權)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-073 Phase 3 全部完成:
3-1: TYPE-1 triage guard
- get_or_create_decision() 入口: notification_type=TYPE-1 直接 bypass LLM 分析
- classify_notification() 優先讀 incident.notification_type (早期分診結果)
- ConfigurationDrift/KubeConfigDrift 補入 TYPE-4D 匹配清單
3-2: infrastructure → SSH MCP routing
- _auto_execute() 中 alert_category=infrastructure + 非 kubectl action → _ssh_execute()
- _ssh_execute(): docker_restart / service_restart tool 路由
- 取 instance label 對應 SSH_MCP_ALLOWED_HOSTS 白名單主機
3-3: send_info_notification() TYPE-1 已存在,classify_notification 修復確保正確呼叫
3-4: Dynamic button builder 已存在 _build_inline_keyboard + _CATEGORY_BUTTONS
3-5: action | parse fix
- _auto_execute() 開頭: action 含 | 時取第一段 (LLM 有時輸出 "kubectl X | kubectl get")
3-6: risk_level YAML priority override LLM
- dual_engine_analyze() LLM 結果返回後,用 alert_rules.yaml 對應 rule.risk 覆蓋
3-7: send_drift_card() TYPE-4D 已存在,classify_notification 修復確保正確觸發
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 14:39:19 +08:00
OG T
2af4dffcc6
fix(security): Architecture Review 修復 5 項高信心問題
...
安全修復 (P0):
1. ssh_provider: 新增 _validate_param() 白名單驗證,防止 command injection
- container_name/service/filter_name: [a-zA-Z0-9._-]{1,128}
- compose_dir: 必須以 /opt/ 或 /srv/ 開頭,禁止 ..
- domain: FQDN 白名單
- tail/port/lines: int() 轉換 + 上下限夾緊
2. ssh_provider: known_hosts=None 改為讀 SSH_MCP_KNOWN_HOSTS_FILE 環境變數
- 預設仍 None(內網快速啟動),但啟動時寫入 warning log
- 設定文件:ops/runbooks/ssh-mcp-setup.md (待補)
模組化修復 (P1):
3. km_conversion_service: 移除 import 時的 ALERT_EVENT_TYPES.update() 副作用
- ADR-071 event types 移入 alert_operation_log_repository.py 靜態集合
4. telegram_gateway: create_task() 改為 await + try/except
- 避免 DB session 關閉後的競爭條件
- KM 轉換失敗記錄 warning log,不中斷主流程
5. km_conversion_service: 新增頂層 try/except,錯誤一律 error log 後 re-raise
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-11 02:50:26 +08:00
OG T
325b3851b5
feat(adr-071): 告警通知四類型第一批 B/C/E/F/G/H 全實作
...
CD Pipeline / build-and-deploy (push) Has been cancelled
Type Sync Check / check-type-sync (push) Failing after 1m7s
ADR-071-B: classify_notification() — 五型分類器 (TYPE-1/2/3/4/4D)
ADR-071-C: send_info_notification() — TYPE-1 純資訊無按鈕卡片
ADR-071-E: _build_inline_keyboard() — 依 alert_category 動態組合 TYPE-3 按鈕
ADR-071-F: send_drift_card() — TYPE-4D Config Drift 卡片 + Diff 截斷
ADR-071-G: km_conversion_service.py — Incident RESOLVED 自動轉 KM
ADR-071-H: handle_manual_fix_done() — TYPE-4 手動修復 Bot 對話閉環
前批已完成: ADR-071-A (DB Migration) + ADR-071-D (狀態機守衛)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-11 02:24:20 +08:00
OG T
dabc62e0f8
fix(telegram): append_incident_update — 儲存告警卡片 message_id 到 Redis
...
CD Pipeline / build-and-deploy (push) Successful in 14m31s
_send_approval_card_to_group 發出告警卡片後,將 Telegram message_id
存入 Redis tg_msg:{incident_id}(TTL 24h),供後續 append_incident_update
換掉批准按鈕 + reply 狀態。
修復前:tg_msg key 從未被寫入,append 永遠 fallback 發新訊息,
批准按鈕永遠無法被移除。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-10 22:41:30 +08:00
OG T
797c7c749e
fix(nemotron): deepseek-r1 num_predict 400→1200,避免 <think> block 截斷後空回覆
...
CD Pipeline / build-and-deploy (push) Failing after 28s
deepseek-r1:14b 思考 token 超過 400 會在 </think> 前截斷,導致
清理後 body 為空,Telegram 顯示空訊息。
- chat_manager: num_predict 400 → 1200
- telegram_gateway: _clean_ai_reply 空值加 fallback 錯誤提示
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-10 22:35:37 +08:00
OG T
100e4d9b89
fix(chat): AI 回覆截斷問題 — 強制 persona + Markdown 清理 + 600字上限
...
CD Pipeline / build-and-deploy (push) Successful in 14m39s
問題: OpenClaw/NemoClaw 回覆 Markdown 語法 + 超長,Telegram 顯示截斷
修正:
1. chat_manager: _call_openclaw/_call_nemotron 強制前置 persona (含不超過300字規範)
2. telegram_gateway: _clean_ai_reply() 移除 **bold** *italic* # header 語法
移除 deepseek-r1 <think> 標籤,截斷 > 600 字並在段落邊界截
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-10 21:26:15 +08:00
OG T
b9dbbb3575
feat(rag): Telegram /rag 指令 + /rag/optimize ivfflat 端點
...
CD Pipeline / build-and-deploy (push) Successful in 14m9s
- telegram_gateway: /rag <query> → KnowledgeRAGService.query()
_handle_group_command 加 full_text 參數取得完整指令文字
/help 更新加入 /rag 說明
- rag.py: POST /rag/optimize → rag_repo.create_ivfflat_index()
- rag_chunk_repository: create_ivfflat_index() — ivfflat with lists=100
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-10 14:47:21 +08:00
OG T
cf5eb71ea6
fix(phase34): polling loop 補圖片路由 — _handle_chat_message photo handler
...
CD Pipeline / build-and-deploy (push) Has been cancelled
text=None 時直接 return,導致圖片訊息被丟棄
在 text 檢查前插入 photo 路由,呼叫 image_analysis_service
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-10 13:58:05 +08:00
OG T
7768924fea
fix(flywheel): 自動修復後移除 Telegram 按鈕 + 心跳告警排除飛輪
...
CD Pipeline / build-and-deploy (push) Failing after 6m56s
問題: 自動修復成功後 Telegram 卡片仍顯示批准/拒絕/靜默按鈕
Fix 1 — Telegram 卡片回饋閉環 (積木化合規):
- telegram_gateway.send_approval_card: 發送後自動存 tg_approval:{id} 到 Redis
- telegram_gateway.mark_auto_repaired(): 新方法 — 移除按鈕 + reply 結果
- _try_auto_repair_background: 改呼叫 gateway.mark_auto_repaired() (Service 層)
Fix 2 — 心跳/看門狗告警排除飛輪:
- constants.py: is_heartbeat_alertname() + HEARTBEAT_ALERT_NAMES
- NoAlertsReceived2Hours 等不觸發 _try_auto_repair_background
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-10 11:52:04 +08:00
OG T
88ac1c7f50
feat(phase27): 歷史按鈕雙層頻率統計 + DB frequency_snapshot 持久化
...
CD Pipeline / build-and-deploy (push) Failing after 1m44s
- telegram_gateway: _send_incident_history 改為 Phase 27 雙層策略
Layer 1: DB frequency_snapshot (建立時刻永久快照)
Layer 2: Redis AnomalyCounter disposition 累積統計 (35d TTL)
修復舊版呼叫 record_anomaly() 導致誤計數的 bug
- 新增 migration: phase27_incident_frequency_snapshot.sql (已在 prod 執行)
- CLAUDE.md: 精簡至 123 行,減少 Token 消耗
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-10 01:06:51 +08:00
OG T
ae90c36cd7
fix(telegram): _send_incident_history 加入 freq=None fallback — 無頻率統計資料
...
CD Pipeline / build-and-deploy (push) Has been cancelled
test_history_handles_no_stats 要求原始碼中有「無頻率統計資料」fallback 分支,
當 AnomalyCounter.record_anomaly() 回傳 None 時顯示此訊息而非繼續處理。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-10 01:01:19 +08:00
OG T
e59f8181b3
fix(telegram): 歷史按鈕改從 AnomalyCounter(Redis) 讀頻率,修復永遠顯示「無頻率統計資料」
...
CD Pipeline / build-and-deploy (push) Failing after 1m45s
根本原因: frequency_stats 從未持久化到 DB,get_by_id() 回傳永遠是 None
修復: 用 AnomalyCounter.derive_key_from_incident() 推導 anomaly_key,
直接從 Redis 查即時頻率與處置分佈統計
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-10 00:56:23 +08:00
OG T
d324dd7aed
fix(telegram): 移除所有告警訊息欄位截斷限制,放寬至 Telegram 4096 字元硬限
...
CD Pipeline / build-and-deploy (push) Has been cancelled
問題:root_cause[:50/100]、suggested_action[:80]、suggestion[:50]、
note[:150]、fix_action[:100]、impact[:150]、hypothesis[:200]
以及 message[:900]/[:1000] 導致告警內容顯示不完整。
修復:移除欄位截斷,整體上限改為 4096(Telegram API 硬限制)。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 23:32:51 +08:00
OG T
eb46079b4a
fix(telegram): root_cause 顯示長度 50→100 字元,符合 SOUL.md 鐵律
...
CD Pipeline / build-and-deploy (push) Has been cancelled
SOUL.md 明定根因摘要上限 100 字元,但程式碼兩處 IncidentApprovalCard
均截在 [:50],導致告警卡片訊息被截斷。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 23:30:58 +08:00
OG T
896bef94ee
fix(web): pending-approvals-card 加防重複點擊 + loading 狀態
...
linter 自動強化: actioningId state 防止同一張卡重複操作
- disabled + opacity 0.6 + cursor not-allowed
- loading 時按鈕顯示 '...'
- finally() 確保 actioningId 清除
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 18:38:08 +08:00
OG T
1483218bab
feat(approval): 批准/拒絕後立即回應 Telegram + 持久化 message_id 到 DB
...
CD Pipeline / build-and-deploy (push) Successful in 13m9s
問題:按下 TG 批准/拒絕按鈕後完全沒有任何回應,使用者不知道是否成功。
Telegram message_id 只存 Redis 24h TTL,過期後無法追蹤。
修正:
- approval_records 加 telegram_message_id / telegram_chat_id 欄位(已 ALTER TABLE)
- approval_db.update_telegram_message() — 持久化 message_id 到 DB
- decision_manager: 發送告警卡片後同時寫 Redis + DB
- telegram_gateway._notify_approval_result() — 批准/拒絕後:
1. editMessageReplyMarkup 移除批准/拒絕按鈕,保留資訊按鈕
2. sendMessage reply_to 在原訊息下回覆狀態行
3. fallback: send_notification 發新訊息
- _handle_group_command: chat_id 改為 _chat_id 消除 IDE lint
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 18:19:31 +08:00
OG T
d8c2969341
feat(telegram): AI 鏈路透明化 — 告警訊息顯示 OpenClaw + Tool Calling 模型/後端
...
CD Pipeline / build-and-deploy (push) Successful in 12m12s
- nemotron.py: 偵測 OllamaToolProvider vs NvidiaProvider,記錄 tool_model/tool_backend
- openclaw.py: 傳播 nemotron_tool_model/nemotron_tool_backend 到 proposal
- decision_manager.py: 從 proposal_data 提取並傳給 send_approval_card()
- telegram_gateway.py: TelegramMessage 新增兩個欄位,format_with_nemotron 顯示
"🔧 Tool Calling: llama3.1:8b (Ollama 本機)" 或 "NVIDIA 雲端"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 15:05:16 +08:00
OG T
4f80ba38c0
feat: 告警狀態變更在原訊息延續 (方案 B)
...
CD Pipeline / build-and-deploy (push) Successful in 12m28s
**telegram_gateway.py**
- 新增 append_incident_update(incident_id, status_line)
- 從 Redis tg_msg:{id} 取 message_id
- editMessageReplyMarkup: 移除 Row1(批准/拒絕/靜默),保留 Row2(詳情/重診/歷史)
- sendMessage reply_to_message_id: 在原訊息下方追加狀態行
- 找不到 message_id 回傳 False(呼叫方自行 fallback)
**decision_manager.py**
- _push_decision_to_telegram: send_approval_card 後存 tg_msg:{id}=message_id (TTL 24h)
- _push_auto_repair_result: 改用 append_incident_update,找不到 message_id 才 fallback 新訊息
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 14:21:33 +08:00
OG T
c5e475121a
fix(telegram): 修復建議指令被截斷 + decision_manager enum string 補正
...
CD Pipeline / build-and-deploy (push) Has been cancelled
根因 1: telegram_gateway.py suggested_action[:35] 剛好截到 deployment/ 後
→ 改為 [:80],完整顯示 kubectl command
根因 2: 舊 Incident proposal_data 存 enum string (RESTART_DEPLOYMENT)
→ decision_manager.py 加入偵測,用規則引擎重新查 kubectl command
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 11:14:30 +08:00
OG T
0f5fecfef5
fix(sprint5.1): 首席架構師審查修正 — S1×4 S2×2 S3×1
...
CD Pipeline / build-and-deploy (push) Failing after 1m40s
S1-1: service_registry/velero_client/preflight_service 改用 structlog
S1-2: velero_client datetime.now(UTC) 改用 now_taipei()(台北時區鐵律)
S1-3: Guardrail 失敗改為保守拒絕(原放行方向與安全目標相悖)
S1-4: service_registry import 移至模組頂部(移除函數內 import)
S2-1: telegram_gateway T1-T6 六個通知方法補齊 try/except
S2-2: webhooks.py Langfuse URL 改用 settings.LANGFUSE_URL(移除硬寫內網 IP)
S3-3: velero_client trigger_emergency_backup 改為 kubectl apply Backup CRD
(原 kubectl create backup 語法不存在,審查發現靜默失敗風險)
審查評分: 70/100 → 修正後預計 90+/100
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-08 16:36:18 +08:00
OG T
88696dba9b
feat(sprint5.1): Data Safety Guardrails 全鏈路整合 (L1-L5)
...
CD Pipeline / build-and-deploy (push) Failing after 1m33s
Type Sync Check / check-type-sync (push) Failing after 58s
Layer 0 - K8s RBAC:
- k8s/rbac/api-velero-reader.yaml: awoooi-executor SA Velero backup reader
Layer 1 - DB Migration (已在 188 執行):
- M-002: approval_records 新增 approval_level/votes/required_votes
- M-003: alert_event_type ENUM 新增 8 個值
Layer 2 - IaC:
- ops/config/service-registry.yaml: 全服務 Stateful 分級清單 (BLOCK/CRITICAL_HITL/STANDARD_HITL/AUTO)
Layer 3 - Python Services:
- service_registry.py: 讀取 YAML,提供 is_blocked/requires_multisig/get_required_votes
- velero_client.py: kubectl 查詢 Velero 備份年齡,失敗 fallback 999h
- preflight_service.py: Pre-flight 安全檢查 (Q2/Q4 決策)
Layer 1-M001 - Playbook model:
- playbook.py: 新增 requires_approval_level/stateful_targets/requires_pre_backup
Layer 4 - 業務邏輯:
- alert_operation_log_repository.py: 新增 8 個 event_type (Guardrail/Pre-flight/MultiSig/備份)
- auto_repair_service.py: 注入 Service Registry Guardrail 檢查 (BLOCK → 直接拒絕)
- webhooks.py: ALERT_RECEIVED 溯源記錄 + auto_repair flag Q9 + Langfuse trace_id Q10
- db/models.py: ApprovalRecord 同步 approval_level/votes/required_votes 欄位
- docker-health-monitor.sh: 純感知層改造(移除所有 docker restart 邏輯)
Layer 5 - Telegram 通知:
- telegram_gateway.py: T1-T6 六個新通知方法 (Guardrail/Pre-flight/Backup/MultiSig/ChangeApplied)
參考: ADR-062 Data Safety Guardrails, ADR-063 Service Registry IaC
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-08 16:24:09 +08:00
OG T
de3935d1d4
feat: Sprint 4 Phase E+F — 前端處置統計 + 週報處置分佈
...
CD Pipeline / build-and-deploy (push) Failing after 1m26s
Type Sync Check / check-type-sync (push) Failing after 1m2s
Phase E: 前端頁面
- E1: /reports 完整處置統計儀表板 (已在 Sprint F 完成)
- E2: 首頁 Metrics Strip — 從 disposition API 取得真實自動化率
優先使用 /stats/disposition auto_rate,fallback 到 incidents 推算
- E3: /auto-repair 處置概況卡片 (已在 Sprint F 完成)
- E4: /neural-command stats tab 處置分佈 (已在 Sprint F 完成)
- E5: i18n 翻譯 zh-TW + en (已在 Sprint F 完成)
Phase F: 週報 + 文件
- F1: WeeklyReportMessage 新增 disposition 5 欄位
週報格式加「📋 處置分佈」區塊 (自動/冷啟動/人工/手動 + 自動化率)
weekly_report_service 整合 get_all_disposition_stats()
- message 字數上限從 900 提升到 1200 (適應處置區塊)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-07 13:02:20 +08:00
OG T
a85e9ced08
feat(api+telegram): Sprint 4 Phase C+D — API 端點 + Telegram 處置統計
...
Phase C: API 端點
- C1: GET /api/v1/stats/disposition — 完整處置分佈統計
- DispositionSummary: auto/human/manual/cold_start + auto_rate
- DispositionByAnomaly: 按異常類型明細 (最多 20 筆)
- Redis SCAN + HGETALL 聚合
- C2: GET /api/v1/auto-repair/stats 擴充 disposition_summary
Phase D: Telegram 告警格式
- D1: 告警卡片加處置統計行
- 🤖 自動: N | 👤 審核: N | 🔧 手動: N
- 自動化率百分比
- D2: 歷史按鈕強化處置分佈明細
- 完整 5 項計數 + 自動化率
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-07 12:17:20 +08:00
OG T
f6332b4b2f
fix(telegram): 修正 approval_id UUID 轉換錯誤 — 支援 INC-xxx 格式
...
CD Pipeline / build-and-deploy (push) Successful in 12m24s
_execute_approval_action 用 UUID(approval_id) 但 approval_id 是 INC-xxx,
導致 'badly formed hexadecimal UUID string' 錯誤,簽核無法執行。
修正: 先嘗試 UUID 轉換,失敗則用 incident_id 查出對應的 pending approval UUID。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 11:53:48 +08:00
OG T
09d965dab5
fix(telegram): 修正 editMessageText 400 錯誤 — 先移除按鈕再更新文字
...
CD Pipeline / build-and-deploy (push) Failing after 12m46s
原因: original_text 來自 message.text (純文字),含 <>&等字符,
用 parse_mode=HTML 發送時 Telegram 返回 400。
修正:
1. 先呼叫 editMessageReplyMarkup 移除按鈕 (確保按鈕一定消失)
2. 再 html.escape(original_text) 後嘗試更新文字
3. 文字更新失敗不影響整體流程 (按鈕已移除為首要目標)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 22:13:54 +08:00
OG T
8220027298
fix(telegram+cd): 兩個顯示 bug 修正
...
CD Pipeline / build-and-deploy (push) Has been cancelled
1. Nemotron args 顯示 Python dict 字串問題
- restart_deployment: {'deployment_name': 'awoo'} → restart_deployment: deployment_name=awoooi-api
- 改用 key=value 格式化,不再使用 str(dict)[:25]
2. CD 通知 ${MINUTES}/${SECONDS} 等變數未展開
- TG_MSG 從 env: 移到 run: shell 中組裝
- env: 中的 shell 變數在 bash 執行前是靜態字串,無法展開
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 14:47:52 +08:00
OG T
22ee9b2fe3
fix(telegram): answerCallbackQuery result=true 導致 bool is not iterable
...
CD Pipeline / build-and-deploy (push) Successful in 13m3s
Telegram answerCallbackQuery 成功時返回 {"ok": true, "result": true},
_send_request 中 "message_id" in result["result"] 對 bool 做 in 操作
報 "argument of type 'bool' is not iterable"。
修正:加 isinstance(result_val, dict) 防禦後再做 in 檢查。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 14:20:54 +08:00
OG T
4b4007db6c
feat(telegram): SRE 群組告警格式升級為完整 v7.0
...
CD Pipeline / build-and-deploy (push) Has been cancelled
_send_approval_card_to_group 改用與個人 chat 相同的 TelegramMessage.format()
格式,包含 SignOz metrics、AI provider/model、Nemotron 協作、異常頻率統計等全部欄位。
統帥指示:群組收到的告警訊息要與個人 chat 格式完全一致。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 14:11:59 +08:00
OG T
665f93e83f
fix(telegram): 首席架構師 R1 修正 — I-1/I-2/M-1/M-2
...
CD Pipeline / build-and-deploy (push) Has been cancelled
I-1: webhooks/sentry_webhook/signoz_webhook 三個呼叫者補 TODO 說明
無 incident_id 是已知限制(Approval 路徑未建 Incident 關聯)
I-2: TestPushRequest 新增 incident_id 欄位,使 QA 可驗證按鈕渲染
M-1: 移除 _build_inline_keyboard 呼叫中多餘的 `or message.incident_id`
M-2: 補充 900/1000 截斷長度差異說明
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 13:07:42 +08:00
OG T
4935cfc346
fix(telegram): 重設計訊息格式 + 修復 detail/reanalyze/history 按鈕失效
...
CD Pipeline / build-and-deploy (push) Failing after 1m26s
- format() / format_with_nemotron(): 移除 ═══ 分隔符,改為簡潔換行佈局
- send_approval_card(): 新增 incident_id 參數,傳入 _build_inline_keyboard()
- decision_manager.py: 呼叫 send_approval_card() 時傳入 incident.incident_id
- 問題根因: incident_id 未傳入 _build_inline_keyboard() 導致第二排按鈕從未渲染
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 12:44:13 +08:00
OG T
d0f09705e5
fix(auto-repair): 修復三個阻礙自動修復的根本原因
...
CD Pipeline / build-and-deploy (push) Has been cancelled
1. playbook_rag: Ollama embedding http_client 滾動重啟後 is_closed
- 新增 _get_http_client() 偵測 is_closed 自動重建
- singleton get_playbook_rag_service() 加 is_closed 重建判斷
2. telegram: 加入 ai_model 欄位顯示底層判斷模型
- TelegramMessage.ai_model 欄位
- format() / format_with_nemotron() 顯示 "Nemotron (nemotron-70b)"
- openclaw proposal_dict 加入 model 欄位
- decision_manager / send_approval_card 串接
3. DB: 清除 9 筆 3/26 殭屍 PENDING (mock_fallback CRITICAL 測試記錄)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-04 11:46:25 +08:00
OG T
9e78d5222a
feat(group-chat): 方案B slash commands — /status /incidents /cost /pods /help (2026-04-03 ogt)
...
CD Pipeline / build-and-deploy (push) Successful in 7m5s
E2E Health Check / e2e-health (push) Successful in 17s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-03 20:02:27 +08:00
OG T
e833065043
feat(group-chat): Reply Bot 訊息時只有被Reply的Bot回應 (2026-04-03 ogt)
...
CD Pipeline / build-and-deploy (push) Successful in 7m0s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-03 19:48:10 +08:00
OG T
8d09b18477
fix(group-chat): 移除雙AI互相評論 — 單獨@只有該AI回覆,雙AI路徑不再互評
...
CD Pipeline / build-and-deploy (push) Has been cancelled
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-03 19:44:49 +08:00