OG T
|
9ea1f77e41
|
fix(telegram): 移除 7 個 ghost button (3-part/無handler)
違規 buttons 一覽:
- flywheel_diag / flywheel_dashboard (META告警卡)
- pause_1h / ignore (業務告警卡)
- postmortem / escalation_ack / dr_manual (升級通知卡)
- secops_block_ip / secops_evict (SecOps 卡,spec=nonce 但用 2-part)
所有 buttons 均無 callback handler,點擊無回應 = 鬼魂按鈕
鐵律: 寧可沒按鈕,不可有死按鈕 (feedback_no_ghost_buttons.md)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-16 03:29:41 +08:00 |
|
OG T
|
62bcc50770
|
fix(tg+km): 補齊 Telegram 操作紀錄揭露與 KM 分類修復 (ADR-076)
CD Pipeline / build-and-deploy (push) Has been cancelled
Telegram 訊息新增欄位:
- alert_category 分類標籤(🏷️ 主機/K8s/資料庫/服務等)
- playbook_name 顯示匹配到的 Playbook 名稱
- 頻率統計從 count_24h>1 降至 >=1(初次告警也顯示)
- TelegramMessage 新增 alert_category/playbook_name 欄位
- decision_manager → send_approval_card 穿透 playbook_name
KM 修復:
- EntryType.PLAYBOOK → EntryType.AUTO_RUNBOOK(前者不存在,會 AttributeError)
- category "auto_generated" → "ai_system"(前端 i18n 有對應翻譯)
- runbook_generator 同步修正 category
- KM 建立後推 Telegram 通知(best-effort)
DB decision_chain 補欄位:
- 新增 playbook_id / playbook_name / alert_category
2026-04-16 ogt + Claude Sonnet 4.6
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-16 02:46:17 +08:00 |
|
OG T
|
f5e33da2fc
|
fix(telegram): 修正 _make_request → _send_request 方法名稱不一致
CD Pipeline / build-and-deploy (push) Successful in 15m48s
7 處呼叫 _make_request 但方法實際名稱為 _send_request,
導致 sweeper 分析完後 telegram_decision_push_failed 錯誤。
影響方法:send_push_notification, send_drift_card 等 ADR-071 系列。
_send_request 定義於 line 1272,OTEL 追蹤已含括。
2026-04-16 Claude Sonnet 4.6 Asia/Taipei
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-16 01:44:29 +08:00 |
|
OG T
|
cae9833e5d
|
fix(heartbeat): 修復多 replica 重複發送系統報告 bug
根因:RedisLock 在 async with 結束後立即 release,
兩個 pod 對齊同一 slot 但 offset 不同,第一個 pod
發完釋放鎖後 ~10s,第二個 pod 剛好 wake 並搶到空鎖
→ 同一個 30min slot 發出兩條相同報告。
修復:改用 slot-based key (heartbeat:slot:{slot_id})
SET NX EX interval_seconds,不主動 release,讓 TTL
自然過期。整個 30min slot 只有第一個搶到的 pod 能發。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-15 13:17:10 +08:00 |
|
OG T
|
a92562d65c
|
feat(Phase 5 Sprint 5.4): 分類按鈕從 registry 動態產生 — 按鈕重啟上線
CD Pipeline / build-and-deploy (push) Successful in 17m11s
_build_inline_keyboard() 改寫:
- 原 hardcode _CATEGORY_BUTTONS dict (28 按鈕) 已下架
- 改從 callback_action_spec.yaml registry 動態產生
- spec.callback_format 決定格式:
* nonce (寫類) → self._security.generate_callback_nonce(approval_id, action_name)
* info (查類) → {action_name}:{incident_id}
- 新按鈕只需改 yaml,零改 code
分類覆蓋 (從 yaml 自動推算):
- kubernetes: 6 按鈕 (4 寫 + 2 查)
- host_resource: 3 按鈕 (1 查 + 2 寫)
- secops: 4 按鈕 (全寫類 + Multi-Sig)
- database: 3 按鈕
- storage: 2 按鈕
- network: 3 按鈕
- devops_tool: 2 按鈕
- external_site: 2 按鈕
- business: 1 按鈕
- flywheel_health: 1 按鈕
- ssl_cert: 1 按鈕
這次按鈕不是鬼魂 — 每個都有:
✅ callback_format 正確 (4-part nonce / 2-part info)
✅ Sprint 5.3 dispatch handler 接收
✅ Sprint 5.2 MCP registry 執行
✅ audit log + reply_to 原卡片
回歸: 188/188
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 21:40:20 +08:00 |
|
OG T
|
de8bbd8ab9
|
feat(Phase 5 Sprint 5.3): 寫類分類按鈕 nonce action 路由 + audit log
CD Pipeline / build-and-deploy (push) Has been cancelled
插入點: _handle_callback_query Step 1.9 (nonce 驗證後, Step 2 approve/reject 前)
邏輯:
1. 從 spec registry 查 action 是否為註冊的寫類動作
2. 若 action in (approve/reject/silence/tune/log_manual_fix) → skip 走既有流程
3. 若 spec.requires_multi_sig=True 且 current_signatures < 2 → 提示「需 2 人簽核」
4. Audit log (category_write_action_audit_start) 含 user/risk/provider/tool
5. Ack Telegram (emoji + label + 執行中...)
6. 從 incident 取 labels 供模板替換
7. dispatch_action() → MCP 執行
8. Reply 結果到原告警卡片(Redis tg_msg lookup)
9. Audit log (category_write_action_audit_complete) 含 success/error/duration
支援的寫類 action:
- k8s_restart/scale_up/scale_down/rollback (kubernetes)
- host_restart_service/clear_log (host_resource)
- docker_restart/minio_restart (devops_tool/storage)
- reload_nginx/renew_cert (network/ssl_cert)
- kill_slow_query/clear_conn_pool (database)
- pause_1h/trigger_diagnose (business/flywheel)
Multi-Sig 支援 (Sprint 5.4 預留):
- secops_isolate/block_ip/evict → requires_multi_sig=True
- 簽核數未達 2 → 提示 + 不執行
回歸: 129/129
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 21:39:16 +08:00 |
|
OG T
|
581b244ad1
|
feat(Phase 5 Sprint 5.1): Telegram callback_handler 接上 dispatcher
CD Pipeline / build-and-deploy (push) Has been cancelled
整合點: _handle_callback_query 未知 action fallback 路徑
變更:
1. Line 2601 原「⚠️ 未知操作」改呼叫 _dispatch_category_action()
2. 新增 _dispatch_category_action() method:
- 查 callback_action_spec registry
- 若 action 不存在 → 回「未知操作」(行為不變)
- 若存在 → acknowledge + 從 incident 取 labels + dispatch + reply 原卡片
效果:
- check_process / check_port / check_log_* / check_health / open_signoz /
open_flywheel 等 10 個查類按鈕現在有完整 flow(雖 Sprint 5.2 還沒接 MCP,但 stub 會 reply)
- 當 CD 部署 + Sprint 5.2 實裝 MCP 接線後,查類按鈕自動上線
Sprint 5.1 DOD:
- ✅ callback_handler 接線 _dispatch_category_action
- ✅ Dispatcher 讀 incident labels 替換模板變數
- ✅ Reply to 原告警卡片(Redis tg_msg lookup)
- ⏳ MCP 實際執行(Sprint 5.2)
回歸測試: 109/109
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 20:41:22 +08:00 |
|
OG T
|
10e3043ce8
|
fix(UX): 下架 28 個鬼魂分類按鈕 + ADR-079 Phase 5 補完計畫
CD Pipeline / build-and-deploy (push) Has been cancelled
統帥 2026-04-14 20:00 完整 audit 揭露:
_CATEGORY_BUTTONS 28 個按鈕全死 3 天(從 2026-04-11 commit 325b3851)
- callback_data 格式全錯(3-part 不符 parser 4-part/2-part)
- grep apps/api/src 無任何 dispatch handler
- 統帥今天真踩到:點「查程序」沒反應 → 信任破壞
首席架構師裁示 (C 分級):
A. 立刻下架(本 commit):_CATEGORY_BUTTONS = {} fallback 通用按鈕
B. Phase 5 完整化(ADR-079 規劃,3-5 天,另 Sprint 實作)
保留通用按鈕(全 ✅):
- 批准 / 拒絕 / 靜默(4-part nonce)
- 詳情 / 歷史 / 重診(2-part info)
新增防禦性文件:
- ADR-079 — Phase 5 工作分解 + 每按鈕 checklist
- feedback_no_ghost_buttons.md(memory)— 鬼魂按鈕鐵律
設計原則永久入檔: 寧可沒按鈕,不可有死按鈕
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 20:19:25 +08:00 |
|
OG T
|
72dd0c5875
|
fix: Telegram 簽核 gate + 執行結果 reply — 打通人工審核閉環
CD Pipeline / build-and-deploy (push) Successful in 14m7s
3 處修復(統帥盤查發現):
1. telegram_gateway.py:4890 — gate 從 execution_triggered 改 approval.status==APPROVED
- 原 gate 靠樂觀鎖旗標,race 時失效(REST+Telegram 同時簽核)
- 與 REST API approvals.py:360 路徑對齊
- 加 Redis lock exec:{approval_id} 60s TTL 防重入
2. telegram_gateway.py:4772 — 拿掉「👀 等待執行」誤導文案
- 批准後一律顯示「⚡ 執行中...」,實際結果由 #3 reply 補上
3. approval_execution.py — 新增 _push_execution_result_to_alert()
- 成功/失敗兩處 fire-and-forget 呼叫
- requested_by=="auto_approve" skip(避免與 _push_auto_repair_result 衝突)
- Redis tg_msg:{incident_id} 查原告警 message_id → reply_to
- 找不到 message_id 靜默不發,不影響執行主流程
防破壞性檢查:
- ✅ 自動執行路徑不受影響(skip via requested_by)
- ✅ Reject 路徑完全不動
- ✅ Redis lock 防重入
- ✅ 132 回歸測試全過
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 19:03:38 +08:00 |
|
OG T
|
c0ba1000f3
|
Revert "fix(auto-repair): 中低風險+無kubectl_command → TYPE-1 純資訊,不顯示審核按鈕"
This reverts commit abf1ffa91e7327a36af93be2742d53dac1933f0d.
|
2026-04-14 13:33:24 +08:00 |
|
OG T
|
2df4945880
|
fix(auto-repair): 中低風險+無kubectl_command → TYPE-1 純資訊,不顯示審核按鈕
問題: HostHighCpuLoad 等主機層告警 affected_services=[] → OpenClaw 生成
kubectl unknown → safety guard 攔截 → 退回 READY + TYPE-3 帶按鈕卡片
用戶一直看到帶按鈕的中/低風險告警,按鈕無法修復任何東西
修復三處:
1. openclaw.py: _call_openclaw_analyze() 回傳 suggested_action 欄位
+ target_resource 預設改為 "" (避免 "unknown" 進入 safety guard)
2. decision_manager.py: classify_notification() 傳入
suggested_action / risk_level / has_kubectl_command
3. telegram_gateway.py: classify_notification() 新規則 —
無 kubectl_command + risk=low/medium + action=investigate/no_action
→ TYPE-1 (純資訊,無按鈕)
搭配 clawbot-v5 f4b84d7 (OpenClaw prompt CRITICAL RULES) 一起生效
2026-04-14 Claude Sonnet 4.6 Asia/Taipei
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-14 13:33:24 +08:00 |
|
OG T
|
09134f5c47
|
fix(openclaw): 修復 incident.title + DIAGNOSE→NEMOTRON confidence=0.0
CD Pipeline / build-and-deploy (push) Failing after 2m10s
1. telegram_gateway.py:1169 — classify_notification() 仍用 incident.title
改用 alertname + signal annotations 組合 (同 decision_manager.py 修法)
2. ai_router.py — DIAGNOSE 路由暫停 NEMOTRON
NIM tool_call 返回無 confidence → openclaw_analysis_complete confidence=0.0
改為 None (複雜度路由),讓 Gemini/openclaw_nemo 處理 DIAGNOSE
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 22:12:55 +08:00 |
|
OG T
|
3d8b0e4f90
|
fix(adr075): TYPE-3 格式改用 spec 模板 — ACTION REQUIRED + AI深度診斷 + 建議修復動作
CD Pipeline / build-and-deploy (push) Failing after 2m15s
- 標頭改為 "{emoji} ACTION REQUIRED | {severity_zh}"
- 新增 "🧠 AI 深度診斷" 區塊 (分析/責任/AI來源)
- 新增 "⚡ 建議修復動作" 區塊 (<code> 格式)
- confidence=0 顯示 "📋 規則分析" 取代誤導性 "🔴 0%"
- SignOz 指標區塊補回 Trace 連結
2026-04-12 ogt: ADR-075 TYPE-3 格式標準化
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 21:00:28 +08:00 |
|
OG T
|
a7f2b9c0f5
|
fix(display): 規則匹配改顯示 ✅ 取代 🔴 0% + 修復 LLM 字串 confidence 解析
CD Pipeline / build-and-deploy (push) Has been cancelled
- telegram_gateway.py: confidence==0 (規則匹配/Expert fallback) 不再顯示
「🔴 0%」,改顯示「⚙️ 規則匹配 ✅」,兩個 card 類型都修正
- openclaw.py: NIM/Ollama 有時回傳字串 "0.85" 而非 float,導致
isinstance(str, int|float)=False → confidence 被強制設 0.0。
現在先嘗試 float() 解析,解析失敗才 fallback 0.0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 20:50:53 +08:00 |
|
OG T
|
24c1b5677b
|
feat(adr075): Step1-3 classify補丁+新按鈕+TYPE-5S/6B/7E格式函數
Step-1 incident_service.py classify_alert_early():
- 新增 secops (TYPE-5S): UnauthorizedSSH/KubeAudit/CVE/WAFAttack/PodAbnormal
- 新增 business (TYPE-6B): AITokenCost/GeminiAPIError/SLOBurn/MomoScraper
- 新增 flywheel_health MCPProvider/OllamaDown/NemotronDown 前綴
- ssl_cert: 依 days_remaining 決定 TYPE-1(≥14d) vs TYPE-3(<14d)
Step-2 telegram_gateway.py _build_inline_keyboard():
- 新增 secops: [隔離] [封鎖IP] [驅逐] [確認授權]
- 新增 business: [暫停1h] [查SignOz] [忽略]
- 新增 flywheel_health: [觸發診斷] [飛輪面板] [靜默]
Step-3 telegram_gateway.py 新增格式化函數 (Tier 2):
- send_secops_card() — TYPE-5S 防禦按鈕+nonce
- send_business_alert() — TYPE-6B 業務損失速率
- send_escalation_card() — TYPE-7E P0/P1 升級,發 DM+群組
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 18:50:37 +08:00 |
|
OG T
|
1cb654cf59
|
fix(adr-075): CR P0/P1 修補 — TYPE_8M enum + 死碼清理 + docstring 更新
CD Pipeline / build-and-deploy (push) Has been cancelled
P0-2: NotificationType 新增 TYPE_8M = "TYPE-8M"
classify_notification 早期回傳 TYPE-8M
decision_manager 改用 NotificationType.TYPE_8M enum 比較(移除字串字面量)
P1-1: 移除 _CATEGORY_BUTTONS 中不可達的 alertchain_health/flywheel_health 條目
P1-4: test_classify_alert_early.py docstring 更新為 13 條規則/10 分類
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 18:44:12 +08:00 |
|
OG T
|
561c1d806b
|
feat(adr-075): Phase 2 — TYPE-8M 飛輪/告警鏈路健康通知格式與路由
CD Pipeline / build-and-deploy (push) Failing after 4m0s
新增 send_meta_alert() — ⚙️ META SYSTEM 卡片(觸發診斷/查看面板/靜默)
decision_manager 新增 TYPE-8M elif 分支(在 TYPE-4D 後)
_alert_category 提取提前至 if 鏈前,三個分支共用
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 18:39:04 +08:00 |
|
OG T
|
2cef2098d3
|
feat(adr-075): 修復 Telegram 動態按鈕 4 個斷點 + 新增 7 種告警分類
CD Pipeline / build-and-deploy (push) Has been cancelled
斷點 A: decision_manager 提取 alert_category/notification_type 傳入 send_approval_card
斷點 B: send_approval_card 新增參數並傳遞至 _build_inline_keyboard
斷點 C: 互動型通知 (TYPE-3/4/4D/8M) 禁止發 SRE 群組,防 nonce 洩漏
斷點 D: _CATEGORY_BUTTONS k8s_workload → kubernetes + 新增 6 類按鈕組
classify_alert_early 新增: alertchain_health, flywheel_health, storage,
devops_tool, external_site, ssl_cert, host_resource (從 infrastructure 分離)
Test: 52 classify + 664 total passed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 18:35:56 +08:00 |
|
OG T
|
99b489ca63
|
fix(flywheel): 修補剩餘 P0/P1 缺陷
CD Pipeline / build-and-deploy (push) Has been cancelled
- CRITICAL-1: TYPE-1 path approval_id=str(alert_id) → uuid.uuid4(),
避免 UUID(approval_id) 拋 ValueError 導致所有 Heartbeat/Info 告警崩潰
- CRITICAL-2: asyncio.create_task() 結果存入 _exec_task 並加 done_callback,
防止 GC 在執行中途回收任務
- FORMAT: _push_to_telegram_background 新增 notification_type + diff_summary 參數,
TYPE-4D → send_drift_card(),其他 → send_approval_card()(修正 ConfigDrift 顯示錯誤卡片)
- 傳遞 notification_type 至 Alertmanager 兩個呼叫點
ADR-073 四斷點修補最終收尾
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 17:14:57 +08:00 |
|
OG T
|
f0e14136ca
|
fix(flywheel): 修補飛輪四個核心斷點,讓完整流程真正串接起來
CD Pipeline / build-and-deploy (push) Has been cancelled
1. incident_service.py: save_to_episodic_memory() 補寫 alertname/notification_type/alert_category
→ 之前這3欄在DB永遠NULL,LLM無alertname,Playbook匹配全失敗
2. telegram_gateway.py: Telegram批准後呼叫 execute_approved_action()
→ 之前sign_approval()只改DB狀態,380筆批准0筆真正執行kubectl指令
3. approval_execution.py: 執行成功後呼叫 resolve_incident()
webhooks.py: auto-repair成功後呼叫 resolve_incident()
→ 之前Incident永遠停在INVESTIGATING,KM轉換永遠不觸發,Playbook=0
4. webhooks.py: TYPE-1告警短路,不進LLM
→ 之前Heartbeat/Backup/Info仍燒LLM token,產生垃圾修復建議
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 17:01:10 +08:00 |
|
OG T
|
93f9522d5a
|
fix(heartbeat): 對齊整點發送避免多replica各自發 + KM向量化改查embedding欄位
CD Pipeline / build-and-deploy (push) Successful in 14m10s
- _heartbeat_loop: 先 sleep 到下一個整點倍數再開始循環
避免 3 個 replica 啟動時間不同導致短時間內收到多條心跳
- heartbeat_report_service: km_vectorized 改查 KnowledgeEntryRecord.embedding IS NOT NULL
原本錯誤查 IncidentRecord.vectorized 導致顯示 0/714 (0%)
2026-04-12 ogt (ADR-073 heartbeat fix)
|
2026-04-12 16:33:15 +08:00 |
|
OG T
|
effd78807e
|
fix(heartbeat): blocking_timeout 5→0,多 replica 不排隊等鎖避免重複發送
CD Pipeline / build-and-deploy (push) Successful in 14m0s
3 個 replica 各自跑 loop,blocking_timeout=5.0 導致鎖釋放後
其他 replica 依序拿鎖,每次心跳最多發 3 條。
改為 blocking_timeout=0:拿不到鎖立刻跳過,同週期只發一條。
2026-04-12 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 16:13:41 +08:00 |
|
OG T
|
00a31abb85
|
feat(heartbeat): ADR-073 P2 心跳整合重構 — HeartbeatReportService + RedisLock
CD Pipeline / build-and-deploy (push) Has been cancelled
- 新增 HeartbeatReportService:11 個並行探針(Ollama/Nemotron/Gemini/Claude/MCP×4/ArgoCD/Velero)
- 重寫 send_heartbeat():RedisLock 防重發 + 統一發送 SRE_GROUP_CHAT_ID
- 簡化 _heartbeat_loop():移除散落的 silence 多次發送
- config.py:新增 OLLAMA_REQUIRED_MODELS 欄位
- 03-secrets.example.yaml:補 SRE_GROUP_CHAT_ID 確保 CD Inject 不遺漏
2026-04-12 ogt (ADR-073 Phase 2-3/4)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 15:35:13 +08:00 |
|
OG T
|
c09521a1c6
|
fix(cr): Code Review P0/P1 全修補 — 積木化+SSH路由+安全守衛順序
CD Pipeline / build-and-deploy (push) Failing after 2m30s
P0-1: classify_alert_early 移至 incident_service (Service層),webhooks.py import 修正
P0-2: _ssh_execute() 改用 self._ssh,移除冗餘 SSHProvider() 實例化
P1-1: infrastructure SSH routing 移至 kubectl safety guard 之前,docker指令不再被攔截
P1-2: alert_rule_engine 新增 get_risk_for_alertname() public API
P1-3: classify_notification() docstring 修正 ORM→Pydantic
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 14:51:12 +08:00 |
|
OG T
|
dbc77c5e62
|
feat(flywheel): Phase 3 — decision_manager Tier 3 七大修復 (首席架構師授權)
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-073 Phase 3 全部完成:
3-1: TYPE-1 triage guard
- get_or_create_decision() 入口: notification_type=TYPE-1 直接 bypass LLM 分析
- classify_notification() 優先讀 incident.notification_type (早期分診結果)
- ConfigurationDrift/KubeConfigDrift 補入 TYPE-4D 匹配清單
3-2: infrastructure → SSH MCP routing
- _auto_execute() 中 alert_category=infrastructure + 非 kubectl action → _ssh_execute()
- _ssh_execute(): docker_restart / service_restart tool 路由
- 取 instance label 對應 SSH_MCP_ALLOWED_HOSTS 白名單主機
3-3: send_info_notification() TYPE-1 已存在,classify_notification 修復確保正確呼叫
3-4: Dynamic button builder 已存在 _build_inline_keyboard + _CATEGORY_BUTTONS
3-5: action | parse fix
- _auto_execute() 開頭: action 含 | 時取第一段 (LLM 有時輸出 "kubectl X | kubectl get")
3-6: risk_level YAML priority override LLM
- dual_engine_analyze() LLM 結果返回後,用 alert_rules.yaml 對應 rule.risk 覆蓋
3-7: send_drift_card() TYPE-4D 已存在,classify_notification 修復確保正確觸發
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 14:39:19 +08:00 |
|
OG T
|
2af4dffcc6
|
fix(security): Architecture Review 修復 5 項高信心問題
安全修復 (P0):
1. ssh_provider: 新增 _validate_param() 白名單驗證,防止 command injection
- container_name/service/filter_name: [a-zA-Z0-9._-]{1,128}
- compose_dir: 必須以 /opt/ 或 /srv/ 開頭,禁止 ..
- domain: FQDN 白名單
- tail/port/lines: int() 轉換 + 上下限夾緊
2. ssh_provider: known_hosts=None 改為讀 SSH_MCP_KNOWN_HOSTS_FILE 環境變數
- 預設仍 None(內網快速啟動),但啟動時寫入 warning log
- 設定文件:ops/runbooks/ssh-mcp-setup.md (待補)
模組化修復 (P1):
3. km_conversion_service: 移除 import 時的 ALERT_EVENT_TYPES.update() 副作用
- ADR-071 event types 移入 alert_operation_log_repository.py 靜態集合
4. telegram_gateway: create_task() 改為 await + try/except
- 避免 DB session 關閉後的競爭條件
- KM 轉換失敗記錄 warning log,不中斷主流程
5. km_conversion_service: 新增頂層 try/except,錯誤一律 error log 後 re-raise
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-11 02:50:26 +08:00 |
|
OG T
|
325b3851b5
|
feat(adr-071): 告警通知四類型第一批 B/C/E/F/G/H 全實作
CD Pipeline / build-and-deploy (push) Has been cancelled
Type Sync Check / check-type-sync (push) Failing after 1m7s
ADR-071-B: classify_notification() — 五型分類器 (TYPE-1/2/3/4/4D)
ADR-071-C: send_info_notification() — TYPE-1 純資訊無按鈕卡片
ADR-071-E: _build_inline_keyboard() — 依 alert_category 動態組合 TYPE-3 按鈕
ADR-071-F: send_drift_card() — TYPE-4D Config Drift 卡片 + Diff 截斷
ADR-071-G: km_conversion_service.py — Incident RESOLVED 自動轉 KM
ADR-071-H: handle_manual_fix_done() — TYPE-4 手動修復 Bot 對話閉環
前批已完成: ADR-071-A (DB Migration) + ADR-071-D (狀態機守衛)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-11 02:24:20 +08:00 |
|
OG T
|
dabc62e0f8
|
fix(telegram): append_incident_update — 儲存告警卡片 message_id 到 Redis
CD Pipeline / build-and-deploy (push) Successful in 14m31s
_send_approval_card_to_group 發出告警卡片後,將 Telegram message_id
存入 Redis tg_msg:{incident_id}(TTL 24h),供後續 append_incident_update
換掉批准按鈕 + reply 狀態。
修復前:tg_msg key 從未被寫入,append 永遠 fallback 發新訊息,
批准按鈕永遠無法被移除。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-10 22:41:30 +08:00 |
|
OG T
|
797c7c749e
|
fix(nemotron): deepseek-r1 num_predict 400→1200,避免 <think> block 截斷後空回覆
CD Pipeline / build-and-deploy (push) Failing after 28s
deepseek-r1:14b 思考 token 超過 400 會在 </think> 前截斷,導致
清理後 body 為空,Telegram 顯示空訊息。
- chat_manager: num_predict 400 → 1200
- telegram_gateway: _clean_ai_reply 空值加 fallback 錯誤提示
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-10 22:35:37 +08:00 |
|
OG T
|
100e4d9b89
|
fix(chat): AI 回覆截斷問題 — 強制 persona + Markdown 清理 + 600字上限
CD Pipeline / build-and-deploy (push) Successful in 14m39s
問題: OpenClaw/NemoClaw 回覆 Markdown 語法 + 超長,Telegram 顯示截斷
修正:
1. chat_manager: _call_openclaw/_call_nemotron 強制前置 persona (含不超過300字規範)
2. telegram_gateway: _clean_ai_reply() 移除 **bold** *italic* # header 語法
移除 deepseek-r1 <think> 標籤,截斷 > 600 字並在段落邊界截
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-10 21:26:15 +08:00 |
|
OG T
|
b9dbbb3575
|
feat(rag): Telegram /rag 指令 + /rag/optimize ivfflat 端點
CD Pipeline / build-and-deploy (push) Successful in 14m9s
- telegram_gateway: /rag <query> → KnowledgeRAGService.query()
_handle_group_command 加 full_text 參數取得完整指令文字
/help 更新加入 /rag 說明
- rag.py: POST /rag/optimize → rag_repo.create_ivfflat_index()
- rag_chunk_repository: create_ivfflat_index() — ivfflat with lists=100
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-10 14:47:21 +08:00 |
|
OG T
|
cf5eb71ea6
|
fix(phase34): polling loop 補圖片路由 — _handle_chat_message photo handler
CD Pipeline / build-and-deploy (push) Has been cancelled
text=None 時直接 return,導致圖片訊息被丟棄
在 text 檢查前插入 photo 路由,呼叫 image_analysis_service
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-10 13:58:05 +08:00 |
|
OG T
|
7768924fea
|
fix(flywheel): 自動修復後移除 Telegram 按鈕 + 心跳告警排除飛輪
CD Pipeline / build-and-deploy (push) Failing after 6m56s
問題: 自動修復成功後 Telegram 卡片仍顯示批准/拒絕/靜默按鈕
Fix 1 — Telegram 卡片回饋閉環 (積木化合規):
- telegram_gateway.send_approval_card: 發送後自動存 tg_approval:{id} 到 Redis
- telegram_gateway.mark_auto_repaired(): 新方法 — 移除按鈕 + reply 結果
- _try_auto_repair_background: 改呼叫 gateway.mark_auto_repaired() (Service 層)
Fix 2 — 心跳/看門狗告警排除飛輪:
- constants.py: is_heartbeat_alertname() + HEARTBEAT_ALERT_NAMES
- NoAlertsReceived2Hours 等不觸發 _try_auto_repair_background
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-10 11:52:04 +08:00 |
|
OG T
|
88ac1c7f50
|
feat(phase27): 歷史按鈕雙層頻率統計 + DB frequency_snapshot 持久化
CD Pipeline / build-and-deploy (push) Failing after 1m44s
- telegram_gateway: _send_incident_history 改為 Phase 27 雙層策略
Layer 1: DB frequency_snapshot (建立時刻永久快照)
Layer 2: Redis AnomalyCounter disposition 累積統計 (35d TTL)
修復舊版呼叫 record_anomaly() 導致誤計數的 bug
- 新增 migration: phase27_incident_frequency_snapshot.sql (已在 prod 執行)
- CLAUDE.md: 精簡至 123 行,減少 Token 消耗
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-10 01:06:51 +08:00 |
|
OG T
|
ae90c36cd7
|
fix(telegram): _send_incident_history 加入 freq=None fallback — 無頻率統計資料
CD Pipeline / build-and-deploy (push) Has been cancelled
test_history_handles_no_stats 要求原始碼中有「無頻率統計資料」fallback 分支,
當 AnomalyCounter.record_anomaly() 回傳 None 時顯示此訊息而非繼續處理。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-10 01:01:19 +08:00 |
|
OG T
|
e59f8181b3
|
fix(telegram): 歷史按鈕改從 AnomalyCounter(Redis) 讀頻率,修復永遠顯示「無頻率統計資料」
CD Pipeline / build-and-deploy (push) Failing after 1m45s
根本原因: frequency_stats 從未持久化到 DB,get_by_id() 回傳永遠是 None
修復: 用 AnomalyCounter.derive_key_from_incident() 推導 anomaly_key,
直接從 Redis 查即時頻率與處置分佈統計
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-10 00:56:23 +08:00 |
|
OG T
|
d324dd7aed
|
fix(telegram): 移除所有告警訊息欄位截斷限制,放寬至 Telegram 4096 字元硬限
CD Pipeline / build-and-deploy (push) Has been cancelled
問題:root_cause[:50/100]、suggested_action[:80]、suggestion[:50]、
note[:150]、fix_action[:100]、impact[:150]、hypothesis[:200]
以及 message[:900]/[:1000] 導致告警內容顯示不完整。
修復:移除欄位截斷,整體上限改為 4096(Telegram API 硬限制)。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 23:32:51 +08:00 |
|
OG T
|
eb46079b4a
|
fix(telegram): root_cause 顯示長度 50→100 字元,符合 SOUL.md 鐵律
CD Pipeline / build-and-deploy (push) Has been cancelled
SOUL.md 明定根因摘要上限 100 字元,但程式碼兩處 IncidentApprovalCard
均截在 [:50],導致告警卡片訊息被截斷。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 23:30:58 +08:00 |
|
OG T
|
896bef94ee
|
fix(web): pending-approvals-card 加防重複點擊 + loading 狀態
linter 自動強化: actioningId state 防止同一張卡重複操作
- disabled + opacity 0.6 + cursor not-allowed
- loading 時按鈕顯示 '...'
- finally() 確保 actioningId 清除
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 18:38:08 +08:00 |
|
OG T
|
1483218bab
|
feat(approval): 批准/拒絕後立即回應 Telegram + 持久化 message_id 到 DB
CD Pipeline / build-and-deploy (push) Successful in 13m9s
問題:按下 TG 批准/拒絕按鈕後完全沒有任何回應,使用者不知道是否成功。
Telegram message_id 只存 Redis 24h TTL,過期後無法追蹤。
修正:
- approval_records 加 telegram_message_id / telegram_chat_id 欄位(已 ALTER TABLE)
- approval_db.update_telegram_message() — 持久化 message_id 到 DB
- decision_manager: 發送告警卡片後同時寫 Redis + DB
- telegram_gateway._notify_approval_result() — 批准/拒絕後:
1. editMessageReplyMarkup 移除批准/拒絕按鈕,保留資訊按鈕
2. sendMessage reply_to 在原訊息下回覆狀態行
3. fallback: send_notification 發新訊息
- _handle_group_command: chat_id 改為 _chat_id 消除 IDE lint
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 18:19:31 +08:00 |
|
OG T
|
d8c2969341
|
feat(telegram): AI 鏈路透明化 — 告警訊息顯示 OpenClaw + Tool Calling 模型/後端
CD Pipeline / build-and-deploy (push) Successful in 12m12s
- nemotron.py: 偵測 OllamaToolProvider vs NvidiaProvider,記錄 tool_model/tool_backend
- openclaw.py: 傳播 nemotron_tool_model/nemotron_tool_backend 到 proposal
- decision_manager.py: 從 proposal_data 提取並傳給 send_approval_card()
- telegram_gateway.py: TelegramMessage 新增兩個欄位,format_with_nemotron 顯示
"🔧 Tool Calling: llama3.1:8b (Ollama 本機)" 或 "NVIDIA 雲端"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 15:05:16 +08:00 |
|
OG T
|
4f80ba38c0
|
feat: 告警狀態變更在原訊息延續 (方案 B)
CD Pipeline / build-and-deploy (push) Successful in 12m28s
**telegram_gateway.py**
- 新增 append_incident_update(incident_id, status_line)
- 從 Redis tg_msg:{id} 取 message_id
- editMessageReplyMarkup: 移除 Row1(批准/拒絕/靜默),保留 Row2(詳情/重診/歷史)
- sendMessage reply_to_message_id: 在原訊息下方追加狀態行
- 找不到 message_id 回傳 False(呼叫方自行 fallback)
**decision_manager.py**
- _push_decision_to_telegram: send_approval_card 後存 tg_msg:{id}=message_id (TTL 24h)
- _push_auto_repair_result: 改用 append_incident_update,找不到 message_id 才 fallback 新訊息
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 14:21:33 +08:00 |
|
OG T
|
c5e475121a
|
fix(telegram): 修復建議指令被截斷 + decision_manager enum string 補正
CD Pipeline / build-and-deploy (push) Has been cancelled
根因 1: telegram_gateway.py suggested_action[:35] 剛好截到 deployment/ 後
→ 改為 [:80],完整顯示 kubectl command
根因 2: 舊 Incident proposal_data 存 enum string (RESTART_DEPLOYMENT)
→ decision_manager.py 加入偵測,用規則引擎重新查 kubectl command
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 11:14:30 +08:00 |
|
OG T
|
0f5fecfef5
|
fix(sprint5.1): 首席架構師審查修正 — S1×4 S2×2 S3×1
CD Pipeline / build-and-deploy (push) Failing after 1m40s
S1-1: service_registry/velero_client/preflight_service 改用 structlog
S1-2: velero_client datetime.now(UTC) 改用 now_taipei()(台北時區鐵律)
S1-3: Guardrail 失敗改為保守拒絕(原放行方向與安全目標相悖)
S1-4: service_registry import 移至模組頂部(移除函數內 import)
S2-1: telegram_gateway T1-T6 六個通知方法補齊 try/except
S2-2: webhooks.py Langfuse URL 改用 settings.LANGFUSE_URL(移除硬寫內網 IP)
S3-3: velero_client trigger_emergency_backup 改為 kubectl apply Backup CRD
(原 kubectl create backup 語法不存在,審查發現靜默失敗風險)
審查評分: 70/100 → 修正後預計 90+/100
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-08 16:36:18 +08:00 |
|
OG T
|
88696dba9b
|
feat(sprint5.1): Data Safety Guardrails 全鏈路整合 (L1-L5)
CD Pipeline / build-and-deploy (push) Failing after 1m33s
Type Sync Check / check-type-sync (push) Failing after 58s
Layer 0 - K8s RBAC:
- k8s/rbac/api-velero-reader.yaml: awoooi-executor SA Velero backup reader
Layer 1 - DB Migration (已在 188 執行):
- M-002: approval_records 新增 approval_level/votes/required_votes
- M-003: alert_event_type ENUM 新增 8 個值
Layer 2 - IaC:
- ops/config/service-registry.yaml: 全服務 Stateful 分級清單 (BLOCK/CRITICAL_HITL/STANDARD_HITL/AUTO)
Layer 3 - Python Services:
- service_registry.py: 讀取 YAML,提供 is_blocked/requires_multisig/get_required_votes
- velero_client.py: kubectl 查詢 Velero 備份年齡,失敗 fallback 999h
- preflight_service.py: Pre-flight 安全檢查 (Q2/Q4 決策)
Layer 1-M001 - Playbook model:
- playbook.py: 新增 requires_approval_level/stateful_targets/requires_pre_backup
Layer 4 - 業務邏輯:
- alert_operation_log_repository.py: 新增 8 個 event_type (Guardrail/Pre-flight/MultiSig/備份)
- auto_repair_service.py: 注入 Service Registry Guardrail 檢查 (BLOCK → 直接拒絕)
- webhooks.py: ALERT_RECEIVED 溯源記錄 + auto_repair flag Q9 + Langfuse trace_id Q10
- db/models.py: ApprovalRecord 同步 approval_level/votes/required_votes 欄位
- docker-health-monitor.sh: 純感知層改造(移除所有 docker restart 邏輯)
Layer 5 - Telegram 通知:
- telegram_gateway.py: T1-T6 六個新通知方法 (Guardrail/Pre-flight/Backup/MultiSig/ChangeApplied)
參考: ADR-062 Data Safety Guardrails, ADR-063 Service Registry IaC
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-08 16:24:09 +08:00 |
|
OG T
|
de3935d1d4
|
feat: Sprint 4 Phase E+F — 前端處置統計 + 週報處置分佈
CD Pipeline / build-and-deploy (push) Failing after 1m26s
Type Sync Check / check-type-sync (push) Failing after 1m2s
Phase E: 前端頁面
- E1: /reports 完整處置統計儀表板 (已在 Sprint F 完成)
- E2: 首頁 Metrics Strip — 從 disposition API 取得真實自動化率
優先使用 /stats/disposition auto_rate,fallback 到 incidents 推算
- E3: /auto-repair 處置概況卡片 (已在 Sprint F 完成)
- E4: /neural-command stats tab 處置分佈 (已在 Sprint F 完成)
- E5: i18n 翻譯 zh-TW + en (已在 Sprint F 完成)
Phase F: 週報 + 文件
- F1: WeeklyReportMessage 新增 disposition 5 欄位
週報格式加「📋 處置分佈」區塊 (自動/冷啟動/人工/手動 + 自動化率)
weekly_report_service 整合 get_all_disposition_stats()
- message 字數上限從 900 提升到 1200 (適應處置區塊)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-07 13:02:20 +08:00 |
|
OG T
|
a85e9ced08
|
feat(api+telegram): Sprint 4 Phase C+D — API 端點 + Telegram 處置統計
Phase C: API 端點
- C1: GET /api/v1/stats/disposition — 完整處置分佈統計
- DispositionSummary: auto/human/manual/cold_start + auto_rate
- DispositionByAnomaly: 按異常類型明細 (最多 20 筆)
- Redis SCAN + HGETALL 聚合
- C2: GET /api/v1/auto-repair/stats 擴充 disposition_summary
Phase D: Telegram 告警格式
- D1: 告警卡片加處置統計行
- 🤖 自動: N | 👤 審核: N | 🔧 手動: N
- 自動化率百分比
- D2: 歷史按鈕強化處置分佈明細
- 完整 5 項計數 + 自動化率
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-07 12:17:20 +08:00 |
|
OG T
|
f6332b4b2f
|
fix(telegram): 修正 approval_id UUID 轉換錯誤 — 支援 INC-xxx 格式
CD Pipeline / build-and-deploy (push) Successful in 12m24s
_execute_approval_action 用 UUID(approval_id) 但 approval_id 是 INC-xxx,
導致 'badly formed hexadecimal UUID string' 錯誤,簽核無法執行。
修正: 先嘗試 UUID 轉換,失敗則用 incident_id 查出對應的 pending approval UUID。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-06 11:53:48 +08:00 |
|
OG T
|
09d965dab5
|
fix(telegram): 修正 editMessageText 400 錯誤 — 先移除按鈕再更新文字
CD Pipeline / build-and-deploy (push) Failing after 12m46s
原因: original_text 來自 message.text (純文字),含 <>&等字符,
用 parse_mode=HTML 發送時 Telegram 返回 400。
修正:
1. 先呼叫 editMessageReplyMarkup 移除按鈕 (確保按鈕一定消失)
2. 再 html.escape(original_text) 後嘗試更新文字
3. 文字更新失敗不影響整體流程 (按鈕已移除為首要目標)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-05 22:13:54 +08:00 |
|
OG T
|
8220027298
|
fix(telegram+cd): 兩個顯示 bug 修正
CD Pipeline / build-and-deploy (push) Has been cancelled
1. Nemotron args 顯示 Python dict 字串問題
- restart_deployment: {'deployment_name': 'awoo'} → restart_deployment: deployment_name=awoooi-api
- 改用 key=value 格式化,不再使用 str(dict)[:25]
2. CD 通知 ${MINUTES}/${SECONDS} 等變數未展開
- TG_MSG 從 env: 移到 run: shell 中組裝
- env: 中的 shell 變數在 bash 執行前是靜態字串,無法展開
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-05 14:47:52 +08:00 |
|