OG T
|
a142e6e937
|
fix(db): create_all checkfirst=True 修復 CrashLoopBackOff
CD Pipeline / build-and-deploy (push) Failing after 12m19s
rolling update 時 create_all 嘗試重建既有 index 導致
"ix_incident_evidence_incident_id already exists" 啟動失敗。
checkfirst=True 讓 SQLAlchemy 跳過已存在的 table/index,
init_db() 從此冪等,不再造成 CrashLoopBackOff。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-15 15:00:49 +08:00 |
|
OG T
|
7da64eaad2
|
feat(Phase 3): 學習閉環重建 — 三根因修復 + 2x EWMA + Evolver Agent
CD Pipeline / build-and-deploy (push) Failing after 19m7s
Type Sync Check / check-type-sync (push) Failing after 1m18s
ADR-083 Phase 3 學習閉環重建:
**三根因修復**
- approval_execution.py: fire-and-forget create_task → await asyncio.wait_for(timeout=30) × 2
(成功路徑 L265 + 失敗路徑 L353,超時記錄 learning_trigger_timeout metric,主流程不 crash)
- models/approval.py: ApprovalRequestBase 新增 matched_playbook_id 欄位
- decision_manager.py: _auto_execute 建立 ApprovalRequest 時填充 matched_playbook_id
- learning_service.py: 雙路徑查找 _matched_pb_id(matched_playbook_id + metadata fallback)
**2x EWMA 負向強化**
- models/playbook.py: 新增 trust_score: float = 0.3(EWMA 動態信任度欄位)
- repositories/playbook_repository.py: update_stats 加 EWMA
成功: trust = 0.9 × old + 0.1 × 1.0
失敗: trust = 0.8 × old + 0.2 × 0.0(衰減速度 2x)
trust < 0.1 → log warning,等 Evolver 封存
**Evolver Agent(新建)**
- services/playbook_evolver.py: 三功能全靜態規則
1. 低信任封存: trust < 0.1 → DEPRECATED
2. 休眠封存: 30d 未使用 AND trust < 0.5 → DEPRECATED
3. 相似合併: 症狀 Jaccard > 0.9 → 保留高 trust,封存低 trust
AIOPS_P3_EVOLVER_ENABLED=False 預設關閉
**文件**
- ADR-083 學習閉環重建
- MASTER §8 Phase 3 完工記錄
AIOPS_P3_ENABLED=False(預設),骨架就位等統帥批准開啟
Co-Authored-By: Claude Sonnet 4.6(亞太)<noreply@anthropic.com>
|
2026-04-15 14:01:37 +08:00 |
|
OG T
|
42bc1df9f9
|
fix(phase2): 驗證發現兩處安全漏洞並修正
CD Pipeline / build-and-deploy (push) Has been cancelled
手動驗證執行中發現:
1. reviewer_agent.py: force push regex 只覆蓋「force push」文字順序,
漏掉 git 實際格式「git push --force」(push 先, --force/-f 後)
→ 修正為雙向 pattern:(?:force.{0,5}push|push.{0,30}(?:--force|-f\b)).{0,30}main
2. coordinator_agent.py: Critic critical challenge 僅施 0.3 penalty,
當原始信心 > 0.7(如 0.82)時 penalty 後仍 > 0.4 閾值,
critical challenge 穿透到 auto-execute 路徑(驗證確認:0.82→0.52>0.4)
→ 新增 Critic REJECT 硬閘(等同 Reviewer REJECT 效力),
在 penalty 邏輯前強制 requires_human_approval=True
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-15 13:48:55 +08:00 |
|
OG T
|
5ddba6d6e0
|
feat(adr-082): Phase 2 多 Agent 協作 — 5 角色辯證系統骨架上線
新增 5 個 Agent + Orchestrator + DecisionManager 接線:
- protocol.py: DiagnosisReport / ActionPlan / ReviewVerdict / CriticReport / DecisionPackage 型別系統
- DiagnosticianAgent: RCA 根因分析,confidence < 0.4 → ABSTAIN
- SolverAgent: 修復方案軍師,blast_radius 評分 + 降級 rule-based mock
- ReviewerAgent: 安全審查,HARD_RULES 靜態 pattern + blast_radius 閾值 (>50 revision, >80 reject)
- CriticAgent: 刻意唱反調,強制 3 問批判性思維,critical challenge → REJECT
- CoordinatorAgent: 純規則聚合,6 級決策閘,REQUEST_REVISION → 強制人工
- AgentOrchestrator: 30s 全局超時,Reviewer ‖ Critic 並行,DB Immutable Event Sourcing + Redis Streams
- DecisionManager: AIOPS_P2_ENABLED gate + _package_to_proposal_data 橋接既有 proposal_data 格式
- AgentSession DB table + 4 個複合 index
- ADR-082 決策記錄
Gate 2 修復(7 項):
- CRITICAL: DELETE FROM regex lookahead 位置錯誤(移至 FROM 後)
- CRITICAL: REQUEST_REVISION 可抵達 auto-execute 路徑(改回 requires_human_approval=True)
- IMPORTANT: _extract_json flat regex 不支援巢狀 JSON(改 find/rfind 邊界提取)
- IMPORTANT: all_degraded 遺漏 verdict.degraded(補全 4 個 Agent)
- IMPORTANT: Solver ABSTAIN guard 放行降級假設(改為無論 hypotheses 有無均跳過)
- IMPORTANT: dataclasses.asdict() Enum 未序列化導致 DB 寫入靜默失敗(加 json.dumps default handler)
- IMPORTANT: P2 gate 直讀屬性繞過父 Phase 守衛(改用 is_phase_enabled(2))
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-15 13:48:55 +08:00 |
|
OG T
|
cae9833e5d
|
fix(heartbeat): 修復多 replica 重複發送系統報告 bug
根因:RedisLock 在 async with 結束後立即 release,
兩個 pod 對齊同一 slot 但 offset 不同,第一個 pod
發完釋放鎖後 ~10s,第二個 pod 剛好 wake 並搶到空鎖
→ 同一個 30min slot 發出兩條相同報告。
修復:改用 slot-based key (heartbeat:slot:{slot_id})
SET NX EX interval_seconds,不主動 release,讓 TTL
自然過期。整個 30min slot 只有第一個搶到的 pod 能發。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-15 13:17:10 +08:00 |
|
OG T
|
f1cbf6db7d
|
feat(adr-081): Phase 1 感官縱深 — 8D 情報蒐集 + 執行後驗證
成品:
- IncidentEvidence DB model(8D 感官 + pre/post 執行狀態)
- EvidenceSnapshot dataclass(build_summary → LLM 上下文)
- SanitizationService(Prompt Injection 0-tolerance,12 pattern)
- MCPToolRegistry(動態工具登記,suggest_tools 不寫死告警類型)
- PreDecisionInvestigator(8D 並行感官,P99 < 8s,Redis 30s 快取)
- PostExecutionVerifier(warmup 10s → 後狀態評估 success/degraded/failed)
- decision_manager + approval_execution 接線(feature flag 守衛)
Gate 1 修復:D4/D5/D7/D8 補 sanitize_dict_values;移除裸 "error" failure
signal 防 error_rate key 誤判;evidence_snapshot rowcount 零行警告。
測試:130 passed(+111 新增)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-15 13:08:38 +08:00 |
|
OG T
|
db9e304a14
|
feat(adr-080): Phase 0 防護欄建立 — AI 自主化飛輪啟動
- docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md
(1456 行,§0-§8 全填完:42-cell 戰術矩陣、7 Phase 計畫、7 ADR 摘要、
15 KPI、21 Feature Flags、10 風險場景)
- docs/adr/ADR-080-ai-autonomy-flywheel-overview.md
(7 Phase 結構 + 4 北極星 + 7 架構師 Review Gates + Phase 退出條件)
- apps/api/src/core/feature_flags.py
(AIOpsFeatureFlags: P1~P6 總開關全 False + 15 細粒度子開關
is_phase_enabled() / is_sub_flag_enabled() + bool cast 安全)
- apps/api/src/jobs/__init__.py + baseline_snapshot.py
(Phase 0 基線快照 Job:MCP calls / Playbook confidence / general 比例
/ learning loop rate / auto_repair — 寫入 aiops:baseline:latest)
- apps/api/tests/test_feature_flags.py (21 tests — 全綠)
- docs/HARD_RULES.md → v1.9
(新增 Phase 退出條件鐵律:禁止未過 exit conditions 宣告 Phase 完成)
- CLAUDE.md 防失憶閘門 1:強制讀 MASTER §0 Session Resume Protocol
Gate 0 Pass — 21/21 tests green
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-15 12:44:53 +08:00 |
|
OG T
|
6c7f648b60
|
fix: 3 個飛輪沉默未打通節點 — 統帥截圖盤出
CD Pipeline / build-and-deploy (push) Successful in 18m56s
統帥截圖證據 (Telegram MEDIUM 告警仍走人工審核):
INC-20260411-A03B2E / A2BB29 顯示「[規則匹配]」+ action=unknown-service
節點 1: AutoApprovePolicy 擋下規則匹配 (飛輪主因)
- ADR-073 規則匹配 confidence=0.0 (防偽造)
- AutoApprovePolicy.min_confidence=0.50 → 擋下
- 結果: MEDIUM 規則匹配永遠人工審核,飛輪不轉
修復: auto_approve.py 加 _is_rule_based 判斷
(is_rule_based / source=expert_system / rule_id / matched_rule)
→ bypass min_confidence 檢查
→ 驗證: should_auto_approve=True ✅
節點 2: _is_bad_target 漏 unknown-service magic string
- _resolve_target_from_k8s fallback 產 unknown-service / unknown-pod
- GAP-A4 Phase 1/2 只擋 'unknown' 而非前綴
修復: alert_rule_engine.py 加 unknown-/none-/null-/undefined- 前綴黑名單
→ 驗證: 4 個 magic 全 bad ✅
節點 3: stale_ready_tokens_resend 無時效過濾
- 截圖是 2026-04-11 (4 天前) 告警
- 舊 labels 過期,重 process 也產不出新 target
- 壓爆 Ollama + 污染 Telegram 卡片
修復: decision_manager.py 跳過 > 3 天的 stale incident
→ skip + log stale_ready_token_skipped_too_old
回歸: 113/113
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-15 10:56:48 +08:00 |
|
OG T
|
a92562d65c
|
feat(Phase 5 Sprint 5.4): 分類按鈕從 registry 動態產生 — 按鈕重啟上線
CD Pipeline / build-and-deploy (push) Successful in 17m11s
_build_inline_keyboard() 改寫:
- 原 hardcode _CATEGORY_BUTTONS dict (28 按鈕) 已下架
- 改從 callback_action_spec.yaml registry 動態產生
- spec.callback_format 決定格式:
* nonce (寫類) → self._security.generate_callback_nonce(approval_id, action_name)
* info (查類) → {action_name}:{incident_id}
- 新按鈕只需改 yaml,零改 code
分類覆蓋 (從 yaml 自動推算):
- kubernetes: 6 按鈕 (4 寫 + 2 查)
- host_resource: 3 按鈕 (1 查 + 2 寫)
- secops: 4 按鈕 (全寫類 + Multi-Sig)
- database: 3 按鈕
- storage: 2 按鈕
- network: 3 按鈕
- devops_tool: 2 按鈕
- external_site: 2 按鈕
- business: 1 按鈕
- flywheel_health: 1 按鈕
- ssl_cert: 1 按鈕
這次按鈕不是鬼魂 — 每個都有:
✅ callback_format 正確 (4-part nonce / 2-part info)
✅ Sprint 5.3 dispatch handler 接收
✅ Sprint 5.2 MCP registry 執行
✅ audit log + reply_to 原卡片
回歸: 188/188
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 21:40:20 +08:00 |
|
OG T
|
de8bbd8ab9
|
feat(Phase 5 Sprint 5.3): 寫類分類按鈕 nonce action 路由 + audit log
CD Pipeline / build-and-deploy (push) Has been cancelled
插入點: _handle_callback_query Step 1.9 (nonce 驗證後, Step 2 approve/reject 前)
邏輯:
1. 從 spec registry 查 action 是否為註冊的寫類動作
2. 若 action in (approve/reject/silence/tune/log_manual_fix) → skip 走既有流程
3. 若 spec.requires_multi_sig=True 且 current_signatures < 2 → 提示「需 2 人簽核」
4. Audit log (category_write_action_audit_start) 含 user/risk/provider/tool
5. Ack Telegram (emoji + label + 執行中...)
6. 從 incident 取 labels 供模板替換
7. dispatch_action() → MCP 執行
8. Reply 結果到原告警卡片(Redis tg_msg lookup)
9. Audit log (category_write_action_audit_complete) 含 success/error/duration
支援的寫類 action:
- k8s_restart/scale_up/scale_down/rollback (kubernetes)
- host_restart_service/clear_log (host_resource)
- docker_restart/minio_restart (devops_tool/storage)
- reload_nginx/renew_cert (network/ssl_cert)
- kill_slow_query/clear_conn_pool (database)
- pause_1h/trigger_diagnose (business/flywheel)
Multi-Sig 支援 (Sprint 5.4 預留):
- secops_isolate/block_ip/evict → requires_multi_sig=True
- 簽核數未達 2 → 提示 + 不執行
回歸: 129/129
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 21:39:16 +08:00 |
|
OG T
|
208c28ed09
|
feat(Phase 5 Sprint 5.2): Callback dispatcher 接入真實 MCP registry
CD Pipeline / build-and-deploy (push) Successful in 14m38s
dispatch_action() 升級:
- 從 Sprint 5.0 stub 升級為真實 MCP 調用
- internal provider: URL builder + authorization 記錄(不走 MCP)
- 其他 provider: from src.plugins.mcp.registry import get_provider → execute
- asyncio.wait_for 包 timeout_sec(按 spec 設定,每按鈕不同)
Graceful degradation:
- Provider 未註冊 → returns success=False + 'provider_not_found' 錯誤
- MCP returned success=False → reply 含錯誤訊息
- asyncio.TimeoutError → reply 「超時 Xs」+ log
新增 _handle_internal_action():
- build_signoz_url → https://signoz.wooo.work/services/{service}
- build_flywheel_url → https://awoooi.wooo.work/flywheel
- record_authorization → 24h 同源靜默確認
測試覆蓋 (26/26):
- 3 新 internal action tests (open_signoz/open_flywheel/secops_authorize)
- 1 MCP failure graceful test
- 既有 22 個保留(更新 2 個 Sprint 5.0 stub 測試為 Sprint 5.2 graceful)
Sprint 5.2 DOD:
✅ 10 查類按鈕 dispatch 路徑完整
✅ 3 internal actions 實作
✅ Graceful failure (no crash)
✅ asyncio.wait_for timeout 保護
⏳ 實際 end-to-end 測試(需 prod MCP providers 都註冊)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 20:43:40 +08:00 |
|
OG T
|
581b244ad1
|
feat(Phase 5 Sprint 5.1): Telegram callback_handler 接上 dispatcher
CD Pipeline / build-and-deploy (push) Has been cancelled
整合點: _handle_callback_query 未知 action fallback 路徑
變更:
1. Line 2601 原「⚠️ 未知操作」改呼叫 _dispatch_category_action()
2. 新增 _dispatch_category_action() method:
- 查 callback_action_spec registry
- 若 action 不存在 → 回「未知操作」(行為不變)
- 若存在 → acknowledge + 從 incident 取 labels + dispatch + reply 原卡片
效果:
- check_process / check_port / check_log_* / check_health / open_signoz /
open_flywheel 等 10 個查類按鈕現在有完整 flow(雖 Sprint 5.2 還沒接 MCP,但 stub 會 reply)
- 當 CD 部署 + Sprint 5.2 實裝 MCP 接線後,查類按鈕自動上線
Sprint 5.1 DOD:
- ✅ callback_handler 接線 _dispatch_category_action
- ✅ Dispatcher 讀 incident labels 替換模板變數
- ✅ Reply to 原告警卡片(Redis tg_msg lookup)
- ⏳ MCP 實際執行(Sprint 5.2)
回歸測試: 109/109
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 20:41:22 +08:00 |
|
OG T
|
36754a8a84
|
fix: Bug A 診斷 + Bug B 真修 — LLM 120s/130s 硬編 → OPENCLAW_TIMEOUT
CD Pipeline / build-and-deploy (push) Has been cancelled
殘留兩個深層 bug 處理:
Bug A (approval.incident_id 仍 NULL) — 加診斷
- update_incident_id 加 rowcount 檢查
- 若 UPDATE 0 rows affected → warning log (id 型別 mismatch 或 session 不同步)
- 手動 UPDATE 測試通過 → DB/permissions 正常,問題在應用層
- 等 CD 部署後 live-fire 觀察 log 診斷真因
Bug B (LLM 仍 2m6s >> 30s) — 真修
openclaw.py 兩處硬編 timeout:
- line 146 httpx client default: 120.0s → settings.OPENCLAW_TIMEOUT (30s)
- line 348 /analyze/incident POST: 130.0s → settings.OPENCLAW_TIMEOUT (30s)
GAP-B4 commit dd0a778 只修了 ai_providers/ollama.py
但 openclaw.py 自己的 httpx client 和 endpoint call 沒改
這就是為什麼 Live-fire #2-#7 都卡 120s+ 的真因
回歸測試: 125/125 (dispatcher + a4 + classify + grouping)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 20:38:00 +08:00 |
|
OG T
|
2e2f5a1881
|
feat(Phase 5 Sprint 5.0): Callback Dispatcher 規格 + 骨架 + 22 測試
CD Pipeline / build-and-deploy (push) Has been cancelled
統帥批准 Phase 5 全 Sprint,Sprint 5.0 產出:
1. callback_action_spec.yaml (24 actions)
- 10 查類 (info 2-part callback, 無副作用): check_process, check_port,
check_log_*, check_health, check_pod_logs, describe_pod, open_signoz,
open_flywheel
- 10 寫類 (nonce 4-part, 有副作用): k8s_restart/scale_up/scale_down/rollback,
host_restart_service/clear_log, docker_restart, minio_restart,
reload_nginx, renew_cert
- 4 secops (Multi-Sig CRITICAL): secops_isolate/block_ip/evict/authorize
2. callback_dispatcher.py
- Registry pattern (lru_cache): get_action_spec / list_actions_for_category
- 模板變數替換: {incident_id} / {labels.xxx} / {signals[0].xxx}
- dispatch_action() 骨架 (Sprint 5.2+ 接 MCP)
- _format_reply: text/code/truncated/url 4 種格式
3. test_callback_dispatcher.py (22 tests全過)
- Registry loading 正確性
- Category filtering
- Template resolution (含 nested list index)
- dispatch stub 返回正確 spec 提示
下一步 Sprint 5.1: 接入 MCP registry + telegram callback_handler 整合
MCP 底層能力已有 (k8s 10+ tools, ssh 15 tools)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 20:34:14 +08:00 |
|
OG T
|
10e3043ce8
|
fix(UX): 下架 28 個鬼魂分類按鈕 + ADR-079 Phase 5 補完計畫
CD Pipeline / build-and-deploy (push) Has been cancelled
統帥 2026-04-14 20:00 完整 audit 揭露:
_CATEGORY_BUTTONS 28 個按鈕全死 3 天(從 2026-04-11 commit 325b3851)
- callback_data 格式全錯(3-part 不符 parser 4-part/2-part)
- grep apps/api/src 無任何 dispatch handler
- 統帥今天真踩到:點「查程序」沒反應 → 信任破壞
首席架構師裁示 (C 分級):
A. 立刻下架(本 commit):_CATEGORY_BUTTONS = {} fallback 通用按鈕
B. Phase 5 完整化(ADR-079 規劃,3-5 天,另 Sprint 實作)
保留通用按鈕(全 ✅):
- 批准 / 拒絕 / 靜默(4-part nonce)
- 詳情 / 歷史 / 重診(2-part info)
新增防禦性文件:
- ADR-079 — Phase 5 工作分解 + 每按鈕 checklist
- feedback_no_ghost_buttons.md(memory)— 鬼魂按鈕鐵律
設計原則永久入檔: 寧可沒按鈕,不可有死按鈕
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 20:19:25 +08:00 |
|
OG T
|
ca862c5575
|
fix(GAP-A4 Phase 2): LLM 路徑 target 救援 — 解開 12 次飛輪攔截
CD Pipeline / build-and-deploy (push) Has been cancelled
統帥全景報告診斷(2026-04-14 20:00):
2h 內 12 次 auto_execute_blocked_unresolved_placeholder
全是 LLM 直接產出 `kubectl ... deployment HostHighCpuLoad`
GAP-A4 Phase 1 只修了 alert_rule_engine._extract_vars
但 LLM 在 decision_manager 路徑沒做同樣檢查 → 12 次擋下 → 0 KM 0 飛輪
修復 (decision_manager._auto_execute placeholder 替換後):
1. 從 action regex 提取 deployment 名(kubectl ... deployment XXX)
2. 套用 alert_rule_engine._is_bad_target() 驗證
3. 若是垃圾(==alertname/unknown/IP)→ 從 incident.signals[0].labels
重推 (用 _extract_vars 同一套 multi-layer 邏輯)
4. 若有合法 target → action.replace(llm_target, good_target)
5. 若 labels 也救不了 → log target_rescue_failed → safety guard 處理
效果:
- KubePodCrashLooping (有 deployment label) → LLM 即使填錯也救回
- HostHighCpuLoad (純主機,無 K8s label) → 仍進 safety guard,
但 log 變 target_rescue_failed 而非 unresolved_placeholder
- 12 次飛輪攔截可望大幅減少
回歸:66/66 (GAP-A4 + kubectl validation) 全過
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 20:06:05 +08:00 |
|
OG T
|
914c7e7a90
|
fix: 9b9ff5b 引發的 NoneAttr bug — incident_id 上移到 Base
CD Pipeline / build-and-deploy (push) Has started running
Type Sync Check / check-type-sync (push) Failing after 1m17s
bug: 'ApprovalRequestCreate' object has no attribute 'incident_id'
Live-fire #6 整個 webhook 500 fail。
根因: 9b9ff5b 在 approval_db 寫 request.incident_id,
但 ApprovalRequestCreate 繼承 Base 沒這 field(只在 ApprovalRequest 才有)。
修復: 把 incident_id 上移到 ApprovalRequestBase
- ApprovalRequestCreate 自動繼承 → webhook 可建帶 incident_id 的 request
- ApprovalRequest 不重複定義
- 786/786 回歸測試全過
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 20:01:47 +08:00 |
|
OG T
|
8b7e9cbfb8
|
fix(BLOCKER): LLM 連續失敗 — 4 個違反設計處全部修復
CD Pipeline / build-and-deploy (push) Successful in 14m21s
統帥盤點發現飛輪沉默真因:4 個違反既定架構設計的 bug 同時撞車。
P0a — Ollama timeout 違反 GAP-B4 設計
config.py:OPENCLAW_TIMEOUT 從 120s 改 30s
原 120s 違反 ADR-052 GAP-B4 (LLM 25s hard timeout) 設計
致 Ollama 過載時 thread 飢餓 120s 才降級
P0b — AI Router silent skip 觀測性修復
ai_router.py: not_registered/circuit_open/rate_limit/privacy_skip
全部累積到 errors 陣列,log all_providers_failed 時可知為何 skip
原本 errors=["ollama: Timeout"] 但 tried=4 個,無法診斷
P1a — send_text 方法不存在 bug
ai_router.py:1005 tg.send_text() → tg.send_notification(parse_mode=HTML)
TelegramGateway 只有 send_notification 沒 send_text
致 fallback 失敗通知本身失敗(雙重靜默)
P1b — resend_stale_ready_tokens 並發爆炸
decision_manager.py: 加 asyncio.Semaphore(5) + 200ms throttle
原本 fire_and_forget N 個 task 同時跑,N=108 時 Ollama embedding
全部 timeout,包括我打的 live-fire 也被擠爆
改:max 5 並發 + 每完成喘 200ms
CD 流程審查 (Blocker 1): 完全符合 ADR-039 設計,10-15 min 是預期
不需修,是設計就需要這時間。
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 19:37:03 +08:00 |
|
OG T
|
9b9ff5bec6
|
fix(critical): approval_records.incident_id 欄位未寫入 — Telegram 卡片找不到 INC 編號
CD Pipeline / build-and-deploy (push) Successful in 15m15s
🚨 統帥實測發現(live-fire #2, #3 反復找不到卡片):
DB 查詢證據:
SELECT id, incident_id, telegram_message_id FROM approval_records
→ incident_id=NULL, telegram_message_id=NULL (所有新 approval)
但 incidents 表確實有對應的 INC-20260414-3318E8 / 5C90CC。
根因:
approval_db.approval_request_to_record_data() dict 定義完全沒有 incident_id
欄位。ApprovalRequestCreate schema line 165 明明有 incident_id: str | None,
但轉 record 時被丟掉 → DB 永遠 NULL → Telegram 卡片顯示 INC 號空白。
影響:
- 用戶 Telegram 上根本認不出是哪個 incident 的審核卡
- 人工審核閉環名存實亡(即使批准也無法連回 incident)
- update_telegram_message_id 路徑也無法 fallback 補回(查 NULL 找不到)
修復 (最小侵入):
在 dict 補 "incident_id": request.incident_id
影響範圍零破壞:
- 舊 approval 繼續 NULL (不動)
- 新 approval 此後會正確寫入
- DB schema 本來就有此欄位 (line 280 Mapped[str|None])
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 19:21:11 +08:00 |
|
OG T
|
72dd0c5875
|
fix: Telegram 簽核 gate + 執行結果 reply — 打通人工審核閉環
CD Pipeline / build-and-deploy (push) Successful in 14m7s
3 處修復(統帥盤查發現):
1. telegram_gateway.py:4890 — gate 從 execution_triggered 改 approval.status==APPROVED
- 原 gate 靠樂觀鎖旗標,race 時失效(REST+Telegram 同時簽核)
- 與 REST API approvals.py:360 路徑對齊
- 加 Redis lock exec:{approval_id} 60s TTL 防重入
2. telegram_gateway.py:4772 — 拿掉「👀 等待執行」誤導文案
- 批准後一律顯示「⚡ 執行中...」,實際結果由 #3 reply 補上
3. approval_execution.py — 新增 _push_execution_result_to_alert()
- 成功/失敗兩處 fire-and-forget 呼叫
- requested_by=="auto_approve" skip(避免與 _push_auto_repair_result 衝突)
- Redis tg_msg:{incident_id} 查原告警 message_id → reply_to
- 找不到 message_id 靜默不發,不影響執行主流程
防破壞性檢查:
- ✅ 自動執行路徑不受影響(skip via requested_by)
- ✅ Reject 路徑完全不動
- ✅ Redis lock 防重入
- ✅ 132 回歸測試全過
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 19:03:38 +08:00 |
|
OG T
|
aa4e5757a2
|
fix: 技術債清理 — report_generation 重試機制 + GAP-A4 文件化
CD Pipeline / build-and-deploy (push) Successful in 15m46s
技術債 #1: postmortem 發送失敗靜默吞掉
- 3 次指數退避重試 (2s → 4s → 6s)
- 全失敗後送簡化降級通知到 SRE 群組
- 防止事後檢討默默消失
技術債 #2 (QueryBuilder 抽象): DEFER
- 全專案僅 1 處用 outcome JSON path query
- 違反「Don't design for hypothetical future requirements」
- 待第二 caller 出現再抽
技術債 #3 (E2E 測試): 已涵蓋
- test_gap_a4_placeholder_resolution.py TestMatchRuleRejection
- Mission C prod 鏈路實測(KubePodCrashLooping)
- Playwright K8s/Telegram staging 留待 staging 環境就緒
新增文件:
- ADR-078-gap-a4-placeholder-resolution.md
- LOGBOOK 2026-04-14 深夜收官條目
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 18:46:25 +08:00 |
|
OG T
|
10b74affcf
|
fix(GAP-A4): 規則 Action 模板 placeholder 解析修復 — 解開 8.3h 飛輪沉默
CD Pipeline / build-and-deploy (push) Has been cancelled
🚨 真因診斷(統帥逮到):
API log 顯示最近 1 小時爆發大量 auto_execute_blocked_unresolved_placeholder:
- action: "kubectl rollout restart deployment HostHighCpuLoad" ← target=alertname
- action: "kubectl rollout restart deployment unknown"
- action: "kubectl scale deployment unknown --replicas=3"
根因:alert_rule_engine._extract_vars() target 解析邏輯不夠強健,
當 Prometheus 告警無 deployment label 時,退回 alertname 或 "unknown",
產生垃圾指令。GAP-A1 防注入閘正確攔下,但自動修復路徑因此卡死,
KM 不寫入 → 飛輪沈默。
修復(三層防護):
1. 新增 _strip_pod_suffix() — K8s Pod 名稱還原 Deployment base
- Deployment 格式: awoooi-api-7d6b776f78-4sgjl → awoooi-api
- StatefulSet: postgresql-0 → postgresql
- Legacy: my-job-x2m4k → my-job
2. 新增 _is_bad_target() — 垃圾 target 識別
- 空串 / "unknown" / "none" / "null"
- target == alertname 本身
- IP:port 格式、純 IP、含空白/括號/引號
- 未解析 {placeholder}
3. 重寫 _extract_vars() — 多層 label 查找(權威優先):
deployment > app > statefulset > pod(去後綴) > container > service > target_resource
每層都過 _is_bad_target 驗證,全失敗 → target="unknown"
4. match_rule() 後置雙驗證:
- bad target → 清空 kubectl_command (降級 LLM)
- 殘留 { or } → 清空 kubectl_command (模板未填完)
測試覆蓋:
- 33 個新單元測試(GAP-A4 四大場景全覆蓋)
- 214/214 回歸測試全過
影響:
- 原本產出「kubectl rollout restart deployment HostHighCpuLoad」的路徑
→ 現在會 `rule_kubectl_command_discarded_bad_target` 並降級 LLM
- LLM 若能從錯誤 log 推理真實 deployment,飛輪恢復正常運轉
- 若 LLM 也無解,進 TYPE-4 人工扶梯
2026-04-14 Claude Sonnet 4.6(MASTER 藍圖之外的隱性 Bug 殲滅)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 18:43:29 +08:00 |
|
OG T
|
f54dea48b1
|
fix(GAP-D5): 日度報告 DB 欄位修正
CD Pipeline / build-and-deploy (push) Has been cancelled
兩處 import/查詢錯誤修復(統帥 E2E 預覽發現):
1. _collect_repair_stats: ApprovalRequestRecord 不存在
→ 改用 IncidentRecord + outcome JSON 路徑查詢 execution_success
2. _collect_playbook_count: PlaybookRecord 不存在
→ 改用 playbook_service.list_playbooks() (Redis 儲存)
修復前:修復成功率永遠 0.0%、活躍 Playbook 永遠 0
修復後:報告數字反映真實 DB/Redis 狀態
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 18:32:29 +08:00 |
|
OG T
|
8de807c40d
|
feat(GAP-D5 Task 4.2): Postmortem 自動組裝 hook
CD Pipeline / build-and-deploy (push) Has been cancelled
incident_service.resolve_incident() 結尾 fire-and-forget 呼叫
report_generation_service.trigger_postmortem(),補完孤兒服務的觸發路徑。
觸發條件(由 trigger_postmortem 內部判斷):
- duration > POSTMORTEM_MIN_DURATION_MINUTES (10min)
- 含 AI root_cause / resolution_action / provider / auto_repaired
背景:
- report_generation_service.py 539 行服務於先前 session 建立
- main.py:322 已啟動 run_daily_report_loop(Task 4.1 ✅)
- trigger_postmortem 在 src/ 下無呼叫方 → 本 commit 補上
MASTER 藍圖 Phase 4 至此完整收官:
✅ Task 4.1 日度巡檢報告(08:00 台北排程,生產環境已跑)
✅ Task 4.2 Postmortem 自動組裝(本 commit 接上 resolve hook)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 18:25:15 +08:00 |
|
OG T
|
dd0a778e1f
|
feat(GAP-B4): LLM 超時降級扶梯 — 精確化內層 timeout
CD Pipeline / build-and-deploy (push) Successful in 14m19s
_dual_engine_analyze 強化(2026-04-14 Claude Sonnet 4.6):
- OpenClaw LLM 呼叫獨立 25s hard timeout(留 5s 給後續處理)
- 超時時明確 llm_timeout_fallback 日誌,立即降級 Expert System
- NemoClaw second opinion 加 3s timeout(advisory 不拖累主流程)
- 保留外層 decide() 30s wait_for 作為 defence-in-depth
為何要做:
- 外層 30s 會把 LLM 卡死整段吃光,thread pool 可能飢餓
- 內層 25s 更早降級 → Expert System 仍能在 SLA 內回應
- LLM timeout 與其他異常用不同日誌標記,便於 SLO-2 監控
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 15:51:23 +08:00 |
|
OG T
|
dedd7c2c17
|
feat(BP-1): KM 萃取品質精修 — 區分自動/人工 + 富化告警元資料
CD Pipeline / build-and-deploy (push) Has been cancelled
_write_execution_result_to_km() 強化:
- 依 approval.requested_by 區分 [自動修復]/[人工修復]
- 從關聯 Incident 提取 alertname / alert_category / affected_services
- Category 從硬編 "execution_result" 改為真實 alert_category
- Tags: auto_executed/human_approved + success/failure + alert_category
- Title 含 alertname,提升 RAG 檢索精準度
- created_by 依模式標記 auto_execute / approval_execution
驗證(2026-04-14 DB 查詢):
- 現有 KM 確實有寫入(approval_execution 建立者)
- 但標題全是「[執行記錄] ❌ kubectl rollout restart deployment/xxx」
- Category 硬編 execution_result,tags 只有 execution/execution_failed
- 本次改造後 KM 將具備完整上下文供下次 RAG 檢索
建立: 2026-04-14 台北時間 Claude Sonnet 4.6(MASTER 藍圖 BP-1 B.1 精修)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 15:48:02 +08:00 |
|
OG T
|
aae7c12645
|
feat(adr-076): Task 3.3 — SSH 修復 KM 萃取(補齊飛輪雙手)
CD Pipeline / build-and-deploy (push) Has been cancelled
動機: SSH MCP 修復(docker restart/systemctl)成功後,KM 無法學習
因為 _extract_repair_steps 只處理 kubectl,SSH 路徑完全漏失。
approval_execution.py:
- _trigger_playbook_extraction: 成功執行後將 approval.action 寫入
incident.outcome.learning_notes,供 Playbook 萃取器讀取
playbook_service.py:
- _parse_ssh_command(): 新增模組函式,解析 ssh [user@]host 'cmd' 格式
- _extract_repair_steps(): 步驟 2 擴充 SSH 路徑分支
ssh ... → ActionType.SSH_COMMAND + host 記錄
kubectl ... → ActionType.KUBECTL(保留原有邏輯)
- _generate_name(): SSH 修復自動加 [SSH] 前綴
- _extract_tags(): SSH 修復自動加 ssh + host_layer 標籤
test_playbook_ssh_extraction.py: 18 tests(100% 通過)
飛輪雙手對齊:
kubectl 路徑: decision_chain.reasoning_steps → KM ✅ (既有)
SSH 路徑: approval.action → learning_notes → KM ✅ (Task 3.3 新增)
測試: 794 passed, 26 skipped, 0 failed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-14 15:19:54 +08:00 |
|
OG T
|
cc42aa0bdb
|
feat(adr-076): Task 2.2 + 2.3 — 規則擴充 + kubectl 注入防護
CD Pipeline / build-and-deploy (push) Has been cancelled
Task 2.2: alert_rules.yaml 新增 3 類規則 (priority 125-127)
- gitea_down: Gitea CI/CD 下線 → NO_ACTION (priority 125, critical)
- ssl_cert_expiring: SSL 憑證到期 → NO_ACTION (priority 126, medium)
- external_site_down: MoWoooWork/Dev/Blackbox probe → NO_ACTION (priority 127, medium)
規則總數: 21 → 24
Task 2.3: alert_rule_engine.py kubectl 注入防護
- _RULE_ENGINE_DESTRUCTIVE_RE: 阻擋 delete pvc/namespace/statefulset/deployment,
drain/cordon, --replicas=0, rm -rf, DROP TABLE, $() 反引號
- validate_kubectl_command(): 公開 API,SSH 指令/空字串直接通過
- match_rule() 整合: 變數替換後驗證,阻擋時清空 + log warning
- test_alert_rule_engine_validation.py: 34 tests (100% 通過)
測試: 776 passed, 26 skipped, 0 failed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-14 15:10:10 +08:00 |
|
OG T
|
684d6cfb43
|
feat(adr-076): 戰術 B 四大 Task 全部完成 — 告警聚合+重試+自動報告
CD Pipeline / build-and-deploy (push) Successful in 17m34s
Task 2: AlertGroupingService — Redis 5分鐘滑動視窗,防告警風暴
- apps/api/src/services/alert_grouping_service.py (新增)
- webhooks.py 整合:指紋生成後/LLM前短路子告警
- Threshold=3,Graceful Degradation,16 tests
Task 3: approval_execution.py 執行失敗重試
- MAX_RETRY=2, RETRY_DELAY_SECONDS=30
- _is_transient_error() 瞬態/永久分類,永久錯誤不重試
- Timeline 記錄重試進度,成功後標注重試次數,29 tests
Task 4: report_generation_service.py 自動報告
- 日度巡檢報告:每日 08:00 台北時間,Telegram SRE 群組推送
- Postmortem:Incident resolved + duration > 10 分鐘自動觸發
- main.py lifespan 掛載 run_daily_report_loop(),30 tests
測試: 600 → 675 通過 (+75),0 failed
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 14:39:14 +08:00 |
|
OG T
|
c0ba1000f3
|
Revert "fix(auto-repair): 中低風險+無kubectl_command → TYPE-1 純資訊,不顯示審核按鈕"
This reverts commit abf1ffa91e7327a36af93be2742d53dac1933f0d.
|
2026-04-14 13:33:24 +08:00 |
|
OG T
|
2df4945880
|
fix(auto-repair): 中低風險+無kubectl_command → TYPE-1 純資訊,不顯示審核按鈕
問題: HostHighCpuLoad 等主機層告警 affected_services=[] → OpenClaw 生成
kubectl unknown → safety guard 攔截 → 退回 READY + TYPE-3 帶按鈕卡片
用戶一直看到帶按鈕的中/低風險告警,按鈕無法修復任何東西
修復三處:
1. openclaw.py: _call_openclaw_analyze() 回傳 suggested_action 欄位
+ target_resource 預設改為 "" (避免 "unknown" 進入 safety guard)
2. decision_manager.py: classify_notification() 傳入
suggested_action / risk_level / has_kubectl_command
3. telegram_gateway.py: classify_notification() 新規則 —
無 kubectl_command + risk=low/medium + action=investigate/no_action
→ TYPE-1 (純資訊,無按鈕)
搭配 clawbot-v5 f4b84d7 (OpenClaw prompt CRITICAL RULES) 一起生效
2026-04-14 Claude Sonnet 4.6 Asia/Taipei
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-14 13:33:24 +08:00 |
|
OG T
|
38ff2bb7a5
|
fix(heartbeat): 改用 ADR-075 TYPE-1 格式 — 💚 INFO 樹狀結構
CD Pipeline / build-and-deploy (push) Successful in 15m4s
舊平鋪文字 → ├─/└─ 樹狀結構對齊 ACTION REQUIRED 卡片風格
- 標題: 💚/⚠️ INFO | AWOOOI 系統報告
- 加 ────── 分隔線
- AI/MCP/飛輪/基礎設施各節統一 ├─/└─ 格式
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 22:52:05 +08:00 |
|
OG T
|
f1face4e34
|
fix(alert-rules): HostHighCpuLoad 獨立規則,停止 kubectl scale unknown
CD Pipeline / build-and-deploy (push) Has been cancelled
根因: HostHighCpuLoad 是 node_exporter host 告警,無 pod/deployment label
被分到 K8s high_cpu 規則 → {target}=unknown → auto-repair 安全攔截
修復: 新增 host_cpu_high 規則 (priority=45),NO_ACTION + 正確描述
high_cpu 規則移除 HostHighCpuLoad/NodeCPUUsageHigh
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 22:50:37 +08:00 |
|
OG T
|
1a4b52ed28
|
fix(alert): fingerprint 加 alertname 防跨告警指紋衝突 + 補入缺漏心跳分類
CD Pipeline / build-and-deploy (push) Has been cancelled
問題根因:
1. generate_fingerprint 用 alert_type(大量 alertname 落入 "custom")
→ 不同告警名稱同目標共用指紋 → 30 分鐘 debounce 互相擋截
2. classify_alert_early 漏掉 DeadMansSwitch / NoAlertsReceived /
PrometheusNotConnectedToAlertmanager → 落入 TYPE-3 一般告警
修復:
- alert_analyzer_service.py: 指紋改為 namespace:deployment:alertname:target_resource
alertname 取自 labels(Alertmanager),fallback 到 alert_type(其他來源)
- incident_service.py: DeadMansSwitch → backup/TYPE-1;
NoAlertsReceived + PrometheusNotConnectedToAlertmanager → alertchain_health/TYPE-8M
- 補 2 個測試,全套 627 passed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 22:50:20 +08:00 |
|
OG T
|
b17a677b97
|
fix(gitea-webhook): analysis.model_dump() 對 dict 失敗
CD Pipeline / build-and-deploy (push) Has been cancelled
_call_openclaw_push_review 回傳 dict,不是 Pydantic model
改用 hasattr 判斷是否有 model_dump()
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 22:45:09 +08:00 |
|
OG T
|
0c88f6702e
|
fix(ai-router): DIAGNOSE 強制用 deepseek-r1:14b,不用 gemma3:4b
CD Pipeline / build-and-deploy (push) Has been cancelled
gemma3:4b (summary model, complexity≤1) 不輸出結構化 JSON
→ _parse_llm_response 無法提取 confidence → confidence=0.0
deepseek-r1:14b (default model) 已驗證可輸出 confidence=0.8
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 22:43:49 +08:00 |
|
OG T
|
db4d4280f5
|
test(ai-router): 更新 DIAGNOSE routing 測試反映暫停 NEMOTRON 現況
CD Pipeline / build-and-deploy (push) Successful in 14m28s
NEMOTRON 因 confidence=0.0 問題暫停,改走複雜度路由(None)
待 _parse_confidence() 修復後恢復
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 22:22:52 +08:00 |
|
OG T
|
09134f5c47
|
fix(openclaw): 修復 incident.title + DIAGNOSE→NEMOTRON confidence=0.0
CD Pipeline / build-and-deploy (push) Failing after 2m10s
1. telegram_gateway.py:1169 — classify_notification() 仍用 incident.title
改用 alertname + signal annotations 組合 (同 decision_manager.py 修法)
2. ai_router.py — DIAGNOSE 路由暫停 NEMOTRON
NIM tool_call 返回無 confidence → openclaw_analysis_complete confidence=0.0
改為 None (複雜度路由),讓 Gemini/openclaw_nemo 處理 DIAGNOSE
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 22:12:55 +08:00 |
|
OG T
|
b3d4b9c8a9
|
test(telegram): 修正 test_telegram_message_templates 斷言
CD Pipeline / build-and-deploy (push) Successful in 14m24s
CRITICAL → 嚴重 (ADR-075 中文風險等級)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 21:20:16 +08:00 |
|
OG T
|
01e6d75ee7
|
test(telegram): 修正測試斷言符合 ADR-075 中文風險等級
CD Pipeline / build-and-deploy (push) Failing after 1m55s
HIGH→高風險, MEDIUM→中風險 (test_sentry / test_github webhook)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 21:08:48 +08:00 |
|
OG T
|
9c8dde0951
|
fix(telegram): 修復 Incident 無 title 欄位導致所有 Telegram 推送失敗
CD Pipeline / build-and-deploy (push) Failing after 2m3s
根因: _push_decision_to_telegram() 有兩處引用 incident.title,
但 Incident model 從來沒有此欄位,導致所有告警卡片推送都
拋 AttributeError,事件在 telegram_decision_push_failed 靜默失敗。
修法:
- line 188: message 改用 signal annotation summary/description/alert_name
- line 249: TYPE-1 title 改用 alertname label / signal.alert_name
影響: 自從 decision_manager 加入這兩行以來,所有 Telegram 通知都沒發出
(包含 TYPE-1 資訊通知和 TYPE-3 審批卡)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 21:02:55 +08:00 |
|
OG T
|
3d8b0e4f90
|
fix(adr075): TYPE-3 格式改用 spec 模板 — ACTION REQUIRED + AI深度診斷 + 建議修復動作
CD Pipeline / build-and-deploy (push) Failing after 2m15s
- 標頭改為 "{emoji} ACTION REQUIRED | {severity_zh}"
- 新增 "🧠 AI 深度診斷" 區塊 (分析/責任/AI來源)
- 新增 "⚡ 建議修復動作" 區塊 (<code> 格式)
- confidence=0 顯示 "📋 規則分析" 取代誤導性 "🔴 0%"
- SignOz 指標區塊補回 Trace 連結
2026-04-12 ogt: ADR-075 TYPE-3 格式標準化
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 21:00:28 +08:00 |
|
OG T
|
a7f2b9c0f5
|
fix(display): 規則匹配改顯示 ✅ 取代 🔴 0% + 修復 LLM 字串 confidence 解析
CD Pipeline / build-and-deploy (push) Has been cancelled
- telegram_gateway.py: confidence==0 (規則匹配/Expert fallback) 不再顯示
「🔴 0%」,改顯示「⚙️ 規則匹配 ✅」,兩個 card 類型都修正
- openclaw.py: NIM/Ollama 有時回傳字串 "0.85" 而非 float,導致
isinstance(str, int|float)=False → confidence 被強制設 0.0。
現在先嘗試 float() 解析,解析失敗才 fallback 0.0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 20:50:53 +08:00 |
|
OG T
|
eda0cfd034
|
fix(adr075): drift 通知改用 send_drift_card,補齊所有呼叫點
CD Pipeline / build-and-deploy (push) Successful in 14m13s
- drift.py: 移除死碼 send_text(),改由 narrate_and_notify() 統一發卡片
- drift_narrator_service: _send_telegram() 改呼 send_drift_card() 帶四顆按鈕
- webhooks.py /alerts 路徑: 補傳 alert_category 啟用動態按鈕
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 20:20:47 +08:00 |
|
OG T
|
c3fea26222
|
fix(adr075): webhooks send_approval_card 補傳 alert_category+notification_type
CD Pipeline / build-and-deploy (push) Has been cancelled
斷點真正根因:_push_to_telegram_background 呼叫 send_approval_card()
時沒有傳入 alert_category 和 notification_type,導致動態按鈕永遠
fallback 到通用 [批准][拒絕][靜默]。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 20:07:12 +08:00 |
|
OG T
|
0a4b7e9609
|
fix(classify): HostBackupFailed 精確補入 backup/TYPE-1(測試通過)
CD Pipeline / build-and-deploy (push) Has been cancelled
前次修法用 'backup' in alertname_lower 太寬,導致 BackupJobFailed warning
被分到 TYPE-1,破壞 test_backup_keyword_warning_not_type1。
改為精確白名單:
_BACKUP_TYPE1_NAMES = {HostBackupFailed, HostBackupStale, HostBackupMissing,
BackupRestoreTestFailed, BackupRestoreTestStale}
+ alertname.startswith('HostBackup') 兜底
結果:664 passed, 0 failed
2026-04-12 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 20:03:46 +08:00 |
|
OG T
|
f25d82a88a
|
fix(adr075): 修補斷點E — _push_to_telegram_background 補 TYPE-8M routing
CD Pipeline / build-and-deploy (push) Has been cancelled
斷點E:alertmanager webhook 走 _push_to_telegram_background,
未含 TYPE-8M branch,導致 meta alert 從未送出。
- webhooks.py: 新增 alert_category 參數 + TYPE-8M branch
- incident_service.py: 還原 rule 5 僅攔 watchdog/heartbeat,
移除誤加的 backup startswith 規則(VeleroBackup 由 K8s rule 接管)
Tests: 52/52 passed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 20:01:51 +08:00 |
|
OG T
|
1f7975170a
|
fix(classify): HostBackupFailed 補入 backup/TYPE-1 規則
CD Pipeline / build-and-deploy (push) Failing after 1m51s
classify_alert_early() 的 backup 規則只攔 watchdog/Heartbeat,
HostBackupFailed 先被 Host prefix 規則攔走 → host_resource/TYPE-3 → 跑 LLM → 審批卡。
修法:在 Host prefix 前新增 backup 關鍵字/前綴攔截:
- HostBackup* / Backup* / VeleroBackup* / BackupRestore*
- alertname 含 "backup"(大小寫不敏感)
影響:所有備份相關告警直接走 TYPE-1 info 通知,不進 LLM。
HostHighCpu / HostDown 等非備份的 Host 告警不受影響。
2026-04-12 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 19:52:05 +08:00 |
|
OG T
|
a5f17cea79
|
fix(notification): TYPE-1 backup/info 告警不再發審批卡
CD Pipeline / build-and-deploy (push) Has been cancelled
classify_notification() 不知道 alert_category,對 backup 告警
(confidence=0, auto_executed=False)返回 TYPE-3,覆蓋掉
classify_alert_early() 已設好的 notification_type=TYPE-1。
修法:在路由分支前,讓 incident.notification_type 明確值
(TYPE-1 / TYPE-4D / TYPE-8M)覆蓋 classify_notification()。
影響:backup/info/watchdog 告警只發 send_info_notification(),
不再噴帶按鈕的審批卡到 Telegram。
2026-04-12 ogt (ADR-075 bugfix)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 19:49:31 +08:00 |
|
OG T
|
7f3e585d6d
|
fix(webhooks): alertmanager handler — alert_type 超範圍改為 custom
CD Pipeline / build-and-deploy (push) Has been cancelled
AlertPayload.alert_type 只接受 8 個 Literal 值
ALERTNAME_TO_TYPE 映射回傳 host_cpu/backup_failure 等不在白名單 → ValidationError
修法:凡不在 Literal 白名單的 alert_type 一律 fallback 為 "custom"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 19:22:35 +08:00 |
|