Commit Graph

889 Commits

Author SHA1 Message Date
OG T
0ab92c20d6 fix(telegram): root_cause 截斷上限 300→500 — 修復「質疑:無(通」幽靈重現
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 10m31s
根因:debate_summary 結構為「診斷(≤220字);方案;安全審查;質疑」
      診斷假設長時總長超過 300 chars → root_cause 截斷在「通」字
修復:300 → 500(Telegram 單卡 4096 限制,安全)

2026-04-17 ogt + Claude Sonnet 4.6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 13:28:16 +08:00
OG T
58d9c0637a fix(drift): drift_narrator 改用 OpenClaw AI Router — 修復「研判原因」空白
根因:drift_narrator_service.py 的 _generate_narrative() 直接呼叫
      Ollama httpx (192.168.0.111:11434),繞過 AI Router,無 fallback。
      192.168.0.111 為死亡 IP → httpx 連線失敗 → 降級 fallback_narrative()
      → fallback 中 interpretation.explanation 存在但顯示層截斷 → 空白

修復:改用 get_openclaw().call(prompt),統一走 AI Router
      同 drift_interpreter.py 的修法(d952435)
      移除 unused httpx import

2026-04-17 ogt + Claude Sonnet 4.6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 13:28:16 +08:00
OG T
e0bfcc7bd6 fix(phase5): 修復 Solver action 格式 — 強制輸出 kubectl 命令
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 9m33s
根因:_build_prompt() 的 action 範例為 "restart_service:awoooi-api"(自訂格式),
LLM 模仿此格式輸出自然語言描述而非 kubectl 命令。

影響鏈:
  Solver action = 自然語言描述
  → auto_approve Condition 1c 拒絕(無 kubectl 關鍵字)
  → _auto_execute() 永不被調用
  → blast_radius_calculator 永不被調用
  → blast_radius_score fill rate = 0/14 = 0%(Phase 5 驗收指標未達)

修復:
  1. blast_radius 參考從抽象描述改為實際 kubectl 命令示例
  2. 明確要求 action 欄位必須是真實 kubectl 命令(不可用自然語言)
  3. 正確範例:kubectl rollout restart deployment/awoooi-api -n awoooi-prod

預期效果:LLM 輸出 kubectl 命令 → auto_approve 通過(低 blast_radius 情境)
          → blast_radius_calculator 被調用 → fill rate 趨向 100%

2026-04-17 ogt + Claude Sonnet 4.6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 12:44:36 +08:00
OG T
b7c2b691bb fix(p2-backlog): 修復 suggested_action「待分析」— action 空時 fallback 到 description
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m26s
根因:_push_decision_to_telegram() 的 suggested_action 只有兩條路:
- action 有值 → 顯示 action[:120]
- action 空   → 顯示「待分析」

但 _package_to_proposal_data() 已從 hypothesis 組出 description
(含「根因:...(信心 X%);方案:...」),此時 action="" 卻還是顯示「待分析」
導致 SRE 在 Telegram 卡片看不到 AI 的診斷結論。

修復:action 空時,優先用 description[:120] 作為 suggested_action
(description 已包含根因摘要,比「待分析」有意義)
fallback chain: action → description → "待分析"

2026-04-17 ogt + Claude Sonnet 4.6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 11:48:49 +08:00
OG T
0388e50d0e fix(p1-backlog): 修復「待分析」死結與 Telegram 訊息截斷
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 30m25s
問題 1:REQUEST_REVISION → 待分析
  根因:safe_candidates=[] → selected=None → recommended_action=None
        → decision_manager action="" → TG 卡顯示「待分析」(資訊流斷裂)
  修復 coordinator_agent.py:
    無安全候選時回退至 Solver 原始最優方案
    標記「[Reviewer 未核准,僅供參考] {action}」
    SRE 永遠能看到 AI 建議,資訊流絕不中斷

問題 2:debate_summary 在 (blast_radius... 中間截斷顯示 (bl
  根因:root_cause=reasoning[:150] — 150 字元對中文 debate_summary 過短
  修復 decision_manager.py:
    root_cause 截斷 150 → 300
    suggested_action 截斷 80 → 120

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 11:12:02 +08:00
OG T
d952435b60 fix(drift): 改用 OpenClaw AI Router 取代 Ollama httpx 直連
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 32m34s
根因:_call_nemotron() 直接呼叫 Ollama httpx(settings.OLLAMA_URL)
      繞過 AI Router,無 fallback → "All connection attempts failed"
      → Telegram 卡顯示「意圖分析失敗:All connection attempts failed」

修復:改走 get_openclaw().call(prompt)
      自動享有 Provider 降級與 fallback 機制(與其他 Agent 一致)

廢棄:BUG-001 httpx 直連繞過法(nvidia_provider 介面已穩定)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 10:27:39 +08:00
OG T
0c15fa5988 refactor(decision): 狀態機重構 — YAML NO_ACTION 閘門上移至決策路由中樞
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
架構師指令(2026-04-17):通知層禁止查詢業務邏輯。
撤銷 c05bcdb 的 inline YAML 查詢(義大利麵補丁),
將 NO_ACTION / INVALID_TARGET 判斷移至正確位置。

重構方向:
① 移除 _push_decision_to_telegram() 的 inline YAML 查詢
  → 通知層只做 blocked_reason → NotificationType 轉譯(Single Responsibility)

② 新增 decide() 第 4c 步:YAML NO_ACTION 路由閘門
  位置:_dual_engine_analyze() 返回後、auto_approve.evaluate() 之前
  邏輯:
    - NO_ACTION → blocked_reason="YAML: NO_ACTION" + is_informational_only=True
      → 短路跳過 auto_approve + Blast Radius → TYPE-1(或 critical → TYPE-4)
    - INVALID_TARGET → blocked_reason="INVALID_TARGET-..." → 短路 → TYPE-4
    - 閘門查詢失敗 → 靜默降級,繼續正常流程

Checkpoint 覆蓋:
  CP1 上移 YAML 評估層 
  CP2 短路跳過 auto_approve 
  CP3 通知層純粹轉譯 

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 10:20:01 +08:00
OG T
c05bcdbbd4 fix(decision): inline YAML NO_ACTION 補查 — 修復 Phase 2 路徑盲點
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 4m0s
根因:Phase 2 (agent debate → auto_approve 拒絕 → 直接推 TG) 不經過
      auto_execute() 的 YAML check,Coordinator 不設 blocked_reason。
      PostgreSQL disk / host resource 等 NO_ACTION 規則告警在 Phase 2
      路徑仍顯示「ACTION REQUIRED」卡片(TYPE-3),而非 TYPE-1 資訊卡。

修復:_push_decision_to_telegram() 在 blocked_reason 為空時,補做一次
      alertname inline YAML 查詢,任何路徑(Phase 2 / Expert / Webhook)
      都能正確偵測 NO_ACTION → TYPE-1 / critical NO_ACTION → TYPE-4。

生產驗證觸發:INC-20260416-C365D0 PostgreSQL disk alert 顯示 ACTION REQUIRED
             而非 TYPE-1,確認全景 Code Review 遺漏此執行路徑。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 10:15:28 +08:00
OG T
149065e3de perf(e2e): CI smoke test 改 retain-on-failure 降低錄影 overhead
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 51m8s
E2E Health Check / e2e-health (push) Successful in 3m18s
video/screenshot 從 'on' 改為 retain-on-failure/only-on-failure
CI 遠端 smoke test 預計從 13min+ 降至 ~1min

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 23:44:20 +08:00
OG T
83ab5e32d7 fix(happy-path): Happy Path 全境加固 — INVALID_TARGET + critical NO_ACTION + 空指令攔截
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
問題 1 (P0) — deployment/unknown 無效重啟:
- alert_rule_engine: 追蹤 _invalid_target flag,回傳 blocked_reason="INVALID_TARGET-..."
- decision_manager: auto_execute 路徑偵測 INVALID_TARGET → 提早返回 + TYPE-4 人工確認
- auto_approve: 新增條件 1c — action 為空字串直接拒絕,防止誤報「即將執行」

問題 2 (P1) — critical+NO_ACTION 靜默:
- decision_manager: blocked_reason 感知層重構
  ① INVALID_TARGET → TYPE-4
  ② NO_ACTION + critical → TYPE-4(升級,SRE 不可錯過)
  ③ NO_ACTION + 非 critical → TYPE-1(維持純資訊卡)

問題 3 (P1) — 規則匹配信心黑洞:
- auto_approve 條件 1c 確保空 action 不通過 auto-approve
  即便 is_rule_based=True 也無法在無指令時自動執行

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 22:57:50 +08:00
OG T
0077ff9758 fix(solver): 傳遞 hypothesis 作為 alert_context 給 OPENCLAW_NEMO
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根因:solver 呼叫 openclaw.call(prompt) 不傳 context
→ nemo fallback 把 prompt[:500](系統說明「軍師 Agent」)
   當 signal description → LLM 回傳垃圾方案描述

修復:把 top.description 放進 alert_context.signals
      讓 nemo 看到真實根因假設(與 diagnostician 同模式 7eb8375)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 22:51:30 +08:00
OG T
92b39ab840 fix(no-action-notify): YAML NO_ACTION 告警改為 TYPE-1 資訊通知(移除無意義審核按鈕)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根因:
- host_resource/postgresql_disk_monitoring YAML 規則設 NO_ACTION
- 但 classify_notification() 不知道 NO_ACTION
- confidence=0.2(感應器無資料)→ 判為 TYPE-4(信心不足需人工審核)
- SRE 看到「審核批准/拒絕」按鈕,卻沒有任何自動修復動作可執行 → 毫無意義

修復:
- _push_decision_to_telegram 偵測 blocked_reason 含 "NO_ACTION"
- 強制 _notif_type = TYPE-1(純資訊通知,無審核按鈕)
- SRE 看到資訊卡「主機 CPU/負載/磁碟告警 (觀察即可)」而非假的審核請求

2026-04-16 ogt + Claude Sonnet 4.6 (台北時區)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 22:37:15 +08:00
OG T
7eb837567d fix(diagnostician): 修復 'AI 深度診斷' 垃圾根因顯示
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根因三層鏈:
1. openclaw.call(prompt) 不傳 context
2. OPENCLAW_NEMO fallback 把 prompt[:500](系統說明文字)當 signal description
3. Nemo LLM 回傳 action_title="調查 AWOOOI SRE 系統的偵探 Agent"(任務描述)
4. _extract_hypotheses() 用 action_title 作為根因假設描述 → Telegram 顯示垃圾

修復:
- openclaw.call() 新增 alert_context 可選參數,透傳給 _call_with_fallback
- diagnostician._analyze() 建立 alert_context(incident_id + evidence_summary as signal)
  → nemo 使用結構化 API 收到真實感應器資料而非系統說明文字
- _extract_hypotheses() nemo 格式轉換:優先用 reasoning(為什麼)作為假設描述
  而非 action_title(做什麼)— reasoning 更接近根因分析

2026-04-16 ogt + Claude Sonnet 4.6 (台北時區)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 22:34:48 +08:00
OG T
54d6818b8d fix(sensors+rules+dedup): 全景三根因修復 — asyncssh缺失/YAML規則空洞/重複卡片
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Fix 1: asyncssh 未安裝 → sensors_succeeded 永遠=0
  - apps/api/pyproject.toml 加入 asyncssh>=2.14.0
  - 根因:ssh_provider.py 的 import asyncssh 在 try/except 外,ImportError 直接噴出
  - 效果:15 個 SSH tool 全部恢復可用

Fix 2: YAML 規則空洞 → HostHighLoadAverage/PostgreSQLDiskGrowthRate 落 generic_fallback → restart
  - 合併 host_cpu_high 為 host_resource_alert,覆蓋 25+ 個主機層 alertname
  - 新增 postgresql_disk_monitoring 規則,覆蓋磁碟增長/exporter/vacuum 類告警
  - 所有主機層/磁碟監控告警 → NO_ACTION,禁止 kubectl restart

Fix 3: 同一 incident 被多 pod 同時處理 → 送出 3 張重複 Telegram 卡
  - decision_manager.get_or_create_decision(): ANALYZING 狀態加入早返回
  - 根因:ANALYZING 不在 (READY/EXECUTING/COMPLETED) 條件 → pod-B/C 各自建新 token

2026-04-16 ogt + Claude Sonnet 4.6 (台北時區)
2026-04-16 22:23:49 +08:00
OG T
02a276127e fix(sensors+drift+repair-card): 全景修復三個節點問題
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 1h1m39s
Fix 1: sensors 7/8 失敗 — SSH host 短名展開 (pre_decision_investigator.py)
  根因: Prometheus instance label 為 "110:9100",split(":")[0]="110"
        SSH_MCP_ALLOWED_HOSTS 存完整 IP "192.168.0.110" → 7 個 SSH 工具全部失敗
  修復: 加入 _SHORT_HOST_MAP,"110"→"192.168.0.110",四台主機全覆蓋

Fix 2: Config Drift 誤報 — K8s 預設欄位加入白名單 (drift_detector.py)
  根因: kubectl rollout restart 後 restartedAt annotation 被偵測為 "medium" drift
        restartPolicy/dnsPolicy/terminationGracePeriodSeconds 等 K8s 自動填入欄位未白名單
  修復: _DEFAULT_ALLOWLIST_FIELDS 加入 13 個 K8s 執行時自動填入欄位

Fix 3: 修復請求卡內容垃圾 — fallback 帶入真實 error context (failure_watcher.py)
  根因: LLM 分析失敗時 root_cause = "規則引擎分類: K8S_ERROR"(無任何有用資訊)
  修復: fallback 改為 "[K8S_ERROR] {operation_type} 在 {target_resource} 失敗\n錯誤:{error_message[:200]}"

2026-04-16 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 20:50:06 +08:00
OG T
513232e90b fix(decision_manager): Agent 分析結果覆寫 Webhook 垃圾 action
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 28m30s
根因 (INC-20260416-C365D0 事故完整根因分析):
- Webhook inline LLM 建立 ApprovalRecord.action = "kubectl rollout restart awoooi-prod"
- Agent 分析正確(postgres disk → NO_ACTION)但只發新 Telegram 卡,未覆寫 DB
- 用戶批准 Agent 卡 → 系統查 incident_id → 找到 Webhook 舊 ApprovalRecord
  → 執行垃圾 action(rollout restart 一個磁碟告警!)

修復:
- approval_db.py: 新增 update_action_by_incident_id()(按 incident_id 更新 PENDING 記錄)
- decision_manager.py: Agent 確認 action 後立即覆寫 ApprovalRecord
  若 action="" (NO_ACTION): 存 "NO_ACTION - {description}" 讓用戶知道 Agent 建議觀察
  用戶批准時執行的是 Agent 的正確建議,而非 Webhook 的通用 action

2026-04-16 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 20:07:15 +08:00
OG T
a258d87767 fix(webhooks+prompts): 修復 LLM 對所有告警一律輸出「重啟 AWOOOI 服務」的根本問題
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根因 (INC-20260416-C365D0 postgres 磁碟告警事故):
1. alert_context 中 alertname 埋在 labels 深處,LLM 看到 alert_type="custom" → 不知道是什麼告警
2. 快取鍵用 alert_type:target_resource → 不同 alertname 共用同一快取 → 全部回傳第一個 LLM 結果
3. 系統 Prompt 無 alert-category 指導 → LLM 永遠輸出 kubectl rollout restart

修復:
- webhooks.py: alert_context 置頂加入 alertname + alert_category + annotations
- openclaw.py: 快取鍵改用 alertname:target_resource(告警名稱才是主要識別符)
- prompts.py: OPENCLAW_SYSTEM_PROMPT + NEMOTRON_SYSTEM_PROMPT 加入 Alert-Specific Analysis Rules
  database/storage 告警 → NO_ACTION + 調查指令;K8s 告警 → 對應重啟指令
  禁止對非 K8s 告警輸出 kubectl rollout restart deployment/awoooi-prod

2026-04-16 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 19:56:13 +08:00
OG T
8b2a3df64b fix(telegram): 修復 Telegram 卡片 description 顯示 debug garbage
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 3m12s
問題:description = debate_summary[:500],用戶看到的是內部審計文字
修復:從 diagnosis.top_hypothesis + action_plan.top_candidate 組出人類可讀摘要
格式:「根因:[描述](信心 X%);方案:[動作]」

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 18:42:13 +08:00
OG T
ded93cbba3 fix(aiops): 修復 evidence 空白 → AI ABSTAIN 問題
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 32m33s
問題:
- signal.alert_name 在頂層,但 _get_alertname() 從 labels["alertname"] 讀 → 空字串
- 所有 sensor 失敗時 evidence_summary 只有 120 字元,AI 無法分析 → ABSTAIN
- labels 為空時 AI 根本不知道是什麼告警

修復:
1. _get_alertname(): 優先讀 signal.alert_name,fallback labels["alertname"]
2. _get_labels(): 自動補 alertname 到 labels dict
3. EvidenceSnapshot.alert_info: 新增告警基礎欄位(sensors=0 時的最小情報)
4. build_summary(): alert_info 永遠放在最前,讓 AI 至少知道告警類型+嚴重度

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 16:26:07 +08:00
OG T
588b0d745b fix(aiops): 修復 sensors=0/0 根因 — MCPToolRegistry 從未在 startup 初始化
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m44s
三個問題同時修復:

1. main.py: 補上 init_mcp_tool_registry() 呼叫
   - ADR-081 Phase 1 建立了 MCPToolRegistry 但從未在 lifespan startup 被呼叫
   - 導致 PreDecisionInvestigator sensors=0/0,evidence_summary 永遠空白
   - 空白 evidence → Diagnostician 永遠 ABSTAIN

2. signal_producer.py: str(dict) → json.dumps()
   - labels/annotations 用 Python str() 序列化,寫入 Redis 後無法反序列化

3. brain/incident_engine.py: 新增 _parse_dict_field() helper
   - 從 Redis 讀回的 labels/annotations 可能是 JSON 字串
   - isinstance(..., dict) 防禦不足,需先 json.loads()

2026-04-16 ogt + Claude Sonnet 4.6(亞太): 飛輪感官修復

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 15:35:19 +08:00
OG T
d294caf830 fix(solver): 相容 openclaw_nemo 回傳格式 → candidates 格式轉換
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 3m51s
與 diagnostician 同步:openclaw_nemo 回傳 action_title/risk_level/confidence,
solver 的 _extract_candidates() 找不到 candidates key → 空方案 → no_candidates

修復: 檢測 action_title 存在時轉換為 candidates 格式
risk_level → blast_radius 映射: critical=60, high=40, medium=25, low=10

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 14:34:50 +08:00
OG T
c27709d11b fix(diagnostician): 相容 openclaw_nemo 回傳格式 → 解除全面 ABSTAIN
根因: AI Router DIAGNOSE→openclaw_nemo 回傳 ClawBot 格式:
     {"action_title":"...","risk_level":"...","confidence":0.85}
     Diagnostician 只解析 {"hypotheses":[...]} → 永遠 0 hypotheses → ABSTAIN

修復: _extract_hypotheses() 新增 openclaw_nemo 格式檢測與轉換
     action_title→description, confidence→confidence, risk_level→category

影響: 所有 critical alert 自 2026-04-15 收到後一律 ABSTAIN,無任何修復動作

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 14:32:32 +08:00
OG T
8582439d2d fix(kb): Signal 無 description 欄位,改用 alert_name + annotations
knowledge_extractor_service 兩處直接訪問 s.description:
- L87 signals_text 組裝:改用 alert_name + annotations.summary/description
- L198 Fallback 標題:改用 alert_name[:60]

Signal model 只有 alert_name, annotations(dict),無 description 屬性。
此修復防止 KB 萃取時 AttributeError 導致草稿無法建立。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 08:54:11 +08:00
OG T
9ea1f77e41 fix(telegram): 移除 7 個 ghost button (3-part/無handler)
違規 buttons 一覽:
- flywheel_diag / flywheel_dashboard (META告警卡)
- pause_1h / ignore (業務告警卡)
- postmortem / escalation_ack / dr_manual (升級通知卡)
- secops_block_ip / secops_evict (SecOps 卡,spec=nonce 但用 2-part)

所有 buttons 均無 callback handler,點擊無回應 = 鬼魂按鈕
鐵律: 寧可沒按鈕,不可有死按鈕 (feedback_no_ghost_buttons.md)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 03:29:41 +08:00
OG T
cd1c0ffdb8 fix(openclaw): call() 5值→3值 — 修復全部 AI 分析降級根因
根因: _call_with_fallback() 回傳 5-tuple,但 call() 直接 return
導致呼叫端 (diagnostician/solver/critic agents) 3-var unpack
→ ValueError: too many values to unpack → 全部 降級 20%

修復: call() 明確解包再回傳 (response, provider, success)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 03:27:48 +08:00
OG T
62e2efda85 fix(heartbeat): 恢復 30 分鐘心跳報告到 SRE 戰情室
2026-04-15 停用理由(forwarded_to_separate_group)有誤,
SRE 戰情室就是 SRE_GROUP_CHAT_ID,不應停用。
恢復 start_heartbeat_monitor(interval=30min)。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 03:05:01 +08:00
OG T
27ba97e586 fix(ollama): 清除所有硬寫 188:11434 fallback — 全部改指向 111 GPU
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 15m59s
- decision_manager.py: 兩處 getattr fallback 188 → 111
- routes/agent.py: OLLAMA_BASE_URL 188 → 111
- knowledge_extractor_service.py: _OLLAMA_BASE 188 → 111

config.py 預設早已是 111,此次清掉 code 層殘留的 188 硬寫值。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 03:01:31 +08:00
OG T
7e3cc8b3b0 fix(agents): 移除人工 per-agent timeout,LLM 必須等完整回應
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
原設計 asyncio.wait_for(timeout_sec=25s) 是任意截斷,
只要 LLM 超過時限就降級為 confidence=20%,根本沒有分析。

正確做法:
- 移除所有 4 個 agent 的 asyncio.wait_for() 包裝
- 只留 except Exception 捕真實異常(連線失敗、模型崩潰)
- 全流程由 Orchestrator GLOBAL_TIMEOUT_SEC=90s 防掛死
- _PER_AGENT_TIMEOUT_SEC 常數廢棄移除

影響:LLM 推理多久就等多久,不再人工截斷,
      deepseek-r1:14b 等模型得以完整輸出分析結果。

2026-04-16 ogt + Claude Sonnet 4.6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 02:54:34 +08:00
OG T
5a3a649f8a fix(decision): TYPE-1 告警重複洗版兩個根因修復
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根因 1: TYPE-1 bypass 在 existing_token 檢查之前執行
→ 每次 get_or_create_decision() 不管 token 是否存在,都直接推 TG
→ 修復: existing_token 檢查提前到 TYPE-1 bypass 之前(統一入口)

根因 2: TYPE-1 token TTL 僅 3600s
→ 1h 後 token 過期,下次掃描重新建立並再推 TG
→ 修復: TYPE-1 token TTL 提升至 86400s (24h)

影響: HostBackupFailed 等 TYPE-1 告警每個 incident 只推 1 次(24h 內)

2026-04-16 ogt + Claude Sonnet 4.6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 02:49:31 +08:00
OG T
62bcc50770 fix(tg+km): 補齊 Telegram 操作紀錄揭露與 KM 分類修復 (ADR-076)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Telegram 訊息新增欄位:
- alert_category 分類標籤(🏷️ 主機/K8s/資料庫/服務等)
- playbook_name 顯示匹配到的 Playbook 名稱
- 頻率統計從 count_24h>1 降至 >=1(初次告警也顯示)
- TelegramMessage 新增 alert_category/playbook_name 欄位
- decision_manager → send_approval_card 穿透 playbook_name

KM 修復:
- EntryType.PLAYBOOK → EntryType.AUTO_RUNBOOK(前者不存在,會 AttributeError)
- category "auto_generated" → "ai_system"(前端 i18n 有對應翻譯)
- runbook_generator 同步修正 category
- KM 建立後推 Telegram 通知(best-effort)

DB decision_chain 補欄位:
- 新增 playbook_id / playbook_name / alert_category

2026-04-16 ogt + Claude Sonnet 4.6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 02:46:17 +08:00
OG T
9538f6cca4 fix(agents): 修正 Agent 5s timeout 導致 LLM 推理全部失敗
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 16m14s
根本原因: deepseek-r1:14b 實測推理 2.2-27.3s avg 10.6s
但 Diagnostician/Critic/Solver/Reviewer 全部使用 timeout_sec=5.0 (開發機測試值)
→ 67% 的 Agent 推理 timeout → 降級 confidence=20% → 自動修復從不觸發

修復:
- _PER_AGENT_TIMEOUT_SEC: 5s → 25s (覆蓋 avg 的 2.3x buffer)
- GLOBAL_TIMEOUT_SEC: 30s → 90s (3個序列Agent × 25s + buffer)
- 明確傳遞 timeout_sec 給所有 4 個 Agent 呼叫

預期效果: 正常告警 AI 分析 confidence ≥ 0.5 → 觸發自動修復

2026-04-16 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 02:28:05 +08:00
OG T
a07daf7e3f fix(incidents): GET /incidents 加 48h age filter,阻止舊 incident 反覆觸發 AI 分析
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根本原因: DECISION_TOKEN_TTL=3600s → 舊 incident token 每小時過期
→ GET /api/v1/incidents 重複觸發 get_or_create_decision → OPENCLAW_NEMO timeout
→ Expert System fallback (confidence=20%) → Telegram 洪水

修復: 只對 created_at 在 48h 內的 incident 觸發背景 AI 分析
48h+ 的舊 incident 不再觸發(仍顯示在列表,只是不重新分析)

2026-04-16 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 02:21:53 +08:00
OG T
f5e33da2fc fix(telegram): 修正 _make_request → _send_request 方法名稱不一致
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 15m48s
7 處呼叫 _make_request 但方法實際名稱為 _send_request,
導致 sweeper 分析完後 telegram_decision_push_failed 錯誤。

影響方法:send_push_notification, send_drift_card 等 ADR-071 系列。
_send_request 定義於 line 1272,OTEL 追蹤已含括。

2026-04-16 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 01:44:29 +08:00
OG T
1e86cc2896 fix(flywheel): 修正 incidents.outcomes → outcome 欄位錯誤
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
GET /api/v1/stats/summary 觸發 UndefinedColumnError: column "outcomes" does not exist
實際欄位為 incidents.outcome (json 型別)

2026-04-16 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 01:42:14 +08:00
OG T
9bfa6fc045 fix(sweeper): 限制只掃 48h 內 incident,防止歷史舊案洗版 Telegram
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
問題:
  首次部署 sweeper 時,找到 117 個無 sweeper_done: 標記的舊 incident
  (最舊 2026-04-09,7 天前) → 觸發全部 LLM 分析
  舊 incident 資料格式 → OPENCLAW_NEMO timeout → Expert System 降級
  confidence=0.2 "降級" → Telegram 連發相同格式告警洗版

修正:
  加入 _MAX_INCIDENT_AGE_HOURS=48 過濾
  只處理 48h 內的 INVESTIGATING incident
  確保 created_at 時區安全(naive → UTC)

2026-04-16 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 01:27:02 +08:00
OG T
0760315059 fix(decision-manager): 修正 CAST 語法 + 關閉 shadow_mode
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
decision_chain_persist_failed 根因:
  asyncpg 不支援 :dc::json 語法 (:: 與具名參數 : 衝突)
  改為 CAST(:dc AS jsonb) — asyncpg 標準寫法

configmap:
  AIOPS_P4_SHADOW_MODE: true → false
  真實主動監控啟用 (proactive_inspector 輸出供 PreDecisionInvestigator 讀取)

2026-04-16 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 01:24:48 +08:00
OG T
20b3fefca7 fix(sweeper): 修正 decision key 格式 BUG (decision:INC-* → sweeper_done:INC-*)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根本原因:
  decision token 實際 key 格式為 decision:DEC-{HEX12}
  sweeper 錯誤地查詢 decision:{incident_id} (永遠 = 0)
  → 每 90s 將 186 個 incident 全部列為「未分析」
  → 觸發大量重複 AI 分析請求 (雖 get_or_create_decision 有去重保護)

修正方式:
  改用 sweeper_done:{incident_id} 輕量標記 (TTL 1h)
  分析完成後才設標記,確保失敗的 incident 下輪仍會重試
  get_or_create_decision 內部已有 COMPLETED/READY 去重,雙重保護

2026-04-16 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 01:20:16 +08:00
OG T
457018c0f9 fix(decision-manager): AI 分析結果寫入 incidents.decision_chain (DB 長期保存)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
修復 Gap: decision token 只有 Redis TTL 12min,AI 診斷歷史永久丟失
  - 新增 _persist_decision_to_db() method
  - get_or_create_decision() 完成後 fire-and-forget 寫入 PG
  - 寫入: ts / confidence / risk_level / provider / source / diagnosis[:200]
  - try/except 吞錯不影響主流程,warning log 追蹤

DB/Cache 分層:
  PG (長期): incidents.decision_chain (歷史) + outcomes + KM entries
  Redis (短期): decision token dedup + working memory + playbook cache

2026-04-16 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 01:08:30 +08:00
OG T
ce1a4d286e feat(sweeper): 新增 Incident Analysis Sweeper — 自動觸發未分析 Incident AI 決策
Gap修復:
  Signal Worker 創建 Incident 後,AI 分析只在 GET /api/v1/incidents 被呼叫時觸發
  若前端無人訪問,新 Incident 永遠沒有 AI 分析與 Telegram 通知

解法:
  新增 src/jobs/incident_analysis_sweeper.py
  每 90 秒掃描無 decision token 的 INVESTIGATING incidents
  自動背景觸發 get_or_create_decision() — Semaphore(3) 限流,每批最多 5 筆
  main.py lifespan 啟動時 asyncio.create_task() 掛載

2026-04-16 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 01:08:30 +08:00
OG T
d258a1fb87 test(ai-router): 更新 DIAGNOSE routing 測試 — None → OPENCLAW_NEMO
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 14m52s
test_diagnose_override_is_none → test_diagnose_override_is_openclaw_nemo
配合 ai_router.py DIAGNOSE 路由修復(Ollama 238s timeout 根因修復)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 00:13:00 +08:00
OG T
d4fed639f6 fix(ai-router): DIAGNOSE 恢復 OPENCLAW_NEMO 路由,修復全部 timeout 降級問題
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 2m16s
根因: 2026-04-12 patch 把 DIAGNOSE 改為 None(複雜度路由)
→ 落入 Rule 6 → Ollama deepseek-r1:14b (CPU 238s) → timeout
→ 降級 20% 信心 + 「待分析」→ 全部 unknown

修復:
1. ai_router.py: DIAGNOSE → OPENCLAW_NEMO(via 188:8088 NVIDIA NIM, 2-27s)
2. ai_router.py: 移除 Rule 6 的 DIAGNOSE deepseek 特殊case(已無用)
3. 04-configmap.yaml: AI_FALLBACK_ORDER 改為 nvidia 優先
   gemini→ollama→nvidia(舊)→ nvidia→gemini→ollama(新)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 00:03:04 +08:00
OG T
c9efaa3740 fix(playbook-seed): 修正 get_db_context import 路徑(db.session → db.base)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
seed 啟動時靜默失敗的根因:
  from src.db.session import get_db_context  ← 模組不存在
  from src.db.base import get_db_context     ← 正確路徑

此 bug 導致 yaml_rule playbooks 完全無法建立。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 23:49:56 +08:00
OG T
800ab1685f fix(playbook+flywheel): 修復 PlaybookSource enum + repair_steps 相容 + KM stats raw SQL
Some checks failed
CD Pipeline / build-and-deploy (push) Successful in 14m58s
Type Sync Check / check-type-sync (push) Failing after 1m17s
修復三個串聯 bug,讓 Playbook seed 能正常執行:

1. PlaybookSource 新增 YAML_RULE enum(alert_rules.yaml 匯入專用)
2. playbook_seed_service: source=YAML_RULE,dedup 改用 raw SQL by name,
   不再呼叫 list_playbooks(舊格式 repair_steps 會 validation error)
3. playbook_repository._orm_to_pydantic: 舊格式 repair_steps 補齊
   step_number/action_type 必填欄位(向下相容)
4. flywheel_stats_service: embedding IS NULL 改用 raw SQL,
   修復 KnowledgeEntryRecord ORM 無 embedding 屬性的 AttributeError

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 23:32:04 +08:00
OG T
77a92eb469 feat(P6): 提交 offline_replay_service + model_rollback_service (漏提)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 14m59s
Phase 6 ADR-087 治理閉環兩個核心服務,
之前建立後沒有 git add,一直是 untracked 狀態。

2026-04-15 Claude Sonnet 4.6 Asia/Taipei
2026-04-15 22:29:09 +08:00
OG T
85c4e3b434 fix(km): 修復 KM 寫入全為 unknown 的根因 (三個節點)
Some checks are pending
CD Pipeline / build-and-deploy (push) Has started running
Bug-A: approval_execution.py 呼叫不存在的 get_incident() → AttributeError
被 except 吞掉 → alertname/alert_category/affected_services 全用預設值
修復: 改用 get_from_working_memory() + get_from_episodic_memory() 雙路徑

Bug-B: _record_to_incident() 從 PG 還原 Incident 時漏掉
notification_type + alert_category 欄位 → km_conversion 讀到 None
修復: 加入這兩個欄位的還原

Bug-C: main.py working_memory_warmup 重建 Incident 時同樣遺漏
notification_type + alert_category
修復: 同步補上

2026-04-15 Claude Sonnet 4.6 Asia/Taipei
2026-04-15 22:28:48 +08:00
OG T
256a24e843 fix(deps+startup): drain3/statsmodels 補入 pyproject + warmup skip 舊資料
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 17m23s
- pyproject.toml: 補 drain3>=0.9.11, statsmodels>=0.14.0, sse-starlette
  → Docker build 從 pyproject 裝,requirements.txt 的套件之前沒裝進 image
  → P4 LogAnomalyDetector 400次 drain3_not_available 告警排除
- main.py: working_memory warmup per-record try/except
  → 舊 incident 含非法 source (node-exporter) → 跳過,不 crash 整個 warmup

2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 22:08:13 +08:00
OG T
c05bac6112 fix(playbook): seed tuple unpack + text[] → jsonb migration
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- playbook_seed_service.py: list_playbooks 回傳 tuple[list, int],
  缺少解包導致 'list' has no attribute 'source'
- fix_playbooks_array_to_jsonb.sql: source_incident_ids/tags text[] → jsonb
  (已手動套用 prod DB)

2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 22:03:59 +08:00
OG T
da871fc149 chore(db): 補齊 AIOps P1/P2/P6 migration SQL(已套用到 prod)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
incident_evidence / agent_sessions / ai_governance_events 三表
IF NOT EXISTS,production DB 已手動確認存在並 apply。

2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 22:02:17 +08:00
OG T
76558a3cd9 feat(AIOps): 全開 P1-P6 feature flags + Nemotron + offline replay loop
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- configmap: 啟用 AIOPS_P1~P6 全部總開關與子開關
- configmap: ENABLE_NEMOTRON_COLLABORATION=true(回歸 120s timeout)
- feature_flags.py: 補齊 AIOPS_P6_GOVERNANCE_ENABLED 缺失欄位
- main.py: 掛載 run_offline_replay_loop(ADR-087 Phase 6)

2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:59:51 +08:00
OG T
ecfb7148bf fix(prod): 接通 YAML 規則引擎與自動執行路徑 — 架構核心斷點
Some checks are pending
CD Pipeline / build-and-deploy (push) Has started running
架構斷點根因:
  YAML 規則引擎(alert_rules.yaml)是人工審閱的權威動作來源,
  但自動執行路徑只讀 proposal_data["kubectl_command"](LLM 生成),
  兩者完全脫節 → HostHighCpuLoad 得到 kubectl restart,DockerContainerUnhealthy
  的 SSH 指令被 LLM 的 kubectl 覆蓋。

修復策略:
  在 auto_execute 入口,先查 YAML match_rule:
  1. YAML → NO_ACTION(如 HostHighCpuLoad)→ 立即返回,不執行任何操作
  2. YAML → 非 kubectl 指令(如 ssh docker restart)→ 覆蓋 LLM action,
     後續 infrastructure SSH 路由才能生效

影響:
  - HostHighCpuLoad / NodeCPUUsageHigh → 停止自動執行,降級人工審核
  - DockerContainerUnhealthy → SSH docker restart(若 labels 有 host/container)

2026-04-15 ogt + Claude Sonnet 4.6(亞太): 生產緊急修復第三批

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:50:25 +08:00