Commit Graph

28 Commits

Author SHA1 Message Date
Your Name
7f200aff5f fix(solver): 注入告警 labels 讓 params 模板填充真實值
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 10m45s
根因:Solver LLM 不知道 namespace/pod/deployment/instance 真實值,
      recommended_actions.params 模板({labels.namespace} 等)填不出來
      → Telegram 顯示 kubectl scale deployment  --replicas=(空白)

修復:
- solver.run() 加 incident_labels 參數
- _build_prompt() 把 labels 顯式列出給 LLM 參考
- orchestrator 從 snapshot.alert_info.labels 取出後傳入

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-28 15:05:06 +08:00
Your Name
c3fa03fc19 fix(solver): 補 AGENT_SOLVER_TIMEOUT_SEC=80 + prompt 禁無腦重啟
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
問題1:AGENT_SOLVER_TIMEOUT_SEC 預設 20s K8s 未設 → deepseek-r1:14b 必然
       timeout → candidates=[] → action="" → Telegram 顯示「待分析」+「規則分析」

問題2:Solver prompt JSON 範例只有 restart + kubectl top,LLM 模仿範例
       → 所有告警都推重啟,HostDisk/CPU 類應優先診斷+清理

修復:
- K8s 加 AGENT_SOLVER_TIMEOUT_SEC=80(< OPENCLAW_TIMEOUT=120,留 buffer)
- Solver prompt 加根因對應修復規則:HostDisk→df/du/journalctl,CPU→top/ps,
  OOM→kubectl logs,禁止「先重啟」
- JSON 範例改為 HostDisk SSH 診斷場景,不再只有 K8s 命令

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-27 15:51:42 +08:00
Your Name
9908fdf50d feat(p3.1-t2-patha): DiagnosisAggregator 路徑 A + Solver F4 critical reject + 對齊測試
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m59s
Wave 8 P3.1-T2 PathA 啟用 + Solver F4 安全強化 + test 對齊:

PathA — DiagnosisAggregator 信號分類層補 PDI:
- ENABLE_DIAGNOSIS_AGGREGATOR default=False → True
  · PathA 純信號分類層(OOMKilled/CrashLoop 等業務邏輯)
  · 不重複呼叫 K8s/SignOz API(只取 PDI 已收集的 raw 資料)
  · 安全 default on — 純邏輯處理,無外部依賴重疊
- diagnosis_aggregator.py +155 行(PathA 實作)
- pre_decision_investigator.py 已接 (commit 3a2cd151)

F4 — Solver critical risk reject:
- solver_agent.py: _validate_recommended_action 拒絕 risk=critical
  · 鐵律:critical 動作必須走人工審批,不可變 Telegram 按鈕
  · log warning + return None(被 _extract 過濾掉)
- _extract_recommended_actions 改返回 (list, status_str) tuple
  · status="ok"/"empty"/"all_invalid" 供呼叫端決策
- protocol.py +16 / metrics.py +9 / ai_router.py +18 — 配套 metric + protocol field

測試對齊:
- test_solver_recommended_actions.py 拆 test_all_valid → low/medium/high accepted +
  test_critical_rejected
- result tuple unpack: result, _ = _extract_recommended_actions(...)
- test_diagnosis_aggregator_stub.py: feature flag default 改 True 對齊 PathA

Tests: 51 passed (solver 28 + aggregator 16 + router fallback 8)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Multiple Engineers (Wave 8 P3.1-T2 PathA + F4) <noreply@anthropic.com>
2026-04-27 14:42:29 +08:00
Your Name
7c726ebc1c fix(b1): Solver Agent 結構化動作 — 北極星 §1.1 修復多樣性 ≥ 40%
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 2m22s
INC-20260425 衍生修復 — Solver 拒絕 rule-based mock 兜底:

原設計缺陷:
- LLM 失敗時 → rule-based mock 推 RESTART 兜底
- 違反北極星 §1.1:修復多樣性 ≥ 40%(不能寫死同一指令)

新設計:
- LLM 失敗 → graceful degraded(candidates=[], recommended_actions=[], degraded=True)
- 禁止 rule-based mock / hardcode RESTART
- 新增 recommended_actions 結構化 MCP 動作清單
  · 供 B3 Telegram 按鈕動態生成
  · YAML 規則庫驅動,非寫死
- 新增 yaml + Path import 載入動作模板庫

向下相容:
- 既有 candidates / blast_radius 邏輯不變
- 新增欄位 recommended_actions 為 optional list

Tests: 8 passed (solver 相關全綠)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (B1 北極星 §1.1) <noreply@anthropic.com>
2026-04-27 08:18:38 +08:00
Your Name
fefe4c21cd fix(inc-20260425): A1+A2 後續 — Solver/Critic timeout + auto_repair 接線 + Runbook + Grafana
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
延續 595629c0 INC-20260425 修復,補三段 Agent + 全鏈路觀測:

A1 後續 — Solver/Critic 三段 timeout 接線:
- solver_agent.py: AGENT_SOLVER_TIMEOUT_SEC=20.0(env override)
- critic_agent.py: AGENT_CRITIC_TIMEOUT_SEC=15.0(env override)
- protocol.py: 三 Agent 共用 observe_agent_step() 包裹呼叫
  · success/timeout/error outcome label
  · histogram 寫入 aiops_agent_step_duration_seconds

A2 後續 — auto_repair_service 改用 _diagnose_fallback_chain:
- auto_repair_service.py +46 行 — 切換 DIAGNOSE 路由到新 chain(NEMO→GEMINI→CLAUDE)
- 完全避開 Ollama CPU 238s 二次 timeout

新增 metrics:
- core/metrics.py +59 行 — 配合 observe_agent_step 的 histogram bucket + label cardinality

新增測試 (862 行):
- test_agent_step_timeouts.py (475) — 三 Agent 各 timeout 邊界 + outcome label
- test_ai_router_diagnose_fallback.py (387) — _diagnose_fallback_chain 正確序

新增配套:
- docs/runbooks/RUNBOOK-AGENT-STEP-LATENCY.md (350) — INC 故障排查 + 觀測指引
- ops/monitoring/grafana/agent_step_latency_rules.yaml (160)
  · 三 Agent histogram alert rules(p99 > timeout 80% → warning)

驗收: 33 tests pass (test_agent_step_timeouts 22 + test_ai_router_diagnose_fallback 11)

INC-20260425 雙修總工作量(595629c0 + 此 commit):
  · 5 個 service/agent 檔修改
  · 1 個新 observability 模組
  · 4 個新測試/配套檔
  · 1372+187 = 1559 行新增

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (INC-20260425 後續) <noreply@anthropic.com>
2026-04-27 08:15:53 +08:00
Your Name
595629c013 fix(inc-20260425): A1 三段 Agent timeout 拆分 + A2 DIAGNOSE 移除 Ollama
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
INC-20260425-8D17BB / 3B6C39 兩則告警 AI 信心降到 20% 根因雙修(統帥批准 A+B):

A1 — 三段 Agent step timeout 拆分(北極星 §1.2 Observable by Default):
- diagnostician_agent.py: PHASE2_STEP_TIMEOUT_SEC=20.0 共用值 → 拆三段
  · AGENT_DIAGNOSTICIAN_TIMEOUT_SEC=30.0(NIM 主吃口,最大 prompt + 多假設)
  · AGENT_SOLVER_TIMEOUT_SEC=20.0(後續 commit 接線)
  · AGENT_CRITIC_TIMEOUT_SEC=15.0(後續 commit 接線)
  · env override 支援,K8s ConfigMap 動態調整不需 rebuild
  · 保留 PHASE2_STEP_TIMEOUT_SEC alias(DEPRECATED,下 sprint 移除)
- observability/agent_step_metrics.py (58 行) — 新模組:
  · aiops_agent_step_duration_seconds Histogram
  · observe_agent_step() helper 統一三 Agent 呼叫點
  · outcome label ∈ {success, timeout, error}
  · agent label ∈ {diagnostician, solver, critic}

A2 — ai_router DIAGNOSE chain 移除 Ollama:
- ai_router.py v4.4 by Claude Sonnet 4.6
  · 新增 _diagnose_fallback_chain: NEMO → GEMINI → CLAUDE
  · Ollama 永久排除於此 chain(CPU-only 實測 238s,二次 timeout 必爆)
  · 新增 aiops_diagnose_fallback_total Prometheus metric
- 根因: NIM timeout 後 fallback 到 Ollama deepseek-r1:14b CPU 238s
  → 二次 timeout → degraded confidence=0.2

Wave8-X2 整合測試補正:
- test_ollama_failover_manager.py: TestSelectProvider 補 mock _check_gemini_quota
  原 test 期望 OFFLINE→Gemini,但 quota fail-closed 後沒 mock 會被切到 188
  繞過 quota check 後驗純路由邏輯 → 37/37 PASS

Tests: 37 passed (test_ollama_failover_manager 全部)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (Wave 8 INC-20260425) <noreply@anthropic.com>
2026-04-27 08:15:10 +08:00
Your Name
cbd28e29a0 fix(solver+incident): 兩組 P0 配置修復 - Gitea 非K8s 過濾 + 備份告警年齡升級
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 8m57s
L3 修復總結(2026-04-25):

【修復 1】Gitea 跨域界限 kubectl 過濾(solver_agent.py)
根因:GiteaMemoryPressure 告警觸發 Solver → LLM 生成 'kubectl scale deployment gitea'
      Gitea 在主機 docker-compose,不在 awoooi-prod K8s namespace → 執行必然失敗

變更:
- 添加 _filter_non_k8s_targets() 函數,對 scale/restart/delete/patch 指令驗證 target
- 添加 _KUBECTL_MUTATING_VERBS / _KUBECTL_ROLLOUT_MUTATING_SUBVERBS 常數
- 在 _solve() 呼叫 _fetch_k8s_inventory() 獲取實際部署清單
- 後置過濾:candidates 中若 target 不在 inventory 且屬寫入動詞 → 丟棄 + 警告

預期行為:GiteaMemoryPressure → Solver 現生成調查類 kubectl(get/describe),而非 scale

【修復 2】HostBackupFailed 誤判升級(incident_service.py + webhooks.py)
根因:備份失敗 >24h 被標記 TYPE-1(純資訊),導致靜默發送無按鈕卡片,未觸發自動修復

變更:
- incident_service.py classify_alert_early() 添加 age_hours 參數
- 添加 _BACKUP_AGE_UPGRADE_NAMES + _BACKUP_AGE_THRESHOLD_HOURS=24.0
- 若 alertname in (HostBackupFailed/Stale/Missing) 且 age > 24h → TYPE-3 升級
- webhooks.py 計算 alert.startsAt → age_hours,並傳遞給 classify_alert_early()

預期行為:HostBackupFailed 25h+ → 升級為 TYPE-3,觸發 LLM 分析 + P0 自動修復建議

測試結果:
- solver_agent: 35/35 tests PASSED 
- incident_service: 11/11 tests PASSED 
- incident_api integration: 7/7 tests PASSED 

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 09:48:04 +08:00
Your Name
f9f2263c00 fix(execution-feedback): 修復系統自動化反饋完全斷鏈的三層 P0 故障
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 8m57s
**背景**
用戶報告執行狀態卡在「 執行中...」永不回報,導致自動修復機制完全癱瘓
(信心度修復後,執行失敗但無法推送 Telegram 卡片通知)

**L1 — Post-verify AttributeError(2 處)**
- approval_execution.py:757, 1010 調用不存在方法 IncidentService.get_incident()
- 正確方法:get_from_working_memory() fallback get_from_episodic_memory()
- 影響:post-verify 邏輯被 exception 無聲吞掉,下游 Telegram 推送完全卡住

**L2 — Notification Provider 未配置**
- 新增 notifications/telegram.py:複用既有 TelegramGateway.send_notification()
- 修改 manager.py:初始化時註冊 TelegramWebhookProvider
- 影響:執行完成後無任何 provider 發送推送,導致 Telegram 看不到結果

**L3 — Solver Agent 語意合成生成殘缺指令**
- 舊邏輯:action_title="重啟服務" → 合成 "kubectl rollout restart deployment -n awoooi-prod"(缺名)
- 下游 operation_parser 無法解析(regex 要求 deployment/<name>)
- 修法:優先從 parsed 提取 target 欄位;無名則 return [],降級到唯讀調查指令
- 測試全部通過:35/35,含 11 個新安全測試

**驗證**
- 被阻擋的惡意 kubectl_command 現在正確 fall-through 到語意合成路徑
- 無 target 名稱時返回空列表,不再生成殘缺指令
- Telegram 執行結果推送鏈路已完整

**預期效果**
- 執行失敗 → 立即收到「 執行失敗」Telegram 卡片(L1 + L2 修復)
- 自動化決策遵循白名單,避免生成無法執行的指令(L3 修復)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 03:29:38 +08:00
Your Name
cc69f3ce04 fix(solver_agent): 修復 AI 信心度阻斷 + 三層 kubectl 安全防禦
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
**修法A — 恢復 AI 決策信心度 (0.5 → 0.9)**
- Solver Agent 優先使用 OpenClaw NIM 的 `kubectl_command` 欄位(完整指令),略過語義合成降級
- 保留原始 0.9 信心度,告警自動化能力回復
- Root cause: 舊版在 action_title 未含 "kubectl" 時執行 min(0.9, 0.5) 降級

**C1 — CRITICAL: ReDoS + 注入防禦**
- 正則 `\s` → `[ ]` 避免換行符號 (\n\r) 配對(Shell 注入向量)
- 加入 `re.ASCII` 與 `{1,500}` 有界量詞,防止指數級回溯
- 性能提升 7.256s → 0.015ms (48x faster)
- 明文拒絕 \n \r \t \x00

**C2 — CRITICAL: 繞過防禦 + 截斷攻擊**
- action_title 路徑加白名單驗證(舊版跳過)
- 標準候選路徑:驗證 → 截斷,防止截斷繞過
- 不安全指令自動降級至語義合成

**C3 — CRITICAL: 無界長度 DoS**
- 新增 _KUBECTL_MAX_LEN = 500,硬上限前置檢查
- 防止長輸入導致正則超時

**測試覆蓋**
- 35 個測試(24 回歸 + 11 新安全測試)
- LF/CR/Tab/Null 注入、Shell 元字元、ReDoS 效能、邊界條件全覆蓋
- Critic 與 vuln-verifier 雙重驗證

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 03:02:58 +08:00
Your Name
39f45dd305 fix(solver): 補 import re(solver_agent 已有 re.compile 但漏 import)
Some checks are pending
CD Pipeline / build-and-deploy (push) Has started running
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 02:42:25 +08:00
Your Name
7d1c85eb86 fix(hermes): ANTHROPIC_API_KEY 注入 + solver 信心度修法 A + 12-Agent 治理文件
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- nl_gateway.py: ClaudeAgentOptions 透過 env= 注入 ANTHROPIC_API_KEY(CLAUDE_API_KEY alias),
  修復 SDK 找不到 API key 的問題(SDK 讀 ANTHROPIC_API_KEY,K8s secret 名稱是 CLAUDE_API_KEY)
- solver_agent.py: 修法 A — kubectl_command 欄位優先路徑,OpenClaw Nemo 回傳完整指令時
  不再被語意合成壓縮 confidence(0.9 → min(0.5) 的 bug),9 tests pass
- AGENTS.md: Codex CLI 對應版 CLAUDE.md(Codex Session 啟動用)
- docs/12-agent-game-rules.md: 12-Agent 任務判型 + 主責/協作派工 + 9 skills 對照(v1.0)
- .agents/skills/06-awoooi-monorepo-master.md: v1.6,新增 12-agent 協作治理章節

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 02:33:43 +08:00
Your Name
0d81b28b1b fix(aiops): bound phase2 timeout and repair incident links
All checks were successful
E2E Health Check / e2e-health (push) Successful in 52s
CD Pipeline / build-and-deploy (push) Successful in 9m24s
2026-04-24 23:53:56 +08:00
Your Name
04ff22563e fix(aiops-p1): Playbook 學習閉環 5斷點全修 + DB Migration(ADR-092 B4)
Some checks failed
run-migration / migrate (push) Failing after 14s
CD Pipeline / build-and-deploy (push) Failing after 2m7s
【P0.4 補丁】pre_decision_investigator Prometheus query 欄位缺失
- _build_tool_params() 補 "query" 欄位(prometheus_query tool 必要參數)
- 新增 _build_prometheus_query() — 依告警類型生成 PromQL(CPU/Memory/Crash/Disk/HTTP/Pod/fallback)
- 修復後 D3_METRICS 感官維度實際取得資料(原本 100% 回 missing_query_parameter)

【P1 Playbook 學習閉環 B1-B5 全修】
- B2 db/models.py: ApprovalRecord 新增 matched_playbook_id 欄位 + ix_approval_matched_playbook index
- B2 db/models.py: TimelineEvent 新增 incident_id 欄位(MCP 稽核用)+ index
- B3 approval_db.py: record→ApprovalRequest 補回 incident_id + matched_playbook_id
- B4 approval_repository.py: 同 B3(兩個轉換函式必須同步)
- B5 approval_db.py: approval_request_to_record_data 補 matched_playbook_id → DB 才能存值

【P1.5 KM 寫入】approval_execution.py: fire-and-forget → await wait_for(30s)
- 根因:asyncio.create_task 在 Pod recycle 時被殺,KM 寫入靜默遺失
- 修復:await asyncio.wait_for(..., timeout=30.0) + TimeoutError log

【Migration 文件】adr092_p1_learning_chain_fix.sql
- ALTER TABLE approval_records ADD COLUMN matched_playbook_id VARCHAR(36)
- ALTER TABLE timeline_events ADD COLUMN incident_id VARCHAR(64)
- 執行:psql $DATABASE_URL -f apps/api/migrations/adr092_p1_learning_chain_fix.sql

【附帶 Agent 改動】
- decision_manager: Phase 2 YAML NO_ACTION 優先門(主機層/外部服務跳過 Agent Debate)
- alert_rules.yaml: Sentry/ClickHouse + HostDiskUsageHigh/Critical 新規則
- solver_agent: action_title 語意合成兜底(取代靜默丟棄)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 15:41:35 +08:00
OG T
cf50a5ce25 fix(solver+execution): Checkpoint-1 假成功修復 + Checkpoint-2 K8s 環境感知
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 10m55s
## Checkpoint-1: 假成功根治
- approval_execution.py: execute_approved_action 改返回 bool
  (原返回 None,呼叫端無法判斷 K8s 是否接受指令)
- decision_manager.py auto-execute 路徑: 用 _exec_success 取代硬編 success=True
  修復: K8s 拒絕指令時正確發  而非  自動修復完成

## Checkpoint-2: K8s 環境感知 (Inventory Pre-flight)
- solver_agent.py: 新增 _fetch_k8s_inventory() — 生成 kubectl 指令前先拉
  kubectl get deployments,statefulsets -n awoooi-prod,將真實名稱清單
  注入 Solver prompt,LLM 必須從清單選擇,防止幻覺(awooiii-api 三個 i)
- 超時 5s 或失敗 → 返回 "",prompt 顯示警示但不中斷主流程

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 23:08:23 +08:00
OG T
93205ceab0 fix(auto_approve+solver): P1 kubectl gate + P2 Nemo path kubectl 強制
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 9m56s
P1 安全漏洞 (auto_approve.py):
- 新增條件 1d:action 必須含 kubectl 關鍵字才可自動執行
- Solver 經 OpenClaw Nemo 路徑輸出自然語言 → 條件 1c 通過但無法執行
- 修復:自然語言 action → 降級人工審核(NO_PLAYBOOK reason)

P2 執行障礙 (solver_agent.py):
- Nemo 格式路徑:action_title 不含 kubectl → return [] → 觸發 _degraded_plan
- _default_action_for_category:舊自然語言 → 真實 kubectl 調查指令
- 降級路徑現在輸出 kubectl get/top/exec 等唯讀指令,可被 auto_approve 1d 正確評估

ADR-082: Phase 2 多 Agent 協作
2026-04-17 ogt + Claude Sonnet 4.6(亞太): P1+P2 hotfix

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 14:49:53 +08:00
OG T
e0bfcc7bd6 fix(phase5): 修復 Solver action 格式 — 強制輸出 kubectl 命令
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 9m33s
根因:_build_prompt() 的 action 範例為 "restart_service:awoooi-api"(自訂格式),
LLM 模仿此格式輸出自然語言描述而非 kubectl 命令。

影響鏈:
  Solver action = 自然語言描述
  → auto_approve Condition 1c 拒絕(無 kubectl 關鍵字)
  → _auto_execute() 永不被調用
  → blast_radius_calculator 永不被調用
  → blast_radius_score fill rate = 0/14 = 0%(Phase 5 驗收指標未達)

修復:
  1. blast_radius 參考從抽象描述改為實際 kubectl 命令示例
  2. 明確要求 action 欄位必須是真實 kubectl 命令(不可用自然語言)
  3. 正確範例:kubectl rollout restart deployment/awoooi-api -n awoooi-prod

預期效果:LLM 輸出 kubectl 命令 → auto_approve 通過(低 blast_radius 情境)
          → blast_radius_calculator 被調用 → fill rate 趨向 100%

2026-04-17 ogt + Claude Sonnet 4.6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 12:44:36 +08:00
OG T
0388e50d0e fix(p1-backlog): 修復「待分析」死結與 Telegram 訊息截斷
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 30m25s
問題 1:REQUEST_REVISION → 待分析
  根因:safe_candidates=[] → selected=None → recommended_action=None
        → decision_manager action="" → TG 卡顯示「待分析」(資訊流斷裂)
  修復 coordinator_agent.py:
    無安全候選時回退至 Solver 原始最優方案
    標記「[Reviewer 未核准,僅供參考] {action}」
    SRE 永遠能看到 AI 建議,資訊流絕不中斷

問題 2:debate_summary 在 (blast_radius... 中間截斷顯示 (bl
  根因:root_cause=reasoning[:150] — 150 字元對中文 debate_summary 過短
  修復 decision_manager.py:
    root_cause 截斷 150 → 300
    suggested_action 截斷 80 → 120

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 11:12:02 +08:00
OG T
0077ff9758 fix(solver): 傳遞 hypothesis 作為 alert_context 給 OPENCLAW_NEMO
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根因:solver 呼叫 openclaw.call(prompt) 不傳 context
→ nemo fallback 把 prompt[:500](系統說明「軍師 Agent」)
   當 signal description → LLM 回傳垃圾方案描述

修復:把 top.description 放進 alert_context.signals
      讓 nemo 看到真實根因假設(與 diagnostician 同模式 7eb8375)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 22:51:30 +08:00
OG T
7eb837567d fix(diagnostician): 修復 'AI 深度診斷' 垃圾根因顯示
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根因三層鏈:
1. openclaw.call(prompt) 不傳 context
2. OPENCLAW_NEMO fallback 把 prompt[:500](系統說明文字)當 signal description
3. Nemo LLM 回傳 action_title="調查 AWOOOI SRE 系統的偵探 Agent"(任務描述)
4. _extract_hypotheses() 用 action_title 作為根因假設描述 → Telegram 顯示垃圾

修復:
- openclaw.call() 新增 alert_context 可選參數,透傳給 _call_with_fallback
- diagnostician._analyze() 建立 alert_context(incident_id + evidence_summary as signal)
  → nemo 使用結構化 API 收到真實感應器資料而非系統說明文字
- _extract_hypotheses() nemo 格式轉換:優先用 reasoning(為什麼)作為假設描述
  而非 action_title(做什麼)— reasoning 更接近根因分析

2026-04-16 ogt + Claude Sonnet 4.6 (台北時區)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 22:34:48 +08:00
OG T
d294caf830 fix(solver): 相容 openclaw_nemo 回傳格式 → candidates 格式轉換
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 3m51s
與 diagnostician 同步:openclaw_nemo 回傳 action_title/risk_level/confidence,
solver 的 _extract_candidates() 找不到 candidates key → 空方案 → no_candidates

修復: 檢測 action_title 存在時轉換為 candidates 格式
risk_level → blast_radius 映射: critical=60, high=40, medium=25, low=10

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 14:34:50 +08:00
OG T
c27709d11b fix(diagnostician): 相容 openclaw_nemo 回傳格式 → 解除全面 ABSTAIN
根因: AI Router DIAGNOSE→openclaw_nemo 回傳 ClawBot 格式:
     {"action_title":"...","risk_level":"...","confidence":0.85}
     Diagnostician 只解析 {"hypotheses":[...]} → 永遠 0 hypotheses → ABSTAIN

修復: _extract_hypotheses() 新增 openclaw_nemo 格式檢測與轉換
     action_title→description, confidence→confidence, risk_level→category

影響: 所有 critical alert 自 2026-04-15 收到後一律 ABSTAIN,無任何修復動作

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 14:32:32 +08:00
OG T
7e3cc8b3b0 fix(agents): 移除人工 per-agent timeout,LLM 必須等完整回應
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
原設計 asyncio.wait_for(timeout_sec=25s) 是任意截斷,
只要 LLM 超過時限就降級為 confidence=20%,根本沒有分析。

正確做法:
- 移除所有 4 個 agent 的 asyncio.wait_for() 包裝
- 只留 except Exception 捕真實異常(連線失敗、模型崩潰)
- 全流程由 Orchestrator GLOBAL_TIMEOUT_SEC=90s 防掛死
- _PER_AGENT_TIMEOUT_SEC 常數廢棄移除

影響:LLM 推理多久就等多久,不再人工截斷,
      deepseek-r1:14b 等模型得以完整輸出分析結果。

2026-04-16 ogt + Claude Sonnet 4.6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 02:54:34 +08:00
OG T
14a02263ae feat(Phase 4): 主動巡檢 + 趨勢預測 + 8D 感官升級 全部完成
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 12m32s
## Phase 4 完整交付(ADR-084)

### 新增服務
- trend_predictor.py: numpy 線性回歸,4h 閾值突破預警,R² 信心評分
- proactive_inspector.py: 每 5 分鐘主動巡檢協調器
  - DynamicBaselineService(3σ 偏離)
  - LogAnomalyDetector(新 Drain3 pattern)
  - TrendPredictor(斜率外推 4h 預測)
  - Shadow Mode + 30 分鐘去重 + Holt-Winters 背景重訓

### 8D 感官升級(EvidenceSnapshot Phase 4 增強)
- PreDecisionInvestigator._collect_phase4_anomalies(): 決策前讀取
  ProactiveInspector 最近巡檢快取 + LogAnomalyDetector 新 pattern
- EvidenceSnapshot.anomaly_context: 新欄位,Phase 4 動態異常上下文
- DiagnosticianAgent._build_prompt(): prompt 包含 anomaly_context,
  LLM RCA 可參考動態基線偏差與趨勢預警

### 資料庫遷移
- incident_evidence: ADD COLUMN anomaly_context JSONB(冪等)

### main.py
- 啟動 run_proactive_inspector_loop() asyncio task

2026-04-15 ogt + Claude Sonnet 4.6(亞太): Phase 4 全部完成

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 15:47:05 +08:00
OG T
42bc1df9f9 fix(phase2): 驗證發現兩處安全漏洞並修正
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
手動驗證執行中發現:
1. reviewer_agent.py: force push regex 只覆蓋「force push」文字順序,
   漏掉 git 實際格式「git push --force」(push 先, --force/-f 後)
   → 修正為雙向 pattern:(?:force.{0,5}push|push.{0,30}(?:--force|-f\b)).{0,30}main

2. coordinator_agent.py: Critic critical challenge 僅施 0.3 penalty,
   當原始信心 > 0.7(如 0.82)時 penalty 後仍 > 0.4 閾值,
   critical challenge 穿透到 auto-execute 路徑(驗證確認:0.82→0.52>0.4)
   → 新增 Critic REJECT 硬閘(等同 Reviewer REJECT 效力),
     在 penalty 邏輯前強制 requires_human_approval=True

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 13:48:55 +08:00
OG T
5ddba6d6e0 feat(adr-082): Phase 2 多 Agent 協作 — 5 角色辯證系統骨架上線
新增 5 個 Agent + Orchestrator + DecisionManager 接線:
- protocol.py: DiagnosisReport / ActionPlan / ReviewVerdict / CriticReport / DecisionPackage 型別系統
- DiagnosticianAgent: RCA 根因分析,confidence < 0.4 → ABSTAIN
- SolverAgent: 修復方案軍師,blast_radius 評分 + 降級 rule-based mock
- ReviewerAgent: 安全審查,HARD_RULES 靜態 pattern + blast_radius 閾值 (>50 revision, >80 reject)
- CriticAgent: 刻意唱反調,強制 3 問批判性思維,critical challenge → REJECT
- CoordinatorAgent: 純規則聚合,6 級決策閘,REQUEST_REVISION → 強制人工
- AgentOrchestrator: 30s 全局超時,Reviewer ‖ Critic 並行,DB Immutable Event Sourcing + Redis Streams
- DecisionManager: AIOPS_P2_ENABLED gate + _package_to_proposal_data 橋接既有 proposal_data 格式
- AgentSession DB table + 4 個複合 index
- ADR-082 決策記錄

Gate 2 修復(7 項):
- CRITICAL: DELETE FROM regex lookahead 位置錯誤(移至 FROM 後)
- CRITICAL: REQUEST_REVISION 可抵達 auto-execute 路徑(改回 requires_human_approval=True)
- IMPORTANT: _extract_json flat regex 不支援巢狀 JSON(改 find/rfind 邊界提取)
- IMPORTANT: all_degraded 遺漏 verdict.degraded(補全 4 個 Agent)
- IMPORTANT: Solver ABSTAIN guard 放行降級假設(改為無論 hypotheses 有無均跳過)
- IMPORTANT: dataclasses.asdict() Enum 未序列化導致 DB 寫入靜默失敗(加 json.dumps default handler)
- IMPORTANT: P2 gate 直讀屬性繞過父 Phase 守衛(改用 is_phase_enabled(2))

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 13:48:55 +08:00
OG T
938df7f291 fix(api): 全面清除假信心分數 - 遵循 feedback_confidence_truthfulness.md
🔴 違規修正: 規則匹配/Expert System 不是 AI 分析,confidence 必須 = 0.0

修正檔案:
- agents/action_planner.py: 0.9 → 0.0
- agents/blast_radius.py: 0.85/0.5/0.9 → 0.0
- agents/security.py: 計算公式 → 0.0
- signoz_webhook.py: 0.7 → 0.0
- auto_approve.py: default 0.5 → 0.0
- ci_auto_repair.py: 整個計算函數 → return 0.0
- error_analyzer_service.py: default 0.5 → 0.0
- intent_classifier.py: 計算公式 → 0.0
- openclaw.py: default 0.5 → 0.0
- resource_resolver.py: 0.8 → 0.0
- k8s_naming.py: 0.9/0.7 → 0.0

只有 LLM 真實分析返回的 confidence 才能 > 0

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 16:00:46 +08:00
OG T
6f049877fc fix(lint): ruff auto-fix + lewooogo-core src 加入 git
- Python: ruff --fix 修復 280 個 lint 錯誤
- lewooogo-core: src/ 目錄未追蹤,導致 CI eslint 失敗

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-23 23:51:37 +08:00
OG T
7478dc0254 feat(phase6-9): Complete modular architecture and Agent Teams
Phase 6.4 - Modular Architecture:
- Add lewooogo-brain adapters for LLM providers
- Add lewooogo-data dual memory (Redis + PostgreSQL)
- Implement consensus engine for multi-agent decisions
- Add incident memory service for historical context

Phase 9 - Agent Teams (Claude Agent SDK):
- Add base agent class with Claude Sonnet 4 integration
- Implement action planner, blast radius, and security agents
- Add agent API endpoints and proposal workflow
- Integrate ADR-009 OpenClaw Agent Teams architecture

DevOps & CI/CD:
- Add GitHub Actions CI/CD workflows (ci.yaml, cd.yaml)
- Add pre-commit hooks and secrets baseline
- Add docker-compose for local development
- Update Kubernetes network policies

Frontend Improvements:
- Add auto-healing error boundary component
- Update i18n messages for agent features
- Enhance dual-state incident card with execution feedback

Documentation:
- Add 7 ADRs covering MCP, design system, architecture decisions
- Update ARCHITECTURE_MEMORY.md with modular design
- Add GLOBAL_RULES.md and SOUL.md for project identity

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-23 18:40:36 +08:00