OG T
d952435b60
fix(drift): 改用 OpenClaw AI Router 取代 Ollama httpx 直連
...
CD Pipeline / build-and-deploy (push) Successful in 32m34s
根因:_call_nemotron() 直接呼叫 Ollama httpx(settings.OLLAMA_URL)
繞過 AI Router,無 fallback → "All connection attempts failed"
→ Telegram 卡顯示「意圖分析失敗:All connection attempts failed」
修復:改走 get_openclaw().call(prompt)
自動享有 Provider 降級與 fallback 機制(與其他 Agent 一致)
廢棄:BUG-001 httpx 直連繞過法(nvidia_provider 介面已穩定)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-17 10:27:39 +08:00
OG T
0c15fa5988
refactor(decision): 狀態機重構 — YAML NO_ACTION 閘門上移至決策路由中樞
...
CD Pipeline / build-and-deploy (push) Has been cancelled
架構師指令(2026-04-17):通知層禁止查詢業務邏輯。
撤銷 c05bcdb 的 inline YAML 查詢(義大利麵補丁),
將 NO_ACTION / INVALID_TARGET 判斷移至正確位置。
重構方向:
① 移除 _push_decision_to_telegram() 的 inline YAML 查詢
→ 通知層只做 blocked_reason → NotificationType 轉譯(Single Responsibility)
② 新增 decide() 第 4c 步:YAML NO_ACTION 路由閘門
位置:_dual_engine_analyze() 返回後、auto_approve.evaluate() 之前
邏輯:
- NO_ACTION → blocked_reason="YAML: NO_ACTION" + is_informational_only=True
→ 短路跳過 auto_approve + Blast Radius → TYPE-1(或 critical → TYPE-4)
- INVALID_TARGET → blocked_reason="INVALID_TARGET-..." → 短路 → TYPE-4
- 閘門查詢失敗 → 靜默降級,繼續正常流程
Checkpoint 覆蓋:
CP1 上移 YAML 評估層 ✅
CP2 短路跳過 auto_approve ✅
CP3 通知層純粹轉譯 ✅
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-17 10:20:01 +08:00
OG T
c05bcdbbd4
fix(decision): inline YAML NO_ACTION 補查 — 修復 Phase 2 路徑盲點
...
CD Pipeline / build-and-deploy (push) Failing after 4m0s
根因:Phase 2 (agent debate → auto_approve 拒絕 → 直接推 TG) 不經過
auto_execute() 的 YAML check,Coordinator 不設 blocked_reason。
PostgreSQL disk / host resource 等 NO_ACTION 規則告警在 Phase 2
路徑仍顯示「ACTION REQUIRED」卡片(TYPE-3),而非 TYPE-1 資訊卡。
修復:_push_decision_to_telegram() 在 blocked_reason 為空時,補做一次
alertname inline YAML 查詢,任何路徑(Phase 2 / Expert / Webhook)
都能正確偵測 NO_ACTION → TYPE-1 / critical NO_ACTION → TYPE-4。
生產驗證觸發:INC-20260416-C365D0 PostgreSQL disk alert 顯示 ACTION REQUIRED
而非 TYPE-1,確認全景 Code Review 遺漏此執行路徑。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-17 10:15:28 +08:00
OG T
83ab5e32d7
fix(happy-path): Happy Path 全境加固 — INVALID_TARGET + critical NO_ACTION + 空指令攔截
...
CD Pipeline / build-and-deploy (push) Has been cancelled
問題 1 (P0) — deployment/unknown 無效重啟:
- alert_rule_engine: 追蹤 _invalid_target flag,回傳 blocked_reason="INVALID_TARGET-..."
- decision_manager: auto_execute 路徑偵測 INVALID_TARGET → 提早返回 + TYPE-4 人工確認
- auto_approve: 新增條件 1c — action 為空字串直接拒絕,防止誤報「即將執行」
問題 2 (P1) — critical+NO_ACTION 靜默:
- decision_manager: blocked_reason 感知層重構
① INVALID_TARGET → TYPE-4
② NO_ACTION + critical → TYPE-4(升級,SRE 不可錯過)
③ NO_ACTION + 非 critical → TYPE-1(維持純資訊卡)
問題 3 (P1) — 規則匹配信心黑洞:
- auto_approve 條件 1c 確保空 action 不通過 auto-approve
即便 is_rule_based=True 也無法在無指令時自動執行
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-16 22:57:50 +08:00
OG T
92b39ab840
fix(no-action-notify): YAML NO_ACTION 告警改為 TYPE-1 資訊通知(移除無意義審核按鈕)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
根因:
- host_resource/postgresql_disk_monitoring YAML 規則設 NO_ACTION
- 但 classify_notification() 不知道 NO_ACTION
- confidence=0.2(感應器無資料)→ 判為 TYPE-4(信心不足需人工審核)
- SRE 看到「審核批准/拒絕」按鈕,卻沒有任何自動修復動作可執行 → 毫無意義
修復:
- _push_decision_to_telegram 偵測 blocked_reason 含 "NO_ACTION"
- 強制 _notif_type = TYPE-1(純資訊通知,無審核按鈕)
- SRE 看到資訊卡「主機 CPU/負載/磁碟告警 (觀察即可)」而非假的審核請求
2026-04-16 ogt + Claude Sonnet 4.6 (台北時區)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-16 22:37:15 +08:00
OG T
7eb837567d
fix(diagnostician): 修復 'AI 深度診斷' 垃圾根因顯示
...
CD Pipeline / build-and-deploy (push) Has been cancelled
根因三層鏈:
1. openclaw.call(prompt) 不傳 context
2. OPENCLAW_NEMO fallback 把 prompt[:500](系統說明文字)當 signal description
3. Nemo LLM 回傳 action_title="調查 AWOOOI SRE 系統的偵探 Agent"(任務描述)
4. _extract_hypotheses() 用 action_title 作為根因假設描述 → Telegram 顯示垃圾
修復:
- openclaw.call() 新增 alert_context 可選參數,透傳給 _call_with_fallback
- diagnostician._analyze() 建立 alert_context(incident_id + evidence_summary as signal)
→ nemo 使用結構化 API 收到真實感應器資料而非系統說明文字
- _extract_hypotheses() nemo 格式轉換:優先用 reasoning(為什麼)作為假設描述
而非 action_title(做什麼)— reasoning 更接近根因分析
2026-04-16 ogt + Claude Sonnet 4.6 (台北時區)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-16 22:34:48 +08:00
OG T
54d6818b8d
fix(sensors+rules+dedup): 全景三根因修復 — asyncssh缺失/YAML規則空洞/重複卡片
...
CD Pipeline / build-and-deploy (push) Has been cancelled
Fix 1: asyncssh 未安裝 → sensors_succeeded 永遠=0
- apps/api/pyproject.toml 加入 asyncssh>=2.14.0
- 根因:ssh_provider.py 的 import asyncssh 在 try/except 外,ImportError 直接噴出
- 效果:15 個 SSH tool 全部恢復可用
Fix 2: YAML 規則空洞 → HostHighLoadAverage/PostgreSQLDiskGrowthRate 落 generic_fallback → restart
- 合併 host_cpu_high 為 host_resource_alert,覆蓋 25+ 個主機層 alertname
- 新增 postgresql_disk_monitoring 規則,覆蓋磁碟增長/exporter/vacuum 類告警
- 所有主機層/磁碟監控告警 → NO_ACTION,禁止 kubectl restart
Fix 3: 同一 incident 被多 pod 同時處理 → 送出 3 張重複 Telegram 卡
- decision_manager.get_or_create_decision(): ANALYZING 狀態加入早返回
- 根因:ANALYZING 不在 (READY/EXECUTING/COMPLETED) 條件 → pod-B/C 各自建新 token
2026-04-16 ogt + Claude Sonnet 4.6 (台北時區)
2026-04-16 22:23:49 +08:00
OG T
02a276127e
fix(sensors+drift+repair-card): 全景修復三個節點問題
...
CD Pipeline / build-and-deploy (push) Successful in 1h1m39s
Fix 1: sensors 7/8 失敗 — SSH host 短名展開 (pre_decision_investigator.py)
根因: Prometheus instance label 為 "110:9100",split(":")[0]="110"
SSH_MCP_ALLOWED_HOSTS 存完整 IP "192.168.0.110" → 7 個 SSH 工具全部失敗
修復: 加入 _SHORT_HOST_MAP,"110"→"192.168.0.110",四台主機全覆蓋
Fix 2: Config Drift 誤報 — K8s 預設欄位加入白名單 (drift_detector.py)
根因: kubectl rollout restart 後 restartedAt annotation 被偵測為 "medium" drift
restartPolicy/dnsPolicy/terminationGracePeriodSeconds 等 K8s 自動填入欄位未白名單
修復: _DEFAULT_ALLOWLIST_FIELDS 加入 13 個 K8s 執行時自動填入欄位
Fix 3: 修復請求卡內容垃圾 — fallback 帶入真實 error context (failure_watcher.py)
根因: LLM 分析失敗時 root_cause = "規則引擎分類: K8S_ERROR"(無任何有用資訊)
修復: fallback 改為 "[K8S_ERROR] {operation_type} 在 {target_resource} 失敗\n錯誤:{error_message[:200]}"
2026-04-16 ogt + Claude Sonnet 4.6(亞太)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-16 20:50:06 +08:00
OG T
513232e90b
fix(decision_manager): Agent 分析結果覆寫 Webhook 垃圾 action
...
CD Pipeline / build-and-deploy (push) Successful in 28m30s
根因 (INC-20260416-C365D0 事故完整根因分析):
- Webhook inline LLM 建立 ApprovalRecord.action = "kubectl rollout restart awoooi-prod"
- Agent 分析正確(postgres disk → NO_ACTION)但只發新 Telegram 卡,未覆寫 DB
- 用戶批准 Agent 卡 → 系統查 incident_id → 找到 Webhook 舊 ApprovalRecord
→ 執行垃圾 action(rollout restart 一個磁碟告警!)
修復:
- approval_db.py: 新增 update_action_by_incident_id()(按 incident_id 更新 PENDING 記錄)
- decision_manager.py: Agent 確認 action 後立即覆寫 ApprovalRecord
若 action="" (NO_ACTION): 存 "NO_ACTION - {description}" 讓用戶知道 Agent 建議觀察
用戶批准時執行的是 Agent 的正確建議,而非 Webhook 的通用 action
2026-04-16 ogt + Claude Sonnet 4.6(亞太)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-16 20:07:15 +08:00
OG T
a258d87767
fix(webhooks+prompts): 修復 LLM 對所有告警一律輸出「重啟 AWOOOI 服務」的根本問題
...
CD Pipeline / build-and-deploy (push) Has been cancelled
根因 (INC-20260416-C365D0 postgres 磁碟告警事故):
1. alert_context 中 alertname 埋在 labels 深處,LLM 看到 alert_type="custom" → 不知道是什麼告警
2. 快取鍵用 alert_type:target_resource → 不同 alertname 共用同一快取 → 全部回傳第一個 LLM 結果
3. 系統 Prompt 無 alert-category 指導 → LLM 永遠輸出 kubectl rollout restart
修復:
- webhooks.py: alert_context 置頂加入 alertname + alert_category + annotations
- openclaw.py: 快取鍵改用 alertname:target_resource(告警名稱才是主要識別符)
- prompts.py: OPENCLAW_SYSTEM_PROMPT + NEMOTRON_SYSTEM_PROMPT 加入 Alert-Specific Analysis Rules
database/storage 告警 → NO_ACTION + 調查指令;K8s 告警 → 對應重啟指令
禁止對非 K8s 告警輸出 kubectl rollout restart deployment/awoooi-prod
2026-04-16 ogt + Claude Sonnet 4.6(亞太)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-16 19:56:13 +08:00
OG T
8b2a3df64b
fix(telegram): 修復 Telegram 卡片 description 顯示 debug garbage
...
CD Pipeline / build-and-deploy (push) Failing after 3m12s
問題:description = debate_summary[:500],用戶看到的是內部審計文字
修復:從 diagnosis.top_hypothesis + action_plan.top_candidate 組出人類可讀摘要
格式:「根因:[描述](信心 X%);方案:[動作]」
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-16 18:42:13 +08:00
OG T
ded93cbba3
fix(aiops): 修復 evidence 空白 → AI ABSTAIN 問題
...
CD Pipeline / build-and-deploy (push) Successful in 32m33s
問題:
- signal.alert_name 在頂層,但 _get_alertname() 從 labels["alertname"] 讀 → 空字串
- 所有 sensor 失敗時 evidence_summary 只有 120 字元,AI 無法分析 → ABSTAIN
- labels 為空時 AI 根本不知道是什麼告警
修復:
1. _get_alertname(): 優先讀 signal.alert_name,fallback labels["alertname"]
2. _get_labels(): 自動補 alertname 到 labels dict
3. EvidenceSnapshot.alert_info: 新增告警基礎欄位(sensors=0 時的最小情報)
4. build_summary(): alert_info 永遠放在最前,讓 AI 至少知道告警類型+嚴重度
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-16 16:26:07 +08:00
OG T
588b0d745b
fix(aiops): 修復 sensors=0/0 根因 — MCPToolRegistry 從未在 startup 初始化
...
CD Pipeline / build-and-deploy (push) Failing after 1m44s
三個問題同時修復:
1. main.py: 補上 init_mcp_tool_registry() 呼叫
- ADR-081 Phase 1 建立了 MCPToolRegistry 但從未在 lifespan startup 被呼叫
- 導致 PreDecisionInvestigator sensors=0/0,evidence_summary 永遠空白
- 空白 evidence → Diagnostician 永遠 ABSTAIN
2. signal_producer.py: str(dict) → json.dumps()
- labels/annotations 用 Python str() 序列化,寫入 Redis 後無法反序列化
3. brain/incident_engine.py: 新增 _parse_dict_field() helper
- 從 Redis 讀回的 labels/annotations 可能是 JSON 字串
- isinstance(..., dict) 防禦不足,需先 json.loads()
2026-04-16 ogt + Claude Sonnet 4.6(亞太): 飛輪感官修復
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-16 15:35:19 +08:00
OG T
8582439d2d
fix(kb): Signal 無 description 欄位,改用 alert_name + annotations
...
knowledge_extractor_service 兩處直接訪問 s.description:
- L87 signals_text 組裝:改用 alert_name + annotations.summary/description
- L198 Fallback 標題:改用 alert_name[:60]
Signal model 只有 alert_name, annotations(dict),無 description 屬性。
此修復防止 KB 萃取時 AttributeError 導致草稿無法建立。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-16 08:54:11 +08:00
OG T
9ea1f77e41
fix(telegram): 移除 7 個 ghost button (3-part/無handler)
...
違規 buttons 一覽:
- flywheel_diag / flywheel_dashboard (META告警卡)
- pause_1h / ignore (業務告警卡)
- postmortem / escalation_ack / dr_manual (升級通知卡)
- secops_block_ip / secops_evict (SecOps 卡,spec=nonce 但用 2-part)
所有 buttons 均無 callback handler,點擊無回應 = 鬼魂按鈕
鐵律: 寧可沒按鈕,不可有死按鈕 (feedback_no_ghost_buttons.md)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-16 03:29:41 +08:00
OG T
cd1c0ffdb8
fix(openclaw): call() 5值→3值 — 修復全部 AI 分析降級根因
...
根因: _call_with_fallback() 回傳 5-tuple,但 call() 直接 return
導致呼叫端 (diagnostician/solver/critic agents) 3-var unpack
→ ValueError: too many values to unpack → 全部 降級 20%
修復: call() 明確解包再回傳 (response, provider, success)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-16 03:27:48 +08:00
OG T
27ba97e586
fix(ollama): 清除所有硬寫 188:11434 fallback — 全部改指向 111 GPU
...
CD Pipeline / build-and-deploy (push) Successful in 15m59s
- decision_manager.py: 兩處 getattr fallback 188 → 111
- routes/agent.py: OLLAMA_BASE_URL 188 → 111
- knowledge_extractor_service.py: _OLLAMA_BASE 188 → 111
config.py 預設早已是 111,此次清掉 code 層殘留的 188 硬寫值。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-16 03:01:31 +08:00
OG T
7e3cc8b3b0
fix(agents): 移除人工 per-agent timeout,LLM 必須等完整回應
...
CD Pipeline / build-and-deploy (push) Has been cancelled
原設計 asyncio.wait_for(timeout_sec=25s) 是任意截斷,
只要 LLM 超過時限就降級為 confidence=20%,根本沒有分析。
正確做法:
- 移除所有 4 個 agent 的 asyncio.wait_for() 包裝
- 只留 except Exception 捕真實異常(連線失敗、模型崩潰)
- 全流程由 Orchestrator GLOBAL_TIMEOUT_SEC=90s 防掛死
- _PER_AGENT_TIMEOUT_SEC 常數廢棄移除
影響:LLM 推理多久就等多久,不再人工截斷,
deepseek-r1:14b 等模型得以完整輸出分析結果。
2026-04-16 ogt + Claude Sonnet 4.6
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-16 02:54:34 +08:00
OG T
5a3a649f8a
fix(decision): TYPE-1 告警重複洗版兩個根因修復
...
CD Pipeline / build-and-deploy (push) Has been cancelled
根因 1: TYPE-1 bypass 在 existing_token 檢查之前執行
→ 每次 get_or_create_decision() 不管 token 是否存在,都直接推 TG
→ 修復: existing_token 檢查提前到 TYPE-1 bypass 之前(統一入口)
根因 2: TYPE-1 token TTL 僅 3600s
→ 1h 後 token 過期,下次掃描重新建立並再推 TG
→ 修復: TYPE-1 token TTL 提升至 86400s (24h)
影響: HostBackupFailed 等 TYPE-1 告警每個 incident 只推 1 次(24h 內)
2026-04-16 ogt + Claude Sonnet 4.6
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-16 02:49:31 +08:00
OG T
62bcc50770
fix(tg+km): 補齊 Telegram 操作紀錄揭露與 KM 分類修復 (ADR-076)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
Telegram 訊息新增欄位:
- alert_category 分類標籤(🏷️ 主機/K8s/資料庫/服務等)
- playbook_name 顯示匹配到的 Playbook 名稱
- 頻率統計從 count_24h>1 降至 >=1(初次告警也顯示)
- TelegramMessage 新增 alert_category/playbook_name 欄位
- decision_manager → send_approval_card 穿透 playbook_name
KM 修復:
- EntryType.PLAYBOOK → EntryType.AUTO_RUNBOOK(前者不存在,會 AttributeError)
- category "auto_generated" → "ai_system"(前端 i18n 有對應翻譯)
- runbook_generator 同步修正 category
- KM 建立後推 Telegram 通知(best-effort)
DB decision_chain 補欄位:
- 新增 playbook_id / playbook_name / alert_category
2026-04-16 ogt + Claude Sonnet 4.6
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-16 02:46:17 +08:00
OG T
9538f6cca4
fix(agents): 修正 Agent 5s timeout 導致 LLM 推理全部失敗
...
CD Pipeline / build-and-deploy (push) Successful in 16m14s
根本原因: deepseek-r1:14b 實測推理 2.2-27.3s avg 10.6s
但 Diagnostician/Critic/Solver/Reviewer 全部使用 timeout_sec=5.0 (開發機測試值)
→ 67% 的 Agent 推理 timeout → 降級 confidence=20% → 自動修復從不觸發
修復:
- _PER_AGENT_TIMEOUT_SEC: 5s → 25s (覆蓋 avg 的 2.3x buffer)
- GLOBAL_TIMEOUT_SEC: 30s → 90s (3個序列Agent × 25s + buffer)
- 明確傳遞 timeout_sec 給所有 4 個 Agent 呼叫
預期效果: 正常告警 AI 分析 confidence ≥ 0.5 → 觸發自動修復
2026-04-16 Claude Sonnet 4.6 Asia/Taipei
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-16 02:28:05 +08:00
OG T
f5e33da2fc
fix(telegram): 修正 _make_request → _send_request 方法名稱不一致
...
CD Pipeline / build-and-deploy (push) Successful in 15m48s
7 處呼叫 _make_request 但方法實際名稱為 _send_request,
導致 sweeper 分析完後 telegram_decision_push_failed 錯誤。
影響方法:send_push_notification, send_drift_card 等 ADR-071 系列。
_send_request 定義於 line 1272,OTEL 追蹤已含括。
2026-04-16 Claude Sonnet 4.6 Asia/Taipei
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-16 01:44:29 +08:00
OG T
1e86cc2896
fix(flywheel): 修正 incidents.outcomes → outcome 欄位錯誤
...
CD Pipeline / build-and-deploy (push) Has been cancelled
GET /api/v1/stats/summary 觸發 UndefinedColumnError: column "outcomes" does not exist
實際欄位為 incidents.outcome (json 型別)
2026-04-16 Claude Sonnet 4.6 Asia/Taipei
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-16 01:42:14 +08:00
OG T
0760315059
fix(decision-manager): 修正 CAST 語法 + 關閉 shadow_mode
...
CD Pipeline / build-and-deploy (push) Has been cancelled
decision_chain_persist_failed 根因:
asyncpg 不支援 :dc::json 語法 (:: 與具名參數 : 衝突)
改為 CAST(:dc AS jsonb) — asyncpg 標準寫法
configmap:
AIOPS_P4_SHADOW_MODE: true → false
真實主動監控啟用 (proactive_inspector 輸出供 PreDecisionInvestigator 讀取)
2026-04-16 Claude Sonnet 4.6 Asia/Taipei
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-16 01:24:48 +08:00
OG T
457018c0f9
fix(decision-manager): AI 分析結果寫入 incidents.decision_chain (DB 長期保存)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
修復 Gap: decision token 只有 Redis TTL 12min,AI 診斷歷史永久丟失
- 新增 _persist_decision_to_db() method
- get_or_create_decision() 完成後 fire-and-forget 寫入 PG
- 寫入: ts / confidence / risk_level / provider / source / diagnosis[:200]
- try/except 吞錯不影響主流程,warning log 追蹤
DB/Cache 分層:
PG (長期): incidents.decision_chain (歷史) + outcomes + KM entries
Redis (短期): decision token dedup + working memory + playbook cache
2026-04-16 Claude Sonnet 4.6 Asia/Taipei
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-16 01:08:30 +08:00
OG T
d4fed639f6
fix(ai-router): DIAGNOSE 恢復 OPENCLAW_NEMO 路由,修復全部 timeout 降級問題
...
CD Pipeline / build-and-deploy (push) Failing after 2m16s
根因: 2026-04-12 patch 把 DIAGNOSE 改為 None(複雜度路由)
→ 落入 Rule 6 → Ollama deepseek-r1:14b (CPU 238s) → timeout
→ 降級 20% 信心 + 「待分析」→ 全部 unknown
修復:
1. ai_router.py: DIAGNOSE → OPENCLAW_NEMO(via 188:8088 NVIDIA NIM, 2-27s)
2. ai_router.py: 移除 Rule 6 的 DIAGNOSE deepseek 特殊case(已無用)
3. 04-configmap.yaml: AI_FALLBACK_ORDER 改為 nvidia 優先
gemini→ollama→nvidia(舊)→ nvidia→gemini→ollama(新)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-16 00:03:04 +08:00
OG T
c9efaa3740
fix(playbook-seed): 修正 get_db_context import 路徑(db.session → db.base)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
seed 啟動時靜默失敗的根因:
from src.db.session import get_db_context ← 模組不存在
from src.db.base import get_db_context ← 正確路徑
此 bug 導致 yaml_rule playbooks 完全無法建立。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 23:49:56 +08:00
OG T
800ab1685f
fix(playbook+flywheel): 修復 PlaybookSource enum + repair_steps 相容 + KM stats raw SQL
...
CD Pipeline / build-and-deploy (push) Successful in 14m58s
Type Sync Check / check-type-sync (push) Failing after 1m17s
修復三個串聯 bug,讓 Playbook seed 能正常執行:
1. PlaybookSource 新增 YAML_RULE enum(alert_rules.yaml 匯入專用)
2. playbook_seed_service: source=YAML_RULE,dedup 改用 raw SQL by name,
不再呼叫 list_playbooks(舊格式 repair_steps 會 validation error)
3. playbook_repository._orm_to_pydantic: 舊格式 repair_steps 補齊
step_number/action_type 必填欄位(向下相容)
4. flywheel_stats_service: embedding IS NULL 改用 raw SQL,
修復 KnowledgeEntryRecord ORM 無 embedding 屬性的 AttributeError
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 23:32:04 +08:00
OG T
77a92eb469
feat(P6): 提交 offline_replay_service + model_rollback_service (漏提)
...
CD Pipeline / build-and-deploy (push) Successful in 14m59s
Phase 6 ADR-087 治理閉環兩個核心服務,
之前建立後沒有 git add,一直是 untracked 狀態。
2026-04-15 Claude Sonnet 4.6 Asia/Taipei
2026-04-15 22:29:09 +08:00
OG T
85c4e3b434
fix(km): 修復 KM 寫入全為 unknown 的根因 (三個節點)
...
CD Pipeline / build-and-deploy (push) Has started running
Bug-A: approval_execution.py 呼叫不存在的 get_incident() → AttributeError
被 except 吞掉 → alertname/alert_category/affected_services 全用預設值
修復: 改用 get_from_working_memory() + get_from_episodic_memory() 雙路徑
Bug-B: _record_to_incident() 從 PG 還原 Incident 時漏掉
notification_type + alert_category 欄位 → km_conversion 讀到 None
修復: 加入這兩個欄位的還原
Bug-C: main.py working_memory_warmup 重建 Incident 時同樣遺漏
notification_type + alert_category
修復: 同步補上
2026-04-15 Claude Sonnet 4.6 Asia/Taipei
2026-04-15 22:28:48 +08:00
OG T
c05bac6112
fix(playbook): seed tuple unpack + text[] → jsonb migration
...
CD Pipeline / build-and-deploy (push) Has been cancelled
- playbook_seed_service.py: list_playbooks 回傳 tuple[list, int],
缺少解包導致 'list' has no attribute 'source'
- fix_playbooks_array_to_jsonb.sql: source_incident_ids/tags text[] → jsonb
(已手動套用 prod DB)
2026-04-15 ogt + Claude Sonnet 4.6(亞太)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 22:03:59 +08:00
OG T
ecfb7148bf
fix(prod): 接通 YAML 規則引擎與自動執行路徑 — 架構核心斷點
...
CD Pipeline / build-and-deploy (push) Has started running
架構斷點根因:
YAML 規則引擎(alert_rules.yaml)是人工審閱的權威動作來源,
但自動執行路徑只讀 proposal_data["kubectl_command"](LLM 生成),
兩者完全脫節 → HostHighCpuLoad 得到 kubectl restart,DockerContainerUnhealthy
的 SSH 指令被 LLM 的 kubectl 覆蓋。
修復策略:
在 auto_execute 入口,先查 YAML match_rule:
1. YAML → NO_ACTION(如 HostHighCpuLoad)→ 立即返回,不執行任何操作
2. YAML → 非 kubectl 指令(如 ssh docker restart)→ 覆蓋 LLM action,
後續 infrastructure SSH 路由才能生效
影響:
- HostHighCpuLoad / NodeCPUUsageHigh → 停止自動執行,降級人工審核
- DockerContainerUnhealthy → SSH docker restart(若 labels 有 host/container)
2026-04-15 ogt + Claude Sonnet 4.6(亞太): 生產緊急修復第三批
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 21:50:25 +08:00
OG T
3696fb5938
fix(prod): 修復 host_resource 誤發 K8s kubectl + 自動執行重複風暴
...
CD Pipeline / build-and-deploy (push) Has been cancelled
1. decision_manager: host_resource 告警(HostHighCpuLoad 等)
不得執行 kubectl 操作 → 降級人工審核
根因:原本只擋 infrastructure,host_resource 漏進 K8s 路徑
→ 導致 kubectl rollout restart deployment/HostHighCpuLoad 被真實執行
2. decision_manager: auto_execute 路徑補入 Redis cooldown
同一 target 5 分鐘內最多自動執行 2 次,防止 awoooi-worker 3x 風暴
根因:decision_manager 自動執行路徑完全無冷卻保護
2026-04-15 ogt + Claude Sonnet 4.6(亞太): 生產緊急修復第二批
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 21:45:46 +08:00
OG T
67f437043a
fix(prod): 修復四個生產致命 bug — outcome 寫入 / OpenClaw / Telegram 通知 / LLM 規則顯示
...
CD Pipeline / build-and-deploy (push) Has been cancelled
1. decision_manager: 移除 UPDATE incidents 中的 verification_result 欄位
(incidents 表無此欄位 → 所有 outcome 寫入失敗 outcome_write_failed)
2. failure_watcher: get_openclaw_service → get_openclaw
(函數名錯誤 → OpenClaw 分析全部 ImportError 崩潰)
3. failure_watcher: tg.send_message → tg.send_notification
(TelegramGateway 無 send_message 方法 → 修復通知無法送出)
4. decision_manager: expert_analyze 補齊 initial_diagnosis / diagnosis_description key
(openclaw.py 讀這兩個 key,但 expert_analyze 只有 matched_rule / description
→ LLM 永遠看到 Matched Rule=unknown,無法正確分析)
2026-04-15 ogt + Claude Sonnet 4.6(亞太): 生產緊急修復
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 21:41:31 +08:00
OG T
01fb531c02
fix(Phase 3): Evolver force=True bypass flag + 清理未使用 import
...
- run_evolver(force=True):管理員手動端點可繞過 feature flag
- 移除 typing.Any 未使用 import
- 移除 _merge_similar 中冗餘的 calculate_jaccard_similarity import
ADR-083 Phase 3 — 2026-04-15 ogt + Claude Sonnet 4.6(亞太)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 21:09:01 +08:00
OG T
4718c7667c
feat(Phase 3): Evolver loop 排程 + 手動觸發端點 — 合併演練閘道完工
...
CD Pipeline / build-and-deploy (push) Has been cancelled
- playbook_evolver.py: 新增 run_evolver_loop()(24h 無限迴圈)
- main.py: 掛載 run_evolver_loop asyncio.create_task
- api/v1/learning.py: POST /api/v1/learning/evolver/run(Phase 3 exit #6 演練端點)
- MASTER §8: 補錄 66c4eda AgentSession + 本次 Evolver 完整退出條件清單
ADR-083 Phase 3 — 2026-04-15 ogt + Claude Sonnet 4.6(亞太)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 21:07:56 +08:00
OG T
66c4eda27a
feat(Phase 3): AgentSession 學習接線 — record_agent_session() + orchestrator 辯證訊號
...
CD Pipeline / build-and-deploy (push) Has been cancelled
- learning_service.py: 新增 record_agent_session() — 5-Agent 辯證結果 → Redis analytics
Critic 質疑 + matched_playbook_id → 輕度負向 EWMA;all_agents_degraded 記錄治理事件
- agent_orchestrator.py: run_agent_debate() 完成後 best-effort 呼叫 record_agent_session()
Phase 3 L7×D2 學習訊號全部接線完成
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 21:00:18 +08:00
OG T
fb1bbd0e20
feat(Phase 3): 學習閉環補完 — Root cause 3 + 診斷 feedback + 知識遺忘 + Fine-tune 管線
...
CD Pipeline / build-and-deploy (push) Has been cancelled
- approval_execution.py: _run_post_execution_verify() 補接 record_verification_result()
Root cause 3 終結:環境驗證結果(success/degraded/failed/timeout)不再孤立
- learning_service.py: 新增 record_verification_result() — 驗證結果 → Redis + Playbook EWMA
- learning_service.py: 新增 record_diagnosis_outcome() — 誤診負向訊號回寫(L3×D4)
- jobs/knowledge_decay_job.py: 新建 30d 知識遺忘 Job(未引用 draft/review → archived)
- services/finetune_exporter.py: 新建每週 JSONL 匯出(EvidenceSnapshot × AgentSession)
- main.py: 掛載 knowledge_decay_loop(24h)+ finetune_export_loop(7d)
- MASTER §8: Phase 3 核心改造項全部落地記錄
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 20:57:43 +08:00
OG T
65838708ce
fix(format): 剩餘 send_notification raw text 改為 ADR-075 TYPE-X 格式
...
CD Pipeline / build-and-deploy (push) Successful in 18m11s
- decision_manager.py: 自動修復通知改為 TYPE-2 ├─/└─ 樹狀格式
- gitea_webhook_service.py: Code Review 通知改為 TYPE-1 格式,移除 ═══ border
至此所有 3 個外部 send_notification 呼叫者均符合 ADR-075 格式規範:
1. ai_router.py — TYPE-1 AI Provider 不可用(已於 3ce5025 修復)
2. decision_manager.py — TYPE-2 自動修復完成/失敗(本 commit)
3. gitea_webhook_service.py — TYPE-1 Code Review(本 commit)
2026-04-15 ogt + Claude Sonnet 4.6(亞太): Phase 6 format enforcement
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 20:05:49 +08:00
OG T
14579ce149
fix(heartbeat): 系統沉默閾值 2h → 24h,消除假陽性告警
...
CD Pipeline / build-and-deploy (push) Has been cancelled
無事故期間系統正常不寫 KM,2h 必然誤報。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 19:51:01 +08:00
OG T
3ce5025ca7
fix(alerts): 3 個飛輪沉默節點 — DIAGNOSE routing + 心跳停用 + 通知格式
...
CD Pipeline / build-and-deploy (push) Has been cancelled
1. openclaw.py: DIAGNOSE 移除 require_local=True
- v4.3 已決定 NIM 為主力且無隱私問題
- require_local=True 導致所有 provider 被 privacy_skip → 告警永遠失敗
- 修後 DIAGNOSE 走 _full_fallback_chain(NIM → Gemini → Claude)
2. ai_router.py: require_local 失敗通知改為 ADR-075 TYPE-1 格式
- 禁止純文字 raw notification(統帥鐵律:所有訊息必須符合格式模板)
- 改用 ├─ / └─ 樹狀結構 + 語義化標籤
3. main.py: 停用 Telegram 心跳監控
- 心跳已轉發到另一個 Telegram 群組,不需在此頻道重複發送
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 19:49:43 +08:00
OG T
f045506abd
fix(flywheel): P2 Approval 逾期不結案 → KM 學習鏈斷鏈修復
...
CD Pipeline / build-and-deploy (push) Failing after 12m11s
問題根因:
PENDING approval 無人處置超過 48h 後應自動 EXPIRED,
但 get_pending_approvals() 只在用戶開 UI 時觸發,
若無人開 UI → Incident 永遠 PENDING → KM 永遠不寫入
→ Phase 6 SLO human_override_rate 低估,EWMA 缺少負向樣本。
修復:
1. anomaly_counter.py: 新增 "timeout_ignored" disposition 類型,
與 auto_repair / human_approved / manual_resolved 區分
2. incident_service.py: resolve_incident() 新增 resolution_type 參數,
resolution_type="timeout" 時記錄 "timeout_ignored" 而非 "manual_resolved"
3. jobs/approval_timeout_resolver.py (新): 每小時掃描逾期 PENDING approval,
批次標記 EXPIRED,對每筆有 incident_id 的記錄呼叫 resolve_incident("timeout")
4. main.py: startup 掛載 approval_timeout_resolver 排程(interval=3600s)
效果:
- 告警無人處置 48h → Incident 自動結案 → KM 寫入 → EWMA 取得樣本
- disposition="timeout_ignored" 讓 SLO 計算正確區分「AI 建議被忽略」
- 飛輪學習鏈對「無人處置告警」閉環
2026-04-15 ogt + Claude Sonnet 4.6(亞太)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 19:21:21 +08:00
OG T
f31b4e31ba
fix(approval): create_approval_with_fingerprint 補注 48h expires_at 預設值
...
CD Pipeline / build-and-deploy (push) Has been cancelled
根因(盤點後確認):
所有 webhook 建立 approval 的路徑(webhooks.py:908/1426/1566)均未傳
expires_at,DB 欄位為 NULL。get_pending_approvals() 的自動過期邏輯
WHERE expires_at < now 對 NULL 永遠為 False → 殭屍 PENDING 永不清理。
修正策略:
在 create_approval_with_fingerprint()(告警 approval 唯一共用入口)
注入預設 48h TTL,一次覆蓋全部 3 個 webhook 呼叫點。
手動 API 建立(approvals.py)自行傳 expires_at,不受影響。
與 2026-04-15 24h PENDING_TTL_HOURS 補丁協同工作:
- 24h: find_by_fingerprint 不再收斂過期 PENDING → 新告警重新觸發通知
- 48h: get_pending_approvals auto-expire → UI 殭屍記錄自動清除
2026-04-15 ogt + Claude Sonnet 4.6(亞太):完整盤點後補完
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 19:08:17 +08:00
OG T
fab65e7d7a
fix(alerts): PENDING 收斂無 TTL → 老記錄永久封鎖 Telegram 告警
...
CD Pipeline / build-and-deploy (push) Has started running
根因:find_by_fingerprint 的 PENDING 匹配條件無時間上限,
2026-04-12 建立的 3 筆 PENDING approval records(hit=77/30/17)
持續吃掉所有同指紋告警,造成 2+ 小時 Telegram 靜音。
修正(approval_db.py):
- PENDING_TTL_HOURS = 24:PENDING 記錄逾 24h 不再收斂新告警
- 原本:OR(status=PENDING, created_at>=30min前)
- 修正:OR(PENDING AND created_at>=24h前, created_at>=30min前)
緊急修復:kubectl exec 直接將 7 筆過期 PENDING 記錄設為 expired,
即時恢復 Telegram 告警流(不等部署)。
Phase 6 AI 自我治理閉環(ADR-087):
- feat(db): 新增 ai_governance_events 表 + 3 個 index(base.py + models.py)
- feat(svc): ai_slo_calculator.py — 7d 滾動 SLO(success/override/false_neg)
- feat(svc): trust_drift_detector.py — Playbook 信任度極端偏態偵測
- feat(job): kb_rot_cleaner.py — K8s API/Prom metric/老舊 incident_case 腐爛清理
- feat(svc): decision_manager.py — 自我降級守衛(SLO 違反 → 提高門檻/保守模式)
2026-04-15 ogt + Claude Sonnet 4.6(亞太)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 18:56:26 +08:00
OG T
655d1a568a
feat(Phase 5): Declarative 修復抽象化 + Blast Radius 分控 全部完成
...
CD Pipeline / build-and-deploy (push) Has been cancelled
## Phase 5 交付(ADR-086)
### 新增服務(4 個)
- blast_radius_calculator.py: 爆炸半徑計算器(0-100 純函數)
- 18 種 kubectl 動作基礎分 + 命名空間倍率 + 特殊 flag 修正
- HARD_RULES 永擋:delete ns/pv/pvc/clusterrole + rm -rf + DROP TABLE
- 分級:≤10 auto / 11-50 human / 51-99 dual / 100 blocked
- declarative_remediation.py: DeclarativeSpec 不可變規格(frozen dataclass)
- evaluate() 封裝 Blast Radius + dry-run + rollback_plan + constraints
- rollback_plan 從 kubectl 動作類型自動推導(不呼叫 LLM)
- gitops_pr_service.py: Gitea Issue 高風險修復審核(tier=dual)
- 含 Blast Radius + 目標狀態 + 回滾計畫 + 雙人審核流程
- AIOPS_P5_GITOPS_PR flag 守衛
- rollback_manager.py: 驗證失敗自動回滾
- 先驗 rollout history ≥ 2 revision,防止無版本可回滾
- kubectl rollout undo + 120s 收斂等待
### decision_manager.py 接線(AIOPS_P5_BLAST_RADIUS_CHECK)
- _auto_execute() 在安全守衛後、ApprovalRequest 前插入分級守衛
- blocked → 永擋 + 人工審核通知
- dual → 非同步 GitOps Issue + 升級人工審核
- human → 升級人工審核(不自動執行)
- auto(≤10)→ 原有自動執行流程
- 失敗降級:計算異常 → 保守升人工
### learning_service.py
- record_declarative_outcome(): 記錄 DeclarativeSpec 執行結果
anomaly_key=declarative:{incident_id},含 blast_radius_score/tier/rollback
2026-04-15 ogt + Claude Sonnet 4.6(亞太): Phase 5 全部完成
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 16:06:54 +08:00
OG T
14a02263ae
feat(Phase 4): 主動巡檢 + 趨勢預測 + 8D 感官升級 全部完成
...
CD Pipeline / build-and-deploy (push) Failing after 12m32s
## Phase 4 完整交付(ADR-084)
### 新增服務
- trend_predictor.py: numpy 線性回歸,4h 閾值突破預警,R² 信心評分
- proactive_inspector.py: 每 5 分鐘主動巡檢協調器
- DynamicBaselineService(3σ 偏離)
- LogAnomalyDetector(新 Drain3 pattern)
- TrendPredictor(斜率外推 4h 預測)
- Shadow Mode + 30 分鐘去重 + Holt-Winters 背景重訓
### 8D 感官升級(EvidenceSnapshot Phase 4 增強)
- PreDecisionInvestigator._collect_phase4_anomalies(): 決策前讀取
ProactiveInspector 最近巡檢快取 + LogAnomalyDetector 新 pattern
- EvidenceSnapshot.anomaly_context: 新欄位,Phase 4 動態異常上下文
- DiagnosticianAgent._build_prompt(): prompt 包含 anomaly_context,
LLM RCA 可參考動態基線偏差與趨勢預警
### 資料庫遷移
- incident_evidence: ADD COLUMN anomaly_context JSONB(冪等)
### main.py
- 啟動 run_proactive_inspector_loop() asyncio task
2026-04-15 ogt + Claude Sonnet 4.6(亞太): Phase 4 全部完成
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 15:47:05 +08:00
OG T
4a6aa16a94
fix(Phase 4): 修正呼叫點遺漏傳入參數 — promql 和 sample_log
...
CD Pipeline / build-and-deploy (push) Has been cancelled
關聯節點檢查發現:
- dynamic_baseline_service.py: _save_baseline() 在 train_baseline() 中
未傳入 promql/lookback_hours → PG 記錄無法追蹤訓練來源
- log_anomaly_detector.py: _save_new_cluster() 未傳入 sample_log →
PG 記錄 LogCluster 時 sample_log 欄位為空
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 15:34:33 +08:00
OG T
bf45b80bd2
feat(Phase 3.5 + Phase 4): AI 學習成果持久化到 PostgreSQL — 修正「AI 失憶」架構缺陷
...
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-085: AI 學習成果不可存在 Cache
架構鐵律確立:
- PostgreSQL = System of Record(AI 的永久記憶)
- Redis = Warm Cache(加速讀取,TTL 到期從 PG 復原)
核心變更:
1. models.py: 新增 PlaybookRecord / DynamicBaselineRecord / LogClusterRecord ORM
2. base.py: ALTER TABLE playbooks 補加 trust_score / requires_approval_level 等欄位
3. playbook_repository.py: 完整雙寫實作(PG upsert + Redis cache)
4. dynamic_baseline_service.py: Holt-Winters 訓練結果寫入 PG,Redis 只作 24h warm cache
5. log_anomaly_detector.py: Drain3 cluster template 寫入 PG(UPSERT on cluster_id)
6. main.py: 啟動時執行 backfill_redis_to_pg()(Redis → PG 冪等補救)
修正的問題:
- Playbook 7天 Redis TTL 到期 → AI 失去所有修復知識
- trust_score EWMA 隨 Redis TTL 歸零 → AI 重新回到初始信任度 0.3
- Holt-Winters 基線 24h TTL → AI 每天重新學習「正常」的定義
- Drain3 cluster 沒有持久化 → AI 把已知 log pattern 反覆當新 pattern
Phase 4 新服務(requirements.txt 已加入 statsmodels + drain3 + numpy)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 15:34:04 +08:00
OG T
7da64eaad2
feat(Phase 3): 學習閉環重建 — 三根因修復 + 2x EWMA + Evolver Agent
...
CD Pipeline / build-and-deploy (push) Failing after 19m7s
Type Sync Check / check-type-sync (push) Failing after 1m18s
ADR-083 Phase 3 學習閉環重建:
**三根因修復**
- approval_execution.py: fire-and-forget create_task → await asyncio.wait_for(timeout=30) × 2
(成功路徑 L265 + 失敗路徑 L353,超時記錄 learning_trigger_timeout metric,主流程不 crash)
- models/approval.py: ApprovalRequestBase 新增 matched_playbook_id 欄位
- decision_manager.py: _auto_execute 建立 ApprovalRequest 時填充 matched_playbook_id
- learning_service.py: 雙路徑查找 _matched_pb_id(matched_playbook_id + metadata fallback)
**2x EWMA 負向強化**
- models/playbook.py: 新增 trust_score: float = 0.3(EWMA 動態信任度欄位)
- repositories/playbook_repository.py: update_stats 加 EWMA
成功: trust = 0.9 × old + 0.1 × 1.0
失敗: trust = 0.8 × old + 0.2 × 0.0(衰減速度 2x)
trust < 0.1 → log warning,等 Evolver 封存
**Evolver Agent(新建)**
- services/playbook_evolver.py: 三功能全靜態規則
1. 低信任封存: trust < 0.1 → DEPRECATED
2. 休眠封存: 30d 未使用 AND trust < 0.5 → DEPRECATED
3. 相似合併: 症狀 Jaccard > 0.9 → 保留高 trust,封存低 trust
AIOPS_P3_EVOLVER_ENABLED=False 預設關閉
**文件**
- ADR-083 學習閉環重建
- MASTER §8 Phase 3 完工記錄
AIOPS_P3_ENABLED=False(預設),骨架就位等統帥批准開啟
Co-Authored-By: Claude Sonnet 4.6(亞太)<noreply@anthropic.com >
2026-04-15 14:01:37 +08:00
OG T
5ddba6d6e0
feat(adr-082): Phase 2 多 Agent 協作 — 5 角色辯證系統骨架上線
...
新增 5 個 Agent + Orchestrator + DecisionManager 接線:
- protocol.py: DiagnosisReport / ActionPlan / ReviewVerdict / CriticReport / DecisionPackage 型別系統
- DiagnosticianAgent: RCA 根因分析,confidence < 0.4 → ABSTAIN
- SolverAgent: 修復方案軍師,blast_radius 評分 + 降級 rule-based mock
- ReviewerAgent: 安全審查,HARD_RULES 靜態 pattern + blast_radius 閾值 (>50 revision, >80 reject)
- CriticAgent: 刻意唱反調,強制 3 問批判性思維,critical challenge → REJECT
- CoordinatorAgent: 純規則聚合,6 級決策閘,REQUEST_REVISION → 強制人工
- AgentOrchestrator: 30s 全局超時,Reviewer ‖ Critic 並行,DB Immutable Event Sourcing + Redis Streams
- DecisionManager: AIOPS_P2_ENABLED gate + _package_to_proposal_data 橋接既有 proposal_data 格式
- AgentSession DB table + 4 個複合 index
- ADR-082 決策記錄
Gate 2 修復(7 項):
- CRITICAL: DELETE FROM regex lookahead 位置錯誤(移至 FROM 後)
- CRITICAL: REQUEST_REVISION 可抵達 auto-execute 路徑(改回 requires_human_approval=True)
- IMPORTANT: _extract_json flat regex 不支援巢狀 JSON(改 find/rfind 邊界提取)
- IMPORTANT: all_degraded 遺漏 verdict.degraded(補全 4 個 Agent)
- IMPORTANT: Solver ABSTAIN guard 放行降級假設(改為無論 hypotheses 有無均跳過)
- IMPORTANT: dataclasses.asdict() Enum 未序列化導致 DB 寫入靜默失敗(加 json.dumps default handler)
- IMPORTANT: P2 gate 直讀屬性繞過父 Phase 守衛(改用 is_phase_enabled(2))
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 13:48:55 +08:00