awoooi

Author	SHA1	Message	Date
Your Name	551227f3bb	fix(ai): convert legacy manual gates to controlled automation Some checks failed CD Pipeline / tests (push) Successful in 1m45s Details CD Pipeline / build-and-deploy (push) Successful in 4m55s Details CD Pipeline / post-deploy-checks (push) Successful in 1m58s Details Code Review / ai-code-review (push) Has been cancelled Details Ansible / Reboot Recovery Contract / validate (push) Has been cancelled Details	2026-06-27 19:35:41 +08:00
Your Name	ee2cc2bfc3	fix(alerts): 收斂 Telegram 告警到 SRE 戰情室 Some checks failed CD Pipeline / tests (push) Failing after 1m23s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details Code Review / ai-code-review (push) Successful in 15s Details	2026-06-12 11:06:16 +08:00
Your Name	32e4beca06	fix(api): connect approval execution truth chain All checks were successful CD Pipeline / tests (push) Successful in 1m27s Details Code Review / ai-code-review (push) Successful in 14s Details CD Pipeline / build-and-deploy (push) Successful in 4m24s Details CD Pipeline / post-deploy-checks (push) Successful in 1m29s Details	2026-06-11 13:03:54 +08:00
Your Name	16775bb4fa	feat(adr100): bridge playbook authoring approvals All checks were successful CD Pipeline / tests (push) Successful in 1m20s Details Code Review / ai-code-review (push) Successful in 13s Details CD Pipeline / build-and-deploy (push) Successful in 7m44s Details CD Pipeline / post-deploy-checks (push) Successful in 2m49s Details	2026-06-01 20:49:28 +08:00
Your Name	356e4d41cc	fix(telegram): link incident truth chain from alerts Some checks failed CD Pipeline / tests (push) Successful in 1m35s Details Code Review / ai-code-review (push) Successful in 12s Details CD Pipeline / build-and-deploy (push) Successful in 4m37s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details	2026-05-31 23:08:01 +08:00
Your Name	a21f94ced1	fix(alerts): clarify execution result verdict Some checks failed CD Pipeline / tests (push) Successful in 1m17s Details Code Review / ai-code-review (push) Successful in 12s Details CD Pipeline / build-and-deploy (push) Successful in 4m11s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details	2026-05-31 17:28:55 +08:00
Your Name	3d8b395032	fix(alerts): 補齊處置結果與人工通知契約 Some checks failed CD Pipeline / tests (push) Failing after 45s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details Code Review / ai-code-review (push) Successful in 12s Details	2026-05-31 15:46:07 +08:00
Your Name	e2ab879636	fix(alerts): correct telegram execution truth Some checks failed CD Pipeline / tests (push) Failing after 52s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details Code Review / ai-code-review (push) Successful in 11s Details	2026-05-31 13:58:39 +08:00
Your Name	596f2f6820	fix(awooop): link auto approved execution evidence All checks were successful Code Review / ai-code-review (push) Successful in 11s Details CD Pipeline / tests (push) Successful in 1m17s Details CD Pipeline / build-and-deploy (push) Successful in 3m42s Details CD Pipeline / post-deploy-checks (push) Successful in 1m21s Details	2026-05-13 19:14:17 +08:00
Your Name	34bfe56f53	fix(awooop): persist approved ssh gateway blocks All checks were successful Code Review / ai-code-review (push) Successful in 20s Details CD Pipeline / tests (push) Successful in 3m58s Details CD Pipeline / build-and-deploy (push) Successful in 3m47s Details CD Pipeline / post-deploy-checks (push) Successful in 1m18s Details	2026-05-13 11:15:54 +08:00
Your Name	a0a2a5b1f0	feat(awooop): gate approved ssh execution All checks were successful Code Review / ai-code-review (push) Successful in 10s Details run-migration / migrate (push) Successful in 9s Details CD Pipeline / tests (push) Successful in 1m22s Details CD Pipeline / build-and-deploy (push) Successful in 6m36s Details CD Pipeline / post-deploy-checks (push) Successful in 1m42s Details	2026-05-13 11:02:24 +08:00
Your Name	80c36ba801	fix(incident): F2 NO_ACTION 觸發 resolve_incident + 冪等 guard All checks were successful Code Review / ai-code-review (push) Successful in 11s Details CD Pipeline / tests (push) Successful in 1m9s Details CD Pipeline / build-and-deploy (push) Successful in 3m29s Details CD Pipeline / post-deploy-checks (push) Successful in 1m30s Details 【根因】INC-20260507-99ADF2 飛輪斷流，566+ stuck incidents（30秒漲 1）核心原因：NO_ACTION 路徑 (approval_execution.py:251) 提前 return True，跳過 line 482-495 已有的 resolve_incident 呼叫，incident 永遠卡 INVESTIGATING。【修法】 - approval_execution.py NO_ACTION 分支補 resolve_incident 呼叫 + 成功/失敗 log，背景 log 加 path="no_action" 用於 prod 量化修法生效率（debugger 全鏈分析 + critic 1st/2nd 審查必修 #1）。 - incident_service.py resolve_incident 在 line 1106 加 RESOLVED 冪等 guard，早於所有副作用（status mutation / Redis / DB / postmortem / KB / KM / disposition），順帶修 success path line 482-495 重觸 postmortem 的潛在老風險（critic 必修 #2）。【遵守 Codex 5/6 設計（feedback_respect_codex_design_intent.md）】 - 不動 flywheel_stats_service.py / heartbeat_report_service.py / auto_repair_service.record_auto_repair() / metrics_repository UPPER(status)。 - resolve_incident 不寫 auto_repair_executions 表（Codex 5/6 source of truth），不污染 24h KPI 計算。【Test 覆蓋】 - test_approval_execution_no_action.py：NO_ACTION → resolve 被呼叫一次 + resolve raise 時仍 return True（NO_ACTION 不能因 resolve 失敗退化成 False，否則污染 auto_execute KPI line 207-208 註解契約）。 - test_incident_service_resolve_idempotency.py：RESOLVED → return existing + save_to_working_memory 不被呼叫；not_found → return None。【驗收條件（部署後 24h）】 1. grep `path="no_action"` 中 incident_resolved_after_no_action_execution 數量 vs background_execution_noop 數量，1:1 才算修復成功。 2. awoooi_flywheel_incidents_stuck 從每 30 秒漲 1 變平緩。 3. SRE 群 24h 內若湧入 >20 份 NO_ACTION postmortem 觸發 follow-up 評估 resolution_type="no_action" 跳過 postmortem（critic Minor #3 方案 B）。 Refs: INC-20260507-99ADF2, debugger root cause #1 (鏈 A), critic 1st 必修 #1 #2, critic 2nd 必修 #1 #2 #3 Co-Authored-By: Codex (aider) <noreply@anthropic.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-07 18:55:58 +08:00
Your Name	ecc65be6e1	fix(telegram): route direct sends through gateway All checks were successful Code Review / ai-code-review (push) Successful in 13s Details CD Pipeline / tests (push) Successful in 1m10s Details CD Pipeline / build-and-deploy (push) Successful in 3m29s Details CD Pipeline / post-deploy-checks (push) Successful in 1m20s Details	2026-05-07 00:33:27 +08:00
Your Name	a7a9ba996d	fix(mcp): audit approved ssh execution path All checks were successful Code Review / ai-code-review (push) Successful in 11s Details CD Pipeline / tests (push) Successful in 1m5s Details CD Pipeline / build-and-deploy (push) Successful in 3m45s Details CD Pipeline / post-deploy-checks (push) Successful in 1m20s Details	2026-05-06 16:34:39 +08:00
Your Name	607358c4dd	fix(approval): route SSH actions through SSHProvider on manual approve parse_operation_from_action only knew kubectl and Chinese restart phrases, so any "ssh host '...'" action approved via Telegram fell through to "Could not parse operation type" and reported a fake failure even though the LLM had proposed a valid host repair. Adds OperationType.SSH_HOST, makes the parser detect ssh prefixes (with optional flags / user@host) before kubectl patterns, and routes the SSH_HOST branch in approval_execution.execute_in_background through SSHProvider with the same tool keywords decision_manager uses (ssh_docker_prune / ssh_docker_restart / ssh_systemctl_restart / ssh_diagnose). Unroutable SSH actions now fail loudly with a descriptive error instead of silently breaking. Trigger: 2026-05-02 incidents INC-20260502-D6D0B7 / E12EE4 / 557055 were approved by the user but executor reported "Could not parse" and left the alerts pending. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:31:37 +08:00
Your Name	61f5a6a419	fix(telegram): route alerts to SRE war room Some checks failed CD Pipeline / tests (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details Code Review / ai-code-review (push) Has been cancelled Details	2026-04-30 15:01:23 +08:00
Your Name	4a57c2d04f	feat(flywheel): expose incident processing timeline All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m56s Details	2026-04-29 23:38:30 +08:00
Your Name	c22e5f334e	feat(km): P1-1 KMWriter 統一契約 + 5 caller 切換 + M4 反查鏈補齊 12-Agent 全景診斷揪出 KM 寫入鏈路 5 條入口無統一契約，fire-and-forget 在 Pod recycle 時會丟失條目。本次抽 KMWriter 強制 7 條契約。 ## 7 條契約強制 1. 同步底線：強制 await asyncio.wait_for(timeout) 2. 重試：3 次指數退避 1s/2s/4s（OperationalError / 網路類例外） 3. 失敗回收：3 次後寫 Redis DLQ km:dlq + log 4. 觀測：structlog event + 預留 metric hook（P1-3 補 emitter） 5. 冪等：incident_id + path_type 為 unique key 6. 禁止吞例外：except 必須 log + raise/DLQ 7. M4 反查鏈：payload 含 approval_id 時自動填 related_approval_id 並回填 Path A ## Caller 切換（5 條入口統一介面） - incident_service.py:1086 Path A（KB extractor + km_conversion） - approval_execution.py:771 Path B-人工 - decision_manager.py:2178 Path B-自動成功（消除跨類私有方法調用 M1） - decision_manager.py:2200 Path B-自動失敗（修 B2 早期吞例外） - playbook_service.py:210 PlaybookKM（兩份 T0 報告都漏的第三條） ## M4 反查鏈補齊 - knowledge.py + models.py: 補 related_approval_id ORM 欄位 - 對齊 phase26_incident_km_integration.sql:20 schema（partial index 已存在） - approval↔KM 雙向反查鏈完整（dual-path 縫合線） ## Feature Flag (rollback 保險) - KM_WRITE_AWAIT=true (default): await + timeout + DLQ 強制 - KM_WRITE_AWAIT=false: fire-and-forget（舊行為） ## 測試 - apps/api/tests/test_km_writer.py: 18 測試全綠覆蓋 success / timeout / retry / DLQ / 冪等 / KMWriteError / on_failure=raise / 反查鏈回填 - 1552 unit tests 全綠（無回歸） ## 驗收飛輪閉環核心 — KM 寫入不再靜默丟失，AI 學習鏈不斷裂。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 10:44:39 +08:00
Your Name	123d9c8a2e	fix(p3.1-t1): 三 Tier-1 服務整合 — model_rollback_service + resource_resolver Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details P3.1-T1 接線兩個既有服務到主流程： offline_replay_service.py — model_rollback_service 整合: - 回放事件寫入治理 DB 後，觸發 ModelRollbackService.check() 衰退偵測 - feature flag 由 model_rollback_service 自行判斷（AIOPS_P6_GOVERNANCE_ENABLED） - retrain_recommended → log warning 含 streak / absolute_floor / conservative_mode - exception fail-soft（不阻斷 replay 主流程） approval_execution.py — resource_resolver 整合: - kubectl 指令解析後，動態驗證資源是否存在於 K8s - 若 resolved_name != raw_name → log + apply normalized name - 若不存在但有 candidates → log warning + suggestions（不攔截執行，只記錄） - exception fail-soft（不阻斷主流程） - RESOURCE_RESOLVE_TOTAL Prometheus counter labels: hit/suggestion/miss/error Tests: 後端 1303 collected（無回歸），對應 dedicated 測試在前次 commit 已寫 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Claude Sonnet 4.6 (P3.1-T1) <noreply@anthropic.com>	2026-04-27 08:17:04 +08:00
Your Name	32affaffeb	fix(critic-hotfix): 4 修補 critic BLOCKER + HIGH（CD 阻塞 + 飛輪空轉） Some checks are pending CD Pipeline / build-and-deploy (push) Has started running Details Critic 全面審查 6 個 commit 後抓出： CD 阻塞修復: - test_ai_router_failover_integration.py: 3 個 test 改用 patch.object 直接 mock _select_provider_and_model 強制初始 OLLAMA。原 IntentType.UNKNOWN mock 在 router 內仍被 reclassify 成 DIAGNOSE → openclaw_nemo，failover 不觸發。 → 5/5 PASSED BLOCKER B1 — Gitea Telegram 通知永遠發不出去: - apps/api/src/api/v1/gitea_webhook.py:399 redis = await get_redis() → redis = get_redis() 原 await 會 raise TypeError 被外層 except 吞 → Task C PR merged + workflow_run failure 通知全部失效（CI 綠燈是假象，test 只驗 HTTP 202 不驗實際送達） BLOCKER B2 — P1.3+P1.4 學習鏈閉環空轉（兩處同 bug）: - apps/api/src/api/v1/webhooks.py:261 - apps/api/src/services/approval_execution.py:771（pre-existing） EvidenceSnapshot.get_latest_snapshot(...) 是 module-level async function 不是 classmethod → AttributeError 被 except 吞成 warning → 飛輪閉環假性接通實際空跑（feature flag default off 暫時免爆） HIGH H3 — main.py lifespan 順序競爭: - apps/api/src/main.py: configure_alerter() 移到 _recovery_svc.start() 之前原順序：start() 觸發 immediate-check → 可能呼叫 alert_recovery，但 alerter 尚未注入 Redis → dedup fail-open，重複告警風險。 HIGH H1 — Gemini quota dedup 跨日吞告警: - apps/api/src/services/failover_alerter.py:89 dedup key 加 :{YYYY-MM-DD} 後綴，每日獨立 dedup window 原昨 22:00 觸發，今 21:30 再觸發時 dedup 還沒過期會被吞掉 Tests: 14 passed (failover_alerter + ai_router_failover_integration + lifespan_wiring) 延後 follow-up: - H2: proactive_inspector memory metric 改名 + baseline 清理 - H4: probe_success NaN fallback - M1-M4 / S1-S2: 見 critic 報告 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:39:53 +08:00
Your Name	f9f2263c00	fix(execution-feedback): 修復系統自動化反饋完全斷鏈的三層 P0 故障 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m57s Details 背景用戶報告執行狀態卡在「⚡ 執行中...」永不回報，導致自動修復機制完全癱瘓（信心度修復後，執行失敗但無法推送 Telegram 卡片通知） L1 — Post-verify AttributeError（2 處） - approval_execution.py:757, 1010 調用不存在方法 IncidentService.get_incident() - 正確方法：get_from_working_memory() fallback get_from_episodic_memory() - 影響：post-verify 邏輯被 exception 無聲吞掉，下游 Telegram 推送完全卡住 L2 — Notification Provider 未配置 - 新增 notifications/telegram.py：複用既有 TelegramGateway.send_notification() - 修改 manager.py：初始化時註冊 TelegramWebhookProvider - 影響：執行完成後無任何 provider 發送推送，導致 Telegram 看不到結果 L3 — Solver Agent 語意合成生成殘缺指令 - 舊邏輯：action_title="重啟服務" → 合成 "kubectl rollout restart deployment -n awoooi-prod"（缺名） - 下游 operation_parser 無法解析（regex 要求 deployment/<name>） - 修法：優先從 parsed 提取 target 欄位；無名則 return []，降級到唯讀調查指令 - 測試全部通過：35/35，含 11 個新安全測試驗證 - 被阻擋的惡意 kubectl_command 現在正確 fall-through 到語意合成路徑 - 無 target 名稱時返回空列表，不再生成殘缺指令 - Telegram 執行結果推送鏈路已完整預期效果 - 執行失敗 → 立即收到「❌ 執行失敗」Telegram 卡片（L1 + L2 修復） - 自動化決策遵循白名單，避免生成無法執行的指令（L3 修復） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 03:29:38 +08:00
Your Name	04ff22563e	fix(aiops-p1): Playbook 學習閉環 5斷點全修 + DB Migration（ADR-092 B4） Some checks failed run-migration / migrate (push) Failing after 14s Details CD Pipeline / build-and-deploy (push) Failing after 2m7s Details 【P0.4 補丁】pre_decision_investigator Prometheus query 欄位缺失 - _build_tool_params() 補 "query" 欄位（prometheus_query tool 必要參數） - 新增 _build_prometheus_query() — 依告警類型生成 PromQL（CPU/Memory/Crash/Disk/HTTP/Pod/fallback） - 修復後 D3_METRICS 感官維度實際取得資料（原本 100% 回 missing_query_parameter）【P1 Playbook 學習閉環 B1-B5 全修】 - B2 db/models.py: ApprovalRecord 新增 matched_playbook_id 欄位 + ix_approval_matched_playbook index - B2 db/models.py: TimelineEvent 新增 incident_id 欄位（MCP 稽核用）+ index - B3 approval_db.py: record→ApprovalRequest 補回 incident_id + matched_playbook_id - B4 approval_repository.py: 同 B3（兩個轉換函式必須同步） - B5 approval_db.py: approval_request_to_record_data 補 matched_playbook_id → DB 才能存值【P1.5 KM 寫入】approval_execution.py: fire-and-forget → await wait_for(30s) - 根因：asyncio.create_task 在 Pod recycle 時被殺，KM 寫入靜默遺失 - 修復：await asyncio.wait_for(..., timeout=30.0) + TimeoutError log 【Migration 文件】adr092_p1_learning_chain_fix.sql - ALTER TABLE approval_records ADD COLUMN matched_playbook_id VARCHAR(36) - ALTER TABLE timeline_events ADD COLUMN incident_id VARCHAR(64) - 執行：psql $DATABASE_URL -f apps/api/migrations/adr092_p1_learning_chain_fix.sql 【附帶 Agent 改動】 - decision_manager: Phase 2 YAML NO_ACTION 優先門（主機層/外部服務跳過 Agent Debate） - alert_rules.yaml: Sentry/ClickHouse + HostDiskUsageHigh/Critical 新規則 - solver_agent: action_title 語意合成兜底（取代靜默丟棄） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 15:41:35 +08:00
Your Name	7f4088bcd0	fix(aiops-p0): 六大病根 P0 全面修復（ADR-092 B4）【P0.1】knowledge_extractor_service.py:210 — AttributeError 修復 - Signal.description 欄位不存在（100% 失敗，KM 每天+5 根因） - 改用 alert_name + annotations.summary 拼接文字【P0.2+P0.3】Gate 9+11 唯讀指令鬆綁 - blast_radius_calculator: kubectl get/top/describe/logs/version → score=1（非 50） - operation_parser: 增加 INVESTIGATE 類型識別（唯讀 kubectl 不回 None） - executor.py: OperationType 新增 INVESTIGATE enum - approval_execution.py: INVESTIGATE 路徑直接呼叫 execute_kubectl_command 【P0.4】MCP SSH/K8s Provider 修復 - decision_manager: params= → parameters=（符合 MCPToolProvider.execute 簽名） - decision_manager: MCPToolResult .get() → .success/.output（dataclass 用法） - decision_manager + ssh_provider: 補入 hosts 120/121（原 default 缺失） - auto_approve: phase2_agent_debate source bypass confidence 閾值【P0.5】告警規則語義矛盾修復 - alert_rules.yaml: 8 條 kubectl 查詢規則 RESTART_DEPLOYMENT → NO_ACTION (CrashLoopBackOff/PostgreSQL 連線/慢查詢/MinIO 磁碟/K3s 節點/告警鏈路/SSL/CoreDNS 等) - incident_service.py: cAdvisor/CoreDNS 從 general 拆出獨立分類【P0.6】proactive_inspector 動態基線 PromQL 全修 - 5 個 MONITORED_METRICS PromQL 全部修正（cadvisor label/datname/blackbox） - db_connection_pool: datname="awoooi" → "awoooi_prod" - http_error_rate: 無效 http_requests_total → blackbox probe_success - cpu/memory: namespace label → name=~"k8s_api_awoooi-api.*" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 15:32:23 +08:00
Your Name	54d60d04f5	feat(drift+target): P0.1+P0.2+P0.3 三修 — drift 分頁分類 + AI 推薦 + target 追 trace 統帥三問決議：全做；AI 推薦 0.85 門檻純顯示不自動；先查 aol 再修 ## RCA: awoooi-service 失敗來源 - /api/v1/aiops/kpi 顯示過去 24h 有 1 筆 playbook_executed actor=approval_execution status=failed - grep codebase: 無任何程式碼寫死 awoooi-service（只有歷史 comment） - 最可能源: alert_rule_engine._extract_vars 從 labels.service 取值當 Deployment 名 - cf5050c/4f2e122（2026-04-18）已修 NEMOTRON 幻覺雙路徑；本次修第三條路徑 ## 修復 ### P0.3a alert_rule_engine._extract_vars - labels.service 降級：-service 結尾先剝 suffix 視為 base name - match_rule 回傳新增 target_source 欄位追 trace - 下次 awoooi-service 復發可直接看來源（label.service(stripped) 等） ### P0.3c approval_execution._log_aol_started.input - 補 parsed_target/operation/namespace 欄位 - 未來 aol 查 failed 可直接看 target，無需推敲 ### P0.1 telegram_gateway._send_drift_diff_detail - 分頁（10 項/頁）取代一次洗版 30 項 - header 3 桶分類計數: 人工高風險 / 一般修改 / K8s 自動 - 底部 ⬅️/➡️ 分頁按鈕（callback: drift_view_page:{report_id}_{page}） - security_interceptor INFO_ACTIONS 加 drift_view_page 白名單 ### P0.2 drift_narrator recommendation - LLM prompt 加 recommendation 欄位（action/confidence/reason） - action ∈ {adopt, revert, ignore, investigate} - 卡片頂部顯示「🎯 AI 建議：⏪ 回滾 (85%) — reason」 - LLM 失敗走 _fallback_recommendation（規則式依 intent 對應） - 卡片 diff_summary 上限 500 → 1500 字容納推薦 + narrative + items - 統帥指令：純顯示不自動執行（門檻 0.85 保留未來） ## 驗證 - 90 個 pytest test 全過（drift + rule_engine + approval_execution） - 5 檔 AST syntax check 過 ## 下次驗收 1. 下次 drift 觸發 → 卡片頂部有「🎯 AI 建議」 2. drift_view 按下 → 3 桶分類 header + ⬅️/➡️ 3. awoooi-service 若復發 → automation_operation_log.input.parsed_target 直接查 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 04:04:13 +08:00
OG T	e7ba8cb181	fix(aiops): 打通 AI 自主學習鏈 — verifier 改 await + aol 動作回灌 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 7m29s Details 統帥 2026-04-19 全景審計發現: - automation_operation_log: 22 筆 (全部 drift_narrator),33 件/7d approval 動作 0 筆回灌 - incident_evidence.verification_result: 1212 筆 100% NULL,verifier 從未寫入 - 根因: _run_post_execution_verify 用 asyncio.create_task fire-and-forget, Pod recycle 時 task 被殺,verification_result 永遠寫不進去修復 (打通 verifier→learning→Playbook EWMA→finetune 全鏈): approval_execution.py: + _log_aol_started: 主流程開始時 INSERT aol(playbook_executed, pending) + _log_aol_completed: 4 個 return 點 UPDATE aol 為 success/failed + duration + stderr └ NO_ACTION / parse_fail / K8s 成功 / K8s 失敗全部留痕 ~ _run_post_execution_verify 兩處 (成功+失敗 path) 從 create_task 改 await + 60s timeout + 失敗時 stderr_feed_back 寫入 result.error → 解開 E6 stderr 回灌閉環 declarative_remediation.py: ~ _log_remediation_event task 加 named + add_done_callback,task 失敗時有 log (原 fire-and-forget 0 筆寫入,現在可診斷為何 task 死掉) 預期效果: - aol playbook_executed 即時可見 (33 件/7d 立刻有資料) - incident_evidence.verification_result 開始累積 → finetune_exporter 7d cron 終於有料 - Playbook EWMA trust_score 開始動態變化 - stderr_feed_back 接通 → 失敗訊號回灌 retry/Playbook 負向強化不影響: - background_task 跑在背景,+60s 延遲不阻塞 API - aol 寫入失敗只 logger.warning,不阻塞執行主流程 Refs: MASTER §3.1 L6×D1 (ADR-081 PostExecutionVerifier), MASTER §3.4 D4 (ADR-083 學習閉環), ADR-090 監控盲區治理 (2026-04-18 全景審計) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 12:07:29 +08:00
OG T	4b8be32610	fix(telegram+approval): TG-1 + AP-1/2/3 — 4 修 Telegram UX Some checks failed CD Pipeline / build-and-deploy (push) Failing after 25m27s Details Ansible Lint / lint (push) Has been cancelled Details 2026-04-19 凌晨（台北時區）— ogt + Claude Opus 4.7 (1M) ## TG-1: INFO_ACTIONS 加 view security_interceptor.py — 'view' 按鈕現在走 2-part 讀格式, 不再誤觸發 4-part nonce 寫格式。 ## AP-1: approval_records.telegram_message_id 持久化 telegram_gateway.send_approval_card send 成功後,在 DB 層 UPDATE approval_records SET telegram_message_id, telegram_chat_id (不只 Redis, Pod 重啟仍可找回原卡片)。 ## AP-2: approval 執行完成原卡片 edit + KM/Playbook 增量 approval_execution._push_execution_result_to_alert 除了 reply 原卡片, 還 editMessageReplyMarkup 移除按鈕（修「永遠執行中」卡片問題）。 - 同步查 knowledge_entries/playbooks 2min 內增量,附加到訊息顯示 "📚 KM +N 🎯 Playbook 更新×M" - 成功: ✅ 執行成功 + action + KM 增量 - 失敗: ❌ 執行失敗 + 原因 + KM 增量 ## AP-3: primary_responsibility 正規化降「❓ 未知」比例 openclaw._parse_analysis_result: 若 LLM 填空/None/不在白名單 (FE/BE/INFRA/DB/COLLAB),強制 fallback: kubectl 關鍵字有 → INFRA, 否則 BE。之前只檢查 "not in data" 但 None 或空字串會穿過。 ## 跳過: TG-3 (refactor) + TG-5 (webhook 為棄用 endpoint,design 採 Long Polling) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 01:15:58 +08:00
OG T	fdce0a3ab9	fix(approval): NO_ACTION 不再誤標 EXECUTION_FAILED (MASTER §7.1 #11 修) Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 2026-04-19 凌晨（台北時區）— ogt + Claude Opus 4.7 (1M) 根因: approval.action='NO_ACTION - 待分析' (幻覺 validator 降級產物) 丟進 parse_operation_from_action → operation_type=None → background_execution_skip → update_execution_status(success=False) → 標為 EXECUTION_FAILED。污染 KPI: MASTER §7.1 #11 auto_execute 成功率 = EXECUTION_SUCCESS / (SUCCESS+FAILED) NO_ACTION 本來就不該計入失敗,但卻被算進去拖垮指標。實測 30d 成功率 0.9% 有很大比例是 NO_ACTION 誤標造成。修復: parse 失敗時先判斷是否 NO_ACTION 類 (action 含 NO_ACTION/OBSERVE/INVESTIGATE 等關鍵字) → 走專屬 noop 分支: - log event=background_execution_noop (info 級) - update_execution_status(success=True) → EXECUTION_SUCCESS - timeline 標 ✅ 純觀察類動作完成 - reply 原告警卡片顯示成功 - return True 真正解析失敗 (非 NO_ACTION) 保留原失敗路徑,但補上 error_message (P0.2 延伸),讓 rejection_reason 有 "Could not parse operation type from action: <action>" 而非空字串。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 01:08:16 +08:00
OG T	6ad73b4834	fix(flywheel): 三修 L5/L6 斷鏈 — RBAC 擴權 + 失敗原因入庫 + verifier 失敗時也跑 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 11m6s Details 2026-04-18 晚（台北時區） — ogt + Claude Opus 4.7 (1M) 全景飛輪診斷暴露 3 個真斷鏈: - L5 執行 30d: EXECUTION_FAILED 216 / EXECUTION_SUCCESS 2 (失敗率 99%) - L6 驗證 7d: verification_result 全 NULL (988 筆 evidence 都沒驗) - 所有 rejection_reason / error_message 欄位全空(無法診斷) 根因: awoooi-executor ServiceAccount RBAC 不足,executor.py 每次 kubectl get nodes/HPA 都 Forbidden,連 evidence 都抓不到,後面 repair 全炸,verifier 因為 execution 沒 success 永遠不 trigger,evidence 驗證結果永遠 NULL。修一個 RBAC 解 3 個節點。 ## P0.1 RBAC 擴權 (k8s/awoooi-prod/07-rbac.yaml) 新增 cluster-scope 讀權(僅 list/get/watch,零寫入): - nodes + nodes/status (evidence gathering 必需) - horizontalpodautoscalers (HPA 狀態) - metrics.k8s.io: nodes + pods (resource metrics) - statefulsets + daemonsets (完整 workload 視圖) 已 kubectl apply + 煙霧測試: kubectl get nodes 可跑。 ## P0.2 失敗時必寫 rejection_reason (approval_db.py) update_execution_status() 新增 error_message 參數,失敗時寫入 rejection_reason (截 2000 字) → 之後診斷有依據。 approval_execution.py 呼叫端同步更新,result.error 一路傳進 DB。 ## P0.3 Verifier 失敗時也跑 (approval_execution.py) 原邏輯: verifier 只在 result.success=True 時呼叫 → 99% 失敗下永遠不跑。新邏輯: 失敗 path 也 create_task 跑 verifier,action_taken 後綴加 ":FAILED" 標記。verifier 抓 post_state 寫 verification_result='failed' 回 incident_evidence。 L7 learning 從此有失敗樣本可學,playbook trust 負向 2x 衰減才真正生效。預期效果: - EXECUTION_FAILED 率 30d 內應從 99% 降到 <30% - incident_evidence.verification_result NULL 率應從 100% 降到 <10% - approval_records.rejection_reason 補齊率從 0% 到 100% Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 20:12:57 +08:00
OG T	cf50a5ce25	fix(solver+execution): Checkpoint-1 假成功修復 + Checkpoint-2 K8s 環境感知 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m55s Details ## Checkpoint-1: 假成功根治 - approval_execution.py: execute_approved_action 改返回 bool (原返回 None，呼叫端無法判斷 K8s 是否接受指令) - decision_manager.py auto-execute 路徑: 用 _exec_success 取代硬編 success=True 修復: K8s 拒絕指令時正確發 ❌ 而非 ✅ 自動修復完成 ## Checkpoint-2: K8s 環境感知 (Inventory Pre-flight) - solver_agent.py: 新增 _fetch_k8s_inventory() — 生成 kubectl 指令前先拉 kubectl get deployments,statefulsets -n awoooi-prod，將真實名稱清單注入 Solver prompt，LLM 必須從清單選擇，防止幻覺（awooiii-api 三個 i） - 超時 5s 或失敗 → 返回 ""，prompt 顯示警示但不中斷主流程 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 23:08:23 +08:00
OG T	85c4e3b434	fix(km): 修復 KM 寫入全為 unknown 的根因 (三個節點) Some checks are pending CD Pipeline / build-and-deploy (push) Has started running Details Bug-A: approval_execution.py 呼叫不存在的 get_incident() → AttributeError 被 except 吞掉 → alertname/alert_category/affected_services 全用預設值修復: 改用 get_from_working_memory() + get_from_episodic_memory() 雙路徑 Bug-B: _record_to_incident() 從 PG 還原 Incident 時漏掉 notification_type + alert_category 欄位 → km_conversion 讀到 None 修復: 加入這兩個欄位的還原 Bug-C: main.py working_memory_warmup 重建 Incident 時同樣遺漏 notification_type + alert_category 修復: 同步補上 2026-04-15 Claude Sonnet 4.6 Asia/Taipei	2026-04-15 22:28:48 +08:00
OG T	fb1bbd0e20	feat(Phase 3): 學習閉環補完 — Root cause 3 + 診斷 feedback + 知識遺忘 + Fine-tune 管線 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details - approval_execution.py: _run_post_execution_verify() 補接 record_verification_result() Root cause 3 終結：環境驗證結果（success/degraded/failed/timeout）不再孤立 - learning_service.py: 新增 record_verification_result() — 驗證結果 → Redis + Playbook EWMA - learning_service.py: 新增 record_diagnosis_outcome() — 誤診負向訊號回寫（L3×D4） - jobs/knowledge_decay_job.py: 新建 30d 知識遺忘 Job（未引用 draft/review → archived） - services/finetune_exporter.py: 新建每週 JSONL 匯出（EvidenceSnapshot × AgentSession） - main.py: 掛載 knowledge_decay_loop（24h）+ finetune_export_loop（7d） - MASTER §8: Phase 3 核心改造項全部落地記錄 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 20:57:43 +08:00
OG T	7da64eaad2	feat(Phase 3): 學習閉環重建 — 三根因修復 + 2x EWMA + Evolver Agent Some checks failed CD Pipeline / build-and-deploy (push) Failing after 19m7s Details Type Sync Check / check-type-sync (push) Failing after 1m18s Details ADR-083 Phase 3 學習閉環重建：三根因修復 - approval_execution.py: fire-and-forget create_task → await asyncio.wait_for(timeout=30) × 2 （成功路徑 L265 + 失敗路徑 L353，超時記錄 learning_trigger_timeout metric，主流程不 crash） - models/approval.py: ApprovalRequestBase 新增 matched_playbook_id 欄位 - decision_manager.py: _auto_execute 建立 ApprovalRequest 時填充 matched_playbook_id - learning_service.py: 雙路徑查找 _matched_pb_id（matched_playbook_id + metadata fallback） 2x EWMA 負向強化 - models/playbook.py: 新增 trust_score: float = 0.3（EWMA 動態信任度欄位） - repositories/playbook_repository.py: update_stats 加 EWMA 成功: trust = 0.9 × old + 0.1 × 1.0 失敗: trust = 0.8 × old + 0.2 × 0.0（衰減速度 2x） trust < 0.1 → log warning，等 Evolver 封存 Evolver Agent（新建） - services/playbook_evolver.py: 三功能全靜態規則 1. 低信任封存: trust < 0.1 → DEPRECATED 2. 休眠封存: 30d 未使用 AND trust < 0.5 → DEPRECATED 3. 相似合併: 症狀 Jaccard > 0.9 → 保留高 trust，封存低 trust AIOPS_P3_EVOLVER_ENABLED=False 預設關閉文件 - ADR-083 學習閉環重建 - MASTER §8 Phase 3 完工記錄 AIOPS_P3_ENABLED=False（預設），骨架就位等統帥批准開啟 Co-Authored-By: Claude Sonnet 4.6（亞太）<noreply@anthropic.com>	2026-04-15 14:01:37 +08:00
OG T	f1cbf6db7d	feat(adr-081): Phase 1 感官縱深 — 8D 情報蒐集 + 執行後驗證成品： - IncidentEvidence DB model（8D 感官 + pre/post 執行狀態） - EvidenceSnapshot dataclass（build_summary → LLM 上下文） - SanitizationService（Prompt Injection 0-tolerance，12 pattern） - MCPToolRegistry（動態工具登記，suggest_tools 不寫死告警類型） - PreDecisionInvestigator（8D 並行感官，P99 < 8s，Redis 30s 快取） - PostExecutionVerifier（warmup 10s → 後狀態評估 success/degraded/failed） - decision_manager + approval_execution 接線（feature flag 守衛） Gate 1 修復：D4/D5/D7/D8 補 sanitize_dict_values；移除裸 "error" failure signal 防 error_rate key 誤判；evidence_snapshot rowcount 零行警告。測試：130 passed（+111 新增） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 13:08:38 +08:00
OG T	72dd0c5875	fix: Telegram 簽核 gate + 執行結果 reply — 打通人工審核閉環 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 14m7s Details 3 處修復（統帥盤查發現）: 1. telegram_gateway.py:4890 — gate 從 execution_triggered 改 approval.status==APPROVED - 原 gate 靠樂觀鎖旗標，race 時失效（REST+Telegram 同時簽核） - 與 REST API approvals.py:360 路徑對齊 - 加 Redis lock exec:{approval_id} 60s TTL 防重入 2. telegram_gateway.py:4772 — 拿掉「👀 等待執行」誤導文案 - 批准後一律顯示「⚡ 執行中...」，實際結果由 #3 reply 補上 3. approval_execution.py — 新增 _push_execution_result_to_alert() - 成功/失敗兩處 fire-and-forget 呼叫 - requested_by=="auto_approve" skip（避免與 _push_auto_repair_result 衝突） - Redis tg_msg:{incident_id} 查原告警 message_id → reply_to - 找不到 message_id 靜默不發，不影響執行主流程防破壞性檢查： - ✅ 自動執行路徑不受影響（skip via requested_by） - ✅ Reject 路徑完全不動 - ✅ Redis lock 防重入 - ✅ 132 回歸測試全過 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-14 19:03:38 +08:00
OG T	dedd7c2c17	feat(BP-1): KM 萃取品質精修 — 區分自動/人工 + 富化告警元資料 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details _write_execution_result_to_km() 強化： - 依 approval.requested_by 區分 [自動修復]/[人工修復] - 從關聯 Incident 提取 alertname / alert_category / affected_services - Category 從硬編 "execution_result" 改為真實 alert_category - Tags: auto_executed/human_approved + success/failure + alert_category - Title 含 alertname，提升 RAG 檢索精準度 - created_by 依模式標記 auto_execute / approval_execution 驗證（2026-04-14 DB 查詢）： - 現有 KM 確實有寫入（approval_execution 建立者） - 但標題全是「[執行記錄] ❌ kubectl rollout restart deployment/xxx」 - Category 硬編 execution_result，tags 只有 execution/execution_failed - 本次改造後 KM 將具備完整上下文供下次 RAG 檢索建立: 2026-04-14 台北時間 Claude Sonnet 4.6（MASTER 藍圖 BP-1 B.1 精修） Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-14 15:48:02 +08:00
OG T	aae7c12645	feat(adr-076): Task 3.3 — SSH 修復 KM 萃取（補齊飛輪雙手） Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 動機: SSH MCP 修復（docker restart/systemctl）成功後，KM 無法學習因為 _extract_repair_steps 只處理 kubectl，SSH 路徑完全漏失。 approval_execution.py: - _trigger_playbook_extraction: 成功執行後將 approval.action 寫入 incident.outcome.learning_notes，供 Playbook 萃取器讀取 playbook_service.py: - _parse_ssh_command(): 新增模組函式，解析 ssh [user@]host 'cmd' 格式 - _extract_repair_steps(): 步驟 2 擴充 SSH 路徑分支 ssh ... → ActionType.SSH_COMMAND + host 記錄 kubectl ... → ActionType.KUBECTL（保留原有邏輯） - _generate_name(): SSH 修復自動加 [SSH] 前綴 - _extract_tags(): SSH 修復自動加 ssh + host_layer 標籤 test_playbook_ssh_extraction.py: 18 tests（100% 通過）飛輪雙手對齊: kubectl 路徑: decision_chain.reasoning_steps → KM ✅ (既有) SSH 路徑: approval.action → learning_notes → KM ✅ (Task 3.3 新增) 測試: 794 passed, 26 skipped, 0 failed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-14 15:19:54 +08:00
OG T	684d6cfb43	feat(adr-076): 戰術 B 四大 Task 全部完成 — 告警聚合+重試+自動報告 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 17m34s Details Task 2: AlertGroupingService — Redis 5分鐘滑動視窗，防告警風暴 - apps/api/src/services/alert_grouping_service.py (新增) - webhooks.py 整合：指紋生成後/LLM前短路子告警 - Threshold=3，Graceful Degradation，16 tests Task 3: approval_execution.py 執行失敗重試 - MAX_RETRY=2, RETRY_DELAY_SECONDS=30 - _is_transient_error() 瞬態/永久分類，永久錯誤不重試 - Timeline 記錄重試進度，成功後標注重試次數，29 tests Task 4: report_generation_service.py 自動報告 - 日度巡檢報告：每日 08:00 台北時間，Telegram SRE 群組推送 - Postmortem：Incident resolved + duration > 10 分鐘自動觸發 - main.py lifespan 掛載 run_daily_report_loop()，30 tests 測試: 600 → 675 通過 (+75)，0 failed Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-14 14:39:14 +08:00
OG T	f0e14136ca	fix(flywheel): 修補飛輪四個核心斷點，讓完整流程真正串接起來 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 1. incident_service.py: save_to_episodic_memory() 補寫 alertname/notification_type/alert_category → 之前這3欄在DB永遠NULL，LLM無alertname，Playbook匹配全失敗 2. telegram_gateway.py: Telegram批准後呼叫 execute_approved_action() → 之前sign_approval()只改DB狀態，380筆批准0筆真正執行kubectl指令 3. approval_execution.py: 執行成功後呼叫 resolve_incident() webhooks.py: auto-repair成功後呼叫 resolve_incident() → 之前Incident永遠停在INVESTIGATING，KM轉換永遠不觸發，Playbook=0 4. webhooks.py: TYPE-1告警短路，不進LLM → 之前Heartbeat/Backup/Info仍燒LLM token，產生垃圾修復建議 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-12 17:01:10 +08:00
OG T	b7ea362efc	fix(api): Review #2 技術債清理 — I1/S1/S2/S3 全數修正 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 12m13s Details I1: error_type 欄位補全 - AnomalyCounter.derive_key_from_incident() 新增從 signal.labels 提取 reason/error_type，確保四欄位完整 S1: 三處 signature 建構邏輯統一 - auto_repair_service._derive_anomaly_key() → 委託 derive_key_from_incident() - approval_execution._get_anomaly_key_from_approval() → 同上 - incident_service.resolve_incident() B4 → 同上 - 消除 3 處重複的 signature 建構程式碼 S2: Redis Pipeline 批次查詢 - get_all_disposition_stats() 從 N+1 hgetall 改為 2 次 Pipeline - Pipeline 1: 批次 hgetall 所有 disposition key - Pipeline 2: 批次 hget metadata (alert_name) - 效能從 O(2N) Redis round-trip 降至 O(2) S3: auto_repair.py get_incident AttributeError 修復 - get_incident() → get_from_working_memory() (pre-existing bug) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-07 13:13:42 +08:00
OG T	561bcb638b	fix(api): Sprint 4 首席架構師 Review P0 修正 — hash 統一 + 積木化合規 P0-1: anomaly_key hash 推導統一 - B1: 新增 _derive_anomaly_key() 使用 AnomalyCounter.hash_signature() 取代 symptoms.compute_hash() - B3/B4: namespace 改用 signal.labels.get("namespace", "") 修正 getattr(signal, "namespace", "") 永遠回傳空字串 P0-2: Router 層積木化合規 - C1/C2: 封裝 get_all_disposition_stats() 到 AnomalyCounter - Router 不再直接存取 counter.redis - stats.py 移除未使用的 days/stats 參數 P1: get_frequency() 填充 disposition 欄位 - 與 _record_anomaly_impl() 一致，回傳完整處置統計首席架構師評分: 82/100 → P0 全數修正 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-07 12:53:12 +08:00
OG T	9253281d46	feat(api): Sprint 4 Phase A+B — 告警處置統計資料層+寫入層 Phase A: 資料層 - A1: IncidentFrequencyStats 新增 4 欄位 (human_approved/manual_resolved/cold_start_trust/total_resolution) - A2: AnomalyCounter.record_disposition() — Redis HINCRBY 原子遞增 - A3: get_disposition_stats() — HGETALL 回傳處置分佈 - AnomalyFrequency dataclass 擴充 + to_dict() 同步 - _record_anomaly_impl() 整合 disposition stats Phase B: 寫入層觸發點接線 - B1: 自動修復成功 → record_disposition("auto_repair") - B2: 冷啟動信任成功 → record_disposition("cold_start_trust") - AutoRepairDecision 新增 is_cold_start flag - execute_auto_repair() 接收並區分處置類型 - B3: 人工批准執行成功 → record_disposition("human_approved") - 新增 _get_anomaly_key_from_approval() helper - B4: 手動處理推斷 → resolve_incident() 排除法判定 - 若 resolved 且無 auto/human/cold_start 紀錄 → manual_resolved 安全設計: 所有 disposition 記錄走 try/except，失敗不阻塞主流程 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-07 11:54:46 +08:00
OG T	658337ec18	fix(phase26): 打通 Incident→DB→KM 完整鏈路 + namespace 修正 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m29s Details Type Sync Check / check-type-sync (push) Failing after 52s Details 問題根因: 1. create_incident_for_approval 只存 Redis，不存 PostgreSQL → TTL 7天後消失，Playbook 萃取永遠找不到 Incident 2. ApprovalRecord 無 incident_id 欄位 → _trigger_playbook_extraction 靠 regex 掃中文文字找 INC-，永遠失敗 3. operation_parser namespace fallback 是 "default" → 所有 deployment 在 awoooi-prod，203 次執行全失敗修復: - Incident 同時寫入 Redis + PostgreSQL (save_to_episodic_memory) - ApprovalRecord 加入 incident_id 欄位 (model + ORM + migration) - alertmanager_webhook 建立 Approval 後回寫 incident_id - _trigger_playbook_extraction 直接用 approval.incident_id - operation_parser DEFAULT_NAMESPACE = "awoooi-prod" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 11:46:05 +08:00
OG T	df3ef9006c	fix(auto-repair): 首席架構師 Review — 4 Critical/Important 修復 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 7m2s Details Critical #1: KM write task 移出 try/except - _trigger_learning 的 KM 寫入原在 try 內，learning 失敗時不寫 KM - 移至 except 後確保成功/失敗都寫入 - 移除冗餘 import asyncio（已在頂層 import） - Minor: approval.incident_id or None 防空字串 Important #2: migration 加 PRIMARY KEY - playbook_id 從 UNIQUE 升為 PRIMARY KEY - prod DB 已執行 ALTER TABLE ADD PRIMARY KEY Important #3: s.sequence→s.step_number, s.description→s.command - embed_playbook() 使用不存在的欄位名，RAG 向量索引靜默失敗 - RepairStep 正確欄位: step_number, command Important #1: PlaybookService._get_rag_service 不再 Service 層快取 - 改為每次呼叫工廠 get_playbook_rag_service() - 避免舊實例繞過工廠的 is_closed 重建邏輯冷啟動修復 (首席架構師建議B+C): - _trigger_playbook_extraction 執行成功後自動設定 execution_success=True, effectiveness_score=4, status=RESOLVED - skip 路徑 logger.debug → logger.info 提升可觀測性 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 12:02:03 +08:00
OG T	72d7536ead	feat(auto-repair): 完整自動修復閉環 + KM 沉澱串接 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 1. DB Migration: playbooks 資料表 (phase7_playbooks_table.sql) - 這是自動修復無法啟動的根本原因 — table 從未建立 - 5 個索引: status/tags/alert_names/source_incidents/created_at - 已在 prod DB 執行 2. playbook_service: 萃取後自動沉澱 KM - extract_from_incident() 完成後 fire-and-forget _write_to_km() - 內容含症狀模式、修復步驟、信心度、來源 Incident 3. approval_execution: 執行結果沉澱 KM - _trigger_learning() 後 fire-and-forget _write_execution_result_to_km() - 成功/失敗記錄都寫入，category=execution_result 完整閉環: 告警 → AI分析 → 查Playbook → 決策 → 執行 → 結果寫KM ↓ Incident解決 → KM(knowledge_extractor) → Playbook萃取 → KM Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 11:54:15 +08:00
OG T	3256142d29	feat(api): ADR-030 Phase 5 持續學習迴圈從執行結果中學習，持續優化決策： 1. learning_service.py - 持續學習服務 - process_execution_result(): 處理執行結果 - process_human_feedback(): 處理人工反饋 - 自動調整信任度 (成功+1 / 失敗歸零) - 更新 Playbook 統計 - 成功案例自動萃取 Playbook 2. approval_execution.py - 整合學習觸發 - 執行成功後觸發學習 - 執行失敗後觸發學習 - _trigger_learning(): 非阻塞呼叫學習服務學習流程: 執行完成 → LearningService.process_execution_result() ├─ 成功: TrustEngine +1 分 + Playbook 統計更新 └─ 失敗: TrustEngine 歸零 + 記錄失敗原因 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-03-26 22:19:41 +08:00
OG T	2e75a20150	feat(api): Phase 7.5-7.6 Playbook 整合決策與自動萃取 Phase 7.5: DecisionManager 三軌決策 - 新增 Playbook 優先匹配 (similarity >= 85%) - 三軌決策順序: Playbook > LLM > Expert System - 整合 PlaybookService 推薦引擎 Phase 7.6: 自動萃取機制 - approval_execution.py 成功執行後觸發萃取 - 條件: RESOLVED/CLOSED + effectiveness >= 4 - 滿分 (5) 自動核准 Playbook 測試: - 13 個 Playbook 單元測試全部通過 - 修復 Incident 模型欄位對應 (reasoning_steps) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-03-26 11:09:25 +08:00
OG T	716b94f60a	feat(api): Phase 16 R4.2 抽取 ApprovalExecutionService Strangler Fig Pattern: 從 approvals.py 抽取執行編排邏輯新增: - src/services/approval_execution.py (271 行) - ApprovalExecutionService class - 整合 OperationParser + Executor + Timeline + Notifications 瘦身成果: - approvals.py: 1097 → 787 行 (-310 行) - R4 總計: 移除 310 行內嵌業務邏輯 CI/CD 修復: - 移除危險的 rm -f ~/actions-runner-* 指令 - 改用 checkout clean: true + workspace 內清理 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-03-25 22:04:15 +08:00

47 Commits