awoooi

Author	SHA1	Message	Date
Your Name	c22e5f334e	feat(km): P1-1 KMWriter 統一契約 + 5 caller 切換 + M4 反查鏈補齊 12-Agent 全景診斷揪出 KM 寫入鏈路 5 條入口無統一契約，fire-and-forget 在 Pod recycle 時會丟失條目。本次抽 KMWriter 強制 7 條契約。 ## 7 條契約強制 1. 同步底線：強制 await asyncio.wait_for(timeout) 2. 重試：3 次指數退避 1s/2s/4s（OperationalError / 網路類例外） 3. 失敗回收：3 次後寫 Redis DLQ km:dlq + log 4. 觀測：structlog event + 預留 metric hook（P1-3 補 emitter） 5. 冪等：incident_id + path_type 為 unique key 6. 禁止吞例外：except 必須 log + raise/DLQ 7. M4 反查鏈：payload 含 approval_id 時自動填 related_approval_id 並回填 Path A ## Caller 切換（5 條入口統一介面） - incident_service.py:1086 Path A（KB extractor + km_conversion） - approval_execution.py:771 Path B-人工 - decision_manager.py:2178 Path B-自動成功（消除跨類私有方法調用 M1） - decision_manager.py:2200 Path B-自動失敗（修 B2 早期吞例外） - playbook_service.py:210 PlaybookKM（兩份 T0 報告都漏的第三條） ## M4 反查鏈補齊 - knowledge.py + models.py: 補 related_approval_id ORM 欄位 - 對齊 phase26_incident_km_integration.sql:20 schema（partial index 已存在） - approval↔KM 雙向反查鏈完整（dual-path 縫合線） ## Feature Flag (rollback 保險) - KM_WRITE_AWAIT=true (default): await + timeout + DLQ 強制 - KM_WRITE_AWAIT=false: fire-and-forget（舊行為） ## 測試 - apps/api/tests/test_km_writer.py: 18 測試全綠覆蓋 success / timeout / retry / DLQ / 冪等 / KMWriteError / on_failure=raise / 反查鏈回填 - 1552 unit tests 全綠（無回歸） ## 驗收飛輪閉環核心 — KM 寫入不再靜默丟失，AI 學習鏈不斷裂。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 10:44:39 +08:00
Your Name	715dc3cb91	fix(observability): P0 假警報止血 + ConfigMap drift 對齊 + 治理工具 12-Agent 全景診斷觸發的 P0/P1 觀測層修復。 ## P0 假警報止血（4 SLO 雪崩根因） - governance_agent.py:306 — 空 result 不再 fallback 0.0，改 continue + log warning 根因：Prometheus 查無資料（emitter 未實作 / rule 未部署）被誤判為 SLO=0 必觸發 violated=True 噴 4 條假告警 ## P0 鬼魂按鈕守門 - telegram_gateway.py:1654 — LLM 動態按鈕 Redis 失敗時 btn_list.clear() first_row（批准/拒絕，HMAC nonce 無狀態）由 caller 1488 永遠保留 feedback_no_ghost_buttons.md 三缺一鐵律對齊 ## ConfigMap drift 修復（3 處） - config.py:683 PROMETHEUS_URL: 188→110（drift checker 揪出 = SPF-4 部分根因） - config.py:705 ARGOCD_URL: 125→121（T0 G3 已知） - config.py:375 AI_FALLBACK_ORDER: 補 nvidia 對齊 ConfigMap ## P1 Alertmanager 升級（amtool SUCCESS） - ops/alertmanager/alertmanager.yml: deprecated → v0.27+ 新語法 - match/match_re → matchers - source_match/target_match → source_matchers/target_matchers - group_by 加 team label（防 SLO 雪崩 4 條同秒推） - PostgreSQL/Redis inhibit 補 equal: ['instance']（防爆炸抑制） - 新增 3 組因果抑制： - OllamaInstanceDown → SLO_/AI_（30 分鐘） - KMConverterDown → SLO_KMGrowthRate* - SLO__FastBurn → SLO__(Medium\|Slow)Burn ## 治理工具落地 - scripts/check_config_drift.py: ConfigMap vs code default drift 檢測揪出 PROMETHEUS_URL drift 是 SPF-4 根因（governance_agent 連 188 而非 110） - scripts/health_check_session.sh: 11 服務 + 4 SSH + drift + git 全景驗證 ## 驗證 - 1552 unit tests 全綠 - amtool check-config SUCCESS（8 inhibit_rules / 2 receivers） - drift checker 4 欄位全對齊 - health check 11 服務全可達 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 10:44:39 +08:00
Your Name	143c15f052	feat(wave2+km): LLM 動態按鈕啟用 + KM 自動寫入 + AI Router dead code 標記 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m52s Details - ConfigMap: USE_LLM_DYNAMIC_BUTTONS=true（B2/B3/B4 handler 全就緒） - decision_manager: auto_execute 失敗路徑補 KM fire-and-forget 寫入 - ai_router: _build_fallback_chain 標記 DEPRECATED 2026-04-28 - tests: test_golden_regression.py 新增 172 行 golden 回歸測試 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-28 15:27:33 +08:00
Your Name	7f200aff5f	fix(solver): 注入告警 labels 讓 params 模板填充真實值 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m45s Details 根因：Solver LLM 不知道 namespace/pod/deployment/instance 真實值， recommended_actions.params 模板（{labels.namespace} 等）填不出來 → Telegram 顯示 kubectl scale deployment --replicas=（空白）修復： - solver.run() 加 incident_labels 參數 - _build_prompt() 把 labels 顯式列出給 LLM 參考 - orchestrator 從 snapshot.alert_info.labels 取出後傳入 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-28 15:05:06 +08:00
Your Name	c1a1be61bd	fix(ssh-auto): 主機告警 SSH 自動診斷授權（HostHighCpuLoad 不再卡人工審核） All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m7s Details 根因：SSH_MCP_ALLOWED_HOSTS 未設定 → _ssh_execute() 全部攔截 + auto_approve 只認 kubectl 不認 ssh → 主機告警永遠降級人工修復： - ConfigMap: 補 SSH_MCP_ALLOWED_HOSTS 四主機白名單 - alert_rules: HostHighCpuLoad 等從 NO_ACTION 改為 ssh_diagnose 指令 - auto_approve: _has_executable 加入 ssh 開頭識別 - decision_manager: _ssh_execute() 加入 ssh_diagnose 路由 - ssh_provider: 新增 ssh_diagnose tool（ps aux + free -h + df -h，只讀） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-27 20:13:07 +08:00
Your Name	277808758d	fix(failover): 補 OllamaRoutingResult.health_188 optional 欄位（merge conflict 遺漏） Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details stash pop 時 --theirs 覆蓋掉了 health_188 dataclass 欄位定義，導致 to_dict() 拋出 AttributeError（health_188 只在方法內引用）。補上 health_188: HealthReport \| None = None，37 failover tests ✅ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-27 20:04:49 +08:00
Your Name	877c2651bf	feat(p3.2.3): provider版本變更Telegram告警 + Gemini quota訊息更新 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m40s Details - FailoverAlerter.alert_provider_version_changed()： - 每個 provider 獨立 dedup key（TTL 3600s），避免頻繁重複告警 - 批次合併通知：同一輪變更一則訊息，標出哪些 provider 版本異動 - 例外由 tracker 層 try/except 攔截，不中斷探測排程 - ModelVersionTracker.run_probe_cycle()： - changed_providers 非空時呼叫 alert_provider_version_changed() - P3.2.3 整合完成，告警鏈路 probe → 比對 → DB → Telegram 全通 - Gemini quota 告警訊息更新：移除舊的 188 CPU 備援字眼，改為 Nemotron → Claude - 6 new tests, 1501 passed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-27 20:00:03 +08:00
Your Name	ae5e33d254	feat(failover+dispatcher): 補齊 unstaged 服務變更 - callback_dispatcher: params 型別放寬支援 numeric - failover_alerter: alert TTL 修正 - model_version_tracker: 小調整 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 19:56:51 +08:00
Your Name	3e382a4225	fix(telegram): P0 async race + P1 short_id 碰撞 + P0 incident_id 修復 - _build_llm_action_buttons 改 async，await setex 在 return 前完成（消除「按鈕發出→點擊→Redis 未寫完」的 race） - short_id 從 4 bytes → 8 bytes（16-hex），64-bit 碰撞空間 - payload 加入 incident_id，callback handler 從 payload 還原真實 ID （修 P0-2：避免 short_id 進 context 造成 KM 學習鏈錯亂） - Redis 故障與按鈕過期分流回應（P1） - HTML escape 防 XSS（P2） - _build_inline_keyboard 改 async，兩個呼叫端加 await - tests 全部改 @pytest.mark.asyncio + AsyncMock redis （1495 passed in unit suite） Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 19:56:51 +08:00
Your Name	a0502b778e	feat(auto-execute): CS3 alertmanager AI path 高信心自動執行（修法3擴展） All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m41s Details - CS3（alertmanager AI path）補入與 CS1 相同的 5 safety gate 自動執行邏輯 - confidence >= 0.85 + !CRITICAL + kubectl非空 + !NO_ACTION + !DESTRUCTIVE - 使用 _cs3_destr_patterns（from auto_approve）做破壞性指令攔截 - 例外包覆 try/except，不影響主流程 - 新增 test_cs3_auto_execute.py，9 tests 全通過 - CS4（LLM fallback）action=OBSERVE/confidence=0.0 → 不需要 auto-execute，維持現狀 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-27 19:46:56 +08:00
Your Name	d0c24275d6	fix(incident): Alertmanager 告警補寫 frequency_stats → 歷史統計不再空白 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 根因：create_incident_for_approval 建立 Incident 時從未查詢 AnomalyCounter → frequency_snapshot 永遠 null → 歷史按鈕顯示「無建立時快照」 signoz/sentry webhook 有寫，Alertmanager 路徑漏掉修復：建立前 record_anomaly → 頻率快照存入 frequency_stats → PG 持久化失敗無害（try/except，不阻斷主流程） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-27 19:41:10 +08:00
Your Name	e3bad58842	feat(auto-rate): CS1 LLM 高信心度路徑自動執行（confidence ≥ 0.85） All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m53s Details 繼 CS2 rule_engine 後，CS1 LLM 路徑也開啟自動執行： - confidence >= 0.85 + low/medium risk + kubectl 有值 → auto-execute - CRITICAL / DESTRUCTIVE_PATTERNS / NO_ACTION → 絕對不執行 - 例外降級到 PENDING，不 crash - 9 tests 驗收（1469 passed） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-27 16:12:30 +08:00
Your Name	e5f8d90451	feat(auto-rate): rule_engine 路徑開啟自動執行，預計 42% → 70%+ Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 修法 3（debugger 建議）：CS2 is_rule_based=True + kubectl 有值 + 非 CRITICAL/DESTRUCTIVE → 直接 auto-execute，不建 PENDING record 安全防線（5 層）： - CRITICAL risk → 絕對不自動執行 - _DESTRUCTIVE_PATTERNS 命中 → 絕對不自動執行 - NO_ACTION → 不執行 - kubectl 空字串 → 不執行 - 任何例外 → catch + 降級到 PENDING，不 crash 15 tests 驗收（1487 passed） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-27 16:08:50 +08:00
Your Name	a184b82ed1	feat(webhook): shadow-run auto_approve.evaluate + 補 metadata kwarg Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 4 個 webhook call site 問題修復（debugger 根因分析 2026-04-27）： - 補 metadata kwarg → extra_metadata 不再為 NULL（source/confidence_score/is_rule_based/playbook_id） - shadow-run policy.evaluate() → logger.info 觀測 should_auto_approve - 不改任何執行決策：status 仍 pending，Telegram 推送不變 - 9 tests 驗收 metadata 非 null + shadow log 格式 + 例外不 propagate 下一步：shadow 觀測 1-2 天後開啟修法 3（rule_based 路徑自動執行） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-27 16:00:00 +08:00
Your Name	0fd71b3e33	fix(mcp/k8s): _kubectl_scale 補 validate_deployment_exists dry-run Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 根因：_kubectl_restart 有 dry-run 驗證，_kubectl_scale 完全沒有 → gitea（docker-compose，不在 K8s）直接被 kubectl scale 執行 → Deployment 'gitea' not found in namespace 'awoooi-prod'（INC-20260425-3B6C39）修復：_kubectl_scale 在執行前加 validate_deployment_exists， K8s 找不到 deployment 時返回 error 而非繼續執行 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-27 15:59:37 +08:00
Your Name	c3fa03fc19	fix(solver): 補 AGENT_SOLVER_TIMEOUT_SEC=80 + prompt 禁無腦重啟 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 問題1：AGENT_SOLVER_TIMEOUT_SEC 預設 20s K8s 未設 → deepseek-r1:14b 必然 timeout → candidates=[] → action="" → Telegram 顯示「待分析」+「規則分析」問題2：Solver prompt JSON 範例只有 restart + kubectl top，LLM 模仿範例 → 所有告警都推重啟，HostDisk/CPU 類應優先診斷+清理修復： - K8s 加 AGENT_SOLVER_TIMEOUT_SEC=80（< OPENCLAW_TIMEOUT=120，留 buffer） - Solver prompt 加根因對應修復規則：HostDisk→df/du/journalctl，CPU→top/ps， OOM→kubectl logs，禁止「先重啟」 - JSON 範例改為 HostDisk SSH 診斷場景，不再只有 K8s 命令 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-27 15:51:42 +08:00
Your Name	b432becd4e	fix(failover): 188 完全移出 routing chain，備援只用 Gemini Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 統帥鐵律 2026-04-26： - 唯一 Ollama = 111（M1 Pro Metal 加速） - 188 CPU-only (0.45 tok/s) 禁止即時回應，移出所有 fallback chain - 111 HEALTHY → fallback=[Gemini] - 111 非HEALTHY → primary=Gemini, fallback=[Nemotron, Claude] - Gemini quota exceeded → Nemotron → Claude（不落 188） - OllamaRoutingResult 移除 health_188 欄位 - select_provider 只 check 111（不再 asyncio.gather 兩節點） - 測試全部對齊新規則（1451 passed） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-27 15:47:41 +08:00
Your Name	ea23972f7a	feat(dispatch): B2 LLM 動態 MCP 派發安全閘 + telegram_gateway LLM 按鈕流程 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m10s Details ADR-082 §B2：dispatch_llm_action() 風險閘控 + allowlist + 模板渲染 23 tests pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 15:22:31 +08:00
Your Name	8d6e086254	fix(p3.2): model_version_tracker 改 pure unit test + probe 改善 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 2m7s Details Engineer 重寫 test_model_version_tracker： - 用 _make_fake_ctx (asynccontextmanager) 完整 mock get_db_context - 移除 @pytest.mark.integration（整 class） - patch probe_all_providers + get_db_context 雙路徑 - 4 testcases 全綠，無真實 PG 依賴 model_version_probe.py 配套改善（match 新 test mock 預期） Tests: 19 passed (probe 15 + tracker 4) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 14:58:46 +08:00
Your Name	ed205489c1	feat(p3.2-tests+ci-schema): model_version 測試 + CI test_schema 對齊 + Grafana SLO Dashboard Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m20s Details P3.2 配套測試 + CI 環境同步 + ADR-100 Grafana 視覺化： CI test_schema 補齊（解 1162-1172 阻塞之延伸）: - setup_test_schema.sql 加 ai_provider_version_history 表 - 對齊 production p3_2_provider_version_history.sql（已 K8s exec 上線）新增測試 (636 行): - test_model_version_probe.py (387) — Provider 探測單元測試 - test_model_version_tracker.py (249) — Tracker 整合測試 · 4 個 DB-dependent tests 標 @pytest.mark.integration · 15 unit + 4 integration（unit step 跳過 integration class）新增配套: - ai-slo-dashboard.json (496 行) — Grafana 儀表板 · 對應 ADR-100 SLO 規則的 4 大面板：自主修復成功率 / 飛輪閉環延遲 / 治理事件 / Provider 健康度修改: - governance_agent.py +122 行 — SLO 指標暴露 + retrieve metric 整合 Tests: 15 passed (probe + tracker unit), 4 deselected (integration class) Production 部署狀態: - p2_decision_fusion_columns.sql ✅ K8s exec 完成（commit c58bdd0c） - p3_2_provider_version_history.sql ✅ K8s exec 完成（this commit） - 兩個 production migration 都已上線，CI test_schema 同步補齊 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 14:57:16 +08:00
Your Name	025a493f06	feat(p3.2+adr-100): Model Version Tracker + SLO 自治 + KB rot cleaner Some checks failed run-migration / migrate (push) Failing after 12s Details CD Pipeline / build-and-deploy (push) Has been cancelled Details Wave 8 P3.2 模型版本追蹤 + ADR-100 SLO 自我治理 + 配套： P3.2 — Model Version Tracking: - model_version_probe.py (268 行) — 探測 Ollama / OpenRouter 等 provider 的 model version - model_version_tracker.py (101 行) — 對齊 PG provider_version_history 表 - migrations/p3_2_provider_version_history.sql + rollback — 25 行 schema - db/models.py +32 行 — ProviderVersionHistory ORM ADR-100 — AI 自主化 SLO: - docs/adr/ADR-100-ai-autonomous-slo.md (167 行) — 飛輪 SLO 設計與閾值 - ops/monitoring/slo-rules.yml (254 行) — Prometheus SLO recording rules + alerts - ops/monitoring/tests/test_slo_rules.yaml (242 行) — promtool unit tests 整合修改: - main.py +72 行 — Lifespan 啟動 model_version_probe + KB rot cleaner schedule - gitea_webhook.py +45 行 — webhook 接收 model 版本變化通知 - ci_auto_repair.py / evidence_snapshot.py / pre_decision_investigator.py — 配合接線新測試: - test_kb_rot_cleaner_schedule.py (120 行) — 9 tests pass - test_slo_rules.yaml — promtool 驗收 Tests: 9 passed (test_kb_rot_cleaner_schedule) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Multiple Engineers (P3.2 + ADR-100) <noreply@anthropic.com>	2026-04-27 14:54:19 +08:00
Your Name	9908fdf50d	feat(p3.1-t2-patha): DiagnosisAggregator 路徑 A + Solver F4 critical reject + 對齊測試 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m59s Details Wave 8 P3.1-T2 PathA 啟用 + Solver F4 安全強化 + test 對齊： PathA — DiagnosisAggregator 信號分類層補 PDI: - ENABLE_DIAGNOSIS_AGGREGATOR default=False → True · PathA 純信號分類層（OOMKilled/CrashLoop 等業務邏輯） · 不重複呼叫 K8s/SignOz API（只取 PDI 已收集的 raw 資料） · 安全 default on — 純邏輯處理，無外部依賴重疊 - diagnosis_aggregator.py +155 行（PathA 實作） - pre_decision_investigator.py 已接 (commit `3a2cd151`) F4 — Solver critical risk reject: - solver_agent.py: _validate_recommended_action 拒絕 risk=critical · 鐵律：critical 動作必須走人工審批，不可變 Telegram 按鈕 · log warning + return None（被 _extract 過濾掉） - _extract_recommended_actions 改返回 (list, status_str) tuple · status="ok"/"empty"/"all_invalid" 供呼叫端決策 - protocol.py +16 / metrics.py +9 / ai_router.py +18 — 配套 metric + protocol field 測試對齊: - test_solver_recommended_actions.py 拆 test_all_valid → low/medium/high accepted + test_critical_rejected - result tuple unpack: result, _ = _extract_recommended_actions(...) - test_diagnosis_aggregator_stub.py: feature flag default 改 True 對齊 PathA Tests: 51 passed (solver 28 + aggregator 16 + router fallback 8) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Multiple Engineers (Wave 8 P3.1-T2 PathA + F4) <noreply@anthropic.com>	2026-04-27 14:42:29 +08:00
Your Name	fb130c9a28	feat(p3.1-t2): DiagnosisAggregator stub tests + sanitization 補強 + metrics 補欄 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 2m16s Details Wave 8 P3.1-T2 後續補測 + 配套：新增測試: - test_diagnosis_aggregator_stub.py (238 行) — 15 tests · stub fixture 驗證 _collect_diagnosis_aggregator 接線 · feature flag default off 不呼叫 · timeout 邊界 / exception fail-soft 修改: - core/metrics.py +23 — 新增 DiagnosisAggregator 相關 Prometheus 指標 - sanitization_service.py +24 — 補強 prompt sanitize 邊界（vuln #4 配套） - RUNBOOK-AGENT-STEP-LATENCY.md / agent_step_latency_rules.yaml — 微調 Tests: 15 passed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 08:30:26 +08:00
Your Name	3a2cd15144	feat(p3.1-t2): Tier-2 三服務感知強化 — Sentry 簽章 + DiagnosisAggregator + Solver actions test Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details Wave 8 P3.1-T2 三項感知強化（多 engineer 補完）： Sentry Webhook 簽章驗證: - sentry_webhook.py: 接入 SentryWebhookService.verify_sentry_signature() - 拒絕無效 sentry-hook-signature → 401 → 防偽造攻擊 DiagnosisAggregator Pod 深診斷整合: - pre_decision_investigator.py: 新增 _collect_diagnosis_aggregator() - ENABLE_DIAGNOSIS_AGGREGATOR feature flag 守衛（default=False） - evidence_snapshot.py: extra_diagnosis 欄位 + build_summary 顯示 - timeout=3.0s + try/except 隔離（fail-soft） - Conservative 策略：待重疊分析確認 vs PreDecisionInvestigator 不重複 config.py: - 新增 ENABLE_DIAGNOSIS_AGGREGATOR Field（default=False，K8s ConfigMap 動態啟用） Solver B1 補測（commit `7c726ebc` 對應）: - test_solver_recommended_actions.py — 20 tests + 3 skipped - 驗證結構化 recommended_actions（北極星 §1.1 修復多樣性 ≥ 40%） - LLM 失敗 graceful degraded（candidates=[], degraded=True） Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Multiple Engineers (Wave 8 P3.1-T2) <noreply@anthropic.com>	2026-04-27 08:24:15 +08:00
Your Name	7c726ebc1c	fix(b1): Solver Agent 結構化動作 — 北極星 §1.1 修復多樣性 ≥ 40% Some checks failed CD Pipeline / build-and-deploy (push) Failing after 2m22s Details INC-20260425 衍生修復 — Solver 拒絕 rule-based mock 兜底: 原設計缺陷: - LLM 失敗時 → rule-based mock 推 RESTART 兜底 - 違反北極星 §1.1：修復多樣性 ≥ 40%（不能寫死同一指令）新設計: - LLM 失敗 → graceful degraded（candidates=[], recommended_actions=[], degraded=True） - 禁止 rule-based mock / hardcode RESTART - 新增 recommended_actions 結構化 MCP 動作清單 · 供 B3 Telegram 按鈕動態生成 · YAML 規則庫驅動，非寫死 - 新增 yaml + Path import 載入動作模板庫向下相容: - 既有 candidates / blast_radius 邏輯不變 - 新增欄位 recommended_actions 為 optional list Tests: 8 passed (solver 相關全綠) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Claude Sonnet 4.6 (B1 北極星 §1.1) <noreply@anthropic.com>	2026-04-27 08:18:38 +08:00
Your Name	123d9c8a2e	fix(p3.1-t1): 三 Tier-1 服務整合 — model_rollback_service + resource_resolver Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details P3.1-T1 接線兩個既有服務到主流程： offline_replay_service.py — model_rollback_service 整合: - 回放事件寫入治理 DB 後，觸發 ModelRollbackService.check() 衰退偵測 - feature flag 由 model_rollback_service 自行判斷（AIOPS_P6_GOVERNANCE_ENABLED） - retrain_recommended → log warning 含 streak / absolute_floor / conservative_mode - exception fail-soft（不阻斷 replay 主流程） approval_execution.py — resource_resolver 整合: - kubectl 指令解析後，動態驗證資源是否存在於 K8s - 若 resolved_name != raw_name → log + apply normalized name - 若不存在但有 candidates → log warning + suggestions（不攔截執行，只記錄） - exception fail-soft（不阻斷主流程） - RESOURCE_RESOLVE_TOTAL Prometheus counter labels: hit/suggestion/miss/error Tests: 後端 1303 collected（無回歸），對應 dedicated 測試在前次 commit 已寫 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Claude Sonnet 4.6 (P3.1-T1) <noreply@anthropic.com>	2026-04-27 08:17:04 +08:00
Your Name	fefe4c21cd	fix(inc-20260425): A1+A2 後續 — Solver/Critic timeout + auto_repair 接線 + Runbook + Grafana Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 延續 `595629c0` INC-20260425 修復，補三段 Agent + 全鏈路觀測： A1 後續 — Solver/Critic 三段 timeout 接線: - solver_agent.py: AGENT_SOLVER_TIMEOUT_SEC=20.0（env override） - critic_agent.py: AGENT_CRITIC_TIMEOUT_SEC=15.0（env override） - protocol.py: 三 Agent 共用 observe_agent_step() 包裹呼叫 · success/timeout/error outcome label · histogram 寫入 aiops_agent_step_duration_seconds A2 後續 — auto_repair_service 改用 _diagnose_fallback_chain: - auto_repair_service.py +46 行 — 切換 DIAGNOSE 路由到新 chain（NEMO→GEMINI→CLAUDE） - 完全避開 Ollama CPU 238s 二次 timeout 新增 metrics: - core/metrics.py +59 行 — 配合 observe_agent_step 的 histogram bucket + label cardinality 新增測試 (862 行): - test_agent_step_timeouts.py (475) — 三 Agent 各 timeout 邊界 + outcome label - test_ai_router_diagnose_fallback.py (387) — _diagnose_fallback_chain 正確序新增配套: - docs/runbooks/RUNBOOK-AGENT-STEP-LATENCY.md (350) — INC 故障排查 + 觀測指引 - ops/monitoring/grafana/agent_step_latency_rules.yaml (160) · 三 Agent histogram alert rules（p99 > timeout 80% → warning）驗收: 33 tests pass (test_agent_step_timeouts 22 + test_ai_router_diagnose_fallback 11) INC-20260425 雙修總工作量（595629c0 + 此 commit）: · 5 個 service/agent 檔修改 · 1 個新 observability 模組 · 4 個新測試/配套檔 · 1372+187 = 1559 行新增 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Claude Sonnet 4.6 (INC-20260425 後續) <noreply@anthropic.com>	2026-04-27 08:15:53 +08:00
Your Name	595629c013	fix(inc-20260425): A1 三段 Agent timeout 拆分 + A2 DIAGNOSE 移除 Ollama Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details INC-20260425-8D17BB / 3B6C39 兩則告警 AI 信心降到 20% 根因雙修（統帥批准 A+B）： A1 — 三段 Agent step timeout 拆分（北極星 §1.2 Observable by Default）: - diagnostician_agent.py: PHASE2_STEP_TIMEOUT_SEC=20.0 共用值 → 拆三段 · AGENT_DIAGNOSTICIAN_TIMEOUT_SEC=30.0（NIM 主吃口，最大 prompt + 多假設） · AGENT_SOLVER_TIMEOUT_SEC=20.0（後續 commit 接線） · AGENT_CRITIC_TIMEOUT_SEC=15.0（後續 commit 接線） · env override 支援，K8s ConfigMap 動態調整不需 rebuild · 保留 PHASE2_STEP_TIMEOUT_SEC alias（DEPRECATED，下 sprint 移除） - observability/agent_step_metrics.py (58 行) — 新模組: · aiops_agent_step_duration_seconds Histogram · observe_agent_step() helper 統一三 Agent 呼叫點 · outcome label ∈ {success, timeout, error} · agent label ∈ {diagnostician, solver, critic} A2 — ai_router DIAGNOSE chain 移除 Ollama: - ai_router.py v4.4 by Claude Sonnet 4.6 · 新增 _diagnose_fallback_chain: NEMO → GEMINI → CLAUDE · Ollama 永久排除於此 chain（CPU-only 實測 238s，二次 timeout 必爆） · 新增 aiops_diagnose_fallback_total Prometheus metric - 根因: NIM timeout 後 fallback 到 Ollama deepseek-r1:14b CPU 238s → 二次 timeout → degraded confidence=0.2 Wave8-X2 整合測試補正: - test_ollama_failover_manager.py: TestSelectProvider 補 mock _check_gemini_quota 原 test 期望 OFFLINE→Gemini，但 quota fail-closed 後沒 mock 會被切到 188 繞過 quota check 後驗純路由邏輯 → 37/37 PASS Tests: 37 passed (test_ollama_failover_manager 全部) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Claude Sonnet 4.6 (Wave 8 INC-20260425) <noreply@anthropic.com>	2026-04-27 08:15:10 +08:00
Your Name	cc547736ab	feat(wave6-8): P2.1 fusion + P2.2 governance + P2.4 consensus + Wave 7/8 BLOCKER 修復承接 Wave 6/7/8 多 engineer 在 agent 限額前完成的代碼，補 commit 解 production HEAD 隱性 import error（decision_fusion 已被 decision_manager 引用但檔案 untracked）。新增（後端核心）: - decision_fusion.py (562 行) — P2.1 方法 III（OpenClaw + Hermes + Elephant 三 LLM 融合） - aiops_timeline.py + aiops_timeline_service.py — critic B4 修復 /api/v1/aiops/timeline endpoint，DB 存取抽到 service 層遵守 leWOOOgo 積木化 - migrations/p2_decision_fusion_columns.sql + rollback — approval_records fusion 欄位修改（後端整合）: - decision_manager.py — fusion 三斷鏈修補（critic B1+B2+B3）： · B1: 寫 _evidence_snapshot_ref 到 token.proposal_data · B2: fusion 前計算 complexity_score 並寫 token · B3: fusion composite 寫 token.proposal_data["decision_fusion"] - auto_approve.py — fusion + consensus 認識（critic B3+B5）： · composite > 0.7 → auto_execute_eligible bypass min_confidence · source=consensus_engine + score>=0.6 → 規則可信路徑 - consensus_engine.py — db-fix _save_consensus 重用 agent_sessions - governance_agent.py — db-fix _alert PG 寫入 ai_governance_events - approval_db.py — fusion 3 欄位 + 2 partial index + CheckConstraint - db/models.py — schema 對齊 migration - core/config.py — vuln #1 修復：OLLAMA_URL/_FALLBACK_URL field_validator 拒絕公網 IP + 外部域名，僅允許私網/loopback/K8s SVC 白名單 - core/feature_flags.py — P2 fusion + consensus flags - main.py — governance_agent lifespan 啟動 - failover_alerter.py — Wave8-X2: in-memory dedup fallback（Redis 拒絕後不 fail-open） - ollama_*.py — metrics 整合 + recovery 改善 - auto_repair_service.py — verifier 接線新增（測試 2438 行）: - test_decision_fusion.py / test_governance_agent.py / test_consensus_integration.py - test_p2_db_fixes.py / test_wave8_fusion_fixes.py - test_config_url_validation.py（vuln #1 12 tests） - test_failover_alerter.py +Wave8-X2 in-memory dedup 補測驗收: 116 tests pass (decision_fusion + wave8_fusion + config_url + consensus + governance + p2_db_fixes + failover_alerter) Conflict resolution: - 3 檔（config.py + auto_approve.py + decision_manager.py）git stash pop 衝突保留 stashed (engineer 最終版)，補回 ValueError 「公網 IP」字樣對齊 test Note: 此 commit 解 production HEAD 隱性 import error 仍未修: vuln #4 prompt injection / debugger B14 quota fail-closed / B25-B26 drain_pending_tasks / B8 governance fail alert Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Multiple Engineers (Wave 6/7/8) <noreply@anthropic.com>	2026-04-27 08:11:40 +08:00
Your Name	2c57b71db9	feat(wave5-p2): GovernanceAgent 4 項自檢 + Ollama 健康告警規則 + Prometheus metrics 整合 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m45s Details MASTER plan_complete_v3.md Wave 5 P2.2 + P2.3 完成（multiple engineers 在限額前完成代碼，補 commit）： P2.2 — GovernanceAgent 4 項自檢: - governance_agent.py (342 行) — 每 1 小時自檢循環: · trust_drift（信任度漂移檢測） · knowledge_degradation（知識退化檢測） · llm_hallucination（LLM 幻覺檢測） · execution_blast_radius（執行爆炸半徑檢測） - main.py lifespan: asyncio.create_task(run_governance_loop()) 啟動 try/except 包裹，schedule 失敗不阻斷主流程 - failover_alerter.py: alert_governance(event_type, payload) 1h dedup 四類事件 → Telegram MarkdownV2 告警 P2.3 — Ollama 健康規則 + Prometheus Metrics: - ops/monitoring/ollama_health_rules.yaml (148 行): · OllamaHealthDegraded / OllamaPrimaryDown · OllamaFailoverTriggered / GeminiQuotaExceeded · 補 Prometheus 取資料的 alert rules - core/metrics.py (57 行): · GEMINI_DAILY_CALL_COUNT / GEMINI_DAILY_QUOTA Gauge · OLLAMA_FAILOVER_TRIGGERED_TOTAL Counter · OLLAMA_CURRENT_PRIMARY_IS_OLLAMA Gauge - ollama_failover_manager.py: · _check_gemini_quota: 每次 check 同步更新 Gauge（讓 Prometheus 取最新值） · select_provider: failover 時 inc Counter + 切 Primary Gauge · try/except 包裹，metric 失敗不阻斷主路由 E2E 測試: - test_failover_e2e_dispatch.py (365 行) 完整 dispatch 路徑：health check → failover decide → alerter → metrics Tests: 54 passed (e2e_dispatch + failover_manager + failover_alerter) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Multiple Engineers (上 session Wave 5) <noreply@anthropic.com>	2026-04-26 20:56:19 +08:00
Your Name	02362eddcf	feat(wave4-5): P1.3+P1.4 真接線 + Ollama_188 provider 註冊 + quota atomic 修復 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 2m0s Details 3 個 engineers 在限額前的 Wave 4/5 完成工作（補 commit）： Engineer-B3 — Wave 4 P1.3+P1.4 真飛輪閉環（auto_repair_service.py 才是正確接線位置）: - execute_auto_repair 成功後 fire-and-forget 啟動 PostExecutionVerifier - record_verification_result 觸發 EWMA trust_score 演化 - snapshot=None（不依賴 EvidenceSnapshot，避免我之前 webhooks.py 補丁的 B2 bug） - _pending_tasks 管理生命週期，Lifespan shutdown 時等任務完成 Engineer-A4 — Wave 5 B1-fix Ollama188Provider 註冊: - ai_providers/ollama.py: 新增 Ollama188Provider(OllamaProvider) 子類 - name="ollama_188", is_enabled 看 ENABLE_OLLAMA_188 + OLLAMA_FALLBACK_URL - analyze() 用 OLLAMA_FALLBACK_URL（192.168.0.188:11434）作為推理端點 - ai_router.py:_init_registry 補 registry.register(Ollama188Provider()) - 修復 BLOCKER：原本 failover_manager 決策返回 "ollama_188"，但 executor 查不到 → not_registered → 188 從未被打到。Wave 2 P1.1 整套容災系統前段卡住。 Engineer-A4 — Wave 5 B3-fix Gemini quota TOCTOU 修復: - ollama_failover_manager.py:_check_gemini_quota 改用 redis.pipeline() 原 GET → 判斷 → INCR → EXPIRE 四步分離，並行請求在 GET/INCR 間競爭超發修法：SET NX(首次設 TTL) + INCR atomic pipeline，用 INCR 後新值判斷 Engineer-B3 — test_learning_chain_e2e.py（377 行 No-Mock 整合測試）: - 純 Python Stub + monkeypatch（feedback_no_mock_testing.md 合規） - execute_auto_repair 成功 → verifier 被呼叫 ✓ - execute_auto_repair 失敗 → verifier 不被呼叫 ✓ - matched_playbook_id=None → log warning 不 crash ✓ - verifier 拋例外 → 修復回傳成功，trust 不更新 ✓ Tests: 42 passed (failover_manager + ai_router_failover_integration 全綠) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Engineer-A4 + Engineer-B3 (上 session) <noreply@anthropic.com>	2026-04-26 20:44:19 +08:00
Your Name	75b404379b	fix(critic-h2-h4): proactive_inspector metric 改名 + probe_success fallback Some checks failed CD Pipeline / build-and-deploy (push) Failing after 2m7s Details H2 — metric semantic 切換污染 baseline: - cpu_usage_awoooi_api → cpu_usage_node_188 - memory_usage_awoooi_api → memory_usage_node_188 原 metric_name 對應 container working set，新 PromQL 改為 node-level ratio （cadvisor 停止後的替代）。語意完全不同但保留同名 → 既有 DynamicBaseline 模型用舊單位訓練的 σ 對新值失真，5 分鐘 inspector 週期會狂報假 anomaly。改名後 baseline 從零學習，初期 sample 數不足會被 _has_enough_samples 守門跳過告警，安全度過 30 個週期暖機期。 H4 — probe_success 全部不可達假觸發: - 1 - avg(probe_success) + 1 - avg(probe_success or on() vector(1)) 原 expr 在 Blackbox 全部 target 失聯時 avg 回空 vector → _fetch_current_value 若把空當 0 → 1-0=1 遠超 0.05 threshold → 5min 一次假告警。 fallback 視為全部成功（值=1，1-1=0），真實 probe down 由獨立的 BlackboxProbeFailure rule 偵測，責任分離。部署後驗證: - baseline 表新增 metric_name='memory_usage_node_188' / 'cpu_usage_node_188' 的 row - 舊 metric_name='memory_usage_awoooi_api' / 'cpu_usage_awoooi_api' 的 row 30 天後可清理 - proactive_inspection_logs 30 個週期內看 _baseline_warmup_skipped 條目而非假 anomaly Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:40:57 +08:00
Your Name	32affaffeb	fix(critic-hotfix): 4 修補 critic BLOCKER + HIGH（CD 阻塞 + 飛輪空轉） Some checks are pending CD Pipeline / build-and-deploy (push) Has started running Details Critic 全面審查 6 個 commit 後抓出： CD 阻塞修復: - test_ai_router_failover_integration.py: 3 個 test 改用 patch.object 直接 mock _select_provider_and_model 強制初始 OLLAMA。原 IntentType.UNKNOWN mock 在 router 內仍被 reclassify 成 DIAGNOSE → openclaw_nemo，failover 不觸發。 → 5/5 PASSED BLOCKER B1 — Gitea Telegram 通知永遠發不出去: - apps/api/src/api/v1/gitea_webhook.py:399 redis = await get_redis() → redis = get_redis() 原 await 會 raise TypeError 被外層 except 吞 → Task C PR merged + workflow_run failure 通知全部失效（CI 綠燈是假象，test 只驗 HTTP 202 不驗實際送達） BLOCKER B2 — P1.3+P1.4 學習鏈閉環空轉（兩處同 bug）: - apps/api/src/api/v1/webhooks.py:261 - apps/api/src/services/approval_execution.py:771（pre-existing） EvidenceSnapshot.get_latest_snapshot(...) 是 module-level async function 不是 classmethod → AttributeError 被 except 吞成 warning → 飛輪閉環假性接通實際空跑（feature flag default off 暫時免爆） HIGH H3 — main.py lifespan 順序競爭: - apps/api/src/main.py: configure_alerter() 移到 _recovery_svc.start() 之前原順序：start() 觸發 immediate-check → 可能呼叫 alert_recovery，但 alerter 尚未注入 Redis → dedup fail-open，重複告警風險。 HIGH H1 — Gemini quota dedup 跨日吞告警: - apps/api/src/services/failover_alerter.py:89 dedup key 加 :{YYYY-MM-DD} 後綴，每日獨立 dedup window 原昨 22:00 觸發，今 21:30 再觸發時 dedup 還沒過期會被吞掉 Tests: 14 passed (failover_alerter + ai_router_failover_integration + lifespan_wiring) 延後 follow-up: - H2: proactive_inspector memory metric 改名 + baseline 清理 - H4: probe_success NaN fallback - M1-M4 / S1-S2: 見 critic 報告 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:39:53 +08:00
Your Name	dcf2750b2b	feat(p1.5): FailoverAlerter 整合點 3+4 + 6 個 testcase 補完 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m32s Details P1.5 收尾（status 文件 line 96-99 指定）：整合點 3 — failover_manager Gemini quota 告警觸發: - ollama_failover_manager.py: _check_gemini_quota 返回 False 時呼叫 alerter.alert_gemini_quota_exceeded({quota, current_count}) - 從 Redis 讀 ollama:gemini_daily_count:{date} 取 current_count（fail-soft） - alerter 內 24h dedup（QUOTA_DEDUP_TTL_SEC=86400），每日只發一次 - try/except 包裹：告警失敗 fail-open，不阻斷 routing 整合點 4 — main.py lifespan 注入 Redis client: - 在 _recovery_svc.start() 之後、yield 之前 - 呼叫 configure_alerter(get_redis()) 替換 singleton 注入 dedup 能力 - try/except 包裹：注入失敗 fail-open（alerter 仍可工作但 dedup 失效）新測試 (174 行, 6/6 pass): - test_alert_failover_dedup: 同 to_provider 第二次被 10min dedup ✅ - test_alert_recovery_send: 正常發送 + Markdown 訊息 + 連續 N 次 HEALTHY ✅ - test_no_telegram_chat_id_noop: chat_id 缺時 fail-soft 不 raise ✅ - test_quota_alert_dedup_24h: TTL=86400s，訊息含 quota+count ✅ - test_configure_alerter_replaces_singleton: lifespan 注入後 redis 可用 ✅ - test_dedup_fail_open_when_no_redis: Redis None → 允許送出 ✅ Mock 注意：_send() inline import telegram_gateway/get_settings， mock target 必須是 src.services.telegram_gateway / src.core.config 而非 alerter module 自己。回歸：原 37 ollama_failover_manager + 3 lifespan_wiring 測試全綠。飛輪自主化分數：~75 → 預估 ~80（配額耗盡有告警，運維可見性 +5） Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:28:29 +08:00
Your Name	fd40b79db4	feat(p0.6+p1.3+p1.4): 飛輪閉環最後一哩 + ProactiveInspector PromQL 三修 Some checks failed run-migration / migrate (push) Failing after 17s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 47s Details CD Pipeline / build-and-deploy (push) Failing after 1m50s Details P0.6 ProactiveInspector PromQL labels 修正 (Engineer-B): - http_error_rate: blackbox_probe_success → probe_success（實測 metric 名稱） - cpu_usage_awoooi_api: cadvisor up=0（停止）→ 改 node-exporter node_cpu_seconds_total - memory_usage_awoooi_api: cadvisor 停止 → node-exporter 記憶體使用率比例 P1.3+P1.4 飛輪閉環最後一哩 (Engineer-B2): - webhooks.py:_try_auto_repair_background 補 PostExecutionVerifier 接線 - feature flag AIOPS_P1_POST_EXECUTION_VERIFIER 守住（default off，可漸進啟用） - 60s timeout + try/except 三重防護（timeout / 一般 exception / outer exception） - asyncio.wait_for + EvidenceSnapshot.get_latest_snapshot - 補 learning_service.record_verification_result 呼叫 - matched_playbook_id 從 result.playbook_id 帶入 - 觸發 EWMA trust_score 演化（飛輪閉環） - 對稱於人工審核路徑 approval_execution._run_post_execution_verify ADR 對應: ADR-081 Phase 1 (Verifier) + ADR-083 Phase 3 (Learning) plan_complete_v3.md L5/L6 階段：⚠️ → ✅（飛輪自主化分數預估 +12 分） Note: feature flag default off → 不會立即影響 production 行為；啟用前需 critic 審查 + production E2E 驗證。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:20:11 +08:00
Your Name	e96055eef9	fix(p0.4): Playbook 學習鏈三道修復 — partial index + race防護 + 手動路徑接線 ADR-092 P0.4 Playbook EWMA 學習閉環的 DB / Repository / Service 三層修補。 DB 層 (db-expert-fix by Engineer-B): - ApprovalRecord.matched_playbook_id 移除 index=True，改 __table_args__ partial index (WHERE matched_playbook_id IS NOT NULL) — 多數列 NULL，full index 浪費空間 - adr092_p1_learning_chain_rollback.sql: 純 ROLLBACK SQL（DBA 手動執行） Repository 層: - playbook_repository.py: SELECT FOR UPDATE 防 lost update 避免並發 EWMA 更新覆蓋彼此 Service 層 (P0.4 修復): - proposal_service.py: 手動審核路徑補 _try_playbook_match_id 呼叫 decision_manager auto_execute 路徑已有此邏輯（行 2035），此處補手動路徑缺口，使 matched_playbook_id 可寫入 DB → EWMA 才能演化測試: - test_playbook_repository_race_condition.py: 3 cases SELECT FOR UPDATE 防 race 正確阻擋並發 EWMA 更新（pass） Note: migration SQL 待 DBA 手動執行（feedback_dev_prod_separation.md），不執行 alembic upgrade（statu 文件禁忌條款）。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:19:46 +08:00
Your Name	55c6b4e2d9	feat(p1): Ollama 多層容災系統 — P1.1 健康檢測 + P1.2 ai_router 整合 + P1.5 容災告警 ADR-092 P1 飛輪閉環的 Ollama 失敗轉移子系統，全部 Engineer-A2/C/C2 補上。新服務 (1581 行)： - ollama_health_monitor.py (356)：3 層健康檢測（TCP/HTTP/推理） - ollama_failover_manager.py (571)：111→188 自動切換 + Redis 持久化 + recovery callback - ollama_auto_recovery.py (436)：30s 背景監控 + 連續 3 次 HEALTHY → 切回 + clear_cache - failover_alerter.py (218)：P1.5 Telegram 容災告警服務整合： - ai_router.py: AIProviderEnum.OLLAMA_188 + 120s budget + failover fallback chain - main.py lifespan: 啟動時 wire callback + start recovery，關閉時優雅 stop - config.py: OLLAMA_FALLBACK_URL / OLLAMA_HEALTH_CHECK_MODEL / GEMINI_DAILY_QUOTA（帳單熔斷） K8s 配置： - 04-configmap.yaml.patch-188-fallback：注入 OLLAMA_FALLBACK_URL=http://192.168.0.188:11434 測試 (2082 行)： - test_ollama_health_monitor.py (402) - test_ollama_failover_manager.py (707) - test_ollama_auto_recovery.py (580) - test_ai_router_failover_integration.py (257) - test_lifespan_failover_wiring.py (136) 依賴鏈：service 三件套 + ai_router + main.py 一起 commit，缺一就 ImportError。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:18:33 +08:00
Your Name	d3a4fb4d15	feat(t0): Task A 按鈕一致性測試 + Task C Gitea→Telegram 通知收尾 Task A — Telegram 按鈕鬼魂鐵律測試（補測 production telegram_gateway.py） - test_telegram_button_consistency.py 新增 14 測試 - send_info_notification 兩鍵 [📋 詳情][📊 歷史] - _send_approval_card_to_group reply_markup - callback_data 對齊 INFO_ACTIONS 白名單 - parse_callback_data + handler 完整性 Task C — Gitea CI/CD → Telegram 告警轉發 - GiteaPullRequest.merged 欄位（HasMerged bool json:"merged"） - _send_gitea_notification helper：Redis SET NX EX 600s 去重 - handle_pull_request: closed+merged → PR Merged Telegram 卡片 - handle_workflow_run: status=failure → 部署/構建失敗卡片 - 不加按鈕（feedback_no_ghost_buttons.md 合規） - test_gitea_webhook.py +247 行新測試驗收: K8s GITEA_WEBHOOK_SECRET 64 bytes ✅ Gitea hook #4 events: pull_request + push + workflow_run ✅ 端點 HMAC 401 驗簽 ✅ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:17:17 +08:00
Your Name	bb12647e8d	feat(telegram): 群組告警卡片加入完整互動按鈕（批准/拒絕/暫默/詳情/重診/歷史） All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m7s Details - _send_approval_card_to_group 加 alert_category + notification_type 參數 - 群組卡片改用 _build_inline_keyboard（與 DM 相同的完整六鍵佈局） - send_approval_card → _send_approval_card_to_group 傳遞兩參數 - TYPE-1 通知補 read-only 詳情/歷史按鈕（鬼魂按鈕鐵律合規） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 10:31:27 +08:00
Your Name	cbd28e29a0	fix(solver+incident): 兩組 P0 配置修復 - Gitea 非K8s 過濾 + 備份告警年齡升級 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m57s Details L3 修復總結（2026-04-25）：【修復 1】Gitea 跨域界限 kubectl 過濾（solver_agent.py）根因：GiteaMemoryPressure 告警觸發 Solver → LLM 生成 'kubectl scale deployment gitea' Gitea 在主機 docker-compose，不在 awoooi-prod K8s namespace → 執行必然失敗變更： - 添加 _filter_non_k8s_targets() 函數，對 scale/restart/delete/patch 指令驗證 target - 添加 _KUBECTL_MUTATING_VERBS / _KUBECTL_ROLLOUT_MUTATING_SUBVERBS 常數 - 在 _solve() 呼叫 _fetch_k8s_inventory() 獲取實際部署清單 - 後置過濾：candidates 中若 target 不在 inventory 且屬寫入動詞 → 丟棄 + 警告預期行為：GiteaMemoryPressure → Solver 現生成調查類 kubectl（get/describe），而非 scale 【修復 2】HostBackupFailed 誤判升級（incident_service.py + webhooks.py）根因：備份失敗 >24h 被標記 TYPE-1（純資訊），導致靜默發送無按鈕卡片，未觸發自動修復變更： - incident_service.py classify_alert_early() 添加 age_hours 參數 - 添加 _BACKUP_AGE_UPGRADE_NAMES + _BACKUP_AGE_THRESHOLD_HOURS=24.0 - 若 alertname in (HostBackupFailed/Stale/Missing) 且 age > 24h → TYPE-3 升級 - webhooks.py 計算 alert.startsAt → age_hours，並傳遞給 classify_alert_early() 預期行為：HostBackupFailed 25h+ → 升級為 TYPE-3，觸發 LLM 分析 + P0 自動修復建議測試結果： - solver_agent: 35/35 tests PASSED ✅ - incident_service: 11/11 tests PASSED ✅ - incident_api integration: 7/7 tests PASSED ✅ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 09:48:04 +08:00
Your Name	6baa5054bc	fix(auto-execute): 修復 kubectl pattern 攔截 + 補 auto_execute KM 寫入 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 問題 1：_ALLOWED_KUBECTL_PATTERN 不允許 resource type keyword 根因：LLM 輸出 "kubectl rollout restart deployment clickhouse" 但 pattern 只允許 "kubectl rollout restart clickhouse"（無 deployment 關鍵字）結果：_action_safe=False → auto_execute_blocked_unresolved_placeholder → 所有 low/medium risk 告警降為人工審核，飛輪完全停轉修法：pattern 新增可選的 resource type group（deployment/pod/service/...） + re.ASCII flag 防 unicode bypass，12/12 test cases 通過問題 2：auto_execute 路徑 KM 寫入斷鏈根因：_write_execution_result_to_km 只在人工審核路徑呼叫修法：auto_execute 完成後補 _fire_and_forget(executor._write_execution_result_to_km) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 09:47:35 +08:00
Your Name	f9f2263c00	fix(execution-feedback): 修復系統自動化反饋完全斷鏈的三層 P0 故障 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m57s Details 背景用戶報告執行狀態卡在「⚡ 執行中...」永不回報，導致自動修復機制完全癱瘓（信心度修復後，執行失敗但無法推送 Telegram 卡片通知） L1 — Post-verify AttributeError（2 處） - approval_execution.py:757, 1010 調用不存在方法 IncidentService.get_incident() - 正確方法：get_from_working_memory() fallback get_from_episodic_memory() - 影響：post-verify 邏輯被 exception 無聲吞掉，下游 Telegram 推送完全卡住 L2 — Notification Provider 未配置 - 新增 notifications/telegram.py：複用既有 TelegramGateway.send_notification() - 修改 manager.py：初始化時註冊 TelegramWebhookProvider - 影響：執行完成後無任何 provider 發送推送，導致 Telegram 看不到結果 L3 — Solver Agent 語意合成生成殘缺指令 - 舊邏輯：action_title="重啟服務" → 合成 "kubectl rollout restart deployment -n awoooi-prod"（缺名） - 下游 operation_parser 無法解析（regex 要求 deployment/<name>） - 修法：優先從 parsed 提取 target 欄位；無名則 return []，降級到唯讀調查指令 - 測試全部通過：35/35，含 11 個新安全測試驗證 - 被阻擋的惡意 kubectl_command 現在正確 fall-through 到語意合成路徑 - 無 target 名稱時返回空列表，不再生成殘缺指令 - Telegram 執行結果推送鏈路已完整預期效果 - 執行失敗 → 立即收到「❌ 執行失敗」Telegram 卡片（L1 + L2 修復） - 自動化決策遵循白名單，避免生成無法執行的指令（L3 修復） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 03:29:38 +08:00
Your Name	7b6df17dee	feat(hermes): 升級 Ollama 模型路由 — qwen3:8b 取代雙模型 Some checks are pending CD Pipeline / build-and-deploy (push) Has started running Details - qwen2.5-coder:7b + qwen2.5:7b-instruct → qwen3:8b (Hybrid Thinking) - qwen3:8b 同時勝任程式碼與通用指令，單一模型涵蓋 9 個 agent - deepseek-r1:14b 保留 debugger / vuln-verifier 推理任務 - gemma4 尚未在 Ollama registry 釋出，暫保留 gemma3:4b - 已在 111 主機 pull qwen3:8b (4.9GB) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 03:24:16 +08:00
Your Name	250eca99c6	fix(hermes): 改用 Ollama 本地模型（111），零費用，按 agent 類型選模型 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 模型路由： debugger / vuln-verifier → deepseek-r1:14b (強推理，找根因/安全分析) critic / db-expert / coder 群 → qwen2.5-coder:7b (程式碼專用) planner / onboarder / web → qwen2.5:7b-instruct (通用指令) default → deepseek-r1:14b - _strip_think_tags(): 去除 deepseek-r1 <think> 推理塊，只留最終回答 - timeout=90s (deepseek-r1 推理較慢) - log 加 model 欄位供 latency 監控 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 03:13:59 +08:00
Your Name	d467cac709	fix(hermes): 改用 anthropic Python SDK 直呼，棄用需要 claude CLI 的 claude-agent-sdk Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 根因：claude-agent-sdk 需要 spawn claude CLI，prod pod 沒有 CLI 所以 SDK 回空。修法：改用 anthropic.AsyncAnthropic().messages.create() 直呼 API。 model: claude-haiku-4-5-20251001（快速低成本，適合 Telegram QA） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 03:08:51 +08:00
Your Name	c14f23b33a	feat(k8s+notification): TG_GROUP_CUTOVER=true — 所有告警全切 SRE 群組 notification_matrix TYPE-5S: DM → GROUP（SignOz 事件補齊） prod/dev ConfigMap TG_GROUP_CUTOVER: false → true Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 03:07:28 +08:00
Your Name	cc69f3ce04	fix(solver_agent): 修復 AI 信心度阻斷 + 三層 kubectl 安全防禦 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 修法A — 恢復 AI 決策信心度 (0.5 → 0.9) - Solver Agent 優先使用 OpenClaw NIM 的 `kubectl_command` 欄位（完整指令），略過語義合成降級 - 保留原始 0.9 信心度，告警自動化能力回復 - Root cause: 舊版在 action_title 未含 "kubectl" 時執行 min(0.9, 0.5) 降級 C1 — CRITICAL: ReDoS + 注入防禦 - 正則 `\s` → `[ ]` 避免換行符號 (\n\r) 配對（Shell 注入向量） - 加入 `re.ASCII` 與 `{1,500}` 有界量詞，防止指數級回溯 - 性能提升 7.256s → 0.015ms (48x faster) - 明文拒絕 \n \r \t \x00 C2 — CRITICAL: 繞過防禦 + 截斷攻擊 - action_title 路徑加白名單驗證（舊版跳過） - 標準候選路徑：驗證 → 截斷，防止截斷繞過 - 不安全指令自動降級至語義合成 C3 — CRITICAL: 無界長度 DoS - 新增 _KUBECTL_MAX_LEN = 500，硬上限前置檢查 - 防止長輸入導致正則超時測試覆蓋 - 35 個測試（24 回歸 + 11 新安全測試） - LF/CR/Tab/Null 注入、Shell 元字元、ReDoS 效能、邊界條件全覆蓋 - Critic 與 vuln-verifier 雙重驗證 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 03:02:58 +08:00
Your Name	39f45dd305	fix(solver): 補 import re（solver_agent 已有 re.compile 但漏 import） Some checks are pending CD Pipeline / build-and-deploy (push) Has started running Details Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 02:42:25 +08:00
Your Name	a49554c5a0	feat(hermes): 接入 polling 路徑 — @tsenyangbot @mention → Hermes NL (ADR-094) Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details _handle_group_message() 新增 Hermes NL 路由： HERMES_NL_ENABLED=true + @tsenyangbot @mention → process_nl_message() → send_hermes_reply()，不影響既有 OpenClaw/NemoClaw 路徑 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 02:42:03 +08:00
Your Name	7d1c85eb86	fix(hermes): ANTHROPIC_API_KEY 注入 + solver 信心度修法 A + 12-Agent 治理文件 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details - nl_gateway.py: ClaudeAgentOptions 透過 env= 注入 ANTHROPIC_API_KEY（CLAUDE_API_KEY alias），修復 SDK 找不到 API key 的問題（SDK 讀 ANTHROPIC_API_KEY，K8s secret 名稱是 CLAUDE_API_KEY） - solver_agent.py: 修法 A — kubectl_command 欄位優先路徑，OpenClaw Nemo 回傳完整指令時不再被語意合成壓縮 confidence（0.9 → min(0.5) 的 bug），9 tests pass - AGENTS.md: Codex CLI 對應版 CLAUDE.md（Codex Session 啟動用） - docs/12-agent-game-rules.md: 12-Agent 任務判型 + 主責/協作派工 + 9 skills 對照（v1.0） - .agents/skills/06-awoooi-monorepo-master.md: v1.6，新增 12-agent 協作治理章節 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 02:33:43 +08:00

1 2 3 4 5 ...

789 Commits