awoooi

Author	SHA1	Message	Date
Your Name	cc547736ab	feat(wave6-8): P2.1 fusion + P2.2 governance + P2.4 consensus + Wave 7/8 BLOCKER 修復承接 Wave 6/7/8 多 engineer 在 agent 限額前完成的代碼，補 commit 解 production HEAD 隱性 import error（decision_fusion 已被 decision_manager 引用但檔案 untracked）。新增（後端核心）: - decision_fusion.py (562 行) — P2.1 方法 III（OpenClaw + Hermes + Elephant 三 LLM 融合） - aiops_timeline.py + aiops_timeline_service.py — critic B4 修復 /api/v1/aiops/timeline endpoint，DB 存取抽到 service 層遵守 leWOOOgo 積木化 - migrations/p2_decision_fusion_columns.sql + rollback — approval_records fusion 欄位修改（後端整合）: - decision_manager.py — fusion 三斷鏈修補（critic B1+B2+B3）： · B1: 寫 _evidence_snapshot_ref 到 token.proposal_data · B2: fusion 前計算 complexity_score 並寫 token · B3: fusion composite 寫 token.proposal_data["decision_fusion"] - auto_approve.py — fusion + consensus 認識（critic B3+B5）： · composite > 0.7 → auto_execute_eligible bypass min_confidence · source=consensus_engine + score>=0.6 → 規則可信路徑 - consensus_engine.py — db-fix _save_consensus 重用 agent_sessions - governance_agent.py — db-fix _alert PG 寫入 ai_governance_events - approval_db.py — fusion 3 欄位 + 2 partial index + CheckConstraint - db/models.py — schema 對齊 migration - core/config.py — vuln #1 修復：OLLAMA_URL/_FALLBACK_URL field_validator 拒絕公網 IP + 外部域名，僅允許私網/loopback/K8s SVC 白名單 - core/feature_flags.py — P2 fusion + consensus flags - main.py — governance_agent lifespan 啟動 - failover_alerter.py — Wave8-X2: in-memory dedup fallback（Redis 拒絕後不 fail-open） - ollama_*.py — metrics 整合 + recovery 改善 - auto_repair_service.py — verifier 接線新增（測試 2438 行）: - test_decision_fusion.py / test_governance_agent.py / test_consensus_integration.py - test_p2_db_fixes.py / test_wave8_fusion_fixes.py - test_config_url_validation.py（vuln #1 12 tests） - test_failover_alerter.py +Wave8-X2 in-memory dedup 補測驗收: 116 tests pass (decision_fusion + wave8_fusion + config_url + consensus + governance + p2_db_fixes + failover_alerter) Conflict resolution: - 3 檔（config.py + auto_approve.py + decision_manager.py）git stash pop 衝突保留 stashed (engineer 最終版)，補回 ValueError 「公網 IP」字樣對齊 test Note: 此 commit 解 production HEAD 隱性 import error 仍未修: vuln #4 prompt injection / debugger B14 quota fail-closed / B25-B26 drain_pending_tasks / B8 governance fail alert Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Multiple Engineers (Wave 6/7/8) <noreply@anthropic.com>	2026-04-27 08:11:40 +08:00
Your Name	2c57b71db9	feat(wave5-p2): GovernanceAgent 4 項自檢 + Ollama 健康告警規則 + Prometheus metrics 整合 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m45s Details MASTER plan_complete_v3.md Wave 5 P2.2 + P2.3 完成（multiple engineers 在限額前完成代碼，補 commit）： P2.2 — GovernanceAgent 4 項自檢: - governance_agent.py (342 行) — 每 1 小時自檢循環: · trust_drift（信任度漂移檢測） · knowledge_degradation（知識退化檢測） · llm_hallucination（LLM 幻覺檢測） · execution_blast_radius（執行爆炸半徑檢測） - main.py lifespan: asyncio.create_task(run_governance_loop()) 啟動 try/except 包裹，schedule 失敗不阻斷主流程 - failover_alerter.py: alert_governance(event_type, payload) 1h dedup 四類事件 → Telegram MarkdownV2 告警 P2.3 — Ollama 健康規則 + Prometheus Metrics: - ops/monitoring/ollama_health_rules.yaml (148 行): · OllamaHealthDegraded / OllamaPrimaryDown · OllamaFailoverTriggered / GeminiQuotaExceeded · 補 Prometheus 取資料的 alert rules - core/metrics.py (57 行): · GEMINI_DAILY_CALL_COUNT / GEMINI_DAILY_QUOTA Gauge · OLLAMA_FAILOVER_TRIGGERED_TOTAL Counter · OLLAMA_CURRENT_PRIMARY_IS_OLLAMA Gauge - ollama_failover_manager.py: · _check_gemini_quota: 每次 check 同步更新 Gauge（讓 Prometheus 取最新值） · select_provider: failover 時 inc Counter + 切 Primary Gauge · try/except 包裹，metric 失敗不阻斷主路由 E2E 測試: - test_failover_e2e_dispatch.py (365 行) 完整 dispatch 路徑：health check → failover decide → alerter → metrics Tests: 54 passed (e2e_dispatch + failover_manager + failover_alerter) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Multiple Engineers (上 session Wave 5) <noreply@anthropic.com>	2026-04-26 20:56:19 +08:00
Your Name	02362eddcf	feat(wave4-5): P1.3+P1.4 真接線 + Ollama_188 provider 註冊 + quota atomic 修復 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 2m0s Details 3 個 engineers 在限額前的 Wave 4/5 完成工作（補 commit）： Engineer-B3 — Wave 4 P1.3+P1.4 真飛輪閉環（auto_repair_service.py 才是正確接線位置）: - execute_auto_repair 成功後 fire-and-forget 啟動 PostExecutionVerifier - record_verification_result 觸發 EWMA trust_score 演化 - snapshot=None（不依賴 EvidenceSnapshot，避免我之前 webhooks.py 補丁的 B2 bug） - _pending_tasks 管理生命週期，Lifespan shutdown 時等任務完成 Engineer-A4 — Wave 5 B1-fix Ollama188Provider 註冊: - ai_providers/ollama.py: 新增 Ollama188Provider(OllamaProvider) 子類 - name="ollama_188", is_enabled 看 ENABLE_OLLAMA_188 + OLLAMA_FALLBACK_URL - analyze() 用 OLLAMA_FALLBACK_URL（192.168.0.188:11434）作為推理端點 - ai_router.py:_init_registry 補 registry.register(Ollama188Provider()) - 修復 BLOCKER：原本 failover_manager 決策返回 "ollama_188"，但 executor 查不到 → not_registered → 188 從未被打到。Wave 2 P1.1 整套容災系統前段卡住。 Engineer-A4 — Wave 5 B3-fix Gemini quota TOCTOU 修復: - ollama_failover_manager.py:_check_gemini_quota 改用 redis.pipeline() 原 GET → 判斷 → INCR → EXPIRE 四步分離，並行請求在 GET/INCR 間競爭超發修法：SET NX(首次設 TTL) + INCR atomic pipeline，用 INCR 後新值判斷 Engineer-B3 — test_learning_chain_e2e.py（377 行 No-Mock 整合測試）: - 純 Python Stub + monkeypatch（feedback_no_mock_testing.md 合規） - execute_auto_repair 成功 → verifier 被呼叫 ✓ - execute_auto_repair 失敗 → verifier 不被呼叫 ✓ - matched_playbook_id=None → log warning 不 crash ✓ - verifier 拋例外 → 修復回傳成功，trust 不更新 ✓ Tests: 42 passed (failover_manager + ai_router_failover_integration 全綠) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Engineer-A4 + Engineer-B3 (上 session) <noreply@anthropic.com>	2026-04-26 20:44:19 +08:00
Your Name	75b404379b	fix(critic-h2-h4): proactive_inspector metric 改名 + probe_success fallback Some checks failed CD Pipeline / build-and-deploy (push) Failing after 2m7s Details H2 — metric semantic 切換污染 baseline: - cpu_usage_awoooi_api → cpu_usage_node_188 - memory_usage_awoooi_api → memory_usage_node_188 原 metric_name 對應 container working set，新 PromQL 改為 node-level ratio （cadvisor 停止後的替代）。語意完全不同但保留同名 → 既有 DynamicBaseline 模型用舊單位訓練的 σ 對新值失真，5 分鐘 inspector 週期會狂報假 anomaly。改名後 baseline 從零學習，初期 sample 數不足會被 _has_enough_samples 守門跳過告警，安全度過 30 個週期暖機期。 H4 — probe_success 全部不可達假觸發: - 1 - avg(probe_success) + 1 - avg(probe_success or on() vector(1)) 原 expr 在 Blackbox 全部 target 失聯時 avg 回空 vector → _fetch_current_value 若把空當 0 → 1-0=1 遠超 0.05 threshold → 5min 一次假告警。 fallback 視為全部成功（值=1，1-1=0），真實 probe down 由獨立的 BlackboxProbeFailure rule 偵測，責任分離。部署後驗證: - baseline 表新增 metric_name='memory_usage_node_188' / 'cpu_usage_node_188' 的 row - 舊 metric_name='memory_usage_awoooi_api' / 'cpu_usage_awoooi_api' 的 row 30 天後可清理 - proactive_inspection_logs 30 個週期內看 _baseline_warmup_skipped 條目而非假 anomaly Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:40:57 +08:00
Your Name	32affaffeb	fix(critic-hotfix): 4 修補 critic BLOCKER + HIGH（CD 阻塞 + 飛輪空轉） Some checks are pending CD Pipeline / build-and-deploy (push) Has started running Details Critic 全面審查 6 個 commit 後抓出： CD 阻塞修復: - test_ai_router_failover_integration.py: 3 個 test 改用 patch.object 直接 mock _select_provider_and_model 強制初始 OLLAMA。原 IntentType.UNKNOWN mock 在 router 內仍被 reclassify 成 DIAGNOSE → openclaw_nemo，failover 不觸發。 → 5/5 PASSED BLOCKER B1 — Gitea Telegram 通知永遠發不出去: - apps/api/src/api/v1/gitea_webhook.py:399 redis = await get_redis() → redis = get_redis() 原 await 會 raise TypeError 被外層 except 吞 → Task C PR merged + workflow_run failure 通知全部失效（CI 綠燈是假象，test 只驗 HTTP 202 不驗實際送達） BLOCKER B2 — P1.3+P1.4 學習鏈閉環空轉（兩處同 bug）: - apps/api/src/api/v1/webhooks.py:261 - apps/api/src/services/approval_execution.py:771（pre-existing） EvidenceSnapshot.get_latest_snapshot(...) 是 module-level async function 不是 classmethod → AttributeError 被 except 吞成 warning → 飛輪閉環假性接通實際空跑（feature flag default off 暫時免爆） HIGH H3 — main.py lifespan 順序競爭: - apps/api/src/main.py: configure_alerter() 移到 _recovery_svc.start() 之前原順序：start() 觸發 immediate-check → 可能呼叫 alert_recovery，但 alerter 尚未注入 Redis → dedup fail-open，重複告警風險。 HIGH H1 — Gemini quota dedup 跨日吞告警: - apps/api/src/services/failover_alerter.py:89 dedup key 加 :{YYYY-MM-DD} 後綴，每日獨立 dedup window 原昨 22:00 觸發，今 21:30 再觸發時 dedup 還沒過期會被吞掉 Tests: 14 passed (failover_alerter + ai_router_failover_integration + lifespan_wiring) 延後 follow-up: - H2: proactive_inspector memory metric 改名 + baseline 清理 - H4: probe_success NaN fallback - M1-M4 / S1-S2: 見 critic 報告 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:39:53 +08:00
Your Name	dcf2750b2b	feat(p1.5): FailoverAlerter 整合點 3+4 + 6 個 testcase 補完 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m32s Details P1.5 收尾（status 文件 line 96-99 指定）：整合點 3 — failover_manager Gemini quota 告警觸發: - ollama_failover_manager.py: _check_gemini_quota 返回 False 時呼叫 alerter.alert_gemini_quota_exceeded({quota, current_count}) - 從 Redis 讀 ollama:gemini_daily_count:{date} 取 current_count（fail-soft） - alerter 內 24h dedup（QUOTA_DEDUP_TTL_SEC=86400），每日只發一次 - try/except 包裹：告警失敗 fail-open，不阻斷 routing 整合點 4 — main.py lifespan 注入 Redis client: - 在 _recovery_svc.start() 之後、yield 之前 - 呼叫 configure_alerter(get_redis()) 替換 singleton 注入 dedup 能力 - try/except 包裹：注入失敗 fail-open（alerter 仍可工作但 dedup 失效）新測試 (174 行, 6/6 pass): - test_alert_failover_dedup: 同 to_provider 第二次被 10min dedup ✅ - test_alert_recovery_send: 正常發送 + Markdown 訊息 + 連續 N 次 HEALTHY ✅ - test_no_telegram_chat_id_noop: chat_id 缺時 fail-soft 不 raise ✅ - test_quota_alert_dedup_24h: TTL=86400s，訊息含 quota+count ✅ - test_configure_alerter_replaces_singleton: lifespan 注入後 redis 可用 ✅ - test_dedup_fail_open_when_no_redis: Redis None → 允許送出 ✅ Mock 注意：_send() inline import telegram_gateway/get_settings， mock target 必須是 src.services.telegram_gateway / src.core.config 而非 alerter module 自己。回歸：原 37 ollama_failover_manager + 3 lifespan_wiring 測試全綠。飛輪自主化分數：~75 → 預估 ~80（配額耗盡有告警，運維可見性 +5） Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:28:29 +08:00
Your Name	fd40b79db4	feat(p0.6+p1.3+p1.4): 飛輪閉環最後一哩 + ProactiveInspector PromQL 三修 Some checks failed run-migration / migrate (push) Failing after 17s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 47s Details CD Pipeline / build-and-deploy (push) Failing after 1m50s Details P0.6 ProactiveInspector PromQL labels 修正 (Engineer-B): - http_error_rate: blackbox_probe_success → probe_success（實測 metric 名稱） - cpu_usage_awoooi_api: cadvisor up=0（停止）→ 改 node-exporter node_cpu_seconds_total - memory_usage_awoooi_api: cadvisor 停止 → node-exporter 記憶體使用率比例 P1.3+P1.4 飛輪閉環最後一哩 (Engineer-B2): - webhooks.py:_try_auto_repair_background 補 PostExecutionVerifier 接線 - feature flag AIOPS_P1_POST_EXECUTION_VERIFIER 守住（default off，可漸進啟用） - 60s timeout + try/except 三重防護（timeout / 一般 exception / outer exception） - asyncio.wait_for + EvidenceSnapshot.get_latest_snapshot - 補 learning_service.record_verification_result 呼叫 - matched_playbook_id 從 result.playbook_id 帶入 - 觸發 EWMA trust_score 演化（飛輪閉環） - 對稱於人工審核路徑 approval_execution._run_post_execution_verify ADR 對應: ADR-081 Phase 1 (Verifier) + ADR-083 Phase 3 (Learning) plan_complete_v3.md L5/L6 階段：⚠️ → ✅（飛輪自主化分數預估 +12 分） Note: feature flag default off → 不會立即影響 production 行為；啟用前需 critic 審查 + production E2E 驗證。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:20:11 +08:00
Your Name	e96055eef9	fix(p0.4): Playbook 學習鏈三道修復 — partial index + race防護 + 手動路徑接線 ADR-092 P0.4 Playbook EWMA 學習閉環的 DB / Repository / Service 三層修補。 DB 層 (db-expert-fix by Engineer-B): - ApprovalRecord.matched_playbook_id 移除 index=True，改 __table_args__ partial index (WHERE matched_playbook_id IS NOT NULL) — 多數列 NULL，full index 浪費空間 - adr092_p1_learning_chain_rollback.sql: 純 ROLLBACK SQL（DBA 手動執行） Repository 層: - playbook_repository.py: SELECT FOR UPDATE 防 lost update 避免並發 EWMA 更新覆蓋彼此 Service 層 (P0.4 修復): - proposal_service.py: 手動審核路徑補 _try_playbook_match_id 呼叫 decision_manager auto_execute 路徑已有此邏輯（行 2035），此處補手動路徑缺口，使 matched_playbook_id 可寫入 DB → EWMA 才能演化測試: - test_playbook_repository_race_condition.py: 3 cases SELECT FOR UPDATE 防 race 正確阻擋並發 EWMA 更新（pass） Note: migration SQL 待 DBA 手動執行（feedback_dev_prod_separation.md），不執行 alembic upgrade（statu 文件禁忌條款）。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:19:46 +08:00
Your Name	55c6b4e2d9	feat(p1): Ollama 多層容災系統 — P1.1 健康檢測 + P1.2 ai_router 整合 + P1.5 容災告警 ADR-092 P1 飛輪閉環的 Ollama 失敗轉移子系統，全部 Engineer-A2/C/C2 補上。新服務 (1581 行)： - ollama_health_monitor.py (356)：3 層健康檢測（TCP/HTTP/推理） - ollama_failover_manager.py (571)：111→188 自動切換 + Redis 持久化 + recovery callback - ollama_auto_recovery.py (436)：30s 背景監控 + 連續 3 次 HEALTHY → 切回 + clear_cache - failover_alerter.py (218)：P1.5 Telegram 容災告警服務整合： - ai_router.py: AIProviderEnum.OLLAMA_188 + 120s budget + failover fallback chain - main.py lifespan: 啟動時 wire callback + start recovery，關閉時優雅 stop - config.py: OLLAMA_FALLBACK_URL / OLLAMA_HEALTH_CHECK_MODEL / GEMINI_DAILY_QUOTA（帳單熔斷） K8s 配置： - 04-configmap.yaml.patch-188-fallback：注入 OLLAMA_FALLBACK_URL=http://192.168.0.188:11434 測試 (2082 行)： - test_ollama_health_monitor.py (402) - test_ollama_failover_manager.py (707) - test_ollama_auto_recovery.py (580) - test_ai_router_failover_integration.py (257) - test_lifespan_failover_wiring.py (136) 依賴鏈：service 三件套 + ai_router + main.py 一起 commit，缺一就 ImportError。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 20:18:33 +08:00
Your Name	bb12647e8d	feat(telegram): 群組告警卡片加入完整互動按鈕（批准/拒絕/暫默/詳情/重診/歷史） All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m7s Details - _send_approval_card_to_group 加 alert_category + notification_type 參數 - 群組卡片改用 _build_inline_keyboard（與 DM 相同的完整六鍵佈局） - send_approval_card → _send_approval_card_to_group 傳遞兩參數 - TYPE-1 通知補 read-only 詳情/歷史按鈕（鬼魂按鈕鐵律合規） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 10:31:27 +08:00
Your Name	cbd28e29a0	fix(solver+incident): 兩組 P0 配置修復 - Gitea 非K8s 過濾 + 備份告警年齡升級 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m57s Details L3 修復總結（2026-04-25）：【修復 1】Gitea 跨域界限 kubectl 過濾（solver_agent.py）根因：GiteaMemoryPressure 告警觸發 Solver → LLM 生成 'kubectl scale deployment gitea' Gitea 在主機 docker-compose，不在 awoooi-prod K8s namespace → 執行必然失敗變更： - 添加 _filter_non_k8s_targets() 函數，對 scale/restart/delete/patch 指令驗證 target - 添加 _KUBECTL_MUTATING_VERBS / _KUBECTL_ROLLOUT_MUTATING_SUBVERBS 常數 - 在 _solve() 呼叫 _fetch_k8s_inventory() 獲取實際部署清單 - 後置過濾：candidates 中若 target 不在 inventory 且屬寫入動詞 → 丟棄 + 警告預期行為：GiteaMemoryPressure → Solver 現生成調查類 kubectl（get/describe），而非 scale 【修復 2】HostBackupFailed 誤判升級（incident_service.py + webhooks.py）根因：備份失敗 >24h 被標記 TYPE-1（純資訊），導致靜默發送無按鈕卡片，未觸發自動修復變更： - incident_service.py classify_alert_early() 添加 age_hours 參數 - 添加 _BACKUP_AGE_UPGRADE_NAMES + _BACKUP_AGE_THRESHOLD_HOURS=24.0 - 若 alertname in (HostBackupFailed/Stale/Missing) 且 age > 24h → TYPE-3 升級 - webhooks.py 計算 alert.startsAt → age_hours，並傳遞給 classify_alert_early() 預期行為：HostBackupFailed 25h+ → 升級為 TYPE-3，觸發 LLM 分析 + P0 自動修復建議測試結果： - solver_agent: 35/35 tests PASSED ✅ - incident_service: 11/11 tests PASSED ✅ - incident_api integration: 7/7 tests PASSED ✅ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 09:48:04 +08:00
Your Name	6baa5054bc	fix(auto-execute): 修復 kubectl pattern 攔截 + 補 auto_execute KM 寫入 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 問題 1：_ALLOWED_KUBECTL_PATTERN 不允許 resource type keyword 根因：LLM 輸出 "kubectl rollout restart deployment clickhouse" 但 pattern 只允許 "kubectl rollout restart clickhouse"（無 deployment 關鍵字）結果：_action_safe=False → auto_execute_blocked_unresolved_placeholder → 所有 low/medium risk 告警降為人工審核，飛輪完全停轉修法：pattern 新增可選的 resource type group（deployment/pod/service/...） + re.ASCII flag 防 unicode bypass，12/12 test cases 通過問題 2：auto_execute 路徑 KM 寫入斷鏈根因：_write_execution_result_to_km 只在人工審核路徑呼叫修法：auto_execute 完成後補 _fire_and_forget(executor._write_execution_result_to_km) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 09:47:35 +08:00
Your Name	f9f2263c00	fix(execution-feedback): 修復系統自動化反饋完全斷鏈的三層 P0 故障 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m57s Details 背景用戶報告執行狀態卡在「⚡ 執行中...」永不回報，導致自動修復機制完全癱瘓（信心度修復後，執行失敗但無法推送 Telegram 卡片通知） L1 — Post-verify AttributeError（2 處） - approval_execution.py:757, 1010 調用不存在方法 IncidentService.get_incident() - 正確方法：get_from_working_memory() fallback get_from_episodic_memory() - 影響：post-verify 邏輯被 exception 無聲吞掉，下游 Telegram 推送完全卡住 L2 — Notification Provider 未配置 - 新增 notifications/telegram.py：複用既有 TelegramGateway.send_notification() - 修改 manager.py：初始化時註冊 TelegramWebhookProvider - 影響：執行完成後無任何 provider 發送推送，導致 Telegram 看不到結果 L3 — Solver Agent 語意合成生成殘缺指令 - 舊邏輯：action_title="重啟服務" → 合成 "kubectl rollout restart deployment -n awoooi-prod"（缺名） - 下游 operation_parser 無法解析（regex 要求 deployment/<name>） - 修法：優先從 parsed 提取 target 欄位；無名則 return []，降級到唯讀調查指令 - 測試全部通過：35/35，含 11 個新安全測試驗證 - 被阻擋的惡意 kubectl_command 現在正確 fall-through 到語意合成路徑 - 無 target 名稱時返回空列表，不再生成殘缺指令 - Telegram 執行結果推送鏈路已完整預期效果 - 執行失敗 → 立即收到「❌ 執行失敗」Telegram 卡片（L1 + L2 修復） - 自動化決策遵循白名單，避免生成無法執行的指令（L3 修復） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 03:29:38 +08:00
Your Name	c14f23b33a	feat(k8s+notification): TG_GROUP_CUTOVER=true — 所有告警全切 SRE 群組 notification_matrix TYPE-5S: DM → GROUP（SignOz 事件補齊） prod/dev ConfigMap TG_GROUP_CUTOVER: false → true Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 03:07:28 +08:00
Your Name	a49554c5a0	feat(hermes): 接入 polling 路徑 — @tsenyangbot @mention → Hermes NL (ADR-094) Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details _handle_group_message() 新增 Hermes NL 路由： HERMES_NL_ENABLED=true + @tsenyangbot @mention → process_nl_message() → send_hermes_reply()，不影響既有 OpenClaw/NemoClaw 路徑 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 02:42:03 +08:00
Your Name	86ee013cdf	feat(hermes-complete): Hermes NL 三項補強 + ConsensusEngine + ADR 收尾 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m32s Details ## Hermes NL 補強（nl_gateway.py） - T1 hermes_dispatch_log DB 寫入（asyncio.create_task 非阻擋） - T2 Redis 速率限制：per-chat_id 20 req/min，fail-open - T3 Multi-turn session：hermes:session:{chat_id}:{user_id} TTL=300s，最近 3 輪 ## ConsensusEngine（ADR-095 宣告式設計） - consensus_engine.py: CONSENSUS_WEIGHTS class 屬性 security=0.4 鎖定，9 個 Claude Code agent 分配 0.6 - config.py: ENABLE_12AGENT_CONSENSUS=False feature flag ## ADR 狀態 - ADR-093/094/095: Proposed → 🟡 批准實作中 - 各 ADR 加 v1.1 變更紀錄 ## K8s ConfigMap - prod 04-configmap.yaml: 加 3 個 feature flags（均 false） - dev 02-configmap.yaml: 同步加入 ## LOGBOOK - 記錄 WS0–WS6 + 補強完成，feature flags 啟用指引 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 02:22:40 +08:00
Your Name	2572ec46d2	feat(ws4): Hermes NL 自然語言介面 — 12-Agent Claude SDK 接入（ADR-094/095） ## hermes/ 套件（5 個新模組） ### display_names.py - 12 agent 視覺識別表（emoji + hashtag + handle + short_name） - format_response_header() 產生 Telegram 前綴 ### agent_loader.py - 解析 .claude/agents/*.md frontmatter → system prompt - lru_cache 避免重複讀檔 ### safety_hooks.py - 移植 awoooi-guard.js 20 條 HARD BLOCK 規則（DENY_PATTERNS） - 5 條 MUTATE_PATTERNS → 須走審批流 ### nl_gateway.py - Layer 1: 關鍵字正則路由（12 條規則，<10ms） - Layer 3: DEFAULT_AGENT = "debugger" - Claude Agent SDK query() 非同步串流，取 ResultMessage.result - 安全降級：SDK error → 友好錯誤訊息 ### telegram_webhook.py - WS4 Hermes NL 接入（@tsenyangbot mention 或私訊觸發） - HERMES_NL_ENABLED=False（feature flag 保護，預設關閉） ## telegram_gateway.py - send_hermes_reply(text, chat_id, reply_to_message_id) 無 500 字截斷，支援 Agent 長回覆 ## config.py - HERMES_NL_ENABLED: bool = False - TELEGRAM_BOT_USERNAME: str = "tsenyangbot" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 02:10:06 +08:00
Your Name	294e0e3387	feat(ws3): ADR-093 Callback User-ID Binding + ADR-094 Webhook 入口 ## T3.1/T3.2 Bound User Check（security_interceptor.py） - verify_callback() Step 0: 檢查 Redis cb_bind:{nonce} → 若有 binding 且 caller != bound_user_id → UserNotWhitelistedError → 若 key 不存在（舊格式）→ 降級走 whitelist（向後相容） → 若 Redis unavailable → 降級繼續（安全降級） - bind_callback_user(nonce, user_id): async 方法，TTL=48h ## T3.3 Telegram Webhook 入口（ADR-094） - apps/api/src/api/v1/telegram_webhook.py（新建） POST /api/v1/telegram/webhook - X-Telegram-Bot-Api-Secret-Token header 驗證 - TELEGRAM_WEBHOOK_SECRET="" → dev 跳過（不 break 現有測試） - WS4 Hermes NL 接入預留佔位 ## T3.4 config.py - 新增 TELEGRAM_WEBHOOK_SECRET field（預設空字串） ## main.py - 掛載 telegram_webhook_v1.router 到 /api/v1 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 02:10:06 +08:00
Your Name	6d5fd3c124	feat(ws2): ADR-093 路由統一 — BIGINT + NotificationMatrix + feature flag ## 修復 ### T2.1 BigInteger overflow 修復 - `db/models.py`: telegram_chat_id Integer → BigInteger （原 int32 無法容納群組 ID -1003711974679） ### T2.2 移除 CAST workaround - `approval_db.py:739`: 移除 CAST(:telegram_chat_id AS BIGINT) ORM 已正確使用 BigInteger，workaround 可退役 ### T2.3 Redis key 一致性修復 - `heartbeat_report_service.py:575`: telegram:polling_leader → telegram:polling:leader （telegram_gateway.py 使用冒號分隔，heartbeat 用底線是 bug） ## 新增 ### T2.4 notification_matrix.py - `services/notification_matrix.py`: ADR-093 路由矩陣 - Destination(DM/GROUP/BOTH) + RoutingRule dataclass - NOTIFICATION_ROUTING dict（TYPE-1 ~ TYPE-8M 完整映射） - resolve_chat_ids(type, dm, group, *, tg_group_cutover=False) 灰階切流 API ### T2.5 telegram_gateway.py feature flag 保護 - line 43: 加 notification_matrix import - line 1827-1834: TG_GROUP_CUTOVER=False 時維持舊行為 TG_GROUP_CUTOVER=True 時解除 _interactive_types 黑名單，由矩陣控制 ### T2.6 Migration SQL - `migrations/adr093_notification_routing.sql`: - CREATE TABLE approval_records (telegram_chat_id BIGINT) - CREATE ROLE awoooi_migrator (IF NOT EXISTS) - 含舊環境 ALTER COLUMN int→bigint 保護 ## 測試同步 - `tests/integration/setup_test_schema.sql`: telegram_chat_id BIGINT Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 02:10:06 +08:00
Your Name	55f111e0e3	fix(aiops): correct host alert fallback and resolved stamp All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m54s Details	2026-04-25 00:14:07 +08:00
Your Name	0d81b28b1b	fix(aiops): bound phase2 timeout and repair incident links All checks were successful E2E Health Check / e2e-health (push) Successful in 52s Details CD Pipeline / build-and-deploy (push) Successful in 9m24s Details	2026-04-24 23:53:56 +08:00
Your Name	e75e4678a9	feat(p2.4): Telegram 中間態推播 — 分析中佔位卡 + 完成後自動刪除 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details P2.4 實作 2026-04-24 ogt + Claude Sonnet 4.6 問題: LLM 分析耗時 10-30s，期間 Telegram 無任何回應，使用者不知系統在處理修復: - telegram_gateway.py: 新增 send_analyzing_placeholder() — 發送「AI 正在分析中...」佔位卡 - telegram_gateway.py: 新增 delete_message() — 刪除佔位卡 - webhooks.py: LLM 分析前 3s 內送出佔位卡（超時不阻塞主流程） - webhooks.py: _push_to_telegram_background 收到 placeholder_message_id → 完整卡發出後刪除佔位卡 - webhooks.py: import asyncio（補缺漏）效果: 使用者在告警到達 <3s 內即看到「分析中...」訊息，完整卡出現後自動清除 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 15:56:26 +08:00
Your Name	bb5f16f8ef	fix(aiops-p2): P2.1 LLM品質三修 — Evidence-First + consensus confidence + raw_evidence注入 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 根因: - consensus_engine 四 ExpertAgent confidence=0.0 → 加權投票 total=0 → 永遠返回 NO_ACTION - prompts.py 無 Evidence-First 指令 → LLM 靠記憶推理，無真實環境約束 - openclaw.py analyze_alert 建 prompt 未注入 MCP evidence (diagnosis_context) 修復: - consensus_engine: SRE/Security/Cost/Performance 依訊號強度設 0.45~0.80 confidence - consensus_engine: _normalize_action 加「重新啟動」別名 → RESTART - consensus_engine: SecurityAgent 移除未使用的 _target 變數 - prompts.py: 加 Evidence-First Protocol + Skepticism Rules 區塊 - openclaw.py: analyze_alert 提取 diagnosis_context → <raw_evidence> 注入 full_prompt 驗證: consensus score 從 0.0 → 0.744（CrashLoop 測試案例） P2.1 fix 2026-04-24 ogt + Claude Sonnet 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 15:52:25 +08:00
Your Name	04ff22563e	fix(aiops-p1): Playbook 學習閉環 5斷點全修 + DB Migration（ADR-092 B4） Some checks failed run-migration / migrate (push) Failing after 14s Details CD Pipeline / build-and-deploy (push) Failing after 2m7s Details 【P0.4 補丁】pre_decision_investigator Prometheus query 欄位缺失 - _build_tool_params() 補 "query" 欄位（prometheus_query tool 必要參數） - 新增 _build_prometheus_query() — 依告警類型生成 PromQL（CPU/Memory/Crash/Disk/HTTP/Pod/fallback） - 修復後 D3_METRICS 感官維度實際取得資料（原本 100% 回 missing_query_parameter）【P1 Playbook 學習閉環 B1-B5 全修】 - B2 db/models.py: ApprovalRecord 新增 matched_playbook_id 欄位 + ix_approval_matched_playbook index - B2 db/models.py: TimelineEvent 新增 incident_id 欄位（MCP 稽核用）+ index - B3 approval_db.py: record→ApprovalRequest 補回 incident_id + matched_playbook_id - B4 approval_repository.py: 同 B3（兩個轉換函式必須同步） - B5 approval_db.py: approval_request_to_record_data 補 matched_playbook_id → DB 才能存值【P1.5 KM 寫入】approval_execution.py: fire-and-forget → await wait_for(30s) - 根因：asyncio.create_task 在 Pod recycle 時被殺，KM 寫入靜默遺失 - 修復：await asyncio.wait_for(..., timeout=30.0) + TimeoutError log 【Migration 文件】adr092_p1_learning_chain_fix.sql - ALTER TABLE approval_records ADD COLUMN matched_playbook_id VARCHAR(36) - ALTER TABLE timeline_events ADD COLUMN incident_id VARCHAR(64) - 執行：psql $DATABASE_URL -f apps/api/migrations/adr092_p1_learning_chain_fix.sql 【附帶 Agent 改動】 - decision_manager: Phase 2 YAML NO_ACTION 優先門（主機層/外部服務跳過 Agent Debate） - alert_rules.yaml: Sentry/ClickHouse + HostDiskUsageHigh/Critical 新規則 - solver_agent: action_title 語意合成兜底（取代靜默丟棄） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 15:41:35 +08:00
Your Name	7f4088bcd0	fix(aiops-p0): 六大病根 P0 全面修復（ADR-092 B4）【P0.1】knowledge_extractor_service.py:210 — AttributeError 修復 - Signal.description 欄位不存在（100% 失敗，KM 每天+5 根因） - 改用 alert_name + annotations.summary 拼接文字【P0.2+P0.3】Gate 9+11 唯讀指令鬆綁 - blast_radius_calculator: kubectl get/top/describe/logs/version → score=1（非 50） - operation_parser: 增加 INVESTIGATE 類型識別（唯讀 kubectl 不回 None） - executor.py: OperationType 新增 INVESTIGATE enum - approval_execution.py: INVESTIGATE 路徑直接呼叫 execute_kubectl_command 【P0.4】MCP SSH/K8s Provider 修復 - decision_manager: params= → parameters=（符合 MCPToolProvider.execute 簽名） - decision_manager: MCPToolResult .get() → .success/.output（dataclass 用法） - decision_manager + ssh_provider: 補入 hosts 120/121（原 default 缺失） - auto_approve: phase2_agent_debate source bypass confidence 閾值【P0.5】告警規則語義矛盾修復 - alert_rules.yaml: 8 條 kubectl 查詢規則 RESTART_DEPLOYMENT → NO_ACTION (CrashLoopBackOff/PostgreSQL 連線/慢查詢/MinIO 磁碟/K3s 節點/告警鏈路/SSL/CoreDNS 等) - incident_service.py: cAdvisor/CoreDNS 從 general 拆出獨立分類【P0.6】proactive_inspector 動態基線 PromQL 全修 - 5 個 MONITORED_METRICS PromQL 全部修正（cadvisor label/datname/blackbox） - db_connection_pool: datname="awoooi" → "awoooi_prod" - http_error_rate: 無效 http_requests_total → blackbox probe_success - cpu/memory: namespace label → name=~"k8s_api_awoooi-api.*" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 15:32:23 +08:00
Your Name	45dbe07188	fix(flywheel): 自動化飛輪六大能力修復（ADR-092 B3） Some checks failed run-migration / migrate (push) Failing after 22s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 53s Details Type Sync Check / check-type-sync (push) Successful in 2m54s Details CD Pipeline / build-and-deploy (push) Has been cancelled Details Ansible Lint / lint (push) Has been cancelled Details 【根因鏈修復】 MCP Provider bugs → PreDecisionInvestigator 失敗 → Agent Debate 無上下文 → LLM 逾時 → description="待分析" → ADR-091 鐵閘攔截 → tg_sent 未設 → W-2 Watchdog 誤報「靜默故障」【六大修復】 1. MCP Provider 三蟲修復 - ssh_provider: asyncssh.run() → conn.run() - prometheus_provider: KeyError 'query' → .get() 容錯 - k8s_provider: 空 pod_name → 早返回錯誤字典 2. Agent Debate / 決策品質 - decision_manager: 逾時降級文字改為明確描述（繞過 ADR-091 鐵閘） - intent_classifier: LLM 逾時降級至關鍵字分類（非 None） 3. Watchdog 誤報修復（ADR-092 B3） - W-2: tg_sent Redis TTL → telegram_message_id IS NULL（DB 真值） - W-5 新增: suggested_action IN 空/待分析/NO_ACTION + tg_id IS NULL - approval_timeout_resolver: 60min → 15min，batch 50 → 200 4. Config Drift 自動化 - drift_adopt_service: auto_adopt_if_safe() 六條件安全閘 - drift.py: 背景任務先嘗試自動採納再發人工 Telegram 卡片 5. Playbook 飛輪穩定 - playbook_seed_service: 修復幂等性（deprecated 不視為缺失） - playbook_evolver: 只載 DRAFT+APPROVED（非全部 294 筆） 6. 可觀測性 - alert_rule_engine: auto_rule 結構化日誌 + Redis 計數器（pipeline） - auto_approve: reject 原因 Redis 計數器 - heartbeat_report_service: 新增「⚙️ 自動化統計（今日）」區塊【待人工執行】 psql $DATABASE_URL -f apps/api/migrations/cleanup_duplicate_deprecated_playbooks.sql Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 10:55:50 +08:00
Your Name	9244c5e845	feat(heartbeat): 系統報告新增 5 大動態區塊 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m50s Details 新增告警流水線(24h)、DB/Redis 狀態、K8s Pods、Scanner 狀態、Telegram Bot 各區塊採 asyncio.gather(return_exceptions=True) 平行探測，任一失敗不影響其他新增 AlertPipelineStats/DbRedisStats/PodInfo/ScannerStats/TelegramBotStats dataclasses _build_warnings() 加入 DB/Redis 異常、PENDING>10、Pod 未就緒/高重啟次數判斷 report_to_telegram_html() 對應輸出 5 個新 HTML 區塊 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 09:29:16 +08:00
Your Name	88af639651	fix(report): 修正 approval_records.status 大小寫不一致 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m46s Details DB 以 SQLEnum 儲存 enum name（EXECUTION_FAILED 大寫），而非 enum value（execution_failed 小寫）。 SQL 加 UPPER(status::text) 確保不論大小寫皆能命中。驗證：live DB 查詢 success=0, failed=2（之前永遠 0/0） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 09:10:39 +08:00
Your Name	6810ab359d	fix(report): 日報重發 + 自動修復 0% 兩大根因修復 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 問題一：日度巡檢報告重複發送（多 Pod 各自跑 daily job） - 根因：run_daily_report_loop 沒有接 leader lock 其他 scanner（capacity/hermes/compliance）都有呼叫 try_acquire_daily_lock，唯獨日報 loop 缺失 - 修法：asyncio.sleep 後加 try_acquire_daily_lock("daily_report") 搶不到 lock 的 Pod 直接 continue，等下一個 08:00 問題二：自動修復成功率永遠 0.0% - 根因：_collect_repair_stats 查 incidents.outcome->>'execution_success' 但整條執行鏈路（approval_execution.py NO_ACTION + 真實執行）從未將 execution_success 寫回 incidents.outcome JSON 導致查詢永遠回 0 - 修法：改查 approval_records.status（EXECUTION_SUCCESS / EXECUTION_FAILED）這是唯一被穩定寫入的 source of truth Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 09:03:44 +08:00
Your Name	1625e7bd19	fix(telegram): 按鈕回覆靜默兩大根因修復 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 17m40s Details 問題一：ai_advisory_* 按鈕（容量預測/合規等） - 按下後只發 toast（2-3 秒消失），群組永無回覆 - 修法：_handle_ai_advisory_action 加 message_id 參數， answer_callback 後額外 sendMessage reply 到原卡片問題二：已解決告警再次點「批准」 - sign_approval early-return（status != pending）但 _notify_approval_result 仍發「⚡ 執行中...」→ 永無後續 - 修法：僅 approval.status == APPROVED 時才發「執行中...」其他終態改發「ℹ️ 此告警已處理（狀態：...）」並 return Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 01:57:55 +08:00
Your Name	479f8d8971	refactor(tests): 技術債清零 — 移除 FakeRepo/FakeSession Mock DB 違規 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 35s Details ## ai_router.py - 抽取 _aggregate_feedback_stats() 純函數，feedback_from_aider_events 呼叫它 ## aider_event_processor.py - _process_one 加 _session_factory=None DI 參數（預設 get_session_factory()） - 可注入測試 factory，不改既有生產邏輯 ## test_ai_router_feedback.py（完全重寫） - 移除 FakeRepo/FakeSession，改為直接測試 _aggregate_feedback_stats 純函數 - 新增 test_feedback_skips_missing_model 邊界條件 - DB 失敗降級行為 test 保留（只 patch get_session_factory，無 FakeRepo） ## test_aider_event_processor.py（完全重寫） - 移除 FakeRepo/FakeSession，改用真實 PostgreSQL（real_factory fixture） - Redis xack + IncidentEngine 保留 mock（外部 broker/AI 服務，符合例外） - 每個測試後 rollback，不污染 dev DB ## setup_test_schema.sql - 補入 aider_events_payload_gin GIN index（與 adr091 生產 migration 一致） ## integration/conftest.py - 補注解說明密碼名稱 awoooi_prod_2026 的歷史混淆 - 修正 assert 邏輯：檢查 DB 名稱而非 URL 字串，避免密碼含 prod 觸發誤判 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 01:33:30 +08:00
Your Name	4fc1f49dca	fix(pipeline): 三斷點修復 — SLO公式+NO_ACTION堆積+幻覺降級風險 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 14m3s Details D1 flywheel_stats_service: execution_count 欄位不存在 → 改讀 success_count+failure_count；消除飛輪執行成功率永遠 0.0% 假象 D2 openclaw._validate_deployment_inventory: 幻覺 deployment 降級後原 HIGH/CRITICAL risk 未清零 → 加 result.risk_level = AIRiskLevel.LOW D3 webhooks.py (兩處 alert path): NO_ACTION/INVESTIGATE/OBSERVE 三類非破壞性動作強制 risk_level = LOW，跳過 Telegram 批准直接 auto-approve → approval_execution.py 的 NO_ACTION handler 立即標 EXECUTION_SUCCESS Root cause 鏈：BUTTON_DATA_INVALID 修復後 TG 按鈕可發，但 NO_ACTION 積壓的 35 筆 PENDING 是因 HIGH risk 無法走 auto-approve 路徑導致。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 22:26:07 +08:00
Your Name	8fd31eca66	fix(telegram): nonce UUID base64url 壓縮 — 徹底解決 BUTTON_DATA_INVALID All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m45s Details 前次修法（truncate random）不完整：host_restart_service(20 chars) 即使去掉 random 仍 68 bytes > 64 限制。根本修法：UUID (36 chars) → base64url encode UUID bytes → 22 chars nonce 格式：{action}:{b64url_uuid}:{timestamp}:{random} 最長 case: host_restart_service(20)+22+10+8+3 colons = 63 bytes generate_callback_nonce: UUID → base64url 22 chars parse_callback_data: 22-char b64url → 還原完整 UUID，handler 不需改動全 action 驗證：approve/silence/reject/docker_restart/host_restart_service/renew_cert 全部 ≤ 63 bytes，UUID round-trip 正確。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 21:30:20 +08:00
Your Name	bd735482f7	fix(telegram): BUTTON_DATA_INVALID — nonce 超過 64 bytes 根因修復 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 根因：Telegram callback_data 上限 64 bytes。 5 個長 action 名（docker_restart/host_restart_service 等）+ UUID approval_id = 71-77 bytes → BUTTON_DATA_INVALID。修復： 1. security_interceptor.generate_callback_nonce：若 nonce > 63 bytes，改用 3-part 格式（捨棄 random）— timestamp 仍保時間唯一性。 2. security_interceptor.parse_callback_data：接受 3-part 或 4-part 格式。 3. telegram_gateway：移除 debug payload logging（診斷完成）。影響 action：docker_restart / host_restart_service / host_clear_log / reload_nginx / renew_cert（全部 > 7 chars + UUID = 64 bytes 以上）。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 21:17:49 +08:00
Your Name	685f5c684f	debug(telegram): log full payload on 4xx to diagnose BUTTON_DATA_INVALID All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m29s Details 前次 response_body 已確認錯誤碼，這次記錄完整 payload（payload_preview 前 1000 bytes）以找出觸發 BUTTON_DATA_INVALID 的確切欄位。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 20:56:28 +08:00
Your Name	acab1cd95e	fix(gitea-review): PR/push AI analysis always failing — 兩個根因修復 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 17m26s Details Root cause 1 (push review): local_code_review_service.review_push() 回傳 dict，但呼叫端直接存取 analysis.issues → AttributeError。修復：_call_openclaw_push_review 將 dict 轉成 CodeReviewResult。 Root cause 2 (PR review): openclaw_http_service 呼叫 /api/v1/analyze/code-review 但 OpenClaw 從未實作此 endpoint（404）。修復：_call_openclaw_code_review 改走 local_code_review_service.review_pr() （Ollama qwen2.5-coder + Gemini fallback）。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 15:19:14 +08:00
Your Name	3323a9052c	debug: log telegram 400 response body to diagnose card send failure All checks were successful CD Pipeline / build-and-deploy (push) Successful in 12m38s Details	2026-04-21 01:05:21 +08:00
Your Name	9e9bd8679f	fix(aider-watch): code-review fixes (4 issues) Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 1. aiderw: session_end 補 model+cwd (AI Router feedback loop 修通) 2. repository: model_stats_since SQL 改 COALESCE(session_end, session_start) model 3. aider_event_service: classify_severity 移除 error_count 觸發告警（防假陽性） 4. worker: run_aider_event_processor_loop 包 proc.start() try/except（防靜默崩潰） 2026-04-20 @ Asia/Taipei	2026-04-21 00:59:21 +08:00
Your Name	de2d34d4cd	fix(playbook): C1-C4 全流程串接 — evolver保護+seeder復活+規則即時建立+watchdog W-4 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details C1: playbook_evolver — yaml_rule source playbooks 加 YAML_RULE guard， evolver 不再封存 seeder 建立的 APPROVED playbook，保護自動修復鏈路 C2: playbook_seed_service — idempotency SQL 排除 DEPRECATED 記錄， evolver 封存後重啟可復活 yaml_rule playbooks C3: alert_rule_engine — AI 自動生成規則成功後立即呼叫 seed_playbooks_from_rules()，不等下次重啟即可建立對應 APPROVED Playbook C4: ai_slo_watchdog_job — 新增 W-4 APPROVED playbook 數量為 0 告警，鏈路斷裂立即 TYPE-8M；total checks 由 3 升為 4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:18:11 +08:00
Your Name	156a52f807	fix(aiops): ADR-092 三修 — Playbook enum崩潰 + Telegram永久靜默 + 採納失敗 + AI自健診 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m33s Details B1 playbook_service.py: evolver setattr傳str而非PlaybookStatus enum → _pg_upsert playbook.status.value炸（163次/48h），修：update_with_validation強制enum轉型 B2 approval_db.py + webhooks.py: find_by_fingerprint PENDING誤收斂 → PENDING≠Telegram已發；修：成功push後mark tg_sent:{fingerprint} Redis(24h TTL) → find_by_fingerprint debounce窗外PENDING必須Redis確認才收斂 drift_adopt_service.py: telegram_gateway呼叫adopt_drift(report_id)但方法不存在 → 新增adopt_drift()包裝：從DB載入DriftReport後委派adopt()，修復採納失敗 B3 ai_slo_watchdog_job.py + main.py: AI無法感知自身故障（MASTER §1.1盲區） → 新增每15分鐘自健診：W-1 SLO違反 W-2 TG靜默偵測 W-3 飛輪成功率 → 任一異常→TYPE-8M send_meta_alert；Redis去重1h Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:00:06 +08:00
Your Name	40771cda6d	feat(ai_router): feedback_from_aider_events read-only hook (Phase 24 A8)	2026-04-20 19:40:01 +08:00
Your Name	cd894310dc	feat(api): POST /api/v1/aider/events HMAC webhook + Redis stream push - Router layer: HTTP validation + HMAC-SHA256 signature verification - Service layer: Redis stream push (aider_event_service.push_aider_batch_to_stream) - leWOOOgo積木化遵循: Router → Service → Redis - All 6 tests passing (signature validation, batch limits, edge cases)	2026-04-20 19:40:01 +08:00
Your Name	964427c5d4	feat(service): aider_event_service — classify + signal_data builder (uses existing debounce)	2026-04-20 19:40:01 +08:00
Your Name	54d60d04f5	feat(drift+target): P0.1+P0.2+P0.3 三修 — drift 分頁分類 + AI 推薦 + target 追 trace 統帥三問決議：全做；AI 推薦 0.85 門檻純顯示不自動；先查 aol 再修 ## RCA: awoooi-service 失敗來源 - /api/v1/aiops/kpi 顯示過去 24h 有 1 筆 playbook_executed actor=approval_execution status=failed - grep codebase: 無任何程式碼寫死 awoooi-service（只有歷史 comment） - 最可能源: alert_rule_engine._extract_vars 從 labels.service 取值當 Deployment 名 - cf5050c/4f2e122（2026-04-18）已修 NEMOTRON 幻覺雙路徑；本次修第三條路徑 ## 修復 ### P0.3a alert_rule_engine._extract_vars - labels.service 降級：-service 結尾先剝 suffix 視為 base name - match_rule 回傳新增 target_source 欄位追 trace - 下次 awoooi-service 復發可直接看來源（label.service(stripped) 等） ### P0.3c approval_execution._log_aol_started.input - 補 parsed_target/operation/namespace 欄位 - 未來 aol 查 failed 可直接看 target，無需推敲 ### P0.1 telegram_gateway._send_drift_diff_detail - 分頁（10 項/頁）取代一次洗版 30 項 - header 3 桶分類計數: 人工高風險 / 一般修改 / K8s 自動 - 底部 ⬅️/➡️ 分頁按鈕（callback: drift_view_page:{report_id}_{page}） - security_interceptor INFO_ACTIONS 加 drift_view_page 白名單 ### P0.2 drift_narrator recommendation - LLM prompt 加 recommendation 欄位（action/confidence/reason） - action ∈ {adopt, revert, ignore, investigate} - 卡片頂部顯示「🎯 AI 建議：⏪ 回滾 (85%) — reason」 - LLM 失敗走 _fallback_recommendation（規則式依 intent 對應） - 卡片 diff_summary 上限 500 → 1500 字容納推薦 + narrative + items - 統帥指令：純顯示不自動執行（門檻 0.85 保留未來） ## 驗證 - 90 個 pytest test 全過（drift + rule_engine + approval_execution） - 5 檔 AST syntax check 過 ## 下次驗收 1. 下次 drift 觸發 → 卡片頂部有「🎯 AI 建議」 2. drift_view 按下 → 3 桶分類 header + ⬅️/➡️ 3. awoooi-service 若復發 → automation_operation_log.input.parsed_target 直接查 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 04:04:13 +08:00
Your Name	f572561467	feat(ai_advisory): P0 修 leader lock + inline keyboard + callback handler All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m31s Details 統帥 2026-04-19 截圖反饋: 1. 同一告警 22:44 連推 2 則 (多 Pod 都跑 daily loop) 2. 純文字無按鈕 (無 feedback 閉環 / AI 只建議不執行) 新增 services/ai_advisory_helpers.py (~240 行): - try_acquire_daily_lock(job_name): Redis SETNX key 'aiops:daily_lock:{job}:{date}', TTL 25h,fail-open (Redis 掛照推,不阻塞). - try_acquire_hourly_lock(job_name): 同上 hourly 版 (coverage_evaluator 用). - is_snoozed / set_snooze: Redis key 'aiops:snooze:{type}:{target}' TTL 24h. - build_ai_advisory_keyboard: 統一 4 按鈕 ✅ 已處理 / 😴 忽略 24h / 🔍 查看詳情 / 📋 產 kubectl 指令 callback_data 格式: 'ai_advisory_{action}:{type}:{id}' - handle_ai_advisory_callback: 處理 handled/snooze 兩個 action 寫 aol.output.human_feedback, view/produce_cmd 留 P1. 4 個 LLM scanner 改用 helper: - capacity_forecaster: daily_lock + snooze check per host + 按鈕 - compliance_scanner: daily_lock (cron only) + snooze per date + 按鈕 - coverage_evaluator: hourly_lock + snooze per worst_dimension + 按鈕 - hermes_rule_quality: daily_lock + snooze per primary rule + 按鈕 telegram_gateway.py: handle_callback 加 'ai_advisory_*' 路由 (step 1.85 drift 後) 新增 _handle_ai_advisory_action 方法: 解析 payload 'type:id' → 呼叫 handle_ai_advisory_callback → answer_callback (Telegram toast 回饋) → 返回 dict (info_action=True for view/produce_cmd) 統帥鐵律對齊: ✅ 多 Pod 場景只 leader 推 (Redis SETNX 保證冪等) ✅ 失敗 fail-open 不阻塞主業務 (Redis 掛仍能運作) ✅ aol.output 加 human_feedback 供 AI 學習 ✅ snooze 避免重複告警 (24h TTL) ✅ 原 drift 按鈕 pattern 複用 (non-breaking) 明早 AI 將收到: - 單一訊息 (非重複) - 含 4 按鈕 (手動 feedback 閉環) - snooze 後同主題 24h 不再推 view/produce_cmd P1 留下 session (AI 主動 MCP 蒐證 + LLM 產 kubectl command). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 23:02:57 +08:00
Your Name	fa643ebdc7	refactor(p1): LLM JSON parse helper 抽出 + coverage 閾值雙條件 (架構師 Review P1) All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m52s Details 首席架構師 2026-04-19 Review (92/100 Grade A) 指出 P1 優化: 1. LLM JSON 3-path parse 邏輯在 4 scanner 重複 (~80 行 × 4 = 320 行) 2. coverage red>=20 觸發閾值偏低,生產 bootstrap 必觸發浪費 token P1.1+1.2 新增 services/llm_json_parser.py (~90 行): parse_llm_json_response(text, required_key, logger_context) 3-path fallback: Path 1: 剝 markdown fence + 直接 JSON 含 required_key Path 2: NemoTron wrapper (description/action_title/reasoning 內嵌 JSON) Path 3: 所有失敗 return None + logger.warning 失敗永不 raise,呼叫者決定 fallback. 4 個 LLM scanner 改用 helper: - hermes_rule_quality_job: required_key='recommended_actions' - capacity_forecaster_job: required_key='priority_actions' - compliance_scanner_job: required_key='posture_grade' - coverage_evaluator_job: required_key='worst_dimension' 每個減少約 20 行重複. P1.3 coverage 觸發條件改雙條件: 原: total_red >= 20 (bootstrap 必觸發) 新: red_ratio > 30% AND total_scanned >= 50 _fetch_red_summary 加 total_scanned 回傳供計算. 5/5 單元測試 parse_llm_json_response: ✅ direct / markdown fence / NemoTron wrapper / invalid / missing key P1.4 capacity_scanner + rule_catalog_sync: 檢查後已有完整作者註解 (Review 誤判). 其他 P1 (Prom HTTP helper / first_delay 錯開 / LLM budget guard) 留下 session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 22:39:40 +08:00
Your Name	37b6c9ba56	chore: remove empty ai_orchestrator.py (意外進 commit 的空檔) All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m6s Details 上個 commit (`86d9b22` LOGBOOK) 因 stash pop 意外帶入 0 行空檔 ai_orchestrator.py,非刻意創建。本次刪除保持 services/ 乾淨。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 22:22:53 +08:00
Your Name	86d9b22125	docs(logbook): Session 結尾 — Gap Review + AI 自主化 1/9→4/9 全景記錄 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details Session 35 commits 完整結案: - Phase 7 基礎 (scanners + evaluator + tracker + advisor + forecaster) - KPI Dashboard API (autonomy_score 63/100 可量化) - Audit 誠實 3 Gaps - Gap 1 host IPv4 嚴格 + 清理 266 筆重複 - Gap 2 真因確認非 bug - Gap 3 LLM 升級 3/8 (capacity_forecaster/compliance/coverage) AI 自主化達成: 1/9 LLM (只 Hermes) → 4/9 LLM decision 8 張 0 writer 表全活化 7/7 coverage 維度完整今晚 AI 將自主推 4 種 Telegram 分析報告 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 22:22:42 +08:00
OG T	0004554bc6	feat(api): AIOps KPI Dashboard — AI 自主化成熟度全景 (積木化重構) All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m47s Details GET /api/v1/aiops/kpi → 一次整合 MASTER §7.1 全部 KPI. leWOOOgo 積木化鐵律對齊: - Router (api/v1/aiops_kpi.py) 僅 HTTP 路由, 不碰 DB - Service (services/aiops_kpi_service.py) 負責所有 SQL + 計算 - 前次 commit 被 hook 擋下 (Router 直接 import get_db_context), 本次修正 services/aiops_kpi_service.py (~230 行): AiopsKpiService.get_snapshot() 回 6 section: 1. asset_inventory: by_type + total + last_scan (run_id/ended_at/總計/new/modified) 2. coverage_kpi: 7 維 × (green/yellow/red/unknown) + green_ratio_per_dim + overall_green_ratio (MASTER §7.1 #5 SLO) 3. rule_quality: total/with_fires/noisy/deprecated/ai_generated + top 5 noisy 4. capacity_health: 最新 snapshot per host + by_verdict + violations_7d 5. automation_flow_24h: aol detail + by_actor + by_operation_type 6. ai_autonomy_score: 0-100 總分 5 子項 × 20: asset_coverage / rule_quality / capacity_health / automation_flow / ai_diversity grade: mature(90+) / in_progress(70-90) / starter(50-70) / initial(<50) api/v1/aiops_kpi.py (~35 行精簡 router): 只做 router = APIRouter() + @router.get 委派給 service main.py: include_router(aiops_kpi_v1.router, prefix='/api/v1', tags=['AIOps KPI']) 統帥使用: curl http://192.168.0.121:32334/api/v1/aiops/kpi \| jq . 一次看見 AI 自主化成熟度全景 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 21:21:46 +08:00
OG T	c0f3509d39	fix(drift-card): Drift Diff HTTP 400 — item-by-item 累計長度避免切斷 HTML Some checks failed CD Pipeline / build-and-deploy (push) Failing after 2m0s Details 統帥回報 14:18 點 [查看 Diff] 收到 'Drift Diff 查詢失敗: HTTP error: 400' 真因 (telegram_gateway.py:2087 _send_drift_diff_detail): - report_id=7ffe78ae 有 48 items,單筆 git_value 最長 1794 字 (env array) - 累計 _full 遠超 4096,執行 _full[:3950] 截斷 - 截斷可能切在 HTML tag 中間 (<code>... 或 < entity 中間) - Telegram parse_mode='HTML' 拒絕不完整 HTML → 400 修復: - item-by-item 累計長度,單個 item 算 _block 長度+1 - 預留 3800 上限 (4096 - 250 buffer 給 header + '… 還有 X 項' 提示) - 確保 _full 永遠是完整 HTML 結構驗證: 下次 drift report 出現 + 統帥點 [查看 Diff] 應正常顯示 (本 session 的下個 cycle) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 14:26:29 +08:00

1 2 3 4 5 ...

552 Commits