awoooi

Author	SHA1	Message	Date
Your Name	86ee013cdf	feat(hermes-complete): Hermes NL 三項補強 + ConsensusEngine + ADR 收尾 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m32s Details ## Hermes NL 補強（nl_gateway.py） - T1 hermes_dispatch_log DB 寫入（asyncio.create_task 非阻擋） - T2 Redis 速率限制：per-chat_id 20 req/min，fail-open - T3 Multi-turn session：hermes:session:{chat_id}:{user_id} TTL=300s，最近 3 輪 ## ConsensusEngine（ADR-095 宣告式設計） - consensus_engine.py: CONSENSUS_WEIGHTS class 屬性 security=0.4 鎖定，9 個 Claude Code agent 分配 0.6 - config.py: ENABLE_12AGENT_CONSENSUS=False feature flag ## ADR 狀態 - ADR-093/094/095: Proposed → 🟡 批准實作中 - 各 ADR 加 v1.1 變更紀錄 ## K8s ConfigMap - prod 04-configmap.yaml: 加 3 個 feature flags（均 false） - dev 02-configmap.yaml: 同步加入 ## LOGBOOK - 記錄 WS0–WS6 + 補強完成，feature flags 啟用指引 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 02:22:40 +08:00
Your Name	2572ec46d2	feat(ws4): Hermes NL 自然語言介面 — 12-Agent Claude SDK 接入（ADR-094/095） ## hermes/ 套件（5 個新模組） ### display_names.py - 12 agent 視覺識別表（emoji + hashtag + handle + short_name） - format_response_header() 產生 Telegram 前綴 ### agent_loader.py - 解析 .claude/agents/*.md frontmatter → system prompt - lru_cache 避免重複讀檔 ### safety_hooks.py - 移植 awoooi-guard.js 20 條 HARD BLOCK 規則（DENY_PATTERNS） - 5 條 MUTATE_PATTERNS → 須走審批流 ### nl_gateway.py - Layer 1: 關鍵字正則路由（12 條規則，<10ms） - Layer 3: DEFAULT_AGENT = "debugger" - Claude Agent SDK query() 非同步串流，取 ResultMessage.result - 安全降級：SDK error → 友好錯誤訊息 ### telegram_webhook.py - WS4 Hermes NL 接入（@tsenyangbot mention 或私訊觸發） - HERMES_NL_ENABLED=False（feature flag 保護，預設關閉） ## telegram_gateway.py - send_hermes_reply(text, chat_id, reply_to_message_id) 無 500 字截斷，支援 Agent 長回覆 ## config.py - HERMES_NL_ENABLED: bool = False - TELEGRAM_BOT_USERNAME: str = "tsenyangbot" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 02:10:06 +08:00
Your Name	294e0e3387	feat(ws3): ADR-093 Callback User-ID Binding + ADR-094 Webhook 入口 ## T3.1/T3.2 Bound User Check（security_interceptor.py） - verify_callback() Step 0: 檢查 Redis cb_bind:{nonce} → 若有 binding 且 caller != bound_user_id → UserNotWhitelistedError → 若 key 不存在（舊格式）→ 降級走 whitelist（向後相容） → 若 Redis unavailable → 降級繼續（安全降級） - bind_callback_user(nonce, user_id): async 方法，TTL=48h ## T3.3 Telegram Webhook 入口（ADR-094） - apps/api/src/api/v1/telegram_webhook.py（新建） POST /api/v1/telegram/webhook - X-Telegram-Bot-Api-Secret-Token header 驗證 - TELEGRAM_WEBHOOK_SECRET="" → dev 跳過（不 break 現有測試） - WS4 Hermes NL 接入預留佔位 ## T3.4 config.py - 新增 TELEGRAM_WEBHOOK_SECRET field（預設空字串） ## main.py - 掛載 telegram_webhook_v1.router 到 /api/v1 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 02:10:06 +08:00
Your Name	6d5fd3c124	feat(ws2): ADR-093 路由統一 — BIGINT + NotificationMatrix + feature flag ## 修復 ### T2.1 BigInteger overflow 修復 - `db/models.py`: telegram_chat_id Integer → BigInteger （原 int32 無法容納群組 ID -1003711974679） ### T2.2 移除 CAST workaround - `approval_db.py:739`: 移除 CAST(:telegram_chat_id AS BIGINT) ORM 已正確使用 BigInteger，workaround 可退役 ### T2.3 Redis key 一致性修復 - `heartbeat_report_service.py:575`: telegram:polling_leader → telegram:polling:leader （telegram_gateway.py 使用冒號分隔，heartbeat 用底線是 bug） ## 新增 ### T2.4 notification_matrix.py - `services/notification_matrix.py`: ADR-093 路由矩陣 - Destination(DM/GROUP/BOTH) + RoutingRule dataclass - NOTIFICATION_ROUTING dict（TYPE-1 ~ TYPE-8M 完整映射） - resolve_chat_ids(type, dm, group, *, tg_group_cutover=False) 灰階切流 API ### T2.5 telegram_gateway.py feature flag 保護 - line 43: 加 notification_matrix import - line 1827-1834: TG_GROUP_CUTOVER=False 時維持舊行為 TG_GROUP_CUTOVER=True 時解除 _interactive_types 黑名單，由矩陣控制 ### T2.6 Migration SQL - `migrations/adr093_notification_routing.sql`: - CREATE TABLE approval_records (telegram_chat_id BIGINT) - CREATE ROLE awoooi_migrator (IF NOT EXISTS) - 含舊環境 ALTER COLUMN int→bigint 保護 ## 測試同步 - `tests/integration/setup_test_schema.sql`: telegram_chat_id BIGINT Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-25 02:10:06 +08:00
Your Name	55f111e0e3	fix(aiops): correct host alert fallback and resolved stamp All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m54s Details	2026-04-25 00:14:07 +08:00
Your Name	0d81b28b1b	fix(aiops): bound phase2 timeout and repair incident links All checks were successful E2E Health Check / e2e-health (push) Successful in 52s Details CD Pipeline / build-and-deploy (push) Successful in 9m24s Details	2026-04-24 23:53:56 +08:00
Your Name	e75e4678a9	feat(p2.4): Telegram 中間態推播 — 分析中佔位卡 + 完成後自動刪除 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details P2.4 實作 2026-04-24 ogt + Claude Sonnet 4.6 問題: LLM 分析耗時 10-30s，期間 Telegram 無任何回應，使用者不知系統在處理修復: - telegram_gateway.py: 新增 send_analyzing_placeholder() — 發送「AI 正在分析中...」佔位卡 - telegram_gateway.py: 新增 delete_message() — 刪除佔位卡 - webhooks.py: LLM 分析前 3s 內送出佔位卡（超時不阻塞主流程） - webhooks.py: _push_to_telegram_background 收到 placeholder_message_id → 完整卡發出後刪除佔位卡 - webhooks.py: import asyncio（補缺漏）效果: 使用者在告警到達 <3s 內即看到「分析中...」訊息，完整卡出現後自動清除 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 15:56:26 +08:00
Your Name	bb5f16f8ef	fix(aiops-p2): P2.1 LLM品質三修 — Evidence-First + consensus confidence + raw_evidence注入 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 根因: - consensus_engine 四 ExpertAgent confidence=0.0 → 加權投票 total=0 → 永遠返回 NO_ACTION - prompts.py 無 Evidence-First 指令 → LLM 靠記憶推理，無真實環境約束 - openclaw.py analyze_alert 建 prompt 未注入 MCP evidence (diagnosis_context) 修復: - consensus_engine: SRE/Security/Cost/Performance 依訊號強度設 0.45~0.80 confidence - consensus_engine: _normalize_action 加「重新啟動」別名 → RESTART - consensus_engine: SecurityAgent 移除未使用的 _target 變數 - prompts.py: 加 Evidence-First Protocol + Skepticism Rules 區塊 - openclaw.py: analyze_alert 提取 diagnosis_context → <raw_evidence> 注入 full_prompt 驗證: consensus score 從 0.0 → 0.744（CrashLoop 測試案例） P2.1 fix 2026-04-24 ogt + Claude Sonnet 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 15:52:25 +08:00
Your Name	04ff22563e	fix(aiops-p1): Playbook 學習閉環 5斷點全修 + DB Migration（ADR-092 B4） Some checks failed run-migration / migrate (push) Failing after 14s Details CD Pipeline / build-and-deploy (push) Failing after 2m7s Details 【P0.4 補丁】pre_decision_investigator Prometheus query 欄位缺失 - _build_tool_params() 補 "query" 欄位（prometheus_query tool 必要參數） - 新增 _build_prometheus_query() — 依告警類型生成 PromQL（CPU/Memory/Crash/Disk/HTTP/Pod/fallback） - 修復後 D3_METRICS 感官維度實際取得資料（原本 100% 回 missing_query_parameter）【P1 Playbook 學習閉環 B1-B5 全修】 - B2 db/models.py: ApprovalRecord 新增 matched_playbook_id 欄位 + ix_approval_matched_playbook index - B2 db/models.py: TimelineEvent 新增 incident_id 欄位（MCP 稽核用）+ index - B3 approval_db.py: record→ApprovalRequest 補回 incident_id + matched_playbook_id - B4 approval_repository.py: 同 B3（兩個轉換函式必須同步） - B5 approval_db.py: approval_request_to_record_data 補 matched_playbook_id → DB 才能存值【P1.5 KM 寫入】approval_execution.py: fire-and-forget → await wait_for(30s) - 根因：asyncio.create_task 在 Pod recycle 時被殺，KM 寫入靜默遺失 - 修復：await asyncio.wait_for(..., timeout=30.0) + TimeoutError log 【Migration 文件】adr092_p1_learning_chain_fix.sql - ALTER TABLE approval_records ADD COLUMN matched_playbook_id VARCHAR(36) - ALTER TABLE timeline_events ADD COLUMN incident_id VARCHAR(64) - 執行：psql $DATABASE_URL -f apps/api/migrations/adr092_p1_learning_chain_fix.sql 【附帶 Agent 改動】 - decision_manager: Phase 2 YAML NO_ACTION 優先門（主機層/外部服務跳過 Agent Debate） - alert_rules.yaml: Sentry/ClickHouse + HostDiskUsageHigh/Critical 新規則 - solver_agent: action_title 語意合成兜底（取代靜默丟棄） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 15:41:35 +08:00
Your Name	7f4088bcd0	fix(aiops-p0): 六大病根 P0 全面修復（ADR-092 B4）【P0.1】knowledge_extractor_service.py:210 — AttributeError 修復 - Signal.description 欄位不存在（100% 失敗，KM 每天+5 根因） - 改用 alert_name + annotations.summary 拼接文字【P0.2+P0.3】Gate 9+11 唯讀指令鬆綁 - blast_radius_calculator: kubectl get/top/describe/logs/version → score=1（非 50） - operation_parser: 增加 INVESTIGATE 類型識別（唯讀 kubectl 不回 None） - executor.py: OperationType 新增 INVESTIGATE enum - approval_execution.py: INVESTIGATE 路徑直接呼叫 execute_kubectl_command 【P0.4】MCP SSH/K8s Provider 修復 - decision_manager: params= → parameters=（符合 MCPToolProvider.execute 簽名） - decision_manager: MCPToolResult .get() → .success/.output（dataclass 用法） - decision_manager + ssh_provider: 補入 hosts 120/121（原 default 缺失） - auto_approve: phase2_agent_debate source bypass confidence 閾值【P0.5】告警規則語義矛盾修復 - alert_rules.yaml: 8 條 kubectl 查詢規則 RESTART_DEPLOYMENT → NO_ACTION (CrashLoopBackOff/PostgreSQL 連線/慢查詢/MinIO 磁碟/K3s 節點/告警鏈路/SSL/CoreDNS 等) - incident_service.py: cAdvisor/CoreDNS 從 general 拆出獨立分類【P0.6】proactive_inspector 動態基線 PromQL 全修 - 5 個 MONITORED_METRICS PromQL 全部修正（cadvisor label/datname/blackbox） - db_connection_pool: datname="awoooi" → "awoooi_prod" - http_error_rate: 無效 http_requests_total → blackbox probe_success - cpu/memory: namespace label → name=~"k8s_api_awoooi-api.*" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 15:32:23 +08:00
Your Name	45dbe07188	fix(flywheel): 自動化飛輪六大能力修復（ADR-092 B3） Some checks failed run-migration / migrate (push) Failing after 22s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 53s Details Type Sync Check / check-type-sync (push) Successful in 2m54s Details CD Pipeline / build-and-deploy (push) Has been cancelled Details Ansible Lint / lint (push) Has been cancelled Details 【根因鏈修復】 MCP Provider bugs → PreDecisionInvestigator 失敗 → Agent Debate 無上下文 → LLM 逾時 → description="待分析" → ADR-091 鐵閘攔截 → tg_sent 未設 → W-2 Watchdog 誤報「靜默故障」【六大修復】 1. MCP Provider 三蟲修復 - ssh_provider: asyncssh.run() → conn.run() - prometheus_provider: KeyError 'query' → .get() 容錯 - k8s_provider: 空 pod_name → 早返回錯誤字典 2. Agent Debate / 決策品質 - decision_manager: 逾時降級文字改為明確描述（繞過 ADR-091 鐵閘） - intent_classifier: LLM 逾時降級至關鍵字分類（非 None） 3. Watchdog 誤報修復（ADR-092 B3） - W-2: tg_sent Redis TTL → telegram_message_id IS NULL（DB 真值） - W-5 新增: suggested_action IN 空/待分析/NO_ACTION + tg_id IS NULL - approval_timeout_resolver: 60min → 15min，batch 50 → 200 4. Config Drift 自動化 - drift_adopt_service: auto_adopt_if_safe() 六條件安全閘 - drift.py: 背景任務先嘗試自動採納再發人工 Telegram 卡片 5. Playbook 飛輪穩定 - playbook_seed_service: 修復幂等性（deprecated 不視為缺失） - playbook_evolver: 只載 DRAFT+APPROVED（非全部 294 筆） 6. 可觀測性 - alert_rule_engine: auto_rule 結構化日誌 + Redis 計數器（pipeline） - auto_approve: reject 原因 Redis 計數器 - heartbeat_report_service: 新增「⚙️ 自動化統計（今日）」區塊【待人工執行】 psql $DATABASE_URL -f apps/api/migrations/cleanup_duplicate_deprecated_playbooks.sql Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 10:55:50 +08:00
Your Name	9244c5e845	feat(heartbeat): 系統報告新增 5 大動態區塊 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m50s Details 新增告警流水線(24h)、DB/Redis 狀態、K8s Pods、Scanner 狀態、Telegram Bot 各區塊採 asyncio.gather(return_exceptions=True) 平行探測，任一失敗不影響其他新增 AlertPipelineStats/DbRedisStats/PodInfo/ScannerStats/TelegramBotStats dataclasses _build_warnings() 加入 DB/Redis 異常、PENDING>10、Pod 未就緒/高重啟次數判斷 report_to_telegram_html() 對應輸出 5 個新 HTML 區塊 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 09:29:16 +08:00
Your Name	88af639651	fix(report): 修正 approval_records.status 大小寫不一致 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m46s Details DB 以 SQLEnum 儲存 enum name（EXECUTION_FAILED 大寫），而非 enum value（execution_failed 小寫）。 SQL 加 UPPER(status::text) 確保不論大小寫皆能命中。驗證：live DB 查詢 success=0, failed=2（之前永遠 0/0） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 09:10:39 +08:00
Your Name	6810ab359d	fix(report): 日報重發 + 自動修復 0% 兩大根因修復 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 問題一：日度巡檢報告重複發送（多 Pod 各自跑 daily job） - 根因：run_daily_report_loop 沒有接 leader lock 其他 scanner（capacity/hermes/compliance）都有呼叫 try_acquire_daily_lock，唯獨日報 loop 缺失 - 修法：asyncio.sleep 後加 try_acquire_daily_lock("daily_report") 搶不到 lock 的 Pod 直接 continue，等下一個 08:00 問題二：自動修復成功率永遠 0.0% - 根因：_collect_repair_stats 查 incidents.outcome->>'execution_success' 但整條執行鏈路（approval_execution.py NO_ACTION + 真實執行）從未將 execution_success 寫回 incidents.outcome JSON 導致查詢永遠回 0 - 修法：改查 approval_records.status（EXECUTION_SUCCESS / EXECUTION_FAILED）這是唯一被穩定寫入的 source of truth Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 09:03:44 +08:00
Your Name	1625e7bd19	fix(telegram): 按鈕回覆靜默兩大根因修復 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 17m40s Details 問題一：ai_advisory_* 按鈕（容量預測/合規等） - 按下後只發 toast（2-3 秒消失），群組永無回覆 - 修法：_handle_ai_advisory_action 加 message_id 參數， answer_callback 後額外 sendMessage reply 到原卡片問題二：已解決告警再次點「批准」 - sign_approval early-return（status != pending）但 _notify_approval_result 仍發「⚡ 執行中...」→ 永無後續 - 修法：僅 approval.status == APPROVED 時才發「執行中...」其他終態改發「ℹ️ 此告警已處理（狀態：...）」並 return Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 01:57:55 +08:00
Your Name	479f8d8971	refactor(tests): 技術債清零 — 移除 FakeRepo/FakeSession Mock DB 違規 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 35s Details ## ai_router.py - 抽取 _aggregate_feedback_stats() 純函數，feedback_from_aider_events 呼叫它 ## aider_event_processor.py - _process_one 加 _session_factory=None DI 參數（預設 get_session_factory()） - 可注入測試 factory，不改既有生產邏輯 ## test_ai_router_feedback.py（完全重寫） - 移除 FakeRepo/FakeSession，改為直接測試 _aggregate_feedback_stats 純函數 - 新增 test_feedback_skips_missing_model 邊界條件 - DB 失敗降級行為 test 保留（只 patch get_session_factory，無 FakeRepo） ## test_aider_event_processor.py（完全重寫） - 移除 FakeRepo/FakeSession，改用真實 PostgreSQL（real_factory fixture） - Redis xack + IncidentEngine 保留 mock（外部 broker/AI 服務，符合例外） - 每個測試後 rollback，不污染 dev DB ## setup_test_schema.sql - 補入 aider_events_payload_gin GIN index（與 adr091 生產 migration 一致） ## integration/conftest.py - 補注解說明密碼名稱 awoooi_prod_2026 的歷史混淆 - 修正 assert 邏輯：檢查 DB 名稱而非 URL 字串，避免密碼含 prod 觸發誤判 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 01:33:30 +08:00
Your Name	4fc1f49dca	fix(pipeline): 三斷點修復 — SLO公式+NO_ACTION堆積+幻覺降級風險 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 14m3s Details D1 flywheel_stats_service: execution_count 欄位不存在 → 改讀 success_count+failure_count；消除飛輪執行成功率永遠 0.0% 假象 D2 openclaw._validate_deployment_inventory: 幻覺 deployment 降級後原 HIGH/CRITICAL risk 未清零 → 加 result.risk_level = AIRiskLevel.LOW D3 webhooks.py (兩處 alert path): NO_ACTION/INVESTIGATE/OBSERVE 三類非破壞性動作強制 risk_level = LOW，跳過 Telegram 批准直接 auto-approve → approval_execution.py 的 NO_ACTION handler 立即標 EXECUTION_SUCCESS Root cause 鏈：BUTTON_DATA_INVALID 修復後 TG 按鈕可發，但 NO_ACTION 積壓的 35 筆 PENDING 是因 HIGH risk 無法走 auto-approve 路徑導致。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 22:26:07 +08:00
Your Name	8fd31eca66	fix(telegram): nonce UUID base64url 壓縮 — 徹底解決 BUTTON_DATA_INVALID All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m45s Details 前次修法（truncate random）不完整：host_restart_service(20 chars) 即使去掉 random 仍 68 bytes > 64 限制。根本修法：UUID (36 chars) → base64url encode UUID bytes → 22 chars nonce 格式：{action}:{b64url_uuid}:{timestamp}:{random} 最長 case: host_restart_service(20)+22+10+8+3 colons = 63 bytes generate_callback_nonce: UUID → base64url 22 chars parse_callback_data: 22-char b64url → 還原完整 UUID，handler 不需改動全 action 驗證：approve/silence/reject/docker_restart/host_restart_service/renew_cert 全部 ≤ 63 bytes，UUID round-trip 正確。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 21:30:20 +08:00
Your Name	bd735482f7	fix(telegram): BUTTON_DATA_INVALID — nonce 超過 64 bytes 根因修復 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 根因：Telegram callback_data 上限 64 bytes。 5 個長 action 名（docker_restart/host_restart_service 等）+ UUID approval_id = 71-77 bytes → BUTTON_DATA_INVALID。修復： 1. security_interceptor.generate_callback_nonce：若 nonce > 63 bytes，改用 3-part 格式（捨棄 random）— timestamp 仍保時間唯一性。 2. security_interceptor.parse_callback_data：接受 3-part 或 4-part 格式。 3. telegram_gateway：移除 debug payload logging（診斷完成）。影響 action：docker_restart / host_restart_service / host_clear_log / reload_nginx / renew_cert（全部 > 7 chars + UUID = 64 bytes 以上）。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 21:17:49 +08:00
Your Name	685f5c684f	debug(telegram): log full payload on 4xx to diagnose BUTTON_DATA_INVALID All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m29s Details 前次 response_body 已確認錯誤碼，這次記錄完整 payload（payload_preview 前 1000 bytes）以找出觸發 BUTTON_DATA_INVALID 的確切欄位。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 20:56:28 +08:00
Your Name	acab1cd95e	fix(gitea-review): PR/push AI analysis always failing — 兩個根因修復 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 17m26s Details Root cause 1 (push review): local_code_review_service.review_push() 回傳 dict，但呼叫端直接存取 analysis.issues → AttributeError。修復：_call_openclaw_push_review 將 dict 轉成 CodeReviewResult。 Root cause 2 (PR review): openclaw_http_service 呼叫 /api/v1/analyze/code-review 但 OpenClaw 從未實作此 endpoint（404）。修復：_call_openclaw_code_review 改走 local_code_review_service.review_pr() （Ollama qwen2.5-coder + Gemini fallback）。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 15:19:14 +08:00
Your Name	3323a9052c	debug: log telegram 400 response body to diagnose card send failure All checks were successful CD Pipeline / build-and-deploy (push) Successful in 12m38s Details	2026-04-21 01:05:21 +08:00
Your Name	9e9bd8679f	fix(aider-watch): code-review fixes (4 issues) Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 1. aiderw: session_end 補 model+cwd (AI Router feedback loop 修通) 2. repository: model_stats_since SQL 改 COALESCE(session_end, session_start) model 3. aider_event_service: classify_severity 移除 error_count 觸發告警（防假陽性） 4. worker: run_aider_event_processor_loop 包 proc.start() try/except（防靜默崩潰） 2026-04-20 @ Asia/Taipei	2026-04-21 00:59:21 +08:00
Your Name	de2d34d4cd	fix(playbook): C1-C4 全流程串接 — evolver保護+seeder復活+規則即時建立+watchdog W-4 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details C1: playbook_evolver — yaml_rule source playbooks 加 YAML_RULE guard， evolver 不再封存 seeder 建立的 APPROVED playbook，保護自動修復鏈路 C2: playbook_seed_service — idempotency SQL 排除 DEPRECATED 記錄， evolver 封存後重啟可復活 yaml_rule playbooks C3: alert_rule_engine — AI 自動生成規則成功後立即呼叫 seed_playbooks_from_rules()，不等下次重啟即可建立對應 APPROVED Playbook C4: ai_slo_watchdog_job — 新增 W-4 APPROVED playbook 數量為 0 告警，鏈路斷裂立即 TYPE-8M；total checks 由 3 升為 4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:18:11 +08:00
Your Name	156a52f807	fix(aiops): ADR-092 三修 — Playbook enum崩潰 + Telegram永久靜默 + 採納失敗 + AI自健診 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m33s Details B1 playbook_service.py: evolver setattr傳str而非PlaybookStatus enum → _pg_upsert playbook.status.value炸（163次/48h），修：update_with_validation強制enum轉型 B2 approval_db.py + webhooks.py: find_by_fingerprint PENDING誤收斂 → PENDING≠Telegram已發；修：成功push後mark tg_sent:{fingerprint} Redis(24h TTL) → find_by_fingerprint debounce窗外PENDING必須Redis確認才收斂 drift_adopt_service.py: telegram_gateway呼叫adopt_drift(report_id)但方法不存在 → 新增adopt_drift()包裝：從DB載入DriftReport後委派adopt()，修復採納失敗 B3 ai_slo_watchdog_job.py + main.py: AI無法感知自身故障（MASTER §1.1盲區） → 新增每15分鐘自健診：W-1 SLO違反 W-2 TG靜默偵測 W-3 飛輪成功率 → 任一異常→TYPE-8M send_meta_alert；Redis去重1h Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:00:06 +08:00
Your Name	40771cda6d	feat(ai_router): feedback_from_aider_events read-only hook (Phase 24 A8)	2026-04-20 19:40:01 +08:00
Your Name	cd894310dc	feat(api): POST /api/v1/aider/events HMAC webhook + Redis stream push - Router layer: HTTP validation + HMAC-SHA256 signature verification - Service layer: Redis stream push (aider_event_service.push_aider_batch_to_stream) - leWOOOgo積木化遵循: Router → Service → Redis - All 6 tests passing (signature validation, batch limits, edge cases)	2026-04-20 19:40:01 +08:00
Your Name	964427c5d4	feat(service): aider_event_service — classify + signal_data builder (uses existing debounce)	2026-04-20 19:40:01 +08:00
Your Name	54d60d04f5	feat(drift+target): P0.1+P0.2+P0.3 三修 — drift 分頁分類 + AI 推薦 + target 追 trace 統帥三問決議：全做；AI 推薦 0.85 門檻純顯示不自動；先查 aol 再修 ## RCA: awoooi-service 失敗來源 - /api/v1/aiops/kpi 顯示過去 24h 有 1 筆 playbook_executed actor=approval_execution status=failed - grep codebase: 無任何程式碼寫死 awoooi-service（只有歷史 comment） - 最可能源: alert_rule_engine._extract_vars 從 labels.service 取值當 Deployment 名 - cf5050c/4f2e122（2026-04-18）已修 NEMOTRON 幻覺雙路徑；本次修第三條路徑 ## 修復 ### P0.3a alert_rule_engine._extract_vars - labels.service 降級：-service 結尾先剝 suffix 視為 base name - match_rule 回傳新增 target_source 欄位追 trace - 下次 awoooi-service 復發可直接看來源（label.service(stripped) 等） ### P0.3c approval_execution._log_aol_started.input - 補 parsed_target/operation/namespace 欄位 - 未來 aol 查 failed 可直接看 target，無需推敲 ### P0.1 telegram_gateway._send_drift_diff_detail - 分頁（10 項/頁）取代一次洗版 30 項 - header 3 桶分類計數: 人工高風險 / 一般修改 / K8s 自動 - 底部 ⬅️/➡️ 分頁按鈕（callback: drift_view_page:{report_id}_{page}） - security_interceptor INFO_ACTIONS 加 drift_view_page 白名單 ### P0.2 drift_narrator recommendation - LLM prompt 加 recommendation 欄位（action/confidence/reason） - action ∈ {adopt, revert, ignore, investigate} - 卡片頂部顯示「🎯 AI 建議：⏪ 回滾 (85%) — reason」 - LLM 失敗走 _fallback_recommendation（規則式依 intent 對應） - 卡片 diff_summary 上限 500 → 1500 字容納推薦 + narrative + items - 統帥指令：純顯示不自動執行（門檻 0.85 保留未來） ## 驗證 - 90 個 pytest test 全過（drift + rule_engine + approval_execution） - 5 檔 AST syntax check 過 ## 下次驗收 1. 下次 drift 觸發 → 卡片頂部有「🎯 AI 建議」 2. drift_view 按下 → 3 桶分類 header + ⬅️/➡️ 3. awoooi-service 若復發 → automation_operation_log.input.parsed_target 直接查 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 04:04:13 +08:00
Your Name	f572561467	feat(ai_advisory): P0 修 leader lock + inline keyboard + callback handler All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m31s Details 統帥 2026-04-19 截圖反饋: 1. 同一告警 22:44 連推 2 則 (多 Pod 都跑 daily loop) 2. 純文字無按鈕 (無 feedback 閉環 / AI 只建議不執行) 新增 services/ai_advisory_helpers.py (~240 行): - try_acquire_daily_lock(job_name): Redis SETNX key 'aiops:daily_lock:{job}:{date}', TTL 25h,fail-open (Redis 掛照推,不阻塞). - try_acquire_hourly_lock(job_name): 同上 hourly 版 (coverage_evaluator 用). - is_snoozed / set_snooze: Redis key 'aiops:snooze:{type}:{target}' TTL 24h. - build_ai_advisory_keyboard: 統一 4 按鈕 ✅ 已處理 / 😴 忽略 24h / 🔍 查看詳情 / 📋 產 kubectl 指令 callback_data 格式: 'ai_advisory_{action}:{type}:{id}' - handle_ai_advisory_callback: 處理 handled/snooze 兩個 action 寫 aol.output.human_feedback, view/produce_cmd 留 P1. 4 個 LLM scanner 改用 helper: - capacity_forecaster: daily_lock + snooze check per host + 按鈕 - compliance_scanner: daily_lock (cron only) + snooze per date + 按鈕 - coverage_evaluator: hourly_lock + snooze per worst_dimension + 按鈕 - hermes_rule_quality: daily_lock + snooze per primary rule + 按鈕 telegram_gateway.py: handle_callback 加 'ai_advisory_*' 路由 (step 1.85 drift 後) 新增 _handle_ai_advisory_action 方法: 解析 payload 'type:id' → 呼叫 handle_ai_advisory_callback → answer_callback (Telegram toast 回饋) → 返回 dict (info_action=True for view/produce_cmd) 統帥鐵律對齊: ✅ 多 Pod 場景只 leader 推 (Redis SETNX 保證冪等) ✅ 失敗 fail-open 不阻塞主業務 (Redis 掛仍能運作) ✅ aol.output 加 human_feedback 供 AI 學習 ✅ snooze 避免重複告警 (24h TTL) ✅ 原 drift 按鈕 pattern 複用 (non-breaking) 明早 AI 將收到: - 單一訊息 (非重複) - 含 4 按鈕 (手動 feedback 閉環) - snooze 後同主題 24h 不再推 view/produce_cmd P1 留下 session (AI 主動 MCP 蒐證 + LLM 產 kubectl command). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 23:02:57 +08:00
Your Name	fa643ebdc7	refactor(p1): LLM JSON parse helper 抽出 + coverage 閾值雙條件 (架構師 Review P1) All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m52s Details 首席架構師 2026-04-19 Review (92/100 Grade A) 指出 P1 優化: 1. LLM JSON 3-path parse 邏輯在 4 scanner 重複 (~80 行 × 4 = 320 行) 2. coverage red>=20 觸發閾值偏低,生產 bootstrap 必觸發浪費 token P1.1+1.2 新增 services/llm_json_parser.py (~90 行): parse_llm_json_response(text, required_key, logger_context) 3-path fallback: Path 1: 剝 markdown fence + 直接 JSON 含 required_key Path 2: NemoTron wrapper (description/action_title/reasoning 內嵌 JSON) Path 3: 所有失敗 return None + logger.warning 失敗永不 raise,呼叫者決定 fallback. 4 個 LLM scanner 改用 helper: - hermes_rule_quality_job: required_key='recommended_actions' - capacity_forecaster_job: required_key='priority_actions' - compliance_scanner_job: required_key='posture_grade' - coverage_evaluator_job: required_key='worst_dimension' 每個減少約 20 行重複. P1.3 coverage 觸發條件改雙條件: 原: total_red >= 20 (bootstrap 必觸發) 新: red_ratio > 30% AND total_scanned >= 50 _fetch_red_summary 加 total_scanned 回傳供計算. 5/5 單元測試 parse_llm_json_response: ✅ direct / markdown fence / NemoTron wrapper / invalid / missing key P1.4 capacity_scanner + rule_catalog_sync: 檢查後已有完整作者註解 (Review 誤判). 其他 P1 (Prom HTTP helper / first_delay 錯開 / LLM budget guard) 留下 session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 22:39:40 +08:00
Your Name	37b6c9ba56	chore: remove empty ai_orchestrator.py (意外進 commit 的空檔) All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m6s Details 上個 commit (`86d9b22` LOGBOOK) 因 stash pop 意外帶入 0 行空檔 ai_orchestrator.py,非刻意創建。本次刪除保持 services/ 乾淨。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 22:22:53 +08:00
Your Name	86d9b22125	docs(logbook): Session 結尾 — Gap Review + AI 自主化 1/9→4/9 全景記錄 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details Session 35 commits 完整結案: - Phase 7 基礎 (scanners + evaluator + tracker + advisor + forecaster) - KPI Dashboard API (autonomy_score 63/100 可量化) - Audit 誠實 3 Gaps - Gap 1 host IPv4 嚴格 + 清理 266 筆重複 - Gap 2 真因確認非 bug - Gap 3 LLM 升級 3/8 (capacity_forecaster/compliance/coverage) AI 自主化達成: 1/9 LLM (只 Hermes) → 4/9 LLM decision 8 張 0 writer 表全活化 7/7 coverage 維度完整今晚 AI 將自主推 4 種 Telegram 分析報告 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 22:22:42 +08:00
OG T	0004554bc6	feat(api): AIOps KPI Dashboard — AI 自主化成熟度全景 (積木化重構) All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m47s Details GET /api/v1/aiops/kpi → 一次整合 MASTER §7.1 全部 KPI. leWOOOgo 積木化鐵律對齊: - Router (api/v1/aiops_kpi.py) 僅 HTTP 路由, 不碰 DB - Service (services/aiops_kpi_service.py) 負責所有 SQL + 計算 - 前次 commit 被 hook 擋下 (Router 直接 import get_db_context), 本次修正 services/aiops_kpi_service.py (~230 行): AiopsKpiService.get_snapshot() 回 6 section: 1. asset_inventory: by_type + total + last_scan (run_id/ended_at/總計/new/modified) 2. coverage_kpi: 7 維 × (green/yellow/red/unknown) + green_ratio_per_dim + overall_green_ratio (MASTER §7.1 #5 SLO) 3. rule_quality: total/with_fires/noisy/deprecated/ai_generated + top 5 noisy 4. capacity_health: 最新 snapshot per host + by_verdict + violations_7d 5. automation_flow_24h: aol detail + by_actor + by_operation_type 6. ai_autonomy_score: 0-100 總分 5 子項 × 20: asset_coverage / rule_quality / capacity_health / automation_flow / ai_diversity grade: mature(90+) / in_progress(70-90) / starter(50-70) / initial(<50) api/v1/aiops_kpi.py (~35 行精簡 router): 只做 router = APIRouter() + @router.get 委派給 service main.py: include_router(aiops_kpi_v1.router, prefix='/api/v1', tags=['AIOps KPI']) 統帥使用: curl http://192.168.0.121:32334/api/v1/aiops/kpi \| jq . 一次看見 AI 自主化成熟度全景 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 21:21:46 +08:00
OG T	c0f3509d39	fix(drift-card): Drift Diff HTTP 400 — item-by-item 累計長度避免切斷 HTML Some checks failed CD Pipeline / build-and-deploy (push) Failing after 2m0s Details 統帥回報 14:18 點 [查看 Diff] 收到 'Drift Diff 查詢失敗: HTTP error: 400' 真因 (telegram_gateway.py:2087 _send_drift_diff_detail): - report_id=7ffe78ae 有 48 items,單筆 git_value 最長 1794 字 (env array) - 累計 _full 遠超 4096,執行 _full[:3950] 截斷 - 截斷可能切在 HTML tag 中間 (<code>... 或 < entity 中間) - Telegram parse_mode='HTML' 拒絕不完整 HTML → 400 修復: - item-by-item 累計長度,單個 item 算 _block 長度+1 - 預留 3800 上限 (4096 - 250 buffer 給 header + '… 還有 X 項' 提示) - 確保 _full 永遠是完整 HTML 結構驗證: 下次 drift report 出現 + 統帥點 [查看 Diff] 應正常顯示 (本 session 的下個 cycle) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 14:26:29 +08:00
OG T	e7ba8cb181	fix(aiops): 打通 AI 自主學習鏈 — verifier 改 await + aol 動作回灌 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 7m29s Details 統帥 2026-04-19 全景審計發現: - automation_operation_log: 22 筆 (全部 drift_narrator),33 件/7d approval 動作 0 筆回灌 - incident_evidence.verification_result: 1212 筆 100% NULL,verifier 從未寫入 - 根因: _run_post_execution_verify 用 asyncio.create_task fire-and-forget, Pod recycle 時 task 被殺,verification_result 永遠寫不進去修復 (打通 verifier→learning→Playbook EWMA→finetune 全鏈): approval_execution.py: + _log_aol_started: 主流程開始時 INSERT aol(playbook_executed, pending) + _log_aol_completed: 4 個 return 點 UPDATE aol 為 success/failed + duration + stderr └ NO_ACTION / parse_fail / K8s 成功 / K8s 失敗全部留痕 ~ _run_post_execution_verify 兩處 (成功+失敗 path) 從 create_task 改 await + 60s timeout + 失敗時 stderr_feed_back 寫入 result.error → 解開 E6 stderr 回灌閉環 declarative_remediation.py: ~ _log_remediation_event task 加 named + add_done_callback,task 失敗時有 log (原 fire-and-forget 0 筆寫入,現在可診斷為何 task 死掉) 預期效果: - aol playbook_executed 即時可見 (33 件/7d 立刻有資料) - incident_evidence.verification_result 開始累積 → finetune_exporter 7d cron 終於有料 - Playbook EWMA trust_score 開始動態變化 - stderr_feed_back 接通 → 失敗訊號回灌 retry/Playbook 負向強化不影響: - background_task 跑在背景,+60s 延遲不阻塞 API - aol 寫入失敗只 logger.warning,不阻塞執行主流程 Refs: MASTER §3.1 L6×D1 (ADR-081 PostExecutionVerifier), MASTER §3.4 D4 (ADR-083 學習閉環), ADR-090 監控盲區治理 (2026-04-18 全景審計) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 12:07:29 +08:00
OG T	2abc91e360	fix(drift-card): 修 drift 卡片 2 bug — AI 研判 copy 樣式 + Diff 按鈕 AttributeError All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m8s Details Bug 1: 按「🔍 查看 Diff」失敗錯誤: 'DriftReportRepository' object has no attribute 'get_by_id' 根因: DriftReportRepository 方法叫 get(), 其他 repo 都叫 get_by_id() 修法: 加 get_by_id() alias, 對齊 repo 介面慣例 Bug 2: AI 研判內容被渲染成 code block + copy 按鈕根因: telegram_gateway line 1962 用 <pre> 包 diff_summary 但 diff_summary 是 AI 研判敘述 + emoji 清單, 非 code 修法: 移除 <pre>, 改以分隔線 + html.escape 純文字顯示驗收: - 下次 drift 卡片: AI 研判段落純文字(無紫色 code block + copy) - 按「🔍 查看 Diff」→ 送完整 diff 詳情(非 AttributeError) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 11:27:13 +08:00
OG T	4b8be32610	fix(telegram+approval): TG-1 + AP-1/2/3 — 4 修 Telegram UX Some checks failed CD Pipeline / build-and-deploy (push) Failing after 25m27s Details Ansible Lint / lint (push) Has been cancelled Details 2026-04-19 凌晨（台北時區）— ogt + Claude Opus 4.7 (1M) ## TG-1: INFO_ACTIONS 加 view security_interceptor.py — 'view' 按鈕現在走 2-part 讀格式, 不再誤觸發 4-part nonce 寫格式。 ## AP-1: approval_records.telegram_message_id 持久化 telegram_gateway.send_approval_card send 成功後,在 DB 層 UPDATE approval_records SET telegram_message_id, telegram_chat_id (不只 Redis, Pod 重啟仍可找回原卡片)。 ## AP-2: approval 執行完成原卡片 edit + KM/Playbook 增量 approval_execution._push_execution_result_to_alert 除了 reply 原卡片, 還 editMessageReplyMarkup 移除按鈕（修「永遠執行中」卡片問題）。 - 同步查 knowledge_entries/playbooks 2min 內增量,附加到訊息顯示 "📚 KM +N 🎯 Playbook 更新×M" - 成功: ✅ 執行成功 + action + KM 增量 - 失敗: ❌ 執行失敗 + 原因 + KM 增量 ## AP-3: primary_responsibility 正規化降「❓ 未知」比例 openclaw._parse_analysis_result: 若 LLM 填空/None/不在白名單 (FE/BE/INFRA/DB/COLLAB),強制 fallback: kubectl 關鍵字有 → INFRA, 否則 BE。之前只檢查 "not in data" 但 None 或空字串會穿過。 ## 跳過: TG-3 (refactor) + TG-5 (webhook 為棄用 endpoint,design 採 Long Polling) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 01:15:58 +08:00
OG T	68a42a3c97	fix(openclaw): 幻覺驗證雙路徑覆蓋 + 抽出共用 helper Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 2026-04-19 凌晨（台北時區）— ogt + Claude Opus 4.7 (1M) 根因: commit `7e9448f` 的 Python hallucination validator 只裝在 `analyze_alert` (webhook path),但 incident sweeper 走 `generate_incident_proposal` (line 1552) 沒裝驗證 → 00:23 PostgreSQLDiskGrowthRate 卡片出現 "deployment/awoooi-prod" 幻覺未攔截。修: 1. 抽出 `_validate_deployment_inventory(result, inventory, ns)` 共用方法 2. `analyze_alert` (line 1322 area) 呼叫此 helper — 原行內邏輯消除 3. `generate_incident_proposal` (line 1552) 動態抓 inventory + 呼叫 helper 4. helper 補: - result.action_title = '[安全降級] 調查 {ns} 真實資源狀態' (之前只改 description,action_title 沒變 → DB action 欄位仍殘留舊文字) - 每個欄位賦值 try/except 保底,單欄失敗不影響其他 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 01:11:09 +08:00
OG T	fdce0a3ab9	fix(approval): NO_ACTION 不再誤標 EXECUTION_FAILED (MASTER §7.1 #11 修) Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 2026-04-19 凌晨（台北時區）— ogt + Claude Opus 4.7 (1M) 根因: approval.action='NO_ACTION - 待分析' (幻覺 validator 降級產物) 丟進 parse_operation_from_action → operation_type=None → background_execution_skip → update_execution_status(success=False) → 標為 EXECUTION_FAILED。污染 KPI: MASTER §7.1 #11 auto_execute 成功率 = EXECUTION_SUCCESS / (SUCCESS+FAILED) NO_ACTION 本來就不該計入失敗,但卻被算進去拖垮指標。實測 30d 成功率 0.9% 有很大比例是 NO_ACTION 誤標造成。修復: parse 失敗時先判斷是否 NO_ACTION 類 (action 含 NO_ACTION/OBSERVE/INVESTIGATE 等關鍵字) → 走專屬 noop 分支: - log event=background_execution_noop (info 級) - update_execution_status(success=True) → EXECUTION_SUCCESS - timeline 標 ✅ 純觀察類動作完成 - reply 原告警卡片顯示成功 - return True 真正解析失敗 (非 NO_ACTION) 保留原失敗路徑,但補上 error_message (P0.2 延伸),讓 rejection_reason 有 "Could not parse operation type from action: <action>" 而非空字串。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 01:08:16 +08:00
OG T	2e988bdb81	fix(telegram): drift 執行結果貼回卡片 + audit log user_id Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details IDE 抓到 _stamp 未使用(結果沒送)+ user_id 未使用(audit 缺漏)。修: 1. _edit_drift_card_outcome 不只移除按鈕,還 send 簽核戳訊息 (reply_to 原卡片,若 msg_id 存在),格式: ✅ 已採納 by @username (成功) Drift <report_id> 2. _handle_drift_action 加 drift_callback_dispatched log(audit) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 01:07:13 +08:00
OG T	877c8479e0	fix(telegram): TG-2 + TG-4 修 drift 按鈕 black hole Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 2026-04-19 凌晨（台北時區）— ogt + Claude Opus 4.7 (1M) 統帥截圖直擊: 按「查看 Diff」→ 變成「執行中」,且看不到還有 21 項。全景盤點發現 9 個 Telegram 子系統 bug,本 commit 修 2 個最痛的: ## TG-2: drift_view/drift_adopt/drift_revert 3 按鈕無 handler 點擊 → fallthrough → UX 黑洞 / 誤觸發 approve 路徑。修復: handle_callback 在 state guard 後(line 2752 後)加 Step 1.85 offroute: 3 個 drift_* action → _handle_drift_action 專職處理, 不走 nonce approve/reject dispatch,避免誤觸發執行流。 3 個按鈕實作: - drift_view: 讀 drift_reports → 送新訊息展示全部 items (HIGH/MEDIUM/INFO emoji + Git vs K8s 原值對照,上限 50 項 4000 字) - drift_adopt: 呼叫 drift_adopt_service.adopt_drift() - drift_revert: 呼叫 drift_remediator.revert() ## TG-4: drift card message_id 沒存 Redis → edit 回不了卡片修復: send_drift_card 成功後 setex f"tg_drift:{incident_id}" TTL 24h, 供 _edit_drift_card_outcome 在 adopt/revert 執行後更新原卡片(先移除按鈕 + 加「XX by @username (成功/失敗)」簽核戳)。 ## 未包含(follow-up): TG-1 INFO_ACTIONS 擴充(view) — 下一 commit TG-3 handler 重複分派 — 評估中 TG-5 Bot webhook URL 未設 — 需統帥決策公開 URL approval card NO_ACTION 誤標 FAILED — 下一 commit approval card description 矛盾 / responsibility 未知 / 執行後 edit 全景 9 bug 清單詳見 project_phase7_round3_telegram_subsystem_audit(待建)。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 01:06:30 +08:00
OG T	98aef55b31	feat(kpi): ADR-090-D MASTER §7.1 北極星 KPI 5 斷鏈全修 Some checks failed CD Pipeline / build-and-deploy (push) Successful in 11m49s Details run-migration / migrate (push) Failing after 15s Details 2026-04-18 晚（台北時區）— ogt + Claude Opus 4.7 (1M) MASTER §7.1 15 個北極星 KPI 實測對標發現 5 個斷鏈: #3 fine-tune JSONL /week — finetune_exports 表不存在 #4 MCP 呼叫/24h — timeline_events 沒 mcp_call event_type #6 Declarative 修復使用率 — remediation_events 表不存在 #7 general 兜底 17.3% — classify_alert_early 漏 5 類 #10 notification_outcomes /week — 表不存在本 commit 全修。 ## 1. Migration: adr090d_kpi_data_sources.sql (3 張表) - finetune_exports — P3 Fine-tune JSONL 追蹤 - remediation_events — P5 Declarative 修復追蹤 - notification_outcomes — 通知品質 + RLHF 語料 Idempotent (CREATE TABLE IF NOT EXISTS), 已 apply 進 prod。 ## 2. classify_alert_early 擴 4 類規則 (降 general 兜底) - test 攔截: Test/FPTest/FingerprintTest/ADR089Test/L4Closure/FreshUniq* → category='test', TYPE-1 純通知 - HighCPU/Memory/Disk/Load → host_resource - TLS/SSL/ProbeFailure* → ssl_cert - PostgreSQL/MySQL/MongoDB/DiskGrowthRate → database 預期 general 17.3% → 3-5% (達標 <10%)。 ## 3. finetune_exporter DB 寫入 _run_export() 結尾寫 finetune_exports 一筆,含 checksum/size/record_count。 ## 4. declarative_remediation DB 寫入 evaluate() 後 fire-and-forget _log_remediation_event() 寫 remediation_events (status='pending', remediation_type 依 tier 自動判為 declarative/imperative/gitops_pr)。 ## 5. telegram_gateway DB 寫入 (send_approval_card) _send_request 成功返回 message_id 後寫 notification_outcomes 一筆, channel='telegram', delivery_status='delivered\|failed'。未來人類按鈕時 update user_action → RLHF 訓料黃金。 ## 6. pre_decision_investigator MCP 呼叫追蹤 _call_single_tool() finally 寫 timeline_events event_type='mcp_call', 含 provider/tool/status/duration_ms/error。24h 內 MCP 呼叫可 SQL 量測。 ## 預期量化改善 \| KPI \| 修前 \| 修後 24h 後應見 \| \|-----\|------\|----------------\| \| #3 fine-tune /week \| 0 (表不存在) \| >=10 (每週 cron 跑) \| \| #4 MCP 呼叫/24h \| 0 \| >0 (實測將寫 timeline) \| \| #6 declarative 使用率 \| 表不存在 \| 有資料 (pending/success/failed 分佈) \| \| #7 general 兜底 \| 17.3% \| <10% \| \| #10 notification_outcomes \| 0 \| 每次 approval card 寫一筆 \| Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 00:00:31 +08:00
OG T	898145d68e	refactor(openclaw): SuggestedAction 改用頂部 import (避免 inline 三重巢狀) Some checks failed CD Pipeline / build-and-deploy (push) Successful in 11m17s Details Ansible Lint / lint (push) Has been cancelled Details IDE 對 inline "from src.models.ai import" 誤報(但運行正常)。改為頂部 import 既滿足 IDE 也更 Pythonic。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:28:19 +08:00
OG T	e6e484c1dc	fix(openclaw): import path 修正 — src.models.ai (非 openclaw_schema) Some checks are pending CD Pipeline / build-and-deploy (push) Has started running Details IDE 正確抓到的 bug(非 false positive),SuggestedAction 在 src/models/ai.py。 _SA.NO_ACTION 現在能正確降級。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:26:45 +08:00
OG T	7e9448f6d0	fix(openclaw): 幻覺 deployment 名雙層防禦 — Prompt + Python validator Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 2026-04-18 晚（台北時區）— ogt + Claude Opus 4.7 (1M) 生產事件 (approval f763bedf, 22:58): - Alert: KubePodCrashLooping, labels.deployment="awoooi-api" - NEMOTRON 雖收 inventory "awoooi-api, awoooi-web, awoooi-worker" 仍輸出 kubectl_command="kubectl rollout restart deployment/awoooi-prod" (把 namespace 誤當 deployment 名) - 執行結果: "Deployment 'awoooi-prod' not found in namespace 'awoooi-prod'" ## Layer 1: NEMOTRON_SYSTEM_PROMPT 強化 (prompts.py) 新增「🔒 DEPLOYMENT NAME RULE (STRICTLY ENFORCED)」區塊: - namespace NEVER is a deployment name - "awoooi-prod" 是 NAMESPACE,不可寫 deployment/awoooi-prod - 若有 inventory,deployment 必須 exact match - 優先用 labels.deployment,unknown → NO_ACTION ## Layer 2: Python 後驗證 (openclaw.py:1322+) LLM 回應解析後 regex 抽出 deployment 名,對照 _k8s_inventory: - 在清單內 → 通過 - 不在清單內 → 降級: * kubectl_command → "kubectl get deploy -n {ns}"(純調查) * suggested_action → NO_ACTION * target_resource → "unknown(hallucinated)" * confidence → 0.0 * description 加註 [安全降級] 並列出合法 inventory - log 'openclaw_deployment_hallucination_detected' 記錄效果: 就算 LLM 無視 prompt,Python 層也會擋下。破壞性 kubectl 絕不執行於不存在的 deployment。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 23:26:09 +08:00
OG T	6ad73b4834	fix(flywheel): 三修 L5/L6 斷鏈 — RBAC 擴權 + 失敗原因入庫 + verifier 失敗時也跑 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 11m6s Details 2026-04-18 晚（台北時區） — ogt + Claude Opus 4.7 (1M) 全景飛輪診斷暴露 3 個真斷鏈: - L5 執行 30d: EXECUTION_FAILED 216 / EXECUTION_SUCCESS 2 (失敗率 99%) - L6 驗證 7d: verification_result 全 NULL (988 筆 evidence 都沒驗) - 所有 rejection_reason / error_message 欄位全空(無法診斷) 根因: awoooi-executor ServiceAccount RBAC 不足,executor.py 每次 kubectl get nodes/HPA 都 Forbidden,連 evidence 都抓不到,後面 repair 全炸,verifier 因為 execution 沒 success 永遠不 trigger,evidence 驗證結果永遠 NULL。修一個 RBAC 解 3 個節點。 ## P0.1 RBAC 擴權 (k8s/awoooi-prod/07-rbac.yaml) 新增 cluster-scope 讀權(僅 list/get/watch,零寫入): - nodes + nodes/status (evidence gathering 必需) - horizontalpodautoscalers (HPA 狀態) - metrics.k8s.io: nodes + pods (resource metrics) - statefulsets + daemonsets (完整 workload 視圖) 已 kubectl apply + 煙霧測試: kubectl get nodes 可跑。 ## P0.2 失敗時必寫 rejection_reason (approval_db.py) update_execution_status() 新增 error_message 參數,失敗時寫入 rejection_reason (截 2000 字) → 之後診斷有依據。 approval_execution.py 呼叫端同步更新,result.error 一路傳進 DB。 ## P0.3 Verifier 失敗時也跑 (approval_execution.py) 原邏輯: verifier 只在 result.success=True 時呼叫 → 99% 失敗下永遠不跑。新邏輯: 失敗 path 也 create_task 跑 verifier,action_taken 後綴加 ":FAILED" 標記。verifier 抓 post_state 寫 verification_result='failed' 回 incident_evidence。 L7 learning 從此有失敗樣本可學,playbook trust 負向 2x 衰減才真正生效。預期效果: - EXECUTION_FAILED 率 30d 內應從 99% 降到 <30% - incident_evidence.verification_result NULL 率應從 100% 降到 <10% - approval_records.rejection_reason 補齊率從 0% 到 100% Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 20:12:57 +08:00
OG T	b0d560dbb3	fix(drift-narrator): shortener 用 replace — 包容 LLM 加 'Resource/Name:' 前綴幻覺 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m50s Details 2026-04-18 下午（台北時區）— ogt + Claude Opus 4.7 Round 4 LLM 自己在 field 前加資源識別符: 'Deployment/awoooi-web: spec.template.spec.containers' 導致 startswith 模式 shortener 失效(前綴不在開頭)。防禦式修法: startswith 不中 → 改用 replace 清除任何位置的前綴。結果: 'Deployment/awoooi-web: containers' ✅ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 18:12:15 +08:00
OG T	b63aed72df	fix(drift-narrator): 砍 spec.template.spec. 前綴 — 修 Telegram 自動換行醜陋排版 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 12m1s Details 2026-04-18 下午（台北時區）— ogt + Claude Opus 4.7 統帥實彈三輪視覺回報: 字段名 'spec.template.spec.volumes' 共 24 字元, 加上 emoji+': '+summary 超過 Telegram <pre> 視覺寬度,自動換行造成 emoji 與 field name 斷開、單獨成行的醜狀。修復: _shorten_field_path() 砍 3 種常見前綴: - 'spec.template.spec.' → '' - 'spec.template.' → '' (後備) - 'spec.' → '' (後備) 效果對比: 前: '🟡 spec.template.spec.affinity.podAntiAffinity.preferredDuringS: [清單 3 項]' 後: '🟡 affinity.podAntiAffinity.preferredDuringS: [清單 3 項]' Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 17:10:20 +08:00
OG T	f3960f36d2	fix(drift-narrator): fallback 強化 — 標註 K8s 預設值補齊 + 可操作數獨立計算 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m37s Details 2026-04-18 下午（台北時區）—— ogt + Claude Opus 4.7 (1M) 統帥實彈測試回報: 卡片顯示「securityContext: (未設) → {物件 0 欄位}」毫無意義。根因: _fallback_items 對「K8s controller 自動補齊空物件」的噪音誤當成真實變更輸出。且「還有 29 項」數字包含白名單 + trivial。修復 3 項: 1. _is_trivial_drift() 新判定函數 None/空字串/{}/[]/false/0 等互相視為「無實質變更」捕捉 K8s controller 自動補齊場景 2. _summarize_item() 替代原本 smart_shorten - trivial → "K8s 預設值補齊 (無實質變更)" - None → value → "新增 xxx" - value → None → "已刪除 (原: xxx)" - 其他 → "from → to" 3. _fallback_items() 改進 - 按 level 排序 (HIGH 優先) - 白名單 + HPA allowlist 先過濾 4. _count_nontrivial_drift() + Telegram 呈現 - 新增「可操作」計數 (去掉白名單 + trivial) - 「還有 N 項」用可操作數,不會誤導 - items 為空時顯示「全為白名單或預設值補齊」預期效果: 之前: "... 還有 29 項" (其實只 1 個是真實 drift) 現在: "... 還有 0 項" 或 "(全部為白名單或預設值補齊)" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 16:29:49 +08:00

1 2 3 4 5 ...

537 Commits