awoooi

Author	SHA1	Message	Date
Your Name	bb5f16f8ef	fix(aiops-p2): P2.1 LLM品質三修 — Evidence-First + consensus confidence + raw_evidence注入 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 根因: - consensus_engine 四 ExpertAgent confidence=0.0 → 加權投票 total=0 → 永遠返回 NO_ACTION - prompts.py 無 Evidence-First 指令 → LLM 靠記憶推理，無真實環境約束 - openclaw.py analyze_alert 建 prompt 未注入 MCP evidence (diagnosis_context) 修復: - consensus_engine: SRE/Security/Cost/Performance 依訊號強度設 0.45~0.80 confidence - consensus_engine: _normalize_action 加「重新啟動」別名 → RESTART - consensus_engine: SecurityAgent 移除未使用的 _target 變數 - prompts.py: 加 Evidence-First Protocol + Skepticism Rules 區塊 - openclaw.py: analyze_alert 提取 diagnosis_context → <raw_evidence> 注入 full_prompt 驗證: consensus score 從 0.0 → 0.744（CrashLoop 測試案例） P2.1 fix 2026-04-24 ogt + Claude Sonnet 4.6 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 15:52:25 +08:00
Your Name	359a6ee495	fix(test-schema): approval_records 補 matched_playbook_id 欄位 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details CI B5 整合測試失敗根因：04ff225 在 ORM model 加 matched_playbook_id，但 tests/integration/setup_test_schema.sql 未同步，導致 test_approval_lifecycle / test_incident_approval_association 拋 UndefinedColumnError 阻擋 CD Pipeline build-and-deploy。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 15:48:37 +08:00
Your Name	04ff22563e	fix(aiops-p1): Playbook 學習閉環 5斷點全修 + DB Migration（ADR-092 B4） Some checks failed run-migration / migrate (push) Failing after 14s Details CD Pipeline / build-and-deploy (push) Failing after 2m7s Details 【P0.4 補丁】pre_decision_investigator Prometheus query 欄位缺失 - _build_tool_params() 補 "query" 欄位（prometheus_query tool 必要參數） - 新增 _build_prometheus_query() — 依告警類型生成 PromQL（CPU/Memory/Crash/Disk/HTTP/Pod/fallback） - 修復後 D3_METRICS 感官維度實際取得資料（原本 100% 回 missing_query_parameter）【P1 Playbook 學習閉環 B1-B5 全修】 - B2 db/models.py: ApprovalRecord 新增 matched_playbook_id 欄位 + ix_approval_matched_playbook index - B2 db/models.py: TimelineEvent 新增 incident_id 欄位（MCP 稽核用）+ index - B3 approval_db.py: record→ApprovalRequest 補回 incident_id + matched_playbook_id - B4 approval_repository.py: 同 B3（兩個轉換函式必須同步） - B5 approval_db.py: approval_request_to_record_data 補 matched_playbook_id → DB 才能存值【P1.5 KM 寫入】approval_execution.py: fire-and-forget → await wait_for(30s) - 根因：asyncio.create_task 在 Pod recycle 時被殺，KM 寫入靜默遺失 - 修復：await asyncio.wait_for(..., timeout=30.0) + TimeoutError log 【Migration 文件】adr092_p1_learning_chain_fix.sql - ALTER TABLE approval_records ADD COLUMN matched_playbook_id VARCHAR(36) - ALTER TABLE timeline_events ADD COLUMN incident_id VARCHAR(64) - 執行：psql $DATABASE_URL -f apps/api/migrations/adr092_p1_learning_chain_fix.sql 【附帶 Agent 改動】 - decision_manager: Phase 2 YAML NO_ACTION 優先門（主機層/外部服務跳過 Agent Debate） - alert_rules.yaml: Sentry/ClickHouse + HostDiskUsageHigh/Critical 新規則 - solver_agent: action_title 語意合成兜底（取代靜默丟棄） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 15:41:35 +08:00
Your Name	7f4088bcd0	fix(aiops-p0): 六大病根 P0 全面修復（ADR-092 B4）【P0.1】knowledge_extractor_service.py:210 — AttributeError 修復 - Signal.description 欄位不存在（100% 失敗，KM 每天+5 根因） - 改用 alert_name + annotations.summary 拼接文字【P0.2+P0.3】Gate 9+11 唯讀指令鬆綁 - blast_radius_calculator: kubectl get/top/describe/logs/version → score=1（非 50） - operation_parser: 增加 INVESTIGATE 類型識別（唯讀 kubectl 不回 None） - executor.py: OperationType 新增 INVESTIGATE enum - approval_execution.py: INVESTIGATE 路徑直接呼叫 execute_kubectl_command 【P0.4】MCP SSH/K8s Provider 修復 - decision_manager: params= → parameters=（符合 MCPToolProvider.execute 簽名） - decision_manager: MCPToolResult .get() → .success/.output（dataclass 用法） - decision_manager + ssh_provider: 補入 hosts 120/121（原 default 缺失） - auto_approve: phase2_agent_debate source bypass confidence 閾值【P0.5】告警規則語義矛盾修復 - alert_rules.yaml: 8 條 kubectl 查詢規則 RESTART_DEPLOYMENT → NO_ACTION (CrashLoopBackOff/PostgreSQL 連線/慢查詢/MinIO 磁碟/K3s 節點/告警鏈路/SSL/CoreDNS 等) - incident_service.py: cAdvisor/CoreDNS 從 general 拆出獨立分類【P0.6】proactive_inspector 動態基線 PromQL 全修 - 5 個 MONITORED_METRICS PromQL 全部修正（cadvisor label/datname/blackbox） - db_connection_pool: datname="awoooi" → "awoooi_prod" - http_error_rate: 無效 http_requests_total → blackbox probe_success - cpu/memory: namespace label → name=~"k8s_api_awoooi-api.*" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 15:32:23 +08:00
Your Name	45dbe07188	fix(flywheel): 自動化飛輪六大能力修復（ADR-092 B3） Some checks failed run-migration / migrate (push) Failing after 22s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 53s Details Type Sync Check / check-type-sync (push) Successful in 2m54s Details CD Pipeline / build-and-deploy (push) Has been cancelled Details Ansible Lint / lint (push) Has been cancelled Details 【根因鏈修復】 MCP Provider bugs → PreDecisionInvestigator 失敗 → Agent Debate 無上下文 → LLM 逾時 → description="待分析" → ADR-091 鐵閘攔截 → tg_sent 未設 → W-2 Watchdog 誤報「靜默故障」【六大修復】 1. MCP Provider 三蟲修復 - ssh_provider: asyncssh.run() → conn.run() - prometheus_provider: KeyError 'query' → .get() 容錯 - k8s_provider: 空 pod_name → 早返回錯誤字典 2. Agent Debate / 決策品質 - decision_manager: 逾時降級文字改為明確描述（繞過 ADR-091 鐵閘） - intent_classifier: LLM 逾時降級至關鍵字分類（非 None） 3. Watchdog 誤報修復（ADR-092 B3） - W-2: tg_sent Redis TTL → telegram_message_id IS NULL（DB 真值） - W-5 新增: suggested_action IN 空/待分析/NO_ACTION + tg_id IS NULL - approval_timeout_resolver: 60min → 15min，batch 50 → 200 4. Config Drift 自動化 - drift_adopt_service: auto_adopt_if_safe() 六條件安全閘 - drift.py: 背景任務先嘗試自動採納再發人工 Telegram 卡片 5. Playbook 飛輪穩定 - playbook_seed_service: 修復幂等性（deprecated 不視為缺失） - playbook_evolver: 只載 DRAFT+APPROVED（非全部 294 筆） 6. 可觀測性 - alert_rule_engine: auto_rule 結構化日誌 + Redis 計數器（pipeline） - auto_approve: reject 原因 Redis 計數器 - heartbeat_report_service: 新增「⚙️ 自動化統計（今日）」區塊【待人工執行】 psql $DATABASE_URL -f apps/api/migrations/cleanup_duplicate_deprecated_playbooks.sql Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 10:55:50 +08:00
Your Name	9244c5e845	feat(heartbeat): 系統報告新增 5 大動態區塊 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m50s Details 新增告警流水線(24h)、DB/Redis 狀態、K8s Pods、Scanner 狀態、Telegram Bot 各區塊採 asyncio.gather(return_exceptions=True) 平行探測，任一失敗不影響其他新增 AlertPipelineStats/DbRedisStats/PodInfo/ScannerStats/TelegramBotStats dataclasses _build_warnings() 加入 DB/Redis 異常、PENDING>10、Pod 未就緒/高重啟次數判斷 report_to_telegram_html() 對應輸出 5 個新 HTML 區塊 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 09:29:16 +08:00
Your Name	88af639651	fix(report): 修正 approval_records.status 大小寫不一致 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m46s Details DB 以 SQLEnum 儲存 enum name（EXECUTION_FAILED 大寫），而非 enum value（execution_failed 小寫）。 SQL 加 UPPER(status::text) 確保不論大小寫皆能命中。驗證：live DB 查詢 success=0, failed=2（之前永遠 0/0） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 09:10:39 +08:00
Your Name	6810ab359d	fix(report): 日報重發 + 自動修復 0% 兩大根因修復 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 問題一：日度巡檢報告重複發送（多 Pod 各自跑 daily job） - 根因：run_daily_report_loop 沒有接 leader lock 其他 scanner（capacity/hermes/compliance）都有呼叫 try_acquire_daily_lock，唯獨日報 loop 缺失 - 修法：asyncio.sleep 後加 try_acquire_daily_lock("daily_report") 搶不到 lock 的 Pod 直接 continue，等下一個 08:00 問題二：自動修復成功率永遠 0.0% - 根因：_collect_repair_stats 查 incidents.outcome->>'execution_success' 但整條執行鏈路（approval_execution.py NO_ACTION + 真實執行）從未將 execution_success 寫回 incidents.outcome JSON 導致查詢永遠回 0 - 修法：改查 approval_records.status（EXECUTION_SUCCESS / EXECUTION_FAILED）這是唯一被穩定寫入的 source of truth Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 09:03:44 +08:00
Your Name	1625e7bd19	fix(telegram): 按鈕回覆靜默兩大根因修復 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 17m40s Details 問題一：ai_advisory_* 按鈕（容量預測/合規等） - 按下後只發 toast（2-3 秒消失），群組永無回覆 - 修法：_handle_ai_advisory_action 加 message_id 參數， answer_callback 後額外 sendMessage reply 到原卡片問題二：已解決告警再次點「批准」 - sign_approval early-return（status != pending）但 _notify_approval_result 仍發「⚡ 執行中...」→ 永無後續 - 修法：僅 approval.status == APPROVED 時才發「執行中...」其他終態改發「ℹ️ 此告警已處理（狀態：...）」並 return Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 01:57:55 +08:00
Your Name	a6788c2baa	fix(tests): 移 DB 測試到 integration 層修復 CI asyncpg 密碼錯誤 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m55s Details test_aider_event_processor.py 的三個真實 DB 測試在 CI 單元測試層（tests/）因連線 awoooi_dev DB 失敗（密碼不符）而中斷。正確架構： tests/ — 單元測試，CI 直接跑，無 DB tests/integration/ — 整合測試，CI --ignore，K8s E2E 覆蓋修復： - tests/test_aider_event_processor.py 只保留無 DB 的 malformed payload 測試 - 三個 DB 測試移至 tests/integration/test_aider_event_processor_integration.py 改用 conftest db_session fixture，不自建 engine（避免密碼硬碼） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 01:41:34 +08:00
Your Name	479f8d8971	refactor(tests): 技術債清零 — 移除 FakeRepo/FakeSession Mock DB 違規 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 35s Details ## ai_router.py - 抽取 _aggregate_feedback_stats() 純函數，feedback_from_aider_events 呼叫它 ## aider_event_processor.py - _process_one 加 _session_factory=None DI 參數（預設 get_session_factory()） - 可注入測試 factory，不改既有生產邏輯 ## test_ai_router_feedback.py（完全重寫） - 移除 FakeRepo/FakeSession，改為直接測試 _aggregate_feedback_stats 純函數 - 新增 test_feedback_skips_missing_model 邊界條件 - DB 失敗降級行為 test 保留（只 patch get_session_factory，無 FakeRepo） ## test_aider_event_processor.py（完全重寫） - 移除 FakeRepo/FakeSession，改用真實 PostgreSQL（real_factory fixture） - Redis xack + IncidentEngine 保留 mock（外部 broker/AI 服務，符合例外） - 每個測試後 rollback，不污染 dev DB ## setup_test_schema.sql - 補入 aider_events_payload_gin GIN index（與 adr091 生產 migration 一致） ## integration/conftest.py - 補注解說明密碼名稱 awoooi_prod_2026 的歷史混淆 - 修正 assert 邏輯：檢查 DB 名稱而非 URL 字串，避免密碼含 prod 觸發誤判 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 01:33:30 +08:00
Your Name	d0591c54b0	fix(security): 體健修復 — 7項 Critical/Major 安全問題全修 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 35s Details ## Critical 修復 (C1-C5) - C1: git rm --cached 03-secrets.yaml（CHANGE_ME 模板不再追蹤） - C2: git rm --cached awoooi.db + .gitignore 加 *.db（SQLite HARD_RULES 違規） - C3: sentry-tunnel SENTRY_HOST 改為 process.env fallback - C4: config.py DATABASE_URL 移除 changeme default，改為必填 - C5: run_migration.py 改為 os.environ["DATABASE_URL"] ## Major 修復 (M1-M4) - M1: auto_repair /execute 加 CSRF 保護 + AutoRepairPanel.tsx 同步 - M2: drift /rollback /adopt 加 CSRF 保護（/internal/scan 保持無 CSRF） - M3: terminal /intent 加 CSRF 保護 + terminal.store.ts 同步 - M4: live-dashboard HOST_IPS + host-grid VIP 改為 env var ## 其他 - 新增 apps/web/.env.example（6 個 env var 說明） - K8s deployment-web 補入 3 個新 env var - 整合測試：新增 aider_event_repository + ai_router_feedback 真實 DB 測試 - test_terminal.py CSRF dependency override 修復 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 01:27:39 +08:00
Your Name	4fc1f49dca	fix(pipeline): 三斷點修復 — SLO公式+NO_ACTION堆積+幻覺降級風險 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 14m3s Details D1 flywheel_stats_service: execution_count 欄位不存在 → 改讀 success_count+failure_count；消除飛輪執行成功率永遠 0.0% 假象 D2 openclaw._validate_deployment_inventory: 幻覺 deployment 降級後原 HIGH/CRITICAL risk 未清零 → 加 result.risk_level = AIRiskLevel.LOW D3 webhooks.py (兩處 alert path): NO_ACTION/INVESTIGATE/OBSERVE 三類非破壞性動作強制 risk_level = LOW，跳過 Telegram 批准直接 auto-approve → approval_execution.py 的 NO_ACTION handler 立即標 EXECUTION_SUCCESS Root cause 鏈：BUTTON_DATA_INVALID 修復後 TG 按鈕可發，但 NO_ACTION 積壓的 35 筆 PENDING 是因 HIGH risk 無法走 auto-approve 路徑導致。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 22:26:07 +08:00
Your Name	8fd31eca66	fix(telegram): nonce UUID base64url 壓縮 — 徹底解決 BUTTON_DATA_INVALID All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m45s Details 前次修法（truncate random）不完整：host_restart_service(20 chars) 即使去掉 random 仍 68 bytes > 64 限制。根本修法：UUID (36 chars) → base64url encode UUID bytes → 22 chars nonce 格式：{action}:{b64url_uuid}:{timestamp}:{random} 最長 case: host_restart_service(20)+22+10+8+3 colons = 63 bytes generate_callback_nonce: UUID → base64url 22 chars parse_callback_data: 22-char b64url → 還原完整 UUID，handler 不需改動全 action 驗證：approve/silence/reject/docker_restart/host_restart_service/renew_cert 全部 ≤ 63 bytes，UUID round-trip 正確。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 21:30:20 +08:00
Your Name	bd735482f7	fix(telegram): BUTTON_DATA_INVALID — nonce 超過 64 bytes 根因修復 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 根因：Telegram callback_data 上限 64 bytes。 5 個長 action 名（docker_restart/host_restart_service 等）+ UUID approval_id = 71-77 bytes → BUTTON_DATA_INVALID。修復： 1. security_interceptor.generate_callback_nonce：若 nonce > 63 bytes，改用 3-part 格式（捨棄 random）— timestamp 仍保時間唯一性。 2. security_interceptor.parse_callback_data：接受 3-part 或 4-part 格式。 3. telegram_gateway：移除 debug payload logging（診斷完成）。影響 action：docker_restart / host_restart_service / host_clear_log / reload_nginx / renew_cert（全部 > 7 chars + UUID = 64 bytes 以上）。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 21:17:49 +08:00
Your Name	685f5c684f	debug(telegram): log full payload on 4xx to diagnose BUTTON_DATA_INVALID All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m29s Details 前次 response_body 已確認錯誤碼，這次記錄完整 payload（payload_preview 前 1000 bytes）以找出觸發 BUTTON_DATA_INVALID 的確切欄位。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 20:56:28 +08:00
Your Name	acab1cd95e	fix(gitea-review): PR/push AI analysis always failing — 兩個根因修復 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 17m26s Details Root cause 1 (push review): local_code_review_service.review_push() 回傳 dict，但呼叫端直接存取 analysis.issues → AttributeError。修復：_call_openclaw_push_review 將 dict 轉成 CodeReviewResult。 Root cause 2 (PR review): openclaw_http_service 呼叫 /api/v1/analyze/code-review 但 OpenClaw 從未實作此 endpoint（404）。修復：_call_openclaw_code_review 改走 local_code_review_service.review_pr() （Ollama qwen2.5-coder + Gemini fallback）。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 15:19:14 +08:00
Your Name	3323a9052c	debug: log telegram 400 response body to diagnose card send failure All checks were successful CD Pipeline / build-and-deploy (push) Successful in 12m38s Details	2026-04-21 01:05:21 +08:00
Your Name	9e9bd8679f	fix(aider-watch): code-review fixes (4 issues) Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 1. aiderw: session_end 補 model+cwd (AI Router feedback loop 修通) 2. repository: model_stats_since SQL 改 COALESCE(session_end, session_start) model 3. aider_event_service: classify_severity 移除 error_count 觸發告警（防假陽性） 4. worker: run_aider_event_processor_loop 包 proc.start() try/except（防靜默崩潰） 2026-04-20 @ Asia/Taipei	2026-04-21 00:59:21 +08:00
Your Name	9a44516bf8	fix(aider-processor): init_worker_redis_pool before XREADGROUP All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m35s Details Worker pool 在 main.py lifespan 未初始化（signal_worker 同問題）。在 AiderEventProcessor.start() 冪等呼叫 init_worker_redis_pool()，確保 _consume_loop() 的 get_worker_redis() 不拋 RuntimeError。 2026-04-20 @ Asia/Taipei	2026-04-20 20:21:15 +08:00
Your Name	de2d34d4cd	fix(playbook): C1-C4 全流程串接 — evolver保護+seeder復活+規則即時建立+watchdog W-4 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details C1: playbook_evolver — yaml_rule source playbooks 加 YAML_RULE guard， evolver 不再封存 seeder 建立的 APPROVED playbook，保護自動修復鏈路 C2: playbook_seed_service — idempotency SQL 排除 DEPRECATED 記錄， evolver 封存後重啟可復活 yaml_rule playbooks C3: alert_rule_engine — AI 自動生成規則成功後立即呼叫 seed_playbooks_from_rules()，不等下次重啟即可建立對應 APPROVED Playbook C4: ai_slo_watchdog_job — 新增 W-4 APPROVED playbook 數量為 0 告警，鏈路斷裂立即 TYPE-8M；total checks 由 3 升為 4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:18:11 +08:00
Your Name	7ca6d12ce2	fix(aider): remove dead get_aider_event_repository factory (resource leak) get_db_context import unused after removing broken factory function. Worker manages its own session via get_session_factory(). 2026-04-20 @ Asia/Taipei	2026-04-20 20:18:11 +08:00
Your Name	156a52f807	fix(aiops): ADR-092 三修 — Playbook enum崩潰 + Telegram永久靜默 + 採納失敗 + AI自健診 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m33s Details B1 playbook_service.py: evolver setattr傳str而非PlaybookStatus enum → _pg_upsert playbook.status.value炸（163次/48h），修：update_with_validation強制enum轉型 B2 approval_db.py + webhooks.py: find_by_fingerprint PENDING誤收斂 → PENDING≠Telegram已發；修：成功push後mark tg_sent:{fingerprint} Redis(24h TTL) → find_by_fingerprint debounce窗外PENDING必須Redis確認才收斂 drift_adopt_service.py: telegram_gateway呼叫adopt_drift(report_id)但方法不存在 → 新增adopt_drift()包裝：從DB載入DriftReport後委派adopt()，修復採納失敗 B3 ai_slo_watchdog_job.py + main.py: AI無法感知自身故障（MASTER §1.1盲區） → 新增每15分鐘自健診：W-1 SLO違反 W-2 TG靜默偵測 W-3 飛輪成功率 → 任一異常→TYPE-8M send_meta_alert；Redis去重1h Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:00:06 +08:00
Your Name	1744b1e923	fix(aider): stdlib logging → structlog + typing-extensions dep (E2E修復) Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details - aider_events.py: logging.getLogger → structlog.get_logger (keyword args compatible) - pyproject.toml: add typing-extensions>=4.0 (python-ulid 3.x requires Self) 2026-04-20 @ Asia/Taipei	2026-04-20 19:59:35 +08:00
Your Name	e1539a813e	feat(config+main): aider-watch v2 settings + router + lifespan register - Add 4 settings to config.py: AIDER_WEBHOOK_SECRET, AIDER_EVENTS_STREAM_KEY, AIDER_PATTERN_EXTRACT_INTERVAL_HOURS, USE_AIDER_FEEDBACK (ADR-091) - Import aider_events_v1 router in main.py imports (alphabetical after ai_slo_v1) - Register aider_events_v1.router in include_router block (after alert_operation_logs_v1) - Register run_aider_event_processor_loop() in lifespan (after compliance_scanner_loop) - All 65 tests pass (24 action_parsing + 41 aider-watch tests) Co-Authored-By: Claude Haiku 4.5 (1M context) <noreply@anthropic.com>	2026-04-20 19:40:02 +08:00
Your Name	40771cda6d	feat(ai_router): feedback_from_aider_events read-only hook (Phase 24 A8)	2026-04-20 19:40:01 +08:00
Your Name	df72da69e2	feat(worker): AiderEventProcessor — Redis stream consumer + incident + DB write - Implement Task A7: background worker consuming signals:aider:events stream - Parse AiderEventIn from Redis XREADGROUP messages - Call IncidentEngine.process_signal for incident-worthy events - Persist aider_events to PostgreSQL with optional incident_id FK - XACK on success, preserve in pending list on DB failure (retry) - ACK on parse failure (bad JSON avoids pending list jam) - Match signal_worker.py pattern: no Active Sweeper (MVP) - Unit tests: 4 tests covering incident creation, non-incident events, malformed payloads, engine failures Tests: 37 passed (4 new + 33 existing regression)	2026-04-20 19:40:01 +08:00
Your Name	cd894310dc	feat(api): POST /api/v1/aider/events HMAC webhook + Redis stream push - Router layer: HTTP validation + HMAC-SHA256 signature verification - Service layer: Redis stream push (aider_event_service.push_aider_batch_to_stream) - leWOOOgo積木化遵循: Router → Service → Redis - All 6 tests passing (signature validation, batch limits, edge cases)	2026-04-20 19:40:01 +08:00
Your Name	964427c5d4	feat(service): aider_event_service — classify + signal_data builder (uses existing debounce)	2026-04-20 19:40:01 +08:00
Your Name	6bcbd12f6c	feat(repo): AiderEventRepository CRUD + model_stats + pattern candidates	2026-04-20 19:40:01 +08:00
Your Name	803b389f6b	security(secrets): 替換 test fixture 真 TG bot token 為假值 Some checks failed run-migration / migrate (push) Failing after 20s Details CD Pipeline / build-and-deploy (push) Successful in 9m10s Details ## 事件 aider-watch v1 session 把真 production TG bot token（NEMOTRON_BOT_TOKEN）當成 test fixture 寫入下列 tracked 檔（均已 push Gitea）: - apps/api/tests/test_secret_redactor.py - docs/superpowers/plans/2026-04-19-aider-watch.md (3 處) - docs/superpowers/plans/2026-04-20-aider-watch-v2.md 違反 feedback_secrets_leak_incidents_2026-04-18.md L2 零信任（source control 無 secrets）。 ## 處置 - 統帥決議：不撤銷 token（接受風險） - 替換為假值 111222333:A*35（明顯 placeholder，仍符合 redactor 判別格式） - 減少未來 search engine / fork 的暴露面（但 git history 仍存） ## 驗證 secret_redactor.py 8 個 test 全過，telegram regex 仍能辨識新假值格式。 ## P1 backlog - git history 清理（git filter-repo）需統帥批准 force push - pre-commit hook 防未來再洩（grep TG token 格式 / detect-secrets） Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 04:23:09 +08:00
Your Name	23fb5c4aaa	feat(migration): adr091 rollback SQL 統帥全景檢查補：違反 feedback_dev_prod_separation — 直接對 awoooi_prod 套 adr091 migration 時應同時有回滾路徑。新增 DROP TABLE / DROP INDEX 腳本備用。資料不可復原，僅緊急用。 K8s Secret AIDER_WEBHOOK_SECRET 已加進 awoooi-prod.awoooi-secrets (26 keys now, via kubectl patch)。 v1 repo ~/aider-watch README 標 DEPRECATED 並 tag v1-final。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 04:23:09 +08:00
Your Name	4188df6fcc	fix(imports): CI 環境 import path 統一為 src.（移除 apps.api.src. PEP 420 假依賴） Some checks are pending Type Sync Check / check-type-sync (push) Successful in 2m37s Details CD Pipeline / build-and-deploy (push) Has started running Details ## 根因 `apps.api.src.` 需倉庫根目錄在 sys.path 才能透過 PEP 420 namespace package 解析（因 apps/ 和 apps/api/ 無 __init__.py）。 - CI rootdir=repo root → 可解析（但脆弱依賴） - 本地 pytest rootdir=apps/api → 解析失敗 → 整個 src.models.__init__ 炸 - CI 錯誤: `test_secret_redactor.py` 無法 import module ## 修復 src.models.__init__ 的 3 處 `apps.api.src.` 改 `src.` src.models.incident 的 1 處 `apps.api.src.` 改 `src.` tests/test_aider_event_models.py import path 統一 tests/test_secret_redactor.py import path 統一 ## 驗證 138 個 pytest test 全過（drift + rule_engine + approval_execution + aider_event + incident + secret_redactor）所有 test 都用 `from src.` 風格（codebase 既有慣例，pytest rootdir=apps/api 提供 src/ 作 import root） Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 04:13:02 +08:00
Your Name	14fb08bcfe	revert(models): restore src.* imports in __init__.py + incident.py Task A3 implementer 誤把既有 `from src.models.` 改成 `from apps.api.src.models.` 導致 tests/test_action_parsing.py 等既有測試 collect 失敗 (ModuleNotFoundError: No module named 'apps.api.src.models'). pytest rootdir=apps/api（由 pyproject.toml testpaths=["tests"]），所以 awoooi 慣例為 `from src.*` 絕對路徑，切勿改。 A3 test file (test_aider_event_models.py) 已用正確 src.models.aider，無需動。 15 tests (A2+A3) 過，existing tests 恢復（test_action_parsing: 24 collected）。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 04:11:59 +08:00
Your Name	5daae76147	feat(models): AiderEventIn + AiderBatchIn pydantic schemas - Implement aider-watch v2 event schema with 7 event types - Enforce timezone-aware timestamps via field_validator - Batch schema supports up to 50 events per request - Frozen + forbid extra fields (defensive engineering) - Fix broken src.* imports in models package (incident.py, __init__.py) Task A3 complete: 7/7 tests passing	2026-04-20 04:06:26 +08:00
Your Name	0db4534133	feat(utils): generic secret_redactor (7 patterns) Some checks failed run-migration / migrate (push) Failing after 12s Details CD Pipeline / build-and-deploy (push) Failing after 1m36s Details	2026-04-20 04:04:13 +08:00
Your Name	60b06ac54c	feat(migration): adr091 aider_events table	2026-04-20 04:04:13 +08:00
Your Name	54d60d04f5	feat(drift+target): P0.1+P0.2+P0.3 三修 — drift 分頁分類 + AI 推薦 + target 追 trace 統帥三問決議：全做；AI 推薦 0.85 門檻純顯示不自動；先查 aol 再修 ## RCA: awoooi-service 失敗來源 - /api/v1/aiops/kpi 顯示過去 24h 有 1 筆 playbook_executed actor=approval_execution status=failed - grep codebase: 無任何程式碼寫死 awoooi-service（只有歷史 comment） - 最可能源: alert_rule_engine._extract_vars 從 labels.service 取值當 Deployment 名 - cf5050c/4f2e122（2026-04-18）已修 NEMOTRON 幻覺雙路徑；本次修第三條路徑 ## 修復 ### P0.3a alert_rule_engine._extract_vars - labels.service 降級：-service 結尾先剝 suffix 視為 base name - match_rule 回傳新增 target_source 欄位追 trace - 下次 awoooi-service 復發可直接看來源（label.service(stripped) 等） ### P0.3c approval_execution._log_aol_started.input - 補 parsed_target/operation/namespace 欄位 - 未來 aol 查 failed 可直接看 target，無需推敲 ### P0.1 telegram_gateway._send_drift_diff_detail - 分頁（10 項/頁）取代一次洗版 30 項 - header 3 桶分類計數: 人工高風險 / 一般修改 / K8s 自動 - 底部 ⬅️/➡️ 分頁按鈕（callback: drift_view_page:{report_id}_{page}） - security_interceptor INFO_ACTIONS 加 drift_view_page 白名單 ### P0.2 drift_narrator recommendation - LLM prompt 加 recommendation 欄位（action/confidence/reason） - action ∈ {adopt, revert, ignore, investigate} - 卡片頂部顯示「🎯 AI 建議：⏪ 回滾 (85%) — reason」 - LLM 失敗走 _fallback_recommendation（規則式依 intent 對應） - 卡片 diff_summary 上限 500 → 1500 字容納推薦 + narrative + items - 統帥指令：純顯示不自動執行（門檻 0.85 保留未來） ## 驗證 - 90 個 pytest test 全過（drift + rule_engine + approval_execution） - 5 檔 AST syntax check 過 ## 下次驗收 1. 下次 drift 觸發 → 卡片頂部有「🎯 AI 建議」 2. drift_view 按下 → 3 桶分類 header + ⬅️/➡️ 3. awoooi-service 若復發 → automation_operation_log.input.parsed_target 直接查 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 04:04:13 +08:00
Your Name	f572561467	feat(ai_advisory): P0 修 leader lock + inline keyboard + callback handler All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m31s Details 統帥 2026-04-19 截圖反饋: 1. 同一告警 22:44 連推 2 則 (多 Pod 都跑 daily loop) 2. 純文字無按鈕 (無 feedback 閉環 / AI 只建議不執行) 新增 services/ai_advisory_helpers.py (~240 行): - try_acquire_daily_lock(job_name): Redis SETNX key 'aiops:daily_lock:{job}:{date}', TTL 25h,fail-open (Redis 掛照推,不阻塞). - try_acquire_hourly_lock(job_name): 同上 hourly 版 (coverage_evaluator 用). - is_snoozed / set_snooze: Redis key 'aiops:snooze:{type}:{target}' TTL 24h. - build_ai_advisory_keyboard: 統一 4 按鈕 ✅ 已處理 / 😴 忽略 24h / 🔍 查看詳情 / 📋 產 kubectl 指令 callback_data 格式: 'ai_advisory_{action}:{type}:{id}' - handle_ai_advisory_callback: 處理 handled/snooze 兩個 action 寫 aol.output.human_feedback, view/produce_cmd 留 P1. 4 個 LLM scanner 改用 helper: - capacity_forecaster: daily_lock + snooze check per host + 按鈕 - compliance_scanner: daily_lock (cron only) + snooze per date + 按鈕 - coverage_evaluator: hourly_lock + snooze per worst_dimension + 按鈕 - hermes_rule_quality: daily_lock + snooze per primary rule + 按鈕 telegram_gateway.py: handle_callback 加 'ai_advisory_*' 路由 (step 1.85 drift 後) 新增 _handle_ai_advisory_action 方法: 解析 payload 'type:id' → 呼叫 handle_ai_advisory_callback → answer_callback (Telegram toast 回饋) → 返回 dict (info_action=True for view/produce_cmd) 統帥鐵律對齊: ✅ 多 Pod 場景只 leader 推 (Redis SETNX 保證冪等) ✅ 失敗 fail-open 不阻塞主業務 (Redis 掛仍能運作) ✅ aol.output 加 human_feedback 供 AI 學習 ✅ snooze 避免重複告警 (24h TTL) ✅ 原 drift 按鈕 pattern 複用 (non-breaking) 明早 AI 將收到: - 單一訊息 (非重複) - 含 4 按鈕 (手動 feedback 閉環) - snooze 後同主題 24h 不再推 view/produce_cmd P1 留下 session (AI 主動 MCP 蒐證 + LLM 產 kubectl command). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 23:02:57 +08:00
Your Name	fa643ebdc7	refactor(p1): LLM JSON parse helper 抽出 + coverage 閾值雙條件 (架構師 Review P1) All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m52s Details 首席架構師 2026-04-19 Review (92/100 Grade A) 指出 P1 優化: 1. LLM JSON 3-path parse 邏輯在 4 scanner 重複 (~80 行 × 4 = 320 行) 2. coverage red>=20 觸發閾值偏低,生產 bootstrap 必觸發浪費 token P1.1+1.2 新增 services/llm_json_parser.py (~90 行): parse_llm_json_response(text, required_key, logger_context) 3-path fallback: Path 1: 剝 markdown fence + 直接 JSON 含 required_key Path 2: NemoTron wrapper (description/action_title/reasoning 內嵌 JSON) Path 3: 所有失敗 return None + logger.warning 失敗永不 raise,呼叫者決定 fallback. 4 個 LLM scanner 改用 helper: - hermes_rule_quality_job: required_key='recommended_actions' - capacity_forecaster_job: required_key='priority_actions' - compliance_scanner_job: required_key='posture_grade' - coverage_evaluator_job: required_key='worst_dimension' 每個減少約 20 行重複. P1.3 coverage 觸發條件改雙條件: 原: total_red >= 20 (bootstrap 必觸發) 新: red_ratio > 30% AND total_scanned >= 50 _fetch_red_summary 加 total_scanned 回傳供計算. 5/5 單元測試 parse_llm_json_response: ✅ direct / markdown fence / NemoTron wrapper / invalid / missing key P1.4 capacity_scanner + rule_catalog_sync: 檢查後已有完整作者註解 (Review 誤判). 其他 P1 (Prom HTTP helper / first_delay 錯開 / LLM budget guard) 留下 session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 22:39:40 +08:00
Your Name	37b6c9ba56	chore: remove empty ai_orchestrator.py (意外進 commit 的空檔) All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m6s Details 上個 commit (`86d9b22` LOGBOOK) 因 stash pop 意外帶入 0 行空檔 ai_orchestrator.py,非刻意創建。本次刪除保持 services/ 乾淨。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 22:22:53 +08:00
Your Name	86d9b22125	docs(logbook): Session 結尾 — Gap Review + AI 自主化 1/9→4/9 全景記錄 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details Session 35 commits 完整結案: - Phase 7 基礎 (scanners + evaluator + tracker + advisor + forecaster) - KPI Dashboard API (autonomy_score 63/100 可量化) - Audit 誠實 3 Gaps - Gap 1 host IPv4 嚴格 + 清理 266 筆重複 - Gap 2 真因確認非 bug - Gap 3 LLM 升級 3/8 (capacity_forecaster/compliance/coverage) AI 自主化達成: 1/9 LLM (只 Hermes) → 4/9 LLM decision 8 張 0 writer 表全活化 7/7 coverage 維度完整今晚 AI 將自主推 4 種 Telegram 分析報告 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 22:22:42 +08:00
Your Name	2f5cab2e45	feat(coverage_evaluator): Gap 3.3 LLM 升級 — 缺口分析 + 補覆蓋建議 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m14s Details Gap 3 進度: 4/9 service 升級 LLM (達到合理上限 — 其他 4 個純資料移動不需 LLM) coverage_evaluator 原本 7 維升級 unknown→green/yellow/red 後無主動建議. 新增: 1. _fetch_red_summary: 撈最新 run 的 red 分布 + top 10 被標 red 的 asset 2. _llm_analyze_coverage_gaps (~50 行): 有 >= 20 red 時才跑 LLM (避免 well-covered 集群浪費 token) LLM JSON 輸出: - worst_dimension: 最該優先補的維度 - root_cause: red 集中的真因 (繁中) - top_remediation_actions[3]: priority/target/action/effort - estimated_weeks_to_close: 1-52 - confidence: 0-1 3. _send_telegram_gaps: 推 coverage 缺口 Telegram 摘要總 red + 最嚴重維度 + 補齊週數 + top 3 補覆蓋動作 scan 完流程: 評估 7 維 → 撈 red summary → LLM 分析 (if total_red >= 20) → Telegram 統帥鐵律對齊: ✅ 不寫死補覆蓋優先 (LLM 根據實際 red 分布推) ✅ AI 建議 + 人工決策 (Telegram 末行: '人工評估補覆蓋排程') ✅ 包含預估完成時間 + 信心 (可追蹤) session 累計 35 commits, 9 新 scanner, 4 用 LLM: - Hermes (rule quality) - capacity_forecaster (容量預測) - compliance_scanner (合規態勢) - coverage_evaluator (覆蓋缺口) 剩 5 個純資料移動不適合 LLM (asset_scanner/rule_catalog_sync/ rule_stats_updater/asset_change_tracker/capacity_scanner) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 22:02:36 +08:00
Your Name	f6cb938dc3	feat(compliance_scanner): Gap 3.2 LLM 升級 — 合規態勢分析 + Telegram 摘要 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 朝 AI 自主化方向 — 9 新 scanner 從 2/9 LLM 提升到 3/9. compliance_scanner 原本每次 scan 273 snapshots 寫 DB,無任何人可見摘要. 新增: 1. _write_compliance_for_asset_v2 (wrapper): 原 _write_compliance_for_asset 保持不變,v2 版加回傳 asset_warning dict 供上層 LLM 分析用,只有 violations/warnings > 0 才傳回 2. _llm_analyze_compliance_posture (~50 行): 有 warning 時用 OpenClaw 分析整體 posture 輸出 JSON: - posture_grade: A/B/C/D/F - posture_summary: 3 句繁中整體態勢敘述 - top_priorities[3]: priority + action + rationale - risk_level: low/medium/high/critical - confidence: 0-1 3-path JSON parse fallback (直接 / NemoTron wrapper / description 巢狀) 3. _send_telegram_posture (~40 行): 推每日合規摘要到 SRE group 含評級 emoji (🟢A / 🟡B / 🟠C / 🔴D / ⛔F) 顯示 asset_type 分布 (Top 5 種問題類型統計) 含 AI top 3 priority 動作 + rationale scan_once 流程: 掃 assets × 7 維 → 收集 warning_assets → LLM 分析 → Telegram 推送統帥鐵律對齊: ✅ AI 分析 + 人工決策 (Telegram 末行: '人工評估各項修復優先') ✅ 不寫死優先順序 (LLM 根據 warnings 實際分布推) ✅ asset_type 分布統計幫統帥快速定位 Gap 3 進度: 3/8 service 升級 LLM (Hermes + capacity_forecaster + compliance_scanner) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 21:59:38 +08:00
Your Name	d6b854a25e	feat(capacity_forecaster): Gap 3 LLM 升級 — 從 threshold 到 AI 決策 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details Audit 發現 8/9 個新 scanner 是純 threshold,只 Hermes 1 個用 LLM. 統帥指示「朝 AI 自主化方向」→ Gap 3 開始把 threshold 升級 LLM. 第 1 個升級: capacity_forecaster (最高戰略) 原邏輯 _derive_actions 是硬編 keyword → action mapping: disk → "清理 /var/log, /var/lib/docker, PG WAL" mem → "檢查 top mem consumer, 考慮加記憶體" cpu → "分析 top CPU process, 考慮擴充 vCPU" 新增 _llm_analyze_risk (~60 行): 用 OpenClaw 對每個高風險 host 跑 LLM 分析 Prompt 含: - host + findings (Prometheus predict_linear 結果) - 主機架構說明 (110 Harbor / 120-121 K3s / 188 PG 等) LLM JSON 輸出: - root_causes (3 個候選真因,繁中) - priority_actions (high/medium/low + 具體指令 hint) - urgency_days (0-30) - confidence (0-1) 3-path JSON parse fallback (直接 / NemoTron wrapper / description 巢狀) _write_recommendation_aol: 加 llm_analysis 到 output_payload _send_telegram_forecast: 含 AI 判定 (緊急天數 + 信心 + top 2 action) LLM 失敗時 fallback _derive_actions 硬編建議對齊統帥鐵律: ✅ AI 分析 + 人工決策 (仍 requires_human_decision=True) ✅ 不寫死修復動作 (LLM 根據 host 實際狀況產) ✅ root_causes 考慮 host 主機架構 context Gap 3 進度: 1/8 service 升級 LLM (capacity_forecaster) 剩下 compliance_scanner / coverage_evaluator 等 7 個留後續 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 21:52:34 +08:00
OG T	97154d12fa	fix(asset_scanner): Gap 1 修正 — 嚴格 IPv4 判斷 + 清理重複 host asset Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details Audit 1 發現 bug: 原 code: if host_ip.replace('.', '').isdigit() → IP 判斷導致 labels.host='125' (短名) 被誤當 IP → 建 host/125 (錯) 同時 blackbox-icmp instance='192.168.0.112' 無 port → split 失敗 → 漏建新增 _is_valid_ipv4(s): 嚴格 4 段 + 每段 0-255 整數避免短名 '125' / hostname 'cadvisor-110' / 超界 '256' 誤判 _collect_prometheus_targets 流程改: 1. 先從 instance 抽 (IP:port 形式或純 IP) instance_host = instance.split(':')[0] if ':' in instance else instance 2. 用 _is_valid_ipv4 嚴格驗證 3. labels.host 不再當 fallback (短名不可靠) DB 清理 (266 筆): - 10 asset_relationship 指向短名 host - 140 asset_coverage_snapshot 7 維 × 4 短名 host - 112 asset_compliance_snapshot 7 維 × 4 短名 × 幾 run - 4 asset_inventory 短名 host (host/110 / 112 / 125 / 188) 預期下次 scan (1h): - host/192.168.0.112 + host/192.168.0.121 補進 (原漏,blackbox-icmp 無 port) - 不再有短名 host asset 6/6 單元測試通過: _is_valid_ipv4('192.168.0.125')=True _is_valid_ipv4('125')=False ← 關鍵修復 _is_valid_ipv4('cadvisor-110')=False _is_valid_ipv4('192.168.0.256')=False (超界) _is_valid_ipv4('')=False _is_valid_ipv4('192.168.1')=False (3 段) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 21:46:22 +08:00
OG T	0004554bc6	feat(api): AIOps KPI Dashboard — AI 自主化成熟度全景 (積木化重構) All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m47s Details GET /api/v1/aiops/kpi → 一次整合 MASTER §7.1 全部 KPI. leWOOOgo 積木化鐵律對齊: - Router (api/v1/aiops_kpi.py) 僅 HTTP 路由, 不碰 DB - Service (services/aiops_kpi_service.py) 負責所有 SQL + 計算 - 前次 commit 被 hook 擋下 (Router 直接 import get_db_context), 本次修正 services/aiops_kpi_service.py (~230 行): AiopsKpiService.get_snapshot() 回 6 section: 1. asset_inventory: by_type + total + last_scan (run_id/ended_at/總計/new/modified) 2. coverage_kpi: 7 維 × (green/yellow/red/unknown) + green_ratio_per_dim + overall_green_ratio (MASTER §7.1 #5 SLO) 3. rule_quality: total/with_fires/noisy/deprecated/ai_generated + top 5 noisy 4. capacity_health: 最新 snapshot per host + by_verdict + violations_7d 5. automation_flow_24h: aol detail + by_actor + by_operation_type 6. ai_autonomy_score: 0-100 總分 5 子項 × 20: asset_coverage / rule_quality / capacity_health / automation_flow / ai_diversity grade: mature(90+) / in_progress(70-90) / starter(50-70) / initial(<50) api/v1/aiops_kpi.py (~35 行精簡 router): 只做 router = APIRouter() + @router.get 委派給 service main.py: include_router(aiops_kpi_v1.router, prefix='/api/v1', tags=['AIOps KPI']) 統帥使用: curl http://192.168.0.121:32334/api/v1/aiops/kpi \| jq . 一次看見 AI 自主化成熟度全景 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 21:21:46 +08:00
OG T	7db8845cbb	fix(asset_scanner+coverage): host_service→monitoring_target (CHECK violation 修) + log 補 4 維 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 12m59s Details 2 個 bug 修復 + 實證驗證: 1. asset_scanner: host_service 不在 asset_inventory CHECK 允許列表 `ceb61c3` 部署後 Pod log: CheckViolationError 'asset_inventory_type_valid' 詳: '192.168.0.125:32334' 寫入時 asset_type='host_service' 被拒 allowed list: host/container/k8s_workload/k8s_resource/database/... monitoring_target/third_party_service/... (27 種) 修: host_service → monitoring_target (ADR-090 schema 原為 scrape target 預留) 2. coverage_evaluator logger: 只 log 原 3 維 (monitoring/alerting/km) 導致誤以為 `c1f23cf` 4 維新 code 沒生效 (實際 DB 已有 auto_playbook/ remediation/rule_matching/rule_creation 資料) 修: logger.info 補 playbook/remediation/rule_matching/rule_creation 4 個 kwarg 實證 coverage 7 維 DB 分佈 (已生效): auto_alerting: 22 green / 78 red / 52 unknown auto_km_creation: 5 green / 17 yellow / 130 unknown auto_monitoring: 1 green / 1 red / 150 unknown auto_playbook: 3 green / 19 yellow / 130 unknown ← 新維度 auto_remediation: 0 / 0 / 98 red / 54 unknown ← 新維度 auto_rule_creation: 0 / 0 / 100 red / 52 unknown ← 新維度 auto_rule_matching: 4 green / 96 yellow / 52 unknown ← 新維度治理洞察: 98 red remediation = 大部分 asset 過去 30d 沒修復行動 (修復能力缺口) 100 red rule_creation = 無 AI rule (全 yaml_hardcoded) 96 yellow rule_matching = 過去 30d 沒告警觸發 (可能沒問題/沒覆蓋) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 20:27:48 +08:00
OG T	ceb61c3c8e	feat(asset_scanner): Gap 1 修 — Prometheus targets 補齊 host-install services All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m32s Details Audit 發現 asset_inventory 只涵蓋 K8s (mon=120, mon1=121 共 2 node+78 pods), 完全漏 110 (Harbor/Gitea/監控) + 112 (security) + 188 (PG/Redis/Ollama) + 125 (mon backup/standby) 這 4 主機的 host-install services. 用戶 4 主機架構 (110/112/120/121/188) 只覆蓋 2/5 = 40%. 新增 _collect_prometheus_targets: GET /api/v1/targets?state=active → 自動發現全部被監控的: - host_service (IP 形式 target → postgres-110/redis-110/minio-188/node-exporter 等) - third_party_service (非 IP 如 alertmanager/argocd-server) - host (每個 unique IP 建 asset_type='host') - target → host 的 depends_on relationship 預期新增 asset_inventory: - host: 6 個 (110/112/120/121/125/188,Prometheus 看到的 blackbox-icmp 全覆蓋) - host_service: ~15 個 (postgres/redis/minio/node-exporter/cadvisor 等) - third_party_service: ~5 個 (alertmanager/argocd/prometheus/velero 等) 解鎖: - 110/112/188 host-install services 進入 asset_inventory - coverage_evaluator 可評估這些 asset (monitoring/alerting/playbook 等 7 維) - blast_radius_calculator 可查「110 PostgreSQL 影響哪些 service」 - Hermes/forecaster 建議範圍擴大到非 K8s 服務對齊統帥鐵律: 朝 AI 自主化 — 不硬編主機清單,動態從 Prometheus 發現 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 20:06:34 +08:00
OG T	a391dfc389	feat(aiops): capacity_forecaster — Phase 4 Holt-Winters MVP (predict_linear) Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 統帥批准 4 項下階段候選之一完成: AI 容量預測. 新增 capacity_forecaster_job.py (~220 行): 每日 05:00 Taipei 跑預測 (02:00 scanner → 03:00 compliance → 04:00 Hermes → 05:00 forecaster 形成完整日鏈). 預測方法論 (MVP): Prometheus predict_linear(metric[7d], 86400*7) — 基於過去 7d 做線性外推 3 個預測 query: 1. disk_saturation_7d: predict_linear(node_filesystem_avail_bytes[7d], 7d) < 0 2. mem_saturation_7d: predict_linear(MemAvailable[7d], 7d) / MemTotal < 10% 3. cpu_high_7d_trend: avg_over_time(cpu_used_pct[7d]) > 70% 發現高風險 host → 寫 aol(capacity_recommendation) + 推 Telegram - input: host + horizon + findings count - output: findings list + proposed_actions + requires_human_decision=true proposed_actions 依 findings 推導: - disk: 清理 log/docker/PG WAL 或擴容 - mem: top consumer / JVM 調整 - cpu: scale out / vCPU 擴充統帥鐵律對齊: ✅ 只推建議不自動 scale up ✅ 7d window 有足夠樣本 ✅ AI 預測 + 人工決策未來 TODO: - 真 Holt-Winters (含季節性) — 需 Python statsmodels - 業務週期調整 (週一高峰/週末低谷) Wire main.py lifespan asyncio.create_task() Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 20:00:36 +08:00

1 2 3 4 5 ...

815 Commits