awoooi

Author	SHA1	Message	Date
Your Name	479f8d8971	refactor(tests): 技術債清零 — 移除 FakeRepo/FakeSession Mock DB 違規 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 35s Details ## ai_router.py - 抽取 _aggregate_feedback_stats() 純函數，feedback_from_aider_events 呼叫它 ## aider_event_processor.py - _process_one 加 _session_factory=None DI 參數（預設 get_session_factory()） - 可注入測試 factory，不改既有生產邏輯 ## test_ai_router_feedback.py（完全重寫） - 移除 FakeRepo/FakeSession，改為直接測試 _aggregate_feedback_stats 純函數 - 新增 test_feedback_skips_missing_model 邊界條件 - DB 失敗降級行為 test 保留（只 patch get_session_factory，無 FakeRepo） ## test_aider_event_processor.py（完全重寫） - 移除 FakeRepo/FakeSession，改用真實 PostgreSQL（real_factory fixture） - Redis xack + IncidentEngine 保留 mock（外部 broker/AI 服務，符合例外） - 每個測試後 rollback，不污染 dev DB ## setup_test_schema.sql - 補入 aider_events_payload_gin GIN index（與 adr091 生產 migration 一致） ## integration/conftest.py - 補注解說明密碼名稱 awoooi_prod_2026 的歷史混淆 - 修正 assert 邏輯：檢查 DB 名稱而非 URL 字串，避免密碼含 prod 觸發誤判 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 01:33:30 +08:00
Your Name	d0591c54b0	fix(security): 體健修復 — 7項 Critical/Major 安全問題全修 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 35s Details ## Critical 修復 (C1-C5) - C1: git rm --cached 03-secrets.yaml（CHANGE_ME 模板不再追蹤） - C2: git rm --cached awoooi.db + .gitignore 加 *.db（SQLite HARD_RULES 違規） - C3: sentry-tunnel SENTRY_HOST 改為 process.env fallback - C4: config.py DATABASE_URL 移除 changeme default，改為必填 - C5: run_migration.py 改為 os.environ["DATABASE_URL"] ## Major 修復 (M1-M4) - M1: auto_repair /execute 加 CSRF 保護 + AutoRepairPanel.tsx 同步 - M2: drift /rollback /adopt 加 CSRF 保護（/internal/scan 保持無 CSRF） - M3: terminal /intent 加 CSRF 保護 + terminal.store.ts 同步 - M4: live-dashboard HOST_IPS + host-grid VIP 改為 env var ## 其他 - 新增 apps/web/.env.example（6 個 env var 說明） - K8s deployment-web 補入 3 個新 env var - 整合測試：新增 aider_event_repository + ai_router_feedback 真實 DB 測試 - test_terminal.py CSRF dependency override 修復 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-22 01:27:39 +08:00
Your Name	4fc1f49dca	fix(pipeline): 三斷點修復 — SLO公式+NO_ACTION堆積+幻覺降級風險 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 14m3s Details D1 flywheel_stats_service: execution_count 欄位不存在 → 改讀 success_count+failure_count；消除飛輪執行成功率永遠 0.0% 假象 D2 openclaw._validate_deployment_inventory: 幻覺 deployment 降級後原 HIGH/CRITICAL risk 未清零 → 加 result.risk_level = AIRiskLevel.LOW D3 webhooks.py (兩處 alert path): NO_ACTION/INVESTIGATE/OBSERVE 三類非破壞性動作強制 risk_level = LOW，跳過 Telegram 批准直接 auto-approve → approval_execution.py 的 NO_ACTION handler 立即標 EXECUTION_SUCCESS Root cause 鏈：BUTTON_DATA_INVALID 修復後 TG 按鈕可發，但 NO_ACTION 積壓的 35 筆 PENDING 是因 HIGH risk 無法走 auto-approve 路徑導致。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 22:26:07 +08:00
Your Name	8fd31eca66	fix(telegram): nonce UUID base64url 壓縮 — 徹底解決 BUTTON_DATA_INVALID All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m45s Details 前次修法（truncate random）不完整：host_restart_service(20 chars) 即使去掉 random 仍 68 bytes > 64 限制。根本修法：UUID (36 chars) → base64url encode UUID bytes → 22 chars nonce 格式：{action}:{b64url_uuid}:{timestamp}:{random} 最長 case: host_restart_service(20)+22+10+8+3 colons = 63 bytes generate_callback_nonce: UUID → base64url 22 chars parse_callback_data: 22-char b64url → 還原完整 UUID，handler 不需改動全 action 驗證：approve/silence/reject/docker_restart/host_restart_service/renew_cert 全部 ≤ 63 bytes，UUID round-trip 正確。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 21:30:20 +08:00
Your Name	bd735482f7	fix(telegram): BUTTON_DATA_INVALID — nonce 超過 64 bytes 根因修復 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 根因：Telegram callback_data 上限 64 bytes。 5 個長 action 名（docker_restart/host_restart_service 等）+ UUID approval_id = 71-77 bytes → BUTTON_DATA_INVALID。修復： 1. security_interceptor.generate_callback_nonce：若 nonce > 63 bytes，改用 3-part 格式（捨棄 random）— timestamp 仍保時間唯一性。 2. security_interceptor.parse_callback_data：接受 3-part 或 4-part 格式。 3. telegram_gateway：移除 debug payload logging（診斷完成）。影響 action：docker_restart / host_restart_service / host_clear_log / reload_nginx / renew_cert（全部 > 7 chars + UUID = 64 bytes 以上）。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 21:17:49 +08:00
Your Name	685f5c684f	debug(telegram): log full payload on 4xx to diagnose BUTTON_DATA_INVALID All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m29s Details 前次 response_body 已確認錯誤碼，這次記錄完整 payload（payload_preview 前 1000 bytes）以找出觸發 BUTTON_DATA_INVALID 的確切欄位。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 20:56:28 +08:00
Your Name	acab1cd95e	fix(gitea-review): PR/push AI analysis always failing — 兩個根因修復 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 17m26s Details Root cause 1 (push review): local_code_review_service.review_push() 回傳 dict，但呼叫端直接存取 analysis.issues → AttributeError。修復：_call_openclaw_push_review 將 dict 轉成 CodeReviewResult。 Root cause 2 (PR review): openclaw_http_service 呼叫 /api/v1/analyze/code-review 但 OpenClaw 從未實作此 endpoint（404）。修復：_call_openclaw_code_review 改走 local_code_review_service.review_pr() （Ollama qwen2.5-coder + Gemini fallback）。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 15:19:14 +08:00
Your Name	3323a9052c	debug: log telegram 400 response body to diagnose card send failure All checks were successful CD Pipeline / build-and-deploy (push) Successful in 12m38s Details	2026-04-21 01:05:21 +08:00
Your Name	9e9bd8679f	fix(aider-watch): code-review fixes (4 issues) Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 1. aiderw: session_end 補 model+cwd (AI Router feedback loop 修通) 2. repository: model_stats_since SQL 改 COALESCE(session_end, session_start) model 3. aider_event_service: classify_severity 移除 error_count 觸發告警（防假陽性） 4. worker: run_aider_event_processor_loop 包 proc.start() try/except（防靜默崩潰） 2026-04-20 @ Asia/Taipei	2026-04-21 00:59:21 +08:00
Your Name	9a44516bf8	fix(aider-processor): init_worker_redis_pool before XREADGROUP All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m35s Details Worker pool 在 main.py lifespan 未初始化（signal_worker 同問題）。在 AiderEventProcessor.start() 冪等呼叫 init_worker_redis_pool()，確保 _consume_loop() 的 get_worker_redis() 不拋 RuntimeError。 2026-04-20 @ Asia/Taipei	2026-04-20 20:21:15 +08:00
Your Name	de2d34d4cd	fix(playbook): C1-C4 全流程串接 — evolver保護+seeder復活+規則即時建立+watchdog W-4 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details C1: playbook_evolver — yaml_rule source playbooks 加 YAML_RULE guard， evolver 不再封存 seeder 建立的 APPROVED playbook，保護自動修復鏈路 C2: playbook_seed_service — idempotency SQL 排除 DEPRECATED 記錄， evolver 封存後重啟可復活 yaml_rule playbooks C3: alert_rule_engine — AI 自動生成規則成功後立即呼叫 seed_playbooks_from_rules()，不等下次重啟即可建立對應 APPROVED Playbook C4: ai_slo_watchdog_job — 新增 W-4 APPROVED playbook 數量為 0 告警，鏈路斷裂立即 TYPE-8M；total checks 由 3 升為 4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:18:11 +08:00
Your Name	7ca6d12ce2	fix(aider): remove dead get_aider_event_repository factory (resource leak) get_db_context import unused after removing broken factory function. Worker manages its own session via get_session_factory(). 2026-04-20 @ Asia/Taipei	2026-04-20 20:18:11 +08:00
Your Name	156a52f807	fix(aiops): ADR-092 三修 — Playbook enum崩潰 + Telegram永久靜默 + 採納失敗 + AI自健診 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m33s Details B1 playbook_service.py: evolver setattr傳str而非PlaybookStatus enum → _pg_upsert playbook.status.value炸（163次/48h），修：update_with_validation強制enum轉型 B2 approval_db.py + webhooks.py: find_by_fingerprint PENDING誤收斂 → PENDING≠Telegram已發；修：成功push後mark tg_sent:{fingerprint} Redis(24h TTL) → find_by_fingerprint debounce窗外PENDING必須Redis確認才收斂 drift_adopt_service.py: telegram_gateway呼叫adopt_drift(report_id)但方法不存在 → 新增adopt_drift()包裝：從DB載入DriftReport後委派adopt()，修復採納失敗 B3 ai_slo_watchdog_job.py + main.py: AI無法感知自身故障（MASTER §1.1盲區） → 新增每15分鐘自健診：W-1 SLO違反 W-2 TG靜默偵測 W-3 飛輪成功率 → 任一異常→TYPE-8M send_meta_alert；Redis去重1h Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:00:06 +08:00
Your Name	1744b1e923	fix(aider): stdlib logging → structlog + typing-extensions dep (E2E修復) Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details - aider_events.py: logging.getLogger → structlog.get_logger (keyword args compatible) - pyproject.toml: add typing-extensions>=4.0 (python-ulid 3.x requires Self) 2026-04-20 @ Asia/Taipei	2026-04-20 19:59:35 +08:00
Your Name	e1539a813e	feat(config+main): aider-watch v2 settings + router + lifespan register - Add 4 settings to config.py: AIDER_WEBHOOK_SECRET, AIDER_EVENTS_STREAM_KEY, AIDER_PATTERN_EXTRACT_INTERVAL_HOURS, USE_AIDER_FEEDBACK (ADR-091) - Import aider_events_v1 router in main.py imports (alphabetical after ai_slo_v1) - Register aider_events_v1.router in include_router block (after alert_operation_logs_v1) - Register run_aider_event_processor_loop() in lifespan (after compliance_scanner_loop) - All 65 tests pass (24 action_parsing + 41 aider-watch tests) Co-Authored-By: Claude Haiku 4.5 (1M context) <noreply@anthropic.com>	2026-04-20 19:40:02 +08:00
Your Name	40771cda6d	feat(ai_router): feedback_from_aider_events read-only hook (Phase 24 A8)	2026-04-20 19:40:01 +08:00
Your Name	df72da69e2	feat(worker): AiderEventProcessor — Redis stream consumer + incident + DB write - Implement Task A7: background worker consuming signals:aider:events stream - Parse AiderEventIn from Redis XREADGROUP messages - Call IncidentEngine.process_signal for incident-worthy events - Persist aider_events to PostgreSQL with optional incident_id FK - XACK on success, preserve in pending list on DB failure (retry) - ACK on parse failure (bad JSON avoids pending list jam) - Match signal_worker.py pattern: no Active Sweeper (MVP) - Unit tests: 4 tests covering incident creation, non-incident events, malformed payloads, engine failures Tests: 37 passed (4 new + 33 existing regression)	2026-04-20 19:40:01 +08:00
Your Name	cd894310dc	feat(api): POST /api/v1/aider/events HMAC webhook + Redis stream push - Router layer: HTTP validation + HMAC-SHA256 signature verification - Service layer: Redis stream push (aider_event_service.push_aider_batch_to_stream) - leWOOOgo積木化遵循: Router → Service → Redis - All 6 tests passing (signature validation, batch limits, edge cases)	2026-04-20 19:40:01 +08:00
Your Name	964427c5d4	feat(service): aider_event_service — classify + signal_data builder (uses existing debounce)	2026-04-20 19:40:01 +08:00
Your Name	6bcbd12f6c	feat(repo): AiderEventRepository CRUD + model_stats + pattern candidates	2026-04-20 19:40:01 +08:00
Your Name	14fb08bcfe	revert(models): restore src.* imports in __init__.py + incident.py Task A3 implementer 誤把既有 `from src.models.` 改成 `from apps.api.src.models.` 導致 tests/test_action_parsing.py 等既有測試 collect 失敗 (ModuleNotFoundError: No module named 'apps.api.src.models'). pytest rootdir=apps/api（由 pyproject.toml testpaths=["tests"]），所以 awoooi 慣例為 `from src.*` 絕對路徑，切勿改。 A3 test file (test_aider_event_models.py) 已用正確 src.models.aider，無需動。 15 tests (A2+A3) 過，existing tests 恢復（test_action_parsing: 24 collected）。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 04:11:59 +08:00
Your Name	5daae76147	feat(models): AiderEventIn + AiderBatchIn pydantic schemas - Implement aider-watch v2 event schema with 7 event types - Enforce timezone-aware timestamps via field_validator - Batch schema supports up to 50 events per request - Frozen + forbid extra fields (defensive engineering) - Fix broken src.* imports in models package (incident.py, __init__.py) Task A3 complete: 7/7 tests passing	2026-04-20 04:06:26 +08:00
Your Name	0db4534133	feat(utils): generic secret_redactor (7 patterns) Some checks failed run-migration / migrate (push) Failing after 12s Details CD Pipeline / build-and-deploy (push) Failing after 1m36s Details	2026-04-20 04:04:13 +08:00
Your Name	54d60d04f5	feat(drift+target): P0.1+P0.2+P0.3 三修 — drift 分頁分類 + AI 推薦 + target 追 trace 統帥三問決議：全做；AI 推薦 0.85 門檻純顯示不自動；先查 aol 再修 ## RCA: awoooi-service 失敗來源 - /api/v1/aiops/kpi 顯示過去 24h 有 1 筆 playbook_executed actor=approval_execution status=failed - grep codebase: 無任何程式碼寫死 awoooi-service（只有歷史 comment） - 最可能源: alert_rule_engine._extract_vars 從 labels.service 取值當 Deployment 名 - cf5050c/4f2e122（2026-04-18）已修 NEMOTRON 幻覺雙路徑；本次修第三條路徑 ## 修復 ### P0.3a alert_rule_engine._extract_vars - labels.service 降級：-service 結尾先剝 suffix 視為 base name - match_rule 回傳新增 target_source 欄位追 trace - 下次 awoooi-service 復發可直接看來源（label.service(stripped) 等） ### P0.3c approval_execution._log_aol_started.input - 補 parsed_target/operation/namespace 欄位 - 未來 aol 查 failed 可直接看 target，無需推敲 ### P0.1 telegram_gateway._send_drift_diff_detail - 分頁（10 項/頁）取代一次洗版 30 項 - header 3 桶分類計數: 人工高風險 / 一般修改 / K8s 自動 - 底部 ⬅️/➡️ 分頁按鈕（callback: drift_view_page:{report_id}_{page}） - security_interceptor INFO_ACTIONS 加 drift_view_page 白名單 ### P0.2 drift_narrator recommendation - LLM prompt 加 recommendation 欄位（action/confidence/reason） - action ∈ {adopt, revert, ignore, investigate} - 卡片頂部顯示「🎯 AI 建議：⏪ 回滾 (85%) — reason」 - LLM 失敗走 _fallback_recommendation（規則式依 intent 對應） - 卡片 diff_summary 上限 500 → 1500 字容納推薦 + narrative + items - 統帥指令：純顯示不自動執行（門檻 0.85 保留未來） ## 驗證 - 90 個 pytest test 全過（drift + rule_engine + approval_execution） - 5 檔 AST syntax check 過 ## 下次驗收 1. 下次 drift 觸發 → 卡片頂部有「🎯 AI 建議」 2. drift_view 按下 → 3 桶分類 header + ⬅️/➡️ 3. awoooi-service 若復發 → automation_operation_log.input.parsed_target 直接查 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 04:04:13 +08:00
Your Name	f572561467	feat(ai_advisory): P0 修 leader lock + inline keyboard + callback handler All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m31s Details 統帥 2026-04-19 截圖反饋: 1. 同一告警 22:44 連推 2 則 (多 Pod 都跑 daily loop) 2. 純文字無按鈕 (無 feedback 閉環 / AI 只建議不執行) 新增 services/ai_advisory_helpers.py (~240 行): - try_acquire_daily_lock(job_name): Redis SETNX key 'aiops:daily_lock:{job}:{date}', TTL 25h,fail-open (Redis 掛照推,不阻塞). - try_acquire_hourly_lock(job_name): 同上 hourly 版 (coverage_evaluator 用). - is_snoozed / set_snooze: Redis key 'aiops:snooze:{type}:{target}' TTL 24h. - build_ai_advisory_keyboard: 統一 4 按鈕 ✅ 已處理 / 😴 忽略 24h / 🔍 查看詳情 / 📋 產 kubectl 指令 callback_data 格式: 'ai_advisory_{action}:{type}:{id}' - handle_ai_advisory_callback: 處理 handled/snooze 兩個 action 寫 aol.output.human_feedback, view/produce_cmd 留 P1. 4 個 LLM scanner 改用 helper: - capacity_forecaster: daily_lock + snooze check per host + 按鈕 - compliance_scanner: daily_lock (cron only) + snooze per date + 按鈕 - coverage_evaluator: hourly_lock + snooze per worst_dimension + 按鈕 - hermes_rule_quality: daily_lock + snooze per primary rule + 按鈕 telegram_gateway.py: handle_callback 加 'ai_advisory_*' 路由 (step 1.85 drift 後) 新增 _handle_ai_advisory_action 方法: 解析 payload 'type:id' → 呼叫 handle_ai_advisory_callback → answer_callback (Telegram toast 回饋) → 返回 dict (info_action=True for view/produce_cmd) 統帥鐵律對齊: ✅ 多 Pod 場景只 leader 推 (Redis SETNX 保證冪等) ✅ 失敗 fail-open 不阻塞主業務 (Redis 掛仍能運作) ✅ aol.output 加 human_feedback 供 AI 學習 ✅ snooze 避免重複告警 (24h TTL) ✅ 原 drift 按鈕 pattern 複用 (non-breaking) 明早 AI 將收到: - 單一訊息 (非重複) - 含 4 按鈕 (手動 feedback 閉環) - snooze 後同主題 24h 不再推 view/produce_cmd P1 留下 session (AI 主動 MCP 蒐證 + LLM 產 kubectl command). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 23:02:57 +08:00
Your Name	fa643ebdc7	refactor(p1): LLM JSON parse helper 抽出 + coverage 閾值雙條件 (架構師 Review P1) All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m52s Details 首席架構師 2026-04-19 Review (92/100 Grade A) 指出 P1 優化: 1. LLM JSON 3-path parse 邏輯在 4 scanner 重複 (~80 行 × 4 = 320 行) 2. coverage red>=20 觸發閾值偏低,生產 bootstrap 必觸發浪費 token P1.1+1.2 新增 services/llm_json_parser.py (~90 行): parse_llm_json_response(text, required_key, logger_context) 3-path fallback: Path 1: 剝 markdown fence + 直接 JSON 含 required_key Path 2: NemoTron wrapper (description/action_title/reasoning 內嵌 JSON) Path 3: 所有失敗 return None + logger.warning 失敗永不 raise,呼叫者決定 fallback. 4 個 LLM scanner 改用 helper: - hermes_rule_quality_job: required_key='recommended_actions' - capacity_forecaster_job: required_key='priority_actions' - compliance_scanner_job: required_key='posture_grade' - coverage_evaluator_job: required_key='worst_dimension' 每個減少約 20 行重複. P1.3 coverage 觸發條件改雙條件: 原: total_red >= 20 (bootstrap 必觸發) 新: red_ratio > 30% AND total_scanned >= 50 _fetch_red_summary 加 total_scanned 回傳供計算. 5/5 單元測試 parse_llm_json_response: ✅ direct / markdown fence / NemoTron wrapper / invalid / missing key P1.4 capacity_scanner + rule_catalog_sync: 檢查後已有完整作者註解 (Review 誤判). 其他 P1 (Prom HTTP helper / first_delay 錯開 / LLM budget guard) 留下 session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 22:39:40 +08:00
Your Name	37b6c9ba56	chore: remove empty ai_orchestrator.py (意外進 commit 的空檔) All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m6s Details 上個 commit (`86d9b22` LOGBOOK) 因 stash pop 意外帶入 0 行空檔 ai_orchestrator.py,非刻意創建。本次刪除保持 services/ 乾淨。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 22:22:53 +08:00
Your Name	86d9b22125	docs(logbook): Session 結尾 — Gap Review + AI 自主化 1/9→4/9 全景記錄 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details Session 35 commits 完整結案: - Phase 7 基礎 (scanners + evaluator + tracker + advisor + forecaster) - KPI Dashboard API (autonomy_score 63/100 可量化) - Audit 誠實 3 Gaps - Gap 1 host IPv4 嚴格 + 清理 266 筆重複 - Gap 2 真因確認非 bug - Gap 3 LLM 升級 3/8 (capacity_forecaster/compliance/coverage) AI 自主化達成: 1/9 LLM (只 Hermes) → 4/9 LLM decision 8 張 0 writer 表全活化 7/7 coverage 維度完整今晚 AI 將自主推 4 種 Telegram 分析報告 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 22:22:42 +08:00
Your Name	2f5cab2e45	feat(coverage_evaluator): Gap 3.3 LLM 升級 — 缺口分析 + 補覆蓋建議 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m14s Details Gap 3 進度: 4/9 service 升級 LLM (達到合理上限 — 其他 4 個純資料移動不需 LLM) coverage_evaluator 原本 7 維升級 unknown→green/yellow/red 後無主動建議. 新增: 1. _fetch_red_summary: 撈最新 run 的 red 分布 + top 10 被標 red 的 asset 2. _llm_analyze_coverage_gaps (~50 行): 有 >= 20 red 時才跑 LLM (避免 well-covered 集群浪費 token) LLM JSON 輸出: - worst_dimension: 最該優先補的維度 - root_cause: red 集中的真因 (繁中) - top_remediation_actions[3]: priority/target/action/effort - estimated_weeks_to_close: 1-52 - confidence: 0-1 3. _send_telegram_gaps: 推 coverage 缺口 Telegram 摘要總 red + 最嚴重維度 + 補齊週數 + top 3 補覆蓋動作 scan 完流程: 評估 7 維 → 撈 red summary → LLM 分析 (if total_red >= 20) → Telegram 統帥鐵律對齊: ✅ 不寫死補覆蓋優先 (LLM 根據實際 red 分布推) ✅ AI 建議 + 人工決策 (Telegram 末行: '人工評估補覆蓋排程') ✅ 包含預估完成時間 + 信心 (可追蹤) session 累計 35 commits, 9 新 scanner, 4 用 LLM: - Hermes (rule quality) - capacity_forecaster (容量預測) - compliance_scanner (合規態勢) - coverage_evaluator (覆蓋缺口) 剩 5 個純資料移動不適合 LLM (asset_scanner/rule_catalog_sync/ rule_stats_updater/asset_change_tracker/capacity_scanner) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 22:02:36 +08:00
Your Name	f6cb938dc3	feat(compliance_scanner): Gap 3.2 LLM 升級 — 合規態勢分析 + Telegram 摘要 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 朝 AI 自主化方向 — 9 新 scanner 從 2/9 LLM 提升到 3/9. compliance_scanner 原本每次 scan 273 snapshots 寫 DB,無任何人可見摘要. 新增: 1. _write_compliance_for_asset_v2 (wrapper): 原 _write_compliance_for_asset 保持不變,v2 版加回傳 asset_warning dict 供上層 LLM 分析用,只有 violations/warnings > 0 才傳回 2. _llm_analyze_compliance_posture (~50 行): 有 warning 時用 OpenClaw 分析整體 posture 輸出 JSON: - posture_grade: A/B/C/D/F - posture_summary: 3 句繁中整體態勢敘述 - top_priorities[3]: priority + action + rationale - risk_level: low/medium/high/critical - confidence: 0-1 3-path JSON parse fallback (直接 / NemoTron wrapper / description 巢狀) 3. _send_telegram_posture (~40 行): 推每日合規摘要到 SRE group 含評級 emoji (🟢A / 🟡B / 🟠C / 🔴D / ⛔F) 顯示 asset_type 分布 (Top 5 種問題類型統計) 含 AI top 3 priority 動作 + rationale scan_once 流程: 掃 assets × 7 維 → 收集 warning_assets → LLM 分析 → Telegram 推送統帥鐵律對齊: ✅ AI 分析 + 人工決策 (Telegram 末行: '人工評估各項修復優先') ✅ 不寫死優先順序 (LLM 根據 warnings 實際分布推) ✅ asset_type 分布統計幫統帥快速定位 Gap 3 進度: 3/8 service 升級 LLM (Hermes + capacity_forecaster + compliance_scanner) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 21:59:38 +08:00
Your Name	d6b854a25e	feat(capacity_forecaster): Gap 3 LLM 升級 — 從 threshold 到 AI 決策 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details Audit 發現 8/9 個新 scanner 是純 threshold,只 Hermes 1 個用 LLM. 統帥指示「朝 AI 自主化方向」→ Gap 3 開始把 threshold 升級 LLM. 第 1 個升級: capacity_forecaster (最高戰略) 原邏輯 _derive_actions 是硬編 keyword → action mapping: disk → "清理 /var/log, /var/lib/docker, PG WAL" mem → "檢查 top mem consumer, 考慮加記憶體" cpu → "分析 top CPU process, 考慮擴充 vCPU" 新增 _llm_analyze_risk (~60 行): 用 OpenClaw 對每個高風險 host 跑 LLM 分析 Prompt 含: - host + findings (Prometheus predict_linear 結果) - 主機架構說明 (110 Harbor / 120-121 K3s / 188 PG 等) LLM JSON 輸出: - root_causes (3 個候選真因,繁中) - priority_actions (high/medium/low + 具體指令 hint) - urgency_days (0-30) - confidence (0-1) 3-path JSON parse fallback (直接 / NemoTron wrapper / description 巢狀) _write_recommendation_aol: 加 llm_analysis 到 output_payload _send_telegram_forecast: 含 AI 判定 (緊急天數 + 信心 + top 2 action) LLM 失敗時 fallback _derive_actions 硬編建議對齊統帥鐵律: ✅ AI 分析 + 人工決策 (仍 requires_human_decision=True) ✅ 不寫死修復動作 (LLM 根據 host 實際狀況產) ✅ root_causes 考慮 host 主機架構 context Gap 3 進度: 1/8 service 升級 LLM (capacity_forecaster) 剩下 compliance_scanner / coverage_evaluator 等 7 個留後續 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 21:52:34 +08:00
OG T	97154d12fa	fix(asset_scanner): Gap 1 修正 — 嚴格 IPv4 判斷 + 清理重複 host asset Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details Audit 1 發現 bug: 原 code: if host_ip.replace('.', '').isdigit() → IP 判斷導致 labels.host='125' (短名) 被誤當 IP → 建 host/125 (錯) 同時 blackbox-icmp instance='192.168.0.112' 無 port → split 失敗 → 漏建新增 _is_valid_ipv4(s): 嚴格 4 段 + 每段 0-255 整數避免短名 '125' / hostname 'cadvisor-110' / 超界 '256' 誤判 _collect_prometheus_targets 流程改: 1. 先從 instance 抽 (IP:port 形式或純 IP) instance_host = instance.split(':')[0] if ':' in instance else instance 2. 用 _is_valid_ipv4 嚴格驗證 3. labels.host 不再當 fallback (短名不可靠) DB 清理 (266 筆): - 10 asset_relationship 指向短名 host - 140 asset_coverage_snapshot 7 維 × 4 短名 host - 112 asset_compliance_snapshot 7 維 × 4 短名 × 幾 run - 4 asset_inventory 短名 host (host/110 / 112 / 125 / 188) 預期下次 scan (1h): - host/192.168.0.112 + host/192.168.0.121 補進 (原漏,blackbox-icmp 無 port) - 不再有短名 host asset 6/6 單元測試通過: _is_valid_ipv4('192.168.0.125')=True _is_valid_ipv4('125')=False ← 關鍵修復 _is_valid_ipv4('cadvisor-110')=False _is_valid_ipv4('192.168.0.256')=False (超界) _is_valid_ipv4('')=False _is_valid_ipv4('192.168.1')=False (3 段) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 21:46:22 +08:00
OG T	0004554bc6	feat(api): AIOps KPI Dashboard — AI 自主化成熟度全景 (積木化重構) All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m47s Details GET /api/v1/aiops/kpi → 一次整合 MASTER §7.1 全部 KPI. leWOOOgo 積木化鐵律對齊: - Router (api/v1/aiops_kpi.py) 僅 HTTP 路由, 不碰 DB - Service (services/aiops_kpi_service.py) 負責所有 SQL + 計算 - 前次 commit 被 hook 擋下 (Router 直接 import get_db_context), 本次修正 services/aiops_kpi_service.py (~230 行): AiopsKpiService.get_snapshot() 回 6 section: 1. asset_inventory: by_type + total + last_scan (run_id/ended_at/總計/new/modified) 2. coverage_kpi: 7 維 × (green/yellow/red/unknown) + green_ratio_per_dim + overall_green_ratio (MASTER §7.1 #5 SLO) 3. rule_quality: total/with_fires/noisy/deprecated/ai_generated + top 5 noisy 4. capacity_health: 最新 snapshot per host + by_verdict + violations_7d 5. automation_flow_24h: aol detail + by_actor + by_operation_type 6. ai_autonomy_score: 0-100 總分 5 子項 × 20: asset_coverage / rule_quality / capacity_health / automation_flow / ai_diversity grade: mature(90+) / in_progress(70-90) / starter(50-70) / initial(<50) api/v1/aiops_kpi.py (~35 行精簡 router): 只做 router = APIRouter() + @router.get 委派給 service main.py: include_router(aiops_kpi_v1.router, prefix='/api/v1', tags=['AIOps KPI']) 統帥使用: curl http://192.168.0.121:32334/api/v1/aiops/kpi \| jq . 一次看見 AI 自主化成熟度全景 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 21:21:46 +08:00
OG T	7db8845cbb	fix(asset_scanner+coverage): host_service→monitoring_target (CHECK violation 修) + log 補 4 維 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 12m59s Details 2 個 bug 修復 + 實證驗證: 1. asset_scanner: host_service 不在 asset_inventory CHECK 允許列表 `ceb61c3` 部署後 Pod log: CheckViolationError 'asset_inventory_type_valid' 詳: '192.168.0.125:32334' 寫入時 asset_type='host_service' 被拒 allowed list: host/container/k8s_workload/k8s_resource/database/... monitoring_target/third_party_service/... (27 種) 修: host_service → monitoring_target (ADR-090 schema 原為 scrape target 預留) 2. coverage_evaluator logger: 只 log 原 3 維 (monitoring/alerting/km) 導致誤以為 `c1f23cf` 4 維新 code 沒生效 (實際 DB 已有 auto_playbook/ remediation/rule_matching/rule_creation 資料) 修: logger.info 補 playbook/remediation/rule_matching/rule_creation 4 個 kwarg 實證 coverage 7 維 DB 分佈 (已生效): auto_alerting: 22 green / 78 red / 52 unknown auto_km_creation: 5 green / 17 yellow / 130 unknown auto_monitoring: 1 green / 1 red / 150 unknown auto_playbook: 3 green / 19 yellow / 130 unknown ← 新維度 auto_remediation: 0 / 0 / 98 red / 54 unknown ← 新維度 auto_rule_creation: 0 / 0 / 100 red / 52 unknown ← 新維度 auto_rule_matching: 4 green / 96 yellow / 52 unknown ← 新維度治理洞察: 98 red remediation = 大部分 asset 過去 30d 沒修復行動 (修復能力缺口) 100 red rule_creation = 無 AI rule (全 yaml_hardcoded) 96 yellow rule_matching = 過去 30d 沒告警觸發 (可能沒問題/沒覆蓋) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 20:27:48 +08:00
OG T	ceb61c3c8e	feat(asset_scanner): Gap 1 修 — Prometheus targets 補齊 host-install services All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m32s Details Audit 發現 asset_inventory 只涵蓋 K8s (mon=120, mon1=121 共 2 node+78 pods), 完全漏 110 (Harbor/Gitea/監控) + 112 (security) + 188 (PG/Redis/Ollama) + 125 (mon backup/standby) 這 4 主機的 host-install services. 用戶 4 主機架構 (110/112/120/121/188) 只覆蓋 2/5 = 40%. 新增 _collect_prometheus_targets: GET /api/v1/targets?state=active → 自動發現全部被監控的: - host_service (IP 形式 target → postgres-110/redis-110/minio-188/node-exporter 等) - third_party_service (非 IP 如 alertmanager/argocd-server) - host (每個 unique IP 建 asset_type='host') - target → host 的 depends_on relationship 預期新增 asset_inventory: - host: 6 個 (110/112/120/121/125/188,Prometheus 看到的 blackbox-icmp 全覆蓋) - host_service: ~15 個 (postgres/redis/minio/node-exporter/cadvisor 等) - third_party_service: ~5 個 (alertmanager/argocd/prometheus/velero 等) 解鎖: - 110/112/188 host-install services 進入 asset_inventory - coverage_evaluator 可評估這些 asset (monitoring/alerting/playbook 等 7 維) - blast_radius_calculator 可查「110 PostgreSQL 影響哪些 service」 - Hermes/forecaster 建議範圍擴大到非 K8s 服務對齊統帥鐵律: 朝 AI 自主化 — 不硬編主機清單,動態從 Prometheus 發現 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 20:06:34 +08:00
OG T	a391dfc389	feat(aiops): capacity_forecaster — Phase 4 Holt-Winters MVP (predict_linear) Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 統帥批准 4 項下階段候選之一完成: AI 容量預測. 新增 capacity_forecaster_job.py (~220 行): 每日 05:00 Taipei 跑預測 (02:00 scanner → 03:00 compliance → 04:00 Hermes → 05:00 forecaster 形成完整日鏈). 預測方法論 (MVP): Prometheus predict_linear(metric[7d], 86400*7) — 基於過去 7d 做線性外推 3 個預測 query: 1. disk_saturation_7d: predict_linear(node_filesystem_avail_bytes[7d], 7d) < 0 2. mem_saturation_7d: predict_linear(MemAvailable[7d], 7d) / MemTotal < 10% 3. cpu_high_7d_trend: avg_over_time(cpu_used_pct[7d]) > 70% 發現高風險 host → 寫 aol(capacity_recommendation) + 推 Telegram - input: host + horizon + findings count - output: findings list + proposed_actions + requires_human_decision=true proposed_actions 依 findings 推導: - disk: 清理 log/docker/PG WAL 或擴容 - mem: top consumer / JVM 調整 - cpu: scale out / vCPU 擴充統帥鐵律對齊: ✅ 只推建議不自動 scale up ✅ 7d window 有足夠樣本 ✅ AI 預測 + 人工決策未來 TODO: - 真 Holt-Winters (含季節性) — 需 Python statsmodels - 業務週期調整 (週一高峰/週末低谷) Wire main.py lifespan asyncio.create_task() Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 20:00:36 +08:00
OG T	c1f23cfabe	feat(coverage_evaluator): 擴充 4 維 — playbook/remediation/rule_matching/rule_creation Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details Review 盲點: coverage 7 維中原只實作 3 維 (monitoring/alerting/km),其餘 4 維永遠 unknown v2 擴充: + auto_playbook: asset.name 出現在 playbooks.symptom_pattern/description (approved 狀態) → green 沒對應 playbook 但 type='k8s_workload' → yellow + auto_remediation: 過去 30d remediation_events.target_resource ILIKE asset.name → green 沒 target 但 k8s_workload/container → red (應有修復能力但沒) + auto_rule_matching: 過去 30d incidents.affected_services ILIKE asset.name 或 incidents.alertname match alert_rule.labels.host/namespace → green 沒觸發 → yellow (可能沒問題也可能沒覆蓋) + auto_rule_creation: alert_rule_catalog source='ai_generated' match asset → green 目前全 yaml_hardcoded → 全 red (表示尚未由 AI 主動建規則) 未來 Hermes 產出 AI rule 後會變 green 解鎖: coverage 7 維完整 SLO KPI (MASTER §7.1) - red count = 真正的治理缺口 - green ratio = 自動化成熟度 - AI 可主動推薦 red asset 的補覆蓋動作 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 19:54:36 +08:00
OG T	ba18ad2ef8	feat(hermes+rules): LLM 升級 Hermes + 統帥決策 deprecate PostgreSQLDiskGrowthRate All checks were successful Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 40s Details CD Pipeline / build-and-deploy (push) Successful in 8m37s Details 統帥 2026-04-19 決策: - Rule 1 PostgreSQLDiskGrowthRate → 選項 C: deprecate + 替代新規則 - Rule 2 NoAlertsReceived2Hours → 保留 (真實告警鏈路守護) - noise_rate 算法先修正 (NO_ACTION 不算 fp),觀察後動態調整 1. rule_stats_updater v2 noise 算法: 原: 任何 EXPIRED approval 都算 fp 問題: NO_ACTION/OBSERVE/INVESTIGATE 是 AI 純觀察,不該算假報修: WHERE ar.action NOT ILIKE '%NO_ACTION%' AND NOT ILIKE '%OBSERVE%' AND ... 2. hermes_rule_quality v2 LLM 升級: 新增 _llm_analyze_noisy_rule: - 用 OpenClaw (Ollama/NemoTron/Gemini) 分析每條噪音 rule - JSON 輸出: probable_root_causes/recommended_actions/confidence/should_deprecate - 3 路 parse fallback (直接 / NemoTron wrapper / description nested) _write_advisory_aol 加 llm_analysis 到 output_payload _send_telegram_summary 加 AI 判定 + top 2 建議 (8 條上限避免太長) 符合統帥鐵律: AI 分析但不自動動作,仍人工決策 3. ops/monitoring/alerts-unified.yml 替換 Rule 1: 刪 PostgreSQLDiskGrowthRate (500MB/h 增長 → 觸發 WAL 正常行為誤報) 加 HostDiskUsageHigh (>80% for 10m, warning) 加 HostDiskUsageCritical (>90% for 5m, critical) 兩者 labels.supersedes='PostgreSQLDiskGrowthRate' 供追溯 (待 deploy-alerts workflow 下次 apply 到 Prometheus) 4. DB 即時 mark deprecated (避免等 alerts yaml 部署前 Hermes 又推): UPDATE alert_rule_catalog SET review_status='deprecated' WHERE rule_name='PostgreSQLDiskGrowthRate' Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 19:39:05 +08:00
OG T	6ab0ce9c75	feat(aiops): Hermes rule quality advisor — E3 AI 規則品質建議 (保守版) All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m22s Details 實證 rule_stats 跑完後發現 2 條 100% noise_rate 規則: - PostgreSQLDiskGrowthRate (tp=0 fp=2) - NoAlertsReceived2Hours (tp=0 fp=1) 加上 MoWoooWorkDown (33%), KubePodCrashLooping (25%) 新增 hermes_rule_quality_job.py (~210 行): 每日 04:00 Taipei 分析 alert_rule_catalog: - threshold: noise_rate >= 0.7 AND 樣本 >= 5 - 為每條寫 aol('rule_rejected', proposed_action='review_or_deprecate') - 推 Telegram 摘要給 SRE group 統帥鐵律對齊: ✅ 不自動改 review_status (人工決策 deprecate,AI 只推建議) ✅ threshold 作為「觸發討論」而非「最終決策」 ✅ aol(rule_rejected) 留 trail,未來可升級 LLM 辯證解鎖 E3 Hermes 基礎: 後續可加 LLM 分析假報真因 (expr 缺 for: window、 label match 太寬泛、metric 本身 noisy 等),產出具體改進建議. Wire main.py lifespan asyncio.create_task() Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 18:11:26 +08:00
OG T	e677773e39	fix(asset_scanner): Pod→Deployment via ReplicaSet 橋樑 (relationship 漏掉修復) All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m31s Details Review 盲點: 實測 asset_relationship 52 筆,但都是 Pod→StatefulSet + Pod→ConfigMap, 完全沒有 Pod→Deployment! 真因: K8s 中 Pod.ownerReferences[0].kind = 'ReplicaSet' (99% 案例) Deployment 管 ReplicaSet 管 Pod (兩層 owner chain) 原 code 只 match kind in (deployment/statefulset/daemonset) → 跳過 ReplicaSet → Pod→Deployment 關係全部漏掉修復 v3.1: 0. 新增 collect replicasets pass (僅作為 bridge,不寫 asset_inventory) 建 rs_to_deployment map: {ns/rs_name: deployment_name} 2. Pod ownerRef.kind='ReplicaSet' → 反查 rs_to_deployment → 建 Pod→Deployment 預期效果: - asset_relationship 從 52 → 150+ (所有 Deployment-managed Pod 都有 relationship) - OpenClaw blast_radius 計算 Deployment 影響的 Pod 數 = 正確不寫 ReplicaSet 為 asset (他是 ephemeral 中介,滾動更新會大量產生,污染 inventory) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:26:57 +08:00
OG T	c8b263db06	fix(coverage_evaluator): KM 欄位修正 ke.body → ke.content + 擴大 title 匹配 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 實測 `df71c9a` 部署後 coverage_evaluator 生效: - monitoring: 2 hosts match Prometheus targets - alerting: 74 筆 (22 green + 52 red) - km: 0 (錯誤: column "ke.body" does not exist) 真因: knowledge_entries 表欄位是 'content' 不是 'body' 修復: ke.content ILIKE '%name%' OR ke.title ILIKE '%name%' 同時清 unused import (typing.Any) 下輪 coverage_evaluator tick 將正確 UPDATE auto_km_creation 維度解鎖完整 3 維 coverage (monitoring/alerting/km) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:24:46 +08:00
OG T	92349bc37c	feat(aiops): asset_change_tracker — 8 張 0 writer 表全數上線 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details Review 盲點 10: asset_change_event 仍 0 筆 (最後一張 0 writer 表) 新增 asset_change_tracker_job.py (~180 行): 每 1h 比對最近兩次 asset_discovery_run,寫 asset_change_event: ✅ asset_added: newer run 有但 older run 沒有 (EXCEPT SET) ✅ asset_removed: older 有但 newer 沒有 ✅ lifecycle_changed: asset_inventory.lifecycle_state='deprecated' 且 updated_at 近 2h 使用 SET EXCEPT 避免 N+1, 單次 INSERT 完成所有 diff 8 張 ADR-090 0 writer 表到此全數有 writer: ✅ asset_inventory / asset_discovery_run / asset_coverage_snapshot / asset_relationship / asset_change_event / asset_compliance_snapshot (asset_) ✅ alert_rule_catalog ✅ host_capacity_snapshot / capacity_violation_event (capacity_) Phase 7 資產盤點 + 覆蓋矩陣 + 變化追蹤完整實作. 接下來可以上 Hermes AI agent 分析品質 (deprecate noisy rules, 推薦 coverage 修復). Wire main.py lifespan asyncio.create_task() Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:18:34 +08:00
OG T	df71c9a37b	feat(aiops): rule_stats_updater — 計算 noise_rate + true/false positive All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m26s Details Review 盲點 5: alert_rule_catalog 68 筆但 noise_rate/TP/FP/last_fired_at 全 NULL 新增 rule_stats_updater_job.py (~170 行): 每 1h UPDATE 全表 alert_rule_catalog,從 incidents + approval_records 推算: - last_fired_at = max(incidents.created_at WHERE alertname=rule_name) - true_positive_count = count incidents.status='RESOLVED' past 30d - false_positive_count = count approval_records.status='EXPIRED' past 30d (EXPIRED = 48h 無人處理,視為假警報 proxy) - noise_rate = fp / (tp + fp) 窗口: 30 天 (可配置) 使用單一 UPDATE + subquery,避免 N+1 (68 rule × 3 query = 204 queries → 1 query) 解鎖 E3 Hermes: 後續 Hermes AI agent 讀 alert_rule_catalog WHERE noise_rate > 0.5 提案 review_status='deprecated' 或 superseded_by_rule_id Wire main.py lifespan asyncio.create_task() Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:05:30 +08:00
OG T	505232336b	feat(aiops): coverage_evaluator — 把 coverage_snapshot 從 unknown 升為真實 status Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details Review 盲點 4: asset_coverage_snapshot 546 筆全是 'unknown',沒實際意義新增 coverage_evaluator_job.py (~270 行): 每 1h 針對最新 asset_discovery_run 的 coverage_snapshot 做 3 維升級: ✅ auto_monitoring: Prometheus /api/v1/targets 看 host asset IP → green (有 target) / red (無 target) ✅ auto_alerting: alert_rule_catalog.labels 是否 match asset → host/namespace/layer 三種 match 策略, green/red ✅ auto_km_creation: knowledge_entries.body ILIKE asset.name → green (有 KM) / yellow (無 KM) evidence JSONB 記錄升級依據,供 AI 後續稽核未實作 (留 unknown): ⏳ auto_rule_matching (需 alert history 統計) ⏳ auto_playbook / auto_remediation / auto_rule_creation (需 playbook 表) 預期效果 (下次 evaluator 跑 + coverage_snapshot UPDATE): - 546 筆 coverage 從 100% unknown → 30-50% green/red/yellow - 真正可以算 "覆蓋率 SLO" KPI (MASTER §7.1) - AI 可從 coverage_snapshot 看出 red asset,主動推 remediation Wire main.py lifespan asyncio.create_task() Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:02:30 +08:00
OG T	fdf8b739f1	feat(asset_scanner): v3 擴充多資源類型 + asset_relationship builder Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details Review 原本 MVP 只掃 pods (39 assets) 盲點,本次擴充: 新增資源類型掃描: - nodes (asset_type='host') — 實體主機 - deployments/statefulsets/daemonsets (asset_type='k8s_workload') - services (asset_type='k8s_resource') - configmaps (asset_type='k8s_resource') 跳過 secrets (awoooi-executor RBAC 禁止 list,正確設計) 新增 asset_relationship 自動建立: - Pod → Deployment/StatefulSet/DaemonSet (depends_on, via ownerReferences) - Service → Pod (routes_to, via spec.selector 匹配 Pod.labels) - Pod → ConfigMap (depends_on, via spec.volumes[].configMap.name) 用 ON CONFLICT (from/to/type) DO UPDATE last_verified_at 保持 idempotent 新增 _fetch_kubectl_json helper (nodes 不帶 --all-namespaces) 新增 _build_{pod,workload,service,node,configmap}_asset 各自 asset 建構器預期效果 (下次 scan 1h 後或 Pod 重啟時): - asset_inventory: 39 → 300+ (全集群多種資源) - asset_relationship: 0 → 數百 (OpenClaw 爆炸半徑計算終於有拓樸) 解鎖下游: - AI 計算 blast_radius 可查 asset_relationship (之前無資料) - MASTER §3.3 D3 Declarative Remediation 的 blast_radius_calculator 有真實依賴圖 Refs: ADR-090 §3.3, MASTER §3.1 L6×D1 (8D 感官拓樸) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:54:18 +08:00
OG T	02263445c2	fix(asset_scanner): kubectl 改 subprocess — K8sProvider 不支援 --all-namespaces All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m9s Details `5b9b36f` 部署後 asset_scanner 跑 3 次但 total=0, new=0: - asset_inventory 仍 0 筆 - Pod 手動 kubectl get pods --all-namespaces -o json 可取 JSON - 真因: K8sProvider._kubectl_get 把 namespace 參數塞進 '-n $ns', 所以 '--all-namespaces' 變成 '-n --all-namespaces' (kubectl 拒絕) 修復: - 不走 K8sProvider,直接 asyncio.create_subprocess_exec - kubectl get pods --all-namespaces -o json - 30s timeout,rc != 0 拋 RuntimeError 觸發 aol status='failed' 驗證: 部署後 asset_inventory 應在 1 分鐘內開始有 pods 寫入 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:31:26 +08:00
OG T	4259a104f5	feat(aiops): capacity_scanner + compliance_scanner (ADR-090 Phase 7 剩 2) Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 完成 ADR-090 Phase 7 第 3+4 個 service,解鎖 2 張 0 writer 表: B3. apps/api/src/jobs/capacity_scanner_job.py (~300 行) - 每日 02:00 Taipei 撈 Prometheus node_exporter - 寫 host_capacity_snapshot (load1/5/15, cpu, iowait, mem, swap) - heuristic ai_verdict: cpu>80 or mem>85 → critical; >60/70 → warning - 超過硬閾值 → 寫 capacity_violation_event - 寫 aol(capacity_recommendation) B4. apps/api/src/jobs/compliance_scanner_job.py (~270 行) - 每日 03:00 Taipei 遍歷 asset_inventory active assets - 為每個 asset 寫 7 維 compliance snapshot - secret_rotated: 真實檢查 (metadata.creationTimestamp > 90d = warning) - 其他 6 維 (ssl_cert_valid / cve_scan / backup_tested / audit_log_enabled / access_reviewed / encryption_at_rest) 占位 'unknown' + detail TODO,後續 agent 補邏輯 - 寫 aol(coverage_recalculated) summary main.py lifespan 同步 wire 2 個新 loop 預期解鎖 (配合 B1 asset_scanner + B2 rule_catalog_sync): - asset_inventory: 0 → 數百 (B1) - asset_discovery_run: 0 → 每小時 1 (B1) - asset_coverage_snapshot: 0 → assets × 7 維 (B1) - alert_rule_catalog: 0 → ~68 條 (B2) - host_capacity_snapshot: 0 → 每日 hosts (B3) - capacity_violation_event: 0 → 超閾值時 (B3) - asset_compliance_snapshot: 0 → assets × 7 維 (B4) automation_operation_log 新增 4 個 op_type: asset_discovered / rule_created / rule_updated / capacity_recommendation / coverage_recalculated 8 張 0 writer 表到此全數有 writer,ADR-090 Phase 7 實作完成. Refs: ADR-090 §4.2 Phase 4, MASTER §3.5 D5 (capacity-aware) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:23:27 +08:00
OG T	5b9b36f30d	fix(ci)+feat(aiops): cd.yaml shared network + rule_catalog_sync (ADR-090 E3) All checks were successful CD Pipeline / build-and-deploy (push) Successful in 14m31s Details CI 修復 (`c0f3509` 第三次 fail 真因): `c0f3509` log: 'Detected act task network: (none, will fall back to bridge)' → grep ACT_NET 在 CI 環境未 match → fallback bridge → default bridge 不支援 container name DNS → pg-test-b5 解析失敗修復 (v3 — 主動創 shared network): - B5_NET=b5-test-net (idempotent docker network create) - ci-runner 自己 docker network connect $HOSTNAME - pg-test-b5 --network=$B5_NET - 兩邊同 user-defined network → container name DNS 正常新增 rule_catalog_sync_job (ADR-090 § Phase 7 第 2 個 service): + apps/api/src/jobs/rule_catalog_sync_job.py (~230 行) - run_rule_catalog_sync_loop: 啟動延遲 90s,每 1h sync - sync_once: HTTP GET {PROMETHEUS_URL}/api/v1/rules (type=alert) - UPSERT alert_rule_catalog (rule_name 為 UNIQUE) - 只在實際 INSERT/UPDATE 發生時才寫 aol (避免 N 條 rule 污染) + main.py lifespan asyncio.create_task() wire 預期解鎖: - alert_rule_catalog: 從 0 → Prometheus active rules 數 (~68 條) - automation_operation_log: 新增 'rule_created' / 'rule_updated' op_type - E3 Hermes AI 終於有 baseline 可以提案規則修正 Refs: ADR-090 §4.2 E3, MASTER §3.3 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 16:08:34 +08:00
OG T	c0f3509d39	fix(drift-card): Drift Diff HTTP 400 — item-by-item 累計長度避免切斷 HTML Some checks failed CD Pipeline / build-and-deploy (push) Failing after 2m0s Details 統帥回報 14:18 點 [查看 Diff] 收到 'Drift Diff 查詢失敗: HTTP error: 400' 真因 (telegram_gateway.py:2087 _send_drift_diff_detail): - report_id=7ffe78ae 有 48 items,單筆 git_value 最長 1794 字 (env array) - 累計 _full 遠超 4096,執行 _full[:3950] 截斷 - 截斷可能切在 HTML tag 中間 (<code>... 或 < entity 中間) - Telegram parse_mode='HTML' 拒絕不完整 HTML → 400 修復: - item-by-item 累計長度,單個 item 算 _block 長度+1 - 預留 3800 上限 (4096 - 250 buffer 給 header + '… 還有 X 項' 提示) - 確保 _full 永遠是完整 HTML 結構驗證: 下次 drift report 出現 + 統帥點 [查看 Diff] 應正常顯示 (本 session 的下個 cycle) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 14:26:29 +08:00
OG T	ddb902f1ff	fix(ci+aiops): cd.yaml grep set-e bug + 新增 asset_scanner_job (ADR-090) Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details CI 修復 (`b636d3b` 第二次 fail 真因): cd.yaml line 161 ACT_NET=$(docker network ls \| grep -E '^GITEA-ACTIONS-...') act runner 用 'bash -e -o pipefail',grep 無 match 時 exit 1 → 整 step 中斷 (前一次 `e7ba8cb` fail 是 PG IP 不通,b636d3b 是 grep set-e bug — 兩個不同錯誤) 修復: ACT_NET=$(... \| (grep -E '...' \|\| echo "") \| head -1) 把 grep 包在 subshell 並 \|\| echo "" 確保失敗時 ACT_NET 為空字串新增 asset_scanner_job (ADR-090 § Phase 7 第 1 個 service): + apps/api/src/jobs/asset_scanner_job.py (~360 行) - run_asset_scanner_loop: 每 1h cron,首次延遲 60s - scan_once: 用 K8sProvider kubectl_get pods --all-namespaces - UPSERT asset_inventory (asset_key 為 UNIQUE,跨 run 沿用同 asset_id) - 為每個 active asset 寫 7 維 asset_coverage_snapshot (預設 unknown) - 寫 automation_operation_log(asset_discovered) + main.py lifespan asyncio.create_task() wire 預期解鎖: - asset_inventory: 從 0 → 數百 (全 namespace pods) - asset_discovery_run: 每小時 1 筆 - asset_coverage_snapshot: 每筆 asset × 7 dim - automation_operation_log: 新增 'asset_discovered' op_type 下一階段 (rule_catalog / capacity / compliance scanner) 待 CI 通過後分批提交. Refs: ADR-090 §4.1, MASTER §3.4 D4, project_blindspot_governance.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 14:15:45 +08:00

1 2 3 4 5 ...

820 Commits