awoooi

Author	SHA1	Message	Date
Your Name	9080ba3670	feat(awooop): run ansible check-mode evidence worker All checks were successful CD Pipeline / tests (push) Successful in 1m28s Details Code Review / ai-code-review (push) Successful in 12s Details CD Pipeline / build-and-deploy (push) Successful in 5m9s Details CD Pipeline / post-deploy-checks (push) Successful in 1m30s Details	2026-05-31 12:53:22 +08:00
Your Name	9e093a9525	fix(api): reconcile inactive stale incidents All checks were successful CD Pipeline / tests (push) Successful in 1m26s Details Code Review / ai-code-review (push) Successful in 11s Details CD Pipeline / build-and-deploy (push) Successful in 4m23s Details CD Pipeline / post-deploy-checks (push) Successful in 2m17s Details	2026-05-29 11:43:19 +08:00
Your Name	8342cfa460	fix(governance): stop km healthcheck requeue All checks were successful Code Review / ai-code-review (push) Successful in 11s Details CD Pipeline / tests (push) Successful in 2m1s Details CD Pipeline / build-and-deploy (push) Successful in 4m45s Details CD Pipeline / post-deploy-checks (push) Successful in 1m46s Details	2026-05-19 23:01:03 +08:00
Your Name	edf97ad8ca	feat(governance): process hermes km healthchecks All checks were successful Code Review / ai-code-review (push) Successful in 10s Details CD Pipeline / tests (push) Successful in 2m13s Details CD Pipeline / build-and-deploy (push) Successful in 5m14s Details CD Pipeline / post-deploy-checks (push) Successful in 1m55s Details	2026-05-19 22:32:55 +08:00
Your Name	1d285dd9d4	fix(api): suppress batch reconcile postmortems Some checks failed Code Review / ai-code-review (push) Successful in 11s Details CD Pipeline / tests (push) Successful in 1m18s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-05-19 12:18:17 +08:00
Your Name	d0835a7be1	fix(api): reconcile completed stuck incidents All checks were successful Code Review / ai-code-review (push) Successful in 11s Details CD Pipeline / tests (push) Successful in 1m2s Details CD Pipeline / build-and-deploy (push) Successful in 3m34s Details CD Pipeline / post-deploy-checks (push) Successful in 1m35s Details	2026-05-19 11:45:15 +08:00
Your Name	ff30c61c4c	fix(rls): 收斂 API DB access context All checks were successful Code Review / ai-code-review (push) Successful in 21s Details CD Pipeline / tests (push) Successful in 1m20s Details CD Pipeline / build-and-deploy (push) Successful in 4m15s Details CD Pipeline / post-deploy-checks (push) Successful in 1m58s Details	2026-05-12 19:55:13 +08:00
Your Name	c1ac157aaf	fix(km): keep backfill reconciler loop alive All checks were successful Code Review / ai-code-review (push) Successful in 11s Details CD Pipeline / tests (push) Successful in 1m12s Details CD Pipeline / build-and-deploy (push) Successful in 4m2s Details CD Pipeline / post-deploy-checks (push) Successful in 1m18s Details	2026-05-06 17:03:22 +08:00
Your Name	4111ea4f9f	fix(ai): remove 188 ollama provider All checks were successful Code Review / ai-code-review (push) Successful in 12s Details CD Pipeline / tests (push) Successful in 1m13s Details CD Pipeline / build-and-deploy (push) Successful in 3m36s Details CD Pipeline / post-deploy-checks (push) Successful in 1m20s Details	2026-05-06 14:34:48 +08:00
Your Name	bf847ad045	fix(ai): stabilize GCP Ollama alert lane Some checks failed Code Review / ai-code-review (push) Successful in 10s Details CD Pipeline / build-and-deploy (push) Has been cancelled Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / tests (push) Has been cancelled Details	2026-05-05 22:20:27 +08:00
Your Name	8e22110030	fix(governance): keep trust drift watchdog on governance agent Some checks failed CD Pipeline / tests (push) Successful in 2m51s Details Code Review / ai-code-review (push) Successful in 24s Details CD Pipeline / build-and-deploy (push) Has started running Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details	2026-05-05 14:00:13 +08:00
Your Name	aa4ccec429	fix(watchdog): ADR-092 B4 — 三層修復消除 META SYSTEM 重複告警 + Ollama 路由強化 All checks were successful Code Review / ai-code-review (push) Successful in 7m16s Details 問題根因（debugger 全景徹查）： 1. Prod 仍跑舊版代碼（ec013f66 後的修法未部署 → 告警字串仍含舊格式） 2. replicas=2 時 Pod 間 grace period 不共享 → violation_codes 分歧 → 不同 SHA256 → dedup 失效 3. 新 Pod 啟動立即執行 _check_once() → rollout 時多發一波 4. W6 violation_codes 含動態 low_count → count 微變繞過 dedup 修復（A2/A3/W6/C1/C2）： - A2：run_ai_slo_watchdog_loop 加 90s leading sleep，避免 rollout 立即觸發 - A3：_grace_active() 改為 Redis cluster-shared（watchdog:cluster_grace, ex=1800s, nx=True）消除 Pod 間 grace period 不一致；Redis 故障時 fallback 為 process-local monotonic - W6：violation_codes 移除動態 low_count，改為穩定 "W6:trust_drift" - C1：ollama_auto_recovery.py recovered_host 改動態 label（依 URL port 判斷 GCP-A/B/Local） - C2：ConfigMap OLLAMA_FALLBACK_URL 改走 110:11437 nginx proxy，三層容災統一架構 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-05 10:31:53 +08:00
Your Name	10e665a540	fix(watchdog): 修復 META SYSTEM 重複告警 — violation_codes 穩定 dedup All checks were successful Code Review / ai-code-review (push) Successful in 1m3s Details 根因：violations 字串含動態浮點數（mean_trust/low_ratio），每次微變 → SHA256 不同 → dedup 失效修法：新增 violation_codes list（穩定 W-code 格式），dedup 計算只用 violation_codes violations 保持含動態值（顯示用），Telegram 通知照常顯示完整資訊 W-6 Trust Drift dedup key: W6:trust_drift:low_count={N}（不含浮點數） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-05 00:06:38 +08:00
Your Name	ec013f662d	fix(watchdog): 修复 Trust Drift 重复告警 + 建立 GCP Ollama nginx proxy Some checks failed Code Review / ai-code-review (push) Successful in 45s Details Ansible Lint / lint (push) Has been cancelled Details - ai_slo_watchdog_job: 改用 trust_drift_detector 纯统计 lib 避免与 governance_agent 每小时自检查重复触发 Telegram - infra/ansible: 建立 110 nginx proxy 转发到 GCP-A/B 端口 11435 -> 34.143.170.20:11434 (GCP-A) 端口 11436 -> 34.21.145.224:11434 (GCP-B) - docs/runbooks: DEPLOY-GCP-OLLAMA-PROXY.md 完整部署指南 - ops/nginx: 手动部署脚本供 110 直接执行 ADR-110 三层容灾启用前提：先部署 proxy，再改 ConfigMap	2026-05-04 23:12:35 +08:00
Your Name	45f6f17558	fix(watchdog): dedup hash 非確定性 bug — 改用 hashlib.sha256 + setnx atomic All checks were successful Code Review / ai-code-review (push) Successful in 56s Details 根因：Python 內建 hash() 受 PYTHONHASHSEED 影響，每次 process 重啟值不同。每次 kubectl rollout restart → 新 pod 算出不同 dedup_hash → 繞過 1h TTL → 洗版。症狀：連續 rollout 4-5 次後，META SYSTEM 每分鐘一條狂發（19:39/40/41/42 截圖）。修法： 1. hash() → hashlib.sha256(content.encode()).hexdigest()[:12]（跨 pod/重啟確定性） 2. redis.exists+setex → redis.set(nx=True) atomic setnx（防多 replica 並發多發） 2026-05-04 ogt Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-04 19:47:42 +08:00
Your Name	f6b698c873	fix(aiops): Critic 修復 — PromQL 注入防線 + flag=False escalation bug + 計數虛報 All checks were successful Code Review / ai-code-review (push) Successful in 53s Details Bug 1 (drift.py): DRIFT_AUTO_ADOPT_ENABLED=false 時仍設 auto_block_reason → 導致 escalation 被觸發，把「停用」誤判為「阻擋事故」修法: flag=False 不設 auto_block_reason，視為靜默停用 Bug 2 (coverage_evaluator_job.py): asset name/host/namespace/ip 直接 f-string 進 PromQL，無白名單驗證 → 髒資料可生成語意污染規則或讓 Prometheus reload 失敗修法: 加 _safe_label_val 正規表達式白名單（^[a-zA-Z0-9._\-]+$），不合法直接 skip + debug log Bug 3 (coverage_evaluator_job.py): ON CONFLICT DO NOTHING 衝突時 created 仍 +1 → stats["rules_auto_created"] 計數虛高，Redis 冷卻被誤設修法: 改用 INSERT ... RETURNING rule_name，fetchone() 確認實際插入才計數和設冷卻附加: Redis RuntimeError 單獨 catch + log（不再靜默 pass） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-04 14:31:53 +08:00
Your Name	72cd79ed8b	fix(aiops): Task2 drift auto-adopt 根因修復 + Task3 coverage gap 規則自動生成 All checks were successful Code Review / ai-code-review (push) Successful in 48s Details Task 2 — Drift 自動採納修根因: 根因: _analyze_and_notify() 中 report 是 in-memory 物件， update_interpretation() 只更新 DB，不回寫 report.interpretation，導致 auto_adopt_if_safe() 永遠看到 None → 觸發「尚無 Nemotron 意圖分析」 → Drift 自動採納 0 筆修法: report.interpretation = interpretation（DB 寫入後立即回寫記憶體）附加: DRIFT_AUTO_ADOPT_ENABLED flag（default=True，回滾: kubectl set env ...=false） Task 3 — Coverage Gap → AI 規則自動生成執行器: 根因: evaluate_once() 只分析 red 缺口，但無執行器將分析轉為實際規則 → alert_rule_catalog 的 ai_generated source 永遠為 0 條修法: 新增 _auto_create_rules_for_uncovered_assets(run_id) · 查 auto_alerting=red 的 top 5 host/k8s_workload asset · 依 asset_type 生成範本化 PromQL rule（host→up, k8s→replicas_available） · UPSERT 進 alert_rule_catalog（source='ai_generated', review_status='pending_review'） · Redis 24h 冷卻防重複，Redis 不可用時降級繼續附加: COVERAGE_AUTO_RULE_ENABLED flag（default=True，回滾: kubectl set env ...=false） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-04 14:22:51 +08:00
Your Name	577250a678	fix(governance): 修反消音化 W-3/W-4 守衛 + Prometheus 補資料缺失告警 Some checks failed Code Review / ai-code-review (push) Successful in 52s Details CD Pipeline / tests (push) Failing after 2m21s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 1m6s Details 【統帥怒訓 — 違反 feedback_full_chain_first_then_fix.md 鐵律】前次 commit `f1362fcc` 用 skip 條件把告警吞掉，是消音化解法： - W-3：total_exec<10 永遠 skip → Redis 永遠空也不會告警 - W-4：playbooks total==0 永遠 skip → 表被清空也不會告警 - Prometheus NaN sentinel + 既有 < 0.1 規則疊加後沒任何路徑會告警統帥怒訓「又把告警給消失了」「已經這樣做幾次了」。本 commit 救回告警可見性。【修法 — 啟動 30 分鐘寬限 + 過期改打資料管線斷新告警】 - ai_slo_watchdog_job.py 新增模組層 _PROCESS_START 與 _grace_active() 守衛： - W-3a：metric 有資料 + rate<0.30 → 既有「飛輪成功率過低」 - W-3b：rate=None 且 uptime>30min → 新告警「飛輪資料管線無流量」 - W-4a：playbooks total>0 + approved=0 → 既有「自動修復鏈路斷裂」 - W-4b：playbooks total=0 且 uptime>30min → 新告警「Playbook 表初始化失敗」 - 3 份 Prometheus rule（k8s/monitoring/flywheel-alerts.yaml、 ops/monitoring/alerts.yml、ops/monitoring/alerts-unified.yml）新增 FlywheelExecutionRateMissing：absent() 或 NaN 持續 30 分鐘 → 告警，與 watchdog W-3b 雙保險【已加入 memory】 feedback_silencing_alerts_recurring_violation.md 鎖入紅線鐵律：「fresh deploy / init guard 用 skip 吞告警 = 結構性失職，必須分流寬限期 + 過期改打資料管線斷新告警」【驗證】 106 個治理相關 unit test 全過： test_trust_drift_watchdog / test_governance_agent / test_failover_alerter / test_check_trust_drift_commit_outside_context_poc / test_governance_remediation_dispatch / test_ai_governance_endpoints / test_governance_dispatcher 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 12:39:46 +08:00
Your Name	f1362fcc8d	fix(governance): 修治理告警 4 個 silent failure + Prom sentinel 連鎖 Some checks failed Code Review / ai-code-review (push) Successful in 49s Details CD Pipeline / tests (push) Successful in 2m9s Details CD Pipeline / build-and-deploy (push) Failing after 31m11s Details CD Pipeline / post-deploy-checks (push) Has been skipped Details 【全景檢測：12-agent 並行掃描定位 4 大 bug 與 1 個 P0 連鎖回歸】 Bug 1（P0 silent failure）— governance_agent.check_trust_drift 原 `await db.commit()` 縮排錯在 async with 區塊外（8 空格 vs 12）， session 已 auto-commit 關閉，二次 commit 拋 InvalidRequestError 被吞， governance_trust_drift_auto_deprecated log 從不出現。修：commit/log 移回 with 內。附 AST regression guard test 擋退化。 Bug 2 — flywheel_stats_service / W-3 fresh deploy 假告警 Redis 空時 total_exec=0 → rate=0.0 → watchdog `< 0.30` 立即觸發「飛輪成功率 0%」假告警。修：total_exec < FLYWHEEL_MIN_SAMPLE(10) 回 None， watchdog 判 None 跳過 W-3。Prometheus sentinel 用 NaN（非 -1.0）避免觸發 ops/monitoring/alerts.yml:775 等 3 份 prom rule 的 `< 0.1` 條件造成 2h 後假告警連鎖。前端 type 同步 number \| null。 Bug 3 — failover_alerter dedup key 原 key 只看 event_type 不看 payload，trust_drift 4→25 IDs 變動全被 1h dedup 吞掉。修：dedup key 加 sha256(impact subdict)[:8]，event_type sanitize 防特殊字元污染 Redis key。 Bug 4 — ai_slo_watchdog_job W-4 evolver 全封存初始化誤報原邏輯 approved==0 即告警，未排除「playbooks 表初始化中」場景。修：_count_approved_playbooks 回 (approved, total)，total==0 → skip。【執行結果】 - 39 個相關 unit test 全過（test_failover_alerter / test_governance_agent / test_trust_drift_watchdog / test_check_trust_drift_commit_outside_context_poc） - 6 個關鍵路徑實測：NaN sentinel / float 渲染 / hash 區分性 / dedup 同 impact 相同 hash / datetime 容錯 / 4 檔 py_compile 全過【調度教訓 — 留作未來改進】 - 12-agent 並行調度時，vuln-verifier 與 fullstack-engineer 競態導致 vuln-verifier 讀到已修代碼誤判 NOT REPRODUCIBLE。未來：vuln-verifier 應在 fullstack 之前執行，或用 git show HEAD~1 對比修復前。 - fullstack-engineer 引入 P0 regression（f-string 內嵌 ternary 非法 format spec）， critic 抓到 + Prom sentinel 連鎖 — 證明 critic 審查必要不可省。 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 00:18:57 +08:00
Your Name	b710f3f38f	feat(governance): normalize AI治理告警輸出與元告警解析度 Some checks failed CD Pipeline / tests (push) Failing after 25s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details Code Review / ai-code-review (push) Successful in 46s Details	2026-05-02 23:49:59 +08:00
Your Name	dedb12085b	chore(governance,watchdog): enrich alerts and enable prometheus multiproc Some checks failed CD Pipeline / tests (push) Failing after 1m22s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details Code Review / ai-code-review (push) Successful in 43s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 57s Details	2026-05-02 23:44:12 +08:00
Your Name	b3a0f0d766	fix(telegram): dedup by fingerprint + 24h TTL to stop repeat alerts All checks were successful CD Pipeline / tests (push) Successful in 2m22s Details Code Review / ai-code-review (push) Successful in 57s Details CD Pipeline / build-and-deploy (push) Successful in 21m3s Details CD Pipeline / post-deploy-checks (push) Successful in 5m2s Details Telegram 重複發告警鐵證（4 個 agent 真實數據）： - INC-6FE3BD (HostBackupFailed) 24h 內被推 15 次 - INC-FD6E21 (HostHighCpuLoad) 24h 內被推 6 次 - 06:44:18 同秒兩送 = pod 並發 race 根因： 1. `telegram_sent:{incident_id}` dedup key 綁 uuid4 隨機 INC ID，同 fingerprint 換新 INC 完全不去重 2. dedup TTL=600s 比 incident_analysis_sweeper 重觸週期 1h、 alertmanager repeat_interval 4h 都短 → 每輪都過期通過 3. pod restart 走 _resend_unconfirmed_ready_tokens 用同一 incident_id key → 重啟必炸一波修法（不消音、是「AI 認得這是同一事故」）： - decision_manager.py:207-225 dedup key 改 alertname+target fingerprint - decision_manager.py:573-578 TTL 600s → 86400s (蓋住 sweeper 1h × alertmanager 4h) - decision_manager.py:3189-3208 pod restart resend 路徑同步改 fingerprint - incident_analysis_sweeper.py:37-42 sweeper_done TTL 3600s → 86400s 預期：同症狀 24h 內最多發 1 張 decision card；resolved 後 line 220-226 status check 會 early return，不影響復發偵測。 Tests: 35 passed (test_telegram_adr050 + test_decision_manager_docker_prune_routing) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 16:25:48 +08:00
Your Name	f154ac022e	feat(playbook): version generated playbooks All checks were successful CD Pipeline / tests (push) Successful in 1m34s Details Code Review / ai-code-review (push) Successful in 28s Details Type Sync Check / check-type-sync (push) Successful in 1m10s Details CD Pipeline / build-and-deploy (push) Successful in 10m19s Details CD Pipeline / post-deploy-checks (push) Successful in 3m1s Details	2026-04-30 23:59:39 +08:00
Your Name	6e04fe9c8a	feat(playbook): generate drafts with local llm Some checks failed CD Pipeline / tests (push) Successful in 1m28s Details Code Review / ai-code-review (push) Successful in 29s Details Type Sync Check / check-type-sync (push) Failing after 2m41s Details CD Pipeline / build-and-deploy (push) Successful in 8m40s Details CD Pipeline / post-deploy-checks (push) Successful in 3m10s Details	2026-04-30 23:04:58 +08:00
Your Name	61f5a6a419	fix(telegram): route alerts to SRE war room Some checks failed CD Pipeline / tests (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details Code Review / ai-code-review (push) Has been cancelled Details	2026-04-30 15:01:23 +08:00
Your Name	3668d49f2f	feat(flywheel): W2 三件 + KMWriter critic 修法（1635 tests 全綠） Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m38s Details W2 (onboarder 4 週飛輪 80→90 路徑第二週) + critic PR review 5 個 critical/major 全部修完，default flag=false 安全無爆炸風險。 ## W2 三件 PR ### PR-R2 — AOL → catalog confidence EWMA 回灌（修飛輪斷鏈 C2） - 新檔 `apps/api/src/jobs/aol_to_catalog_writeback_job.py` - 邏輯：每小時掃 AOL 計算 EWMA confidence (alpha=0.3) 回灌 alert_rule_catalog - 失敗閾值 N=5 連續低成功率 → review_status='draft' - Hermes _fetch_noisy_rules SQL 加 OR review_status='draft' - ENABLE_AOL_WRITEBACK_JOB=false (default) - 8 個測試（mock path 修正：lazy import → patch src.db.base.get_db_context） ### PR-V1 — self_healing_validator 串接 (修飛輪斷鏈 C6) - 新檔 `apps/api/src/services/self_healing_validator.py`（純函數 assess_self_healing） - post_execution_verifier.py step 5 串接（feature flag gate） - evidence_snapshot.py 加 self_healing_score / self_healing_detail 欄位 - db/models.py + base.py ALTER IF NOT EXISTS - score < 0.5 → 觸發 rollback 提案 Telegram alert（不自動執行） - ENABLE_SELF_HEALING_VALIDATOR=false (default) - 7 個測試 ### PR-L1 — KM ↔ Playbook 雙向回路 (修飛輪斷鏈 C3+C4) - learning_service.py 三條新邏輯： 1. _write_playbook_evolution_km：promote/demote 寫 KM 演化條目 2. _check_and_mark_playbook_review：N=5 累積觸發 review_required 3. _demote_alert_rule_catalog_confidence：DEPRECATED → confidence×=0.5 - PlaybookRecord 加 review_required 欄位（schema migration via base.py） - ENABLE_KM_PLAYBOOK_FEEDBACK_LOOP=false (default) - KM_PLAYBOOK_REVIEW_THRESHOLD=5 可調 - 6 個測試 ## KMWriter Critic 5 個 Critical/Major 修復（之前 critic PR review 發現）之前 push commit `c5753e1c` 已修，本 commit 補回 stash 中的對應檔案： - C1 km_writer.py:194 backfill 自打臉（已修：同步 await + DLQ） - C2 km_writer.py:391 KM_WRITE_AWAIT=false 路徑收緊 - M1 decision_manager.py:2178/2203 移除 _fire_and_forget - M2 incident_service.py:1099 自製 path 加 retry+DLQ - M3 km_writer.py:166 冪等聲明對齊（UPSERT + partial unique index） ## 驗證 - 1635 unit tests 全綠（+27 from 1608） - 與 `fb0c72db` (推翻 A2 Ollama primary) 共存無衝突 - 所有新 Job/Service default flag=false（不爆炸） ## 期望影響飛輪斷鏈 C2 + C3 + C4 + C6 全修飛輪自主化評分：65 → 85 預估（W2 完成後）啟用順序（待 prod `fb0c72db` 驗證 OLLAMA primary 跑得起來後）： 1. ENABLE_AOL_WRITEBACK_JOB=true 2. ENABLE_KM_PLAYBOOK_FEEDBACK_LOOP=true 3. ENABLE_SELF_HEALING_VALIDATOR=true Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:44:04 +08:00
Your Name	c5753e1c57	fix(critic-review): KMWriter 名實統一 + Alertmanager 修抑制 + drift checker AST 化 critic PR review 揭示已 push commits 的 7 個 blocker，本 commit 全部修復。 ## C1 + C2 + M1 + M2 + M3 — KMWriter 真正統一契約（critic 最嚴重 5 條） ### C1 km_writer.py:194 — backfill 自打臉修 - 裸 asyncio.create_task(_backfill_path_a_approval) → await _backfill_path_a_approval_safe() - 同步 await + 獨立 DLQ km:backfill:dlq + try/except 不阻塞主寫入 - 新增 km_backfill_reconciler_job.py（每 5 分鐘掃 DLQ）+ ENABLE_KM_BACKFILL_RECONCILER flag - 防 Path B 比 Path A 先完成 → related_approval_id 永遠 NULL 的 race ### C2 km_writer.py:391 — KM_WRITE_AWAIT=false 路徑收緊 - 從 ensure_future（fire-and-forget 比舊版同步寫更糟） - 改 await writer.write(retry=1, timeout=2.0)（仍 await 但只試一次、超時短） - docstring 明確標註「緊急回滾用，不保證可靠性」 ### M1 decision_manager.py:2178/2203 — 移除 _fire_and_forget 旁路 - 兩處 _fire_and_forget(executor.write_execution_result_to_km(...)) - 改 await asyncio.shield(...) + BaseException 保護（防上層 cancel 中斷） - KM_WRITE_AWAIT=true 在這條路徑終於真正 await ### M2 incident_service.py:1099 — 自製 path 加 retry+DLQ - 原本 if settings.KM_WRITE_AWAIT: await asyncio.wait_for else create_task - 改 3 次指數退避 retry + DLQ 保護（呼叫 km_writer 私有 helper） ### M3 km_writer.py:166 — 冪等聲明對齊實作 - knowledge_repository.create() 加 UPSERT 路徑（pg_insert ON CONFLICT DO UPDATE） - KnowledgeEntryCreate / KnowledgeEntryRecord 加 path_type 欄位 - migration: ADD COLUMN path_type + partial unique index uix_knowledge_incident_path ## M4 alertmanager.yml — equal: [] 收緊（critic 防爆炸抑制） - OllamaInstanceDown / KMConverterDown 抑制加 equal: ['cluster'] 約束 - 防多 cluster 場景下任一 Ollama down 誤抑全 AI/SLO 告警 ## M5 Alertmanager 版本驗證（已確認 v0.31.1，遠超 v0.22+） ## M6 governance_agent.py — health score 區分 skipped vs ok vs violated - check_slo_compliance 加 _meta {violated_count, skipped_count, ok_count, all_skipped, status} - run_self_check: SLO 全 skipped 時獨立發 governance_slo_data_gap 告警（不污染 self_failure 計數，因為 no_data 是 emitter 未實作不是治理機制故障） ## M7 scripts/check_config_drift.py — 改 AST 解析 - regex 改 ast.parse 找 Settings ClassDef AnnAssign Field(default=...) - 避免多行 list / default_factory= / 含跳行字串的 false negative - 4 欄位（AI_FALLBACK_ORDER / ARGOCD_URL / PROMETHEUS_URL / OLLAMA_URL）全對齊 ## 新增測試 - test_km_writer_backfill_reconciler.py: 7 cases（C1 reconciler + safe helper） - test_km_writer_idempotent.py: 5 cases（M3 path_type 注入 + UPSERT 分支） ## 驗證 - 1585 unit tests 全綠（+13 從 1572） - amtool check-config SUCCESS（8 inhibit_rules / 2 receivers） - drift checker AST-based 4 欄位全對齊 - Alertmanager v0.31.1 確認支援新語法 ## 期望影響 - KMWriter 名實統一：飛輪閉環 KM 寫入路徑 100% 可靠 - M4 抑制爆炸風險解除 - 治理層不再對 SLO no_data 靜默 - drift checker false negative 風險解除 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 10:44:39 +08:00
Your Name	123d9c8a2e	fix(p3.1-t1): 三 Tier-1 服務整合 — model_rollback_service + resource_resolver Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details P3.1-T1 接線兩個既有服務到主流程： offline_replay_service.py — model_rollback_service 整合: - 回放事件寫入治理 DB 後，觸發 ModelRollbackService.check() 衰退偵測 - feature flag 由 model_rollback_service 自行判斷（AIOPS_P6_GOVERNANCE_ENABLED） - retrain_recommended → log warning 含 streak / absolute_floor / conservative_mode - exception fail-soft（不阻斷 replay 主流程） approval_execution.py — resource_resolver 整合: - kubectl 指令解析後，動態驗證資源是否存在於 K8s - 若 resolved_name != raw_name → log + apply normalized name - 若不存在但有 candidates → log warning + suggestions（不攔截執行，只記錄） - exception fail-soft（不阻斷主流程） - RESOURCE_RESOLVE_TOTAL Prometheus counter labels: hit/suggestion/miss/error Tests: 後端 1303 collected（無回歸），對應 dedicated 測試在前次 commit 已寫 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Claude Sonnet 4.6 (P3.1-T1) <noreply@anthropic.com>	2026-04-27 08:17:04 +08:00
Your Name	cc547736ab	feat(wave6-8): P2.1 fusion + P2.2 governance + P2.4 consensus + Wave 7/8 BLOCKER 修復承接 Wave 6/7/8 多 engineer 在 agent 限額前完成的代碼，補 commit 解 production HEAD 隱性 import error（decision_fusion 已被 decision_manager 引用但檔案 untracked）。新增（後端核心）: - decision_fusion.py (562 行) — P2.1 方法 III（OpenClaw + Hermes + Elephant 三 LLM 融合） - aiops_timeline.py + aiops_timeline_service.py — critic B4 修復 /api/v1/aiops/timeline endpoint，DB 存取抽到 service 層遵守 leWOOOgo 積木化 - migrations/p2_decision_fusion_columns.sql + rollback — approval_records fusion 欄位修改（後端整合）: - decision_manager.py — fusion 三斷鏈修補（critic B1+B2+B3）： · B1: 寫 _evidence_snapshot_ref 到 token.proposal_data · B2: fusion 前計算 complexity_score 並寫 token · B3: fusion composite 寫 token.proposal_data["decision_fusion"] - auto_approve.py — fusion + consensus 認識（critic B3+B5）： · composite > 0.7 → auto_execute_eligible bypass min_confidence · source=consensus_engine + score>=0.6 → 規則可信路徑 - consensus_engine.py — db-fix _save_consensus 重用 agent_sessions - governance_agent.py — db-fix _alert PG 寫入 ai_governance_events - approval_db.py — fusion 3 欄位 + 2 partial index + CheckConstraint - db/models.py — schema 對齊 migration - core/config.py — vuln #1 修復：OLLAMA_URL/_FALLBACK_URL field_validator 拒絕公網 IP + 外部域名，僅允許私網/loopback/K8s SVC 白名單 - core/feature_flags.py — P2 fusion + consensus flags - main.py — governance_agent lifespan 啟動 - failover_alerter.py — Wave8-X2: in-memory dedup fallback（Redis 拒絕後不 fail-open） - ollama_*.py — metrics 整合 + recovery 改善 - auto_repair_service.py — verifier 接線新增（測試 2438 行）: - test_decision_fusion.py / test_governance_agent.py / test_consensus_integration.py - test_p2_db_fixes.py / test_wave8_fusion_fixes.py - test_config_url_validation.py（vuln #1 12 tests） - test_failover_alerter.py +Wave8-X2 in-memory dedup 補測驗收: 116 tests pass (decision_fusion + wave8_fusion + config_url + consensus + governance + p2_db_fixes + failover_alerter) Conflict resolution: - 3 檔（config.py + auto_approve.py + decision_manager.py）git stash pop 衝突保留 stashed (engineer 最終版)，補回 ValueError 「公網 IP」字樣對齊 test Note: 此 commit 解 production HEAD 隱性 import error 仍未修: vuln #4 prompt injection / debugger B14 quota fail-closed / B25-B26 drain_pending_tasks / B8 governance fail alert Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Multiple Engineers (Wave 6/7/8) <noreply@anthropic.com>	2026-04-27 08:11:40 +08:00
Your Name	c995fe4008	fix(watchdog-w5): suggested_action 欄位不存在 → 改用 action All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m30s Details ApprovalRecord ORM 只有 action 欄位，suggested_action 僅存於 Pydantic ApprovalRequest 層。新 Pod 啟動後 W-5 拋 AttributeError： "type object 'ApprovalRecord' has no attribute 'suggested_action'"。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 20:40:42 +08:00
Your Name	97ce5ea658	feat(p2.6): trust_drift_detector 接入 ai_slo_watchdog_job W-6 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m10s Details P2.6 接入 2026-04-24 ogt + Claude Sonnet 4.6 問題: trust_drift_detector.py 是孤立服務（零引用），Playbook 信任度偏態（盲目樂觀/學習鎖死）從未被任何監控機制感知修復: ai_slo_watchdog_job._check_once() 新增 W-6 Trust Drift 檢查 - 呼叫 get_trust_drift_detector().run()（偵測 + 寫 ai_governance_events） - 偵測到偏態時加入 violations 清單 → 觸發 TYPE-8M Meta-System 告警 - checks 計數從 5 → 6 覆蓋案例: - optimism_bias: >70% Playbook trust_score >0.9 → PostExecutionVerifier 可能失效 - confidence_collapse: >70% Playbook trust_score <0.3 → EWMA 計算/執行誤判 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 15:57:30 +08:00
Your Name	45dbe07188	fix(flywheel): 自動化飛輪六大能力修復（ADR-092 B3） Some checks failed run-migration / migrate (push) Failing after 22s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 53s Details Type Sync Check / check-type-sync (push) Successful in 2m54s Details CD Pipeline / build-and-deploy (push) Has been cancelled Details Ansible Lint / lint (push) Has been cancelled Details 【根因鏈修復】 MCP Provider bugs → PreDecisionInvestigator 失敗 → Agent Debate 無上下文 → LLM 逾時 → description="待分析" → ADR-091 鐵閘攔截 → tg_sent 未設 → W-2 Watchdog 誤報「靜默故障」【六大修復】 1. MCP Provider 三蟲修復 - ssh_provider: asyncssh.run() → conn.run() - prometheus_provider: KeyError 'query' → .get() 容錯 - k8s_provider: 空 pod_name → 早返回錯誤字典 2. Agent Debate / 決策品質 - decision_manager: 逾時降級文字改為明確描述（繞過 ADR-091 鐵閘） - intent_classifier: LLM 逾時降級至關鍵字分類（非 None） 3. Watchdog 誤報修復（ADR-092 B3） - W-2: tg_sent Redis TTL → telegram_message_id IS NULL（DB 真值） - W-5 新增: suggested_action IN 空/待分析/NO_ACTION + tg_id IS NULL - approval_timeout_resolver: 60min → 15min，batch 50 → 200 4. Config Drift 自動化 - drift_adopt_service: auto_adopt_if_safe() 六條件安全閘 - drift.py: 背景任務先嘗試自動採納再發人工 Telegram 卡片 5. Playbook 飛輪穩定 - playbook_seed_service: 修復幂等性（deprecated 不視為缺失） - playbook_evolver: 只載 DRAFT+APPROVED（非全部 294 筆） 6. 可觀測性 - alert_rule_engine: auto_rule 結構化日誌 + Redis 計數器（pipeline） - auto_approve: reject 原因 Redis 計數器 - heartbeat_report_service: 新增「⚙️ 自動化統計（今日）」區塊【待人工執行】 psql $DATABASE_URL -f apps/api/migrations/cleanup_duplicate_deprecated_playbooks.sql Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 10:55:50 +08:00
Your Name	de2d34d4cd	fix(playbook): C1-C4 全流程串接 — evolver保護+seeder復活+規則即時建立+watchdog W-4 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details C1: playbook_evolver — yaml_rule source playbooks 加 YAML_RULE guard， evolver 不再封存 seeder 建立的 APPROVED playbook，保護自動修復鏈路 C2: playbook_seed_service — idempotency SQL 排除 DEPRECATED 記錄， evolver 封存後重啟可復活 yaml_rule playbooks C3: alert_rule_engine — AI 自動生成規則成功後立即呼叫 seed_playbooks_from_rules()，不等下次重啟即可建立對應 APPROVED Playbook C4: ai_slo_watchdog_job — 新增 W-4 APPROVED playbook 數量為 0 告警，鏈路斷裂立即 TYPE-8M；total checks 由 3 升為 4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:18:11 +08:00
Your Name	156a52f807	fix(aiops): ADR-092 三修 — Playbook enum崩潰 + Telegram永久靜默 + 採納失敗 + AI自健診 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m33s Details B1 playbook_service.py: evolver setattr傳str而非PlaybookStatus enum → _pg_upsert playbook.status.value炸（163次/48h），修：update_with_validation強制enum轉型 B2 approval_db.py + webhooks.py: find_by_fingerprint PENDING誤收斂 → PENDING≠Telegram已發；修：成功push後mark tg_sent:{fingerprint} Redis(24h TTL) → find_by_fingerprint debounce窗外PENDING必須Redis確認才收斂 drift_adopt_service.py: telegram_gateway呼叫adopt_drift(report_id)但方法不存在 → 新增adopt_drift()包裝：從DB載入DriftReport後委派adopt()，修復採納失敗 B3 ai_slo_watchdog_job.py + main.py: AI無法感知自身故障（MASTER §1.1盲區） → 新增每15分鐘自健診：W-1 SLO違反 W-2 TG靜默偵測 W-3 飛輪成功率 → 任一異常→TYPE-8M send_meta_alert；Redis去重1h Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:00:06 +08:00
Your Name	f572561467	feat(ai_advisory): P0 修 leader lock + inline keyboard + callback handler All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m31s Details 統帥 2026-04-19 截圖反饋: 1. 同一告警 22:44 連推 2 則 (多 Pod 都跑 daily loop) 2. 純文字無按鈕 (無 feedback 閉環 / AI 只建議不執行) 新增 services/ai_advisory_helpers.py (~240 行): - try_acquire_daily_lock(job_name): Redis SETNX key 'aiops:daily_lock:{job}:{date}', TTL 25h,fail-open (Redis 掛照推,不阻塞). - try_acquire_hourly_lock(job_name): 同上 hourly 版 (coverage_evaluator 用). - is_snoozed / set_snooze: Redis key 'aiops:snooze:{type}:{target}' TTL 24h. - build_ai_advisory_keyboard: 統一 4 按鈕 ✅ 已處理 / 😴 忽略 24h / 🔍 查看詳情 / 📋 產 kubectl 指令 callback_data 格式: 'ai_advisory_{action}:{type}:{id}' - handle_ai_advisory_callback: 處理 handled/snooze 兩個 action 寫 aol.output.human_feedback, view/produce_cmd 留 P1. 4 個 LLM scanner 改用 helper: - capacity_forecaster: daily_lock + snooze check per host + 按鈕 - compliance_scanner: daily_lock (cron only) + snooze per date + 按鈕 - coverage_evaluator: hourly_lock + snooze per worst_dimension + 按鈕 - hermes_rule_quality: daily_lock + snooze per primary rule + 按鈕 telegram_gateway.py: handle_callback 加 'ai_advisory_*' 路由 (step 1.85 drift 後) 新增 _handle_ai_advisory_action 方法: 解析 payload 'type:id' → 呼叫 handle_ai_advisory_callback → answer_callback (Telegram toast 回饋) → 返回 dict (info_action=True for view/produce_cmd) 統帥鐵律對齊: ✅ 多 Pod 場景只 leader 推 (Redis SETNX 保證冪等) ✅ 失敗 fail-open 不阻塞主業務 (Redis 掛仍能運作) ✅ aol.output 加 human_feedback 供 AI 學習 ✅ snooze 避免重複告警 (24h TTL) ✅ 原 drift 按鈕 pattern 複用 (non-breaking) 明早 AI 將收到: - 單一訊息 (非重複) - 含 4 按鈕 (手動 feedback 閉環) - snooze 後同主題 24h 不再推 view/produce_cmd P1 留下 session (AI 主動 MCP 蒐證 + LLM 產 kubectl command). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 23:02:57 +08:00
Your Name	fa643ebdc7	refactor(p1): LLM JSON parse helper 抽出 + coverage 閾值雙條件 (架構師 Review P1) All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m52s Details 首席架構師 2026-04-19 Review (92/100 Grade A) 指出 P1 優化: 1. LLM JSON 3-path parse 邏輯在 4 scanner 重複 (~80 行 × 4 = 320 行) 2. coverage red>=20 觸發閾值偏低,生產 bootstrap 必觸發浪費 token P1.1+1.2 新增 services/llm_json_parser.py (~90 行): parse_llm_json_response(text, required_key, logger_context) 3-path fallback: Path 1: 剝 markdown fence + 直接 JSON 含 required_key Path 2: NemoTron wrapper (description/action_title/reasoning 內嵌 JSON) Path 3: 所有失敗 return None + logger.warning 失敗永不 raise,呼叫者決定 fallback. 4 個 LLM scanner 改用 helper: - hermes_rule_quality_job: required_key='recommended_actions' - capacity_forecaster_job: required_key='priority_actions' - compliance_scanner_job: required_key='posture_grade' - coverage_evaluator_job: required_key='worst_dimension' 每個減少約 20 行重複. P1.3 coverage 觸發條件改雙條件: 原: total_red >= 20 (bootstrap 必觸發) 新: red_ratio > 30% AND total_scanned >= 50 _fetch_red_summary 加 total_scanned 回傳供計算. 5/5 單元測試 parse_llm_json_response: ✅ direct / markdown fence / NemoTron wrapper / invalid / missing key P1.4 capacity_scanner + rule_catalog_sync: 檢查後已有完整作者註解 (Review 誤判). 其他 P1 (Prom HTTP helper / first_delay 錯開 / LLM budget guard) 留下 session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 22:39:40 +08:00
Your Name	2f5cab2e45	feat(coverage_evaluator): Gap 3.3 LLM 升級 — 缺口分析 + 補覆蓋建議 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m14s Details Gap 3 進度: 4/9 service 升級 LLM (達到合理上限 — 其他 4 個純資料移動不需 LLM) coverage_evaluator 原本 7 維升級 unknown→green/yellow/red 後無主動建議. 新增: 1. _fetch_red_summary: 撈最新 run 的 red 分布 + top 10 被標 red 的 asset 2. _llm_analyze_coverage_gaps (~50 行): 有 >= 20 red 時才跑 LLM (避免 well-covered 集群浪費 token) LLM JSON 輸出: - worst_dimension: 最該優先補的維度 - root_cause: red 集中的真因 (繁中) - top_remediation_actions[3]: priority/target/action/effort - estimated_weeks_to_close: 1-52 - confidence: 0-1 3. _send_telegram_gaps: 推 coverage 缺口 Telegram 摘要總 red + 最嚴重維度 + 補齊週數 + top 3 補覆蓋動作 scan 完流程: 評估 7 維 → 撈 red summary → LLM 分析 (if total_red >= 20) → Telegram 統帥鐵律對齊: ✅ 不寫死補覆蓋優先 (LLM 根據實際 red 分布推) ✅ AI 建議 + 人工決策 (Telegram 末行: '人工評估補覆蓋排程') ✅ 包含預估完成時間 + 信心 (可追蹤) session 累計 35 commits, 9 新 scanner, 4 用 LLM: - Hermes (rule quality) - capacity_forecaster (容量預測) - compliance_scanner (合規態勢) - coverage_evaluator (覆蓋缺口) 剩 5 個純資料移動不適合 LLM (asset_scanner/rule_catalog_sync/ rule_stats_updater/asset_change_tracker/capacity_scanner) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 22:02:36 +08:00
Your Name	f6cb938dc3	feat(compliance_scanner): Gap 3.2 LLM 升級 — 合規態勢分析 + Telegram 摘要 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 朝 AI 自主化方向 — 9 新 scanner 從 2/9 LLM 提升到 3/9. compliance_scanner 原本每次 scan 273 snapshots 寫 DB,無任何人可見摘要. 新增: 1. _write_compliance_for_asset_v2 (wrapper): 原 _write_compliance_for_asset 保持不變,v2 版加回傳 asset_warning dict 供上層 LLM 分析用,只有 violations/warnings > 0 才傳回 2. _llm_analyze_compliance_posture (~50 行): 有 warning 時用 OpenClaw 分析整體 posture 輸出 JSON: - posture_grade: A/B/C/D/F - posture_summary: 3 句繁中整體態勢敘述 - top_priorities[3]: priority + action + rationale - risk_level: low/medium/high/critical - confidence: 0-1 3-path JSON parse fallback (直接 / NemoTron wrapper / description 巢狀) 3. _send_telegram_posture (~40 行): 推每日合規摘要到 SRE group 含評級 emoji (🟢A / 🟡B / 🟠C / 🔴D / ⛔F) 顯示 asset_type 分布 (Top 5 種問題類型統計) 含 AI top 3 priority 動作 + rationale scan_once 流程: 掃 assets × 7 維 → 收集 warning_assets → LLM 分析 → Telegram 推送統帥鐵律對齊: ✅ AI 分析 + 人工決策 (Telegram 末行: '人工評估各項修復優先') ✅ 不寫死優先順序 (LLM 根據 warnings 實際分布推) ✅ asset_type 分布統計幫統帥快速定位 Gap 3 進度: 3/8 service 升級 LLM (Hermes + capacity_forecaster + compliance_scanner) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 21:59:38 +08:00
Your Name	d6b854a25e	feat(capacity_forecaster): Gap 3 LLM 升級 — 從 threshold 到 AI 決策 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details Audit 發現 8/9 個新 scanner 是純 threshold,只 Hermes 1 個用 LLM. 統帥指示「朝 AI 自主化方向」→ Gap 3 開始把 threshold 升級 LLM. 第 1 個升級: capacity_forecaster (最高戰略) 原邏輯 _derive_actions 是硬編 keyword → action mapping: disk → "清理 /var/log, /var/lib/docker, PG WAL" mem → "檢查 top mem consumer, 考慮加記憶體" cpu → "分析 top CPU process, 考慮擴充 vCPU" 新增 _llm_analyze_risk (~60 行): 用 OpenClaw 對每個高風險 host 跑 LLM 分析 Prompt 含: - host + findings (Prometheus predict_linear 結果) - 主機架構說明 (110 Harbor / 120-121 K3s / 188 PG 等) LLM JSON 輸出: - root_causes (3 個候選真因,繁中) - priority_actions (high/medium/low + 具體指令 hint) - urgency_days (0-30) - confidence (0-1) 3-path JSON parse fallback (直接 / NemoTron wrapper / description 巢狀) _write_recommendation_aol: 加 llm_analysis 到 output_payload _send_telegram_forecast: 含 AI 判定 (緊急天數 + 信心 + top 2 action) LLM 失敗時 fallback _derive_actions 硬編建議對齊統帥鐵律: ✅ AI 分析 + 人工決策 (仍 requires_human_decision=True) ✅ 不寫死修復動作 (LLM 根據 host 實際狀況產) ✅ root_causes 考慮 host 主機架構 context Gap 3 進度: 1/8 service 升級 LLM (capacity_forecaster) 剩下 compliance_scanner / coverage_evaluator 等 7 個留後續 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 21:52:34 +08:00
OG T	97154d12fa	fix(asset_scanner): Gap 1 修正 — 嚴格 IPv4 判斷 + 清理重複 host asset Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details Audit 1 發現 bug: 原 code: if host_ip.replace('.', '').isdigit() → IP 判斷導致 labels.host='125' (短名) 被誤當 IP → 建 host/125 (錯) 同時 blackbox-icmp instance='192.168.0.112' 無 port → split 失敗 → 漏建新增 _is_valid_ipv4(s): 嚴格 4 段 + 每段 0-255 整數避免短名 '125' / hostname 'cadvisor-110' / 超界 '256' 誤判 _collect_prometheus_targets 流程改: 1. 先從 instance 抽 (IP:port 形式或純 IP) instance_host = instance.split(':')[0] if ':' in instance else instance 2. 用 _is_valid_ipv4 嚴格驗證 3. labels.host 不再當 fallback (短名不可靠) DB 清理 (266 筆): - 10 asset_relationship 指向短名 host - 140 asset_coverage_snapshot 7 維 × 4 短名 host - 112 asset_compliance_snapshot 7 維 × 4 短名 × 幾 run - 4 asset_inventory 短名 host (host/110 / 112 / 125 / 188) 預期下次 scan (1h): - host/192.168.0.112 + host/192.168.0.121 補進 (原漏,blackbox-icmp 無 port) - 不再有短名 host asset 6/6 單元測試通過: _is_valid_ipv4('192.168.0.125')=True _is_valid_ipv4('125')=False ← 關鍵修復 _is_valid_ipv4('cadvisor-110')=False _is_valid_ipv4('192.168.0.256')=False (超界) _is_valid_ipv4('')=False _is_valid_ipv4('192.168.1')=False (3 段) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 21:46:22 +08:00
OG T	7db8845cbb	fix(asset_scanner+coverage): host_service→monitoring_target (CHECK violation 修) + log 補 4 維 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 12m59s Details 2 個 bug 修復 + 實證驗證: 1. asset_scanner: host_service 不在 asset_inventory CHECK 允許列表 `ceb61c3` 部署後 Pod log: CheckViolationError 'asset_inventory_type_valid' 詳: '192.168.0.125:32334' 寫入時 asset_type='host_service' 被拒 allowed list: host/container/k8s_workload/k8s_resource/database/... monitoring_target/third_party_service/... (27 種) 修: host_service → monitoring_target (ADR-090 schema 原為 scrape target 預留) 2. coverage_evaluator logger: 只 log 原 3 維 (monitoring/alerting/km) 導致誤以為 `c1f23cf` 4 維新 code 沒生效 (實際 DB 已有 auto_playbook/ remediation/rule_matching/rule_creation 資料) 修: logger.info 補 playbook/remediation/rule_matching/rule_creation 4 個 kwarg 實證 coverage 7 維 DB 分佈 (已生效): auto_alerting: 22 green / 78 red / 52 unknown auto_km_creation: 5 green / 17 yellow / 130 unknown auto_monitoring: 1 green / 1 red / 150 unknown auto_playbook: 3 green / 19 yellow / 130 unknown ← 新維度 auto_remediation: 0 / 0 / 98 red / 54 unknown ← 新維度 auto_rule_creation: 0 / 0 / 100 red / 52 unknown ← 新維度 auto_rule_matching: 4 green / 96 yellow / 52 unknown ← 新維度治理洞察: 98 red remediation = 大部分 asset 過去 30d 沒修復行動 (修復能力缺口) 100 red rule_creation = 無 AI rule (全 yaml_hardcoded) 96 yellow rule_matching = 過去 30d 沒告警觸發 (可能沒問題/沒覆蓋) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 20:27:48 +08:00
OG T	ceb61c3c8e	feat(asset_scanner): Gap 1 修 — Prometheus targets 補齊 host-install services All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m32s Details Audit 發現 asset_inventory 只涵蓋 K8s (mon=120, mon1=121 共 2 node+78 pods), 完全漏 110 (Harbor/Gitea/監控) + 112 (security) + 188 (PG/Redis/Ollama) + 125 (mon backup/standby) 這 4 主機的 host-install services. 用戶 4 主機架構 (110/112/120/121/188) 只覆蓋 2/5 = 40%. 新增 _collect_prometheus_targets: GET /api/v1/targets?state=active → 自動發現全部被監控的: - host_service (IP 形式 target → postgres-110/redis-110/minio-188/node-exporter 等) - third_party_service (非 IP 如 alertmanager/argocd-server) - host (每個 unique IP 建 asset_type='host') - target → host 的 depends_on relationship 預期新增 asset_inventory: - host: 6 個 (110/112/120/121/125/188,Prometheus 看到的 blackbox-icmp 全覆蓋) - host_service: ~15 個 (postgres/redis/minio/node-exporter/cadvisor 等) - third_party_service: ~5 個 (alertmanager/argocd/prometheus/velero 等) 解鎖: - 110/112/188 host-install services 進入 asset_inventory - coverage_evaluator 可評估這些 asset (monitoring/alerting/playbook 等 7 維) - blast_radius_calculator 可查「110 PostgreSQL 影響哪些 service」 - Hermes/forecaster 建議範圍擴大到非 K8s 服務對齊統帥鐵律: 朝 AI 自主化 — 不硬編主機清單,動態從 Prometheus 發現 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 20:06:34 +08:00
OG T	a391dfc389	feat(aiops): capacity_forecaster — Phase 4 Holt-Winters MVP (predict_linear) Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 統帥批准 4 項下階段候選之一完成: AI 容量預測. 新增 capacity_forecaster_job.py (~220 行): 每日 05:00 Taipei 跑預測 (02:00 scanner → 03:00 compliance → 04:00 Hermes → 05:00 forecaster 形成完整日鏈). 預測方法論 (MVP): Prometheus predict_linear(metric[7d], 86400*7) — 基於過去 7d 做線性外推 3 個預測 query: 1. disk_saturation_7d: predict_linear(node_filesystem_avail_bytes[7d], 7d) < 0 2. mem_saturation_7d: predict_linear(MemAvailable[7d], 7d) / MemTotal < 10% 3. cpu_high_7d_trend: avg_over_time(cpu_used_pct[7d]) > 70% 發現高風險 host → 寫 aol(capacity_recommendation) + 推 Telegram - input: host + horizon + findings count - output: findings list + proposed_actions + requires_human_decision=true proposed_actions 依 findings 推導: - disk: 清理 log/docker/PG WAL 或擴容 - mem: top consumer / JVM 調整 - cpu: scale out / vCPU 擴充統帥鐵律對齊: ✅ 只推建議不自動 scale up ✅ 7d window 有足夠樣本 ✅ AI 預測 + 人工決策未來 TODO: - 真 Holt-Winters (含季節性) — 需 Python statsmodels - 業務週期調整 (週一高峰/週末低谷) Wire main.py lifespan asyncio.create_task() Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 20:00:36 +08:00
OG T	c1f23cfabe	feat(coverage_evaluator): 擴充 4 維 — playbook/remediation/rule_matching/rule_creation Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details Review 盲點: coverage 7 維中原只實作 3 維 (monitoring/alerting/km),其餘 4 維永遠 unknown v2 擴充: + auto_playbook: asset.name 出現在 playbooks.symptom_pattern/description (approved 狀態) → green 沒對應 playbook 但 type='k8s_workload' → yellow + auto_remediation: 過去 30d remediation_events.target_resource ILIKE asset.name → green 沒 target 但 k8s_workload/container → red (應有修復能力但沒) + auto_rule_matching: 過去 30d incidents.affected_services ILIKE asset.name 或 incidents.alertname match alert_rule.labels.host/namespace → green 沒觸發 → yellow (可能沒問題也可能沒覆蓋) + auto_rule_creation: alert_rule_catalog source='ai_generated' match asset → green 目前全 yaml_hardcoded → 全 red (表示尚未由 AI 主動建規則) 未來 Hermes 產出 AI rule 後會變 green 解鎖: coverage 7 維完整 SLO KPI (MASTER §7.1) - red count = 真正的治理缺口 - green ratio = 自動化成熟度 - AI 可主動推薦 red asset 的補覆蓋動作 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 19:54:36 +08:00
OG T	ba18ad2ef8	feat(hermes+rules): LLM 升級 Hermes + 統帥決策 deprecate PostgreSQLDiskGrowthRate All checks were successful Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 40s Details CD Pipeline / build-and-deploy (push) Successful in 8m37s Details 統帥 2026-04-19 決策: - Rule 1 PostgreSQLDiskGrowthRate → 選項 C: deprecate + 替代新規則 - Rule 2 NoAlertsReceived2Hours → 保留 (真實告警鏈路守護) - noise_rate 算法先修正 (NO_ACTION 不算 fp),觀察後動態調整 1. rule_stats_updater v2 noise 算法: 原: 任何 EXPIRED approval 都算 fp 問題: NO_ACTION/OBSERVE/INVESTIGATE 是 AI 純觀察,不該算假報修: WHERE ar.action NOT ILIKE '%NO_ACTION%' AND NOT ILIKE '%OBSERVE%' AND ... 2. hermes_rule_quality v2 LLM 升級: 新增 _llm_analyze_noisy_rule: - 用 OpenClaw (Ollama/NemoTron/Gemini) 分析每條噪音 rule - JSON 輸出: probable_root_causes/recommended_actions/confidence/should_deprecate - 3 路 parse fallback (直接 / NemoTron wrapper / description nested) _write_advisory_aol 加 llm_analysis 到 output_payload _send_telegram_summary 加 AI 判定 + top 2 建議 (8 條上限避免太長) 符合統帥鐵律: AI 分析但不自動動作,仍人工決策 3. ops/monitoring/alerts-unified.yml 替換 Rule 1: 刪 PostgreSQLDiskGrowthRate (500MB/h 增長 → 觸發 WAL 正常行為誤報) 加 HostDiskUsageHigh (>80% for 10m, warning) 加 HostDiskUsageCritical (>90% for 5m, critical) 兩者 labels.supersedes='PostgreSQLDiskGrowthRate' 供追溯 (待 deploy-alerts workflow 下次 apply 到 Prometheus) 4. DB 即時 mark deprecated (避免等 alerts yaml 部署前 Hermes 又推): UPDATE alert_rule_catalog SET review_status='deprecated' WHERE rule_name='PostgreSQLDiskGrowthRate' Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 19:39:05 +08:00
OG T	6ab0ce9c75	feat(aiops): Hermes rule quality advisor — E3 AI 規則品質建議 (保守版) All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m22s Details 實證 rule_stats 跑完後發現 2 條 100% noise_rate 規則: - PostgreSQLDiskGrowthRate (tp=0 fp=2) - NoAlertsReceived2Hours (tp=0 fp=1) 加上 MoWoooWorkDown (33%), KubePodCrashLooping (25%) 新增 hermes_rule_quality_job.py (~210 行): 每日 04:00 Taipei 分析 alert_rule_catalog: - threshold: noise_rate >= 0.7 AND 樣本 >= 5 - 為每條寫 aol('rule_rejected', proposed_action='review_or_deprecate') - 推 Telegram 摘要給 SRE group 統帥鐵律對齊: ✅ 不自動改 review_status (人工決策 deprecate,AI 只推建議) ✅ threshold 作為「觸發討論」而非「最終決策」 ✅ aol(rule_rejected) 留 trail,未來可升級 LLM 辯證解鎖 E3 Hermes 基礎: 後續可加 LLM 分析假報真因 (expr 缺 for: window、 label match 太寬泛、metric 本身 noisy 等),產出具體改進建議. Wire main.py lifespan asyncio.create_task() Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 18:11:26 +08:00
OG T	e677773e39	fix(asset_scanner): Pod→Deployment via ReplicaSet 橋樑 (relationship 漏掉修復) All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m31s Details Review 盲點: 實測 asset_relationship 52 筆,但都是 Pod→StatefulSet + Pod→ConfigMap, 完全沒有 Pod→Deployment! 真因: K8s 中 Pod.ownerReferences[0].kind = 'ReplicaSet' (99% 案例) Deployment 管 ReplicaSet 管 Pod (兩層 owner chain) 原 code 只 match kind in (deployment/statefulset/daemonset) → 跳過 ReplicaSet → Pod→Deployment 關係全部漏掉修復 v3.1: 0. 新增 collect replicasets pass (僅作為 bridge,不寫 asset_inventory) 建 rs_to_deployment map: {ns/rs_name: deployment_name} 2. Pod ownerRef.kind='ReplicaSet' → 反查 rs_to_deployment → 建 Pod→Deployment 預期效果: - asset_relationship 從 52 → 150+ (所有 Deployment-managed Pod 都有 relationship) - OpenClaw blast_radius 計算 Deployment 影響的 Pod 數 = 正確不寫 ReplicaSet 為 asset (他是 ephemeral 中介,滾動更新會大量產生,污染 inventory) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:26:57 +08:00
OG T	c8b263db06	fix(coverage_evaluator): KM 欄位修正 ke.body → ke.content + 擴大 title 匹配 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 實測 `df71c9a` 部署後 coverage_evaluator 生效: - monitoring: 2 hosts match Prometheus targets - alerting: 74 筆 (22 green + 52 red) - km: 0 (錯誤: column "ke.body" does not exist) 真因: knowledge_entries 表欄位是 'content' 不是 'body' 修復: ke.content ILIKE '%name%' OR ke.title ILIKE '%name%' 同時清 unused import (typing.Any) 下輪 coverage_evaluator tick 將正確 UPDATE auto_km_creation 維度解鎖完整 3 維 coverage (monitoring/alerting/km) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:24:46 +08:00
OG T	92349bc37c	feat(aiops): asset_change_tracker — 8 張 0 writer 表全數上線 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details Review 盲點 10: asset_change_event 仍 0 筆 (最後一張 0 writer 表) 新增 asset_change_tracker_job.py (~180 行): 每 1h 比對最近兩次 asset_discovery_run,寫 asset_change_event: ✅ asset_added: newer run 有但 older run 沒有 (EXCEPT SET) ✅ asset_removed: older 有但 newer 沒有 ✅ lifecycle_changed: asset_inventory.lifecycle_state='deprecated' 且 updated_at 近 2h 使用 SET EXCEPT 避免 N+1, 單次 INSERT 完成所有 diff 8 張 ADR-090 0 writer 表到此全數有 writer: ✅ asset_inventory / asset_discovery_run / asset_coverage_snapshot / asset_relationship / asset_change_event / asset_compliance_snapshot (asset_) ✅ alert_rule_catalog ✅ host_capacity_snapshot / capacity_violation_event (capacity_) Phase 7 資產盤點 + 覆蓋矩陣 + 變化追蹤完整實作. 接下來可以上 Hermes AI agent 分析品質 (deprecate noisy rules, 推薦 coverage 修復). Wire main.py lifespan asyncio.create_task() Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:18:34 +08:00
OG T	df71c9a37b	feat(aiops): rule_stats_updater — 計算 noise_rate + true/false positive All checks were successful CD Pipeline / build-and-deploy (push) Successful in 8m26s Details Review 盲點 5: alert_rule_catalog 68 筆但 noise_rate/TP/FP/last_fired_at 全 NULL 新增 rule_stats_updater_job.py (~170 行): 每 1h UPDATE 全表 alert_rule_catalog,從 incidents + approval_records 推算: - last_fired_at = max(incidents.created_at WHERE alertname=rule_name) - true_positive_count = count incidents.status='RESOLVED' past 30d - false_positive_count = count approval_records.status='EXPIRED' past 30d (EXPIRED = 48h 無人處理,視為假警報 proxy) - noise_rate = fp / (tp + fp) 窗口: 30 天 (可配置) 使用單一 UPDATE + subquery,避免 N+1 (68 rule × 3 query = 204 queries → 1 query) 解鎖 E3 Hermes: 後續 Hermes AI agent 讀 alert_rule_catalog WHERE noise_rate > 0.5 提案 review_status='deprecated' 或 superseded_by_rule_id Wire main.py lifespan asyncio.create_task() Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:05:30 +08:00

1 2

64 Commits