Your Name
|
9080ba3670
|
feat(awooop): run ansible check-mode evidence worker
CD Pipeline / tests (push) Successful in 1m28s
Code Review / ai-code-review (push) Successful in 12s
CD Pipeline / build-and-deploy (push) Successful in 5m9s
CD Pipeline / post-deploy-checks (push) Successful in 1m30s
|
2026-05-31 12:53:22 +08:00 |
|
Your Name
|
9e093a9525
|
fix(api): reconcile inactive stale incidents
CD Pipeline / tests (push) Successful in 1m26s
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / build-and-deploy (push) Successful in 4m23s
CD Pipeline / post-deploy-checks (push) Successful in 2m17s
|
2026-05-29 11:43:19 +08:00 |
|
Your Name
|
8342cfa460
|
fix(governance): stop km healthcheck requeue
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / tests (push) Successful in 2m1s
CD Pipeline / build-and-deploy (push) Successful in 4m45s
CD Pipeline / post-deploy-checks (push) Successful in 1m46s
|
2026-05-19 23:01:03 +08:00 |
|
Your Name
|
edf97ad8ca
|
feat(governance): process hermes km healthchecks
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 2m13s
CD Pipeline / build-and-deploy (push) Successful in 5m14s
CD Pipeline / post-deploy-checks (push) Successful in 1m55s
|
2026-05-19 22:32:55 +08:00 |
|
Your Name
|
1d285dd9d4
|
fix(api): suppress batch reconcile postmortems
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / tests (push) Successful in 1m18s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
|
2026-05-19 12:18:17 +08:00 |
|
Your Name
|
d0835a7be1
|
fix(api): reconcile completed stuck incidents
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / tests (push) Successful in 1m2s
CD Pipeline / build-and-deploy (push) Successful in 3m34s
CD Pipeline / post-deploy-checks (push) Successful in 1m35s
|
2026-05-19 11:45:15 +08:00 |
|
Your Name
|
ff30c61c4c
|
fix(rls): 收斂 API DB access context
Code Review / ai-code-review (push) Successful in 21s
CD Pipeline / tests (push) Successful in 1m20s
CD Pipeline / build-and-deploy (push) Successful in 4m15s
CD Pipeline / post-deploy-checks (push) Successful in 1m58s
|
2026-05-12 19:55:13 +08:00 |
|
Your Name
|
c1ac157aaf
|
fix(km): keep backfill reconciler loop alive
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / tests (push) Successful in 1m12s
CD Pipeline / build-and-deploy (push) Successful in 4m2s
CD Pipeline / post-deploy-checks (push) Successful in 1m18s
|
2026-05-06 17:03:22 +08:00 |
|
Your Name
|
4111ea4f9f
|
fix(ai): remove 188 ollama provider
Code Review / ai-code-review (push) Successful in 12s
CD Pipeline / tests (push) Successful in 1m13s
CD Pipeline / build-and-deploy (push) Successful in 3m36s
CD Pipeline / post-deploy-checks (push) Successful in 1m20s
|
2026-05-06 14:34:48 +08:00 |
|
Your Name
|
bf847ad045
|
fix(ai): stabilize GCP Ollama alert lane
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
|
2026-05-05 22:20:27 +08:00 |
|
Your Name
|
8e22110030
|
fix(governance): keep trust drift watchdog on governance agent
CD Pipeline / tests (push) Successful in 2m51s
Code Review / ai-code-review (push) Successful in 24s
CD Pipeline / build-and-deploy (push) Has started running
CD Pipeline / post-deploy-checks (push) Has been cancelled
|
2026-05-05 14:00:13 +08:00 |
|
Your Name
|
aa4ccec429
|
fix(watchdog): ADR-092 B4 — 三層修復消除 META SYSTEM 重複告警 + Ollama 路由強化
Code Review / ai-code-review (push) Successful in 7m16s
問題根因(debugger 全景徹查):
1. Prod 仍跑舊版代碼(ec013f66 後的修法未部署 → 告警字串仍含舊格式)
2. replicas=2 時 Pod 間 grace period 不共享 → violation_codes 分歧 → 不同 SHA256 → dedup 失效
3. 新 Pod 啟動立即執行 _check_once() → rollout 時多發一波
4. W6 violation_codes 含動態 low_count → count 微變繞過 dedup
修復(A2/A3/W6/C1/C2):
- A2:run_ai_slo_watchdog_loop 加 90s leading sleep,避免 rollout 立即觸發
- A3:_grace_active() 改為 Redis cluster-shared(watchdog:cluster_grace, ex=1800s, nx=True)
消除 Pod 間 grace period 不一致;Redis 故障時 fallback 為 process-local monotonic
- W6:violation_codes 移除動態 low_count,改為穩定 "W6:trust_drift"
- C1:ollama_auto_recovery.py recovered_host 改動態 label(依 URL port 判斷 GCP-A/B/Local)
- C2:ConfigMap OLLAMA_FALLBACK_URL 改走 110:11437 nginx proxy,三層容災統一架構
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-05 10:31:53 +08:00 |
|
Your Name
|
10e665a540
|
fix(watchdog): 修復 META SYSTEM 重複告警 — violation_codes 穩定 dedup
Code Review / ai-code-review (push) Successful in 1m3s
根因:violations 字串含動態浮點數(mean_trust/low_ratio),每次微變 → SHA256 不同 → dedup 失效
修法:新增 violation_codes list(穩定 W-code 格式),dedup 計算只用 violation_codes
violations 保持含動態值(顯示用),Telegram 通知照常顯示完整資訊
W-6 Trust Drift dedup key: W6:trust_drift:low_count={N}(不含浮點數)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-05 00:06:38 +08:00 |
|
Your Name
|
ec013f662d
|
fix(watchdog): 修复 Trust Drift 重复告警 + 建立 GCP Ollama nginx proxy
Code Review / ai-code-review (push) Successful in 45s
Ansible Lint / lint (push) Has been cancelled
- ai_slo_watchdog_job: 改用 trust_drift_detector 纯统计 lib
避免与 governance_agent 每小时自检查重复触发 Telegram
- infra/ansible: 建立 110 nginx proxy 转发到 GCP-A/B
端口 11435 -> 34.143.170.20:11434 (GCP-A)
端口 11436 -> 34.21.145.224:11434 (GCP-B)
- docs/runbooks: DEPLOY-GCP-OLLAMA-PROXY.md 完整部署指南
- ops/nginx: 手动部署脚本供 110 直接执行
ADR-110 三层容灾启用前提:先部署 proxy,再改 ConfigMap
|
2026-05-04 23:12:35 +08:00 |
|
Your Name
|
45f6f17558
|
fix(watchdog): dedup hash 非確定性 bug — 改用 hashlib.sha256 + setnx atomic
Code Review / ai-code-review (push) Successful in 56s
根因:Python 內建 hash() 受 PYTHONHASHSEED 影響,每次 process 重啟值不同。
每次 kubectl rollout restart → 新 pod 算出不同 dedup_hash → 繞過 1h TTL → 洗版。
症狀:連續 rollout 4-5 次後,META SYSTEM 每分鐘一條狂發(19:39/40/41/42 截圖)。
修法:
1. hash() → hashlib.sha256(content.encode()).hexdigest()[:12](跨 pod/重啟確定性)
2. redis.exists+setex → redis.set(nx=True) atomic setnx(防多 replica 並發多發)
2026-05-04 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 19:47:42 +08:00 |
|
Your Name
|
f6b698c873
|
fix(aiops): Critic 修復 — PromQL 注入防線 + flag=False escalation bug + 計數虛報
Code Review / ai-code-review (push) Successful in 53s
Bug 1 (drift.py): DRIFT_AUTO_ADOPT_ENABLED=false 時仍設 auto_block_reason
→ 導致 escalation 被觸發,把「停用」誤判為「阻擋事故」
修法: flag=False 不設 auto_block_reason,視為靜默停用
Bug 2 (coverage_evaluator_job.py): asset name/host/namespace/ip 直接 f-string
進 PromQL,無白名單驗證
→ 髒資料可生成語意污染規則或讓 Prometheus reload 失敗
修法: 加 _safe_label_val 正規表達式白名單(^[a-zA-Z0-9._\-]+$),
不合法直接 skip + debug log
Bug 3 (coverage_evaluator_job.py): ON CONFLICT DO NOTHING 衝突時 created 仍 +1
→ stats["rules_auto_created"] 計數虛高,Redis 冷卻被誤設
修法: 改用 INSERT ... RETURNING rule_name,fetchone() 確認實際插入才計數和設冷卻
附加: Redis RuntimeError 單獨 catch + log(不再靜默 pass)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 14:31:53 +08:00 |
|
Your Name
|
72cd79ed8b
|
fix(aiops): Task2 drift auto-adopt 根因修復 + Task3 coverage gap 規則自動生成
Code Review / ai-code-review (push) Successful in 48s
Task 2 — Drift 自動採納修根因:
根因: _analyze_and_notify() 中 report 是 in-memory 物件,
update_interpretation() 只更新 DB,不回寫 report.interpretation,
導致 auto_adopt_if_safe() 永遠看到 None → 觸發「尚無 Nemotron 意圖分析」
→ Drift 自動採納 0 筆
修法: report.interpretation = interpretation(DB 寫入後立即回寫記憶體)
附加: DRIFT_AUTO_ADOPT_ENABLED flag(default=True,回滾: kubectl set env ...=false)
Task 3 — Coverage Gap → AI 規則自動生成執行器:
根因: evaluate_once() 只分析 red 缺口,但無執行器將分析轉為實際規則
→ alert_rule_catalog 的 ai_generated source 永遠為 0 條
修法: 新增 _auto_create_rules_for_uncovered_assets(run_id)
· 查 auto_alerting=red 的 top 5 host/k8s_workload asset
· 依 asset_type 生成範本化 PromQL rule(host→up, k8s→replicas_available)
· UPSERT 進 alert_rule_catalog(source='ai_generated', review_status='pending_review')
· Redis 24h 冷卻防重複,Redis 不可用時降級繼續
附加: COVERAGE_AUTO_RULE_ENABLED flag(default=True,回滾: kubectl set env ...=false)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 14:22:51 +08:00 |
|
Your Name
|
577250a678
|
fix(governance): 修反消音化 W-3/W-4 守衛 + Prometheus 補資料缺失告警
Code Review / ai-code-review (push) Successful in 52s
CD Pipeline / tests (push) Failing after 2m21s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 1m6s
【統帥怒訓 — 違反 feedback_full_chain_first_then_fix.md 鐵律】
前次 commit f1362fcc 用 skip 條件把告警吞掉,是消音化解法:
- W-3:total_exec<10 永遠 skip → Redis 永遠空也不會告警
- W-4:playbooks total==0 永遠 skip → 表被清空也不會告警
- Prometheus NaN sentinel + 既有 < 0.1 規則疊加後沒任何路徑會告警
統帥怒訓「又把告警給消失了」「已經這樣做幾次了」。本 commit 救回告警可見性。
【修法 — 啟動 30 分鐘寬限 + 過期改打資料管線斷新告警】
- ai_slo_watchdog_job.py 新增模組層 _PROCESS_START 與 _grace_active() 守衛:
- W-3a:metric 有資料 + rate<0.30 → 既有「飛輪成功率過低」
- W-3b:rate=None 且 uptime>30min → 新告警「飛輪資料管線無流量」
- W-4a:playbooks total>0 + approved=0 → 既有「自動修復鏈路斷裂」
- W-4b:playbooks total=0 且 uptime>30min → 新告警「Playbook 表初始化失敗」
- 3 份 Prometheus rule(k8s/monitoring/flywheel-alerts.yaml、
ops/monitoring/alerts.yml、ops/monitoring/alerts-unified.yml)新增
FlywheelExecutionRateMissing:absent() 或 NaN 持續 30 分鐘 → 告警,
與 watchdog W-3b 雙保險
【已加入 memory】
feedback_silencing_alerts_recurring_violation.md 鎖入紅線鐵律:
「fresh deploy / init guard 用 skip 吞告警 = 結構性失職,必須分流寬限期 +
過期改打資料管線斷新告警」
【驗證】
106 個治理相關 unit test 全過:
test_trust_drift_watchdog / test_governance_agent / test_failover_alerter /
test_check_trust_drift_commit_outside_context_poc /
test_governance_remediation_dispatch / test_ai_governance_endpoints /
test_governance_dispatcher
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-05-03 12:39:46 +08:00 |
|
Your Name
|
f1362fcc8d
|
fix(governance): 修治理告警 4 個 silent failure + Prom sentinel 連鎖
Code Review / ai-code-review (push) Successful in 49s
CD Pipeline / tests (push) Successful in 2m9s
CD Pipeline / build-and-deploy (push) Failing after 31m11s
CD Pipeline / post-deploy-checks (push) Has been skipped
【全景檢測:12-agent 並行掃描定位 4 大 bug 與 1 個 P0 連鎖回歸】
Bug 1(P0 silent failure)— governance_agent.check_trust_drift
原 `await db.commit()` 縮排錯在 async with 區塊外(8 空格 vs 12),
session 已 auto-commit 關閉,二次 commit 拋 InvalidRequestError 被吞,
governance_trust_drift_auto_deprecated log 從不出現。修:commit/log 移回 with 內。
附 AST regression guard test 擋退化。
Bug 2 — flywheel_stats_service / W-3 fresh deploy 假告警
Redis 空時 total_exec=0 → rate=0.0 → watchdog `< 0.30` 立即觸發
「飛輪成功率 0%」假告警。修:total_exec < FLYWHEEL_MIN_SAMPLE(10) 回 None,
watchdog 判 None 跳過 W-3。Prometheus sentinel 用 NaN(非 -1.0)
避免觸發 ops/monitoring/alerts.yml:775 等 3 份 prom rule 的 `< 0.1`
條件造成 2h 後假告警連鎖。前端 type 同步 number | null。
Bug 3 — failover_alerter dedup key
原 key 只看 event_type 不看 payload,trust_drift 4→25 IDs 變動全被
1h dedup 吞掉。修:dedup key 加 sha256(impact subdict)[:8],event_type
sanitize 防特殊字元污染 Redis key。
Bug 4 — ai_slo_watchdog_job W-4 evolver 全封存初始化誤報
原邏輯 approved==0 即告警,未排除「playbooks 表初始化中」場景。
修:_count_approved_playbooks 回 (approved, total),total==0 → skip。
【執行結果】
- 39 個相關 unit test 全過(test_failover_alerter / test_governance_agent /
test_trust_drift_watchdog / test_check_trust_drift_commit_outside_context_poc)
- 6 個關鍵路徑實測:NaN sentinel / float 渲染 / hash 區分性 / dedup 同 impact
相同 hash / datetime 容錯 / 4 檔 py_compile 全過
【調度教訓 — 留作未來改進】
- 12-agent 並行調度時,vuln-verifier 與 fullstack-engineer 競態
導致 vuln-verifier 讀到已修代碼誤判 NOT REPRODUCIBLE。
未來:vuln-verifier 應在 fullstack 之前執行,或用 git show HEAD~1 對比修復前。
- fullstack-engineer 引入 P0 regression(f-string 內嵌 ternary 非法 format spec),
critic 抓到 + Prom sentinel 連鎖 — 證明 critic 審查必要不可省。
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-05-03 00:18:57 +08:00 |
|
Your Name
|
b710f3f38f
|
feat(governance): normalize AI治理告警輸出與元告警解析度
CD Pipeline / tests (push) Failing after 25s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 46s
|
2026-05-02 23:49:59 +08:00 |
|
Your Name
|
dedb12085b
|
chore(governance,watchdog): enrich alerts and enable prometheus multiproc
CD Pipeline / tests (push) Failing after 1m22s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 43s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 57s
|
2026-05-02 23:44:12 +08:00 |
|
Your Name
|
b3a0f0d766
|
fix(telegram): dedup by fingerprint + 24h TTL to stop repeat alerts
CD Pipeline / tests (push) Successful in 2m22s
Code Review / ai-code-review (push) Successful in 57s
CD Pipeline / build-and-deploy (push) Successful in 21m3s
CD Pipeline / post-deploy-checks (push) Successful in 5m2s
Telegram 重複發告警鐵證(4 個 agent 真實數據):
- INC-6FE3BD (HostBackupFailed) 24h 內被推 15 次
- INC-FD6E21 (HostHighCpuLoad) 24h 內被推 6 次
- 06:44:18 同秒兩送 = pod 並發 race
根因:
1. `telegram_sent:{incident_id}` dedup key 綁 uuid4 隨機 INC ID,
同 fingerprint 換新 INC 完全不去重
2. dedup TTL=600s 比 incident_analysis_sweeper 重觸週期 1h、
alertmanager repeat_interval 4h 都短 → 每輪都過期通過
3. pod restart 走 _resend_unconfirmed_ready_tokens 用同一 incident_id key
→ 重啟必炸一波
修法(不消音、是「AI 認得這是同一事故」):
- decision_manager.py:207-225 dedup key 改 alertname+target fingerprint
- decision_manager.py:573-578 TTL 600s → 86400s (蓋住 sweeper 1h × alertmanager 4h)
- decision_manager.py:3189-3208 pod restart resend 路徑同步改 fingerprint
- incident_analysis_sweeper.py:37-42 sweeper_done TTL 3600s → 86400s
預期:同症狀 24h 內最多發 1 張 decision card;resolved 後 line 220-226
status check 會 early return,不影響復發偵測。
Tests: 35 passed (test_telegram_adr050 + test_decision_manager_docker_prune_routing)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-05-02 16:25:48 +08:00 |
|
Your Name
|
f154ac022e
|
feat(playbook): version generated playbooks
CD Pipeline / tests (push) Successful in 1m34s
Code Review / ai-code-review (push) Successful in 28s
Type Sync Check / check-type-sync (push) Successful in 1m10s
CD Pipeline / build-and-deploy (push) Successful in 10m19s
CD Pipeline / post-deploy-checks (push) Successful in 3m1s
|
2026-04-30 23:59:39 +08:00 |
|
Your Name
|
6e04fe9c8a
|
feat(playbook): generate drafts with local llm
CD Pipeline / tests (push) Successful in 1m28s
Code Review / ai-code-review (push) Successful in 29s
Type Sync Check / check-type-sync (push) Failing after 2m41s
CD Pipeline / build-and-deploy (push) Successful in 8m40s
CD Pipeline / post-deploy-checks (push) Successful in 3m10s
|
2026-04-30 23:04:58 +08:00 |
|
Your Name
|
61f5a6a419
|
fix(telegram): route alerts to SRE war room
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
|
2026-04-30 15:01:23 +08:00 |
|
Your Name
|
3668d49f2f
|
feat(flywheel): W2 三件 + KMWriter critic 修法(1635 tests 全綠)
CD Pipeline / build-and-deploy (push) Failing after 1m38s
W2 (onboarder 4 週飛輪 80→90 路徑第二週) + critic PR review 5 個 critical/major
全部修完,default flag=false 安全無爆炸風險。
## W2 三件 PR
### PR-R2 — AOL → catalog confidence EWMA 回灌(修飛輪斷鏈 C2)
- 新檔 `apps/api/src/jobs/aol_to_catalog_writeback_job.py`
- 邏輯:每小時掃 AOL 計算 EWMA confidence (alpha=0.3) 回灌 alert_rule_catalog
- 失敗閾值 N=5 連續低成功率 → review_status='draft'
- Hermes _fetch_noisy_rules SQL 加 OR review_status='draft'
- ENABLE_AOL_WRITEBACK_JOB=false (default)
- 8 個測試(mock path 修正:lazy import → patch src.db.base.get_db_context)
### PR-V1 — self_healing_validator 串接 (修飛輪斷鏈 C6)
- 新檔 `apps/api/src/services/self_healing_validator.py`(純函數 assess_self_healing)
- post_execution_verifier.py step 5 串接(feature flag gate)
- evidence_snapshot.py 加 self_healing_score / self_healing_detail 欄位
- db/models.py + base.py ALTER IF NOT EXISTS
- score < 0.5 → 觸發 rollback 提案 Telegram alert(不自動執行)
- ENABLE_SELF_HEALING_VALIDATOR=false (default)
- 7 個測試
### PR-L1 — KM ↔ Playbook 雙向回路 (修飛輪斷鏈 C3+C4)
- learning_service.py 三條新邏輯:
1. _write_playbook_evolution_km:promote/demote 寫 KM 演化條目
2. _check_and_mark_playbook_review:N=5 累積觸發 review_required
3. _demote_alert_rule_catalog_confidence:DEPRECATED → confidence×=0.5
- PlaybookRecord 加 review_required 欄位(schema migration via base.py)
- ENABLE_KM_PLAYBOOK_FEEDBACK_LOOP=false (default)
- KM_PLAYBOOK_REVIEW_THRESHOLD=5 可調
- 6 個測試
## KMWriter Critic 5 個 Critical/Major 修復(之前 critic PR review 發現)
之前 push commit c5753e1c 已修,本 commit 補回 stash 中的對應檔案:
- C1 km_writer.py:194 backfill 自打臉(已修:同步 await + DLQ)
- C2 km_writer.py:391 KM_WRITE_AWAIT=false 路徑收緊
- M1 decision_manager.py:2178/2203 移除 _fire_and_forget
- M2 incident_service.py:1099 自製 path 加 retry+DLQ
- M3 km_writer.py:166 冪等聲明對齊(UPSERT + partial unique index)
## 驗證
- 1635 unit tests 全綠(+27 from 1608)
- 與 fb0c72db (推翻 A2 Ollama primary) 共存無衝突
- 所有新 Job/Service default flag=false(不爆炸)
## 期望影響
飛輪斷鏈 C2 + C3 + C4 + C6 全修
飛輪自主化評分:65 → 85 預估(W2 完成後)
啟用順序(待 prod fb0c72db 驗證 OLLAMA primary 跑得起來後):
1. ENABLE_AOL_WRITEBACK_JOB=true
2. ENABLE_KM_PLAYBOOK_FEEDBACK_LOOP=true
3. ENABLE_SELF_HEALING_VALIDATOR=true
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-29 19:44:04 +08:00 |
|
Your Name
|
c5753e1c57
|
fix(critic-review): KMWriter 名實統一 + Alertmanager 修抑制 + drift checker AST 化
critic PR review 揭示已 push commits 的 7 個 blocker,本 commit 全部修復。
## C1 + C2 + M1 + M2 + M3 — KMWriter 真正統一契約(critic 最嚴重 5 條)
### C1 km_writer.py:194 — backfill 自打臉修
- 裸 asyncio.create_task(_backfill_path_a_approval) → await _backfill_path_a_approval_safe()
- 同步 await + 獨立 DLQ km:backfill:dlq + try/except 不阻塞主寫入
- 新增 km_backfill_reconciler_job.py(每 5 分鐘掃 DLQ)+ ENABLE_KM_BACKFILL_RECONCILER flag
- 防 Path B 比 Path A 先完成 → related_approval_id 永遠 NULL 的 race
### C2 km_writer.py:391 — KM_WRITE_AWAIT=false 路徑收緊
- 從 ensure_future(fire-and-forget 比舊版同步寫更糟)
- 改 await writer.write(retry=1, timeout=2.0)(仍 await 但只試一次、超時短)
- docstring 明確標註「緊急回滾用,不保證可靠性」
### M1 decision_manager.py:2178/2203 — 移除 _fire_and_forget 旁路
- 兩處 _fire_and_forget(executor.write_execution_result_to_km(...))
- 改 await asyncio.shield(...) + BaseException 保護(防上層 cancel 中斷)
- KM_WRITE_AWAIT=true 在這條路徑終於真正 await
### M2 incident_service.py:1099 — 自製 path 加 retry+DLQ
- 原本 if settings.KM_WRITE_AWAIT: await asyncio.wait_for else create_task
- 改 3 次指數退避 retry + DLQ 保護(呼叫 km_writer 私有 helper)
### M3 km_writer.py:166 — 冪等聲明對齊實作
- knowledge_repository.create() 加 UPSERT 路徑(pg_insert ON CONFLICT DO UPDATE)
- KnowledgeEntryCreate / KnowledgeEntryRecord 加 path_type 欄位
- migration: ADD COLUMN path_type + partial unique index uix_knowledge_incident_path
## M4 alertmanager.yml — equal: [] 收緊(critic 防爆炸抑制)
- OllamaInstanceDown / KMConverterDown 抑制加 equal: ['cluster'] 約束
- 防多 cluster 場景下任一 Ollama down 誤抑全 AI/SLO 告警
## M5 Alertmanager 版本驗證(已確認 v0.31.1,遠超 v0.22+)
## M6 governance_agent.py — health score 區分 skipped vs ok vs violated
- check_slo_compliance 加 _meta {violated_count, skipped_count, ok_count, all_skipped, status}
- run_self_check: SLO 全 skipped 時獨立發 governance_slo_data_gap 告警
(不污染 self_failure 計數,因為 no_data 是 emitter 未實作不是治理機制故障)
## M7 scripts/check_config_drift.py — 改 AST 解析
- regex 改 ast.parse 找 Settings ClassDef AnnAssign Field(default=...)
- 避免多行 list / default_factory= / 含跳行字串的 false negative
- 4 欄位(AI_FALLBACK_ORDER / ARGOCD_URL / PROMETHEUS_URL / OLLAMA_URL)全對齊
## 新增測試
- test_km_writer_backfill_reconciler.py: 7 cases(C1 reconciler + safe helper)
- test_km_writer_idempotent.py: 5 cases(M3 path_type 注入 + UPSERT 分支)
## 驗證
- 1585 unit tests 全綠(+13 從 1572)
- amtool check-config SUCCESS(8 inhibit_rules / 2 receivers)
- drift checker AST-based 4 欄位全對齊
- Alertmanager v0.31.1 確認支援新語法
## 期望影響
- KMWriter 名實統一:飛輪閉環 KM 寫入路徑 100% 可靠
- M4 抑制爆炸風險解除
- 治理層不再對 SLO no_data 靜默
- drift checker false negative 風險解除
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-29 10:44:39 +08:00 |
|
Your Name
|
123d9c8a2e
|
fix(p3.1-t1): 三 Tier-1 服務整合 — model_rollback_service + resource_resolver
CD Pipeline / build-and-deploy (push) Has been cancelled
P3.1-T1 接線兩個既有服務到主流程:
offline_replay_service.py — model_rollback_service 整合:
- 回放事件寫入治理 DB 後,觸發 ModelRollbackService.check() 衰退偵測
- feature flag 由 model_rollback_service 自行判斷(AIOPS_P6_GOVERNANCE_ENABLED)
- retrain_recommended → log warning 含 streak / absolute_floor / conservative_mode
- exception fail-soft(不阻斷 replay 主流程)
approval_execution.py — resource_resolver 整合:
- kubectl 指令解析後,動態驗證資源是否存在於 K8s
- 若 resolved_name != raw_name → log + apply normalized name
- 若不存在但有 candidates → log warning + suggestions(不攔截執行,只記錄)
- exception fail-soft(不阻斷主流程)
- RESOURCE_RESOLVE_TOTAL Prometheus counter labels: hit/suggestion/miss/error
Tests: 後端 1303 collected(無回歸),對應 dedicated 測試在前次 commit 已寫
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (P3.1-T1) <noreply@anthropic.com>
|
2026-04-27 08:17:04 +08:00 |
|
Your Name
|
cc547736ab
|
feat(wave6-8): P2.1 fusion + P2.2 governance + P2.4 consensus + Wave 7/8 BLOCKER 修復
承接 Wave 6/7/8 多 engineer 在 agent 限額前完成的代碼,補 commit 解 production
HEAD 隱性 import error(decision_fusion 已被 decision_manager 引用但檔案 untracked)。
新增(後端核心):
- decision_fusion.py (562 行) — P2.1 方法 III(OpenClaw + Hermes + Elephant 三 LLM 融合)
- aiops_timeline.py + aiops_timeline_service.py — critic B4 修復
/api/v1/aiops/timeline endpoint,DB 存取抽到 service 層遵守 leWOOOgo 積木化
- migrations/p2_decision_fusion_columns.sql + rollback — approval_records fusion 欄位
修改(後端整合):
- decision_manager.py — fusion 三斷鏈修補(critic B1+B2+B3):
· B1: 寫 _evidence_snapshot_ref 到 token.proposal_data
· B2: fusion 前計算 complexity_score 並寫 token
· B3: fusion composite 寫 token.proposal_data["decision_fusion"]
- auto_approve.py — fusion + consensus 認識(critic B3+B5):
· composite > 0.7 → auto_execute_eligible bypass min_confidence
· source=consensus_engine + score>=0.6 → 規則可信路徑
- consensus_engine.py — db-fix _save_consensus 重用 agent_sessions
- governance_agent.py — db-fix _alert PG 寫入 ai_governance_events
- approval_db.py — fusion 3 欄位 + 2 partial index + CheckConstraint
- db/models.py — schema 對齊 migration
- core/config.py — vuln #1 修復:OLLAMA_URL/_FALLBACK_URL field_validator
拒絕公網 IP + 外部域名,僅允許私網/loopback/K8s SVC 白名單
- core/feature_flags.py — P2 fusion + consensus flags
- main.py — governance_agent lifespan 啟動
- failover_alerter.py — Wave8-X2: in-memory dedup fallback(Redis 拒絕後不 fail-open)
- ollama_*.py — metrics 整合 + recovery 改善
- auto_repair_service.py — verifier 接線
新增(測試 2438 行):
- test_decision_fusion.py / test_governance_agent.py / test_consensus_integration.py
- test_p2_db_fixes.py / test_wave8_fusion_fixes.py
- test_config_url_validation.py(vuln #1 12 tests)
- test_failover_alerter.py +Wave8-X2 in-memory dedup 補測
驗收: 116 tests pass (decision_fusion + wave8_fusion + config_url + consensus +
governance + p2_db_fixes + failover_alerter)
Conflict resolution:
- 3 檔(config.py + auto_approve.py + decision_manager.py)git stash pop 衝突
保留 stashed (engineer 最終版),補回 ValueError 「公網 IP」字樣對齊 test
Note: 此 commit 解 production HEAD 隱性 import error
仍未修: vuln #4 prompt injection / debugger B14 quota fail-closed
/ B25-B26 drain_pending_tasks / B8 governance fail alert
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Multiple Engineers (Wave 6/7/8) <noreply@anthropic.com>
|
2026-04-27 08:11:40 +08:00 |
|
Your Name
|
c995fe4008
|
fix(watchdog-w5): suggested_action 欄位不存在 → 改用 action
CD Pipeline / build-and-deploy (push) Successful in 13m30s
ApprovalRecord ORM 只有 action 欄位,suggested_action 僅存於 Pydantic
ApprovalRequest 層。新 Pod 啟動後 W-5 拋 AttributeError:
"type object 'ApprovalRecord' has no attribute 'suggested_action'"。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-24 20:40:42 +08:00 |
|
Your Name
|
97ce5ea658
|
feat(p2.6): trust_drift_detector 接入 ai_slo_watchdog_job W-6
CD Pipeline / build-and-deploy (push) Successful in 9m10s
P2.6 接入 2026-04-24 ogt + Claude Sonnet 4.6
問題: trust_drift_detector.py 是孤立服務(零引用),Playbook 信任度
偏態(盲目樂觀/學習鎖死)從未被任何監控機制感知
修復: ai_slo_watchdog_job._check_once() 新增 W-6 Trust Drift 檢查
- 呼叫 get_trust_drift_detector().run()(偵測 + 寫 ai_governance_events)
- 偵測到偏態時加入 violations 清單 → 觸發 TYPE-8M Meta-System 告警
- checks 計數從 5 → 6
覆蓋案例:
- optimism_bias: >70% Playbook trust_score >0.9 → PostExecutionVerifier 可能失效
- confidence_collapse: >70% Playbook trust_score <0.3 → EWMA 計算/執行誤判
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-24 15:57:30 +08:00 |
|
Your Name
|
45dbe07188
|
fix(flywheel): 自動化飛輪六大能力修復(ADR-092 B3)
run-migration / migrate (push) Failing after 22s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 53s
Type Sync Check / check-type-sync (push) Successful in 2m54s
CD Pipeline / build-and-deploy (push) Has been cancelled
Ansible Lint / lint (push) Has been cancelled
【根因鏈修復】
MCP Provider bugs → PreDecisionInvestigator 失敗 → Agent Debate 無上下文
→ LLM 逾時 → description="待分析" → ADR-091 鐵閘攔截 → tg_sent 未設
→ W-2 Watchdog 誤報「靜默故障」
【六大修復】
1. MCP Provider 三蟲修復
- ssh_provider: asyncssh.run() → conn.run()
- prometheus_provider: KeyError 'query' → .get() 容錯
- k8s_provider: 空 pod_name → 早返回錯誤字典
2. Agent Debate / 決策品質
- decision_manager: 逾時降級文字改為明確描述(繞過 ADR-091 鐵閘)
- intent_classifier: LLM 逾時降級至關鍵字分類(非 None)
3. Watchdog 誤報修復(ADR-092 B3)
- W-2: tg_sent Redis TTL → telegram_message_id IS NULL(DB 真值)
- W-5 新增: suggested_action IN 空/待分析/NO_ACTION + tg_id IS NULL
- approval_timeout_resolver: 60min → 15min,batch 50 → 200
4. Config Drift 自動化
- drift_adopt_service: auto_adopt_if_safe() 六條件安全閘
- drift.py: 背景任務先嘗試自動採納再發人工 Telegram 卡片
5. Playbook 飛輪穩定
- playbook_seed_service: 修復幂等性(deprecated 不視為缺失)
- playbook_evolver: 只載 DRAFT+APPROVED(非全部 294 筆)
6. 可觀測性
- alert_rule_engine: auto_rule 結構化日誌 + Redis 計數器(pipeline)
- auto_approve: reject 原因 Redis 計數器
- heartbeat_report_service: 新增「⚙️ 自動化統計(今日)」區塊
【待人工執行】
psql $DATABASE_URL -f apps/api/migrations/cleanup_duplicate_deprecated_playbooks.sql
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-24 10:55:50 +08:00 |
|
Your Name
|
de2d34d4cd
|
fix(playbook): C1-C4 全流程串接 — evolver保護+seeder復活+規則即時建立+watchdog W-4
CD Pipeline / build-and-deploy (push) Has been cancelled
C1: playbook_evolver — yaml_rule source playbooks 加 YAML_RULE guard,
evolver 不再封存 seeder 建立的 APPROVED playbook,保護自動修復鏈路
C2: playbook_seed_service — idempotency SQL 排除 DEPRECATED 記錄,
evolver 封存後重啟可復活 yaml_rule playbooks
C3: alert_rule_engine — AI 自動生成規則成功後立即呼叫 seed_playbooks_from_rules(),
不等下次重啟即可建立對應 APPROVED Playbook
C4: ai_slo_watchdog_job — 新增 W-4 APPROVED playbook 數量為 0 告警,
鏈路斷裂立即 TYPE-8M;total checks 由 3 升為 4
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-20 20:18:11 +08:00 |
|
Your Name
|
156a52f807
|
fix(aiops): ADR-092 三修 — Playbook enum崩潰 + Telegram永久靜默 + 採納失敗 + AI自健診
CD Pipeline / build-and-deploy (push) Successful in 13m33s
B1 playbook_service.py: evolver setattr傳str而非PlaybookStatus enum
→ _pg_upsert playbook.status.value炸(163次/48h),修:update_with_validation強制enum轉型
B2 approval_db.py + webhooks.py: find_by_fingerprint PENDING誤收斂
→ PENDING≠Telegram已發;修:成功push後mark tg_sent:{fingerprint} Redis(24h TTL)
→ find_by_fingerprint debounce窗外PENDING必須Redis確認才收斂
drift_adopt_service.py: telegram_gateway呼叫adopt_drift(report_id)但方法不存在
→ 新增adopt_drift()包裝:從DB載入DriftReport後委派adopt(),修復採納失敗
B3 ai_slo_watchdog_job.py + main.py: AI無法感知自身故障(MASTER §1.1盲區)
→ 新增每15分鐘自健診:W-1 SLO違反 W-2 TG靜默偵測 W-3 飛輪成功率
→ 任一異常→TYPE-8M send_meta_alert;Redis去重1h
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-20 20:00:06 +08:00 |
|
Your Name
|
f572561467
|
feat(ai_advisory): P0 修 leader lock + inline keyboard + callback handler
CD Pipeline / build-and-deploy (push) Successful in 8m31s
統帥 2026-04-19 截圖反饋:
1. 同一告警 22:44 連推 2 則 (多 Pod 都跑 daily loop)
2. 純文字無按鈕 (無 feedback 閉環 / AI 只建議不執行)
新增 services/ai_advisory_helpers.py (~240 行):
- try_acquire_daily_lock(job_name): Redis SETNX key 'aiops:daily_lock:{job}:{date}',
TTL 25h,fail-open (Redis 掛照推,不阻塞).
- try_acquire_hourly_lock(job_name): 同上 hourly 版 (coverage_evaluator 用).
- is_snoozed / set_snooze: Redis key 'aiops:snooze:{type}:{target}' TTL 24h.
- build_ai_advisory_keyboard: 統一 4 按鈕
✅ 已處理 / 😴 忽略 24h / 🔍 查看詳情 / 📋 產 kubectl 指令
callback_data 格式: 'ai_advisory_{action}:{type}:{id}'
- handle_ai_advisory_callback: 處理 handled/snooze 兩個 action 寫 aol.output.human_feedback,
view/produce_cmd 留 P1.
4 個 LLM scanner 改用 helper:
- capacity_forecaster: daily_lock + snooze check per host + 按鈕
- compliance_scanner: daily_lock (cron only) + snooze per date + 按鈕
- coverage_evaluator: hourly_lock + snooze per worst_dimension + 按鈕
- hermes_rule_quality: daily_lock + snooze per primary rule + 按鈕
telegram_gateway.py:
handle_callback 加 'ai_advisory_*' 路由 (step 1.85 drift 後)
新增 _handle_ai_advisory_action 方法:
解析 payload 'type:id' → 呼叫 handle_ai_advisory_callback
→ answer_callback (Telegram toast 回饋)
→ 返回 dict (info_action=True for view/produce_cmd)
統帥鐵律對齊:
✅ 多 Pod 場景只 leader 推 (Redis SETNX 保證冪等)
✅ 失敗 fail-open 不阻塞主業務 (Redis 掛仍能運作)
✅ aol.output 加 human_feedback 供 AI 學習
✅ snooze 避免重複告警 (24h TTL)
✅ 原 drift 按鈕 pattern 複用 (non-breaking)
明早 AI 將收到:
- 單一訊息 (非重複)
- 含 4 按鈕 (手動 feedback 閉環)
- snooze 後同主題 24h 不再推
view/produce_cmd P1 留下 session (AI 主動 MCP 蒐證 + LLM 產 kubectl command).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-19 23:02:57 +08:00 |
|
Your Name
|
fa643ebdc7
|
refactor(p1): LLM JSON parse helper 抽出 + coverage 閾值雙條件 (架構師 Review P1)
CD Pipeline / build-and-deploy (push) Successful in 8m52s
首席架構師 2026-04-19 Review (92/100 Grade A) 指出 P1 優化:
1. LLM JSON 3-path parse 邏輯在 4 scanner 重複 (~80 行 × 4 = 320 行)
2. coverage red>=20 觸發閾值偏低,生產 bootstrap 必觸發浪費 token
P1.1+1.2 新增 services/llm_json_parser.py (~90 行):
parse_llm_json_response(text, required_key, logger_context)
3-path fallback:
Path 1: 剝 markdown fence + 直接 JSON 含 required_key
Path 2: NemoTron wrapper (description/action_title/reasoning 內嵌 JSON)
Path 3: 所有失敗 return None + logger.warning
失敗永不 raise,呼叫者決定 fallback.
4 個 LLM scanner 改用 helper:
- hermes_rule_quality_job: required_key='recommended_actions'
- capacity_forecaster_job: required_key='priority_actions'
- compliance_scanner_job: required_key='posture_grade'
- coverage_evaluator_job: required_key='worst_dimension'
每個減少約 20 行重複.
P1.3 coverage 觸發條件改雙條件:
原: total_red >= 20 (bootstrap 必觸發)
新: red_ratio > 30% AND total_scanned >= 50
_fetch_red_summary 加 total_scanned 回傳供計算.
5/5 單元測試 parse_llm_json_response:
✅ direct / markdown fence / NemoTron wrapper / invalid / missing key
P1.4 capacity_scanner + rule_catalog_sync: 檢查後已有完整作者註解 (Review 誤判).
其他 P1 (Prom HTTP helper / first_delay 錯開 / LLM budget guard) 留下 session.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-19 22:39:40 +08:00 |
|
Your Name
|
2f5cab2e45
|
feat(coverage_evaluator): Gap 3.3 LLM 升級 — 缺口分析 + 補覆蓋建議
CD Pipeline / build-and-deploy (push) Successful in 10m14s
Gap 3 進度: 4/9 service 升級 LLM (達到合理上限 — 其他 4 個純資料移動不需 LLM)
coverage_evaluator 原本 7 維升級 unknown→green/yellow/red 後無主動建議.
新增:
1. _fetch_red_summary: 撈最新 run 的 red 分布 + top 10 被標 red 的 asset
2. _llm_analyze_coverage_gaps (~50 行):
有 >= 20 red 時才跑 LLM (避免 well-covered 集群浪費 token)
LLM JSON 輸出:
- worst_dimension: 最該優先補的維度
- root_cause: red 集中的真因 (繁中)
- top_remediation_actions[3]: priority/target/action/effort
- estimated_weeks_to_close: 1-52
- confidence: 0-1
3. _send_telegram_gaps: 推 coverage 缺口 Telegram 摘要
總 red + 最嚴重維度 + 補齊週數 + top 3 補覆蓋動作
scan 完流程:
評估 7 維 → 撈 red summary → LLM 分析 (if total_red >= 20) → Telegram
統帥鐵律對齊:
✅ 不寫死補覆蓋優先 (LLM 根據實際 red 分布推)
✅ AI 建議 + 人工決策 (Telegram 末行: '人工評估補覆蓋排程')
✅ 包含預估完成時間 + 信心 (可追蹤)
session 累計 35 commits, 9 新 scanner, 4 用 LLM:
- Hermes (rule quality)
- capacity_forecaster (容量預測)
- compliance_scanner (合規態勢)
- coverage_evaluator (覆蓋缺口)
剩 5 個純資料移動不適合 LLM (asset_scanner/rule_catalog_sync/
rule_stats_updater/asset_change_tracker/capacity_scanner)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-19 22:02:36 +08:00 |
|
Your Name
|
f6cb938dc3
|
feat(compliance_scanner): Gap 3.2 LLM 升級 — 合規態勢分析 + Telegram 摘要
CD Pipeline / build-and-deploy (push) Has been cancelled
朝 AI 自主化方向 — 9 新 scanner 從 2/9 LLM 提升到 3/9.
compliance_scanner 原本每次 scan 273 snapshots 寫 DB,無任何人可見摘要.
新增:
1. _write_compliance_for_asset_v2 (wrapper):
原 _write_compliance_for_asset 保持不變,v2 版加回傳 asset_warning dict
供上層 LLM 分析用,只有 violations/warnings > 0 才傳回
2. _llm_analyze_compliance_posture (~50 行):
有 warning 時用 OpenClaw 分析整體 posture
輸出 JSON:
- posture_grade: A/B/C/D/F
- posture_summary: 3 句繁中整體態勢敘述
- top_priorities[3]: priority + action + rationale
- risk_level: low/medium/high/critical
- confidence: 0-1
3-path JSON parse fallback (直接 / NemoTron wrapper / description 巢狀)
3. _send_telegram_posture (~40 行):
推每日合規摘要到 SRE group
含評級 emoji (🟢A / 🟡B / 🟠C / 🔴D / ⛔F)
顯示 asset_type 分布 (Top 5 種問題類型統計)
含 AI top 3 priority 動作 + rationale
scan_once 流程:
掃 assets × 7 維 → 收集 warning_assets → LLM 分析 → Telegram 推送
統帥鐵律對齊:
✅ AI 分析 + 人工決策 (Telegram 末行: '人工評估各項修復優先')
✅ 不寫死優先順序 (LLM 根據 warnings 實際分布推)
✅ asset_type 分布統計幫統帥快速定位
Gap 3 進度: 3/8 service 升級 LLM (Hermes + capacity_forecaster + compliance_scanner)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-19 21:59:38 +08:00 |
|
Your Name
|
d6b854a25e
|
feat(capacity_forecaster): Gap 3 LLM 升級 — 從 threshold 到 AI 決策
CD Pipeline / build-and-deploy (push) Has been cancelled
Audit 發現 8/9 個新 scanner 是純 threshold,只 Hermes 1 個用 LLM.
統帥指示「朝 AI 自主化方向」→ Gap 3 開始把 threshold 升級 LLM.
第 1 個升級: capacity_forecaster (最高戰略)
原邏輯 _derive_actions 是硬編 keyword → action mapping:
disk → "清理 /var/log, /var/lib/docker, PG WAL"
mem → "檢查 top mem consumer, 考慮加記憶體"
cpu → "分析 top CPU process, 考慮擴充 vCPU"
新增 _llm_analyze_risk (~60 行):
用 OpenClaw 對每個高風險 host 跑 LLM 分析
Prompt 含:
- host + findings (Prometheus predict_linear 結果)
- 主機架構說明 (110 Harbor / 120-121 K3s / 188 PG 等)
LLM JSON 輸出:
- root_causes (3 個候選真因,繁中)
- priority_actions (high/medium/low + 具體指令 hint)
- urgency_days (0-30)
- confidence (0-1)
3-path JSON parse fallback (直接 / NemoTron wrapper / description 巢狀)
_write_recommendation_aol: 加 llm_analysis 到 output_payload
_send_telegram_forecast: 含 AI 判定 (緊急天數 + 信心 + top 2 action)
LLM 失敗時 fallback _derive_actions 硬編建議
對齊統帥鐵律:
✅ AI 分析 + 人工決策 (仍 requires_human_decision=True)
✅ 不寫死修復動作 (LLM 根據 host 實際狀況產)
✅ root_causes 考慮 host 主機架構 context
Gap 3 進度: 1/8 service 升級 LLM (capacity_forecaster)
剩下 compliance_scanner / coverage_evaluator 等 7 個留後續
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-19 21:52:34 +08:00 |
|
OG T
|
97154d12fa
|
fix(asset_scanner): Gap 1 修正 — 嚴格 IPv4 判斷 + 清理重複 host asset
CD Pipeline / build-and-deploy (push) Has been cancelled
Audit 1 發現 bug:
原 code: if host_ip.replace('.', '').isdigit() → IP 判斷
導致 labels.host='125' (短名) 被誤當 IP → 建 host/125 (錯)
同時 blackbox-icmp instance='192.168.0.112' 無 port → split 失敗 → 漏建
新增 _is_valid_ipv4(s):
嚴格 4 段 + 每段 0-255 整數
避免短名 '125' / hostname 'cadvisor-110' / 超界 '256' 誤判
_collect_prometheus_targets 流程改:
1. 先從 instance 抽 (IP:port 形式 或純 IP)
instance_host = instance.split(':')[0] if ':' in instance else instance
2. 用 _is_valid_ipv4 嚴格驗證
3. labels.host 不再當 fallback (短名不可靠)
DB 清理 (266 筆):
- 10 asset_relationship 指向短名 host
- 140 asset_coverage_snapshot 7 維 × 4 短名 host
- 112 asset_compliance_snapshot 7 維 × 4 短名 × 幾 run
- 4 asset_inventory 短名 host (host/110 / 112 / 125 / 188)
預期下次 scan (1h):
- host/192.168.0.112 + host/192.168.0.121 補進 (原漏,blackbox-icmp 無 port)
- 不再有短名 host asset
6/6 單元測試通過:
_is_valid_ipv4('192.168.0.125')=True
_is_valid_ipv4('125')=False ← 關鍵修復
_is_valid_ipv4('cadvisor-110')=False
_is_valid_ipv4('192.168.0.256')=False (超界)
_is_valid_ipv4('')=False
_is_valid_ipv4('192.168.1')=False (3 段)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-19 21:46:22 +08:00 |
|
OG T
|
7db8845cbb
|
fix(asset_scanner+coverage): host_service→monitoring_target (CHECK violation 修) + log 補 4 維
CD Pipeline / build-and-deploy (push) Successful in 12m59s
2 個 bug 修復 + 實證驗證:
1. asset_scanner: host_service 不在 asset_inventory CHECK 允許列表
ceb61c3 部署後 Pod log: CheckViolationError 'asset_inventory_type_valid'
詳: '192.168.0.125:32334' 寫入時 asset_type='host_service' 被拒
allowed list: host/container/k8s_workload/k8s_resource/database/...
monitoring_target/third_party_service/... (27 種)
修: host_service → monitoring_target (ADR-090 schema 原為 scrape target 預留)
2. coverage_evaluator logger: 只 log 原 3 維 (monitoring/alerting/km)
導致誤以為 c1f23cf 4 維新 code 沒生效 (實際 DB 已有 auto_playbook/
remediation/rule_matching/rule_creation 資料)
修: logger.info 補 playbook/remediation/rule_matching/rule_creation 4 個 kwarg
實證 coverage 7 維 DB 分佈 (已生效):
auto_alerting: 22 green / 78 red / 52 unknown
auto_km_creation: 5 green / 17 yellow / 130 unknown
auto_monitoring: 1 green / 1 red / 150 unknown
auto_playbook: 3 green / 19 yellow / 130 unknown ← 新維度
auto_remediation: 0 / 0 / 98 red / 54 unknown ← 新維度
auto_rule_creation: 0 / 0 / 100 red / 52 unknown ← 新維度
auto_rule_matching: 4 green / 96 yellow / 52 unknown ← 新維度
治理洞察:
98 red remediation = 大部分 asset 過去 30d 沒修復行動 (修復能力缺口)
100 red rule_creation = 無 AI rule (全 yaml_hardcoded)
96 yellow rule_matching = 過去 30d 沒告警觸發 (可能沒問題/沒覆蓋)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-19 20:27:48 +08:00 |
|
OG T
|
ceb61c3c8e
|
feat(asset_scanner): Gap 1 修 — Prometheus targets 補齊 host-install services
CD Pipeline / build-and-deploy (push) Successful in 13m32s
Audit 發現 asset_inventory 只涵蓋 K8s (mon=120, mon1=121 共 2 node+78 pods),
完全漏 110 (Harbor/Gitea/監控) + 112 (security) + 188 (PG/Redis/Ollama) +
125 (mon backup/standby) 這 4 主機的 host-install services.
用戶 4 主機架構 (110/112/120/121/188) 只覆蓋 2/5 = 40%.
新增 _collect_prometheus_targets:
GET /api/v1/targets?state=active → 自動發現全部被監控的:
- host_service (IP 形式 target → postgres-110/redis-110/minio-188/node-exporter 等)
- third_party_service (非 IP 如 alertmanager/argocd-server)
- host (每個 unique IP 建 asset_type='host')
- target → host 的 depends_on relationship
預期新增 asset_inventory:
- host: 6 個 (110/112/120/121/125/188,Prometheus 看到的 blackbox-icmp 全覆蓋)
- host_service: ~15 個 (postgres/redis/minio/node-exporter/cadvisor 等)
- third_party_service: ~5 個 (alertmanager/argocd/prometheus/velero 等)
解鎖:
- 110/112/188 host-install services 進入 asset_inventory
- coverage_evaluator 可評估這些 asset (monitoring/alerting/playbook 等 7 維)
- blast_radius_calculator 可查「110 PostgreSQL 影響哪些 service」
- Hermes/forecaster 建議範圍擴大到非 K8s 服務
對齊統帥鐵律: 朝 AI 自主化 — 不硬編主機清單,動態從 Prometheus 發現
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-19 20:06:34 +08:00 |
|
OG T
|
a391dfc389
|
feat(aiops): capacity_forecaster — Phase 4 Holt-Winters MVP (predict_linear)
CD Pipeline / build-and-deploy (push) Has been cancelled
統帥批准 4 項下階段候選之一完成: AI 容量預測.
新增 capacity_forecaster_job.py (~220 行):
每日 05:00 Taipei 跑預測 (02:00 scanner → 03:00 compliance →
04:00 Hermes → 05:00 forecaster 形成完整日鏈).
預測方法論 (MVP):
Prometheus predict_linear(metric[7d], 86400*7) — 基於過去 7d 做線性外推
3 個預測 query:
1. disk_saturation_7d: predict_linear(node_filesystem_avail_bytes[7d], 7d) < 0
2. mem_saturation_7d: predict_linear(MemAvailable[7d], 7d) / MemTotal < 10%
3. cpu_high_7d_trend: avg_over_time(cpu_used_pct[7d]) > 70%
發現高風險 host → 寫 aol(capacity_recommendation) + 推 Telegram
- input: host + horizon + findings count
- output: findings list + proposed_actions + requires_human_decision=true
proposed_actions 依 findings 推導:
- disk: 清理 log/docker/PG WAL 或擴容
- mem: top consumer / JVM 調整
- cpu: scale out / vCPU 擴充
統帥鐵律對齊:
✅ 只推建議不自動 scale up
✅ 7d window 有足夠樣本
✅ AI 預測 + 人工決策
未來 TODO:
- 真 Holt-Winters (含季節性) — 需 Python statsmodels
- 業務週期調整 (週一高峰/週末低谷)
Wire main.py lifespan asyncio.create_task()
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-19 20:00:36 +08:00 |
|
OG T
|
c1f23cfabe
|
feat(coverage_evaluator): 擴充 4 維 — playbook/remediation/rule_matching/rule_creation
CD Pipeline / build-and-deploy (push) Has been cancelled
Review 盲點: coverage 7 維中原只實作 3 維 (monitoring/alerting/km),其餘 4 維永遠 unknown
v2 擴充:
+ auto_playbook: asset.name 出現在 playbooks.symptom_pattern/description (approved 狀態) → green
沒對應 playbook 但 type='k8s_workload' → yellow
+ auto_remediation: 過去 30d remediation_events.target_resource ILIKE asset.name → green
沒 target 但 k8s_workload/container → red (應有修復能力但沒)
+ auto_rule_matching: 過去 30d incidents.affected_services ILIKE asset.name
或 incidents.alertname match alert_rule.labels.host/namespace → green
沒觸發 → yellow (可能沒問題也可能沒覆蓋)
+ auto_rule_creation: alert_rule_catalog source='ai_generated' match asset → green
目前全 yaml_hardcoded → 全 red (表示尚未由 AI 主動建規則)
未來 Hermes 產出 AI rule 後會變 green
解鎖: coverage 7 維完整 SLO KPI (MASTER §7.1)
- red count = 真正的治理缺口
- green ratio = 自動化成熟度
- AI 可主動推薦 red asset 的補覆蓋動作
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-19 19:54:36 +08:00 |
|
OG T
|
ba18ad2ef8
|
feat(hermes+rules): LLM 升級 Hermes + 統帥決策 deprecate PostgreSQLDiskGrowthRate
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 40s
CD Pipeline / build-and-deploy (push) Successful in 8m37s
統帥 2026-04-19 決策:
- Rule 1 PostgreSQLDiskGrowthRate → 選項 C: deprecate + 替代新規則
- Rule 2 NoAlertsReceived2Hours → 保留 (真實告警鏈路守護)
- noise_rate 算法先修正 (NO_ACTION 不算 fp),觀察後動態調整
1. rule_stats_updater v2 noise 算法:
原: 任何 EXPIRED approval 都算 fp
問題: NO_ACTION/OBSERVE/INVESTIGATE 是 AI 純觀察,不該算假報
修: WHERE ar.action NOT ILIKE '%NO_ACTION%' AND NOT ILIKE '%OBSERVE%' AND ...
2. hermes_rule_quality v2 LLM 升級:
新增 _llm_analyze_noisy_rule:
- 用 OpenClaw (Ollama/NemoTron/Gemini) 分析每條噪音 rule
- JSON 輸出: probable_root_causes/recommended_actions/confidence/should_deprecate
- 3 路 parse fallback (直接 / NemoTron wrapper / description nested)
_write_advisory_aol 加 llm_analysis 到 output_payload
_send_telegram_summary 加 AI 判定 + top 2 建議 (8 條上限避免太長)
符合統帥鐵律: AI 分析但不自動動作,仍人工決策
3. ops/monitoring/alerts-unified.yml 替換 Rule 1:
刪 PostgreSQLDiskGrowthRate (500MB/h 增長 → 觸發 WAL 正常行為誤報)
加 HostDiskUsageHigh (>80% for 10m, warning)
加 HostDiskUsageCritical (>90% for 5m, critical)
兩者 labels.supersedes='PostgreSQLDiskGrowthRate' 供追溯
(待 deploy-alerts workflow 下次 apply 到 Prometheus)
4. DB 即時 mark deprecated (避免等 alerts yaml 部署前 Hermes 又推):
UPDATE alert_rule_catalog SET review_status='deprecated' WHERE rule_name='PostgreSQLDiskGrowthRate'
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-19 19:39:05 +08:00 |
|
OG T
|
6ab0ce9c75
|
feat(aiops): Hermes rule quality advisor — E3 AI 規則品質建議 (保守版)
CD Pipeline / build-and-deploy (push) Successful in 8m22s
實證 rule_stats 跑完後發現 2 條 100% noise_rate 規則:
- PostgreSQLDiskGrowthRate (tp=0 fp=2)
- NoAlertsReceived2Hours (tp=0 fp=1)
加上 MoWoooWorkDown (33%), KubePodCrashLooping (25%)
新增 hermes_rule_quality_job.py (~210 行):
每日 04:00 Taipei 分析 alert_rule_catalog:
- threshold: noise_rate >= 0.7 AND 樣本 >= 5
- 為每條寫 aol('rule_rejected', proposed_action='review_or_deprecate')
- 推 Telegram 摘要給 SRE group
統帥鐵律對齊:
✅ 不自動改 review_status (人工決策 deprecate,AI 只推建議)
✅ threshold 作為「觸發討論」而非「最終決策」
✅ aol(rule_rejected) 留 trail,未來可升級 LLM 辯證
解鎖 E3 Hermes 基礎: 後續可加 LLM 分析假報真因 (expr 缺 for: window、
label match 太寬泛、metric 本身 noisy 等),產出具體改進建議.
Wire main.py lifespan asyncio.create_task()
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-19 18:11:26 +08:00 |
|
OG T
|
e677773e39
|
fix(asset_scanner): Pod→Deployment via ReplicaSet 橋樑 (relationship 漏掉修復)
CD Pipeline / build-and-deploy (push) Successful in 9m31s
Review 盲點: 實測 asset_relationship 52 筆,但都是 Pod→StatefulSet + Pod→ConfigMap,
完全沒有 Pod→Deployment!
真因:
K8s 中 Pod.ownerReferences[0].kind = 'ReplicaSet' (99% 案例)
Deployment 管 ReplicaSet 管 Pod (兩層 owner chain)
原 code 只 match kind in (deployment/statefulset/daemonset) → 跳過 ReplicaSet
→ Pod→Deployment 關係全部漏掉
修復 v3.1:
0. 新增 collect replicasets pass (僅作為 bridge,不寫 asset_inventory)
建 rs_to_deployment map: {ns/rs_name: deployment_name}
2. Pod ownerRef.kind='ReplicaSet' → 反查 rs_to_deployment → 建 Pod→Deployment
預期效果:
- asset_relationship 從 52 → 150+ (所有 Deployment-managed Pod 都有 relationship)
- OpenClaw blast_radius 計算 Deployment 影響的 Pod 數 = 正確
不寫 ReplicaSet 為 asset (他是 ephemeral 中介,滾動更新會大量產生,污染 inventory)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-19 17:26:57 +08:00 |
|
OG T
|
c8b263db06
|
fix(coverage_evaluator): KM 欄位修正 ke.body → ke.content + 擴大 title 匹配
CD Pipeline / build-and-deploy (push) Has been cancelled
實測 df71c9a 部署後 coverage_evaluator 生效:
- monitoring: 2 hosts match Prometheus targets
- alerting: 74 筆 (22 green + 52 red)
- km: 0 (錯誤: column "ke.body" does not exist)
真因: knowledge_entries 表欄位是 'content' 不是 'body'
修復: ke.content ILIKE '%name%' OR ke.title ILIKE '%name%'
同時清 unused import (typing.Any)
下輪 coverage_evaluator tick 將正確 UPDATE auto_km_creation 維度
解鎖完整 3 維 coverage (monitoring/alerting/km)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-19 17:24:46 +08:00 |
|
OG T
|
92349bc37c
|
feat(aiops): asset_change_tracker — 8 張 0 writer 表全數上線
CD Pipeline / build-and-deploy (push) Has been cancelled
Review 盲點 10: asset_change_event 仍 0 筆 (最後一張 0 writer 表)
新增 asset_change_tracker_job.py (~180 行):
每 1h 比對最近兩次 asset_discovery_run,寫 asset_change_event:
✅ asset_added: newer run 有但 older run 沒有 (EXCEPT SET)
✅ asset_removed: older 有但 newer 沒有
✅ lifecycle_changed: asset_inventory.lifecycle_state='deprecated' 且 updated_at 近 2h
使用 SET EXCEPT 避免 N+1, 單次 INSERT 完成所有 diff
8 張 ADR-090 0 writer 表到此全數有 writer:
✅ asset_inventory / asset_discovery_run / asset_coverage_snapshot
/ asset_relationship / asset_change_event / asset_compliance_snapshot (asset_*)
✅ alert_rule_catalog
✅ host_capacity_snapshot / capacity_violation_event (capacity_*)
Phase 7 資產盤點 + 覆蓋矩陣 + 變化追蹤完整實作.
接下來可以上 Hermes AI agent 分析品質 (deprecate noisy rules, 推薦 coverage 修復).
Wire main.py lifespan asyncio.create_task()
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-19 17:18:34 +08:00 |
|
OG T
|
df71c9a37b
|
feat(aiops): rule_stats_updater — 計算 noise_rate + true/false positive
CD Pipeline / build-and-deploy (push) Successful in 8m26s
Review 盲點 5: alert_rule_catalog 68 筆但 noise_rate/TP/FP/last_fired_at 全 NULL
新增 rule_stats_updater_job.py (~170 行):
每 1h UPDATE 全表 alert_rule_catalog,從 incidents + approval_records 推算:
- last_fired_at = max(incidents.created_at WHERE alertname=rule_name)
- true_positive_count = count incidents.status='RESOLVED' past 30d
- false_positive_count = count approval_records.status='EXPIRED' past 30d
(EXPIRED = 48h 無人處理,視為假警報 proxy)
- noise_rate = fp / (tp + fp)
窗口: 30 天 (可配置)
使用單一 UPDATE + subquery,避免 N+1 (68 rule × 3 query = 204 queries → 1 query)
解鎖 E3 Hermes:
後續 Hermes AI agent 讀 alert_rule_catalog WHERE noise_rate > 0.5
提案 review_status='deprecated' 或 superseded_by_rule_id
Wire main.py lifespan asyncio.create_task()
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-19 17:05:30 +08:00 |
|