Commit Graph

11 Commits

Author SHA1 Message Date
OG T
9d6aa7ea45 feat(trust): ADR-088 Trust Score 持久化 — L4 自動放行核心
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 10m40s
TrustScoreManager 從記憶體升級為 PostgreSQL 持久化,
Pod 重啟後信任分數不再歸零,AI 能真正累積到 L4 自動放行門檻。

變更:
- migrations/adr088_trust_score_persistence.sql: trust_records 表
- db/models.py: TrustRecordDB ORM model
- repositories/interfaces.py: ITrustRepository Protocol
- repositories/trust_repository.py: PG upsert ON CONFLICT DO UPDATE
- services/trust_engine.py: bulk_load() 啟動 warm-up
- services/learning_service.py: _persist_trust() + 2 call sites
- main.py: 啟動時 load_all() → bulk_load()

流程: 批准 5 次 → score=5 寫入 DB → Pod 重啟 → warm-up 讀回
      → evaluate_adjusted_risk MEDIUM→LOW → 自動執行

2026-04-17 ogt + Claude Sonnet 4.6(亞太): ADR-088

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-17 16:14:44 +08:00
OG T
66c4eda27a feat(Phase 3): AgentSession 學習接線 — record_agent_session() + orchestrator 辯證訊號
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- learning_service.py: 新增 record_agent_session() — 5-Agent 辯證結果 → Redis analytics
  Critic 質疑 + matched_playbook_id → 輕度負向 EWMA;all_agents_degraded 記錄治理事件
- agent_orchestrator.py: run_agent_debate() 完成後 best-effort 呼叫 record_agent_session()
  Phase 3 L7×D2 學習訊號全部接線完成

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:00:18 +08:00
OG T
fb1bbd0e20 feat(Phase 3): 學習閉環補完 — Root cause 3 + 診斷 feedback + 知識遺忘 + Fine-tune 管線
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- approval_execution.py: _run_post_execution_verify() 補接 record_verification_result()
  Root cause 3 終結:環境驗證結果(success/degraded/failed/timeout)不再孤立
- learning_service.py: 新增 record_verification_result() — 驗證結果 → Redis + Playbook EWMA
- learning_service.py: 新增 record_diagnosis_outcome() — 誤診負向訊號回寫(L3×D4)
- jobs/knowledge_decay_job.py: 新建 30d 知識遺忘 Job(未引用 draft/review → archived)
- services/finetune_exporter.py: 新建每週 JSONL 匯出(EvidenceSnapshot × AgentSession)
- main.py: 掛載 knowledge_decay_loop(24h)+ finetune_export_loop(7d)
- MASTER §8: Phase 3 核心改造項全部落地記錄

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 20:57:43 +08:00
OG T
655d1a568a feat(Phase 5): Declarative 修復抽象化 + Blast Radius 分控 全部完成
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
## Phase 5 交付(ADR-086)

### 新增服務(4 個)
- blast_radius_calculator.py: 爆炸半徑計算器(0-100 純函數)
  - 18 種 kubectl 動作基礎分 + 命名空間倍率 + 特殊 flag 修正
  - HARD_RULES 永擋:delete ns/pv/pvc/clusterrole + rm -rf + DROP TABLE
  - 分級:≤10 auto / 11-50 human / 51-99 dual / 100 blocked
- declarative_remediation.py: DeclarativeSpec 不可變規格(frozen dataclass)
  - evaluate() 封裝 Blast Radius + dry-run + rollback_plan + constraints
  - rollback_plan 從 kubectl 動作類型自動推導(不呼叫 LLM)
- gitops_pr_service.py: Gitea Issue 高風險修復審核(tier=dual)
  - 含 Blast Radius + 目標狀態 + 回滾計畫 + 雙人審核流程
  - AIOPS_P5_GITOPS_PR flag 守衛
- rollback_manager.py: 驗證失敗自動回滾
  - 先驗 rollout history ≥ 2 revision,防止無版本可回滾
  - kubectl rollout undo + 120s 收斂等待

### decision_manager.py 接線(AIOPS_P5_BLAST_RADIUS_CHECK)
- _auto_execute() 在安全守衛後、ApprovalRequest 前插入分級守衛
- blocked → 永擋 + 人工審核通知
- dual → 非同步 GitOps Issue + 升級人工審核
- human → 升級人工審核(不自動執行)
- auto(≤10)→ 原有自動執行流程
- 失敗降級:計算異常 → 保守升人工

### learning_service.py
- record_declarative_outcome(): 記錄 DeclarativeSpec 執行結果
  anomaly_key=declarative:{incident_id},含 blast_radius_score/tier/rollback

2026-04-15 ogt + Claude Sonnet 4.6(亞太): Phase 5 全部完成

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 16:06:54 +08:00
OG T
7da64eaad2 feat(Phase 3): 學習閉環重建 — 三根因修復 + 2x EWMA + Evolver Agent
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 19m7s
Type Sync Check / check-type-sync (push) Failing after 1m18s
ADR-083 Phase 3 學習閉環重建:

**三根因修復**
- approval_execution.py: fire-and-forget create_task → await asyncio.wait_for(timeout=30) × 2
  (成功路徑 L265 + 失敗路徑 L353,超時記錄 learning_trigger_timeout metric,主流程不 crash)
- models/approval.py: ApprovalRequestBase 新增 matched_playbook_id 欄位
- decision_manager.py: _auto_execute 建立 ApprovalRequest 時填充 matched_playbook_id
- learning_service.py: 雙路徑查找 _matched_pb_id(matched_playbook_id + metadata fallback)

**2x EWMA 負向強化**
- models/playbook.py: 新增 trust_score: float = 0.3(EWMA 動態信任度欄位)
- repositories/playbook_repository.py: update_stats 加 EWMA
  成功: trust = 0.9 × old + 0.1 × 1.0
  失敗: trust = 0.8 × old + 0.2 × 0.0(衰減速度 2x)
  trust < 0.1 → log warning,等 Evolver 封存

**Evolver Agent(新建)**
- services/playbook_evolver.py: 三功能全靜態規則
  1. 低信任封存: trust < 0.1 → DEPRECATED
  2. 休眠封存: 30d 未使用 AND trust < 0.5 → DEPRECATED
  3. 相似合併: 症狀 Jaccard > 0.9 → 保留高 trust,封存低 trust
  AIOPS_P3_EVOLVER_ENABLED=False 預設關閉

**文件**
- ADR-083 學習閉環重建
- MASTER §8 Phase 3 完工記錄

AIOPS_P3_ENABLED=False(預設),骨架就位等統帥批准開啟

Co-Authored-By: Claude Sonnet 4.6(亞太)<noreply@anthropic.com>
2026-04-15 14:01:37 +08:00
OG T
e1e3bba296 refactor(api): Phase 22 技術債修復 - 業務邏輯分層
Some checks failed
E2E Health Check / e2e-health (push) Has been cancelled
P2.3: LearningService.get_learning_summary() 業務邏輯移至 Service 層
- Repository 只提供原始統計數據
- Service 計算 best_action 和 learning_status

P2.6: Playbook similarity 計算邏輯抽取
- 新增 src/utils/similarity.py
- Repository 從 utils 導入,不再定義演算法

2026-03-31 Claude Code (首席架構師技術債修復)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-31 18:55:06 +08:00
OG T
f5b19cf108 feat(learning): 實作 Playbook 信心度調整機制 (ADR-030)
- 新增 _promote_playbook: 高評分提升信心度 +0.1
- 新增 _demote_playbook: 低評分降低信心度 -0.15
- 新增 find_by_source_incident: 按 incident_id 查詢 Playbook
- 新增 adjust_confidence: 信心度調整 + 狀態自動轉換
- 新增 Playbook.failure_rate 屬性

自動狀態轉換:
- ai_confidence >= 0.9 + DRAFT → 自動 APPROVED
- ai_confidence < 0.3 + failure_rate > 50% → 自動 DEPRECATED

測試: 13 案例全部通過

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 22:10:49 +08:00
OG T
b97f9364fb feat(k8s): add Worker HPA + fix non-AI confidence values
Wave 2 Deployment:
- Worker HPA: min:1 max:3, CPU 70%, Memory 80%
- 前置條件: XCLAIM + terminationGracePeriodSeconds:90 (Wave 1 )
- 比 API/Web 更保守的擴縮策略 (120s up, 600s down)

Confidence Fix:
- 非 AI 分析來源 (fallback/playbook/historical/consensus) 設 confidence=0.0
- 避免混淆 AI 信心度與其他指標 (成功率/相似度)
- 涉及: github_webhook, decision_manager, intent_classifier, learning_service

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 16:09:37 +08:00
OG T
50c055b547 feat(api): Phase D-G P0 修正 - Learning Repository 積木化
新增:
- ILearningRepository Protocol (interfaces.py)
- LearningRepository (Redis 持久化層)
- Learning API 端點 (/api/v1/learning/*)
- LearningService.get_recommended_fix() 方法
- LearningService.get_learning_summary() 方法

修正:
- Service 不直接依賴 Redis Client (透過 Repository)
- 符合 leWOOOgo 積木化原則
- 首席架構師審查: 74/100 → 92/100

更新:
- ADR-030: 新增 Phase D-G P0 修正章節
- Skill 02: v1.9 → v2.0
- Runner 修復: 序列建構解決 _runner_file_commands 衝突

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 11:03:51 +08:00
OG T
138ef0c2db fix(api): 修復 7 個 Lint 錯誤 (unused imports + zip strict + dict comprehension)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-27 14:42:47 +08:00
OG T
3256142d29 feat(api): ADR-030 Phase 5 持續學習迴圈
從執行結果中學習,持續優化決策:

1. learning_service.py - 持續學習服務
   - process_execution_result(): 處理執行結果
   - process_human_feedback(): 處理人工反饋
   - 自動調整信任度 (成功+1 / 失敗歸零)
   - 更新 Playbook 統計
   - 成功案例自動萃取 Playbook

2. approval_execution.py - 整合學習觸發
   - 執行成功後觸發學習
   - 執行失敗後觸發學習
   - _trigger_learning(): 非阻塞呼叫學習服務

學習流程:
執行完成 → LearningService.process_execution_result()
  ├─ 成功: TrustEngine +1 分 + Playbook 統計更新
  └─ 失敗: TrustEngine 歸零 + 記錄失敗原因

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 22:19:41 +08:00