Commit Graph

1426 Commits

Author SHA1 Message Date
OG T
da871fc149 chore(db): 補齊 AIOps P1/P2/P6 migration SQL(已套用到 prod)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
incident_evidence / agent_sessions / ai_governance_events 三表
IF NOT EXISTS,production DB 已手動確認存在並 apply。

2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 22:02:17 +08:00
OG T
76558a3cd9 feat(AIOps): 全開 P1-P6 feature flags + Nemotron + offline replay loop
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- configmap: 啟用 AIOPS_P1~P6 全部總開關與子開關
- configmap: ENABLE_NEMOTRON_COLLABORATION=true(回歸 120s timeout)
- feature_flags.py: 補齊 AIOPS_P6_GOVERNANCE_ENABLED 缺失欄位
- main.py: 掛載 run_offline_replay_loop(ADR-087 Phase 6)

2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:59:51 +08:00
OG T
ecfb7148bf fix(prod): 接通 YAML 規則引擎與自動執行路徑 — 架構核心斷點
Some checks are pending
CD Pipeline / build-and-deploy (push) Has started running
架構斷點根因:
  YAML 規則引擎(alert_rules.yaml)是人工審閱的權威動作來源,
  但自動執行路徑只讀 proposal_data["kubectl_command"](LLM 生成),
  兩者完全脫節 → HostHighCpuLoad 得到 kubectl restart,DockerContainerUnhealthy
  的 SSH 指令被 LLM 的 kubectl 覆蓋。

修復策略:
  在 auto_execute 入口,先查 YAML match_rule:
  1. YAML → NO_ACTION(如 HostHighCpuLoad)→ 立即返回,不執行任何操作
  2. YAML → 非 kubectl 指令(如 ssh docker restart)→ 覆蓋 LLM action,
     後續 infrastructure SSH 路由才能生效

影響:
  - HostHighCpuLoad / NodeCPUUsageHigh → 停止自動執行,降級人工審核
  - DockerContainerUnhealthy → SSH docker restart(若 labels 有 host/container)

2026-04-15 ogt + Claude Sonnet 4.6(亞太): 生產緊急修復第三批

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:50:25 +08:00
OG T
3696fb5938 fix(prod): 修復 host_resource 誤發 K8s kubectl + 自動執行重複風暴
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. decision_manager: host_resource 告警(HostHighCpuLoad 等)
   不得執行 kubectl 操作 → 降級人工審核
   根因:原本只擋 infrastructure,host_resource 漏進 K8s 路徑
   → 導致 kubectl rollout restart deployment/HostHighCpuLoad 被真實執行

2. decision_manager: auto_execute 路徑補入 Redis cooldown
   同一 target 5 分鐘內最多自動執行 2 次,防止 awoooi-worker 3x 風暴
   根因:decision_manager 自動執行路徑完全無冷卻保護

2026-04-15 ogt + Claude Sonnet 4.6(亞太): 生產緊急修復第二批

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:45:46 +08:00
OG T
67f437043a fix(prod): 修復四個生產致命 bug — outcome 寫入 / OpenClaw / Telegram 通知 / LLM 規則顯示
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. decision_manager: 移除 UPDATE incidents 中的 verification_result 欄位
   (incidents 表無此欄位 → 所有 outcome 寫入失敗 outcome_write_failed)

2. failure_watcher: get_openclaw_service → get_openclaw
   (函數名錯誤 → OpenClaw 分析全部 ImportError 崩潰)

3. failure_watcher: tg.send_message → tg.send_notification
   (TelegramGateway 無 send_message 方法 → 修復通知無法送出)

4. decision_manager: expert_analyze 補齊 initial_diagnosis / diagnosis_description key
   (openclaw.py 讀這兩個 key,但 expert_analyze 只有 matched_rule / description
    → LLM 永遠看到 Matched Rule=unknown,無法正確分析)

2026-04-15 ogt + Claude Sonnet 4.6(亞太): 生產緊急修復

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:41:31 +08:00
OG T
e465ee1936 docs(Phase 3): Evolver 演練完成 — exit condition #6 通過
- MASTER spec §3/§7/§8:三處 Evolver 演練勾選完成
- LOGBOOK:演練結果記錄 + 下一步更新為 7 天生產監控

演練結果:POST /api/v1/learning/evolver/run → HTTP 200 errors:[] 2026-04-15

ADR-083 Phase 3 — 2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:24:33 +08:00
AWOOOI CD
e449b275aa chore(cd): deploy e5e94f5 [skip ci] 2026-04-15 13:19:00 +00:00
OG T
5f86da52d9 docs(LOGBOOK): Phase 3 全部落地記錄 — 6 個元件 + 退出條件清單
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:10:47 +08:00
OG T
e5e94f5fda fix(Phase 3): 管理員端點傳 force=True — 確保 Evolver 演練不受 flag 阻擋
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 14m56s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:09:13 +08:00
OG T
01fb531c02 fix(Phase 3): Evolver force=True bypass flag + 清理未使用 import
- run_evolver(force=True):管理員手動端點可繞過 feature flag
- 移除 typing.Any 未使用 import
- 移除 _merge_similar 中冗餘的 calculate_jaccard_similarity import

ADR-083 Phase 3 — 2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:09:01 +08:00
OG T
4718c7667c feat(Phase 3): Evolver loop 排程 + 手動觸發端點 — 合併演練閘道完工
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- playbook_evolver.py: 新增 run_evolver_loop()(24h 無限迴圈)
- main.py: 掛載 run_evolver_loop asyncio.create_task
- api/v1/learning.py: POST /api/v1/learning/evolver/run(Phase 3 exit #6 演練端點)
- MASTER §8: 補錄 66c4eda AgentSession + 本次 Evolver 完整退出條件清單

ADR-083 Phase 3 — 2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:07:56 +08:00
OG T
66c4eda27a feat(Phase 3): AgentSession 學習接線 — record_agent_session() + orchestrator 辯證訊號
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- learning_service.py: 新增 record_agent_session() — 5-Agent 辯證結果 → Redis analytics
  Critic 質疑 + matched_playbook_id → 輕度負向 EWMA;all_agents_degraded 記錄治理事件
- agent_orchestrator.py: run_agent_debate() 完成後 best-effort 呼叫 record_agent_session()
  Phase 3 L7×D2 學習訊號全部接線完成

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:00:18 +08:00
OG T
fb1bbd0e20 feat(Phase 3): 學習閉環補完 — Root cause 3 + 診斷 feedback + 知識遺忘 + Fine-tune 管線
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- approval_execution.py: _run_post_execution_verify() 補接 record_verification_result()
  Root cause 3 終結:環境驗證結果(success/degraded/failed/timeout)不再孤立
- learning_service.py: 新增 record_verification_result() — 驗證結果 → Redis + Playbook EWMA
- learning_service.py: 新增 record_diagnosis_outcome() — 誤診負向訊號回寫(L3×D4)
- jobs/knowledge_decay_job.py: 新建 30d 知識遺忘 Job(未引用 draft/review → archived)
- services/finetune_exporter.py: 新建每週 JSONL 匯出(EvidenceSnapshot × AgentSession)
- main.py: 掛載 knowledge_decay_loop(24h)+ finetune_export_loop(7d)
- MASTER §8: Phase 3 核心改造項全部落地記錄

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 20:57:43 +08:00
AWOOOI CD
e23e49c13b chore(cd): deploy ff448ad [skip ci] 2026-04-15 12:47:59 +00:00
OG T
ff448ad282 fix(incidents): 修復兩個 DB 完整性問題
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 14m52s
1. alertname IS NULL(4 筆歷史修復 + code fallback)
   - incident_repository.py: alertname 補 labels["alertname"] fallback
   - SQL UPDATE: 用 signals->0->>'alert_name' 修補存量 4 筆 NULL 記錄

2. TYPE-1 incidents 永遠卡 INVESTIGATING(18 筆修復 + code fix)
   - webhooks.py: TYPE-1 短路後立即加 resolve_incident background task
   - SQL UPDATE: 批次將存量 TYPE-1 INVESTIGATING → RESOLVED

根因: ADR-073 TYPE-1 短路設計只發通知,未關閉 incident 狀態
      backup/heartbeat 告警每小時觸發 → 無限累積 INVESTIGATING 記錄

2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 20:38:08 +08:00
AWOOOI CD
d46f230c1f chore(cd): deploy 6583870 [skip ci] 2026-04-15 12:18:44 +00:00
OG T
65838708ce fix(format): 剩餘 send_notification raw text 改為 ADR-075 TYPE-X 格式
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 18m11s
- decision_manager.py: 自動修復通知改為 TYPE-2 ├─/└─ 樹狀格式
- gitea_webhook_service.py: Code Review 通知改為 TYPE-1 格式,移除 ═══ border

至此所有 3 個外部 send_notification 呼叫者均符合 ADR-075 格式規範:
  1. ai_router.py — TYPE-1 AI Provider 不可用(已於 3ce5025 修復)
  2. decision_manager.py — TYPE-2 自動修復完成/失敗(本 commit)
  3. gitea_webhook_service.py — TYPE-1 Code Review(本 commit)

2026-04-15 ogt + Claude Sonnet 4.6(亞太): Phase 6 format enforcement

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 20:05:49 +08:00
OG T
ee486fbd2b docs(logbook): 2026-04-15 深夜收官 — P0/P2 RCA + Phase 6 閉環
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 19:58:09 +08:00
OG T
05b774386b feat(Phase 6): AI SLO REST API — GET /api/v1/ai/slo 收官
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-087 Phase 6 自我治理閉環最後一塊拼圖:

1. api/v1/ai_slo.py — GET /api/v1/ai/slo
   - Service 層快取優先(TTL 5min,AiSloCalculator.get_cached_report)
   - force_refresh=true 強制重算(AiSloCalculator.run)
   - Router 層零 Redis 直接存取(leWOOOgo 積木化鐵律)

2. main.py — 路由掛載 ai_slo_v1.router(prefix=/api/v1)

3. MASTER §8 Living Changelog 追加:
   - P0 告警靜默 3 根因 RCA 完整紀錄
   - P2 飛輪斷鏈修復摘要
   - Phase 6 全元件完成清單

Phase 6 退出條件 5/6 已達(生產驗證待 image 上線)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 19:57:26 +08:00
OG T
14579ce149 fix(heartbeat): 系統沉默閾值 2h → 24h,消除假陽性告警
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
無事故期間系統正常不寫 KM,2h 必然誤報。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 19:51:01 +08:00
OG T
3ce5025ca7 fix(alerts): 3 個飛輪沉默節點 — DIAGNOSE routing + 心跳停用 + 通知格式
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. openclaw.py: DIAGNOSE 移除 require_local=True
   - v4.3 已決定 NIM 為主力且無隱私問題
   - require_local=True 導致所有 provider 被 privacy_skip → 告警永遠失敗
   - 修後 DIAGNOSE 走 _full_fallback_chain(NIM → Gemini → Claude)

2. ai_router.py: require_local 失敗通知改為 ADR-075 TYPE-1 格式
   - 禁止純文字 raw notification(統帥鐵律:所有訊息必須符合格式模板)
   - 改用 ├─ / └─ 樹狀結構 + 語義化標籤

3. main.py: 停用 Telegram 心跳監控
   - 心跳已轉發到另一個 Telegram 群組,不需在此頻道重複發送

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 19:49:43 +08:00
AWOOOI CD
2d85b49cc0 chore(cd): deploy f9ba200 [skip ci] 2026-04-15 11:47:40 +00:00
OG T
f9ba200638 fix(db): Phase 6 migration 三條 CREATE INDEX 拆開各自 execute
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
asyncpg 不支援 prepared statement 內多條 SQL 指令,
原本一個 text("""...""") 包含三條 CREATE INDEX 導致 CrashLoopBackOff。
拆成三個獨立 conn.execute() 呼叫。

2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 19:37:58 +08:00
AWOOOI CD
160689a110 chore(cd): deploy f045506 [skip ci] 2026-04-15 11:31:02 +00:00
OG T
f045506abd fix(flywheel): P2 Approval 逾期不結案 → KM 學習鏈斷鏈修復
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 12m11s
問題根因:
  PENDING approval 無人處置超過 48h 後應自動 EXPIRED,
  但 get_pending_approvals() 只在用戶開 UI 時觸發,
  若無人開 UI → Incident 永遠 PENDING → KM 永遠不寫入
  → Phase 6 SLO human_override_rate 低估,EWMA 缺少負向樣本。

修復:
  1. anomaly_counter.py: 新增 "timeout_ignored" disposition 類型,
     與 auto_repair / human_approved / manual_resolved 區分
  2. incident_service.py: resolve_incident() 新增 resolution_type 參數,
     resolution_type="timeout" 時記錄 "timeout_ignored" 而非 "manual_resolved"
  3. jobs/approval_timeout_resolver.py (新): 每小時掃描逾期 PENDING approval,
     批次標記 EXPIRED,對每筆有 incident_id 的記錄呼叫 resolve_incident("timeout")
  4. main.py: startup 掛載 approval_timeout_resolver 排程(interval=3600s)

效果:
  - 告警無人處置 48h → Incident 自動結案 → KM 寫入 → EWMA 取得樣本
  - disposition="timeout_ignored" 讓 SLO 計算正確區分「AI 建議被忽略」
  - 飛輪學習鏈對「無人處置告警」閉環

2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 19:21:21 +08:00
AWOOOI CD
586602e7ff chore(cd): deploy f31b4e3 [skip ci] 2026-04-15 11:18:56 +00:00
OG T
f31b4e31ba fix(approval): create_approval_with_fingerprint 補注 48h expires_at 預設值
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根因(盤點後確認):
  所有 webhook 建立 approval 的路徑(webhooks.py:908/1426/1566)均未傳
  expires_at,DB 欄位為 NULL。get_pending_approvals() 的自動過期邏輯
  WHERE expires_at < now 對 NULL 永遠為 False → 殭屍 PENDING 永不清理。

修正策略:
  在 create_approval_with_fingerprint()(告警 approval 唯一共用入口)
  注入預設 48h TTL,一次覆蓋全部 3 個 webhook 呼叫點。
  手動 API 建立(approvals.py)自行傳 expires_at,不受影響。

與 2026-04-15 24h PENDING_TTL_HOURS 補丁協同工作:
  - 24h: find_by_fingerprint 不再收斂過期 PENDING → 新告警重新觸發通知
  - 48h: get_pending_approvals auto-expire → UI 殭屍記錄自動清除

2026-04-15 ogt + Claude Sonnet 4.6(亞太):完整盤點後補完

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 19:08:17 +08:00
AWOOOI CD
1d22376b86 chore(cd): deploy fab65e7 [skip ci] 2026-04-15 11:06:22 +00:00
OG T
fab65e7d7a fix(alerts): PENDING 收斂無 TTL → 老記錄永久封鎖 Telegram 告警
Some checks are pending
CD Pipeline / build-and-deploy (push) Has started running
根因:find_by_fingerprint 的 PENDING 匹配條件無時間上限,
2026-04-12 建立的 3 筆 PENDING approval records(hit=77/30/17)
持續吃掉所有同指紋告警,造成 2+ 小時 Telegram 靜音。

修正(approval_db.py):
  - PENDING_TTL_HOURS = 24:PENDING 記錄逾 24h 不再收斂新告警
  - 原本:OR(status=PENDING, created_at>=30min前)
  - 修正:OR(PENDING AND created_at>=24h前, created_at>=30min前)

緊急修復:kubectl exec 直接將 7 筆過期 PENDING 記錄設為 expired,
即時恢復 Telegram 告警流(不等部署)。

Phase 6 AI 自我治理閉環(ADR-087):
  - feat(db): 新增 ai_governance_events 表 + 3 個 index(base.py + models.py)
  - feat(svc): ai_slo_calculator.py — 7d 滾動 SLO(success/override/false_neg)
  - feat(svc): trust_drift_detector.py — Playbook 信任度極端偏態偵測
  - feat(job): kb_rot_cleaner.py — K8s API/Prom metric/老舊 incident_case 腐爛清理
  - feat(svc): decision_manager.py — 自我降級守衛(SLO 違反 → 提高門檻/保守模式)

2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 18:56:26 +08:00
AWOOOI CD
37f4553349 chore(cd): deploy 4e2e665 [skip ci] 2026-04-15 08:22:53 +00:00
OG T
4e2e6652e3 fix(db): 移除 IncidentEvidence.incident_id 的重複 index 定義
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 14m50s
根本原因:incident_id 同時設定 index=True(mapped_column)
與 __table_args__ 中的 Index("ix_incident_evidence_incident_id"),
導致 table.create 生成重複的 CREATE INDEX,
觸發 "already exists" 被靜默捕捉,整個 CREATE TABLE transaction 回滾。
直接效果:Pod 啟動時 incident_evidence 表永遠不會被建立,
導致後續 ALTER TABLE 失敗 → CrashLoopBackOff。

修法:移除 mapped_column 中的 index=True,
索引由 __table_args__ 統一管理。

注意:已在 PostgreSQL 直接建立 incident_evidence 表解除 CrashLoop。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 16:13:18 +08:00
OG T
655d1a568a feat(Phase 5): Declarative 修復抽象化 + Blast Radius 分控 全部完成
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
## Phase 5 交付(ADR-086)

### 新增服務(4 個)
- blast_radius_calculator.py: 爆炸半徑計算器(0-100 純函數)
  - 18 種 kubectl 動作基礎分 + 命名空間倍率 + 特殊 flag 修正
  - HARD_RULES 永擋:delete ns/pv/pvc/clusterrole + rm -rf + DROP TABLE
  - 分級:≤10 auto / 11-50 human / 51-99 dual / 100 blocked
- declarative_remediation.py: DeclarativeSpec 不可變規格(frozen dataclass)
  - evaluate() 封裝 Blast Radius + dry-run + rollback_plan + constraints
  - rollback_plan 從 kubectl 動作類型自動推導(不呼叫 LLM)
- gitops_pr_service.py: Gitea Issue 高風險修復審核(tier=dual)
  - 含 Blast Radius + 目標狀態 + 回滾計畫 + 雙人審核流程
  - AIOPS_P5_GITOPS_PR flag 守衛
- rollback_manager.py: 驗證失敗自動回滾
  - 先驗 rollout history ≥ 2 revision,防止無版本可回滾
  - kubectl rollout undo + 120s 收斂等待

### decision_manager.py 接線(AIOPS_P5_BLAST_RADIUS_CHECK)
- _auto_execute() 在安全守衛後、ApprovalRequest 前插入分級守衛
- blocked → 永擋 + 人工審核通知
- dual → 非同步 GitOps Issue + 升級人工審核
- human → 升級人工審核(不自動執行)
- auto(≤10)→ 原有自動執行流程
- 失敗降級:計算異常 → 保守升人工

### learning_service.py
- record_declarative_outcome(): 記錄 DeclarativeSpec 執行結果
  anomaly_key=declarative:{incident_id},含 blast_radius_score/tier/rollback

2026-04-15 ogt + Claude Sonnet 4.6(亞太): Phase 5 全部完成

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 16:06:54 +08:00
AWOOOI CD
53344c201e chore(cd): deploy 14a0226 [skip ci] 2026-04-15 07:57:10 +00:00
OG T
14a02263ae feat(Phase 4): 主動巡檢 + 趨勢預測 + 8D 感官升級 全部完成
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 12m32s
## Phase 4 完整交付(ADR-084)

### 新增服務
- trend_predictor.py: numpy 線性回歸,4h 閾值突破預警,R² 信心評分
- proactive_inspector.py: 每 5 分鐘主動巡檢協調器
  - DynamicBaselineService(3σ 偏離)
  - LogAnomalyDetector(新 Drain3 pattern)
  - TrendPredictor(斜率外推 4h 預測)
  - Shadow Mode + 30 分鐘去重 + Holt-Winters 背景重訓

### 8D 感官升級(EvidenceSnapshot Phase 4 增強)
- PreDecisionInvestigator._collect_phase4_anomalies(): 決策前讀取
  ProactiveInspector 最近巡檢快取 + LogAnomalyDetector 新 pattern
- EvidenceSnapshot.anomaly_context: 新欄位,Phase 4 動態異常上下文
- DiagnosticianAgent._build_prompt(): prompt 包含 anomaly_context,
  LLM RCA 可參考動態基線偏差與趨勢預警

### 資料庫遷移
- incident_evidence: ADD COLUMN anomaly_context JSONB(冪等)

### main.py
- 啟動 run_proactive_inspector_loop() asyncio task

2026-04-15 ogt + Claude Sonnet 4.6(亞太): Phase 4 全部完成

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 15:47:05 +08:00
OG T
952c10955b fix(db): 多 replica 並行啟動競爭 — 每 table 獨立 tx + DROP INDEX IF EXISTS
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
根因:單一大 transaction 內兩個 pod 同時建同一個 table,
其中一個 CREATE INDEX 失敗 → 整個 transaction ROLLBACK
→ table 也消失 → 下次重啟同樣情況 → 無限 CrashLoop。

修法三層:
1. 每個 table 用獨立 transaction 建立(失敗不影響其他)
2. 建 table 前先 DROP INDEX IF EXISTS 清殘留孤兒 index
3. 捕捉 "already exists" 讓並行 pod 優雅跳過(不 crash)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 15:38:43 +08:00
OG T
4a6aa16a94 fix(Phase 4): 修正呼叫點遺漏傳入參數 — promql 和 sample_log
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
關聯節點檢查發現:
- dynamic_baseline_service.py: _save_baseline() 在 train_baseline() 中
  未傳入 promql/lookback_hours → PG 記錄無法追蹤訓練來源
- log_anomaly_detector.py: _save_new_cluster() 未傳入 sample_log →
  PG 記錄 LogCluster 時 sample_log 欄位為空

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 15:34:33 +08:00
OG T
bf45b80bd2 feat(Phase 3.5 + Phase 4): AI 學習成果持久化到 PostgreSQL — 修正「AI 失憶」架構缺陷
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-085: AI 學習成果不可存在 Cache

架構鐵律確立:
- PostgreSQL = System of Record(AI 的永久記憶)
- Redis = Warm Cache(加速讀取,TTL 到期從 PG 復原)

核心變更:
1. models.py: 新增 PlaybookRecord / DynamicBaselineRecord / LogClusterRecord ORM
2. base.py: ALTER TABLE playbooks 補加 trust_score / requires_approval_level 等欄位
3. playbook_repository.py: 完整雙寫實作(PG upsert + Redis cache)
4. dynamic_baseline_service.py: Holt-Winters 訓練結果寫入 PG,Redis 只作 24h warm cache
5. log_anomaly_detector.py: Drain3 cluster template 寫入 PG(UPSERT on cluster_id)
6. main.py: 啟動時執行 backfill_redis_to_pg()(Redis → PG 冪等補救)

修正的問題:
- Playbook 7天 Redis TTL 到期 → AI 失去所有修復知識
- trust_score EWMA 隨 Redis TTL 歸零 → AI 重新回到初始信任度 0.3
- Holt-Winters 基線 24h TTL → AI 每天重新學習「正常」的定義
- Drain3 cluster 沒有持久化 → AI 把已知 log pattern 反覆當新 pattern

Phase 4 新服務(requirements.txt 已加入 statsmodels + drain3 + numpy)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 15:34:04 +08:00
AWOOOI CD
9126c594a4 chore(cd): deploy 0f2ec79 [skip ci] 2026-04-15 07:28:25 +00:00
OG T
0f2ec7987c fix(db): 改用 inspect 跳過現有 table,根治 CrashLoopBackOff
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 14m42s
checkfirst=True 只跳過 CREATE TABLE,SQLAlchemy 2.0 仍對
__table_args__ Index 物件發出獨立 CREATE INDEX → duplicate error。
改法:先 inspect 取得現有 tables,只對不存在的 table 呼叫
table.create(),index 永遠只隨新 table 建立,不再 duplicate。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 15:18:25 +08:00
AWOOOI CD
8997ba70cb chore(cd): deploy a142e6e [skip ci] 2026-04-15 07:11:37 +00:00
OG T
a142e6e937 fix(db): create_all checkfirst=True 修復 CrashLoopBackOff
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 12m19s
rolling update 時 create_all 嘗試重建既有 index 導致
"ix_incident_evidence_incident_id already exists" 啟動失敗。
checkfirst=True 讓 SQLAlchemy 跳過已存在的 table/index,
init_db() 從此冪等,不再造成 CrashLoopBackOff。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 15:00:49 +08:00
OG T
777e40d618 Merge remote-tracking branch 'gitea/main'
All checks were successful
Type Sync Check / check-type-sync (push) Successful in 1m8s
2026-04-15 15:00:48 +08:00
OG T
83e0fd882d chore(types): 重新生成 shared-types — Playbook.trust_score + IncidentId3
因 Phase 0/1 新增 Playbook.trust_score 欄位,
IncidentId 型別索引序號更新為 IncidentId3,
重新執行 pnpm generate 同步 API schema → TypeScript 型別。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 15:00:44 +08:00
AWOOOI CD
d493fb9b78 chore(cd): deploy 7da64ea [skip ci] 2026-04-15 06:18:11 +00:00
OG T
7da64eaad2 feat(Phase 3): 學習閉環重建 — 三根因修復 + 2x EWMA + Evolver Agent
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 19m7s
Type Sync Check / check-type-sync (push) Failing after 1m18s
ADR-083 Phase 3 學習閉環重建:

**三根因修復**
- approval_execution.py: fire-and-forget create_task → await asyncio.wait_for(timeout=30) × 2
  (成功路徑 L265 + 失敗路徑 L353,超時記錄 learning_trigger_timeout metric,主流程不 crash)
- models/approval.py: ApprovalRequestBase 新增 matched_playbook_id 欄位
- decision_manager.py: _auto_execute 建立 ApprovalRequest 時填充 matched_playbook_id
- learning_service.py: 雙路徑查找 _matched_pb_id(matched_playbook_id + metadata fallback)

**2x EWMA 負向強化**
- models/playbook.py: 新增 trust_score: float = 0.3(EWMA 動態信任度欄位)
- repositories/playbook_repository.py: update_stats 加 EWMA
  成功: trust = 0.9 × old + 0.1 × 1.0
  失敗: trust = 0.8 × old + 0.2 × 0.0(衰減速度 2x)
  trust < 0.1 → log warning,等 Evolver 封存

**Evolver Agent(新建)**
- services/playbook_evolver.py: 三功能全靜態規則
  1. 低信任封存: trust < 0.1 → DEPRECATED
  2. 休眠封存: 30d 未使用 AND trust < 0.5 → DEPRECATED
  3. 相似合併: 症狀 Jaccard > 0.9 → 保留高 trust,封存低 trust
  AIOPS_P3_EVOLVER_ENABLED=False 預設關閉

**文件**
- ADR-083 學習閉環重建
- MASTER §8 Phase 3 完工記錄

AIOPS_P3_ENABLED=False(預設),骨架就位等統帥批准開啟

Co-Authored-By: Claude Sonnet 4.6(亞太)<noreply@anthropic.com>
2026-04-15 14:01:37 +08:00
AWOOOI CD
7edb298a75 chore(cd): deploy 42bc1df [skip ci] 2026-04-15 05:58:38 +00:00
OG T
42bc1df9f9 fix(phase2): 驗證發現兩處安全漏洞並修正
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
手動驗證執行中發現:
1. reviewer_agent.py: force push regex 只覆蓋「force push」文字順序,
   漏掉 git 實際格式「git push --force」(push 先, --force/-f 後)
   → 修正為雙向 pattern:(?:force.{0,5}push|push.{0,30}(?:--force|-f\b)).{0,30}main

2. coordinator_agent.py: Critic critical challenge 僅施 0.3 penalty,
   當原始信心 > 0.7(如 0.82)時 penalty 後仍 > 0.4 閾值,
   critical challenge 穿透到 auto-execute 路徑(驗證確認:0.82→0.52>0.4)
   → 新增 Critic REJECT 硬閘(等同 Reviewer REJECT 效力),
     在 penalty 邏輯前強制 requires_human_approval=True

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 13:48:55 +08:00
OG T
5ddba6d6e0 feat(adr-082): Phase 2 多 Agent 協作 — 5 角色辯證系統骨架上線
新增 5 個 Agent + Orchestrator + DecisionManager 接線:
- protocol.py: DiagnosisReport / ActionPlan / ReviewVerdict / CriticReport / DecisionPackage 型別系統
- DiagnosticianAgent: RCA 根因分析,confidence < 0.4 → ABSTAIN
- SolverAgent: 修復方案軍師,blast_radius 評分 + 降級 rule-based mock
- ReviewerAgent: 安全審查,HARD_RULES 靜態 pattern + blast_radius 閾值 (>50 revision, >80 reject)
- CriticAgent: 刻意唱反調,強制 3 問批判性思維,critical challenge → REJECT
- CoordinatorAgent: 純規則聚合,6 級決策閘,REQUEST_REVISION → 強制人工
- AgentOrchestrator: 30s 全局超時,Reviewer ‖ Critic 並行,DB Immutable Event Sourcing + Redis Streams
- DecisionManager: AIOPS_P2_ENABLED gate + _package_to_proposal_data 橋接既有 proposal_data 格式
- AgentSession DB table + 4 個複合 index
- ADR-082 決策記錄

Gate 2 修復(7 項):
- CRITICAL: DELETE FROM regex lookahead 位置錯誤(移至 FROM 後)
- CRITICAL: REQUEST_REVISION 可抵達 auto-execute 路徑(改回 requires_human_approval=True)
- IMPORTANT: _extract_json flat regex 不支援巢狀 JSON(改 find/rfind 邊界提取)
- IMPORTANT: all_degraded 遺漏 verdict.degraded(補全 4 個 Agent)
- IMPORTANT: Solver ABSTAIN guard 放行降級假設(改為無論 hypotheses 有無均跳過)
- IMPORTANT: dataclasses.asdict() Enum 未序列化導致 DB 寫入靜默失敗(加 json.dumps default handler)
- IMPORTANT: P2 gate 直讀屬性繞過父 Phase 守衛(改用 is_phase_enabled(2))

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 13:48:55 +08:00
AWOOOI CD
d51705b4ec chore(cd): deploy b6cb199 [skip ci] 2026-04-15 05:40:15 +00:00
OG T
b6cb1999a9 Merge remote-tracking branch 'gitea/main'
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 16m36s
2026-04-15 13:28:36 +08:00