Commit Graph

2 Commits

Author SHA1 Message Date
Your Name
45dbe07188 fix(flywheel): 自動化飛輪六大能力修復(ADR-092 B3)
Some checks failed
run-migration / migrate (push) Failing after 22s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 53s
Type Sync Check / check-type-sync (push) Successful in 2m54s
CD Pipeline / build-and-deploy (push) Has been cancelled
Ansible Lint / lint (push) Has been cancelled
【根因鏈修復】
MCP Provider bugs → PreDecisionInvestigator 失敗 → Agent Debate 無上下文
→ LLM 逾時 → description="待分析" → ADR-091 鐵閘攔截 → tg_sent 未設
→ W-2 Watchdog 誤報「靜默故障」

【六大修復】
1. MCP Provider 三蟲修復
   - ssh_provider: asyncssh.run() → conn.run()
   - prometheus_provider: KeyError 'query' → .get() 容錯
   - k8s_provider: 空 pod_name → 早返回錯誤字典

2. Agent Debate / 決策品質
   - decision_manager: 逾時降級文字改為明確描述(繞過 ADR-091 鐵閘)
   - intent_classifier: LLM 逾時降級至關鍵字分類(非 None)

3. Watchdog 誤報修復(ADR-092 B3)
   - W-2: tg_sent Redis TTL → telegram_message_id IS NULL(DB 真值)
   - W-5 新增: suggested_action IN 空/待分析/NO_ACTION + tg_id IS NULL
   - approval_timeout_resolver: 60min → 15min,batch 50 → 200

4. Config Drift 自動化
   - drift_adopt_service: auto_adopt_if_safe() 六條件安全閘
   - drift.py: 背景任務先嘗試自動採納再發人工 Telegram 卡片

5. Playbook 飛輪穩定
   - playbook_seed_service: 修復幂等性(deprecated 不視為缺失)
   - playbook_evolver: 只載 DRAFT+APPROVED(非全部 294 筆)

6. 可觀測性
   - alert_rule_engine: auto_rule 結構化日誌 + Redis 計數器(pipeline)
   - auto_approve: reject 原因 Redis 計數器
   - heartbeat_report_service: 新增「⚙️ 自動化統計(今日)」區塊

【待人工執行】
psql $DATABASE_URL -f apps/api/migrations/cleanup_duplicate_deprecated_playbooks.sql

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 10:55:50 +08:00
OG T
f045506abd fix(flywheel): P2 Approval 逾期不結案 → KM 學習鏈斷鏈修復
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 12m11s
問題根因:
  PENDING approval 無人處置超過 48h 後應自動 EXPIRED,
  但 get_pending_approvals() 只在用戶開 UI 時觸發,
  若無人開 UI → Incident 永遠 PENDING → KM 永遠不寫入
  → Phase 6 SLO human_override_rate 低估,EWMA 缺少負向樣本。

修復:
  1. anomaly_counter.py: 新增 "timeout_ignored" disposition 類型,
     與 auto_repair / human_approved / manual_resolved 區分
  2. incident_service.py: resolve_incident() 新增 resolution_type 參數,
     resolution_type="timeout" 時記錄 "timeout_ignored" 而非 "manual_resolved"
  3. jobs/approval_timeout_resolver.py (新): 每小時掃描逾期 PENDING approval,
     批次標記 EXPIRED,對每筆有 incident_id 的記錄呼叫 resolve_incident("timeout")
  4. main.py: startup 掛載 approval_timeout_resolver 排程(interval=3600s)

效果:
  - 告警無人處置 48h → Incident 自動結案 → KM 寫入 → EWMA 取得樣本
  - disposition="timeout_ignored" 讓 SLO 計算正確區分「AI 建議被忽略」
  - 飛輪學習鏈對「無人處置告警」閉環

2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 19:21:21 +08:00