Commit Graph

6 Commits

Author SHA1 Message Date
OG T
658337ec18 fix(phase26): 打通 Incident→DB→KM 完整鏈路 + namespace 修正
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m29s
Type Sync Check / check-type-sync (push) Failing after 52s
問題根因:
1. create_incident_for_approval 只存 Redis,不存 PostgreSQL
   → TTL 7天後消失,Playbook 萃取永遠找不到 Incident
2. ApprovalRecord 無 incident_id 欄位
   → _trigger_playbook_extraction 靠 regex 掃中文文字找 INC-,永遠失敗
3. operation_parser namespace fallback 是 "default"
   → 所有 deployment 在 awoooi-prod,203 次執行全失敗

修復:
- Incident 同時寫入 Redis + PostgreSQL (save_to_episodic_memory)
- ApprovalRecord 加入 incident_id 欄位 (model + ORM + migration)
- alertmanager_webhook 建立 Approval 後回寫 incident_id
- _trigger_playbook_extraction 直接用 approval.incident_id
- operation_parser DEFAULT_NAMESPACE = "awoooi-prod"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 11:46:05 +08:00
OG T
3455044457 feat(phase25): Nemotron 主動防禦三方向 P0+P1+P2 完整實作
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 38s
Type Sync Check / check-type-sync (push) Failing after 35s
P0 - DIAGNOSE Privacy-First Routing:
- ai_router.py: _local_fallback_chain [NEMOTRON→OLLAMA→REJECT]
- DIAGNOSE 意圖 override 改為 NEMOTRON (原 OLLAMA)
- DIAGNOSE fallback 使用 local-only 鏈,不觸碰雲端
- 全部失敗時 REJECT + Telegram 通知
- config.py: NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS=30, OLLAMA_DIAGNOSE_TIMEOUT_SECONDS=60
- nemotron.py: 根據 context[task_type] 選擇 timeout

P1 - Knowledge Auto-Harvesting:
- models/knowledge.py: EntryType.AUTO_RUNBOOK + ANTI_PATTERN + symptoms_hash
- EntryStatus.PUBLISHED (ANTI_PATTERN 直接發布,無需審核)
- models/playbook.py: SymptomPattern.compute_hash() (16字元確定性 hash)
- services/runbook_generator.py: NemotronRunbookGenerator (v1.1)
  - generate_runbook() → AUTO_RUNBOOK (DRAFT) + Telegram 審核 card
  - generate_anti_pattern() → ANTI_PATTERN (PUBLISHED) + Telegram 通知
  - 使用 nvidia.chat() (正確介面),Nemotron 超時時 Minimal fallback
- knowledge_service.py: check_anti_pattern(symptoms_hash, days=7)
- db/models.py: symptoms_hash VARCHAR(16) + ix_knowledge_symptoms_hash
- repositories/knowledge_repository.py: create() 支援 symptoms_hash + status
- auto_repair_service.py: anti_pattern_gate 在 decide() + runbook hook 在 execute()
- migrations/phase8_symptoms_hash.sql: ALTER TABLE + partial index + PUBLISHED constraint

P2 - Config Drift Detection:
- models/drift.py: DriftItem/DriftReport/DriftLevel/DriftIntent/DriftStatus
- services/drift_detector.py: GitStateReader + K8sStateReader + DriftDetector
- services/drift_analyzer.py: 白名單過濾 + DriftLevel 分級
- services/drift_interpreter.py: NemotronDriftInterpreter(意圖分析,不生成修復指令)
- services/drift_remediator.py: rollback(kubectl apply) + adopt(git push gitea)
- api/v1/drift.py: POST /scan, GET /reports, POST /rollback, POST /adopt
- migrations/phase9_drift_reports.sql: drift_reports 表
- k8s/drift-cronjob.yaml: 每小時自動掃描 CronJob

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 12:35:05 +08:00
OG T
df3ef9006c fix(auto-repair): 首席架構師 Review — 4 Critical/Important 修復
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m2s
Critical #1: KM write task 移出 try/except
- _trigger_learning 的 KM 寫入原在 try 內,learning 失敗時不寫 KM
- 移至 except 後確保成功/失敗都寫入
- 移除冗餘 import asyncio(已在頂層 import)
- Minor: approval.incident_id or None 防空字串

Important #2: migration 加 PRIMARY KEY
- playbook_id 從 UNIQUE 升為 PRIMARY KEY
- prod DB 已執行 ALTER TABLE ADD PRIMARY KEY

Important #3: s.sequence→s.step_number, s.description→s.command
- embed_playbook() 使用不存在的欄位名,RAG 向量索引靜默失敗
- RepairStep 正確欄位: step_number, command

Important #1: PlaybookService._get_rag_service 不再 Service 層快取
- 改為每次呼叫工廠 get_playbook_rag_service()
- 避免舊實例繞過工廠的 is_closed 重建邏輯

冷啟動修復 (首席架構師建議B+C):
- _trigger_playbook_extraction 執行成功後自動設定
  execution_success=True, effectiveness_score=4, status=RESOLVED
- skip 路徑 logger.debug → logger.info 提升可觀測性

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 12:02:03 +08:00
OG T
72d7536ead feat(auto-repair): 完整自動修復閉環 + KM 沉澱串接
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. DB Migration: playbooks 資料表 (phase7_playbooks_table.sql)
   - 這是自動修復無法啟動的根本原因 — table 從未建立
   - 5 個索引: status/tags/alert_names/source_incidents/created_at
   - 已在 prod DB 執行

2. playbook_service: 萃取後自動沉澱 KM
   - extract_from_incident() 完成後 fire-and-forget _write_to_km()
   - 內容含症狀模式、修復步驟、信心度、來源 Incident

3. approval_execution: 執行結果沉澱 KM
   - _trigger_learning() 後 fire-and-forget _write_execution_result_to_km()
   - 成功/失敗記錄都寫入,category=execution_result

完整閉環:
告警 → AI分析 → 查Playbook → 決策 → 執行 → 結果寫KM
                                              ↓
                              Incident解決 → KM(knowledge_extractor)
                                          → Playbook萃取 → KM

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 11:54:15 +08:00
OG T
a1f7d1f495 fix(db): 固化 risklevel ADD VALUE 'high' 為正式 migration
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 6m58s
E2E Health Check / e2e-health (push) Successful in 18s
Phase 23 緊急修復已在 prod/dev 手動執行,此檔作為正式記錄
使用 DO 塊防止重複執行錯誤

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 21:36:15 +08:00
OG T
30153496d1 fix(api): 修復全部 lint 錯誤 (ruff --fix)
- Import sorting (I001)
- Unused imports (F401)
- f-string without placeholders (F541)
- Loop variable unused (B007)
- zip() strict parameter (B905)
- Exception chaining (B904)
- collections.abc imports (UP035)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 16:06:20 +08:00