OG T
|
325b3851b5
|
feat(adr-071): 告警通知四類型第一批 B/C/E/F/G/H 全實作
CD Pipeline / build-and-deploy (push) Has been cancelled
Type Sync Check / check-type-sync (push) Failing after 1m7s
ADR-071-B: classify_notification() — 五型分類器 (TYPE-1/2/3/4/4D)
ADR-071-C: send_info_notification() — TYPE-1 純資訊無按鈕卡片
ADR-071-E: _build_inline_keyboard() — 依 alert_category 動態組合 TYPE-3 按鈕
ADR-071-F: send_drift_card() — TYPE-4D Config Drift 卡片 + Diff 截斷
ADR-071-G: km_conversion_service.py — Incident RESOLVED 自動轉 KM
ADR-071-H: handle_manual_fix_done() — TYPE-4 手動修復 Bot 對話閉環
前批已完成: ADR-071-A (DB Migration) + ADR-071-D (狀態機守衛)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-11 02:24:20 +08:00 |
|
OG T
|
c6edfb5614
|
fix(flywheel): 四階段系統性修復 AUTO_REPAIR NO_MATCH 斷層
CD Pipeline / build-and-deploy (push) Has been cancelled
Phase 1 — affected_services 污染根治
- webhooks.py: _extract_affected_services() 從 labels 精準萃取服務名
(component > job > pod deployment name > clean target_resource > [])
- create_incident_for_approval: alert_labels 完整保留進 Signal
- alert_name 從 alertname 取,不再用 "custom"
Phase 2 — Playbook alertname 變體擴充
- alert_rules.yaml: 5 條規則新增 HostHighCpuLoad、KubePodCrashLooping 等變體
- scripts/update_playbook_alert_variants.py: Redis index 已執行更新 ✅
Phase 3 — Jaccard 通用型 Playbook 豁免
- similarity.py: affected_services=[] → 1.0 豁免(基礎設施 Playbook 不針對特定服務)
- severity_range=[] → 1.0 豁免(適用所有嚴重度)
Phase 4 — Playbook Embedding 持久化(冷啟動修復)
- migrations/flywheel_playbook_embeddings.sql: pgvector 持久化表
- services/playbook_embedding_service.py: 啟動時重建 Redis 向量快取 + 同步 DB
- main.py: lifespan 啟動時 asyncio.create_task 非阻塞執行
2026-04-10 Asia/Taipei — Claude Sonnet 4.6
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-10 11:04:56 +08:00 |
|
OG T
|
af7b1591c1
|
feat(rag): phase35 ivfflat 向量索引 — 5814 chunks 已建立
CD Pipeline / build-and-deploy (push) Has been cancelled
已在 prod 執行: idx_rag_chunks_embedding (lists=100, cosine)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-10 10:33:32 +08:00 |
|
OG T
|
63e840ae42
|
feat(ollama): Phase 31-34 ADR-067 — Log摘要/PR審查/RAG知識庫/圖片分析
CD Pipeline / build-and-deploy (push) Has started running
Phase 31: log_summary_service.py — deepseek-r1:14b K8s Pod日誌異常摘要
- 觸發: signoz_webhook 告警時背景呼叫
- Redis快取 log_summary:{pod}:{date} TTL 24h
- 敏感資料regex遮蔽
Phase 32: local_code_review_service.py — qwen2.5-coder:7b PR自動審查
- Fallback: Gemini (diff > 50KB 或 Ollama超時)
- semaphore 最多2個同時審查
- 雙寫: Redis TTL 7d + pr_reviews表 (phase29 migration)
Phase 33: knowledge_rag_service.py — nomic-embed-text 768維 pgvector RAG
- 向量化(188) + 生成(111) 雙Ollama
- rag_chunks表 (phase28 migration)
- 初期線性搜尋,>100筆啟用ivfflat索引
Phase 34: image_analysis_service.py — llava:latest Telegram圖片分析
- download_and_analyze: Bot API getFile → 下載 → llava → 回應
- Rate limit: 每chat_id每分鐘3次 (Redis sliding window)
- telegram.py webhook新增photo分支
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-10 01:50:22 +08:00 |
|
OG T
|
89015d4527
|
feat(phase30): Drift 報告 AI 人話摘要 (ADR-067)
CD Pipeline / build-and-deploy (push) Has been cancelled
- 新增 DriftNarratorService — qwen2.5:7b-instruct (Ollama 111)
- 觸發條件: high >= 1 or medium >= 3(HPA replicas 白名單)
- Redis 快取: drift_narrative:{report_id} TTL 1h
- LLM 失敗時 graceful fallback 結構化文字
- drift.py _analyze_and_notify: 接入 narrator(Phase 30 標記)
- Migration: drift_reports.narrative_text TEXT (已在 prod 執行)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-10 01:37:43 +08:00 |
|
OG T
|
88ac1c7f50
|
feat(phase27): 歷史按鈕雙層頻率統計 + DB frequency_snapshot 持久化
CD Pipeline / build-and-deploy (push) Failing after 1m44s
- telegram_gateway: _send_incident_history 改為 Phase 27 雙層策略
Layer 1: DB frequency_snapshot (建立時刻永久快照)
Layer 2: Redis AnomalyCounter disposition 累積統計 (35d TTL)
修復舊版呼叫 record_anomaly() 導致誤計數的 bug
- 新增 migration: phase27_incident_frequency_snapshot.sql (已在 prod 執行)
- CLAUDE.md: 精簡至 123 行,減少 Token 消耗
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-10 01:06:51 +08:00 |
|
OG T
|
896bef94ee
|
fix(web): pending-approvals-card 加防重複點擊 + loading 狀態
linter 自動強化: actioningId state 防止同一張卡重複操作
- disabled + opacity 0.6 + cursor not-allowed
- loading 時按鈕顯示 '...'
- finally() 確保 actioningId 清除
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 18:38:08 +08:00 |
|
OG T
|
88696dba9b
|
feat(sprint5.1): Data Safety Guardrails 全鏈路整合 (L1-L5)
CD Pipeline / build-and-deploy (push) Failing after 1m33s
Type Sync Check / check-type-sync (push) Failing after 58s
Layer 0 - K8s RBAC:
- k8s/rbac/api-velero-reader.yaml: awoooi-executor SA Velero backup reader
Layer 1 - DB Migration (已在 188 執行):
- M-002: approval_records 新增 approval_level/votes/required_votes
- M-003: alert_event_type ENUM 新增 8 個值
Layer 2 - IaC:
- ops/config/service-registry.yaml: 全服務 Stateful 分級清單 (BLOCK/CRITICAL_HITL/STANDARD_HITL/AUTO)
Layer 3 - Python Services:
- service_registry.py: 讀取 YAML,提供 is_blocked/requires_multisig/get_required_votes
- velero_client.py: kubectl 查詢 Velero 備份年齡,失敗 fallback 999h
- preflight_service.py: Pre-flight 安全檢查 (Q2/Q4 決策)
Layer 1-M001 - Playbook model:
- playbook.py: 新增 requires_approval_level/stateful_targets/requires_pre_backup
Layer 4 - 業務邏輯:
- alert_operation_log_repository.py: 新增 8 個 event_type (Guardrail/Pre-flight/MultiSig/備份)
- auto_repair_service.py: 注入 Service Registry Guardrail 檢查 (BLOCK → 直接拒絕)
- webhooks.py: ALERT_RECEIVED 溯源記錄 + auto_repair flag Q9 + Langfuse trace_id Q10
- db/models.py: ApprovalRecord 同步 approval_level/votes/required_votes 欄位
- docker-health-monitor.sh: 純感知層改造(移除所有 docker restart 邏輯)
Layer 5 - Telegram 通知:
- telegram_gateway.py: T1-T6 六個新通知方法 (Guardrail/Pre-flight/Backup/MultiSig/ChangeApplied)
參考: ADR-062 Data Safety Guardrails, ADR-063 Service Registry IaC
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-08 16:24:09 +08:00 |
|
OG T
|
f20121ad41
|
feat(audit): Phase 11 告警操作完整溯源 — alert_operation_log + 歷史回填
CD Pipeline / build-and-deploy (push) Failing after 1m29s
統帥指令「所有告警訊息通通寫入資料庫,並記錄相關操作」
變更:
- phase11_alert_operation_log.sql: 新表 (Event Sourcing,不可變)
- phase11b_backfill_alert_operation_log.sql: 歷史回填 654 筆
- 14 筆 ALERT_RECEIVED (incidents)
- 265 筆 TELEGRAM_SENT (approval_records)
- 265 筆 USER_ACTION (approval_records)
- 110 筆 EXECUTION_COMPLETED (audit_logs)
- db/models.py: AlertOperationLog SQLAlchemy model
- repositories/alert_operation_log_repository.py: append/list_by_incident/get_stats
- webhooks.py: _try_auto_repair_background 寫入 AUTO_REPAIR_TRIGGERED + EXECUTION_COMPLETED + TELEGRAM_RESULT_SENT
- webhooks.py: _push_to_telegram_background 寫入 TELEGRAM_SENT
- telegram.py: handle_callback 寫入 USER_ACTION (approve/reject)
已執行 migration: awoooi_prod@192.168.0.188 ✅
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-04-08 11:22:03 +08:00 |
|
OG T
|
eee6f06215
|
feat(auto-repair): 所有操作強制寫入 DB — auto_repair_executions 表
CD Pipeline / build-and-deploy (push) Failing after 1m32s
統帥指令: 所有自動修復操作(成功/失敗)必須持久化
變更:
- migrations/phase10_auto_repair_executions.sql: 新增表 + 4 個索引
- db/models.py: 新增 AutoRepairExecution SQLAlchemy model
- repositories/audit_log_repository.py: 新增 AutoRepairExecutionRepository (create/list_by_incident/get_stats)
- auto_repair_service.py: execute_auto_repair 成功/失敗分支都寫入 DB
- 新增 similarity_score 參數傳遞
- AutoRepairDecision 新增 similarity_score 欄位
- webhooks.py: 傳入 similarity_score 到 execute_auto_repair
已執行 migration: awoooi_prod@192.168.0.188:5432 ✅
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
2026-04-08 11:16:37 +08:00 |
|
OG T
|
658337ec18
|
fix(phase26): 打通 Incident→DB→KM 完整鏈路 + namespace 修正
CD Pipeline / build-and-deploy (push) Failing after 1m29s
Type Sync Check / check-type-sync (push) Failing after 52s
問題根因:
1. create_incident_for_approval 只存 Redis,不存 PostgreSQL
→ TTL 7天後消失,Playbook 萃取永遠找不到 Incident
2. ApprovalRecord 無 incident_id 欄位
→ _trigger_playbook_extraction 靠 regex 掃中文文字找 INC-,永遠失敗
3. operation_parser namespace fallback 是 "default"
→ 所有 deployment 在 awoooi-prod,203 次執行全失敗
修復:
- Incident 同時寫入 Redis + PostgreSQL (save_to_episodic_memory)
- ApprovalRecord 加入 incident_id 欄位 (model + ORM + migration)
- alertmanager_webhook 建立 Approval 後回寫 incident_id
- _trigger_playbook_extraction 直接用 approval.incident_id
- operation_parser DEFAULT_NAMESPACE = "awoooi-prod"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-06 11:46:05 +08:00 |
|
OG T
|
3455044457
|
feat(phase25): Nemotron 主動防禦三方向 P0+P1+P2 完整實作
CD Pipeline / build-and-deploy (push) Failing after 38s
Type Sync Check / check-type-sync (push) Failing after 35s
P0 - DIAGNOSE Privacy-First Routing:
- ai_router.py: _local_fallback_chain [NEMOTRON→OLLAMA→REJECT]
- DIAGNOSE 意圖 override 改為 NEMOTRON (原 OLLAMA)
- DIAGNOSE fallback 使用 local-only 鏈,不觸碰雲端
- 全部失敗時 REJECT + Telegram 通知
- config.py: NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS=30, OLLAMA_DIAGNOSE_TIMEOUT_SECONDS=60
- nemotron.py: 根據 context[task_type] 選擇 timeout
P1 - Knowledge Auto-Harvesting:
- models/knowledge.py: EntryType.AUTO_RUNBOOK + ANTI_PATTERN + symptoms_hash
- EntryStatus.PUBLISHED (ANTI_PATTERN 直接發布,無需審核)
- models/playbook.py: SymptomPattern.compute_hash() (16字元確定性 hash)
- services/runbook_generator.py: NemotronRunbookGenerator (v1.1)
- generate_runbook() → AUTO_RUNBOOK (DRAFT) + Telegram 審核 card
- generate_anti_pattern() → ANTI_PATTERN (PUBLISHED) + Telegram 通知
- 使用 nvidia.chat() (正確介面),Nemotron 超時時 Minimal fallback
- knowledge_service.py: check_anti_pattern(symptoms_hash, days=7)
- db/models.py: symptoms_hash VARCHAR(16) + ix_knowledge_symptoms_hash
- repositories/knowledge_repository.py: create() 支援 symptoms_hash + status
- auto_repair_service.py: anti_pattern_gate 在 decide() + runbook hook 在 execute()
- migrations/phase8_symptoms_hash.sql: ALTER TABLE + partial index + PUBLISHED constraint
P2 - Config Drift Detection:
- models/drift.py: DriftItem/DriftReport/DriftLevel/DriftIntent/DriftStatus
- services/drift_detector.py: GitStateReader + K8sStateReader + DriftDetector
- services/drift_analyzer.py: 白名單過濾 + DriftLevel 分級
- services/drift_interpreter.py: NemotronDriftInterpreter(意圖分析,不生成修復指令)
- services/drift_remediator.py: rollback(kubectl apply) + adopt(git push gitea)
- api/v1/drift.py: POST /scan, GET /reports, POST /rollback, POST /adopt
- migrations/phase9_drift_reports.sql: drift_reports 表
- k8s/drift-cronjob.yaml: 每小時自動掃描 CronJob
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-04 12:35:05 +08:00 |
|
OG T
|
df3ef9006c
|
fix(auto-repair): 首席架構師 Review — 4 Critical/Important 修復
CD Pipeline / build-and-deploy (push) Successful in 7m2s
Critical #1: KM write task 移出 try/except
- _trigger_learning 的 KM 寫入原在 try 內,learning 失敗時不寫 KM
- 移至 except 後確保成功/失敗都寫入
- 移除冗餘 import asyncio(已在頂層 import)
- Minor: approval.incident_id or None 防空字串
Important #2: migration 加 PRIMARY KEY
- playbook_id 從 UNIQUE 升為 PRIMARY KEY
- prod DB 已執行 ALTER TABLE ADD PRIMARY KEY
Important #3: s.sequence→s.step_number, s.description→s.command
- embed_playbook() 使用不存在的欄位名,RAG 向量索引靜默失敗
- RepairStep 正確欄位: step_number, command
Important #1: PlaybookService._get_rag_service 不再 Service 層快取
- 改為每次呼叫工廠 get_playbook_rag_service()
- 避免舊實例繞過工廠的 is_closed 重建邏輯
冷啟動修復 (首席架構師建議B+C):
- _trigger_playbook_extraction 執行成功後自動設定
execution_success=True, effectiveness_score=4, status=RESOLVED
- skip 路徑 logger.debug → logger.info 提升可觀測性
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-04 12:02:03 +08:00 |
|
OG T
|
72d7536ead
|
feat(auto-repair): 完整自動修復閉環 + KM 沉澱串接
CD Pipeline / build-and-deploy (push) Has been cancelled
1. DB Migration: playbooks 資料表 (phase7_playbooks_table.sql)
- 這是自動修復無法啟動的根本原因 — table 從未建立
- 5 個索引: status/tags/alert_names/source_incidents/created_at
- 已在 prod DB 執行
2. playbook_service: 萃取後自動沉澱 KM
- extract_from_incident() 完成後 fire-and-forget _write_to_km()
- 內容含症狀模式、修復步驟、信心度、來源 Incident
3. approval_execution: 執行結果沉澱 KM
- _trigger_learning() 後 fire-and-forget _write_execution_result_to_km()
- 成功/失敗記錄都寫入,category=execution_result
完整閉環:
告警 → AI分析 → 查Playbook → 決策 → 執行 → 結果寫KM
↓
Incident解決 → KM(knowledge_extractor)
→ Playbook萃取 → KM
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-04 11:54:15 +08:00 |
|
OG T
|
a1f7d1f495
|
fix(db): 固化 risklevel ADD VALUE 'high' 為正式 migration
CD Pipeline / build-and-deploy (push) Successful in 6m58s
E2E Health Check / e2e-health (push) Successful in 18s
Phase 23 緊急修復已在 prod/dev 手動執行,此檔作為正式記錄
使用 DO 塊防止重複執行錯誤
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-01 21:36:15 +08:00 |
|
OG T
|
30153496d1
|
fix(api): 修復全部 lint 錯誤 (ruff --fix)
- Import sorting (I001)
- Unused imports (F401)
- f-string without placeholders (F541)
- Loop variable unused (B007)
- zip() strict parameter (B905)
- Exception chaining (B904)
- collections.abc imports (UP035)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-26 16:06:20 +08:00 |
|