Commit Graph

4 Commits

Author SHA1 Message Date
Your Name
c5753e1c57 fix(critic-review): KMWriter 名實統一 + Alertmanager 修抑制 + drift checker AST 化
critic PR review 揭示已 push commits 的 7 個 blocker,本 commit 全部修復。

## C1 + C2 + M1 + M2 + M3 — KMWriter 真正統一契約(critic 最嚴重 5 條)

### C1 km_writer.py:194 — backfill 自打臉修
- 裸 asyncio.create_task(_backfill_path_a_approval) → await _backfill_path_a_approval_safe()
- 同步 await + 獨立 DLQ km:backfill:dlq + try/except 不阻塞主寫入
- 新增 km_backfill_reconciler_job.py(每 5 分鐘掃 DLQ)+ ENABLE_KM_BACKFILL_RECONCILER flag
- 防 Path B 比 Path A 先完成 → related_approval_id 永遠 NULL 的 race

### C2 km_writer.py:391 — KM_WRITE_AWAIT=false 路徑收緊
- 從 ensure_future(fire-and-forget 比舊版同步寫更糟)
- 改 await writer.write(retry=1, timeout=2.0)(仍 await 但只試一次、超時短)
- docstring 明確標註「緊急回滾用,不保證可靠性」

### M1 decision_manager.py:2178/2203 — 移除 _fire_and_forget 旁路
- 兩處 _fire_and_forget(executor.write_execution_result_to_km(...))
- 改 await asyncio.shield(...) + BaseException 保護(防上層 cancel 中斷)
- KM_WRITE_AWAIT=true 在這條路徑終於真正 await

### M2 incident_service.py:1099 — 自製 path 加 retry+DLQ
- 原本 if settings.KM_WRITE_AWAIT: await asyncio.wait_for else create_task
- 改 3 次指數退避 retry + DLQ 保護(呼叫 km_writer 私有 helper)

### M3 km_writer.py:166 — 冪等聲明對齊實作
- knowledge_repository.create() 加 UPSERT 路徑(pg_insert ON CONFLICT DO UPDATE)
- KnowledgeEntryCreate / KnowledgeEntryRecord 加 path_type 欄位
- migration: ADD COLUMN path_type + partial unique index uix_knowledge_incident_path

## M4 alertmanager.yml — equal: [] 收緊(critic 防爆炸抑制)
- OllamaInstanceDown / KMConverterDown 抑制加 equal: ['cluster'] 約束
- 防多 cluster 場景下任一 Ollama down 誤抑全 AI/SLO 告警

## M5 Alertmanager 版本驗證(已確認 v0.31.1,遠超 v0.22+)

## M6 governance_agent.py — health score 區分 skipped vs ok vs violated
- check_slo_compliance 加 _meta {violated_count, skipped_count, ok_count, all_skipped, status}
- run_self_check: SLO 全 skipped 時獨立發 governance_slo_data_gap 告警
  (不污染 self_failure 計數,因為 no_data 是 emitter 未實作不是治理機制故障)

## M7 scripts/check_config_drift.py — 改 AST 解析
- regex 改 ast.parse 找 Settings ClassDef AnnAssign Field(default=...)
- 避免多行 list / default_factory= / 含跳行字串的 false negative
- 4 欄位(AI_FALLBACK_ORDER / ARGOCD_URL / PROMETHEUS_URL / OLLAMA_URL)全對齊

## 新增測試
- test_km_writer_backfill_reconciler.py: 7 cases(C1 reconciler + safe helper)
- test_km_writer_idempotent.py: 5 cases(M3 path_type 注入 + UPSERT 分支)

## 驗證
- 1585 unit tests 全綠(+13 從 1572)
- amtool check-config SUCCESS(8 inhibit_rules / 2 receivers)
- drift checker AST-based 4 欄位全對齊
- Alertmanager v0.31.1 確認支援新語法

## 期望影響
- KMWriter 名實統一:飛輪閉環 KM 寫入路徑 100% 可靠
- M4 抑制爆炸風險解除
- 治理層不再對 SLO no_data 靜默
- drift checker false negative 風險解除

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 10:44:39 +08:00
Your Name
c22e5f334e feat(km): P1-1 KMWriter 統一契約 + 5 caller 切換 + M4 反查鏈補齊
12-Agent 全景診斷揪出 KM 寫入鏈路 5 條入口無統一契約,fire-and-forget
在 Pod recycle 時會丟失條目。本次抽 KMWriter 強制 7 條契約。

## 7 條契約強制
1. 同步底線:強制 await asyncio.wait_for(timeout)
2. 重試:3 次指數退避 1s/2s/4s(OperationalError / 網路類例外)
3. 失敗回收:3 次後寫 Redis DLQ km:dlq + log
4. 觀測:structlog event + 預留 metric hook(P1-3 補 emitter)
5. 冪等:incident_id + path_type 為 unique key
6. 禁止吞例外:except 必須 log + raise/DLQ
7. M4 反查鏈:payload 含 approval_id 時自動填 related_approval_id 並回填 Path A

## Caller 切換(5 條入口統一介面)
- incident_service.py:1086 Path A(KB extractor + km_conversion)
- approval_execution.py:771 Path B-人工
- decision_manager.py:2178 Path B-自動成功(消除跨類私有方法調用 M1)
- decision_manager.py:2200 Path B-自動失敗(修 B2 早期吞例外)
- playbook_service.py:210 PlaybookKM(兩份 T0 報告都漏的第三條)

## M4 反查鏈補齊
- knowledge.py + models.py: 補 related_approval_id ORM 欄位
- 對齊 phase26_incident_km_integration.sql:20 schema(partial index 已存在)
- approval↔KM 雙向反查鏈完整(dual-path 縫合線)

## Feature Flag (rollback 保險)
- KM_WRITE_AWAIT=true (default): await + timeout + DLQ 強制
- KM_WRITE_AWAIT=false: fire-and-forget(舊行為)

## 測試
- apps/api/tests/test_km_writer.py: 18 測試全綠
  覆蓋 success / timeout / retry / DLQ / 冪等 / KMWriteError /
  on_failure=raise / 反查鏈回填
- 1552 unit tests 全綠(無回歸)

## 驗收
飛輪閉環核心 — KM 寫入不再靜默丟失,AI 學習鏈不斷裂。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 10:44:39 +08:00
OG T
3455044457 feat(phase25): Nemotron 主動防禦三方向 P0+P1+P2 完整實作
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 38s
Type Sync Check / check-type-sync (push) Failing after 35s
P0 - DIAGNOSE Privacy-First Routing:
- ai_router.py: _local_fallback_chain [NEMOTRON→OLLAMA→REJECT]
- DIAGNOSE 意圖 override 改為 NEMOTRON (原 OLLAMA)
- DIAGNOSE fallback 使用 local-only 鏈,不觸碰雲端
- 全部失敗時 REJECT + Telegram 通知
- config.py: NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS=30, OLLAMA_DIAGNOSE_TIMEOUT_SECONDS=60
- nemotron.py: 根據 context[task_type] 選擇 timeout

P1 - Knowledge Auto-Harvesting:
- models/knowledge.py: EntryType.AUTO_RUNBOOK + ANTI_PATTERN + symptoms_hash
- EntryStatus.PUBLISHED (ANTI_PATTERN 直接發布,無需審核)
- models/playbook.py: SymptomPattern.compute_hash() (16字元確定性 hash)
- services/runbook_generator.py: NemotronRunbookGenerator (v1.1)
  - generate_runbook() → AUTO_RUNBOOK (DRAFT) + Telegram 審核 card
  - generate_anti_pattern() → ANTI_PATTERN (PUBLISHED) + Telegram 通知
  - 使用 nvidia.chat() (正確介面),Nemotron 超時時 Minimal fallback
- knowledge_service.py: check_anti_pattern(symptoms_hash, days=7)
- db/models.py: symptoms_hash VARCHAR(16) + ix_knowledge_symptoms_hash
- repositories/knowledge_repository.py: create() 支援 symptoms_hash + status
- auto_repair_service.py: anti_pattern_gate 在 decide() + runbook hook 在 execute()
- migrations/phase8_symptoms_hash.sql: ALTER TABLE + partial index + PUBLISHED constraint

P2 - Config Drift Detection:
- models/drift.py: DriftItem/DriftReport/DriftLevel/DriftIntent/DriftStatus
- services/drift_detector.py: GitStateReader + K8sStateReader + DriftDetector
- services/drift_analyzer.py: 白名單過濾 + DriftLevel 分級
- services/drift_interpreter.py: NemotronDriftInterpreter(意圖分析,不生成修復指令)
- services/drift_remediator.py: rollback(kubectl apply) + adopt(git push gitea)
- api/v1/drift.py: POST /scan, GET /reports, POST /rollback, POST /adopt
- migrations/phase9_drift_reports.sql: drift_reports 表
- k8s/drift-cronjob.yaml: 每小時自動掃描 CronJob

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 12:35:05 +08:00
OG T
d8be78b135 feat(api): Knowledge Base Phase 1 後端四層架構
Some checks failed
CD Pipeline / build-and-deploy (push) Successful in 7m0s
E2E Health Check / e2e-health (push) Successful in 17s
Type Sync Check / check-type-sync (push) Failing after 30s
- models/knowledge.py: Pydantic Schema (EntryType/Source/Status/CRUD)
- db/models.py: KnowledgeEntryRecord ORM (PostgreSQL)
- repositories/interfaces.py: IKnowledgeRepository Protocol
- repositories/knowledge_repository.py: PostgreSQL CRUD 實作
- services/knowledge_service.py: 業務邏輯 (get_db_context 內部管理 session)
- api/v1/knowledge.py: REST Router (get_knowledge_service,無直接 DB 存取)
- main.py: 掛載 Knowledge Base Router
- k8s/jobs/migrate-knowledge-entries.yaml: DB Migration Job

API 端點: GET/POST / | GET/PATCH/DELETE /{id} | POST /{id}/approve
         GET /search | GET /categories

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 00:55:56 +08:00