Files
awoooi/apps/api/tests/test_km_writer_backfill_reconciler.py
Your Name c5753e1c57 fix(critic-review): KMWriter 名實統一 + Alertmanager 修抑制 + drift checker AST 化
critic PR review 揭示已 push commits 的 7 個 blocker,本 commit 全部修復。

## C1 + C2 + M1 + M2 + M3 — KMWriter 真正統一契約(critic 最嚴重 5 條)

### C1 km_writer.py:194 — backfill 自打臉修
- 裸 asyncio.create_task(_backfill_path_a_approval) → await _backfill_path_a_approval_safe()
- 同步 await + 獨立 DLQ km:backfill:dlq + try/except 不阻塞主寫入
- 新增 km_backfill_reconciler_job.py(每 5 分鐘掃 DLQ)+ ENABLE_KM_BACKFILL_RECONCILER flag
- 防 Path B 比 Path A 先完成 → related_approval_id 永遠 NULL 的 race

### C2 km_writer.py:391 — KM_WRITE_AWAIT=false 路徑收緊
- 從 ensure_future(fire-and-forget 比舊版同步寫更糟)
- 改 await writer.write(retry=1, timeout=2.0)(仍 await 但只試一次、超時短)
- docstring 明確標註「緊急回滾用,不保證可靠性」

### M1 decision_manager.py:2178/2203 — 移除 _fire_and_forget 旁路
- 兩處 _fire_and_forget(executor.write_execution_result_to_km(...))
- 改 await asyncio.shield(...) + BaseException 保護(防上層 cancel 中斷)
- KM_WRITE_AWAIT=true 在這條路徑終於真正 await

### M2 incident_service.py:1099 — 自製 path 加 retry+DLQ
- 原本 if settings.KM_WRITE_AWAIT: await asyncio.wait_for else create_task
- 改 3 次指數退避 retry + DLQ 保護(呼叫 km_writer 私有 helper)

### M3 km_writer.py:166 — 冪等聲明對齊實作
- knowledge_repository.create() 加 UPSERT 路徑(pg_insert ON CONFLICT DO UPDATE)
- KnowledgeEntryCreate / KnowledgeEntryRecord 加 path_type 欄位
- migration: ADD COLUMN path_type + partial unique index uix_knowledge_incident_path

## M4 alertmanager.yml — equal: [] 收緊(critic 防爆炸抑制)
- OllamaInstanceDown / KMConverterDown 抑制加 equal: ['cluster'] 約束
- 防多 cluster 場景下任一 Ollama down 誤抑全 AI/SLO 告警

## M5 Alertmanager 版本驗證(已確認 v0.31.1,遠超 v0.22+)

## M6 governance_agent.py — health score 區分 skipped vs ok vs violated
- check_slo_compliance 加 _meta {violated_count, skipped_count, ok_count, all_skipped, status}
- run_self_check: SLO 全 skipped 時獨立發 governance_slo_data_gap 告警
  (不污染 self_failure 計數,因為 no_data 是 emitter 未實作不是治理機制故障)

## M7 scripts/check_config_drift.py — 改 AST 解析
- regex 改 ast.parse 找 Settings ClassDef AnnAssign Field(default=...)
- 避免多行 list / default_factory= / 含跳行字串的 false negative
- 4 欄位(AI_FALLBACK_ORDER / ARGOCD_URL / PROMETHEUS_URL / OLLAMA_URL)全對齊

## 新增測試
- test_km_writer_backfill_reconciler.py: 7 cases(C1 reconciler + safe helper)
- test_km_writer_idempotent.py: 5 cases(M3 path_type 注入 + UPSERT 分支)

## 驗證
- 1585 unit tests 全綠(+13 從 1572)
- amtool check-config SUCCESS(8 inhibit_rules / 2 receivers)
- drift checker AST-based 4 欄位全對齊
- Alertmanager v0.31.1 確認支援新語法

## 期望影響
- KMWriter 名實統一:飛輪閉環 KM 寫入路徑 100% 可靠
- M4 抑制爆炸風險解除
- 治理層不再對 SLO no_data 靜默
- drift checker false negative 風險解除

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 10:44:39 +08:00

196 lines
7.3 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""
KM Backfill Reconciler 單元測試
================================
P1-1 C1 修復 2026-04-28 ogt + Claude Sonnet 4.6
測試範圍:
1. reconciler 從 DLQ 成功補救 → LREM 移除
2. reconciler DB 失敗 → 保留 DLQ不移除
3. reconciler DLQ 格式錯誤 → 移除(無法補救)
4. reconciler DLQ 空 → 0 processed
5. ENABLE_KM_BACKFILL_RECONCILER=false → 跳過
6. _backfill_path_a_approval_safe — 成功路徑不寫 DLQ
7. _backfill_path_a_approval_safe — 失敗時寫 km:backfill:dlq
建立2026-04-28 (台北時區) ogt + Claude Sonnet 4.6
"""
import json
from unittest.mock import AsyncMock, MagicMock, patch
import pytest
from src.jobs.km_backfill_reconciler_job import (
run_km_backfill_reconciler,
)
from src.services.km_writer import (
KM_BACKFILL_DLQ_KEY,
_backfill_path_a_approval_safe,
)
# =============================================================================
# Helper
# =============================================================================
def _make_dlq_record(incident_id: str = "INC-001", approval_id: str = "AP-001") -> bytes:
return json.dumps({"incident_id": incident_id, "approval_id": approval_id}).encode()
# =============================================================================
# 1. Reconciler 成功補救
# =============================================================================
@pytest.mark.asyncio
async def test_reconciler_success_removes_from_dlq():
"""成功補救後應 LREM 從 DLQ 移除"""
record = _make_dlq_record("INC-R1", "AP-R1")
mock_redis = AsyncMock()
mock_redis.lrange = AsyncMock(return_value=[record])
mock_redis.lrem = AsyncMock()
with patch("src.jobs.km_backfill_reconciler_job.settings") as mock_settings, \
patch("src.core.redis_client.get_redis", return_value=mock_redis), \
patch("src.jobs.km_backfill_reconciler_job._do_backfill", new_callable=AsyncMock) as mock_do:
mock_settings.ENABLE_KM_BACKFILL_RECONCILER = True
result = await run_km_backfill_reconciler()
assert result["processed"] == 1
assert result["success"] == 1
assert result["failed"] == 0
mock_do.assert_called_once_with("INC-R1", "AP-R1")
mock_redis.lrem.assert_called_once_with(KM_BACKFILL_DLQ_KEY, 1, record)
# =============================================================================
# 2. Reconciler DB 失敗 → 保留 DLQ
# =============================================================================
@pytest.mark.asyncio
async def test_reconciler_db_failure_preserves_dlq():
"""DB 失敗時不應 LREM保留 DLQ 等下次補救)"""
record = _make_dlq_record("INC-FAIL", "AP-FAIL")
mock_redis = AsyncMock()
mock_redis.lrange = AsyncMock(return_value=[record])
mock_redis.lrem = AsyncMock()
with patch("src.jobs.km_backfill_reconciler_job.settings") as mock_settings, \
patch("src.core.redis_client.get_redis", return_value=mock_redis), \
patch("src.jobs.km_backfill_reconciler_job._do_backfill",
side_effect=Exception("db connection refused")):
mock_settings.ENABLE_KM_BACKFILL_RECONCILER = True
result = await run_km_backfill_reconciler()
assert result["processed"] == 1
assert result["success"] == 0
assert result["failed"] == 1
# 失敗時不應 LREM
mock_redis.lrem.assert_not_called()
# =============================================================================
# 3. Reconciler 格式錯誤 → 移除(無法補救)
# =============================================================================
@pytest.mark.asyncio
async def test_reconciler_malformed_record_removed():
"""格式錯誤的 DLQ record 應被移除(不能卡住 DLQ"""
malformed = b"not-json-at-all"
mock_redis = AsyncMock()
mock_redis.lrange = AsyncMock(return_value=[malformed])
mock_redis.lrem = AsyncMock()
with patch("src.jobs.km_backfill_reconciler_job.settings") as mock_settings, \
patch("src.core.redis_client.get_redis", return_value=mock_redis), \
patch("src.jobs.km_backfill_reconciler_job._do_backfill", new_callable=AsyncMock) as mock_do:
mock_settings.ENABLE_KM_BACKFILL_RECONCILER = True
await run_km_backfill_reconciler()
# 格式錯誤移除
mock_redis.lrem.assert_called_once_with(KM_BACKFILL_DLQ_KEY, 1, malformed)
# 不嘗試 DB 補救
mock_do.assert_not_called()
# =============================================================================
# 4. DLQ 空 → 0 processed
# =============================================================================
@pytest.mark.asyncio
async def test_reconciler_empty_dlq():
"""DLQ 為空時應返回 0 processed"""
mock_redis = AsyncMock()
mock_redis.lrange = AsyncMock(return_value=[])
with patch("src.jobs.km_backfill_reconciler_job.settings") as mock_settings, \
patch("src.core.redis_client.get_redis", return_value=mock_redis):
mock_settings.ENABLE_KM_BACKFILL_RECONCILER = True
result = await run_km_backfill_reconciler()
assert result["processed"] == 0
assert result["success"] == 0
assert result["failed"] == 0
# =============================================================================
# 5. ENABLE_KM_BACKFILL_RECONCILER=false → 跳過
# =============================================================================
@pytest.mark.asyncio
async def test_reconciler_disabled_skips():
"""Feature flag false 時應直接返回 0不存取 Redis"""
with patch("src.jobs.km_backfill_reconciler_job.settings") as mock_settings, \
patch("src.core.redis_client.get_redis") as mock_get_redis:
mock_settings.ENABLE_KM_BACKFILL_RECONCILER = False
result = await run_km_backfill_reconciler()
assert result["processed"] == 0
mock_get_redis.assert_not_called()
# =============================================================================
# 6. _backfill_path_a_approval_safe — 成功路徑不寫 DLQ
# =============================================================================
@pytest.mark.asyncio
async def test_backfill_safe_success_no_dlq():
"""成功時不應寫 km:backfill:dlq"""
with patch("src.services.km_writer._backfill_path_a_approval", new_callable=AsyncMock) as mock_bf, \
patch("src.core.redis_client.get_redis") as mock_get_redis:
await _backfill_path_a_approval_safe("INC-OK", "AP-OK")
mock_bf.assert_called_once_with("INC-OK", "AP-OK")
mock_get_redis.assert_not_called()
# =============================================================================
# 7. _backfill_path_a_approval_safe — 失敗時寫 km:backfill:dlq
# =============================================================================
@pytest.mark.asyncio
async def test_backfill_safe_failure_writes_dlq():
"""失敗時應寫 km:backfill:dlq 且不拋例外"""
captured_keys = []
mock_redis = AsyncMock()
async def _capture_lpush(key, value):
captured_keys.append(key)
mock_redis.lpush.side_effect = _capture_lpush
mock_redis.ltrim = AsyncMock()
with patch("src.services.km_writer._backfill_path_a_approval",
side_effect=Exception("db error")), \
patch("src.core.redis_client.get_redis", return_value=mock_redis):
# 不應拋例外
await _backfill_path_a_approval_safe("INC-ERR", "AP-ERR")
assert KM_BACKFILL_DLQ_KEY in captured_keys