Files
awoooi/apps/api/tests/test_incident_service_resolve_idempotency.py
Your Name 80c36ba801
All checks were successful
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / tests (push) Successful in 1m9s
CD Pipeline / build-and-deploy (push) Successful in 3m29s
CD Pipeline / post-deploy-checks (push) Successful in 1m30s
fix(incident): F2 NO_ACTION 觸發 resolve_incident + 冪等 guard
【根因】INC-20260507-99ADF2 飛輪斷流,566+ stuck incidents(30秒漲 1)核心
原因:NO_ACTION 路徑 (approval_execution.py:251) 提前 return True,跳過
line 482-495 已有的 resolve_incident 呼叫,incident 永遠卡 INVESTIGATING。

【修法】
- approval_execution.py NO_ACTION 分支補 resolve_incident 呼叫 + 成功/失敗
  log,背景 log 加 path="no_action" 用於 prod 量化修法生效率(debugger
  全鏈分析 + critic 1st/2nd 審查必修 #1)。
- incident_service.py resolve_incident 在 line 1106 加 RESOLVED 冪等 guard,
  早於所有副作用(status mutation / Redis / DB / postmortem / KB / KM /
  disposition),順帶修 success path line 482-495 重觸 postmortem 的潛在
  老風險(critic 必修 #2)。

【遵守 Codex 5/6 設計(feedback_respect_codex_design_intent.md)】
- 不動 flywheel_stats_service.py / heartbeat_report_service.py /
  auto_repair_service.record_auto_repair() / metrics_repository UPPER(status)。
- resolve_incident 不寫 auto_repair_executions 表(Codex 5/6 source of
  truth),不污染 24h KPI 計算。

【Test 覆蓋】
- test_approval_execution_no_action.py:NO_ACTION → resolve 被呼叫一次 +
  resolve raise 時仍 return True(NO_ACTION 不能因 resolve 失敗退化成 False,
  否則污染 auto_execute KPI line 207-208 註解契約)。
- test_incident_service_resolve_idempotency.py:RESOLVED → return existing +
  save_to_working_memory 不被呼叫;not_found → return None。

【驗收條件(部署後 24h)】
1. grep `path="no_action"` 中 incident_resolved_after_no_action_execution
   數量 vs background_execution_noop 數量,1:1 才算修復成功。
2. awoooi_flywheel_incidents_stuck 從每 30 秒漲 1 變平緩。
3. SRE 群 24h 內若湧入 >20 份 NO_ACTION postmortem 觸發 follow-up 評估
   resolution_type="no_action" 跳過 postmortem(critic Minor #3 方案 B)。

Refs: INC-20260507-99ADF2, debugger root cause #1 (鏈 A), critic 1st 必修
#1 #2, critic 2nd 必修 #1 #2 #3

Co-Authored-By: Codex (aider) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 18:55:58 +08:00

65 lines
2.2 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""
test_incident_service_resolve_idempotency
==========================================
驗證 `IncidentService.resolve_incident` 對已經 RESOLVED 的 incident 必須 idempotent
- 直接 return existing incident
- 不呼叫 save_to_working_memory避免重複 Redis write
- 不呼叫 incident_repository.update_status避免重複 DB write
- 不觸發 postmortem / KB extract / KM convert / disposition 副作用
對應 critic 必修 #2 — 沒這個單測,未來有人挪 guard 位置會悄悄破功,
重新放大「resolve_incident 重複觸發 postmortem 洗版」的舊風險。
"""
from types import SimpleNamespace
from unittest.mock import AsyncMock
import pytest
from src.models.incident import IncidentStatus
from src.services.incident_service import IncidentService
@pytest.mark.asyncio
async def test_resolve_incident_skips_when_already_resolved(monkeypatch):
"""RESOLVED 的 incident 重複 resolve 應 idempotent。"""
fake_incident = SimpleNamespace(
incident_id="INC-IDEMPO-001",
status=IncidentStatus.RESOLVED,
)
svc = IncidentService()
# Mock 入口讀取 → 回 RESOLVED incident
monkeypatch.setattr(
svc, "get_from_working_memory", AsyncMock(return_value=fake_incident)
)
# Mock 後續所有副作用 → 用 AsyncMock 監看是否被呼叫
save_mock = AsyncMock(return_value=True)
monkeypatch.setattr(svc, "save_to_working_memory", save_mock)
result = await svc.resolve_incident("INC-IDEMPO-001")
# 應 return existing incident
assert result is fake_incident
# 副作用一律不能觸發guard 必須早於 line 1117 的 status mutation
save_mock.assert_not_called()
@pytest.mark.asyncio
async def test_resolve_incident_returns_none_when_not_found(monkeypatch):
"""incident 不存在時 return None。確保 guard 不影響 not-found 路徑。"""
svc = IncidentService()
monkeypatch.setattr(
svc, "get_from_working_memory", AsyncMock(return_value=None)
)
save_mock = AsyncMock(return_value=True)
monkeypatch.setattr(svc, "save_to_working_memory", save_mock)
result = await svc.resolve_incident("INC-NOT-EXIST")
assert result is None
save_mock.assert_not_called()