Your Name
|
16775bb4fa
|
feat(adr100): bridge playbook authoring approvals
CD Pipeline / tests (push) Successful in 1m20s
Code Review / ai-code-review (push) Successful in 13s
CD Pipeline / build-and-deploy (push) Successful in 7m44s
CD Pipeline / post-deploy-checks (push) Successful in 2m49s
|
2026-06-01 20:49:28 +08:00 |
|
Your Name
|
e9a8a2b3e9
|
test(alerts): 對齊 no-action 修復語意測試
CD Pipeline / tests (push) Successful in 1m19s
Code Review / ai-code-review (push) Successful in 12s
CD Pipeline / build-and-deploy (push) Successful in 3m29s
CD Pipeline / post-deploy-checks (push) Successful in 1m47s
|
2026-05-31 15:53:14 +08:00 |
|
Your Name
|
e2ab879636
|
fix(alerts): correct telegram execution truth
CD Pipeline / tests (push) Failing after 52s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 11s
|
2026-05-31 13:58:39 +08:00 |
|
Your Name
|
80c36ba801
|
fix(incident): F2 NO_ACTION 觸發 resolve_incident + 冪等 guard
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / tests (push) Successful in 1m9s
CD Pipeline / build-and-deploy (push) Successful in 3m29s
CD Pipeline / post-deploy-checks (push) Successful in 1m30s
【根因】INC-20260507-99ADF2 飛輪斷流,566+ stuck incidents(30秒漲 1)核心
原因:NO_ACTION 路徑 (approval_execution.py:251) 提前 return True,跳過
line 482-495 已有的 resolve_incident 呼叫,incident 永遠卡 INVESTIGATING。
【修法】
- approval_execution.py NO_ACTION 分支補 resolve_incident 呼叫 + 成功/失敗
log,背景 log 加 path="no_action" 用於 prod 量化修法生效率(debugger
全鏈分析 + critic 1st/2nd 審查必修 #1)。
- incident_service.py resolve_incident 在 line 1106 加 RESOLVED 冪等 guard,
早於所有副作用(status mutation / Redis / DB / postmortem / KB / KM /
disposition),順帶修 success path line 482-495 重觸 postmortem 的潛在
老風險(critic 必修 #2)。
【遵守 Codex 5/6 設計(feedback_respect_codex_design_intent.md)】
- 不動 flywheel_stats_service.py / heartbeat_report_service.py /
auto_repair_service.record_auto_repair() / metrics_repository UPPER(status)。
- resolve_incident 不寫 auto_repair_executions 表(Codex 5/6 source of
truth),不污染 24h KPI 計算。
【Test 覆蓋】
- test_approval_execution_no_action.py:NO_ACTION → resolve 被呼叫一次 +
resolve raise 時仍 return True(NO_ACTION 不能因 resolve 失敗退化成 False,
否則污染 auto_execute KPI line 207-208 註解契約)。
- test_incident_service_resolve_idempotency.py:RESOLVED → return existing +
save_to_working_memory 不被呼叫;not_found → return None。
【驗收條件(部署後 24h)】
1. grep `path="no_action"` 中 incident_resolved_after_no_action_execution
數量 vs background_execution_noop 數量,1:1 才算修復成功。
2. awoooi_flywheel_incidents_stuck 從每 30 秒漲 1 變平緩。
3. SRE 群 24h 內若湧入 >20 份 NO_ACTION postmortem 觸發 follow-up 評估
resolution_type="no_action" 跳過 postmortem(critic Minor #3 方案 B)。
Refs: INC-20260507-99ADF2, debugger root cause #1 (鏈 A), critic 1st 必修
#1 #2, critic 2nd 必修 #1 #2 #3
Co-Authored-By: Codex (aider) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-07 18:55:58 +08:00 |
|