Files
awoooi/apps/api
Your Name 2ce722bda9
Some checks failed
Code Review / ai-code-review (push) Successful in 51s
CD Pipeline / tests (push) Successful in 2m59s
CD Pipeline / build-and-deploy (push) Has started running
CD Pipeline / post-deploy-checks (push) Has been cancelled
feat(heartbeat): full K8s pod lifecycle state machine + regression tests
P0 #3 (徹底長期修系列) — 把 daily report 的 pod 健康判斷從「ready=False 一律告警」
升級到完整 K8s pod lifecycle state machine:

| Phase | 行為 |
|-------|------|
| Succeeded / Completed | 跳過(CronJob/Job 跑完正常) |
| Failed | 必告警 |
| Unknown | 必告警 |
| Pending <5min | 跳過(剛 schedule 合理) |
| Pending >=5min | 告警「image pull / scheduling 卡住」|
| Running ready=True | 健康,跳過 |
| Running ready=False <2min | 跳過(剛起來 probe 還沒過)|
| Running ready=False >=2min | 告警「readiness probe fail / 啟動異常」|
| restarts >=3 | 必告警(無論 phase)|

實作:
- PodInfo 加 start_time: Optional[str](從 .status.startTime)
- _get_pod_status kubectl custom-columns 加 STARTTIME
- _build_warnings 完整 state machine + 閾值常數

regression test (test_heartbeat_pod_state_machine.py 13 個) 覆蓋每個 phase
+ 邊界條件,含 2026-05-02 統帥截圖鐵證重現(3 個 drift-scanner Succeeded
pod 不該觸發「需關注 3 項」假警報)。

Tests: 13 passed (新增 test_heartbeat_pod_state_machine.py)

接續 a38d9112(單純 Succeeded skip),這次徹底處理 Pending/Failed/Unknown
+ 時間閾值 + 沒 start_time 的保守告警。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 01:44:58 +08:00
..