awoooi

wooo/awoooi

Fork 0

Commit Graph

Author	SHA1	Message	Date
Your Name	2ce722bda9	feat(heartbeat): full K8s pod lifecycle state machine + regression tests Some checks failed Code Review / ai-code-review (push) Successful in 51s Details CD Pipeline / tests (push) Successful in 2m59s Details CD Pipeline / build-and-deploy (push) Has started running Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details P0 #3 (徹底長期修系列) — 把 daily report 的 pod 健康判斷從「ready=False 一律告警」升級到完整 K8s pod lifecycle state machine： \| Phase \| 行為 \| \|-------\|------\| \| Succeeded / Completed \| 跳過（CronJob/Job 跑完正常） \| \| Failed \| 必告警 \| \| Unknown \| 必告警 \| \| Pending <5min \| 跳過（剛 schedule 合理） \| \| Pending >=5min \| 告警「image pull / scheduling 卡住」\| \| Running ready=True \| 健康，跳過 \| \| Running ready=False <2min \| 跳過（剛起來 probe 還沒過）\| \| Running ready=False >=2min \| 告警「readiness probe fail / 啟動異常」\| \| restarts >=3 \| 必告警（無論 phase）\| 實作： - PodInfo 加 start_time: Optional[str]（從 .status.startTime） - _get_pod_status kubectl custom-columns 加 STARTTIME - _build_warnings 完整 state machine + 閾值常數 regression test (test_heartbeat_pod_state_machine.py 13 個) 覆蓋每個 phase + 邊界條件，含 2026-05-02 統帥截圖鐵證重現（3 個 drift-scanner Succeeded pod 不該觸發「需關注 3 項」假警報）。 Tests: 13 passed (新增 test_heartbeat_pod_state_machine.py) 接續 a38d9112（單純 Succeeded skip），這次徹底處理 Pending/Failed/Unknown + 時間閾值 + 沒 start_time 的保守告警。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 01:44:58 +08:00

Author

SHA1

Message

Date

Your Name

2ce722bda9

feat(heartbeat): full K8s pod lifecycle state machine + regression tests

Code Review / ai-code-review (push) Successful in 51s

Details

CD Pipeline / tests (push) Successful in 2m59s

Details

CD Pipeline / build-and-deploy (push) Has started running

Details

CD Pipeline / post-deploy-checks (push) Has been cancelled

Details

P0 #3 (徹底長期修系列) — 把 daily report 的 pod 健康判斷從「ready=False 一律告警」
升級到完整 K8s pod lifecycle state machine：

| Phase | 行為 |
|-------|------|
| Succeeded / Completed | 跳過（CronJob/Job 跑完正常） |
| Failed | 必告警 |
| Unknown | 必告警 |
| Pending <5min | 跳過（剛 schedule 合理） |
| Pending >=5min | 告警「image pull / scheduling 卡住」|
| Running ready=True | 健康，跳過 |
| Running ready=False <2min | 跳過（剛起來 probe 還沒過）|
| Running ready=False >=2min | 告警「readiness probe fail / 啟動異常」|
| restarts >=3 | 必告警（無論 phase）|

實作：
- PodInfo 加 start_time: Optional[str]（從 .status.startTime）
- _get_pod_status kubectl custom-columns 加 STARTTIME
- _build_warnings 完整 state machine + 閾值常數

regression test (test_heartbeat_pod_state_machine.py 13 個) 覆蓋每個 phase
+ 邊界條件，含 2026-05-02 統帥截圖鐵證重現（3 個 drift-scanner Succeeded
pod 不該觸發「需關注 3 項」假警報）。

Tests: 13 passed (新增 test_heartbeat_pod_state_machine.py)

接續 a38d9112（單純 Succeeded skip），這次徹底處理 Pending/Failed/Unknown
+ 時間閾值 + 沒 start_time 的保守告警。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-03 01:44:58 +08:00

1 Commits