docs(logbook): record alert grouping threshold deploy [skip ci]
This commit is contained in:
@@ -4702,3 +4702,76 @@ API log:
|
||||
```
|
||||
|
||||
判讀:Runbook Review、Escalation、同 incident 的後續摘要會先嘗試接回原告警卡,Telegram 不再把這些訊息全部打成頂層。下一步要處理的是「跨 incident 同 fingerprint 聚合」與 AwoooP Timeline UI,把完整執行歷程放到 Console,Telegram 只保留摘要與人工入口。
|
||||
|
||||
### 01:32 Alert Grouping 門檻收斂
|
||||
|
||||
**背景**:
|
||||
|
||||
- Telegram Incident follow-up threading 已解決同一 Incident 後續訊息接回原告警卡的問題。
|
||||
- 另一個洗版來源是「不同 Incident、但同 alertname/namespace 的告警風暴」。既有 `AlertGroupingService` 雖然有 5 分鐘滑動視窗,但門檻為 3,代表前兩個同組告警仍可能各自進 AI / Telegram。
|
||||
- 對 DockerContainerRestartSpike、HostLoadAverageSustainedHigh 這類多容器/多 target 同時爆發的場景,門檻 3 太鬆,會讓 Telegram 先被兩張主卡洗過一輪。
|
||||
|
||||
**改動**:
|
||||
|
||||
- `alert_grouping_service.py`:`GROUP_THRESHOLD` 從 `3` 調整為 `2`。
|
||||
- 第一個同組告警保留為 Parent Alert / 主卡。
|
||||
- 第二個起在 5 分鐘視窗內收斂為 child alert,短路 LLM 與 Telegram 主卡。
|
||||
- `test_alert_grouping_service.py`:更新門檻測試與語義描述。
|
||||
|
||||
**驗證**:
|
||||
|
||||
```text
|
||||
py_compile:
|
||||
apps/api/src/services/alert_grouping_service.py
|
||||
apps/api/tests/test_alert_grouping_service.py
|
||||
# passed
|
||||
|
||||
pytest:
|
||||
apps/api/tests/test_alert_grouping_service.py
|
||||
# 16 passed
|
||||
```
|
||||
|
||||
**生產部署**:
|
||||
|
||||
```text
|
||||
Commit:
|
||||
c49246b8 fix(alerts): group repeated alerts from second firing
|
||||
|
||||
Gitea workflows:
|
||||
1841 CD Pipeline:
|
||||
- tests -> success
|
||||
- build-and-deploy -> success
|
||||
- post-deploy-checks -> success
|
||||
1842 Code Review -> success
|
||||
|
||||
CD deploy marker:
|
||||
8485d993 chore(cd): deploy c49246b [skip ci]
|
||||
|
||||
awoooi-api image:
|
||||
192.168.0.110:5000/awoooi/api:c49246b8c629ee23b89b55c916ab51bed7c73516
|
||||
|
||||
awoooi-web image:
|
||||
192.168.0.110:5000/awoooi/web:c49246b8c629ee23b89b55c916ab51bed7c73516
|
||||
|
||||
awoooi-worker image:
|
||||
192.168.0.110:5000/awoooi/api:c49246b8c629ee23b89b55c916ab51bed7c73516
|
||||
|
||||
K8s rollout:
|
||||
awoooi-api -> successfully rolled out
|
||||
awoooi-web -> successfully rolled out
|
||||
awoooi-worker -> successfully rolled out
|
||||
awoooi-api pods -> 2/2 Running, 0 restarts
|
||||
awoooi-worker pod -> 1/1 Running, 0 restarts
|
||||
|
||||
HTTP:
|
||||
/api/v1/health -> 200
|
||||
/zh-TW -> 200
|
||||
/zh-TW/awooop -> 200
|
||||
/zh-TW/awooop/runs -> 200
|
||||
/zh-TW/awooop/approvals -> 200
|
||||
|
||||
API log:
|
||||
近 5 分鐘未見 alert_grouping_redis_error / alertmanager_grouping_failed / telegram_request_failed / telegram_api_error。
|
||||
```
|
||||
|
||||
判讀:Telegram 噪音治理目前完成兩層防線:同 Incident 後續訊息接回原卡;跨 Incident 同組告警從第二個 firing 起收斂。下一步要把 grouped child alert 的摘要與計數寫進 AwoooP Timeline / Run Monitor,讓 Telegram 不洗版但 Console 仍保留完整脈絡。
|
||||
|
||||
Reference in New Issue
Block a user