From 1a1dea00eb12aa4b1daba2056239ecfc21ecdc62 Mon Sep 17 00:00:00 2001 From: Your Name Date: Thu, 7 May 2026 01:27:09 +0800 Subject: [PATCH] docs(logbook): record alert grouping threshold deploy [skip ci] --- docs/LOGBOOK.md | 73 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 73 insertions(+) diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 27f3a4a4..71c18d1d 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -4702,3 +4702,76 @@ API log: ``` 判讀:Runbook Review、Escalation、同 incident 的後續摘要會先嘗試接回原告警卡,Telegram 不再把這些訊息全部打成頂層。下一步要處理的是「跨 incident 同 fingerprint 聚合」與 AwoooP Timeline UI,把完整執行歷程放到 Console,Telegram 只保留摘要與人工入口。 + +### 01:32 Alert Grouping 門檻收斂 + +**背景**: + +- Telegram Incident follow-up threading 已解決同一 Incident 後續訊息接回原告警卡的問題。 +- 另一個洗版來源是「不同 Incident、但同 alertname/namespace 的告警風暴」。既有 `AlertGroupingService` 雖然有 5 分鐘滑動視窗,但門檻為 3,代表前兩個同組告警仍可能各自進 AI / Telegram。 +- 對 DockerContainerRestartSpike、HostLoadAverageSustainedHigh 這類多容器/多 target 同時爆發的場景,門檻 3 太鬆,會讓 Telegram 先被兩張主卡洗過一輪。 + +**改動**: + +- `alert_grouping_service.py`:`GROUP_THRESHOLD` 從 `3` 調整為 `2`。 + - 第一個同組告警保留為 Parent Alert / 主卡。 + - 第二個起在 5 分鐘視窗內收斂為 child alert,短路 LLM 與 Telegram 主卡。 +- `test_alert_grouping_service.py`:更新門檻測試與語義描述。 + +**驗證**: + +```text +py_compile: +apps/api/src/services/alert_grouping_service.py +apps/api/tests/test_alert_grouping_service.py +# passed + +pytest: +apps/api/tests/test_alert_grouping_service.py +# 16 passed +``` + +**生產部署**: + +```text +Commit: +c49246b8 fix(alerts): group repeated alerts from second firing + +Gitea workflows: +1841 CD Pipeline: +- tests -> success +- build-and-deploy -> success +- post-deploy-checks -> success +1842 Code Review -> success + +CD deploy marker: +8485d993 chore(cd): deploy c49246b [skip ci] + +awoooi-api image: +192.168.0.110:5000/awoooi/api:c49246b8c629ee23b89b55c916ab51bed7c73516 + +awoooi-web image: +192.168.0.110:5000/awoooi/web:c49246b8c629ee23b89b55c916ab51bed7c73516 + +awoooi-worker image: +192.168.0.110:5000/awoooi/api:c49246b8c629ee23b89b55c916ab51bed7c73516 + +K8s rollout: +awoooi-api -> successfully rolled out +awoooi-web -> successfully rolled out +awoooi-worker -> successfully rolled out +awoooi-api pods -> 2/2 Running, 0 restarts +awoooi-worker pod -> 1/1 Running, 0 restarts + +HTTP: +/api/v1/health -> 200 +/zh-TW -> 200 +/zh-TW/awooop -> 200 +/zh-TW/awooop/runs -> 200 +/zh-TW/awooop/approvals -> 200 + +API log: +近 5 分鐘未見 alert_grouping_redis_error / alertmanager_grouping_failed / telegram_request_failed / telegram_api_error。 +``` + +判讀:Telegram 噪音治理目前完成兩層防線:同 Incident 後續訊息接回原卡;跨 Incident 同組告警從第二個 firing 起收斂。下一步要把 grouped child alert 的摘要與計數寫進 AwoooP Timeline / Run Monitor,讓 Telegram 不洗版但 Console 仍保留完整脈絡。