diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 71c18d1d..35f0bfb2 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -4775,3 +4775,102 @@ API log: ``` 判讀:Telegram 噪音治理目前完成兩層防線:同 Incident 後續訊息接回原卡;跨 Incident 同組告警從第二個 firing 起收斂。下一步要把 grouped child alert 的摘要與計數寫進 AwoooP Timeline / Run Monitor,讓 Telegram 不洗版但 Console 仍保留完整脈絡。 + +### 01:42 AwoooP 收斂事件落庫與 Run 監控顯示 + +**背景**: + +- 上一輪把 AlertGrouping 門檻調到 2 後,第二個同組告警會短路 LLM / Telegram,解決洗版與 token 成本問題。 +- 但該分支原本只寫 `alertmanager_grouped_skip` log,Operator Console 看不到「哪些子告警被合併」,會造成 Telegram 安靜但前端失去脈絡。 +- 本輪補上「不發 Telegram、但落 AwoooP 事件流」的控制面紀錄。 + +**改動**: + +- `channel_hub.py` 新增 grouped alert event helper: + - `build_grouped_alert_provider_event_id()`:產生 `alert-group:{alert_id}:{fingerprint}` 冪等 ID。 + - `format_grouped_alert_event_content()`:整理 alertname、severity、namespace、target、group count、parent/child fingerprint。 + - `record_grouped_alert_event()`:以 `channel_type=internal` 寫入 `awooop_conversation_event`,DB 失敗 fail-open,不阻斷 Alertmanager ACK。 +- `webhooks.py`:在 `grouping_result.is_grouped` 分支用 `background_tasks.add_task()` 背景落庫,仍立即回覆 `converged=True`,不進 LLM、不發 Telegram。 +- Platform API 新增 `GET /api/v1/platform/events/recent`,可依 `project_id`、`channel_type`、`provider_prefix` 查最近事件。 +- `/zh-TW/awooop/runs` 新增「最近告警收斂」區塊,讀取 `channel_type=internal&provider_prefix=alert-group`,讓 grouped child alert 出現在 Operator Console,而非 Telegram。 + +**驗證**: + +```text +py_compile: +apps/api/src/services/channel_hub.py +apps/api/src/api/v1/webhooks.py +apps/api/src/services/platform_operator_service.py +apps/api/src/api/v1/platform/events.py +apps/api/src/api/v1/platform/__init__.py +apps/api/tests/test_channel_hub_grouped_alert_events.py +apps/api/tests/test_platform_router_order.py +# passed + +pytest: +DATABASE_URL='postgresql+asyncpg://test:test@127.0.0.1:5432/test' \ + /Users/ogt/awoooi/apps/api/.venv/bin/python -m pytest \ + apps/api/tests/test_channel_hub_grouped_alert_events.py \ + apps/api/tests/test_platform_router_order.py \ + apps/api/tests/test_alert_grouping_service.py -q +# 20 passed + +ruff import order: +channel_hub.py / platform_operator_service.py / platform/events.py / +platform/__init__.py / grouped alert tests / platform router tests +# All checks passed + +frontend: +pnpm --filter @awoooi/web typecheck +# passed + +NEXT_PUBLIC_API_URL='https://awoooi.wooo.work' pnpm --filter @awoooi/web build +# passed +``` + +**生產部署**: + +```text +Commit: +251554c0 fix(awooop): record grouped alert events + +Gitea workflows: +1843 CD Pipeline: +- tests -> success +- build-and-deploy -> success +- post-deploy-checks -> success +1844 Code Review -> success + +CD deploy marker: +e5fd9395 chore(cd): deploy 251554c [skip ci] + +awoooi-api image: +192.168.0.110:5000/awoooi/api:251554c0440f0b6c0f2668dcee7780495c873c57 + +awoooi-web image: +192.168.0.110:5000/awoooi/web:251554c0440f0b6c0f2668dcee7780495c873c57 + +awoooi-worker image: +192.168.0.110:5000/awoooi/api:251554c0440f0b6c0f2668dcee7780495c873c57 + +K8s rollout: +awoooi-api -> successfully rolled out +awoooi-web -> successfully rolled out +awoooi-worker -> successfully rolled out +awoooi-api pods -> 2/2 Running, 0 restarts +awoooi-web pods -> 2/2 Running, 0 restarts +awoooi-worker pod -> 1/1 Running, 0 restarts + +HTTP: +/api/v1/health -> 200 +/zh-TW/awooop -> 200, no Application error +/zh-TW/awooop/runs -> 200, no Application error, contains 最近告警收斂 +/zh-TW/awooop/approvals -> 200, no Application error +/api/v1/platform/events/recent?channel_type=internal&provider_prefix=alert-group&limit=1 + -> 200, {"events": [], "total": 0, "limit": 1} + +API log: +近 10 分鐘未見 grouped_alert_event_record_failed / Traceback。 +``` + +判讀:Telegram 噪音治理第三層已上線。後續同組告警第二筆起會被收斂,不再發 Telegram 主卡,也會以 internal channel event 進 AwoooP Run 監控。下一步若仍覺得群組雜亂,應改「父卡定時更新摘要」或「戰情室 thread digest」,不要恢復每筆子告警發送。