5 條規則: APIHighErrorLogRate/WorkerTaskFailed/PodOOMKilled/
TelegramPollingFailed/NemotronAllTimeout
含 SigNoz UI 設定步驟 + webhook 驗證指令
標籤與 Prometheus 統一規範對齊 (layer/component/team)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
108 lines
3.1 KiB
Markdown
108 lines
3.1 KiB
Markdown
# SigNoz Log-Based Alert Rules
|
||
# 2026-04-05 Claude Code: Sprint 2 — 日誌告警
|
||
# 設定位置: http://192.168.0.188:3301/alerts (SigNoz UI)
|
||
# Webhook: http://192.168.0.121:32334/api/v1/webhooks/signoz
|
||
|
||
## 設定步驟
|
||
|
||
1. 開啟 http://192.168.0.188:3301/alerts
|
||
2. 點擊 "New Alert Rule" → "Logs Based Alert"
|
||
3. 依照下表填入各欄位
|
||
4. Notification Channel: 選擇 awoooi-api (指向 /api/v1/webhooks/signoz)
|
||
5. 保存並啟用
|
||
|
||
驗證 Webhook 鏈路:
|
||
```bash
|
||
curl -X POST http://192.168.0.121:32334/api/v1/webhooks/signoz \
|
||
-H 'Content-Type: application/json' \
|
||
-d '{"alerts":[{"labels":{"alertname":"SignozWebhookTest","severity":"info","layer":"k8s"},"annotations":{"summary":"Sprint 2 webhook 驗證,請忽略"},"status":"firing"}],"version":"4"}'
|
||
# 預期: {"success":true}
|
||
```
|
||
|
||
---
|
||
|
||
## Rule 1: API 高錯誤日誌率
|
||
|
||
| 欄位 | 值 |
|
||
|------|-----|
|
||
| Name | APIHighErrorLogRate |
|
||
| Type | Logs Based Alert |
|
||
| Query | `service.name = "awoooi-api" AND severity_text = "ERROR"` |
|
||
| Condition | Count > 10 per 5m |
|
||
| For | 5m |
|
||
| Severity | warning |
|
||
| Labels | `layer=k8s, component=api, team=backend` |
|
||
| Message | API 錯誤日誌率過高 ({{ $value }} 次/5分鐘) |
|
||
|
||
---
|
||
|
||
## Rule 2: Worker 任務失敗
|
||
|
||
| 欄位 | 值 |
|
||
|------|-----|
|
||
| Name | WorkerTaskFailed |
|
||
| Type | Logs Based Alert |
|
||
| Query | `service.name = "awoooi-worker" AND (body CONTAINS "task_failed" OR body CONTAINS "Unhandled exception")` |
|
||
| Condition | Count > 5 per 5m |
|
||
| For | 5m |
|
||
| Severity | warning |
|
||
| Labels | `layer=k8s, component=worker, team=backend` |
|
||
| Message | Worker 任務失敗次數過高 ({{ $value }} 次/5分鐘) |
|
||
|
||
---
|
||
|
||
## Rule 3: Pod OOM Kill
|
||
|
||
| 欄位 | 值 |
|
||
|------|-----|
|
||
| Name | PodOOMKilled |
|
||
| Type | Logs Based Alert |
|
||
| Query | `body CONTAINS "OOMKilled" OR body CONTAINS "OutOfMemory"` |
|
||
| Condition | Count > 0 per 1m |
|
||
| For | 1m |
|
||
| Severity | critical |
|
||
| Labels | `layer=k8s, component=k8s, team=ops` |
|
||
| Message | 偵測到 Pod OOM Kill 事件 |
|
||
|
||
---
|
||
|
||
## Rule 4: Telegram Polling 失敗
|
||
|
||
| 欄位 | 值 |
|
||
|------|-----|
|
||
| Name | TelegramPollingFailed |
|
||
| Type | Logs Based Alert |
|
||
| Query | `service.name = "awoooi-api" AND body CONTAINS "telegram_polling_error"` |
|
||
| Condition | Count > 3 per 5m |
|
||
| For | 5m |
|
||
| Severity | critical |
|
||
| Labels | `layer=k8s, component=api, team=platform` |
|
||
| Message | Telegram Polling 連續失敗,機器人可能無回應 |
|
||
|
||
---
|
||
|
||
## Rule 5: Nemotron 全部超時
|
||
|
||
| 欄位 | 值 |
|
||
|------|-----|
|
||
| Name | NemotronAllTimeout |
|
||
| Type | Logs Based Alert |
|
||
| Query | `service.name = "awoooi-api" AND body CONTAINS "nemotron_tool_call_timeout"` |
|
||
| Condition | Count > 5 per 5m |
|
||
| For | 5m |
|
||
| Severity | warning |
|
||
| Labels | `layer=k8s, component=ai, team=ai` |
|
||
| Message | Nemotron tool call 頻繁超時,AI 功能可能降級 |
|
||
|
||
---
|
||
|
||
## 與 Prometheus 標籤規範對齊
|
||
|
||
所有 SigNoz alert 必須包含:
|
||
- `layer`: k8s (pod 內的日誌)
|
||
- `component`: 對應服務名稱
|
||
- `team`: 負責團隊
|
||
- `source`: signoz (用於 auto_repair 路由決策)
|
||
|
||
這樣讓 AWOOOI API 的 auto_repair_service 能用相同的 layer/component 邏輯決定修復路徑。
|