ops(signoz): 建立 log-based alert rules 文檔 (Sprint 2)
5 條規則: APIHighErrorLogRate/WorkerTaskFailed/PodOOMKilled/
TelegramPollingFailed/NemotronAllTimeout
含 SigNoz UI 設定步驟 + webhook 驗證指令
標籤與 Prometheus 統一規範對齊 (layer/component/team)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
107
ops/signoz/alerting/log-rules.md
Normal file
107
ops/signoz/alerting/log-rules.md
Normal file
@@ -0,0 +1,107 @@
|
||||
# SigNoz Log-Based Alert Rules
|
||||
# 2026-04-05 Claude Code: Sprint 2 — 日誌告警
|
||||
# 設定位置: http://192.168.0.188:3301/alerts (SigNoz UI)
|
||||
# Webhook: http://192.168.0.121:32334/api/v1/webhooks/signoz
|
||||
|
||||
## 設定步驟
|
||||
|
||||
1. 開啟 http://192.168.0.188:3301/alerts
|
||||
2. 點擊 "New Alert Rule" → "Logs Based Alert"
|
||||
3. 依照下表填入各欄位
|
||||
4. Notification Channel: 選擇 awoooi-api (指向 /api/v1/webhooks/signoz)
|
||||
5. 保存並啟用
|
||||
|
||||
驗證 Webhook 鏈路:
|
||||
```bash
|
||||
curl -X POST http://192.168.0.121:32334/api/v1/webhooks/signoz \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"alerts":[{"labels":{"alertname":"SignozWebhookTest","severity":"info","layer":"k8s"},"annotations":{"summary":"Sprint 2 webhook 驗證,請忽略"},"status":"firing"}],"version":"4"}'
|
||||
# 預期: {"success":true}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Rule 1: API 高錯誤日誌率
|
||||
|
||||
| 欄位 | 值 |
|
||||
|------|-----|
|
||||
| Name | APIHighErrorLogRate |
|
||||
| Type | Logs Based Alert |
|
||||
| Query | `service.name = "awoooi-api" AND severity_text = "ERROR"` |
|
||||
| Condition | Count > 10 per 5m |
|
||||
| For | 5m |
|
||||
| Severity | warning |
|
||||
| Labels | `layer=k8s, component=api, team=backend` |
|
||||
| Message | API 錯誤日誌率過高 ({{ $value }} 次/5分鐘) |
|
||||
|
||||
---
|
||||
|
||||
## Rule 2: Worker 任務失敗
|
||||
|
||||
| 欄位 | 值 |
|
||||
|------|-----|
|
||||
| Name | WorkerTaskFailed |
|
||||
| Type | Logs Based Alert |
|
||||
| Query | `service.name = "awoooi-worker" AND (body CONTAINS "task_failed" OR body CONTAINS "Unhandled exception")` |
|
||||
| Condition | Count > 5 per 5m |
|
||||
| For | 5m |
|
||||
| Severity | warning |
|
||||
| Labels | `layer=k8s, component=worker, team=backend` |
|
||||
| Message | Worker 任務失敗次數過高 ({{ $value }} 次/5分鐘) |
|
||||
|
||||
---
|
||||
|
||||
## Rule 3: Pod OOM Kill
|
||||
|
||||
| 欄位 | 值 |
|
||||
|------|-----|
|
||||
| Name | PodOOMKilled |
|
||||
| Type | Logs Based Alert |
|
||||
| Query | `body CONTAINS "OOMKilled" OR body CONTAINS "OutOfMemory"` |
|
||||
| Condition | Count > 0 per 1m |
|
||||
| For | 1m |
|
||||
| Severity | critical |
|
||||
| Labels | `layer=k8s, component=k8s, team=ops` |
|
||||
| Message | 偵測到 Pod OOM Kill 事件 |
|
||||
|
||||
---
|
||||
|
||||
## Rule 4: Telegram Polling 失敗
|
||||
|
||||
| 欄位 | 值 |
|
||||
|------|-----|
|
||||
| Name | TelegramPollingFailed |
|
||||
| Type | Logs Based Alert |
|
||||
| Query | `service.name = "awoooi-api" AND body CONTAINS "telegram_polling_error"` |
|
||||
| Condition | Count > 3 per 5m |
|
||||
| For | 5m |
|
||||
| Severity | critical |
|
||||
| Labels | `layer=k8s, component=api, team=platform` |
|
||||
| Message | Telegram Polling 連續失敗,機器人可能無回應 |
|
||||
|
||||
---
|
||||
|
||||
## Rule 5: Nemotron 全部超時
|
||||
|
||||
| 欄位 | 值 |
|
||||
|------|-----|
|
||||
| Name | NemotronAllTimeout |
|
||||
| Type | Logs Based Alert |
|
||||
| Query | `service.name = "awoooi-api" AND body CONTAINS "nemotron_tool_call_timeout"` |
|
||||
| Condition | Count > 5 per 5m |
|
||||
| For | 5m |
|
||||
| Severity | warning |
|
||||
| Labels | `layer=k8s, component=ai, team=ai` |
|
||||
| Message | Nemotron tool call 頻繁超時,AI 功能可能降級 |
|
||||
|
||||
---
|
||||
|
||||
## 與 Prometheus 標籤規範對齊
|
||||
|
||||
所有 SigNoz alert 必須包含:
|
||||
- `layer`: k8s (pod 內的日誌)
|
||||
- `component`: 對應服務名稱
|
||||
- `team`: 負責團隊
|
||||
- `source`: signoz (用於 auto_repair 路由決策)
|
||||
|
||||
這樣讓 AWOOOI API 的 auto_repair_service 能用相同的 layer/component 邏輯決定修復路徑。
|
||||
Reference in New Issue
Block a user