Files
awoooi/ops/signoz/alerting/log-rules.md
OG T e70ceaba61 ops(signoz): 建立 log-based alert rules 文檔 (Sprint 2)
5 條規則: APIHighErrorLogRate/WorkerTaskFailed/PodOOMKilled/
         TelegramPollingFailed/NemotronAllTimeout
含 SigNoz UI 設定步驟 + webhook 驗證指令
標籤與 Prometheus 統一規範對齊 (layer/component/team)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 11:10:02 +08:00

108 lines
3.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# SigNoz Log-Based Alert Rules
# 2026-04-05 Claude Code: Sprint 2 — 日誌告警
# 設定位置: http://192.168.0.188:3301/alerts (SigNoz UI)
# Webhook: http://192.168.0.121:32334/api/v1/webhooks/signoz
## 設定步驟
1. 開啟 http://192.168.0.188:3301/alerts
2. 點擊 "New Alert Rule" → "Logs Based Alert"
3. 依照下表填入各欄位
4. Notification Channel: 選擇 awoooi-api (指向 /api/v1/webhooks/signoz)
5. 保存並啟用
驗證 Webhook 鏈路:
```bash
curl -X POST http://192.168.0.121:32334/api/v1/webhooks/signoz \
-H 'Content-Type: application/json' \
-d '{"alerts":[{"labels":{"alertname":"SignozWebhookTest","severity":"info","layer":"k8s"},"annotations":{"summary":"Sprint 2 webhook 驗證,請忽略"},"status":"firing"}],"version":"4"}'
# 預期: {"success":true}
```
---
## Rule 1: API 高錯誤日誌率
| 欄位 | 值 |
|------|-----|
| Name | APIHighErrorLogRate |
| Type | Logs Based Alert |
| Query | `service.name = "awoooi-api" AND severity_text = "ERROR"` |
| Condition | Count > 10 per 5m |
| For | 5m |
| Severity | warning |
| Labels | `layer=k8s, component=api, team=backend` |
| Message | API 錯誤日誌率過高 ({{ $value }} 次/5分鐘) |
---
## Rule 2: Worker 任務失敗
| 欄位 | 值 |
|------|-----|
| Name | WorkerTaskFailed |
| Type | Logs Based Alert |
| Query | `service.name = "awoooi-worker" AND (body CONTAINS "task_failed" OR body CONTAINS "Unhandled exception")` |
| Condition | Count > 5 per 5m |
| For | 5m |
| Severity | warning |
| Labels | `layer=k8s, component=worker, team=backend` |
| Message | Worker 任務失敗次數過高 ({{ $value }} 次/5分鐘) |
---
## Rule 3: Pod OOM Kill
| 欄位 | 值 |
|------|-----|
| Name | PodOOMKilled |
| Type | Logs Based Alert |
| Query | `body CONTAINS "OOMKilled" OR body CONTAINS "OutOfMemory"` |
| Condition | Count > 0 per 1m |
| For | 1m |
| Severity | critical |
| Labels | `layer=k8s, component=k8s, team=ops` |
| Message | 偵測到 Pod OOM Kill 事件 |
---
## Rule 4: Telegram Polling 失敗
| 欄位 | 值 |
|------|-----|
| Name | TelegramPollingFailed |
| Type | Logs Based Alert |
| Query | `service.name = "awoooi-api" AND body CONTAINS "telegram_polling_error"` |
| Condition | Count > 3 per 5m |
| For | 5m |
| Severity | critical |
| Labels | `layer=k8s, component=api, team=platform` |
| Message | Telegram Polling 連續失敗,機器人可能無回應 |
---
## Rule 5: Nemotron 全部超時
| 欄位 | 值 |
|------|-----|
| Name | NemotronAllTimeout |
| Type | Logs Based Alert |
| Query | `service.name = "awoooi-api" AND body CONTAINS "nemotron_tool_call_timeout"` |
| Condition | Count > 5 per 5m |
| For | 5m |
| Severity | warning |
| Labels | `layer=k8s, component=ai, team=ai` |
| Message | Nemotron tool call 頻繁超時AI 功能可能降級 |
---
## 與 Prometheus 標籤規範對齊
所有 SigNoz alert 必須包含:
- `layer`: k8s (pod 內的日誌)
- `component`: 對應服務名稱
- `team`: 負責團隊
- `source`: signoz (用於 auto_repair 路由決策)
這樣讓 AWOOOI API 的 auto_repair_service 能用相同的 layer/component 邏輯決定修復路徑。