ops(signoz): 建立 log-based alert rules 文檔 (Sprint 2)

5 條規則: APIHighErrorLogRate/WorkerTaskFailed/PodOOMKilled/
         TelegramPollingFailed/NemotronAllTimeout
含 SigNoz UI 設定步驟 + webhook 驗證指令
標籤與 Prometheus 統一規範對齊 (layer/component/team)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
OG T
2026-04-05 11:10:02 +08:00
parent 77f70125cb
commit e70ceaba61

View File

@@ -0,0 +1,107 @@
# SigNoz Log-Based Alert Rules
# 2026-04-05 Claude Code: Sprint 2 — 日誌告警
# 設定位置: http://192.168.0.188:3301/alerts (SigNoz UI)
# Webhook: http://192.168.0.121:32334/api/v1/webhooks/signoz
## 設定步驟
1. 開啟 http://192.168.0.188:3301/alerts
2. 點擊 "New Alert Rule" → "Logs Based Alert"
3. 依照下表填入各欄位
4. Notification Channel: 選擇 awoooi-api (指向 /api/v1/webhooks/signoz)
5. 保存並啟用
驗證 Webhook 鏈路:
```bash
curl -X POST http://192.168.0.121:32334/api/v1/webhooks/signoz \
-H 'Content-Type: application/json' \
-d '{"alerts":[{"labels":{"alertname":"SignozWebhookTest","severity":"info","layer":"k8s"},"annotations":{"summary":"Sprint 2 webhook 驗證,請忽略"},"status":"firing"}],"version":"4"}'
# 預期: {"success":true}
```
---
## Rule 1: API 高錯誤日誌率
| 欄位 | 值 |
|------|-----|
| Name | APIHighErrorLogRate |
| Type | Logs Based Alert |
| Query | `service.name = "awoooi-api" AND severity_text = "ERROR"` |
| Condition | Count > 10 per 5m |
| For | 5m |
| Severity | warning |
| Labels | `layer=k8s, component=api, team=backend` |
| Message | API 錯誤日誌率過高 ({{ $value }} 次/5分鐘) |
---
## Rule 2: Worker 任務失敗
| 欄位 | 值 |
|------|-----|
| Name | WorkerTaskFailed |
| Type | Logs Based Alert |
| Query | `service.name = "awoooi-worker" AND (body CONTAINS "task_failed" OR body CONTAINS "Unhandled exception")` |
| Condition | Count > 5 per 5m |
| For | 5m |
| Severity | warning |
| Labels | `layer=k8s, component=worker, team=backend` |
| Message | Worker 任務失敗次數過高 ({{ $value }} 次/5分鐘) |
---
## Rule 3: Pod OOM Kill
| 欄位 | 值 |
|------|-----|
| Name | PodOOMKilled |
| Type | Logs Based Alert |
| Query | `body CONTAINS "OOMKilled" OR body CONTAINS "OutOfMemory"` |
| Condition | Count > 0 per 1m |
| For | 1m |
| Severity | critical |
| Labels | `layer=k8s, component=k8s, team=ops` |
| Message | 偵測到 Pod OOM Kill 事件 |
---
## Rule 4: Telegram Polling 失敗
| 欄位 | 值 |
|------|-----|
| Name | TelegramPollingFailed |
| Type | Logs Based Alert |
| Query | `service.name = "awoooi-api" AND body CONTAINS "telegram_polling_error"` |
| Condition | Count > 3 per 5m |
| For | 5m |
| Severity | critical |
| Labels | `layer=k8s, component=api, team=platform` |
| Message | Telegram Polling 連續失敗,機器人可能無回應 |
---
## Rule 5: Nemotron 全部超時
| 欄位 | 值 |
|------|-----|
| Name | NemotronAllTimeout |
| Type | Logs Based Alert |
| Query | `service.name = "awoooi-api" AND body CONTAINS "nemotron_tool_call_timeout"` |
| Condition | Count > 5 per 5m |
| For | 5m |
| Severity | warning |
| Labels | `layer=k8s, component=ai, team=ai` |
| Message | Nemotron tool call 頻繁超時AI 功能可能降級 |
---
## 與 Prometheus 標籤規範對齊
所有 SigNoz alert 必須包含:
- `layer`: k8s (pod 內的日誌)
- `component`: 對應服務名稱
- `team`: 負責團隊
- `source`: signoz (用於 auto_repair 路由決策)
這樣讓 AWOOOI API 的 auto_repair_service 能用相同的 layer/component 邏輯決定修復路徑。