From e70ceaba61c851b614d8a0c5da438f2ecc2e79f1 Mon Sep 17 00:00:00 2001 From: OG T Date: Sun, 5 Apr 2026 11:10:02 +0800 Subject: [PATCH] =?UTF-8?q?ops(signoz):=20=E5=BB=BA=E7=AB=8B=20log-based?= =?UTF-8?q?=20alert=20rules=20=E6=96=87=E6=AA=94=20(Sprint=202)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 5 條規則: APIHighErrorLogRate/WorkerTaskFailed/PodOOMKilled/ TelegramPollingFailed/NemotronAllTimeout 含 SigNoz UI 設定步驟 + webhook 驗證指令 標籤與 Prometheus 統一規範對齊 (layer/component/team) Co-Authored-By: Claude Sonnet 4.6 --- ops/signoz/alerting/log-rules.md | 107 +++++++++++++++++++++++++++++++ 1 file changed, 107 insertions(+) create mode 100644 ops/signoz/alerting/log-rules.md diff --git a/ops/signoz/alerting/log-rules.md b/ops/signoz/alerting/log-rules.md new file mode 100644 index 00000000..4c1a418e --- /dev/null +++ b/ops/signoz/alerting/log-rules.md @@ -0,0 +1,107 @@ +# SigNoz Log-Based Alert Rules +# 2026-04-05 Claude Code: Sprint 2 — 日誌告警 +# 設定位置: http://192.168.0.188:3301/alerts (SigNoz UI) +# Webhook: http://192.168.0.121:32334/api/v1/webhooks/signoz + +## 設定步驟 + +1. 開啟 http://192.168.0.188:3301/alerts +2. 點擊 "New Alert Rule" → "Logs Based Alert" +3. 依照下表填入各欄位 +4. Notification Channel: 選擇 awoooi-api (指向 /api/v1/webhooks/signoz) +5. 保存並啟用 + +驗證 Webhook 鏈路: +```bash +curl -X POST http://192.168.0.121:32334/api/v1/webhooks/signoz \ + -H 'Content-Type: application/json' \ + -d '{"alerts":[{"labels":{"alertname":"SignozWebhookTest","severity":"info","layer":"k8s"},"annotations":{"summary":"Sprint 2 webhook 驗證,請忽略"},"status":"firing"}],"version":"4"}' +# 預期: {"success":true} +``` + +--- + +## Rule 1: API 高錯誤日誌率 + +| 欄位 | 值 | +|------|-----| +| Name | APIHighErrorLogRate | +| Type | Logs Based Alert | +| Query | `service.name = "awoooi-api" AND severity_text = "ERROR"` | +| Condition | Count > 10 per 5m | +| For | 5m | +| Severity | warning | +| Labels | `layer=k8s, component=api, team=backend` | +| Message | API 錯誤日誌率過高 ({{ $value }} 次/5分鐘) | + +--- + +## Rule 2: Worker 任務失敗 + +| 欄位 | 值 | +|------|-----| +| Name | WorkerTaskFailed | +| Type | Logs Based Alert | +| Query | `service.name = "awoooi-worker" AND (body CONTAINS "task_failed" OR body CONTAINS "Unhandled exception")` | +| Condition | Count > 5 per 5m | +| For | 5m | +| Severity | warning | +| Labels | `layer=k8s, component=worker, team=backend` | +| Message | Worker 任務失敗次數過高 ({{ $value }} 次/5分鐘) | + +--- + +## Rule 3: Pod OOM Kill + +| 欄位 | 值 | +|------|-----| +| Name | PodOOMKilled | +| Type | Logs Based Alert | +| Query | `body CONTAINS "OOMKilled" OR body CONTAINS "OutOfMemory"` | +| Condition | Count > 0 per 1m | +| For | 1m | +| Severity | critical | +| Labels | `layer=k8s, component=k8s, team=ops` | +| Message | 偵測到 Pod OOM Kill 事件 | + +--- + +## Rule 4: Telegram Polling 失敗 + +| 欄位 | 值 | +|------|-----| +| Name | TelegramPollingFailed | +| Type | Logs Based Alert | +| Query | `service.name = "awoooi-api" AND body CONTAINS "telegram_polling_error"` | +| Condition | Count > 3 per 5m | +| For | 5m | +| Severity | critical | +| Labels | `layer=k8s, component=api, team=platform` | +| Message | Telegram Polling 連續失敗,機器人可能無回應 | + +--- + +## Rule 5: Nemotron 全部超時 + +| 欄位 | 值 | +|------|-----| +| Name | NemotronAllTimeout | +| Type | Logs Based Alert | +| Query | `service.name = "awoooi-api" AND body CONTAINS "nemotron_tool_call_timeout"` | +| Condition | Count > 5 per 5m | +| For | 5m | +| Severity | warning | +| Labels | `layer=k8s, component=ai, team=ai` | +| Message | Nemotron tool call 頻繁超時,AI 功能可能降級 | + +--- + +## 與 Prometheus 標籤規範對齊 + +所有 SigNoz alert 必須包含: +- `layer`: k8s (pod 內的日誌) +- `component`: 對應服務名稱 +- `team`: 負責團隊 +- `source`: signoz (用於 auto_repair 路由決策) + +這樣讓 AWOOOI API 的 auto_repair_service 能用相同的 layer/component 邏輯決定修復路徑。