# SigNoz Log-Based Alert Rules # 2026-04-05 Claude Code: Sprint 2 — 日誌告警 # 設定位置: http://192.168.0.188:3301/alerts (SigNoz UI) # Webhook: http://192.168.0.121:32334/api/v1/webhooks/signoz ## 設定步驟 1. 開啟 http://192.168.0.188:3301/alerts 2. 點擊 "New Alert Rule" → "Logs Based Alert" 3. 依照下表填入各欄位 4. Notification Channel: 選擇 awoooi-api (指向 /api/v1/webhooks/signoz) 5. 保存並啟用 驗證 Webhook 鏈路: ```bash curl -X POST http://192.168.0.121:32334/api/v1/webhooks/signoz \ -H 'Content-Type: application/json' \ -d '{"alerts":[{"labels":{"alertname":"SignozWebhookTest","severity":"info","layer":"k8s"},"annotations":{"summary":"Sprint 2 webhook 驗證,請忽略"},"status":"firing"}],"version":"4"}' # 預期: {"success":true} ``` --- ## Rule 1: API 高錯誤日誌率 | 欄位 | 值 | |------|-----| | Name | APIHighErrorLogRate | | Type | Logs Based Alert | | Query | `service.name = "awoooi-api" AND severity_text = "ERROR"` | | Condition | Count > 10 per 5m | | For | 5m | | Severity | warning | | Labels | `layer=k8s, component=api, team=backend` | | Message | API 錯誤日誌率過高 ({{ $value }} 次/5分鐘) | --- ## Rule 2: Worker 任務失敗 | 欄位 | 值 | |------|-----| | Name | WorkerTaskFailed | | Type | Logs Based Alert | | Query | `service.name = "awoooi-worker" AND (body CONTAINS "task_failed" OR body CONTAINS "Unhandled exception")` | | Condition | Count > 5 per 5m | | For | 5m | | Severity | warning | | Labels | `layer=k8s, component=worker, team=backend` | | Message | Worker 任務失敗次數過高 ({{ $value }} 次/5分鐘) | --- ## Rule 3: Pod OOM Kill | 欄位 | 值 | |------|-----| | Name | PodOOMKilled | | Type | Logs Based Alert | | Query | `body CONTAINS "OOMKilled" OR body CONTAINS "OutOfMemory"` | | Condition | Count > 0 per 1m | | For | 1m | | Severity | critical | | Labels | `layer=k8s, component=k8s, team=ops` | | Message | 偵測到 Pod OOM Kill 事件 | --- ## Rule 4: Telegram Polling 失敗 | 欄位 | 值 | |------|-----| | Name | TelegramPollingFailed | | Type | Logs Based Alert | | Query | `service.name = "awoooi-api" AND body CONTAINS "telegram_polling_error"` | | Condition | Count > 3 per 5m | | For | 5m | | Severity | critical | | Labels | `layer=k8s, component=api, team=platform` | | Message | Telegram Polling 連續失敗,機器人可能無回應 | --- ## Rule 5: Nemotron 全部超時 | 欄位 | 值 | |------|-----| | Name | NemotronAllTimeout | | Type | Logs Based Alert | | Query | `service.name = "awoooi-api" AND body CONTAINS "nemotron_tool_call_timeout"` | | Condition | Count > 5 per 5m | | For | 5m | | Severity | warning | | Labels | `layer=k8s, component=ai, team=ai` | | Message | Nemotron tool call 頻繁超時,AI 功能可能降級 | --- ## 與 Prometheus 標籤規範對齊 所有 SigNoz alert 必須包含: - `layer`: k8s (pod 內的日誌) - `component`: 對應服務名稱 - `team`: 負責團隊 - `source`: signoz (用於 auto_repair 路由決策) 這樣讓 AWOOOI API 的 auto_repair_service 能用相同的 layer/component 邏輯決定修復路徑。