Files
awoooi/ops/signoz/alerting/log-rules.md
OG T e70ceaba61 ops(signoz): 建立 log-based alert rules 文檔 (Sprint 2)
5 條規則: APIHighErrorLogRate/WorkerTaskFailed/PodOOMKilled/
         TelegramPollingFailed/NemotronAllTimeout
含 SigNoz UI 設定步驟 + webhook 驗證指令
標籤與 Prometheus 統一規範對齊 (layer/component/team)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 11:10:02 +08:00

3.1 KiB
Raw Blame History

SigNoz Log-Based Alert Rules

2026-04-05 Claude Code: Sprint 2 — 日誌告警

設定位置: http://192.168.0.188:3301/alerts (SigNoz UI)

Webhook: http://192.168.0.121:32334/api/v1/webhooks/signoz

設定步驟

  1. 開啟 http://192.168.0.188:3301/alerts
  2. 點擊 "New Alert Rule" → "Logs Based Alert"
  3. 依照下表填入各欄位
  4. Notification Channel: 選擇 awoooi-api (指向 /api/v1/webhooks/signoz)
  5. 保存並啟用

驗證 Webhook 鏈路:

curl -X POST http://192.168.0.121:32334/api/v1/webhooks/signoz \
  -H 'Content-Type: application/json' \
  -d '{"alerts":[{"labels":{"alertname":"SignozWebhookTest","severity":"info","layer":"k8s"},"annotations":{"summary":"Sprint 2 webhook 驗證,請忽略"},"status":"firing"}],"version":"4"}'
# 預期: {"success":true}

Rule 1: API 高錯誤日誌率

欄位
Name APIHighErrorLogRate
Type Logs Based Alert
Query service.name = "awoooi-api" AND severity_text = "ERROR"
Condition Count > 10 per 5m
For 5m
Severity warning
Labels layer=k8s, component=api, team=backend
Message API 錯誤日誌率過高 ({{ $value }} 次/5分鐘)

Rule 2: Worker 任務失敗

欄位
Name WorkerTaskFailed
Type Logs Based Alert
Query service.name = "awoooi-worker" AND (body CONTAINS "task_failed" OR body CONTAINS "Unhandled exception")
Condition Count > 5 per 5m
For 5m
Severity warning
Labels layer=k8s, component=worker, team=backend
Message Worker 任務失敗次數過高 ({{ $value }} 次/5分鐘)

Rule 3: Pod OOM Kill

欄位
Name PodOOMKilled
Type Logs Based Alert
Query body CONTAINS "OOMKilled" OR body CONTAINS "OutOfMemory"
Condition Count > 0 per 1m
For 1m
Severity critical
Labels layer=k8s, component=k8s, team=ops
Message 偵測到 Pod OOM Kill 事件

Rule 4: Telegram Polling 失敗

欄位
Name TelegramPollingFailed
Type Logs Based Alert
Query service.name = "awoooi-api" AND body CONTAINS "telegram_polling_error"
Condition Count > 3 per 5m
For 5m
Severity critical
Labels layer=k8s, component=api, team=platform
Message Telegram Polling 連續失敗,機器人可能無回應

Rule 5: Nemotron 全部超時

欄位
Name NemotronAllTimeout
Type Logs Based Alert
Query service.name = "awoooi-api" AND body CONTAINS "nemotron_tool_call_timeout"
Condition Count > 5 per 5m
For 5m
Severity warning
Labels layer=k8s, component=ai, team=ai
Message Nemotron tool call 頻繁超時AI 功能可能降級

與 Prometheus 標籤規範對齊

所有 SigNoz alert 必須包含:

  • layer: k8s (pod 內的日誌)
  • component: 對應服務名稱
  • team: 負責團隊
  • source: signoz (用於 auto_repair 路由決策)

這樣讓 AWOOOI API 的 auto_repair_service 能用相同的 layer/component 邏輯決定修復路徑。