Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 2m16s
Wave 8 P3.1-T2 後續補測 + 配套: 新增測試: - test_diagnosis_aggregator_stub.py (238 行) — 15 tests · stub fixture 驗證 _collect_diagnosis_aggregator 接線 · feature flag default off 不呼叫 · timeout 邊界 / exception fail-soft 修改: - core/metrics.py +23 — 新增 DiagnosisAggregator 相關 Prometheus 指標 - sanitization_service.py +24 — 補強 prompt sanitize 邊界(vuln #4 配套) - RUNBOOK-AGENT-STEP-LATENCY.md / agent_step_latency_rules.yaml — 微調 Tests: 15 passed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
202 lines
10 KiB
YAML
202 lines
10 KiB
YAML
# ops/monitoring/grafana/agent_step_latency_rules.yaml
|
||
# AWOOOI Agent Step Latency 告警規則
|
||
# 2026-04-27 Claude Sonnet 4.6: A3 — Agent step latency observability (config A+B Wave 1)
|
||
# 2026-04-27 Claude Sonnet 4.6: F7 — 補完整部署 SOP(critic H 修正)
|
||
#
|
||
# =============================================================================
|
||
# 部署 SOP(F7)— 不可跳步驟,不可實際 apply(SRE 手動執行)
|
||
# =============================================================================
|
||
#
|
||
# 部署模型:Prometheus 在 k8s 110:/home/wooo/monitoring/ 以 docker-compose 運行,
|
||
# prometheus.yml rule_files: ["alerts.yml"] 只載入單一檔案;
|
||
# 本 yaml 需手動合併至 alerts-unified.yml → 部署為 alerts.yml。
|
||
#
|
||
# 步驟 1 — 本機語法驗證(必須 PASSED,否則不合併)
|
||
# promtool check rules ops/monitoring/grafana/agent_step_latency_rules.yaml
|
||
# # 預期輸出包含:Checking ... SUCCESS
|
||
#
|
||
# 步驟 2 — 合併到 alerts-unified.yml(DBA/SRE 手動)
|
||
# # 將 groups[0](agent_step_latency)整段 yaml 貼入
|
||
# # ops/monitoring/alerts-unified.yml 的 groups: 清單末尾
|
||
# # 合併後再次驗證整體檔案:
|
||
# promtool check rules ops/monitoring/alerts-unified.yml
|
||
#
|
||
# 步驟 3 — 部署(透過既有 CD 流程)
|
||
# # 標準路徑(CD 自動):
|
||
# git push gitea main # Gitea CI 跑 scripts/ops/deploy-alerts.sh
|
||
# #
|
||
# # 緊急手動路徑(SRE 授權):
|
||
# scp ops/monitoring/alerts-unified.yml wooo@192.168.0.110:/home/wooo/monitoring/alerts.yml
|
||
# # 再執行 Prometheus reload(步驟 4)
|
||
#
|
||
# 步驟 4 — Prometheus reload
|
||
# curl -X POST http://192.168.0.110:9090/-/reload
|
||
# # 或(若 Prometheus 在 Docker container 內):
|
||
# docker exec -it prometheus kill -HUP 1
|
||
#
|
||
# 步驟 5 — 驗收(三條 alert 全部列出才算完成)
|
||
# curl -s http://192.168.0.110:9090/api/v1/rules \
|
||
# | python3 -m json.tool \
|
||
# | grep '"name"' \
|
||
# | grep -E "AgentStepLatencyHigh|AgentStepTimeoutSpike|DiagnoseFallbackToCloud"
|
||
# # 預期:三行都出現
|
||
#
|
||
# =============================================================================
|
||
#
|
||
# 部署目標:與 alerts-unified.yml 一起部署到 192.168.0.110:/home/wooo/monitoring/alerts.yml
|
||
# 部署方式:手動合併至 alerts-unified.yml(Prometheus rule_files 只接受單一 alerts.yml)
|
||
#
|
||
# 依賴 Metrics(均由 A1/A2 提供,需 A1+A2 上線後才能 ACTIVE):
|
||
# - aiops_agent_step_duration_seconds{agent,outcome} — Histogram, A1 (apps/api/src/observability/agent_step_metrics.py)
|
||
# agent ∈ {diagnostician, solver, critic}
|
||
# outcome ∈ {success, timeout, error}
|
||
# buckets: [0.5, 1.0, 2.0, 5.0, 10.0, 15.0, 20.0, 30.0, 45.0, 60.0]
|
||
# - aiops_diagnose_fallback_total{from_provider,to_provider} — Counter, A2 (apps/api/src/services/ai_router.py)
|
||
# from_provider ∈ {openclaw_nemo, gemini, claude}
|
||
# to_provider ∈ {gemini, claude}
|
||
#
|
||
# 標籤規範(對齊 alerts-unified.yml):
|
||
# layer: k8s (API 跑在 k8s awoooi-prod namespace)
|
||
# team: ai
|
||
# auto_repair: "false" (NIM 延遲需人工判斷,不啟動 auto_repair)
|
||
#
|
||
# 背景:INC-20260425-8D17BB / INC-20260425-3B6C39
|
||
# NIM (192.168.0.188:8088) 實測延遲 2-27s,尾巴 latency 命中 PHASE2_STEP_TIMEOUT_SEC=20s
|
||
# → confidence 降至 20%(degraded),diagnostician/solver/critic 全部失效
|
||
# → fallback 到 Gemini,消耗每日配額
|
||
# 本規則組提供全路徑可觀測性:latency high → timeout spike → fallback
|
||
#
|
||
# ⚠️ 啟用前必須確認:
|
||
# [A1] aiops_agent_step_duration_seconds 已暴露(A1 merge 且 Pod restart 後)
|
||
# [A2] aiops_diagnose_fallback_total 已暴露(A2 merge 且 Pod restart 後)
|
||
# 驗證方式:curl http://192.168.0.188:8088/metrics | grep aiops_agent_step
|
||
# =============================================================================
|
||
|
||
groups:
|
||
|
||
# ===========================================================================
|
||
# Agent Step Latency (agent_step_latency)
|
||
# 監控三段 Phase 2 Agent(diagnostician/solver/critic)呼叫 LLM 的延遲與失敗
|
||
# ===========================================================================
|
||
- name: agent_step_latency
|
||
interval: 60s
|
||
rules:
|
||
|
||
# -------------------------------------------------------------------------
|
||
# [ACTIVE — 需 A1 上線]
|
||
# AgentStepLatencyHigh — p75 延遲持續高水位
|
||
#
|
||
# 觸發條件:任意一個 agent 的 p75 step latency 超過 25s 且持續 10 分鐘
|
||
# 意義:NIM 推理尾巴 latency 偏高,有命中 timeout 的風險
|
||
# 嚴重度:warning(尚未 timeout,但趨勢不好,SRE 應預先介入)
|
||
#
|
||
# PromQL 設計說明:
|
||
# - rate([5m]) 計算 5 分鐘滑動視窗的 bucket 增量,消除計數器重置問題
|
||
# - histogram_quantile(0.75, ...) 計算當前視窗的 75th 百分位
|
||
# - sum by (agent, le) 保留 agent 維度,使每個 agent 獨立觸發
|
||
# - > 25 對齊 INC-20260425 實測尾巴 latency(最高 27s),低於最寬 timeout 30s
|
||
# - for: 10m 避免短暫尖刺誤報(10 分鐘連續高水位才通知)
|
||
# -------------------------------------------------------------------------
|
||
- alert: AgentStepLatencyHigh
|
||
expr: |
|
||
histogram_quantile(
|
||
0.75,
|
||
sum by (agent, le) (
|
||
rate(aiops_agent_step_duration_seconds_bucket[5m])
|
||
)
|
||
) > 25
|
||
for: 10m
|
||
labels:
|
||
severity: warning
|
||
layer: k8s
|
||
team: ai
|
||
auto_repair: "false"
|
||
alert_category: "agent_step_latency"
|
||
annotations:
|
||
summary: "Agent {{ $labels.agent }} p75 step latency > 25s(持續 10m)"
|
||
description: |
|
||
Phase 2 agent {{ $labels.agent }} 過去 5 分鐘 LLM call 的 p75 延遲為
|
||
{{ $value | humanizeDuration }},超過 25s 門檻持續 10 分鐘。
|
||
NIM 實測尾巴可達 27s;若持續惡化將命中 timeout 並觸發 Gemini fallback。
|
||
建議確認 NIM 健康度(192.168.0.188:8088)及 GPU 負載。
|
||
runbook_url: "docs/runbooks/RUNBOOK-AGENT-STEP-LATENCY.md#agentsteplatencyhigh"
|
||
|
||
# -------------------------------------------------------------------------
|
||
# [ACTIVE — 需 A1 上線]
|
||
# AgentStepTimeoutSpike — timeout 事件頻率爆發
|
||
#
|
||
# 觸發條件:任意 agent 每分鐘 timeout 超過 3 次,持續 5 分鐘
|
||
# 意義:Agent 已進入 degraded 狀態,confidence 降至 20%,飛輪失能
|
||
# 嚴重度:critical(已在降級,需立即處理)
|
||
#
|
||
# PromQL 設計說明:
|
||
# - rate(...)_total 為 Counter metric,需用 rate() 計算速率
|
||
# - aiops_agent_step_duration_seconds_count{outcome="timeout"} 是
|
||
# Histogram Counter 的特殊形式,存在於 _count 維度中;
|
||
# 但 prometheus_client Histogram 的 outcome label 在 _bucket/_count/_sum 上均存在
|
||
# - rate([1m]) * 60 = 每分鐘 timeout 數(rate 本身是 per-second,乘 60 轉換)
|
||
# - sum by (agent) 使每個 agent 獨立觸發,不混合計算
|
||
# - > 3 對齊任務規格(每分鐘 > 3 起)
|
||
# - for: 5m 確認是持續問題,非一過性尖峰
|
||
# -------------------------------------------------------------------------
|
||
- alert: AgentStepTimeoutSpike
|
||
expr: |
|
||
sum by (agent) (
|
||
rate(aiops_agent_step_duration_seconds_count{outcome="timeout"}[1m])
|
||
) * 60 > 3
|
||
for: 5m
|
||
labels:
|
||
severity: critical
|
||
layer: k8s
|
||
team: ai
|
||
auto_repair: "false"
|
||
alert_category: "agent_step_latency"
|
||
annotations:
|
||
summary: "Agent {{ $labels.agent }} timeout 頻率 > 3/min(持續 5m,飛輪失能)"
|
||
description: |
|
||
Phase 2 agent {{ $labels.agent }} 過去 1 分鐘 timeout 速率為
|
||
{{ $value | humanize }}/min,持續 5 分鐘超過門檻 3/min。
|
||
agent 處於 degraded 狀態(confidence=20%);診斷、決策、修復全部失效。
|
||
根因通常是 NIM 高負載或網路抖動。需立即查 NIM 及考慮切 provider。
|
||
runbook_url: "docs/runbooks/RUNBOOK-AGENT-STEP-LATENCY.md#agentsteptimeoutspike"
|
||
|
||
# -------------------------------------------------------------------------
|
||
# [ACTIVE — 需 A2 上線]
|
||
# DiagnoseFallbackToCloud — NEMO→Gemini fallback 頻率預警
|
||
#
|
||
# 觸發條件:從 openclaw_nemo 到 gemini 的 fallback 每分鐘 > 5 次,持續 5 分鐘
|
||
# 意義:NIM 已無法服務,Gemini 每日配額正在高速消耗,有配額耗盡風險
|
||
# 嚴重度:warning(Gemini 還在服務,但配額燃燒速度需注意)
|
||
#
|
||
# PromQL 設計說明:
|
||
# - aiops_diagnose_fallback_total 是 Counter,用 rate() + * 60 轉為 /min
|
||
# - 精確過濾 from_provider="openclaw_nemo", to_provider="gemini"
|
||
# (最關鍵的 NEMO→Gemini 跳轉,預警 Gemini quota 燒耗)
|
||
# - 若需監控 Gemini→Claude 第二段 fallback,另建獨立 alert(超出本任務範圍)
|
||
# - for: 5m 確認持續燃燒,非一過性 NIM 重啟
|
||
# -------------------------------------------------------------------------
|
||
- alert: DiagnoseFallbackToCloud
|
||
expr: |
|
||
rate(
|
||
aiops_diagnose_fallback_total{
|
||
from_provider="openclaw_nemo",
|
||
to_provider="gemini"
|
||
}[1m]
|
||
) * 60 > 5
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
layer: k8s
|
||
team: ai
|
||
auto_repair: "false"
|
||
alert_category: "agent_step_latency"
|
||
annotations:
|
||
summary: "NEMO→Gemini fallback > 5/min(持續 5m),Gemini quota 正在燃燒"
|
||
description: |
|
||
DIAGNOSE phase 從 openclaw_nemo fallback 到 gemini 的速率為
|
||
{{ $value | humanize }}/min,持續 5 分鐘超過門檻 5/min。
|
||
NIM 主力路由已無法服務;Gemini 每日配額正在高速消耗。
|
||
若 Gemini 配額耗盡,fallback 將轉向 Claude(費用更高)。
|
||
需立即確認 NIM 狀態並決定是否限流或增加 Gemini 配額。
|
||
runbook_url: "docs/runbooks/RUNBOOK-AGENT-STEP-LATENCY.md#diagnosefallbacktocloud"
|