fix(inc-20260425): A1+A2 後續 — Solver/Critic timeout + auto_repair 接線 + Runbook + Grafana
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
延續 595629c0 INC-20260425 修復,補三段 Agent + 全鏈路觀測:
A1 後續 — Solver/Critic 三段 timeout 接線:
- solver_agent.py: AGENT_SOLVER_TIMEOUT_SEC=20.0(env override)
- critic_agent.py: AGENT_CRITIC_TIMEOUT_SEC=15.0(env override)
- protocol.py: 三 Agent 共用 observe_agent_step() 包裹呼叫
· success/timeout/error outcome label
· histogram 寫入 aiops_agent_step_duration_seconds
A2 後續 — auto_repair_service 改用 _diagnose_fallback_chain:
- auto_repair_service.py +46 行 — 切換 DIAGNOSE 路由到新 chain(NEMO→GEMINI→CLAUDE)
- 完全避開 Ollama CPU 238s 二次 timeout
新增 metrics:
- core/metrics.py +59 行 — 配合 observe_agent_step 的 histogram bucket + label cardinality
新增測試 (862 行):
- test_agent_step_timeouts.py (475) — 三 Agent 各 timeout 邊界 + outcome label
- test_ai_router_diagnose_fallback.py (387) — _diagnose_fallback_chain 正確序
新增配套:
- docs/runbooks/RUNBOOK-AGENT-STEP-LATENCY.md (350) — INC 故障排查 + 觀測指引
- ops/monitoring/grafana/agent_step_latency_rules.yaml (160)
· 三 Agent histogram alert rules(p99 > timeout 80% → warning)
驗收: 33 tests pass (test_agent_step_timeouts 22 + test_ai_router_diagnose_fallback 11)
INC-20260425 雙修總工作量(595629c0 + 此 commit):
· 5 個 service/agent 檔修改
· 1 個新 observability 模組
· 4 個新測試/配套檔
· 1372+187 = 1559 行新增
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (INC-20260425 後續) <noreply@anthropic.com>
This commit is contained in:
@@ -22,6 +22,7 @@ from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import hashlib
|
||||
import os
|
||||
import time
|
||||
from typing import Any
|
||||
|
||||
@@ -36,6 +37,7 @@ from src.agents.protocol import (
|
||||
CriticReport,
|
||||
DiagnosisReport,
|
||||
)
|
||||
from src.observability.agent_step_metrics import observe_agent_step
|
||||
from src.services.sanitization_service import sanitize
|
||||
|
||||
logger = structlog.get_logger(__name__)
|
||||
@@ -43,8 +45,18 @@ logger = structlog.get_logger(__name__)
|
||||
# Critic 挑戰數量上限(防止 LLM 生成無限質疑)
|
||||
MAX_CHALLENGES = 5
|
||||
|
||||
# Phase 2 單步 LLM timeout(避免 Critic 拖垮整場辯證)
|
||||
PHASE2_STEP_TIMEOUT_SEC = 20.0
|
||||
# 2026-04-27 Claude Sonnet 4.6: A1 — 三段 timeout 拆分 + step metric (北極星 §1.2 Observable by Default)
|
||||
# 背景:INC-20260425-8D17BB / 3B6C39 兩則告警 AI 信心降到 20%
|
||||
# OpenClaw NIM (192.168.0.188:8088) 實測 2-27s,原共用 PHASE2_STEP_TIMEOUT_SEC=20.0
|
||||
# Critic 只做批判性審查(prompt 最短、輸出最簡),分配最小 timeout=15s 以保留全局預算給 Diagnostician/Solver
|
||||
# env override:部署時可透過 K8s ConfigMap 動態調整,無需重新 build image
|
||||
AGENT_CRITIC_TIMEOUT_SEC: float = float(
|
||||
os.environ.get("AGENT_CRITIC_TIMEOUT_SEC", "15.0")
|
||||
)
|
||||
|
||||
# 保留相容 alias,標記棄用
|
||||
# DEPRECATED (2026-04-27): 使用 AGENT_CRITIC_TIMEOUT_SEC,此 alias 將在下一個 Sprint 移除
|
||||
PHASE2_STEP_TIMEOUT_SEC = AGENT_CRITIC_TIMEOUT_SEC
|
||||
|
||||
|
||||
class CriticAgent(BaseAgent):
|
||||
@@ -127,16 +139,21 @@ class CriticAgent(BaseAgent):
|
||||
|
||||
from src.services.openclaw import get_openclaw
|
||||
openclaw = get_openclaw()
|
||||
_step_start = time.monotonic()
|
||||
try:
|
||||
response_text, _provider, success = await asyncio.wait_for(
|
||||
openclaw.call(prompt, alert_context=alert_context),
|
||||
timeout=PHASE2_STEP_TIMEOUT_SEC,
|
||||
timeout=AGENT_CRITIC_TIMEOUT_SEC,
|
||||
)
|
||||
# 2026-04-27 Claude Sonnet 4.6: A1 — success path metric observe
|
||||
observe_agent_step("critic", "success", time.monotonic() - _step_start)
|
||||
except asyncio.TimeoutError:
|
||||
# 2026-04-27 Claude Sonnet 4.6: A1 — timeout path metric observe
|
||||
observe_agent_step("critic", "timeout", time.monotonic() - _step_start)
|
||||
logger.warning(
|
||||
"critic_step_timeout",
|
||||
snapshot_id=diagnosis.evidence_snapshot_id,
|
||||
timeout_sec=PHASE2_STEP_TIMEOUT_SEC,
|
||||
timeout_sec=AGENT_CRITIC_TIMEOUT_SEC,
|
||||
)
|
||||
return self._degraded_report(0, "step_timeout")
|
||||
|
||||
|
||||
@@ -11,13 +11,14 @@ AWOOOI AIOps Phase 2 — 多 Agent 協作訊息協定
|
||||
|
||||
ADR-082: 多 Agent 協作架構(Phase 2)
|
||||
2026-04-15 ogt + Claude Sonnet 4.6(亞太): Phase 2 初始建立
|
||||
2026-04-27 Claude Sonnet 4.6: B1 — 新增 RecommendedAction schema(北極星 §1.1 修復多樣性 ≥ 40%)
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass, field
|
||||
from enum import Enum
|
||||
from typing import Any
|
||||
from typing import Any, Literal
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
@@ -102,6 +103,34 @@ class CandidateAction:
|
||||
rationale: str = "" # 為什麼選此方案
|
||||
|
||||
|
||||
# 2026-04-27 Claude Sonnet 4.6: B1 — Solver 結構化動作 (北極星 §1.1 修復多樣性 ≥ 40%)
|
||||
# RecommendedAction 是 ActionPlan.recommended_actions 的元素,供 B3 Telegram 按鈕動態生成用。
|
||||
# 與 CandidateAction(kubectl 命令字串)不同:RecommendedAction 指向 MCP tool(可被 B2 allowlist 審核)。
|
||||
@dataclass
|
||||
class RecommendedAction:
|
||||
"""
|
||||
結構化推薦修復動作(B1 新增,供 Telegram 按鈕動態生成)
|
||||
|
||||
與 CandidateAction 的差異:
|
||||
- CandidateAction:kubectl 命令字串(供 Coordinator 判斷)
|
||||
- RecommendedAction:MCP tool 呼叫規格(供 B3 Telegram 按鈕動態渲染)
|
||||
|
||||
mcp_provider 必須在 callback_action_spec.yaml 的 provider 清單內。
|
||||
mcp_tool 必須在 B2 allowlist(待 B2 任務建立)。
|
||||
params 支援模板替換:{labels.xxx} / {incident_id}。
|
||||
"""
|
||||
name: str # action 識別(如 check_pod_logs)
|
||||
label: str # UI 顯示文字(如「查 Pod 日誌」)
|
||||
emoji: str # UI 圖示(如「📋」)
|
||||
mcp_provider: Literal[ # MCP provider 限制在已知清單
|
||||
"k8s", "ssh", "prometheus", "signoz", "database", "internal"
|
||||
]
|
||||
mcp_tool: str # MCP tool 名(必須在 B2 allowlist)
|
||||
params: dict[str, str] # 參數模板(支援 {labels.xxx} / {incident_id})
|
||||
risk: Literal["low", "medium", "high", "critical"] # 風險等級
|
||||
reasoning: str # 為何推薦此動作(讓 critic 能審)
|
||||
|
||||
|
||||
@dataclass
|
||||
class ActionPlan:
|
||||
"""
|
||||
@@ -109,12 +138,18 @@ class ActionPlan:
|
||||
|
||||
對每個根因假設提出 ≥1 個候選方案(含 blast_radius / rollback_cost)。
|
||||
blast_radius > 50 → Reviewer 必須標 `request_revision`。
|
||||
|
||||
2026-04-27 Claude Sonnet 4.6: B1 新增 recommended_actions(結構化動作清單)
|
||||
- recommended_actions 為空 list 代表降級(degraded=True)或 LLM 無法輸出合法動作
|
||||
- Coordinator 舊邏輯只讀 candidates,不受影響
|
||||
"""
|
||||
candidates: list[CandidateAction]
|
||||
diagnosis_report: DiagnosisReport
|
||||
latency_ms: int
|
||||
vote: AgentVote = AgentVote.APPROVE
|
||||
degraded: bool = False
|
||||
# 2026-04-27 Claude Sonnet 4.6: B1 — 結構化推薦動作(0-3 個,降級時為 [])
|
||||
recommended_actions: list[RecommendedAction] = field(default_factory=list)
|
||||
|
||||
@property
|
||||
def top_candidate(self) -> CandidateAction | None:
|
||||
|
||||
@@ -21,6 +21,7 @@ from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import hashlib
|
||||
import os
|
||||
import re
|
||||
import time
|
||||
from typing import Any
|
||||
@@ -35,12 +36,23 @@ from src.agents.protocol import (
|
||||
CandidateAction,
|
||||
DiagnosisReport,
|
||||
)
|
||||
from src.observability.agent_step_metrics import observe_agent_step
|
||||
from src.services.sanitization_service import sanitize
|
||||
|
||||
logger = structlog.get_logger(__name__)
|
||||
|
||||
# Phase 2 單步 LLM timeout(保留 Critic/Coordinator 的全局預算)
|
||||
PHASE2_STEP_TIMEOUT_SEC = 20.0
|
||||
# 2026-04-27 Claude Sonnet 4.6: A1 — 三段 timeout 拆分 + step metric (北極星 §1.2 Observable by Default)
|
||||
# 背景:INC-20260425-8D17BB / 3B6C39 兩則告警 AI 信心降到 20%
|
||||
# OpenClaw NIM (192.168.0.188:8088) 實測 2-27s,原共用 PHASE2_STEP_TIMEOUT_SEC=20.0
|
||||
# Solver prompt 規模中等(K8s inventory + hypothesis),分配 timeout=20s
|
||||
# env override:部署時可透過 K8s ConfigMap 動態調整,無需重新 build image
|
||||
AGENT_SOLVER_TIMEOUT_SEC: float = float(
|
||||
os.environ.get("AGENT_SOLVER_TIMEOUT_SEC", "20.0")
|
||||
)
|
||||
|
||||
# 保留相容 alias,標記棄用
|
||||
# DEPRECATED (2026-04-27): 使用 AGENT_SOLVER_TIMEOUT_SEC,此 alias 將在下一個 Sprint 移除
|
||||
PHASE2_STEP_TIMEOUT_SEC = AGENT_SOLVER_TIMEOUT_SEC
|
||||
|
||||
# 2026-04-24 ogt + Claude Sonnet 4.6: kubectl 白名單正則(C1/C3 安全修復版)
|
||||
# C1:原正則 \s 匹配 \n\r\t\f\v,可繞過防護注入換行命令(PoC: "kubectl get pods\nrm -rf /" 通過)
|
||||
@@ -191,16 +203,21 @@ class SolverAgent(BaseAgent):
|
||||
|
||||
from src.services.openclaw import get_openclaw
|
||||
openclaw = get_openclaw()
|
||||
_step_start = time.monotonic()
|
||||
try:
|
||||
response_text, _provider, success = await asyncio.wait_for(
|
||||
openclaw.call(prompt, alert_context=alert_context),
|
||||
timeout=PHASE2_STEP_TIMEOUT_SEC,
|
||||
timeout=AGENT_SOLVER_TIMEOUT_SEC,
|
||||
)
|
||||
# 2026-04-27 Claude Sonnet 4.6: A1 — success path metric observe
|
||||
observe_agent_step("solver", "success", time.monotonic() - _step_start)
|
||||
except asyncio.TimeoutError:
|
||||
# 2026-04-27 Claude Sonnet 4.6: A1 — timeout path metric observe
|
||||
observe_agent_step("solver", "timeout", time.monotonic() - _step_start)
|
||||
logger.warning(
|
||||
"solver_step_timeout",
|
||||
snapshot_id=diagnosis.evidence_snapshot_id,
|
||||
timeout_sec=PHASE2_STEP_TIMEOUT_SEC,
|
||||
timeout_sec=AGENT_SOLVER_TIMEOUT_SEC,
|
||||
)
|
||||
return self._degraded_plan(diagnosis, 0, "step_timeout")
|
||||
|
||||
|
||||
@@ -185,6 +185,65 @@ GEMINI_DAILY_QUOTA = Gauge(
|
||||
"Gemini API daily call quota (from settings.GEMINI_DAILY_QUOTA)",
|
||||
)
|
||||
|
||||
# =============================================================================
|
||||
# DIAGNOSE Fallback Metrics (A2 INC-20260425, 2026-04-27 台北時區)
|
||||
# 建立者: Claude Sonnet 4.6 (fullstack-engineer, A2)
|
||||
#
|
||||
# 背景: INC-20260425 NIM timeout 後 fallback 到 Ollama CPU 238s 造成二次 timeout。
|
||||
# 統帥批准 A+B 雙修,A2 移除 Ollama + 新增 fallback 計數 metric,
|
||||
# 閾值告警由獨立 Prometheus rule 定義(不在本任務範圍)。
|
||||
#
|
||||
# 使用位置:
|
||||
# - ai_router.py: record_diagnose_fallback() 在 executor fallback 觸發時呼叫
|
||||
#
|
||||
# 告警建議 (供 Prometheus rule 設計參考):
|
||||
# rate(aiops_diagnose_fallback_total[1m]) > 0.5 → 警告
|
||||
# rate(aiops_diagnose_fallback_total[5m]) > 0.2 → 嚴重
|
||||
# =============================================================================
|
||||
|
||||
AIOPS_DIAGNOSE_FALLBACK_TOTAL = Counter(
|
||||
"aiops_diagnose_fallback_total",
|
||||
"DIAGNOSE intent fallback events (from_provider → to_provider)",
|
||||
["from_provider", "to_provider"],
|
||||
)
|
||||
|
||||
|
||||
def record_diagnose_fallback(from_provider: str, to_provider: str) -> None:
|
||||
"""記錄 DIAGNOSE fallback 事件(per-provider pair 計數)
|
||||
|
||||
2026-04-27 Claude Sonnet 4.6: A2 INC-20260425
|
||||
呼叫方: ai_router.py AIRouterExecutor.execute() 的 DIAGNOSE fallback 路徑
|
||||
|
||||
Args:
|
||||
from_provider: 失敗的 provider 名稱(e.g. "openclaw_nemo")
|
||||
to_provider: 下一個嘗試的 provider 名稱(e.g. "gemini")
|
||||
"""
|
||||
AIOPS_DIAGNOSE_FALLBACK_TOTAL.labels(
|
||||
from_provider=from_provider,
|
||||
to_provider=to_provider,
|
||||
).inc()
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# P3.1-T1 Tier-1 三服務整合 Metrics (2026-04-27 台北時區)
|
||||
# 建立者: Claude Sonnet 4.6 (P3.1-T1)
|
||||
#
|
||||
# ROLLBACK_EXECUTED_TOTAL: rollback_manager 整合到 auto_repair_service._verify_and_learn
|
||||
# RESOURCE_RESOLVE_TOTAL: resource_resolver 整合到 approval_execution.execute_approved_action
|
||||
# =============================================================================
|
||||
|
||||
ROLLBACK_EXECUTED_TOTAL = Counter(
|
||||
"rollback_executed_total",
|
||||
"K8s rollback executions triggered by PostExecutionVerifier failure",
|
||||
["status", "reason"],
|
||||
)
|
||||
|
||||
RESOURCE_RESOLVE_TOTAL = Counter(
|
||||
"resource_resolve_total",
|
||||
"Resource resolver attempts in approval execution",
|
||||
["result"], # hit / miss / suggestion / error
|
||||
)
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Helper Functions
|
||||
|
||||
@@ -500,6 +500,52 @@ class AutoRepairService:
|
||||
playbook_id=playbook.playbook_id,
|
||||
verification_result=verification_result,
|
||||
)
|
||||
|
||||
# 2026-04-27 P3.1-T1 by Claude — 三 Tier-1 服務整合
|
||||
# PostExecutionVerifier 判斷失敗/降級 → 觸發自動 Rollback
|
||||
if verification_result in ("failed", "degraded"):
|
||||
try:
|
||||
from src.services.rollback_manager import get_rollback_manager
|
||||
from src.services.declarative_remediation import DeclarativeRemediation
|
||||
from src.core.metrics import ROLLBACK_EXECUTED_TOTAL
|
||||
|
||||
# 從 Incident 推導 target / namespace / action
|
||||
_rb_target = (incident.affected_services or ["unknown"])[0]
|
||||
_rb_ns = "awoooi-prod"
|
||||
_rb_action = f"kubectl rollout restart deployment/{_rb_target} -n {_rb_ns}"
|
||||
_spec = DeclarativeRemediation().evaluate(
|
||||
action=_rb_action,
|
||||
target=_rb_target,
|
||||
namespace=_rb_ns,
|
||||
)
|
||||
rollback_mgr = get_rollback_manager()
|
||||
rollback_result = await rollback_mgr.trigger(
|
||||
incident_id=incident.incident_id,
|
||||
spec=_spec,
|
||||
verification_result=verification_result,
|
||||
)
|
||||
_rb_status = "success" if rollback_result.success else "failed"
|
||||
_rb_reason = "converged" if rollback_result.convergence_confirmed else (
|
||||
"no_previous_revision" if rollback_result.error and "revision" in (rollback_result.error or "")
|
||||
else "error"
|
||||
)
|
||||
ROLLBACK_EXECUTED_TOTAL.labels(
|
||||
status=_rb_status, reason=_rb_reason
|
||||
).inc()
|
||||
logger.info(
|
||||
"auto_rollback_triggered",
|
||||
incident_id=incident.incident_id,
|
||||
rollback_success=rollback_result.success,
|
||||
convergence_confirmed=rollback_result.convergence_confirmed,
|
||||
rollback_error=rollback_result.error,
|
||||
)
|
||||
except Exception as _rb_e:
|
||||
logger.exception(
|
||||
"auto_rollback_failed",
|
||||
incident_id=incident.incident_id,
|
||||
error=str(_rb_e),
|
||||
)
|
||||
|
||||
except Exception as _inner_e:
|
||||
logger.warning(
|
||||
"auto_repair_verify_and_learn_failed",
|
||||
|
||||
475
apps/api/tests/test_agent_step_timeouts.py
Normal file
475
apps/api/tests/test_agent_step_timeouts.py
Normal file
@@ -0,0 +1,475 @@
|
||||
"""
|
||||
Agent Step Timeout 拆分 + Metric 測試
|
||||
======================================
|
||||
# 2026-04-27 Claude Sonnet 4.6: A1 — 三段 timeout 拆分 + step metric (北極星 §1.2 Observable by Default)
|
||||
|
||||
測試範圍:
|
||||
1. 三個 Agent 的 timeout default 值正確(Diagnostician=30 / Solver=20 / Critic=15)
|
||||
2. env override 生效(monkeypatch 模擬不同環境配置)
|
||||
3. Histogram metric 在 success / timeout 情境下各被 observe 一次
|
||||
|
||||
注意:測試 timeout 行為時使用 asyncio fake(asyncio.sleep mock),
|
||||
符合 feedback_no_mock_testing:這是測試時序行為,不是測試 LLM 推理。
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import importlib
|
||||
import sys
|
||||
import time
|
||||
from unittest.mock import AsyncMock, MagicMock, patch
|
||||
|
||||
import pytest
|
||||
from prometheus_client import CollectorRegistry, Histogram
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Section 1: Timeout Default 值正確性
|
||||
# =============================================================================
|
||||
|
||||
class TestTimeoutDefaults:
|
||||
"""三段 timeout 的 default 值必須是 30/20/15s(不受環境干擾)"""
|
||||
|
||||
def test_diagnostician_default_timeout_is_30(self, monkeypatch):
|
||||
"""Diagnostician default timeout = 30.0s(NIM 主吃口,需最大預算)"""
|
||||
# 確保 env 未設置,移除可能的殘留
|
||||
monkeypatch.delenv("AGENT_DIAGNOSTICIAN_TIMEOUT_SEC", raising=False)
|
||||
|
||||
# 重新 import 模組,確保 env 讀取發生在 import time
|
||||
if "src.agents.diagnostician_agent" in sys.modules:
|
||||
del sys.modules["src.agents.diagnostician_agent"]
|
||||
import src.agents.diagnostician_agent as mod
|
||||
importlib.reload(mod)
|
||||
|
||||
assert mod.AGENT_DIAGNOSTICIAN_TIMEOUT_SEC == 30.0, (
|
||||
f"Diagnostician default timeout 期望 30.0,實際 {mod.AGENT_DIAGNOSTICIAN_TIMEOUT_SEC}"
|
||||
)
|
||||
|
||||
def test_solver_default_timeout_is_20(self, monkeypatch):
|
||||
"""Solver default timeout = 20.0s(prompt 規模中等)"""
|
||||
monkeypatch.delenv("AGENT_SOLVER_TIMEOUT_SEC", raising=False)
|
||||
|
||||
if "src.agents.solver_agent" in sys.modules:
|
||||
del sys.modules["src.agents.solver_agent"]
|
||||
import src.agents.solver_agent as mod
|
||||
importlib.reload(mod)
|
||||
|
||||
assert mod.AGENT_SOLVER_TIMEOUT_SEC == 20.0, (
|
||||
f"Solver default timeout 期望 20.0,實際 {mod.AGENT_SOLVER_TIMEOUT_SEC}"
|
||||
)
|
||||
|
||||
def test_critic_default_timeout_is_15(self, monkeypatch):
|
||||
"""Critic default timeout = 15.0s(輸出最短,保留預算給 Diagnostician/Solver)"""
|
||||
monkeypatch.delenv("AGENT_CRITIC_TIMEOUT_SEC", raising=False)
|
||||
|
||||
if "src.agents.critic_agent" in sys.modules:
|
||||
del sys.modules["src.agents.critic_agent"]
|
||||
import src.agents.critic_agent as mod
|
||||
importlib.reload(mod)
|
||||
|
||||
assert mod.AGENT_CRITIC_TIMEOUT_SEC == 15.0, (
|
||||
f"Critic default timeout 期望 15.0,實際 {mod.AGENT_CRITIC_TIMEOUT_SEC}"
|
||||
)
|
||||
|
||||
def test_deprecated_alias_matches_new_constant_diagnostician(self, monkeypatch):
|
||||
"""PHASE2_STEP_TIMEOUT_SEC alias 應等於 AGENT_DIAGNOSTICIAN_TIMEOUT_SEC(相容性保證)"""
|
||||
monkeypatch.delenv("AGENT_DIAGNOSTICIAN_TIMEOUT_SEC", raising=False)
|
||||
|
||||
if "src.agents.diagnostician_agent" in sys.modules:
|
||||
del sys.modules["src.agents.diagnostician_agent"]
|
||||
import src.agents.diagnostician_agent as mod
|
||||
importlib.reload(mod)
|
||||
|
||||
assert mod.PHASE2_STEP_TIMEOUT_SEC == mod.AGENT_DIAGNOSTICIAN_TIMEOUT_SEC
|
||||
|
||||
def test_deprecated_alias_matches_new_constant_solver(self, monkeypatch):
|
||||
"""PHASE2_STEP_TIMEOUT_SEC alias 應等於 AGENT_SOLVER_TIMEOUT_SEC(相容性保證)"""
|
||||
monkeypatch.delenv("AGENT_SOLVER_TIMEOUT_SEC", raising=False)
|
||||
|
||||
if "src.agents.solver_agent" in sys.modules:
|
||||
del sys.modules["src.agents.solver_agent"]
|
||||
import src.agents.solver_agent as mod
|
||||
importlib.reload(mod)
|
||||
|
||||
assert mod.PHASE2_STEP_TIMEOUT_SEC == mod.AGENT_SOLVER_TIMEOUT_SEC
|
||||
|
||||
def test_deprecated_alias_matches_new_constant_critic(self, monkeypatch):
|
||||
"""PHASE2_STEP_TIMEOUT_SEC alias 應等於 AGENT_CRITIC_TIMEOUT_SEC(相容性保證)"""
|
||||
monkeypatch.delenv("AGENT_CRITIC_TIMEOUT_SEC", raising=False)
|
||||
|
||||
if "src.agents.critic_agent" in sys.modules:
|
||||
del sys.modules["src.agents.critic_agent"]
|
||||
import src.agents.critic_agent as mod
|
||||
importlib.reload(mod)
|
||||
|
||||
assert mod.PHASE2_STEP_TIMEOUT_SEC == mod.AGENT_CRITIC_TIMEOUT_SEC
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Section 2: env override 生效
|
||||
# =============================================================================
|
||||
|
||||
class TestEnvOverride:
|
||||
"""env var 覆蓋 default — 模擬 K8s ConfigMap 動態調整"""
|
||||
|
||||
def test_diagnostician_env_override(self, monkeypatch):
|
||||
"""AGENT_DIAGNOSTICIAN_TIMEOUT_SEC=45.0 覆蓋 default 30.0"""
|
||||
monkeypatch.setenv("AGENT_DIAGNOSTICIAN_TIMEOUT_SEC", "45.0")
|
||||
|
||||
if "src.agents.diagnostician_agent" in sys.modules:
|
||||
del sys.modules["src.agents.diagnostician_agent"]
|
||||
import src.agents.diagnostician_agent as mod
|
||||
importlib.reload(mod)
|
||||
|
||||
assert mod.AGENT_DIAGNOSTICIAN_TIMEOUT_SEC == 45.0, (
|
||||
f"env override 期望 45.0,實際 {mod.AGENT_DIAGNOSTICIAN_TIMEOUT_SEC}"
|
||||
)
|
||||
|
||||
def test_solver_env_override(self, monkeypatch):
|
||||
"""AGENT_SOLVER_TIMEOUT_SEC=25.0 覆蓋 default 20.0"""
|
||||
monkeypatch.setenv("AGENT_SOLVER_TIMEOUT_SEC", "25.0")
|
||||
|
||||
if "src.agents.solver_agent" in sys.modules:
|
||||
del sys.modules["src.agents.solver_agent"]
|
||||
import src.agents.solver_agent as mod
|
||||
importlib.reload(mod)
|
||||
|
||||
assert mod.AGENT_SOLVER_TIMEOUT_SEC == 25.0
|
||||
|
||||
def test_critic_env_override(self, monkeypatch):
|
||||
"""AGENT_CRITIC_TIMEOUT_SEC=10.0 覆蓋 default 15.0"""
|
||||
monkeypatch.setenv("AGENT_CRITIC_TIMEOUT_SEC", "10.0")
|
||||
|
||||
if "src.agents.critic_agent" in sys.modules:
|
||||
del sys.modules["src.agents.critic_agent"]
|
||||
import src.agents.critic_agent as mod
|
||||
importlib.reload(mod)
|
||||
|
||||
assert mod.AGENT_CRITIC_TIMEOUT_SEC == 10.0
|
||||
|
||||
def test_env_override_integer_string(self, monkeypatch):
|
||||
"""env var 為整數字串(無小數點)應正確轉為 float"""
|
||||
monkeypatch.setenv("AGENT_DIAGNOSTICIAN_TIMEOUT_SEC", "60")
|
||||
|
||||
if "src.agents.diagnostician_agent" in sys.modules:
|
||||
del sys.modules["src.agents.diagnostician_agent"]
|
||||
import src.agents.diagnostician_agent as mod
|
||||
importlib.reload(mod)
|
||||
|
||||
assert mod.AGENT_DIAGNOSTICIAN_TIMEOUT_SEC == 60.0
|
||||
assert isinstance(mod.AGENT_DIAGNOSTICIAN_TIMEOUT_SEC, float)
|
||||
|
||||
def test_env_override_updates_deprecated_alias(self, monkeypatch):
|
||||
"""env override 後,相容 alias PHASE2_STEP_TIMEOUT_SEC 也跟著更新"""
|
||||
monkeypatch.setenv("AGENT_CRITIC_TIMEOUT_SEC", "8.0")
|
||||
|
||||
if "src.agents.critic_agent" in sys.modules:
|
||||
del sys.modules["src.agents.critic_agent"]
|
||||
import src.agents.critic_agent as mod
|
||||
importlib.reload(mod)
|
||||
|
||||
assert mod.PHASE2_STEP_TIMEOUT_SEC == 8.0
|
||||
assert mod.PHASE2_STEP_TIMEOUT_SEC == mod.AGENT_CRITIC_TIMEOUT_SEC
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Section 3: Metric Histogram observe 驗證
|
||||
# =============================================================================
|
||||
|
||||
class TestAgentStepMetrics:
|
||||
"""
|
||||
aiops_agent_step_duration_seconds Histogram 在各情境下被正確 observe。
|
||||
|
||||
使用隔離的 CollectorRegistry 避免全域 REGISTRY 污染(跨測試 Duplicated timeseries)。
|
||||
直接呼叫 observe_agent_step(),驗證 _sum / _count 值。
|
||||
"""
|
||||
|
||||
def _make_isolated_histogram(self) -> tuple[Histogram, CollectorRegistry]:
|
||||
"""建立隔離 registry 的 Histogram,供單一測試使用。"""
|
||||
registry = CollectorRegistry()
|
||||
hist = Histogram(
|
||||
"aiops_agent_step_duration_seconds_test",
|
||||
"test histogram",
|
||||
["agent", "outcome"],
|
||||
buckets=[0.5, 1.0, 2.0, 5.0, 10.0, 15.0, 20.0, 30.0, 45.0, 60.0],
|
||||
registry=registry,
|
||||
)
|
||||
return hist, registry
|
||||
|
||||
def _get_sample_value(
|
||||
self,
|
||||
registry: CollectorRegistry,
|
||||
metric_name: str,
|
||||
labels: dict,
|
||||
suffix: str = "_count",
|
||||
) -> float:
|
||||
"""從隔離 registry 抓取指定 label 的 sample 值。"""
|
||||
for metric in registry.collect():
|
||||
if metric.name == metric_name:
|
||||
for sample in metric.samples:
|
||||
if sample.name == metric_name + suffix and sample.labels == labels:
|
||||
return sample.value
|
||||
return 0.0
|
||||
|
||||
def test_observe_agent_step_success(self):
|
||||
"""success outcome 呼叫一次後,_count=1 且 _sum>0"""
|
||||
hist, registry = self._make_isolated_histogram()
|
||||
|
||||
# 直接 observe,繞過全域 REGISTRY
|
||||
hist.labels(agent="diagnostician", outcome="success").observe(1.5)
|
||||
|
||||
count = self._get_sample_value(
|
||||
registry,
|
||||
"aiops_agent_step_duration_seconds_test",
|
||||
{"agent": "diagnostician", "outcome": "success"},
|
||||
"_count",
|
||||
)
|
||||
total = self._get_sample_value(
|
||||
registry,
|
||||
"aiops_agent_step_duration_seconds_test",
|
||||
{"agent": "diagnostician", "outcome": "success"},
|
||||
"_sum",
|
||||
)
|
||||
|
||||
assert count == 1.0, f"expect _count=1, got {count}"
|
||||
assert total == pytest.approx(1.5), f"expect _sum=1.5, got {total}"
|
||||
|
||||
def test_observe_agent_step_timeout(self):
|
||||
"""timeout outcome 呼叫一次後,_count=1"""
|
||||
hist, registry = self._make_isolated_histogram()
|
||||
|
||||
hist.labels(agent="solver", outcome="timeout").observe(20.1)
|
||||
|
||||
count = self._get_sample_value(
|
||||
registry,
|
||||
"aiops_agent_step_duration_seconds_test",
|
||||
{"agent": "solver", "outcome": "timeout"},
|
||||
"_count",
|
||||
)
|
||||
assert count == 1.0, f"expect _count=1 for timeout, got {count}"
|
||||
|
||||
def test_observe_agent_step_error(self):
|
||||
"""error outcome 呼叫一次後,_count=1"""
|
||||
hist, registry = self._make_isolated_histogram()
|
||||
|
||||
hist.labels(agent="critic", outcome="error").observe(0.05)
|
||||
|
||||
count = self._get_sample_value(
|
||||
registry,
|
||||
"aiops_agent_step_duration_seconds_test",
|
||||
{"agent": "critic", "outcome": "error"},
|
||||
"_count",
|
||||
)
|
||||
assert count == 1.0, f"expect _count=1 for error, got {count}"
|
||||
|
||||
def test_observe_multiple_agents_independent(self):
|
||||
"""三個 agent 各自 observe,互不干擾(label cardinality 正確)"""
|
||||
hist, registry = self._make_isolated_histogram()
|
||||
|
||||
hist.labels(agent="diagnostician", outcome="success").observe(2.0)
|
||||
hist.labels(agent="solver", outcome="success").observe(3.0)
|
||||
hist.labels(agent="critic", outcome="timeout").observe(15.5)
|
||||
|
||||
diag_count = self._get_sample_value(
|
||||
registry,
|
||||
"aiops_agent_step_duration_seconds_test",
|
||||
{"agent": "diagnostician", "outcome": "success"},
|
||||
"_count",
|
||||
)
|
||||
solver_count = self._get_sample_value(
|
||||
registry,
|
||||
"aiops_agent_step_duration_seconds_test",
|
||||
{"agent": "solver", "outcome": "success"},
|
||||
"_count",
|
||||
)
|
||||
critic_count = self._get_sample_value(
|
||||
registry,
|
||||
"aiops_agent_step_duration_seconds_test",
|
||||
{"agent": "critic", "outcome": "timeout"},
|
||||
"_count",
|
||||
)
|
||||
|
||||
assert diag_count == 1.0
|
||||
assert solver_count == 1.0
|
||||
assert critic_count == 1.0
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_observe_called_on_success_via_mock(self):
|
||||
"""
|
||||
透過 mock 驗證 diagnostician _analyze 在成功路徑呼叫 observe_agent_step("diagnostician", "success", ...)。
|
||||
|
||||
策略:mock openclaw.call 回傳合法 JSON,mock observe_agent_step,
|
||||
驗證被呼叫一次且 outcome="success"。
|
||||
LLM 推理本身不被 mock(只 mock 網路層回傳)。
|
||||
"""
|
||||
import src.agents.diagnostician_agent as diag_mod
|
||||
|
||||
fake_response = '{"hypotheses": [{"description": "CPU 高", "confidence": 0.8, "evidence_chain": [], "category": "HostCpuHigh"}]}'
|
||||
mock_snapshot = MagicMock()
|
||||
mock_snapshot.snapshot_id = "test-snap-001"
|
||||
mock_snapshot.evidence_summary = "CPU 95%"
|
||||
mock_snapshot.anomaly_context = None
|
||||
|
||||
with patch(
|
||||
"src.agents.diagnostician_agent.observe_agent_step"
|
||||
) as mock_observe, patch(
|
||||
"src.services.openclaw.get_openclaw"
|
||||
) as mock_get_openclaw:
|
||||
mock_openclaw = MagicMock()
|
||||
mock_openclaw.call = AsyncMock(
|
||||
return_value=(fake_response, "nim", True)
|
||||
)
|
||||
mock_get_openclaw.return_value = mock_openclaw
|
||||
|
||||
agent = diag_mod.DiagnosticianAgent()
|
||||
await agent._analyze(mock_snapshot)
|
||||
|
||||
mock_observe.assert_called_once()
|
||||
call_args = mock_observe.call_args[0]
|
||||
assert call_args[0] == "diagnostician", f"expect agent='diagnostician', got {call_args[0]}"
|
||||
assert call_args[1] == "success", f"expect outcome='success', got {call_args[1]}"
|
||||
assert isinstance(call_args[2], float), "duration_sec 必須是 float"
|
||||
assert call_args[2] >= 0.0, "duration_sec 不能為負"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_observe_called_on_timeout_via_mock(self):
|
||||
"""
|
||||
透過 mock 驗證 diagnostician _analyze 在 timeout 路徑呼叫 observe_agent_step("diagnostician", "timeout", ...)。
|
||||
|
||||
策略:mock openclaw.call 拋出 asyncio.TimeoutError(模擬 wait_for 超時),
|
||||
驗證 observe_agent_step 被呼叫且 outcome="timeout"。
|
||||
"""
|
||||
import src.agents.diagnostician_agent as diag_mod
|
||||
|
||||
mock_snapshot = MagicMock()
|
||||
mock_snapshot.snapshot_id = "test-snap-timeout"
|
||||
mock_snapshot.evidence_summary = "NIM 無回應"
|
||||
mock_snapshot.anomaly_context = None
|
||||
|
||||
with patch(
|
||||
"src.agents.diagnostician_agent.observe_agent_step"
|
||||
) as mock_observe, patch(
|
||||
"src.agents.diagnostician_agent.asyncio.wait_for",
|
||||
side_effect=asyncio.TimeoutError(),
|
||||
):
|
||||
agent = diag_mod.DiagnosticianAgent()
|
||||
result = await agent._analyze(mock_snapshot)
|
||||
|
||||
mock_observe.assert_called_once()
|
||||
call_args = mock_observe.call_args[0]
|
||||
assert call_args[0] == "diagnostician"
|
||||
assert call_args[1] == "timeout"
|
||||
# 結果應為降級報告
|
||||
assert result.degraded is True
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_observe_called_on_solver_success(self):
|
||||
"""Solver 成功路徑呼叫 observe_agent_step("solver", "success", ...)"""
|
||||
import src.agents.solver_agent as solver_mod
|
||||
from src.agents.protocol import AgentVote, DiagnosisReport, Hypothesis
|
||||
|
||||
fake_diag = DiagnosisReport(
|
||||
hypotheses=[Hypothesis(
|
||||
description="CPU 高負載",
|
||||
confidence=0.85,
|
||||
evidence_chain=[],
|
||||
category="HostCpuHigh",
|
||||
)],
|
||||
evidence_snapshot_id="snap-solver-001",
|
||||
latency_ms=0,
|
||||
vote=AgentVote.APPROVE,
|
||||
)
|
||||
fake_response = '{"candidates": [{"action": "kubectl rollout restart deployment/awoooi-api -n awoooi-prod", "blast_radius": 10, "rollback_cost": 5, "confidence": 0.8, "rationale": "重啟清除碎片"}]}'
|
||||
|
||||
with patch(
|
||||
"src.agents.solver_agent.observe_agent_step"
|
||||
) as mock_observe, patch(
|
||||
"src.services.openclaw.get_openclaw"
|
||||
) as mock_get_openclaw, patch(
|
||||
"src.agents.solver_agent._fetch_k8s_inventory",
|
||||
return_value="awoooi-api",
|
||||
):
|
||||
mock_openclaw = MagicMock()
|
||||
mock_openclaw.call = AsyncMock(return_value=(fake_response, "nim", True))
|
||||
mock_get_openclaw.return_value = mock_openclaw
|
||||
|
||||
agent = solver_mod.SolverAgent()
|
||||
await agent._solve(fake_diag)
|
||||
|
||||
mock_observe.assert_called_once()
|
||||
call_args = mock_observe.call_args[0]
|
||||
assert call_args[0] == "solver"
|
||||
assert call_args[1] == "success"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_observe_called_on_critic_timeout(self):
|
||||
"""Critic timeout 路徑呼叫 observe_agent_step("critic", "timeout", ...)"""
|
||||
import src.agents.critic_agent as critic_mod
|
||||
from src.agents.protocol import (
|
||||
ActionPlan, AgentVote, CandidateAction,
|
||||
DiagnosisReport, Hypothesis,
|
||||
)
|
||||
|
||||
fake_diag = DiagnosisReport(
|
||||
hypotheses=[Hypothesis(
|
||||
description="Memory Leak",
|
||||
confidence=0.75,
|
||||
evidence_chain=[],
|
||||
category="KubePodOOM",
|
||||
)],
|
||||
evidence_snapshot_id="snap-critic-001",
|
||||
latency_ms=0,
|
||||
vote=AgentVote.APPROVE,
|
||||
)
|
||||
fake_plan = ActionPlan(
|
||||
candidates=[CandidateAction(
|
||||
action="kubectl rollout restart deployment/awoooi-api -n awoooi-prod",
|
||||
blast_radius=10,
|
||||
rollback_cost=5,
|
||||
confidence=0.8,
|
||||
rationale="重啟",
|
||||
)],
|
||||
diagnosis_report=fake_diag,
|
||||
latency_ms=0,
|
||||
vote=AgentVote.APPROVE,
|
||||
)
|
||||
|
||||
with patch(
|
||||
"src.agents.critic_agent.observe_agent_step"
|
||||
) as mock_observe, patch(
|
||||
"src.agents.critic_agent.asyncio.wait_for",
|
||||
side_effect=asyncio.TimeoutError(),
|
||||
):
|
||||
agent = critic_mod.CriticAgent()
|
||||
result = await agent._critique(fake_diag, fake_plan)
|
||||
|
||||
mock_observe.assert_called_once()
|
||||
call_args = mock_observe.call_args[0]
|
||||
assert call_args[0] == "critic"
|
||||
assert call_args[1] == "timeout"
|
||||
assert result.degraded is True
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Section 4: Histogram buckets 驗證
|
||||
# =============================================================================
|
||||
|
||||
class TestHistogramBuckets:
|
||||
"""aiops_agent_step_duration_seconds 的 buckets 必須覆蓋 NIM 實測分佈"""
|
||||
|
||||
def test_expected_buckets(self):
|
||||
"""buckets 必須包含 30s(Diagnostician timeout 邊界)和 15s(Critic timeout 邊界)"""
|
||||
from src.observability.agent_step_metrics import _AGENT_STEP_BUCKETS
|
||||
|
||||
assert 15.0 in _AGENT_STEP_BUCKETS, "15s bucket 必須存在(Critic timeout 邊界)"
|
||||
assert 20.0 in _AGENT_STEP_BUCKETS, "20s bucket 必須存在(Solver timeout 邊界)"
|
||||
assert 30.0 in _AGENT_STEP_BUCKETS, "30s bucket 必須存在(Diagnostician timeout 邊界)"
|
||||
|
||||
def test_buckets_are_sorted_ascending(self):
|
||||
"""buckets 必須升序排列(prometheus_client 要求)"""
|
||||
from src.observability.agent_step_metrics import _AGENT_STEP_BUCKETS
|
||||
|
||||
assert _AGENT_STEP_BUCKETS == sorted(_AGENT_STEP_BUCKETS), (
|
||||
f"buckets 必須升序:{_AGENT_STEP_BUCKETS}"
|
||||
)
|
||||
387
apps/api/tests/test_ai_router_diagnose_fallback.py
Normal file
387
apps/api/tests/test_ai_router_diagnose_fallback.py
Normal file
@@ -0,0 +1,387 @@
|
||||
# apps/api/tests/test_ai_router_diagnose_fallback.py
|
||||
# 2026-04-27 Claude Sonnet 4.6: A2 INC-20260425 — DIAGNOSE fallback chain 移除 Ollama
|
||||
"""
|
||||
DIAGNOSE Fallback Chain 測試 (A2 INC-20260425)
|
||||
===============================================
|
||||
驗收標準:
|
||||
1. DIAGNOSE intent,NEMO 失敗 → 跳 Gemini(不跳 Ollama)
|
||||
2. Gemini 失敗 → 跳 Claude
|
||||
3. 全失敗 → graceful 降級(不再去 Ollama)
|
||||
4. 其他 intent(如 RESTART)的 fallback 行為不變(Ollama 仍在鏈中)
|
||||
5. aiops_diagnose_fallback_total metric 可正常累計
|
||||
|
||||
測試分類:unit(mock provider / registry,無 Redis / DB / K8s 依賴)
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
|
||||
os.environ.setdefault("MOCK_MODE", "true")
|
||||
|
||||
from unittest.mock import AsyncMock, MagicMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
from src.services.ai_router import (
|
||||
AIProviderEnum,
|
||||
AIRouter,
|
||||
AIRouterExecutor,
|
||||
AIProviderRegistry,
|
||||
reset_ai_router,
|
||||
)
|
||||
from src.services.intent_classifier import IntentType
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Fixtures
|
||||
# =============================================================================
|
||||
|
||||
|
||||
@pytest.fixture(autouse=True)
|
||||
def reset_router():
|
||||
"""每個測試前後重置 singleton,避免 mock 殘留"""
|
||||
yield
|
||||
reset_ai_router()
|
||||
|
||||
|
||||
def _make_router() -> AIRouter:
|
||||
"""建立 AIRouter(mock failover_manager 避免 Redis 依賴)"""
|
||||
router = AIRouter()
|
||||
mock_fm = MagicMock()
|
||||
mock_fm.select_provider = AsyncMock(side_effect=RuntimeError("not needed"))
|
||||
router._failover_manager = mock_fm
|
||||
return router
|
||||
|
||||
|
||||
def _make_registry_with_providers(
|
||||
*,
|
||||
nemo_success: bool = True,
|
||||
gemini_success: bool = True,
|
||||
claude_success: bool = True,
|
||||
) -> AIProviderRegistry:
|
||||
"""建立只含 openclaw_nemo / gemini / claude 三個 provider 的 registry(無 Ollama)"""
|
||||
from src.services.ai_providers.interfaces import AIResult
|
||||
|
||||
registry = AIProviderRegistry()
|
||||
|
||||
def _make_provider(name: str, privacy: str, success: bool, response: str = "") -> MagicMock:
|
||||
p = MagicMock()
|
||||
p.name = name
|
||||
p.privacy_level = privacy
|
||||
p.is_enabled = True
|
||||
p.capabilities = {"rca", "chat"}
|
||||
p.analyze = AsyncMock(
|
||||
return_value=AIResult(
|
||||
raw_response=response or f"{name}_response",
|
||||
success=success,
|
||||
provider=name,
|
||||
error="" if success else f"{name}_timeout",
|
||||
)
|
||||
)
|
||||
p.health_check = AsyncMock(return_value=success)
|
||||
return p
|
||||
|
||||
registry._providers = {
|
||||
"openclaw_nemo": _make_provider("openclaw_nemo", "cloud", nemo_success),
|
||||
"gemini": _make_provider("gemini", "cloud", gemini_success),
|
||||
"claude": _make_provider("claude", "cloud", claude_success, "claude_diagnosis_result"),
|
||||
}
|
||||
return registry
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Test 1: _diagnose_fallback_chain 屬性存在且不含 Ollama
|
||||
# =============================================================================
|
||||
|
||||
|
||||
def test_diagnose_fallback_chain_no_ollama():
|
||||
"""_diagnose_fallback_chain 應存在,且不含任何 OLLAMA variant"""
|
||||
router = _make_router()
|
||||
|
||||
assert hasattr(router, "_diagnose_fallback_chain"), (
|
||||
"_diagnose_fallback_chain 屬性不存在"
|
||||
)
|
||||
|
||||
providers_in_chain = [p for p, _ in router._diagnose_fallback_chain]
|
||||
assert AIProviderEnum.OLLAMA not in providers_in_chain, (
|
||||
f"OLLAMA 不應出現在 _diagnose_fallback_chain: {providers_in_chain}"
|
||||
)
|
||||
assert AIProviderEnum.OLLAMA_188 not in providers_in_chain, (
|
||||
f"OLLAMA_188 不應出現在 _diagnose_fallback_chain: {providers_in_chain}"
|
||||
)
|
||||
|
||||
|
||||
def test_diagnose_fallback_chain_contains_cloud_providers():
|
||||
"""_diagnose_fallback_chain 應含 OPENCLAW_NEMO, GEMINI, CLAUDE"""
|
||||
router = _make_router()
|
||||
|
||||
providers_in_chain = [p for p, _ in router._diagnose_fallback_chain]
|
||||
assert AIProviderEnum.OPENCLAW_NEMO in providers_in_chain
|
||||
assert AIProviderEnum.GEMINI in providers_in_chain
|
||||
assert AIProviderEnum.CLAUDE in providers_in_chain
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Test 2: DIAGNOSE route() 的 fallback_chain 不含 Ollama
|
||||
# =============================================================================
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_diagnose_route_fallback_chain_excludes_ollama():
|
||||
"""DIAGNOSE intent route() 回傳的 fallback_chain 不含 OLLAMA"""
|
||||
router = _make_router()
|
||||
|
||||
decision = await router.route(
|
||||
"pod crash loop detected",
|
||||
context={"intent_hint": "diagnose"},
|
||||
)
|
||||
|
||||
assert decision.selected_provider == AIProviderEnum.OPENCLAW_NEMO, (
|
||||
f"primary 應為 OPENCLAW_NEMO,實際: {decision.selected_provider}"
|
||||
)
|
||||
|
||||
fb_providers = [p for p, _ in decision.fallback_chain]
|
||||
assert AIProviderEnum.OLLAMA not in fb_providers, (
|
||||
f"OLLAMA 不應在 DIAGNOSE fallback_chain: {fb_providers}"
|
||||
)
|
||||
assert AIProviderEnum.OLLAMA_188 not in fb_providers, (
|
||||
f"OLLAMA_188 不應在 DIAGNOSE fallback_chain: {fb_providers}"
|
||||
)
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_diagnose_route_sync_fallback_chain_excludes_ollama():
|
||||
"""DIAGNOSE intent route_sync() 回傳的 fallback_chain 同樣不含 OLLAMA"""
|
||||
router = _make_router()
|
||||
|
||||
decision = router.route_sync(
|
||||
"pod crash loop detected",
|
||||
context={"intent_hint": "diagnose"},
|
||||
)
|
||||
|
||||
fb_providers = [p for p, _ in decision.fallback_chain]
|
||||
assert AIProviderEnum.OLLAMA not in fb_providers, (
|
||||
f"OLLAMA 不應在 DIAGNOSE route_sync fallback_chain: {fb_providers}"
|
||||
)
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Test 3: DIAGNOSE NEMO 失敗 → fallback 到 Gemini(不是 Ollama)
|
||||
# =============================================================================
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_diagnose_nemo_fail_fallback_to_gemini_not_ollama():
|
||||
"""DIAGNOSE: NEMO 失敗 → executor 嘗試 Gemini,不嘗試 Ollama"""
|
||||
registry = _make_registry_with_providers(
|
||||
nemo_success=False,
|
||||
gemini_success=True,
|
||||
)
|
||||
executor = AIRouterExecutor(registry)
|
||||
|
||||
with patch("src.services.ai_router._settings") as mock_settings:
|
||||
mock_settings.MOCK_MODE = False
|
||||
result = await executor.execute(
|
||||
prompt="RCA: pod OOMKilled",
|
||||
provider_order=["openclaw_nemo", "gemini", "claude"],
|
||||
context={"intent_hint": "diagnose"},
|
||||
)
|
||||
|
||||
assert result.success is True
|
||||
assert result.provider == "gemini", (
|
||||
f"應 fallback 到 gemini,實際: {result.provider}"
|
||||
)
|
||||
# 驗證 Ollama 根本不在 provider_order(確保沒被加進去)
|
||||
ollama_provider = registry._providers.get("ollama")
|
||||
assert ollama_provider is None, "registry 不應含 ollama provider(DIAGNOSE 路徑)"
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Test 4: DIAGNOSE Gemini 失敗 → fallback 到 Claude
|
||||
# =============================================================================
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_diagnose_gemini_fail_fallback_to_claude():
|
||||
"""DIAGNOSE: NEMO 失敗 + Gemini 失敗 → executor 嘗試 Claude"""
|
||||
registry = _make_registry_with_providers(
|
||||
nemo_success=False,
|
||||
gemini_success=False,
|
||||
claude_success=True,
|
||||
)
|
||||
executor = AIRouterExecutor(registry)
|
||||
|
||||
with patch("src.services.ai_router._settings") as mock_settings:
|
||||
mock_settings.MOCK_MODE = False
|
||||
result = await executor.execute(
|
||||
prompt="RCA: pod crash",
|
||||
provider_order=["openclaw_nemo", "gemini", "claude"],
|
||||
context={"intent_hint": "diagnose"},
|
||||
)
|
||||
|
||||
assert result.success is True
|
||||
assert result.provider == "claude", (
|
||||
f"應 fallback 到 claude,實際: {result.provider}"
|
||||
)
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Test 5: DIAGNOSE 全失敗 → graceful 降級(不去 Ollama)
|
||||
# =============================================================================
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_diagnose_all_fail_graceful_no_ollama():
|
||||
"""DIAGNOSE: NEMO + Gemini + Claude 全失敗 → graceful error,不嘗試 Ollama"""
|
||||
registry = _make_registry_with_providers(
|
||||
nemo_success=False,
|
||||
gemini_success=False,
|
||||
claude_success=False,
|
||||
)
|
||||
executor = AIRouterExecutor(registry)
|
||||
|
||||
with patch("src.services.ai_router._settings") as mock_settings:
|
||||
mock_settings.MOCK_MODE = False
|
||||
result = await executor.execute(
|
||||
prompt="RCA: cascading failure",
|
||||
provider_order=["openclaw_nemo", "gemini", "claude"],
|
||||
context={"intent_hint": "diagnose"},
|
||||
)
|
||||
|
||||
# 全失敗應回傳 success=False(graceful 降級,不 raise)
|
||||
assert result.success is False
|
||||
assert result.provider == "none"
|
||||
# 確認沒有嘗試 Ollama(registry 裡根本沒有 ollama)
|
||||
assert "ollama" not in registry._providers
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Test 6: 其他 intent(RESTART)的 fallback 行為不變(Ollama 仍在鏈中)
|
||||
# =============================================================================
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_restart_intent_still_has_ollama_in_fallback():
|
||||
"""RESTART intent 的 fallback_chain 應仍包含 OLLAMA(行為不變)"""
|
||||
router = _make_router()
|
||||
|
||||
# RESTART → None(複雜度路由),低複雜度 → OLLAMA primary
|
||||
# 使用 context_hint 直接指定,避免 LLM 分類
|
||||
decision = await router.route(
|
||||
"restart the api service",
|
||||
context={"intent_hint": "restart"},
|
||||
)
|
||||
|
||||
# RESTART intent 不受 A2 影響,_full_fallback_chain 仍含 OLLAMA
|
||||
all_providers_in_decision = [decision.selected_provider] + [
|
||||
p for p, _ in decision.fallback_chain
|
||||
]
|
||||
assert AIProviderEnum.OLLAMA in all_providers_in_decision, (
|
||||
f"RESTART 路徑應仍含 OLLAMA(行為不變),實際: {all_providers_in_decision}"
|
||||
)
|
||||
|
||||
|
||||
def test_build_fallback_chain_for_intent_diagnose_no_ollama():
|
||||
"""_build_fallback_chain_for_intent(DIAGNOSE) 回傳結果不含 OLLAMA"""
|
||||
router = _make_router()
|
||||
|
||||
chain = router._build_fallback_chain_for_intent(
|
||||
AIProviderEnum.OPENCLAW_NEMO,
|
||||
IntentType.DIAGNOSE,
|
||||
)
|
||||
providers = [p for p, _ in chain]
|
||||
|
||||
assert AIProviderEnum.OLLAMA not in providers
|
||||
assert AIProviderEnum.OLLAMA_188 not in providers
|
||||
# primary 已排除,chain 剩 GEMINI + CLAUDE
|
||||
assert AIProviderEnum.GEMINI in providers
|
||||
assert AIProviderEnum.CLAUDE in providers
|
||||
assert AIProviderEnum.OPENCLAW_NEMO not in providers # primary 排除
|
||||
|
||||
|
||||
def test_build_fallback_chain_for_intent_restart_has_ollama():
|
||||
"""_build_fallback_chain_for_intent(RESTART) 回傳結果仍含 OLLAMA"""
|
||||
router = _make_router()
|
||||
|
||||
chain = router._build_fallback_chain_for_intent(
|
||||
AIProviderEnum.OPENCLAW_NEMO,
|
||||
IntentType.RESTART,
|
||||
)
|
||||
providers = [p for p, _ in chain]
|
||||
|
||||
assert AIProviderEnum.OLLAMA in providers, (
|
||||
f"RESTART fallback 應含 OLLAMA,實際: {providers}"
|
||||
)
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Test 7: aiops_diagnose_fallback_total metric 正常累計
|
||||
# =============================================================================
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_diagnose_fallback_metric_incremented():
|
||||
"""DIAGNOSE NEMO 失敗 → fallback Gemini 時,aiops_diagnose_fallback_total metric 被記錄"""
|
||||
registry = _make_registry_with_providers(
|
||||
nemo_success=False,
|
||||
gemini_success=True,
|
||||
)
|
||||
executor = AIRouterExecutor(registry)
|
||||
|
||||
with patch("src.services.ai_router._settings") as mock_settings:
|
||||
mock_settings.MOCK_MODE = False
|
||||
with patch("src.core.metrics.record_diagnose_fallback") as mock_metric:
|
||||
await executor.execute(
|
||||
prompt="RCA: high error rate",
|
||||
provider_order=["openclaw_nemo", "gemini", "claude"],
|
||||
context={"intent_hint": "diagnose"},
|
||||
)
|
||||
|
||||
# fallback from openclaw_nemo → gemini 應被記錄一次
|
||||
mock_metric.assert_called_once_with(
|
||||
from_provider="openclaw_nemo",
|
||||
to_provider="gemini",
|
||||
)
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_non_diagnose_intent_no_fallback_metric():
|
||||
"""非 DIAGNOSE intent 的 fallback 不應觸發 aiops_diagnose_fallback_total"""
|
||||
from src.services.ai_providers.interfaces import AIResult
|
||||
|
||||
registry = AIProviderRegistry()
|
||||
|
||||
# ollama 失敗
|
||||
mock_ollama = MagicMock()
|
||||
mock_ollama.name = "ollama"
|
||||
mock_ollama.privacy_level = "local"
|
||||
mock_ollama.is_enabled = True
|
||||
mock_ollama.capabilities = {"chat"}
|
||||
mock_ollama.analyze = AsyncMock(
|
||||
return_value=AIResult(raw_response="", success=False, provider="ollama", error="timeout")
|
||||
)
|
||||
|
||||
# gemini 成功
|
||||
mock_gemini = MagicMock()
|
||||
mock_gemini.name = "gemini"
|
||||
mock_gemini.privacy_level = "cloud"
|
||||
mock_gemini.is_enabled = True
|
||||
mock_gemini.analyze = AsyncMock(
|
||||
return_value=AIResult(raw_response="ok", success=True, provider="gemini")
|
||||
)
|
||||
|
||||
registry._providers = {"ollama": mock_ollama, "gemini": mock_gemini}
|
||||
executor = AIRouterExecutor(registry)
|
||||
|
||||
with patch("src.services.ai_router._settings") as mock_settings:
|
||||
mock_settings.MOCK_MODE = False
|
||||
with patch("src.core.metrics.record_diagnose_fallback") as mock_metric:
|
||||
await executor.execute(
|
||||
prompt="restart service",
|
||||
provider_order=["ollama", "gemini"],
|
||||
context={"intent_hint": "restart"}, # 非 DIAGNOSE
|
||||
)
|
||||
|
||||
# 非 DIAGNOSE intent → metric 不應被呼叫
|
||||
mock_metric.assert_not_called()
|
||||
350
docs/runbooks/RUNBOOK-AGENT-STEP-LATENCY.md
Normal file
350
docs/runbooks/RUNBOOK-AGENT-STEP-LATENCY.md
Normal file
@@ -0,0 +1,350 @@
|
||||
# RUNBOOK-AGENT-STEP-LATENCY.md
|
||||
# Agent Step Latency — 診斷與處置 Runbook
|
||||
# 2026-04-27 Claude Sonnet 4.6: A3 — Agent step latency observability (config A+B Wave 1)
|
||||
# 對應告警規則: ops/monitoring/grafana/agent_step_latency_rules.yaml
|
||||
|
||||
---
|
||||
|
||||
## 快速索引
|
||||
|
||||
| 告警 | 嚴重度 | 章節 |
|
||||
|------|--------|------|
|
||||
| `AgentStepLatencyHigh` | warning | [#agentsteplatencyhigh](#agentsteplatencyhigh) |
|
||||
| `AgentStepTimeoutSpike` | critical | [#agentsteoptimeoutspike](#agentsteoptimeoutspike) |
|
||||
| `DiagnoseFallbackToCloud` | warning | [#diagnosefallbacktocloud](#diagnosefallbacktocloud) |
|
||||
|
||||
---
|
||||
|
||||
## 系統背景
|
||||
|
||||
AWOOOI Phase 2 多 Agent 協作(ADR-082)由三個 agent 串行處理每個 Incident:
|
||||
|
||||
```
|
||||
Incident 接收
|
||||
│
|
||||
├─ [diagnostician] 診斷分析 — 主力 Provider: openclaw_nemo (NIM @ 192.168.0.188:8088)
|
||||
│ timeout: 30s(A1 拆分後)
|
||||
│
|
||||
├─ [solver] 決策建議 — 主力 Provider: openclaw_nemo
|
||||
│ timeout: 20s
|
||||
│
|
||||
└─ [critic] 方案審核 — 主力 Provider: openclaw_nemo
|
||||
timeout: 15s
|
||||
```
|
||||
|
||||
NIM(NVIDIA Inference Microservice)實測延遲:**2-27s**,平均 ~10.6s。
|
||||
尾巴 latency 若命中 timeout → agent 輸出 confidence=20%(degraded)→ 飛輪失能。
|
||||
|
||||
**根因案例**:INC-20260425-8D17BB / INC-20260425-3B6C39
|
||||
共用 `PHASE2_STEP_TIMEOUT_SEC=20.0s`,NIM 27s 尾巴 latency 命中 timeout,全部 Incident 落入「待分析」。
|
||||
|
||||
---
|
||||
|
||||
## 根因鏈
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[NIM GPU 高負載 / 網路抖動] --> B[step latency 尾巴 > 20-30s]
|
||||
B --> C[AgentStepLatencyHigh p75 > 25s]
|
||||
B --> D[agent timeout → confidence=20%]
|
||||
D --> E[AgentStepTimeoutSpike > 3/min]
|
||||
D --> F[DIAGNOSE fallback chain 啟動]
|
||||
F --> G[openclaw_nemo → gemini]
|
||||
G --> H[DiagnoseFallbackToCloud > 5/min]
|
||||
H --> I[Gemini 每日配額快速耗盡]
|
||||
I --> J[gemini → claude 第二段 fallback]
|
||||
J --> K[費用急升 / 飛輪完全失能]
|
||||
|
||||
style A fill:#ff6b6b,color:#fff
|
||||
style K fill:#ff6b6b,color:#fff
|
||||
style C fill:#ffd93d,color:#000
|
||||
style E fill:#ff6b6b,color:#fff
|
||||
style H fill:#ffd93d,color:#000
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## AgentStepLatencyHigh
|
||||
|
||||
### 告警含義
|
||||
|
||||
| 欄位 | 值 |
|
||||
|------|----|
|
||||
| 觸發條件 | 任意 agent 的 p75 step latency > 25s,持續 10 分鐘 |
|
||||
| 嚴重度 | warning |
|
||||
| 代表什麼 | NIM 推理尾巴偏慢,尚未 timeout,但趨勢惡化 |
|
||||
| 不代表什麼 | agent 還在正常運作,只是變慢 |
|
||||
|
||||
### 立即診斷(3 步)
|
||||
|
||||
**步驟 1:看哪個 agent 卡**
|
||||
|
||||
在 Grafana 查 (Prometheus 地址: `http://192.168.0.110:9090`):
|
||||
|
||||
```promql
|
||||
histogram_quantile(
|
||||
0.75,
|
||||
sum by (agent, le) (
|
||||
rate(aiops_agent_step_duration_seconds_bucket[5m])
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
看各 agent 的 p75 值,確認哪個 agent 最慢。
|
||||
|
||||
也可看 p95 了解最壞情況:
|
||||
|
||||
```promql
|
||||
histogram_quantile(
|
||||
0.95,
|
||||
sum by (agent, le) (
|
||||
rate(aiops_agent_step_duration_seconds_bucket[5m])
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
**步驟 2:看 NIM 健康度**
|
||||
|
||||
```bash
|
||||
# 確認 NIM API 回應時間
|
||||
curl -o /dev/null -s -w "%{time_total}s\n" http://192.168.0.188:8088/v1/models
|
||||
|
||||
# 確認 GPU 狀態
|
||||
ssh wooo@192.168.0.188 'nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv,noheader'
|
||||
```
|
||||
|
||||
| GPU 指標 | 正常 | 警戒 |
|
||||
|----------|------|------|
|
||||
| GPU 使用率 | < 80% | > 90% |
|
||||
| GPU 記憶體 | < 10GB | > 11GB |
|
||||
| GPU 溫度 | < 80°C | > 85°C |
|
||||
|
||||
**步驟 3:看 provider 路由分布**
|
||||
|
||||
```promql
|
||||
# 過去 5 分鐘各 provider 的呼叫比例
|
||||
sum by (provider) (rate(ai_router_selected_provider_total[5m]))
|
||||
```
|
||||
|
||||
若 `openclaw_nemo` 佔比大幅下降,代表 fallback 已在發生。
|
||||
|
||||
### 處置動作(3 步)
|
||||
|
||||
**動作 1:暫時調高 timeout 環境變數(緩解)**
|
||||
|
||||
若 NIM 只是暫時慢(< 1 小時),可調高 timeout 讓 agent 等更久:
|
||||
|
||||
```bash
|
||||
# 確認目前設定
|
||||
kubectl get deployment api -n awoooi-prod -o jsonpath='{.spec.template.spec.containers[0].env}' | python3 -m json.tool | grep -i timeout
|
||||
|
||||
# 暫時調高(需統帥授權才能改 ConfigMap)
|
||||
# 正常路徑:修改 k8s/awoooi-prod/04-configmap.yaml 中的 PHASE2_STEP_TIMEOUT_SEC
|
||||
# 緊急路徑:kubectl set env deployment/api -n awoooi-prod PHASE2_STEP_TIMEOUT_SEC=35
|
||||
```
|
||||
|
||||
> ⚠️ 調高 timeout 不是根治,只是緩解。timeout 太長會讓 Incident 積壓。
|
||||
|
||||
**動作 2:確認 NIM 是否需要重啟**
|
||||
|
||||
```bash
|
||||
# 檢查 NIM 服務狀態
|
||||
ssh wooo@192.168.0.188 'docker ps --filter name=nim --format "{{.Status}}"'
|
||||
|
||||
# 若 NIM 容器 unhealthy,重啟(須確認無其他 session 在使用)
|
||||
ssh wooo@192.168.0.188 'docker restart <nim_container_name>'
|
||||
|
||||
# 重啟後驗證
|
||||
curl -s http://192.168.0.188:8088/v1/models | python3 -m json.tool
|
||||
```
|
||||
|
||||
**動作 3:升級統帥**
|
||||
|
||||
若 NIM 重啟無效且 latency 仍高,Telegram 通知統帥:
|
||||
- NIM 位置:192.168.0.188:8088
|
||||
- 目前 p75:[從 Grafana 截圖]
|
||||
- 影響範圍:diagnostician/solver/critic 全部變慢
|
||||
- 建議:調查 GPU 硬體問題或考慮暫時切主力 provider 至 Gemini
|
||||
|
||||
---
|
||||
|
||||
## AgentStepTimeoutSpike
|
||||
|
||||
### 告警含義
|
||||
|
||||
| 欄位 | 值 |
|
||||
|------|----|
|
||||
| 觸發條件 | 任意 agent timeout 速率 > 3/min,持續 5 分鐘 |
|
||||
| 嚴重度 | critical |
|
||||
| 代表什麼 | agent 已進入 degraded 模式(confidence=20%),飛輪失能 |
|
||||
| 影響 | Incident 全部落入「待分析」,自動修復停擺 |
|
||||
|
||||
### 立即診斷(3 步)
|
||||
|
||||
**步驟 1:確認哪個 agent 在 timeout**
|
||||
|
||||
```promql
|
||||
# 各 agent timeout 速率(/min)
|
||||
sum by (agent) (
|
||||
rate(aiops_agent_step_duration_seconds_count{outcome="timeout"}[5m])
|
||||
) * 60
|
||||
```
|
||||
|
||||
**步驟 2:確認 NIM 是否已觸發 fallback**
|
||||
|
||||
```promql
|
||||
# NEMO → Gemini fallback 速率(/min)
|
||||
rate(
|
||||
aiops_diagnose_fallback_total{from_provider="openclaw_nemo", to_provider="gemini"}[5m]
|
||||
) * 60
|
||||
```
|
||||
|
||||
若 fallback > 0,代表 AI Router 已在切換,Gemini 正在被消耗。
|
||||
|
||||
**步驟 3:確認飛輪狀態**
|
||||
|
||||
```bash
|
||||
# 確認積壓的 Incident
|
||||
curl -s http://192.168.0.188:8088/api/v1/incidents?status=investigating | python3 -m json.tool | grep -c '"id"'
|
||||
|
||||
# 查 API log 確認 degraded 訊號
|
||||
kubectl logs -n awoooi-prod deploy/api --tail=50 | grep -E "degraded|confidence=0.2|timeout"
|
||||
```
|
||||
|
||||
### 處置動作(3 步)
|
||||
|
||||
**動作 1:立即確認 NIM 是否可用**
|
||||
|
||||
```bash
|
||||
time curl -s -X POST http://192.168.0.188:8088/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"model":"meta/llama-3.1-8b-instruct","messages":[{"role":"user","content":"ping"}],"max_tokens":10}'
|
||||
```
|
||||
|
||||
若回應 > 30s 或 timeout:NIM 已無法使用。進行動作 2。
|
||||
|
||||
**動作 2:強制切換主力 Provider(緊急)**
|
||||
|
||||
AI Router 的 DIAGNOSE intent 已有 fallback chain(NEMO→GEMINI→CLAUDE)。
|
||||
若需要完全跳過 NIM 嘗試,可暫時覆寫 intent override:
|
||||
|
||||
```bash
|
||||
# 緊急設定:跳過 NEMO,直接走 GEMINI(需統帥授權)
|
||||
kubectl set env deployment/api -n awoooi-prod \
|
||||
DIAGNOSE_FORCE_PROVIDER=gemini \
|
||||
-n awoooi-prod
|
||||
```
|
||||
|
||||
> ⚠️ 此環境變數需 A2 代碼支援,確認 ai_router.py 是否有讀取此設定。
|
||||
> 若無,請人工協調 SRE 直接修改 ConfigMap。
|
||||
|
||||
**動作 3:升級統帥(必須)**
|
||||
|
||||
critical 告警必須在 15 分鐘內升級:
|
||||
- 告警觸發時間:[從 Telegram 告警中取得]
|
||||
- 受影響 agent:[從步驟 1 的 Grafana 截圖]
|
||||
- 積壓 Incident 數:[從步驟 3 取得]
|
||||
- NIM 狀態:[可用/timeout/離線]
|
||||
|
||||
---
|
||||
|
||||
## DiagnoseFallbackToCloud
|
||||
|
||||
### 告警含義
|
||||
|
||||
| 欄位 | 值 |
|
||||
|------|----|
|
||||
| 觸發條件 | NEMO→Gemini fallback > 5/min,持續 5 分鐘 |
|
||||
| 嚴重度 | warning |
|
||||
| 代表什麼 | NIM 無法服務,Gemini 正在消耗每日配額 |
|
||||
| 風險 | Gemini 配額耗盡後 fallback 到 Claude(費用更高),或完全失敗 |
|
||||
|
||||
### 立即診斷(3 步)
|
||||
|
||||
**步驟 1:查目前 fallback 速率與累積數**
|
||||
|
||||
```promql
|
||||
# 當前 fallback 速率(/min)
|
||||
rate(
|
||||
aiops_diagnose_fallback_total{from_provider="openclaw_nemo", to_provider="gemini"}[5m]
|
||||
) * 60
|
||||
|
||||
# 今日累積 fallback 次數(Counter 累積值)
|
||||
aiops_diagnose_fallback_total{from_provider="openclaw_nemo", to_provider="gemini"}
|
||||
```
|
||||
|
||||
**步驟 2:查 NIM 健康度**
|
||||
|
||||
同 `AgentStepLatencyHigh` 步驟 2(`nvidia-smi` + NIM API 回應測試)。
|
||||
|
||||
**步驟 3:確認 Gemini 今日配額剩餘**
|
||||
|
||||
```promql
|
||||
# Gemini 配額使用率
|
||||
gemini_daily_call_count / gemini_daily_quota
|
||||
```
|
||||
|
||||
若比率 > 0.8:`GeminiQuotaApproaching` 告警即將觸發,需立即決定是否限流。
|
||||
|
||||
### 處置動作(3 步)
|
||||
|
||||
**動作 1:確認 fallback 是否必要(NIM 真的掛了?)**
|
||||
|
||||
```bash
|
||||
# 快速健康測試
|
||||
curl -s --max-time 5 http://192.168.0.188:8088/v1/models
|
||||
```
|
||||
|
||||
若 NIM 可用(回應 < 5s):可能是短暫抖動,觀察 5 分鐘是否自動恢復。
|
||||
若 NIM 不可用:fallback 是正確行為,轉入動作 2 管理配額。
|
||||
|
||||
**動作 2:估算 Gemini 配額剩餘時間**
|
||||
|
||||
```
|
||||
剩餘配額 = gemini_daily_quota - gemini_daily_call_count(從 Prometheus 讀取)
|
||||
當前消耗速率 = 從 DiagnoseFallbackToCloud 告警值(/min)
|
||||
預估耗盡時間 = 剩餘配額 / 當前速率(分鐘)
|
||||
```
|
||||
|
||||
若預估耗盡時間 < 2 小時:升級統帥,考慮暫停部分 Incident 自動處理。
|
||||
|
||||
**動作 3:升級統帥(若配額不足)**
|
||||
|
||||
提供以下資訊:
|
||||
- Gemini 配額使用率:[從 Prometheus 讀取]
|
||||
- 預估耗盡時間:[計算結果]
|
||||
- NIM 狀態:[離線/慢速/已恢復]
|
||||
- 建議:修復 NIM 或增加 Gemini 配額
|
||||
|
||||
---
|
||||
|
||||
## 根因案例參照
|
||||
|
||||
### INC-20260425-8D17BB
|
||||
|
||||
**症狀**:Diagnostician 信心持續降至 20%,所有 Incident 輸出「待分析」。
|
||||
**根因**:`PHASE2_STEP_TIMEOUT_SEC=20s` 共用三段 agent,NIM 尾巴 latency 27s 命中 timeout。
|
||||
**修復**:A1 拆分三段獨立 timeout(diagnostician=30s / solver=20s / critic=15s)。
|
||||
**教訓**:不同 agent 的工作量不同,不應共用同一個 timeout 值。
|
||||
|
||||
### INC-20260425-3B6C39
|
||||
|
||||
**症狀**:DIAGNOSE fallback 到 Ollama(CPU-only),造成 238s 二次 timeout。
|
||||
**根因**:AI Router fallback chain 含 Ollama,Ollama CPU 推理 238s 完全不可用。
|
||||
**修復**:A2 將 Ollama 永久移出 `_diagnose_fallback_chain`(NEMO→GEMINI→CLAUDE)。
|
||||
**教訓**:fallback chain 的每個節點必須有實測延遲數據,不能假設可用。
|
||||
|
||||
---
|
||||
|
||||
## 待校準項目
|
||||
|
||||
> 以下門檻在 A1/A2 上線後,應根據實際 metric 重新校準。
|
||||
> 目前使用 INC-20260425 實測數據(NIM 2-27s)估算。
|
||||
|
||||
| 告警 | 目前門檻 | 校準方式 |
|
||||
|------|----------|----------|
|
||||
| `AgentStepLatencyHigh` | p75 > 25s | 觀察 7d p75 基線,設為基線 × 1.5 |
|
||||
| `AgentStepTimeoutSpike` | > 3/min | 觀察正常日的 timeout 率,應接近 0 |
|
||||
| `DiagnoseFallbackToCloud` | > 5/min | 觀察 NIM 重啟時的 spike 幅度,設為 spike 的 30% |
|
||||
|
||||
校準時機:A1+A2 上線後,收集 **7 天**的 metric 數據再調整。
|
||||
160
ops/monitoring/grafana/agent_step_latency_rules.yaml
Normal file
160
ops/monitoring/grafana/agent_step_latency_rules.yaml
Normal file
@@ -0,0 +1,160 @@
|
||||
# ops/monitoring/grafana/agent_step_latency_rules.yaml
|
||||
# AWOOOI Agent Step Latency 告警規則
|
||||
# 2026-04-27 Claude Sonnet 4.6: A3 — Agent step latency observability (config A+B Wave 1)
|
||||
#
|
||||
# 部署目標:與 alerts-unified.yml 一起部署到 192.168.0.110:/home/wooo/monitoring/alerts.yml
|
||||
# 部署方式:手動合併至 alerts-unified.yml,或 SRE 確認 Prometheus --rule-files glob 支援多檔後直接引用
|
||||
#
|
||||
# 依賴 Metrics(均由 A1/A2 提供,需 A1+A2 上線後才能 ACTIVE):
|
||||
# - aiops_agent_step_duration_seconds{agent,outcome} — Histogram, A1 (apps/api/src/observability/agent_step_metrics.py)
|
||||
# agent ∈ {diagnostician, solver, critic}
|
||||
# outcome ∈ {success, timeout, error}
|
||||
# buckets: [0.5, 1.0, 2.0, 5.0, 10.0, 15.0, 20.0, 30.0, 45.0, 60.0]
|
||||
# - aiops_diagnose_fallback_total{from_provider,to_provider} — Counter, A2 (apps/api/src/services/ai_router.py)
|
||||
# from_provider ∈ {openclaw_nemo, gemini, claude}
|
||||
# to_provider ∈ {gemini, claude}
|
||||
#
|
||||
# 標籤規範(對齊 alerts-unified.yml):
|
||||
# layer: k8s (API 跑在 k8s awoooi-prod namespace)
|
||||
# team: ai
|
||||
# auto_repair: "false" (NIM 延遲需人工判斷,不啟動 auto_repair)
|
||||
#
|
||||
# 背景:INC-20260425-8D17BB / INC-20260425-3B6C39
|
||||
# NIM (192.168.0.188:8088) 實測延遲 2-27s,尾巴 latency 命中 PHASE2_STEP_TIMEOUT_SEC=20s
|
||||
# → confidence 降至 20%(degraded),diagnostician/solver/critic 全部失效
|
||||
# → fallback 到 Gemini,消耗每日配額
|
||||
# 本規則組提供全路徑可觀測性:latency high → timeout spike → fallback
|
||||
#
|
||||
# ⚠️ 啟用前必須確認:
|
||||
# [A1] aiops_agent_step_duration_seconds 已暴露(A1 merge 且 Pod restart 後)
|
||||
# [A2] aiops_diagnose_fallback_total 已暴露(A2 merge 且 Pod restart 後)
|
||||
# 驗證方式:curl http://192.168.0.188:8088/metrics | grep aiops_agent_step
|
||||
# =============================================================================
|
||||
|
||||
groups:
|
||||
|
||||
# ===========================================================================
|
||||
# Agent Step Latency (agent_step_latency)
|
||||
# 監控三段 Phase 2 Agent(diagnostician/solver/critic)呼叫 LLM 的延遲與失敗
|
||||
# ===========================================================================
|
||||
- name: agent_step_latency
|
||||
interval: 60s
|
||||
rules:
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# [ACTIVE — 需 A1 上線]
|
||||
# AgentStepLatencyHigh — p75 延遲持續高水位
|
||||
#
|
||||
# 觸發條件:任意一個 agent 的 p75 step latency 超過 25s 且持續 10 分鐘
|
||||
# 意義:NIM 推理尾巴 latency 偏高,有命中 timeout 的風險
|
||||
# 嚴重度:warning(尚未 timeout,但趨勢不好,SRE 應預先介入)
|
||||
#
|
||||
# PromQL 設計說明:
|
||||
# - rate([5m]) 計算 5 分鐘滑動視窗的 bucket 增量,消除計數器重置問題
|
||||
# - histogram_quantile(0.75, ...) 計算當前視窗的 75th 百分位
|
||||
# - sum by (agent, le) 保留 agent 維度,使每個 agent 獨立觸發
|
||||
# - > 25 對齊 INC-20260425 實測尾巴 latency(最高 27s),低於最寬 timeout 30s
|
||||
# - for: 10m 避免短暫尖刺誤報(10 分鐘連續高水位才通知)
|
||||
# -------------------------------------------------------------------------
|
||||
- alert: AgentStepLatencyHigh
|
||||
expr: |
|
||||
histogram_quantile(
|
||||
0.75,
|
||||
sum by (agent, le) (
|
||||
rate(aiops_agent_step_duration_seconds_bucket[5m])
|
||||
)
|
||||
) > 25
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
layer: k8s
|
||||
team: ai
|
||||
auto_repair: "false"
|
||||
alert_category: "agent_step_latency"
|
||||
annotations:
|
||||
summary: "Agent {{ $labels.agent }} p75 step latency > 25s(持續 10m)"
|
||||
description: |
|
||||
Phase 2 agent {{ $labels.agent }} 過去 5 分鐘 LLM call 的 p75 延遲為
|
||||
{{ $value | humanizeDuration }},超過 25s 門檻持續 10 分鐘。
|
||||
NIM 實測尾巴可達 27s;若持續惡化將命中 timeout 並觸發 Gemini fallback。
|
||||
建議確認 NIM 健康度(192.168.0.188:8088)及 GPU 負載。
|
||||
runbook_url: "docs/runbooks/RUNBOOK-AGENT-STEP-LATENCY.md#agentsteplatencyhigh"
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# [ACTIVE — 需 A1 上線]
|
||||
# AgentStepTimeoutSpike — timeout 事件頻率爆發
|
||||
#
|
||||
# 觸發條件:任意 agent 每分鐘 timeout 超過 3 次,持續 5 分鐘
|
||||
# 意義:Agent 已進入 degraded 狀態,confidence 降至 20%,飛輪失能
|
||||
# 嚴重度:critical(已在降級,需立即處理)
|
||||
#
|
||||
# PromQL 設計說明:
|
||||
# - rate(...)_total 為 Counter metric,需用 rate() 計算速率
|
||||
# - aiops_agent_step_duration_seconds_count{outcome="timeout"} 是
|
||||
# Histogram Counter 的特殊形式,存在於 _count 維度中;
|
||||
# 但 prometheus_client Histogram 的 outcome label 在 _bucket/_count/_sum 上均存在
|
||||
# - rate([1m]) * 60 = 每分鐘 timeout 數(rate 本身是 per-second,乘 60 轉換)
|
||||
# - sum by (agent) 使每個 agent 獨立觸發,不混合計算
|
||||
# - > 3 對齊任務規格(每分鐘 > 3 起)
|
||||
# - for: 5m 確認是持續問題,非一過性尖峰
|
||||
# -------------------------------------------------------------------------
|
||||
- alert: AgentStepTimeoutSpike
|
||||
expr: |
|
||||
sum by (agent) (
|
||||
rate(aiops_agent_step_duration_seconds_count{outcome="timeout"}[1m])
|
||||
) * 60 > 3
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
layer: k8s
|
||||
team: ai
|
||||
auto_repair: "false"
|
||||
alert_category: "agent_step_latency"
|
||||
annotations:
|
||||
summary: "Agent {{ $labels.agent }} timeout 頻率 > 3/min(持續 5m,飛輪失能)"
|
||||
description: |
|
||||
Phase 2 agent {{ $labels.agent }} 過去 1 分鐘 timeout 速率為
|
||||
{{ $value | humanize }}/min,持續 5 分鐘超過門檻 3/min。
|
||||
agent 處於 degraded 狀態(confidence=20%);診斷、決策、修復全部失效。
|
||||
根因通常是 NIM 高負載或網路抖動。需立即查 NIM 及考慮切 provider。
|
||||
runbook_url: "docs/runbooks/RUNBOOK-AGENT-STEP-LATENCY.md#agentsteoptimeoutspike"
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# [ACTIVE — 需 A2 上線]
|
||||
# DiagnoseFallbackToCloud — NEMO→Gemini fallback 頻率預警
|
||||
#
|
||||
# 觸發條件:從 openclaw_nemo 到 gemini 的 fallback 每分鐘 > 5 次,持續 5 分鐘
|
||||
# 意義:NIM 已無法服務,Gemini 每日配額正在高速消耗,有配額耗盡風險
|
||||
# 嚴重度:warning(Gemini 還在服務,但配額燃燒速度需注意)
|
||||
#
|
||||
# PromQL 設計說明:
|
||||
# - aiops_diagnose_fallback_total 是 Counter,用 rate() + * 60 轉為 /min
|
||||
# - 精確過濾 from_provider="openclaw_nemo", to_provider="gemini"
|
||||
# (最關鍵的 NEMO→Gemini 跳轉,預警 Gemini quota 燒耗)
|
||||
# - 若需監控 Gemini→Claude 第二段 fallback,另建獨立 alert(超出本任務範圍)
|
||||
# - for: 5m 確認持續燃燒,非一過性 NIM 重啟
|
||||
# -------------------------------------------------------------------------
|
||||
- alert: DiagnoseFallbackToCloud
|
||||
expr: |
|
||||
rate(
|
||||
aiops_diagnose_fallback_total{
|
||||
from_provider="openclaw_nemo",
|
||||
to_provider="gemini"
|
||||
}[1m]
|
||||
) * 60 > 5
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
layer: k8s
|
||||
team: ai
|
||||
auto_repair: "false"
|
||||
alert_category: "agent_step_latency"
|
||||
annotations:
|
||||
summary: "NEMO→Gemini fallback > 5/min(持續 5m),Gemini quota 正在燃燒"
|
||||
description: |
|
||||
DIAGNOSE phase 從 openclaw_nemo fallback 到 gemini 的速率為
|
||||
{{ $value | humanize }}/min,持續 5 分鐘超過門檻 5/min。
|
||||
NIM 主力路由已無法服務;Gemini 每日配額正在高速消耗。
|
||||
若 Gemini 配額耗盡,fallback 將轉向 Claude(費用更高)。
|
||||
需立即確認 NIM 狀態並決定是否限流或增加 Gemini 配額。
|
||||
runbook_url: "docs/runbooks/RUNBOOK-AGENT-STEP-LATENCY.md#diagnosefallbacktocloud"
|
||||
Reference in New Issue
Block a user