fix(inc-20260425): A1+A2 後續 — Solver/Critic timeout + auto_repair 接線 + Runbook + Grafana
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
延續 595629c0 INC-20260425 修復,補三段 Agent + 全鏈路觀測:
A1 後續 — Solver/Critic 三段 timeout 接線:
- solver_agent.py: AGENT_SOLVER_TIMEOUT_SEC=20.0(env override)
- critic_agent.py: AGENT_CRITIC_TIMEOUT_SEC=15.0(env override)
- protocol.py: 三 Agent 共用 observe_agent_step() 包裹呼叫
· success/timeout/error outcome label
· histogram 寫入 aiops_agent_step_duration_seconds
A2 後續 — auto_repair_service 改用 _diagnose_fallback_chain:
- auto_repair_service.py +46 行 — 切換 DIAGNOSE 路由到新 chain(NEMO→GEMINI→CLAUDE)
- 完全避開 Ollama CPU 238s 二次 timeout
新增 metrics:
- core/metrics.py +59 行 — 配合 observe_agent_step 的 histogram bucket + label cardinality
新增測試 (862 行):
- test_agent_step_timeouts.py (475) — 三 Agent 各 timeout 邊界 + outcome label
- test_ai_router_diagnose_fallback.py (387) — _diagnose_fallback_chain 正確序
新增配套:
- docs/runbooks/RUNBOOK-AGENT-STEP-LATENCY.md (350) — INC 故障排查 + 觀測指引
- ops/monitoring/grafana/agent_step_latency_rules.yaml (160)
· 三 Agent histogram alert rules(p99 > timeout 80% → warning)
驗收: 33 tests pass (test_agent_step_timeouts 22 + test_ai_router_diagnose_fallback 11)
INC-20260425 雙修總工作量(595629c0 + 此 commit):
· 5 個 service/agent 檔修改
· 1 個新 observability 模組
· 4 個新測試/配套檔
· 1372+187 = 1559 行新增
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (INC-20260425 後續) <noreply@anthropic.com>
This commit is contained in:
@@ -185,6 +185,65 @@ GEMINI_DAILY_QUOTA = Gauge(
|
||||
"Gemini API daily call quota (from settings.GEMINI_DAILY_QUOTA)",
|
||||
)
|
||||
|
||||
# =============================================================================
|
||||
# DIAGNOSE Fallback Metrics (A2 INC-20260425, 2026-04-27 台北時區)
|
||||
# 建立者: Claude Sonnet 4.6 (fullstack-engineer, A2)
|
||||
#
|
||||
# 背景: INC-20260425 NIM timeout 後 fallback 到 Ollama CPU 238s 造成二次 timeout。
|
||||
# 統帥批准 A+B 雙修,A2 移除 Ollama + 新增 fallback 計數 metric,
|
||||
# 閾值告警由獨立 Prometheus rule 定義(不在本任務範圍)。
|
||||
#
|
||||
# 使用位置:
|
||||
# - ai_router.py: record_diagnose_fallback() 在 executor fallback 觸發時呼叫
|
||||
#
|
||||
# 告警建議 (供 Prometheus rule 設計參考):
|
||||
# rate(aiops_diagnose_fallback_total[1m]) > 0.5 → 警告
|
||||
# rate(aiops_diagnose_fallback_total[5m]) > 0.2 → 嚴重
|
||||
# =============================================================================
|
||||
|
||||
AIOPS_DIAGNOSE_FALLBACK_TOTAL = Counter(
|
||||
"aiops_diagnose_fallback_total",
|
||||
"DIAGNOSE intent fallback events (from_provider → to_provider)",
|
||||
["from_provider", "to_provider"],
|
||||
)
|
||||
|
||||
|
||||
def record_diagnose_fallback(from_provider: str, to_provider: str) -> None:
|
||||
"""記錄 DIAGNOSE fallback 事件(per-provider pair 計數)
|
||||
|
||||
2026-04-27 Claude Sonnet 4.6: A2 INC-20260425
|
||||
呼叫方: ai_router.py AIRouterExecutor.execute() 的 DIAGNOSE fallback 路徑
|
||||
|
||||
Args:
|
||||
from_provider: 失敗的 provider 名稱(e.g. "openclaw_nemo")
|
||||
to_provider: 下一個嘗試的 provider 名稱(e.g. "gemini")
|
||||
"""
|
||||
AIOPS_DIAGNOSE_FALLBACK_TOTAL.labels(
|
||||
from_provider=from_provider,
|
||||
to_provider=to_provider,
|
||||
).inc()
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# P3.1-T1 Tier-1 三服務整合 Metrics (2026-04-27 台北時區)
|
||||
# 建立者: Claude Sonnet 4.6 (P3.1-T1)
|
||||
#
|
||||
# ROLLBACK_EXECUTED_TOTAL: rollback_manager 整合到 auto_repair_service._verify_and_learn
|
||||
# RESOURCE_RESOLVE_TOTAL: resource_resolver 整合到 approval_execution.execute_approved_action
|
||||
# =============================================================================
|
||||
|
||||
ROLLBACK_EXECUTED_TOTAL = Counter(
|
||||
"rollback_executed_total",
|
||||
"K8s rollback executions triggered by PostExecutionVerifier failure",
|
||||
["status", "reason"],
|
||||
)
|
||||
|
||||
RESOURCE_RESOLVE_TOTAL = Counter(
|
||||
"resource_resolve_total",
|
||||
"Resource resolver attempts in approval execution",
|
||||
["result"], # hit / miss / suggestion / error
|
||||
)
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Helper Functions
|
||||
|
||||
Reference in New Issue
Block a user