feat(wave5-p2): GovernanceAgent 4 項自檢 + Ollama 健康告警規則 + Prometheus metrics 整合
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 10m45s
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 10m45s
MASTER plan_complete_v3.md Wave 5 P2.2 + P2.3 完成(multiple engineers 在限額前完成代碼,補 commit): P2.2 — GovernanceAgent 4 項自檢: - governance_agent.py (342 行) — 每 1 小時自檢循環: · trust_drift(信任度漂移檢測) · knowledge_degradation(知識退化檢測) · llm_hallucination(LLM 幻覺檢測) · execution_blast_radius(執行爆炸半徑檢測) - main.py lifespan: asyncio.create_task(run_governance_loop()) 啟動 try/except 包裹,schedule 失敗不阻斷主流程 - failover_alerter.py: alert_governance(event_type, payload) 1h dedup 四類事件 → Telegram MarkdownV2 告警 P2.3 — Ollama 健康規則 + Prometheus Metrics: - ops/monitoring/ollama_health_rules.yaml (148 行): · OllamaHealthDegraded / OllamaPrimaryDown · OllamaFailoverTriggered / GeminiQuotaExceeded · 補 Prometheus 取資料的 alert rules - core/metrics.py (57 行): · GEMINI_DAILY_CALL_COUNT / GEMINI_DAILY_QUOTA Gauge · OLLAMA_FAILOVER_TRIGGERED_TOTAL Counter · OLLAMA_CURRENT_PRIMARY_IS_OLLAMA Gauge - ollama_failover_manager.py: · _check_gemini_quota: 每次 check 同步更新 Gauge(讓 Prometheus 取最新值) · select_provider: failover 時 inc Counter + 切 Primary Gauge · try/except 包裹,metric 失敗不阻斷主路由 E2E 測試: - test_failover_e2e_dispatch.py (365 行) 完整 dispatch 路徑:health check → failover decide → alerter → metrics Tests: 54 passed (e2e_dispatch + failover_manager + failover_alerter) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Multiple Engineers (上 session Wave 5) <noreply@anthropic.com>
This commit is contained in:
@@ -129,6 +129,63 @@ LEARNING_SKIP_TOTAL = Counter(
|
||||
)
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Ollama 容災指標 (P2.3, 2026-04-26 台北時區)
|
||||
# 建立者: Claude Sonnet 4.6 (tool-expert, P2.3)
|
||||
#
|
||||
# 對應告警規則: ops/monitoring/ollama_health_rules.yaml
|
||||
#
|
||||
# 使用位置:
|
||||
# - ollama_failover_manager.py: OLLAMA_FAILOVER_TRIGGERED_TOTAL, AI_ROUTER_PROVIDER_TOTAL
|
||||
# - ollama_auto_recovery.py: OLLAMA_RECOVERY_TRIGGERED_TOTAL
|
||||
# - ollama_health_monitor.py: OLLAMA_HEALTH_STATUS
|
||||
# - main.py lifespan / background task: GEMINI_DAILY_CALL_COUNT, GEMINI_DAILY_QUOTA
|
||||
#
|
||||
# Backlog(需設計後另行補入):
|
||||
# - ollama_inference_duration_seconds (Histogram) — 需在 _check_inference() 裡 observe
|
||||
# - post_execution_verification_failed_total / _total — 需 auto_repair_service.py 補入
|
||||
# =============================================================================
|
||||
|
||||
OLLAMA_FAILOVER_TRIGGERED_TOTAL = Counter(
|
||||
"ollama_failover_triggered_total",
|
||||
"Ollama failover events (primary switched away from ollama_111)",
|
||||
["from_provider", "to_provider"],
|
||||
)
|
||||
|
||||
OLLAMA_RECOVERY_TRIGGERED_TOTAL = Counter(
|
||||
"ollama_recovery_triggered_total",
|
||||
"Ollama auto-recovery events (primary switched back to ollama_111)",
|
||||
["from_provider"],
|
||||
)
|
||||
|
||||
OLLAMA_HEALTH_STATUS = Gauge(
|
||||
"ollama_health_status",
|
||||
"Ollama instance health (1=healthy, 0=not_healthy/offline)",
|
||||
["host"], # host: "111" or "188"
|
||||
)
|
||||
|
||||
OLLAMA_CURRENT_PRIMARY_IS_OLLAMA = Gauge(
|
||||
"ollama_current_primary_is_ollama",
|
||||
"Whether the current primary AI provider is ollama_111 (1=yes, 0=no)",
|
||||
)
|
||||
|
||||
AI_ROUTER_PROVIDER_TOTAL = Counter(
|
||||
"ai_router_selected_provider_total",
|
||||
"AI router provider selection count (all routing decisions)",
|
||||
["provider"],
|
||||
)
|
||||
|
||||
GEMINI_DAILY_CALL_COUNT = Gauge(
|
||||
"gemini_daily_call_count",
|
||||
"Gemini API calls made today (read from Redis ollama:gemini_daily_count:{date})",
|
||||
)
|
||||
|
||||
GEMINI_DAILY_QUOTA = Gauge(
|
||||
"gemini_daily_quota",
|
||||
"Gemini API daily call quota (from settings.GEMINI_DAILY_QUOTA)",
|
||||
)
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Helper Functions
|
||||
# =============================================================================
|
||||
|
||||
Reference in New Issue
Block a user