feat(wave5-p2): GovernanceAgent 4 項自檢 + Ollama 健康告警規則 + Prometheus metrics 整合

MASTER plan_complete_v3.md Wave 5 P2.2 + P2.3 完成（multiple engineers 在限額前完成代碼，補 commit）： P2.2 — GovernanceAgent 4 項自檢: - governance_agent.py (342 行) — 每 1 小時自檢循環: · trust_drift（信任度漂移檢測） · knowledge_degradation（知識退化檢測） · llm_hallucination（LLM 幻覺檢測） · execution_blast_radius（執行爆炸半徑檢測） - main.py lifespan: asyncio.create_task(run_governance_loop()) 啟動 try/except 包裹，schedule 失敗不阻斷主流程 - failover_alerter.py: alert_governance(event_type, payload) 1h dedup 四類事件 → Telegram MarkdownV2 告警 P2.3 — Ollama 健康規則 + Prometheus Metrics: - ops/monitoring/ollama_health_rules.yaml (148 行): · OllamaHealthDegraded / OllamaPrimaryDown · OllamaFailoverTriggered / GeminiQuotaExceeded · 補 Prometheus 取資料的 alert rules - core/metrics.py (57 行): · GEMINI_DAILY_CALL_COUNT / GEMINI_DAILY_QUOTA Gauge · OLLAMA_FAILOVER_TRIGGERED_TOTAL Counter · OLLAMA_CURRENT_PRIMARY_IS_OLLAMA Gauge - ollama_failover_manager.py: · _check_gemini_quota: 每次 check 同步更新 Gauge（讓 Prometheus 取最新值） · select_provider: failover 時 inc Counter + 切 Primary Gauge · try/except 包裹，metric 失敗不阻斷主路由 E2E 測試: - test_failover_e2e_dispatch.py (365 行) 完整 dispatch 路徑：health check → failover decide → alerter → metrics Tests: 54 passed (e2e_dispatch + failover_manager + failover_alerter) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Multiple Engineers (上 session Wave 5) <noreply@anthropic.com>
2026-04-26 20:56:19 +08:00
parent bddf99a002
commit 2c57b71db9
7 changed files with 970 additions and 0 deletions
--- a/apps/api/src/core/metrics.py
+++ b/apps/api/src/core/metrics.py
@@ -129,6 +129,63 @@ LEARNING_SKIP_TOTAL = Counter(
 )


+# =============================================================================
+# Ollama 容災指標 (P2.3, 2026-04-26 台北時區)
+# 建立者: Claude Sonnet 4.6 (tool-expert, P2.3)
+#
+# 對應告警規則: ops/monitoring/ollama_health_rules.yaml
+#
+# 使用位置:
+#   - ollama_failover_manager.py: OLLAMA_FAILOVER_TRIGGERED_TOTAL, AI_ROUTER_PROVIDER_TOTAL
+#   - ollama_auto_recovery.py:    OLLAMA_RECOVERY_TRIGGERED_TOTAL
+#   - ollama_health_monitor.py:   OLLAMA_HEALTH_STATUS
+#   - main.py lifespan / background task: GEMINI_DAILY_CALL_COUNT, GEMINI_DAILY_QUOTA
+#
+# Backlog（需設計後另行補入）：
+#   - ollama_inference_duration_seconds (Histogram) — 需在 _check_inference() 裡 observe
+#   - post_execution_verification_failed_total / _total — 需 auto_repair_service.py 補入
+# =============================================================================
+
+OLLAMA_FAILOVER_TRIGGERED_TOTAL = Counter(
+    "ollama_failover_triggered_total",
+    "Ollama failover events (primary switched away from ollama_111)",
+    ["from_provider", "to_provider"],
+)
+
+OLLAMA_RECOVERY_TRIGGERED_TOTAL = Counter(
+    "ollama_recovery_triggered_total",
+    "Ollama auto-recovery events (primary switched back to ollama_111)",
+    ["from_provider"],
+)
+
+OLLAMA_HEALTH_STATUS = Gauge(
+    "ollama_health_status",
+    "Ollama instance health (1=healthy, 0=not_healthy/offline)",
+    ["host"],  # host: "111" or "188"
+)
+
+OLLAMA_CURRENT_PRIMARY_IS_OLLAMA = Gauge(
+    "ollama_current_primary_is_ollama",
+    "Whether the current primary AI provider is ollama_111 (1=yes, 0=no)",
+)
+
+AI_ROUTER_PROVIDER_TOTAL = Counter(
+    "ai_router_selected_provider_total",
+    "AI router provider selection count (all routing decisions)",
+    ["provider"],
+)
+
+GEMINI_DAILY_CALL_COUNT = Gauge(
+    "gemini_daily_call_count",
+    "Gemini API calls made today (read from Redis ollama:gemini_daily_count:{date})",
+)
+
+GEMINI_DAILY_QUOTA = Gauge(
+    "gemini_daily_quota",
+    "Gemini API daily call quota (from settings.GEMINI_DAILY_QUOTA)",
+)
+
+
 # =============================================================================
 # Helper Functions
 # =============================================================================