feat(wave5-p2): GovernanceAgent 4 項自檢 + Ollama 健康告警規則 + Prometheus metrics 整合
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 10m45s

MASTER plan_complete_v3.md Wave 5 P2.2 + P2.3 完成(multiple engineers 在限額前完成代碼,補 commit):

P2.2 — GovernanceAgent 4 項自檢:
- governance_agent.py (342 行) — 每 1 小時自檢循環:
  · trust_drift(信任度漂移檢測)
  · knowledge_degradation(知識退化檢測)
  · llm_hallucination(LLM 幻覺檢測)
  · execution_blast_radius(執行爆炸半徑檢測)
- main.py lifespan: asyncio.create_task(run_governance_loop()) 啟動
  try/except 包裹,schedule 失敗不阻斷主流程
- failover_alerter.py: alert_governance(event_type, payload) 1h dedup
  四類事件 → Telegram MarkdownV2 告警

P2.3 — Ollama 健康規則 + Prometheus Metrics:
- ops/monitoring/ollama_health_rules.yaml (148 行):
  · OllamaHealthDegraded / OllamaPrimaryDown
  · OllamaFailoverTriggered / GeminiQuotaExceeded
  · 補 Prometheus 取資料的 alert rules
- core/metrics.py (57 行):
  · GEMINI_DAILY_CALL_COUNT / GEMINI_DAILY_QUOTA Gauge
  · OLLAMA_FAILOVER_TRIGGERED_TOTAL Counter
  · OLLAMA_CURRENT_PRIMARY_IS_OLLAMA Gauge
- ollama_failover_manager.py:
  · _check_gemini_quota: 每次 check 同步更新 Gauge(讓 Prometheus 取最新值)
  · select_provider: failover 時 inc Counter + 切 Primary Gauge
  · try/except 包裹,metric 失敗不阻斷主路由

E2E 測試:
- test_failover_e2e_dispatch.py (365 行)
  完整 dispatch 路徑:health check → failover decide → alerter → metrics

Tests: 54 passed (e2e_dispatch + failover_manager + failover_alerter)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Multiple Engineers (上 session Wave 5) <noreply@anthropic.com>
This commit is contained in:
Your Name
2026-04-26 20:56:19 +08:00
parent bddf99a002
commit 2c57b71db9
7 changed files with 970 additions and 0 deletions

View File

@@ -129,6 +129,63 @@ LEARNING_SKIP_TOTAL = Counter(
)
# =============================================================================
# Ollama 容災指標 (P2.3, 2026-04-26 台北時區)
# 建立者: Claude Sonnet 4.6 (tool-expert, P2.3)
#
# 對應告警規則: ops/monitoring/ollama_health_rules.yaml
#
# 使用位置:
# - ollama_failover_manager.py: OLLAMA_FAILOVER_TRIGGERED_TOTAL, AI_ROUTER_PROVIDER_TOTAL
# - ollama_auto_recovery.py: OLLAMA_RECOVERY_TRIGGERED_TOTAL
# - ollama_health_monitor.py: OLLAMA_HEALTH_STATUS
# - main.py lifespan / background task: GEMINI_DAILY_CALL_COUNT, GEMINI_DAILY_QUOTA
#
# Backlog需設計後另行補入
# - ollama_inference_duration_seconds (Histogram) — 需在 _check_inference() 裡 observe
# - post_execution_verification_failed_total / _total — 需 auto_repair_service.py 補入
# =============================================================================
OLLAMA_FAILOVER_TRIGGERED_TOTAL = Counter(
"ollama_failover_triggered_total",
"Ollama failover events (primary switched away from ollama_111)",
["from_provider", "to_provider"],
)
OLLAMA_RECOVERY_TRIGGERED_TOTAL = Counter(
"ollama_recovery_triggered_total",
"Ollama auto-recovery events (primary switched back to ollama_111)",
["from_provider"],
)
OLLAMA_HEALTH_STATUS = Gauge(
"ollama_health_status",
"Ollama instance health (1=healthy, 0=not_healthy/offline)",
["host"], # host: "111" or "188"
)
OLLAMA_CURRENT_PRIMARY_IS_OLLAMA = Gauge(
"ollama_current_primary_is_ollama",
"Whether the current primary AI provider is ollama_111 (1=yes, 0=no)",
)
AI_ROUTER_PROVIDER_TOTAL = Counter(
"ai_router_selected_provider_total",
"AI router provider selection count (all routing decisions)",
["provider"],
)
GEMINI_DAILY_CALL_COUNT = Gauge(
"gemini_daily_call_count",
"Gemini API calls made today (read from Redis ollama:gemini_daily_count:{date})",
)
GEMINI_DAILY_QUOTA = Gauge(
"gemini_daily_quota",
"Gemini API daily call quota (from settings.GEMINI_DAILY_QUOTA)",
)
# =============================================================================
# Helper Functions
# =============================================================================