feat(ops): Ollama 容災 Runbook + Grafana 儀表板 + Consensus K8s ConfigMap patch
Wave 6 P2.3 ops 配套 + tool-expert 部署文件: 新增: - docs/runbooks/RUNBOOK-OLLAMA-FAILOVER.md (240 行) · 三大鐵律驗證步驟(自動切 Gemini / 自動切回 / quota 熔斷) · failover/recovery 完整 SOP · 故障排查清單(Ollama 111/188 不通、Gemini quota 超發等) - ops/monitoring/grafana/dashboards/ollama_failover.json (295 行) · 4 panel:current primary / failover events / quota usage / health status · 對應 P2.3 metrics: OLLAMA_FAILOVER_TRIGGERED_TOTAL / GEMINI_DAILY_CALL_COUNT - k8s/awoooi-prod/04-configmap.yaml.patch-consensus · ENABLE_12AGENT_CONSENSUS / ENABLE_AIOPS_P2_FUSION feature flag patch Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: tool-expert agent (Wave 6) <noreply@anthropic.com>
This commit is contained in:
240
docs/runbooks/RUNBOOK-OLLAMA-FAILOVER.md
Normal file
240
docs/runbooks/RUNBOOK-OLLAMA-FAILOVER.md
Normal file
@@ -0,0 +1,240 @@
|
||||
# RUNBOOK-OLLAMA-FAILOVER.md
|
||||
# Ollama 容災監控 Runbook
|
||||
# 2026-04-26 P2.3 by Claude Sonnet 4.6 (tool-expert)
|
||||
# 對應告警規則: ops/monitoring/ollama_health_rules.yaml
|
||||
# 對應 Dashboard: ops/monitoring/grafana/dashboards/ollama_failover.json
|
||||
|
||||
---
|
||||
|
||||
## Grafana Dashboard 使用說明
|
||||
|
||||
Dashboard 路徑:`Ollama 容災監控`(uid: `ollama-failover-p23`)
|
||||
匯入方式:Grafana UI → Dashboards → Import → Upload JSON file → 選 `ops/monitoring/grafana/dashboards/ollama_failover.json`
|
||||
|
||||
### Panel 1 — Ollama 可用性 (Stat)
|
||||
|
||||
**看什麼**:`up{job=~"ollama_111|ollama_188"}` × 100,顯示每台 Ollama 主機的 scrape 存活狀態。
|
||||
|
||||
| 顏色 | 意義 |
|
||||
|------|------|
|
||||
| 綠色 100% | Prometheus 探測正常,主機在線 |
|
||||
| 黃色 50% | 一台離線,另一台在線(容災中) |
|
||||
| 紅色 0% | 兩台全離線,高風險 |
|
||||
|
||||
**注意**:此面板反映 Prometheus scrape 狀態,需要 scrape job 命名為 `ollama_111` / `ollama_188`。
|
||||
設定檔位於 `ops/monitoring/generated/prometheus-scrape-generated.yaml`。
|
||||
|
||||
---
|
||||
|
||||
### Panel 2 — 推理延遲 P50 / P99 (Time Series)
|
||||
|
||||
**看什麼**:推理延遲分位數。
|
||||
|
||||
| 門檻 | 含義 |
|
||||
|------|------|
|
||||
| < 10s (P50) | HEALTHY — 正常使用 111 |
|
||||
| 10–30s (P50) | SLOW — 系統已切至 Gemini |
|
||||
| > 30s (P99) | DEGRADED — 應觸發 failover |
|
||||
|
||||
**⚠️ BACKLOG 警告**:`ollama_inference_duration_seconds_bucket` 尚未在 API 暴露(需在 `_check_inference()` 加 Histogram.observe())。
|
||||
面板顯示 "No Data" 是正常的,等 backlog 補完後啟用。
|
||||
|
||||
---
|
||||
|
||||
### Panel 3 — AI Provider 路由分布 (Pie Chart)
|
||||
|
||||
**看什麼**:過去 5 分鐘各 provider 被選中的請求比例。
|
||||
|
||||
| 分布 | 意義 |
|
||||
|------|------|
|
||||
| ollama 佔 >90% | 正常,111 健康 |
|
||||
| gemini 佔多數 | 111 SLOW/DEGRADED/OFFLINE,容災中 |
|
||||
| ollama_188 出現 | Gemini 配額耗盡備援,或 111 和 Gemini 同時失敗 |
|
||||
| 全部 nemotron/claude | 極端情況,所有主力 provider 失敗 |
|
||||
|
||||
---
|
||||
|
||||
### Panel 4 — Failover / Recovery 觸發次數 (Bar Chart)
|
||||
|
||||
**看什麼**:每小時 failover(橘)和 recovery(綠)的觸發次數。
|
||||
|
||||
| 模式 | 意義 |
|
||||
|------|------|
|
||||
| 兩條都接近 0 | 正常,111 穩定運行中 |
|
||||
| 橘色上升後綠色跟上 | Auto recovery 正常:切出後又切回 |
|
||||
| 橘色上升,綠色不動 | **`OllamaRecoveryStuck` alert**,見下方 runbook |
|
||||
| 橘色持續高頻(>5/h) | **`OllamaFailoverFrequent` alert**,111 不穩定 |
|
||||
|
||||
---
|
||||
|
||||
## Alert Runbook
|
||||
|
||||
### `OllamaInstanceDown` — Ollama 主機離線
|
||||
|
||||
**觸發條件**:`up{job=~"ollama_111|ollama_188"} == 0` 持續 2 分鐘。
|
||||
|
||||
**影響評估**:
|
||||
- 系統應已自動切至 Gemini(查 Panel 3 確認)
|
||||
- 查 Panel 4 是否有 Failover 計數上升
|
||||
|
||||
**排查步驟**:
|
||||
|
||||
```bash
|
||||
# 步驟 1:確認主機存活
|
||||
ping -c 3 192.168.0.111
|
||||
ping -c 3 192.168.0.188
|
||||
|
||||
# 步驟 2:SSH 進主機確認 ollama 服務狀態
|
||||
ssh wooo@192.168.0.111 'systemctl status ollama'
|
||||
ssh wooo@192.168.0.188 'systemctl status ollama'
|
||||
|
||||
# 步驟 3:查 ollama 最近的 journal log
|
||||
ssh wooo@192.168.0.111 'journalctl -u ollama -n 50 --no-pager'
|
||||
|
||||
# 步驟 4:確認 GPU 記憶體(111 是 GPU 主機)
|
||||
ssh wooo@192.168.0.111 'nvidia-smi'
|
||||
|
||||
# 步驟 5:如果服務掛了,重啟
|
||||
ssh wooo@192.168.0.111 'systemctl restart ollama'
|
||||
# 等 30s,確認服務啟動
|
||||
ssh wooo@192.168.0.111 'systemctl status ollama'
|
||||
```
|
||||
|
||||
**恢復確認**:
|
||||
Panel 1 變綠色,Panel 4 出現 Recovery 計數上升,表示 auto recovery 已觸發切回。
|
||||
|
||||
---
|
||||
|
||||
### `OllamaFailoverFrequent` — Failover 頻率過高
|
||||
|
||||
**觸發條件**:`rate(ollama_failover_triggered_total[1h]) > 5` 持續 10 分鐘(每小時超過 5 次切換)。
|
||||
|
||||
**影響評估**:
|
||||
- 服務本身仍可用(Gemini 在接手)
|
||||
- 但 Gemini 配額消耗加速,有觸發 `GeminiQuotaApproaching` 的風險
|
||||
|
||||
**排查步驟**:
|
||||
|
||||
```bash
|
||||
# 步驟 1:確認 111 近況(反覆 OFFLINE/HEALTHY 之間跳動?)
|
||||
ssh wooo@192.168.0.111 'nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv'
|
||||
|
||||
# 步驟 2:查 API log 找 failover 原因
|
||||
kubectl logs -n awoooi-prod deploy/api --since=30m | grep "ollama_failover_triggered"
|
||||
|
||||
# 步驟 3:查推理延遲(是否長期在 SLOW 邊界?)
|
||||
kubectl logs -n awoooi-prod deploy/api --since=30m | grep "ollama_health_checked"
|
||||
|
||||
# 步驟 4:如果是 GPU 記憶體問題,清除 model cache
|
||||
ssh wooo@192.168.0.111 'systemctl restart ollama'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### `OllamaRecoveryStuck` — Auto Recovery 停滯
|
||||
|
||||
**觸發條件**:`ollama_health_status{host="111"} == 1 AND ollama_current_primary_is_ollama == 0` 持續 5 分鐘。
|
||||
(111 已 HEALTHY 但路由仍走 Gemini)
|
||||
|
||||
**影響評估**:
|
||||
- API 功能正常(Gemini 在服務)
|
||||
- 但 Gemini 配額持續消耗,111 GPU 資源浪費
|
||||
|
||||
**排查步驟**:
|
||||
|
||||
```bash
|
||||
# 步驟 1:確認 OllamaAutoRecoveryService 是否在運行
|
||||
kubectl logs -n awoooi-prod deploy/api --since=10m | grep "ollama_auto_recovery"
|
||||
|
||||
# 步驟 2:查 recovery service 狀態
|
||||
kubectl logs -n awoooi-prod deploy/api --since=10m | grep -E "ollama_auto_recovery_started|ollama_auto_recovery_stopped|ollama_auto_recovery_loop_error"
|
||||
|
||||
# 步驟 3:查 current_primary Redis key
|
||||
kubectl exec -n awoooi-prod deploy/api -- python -c "
|
||||
import asyncio
|
||||
from src.core.redis_client import get_redis
|
||||
async def check():
|
||||
r = get_redis()
|
||||
val = await r.get('ollama:current_primary')
|
||||
print('current_primary:', val)
|
||||
asyncio.run(check())
|
||||
"
|
||||
|
||||
# 步驟 4:如果 recovery service 掛了,重啟 API pod(會重新啟動 lifespan)
|
||||
kubectl rollout restart deployment/api -n awoooi-prod
|
||||
kubectl rollout status deployment/api -n awoooi-prod
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### `GeminiQuotaApproaching` — Gemini 配額 >80%
|
||||
|
||||
**觸發條件**:`gemini_daily_call_count / gemini_daily_quota > 0.8` 持續 5 分鐘。
|
||||
|
||||
**注意**:`gemini_daily_quota` 來自 `settings.GEMINI_DAILY_QUOTA`(預設 1000)。
|
||||
`gemini_daily_call_count` 從 Redis key `ollama:gemini_daily_count:{YYYY-MM-DD}` 讀取並刷新 Gauge。
|
||||
|
||||
**影響評估**:
|
||||
- 當日 Gemini 配額即將耗盡
|
||||
- 耗盡後系統會自動切至 188 CPU-only 備援(qwen2.5:7b-instruct),速度較慢
|
||||
|
||||
**行動步驟**:
|
||||
|
||||
```bash
|
||||
# 步驟 1:確認當日 Gemini 使用量
|
||||
kubectl exec -n awoooi-prod deploy/api -- python -c "
|
||||
import asyncio, datetime
|
||||
from src.core.redis_client import get_redis
|
||||
async def check():
|
||||
r = get_redis()
|
||||
today = datetime.date.today().isoformat()
|
||||
val = await r.get(f'ollama:gemini_daily_count:{today}')
|
||||
print(f'gemini_daily_count[{today}]:', val)
|
||||
asyncio.run(check())
|
||||
"
|
||||
|
||||
# 步驟 2:確認 111 是否能快速恢復(讓流量切回 Ollama)
|
||||
ssh wooo@192.168.0.111 'systemctl status ollama && nvidia-smi'
|
||||
|
||||
# 步驟 3:如需增加配額,修改 settings
|
||||
# k8s/awoooi-prod/04-configmap.yaml.patch-* 找 GEMINI_DAILY_QUOTA
|
||||
# 改完後 kubectl apply + rollout restart
|
||||
|
||||
# 步驟 4:緊急手動重置計數(謹慎使用,只在確認誤計時才用)
|
||||
# kubectl exec -n awoooi-prod deploy/api -- redis-cli DEL "ollama:gemini_daily_count:$(date +%Y-%m-%d)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Metric 清單
|
||||
|
||||
| Metric | 類型 | 狀態 | 說明 |
|
||||
|--------|------|------|------|
|
||||
| `up{job="ollama_111"}` | Gauge | ✅ 現有 | Prometheus scrape 存活 |
|
||||
| `up{job="ollama_188"}` | Gauge | ✅ 現有 | Prometheus scrape 存活 |
|
||||
| `ollama_failover_triggered_total` | Counter | ✅ P2.3 補入 | failover 切換次數,labels: from_provider, to_provider |
|
||||
| `ollama_recovery_triggered_total` | Counter | ✅ P2.3 補入 | recovery 切回次數,labels: from_provider |
|
||||
| `ollama_health_status{host}` | Gauge | ✅ P2.3 補入 | 健康狀態 1=healthy, 0=not_healthy |
|
||||
| `ollama_current_primary_is_ollama` | Gauge | ✅ P2.3 補入 | 1=primary 是 ollama, 0=failover 中 |
|
||||
| `ai_router_selected_provider_total` | Counter | ✅ P2.3 補入 | AI router 選擇次數,labels: provider |
|
||||
| `gemini_daily_call_count` | Gauge | ✅ P2.3 補入 | 今日 Gemini 呼叫次數 |
|
||||
| `gemini_daily_quota` | Gauge | ✅ P2.3 補入 | Gemini 每日配額 |
|
||||
| `ollama_inference_duration_seconds` | Histogram | ⏳ BACKLOG | 推理延遲分布,需在 `_check_inference()` 加 observe |
|
||||
| `post_execution_verification_total` | Counter | ⏳ BACKLOG | Verifier 執行次數,需 auto_repair_service.py 補入 |
|
||||
| `post_execution_verification_failed_total` | Counter | ⏳ BACKLOG | Verifier 失敗次數,需 auto_repair_service.py 補入 |
|
||||
|
||||
## Backlog 補完指引
|
||||
|
||||
### `ollama_inference_duration_seconds`
|
||||
|
||||
在 `apps/api/src/services/ollama_health_monitor.py` 的 `_check_inference()` 方法結尾,加:
|
||||
|
||||
```python
|
||||
from src.core.metrics import OLLAMA_INFERENCE_DURATION # 需先在 metrics.py 加 Histogram
|
||||
OLLAMA_INFERENCE_DURATION.labels(host=host_label).observe(latency_ms / 1000)
|
||||
```
|
||||
|
||||
### `post_execution_verification_*`
|
||||
|
||||
在 `apps/api/src/services/auto_repair_service.py` 的 verifier 路徑,加 Counter inc()。
|
||||
需先確認 verifier 執行點(grep `post_execution` 或 `verif` 找入口)。
|
||||
28
k8s/awoooi-prod/04-configmap.yaml.patch-consensus
Normal file
28
k8s/awoooi-prod/04-configmap.yaml.patch-consensus
Normal file
@@ -0,0 +1,28 @@
|
||||
# ============================================================================
|
||||
# PATCH: P2.4 啟用 12-Agent ConsensusEngine
|
||||
# 日期: 2026-04-26 (台北時區)
|
||||
# 負責人: ogt + Claude Sonnet 4.6
|
||||
# ADR 參考: ADR-095, plan_complete_v3.md P2.4
|
||||
# 說明:
|
||||
# 將 ENABLE_12AGENT_CONSENSUS 設為 "true" 後,P0/P1 事件的 decision 路徑
|
||||
# 將呼叫 ConsensusEngine,整合 SRE/Security/Cost/Performance 四位專家意見。
|
||||
# 共識分數 ≥0.6 → READY(可自動執行);<0.6 → fallback to expert_analyze
|
||||
# 影響範圍:
|
||||
# - incident_analysis_sweeper: P0/P1 事件呼叫 get_or_create_decision_with_consensus
|
||||
# - decision_manager: 加入 ENABLE_12AGENT_CONSENSUS flag 守門
|
||||
# ⚠️ 注意: ConsensusEngine 需呼叫 Ollama/NIM,確認 AI 服務可用後再啟用
|
||||
# ⚠️ 此 patch 僅供 review,需統帥批准後手動 apply
|
||||
# ============================================================================
|
||||
#
|
||||
# 將以下一行加入 /Users/ogt/awoooi/k8s/awoooi-prod/04-configmap.yaml
|
||||
# 建議位置: TG_GROUP_CUTOVER 行之後
|
||||
#
|
||||
# --- 新增內容 ---
|
||||
# 2026-04-26 P2.4 ogt + Claude Sonnet 4.6: 啟用 12-Agent ConsensusEngine (ADR-095)
|
||||
# P0/P1 事件走 ConsensusEngine → 4 專家並行投票 → 共識 ≥0.6 自動執行
|
||||
ENABLE_12AGENT_CONSENSUS: "true"
|
||||
# --- 新增內容結束 ---
|
||||
#
|
||||
# 使用方式 (需用戶 review 後手動 apply):
|
||||
# kubectl -n awoooi-prod apply -f k8s/awoooi-prod/04-configmap.yaml
|
||||
# kubectl -n awoooi-prod rollout restart deployment/awoooi-api
|
||||
295
ops/monitoring/grafana/dashboards/ollama_failover.json
Normal file
295
ops/monitoring/grafana/dashboards/ollama_failover.json
Normal file
@@ -0,0 +1,295 @@
|
||||
{
|
||||
"__inputs": [],
|
||||
"__requires": [
|
||||
{
|
||||
"type": "grafana",
|
||||
"id": "grafana",
|
||||
"name": "Grafana",
|
||||
"version": "10.0.0"
|
||||
},
|
||||
{
|
||||
"type": "panel",
|
||||
"id": "stat",
|
||||
"name": "Stat",
|
||||
"version": ""
|
||||
},
|
||||
{
|
||||
"type": "panel",
|
||||
"id": "timeseries",
|
||||
"name": "Time series",
|
||||
"version": ""
|
||||
},
|
||||
{
|
||||
"type": "panel",
|
||||
"id": "piechart",
|
||||
"name": "Pie chart",
|
||||
"version": ""
|
||||
},
|
||||
{
|
||||
"type": "panel",
|
||||
"id": "barchart",
|
||||
"name": "Bar chart",
|
||||
"version": ""
|
||||
}
|
||||
],
|
||||
"annotations": {
|
||||
"list": []
|
||||
},
|
||||
"description": "Ollama 容災監控 — 可用性、推理延遲、AI 路由分布、Failover/Recovery 觸發 | P2.3 2026-04-26 台北時區",
|
||||
"editable": true,
|
||||
"fiscalYearStartMonth": 0,
|
||||
"graphTooltip": 1,
|
||||
"id": null,
|
||||
"links": [],
|
||||
"panels": [
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {
|
||||
"mode": "thresholds"
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "color": "red", "value": null },
|
||||
{ "color": "yellow", "value": 50 },
|
||||
{ "color": "green", "value": 100 }
|
||||
]
|
||||
},
|
||||
"unit": "percent",
|
||||
"min": 0,
|
||||
"max": 100
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": { "h": 6, "w": 12, "x": 0, "y": 0 },
|
||||
"id": 1,
|
||||
"options": {
|
||||
"colorMode": "background",
|
||||
"graphMode": "none",
|
||||
"justifyMode": "center",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"textMode": "auto"
|
||||
},
|
||||
"title": "Ollama 可用性",
|
||||
"description": "up{job=~\"ollama_111|ollama_188\"} × 100\n- 綠色 100% = 主機在線\n- 紅色 0% = 主機離線(容災應已觸發)\n\n資料來源: Prometheus scrape job ollama_111 / ollama_188",
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "${datasource}" },
|
||||
"expr": "up{job=~\"ollama_111|ollama_188\"} * 100",
|
||||
"legendFormat": "{{ job }}",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "palette-classic" },
|
||||
"custom": {
|
||||
"axisCenteredZero": false,
|
||||
"axisColorMode": "text",
|
||||
"axisLabel": "延遲 (秒)",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 10,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": { "legend": false, "tooltip": false, "viz": false },
|
||||
"lineInterpolation": "smooth",
|
||||
"lineWidth": 2,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": { "type": "linear" },
|
||||
"showPoints": "never",
|
||||
"spanNulls": false
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "color": "green", "value": null },
|
||||
{ "color": "yellow", "value": 10 },
|
||||
{ "color": "red", "value": 30 }
|
||||
]
|
||||
},
|
||||
"unit": "s"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": { "h": 6, "w": 12, "x": 12, "y": 0 },
|
||||
"id": 2,
|
||||
"options": {
|
||||
"legend": {
|
||||
"calcs": ["lastNotNull", "max"],
|
||||
"displayMode": "table",
|
||||
"placement": "bottom"
|
||||
},
|
||||
"tooltip": { "mode": "multi", "sort": "none" }
|
||||
},
|
||||
"title": "推理延遲 P50 / P99",
|
||||
"description": "histogram_quantile(0.5/0.99, rate(ollama_inference_duration_seconds_bucket[5m]))\n- P50 > 10s = SLOW 門檻\n- P99 > 30s = DEGRADED 門檻,觸發 failover\n\n⚠️ BACKLOG: ollama_inference_duration_seconds_bucket 尚未暴露,面板會顯示 No Data 直到 Part 3 backlog 補完",
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "${datasource}" },
|
||||
"expr": "histogram_quantile(0.5, rate(ollama_inference_duration_seconds_bucket[5m]))",
|
||||
"legendFormat": "P50",
|
||||
"refId": "A"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "${datasource}" },
|
||||
"expr": "histogram_quantile(0.99, rate(ollama_inference_duration_seconds_bucket[5m]))",
|
||||
"legendFormat": "P99",
|
||||
"refId": "B"
|
||||
}
|
||||
],
|
||||
"type": "timeseries"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "palette-classic" },
|
||||
"custom": {
|
||||
"hideFrom": { "legend": false, "tooltip": false, "viz": false }
|
||||
},
|
||||
"mappings": [],
|
||||
"unit": "reqps"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": { "h": 6, "w": 12, "x": 0, "y": 6 },
|
||||
"id": 3,
|
||||
"options": {
|
||||
"displayLabels": ["name", "percent"],
|
||||
"legend": {
|
||||
"displayMode": "table",
|
||||
"placement": "right",
|
||||
"values": ["value", "percent"]
|
||||
},
|
||||
"pieType": "pie",
|
||||
"tooltip": { "mode": "single", "sort": "none" }
|
||||
},
|
||||
"title": "AI Provider 路由分布",
|
||||
"description": "sum by (provider) (rate(ai_router_selected_provider_total[5m]))\n- 正常狀態: ollama 佔大多數\n- failover 中: gemini / ollama_188 比例上升\n- 全走 gemini = 111 完全 offline\n\n資料來源: OLLAMA_FAILOVER_TRIGGERED_TOTAL + AI_ROUTER_PROVIDER_TOTAL (src/core/metrics.py)",
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "${datasource}" },
|
||||
"expr": "sum by (provider) (rate(ai_router_selected_provider_total[5m]))",
|
||||
"legendFormat": "{{ provider }}",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"type": "piechart"
|
||||
},
|
||||
{
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "${datasource}"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "palette-classic" },
|
||||
"custom": {
|
||||
"axisLabel": "次數/小時",
|
||||
"fillOpacity": 80,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": { "legend": false, "tooltip": false, "viz": false },
|
||||
"lineWidth": 1
|
||||
},
|
||||
"mappings": [],
|
||||
"unit": "short"
|
||||
},
|
||||
"overrides": [
|
||||
{
|
||||
"matcher": { "id": "byName", "options": "Failover" },
|
||||
"properties": [
|
||||
{ "id": "color", "value": { "fixedColor": "orange", "mode": "fixed" } }
|
||||
]
|
||||
},
|
||||
{
|
||||
"matcher": { "id": "byName", "options": "Recovery" },
|
||||
"properties": [
|
||||
{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }
|
||||
]
|
||||
}
|
||||
]
|
||||
},
|
||||
"gridPos": { "h": 6, "w": 12, "x": 12, "y": 6 },
|
||||
"id": 4,
|
||||
"options": {
|
||||
"barRadius": 0.05,
|
||||
"barWidth": 0.7,
|
||||
"groupWidth": 0.7,
|
||||
"legend": {
|
||||
"calcs": ["sum"],
|
||||
"displayMode": "list",
|
||||
"placement": "bottom"
|
||||
},
|
||||
"orientation": "auto",
|
||||
"tooltip": { "mode": "multi", "sort": "none" },
|
||||
"xTickLabelRotation": 0
|
||||
},
|
||||
"title": "Failover / Recovery 觸發次數",
|
||||
"description": "橘色 = ollama_failover_triggered_total (切離 111)\n綠色 = ollama_recovery_triggered_total (切回 111)\n\n正常狀態兩條都接近 0。\n橘升後緊跟綠升 = auto recovery 正常工作。\n橘升但綠不升 = OllamaRecoveryStuck,看 alert。\n\n資料來源: src/core/metrics.py OLLAMA_FAILOVER_TRIGGERED_TOTAL / OLLAMA_RECOVERY_TRIGGERED_TOTAL",
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "${datasource}" },
|
||||
"expr": "sum(rate(ollama_failover_triggered_total[1h])) * 3600",
|
||||
"legendFormat": "Failover",
|
||||
"refId": "A"
|
||||
},
|
||||
{
|
||||
"datasource": { "type": "prometheus", "uid": "${datasource}" },
|
||||
"expr": "sum(rate(ollama_recovery_triggered_total[1h])) * 3600",
|
||||
"legendFormat": "Recovery",
|
||||
"refId": "B"
|
||||
}
|
||||
],
|
||||
"type": "barchart"
|
||||
}
|
||||
],
|
||||
"refresh": "30s",
|
||||
"schemaVersion": 39,
|
||||
"tags": ["ollama", "failover", "aiops", "p2.3"],
|
||||
"templating": {
|
||||
"list": [
|
||||
{
|
||||
"current": {},
|
||||
"hide": 0,
|
||||
"includeAll": false,
|
||||
"label": "Datasource",
|
||||
"multi": false,
|
||||
"name": "datasource",
|
||||
"options": [],
|
||||
"query": "prometheus",
|
||||
"refresh": 1,
|
||||
"regex": "",
|
||||
"type": "datasource"
|
||||
}
|
||||
]
|
||||
},
|
||||
"time": { "from": "now-3h", "to": "now" },
|
||||
"timepicker": {},
|
||||
"timezone": "Asia/Taipei",
|
||||
"title": "Ollama 容災監控",
|
||||
"uid": "ollama-failover-p23",
|
||||
"version": 1
|
||||
}
|
||||
Reference in New Issue
Block a user