diff --git a/docs/runbooks/RUNBOOK-OLLAMA-FAILOVER.md b/docs/runbooks/RUNBOOK-OLLAMA-FAILOVER.md new file mode 100644 index 00000000..c6f65d39 --- /dev/null +++ b/docs/runbooks/RUNBOOK-OLLAMA-FAILOVER.md @@ -0,0 +1,240 @@ +# RUNBOOK-OLLAMA-FAILOVER.md +# Ollama 容災監控 Runbook +# 2026-04-26 P2.3 by Claude Sonnet 4.6 (tool-expert) +# 對應告警規則: ops/monitoring/ollama_health_rules.yaml +# 對應 Dashboard: ops/monitoring/grafana/dashboards/ollama_failover.json + +--- + +## Grafana Dashboard 使用說明 + +Dashboard 路徑:`Ollama 容災監控`(uid: `ollama-failover-p23`) +匯入方式:Grafana UI → Dashboards → Import → Upload JSON file → 選 `ops/monitoring/grafana/dashboards/ollama_failover.json` + +### Panel 1 — Ollama 可用性 (Stat) + +**看什麼**:`up{job=~"ollama_111|ollama_188"}` × 100,顯示每台 Ollama 主機的 scrape 存活狀態。 + +| 顏色 | 意義 | +|------|------| +| 綠色 100% | Prometheus 探測正常,主機在線 | +| 黃色 50% | 一台離線,另一台在線(容災中) | +| 紅色 0% | 兩台全離線,高風險 | + +**注意**:此面板反映 Prometheus scrape 狀態,需要 scrape job 命名為 `ollama_111` / `ollama_188`。 +設定檔位於 `ops/monitoring/generated/prometheus-scrape-generated.yaml`。 + +--- + +### Panel 2 — 推理延遲 P50 / P99 (Time Series) + +**看什麼**:推理延遲分位數。 + +| 門檻 | 含義 | +|------|------| +| < 10s (P50) | HEALTHY — 正常使用 111 | +| 10–30s (P50) | SLOW — 系統已切至 Gemini | +| > 30s (P99) | DEGRADED — 應觸發 failover | + +**⚠️ BACKLOG 警告**:`ollama_inference_duration_seconds_bucket` 尚未在 API 暴露(需在 `_check_inference()` 加 Histogram.observe())。 +面板顯示 "No Data" 是正常的,等 backlog 補完後啟用。 + +--- + +### Panel 3 — AI Provider 路由分布 (Pie Chart) + +**看什麼**:過去 5 分鐘各 provider 被選中的請求比例。 + +| 分布 | 意義 | +|------|------| +| ollama 佔 >90% | 正常,111 健康 | +| gemini 佔多數 | 111 SLOW/DEGRADED/OFFLINE,容災中 | +| ollama_188 出現 | Gemini 配額耗盡備援,或 111 和 Gemini 同時失敗 | +| 全部 nemotron/claude | 極端情況,所有主力 provider 失敗 | + +--- + +### Panel 4 — Failover / Recovery 觸發次數 (Bar Chart) + +**看什麼**:每小時 failover(橘)和 recovery(綠)的觸發次數。 + +| 模式 | 意義 | +|------|------| +| 兩條都接近 0 | 正常,111 穩定運行中 | +| 橘色上升後綠色跟上 | Auto recovery 正常:切出後又切回 | +| 橘色上升,綠色不動 | **`OllamaRecoveryStuck` alert**,見下方 runbook | +| 橘色持續高頻(>5/h) | **`OllamaFailoverFrequent` alert**,111 不穩定 | + +--- + +## Alert Runbook + +### `OllamaInstanceDown` — Ollama 主機離線 + +**觸發條件**:`up{job=~"ollama_111|ollama_188"} == 0` 持續 2 分鐘。 + +**影響評估**: +- 系統應已自動切至 Gemini(查 Panel 3 確認) +- 查 Panel 4 是否有 Failover 計數上升 + +**排查步驟**: + +```bash +# 步驟 1:確認主機存活 +ping -c 3 192.168.0.111 +ping -c 3 192.168.0.188 + +# 步驟 2:SSH 進主機確認 ollama 服務狀態 +ssh wooo@192.168.0.111 'systemctl status ollama' +ssh wooo@192.168.0.188 'systemctl status ollama' + +# 步驟 3:查 ollama 最近的 journal log +ssh wooo@192.168.0.111 'journalctl -u ollama -n 50 --no-pager' + +# 步驟 4:確認 GPU 記憶體(111 是 GPU 主機) +ssh wooo@192.168.0.111 'nvidia-smi' + +# 步驟 5:如果服務掛了,重啟 +ssh wooo@192.168.0.111 'systemctl restart ollama' +# 等 30s,確認服務啟動 +ssh wooo@192.168.0.111 'systemctl status ollama' +``` + +**恢復確認**: +Panel 1 變綠色,Panel 4 出現 Recovery 計數上升,表示 auto recovery 已觸發切回。 + +--- + +### `OllamaFailoverFrequent` — Failover 頻率過高 + +**觸發條件**:`rate(ollama_failover_triggered_total[1h]) > 5` 持續 10 分鐘(每小時超過 5 次切換)。 + +**影響評估**: +- 服務本身仍可用(Gemini 在接手) +- 但 Gemini 配額消耗加速,有觸發 `GeminiQuotaApproaching` 的風險 + +**排查步驟**: + +```bash +# 步驟 1:確認 111 近況(反覆 OFFLINE/HEALTHY 之間跳動?) +ssh wooo@192.168.0.111 'nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv' + +# 步驟 2:查 API log 找 failover 原因 +kubectl logs -n awoooi-prod deploy/api --since=30m | grep "ollama_failover_triggered" + +# 步驟 3:查推理延遲(是否長期在 SLOW 邊界?) +kubectl logs -n awoooi-prod deploy/api --since=30m | grep "ollama_health_checked" + +# 步驟 4:如果是 GPU 記憶體問題,清除 model cache +ssh wooo@192.168.0.111 'systemctl restart ollama' +``` + +--- + +### `OllamaRecoveryStuck` — Auto Recovery 停滯 + +**觸發條件**:`ollama_health_status{host="111"} == 1 AND ollama_current_primary_is_ollama == 0` 持續 5 分鐘。 +(111 已 HEALTHY 但路由仍走 Gemini) + +**影響評估**: +- API 功能正常(Gemini 在服務) +- 但 Gemini 配額持續消耗,111 GPU 資源浪費 + +**排查步驟**: + +```bash +# 步驟 1:確認 OllamaAutoRecoveryService 是否在運行 +kubectl logs -n awoooi-prod deploy/api --since=10m | grep "ollama_auto_recovery" + +# 步驟 2:查 recovery service 狀態 +kubectl logs -n awoooi-prod deploy/api --since=10m | grep -E "ollama_auto_recovery_started|ollama_auto_recovery_stopped|ollama_auto_recovery_loop_error" + +# 步驟 3:查 current_primary Redis key +kubectl exec -n awoooi-prod deploy/api -- python -c " +import asyncio +from src.core.redis_client import get_redis +async def check(): + r = get_redis() + val = await r.get('ollama:current_primary') + print('current_primary:', val) +asyncio.run(check()) +" + +# 步驟 4:如果 recovery service 掛了,重啟 API pod(會重新啟動 lifespan) +kubectl rollout restart deployment/api -n awoooi-prod +kubectl rollout status deployment/api -n awoooi-prod +``` + +--- + +### `GeminiQuotaApproaching` — Gemini 配額 >80% + +**觸發條件**:`gemini_daily_call_count / gemini_daily_quota > 0.8` 持續 5 分鐘。 + +**注意**:`gemini_daily_quota` 來自 `settings.GEMINI_DAILY_QUOTA`(預設 1000)。 +`gemini_daily_call_count` 從 Redis key `ollama:gemini_daily_count:{YYYY-MM-DD}` 讀取並刷新 Gauge。 + +**影響評估**: +- 當日 Gemini 配額即將耗盡 +- 耗盡後系統會自動切至 188 CPU-only 備援(qwen2.5:7b-instruct),速度較慢 + +**行動步驟**: + +```bash +# 步驟 1:確認當日 Gemini 使用量 +kubectl exec -n awoooi-prod deploy/api -- python -c " +import asyncio, datetime +from src.core.redis_client import get_redis +async def check(): + r = get_redis() + today = datetime.date.today().isoformat() + val = await r.get(f'ollama:gemini_daily_count:{today}') + print(f'gemini_daily_count[{today}]:', val) +asyncio.run(check()) +" + +# 步驟 2:確認 111 是否能快速恢復(讓流量切回 Ollama) +ssh wooo@192.168.0.111 'systemctl status ollama && nvidia-smi' + +# 步驟 3:如需增加配額,修改 settings +# k8s/awoooi-prod/04-configmap.yaml.patch-* 找 GEMINI_DAILY_QUOTA +# 改完後 kubectl apply + rollout restart + +# 步驟 4:緊急手動重置計數(謹慎使用,只在確認誤計時才用) +# kubectl exec -n awoooi-prod deploy/api -- redis-cli DEL "ollama:gemini_daily_count:$(date +%Y-%m-%d)" +``` + +--- + +## Metric 清單 + +| Metric | 類型 | 狀態 | 說明 | +|--------|------|------|------| +| `up{job="ollama_111"}` | Gauge | ✅ 現有 | Prometheus scrape 存活 | +| `up{job="ollama_188"}` | Gauge | ✅ 現有 | Prometheus scrape 存活 | +| `ollama_failover_triggered_total` | Counter | ✅ P2.3 補入 | failover 切換次數,labels: from_provider, to_provider | +| `ollama_recovery_triggered_total` | Counter | ✅ P2.3 補入 | recovery 切回次數,labels: from_provider | +| `ollama_health_status{host}` | Gauge | ✅ P2.3 補入 | 健康狀態 1=healthy, 0=not_healthy | +| `ollama_current_primary_is_ollama` | Gauge | ✅ P2.3 補入 | 1=primary 是 ollama, 0=failover 中 | +| `ai_router_selected_provider_total` | Counter | ✅ P2.3 補入 | AI router 選擇次數,labels: provider | +| `gemini_daily_call_count` | Gauge | ✅ P2.3 補入 | 今日 Gemini 呼叫次數 | +| `gemini_daily_quota` | Gauge | ✅ P2.3 補入 | Gemini 每日配額 | +| `ollama_inference_duration_seconds` | Histogram | ⏳ BACKLOG | 推理延遲分布,需在 `_check_inference()` 加 observe | +| `post_execution_verification_total` | Counter | ⏳ BACKLOG | Verifier 執行次數,需 auto_repair_service.py 補入 | +| `post_execution_verification_failed_total` | Counter | ⏳ BACKLOG | Verifier 失敗次數,需 auto_repair_service.py 補入 | + +## Backlog 補完指引 + +### `ollama_inference_duration_seconds` + +在 `apps/api/src/services/ollama_health_monitor.py` 的 `_check_inference()` 方法結尾,加: + +```python +from src.core.metrics import OLLAMA_INFERENCE_DURATION # 需先在 metrics.py 加 Histogram +OLLAMA_INFERENCE_DURATION.labels(host=host_label).observe(latency_ms / 1000) +``` + +### `post_execution_verification_*` + +在 `apps/api/src/services/auto_repair_service.py` 的 verifier 路徑,加 Counter inc()。 +需先確認 verifier 執行點(grep `post_execution` 或 `verif` 找入口)。 diff --git a/k8s/awoooi-prod/04-configmap.yaml.patch-consensus b/k8s/awoooi-prod/04-configmap.yaml.patch-consensus new file mode 100644 index 00000000..42731e9f --- /dev/null +++ b/k8s/awoooi-prod/04-configmap.yaml.patch-consensus @@ -0,0 +1,28 @@ +# ============================================================================ +# PATCH: P2.4 啟用 12-Agent ConsensusEngine +# 日期: 2026-04-26 (台北時區) +# 負責人: ogt + Claude Sonnet 4.6 +# ADR 參考: ADR-095, plan_complete_v3.md P2.4 +# 說明: +# 將 ENABLE_12AGENT_CONSENSUS 設為 "true" 後,P0/P1 事件的 decision 路徑 +# 將呼叫 ConsensusEngine,整合 SRE/Security/Cost/Performance 四位專家意見。 +# 共識分數 ≥0.6 → READY(可自動執行);<0.6 → fallback to expert_analyze +# 影響範圍: +# - incident_analysis_sweeper: P0/P1 事件呼叫 get_or_create_decision_with_consensus +# - decision_manager: 加入 ENABLE_12AGENT_CONSENSUS flag 守門 +# ⚠️ 注意: ConsensusEngine 需呼叫 Ollama/NIM,確認 AI 服務可用後再啟用 +# ⚠️ 此 patch 僅供 review,需統帥批准後手動 apply +# ============================================================================ +# +# 將以下一行加入 /Users/ogt/awoooi/k8s/awoooi-prod/04-configmap.yaml +# 建議位置: TG_GROUP_CUTOVER 行之後 +# +# --- 新增內容 --- + # 2026-04-26 P2.4 ogt + Claude Sonnet 4.6: 啟用 12-Agent ConsensusEngine (ADR-095) + # P0/P1 事件走 ConsensusEngine → 4 專家並行投票 → 共識 ≥0.6 自動執行 + ENABLE_12AGENT_CONSENSUS: "true" +# --- 新增內容結束 --- +# +# 使用方式 (需用戶 review 後手動 apply): +# kubectl -n awoooi-prod apply -f k8s/awoooi-prod/04-configmap.yaml +# kubectl -n awoooi-prod rollout restart deployment/awoooi-api diff --git a/ops/monitoring/grafana/dashboards/ollama_failover.json b/ops/monitoring/grafana/dashboards/ollama_failover.json new file mode 100644 index 00000000..2fda28c1 --- /dev/null +++ b/ops/monitoring/grafana/dashboards/ollama_failover.json @@ -0,0 +1,295 @@ +{ + "__inputs": [], + "__requires": [ + { + "type": "grafana", + "id": "grafana", + "name": "Grafana", + "version": "10.0.0" + }, + { + "type": "panel", + "id": "stat", + "name": "Stat", + "version": "" + }, + { + "type": "panel", + "id": "timeseries", + "name": "Time series", + "version": "" + }, + { + "type": "panel", + "id": "piechart", + "name": "Pie chart", + "version": "" + }, + { + "type": "panel", + "id": "barchart", + "name": "Bar chart", + "version": "" + } + ], + "annotations": { + "list": [] + }, + "description": "Ollama 容災監控 — 可用性、推理延遲、AI 路由分布、Failover/Recovery 觸發 | P2.3 2026-04-26 台北時區", + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 1, + "id": null, + "links": [], + "panels": [ + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { "color": "red", "value": null }, + { "color": "yellow", "value": 50 }, + { "color": "green", "value": 100 } + ] + }, + "unit": "percent", + "min": 0, + "max": 100 + }, + "overrides": [] + }, + "gridPos": { "h": 6, "w": 12, "x": 0, "y": 0 }, + "id": 1, + "options": { + "colorMode": "background", + "graphMode": "none", + "justifyMode": "center", + "orientation": "auto", + "reduceOptions": { + "calcs": ["lastNotNull"], + "fields": "", + "values": false + }, + "textMode": "auto" + }, + "title": "Ollama 可用性", + "description": "up{job=~\"ollama_111|ollama_188\"} × 100\n- 綠色 100% = 主機在線\n- 紅色 0% = 主機離線(容災應已觸發)\n\n資料來源: Prometheus scrape job ollama_111 / ollama_188", + "targets": [ + { + "datasource": { "type": "prometheus", "uid": "${datasource}" }, + "expr": "up{job=~\"ollama_111|ollama_188\"} * 100", + "legendFormat": "{{ job }}", + "refId": "A" + } + ], + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "fieldConfig": { + "defaults": { + "color": { "mode": "palette-classic" }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "延遲 (秒)", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 10, + "gradientMode": "none", + "hideFrom": { "legend": false, "tooltip": false, "viz": false }, + "lineInterpolation": "smooth", + "lineWidth": 2, + "pointSize": 5, + "scaleDistribution": { "type": "linear" }, + "showPoints": "never", + "spanNulls": false + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { "color": "green", "value": null }, + { "color": "yellow", "value": 10 }, + { "color": "red", "value": 30 } + ] + }, + "unit": "s" + }, + "overrides": [] + }, + "gridPos": { "h": 6, "w": 12, "x": 12, "y": 0 }, + "id": 2, + "options": { + "legend": { + "calcs": ["lastNotNull", "max"], + "displayMode": "table", + "placement": "bottom" + }, + "tooltip": { "mode": "multi", "sort": "none" } + }, + "title": "推理延遲 P50 / P99", + "description": "histogram_quantile(0.5/0.99, rate(ollama_inference_duration_seconds_bucket[5m]))\n- P50 > 10s = SLOW 門檻\n- P99 > 30s = DEGRADED 門檻,觸發 failover\n\n⚠️ BACKLOG: ollama_inference_duration_seconds_bucket 尚未暴露,面板會顯示 No Data 直到 Part 3 backlog 補完", + "targets": [ + { + "datasource": { "type": "prometheus", "uid": "${datasource}" }, + "expr": "histogram_quantile(0.5, rate(ollama_inference_duration_seconds_bucket[5m]))", + "legendFormat": "P50", + "refId": "A" + }, + { + "datasource": { "type": "prometheus", "uid": "${datasource}" }, + "expr": "histogram_quantile(0.99, rate(ollama_inference_duration_seconds_bucket[5m]))", + "legendFormat": "P99", + "refId": "B" + } + ], + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "fieldConfig": { + "defaults": { + "color": { "mode": "palette-classic" }, + "custom": { + "hideFrom": { "legend": false, "tooltip": false, "viz": false } + }, + "mappings": [], + "unit": "reqps" + }, + "overrides": [] + }, + "gridPos": { "h": 6, "w": 12, "x": 0, "y": 6 }, + "id": 3, + "options": { + "displayLabels": ["name", "percent"], + "legend": { + "displayMode": "table", + "placement": "right", + "values": ["value", "percent"] + }, + "pieType": "pie", + "tooltip": { "mode": "single", "sort": "none" } + }, + "title": "AI Provider 路由分布", + "description": "sum by (provider) (rate(ai_router_selected_provider_total[5m]))\n- 正常狀態: ollama 佔大多數\n- failover 中: gemini / ollama_188 比例上升\n- 全走 gemini = 111 完全 offline\n\n資料來源: OLLAMA_FAILOVER_TRIGGERED_TOTAL + AI_ROUTER_PROVIDER_TOTAL (src/core/metrics.py)", + "targets": [ + { + "datasource": { "type": "prometheus", "uid": "${datasource}" }, + "expr": "sum by (provider) (rate(ai_router_selected_provider_total[5m]))", + "legendFormat": "{{ provider }}", + "refId": "A" + } + ], + "type": "piechart" + }, + { + "datasource": { + "type": "prometheus", + "uid": "${datasource}" + }, + "fieldConfig": { + "defaults": { + "color": { "mode": "palette-classic" }, + "custom": { + "axisLabel": "次數/小時", + "fillOpacity": 80, + "gradientMode": "none", + "hideFrom": { "legend": false, "tooltip": false, "viz": false }, + "lineWidth": 1 + }, + "mappings": [], + "unit": "short" + }, + "overrides": [ + { + "matcher": { "id": "byName", "options": "Failover" }, + "properties": [ + { "id": "color", "value": { "fixedColor": "orange", "mode": "fixed" } } + ] + }, + { + "matcher": { "id": "byName", "options": "Recovery" }, + "properties": [ + { "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } } + ] + } + ] + }, + "gridPos": { "h": 6, "w": 12, "x": 12, "y": 6 }, + "id": 4, + "options": { + "barRadius": 0.05, + "barWidth": 0.7, + "groupWidth": 0.7, + "legend": { + "calcs": ["sum"], + "displayMode": "list", + "placement": "bottom" + }, + "orientation": "auto", + "tooltip": { "mode": "multi", "sort": "none" }, + "xTickLabelRotation": 0 + }, + "title": "Failover / Recovery 觸發次數", + "description": "橘色 = ollama_failover_triggered_total (切離 111)\n綠色 = ollama_recovery_triggered_total (切回 111)\n\n正常狀態兩條都接近 0。\n橘升後緊跟綠升 = auto recovery 正常工作。\n橘升但綠不升 = OllamaRecoveryStuck,看 alert。\n\n資料來源: src/core/metrics.py OLLAMA_FAILOVER_TRIGGERED_TOTAL / OLLAMA_RECOVERY_TRIGGERED_TOTAL", + "targets": [ + { + "datasource": { "type": "prometheus", "uid": "${datasource}" }, + "expr": "sum(rate(ollama_failover_triggered_total[1h])) * 3600", + "legendFormat": "Failover", + "refId": "A" + }, + { + "datasource": { "type": "prometheus", "uid": "${datasource}" }, + "expr": "sum(rate(ollama_recovery_triggered_total[1h])) * 3600", + "legendFormat": "Recovery", + "refId": "B" + } + ], + "type": "barchart" + } + ], + "refresh": "30s", + "schemaVersion": 39, + "tags": ["ollama", "failover", "aiops", "p2.3"], + "templating": { + "list": [ + { + "current": {}, + "hide": 0, + "includeAll": false, + "label": "Datasource", + "multi": false, + "name": "datasource", + "options": [], + "query": "prometheus", + "refresh": 1, + "regex": "", + "type": "datasource" + } + ] + }, + "time": { "from": "now-3h", "to": "now" }, + "timepicker": {}, + "timezone": "Asia/Taipei", + "title": "Ollama 容災監控", + "uid": "ollama-failover-p23", + "version": 1 +}