feat(ops): Ollama 容災 Runbook + Grafana 儀表板 + Consensus K8s ConfigMap patch
Some checks failed
run-migration / migrate (push) Failing after 13s
CD Pipeline / build-and-deploy (push) Failing after 2m1s

Wave 6 P2.3 ops 配套 + tool-expert 部署文件:

新增:
- docs/runbooks/RUNBOOK-OLLAMA-FAILOVER.md (240 行)
  · 三大鐵律驗證步驟(自動切 Gemini / 自動切回 / quota 熔斷)
  · failover/recovery 完整 SOP
  · 故障排查清單(Ollama 111/188 不通、Gemini quota 超發等)
- ops/monitoring/grafana/dashboards/ollama_failover.json (295 行)
  · 4 panel:current primary / failover events / quota usage / health status
  · 對應 P2.3 metrics: OLLAMA_FAILOVER_TRIGGERED_TOTAL / GEMINI_DAILY_CALL_COUNT
- k8s/awoooi-prod/04-configmap.yaml.patch-consensus
  · ENABLE_12AGENT_CONSENSUS / ENABLE_AIOPS_P2_FUSION feature flag patch

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: tool-expert agent (Wave 6) <noreply@anthropic.com>
This commit is contained in:
Your Name
2026-04-27 08:11:20 +08:00
parent 1096da12ae
commit 1ab6786ce3
3 changed files with 563 additions and 0 deletions

View File

@@ -0,0 +1,240 @@
# RUNBOOK-OLLAMA-FAILOVER.md
# Ollama 容災監控 Runbook
# 2026-04-26 P2.3 by Claude Sonnet 4.6 (tool-expert)
# 對應告警規則: ops/monitoring/ollama_health_rules.yaml
# 對應 Dashboard: ops/monitoring/grafana/dashboards/ollama_failover.json
---
## Grafana Dashboard 使用說明
Dashboard 路徑:`Ollama 容災監控`uid: `ollama-failover-p23`
匯入方式Grafana UI → Dashboards → Import → Upload JSON file → 選 `ops/monitoring/grafana/dashboards/ollama_failover.json`
### Panel 1 — Ollama 可用性 (Stat)
**看什麼**`up{job=~"ollama_111|ollama_188"}` × 100顯示每台 Ollama 主機的 scrape 存活狀態。
| 顏色 | 意義 |
|------|------|
| 綠色 100% | Prometheus 探測正常,主機在線 |
| 黃色 50% | 一台離線,另一台在線(容災中) |
| 紅色 0% | 兩台全離線,高風險 |
**注意**:此面板反映 Prometheus scrape 狀態,需要 scrape job 命名為 `ollama_111` / `ollama_188`
設定檔位於 `ops/monitoring/generated/prometheus-scrape-generated.yaml`
---
### Panel 2 — 推理延遲 P50 / P99 (Time Series)
**看什麼**:推理延遲分位數。
| 門檻 | 含義 |
|------|------|
| < 10s (P50) | HEALTHY — 正常使用 111 |
| 1030s (P50) | SLOW — 系統已切至 Gemini |
| > 30s (P99) | DEGRADED — 應觸發 failover |
**⚠️ BACKLOG 警告**`ollama_inference_duration_seconds_bucket` 尚未在 API 暴露(需在 `_check_inference()` 加 Histogram.observe())。
面板顯示 "No Data" 是正常的,等 backlog 補完後啟用。
---
### Panel 3 — AI Provider 路由分布 (Pie Chart)
**看什麼**:過去 5 分鐘各 provider 被選中的請求比例。
| 分布 | 意義 |
|------|------|
| ollama 佔 >90% | 正常111 健康 |
| gemini 佔多數 | 111 SLOW/DEGRADED/OFFLINE容災中 |
| ollama_188 出現 | Gemini 配額耗盡備援,或 111 和 Gemini 同時失敗 |
| 全部 nemotron/claude | 極端情況,所有主力 provider 失敗 |
---
### Panel 4 — Failover / Recovery 觸發次數 (Bar Chart)
**看什麼**:每小時 failover和 recovery的觸發次數。
| 模式 | 意義 |
|------|------|
| 兩條都接近 0 | 正常111 穩定運行中 |
| 橘色上升後綠色跟上 | Auto recovery 正常:切出後又切回 |
| 橘色上升,綠色不動 | **`OllamaRecoveryStuck` alert**,見下方 runbook |
| 橘色持續高頻(>5/h | **`OllamaFailoverFrequent` alert**111 不穩定 |
---
## Alert Runbook
### `OllamaInstanceDown` — Ollama 主機離線
**觸發條件**`up{job=~"ollama_111|ollama_188"} == 0` 持續 2 分鐘。
**影響評估**
- 系統應已自動切至 Gemini查 Panel 3 確認)
- 查 Panel 4 是否有 Failover 計數上升
**排查步驟**
```bash
# 步驟 1確認主機存活
ping -c 3 192.168.0.111
ping -c 3 192.168.0.188
# 步驟 2SSH 進主機確認 ollama 服務狀態
ssh wooo@192.168.0.111 'systemctl status ollama'
ssh wooo@192.168.0.188 'systemctl status ollama'
# 步驟 3查 ollama 最近的 journal log
ssh wooo@192.168.0.111 'journalctl -u ollama -n 50 --no-pager'
# 步驟 4確認 GPU 記憶體111 是 GPU 主機)
ssh wooo@192.168.0.111 'nvidia-smi'
# 步驟 5如果服務掛了重啟
ssh wooo@192.168.0.111 'systemctl restart ollama'
# 等 30s確認服務啟動
ssh wooo@192.168.0.111 'systemctl status ollama'
```
**恢復確認**
Panel 1 變綠色Panel 4 出現 Recovery 計數上升,表示 auto recovery 已觸發切回。
---
### `OllamaFailoverFrequent` — Failover 頻率過高
**觸發條件**`rate(ollama_failover_triggered_total[1h]) > 5` 持續 10 分鐘(每小時超過 5 次切換)。
**影響評估**
- 服務本身仍可用Gemini 在接手)
- 但 Gemini 配額消耗加速,有觸發 `GeminiQuotaApproaching` 的風險
**排查步驟**
```bash
# 步驟 1確認 111 近況(反覆 OFFLINE/HEALTHY 之間跳動?)
ssh wooo@192.168.0.111 'nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv'
# 步驟 2查 API log 找 failover 原因
kubectl logs -n awoooi-prod deploy/api --since=30m | grep "ollama_failover_triggered"
# 步驟 3查推理延遲是否長期在 SLOW 邊界?)
kubectl logs -n awoooi-prod deploy/api --since=30m | grep "ollama_health_checked"
# 步驟 4如果是 GPU 記憶體問題,清除 model cache
ssh wooo@192.168.0.111 'systemctl restart ollama'
```
---
### `OllamaRecoveryStuck` — Auto Recovery 停滯
**觸發條件**`ollama_health_status{host="111"} == 1 AND ollama_current_primary_is_ollama == 0` 持續 5 分鐘。
111 已 HEALTHY 但路由仍走 Gemini
**影響評估**
- API 功能正常Gemini 在服務)
- 但 Gemini 配額持續消耗111 GPU 資源浪費
**排查步驟**
```bash
# 步驟 1確認 OllamaAutoRecoveryService 是否在運行
kubectl logs -n awoooi-prod deploy/api --since=10m | grep "ollama_auto_recovery"
# 步驟 2查 recovery service 狀態
kubectl logs -n awoooi-prod deploy/api --since=10m | grep -E "ollama_auto_recovery_started|ollama_auto_recovery_stopped|ollama_auto_recovery_loop_error"
# 步驟 3查 current_primary Redis key
kubectl exec -n awoooi-prod deploy/api -- python -c "
import asyncio
from src.core.redis_client import get_redis
async def check():
r = get_redis()
val = await r.get('ollama:current_primary')
print('current_primary:', val)
asyncio.run(check())
"
# 步驟 4如果 recovery service 掛了,重啟 API pod會重新啟動 lifespan
kubectl rollout restart deployment/api -n awoooi-prod
kubectl rollout status deployment/api -n awoooi-prod
```
---
### `GeminiQuotaApproaching` — Gemini 配額 >80%
**觸發條件**`gemini_daily_call_count / gemini_daily_quota > 0.8` 持續 5 分鐘。
**注意**`gemini_daily_quota` 來自 `settings.GEMINI_DAILY_QUOTA`(預設 1000
`gemini_daily_call_count` 從 Redis key `ollama:gemini_daily_count:{YYYY-MM-DD}` 讀取並刷新 Gauge。
**影響評估**
- 當日 Gemini 配額即將耗盡
- 耗盡後系統會自動切至 188 CPU-only 備援qwen2.5:7b-instruct速度較慢
**行動步驟**
```bash
# 步驟 1確認當日 Gemini 使用量
kubectl exec -n awoooi-prod deploy/api -- python -c "
import asyncio, datetime
from src.core.redis_client import get_redis
async def check():
r = get_redis()
today = datetime.date.today().isoformat()
val = await r.get(f'ollama:gemini_daily_count:{today}')
print(f'gemini_daily_count[{today}]:', val)
asyncio.run(check())
"
# 步驟 2確認 111 是否能快速恢復(讓流量切回 Ollama
ssh wooo@192.168.0.111 'systemctl status ollama && nvidia-smi'
# 步驟 3如需增加配額修改 settings
# k8s/awoooi-prod/04-configmap.yaml.patch-* 找 GEMINI_DAILY_QUOTA
# 改完後 kubectl apply + rollout restart
# 步驟 4緊急手動重置計數謹慎使用只在確認誤計時才用
# kubectl exec -n awoooi-prod deploy/api -- redis-cli DEL "ollama:gemini_daily_count:$(date +%Y-%m-%d)"
```
---
## Metric 清單
| Metric | 類型 | 狀態 | 說明 |
|--------|------|------|------|
| `up{job="ollama_111"}` | Gauge | ✅ 現有 | Prometheus scrape 存活 |
| `up{job="ollama_188"}` | Gauge | ✅ 現有 | Prometheus scrape 存活 |
| `ollama_failover_triggered_total` | Counter | ✅ P2.3 補入 | failover 切換次數labels: from_provider, to_provider |
| `ollama_recovery_triggered_total` | Counter | ✅ P2.3 補入 | recovery 切回次數labels: from_provider |
| `ollama_health_status{host}` | Gauge | ✅ P2.3 補入 | 健康狀態 1=healthy, 0=not_healthy |
| `ollama_current_primary_is_ollama` | Gauge | ✅ P2.3 補入 | 1=primary 是 ollama, 0=failover 中 |
| `ai_router_selected_provider_total` | Counter | ✅ P2.3 補入 | AI router 選擇次數labels: provider |
| `gemini_daily_call_count` | Gauge | ✅ P2.3 補入 | 今日 Gemini 呼叫次數 |
| `gemini_daily_quota` | Gauge | ✅ P2.3 補入 | Gemini 每日配額 |
| `ollama_inference_duration_seconds` | Histogram | ⏳ BACKLOG | 推理延遲分布,需在 `_check_inference()` 加 observe |
| `post_execution_verification_total` | Counter | ⏳ BACKLOG | Verifier 執行次數,需 auto_repair_service.py 補入 |
| `post_execution_verification_failed_total` | Counter | ⏳ BACKLOG | Verifier 失敗次數,需 auto_repair_service.py 補入 |
## Backlog 補完指引
### `ollama_inference_duration_seconds`
`apps/api/src/services/ollama_health_monitor.py``_check_inference()` 方法結尾,加:
```python
from src.core.metrics import OLLAMA_INFERENCE_DURATION # 需先在 metrics.py 加 Histogram
OLLAMA_INFERENCE_DURATION.labels(host=host_label).observe(latency_ms / 1000)
```
### `post_execution_verification_*`
`apps/api/src/services/auto_repair_service.py` 的 verifier 路徑,加 Counter inc()。
需先確認 verifier 執行點grep `post_execution``verif` 找入口)。

View File

@@ -0,0 +1,28 @@
# ============================================================================
# PATCH: P2.4 啟用 12-Agent ConsensusEngine
# 日期: 2026-04-26 (台北時區)
# 負責人: ogt + Claude Sonnet 4.6
# ADR 參考: ADR-095, plan_complete_v3.md P2.4
# 說明:
# 將 ENABLE_12AGENT_CONSENSUS 設為 "true" 後P0/P1 事件的 decision 路徑
# 將呼叫 ConsensusEngine整合 SRE/Security/Cost/Performance 四位專家意見。
# 共識分數 ≥0.6 → READY可自動執行<0.6 → fallback to expert_analyze
# 影響範圍:
# - incident_analysis_sweeper: P0/P1 事件呼叫 get_or_create_decision_with_consensus
# - decision_manager: 加入 ENABLE_12AGENT_CONSENSUS flag 守門
# ⚠️ 注意: ConsensusEngine 需呼叫 Ollama/NIM確認 AI 服務可用後再啟用
# ⚠️ 此 patch 僅供 review需統帥批准後手動 apply
# ============================================================================
#
# 將以下一行加入 /Users/ogt/awoooi/k8s/awoooi-prod/04-configmap.yaml
# 建議位置: TG_GROUP_CUTOVER 行之後
#
# --- 新增內容 ---
# 2026-04-26 P2.4 ogt + Claude Sonnet 4.6: 啟用 12-Agent ConsensusEngine (ADR-095)
# P0/P1 事件走 ConsensusEngine → 4 專家並行投票 → 共識 ≥0.6 自動執行
ENABLE_12AGENT_CONSENSUS: "true"
# --- 新增內容結束 ---
#
# 使用方式 (需用戶 review 後手動 apply):
# kubectl -n awoooi-prod apply -f k8s/awoooi-prod/04-configmap.yaml
# kubectl -n awoooi-prod rollout restart deployment/awoooi-api

View File

@@ -0,0 +1,295 @@
{
"__inputs": [],
"__requires": [
{
"type": "grafana",
"id": "grafana",
"name": "Grafana",
"version": "10.0.0"
},
{
"type": "panel",
"id": "stat",
"name": "Stat",
"version": ""
},
{
"type": "panel",
"id": "timeseries",
"name": "Time series",
"version": ""
},
{
"type": "panel",
"id": "piechart",
"name": "Pie chart",
"version": ""
},
{
"type": "panel",
"id": "barchart",
"name": "Bar chart",
"version": ""
}
],
"annotations": {
"list": []
},
"description": "Ollama 容災監控 — 可用性、推理延遲、AI 路由分布、Failover/Recovery 觸發 | P2.3 2026-04-26 台北時區",
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 1,
"id": null,
"links": [],
"panels": [
{
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": null },
{ "color": "yellow", "value": 50 },
{ "color": "green", "value": 100 }
]
},
"unit": "percent",
"min": 0,
"max": 100
},
"overrides": []
},
"gridPos": { "h": 6, "w": 12, "x": 0, "y": 0 },
"id": 1,
"options": {
"colorMode": "background",
"graphMode": "none",
"justifyMode": "center",
"orientation": "auto",
"reduceOptions": {
"calcs": ["lastNotNull"],
"fields": "",
"values": false
},
"textMode": "auto"
},
"title": "Ollama 可用性",
"description": "up{job=~\"ollama_111|ollama_188\"} × 100\n- 綠色 100% = 主機在線\n- 紅色 0% = 主機離線(容災應已觸發)\n\n資料來源: Prometheus scrape job ollama_111 / ollama_188",
"targets": [
{
"datasource": { "type": "prometheus", "uid": "${datasource}" },
"expr": "up{job=~\"ollama_111|ollama_188\"} * 100",
"legendFormat": "{{ job }}",
"refId": "A"
}
],
"type": "stat"
},
{
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"color": { "mode": "palette-classic" },
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "延遲 (秒)",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "none",
"hideFrom": { "legend": false, "tooltip": false, "viz": false },
"lineInterpolation": "smooth",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": { "type": "linear" },
"showPoints": "never",
"spanNulls": false
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 10 },
{ "color": "red", "value": 30 }
]
},
"unit": "s"
},
"overrides": []
},
"gridPos": { "h": 6, "w": 12, "x": 12, "y": 0 },
"id": 2,
"options": {
"legend": {
"calcs": ["lastNotNull", "max"],
"displayMode": "table",
"placement": "bottom"
},
"tooltip": { "mode": "multi", "sort": "none" }
},
"title": "推理延遲 P50 / P99",
"description": "histogram_quantile(0.5/0.99, rate(ollama_inference_duration_seconds_bucket[5m]))\n- P50 > 10s = SLOW 門檻\n- P99 > 30s = DEGRADED 門檻,觸發 failover\n\n⚠ BACKLOG: ollama_inference_duration_seconds_bucket 尚未暴露,面板會顯示 No Data 直到 Part 3 backlog 補完",
"targets": [
{
"datasource": { "type": "prometheus", "uid": "${datasource}" },
"expr": "histogram_quantile(0.5, rate(ollama_inference_duration_seconds_bucket[5m]))",
"legendFormat": "P50",
"refId": "A"
},
{
"datasource": { "type": "prometheus", "uid": "${datasource}" },
"expr": "histogram_quantile(0.99, rate(ollama_inference_duration_seconds_bucket[5m]))",
"legendFormat": "P99",
"refId": "B"
}
],
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"color": { "mode": "palette-classic" },
"custom": {
"hideFrom": { "legend": false, "tooltip": false, "viz": false }
},
"mappings": [],
"unit": "reqps"
},
"overrides": []
},
"gridPos": { "h": 6, "w": 12, "x": 0, "y": 6 },
"id": 3,
"options": {
"displayLabels": ["name", "percent"],
"legend": {
"displayMode": "table",
"placement": "right",
"values": ["value", "percent"]
},
"pieType": "pie",
"tooltip": { "mode": "single", "sort": "none" }
},
"title": "AI Provider 路由分布",
"description": "sum by (provider) (rate(ai_router_selected_provider_total[5m]))\n- 正常狀態: ollama 佔大多數\n- failover 中: gemini / ollama_188 比例上升\n- 全走 gemini = 111 完全 offline\n\n資料來源: OLLAMA_FAILOVER_TRIGGERED_TOTAL + AI_ROUTER_PROVIDER_TOTAL (src/core/metrics.py)",
"targets": [
{
"datasource": { "type": "prometheus", "uid": "${datasource}" },
"expr": "sum by (provider) (rate(ai_router_selected_provider_total[5m]))",
"legendFormat": "{{ provider }}",
"refId": "A"
}
],
"type": "piechart"
},
{
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"fieldConfig": {
"defaults": {
"color": { "mode": "palette-classic" },
"custom": {
"axisLabel": "次數/小時",
"fillOpacity": 80,
"gradientMode": "none",
"hideFrom": { "legend": false, "tooltip": false, "viz": false },
"lineWidth": 1
},
"mappings": [],
"unit": "short"
},
"overrides": [
{
"matcher": { "id": "byName", "options": "Failover" },
"properties": [
{ "id": "color", "value": { "fixedColor": "orange", "mode": "fixed" } }
]
},
{
"matcher": { "id": "byName", "options": "Recovery" },
"properties": [
{ "id": "color", "value": { "fixedColor": "green", "mode": "fixed" } }
]
}
]
},
"gridPos": { "h": 6, "w": 12, "x": 12, "y": 6 },
"id": 4,
"options": {
"barRadius": 0.05,
"barWidth": 0.7,
"groupWidth": 0.7,
"legend": {
"calcs": ["sum"],
"displayMode": "list",
"placement": "bottom"
},
"orientation": "auto",
"tooltip": { "mode": "multi", "sort": "none" },
"xTickLabelRotation": 0
},
"title": "Failover / Recovery 觸發次數",
"description": "橘色 = ollama_failover_triggered_total (切離 111)\n綠色 = ollama_recovery_triggered_total (切回 111)\n\n正常狀態兩條都接近 0。\n橘升後緊跟綠升 = auto recovery 正常工作。\n橘升但綠不升 = OllamaRecoveryStuck看 alert。\n\n資料來源: src/core/metrics.py OLLAMA_FAILOVER_TRIGGERED_TOTAL / OLLAMA_RECOVERY_TRIGGERED_TOTAL",
"targets": [
{
"datasource": { "type": "prometheus", "uid": "${datasource}" },
"expr": "sum(rate(ollama_failover_triggered_total[1h])) * 3600",
"legendFormat": "Failover",
"refId": "A"
},
{
"datasource": { "type": "prometheus", "uid": "${datasource}" },
"expr": "sum(rate(ollama_recovery_triggered_total[1h])) * 3600",
"legendFormat": "Recovery",
"refId": "B"
}
],
"type": "barchart"
}
],
"refresh": "30s",
"schemaVersion": 39,
"tags": ["ollama", "failover", "aiops", "p2.3"],
"templating": {
"list": [
{
"current": {},
"hide": 0,
"includeAll": false,
"label": "Datasource",
"multi": false,
"name": "datasource",
"options": [],
"query": "prometheus",
"refresh": 1,
"regex": "",
"type": "datasource"
}
]
},
"time": { "from": "now-3h", "to": "now" },
"timepicker": {},
"timezone": "Asia/Taipei",
"title": "Ollama 容災監控",
"uid": "ollama-failover-p23",
"version": 1
}