建立 Phase 15 LLMOps 觀測架構決策文件,記錄: - 三層觀測架構 (Langfuse + SignOz + Sentry) - Langfuse 整合與 Deep Linking 實作 - Redis Streams Trace Context 傳遞機制 - 取樣率策略與成本估算 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
285 lines
9.1 KiB
Markdown
285 lines
9.1 KiB
Markdown
# ADR-017: LLMOps Observability
|
||
|
||
> **狀態**: ✅ 已批准
|
||
> **日期**: 2026-03-26
|
||
> **決策者**: 統帥
|
||
> **Phase**: 15
|
||
|
||
## 背景
|
||
|
||
AWOOOI 作為 AI-First 產品,LLM 呼叫是核心業務流程。然而,在 Phase 15 之前存在以下觀測盲點:
|
||
|
||
### 問題一:LLM 呼叫無追蹤
|
||
|
||
- Prompt 內容無紀錄,無法重現問題
|
||
- Token 消耗無統計,成本失控風險
|
||
- 模型 Fallback 鏈路不可見
|
||
- AI 幻覺無法追溯來源
|
||
|
||
### 問題二:Trace Context 斷鏈
|
||
|
||
- Redis Streams 不會自動傳遞 OTEL Trace ID
|
||
- Worker 處理的 LLM 呼叫與原始請求脫節
|
||
- 無法從前端錯誤追溯到 AI 決策過程
|
||
|
||
### 問題三:觀測系統孤島
|
||
|
||
- Sentry (前端錯誤) 與 SignOz (後端 Traces) 無法互連
|
||
- 需要在三個系統間手動搜尋同一個 Trace
|
||
- 故障排查效率低下
|
||
|
||
---
|
||
|
||
## 決策
|
||
|
||
採用**三層觀測架構 (The Holy Trinity)**,實現零斷鏈觀測:
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ AWOOOI 觀測分層架構 │
|
||
├─────────────────────────────────────────────────────────────┤
|
||
│ │
|
||
│ Layer 3: AI 決策層 ← Langfuse (100% 取樣) │
|
||
│ ┌─────────────────────────────────────────────────────┐ │
|
||
│ │ Prompt 追蹤 │ Token 成本 │ Agent 軌跡 │ 幻覺偵測 │ │
|
||
│ └─────────────────────────────────────────────────────┘ │
|
||
│ ▲ │
|
||
│ ⚠️ Redis Streams │
|
||
│ (手動 Context Injection) │
|
||
│ │ │
|
||
│ Layer 2: 基礎設施層 ← SignOz (10% 取樣, Error 100%) │
|
||
│ ┌─────────────────────────────────────────────────────┐ │
|
||
│ │ API 延遲 │ K8s 資源 │ DB 查詢 │ 後端錯誤 │ │
|
||
│ └─────────────────────────────────────────────────────┘ │
|
||
│ ▲ │
|
||
│ (HTTP Header 自動傳遞) │
|
||
│ │ │
|
||
│ Layer 1: 前端用戶層 ← Sentry (10% 取樣, Error 100%) │
|
||
│ ┌─────────────────────────────────────────────────────┐ │
|
||
│ │ JS 錯誤 │ 用戶操作錄影 │ 效能指標 │ 崩潰堆疊 │ │
|
||
│ └─────────────────────────────────────────────────────┘ │
|
||
│ │
|
||
└─────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### 核心元件
|
||
|
||
| 元件 | 部署位置 | 用途 |
|
||
|------|---------|------|
|
||
| Langfuse | 192.168.0.110:3100 | LLM 呼叫追蹤、Token 成本 |
|
||
| SignOz | 192.168.0.188:3301 | API Traces、Metrics、Logs |
|
||
| Sentry | 192.168.0.110:9000 | 前端錯誤、Session Replay |
|
||
|
||
---
|
||
|
||
## 技術實作
|
||
|
||
### 1. Langfuse Client (`apps/api/src/services/langfuse_client.py`)
|
||
|
||
專用 LLM 追蹤 Client,提供:
|
||
|
||
```python
|
||
# Context Manager 追蹤 LLM 呼叫鏈
|
||
from src.services.langfuse_client import langfuse_trace
|
||
|
||
async with langfuse_trace("openclaw_decision") as trace:
|
||
# 記錄 LLM generation
|
||
trace.generation(
|
||
name="ollama_call",
|
||
model="qwen2.5:7b-instruct",
|
||
input=prompt,
|
||
output=result,
|
||
usage={"input": 500, "output": 200},
|
||
)
|
||
|
||
# 記錄評分 (用於 Prompt 品質追蹤)
|
||
trace.score("response_quality", value=0.95)
|
||
```
|
||
|
||
**關鍵功能**:
|
||
- `LangfuseTraceContext`: 自動整合 OTEL trace_id
|
||
- `langfuse_trace()`: Context Manager 包裝
|
||
- `langfuse_observe()`: 裝飾器自動追蹤函數
|
||
- 自動注入 SignOz Deep Link 到 metadata
|
||
|
||
### 2. Deep Linking (`apps/api/src/core/deep_linking.py`)
|
||
|
||
三系統互連 URL 生成器:
|
||
|
||
```python
|
||
from src.core.deep_linking import DeepLinking
|
||
|
||
# 取得所有可用 Deep Links
|
||
links = DeepLinking.get_all_links(
|
||
otel_trace_id="abc123...",
|
||
langfuse_trace_id="lf-xyz...",
|
||
sentry_issue_id="12345",
|
||
)
|
||
# {
|
||
# "signoz_trace": "http://192.168.0.188:3301/trace/abc123...",
|
||
# "langfuse_trace": "http://192.168.0.110:3100/project/awoooi-openclaw/traces/lf-xyz...",
|
||
# "sentry_issue": "http://192.168.0.110:9000/organizations/sentry/issues/12345/",
|
||
# }
|
||
```
|
||
|
||
**URL 格式**:
|
||
|
||
| 系統 | URL 格式 |
|
||
|------|---------|
|
||
| SignOz Trace | `http://192.168.0.188:3301/trace/{trace_id}` |
|
||
| Langfuse Trace | `http://192.168.0.110:3100/project/awoooi-openclaw/traces/{id}` |
|
||
| Sentry Issue | `http://192.168.0.110:9000/organizations/sentry/issues/{id}/` |
|
||
|
||
### 3. Redis Trace Context 傳遞
|
||
|
||
解決 Redis Streams 斷鏈問題:
|
||
|
||
```python
|
||
# Producer (webhooks.py) - 注入 Trace Context
|
||
async def publish_to_redis(signal: Signal):
|
||
from src.core.telemetry import get_trace_context
|
||
|
||
trace_ctx = get_trace_context() # {"trace_id": "xxx", "span_id": "yyy"}
|
||
payload = {
|
||
**signal.dict(),
|
||
"_trace_id": trace_ctx["trace_id"],
|
||
"_span_id": trace_ctx["span_id"],
|
||
}
|
||
await redis.xadd("stream:signals", payload)
|
||
|
||
# Consumer (signal_worker.py) - 還原 Trace Context
|
||
async def consume_from_redis():
|
||
from src.core.telemetry import restore_trace_context
|
||
|
||
message = await redis.xreadgroup(...)
|
||
trace_id = message.pop("_trace_id", None)
|
||
span_id = message.pop("_span_id", None)
|
||
|
||
# 重建 OTEL Context (W3C traceparent 格式)
|
||
with restore_trace_context(trace_id, span_id):
|
||
# 處理邏輯,Langfuse 會繼承此 trace_id
|
||
await process_signal(message)
|
||
```
|
||
|
||
### 4. 連結方向圖
|
||
|
||
```
|
||
Sentry Error
|
||
│ tags.otel_trace_id
|
||
│ contexts.signoz.trace_url
|
||
▼
|
||
SignOz Trace
|
||
│ span.langfuse.trace_id
|
||
│ span.langfuse.trace_url
|
||
▼
|
||
Langfuse Generation
|
||
│ metadata.otel_trace_id
|
||
│ metadata.signoz_trace_url
|
||
▲
|
||
└── 雙向連結完成
|
||
```
|
||
|
||
---
|
||
|
||
## 部署配置
|
||
|
||
### Langfuse Self-Hosted (192.168.0.110:3100)
|
||
|
||
| 項目 | 值 |
|
||
|------|-----|
|
||
| 主機 | 192.168.0.110 (DevOps 金庫) |
|
||
| Port | 3100 |
|
||
| 映像 | langfuse/langfuse:2 |
|
||
| 資料庫 | PostgreSQL 15 (內建容器) |
|
||
| 部署層級 | 容器層 (Docker Compose) |
|
||
|
||
### 環境變數
|
||
|
||
```yaml
|
||
# K8s awoooi-secrets
|
||
LANGFUSE_ENABLED: "true"
|
||
LANGFUSE_URL: "http://192.168.0.110:3100"
|
||
LANGFUSE_PUBLIC_KEY: "pk-lf-..."
|
||
LANGFUSE_SECRET_KEY: "sk-lf-..."
|
||
```
|
||
|
||
### 專案配置
|
||
|
||
| 項目 | 值 |
|
||
|------|-----|
|
||
| 專案名稱 | awoooi-openclaw |
|
||
| 管理帳號 | admin@awoooi.local |
|
||
|
||
---
|
||
|
||
## 取樣率策略
|
||
|
||
### 成本控制原則
|
||
|
||
| 工具 | 正常流量 | Error 發生時 | 原因 |
|
||
|------|---------|-------------|------|
|
||
| Sentry | 10% | 100% | 前端流量大 |
|
||
| SignOz | 10% | 100% | API 流量大 |
|
||
| Langfuse | **100%** | 100% | AI 決策必須完整記錄 |
|
||
|
||
### 動態取樣配置
|
||
|
||
```python
|
||
# Sentry traces_sampler
|
||
def traces_sampler(sampling_context):
|
||
if sampling_context.get("error"):
|
||
return 1.0 # Error 必定記錄
|
||
return 0.1 # 正常流量 10%
|
||
```
|
||
|
||
---
|
||
|
||
## 成本估算
|
||
|
||
| 項目 | 月費 | 說明 |
|
||
|------|------|------|
|
||
| Langfuse | $0 | Self-hosted |
|
||
| 額外儲存 | ~$5 | PostgreSQL Volume |
|
||
| **總計** | **~$5/月** | 幾乎免費 |
|
||
|
||
---
|
||
|
||
## 後果
|
||
|
||
### 優點
|
||
|
||
- **完整追溯**: 從前端錯誤追溯到 AI 決策
|
||
- **成本可見**: Token 消耗即時監控
|
||
- **一鍵跳轉**: 三系統 Deep Linking
|
||
- **Prompt 版本化**: 支援 A/B 測試
|
||
|
||
### 缺點
|
||
|
||
- **額外延遲**: Langfuse 寫入 ~5ms
|
||
- **儲存成本**: 100% 取樣累積資料快
|
||
|
||
### 風險
|
||
|
||
- **Redis Streams 斷鏈**: 需手動維護 Context Injection
|
||
- **Langfuse 故障**: LLM 呼叫仍正常,僅失去追蹤
|
||
|
||
---
|
||
|
||
## 相關檔案
|
||
|
||
| 檔案 | 說明 |
|
||
|------|------|
|
||
| `apps/api/src/services/langfuse_client.py` | Langfuse Client 包裝 |
|
||
| `apps/api/src/core/deep_linking.py` | Deep Linking URL 生成器 |
|
||
| `apps/api/src/core/telemetry.py` | Trace Context 工具函數 |
|
||
| `apps/api/src/core/config.py` | LANGFUSE_* 設定 |
|
||
|
||
---
|
||
|
||
## 參考
|
||
|
||
- [Langfuse Documentation](https://langfuse.com/docs)
|
||
- [OpenTelemetry Trace Context](https://www.w3.org/TR/trace-context/)
|
||
- Memory: `project_phase15_llmops_observability.md`
|
||
- Memory: `project_phase15_langfuse.md`
|