Files
awoooi/k8s/monitoring/prometheus-remote-write-signoz.yaml
OG T 3f339110dd
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
fix(observability): 同步 .188 實際部署調整至 repo
與原始計畫的差異:

1. MinIO Bearer Token 認證
   - 原計畫: MINIO_PROMETHEUS_AUTH_TYPE=public (此版本不支援)
   - 實際: mc admin prometheus generate 產生 Bearer Token
   - 更新: prometheus-config-phase-o.yaml 加入 bearer_token

2. remote_write 廢棄 → OTEL Collector Prometheus scrape
   - 原計畫: Prometheus remote_write → SigNoz OTEL /api/v1/write
   - 實際: SigNoz OTEL Collector 不支援 Prometheus remote_write 格式 (404)
   - 改用: OTEL Collector prometheus receiver 直接 scrape node-exporter + kube-state-metrics
   - 新增: ops/signoz/otel-collector-config-phase-o.yaml (版本控管副本)

3. ADR-053 驗收清單更新為實際結果

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-04-02 21:23:47 +08:00

90 lines
3.8 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# =============================================================================
# Prometheus Remote Write → SigNoz - Phase O-3
# =============================================================================
# 建立者: Claude Code (首席架構師)
# 日期: 2026-04-02 (台北時間)
#
# ❌ 此方案已廢棄 (2026-04-02 實際部署時發現)
# 原因: SigNoz OTEL Collector 不支援 Prometheus remote_write 格式 (Protobuf)
# 端點 /api/v1/write 回傳 404 Not Found
#
# ✅ 改用方案: SigNoz OTEL Collector Prometheus Receiver 直接 scrape
# 設定檔: ops/signoz/otel-collector-config-phase-o.yaml
# 實際部署: .188:/home/ollama/signoz/deploy/docker/otel-collector-config.yaml
# 新增 jobs: node-from-signoz (node-exporter) + kube-state-from-signoz
#
# =============================================================================
# ===== 新增至 prometheus.yml 最外層 (與 scrape_configs 同級) =====
remote_write:
- url: http://localhost:24318/api/v1/write
# 只在同一台機器上 (188 → 188),無需跨網路
remote_timeout: 30s
# 佇列配置: 適合小型叢集
queue_config:
capacity: 2500
max_shards: 5
min_shards: 1
max_samples_per_send: 500
batch_send_deadline: 5s
# ===== 白名單過濾 (只轉關鍵指標) =====
# 預估 ~50 個 metric series90 天約 2-5 GB
write_relabel_configs:
# 1. 節點資源指標
- source_labels: [__name__]
regex: "node_cpu_seconds_total|node_memory_MemAvailable_bytes|node_memory_MemTotal_bytes|node_filesystem_avail_bytes|node_filesystem_size_bytes|node_load1|node_load5|node_load15|node_network_receive_bytes_total|node_network_transmit_bytes_total"
action: keep
# 2. 容器資源指標
- source_labels: [__name__]
regex: "container_cpu_usage_seconds_total|container_memory_working_set_bytes|container_memory_rss|container_network_receive_bytes_total|container_network_transmit_bytes_total"
action: keep
# 3. K8s 狀態指標 (Pod 重啟是 RCA 關鍵!)
- source_labels: [__name__]
regex: "kube_pod_container_status_restarts_total|kube_pod_status_phase|kube_deployment_status_replicas_available|kube_deployment_status_replicas_unavailable|kube_node_status_condition"
action: keep
# 4. API 效能指標
- source_labels: [__name__]
regex: "http_request_duration_seconds.*|http_requests_total"
action: keep
# 5. AWOOOI 自訂業務指標
- source_labels: [__name__]
regex: "awoooi_.*"
action: keep
# 6. 資料庫指標
- source_labels: [__name__]
regex: "pg_stat_activity_count|pg_stat_database_conflicts|pg_stat_database_deadlocks|pg_up|redis_connected_clients|redis_memory_used_bytes|redis_keyspace_hits_total|redis_keyspace_misses_total|redis_up"
action: keep
# 7. Probe 健康狀態
- source_labels: [__name__]
regex: "probe_success|probe_duration_seconds|probe_ssl_earliest_cert_expiry"
action: keep
# 8. UP 指標 (所有 target 存活狀態)
- source_labels: [__name__]
regex: "up"
action: keep
# =============================================================================
# ⚠️ 注意事項
# =============================================================================
#
# 1. write_relabel_configs 是 OR 邏輯:多個 keep 規則中,
# 任一匹配即保留。這是 Prometheus 內建行為。
#
# 2. 如果 SigNoz OTLP/HTTP 端點不支援 Prometheus remote_write 格式,
# 需要改用 victoria-metrics 或 cortex 作為中繼。
# 替代方案: 在 SigNoz 的 OTEL Collector 中啟用 prometheusremotewrite receiver。
#
# 3. 增加/移除指標後,用以下指令驗證:
# curl -s "http://192.168.0.188:9090/api/v1/query?query=up" | jq '.data.result | length'
#