fix(observability): 同步 .188 實際部署調整至 repo
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled

與原始計畫的差異:

1. MinIO Bearer Token 認證
   - 原計畫: MINIO_PROMETHEUS_AUTH_TYPE=public (此版本不支援)
   - 實際: mc admin prometheus generate 產生 Bearer Token
   - 更新: prometheus-config-phase-o.yaml 加入 bearer_token

2. remote_write 廢棄 → OTEL Collector Prometheus scrape
   - 原計畫: Prometheus remote_write → SigNoz OTEL /api/v1/write
   - 實際: SigNoz OTEL Collector 不支援 Prometheus remote_write 格式 (404)
   - 改用: OTEL Collector prometheus receiver 直接 scrape node-exporter + kube-state-metrics
   - 新增: ops/signoz/otel-collector-config-phase-o.yaml (版本控管副本)

3. ADR-053 驗收清單更新為實際結果

Co-Authored-By: Claude Code <noreply@anthropic.com>
This commit is contained in:
OG T
2026-04-02 21:23:47 +08:00
parent 93e3aa6811
commit 3f339110dd
4 changed files with 227 additions and 26 deletions

View File

@@ -43,9 +43,13 @@
→ ClickHouse (統一儲存)
→ SigNoz UI (統一查詢)
Prometheus (指標收集)
remote_write (白名單過濾 ~50 series)
SigNoz ClickHouse (長期 90 天)
SigNoz OTEL Collector (Prometheus Receiver)
直接 scrape node-exporter (172.28.0.1:9100)
直接 scrape kube-state-metrics (192.168.0.121:30888)
→ SigNoz ClickHouse (長期儲存)
注意: 原計畫 Prometheus remote_write 因 SigNoz 不支援 Protobuf 格式而廢棄
改用 OTEL Collector 內建 prometheus receiver 直接 scrape 關鍵指標
```
---
@@ -110,12 +114,14 @@ Warning/Error Event 全量保留。Normal/Scheduled/Pulling/Pulled/Created/Start
- [x] OTEL Collector 2 Pod Running (mon + mon1)
- [x] Event Exporter 1 Pod Running
- [x] Descheduler CronJob 正常執行 (Completed)
- [x] MinIO + Kali 告警規則已加入 Prometheus
- [x] Alert Chain Smoke Test Script 完成
- [x] MinIO 監控 up (Bearer Token 認證mc admin prometheus generate)
- [x] Kali Scanner TCP probe up
- [x] MinIO/Kali 告警規則已加入 Prometheus (追加至 alerts.yml7 groups)
- [x] SigNoz 指標流入 (OTEL Collector prometheus receiver: node + kube-state)
- [x] Alert Chain Smoke Test 7/8 PASSED (1 non-critical: 指標剛啟動)
- [x] CD Pipeline 整合 Alert Chain Smoke Test + Sentry Token 注入
- [ ] ClickHouse TTL 設定 (待 .188 操作)
- [ ] Prometheus remote_write 部署 (待 .188 操作)
- [ ] SignOz 告警規則部署 (待 .188 操作)
- [ ] ClickHouse TTL 設定 (待 .188 操作: signoz_logs 30天 / signoz_metrics 90天)
- [x] ~~Prometheus remote_write~~ → 改用 OTEL Collector federation scrape (SigNoz 不支援 remote_write 格式)
---

View File

@@ -4,24 +4,30 @@
# 建立者: Claude Code (首席架構師)
# 日期: 2026-04-02 (台北時間)
# 用途: MinIO 監控 + Kali 健康探測
# 部署位置: 192.168.0.188 /etc/prometheus/prometheus.yml
# 部署位置: 192.168.0.188 /home/ollama/momo-pro/monitoring/prometheus.yml
# 實際部署: 2026-04-02 已手動追加至 .188
# =============================================================================
#
# 部署方式:
# 1. SSH 到 192.168.0.188 (ollama 使用者)
# 2. 編輯 /etc/prometheus/prometheus.yml
# 3. 在 scrape_configs 區塊新增以下內容
# 4. 執行: sudo systemctl reload prometheus
# 2. 追加至 /home/ollama/momo-pro/monitoring/prometheus.yml scrape_configs 末端
# 3. docker kill -s SIGHUP prometheus
#
# ⚠️ MinIO 認證說明:
# MinIO 此版本 (RELEASE.2024-03-26) 不支援 MINIO_PROMETHEUS_AUTH_TYPE=public
# 必須使用 Bearer Token 認證
# Token 產生: docker exec minio mc admin prometheus generate local/
# Token 有效期: ~2031 (exp: 4928730704)
# =============================================================================
# ===== MinIO 監控 (O-1.3) =====
# 前置條件: MinIO 需啟用 Prometheus 端點
# mc admin prometheus generate myminio
# 或設定環境變數: MINIO_PROMETHEUS_AUTH_TYPE=public
# 前置條件: Bearer Token 由 mc admin prometheus generate 產生
#
# 重新產生 Token:
# docker exec minio mc alias set local http://localhost:9000 minio_admin 'Minio_Velero_2026!'
# docker exec minio mc admin prometheus generate local/
# 驗證:
# curl -s http://192.168.0.188:9000/minio/v2/metrics/cluster | head -5
# curl -H "Authorization: Bearer <token>" http://192.168.0.188:9000/minio/v2/metrics/cluster | head -5
- job_name: minio
honor_timestamps: true
@@ -29,6 +35,10 @@
scrape_timeout: 10s
metrics_path: /minio/v2/metrics/cluster
scheme: http
# ⚠️ Bearer Token 認證 (2026-04-02 部署時由 mc admin prometheus generate 產生)
# Token 已直接寫入 .188:/home/ollama/momo-pro/monitoring/prometheus.yml
# 如需輪換: docker exec minio mc admin prometheus generate local/
bearer_token: eyJhbGciOiJIUzUxMiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJwcm9tZXRoZXVzIiwic3ViIjoibWluaW9fYWRtaW4iLCJleHAiOjQ5Mjg3MzA3MDR9.s5WpFkluoicR_JXi_1l6dYVygkNV9G42s6c3NkSrenALWKZM78h-grj8vcqDeJaGR2eX4Ib4hPlcMqpM2yXjoQ
static_configs:
- targets:
- 192.168.0.188:9000

View File

@@ -3,18 +3,15 @@
# =============================================================================
# 建立者: Claude Code (首席架構師)
# 日期: 2026-04-02 (台北時間)
# 用途: 將關鍵指標長期儲存到 SigNoz ClickHouse (90 天)
# 部署位置: 192.168.0.188 /etc/prometheus/prometheus.yml
# =============================================================================
#
# 部署方式:
# 1. SSH 到 192.168.0.188 (ollama 使用者)
# 2. 編輯 /etc/prometheus/prometheus.yml
# 3. 在最外層新增以下 remote_write 區塊
# 4. 執行: sudo systemctl reload prometheus
# ❌ 此方案已廢棄 (2026-04-02 實際部署時發現)
# 原因: SigNoz OTEL Collector 不支援 Prometheus remote_write 格式 (Protobuf)
# 端點 /api/v1/write 回傳 404 Not Found
#
# 驗證:
# curl -s "http://192.168.0.188:9090/api/v1/status/config" | jq '.data.yaml' | grep remote_write
# ✅ 改用方案: SigNoz OTEL Collector Prometheus Receiver 直接 scrape
# 設定檔: ops/signoz/otel-collector-config-phase-o.yaml
# 實際部署: .188:/home/ollama/signoz/deploy/docker/otel-collector-config.yaml
# 新增 jobs: node-from-signoz (node-exporter) + kube-state-from-signoz
#
# =============================================================================

View File

@@ -0,0 +1,188 @@
# =============================================================================
# SigNoz OTEL Collector Config - Phase O-3 實際部署版本
# =============================================================================
# 建立者: Claude Code (首席架構師)
# 日期: 2026-04-02 (台北時間)
# 部署位置: 192.168.0.188:/home/ollama/signoz/deploy/docker/otel-collector-config.yaml
#
# Phase O-3 新增內容 (與原版差異):
# prometheus receiver 新增 scrape jobs:
# - node-from-signoz: node-exporter (172.28.0.1:9100, monitoring_monitoring bridge gateway)
# - kube-state-from-signoz: kube-state-metrics (192.168.0.121:30888)
#
# 注意: signoz-otel-collector 需加入 monitoring_monitoring Docker network:
# docker network connect monitoring_monitoring signoz-otel-collector
#
# 原方案 remote_write 已廢棄: SigNoz OTEL Collector 不支援 Prometheus remote_write 格式
# 原始備份: /home/ollama/signoz/deploy/docker/otel-collector-config.yaml.bak.phase-o
# =============================================================================
connectors:
signozmeter:
dimensions:
- name: service.name
- name: deployment.environment
- name: host.name
metrics_flush_interval: 1h
exporters:
clickhouselogsexporter:
dsn: tcp://clickhouse:9000/signoz_logs
timeout: 10s
use_new_schema: true
clickhousetraces:
datasource: tcp://clickhouse:9000/signoz_traces
low_cardinal_exception_grouping: ${env:LOW_CARDINAL_EXCEPTION_GROUPING}
use_new_schema: true
metadataexporter:
cache:
provider: in_memory
dsn: tcp://clickhouse:9000/signoz_metadata
enabled: true
timeout: 45s
signozclickhousemeter:
dsn: tcp://clickhouse:9000/signoz_meter
sending_queue:
enabled: false
timeout: 45s
signozclickhousemetrics:
dsn: tcp://clickhouse:9000/signoz_metrics
extensions:
health_check:
endpoint: 0.0.0.0:13133
pprof:
endpoint: 0.0.0.0:1777
processors:
batch:
send_batch_max_size: 11000
send_batch_size: 10000
timeout: 10s
batch/meter:
send_batch_max_size: 25000
send_batch_size: 20000
timeout: 1s
resourcedetection:
detectors:
- env
- system
timeout: 2s
signozspanmetrics/delta:
aggregation_temporality: AGGREGATION_TEMPORALITY_DELTA
dimensions:
- default: default
name: service.namespace
- default: default
name: deployment.environment
- name: signoz.collector.id
- name: service.version
- name: browser.platform
- name: browser.mobile
- name: k8s.cluster.name
- name: k8s.node.name
- name: k8s.namespace.name
- name: host.name
- name: host.type
- name: container.name
dimensions_cache_size: 100000
enable_exp_histogram: true
latency_histogram_buckets:
- 100us
- 1ms
- 2ms
- 6ms
- 10ms
- 50ms
- 100ms
- 250ms
- 500ms
- 1000ms
- 1400ms
- 2000ms
- 5s
- 10s
- 20s
- 40s
- 60s
metrics_exporter: signozclickhousemetrics
metrics_flush_interval: 60s
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
global:
scrape_interval: 60s
scrape_configs:
- job_name: otel-collector
static_configs:
- labels:
job_name: otel-collector
targets:
- localhost:8888
- job_name: node-from-signoz
metrics_path: /metrics
scrape_interval: 60s
static_configs:
- targets:
- 172.28.0.1:9100
- job_name: kube-state-from-signoz
metrics_path: /metrics
scrape_interval: 60s
static_configs:
- targets:
- 192.168.0.121:30888
service:
extensions:
- health_check
- pprof
pipelines:
logs:
exporters:
- clickhouselogsexporter
- metadataexporter
- signozmeter
processors:
- batch
receivers:
- otlp
metrics:
exporters:
- signozclickhousemetrics
- metadataexporter
- signozmeter
processors:
- batch
receivers:
- otlp
metrics/meter:
exporters:
- signozclickhousemeter
processors:
- batch/meter
receivers:
- signozmeter
metrics/prometheus:
exporters:
- signozclickhousemetrics
- metadataexporter
- signozmeter
processors:
- batch
receivers:
- prometheus
traces:
exporters:
- clickhousetraces
- metadataexporter
- signozmeter
processors:
- signozspanmetrics/delta
- batch
receivers:
- otlp
telemetry:
logs:
encoding: json