Commit Graph

2 Commits

Author SHA1 Message Date
OG T
3f339110dd fix(observability): 同步 .188 實際部署調整至 repo
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
與原始計畫的差異:

1. MinIO Bearer Token 認證
   - 原計畫: MINIO_PROMETHEUS_AUTH_TYPE=public (此版本不支援)
   - 實際: mc admin prometheus generate 產生 Bearer Token
   - 更新: prometheus-config-phase-o.yaml 加入 bearer_token

2. remote_write 廢棄 → OTEL Collector Prometheus scrape
   - 原計畫: Prometheus remote_write → SigNoz OTEL /api/v1/write
   - 實際: SigNoz OTEL Collector 不支援 Prometheus remote_write 格式 (404)
   - 改用: OTEL Collector prometheus receiver 直接 scrape node-exporter + kube-state-metrics
   - 新增: ops/signoz/otel-collector-config-phase-o.yaml (版本控管副本)

3. ADR-053 驗收清單更新為實際結果

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-04-02 21:23:47 +08:00
OG T
41bf0681cf feat(observability): Phase O-2/O-3 OTEL Log管線 + Event Exporter + Remote Write
O-2.1: OTEL Collector DaemonSet (filelog receiver)
  - 收集所有 K3s 節點 Pod stdout/stderr → SigNoz ClickHouse
  - CRI log parser (Go time layout for +08:00 timezone)
  - filter processor 排除 kube-system debug noise
  - observability namespace PSA privileged (log 目錄需 root)
  - 資源限制: 50m-200m CPU / 64-128Mi Memory

O-2.2: kubernetes-event-exporter
  - K8s Event → 結構化 JSON Log → SigNoz
  - Warning/Error 全量保留, Normal 過濾高頻事件
  - 解決: Event 預設僅保留 ~1hr 的致命盲區

O-3: Prometheus remote_write 配置模板
  - 白名單: ~50 關鍵 metric series (node/container/kube/api/db)
  - 目標: 90 天長期儲存於 SigNoz ClickHouse

已部署驗證: 3 Pod Running, 0 error, filelog 正常監控所有 namespace

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 14:01:42 +08:00