Commit Graph

3 Commits

Author SHA1 Message Date
OG T
e70ceaba61 ops(signoz): 建立 log-based alert rules 文檔 (Sprint 2)
5 條規則: APIHighErrorLogRate/WorkerTaskFailed/PodOOMKilled/
         TelegramPollingFailed/NemotronAllTimeout
含 SigNoz UI 設定步驟 + webhook 驗證指令
標籤與 Prometheus 統一規範對齊 (layer/component/team)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 11:10:02 +08:00
OG T
3f339110dd fix(observability): 同步 .188 實際部署調整至 repo
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
與原始計畫的差異:

1. MinIO Bearer Token 認證
   - 原計畫: MINIO_PROMETHEUS_AUTH_TYPE=public (此版本不支援)
   - 實際: mc admin prometheus generate 產生 Bearer Token
   - 更新: prometheus-config-phase-o.yaml 加入 bearer_token

2. remote_write 廢棄 → OTEL Collector Prometheus scrape
   - 原計畫: Prometheus remote_write → SigNoz OTEL /api/v1/write
   - 實際: SigNoz OTEL Collector 不支援 Prometheus remote_write 格式 (404)
   - 改用: OTEL Collector prometheus receiver 直接 scrape node-exporter + kube-state-metrics
   - 新增: ops/signoz/otel-collector-config-phase-o.yaml (版本控管副本)

3. ADR-053 驗收清單更新為實際結果

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-04-02 21:23:47 +08:00
OG T
a5a6bd3408 feat(monitoring): K8s alert rules + Grafana dashboards + ops 腳本
- k8s/monitoring/alert-chain-monitor.yaml
- k8s/monitoring/database-alerts.yaml
- ops/grafana/ Grafana dashboards
- ops/signoz/ SignOz 配置
- ops/scripts/ 維運腳本

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 16:04:14 +08:00