docs(observability): ADR-053 SigNoz 統一架構 + Phase O 驗收
- 新增 ADR-053: 可觀測性統一架構決策記錄 - 更新 service-registry.yaml: 補齊 MinIO/Kali 監控入口 - 更新 LOGBOOK: Phase O 完成狀態 Phase O 驗收清單: ✅ kubectl Mac 本機免密碼 ✅ OTEL Collector 2 Pod Running ✅ Event Exporter 1 Pod Running ✅ Descheduler CronJob Completed ✅ MinIO + Kali 告警規則 ✅ Alert Chain Smoke Test ✅ CD Pipeline 整合 ⏳ ClickHouse TTL / remote_write / SigNoz rules (待 .188 手動) Co-Authored-By: Claude Code <noreply@anthropic.com>
This commit is contained in:
@@ -5,7 +5,30 @@
|
||||
|
||||
---
|
||||
|
||||
## 📍 當前狀態 (2026-04-01 17:30 台北)
|
||||
## 📍 當前狀態 (2026-04-02 01:30 台北)
|
||||
|
||||
| 項目 | 狀態 |
|
||||
|------|------|
|
||||
| **Phase O: 可觀測性終極補完** | 📋 設計批准,待實施 (O-1 → O-6, 12-15h) |
|
||||
| **K3s 健康盤點** | ✅ 全叢集 23 Pod / 0 重啟 / 0 壓力,但發現 6 大盲區 |
|
||||
| **架構決策: SigNoz 統一派** | ✅ 統帥批准 (不另裝 Loki) |
|
||||
| **設計規格** | ✅ `docs/superpowers/specs/2026-04-02-observability-ultimate-completion-design.md` |
|
||||
|
||||
---
|
||||
|
||||
## 📍 歷史狀態 (2026-04-02 09:30 台北)
|
||||
|
||||
| 項目 | 狀態 |
|
||||
|------|------|
|
||||
| **Langfuse LLMOps 全修復** | ✅ v2.60.10 部署 + trace 驗證 4 筆 + Gitea Secrets 注入 |
|
||||
| **首席架構師審查修復** | ✅ i18n 6檔 + CD self-hosted + 時區 + Secret exit 1 + op:add |
|
||||
| **技術債清理** | ✅ risklevel migration 自動化 + Whitelist 更新 + worktree 清理 |
|
||||
| **Knowledge Base Phase 1** | ✅ 後端四層架構 + 前端頁面 |
|
||||
| **dev 分支同步** | ✅ Fast-forward to main |
|
||||
|
||||
---
|
||||
|
||||
## 📍 歷史狀態 (2026-04-01 17:30 台北)
|
||||
|
||||
| 項目 | 狀態 |
|
||||
|------|------|
|
||||
|
||||
136
docs/adr/ADR-053-observability-signoz-unified-architecture.md
Normal file
136
docs/adr/ADR-053-observability-signoz-unified-architecture.md
Normal file
@@ -0,0 +1,136 @@
|
||||
# ADR-053: 可觀測性統一架構 — SigNoz 統一派
|
||||
|
||||
| 欄位 | 值 |
|
||||
|------|-----|
|
||||
| **狀態** | Accepted |
|
||||
| **日期** | 2026-04-02 (台北時間) |
|
||||
| **決策者** | 統帥 (ogt) |
|
||||
| **首席架構師** | Claude Code |
|
||||
| **相關 ADR** | ADR-035 (Telegram Alert), ADR-037 (監控增強), ADR-039 (Gitea CI/CD) |
|
||||
|
||||
---
|
||||
|
||||
## 背景
|
||||
|
||||
2026-04-02 K3s 健康盤點中發現「0 Pod 重啟、0 異常」的數據,經深入分析確認為**可觀測性盲區**所造成的假象,而非真實穩定:
|
||||
|
||||
1. K8s Event 預設僅保留 ~1 小時
|
||||
2. Pod Log 隨容器消亡而永久消失
|
||||
3. Prometheus 預設 ~15 天保留,無長期趨勢
|
||||
4. Descheduler PodSecurity 違規,CronJob 持續失敗
|
||||
5. kubectl 存取需 sudo 密碼
|
||||
6. MinIO/Kali 無監控覆蓋
|
||||
|
||||
這些盲區對 OpenClaw AI 的 RCA (Root Cause Analysis) 能力構成致命威脅:**「案發現場」已被清空,再聰明的 AI 也只能瞎猜。**
|
||||
|
||||
---
|
||||
|
||||
## 決策
|
||||
|
||||
### 路線選擇
|
||||
|
||||
| 選項 | 決策 | 理由 |
|
||||
|------|------|------|
|
||||
| **A. SigNoz 統一派** | **採用** | 複用 ClickHouse、零新增服務、單一查詢介面 |
|
||||
| B. Loki 獨立派 | 否決 | .188 過載、維運複雜度增加、查詢分散 |
|
||||
|
||||
### 核心架構
|
||||
|
||||
```
|
||||
所有 Log/Event 來源
|
||||
→ OTEL Collector (DaemonSet, K3s 節點)
|
||||
→ 192.168.0.188:24318 (SigNoz OTEL Collector HTTP)
|
||||
→ ClickHouse (統一儲存)
|
||||
→ SigNoz UI (統一查詢)
|
||||
|
||||
Prometheus (指標收集)
|
||||
→ remote_write (白名單過濾 ~50 series)
|
||||
→ SigNoz ClickHouse (長期 90 天)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 實施內容 (Phase O)
|
||||
|
||||
### O-1: 基礎設施修復
|
||||
|
||||
| 項目 | 修復內容 |
|
||||
|------|---------|
|
||||
| Descheduler | securityContext PSA restricted 合規 + RBAC 補齊 + namespace PSA label |
|
||||
| kubectl | Mac 本機 kubeconfig 指向 VIP,免 SSH |
|
||||
| MinIO 監控 | Prometheus scrape config + 4 條告警規則 |
|
||||
| Kali 監控 | Blackbox TCP probe + 1 條告警規則 |
|
||||
|
||||
### O-2: OTEL Log 管線
|
||||
|
||||
| 組件 | 功能 |
|
||||
|------|------|
|
||||
| OTEL Collector DaemonSet | filelog receiver → 所有 Pod stdout/stderr → SigNoz |
|
||||
| Event Exporter | K8s Warning/Error Events → 結構化 JSON → SigNoz |
|
||||
| 保留期 | Log/Event 30 天,Metrics 90 天 (ClickHouse TTL) |
|
||||
|
||||
### O-3: 長期指標
|
||||
|
||||
Prometheus `remote_write` → SigNoz,僅轉 8 類關鍵 metric series:
|
||||
node / container / kube-state / API / awoooi 自訂 / DB / probe / up
|
||||
|
||||
### O-4: Wave A 告警鏈路收尾
|
||||
|
||||
| Wave 項目 | 狀態 |
|
||||
|-----------|------|
|
||||
| A.1 Sentry Token CD 注入 | ✅ |
|
||||
| A.2 SignOz 告警規則 | 待 .188 手動部署 |
|
||||
| A.3 SignOz Webhook Handler | ✅ (2026-03-29 已實作) |
|
||||
| A.4 Sentry Comment API | ✅ (2026-03-29 已實作) |
|
||||
| A.5 Alert Chain Metrics | ✅ (2026-03-29 已實作) |
|
||||
| A.6 Smoke Test Script | ✅ scripts/alert_chain_smoke_test.py |
|
||||
| B.2 CD Smoke Test 整合 | ✅ cd.yaml |
|
||||
|
||||
---
|
||||
|
||||
## 關鍵設計決策
|
||||
|
||||
### 1. observability namespace PSA = privileged
|
||||
|
||||
OTEL Collector 必須以 root 讀取 `/var/log/pods`(該目錄 root:root 0750)。這是業界標準做法(Promtail、Fluentbit 亦同),不可避免。**僅限 observability namespace**,其他 namespace 維持 restricted。
|
||||
|
||||
### 2. Prometheus white-list remote_write
|
||||
|
||||
不全量 remote_write,只轉 ~50 個關鍵指標,防止 ClickHouse 高基數爆炸。完整指標保留在 Prometheus 本地 15 天。
|
||||
|
||||
### 3. Event 過濾策略
|
||||
|
||||
Warning/Error Event 全量保留。Normal/Scheduled/Pulling/Pulled/Created/Started 等高頻無意義 Event 丟棄,降低 ClickHouse 儲存壓力。
|
||||
|
||||
---
|
||||
|
||||
## 驗收標準
|
||||
|
||||
- [x] kubectl 可從 Mac 本機免密碼查詢所有 namespace
|
||||
- [x] OTEL Collector 2 Pod Running (mon + mon1)
|
||||
- [x] Event Exporter 1 Pod Running
|
||||
- [x] Descheduler CronJob 正常執行 (Completed)
|
||||
- [x] MinIO + Kali 告警規則已加入 Prometheus
|
||||
- [x] Alert Chain Smoke Test Script 完成
|
||||
- [x] CD Pipeline 整合 Alert Chain Smoke Test + Sentry Token 注入
|
||||
- [ ] ClickHouse TTL 設定 (待 .188 操作)
|
||||
- [ ] Prometheus remote_write 部署 (待 .188 操作)
|
||||
- [ ] SignOz 告警規則部署 (待 .188 操作)
|
||||
|
||||
---
|
||||
|
||||
## 後續行動
|
||||
|
||||
1. SSH 到 .188 部署剩餘配置 (prometheus-config-phase-o.yaml / prometheus-remote-write-signoz.yaml / minio-kali-alerts.yaml / SigNoz rules)
|
||||
2. Wave D: NVIDIA Grafana Dashboard + 監控覆蓋率報告
|
||||
|
||||
---
|
||||
|
||||
## 影響評估
|
||||
|
||||
| 項目 | 影響 |
|
||||
|------|------|
|
||||
| .188 磁碟 | +~10-15 GB (30-90 天保留,ClickHouse 自動 TTL) |
|
||||
| .188 CPU | +~50-200m (OTEL Collector 轉發) |
|
||||
| K3s 節點 | +50m/64Mi per node (OTEL Collector DaemonSet) |
|
||||
| OpenClaw RCA 能力 | **大幅提升** — 30 天 Log/Event 可查,90 天指標趨勢 |
|
||||
@@ -374,11 +374,30 @@ nodes:
|
||||
|
||||
- name: kali
|
||||
ip: 192.168.0.112
|
||||
port: 8080
|
||||
role: security
|
||||
monitoring:
|
||||
blackbox_tcp: true
|
||||
prometheus_scrape: false # 隔離環境,只做 TCP probe
|
||||
alerts:
|
||||
- node_down
|
||||
- service_down
|
||||
owner: security-team
|
||||
|
||||
# Phase O-1.3 2026-04-02: MinIO 備份儲存 (Phase O 補完)
|
||||
- name: minio
|
||||
ip: 192.168.0.188
|
||||
port: 9000
|
||||
role: storage
|
||||
monitoring:
|
||||
prometheus_scrape: true
|
||||
metrics_path: /minio/v2/metrics/cluster
|
||||
alerts:
|
||||
- service_down
|
||||
- disk_space_low
|
||||
criticality: P1
|
||||
owner: devops-team
|
||||
|
||||
# =============================================================================
|
||||
# 前端頁面
|
||||
# =============================================================================
|
||||
|
||||
Reference in New Issue
Block a user