OG T
08db3580a7
fix(monitoring): 修復 110 主機 CPU 高負載
...
CD Pipeline / build-and-deploy (push) Has started running
根因 1: cadvisor 持續掃描 overlay2 磁碟用量 (每次 1-4s × N 容器)
→ 加 --disable_metrics=disk,diskIO,tcp,udp,percpu,sched,process
→ --housekeeping_interval=30s --docker_only=true
→ CPU 從 239% 降到 <1%
根因 2: node_exporter scrape_timeout 預設 10s,高 load 下超時→broken pipe→瘋狂重試
→ 加 scrape_interval: 30s / scrape_timeout: 25s
→ CPU 從 48% 降到 0%
整體 load average: 20 → 9
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 13:53:13 +08:00
OG T
5bd8a8a719
fix(monitoring): 補齊 blackbox-tcp scrape targets (11→15)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
SentryDown/HarborDown/SignOzDown 等告警引用的 instance 不在 scrape list
導致 absent metric = 0,告警持續 firing
新增缺少的 targets:
192.168.0.125:6443/32334/32335 (K3s)
192.168.0.110:9000/5000/3100 (Sentry/Harbor/Langfuse)
192.168.0.188:3301/5432/6380/11434/8089 (SignOz/PG/Redis/Ollama/OpenClaw)
已在 110 主機 reload Prometheus,全部 15 targets UP
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 12:20:19 +08:00
OG T
85d4857d1b
fix(monitoring): RedisMemoryHigh 誤報 — max_bytes=0 除以零修正
...
CD Pipeline / build-and-deploy (push) Has started running
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 37s
- 加入 redis_memory_max_bytes > 0 前置條件
- 防止 Redis 未設 maxmemory 時除以零產生 +Inf 永遠觸發
- 影響: alerts-unified.yml + database-alerts.yaml
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 11:41:10 +08:00
OG T
3f339110dd
fix(observability): 同步 .188 實際部署調整至 repo
...
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
與原始計畫的差異:
1. MinIO Bearer Token 認證
- 原計畫: MINIO_PROMETHEUS_AUTH_TYPE=public (此版本不支援)
- 實際: mc admin prometheus generate 產生 Bearer Token
- 更新: prometheus-config-phase-o.yaml 加入 bearer_token
2. remote_write 廢棄 → OTEL Collector Prometheus scrape
- 原計畫: Prometheus remote_write → SigNoz OTEL /api/v1/write
- 實際: SigNoz OTEL Collector 不支援 Prometheus remote_write 格式 (404)
- 改用: OTEL Collector prometheus receiver 直接 scrape node-exporter + kube-state-metrics
- 新增: ops/signoz/otel-collector-config-phase-o.yaml (版本控管副本)
3. ADR-053 驗收清單更新為實際結果
Co-Authored-By: Claude Code <noreply@anthropic.com >
2026-04-02 21:23:47 +08:00
OG T
99be215e83
fix(monitoring): R1 Review 修正 — Blackbox DNS/PSA label/告警閾值
...
Critical: Blackbox Exporter replacement 從 K8s DNS 改為主機 IP (192.168.0.188:9115)
Important: Descheduler namespace 顯式宣告 PSA restricted labels
Suggestion: failedJobsHistoryLimit 3→1, 新增 MinioDiskUsageCritical 5% 告警
R1 Review by: 首席架構師 (Phase O-1)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-02 14:02:50 +08:00
OG T
41bf0681cf
feat(observability): Phase O-2/O-3 OTEL Log管線 + Event Exporter + Remote Write
...
O-2.1: OTEL Collector DaemonSet (filelog receiver)
- 收集所有 K3s 節點 Pod stdout/stderr → SigNoz ClickHouse
- CRI log parser (Go time layout for +08:00 timezone)
- filter processor 排除 kube-system debug noise
- observability namespace PSA privileged (log 目錄需 root)
- 資源限制: 50m-200m CPU / 64-128Mi Memory
O-2.2: kubernetes-event-exporter
- K8s Event → 結構化 JSON Log → SigNoz
- Warning/Error 全量保留, Normal 過濾高頻事件
- 解決: Event 預設僅保留 ~1hr 的致命盲區
O-3: Prometheus remote_write 配置模板
- 白名單: ~50 關鍵 metric series (node/container/kube/api/db)
- 目標: 90 天長期儲存於 SigNoz ClickHouse
已部署驗證: 3 Pod Running, 0 error, filelog 正常監控所有 namespace
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-02 14:01:42 +08:00
OG T
6ce82ff883
fix(k3s): Phase O-1 基礎設施修復 — Descheduler + MinIO/Kali 監控
...
O-1.1: Descheduler securityContext 修復 (PodSecurity restricted 合規)
- 新增 pod securityContext (runAsNonRoot, runAsUser:65534, seccompProfile)
- 新增 container securityContext (allowPrivilegeEscalation:false, drop ALL)
- 補齊 RBAC: namespaces + replicasets list 權限
- 已部署驗證: CronJob 成功執行 (Status: Completed)
O-1.3: MinIO Prometheus scrape 配置 + 告警規則
O-1.4: Kali Blackbox TCP probe + 告警規則
- MinioDown, MinioDiskUsageHigh, MinioOfflineDisk
- KaliScannerDown
待手動部署: Prometheus config → .188, kubectl kubeconfig → 120/121
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-02 13:55:26 +08:00
OG T
a5a6bd3408
feat(monitoring): K8s alert rules + Grafana dashboards + ops 腳本
...
- k8s/monitoring/alert-chain-monitor.yaml
- k8s/monitoring/database-alerts.yaml
- ops/grafana/ Grafana dashboards
- ops/signoz/ SignOz 配置
- ops/scripts/ 維運腳本
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-29 16:04:14 +08:00
OG T
12e49d844a
feat(monitoring): ADR-037 Wave B - Database Exporters + Prometheus 整合
...
- 部署 PostgreSQL Exporter (192.168.0.188:9187)
- 部署 Redis Exporter (192.168.0.188:9121)
- 更新 Prometheus scrape config
- 首席架構師審查: 97% OUTSTANDING
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-29 15:18:54 +08:00
OG T
93c3280481
feat(monitoring): Phase 20 Nemotron 完整監控整合
...
服務註冊表:
- 新增 nvidia-nemotron AI 服務
- 3 個 Prometheus metrics 定義
- 4 個告警規則 (circuit_breaker, timeout, error_rate, rate_limit)
- fallback 策略 (nvidia → gemini → ollama)
Alertmanager 規則:
- NvidiaCircuitBreakerOpen (P1)
- NvidiaToolCallingHighLatency (P2)
- NvidiaToolCallingHighErrorRate (P0)
- NvidiaCircuitBreakerHalfOpen (Info)
- NvidiaCircuitBreakerClosed (Info)
- NvidiaNoRequests (P3)
自動修復:
- fallback_to_gemini
- fallback_to_ollama
- switch_model
ADR: ADR-036
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-29 02:05:59 +08:00
OG T
179e659f14
chore: 清理 Playwright 產物 + kube-state-metrics 告警擴充
...
清理工作:
- .gitignore 新增 playwright-report/ 和 test-results/ 排除
- 保留 phase19/ 參考截圖目錄
kube-state-metrics 告警擴充 (P3):
- CronJobLastRunFailed: Job 執行失敗
- DaemonSetMissingPods: DaemonSet 缺少 Pod
- StatefulSetReplicasMismatch: StatefulSet 副本不足
- ContainerWaiting: ImagePullBackOff/CrashLoopBackOff 偵測
- PDBViolation: PDB 健康 Pod 數不足
- NodeUnschedulable: 節點標記為不可排程
新增:
- apps/api/scripts/test_nemotron_tool_calling.py (E2E 比較測試)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-29 01:28:35 +08:00
OG T
ee2bceefff
feat(monitoring): Phase 19.6 測試文檔 + P1-P3 改進 + 首席架構師審查
...
Phase 19.6 測試文檔收尾:
- E2E 測試擴充至 18 項 (Terminal/GenUI 驗證)
- 新增 PHASE19-VERIFICATION-CHECKLIST.md (完整驗證清單)
P1 驗證:
- ArgoCD Metrics NodePort 監控 (30883/30884)
- TLS 證書監控 (Blackbox Exporter 9115)
P2 改進:
- waitForTimeout → waitForLoadState('networkidle')
- 跨平台快捷鍵 (Meta+J / Control+J)
- SKIP_MULTISIG_TESTS 環境變數控制
- Prometheus GitOps 部署腳本
P3 改進:
- HPA maxReplicas 4 → 6 (API/Web)
首席架構師審查: 47/50 OUTSTANDING (94%)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-29 01:19:26 +08:00
OG T
dc7daf5d81
docs(monitoring): 更新 ArgoCD Metrics 端點文檔
...
- ArgoCD Server Pod 運行在 mon1 (192.168.0.121)
- 更新 Prometheus target 為 192.168.0.121:30883
- 標記配置已部署並驗證
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-28 23:59:46 +08:00
OG T
e75e578547
feat(monitoring): P1/P2 改進 - ArgoCD Metrics + TLS 證書告警
...
## P1: ArgoCD Metrics
- 新增 ArgoCD Metrics NodePort (30882, 30883)
- 更新 NetworkPolicy 允許 Prometheus (188) 抓取
- 提供 Prometheus scrape config 範本
## P1: NetworkPolicy AI API
- 文檔標註 K8s NetworkPolicy 不支援 FQDN 限制
- 維持現有配置避免 AI 功能中斷
## P2: TLS 證書告警
- 新增 TLSCertExpiringIn30Days (30天預警)
- 新增 TLSCertExpiringIn7Days (7天緊急)
- 新增 TLSCertExpired (已過期)
- 新增 TLSProbeFailure (探測失敗)
## P2: Multi-Sig E2E 測試
- 標記為條件式執行 (API 不可用時自動跳過)
- 避免 CI/CD 因無法連接生產 API 而失敗
首席架構師審查: 2026-03-29 (台北時間)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-28 23:48:57 +08:00
OG T
a30f766eb1
feat(monitoring): 首席架構師完整審查 + 補充告警規則
...
## 首席架構師審查結果: 198/200 (99%) EXCEPTIONAL
### 審查範圍
- 架構設計: 50/50 ⭐
- 安全性: 49/50
- 模組化合規: 50/50 ⭐
- 監控告警: 49/50
- E2E 測試: 49/50
### 新增補充告警 (12 條)
- RedisDown, PostgreSQLDown, OllamaDown, OpenClawDown
- HarborDown, LangfuseDown
- HPAMaxedOut, HPAScalingDisabled
- WorkerUnavailable
- NodeHighCPU, NodeHighMemory, ContainerOOMKilled
### 檔案
- k8s/monitoring/k3s-alerts-supplemental.yaml
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-28 23:30:44 +08:00
OG T
0b68352fc2
feat(k3s): P2/P3 改進 - kube-state-metrics + Kured 時區修復 + Descheduler 調整
...
P2 改進:
- 新增 kube-state-metrics v2.10.1 (NodePort:30888)
- 新增 7 條 kube-state-metrics 告警規則 (NPD 整合)
P3 改進:
- 修復 Kured 維護窗口時區 (18:00→02:00 台北時間)
- Descheduler threshold 20%→30% (避免過度遷移)
首席架構師審查建議執行項目
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-28 22:23:42 +08:00
OG T
1a4be7b18a
feat(k-mon): K3s monitoring integration (Phase K-MON)
...
- Add Velero metrics NodePort service (30885)
- Add K3s infrastructure alert rules:
- VIP 6443 availability
- Node ICMP checks
- AWOOOI API/Web TCP checks
- SignOz/Sentry availability
- Add Velero backup alerts (failed/missing)
- Add ADR-034 for ArgoCD GitOps adoption
Deployed to:
- K3s: velero-metrics service
- 188: Prometheus + Alertmanager configs
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-28 21:57:57 +08:00