OG T
|
dc27f8f811
|
ops(monitoring): 統一 Prometheus 告警規則 — 40+條含統一 layer 標籤
修正:
- ClawBotDown → OpenClawDown (舊命名廢棄)
- 加入 SentryDown/HarborDown/GiteaDown/AlertmanagerDown
- 所有規則補齊 layer/component/host/auto_repair 統一標籤
- 整合 k8s/monitoring/*.yaml → ops/monitoring/alerts-unified.yml
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-05 02:26:18 +08:00 |
|
OG T
|
3e4612f259
|
docs(observability): ADR-053 SigNoz 統一架構 + Phase O 驗收
CD Pipeline / build-and-deploy (push) Failing after 36s
E2E Health Check / e2e-health (push) Successful in 16s
- 新增 ADR-053: 可觀測性統一架構決策記錄
- 更新 service-registry.yaml: 補齊 MinIO/Kali 監控入口
- 更新 LOGBOOK: Phase O 完成狀態
Phase O 驗收清單:
✅ kubectl Mac 本機免密碼
✅ OTEL Collector 2 Pod Running
✅ Event Exporter 1 Pod Running
✅ Descheduler CronJob Completed
✅ MinIO + Kali 告警規則
✅ Alert Chain Smoke Test
✅ CD Pipeline 整合
⏳ ClickHouse TTL / remote_write / SigNoz rules (待 .188 手動)
Co-Authored-By: Claude Code <noreply@anthropic.com>
|
2026-04-02 18:26:57 +08:00 |
|
OG T
|
c7f9c119e7
|
fix(cd): 補提交 ops/monitoring 腳本
遺漏文件導致 CD Monitoring Coverage 步驟失敗
新增:
- generate_monitoring.py - 監控覆蓋率檢查
- coverage_report.py - 覆蓋率報告
- discover_docker.py - Docker 服務發現
- deploy-exporters.sh - Exporter 部署腳本
- postgres-exporter-queries.yaml - PostgreSQL 查詢配置
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 15:45:42 +08:00 |
|
OG T
|
12e49d844a
|
feat(monitoring): ADR-037 Wave B - Database Exporters + Prometheus 整合
- 部署 PostgreSQL Exporter (192.168.0.188:9187)
- 部署 Redis Exporter (192.168.0.188:9121)
- 更新 Prometheus scrape config
- 首席架構師審查: 97% OUTSTANDING
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 15:18:54 +08:00 |
|
OG T
|
93c3280481
|
feat(monitoring): Phase 20 Nemotron 完整監控整合
服務註冊表:
- 新增 nvidia-nemotron AI 服務
- 3 個 Prometheus metrics 定義
- 4 個告警規則 (circuit_breaker, timeout, error_rate, rate_limit)
- fallback 策略 (nvidia → gemini → ollama)
Alertmanager 規則:
- NvidiaCircuitBreakerOpen (P1)
- NvidiaToolCallingHighLatency (P2)
- NvidiaToolCallingHighErrorRate (P0)
- NvidiaCircuitBreakerHalfOpen (Info)
- NvidiaCircuitBreakerClosed (Info)
- NvidiaNoRequests (P3)
自動修復:
- fallback_to_gemini
- fallback_to_ollama
- switch_model
ADR: ADR-036
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 02:05:59 +08:00 |
|
OG T
|
40163a51b5
|
feat(monitoring): 完整監控策略與自動整合架構
新增:
1. MONITORING_COMPLETE_STRATEGY.md - 完整監控策略
- 5 主機 × 60+ 服務監控矩陣
- P0/P1/P2 告警規則清單
- AI 自動修復閉環流程
- 安全護欄配置
2. MONITORING_INTEGRATION_ARCHITECTURE.md - 自動整合架構
- 服務註冊表 (Single Source of Truth)
- CI/CD 自動驗證監控覆蓋率
- 新服務自動獲得監控
3. ops/monitoring/service-registry.yaml - 服務清單
- K8s 工作負載 (API/Web/Worker/ArgoCD)
- Docker 容器 (Ollama/OpenClaw/Redis/Postgres)
- 前端頁面 SLO
- API 端點 SLO
- 告警模板與自動修復動作
4. ops/monitoring/validate_coverage.py - 覆蓋率驗證
- CI 階段執行
- 檢測未監控服務
- 生成覆蓋率報告
設計原則:
- 監控即代碼 (Monitoring as Code)
- 新服務必須在 registry 註冊才能部署
- 自動發現機制防止遺漏
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 01:52:08 +08:00 |
|