OG T
|
e70ceaba61
|
ops(signoz): 建立 log-based alert rules 文檔 (Sprint 2)
5 條規則: APIHighErrorLogRate/WorkerTaskFailed/PodOOMKilled/
TelegramPollingFailed/NemotronAllTimeout
含 SigNoz UI 設定步驟 + webhook 驗證指令
標籤與 Prometheus 統一規範對齊 (layer/component/team)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-05 11:10:02 +08:00 |
|
OG T
|
dc27f8f811
|
ops(monitoring): 統一 Prometheus 告警規則 — 40+條含統一 layer 標籤
修正:
- ClawBotDown → OpenClawDown (舊命名廢棄)
- 加入 SentryDown/HarborDown/GiteaDown/AlertmanagerDown
- 所有規則補齊 layer/component/host/auto_repair 統一標籤
- 整合 k8s/monitoring/*.yaml → ops/monitoring/alerts-unified.yml
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-05 02:26:18 +08:00 |
|
OG T
|
30b7b10f01
|
feat(grafana): Wave D — AI監控 + 基礎設施 Dashboard (Grafana 188:3002)
新增 2 個 Dashboard,匯入既有 Nemotron Dashboard:
1. ai-monitoring.json — LLM + NVIDIA AI 監控
- LLM 呼叫速率 (req/min)
- LLM P99/P50 延遲
- Nemotron Tool Calling P99/P50 延遲
- LLM Cache 命中率 %
- LLM Fallback 次數
- Alert Chain 健康/最後成功時間
2. infra-monitoring.json — Node + K3s 基礎設施
- CPU/Memory 使用率
- K3s Pod 數量 (by namespace)
- K3s Pod 重啟次數
- Prometheus Targets UP/DOWN
- API 請求速率
3. nvidia-nemotron.json — 既有 18-panel Nemotron Dashboard (版控)
部署: 192.168.0.188:3002 (Grafana 12.4.1)
Provisioning: monitoring/grafana/provisioning/dashboards/
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-03 00:18:00 +08:00 |
|
OG T
|
3f339110dd
|
fix(observability): 同步 .188 實際部署調整至 repo
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
與原始計畫的差異:
1. MinIO Bearer Token 認證
- 原計畫: MINIO_PROMETHEUS_AUTH_TYPE=public (此版本不支援)
- 實際: mc admin prometheus generate 產生 Bearer Token
- 更新: prometheus-config-phase-o.yaml 加入 bearer_token
2. remote_write 廢棄 → OTEL Collector Prometheus scrape
- 原計畫: Prometheus remote_write → SigNoz OTEL /api/v1/write
- 實際: SigNoz OTEL Collector 不支援 Prometheus remote_write 格式 (404)
- 改用: OTEL Collector prometheus receiver 直接 scrape node-exporter + kube-state-metrics
- 新增: ops/signoz/otel-collector-config-phase-o.yaml (版本控管副本)
3. ADR-053 驗收清單更新為實際結果
Co-Authored-By: Claude Code <noreply@anthropic.com>
|
2026-04-02 21:23:47 +08:00 |
|
OG T
|
3e4612f259
|
docs(observability): ADR-053 SigNoz 統一架構 + Phase O 驗收
CD Pipeline / build-and-deploy (push) Failing after 36s
E2E Health Check / e2e-health (push) Successful in 16s
- 新增 ADR-053: 可觀測性統一架構決策記錄
- 更新 service-registry.yaml: 補齊 MinIO/Kali 監控入口
- 更新 LOGBOOK: Phase O 完成狀態
Phase O 驗收清單:
✅ kubectl Mac 本機免密碼
✅ OTEL Collector 2 Pod Running
✅ Event Exporter 1 Pod Running
✅ Descheduler CronJob Completed
✅ MinIO + Kali 告警規則
✅ Alert Chain Smoke Test
✅ CD Pipeline 整合
⏳ ClickHouse TTL / remote_write / SigNoz rules (待 .188 手動)
Co-Authored-By: Claude Code <noreply@anthropic.com>
|
2026-04-02 18:26:57 +08:00 |
|
OG T
|
a5a6bd3408
|
feat(monitoring): K8s alert rules + Grafana dashboards + ops 腳本
- k8s/monitoring/alert-chain-monitor.yaml
- k8s/monitoring/database-alerts.yaml
- ops/grafana/ Grafana dashboards
- ops/signoz/ SignOz 配置
- ops/scripts/ 維運腳本
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:04:14 +08:00 |
|
OG T
|
c7f9c119e7
|
fix(cd): 補提交 ops/monitoring 腳本
遺漏文件導致 CD Monitoring Coverage 步驟失敗
新增:
- generate_monitoring.py - 監控覆蓋率檢查
- coverage_report.py - 覆蓋率報告
- discover_docker.py - Docker 服務發現
- deploy-exporters.sh - Exporter 部署腳本
- postgres-exporter-queries.yaml - PostgreSQL 查詢配置
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 15:45:42 +08:00 |
|
OG T
|
12e49d844a
|
feat(monitoring): ADR-037 Wave B - Database Exporters + Prometheus 整合
- 部署 PostgreSQL Exporter (192.168.0.188:9187)
- 部署 Redis Exporter (192.168.0.188:9121)
- 更新 Prometheus scrape config
- 首席架構師審查: 97% OUTSTANDING
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 15:18:54 +08:00 |
|
OG T
|
d15fb7d9f4
|
fix(cd): 序列建構修復 Runner _runner_file_commands 衝突
根因: 並行 Job 的 Set up job 階段會同時寫入 RUNNER_TEMP
解法: build-api needs build-web,確保序列執行
移除: Job-level concurrency groups (不再需要)
更新: ops/runner/README.md v1.0→v2.0
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 10:29:11 +08:00 |
|
OG T
|
07114f9181
|
fix(runner): v4 - 啟用 cancel-in-progress 防止並行衝突
根因確認:
- _diag/pages 衝突發生在 "Set up job" 階段
- 這是在任何自定義步驟執行之前
- Runner 內部 bug,workflow 層清理無法解決
永久解決方案:
- cancel-in-progress: true (確保同一時間只有一個 workflow)
- 不再嘗試清理 RUNNER_TEMP (會破壞其他 Job)
- 保留 _diag/pages 清理作為輔助措施
更新 ops/runner/README.md:
- 完整根因分析
- v3 最終解決方案說明
- 警告: 不要清理 RUNNER_TEMP
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 02:10:17 +08:00 |
|
OG T
|
93c3280481
|
feat(monitoring): Phase 20 Nemotron 完整監控整合
服務註冊表:
- 新增 nvidia-nemotron AI 服務
- 3 個 Prometheus metrics 定義
- 4 個告警規則 (circuit_breaker, timeout, error_rate, rate_limit)
- fallback 策略 (nvidia → gemini → ollama)
Alertmanager 規則:
- NvidiaCircuitBreakerOpen (P1)
- NvidiaToolCallingHighLatency (P2)
- NvidiaToolCallingHighErrorRate (P0)
- NvidiaCircuitBreakerHalfOpen (Info)
- NvidiaCircuitBreakerClosed (Info)
- NvidiaNoRequests (P3)
自動修復:
- fallback_to_gemini
- fallback_to_ollama
- switch_model
ADR: ADR-036
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 02:05:59 +08:00 |
|
OG T
|
183776a34f
|
fix(runner): 永久修復 _diag/pages 檔案衝突問題
問題: Runner 並行執行時 "file already exists" 導致 CD 失敗
解決方案:
1. CD Workflow: 刪除整個 _diag/pages 目錄再重建 (非 rm -rf /*)
2. Systemd Timer: 每 5 分鐘自動清理過期檔案
3. flock 鎖定: 防止清理程序競爭
新增檔案:
- ops/runner/cleanup-runner-diag.sh - 清理腳本
- ops/runner/runner-diag-cleanup.service - Systemd service
- ops/runner/runner-diag-cleanup.timer - 定時器
- ops/runner/deploy-runner-cleanup.sh - 部署腳本
- ops/runner/README.md - 文檔
部署指令:
ssh wooo@192.168.0.110
bash awoooi/ops/runner/deploy-runner-cleanup.sh
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 02:04:35 +08:00 |
|
OG T
|
40163a51b5
|
feat(monitoring): 完整監控策略與自動整合架構
新增:
1. MONITORING_COMPLETE_STRATEGY.md - 完整監控策略
- 5 主機 × 60+ 服務監控矩陣
- P0/P1/P2 告警規則清單
- AI 自動修復閉環流程
- 安全護欄配置
2. MONITORING_INTEGRATION_ARCHITECTURE.md - 自動整合架構
- 服務註冊表 (Single Source of Truth)
- CI/CD 自動驗證監控覆蓋率
- 新服務自動獲得監控
3. ops/monitoring/service-registry.yaml - 服務清單
- K8s 工作負載 (API/Web/Worker/ArgoCD)
- Docker 容器 (Ollama/OpenClaw/Redis/Postgres)
- 前端頁面 SLO
- API 端點 SLO
- 告警模板與自動修復動作
4. ops/monitoring/validate_coverage.py - 覆蓋率驗證
- CI 階段執行
- 檢測未監控服務
- 生成覆蓋率報告
設計原則:
- 監控即代碼 (Monitoring as Code)
- 新服務必須在 registry 註冊才能部署
- 自動發現機制防止遺漏
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 01:52:08 +08:00 |
|
OG T
|
9bff46a1b0
|
feat: integrate Sentry + fix CI/CD issues
Sentry Integration (補強 SignOz):
- Add @sentry/nextjs for frontend error tracking + session replay
- Add sentry-sdk[fastapi] for backend error tracking
- Create sentry.client/server/edge.config.ts
- Integrate with next.config.js + instrumentation.ts
- Add Sentry exception capture in FastAPI error handler
- Create deployment scripts for Self-Hosted @ 192.168.0.110
CI/CD Fixes:
- Fix F821 Undefined name 'Field' in incidents.py
- Add NEXT_PUBLIC_API_URL env var to CI build step
- Add build-arg to Docker build verification
E2E Test Improvements:
- Fix strict mode violations in dashboard-acceptance tests
- Add timeout increase for Phase 4 demo tests
- Make tests more resilient to UI variations
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-24 15:19:52 +08:00 |
|
OG T
|
7478dc0254
|
feat(phase6-9): Complete modular architecture and Agent Teams
Phase 6.4 - Modular Architecture:
- Add lewooogo-brain adapters for LLM providers
- Add lewooogo-data dual memory (Redis + PostgreSQL)
- Implement consensus engine for multi-agent decisions
- Add incident memory service for historical context
Phase 9 - Agent Teams (Claude Agent SDK):
- Add base agent class with Claude Sonnet 4 integration
- Implement action planner, blast radius, and security agents
- Add agent API endpoints and proposal workflow
- Integrate ADR-009 OpenClaw Agent Teams architecture
DevOps & CI/CD:
- Add GitHub Actions CI/CD workflows (ci.yaml, cd.yaml)
- Add pre-commit hooks and secrets baseline
- Add docker-compose for local development
- Update Kubernetes network policies
Frontend Improvements:
- Add auto-healing error boundary component
- Update i18n messages for agent features
- Enhance dual-state incident card with execution feedback
Documentation:
- Add 7 ADRs covering MCP, design system, architecture decisions
- Update ARCHITECTURE_MEMORY.md with modular design
- Add GLOBAL_RULES.md and SOUL.md for project identity
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-23 18:40:36 +08:00 |
|