- ADR-037 監控增強架構 - MONITORING_MASTER_PLAN 主計畫 - MASTER_EXECUTION_SCHEDULE 執行排程 - Phase D/E/Worker HPA Runbooks Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
220 lines
6.6 KiB
Markdown
220 lines
6.6 KiB
Markdown
# AWOOOI 監控整合主計畫
|
|
|
|
> **版本**: v1.0
|
|
> **建立日期**: 2026-03-29
|
|
> **狀態**: ✅ 統帥批准
|
|
> **總工時**: 10.75h
|
|
> **整合自**: `MONITORING_INTEGRATION_ARCHITECTURE.md` + `IMPLEMENTATION_STEPS_REMAINING_PHASES.md`
|
|
|
|
---
|
|
|
|
## 一、執行摘要
|
|
|
|
本計畫整合以下兩份文件:
|
|
1. **MONITORING_INTEGRATION_ARCHITECTURE.md** - 監控即代碼架構
|
|
2. **IMPLEMENTATION_STEPS_REMAINING_PHASES.md** - Phase D-G 實施步驟
|
|
|
|
### 核心發現
|
|
|
|
| 類別 | 已完成 | 待完成 |
|
|
|------|--------|--------|
|
|
| **Service Registry** | ✅ 含 NVIDIA | - |
|
|
| **覆蓋率驗證** | ✅ validate_coverage.py | generate_monitoring.py |
|
|
| **NVIDIA 告警** | ✅ 5 條規則 | Grafana Dashboard |
|
|
| **Sentry 整合** | ✅ Webhook Handler | Comment 回寫 (TODO) |
|
|
| **SignOz 整合** | ❌ 無 Webhook | Handler + Rules |
|
|
| **告警鏈路驗證** | ❌ 無 | Smoke Test + CD 整合 |
|
|
|
|
---
|
|
|
|
## 二、工作依賴關係
|
|
|
|
```
|
|
Layer 0: 基礎設施 (無依賴)
|
|
├── L0.1 Sentry API Token ─────────┐
|
|
└── L0.2 SignOz 告警規則 ──────────┼──▶ Layer 1
|
|
│
|
|
Layer 1: Webhook 鏈路 │
|
|
├── L1.1 SignOz Webhook (←L0.2) ───┤
|
|
├── L1.2 Sentry Comment (←L0.1) ───┼──▶ Layer 2
|
|
└── L1.3 Alert Chain Metrics ──────┘
|
|
│
|
|
Layer 2: 告警鏈路驗證 │
|
|
├── L2.1 Smoke Test (←L1.1,L1.2) ──┤
|
|
├── L2.2 Alert Chain Rules (←L1.3) ┼──▶ Layer 3
|
|
└── L2.3 CD Pipeline (←L2.1) ──────┘
|
|
│
|
|
Layer 3: 監控自動化 (獨立) │
|
|
├── L3.1 generate_monitoring.py ───┤
|
|
├── L3.2 CI 覆蓋率檢查 (←L3.1) ────┼──▶ Layer 4
|
|
└── L3.3 Docker 自動發現 ──────────┘
|
|
│
|
|
Layer 4: 可視化 │
|
|
├── L4.1 NVIDIA Grafana Dashboard ─┤
|
|
└── L4.2 監控覆蓋率報告 (←L3.1) ───┘
|
|
```
|
|
|
|
---
|
|
|
|
## 三、Wave 執行計畫
|
|
|
|
### Wave A: 告警鏈路完善 (P0 - 3.5h)
|
|
|
|
| # | 任務 | 工時 | 依賴 | 可並行 | 檔案 |
|
|
|---|------|------|------|--------|------|
|
|
| A.1 | Sentry API Token 設定 | 15min | - | ✅ | GitHub Secret + K8s |
|
|
| A.2 | SignOz 告警規則部署 | 30min | - | ✅ | `signoz/alerting/rules.yaml` |
|
|
| A.3 | SignOz Webhook Handler | 45min | A.2 | - | `signoz_webhook.py` |
|
|
| A.4 | Sentry Comment 回寫 | 30min | A.1 | - | `sentry_webhook.py` |
|
|
| A.5 | Alert Chain Metrics | 30min | - | ✅ | `core/metrics.py` |
|
|
| A.6 | Smoke Test 腳本 | 45min | A.3,A.4 | - | `alert_chain_smoke_test.py` |
|
|
|
|
### Wave B: 鏈路防護 (P1 - 1.5h)
|
|
|
|
| # | 任務 | 工時 | 依賴 | 檔案 |
|
|
|---|------|------|------|------|
|
|
| B.1 | Alert Chain PrometheusRule | 30min | A.5 | `k8s/monitoring/alert-chain-monitor.yaml` |
|
|
| B.2 | CD Pipeline 整合 | 30min | A.6 | `.github/workflows/cd.yaml` |
|
|
| B.3 | 部署驗證 + 文檔更新 | 30min | B.1,B.2 | ADR + Memory |
|
|
|
|
### Wave C: 監控自動化 (P2 - 2.75h)
|
|
|
|
| # | 任務 | 工時 | 依賴 | 檔案 |
|
|
|---|------|------|------|------|
|
|
| C.1 | generate_monitoring.py | 1.5h | - | `ops/monitoring/generate_monitoring.py` |
|
|
| C.2 | CI 監控覆蓋率檢查 | 30min | C.1 | `.github/workflows/cd.yaml` |
|
|
| C.3 | Docker 容器自動發現 | 45min | - | `ops/monitoring/discover_docker.py` |
|
|
|
|
### Wave D: 可視化 (P3 - 3h)
|
|
|
|
| # | 任務 | 工時 | 依賴 | 檔案 |
|
|
|---|------|------|------|------|
|
|
| D.1 | NVIDIA Grafana Dashboard | 2h | - | `ops/grafana/nvidia-dashboard.json` |
|
|
| D.2 | 監控覆蓋率報告 | 1h | C.1 | `ops/monitoring/coverage_report.py` |
|
|
|
|
---
|
|
|
|
## 四、現有整合點
|
|
|
|
### Phase 20 (NVIDIA Nemotron) ✅
|
|
|
|
```yaml
|
|
已完成:
|
|
- nvidia_provider.py P3 Prometheus Metrics
|
|
- k8s/monitoring/nvidia-alerts.yaml (5 規則)
|
|
- ops/monitoring/service-registry.yaml (NVIDIA 條目)
|
|
|
|
整合:
|
|
- Wave A.5 Alert Chain Metrics 納入 NVIDIA 告警監控
|
|
```
|
|
|
|
### K-MON (K3s 監控) ✅
|
|
|
|
```yaml
|
|
已完成:
|
|
- k8s/monitoring/k3s-alerts.yaml (20+ 規則)
|
|
- Blackbox Exporter
|
|
- kube-state-metrics
|
|
|
|
整合:
|
|
- Wave B.1 擴充現有 PrometheusRule
|
|
```
|
|
|
|
### ADR-034 (Telegram Secrets 注入) ✅
|
|
|
|
```yaml
|
|
已完成:
|
|
- Pre-flight 檢查
|
|
- CD 自動注入
|
|
|
|
整合:
|
|
- Wave A.1 使用同樣模式注入 SENTRY_API_TOKEN
|
|
```
|
|
|
|
---
|
|
|
|
## 五、Phase D-G 任務對應
|
|
|
|
| 原 Phase | 任務 | 對應 Wave | 狀態 |
|
|
|----------|------|-----------|------|
|
|
| **Phase D** | Sentry Comment 回寫 | Wave A.4 | 待執行 |
|
|
| **Phase E** | SignOz 告警規則 | Wave A.2 + A.3 | 待執行 |
|
|
| **Phase F** | 告警鏈路 E2E 驗證 | Wave A.6 + B.1 + B.2 | 待執行 |
|
|
| **Phase G** | Learning Service | ✅ 已存在 | 僅需整合 |
|
|
|
|
---
|
|
|
|
## 六、執行時程
|
|
|
|
```
|
|
Day 1 (4h):
|
|
├── [並行] A.1 Sentry Token (15min)
|
|
├── [並行] A.2 SignOz Rules (30min)
|
|
├── [並行] A.5 Alert Chain Metrics (30min)
|
|
├── A.3 SignOz Webhook (45min)
|
|
├── A.4 Sentry Comment (30min)
|
|
└── A.6 Smoke Test (45min)
|
|
|
|
Day 2 (2h):
|
|
├── B.1 Alert Chain Rules (30min)
|
|
├── B.2 CD Pipeline (30min)
|
|
└── B.3 驗證 + 文檔 (30min)
|
|
|
|
Day 3+ (可延後):
|
|
├── Wave C: 監控自動化 (2.75h)
|
|
└── Wave D: 可視化 (3h)
|
|
```
|
|
|
|
---
|
|
|
|
## 七、驗收標準
|
|
|
|
### Wave A 完成條件
|
|
|
|
- [ ] Sentry Issue 自動收到 AI 分析 Comment
|
|
- [ ] SignOz 告警可觸發 Telegram 通知
|
|
- [ ] `alert_chain_smoke_test.py` 全部通過
|
|
|
|
### Wave B 完成條件
|
|
|
|
- [ ] CD 部署後自動執行 Smoke Test
|
|
- [ ] 告警鏈路斷裂 2 小時內觸發告警
|
|
- [ ] ADR-037 建立並通過審查
|
|
|
|
### Wave C 完成條件
|
|
|
|
- [ ] 新服務未註冊時 CI 失敗
|
|
- [ ] 每小時自動掃描新 Docker 容器
|
|
- [ ] 生成的配置與現有配置一致
|
|
|
|
### Wave D 完成條件
|
|
|
|
- [ ] NVIDIA Grafana Dashboard 可訪問
|
|
- [ ] 每日覆蓋率報告自動發送
|
|
|
|
---
|
|
|
|
## 八、風險評估
|
|
|
|
| 風險 | 機率 | 影響 | 緩解措施 |
|
|
|------|------|------|----------|
|
|
| Sentry API Token 權限不足 | 低 | 中 | 測試 API 呼叫後再部署 |
|
|
| SignOz 版本不支援告警 | 低 | 高 | 確認 SignOz 版本 |
|
|
| Smoke Test 誤報 | 中 | 低 | 設定合理超時 + 重試 |
|
|
| CD Pipeline 變慢 | 中 | 低 | Smoke Test 並行執行 |
|
|
|
|
---
|
|
|
|
## 九、相關文件
|
|
|
|
- [MONITORING_INTEGRATION_ARCHITECTURE.md](MONITORING_INTEGRATION_ARCHITECTURE.md)
|
|
- [IMPLEMENTATION_STEPS_REMAINING_PHASES.md](IMPLEMENTATION_STEPS_REMAINING_PHASES.md)
|
|
- [ADR-034 Telegram Secrets 注入](../adr/ADR-034-telegram-secrets-injection.md)
|
|
- [ADR-036 Nemotron Tool Calling](../adr/ADR-036-nemotron-tool-calling-integration.md)
|
|
|
|
---
|
|
|
|
**文件結束**
|
|
|
|
*2026-03-29 統帥批准*
|