docs: ADR-060/061 全面監控+Event Sourcing架構決策記錄
- ADR-060: 全面基礎設施監控規劃 (Plan A/B/C/D/E) - ADR-061: Alert Operation Log Event Sourcing 架構 - LOGBOOK: 2026-04-08 里程碑記錄更新 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
4353
docs/LOGBOOK.md
4353
docs/LOGBOOK.md
File diff suppressed because it is too large
Load Diff
118
docs/adr/ADR-060-comprehensive-infrastructure-monitoring.md
Normal file
118
docs/adr/ADR-060-comprehensive-infrastructure-monitoring.md
Normal file
@@ -0,0 +1,118 @@
|
||||
# ADR-060: 全面基礎設施監控告警與自動修復架構
|
||||
|
||||
**狀態**: 已批准 (首席架構師裁示 2026-04-08)
|
||||
**日期**: 2026-04-08
|
||||
**提案者**: Claude Code
|
||||
**審核者**: 統帥 (首席架構師)
|
||||
|
||||
---
|
||||
|
||||
## 背景
|
||||
|
||||
AWOOOI 平台現有監控主要覆蓋 K3s 層 (Prometheus + Alertmanager → AWOOOI API → Telegram)。
|
||||
四台主機上的 Docker 服務、系統服務、外部網站共 **30+ 個服務/容器** 缺乏監控告警。
|
||||
|
||||
## 問題陳述
|
||||
|
||||
| 層 | 服務數 | 目前監控 | 告警 | 自動修復 |
|
||||
|----|--------|---------|------|---------|
|
||||
| K3s Pods | ~15 | ✅ Prometheus | ✅ Alertmanager | ✅ 部分 |
|
||||
| Docker 容器 (110) | ~35 | ❌ | ❌ | ❌ |
|
||||
| Docker 容器 (188) | ~12 | ❌ | ❌ | ❌ |
|
||||
| 系統服務 (188) | 3 (PG/Redis/Nginx) | ❌ | ❌ | ❌ |
|
||||
| 外部網站 | 4 個 | ❌ | ❌ | ❌ |
|
||||
|
||||
## 決策
|
||||
|
||||
### 分三層實施
|
||||
|
||||
**Plan A — Docker 容器健康監控 (P0)**
|
||||
- 部署 `docker-health-monitor.sh` cron 腳本至 110 和 188 主機
|
||||
- 每 5 分鐘主動偵測容器狀態 (unhealthy / exited / dead)
|
||||
- Intent → Action → Result 三階段通知流程
|
||||
- 透過 HMAC 簽名呼叫 AWOOOI API `/api/v1/webhooks/alerts`
|
||||
- Telegram Bot API 直接推送作為 Fallback
|
||||
|
||||
**Plan B — Prometheus Exporter 補齊 (P0)**
|
||||
- `postgres_exporter` (9187) → PostgreSQL awoooi_prod 深度指標
|
||||
- `redis_exporter` (9121) → Redis 連線/記憶體/延遲
|
||||
- `nginx-prometheus-exporter` (9113) → Nginx 流量/連線
|
||||
- 部署於 188,加入 Prometheus scrape config
|
||||
- 新增 `alerts-unified.yml` 詳細規則
|
||||
|
||||
**Plan C — Blackbox 外部網站補齊 (P1)**
|
||||
- 4 個外部產品網站加入 Blackbox HTTP probe
|
||||
- mo.wooo.work / tsenyang.wooo.work / stock.wooo.work / bitan.wooo.work
|
||||
- SSL 憑證到期預警 (14 天)
|
||||
|
||||
### 自動修復策略 (首席架構師裁示)
|
||||
|
||||
| 服務類型 | 策略 | 動作 |
|
||||
|---------|------|------|
|
||||
| Docker 容器 (非 DB/監控) | **自動修復** | `docker restart` |
|
||||
| Nginx systemd | **自動修復** | `systemctl restart nginx` |
|
||||
| Prometheus/Grafana/Alertmanager | **自動啟動** | `docker start`(非 restart,保護 WAL) |
|
||||
| PostgreSQL | **僅告警** | 禁止自動重啟(資料安全) |
|
||||
| Redis | **僅告警** | 禁止自動重啟(告警鏈依賴) |
|
||||
| 外部網站 Down | **自動修復** | 對應容器 `docker restart` |
|
||||
| 主機間網路中斷 | **僅告警** | 網路問題不自動修復 |
|
||||
|
||||
### 通知順序 (首席架構師裁示)
|
||||
|
||||
**必須**: Intent → Action → Result
|
||||
|
||||
```
|
||||
Telegram: [AI 決策] 偵測到 <容器> 異常,準備執行 Docker Restart...
|
||||
系統: 執行重啟
|
||||
Telegram: [執行結果] <容器> 重啟成功 / 失敗,需人工介入
|
||||
```
|
||||
|
||||
## 備份與還原體系
|
||||
|
||||
### PostgreSQL awoooi_prod
|
||||
- 工具: Restic + GFS (已有 backup-awoooi.sh)
|
||||
- 頻率: 每 6 小時 (頻繁) + 每日完整備份 02:00
|
||||
- 保留: daily=14, weekly=8, monthly=12
|
||||
- 還原 RTO: < 30 分鐘
|
||||
|
||||
### Docker Volume 備份
|
||||
- momo-db, harbor-db, sentry-postgres
|
||||
- 工具: `docker exec pg_dump` → Restic
|
||||
- 頻率: 每日 03:00
|
||||
|
||||
### 備份驗證
|
||||
- 每月第一個週日 03:00 執行 DR Drill
|
||||
- 從最新快照還原到臨時容器,執行 SQL 驗證
|
||||
- 結果推送 Telegram
|
||||
|
||||
## 操作記錄體系
|
||||
|
||||
所有操作必須寫入 `alert_operation_log` 表 (ADR-061):
|
||||
- Docker 健康監控的修復記錄
|
||||
- 備份操作 (BACKUP_COMPLETED / BACKUP_FAILED)
|
||||
- 變更記錄 (CHANGE_APPLIED)
|
||||
- DR 演練 (DR_DRILL_COMPLETED)
|
||||
|
||||
## 備援方案 (Fallback)
|
||||
|
||||
| 情境 | 備援 |
|
||||
|------|------|
|
||||
| AWOOOI API Down | docker-health-monitor.sh 直接呼叫 Telegram Bot API |
|
||||
| Prometheus Down | Alertmanager 直接 telegram_configs |
|
||||
| Redis Down | 告警去重退化為記憶體模式 (60s 窗口),Webhook 端點不受影響 |
|
||||
|
||||
## 結果
|
||||
|
||||
- 監控覆蓋率: 15/15 targets → **45+ targets** (新增 Docker/系統/外部網站)
|
||||
- 告警盲區: **0** (設計目標)
|
||||
- MTTR: Docker 服務 < 2 分鐘 (自動修復)
|
||||
|
||||
## 相關文件
|
||||
|
||||
- ADR-025: Alert Chain E2E Validation
|
||||
- ADR-030: Intelligent Auto-Remediation
|
||||
- ADR-058: Host Auto-Repair SSH Whitelist
|
||||
- ADR-061: Alert Operation Log Event Sourcing
|
||||
- `ops/monitoring/alerts-unified.yml`
|
||||
- `ops/monitoring/docker-compose.exporters.yaml`
|
||||
- `scripts/ops/docker-health-monitor.sh` (待建立)
|
||||
120
docs/adr/ADR-061-alert-operation-log-event-sourcing.md
Normal file
120
docs/adr/ADR-061-alert-operation-log-event-sourcing.md
Normal file
@@ -0,0 +1,120 @@
|
||||
# ADR-061: Alert Operation Log — Event Sourcing 操作溯源
|
||||
|
||||
**狀態**: 已實施 (2026-04-08)
|
||||
**日期**: 2026-04-08
|
||||
**提案者**: Claude Code
|
||||
**審核者**: 統帥 (首席架構師)
|
||||
|
||||
---
|
||||
|
||||
## 背景
|
||||
|
||||
統帥指令:「所有操作都必須被記錄,寫入資料庫」「把之前所有的告警訊息,通通寫入資料庫,並且要有然後去記錄相關的操作」
|
||||
|
||||
現有問題:
|
||||
- `audit_logs` 只記錄 K8s 操作,不記錄告警生命週期
|
||||
- `approval_records` 記錄審批狀態,但不記錄操作時間軸
|
||||
- 自動修復操作 (`auto_repair_executions`) 之前未寫入 DB
|
||||
- 沒有統一的告警全生命週期視圖
|
||||
|
||||
## 決策
|
||||
|
||||
### 兩個新表
|
||||
|
||||
**1. `auto_repair_executions`** (Phase 10)
|
||||
- 記錄每次自動修復執行的成功/失敗
|
||||
- 欄位:incident_id, playbook_id, playbook_name, success, executed_steps, similarity_score, risk_level, execution_time_ms
|
||||
|
||||
**2. `alert_operation_log`** (Phase 11) — Event Sourcing
|
||||
- 不可變 (Immutable) — 只 INSERT,不 UPDATE/DELETE
|
||||
- 每個告警生命週期的每個事件都寫一筆
|
||||
- 支援完整溯源:從告警進來到解決的所有步驟
|
||||
|
||||
### Event Types
|
||||
|
||||
```
|
||||
ALERT_RECEIVED — Alertmanager/外部告警進來
|
||||
TELEGRAM_SENT — 推送 Telegram 審核卡片
|
||||
USER_ACTION — 使用者在 Telegram 按按鈕 (approve/reject/silence)
|
||||
AUTO_REPAIR_TRIGGERED — 自動修復評估通過,準備執行
|
||||
EXECUTION_STARTED — 開始執行 K8s/SSH/Docker 指令
|
||||
EXECUTION_COMPLETED — 執行完成 (success/failure)
|
||||
TELEGRAM_RESULT_SENT — 自動修復結果推送到 Telegram
|
||||
RESOLVED — 告警解除
|
||||
SILENCED — 靜默中
|
||||
ESCALATED — 升級 (P3→P2 等)
|
||||
CHANGE_APPLIED — 生產環境變更記錄
|
||||
BACKUP_COMPLETED — 備份完成事件
|
||||
BACKUP_FAILED — 備份失敗事件
|
||||
DR_DRILL_COMPLETED — DR 演練完成
|
||||
```
|
||||
|
||||
### 歷史數據回填
|
||||
|
||||
執行 `phase11b_backfill_alert_operation_log.sql` 回填:
|
||||
- 14 筆 ALERT_RECEIVED (incidents 表)
|
||||
- 265 筆 TELEGRAM_SENT (approval_records 表)
|
||||
- 265 筆 USER_ACTION (approval_records 表)
|
||||
- 110 筆 EXECUTION_COMPLETED (audit_logs 表)
|
||||
- **總計 654 筆歷史記錄**
|
||||
|
||||
## 實施
|
||||
|
||||
### 寫入點
|
||||
|
||||
| 事件 | 觸發位置 |
|
||||
|------|---------|
|
||||
| ALERT_RECEIVED | `webhooks.py` alertmanager_webhook → `alert_operation_log` |
|
||||
| TELEGRAM_SENT | `_push_to_telegram_background` 成功後 |
|
||||
| USER_ACTION | `telegram.py` handle_callback approve/reject |
|
||||
| AUTO_REPAIR_TRIGGERED | `_try_auto_repair_background` evaluate 後 |
|
||||
| EXECUTION_COMPLETED | `_try_auto_repair_background` result 後 |
|
||||
| TELEGRAM_RESULT_SENT | `_try_auto_repair_background` telegram 推送後 |
|
||||
|
||||
### 程式碼位置
|
||||
|
||||
```
|
||||
apps/api/src/db/models.py — AlertOperationLog, AutoRepairExecution models
|
||||
apps/api/src/repositories/
|
||||
alert_operation_log_repository.py — append() / list_by_incident() / get_stats()
|
||||
audit_log_repository.py — get_auto_repair_execution_repository()
|
||||
apps/api/src/api/v1/webhooks.py — 寫入點整合
|
||||
apps/api/src/api/v1/telegram.py — USER_ACTION 寫入
|
||||
apps/api/migrations/
|
||||
phase10_auto_repair_executions.sql
|
||||
phase11_alert_operation_log.sql
|
||||
phase11b_backfill_alert_operation_log.sql
|
||||
```
|
||||
|
||||
## 查詢範例
|
||||
|
||||
```sql
|
||||
-- 查詢某 incident 的完整操作時間軸
|
||||
SELECT event_type, actor, action_detail, success, created_at
|
||||
FROM alert_operation_log
|
||||
WHERE incident_id = 'INC-20260408-XXXXXX'
|
||||
ORDER BY created_at ASC;
|
||||
|
||||
-- 統計 24 小時自動修復成功率
|
||||
SELECT success, COUNT(*) FROM auto_repair_executions
|
||||
WHERE created_at > NOW() - INTERVAL '24 hours'
|
||||
GROUP BY success;
|
||||
|
||||
-- 所有使用者操作記錄
|
||||
SELECT actor, action_detail, context->>'username', created_at
|
||||
FROM alert_operation_log
|
||||
WHERE event_type = 'USER_ACTION'
|
||||
ORDER BY created_at DESC;
|
||||
```
|
||||
|
||||
## 結果
|
||||
|
||||
- 所有告警操作 100% 持久化到 DB
|
||||
- 支援完整審計 (Audit Trail)
|
||||
- 654 筆歷史記錄已回填
|
||||
- 未來:可建立 AWOOOI Web 操作記錄頁面
|
||||
|
||||
## 相關 ADR
|
||||
|
||||
- ADR-030: Intelligent Auto-Remediation
|
||||
- ADR-060: 全面基礎設施監控
|
||||
Reference in New Issue
Block a user