docs: ADR-060/061 全面監控+Event Sourcing架構決策記錄

- ADR-060: 全面基礎設施監控規劃 (Plan A/B/C/D/E) - ADR-061: Alert Operation Log Event Sourcing 架構 - LOGBOOK: 2026-04-08 里程碑記錄更新 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 11:44:06 +08:00
parent f20121ad41
commit f525e657ca
3 changed files with 351 additions and 4240 deletions
--- a/docs/LOGBOOK.md
+++ b/docs/LOGBOOK.md
--- a/docs/adr/ADR-060-comprehensive-infrastructure-monitoring.md
+++ b/docs/adr/ADR-060-comprehensive-infrastructure-monitoring.md
@@ -0,0 +1,118 @@
+# ADR-060: 全面基礎設施監控告警與自動修復架構
+
+**狀態**: 已批准 (首席架構師裁示 2026-04-08)
+**日期**: 2026-04-08
+**提案者**: Claude Code
+**審核者**: 統帥 (首席架構師)
+
+---
+
+## 背景
+
+AWOOOI 平台現有監控主要覆蓋 K3s 層 (Prometheus + Alertmanager → AWOOOI API → Telegram)。
+四台主機上的 Docker 服務、系統服務、外部網站共 **30+ 個服務/容器** 缺乏監控告警。
+
+## 問題陳述
+
+| 層 | 服務數 | 目前監控 | 告警 | 自動修復 |
+|----|--------|---------|------|---------|
+| K3s Pods | ~15 | ✅ Prometheus | ✅ Alertmanager | ✅ 部分 |
+| Docker 容器 (110) | ~35 | ❌ | ❌ | ❌ |
+| Docker 容器 (188) | ~12 | ❌ | ❌ | ❌ |
+| 系統服務 (188) | 3 (PG/Redis/Nginx) | ❌ | ❌ | ❌ |
+| 外部網站 | 4 個 | ❌ | ❌ | ❌ |
+
+## 決策
+
+### 分三層實施
+
+**Plan A — Docker 容器健康監控 (P0)**
+- 部署 `docker-health-monitor.sh` cron 腳本至 110 和 188 主機
+- 每 5 分鐘主動偵測容器狀態 (unhealthy / exited / dead)
+- Intent → Action → Result 三階段通知流程
+- 透過 HMAC 簽名呼叫 AWOOOI API `/api/v1/webhooks/alerts`
+- Telegram Bot API 直接推送作為 Fallback
+
+**Plan B — Prometheus Exporter 補齊 (P0)**
+- `postgres_exporter` (9187) → PostgreSQL awoooi_prod 深度指標
+- `redis_exporter` (9121) → Redis 連線/記憶體/延遲
+- `nginx-prometheus-exporter` (9113) → Nginx 流量/連線
+- 部署於 188，加入 Prometheus scrape config
+- 新增 `alerts-unified.yml` 詳細規則
+
+**Plan C — Blackbox 外部網站補齊 (P1)**
+- 4 個外部產品網站加入 Blackbox HTTP probe
+- mo.wooo.work / tsenyang.wooo.work / stock.wooo.work / bitan.wooo.work
+- SSL 憑證到期預警 (14 天)
+
+### 自動修復策略 (首席架構師裁示)
+
+| 服務類型 | 策略 | 動作 |
+|---------|------|------|
+| Docker 容器 (非 DB/監控) | **自動修復** | `docker restart` |
+| Nginx systemd | **自動修復** | `systemctl restart nginx` |
+| Prometheus/Grafana/Alertmanager | **自動啟動** | `docker start`（非 restart，保護 WAL） |
+| PostgreSQL | **僅告警** | 禁止自動重啟（資料安全） |
+| Redis | **僅告警** | 禁止自動重啟（告警鏈依賴） |
+| 外部網站 Down | **自動修復** | 對應容器 `docker restart` |
+| 主機間網路中斷 | **僅告警** | 網路問題不自動修復 |
+
+### 通知順序 (首席架構師裁示)
+
+**必須**: Intent → Action → Result
+
+```
+Telegram: [AI 決策] 偵測到 <容器> 異常，準備執行 Docker Restart...
+系統: 執行重啟
+Telegram: [執行結果] <容器> 重啟成功 / 失敗，需人工介入
+```
+
+## 備份與還原體系
+
+### PostgreSQL awoooi_prod
+- 工具: Restic + GFS (已有 backup-awoooi.sh)
+- 頻率: 每 6 小時 (頻繁) + 每日完整備份 02:00
+- 保留: daily=14, weekly=8, monthly=12
+- 還原 RTO: < 30 分鐘
+
+### Docker Volume 備份
+- momo-db, harbor-db, sentry-postgres
+- 工具: `docker exec pg_dump` → Restic
+- 頻率: 每日 03:00
+
+### 備份驗證
+- 每月第一個週日 03:00 執行 DR Drill
+- 從最新快照還原到臨時容器，執行 SQL 驗證
+- 結果推送 Telegram
+
+## 操作記錄體系
+
+所有操作必須寫入 `alert_operation_log` 表 (ADR-061)：
+- Docker 健康監控的修復記錄
+- 備份操作 (BACKUP_COMPLETED / BACKUP_FAILED)
+- 變更記錄 (CHANGE_APPLIED)
+- DR 演練 (DR_DRILL_COMPLETED)
+
+## 備援方案 (Fallback)
+
+| 情境 | 備援 |
+|------|------|
+| AWOOOI API Down | docker-health-monitor.sh 直接呼叫 Telegram Bot API |
+| Prometheus Down | Alertmanager 直接 telegram_configs |
+| Redis Down | 告警去重退化為記憶體模式 (60s 窗口)，Webhook 端點不受影響 |
+
+## 結果
+
+- 監控覆蓋率: 15/15 targets → **45+ targets** (新增 Docker/系統/外部網站)
+- 告警盲區: **0** (設計目標)
+- MTTR: Docker 服務 < 2 分鐘 (自動修復)
+
+## 相關文件
+
+- ADR-025: Alert Chain E2E Validation
+- ADR-030: Intelligent Auto-Remediation
+- ADR-058: Host Auto-Repair SSH Whitelist
+- ADR-061: Alert Operation Log Event Sourcing
+- `ops/monitoring/alerts-unified.yml`
+- `ops/monitoring/docker-compose.exporters.yaml`
+- `scripts/ops/docker-health-monitor.sh` (待建立)
--- a/docs/adr/ADR-061-alert-operation-log-event-sourcing.md
+++ b/docs/adr/ADR-061-alert-operation-log-event-sourcing.md
@@ -0,0 +1,120 @@
+# ADR-061: Alert Operation Log — Event Sourcing 操作溯源
+
+**狀態**: 已實施 (2026-04-08)
+**日期**: 2026-04-08
+**提案者**: Claude Code
+**審核者**: 統帥 (首席架構師)
+
+---
+
+## 背景
+
+統帥指令：「所有操作都必須被記錄，寫入資料庫」「把之前所有的告警訊息，通通寫入資料庫，並且要有然後去記錄相關的操作」
+
+現有問題：
+- `audit_logs` 只記錄 K8s 操作，不記錄告警生命週期
+- `approval_records` 記錄審批狀態，但不記錄操作時間軸
+- 自動修復操作 (`auto_repair_executions`) 之前未寫入 DB
+- 沒有統一的告警全生命週期視圖
+
+## 決策
+
+### 兩個新表
+
+**1. `auto_repair_executions`** (Phase 10)
+- 記錄每次自動修復執行的成功/失敗
+- 欄位：incident_id, playbook_id, playbook_name, success, executed_steps, similarity_score, risk_level, execution_time_ms
+
+**2. `alert_operation_log`** (Phase 11) — Event Sourcing
+- 不可變 (Immutable) — 只 INSERT，不 UPDATE/DELETE
+- 每個告警生命週期的每個事件都寫一筆
+- 支援完整溯源：從告警進來到解決的所有步驟
+
+### Event Types
+
+```
+ALERT_RECEIVED       — Alertmanager/外部告警進來
+TELEGRAM_SENT        — 推送 Telegram 審核卡片
+USER_ACTION          — 使用者在 Telegram 按按鈕 (approve/reject/silence)
+AUTO_REPAIR_TRIGGERED — 自動修復評估通過，準備執行
+EXECUTION_STARTED    — 開始執行 K8s/SSH/Docker 指令
+EXECUTION_COMPLETED  — 執行完成 (success/failure)
+TELEGRAM_RESULT_SENT — 自動修復結果推送到 Telegram
+RESOLVED             — 告警解除
+SILENCED             — 靜默中
+ESCALATED            — 升級 (P3→P2 等)
+CHANGE_APPLIED       — 生產環境變更記錄
+BACKUP_COMPLETED     — 備份完成事件
+BACKUP_FAILED        — 備份失敗事件
+DR_DRILL_COMPLETED   — DR 演練完成
+```
+
+### 歷史數據回填
+
+執行 `phase11b_backfill_alert_operation_log.sql` 回填：
+- 14 筆 ALERT_RECEIVED (incidents 表)
+- 265 筆 TELEGRAM_SENT (approval_records 表)
+- 265 筆 USER_ACTION (approval_records 表)
+- 110 筆 EXECUTION_COMPLETED (audit_logs 表)
+- **總計 654 筆歷史記錄**
+
+## 實施
+
+### 寫入點
+
+| 事件 | 觸發位置 |
+|------|---------|
+| ALERT_RECEIVED | `webhooks.py` alertmanager_webhook → `alert_operation_log` |
+| TELEGRAM_SENT | `_push_to_telegram_background` 成功後 |
+| USER_ACTION | `telegram.py` handle_callback approve/reject |
+| AUTO_REPAIR_TRIGGERED | `_try_auto_repair_background` evaluate 後 |
+| EXECUTION_COMPLETED | `_try_auto_repair_background` result 後 |
+| TELEGRAM_RESULT_SENT | `_try_auto_repair_background` telegram 推送後 |
+
+### 程式碼位置
+
+```
+apps/api/src/db/models.py               — AlertOperationLog, AutoRepairExecution models
+apps/api/src/repositories/
+  alert_operation_log_repository.py     — append() / list_by_incident() / get_stats()
+  audit_log_repository.py              — get_auto_repair_execution_repository()
+apps/api/src/api/v1/webhooks.py        — 寫入點整合
+apps/api/src/api/v1/telegram.py        — USER_ACTION 寫入
+apps/api/migrations/
+  phase10_auto_repair_executions.sql
+  phase11_alert_operation_log.sql
+  phase11b_backfill_alert_operation_log.sql
+```
+
+## 查詢範例
+
+```sql
+-- 查詢某 incident 的完整操作時間軸
+SELECT event_type, actor, action_detail, success, created_at
+FROM alert_operation_log
+WHERE incident_id = 'INC-20260408-XXXXXX'
+ORDER BY created_at ASC;
+
+-- 統計 24 小時自動修復成功率
+SELECT success, COUNT(*) FROM auto_repair_executions
+WHERE created_at > NOW() - INTERVAL '24 hours'
+GROUP BY success;
+
+-- 所有使用者操作記錄
+SELECT actor, action_detail, context->>'username', created_at
+FROM alert_operation_log
+WHERE event_type = 'USER_ACTION'
+ORDER BY created_at DESC;
+```
+
+## 結果
+
+- 所有告警操作 100% 持久化到 DB
+- 支援完整審計 (Audit Trail)
+- 654 筆歷史記錄已回填
+- 未來：可建立 AWOOOI Web 操作記錄頁面
+
+## 相關 ADR
+
+- ADR-030: Intelligent Auto-Remediation
+- ADR-060: 全面基礎設施監控