Commit Graph

11 Commits

Author SHA1 Message Date
OG T
5ead01abf7 feat(ops): dr-drill.sh — 每月 DR Drill 自動演練
每月第一個週日 03:00 (121 cron) 執行:
1. 找最新 Velero backup (Completed)
2. 還原到 awoooi-dr-test namespace
3. 等待 Pod Ready + API health 驗證
4. 清理 dr-test namespace + restore 資源
5. Telegram 通知 PASS/FAIL + 耗時

支援 --dry-run 模式 (只檢查 backup,不還原)。
dry-run 驗證通過: daily-awoooi-prod-20260409020003

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 10:42:12 +08:00
OG T
ec4ebaf310 fix(ops): pg-backup momo_analytics 改用 docker exec (無對外 port)
momo-db 容器無 port binding,TCP 127.0.0.1:5432 連到 host PG 非容器。
改用 docker exec momo-db pg_dump,實際備份 91M。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 09:57:05 +08:00
OG T
f98be41517 feat(ops): pg-backup.sh — PostgreSQL 每 6h 自動備份
備份目標 (188):
- awoooi_prod (host PostgreSQL, TCP 127.0.0.1)
- momo_analytics (momo-db 容器)

功能:
- gzip 壓縮,保留 7 天自動清理
- Telegram 通知 (成功/失敗)
- cron 0 */6 * * * 已設定

驗證: 兩個 DB 備份成功 (awoooi_prod 206K, gz 完整)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 09:16:21 +08:00
OG T
0e6c4b83d4 feat(ops): docker-health-monitor 完成部署 110+188
- 增加 EXCLUDE_CONTAINERS 排除清單(signoz init containers)
- max-time 30→60 支援 API 首次 AI 分析所需時間
- 110: wooo/awoooi-ops, cron */5, secrets.env 已設定
- 188: ollama/awoooi-ops, cron */5, secrets.env 已設定
- 驗證: 188→API webhook 200, Telegram 已收到告警

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 22:59:45 +08:00
OG T
d80153bdce fix(openclaw): NIM 完全失敗後 fallback 到 Gemini 產生執行方案
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m34s
NIM tool calling 多次 timeout 後,不再顯示空白執行方案,
改由 Gemini 代理產生 kubectl 操作指令(JSON 解析)。
只有 NIM 完全失敗才觸發,符合統帥「必須等到有回應」原則。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 22:55:25 +08:00
OG T
170ce2f11d fix(ci): 修正測試與 Sprint 5.2 部署腳本
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m38s
tests/test_auto_repair_service.py:
  - 更新 3個測試符合 2026-04-07 統帥指令移除門檻
  - APPROVED Playbook 直接通過 (低相似度/低品質/高風險均通過)

tests/test_phase22_nemotron_collab.py:
  - 更新 log key: nemotron_collaboration_failed → exhausted

ops/monitoring/docker-compose.exporters.yaml:
  - 修正 postgres DSN: awoooi:awoooi_prod_2026@localhost:5432/awoooi_prod

Sprint 5.2 新增腳本:
  - scripts/sprint51_e2e_validation.py: L7 E2E 驗收腳本 (T1-T5)
  - scripts/ops/deploy-docker-health-monitor.sh: Plan A 一鍵部署腳本

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 18:17:48 +08:00
OG T
88696dba9b feat(sprint5.1): Data Safety Guardrails 全鏈路整合 (L1-L5)
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m33s
Type Sync Check / check-type-sync (push) Failing after 58s
Layer 0 - K8s RBAC:
  - k8s/rbac/api-velero-reader.yaml: awoooi-executor SA Velero backup reader

Layer 1 - DB Migration (已在 188 執行):
  - M-002: approval_records 新增 approval_level/votes/required_votes
  - M-003: alert_event_type ENUM 新增 8 個值

Layer 2 - IaC:
  - ops/config/service-registry.yaml: 全服務 Stateful 分級清單 (BLOCK/CRITICAL_HITL/STANDARD_HITL/AUTO)

Layer 3 - Python Services:
  - service_registry.py: 讀取 YAML,提供 is_blocked/requires_multisig/get_required_votes
  - velero_client.py: kubectl 查詢 Velero 備份年齡,失敗 fallback 999h
  - preflight_service.py: Pre-flight 安全檢查 (Q2/Q4 決策)

Layer 1-M001 - Playbook model:
  - playbook.py: 新增 requires_approval_level/stateful_targets/requires_pre_backup

Layer 4 - 業務邏輯:
  - alert_operation_log_repository.py: 新增 8 個 event_type (Guardrail/Pre-flight/MultiSig/備份)
  - auto_repair_service.py: 注入 Service Registry Guardrail 檢查 (BLOCK → 直接拒絕)
  - webhooks.py: ALERT_RECEIVED 溯源記錄 + auto_repair flag Q9 + Langfuse trace_id Q10
  - db/models.py: ApprovalRecord 同步 approval_level/votes/required_votes 欄位
  - docker-health-monitor.sh: 純感知層改造(移除所有 docker restart 邏輯)

Layer 5 - Telegram 通知:
  - telegram_gateway.py: T1-T6 六個新通知方法 (Guardrail/Pre-flight/Backup/MultiSig/ChangeApplied)

參考: ADR-062 Data Safety Guardrails, ADR-063 Service Registry IaC

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 16:24:09 +08:00
OG T
a421d2c5b8 feat(ops): Plan A docker-health-monitor.sh — Docker 容器健康監控自動修復
- 偵測 unhealthy / exited / dead 容器
- 排除清單: DB(PG/Redis)、Gitea、監控棧
- Prometheus/Grafana/Alertmanager exited → docker start (保護 WAL)
- 必須三段式通知: Intent→Action→Result (首席架構師裁示)
- HMAC-SHA256 簽章 → AWOOOI API /api/v1/webhooks/custom-alert
- Fallback: API down → 直接 Telegram Bot API
- 冷卻期 300s,防止重複修復

部署: cron */5 * * * * on 192.168.0.110 + 192.168.0.188
設定: /etc/awoooi-ops/secrets.env

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 11:48:39 +08:00
OG T
b688eeecb7 fix(ops): seed 腳本支援 API_BASE 環境變數
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 12:23:55 +08:00
OG T
76f7330c9d feat(api): POST /playbooks/ 建立端點 + seed-repair-playbooks.py (Task 14)
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 57s
CD Pipeline / Deploy Prometheus Alert Rules (push) Has been skipped
- playbooks.py: 新增 POST / 端點供直接建立 Playbook (seed/管理用)
- seed-repair-playbooks.py: 5個 Host Repair Playbooks (ssh_command)
  sentry/harbor/gitea/alertmanager (110) + openclaw (188)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 11:53:49 +08:00
OG T
7becdcbaf6 ops(scripts): 加入 deploy-alerts.sh 自動部署 Prometheus 規則
功能: 驗證 YAML → 備份 → scp → reload → 驗證規則數+關鍵規則
同步啟用 Prometheus --web.enable-lifecycle (110 docker-compose.yml)
部署驗證: 28 條規則全部 ,關鍵規則 SentryDown/HarborDown/GiteaDown/OpenClawDown/AlertmanagerDown/AlertChainUnhealthy 已上線

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 02:29:21 +08:00