Your Name
|
ee2cc2bfc3
|
fix(alerts): 收斂 Telegram 告警到 SRE 戰情室
CD Pipeline / tests (push) Failing after 1m23s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 15s
|
2026-06-12 11:06:16 +08:00 |
|
Your Name
|
3418e014bc
|
fix(security): 移除即時高風險明文與 SSH 信任缺口 [skip ci]
|
2026-06-11 11:10:26 +08:00 |
|
Your Name
|
cfb866d055
|
feat(governance): add agent market automation surfaces
Ansible Lint / lint (push) Successful in 35s
CD Pipeline / tests (push) Failing after 13s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Failing after 11s
|
2026-06-04 21:50:55 +08:00 |
|
Your Name
|
017dba8b00
|
docs(argocd): codify health persistence config [skip ci]
|
2026-06-04 09:33:45 +08:00 |
|
Your Name
|
d0163b2d69
|
docs(ops): document ollama 111 fallback diagnosis [skip ci]
|
2026-06-04 09:31:20 +08:00 |
|
Your Name
|
ae7b39d96a
|
fix(ops): harden reboot recovery and backup alerts
|
2026-05-29 12:41:34 +08:00 |
|
Your Name
|
d6d2719e02
|
fix(alerts): deploy drift guard with canonical rules
Code Review / ai-code-review (push) Has been cancelled
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 29s
|
2026-05-29 11:14:12 +08:00 |
|
Your Name
|
7d2128b53c
|
fix(alerts): keep prometheus canonical rules in sync
Code Review / ai-code-review (push) Successful in 11s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 32s
|
2026-05-29 11:09:33 +08:00 |
|
Your Name
|
ae9d0b7385
|
feat(monitoring): alert on stale source provider ingestion
Code Review / ai-code-review (push) Successful in 10s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 25s
CD Pipeline / tests (push) Successful in 3m26s
CD Pipeline / build-and-deploy (push) Successful in 3m38s
CD Pipeline / post-deploy-checks (push) Successful in 1m25s
|
2026-05-20 19:19:21 +08:00 |
|
Your Name
|
4956fbb849
|
fix(monitoring): verify alert rule deploy content
Code Review / ai-code-review (push) Successful in 11s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 23s
|
2026-05-20 13:26:24 +08:00 |
|
Your Name
|
d2a4a17969
|
fix(governance): stabilize adr100 km growth slo
Code Review / ai-code-review (push) Successful in 22s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 25s
CD Pipeline / tests (push) Successful in 1m11s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
|
2026-05-14 19:33:52 +08:00 |
|
Your Name
|
a0a0731cd6
|
fix(auto-repair): preserve exact playbook candidates
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 5m46s
CD Pipeline / build-and-deploy (push) Successful in 4m6s
CD Pipeline / post-deploy-checks (push) Successful in 1m28s
|
2026-05-13 23:38:19 +08:00 |
|
Your Name
|
7a8cbb3241
|
fix(auto-repair): prefer exact playbooks and fail failed steps
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / tests (push) Successful in 1m3s
CD Pipeline / build-and-deploy (push) Successful in 3m31s
CD Pipeline / post-deploy-checks (push) Successful in 1m32s
|
2026-05-13 23:21:17 +08:00 |
|
Your Name
|
4ee57b710d
|
fix(ops): support API image path for T16 seed script
Code Review / ai-code-review (push) Successful in 10s
|
2026-05-13 23:03:40 +08:00 |
|
Your Name
|
1778a692e0
|
feat(awooop): add auto repair canary live-fire target
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / tests (push) Successful in 1m11s
CD Pipeline / build-and-deploy (push) Failing after 6m52s
CD Pipeline / post-deploy-checks (push) Has been skipped
|
2026-05-13 22:30:20 +08:00 |
|
Your Name
|
b4d367eeb4
|
feat(awooop): expose mcp bridge truth chain
Code Review / ai-code-review (push) Successful in 13s
CD Pipeline / tests (push) Successful in 1m17s
CD Pipeline / build-and-deploy (push) Successful in 3m55s
CD Pipeline / post-deploy-checks (push) Successful in 1m45s
|
2026-05-13 03:21:31 +08:00 |
|
Your Name
|
de16c88418
|
chore(rls): 套用 outbound message canary
Code Review / ai-code-review (push) Successful in 11s
|
2026-05-12 21:55:23 +08:00 |
|
Your Name
|
7d92f0acd7
|
chore(rls): stage projects canary path
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 1m8s
CD Pipeline / build-and-deploy (push) Successful in 3m49s
CD Pipeline / post-deploy-checks (push) Successful in 1m25s
|
2026-05-12 21:25:24 +08:00 |
|
Your Name
|
b7af597459
|
chore(rls): 套用 tool registry canary wave1.1
Code Review / ai-code-review (push) Successful in 10s
|
2026-05-12 21:15:14 +08:00 |
|
Your Name
|
8c4dc7a5a8
|
chore(rls): 新增 manual script gate 與 canary wave1
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 1m5s
CD Pipeline / build-and-deploy (push) Failing after 10m6s
CD Pipeline / post-deploy-checks (push) Has been skipped
|
2026-05-12 20:23:27 +08:00 |
|
Your Name
|
ff30c61c4c
|
fix(rls): 收斂 API DB access context
Code Review / ai-code-review (push) Successful in 21s
CD Pipeline / tests (push) Successful in 1m20s
CD Pipeline / build-and-deploy (push) Successful in 4m15s
CD Pipeline / post-deploy-checks (push) Successful in 1m58s
|
2026-05-12 19:55:13 +08:00 |
|
Your Name
|
f0255e0300
|
chore(ops): 補強 RLS role bootstrap gate
Code Review / ai-code-review (push) Successful in 10s
|
2026-05-12 18:36:35 +08:00 |
|
Your Name
|
0bc1878778
|
chore(ops): 新增 RLS preflight 與 registry certbot 修復包
Code Review / ai-code-review (push) Successful in 13s
|
2026-05-12 18:25:53 +08:00 |
|
Your Name
|
1a74286dfa
|
fix(awooop): mirror ops notifications through api
Code Review / ai-code-review (push) Successful in 10s
|
2026-05-12 14:43:09 +08:00 |
|
Your Name
|
d3e1b61096
|
fix(ops): persist 188 ollama localhost binding
Code Review / ai-code-review (push) Successful in 11s
|
2026-05-06 15:27:19 +08:00 |
|
Your Name
|
f88a3a846b
|
fix(ops): contain 188 ollama gateway exposure
Code Review / ai-code-review (push) Successful in 10s
|
2026-05-06 15:18:28 +08:00 |
|
Your Name
|
d441f70693
|
fix(ai): add 188 ollama retirement gate
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 1m2s
CD Pipeline / build-and-deploy (push) Successful in 9m2s
CD Pipeline / post-deploy-checks (push) Successful in 1m15s
|
2026-05-06 14:55:21 +08:00 |
|
OG T
|
6e2ab7cedc
|
fix(alertmanager): make live config deployment safe
Code Review / ai-code-review (push) Successful in 10s
|
2026-05-06 13:52:57 +08:00 |
|
Your Name
|
587551c1f1
|
fix(ops): monitor full-stack cold-start gates
Code Review / ai-code-review (push) Successful in 11s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 18s
|
2026-05-06 00:48:05 +08:00 |
|
Your Name
|
ed7c6946cb
|
docs(awooop): define private Ollama mesh gateway
Code Review / ai-code-review (push) Successful in 10s
|
2026-05-05 22:56:22 +08:00 |
|
Your Name
|
72d66e4ae6
|
fix(ops): align stale job cleanup thresholds
Code Review / ai-code-review (push) Successful in 28s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 36s
|
2026-05-05 14:54:17 +08:00 |
|
Your Name
|
5e625f777d
|
fix(ops): add stale gitea job cleanup guard
Code Review / ai-code-review (push) Has been cancelled
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Has been cancelled
|
2026-05-05 14:50:47 +08:00 |
|
Your Name
|
7d45f0cb58
|
fix(ops): alert on stale gitea actions jobs
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Has been cancelled
|
2026-05-05 14:42:09 +08:00 |
|
Your Name
|
34d1c76be9
|
fix(ops): route systemd runner baseline alerts
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
|
2026-05-05 14:19:58 +08:00 |
|
Your Name
|
fe618960a8
|
fix(ops): monitor systemd runners in host baseline
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 39s
|
2026-05-05 14:08:43 +08:00 |
|
Your Name
|
e8e6748f70
|
fix(ops): add docker host resource baseline guardrails
CD Pipeline / tests (push) Failing after 1m50s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 25s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s
|
2026-05-05 13:45:09 +08:00 |
|
Your Name
|
95110971f3
|
fix(telegram): close remaining DM alert routes
CD Pipeline / tests (push) Successful in 1m27s
Code Review / ai-code-review (push) Successful in 29s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
|
2026-04-30 23:02:17 +08:00 |
|
Your Name
|
e27b462bef
|
fix(ops): keep disabled gitea runner stopped
Code Review / ai-code-review (push) Successful in 27s
|
2026-04-30 10:59:46 +08:00 |
|
OG T
|
fb1d101902
|
fix(backup): HostBackupFailed P1 根治 — Prometheus textfile 指標 + docker socket 讀取
問題一:backup_110_last_success_timestamp 指標從未存在
根因:腳本只寫純文字 last_success 檔,從未輸出 .prom 格式
修復:成功時寫入 /home/ollama/node_exporter_textfiles/backup.prom
node_exporter 新增 --collector.textfile.directory=/textfile_collector
volume: /home/ollama/node_exporter_textfiles:/textfile_collector
問題二:Harbor/Gitea rsync 權限拒絕
根因:/var/lib/docker/volumes/ 是 710 root:root,docker group 無法直接存取 FS 路徑
修復:改用 docker run --rm -v <volume>:/source alpine tar czf -
透過 docker socket(wooo 已在 docker group)讀取 volume 內容再解壓
驗證:備份腳本三項全 OK,node_exporter 9100/metrics 正確輸出指標
Prometheus absent(backup_110_last_success_timestamp) 應在下次 scrape 後清除
2026-04-18 ogt + Claude Sonnet 4.6
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-18 10:37:23 +08:00 |
|
OG T
|
de055778b3
|
fix(cd): CD_PUSH_TOKEN + backup 路徑使用 BACKUP_ROOT 環境變數
CD Pipeline / build-and-deploy (push) Has been cancelled
- cd.yaml: GITEA_CD_TOKEN → CD_PUSH_TOKEN(Gitea 保留 GITEA_ 前綴)
- ADR-069: 同步更新 token 名稱說明
- backup-from-110.sh: 改用 BACKUP_ROOT 環境變數(預設 /home/ollama/backup/110)
避免 /var/log /var/run 需要 root 權限
- 已部署到 188 + cron 0 1 * * * 設定完成
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-11 09:07:47 +08:00 |
|
OG T
|
43edff184d
|
feat(dr): Sprint C — Host rsync 備份 + DR SOP 文件
C-1 Velero: 已確認運作中(daily-awoooi-prod schedule, 13d, MinIO Available)
C-2 Host rsync 備份:
scripts/ops/backup-from-110.sh — 188 每日凌晨 1:00 rsync 備份 110
- Harbor registry data(最高優先)
- Gitea repos
- bitan-pharmacy.git(若存在)
- 成功寫入 /var/run/backup-110.last_success 供 Prometheus 監控
- 失敗時 Telegram 告警
ops/monitoring/alerts-unified.yml — 新增 HostBackupFailed 告警規則
C-3 DR SOP 文件:
docs/runbooks/disaster-recovery/DR-K8s-awoooi.md (<15分鐘)
docs/runbooks/disaster-recovery/DR-Nginx.md (<5分鐘)
docs/runbooks/disaster-recovery/DR-Harbor.md (<30分鐘)
docs/runbooks/disaster-recovery/DR-Bitan.md (<5分鐘)
docs/runbooks/disaster-recovery/DR-Stock.md (<5分鐘)
部署備份腳本說明 (需手動執行):
scp scripts/ops/backup-from-110.sh ollama@192.168.0.188:~/bin/backup-from-110.sh
ssh ollama@192.168.0.188 "chmod +x ~/bin/backup-from-110.sh && mkdir -p /backup/110/{harbor,gitea}"
ssh ollama@192.168.0.188 "echo '0 1 * * * /home/ollama/bin/backup-from-110.sh' | crontab -"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-11 03:04:18 +08:00 |
|
OG T
|
8d0042ed29
|
feat(ops): Sprint 5.2 docker-health-monitor 升級為自動修復模式
舊版: 純感知層 (L4-6),只送 Webhook,修復由 API 執行
新版: 感知 + 自動修復 + 回報
修復分級 (ADR-060):
- 一般容器: docker restart
- 監控棧 (prometheus/grafana/alertmanager): docker start (保護 WAL)
- DB/Redis/ClickHouse: 僅告警,禁止重啟
已部署到:
- 192.168.0.110 ~/awoooi-ops/docker-health-monitor.sh
- 192.168.0.188 ~/awoooi-ops/docker-health-monitor.sh
- 兩台 cron */5 * * * * 運行中
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 11:59:11 +08:00 |
|
OG T
|
5ead01abf7
|
feat(ops): dr-drill.sh — 每月 DR Drill 自動演練
每月第一個週日 03:00 (121 cron) 執行:
1. 找最新 Velero backup (Completed)
2. 還原到 awoooi-dr-test namespace
3. 等待 Pod Ready + API health 驗證
4. 清理 dr-test namespace + restore 資源
5. Telegram 通知 PASS/FAIL + 耗時
支援 --dry-run 模式 (只檢查 backup,不還原)。
dry-run 驗證通過: daily-awoooi-prod-20260409020003
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 10:42:12 +08:00 |
|
OG T
|
ec4ebaf310
|
fix(ops): pg-backup momo_analytics 改用 docker exec (無對外 port)
momo-db 容器無 port binding,TCP 127.0.0.1:5432 連到 host PG 非容器。
改用 docker exec momo-db pg_dump,實際備份 91M。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 09:57:05 +08:00 |
|
OG T
|
f98be41517
|
feat(ops): pg-backup.sh — PostgreSQL 每 6h 自動備份
備份目標 (188):
- awoooi_prod (host PostgreSQL, TCP 127.0.0.1)
- momo_analytics (momo-db 容器)
功能:
- gzip 壓縮,保留 7 天自動清理
- Telegram 通知 (成功/失敗)
- cron 0 */6 * * * 已設定
驗證: 兩個 DB 備份成功 (awoooi_prod 206K, gz 完整)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 09:16:21 +08:00 |
|
OG T
|
0e6c4b83d4
|
feat(ops): docker-health-monitor 完成部署 110+188
- 增加 EXCLUDE_CONTAINERS 排除清單(signoz init containers)
- max-time 30→60 支援 API 首次 AI 分析所需時間
- 110: wooo/awoooi-ops, cron */5, secrets.env 已設定
- 188: ollama/awoooi-ops, cron */5, secrets.env 已設定
- 驗證: 188→API webhook 200, Telegram 已收到告警
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-08 22:59:45 +08:00 |
|
OG T
|
d80153bdce
|
fix(openclaw): NIM 完全失敗後 fallback 到 Gemini 產生執行方案
CD Pipeline / build-and-deploy (push) Failing after 1m34s
NIM tool calling 多次 timeout 後,不再顯示空白執行方案,
改由 Gemini 代理產生 kubectl 操作指令(JSON 解析)。
只有 NIM 完全失敗才觸發,符合統帥「必須等到有回應」原則。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-08 22:55:25 +08:00 |
|
OG T
|
170ce2f11d
|
fix(ci): 修正測試與 Sprint 5.2 部署腳本
CD Pipeline / build-and-deploy (push) Failing after 1m38s
tests/test_auto_repair_service.py:
- 更新 3個測試符合 2026-04-07 統帥指令移除門檻
- APPROVED Playbook 直接通過 (低相似度/低品質/高風險均通過)
tests/test_phase22_nemotron_collab.py:
- 更新 log key: nemotron_collaboration_failed → exhausted
ops/monitoring/docker-compose.exporters.yaml:
- 修正 postgres DSN: awoooi:awoooi_prod_2026@localhost:5432/awoooi_prod
Sprint 5.2 新增腳本:
- scripts/sprint51_e2e_validation.py: L7 E2E 驗收腳本 (T1-T5)
- scripts/ops/deploy-docker-health-monitor.sh: Plan A 一鍵部署腳本
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-08 18:17:48 +08:00 |
|
OG T
|
88696dba9b
|
feat(sprint5.1): Data Safety Guardrails 全鏈路整合 (L1-L5)
CD Pipeline / build-and-deploy (push) Failing after 1m33s
Type Sync Check / check-type-sync (push) Failing after 58s
Layer 0 - K8s RBAC:
- k8s/rbac/api-velero-reader.yaml: awoooi-executor SA Velero backup reader
Layer 1 - DB Migration (已在 188 執行):
- M-002: approval_records 新增 approval_level/votes/required_votes
- M-003: alert_event_type ENUM 新增 8 個值
Layer 2 - IaC:
- ops/config/service-registry.yaml: 全服務 Stateful 分級清單 (BLOCK/CRITICAL_HITL/STANDARD_HITL/AUTO)
Layer 3 - Python Services:
- service_registry.py: 讀取 YAML,提供 is_blocked/requires_multisig/get_required_votes
- velero_client.py: kubectl 查詢 Velero 備份年齡,失敗 fallback 999h
- preflight_service.py: Pre-flight 安全檢查 (Q2/Q4 決策)
Layer 1-M001 - Playbook model:
- playbook.py: 新增 requires_approval_level/stateful_targets/requires_pre_backup
Layer 4 - 業務邏輯:
- alert_operation_log_repository.py: 新增 8 個 event_type (Guardrail/Pre-flight/MultiSig/備份)
- auto_repair_service.py: 注入 Service Registry Guardrail 檢查 (BLOCK → 直接拒絕)
- webhooks.py: ALERT_RECEIVED 溯源記錄 + auto_repair flag Q9 + Langfuse trace_id Q10
- db/models.py: ApprovalRecord 同步 approval_level/votes/required_votes 欄位
- docker-health-monitor.sh: 純感知層改造(移除所有 docker restart 邏輯)
Layer 5 - Telegram 通知:
- telegram_gateway.py: T1-T6 六個新通知方法 (Guardrail/Pre-flight/Backup/MultiSig/ChangeApplied)
參考: ADR-062 Data Safety Guardrails, ADR-063 Service Registry IaC
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-08 16:24:09 +08:00 |
|
OG T
|
a421d2c5b8
|
feat(ops): Plan A docker-health-monitor.sh — Docker 容器健康監控自動修復
- 偵測 unhealthy / exited / dead 容器
- 排除清單: DB(PG/Redis)、Gitea、監控棧
- Prometheus/Grafana/Alertmanager exited → docker start (保護 WAL)
- 必須三段式通知: Intent→Action→Result (首席架構師裁示)
- HMAC-SHA256 簽章 → AWOOOI API /api/v1/webhooks/custom-alert
- Fallback: API down → 直接 Telegram Bot API
- 冷卻期 300s,防止重複修復
部署: cron */5 * * * * on 192.168.0.110 + 192.168.0.188
設定: /etc/awoooi-ops/secrets.env
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-08 11:48:39 +08:00 |
|