AWOOOI CD
|
586602e7ff
|
chore(cd): deploy f31b4e3 [skip ci]
|
2026-04-15 11:18:56 +00:00 |
|
AWOOOI CD
|
1d22376b86
|
chore(cd): deploy fab65e7 [skip ci]
|
2026-04-15 11:06:22 +00:00 |
|
AWOOOI CD
|
37f4553349
|
chore(cd): deploy 4e2e665 [skip ci]
|
2026-04-15 08:22:53 +00:00 |
|
AWOOOI CD
|
53344c201e
|
chore(cd): deploy 14a0226 [skip ci]
|
2026-04-15 07:57:10 +00:00 |
|
AWOOOI CD
|
9126c594a4
|
chore(cd): deploy 0f2ec79 [skip ci]
|
2026-04-15 07:28:25 +00:00 |
|
AWOOOI CD
|
8997ba70cb
|
chore(cd): deploy a142e6e [skip ci]
|
2026-04-15 07:11:37 +00:00 |
|
AWOOOI CD
|
d493fb9b78
|
chore(cd): deploy 7da64ea [skip ci]
|
2026-04-15 06:18:11 +00:00 |
|
AWOOOI CD
|
7edb298a75
|
chore(cd): deploy 42bc1df [skip ci]
|
2026-04-15 05:58:38 +00:00 |
|
AWOOOI CD
|
d51705b4ec
|
chore(cd): deploy b6cb199 [skip ci]
|
2026-04-15 05:40:15 +00:00 |
|
AWOOOI CD
|
40aa7ceba8
|
chore(cd): deploy 6c7f648 [skip ci]
|
2026-04-15 03:10:45 +00:00 |
|
AWOOOI CD
|
a52b550607
|
chore(cd): deploy a92562d [skip ci]
|
2026-04-14 13:50:09 +00:00 |
|
AWOOOI CD
|
44545633a8
|
chore(cd): deploy 208c28e [skip ci]
|
2026-04-14 12:53:38 +00:00 |
|
AWOOOI CD
|
a120cc45b8
|
chore(cd): deploy 10e3043 [skip ci]
|
2026-04-14 12:29:34 +00:00 |
|
AWOOOI CD
|
094aa957b2
|
chore(cd): deploy ca862c5 [skip ci]
|
2026-04-14 12:16:45 +00:00 |
|
AWOOOI CD
|
2a37d1c06f
|
chore(cd): deploy 8b7e9cb [skip ci]
|
2026-04-14 11:46:35 +00:00 |
|
AWOOOI CD
|
35736315ce
|
chore(cd): deploy 9b9ff5b [skip ci]
|
2026-04-14 11:31:31 +00:00 |
|
AWOOOI CD
|
3f8d087aee
|
chore(cd): deploy 72dd0c5 [skip ci]
|
2026-04-14 11:13:00 +00:00 |
|
AWOOOI CD
|
e7171a4ac8
|
chore(cd): deploy aa4e575 [skip ci]
|
2026-04-14 10:56:28 +00:00 |
|
AWOOOI CD
|
88a33eb4d7
|
chore(cd): deploy f54dea4 [skip ci]
|
2026-04-14 10:42:20 +00:00 |
|
OG T
|
b8b124c917
|
chore: 收工存檔 — LOGBOOK + 封存過時 flywheel-alerts CRD
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-14 傍晚 Session 收尾:
- LOGBOOK.md: 記錄本日 6 commits + Mission C E2E 驗證 + MASTER 藍圖 P0+P1 全綠
- k8s/monitoring/flywheel-alerts.yaml: 加 🔴 DEPRECATED 標記
(11/11 規則已遷入 ops/monitoring/alerts-unified.yml,此 CRD 檔無 Operator 支援)
Backlog 清剿盤點:
✅ C2 hasType4 前端硬編(已接真實 API)
✅ C3 WebSocket 無重連(指數退避 + polling fallback)
✅ flywheel-alerts Docker 方式改寫(已完成,僅舊檔未清理 → 本 commit 封存)
✅ risk_level YAML 優先邏輯(decision_manager:1663)
⏳ SMTP_USER/SMTP_PASSWORD CHANGE_ME(需統帥提供帳密)
⏳ 各類 E2E 驗證(需真實告警觸發)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 18:21:08 +08:00 |
|
AWOOOI CD
|
0f48a507c0
|
chore(cd): deploy dd0a778 [skip ci]
|
2026-04-14 08:01:04 +00:00 |
|
AWOOOI CD
|
a71f09e30a
|
chore(cd): deploy 2c6ed4e [skip ci]
|
2026-04-14 07:38:35 +00:00 |
|
OG T
|
2c6ed4e9cf
|
fix(k8s): 修復 ArgoCD probe 失敗 + drift-scanner egress 封鎖
CD Pipeline / build-and-deploy (push) Successful in 14m36s
問題 1 — ArgoCD "All connection attempts failed":
- ARGOCD_URL 指向 192.168.0.120:30443,但 node 120 kube-proxy 對
30443 有路由 bug(ArgoCD pod 在 121)
- 修復: ARGOCD_URL → 192.168.0.121:30443
- NetworkPolicy: 補白名單 192.168.0.121/32:30443
- NetworkPolicy: 補白名單 192.168.0.125/32:30443 (keepalived VIP)
問題 2 — drift-scanner Error x5 / 系統沉默 9.4h:
- CronJob pod template 缺少 system=awoooi label
- default-deny-all 封鎖所有 egress,allow-required-egress 僅對
system=awoooi pods 生效
- 修復: drift-cronjob pod template 新增 system: awoooi
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-14 15:28:52 +08:00 |
|
AWOOOI CD
|
dd378ac698
|
chore(cd): deploy 684d6cf [skip ci]
|
2026-04-14 06:50:00 +00:00 |
|
AWOOOI CD
|
5d8feaad2a
|
chore(cd): deploy 38ff2bb [skip ci]
|
2026-04-12 15:01:47 +00:00 |
|
AWOOOI CD
|
6dec8ce491
|
chore(cd): deploy db4d428 [skip ci]
|
2026-04-12 14:32:47 +00:00 |
|
OG T
|
3de45aa2c3
|
fix(k8s): deployment env 同步 + 停用 ENABLE_NEMOTRON_COLLABORATION
CD Pipeline / build-and-deploy (push) Has been cancelled
將 live-patch 的 env: 覆蓋同步回 Git,避免 ArgoCD drift:
- ENABLE_NEMOTRON_COLLABORATION: false (Ollama timeout 修復)
- USE_AI_ROUTER, OLLAMA_URL, OPENCLAW_* 等覆蓋值納入 GitOps 管理
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 22:06:10 +08:00 |
|
AWOOOI CD
|
b6caabd8e3
|
chore(cd): deploy b3d4b9c [skip ci]
|
2026-04-12 13:29:40 +00:00 |
|
OG T
|
efca6f816a
|
fix(nemotron): 停用 Nemotron 協作 — Ollama timeout 導致 confidence=0.0
CD Pipeline / build-and-deploy (push) Failing after 2m1s
Ollama 111 tool_call 60s×2 > asyncio.wait_for 30s
→ expert_system fallback → confidence=0.0 → 告警卡顯示規則匹配 0%
暫停 ADR-044 直到 Ollama 穩定,OpenClaw LLM 仍正常運作
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 21:06:27 +08:00 |
|
AWOOOI CD
|
f64393e4cb
|
chore(cd): deploy eda0cfd [skip ci]
|
2026-04-12 12:30:49 +00:00 |
|
AWOOOI CD
|
f4675872f9
|
chore(cd): deploy c3fea26 [skip ci]
|
2026-04-12 12:17:06 +00:00 |
|
AWOOOI CD
|
6490c6a885
|
chore(cd): deploy e5791b9 [skip ci]
|
2026-04-12 11:34:56 +00:00 |
|
AWOOOI CD
|
9f264ebad1
|
chore(cd): deploy e89d878 [skip ci]
|
2026-04-12 11:07:02 +00:00 |
|
AWOOOI CD
|
022b3cd7d4
|
chore(cd): deploy 7fc1e0a [skip ci]
|
2026-04-12 10:12:04 +00:00 |
|
AWOOOI CD
|
295869d6c7
|
chore(cd): deploy 99b489c [skip ci]
|
2026-04-12 09:25:11 +00:00 |
|
AWOOOI CD
|
cce55d560d
|
chore(cd): deploy f0e1413 [skip ci]
|
2026-04-12 09:10:35 +00:00 |
|
AWOOOI CD
|
d2286ca827
|
chore(cd): deploy 93f9522 [skip ci]
|
2026-04-12 08:42:45 +00:00 |
|
AWOOOI CD
|
c8e9fbb518
|
chore(cd): deploy effd788 [skip ci]
|
2026-04-12 08:23:16 +00:00 |
|
AWOOOI CD
|
444b17513d
|
chore(cd): deploy 9b1812c [skip ci]
|
2026-04-12 07:52:09 +00:00 |
|
OG T
|
ec6a341f3e
|
feat(m5): ADR-074 M5 — Docker 188 / Redis Streams / PostgreSQL 磁碟告警
新增 awoooi_infrastructure_detailed 告警群組:
- DockerContainerUnhealthyDetailed: 188 容器無活動 > 2min
- RedisStreamBacklogHigh: Stream 積壓 > 500 筆
- PostgreSQLDiskGrowthRate: 磁碟 1h 增長 > 500MB
2026-04-12 ogt (ADR-074 M5)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 15:36:54 +08:00 |
|
OG T
|
c1c96ab47b
|
feat(m4): ADR-074 M4 — 備份還原週排程驗證 CronJob
- scripts/cron_backup_restore_test.sh: Velero restore dry-run 腳本
- k8s/awoooi-prod/16-cronjob-backup-restore-test.yaml: 每週日 02:00 台北執行
- k8s/awoooi-prod/17-configmap-backup-restore-scripts.yaml: 腳本 ConfigMap
- flywheel-alerts.yaml: BackupRestoreTestFailed + BackupRestoreTestStale 告警
失敗時寫入 node-exporter textfile → Prometheus 告警 → TYPE-3 Incident
2026-04-12 ogt (ADR-074 M4)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 15:36:30 +08:00 |
|
OG T
|
00a31abb85
|
feat(heartbeat): ADR-073 P2 心跳整合重構 — HeartbeatReportService + RedisLock
CD Pipeline / build-and-deploy (push) Has been cancelled
- 新增 HeartbeatReportService:11 個並行探針(Ollama/Nemotron/Gemini/Claude/MCP×4/ArgoCD/Velero)
- 重寫 send_heartbeat():RedisLock 防重發 + 統一發送 SRE_GROUP_CHAT_ID
- 簡化 _heartbeat_loop():移除散落的 silence 多次發送
- config.py:新增 OLLAMA_REQUIRED_MODELS 欄位
- 03-secrets.example.yaml:補 SRE_GROUP_CHAT_ID 確保 CD Inject 不遺漏
2026-04-12 ogt (ADR-073 Phase 2-3/4)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 15:35:13 +08:00 |
|
OG T
|
16d682346a
|
feat(adr-074): M1 飛輪健康度 Exporter + M2 主機網路監控
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-074 M1:
- FlywheelStatsService: 計算6項飛輪指標(Playbook數/成功率/KM向量化/alertname NULL/卡住數)
- GET /api/v1/stats/flywheel — 六節點即時狀態(C1 前端用)
- GET /api/v1/stats/summary — KPI 面板數據(C1 前端用)
- GET /api/v1/stats/flywheel/metrics — Prometheus text format
- flywheel-alerts.yaml: 5條告警規則(FlywheelPlaybookZero/ExecutionSuccessLow/KMVectorizationLow/AlertnameNullHigh/IncidentsStuck)
- prometheus.yml: awoooi-flywheel scrape job(5分鐘間隔)
ADR-074 M2:
- prometheus.yml: host-connectivity Blackbox TCP probe(110:22/188:22/120:6443/121:6443)
- flywheel-alerts.yaml: HostNetworkPartition 告警規則
597 unit tests passed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 15:31:01 +08:00 |
|
OG T
|
0d239838b4
|
fix(cr): Code Review P2 — 測試覆蓋 + CronJob 腳本重構
CD Pipeline / build-and-deploy (push) Has been cancelled
P2-1: CronJob inline Python 抽成 scripts/cron_km_vectorize.py
Dockerfile 加入 COPY scripts/,CronJob YAML 改用腳本路徑
P2-2: 新增 test_classify_alert_early.py — 23 tests 覆蓋 7 條分類規則
含邊界情況:VeleroBackupFailed(backup優先於k8s)、優先順序驗證
595 unit tests passed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 15:14:44 +08:00 |
|
OG T
|
f2fc4712ad
|
feat(flywheel): Phase 4 — KM conversion hook + daily vectorize cron
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-073 Phase 4-2: incident_service.resolve_incident() KM conversion hook
- resolve 時 fire-and-forget KMConversionService.convert(incident)
- 已解決的 Incident 自動轉換為結構化 KM 條目,完成飛輪「學習固化」節點
- KMConversionService (Phase 4-1) 已存在 (km_conversion_service.py, 336 lines)
ADR-073 Phase 4-3: 15-cronjob-km-vectorize.yaml
- 每日 03:00 台北時間呼叫 /api/v1/knowledge/embed-all
- 自動向量化當日新增 KM 條目,確保 RAG 查詢不遺漏新知識
- 加入 kustomization.yaml resources
Phase 4-4: handle_callback log_manual_fix 已存在 (telegram_gateway.py:2468)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 14:40:33 +08:00 |
|
AWOOOI CD
|
59b7d8ea32
|
chore(cd): deploy 6dc03c9 [skip ci]
|
2026-04-12 06:30:20 +00:00 |
|
OG T
|
6dc03c9a55
|
fix(argocd)+feat(flywheel): Phase 1 完成 — ArgoCD image 斷路修復 + 冷啟動腳本
CD Pipeline / build-and-deploy (push) Has been cancelled
1. k8s/argocd/awoooi-prod-app.yaml:
移除 Deployment image ignoreDifferences
- 原設計造成 CD 更新 kustomization.yaml 後 ArgoCD 不更新 image
- 修復後 GitOps 閉環恢復正常
2. scripts/cold_start_playbooks.py:
ADR-073 Phase 1 Step 8 — 生成 15 個基礎 Playbook (K8s/Docker/DB/Infra)
執行結果: Playbooks 0 → 15
3. scripts/batch_vectorize_km.py:
ADR-073 Phase 1 Step 9 — 批次向量化 KM
執行結果: 711/713 embedding IS NOT NULL
Phase 1 全部完成,飛輪已解封:
- Pod 運行 105998d(含 8be87b0 所有修復)
- debounce 30min + alertname NULL 修復 + _collect_mcp_context 啟用
- 15 Playbooks + 711 KM 向量化
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 14:20:52 +08:00 |
|
AWOOOI CD
|
b3fdabeb91
|
chore(cd): deploy 105998d [skip ci]
|
2026-04-12 06:06:03 +00:00 |
|
OG T
|
7c4b36c2cd
|
fix(flywheel): Phase 1 — 部署 8be87b0 + debounce 30min + alertname NULL 修復
CD Pipeline / build-and-deploy (push) Failing after 2m29s
ADR-073 Phase 1 三項修復:
1. k8s/kustomization.yaml: newTag a86ecf3 → 8be87b0
- 解封 _collect_mcp_context + auto_approve + DESTRUCTIVE_PATTERNS
- 這是飛輪解封的關鍵
2. webhooks.py: DEBOUNCE_WINDOW_MINUTES 5 → 30
- 防止同一問題每 5 分鐘重建 Incident,改為 30 分鐘收斂窗口
3. incident_repository.py: signals JSONB 補充 alertname key
- signal.model_dump() 只有 alert_name,DB query 用 signals->0->>'alertname'
- 補充 alertname alias,修復 132 筆 incidents.alertname = NULL 根因
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 13:41:22 +08:00 |
|
AWOOOI CD
|
fdb8c2b97b
|
chore(cd): deploy a86ecf3 [skip ci]
|
2026-04-12 04:28:38 +00:00 |
|