Commit Graph

168 Commits

Author SHA1 Message Date
AWOOOI CD
586602e7ff chore(cd): deploy f31b4e3 [skip ci] 2026-04-15 11:18:56 +00:00
AWOOOI CD
1d22376b86 chore(cd): deploy fab65e7 [skip ci] 2026-04-15 11:06:22 +00:00
AWOOOI CD
37f4553349 chore(cd): deploy 4e2e665 [skip ci] 2026-04-15 08:22:53 +00:00
AWOOOI CD
53344c201e chore(cd): deploy 14a0226 [skip ci] 2026-04-15 07:57:10 +00:00
AWOOOI CD
9126c594a4 chore(cd): deploy 0f2ec79 [skip ci] 2026-04-15 07:28:25 +00:00
AWOOOI CD
8997ba70cb chore(cd): deploy a142e6e [skip ci] 2026-04-15 07:11:37 +00:00
AWOOOI CD
d493fb9b78 chore(cd): deploy 7da64ea [skip ci] 2026-04-15 06:18:11 +00:00
AWOOOI CD
7edb298a75 chore(cd): deploy 42bc1df [skip ci] 2026-04-15 05:58:38 +00:00
AWOOOI CD
d51705b4ec chore(cd): deploy b6cb199 [skip ci] 2026-04-15 05:40:15 +00:00
AWOOOI CD
40aa7ceba8 chore(cd): deploy 6c7f648 [skip ci] 2026-04-15 03:10:45 +00:00
AWOOOI CD
a52b550607 chore(cd): deploy a92562d [skip ci] 2026-04-14 13:50:09 +00:00
AWOOOI CD
44545633a8 chore(cd): deploy 208c28e [skip ci] 2026-04-14 12:53:38 +00:00
AWOOOI CD
a120cc45b8 chore(cd): deploy 10e3043 [skip ci] 2026-04-14 12:29:34 +00:00
AWOOOI CD
094aa957b2 chore(cd): deploy ca862c5 [skip ci] 2026-04-14 12:16:45 +00:00
AWOOOI CD
2a37d1c06f chore(cd): deploy 8b7e9cb [skip ci] 2026-04-14 11:46:35 +00:00
AWOOOI CD
35736315ce chore(cd): deploy 9b9ff5b [skip ci] 2026-04-14 11:31:31 +00:00
AWOOOI CD
3f8d087aee chore(cd): deploy 72dd0c5 [skip ci] 2026-04-14 11:13:00 +00:00
AWOOOI CD
e7171a4ac8 chore(cd): deploy aa4e575 [skip ci] 2026-04-14 10:56:28 +00:00
AWOOOI CD
88a33eb4d7 chore(cd): deploy f54dea4 [skip ci] 2026-04-14 10:42:20 +00:00
OG T
b8b124c917 chore: 收工存檔 — LOGBOOK + 封存過時 flywheel-alerts CRD
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-14 傍晚 Session 收尾:
- LOGBOOK.md: 記錄本日 6 commits + Mission C E2E 驗證 + MASTER 藍圖 P0+P1 全綠
- k8s/monitoring/flywheel-alerts.yaml: 加 🔴 DEPRECATED 標記
  (11/11 規則已遷入 ops/monitoring/alerts-unified.yml,此 CRD 檔無 Operator 支援)

Backlog 清剿盤點:
 C2 hasType4 前端硬編(已接真實 API)
 C3 WebSocket 無重連(指數退避 + polling fallback)
 flywheel-alerts Docker 方式改寫(已完成,僅舊檔未清理 → 本 commit 封存)
 risk_level YAML 優先邏輯(decision_manager:1663)
 SMTP_USER/SMTP_PASSWORD CHANGE_ME(需統帥提供帳密)
 各類 E2E 驗證(需真實告警觸發)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-14 18:21:08 +08:00
AWOOOI CD
0f48a507c0 chore(cd): deploy dd0a778 [skip ci] 2026-04-14 08:01:04 +00:00
AWOOOI CD
a71f09e30a chore(cd): deploy 2c6ed4e [skip ci] 2026-04-14 07:38:35 +00:00
OG T
2c6ed4e9cf fix(k8s): 修復 ArgoCD probe 失敗 + drift-scanner egress 封鎖
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 14m36s
問題 1 — ArgoCD "All connection attempts failed":
- ARGOCD_URL 指向 192.168.0.120:30443,但 node 120 kube-proxy 對
  30443 有路由 bug(ArgoCD pod 在 121)
- 修復: ARGOCD_URL → 192.168.0.121:30443
- NetworkPolicy: 補白名單 192.168.0.121/32:30443
- NetworkPolicy: 補白名單 192.168.0.125/32:30443 (keepalived VIP)

問題 2 — drift-scanner Error x5 / 系統沉默 9.4h:
- CronJob pod template 缺少 system=awoooi label
- default-deny-all 封鎖所有 egress,allow-required-egress 僅對
  system=awoooi pods 生效
- 修復: drift-cronjob pod template 新增 system: awoooi

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 15:28:52 +08:00
AWOOOI CD
dd378ac698 chore(cd): deploy 684d6cf [skip ci] 2026-04-14 06:50:00 +00:00
AWOOOI CD
5d8feaad2a chore(cd): deploy 38ff2bb [skip ci] 2026-04-12 15:01:47 +00:00
AWOOOI CD
6dec8ce491 chore(cd): deploy db4d428 [skip ci] 2026-04-12 14:32:47 +00:00
OG T
3de45aa2c3 fix(k8s): deployment env 同步 + 停用 ENABLE_NEMOTRON_COLLABORATION
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
將 live-patch 的 env: 覆蓋同步回 Git,避免 ArgoCD drift:
- ENABLE_NEMOTRON_COLLABORATION: false (Ollama timeout 修復)
- USE_AI_ROUTER, OLLAMA_URL, OPENCLAW_* 等覆蓋值納入 GitOps 管理

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 22:06:10 +08:00
AWOOOI CD
b6caabd8e3 chore(cd): deploy b3d4b9c [skip ci] 2026-04-12 13:29:40 +00:00
OG T
efca6f816a fix(nemotron): 停用 Nemotron 協作 — Ollama timeout 導致 confidence=0.0
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 2m1s
Ollama 111 tool_call 60s×2 > asyncio.wait_for 30s
→ expert_system fallback → confidence=0.0 → 告警卡顯示規則匹配 0%

暫停 ADR-044 直到 Ollama 穩定,OpenClaw LLM 仍正常運作

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 21:06:27 +08:00
AWOOOI CD
f64393e4cb chore(cd): deploy eda0cfd [skip ci] 2026-04-12 12:30:49 +00:00
AWOOOI CD
f4675872f9 chore(cd): deploy c3fea26 [skip ci] 2026-04-12 12:17:06 +00:00
AWOOOI CD
6490c6a885 chore(cd): deploy e5791b9 [skip ci] 2026-04-12 11:34:56 +00:00
AWOOOI CD
9f264ebad1 chore(cd): deploy e89d878 [skip ci] 2026-04-12 11:07:02 +00:00
AWOOOI CD
022b3cd7d4 chore(cd): deploy 7fc1e0a [skip ci] 2026-04-12 10:12:04 +00:00
AWOOOI CD
295869d6c7 chore(cd): deploy 99b489c [skip ci] 2026-04-12 09:25:11 +00:00
AWOOOI CD
cce55d560d chore(cd): deploy f0e1413 [skip ci] 2026-04-12 09:10:35 +00:00
AWOOOI CD
d2286ca827 chore(cd): deploy 93f9522 [skip ci] 2026-04-12 08:42:45 +00:00
AWOOOI CD
c8e9fbb518 chore(cd): deploy effd788 [skip ci] 2026-04-12 08:23:16 +00:00
AWOOOI CD
444b17513d chore(cd): deploy 9b1812c [skip ci] 2026-04-12 07:52:09 +00:00
OG T
ec6a341f3e feat(m5): ADR-074 M5 — Docker 188 / Redis Streams / PostgreSQL 磁碟告警
新增 awoooi_infrastructure_detailed 告警群組:
- DockerContainerUnhealthyDetailed: 188 容器無活動 > 2min
- RedisStreamBacklogHigh: Stream 積壓 > 500 筆
- PostgreSQLDiskGrowthRate: 磁碟 1h 增長 > 500MB

2026-04-12 ogt (ADR-074 M5)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:36:54 +08:00
OG T
c1c96ab47b feat(m4): ADR-074 M4 — 備份還原週排程驗證 CronJob
- scripts/cron_backup_restore_test.sh: Velero restore dry-run 腳本
- k8s/awoooi-prod/16-cronjob-backup-restore-test.yaml: 每週日 02:00 台北執行
- k8s/awoooi-prod/17-configmap-backup-restore-scripts.yaml: 腳本 ConfigMap
- flywheel-alerts.yaml: BackupRestoreTestFailed + BackupRestoreTestStale 告警

失敗時寫入 node-exporter textfile → Prometheus 告警 → TYPE-3 Incident

2026-04-12 ogt (ADR-074 M4)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:36:30 +08:00
OG T
00a31abb85 feat(heartbeat): ADR-073 P2 心跳整合重構 — HeartbeatReportService + RedisLock
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- 新增 HeartbeatReportService:11 個並行探針(Ollama/Nemotron/Gemini/Claude/MCP×4/ArgoCD/Velero)
- 重寫 send_heartbeat():RedisLock 防重發 + 統一發送 SRE_GROUP_CHAT_ID
- 簡化 _heartbeat_loop():移除散落的 silence 多次發送
- config.py:新增 OLLAMA_REQUIRED_MODELS 欄位
- 03-secrets.example.yaml:補 SRE_GROUP_CHAT_ID 確保 CD Inject 不遺漏

2026-04-12 ogt (ADR-073 Phase 2-3/4)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:35:13 +08:00
OG T
16d682346a feat(adr-074): M1 飛輪健康度 Exporter + M2 主機網路監控
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-074 M1:
  - FlywheelStatsService: 計算6項飛輪指標(Playbook數/成功率/KM向量化/alertname NULL/卡住數)
  - GET /api/v1/stats/flywheel — 六節點即時狀態(C1 前端用)
  - GET /api/v1/stats/summary — KPI 面板數據(C1 前端用)
  - GET /api/v1/stats/flywheel/metrics — Prometheus text format
  - flywheel-alerts.yaml: 5條告警規則(FlywheelPlaybookZero/ExecutionSuccessLow/KMVectorizationLow/AlertnameNullHigh/IncidentsStuck)
  - prometheus.yml: awoooi-flywheel scrape job(5分鐘間隔)

ADR-074 M2:
  - prometheus.yml: host-connectivity Blackbox TCP probe(110:22/188:22/120:6443/121:6443)
  - flywheel-alerts.yaml: HostNetworkPartition 告警規則

597 unit tests passed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:31:01 +08:00
OG T
0d239838b4 fix(cr): Code Review P2 — 測試覆蓋 + CronJob 腳本重構
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
P2-1: CronJob inline Python 抽成 scripts/cron_km_vectorize.py
      Dockerfile 加入 COPY scripts/,CronJob YAML 改用腳本路徑
P2-2: 新增 test_classify_alert_early.py — 23 tests 覆蓋 7 條分類規則
      含邊界情況:VeleroBackupFailed(backup優先於k8s)、優先順序驗證

595 unit tests passed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:14:44 +08:00
OG T
f2fc4712ad feat(flywheel): Phase 4 — KM conversion hook + daily vectorize cron
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-073 Phase 4-2: incident_service.resolve_incident() KM conversion hook
- resolve 時 fire-and-forget KMConversionService.convert(incident)
- 已解決的 Incident 自動轉換為結構化 KM 條目,完成飛輪「學習固化」節點
- KMConversionService (Phase 4-1) 已存在 (km_conversion_service.py, 336 lines)

ADR-073 Phase 4-3: 15-cronjob-km-vectorize.yaml
- 每日 03:00 台北時間呼叫 /api/v1/knowledge/embed-all
- 自動向量化當日新增 KM 條目,確保 RAG 查詢不遺漏新知識
- 加入 kustomization.yaml resources

Phase 4-4: handle_callback log_manual_fix 已存在 (telegram_gateway.py:2468)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 14:40:33 +08:00
AWOOOI CD
59b7d8ea32 chore(cd): deploy 6dc03c9 [skip ci] 2026-04-12 06:30:20 +00:00
OG T
6dc03c9a55 fix(argocd)+feat(flywheel): Phase 1 完成 — ArgoCD image 斷路修復 + 冷啟動腳本
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. k8s/argocd/awoooi-prod-app.yaml:
   移除 Deployment image ignoreDifferences
   - 原設計造成 CD 更新 kustomization.yaml 後 ArgoCD 不更新 image
   - 修復後 GitOps 閉環恢復正常

2. scripts/cold_start_playbooks.py:
   ADR-073 Phase 1 Step 8 — 生成 15 個基礎 Playbook (K8s/Docker/DB/Infra)
   執行結果: Playbooks 0 → 15

3. scripts/batch_vectorize_km.py:
   ADR-073 Phase 1 Step 9 — 批次向量化 KM
   執行結果: 711/713 embedding IS NOT NULL

Phase 1 全部完成,飛輪已解封:
- Pod 運行 105998d(含 8be87b0 所有修復)
- debounce 30min + alertname NULL 修復 + _collect_mcp_context 啟用
- 15 Playbooks + 711 KM 向量化

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 14:20:52 +08:00
AWOOOI CD
b3fdabeb91 chore(cd): deploy 105998d [skip ci] 2026-04-12 06:06:03 +00:00
OG T
7c4b36c2cd fix(flywheel): Phase 1 — 部署 8be87b0 + debounce 30min + alertname NULL 修復
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 2m29s
ADR-073 Phase 1 三項修復:

1. k8s/kustomization.yaml: newTag a86ecf38be87b0
   - 解封 _collect_mcp_context + auto_approve + DESTRUCTIVE_PATTERNS
   - 這是飛輪解封的關鍵

2. webhooks.py: DEBOUNCE_WINDOW_MINUTES 5 → 30
   - 防止同一問題每 5 分鐘重建 Incident,改為 30 分鐘收斂窗口

3. incident_repository.py: signals JSONB 補充 alertname key
   - signal.model_dump() 只有 alert_name,DB query 用 signals->0->>'alertname'
   - 補充 alertname alias,修復 132 筆 incidents.alertname = NULL 根因

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 13:41:22 +08:00
AWOOOI CD
fdb8c2b97b chore(cd): deploy a86ecf3 [skip ci] 2026-04-12 04:28:38 +00:00