Your Name
a18e2f9c3f
fix(security): 停用 GitHub production deploy
2026-05-12 16:22:16 +08:00
Your Name
ec5eaef31c
chore(ci): enable Gitea Actions workflows
2026-05-02 15:20:01 +08:00
OG T
25e69e6870
feat(cicd): ADR-039 完成 - GitHub Actions 停用,Gitea 主倉
...
E2E Health Check / e2e-health (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
- 停用所有 GitHub Actions workflows (.disabled)
- 更新 CLAUDE.md 添加 Gitea CI/CD 章節
- 更新 LOGBOOK.md 記錄遷移狀態
- Gitea 版本: 1.25.5
- Runner 版本: v0.3.1 (host 網絡模式)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-30 01:07:32 +08:00
OG T
f0933620e1
fix(cd): Secret 更新後自動重啟 API Pod
...
K8s 問題: patch secret 後 Pod 不會自動讀取新值
修復: 新增 kubectl rollout restart 強制重啟
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-29 21:16:40 +08:00
OG T
6a8e1bfdd1
feat(cicd): Gitea Mirror B2 備份策略
...
- 新增 Gitea remote (192.168.0.110:3001/wooo/awoooi)
- CD 成功後自動 mirror to Gitea
- 新增 GITEA_MIRROR_TOKEN GitHub Secret
- 更新 LOGBOOK 紀錄
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-29 20:28:21 +08:00
OG T
caaf12e41c
fix(cd): P0 並發治理 - force_deploy 獨立 concurrency group
...
首席架構師審查 2026-03-29:
- 問題: cancel-in-progress: true 導致 force_deploy 被新 push 取消
- 已發生 5+ 次 force deploy 被取消,25 commits 無法部署
- 解決: force_deploy 使用獨立 group,不會被普通 push 取消
- 普通 push 仍互相取消 (防止 Runner 檔案衝突)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-29 16:42:50 +08:00
OG T
cf6cf1ff20
fix(cd): P0 雙跳過保護 - 防止 ImagePullBackOff
...
首席架構師審查 2026-03-29:
- 問題: 當 API/Web build 都跳過時,kustomize 仍含 IMAGE_TAG_PLACEHOLDER
- 影響: kubectl apply 部署無效映像 → ImagePullBackOff
- 修復: 檢測雙跳過,只做 Secrets 同步,跳過 Deployment apply
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-29 16:18:14 +08:00
OG T
12f7a83df8
fix(ci): 修復 Runner _diag/pages 檔案衝突 (徹底解決)
...
根本原因:
- 41 個殭屍 Runner 進程互相衝突
- _diag/pages 目錄沒有自動清理
解決方案:
- 所有 Workflow Job 第一步清理 _diag/pages
- 覆蓋所有 self-hosted runner jobs
影響範圍:
- runner-healthcheck.yml (2 jobs)
- daily-e2e-health.yaml (1 job)
- nightly-llm.yaml (1 job)
- ci.yaml (9 jobs)
- cd.yaml (已有)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-29 15:09:13 +08:00
OG T
50c055b547
feat(api): Phase D-G P0 修正 - Learning Repository 積木化
...
新增:
- ILearningRepository Protocol (interfaces.py)
- LearningRepository (Redis 持久化層)
- Learning API 端點 (/api/v1/learning/*)
- LearningService.get_recommended_fix() 方法
- LearningService.get_learning_summary() 方法
修正:
- Service 不直接依賴 Redis Client (透過 Repository)
- 符合 leWOOOgo 積木化原則
- 首席架構師審查: 74/100 → 92/100
更新:
- ADR-030: 新增 Phase D-G P0 修正章節
- Skill 02: v1.9 → v2.0
- Runner 修復: 序列建構解決 _runner_file_commands 衝突
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-29 11:03:51 +08:00
OG T
d15fb7d9f4
fix(cd): 序列建構修復 Runner _runner_file_commands 衝突
...
根因: 並行 Job 的 Set up job 階段會同時寫入 RUNNER_TEMP
解法: build-api needs build-web,確保序列執行
移除: Job-level concurrency groups (不再需要)
更新: ops/runner/README.md v1.0→v2.0
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-29 10:29:11 +08:00
OG T
6ddaf75260
fix(runner): v5 - Job 層級 mutex 確保嚴格序列執行
...
根因確認:
- 即使有 needs 依賴,Jobs 仍可能在 "Set up job" 階段並行
- 所有 Jobs 共用同一 Runner,並行寫入 _diag/pages 造成衝突
永久解決方案:
- 每個 Job 加上 concurrency.group: runner-awoooi-cd-mutex
- cancel-in-progress: false (等待而非取消)
- 確保同一時間只有一個 Job 在 Runner 上執行
影響:
- CD 會變慢 (Jobs 嚴格序列)
- 但保證穩定性 (不再有檔案衝突)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-29 02:12:38 +08:00
OG T
07114f9181
fix(runner): v4 - 啟用 cancel-in-progress 防止並行衝突
...
根因確認:
- _diag/pages 衝突發生在 "Set up job" 階段
- 這是在任何自定義步驟執行之前
- Runner 內部 bug,workflow 層清理無法解決
永久解決方案:
- cancel-in-progress: true (確保同一時間只有一個 workflow)
- 不再嘗試清理 RUNNER_TEMP (會破壞其他 Job)
- 保留 _diag/pages 清理作為輔助措施
更新 ops/runner/README.md:
- 完整根因分析
- v3 最終解決方案說明
- 警告: 不要清理 RUNNER_TEMP
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-29 02:10:17 +08:00
OG T
08fb6c59c8
fix(runner): v3 - 只清理 _diag/pages,不碰 RUNNER_TEMP
...
根因分析:
- RUNNER_TEMP 在同一 Runner 的所有 Jobs 之間共享
- 清理 RUNNER_TEMP 會刪除其他 Job 的 _runner_file_commands
- 導致 "Missing file at path: _runner_file_commands/set_output_xxx"
修正:
- 移除所有 RUNNER_TEMP 清理邏輯
- 只清理 _diag/pages (這是唯一需要清理的目錄)
- 簡化清理腳本,移除不必要的複雜度
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-29 02:08:50 +08:00
OG T
02c38c3a9b
fix(runner): 保留 _runner_file_commands 避免 checkout 失敗
...
問題: 清理腳本刪除了 $RUNNER_TEMP/* 包含 _runner_file_commands
結果: "Missing file at path: _runner_file_commands/set_output_xxx"
修正:
- 移除 rm -rf $RUNNER_TEMP/* (會刪除關鍵檔案)
- Pre-flight: 使用 find 排除 _runner_file_commands
- 其他 Jobs: 只清理 _diag/pages
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-29 02:07:05 +08:00
OG T
183776a34f
fix(runner): 永久修復 _diag/pages 檔案衝突問題
...
問題: Runner 並行執行時 "file already exists" 導致 CD 失敗
解決方案:
1. CD Workflow: 刪除整個 _diag/pages 目錄再重建 (非 rm -rf /*)
2. Systemd Timer: 每 5 分鐘自動清理過期檔案
3. flock 鎖定: 防止清理程序競爭
新增檔案:
- ops/runner/cleanup-runner-diag.sh - 清理腳本
- ops/runner/runner-diag-cleanup.service - Systemd service
- ops/runner/runner-diag-cleanup.timer - 定時器
- ops/runner/deploy-runner-cleanup.sh - 部署腳本
- ops/runner/README.md - 文檔
部署指令:
ssh wooo@192.168 .0.110
bash awoooi/ops/runner/deploy-runner-cleanup.sh
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-29 02:04:35 +08:00
OG T
725392b578
fix(k8s): NetworkPolicy 繞過 kustomize commonLabels
...
問題: kustomize commonLabels 會加到 NetworkPolicy egress[].to[].podSelector
導致 DNS rule 要求 CoreDNS pods 有 system:awoooi + environment:prod
但 CoreDNS 只有 k8s-app:kube-dns,造成 DNS 解析失敗
修復:
- kustomization.yaml: 移除 02-network-policy.yaml
- cd.yaml: 新增 Apply NetworkPolicy step 單獨套用
2026-03-29 ogt: 根本原因修復
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-29 01:27:29 +08:00
OG T
2c968305c8
fix(cd): 增加 Build timeout 至 20 分鐘
...
Build API/Web 超時導致 CD 失敗,增加超時時間:
- Build API: 10m → 20m
- Build Web: 15m → 20m
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-29 00:23:44 +08:00
OG T
b77e151387
feat(ai): ADR-036 NVIDIA Nemotron Tool Calling 整合
...
Phase 20 - 提升 Tool Calling 精準度 50% → 83.3%
新增:
- src/models/nvidia.py: Pydantic Schema
- src/services/nvidia_provider.py: NvidiaProvider 類別
- tests/test_nvidia_provider.py: 15 項單元測試 (全部通過)
修改:
- ai_router.py: AIProvider.NVIDIA + route_tool_calling()
- ai_rate_limiter.py: NVIDIA 限制 (5 RPM, 100/day)
- models.json: NVIDIA 配置
- cd.yaml: Secrets 注入 NVIDIA_API_KEY
路由策略:
- Tool Calling: Nemotron → Gemini → Claude
- 一般對話: Ollama → Gemini → Claude (不變)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-29 00:00:08 +08:00
OG T
6a38c0c968
fix(cd): ADR-035 Telegram Secrets 自動注入三層防護
...
🔴 事故根因: K8s Secrets 未注入,Telegram 告警長時間失效
- kustomization.yaml 說「由 CI/CD 處理」但 CD 從未執行
🛡️ 三層防護機制:
- Layer 1: Pre-flight 檢查 GitHub Secrets 存在
- Layer 2: Deploy 時 kubectl patch secret 自動注入
- Layer 3: Post-Deploy E2E 測試告警驗證
📄 文件更新:
- ADR-035: docs/adr/ADR-035-telegram-alert-chain-enforcement.md
- DevOps Skill v1.9: 新增 Secrets 注入鐵律
- CLAUDE.md: 新增告警鏈路章節
- LOGBOOK: 事故記錄
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-28 21:47:49 +08:00
OG T
9fa996c9fe
fix(cicd): 修正 OTEL 端點配置 192.168.0.121→188
...
問題: CI/CD workflows 指向錯誤的 OTEL 端點
- ci.yaml: 121:4318 → 188:24318
- cd.yaml: 121:4318 → 188:24318
SignOz 實際運行在 192.168.0.188 (AI+Web 中心)
更新:
- Skill 04 v1.8 加入可觀測性端點規範
- LOGBOOK 記錄配置修正
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-28 18:47:23 +08:00
OG T
17ee8838be
revert: 還原 Telegram + CD 到正常狀態
...
還原檔案到 d071019 版本:
- decision_manager.py: 移除 Redis dedup 邏輯
- telegram_gateway.py: 還原 INC- 前綴邏輯
- cd.yaml: 移除 selector immutable 處理和 Token injection
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-26 22:10:52 +08:00
OG T
99809f4a33
fix(cd): 注入 Telegram Token 到 K8s Secret
...
問題: AWOOOI API 的 OPENCLAW_TG_BOT_TOKEN 為空,Telegram 無法發送
修復: CD 部署時從 GitHub Secrets 注入 Token
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-26 21:41:31 +08:00
OG T
6421af05f9
fix(cd): 處理 K8s selector immutability 問題
...
問題: kustomize labels 配置變更導致 selector 不匹配
修復: 偵測到 "field is immutable" 錯誤時自動刪除重建 Deployment
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-26 19:53:51 +08:00
OG T
0a9d94d82b
feat(k8s): CoreDNS GitOps 架構 (ADR-026)
...
問題: DNS 配置沒有版本控制,手動修改易遺失
架構:
- k8s/k3s-system/coredns-custom.yaml: HelmChartConfig
- CD workflow: k3s-system 路徑偵測 + 自動 apply
- ADR-026: CoreDNS GitOps 管控架構
DNS 上游:
- 使用 8.8.8.8 + 1.1.1.1
- 禁止 /etc/resolv.conf (systemd-resolved)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-26 18:43:28 +08:00
OG T
31f554962e
fix(ci): 改用 cancel-in-progress: false 避免 Runner 衝突
...
Runner 被取消時不會清理 _diag/pages,導致下一次 run 檔案衝突
改為排隊等待而非取消
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-26 01:08:07 +08:00
OG T
ac294c1e3c
fix(ci): 清理 _diag/pages 避免 log 檔衝突
...
Runner 並行執行時 _diag/pages/*.log 會產生衝突
新增清理該目錄的步驟
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-26 01:07:07 +08:00
OG T
8ee2437a7f
fix(ci): Runner 暫存目錄清理 - 永久修復
...
- 每個 Job 開始前清理 $RUNNER_TEMP/*
- 新增 crontab 每小時自動清理
- 新增 ~/bin/runner-cleanup.sh 腳本
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-26 01:05:49 +08:00
OG T
716b94f60a
feat(api): Phase 16 R4.2 抽取 ApprovalExecutionService
...
Strangler Fig Pattern: 從 approvals.py 抽取執行編排邏輯
新增:
- src/services/approval_execution.py (271 行)
- ApprovalExecutionService class
- 整合 OperationParser + Executor + Timeline + Notifications
瘦身成果:
- approvals.py: 1097 → 787 行 (-310 行)
- R4 總計: 移除 310 行內嵌業務邏輯
CI/CD 修復:
- 移除危險的 rm -f ~/actions-runner-* 指令
- 改用 checkout clean: true + workspace 內清理
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-25 22:04:15 +08:00
OG T
39eca4535b
fix(ci): 清理 Runner diag logs 避免 "file already exists" 衝突
...
Pre-flight Check 加入清理步驟:
- rm -f ~/actions-runner-awoooi/_diag/pages/*.log
- rm -f ~/actions-runner-awoooi-2/_diag/pages/*.log
同時修復 CI 和 CD workflow
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-25 21:49:17 +08:00
OG T
0e22680547
fix(cd): 清理 worktree 目錄避免 submodule 衝突
...
Deploy job 增加 rm -rf .claude/worktrees 清理步驟
解決 "no submodule mapping found" 錯誤
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-25 21:46:51 +08:00
OG T
bfda353270
fix(ci): 清理 .claude/worktrees 防止 submodule 錯誤
...
問題: Runner 上的 .claude/worktrees 被誤認為 submodule
解決: 在 checkout 前清理 worktrees 目錄
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-25 21:24:08 +08:00
OG T
708ea4686e
fix(cd): 修復 Build 跳過時的 ImagePullBackOff 問題
...
問題: 當 Build Web/API 被跳過時,Deploy 仍更新 image tag 到不存在的版本
解決: 根據 build job 結果條件性更新 image
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-25 16:02:44 +08:00
OG T
2bb76433f1
feat(cd): 改善部署通知格式 (用戶友善)
...
- 顯示版本描述 (commit message 前50字)
- 顯示部署時間 (Asia/Taipei 時區)
- 顯示作者
- 顯示簡短 SHA
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-24 23:36:08 +08:00
OG T
2337a03dfa
fix(cd): Use Python httpx for health check instead of curl
...
- Container uses python:3.11-slim without curl
- httpx is already installed as API dependency
- Fixes: "curl: executable file not found in $PATH"
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-24 18:24:18 +08:00
OG T
ab240c62ca
fix(cd): Improve health check with container name and fallback
...
- Add -c api to specify container name
- Increase sleep to 15s for pod startup
- Add fallback message to prevent workflow failure
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-24 17:44:05 +08:00
OG T
b20987e7b6
feat(sentry): Implement Sentry Tunnel to avoid local network permission dialog
...
- Add /api/sentry-tunnel API Route (Next.js)
- Update sentry.client.config.ts with tunnel option
- Re-enable NEXT_PUBLIC_SENTRY_DSN in CI/CD workflows
Resolves : #45 Sentry Tunnel
See: feedback_sentry_local_network.md
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-24 16:16:34 +08:00
OG T
cd7d63eeb1
feat(cicd): Add OTEL tracing to SignOz for CI/CD monitoring
...
- CI: awoooi-ci service with sha + ci environment
- CD: awoooi-cd service with sha + production environment
- Exports to SignOz at 192.168.0.121:4318
Approved: 2026-03-24 統帥指令
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-24 16:03:37 +08:00
OG T
bf702ffd10
fix(sentry): 暫時停用前端 Sentry DSN (區域網路權限問題)
...
問題:
- Sentry DSN 使用內網 IP 192.168.0.110:9000
- 瀏覽器嘗試發送錯誤時觸發「存取區域網路」權限對話框
- 無痕模式下體驗極差
暫時解決:
- 停用 NEXT_PUBLIC_SENTRY_DSN 環境變數
- 前端 Sentry SDK 不會初始化
- 後端 Sentry 仍正常運作
TODO:
- 實作 Sentry Tunnel (Next.js API Route 轉發)
- 或設定 Nginx 反向代理
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-24 15:55:25 +08:00
OG T
a280d71684
perf(ci/cd): v2.0 完整沿用 AIOPS 最佳實踐
...
優化項目:
- Pre-flight Check (10s Fail-Fast)
- Runner 標籤 [self-hosted, harbor, k8s]
- dorny/paths-filter 精確路徑偵測
- API + Web 並行建構
- timeout-minutes 防止卡死
- Telegram + OpenClaw 通知
- force_deploy 強制重建選項
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-24 15:45:04 +08:00
OG T
e25d7bd13f
feat(sentry): add Sentry DSN to CI/CD build process
...
- Add NEXT_PUBLIC_SENTRY_DSN to CI/CD workflows (build-time injection)
- Add SENTRY_DSN build arg to web Dockerfile
- Sentry Self-Hosted deployed on 192.168.0.110:9000
- GeoIP database configured (MaxMind GeoLite2-City 61MB)
- awoooi-web project: http://da02...@192.168.0.110:9000/2
- awoooi-api project: http://8c4a...@192.168.0.110:9000/3
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-24 15:33:36 +08:00
OG T
7a76f3e628
fix(cd): Add NEXT_PUBLIC_API_URL build-arg for Web build
...
Root cause: Frontend was compiled with default localhost:8000
instead of production URL https://awoooi.wooo.work
This caused all API calls to fail in production because the
browser tried to call localhost:8000 which doesn't exist.
Next.js NEXT_PUBLIC_* variables are baked in at BUILD TIME,
not runtime, so they must be passed via --build-arg.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-24 14:36:46 +08:00
OG T
774290d333
fix(cd): Use kubectl for health check instead of external DNS
...
Problem: Self-hosted runner (192.168.0.110) cannot resolve
api.awoooi.wooo.work, causing health check to fail even though
deployments succeeded.
Solution:
- Use kubectl get pods to verify Pod is Running
- Use kubectl exec to test internal health endpoint (localhost:8000)
- More reliable than external DNS dependency
This follows mainstream K8s deployment practices.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-24 14:23:02 +08:00
OG T
515339f2a5
perf(cd): Optimize CD workflow based on wooo-aiops patterns
...
Changes:
- Add change detection (only build what changed)
- Add skip_api/skip_web manual inputs for selective builds
- Use native Docker BuildKit (remove buildx-action overhead)
- Add local Next.js cache (/home/wooo/build-cache/awoooi/)
- Split build-images into build-api and build-web jobs
Reference: wooo-aiops ci.yml and fast-deploy-uat.yml
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-24 14:13:56 +08:00
OG T
580c38de94
fix(cd): Fix kustomize image replacement with full image names
...
The kustomize edit set image command requires the OLD_IMAGE to match
exactly what's in the deployment YAML files, including the tag.
Changes:
- Use full image name with :IMAGE_TAG_PLACEHOLDER suffix
- Update kustomization.yaml to match deployment YAML format
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-24 14:05:31 +08:00
OG T
181d62a29e
fix(cd): 新增 kubeconfig 驗證步驟 + 修正 PATH
...
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-24 13:15:17 +08:00
OG T
fb62aa06f0
fix(cd): 安裝 kubectl 到 runner
...
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-24 12:48:59 +08:00
OG T
bff031fa8f
fix(cd): 修正 kustomize 安裝路徑 (避免 sudo)
...
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-24 12:31:26 +08:00
OG T
6bb1ab028d
fix(cd): 修正 namespace awoooi → awoooi-prod
...
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-24 12:14:29 +08:00
OG T
f4a6595839
fix(cd): 安裝 kustomize 到 runner
...
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-24 12:08:31 +08:00
OG T
118a9aa329
fix(cd): 修正 Kustomize 路徑 k8s/overlays/prod → k8s/awoooi-prod
...
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-24 11:53:21 +08:00