OG T
|
2fde0b5724
|
docs: 更新 LOGBOOK - Lint 清零 + E2E 診斷詳細紀錄
- Lint 61→0 完全清零,記錄 React Hook 依賴修復模式
- E2E Health Check 診斷進度 (VIP 可達,NodePort 待查)
- 新增 useMemo 包裝物件依賴的標準模式
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 20:29:43 +08:00 |
|
OG T
|
79134fb019
|
feat(ai): 新增 NVIDIA Nemotron 到告警 Fallback Chain
- 新增 _call_nvidia() 一般告警支援 (非 Tool Calling)
- Fallback 順序: Gemini → Nvidia → Ollama → Claude
- Nvidia 免費 tier ($0),含 Token 追蹤
解決: Gemini 超限 (500/500) 後無法 fallback 問題
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 20:28:24 +08:00 |
|
OG T
|
6a8e1bfdd1
|
feat(cicd): Gitea Mirror B2 備份策略
- 新增 Gitea remote (192.168.0.110:3001/wooo/awoooi)
- CD 成功後自動 mirror to Gitea
- 新增 GITEA_MIRROR_TOKEN GitHub Secret
- 更新 LOGBOOK 紀錄
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 20:28:21 +08:00 |
|
OG T
|
0e24f73399
|
fix(ci): E2E kubectl 診斷改為非阻塞 (graceful fallback)
- 移除對 KUBECONFIG secret 的依賴
- kubectl 無法連線時 graceful 跳過
- 保留 API health check 作為主要驗證
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 20:26:30 +08:00 |
|
OG T
|
f3d01bb410
|
fix(ci): E2E 增加 kubectl 診斷 (Pod/Service/Endpoints)
- 新增 Check K8s Status step
- 檢查 awoooi-api pods 狀態
- 檢查 awoooi-api service 狀態
- 檢查 endpoints 是否正確
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 20:24:02 +08:00 |
|
OG T
|
0f3339e977
|
fix(ci): E2E health check 增加網路診斷
- 增加 ping VIP 診斷
- 增加備用端點 (direct 120) fallback
- 增加 HTTP 狀態碼和回應內容輸出
- 改善錯誤訊息,方便除錯
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 20:21:52 +08:00 |
|
OG T
|
2e9ccf4a26
|
fix(lint): 清理所有 ESLint 警告 (61→0)
- 修復未使用變數 (prefix with _)
- 修復 type-only imports
- 修復 react-hooks/exhaustive-deps (useMemo + 依賴補齊)
- 修復 no-explicit-any (eslint-disable 標記)
- 移除未使用的 imports
涉及組件:
- demo/page, layout, page (主頁面)
- ai/* (OpenClaw, HITL, ThinkingStream)
- approval/* (ApprovalCard, LiveApprovalPanel)
- dashboard/* (HostCard, LiveDashboard, ConnectionStatus)
- incident/* (DualStateIncidentCard, ThinkingTerminal)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 17:06:58 +08:00 |
|
OG T
|
5cad3707ee
|
fix(api): 補齊 prometheus-client 依賴 + 停用 Nightly LLM Tests
首席架構師審查 2026-03-29:
- 問題: metrics.py import prometheus_client 但未加入依賴
- 影響: API Pod CrashLoopBackOff
- 修復: 新增 prometheus-client>=0.20.0
統帥指示: 停用 Nightly LLM Tests 減少 Runner 負載
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 17:05:20 +08:00 |
|
OG T
|
5a8edd692d
|
docs: 更新 LOGBOOK - Lint 清理完成
|
2026-03-29 16:43:49 +08:00 |
|
OG T
|
caaf12e41c
|
fix(cd): P0 並發治理 - force_deploy 獨立 concurrency group
首席架構師審查 2026-03-29:
- 問題: cancel-in-progress: true 導致 force_deploy 被新 push 取消
- 已發生 5+ 次 force deploy 被取消,25 commits 無法部署
- 解決: force_deploy 使用獨立 group,不會被普通 push 取消
- 普通 push 仍互相取消 (防止 Runner 檔案衝突)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:42:50 +08:00 |
|
OG T
|
5ee139749a
|
chore(lint): 清理 7 項 ESLint 警告
- useApprovalSSE.ts: 標記未使用的 fallbackToPolling
- useErrors.ts: 移除未使用的 ErrorListResponse import
- dashboard.store.ts: 標記 SSE event 參數
- agent.store.ts: 加註 SSE 串流迴圈說明
- approval.store.ts: 改用正規 type import
- terminal.store.ts: 改用 inline type import
- OmniTerminal.tsx: 改用 type import
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:40:19 +08:00 |
|
OG T
|
d68917cdac
|
docs: Wave 3 i18n 清零完成 - 40+ 違規全部修復
- TECHNICAL_DEBT_PHASE2.md: 更新為 ✅ 全部完成狀態
- LOGBOOK.md: 新增 Wave 3 完成紀錄
修復清單:
- status-orb.tsx: 狀態標籤 i18n
- OmniTerminal.tsx: SSE 連線狀態 i18n
- sse-states.ts: 連線狀態 label 改 i18n key
- thinking-terminal.tsx: 終端機 UI 全面 i18n
- live-host-card.tsx: 移除 hardcoded 預設值
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:35:47 +08:00 |
|
OG T
|
e9bed212de
|
fix(i18n): Wave 3 完成 - thinking-terminal + 翻譯補充
- thinking-terminal.tsx: 所有 hardcode 改用 useTranslations
- DependencyPathVisualizer: blastRadius/rootCauseChain
- ServiceChainVisualizer: upstreamImpact/downstreamDependencies
- FinOpsVisualizer: finopsAnalysis/wastedPerMonth/realizable/freed
- ThinkingTerminal: title/executing/initiate/waiting/stream/events
- live-host-card.tsx: 移除未使用的 baselineLabel 預設值
- zh-TW.json/en.json: 新增 terminal 區塊翻譯
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:34:03 +08:00 |
|
OG T
|
9747bd43a2
|
fix(i18n): Wave 3 清零 - status-orb + OmniTerminal + sse-states
- status-orb.tsx: 狀態 label 改用 useTranslations
- OmniTerminal.tsx: 'SSE Live'/'Offline' 改用 i18n
- sse-states.ts: label 改為 i18n key (connection.xxx)
- 新增 subscribing/streaming 翻譯
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:30:01 +08:00 |
|
OG T
|
590b5c2bd5
|
docs: P1 修復完成 - 91/100 → 95/100
5/5 P1 修復項目全部完成:
- RAG Provider DI 模式一致性
- Worker PDB (已存在)
- RAG 測試 9 項
- Grafana Config 快取
- Embedding 維度配置化
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:24:25 +08:00 |
|
OG T
|
8724ed7dcf
|
fix(mcp): P1 修復 - DI 一致性 + 測試補充 + 配置優化
首席架構師審查 P1 修復清單:
P1-1 RAG Provider DI 模式一致性:
- 支援 rag_service 參數注入
- 新增 close() 方法
- TYPE_CHECKING 延遲導入
P1-3 RAG 測試補充:
- test_rag_provider.py (9 tests)
- DI 注入/Lazy Load/Tool Schema/驗證/Close
P1-4 Grafana Config 快取優化:
- URL/Key 首次查詢後快取
- 減少重複 settings 存取
P1-5 Embedding 維度配置化:
- MODEL_DIMENSIONS 字典 (qwen/llama/nomic)
- default_dimension 參數
- 支援更多模型
測試: 9/9 PASSED
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:23:30 +08:00 |
|
OG T
|
fc3d4a6b3a
|
docs: 首席架構師審查 91/100 + Phase 13.2 MCP Tools 完成
Wave 2 + Phase 13.2 審查結果:
- Worker HPA: 95/100
- Grafana Provider: 92/100
- RAG Provider: 88/100
- RAG Service: 90/100
P1 建議 (5項):
1. RAG Provider DI 模式一致性
2. Grafana Config 注入優化
3. RAG 測試補充
4. Embedding 維度配置化
5. Worker HPA + PDB 配合
模組化合規: Protocol/DI/Log 全部通過
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:19:24 +08:00 |
|
OG T
|
cf6cf1ff20
|
fix(cd): P0 雙跳過保護 - 防止 ImagePullBackOff
首席架構師審查 2026-03-29:
- 問題: 當 API/Web build 都跳過時,kustomize 仍含 IMAGE_TAG_PLACEHOLDER
- 影響: kubectl apply 部署無效映像 → ImagePullBackOff
- 修復: 檢測雙跳過,只做 Secrets 同步,跳過 Deployment apply
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:18:14 +08:00 |
|
OG T
|
f87c30b1c7
|
docs(skills): 新增 ADR-038/039 OpenClaw 安全網章節
Wave 1 部署完成後更新 Skill 02:
- Circuit Breaker 雙層保護模式 (Layer 1 斷路 + Layer 2 限流)
- 全域修復冷卻機制 (15min/5次 → 凍結)
- StatefulSet 黑名單 (postgres/redis/clickhouse 禁止自動修復)
- Worker XCLAIM 孤兒訊息回收配置
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:12:47 +08:00 |
|
OG T
|
f6c3c7704f
|
docs: 更新 LOGBOOK - Wave 2 Worker HPA 部署完成
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:10:23 +08:00 |
|
OG T
|
b97f9364fb
|
feat(k8s): add Worker HPA + fix non-AI confidence values
Wave 2 Deployment:
- Worker HPA: min:1 max:3, CPU 70%, Memory 80%
- 前置條件: XCLAIM + terminationGracePeriodSeconds:90 (Wave 1 ✅)
- 比 API/Web 更保守的擴縮策略 (120s up, 600s down)
Confidence Fix:
- 非 AI 分析來源 (fallback/playbook/historical/consensus) 設 confidence=0.0
- 避免混淆 AI 信心度與其他指標 (成功率/相似度)
- 涉及: github_webhook, decision_manager, intent_classifier, learning_service
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:09:37 +08:00 |
|
OG T
|
3bfb9c51f5
|
chore: Skills + CLAUDE.md + Playwright 配置更新
- SRE-QA Skills 擴充
- CLAUDE.md 指引更新
- playwright.config.ts 優化
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:04:43 +08:00 |
|
OG T
|
a5a6bd3408
|
feat(monitoring): K8s alert rules + Grafana dashboards + ops 腳本
- k8s/monitoring/alert-chain-monitor.yaml
- k8s/monitoring/database-alerts.yaml
- ops/grafana/ Grafana dashboards
- ops/signoz/ SignOz 配置
- ops/scripts/ 維運腳本
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:04:14 +08:00 |
|
OG T
|
89e05e6ea2
|
docs: ADR-037 + 監控架構提案 + Runbooks
- ADR-037 監控增強架構
- MONITORING_MASTER_PLAN 主計畫
- MASTER_EXECUTION_SCHEDULE 執行排程
- Phase D/E/Worker HPA Runbooks
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:04:08 +08:00 |
|
OG T
|
237fb64a81
|
chore(k8s): secrets template + web deployment 更新
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:03:47 +08:00 |
|
OG T
|
95b46af986
|
docs: 新增稽核報告 + 靈感實驗室 + Runbook 更新
- AWOOOI_COMPREHENSIVE_AUDIT_2026Q1.md 全維度稽核
- INSPIRATION_LAB.md 靈感收集
- K3S-OPTIMIZATION-RUNBOOK.md 優化指南
- ADR-006 AI Fallback 策略更新
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:03:41 +08:00 |
|
OG T
|
01d76df383
|
feat(web): i18n 快捷鍵提示 + UI 組件優化
- 新增 closeEsc, previous, next 翻譯
- approval-modal, slide-panel 更新
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:03:35 +08:00 |
|
OG T
|
8ba5f5c4d3
|
docs: Wave C-D 監控自動化確認完成
- C.1 generate_monitoring.py ✅
- C.2 CI 監控覆蓋率檢查 ✅
- C.3 discover_docker.py ✅
- D.1 NVIDIA Dashboard ✅
- D.2 coverage_report.py ✅
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:03:14 +08:00 |
|
OG T
|
938df7f291
|
fix(api): 全面清除假信心分數 - 遵循 feedback_confidence_truthfulness.md
🔴 違規修正: 規則匹配/Expert System 不是 AI 分析,confidence 必須 = 0.0
修正檔案:
- agents/action_planner.py: 0.9 → 0.0
- agents/blast_radius.py: 0.85/0.5/0.9 → 0.0
- agents/security.py: 計算公式 → 0.0
- signoz_webhook.py: 0.7 → 0.0
- auto_approve.py: default 0.5 → 0.0
- ci_auto_repair.py: 整個計算函數 → return 0.0
- error_analyzer_service.py: default 0.5 → 0.0
- intent_classifier.py: 計算公式 → 0.0
- openclaw.py: default 0.5 → 0.0
- resource_resolver.py: 0.8 → 0.0
- k8s_naming.py: 0.9/0.7 → 0.0
只有 LLM 真實分析返回的 confidence 才能 > 0
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:00:46 +08:00 |
|
OG T
|
b5602e23db
|
docs: 更新 LOGBOOK - Wave 1 安全網全部完成
- Circuit Breaker (ADR-038) ✅
- Global Repair Cooldown (ADR-039) ✅
- Signal Worker XCLAIM + Graceful Shutdown ✅
- AnomalyCounter Graceful Degradation ✅
- K8s terminationGracePeriodSeconds: 90 ✅
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 15:57:56 +08:00 |
|
OG T
|
19b00a1ca0
|
fix(api): 移除 Consensus Engine 假信心分數
🔴 違反鐵律: feedback_confidence_truthfulness.md
Expert System 必須 confidence = 0.0,禁止假裝 AI 仲裁
修正:
- SREAgent: 0.85/0.80/0.75/0.60 → 0.0
- SecurityAgent: 0.70/0.85 → 0.0
- CostAgent: 0.75 → 0.0
- PerformanceAgent: 0.80/0.70 → 0.0
所有規則匹配現在正確顯示為「⚙️ 規則匹配」而非「🤖 AI 仲裁」
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 15:57:04 +08:00 |
|
OG T
|
89a2339796
|
feat(api): ADR-038 Circuit Breaker 整合 + Graceful Degradation
sentry_webhook.py:
- 整合 OpenClawGuard (Circuit Breaker + Semaphore)
- 斷路狀態快速失敗,不呼叫 OpenClaw
- 並發控制: 最多 3 個同時 LLM 推理
anomaly_counter.py:
- record_anomaly() Redis 故障 Graceful Degradation
- 失敗時返回預設 AnomalyFrequency (count=0)
- 不中斷主流程
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 15:55:51 +08:00 |
|
OG T
|
39396dc57a
|
feat(worker): Wave 1 Signal Worker XCLAIM + Graceful Shutdown
ADR-038/039 Wave 1 強化:
- 新增 Active Sweeper: XPENDING + XCLAIM 回收閒置訊息
- PENDING_IDLE_MS: 60秒無ACK則可被回收
- SWEEP_INTERVAL_S: 每30秒掃描一次
- Graceful Shutdown: 75秒超時 (搭配 K8s 90秒)
- 超過 MAX_RETRIES 的訊息強制 ACK
K8s Worker Deployment:
- 新增 terminationGracePeriodSeconds: 90
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 15:53:05 +08:00 |
|
OG T
|
bf06737eed
|
docs: ADR-038/039 + LOGBOOK 更新
- ADR-038: OpenClaw 併發治理架構
- ADR-039: 全域自動修復熔斷
- LOGBOOK: 今日進度記錄
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 15:48:09 +08:00 |
|
OG T
|
27509db212
|
feat(api): Wave 1 安全網 - Circuit Breaker + Global Repair Cooldown
ADR-038: OpenClaw 雙層保護
- Layer 1: Circuit Breaker (5 failures → 60s cooldown)
- Layer 2: Concurrency Semaphore (max 3 concurrent)
- 新增 src/core/circuit_breaker.py
ADR-039: 全域修復熔斷
- Global Cooldown: 5 repairs/15min → freeze
- StatefulSet Blacklist: postgres/redis/clickhouse 禁止自動重啟
- 新增 src/services/global_repair_cooldown.py
- 整合到 auto_repair_service.py
測試:
- test_circuit_breaker.py (狀態轉換 + Semaphore)
- test_global_repair_cooldown.py (黑名單 + 計數閾值)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 15:48:03 +08:00 |
|
OG T
|
c7f9c119e7
|
fix(cd): 補提交 ops/monitoring 腳本
遺漏文件導致 CD Monitoring Coverage 步驟失敗
新增:
- generate_monitoring.py - 監控覆蓋率檢查
- coverage_report.py - 覆蓋率報告
- discover_docker.py - Docker 服務發現
- deploy-exporters.sh - Exporter 部署腳本
- postgres-exporter-queries.yaml - PostgreSQL 查詢配置
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 15:45:42 +08:00 |
|
OG T
|
2c79cba629
|
fix(api): 修復最後 2 個 bare except 錯誤
- scripts/test_nemotron_tool_calling.py: except -> except Exception
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 15:37:02 +08:00 |
|
OG T
|
d89f0520f9
|
fix(api): 修復 34 個 Ruff lint 錯誤
- 自動修復 import 排序、unused imports
- 手動修復 raise from、isinstance union、unused variable
- scripts/ 暫時保留 (非 CI 阻擋)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 15:27:49 +08:00 |
|
OG T
|
5f9a6a7e55
|
fix(ai): 移除假信心分數 + 顯示 AI 模型來源
問題: AI 仲裁顯示硬編碼信心分數 (0.75/0.88/0.92/0.70)
修復:
- decision_manager: 預設 confidence 0.75 → 0.0
- decision_manager: Expert System confidence=0.0 + is_rule_based
- openclaw: 所有 Mock Response confidence → 0.0
- telegram_gateway: 新增 ai_provider 欄位
- telegram_gateway: 動態來源標籤 (Ollama/Gemini/Claude/規則匹配)
Telegram 卡片顯示:
- confidence > 0 + provider=ollama → 🤖 Ollama 仲裁
- confidence > 0 + provider=gemini → 🤖 Gemini 仲裁
- confidence > 0 + provider=claude → 🤖 Claude 仲裁
- confidence == 0 → ⚙️ 規則匹配
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 15:19:51 +08:00 |
|
OG T
|
12e49d844a
|
feat(monitoring): ADR-037 Wave B - Database Exporters + Prometheus 整合
- 部署 PostgreSQL Exporter (192.168.0.188:9187)
- 部署 Redis Exporter (192.168.0.188:9121)
- 更新 Prometheus scrape config
- 首席架構師審查: 97% OUTSTANDING
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 15:18:54 +08:00 |
|
OG T
|
c5db6520c8
|
perf(web): P1 前端優化 - 移除 Polling + CSS Cursor Blink
Phase 8.0 #15-17 前端效能優化:
#15 Sidebar Polling → SSE:
- 移除 30s setInterval polling
- 改用 useApprovalStore SSE 驅動的 pendingApprovals
- 新增 mounted check 防止 hydration mismatch
#16 Cursor Blink DOM Bypass:
- thinking-stream.tsx: setInterval → animate-pulse
- ai-thinking-panel.tsx: 移除 cursorVisible state
- clawbot-panel.tsx: 移除 cursorVisible state
- openclaw-panel.tsx: 移除 cursorVisible state
#17 Hydration Fix:
- sidebar.tsx badge 加入 mounted check
結果: -46 行代碼 (移除不必要的 setState/setInterval)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 15:09:44 +08:00 |
|
OG T
|
12f7a83df8
|
fix(ci): 修復 Runner _diag/pages 檔案衝突 (徹底解決)
根本原因:
- 41 個殭屍 Runner 進程互相衝突
- _diag/pages 目錄沒有自動清理
解決方案:
- 所有 Workflow Job 第一步清理 _diag/pages
- 覆蓋所有 self-hosted runner jobs
影響範圍:
- runner-healthcheck.yml (2 jobs)
- daily-e2e-health.yaml (1 job)
- nightly-llm.yaml (1 job)
- ci.yaml (9 jobs)
- cd.yaml (已有)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 15:09:13 +08:00 |
|
OG T
|
b55b1147e2
|
docs: 更新 LOGBOOK - P1-3/P1-4 完成 (32 tests)
|
2026-03-29 11:29:17 +08:00 |
|
OG T
|
49f21dc4e1
|
test(api): P1-3/P1-4 ApprovalRequestCreate + Telegram 測試
P1-3: ApprovalRequestCreate 欄位對齊測試 (13 tests)
- 必填欄位驗證 (action, description, requested_by)
- BlastRadius Model 驗證
- SignOz/Sentry/GitHub Webhook 格式驗證
- Pydantic v2 額外欄位行為驗證
P1-4: Telegram 整合驗證測試 (19 tests)
- SignOzMetricsBlock 格式化
- TelegramMessage 結構
- 風險等級 Emoji 映射
- Webhook → Telegram 訊息流程
遵循: feedback_no_mock_testing.md (禁止 Mock)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 11:28:33 +08:00 |
|
OG T
|
ac2715e541
|
fix(api): P1-2 ApprovalRequestCreate 欄位對齊
修正 SignOz + GitHub Webhook 的 ApprovalRequestCreate:
Before (錯誤欄位):
- action_type, target_resource, source
- blast_radius=BlastRadius.SINGLE (enum 不存在)
- dry_run_check=DryRunCheck.SKIPPED (錯誤格式)
- 缺少 action, description, requested_by
After (正確欄位):
- action, description (必填)
- blast_radius=BlastRadius(...) (Pydantic Model)
- dry_run_checks=[] (list)
- requested_by (必填)
- 其他欄位移至 metadata
遵循: ApprovalRequestBase schema (approval.py)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 11:17:27 +08:00 |
|
OG T
|
50c055b547
|
feat(api): Phase D-G P0 修正 - Learning Repository 積木化
新增:
- ILearningRepository Protocol (interfaces.py)
- LearningRepository (Redis 持久化層)
- Learning API 端點 (/api/v1/learning/*)
- LearningService.get_recommended_fix() 方法
- LearningService.get_learning_summary() 方法
修正:
- Service 不直接依賴 Redis Client (透過 Repository)
- 符合 leWOOOgo 積木化原則
- 首席架構師審查: 74/100 → 92/100
更新:
- ADR-030: 新增 Phase D-G P0 修正章節
- Skill 02: v1.9 → v2.0
- Runner 修復: 序列建構解決 _runner_file_commands 衝突
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 11:03:51 +08:00 |
|
OG T
|
d15fb7d9f4
|
fix(cd): 序列建構修復 Runner _runner_file_commands 衝突
根因: 並行 Job 的 Set up job 階段會同時寫入 RUNNER_TEMP
解法: build-api needs build-web,確保序列執行
移除: Job-level concurrency groups (不再需要)
更新: ops/runner/README.md v1.0→v2.0
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 10:29:11 +08:00 |
|
OG T
|
7b2f585244
|
docs: 完整監控實施步驟 (7 Phase 詳細文檔)
Phase A: AnomalyCounter 服務 (4h)
- Redis Sorted Set 滑動窗口計數
- 頻率閾值告警 (REPEAT/ESCALATE/PERMANENT_FIX)
- Tier 決策邏輯整合
Phase B: Database Exporters (3h)
- pg_exporter: 連接池/慢查詢/鎖等待/膨脹監控
- redis_exporter: 記憶體/命中率/驅逐監控
- 15+ 告警規則
Phase C: Incident 頻率欄位 (2h)
- IncidentFrequencyStats 模型
- 告警聚合邏輯 (10 分鐘窗口)
- 前端頻率顯示
Phase D: Sentry Comment 回寫 (1h)
- 完成 TODO 實作
- Sentry API Token 配置
Phase E: SignOz 告警規則 (2h)
- Error Rate / Latency 告警
- Trace 異常檢測
- SignOz Webhook Handler
Phase F: Alert Chain E2E (2h)
- Smoke Test 腳本
- CD Pipeline 整合
- 鏈路監控告警
Phase G: Learning Service (3h)
- 修復效果學習
- 成功率計算
- Playbook 自動更新
總工時: 17h (2-3 天)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 10:23:04 +08:00 |
|
OG T
|
6ddaf75260
|
fix(runner): v5 - Job 層級 mutex 確保嚴格序列執行
根因確認:
- 即使有 needs 依賴,Jobs 仍可能在 "Set up job" 階段並行
- 所有 Jobs 共用同一 Runner,並行寫入 _diag/pages 造成衝突
永久解決方案:
- 每個 Job 加上 concurrency.group: runner-awoooi-cd-mutex
- cancel-in-progress: false (等待而非取消)
- 確保同一時間只有一個 Job 在 Runner 上執行
影響:
- CD 會變慢 (Jobs 嚴格序列)
- 但保證穩定性 (不再有檔案衝突)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 02:12:38 +08:00 |
|
OG T
|
07114f9181
|
fix(runner): v4 - 啟用 cancel-in-progress 防止並行衝突
根因確認:
- _diag/pages 衝突發生在 "Set up job" 階段
- 這是在任何自定義步驟執行之前
- Runner 內部 bug,workflow 層清理無法解決
永久解決方案:
- cancel-in-progress: true (確保同一時間只有一個 workflow)
- 不再嘗試清理 RUNNER_TEMP (會破壞其他 Job)
- 保留 _diag/pages 清理作為輔助措施
更新 ops/runner/README.md:
- 完整根因分析
- v3 最終解決方案說明
- 警告: 不要清理 RUNNER_TEMP
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 02:10:17 +08:00 |
|