OG T
|
89e05e6ea2
|
docs: ADR-037 + 監控架構提案 + Runbooks
- ADR-037 監控增強架構
- MONITORING_MASTER_PLAN 主計畫
- MASTER_EXECUTION_SCHEDULE 執行排程
- Phase D/E/Worker HPA Runbooks
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:04:08 +08:00 |
|
OG T
|
95b46af986
|
docs: 新增稽核報告 + 靈感實驗室 + Runbook 更新
- AWOOOI_COMPREHENSIVE_AUDIT_2026Q1.md 全維度稽核
- INSPIRATION_LAB.md 靈感收集
- K3S-OPTIMIZATION-RUNBOOK.md 優化指南
- ADR-006 AI Fallback 策略更新
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:03:41 +08:00 |
|
OG T
|
8ba5f5c4d3
|
docs: Wave C-D 監控自動化確認完成
- C.1 generate_monitoring.py ✅
- C.2 CI 監控覆蓋率檢查 ✅
- C.3 discover_docker.py ✅
- D.1 NVIDIA Dashboard ✅
- D.2 coverage_report.py ✅
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 16:03:14 +08:00 |
|
OG T
|
b5602e23db
|
docs: 更新 LOGBOOK - Wave 1 安全網全部完成
- Circuit Breaker (ADR-038) ✅
- Global Repair Cooldown (ADR-039) ✅
- Signal Worker XCLAIM + Graceful Shutdown ✅
- AnomalyCounter Graceful Degradation ✅
- K8s terminationGracePeriodSeconds: 90 ✅
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 15:57:56 +08:00 |
|
OG T
|
bf06737eed
|
docs: ADR-038/039 + LOGBOOK 更新
- ADR-038: OpenClaw 併發治理架構
- ADR-039: 全域自動修復熔斷
- LOGBOOK: 今日進度記錄
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 15:48:09 +08:00 |
|
OG T
|
5f9a6a7e55
|
fix(ai): 移除假信心分數 + 顯示 AI 模型來源
問題: AI 仲裁顯示硬編碼信心分數 (0.75/0.88/0.92/0.70)
修復:
- decision_manager: 預設 confidence 0.75 → 0.0
- decision_manager: Expert System confidence=0.0 + is_rule_based
- openclaw: 所有 Mock Response confidence → 0.0
- telegram_gateway: 新增 ai_provider 欄位
- telegram_gateway: 動態來源標籤 (Ollama/Gemini/Claude/規則匹配)
Telegram 卡片顯示:
- confidence > 0 + provider=ollama → 🤖 Ollama 仲裁
- confidence > 0 + provider=gemini → 🤖 Gemini 仲裁
- confidence > 0 + provider=claude → 🤖 Claude 仲裁
- confidence == 0 → ⚙️ 規則匹配
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 15:19:51 +08:00 |
|
OG T
|
12e49d844a
|
feat(monitoring): ADR-037 Wave B - Database Exporters + Prometheus 整合
- 部署 PostgreSQL Exporter (192.168.0.188:9187)
- 部署 Redis Exporter (192.168.0.188:9121)
- 更新 Prometheus scrape config
- 首席架構師審查: 97% OUTSTANDING
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 15:18:54 +08:00 |
|
OG T
|
b55b1147e2
|
docs: 更新 LOGBOOK - P1-3/P1-4 完成 (32 tests)
|
2026-03-29 11:29:17 +08:00 |
|
OG T
|
50c055b547
|
feat(api): Phase D-G P0 修正 - Learning Repository 積木化
新增:
- ILearningRepository Protocol (interfaces.py)
- LearningRepository (Redis 持久化層)
- Learning API 端點 (/api/v1/learning/*)
- LearningService.get_recommended_fix() 方法
- LearningService.get_learning_summary() 方法
修正:
- Service 不直接依賴 Redis Client (透過 Repository)
- 符合 leWOOOgo 積木化原則
- 首席架構師審查: 74/100 → 92/100
更新:
- ADR-030: 新增 Phase D-G P0 修正章節
- Skill 02: v1.9 → v2.0
- Runner 修復: 序列建構解決 _runner_file_commands 衝突
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 11:03:51 +08:00 |
|
OG T
|
7b2f585244
|
docs: 完整監控實施步驟 (7 Phase 詳細文檔)
Phase A: AnomalyCounter 服務 (4h)
- Redis Sorted Set 滑動窗口計數
- 頻率閾值告警 (REPEAT/ESCALATE/PERMANENT_FIX)
- Tier 決策邏輯整合
Phase B: Database Exporters (3h)
- pg_exporter: 連接池/慢查詢/鎖等待/膨脹監控
- redis_exporter: 記憶體/命中率/驅逐監控
- 15+ 告警規則
Phase C: Incident 頻率欄位 (2h)
- IncidentFrequencyStats 模型
- 告警聚合邏輯 (10 分鐘窗口)
- 前端頻率顯示
Phase D: Sentry Comment 回寫 (1h)
- 完成 TODO 實作
- Sentry API Token 配置
Phase E: SignOz 告警規則 (2h)
- Error Rate / Latency 告警
- Trace 異常檢測
- SignOz Webhook Handler
Phase F: Alert Chain E2E (2h)
- Smoke Test 腳本
- CD Pipeline 整合
- 鏈路監控告警
Phase G: Learning Service (3h)
- 修復效果學習
- 成功率計算
- Playbook 自動更新
總工時: 17h (2-3 天)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 10:23:04 +08:00 |
|
OG T
|
1e7c5134fe
|
docs: 新增異常頻率統計與根本修復章節 (統帥反饋)
- 異常頻率追蹤架構 (Redis 計數器 + 滑動窗口)
- 修復策略分級 (Tier 1-4: 重啟→緩解→根因→架構)
- AI 學習服務 (LearningService + Playbook 自動更新)
- Telegram 頻率告警格式 (重複次數 + 成功率統計)
- 實作清單 (P0: 22h, P1: 12h, P2: 8h)
🔴 關鍵觀點: 重啟只是治標,不是治本
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 02:04:10 +08:00 |
|
OG T
|
56ae7290e3
|
docs: 更新 LOGBOOK - 完整監控策略 + Telegram 按鈕修復
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 02:01:06 +08:00 |
|
OG T
|
40163a51b5
|
feat(monitoring): 完整監控策略與自動整合架構
新增:
1. MONITORING_COMPLETE_STRATEGY.md - 完整監控策略
- 5 主機 × 60+ 服務監控矩陣
- P0/P1/P2 告警規則清單
- AI 自動修復閉環流程
- 安全護欄配置
2. MONITORING_INTEGRATION_ARCHITECTURE.md - 自動整合架構
- 服務註冊表 (Single Source of Truth)
- CI/CD 自動驗證監控覆蓋率
- 新服務自動獲得監控
3. ops/monitoring/service-registry.yaml - 服務清單
- K8s 工作負載 (API/Web/Worker/ArgoCD)
- Docker 容器 (Ollama/OpenClaw/Redis/Postgres)
- 前端頁面 SLO
- API 端點 SLO
- 告警模板與自動修復動作
4. ops/monitoring/validate_coverage.py - 覆蓋率驗證
- CI 階段執行
- 檢測未監控服務
- 生成覆蓋率報告
設計原則:
- 監控即代碼 (Monitoring as Code)
- 新服務必須在 registry 註冊才能部署
- 自動發現機制防止遺漏
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 01:52:08 +08:00 |
|
OG T
|
94d6a0c720
|
docs(ai): 更新 ADR-036 和 LOGBOOK - P3 優化記錄
- ADR-036 v1.4: P3 優化完成 (95/100)
- LOGBOOK: Phase 20 P1+P2+P3 全部完成
- 測試: 34/34 PASSED
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 01:51:35 +08:00 |
|
OG T
|
179e659f14
|
chore: 清理 Playwright 產物 + kube-state-metrics 告警擴充
清理工作:
- .gitignore 新增 playwright-report/ 和 test-results/ 排除
- 保留 phase19/ 參考截圖目錄
kube-state-metrics 告警擴充 (P3):
- CronJobLastRunFailed: Job 執行失敗
- DaemonSetMissingPods: DaemonSet 缺少 Pod
- StatefulSetReplicasMismatch: StatefulSet 副本不足
- ContainerWaiting: ImagePullBackOff/CrashLoopBackOff 偵測
- PDBViolation: PDB 健康 Pod 數不足
- NodeUnschedulable: 節點標記為不可排程
新增:
- apps/api/scripts/test_nemotron_tool_calling.py (E2E 比較測試)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 01:28:35 +08:00 |
|
OG T
|
ee2bceefff
|
feat(monitoring): Phase 19.6 測試文檔 + P1-P3 改進 + 首席架構師審查
Phase 19.6 測試文檔收尾:
- E2E 測試擴充至 18 項 (Terminal/GenUI 驗證)
- 新增 PHASE19-VERIFICATION-CHECKLIST.md (完整驗證清單)
P1 驗證:
- ArgoCD Metrics NodePort 監控 (30883/30884)
- TLS 證書監控 (Blackbox Exporter 9115)
P2 改進:
- waitForTimeout → waitForLoadState('networkidle')
- 跨平台快捷鍵 (Meta+J / Control+J)
- SKIP_MULTISIG_TESTS 環境變數控制
- Prometheus GitOps 部署腳本
P3 改進:
- HPA maxReplicas 4 → 6 (API/Web)
首席架構師審查: 47/50 OUTSTANDING (94%)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 01:19:26 +08:00 |
|
OG T
|
b77e151387
|
feat(ai): ADR-036 NVIDIA Nemotron Tool Calling 整合
Phase 20 - 提升 Tool Calling 精準度 50% → 83.3%
新增:
- src/models/nvidia.py: Pydantic Schema
- src/services/nvidia_provider.py: NvidiaProvider 類別
- tests/test_nvidia_provider.py: 15 項單元測試 (全部通過)
修改:
- ai_router.py: AIProvider.NVIDIA + route_tool_calling()
- ai_rate_limiter.py: NVIDIA 限制 (5 RPM, 100/day)
- models.json: NVIDIA 配置
- cd.yaml: Secrets 注入 NVIDIA_API_KEY
路由策略:
- Tool Calling: Nemotron → Gemini → Claude
- 一般對話: Ollama → Gemini → Claude (不變)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-29 00:00:08 +08:00 |
|
OG T
|
a30f766eb1
|
feat(monitoring): 首席架構師完整審查 + 補充告警規則
## 首席架構師審查結果: 198/200 (99%) EXCEPTIONAL
### 審查範圍
- 架構設計: 50/50 ⭐
- 安全性: 49/50
- 模組化合規: 50/50 ⭐
- 監控告警: 49/50
- E2E 測試: 49/50
### 新增補充告警 (12 條)
- RedisDown, PostgreSQLDown, OllamaDown, OpenClawDown
- HarborDown, LangfuseDown
- HPAMaxedOut, HPAScalingDisabled
- WorkerUnavailable
- NodeHighCPU, NodeHighMemory, ContainerOOMKilled
### 檔案
- k8s/monitoring/k3s-alerts-supplemental.yaml
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-28 23:30:44 +08:00 |
|
OG T
|
f0572ae906
|
feat(k4.3): Pod Security Standards + Grafana Dashboard
K4.3 Pod Security Standards:
- awoooi-prod: baseline
- kube-state-metrics: baseline
- kured: privileged (hostPID required)
- descheduler: restricted
- velero: baseline
- argocd: baseline
Grafana Dashboard:
- K3s Cluster Overview (9 panels)
- Nodes, Pods, HPA, Velero, Alerts
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-28 23:16:54 +08:00 |
|
OG T
|
863fc5a426
|
docs: 新增監控告警完整流程文檔 (2026-03-29 ogt)
內容:
- 8 層架構圖 (ASCII)
- 工具/服務清單表格
- 配置/代碼檔案清單
- 完整資料流說明
- E2E 驗證機制 (ADR-025/035)
- 故障排查指南
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-28 22:25:14 +08:00 |
|
OG T
|
1a4be7b18a
|
feat(k-mon): K3s monitoring integration (Phase K-MON)
- Add Velero metrics NodePort service (30885)
- Add K3s infrastructure alert rules:
- VIP 6443 availability
- Node ICMP checks
- AWOOOI API/Web TCP checks
- SignOz/Sentry availability
- Add Velero backup alerts (failed/missing)
- Add ADR-034 for ArgoCD GitOps adoption
Deployed to:
- K3s: velero-metrics service
- 188: Prometheus + Alertmanager configs
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-28 21:57:57 +08:00 |
|
OG T
|
6a38c0c968
|
fix(cd): ADR-035 Telegram Secrets 自動注入三層防護
🔴 事故根因: K8s Secrets 未注入,Telegram 告警長時間失效
- kustomization.yaml 說「由 CI/CD 處理」但 CD 從未執行
🛡️ 三層防護機制:
- Layer 1: Pre-flight 檢查 GitHub Secrets 存在
- Layer 2: Deploy 時 kubectl patch secret 自動注入
- Layer 3: Post-Deploy E2E 測試告警驗證
📄 文件更新:
- ADR-035: docs/adr/ADR-035-telegram-alert-chain-enforcement.md
- DevOps Skill v1.9: 新增 Secrets 注入鐵律
- CLAUDE.md: 新增告警鏈路章節
- LOGBOOK: 事故記錄
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-28 21:47:49 +08:00 |
|
OG T
|
66fb56c691
|
feat(k8s): Phase K2 自動化維運完成
- K2.4 NPD: Node Problem Detector (DaemonSet)
- K2.3 VPA: 3 Vertical Pod Autoscaler (Off 模式)
- K2.1 ArgoCD: v3.3.6 @ :30443 (GitOps)
- K2.2 Sealed Secrets: v0.26.0 (加密 Secrets)
新增檔案:
- k8s/npd/node-problem-detector.yaml
- k8s/awoooi-prod/11-vpa.yaml
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-28 21:27:05 +08:00 |
|
OG T
|
d3e6b59b86
|
docs: K1 Velero 備份系統完成
- MinIO 部署 (192.168.0.188:9000/9001)
- Velero v1.13.0 部署到 K3s
- daily-awoooi-prod Schedule (每日 02:00)
- 測試備份成功 (153 items / 30 天保留)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-28 21:16:27 +08:00 |
|
OG T
|
eea6e3acc3
|
feat(k8s): 新增 Velero 備份系統 (K1.1)
Phase K1 災難恢復:
- MinIO 部署在 192.168.0.188:9000/9001
- Velero v1.13.0 完整安裝 manifests
- velero-backups bucket 已建立
- README 含部署與使用指南
部署方式:
ssh wooo@192.168.0.120
sudo kubectl apply -f k8s/velero/velero-install-full.yaml
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-28 20:53:02 +08:00 |
|
OG T
|
269c81bdbb
|
fix(k8s): OpenClaw 端口統一 8088→8089
- ConfigMap: OPENCLAW_URL 更新為 8089
- NetworkPolicy: 允許 8089 出站
- SERVICE-ENDPOINTS.md: 移除 legacy 8088 引用
2026-03-28 清理舊配置,統一使用正式端口
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-28 20:32:30 +08:00 |
|
OG T
|
e03d99b871
|
docs(runbook): K3s 優化 Runbook v1.2 - 標記完成狀態
Phase 完成狀態:
- K0 ✅ Swap/PDB/備份/清理 (首席架構師 9.0/10)
- K-NET ✅ VIP 192.168.0.125 + CI/CD 整合
- K-CLEAN ✅ 9 RS + 1 Job 清理
K-HA 📋 另案規劃 (需維護窗口)
更新:
- 版本號 1.1 → 1.2
- 目錄標記完成狀態
- 各 Phase 加入執行結果
- 附錄 A 實際執行時間線
- 問題統計 (清理前後對照)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-28 18:52:13 +08:00 |
|
OG T
|
9fa996c9fe
|
fix(cicd): 修正 OTEL 端點配置 192.168.0.121→188
問題: CI/CD workflows 指向錯誤的 OTEL 端點
- ci.yaml: 121:4318 → 188:24318
- cd.yaml: 121:4318 → 188:24318
SignOz 實際運行在 192.168.0.188 (AI+Web 中心)
更新:
- Skill 04 v1.8 加入可觀測性端點規範
- LOGBOOK 記錄配置修正
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-28 18:47:23 +08:00 |
|
OG T
|
d206460751
|
feat(security): Phase 20 CSRF 防護實作
Phase 19 首席架構師審查指出: 核鑰 UX 安全性缺 CSRF 防護
後端:
- 新增 src/core/csrf.py (Double Submit Cookie 模式)
- 新增 src/api/v1/csrf.py (GET /api/v1/csrf/token)
- 新增 src/models/csrf.py (CSRFTokenResponse)
- 修改 approvals.py sign/reject/bulk 端點加入 CSRFToken 驗證
前端:
- 新增 hooks/useCSRF.ts (React Hook)
- 修改 approval.store.ts 整合 CSRF Token 參數
安全特性:
- 256-bit Token (secrets.token_hex)
- 時序安全比較 (secrets.compare_digest)
- SameSite=Strict Cookie
- 1 小時 Token 有效期
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-28 18:31:58 +08:00 |
|
OG T
|
7b9b0c490b
|
feat(phase19): Omni-Terminal 100% 完成 + 首席架構師審查 47/50
## Phase 19 Omni-Terminal (Wave 0-6 全部完成)
### 核心功能
- SSE 狀態機 (7-State 設計,10/10 分)
- GenUI 動態渲染 (6 張卡片 + Zod Schema 驗證)
- 核鑰 UX (長按授權 + 風險分級)
- Terminal Telemetry (Sentry 整合)
### P0-P2 修復
- P0: Singleton → FastAPI Depends 依賴注入
- P1: Zod Schema 升級 (7 個驗證 Schema)
- P1: 錯誤分類碼聚合 (Sentry fingerprint)
- P2: Slow Query 監控 (5s 警告 / 10s 嚴重)
### 測試
- test_terminal_service.py: 54 項測試全通過
- 意圖分類: 42 個測試案例 (9 種 IntentType)
### 文檔
- ADR-031: SSE 架構實作紀錄
- ADR-032: GenUI 渲染實作紀錄
- Skills: v1.9 (後端 Terminal 章節)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-28 18:04:12 +08:00 |
|
OG T
|
3e5315aaf8
|
docs(k3s): 首席架構師審查完成 46/50 (92%)
K3s 優化工作審查完成:
- ADR-033: Phase K0 + K-NET 標記為已完成
- 09-pdb.yaml: Worker PDB 設計說明註釋
- DevOps Skill: 新增 keepalived 快速操作參考
審查結果:
- 架構合規性: 9/10
- Runbook 完整性: 10/10 ⭐
- 模組化合規: 9/10
- 風險控制: 9/10
- 文檔完整性: 9/10
P2 問題已修復,無 P0/P1 阻擋項
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-28 18:00:07 +08:00 |
|
OG T
|
efb80b403e
|
feat(k8s): Phase K0.5 Startup Probe + PDB + revisionHistoryLimit
K3s 生產級優化 Phase K0 變更:
- 新增 startupProbe 到 API/Web/Worker Deployment (60s 啟動時間)
- 新增 revisionHistoryLimit: 3 (減少孤立 ReplicaSet)
- 新增 09-pdb.yaml (PodDisruptionBudget 保護)
- 新增 K3S-OPTIMIZATION-RUNBOOK.md (執行手冊)
- 修正 selector 對齊現有 Deployment (app+environment+system)
首席架構師審查: 9.0/10 ✅
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-28 11:13:44 +08:00 |
|
OG T
|
e5ded3b3f2
|
feat(phase19): OmniTerminal + GenUI + Hybrid SSE 架構實作 (Wave 0-2)
Phase 19 OmniTerminal MVP 完成:
- Wave 0: Backend (Hybrid SSE POST→GET 架構)
- Wave 1: Frontend (OmniTerminal 狀態機 + GenUI Registry)
- Wave 2: UI 組件 (8 個 GenUI 動態卡片)
ADR 文檔:
- ADR-031: OmniTerminal SSE 架構
- ADR-032: GenUI 動態渲染框架
- ADR-033: K3s HA 架構設計
GenUI 組件:
- GenUIRenderer, K8sPodStatusCard, SentryErrorCard
- MetricsSummaryCard, IncidentTimelineCard
- TraceWaterfallCard, ApprovalCard, NuclearKeyButton
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-28 00:17:26 +08:00 |
|
OG T
|
54061fb8be
|
docs: 更新 LOGBOOK - Sentry 首席架構師審查完成
- Sentry 整合驗證通過
- K3s Master 確認 192.168.0.120
- Phase 10 全部完成
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-27 14:57:03 +08:00 |
|
OG T
|
a579710982
|
fix(k8s): 補齊 Sentry DSN 配置 (首席架構師審查)
- 03-secrets.example.yaml: 新增 SENTRY_DSN
- 04-configmap.yaml: 新增 Sentry 元數據
- LOGBOOK: 新增 CD Lint 修復記錄
Phase 10 Sentry 整合 - DSN 配置補齊
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-27 14:51:41 +08:00 |
|
OG T
|
43e8ead0d2
|
docs: 更新 LOGBOOK - P1 模組化違規已修復
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-27 10:08:15 +08:00 |
|
OG T
|
30bed33401
|
docs: 更新 LOGBOOK - P1 按鈕優化完成
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-27 09:51:20 +08:00 |
|
OG T
|
4ee5376bd1
|
docs: 告警機制優化計畫 + ADR-030 Phase 6 + Skill 03 v1.5
- LOGBOOK: 新增告警機制完整審查記錄
- ADR-030: 新增 Phase 6 非同步分析優化章節
- Skill 03: v1.5 Stream Key 統一 + Telegram 去重
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-27 09:42:53 +08:00 |
|
OG T
|
d3a0ed4253
|
docs(adr): ADR-030 智能自動修復系統完整設計
五階段實施計畫:
- Phase 1: 智能診斷基礎 ✅ 已完成
- Phase 2: 資料收集強化 (K8s Events + SignOz 深度整合)
- Phase 3: Playbook RAG (向量化 + 語意搜尋)
- Phase 4: 自動執行機制 (信任度 + 風險評估)
- Phase 5: 持續學習迴圈 (反饋 + 信任度調整)
架構相容性分析:
- 介面擴展點定義
- 資料庫 Schema 變更
- 風險評估與回滾計畫
預計時程: 10-15 週
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-26 21:48:41 +08:00 |
|
OG T
|
309a019cc3
|
docs: 記錄 Telegram 告警轟炸事故修復
更新:
- ADR-027: 新增緊急事故修復章節
- LOGBOOK: 記錄 2026-03-26 事故時間線
- Skill 02 v1.6: 新增 Telegram 去重機制章節
根因: Phase 6.5 修改 + INC- 前綴重複
修復: Redis 去重 (10 分鐘) + 前綴檢查
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-26 20:13:07 +08:00 |
|
OG T
|
fb03430469
|
feat(api): ADR-027 Phase 2 - 簽核/拒絕後自動同步 Incident 狀態
Router 整合點:
- POST /approvals/{id}/sign → on_approval_status_change("approved")
- POST /approvals/{id}/reject → on_approval_status_change("rejected")
- POST /approvals/bulk-approve → 批次同步
變更:
- 移除舊的 resolve_incident_after_approval() 調用
- 改用 IncidentApprovalService.on_approval_status_change()
- 同步失敗不阻斷主流程 (容錯設計)
ADR-027 進度: Phase 1-2 ✅ 完成
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-26 19:44:59 +08:00 |
|
OG T
|
dd42e6b75b
|
chore: services export + meetings 文檔格式化
- services/__init__.py: 導出 IncidentApprovalService (ADR-027)
- meetings docs: 格式化更新
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-26 19:10:48 +08:00 |
|
OG T
|
a9f8ad56c1
|
chore: 未提交變更整理 (API core + docs + scripts)
API 核心:
- constants.py: 系統常量定義
- unit_of_work.py: Unit of Work 模式
- incident_approval_service.py: Incident-Approval 同步服務
文檔更新:
- LOGBOOK.md: 進度更新
- AWOOOI_AGENTIC_WORKSPACE_ROADMAP.md: 路線圖
- 2026-03-26_llm_testing_evaluation.md: LLM 測試評估
- phase5_telemetry_architecture.md: 遙測架構
- SECRETS_REFERENCE.md: 密鑰參考
配置/腳本:
- Skill 02 v1.x: leWOOOgo 後端更新
- .dependency-cruiser.cjs: 依賴規則
- demo-multisig-flow.sh: 演示腳本
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-26 19:10:12 +08:00 |
|
OG T
|
2f5986df5c
|
docs: ADR 整理與新增 (021-029)
ADR 編號修正:
- ADR-023 failure-auto-repair → ADR-028
- ADR-025 cicd-ai-integration → ADR-029
新增 ADR:
- ADR-021: Playbook 更新驗證
- ADR-022: Sentry 整合架構
- ADR-027: Incident-Approval 同步
- ADR-028: 失敗自動修復閉環
- ADR-029: CI/CD AI 整合 (原 ADR-025)
更新:
- ADR-018: LLM 測試策略狀態更新
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-26 19:09:08 +08:00 |
|
OG T
|
c7be68f800
|
docs: LOGBOOK 更新 Phase 13.2 #84 完成狀態
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-26 18:56:24 +08:00 |
|
OG T
|
0a9d94d82b
|
feat(k8s): CoreDNS GitOps 架構 (ADR-026)
問題: DNS 配置沒有版本控制,手動修改易遺失
架構:
- k8s/k3s-system/coredns-custom.yaml: HelmChartConfig
- CD workflow: k3s-system 路徑偵測 + 自動 apply
- ADR-026: CoreDNS GitOps 管控架構
DNS 上游:
- 使用 8.8.8.8 + 1.1.1.1
- 禁止 /etc/resolv.conf (systemd-resolved)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-26 18:43:28 +08:00 |
|
OG T
|
6e3a7fca20
|
docs: ADR-006 v1.2 Rate Limiter + LOGBOOK 更新
- ADR-006: 新增 Rate Limiter 實作章節 (v1.2)
- LOGBOOK: 記錄 Gemini 切換 + Rate Limiter 上線
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-26 18:16:45 +08:00 |
|
OG T
|
30145c7d7e
|
docs: ADR-025 CI/CD AI 整合架構 + Skill 07 更新
- ADR-025: 文檔化 Phase 13.1 CI/CD AI 整合架構決策
- GitHub Webhook 事件驅動流程
- 風險分級執行決策 (AUTO/TELEGRAM/APPROVAL/BLOCKED)
- SignOz Log 整合
- Skill 07 v1.3: 新增 Grafana MCP + SignOz query_logs
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-26 15:41:26 +08:00 |
|
OG T
|
14c81f728f
|
docs: 新增 ADR-025 告警鏈路 E2E 驗證 + 更新 Skills
新增:
- ADR-025: 告警鏈路 E2E 驗證架構 (2026-03-26 事故教訓)
更新:
- ADR-011: 新增 DNS 規則最佳實踐 (附錄 B)
- Skill 04: 新增 NetworkPolicy DNS 規則 + CoreDNS 設定
- Skill 05: 新增告警鏈路 Smoke Test 要求
- CLAUDE.md: 新增告警鏈路驗證到任務前必讀
事故根因:
1. URL 路徑錯誤 (webhook vs webhooks)
2. NetworkPolicy DNS 規則標籤不匹配
3. CoreDNS 上游 DNS 依賴 systemd-resolved
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-26 15:34:12 +08:00 |
|
OG T
|
579da38b8b
|
feat(api): Phase 13 智能路由 + CI/CD 整合 (#74-88)
Phase 13.1 CI/CD Integration:
- #76 workflow_run handler for CI failure diagnosis
- #77 SignOz log query (query_logs, error_logs_summary MCP)
- #78 CIAutoRepairService with risk-based execution decisions
Phase 13.3 Smart Routing:
- #85 Intent Classifier v2.0 (rule engine + LLM fallback)
- #86 Complexity Scorer (9-dimension scoring)
- #87 AI Router v3.0 (routing decision matrix)
- #88 Token Counter (OTEL + Langfuse integration)
New files:
- services/ci_auto_repair.py (risk stratification)
- services/model_registry.py (centralized model config)
- services/token_counter.py (677 lines)
- Skill 08: Model Router Expert
- Skill 09: Strangler Pattern Expert
- ADR-023: Smart Routing Architecture
- ADR-024: API Layer Architecture
Tests:
- phase11-conversational.spec.ts (E2E tests)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
2026-03-26 15:32:52 +08:00 |
|