Commit Graph

94 Commits

Author SHA1 Message Date
OG T
89e05e6ea2 docs: ADR-037 + 監控架構提案 + Runbooks
- ADR-037 監控增強架構
- MONITORING_MASTER_PLAN 主計畫
- MASTER_EXECUTION_SCHEDULE 執行排程
- Phase D/E/Worker HPA Runbooks

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 16:04:08 +08:00
OG T
95b46af986 docs: 新增稽核報告 + 靈感實驗室 + Runbook 更新
- AWOOOI_COMPREHENSIVE_AUDIT_2026Q1.md 全維度稽核
- INSPIRATION_LAB.md 靈感收集
- K3S-OPTIMIZATION-RUNBOOK.md 優化指南
- ADR-006 AI Fallback 策略更新

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 16:03:41 +08:00
OG T
8ba5f5c4d3 docs: Wave C-D 監控自動化確認完成
- C.1 generate_monitoring.py 
- C.2 CI 監控覆蓋率檢查 
- C.3 discover_docker.py 
- D.1 NVIDIA Dashboard 
- D.2 coverage_report.py 

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 16:03:14 +08:00
OG T
b5602e23db docs: 更新 LOGBOOK - Wave 1 安全網全部完成
- Circuit Breaker (ADR-038) 
- Global Repair Cooldown (ADR-039) 
- Signal Worker XCLAIM + Graceful Shutdown 
- AnomalyCounter Graceful Degradation 
- K8s terminationGracePeriodSeconds: 90 

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 15:57:56 +08:00
OG T
bf06737eed docs: ADR-038/039 + LOGBOOK 更新
- ADR-038: OpenClaw 併發治理架構
- ADR-039: 全域自動修復熔斷
- LOGBOOK: 今日進度記錄

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 15:48:09 +08:00
OG T
5f9a6a7e55 fix(ai): 移除假信心分數 + 顯示 AI 模型來源
問題: AI 仲裁顯示硬編碼信心分數 (0.75/0.88/0.92/0.70)

修復:
- decision_manager: 預設 confidence 0.75 → 0.0
- decision_manager: Expert System confidence=0.0 + is_rule_based
- openclaw: 所有 Mock Response confidence → 0.0
- telegram_gateway: 新增 ai_provider 欄位
- telegram_gateway: 動態來源標籤 (Ollama/Gemini/Claude/規則匹配)

Telegram 卡片顯示:
- confidence > 0 + provider=ollama → 🤖 Ollama 仲裁
- confidence > 0 + provider=gemini → 🤖 Gemini 仲裁
- confidence > 0 + provider=claude → 🤖 Claude 仲裁
- confidence == 0 → ⚙️ 規則匹配

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 15:19:51 +08:00
OG T
12e49d844a feat(monitoring): ADR-037 Wave B - Database Exporters + Prometheus 整合
- 部署 PostgreSQL Exporter (192.168.0.188:9187)
- 部署 Redis Exporter (192.168.0.188:9121)
- 更新 Prometheus scrape config
- 首席架構師審查: 97% OUTSTANDING

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 15:18:54 +08:00
OG T
b55b1147e2 docs: 更新 LOGBOOK - P1-3/P1-4 完成 (32 tests) 2026-03-29 11:29:17 +08:00
OG T
50c055b547 feat(api): Phase D-G P0 修正 - Learning Repository 積木化
新增:
- ILearningRepository Protocol (interfaces.py)
- LearningRepository (Redis 持久化層)
- Learning API 端點 (/api/v1/learning/*)
- LearningService.get_recommended_fix() 方法
- LearningService.get_learning_summary() 方法

修正:
- Service 不直接依賴 Redis Client (透過 Repository)
- 符合 leWOOOgo 積木化原則
- 首席架構師審查: 74/100 → 92/100

更新:
- ADR-030: 新增 Phase D-G P0 修正章節
- Skill 02: v1.9 → v2.0
- Runner 修復: 序列建構解決 _runner_file_commands 衝突

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 11:03:51 +08:00
OG T
7b2f585244 docs: 完整監控實施步驟 (7 Phase 詳細文檔)
Phase A: AnomalyCounter 服務 (4h)
- Redis Sorted Set 滑動窗口計數
- 頻率閾值告警 (REPEAT/ESCALATE/PERMANENT_FIX)
- Tier 決策邏輯整合

Phase B: Database Exporters (3h)
- pg_exporter: 連接池/慢查詢/鎖等待/膨脹監控
- redis_exporter: 記憶體/命中率/驅逐監控
- 15+ 告警規則

Phase C: Incident 頻率欄位 (2h)
- IncidentFrequencyStats 模型
- 告警聚合邏輯 (10 分鐘窗口)
- 前端頻率顯示

Phase D: Sentry Comment 回寫 (1h)
- 完成 TODO 實作
- Sentry API Token 配置

Phase E: SignOz 告警規則 (2h)
- Error Rate / Latency 告警
- Trace 異常檢測
- SignOz Webhook Handler

Phase F: Alert Chain E2E (2h)
- Smoke Test 腳本
- CD Pipeline 整合
- 鏈路監控告警

Phase G: Learning Service (3h)
- 修復效果學習
- 成功率計算
- Playbook 自動更新

總工時: 17h (2-3 天)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 10:23:04 +08:00
OG T
1e7c5134fe docs: 新增異常頻率統計與根本修復章節 (統帥反饋)
- 異常頻率追蹤架構 (Redis 計數器 + 滑動窗口)
- 修復策略分級 (Tier 1-4: 重啟→緩解→根因→架構)
- AI 學習服務 (LearningService + Playbook 自動更新)
- Telegram 頻率告警格式 (重複次數 + 成功率統計)
- 實作清單 (P0: 22h, P1: 12h, P2: 8h)

🔴 關鍵觀點: 重啟只是治標,不是治本

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 02:04:10 +08:00
OG T
56ae7290e3 docs: 更新 LOGBOOK - 完整監控策略 + Telegram 按鈕修復
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 02:01:06 +08:00
OG T
40163a51b5 feat(monitoring): 完整監控策略與自動整合架構
新增:
1. MONITORING_COMPLETE_STRATEGY.md - 完整監控策略
   - 5 主機 × 60+ 服務監控矩陣
   - P0/P1/P2 告警規則清單
   - AI 自動修復閉環流程
   - 安全護欄配置

2. MONITORING_INTEGRATION_ARCHITECTURE.md - 自動整合架構
   - 服務註冊表 (Single Source of Truth)
   - CI/CD 自動驗證監控覆蓋率
   - 新服務自動獲得監控

3. ops/monitoring/service-registry.yaml - 服務清單
   - K8s 工作負載 (API/Web/Worker/ArgoCD)
   - Docker 容器 (Ollama/OpenClaw/Redis/Postgres)
   - 前端頁面 SLO
   - API 端點 SLO
   - 告警模板與自動修復動作

4. ops/monitoring/validate_coverage.py - 覆蓋率驗證
   - CI 階段執行
   - 檢測未監控服務
   - 生成覆蓋率報告

設計原則:
- 監控即代碼 (Monitoring as Code)
- 新服務必須在 registry 註冊才能部署
- 自動發現機制防止遺漏

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 01:52:08 +08:00
OG T
94d6a0c720 docs(ai): 更新 ADR-036 和 LOGBOOK - P3 優化記錄
- ADR-036 v1.4: P3 優化完成 (95/100)
- LOGBOOK: Phase 20 P1+P2+P3 全部完成
- 測試: 34/34 PASSED

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 01:51:35 +08:00
OG T
179e659f14 chore: 清理 Playwright 產物 + kube-state-metrics 告警擴充
清理工作:
- .gitignore 新增 playwright-report/ 和 test-results/ 排除
- 保留 phase19/ 參考截圖目錄

kube-state-metrics 告警擴充 (P3):
- CronJobLastRunFailed: Job 執行失敗
- DaemonSetMissingPods: DaemonSet 缺少 Pod
- StatefulSetReplicasMismatch: StatefulSet 副本不足
- ContainerWaiting: ImagePullBackOff/CrashLoopBackOff 偵測
- PDBViolation: PDB 健康 Pod 數不足
- NodeUnschedulable: 節點標記為不可排程

新增:
- apps/api/scripts/test_nemotron_tool_calling.py (E2E 比較測試)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 01:28:35 +08:00
OG T
ee2bceefff feat(monitoring): Phase 19.6 測試文檔 + P1-P3 改進 + 首席架構師審查
Phase 19.6 測試文檔收尾:
- E2E 測試擴充至 18 項 (Terminal/GenUI 驗證)
- 新增 PHASE19-VERIFICATION-CHECKLIST.md (完整驗證清單)

P1 驗證:
- ArgoCD Metrics NodePort 監控 (30883/30884)
- TLS 證書監控 (Blackbox Exporter 9115)

P2 改進:
- waitForTimeout → waitForLoadState('networkidle')
- 跨平台快捷鍵 (Meta+J / Control+J)
- SKIP_MULTISIG_TESTS 環境變數控制
- Prometheus GitOps 部署腳本

P3 改進:
- HPA maxReplicas 4 → 6 (API/Web)

首席架構師審查: 47/50 OUTSTANDING (94%)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 01:19:26 +08:00
OG T
b77e151387 feat(ai): ADR-036 NVIDIA Nemotron Tool Calling 整合
Phase 20 - 提升 Tool Calling 精準度 50% → 83.3%

新增:
- src/models/nvidia.py: Pydantic Schema
- src/services/nvidia_provider.py: NvidiaProvider 類別
- tests/test_nvidia_provider.py: 15 項單元測試 (全部通過)

修改:
- ai_router.py: AIProvider.NVIDIA + route_tool_calling()
- ai_rate_limiter.py: NVIDIA 限制 (5 RPM, 100/day)
- models.json: NVIDIA 配置
- cd.yaml: Secrets 注入 NVIDIA_API_KEY

路由策略:
- Tool Calling: Nemotron → Gemini → Claude
- 一般對話: Ollama → Gemini → Claude (不變)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 00:00:08 +08:00
OG T
a30f766eb1 feat(monitoring): 首席架構師完整審查 + 補充告警規則
## 首席架構師審查結果: 198/200 (99%) EXCEPTIONAL

### 審查範圍
- 架構設計: 50/50 
- 安全性: 49/50
- 模組化合規: 50/50 
- 監控告警: 49/50
- E2E 測試: 49/50

### 新增補充告警 (12 條)
- RedisDown, PostgreSQLDown, OllamaDown, OpenClawDown
- HarborDown, LangfuseDown
- HPAMaxedOut, HPAScalingDisabled
- WorkerUnavailable
- NodeHighCPU, NodeHighMemory, ContainerOOMKilled

### 檔案
- k8s/monitoring/k3s-alerts-supplemental.yaml

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-28 23:30:44 +08:00
OG T
f0572ae906 feat(k4.3): Pod Security Standards + Grafana Dashboard
K4.3 Pod Security Standards:
- awoooi-prod: baseline
- kube-state-metrics: baseline
- kured: privileged (hostPID required)
- descheduler: restricted
- velero: baseline
- argocd: baseline

Grafana Dashboard:
- K3s Cluster Overview (9 panels)
- Nodes, Pods, HPA, Velero, Alerts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-28 23:16:54 +08:00
OG T
863fc5a426 docs: 新增監控告警完整流程文檔 (2026-03-29 ogt)
內容:
- 8 層架構圖 (ASCII)
- 工具/服務清單表格
- 配置/代碼檔案清單
- 完整資料流說明
- E2E 驗證機制 (ADR-025/035)
- 故障排查指南

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-28 22:25:14 +08:00
OG T
1a4be7b18a feat(k-mon): K3s monitoring integration (Phase K-MON)
- Add Velero metrics NodePort service (30885)
- Add K3s infrastructure alert rules:
  - VIP 6443 availability
  - Node ICMP checks
  - AWOOOI API/Web TCP checks
  - SignOz/Sentry availability
- Add Velero backup alerts (failed/missing)
- Add ADR-034 for ArgoCD GitOps adoption

Deployed to:
- K3s: velero-metrics service
- 188: Prometheus + Alertmanager configs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-28 21:57:57 +08:00
OG T
6a38c0c968 fix(cd): ADR-035 Telegram Secrets 自動注入三層防護
🔴 事故根因: K8s Secrets 未注入,Telegram 告警長時間失效
- kustomization.yaml 說「由 CI/CD 處理」但 CD 從未執行

🛡️ 三層防護機制:
- Layer 1: Pre-flight 檢查 GitHub Secrets 存在
- Layer 2: Deploy 時 kubectl patch secret 自動注入
- Layer 3: Post-Deploy E2E 測試告警驗證

📄 文件更新:
- ADR-035: docs/adr/ADR-035-telegram-alert-chain-enforcement.md
- DevOps Skill v1.9: 新增 Secrets 注入鐵律
- CLAUDE.md: 新增告警鏈路章節
- LOGBOOK: 事故記錄

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-28 21:47:49 +08:00
OG T
66fb56c691 feat(k8s): Phase K2 自動化維運完成
- K2.4 NPD: Node Problem Detector (DaemonSet)
- K2.3 VPA: 3 Vertical Pod Autoscaler (Off 模式)
- K2.1 ArgoCD: v3.3.6 @ :30443 (GitOps)
- K2.2 Sealed Secrets: v0.26.0 (加密 Secrets)

新增檔案:
- k8s/npd/node-problem-detector.yaml
- k8s/awoooi-prod/11-vpa.yaml

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-28 21:27:05 +08:00
OG T
d3e6b59b86 docs: K1 Velero 備份系統完成
- MinIO 部署 (192.168.0.188:9000/9001)
- Velero v1.13.0 部署到 K3s
- daily-awoooi-prod Schedule (每日 02:00)
- 測試備份成功 (153 items / 30 天保留)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-28 21:16:27 +08:00
OG T
eea6e3acc3 feat(k8s): 新增 Velero 備份系統 (K1.1)
Phase K1 災難恢復:
- MinIO 部署在 192.168.0.188:9000/9001
- Velero v1.13.0 完整安裝 manifests
- velero-backups bucket 已建立
- README 含部署與使用指南

部署方式:
  ssh wooo@192.168.0.120
  sudo kubectl apply -f k8s/velero/velero-install-full.yaml

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-28 20:53:02 +08:00
OG T
269c81bdbb fix(k8s): OpenClaw 端口統一 8088→8089
- ConfigMap: OPENCLAW_URL 更新為 8089
- NetworkPolicy: 允許 8089 出站
- SERVICE-ENDPOINTS.md: 移除 legacy 8088 引用

2026-03-28 清理舊配置,統一使用正式端口

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-28 20:32:30 +08:00
OG T
e03d99b871 docs(runbook): K3s 優化 Runbook v1.2 - 標記完成狀態
Phase 完成狀態:
- K0  Swap/PDB/備份/清理 (首席架構師 9.0/10)
- K-NET  VIP 192.168.0.125 + CI/CD 整合
- K-CLEAN  9 RS + 1 Job 清理

K-HA 📋 另案規劃 (需維護窗口)

更新:
- 版本號 1.1 → 1.2
- 目錄標記完成狀態
- 各 Phase 加入執行結果
- 附錄 A 實際執行時間線
- 問題統計 (清理前後對照)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-28 18:52:13 +08:00
OG T
9fa996c9fe fix(cicd): 修正 OTEL 端點配置 192.168.0.121→188
問題: CI/CD workflows 指向錯誤的 OTEL 端點
- ci.yaml: 121:4318 → 188:24318
- cd.yaml: 121:4318 → 188:24318

SignOz 實際運行在 192.168.0.188 (AI+Web 中心)

更新:
- Skill 04 v1.8 加入可觀測性端點規範
- LOGBOOK 記錄配置修正

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-28 18:47:23 +08:00
OG T
d206460751 feat(security): Phase 20 CSRF 防護實作
Phase 19 首席架構師審查指出: 核鑰 UX 安全性缺 CSRF 防護

後端:
- 新增 src/core/csrf.py (Double Submit Cookie 模式)
- 新增 src/api/v1/csrf.py (GET /api/v1/csrf/token)
- 新增 src/models/csrf.py (CSRFTokenResponse)
- 修改 approvals.py sign/reject/bulk 端點加入 CSRFToken 驗證

前端:
- 新增 hooks/useCSRF.ts (React Hook)
- 修改 approval.store.ts 整合 CSRF Token 參數

安全特性:
- 256-bit Token (secrets.token_hex)
- 時序安全比較 (secrets.compare_digest)
- SameSite=Strict Cookie
- 1 小時 Token 有效期

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-28 18:31:58 +08:00
OG T
7b9b0c490b feat(phase19): Omni-Terminal 100% 完成 + 首席架構師審查 47/50
## Phase 19 Omni-Terminal (Wave 0-6 全部完成)

### 核心功能
- SSE 狀態機 (7-State 設計,10/10 分)
- GenUI 動態渲染 (6 張卡片 + Zod Schema 驗證)
- 核鑰 UX (長按授權 + 風險分級)
- Terminal Telemetry (Sentry 整合)

### P0-P2 修復
- P0: Singleton → FastAPI Depends 依賴注入
- P1: Zod Schema 升級 (7 個驗證 Schema)
- P1: 錯誤分類碼聚合 (Sentry fingerprint)
- P2: Slow Query 監控 (5s 警告 / 10s 嚴重)

### 測試
- test_terminal_service.py: 54 項測試全通過
- 意圖分類: 42 個測試案例 (9 種 IntentType)

### 文檔
- ADR-031: SSE 架構實作紀錄
- ADR-032: GenUI 渲染實作紀錄
- Skills: v1.9 (後端 Terminal 章節)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-28 18:04:12 +08:00
OG T
3e5315aaf8 docs(k3s): 首席架構師審查完成 46/50 (92%)
K3s 優化工作審查完成:

- ADR-033: Phase K0 + K-NET 標記為已完成
- 09-pdb.yaml: Worker PDB 設計說明註釋
- DevOps Skill: 新增 keepalived 快速操作參考

審查結果:
- 架構合規性: 9/10
- Runbook 完整性: 10/10 
- 模組化合規: 9/10
- 風險控制: 9/10
- 文檔完整性: 9/10

P2 問題已修復,無 P0/P1 阻擋項

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-28 18:00:07 +08:00
OG T
efb80b403e feat(k8s): Phase K0.5 Startup Probe + PDB + revisionHistoryLimit
K3s 生產級優化 Phase K0 變更:

- 新增 startupProbe 到 API/Web/Worker Deployment (60s 啟動時間)
- 新增 revisionHistoryLimit: 3 (減少孤立 ReplicaSet)
- 新增 09-pdb.yaml (PodDisruptionBudget 保護)
- 新增 K3S-OPTIMIZATION-RUNBOOK.md (執行手冊)
- 修正 selector 對齊現有 Deployment (app+environment+system)

首席架構師審查: 9.0/10 

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-28 11:13:44 +08:00
OG T
e5ded3b3f2 feat(phase19): OmniTerminal + GenUI + Hybrid SSE 架構實作 (Wave 0-2)
Phase 19 OmniTerminal MVP 完成:
- Wave 0: Backend (Hybrid SSE POST→GET 架構)
- Wave 1: Frontend (OmniTerminal 狀態機 + GenUI Registry)
- Wave 2: UI 組件 (8 個 GenUI 動態卡片)

ADR 文檔:
- ADR-031: OmniTerminal SSE 架構
- ADR-032: GenUI 動態渲染框架
- ADR-033: K3s HA 架構設計

GenUI 組件:
- GenUIRenderer, K8sPodStatusCard, SentryErrorCard
- MetricsSummaryCard, IncidentTimelineCard
- TraceWaterfallCard, ApprovalCard, NuclearKeyButton

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-28 00:17:26 +08:00
OG T
54061fb8be docs: 更新 LOGBOOK - Sentry 首席架構師審查完成
- Sentry 整合驗證通過
- K3s Master 確認 192.168.0.120
- Phase 10 全部完成

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-27 14:57:03 +08:00
OG T
a579710982 fix(k8s): 補齊 Sentry DSN 配置 (首席架構師審查)
- 03-secrets.example.yaml: 新增 SENTRY_DSN
- 04-configmap.yaml: 新增 Sentry 元數據
- LOGBOOK: 新增 CD Lint 修復記錄

Phase 10 Sentry 整合 - DSN 配置補齊

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-27 14:51:41 +08:00
OG T
43e8ead0d2 docs: 更新 LOGBOOK - P1 模組化違規已修復
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-27 10:08:15 +08:00
OG T
30bed33401 docs: 更新 LOGBOOK - P1 按鈕優化完成
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-27 09:51:20 +08:00
OG T
4ee5376bd1 docs: 告警機制優化計畫 + ADR-030 Phase 6 + Skill 03 v1.5
- LOGBOOK: 新增告警機制完整審查記錄
- ADR-030: 新增 Phase 6 非同步分析優化章節
- Skill 03: v1.5 Stream Key 統一 + Telegram 去重

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-27 09:42:53 +08:00
OG T
d3a0ed4253 docs(adr): ADR-030 智能自動修復系統完整設計
五階段實施計畫:
- Phase 1: 智能診斷基礎  已完成
- Phase 2: 資料收集強化 (K8s Events + SignOz 深度整合)
- Phase 3: Playbook RAG (向量化 + 語意搜尋)
- Phase 4: 自動執行機制 (信任度 + 風險評估)
- Phase 5: 持續學習迴圈 (反饋 + 信任度調整)

架構相容性分析:
- 介面擴展點定義
- 資料庫 Schema 變更
- 風險評估與回滾計畫

預計時程: 10-15 週

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 21:48:41 +08:00
OG T
309a019cc3 docs: 記錄 Telegram 告警轟炸事故修復
更新:
- ADR-027: 新增緊急事故修復章節
- LOGBOOK: 記錄 2026-03-26 事故時間線
- Skill 02 v1.6: 新增 Telegram 去重機制章節

根因: Phase 6.5 修改 + INC- 前綴重複
修復: Redis 去重 (10 分鐘) + 前綴檢查

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 20:13:07 +08:00
OG T
fb03430469 feat(api): ADR-027 Phase 2 - 簽核/拒絕後自動同步 Incident 狀態
Router 整合點:
- POST /approvals/{id}/sign → on_approval_status_change("approved")
- POST /approvals/{id}/reject → on_approval_status_change("rejected")
- POST /approvals/bulk-approve → 批次同步

變更:
- 移除舊的 resolve_incident_after_approval() 調用
- 改用 IncidentApprovalService.on_approval_status_change()
- 同步失敗不阻斷主流程 (容錯設計)

ADR-027 進度: Phase 1-2  完成

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 19:44:59 +08:00
OG T
dd42e6b75b chore: services export + meetings 文檔格式化
- services/__init__.py: 導出 IncidentApprovalService (ADR-027)
- meetings docs: 格式化更新

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 19:10:48 +08:00
OG T
a9f8ad56c1 chore: 未提交變更整理 (API core + docs + scripts)
API 核心:
- constants.py: 系統常量定義
- unit_of_work.py: Unit of Work 模式
- incident_approval_service.py: Incident-Approval 同步服務

文檔更新:
- LOGBOOK.md: 進度更新
- AWOOOI_AGENTIC_WORKSPACE_ROADMAP.md: 路線圖
- 2026-03-26_llm_testing_evaluation.md: LLM 測試評估
- phase5_telemetry_architecture.md: 遙測架構
- SECRETS_REFERENCE.md: 密鑰參考

配置/腳本:
- Skill 02 v1.x: leWOOOgo 後端更新
- .dependency-cruiser.cjs: 依賴規則
- demo-multisig-flow.sh: 演示腳本

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 19:10:12 +08:00
OG T
2f5986df5c docs: ADR 整理與新增 (021-029)
ADR 編號修正:
- ADR-023 failure-auto-repair → ADR-028
- ADR-025 cicd-ai-integration → ADR-029

新增 ADR:
- ADR-021: Playbook 更新驗證
- ADR-022: Sentry 整合架構
- ADR-027: Incident-Approval 同步
- ADR-028: 失敗自動修復閉環
- ADR-029: CI/CD AI 整合 (原 ADR-025)

更新:
- ADR-018: LLM 測試策略狀態更新

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 19:09:08 +08:00
OG T
c7be68f800 docs: LOGBOOK 更新 Phase 13.2 #84 完成狀態
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 18:56:24 +08:00
OG T
0a9d94d82b feat(k8s): CoreDNS GitOps 架構 (ADR-026)
問題: DNS 配置沒有版本控制,手動修改易遺失

架構:
- k8s/k3s-system/coredns-custom.yaml: HelmChartConfig
- CD workflow: k3s-system 路徑偵測 + 自動 apply
- ADR-026: CoreDNS GitOps 管控架構

DNS 上游:
- 使用 8.8.8.8 + 1.1.1.1
- 禁止 /etc/resolv.conf (systemd-resolved)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 18:43:28 +08:00
OG T
6e3a7fca20 docs: ADR-006 v1.2 Rate Limiter + LOGBOOK 更新
- ADR-006: 新增 Rate Limiter 實作章節 (v1.2)
- LOGBOOK: 記錄 Gemini 切換 + Rate Limiter 上線

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 18:16:45 +08:00
OG T
30145c7d7e docs: ADR-025 CI/CD AI 整合架構 + Skill 07 更新
- ADR-025: 文檔化 Phase 13.1 CI/CD AI 整合架構決策
  - GitHub Webhook 事件驅動流程
  - 風險分級執行決策 (AUTO/TELEGRAM/APPROVAL/BLOCKED)
  - SignOz Log 整合
- Skill 07 v1.3: 新增 Grafana MCP + SignOz query_logs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 15:41:26 +08:00
OG T
14c81f728f docs: 新增 ADR-025 告警鏈路 E2E 驗證 + 更新 Skills
新增:
- ADR-025: 告警鏈路 E2E 驗證架構 (2026-03-26 事故教訓)

更新:
- ADR-011: 新增 DNS 規則最佳實踐 (附錄 B)
- Skill 04: 新增 NetworkPolicy DNS 規則 + CoreDNS 設定
- Skill 05: 新增告警鏈路 Smoke Test 要求
- CLAUDE.md: 新增告警鏈路驗證到任務前必讀

事故根因:
1. URL 路徑錯誤 (webhook vs webhooks)
2. NetworkPolicy DNS 規則標籤不匹配
3. CoreDNS 上游 DNS 依賴 systemd-resolved

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 15:34:12 +08:00
OG T
579da38b8b feat(api): Phase 13 智能路由 + CI/CD 整合 (#74-88)
Phase 13.1 CI/CD Integration:
- #76 workflow_run handler for CI failure diagnosis
- #77 SignOz log query (query_logs, error_logs_summary MCP)
- #78 CIAutoRepairService with risk-based execution decisions

Phase 13.3 Smart Routing:
- #85 Intent Classifier v2.0 (rule engine + LLM fallback)
- #86 Complexity Scorer (9-dimension scoring)
- #87 AI Router v3.0 (routing decision matrix)
- #88 Token Counter (OTEL + Langfuse integration)

New files:
- services/ci_auto_repair.py (risk stratification)
- services/model_registry.py (centralized model config)
- services/token_counter.py (677 lines)
- Skill 08: Model Router Expert
- Skill 09: Strangler Pattern Expert
- ADR-023: Smart Routing Architecture
- ADR-024: API Layer Architecture

Tests:
- phase11-conversational.spec.ts (E2E tests)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 15:32:52 +08:00