Commit Graph

201 Commits

Author SHA1 Message Date
OG T
f51bf5a6a8 feat(backup): 全服務備份覆蓋 + 告警機制 — 9/9 服務完整
新增備份(已部署到 110,首次執行全部通過):
- backup-langfuse.sh: Langfuse AI 追蹤/評測 DB (7238 traces)
- backup-monitoring.sh: Prometheus + Grafana + Alertmanager volumes + configs
- backup-signoz.sh: SignOz ClickHouse + SQLite (分散式追蹤/日誌)
- backup-open-webui.sh: Open-WebUI LLM 對話紀錄 (SSH 188 volume)
- backup-clawbot.sh: ClawBot Redis 狀態/快取 (SSH 188 volume)
- backup-all.sh v3.0: 整合至 9/9 服務

告警機制:
- common.sh: notify_clawbot 改用 /webhook/custom 正確格式
- failed → severity:critical → Telegram 🔴 立即告警
- 告警測試通過:{"status":"ok","alert_id":"878c4c59..."}

GFS 保留:30日/12週/24月 (AWOOOI 額外 28h 高頻)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 11:12:42 +08:00
OG T
91564c6ea3 docs(sop): REBOOT-RECOVERY-SOP.md v4.0
更新:
- 加入 Sentry /opt/sentry 啟動說明 (110 Step 7/9)
- 新增 Sentry 重開機損壞修復章節 (PostgreSQL WAL/Redis RDB/ClickHouse parts)
- 告警沉默診斷樹補充「規則未部署」診斷 + deploy-alerts.sh 修復指令
- E2E 驗證腳本加入 Sentry + Prometheus 規則數驗證 (≥25)
- 架構圖補充 Sentry :9000

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 03:11:27 +08:00
OG T
0db9b41808 docs(plan): Observability + Auto-healing 完整實施計畫 (15 Tasks, 3 Sprints)
Sprint 1 (P0): Prometheus 統一告警規則 + Sentry 啟動 + CD 同步
Sprint 2 (P1): SigNoz 日誌告警 + Sentry SDK 標籤
Sprint 3 (P2): SSH HostRepairAgent 基礎設施

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 02:24:23 +08:00
OG T
de33abe0e3 docs(spec): 全系統自愈閉環設計規格 v1.0
整合三大問題的完整解決方案:
1. Prometheus 規則未部署 (13條→40+條,含SentryDown/AlertChain)
2. 日誌收集但無log-based alerting
3. 自動修復只限K8s層,無Host Docker/systemd修復能力

包含:
- 統一標籤規範 (layer/component/team/host)
- Sprint 1: 規則部署+Sentry啟動+CD同步
- Sprint 2: SigNoz log alert + Sentry整合
- Sprint 3: SSH HostRepairAgent + Playbooks
- SOP v4.0整合更新點

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 02:14:01 +08:00
OG T
2243a21b96 fix(ai-router): v4.3 NIM 保護 — timeout 不計 CB 失敗,每次先跑 NIM 才切 Gemini
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 20s
需求: NIM 必須等到有回應才切換,不能因為慢就被 CB 封鎖走 Gemini

變更:
- Timeout exception 不累積 CB failure(只有真實連線錯誤才計)
- NIM CB: failure_threshold=10, recovery_timeout=30s(比預設寬鬆)
- 設計文件 v4.3: 更新方向二,移除錯誤假設

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 01:51:12 +08:00
OG T
8f64affbdb docs(runbooks): REBOOT-RECOVERY-SOP v3.0 完整重開機自動化方案
## 內容

完整盤點所有主機、服務、工具、監控的:
- 啟動順序與依賴關係圖
- 正常重啟 vs 異常重啟處理流程
- 各主機詳細啟動序列 (188/110/120/121)
- 常見故障排查手冊 (告警沉默/CD失效/數據消失/NodePort)
- E2E 驗證腳本

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 01:48:29 +08:00
OG T
be3aa6069b feat(backup): AWOOOI 高頻備份 — 每 6 小時備份 awoooi_prod
awoooi_prod 為核心生產 DB,每日一次最大損失 24 小時不可接受:

- backup-awoooi-frequent.sh:每 6 小時備份 awoooi_prod(08/14/20:00)
- 02:00 由 backup-all.sh 完整備份(含 dev/k3s)
- 合計 4次/天,最大數據損失 ≤ 6 小時
- GFS 保留:28h 高頻 + 30日 + 12週 + 24月

首次執行: 680K,4s,snapshot db050dbc

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 01:14:50 +08:00
OG T
3136fc5ea0 feat(backup): 全面自動化備份 + AWOOOI DB + GFS 延長保留
首席架構師備份審計 — 全部自動化完成:

- backup-awoooi.sh:新增 AWOOOI PostgreSQL 備份腳本
  - awoooi_prod (KB/事故/AutoRepair/Drift) + k3s_datastore
  - 從 110 SSH 到 188 執行 pg_dump,整合進 restic
  - 首次執行:680K,9s,snapshot 8750748f 

- backup-all.sh v2.0:整合第 4 個服務 AWOOOI DB

- GFS 保留策略延長:
  - 每日 7→30 份(覆蓋最近 30 天)
  - 每週 4→12 份(覆蓋最近 3 個月)
  - 每月 6→24 份(覆蓋最近 2 年)

- BACKUP-STATUS.md:更新為全自動化狀態總覽

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 01:11:31 +08:00
OG T
84cfdb6195 docs(backup): 備份審計完整盤點 + 新增 AWOOOI DB 與 Gitea DB 備份腳本
首席架構師備份審計結論:
- awoooi_prod PostgreSQL: 無備份 (P0 缺口)
- Gitea SQLite DB: 無備份 (今日已損壞,人工修復耗時 2h+)

新增:
- scripts/backup/backup-awoooi-db.sh (188 部署,02:00 daily)
- scripts/backup/backup-gitea-db.sh (110 部署,01:00 daily)
- docs/runbooks/BACKUP-STATUS.md (全景表 + 部署步驟 + SOP)
- LOGBOOK.md 備份審計段落

待手動部署:統帥需 scp 腳本至 188/110 並設定 crontab

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 01:01:58 +08:00
OG T
45458e8f33 docs(adr): ADR-057 狀態更新為已批准並實作
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 00:44:31 +08:00
OG T
f4f454fd98 feat(api): 重開機後自動 warm-up Redis Working Memory from PostgreSQL
- main.py lifespan: 啟動時從 DB restore INVESTIGATING/MITIGATING incidents
- scripts/reboot-recovery: 188 + 110 自動化腳本 + systemd services
- scripts/reboot-recovery: aiops-network 自動建立 (ClawBot 依賴)
- docs/runbooks/REBOOT-RECOVERY-SOP.md: 完整改寫,含自動化腳本說明

Why: 重開機後 Redis 清空導致前端 incidents 顯示 0 筆(DB 完整保存)
統帥批准: 「所有數據必須被長久記錄下來」

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 00:39:20 +08:00
OG T
ddb75b69c5 docs(logbook): Phase 25 Review R2 通過 + ADR-054~057 記錄
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 00:25:31 +08:00
OG T
15c7f6fcd3 docs(adr): 起草 ADR-054/055/056/057 — Phase 25 三方向架構決策
ADR-054: DIAGNOSE Privacy-First Routing (已批准)
  - _local_fallback_chain 設計決策
  - NEMOTRON privacy_level=local 首席架構師裁示
  - 全部 local 失敗 → REJECT + Telegram

ADR-055: Knowledge Auto-Harvesting (已批准)
  - AUTO_RUNBOOK DRAFT + ANTI_PATTERN PUBLISHED 設計理由
  - compute_hash() 碰撞風險說明
  - Fire-and-forget GC 防護強制規範

ADR-056: Config Drift Detection 四層架構 (已批准)
  - Detector→Analyzer→Interpreter→Remediator 職責邊界
  - AI 只做意圖分析不做修復決策
  - adopt() 暫停 + _recent_reports Phase 1 限制

ADR-057: adopt() Gitea PR API 實作路徑 (草案,待批准)
  - 解決 API Pod git add -A 安全風險
  - PR review 流程保障

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 00:24:50 +08:00
OG T
c4923b6908 docs(logbook): Phase 22.4 + Phase 25 全部驗證通過記錄
- Phase 22.4 tests 18/18 PASSED (b6e12f7)
- embed-all 7/7 prod 成功
- semantic-search E2E score=0.6867 驗證通過
- drift /scan E2E 正常回應
- drift-scanner CronJob 每小時執行
- dev/prod DB migration (symptoms_hash + enum) 完成
- 53 integration tests PASSED

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 18:00:33 +08:00
OG T
0c180dec86 docs(spec): 方向二實作修正記錄 — Nemotron privacy_level=cloud (P0) 2026-04-04 17:42:53 +08:00
OG T
0b41df45d6 docs(plans): 三方向實作計畫 P0/P1/P2
- P0: DIAGNOSE Privacy-First Routing(local chain 隔離 + REJECT 保護)
- P1: Knowledge Auto-Harvesting(Anti-Pattern 閉環 + Runbook 生成)
- P2: Config Drift Detection(GitOps 守門員 + Nemotron 意圖分析)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 12:31:36 +08:00
OG T
035cb9cd0d docs(spec): Nemotron 主動防禦三方向設計文件
- 方向一:Knowledge Auto-Harvesting(Anti-Pattern 閉環 + Runbook 自動生成)
- 方向二:DIAGNOSE Privacy-First Routing(Local-Only Fallback Chain)
- 方向三:Config Drift Detection(GitOps 守門員 + Nemotron 意圖分析)

首席架構師 ogt 100% 技術背書

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 12:18:11 +08:00
OG T
369413f87d docs: 更新 LOGBOOK KB Phase 2 全修完成 + 5 tests PASSED
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 11:55:40 +08:00
OG T
69a9218723 docs: 更新 LOGBOOK KB Phase 2 + 首席架構師 Review 紀錄
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 11:49:31 +08:00
OG T
cddc4cb1fc fix(knowledge): 首席架構師 Review 修復 C1+C2+I1+I2 (71→~88/100)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m16s
C1: IKnowledgeRepository Protocol 補齊 save_embedding + semantic_search +
    list_unembedded_entries,恢復 Interface 先行保護層

C2: embed_all_entries Service 層 raw SQL 移至 Repository.list_unembedded_entries()
    Service 改透過 Protocol 呼叫,符合 leWOOOgo 積木化原則

I1: asyncio.create_task 加入 _pending_tasks set 持有引用,防 GC 回收與
    Shutdown 時 Task 遺失;task done 後自動 discard

I2: OllamaEmbeddingService 從每次 new 改為 KnowledgeService.__init__ 注入,
    單一實例重用

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 11:22:38 +08:00
OG T
15aabd6ac5 fix(chat+nim): 修復首席架構師 Review I1-I4 + S3 四項重要問題
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m9s
I1: chat_manager._call_openclaw timeout=30.0 → 讀 settings.OPENCLAW_TIMEOUT
I2: nvidia_provider.py stale comment "45" → "55" 對齊 ConfigMap
I3: asyncio.shield 移除 — shield 超時後 task 繼續跑但無人等待 (silent leak)
I4: ChatManager.__init__ 移除 repo 實例 (leWOOOgo 禁 Service 持有 repository)
S3: _check_nemotron_health probe 10s → 25s + /v1/models 輕量端點

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 16:36:16 +08:00
OG T
ce945fe89e rule(cost): 🔴🔴🔴 費用變更強制審批 — HARD_RULES v1.8 + CLAUDE.md
統帥指示 2026-04-03:
所有涉及費用產生的變更必須停下來等統帥明確批准後才可執行

新增:
- HARD_RULES.md v1.8: Cost Change Approval 章節
  - 定義涉費變更範圍
  - 強制流程: 識別→停→說明→等批准→執行
  - 今日違規教訓記錄
- CLAUDE.md 任務前必讀新增費用變更條目

Memory 已同步:
- feedback_cost_change_approval.md (新建)
- feedback_constitution_v2.md 第五章
- MEMORY.md 索引最高鐵律區

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 15:36:47 +08:00
OG T
dc232ebb49 docs: LOGBOOK 更新 — KB Phase 1 + monitoring + I1/I3 完成
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 13:22:54 +08:00
OG T
0b83707697 feat(web): APM/Apps/Deployments/Tickets 頁面升級 — 串接真實 API 數據
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- apm/page.tsx: Golden Signals 真實數據 (SignOz ClickHouse)
- apps/page.tsx: 主機服務狀態 (/api/v1/dashboard 真實數據)
- deployments/page.tsx: K8s 部署狀態串接
- tickets/page.tsx: Incidents 列表串接
- i18n: apm/apps/deployments/tickets namespace 雙語補齊

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 00:08:11 +08:00
OG T
2d5f1a71ad chore(observability): ClickHouse TTL 設定完成 — Phase O 全驗收
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
signoz_logs: 30天 (已內建 _retention_days DEFAULT 30)
signoz_metrics 8個表: 233280000s(2700天) → 7776000s(90天)
  - samples_v4, samples_v4_agg_5m, samples_v4_agg_30m
  - exp_hist, time_series_v4, time_series_v4_6hrs
  - time_series_v4_1day, time_series_v4_1week

Phase O 驗收清單全部打勾 

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-04-02 21:38:39 +08:00
OG T
08f73dfce8 docs: Phase O-5 Wave 5.4 告警鏈路 E2E 驗證 Runbook
Some checks failed
E2E Health Check / e2e-health (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
- 架構圖、手動測試步驟、smoke test 清單
- generate_monitoring.py 用法說明
- 已知問題豁免清單、回滾指令
- 首次驗收記錄 2026-04-02 8/8

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 21:34:43 +08:00
OG T
48c65756da chore(config): USE_AI_ROUTER=true 寫入 ConfigMap (Phase 24 B2)
Some checks failed
E2E Health Check / e2e-health (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
防止下次 CD deploy 覆蓋 kubectl set env 的設定。
B2 觀察期 48h, 截止 2026-04-04 18:40 台北時間。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 21:26:53 +08:00
OG T
3f339110dd fix(observability): 同步 .188 實際部署調整至 repo
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
與原始計畫的差異:

1. MinIO Bearer Token 認證
   - 原計畫: MINIO_PROMETHEUS_AUTH_TYPE=public (此版本不支援)
   - 實際: mc admin prometheus generate 產生 Bearer Token
   - 更新: prometheus-config-phase-o.yaml 加入 bearer_token

2. remote_write 廢棄 → OTEL Collector Prometheus scrape
   - 原計畫: Prometheus remote_write → SigNoz OTEL /api/v1/write
   - 實際: SigNoz OTEL Collector 不支援 Prometheus remote_write 格式 (404)
   - 改用: OTEL Collector prometheus receiver 直接 scrape node-exporter + kube-state-metrics
   - 新增: ops/signoz/otel-collector-config-phase-o.yaml (版本控管副本)

3. ADR-053 驗收清單更新為實際結果

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-04-02 21:23:47 +08:00
OG T
3e4612f259 docs(observability): ADR-053 SigNoz 統一架構 + Phase O 驗收
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 36s
E2E Health Check / e2e-health (push) Successful in 16s
- 新增 ADR-053: 可觀測性統一架構決策記錄
- 更新 service-registry.yaml: 補齊 MinIO/Kali 監控入口
- 更新 LOGBOOK: Phase O 完成狀態

Phase O 驗收清單:
 kubectl Mac 本機免密碼
 OTEL Collector 2 Pod Running
 Event Exporter 1 Pod Running
 Descheduler CronJob Completed
 MinIO + Kali 告警規則
 Alert Chain Smoke Test
 CD Pipeline 整合
 ClickHouse TTL / remote_write / SigNoz rules (待 .188 手動)

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-04-02 18:26:57 +08:00
OG T
51961b9f03 docs: Phase O 可觀測性終極補完計畫設計規格
SigNoz 統一派架構,解決 6 大盲區 (Event/Log/Metrics/Descheduler/kubectl/MinIO-Kali)
+ Monitoring Master Plan Wave A-D 收尾
+ 5 個首席架構師 Review 節點

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 13:45:23 +08:00
OG T
73e8f8ab77 feat(ai): Phase 24-A+B1 — AI Provider Registry + 絞殺者包裝 (ADR-052)
Some checks failed
E2E Health Check / e2e-health (push) Successful in 16s
CD Pipeline / build-and-deploy (push) Has been cancelled
Brain Layer 雙軌 Registry 架構:
- 新建 src/services/ai_providers/ 目錄 (interfaces + 4 providers)
  - OllamaProvider (local, rca/chat/code_review)
  - GeminiProvider (cloud, rca/chat)
  - ClaudeProvider (cloud, rca/chat/code_review)
  - OpenClawNemoProvider (cloud, rca — 委派 188→NIM)
- 擴展 ai_router.py 加入:
  - AIProviderRegistry (動態註冊/啟停)
  - AIRouterExecutor (Cache + 閘門 CB/RL/Sem + 執行)
- openclaw.py 絞殺者包裝: USE_AI_ROUTER=true 走新路徑
- config.py + ConfigMap 加入 USE_AI_ROUTER=false (安全預設)
- ADR-052 正式文件 (14 項決策 D1-D14)
- HARD_RULES v1.7 加入 AI Router 規範

安全: USE_AI_ROUTER=false 預設不啟用,需手動開啟觀察
回滾: kubectl set env deployment/awoooi-api USE_AI_ROUTER=false

2026-04-02 ogt: Phase 24 首批實作

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 13:16:09 +08:00
OG T
db2a2852b8 docs: 前端重構驗收報告 87/100
Some checks failed
E2E Health Check / e2e-health (push) Successful in 16s
CD Pipeline / build-and-deploy (push) Has been cancelled
Playwright 瀏覽器截圖 + KB API 端點測試 + Console 分析
- 24/24 路由零 404
- 7 完整頁面 + 15 ComingSoon
- KB API 7 端點全部正常
- 1 Low bug (archived entry still accessible via GET)
- Metrics Strip [object Object] 待修

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 10:20:27 +08:00
OG T
25889d4b8e docs: 歸檔 ADR-050 reanalyze 實作計畫 (已完成)
Some checks failed
CD Pipeline (Dev) / build-and-deploy-dev (push) Failing after 9s
E2E Health Check / e2e-health (push) Successful in 18s
CD Pipeline / build-and-deploy (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 09:38:03 +08:00
OG T
5959855a71 feat(web): 字體系統升級 + NemoClaw SVG 還原 + Knowledge Base 設計文件
Some checks failed
E2E Health Check / e2e-health (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
- 字體:Syne (標題) + DM Mono (內文) + VT323 (品牌點陣),替換 Inter
- Tailwind: fontFamily 更新 + 5 層文字色彩 token (primary→disabled)
- Sidebar: NemoClaw 白瓷龍蝦爪 SVG + AWOOOI 用 VT323 放大
- OpenClaw Panel: 還原 NemoClaw 3D 白瓷龍蝦爪 (替換 NemoNodeAnimation)
- Knowledge Base 設計文件 (B分離/A K8s Job/Phase1跳過向量搜尋)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 00:48:42 +08:00
OG T
8845377a6d docs: 更新 AI中心重設計規格 (廢棄元件 + 授權邏輯記錄)
Some checks failed
E2E Health Check / e2e-health (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 22:28:32 +08:00
OG T
9cf73bda4f feat(llmops): 啟用 Langfuse LLMOps 追蹤 + CD 自動注入 Keys
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m6s
E2E Health Check / e2e-health (push) Successful in 18s
- 04-configmap.yaml: LANGFUSE_ENABLED=true (Phase 15.1 Key 已在 K8s Secret)
- cd.yaml: 補齊 Langfuse keys CD 自動注入 (LANGFUSE_PUBLIC/SECRET_KEY)
- LOGBOOK.md: ClawBot → OpenClaw 命名修正
- .gitignore: 加入 tsconfig.tsbuildinfo + .superpowers/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 22:19:22 +08:00
OG T
0b04abf990 docs(plan): add AI Center v6 redesign implementation plan (13 tasks) 2026-04-01 19:39:41 +08:00
OG T
4b84e95723 docs: AI中心 UI 重設計規格文件 v6
- Anthropic Warmth (#f5f4ed) + OpenClaw Blue (#4A90D9) 色彩系統
- 3欄佈局:Sidebar(200px) | Feed(50%) | RightPanel(50%)
- 完整側邊欄:4區19項(整合 wooo-aiops 所有菜單)
- 事件卡片流程圖 + Q版龍蝦 (橘紅本色 #E85530)
- NemoClaw 白底節點動畫(截圖風格)
- 全面圓角規範

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 19:19:03 +08:00
OG T
9913f5dc6d feat(infra): 開發環境分離 + BuildKit cache 修復 + circuit breaker 優化
Some checks failed
CD Pipeline / build-and-deploy (push) Successful in 6m52s
E2E Health Check / e2e-health (push) Successful in 17s
CD Pipeline (Dev) / build-and-deploy-dev (push) Failing after 9s
1. k8s/awoooi-dev/: 新建 dev namespace (01-05 配置)
   - Namespace + ResourceQuota (cpu 2/4, mem 4Gi/8Gi)
   - ConfigMap: ENVIRONMENT=dev, LOG_LEVEL=DEBUG, SHADOW_MODE=false
   - Deployment: 1 replica, NodePort 32344, image dev-latest
   - RBAC: awoooi-executor-dev ServiceAccount

2. .gitea/workflows/cd-dev.yaml: dev branch CD pipeline
   - 觸發: dev branch push
   - Build: --no-cache (防 cache poisoning)
   - Tag: dev-{sha} / dev-latest
   - Deploy: awoooi-dev namespace, health check 32344
   - Telegram: [DEV] 前綴通知

3. apps/api/Dockerfile: ARG CACHE_BUST=none (防 BuildKit cache 毒化)
   - deps 層 (pip install) 仍可 cache
   - src/ 和 models.json 層每次重建

4. .gitea/workflows/cd.yaml: 正式環境 API build 加入 CACHE_BUST=git_sha
   - 確保 models.json 等配置變更正確進入 image

5. apps/api/src/services/nvidia_provider.py: timeout 不計入 circuit breaker
   - TimeoutException → 只 log,不 record_failure()
   - 只有硬性錯誤 (auth/rate limit/exception) 才斷路

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 16:22:21 +08:00
OG T
c9c60c3a61 feat(mcp-integrations): Phase S 架構修復 + MCP 整合基礎建設
Some checks failed
E2E Health Check / e2e-health (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
Type Sync Check / check-type-sync (push) Failing after 22s
Phase S 技術債修復 (首席架構師審查 82→完整):
- S-01: generate_alert_fingerprint 移至 AlertAnalyzer.generate_fingerprint() staticmethod
- S-04: 移除 Pydantic v2 deprecated json_encoders (直接用原生 datetime 序列化)

Sentry MCP 整合 (Phase 23):
- ADR-048: Sentry→OpenClaw AI Triage 架構決策
- sentry_webhook_service.py: parse/analyze/create_incident/build_message Service 層
- config.py: SENTRY_WEBHOOK_SECRET (Fail-Closed HMAC-SHA256)

Playwright MCP 整合 (短期):
- smoke.spec.ts: 5 頁面 E2E smoke test (home/dashboard/incidents/approvals/terminal)
- cd.yaml: E2E Smoke Test 步驟 + Telegram 🎭 Smoke 狀態通知

長期規劃 ADR:
- ADR-049: Figma Code Connect 設計系統同步
- ADR-050: Telegram 互動式 Incident 2.0 (6鍵 Inline Keyboard)
- ADR-051: Context7 依賴升級顧問 (Next.js 14→15, FastAPI 0.115→0.128)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 16:20:57 +08:00
OG T
5a46998689 docs: Secrets 管理手冊 (ADR-035+ 統一 Secrets 真相來源)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 5m23s
E2E Health Check / e2e-health (push) Successful in 17s
建立 docs/runbooks/SECRETS-MANAGEMENT.md:
- 7 個 Gitea Secrets + 12 個 K8s Secrets 完整清單
- 更新 SOP (API + Web UI)
- 一鍵狀態檢查命令
- 各 key 取得/更新指南
- 緊急狀況處理

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 15:40:48 +08:00
OG T
22de22c989 refactor(phase-s): Phase S 技術債清理 - 五項架構改善
S-01: generate_alert_fingerprint() 移至 alert_analyzer_service (Router→Service)
S-02: 移除廢棄 USE_NEW_ENGINE config (Phase R 已完成歷史使命)
S-03: github_webhook.py linter 清理 (Field unused + delivery_id noqa)
S-04: Pydantic v2 遷移 - approval/incident models (class Config → ConfigDict)
S-05: Skill 09 v1.1 更新 (USE_NEW_ENGINE 廢棄說明)

測試: 393 passed, 零失敗

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 13:12:02 +08:00
OG T
59902f270d fix(tests): 首席架構師審查修復 - 測試套件 + DI 強化 (96/100 OUTSTANDING)
P1 測試修復:
- test_smart_router.py: 更新至當前 API (IntentResult + DIAGNOSE/CONFIG 規範化)
- test_auto_repair_service.py: 注入 _no_cooldown fixture 隔離 Redis 依賴
- test_global_repair_cooldown.py: 加 @pytest.mark.integration 標記

P2 架構改進:
- AutoRepairService: 新增 cooldown_checker DI 參數 (Callable | None)
- global_repair_cooldown: get_redis() 移入 try-except 防止未捕獲 RuntimeError

P3 配置:
- pyproject.toml: 登記 integration pytest marker

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 11:11:50 +08:00
OG T
6fed8be8c4 docs(adr): ADR-024 R4 Router 瘦身標記完成
Some checks failed
E2E Health Check / e2e-health (push) Successful in 17s
Type Sync Check / check-type-sync (push) Failing after 22s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 09:27:40 +08:00
OG T
5086bafa36 docs: ADR-045 Telegram Gateway 統一到 K8s AWOOOI API
記錄 2026-03-31 已實施的架構決策:
- 統一 Telegram 到 K8s AWOOOI API Webhook 模式
- 解決 OpenClaw (188) Long Polling 雙軌競爭問題

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 09:17:08 +08:00
OG T
a94bb57d8b feat(types): ADR-046 IncidentConverter + IncidentEngineAdapter
實作 ADR-046 Option B: IncidentConverter 轉換層,解決
BrainIncident (lewooogo-brain) 與 LocalIncident (apps/api) 型別邊界問題。

變更:
- 新增 src/utils/incident_converter.py
  - brain_to_local(): BrainIncident → LocalIncident
  - local_to_brain(): LocalIncident → BrainIncident
  - ESCALATED → MITIGATING 映射 (brain 無 ESCALATED)
- incident_engine.py: 新增 IncidentEngineAdapter 包裝層
  - process_signal() / get_incident() 輸出轉換為 LocalIncident
  - get_incident_engine() 返回 IncidentEngineAdapter
- incident_memory.py: 加入 brain_to_local import,更新 _record_to_incident 說明
- ADR-046: 標記三個轉換點全部完成

解鎖: #123 proposal_service.py 清理 (下一步)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-31 22:47:54 +08:00
OG T
2ba61acf72 fix(api): Phase R-R2.2 首席架構師 72/100 P2 修復
P2-01 signal_worker.py: persisted_to_pg 改用 getattr 防 BrainIncident AttributeError
P2-02 IIncidentEngine Protocol: update_incident_status → update_status 對齊 brain 實作
P2-03 config.py USE_NEW_ENGINE: 標記失效 + 回滾路徑更正 (git revert 而非 kubectl)
ADR-046: Option B (IncidentConverter) 決策完成,待實作清單更新
ADR-024: 審查結論 + 正式回滾指令更新
Skill 02: v2.5 版本記錄

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-31 22:33:08 +08:00
OG T
cd91560e0b docs: Phase R-R2 完成文件更新 + ADR-046 型別統一
- ADR-024: 更新執行進度 (R1 R2 R3 R4待執行)
- ADR-046: 新增跨套件 Incident 型別統一治理 (待決策)
  推薦 Option B: IncidentConverter 轉換層
- Skill 02: v2.5 記錄 Phase R-R2 + R-R2.1 + ADR-046
- LOGBOOK: 更新當前狀態

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-31 22:17:44 +08:00
OG T
67ef98e737 docs: 更新 LOGBOOK - Phase R-R2 完成 (#121 #122)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-31 22:04:13 +08:00
OG T
a3bd0a4b45 docs: 更新 LOGBOOK - Phase R-R1 絞殺者模式確認完成
Some checks failed
E2E Health Check / e2e-health (push) Successful in 16s
Type Sync Check / check-type-sync (push) Failing after 20s
確認項目:
- #117-119: Dockerfile + 絞殺者包裝  已實作
- USE_NEW_ENGINE 開關已配置 (默認 False)
- 回滾機制: kubectl set env USE_NEW_ENGINE=false
- Phase 15.4 #113-114 取樣率確認完成

下一步:
- #120 E2E 驗證 (啟用 USE_NEW_ENGINE=True 測試)
- Phase R-R2 刪除重複邏輯

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-31 21:36:33 +08:00