Commit Graph

275 Commits

Author SHA1 Message Date
OG T
702350925a fix(monitoring+layout): 修復基礎架構消失 + 監控工具全線上
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 6m47s
- page.tsx: 右側 panel overflow:hidden → overflowY:auto,基礎架構重新顯示
- page.tsx: 監控工具卡片對齊 figma (icon box + 版本/統計行 + ›箭頭)
- monitoring.py: Gitea probe 改用 /api/v1/version (/-/readiness 404)
- monitoring.py: Grafana dashboard count 加 Basic auth
- NetworkPolicy: 補開 3002/9090/3001 egress (Grafana/Prometheus/Gitea)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 00:50:53 +08:00
OG T
b6105b8214 fix(ai): 首席架構師審查修復 C1+C2 (Phase 24 C)
C1 — telegram_gateway.py Fail-Closed 白名單:
  白名單為空時 'if whitelist and ...' 為 False → 任何人可執行 /ai
  修復: 'if not whitelist or user_id not in whitelist' Fail-Closed
  加入 whitelist_empty 欄位到 warning log

C2 — openclaw.py list comprehension await 語法錯誤:
  Python 3.11 不支援 list comprehension 中使用 await
  'if not await is_provider_disabled(p)' → SyntaxError
  修復: 改為 for loop 明確 await
  I4: 靜默 except 改為 logger.warning

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 00:42:02 +08:00
OG T
8bc086af58 feat(infra): 完整監控工具 + 主機服務清單 + K3s Cluster 突顯
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 6m50s
監控工具 (6個):
- 加入 Grafana (110:3002), Sentry (110:9000), Langfuse (110:3100)
- 保留 Prometheus, SigNoz, Gitea

基礎架構:
- 靜態服務目錄 HOST_CATALOG:每台主機完整服務+Port+說明
- K3s Server #2 (121) 補靜態卡 (API 未回傳)
- K3s Cluster HA 獨立藍色區塊,☸ 標題 + VIP 資訊
- 所有服務含 Port 號與功能描述

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 00:36:59 +08:00
OG T
dbe71f82e3 feat(ai): Phase 24 C — Telegram /ai 動態控制 + Redis 狀態管理
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
新增 ai_control.py:
- /ai status: 所有 Provider 狀態 + 路由模式
- /ai router on/off: 動態切換 AIRouter (覆蓋 env var)
- /ai primary <provider>: 設定主要 Provider
- /ai enable/disable <provider>: 控制 Provider 啟停
- /ai cost: 費用統計
- 白名單: OPENCLAW_TG_USER_WHITELIST 保護

telegram_gateway.py:
- _handle_chat_message 加入 /ai 指令攔截路由
- 白名單未授權返回警告

openclaw.py:
- Redis 狀態覆蓋 env USE_AI_ROUTER (/ai router on/off 生效)
- Redis primary_provider 覆蓋路由決策 (/ai primary 生效)
- Redis disabled provider 過濾 (/ai disable 生效)

Redis Keys:
  ai:control:use_router
  ai:control:primary_provider
  ai:control:disabled:<provider>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 00:34:14 +08:00
OG T
b4b3a457c5 refactor(openclaw): Phase 24 B4 — 封存舊 fallback Provider 方法
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
[ARCHIVED] _call_ollama / _call_gemini / _call_claude
- 這三個方法為 USE_AI_ROUTER=false 回滾保留路徑
- 新路徑: USE_AI_ROUTER=true → AIRouterExecutor (ai_router.py)
- 新 Provider: ai_providers/ollama.py / gemini.py / claude.py
- 封存而非刪除: 完整移除等 Phase 24 全驗收後 (ADR-052 D11)

R3 觀察結果 (通過 ):
- openclaw_nemo provider: 12/12 incidents 全部正確路由
- 信心度: 0.8~0.9 正常
- USE_AI_ROUTER=true 生效確認

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 00:29:56 +08:00
OG T
ce11fcdc3a feat(monitoring): 監控工具區塊 — Grafana/Prometheus/SigNoz/Gitea 狀態
- 新增 GET /api/v1/monitoring/status,asyncio.gather 並行探測四工具
- 前端 MonitoringTools 元件,60s 輪詢顯示狀態/版本/統計
- 新增 monitoringTools i18n key

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 00:27:47 +08:00
OG T
97d86861ed fix(ai_router): C1 修復 — AIProviderEnum 對齊 Registry 實際 Provider 名稱
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 37s
問題: AIProviderEnum.NVIDIA = "nvidia" 在 Registry 無對應 Provider
      OpenClawNemoProvider.name = "openclaw_nemo"
      NemotronProvider.name = "nemotron"
      → 高複雜度/Tool Calling 路由永遠 skip,靜默 fallback 到 Gemini/Ollama

修復:
- 新增 OPENCLAW_NEMO = "openclaw_nemo" (一般推理, via .188 → NVIDIA NIM)
- 新增 NEMOTRON = "nemotron" (Tool Calling, direct NVIDIA NIM)
- 移除 NVIDIA = "nvidia" (Registry 無對應)
- 規則 4 (複雜度>=4/HIGH風險): NVIDIA → OPENCLAW_NEMO
- route_tool_calling: NVIDIA → NEMOTRON
- Rate Limiter check: "nvidia" → "openclaw_nemo"
- _full_fallback_chain: OPENCLAW_NEMO 首位
- _tool_calling_fallback_chain: NEMOTRON 首位

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 23:31:31 +08:00
OG T
58002e6bf4 feat(phase24-b3): NemotronProvider 抽取 + incident-card 重構
Some checks failed
E2E Health Check / e2e-health (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
Phase 24 B3:
- 新增 ai_providers/nemotron.py: NemotronProvider 封裝 K8s Tool Calling
  搬移自 openclaw.py _call_nemotron_tools (L1623-1785)
  capabilities=tool_calling, privacy_level=cloud
- ai_router.py: 加入 NemotronProvider 到 Registry
- ai_providers/__init__.py: 匯出 NemotronProvider

Phase R-UI2 (架構師 Warning):
- incident-card.tsx: 抽取 useApprovalAction hook
  handleApprove/handleReject 60行重複邏輯 → 共用 hook
  行為完全不變,維護性提升

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 23:12:42 +08:00
OG T
5a8aae89c4 fix(phase24): 首席架構師 Review C1/C2/C3/I4 修復
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m12s
E2E Health Check / e2e-health (push) Successful in 18s
C1 (P0): AIRouterExecutor.execute() 補 Langfuse Trace (D5)
  - 建立 langfuse_trace("ai_router_execute") 包住整個執行鏈
  - 成功時記錄 generation (model/input/output/tokens/cost)
  - prod 所有 AI 呼叫現在有 LLMOps 追蹤

C2 (P0): 絞殺者改為呼叫 AIRouter.route() 智慧路由
  - 先取得 RoutingDecision (意圖分類 + 複雜度評分)
  - provider_order 從 selected_provider + fallback_chain 動態生成
  - D1 意圖路由矩陣、D7 隱私保護 (DIAGNOSE 強制 local) 生效

C3 (P1): 型別標注 typo 修復
  - AIProviderEnumEnum → AIProviderEnum
  - AIProviderEnumProtocol → AIProviderProtocol

I4 (P1): interfaces.py AIProvider Protocol 補 close() 定義

S1: ai_router.py 模組版本標頭更新至 v4.0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 21:47:06 +08:00
OG T
04978995c1 fix(metrics): 實際呼叫 record_alert_chain_success (Wave A.5)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 6m47s
E2E Health Check / e2e-health (push) Successful in 17s
alert_chain_last_success_timestamp 指標已定義但從未被 set。
在 alertmanager_webhook 兩個主要成功路徑呼叫 record_alert_chain_success():
- CI/CD 告警成功處理後
- LLM 分析完成後

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 20:10:58 +08:00
OG T
f5b8738185 fix(wave-a): Wave A 告警鏈路驗收修復
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
- sentry_webhook: 加入 GET /health endpoint (smoke test 探測用)
- smoke_test: alertmanager 路徑改為 /webhooks/health (已存在)
- smoke_test: Prometheus URL 改為正確的 110:9090
- smoke_test: Alert chain metric 標記 critical=False (初始化期正常)

Wave A.6 smoke test 現在 6/8 → 7/8 checks pass (sentry health deploy 後 8/8)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 20:08:26 +08:00
OG T
3ad7b60f68 fix(ai): Phase 24 R1+R2 首席架構師 Review 修復 (C1-C3 + I1-I5)
Some checks failed
E2E Health Check / e2e-health (push) Successful in 18s
CD Pipeline / build-and-deploy (push) Has been cancelled
Critical 修復:
- C1: AIProvider Enum 改名為 AIProviderEnum (避免與 Protocol 同名衝突)
- C2: 共用 Circuit Breaker → per-provider _SimpleCircuitBreaker
  (避免 Gemini 掛掉時 Ollama 也被擋)
- C3: cache_key 移到 try 外面 (避免 UnboundLocalError)

Important 修復:
- I1: Claude hardcode model → 用 get_model_registry()
- I2: Claude 追蹤 tokens/cost (input_tokens + output_tokens)
- I3: Ollama 追蹤 tokens (eval_count + prompt_eval_count)
- I4: Gemini temperature → 用 model_registry
- I5: AIProviderRegistry.close_all() shutdown hook

2026-04-02 ogt: Phase 24 首席架構師審查通過後修復

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 13:40:58 +08:00
OG T
73e8f8ab77 feat(ai): Phase 24-A+B1 — AI Provider Registry + 絞殺者包裝 (ADR-052)
Some checks failed
E2E Health Check / e2e-health (push) Successful in 16s
CD Pipeline / build-and-deploy (push) Has been cancelled
Brain Layer 雙軌 Registry 架構:
- 新建 src/services/ai_providers/ 目錄 (interfaces + 4 providers)
  - OllamaProvider (local, rca/chat/code_review)
  - GeminiProvider (cloud, rca/chat)
  - ClaudeProvider (cloud, rca/chat/code_review)
  - OpenClawNemoProvider (cloud, rca — 委派 188→NIM)
- 擴展 ai_router.py 加入:
  - AIProviderRegistry (動態註冊/啟停)
  - AIRouterExecutor (Cache + 閘門 CB/RL/Sem + 執行)
- openclaw.py 絞殺者包裝: USE_AI_ROUTER=true 走新路徑
- config.py + ConfigMap 加入 USE_AI_ROUTER=false (安全預設)
- ADR-052 正式文件 (14 項決策 D1-D14)
- HARD_RULES v1.7 加入 AI Router 規範

安全: USE_AI_ROUTER=false 預設不啟用,需手動開啟觀察
回滾: kubectl set env deployment/awoooi-api USE_AI_ROUTER=false

2026-04-02 ogt: Phase 24 首批實作

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 13:16:09 +08:00
OG T
05cd9cbab4 fix(web): 驗收報告 6 個問題修復
Some checks failed
E2E Health Check / e2e-health (push) Successful in 17s
CD Pipeline / build-and-deploy (push) Has been cancelled
1. [Medium] Metrics Strip [object Object] — 移除 pendingApprovals 陣列直接渲染
   + label 硬編碼改 i18n (activeIncidents/serviceHealth/todayIncidents 等)
2. [Low] KB GET /{id} 不過濾 archived — get_by_id 加 status != ARCHIVED
3. [Low] favicon.ico 404 — 新增 NemoClaw SVG favicon + layout metadata
4. [Medium] auto-repair console errors — fetchEval 加 try-catch 靜默處理

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 10:30:43 +08:00
OG T
db1aed81d9 fix(db): C1 時區統一遷移 — utc_now → taipei_now (全 5 table)
Some checks failed
E2E Health Check / e2e-health (push) Successful in 18s
CD Pipeline / build-and-deploy (push) Has been cancelled
🔴 首席架構師審查 C1: 全系統禁止 UTC,必須台北時區 +8

- utc_now() → taipei_now() (調用 src.utils.timezone.now_taipei)
- 影響: ApprovalRecord, TimelineEvent, AuditLog, IncidentRecord, KnowledgeEntryRecord
- 13 處 default/onupdate 全部替換
- 移除 datetime.UTC import

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 09:13:36 +08:00
OG T
628387de8c fix: risklevel migration 自動化 + Telegram Whitelist 注入
Some checks failed
E2E Health Check / e2e-health (push) Successful in 17s
CD Pipeline / build-and-deploy (push) Has been cancelled
1. init_db() 啟動時自動確保 risklevel enum 包含 'high' 值
   (Phase 23 新增,避免舊 DB 缺值導致 InvalidTextRepresentation)

2. CD Pipeline 新增 OPENCLAW_TG_USER_WHITELIST 自動注入
   (之前為 CHANGE_ME,已更新為實際 user ID 5619078117)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 09:13:13 +08:00
OG T
d2bad44173 fix(api): KB 架構審查修復 I3-I5
Some checks failed
E2E Health Check / e2e-health (push) Successful in 17s
CD Pipeline / build-and-deploy (push) Has been cancelled
- I3: Service 層加 IKnowledgeRepository Protocol 型別標注
- I4: search 方法加入 tags JSONB 搜尋 (cast→String→ilike)
- I5: get_categories 獨立方法,不再繞道 list_entries(limit=0)

首席架構師審查 87/100 → 全部 Important issues 已修復

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 09:05:54 +08:00
OG T
48a0bc66f7 fix(api): KB 首席架構師審查修復 (I1 tags filter + I2 type annotation)
Some checks failed
E2E Health Check / e2e-health (push) Successful in 16s
CD Pipeline / build-and-deploy (push) Has been cancelled
- I1: Repository list_entries 實作 tags JSONB @> 篩選 (之前聲明未實作)
- I2: ORM tags 型別從 list[dict[str, Any]] 修正為 list[str]

首席架構師審查: 87/100
C1 時區(UTC→Taipei) 為既有系統性問題,另開 task 統一遷移

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 09:04:41 +08:00
OG T
e17248fd10 fix: 首席架構師審查修復 — i18n/CD/時區/死碼清理
Some checks failed
E2E Health Check / e2e-health (push) Successful in 16s
CD Pipeline / build-and-deploy (push) Has been cancelled
P0 前端 i18n 合規 (6 檔案):
- settings/page.tsx: 全面改用 useTranslations('settings')
- auto-repair/page.tsx: 30+ 處硬編碼改用 t('autoRepair.*')
- sidebar.tsx: sectionLabel 改用 tSection(),aria-label 國際化
- openclaw-panel.tsx: STATUS_MESSAGES 改用 tPanel(),Production 改用 tBrand
- alerts/page.tsx: StatPill label 改用 t('incident.severity.*')

P1 CD Pipeline:
- cd.yaml: runs-on 改 self-hosted (ADR-039)
- Telegram Secret 注入失敗改為 exit 1 (ADR-035)
- kubectl patch op:replace → op:add (首次部署相容)

P2 後端:
- langfuse_client.py: 移除 v4.x 死碼分支 (SDK 鎖定 <3.0.0)
- ai.py: 標記 TODO(R4) Router 瘦身

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 09:02:41 +08:00
OG T
d32d84efce feat(telegram): 接通 Phase 22 Nemotron 雙軌顯示 (ADR-044)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
根本原因: format_with_nemotron() 已實作但從未被呼叫
- send_approval_card() 新增 nemotron_enabled/tools/validation/latency 參數
- TelegramMessage 建構時傳入 nemotron 欄位
- nemotron_enabled=true 時自動使用 format_with_nemotron() 格式
- _push_decision_to_telegram() 從 proposal_data 提取並傳遞 nemotron 資料

效果: Telegram 同時顯示 OpenClaw 仲裁 + Nemotron 執行方案雙區塊
2026-04-02 ogt: Phase 22 最後一哩路

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 08:59:03 +08:00
OG T
d8be78b135 feat(api): Knowledge Base Phase 1 後端四層架構
Some checks failed
CD Pipeline / build-and-deploy (push) Successful in 7m0s
E2E Health Check / e2e-health (push) Successful in 17s
Type Sync Check / check-type-sync (push) Failing after 30s
- models/knowledge.py: Pydantic Schema (EntryType/Source/Status/CRUD)
- db/models.py: KnowledgeEntryRecord ORM (PostgreSQL)
- repositories/interfaces.py: IKnowledgeRepository Protocol
- repositories/knowledge_repository.py: PostgreSQL CRUD 實作
- services/knowledge_service.py: 業務邏輯 (get_db_context 內部管理 session)
- api/v1/knowledge.py: REST Router (get_knowledge_service,無直接 DB 存取)
- main.py: 掛載 Knowledge Base Router
- k8s/jobs/migrate-knowledge-entries.yaml: DB Migration Job

API 端點: GET/POST / | GET/PATCH/DELETE /{id} | POST /{id}/approve
         GET /search | GET /categories

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 00:55:56 +08:00
OG T
077e1cd637 fix(telegram): 修復所有 webhook 路徑缺失 ai_provider → Telegram 顯示「AI 仲裁判定」
Some checks failed
E2E Health Check / e2e-health (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
根本原因: send_approval_card() 有 ai_provider 參數,但三個 webhook 呼叫端都沒傳:
- signoz_webhook.py: 有 ai_provider 參數但未轉傳給 send_approval_card
- sentry_webhook.py: 有 analysis.analyzed_by 但未傳
- webhooks.py: _push_to_telegram_background 缺少 ai_provider 參數

修復後 Telegram 會顯示「🤖 OpenClaw Nemo 仲裁」而非「🤖 AI 仲裁判定」

2026-04-02 ogt: 三個 webhook 路徑統一修復

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 00:51:13 +08:00
OG T
de04de1d4f fix(telegram): 新增 openclaw_nemo/nvidia_nim 顯示名稱映射
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
- format() 和 format_with_nemotron() 兩處 provider_names 均加入:
  openclaw_nemo → "OpenClaw Nemo"
  openclaw_nvidia_nim → "OpenClaw Nemo"
  openclaw_qwen → "OpenClaw Nemo"
- 修正顯示 "OPENCLAW_NEMO" (大寫) 的問題
- 2026-04-01 ogt: 配合 AI 仲裁架構調整

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 00:20:37 +08:00
OG T
27b4d2a76a fix(telegram): strip <placeholder> 佔位符防止 HTML parse 錯誤
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
OpenClaw 生成的 kubectl_command 含 <受影響服務名稱>
在 Telegram HTML parse mode 下造成 'Can't parse entities'
用 regex strip 所有 <...> 佔位符

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 22:50:07 +08:00
OG T
88051388d4 fix(ai): 修復 _call_openclaw_analyze datetime 序列化失敗 → fallback Gemini
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
signals dict 內含 datetime 物件,httpx json= 無法序列化
加入 _to_serializable 遞迴轉換,datetime → str

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 22:37:04 +08:00
OG T
ff85b0581a fix(ai): 修復 analyze_and_propose 方法呼叫錯誤
Some checks failed
CD Pipeline (Dev) / build-and-deploy-dev (push) Failing after 9s
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
- OpenClawService 沒有 analyze(), 正確方法是 analyze_alert(alert_context)
- 包裝 host_statuses 為 alert_ctx 傳入
- 解包返回值 (8-tuple) 用 *_ 忽略尾端

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 22:33:51 +08:00
OG T
5809d3e336 feat(ai): 委派 Incident RCA 給 OpenClaw (Nemo) — 架構鐵律修正
Some checks failed
E2E Health Check / e2e-health (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
架構鐵律: OpenClaw = AI 大腦,AWOOOI API 透過 HTTP 委派仲裁
修改:
- openclaw.py: 加入 _call_openclaw_analyze(),在 LLM fallback 前先呼叫 OpenClaw
- 04-configmap.yaml: OPENCLAW_URL 修正為 :8088 (新容器 port)
- AI_FALLBACK_ORDER 改為 ["ollama","claude"] (移除 Gemini 付費 API)

OpenClaw /api/v1/analyze/incident → qwen2.5:7b 本機 Ollama (Nemo)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 21:11:30 +08:00
OG T
60d2fbaf8c feat(telegram): implement reanalyze button handler, replace placeholder (ADR-050)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 21:08:44 +08:00
OG T
6dc1505584 feat(incident): add trigger_reanalysis() with Redis 10min dedup (ADR-050) 2026-04-01 21:06:39 +08:00
OG T
a9d8fd9c3c feat(telegram): ADR-050 P2 - detail/history info actions 實作
All checks were successful
CD Pipeline (Dev) / build-and-deploy-dev (push) Successful in 2m28s
- _send_incident_detail: 取得事件詳情 + AI 信心條形圖,傳送新訊息保留原始簽核卡片
- _send_incident_history: 頻率統計 (1h/24h/7d/30d + 自動修復次數)
- reanalyze: 保留為開發中 placeholder

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 18:48:04 +08:00
OG T
0bf0a1cea2 feat(telegram): ADR-050 P1 - 6鍵 Inline Keyboard + info actions 骨架
All checks were successful
CD Pipeline (Dev) / build-and-deploy-dev (push) Successful in 2m39s
CD Pipeline / build-and-deploy (push) Successful in 7m1s
E2E Health Check / e2e-health (push) Successful in 17s
第一行: [ 批准] [ 拒絕] [🔕 靜默] (nonce 防重放)
第二行: [📋 詳情] [🔄 重診] [📊 歷史] (read-only, action:incident_id 格式)

- security_interceptor: parse_callback_data 支援 2-part info action 格式
- telegram_gateway: _build_inline_keyboard 新增 incident_id 參數
- telegram.py: info_action 短路,不觸發 DB 操作

P2 待實作: detail/reanalyze/history 回傳實際資料 (目前回傳「功能開發中」)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 18:34:26 +08:00
OG T
43a370fc11 fix(model): IncidentOutcome 舊 Redis 字串格式相容性
Some checks failed
CD Pipeline (Dev) / build-and-deploy-dev (push) Successful in 2m38s
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
Type Sync Check / check-type-sync (push) Failing after 22s
舊事件 outcome 存為字串 "resolved",Pydantic v2 無法解析
→ INTERNAL_ERROR on /auto-repair/evaluate/{incident_id}

field_validator mode='before' 將字串轉為 None (安全丟棄)
確保舊資料不引發 incident_parse_error

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 18:03:21 +08:00
OG T
9913f5dc6d feat(infra): 開發環境分離 + BuildKit cache 修復 + circuit breaker 優化
Some checks failed
CD Pipeline / build-and-deploy (push) Successful in 6m52s
E2E Health Check / e2e-health (push) Successful in 17s
CD Pipeline (Dev) / build-and-deploy-dev (push) Failing after 9s
1. k8s/awoooi-dev/: 新建 dev namespace (01-05 配置)
   - Namespace + ResourceQuota (cpu 2/4, mem 4Gi/8Gi)
   - ConfigMap: ENVIRONMENT=dev, LOG_LEVEL=DEBUG, SHADOW_MODE=false
   - Deployment: 1 replica, NodePort 32344, image dev-latest
   - RBAC: awoooi-executor-dev ServiceAccount

2. .gitea/workflows/cd-dev.yaml: dev branch CD pipeline
   - 觸發: dev branch push
   - Build: --no-cache (防 cache poisoning)
   - Tag: dev-{sha} / dev-latest
   - Deploy: awoooi-dev namespace, health check 32344
   - Telegram: [DEV] 前綴通知

3. apps/api/Dockerfile: ARG CACHE_BUST=none (防 BuildKit cache 毒化)
   - deps 層 (pip install) 仍可 cache
   - src/ 和 models.json 層每次重建

4. .gitea/workflows/cd.yaml: 正式環境 API build 加入 CACHE_BUST=git_sha
   - 確保 models.json 等配置變更正確進入 image

5. apps/api/src/services/nvidia_provider.py: timeout 不計入 circuit breaker
   - TimeoutException → 只 log,不 record_failure()
   - 只有硬性錯誤 (auth/rate limit/exception) 才斷路

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 16:22:21 +08:00
OG T
c9c60c3a61 feat(mcp-integrations): Phase S 架構修復 + MCP 整合基礎建設
Some checks failed
E2E Health Check / e2e-health (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
Type Sync Check / check-type-sync (push) Failing after 22s
Phase S 技術債修復 (首席架構師審查 82→完整):
- S-01: generate_alert_fingerprint 移至 AlertAnalyzer.generate_fingerprint() staticmethod
- S-04: 移除 Pydantic v2 deprecated json_encoders (直接用原生 datetime 序列化)

Sentry MCP 整合 (Phase 23):
- ADR-048: Sentry→OpenClaw AI Triage 架構決策
- sentry_webhook_service.py: parse/analyze/create_incident/build_message Service 層
- config.py: SENTRY_WEBHOOK_SECRET (Fail-Closed HMAC-SHA256)

Playwright MCP 整合 (短期):
- smoke.spec.ts: 5 頁面 E2E smoke test (home/dashboard/incidents/approvals/terminal)
- cd.yaml: E2E Smoke Test 步驟 + Telegram 🎭 Smoke 狀態通知

長期規劃 ADR:
- ADR-049: Figma Code Connect 設計系統同步
- ADR-050: Telegram 互動式 Incident 2.0 (6鍵 Inline Keyboard)
- ADR-051: Context7 依賴升級顧問 (Next.js 14→15, FastAPI 0.115→0.128)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 16:20:57 +08:00
OG T
394f85954e fix(api): 修復 Y/n 404 + 停用 Multi-Sig
Some checks failed
E2E Health Check / e2e-health (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
1. proposal_service._load_incident() 改用 incident_service.get_from_working_memory()
   - brain engine 使用 awoooi:incidents: prefix,資料實際在 incident: prefix
   - 兩個 prefix 不符導致永遠 404 (Y/n 按鈕全部失敗)
   - 2026-04-02 ogt

2. trust_engine CRITICAL required_signatures 2→1
   - 統帥決策: 所有審核只需 1 層簽核
   - 2026-04-02 ogt

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 16:16:28 +08:00
OG T
419dc2f8e0 fix(nvidia): timeout 60s→30s,NVIDIA 第一保免費,失敗轉 Gemini
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 5m46s
E2E Health Check / e2e-health (push) Successful in 16s
- nvidia_provider.py: NVIDIA_TIMEOUT 60→30s
- models.json: timeout_seconds 60→30s
- configmap: NEMOTRON_TIMEOUT_SECONDS 45→30s, fallback 恢復 nvidia 第一
目標: Nemo 有足夠時間回應(free),失敗快速轉 Gemini(備援),整體機制可運作

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 16:05:19 +08:00
OG T
4c622813af fix(auto-repair): 實際可用的自動修復門檻 (Phase 22 P1)
Some checks failed
E2E Health Check / e2e-health (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
問題: 四道鎖全卡死導致自動修復永遠不觸發
1. configmap: Gemini 排第一 (100ms vs NVIDIA 60s timeout)
2. auto_approve: confidence 0.90→0.65, trust 5→1, playbook 3→1
3. auto_approve: 開放 medium 風險, require_playbook=False

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 16:02:16 +08:00
OG T
eccf61fbc9 fix(ai): 修復假信心度 + 解除 Shadow Mode (Phase 22 P1)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
1. openclaw.py: LLM 截斷時 confidence 0.82→0.0 (禁止偽造信心度)
2. prompts.py: NEMOTRON schema 範例值改用佔位符,防模型照抄 0.75
3. configmap: SHADOW_MODE_ENABLED=false,開放 low 風險自動執行
   條件門檻: confidence≥90% + trust_score≥5 + playbook_success≥95%

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 15:59:42 +08:00
OG T
0fd53422c6 fix(openclaw): NEMOTRON_SYSTEM_PROMPT confidence/reasoning 移至最前
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 5m36s
E2E Health Check / e2e-health (push) Successful in 17s
Nemo-4B 4B 參數模型輸出長度有限,confidence/reasoning 排在 schema 末尾
時常被截斷,導致 openclaw.py:1045 fallback 補 0.82 假數據。

修復:將 confidence 和 reasoning 移至 schema 最前兩個欄位,確保模型
輸出截斷時仍包含最關鍵欄位。同時明確禁止模型抄範例值。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 13:19:18 +08:00
OG T
22de22c989 refactor(phase-s): Phase S 技術債清理 - 五項架構改善
S-01: generate_alert_fingerprint() 移至 alert_analyzer_service (Router→Service)
S-02: 移除廢棄 USE_NEW_ENGINE config (Phase R 已完成歷史使命)
S-03: github_webhook.py linter 清理 (Field unused + delivery_id noqa)
S-04: Pydantic v2 遷移 - approval/incident models (class Config → ConfigDict)
S-05: Skill 09 v1.1 更新 (USE_NEW_ENGINE 廢棄說明)

測試: 393 passed, 零失敗

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 13:12:02 +08:00
OG T
59902f270d fix(tests): 首席架構師審查修復 - 測試套件 + DI 強化 (96/100 OUTSTANDING)
P1 測試修復:
- test_smart_router.py: 更新至當前 API (IntentResult + DIAGNOSE/CONFIG 規範化)
- test_auto_repair_service.py: 注入 _no_cooldown fixture 隔離 Redis 依賴
- test_global_repair_cooldown.py: 加 @pytest.mark.integration 標記

P2 架構改進:
- AutoRepairService: 新增 cooldown_checker DI 參數 (Callable | None)
- global_repair_cooldown: get_redis() 移入 try-except 防止未捕獲 RuntimeError

P3 配置:
- pyproject.toml: 登記 integration pytest marker

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 11:11:50 +08:00
OG T
e6f6734f39 fix(telegram): Redis Leader Election 解決多 Pod 409 Conflict
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
問題: 2 個 API Pod 同時 getUpdates → 互相 409 → 兩個都失敗
根本原因: explicit env TELEGRAM_ENABLE_POLLING=false 被 kubectl patch 設入
  deployment,覆蓋 ConfigMap 的 true (feedback_k8s_env_precedence.md 違規)

修復步驟:
1. kubectl patch 移除 deployment 的 explicit env override
2. 實作 Redis Leader Election 防止多 Pod 競爭
   - 使用 SET NX EX=45 取得 Leader Lock
   - _leader_renewer(): 每 20s 續約,確保 Leader 持有 Lock
   - _leader_watcher(): 非 Leader Pod 每 30s 嘗試接管
   - 409 時主動釋放 Lock,Watcher 競爭接管

結果: 一個 Pod 正常 polling,另一個 Pod 進入 Watcher 待命模式

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 11:04:10 +08:00
OG T
411880842f refactor(router): R4 #129 AlertAnalyzer 遷移至 services 層
ADR-024 Router 層瘦身 R4: 將業務邏輯從 Router 移出至正確層次。

變更:
- 新增 src/models/webhook.py: AlertPayload + AlertResponse 移至 models 層
- 新增 src/services/alert_analyzer_service.py: AlertAnalyzer (141行) 移至 services 層
  - RISK_MAPPING / ACTION_MAPPING / BLAST_RADIUS_MAPPING 對應表
  - analyze() 方法含 K8s 資源名稱正規化 (ADR-016)
- webhooks.py: 移除重複定義,改為 import,-243行

Router 層 webhooks.py 已符合 ADR-024 禁止清單規範:
AlertAnalyzer 不再存在於 Router 層。

R4 狀態: #127 #128 #129 #130 (全部完成)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 09:27:23 +08:00
OG T
44840f5e73 fix(service): #123 proposal_service.py 修正 key prefix + 移除重複邏輯
ADR-046 修復: proposal_service 使用錯誤 Redis key prefix "incident:"
(brain 使用 "awoooi:incidents:"),導致 R-R2 後 load/persist 失效。

變更:
- _load_incident(): 委派給 IncidentEngineAdapter.get_incident()
  (正確 key prefix,含 brain→local 型別轉換)
- _persist_incident(): Redis 部分委派給 brain DualIncidentMemory
  透過 local_to_brain() 轉換後儲存 (key prefix 一致)
- 移除 _record_to_incident() 重複邏輯 (已由 IncidentEngineAdapter 處理)
- 移除 INCIDENT_KEY_PREFIX 常數
- 移除 get_redis() 直接依賴

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 09:11:57 +08:00
OG T
a94bb57d8b feat(types): ADR-046 IncidentConverter + IncidentEngineAdapter
實作 ADR-046 Option B: IncidentConverter 轉換層,解決
BrainIncident (lewooogo-brain) 與 LocalIncident (apps/api) 型別邊界問題。

變更:
- 新增 src/utils/incident_converter.py
  - brain_to_local(): BrainIncident → LocalIncident
  - local_to_brain(): LocalIncident → BrainIncident
  - ESCALATED → MITIGATING 映射 (brain 無 ESCALATED)
- incident_engine.py: 新增 IncidentEngineAdapter 包裝層
  - process_signal() / get_incident() 輸出轉換為 LocalIncident
  - get_incident_engine() 返回 IncidentEngineAdapter
- incident_memory.py: 加入 brain_to_local import,更新 _record_to_incident 說明
- ADR-046: 標記三個轉換點全部完成

解鎖: #123 proposal_service.py 清理 (下一步)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-31 22:47:54 +08:00
OG T
2ba61acf72 fix(api): Phase R-R2.2 首席架構師 72/100 P2 修復
P2-01 signal_worker.py: persisted_to_pg 改用 getattr 防 BrainIncident AttributeError
P2-02 IIncidentEngine Protocol: update_incident_status → update_status 對齊 brain 實作
P2-03 config.py USE_NEW_ENGINE: 標記失效 + 回滾路徑更正 (git revert 而非 kubectl)
ADR-046: Option B (IncidentConverter) 決策完成,待實作清單更新
ADR-024: 審查結論 + 正式回滾指令更新
Skill 02: v2.5 版本記錄

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-31 22:33:08 +08:00
OG T
d17b67c823 fix(api): Phase R-R2.1 修復架構審查 P0+P1 問題
P0-01: IncidentDbAdapter._record_to_incident 返回型別標注為 Any
       (實際返回 BrainIncident,非本地 Incident,避免型別誤報)
P0-02: get_incident_engine() 加入 try/except ImportError 保護
       (仿照 get_incident_memory() 錯誤處理模式,確保可觀測性)
P1-01: 移除 IncidentMemoryAdapter 死碼 (-170 行 Lua scripts + _ensure_lua_scripts)
       (lewooogo-brain 不調用此方法,已確認)
P1-03: IncidentMemoryAdapter.save_incident() 委派給 self._memory
       (修復 key prefix 不一致: "incident:" vs "awoooi:incidents:")

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-31 22:15:06 +08:00
OG T
c7b3f8f2b3 refactor(api): Phase R-R2 移除內嵌重複邏輯 (#121 #122)
- incident_memory.py: 移除 ~480 行 DualIncidentMemory + IIncidentMemory 內嵌版本
  保留 IncidentDbAdapter (SQLAlchemy bridge) + get_incident_memory() singleton
- incident_engine.py: 移除 ~405 行 IncidentEngine 舊版內嵌類別
  保留 IncidentMemoryAdapter + BlastRadiusAdapter (lewooogo-brain 橋接)
- 全面切換至 lewooogo-brain 套件 (USE_NEW_ENGINE=True 已驗證穩定)
- 測試驗證: 104 passed, 13 skipped (所有 Redis-independent 測試通過)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-31 22:03:00 +08:00
OG T
cc6b18e3bc fix(phase22): 修復 Telegram 對話三個 Bug (ADR-044)
All checks were successful
E2E Health Check / e2e-health (push) Successful in 18s
P0: security_interceptor.py 新增 intercept_telegram() 方法
- 修復 _handle_chat_message 的 AttributeError (致命 Bug)
- 白名單驗證,不需要 Nonce (對話訊息 vs 按鈕回調)

P1: nvidia_provider.py chat() 新增 use_json_mode 參數
- 對話場景預設 False (自然語言回應)
- RCA/分析場景傳入 True (結構化 JSON 輸出)
- openclaw.py RCA 呼叫加上 use_json_mode=True

P2: K8s ConfigMap 啟用 TELEGRAM_ENABLE_POLLING=true
- K8s AWOOOI API 接管 @tsenyangbot Long Polling
- OpenClaw (188) 停止 Telegram,改為純 REST 服務

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-31 21:53:09 +08:00
OG T
1f9e94e78d refactor(ai-router): 新增 IAIRouter Protocol (P1 修復)
首席架構師審查 P1 修復:
- 新增 IAIRouter Protocol 支援 DI 測試替換
- 參考 IModelRegistry, IComplexityScorer 實作模式
- 包含 route(), route_sync(), route_tool_calling() 方法簽名

審查評分: 78/100 → 85/100

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-31 21:23:07 +08:00