Commit Graph

62 Commits

Author SHA1 Message Date
OG T
76558a3cd9 feat(AIOps): 全開 P1-P6 feature flags + Nemotron + offline replay loop
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- configmap: 啟用 AIOPS_P1~P6 全部總開關與子開關
- configmap: ENABLE_NEMOTRON_COLLABORATION=true(回歸 120s timeout)
- feature_flags.py: 補齊 AIOPS_P6_GOVERNANCE_ENABLED 缺失欄位
- main.py: 掛載 run_offline_replay_loop(ADR-087 Phase 6)

2026-04-15 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 21:59:51 +08:00
OG T
bf45b80bd2 feat(Phase 3.5 + Phase 4): AI 學習成果持久化到 PostgreSQL — 修正「AI 失憶」架構缺陷
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-085: AI 學習成果不可存在 Cache

架構鐵律確立:
- PostgreSQL = System of Record(AI 的永久記憶)
- Redis = Warm Cache(加速讀取,TTL 到期從 PG 復原)

核心變更:
1. models.py: 新增 PlaybookRecord / DynamicBaselineRecord / LogClusterRecord ORM
2. base.py: ALTER TABLE playbooks 補加 trust_score / requires_approval_level 等欄位
3. playbook_repository.py: 完整雙寫實作(PG upsert + Redis cache)
4. dynamic_baseline_service.py: Holt-Winters 訓練結果寫入 PG,Redis 只作 24h warm cache
5. log_anomaly_detector.py: Drain3 cluster template 寫入 PG(UPSERT on cluster_id)
6. main.py: 啟動時執行 backfill_redis_to_pg()(Redis → PG 冪等補救)

修正的問題:
- Playbook 7天 Redis TTL 到期 → AI 失去所有修復知識
- trust_score EWMA 隨 Redis TTL 歸零 → AI 重新回到初始信任度 0.3
- Holt-Winters 基線 24h TTL → AI 每天重新學習「正常」的定義
- Drain3 cluster 沒有持久化 → AI 把已知 log pattern 反覆當新 pattern

Phase 4 新服務(requirements.txt 已加入 statsmodels + drain3 + numpy)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 15:34:04 +08:00
OG T
f1cbf6db7d feat(adr-081): Phase 1 感官縱深 — 8D 情報蒐集 + 執行後驗證
成品:
- IncidentEvidence DB model(8D 感官 + pre/post 執行狀態)
- EvidenceSnapshot dataclass(build_summary → LLM 上下文)
- SanitizationService(Prompt Injection 0-tolerance,12 pattern)
- MCPToolRegistry(動態工具登記,suggest_tools 不寫死告警類型)
- PreDecisionInvestigator(8D 並行感官,P99 < 8s,Redis 30s 快取)
- PostExecutionVerifier(warmup 10s → 後狀態評估 success/degraded/failed)
- decision_manager + approval_execution 接線(feature flag 守衛)

Gate 1 修復:D4/D5/D7/D8 補 sanitize_dict_values;移除裸 "error" failure
signal 防 error_rate key 誤判;evidence_snapshot rowcount 零行警告。

測試:130 passed(+111 新增)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 13:08:38 +08:00
OG T
db9e304a14 feat(adr-080): Phase 0 防護欄建立 — AI 自主化飛輪啟動
- docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md
  (1456 行,§0-§8 全填完:42-cell 戰術矩陣、7 Phase 計畫、7 ADR 摘要、
   15 KPI、21 Feature Flags、10 風險場景)

- docs/adr/ADR-080-ai-autonomy-flywheel-overview.md
  (7 Phase 結構 + 4 北極星 + 7 架構師 Review Gates + Phase 退出條件)

- apps/api/src/core/feature_flags.py
  (AIOpsFeatureFlags: P1~P6 總開關全 False + 15 細粒度子開關
   is_phase_enabled() / is_sub_flag_enabled() + bool cast 安全)

- apps/api/src/jobs/__init__.py + baseline_snapshot.py
  (Phase 0 基線快照 Job:MCP calls / Playbook confidence / general 比例
   / learning loop rate / auto_repair — 寫入 aiops:baseline:latest)

- apps/api/tests/test_feature_flags.py  (21 tests — 全綠)

- docs/HARD_RULES.md → v1.9
  (新增 Phase 退出條件鐵律:禁止未過 exit conditions 宣告 Phase 完成)

- CLAUDE.md 防失憶閘門 1:強制讀 MASTER §0 Session Resume Protocol

Gate 0 Pass — 21/21 tests green

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 12:44:53 +08:00
OG T
8b7e9cbfb8 fix(BLOCKER): LLM 連續失敗 — 4 個違反設計處全部修復
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 14m21s
統帥盤點發現飛輪沉默真因:4 個違反既定架構設計的 bug 同時撞車。

P0a — Ollama timeout 違反 GAP-B4 設計
  config.py:OPENCLAW_TIMEOUT 從 120s 改 30s
  原 120s 違反 ADR-052 GAP-B4 (LLM 25s hard timeout) 設計
  致 Ollama 過載時 thread 飢餓 120s 才降級

P0b — AI Router silent skip 觀測性修復
  ai_router.py: not_registered/circuit_open/rate_limit/privacy_skip
  全部累積到 errors 陣列,log all_providers_failed 時可知為何 skip
  原本 errors=["ollama: Timeout"] 但 tried=4 個,無法診斷

P1a — send_text 方法不存在 bug
  ai_router.py:1005 tg.send_text() → tg.send_notification(parse_mode=HTML)
  TelegramGateway 只有 send_notification 沒 send_text
  致 fallback 失敗通知本身失敗(雙重靜默)

P1b — resend_stale_ready_tokens 並發爆炸
  decision_manager.py: 加 asyncio.Semaphore(5) + 200ms throttle
  原本 fire_and_forget N 個 task 同時跑,N=108 時 Ollama embedding
  全部 timeout,包括我打的 live-fire 也被擠爆
  改:max 5 並發 + 每完成喘 200ms

CD 流程審查 (Blocker 1): 完全符合 ADR-039 設計,10-15 min 是預期
不需修,是設計就需要這時間。

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-14 19:37:03 +08:00
OG T
00a31abb85 feat(heartbeat): ADR-073 P2 心跳整合重構 — HeartbeatReportService + RedisLock
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- 新增 HeartbeatReportService:11 個並行探針(Ollama/Nemotron/Gemini/Claude/MCP×4/ArgoCD/Velero)
- 重寫 send_heartbeat():RedisLock 防重發 + 統一發送 SRE_GROUP_CHAT_ID
- 簡化 _heartbeat_loop():移除散落的 silence 多次發送
- config.py:新增 OLLAMA_REQUIRED_MODELS 欄位
- 03-secrets.example.yaml:補 SRE_GROUP_CHAT_ID 確保 CD Inject 不遺漏

2026-04-12 ogt (ADR-073 Phase 2-3/4)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:35:13 +08:00
OG T
184b37a8b1 refactor(decision_manager): I2 DI 化 MCP Providers + fix config list type bug
- DecisionManager.__init__ 注入 SSHProvider/K8sProvider,移除函數內 import+實例化
- config.get_tg_user_whitelist() 支援 list 輸入(monkeypatch/直接傳入),修復 AttributeError
- LOGBOOK 更新(test fix 6e0ee8b)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 13:04:46 +08:00
OG T
5d78c5492b feat(argocd-mcp): 啟用 ArgoCD MCP Provider + token 注入流程
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- config.py: ARGOCD_URL → https://192.168.0.125:30443(實際 HTTPS NodePort)
- config.py: ARGOCD_MCP_ENABLED=True + SENTRY_MCP_ENABLED=True(預設啟用)
- cd.yaml: 新增 ARGOCD_API_TOKEN Gitea Secret → K8s Secret 注入步驟
- K8s: ARGOCD_API_TOKEN 已手動注入 awoooi-secrets + API pods 已 rollout restart
- ArgoCD: 已開啟 admin account apiKey capability

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 09:32:28 +08:00
OG T
a2cc985f60 feat(mcp-phase3): ArgoCD MCP + Sentry MCP + 完整 Provider 註冊
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
ArgoCDProvider (3 工具):
  - argocd_list_apps: 列出所有 App + sync/health 狀態
  - argocd_get_app_status: 詳細狀態 + 問題資源清單
  - argocd_get_sync_history: 最近 N 筆部署記錄
  - 輸入驗證: app_name 白名單 regex
  - 需 ARGOCD_API_TOKEN + ARGOCD_MCP_ENABLED=true

SentryProvider (3 工具):
  - sentry_list_issues: 列出最近 Issues(狀態過濾)
  - sentry_get_issue: 詳情 + stacktrace 最後 5 frames
  - sentry_search_issues: PromQL 風格搜尋
  - issue_id 白名單驗證(只允許純數字)
  - 需 SENTRY_AUTH_TOKEN + SENTRY_MCP_ENABLED=true

providers/__init__.py: 補上 Prometheus + SSH + ArgoCD + Sentry 全部 10 個 providers
config.py: 新增 ARGOCD_URL / ARGOCD_API_TOKEN / ARGOCD_MCP_ENABLED / SENTRY_MCP_ENABLED

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 09:11:53 +08:00
OG T
6351e9a0e9 feat(mcp-phase2): MCP Phase 2 — Prometheus MCP + SSH MCP + alert labels
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 13m37s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 35s
MCP-2b: prometheus_provider.py
  - prometheus_query (PromQL 即時查詢)
  - prometheus_query_range (歷史趨勢,預設 15 分鐘)
  - prometheus_get_alert_history (告警觸發歷史)
  - config: PROMETHEUS_URL + PROMETHEUS_MCP_ENABLED

MCP-2a: ssh_provider.py
  - 群組A 9 個只讀診斷工具 (top/disk/memory/logs/status/port/nginx/swap)
  - 群組B 6 個安全操作工具 (restart/compose/systemctl/clear-log/ssl/nginx-reload)
  - 四層安全守衛 (白名單/allowed_hosts/forbidden_patterns/trust_score)
  - config: SSH_MCP_ENABLED + SSH_MCP_ALLOWED_HOSTS

K8s: 04-ssh-mcp-secret.example.yaml (ssh-mcp-key Secret 範本 + 建立步驟)

Alert labels: alerts-unified.yml 補充 mcp_provider/host_type/alert_category
  覆蓋: HostHighCpuLoad/HostOutOfMemory/HostOutOfDiskSpace/DockerContainer*
        SignOzDown/SentryDown/HarborDown/GiteaDown

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 02:35:35 +08:00
OG T
7768924fea fix(flywheel): 自動修復後移除 Telegram 按鈕 + 心跳告警排除飛輪
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 6m56s
問題: 自動修復成功後 Telegram 卡片仍顯示批准/拒絕/靜默按鈕

Fix 1 — Telegram 卡片回饋閉環 (積木化合規):
- telegram_gateway.send_approval_card: 發送後自動存 tg_approval:{id} 到 Redis
- telegram_gateway.mark_auto_repaired(): 新方法 — 移除按鈕 + reply 結果
- _try_auto_repair_background: 改呼叫 gateway.mark_auto_repaired() (Service 層)

Fix 2 — 心跳/看門狗告警排除飛輪:
- constants.py: is_heartbeat_alertname() + HEARTBEAT_ALERT_NAMES
- NoAlertsReceived2Hours 等不觸發 _try_auto_repair_background

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 11:52:04 +08:00
OG T
8c2983b70a fix(api+web): CORS 補 K3s NodePort origins + sign 補 signer_id/name
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
CORS (config.py):
- 補 http://192.168.0.125:32335 (K3s VIP NodePort)
- 補 http://192.168.0.120:32335 + 121:32335 (K3s nodes)
- 修前: 內網瀏覽器開 :32335 打 API 全 CORS blocked
  (incidents Failed to fetch / monitoring 無法連線根因)

sign body (pending-approvals-card.tsx):
- signer: 'web-ui' → signer_id: CURRENT_USER.id + signer_name: CURRENT_USER.name
- 修前: POST /approvals/{id}/sign 回 403 (缺必填欄位 422 誤報為 403)
  — 實際是 422 Field required signer_id + signer_name

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 19:50:48 +08:00
OG T
7857c25677 feat: Ollama 本機 Tool Calling 取代 NVIDIA 雲端 (44s→~5s)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- nvidia_provider.py: 新增 OllamaToolProvider
  - 實作 INvidiaProvider protocol,打 Ollama /v1/chat/completions
  - 模型: llama3.1:8b (tool calling 最穩定的 8B)
  - 延遲: 44s → ~5s(本機 M1 Pro 192.168.0.111)
  - get_nvidia_provider() 根據 USE_OLLAMA_TOOL_CALLING 切換
- config.py: USE_OLLAMA_TOOL_CALLING=True (預設開啟), OLLAMA_TOOL_MODEL=llama3.1:8b
- 回退: USE_OLLAMA_TOOL_CALLING=False → 恢復 NvidiaProvider 雲端

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 14:55:04 +08:00
OG T
8b5db2f58e feat(infra): 切換 Ollama 到 M1 Pro 192.168.0.111 + NetworkPolicy 更新
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- OLLAMA_URL: 188 → 111 (M1 Pro, 40+ tok/s vs 0.45 tok/s)
- OPENCLAW_DEFAULT_MODEL: qwen2.5:7b-instruct → deepseek-r1:14b (SRE最強推理)
- OPENCLAW_TIMEOUT: 90s → 120s (deepseek-r1:14b 實測最慢 54s)
- NetworkPolicy v1.3: 新增 192.168.0.111:11434 egress,移除 188 的 Ollama port

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 22:05:14 +08:00
OG T
84f1f9f021 refactor(config): GITHUB_WEBHOOK_SECRET → GITEA_WEBHOOK_SECRET (ADR-059) 2026-04-05 14:25:47 +08:00
OG T
5ad403b287 fix(p0): v4.3 — 實測確認 Ollama CPU-only 不可用,DIAGNOSE 統一走 NIM
實測依據 (2026-04-05):
- Ollama llama3.2:3b CPU-only: 238s 回 {"ok":true},生產不可用
- Nemotron NIM: 2.2s~27.3s,avg 10.6s,一直是主力(Phase 22 起)
- NIM 從未有隱私問題,Incident 資料一直送雲端 GPU

變更:
- ai_router.py: _local_fallback_chain 廢棄(空 list)
- ai_router.py: DIAGNOSE route/route_sync 改回 _full_fallback_chain
- config.py: 更新 timeout 說明反映實測結果
- test_p0_diagnose_routing.py: 更新 docstring

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 01:49:06 +08:00
OG T
a81bf50537 feat(drift): ADR-057 adopt() Gitea PR API 實作
- DriftAdoptService: 透過 Gitea REST API 建立 branch + commit + PR
  不在 API Pod 內執行 git(修復 C2 安全漏洞)
- adopt() 端點: 501 → 真實實作(呼叫 DriftAdoptService)
- config.py: 新增 GITEA_API_URL / GITEA_API_TOKEN / GITEA_REPO_OWNER / GITEA_REPO_NAME
- K8s secret awoooi-secrets 已注入 GITEA_API_TOKEN
- drift.py: 移除 trigger_drift_scan 中未使用的 interpreter 變數

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 00:39:29 +08:00
OG T
96d5e18924 fix(p0): 實測修正 — timeout 依 benchmark 調整,_local_fallback_chain 移除雲端 Nemotron
- config.py: NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS=60s (NIM 實測 11-45s + 15s buffer)
- config.py: OLLAMA_DIAGNOSE_TIMEOUT_SECONDS=200s (Ollama 實測 ~173s + 27s buffer)
- ollama.py: 新增 per-task timeout (diagnose/force_local 用 200s)
- ai_router.py: _local_fallback_chain 移除 Nemotron (NIM=雲端,不可進 local chain)
- ai_router.py: v4.2 — Option C 分情境路由正式確立

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 00:29:09 +08:00
OG T
3455044457 feat(phase25): Nemotron 主動防禦三方向 P0+P1+P2 完整實作
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 38s
Type Sync Check / check-type-sync (push) Failing after 35s
P0 - DIAGNOSE Privacy-First Routing:
- ai_router.py: _local_fallback_chain [NEMOTRON→OLLAMA→REJECT]
- DIAGNOSE 意圖 override 改為 NEMOTRON (原 OLLAMA)
- DIAGNOSE fallback 使用 local-only 鏈,不觸碰雲端
- 全部失敗時 REJECT + Telegram 通知
- config.py: NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS=30, OLLAMA_DIAGNOSE_TIMEOUT_SECONDS=60
- nemotron.py: 根據 context[task_type] 選擇 timeout

P1 - Knowledge Auto-Harvesting:
- models/knowledge.py: EntryType.AUTO_RUNBOOK + ANTI_PATTERN + symptoms_hash
- EntryStatus.PUBLISHED (ANTI_PATTERN 直接發布,無需審核)
- models/playbook.py: SymptomPattern.compute_hash() (16字元確定性 hash)
- services/runbook_generator.py: NemotronRunbookGenerator (v1.1)
  - generate_runbook() → AUTO_RUNBOOK (DRAFT) + Telegram 審核 card
  - generate_anti_pattern() → ANTI_PATTERN (PUBLISHED) + Telegram 通知
  - 使用 nvidia.chat() (正確介面),Nemotron 超時時 Minimal fallback
- knowledge_service.py: check_anti_pattern(symptoms_hash, days=7)
- db/models.py: symptoms_hash VARCHAR(16) + ix_knowledge_symptoms_hash
- repositories/knowledge_repository.py: create() 支援 symptoms_hash + status
- auto_repair_service.py: anti_pattern_gate 在 decide() + runbook hook 在 execute()
- migrations/phase8_symptoms_hash.sql: ALTER TABLE + partial index + PUBLISHED constraint

P2 - Config Drift Detection:
- models/drift.py: DriftItem/DriftReport/DriftLevel/DriftIntent/DriftStatus
- services/drift_detector.py: GitStateReader + K8sStateReader + DriftDetector
- services/drift_analyzer.py: 白名單過濾 + DriftLevel 分級
- services/drift_interpreter.py: NemotronDriftInterpreter(意圖分析,不生成修復指令)
- services/drift_remediator.py: rollback(kubectl apply) + adopt(git push gitea)
- api/v1/drift.py: POST /scan, GET /reports, POST /rollback, POST /adopt
- migrations/phase9_drift_reports.sql: drift_reports 表
- k8s/drift-cronjob.yaml: 每小時自動掃描 CronJob

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 12:35:05 +08:00
OG T
c65ed5b1c9 feat(telegram): SRE 戰情室群組三頭政治 Triumvirate (ADR-053)
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 7m6s
- config.py: 新增 OPENCLAW_BOT_TOKEN / NEMOTRON_BOT_TOKEN / SRE_GROUP_CHAT_ID
- telegram_gateway.py: send_to_group / send_as_openclaw / send_as_nemotron / trigger_group_ai_discussion / _send_approval_card_to_group
- send_approval_card 告警發送後非同步觸發群組 AI 雙向討論
- configmap: SRE_GROUP_CHAT_ID=-1003711974679
- secrets: OPENCLAW_BOT_TOKEN / NEMOTRON_BOT_TOKEN CHANGE_ME 佔位

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 17:16:05 +08:00
OG T
73e8f8ab77 feat(ai): Phase 24-A+B1 — AI Provider Registry + 絞殺者包裝 (ADR-052)
Some checks failed
E2E Health Check / e2e-health (push) Successful in 16s
CD Pipeline / build-and-deploy (push) Has been cancelled
Brain Layer 雙軌 Registry 架構:
- 新建 src/services/ai_providers/ 目錄 (interfaces + 4 providers)
  - OllamaProvider (local, rca/chat/code_review)
  - GeminiProvider (cloud, rca/chat)
  - ClaudeProvider (cloud, rca/chat/code_review)
  - OpenClawNemoProvider (cloud, rca — 委派 188→NIM)
- 擴展 ai_router.py 加入:
  - AIProviderRegistry (動態註冊/啟停)
  - AIRouterExecutor (Cache + 閘門 CB/RL/Sem + 執行)
- openclaw.py 絞殺者包裝: USE_AI_ROUTER=true 走新路徑
- config.py + ConfigMap 加入 USE_AI_ROUTER=false (安全預設)
- ADR-052 正式文件 (14 項決策 D1-D14)
- HARD_RULES v1.7 加入 AI Router 規範

安全: USE_AI_ROUTER=false 預設不啟用,需手動開啟觀察
回滾: kubectl set env deployment/awoooi-api USE_AI_ROUTER=false

2026-04-02 ogt: Phase 24 首批實作

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 13:16:09 +08:00
OG T
c9c60c3a61 feat(mcp-integrations): Phase S 架構修復 + MCP 整合基礎建設
Some checks failed
E2E Health Check / e2e-health (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
Type Sync Check / check-type-sync (push) Failing after 22s
Phase S 技術債修復 (首席架構師審查 82→完整):
- S-01: generate_alert_fingerprint 移至 AlertAnalyzer.generate_fingerprint() staticmethod
- S-04: 移除 Pydantic v2 deprecated json_encoders (直接用原生 datetime 序列化)

Sentry MCP 整合 (Phase 23):
- ADR-048: Sentry→OpenClaw AI Triage 架構決策
- sentry_webhook_service.py: parse/analyze/create_incident/build_message Service 層
- config.py: SENTRY_WEBHOOK_SECRET (Fail-Closed HMAC-SHA256)

Playwright MCP 整合 (短期):
- smoke.spec.ts: 5 頁面 E2E smoke test (home/dashboard/incidents/approvals/terminal)
- cd.yaml: E2E Smoke Test 步驟 + Telegram 🎭 Smoke 狀態通知

長期規劃 ADR:
- ADR-049: Figma Code Connect 設計系統同步
- ADR-050: Telegram 互動式 Incident 2.0 (6鍵 Inline Keyboard)
- ADR-051: Context7 依賴升級顧問 (Next.js 14→15, FastAPI 0.115→0.128)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 16:20:57 +08:00
OG T
394f85954e fix(api): 修復 Y/n 404 + 停用 Multi-Sig
Some checks failed
E2E Health Check / e2e-health (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
1. proposal_service._load_incident() 改用 incident_service.get_from_working_memory()
   - brain engine 使用 awoooi:incidents: prefix,資料實際在 incident: prefix
   - 兩個 prefix 不符導致永遠 404 (Y/n 按鈕全部失敗)
   - 2026-04-02 ogt

2. trust_engine CRITICAL required_signatures 2→1
   - 統帥決策: 所有審核只需 1 層簽核
   - 2026-04-02 ogt

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 16:16:28 +08:00
OG T
eccf61fbc9 fix(ai): 修復假信心度 + 解除 Shadow Mode (Phase 22 P1)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
1. openclaw.py: LLM 截斷時 confidence 0.82→0.0 (禁止偽造信心度)
2. prompts.py: NEMOTRON schema 範例值改用佔位符,防模型照抄 0.75
3. configmap: SHADOW_MODE_ENABLED=false,開放 low 風險自動執行
   條件門檻: confidence≥90% + trust_score≥5 + playbook_success≥95%

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 15:59:42 +08:00
OG T
0fd53422c6 fix(openclaw): NEMOTRON_SYSTEM_PROMPT confidence/reasoning 移至最前
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 5m36s
E2E Health Check / e2e-health (push) Successful in 17s
Nemo-4B 4B 參數模型輸出長度有限,confidence/reasoning 排在 schema 末尾
時常被截斷,導致 openclaw.py:1045 fallback 補 0.82 假數據。

修復:將 confidence 和 reasoning 移至 schema 最前兩個欄位,確保模型
輸出截斷時仍包含最關鍵欄位。同時明確禁止模型抄範例值。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 13:19:18 +08:00
OG T
22de22c989 refactor(phase-s): Phase S 技術債清理 - 五項架構改善
S-01: generate_alert_fingerprint() 移至 alert_analyzer_service (Router→Service)
S-02: 移除廢棄 USE_NEW_ENGINE config (Phase R 已完成歷史使命)
S-03: github_webhook.py linter 清理 (Field unused + delivery_id noqa)
S-04: Pydantic v2 遷移 - approval/incident models (class Config → ConfigDict)
S-05: Skill 09 v1.1 更新 (USE_NEW_ENGINE 廢棄說明)

測試: 393 passed, 零失敗

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 13:12:02 +08:00
OG T
2ba61acf72 fix(api): Phase R-R2.2 首席架構師 72/100 P2 修復
P2-01 signal_worker.py: persisted_to_pg 改用 getattr 防 BrainIncident AttributeError
P2-02 IIncidentEngine Protocol: update_incident_status → update_status 對齊 brain 實作
P2-03 config.py USE_NEW_ENGINE: 標記失效 + 回滾路徑更正 (git revert 而非 kubectl)
ADR-046: Option B (IncidentConverter) 決策完成,待實作清單更新
ADR-024: 審查結論 + 正式回滾指令更新
Skill 02: v2.5 版本記錄

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-31 22:33:08 +08:00
OG T
dd526684ab feat(ai): Phase 22 OpenClaw + Nemotron 協作架構 (ADR-044)
All checks were successful
E2E Health Check / e2e-health (push) Successful in 17s
統帥批准實作「仲裁-執行分工」架構:
- OpenClaw = 仲裁者 (Why + Risk Level)
- Nemotron = 執行者 (How + kubectl Command)

新增功能:
- config.py: ENABLE_NEMOTRON_COLLABORATION Feature Flag
- openclaw.py: generate_incident_proposal_with_tools()
- openclaw.py: _call_nemotron_tools() Nemotron 呼叫
- telegram_gateway.py: TelegramMessage Nemotron 欄位
- telegram_gateway.py: format_with_nemotron() 雙區塊格式
- decision_manager.py: 整合協作方法
- proposal_service.py: 整合協作方法

觸發條件:
- LOW 風險 → 僅 OpenClaw
- MEDIUM/HIGH/CRITICAL → OpenClaw + Nemotron 雙軌

首席架構師審查: 83/100 條件通過

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-31 18:52:53 +08:00
OG T
d03668669b fix(openclaw): optimize for Nemo-4B with lightweight prompt and resilient parsing
All checks were successful
E2E Health Check / e2e-health (push) Successful in 26s
2026-03-31 15:59:58 +08:00
OG T
fb0ddf305c fix(api): fix dockerfile to include models.json, remove huge prompt example to fit 4K limit
All checks were successful
E2E Health Check / e2e-health (push) Successful in 17s
2026-03-31 14:03:34 +08:00
OG T
46843c8e19 fix(nvidia): revert to nemotron-mini, truncate context for 4K limit, enforce precise confidence
All checks were successful
E2E Health Check / e2e-health (push) Successful in 17s
2026-03-31 13:57:10 +08:00
OG T
bb85d89874 refactor(api): Phase A P1 快速勝利 (3 項)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
1. 常數提取: SSE_DELAY_SECONDS, MAX_APPROVAL_DISPLAY
2. 錯誤訊息安全化: sanitize_error_message() 移除敏感資訊
3. CI/CD alertname 配置化: is_cicd_alertname() 函數

首席架構師審查 P1 改進 (非阻塞)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-30 01:44:42 +08:00
OG T
27509db212 feat(api): Wave 1 安全網 - Circuit Breaker + Global Repair Cooldown
ADR-038: OpenClaw 雙層保護
- Layer 1: Circuit Breaker (5 failures → 60s cooldown)
- Layer 2: Concurrency Semaphore (max 3 concurrent)
- 新增 src/core/circuit_breaker.py

ADR-039: 全域修復熔斷
- Global Cooldown: 5 repairs/15min → freeze
- StatefulSet Blacklist: postgres/redis/clickhouse 禁止自動重啟
- 新增 src/services/global_repair_cooldown.py
- 整合到 auto_repair_service.py

測試:
- test_circuit_breaker.py (狀態轉換 + Semaphore)
- test_global_repair_cooldown.py (黑名單 + 計數閾值)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 15:48:03 +08:00
OG T
d89f0520f9 fix(api): 修復 34 個 Ruff lint 錯誤
- 自動修復 import 排序、unused imports
- 手動修復 raise from、isinstance union、unused variable
- scripts/ 暫時保留 (非 CI 阻擋)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 15:27:49 +08:00
OG T
b77e151387 feat(ai): ADR-036 NVIDIA Nemotron Tool Calling 整合
Phase 20 - 提升 Tool Calling 精準度 50% → 83.3%

新增:
- src/models/nvidia.py: Pydantic Schema
- src/services/nvidia_provider.py: NvidiaProvider 類別
- tests/test_nvidia_provider.py: 15 項單元測試 (全部通過)

修改:
- ai_router.py: AIProvider.NVIDIA + route_tool_calling()
- ai_rate_limiter.py: NVIDIA 限制 (5 RPM, 100/day)
- models.json: NVIDIA 配置
- cd.yaml: Secrets 注入 NVIDIA_API_KEY

路由策略:
- Tool Calling: Nemotron → Gemini → Claude
- 一般對話: Ollama → Gemini → Claude (不變)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-29 00:00:08 +08:00
OG T
d206460751 feat(security): Phase 20 CSRF 防護實作
Phase 19 首席架構師審查指出: 核鑰 UX 安全性缺 CSRF 防護

後端:
- 新增 src/core/csrf.py (Double Submit Cookie 模式)
- 新增 src/api/v1/csrf.py (GET /api/v1/csrf/token)
- 新增 src/models/csrf.py (CSRFTokenResponse)
- 修改 approvals.py sign/reject/bulk 端點加入 CSRFToken 驗證

前端:
- 新增 hooks/useCSRF.ts (React Hook)
- 修改 approval.store.ts 整合 CSRF Token 參數

安全特性:
- 256-bit Token (secrets.token_hex)
- 時序安全比較 (secrets.compare_digest)
- SameSite=Strict Cookie
- 1 小時 Token 有效期

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-28 18:31:58 +08:00
OG T
e5ded3b3f2 feat(phase19): OmniTerminal + GenUI + Hybrid SSE 架構實作 (Wave 0-2)
Phase 19 OmniTerminal MVP 完成:
- Wave 0: Backend (Hybrid SSE POST→GET 架構)
- Wave 1: Frontend (OmniTerminal 狀態機 + GenUI Registry)
- Wave 2: UI 組件 (8 個 GenUI 動態卡片)

ADR 文檔:
- ADR-031: OmniTerminal SSE 架構
- ADR-032: GenUI 動態渲染框架
- ADR-033: K3s HA 架構設計

GenUI 組件:
- GenUIRenderer, K8sPodStatusCard, SentryErrorCard
- MetricsSummaryCard, IncidentTimelineCard
- TraceWaterfallCard, ApprovalCard, NuclearKeyButton

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-28 00:17:26 +08:00
OG T
801b08a4b7 fix(api): AI_FALLBACK_ORDER 無法正確解析 JSON 格式
根因: ConfigMap 用 JSON '["gemini","ollama","claude"]'
但 validator 用 split(",") 解析,導致無法匹配任何 provider
結果永遠用 default ["ollama","gemini","claude"]

影響: /api/v1/incidents 超時 (Ollama CPU 推理慢)

修復: 新增 JSON 格式支援,優先嘗試 json.loads()

這是根因修復,不是重啟!

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 20:10:56 +08:00
OG T
00e2c94a8e ci: API 分層檢查 + LLM 測試移至 Nightly
CI 強化:
- 新增 API Layer Check (#96): services/repositories/models 分層規則
- LLM 測試移至 nightly-llm.yaml (CPU 推理 ~300s/測試)

分層規則:
- services 禁止引用 api/routers
- repositories 禁止引用 services
- models 禁止引用業務層

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 19:10:30 +08:00
OG T
a9f8ad56c1 chore: 未提交變更整理 (API core + docs + scripts)
API 核心:
- constants.py: 系統常量定義
- unit_of_work.py: Unit of Work 模式
- incident_approval_service.py: Incident-Approval 同步服務

文檔更新:
- LOGBOOK.md: 進度更新
- AWOOOI_AGENTIC_WORKSPACE_ROADMAP.md: 路線圖
- 2026-03-26_llm_testing_evaluation.md: LLM 測試評估
- phase5_telemetry_architecture.md: 遙測架構
- SECRETS_REFERENCE.md: 密鑰參考

配置/腳本:
- Skill 02 v1.x: leWOOOgo 後端更新
- .dependency-cruiser.cjs: 依賴規則
- demo-multisig-flow.sh: 演示腳本

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 19:10:12 +08:00
OG T
30153496d1 fix(api): 修復全部 lint 錯誤 (ruff --fix)
- Import sorting (I001)
- Unused imports (F401)
- f-string without placeholders (F541)
- Loop variable unused (B007)
- zip() strict parameter (B905)
- Exception chaining (B904)
- collections.abc imports (UP035)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 16:06:20 +08:00
OG T
e58da5c534 feat(api): Phase 13.2 #83 Grafana MCP Tool
New MCP provider for Grafana dashboard integration:
- list_dashboards: List available dashboards with filtering
- get_dashboard: Get dashboard details by UID
- get_panel_data: Query panel data via Grafana Query API
- generate_dashboard_url: Generate shareable dashboard URLs

Security:
- API key authentication (Bearer token)
- Dashboard UID validation (alphanumeric + dash/underscore)
- Read-only operations only
- 30s request timeout

Config:
- GRAFANA_URL (default: http://192.168.0.188:3000)
- GRAFANA_API_KEY

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 15:36:17 +08:00
OG T
30f045bf28 feat: ADR-019 System Prompt 集中管理 + Nightly LLM Workflow
新增:
- docs/adr/ADR-019-system-prompt-management.md - System Prompt 規範
- apps/api/src/core/prompts.py - 集中管理 System Prompts
- .github/workflows/nightly-llm.yaml - 每夜 LLM 迴歸測試

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 12:27:47 +08:00
OG T
46ab6a838a fix(api): 修復 ruff lint 錯誤
- langfuse_client.py: import Callable from collections.abc
- telemetry.py: import block 格式化

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 09:27:00 +08:00
OG T
b6cff31653 feat(api): Phase 15.3 Deep Linking 三系統互連
實現 Sentry ↔ SignOz ↔ Langfuse 零斷鏈觀測:

新增 deep_linking.py:
- SignOz Trace URL 生成器
- Langfuse Trace URL 生成器
- Sentry Issue URL 生成器
- get_all_links() 統一取得所有連結

整合點:
- main.py: Sentry before_send 注入 otel_trace_id + signoz_trace_url
- langfuse_client.py: 自動注入 OTEL trace_id 到 metadata
- openclaw.py: SignOz span 記錄 langfuse.trace_id 反向連結

架構圖:
┌─────────┐ trace_id ┌─────────┐ trace_id ┌──────────┐
│ Sentry  │◄────────►│ SignOz  │◄────────►│ Langfuse │
│ Errors  │          │ Traces  │          │ LLMOps   │
└─────────┘          └─────────┘          └──────────┘

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 00:48:28 +08:00
OG T
0d31ccb911 feat(api): Phase 15.2 Redis Trace Context 傳遞
實現 Redis Streams 跨服務追蹤零斷鏈:
- telemetry.py: 新增 get_trace_context() + restore_trace_context()
- webhooks.py: Producer 注入 _trace_id, _span_id 到 Redis
- signal_worker.py: Consumer 還原 Trace Context 建立子 Span

架構: API → Redis Streams → Worker 完整追蹤鏈
格式: W3C Trace Context (traceparent)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 00:40:20 +08:00
OG T
1ac8965a7a feat(api): Phase 15.1 Langfuse LLMOps 整合 + 模型升級
## 新功能
- Langfuse 自建部署 (192.168.0.110:3100)
- langfuse_client.py - LLM 呼叫追蹤包裝
- OpenClaw 整合 Langfuse trace

## 模型升級 (統帥批准)
- 生產預設: llama3.2:3b → qwen2.5:7b-instruct
- 摘要任務: llama3.2:3b (速度優先)

## 配置更新
- requirements.txt: +langfuse>=2.0.0
- config.py: +LANGFUSE_* 設定
- models.json: 更新 Ollama 模型配置
- K8s: Secret + ConfigMap 更新

## 審查通過
- 模組化檢查 
- 核心測試 31/31 

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-26 00:32:19 +08:00
OG T
a202a2693a feat(api): Phase 16 R1.2 絞殺者模式 (Strangler Fig Pattern)
- 新增 USE_NEW_ENGINE 設定開關 (預設 False)
- incident_memory.py 雙軌切換: 內嵌版本 ↔ lewooogo-brain
- 自動降級: lewooogo-brain 不可用時回退內嵌版本
- 回滾指令: kubectl set env deployment/awoooi-api USE_NEW_ENGINE=false

統帥批准 2026-03-26 立即執行

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-25 15:23:03 +08:00
OG T
22cada563b fix(config): Share Redis DB 0 with OpenClaw
- Change REDIS_URL from DB 10 to DB 0
- AWOOOI and OpenClaw now share the same Redis database
- Incidents created by OpenClaw visible in AWOOOI UI

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-24 18:44:34 +08:00
OG T
ad05bbf64c feat(api): Add human feedback API (#6) + async_utils module
Phase 6.6 人類回饋 API:
- PUT /api/v1/incidents/{id}/feedback endpoint
- effectiveness_score (1-5), human_feedback, learning_notes fields
- Sync to Redis (Working Memory) + PostgreSQL (Episodic Memory)
- For stats aggregation at /api/v1/stats/feedback/summary

async_utils module:
- fire_and_forget() for safe background tasks
- Prevents swallowed exceptions in asyncio.create_task()
- Addresses P2 #8 tech debt

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-24 14:16:17 +08:00