OG T
76558a3cd9
feat(AIOps): 全開 P1-P6 feature flags + Nemotron + offline replay loop
...
CD Pipeline / build-and-deploy (push) Has been cancelled
- configmap: 啟用 AIOPS_P1~P6 全部總開關與子開關
- configmap: ENABLE_NEMOTRON_COLLABORATION=true(回歸 120s timeout)
- feature_flags.py: 補齊 AIOPS_P6_GOVERNANCE_ENABLED 缺失欄位
- main.py: 掛載 run_offline_replay_loop(ADR-087 Phase 6)
2026-04-15 ogt + Claude Sonnet 4.6(亞太)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 21:59:51 +08:00
OG T
bf45b80bd2
feat(Phase 3.5 + Phase 4): AI 學習成果持久化到 PostgreSQL — 修正「AI 失憶」架構缺陷
...
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-085: AI 學習成果不可存在 Cache
架構鐵律確立:
- PostgreSQL = System of Record(AI 的永久記憶)
- Redis = Warm Cache(加速讀取,TTL 到期從 PG 復原)
核心變更:
1. models.py: 新增 PlaybookRecord / DynamicBaselineRecord / LogClusterRecord ORM
2. base.py: ALTER TABLE playbooks 補加 trust_score / requires_approval_level 等欄位
3. playbook_repository.py: 完整雙寫實作(PG upsert + Redis cache)
4. dynamic_baseline_service.py: Holt-Winters 訓練結果寫入 PG,Redis 只作 24h warm cache
5. log_anomaly_detector.py: Drain3 cluster template 寫入 PG(UPSERT on cluster_id)
6. main.py: 啟動時執行 backfill_redis_to_pg()(Redis → PG 冪等補救)
修正的問題:
- Playbook 7天 Redis TTL 到期 → AI 失去所有修復知識
- trust_score EWMA 隨 Redis TTL 歸零 → AI 重新回到初始信任度 0.3
- Holt-Winters 基線 24h TTL → AI 每天重新學習「正常」的定義
- Drain3 cluster 沒有持久化 → AI 把已知 log pattern 反覆當新 pattern
Phase 4 新服務(requirements.txt 已加入 statsmodels + drain3 + numpy)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 15:34:04 +08:00
OG T
f1cbf6db7d
feat(adr-081): Phase 1 感官縱深 — 8D 情報蒐集 + 執行後驗證
...
成品:
- IncidentEvidence DB model(8D 感官 + pre/post 執行狀態)
- EvidenceSnapshot dataclass(build_summary → LLM 上下文)
- SanitizationService(Prompt Injection 0-tolerance,12 pattern)
- MCPToolRegistry(動態工具登記,suggest_tools 不寫死告警類型)
- PreDecisionInvestigator(8D 並行感官,P99 < 8s,Redis 30s 快取)
- PostExecutionVerifier(warmup 10s → 後狀態評估 success/degraded/failed)
- decision_manager + approval_execution 接線(feature flag 守衛)
Gate 1 修復:D4/D5/D7/D8 補 sanitize_dict_values;移除裸 "error" failure
signal 防 error_rate key 誤判;evidence_snapshot rowcount 零行警告。
測試:130 passed(+111 新增)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 13:08:38 +08:00
OG T
db9e304a14
feat(adr-080): Phase 0 防護欄建立 — AI 自主化飛輪啟動
...
- docs/superpowers/specs/2026-04-15-MASTER-ai-autonomous-flywheel-v2.md
(1456 行,§0-§8 全填完:42-cell 戰術矩陣、7 Phase 計畫、7 ADR 摘要、
15 KPI、21 Feature Flags、10 風險場景)
- docs/adr/ADR-080-ai-autonomy-flywheel-overview.md
(7 Phase 結構 + 4 北極星 + 7 架構師 Review Gates + Phase 退出條件)
- apps/api/src/core/feature_flags.py
(AIOpsFeatureFlags: P1~P6 總開關全 False + 15 細粒度子開關
is_phase_enabled() / is_sub_flag_enabled() + bool cast 安全)
- apps/api/src/jobs/__init__.py + baseline_snapshot.py
(Phase 0 基線快照 Job:MCP calls / Playbook confidence / general 比例
/ learning loop rate / auto_repair — 寫入 aiops:baseline:latest)
- apps/api/tests/test_feature_flags.py (21 tests — 全綠)
- docs/HARD_RULES.md → v1.9
(新增 Phase 退出條件鐵律:禁止未過 exit conditions 宣告 Phase 完成)
- CLAUDE.md 防失憶閘門 1:強制讀 MASTER §0 Session Resume Protocol
Gate 0 Pass — 21/21 tests green
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 12:44:53 +08:00
OG T
8b7e9cbfb8
fix(BLOCKER): LLM 連續失敗 — 4 個違反設計處全部修復
...
CD Pipeline / build-and-deploy (push) Successful in 14m21s
統帥盤點發現飛輪沉默真因:4 個違反既定架構設計的 bug 同時撞車。
P0a — Ollama timeout 違反 GAP-B4 設計
config.py:OPENCLAW_TIMEOUT 從 120s 改 30s
原 120s 違反 ADR-052 GAP-B4 (LLM 25s hard timeout) 設計
致 Ollama 過載時 thread 飢餓 120s 才降級
P0b — AI Router silent skip 觀測性修復
ai_router.py: not_registered/circuit_open/rate_limit/privacy_skip
全部累積到 errors 陣列,log all_providers_failed 時可知為何 skip
原本 errors=["ollama: Timeout"] 但 tried=4 個,無法診斷
P1a — send_text 方法不存在 bug
ai_router.py:1005 tg.send_text() → tg.send_notification(parse_mode=HTML)
TelegramGateway 只有 send_notification 沒 send_text
致 fallback 失敗通知本身失敗(雙重靜默)
P1b — resend_stale_ready_tokens 並發爆炸
decision_manager.py: 加 asyncio.Semaphore(5) + 200ms throttle
原本 fire_and_forget N 個 task 同時跑,N=108 時 Ollama embedding
全部 timeout,包括我打的 live-fire 也被擠爆
改:max 5 並發 + 每完成喘 200ms
CD 流程審查 (Blocker 1): 完全符合 ADR-039 設計,10-15 min 是預期
不需修,是設計就需要這時間。
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-14 19:37:03 +08:00
OG T
00a31abb85
feat(heartbeat): ADR-073 P2 心跳整合重構 — HeartbeatReportService + RedisLock
...
CD Pipeline / build-and-deploy (push) Has been cancelled
- 新增 HeartbeatReportService:11 個並行探針(Ollama/Nemotron/Gemini/Claude/MCP×4/ArgoCD/Velero)
- 重寫 send_heartbeat():RedisLock 防重發 + 統一發送 SRE_GROUP_CHAT_ID
- 簡化 _heartbeat_loop():移除散落的 silence 多次發送
- config.py:新增 OLLAMA_REQUIRED_MODELS 欄位
- 03-secrets.example.yaml:補 SRE_GROUP_CHAT_ID 確保 CD Inject 不遺漏
2026-04-12 ogt (ADR-073 Phase 2-3/4)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 15:35:13 +08:00
OG T
184b37a8b1
refactor(decision_manager): I2 DI 化 MCP Providers + fix config list type bug
...
- DecisionManager.__init__ 注入 SSHProvider/K8sProvider,移除函數內 import+實例化
- config.get_tg_user_whitelist() 支援 list 輸入(monkeypatch/直接傳入),修復 AttributeError
- LOGBOOK 更新(test fix 6e0ee8b)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-12 13:04:46 +08:00
OG T
5d78c5492b
feat(argocd-mcp): 啟用 ArgoCD MCP Provider + token 注入流程
...
CD Pipeline / build-and-deploy (push) Has been cancelled
- config.py: ARGOCD_URL → https://192.168.0.125:30443(實際 HTTPS NodePort)
- config.py: ARGOCD_MCP_ENABLED=True + SENTRY_MCP_ENABLED=True(預設啟用)
- cd.yaml: 新增 ARGOCD_API_TOKEN Gitea Secret → K8s Secret 注入步驟
- K8s: ARGOCD_API_TOKEN 已手動注入 awoooi-secrets + API pods 已 rollout restart
- ArgoCD: 已開啟 admin account apiKey capability
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-11 09:32:28 +08:00
OG T
a2cc985f60
feat(mcp-phase3): ArgoCD MCP + Sentry MCP + 完整 Provider 註冊
...
CD Pipeline / build-and-deploy (push) Has been cancelled
ArgoCDProvider (3 工具):
- argocd_list_apps: 列出所有 App + sync/health 狀態
- argocd_get_app_status: 詳細狀態 + 問題資源清單
- argocd_get_sync_history: 最近 N 筆部署記錄
- 輸入驗證: app_name 白名單 regex
- 需 ARGOCD_API_TOKEN + ARGOCD_MCP_ENABLED=true
SentryProvider (3 工具):
- sentry_list_issues: 列出最近 Issues(狀態過濾)
- sentry_get_issue: 詳情 + stacktrace 最後 5 frames
- sentry_search_issues: PromQL 風格搜尋
- issue_id 白名單驗證(只允許純數字)
- 需 SENTRY_AUTH_TOKEN + SENTRY_MCP_ENABLED=true
providers/__init__.py: 補上 Prometheus + SSH + ArgoCD + Sentry 全部 10 個 providers
config.py: 新增 ARGOCD_URL / ARGOCD_API_TOKEN / ARGOCD_MCP_ENABLED / SENTRY_MCP_ENABLED
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-11 09:11:53 +08:00
OG T
6351e9a0e9
feat(mcp-phase2): MCP Phase 2 — Prometheus MCP + SSH MCP + alert labels
...
CD Pipeline / build-and-deploy (push) Successful in 13m37s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 35s
MCP-2b: prometheus_provider.py
- prometheus_query (PromQL 即時查詢)
- prometheus_query_range (歷史趨勢,預設 15 分鐘)
- prometheus_get_alert_history (告警觸發歷史)
- config: PROMETHEUS_URL + PROMETHEUS_MCP_ENABLED
MCP-2a: ssh_provider.py
- 群組A 9 個只讀診斷工具 (top/disk/memory/logs/status/port/nginx/swap)
- 群組B 6 個安全操作工具 (restart/compose/systemctl/clear-log/ssl/nginx-reload)
- 四層安全守衛 (白名單/allowed_hosts/forbidden_patterns/trust_score)
- config: SSH_MCP_ENABLED + SSH_MCP_ALLOWED_HOSTS
K8s: 04-ssh-mcp-secret.example.yaml (ssh-mcp-key Secret 範本 + 建立步驟)
Alert labels: alerts-unified.yml 補充 mcp_provider/host_type/alert_category
覆蓋: HostHighCpuLoad/HostOutOfMemory/HostOutOfDiskSpace/DockerContainer*
SignOzDown/SentryDown/HarborDown/GiteaDown
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-11 02:35:35 +08:00
OG T
7768924fea
fix(flywheel): 自動修復後移除 Telegram 按鈕 + 心跳告警排除飛輪
...
CD Pipeline / build-and-deploy (push) Failing after 6m56s
問題: 自動修復成功後 Telegram 卡片仍顯示批准/拒絕/靜默按鈕
Fix 1 — Telegram 卡片回饋閉環 (積木化合規):
- telegram_gateway.send_approval_card: 發送後自動存 tg_approval:{id} 到 Redis
- telegram_gateway.mark_auto_repaired(): 新方法 — 移除按鈕 + reply 結果
- _try_auto_repair_background: 改呼叫 gateway.mark_auto_repaired() (Service 層)
Fix 2 — 心跳/看門狗告警排除飛輪:
- constants.py: is_heartbeat_alertname() + HEARTBEAT_ALERT_NAMES
- NoAlertsReceived2Hours 等不觸發 _try_auto_repair_background
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-10 11:52:04 +08:00
OG T
8c2983b70a
fix(api+web): CORS 補 K3s NodePort origins + sign 補 signer_id/name
...
CD Pipeline / build-and-deploy (push) Has been cancelled
CORS (config.py):
- 補 http://192.168.0.125:32335 (K3s VIP NodePort)
- 補 http://192.168.0.120:32335 + 121:32335 (K3s nodes)
- 修前: 內網瀏覽器開 :32335 打 API 全 CORS blocked
(incidents Failed to fetch / monitoring 無法連線根因)
sign body (pending-approvals-card.tsx):
- signer: 'web-ui' → signer_id: CURRENT_USER.id + signer_name: CURRENT_USER.name
- 修前: POST /approvals/{id}/sign 回 403 (缺必填欄位 422 誤報為 403)
— 實際是 422 Field required signer_id + signer_name
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 19:50:48 +08:00
OG T
7857c25677
feat: Ollama 本機 Tool Calling 取代 NVIDIA 雲端 (44s→~5s)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
- nvidia_provider.py: 新增 OllamaToolProvider
- 實作 INvidiaProvider protocol,打 Ollama /v1/chat/completions
- 模型: llama3.1:8b (tool calling 最穩定的 8B)
- 延遲: 44s → ~5s(本機 M1 Pro 192.168.0.111)
- get_nvidia_provider() 根據 USE_OLLAMA_TOOL_CALLING 切換
- config.py: USE_OLLAMA_TOOL_CALLING=True (預設開啟), OLLAMA_TOOL_MODEL=llama3.1:8b
- 回退: USE_OLLAMA_TOOL_CALLING=False → 恢復 NvidiaProvider 雲端
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 14:55:04 +08:00
OG T
8b5db2f58e
feat(infra): 切換 Ollama 到 M1 Pro 192.168.0.111 + NetworkPolicy 更新
...
CD Pipeline / build-and-deploy (push) Has been cancelled
- OLLAMA_URL: 188 → 111 (M1 Pro, 40+ tok/s vs 0.45 tok/s)
- OPENCLAW_DEFAULT_MODEL: qwen2.5:7b-instruct → deepseek-r1:14b (SRE最強推理)
- OPENCLAW_TIMEOUT: 90s → 120s (deepseek-r1:14b 實測最慢 54s)
- NetworkPolicy v1.3: 新增 192.168.0.111:11434 egress,移除 188 的 Ollama port
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-08 22:05:14 +08:00
OG T
84f1f9f021
refactor(config): GITHUB_WEBHOOK_SECRET → GITEA_WEBHOOK_SECRET (ADR-059)
2026-04-05 14:25:47 +08:00
OG T
5ad403b287
fix(p0): v4.3 — 實測確認 Ollama CPU-only 不可用,DIAGNOSE 統一走 NIM
...
實測依據 (2026-04-05):
- Ollama llama3.2:3b CPU-only: 238s 回 {"ok":true},生產不可用
- Nemotron NIM: 2.2s~27.3s,avg 10.6s,一直是主力(Phase 22 起)
- NIM 從未有隱私問題,Incident 資料一直送雲端 GPU
變更:
- ai_router.py: _local_fallback_chain 廢棄(空 list)
- ai_router.py: DIAGNOSE route/route_sync 改回 _full_fallback_chain
- config.py: 更新 timeout 說明反映實測結果
- test_p0_diagnose_routing.py: 更新 docstring
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 01:49:06 +08:00
OG T
a81bf50537
feat(drift): ADR-057 adopt() Gitea PR API 實作
...
- DriftAdoptService: 透過 Gitea REST API 建立 branch + commit + PR
不在 API Pod 內執行 git(修復 C2 安全漏洞)
- adopt() 端點: 501 → 真實實作(呼叫 DriftAdoptService)
- config.py: 新增 GITEA_API_URL / GITEA_API_TOKEN / GITEA_REPO_OWNER / GITEA_REPO_NAME
- K8s secret awoooi-secrets 已注入 GITEA_API_TOKEN
- drift.py: 移除 trigger_drift_scan 中未使用的 interpreter 變數
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 00:39:29 +08:00
OG T
96d5e18924
fix(p0): 實測修正 — timeout 依 benchmark 調整,_local_fallback_chain 移除雲端 Nemotron
...
- config.py: NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS=60s (NIM 實測 11-45s + 15s buffer)
- config.py: OLLAMA_DIAGNOSE_TIMEOUT_SECONDS=200s (Ollama 實測 ~173s + 27s buffer)
- ollama.py: 新增 per-task timeout (diagnose/force_local 用 200s)
- ai_router.py: _local_fallback_chain 移除 Nemotron (NIM=雲端,不可進 local chain)
- ai_router.py: v4.2 — Option C 分情境路由正式確立
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 00:29:09 +08:00
OG T
3455044457
feat(phase25): Nemotron 主動防禦三方向 P0+P1+P2 完整實作
...
CD Pipeline / build-and-deploy (push) Failing after 38s
Type Sync Check / check-type-sync (push) Failing after 35s
P0 - DIAGNOSE Privacy-First Routing:
- ai_router.py: _local_fallback_chain [NEMOTRON→OLLAMA→REJECT]
- DIAGNOSE 意圖 override 改為 NEMOTRON (原 OLLAMA)
- DIAGNOSE fallback 使用 local-only 鏈,不觸碰雲端
- 全部失敗時 REJECT + Telegram 通知
- config.py: NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS=30, OLLAMA_DIAGNOSE_TIMEOUT_SECONDS=60
- nemotron.py: 根據 context[task_type] 選擇 timeout
P1 - Knowledge Auto-Harvesting:
- models/knowledge.py: EntryType.AUTO_RUNBOOK + ANTI_PATTERN + symptoms_hash
- EntryStatus.PUBLISHED (ANTI_PATTERN 直接發布,無需審核)
- models/playbook.py: SymptomPattern.compute_hash() (16字元確定性 hash)
- services/runbook_generator.py: NemotronRunbookGenerator (v1.1)
- generate_runbook() → AUTO_RUNBOOK (DRAFT) + Telegram 審核 card
- generate_anti_pattern() → ANTI_PATTERN (PUBLISHED) + Telegram 通知
- 使用 nvidia.chat() (正確介面),Nemotron 超時時 Minimal fallback
- knowledge_service.py: check_anti_pattern(symptoms_hash, days=7)
- db/models.py: symptoms_hash VARCHAR(16) + ix_knowledge_symptoms_hash
- repositories/knowledge_repository.py: create() 支援 symptoms_hash + status
- auto_repair_service.py: anti_pattern_gate 在 decide() + runbook hook 在 execute()
- migrations/phase8_symptoms_hash.sql: ALTER TABLE + partial index + PUBLISHED constraint
P2 - Config Drift Detection:
- models/drift.py: DriftItem/DriftReport/DriftLevel/DriftIntent/DriftStatus
- services/drift_detector.py: GitStateReader + K8sStateReader + DriftDetector
- services/drift_analyzer.py: 白名單過濾 + DriftLevel 分級
- services/drift_interpreter.py: NemotronDriftInterpreter(意圖分析,不生成修復指令)
- services/drift_remediator.py: rollback(kubectl apply) + adopt(git push gitea)
- api/v1/drift.py: POST /scan, GET /reports, POST /rollback, POST /adopt
- migrations/phase9_drift_reports.sql: drift_reports 表
- k8s/drift-cronjob.yaml: 每小時自動掃描 CronJob
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-04 12:35:05 +08:00
OG T
c65ed5b1c9
feat(telegram): SRE 戰情室群組三頭政治 Triumvirate (ADR-053)
...
CD Pipeline / build-and-deploy (push) Successful in 7m6s
- config.py: 新增 OPENCLAW_BOT_TOKEN / NEMOTRON_BOT_TOKEN / SRE_GROUP_CHAT_ID
- telegram_gateway.py: send_to_group / send_as_openclaw / send_as_nemotron / trigger_group_ai_discussion / _send_approval_card_to_group
- send_approval_card 告警發送後非同步觸發群組 AI 雙向討論
- configmap: SRE_GROUP_CHAT_ID=-1003711974679
- secrets: OPENCLAW_BOT_TOKEN / NEMOTRON_BOT_TOKEN CHANGE_ME 佔位
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-03 17:16:05 +08:00
OG T
73e8f8ab77
feat(ai): Phase 24-A+B1 — AI Provider Registry + 絞殺者包裝 (ADR-052)
...
E2E Health Check / e2e-health (push) Successful in 16s
CD Pipeline / build-and-deploy (push) Has been cancelled
Brain Layer 雙軌 Registry 架構:
- 新建 src/services/ai_providers/ 目錄 (interfaces + 4 providers)
- OllamaProvider (local, rca/chat/code_review)
- GeminiProvider (cloud, rca/chat)
- ClaudeProvider (cloud, rca/chat/code_review)
- OpenClawNemoProvider (cloud, rca — 委派 188→NIM)
- 擴展 ai_router.py 加入:
- AIProviderRegistry (動態註冊/啟停)
- AIRouterExecutor (Cache + 閘門 CB/RL/Sem + 執行)
- openclaw.py 絞殺者包裝: USE_AI_ROUTER=true 走新路徑
- config.py + ConfigMap 加入 USE_AI_ROUTER=false (安全預設)
- ADR-052 正式文件 (14 項決策 D1-D14)
- HARD_RULES v1.7 加入 AI Router 規範
安全: USE_AI_ROUTER=false 預設不啟用,需手動開啟觀察
回滾: kubectl set env deployment/awoooi-api USE_AI_ROUTER=false
2026-04-02 ogt: Phase 24 首批實作
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-02 13:16:09 +08:00
OG T
c9c60c3a61
feat(mcp-integrations): Phase S 架構修復 + MCP 整合基礎建設
...
E2E Health Check / e2e-health (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
Type Sync Check / check-type-sync (push) Failing after 22s
Phase S 技術債修復 (首席架構師審查 82→完整):
- S-01: generate_alert_fingerprint 移至 AlertAnalyzer.generate_fingerprint() staticmethod
- S-04: 移除 Pydantic v2 deprecated json_encoders (直接用原生 datetime 序列化)
Sentry MCP 整合 (Phase 23):
- ADR-048: Sentry→OpenClaw AI Triage 架構決策
- sentry_webhook_service.py: parse/analyze/create_incident/build_message Service 層
- config.py: SENTRY_WEBHOOK_SECRET (Fail-Closed HMAC-SHA256)
Playwright MCP 整合 (短期):
- smoke.spec.ts: 5 頁面 E2E smoke test (home/dashboard/incidents/approvals/terminal)
- cd.yaml: E2E Smoke Test 步驟 + Telegram 🎭 Smoke 狀態通知
長期規劃 ADR:
- ADR-049: Figma Code Connect 設計系統同步
- ADR-050: Telegram 互動式 Incident 2.0 (6鍵 Inline Keyboard)
- ADR-051: Context7 依賴升級顧問 (Next.js 14→15, FastAPI 0.115→0.128)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-01 16:20:57 +08:00
OG T
394f85954e
fix(api): 修復 Y/n 404 + 停用 Multi-Sig
...
E2E Health Check / e2e-health (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
1. proposal_service._load_incident() 改用 incident_service.get_from_working_memory()
- brain engine 使用 awoooi:incidents: prefix,資料實際在 incident: prefix
- 兩個 prefix 不符導致永遠 404 (Y/n 按鈕全部失敗)
- 2026-04-02 ogt
2. trust_engine CRITICAL required_signatures 2→1
- 統帥決策: 所有審核只需 1 層簽核
- 2026-04-02 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-01 16:16:28 +08:00
OG T
eccf61fbc9
fix(ai): 修復假信心度 + 解除 Shadow Mode (Phase 22 P1)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
1. openclaw.py: LLM 截斷時 confidence 0.82→0.0 (禁止偽造信心度)
2. prompts.py: NEMOTRON schema 範例值改用佔位符,防模型照抄 0.75
3. configmap: SHADOW_MODE_ENABLED=false,開放 low 風險自動執行
條件門檻: confidence≥90% + trust_score≥5 + playbook_success≥95%
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-01 15:59:42 +08:00
OG T
0fd53422c6
fix(openclaw): NEMOTRON_SYSTEM_PROMPT confidence/reasoning 移至最前
...
CD Pipeline / build-and-deploy (push) Failing after 5m36s
E2E Health Check / e2e-health (push) Successful in 17s
Nemo-4B 4B 參數模型輸出長度有限,confidence/reasoning 排在 schema 末尾
時常被截斷,導致 openclaw.py:1045 fallback 補 0.82 假數據。
修復:將 confidence 和 reasoning 移至 schema 最前兩個欄位,確保模型
輸出截斷時仍包含最關鍵欄位。同時明確禁止模型抄範例值。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-01 13:19:18 +08:00
OG T
22de22c989
refactor(phase-s): Phase S 技術債清理 - 五項架構改善
...
S-01: generate_alert_fingerprint() 移至 alert_analyzer_service (Router→Service)
S-02: 移除廢棄 USE_NEW_ENGINE config (Phase R 已完成歷史使命)
S-03: github_webhook.py linter 清理 (Field unused + delivery_id noqa)
S-04: Pydantic v2 遷移 - approval/incident models (class Config → ConfigDict)
S-05: Skill 09 v1.1 更新 (USE_NEW_ENGINE 廢棄說明)
測試: 393 passed, 零失敗
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-01 13:12:02 +08:00
OG T
2ba61acf72
fix(api): Phase R-R2.2 首席架構師 72/100 P2 修復
...
P2-01 signal_worker.py: persisted_to_pg 改用 getattr 防 BrainIncident AttributeError
P2-02 IIncidentEngine Protocol: update_incident_status → update_status 對齊 brain 實作
P2-03 config.py USE_NEW_ENGINE: 標記失效 + 回滾路徑更正 (git revert 而非 kubectl)
ADR-046: Option B (IncidentConverter) 決策完成,待實作清單更新
ADR-024: 審查結論 + 正式回滾指令更新
Skill 02: v2.5 版本記錄
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-03-31 22:33:08 +08:00
OG T
dd526684ab
feat(ai): Phase 22 OpenClaw + Nemotron 協作架構 (ADR-044)
...
E2E Health Check / e2e-health (push) Successful in 17s
統帥批准實作「仲裁-執行分工」架構:
- OpenClaw = 仲裁者 (Why + Risk Level)
- Nemotron = 執行者 (How + kubectl Command)
新增功能:
- config.py: ENABLE_NEMOTRON_COLLABORATION Feature Flag
- openclaw.py: generate_incident_proposal_with_tools()
- openclaw.py: _call_nemotron_tools() Nemotron 呼叫
- telegram_gateway.py: TelegramMessage Nemotron 欄位
- telegram_gateway.py: format_with_nemotron() 雙區塊格式
- decision_manager.py: 整合協作方法
- proposal_service.py: 整合協作方法
觸發條件:
- LOW 風險 → 僅 OpenClaw
- MEDIUM/HIGH/CRITICAL → OpenClaw + Nemotron 雙軌
首席架構師審查: 83/100 條件通過
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-31 18:52:53 +08:00
OG T
d03668669b
fix(openclaw): optimize for Nemo-4B with lightweight prompt and resilient parsing
E2E Health Check / e2e-health (push) Successful in 26s
2026-03-31 15:59:58 +08:00
OG T
fb0ddf305c
fix(api): fix dockerfile to include models.json, remove huge prompt example to fit 4K limit
E2E Health Check / e2e-health (push) Successful in 17s
2026-03-31 14:03:34 +08:00
OG T
46843c8e19
fix(nvidia): revert to nemotron-mini, truncate context for 4K limit, enforce precise confidence
E2E Health Check / e2e-health (push) Successful in 17s
2026-03-31 13:57:10 +08:00
OG T
bb85d89874
refactor(api): Phase A P1 快速勝利 (3 項)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
E2E Health Check / e2e-health (push) Has been cancelled
1. 常數提取: SSE_DELAY_SECONDS, MAX_APPROVAL_DISPLAY
2. 錯誤訊息安全化: sanitize_error_message() 移除敏感資訊
3. CI/CD alertname 配置化: is_cicd_alertname() 函數
首席架構師審查 P1 改進 (非阻塞)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-30 01:44:42 +08:00
OG T
27509db212
feat(api): Wave 1 安全網 - Circuit Breaker + Global Repair Cooldown
...
ADR-038: OpenClaw 雙層保護
- Layer 1: Circuit Breaker (5 failures → 60s cooldown)
- Layer 2: Concurrency Semaphore (max 3 concurrent)
- 新增 src/core/circuit_breaker.py
ADR-039: 全域修復熔斷
- Global Cooldown: 5 repairs/15min → freeze
- StatefulSet Blacklist: postgres/redis/clickhouse 禁止自動重啟
- 新增 src/services/global_repair_cooldown.py
- 整合到 auto_repair_service.py
測試:
- test_circuit_breaker.py (狀態轉換 + Semaphore)
- test_global_repair_cooldown.py (黑名單 + 計數閾值)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-29 15:48:03 +08:00
OG T
d89f0520f9
fix(api): 修復 34 個 Ruff lint 錯誤
...
- 自動修復 import 排序、unused imports
- 手動修復 raise from、isinstance union、unused variable
- scripts/ 暫時保留 (非 CI 阻擋)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-29 15:27:49 +08:00
OG T
b77e151387
feat(ai): ADR-036 NVIDIA Nemotron Tool Calling 整合
...
Phase 20 - 提升 Tool Calling 精準度 50% → 83.3%
新增:
- src/models/nvidia.py: Pydantic Schema
- src/services/nvidia_provider.py: NvidiaProvider 類別
- tests/test_nvidia_provider.py: 15 項單元測試 (全部通過)
修改:
- ai_router.py: AIProvider.NVIDIA + route_tool_calling()
- ai_rate_limiter.py: NVIDIA 限制 (5 RPM, 100/day)
- models.json: NVIDIA 配置
- cd.yaml: Secrets 注入 NVIDIA_API_KEY
路由策略:
- Tool Calling: Nemotron → Gemini → Claude
- 一般對話: Ollama → Gemini → Claude (不變)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-29 00:00:08 +08:00
OG T
d206460751
feat(security): Phase 20 CSRF 防護實作
...
Phase 19 首席架構師審查指出: 核鑰 UX 安全性缺 CSRF 防護
後端:
- 新增 src/core/csrf.py (Double Submit Cookie 模式)
- 新增 src/api/v1/csrf.py (GET /api/v1/csrf/token)
- 新增 src/models/csrf.py (CSRFTokenResponse)
- 修改 approvals.py sign/reject/bulk 端點加入 CSRFToken 驗證
前端:
- 新增 hooks/useCSRF.ts (React Hook)
- 修改 approval.store.ts 整合 CSRF Token 參數
安全特性:
- 256-bit Token (secrets.token_hex)
- 時序安全比較 (secrets.compare_digest)
- SameSite=Strict Cookie
- 1 小時 Token 有效期
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-28 18:31:58 +08:00
OG T
e5ded3b3f2
feat(phase19): OmniTerminal + GenUI + Hybrid SSE 架構實作 (Wave 0-2)
...
Phase 19 OmniTerminal MVP 完成:
- Wave 0: Backend (Hybrid SSE POST→GET 架構)
- Wave 1: Frontend (OmniTerminal 狀態機 + GenUI Registry)
- Wave 2: UI 組件 (8 個 GenUI 動態卡片)
ADR 文檔:
- ADR-031: OmniTerminal SSE 架構
- ADR-032: GenUI 動態渲染框架
- ADR-033: K3s HA 架構設計
GenUI 組件:
- GenUIRenderer, K8sPodStatusCard, SentryErrorCard
- MetricsSummaryCard, IncidentTimelineCard
- TraceWaterfallCard, ApprovalCard, NuclearKeyButton
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-28 00:17:26 +08:00
OG T
801b08a4b7
fix(api): AI_FALLBACK_ORDER 無法正確解析 JSON 格式
...
根因: ConfigMap 用 JSON '["gemini","ollama","claude"]'
但 validator 用 split(",") 解析,導致無法匹配任何 provider
結果永遠用 default ["ollama","gemini","claude"]
影響: /api/v1/incidents 超時 (Ollama CPU 推理慢)
修復: 新增 JSON 格式支援,優先嘗試 json.loads()
這是根因修復,不是重啟!
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-26 20:10:56 +08:00
OG T
00e2c94a8e
ci: API 分層檢查 + LLM 測試移至 Nightly
...
CI 強化:
- 新增 API Layer Check (#96 ): services/repositories/models 分層規則
- LLM 測試移至 nightly-llm.yaml (CPU 推理 ~300s/測試)
分層規則:
- services 禁止引用 api/routers
- repositories 禁止引用 services
- models 禁止引用業務層
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-26 19:10:30 +08:00
OG T
a9f8ad56c1
chore: 未提交變更整理 (API core + docs + scripts)
...
API 核心:
- constants.py: 系統常量定義
- unit_of_work.py: Unit of Work 模式
- incident_approval_service.py: Incident-Approval 同步服務
文檔更新:
- LOGBOOK.md: 進度更新
- AWOOOI_AGENTIC_WORKSPACE_ROADMAP.md: 路線圖
- 2026-03-26_llm_testing_evaluation.md: LLM 測試評估
- phase5_telemetry_architecture.md: 遙測架構
- SECRETS_REFERENCE.md: 密鑰參考
配置/腳本:
- Skill 02 v1.x: leWOOOgo 後端更新
- .dependency-cruiser.cjs: 依賴規則
- demo-multisig-flow.sh: 演示腳本
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-26 19:10:12 +08:00
OG T
30153496d1
fix(api): 修復全部 lint 錯誤 (ruff --fix)
...
- Import sorting (I001)
- Unused imports (F401)
- f-string without placeholders (F541)
- Loop variable unused (B007)
- zip() strict parameter (B905)
- Exception chaining (B904)
- collections.abc imports (UP035)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-26 16:06:20 +08:00
OG T
e58da5c534
feat(api): Phase 13.2 #83 Grafana MCP Tool
...
New MCP provider for Grafana dashboard integration:
- list_dashboards: List available dashboards with filtering
- get_dashboard: Get dashboard details by UID
- get_panel_data: Query panel data via Grafana Query API
- generate_dashboard_url: Generate shareable dashboard URLs
Security:
- API key authentication (Bearer token)
- Dashboard UID validation (alphanumeric + dash/underscore)
- Read-only operations only
- 30s request timeout
Config:
- GRAFANA_URL (default: http://192.168.0.188:3000 )
- GRAFANA_API_KEY
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-26 15:36:17 +08:00
OG T
30f045bf28
feat: ADR-019 System Prompt 集中管理 + Nightly LLM Workflow
...
新增:
- docs/adr/ADR-019-system-prompt-management.md - System Prompt 規範
- apps/api/src/core/prompts.py - 集中管理 System Prompts
- .github/workflows/nightly-llm.yaml - 每夜 LLM 迴歸測試
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-26 12:27:47 +08:00
OG T
46ab6a838a
fix(api): 修復 ruff lint 錯誤
...
- langfuse_client.py: import Callable from collections.abc
- telemetry.py: import block 格式化
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-26 09:27:00 +08:00
OG T
b6cff31653
feat(api): Phase 15.3 Deep Linking 三系統互連
...
實現 Sentry ↔ SignOz ↔ Langfuse 零斷鏈觀測:
新增 deep_linking.py:
- SignOz Trace URL 生成器
- Langfuse Trace URL 生成器
- Sentry Issue URL 生成器
- get_all_links() 統一取得所有連結
整合點:
- main.py: Sentry before_send 注入 otel_trace_id + signoz_trace_url
- langfuse_client.py: 自動注入 OTEL trace_id 到 metadata
- openclaw.py: SignOz span 記錄 langfuse.trace_id 反向連結
架構圖:
┌─────────┐ trace_id ┌─────────┐ trace_id ┌──────────┐
│ Sentry │◄────────►│ SignOz │◄────────►│ Langfuse │
│ Errors │ │ Traces │ │ LLMOps │
└─────────┘ └─────────┘ └──────────┘
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-26 00:48:28 +08:00
OG T
0d31ccb911
feat(api): Phase 15.2 Redis Trace Context 傳遞
...
實現 Redis Streams 跨服務追蹤零斷鏈:
- telemetry.py: 新增 get_trace_context() + restore_trace_context()
- webhooks.py: Producer 注入 _trace_id, _span_id 到 Redis
- signal_worker.py: Consumer 還原 Trace Context 建立子 Span
架構: API → Redis Streams → Worker 完整追蹤鏈
格式: W3C Trace Context (traceparent)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-26 00:40:20 +08:00
OG T
1ac8965a7a
feat(api): Phase 15.1 Langfuse LLMOps 整合 + 模型升級
...
## 新功能
- Langfuse 自建部署 (192.168.0.110:3100)
- langfuse_client.py - LLM 呼叫追蹤包裝
- OpenClaw 整合 Langfuse trace
## 模型升級 (統帥批准)
- 生產預設: llama3.2:3b → qwen2.5:7b-instruct
- 摘要任務: llama3.2:3b (速度優先)
## 配置更新
- requirements.txt: +langfuse>=2.0.0
- config.py: +LANGFUSE_* 設定
- models.json: 更新 Ollama 模型配置
- K8s: Secret + ConfigMap 更新
## 審查通過
- 模組化檢查 ✅
- 核心測試 31/31 ✅
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-26 00:32:19 +08:00
OG T
a202a2693a
feat(api): Phase 16 R1.2 絞殺者模式 (Strangler Fig Pattern)
...
- 新增 USE_NEW_ENGINE 設定開關 (預設 False)
- incident_memory.py 雙軌切換: 內嵌版本 ↔ lewooogo-brain
- 自動降級: lewooogo-brain 不可用時回退內嵌版本
- 回滾指令: kubectl set env deployment/awoooi-api USE_NEW_ENGINE=false
統帥批准 2026-03-26 立即執行
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-25 15:23:03 +08:00
OG T
22cada563b
fix(config): Share Redis DB 0 with OpenClaw
...
- Change REDIS_URL from DB 10 to DB 0
- AWOOOI and OpenClaw now share the same Redis database
- Incidents created by OpenClaw visible in AWOOOI UI
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-24 18:44:34 +08:00
OG T
ad05bbf64c
feat(api): Add human feedback API ( #6 ) + async_utils module
...
Phase 6.6 人類回饋 API:
- PUT /api/v1/incidents/{id}/feedback endpoint
- effectiveness_score (1-5), human_feedback, learning_notes fields
- Sync to Redis (Working Memory) + PostgreSQL (Episodic Memory)
- For stats aggregation at /api/v1/stats/feedback/summary
async_utils module:
- fire_and_forget() for safe background tasks
- Prevents swallowed exceptions in asyncio.create_task()
- Addresses P2 #8 tech debt
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-24 14:16:17 +08:00