OG T
eee6f06215
feat(auto-repair): 所有操作強制寫入 DB — auto_repair_executions 表
...
CD Pipeline / build-and-deploy (push) Failing after 1m32s
統帥指令: 所有自動修復操作(成功/失敗)必須持久化
變更:
- migrations/phase10_auto_repair_executions.sql: 新增表 + 4 個索引
- db/models.py: 新增 AutoRepairExecution SQLAlchemy model
- repositories/audit_log_repository.py: 新增 AutoRepairExecutionRepository (create/list_by_incident/get_stats)
- auto_repair_service.py: execute_auto_repair 成功/失敗分支都寫入 DB
- 新增 similarity_score 參數傳遞
- AutoRepairDecision 新增 similarity_score 欄位
- webhooks.py: 傳入 similarity_score 到 execute_auto_repair
已執行 migration: awoooi_prod@192.168 .0.188:5432 ✅
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-08 11:16:37 +08:00
OG T
68a2fff746
feat(auto-repair): 移除所有阻擋門檻 — 直接全部跳成自動修復
...
CD Pipeline / build-and-deploy (push) Failing after 1m38s
統帥指令: 所有 APPROVED Playbook 直接執行,不再檢查:
- 相似度門檻 (MIN_SIMILARITY_SCORE 0.7 → 0.0)
- is_high_quality 品質門檻
- 冷啟動信任機制
- 動作風險等級門檻 (evaluate + execute 兩層)
保留: P0/P1 嚴重度人工審核、全域冷卻熔斷、APPROVED 狀態檢查
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-08 11:10:09 +08:00
OG T
b7ea362efc
fix(api): Review #2 技術債清理 — I1/S1/S2/S3 全數修正
...
CD Pipeline / build-and-deploy (push) Successful in 12m13s
I1: error_type 欄位補全
- AnomalyCounter.derive_key_from_incident() 新增
從 signal.labels 提取 reason/error_type,確保四欄位完整
S1: 三處 signature 建構邏輯統一
- auto_repair_service._derive_anomaly_key() → 委託 derive_key_from_incident()
- approval_execution._get_anomaly_key_from_approval() → 同上
- incident_service.resolve_incident() B4 → 同上
- 消除 3 處重複的 signature 建構程式碼
S2: Redis Pipeline 批次查詢
- get_all_disposition_stats() 從 N+1 hgetall 改為 2 次 Pipeline
- Pipeline 1: 批次 hgetall 所有 disposition key
- Pipeline 2: 批次 hget metadata (alert_name)
- 效能從 O(2N) Redis round-trip 降至 O(2)
S3: auto_repair.py get_incident AttributeError 修復
- get_incident() → get_from_working_memory() (pre-existing bug)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-07 13:13:42 +08:00
OG T
de3935d1d4
feat: Sprint 4 Phase E+F — 前端處置統計 + 週報處置分佈
...
CD Pipeline / build-and-deploy (push) Failing after 1m26s
Type Sync Check / check-type-sync (push) Failing after 1m2s
Phase E: 前端頁面
- E1: /reports 完整處置統計儀表板 (已在 Sprint F 完成)
- E2: 首頁 Metrics Strip — 從 disposition API 取得真實自動化率
優先使用 /stats/disposition auto_rate,fallback 到 incidents 推算
- E3: /auto-repair 處置概況卡片 (已在 Sprint F 完成)
- E4: /neural-command stats tab 處置分佈 (已在 Sprint F 完成)
- E5: i18n 翻譯 zh-TW + en (已在 Sprint F 完成)
Phase F: 週報 + 文件
- F1: WeeklyReportMessage 新增 disposition 5 欄位
週報格式加「📋 處置分佈」區塊 (自動/冷啟動/人工/手動 + 自動化率)
weekly_report_service 整合 get_all_disposition_stats()
- message 字數上限從 900 提升到 1200 (適應處置區塊)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-07 13:02:20 +08:00
OG T
561bcb638b
fix(api): Sprint 4 首席架構師 Review P0 修正 — hash 統一 + 積木化合規
...
P0-1: anomaly_key hash 推導統一
- B1: 新增 _derive_anomaly_key() 使用 AnomalyCounter.hash_signature()
取代 symptoms.compute_hash()
- B3/B4: namespace 改用 signal.labels.get("namespace", "")
修正 getattr(signal, "namespace", "") 永遠回傳空字串
P0-2: Router 層積木化合規
- C1/C2: 封裝 get_all_disposition_stats() 到 AnomalyCounter
- Router 不再直接存取 counter.redis
- stats.py 移除未使用的 days/stats 參數
P1: get_frequency() 填充 disposition 欄位
- 與 _record_anomaly_impl() 一致,回傳完整處置統計
首席架構師評分: 82/100 → P0 全數修正
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-07 12:53:12 +08:00
OG T
a85e9ced08
feat(api+telegram): Sprint 4 Phase C+D — API 端點 + Telegram 處置統計
...
Phase C: API 端點
- C1: GET /api/v1/stats/disposition — 完整處置分佈統計
- DispositionSummary: auto/human/manual/cold_start + auto_rate
- DispositionByAnomaly: 按異常類型明細 (最多 20 筆)
- Redis SCAN + HGETALL 聚合
- C2: GET /api/v1/auto-repair/stats 擴充 disposition_summary
Phase D: Telegram 告警格式
- D1: 告警卡片加處置統計行
- 🤖 自動: N | 👤 審核: N | 🔧 手動: N
- 自動化率百分比
- D2: 歷史按鈕強化處置分佈明細
- 完整 5 項計數 + 自動化率
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-07 12:17:20 +08:00
OG T
9253281d46
feat(api): Sprint 4 Phase A+B — 告警處置統計資料層+寫入層
...
Phase A: 資料層
- A1: IncidentFrequencyStats 新增 4 欄位 (human_approved/manual_resolved/cold_start_trust/total_resolution)
- A2: AnomalyCounter.record_disposition() — Redis HINCRBY 原子遞增
- A3: get_disposition_stats() — HGETALL 回傳處置分佈
- AnomalyFrequency dataclass 擴充 + to_dict() 同步
- _record_anomaly_impl() 整合 disposition stats
Phase B: 寫入層觸發點接線
- B1: 自動修復成功 → record_disposition("auto_repair")
- B2: 冷啟動信任成功 → record_disposition("cold_start_trust")
- AutoRepairDecision 新增 is_cold_start flag
- execute_auto_repair() 接收並區分處置類型
- B3: 人工批准執行成功 → record_disposition("human_approved")
- 新增 _get_anomaly_key_from_approval() helper
- B4: 手動處理推斷 → resolve_incident() 排除法判定
- 若 resolved 且無 auto/human/cold_start 紀錄 → manual_resolved
安全設計: 所有 disposition 記錄走 try/except,失敗不阻塞主流程
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-07 11:54:46 +08:00
OG T
53b2daeaca
feat(api): 首次信任機制 — 打破自動修復冷啟動雞生蛋問題
...
問題: Playbook 需要 success_count >= 3 才算 is_high_quality,
但沒有自動修復就不會有成功紀錄 → 永遠達不到門檻。
方案 C: 首次信任 (Cold Start Trust)
- APPROVED 狀態 + 全步驟 risk=LOW + 執行次數 < 3 → 自動放行
- Redis counter 限制每日最多 5 次首次信任自動修復
- 累積 3 次成功後自動回歸正常 is_high_quality 門檻
安全邊界:
- 只有 LOW risk 步驟才能首次信任 (重啟容器等)
- HIGH/CRITICAL 仍需人工審核
- P0/P1 嚴重度仍需人工審核
- 每日上限防止失控
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-07 11:21:00 +08:00
OG T
2fe8062fb8
refactor(api): Re-Review S1/S2/S3 改善 — 消除重複+防禦性驗證+測試隔離
...
S1: 抽取 _execute_and_observe() 公用方法
- 消除 repair_by_uri 中 3 處重複的 execute+audit+langfuse 邏輯
- 統一 AuditLog + Langfuse trace 寫入路徑
S2: SSH username 防禦性驗證
- 新增 validate_ssh_user() + _SSH_USER_RE 正則
- 在 _ssh_execute() 入口驗證 user 參數
- 防止 user@host 拼接產生非預期行為
- 新增 8 個 username 驗證測試
S3: Singleton 測試重置
- 新增 _reset_for_test() classmethod
- 避免跨測試狀態污染
- 新增 2 個 singleton reset 測試
測試: 55/55 全數通過 (原 45 + 新 10)
首席架構師 Re-Review: 91/100 ✅ 通過,3 個 Suggestion 全數實裝
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-07 11:17:40 +08:00
OG T
78a8d3dfa5
fix(api): ansible 控制節點加白名單驗證,防環境變數繞過 (Re-Review Important)
...
首席架構師 Re-Review 指出: ANSIBLE_CONTROL_HOST 來自環境變數 (ConfigMap),
若 ConfigMap 被篡改可繞過 SSH_TARGET_WHITELIST。
在 _execute_ansible() 開頭加 validate_ssh_target_host(host) 閉環。
Re-Review 評分: 91/100 ✅ 通過
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-07 11:13:49 +08:00
OG T
f8d4772abf
fix(api): Sprint 3 P0-1/P0-2/P0-3/P0-4 Critical Security Fixes
...
P0-1: Complete shell metacharacter regex detection
- Enhanced _SHELL_METACHAR_RE to detect: >, <, \n, ${}, $()
- Prevents all shell injection vectors (redirects, variable expansion, newlines)
- Added 5 new validation tests
P0-2: Add shlex.quote() protection for ansible playbook path
- Wraps playbook_path in shlex.quote() before SSH command construction
- Prevents shell injection if path contains special characters
- Applied in _execute_ansible() method
P0-3: Add SSH target host whitelist validation
- Introduces validate_ssh_target_host() function
- Only allows SSH to: 192.168.0.110, 192.168.0.188
- Prevents unauthorized SSH target exploitation
- Added 5 new whitelist validation tests
P0-4: Convert HostRepairAgent to singleton pattern
- Implements __new__() singleton with shared _in_process_locks dict
- Ensures in-process locks persist across multiple auto_repair_service calls
- Previously created new instance per call, making locks ineffective
- Added singleton persistence test
Test Results: 45/45 passing (34 existing + 11 new P0 tests)
All security validations verified via comprehensive unit test coverage.
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-07 11:09:45 +08:00
OG T
af07c23675
fix(k8s): known_hosts 改掛 /etc/repair-known-hosts 獨立目錄,修 mount 衝突
...
CD Pipeline / build-and-deploy (push) Successful in 12m11s
E2E Health Check / e2e-health (push) Successful in 34s
/etc/repair-ssh 已被 repair-ssh-key 佔用,subPath 檔案掛載衝突
改為獨立目錄 /etc/repair-known-hosts,路徑同步更新 KNOWN_HOSTS_PATH
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 15:06:28 +08:00
OG T
1644fe6474
feat(api): auto_repair_service 整合 repair_by_uri (Sprint 3 T6)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 14:39:03 +08:00
OG T
a4e11bfa92
feat(api): AuditLog + Langfuse Trace for SSH_COMMAND (Sprint 3 T5)
...
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 14:38:59 +08:00
OG T
4561f141bb
feat(api): Redis 冪等鎖防止重複修復 (Sprint 3 T4)
...
雙層鎖設計: in-process asyncio.Lock (必定生效) + Redis 分散式鎖 (跨 Pod best-effort)
同一 URI 的第二次修復呼叫立即返回 "already running" 錯誤
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 14:26:53 +08:00
OG T
1a654aa37d
feat(api): HostRepairAgent 三條執行路徑 + known_hosts + Ansible 白名單 (Sprint 3 T3)
...
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 14:22:54 +08:00
OG T
5e8b2a6894
feat(api): URI scheme 解析器 + Shell Injection 防護 (Sprint 3 T1)
...
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 14:18:21 +08:00
OG T
f6332b4b2f
fix(telegram): 修正 approval_id UUID 轉換錯誤 — 支援 INC-xxx 格式
...
CD Pipeline / build-and-deploy (push) Successful in 12m24s
_execute_approval_action 用 UUID(approval_id) 但 approval_id 是 INC-xxx,
導致 'badly formed hexadecimal UUID string' 錯誤,簽核無法執行。
修正: 先嘗試 UUID 轉換,失敗則用 incident_id 查出對應的 pending approval UUID。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 11:53:48 +08:00
OG T
658337ec18
fix(phase26): 打通 Incident→DB→KM 完整鏈路 + namespace 修正
...
CD Pipeline / build-and-deploy (push) Failing after 1m29s
Type Sync Check / check-type-sync (push) Failing after 52s
問題根因:
1. create_incident_for_approval 只存 Redis,不存 PostgreSQL
→ TTL 7天後消失,Playbook 萃取永遠找不到 Incident
2. ApprovalRecord 無 incident_id 欄位
→ _trigger_playbook_extraction 靠 regex 掃中文文字找 INC-,永遠失敗
3. operation_parser namespace fallback 是 "default"
→ 所有 deployment 在 awoooi-prod,203 次執行全失敗
修復:
- Incident 同時寫入 Redis + PostgreSQL (save_to_episodic_memory)
- ApprovalRecord 加入 incident_id 欄位 (model + ORM + migration)
- alertmanager_webhook 建立 Approval 後回寫 incident_id
- _trigger_playbook_extraction 直接用 approval.incident_id
- operation_parser DEFAULT_NAMESPACE = "awoooi-prod"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 11:46:05 +08:00
OG T
286a96d1aa
fix(knowledge): entrystatus enum 大小寫修正 'archived' → 'ARCHIVED'
...
CD Pipeline / build-and-deploy (push) Successful in 12m47s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 11:25:44 +08:00
OG T
09d965dab5
fix(telegram): 修正 editMessageText 400 錯誤 — 先移除按鈕再更新文字
...
CD Pipeline / build-and-deploy (push) Failing after 12m46s
原因: original_text 來自 message.text (純文字),含 <>&等字符,
用 parse_mode=HTML 發送時 Telegram 返回 400。
修正:
1. 先呼叫 editMessageReplyMarkup 移除按鈕 (確保按鈕一定消失)
2. 再 html.escape(original_text) 後嘗試更新文字
3. 文字更新失敗不影響整體流程 (按鈕已移除為首要目標)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 22:13:54 +08:00
OG T
8220027298
fix(telegram+cd): 兩個顯示 bug 修正
...
CD Pipeline / build-and-deploy (push) Has been cancelled
1. Nemotron args 顯示 Python dict 字串問題
- restart_deployment: {'deployment_name': 'awoo'} → restart_deployment: deployment_name=awoooi-api
- 改用 key=value 格式化,不再使用 str(dict)[:25]
2. CD 通知 ${MINUTES}/${SECONDS} 等變數未展開
- TG_MSG 從 env: 移到 run: shell 中組裝
- env: 中的 shell 變數在 bash 執行前是靜態字串,無法展開
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 14:47:52 +08:00
OG T
23364423fa
feat(webhook): ADR-059 GitHub → Gitea Webhook 遷移完成
...
- gitea_webhook.py: Header 全部改 X-Gitea-*,移除 workflow_run handler
- gitea_webhook_service.py: _fetch_pr_diff 改直接 httpx,不依賴 github_api_service
- 清除兩個檔案的所有殘留 github_ log key,review_id prefix 改 gitea-
- test_gitea_webhook.py: 10/10 通過,docstring 修正
- 03-secrets.yaml: 新增 GITEA_WEBHOOK_SECRET 佔位
- cd.yaml: 新增 GITEA_WEBHOOK_SECRET 注入步驟
- ADR-059: 建立架構決策文件
待統帥操作: Gitea Actions secret + Gitea UI Webhook 設定
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 14:44:32 +08:00
OG T
6777532534
feat(webhook): Task 1+2 — config + service GitHub→Gitea 遷移 (ADR-059)
...
- config.py: GITHUB_WEBHOOK_SECRET/ALLOWED_REPOS → GITEA_*
- 新增 gitea_webhook_service.py: PR/Push review only, 移除 CI diagnosis
- 移除 CIFailureDiagnosis, diagnose_ci_failure, _call_openclaw_ci_diagnosis
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 14:33:58 +08:00
OG T
22ee9b2fe3
fix(telegram): answerCallbackQuery result=true 導致 bool is not iterable
...
CD Pipeline / build-and-deploy (push) Successful in 13m3s
Telegram answerCallbackQuery 成功時返回 {"ok": true, "result": true},
_send_request 中 "message_id" in result["result"] 對 bool 做 in 操作
報 "argument of type 'bool' is not iterable"。
修正:加 isinstance(result_val, dict) 防禦後再做 in 檢查。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 14:20:54 +08:00
OG T
4b4007db6c
feat(telegram): SRE 群組告警格式升級為完整 v7.0
...
CD Pipeline / build-and-deploy (push) Has been cancelled
_send_approval_card_to_group 改用與個人 chat 相同的 TelegramMessage.format()
格式,包含 SignOz metrics、AI provider/model、Nemotron 協作、異常頻率統計等全部欄位。
統帥指示:群組收到的告警訊息要與個人 chat 格式完全一致。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 14:11:59 +08:00
OG T
76f3ffd7f7
fix(telegram): whitelist property 返回字串導致按鈕無反應
...
CD Pipeline / build-and-deploy (push) Successful in 13m0s
security_interceptor.whitelist 返回 settings.OPENCLAW_TG_USER_WHITELIST
(字串),但 is_whitelisted 做 user_id in whitelist(int in str),
Python 報 "requires string as left operand, not int"。
修正:改呼叫 settings.get_tg_user_whitelist() 返回 list[int]。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 13:40:52 +08:00
OG T
4b24ecd67f
fix(sprint3): 首席架構師 Review C1/C2/C3/M3/m1 修正
...
C1: _ssh_execute 直接接收 key_path 參數,不反查 LAYER_SSH_CONFIG
C2: PlaybookService.create() proxy,Router 不再穿透呼叫 _repository
C3: CD Step 1b sed 替換 IMAGE_TAG_PLACEHOLDER,消除失敗中斷風險
M3: repair-bot 110/188 regex 統一 [a-z0-9][a-z0-9-]{0,30},禁止底線
m1: defaultMode 0400 加八進位說明注釋
m2: _ssh_execute 用 deadline 計算剩餘 timeout
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 13:07:59 +08:00
OG T
665f93e83f
fix(telegram): 首席架構師 R1 修正 — I-1/I-2/M-1/M-2
...
CD Pipeline / build-and-deploy (push) Has been cancelled
I-1: webhooks/sentry_webhook/signoz_webhook 三個呼叫者補 TODO 說明
無 incident_id 是已知限制(Approval 路徑未建 Incident 關聯)
I-2: TestPushRequest 新增 incident_id 欄位,使 QA 可驗證按鈕渲染
M-1: 移除 _build_inline_keyboard 呼叫中多餘的 `or message.incident_id`
M-2: 補充 900/1000 截斷長度差異說明
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 13:07:42 +08:00
OG T
4935cfc346
fix(telegram): 重設計訊息格式 + 修復 detail/reanalyze/history 按鈕失效
...
CD Pipeline / build-and-deploy (push) Failing after 1m26s
- format() / format_with_nemotron(): 移除 ═══ 分隔符,改為簡潔換行佈局
- send_approval_card(): 新增 incident_id 參數,傳入 _build_inline_keyboard()
- decision_manager.py: 呼叫 send_approval_card() 時傳入 incident.incident_id
- 問題根因: incident_id 未傳入 _build_inline_keyboard() 導致第二排按鈕從未渲染
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 12:44:13 +08:00
OG T
53e1ae7ad7
fix(phase25): I2 NIM system prompt + I4 field_path 正則匹配修正
...
CD Pipeline / Deploy Prometheus Alert Rules (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
I2: nemotron.analyze() 補上 system role (NIM 標準 message format)
- 舊: messages=[{role:user, ...}]
- 新: messages=[{role:system, ...}, {role:user, ...}]
- 效果: K8s operator 角色定義,改善 tool calling 品質
I4: drift_detector._is_allowlisted/_is_critical 用正則取代 strip
- 舊: replace('[*]','') 後 startswith/in → 無法匹配 containers[0]
- 新: [*] → \[\d+\] 正則,正確匹配所有索引
- 修復: containers[*].image 現在能匹配 containers[0].image
2026-04-05 12:11:05 +08:00
OG T
73577f7c5d
chore(ai-router): v4.3 版本號同步 (trigger CD push event)
CD Pipeline / Deploy Prometheus Alert Rules (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-05 12:03:15 +08:00
OG T
bf4f81412c
feat(api): ActionType.SSH_COMMAND + auto_repair_service SSH分支 (Task 12)
...
- playbook.py: 新增 SSH_COMMAND ActionType
- auto_repair_service._execute_step: SSH_COMMAND 分支,格式 layer/component
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 11:47:00 +08:00
OG T
e7d8da85f6
feat(api): HostRepairAgent — SSH 主機層修復 (Task 11)
...
- host_repair_agent.py: layer路由、command injection防護、asyncio SSH執行
- 測試: 12 cases 全通過 (routing/sanitize/success/fail/timeout/denied)
- SSH key: /etc/repair-ssh/id_ed25519 (K8s secret mount)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 11:22:00 +08:00
OG T
2243a21b96
fix(ai-router): v4.3 NIM 保護 — timeout 不計 CB 失敗,每次先跑 NIM 才切 Gemini
...
CD Pipeline / build-and-deploy (push) Failing after 20s
需求: NIM 必須等到有回應才切換,不能因為慢就被 CB 封鎖走 Gemini
變更:
- Timeout exception 不累積 CB failure(只有真實連線錯誤才計)
- NIM CB: failure_threshold=10, recovery_timeout=30s(比預設寬鬆)
- 設計文件 v4.3: 更新方向二,移除錯誤假設
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 01:51:12 +08:00
OG T
5ad403b287
fix(p0): v4.3 — 實測確認 Ollama CPU-only 不可用,DIAGNOSE 統一走 NIM
...
實測依據 (2026-04-05):
- Ollama llama3.2:3b CPU-only: 238s 回 {"ok":true},生產不可用
- Nemotron NIM: 2.2s~27.3s,avg 10.6s,一直是主力(Phase 22 起)
- NIM 從未有隱私問題,Incident 資料一直送雲端 GPU
變更:
- ai_router.py: _local_fallback_chain 廢棄(空 list)
- ai_router.py: DIAGNOSE route/route_sync 改回 _full_fallback_chain
- config.py: 更新 timeout 說明反映實測結果
- test_p0_diagnose_routing.py: 更新 docstring
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 01:49:06 +08:00
OG T
a81bf50537
feat(drift): ADR-057 adopt() Gitea PR API 實作
...
- DriftAdoptService: 透過 Gitea REST API 建立 branch + commit + PR
不在 API Pod 內執行 git(修復 C2 安全漏洞)
- adopt() 端點: 501 → 真實實作(呼叫 DriftAdoptService)
- config.py: 新增 GITEA_API_URL / GITEA_API_TOKEN / GITEA_REPO_OWNER / GITEA_REPO_NAME
- K8s secret awoooi-secrets 已注入 GITEA_API_TOKEN
- drift.py: 移除 trigger_drift_scan 中未使用的 interpreter 變數
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 00:39:29 +08:00
OG T
96d5e18924
fix(p0): 實測修正 — timeout 依 benchmark 調整,_local_fallback_chain 移除雲端 Nemotron
...
- config.py: NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS=60s (NIM 實測 11-45s + 15s buffer)
- config.py: OLLAMA_DIAGNOSE_TIMEOUT_SECONDS=200s (Ollama 實測 ~173s + 27s buffer)
- ollama.py: 新增 per-task timeout (diagnose/force_local 用 200s)
- ai_router.py: _local_fallback_chain 移除 Nemotron (NIM=雲端,不可進 local chain)
- ai_router.py: v4.2 — Option C 分情境路由正式確立
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 00:29:09 +08:00
OG T
4912c7f307
fix(phase25): 首席架構師 Review R2 修正 (I1/I2/I3/I4/C3/M1)
...
I1: auto_repair_service — 失敗分支 anti_pattern task 補齊 _pending_tasks GC 防護
C3: drift_remediator — _kubectl_apply() 實作 resource_key 範圍過濾(修復虛設參數 bug)
M1: drift_remediator — _git_push() 標記 DISABLED,防止誤啟用
I2: drift.py — Telegram 通知移除失效的 adopt() 端點連結
I3: drift/page.tsx — handleScan POST body namespace→namespaces(對齊後端 DriftScanRequest)
I4: drift/page.tsx — 移除硬編碼英文字串,改用 t('loading')/t('highCount')/t('mediumCount')
i18n: zh-TW.json + en.json 補齊 drift.loading key
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 00:22:38 +08:00
OG T
a562db4048
fix(phase25): 首席架構師 Review C1/C2/I1/I3 修正
...
CD Pipeline / build-and-deploy (push) Failing after 57s
C1: NemotronProvider.privacy_level cloud→local
NIM 部署在 192.168.0.188 內網,非官方雲端 API
可納入 DIAGNOSE _local_fallback_chain 隱私邊界
C2: adopt() 端點暫停,返回 501
API Pod 執行 git add -A 有安全風險
ADR-057 起草後改用 Gitea PR API 實作
I1: timeout log 修正,記錄實際套用的 timeout 值
原本永遠記錄 NEMOTRON_TIMEOUT_SECONDS=45
現在記錄依 task_type 選擇的正確值
I3: route_sync() 補 DIAGNOSE 隱私邊界
async route() 已有 _local_fallback_chain
sync 版本遺漏,此次補齊
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-04 18:00:05 +08:00
OG T
c4eafd2a5b
fix(ai-router): fallback_models 排除 selected_model 避免重複
...
CD Pipeline / build-and-deploy (push) Successful in 6m58s
DIAGNOSE intent 路由至 Nemotron 後,fallback_chain 中的 OPENCLAW_NEMO
也使用相同 model string,導致 fallback_models 包含已選模型。
修正: 過濾掉與 selected_model 相同的 model string。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-04 17:43:44 +08:00
OG T
8056be5847
feat(ai-router): DIAGNOSE intent override 升級至 Nemotron (P0)
2026-04-04 17:41:45 +08:00
OG T
3455044457
feat(phase25): Nemotron 主動防禦三方向 P0+P1+P2 完整實作
...
CD Pipeline / build-and-deploy (push) Failing after 38s
Type Sync Check / check-type-sync (push) Failing after 35s
P0 - DIAGNOSE Privacy-First Routing:
- ai_router.py: _local_fallback_chain [NEMOTRON→OLLAMA→REJECT]
- DIAGNOSE 意圖 override 改為 NEMOTRON (原 OLLAMA)
- DIAGNOSE fallback 使用 local-only 鏈,不觸碰雲端
- 全部失敗時 REJECT + Telegram 通知
- config.py: NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS=30, OLLAMA_DIAGNOSE_TIMEOUT_SECONDS=60
- nemotron.py: 根據 context[task_type] 選擇 timeout
P1 - Knowledge Auto-Harvesting:
- models/knowledge.py: EntryType.AUTO_RUNBOOK + ANTI_PATTERN + symptoms_hash
- EntryStatus.PUBLISHED (ANTI_PATTERN 直接發布,無需審核)
- models/playbook.py: SymptomPattern.compute_hash() (16字元確定性 hash)
- services/runbook_generator.py: NemotronRunbookGenerator (v1.1)
- generate_runbook() → AUTO_RUNBOOK (DRAFT) + Telegram 審核 card
- generate_anti_pattern() → ANTI_PATTERN (PUBLISHED) + Telegram 通知
- 使用 nvidia.chat() (正確介面),Nemotron 超時時 Minimal fallback
- knowledge_service.py: check_anti_pattern(symptoms_hash, days=7)
- db/models.py: symptoms_hash VARCHAR(16) + ix_knowledge_symptoms_hash
- repositories/knowledge_repository.py: create() 支援 symptoms_hash + status
- auto_repair_service.py: anti_pattern_gate 在 decide() + runbook hook 在 execute()
- migrations/phase8_symptoms_hash.sql: ALTER TABLE + partial index + PUBLISHED constraint
P2 - Config Drift Detection:
- models/drift.py: DriftItem/DriftReport/DriftLevel/DriftIntent/DriftStatus
- services/drift_detector.py: GitStateReader + K8sStateReader + DriftDetector
- services/drift_analyzer.py: 白名單過濾 + DriftLevel 分級
- services/drift_interpreter.py: NemotronDriftInterpreter(意圖分析,不生成修復指令)
- services/drift_remediator.py: rollback(kubectl apply) + adopt(git push gitea)
- api/v1/drift.py: POST /scan, GET /reports, POST /rollback, POST /adopt
- migrations/phase9_drift_reports.sql: drift_reports 表
- k8s/drift-cronjob.yaml: 每小時自動掃描 CronJob
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-04 12:35:05 +08:00
OG T
df3ef9006c
fix(auto-repair): 首席架構師 Review — 4 Critical/Important 修復
...
CD Pipeline / build-and-deploy (push) Successful in 7m2s
Critical #1 : KM write task 移出 try/except
- _trigger_learning 的 KM 寫入原在 try 內,learning 失敗時不寫 KM
- 移至 except 後確保成功/失敗都寫入
- 移除冗餘 import asyncio(已在頂層 import)
- Minor: approval.incident_id or None 防空字串
Important #2 : migration 加 PRIMARY KEY
- playbook_id 從 UNIQUE 升為 PRIMARY KEY
- prod DB 已執行 ALTER TABLE ADD PRIMARY KEY
Important #3 : s.sequence→s.step_number, s.description→s.command
- embed_playbook() 使用不存在的欄位名,RAG 向量索引靜默失敗
- RepairStep 正確欄位: step_number, command
Important #1 : PlaybookService._get_rag_service 不再 Service 層快取
- 改為每次呼叫工廠 get_playbook_rag_service()
- 避免舊實例繞過工廠的 is_closed 重建邏輯
冷啟動修復 (首席架構師建議B+C):
- _trigger_playbook_extraction 執行成功後自動設定
execution_success=True, effectiveness_score=4, status=RESOLVED
- skip 路徑 logger.debug → logger.info 提升可觀測性
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-04 12:02:03 +08:00
OG T
72d7536ead
feat(auto-repair): 完整自動修復閉環 + KM 沉澱串接
...
CD Pipeline / build-and-deploy (push) Has been cancelled
1. DB Migration: playbooks 資料表 (phase7_playbooks_table.sql)
- 這是自動修復無法啟動的根本原因 — table 從未建立
- 5 個索引: status/tags/alert_names/source_incidents/created_at
- 已在 prod DB 執行
2. playbook_service: 萃取後自動沉澱 KM
- extract_from_incident() 完成後 fire-and-forget _write_to_km()
- 內容含症狀模式、修復步驟、信心度、來源 Incident
3. approval_execution: 執行結果沉澱 KM
- _trigger_learning() 後 fire-and-forget _write_execution_result_to_km()
- 成功/失敗記錄都寫入,category=execution_result
完整閉環:
告警 → AI分析 → 查Playbook → 決策 → 執行 → 結果寫KM
↓
Incident解決 → KM(knowledge_extractor)
→ Playbook萃取 → KM
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-04 11:54:15 +08:00
OG T
429d81d29b
fix(knowledge): I2+I3 首席架構師 Important 修復 — 依賴注入 + exception 細分
...
CD Pipeline / build-and-deploy (push) Has been cancelled
I2: KnowledgeService 移至 DecisionManager.__init__ 注入
_query_kb_context_inner 使用 self._knowledge_svc,移除函數內 import 耦合
I3: _query_kb_context exception 細分
- asyncio.TimeoutError → warning (預期降級)
- ConnectionError/OSError → warning (Ollama 連線問題,預期降級)
- Exception → error (非預期,提升監控可見性)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-04 11:51:43 +08:00
OG T
f846000c8c
fix(knowledge): C1 首席架構師必修 — _query_kb_context 5秒 hard timeout
...
CD Pipeline / build-and-deploy (push) Has been cancelled
C1 修復 (首席架構師 Review 74/100 → 條件通過):
- 抽出 _query_kb_context_inner 含實際查詢邏輯
- _query_kb_context 用 asyncio.wait_for(timeout=5.0) 包裝
- Ollama hang/慢響應最多消耗 5s,保護 30s 決策 SLA
- timeout 時 logger.warning("kb_rag_timeout") 靜默降級
同步移除 LLM prompt 中的 emoji (## 📚 → ## Knowledge Base)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-04 11:48:57 +08:00
OG T
860dc1d892
feat(knowledge): KB Phase 2 — OpenClaw RAG 整合
...
CD Pipeline / build-and-deploy (push) Has been cancelled
_dual_engine_analyze 新增 _query_kb_context():
- Incident 分析前語意搜尋相關 KB 條目 (top-3, threshold=0.4)
- 將 KB context 注入 expert_context.diagnosis_context 傳給 LLM
- 失敗時靜默降級,不影響主分析流程
- dual_engine_llm_win log 新增 kb_rag 欄位,可觀測 RAG 命中率
架構: _query_kb_context 透過 get_knowledge_service() 呼叫 Service 層
符合 leWOOOgo 積木化 — decision_manager 不直接存取 DB/pgvector
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-04 11:46:47 +08:00
OG T
d0f09705e5
fix(auto-repair): 修復三個阻礙自動修復的根本原因
...
CD Pipeline / build-and-deploy (push) Has been cancelled
1. playbook_rag: Ollama embedding http_client 滾動重啟後 is_closed
- 新增 _get_http_client() 偵測 is_closed 自動重建
- singleton get_playbook_rag_service() 加 is_closed 重建判斷
2. telegram: 加入 ai_model 欄位顯示底層判斷模型
- TelegramMessage.ai_model 欄位
- format() / format_with_nemotron() 顯示 "Nemotron (nemotron-70b)"
- openclaw proposal_dict 加入 model 欄位
- decision_manager / send_approval_card 串接
3. DB: 清除 9 筆 3/26 殭屍 PENDING (mock_fallback CRITICAL 測試記錄)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-04 11:46:25 +08:00
OG T
cddc4cb1fc
fix(knowledge): 首席架構師 Review 修復 C1+C2+I1+I2 (71→~88/100)
...
CD Pipeline / build-and-deploy (push) Successful in 7m16s
C1: IKnowledgeRepository Protocol 補齊 save_embedding + semantic_search +
list_unembedded_entries,恢復 Interface 先行保護層
C2: embed_all_entries Service 層 raw SQL 移至 Repository.list_unembedded_entries()
Service 改透過 Protocol 呼叫,符合 leWOOOgo 積木化原則
I1: asyncio.create_task 加入 _pending_tasks set 持有引用,防 GC 回收與
Shutdown 時 Task 遺失;task done 後自動 discard
I2: OllamaEmbeddingService 從每次 new 改為 KnowledgeService.__init__ 注入,
單一實例重用
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-04 11:22:38 +08:00