OG T
de3935d1d4
feat: Sprint 4 Phase E+F — 前端處置統計 + 週報處置分佈
...
CD Pipeline / build-and-deploy (push) Failing after 1m26s
Type Sync Check / check-type-sync (push) Failing after 1m2s
Phase E: 前端頁面
- E1: /reports 完整處置統計儀表板 (已在 Sprint F 完成)
- E2: 首頁 Metrics Strip — 從 disposition API 取得真實自動化率
優先使用 /stats/disposition auto_rate,fallback 到 incidents 推算
- E3: /auto-repair 處置概況卡片 (已在 Sprint F 完成)
- E4: /neural-command stats tab 處置分佈 (已在 Sprint F 完成)
- E5: i18n 翻譯 zh-TW + en (已在 Sprint F 完成)
Phase F: 週報 + 文件
- F1: WeeklyReportMessage 新增 disposition 5 欄位
週報格式加「📋 處置分佈」區塊 (自動/冷啟動/人工/手動 + 自動化率)
weekly_report_service 整合 get_all_disposition_stats()
- message 字數上限從 900 提升到 1200 (適應處置區塊)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-07 13:02:20 +08:00
OG T
561bcb638b
fix(api): Sprint 4 首席架構師 Review P0 修正 — hash 統一 + 積木化合規
...
P0-1: anomaly_key hash 推導統一
- B1: 新增 _derive_anomaly_key() 使用 AnomalyCounter.hash_signature()
取代 symptoms.compute_hash()
- B3/B4: namespace 改用 signal.labels.get("namespace", "")
修正 getattr(signal, "namespace", "") 永遠回傳空字串
P0-2: Router 層積木化合規
- C1/C2: 封裝 get_all_disposition_stats() 到 AnomalyCounter
- Router 不再直接存取 counter.redis
- stats.py 移除未使用的 days/stats 參數
P1: get_frequency() 填充 disposition 欄位
- 與 _record_anomaly_impl() 一致,回傳完整處置統計
首席架構師評分: 82/100 → P0 全數修正
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-07 12:53:12 +08:00
OG T
a85e9ced08
feat(api+telegram): Sprint 4 Phase C+D — API 端點 + Telegram 處置統計
...
Phase C: API 端點
- C1: GET /api/v1/stats/disposition — 完整處置分佈統計
- DispositionSummary: auto/human/manual/cold_start + auto_rate
- DispositionByAnomaly: 按異常類型明細 (最多 20 筆)
- Redis SCAN + HGETALL 聚合
- C2: GET /api/v1/auto-repair/stats 擴充 disposition_summary
Phase D: Telegram 告警格式
- D1: 告警卡片加處置統計行
- 🤖 自動: N | 👤 審核: N | 🔧 手動: N
- 自動化率百分比
- D2: 歷史按鈕強化處置分佈明細
- 完整 5 項計數 + 自動化率
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-07 12:17:20 +08:00
OG T
9253281d46
feat(api): Sprint 4 Phase A+B — 告警處置統計資料層+寫入層
...
Phase A: 資料層
- A1: IncidentFrequencyStats 新增 4 欄位 (human_approved/manual_resolved/cold_start_trust/total_resolution)
- A2: AnomalyCounter.record_disposition() — Redis HINCRBY 原子遞增
- A3: get_disposition_stats() — HGETALL 回傳處置分佈
- AnomalyFrequency dataclass 擴充 + to_dict() 同步
- _record_anomaly_impl() 整合 disposition stats
Phase B: 寫入層觸發點接線
- B1: 自動修復成功 → record_disposition("auto_repair")
- B2: 冷啟動信任成功 → record_disposition("cold_start_trust")
- AutoRepairDecision 新增 is_cold_start flag
- execute_auto_repair() 接收並區分處置類型
- B3: 人工批准執行成功 → record_disposition("human_approved")
- 新增 _get_anomaly_key_from_approval() helper
- B4: 手動處理推斷 → resolve_incident() 排除法判定
- 若 resolved 且無 auto/human/cold_start 紀錄 → manual_resolved
安全設計: 所有 disposition 記錄走 try/except,失敗不阻塞主流程
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-07 11:54:46 +08:00
OG T
53b2daeaca
feat(api): 首次信任機制 — 打破自動修復冷啟動雞生蛋問題
...
問題: Playbook 需要 success_count >= 3 才算 is_high_quality,
但沒有自動修復就不會有成功紀錄 → 永遠達不到門檻。
方案 C: 首次信任 (Cold Start Trust)
- APPROVED 狀態 + 全步驟 risk=LOW + 執行次數 < 3 → 自動放行
- Redis counter 限制每日最多 5 次首次信任自動修復
- 累積 3 次成功後自動回歸正常 is_high_quality 門檻
安全邊界:
- 只有 LOW risk 步驟才能首次信任 (重啟容器等)
- HIGH/CRITICAL 仍需人工審核
- P0/P1 嚴重度仍需人工審核
- 每日上限防止失控
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-07 11:21:00 +08:00
OG T
2fe8062fb8
refactor(api): Re-Review S1/S2/S3 改善 — 消除重複+防禦性驗證+測試隔離
...
S1: 抽取 _execute_and_observe() 公用方法
- 消除 repair_by_uri 中 3 處重複的 execute+audit+langfuse 邏輯
- 統一 AuditLog + Langfuse trace 寫入路徑
S2: SSH username 防禦性驗證
- 新增 validate_ssh_user() + _SSH_USER_RE 正則
- 在 _ssh_execute() 入口驗證 user 參數
- 防止 user@host 拼接產生非預期行為
- 新增 8 個 username 驗證測試
S3: Singleton 測試重置
- 新增 _reset_for_test() classmethod
- 避免跨測試狀態污染
- 新增 2 個 singleton reset 測試
測試: 55/55 全數通過 (原 45 + 新 10)
首席架構師 Re-Review: 91/100 ✅ 通過,3 個 Suggestion 全數實裝
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-07 11:17:40 +08:00
OG T
78a8d3dfa5
fix(api): ansible 控制節點加白名單驗證,防環境變數繞過 (Re-Review Important)
...
首席架構師 Re-Review 指出: ANSIBLE_CONTROL_HOST 來自環境變數 (ConfigMap),
若 ConfigMap 被篡改可繞過 SSH_TARGET_WHITELIST。
在 _execute_ansible() 開頭加 validate_ssh_target_host(host) 閉環。
Re-Review 評分: 91/100 ✅ 通過
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-07 11:13:49 +08:00
OG T
f8d4772abf
fix(api): Sprint 3 P0-1/P0-2/P0-3/P0-4 Critical Security Fixes
...
P0-1: Complete shell metacharacter regex detection
- Enhanced _SHELL_METACHAR_RE to detect: >, <, \n, ${}, $()
- Prevents all shell injection vectors (redirects, variable expansion, newlines)
- Added 5 new validation tests
P0-2: Add shlex.quote() protection for ansible playbook path
- Wraps playbook_path in shlex.quote() before SSH command construction
- Prevents shell injection if path contains special characters
- Applied in _execute_ansible() method
P0-3: Add SSH target host whitelist validation
- Introduces validate_ssh_target_host() function
- Only allows SSH to: 192.168.0.110, 192.168.0.188
- Prevents unauthorized SSH target exploitation
- Added 5 new whitelist validation tests
P0-4: Convert HostRepairAgent to singleton pattern
- Implements __new__() singleton with shared _in_process_locks dict
- Ensures in-process locks persist across multiple auto_repair_service calls
- Previously created new instance per call, making locks ineffective
- Added singleton persistence test
Test Results: 45/45 passing (34 existing + 11 new P0 tests)
All security validations verified via comprehensive unit test coverage.
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-04-07 11:09:45 +08:00
OG T
af07c23675
fix(k8s): known_hosts 改掛 /etc/repair-known-hosts 獨立目錄,修 mount 衝突
...
CD Pipeline / build-and-deploy (push) Successful in 12m11s
E2E Health Check / e2e-health (push) Successful in 34s
/etc/repair-ssh 已被 repair-ssh-key 佔用,subPath 檔案掛載衝突
改為獨立目錄 /etc/repair-known-hosts,路徑同步更新 KNOWN_HOSTS_PATH
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 15:06:28 +08:00
OG T
1644fe6474
feat(api): auto_repair_service 整合 repair_by_uri (Sprint 3 T6)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 14:39:03 +08:00
OG T
a4e11bfa92
feat(api): AuditLog + Langfuse Trace for SSH_COMMAND (Sprint 3 T5)
...
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 14:38:59 +08:00
OG T
02510d3d93
feat: /api/v1/auto-repair/history endpoint + neural-command 接真實 API (Sprint 3)
...
CD Pipeline / build-and-deploy (push) Failing after 8m50s
- 新增 RepairHistoryItem/RepairHistoryResponse Pydantic models
- GET /api/v1/auto-repair/history?limit=N 從 incidents working memory 推導修復歷史
- 前端 fetchData() 同時拉 history + approvals/pending,移除硬編碼 pendingApprovals=0
- try/except 包覆確保任何錯誤都回傳空列表不中斷前端
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 14:28:55 +08:00
OG T
4561f141bb
feat(api): Redis 冪等鎖防止重複修復 (Sprint 3 T4)
...
雙層鎖設計: in-process asyncio.Lock (必定生效) + Redis 分散式鎖 (跨 Pod best-effort)
同一 URI 的第二次修復呼叫立即返回 "already running" 錯誤
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 14:26:53 +08:00
OG T
1a654aa37d
feat(api): HostRepairAgent 三條執行路徑 + known_hosts + Ansible 白名單 (Sprint 3 T3)
...
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 14:22:54 +08:00
OG T
5e8b2a6894
feat(api): URI scheme 解析器 + Shell Injection 防護 (Sprint 3 T1)
...
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 14:18:21 +08:00
OG T
cd37befbe6
fix(models): 全面替換 datetime.UTC → timezone.utc 相容 Python 3.10
...
CD Pipeline / build-and-deploy (push) Has been cancelled
Type Sync Check / check-type-sync (push) Successful in 59s
terminal.py, incident.py, utils/timezone.py 同樣問題。
CI runner Python 3.10 無 UTC 常數,導致所有模型靜默 import 失敗。
# 2026-04-06 ogt: 完整修復,不再有漏網之魚
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 12:40:27 +08:00
OG T
59c3dfb910
fix(models): approval.py 改用 timezone.utc 相容 Python 3.10
...
CD Pipeline / build-and-deploy (push) Successful in 12m12s
Type Sync Check / check-type-sync (push) Failing after 52s
CI runner 用 Python 3.10,datetime.UTC 是 3.11 才加入。
改用 datetime.timezone.utc 全版本相容,修復 CI type-sync 全量失敗。
# 2026-04-06 ogt: root cause — CI Python 3.10 無法 import UTC
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 12:19:23 +08:00
OG T
f6332b4b2f
fix(telegram): 修正 approval_id UUID 轉換錯誤 — 支援 INC-xxx 格式
...
CD Pipeline / build-and-deploy (push) Successful in 12m24s
_execute_approval_action 用 UUID(approval_id) 但 approval_id 是 INC-xxx,
導致 'badly formed hexadecimal UUID string' 錯誤,簽核無法執行。
修正: 先嘗試 UUID 轉換,失敗則用 incident_id 查出對應的 pending approval UUID。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 11:53:48 +08:00
OG T
658337ec18
fix(phase26): 打通 Incident→DB→KM 完整鏈路 + namespace 修正
...
CD Pipeline / build-and-deploy (push) Failing after 1m29s
Type Sync Check / check-type-sync (push) Failing after 52s
問題根因:
1. create_incident_for_approval 只存 Redis,不存 PostgreSQL
→ TTL 7天後消失,Playbook 萃取永遠找不到 Incident
2. ApprovalRecord 無 incident_id 欄位
→ _trigger_playbook_extraction 靠 regex 掃中文文字找 INC-,永遠失敗
3. operation_parser namespace fallback 是 "default"
→ 所有 deployment 在 awoooi-prod,203 次執行全失敗
修復:
- Incident 同時寫入 Redis + PostgreSQL (save_to_episodic_memory)
- ApprovalRecord 加入 incident_id 欄位 (model + ORM + migration)
- alertmanager_webhook 建立 Approval 後回寫 incident_id
- _trigger_playbook_extraction 直接用 approval.incident_id
- operation_parser DEFAULT_NAMESPACE = "awoooi-prod"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 11:46:05 +08:00
OG T
286a96d1aa
fix(knowledge): entrystatus enum 大小寫修正 'archived' → 'ARCHIVED'
...
CD Pipeline / build-and-deploy (push) Successful in 12m47s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 11:25:44 +08:00
OG T
09d965dab5
fix(telegram): 修正 editMessageText 400 錯誤 — 先移除按鈕再更新文字
...
CD Pipeline / build-and-deploy (push) Failing after 12m46s
原因: original_text 來自 message.text (純文字),含 <>&等字符,
用 parse_mode=HTML 發送時 Telegram 返回 400。
修正:
1. 先呼叫 editMessageReplyMarkup 移除按鈕 (確保按鈕一定消失)
2. 再 html.escape(original_text) 後嘗試更新文字
3. 文字更新失敗不影響整體流程 (按鈕已移除為首要目標)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 22:13:54 +08:00
OG T
5499169996
feat(auto-repair): 打通自動修復閉環 (ADR-058)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
Type Sync Check / check-type-sync (push) Failing after 53s
問題: 告警鏈路從未呼叫 auto_repair_service,機制完全死路
修正:
1. webhooks.py: alertmanager_webhook 建立 Incident 後觸發 _try_auto_repair_background
2. playbook.py: is_high_quality 門檻降低 (冷啟動期)
- success_count: 10 → 3
- success_rate: 95% → 80%
3. tests: test_evaluate_not_high_quality 更新為新門檻
流程: Alertmanager → API → Incident → evaluate → P2以下+高品質Playbook → 自動執行 → Telegram通知
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 22:08:08 +08:00
OG T
a83253da0e
fix(gitea-webhook): X-Gitea-Signature 為純 hex,無 sha256= 前綴
...
CD Pipeline / build-and-deploy (push) Failing after 12m39s
Gitea 送出的簽章 header 是純 hex digest,不含 "sha256=" 前綴。
修正驗證邏輯兼容兩種格式(sha256= 前綴自動去除,否則直接用)。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 15:15:36 +08:00
OG T
8220027298
fix(telegram+cd): 兩個顯示 bug 修正
...
CD Pipeline / build-and-deploy (push) Has been cancelled
1. Nemotron args 顯示 Python dict 字串問題
- restart_deployment: {'deployment_name': 'awoo'} → restart_deployment: deployment_name=awoooi-api
- 改用 key=value 格式化,不再使用 str(dict)[:25]
2. CD 通知 ${MINUTES}/${SECONDS} 等變數未展開
- TG_MSG 從 env: 移到 run: shell 中組裝
- env: 中的 shell 變數在 bash 執行前是靜態字串,無法展開
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 14:47:52 +08:00
OG T
59e7879dfb
feat(webhook): Task 5 — tests GitHub→Gitea (ADR-059)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
- test_gitea_webhook.py: 10 tests, X-Gitea-* headers
- conftest.py: GITEA_WEBHOOK_SECRET / GITEA_ALLOWED_REPOS
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 14:45:32 +08:00
OG T
23364423fa
feat(webhook): ADR-059 GitHub → Gitea Webhook 遷移完成
...
- gitea_webhook.py: Header 全部改 X-Gitea-*,移除 workflow_run handler
- gitea_webhook_service.py: _fetch_pr_diff 改直接 httpx,不依賴 github_api_service
- 清除兩個檔案的所有殘留 github_ log key,review_id prefix 改 gitea-
- test_gitea_webhook.py: 10/10 通過,docstring 修正
- 03-secrets.yaml: 新增 GITEA_WEBHOOK_SECRET 佔位
- cd.yaml: 新增 GITEA_WEBHOOK_SECRET 注入步驟
- ADR-059: 建立架構決策文件
待統帥操作: Gitea Actions secret + Gitea UI Webhook 設定
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 14:44:32 +08:00
OG T
b2c0148f2b
feat(webhook): Task 3 — gitea_webhook.py router (ADR-059)
...
- 新增 Gitea Webhook Router: X-Gitea-Event/Signature/Delivery
- 支援 pull_request / push / ping,移除 workflow_run
- review_id prefix 改為 gt-pr-* / gt-push-*
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 14:41:12 +08:00
OG T
6777532534
feat(webhook): Task 1+2 — config + service GitHub→Gitea 遷移 (ADR-059)
...
- config.py: GITHUB_WEBHOOK_SECRET/ALLOWED_REPOS → GITEA_*
- 新增 gitea_webhook_service.py: PR/Push review only, 移除 CI diagnosis
- 移除 CIFailureDiagnosis, diagnose_ci_failure, _call_openclaw_ci_diagnosis
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 14:33:58 +08:00
OG T
84f1f9f021
refactor(config): GITHUB_WEBHOOK_SECRET → GITEA_WEBHOOK_SECRET (ADR-059)
2026-04-05 14:25:47 +08:00
OG T
22ee9b2fe3
fix(telegram): answerCallbackQuery result=true 導致 bool is not iterable
...
CD Pipeline / build-and-deploy (push) Successful in 13m3s
Telegram answerCallbackQuery 成功時返回 {"ok": true, "result": true},
_send_request 中 "message_id" in result["result"] 對 bool 做 in 操作
報 "argument of type 'bool' is not iterable"。
修正:加 isinstance(result_val, dict) 防禦後再做 in 檢查。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 14:20:54 +08:00
OG T
4b4007db6c
feat(telegram): SRE 群組告警格式升級為完整 v7.0
...
CD Pipeline / build-and-deploy (push) Has been cancelled
_send_approval_card_to_group 改用與個人 chat 相同的 TelegramMessage.format()
格式,包含 SignOz metrics、AI provider/model、Nemotron 協作、異常頻率統計等全部欄位。
統帥指示:群組收到的告警訊息要與個人 chat 格式完全一致。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 14:11:59 +08:00
OG T
76f3ffd7f7
fix(telegram): whitelist property 返回字串導致按鈕無反應
...
CD Pipeline / build-and-deploy (push) Successful in 13m0s
security_interceptor.whitelist 返回 settings.OPENCLAW_TG_USER_WHITELIST
(字串),但 is_whitelisted 做 user_id in whitelist(int in str),
Python 報 "requires string as left operand, not int"。
修正:改呼叫 settings.get_tg_user_whitelist() 返回 list[int]。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 13:40:52 +08:00
OG T
4b24ecd67f
fix(sprint3): 首席架構師 Review C1/C2/C3/M3/m1 修正
...
C1: _ssh_execute 直接接收 key_path 參數,不反查 LAYER_SSH_CONFIG
C2: PlaybookService.create() proxy,Router 不再穿透呼叫 _repository
C3: CD Step 1b sed 替換 IMAGE_TAG_PLACEHOLDER,消除失敗中斷風險
M3: repair-bot 110/188 regex 統一 [a-z0-9][a-z0-9-]{0,30},禁止底線
m1: defaultMode 0400 加八進位說明注釋
m2: _ssh_execute 用 deadline 計算剩餘 timeout
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 13:07:59 +08:00
OG T
665f93e83f
fix(telegram): 首席架構師 R1 修正 — I-1/I-2/M-1/M-2
...
CD Pipeline / build-and-deploy (push) Has been cancelled
I-1: webhooks/sentry_webhook/signoz_webhook 三個呼叫者補 TODO 說明
無 incident_id 是已知限制(Approval 路徑未建 Incident 關聯)
I-2: TestPushRequest 新增 incident_id 欄位,使 QA 可驗證按鈕渲染
M-1: 移除 _build_inline_keyboard 呼叫中多餘的 `or message.incident_id`
M-2: 補充 900/1000 截斷長度差異說明
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 13:07:42 +08:00
OG T
4935cfc346
fix(telegram): 重設計訊息格式 + 修復 detail/reanalyze/history 按鈕失效
...
CD Pipeline / build-and-deploy (push) Failing after 1m26s
- format() / format_with_nemotron(): 移除 ═══ 分隔符,改為簡潔換行佈局
- send_approval_card(): 新增 incident_id 參數,傳入 _build_inline_keyboard()
- decision_manager.py: 呼叫 send_approval_card() 時傳入 incident.incident_id
- 問題根因: incident_id 未傳入 _build_inline_keyboard() 導致第二排按鈕從未渲染
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 12:44:13 +08:00
OG T
53e1ae7ad7
fix(phase25): I2 NIM system prompt + I4 field_path 正則匹配修正
...
CD Pipeline / Deploy Prometheus Alert Rules (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
I2: nemotron.analyze() 補上 system role (NIM 標準 message format)
- 舊: messages=[{role:user, ...}]
- 新: messages=[{role:system, ...}, {role:user, ...}]
- 效果: K8s operator 角色定義,改善 tool calling 品質
I4: drift_detector._is_allowlisted/_is_critical 用正則取代 strip
- 舊: replace('[*]','') 後 startswith/in → 無法匹配 containers[0]
- 新: [*] → \[\d+\] 正則,正確匹配所有索引
- 修復: containers[*].image 現在能匹配 containers[0].image
2026-04-05 12:11:05 +08:00
OG T
73577f7c5d
chore(ai-router): v4.3 版本號同步 (trigger CD push event)
CD Pipeline / Deploy Prometheus Alert Rules (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-05 12:03:15 +08:00
OG T
76f7330c9d
feat(api): POST /playbooks/ 建立端點 + seed-repair-playbooks.py (Task 14)
...
CD Pipeline / build-and-deploy (push) Failing after 57s
CD Pipeline / Deploy Prometheus Alert Rules (push) Has been skipped
- playbooks.py: 新增 POST / 端點供直接建立 Playbook (seed/管理用)
- seed-repair-playbooks.py: 5個 Host Repair Playbooks (ssh_command)
sentry/harbor/gitea/alertmanager (110) + openclaw (188)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 11:53:49 +08:00
OG T
bf4f81412c
feat(api): ActionType.SSH_COMMAND + auto_repair_service SSH分支 (Task 12)
...
- playbook.py: 新增 SSH_COMMAND ActionType
- auto_repair_service._execute_step: SSH_COMMAND 分支,格式 layer/component
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 11:47:00 +08:00
OG T
e7d8da85f6
feat(api): HostRepairAgent — SSH 主機層修復 (Task 11)
...
- host_repair_agent.py: layer路由、command injection防護、asyncio SSH執行
- 測試: 12 cases 全通過 (routing/sanitize/success/fail/timeout/denied)
- SSH key: /etc/repair-ssh/id_ed25519 (K8s secret mount)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 11:22:00 +08:00
OG T
7a6fa6359e
feat(api): Sentry init 加入統一 layer/component 標籤
...
對齊 Prometheus 告警標籤規範 (layer/component/team)
讓 Sentry 事件與 auto_repair 路由決策保持一致
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 11:10:40 +08:00
OG T
2243a21b96
fix(ai-router): v4.3 NIM 保護 — timeout 不計 CB 失敗,每次先跑 NIM 才切 Gemini
...
CD Pipeline / build-and-deploy (push) Failing after 20s
需求: NIM 必須等到有回應才切換,不能因為慢就被 CB 封鎖走 Gemini
變更:
- Timeout exception 不累積 CB failure(只有真實連線錯誤才計)
- NIM CB: failure_threshold=10, recovery_timeout=30s(比預設寬鬆)
- 設計文件 v4.3: 更新方向二,移除錯誤假設
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 01:51:12 +08:00
OG T
5ad403b287
fix(p0): v4.3 — 實測確認 Ollama CPU-only 不可用,DIAGNOSE 統一走 NIM
...
實測依據 (2026-04-05):
- Ollama llama3.2:3b CPU-only: 238s 回 {"ok":true},生產不可用
- Nemotron NIM: 2.2s~27.3s,avg 10.6s,一直是主力(Phase 22 起)
- NIM 從未有隱私問題,Incident 資料一直送雲端 GPU
變更:
- ai_router.py: _local_fallback_chain 廢棄(空 list)
- ai_router.py: DIAGNOSE route/route_sync 改回 _full_fallback_chain
- config.py: 更新 timeout 說明反映實測結果
- test_p0_diagnose_routing.py: 更新 docstring
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 01:49:06 +08:00
OG T
a81bf50537
feat(drift): ADR-057 adopt() Gitea PR API 實作
...
- DriftAdoptService: 透過 Gitea REST API 建立 branch + commit + PR
不在 API Pod 內執行 git(修復 C2 安全漏洞)
- adopt() 端點: 501 → 真實實作(呼叫 DriftAdoptService)
- config.py: 新增 GITEA_API_URL / GITEA_API_TOKEN / GITEA_REPO_OWNER / GITEA_REPO_NAME
- K8s secret awoooi-secrets 已注入 GITEA_API_TOKEN
- drift.py: 移除 trigger_drift_scan 中未使用的 interpreter 變數
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 00:39:29 +08:00
OG T
f4f454fd98
feat(api): 重開機後自動 warm-up Redis Working Memory from PostgreSQL
...
- main.py lifespan: 啟動時從 DB restore INVESTIGATING/MITIGATING incidents
- scripts/reboot-recovery: 188 + 110 自動化腳本 + systemd services
- scripts/reboot-recovery: aiops-network 自動建立 (ClawBot 依賴)
- docs/runbooks/REBOOT-RECOVERY-SOP.md: 完整改寫,含自動化腳本說明
Why: 重開機後 Redis 清空導致前端 incidents 顯示 0 筆(DB 完整保存)
統帥批准: 「所有數據必須被長久記錄下來」
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 00:39:20 +08:00
OG T
96d5e18924
fix(p0): 實測修正 — timeout 依 benchmark 調整,_local_fallback_chain 移除雲端 Nemotron
...
- config.py: NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS=60s (NIM 實測 11-45s + 15s buffer)
- config.py: OLLAMA_DIAGNOSE_TIMEOUT_SECONDS=200s (Ollama 實測 ~173s + 27s buffer)
- ollama.py: 新增 per-task timeout (diagnose/force_local 用 200s)
- ai_router.py: _local_fallback_chain 移除 Nemotron (NIM=雲端,不可進 local chain)
- ai_router.py: v4.2 — Option C 分情境路由正式確立
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 00:29:09 +08:00
OG T
4912c7f307
fix(phase25): 首席架構師 Review R2 修正 (I1/I2/I3/I4/C3/M1)
...
I1: auto_repair_service — 失敗分支 anti_pattern task 補齊 _pending_tasks GC 防護
C3: drift_remediator — _kubectl_apply() 實作 resource_key 範圍過濾(修復虛設參數 bug)
M1: drift_remediator — _git_push() 標記 DISABLED,防止誤啟用
I2: drift.py — Telegram 通知移除失效的 adopt() 端點連結
I3: drift/page.tsx — handleScan POST body namespace→namespaces(對齊後端 DriftScanRequest)
I4: drift/page.tsx — 移除硬編碼英文字串,改用 t('loading')/t('highCount')/t('mediumCount')
i18n: zh-TW.json + en.json 補齊 drift.loading key
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-05 00:22:38 +08:00
OG T
a562db4048
fix(phase25): 首席架構師 Review C1/C2/I1/I3 修正
...
CD Pipeline / build-and-deploy (push) Failing after 57s
C1: NemotronProvider.privacy_level cloud→local
NIM 部署在 192.168.0.188 內網,非官方雲端 API
可納入 DIAGNOSE _local_fallback_chain 隱私邊界
C2: adopt() 端點暫停,返回 501
API Pod 執行 git add -A 有安全風險
ADR-057 起草後改用 Gitea PR API 實作
I1: timeout log 修正,記錄實際套用的 timeout 值
原本永遠記錄 NEMOTRON_TIMEOUT_SECONDS=45
現在記錄依 task_type 選擇的正確值
I3: route_sync() 補 DIAGNOSE 隱私邊界
async route() 已有 _local_fallback_chain
sync 版本遺漏,此次補齊
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-04 18:00:05 +08:00
OG T
c4eafd2a5b
fix(ai-router): fallback_models 排除 selected_model 避免重複
...
CD Pipeline / build-and-deploy (push) Successful in 6m58s
DIAGNOSE intent 路由至 Nemotron 後,fallback_chain 中的 OPENCLAW_NEMO
也使用相同 model string,導致 fallback_models 包含已選模型。
修正: 過濾掉與 selected_model 相同的 model string。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-04 17:43:44 +08:00
OG T
8056be5847
feat(ai-router): DIAGNOSE intent override 升級至 Nemotron (P0)
2026-04-04 17:41:45 +08:00