awoooi

Author	SHA1	Message	Date
OG T	896bef94ee	fix(web): pending-approvals-card 加防重複點擊 + loading 狀態 linter 自動強化: actioningId state 防止同一張卡重複操作 - disabled + opacity 0.6 + cursor not-allowed - loading 時按鈕顯示 '...' - finally() 確保 actioningId 清除 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 18:38:08 +08:00
OG T	890e2a9568	fix(review): 架構審查修復 — P0 import crash + i18n 零 hardcode + 靜默錯誤 P0: - proposal_service.py: 補 get_redis + INCIDENT_KEY_PREFIX import (修前: resolve_incident_after_approval 必 NameError crash) P1 i18n: - page.tsx: 拓撲群組移除 emoji，改用 tTopo() i18n key - page.tsx: 主機標籤 (DevOps金庫等) 改 tTopo() i18n - ai-model-status.tsx: 加 useTranslations，AI 模型狀態 → t('aiModelStatus') - disposition-mini.tsx: 查看完整報表 → t('viewAllReport') - recent-activity.tsx: 查看活動串流 → t('viewAllAlerts') P2 品質: - pending-approvals-card.tsx: approve/reject 加 r.ok 檢查+錯誤顯示，查看全部授權加路由+i18n - page-tabs.tsx: TabSkeleton 載入中... → t('loading') - page.tsx: ↑5% → tDashboard('trendUp', {pct}) 動態值 - page.tsx: Prometheus '23' hardcode → '-- targets' i18n 新增 key (zh-TW + en 同步): - dashboard: viewAllAlerts/viewAllAuth/viewAllReport/aiModelStatus/loading/trendUp - topology: groupExternal/allReachable/investigating/hostDevops/hostAiData/hostK3sMaster/hostK3sWorker Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 18:34:50 +08:00
OG T	1483218bab	feat(approval): 批准/拒絕後立即回應 Telegram + 持久化 message_id 到 DB All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m9s Details 問題：按下 TG 批准/拒絕按鈕後完全沒有任何回應，使用者不知道是否成功。 Telegram message_id 只存 Redis 24h TTL，過期後無法追蹤。修正： - approval_records 加 telegram_message_id / telegram_chat_id 欄位（已 ALTER TABLE） - approval_db.update_telegram_message() — 持久化 message_id 到 DB - decision_manager: 發送告警卡片後同時寫 Redis + DB - telegram_gateway._notify_approval_result() — 批准/拒絕後： 1. editMessageReplyMarkup 移除批准/拒絕按鈕，保留資訊按鈕 2. sendMessage reply_to 在原訊息下回覆狀態行 3. fallback: send_notification 發新訊息 - _handle_group_command: chat_id 改為 _chat_id 消除 IDE lint Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 18:19:31 +08:00
OG T	2c7d5d049c	fix(openclaw): Nemotron tool call 回填 kubectl_command，讓批准後執行器能解析 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 根本問題：Nemotron 產生的 restart_deployment(deployment_name=sentry) tool call 只存在 nemotron_tools[]，沒有回填到 proposal["kubectl_command"]。 proposal_service 拿到的 kubectl_command 是空的，approval_records.action 存空值， parse_operation_from_action 永遠返回 None，execute_approved_action 永遠 skip。修正：Nemotron (和 Gemini fallback) 成功後，將 tool call 轉換為 kubectl 指令並回填 proposal["kubectl_command"]，讓 proposal_service 能取到可執行指令。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 18:15:01 +08:00
OG T	ae9780837d	fix(proposal): action 優先用 kubectl_command，修復批准後永遠 skip 執行的根本 bug 根本問題：approval_records.action 存的是 LLM action_title（中文標題，如「重啟 sentry 服務」）， parse_operation_from_action() 無法解析，導致 execute_approved_action() 每次都 skip。修正：action 優先取 llm_proposal["kubectl_command"]（可執行的 kubectl 指令），僅在沒有 kubectl_command 時才 fallback 到 action_title。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 18:13:22 +08:00
OG T	d8c2969341	feat(telegram): AI 鏈路透明化 — 告警訊息顯示 OpenClaw + Tool Calling 模型/後端 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 12m12s Details - nemotron.py: 偵測 OllamaToolProvider vs NvidiaProvider，記錄 tool_model/tool_backend - openclaw.py: 傳播 nemotron_tool_model/nemotron_tool_backend 到 proposal - decision_manager.py: 從 proposal_data 提取並傳給 send_approval_card() - telegram_gateway.py: TelegramMessage 新增兩個欄位，format_with_nemotron 顯示 "🔧 Tool Calling: llama3.1:8b (Ollama 本機)" 或 "NVIDIA 雲端" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 15:05:16 +08:00
OG T	7857c25677	feat: Ollama 本機 Tool Calling 取代 NVIDIA 雲端 (44s→~5s) Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details - nvidia_provider.py: 新增 OllamaToolProvider - 實作 INvidiaProvider protocol，打 Ollama /v1/chat/completions - 模型: llama3.1:8b (tool calling 最穩定的 8B) - 延遲: 44s → ~5s（本機 M1 Pro 192.168.0.111） - get_nvidia_provider() 根據 USE_OLLAMA_TOOL_CALLING 切換 - config.py: USE_OLLAMA_TOOL_CALLING=True (預設開啟), OLLAMA_TOOL_MODEL=llama3.1:8b - 回退: USE_OLLAMA_TOOL_CALLING=False → 恢復 NvidiaProvider 雲端 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 14:55:04 +08:00
OG T	4f80ba38c0	feat: 告警狀態變更在原訊息延續 (方案 B) All checks were successful CD Pipeline / build-and-deploy (push) Successful in 12m28s Details telegram_gateway.py - 新增 append_incident_update(incident_id, status_line) - 從 Redis tg_msg:{id} 取 message_id - editMessageReplyMarkup: 移除 Row1(批准/拒絕/靜默)，保留 Row2(詳情/重診/歷史) - sendMessage reply_to_message_id: 在原訊息下方追加狀態行 - 找不到 message_id 回傳 False（呼叫方自行 fallback） decision_manager.py - _push_decision_to_telegram: send_approval_card 後存 tg_msg:{id}=message_id (TTL 24h) - _push_auto_repair_result: 改用 append_incident_update，找不到 message_id 才 fallback 新訊息 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 14:21:33 +08:00
OG T	2554ac1e60	fix: E2E test 告警識別 + 自動修復結果 Telegram 通知 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details alert_rule_engine.py - _matches() 加入 instance_prefix 匹配（最高優先） - match_rule() 傳入 instance label 至 _matches - 用途: e2e-final-* / e2e-test-* instance 可被 YAML 規則識別 alert_rules.yaml - 新增 e2e_smoke_test 規則 (priority=120) - alertname: E2E_SMOKE_TEST / instance_prefix: e2e-final- / e2e-test- / test-host - suggested_action: NO_ACTION，顯示「告警鏈路驗證成功」 decision_manager.py - _auto_execute() 成功後發 Telegram 結果通知 ✅ - _auto_execute() 失敗後發 Telegram 失敗通知 ❌ - 新增 _push_auto_repair_result() 函數 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 14:16:15 +08:00
OG T	1fb0c0ca90	fix(auto-repair): Bug #5+#6 — SSH binary + affected_services 匹配修正 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details Bug #5 (webhooks.py): target_resource 現在優先用 component label - SentryDown alert 有 labels.component="sentry" - 舊邏輯: labels.instance="192.168.0.110:9000" → Playbook affected_services 不匹配 - 新邏輯: component → pod → instance → alertname Bug #6 (Dockerfile): python:3.11-slim 無 openssh-client - SSH_COMMAND Playbook 執行路徑調用 asyncio.create_subprocess_exec("ssh", ...) - image 沒有 ssh binary → 所有 SSH 修復必然失敗 - 修正: 在 production stage 安裝 openssh-client 服務清單: 補 sentry 主服務到 service-registry.yaml (AUTO 級別) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 14:11:50 +08:00
OG T	1d88b7cd9d	fix(webhooks): Signal.labels 補 alertname 讓 playbook 匹配能讀到原始 alertname Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 問題: create_incident_for_approval 建立 Signal 時 labels 只有 namespace/resource，沒有 alertname，導致 _extract_symptoms 讀 labels.alertname 取得 None，fallback 到 alert_name="custom"， playbook Jaccard 永遠無法匹配真實 alertname (如 SentryDown)。修正: 新增 alertname 參數，傳入 Signal.labels["alertname"]。兩個呼叫點 (LLM 成功 + fallback) 都補上 alertname=alertname。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 13:54:42 +08:00
OG T	e4070b2f86	fix(webhooks): 補 get_alert_operation_log_repository import 兩處 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 12m53s Details alert_received_log_failed 錯誤原因：alertmanager_webhook 函數內直接呼叫 get_alert_operation_log_repository() 但未在 local scope import，導致 NameError 被 except 吞掉，ALERT_RECEIVED 事件無法記錄。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 12:29:48 +08:00
OG T	fc03eb1f4d	fix(auto-repair): _extract_symptoms 優先用 labels.alertname 取得原始 alertname Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 問題: signal.alert_name 存的是 alert_type (如 "custom")，而非 Prometheus alertname (如 "SentryDown")，導致 playbook Jaccard 匹配永遠失敗 (NO_MATCH)。根本原因: webhook 的 alertname_to_type mapping 將未知 alertname 轉為 "custom"，存入 signal.alert_name，但 Playbook 的 symptom_pattern.alert_names 存原始名稱。修正: 從 signal.labels["alertname"] 讀取原始 Prometheus alertname， fallback 到 signal.alert_name (保持向下相容)。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 12:26:18 +08:00
OG T	af49a54728	fix(playbook): alert_names 完全匹配時 bypass 相似度門檻 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 12m58s Details 症狀: SentryDown/OllamaDown 告警觸發 incident，但 playbook 搜索回傳 NO_MATCH，即使 alert_names 完全一致。根本原因: Jaccard 加權計算中，affected_services 存的是 Prometheus instance IP (192.168.0.110:9000)，而 Playbook 存的是服務名 (sentry)，導致 services 維度得 0，最終 0.35 < min_similarity=0.4。修正: alert_names 有交集時直接通過，不受其他維度拉低分數影響。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 12:05:07 +08:00
OG T	79a9a514dd	fix(rules): ADR-064 L1 Redis 分散式鎖防止多 Pod 重複生成規則 Some checks are pending CD Pipeline / build-and-deploy (push) Has started running Details 問題: _generating set 是進程級，多 Pod 各自獨立，同一 alertname 可能被多個 Pod 同時送給 Ollama/Gemini 生成規則修復: SET NX EX lock_key — 只有第一個 Pod 能取鎖，其他 Pod 直接跳過降級: Redis 不可用時 fallback 進程級 set（保持原有行為） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 12:03:51 +08:00
OG T	b66263ad36	fix(decision_manager): resolved Incident 不重送 Telegram Some checks are pending CD Pipeline / build-and-deploy (push) Has started running Details dedup TTL 10分鐘過期後，已 resolve 的 Incident 仍被重新推送加入狀態檢查，resolved/closed 直接跳過 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 12:00:39 +08:00
OG T	9361fd1fa7	fix(decision_manager): action 不應 strip_placeholders 避免截斷 deployment name Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details _strip_placeholders 移除 <...> 導致 kubectl rollout restart deployment/<name> 變成 kubectl rollout restart deployment/，Telegram 顯示建議指令不完整 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 11:45:33 +08:00
OG T	d467fc11be	fix(nemotron): 修復 deployment_name placeholder 問題根因: Nemotron tool calling 收到 target_resource=DockerContainerUnhealthy (非真實 K8s deployment name)，不確定時填 <deployment_name> 修復: 1. prompt 明確標注 deployment_name 必須填入 target_resource 2. 收到 tool call 結果後，偵測 placeholder 並用 target_resource 覆蓋 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 11:44:25 +08:00
OG T	c5e475121a	fix(telegram): 修復建議指令被截斷 + decision_manager enum string 補正 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 根因 1: telegram_gateway.py suggested_action[:35] 剛好截到 deployment/ 後 → 改為 [:80]，完整顯示 kubectl command 根因 2: 舊 Incident proposal_data 存 enum string (RESTART_DEPLOYMENT) → decision_manager.py 加入偵測，用規則引擎重新查 kubectl command Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 11:14:30 +08:00
OG T	5ea6c3fb91	feat: alert_operation_log 查詢 API + 前端頁面 (Sprint 5.2) Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 後端: - 新增 list_recent() 分頁方法 (alert_operation_log_repository) - 新增 /api/v1/alert-operation-logs GET + /stats 端點 - main.py 註冊 alert_operation_logs_v1.router 前端: - /alert-operation-logs 頁面，18 種 event_type 顏色標記 - 分頁、event_type 篩選、incident_id 篩選 - 24h 統計卡片 (總數/護欄攔截/自動修復/已解決) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 10:57:40 +08:00
OG T	428e66c111	fix(arch-review): 首席架構師審查 S1×3 S2×3 S3×3 全修復 + ADR-064 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details S1 Critical: - S1-1: asyncio 觸發移至 _call_with_fallback async 上下文，移除 sync 中的 get_event_loop() - S1-2: _append_rule_to_yaml 加 textwrap.dedent() 正規化 LLM 輸出縮排 - S1-3: _matches() 對 alertname=["*"] 直接回傳 False，防意外命中 S2 Major: - S2-1: auto_generate_rule() 改為 DI 參數注入 (ollama_url/model/gemini_api_key)，移除 import settings - S2-4: _generate_mock_response docstring 澄清為規則引擎生產路徑，非假數據 - S2-5: suggested_action .strip() 防空白字串繞過 or S3 Minor: - S3-2: priority 上界 min(next, 890) - S3-3: alertname sanitize re.sub([{}]) 防 format KeyError - S3-4: model_registry.py 最後修改時間戳更新文件: - ADR-064: Alert Rule Engine YAML 驅動 + AI 自動學習 - Skills 02: 告警規則引擎 DI 規範 + asyncio 禁止事項 - Skills 03: _generate_mock_response 語意澄清 + 規則引擎降級流程 - LOGBOOK: 本次 Session 完整記錄 2026-04-09 ogt: 首席架構師審查修正 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 10:52:40 +08:00
OG T	4b3fdd82f9	fix(api): incidents list 不再同步等待 AI 決策 (效能修復) Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 問題: GET /api/v1/incidents 對每個 incident await AI 分析 (120-180s) 多個活躍 incident 時 timeout 乘積爆炸 → 前端完全無法載入修復: - list endpoint 只查 Redis 已快取的決策 token (立即返回) - 無快取時回 decision=null，背景 fire-and-forget 觸發 AI - 前端對有興趣的 incident 再 GET 單筆端點取得決策結果 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 10:49:30 +08:00
OG T	89da2d24be	fix(model-registry): fallback config 更新為 deepseek-r1:14b + gemma3:4b All checks were successful CD Pipeline / build-and-deploy (push) Successful in 13m20s Details - model_registry._get_default_config: ollama summary llama3.2:3b → gemma3:4b - model_registry._get_default_config: ollama default/rca → deepseek-r1:14b - 修正 test_smart_router::test_simple_context 失敗 (斷言 gemma3:4b) - alert_rule_engine: 移除 asyncio/time unused import - 2026-04-09 ogt Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 09:52:47 +08:00
OG T	71437db0e9	feat(rule-engine): 自動規則生成 — generic_fallback 觸發 AI 學習 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 11m25s Details 流程: 1. 告警命中 generic_fallback 規則 2. 背景觸發 auto_generate_rule() 3. Ollama (deepseek-r1:14b) 生成 YAML 規則片段 4. Ollama 失敗 → Gemini 備援 5. 驗證格式 → append alert_rules.yaml → 清除 lru_cache 6. 下次同類告警直接命中專屬規則，不再走兜底去重: 同一 alertname 進程內只生成一次手寫規則 priority 1-499，AI 生成 500-899，兜底 999 2026-04-09 ogt: AI 自學規則引擎 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 09:20:33 +08:00
OG T	d1ede7f989	feat(openclaw): 告警規則引擎 — alert_rules.yaml 取代硬編碼 if/elif Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details - 新增 alert_rules.yaml: 6 條規則 (docker/target_down/oom/cpu/5xx/crash) + 通用兜底 - 新增 alert_rule_engine.py: YAML 載入、匹配邏輯、變數填充 - openclaw.py _generate_mock_response: 重構為呼叫規則引擎 (v8.0) - 新增規則只需修改 YAML，重啟 Pod 即可，不需改代碼 - 2026-04-09 ogt: 架構重構 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 09:05:23 +08:00
OG T	3abc7c2f85	fix(openclaw): DockerContainerUnhealthy + TargetDown 專屬規則匹配 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details - DockerContainerUnhealthy: ssh docker inspect + docker restart，含 healthcheck 指令驗證 - TargetDown / IP:port instance: ssh 確認 exporter 存活 - 修正 target 混用 alertname 作為 deployment 名稱的問題 - alertname/labels 從 alert_context 提取供規則判斷 - 2026-04-09 ogt: 新增兩條專屬規則 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 09:00:31 +08:00
OG T	4b6f14d9a1	fix(webhook): alertmanager 路徑 suggested_action 改用 kubectl_command Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m43s Details - 1399 行: suggested_action.value (RESTART_DEPLOYMENT) → kubectl_command - 與 /alerts 路徑 887 行保持一致 - 修正 Telegram 顯示「kubectl rollout restart deployment/」後面空白的問題 - 2026-04-09 ogt: bug fix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 08:57:56 +08:00
OG T	d80153bdce	fix(openclaw): NIM 完全失敗後 fallback 到 Gemini 產生執行方案 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m34s Details NIM tool calling 多次 timeout 後，不再顯示空白執行方案，改由 Gemini 代理產生 kubectl 操作指令（JSON 解析）。只有 NIM 完全失敗才觸發，符合統帥「必須等到有回應」原則。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 22:55:25 +08:00
OG T	6f475000f6	fix(db): alert_operation_log.event_type String→PgEnum (create_type=False) Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 修正 DatatypeMismatchError: DB 欄位為 native enum alert_event_type， SQLAlchemy model 誤用 String(50)，導致 alert_operation_log 寫入失敗。使用 PgEnum(create_type=False) 讓 SQLAlchemy 映射已存在的 DB enum，不重建型別。18 個 event_type 值與 M-003 migration 一致。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 22:42:36 +08:00
OG T	86ac6ed028	perf(api): HostAggregator 效能優化 — probe timeout 縮短 + 30 秒記憶體快取	2026-04-08 22:42:01 +08:00
OG T	2a6977343a	fix(telegram): 補傳 incident_id 至所有 _push_to_telegram_background 呼叫點 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details 規則匹配有六顆按鈕但 Ollama/OpenClaw 路徑只有三顆，根因是 alertmanager 和 fallback 路徑呼叫 _push_to_telegram_background 時未傳 incident_id，導致詳情/重診/歷史按鈕不顯示。 - _push_to_telegram_background: 新增 incident_id 參數 - alertmanager 主路徑: 補傳 incident_id - alertmanager fallback 路徑: 存回傳值並補傳 - /alerts 路徑: 尚無 incident，明確傳空字串 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 22:40:22 +08:00
OG T	8b5db2f58e	feat(infra): 切換 Ollama 到 M1 Pro 192.168.0.111 + NetworkPolicy 更新 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details - OLLAMA_URL: 188 → 111 (M1 Pro, 40+ tok/s vs 0.45 tok/s) - OPENCLAW_DEFAULT_MODEL: qwen2.5:7b-instruct → deepseek-r1:14b (SRE最強推理) - OPENCLAW_TIMEOUT: 90s → 120s (deepseek-r1:14b 實測最慢 54s) - NetworkPolicy v1.3: 新增 192.168.0.111:11434 egress，移除 188 的 Ollama port Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 22:05:14 +08:00
OG T	c9f1bcd122	fix(api): service_registry 安全降級 — Docker 無 YAML 時不 crash，fallback AUTO All checks were successful CD Pipeline / build-and-deploy (push) Successful in 11m37s Details	2026-04-08 21:47:38 +08:00
OG T	db4b28c49d	fix(ci): 強制觸發 CD — service_registry.py Docker 路徑修正已包含於 `1f9eea5` Some checks failed CD Pipeline / build-and-deploy (push) Failing after 8m45s Details Pod CrashLoopBackOff: IndexError parents[5] 修復: _find_registry_path() 安全搜尋 (parents[4]/parents[3]/絕對路徑) `1f9eea5` 已修復但未觸發 CI，此 commit 強制重新 build Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 21:37:49 +08:00
OG T	1f9eea5b74	fix(api): service_registry.py Path 索引修正 — 相容 Docker 容器環境 Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-04-08 21:34:40 +08:00
OG T	14cb015826	fix(openclaw): Nemotron 重試邏輯 + exhausted log key (未提交的修改) Some checks failed CD Pipeline / build-and-deploy (push) Has been cancelled Details - generate_incident_proposal_with_tools: 單次 try/except → 2次重試迴圈 - 失敗 log key: nemotron_collaboration_failed → nemotron_collaboration_exhausted - 失敗時 nemotron_enabled=True (讓統帥看到失敗狀態) - _call_nemotron_tools: timeout 超時改為拋出異常(讓外層重試) - 這是之前 Session 的本地修改，修正測試與實際實作不一致問題 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 21:16:34 +08:00
OG T	0f5fecfef5	fix(sprint5.1): 首席架構師審查修正 — S1×4 S2×2 S3×1 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m40s Details S1-1: service_registry/velero_client/preflight_service 改用 structlog S1-2: velero_client datetime.now(UTC) 改用 now_taipei()（台北時區鐵律） S1-3: Guardrail 失敗改為保守拒絕（原放行方向與安全目標相悖） S1-4: service_registry import 移至模組頂部（移除函數內 import） S2-1: telegram_gateway T1-T6 六個通知方法補齊 try/except S2-2: webhooks.py Langfuse URL 改用 settings.LANGFUSE_URL（移除硬寫內網 IP） S3-3: velero_client trigger_emergency_backup 改為 kubectl apply Backup CRD （原 kubectl create backup 語法不存在，審查發現靜默失敗風險）審查評分: 70/100 → 修正後預計 90+/100 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 16:36:18 +08:00
OG T	88696dba9b	feat(sprint5.1): Data Safety Guardrails 全鏈路整合 (L1-L5) Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m33s Details Type Sync Check / check-type-sync (push) Failing after 58s Details Layer 0 - K8s RBAC: - k8s/rbac/api-velero-reader.yaml: awoooi-executor SA Velero backup reader Layer 1 - DB Migration (已在 188 執行): - M-002: approval_records 新增 approval_level/votes/required_votes - M-003: alert_event_type ENUM 新增 8 個值 Layer 2 - IaC: - ops/config/service-registry.yaml: 全服務 Stateful 分級清單 (BLOCK/CRITICAL_HITL/STANDARD_HITL/AUTO) Layer 3 - Python Services: - service_registry.py: 讀取 YAML，提供 is_blocked/requires_multisig/get_required_votes - velero_client.py: kubectl 查詢 Velero 備份年齡，失敗 fallback 999h - preflight_service.py: Pre-flight 安全檢查 (Q2/Q4 決策) Layer 1-M001 - Playbook model: - playbook.py: 新增 requires_approval_level/stateful_targets/requires_pre_backup Layer 4 - 業務邏輯: - alert_operation_log_repository.py: 新增 8 個 event_type (Guardrail/Pre-flight/MultiSig/備份) - auto_repair_service.py: 注入 Service Registry Guardrail 檢查 (BLOCK → 直接拒絕) - webhooks.py: ALERT_RECEIVED 溯源記錄 + auto_repair flag Q9 + Langfuse trace_id Q10 - db/models.py: ApprovalRecord 同步 approval_level/votes/required_votes 欄位 - docker-health-monitor.sh: 純感知層改造（移除所有 docker restart 邏輯） Layer 5 - Telegram 通知: - telegram_gateway.py: T1-T6 六個新通知方法 (Guardrail/Pre-flight/Backup/MultiSig/ChangeApplied) 參考: ADR-062 Data Safety Guardrails, ADR-063 Service Registry IaC Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 16:24:09 +08:00
OG T	f20121ad41	feat(audit): Phase 11 告警操作完整溯源 — alert_operation_log + 歷史回填 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m29s Details 統帥指令「所有告警訊息通通寫入資料庫，並記錄相關操作」變更: - phase11_alert_operation_log.sql: 新表 (Event Sourcing，不可變) - phase11b_backfill_alert_operation_log.sql: 歷史回填 654 筆 - 14 筆 ALERT_RECEIVED (incidents) - 265 筆 TELEGRAM_SENT (approval_records) - 265 筆 USER_ACTION (approval_records) - 110 筆 EXECUTION_COMPLETED (audit_logs) - db/models.py: AlertOperationLog SQLAlchemy model - repositories/alert_operation_log_repository.py: append/list_by_incident/get_stats - webhooks.py: _try_auto_repair_background 寫入 AUTO_REPAIR_TRIGGERED + EXECUTION_COMPLETED + TELEGRAM_RESULT_SENT - webhooks.py: _push_to_telegram_background 寫入 TELEGRAM_SENT - telegram.py: handle_callback 寫入 USER_ACTION (approve/reject) 已執行 migration: awoooi_prod@192.168.0.188 ✅ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 11:22:03 +08:00
OG T	eee6f06215	feat(auto-repair): 所有操作強制寫入 DB — auto_repair_executions 表 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m32s Details 統帥指令: 所有自動修復操作（成功/失敗）必須持久化變更: - migrations/phase10_auto_repair_executions.sql: 新增表 + 4 個索引 - db/models.py: 新增 AutoRepairExecution SQLAlchemy model - repositories/audit_log_repository.py: 新增 AutoRepairExecutionRepository (create/list_by_incident/get_stats) - auto_repair_service.py: execute_auto_repair 成功/失敗分支都寫入 DB - 新增 similarity_score 參數傳遞 - AutoRepairDecision 新增 similarity_score 欄位 - webhooks.py: 傳入 similarity_score 到 execute_auto_repair 已執行 migration: awoooi_prod@192.168.0.188:5432 ✅ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 11:16:37 +08:00
OG T	68a2fff746	feat(auto-repair): 移除所有阻擋門檻 — 直接全部跳成自動修復 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m38s Details 統帥指令: 所有 APPROVED Playbook 直接執行，不再檢查: - 相似度門檻 (MIN_SIMILARITY_SCORE 0.7 → 0.0) - is_high_quality 品質門檻 - 冷啟動信任機制 - 動作風險等級門檻 (evaluate + execute 兩層) 保留: P0/P1 嚴重度人工審核、全域冷卻熔斷、APPROVED 狀態檢查 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 11:10:09 +08:00
OG T	b7ea362efc	fix(api): Review #2 技術債清理 — I1/S1/S2/S3 全數修正 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 12m13s Details I1: error_type 欄位補全 - AnomalyCounter.derive_key_from_incident() 新增從 signal.labels 提取 reason/error_type，確保四欄位完整 S1: 三處 signature 建構邏輯統一 - auto_repair_service._derive_anomaly_key() → 委託 derive_key_from_incident() - approval_execution._get_anomaly_key_from_approval() → 同上 - incident_service.resolve_incident() B4 → 同上 - 消除 3 處重複的 signature 建構程式碼 S2: Redis Pipeline 批次查詢 - get_all_disposition_stats() 從 N+1 hgetall 改為 2 次 Pipeline - Pipeline 1: 批次 hgetall 所有 disposition key - Pipeline 2: 批次 hget metadata (alert_name) - 效能從 O(2N) Redis round-trip 降至 O(2) S3: auto_repair.py get_incident AttributeError 修復 - get_incident() → get_from_working_memory() (pre-existing bug) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-07 13:13:42 +08:00
OG T	de3935d1d4	feat: Sprint 4 Phase E+F — 前端處置統計 + 週報處置分佈 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m26s Details Type Sync Check / check-type-sync (push) Failing after 1m2s Details Phase E: 前端頁面 - E1: /reports 完整處置統計儀表板 (已在 Sprint F 完成) - E2: 首頁 Metrics Strip — 從 disposition API 取得真實自動化率優先使用 /stats/disposition auto_rate，fallback 到 incidents 推算 - E3: /auto-repair 處置概況卡片 (已在 Sprint F 完成) - E4: /neural-command stats tab 處置分佈 (已在 Sprint F 完成) - E5: i18n 翻譯 zh-TW + en (已在 Sprint F 完成) Phase F: 週報 + 文件 - F1: WeeklyReportMessage 新增 disposition 5 欄位週報格式加「📋 處置分佈」區塊 (自動/冷啟動/人工/手動 + 自動化率) weekly_report_service 整合 get_all_disposition_stats() - message 字數上限從 900 提升到 1200 (適應處置區塊) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-07 13:02:20 +08:00
OG T	561bcb638b	fix(api): Sprint 4 首席架構師 Review P0 修正 — hash 統一 + 積木化合規 P0-1: anomaly_key hash 推導統一 - B1: 新增 _derive_anomaly_key() 使用 AnomalyCounter.hash_signature() 取代 symptoms.compute_hash() - B3/B4: namespace 改用 signal.labels.get("namespace", "") 修正 getattr(signal, "namespace", "") 永遠回傳空字串 P0-2: Router 層積木化合規 - C1/C2: 封裝 get_all_disposition_stats() 到 AnomalyCounter - Router 不再直接存取 counter.redis - stats.py 移除未使用的 days/stats 參數 P1: get_frequency() 填充 disposition 欄位 - 與 _record_anomaly_impl() 一致，回傳完整處置統計首席架構師評分: 82/100 → P0 全數修正 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-07 12:53:12 +08:00
OG T	a85e9ced08	feat(api+telegram): Sprint 4 Phase C+D — API 端點 + Telegram 處置統計 Phase C: API 端點 - C1: GET /api/v1/stats/disposition — 完整處置分佈統計 - DispositionSummary: auto/human/manual/cold_start + auto_rate - DispositionByAnomaly: 按異常類型明細 (最多 20 筆) - Redis SCAN + HGETALL 聚合 - C2: GET /api/v1/auto-repair/stats 擴充 disposition_summary Phase D: Telegram 告警格式 - D1: 告警卡片加處置統計行 - 🤖 自動: N \| 👤 審核: N \| 🔧 手動: N - 自動化率百分比 - D2: 歷史按鈕強化處置分佈明細 - 完整 5 項計數 + 自動化率 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-07 12:17:20 +08:00
OG T	9253281d46	feat(api): Sprint 4 Phase A+B — 告警處置統計資料層+寫入層 Phase A: 資料層 - A1: IncidentFrequencyStats 新增 4 欄位 (human_approved/manual_resolved/cold_start_trust/total_resolution) - A2: AnomalyCounter.record_disposition() — Redis HINCRBY 原子遞增 - A3: get_disposition_stats() — HGETALL 回傳處置分佈 - AnomalyFrequency dataclass 擴充 + to_dict() 同步 - _record_anomaly_impl() 整合 disposition stats Phase B: 寫入層觸發點接線 - B1: 自動修復成功 → record_disposition("auto_repair") - B2: 冷啟動信任成功 → record_disposition("cold_start_trust") - AutoRepairDecision 新增 is_cold_start flag - execute_auto_repair() 接收並區分處置類型 - B3: 人工批准執行成功 → record_disposition("human_approved") - 新增 _get_anomaly_key_from_approval() helper - B4: 手動處理推斷 → resolve_incident() 排除法判定 - 若 resolved 且無 auto/human/cold_start 紀錄 → manual_resolved 安全設計: 所有 disposition 記錄走 try/except，失敗不阻塞主流程 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-07 11:54:46 +08:00
OG T	53b2daeaca	feat(api): 首次信任機制 — 打破自動修復冷啟動雞生蛋問題問題: Playbook 需要 success_count >= 3 才算 is_high_quality，但沒有自動修復就不會有成功紀錄 → 永遠達不到門檻。方案 C: 首次信任 (Cold Start Trust) - APPROVED 狀態 + 全步驟 risk=LOW + 執行次數 < 3 → 自動放行 - Redis counter 限制每日最多 5 次首次信任自動修復 - 累積 3 次成功後自動回歸正常 is_high_quality 門檻安全邊界: - 只有 LOW risk 步驟才能首次信任 (重啟容器等) - HIGH/CRITICAL 仍需人工審核 - P0/P1 嚴重度仍需人工審核 - 每日上限防止失控 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-07 11:21:00 +08:00
OG T	2fe8062fb8	refactor(api): Re-Review S1/S2/S3 改善 — 消除重複+防禦性驗證+測試隔離 S1: 抽取 _execute_and_observe() 公用方法 - 消除 repair_by_uri 中 3 處重複的 execute+audit+langfuse 邏輯 - 統一 AuditLog + Langfuse trace 寫入路徑 S2: SSH username 防禦性驗證 - 新增 validate_ssh_user() + _SSH_USER_RE 正則 - 在 _ssh_execute() 入口驗證 user 參數 - 防止 user@host 拼接產生非預期行為 - 新增 8 個 username 驗證測試 S3: Singleton 測試重置 - 新增 _reset_for_test() classmethod - 避免跨測試狀態污染 - 新增 2 個 singleton reset 測試測試: 55/55 全數通過 (原 45 + 新 10) 首席架構師 Re-Review: 91/100 ✅ 通過，3 個 Suggestion 全數實裝 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-07 11:17:40 +08:00
OG T	78a8d3dfa5	fix(api): ansible 控制節點加白名單驗證，防環境變數繞過 (Re-Review Important) 首席架構師 Re-Review 指出: ANSIBLE_CONTROL_HOST 來自環境變數 (ConfigMap)，若 ConfigMap 被篡改可繞過 SSH_TARGET_WHITELIST。在 _execute_ansible() 開頭加 validate_ssh_target_host(host) 閉環。 Re-Review 評分: 91/100 ✅ 通過 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-07 11:13:49 +08:00
OG T	f8d4772abf	fix(api): Sprint 3 P0-1/P0-2/P0-3/P0-4 Critical Security Fixes P0-1: Complete shell metacharacter regex detection - Enhanced _SHELL_METACHAR_RE to detect: >, <, \n, ${}, $() - Prevents all shell injection vectors (redirects, variable expansion, newlines) - Added 5 new validation tests P0-2: Add shlex.quote() protection for ansible playbook path - Wraps playbook_path in shlex.quote() before SSH command construction - Prevents shell injection if path contains special characters - Applied in _execute_ansible() method P0-3: Add SSH target host whitelist validation - Introduces validate_ssh_target_host() function - Only allows SSH to: 192.168.0.110, 192.168.0.188 - Prevents unauthorized SSH target exploitation - Added 5 new whitelist validation tests P0-4: Convert HostRepairAgent to singleton pattern - Implements __new__() singleton with shared _in_process_locks dict - Ensures in-process locks persist across multiple auto_repair_service calls - Previously created new instance per call, making locks ineffective - Added singleton persistence test Test Results: 45/45 passing (34 existing + 11 new P0 tests) All security validations verified via comprehensive unit test coverage. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2026-04-07 11:09:45 +08:00

1 2 3 4 5 ...

414 Commits