OG T
7e498621e0
fix(signoz_webhook): AIBlastRadius → BlastRadius 型別轉換
...
CD Pipeline / build-and-deploy (push) Has been cancelled
blast_radius 欄位傳入 AIBlastRadius 物件導致 Pydantic validation error,
approval 無法存進 DB(Telegram 仍送出但無法批准)。
修法:明確轉換 AIBlastRadius → BlastRadius,data_impact enum 用 .value 橋接。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-10 00:15:40 +08:00
OG T
a303b5ef91
feat(chat): NemoClaw 改接 Ollama 111 deepseek-r1:14b
...
CD Pipeline / build-and-deploy (push) Failing after 4m6s
2026-04-09 ogt: 棄用 Claude Haiku,改用本地 deepseek-r1:14b
- 端點: http://192.168.0.111:11434
- 過濾 <think>...</think> 推理區塊,只回傳結論
- timeout 120s(14b 推理較慢)
- 完全免費,不計入 Claude API 費用
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 23:38:57 +08:00
OG T
62cb274735
feat(host_aggregator+k8s): 新增 121 K3s Worker 主機監控
...
CD Pipeline / build-and-deploy (push) Has been cancelled
HOST_CONFIGS 加入 192.168.0.121(K3s Worker):
- K3s API tcp:6443
- awoooi-api NodePort tcp:32334
- awoooi-web NodePort tcp:32335
NetworkPolicy 補開 121 egress: 6443/32334/32335
NodePort 服務實際在 121(mon1),非 120(mon)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 23:36:36 +08:00
OG T
fc9d0f9c1f
fix(host_aggregator): total=1 時 total//2=0 導致服務全 up 仍顯示 unhealthy
...
CD Pipeline / build-and-deploy (push) Has been cancelled
112(Kali) 和 120(K3s) 各只有 1 個服務,down_count=0 >= total//2=0
永遠成立 → 永遠 unhealthy。改為 total > 1 才套用 >=half 門檻。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 23:35:37 +08:00
OG T
d324dd7aed
fix(telegram): 移除所有告警訊息欄位截斷限制,放寬至 Telegram 4096 字元硬限
...
CD Pipeline / build-and-deploy (push) Has been cancelled
問題:root_cause[:50/100]、suggested_action[:80]、suggestion[:50]、
note[:150]、fix_action[:100]、impact[:150]、hypothesis[:200]
以及 message[:900]/[:1000] 導致告警內容顯示不完整。
修復:移除欄位截斷,整體上限改為 4096(Telegram API 硬限制)。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 23:32:51 +08:00
OG T
eb46079b4a
fix(telegram): root_cause 顯示長度 50→100 字元,符合 SOUL.md 鐵律
...
CD Pipeline / build-and-deploy (push) Has been cancelled
SOUL.md 明定根因摘要上限 100 字元,但程式碼兩處 IncidentApprovalCard
均截在 [:50],導致告警卡片訊息被截斷。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 23:30:58 +08:00
OG T
c92cdeea0f
feat(drift): B4 drift_reports DB 持久化 + CronJob 修復
...
CD Pipeline / build-and-deploy (push) Successful in 12m17s
- drift_repository.py: DriftReportRepository (save/get/list/update)
- drift.py router: 移除 in-memory dict,改用 DB repository
- drift-cronjob.yaml: 修正 SA/NetworkPolicy/NodePort 問題
- allow-intra-namespace NetworkPolicy (已套用至 prod)
- migrate-phase8/9: symptoms_hash + drift_reports migration Job YAML
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 20:28:55 +08:00
OG T
b1e207ffae
fix(host_aggregator): E2E驗證後修正 HOST_CONFIGS — Ollama位置+NodePort+Nginx
...
CD Pipeline / build-and-deploy (push) Has been cancelled
從 K3s Pod 內 Python socket 實測確認後修正:
- 110: 加 Prometheus(9090) Grafana(3002),移除 GH Runner(3000 refused)
- 112: 移除 SSH:22 (K3s Pod NetworkPolicy 未開)
- 120: 移除 awoooi NodePort(只在121不在120)
- 188: 移除 Ollama(在111非188) 和 Nginx:443(Pod內打不通)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 20:27:46 +08:00
OG T
21567a7a6d
fix(host_aggregator): 修正四台主機 probe 端點錯誤導致全部顯示 unhealthy
...
CD Pipeline / build-and-deploy (push) Successful in 12m1s
- 110: Harbor http→tcp(5000), Docker 2375→Gitea tcp(3001)
- 120: K3s 6443 https(401誤判)→tcp, 移除 Traefik 80(closed)
- 188: OpenClaw 8089→8088 (實際端口)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 19:52:34 +08:00
OG T
8c2983b70a
fix(api+web): CORS 補 K3s NodePort origins + sign 補 signer_id/name
...
CD Pipeline / build-and-deploy (push) Has been cancelled
CORS (config.py):
- 補 http://192.168.0.125:32335 (K3s VIP NodePort)
- 補 http://192.168.0.120:32335 + 121:32335 (K3s nodes)
- 修前: 內網瀏覽器開 :32335 打 API 全 CORS blocked
(incidents Failed to fetch / monitoring 無法連線根因)
sign body (pending-approvals-card.tsx):
- signer: 'web-ui' → signer_id: CURRENT_USER.id + signer_name: CURRENT_USER.name
- 修前: POST /approvals/{id}/sign 回 403 (缺必填欄位 422 誤報為 403)
— 實際是 422 Field required signer_id + signer_name
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 19:50:48 +08:00
OG T
34f0228d92
fix(executor): K8s ClusterIP 10.43.0.1 不可達 — 加 K8S_API_SERVER_URL 覆蓋 + migration job
...
CD Pipeline / build-and-deploy (push) Successful in 12m0s
問題: in-cluster config 讀到 10.43.0.1:443,但 K3s Pod 內 iptables/kube-proxy
沒把流量導到實際 API server,導致 Connection refused,批准後 kubectl 永遠失敗
修復:
- executor.py: load_incluster_config() 後讀 K8S_API_SERVER_URL env 覆蓋 host
- 04-configmap.yaml: 設 K8S_API_SERVER_URL=https://192.168.0.120:6443
- migrate-sprint5r-telegram-message-id.yaml: approval_records 新增兩欄 migration job
E2E 驗證: kubectl rollout restart deployment/awoooi-worker success=True ✅
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 19:10:27 +08:00
OG T
ebccb88278
fix(approval_db): 修復 incident_id 篩選查空 DB 欄位而非 JSON 導致執行斷路
...
CD Pipeline / build-and-deploy (push) Has been cancelled
get_all_approvals(incident_id=...) 原本在應用層過濾
a.metadata.get("incident_id"),但 ApprovalRecord.incident_id
是直接欄位,不在 extra_metadata JSON,導致永遠返回空列表,
Telegram 批准後出現 telegram_approval_not_found_by_incident,
審批從未實際執行。改為 .where(ApprovalRecord.incident_id == incident_id)
DB 層直接篩選,同時效能更佳。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 19:05:48 +08:00
OG T
896bef94ee
fix(web): pending-approvals-card 加防重複點擊 + loading 狀態
...
linter 自動強化: actioningId state 防止同一張卡重複操作
- disabled + opacity 0.6 + cursor not-allowed
- loading 時按鈕顯示 '...'
- finally() 確保 actioningId 清除
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 18:38:08 +08:00
OG T
890e2a9568
fix(review): 架構審查修復 — P0 import crash + i18n 零 hardcode + 靜默錯誤
...
P0:
- proposal_service.py: 補 get_redis + INCIDENT_KEY_PREFIX import
(修前: resolve_incident_after_approval 必 NameError crash)
P1 i18n:
- page.tsx: 拓撲群組移除 emoji,改用 tTopo() i18n key
- page.tsx: 主機標籤 (DevOps金庫等) 改 tTopo() i18n
- ai-model-status.tsx: 加 useTranslations,AI 模型狀態 → t('aiModelStatus')
- disposition-mini.tsx: 查看完整報表 → t('viewAllReport')
- recent-activity.tsx: 查看活動串流 → t('viewAllAlerts')
P2 品質:
- pending-approvals-card.tsx: approve/reject 加 r.ok 檢查+錯誤顯示,查看全部授權加路由+i18n
- page-tabs.tsx: TabSkeleton 載入中... → t('loading')
- page.tsx: ↑5% → tDashboard('trendUp', {pct}) 動態值
- page.tsx: Prometheus '23' hardcode → '-- targets'
i18n 新增 key (zh-TW + en 同步):
- dashboard: viewAllAlerts/viewAllAuth/viewAllReport/aiModelStatus/loading/trendUp
- topology: groupExternal/allReachable/investigating/hostDevops/hostAiData/hostK3sMaster/hostK3sWorker
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 18:34:50 +08:00
OG T
1483218bab
feat(approval): 批准/拒絕後立即回應 Telegram + 持久化 message_id 到 DB
...
CD Pipeline / build-and-deploy (push) Successful in 13m9s
問題:按下 TG 批准/拒絕按鈕後完全沒有任何回應,使用者不知道是否成功。
Telegram message_id 只存 Redis 24h TTL,過期後無法追蹤。
修正:
- approval_records 加 telegram_message_id / telegram_chat_id 欄位(已 ALTER TABLE)
- approval_db.update_telegram_message() — 持久化 message_id 到 DB
- decision_manager: 發送告警卡片後同時寫 Redis + DB
- telegram_gateway._notify_approval_result() — 批准/拒絕後:
1. editMessageReplyMarkup 移除批准/拒絕按鈕,保留資訊按鈕
2. sendMessage reply_to 在原訊息下回覆狀態行
3. fallback: send_notification 發新訊息
- _handle_group_command: chat_id 改為 _chat_id 消除 IDE lint
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 18:19:31 +08:00
OG T
2c7d5d049c
fix(openclaw): Nemotron tool call 回填 kubectl_command,讓批准後執行器能解析
...
CD Pipeline / build-and-deploy (push) Has been cancelled
根本問題:Nemotron 產生的 restart_deployment(deployment_name=sentry) tool call
只存在 nemotron_tools[],沒有回填到 proposal["kubectl_command"]。
proposal_service 拿到的 kubectl_command 是空的,approval_records.action 存空值,
parse_operation_from_action 永遠返回 None,execute_approved_action 永遠 skip。
修正:Nemotron (和 Gemini fallback) 成功後,將 tool call 轉換為 kubectl 指令
並回填 proposal["kubectl_command"],讓 proposal_service 能取到可執行指令。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 18:15:01 +08:00
OG T
ae9780837d
fix(proposal): action 優先用 kubectl_command,修復批准後永遠 skip 執行的根本 bug
...
根本問題:approval_records.action 存的是 LLM action_title(中文標題,如「重啟 sentry 服務」),
parse_operation_from_action() 無法解析,導致 execute_approved_action() 每次都 skip。
修正:action 優先取 llm_proposal["kubectl_command"](可執行的 kubectl 指令),
僅在沒有 kubectl_command 時才 fallback 到 action_title。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 18:13:22 +08:00
OG T
d8c2969341
feat(telegram): AI 鏈路透明化 — 告警訊息顯示 OpenClaw + Tool Calling 模型/後端
...
CD Pipeline / build-and-deploy (push) Successful in 12m12s
- nemotron.py: 偵測 OllamaToolProvider vs NvidiaProvider,記錄 tool_model/tool_backend
- openclaw.py: 傳播 nemotron_tool_model/nemotron_tool_backend 到 proposal
- decision_manager.py: 從 proposal_data 提取並傳給 send_approval_card()
- telegram_gateway.py: TelegramMessage 新增兩個欄位,format_with_nemotron 顯示
"🔧 Tool Calling: llama3.1:8b (Ollama 本機)" 或 "NVIDIA 雲端"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 15:05:16 +08:00
OG T
7857c25677
feat: Ollama 本機 Tool Calling 取代 NVIDIA 雲端 (44s→~5s)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
- nvidia_provider.py: 新增 OllamaToolProvider
- 實作 INvidiaProvider protocol,打 Ollama /v1/chat/completions
- 模型: llama3.1:8b (tool calling 最穩定的 8B)
- 延遲: 44s → ~5s(本機 M1 Pro 192.168.0.111)
- get_nvidia_provider() 根據 USE_OLLAMA_TOOL_CALLING 切換
- config.py: USE_OLLAMA_TOOL_CALLING=True (預設開啟), OLLAMA_TOOL_MODEL=llama3.1:8b
- 回退: USE_OLLAMA_TOOL_CALLING=False → 恢復 NvidiaProvider 雲端
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 14:55:04 +08:00
OG T
4f80ba38c0
feat: 告警狀態變更在原訊息延續 (方案 B)
...
CD Pipeline / build-and-deploy (push) Successful in 12m28s
**telegram_gateway.py**
- 新增 append_incident_update(incident_id, status_line)
- 從 Redis tg_msg:{id} 取 message_id
- editMessageReplyMarkup: 移除 Row1(批准/拒絕/靜默),保留 Row2(詳情/重診/歷史)
- sendMessage reply_to_message_id: 在原訊息下方追加狀態行
- 找不到 message_id 回傳 False(呼叫方自行 fallback)
**decision_manager.py**
- _push_decision_to_telegram: send_approval_card 後存 tg_msg:{id}=message_id (TTL 24h)
- _push_auto_repair_result: 改用 append_incident_update,找不到 message_id 才 fallback 新訊息
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 14:21:33 +08:00
OG T
2554ac1e60
fix: E2E test 告警識別 + 自動修復結果 Telegram 通知
...
CD Pipeline / build-and-deploy (push) Has been cancelled
**alert_rule_engine.py**
- _matches() 加入 instance_prefix 匹配(最高優先)
- match_rule() 傳入 instance label 至 _matches
- 用途: e2e-final-* / e2e-test-* instance 可被 YAML 規則識別
**alert_rules.yaml**
- 新增 e2e_smoke_test 規則 (priority=120)
- alertname: E2E_SMOKE_TEST / instance_prefix: e2e-final- / e2e-test- / test-host
- suggested_action: NO_ACTION,顯示「告警鏈路驗證成功」
**decision_manager.py**
- _auto_execute() 成功後發 Telegram 結果通知 ✅
- _auto_execute() 失敗後發 Telegram 失敗通知 ❌
- 新增 _push_auto_repair_result() 函數
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 14:16:15 +08:00
OG T
1fb0c0ca90
fix(auto-repair): Bug #5+#6 — SSH binary + affected_services 匹配修正
...
CD Pipeline / build-and-deploy (push) Has been cancelled
Bug #5 (webhooks.py): target_resource 現在優先用 component label
- SentryDown alert 有 labels.component="sentry"
- 舊邏輯: labels.instance="192.168.0.110:9000" → Playbook affected_services 不匹配
- 新邏輯: component → pod → instance → alertname
Bug #6 (Dockerfile): python:3.11-slim 無 openssh-client
- SSH_COMMAND Playbook 執行路徑調用 asyncio.create_subprocess_exec("ssh", ...)
- image 沒有 ssh binary → 所有 SSH 修復必然失敗
- 修正: 在 production stage 安裝 openssh-client
服務清單: 補 sentry 主服務到 service-registry.yaml (AUTO 級別)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 14:11:50 +08:00
OG T
1d88b7cd9d
fix(webhooks): Signal.labels 補 alertname 讓 playbook 匹配能讀到原始 alertname
...
CD Pipeline / build-and-deploy (push) Has been cancelled
問題: create_incident_for_approval 建立 Signal 時 labels 只有
namespace/resource,沒有 alertname,導致 _extract_symptoms 讀
labels.alertname 取得 None,fallback 到 alert_name="custom",
playbook Jaccard 永遠無法匹配真實 alertname (如 SentryDown)。
修正: 新增 alertname 參數,傳入 Signal.labels["alertname"]。
兩個呼叫點 (LLM 成功 + fallback) 都補上 alertname=alertname。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 13:54:42 +08:00
OG T
e4070b2f86
fix(webhooks): 補 get_alert_operation_log_repository import 兩處
...
CD Pipeline / build-and-deploy (push) Successful in 12m53s
alert_received_log_failed 錯誤原因:alertmanager_webhook 函數內
直接呼叫 get_alert_operation_log_repository() 但未在 local scope import,
導致 NameError 被 except 吞掉,ALERT_RECEIVED 事件無法記錄。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 12:29:48 +08:00
OG T
fc03eb1f4d
fix(auto-repair): _extract_symptoms 優先用 labels.alertname 取得原始 alertname
...
CD Pipeline / build-and-deploy (push) Has been cancelled
問題: signal.alert_name 存的是 alert_type (如 "custom"),而非 Prometheus
alertname (如 "SentryDown"),導致 playbook Jaccard 匹配永遠失敗 (NO_MATCH)。
根本原因: webhook 的 alertname_to_type mapping 將未知 alertname 轉為 "custom",
存入 signal.alert_name,但 Playbook 的 symptom_pattern.alert_names 存原始名稱。
修正: 從 signal.labels["alertname"] 讀取原始 Prometheus alertname,
fallback 到 signal.alert_name (保持向下相容)。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 12:26:18 +08:00
OG T
af49a54728
fix(playbook): alert_names 完全匹配時 bypass 相似度門檻
...
CD Pipeline / build-and-deploy (push) Successful in 12m58s
症狀: SentryDown/OllamaDown 告警觸發 incident,但 playbook 搜索
回傳 NO_MATCH,即使 alert_names 完全一致。
根本原因: Jaccard 加權計算中,affected_services 存的是 Prometheus
instance IP (192.168.0.110:9000),而 Playbook 存的是服務名 (sentry),
導致 services 維度得 0,最終 0.35 < min_similarity=0.4。
修正: alert_names 有交集時直接通過,不受其他維度拉低分數影響。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 12:05:07 +08:00
OG T
79a9a514dd
fix(rules): ADR-064 L1 Redis 分散式鎖防止多 Pod 重複生成規則
...
CD Pipeline / build-and-deploy (push) Has started running
問題: _generating set 是進程級,多 Pod 各自獨立,同一 alertname 可能被
多個 Pod 同時送給 Ollama/Gemini 生成規則
修復: SET NX EX lock_key — 只有第一個 Pod 能取鎖,其他 Pod 直接跳過
降級: Redis 不可用時 fallback 進程級 set(保持原有行為)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 12:03:51 +08:00
OG T
b66263ad36
fix(decision_manager): resolved Incident 不重送 Telegram
...
CD Pipeline / build-and-deploy (push) Has started running
dedup TTL 10分鐘過期後,已 resolve 的 Incident 仍被重新推送
加入狀態檢查,resolved/closed 直接跳過
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 12:00:39 +08:00
OG T
9361fd1fa7
fix(decision_manager): action 不應 strip_placeholders 避免截斷 deployment name
...
CD Pipeline / build-and-deploy (push) Has been cancelled
_strip_placeholders 移除 <...> 導致 kubectl rollout restart deployment/<name>
變成 kubectl rollout restart deployment/,Telegram 顯示建議指令不完整
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 11:45:33 +08:00
OG T
d467fc11be
fix(nemotron): 修復 deployment_name placeholder 問題
...
根因: Nemotron tool calling 收到 target_resource=DockerContainerUnhealthy
(非真實 K8s deployment name),不確定時填 <deployment_name>
修復:
1. prompt 明確標注 deployment_name 必須填入 target_resource
2. 收到 tool call 結果後,偵測 placeholder 並用 target_resource 覆蓋
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 11:44:25 +08:00
OG T
c5e475121a
fix(telegram): 修復建議指令被截斷 + decision_manager enum string 補正
...
CD Pipeline / build-and-deploy (push) Has been cancelled
根因 1: telegram_gateway.py suggested_action[:35] 剛好截到 deployment/ 後
→ 改為 [:80],完整顯示 kubectl command
根因 2: 舊 Incident proposal_data 存 enum string (RESTART_DEPLOYMENT)
→ decision_manager.py 加入偵測,用規則引擎重新查 kubectl command
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 11:14:30 +08:00
OG T
5ea6c3fb91
feat: alert_operation_log 查詢 API + 前端頁面 (Sprint 5.2)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
後端:
- 新增 list_recent() 分頁方法 (alert_operation_log_repository)
- 新增 /api/v1/alert-operation-logs GET + /stats 端點
- main.py 註冊 alert_operation_logs_v1.router
前端:
- /alert-operation-logs 頁面,18 種 event_type 顏色標記
- 分頁、event_type 篩選、incident_id 篩選
- 24h 統計卡片 (總數/護欄攔截/自動修復/已解決)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 10:57:40 +08:00
OG T
428e66c111
fix(arch-review): 首席架構師審查 S1×3 S2×3 S3×3 全修復 + ADR-064
...
CD Pipeline / build-and-deploy (push) Has been cancelled
S1 Critical:
- S1-1: asyncio 觸發移至 _call_with_fallback async 上下文,移除 sync 中的 get_event_loop()
- S1-2: _append_rule_to_yaml 加 textwrap.dedent() 正規化 LLM 輸出縮排
- S1-3: _matches() 對 alertname=["*"] 直接回傳 False,防意外命中
S2 Major:
- S2-1: auto_generate_rule() 改為 DI 參數注入 (ollama_url/model/gemini_api_key),移除 import settings
- S2-4: _generate_mock_response docstring 澄清為規則引擎生產路徑,非假數據
- S2-5: suggested_action .strip() 防空白字串繞過 or
S3 Minor:
- S3-2: priority 上界 min(next, 890)
- S3-3: alertname sanitize re.sub([{}]) 防 format KeyError
- S3-4: model_registry.py 最後修改時間戳更新
文件:
- ADR-064: Alert Rule Engine YAML 驅動 + AI 自動學習
- Skills 02: 告警規則引擎 DI 規範 + asyncio 禁止事項
- Skills 03: _generate_mock_response 語意澄清 + 規則引擎降級流程
- LOGBOOK: 本次 Session 完整記錄
2026-04-09 ogt: 首席架構師審查修正
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 10:52:40 +08:00
OG T
4b3fdd82f9
fix(api): incidents list 不再同步等待 AI 決策 (效能修復)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
問題: GET /api/v1/incidents 對每個 incident await AI 分析 (120-180s)
多個活躍 incident 時 timeout 乘積爆炸 → 前端完全無法載入
修復:
- list endpoint 只查 Redis 已快取的決策 token (立即返回)
- 無快取時回 decision=null,背景 fire-and-forget 觸發 AI
- 前端對有興趣的 incident 再 GET 單筆端點取得決策結果
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 10:49:30 +08:00
OG T
89da2d24be
fix(model-registry): fallback config 更新為 deepseek-r1:14b + gemma3:4b
...
CD Pipeline / build-and-deploy (push) Successful in 13m20s
- model_registry._get_default_config: ollama summary llama3.2:3b → gemma3:4b
- model_registry._get_default_config: ollama default/rca → deepseek-r1:14b
- 修正 test_smart_router::test_simple_context 失敗 (斷言 gemma3:4b)
- alert_rule_engine: 移除 asyncio/time unused import
- 2026-04-09 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 09:52:47 +08:00
OG T
71437db0e9
feat(rule-engine): 自動規則生成 — generic_fallback 觸發 AI 學習
...
CD Pipeline / build-and-deploy (push) Successful in 11m25s
流程:
1. 告警命中 generic_fallback 規則
2. 背景觸發 auto_generate_rule()
3. Ollama (deepseek-r1:14b) 生成 YAML 規則片段
4. Ollama 失敗 → Gemini 備援
5. 驗證格式 → append alert_rules.yaml → 清除 lru_cache
6. 下次同類告警直接命中專屬規則,不再走兜底
去重: 同一 alertname 進程內只生成一次
手寫規則 priority 1-499,AI 生成 500-899,兜底 999
2026-04-09 ogt: AI 自學規則引擎
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 09:20:33 +08:00
OG T
d1ede7f989
feat(openclaw): 告警規則引擎 — alert_rules.yaml 取代硬編碼 if/elif
...
CD Pipeline / build-and-deploy (push) Has been cancelled
- 新增 alert_rules.yaml: 6 條規則 (docker/target_down/oom/cpu/5xx/crash) + 通用兜底
- 新增 alert_rule_engine.py: YAML 載入、匹配邏輯、變數填充
- openclaw.py _generate_mock_response: 重構為呼叫規則引擎 (v8.0)
- 新增規則只需修改 YAML,重啟 Pod 即可,不需改代碼
- 2026-04-09 ogt: 架構重構
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 09:05:23 +08:00
OG T
3abc7c2f85
fix(openclaw): DockerContainerUnhealthy + TargetDown 專屬規則匹配
...
CD Pipeline / build-and-deploy (push) Has been cancelled
- DockerContainerUnhealthy: ssh docker inspect + docker restart,含 healthcheck 指令驗證
- TargetDown / IP:port instance: ssh 確認 exporter 存活
- 修正 target 混用 alertname 作為 deployment 名稱的問題
- alertname/labels 從 alert_context 提取供規則判斷
- 2026-04-09 ogt: 新增兩條專屬規則
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 09:00:31 +08:00
OG T
4b6f14d9a1
fix(webhook): alertmanager 路徑 suggested_action 改用 kubectl_command
...
CD Pipeline / build-and-deploy (push) Failing after 1m43s
- 1399 行: suggested_action.value (RESTART_DEPLOYMENT) → kubectl_command
- 與 /alerts 路徑 887 行保持一致
- 修正 Telegram 顯示「kubectl rollout restart deployment/」後面空白的問題
- 2026-04-09 ogt: bug fix
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 08:57:56 +08:00
OG T
d80153bdce
fix(openclaw): NIM 完全失敗後 fallback 到 Gemini 產生執行方案
...
CD Pipeline / build-and-deploy (push) Failing after 1m34s
NIM tool calling 多次 timeout 後,不再顯示空白執行方案,
改由 Gemini 代理產生 kubectl 操作指令(JSON 解析)。
只有 NIM 完全失敗才觸發,符合統帥「必須等到有回應」原則。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-08 22:55:25 +08:00
OG T
6f475000f6
fix(db): alert_operation_log.event_type String→PgEnum (create_type=False)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
修正 DatatypeMismatchError: DB 欄位為 native enum alert_event_type,
SQLAlchemy model 誤用 String(50),導致 alert_operation_log 寫入失敗。
使用 PgEnum(create_type=False) 讓 SQLAlchemy 映射已存在的 DB enum,
不重建型別。18 個 event_type 值與 M-003 migration 一致。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-08 22:42:36 +08:00
OG T
86ac6ed028
perf(api): HostAggregator 效能優化 — probe timeout 縮短 + 30 秒記憶體快取
2026-04-08 22:42:01 +08:00
OG T
2a6977343a
fix(telegram): 補傳 incident_id 至所有 _push_to_telegram_background 呼叫點
...
CD Pipeline / build-and-deploy (push) Has been cancelled
規則匹配有六顆按鈕但 Ollama/OpenClaw 路徑只有三顆,根因是
alertmanager 和 fallback 路徑呼叫 _push_to_telegram_background 時
未傳 incident_id,導致詳情/重診/歷史按鈕不顯示。
- _push_to_telegram_background: 新增 incident_id 參數
- alertmanager 主路徑: 補傳 incident_id
- alertmanager fallback 路徑: 存回傳值並補傳
- /alerts 路徑: 尚無 incident,明確傳空字串
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-08 22:40:22 +08:00
OG T
8b5db2f58e
feat(infra): 切換 Ollama 到 M1 Pro 192.168.0.111 + NetworkPolicy 更新
...
CD Pipeline / build-and-deploy (push) Has been cancelled
- OLLAMA_URL: 188 → 111 (M1 Pro, 40+ tok/s vs 0.45 tok/s)
- OPENCLAW_DEFAULT_MODEL: qwen2.5:7b-instruct → deepseek-r1:14b (SRE最強推理)
- OPENCLAW_TIMEOUT: 90s → 120s (deepseek-r1:14b 實測最慢 54s)
- NetworkPolicy v1.3: 新增 192.168.0.111:11434 egress,移除 188 的 Ollama port
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-08 22:05:14 +08:00
OG T
c9f1bcd122
fix(api): service_registry 安全降級 — Docker 無 YAML 時不 crash,fallback AUTO
CD Pipeline / build-and-deploy (push) Successful in 11m37s
2026-04-08 21:47:38 +08:00
OG T
db4b28c49d
fix(ci): 強制觸發 CD — service_registry.py Docker 路徑修正已包含於 1f9eea5
...
CD Pipeline / build-and-deploy (push) Failing after 8m45s
Pod CrashLoopBackOff: IndexError parents[5]
修復: _find_registry_path() 安全搜尋 (parents[4]/parents[3]/絕對路徑)
1f9eea5 已修復但未觸發 CI,此 commit 強制重新 build
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-08 21:37:49 +08:00
OG T
1f9eea5b74
fix(api): service_registry.py Path 索引修正 — 相容 Docker 容器環境
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-08 21:34:40 +08:00
OG T
14cb015826
fix(openclaw): Nemotron 重試邏輯 + exhausted log key (未提交的修改)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
- generate_incident_proposal_with_tools: 單次 try/except → 2次重試迴圈
- 失敗 log key: nemotron_collaboration_failed → nemotron_collaboration_exhausted
- 失敗時 nemotron_enabled=True (讓統帥看到失敗狀態)
- _call_nemotron_tools: timeout 超時改為拋出異常(讓外層重試)
- 這是之前 Session 的本地修改,修正測試與實際實作不一致問題
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-08 21:16:34 +08:00
OG T
0f5fecfef5
fix(sprint5.1): 首席架構師審查修正 — S1×4 S2×2 S3×1
...
CD Pipeline / build-and-deploy (push) Failing after 1m40s
S1-1: service_registry/velero_client/preflight_service 改用 structlog
S1-2: velero_client datetime.now(UTC) 改用 now_taipei()(台北時區鐵律)
S1-3: Guardrail 失敗改為保守拒絕(原放行方向與安全目標相悖)
S1-4: service_registry import 移至模組頂部(移除函數內 import)
S2-1: telegram_gateway T1-T6 六個通知方法補齊 try/except
S2-2: webhooks.py Langfuse URL 改用 settings.LANGFUSE_URL(移除硬寫內網 IP)
S3-3: velero_client trigger_emergency_backup 改為 kubectl apply Backup CRD
(原 kubectl create backup 語法不存在,審查發現靜默失敗風險)
審查評分: 70/100 → 修正後預計 90+/100
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-08 16:36:18 +08:00
OG T
88696dba9b
feat(sprint5.1): Data Safety Guardrails 全鏈路整合 (L1-L5)
...
CD Pipeline / build-and-deploy (push) Failing after 1m33s
Type Sync Check / check-type-sync (push) Failing after 58s
Layer 0 - K8s RBAC:
- k8s/rbac/api-velero-reader.yaml: awoooi-executor SA Velero backup reader
Layer 1 - DB Migration (已在 188 執行):
- M-002: approval_records 新增 approval_level/votes/required_votes
- M-003: alert_event_type ENUM 新增 8 個值
Layer 2 - IaC:
- ops/config/service-registry.yaml: 全服務 Stateful 分級清單 (BLOCK/CRITICAL_HITL/STANDARD_HITL/AUTO)
Layer 3 - Python Services:
- service_registry.py: 讀取 YAML,提供 is_blocked/requires_multisig/get_required_votes
- velero_client.py: kubectl 查詢 Velero 備份年齡,失敗 fallback 999h
- preflight_service.py: Pre-flight 安全檢查 (Q2/Q4 決策)
Layer 1-M001 - Playbook model:
- playbook.py: 新增 requires_approval_level/stateful_targets/requires_pre_backup
Layer 4 - 業務邏輯:
- alert_operation_log_repository.py: 新增 8 個 event_type (Guardrail/Pre-flight/MultiSig/備份)
- auto_repair_service.py: 注入 Service Registry Guardrail 檢查 (BLOCK → 直接拒絕)
- webhooks.py: ALERT_RECEIVED 溯源記錄 + auto_repair flag Q9 + Langfuse trace_id Q10
- db/models.py: ApprovalRecord 同步 approval_level/votes/required_votes 欄位
- docker-health-monitor.sh: 純感知層改造(移除所有 docker restart 邏輯)
Layer 5 - Telegram 通知:
- telegram_gateway.py: T1-T6 六個新通知方法 (Guardrail/Pre-flight/Backup/MultiSig/ChangeApplied)
參考: ADR-062 Data Safety Guardrails, ADR-063 Service Registry IaC
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-08 16:24:09 +08:00