AWOOOI CD
|
e7171a4ac8
|
chore(cd): deploy aa4e575 [skip ci]
|
2026-04-14 10:56:28 +00:00 |
|
OG T
|
aa4e5757a2
|
fix: 技術債清理 — report_generation 重試機制 + GAP-A4 文件化
CD Pipeline / build-and-deploy (push) Successful in 15m46s
技術債 #1: postmortem 發送失敗靜默吞掉
- 3 次指數退避重試 (2s → 4s → 6s)
- 全失敗後送簡化降級通知到 SRE 群組
- 防止事後檢討默默消失
技術債 #2 (QueryBuilder 抽象): DEFER
- 全專案僅 1 處用 outcome JSON path query
- 違反「Don't design for hypothetical future requirements」
- 待第二 caller 出現再抽
技術債 #3 (E2E 測試): 已涵蓋
- test_gap_a4_placeholder_resolution.py TestMatchRuleRejection
- Mission C prod 鏈路實測(KubePodCrashLooping)
- Playwright K8s/Telegram staging 留待 staging 環境就緒
新增文件:
- ADR-078-gap-a4-placeholder-resolution.md
- LOGBOOK 2026-04-14 深夜收官條目
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 18:46:25 +08:00 |
|
OG T
|
10b74affcf
|
fix(GAP-A4): 規則 Action 模板 placeholder 解析修復 — 解開 8.3h 飛輪沉默
CD Pipeline / build-and-deploy (push) Has been cancelled
🚨 真因診斷(統帥逮到):
API log 顯示最近 1 小時爆發大量 auto_execute_blocked_unresolved_placeholder:
- action: "kubectl rollout restart deployment HostHighCpuLoad" ← target=alertname
- action: "kubectl rollout restart deployment unknown"
- action: "kubectl scale deployment unknown --replicas=3"
根因:alert_rule_engine._extract_vars() target 解析邏輯不夠強健,
當 Prometheus 告警無 deployment label 時,退回 alertname 或 "unknown",
產生垃圾指令。GAP-A1 防注入閘正確攔下,但自動修復路徑因此卡死,
KM 不寫入 → 飛輪沈默。
修復(三層防護):
1. 新增 _strip_pod_suffix() — K8s Pod 名稱還原 Deployment base
- Deployment 格式: awoooi-api-7d6b776f78-4sgjl → awoooi-api
- StatefulSet: postgresql-0 → postgresql
- Legacy: my-job-x2m4k → my-job
2. 新增 _is_bad_target() — 垃圾 target 識別
- 空串 / "unknown" / "none" / "null"
- target == alertname 本身
- IP:port 格式、純 IP、含空白/括號/引號
- 未解析 {placeholder}
3. 重寫 _extract_vars() — 多層 label 查找(權威優先):
deployment > app > statefulset > pod(去後綴) > container > service > target_resource
每層都過 _is_bad_target 驗證,全失敗 → target="unknown"
4. match_rule() 後置雙驗證:
- bad target → 清空 kubectl_command (降級 LLM)
- 殘留 { or } → 清空 kubectl_command (模板未填完)
測試覆蓋:
- 33 個新單元測試(GAP-A4 四大場景全覆蓋)
- 214/214 回歸測試全過
影響:
- 原本產出「kubectl rollout restart deployment HostHighCpuLoad」的路徑
→ 現在會 `rule_kubectl_command_discarded_bad_target` 並降級 LLM
- LLM 若能從錯誤 log 推理真實 deployment,飛輪恢復正常運轉
- 若 LLM 也無解,進 TYPE-4 人工扶梯
2026-04-14 Claude Sonnet 4.6(MASTER 藍圖之外的隱性 Bug 殲滅)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 18:43:29 +08:00 |
|
AWOOOI CD
|
88a33eb4d7
|
chore(cd): deploy f54dea4 [skip ci]
|
2026-04-14 10:42:20 +00:00 |
|
OG T
|
6cac5071e4
|
docs: MASTER 藍圖結案報告 + ADR-077 + LOGBOOK 收尾
本日 Session 終極收案(9 commits, 11/11 Task, 52 新測試):
- docs/reports/2026-04-14-MASTER-BLUEPRINT-CLOSURE.md — 完整結案報告
- docs/adr/ADR-077-master-blueprint-completion.md — 架構審查 + 決議紀錄
- docs/LOGBOOK.md — 新增深夜收官條目
審查裁定: CONDITIONAL PASS
通訊渠道: 全走 Telegram,SMTP 不需要
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 18:36:59 +08:00 |
|
OG T
|
f54dea48b1
|
fix(GAP-D5): 日度報告 DB 欄位修正
CD Pipeline / build-and-deploy (push) Has been cancelled
兩處 import/查詢錯誤修復(統帥 E2E 預覽發現):
1. _collect_repair_stats: ApprovalRequestRecord 不存在
→ 改用 IncidentRecord + outcome JSON 路徑查詢 execution_success
2. _collect_playbook_count: PlaybookRecord 不存在
→ 改用 playbook_service.list_playbooks() (Redis 儲存)
修復前:修復成功率永遠 0.0%、活躍 Playbook 永遠 0
修復後:報告數字反映真實 DB/Redis 狀態
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 18:32:29 +08:00 |
|
OG T
|
8de807c40d
|
feat(GAP-D5 Task 4.2): Postmortem 自動組裝 hook
CD Pipeline / build-and-deploy (push) Has been cancelled
incident_service.resolve_incident() 結尾 fire-and-forget 呼叫
report_generation_service.trigger_postmortem(),補完孤兒服務的觸發路徑。
觸發條件(由 trigger_postmortem 內部判斷):
- duration > POSTMORTEM_MIN_DURATION_MINUTES (10min)
- 含 AI root_cause / resolution_action / provider / auto_repaired
背景:
- report_generation_service.py 539 行服務於先前 session 建立
- main.py:322 已啟動 run_daily_report_loop(Task 4.1 ✅)
- trigger_postmortem 在 src/ 下無呼叫方 → 本 commit 補上
MASTER 藍圖 Phase 4 至此完整收官:
✅ Task 4.1 日度巡檢報告(08:00 台北排程,生產環境已跑)
✅ Task 4.2 Postmortem 自動組裝(本 commit 接上 resolve hook)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 18:25:15 +08:00 |
|
OG T
|
b8b124c917
|
chore: 收工存檔 — LOGBOOK + 封存過時 flywheel-alerts CRD
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-14 傍晚 Session 收尾:
- LOGBOOK.md: 記錄本日 6 commits + Mission C E2E 驗證 + MASTER 藍圖 P0+P1 全綠
- k8s/monitoring/flywheel-alerts.yaml: 加 🔴 DEPRECATED 標記
(11/11 規則已遷入 ops/monitoring/alerts-unified.yml,此 CRD 檔無 Operator 支援)
Backlog 清剿盤點:
✅ C2 hasType4 前端硬編(已接真實 API)
✅ C3 WebSocket 無重連(指數退避 + polling fallback)
✅ flywheel-alerts Docker 方式改寫(已完成,僅舊檔未清理 → 本 commit 封存)
✅ risk_level YAML 優先邏輯(decision_manager:1663)
⏳ SMTP_USER/SMTP_PASSWORD CHANGE_ME(需統帥提供帳密)
⏳ 各類 E2E 驗證(需真實告警觸發)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 18:21:08 +08:00 |
|
AWOOOI CD
|
0f48a507c0
|
chore(cd): deploy dd0a778 [skip ci]
|
2026-04-14 08:01:04 +00:00 |
|
OG T
|
dd0a778e1f
|
feat(GAP-B4): LLM 超時降級扶梯 — 精確化內層 timeout
CD Pipeline / build-and-deploy (push) Successful in 14m19s
_dual_engine_analyze 強化(2026-04-14 Claude Sonnet 4.6):
- OpenClaw LLM 呼叫獨立 25s hard timeout(留 5s 給後續處理)
- 超時時明確 llm_timeout_fallback 日誌,立即降級 Expert System
- NemoClaw second opinion 加 3s timeout(advisory 不拖累主流程)
- 保留外層 decide() 30s wait_for 作為 defence-in-depth
為何要做:
- 外層 30s 會把 LLM 卡死整段吃光,thread pool 可能飢餓
- 內層 25s 更早降級 → Expert System 仍能在 SLA 內回應
- LLM timeout 與其他異常用不同日誌標記,便於 SLO-2 監控
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 15:51:23 +08:00 |
|
OG T
|
dedd7c2c17
|
feat(BP-1): KM 萃取品質精修 — 區分自動/人工 + 富化告警元資料
CD Pipeline / build-and-deploy (push) Has been cancelled
_write_execution_result_to_km() 強化:
- 依 approval.requested_by 區分 [自動修復]/[人工修復]
- 從關聯 Incident 提取 alertname / alert_category / affected_services
- Category 從硬編 "execution_result" 改為真實 alert_category
- Tags: auto_executed/human_approved + success/failure + alert_category
- Title 含 alertname,提升 RAG 檢索精準度
- created_by 依模式標記 auto_execute / approval_execution
驗證(2026-04-14 DB 查詢):
- 現有 KM 確實有寫入(approval_execution 建立者)
- 但標題全是「[執行記錄] ❌ kubectl rollout restart deployment/xxx」
- Category 硬編 execution_result,tags 只有 execution/execution_failed
- 本次改造後 KM 將具備完整上下文供下次 RAG 檢索
建立: 2026-04-14 台北時間 Claude Sonnet 4.6(MASTER 藍圖 BP-1 B.1 精修)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 15:48:02 +08:00 |
|
AWOOOI CD
|
a71f09e30a
|
chore(cd): deploy 2c6ed4e [skip ci]
|
2026-04-14 07:38:35 +00:00 |
|
OG T
|
43c96890d1
|
docs: 新增4份治理文件 — 告警目錄/AI模型卡/事後分析模板/值班手冊
- docs/reference/ALERT-TAXONOMY-CATALOG.md:16大類、56筆alertname、24條Rule優先順序表
- docs/ai/AI-MODEL-CARDS.md:7個AI模型治理卡(deepseek/qwen/gemini/claude/nemotron)+fallback順序
- docs/templates/POSTMORTEM-TEMPLATE.md:對齊report_generation_service,[AUTO]欄位已標記
- docs/operations/ON-CALL-HANDBOOK.md:P0/P1 SOP、Kill Switch、SLO應對、常用指令速查
建立: 2026-04-14 台北時間 Claude Sonnet 4.6(戰術B Phase 1 完整收尾)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 15:29:12 +08:00 |
|
OG T
|
2c6ed4e9cf
|
fix(k8s): 修復 ArgoCD probe 失敗 + drift-scanner egress 封鎖
CD Pipeline / build-and-deploy (push) Successful in 14m36s
問題 1 — ArgoCD "All connection attempts failed":
- ARGOCD_URL 指向 192.168.0.120:30443,但 node 120 kube-proxy 對
30443 有路由 bug(ArgoCD pod 在 121)
- 修復: ARGOCD_URL → 192.168.0.121:30443
- NetworkPolicy: 補白名單 192.168.0.121/32:30443
- NetworkPolicy: 補白名單 192.168.0.125/32:30443 (keepalived VIP)
問題 2 — drift-scanner Error x5 / 系統沉默 9.4h:
- CronJob pod template 缺少 system=awoooi label
- default-deny-all 封鎖所有 egress,allow-required-egress 僅對
system=awoooi pods 生效
- 修復: drift-cronjob pod template 新增 system: awoooi
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-14 15:28:52 +08:00 |
|
OG T
|
aae7c12645
|
feat(adr-076): Task 3.3 — SSH 修復 KM 萃取(補齊飛輪雙手)
CD Pipeline / build-and-deploy (push) Has been cancelled
動機: SSH MCP 修復(docker restart/systemctl)成功後,KM 無法學習
因為 _extract_repair_steps 只處理 kubectl,SSH 路徑完全漏失。
approval_execution.py:
- _trigger_playbook_extraction: 成功執行後將 approval.action 寫入
incident.outcome.learning_notes,供 Playbook 萃取器讀取
playbook_service.py:
- _parse_ssh_command(): 新增模組函式,解析 ssh [user@]host 'cmd' 格式
- _extract_repair_steps(): 步驟 2 擴充 SSH 路徑分支
ssh ... → ActionType.SSH_COMMAND + host 記錄
kubectl ... → ActionType.KUBECTL(保留原有邏輯)
- _generate_name(): SSH 修復自動加 [SSH] 前綴
- _extract_tags(): SSH 修復自動加 ssh + host_layer 標籤
test_playbook_ssh_extraction.py: 18 tests(100% 通過)
飛輪雙手對齊:
kubectl 路徑: decision_chain.reasoning_steps → KM ✅ (既有)
SSH 路徑: approval.action → learning_notes → KM ✅ (Task 3.3 新增)
測試: 794 passed, 26 skipped, 0 failed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-14 15:19:54 +08:00 |
|
OG T
|
cc42aa0bdb
|
feat(adr-076): Task 2.2 + 2.3 — 規則擴充 + kubectl 注入防護
CD Pipeline / build-and-deploy (push) Has been cancelled
Task 2.2: alert_rules.yaml 新增 3 類規則 (priority 125-127)
- gitea_down: Gitea CI/CD 下線 → NO_ACTION (priority 125, critical)
- ssl_cert_expiring: SSL 憑證到期 → NO_ACTION (priority 126, medium)
- external_site_down: MoWoooWork/Dev/Blackbox probe → NO_ACTION (priority 127, medium)
規則總數: 21 → 24
Task 2.3: alert_rule_engine.py kubectl 注入防護
- _RULE_ENGINE_DESTRUCTIVE_RE: 阻擋 delete pvc/namespace/statefulset/deployment,
drain/cordon, --replicas=0, rm -rf, DROP TABLE, $() 反引號
- validate_kubectl_command(): 公開 API,SSH 指令/空字串直接通過
- match_rule() 整合: 變數替換後驗證,阻擋時清空 + log warning
- test_alert_rule_engine_validation.py: 34 tests (100% 通過)
測試: 776 passed, 26 skipped, 0 failed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-14 15:10:10 +08:00 |
|
OG T
|
be2ec4d761
|
docs(logbook): 更新當前狀態 — P0 文件補建完成,護城河已部署
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 14:54:37 +08:00 |
|
OG T
|
e778e4d0c1
|
docs(slo+ops): SLO-SLI 定義文件 + Human-in-the-Loop 規格書 v1.0
補建業界標準 P0 文件(量尺 + 煞車):
SLO-SLI-DEFINITION.md:
- 5 個 SLI 定義(成功率/延遲/可用性/KM沉澱/送達率)
- SLO 目標值表(及格線 + 卓越線)
- Error Budget 規則(充裕/注意/警戒/耗盡 4 級)
- SLO 違規告警規則(連結 TYPE-8M 飛輪告警)
- 里程碑目標(4 個 Phase 演進路線)
HUMAN-IN-THE-LOOP.md:
- 9 種人工介入觸發條件(HITL-1 ~ HITL-9)
- 破壞性操作強制人工清單(scale=0, delete pvc 等)
- Fail-safe 逾時行為(0→15→30→35 分鐘升級)
- Kill Switch 三種啟動方式(Telegram/API/EnvVar)
- 人工接管標準 SOP(情境 A/B/C)
- 人工介入記錄規範(alert_operation_log 格式)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 14:54:18 +08:00 |
|
AWOOOI CD
|
dd378ac698
|
chore(cd): deploy 684d6cf [skip ci]
|
2026-04-14 06:50:00 +00:00 |
|
OG T
|
684d6cfb43
|
feat(adr-076): 戰術 B 四大 Task 全部完成 — 告警聚合+重試+自動報告
CD Pipeline / build-and-deploy (push) Successful in 17m34s
Task 2: AlertGroupingService — Redis 5分鐘滑動視窗,防告警風暴
- apps/api/src/services/alert_grouping_service.py (新增)
- webhooks.py 整合:指紋生成後/LLM前短路子告警
- Threshold=3,Graceful Degradation,16 tests
Task 3: approval_execution.py 執行失敗重試
- MAX_RETRY=2, RETRY_DELAY_SECONDS=30
- _is_transient_error() 瞬態/永久分類,永久錯誤不重試
- Timeline 記錄重試進度,成功後標注重試次數,29 tests
Task 4: report_generation_service.py 自動報告
- 日度巡檢報告:每日 08:00 台北時間,Telegram SRE 群組推送
- Postmortem:Incident resolved + duration > 10 分鐘自動觸發
- main.py lifespan 掛載 run_daily_report_loop(),30 tests
測試: 600 → 675 通過 (+75),0 failed
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 14:39:14 +08:00 |
|
OG T
|
c0ba1000f3
|
Revert "fix(auto-repair): 中低風險+無kubectl_command → TYPE-1 純資訊,不顯示審核按鈕"
This reverts commit abf1ffa91e7327a36af93be2742d53dac1933f0d.
|
2026-04-14 13:33:24 +08:00 |
|
OG T
|
2df4945880
|
fix(auto-repair): 中低風險+無kubectl_command → TYPE-1 純資訊,不顯示審核按鈕
問題: HostHighCpuLoad 等主機層告警 affected_services=[] → OpenClaw 生成
kubectl unknown → safety guard 攔截 → 退回 READY + TYPE-3 帶按鈕卡片
用戶一直看到帶按鈕的中/低風險告警,按鈕無法修復任何東西
修復三處:
1. openclaw.py: _call_openclaw_analyze() 回傳 suggested_action 欄位
+ target_resource 預設改為 "" (避免 "unknown" 進入 safety guard)
2. decision_manager.py: classify_notification() 傳入
suggested_action / risk_level / has_kubectl_command
3. telegram_gateway.py: classify_notification() 新規則 —
無 kubectl_command + risk=low/medium + action=investigate/no_action
→ TYPE-1 (純資訊,無按鈕)
搭配 clawbot-v5 f4b84d7 (OpenClaw prompt CRITICAL RULES) 一起生效
2026-04-14 Claude Sonnet 4.6 Asia/Taipei
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-14 13:33:24 +08:00 |
|
AWOOOI CD
|
5d8feaad2a
|
chore(cd): deploy 38ff2bb [skip ci]
|
2026-04-12 15:01:47 +00:00 |
|
OG T
|
38ff2bb7a5
|
fix(heartbeat): 改用 ADR-075 TYPE-1 格式 — 💚 INFO 樹狀結構
CD Pipeline / build-and-deploy (push) Successful in 15m4s
舊平鋪文字 → ├─/└─ 樹狀結構對齊 ACTION REQUIRED 卡片風格
- 標題: 💚/⚠️ INFO | AWOOOI 系統報告
- 加 ────── 分隔線
- AI/MCP/飛輪/基礎設施各節統一 ├─/└─ 格式
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 22:52:05 +08:00 |
|
OG T
|
f1face4e34
|
fix(alert-rules): HostHighCpuLoad 獨立規則,停止 kubectl scale unknown
CD Pipeline / build-and-deploy (push) Has been cancelled
根因: HostHighCpuLoad 是 node_exporter host 告警,無 pod/deployment label
被分到 K8s high_cpu 規則 → {target}=unknown → auto-repair 安全攔截
修復: 新增 host_cpu_high 規則 (priority=45),NO_ACTION + 正確描述
high_cpu 規則移除 HostHighCpuLoad/NodeCPUUsageHigh
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 22:50:37 +08:00 |
|
OG T
|
1a4b52ed28
|
fix(alert): fingerprint 加 alertname 防跨告警指紋衝突 + 補入缺漏心跳分類
CD Pipeline / build-and-deploy (push) Has been cancelled
問題根因:
1. generate_fingerprint 用 alert_type(大量 alertname 落入 "custom")
→ 不同告警名稱同目標共用指紋 → 30 分鐘 debounce 互相擋截
2. classify_alert_early 漏掉 DeadMansSwitch / NoAlertsReceived /
PrometheusNotConnectedToAlertmanager → 落入 TYPE-3 一般告警
修復:
- alert_analyzer_service.py: 指紋改為 namespace:deployment:alertname:target_resource
alertname 取自 labels(Alertmanager),fallback 到 alert_type(其他來源)
- incident_service.py: DeadMansSwitch → backup/TYPE-1;
NoAlertsReceived + PrometheusNotConnectedToAlertmanager → alertchain_health/TYPE-8M
- 補 2 個測試,全套 627 passed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 22:50:20 +08:00 |
|
OG T
|
b17a677b97
|
fix(gitea-webhook): analysis.model_dump() 對 dict 失敗
CD Pipeline / build-and-deploy (push) Has been cancelled
_call_openclaw_push_review 回傳 dict,不是 Pydantic model
改用 hasattr 判斷是否有 model_dump()
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 22:45:09 +08:00 |
|
OG T
|
0c88f6702e
|
fix(ai-router): DIAGNOSE 強制用 deepseek-r1:14b,不用 gemma3:4b
CD Pipeline / build-and-deploy (push) Has been cancelled
gemma3:4b (summary model, complexity≤1) 不輸出結構化 JSON
→ _parse_llm_response 無法提取 confidence → confidence=0.0
deepseek-r1:14b (default model) 已驗證可輸出 confidence=0.8
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 22:43:49 +08:00 |
|
OG T
|
946fe1fa7c
|
fix(monitoring): 合併重複飛輪告警 group + 補 notification_type: TYPE-8M
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 44s
awoooi_flywheel_health (重複) 合入 awoooi_flywheel_meta_alerts:
- 所有 5 條規則加 notification_type: TYPE-8M
- 新增 FlywheelAlertnameNullHigh(原僅在舊 group)
- 刪除重複 group,消除 Prometheus 同名告警衝突
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 22:43:02 +08:00 |
|
AWOOOI CD
|
6dec8ce491
|
chore(cd): deploy db4d428 [skip ci]
|
2026-04-12 14:32:47 +00:00 |
|
OG T
|
db4d4280f5
|
test(ai-router): 更新 DIAGNOSE routing 測試反映暫停 NEMOTRON 現況
CD Pipeline / build-and-deploy (push) Successful in 14m28s
NEMOTRON 因 confidence=0.0 問題暫停,改走複雜度路由(None)
待 _parse_confidence() 修復後恢復
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 22:22:52 +08:00 |
|
OG T
|
09134f5c47
|
fix(openclaw): 修復 incident.title + DIAGNOSE→NEMOTRON confidence=0.0
CD Pipeline / build-and-deploy (push) Failing after 2m10s
1. telegram_gateway.py:1169 — classify_notification() 仍用 incident.title
改用 alertname + signal annotations 組合 (同 decision_manager.py 修法)
2. ai_router.py — DIAGNOSE 路由暫停 NEMOTRON
NIM tool_call 返回無 confidence → openclaw_analysis_complete confidence=0.0
改為 None (複雜度路由),讓 Gemini/openclaw_nemo 處理 DIAGNOSE
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 22:12:55 +08:00 |
|
OG T
|
3de45aa2c3
|
fix(k8s): deployment env 同步 + 停用 ENABLE_NEMOTRON_COLLABORATION
CD Pipeline / build-and-deploy (push) Has been cancelled
將 live-patch 的 env: 覆蓋同步回 Git,避免 ArgoCD drift:
- ENABLE_NEMOTRON_COLLABORATION: false (Ollama timeout 修復)
- USE_AI_ROUTER, OLLAMA_URL, OPENCLAW_* 等覆蓋值納入 GitOps 管理
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 22:06:10 +08:00 |
|
OG T
|
bd75aca727
|
feat(adr-075): 補全 2 個欠缺的 Prometheus 告警規則
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 49s
- MomoScraperSuccessLow: 業務爬蟲成功率 <90% (business group)
- CoreDNSResolutionFailed: CoreDNS SERVFAIL 率 >5% (kubernetes group)
ADR-075 Phase 3 完成
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 21:59:18 +08:00 |
|
AWOOOI CD
|
b6caabd8e3
|
chore(cd): deploy b3d4b9c [skip ci]
|
2026-04-12 13:29:40 +00:00 |
|
OG T
|
b3d4b9c8a9
|
test(telegram): 修正 test_telegram_message_templates 斷言
CD Pipeline / build-and-deploy (push) Successful in 14m24s
CRITICAL → 嚴重 (ADR-075 中文風險等級)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 21:20:16 +08:00 |
|
OG T
|
01e6d75ee7
|
test(telegram): 修正測試斷言符合 ADR-075 中文風險等級
CD Pipeline / build-and-deploy (push) Failing after 1m55s
HIGH→高風險, MEDIUM→中風險 (test_sentry / test_github webhook)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 21:08:48 +08:00 |
|
OG T
|
efca6f816a
|
fix(nemotron): 停用 Nemotron 協作 — Ollama timeout 導致 confidence=0.0
CD Pipeline / build-and-deploy (push) Failing after 2m1s
Ollama 111 tool_call 60s×2 > asyncio.wait_for 30s
→ expert_system fallback → confidence=0.0 → 告警卡顯示規則匹配 0%
暫停 ADR-044 直到 Ollama 穩定,OpenClaw LLM 仍正常運作
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 21:06:27 +08:00 |
|
OG T
|
9c8dde0951
|
fix(telegram): 修復 Incident 無 title 欄位導致所有 Telegram 推送失敗
CD Pipeline / build-and-deploy (push) Failing after 2m3s
根因: _push_decision_to_telegram() 有兩處引用 incident.title,
但 Incident model 從來沒有此欄位,導致所有告警卡片推送都
拋 AttributeError,事件在 telegram_decision_push_failed 靜默失敗。
修法:
- line 188: message 改用 signal annotation summary/description/alert_name
- line 249: TYPE-1 title 改用 alertname label / signal.alert_name
影響: 自從 decision_manager 加入這兩行以來,所有 Telegram 通知都沒發出
(包含 TYPE-1 資訊通知和 TYPE-3 審批卡)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 21:02:55 +08:00 |
|
OG T
|
3d8b0e4f90
|
fix(adr075): TYPE-3 格式改用 spec 模板 — ACTION REQUIRED + AI深度診斷 + 建議修復動作
CD Pipeline / build-and-deploy (push) Failing after 2m15s
- 標頭改為 "{emoji} ACTION REQUIRED | {severity_zh}"
- 新增 "🧠 AI 深度診斷" 區塊 (分析/責任/AI來源)
- 新增 "⚡ 建議修復動作" 區塊 (<code> 格式)
- confidence=0 顯示 "📋 規則分析" 取代誤導性 "🔴 0%"
- SignOz 指標區塊補回 Trace 連結
2026-04-12 ogt: ADR-075 TYPE-3 格式標準化
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 21:00:28 +08:00 |
|
OG T
|
a7f2b9c0f5
|
fix(display): 規則匹配改顯示 ✅ 取代 🔴 0% + 修復 LLM 字串 confidence 解析
CD Pipeline / build-and-deploy (push) Has been cancelled
- telegram_gateway.py: confidence==0 (規則匹配/Expert fallback) 不再顯示
「🔴 0%」,改顯示「⚙️ 規則匹配 ✅」,兩個 card 類型都修正
- openclaw.py: NIM/Ollama 有時回傳字串 "0.85" 而非 float,導致
isinstance(str, int|float)=False → confidence 被強制設 0.0。
現在先嘗試 float() 解析,解析失敗才 fallback 0.0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 20:50:53 +08:00 |
|
AWOOOI CD
|
f64393e4cb
|
chore(cd): deploy eda0cfd [skip ci]
|
2026-04-12 12:30:49 +00:00 |
|
OG T
|
eda0cfd034
|
fix(adr075): drift 通知改用 send_drift_card,補齊所有呼叫點
CD Pipeline / build-and-deploy (push) Successful in 14m13s
- drift.py: 移除死碼 send_text(),改由 narrate_and_notify() 統一發卡片
- drift_narrator_service: _send_telegram() 改呼 send_drift_card() 帶四顆按鈕
- webhooks.py /alerts 路徑: 補傳 alert_category 啟用動態按鈕
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 20:20:47 +08:00 |
|
AWOOOI CD
|
f4675872f9
|
chore(cd): deploy c3fea26 [skip ci]
|
2026-04-12 12:17:06 +00:00 |
|
OG T
|
c3fea26222
|
fix(adr075): webhooks send_approval_card 補傳 alert_category+notification_type
CD Pipeline / build-and-deploy (push) Has been cancelled
斷點真正根因:_push_to_telegram_background 呼叫 send_approval_card()
時沒有傳入 alert_category 和 notification_type,導致動態按鈕永遠
fallback 到通用 [批准][拒絕][靜默]。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 20:07:12 +08:00 |
|
OG T
|
0a4b7e9609
|
fix(classify): HostBackupFailed 精確補入 backup/TYPE-1(測試通過)
CD Pipeline / build-and-deploy (push) Has been cancelled
前次修法用 'backup' in alertname_lower 太寬,導致 BackupJobFailed warning
被分到 TYPE-1,破壞 test_backup_keyword_warning_not_type1。
改為精確白名單:
_BACKUP_TYPE1_NAMES = {HostBackupFailed, HostBackupStale, HostBackupMissing,
BackupRestoreTestFailed, BackupRestoreTestStale}
+ alertname.startswith('HostBackup') 兜底
結果:664 passed, 0 failed
2026-04-12 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 20:03:46 +08:00 |
|
OG T
|
f25d82a88a
|
fix(adr075): 修補斷點E — _push_to_telegram_background 補 TYPE-8M routing
CD Pipeline / build-and-deploy (push) Has been cancelled
斷點E:alertmanager webhook 走 _push_to_telegram_background,
未含 TYPE-8M branch,導致 meta alert 從未送出。
- webhooks.py: 新增 alert_category 參數 + TYPE-8M branch
- incident_service.py: 還原 rule 5 僅攔 watchdog/heartbeat,
移除誤加的 backup startswith 規則(VeleroBackup 由 K8s rule 接管)
Tests: 52/52 passed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 20:01:51 +08:00 |
|
OG T
|
1f7975170a
|
fix(classify): HostBackupFailed 補入 backup/TYPE-1 規則
CD Pipeline / build-and-deploy (push) Failing after 1m51s
classify_alert_early() 的 backup 規則只攔 watchdog/Heartbeat,
HostBackupFailed 先被 Host prefix 規則攔走 → host_resource/TYPE-3 → 跑 LLM → 審批卡。
修法:在 Host prefix 前新增 backup 關鍵字/前綴攔截:
- HostBackup* / Backup* / VeleroBackup* / BackupRestore*
- alertname 含 "backup"(大小寫不敏感)
影響:所有備份相關告警直接走 TYPE-1 info 通知,不進 LLM。
HostHighCpu / HostDown 等非備份的 Host 告警不受影響。
2026-04-12 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 19:52:05 +08:00 |
|
OG T
|
a5f17cea79
|
fix(notification): TYPE-1 backup/info 告警不再發審批卡
CD Pipeline / build-and-deploy (push) Has been cancelled
classify_notification() 不知道 alert_category,對 backup 告警
(confidence=0, auto_executed=False)返回 TYPE-3,覆蓋掉
classify_alert_early() 已設好的 notification_type=TYPE-1。
修法:在路由分支前,讓 incident.notification_type 明確值
(TYPE-1 / TYPE-4D / TYPE-8M)覆蓋 classify_notification()。
影響:backup/info/watchdog 告警只發 send_info_notification(),
不再噴帶按鈕的審批卡到 Telegram。
2026-04-12 ogt (ADR-075 bugfix)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 19:49:31 +08:00 |
|
AWOOOI CD
|
6490c6a885
|
chore(cd): deploy e5791b9 [skip ci]
|
2026-04-12 11:34:56 +00:00 |
|