OG T
|
aae7c12645
|
feat(adr-076): Task 3.3 — SSH 修復 KM 萃取(補齊飛輪雙手)
CD Pipeline / build-and-deploy (push) Has been cancelled
動機: SSH MCP 修復(docker restart/systemctl)成功後,KM 無法學習
因為 _extract_repair_steps 只處理 kubectl,SSH 路徑完全漏失。
approval_execution.py:
- _trigger_playbook_extraction: 成功執行後將 approval.action 寫入
incident.outcome.learning_notes,供 Playbook 萃取器讀取
playbook_service.py:
- _parse_ssh_command(): 新增模組函式,解析 ssh [user@]host 'cmd' 格式
- _extract_repair_steps(): 步驟 2 擴充 SSH 路徑分支
ssh ... → ActionType.SSH_COMMAND + host 記錄
kubectl ... → ActionType.KUBECTL(保留原有邏輯)
- _generate_name(): SSH 修復自動加 [SSH] 前綴
- _extract_tags(): SSH 修復自動加 ssh + host_layer 標籤
test_playbook_ssh_extraction.py: 18 tests(100% 通過)
飛輪雙手對齊:
kubectl 路徑: decision_chain.reasoning_steps → KM ✅ (既有)
SSH 路徑: approval.action → learning_notes → KM ✅ (Task 3.3 新增)
測試: 794 passed, 26 skipped, 0 failed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-14 15:19:54 +08:00 |
|
OG T
|
cc42aa0bdb
|
feat(adr-076): Task 2.2 + 2.3 — 規則擴充 + kubectl 注入防護
CD Pipeline / build-and-deploy (push) Has been cancelled
Task 2.2: alert_rules.yaml 新增 3 類規則 (priority 125-127)
- gitea_down: Gitea CI/CD 下線 → NO_ACTION (priority 125, critical)
- ssl_cert_expiring: SSL 憑證到期 → NO_ACTION (priority 126, medium)
- external_site_down: MoWoooWork/Dev/Blackbox probe → NO_ACTION (priority 127, medium)
規則總數: 21 → 24
Task 2.3: alert_rule_engine.py kubectl 注入防護
- _RULE_ENGINE_DESTRUCTIVE_RE: 阻擋 delete pvc/namespace/statefulset/deployment,
drain/cordon, --replicas=0, rm -rf, DROP TABLE, $() 反引號
- validate_kubectl_command(): 公開 API,SSH 指令/空字串直接通過
- match_rule() 整合: 變數替換後驗證,阻擋時清空 + log warning
- test_alert_rule_engine_validation.py: 34 tests (100% 通過)
測試: 776 passed, 26 skipped, 0 failed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-14 15:10:10 +08:00 |
|
OG T
|
684d6cfb43
|
feat(adr-076): 戰術 B 四大 Task 全部完成 — 告警聚合+重試+自動報告
CD Pipeline / build-and-deploy (push) Successful in 17m34s
Task 2: AlertGroupingService — Redis 5分鐘滑動視窗,防告警風暴
- apps/api/src/services/alert_grouping_service.py (新增)
- webhooks.py 整合:指紋生成後/LLM前短路子告警
- Threshold=3,Graceful Degradation,16 tests
Task 3: approval_execution.py 執行失敗重試
- MAX_RETRY=2, RETRY_DELAY_SECONDS=30
- _is_transient_error() 瞬態/永久分類,永久錯誤不重試
- Timeline 記錄重試進度,成功後標注重試次數,29 tests
Task 4: report_generation_service.py 自動報告
- 日度巡檢報告:每日 08:00 台北時間,Telegram SRE 群組推送
- Postmortem:Incident resolved + duration > 10 分鐘自動觸發
- main.py lifespan 掛載 run_daily_report_loop(),30 tests
測試: 600 → 675 通過 (+75),0 failed
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
2026-04-14 14:39:14 +08:00 |
|
OG T
|
c0ba1000f3
|
Revert "fix(auto-repair): 中低風險+無kubectl_command → TYPE-1 純資訊,不顯示審核按鈕"
This reverts commit abf1ffa91e7327a36af93be2742d53dac1933f0d.
|
2026-04-14 13:33:24 +08:00 |
|
OG T
|
2df4945880
|
fix(auto-repair): 中低風險+無kubectl_command → TYPE-1 純資訊,不顯示審核按鈕
問題: HostHighCpuLoad 等主機層告警 affected_services=[] → OpenClaw 生成
kubectl unknown → safety guard 攔截 → 退回 READY + TYPE-3 帶按鈕卡片
用戶一直看到帶按鈕的中/低風險告警,按鈕無法修復任何東西
修復三處:
1. openclaw.py: _call_openclaw_analyze() 回傳 suggested_action 欄位
+ target_resource 預設改為 "" (避免 "unknown" 進入 safety guard)
2. decision_manager.py: classify_notification() 傳入
suggested_action / risk_level / has_kubectl_command
3. telegram_gateway.py: classify_notification() 新規則 —
無 kubectl_command + risk=low/medium + action=investigate/no_action
→ TYPE-1 (純資訊,無按鈕)
搭配 clawbot-v5 f4b84d7 (OpenClaw prompt CRITICAL RULES) 一起生效
2026-04-14 Claude Sonnet 4.6 Asia/Taipei
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-14 13:33:24 +08:00 |
|
OG T
|
38ff2bb7a5
|
fix(heartbeat): 改用 ADR-075 TYPE-1 格式 — 💚 INFO 樹狀結構
CD Pipeline / build-and-deploy (push) Successful in 15m4s
舊平鋪文字 → ├─/└─ 樹狀結構對齊 ACTION REQUIRED 卡片風格
- 標題: 💚/⚠️ INFO | AWOOOI 系統報告
- 加 ────── 分隔線
- AI/MCP/飛輪/基礎設施各節統一 ├─/└─ 格式
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 22:52:05 +08:00 |
|
OG T
|
f1face4e34
|
fix(alert-rules): HostHighCpuLoad 獨立規則,停止 kubectl scale unknown
CD Pipeline / build-and-deploy (push) Has been cancelled
根因: HostHighCpuLoad 是 node_exporter host 告警,無 pod/deployment label
被分到 K8s high_cpu 規則 → {target}=unknown → auto-repair 安全攔截
修復: 新增 host_cpu_high 規則 (priority=45),NO_ACTION + 正確描述
high_cpu 規則移除 HostHighCpuLoad/NodeCPUUsageHigh
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 22:50:37 +08:00 |
|
OG T
|
1a4b52ed28
|
fix(alert): fingerprint 加 alertname 防跨告警指紋衝突 + 補入缺漏心跳分類
CD Pipeline / build-and-deploy (push) Has been cancelled
問題根因:
1. generate_fingerprint 用 alert_type(大量 alertname 落入 "custom")
→ 不同告警名稱同目標共用指紋 → 30 分鐘 debounce 互相擋截
2. classify_alert_early 漏掉 DeadMansSwitch / NoAlertsReceived /
PrometheusNotConnectedToAlertmanager → 落入 TYPE-3 一般告警
修復:
- alert_analyzer_service.py: 指紋改為 namespace:deployment:alertname:target_resource
alertname 取自 labels(Alertmanager),fallback 到 alert_type(其他來源)
- incident_service.py: DeadMansSwitch → backup/TYPE-1;
NoAlertsReceived + PrometheusNotConnectedToAlertmanager → alertchain_health/TYPE-8M
- 補 2 個測試,全套 627 passed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 22:50:20 +08:00 |
|
OG T
|
b17a677b97
|
fix(gitea-webhook): analysis.model_dump() 對 dict 失敗
CD Pipeline / build-and-deploy (push) Has been cancelled
_call_openclaw_push_review 回傳 dict,不是 Pydantic model
改用 hasattr 判斷是否有 model_dump()
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 22:45:09 +08:00 |
|
OG T
|
0c88f6702e
|
fix(ai-router): DIAGNOSE 強制用 deepseek-r1:14b,不用 gemma3:4b
CD Pipeline / build-and-deploy (push) Has been cancelled
gemma3:4b (summary model, complexity≤1) 不輸出結構化 JSON
→ _parse_llm_response 無法提取 confidence → confidence=0.0
deepseek-r1:14b (default model) 已驗證可輸出 confidence=0.8
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 22:43:49 +08:00 |
|
OG T
|
db4d4280f5
|
test(ai-router): 更新 DIAGNOSE routing 測試反映暫停 NEMOTRON 現況
CD Pipeline / build-and-deploy (push) Successful in 14m28s
NEMOTRON 因 confidence=0.0 問題暫停,改走複雜度路由(None)
待 _parse_confidence() 修復後恢復
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 22:22:52 +08:00 |
|
OG T
|
09134f5c47
|
fix(openclaw): 修復 incident.title + DIAGNOSE→NEMOTRON confidence=0.0
CD Pipeline / build-and-deploy (push) Failing after 2m10s
1. telegram_gateway.py:1169 — classify_notification() 仍用 incident.title
改用 alertname + signal annotations 組合 (同 decision_manager.py 修法)
2. ai_router.py — DIAGNOSE 路由暫停 NEMOTRON
NIM tool_call 返回無 confidence → openclaw_analysis_complete confidence=0.0
改為 None (複雜度路由),讓 Gemini/openclaw_nemo 處理 DIAGNOSE
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 22:12:55 +08:00 |
|
OG T
|
b3d4b9c8a9
|
test(telegram): 修正 test_telegram_message_templates 斷言
CD Pipeline / build-and-deploy (push) Successful in 14m24s
CRITICAL → 嚴重 (ADR-075 中文風險等級)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 21:20:16 +08:00 |
|
OG T
|
01e6d75ee7
|
test(telegram): 修正測試斷言符合 ADR-075 中文風險等級
CD Pipeline / build-and-deploy (push) Failing after 1m55s
HIGH→高風險, MEDIUM→中風險 (test_sentry / test_github webhook)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 21:08:48 +08:00 |
|
OG T
|
9c8dde0951
|
fix(telegram): 修復 Incident 無 title 欄位導致所有 Telegram 推送失敗
CD Pipeline / build-and-deploy (push) Failing after 2m3s
根因: _push_decision_to_telegram() 有兩處引用 incident.title,
但 Incident model 從來沒有此欄位,導致所有告警卡片推送都
拋 AttributeError,事件在 telegram_decision_push_failed 靜默失敗。
修法:
- line 188: message 改用 signal annotation summary/description/alert_name
- line 249: TYPE-1 title 改用 alertname label / signal.alert_name
影響: 自從 decision_manager 加入這兩行以來,所有 Telegram 通知都沒發出
(包含 TYPE-1 資訊通知和 TYPE-3 審批卡)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 21:02:55 +08:00 |
|
OG T
|
3d8b0e4f90
|
fix(adr075): TYPE-3 格式改用 spec 模板 — ACTION REQUIRED + AI深度診斷 + 建議修復動作
CD Pipeline / build-and-deploy (push) Failing after 2m15s
- 標頭改為 "{emoji} ACTION REQUIRED | {severity_zh}"
- 新增 "🧠 AI 深度診斷" 區塊 (分析/責任/AI來源)
- 新增 "⚡ 建議修復動作" 區塊 (<code> 格式)
- confidence=0 顯示 "📋 規則分析" 取代誤導性 "🔴 0%"
- SignOz 指標區塊補回 Trace 連結
2026-04-12 ogt: ADR-075 TYPE-3 格式標準化
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 21:00:28 +08:00 |
|
OG T
|
a7f2b9c0f5
|
fix(display): 規則匹配改顯示 ✅ 取代 🔴 0% + 修復 LLM 字串 confidence 解析
CD Pipeline / build-and-deploy (push) Has been cancelled
- telegram_gateway.py: confidence==0 (規則匹配/Expert fallback) 不再顯示
「🔴 0%」,改顯示「⚙️ 規則匹配 ✅」,兩個 card 類型都修正
- openclaw.py: NIM/Ollama 有時回傳字串 "0.85" 而非 float,導致
isinstance(str, int|float)=False → confidence 被強制設 0.0。
現在先嘗試 float() 解析,解析失敗才 fallback 0.0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 20:50:53 +08:00 |
|
OG T
|
eda0cfd034
|
fix(adr075): drift 通知改用 send_drift_card,補齊所有呼叫點
CD Pipeline / build-and-deploy (push) Successful in 14m13s
- drift.py: 移除死碼 send_text(),改由 narrate_and_notify() 統一發卡片
- drift_narrator_service: _send_telegram() 改呼 send_drift_card() 帶四顆按鈕
- webhooks.py /alerts 路徑: 補傳 alert_category 啟用動態按鈕
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 20:20:47 +08:00 |
|
OG T
|
c3fea26222
|
fix(adr075): webhooks send_approval_card 補傳 alert_category+notification_type
CD Pipeline / build-and-deploy (push) Has been cancelled
斷點真正根因:_push_to_telegram_background 呼叫 send_approval_card()
時沒有傳入 alert_category 和 notification_type,導致動態按鈕永遠
fallback 到通用 [批准][拒絕][靜默]。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 20:07:12 +08:00 |
|
OG T
|
0a4b7e9609
|
fix(classify): HostBackupFailed 精確補入 backup/TYPE-1(測試通過)
CD Pipeline / build-and-deploy (push) Has been cancelled
前次修法用 'backup' in alertname_lower 太寬,導致 BackupJobFailed warning
被分到 TYPE-1,破壞 test_backup_keyword_warning_not_type1。
改為精確白名單:
_BACKUP_TYPE1_NAMES = {HostBackupFailed, HostBackupStale, HostBackupMissing,
BackupRestoreTestFailed, BackupRestoreTestStale}
+ alertname.startswith('HostBackup') 兜底
結果:664 passed, 0 failed
2026-04-12 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 20:03:46 +08:00 |
|
OG T
|
f25d82a88a
|
fix(adr075): 修補斷點E — _push_to_telegram_background 補 TYPE-8M routing
CD Pipeline / build-and-deploy (push) Has been cancelled
斷點E:alertmanager webhook 走 _push_to_telegram_background,
未含 TYPE-8M branch,導致 meta alert 從未送出。
- webhooks.py: 新增 alert_category 參數 + TYPE-8M branch
- incident_service.py: 還原 rule 5 僅攔 watchdog/heartbeat,
移除誤加的 backup startswith 規則(VeleroBackup 由 K8s rule 接管)
Tests: 52/52 passed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 20:01:51 +08:00 |
|
OG T
|
1f7975170a
|
fix(classify): HostBackupFailed 補入 backup/TYPE-1 規則
CD Pipeline / build-and-deploy (push) Failing after 1m51s
classify_alert_early() 的 backup 規則只攔 watchdog/Heartbeat,
HostBackupFailed 先被 Host prefix 規則攔走 → host_resource/TYPE-3 → 跑 LLM → 審批卡。
修法:在 Host prefix 前新增 backup 關鍵字/前綴攔截:
- HostBackup* / Backup* / VeleroBackup* / BackupRestore*
- alertname 含 "backup"(大小寫不敏感)
影響:所有備份相關告警直接走 TYPE-1 info 通知,不進 LLM。
HostHighCpu / HostDown 等非備份的 Host 告警不受影響。
2026-04-12 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 19:52:05 +08:00 |
|
OG T
|
a5f17cea79
|
fix(notification): TYPE-1 backup/info 告警不再發審批卡
CD Pipeline / build-and-deploy (push) Has been cancelled
classify_notification() 不知道 alert_category,對 backup 告警
(confidence=0, auto_executed=False)返回 TYPE-3,覆蓋掉
classify_alert_early() 已設好的 notification_type=TYPE-1。
修法:在路由分支前,讓 incident.notification_type 明確值
(TYPE-1 / TYPE-4D / TYPE-8M)覆蓋 classify_notification()。
影響:backup/info/watchdog 告警只發 send_info_notification(),
不再噴帶按鈕的審批卡到 Telegram。
2026-04-12 ogt (ADR-075 bugfix)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 19:49:31 +08:00 |
|
OG T
|
7f3e585d6d
|
fix(webhooks): alertmanager handler — alert_type 超範圍改為 custom
CD Pipeline / build-and-deploy (push) Has been cancelled
AlertPayload.alert_type 只接受 8 個 Literal 值
ALERTNAME_TO_TYPE 映射回傳 host_cpu/backup_failure 等不在白名單 → ValidationError
修法:凡不在 Literal 白名單的 alert_type 一律 fallback 為 "custom"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 19:22:35 +08:00 |
|
OG T
|
bc2665ef6b
|
feat(adr075): Step-5 decision_manager TYPE-5S/TYPE-6B 路由分支
CD Pipeline / build-and-deploy (push) Has been cancelled
- 新增 secops elif:alert_category=secops → send_secops_card()
(resource, threat_behavior 從 incident.signals labels 提取)
- 新增 business elif:alert_category=business → send_business_alert()
(metric_name/current_value/threshold 從 Prometheus labels 提取)
- TYPE-7E escalation_monitor 標記 out-of-scope (ADR-075 範疇外)
- 兩分支均加 2026-04-12 ogt (ADR-075 Step-5) 變更標記
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 19:12:35 +08:00 |
|
OG T
|
24c1b5677b
|
feat(adr075): Step1-3 classify補丁+新按鈕+TYPE-5S/6B/7E格式函數
Step-1 incident_service.py classify_alert_early():
- 新增 secops (TYPE-5S): UnauthorizedSSH/KubeAudit/CVE/WAFAttack/PodAbnormal
- 新增 business (TYPE-6B): AITokenCost/GeminiAPIError/SLOBurn/MomoScraper
- 新增 flywheel_health MCPProvider/OllamaDown/NemotronDown 前綴
- ssl_cert: 依 days_remaining 決定 TYPE-1(≥14d) vs TYPE-3(<14d)
Step-2 telegram_gateway.py _build_inline_keyboard():
- 新增 secops: [隔離] [封鎖IP] [驅逐] [確認授權]
- 新增 business: [暫停1h] [查SignOz] [忽略]
- 新增 flywheel_health: [觸發診斷] [飛輪面板] [靜默]
Step-3 telegram_gateway.py 新增格式化函數 (Tier 2):
- send_secops_card() — TYPE-5S 防禦按鈕+nonce
- send_business_alert() — TYPE-6B 業務損失速率
- send_escalation_card() — TYPE-7E P0/P1 升級,發 DM+群組
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 18:50:37 +08:00 |
|
OG T
|
65a5220e16
|
feat(flywheel-c2-c3): C2 hasType4接真實API + C3 WebSocket指數退避重連
CD Pipeline / build-and-deploy (push) Failing after 3m41s
C2: flywheel_stats_service 加 type4_count query → API 回傳
flywheel-diagram.tsx hasType4 改由 type4Count prop 驅動(非 false)
flywheel-kpi-card.tsx 傳入 type4Count={flowData?.type4_count}
C3: WebSocket onclose 加指數退避重連 (1s→2s→4s→最大30s)
cancelled 旗標確保 unmount 後不重連
wsRetryTimer 加入 cleanup
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 18:45:40 +08:00 |
|
OG T
|
1cb654cf59
|
fix(adr-075): CR P0/P1 修補 — TYPE_8M enum + 死碼清理 + docstring 更新
CD Pipeline / build-and-deploy (push) Has been cancelled
P0-2: NotificationType 新增 TYPE_8M = "TYPE-8M"
classify_notification 早期回傳 TYPE-8M
decision_manager 改用 NotificationType.TYPE_8M enum 比較(移除字串字面量)
P1-1: 移除 _CATEGORY_BUTTONS 中不可達的 alertchain_health/flywheel_health 條目
P1-4: test_classify_alert_early.py docstring 更新為 13 條規則/10 分類
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 18:44:12 +08:00 |
|
OG T
|
561c1d806b
|
feat(adr-075): Phase 2 — TYPE-8M 飛輪/告警鏈路健康通知格式與路由
CD Pipeline / build-and-deploy (push) Failing after 4m0s
新增 send_meta_alert() — ⚙️ META SYSTEM 卡片(觸發診斷/查看面板/靜默)
decision_manager 新增 TYPE-8M elif 分支(在 TYPE-4D 後)
_alert_category 提取提前至 if 鏈前,三個分支共用
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 18:39:04 +08:00 |
|
OG T
|
2cef2098d3
|
feat(adr-075): 修復 Telegram 動態按鈕 4 個斷點 + 新增 7 種告警分類
CD Pipeline / build-and-deploy (push) Has been cancelled
斷點 A: decision_manager 提取 alert_category/notification_type 傳入 send_approval_card
斷點 B: send_approval_card 新增參數並傳遞至 _build_inline_keyboard
斷點 C: 互動型通知 (TYPE-3/4/4D/8M) 禁止發 SRE 群組,防 nonce 洩漏
斷點 D: _CATEGORY_BUTTONS k8s_workload → kubernetes + 新增 6 類按鈕組
classify_alert_early 新增: alertchain_health, flywheel_health, storage,
devops_tool, external_site, ssl_cert, host_resource (從 infrastructure 分離)
Test: 52 classify + 664 total passed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 18:35:56 +08:00 |
|
OG T
|
587d745a50
|
fix(km): 修補 KMConversionService 兩個屬性錯誤
CD Pipeline / build-and-deploy (push) Failing after 28s
- incident.title → getattr(incident, 'title', None) or alertname
(Incident model 無 title 欄位)
- km_entry.entry_id → km_entry.id
(KnowledgeEntry model 主鍵為 id 非 entry_id)
- 補跑後 KM entries 714 → 821 (+107), incidents.vectorized 全部歸零
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 17:52:57 +08:00 |
|
OG T
|
38dddcc7a2
|
fix(heartbeat): KM向量化改用raw SQL + 格式優化去除空格對齊
CD Pipeline / build-and-deploy (push) Failing after 29s
- KM vectorized 改用 raw SQL (ORM 無 embedding 欄位)
- 移除 {display:<18} 空格對齊(非等寬字體Telegram會錯位)
- 格式: Name: value 每行一項,清楚易讀
- KM向量化加狀態icon (✅ ≥90% / ⚠️ <90%)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 17:36:01 +08:00 |
|
OG T
|
a1691c41d5
|
fix(flywheel-stats): 修補 FlywheelStatsService 三個欄位錯誤
CD Pipeline / build-and-deploy (push) Failing after 30s
- KnowledgeEntryRecord.vectorized → embedding.is_(None) (欄位不存在)
- IncidentRecord.id → IncidentRecord.incident_id (主鍵名稱)
- 修復後 /api/v1/stats/flywheel nodes 不再全部回傳 unknown
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 17:27:35 +08:00 |
|
OG T
|
99b489ca63
|
fix(flywheel): 修補剩餘 P0/P1 缺陷
CD Pipeline / build-and-deploy (push) Has been cancelled
- CRITICAL-1: TYPE-1 path approval_id=str(alert_id) → uuid.uuid4(),
避免 UUID(approval_id) 拋 ValueError 導致所有 Heartbeat/Info 告警崩潰
- CRITICAL-2: asyncio.create_task() 結果存入 _exec_task 並加 done_callback,
防止 GC 在執行中途回收任務
- FORMAT: _push_to_telegram_background 新增 notification_type + diff_summary 參數,
TYPE-4D → send_drift_card(),其他 → send_approval_card()(修正 ConfigDrift 顯示錯誤卡片)
- 傳遞 notification_type 至 Alertmanager 兩個呼叫點
ADR-073 四斷點修補最終收尾
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 17:14:57 +08:00 |
|
OG T
|
f0e14136ca
|
fix(flywheel): 修補飛輪四個核心斷點,讓完整流程真正串接起來
CD Pipeline / build-and-deploy (push) Has been cancelled
1. incident_service.py: save_to_episodic_memory() 補寫 alertname/notification_type/alert_category
→ 之前這3欄在DB永遠NULL,LLM無alertname,Playbook匹配全失敗
2. telegram_gateway.py: Telegram批准後呼叫 execute_approved_action()
→ 之前sign_approval()只改DB狀態,380筆批准0筆真正執行kubectl指令
3. approval_execution.py: 執行成功後呼叫 resolve_incident()
webhooks.py: auto-repair成功後呼叫 resolve_incident()
→ 之前Incident永遠停在INVESTIGATING,KM轉換永遠不觸發,Playbook=0
4. webhooks.py: TYPE-1告警短路,不進LLM
→ 之前Heartbeat/Backup/Info仍燒LLM token,產生垃圾修復建議
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 17:01:10 +08:00 |
|
OG T
|
93f9522d5a
|
fix(heartbeat): 對齊整點發送避免多replica各自發 + KM向量化改查embedding欄位
CD Pipeline / build-and-deploy (push) Successful in 14m10s
- _heartbeat_loop: 先 sleep 到下一個整點倍數再開始循環
避免 3 個 replica 啟動時間不同導致短時間內收到多條心跳
- heartbeat_report_service: km_vectorized 改查 KnowledgeEntryRecord.embedding IS NOT NULL
原本錯誤查 IncidentRecord.vectorized 導致顯示 0/714 (0%)
2026-04-12 ogt (ADR-073 heartbeat fix)
|
2026-04-12 16:33:15 +08:00 |
|
OG T
|
effd78807e
|
fix(heartbeat): blocking_timeout 5→0,多 replica 不排隊等鎖避免重複發送
CD Pipeline / build-and-deploy (push) Successful in 14m0s
3 個 replica 各自跑 loop,blocking_timeout=5.0 導致鎖釋放後
其他 replica 依序拿鎖,每次心跳最多發 3 條。
改為 blocking_timeout=0:拿不到鎖立刻跳過,同週期只發一條。
2026-04-12 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 16:13:41 +08:00 |
|
OG T
|
a28625f088
|
fix(cr): 首席架構師 CR P0/P1/P2 全修補
CD Pipeline / build-and-deploy (push) Has been cancelled
P0-1: incident_service.py — 刪除 classify_alert_early 死碼 L131-132
P0-2: cron_backup_restore_test.sh — date +%s%3N→+%s,修正毫秒時間戳
P1-2: gitea_webhook.py — fingerprint 移除 sha_short,收斂同 branch 失敗
heartbeat: 還原原始空格對齊格式(統帥要求原本怎樣就怎樣)
P1-1(積木化)/P1-3(TYPE-4)/P2-1(timeZone)/P2-2(IP)/P2-3(WS重連) 待後續處理
2026-04-12 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 16:10:46 +08:00 |
|
OG T
|
d72c7d5ac4
|
fix(P0): classify_alert_early 參數名稱修正 _labels→labels
CD Pipeline / build-and-deploy (push) Has been cancelled
webhooks.py 呼叫傳 labels= 但函數定義用 _labels,導致所有
Alertmanager webhook 500,告警鏈路完全中斷。
2026-04-12 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 16:02:25 +08:00 |
|
OG T
|
36f285fb85
|
fix(heartbeat): 移除空格對齊,改用直接排版避免 Telegram 跑版
Telegram HTML 模式不渲染等寬字型,空格對齊無效。
改成不對齊但清晰的格式,每行直接顯示 label + value。
2026-04-12 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 16:01:47 +08:00 |
|
OG T
|
0c2892ac19
|
feat(c3): ADR-073-C C3 — WebSocket 飛輪即時推送
後端:
- stats.py 新增 @router.websocket('/flywheel/ws')
- 每 10 秒推送 flywheel_summary JSON
前端 FlywheelKPICard:
- WebSocket 優先,WS 斷線自動降級到 30s HTTP 輪詢
- onopen 時停止 HTTP polling,onclose 時恢復
2026-04-12 ogt (ADR-073-C C3)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 15:40:20 +08:00 |
|
OG T
|
3489e05c84
|
feat(m3): ADR-074 M3 — Gitea CI/CD 管線失敗 webhook
新增 workflow_run 事件處理:
- GiteaWorkflowRun Pydantic model
- handle_workflow_run() — status/conclusion=failure → TYPE-1 Incident
- 透過 get_incident_service().create_incident_from_signal() 建立告警
- 純通知路徑,不觸發自動修復
2026-04-12 ogt (ADR-074 M3)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 15:35:25 +08:00 |
|
OG T
|
00a31abb85
|
feat(heartbeat): ADR-073 P2 心跳整合重構 — HeartbeatReportService + RedisLock
CD Pipeline / build-and-deploy (push) Has been cancelled
- 新增 HeartbeatReportService:11 個並行探針(Ollama/Nemotron/Gemini/Claude/MCP×4/ArgoCD/Velero)
- 重寫 send_heartbeat():RedisLock 防重發 + 統一發送 SRE_GROUP_CHAT_ID
- 簡化 _heartbeat_loop():移除散落的 silence 多次發送
- config.py:新增 OLLAMA_REQUIRED_MODELS 欄位
- 03-secrets.example.yaml:補 SRE_GROUP_CHAT_ID 確保 CD Inject 不遺漏
2026-04-12 ogt (ADR-073 Phase 2-3/4)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 15:35:13 +08:00 |
|
OG T
|
16d682346a
|
feat(adr-074): M1 飛輪健康度 Exporter + M2 主機網路監控
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-074 M1:
- FlywheelStatsService: 計算6項飛輪指標(Playbook數/成功率/KM向量化/alertname NULL/卡住數)
- GET /api/v1/stats/flywheel — 六節點即時狀態(C1 前端用)
- GET /api/v1/stats/summary — KPI 面板數據(C1 前端用)
- GET /api/v1/stats/flywheel/metrics — Prometheus text format
- flywheel-alerts.yaml: 5條告警規則(FlywheelPlaybookZero/ExecutionSuccessLow/KMVectorizationLow/AlertnameNullHigh/IncidentsStuck)
- prometheus.yml: awoooi-flywheel scrape job(5分鐘間隔)
ADR-074 M2:
- prometheus.yml: host-connectivity Blackbox TCP probe(110:22/188:22/120:6443/121:6443)
- flywheel-alerts.yaml: HostNetworkPartition 告警規則
597 unit tests passed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 15:31:01 +08:00 |
|
OG T
|
1074936e54
|
fix(classify): backup/heartbeat severity=warning/critical 告警恢復告警卡片格式
CD Pipeline / build-and-deploy (push) Failing after 2m38s
根因:classify_alert_early() backup 規則無 severity 條件,導致
VeleroBackupFailed / HostBackupFailed (warning/critical) 被分為 TYPE-1
(純資訊無按鈕),告警卡片格式遺失。
修復:
- backup/heartbeat 關鍵字只在 severity=info/none 才命中 TYPE-1
- severity=warning/critical 的 backup 告警走正確 prefix 規則
(Velero→kubernetes TYPE-3, HostBackup→infrastructure TYPE-3)
- Watchdog (severity=none) 由 severity 規則先命中,維持 TYPE-1
- 補強測試:25 cases,含 VeleroBackupFailed critical → kubernetes TYPE-3
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 15:24:00 +08:00 |
|
OG T
|
0d239838b4
|
fix(cr): Code Review P2 — 測試覆蓋 + CronJob 腳本重構
CD Pipeline / build-and-deploy (push) Has been cancelled
P2-1: CronJob inline Python 抽成 scripts/cron_km_vectorize.py
Dockerfile 加入 COPY scripts/,CronJob YAML 改用腳本路徑
P2-2: 新增 test_classify_alert_early.py — 23 tests 覆蓋 7 條分類規則
含邊界情況:VeleroBackupFailed(backup優先於k8s)、優先順序驗證
595 unit tests passed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 15:14:44 +08:00 |
|
OG T
|
c09521a1c6
|
fix(cr): Code Review P0/P1 全修補 — 積木化+SSH路由+安全守衛順序
CD Pipeline / build-and-deploy (push) Failing after 2m30s
P0-1: classify_alert_early 移至 incident_service (Service層),webhooks.py import 修正
P0-2: _ssh_execute() 改用 self._ssh,移除冗餘 SSHProvider() 實例化
P1-1: infrastructure SSH routing 移至 kubectl safety guard 之前,docker指令不再被攔截
P1-2: alert_rule_engine 新增 get_risk_for_alertname() public API
P1-3: classify_notification() docstring 修正 ORM→Pydantic
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 14:51:12 +08:00 |
|
OG T
|
f2fc4712ad
|
feat(flywheel): Phase 4 — KM conversion hook + daily vectorize cron
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-073 Phase 4-2: incident_service.resolve_incident() KM conversion hook
- resolve 時 fire-and-forget KMConversionService.convert(incident)
- 已解決的 Incident 自動轉換為結構化 KM 條目,完成飛輪「學習固化」節點
- KMConversionService (Phase 4-1) 已存在 (km_conversion_service.py, 336 lines)
ADR-073 Phase 4-3: 15-cronjob-km-vectorize.yaml
- 每日 03:00 台北時間呼叫 /api/v1/knowledge/embed-all
- 自動向量化當日新增 KM 條目,確保 RAG 查詢不遺漏新知識
- 加入 kustomization.yaml resources
Phase 4-4: handle_callback log_manual_fix 已存在 (telegram_gateway.py:2468)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 14:40:33 +08:00 |
|
OG T
|
dbc77c5e62
|
feat(flywheel): Phase 3 — decision_manager Tier 3 七大修復 (首席架構師授權)
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-073 Phase 3 全部完成:
3-1: TYPE-1 triage guard
- get_or_create_decision() 入口: notification_type=TYPE-1 直接 bypass LLM 分析
- classify_notification() 優先讀 incident.notification_type (早期分診結果)
- ConfigurationDrift/KubeConfigDrift 補入 TYPE-4D 匹配清單
3-2: infrastructure → SSH MCP routing
- _auto_execute() 中 alert_category=infrastructure + 非 kubectl action → _ssh_execute()
- _ssh_execute(): docker_restart / service_restart tool 路由
- 取 instance label 對應 SSH_MCP_ALLOWED_HOSTS 白名單主機
3-3: send_info_notification() TYPE-1 已存在,classify_notification 修復確保正確呼叫
3-4: Dynamic button builder 已存在 _build_inline_keyboard + _CATEGORY_BUTTONS
3-5: action | parse fix
- _auto_execute() 開頭: action 含 | 時取第一段 (LLM 有時輸出 "kubectl X | kubectl get")
3-6: risk_level YAML priority override LLM
- dual_engine_analyze() LLM 結果返回後,用 alert_rules.yaml 對應 rule.risk 覆蓋
3-7: send_drift_card() TYPE-4D 已存在,classify_notification 修復確保正確觸發
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 14:39:19 +08:00 |
|
OG T
|
5b956a9a47
|
feat(flywheel): Phase 2-3/2-5 — auto_repair outcome 寫入 + 134 筆 alertname 回填腳本
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-073 Phase 2-3: _try_auto_repair_background() 修復執行後寫入 Incident.outcome
- effectiveness_score: 5(成功) / 2(失敗)
- human_feedback: auto_repair:<playbook_id>:success|failed
- should_remember: True(成功) → KMConversionService 飛輪入口
- 讓 KMConversionService 可依 outcome 判斷 EXECUTION_SUCCESS
ADR-073 Phase 2-5: scripts/backfill_alertname.py
- UPDATE incidents SET alertname = COALESCE(signals->0->>'alertname', signals->0->>'alert_name')
- 已在 Pod 執行:134 筆 NULL → 0 筆 (2026-04-12 ogt)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 14:33:11 +08:00 |
|