Commit Graph

514 Commits

Author SHA1 Message Date
OG T
a1691c41d5 fix(flywheel-stats): 修補 FlywheelStatsService 三個欄位錯誤
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 30s
- KnowledgeEntryRecord.vectorized → embedding.is_(None) (欄位不存在)
- IncidentRecord.id → IncidentRecord.incident_id (主鍵名稱)
- 修復後 /api/v1/stats/flywheel nodes 不再全部回傳 unknown

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 17:27:35 +08:00
OG T
99b489ca63 fix(flywheel): 修補剩餘 P0/P1 缺陷
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- CRITICAL-1: TYPE-1 path approval_id=str(alert_id) → uuid.uuid4(),
  避免 UUID(approval_id) 拋 ValueError 導致所有 Heartbeat/Info 告警崩潰
- CRITICAL-2: asyncio.create_task() 結果存入 _exec_task 並加 done_callback,
  防止 GC 在執行中途回收任務
- FORMAT: _push_to_telegram_background 新增 notification_type + diff_summary 參數,
  TYPE-4D → send_drift_card(),其他 → send_approval_card()(修正 ConfigDrift 顯示錯誤卡片)
- 傳遞 notification_type 至 Alertmanager 兩個呼叫點

ADR-073 四斷點修補最終收尾

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 17:14:57 +08:00
OG T
f0e14136ca fix(flywheel): 修補飛輪四個核心斷點,讓完整流程真正串接起來
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. incident_service.py: save_to_episodic_memory() 補寫 alertname/notification_type/alert_category
   → 之前這3欄在DB永遠NULL,LLM無alertname,Playbook匹配全失敗

2. telegram_gateway.py: Telegram批准後呼叫 execute_approved_action()
   → 之前sign_approval()只改DB狀態,380筆批准0筆真正執行kubectl指令

3. approval_execution.py: 執行成功後呼叫 resolve_incident()
   webhooks.py: auto-repair成功後呼叫 resolve_incident()
   → 之前Incident永遠停在INVESTIGATING,KM轉換永遠不觸發,Playbook=0

4. webhooks.py: TYPE-1告警短路,不進LLM
   → 之前Heartbeat/Backup/Info仍燒LLM token,產生垃圾修復建議

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 17:01:10 +08:00
OG T
93f9522d5a fix(heartbeat): 對齊整點發送避免多replica各自發 + KM向量化改查embedding欄位
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 14m10s
- _heartbeat_loop: 先 sleep 到下一個整點倍數再開始循環
  避免 3 個 replica 啟動時間不同導致短時間內收到多條心跳
- heartbeat_report_service: km_vectorized 改查 KnowledgeEntryRecord.embedding IS NOT NULL
  原本錯誤查 IncidentRecord.vectorized 導致顯示 0/714 (0%)

2026-04-12 ogt (ADR-073 heartbeat fix)
2026-04-12 16:33:15 +08:00
OG T
effd78807e fix(heartbeat): blocking_timeout 5→0,多 replica 不排隊等鎖避免重複發送
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 14m0s
3 個 replica 各自跑 loop,blocking_timeout=5.0 導致鎖釋放後
其他 replica 依序拿鎖,每次心跳最多發 3 條。
改為 blocking_timeout=0:拿不到鎖立刻跳過,同週期只發一條。

2026-04-12 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 16:13:41 +08:00
OG T
a28625f088 fix(cr): 首席架構師 CR P0/P1/P2 全修補
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
P0-1: incident_service.py — 刪除 classify_alert_early 死碼 L131-132
P0-2: cron_backup_restore_test.sh — date +%s%3N→+%s,修正毫秒時間戳
P1-2: gitea_webhook.py — fingerprint 移除 sha_short,收斂同 branch 失敗
heartbeat: 還原原始空格對齊格式(統帥要求原本怎樣就怎樣)

P1-1(積木化)/P1-3(TYPE-4)/P2-1(timeZone)/P2-2(IP)/P2-3(WS重連) 待後續處理

2026-04-12 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 16:10:46 +08:00
OG T
d72c7d5ac4 fix(P0): classify_alert_early 參數名稱修正 _labels→labels
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
webhooks.py 呼叫傳 labels= 但函數定義用 _labels,導致所有
Alertmanager webhook 500,告警鏈路完全中斷。

2026-04-12 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 16:02:25 +08:00
OG T
36f285fb85 fix(heartbeat): 移除空格對齊,改用直接排版避免 Telegram 跑版
Telegram HTML 模式不渲染等寬字型,空格對齊無效。
改成不對齊但清晰的格式,每行直接顯示 label + value。

2026-04-12 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 16:01:47 +08:00
OG T
0c2892ac19 feat(c3): ADR-073-C C3 — WebSocket 飛輪即時推送
後端:
- stats.py 新增 @router.websocket('/flywheel/ws')
- 每 10 秒推送 flywheel_summary JSON

前端 FlywheelKPICard:
- WebSocket 優先,WS 斷線自動降級到 30s HTTP 輪詢
- onopen 時停止 HTTP polling,onclose 時恢復

2026-04-12 ogt (ADR-073-C C3)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:40:20 +08:00
OG T
3489e05c84 feat(m3): ADR-074 M3 — Gitea CI/CD 管線失敗 webhook
新增 workflow_run 事件處理:
- GiteaWorkflowRun Pydantic model
- handle_workflow_run() — status/conclusion=failure → TYPE-1 Incident
- 透過 get_incident_service().create_incident_from_signal() 建立告警
- 純通知路徑,不觸發自動修復

2026-04-12 ogt (ADR-074 M3)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:35:25 +08:00
OG T
00a31abb85 feat(heartbeat): ADR-073 P2 心跳整合重構 — HeartbeatReportService + RedisLock
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- 新增 HeartbeatReportService:11 個並行探針(Ollama/Nemotron/Gemini/Claude/MCP×4/ArgoCD/Velero)
- 重寫 send_heartbeat():RedisLock 防重發 + 統一發送 SRE_GROUP_CHAT_ID
- 簡化 _heartbeat_loop():移除散落的 silence 多次發送
- config.py:新增 OLLAMA_REQUIRED_MODELS 欄位
- 03-secrets.example.yaml:補 SRE_GROUP_CHAT_ID 確保 CD Inject 不遺漏

2026-04-12 ogt (ADR-073 Phase 2-3/4)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:35:13 +08:00
OG T
16d682346a feat(adr-074): M1 飛輪健康度 Exporter + M2 主機網路監控
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-074 M1:
  - FlywheelStatsService: 計算6項飛輪指標(Playbook數/成功率/KM向量化/alertname NULL/卡住數)
  - GET /api/v1/stats/flywheel — 六節點即時狀態(C1 前端用)
  - GET /api/v1/stats/summary — KPI 面板數據(C1 前端用)
  - GET /api/v1/stats/flywheel/metrics — Prometheus text format
  - flywheel-alerts.yaml: 5條告警規則(FlywheelPlaybookZero/ExecutionSuccessLow/KMVectorizationLow/AlertnameNullHigh/IncidentsStuck)
  - prometheus.yml: awoooi-flywheel scrape job(5分鐘間隔)

ADR-074 M2:
  - prometheus.yml: host-connectivity Blackbox TCP probe(110:22/188:22/120:6443/121:6443)
  - flywheel-alerts.yaml: HostNetworkPartition 告警規則

597 unit tests passed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:31:01 +08:00
OG T
1074936e54 fix(classify): backup/heartbeat severity=warning/critical 告警恢復告警卡片格式
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 2m38s
根因:classify_alert_early() backup 規則無 severity 條件,導致
VeleroBackupFailed / HostBackupFailed (warning/critical) 被分為 TYPE-1
(純資訊無按鈕),告警卡片格式遺失。

修復:
- backup/heartbeat 關鍵字只在 severity=info/none 才命中 TYPE-1
- severity=warning/critical 的 backup 告警走正確 prefix 規則
  (Velero→kubernetes TYPE-3, HostBackup→infrastructure TYPE-3)
- Watchdog (severity=none) 由 severity 規則先命中,維持 TYPE-1
- 補強測試:25 cases,含 VeleroBackupFailed critical → kubernetes TYPE-3

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:24:00 +08:00
OG T
c09521a1c6 fix(cr): Code Review P0/P1 全修補 — 積木化+SSH路由+安全守衛順序
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 2m30s
P0-1: classify_alert_early 移至 incident_service (Service層),webhooks.py import 修正
P0-2: _ssh_execute() 改用 self._ssh,移除冗餘 SSHProvider() 實例化
P1-1: infrastructure SSH routing 移至 kubectl safety guard 之前,docker指令不再被攔截
P1-2: alert_rule_engine 新增 get_risk_for_alertname() public API
P1-3: classify_notification() docstring 修正 ORM→Pydantic

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 14:51:12 +08:00
OG T
f2fc4712ad feat(flywheel): Phase 4 — KM conversion hook + daily vectorize cron
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-073 Phase 4-2: incident_service.resolve_incident() KM conversion hook
- resolve 時 fire-and-forget KMConversionService.convert(incident)
- 已解決的 Incident 自動轉換為結構化 KM 條目,完成飛輪「學習固化」節點
- KMConversionService (Phase 4-1) 已存在 (km_conversion_service.py, 336 lines)

ADR-073 Phase 4-3: 15-cronjob-km-vectorize.yaml
- 每日 03:00 台北時間呼叫 /api/v1/knowledge/embed-all
- 自動向量化當日新增 KM 條目,確保 RAG 查詢不遺漏新知識
- 加入 kustomization.yaml resources

Phase 4-4: handle_callback log_manual_fix 已存在 (telegram_gateway.py:2468)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 14:40:33 +08:00
OG T
dbc77c5e62 feat(flywheel): Phase 3 — decision_manager Tier 3 七大修復 (首席架構師授權)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-073 Phase 3 全部完成:

3-1: TYPE-1 triage guard
- get_or_create_decision() 入口: notification_type=TYPE-1 直接 bypass LLM 分析
- classify_notification() 優先讀 incident.notification_type (早期分診結果)
- ConfigurationDrift/KubeConfigDrift 補入 TYPE-4D 匹配清單

3-2: infrastructure → SSH MCP routing
- _auto_execute() 中 alert_category=infrastructure + 非 kubectl action → _ssh_execute()
- _ssh_execute(): docker_restart / service_restart tool 路由
- 取 instance label 對應 SSH_MCP_ALLOWED_HOSTS 白名單主機

3-3: send_info_notification() TYPE-1 已存在,classify_notification 修復確保正確呼叫

3-4: Dynamic button builder 已存在 _build_inline_keyboard + _CATEGORY_BUTTONS

3-5: action | parse fix
- _auto_execute() 開頭: action 含 | 時取第一段 (LLM 有時輸出 "kubectl X | kubectl get")

3-6: risk_level YAML priority override LLM
- dual_engine_analyze() LLM 結果返回後,用 alert_rules.yaml 對應 rule.risk 覆蓋

3-7: send_drift_card() TYPE-4D 已存在,classify_notification 修復確保正確觸發

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 14:39:19 +08:00
OG T
5b956a9a47 feat(flywheel): Phase 2-3/2-5 — auto_repair outcome 寫入 + 134 筆 alertname 回填腳本
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-073 Phase 2-3: _try_auto_repair_background() 修復執行後寫入 Incident.outcome
- effectiveness_score: 5(成功) / 2(失敗)
- human_feedback: auto_repair:<playbook_id>:success|failed
- should_remember: True(成功) → KMConversionService 飛輪入口
- 讓 KMConversionService 可依 outcome 判斷 EXECUTION_SUCCESS

ADR-073 Phase 2-5: scripts/backfill_alertname.py
- UPDATE incidents SET alertname = COALESCE(signals->0->>'alertname', signals->0->>'alert_name')
- 已在 Pod 執行:134 筆 NULL → 0 筆 (2026-04-12 ogt)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 14:33:11 +08:00
OG T
d4b8b1588b feat(flywheel): Phase 2-2/2-4 — classify_alert_early + alertname/notification_type/alert_category 寫入
ADR-073 Phase 2-2: 早期分診,在 LLM 分析前決定 alert_category + notification_type
- webhooks.py: 新增 classify_alert_early() — 6 條規則覆蓋 config_drift/info/backup/infra/k8s/db/general
- webhooks.py: alertmanager_webhook 呼叫 classify_alert_early() 並傳入兩個 create_incident_for_approval() 呼叫點
- incident_service.py: create_incident_for_approval() 新增 notification_type/alert_category 參數,寫入 Incident model
- incident_repository.py: _incident_to_record_data() 新增 alertname/notification_type/alert_category 序列化
- db/models.py: IncidentRecord ORM 新增 alertname/notification_type/alert_category 三個 mapped_column

防止 HostBackupFailed 等告警被誤路由到 K8s executor (ADR-073 Phase 2-4 同步完成)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 14:33:11 +08:00
OG T
7c4b36c2cd fix(flywheel): Phase 1 — 部署 8be87b0 + debounce 30min + alertname NULL 修復
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 2m29s
ADR-073 Phase 1 三項修復:

1. k8s/kustomization.yaml: newTag a86ecf38be87b0
   - 解封 _collect_mcp_context + auto_approve + DESTRUCTIVE_PATTERNS
   - 這是飛輪解封的關鍵

2. webhooks.py: DEBOUNCE_WINDOW_MINUTES 5 → 30
   - 防止同一問題每 5 分鐘重建 Incident,改為 30 分鐘收斂窗口

3. incident_repository.py: signals JSONB 補充 alertname key
   - signal.model_dump() 只有 alert_name,DB query 用 signals->0->>'alertname'
   - 補充 alertname alias,修復 132 筆 incidents.alertname = NULL 根因

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 13:41:22 +08:00
OG T
184b37a8b1 refactor(decision_manager): I2 DI 化 MCP Providers + fix config list type bug
- DecisionManager.__init__ 注入 SSHProvider/K8sProvider,移除函數內 import+實例化
- config.get_tg_user_whitelist() 支援 list 輸入(monkeypatch/直接傳入),修復 AttributeError
- LOGBOOK 更新(test fix 6e0ee8b)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 13:04:46 +08:00
OG T
8be87b0f32 fix(review): 首席架構師 Code Review — c439277 Tier 3 紅區修補
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 8m39s
Critical:
- C1: decision_manager _collect_mcp_context container 變數 Python ternary 優先度 bug 修正
  原: `A or B or C[0] if list else ""` (ternary 控制全式)
  修: `A or B or (C[0] if list else "")` (明確括號)
- C2: 所有 MCP 呼叫加 asyncio.wait_for timeout=5s,防止阻塞決策主路徑
  同時加 unknown host warning log (C4)
- C3+M1: _DESTRUCTIVE_PATTERNS 補全移至模組頂層常量
  新增: delete pods(複數)/kubectl drain/kubectl cordon/kubectl rollout undo/
        docker rm/docker stop/docker kill/rm -rf/"replicas": 0(JSON patch)

Important:
- I1: webhooks.py IP 排除改用 is_internal_ip() 支援全 RFC-1918 (10.x/172.16-31.x/192.168.x)
- I4: 新增 test_destructive_patterns.py — 25 測試全過
  涵蓋: 常量存在、攔截、誤攔迴歸、critical 永遠攔截

🔴 Tier 3 紅區 — 首席架構師 Code Review 通過後 push

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 22:05:52 +08:00
OG T
c439277fc3 feat(aiops): ADR-070 全自動化方向 — 三大修復
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
1. auto_approve.py: 允許 high risk 自動執行 (low/medium/high 全開放)
   - min_confidence 0.65→0.50 (信心門檻降低)
   - 新增 DESTRUCTIVE_PATTERNS 攔截真正危險指令
     (scale=0, delete deployment/pvc/namespace, drop table)
   - 核心: critical + 破壞性操作 → 人工; 其他 → 全自動

2. decision_manager.py: 新增 _collect_mcp_context()
   - LLM 分析前先收集真實環境狀態 (SSH/K8s MCP)
   - Host/Docker 告警 → ssh_get_container_status + ssh_get_top_processes
   - K8s 告警 → k8s_get_events
   - 注入 diagnosis_context "當前環境狀態 (MCP 實時查詢)" 區段

3. webhooks.py: 修復 target_resource 提取
   - 新增 name/container/job label 提取
   - DockerContainerUnhealthy 不再 target=alertname
   - IP 位址自動排除 (192.x 開頭不作為 target)

🔴 Tier 3 紅區 — 需首席架構師批准
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 21:39:52 +08:00
OG T
d77b2add73 fix(review): 首席架構師 Code Review 修補 — I1 get_incident_type 邏輯修正 + 測試補全
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 8m13s
Code Review 發現 2 個 Critical + 2 個 Important 問題:

Critical:
- rule.id 語意為「規則識別符」,與 incident_type 命名空間不同,不可混用
  移除 rule_id fallback 路徑,YAML 匹配無 incident_type 時 fall through 靜態 dict
- get_incident_type() 關鍵路徑無測試覆蓋
  新增 test_get_incident_type.py:11 測試、4 類別(靜態/YAML優先/YAML錯誤/custom)全過

Important:
- ALERTNAME_TO_TYPE deferred import 移至模組頂層(無 circular 風險)
- alert_types.py TODO 過期 → 更新為 I1 整合後正確說明

技術債記錄:NetworkPolicy ArgoCD egress ClusterIP 10.43.16.201/32 需 ArgoCD 重裝後更新

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 21:33:19 +08:00
OG T
b2dfcf9b0d fix(telegram): safety guard 攔截改發人工審核卡片,不再發 失敗訊息
問題:AI 無法確認 deployment name 時,每次告警都發一條
「 自動修復失敗 kubectl scale deployment unknown」的垃圾訊息

修復:
- safety guard 攔截 → token.state 回 READY(非 ERROR)
- 改呼叫 _push_decision_to_telegram,發 TYPE-4 人工審核卡片
- mcp_all_failed=True 讓 classify_notification 選 TYPE-4
- K8s 找不到 target 的路徑同樣處理

效果:統帥看到的是「需要人工介入的審核卡片」而非「修復失敗」錯誤訊息

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 21:33:19 +08:00
OG T
615822dcf3 feat(I1): ADR-064 Rule Engine 整合 — 動態推斷 incident_type
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 11m28s
- alert_rule_engine.py: 新增 get_incident_type(alertname)
  優先從 YAML 規則 match.alertname 查找 incident_type/rule_id
  Fallback: ALERTNAME_TO_TYPE 靜態 dict → "custom"
- webhooks.py: alert_type 改用 get_incident_type(alertname)
  取代 ALERTNAME_TO_TYPE.get() 靜態查找
- YAML 規則 19 條 alertname 覆蓋自動生效(無需手改 dict)
- 新 alertname 觸發 generic_fallback → auto_generate_rule() 後自動加入 YAML

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 21:21:41 +08:00
OG T
1ede9f933f refactor(M3): alertname_to_type 抽至 src/constants/alert_types.py
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- 新增 src/constants/__init__.py + alert_types.py
- ALERTNAME_TO_TYPE 常數(56 筆)從 webhooks.py 內聯 dict 遷移至模組
- webhooks.py 改用 ALERTNAME_TO_TYPE.get(alertname, "custom")
- TODO I1: 下 Sprint 整合 ADR-064 Rule Engine 動態推斷(此為中間狀態)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 21:19:52 +08:00
OG T
f2c18c4e63 feat(D1): models.json 集中化 — ADR-067 五大 Ollama 應用 hardcode 消除
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 12m56s
- models.json v1.3.0: providers.ollama.models 新增 9 個 purpose keys
  (drift_summary/drift_intent/log_anomaly/nemoclaw/playbook_draft/
   code_review/embedding/rag_generate/image_analysis)
- drift_narrator_service: NARRATOR_MODEL → get_model("ollama","drift_summary")
- drift_interpreter: MODEL → get_model("ollama","drift_intent")
- log_summary_service: SUMMARY_MODEL → get_model("ollama","log_anomaly")
- local_code_review_service: _MODEL_OLLAMA → get_model("ollama","code_review")
- image_analysis_service: _MODEL → get_model("ollama","image_analysis")
- decision_manager: nemoclaw + playbook_draft 兩處 → get_model()
- embedding_service: get_embedding_service() factory → get_model("ollama","embedding")
- knowledge_service: OllamaEmbeddingService(model=...) → get_model()

所有模型名稱現在統一由 models.json 管理,修改模型只需改一個檔案。
LOGBOOK 更新:D1 完成 + B2 已完成確認

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 20:45:53 +08:00
OG T
82e1c05df8 fix(review): Code Review C1/C2/I2/M2 修補
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
C1 drift_interpreter: 寫死 192.168.0.111 → settings.OLLAMA_URL
  違反 feedback_frontend_internal_ip_ban 鐵律(後端 service 層同樣禁止寫死內網 IP)

C2 km_conversion_service: BUG-004 補同步 Redis Working Memory vectorized 欄位
  原修復只更新 DB,Redis incident:{id} JSON 的 vectorized 未同步
  → 審計查 Redis 仍顯示 False,fly-wheel 閉環指標仍不準
  修復:DB 更新後 GET → JSON patch vectorized=True → SET(保留原 TTL)

I2 decision_manager: _ALERTNAME_KEYWORDS HostHighDiskUsage→HostOutOfDiskSpace
  + 補 DockerContainerExited
  + fallback 路徑加 debug log

M2 decision_manager: import json as _json 從 for 迴圈移至方法頂部

docs: ADR-072 新增 Code Review 發現與技術債記錄

2026-04-11 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 20:36:59 +08:00
OG T
e447f97616 fix(telegram): 接通 classify_notification + 修復 HostBackupFailed 亂送按鈕
三個問題同時修復:

1. classify_notification() 死程式碼接通
   - _push_decision_to_telegram() 現在先呼叫 classify_notification()
   - TYPE-1 (純資訊) → send_info_notification(),無按鈕
   - TYPE-4D (Config Drift) → send_drift_card()
   - 其餘 TYPE-2/3/4 → send_approval_card()(原有按鈕)
   - decision_state + auto_executed 從呼叫端注入 proposal_data

2. alert_rules.yaml 補 host_backup_failed 規則
   - HostBackupFailed / VeleroBackupFailed / VeleroBackupNotRun → NO_ACTION
   - 不再走 generic_fallback → 不再產生 kubectl rollout restart deployment/backup

3. _verify_k8s_deployment_exists() 主機層告警不再保守放行
   - Host*/Docker*/Backup*/Velero*/SSH* 前綴告警 → K8s MCP 不可用時 return False
   - _auto_execute() 收到 NO_ACTION 或空 kubectl_command → 早退,不執行

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 20:35:48 +08:00
OG T
f34fe19134 fix(aiops): ADR-072 BUG-008 alertname_to_type 9→56 筆
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
從 9 筆靜態 map 擴充至完整涵蓋 alerts-unified.yml 全 42 個 alertname:
- host_alerts: HostDown/HostHighCpuLoad/HostOutOfMemory/HostOutOfDiskSpace/HostBackupFailed
- k8s: K3sNodeNotReady/KubePodCrashLooping/KubeDeploymentReplicasMismatch/Velero* (8筆)
- database: PostgreSQL*/Redis* (10 筆)
- service_alerts: *Down (8 筆)
- external: *Down/SSLExpiring (5 筆)
- alert_chain: AlertChainBroken*/NoAlerts/Unhealthy (4 筆)
- docker_health: DockerContainerUnhealthy/Exited (2 筆)
- auto_repair: AutoRepairLowSuccessRate/PermanentFixRequired (2 筆)
- 舊版相容: HighCPUUsage/HighMemoryUsage/DiskSpaceLow/SSLCertExpiringSoon/TargetDown

預期效果: 69/112 incidents "custom" → 大幅降低,HostHighCpuLoad → "host_cpu"

BUG-007 確認不需修: alerts-unified.yml 全 42 規則均已有 severity label

2026-04-11 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 20:29:34 +08:00
OG T
5aa0244c9a fix(aiops): ADR-072 P1 Bug 修復 — BUG-004/005/006
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
BUG-004 KM vectorization 108/112 = False:
  km_conversion_service: KM entry 建立後(embedding 已背景觸發),
  補寫 incidents.vectorized = True,飛輪閉環(ADR-068)學習指標正常

BUG-005 15 ready decisions 無人審核:
  decision_manager: 新增 resend_stale_ready_tokens(),
  掃描 Redis decision:* key,找出 state=ready 且 dedup_key 過期的 token,
  重新推送 Telegram 審核卡片
  main.py: lifespan startup 排程 resend_stale_ready_tokens()(asyncio.create_task 非阻塞)

BUG-006 outcome/verification_result 全 null:
  _push_auto_repair_result: Telegram 推送前先寫入
  incidents.outcome + incidents.verification_result 到 DB

2026-04-11 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 20:24:41 +08:00
OG T
2185e1755c fix(aiops): ADR-072 P0 Bug 修復 — BUG-001/002/003
Some checks are pending
CD Pipeline / build-and-deploy (push) Has started running
BUG-001 drift_interpreter: nvidia_provider 已重構為 NvidiaProviderResult 物件(非 4-tuple)
  → 改用 Ollama httpx 直接呼叫 qwen2.5:7b-instruct,繞過 nvidia_provider
  → 消除所有 K8s config drift 告警的 "too many values to unpack" 永久失敗

BUG-002 deployment_name="unknown": 主機層告警(HostHighCpuLoad 等)無 component/job/pod label
  → _auto_execute() 新增 _resolve_target_from_k8s() 補救
  → K8s MCP kubectl get pods 動態查詢受影響 Pod,去掉 hash suffix 得到 deployment name

BUG-003 無效 deployment 通過 safety guard:
  → _auto_execute() safety guard 通過後加入 _verify_k8s_deployment_exists() 存在性確認
  → K8s 中找不到 deployment/pod → 拒絕執行,寫入 DecisionToken.error
  → K8s MCP 不可用時保守放行(不阻塞主流程)

2026-04-11 Claude Sonnet 4.6 Asia/Taipei

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 20:20:39 +08:00
OG T
f3236338a5 fix(security): Code Review P0+P1+P2 全修補 — MCP Phase 2b-3 + decision_manager
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
P0: decision_manager _fetch_metrics_snapshot 參數型別錯誤
  - prom._instant_query(str) → prom._instant_query({"query": str})
  - 結果解析 r.get("status")=="success" → r.get("result", [])

P1: prometheus_provider — alertname PromQL injection 防範
  - 新增 _RE_SAFE_ALERTNAME 白名單正則

P1: decision_manager — kubectl action 危險字元注入防範
  - 新增 _ALLOWED_KUBECTL_PATTERN 白名單,非法指令格式直接拒絕

P1: decision_manager — 6 個 asyncio.create_task() GC 風險
  - 新增 _background_tasks: set + _fire_and_forget() helper
  - 所有 bare create_task 改用 _fire_and_forget

P1: ssh_provider — Group B 寫入工具強制需要 known_hosts
  - known_hosts 未設定或檔案不存在時拒絕執行,防 MITM

P2: sentry_provider — query 語意白名單驗證
  - 新增 _RE_SAFE_SENTRY_QUERY,拒絕含特殊字元的 query

P2: argocd_provider — verify=False 改為 ARGOCD_VERIFY_TLS 環境變數開關
  - 新增 _tls_verify() helper,預設 false(self-signed cert)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 20:10:33 +08:00
OG T
a6e6f389e2 chore: 清理觸發 CD 的臨時注釋
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 8m9s
2026-04-11 19:15:04 +08:00
OG T
40d6536b62 ci: 觸發 CD — MCP Phase 3/4 + SSH MCP 完整啟用 (providers注釋更新)
Some checks are pending
CD Pipeline / build-and-deploy (push) Waiting to run
2026-04-11 19:14:17 +08:00
OG T
5d78c5492b feat(argocd-mcp): 啟用 ArgoCD MCP Provider + token 注入流程
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- config.py: ARGOCD_URL → https://192.168.0.125:30443(實際 HTTPS NodePort)
- config.py: ARGOCD_MCP_ENABLED=True + SENTRY_MCP_ENABLED=True(預設啟用)
- cd.yaml: 新增 ARGOCD_API_TOKEN Gitea Secret → K8s Secret 注入步驟
- K8s: ARGOCD_API_TOKEN 已手動注入 awoooi-secrets + API pods 已 rollout restart
- ArgoCD: 已開啟 admin account apiKey capability

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 09:32:28 +08:00
OG T
7eb49f9c20 feat(mcp-phase4c): AI 動態規則生成 — 新 alertname 自動產 Playbook 草稿
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 8m29s
_generate_playbook_draft_if_new():
  - Playbook 無命中時非同步觸發(不阻塞決策主流程)
  - 先用 semantic_search(threshold=0.92) 確認 KM 無同名 Playbook
  - 呼叫 qwen2.5:7b-instruct (Ollama 188) 生成五段結構化草稿
    (症狀/根因/診斷步驟/修復動作/驗收條件)
  - 寫入 KnowledgeEntry(type=PLAYBOOK, status=DRAFT, source=AI_EXTRACTED)
  - 寫入 AlertOperationLog PLAYBOOK_DRAFT_CREATED 事件
  - 失敗靜默 debug log

完成 MCP Phase 4 全三項:
  4a NemoClaw second opinion (信心 < 0.7)
  4b K8s 狀態快照 k8s_state_after
  4c AI 動態 Playbook 草稿生成

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 09:16:39 +08:00
OG T
0fa3b35a1c feat(mcp-phase4b): 自動修復後抓 K8s Pod 狀態寫入 k8s_state_after
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 24s
_push_auto_repair_result() 成功後:
  - 呼叫 K8sProvider.kubectl_get(pods, label=app=<service>)
  - 結果截斷 500 字寫入 incidents.k8s_state_after
  - km_conversion_service._build_content() 已支援顯示此欄位
  - 失敗靜默 debug log,不阻塞主流程

完成 KM 三段資料閉環: 症狀(labels) + 情境(metrics_before) + 動作(action) + 效果(metrics_after + k8s_state_after)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 09:15:31 +08:00
OG T
f3ee577f9d feat(mcp-phase4a): NemoClaw second opinion — 信心 < 0.7 觸發 deepseek-r1:14b 複審
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
- _nemoclaw_second_opinion(): 呼叫 Ollama 188 deepseek-r1:14b 做獨立推理
  - 解析 <think>...</think> CoT 格式,只取正文
  - 30s timeout,失敗靜默降級
  - 輸出截斷 300 字
- _dual_engine_analyze(): LLM 信心 < 0.7 時非同步觸發 second opinion
  - 結果附加到 proposal_data["advisory_note"]
- _push_decision_to_telegram(): advisory_note 以 NemoClaw bot 身分追加訊息
  - 格式: "NemoClaw 第二意見 (信心=0.xx)"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 09:14:54 +08:00
OG T
a2cc985f60 feat(mcp-phase3): ArgoCD MCP + Sentry MCP + 完整 Provider 註冊
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
ArgoCDProvider (3 工具):
  - argocd_list_apps: 列出所有 App + sync/health 狀態
  - argocd_get_app_status: 詳細狀態 + 問題資源清單
  - argocd_get_sync_history: 最近 N 筆部署記錄
  - 輸入驗證: app_name 白名單 regex
  - 需 ARGOCD_API_TOKEN + ARGOCD_MCP_ENABLED=true

SentryProvider (3 工具):
  - sentry_list_issues: 列出最近 Issues(狀態過濾)
  - sentry_get_issue: 詳情 + stacktrace 最後 5 frames
  - sentry_search_issues: PromQL 風格搜尋
  - issue_id 白名單驗證(只允許純數字)
  - 需 SENTRY_AUTH_TOKEN + SENTRY_MCP_ENABLED=true

providers/__init__.py: 補上 Prometheus + SSH + ArgoCD + Sentry 全部 10 個 providers
config.py: 新增 ARGOCD_URL / ARGOCD_API_TOKEN / ARGOCD_MCP_ENABLED / SENTRY_MCP_ENABLED

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 09:11:53 +08:00
OG T
1ec19656b5 feat(adr071-ij): TYPE-2 指標快照卡片 + KM 三段資料整合
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 8m17s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 36s
Ansible Lint / lint (push) Has been cancelled
ADR-071-I: decision_manager 執行前後各抓一次 Prometheus metrics
  - _fetch_metrics_snapshot(): 依 alertname 選擇 CPU/Mem/Disk/Restart 查詢
  - _format_metrics_delta(): 輸出 "CPU 92%→23% | Mem 78%→45%" 格式
  - _push_auto_repair_result(): metrics_after 寫 DB + TYPE-2 卡片顯示 delta
  - _auto_execute(): metrics_before 在執行前寫 DB(完成閉環)

ADR-071-J: km_conversion_service._build_content() 使用精簡 delta 格式
  - 從 metrics_before/after 產生人讀 delta(CPU/Mem/Disk/重啟次數)
  - 附加 k8s_state_after(若有)
  - 格式: 症狀 + 根因 + 動作 + 效果數字(症狀→情境→動作→效果)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 03:09:35 +08:00
OG T
a29e5e1de2 feat(mcp-phase1): K8s MCP 強化 — 6 個新工具 + namespace 白名單
MCP Phase 1 (ADR-069 Sprint B 後驗收):
  k8s_get_pod_logs    — Pod log 取得 (tail 1-500,支援 previous)
  k8s_watch_rollout   — rollout 狀態監控直到完成 (timeout 10-300s)
  k8s_get_events      — K8s events (可過濾 resource_name / event_type)
  k8s_describe_pod    — 完整 Pod describe (Conditions/Volumes/Env)
  k8s_get_hpa_status  — HPA 副本數/CPU utilization
  k8s_get_node_conditions — Node Ready/MemoryPressure/DiskPressure

安全強化:
  - ALLOWED_NAMESPACES = {"awoooi-prod"} 硬編碼白名單
  - _validate_namespace() + _validate_name() 參數白名單
  - 數值參數上下限夾緊 (tail 1-500, timeout 10-300s)
  - event_type 只允許 Warning / Normal

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 03:01:38 +08:00
OG T
2af4dffcc6 fix(security): Architecture Review 修復 5 項高信心問題
安全修復 (P0):
1. ssh_provider: 新增 _validate_param() 白名單驗證,防止 command injection
   - container_name/service/filter_name: [a-zA-Z0-9._-]{1,128}
   - compose_dir: 必須以 /opt/ 或 /srv/ 開頭,禁止 ..
   - domain: FQDN 白名單
   - tail/port/lines: int() 轉換 + 上下限夾緊
2. ssh_provider: known_hosts=None 改為讀 SSH_MCP_KNOWN_HOSTS_FILE 環境變數
   - 預設仍 None(內網快速啟動),但啟動時寫入 warning log
   - 設定文件:ops/runbooks/ssh-mcp-setup.md (待補)

模組化修復 (P1):
3. km_conversion_service: 移除 import 時的 ALERT_EVENT_TYPES.update() 副作用
   - ADR-071 event types 移入 alert_operation_log_repository.py 靜態集合
4. telegram_gateway: create_task() 改為 await + try/except
   - 避免 DB session 關閉後的競爭條件
   - KM 轉換失敗記錄 warning log,不中斷主流程
5. km_conversion_service: 新增頂層 try/except,錯誤一律 error log 後 re-raise

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 02:50:26 +08:00
OG T
6351e9a0e9 feat(mcp-phase2): MCP Phase 2 — Prometheus MCP + SSH MCP + alert labels
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 13m37s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 35s
MCP-2b: prometheus_provider.py
  - prometheus_query (PromQL 即時查詢)
  - prometheus_query_range (歷史趨勢,預設 15 分鐘)
  - prometheus_get_alert_history (告警觸發歷史)
  - config: PROMETHEUS_URL + PROMETHEUS_MCP_ENABLED

MCP-2a: ssh_provider.py
  - 群組A 9 個只讀診斷工具 (top/disk/memory/logs/status/port/nginx/swap)
  - 群組B 6 個安全操作工具 (restart/compose/systemctl/clear-log/ssl/nginx-reload)
  - 四層安全守衛 (白名單/allowed_hosts/forbidden_patterns/trust_score)
  - config: SSH_MCP_ENABLED + SSH_MCP_ALLOWED_HOSTS

K8s: 04-ssh-mcp-secret.example.yaml (ssh-mcp-key Secret 範本 + 建立步驟)

Alert labels: alerts-unified.yml 補充 mcp_provider/host_type/alert_category
  覆蓋: HostHighCpuLoad/HostOutOfMemory/HostOutOfDiskSpace/DockerContainer*
        SignOzDown/SentryDown/HarborDown/GiteaDown

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 02:35:35 +08:00
OG T
325b3851b5 feat(adr-071): 告警通知四類型第一批 B/C/E/F/G/H 全實作
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Type Sync Check / check-type-sync (push) Failing after 1m7s
ADR-071-B: classify_notification() — 五型分類器 (TYPE-1/2/3/4/4D)
ADR-071-C: send_info_notification() — TYPE-1 純資訊無按鈕卡片
ADR-071-E: _build_inline_keyboard() — 依 alert_category 動態組合 TYPE-3 按鈕
ADR-071-F: send_drift_card() — TYPE-4D Config Drift 卡片 + Diff 截斷
ADR-071-G: km_conversion_service.py — Incident RESOLVED 自動轉 KM
ADR-071-H: handle_manual_fix_done() — TYPE-4 手動修復 Bot 對話閉環

前批已完成: ADR-071-A (DB Migration) + ADR-071-D (狀態機守衛)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 02:24:20 +08:00
OG T
68a3858ae4 fix(auto_execute): 守衛加入 target==alertname 檢查,防止 LLM 把告警名稱當 deployment 名稱
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 13m33s
HostHighCpuLoad 等主機告警,NemoTron Tool Calling 可能把
alertname 填入 deployment_name,導致執行
'kubectl rollout restart deployment HostHighCpuLoad'。

新增守衛: _target == _alertname 時拒絕執行並通知人工介入。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 01:13:24 +08:00
OG T
a4d655ea7f fix(auto_execute): 安全守衛 — 拒絕執行含 unknown 或未解析 placeholder 的 action
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 19m7s
E2E Health Check / e2e-health (push) Successful in 43s
主機層告警(HostHighCpuLoad、DockerContainerUnhealthy 等)沒有對應
K8s deployment 名稱,affected_services=[],導致 _target='unknown',
執行 'kubectl rollout restart deployment unknown' 這種無意義命令。

修復: 替換後若 action 仍含 'unknown' 或 <...>/{...} 格式,
直接拒絕執行並通知人工介入,不允許帶 placeholder 的命令上線。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 23:57:17 +08:00
OG T
dabc62e0f8 fix(telegram): append_incident_update — 儲存告警卡片 message_id 到 Redis
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 14m31s
_send_approval_card_to_group 發出告警卡片後,將 Telegram message_id
存入 Redis tg_msg:{incident_id}(TTL 24h),供後續 append_incident_update
換掉批准按鈕 + reply 狀態。

修復前:tg_msg key 從未被寫入,append 永遠 fallback 發新訊息,
批准按鈕永遠無法被移除。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 22:41:30 +08:00
OG T
797c7c749e fix(nemotron): deepseek-r1 num_predict 400→1200,避免 <think> block 截斷後空回覆
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 28s
deepseek-r1:14b 思考 token 超過 400 會在 </think> 前截斷,導致
清理後 body 為空,Telegram 顯示空訊息。
- chat_manager: num_predict 400 → 1200
- telegram_gateway: _clean_ai_reply 空值加 fallback 錯誤提示

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 22:35:37 +08:00
OG T
f8926bb70a ci: 觸發 CD — decision_manager 修復標記
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 22:12:56 +08:00