Commit Graph

3678 Commits

Author SHA1 Message Date
OG T
f2b3a7129f docs(plan): Sprint 5 指令中心重設計 — 完整解決方案與細化實施步驟 2026-04-08 12:01:14 +08:00
OG T
876aa9a441 docs(adr): ADR-060 React Flow + elkjs 拓撲圖引擎技術選型 (方案 D+ 批准) 2026-04-08 11:56:58 +08:00
OG T
a421d2c5b8 feat(ops): Plan A docker-health-monitor.sh — Docker 容器健康監控自動修復
- 偵測 unhealthy / exited / dead 容器
- 排除清單: DB(PG/Redis)、Gitea、監控棧
- Prometheus/Grafana/Alertmanager exited → docker start (保護 WAL)
- 必須三段式通知: Intent→Action→Result (首席架構師裁示)
- HMAC-SHA256 簽章 → AWOOOI API /api/v1/webhooks/custom-alert
- Fallback: API down → 直接 Telegram Bot API
- 冷卻期 300s,防止重複修復

部署: cron */5 * * * * on 192.168.0.110 + 192.168.0.188
設定: /etc/awoooi-ops/secrets.env

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 11:48:39 +08:00
OG T
f525e657ca docs: ADR-060/061 全面監控+Event Sourcing架構決策記錄
- ADR-060: 全面基礎設施監控規劃 (Plan A/B/C/D/E)
- ADR-061: Alert Operation Log Event Sourcing 架構
- LOGBOOK: 2026-04-08 里程碑記錄更新

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 11:44:06 +08:00
OG T
f20121ad41 feat(audit): Phase 11 告警操作完整溯源 — alert_operation_log + 歷史回填
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m29s
統帥指令「所有告警訊息通通寫入資料庫,並記錄相關操作」

變更:
- phase11_alert_operation_log.sql: 新表 (Event Sourcing,不可變)
- phase11b_backfill_alert_operation_log.sql: 歷史回填 654 筆
  - 14 筆 ALERT_RECEIVED (incidents)
  - 265 筆 TELEGRAM_SENT (approval_records)
  - 265 筆 USER_ACTION (approval_records)
  - 110 筆 EXECUTION_COMPLETED (audit_logs)
- db/models.py: AlertOperationLog SQLAlchemy model
- repositories/alert_operation_log_repository.py: append/list_by_incident/get_stats
- webhooks.py: _try_auto_repair_background 寫入 AUTO_REPAIR_TRIGGERED + EXECUTION_COMPLETED + TELEGRAM_RESULT_SENT
- webhooks.py: _push_to_telegram_background 寫入 TELEGRAM_SENT
- telegram.py: handle_callback 寫入 USER_ACTION (approve/reject)

已執行 migration: awoooi_prod@192.168.0.188 

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 11:22:03 +08:00
OG T
eee6f06215 feat(auto-repair): 所有操作強制寫入 DB — auto_repair_executions 表
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m32s
統帥指令: 所有自動修復操作(成功/失敗)必須持久化

變更:
- migrations/phase10_auto_repair_executions.sql: 新增表 + 4 個索引
- db/models.py: 新增 AutoRepairExecution SQLAlchemy model
- repositories/audit_log_repository.py: 新增 AutoRepairExecutionRepository (create/list_by_incident/get_stats)
- auto_repair_service.py: execute_auto_repair 成功/失敗分支都寫入 DB
  - 新增 similarity_score 參數傳遞
  - AutoRepairDecision 新增 similarity_score 欄位
- webhooks.py: 傳入 similarity_score 到 execute_auto_repair

已執行 migration: awoooi_prod@192.168.0.188:5432 

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 11:16:37 +08:00
OG T
68a2fff746 feat(auto-repair): 移除所有阻擋門檻 — 直接全部跳成自動修復
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m38s
統帥指令: 所有 APPROVED Playbook 直接執行,不再檢查:
- 相似度門檻 (MIN_SIMILARITY_SCORE 0.7 → 0.0)
- is_high_quality 品質門檻
- 冷啟動信任機制
- 動作風險等級門檻 (evaluate + execute 兩層)

保留: P0/P1 嚴重度人工審核、全域冷卻熔斷、APPROVED 狀態檢查

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 11:10:09 +08:00
OG T
8fcb66eb52 chore(api): trigger CD — Sprint 3+4+F deploy
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 11m28s
E2E Health Check / e2e-health (push) Successful in 34s
2026-04-07 16:00:12 +08:00
OG T
4c45961c4f chore: trigger CD deploy (Sprint 3+4+F) 2026-04-07 13:25:36 +08:00
OG T
b7ea362efc fix(api): Review #2 技術債清理 — I1/S1/S2/S3 全數修正
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 12m13s
I1: error_type 欄位補全
- AnomalyCounter.derive_key_from_incident() 新增
  從 signal.labels 提取 reason/error_type,確保四欄位完整

S1: 三處 signature 建構邏輯統一
- auto_repair_service._derive_anomaly_key() → 委託 derive_key_from_incident()
- approval_execution._get_anomaly_key_from_approval() → 同上
- incident_service.resolve_incident() B4 → 同上
- 消除 3 處重複的 signature 建構程式碼

S2: Redis Pipeline 批次查詢
- get_all_disposition_stats() 從 N+1 hgetall 改為 2 次 Pipeline
- Pipeline 1: 批次 hgetall 所有 disposition key
- Pipeline 2: 批次 hget metadata (alert_name)
- 效能從 O(2N) Redis round-trip 降至 O(2)

S3: auto_repair.py get_incident AttributeError 修復
- get_incident() → get_from_working_memory() (pre-existing bug)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-07 13:13:42 +08:00
OG T
b20a619a3d fix(ci): CD 修復 — shared-types 型別同步 + 測試冷啟動衝突
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Type Sync Check / check-type-sync (push) Successful in 1m2s
1. pnpm shared-types generate — 同步 Sprint 4 新增的 Pydantic model
2. test_evaluate_not_high_quality 修復 — 加 MEDIUM risk step 避免
   意外走冷啟動路徑 (Redis 未初始化 → COLD_START_DAILY_LIMIT)

11/11 auto_repair 測試通過

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 13:09:17 +08:00
OG T
3a3f9cf70c docs(logbook): Sprint 4 全棧完成記錄 — 6 Phase / 19 工作項 2026-04-07 13:02:59 +08:00
OG T
de3935d1d4 feat: Sprint 4 Phase E+F — 前端處置統計 + 週報處置分佈
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m26s
Type Sync Check / check-type-sync (push) Failing after 1m2s
Phase E: 前端頁面
- E1: /reports 完整處置統計儀表板 (已在 Sprint F 完成)
- E2: 首頁 Metrics Strip — 從 disposition API 取得真實自動化率
  優先使用 /stats/disposition auto_rate,fallback 到 incidents 推算
- E3: /auto-repair 處置概況卡片 (已在 Sprint F 完成)
- E4: /neural-command stats tab 處置分佈 (已在 Sprint F 完成)
- E5: i18n 翻譯 zh-TW + en (已在 Sprint F 完成)

Phase F: 週報 + 文件
- F1: WeeklyReportMessage 新增 disposition 5 欄位
  週報格式加「📋 處置分佈」區塊 (自動/冷啟動/人工/手動 + 自動化率)
  weekly_report_service 整合 get_all_disposition_stats()
- message 字數上限從 900 提升到 1200 (適應處置區塊)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-07 13:02:20 +08:00
OG T
37bddbb430 docs(logbook): Sprint 4 Phase E 前端處置統計完成記錄
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 13:01:22 +08:00
OG T
22bc384b28 feat(web): Sprint 4 Phase E — 前端處置統計儀表板
E1: /reports 頁面升級為完整處置統計儀表板
- 頂部 3 KPI (處置總次數/自動化率/人工介入率)
- 四大計數卡片 (自動修復/人工審核/手動處理/冷啟動信任)
- 堆疊分佈條 (百分比視覺化)
- 按異常類型明細表格
- 串接 GET /api/v1/stats/disposition

E3: /auto-repair 頁面加入處置概況 4 卡片
E4: /neural-command stats tab 加入處置分佈區塊
E5: 新增 25+ i18n 翻譯鍵 (zh-TW + en)

全部頁面 next build 通過,統帥鐵律: 無假數據,無資料顯示 '--'

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 13:00:41 +08:00
OG T
246587a401 fix(web): Sprint F 前端打假行動 — 29處假數據全面清除 (首席架構師 98/100)
P0: Neural Command 三個子組件移除所有 MOCK 常數,接上真實 API props
- NeuralLiveCenter: 假歷史/假KPI/假雷達 → 從 stats/history/incidents 即時計算
- NeuralStats: MOCK_HISTORY/SCHEME_STATS/PLAYBOOK_RANKINGS → useMemo 聚合
- NeuralApprovalPanel: MOCK_PENDING → 真實 /api/v1/approvals 簽核操作

P1: 10+處假用戶身份 (demo-user/user-001/War Room User) → CURRENT_USER 常數統一
P2: 刪除 6 個 Demo 匯出 (GlobalPulseChartDemo/MOCK_APPROVAL/DEMO_DECISION_CHAIN)
P3: /demo 頁面加 NEXT_PUBLIC_ENABLE_DEMO 環境變數保護
i18n: 新增 22 個翻譯鍵 (zh-TW + en)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 12:53:52 +08:00
OG T
561bcb638b fix(api): Sprint 4 首席架構師 Review P0 修正 — hash 統一 + 積木化合規
P0-1: anomaly_key hash 推導統一
- B1: 新增 _derive_anomaly_key() 使用 AnomalyCounter.hash_signature()
  取代 symptoms.compute_hash()
- B3/B4: namespace 改用 signal.labels.get("namespace", "")
  修正 getattr(signal, "namespace", "") 永遠回傳空字串

P0-2: Router 層積木化合規
- C1/C2: 封裝 get_all_disposition_stats() 到 AnomalyCounter
- Router 不再直接存取 counter.redis
- stats.py 移除未使用的 days/stats 參數

P1: get_frequency() 填充 disposition 欄位
- 與 _record_anomaly_impl() 一致,回傳完整處置統計

首席架構師評分: 82/100 → P0 全數修正

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-07 12:53:12 +08:00
OG T
a85e9ced08 feat(api+telegram): Sprint 4 Phase C+D — API 端點 + Telegram 處置統計
Phase C: API 端點
- C1: GET /api/v1/stats/disposition — 完整處置分佈統計
  - DispositionSummary: auto/human/manual/cold_start + auto_rate
  - DispositionByAnomaly: 按異常類型明細 (最多 20 筆)
  - Redis SCAN + HGETALL 聚合
- C2: GET /api/v1/auto-repair/stats 擴充 disposition_summary

Phase D: Telegram 告警格式
- D1: 告警卡片加處置統計行
  - 🤖 自動: N | 👤 審核: N | 🔧 手動: N
  - 自動化率百分比
- D2: 歷史按鈕強化處置分佈明細
  - 完整 5 項計數 + 自動化率

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-07 12:17:20 +08:00
OG T
9253281d46 feat(api): Sprint 4 Phase A+B — 告警處置統計資料層+寫入層
Phase A: 資料層
- A1: IncidentFrequencyStats 新增 4 欄位 (human_approved/manual_resolved/cold_start_trust/total_resolution)
- A2: AnomalyCounter.record_disposition() — Redis HINCRBY 原子遞增
- A3: get_disposition_stats() — HGETALL 回傳處置分佈
- AnomalyFrequency dataclass 擴充 + to_dict() 同步
- _record_anomaly_impl() 整合 disposition stats

Phase B: 寫入層觸發點接線
- B1: 自動修復成功 → record_disposition("auto_repair")
- B2: 冷啟動信任成功 → record_disposition("cold_start_trust")
  - AutoRepairDecision 新增 is_cold_start flag
  - execute_auto_repair() 接收並區分處置類型
- B3: 人工批准執行成功 → record_disposition("human_approved")
  - 新增 _get_anomaly_key_from_approval() helper
- B4: 手動處理推斷 → resolve_incident() 排除法判定
  - 若 resolved 且無 auto/human/cold_start 紀錄 → manual_resolved

安全設計: 所有 disposition 記錄走 try/except,失敗不阻塞主流程

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-07 11:54:46 +08:00
OG T
e82d3802c5 docs: Sprint 4 告警處置統計系統 — 完整計畫文件 + LOGBOOK 更新
Sprint 4 計畫包含 6 Phase / 19 工作項:
- Phase A: 資料層 (IncidentFrequencyStats + Redis 計數器)
- Phase B: 寫入層 (4 觸發點: auto_repair/cold_start/human/manual)
- Phase C: API 端點 (/stats/disposition)
- Phase D: Telegram 告警卡片統計
- Phase E: 前端 (/reports 儀表板 + 首頁 + auto-repair + neural-command)
- Phase F: 週報 + 文件

首席架構師審查: 100% Fully Approved
衝突檢查: 所有依賴正確,DAG 無環

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-07 11:37:21 +08:00
OG T
53b2daeaca feat(api): 首次信任機制 — 打破自動修復冷啟動雞生蛋問題
問題: Playbook 需要 success_count >= 3 才算 is_high_quality,
但沒有自動修復就不會有成功紀錄 → 永遠達不到門檻。

方案 C: 首次信任 (Cold Start Trust)
- APPROVED 狀態 + 全步驟 risk=LOW + 執行次數 < 3 → 自動放行
- Redis counter 限制每日最多 5 次首次信任自動修復
- 累積 3 次成功後自動回歸正常 is_high_quality 門檻

安全邊界:
- 只有 LOW risk 步驟才能首次信任 (重啟容器等)
- HIGH/CRITICAL 仍需人工審核
- P0/P1 嚴重度仍需人工審核
- 每日上限防止失控

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-07 11:21:00 +08:00
OG T
2fe8062fb8 refactor(api): Re-Review S1/S2/S3 改善 — 消除重複+防禦性驗證+測試隔離
S1: 抽取 _execute_and_observe() 公用方法
  - 消除 repair_by_uri 中 3 處重複的 execute+audit+langfuse 邏輯
  - 統一 AuditLog + Langfuse trace 寫入路徑

S2: SSH username 防禦性驗證
  - 新增 validate_ssh_user() + _SSH_USER_RE 正則
  - 在 _ssh_execute() 入口驗證 user 參數
  - 防止 user@host 拼接產生非預期行為
  - 新增 8 個 username 驗證測試

S3: Singleton 測試重置
  - 新增 _reset_for_test() classmethod
  - 避免跨測試狀態污染
  - 新增 2 個 singleton reset 測試

測試: 55/55 全數通過 (原 45 + 新 10)
首席架構師 Re-Review: 91/100  通過,3 個 Suggestion 全數實裝

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-07 11:17:40 +08:00
OG T
78a8d3dfa5 fix(api): ansible 控制節點加白名單驗證,防環境變數繞過 (Re-Review Important)
首席架構師 Re-Review 指出: ANSIBLE_CONTROL_HOST 來自環境變數 (ConfigMap),
若 ConfigMap 被篡改可繞過 SSH_TARGET_WHITELIST。
在 _execute_ansible() 開頭加 validate_ssh_target_host(host) 閉環。

Re-Review 評分: 91/100  通過

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-07 11:13:49 +08:00
OG T
0dec007673 docs(logbook): 記錄 Sprint 3 P0 critical security fixes 完成
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 11m37s
2026-04-07 11:10:48 +08:00
OG T
f8d4772abf fix(api): Sprint 3 P0-1/P0-2/P0-3/P0-4 Critical Security Fixes
P0-1: Complete shell metacharacter regex detection
  - Enhanced _SHELL_METACHAR_RE to detect: >, <, \n, ${}, $()
  - Prevents all shell injection vectors (redirects, variable expansion, newlines)
  - Added 5 new validation tests

P0-2: Add shlex.quote() protection for ansible playbook path
  - Wraps playbook_path in shlex.quote() before SSH command construction
  - Prevents shell injection if path contains special characters
  - Applied in _execute_ansible() method

P0-3: Add SSH target host whitelist validation
  - Introduces validate_ssh_target_host() function
  - Only allows SSH to: 192.168.0.110, 192.168.0.188
  - Prevents unauthorized SSH target exploitation
  - Added 5 new whitelist validation tests

P0-4: Convert HostRepairAgent to singleton pattern
  - Implements __new__() singleton with shared _in_process_locks dict
  - Ensures in-process locks persist across multiple auto_repair_service calls
  - Previously created new instance per call, making locks ineffective
  - Added singleton persistence test

Test Results: 45/45 passing (34 existing + 11 new P0 tests)
All security validations verified via comprehensive unit test coverage.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-07 11:09:45 +08:00
OG T
af07c23675 fix(k8s): known_hosts 改掛 /etc/repair-known-hosts 獨立目錄,修 mount 衝突
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 12m11s
E2E Health Check / e2e-health (push) Successful in 34s
/etc/repair-ssh 已被 repair-ssh-key 佔用,subPath 檔案掛載衝突
改為獨立目錄 /etc/repair-known-hosts,路徑同步更新 KNOWN_HOSTS_PATH

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 15:06:28 +08:00
OG T
d56aae135d fix(k8s): repair-known-hosts secret optional:true — Pod 不阻塞等待 secret 建立
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 8m35s
CD 首次跑時才建立 secret,optional 讓 Pod 先起來
等 CD 建立 secret 後自動掛載

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 14:48:45 +08:00
OG T
93bcfb4ce8 docs: 更新 LOGBOOK — Sprint 3 SSH_COMMAND 指揮權鏈完成 2026-04-06 14:48:11 +08:00
OG T
ee187dcb79 ci(cd): CD 自動建立 awoooi-repair-known-hosts Secret (Sprint 3 T2 閉環)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
每次部署時 ssh-keyscan .110/.188 並 kubectl apply secret
替換 StrictHostKeyChecking=no — Security Fix A1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 14:45:20 +08:00
OG T
1644fe6474 feat(api): auto_repair_service 整合 repair_by_uri (Sprint 3 T6)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 14:39:03 +08:00
OG T
a4e11bfa92 feat(api): AuditLog + Langfuse Trace for SSH_COMMAND (Sprint 3 T5)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 14:38:59 +08:00
OG T
02510d3d93 feat: /api/v1/auto-repair/history endpoint + neural-command 接真實 API (Sprint 3)
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 8m50s
- 新增 RepairHistoryItem/RepairHistoryResponse Pydantic models
- GET /api/v1/auto-repair/history?limit=N 從 incidents working memory 推導修復歷史
- 前端 fetchData() 同時拉 history + approvals/pending,移除硬編碼 pendingApprovals=0
- try/except 包覆確保任何錯誤都回傳空列表不中斷前端

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 14:28:55 +08:00
OG T
4561f141bb feat(api): Redis 冪等鎖防止重複修復 (Sprint 3 T4)
雙層鎖設計: in-process asyncio.Lock (必定生效) + Redis 分散式鎖 (跨 Pod best-effort)
同一 URI 的第二次修復呼叫立即返回 "already running" 錯誤

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 14:26:53 +08:00
OG T
1a654aa37d feat(api): HostRepairAgent 三條執行路徑 + known_hosts + Ansible 白名單 (Sprint 3 T3)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 14:22:54 +08:00
OG T
d4cb9a4ac5 ops(k8s): known_hosts Secret + Ansible 白名單 ConfigMap (Sprint 3 T2)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 14:20:14 +08:00
OG T
5e8b2a6894 feat(api): URI scheme 解析器 + Shell Injection 防護 (Sprint 3 T1)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 14:18:21 +08:00
OG T
9197994d51 feat(neural-command): 加入 Sprint 3 指揮鏈可視化 + T1-T7 任務進度監控
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 11m15s
- SSH Gateway → URI解析器 → Shell防注入 → Redis冪等鎖 → Ansible Playbook DB 節點流程圖
- T1-T7 任務卡片 (T1/T2 標記完成,T3-T7 待執行)
- 4 指標面板:實作速度/安全等級/可觀測性/架構健康度

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 14:13:58 +08:00
OG T
1a8021bfaa docs(plans): Sprint 3 SSH_COMMAND 指揮權鏈實作計畫 (7 tasks) 2026-04-06 14:08:28 +08:00
OG T
0b1ceb8618 feat(web): 新增神經指揮中心頁面 /neural-command
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 12m22s
Sprint 3 SSH_COMMAND 指揮權鏈 UI — 完整前端實作:

- Pre-Flight 審查面板: 8/8 安全檢查 (A/B/C 三類) + 通過狀態 + 功能開關
- 即時指揮中心: OpenClaw 🦞 + NemoTron  狀態 + 神經傳導鏈路動畫 + 執行串流
- 統計 & 歷史: 5 KPI + URI scheme 分佈 + Playbook 成效排名 + 時間軸
- 核鑰授權面板: 兩位指揮官診斷 + 執行路徑詳情 + NuclearKeyButton 長按確認

技術:
- 路由: /neural-command (獨立新頁面,非取代 /auto-repair)
- sidebar: BrainCircuit icon,緊接 auto-repair 下方
- i18n: 完整 zh-TW + en 支援 (neuralCommand namespace)
- TypeScript: 型別定義獨立至 components/neural-command/types.ts

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 14:01:31 +08:00
OG T
0da827beef perf(web): Dockerfile 加入 --mount=type=cache 持久化 Next.js build cache
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 13m37s
CACHE_BUST 仍強制讓 source 層失效(確保代碼變更進入 bundle),
但 .next/cache 透過 BuildKit cache mount 跨 build 持久化到 runner host。
Next.js 增量編譯只重建有變更的頁面,預計節省 3-4 分鐘。

# 2026-04-06 ogt: Web build 從 5 min 降至 ~1-2 min(第二次起)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 12:45:43 +08:00
OG T
a4ae74f767 fix(cd): 修正 Playwright 版本偵測路徑 ../package.json → ./package.json
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
在 apps/web 目錄執行,../package.json 不存在故每次都回傳 unknown
導致每次部署都重下載 110MB Chromium。
改用 ./package.json 正確讀取 apps/web 的 @playwright/test 版本。

# 2026-04-06 ogt: 節省 CD 約 2 分鐘

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 12:44:45 +08:00
OG T
cd37befbe6 fix(models): 全面替換 datetime.UTC → timezone.utc 相容 Python 3.10
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Type Sync Check / check-type-sync (push) Successful in 59s
terminal.py, incident.py, utils/timezone.py 同樣問題。
CI runner Python 3.10 無 UTC 常數,導致所有模型靜默 import 失敗。

# 2026-04-06 ogt: 完整修復,不再有漏網之魚

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 12:40:27 +08:00
OG T
59c3dfb910 fix(models): approval.py 改用 timezone.utc 相容 Python 3.10
Some checks failed
CD Pipeline / build-and-deploy (push) Successful in 12m12s
Type Sync Check / check-type-sync (push) Failing after 52s
CI runner 用 Python 3.10,datetime.UTC 是 3.11 才加入。
改用 datetime.timezone.utc 全版本相容,修復 CI type-sync 全量失敗。

# 2026-04-06 ogt: root cause — CI Python 3.10 無法 import UTC

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 12:19:23 +08:00
OG T
b416ab6577 ci(debug): type-sync-check 加入 diff 輸出以診斷 CI 失敗原因
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 12:17:36 +08:00
OG T
8235f91bc6 fix(scripts): generate-schemas 同時加入 apps/api 和 apps/api/src 到 sys.path
Some checks failed
Type Sync Check / check-type-sync (push) Failing after 56s
問題: CI type-sync-check 持續失敗
原因: 只加 apps/api/src 不夠,模型檔內部用 from src.utils.X import Y
     需要 apps/api 在 path 才能解析 src 套件
結果: 51 個型別全部正確生成

# 2026-04-06 ogt: fix CI type-sync blocking deployment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 12:00:18 +08:00
OG T
f6332b4b2f fix(telegram): 修正 approval_id UUID 轉換錯誤 — 支援 INC-xxx 格式
All checks were successful
CD Pipeline / build-and-deploy (push) Successful in 12m24s
_execute_approval_action 用 UUID(approval_id) 但 approval_id 是 INC-xxx,
導致 'badly formed hexadecimal UUID string' 錯誤,簽核無法執行。

修正: 先嘗試 UUID 轉換,失敗則用 incident_id 查出對應的 pending approval UUID。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 11:53:48 +08:00
OG T
71715506c3 chore(types): 重新產生 TypeScript 型別 — Phase 26 ApprovalRequest + namespace 修正
Some checks failed
Type Sync Check / check-type-sync (push) Failing after 51s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 11:50:43 +08:00
OG T
8d496e84e2 fix(test): 更新 action_parsing 測試 — 無 -n 參數預設 namespace 改為 awoooi-prod
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
action_planner.py default_namespace 已是 awoooi-prod,測試預期值同步更新。
明確指定 -n default 的 kubectl 命令保持不變。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 11:49:24 +08:00
OG T
b133631b2d feat(scripts): Phase 26 補寫腳本 — 從 approval_records 反向建立 KM
225 筆歷史告警處理記錄全部補寫到 knowledge_entries (INCIDENT_CASE)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 11:47:47 +08:00
OG T
658337ec18 fix(phase26): 打通 Incident→DB→KM 完整鏈路 + namespace 修正
Some checks failed
CD Pipeline / build-and-deploy (push) Failing after 1m29s
Type Sync Check / check-type-sync (push) Failing after 52s
問題根因:
1. create_incident_for_approval 只存 Redis,不存 PostgreSQL
   → TTL 7天後消失,Playbook 萃取永遠找不到 Incident
2. ApprovalRecord 無 incident_id 欄位
   → _trigger_playbook_extraction 靠 regex 掃中文文字找 INC-,永遠失敗
3. operation_parser namespace fallback 是 "default"
   → 所有 deployment 在 awoooi-prod,203 次執行全失敗

修復:
- Incident 同時寫入 Redis + PostgreSQL (save_to_episodic_memory)
- ApprovalRecord 加入 incident_id 欄位 (model + ORM + migration)
- alertmanager_webhook 建立 Approval 後回寫 incident_id
- _trigger_playbook_extraction 直接用 approval.incident_id
- operation_parser DEFAULT_NAMESPACE = "awoooi-prod"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 11:46:05 +08:00