OG T
f9ba200638
fix(db): Phase 6 migration 三條 CREATE INDEX 拆開各自 execute
...
CD Pipeline / build-and-deploy (push) Has been cancelled
asyncpg 不支援 prepared statement 內多條 SQL 指令,
原本一個 text("""...""") 包含三條 CREATE INDEX 導致 CrashLoopBackOff。
拆成三個獨立 conn.execute() 呼叫。
2026-04-15 ogt + Claude Sonnet 4.6(亞太)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 19:37:58 +08:00
OG T
fab65e7d7a
fix(alerts): PENDING 收斂無 TTL → 老記錄永久封鎖 Telegram 告警
...
CD Pipeline / build-and-deploy (push) Has started running
根因:find_by_fingerprint 的 PENDING 匹配條件無時間上限,
2026-04-12 建立的 3 筆 PENDING approval records(hit=77/30/17)
持續吃掉所有同指紋告警,造成 2+ 小時 Telegram 靜音。
修正(approval_db.py):
- PENDING_TTL_HOURS = 24:PENDING 記錄逾 24h 不再收斂新告警
- 原本:OR(status=PENDING, created_at>=30min前)
- 修正:OR(PENDING AND created_at>=24h前, created_at>=30min前)
緊急修復:kubectl exec 直接將 7 筆過期 PENDING 記錄設為 expired,
即時恢復 Telegram 告警流(不等部署)。
Phase 6 AI 自我治理閉環(ADR-087):
- feat(db): 新增 ai_governance_events 表 + 3 個 index(base.py + models.py)
- feat(svc): ai_slo_calculator.py — 7d 滾動 SLO(success/override/false_neg)
- feat(svc): trust_drift_detector.py — Playbook 信任度極端偏態偵測
- feat(job): kb_rot_cleaner.py — K8s API/Prom metric/老舊 incident_case 腐爛清理
- feat(svc): decision_manager.py — 自我降級守衛(SLO 違反 → 提高門檻/保守模式)
2026-04-15 ogt + Claude Sonnet 4.6(亞太)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 18:56:26 +08:00
OG T
14a02263ae
feat(Phase 4): 主動巡檢 + 趨勢預測 + 8D 感官升級 全部完成
...
CD Pipeline / build-and-deploy (push) Failing after 12m32s
## Phase 4 完整交付(ADR-084)
### 新增服務
- trend_predictor.py: numpy 線性回歸,4h 閾值突破預警,R² 信心評分
- proactive_inspector.py: 每 5 分鐘主動巡檢協調器
- DynamicBaselineService(3σ 偏離)
- LogAnomalyDetector(新 Drain3 pattern)
- TrendPredictor(斜率外推 4h 預測)
- Shadow Mode + 30 分鐘去重 + Holt-Winters 背景重訓
### 8D 感官升級(EvidenceSnapshot Phase 4 增強)
- PreDecisionInvestigator._collect_phase4_anomalies(): 決策前讀取
ProactiveInspector 最近巡檢快取 + LogAnomalyDetector 新 pattern
- EvidenceSnapshot.anomaly_context: 新欄位,Phase 4 動態異常上下文
- DiagnosticianAgent._build_prompt(): prompt 包含 anomaly_context,
LLM RCA 可參考動態基線偏差與趨勢預警
### 資料庫遷移
- incident_evidence: ADD COLUMN anomaly_context JSONB(冪等)
### main.py
- 啟動 run_proactive_inspector_loop() asyncio task
2026-04-15 ogt + Claude Sonnet 4.6(亞太): Phase 4 全部完成
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 15:47:05 +08:00
OG T
952c10955b
fix(db): 多 replica 並行啟動競爭 — 每 table 獨立 tx + DROP INDEX IF EXISTS
...
CD Pipeline / build-and-deploy (push) Has been cancelled
根因:單一大 transaction 內兩個 pod 同時建同一個 table,
其中一個 CREATE INDEX 失敗 → 整個 transaction ROLLBACK
→ table 也消失 → 下次重啟同樣情況 → 無限 CrashLoop。
修法三層:
1. 每個 table 用獨立 transaction 建立(失敗不影響其他)
2. 建 table 前先 DROP INDEX IF EXISTS 清殘留孤兒 index
3. 捕捉 "already exists" 讓並行 pod 優雅跳過(不 crash)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 15:38:43 +08:00
OG T
bf45b80bd2
feat(Phase 3.5 + Phase 4): AI 學習成果持久化到 PostgreSQL — 修正「AI 失憶」架構缺陷
...
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-085: AI 學習成果不可存在 Cache
架構鐵律確立:
- PostgreSQL = System of Record(AI 的永久記憶)
- Redis = Warm Cache(加速讀取,TTL 到期從 PG 復原)
核心變更:
1. models.py: 新增 PlaybookRecord / DynamicBaselineRecord / LogClusterRecord ORM
2. base.py: ALTER TABLE playbooks 補加 trust_score / requires_approval_level 等欄位
3. playbook_repository.py: 完整雙寫實作(PG upsert + Redis cache)
4. dynamic_baseline_service.py: Holt-Winters 訓練結果寫入 PG,Redis 只作 24h warm cache
5. log_anomaly_detector.py: Drain3 cluster template 寫入 PG(UPSERT on cluster_id)
6. main.py: 啟動時執行 backfill_redis_to_pg()(Redis → PG 冪等補救)
修正的問題:
- Playbook 7天 Redis TTL 到期 → AI 失去所有修復知識
- trust_score EWMA 隨 Redis TTL 歸零 → AI 重新回到初始信任度 0.3
- Holt-Winters 基線 24h TTL → AI 每天重新學習「正常」的定義
- Drain3 cluster 沒有持久化 → AI 把已知 log pattern 反覆當新 pattern
Phase 4 新服務(requirements.txt 已加入 statsmodels + drain3 + numpy)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 15:34:04 +08:00
OG T
0f2ec7987c
fix(db): 改用 inspect 跳過現有 table,根治 CrashLoopBackOff
...
CD Pipeline / build-and-deploy (push) Failing after 14m42s
checkfirst=True 只跳過 CREATE TABLE,SQLAlchemy 2.0 仍對
__table_args__ Index 物件發出獨立 CREATE INDEX → duplicate error。
改法:先 inspect 取得現有 tables,只對不存在的 table 呼叫
table.create(),index 永遠只隨新 table 建立,不再 duplicate。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 15:18:25 +08:00
OG T
a142e6e937
fix(db): create_all checkfirst=True 修復 CrashLoopBackOff
...
CD Pipeline / build-and-deploy (push) Failing after 12m19s
rolling update 時 create_all 嘗試重建既有 index 導致
"ix_incident_evidence_incident_id already exists" 啟動失敗。
checkfirst=True 讓 SQLAlchemy 跳過已存在的 table/index,
init_db() 從此冪等,不再造成 CrashLoopBackOff。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 15:00:49 +08:00
OG T
896bef94ee
fix(web): pending-approvals-card 加防重複點擊 + loading 狀態
...
linter 自動強化: actioningId state 防止同一張卡重複操作
- disabled + opacity 0.6 + cursor not-allowed
- loading 時按鈕顯示 '...'
- finally() 確保 actioningId 清除
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 18:38:08 +08:00
OG T
628387de8c
fix: risklevel migration 自動化 + Telegram Whitelist 注入
...
E2E Health Check / e2e-health (push) Successful in 17s
CD Pipeline / build-and-deploy (push) Has been cancelled
1. init_db() 啟動時自動確保 risklevel enum 包含 'high' 值
(Phase 23 新增,避免舊 DB 缺值導致 InvalidTextRepresentation)
2. CD Pipeline 新增 OPENCLAW_TG_USER_WHITELIST 自動注入
(之前為 CHANGE_ME,已更新為實際 user ID 5619078117)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-02 09:13:13 +08:00
OG T
9bff46a1b0
feat: integrate Sentry + fix CI/CD issues
...
Sentry Integration (補強 SignOz):
- Add @sentry/nextjs for frontend error tracking + session replay
- Add sentry-sdk[fastapi] for backend error tracking
- Create sentry.client/server/edge.config.ts
- Integrate with next.config.js + instrumentation.ts
- Add Sentry exception capture in FastAPI error handler
- Create deployment scripts for Self-Hosted @ 192.168.0.110
CI/CD Fixes:
- Fix F821 Undefined name 'Field' in incidents.py
- Add NEXT_PUBLIC_API_URL env var to CI build step
- Add build-arg to Docker build verification
E2E Test Improvements:
- Fix strict mode violations in dashboard-acceptance tests
- Add timeout increase for Phase 4 demo tests
- Make tests more resilient to UI variations
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-24 15:19:52 +08:00
OG T
6f049877fc
fix(lint): ruff auto-fix + lewooogo-core src 加入 git
...
- Python: ruff --fix 修復 280 個 lint 錯誤
- lewooogo-core: src/ 目錄未追蹤,導致 CI eslint 失敗
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-23 23:51:37 +08:00
OG T
1576f2ab20
fix(db): eliminate SQLite brain-split, force PostgreSQL
...
Root cause: Worker used SQLITE_DATABASE_URL causing "no such table: incidents"
because each Pod had isolated SQLite file, not shared PostgreSQL.
Fixes:
- db/base.py: Use DATABASE_URL (PostgreSQL) instead of SQLITE_DATABASE_URL
- Added SQLite prohibition guard with logging
- Added pool_size and pool_pre_ping for production stability
New: packages/lewooogo-data PgMemoryProvider (Phase 6.4d)
- Episodic Memory implementation for PostgreSQL
- init_pg_engine() with auto table creation
- SQLite forbidden by Commander's decree
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-23 10:02:43 +08:00
OG T
196d269b92
feat: add all application source code
...
- apps/api: FastAPI backend with Dockerfile
- apps/web: Next.js frontend with Dockerfile
- apps/sensor: Signal collection agent
- packages: shared packages
Co-Authored-By: Claude <noreply@anthropic.com >
2026-03-22 18:57:44 +08:00