Commit Graph

19 Commits

Author SHA1 Message Date
Your Name
6878e62af7 feat(flywheel): W1 PR-P1 + ADR-091 T1 — 飛輪 80→90 第一波
依 onboarder 端到端閉環審計挖出的 10 條斷鏈 + critic 鐵律違反全景,
W1 第一波修復飛輪鐵證 1 + 2 的核心斷鏈 C1。

## W1 PR-P1 — matched_playbook_id 四斷點守門 (C1 修復)
fullstack 探勘發現 4 斷點之前 session 已修,本 PR 補:
- ENABLE_PLAYBOOK_MATCHING feature flag (default=true)
  rollback: kubectl set env deployment/awoooi-api ENABLE_PLAYBOOK_MATCHING=false
- proposal_service._try_playbook_match_id 入口加 flag check
- 7 個 e2e 測試補上保護網(之前無測試覆蓋)

斷鏈 C1 證據鏈:proposal_service.generate_proposal() → matched_playbook_id
→ approval_db → approval_repository → learning_service._update_playbook_stats
24h 後 playbooks.trust_score 應有真實 EWMA 更新。

## ADR-091 T1 — auto_generate_rule 雙寫 DB (鐵證 1 第一步)
飛輪鐵證 1:alert_rule_catalog.source='ai_generated' 全 codebase 0 筆。
auto_generate_rule() 寫 alert_rules.yaml 但不寫 DB → AI 自學成果與 catalog 雙軌脫鉤。

修法(依 ADR-091 §1 D1):
- 新增 _insert_catalog_ai_generated():YAML 寫入成功後雙寫
  source='ai_generated', confidence=0.5, review_status='draft', created_by_agent
- 新增 _parse_for_to_seconds() helper("30s"/"5m"/"2h" → seconds)
- ON CONFLICT (rule_name) DO NOTHING 冪等保證
- transaction 策略:YAML + DB 不在同一 transaction(YAML 已成 SoT,DB 失敗只 log)
- ENABLE_AI_RULE_CATALOG_WRITE feature flag (default=true)
  rollback: kubectl set env deployment/awoooi-api ENABLE_AI_RULE_CATALOG_WRITE=false

13 個測試覆蓋:parse helper 8 + 業務邏輯 5(success/db_fail/idempotent/flag/SQL_lit)

## 驗證
1572 unit tests 全綠(+20 新增:PR-P1 7 + ADR-091 T1 13)

## 期望影響
飛輪自主化評分:42 → 65(+23 = C1 +3 + 鐵證 1 +20)

## 已知債(critic PR review 揭示,下一個 commit 處理)
- KMWriter 統一契約 3 條 caller 路徑被旁路(C1/M1/M2)
- KMWriter 冪等聲明與實作不符(M3 缺 ON CONFLICT)
- Alertmanager equal:[] 爆炸抑制 + 版本未驗(M4/M5)
- drift checker regex 脆弱(M7 應改 AST)
- governance health score skipped 失真(M6)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 10:44:39 +08:00
Your Name
e96055eef9 fix(p0.4): Playbook 學習鏈三道修復 — partial index + race防護 + 手動路徑接線
ADR-092 P0.4 Playbook EWMA 學習閉環的 DB / Repository / Service 三層修補。

DB 層 (db-expert-fix by Engineer-B):
- ApprovalRecord.matched_playbook_id 移除 index=True,改 __table_args__ partial index
  (WHERE matched_playbook_id IS NOT NULL) — 多數列 NULL,full index 浪費空間
- adr092_p1_learning_chain_rollback.sql: 純 ROLLBACK SQL(DBA 手動執行)

Repository 層:
- playbook_repository.py: SELECT FOR UPDATE 防 lost update
  避免並發 EWMA 更新覆蓋彼此

Service 層 (P0.4 修復):
- proposal_service.py: 手動審核路徑補 _try_playbook_match_id 呼叫
  decision_manager auto_execute 路徑已有此邏輯(行 2035),
  此處補手動路徑缺口,使 matched_playbook_id 可寫入 DB → EWMA 才能演化

測試:
- test_playbook_repository_race_condition.py: 3 cases SELECT FOR UPDATE 防 race
  正確阻擋並發 EWMA 更新(pass)

Note: migration SQL 待 DBA 手動執行(feedback_dev_prod_separation.md),
      不執行 alembic upgrade(statu 文件禁忌條款)。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 20:19:46 +08:00
Your Name
0d81b28b1b fix(aiops): bound phase2 timeout and repair incident links
All checks were successful
E2E Health Check / e2e-health (push) Successful in 52s
CD Pipeline / build-and-deploy (push) Successful in 9m24s
2026-04-24 23:53:56 +08:00
OG T
890e2a9568 fix(review): 架構審查修復 — P0 import crash + i18n 零 hardcode + 靜默錯誤
P0:
- proposal_service.py: 補 get_redis + INCIDENT_KEY_PREFIX import
  (修前: resolve_incident_after_approval 必 NameError crash)

P1 i18n:
- page.tsx: 拓撲群組移除 emoji,改用 tTopo() i18n key
- page.tsx: 主機標籤 (DevOps金庫等) 改 tTopo() i18n
- ai-model-status.tsx: 加 useTranslations,AI 模型狀態 → t('aiModelStatus')
- disposition-mini.tsx: 查看完整報表 → t('viewAllReport')
- recent-activity.tsx: 查看活動串流 → t('viewAllAlerts')

P2 品質:
- pending-approvals-card.tsx: approve/reject 加 r.ok 檢查+錯誤顯示,查看全部授權加路由+i18n
- page-tabs.tsx: TabSkeleton 載入中... → t('loading')
- page.tsx: ↑5% → tDashboard('trendUp', {pct}) 動態值
- page.tsx: Prometheus '23' hardcode → '-- targets'

i18n 新增 key (zh-TW + en 同步):
- dashboard: viewAllAlerts/viewAllAuth/viewAllReport/aiModelStatus/loading/trendUp
- topology: groupExternal/allReachable/investigating/hostDevops/hostAiData/hostK3sMaster/hostK3sWorker

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:34:50 +08:00
OG T
ae9780837d fix(proposal): action 優先用 kubectl_command,修復批准後永遠 skip 執行的根本 bug
根本問題:approval_records.action 存的是 LLM action_title(中文標題,如「重啟 sentry 服務」),
parse_operation_from_action() 無法解析,導致 execute_approved_action() 每次都 skip。

修正:action 優先取 llm_proposal["kubectl_command"](可執行的 kubectl 指令),
僅在沒有 kubectl_command 時才 fallback 到 action_title。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 18:13:22 +08:00
OG T
394f85954e fix(api): 修復 Y/n 404 + 停用 Multi-Sig
Some checks failed
E2E Health Check / e2e-health (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
1. proposal_service._load_incident() 改用 incident_service.get_from_working_memory()
   - brain engine 使用 awoooi:incidents: prefix,資料實際在 incident: prefix
   - 兩個 prefix 不符導致永遠 404 (Y/n 按鈕全部失敗)
   - 2026-04-02 ogt

2. trust_engine CRITICAL required_signatures 2→1
   - 統帥決策: 所有審核只需 1 層簽核
   - 2026-04-02 ogt

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 16:16:28 +08:00
OG T
44840f5e73 fix(service): #123 proposal_service.py 修正 key prefix + 移除重複邏輯
ADR-046 修復: proposal_service 使用錯誤 Redis key prefix "incident:"
(brain 使用 "awoooi:incidents:"),導致 R-R2 後 load/persist 失效。

變更:
- _load_incident(): 委派給 IncidentEngineAdapter.get_incident()
  (正確 key prefix,含 brain→local 型別轉換)
- _persist_incident(): Redis 部分委派給 brain DualIncidentMemory
  透過 local_to_brain() 轉換後儲存 (key prefix 一致)
- 移除 _record_to_incident() 重複邏輯 (已由 IncidentEngineAdapter 處理)
- 移除 INCIDENT_KEY_PREFIX 常數
- 移除 get_redis() 直接依賴

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-01 09:11:57 +08:00
OG T
dd526684ab feat(ai): Phase 22 OpenClaw + Nemotron 協作架構 (ADR-044)
All checks were successful
E2E Health Check / e2e-health (push) Successful in 17s
統帥批准實作「仲裁-執行分工」架構:
- OpenClaw = 仲裁者 (Why + Risk Level)
- Nemotron = 執行者 (How + kubectl Command)

新增功能:
- config.py: ENABLE_NEMOTRON_COLLABORATION Feature Flag
- openclaw.py: generate_incident_proposal_with_tools()
- openclaw.py: _call_nemotron_tools() Nemotron 呼叫
- telegram_gateway.py: TelegramMessage Nemotron 欄位
- telegram_gateway.py: format_with_nemotron() 雙區塊格式
- decision_manager.py: 整合協作方法
- proposal_service.py: 整合協作方法

觸發條件:
- LOW 風險 → 僅 OpenClaw
- MEDIUM/HIGH/CRITICAL → OpenClaw + Nemotron 雙軌

首席架構師審查: 83/100 條件通過

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-31 18:52:53 +08:00
OG T
ef54cf46c9 fix(api): 修復 mypy 類型錯誤 - Incident 欄位補齊 2026-03-24 10:48:15 +08:00
OG T
ec7e45d538 fix(api): 修復 Incident-Approval 狀態同步 BUG
🔴 P0 核心功能修復:

問題: 審核後頁面重整,Y/n 按鈕重複出現
根因: resolve_incident_after_approval 在 Redis 缺失時靜默跳過

修復:
1. proposal_service.py - 處理 Redis 缺失情況
2. approvals.py - 添加詳細日誌追蹤
3. 設定 resolved_at 時間戳

防禦性增強:
- 日誌記錄 metadata 內容
- 記錄 resolve 成功/失敗狀態
- 警告無 incident_id 的情況

長期規範:
- 新增 feedback_incident_approval_sync.md 記憶
- 更新 HARD_RULES.md API 路徑規範

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-24 10:39:22 +08:00
OG T
6f049877fc fix(lint): ruff auto-fix + lewooogo-core src 加入 git
- Python: ruff --fix 修復 280 個 lint 錯誤
- lewooogo-core: src/ 目錄未追蹤,導致 CI eslint 失敗

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-23 23:51:37 +08:00
OG T
f78aab8b2a fix(api): DecisionToken 狀態同步 (Y/n 持久化修復)
根本原因:
- resolve_incident_after_approval 只更新 Incident.decision.state
- 沒有更新獨立儲存的 DecisionToken (decision:{token} key)
- 導致下次 poll 時 get_or_create_decision 返回 READY 狀態的舊 token
- 前端繼續顯示 Y/n 按鈕

修復:
- 在 resolve_incident_after_approval 中同時更新 DecisionToken 狀態為 COMPLETED
- 確保整個決策鏈路狀態一致

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-23 23:46:21 +08:00
OG T
c8558cda9e fix(api): resolve 時 DB 記錄不存在視為成功
根因: Incident 可能因 DB 寫入失敗只存在於 Redis
修復: 只要 Redis 更新成功就算成功 (API 只讀 Redis)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-23 23:09:46 +08:00
OG T
d60cb54c08 fix(api): resolve_incident_after_approval 使用直接更新邏輯
原因: 透過 _persist_incident 間接更新失敗
修復: 改用直接 Redis + DB 更新 (與 debug endpoint 相同邏輯)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-23 22:31:18 +08:00
OG T
03ca124967 fix(api): _persist_incident 新增顯式 commit + 追蹤日誌
根因: DB 變更未被 commit,導致 Incident 狀態更新不持久化

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-23 22:02:00 +08:00
OG T
ac3bf97920 fix(api): 簽核後更新 Incident 狀態為 RESOLVED
根因: 簽核成功後 Incident.status 未更新,導致刷新頁面後 Y/n 按鈕重現

修復:
- proposal_service.py: 新增 resolve_incident_after_approval() 方法
- approvals.py: sign_approval 成功後呼叫更新 Incident 狀態
- 使用 metadata.incident_id 反查關聯的 Incident

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-23 21:37:50 +08:00
OG T
7478dc0254 feat(phase6-9): Complete modular architecture and Agent Teams
Phase 6.4 - Modular Architecture:
- Add lewooogo-brain adapters for LLM providers
- Add lewooogo-data dual memory (Redis + PostgreSQL)
- Implement consensus engine for multi-agent decisions
- Add incident memory service for historical context

Phase 9 - Agent Teams (Claude Agent SDK):
- Add base agent class with Claude Sonnet 4 integration
- Implement action planner, blast radius, and security agents
- Add agent API endpoints and proposal workflow
- Integrate ADR-009 OpenClaw Agent Teams architecture

DevOps & CI/CD:
- Add GitHub Actions CI/CD workflows (ci.yaml, cd.yaml)
- Add pre-commit hooks and secrets baseline
- Add docker-compose for local development
- Update Kubernetes network policies

Frontend Improvements:
- Add auto-healing error boundary component
- Update i18n messages for agent features
- Enhance dual-state incident card with execution feedback

Documentation:
- Add 7 ADRs covering MCP, design system, architecture decisions
- Update ARCHITECTURE_MEMORY.md with modular design
- Add GLOBAL_RULES.md and SOUL.md for project identity

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-23 18:40:36 +08:00
OG T
141df533cc feat(api): Phase 6.4 LLM-based proposal generation with cache
- Add _call_with_cache wrapper in OpenClaw (Redis-based LLM cache)
- Add generate_incident_proposal method for incident analysis
- Integrate ProposalService with OpenClaw LLM
- Fallback to template-based proposals if LLM fails
- Include LLM metadata (provider, confidence, cache status) in proposals

憲法條款: 必須使用快取保護算力資源,嚴禁無快取裸奔調用

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-03-23 01:33:46 +08:00
OG T
196d269b92 feat: add all application source code
- apps/api: FastAPI backend with Dockerfile
- apps/web: Next.js frontend with Dockerfile
- apps/sensor: Signal collection agent
- packages: shared packages

Co-Authored-By: Claude <noreply@anthropic.com>
2026-03-22 18:57:44 +08:00