Your Name
8629ac709b
feat(awooop): Phase 1-8 完整實作 — AwoooP Agent Platform 六平面架構
...
run-migration / migrate (push) Failing after 59s
Code Review / ai-code-review (push) Successful in 1m8s
Type Sync Check / check-type-sync (push) Successful in 2m27s
## Phase 1-3: Control Plane + Contract System
- awooop_phase1_control_plane_2026-05-04.sql: 12 張核心表 + RLS
- awooop_phase1_batch1_rls_2026-05-04.sql: 全部 FORCE RLS + GRANT
- packages/awooop-contracts/: 六合約 JSON Schema + golden fixtures
- src/models/awooop_contracts.py: Pydantic v2 contract models(extra=forbid)
- src/repositories/contract_repository.py: contract lifecycle(draft→published→active)
- src/services/contract_service.py: HMAC publish sig + Redis multi-sig activate
- src/services/schema_validator.py: LLM output validator(retry×3, E-SCHEMA-001)
## Phase 2: Tenant Isolation
- awooop_phase2_budget_ledger_2026-05-04.sql: budget_ledger + RLS
- src/services/budget_service.py: Token Budget Hard Kill 三層防線
- src/core/context.py: PROJECT_ID ContextVar(31 background loop 自動繼承)
- src/db/base.py + models.py: project_id 欄位 + RLS set_config 注入
- src/hermes/nl_gateway.py: project_id Redis key 前綴(Phase A 雙寫)
- src/services/anomaly_counter.py: per-project 改造(Phase A fallback)
## Phase 4: Platform Shell in Shadow Mode
- awooop_phase4_run_state_2026-05-04.sql: run_state + step_journal + idempotency
- src/services/run_state_machine.py: 8-state FSM + SKIP LOCKED + stale reaper
- src/services/platform_runtime.py: UUID v7 + W3C trace_id + shadow_execute
- src/services/audit_sink.py: PII/secret redaction 9 patterns
- src/api/v1/platform/runs.py: POST/GET /v1/platform/runs(Router→Service 架構)
- src/workers/platform_worker.py: SKIP LOCKED worker + heartbeat + reaper loop
- src/main.py: platform router + lifespan worker start/stop
## Phase 5: MCP Gateway 五閘門
- awooop_phase5_mcp_gateway_2026-05-04.sql: 4 表 + RLS
- src/plugins/mcp/gateway.py: McpGateway(Gate 1~5, E-MCP-GATE-001~009)
- src/plugins/mcp/redaction_middleware.py: 雙層 redaction + 16K 截斷
- src/plugins/mcp/registry.py: __provider name mangling(ADR-116)
- src/plugins/mcp/credential_resolver.py: k8s secret ref 解析
- tests/test_mcp_credential_isolation.py: 10 個迴歸測試(secret leak 防再現)
## Phase 6-8: EwoooC + Channel Hub + Approval Token
- awooop_phase6_ewoooc_onboarding_2026-05-04.sql: ewoooc tenant + 4 read-only MCP tools
- awooop_phase7_channel_hub_2026-05-04.sql: conversation_event + outbound_message
- src/services/provider_proxy.py: ProviderProxy + PlatformEnvelope(ADR-115)
- src/services/channel_hub.py: Telegram inbound mirror + Progressive Feedback(30s)
- src/services/awooop_approval_token.py: HS256 + jti NX replay 防護 + suggest mode
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-05-04 19:31:53 +08:00
Your Name
13e51802fe
feat(awooop): Phase 0 全 ADR + Phase 1 control plane schema(含 critic 四項修正)
...
## Phase 0(文件層,全部 Accepted)
- ADR-106/107:AwoooP 平台架構 + 儲存策略
- ADR-111~118:Bootstrap → RLS 七項核心 ADR
- ADR-119~124:SAGA → Singleton Decomposition 六項 ADR
- ADR-UI-01~04:Operator Console 四個 UI ADR
## Phase 1(DB schema + migration)
- awooop_phase1_control_plane_2026-05-04.sql:7 張新表 + trigger + RLS
- Step 1:三角色(platform_admin/migration BYPASSRLS,awooop_app 受 RLS)
- Step 13:GRANT awooop_app 最小權限(7 條)
- Step 14:RLS fail-closed,移除 __platform__ 後門
- awooop_phase1_batch1_rls_2026-05-04.sql:高流量四表三步式 ADD COLUMN
- awooop_phase1_batch1_backfill.py:SKIP LOCKED 分批回填腳本
- awooop_models.py:7 個 SQLAlchemy 2.x models
## Critic 修正(4 Critical + 3 Major)
- C-1:ADD CONSTRAINT IF NOT EXISTS → DO 塊 + pg_constraint 查詢
- C-2:__mapper_args__ 字串 list → primary_key=True on mapped_column
- C-3:__platform__ RLS 後門 → 全移除,改用 BYPASSRLS role
- C-4:awooop_app role 從未建立 → Step 1 + 7 條 GRANT
- M-1:active_pointer_guard SECURITY DEFINER(FORCE RLS 跨租戶保護)
- M-2:pg_partman create_parent 加冪等防護
- M-3:immutability trigger 新增身份欄位保護(project_id/family/contract_id)
## Task 1.2 修補
- agent_loader.py:硬編碼 Mac 路徑 → AGENTS_DIR 環境變數
- Dockerfile:補 COPY .claude/agents/
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-05-04 13:37:11 +08:00
Your Name
b4055c5915
feat(embedding): ADR-110 升級 bge-m3:latest 1024 維向量
...
Code Review / ai-code-review (push) Successful in 57s
run-migration / migrate (push) Failing after 44s
GCP-A (34.143.170.20) 無 nomic-embed-text,改用 bge-m3:latest(專用
多語言 embedding 模型),產生 1024 維向量。
變更:
- embedding_service.py: 加入 bge-m3:latest=1024 維到 MODEL_DIMENSIONS,
預設模型改為 bge-m3:latest,更新文件說明
- playbook_embedding_repository.py + interfaces.py: 更新維度說明
- migrations/embedding_bge_m3_1024.sql: pgvector schema 遷移
rag_chunks + playbook_embeddings vector(768) → vector(1024)
- scripts/reembed_bge_m3.py: 遷移後重新嵌入現有資料的 script
遷移步驟:
1. 執行 embedding_bge_m3_1024.sql(清空現有 768 維向量,變更維度)
2. 執行 python scripts/reembed_bge_m3.py 重新嵌入
2026-05-04 ogt + Claude Sonnet 4.6
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-05-04 11:18:20 +08:00
Your Name
e45b055e0e
feat(governance): AI 治理事件處理鏈四軌交付(C/D/B/A)
...
Code Review / ai-code-review (push) Successful in 48s
run-migration / migrate (push) Failing after 45s
CD Pipeline / tests (push) Successful in 3m46s
Type Sync Check / check-type-sync (push) Successful in 2m8s
CD Pipeline / build-and-deploy (push) Failing after 31m14s
CD Pipeline / post-deploy-checks (push) Has been skipped
【十二人專家團隊全景掃描 + 並行四軌實施】
統帥質疑「有讓 12-agent 一起協作嗎」後,依照團隊規則完成全鏈路交付:
onboarder + critic + db-expert + debugger + frontend-designer 並行掃描,
找到 6 大 Gap,再由 fullstack-engineer × 4、refactor-specialist 協作落地。
【Track C — trust_drift 雙寫整併】
兩條獨立寫 event_type=trust_drift 路徑互不呼叫,下游 consumer 拿到雙份資料
無法判定 source-of-truth。整併保留 governance_agent.check_trust_drift(功能
更全:auto-deprecate + Telegram + PG),TrustDriftDetector 降為純統計 lib,
W-6 watchdog 改呼叫 governance_agent。新增 TestSinglePgWritePerDriftScenario
驗證同一 drift 場景只觸發一次 PG 寫入。
變更:
- apps/api/src/services/trust_drift_detector.py(lib only,不再寫 PG)
- apps/api/tests/test_trust_drift_watchdog.py(W-6 改 mock governance_agent)
【Track D — governance_remediation_dispatch 派遣表】
ai_governance_events 是不可變 Event Sourcing,不能塞執行狀態。新建派遣表
作為投影層:1 event → 0..N dispatches,狀態可變、可重試、可審計。
- PgEnum 5 種 event_type + 7 階段狀態機(pending → dispatched → executing →
succeeded/failed/cancelled/skipped)
- 失敗重試 INSERT 新 row(不改舊 row 的 status,保留審計痕跡)
- Partial unique index ux_grd_one_active_per_event 強制「同事件唯一活躍」
- 4 個複合 index 支援 worker poll、去重查詢、觀測面板
- FK 對應 ai_governance_events / playbooks / incidents / approval_records
全部 SET NULL(avoid cascade lock,但 governance_event 用 RESTRICT)
變更:
- apps/api/src/db/models.py(GovernanceRemediationDispatch ORM class)
- apps/api/migrations/governance_remediation_dispatch_2026-05-03.sql
- apps/api/src/repositories/governance_remediation_dispatch_repo.py
(6 個 async 函式 + 3 個自訂例外:DispatchAlreadyActive /
InvalidStatusTransition / DispatchNotFound)
- apps/api/src/models/governance_dispatch.py(DecisionContextV1 等 4 schema)
- apps/api/tests/test_governance_remediation_dispatch.py(29 tests)
【Track B — /governance 頁面】
後端 PR1 三個 endpoint + 前端 PR2-5 完整三 Tab。
PR1 後端:
- GET /api/v1/ai/governance/events(events_tab,含 event_type/severity/
狀態/時間範圍篩選 + 分頁)
- GET /api/v1/ai/governance/queue(queue_tab,含 graceful fallback:
dispatch 表不存在時回 table_pending=True 不拋 500)
- GET /api/v1/ai/governance/summary(slo_tab 30d 違反時序圖)
- severity 映射規則寫死(critic 建議未來移 settings)
PR2-5 前端:
- /governance 路由 + AppLayout + Compliance Badge 橫幅 + PageTabs
- SLO Tab:3 KPI 卡片(Syne 28px + StatusOrb + 7d sparkline)+
30d 違反 stacked BarChart
- Events Tab:篩選列 + 表格 + inline 展開行(JSON / 修復建議 / 派遣記錄)
- Queue Tab:HITL 待辦卡片 + 信任度進度條 + 批准/拒絕按鈕(本 PR console.log)
- Sidebar 加入「AI 治理」入口(ShieldCheck icon)
- i18n 雙語完整(governance namespace + nav.governance)
- 7 個新元件:slo-kpi-card / slo-violation-chart / events-table /
events-filter-bar / event-detail-drawer / queue-item-card / queue-history-tabs
變更:
- apps/api/src/api/v1/ai_governance.py(router)
- apps/api/src/services/governance_query_service.py
- apps/api/src/models/governance.py(Pydantic V2 schemas)
- apps/api/tests/test_ai_governance_endpoints.py(21 tests)
- apps/web/src/app/[locale]/governance/(page + 3 tabs)
- apps/web/src/components/governance/(7 元件)
- apps/web/messages/{zh-TW,en}.json(governance namespace)
- apps/web/src/components/layout/sidebar.tsx(+1 行)
- apps/api/src/main.py(router include)
【Track A — GovernanceDispatcher 決策融合】
把治理事件接到 remediation 執行器,走北極星方向決策融合(LLM × Playbook trust
× MCP),符合「禁寫死規則」鐵律。
- 設計鐵律:DecisionFusionAdapter 是新增 wrapper,**不修改任何 Tier 3 檔**
(decision_manager / learning_service / trust_engine),只 consume 既有 API
- 三維融合公式:confidence = 0.4×llm + 0.3×playbook_trust + 0.3×mcp_consistency
(權重加 TODO 標明未來由 AI 自學調整)
- 三分支決策路徑:
confidence ≥ 0.85 → auto_dispatch(status=dispatched)
0.65 ≤ confidence < 0.85 → pending_approval(HITL)
confidence < 0.65 → skip + log
- decision_context JSONB 完整記錄三維輸入快照(給未來 fine-tune 用)
- poll 30s 掃 unresolved 事件,仿 governance loop 模式
- 重複事件擋去重(呼叫 get_active_for_event)
變更:
- apps/api/src/services/governance_dispatcher.py
- apps/api/src/services/decision_fusion_adapter.py
- apps/api/tests/test_governance_dispatcher.py(14 tests)
- apps/api/src/main.py(lifespan task 接 run_governance_dispatcher_loop)
【驗證】
1836 個 unit test 全過(29 skipped 為既有 PG integration env 問題)
【調度教訓 — 已記入 memory】
- vuln-verifier 應在 fullstack-engineer **之前**跑(避免並行讀到已修代碼誤判)
- critic 雙輪審查不可省(第二輪抓到 NaN sentinel + Prom rule 連鎖)
- 北極星「禁寫死規則」搭配 decision-fusion 確實實施
【未動 Tier 3 — 已驗證】
git diff 確認本 commit 完全沒改 decision_manager.py / learning_service.py /
trust_engine.py,只新增 wrapper service consume 既有 API。
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-03 12:42:40 +08:00
Your Name
7e4d995e4b
feat(aiops): add mcp agent loop foundation
CD Pipeline / tests (push) Successful in 1m59s
Code Review / ai-code-review (push) Successful in 28s
run-migration / migrate (push) Failing after 24s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-05-01 13:21:19 +08:00
Your Name
337bcb912e
fix(db): tolerate knowledge enum owner mismatch
CD Pipeline / tests (push) Successful in 1m48s
Code Review / ai-code-review (push) Successful in 27s
run-migration / migrate (push) Successful in 22s
CD Pipeline / build-and-deploy (push) Failing after 31m4s
CD Pipeline / post-deploy-checks (push) Has been skipped
2026-05-01 11:08:21 +08:00
Your Name
3a6acae408
fix(km): add phase25 knowledge enum labels
CD Pipeline / tests (push) Successful in 2m14s
Code Review / ai-code-review (push) Successful in 26s
run-migration / migrate (push) Failing after 24s
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
2026-05-01 11:03:03 +08:00
Your Name
474b913ac9
chore(db): add playbook versioning migration
CD Pipeline / tests (push) Successful in 1m32s
Code Review / ai-code-review (push) Successful in 27s
run-migration / migrate (push) Failing after 13s
CD Pipeline / build-and-deploy (push) Has started running
CD Pipeline / post-deploy-checks (push) Has been cancelled
E2E Health Check / e2e-health (push) Successful in 43s
2026-04-30 23:53:19 +08:00
Your Name
c5753e1c57
fix(critic-review): KMWriter 名實統一 + Alertmanager 修抑制 + drift checker AST 化
...
critic PR review 揭示已 push commits 的 7 個 blocker,本 commit 全部修復。
## C1 + C2 + M1 + M2 + M3 — KMWriter 真正統一契約(critic 最嚴重 5 條)
### C1 km_writer.py:194 — backfill 自打臉修
- 裸 asyncio.create_task(_backfill_path_a_approval) → await _backfill_path_a_approval_safe()
- 同步 await + 獨立 DLQ km:backfill:dlq + try/except 不阻塞主寫入
- 新增 km_backfill_reconciler_job.py(每 5 分鐘掃 DLQ)+ ENABLE_KM_BACKFILL_RECONCILER flag
- 防 Path B 比 Path A 先完成 → related_approval_id 永遠 NULL 的 race
### C2 km_writer.py:391 — KM_WRITE_AWAIT=false 路徑收緊
- 從 ensure_future(fire-and-forget 比舊版同步寫更糟)
- 改 await writer.write(retry=1, timeout=2.0)(仍 await 但只試一次、超時短)
- docstring 明確標註「緊急回滾用,不保證可靠性」
### M1 decision_manager.py:2178/2203 — 移除 _fire_and_forget 旁路
- 兩處 _fire_and_forget(executor.write_execution_result_to_km(...))
- 改 await asyncio.shield(...) + BaseException 保護(防上層 cancel 中斷)
- KM_WRITE_AWAIT=true 在這條路徑終於真正 await
### M2 incident_service.py:1099 — 自製 path 加 retry+DLQ
- 原本 if settings.KM_WRITE_AWAIT: await asyncio.wait_for else create_task
- 改 3 次指數退避 retry + DLQ 保護(呼叫 km_writer 私有 helper)
### M3 km_writer.py:166 — 冪等聲明對齊實作
- knowledge_repository.create() 加 UPSERT 路徑(pg_insert ON CONFLICT DO UPDATE)
- KnowledgeEntryCreate / KnowledgeEntryRecord 加 path_type 欄位
- migration: ADD COLUMN path_type + partial unique index uix_knowledge_incident_path
## M4 alertmanager.yml — equal: [] 收緊(critic 防爆炸抑制)
- OllamaInstanceDown / KMConverterDown 抑制加 equal: ['cluster'] 約束
- 防多 cluster 場景下任一 Ollama down 誤抑全 AI/SLO 告警
## M5 Alertmanager 版本驗證(已確認 v0.31.1,遠超 v0.22+)
## M6 governance_agent.py — health score 區分 skipped vs ok vs violated
- check_slo_compliance 加 _meta {violated_count, skipped_count, ok_count, all_skipped, status}
- run_self_check: SLO 全 skipped 時獨立發 governance_slo_data_gap 告警
(不污染 self_failure 計數,因為 no_data 是 emitter 未實作不是治理機制故障)
## M7 scripts/check_config_drift.py — 改 AST 解析
- regex 改 ast.parse 找 Settings ClassDef AnnAssign Field(default=...)
- 避免多行 list / default_factory= / 含跳行字串的 false negative
- 4 欄位(AI_FALLBACK_ORDER / ARGOCD_URL / PROMETHEUS_URL / OLLAMA_URL)全對齊
## 新增測試
- test_km_writer_backfill_reconciler.py: 7 cases(C1 reconciler + safe helper)
- test_km_writer_idempotent.py: 5 cases(M3 path_type 注入 + UPSERT 分支)
## 驗證
- 1585 unit tests 全綠(+13 從 1572)
- amtool check-config SUCCESS(8 inhibit_rules / 2 receivers)
- drift checker AST-based 4 欄位全對齊
- Alertmanager v0.31.1 確認支援新語法
## 期望影響
- KMWriter 名實統一:飛輪閉環 KM 寫入路徑 100% 可靠
- M4 抑制爆炸風險解除
- 治理層不再對 SLO no_data 靜默
- drift checker false negative 風險解除
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-29 10:44:39 +08:00
Your Name
025a493f06
feat(p3.2+adr-100): Model Version Tracker + SLO 自治 + KB rot cleaner
...
run-migration / migrate (push) Failing after 12s
CD Pipeline / build-and-deploy (push) Has been cancelled
Wave 8 P3.2 模型版本追蹤 + ADR-100 SLO 自我治理 + 配套:
P3.2 — Model Version Tracking:
- model_version_probe.py (268 行) — 探測 Ollama / OpenRouter 等 provider 的 model version
- model_version_tracker.py (101 行) — 對齊 PG provider_version_history 表
- migrations/p3_2_provider_version_history.sql + rollback — 25 行 schema
- db/models.py +32 行 — ProviderVersionHistory ORM
ADR-100 — AI 自主化 SLO:
- docs/adr/ADR-100-ai-autonomous-slo.md (167 行) — 飛輪 SLO 設計與閾值
- ops/monitoring/slo-rules.yml (254 行) — Prometheus SLO recording rules + alerts
- ops/monitoring/tests/test_slo_rules.yaml (242 行) — promtool unit tests
整合修改:
- main.py +72 行 — Lifespan 啟動 model_version_probe + KB rot cleaner schedule
- gitea_webhook.py +45 行 — webhook 接收 model 版本變化通知
- ci_auto_repair.py / evidence_snapshot.py / pre_decision_investigator.py — 配合接線
新測試:
- test_kb_rot_cleaner_schedule.py (120 行) — 9 tests pass
- test_slo_rules.yaml — promtool 驗收
Tests: 9 passed (test_kb_rot_cleaner_schedule)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
Co-Authored-By: Multiple Engineers (P3.2 + ADR-100) <noreply@anthropic.com >
2026-04-27 14:54:19 +08:00
Your Name
cc547736ab
feat(wave6-8): P2.1 fusion + P2.2 governance + P2.4 consensus + Wave 7/8 BLOCKER 修復
...
承接 Wave 6/7/8 多 engineer 在 agent 限額前完成的代碼,補 commit 解 production
HEAD 隱性 import error(decision_fusion 已被 decision_manager 引用但檔案 untracked)。
新增(後端核心):
- decision_fusion.py (562 行) — P2.1 方法 III(OpenClaw + Hermes + Elephant 三 LLM 融合)
- aiops_timeline.py + aiops_timeline_service.py — critic B4 修復
/api/v1/aiops/timeline endpoint,DB 存取抽到 service 層遵守 leWOOOgo 積木化
- migrations/p2_decision_fusion_columns.sql + rollback — approval_records fusion 欄位
修改(後端整合):
- decision_manager.py — fusion 三斷鏈修補(critic B1+B2+B3):
· B1: 寫 _evidence_snapshot_ref 到 token.proposal_data
· B2: fusion 前計算 complexity_score 並寫 token
· B3: fusion composite 寫 token.proposal_data["decision_fusion"]
- auto_approve.py — fusion + consensus 認識(critic B3+B5):
· composite > 0.7 → auto_execute_eligible bypass min_confidence
· source=consensus_engine + score>=0.6 → 規則可信路徑
- consensus_engine.py — db-fix _save_consensus 重用 agent_sessions
- governance_agent.py — db-fix _alert PG 寫入 ai_governance_events
- approval_db.py — fusion 3 欄位 + 2 partial index + CheckConstraint
- db/models.py — schema 對齊 migration
- core/config.py — vuln #1 修復:OLLAMA_URL/_FALLBACK_URL field_validator
拒絕公網 IP + 外部域名,僅允許私網/loopback/K8s SVC 白名單
- core/feature_flags.py — P2 fusion + consensus flags
- main.py — governance_agent lifespan 啟動
- failover_alerter.py — Wave8-X2: in-memory dedup fallback(Redis 拒絕後不 fail-open)
- ollama_*.py — metrics 整合 + recovery 改善
- auto_repair_service.py — verifier 接線
新增(測試 2438 行):
- test_decision_fusion.py / test_governance_agent.py / test_consensus_integration.py
- test_p2_db_fixes.py / test_wave8_fusion_fixes.py
- test_config_url_validation.py(vuln #1 12 tests)
- test_failover_alerter.py +Wave8-X2 in-memory dedup 補測
驗收: 116 tests pass (decision_fusion + wave8_fusion + config_url + consensus +
governance + p2_db_fixes + failover_alerter)
Conflict resolution:
- 3 檔(config.py + auto_approve.py + decision_manager.py)git stash pop 衝突
保留 stashed (engineer 最終版),補回 ValueError 「公網 IP」字樣對齊 test
Note: 此 commit 解 production HEAD 隱性 import error
仍未修: vuln #4 prompt injection / debugger B14 quota fail-closed
/ B25-B26 drain_pending_tasks / B8 governance fail alert
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
Co-Authored-By: Multiple Engineers (Wave 6/7/8) <noreply@anthropic.com >
2026-04-27 08:11:40 +08:00
Your Name
e96055eef9
fix(p0.4): Playbook 學習鏈三道修復 — partial index + race防護 + 手動路徑接線
...
ADR-092 P0.4 Playbook EWMA 學習閉環的 DB / Repository / Service 三層修補。
DB 層 (db-expert-fix by Engineer-B):
- ApprovalRecord.matched_playbook_id 移除 index=True,改 __table_args__ partial index
(WHERE matched_playbook_id IS NOT NULL) — 多數列 NULL,full index 浪費空間
- adr092_p1_learning_chain_rollback.sql: 純 ROLLBACK SQL(DBA 手動執行)
Repository 層:
- playbook_repository.py: SELECT FOR UPDATE 防 lost update
避免並發 EWMA 更新覆蓋彼此
Service 層 (P0.4 修復):
- proposal_service.py: 手動審核路徑補 _try_playbook_match_id 呼叫
decision_manager auto_execute 路徑已有此邏輯(行 2035),
此處補手動路徑缺口,使 matched_playbook_id 可寫入 DB → EWMA 才能演化
測試:
- test_playbook_repository_race_condition.py: 3 cases SELECT FOR UPDATE 防 race
正確阻擋並發 EWMA 更新(pass)
Note: migration SQL 待 DBA 手動執行(feedback_dev_prod_separation.md),
不執行 alembic upgrade(statu 文件禁忌條款)。
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-26 20:19:46 +08:00
Your Name
00443370ba
feat(ws6): Hermes observability — latency logging + dispatch audit table
...
run-migration / migrate (push) Failing after 16s
CD Pipeline / build-and-deploy (push) Has been cancelled
- nl_gateway.py: time.monotonic() 測量 SDK call 耗時
hermes_nl_dispatch log 加 latency_ms + success 欄位
- migrations/adr094_hermes_dispatch_log.sql
hermes_dispatch_log(bigserial + chat_id/user_id/agent/latency_ms/success)
已部署至 prod awoooi_prod
ADR-094 P95 latency 監控 + 幻覺追蹤用
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-25 02:10:06 +08:00
Your Name
ed3ba730a1
fix(ws2-migration): 補 enum types + 執行 prod migration
...
- CREATE TYPE approvalstatus / risklevel(SQLAlchemy native_enum)
- approval_records 已在 prod awoooi_prod 建立
- telegram_chat_id BIGINT(支援 -1003711974679)
- status approvalstatus enum(非 VARCHAR)
- awoooi_migrator 角色需 superuser 才能建,留 backlog
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-25 02:10:06 +08:00
Your Name
6d5fd3c124
feat(ws2): ADR-093 路由統一 — BIGINT + NotificationMatrix + feature flag
...
## 修復
### T2.1 BigInteger overflow 修復
- `db/models.py`: telegram_chat_id Integer → BigInteger
(原 int32 無法容納群組 ID -1003711974679)
### T2.2 移除 CAST workaround
- `approval_db.py:739`: 移除 CAST(:telegram_chat_id AS BIGINT)
ORM 已正確使用 BigInteger,workaround 可退役
### T2.3 Redis key 一致性修復
- `heartbeat_report_service.py:575`: telegram:polling_leader → telegram:polling:leader
(telegram_gateway.py 使用冒號分隔,heartbeat 用底線是 bug)
## 新增
### T2.4 notification_matrix.py
- `services/notification_matrix.py`: ADR-093 路由矩陣
- Destination(DM/GROUP/BOTH) + RoutingRule dataclass
- NOTIFICATION_ROUTING dict(TYPE-1 ~ TYPE-8M 完整映射)
- resolve_chat_ids(type, dm, group, *, tg_group_cutover=False) 灰階切流 API
### T2.5 telegram_gateway.py feature flag 保護
- line 43: 加 notification_matrix import
- line 1827-1834: TG_GROUP_CUTOVER=False 時維持舊行為
TG_GROUP_CUTOVER=True 時解除 _interactive_types 黑名單,由矩陣控制
### T2.6 Migration SQL
- `migrations/adr093_notification_routing.sql`:
- CREATE TABLE approval_records (telegram_chat_id BIGINT)
- CREATE ROLE awoooi_migrator (IF NOT EXISTS)
- 含舊環境 ALTER COLUMN int→bigint 保護
## 測試同步
- `tests/integration/setup_test_schema.sql`: telegram_chat_id BIGINT
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-25 02:10:06 +08:00
Your Name
04ff22563e
fix(aiops-p1): Playbook 學習閉環 5斷點全修 + DB Migration(ADR-092 B4)
...
run-migration / migrate (push) Failing after 14s
CD Pipeline / build-and-deploy (push) Failing after 2m7s
【P0.4 補丁】pre_decision_investigator Prometheus query 欄位缺失
- _build_tool_params() 補 "query" 欄位(prometheus_query tool 必要參數)
- 新增 _build_prometheus_query() — 依告警類型生成 PromQL(CPU/Memory/Crash/Disk/HTTP/Pod/fallback)
- 修復後 D3_METRICS 感官維度實際取得資料(原本 100% 回 missing_query_parameter)
【P1 Playbook 學習閉環 B1-B5 全修】
- B2 db/models.py: ApprovalRecord 新增 matched_playbook_id 欄位 + ix_approval_matched_playbook index
- B2 db/models.py: TimelineEvent 新增 incident_id 欄位(MCP 稽核用)+ index
- B3 approval_db.py: record→ApprovalRequest 補回 incident_id + matched_playbook_id
- B4 approval_repository.py: 同 B3(兩個轉換函式必須同步)
- B5 approval_db.py: approval_request_to_record_data 補 matched_playbook_id → DB 才能存值
【P1.5 KM 寫入】approval_execution.py: fire-and-forget → await wait_for(30s)
- 根因:asyncio.create_task 在 Pod recycle 時被殺,KM 寫入靜默遺失
- 修復:await asyncio.wait_for(..., timeout=30.0) + TimeoutError log
【Migration 文件】adr092_p1_learning_chain_fix.sql
- ALTER TABLE approval_records ADD COLUMN matched_playbook_id VARCHAR(36)
- ALTER TABLE timeline_events ADD COLUMN incident_id VARCHAR(64)
- 執行:psql $DATABASE_URL -f apps/api/migrations/adr092_p1_learning_chain_fix.sql
【附帶 Agent 改動】
- decision_manager: Phase 2 YAML NO_ACTION 優先門(主機層/外部服務跳過 Agent Debate)
- alert_rules.yaml: Sentry/ClickHouse + HostDiskUsageHigh/Critical 新規則
- solver_agent: action_title 語意合成兜底(取代靜默丟棄)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-24 15:41:35 +08:00
Your Name
45dbe07188
fix(flywheel): 自動化飛輪六大能力修復(ADR-092 B3)
...
run-migration / migrate (push) Failing after 22s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 53s
Type Sync Check / check-type-sync (push) Successful in 2m54s
CD Pipeline / build-and-deploy (push) Has been cancelled
Ansible Lint / lint (push) Has been cancelled
【根因鏈修復】
MCP Provider bugs → PreDecisionInvestigator 失敗 → Agent Debate 無上下文
→ LLM 逾時 → description="待分析" → ADR-091 鐵閘攔截 → tg_sent 未設
→ W-2 Watchdog 誤報「靜默故障」
【六大修復】
1. MCP Provider 三蟲修復
- ssh_provider: asyncssh.run() → conn.run()
- prometheus_provider: KeyError 'query' → .get() 容錯
- k8s_provider: 空 pod_name → 早返回錯誤字典
2. Agent Debate / 決策品質
- decision_manager: 逾時降級文字改為明確描述(繞過 ADR-091 鐵閘)
- intent_classifier: LLM 逾時降級至關鍵字分類(非 None)
3. Watchdog 誤報修復(ADR-092 B3)
- W-2: tg_sent Redis TTL → telegram_message_id IS NULL(DB 真值)
- W-5 新增: suggested_action IN 空/待分析/NO_ACTION + tg_id IS NULL
- approval_timeout_resolver: 60min → 15min,batch 50 → 200
4. Config Drift 自動化
- drift_adopt_service: auto_adopt_if_safe() 六條件安全閘
- drift.py: 背景任務先嘗試自動採納再發人工 Telegram 卡片
5. Playbook 飛輪穩定
- playbook_seed_service: 修復幂等性(deprecated 不視為缺失)
- playbook_evolver: 只載 DRAFT+APPROVED(非全部 294 筆)
6. 可觀測性
- alert_rule_engine: auto_rule 結構化日誌 + Redis 計數器(pipeline)
- auto_approve: reject 原因 Redis 計數器
- heartbeat_report_service: 新增「⚙️ 自動化統計(今日)」區塊
【待人工執行】
psql $DATABASE_URL -f apps/api/migrations/cleanup_duplicate_deprecated_playbooks.sql
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-24 10:55:50 +08:00
Your Name
23fb5c4aaa
feat(migration): adr091 rollback SQL
...
統帥全景檢查補:違反 feedback_dev_prod_separation — 直接對 awoooi_prod
套 adr091 migration 時應同時有回滾路徑。新增 DROP TABLE / DROP INDEX
腳本備用。資料不可復原,僅緊急用。
K8s Secret AIDER_WEBHOOK_SECRET 已加進 awoooi-prod.awoooi-secrets
(26 keys now, via kubectl patch)。
v1 repo ~/aider-watch README 標 DEPRECATED 並 tag v1-final。
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-20 04:23:09 +08:00
Your Name
60b06ac54c
feat(migration): adr091 aider_events table
2026-04-20 04:04:13 +08:00
OG T
98aef55b31
feat(kpi): ADR-090-D MASTER §7.1 北極星 KPI 5 斷鏈全修
...
CD Pipeline / build-and-deploy (push) Successful in 11m49s
run-migration / migrate (push) Failing after 15s
2026-04-18 晚(台北時區)— ogt + Claude Opus 4.7 (1M)
MASTER §7.1 15 個北極星 KPI 實測對標發現 5 個斷鏈:
#3 fine-tune JSONL /week — finetune_exports 表不存在
#4 MCP 呼叫/24h — timeline_events 沒 mcp_call event_type
#6 Declarative 修復使用率 — remediation_events 表不存在
#7 general 兜底 17.3% — classify_alert_early 漏 5 類
#10 notification_outcomes /week — 表不存在
本 commit 全修。
## 1. Migration: adr090d_kpi_data_sources.sql (3 張表)
- finetune_exports — P3 Fine-tune JSONL 追蹤
- remediation_events — P5 Declarative 修復追蹤
- notification_outcomes — 通知品質 + RLHF 語料
Idempotent (CREATE TABLE IF NOT EXISTS), 已 apply 進 prod。
## 2. classify_alert_early 擴 4 類規則 (降 general 兜底)
- test 攔截: Test*/FPTest/FingerprintTest/ADR089*Test/L4Closure*/*FreshUniq*
→ category='test', TYPE-1 純通知
- High*CPU/Memory/Disk/Load → host_resource
- TLS*/SSL*/*ProbeFailure* → ssl_cert
- PostgreSQL*/MySQL*/MongoDB*/*DiskGrowthRate → database
預期 general 17.3% → 3-5% (達標 <10%)。
## 3. finetune_exporter DB 寫入
_run_export() 結尾寫 finetune_exports 一筆,含 checksum/size/record_count。
## 4. declarative_remediation DB 寫入
evaluate() 後 fire-and-forget _log_remediation_event() 寫 remediation_events
(status='pending', remediation_type 依 tier 自動判為 declarative/imperative/gitops_pr)。
## 5. telegram_gateway DB 寫入 (send_approval_card)
_send_request 成功返回 message_id 後寫 notification_outcomes 一筆,
channel='telegram', delivery_status='delivered|failed'。未來人類按鈕時
update user_action → RLHF 訓料黃金。
## 6. pre_decision_investigator MCP 呼叫追蹤
_call_single_tool() finally 寫 timeline_events event_type='mcp_call',
含 provider/tool/status/duration_ms/error。24h 內 MCP 呼叫可 SQL 量測。
## 預期量化改善
| KPI | 修前 | 修後 24h 後應見 |
|-----|------|----------------|
| #3 fine-tune /week | 0 (表不存在) | >=10 (每週 cron 跑) |
| #4 MCP 呼叫/24h | 0 | >0 (實測將寫 timeline) |
| #6 declarative 使用率 | 表不存在 | 有資料 (pending/success/failed 分佈) |
| #7 general 兜底 | 17.3% | <10% |
| #10 notification_outcomes | 0 | 每次 approval card 寫一筆 |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-19 00:00:31 +08:00
OG T
a156566b17
feat(drift-narrator): ADR-090-C L4 稽核閉環 — notification_formatted op 入庫
...
CD Pipeline / build-and-deploy (push) Successful in 10m47s
run-migration / migrate (push) Failing after 14s
2026-04-18 下午(台北時區)—— ogt + Claude Opus 4.7 (1M)
架構鐵律執行:
「沒有被記錄的 AI 決策,就等於沒有發生過。」
drift_narrator 每次呼叫 LLM 生成摘要,必須完整寫入
automation_operation_log + ai_collaboration_trace,形成 L4 稽核 + RLHF 語料。
本 commit 兩件事:
1. apps/api/migrations/adr090c_notification_formatted_op_type.sql
- 擴充 automation_operation_log.operation_type CHECK 加 'notification_formatted'
- DROP + ADD CONSTRAINT idempotent 模式
- 已用 awoooi(表 owner)apply 進 prod 驗證通過
2. apps/api/src/services/drift_narrator_service.py
- 新增 _log_ai_action_to_db() 負責 DB 稽核寫入
- 在 _generate_narrative_and_items() 結尾(success / fallback 都寫)呼叫
- automation_operation_log:
* operation_type='notification_formatted'
* actor='drift_narrator'
* input = {report_id, namespace, counts, items_scanned}
* output = {narrative, items, items_count}
* duration_ms, tags=['drift','type4d','llm_summary']
* parent_op_id 查詢 alert_fired 鏈路(未來 drift → alert 關聯)
- ai_collaboration_trace:
* agent='drift_narrator', model=provider (ollama / nemotron / 等)
* prompt(限 8000 字)+ response(JSONB)
* accepted = LLM JSON 解析成功 flag(未來 RLHF 訓料金礦)
- 錯誤處理: DB 寫入 try/except 包住,永不破壞 Telegram 通知主流程
P2.4 事件關聯:
- SELECT parent op via input->>'report_id' 或 'drift_report_id'
- 若找到則綁定 parent_op_id(形成 alert_fired → notification_formatted 追溯鏈)
- 目前 drift 本身不經 alert_fired,parent 為 NULL(等未來鏈路接通)
P2.5 RLHF 語料:
- ai_collaboration_trace.accepted=true 的紀錄即為「LLM 解析成功」樣本
- 未來統帥按 Telegram [✅ 採納變更] / [⏪ 回滾] 時,對應 trace 也可更新
outcome flag,形成完整 Human-in-the-loop 語料
技術細節:
- get_db_context() auto-commit(src/db/base.py:128),無需手動 commit
- prompt 最長 8000 字(一般 drift 約 2-3k)
- raw_response 保留前 500 字在 trace.response JSON 中
相關:
- feedback_ai_autonomous_direction.md L4 北極星
- feedback_secrets_leak_incidents_2026-04-18.md L1-L4 分層
- ADR-090 11 張神經網路表
- commit fb88512(B 方案視覺層)
IDE 可能顯示 src.db.base 找不到 —— 那是誤報(drift_repository.py 用同一條路徑)。
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-18 16:04:23 +08:00
OG T
2d43751729
feat(ops): ADR-090-B 零信任收尾範本 — wrapper / sudoers / migrator / CI
...
CD Pipeline / build-and-deploy (push) Successful in 12m17s
run-migration / migrate (push) Failing after 14s
2026-04-18 台北時區 —— ogt + Claude Opus 4.7 (1M)
本 commit 響應本 Session 兩次憑證外洩事故
(feedback_secrets_leak_incidents_2026-04-18.md),
交付統帥可直接部署的零信任基礎設施範本.
檔案清單:
1. scripts/host-ops/awoooi-hosts-add.sh
- 110 主機 /etc/hosts 白名單 wrapper
- 只允許預定義主機名,idempotent,帶 IP 格式驗證
- 安裝: /usr/local/bin/awoooi-hosts-add (root:root 0755)
2. scripts/host-ops/awoooi-wrapper.sudoers
- 配套 sudoers 規則 (NOPASSWD for wrapper + SIGHUP only)
- 安裝: /etc/sudoers.d/awoooi-wrapper (root:root 0440)
- 禁 tee / bash / sh 這類 generic shell access
3. apps/api/migrations/adr090b_awoooi_migrator_role.sql
- PG 限權角色 awoooi_migrator
- 只能 DDL (CREATE/ALTER/DROP/INDEX/COMMENT)
- 明確 REVOKE 所有 DML + default privileges 鎖死
- 本檔由統帥執行 (需 superuser),不由 Claude 執行
4. k8s/awoooi-prod/awoooi-migrator-secret.template.yaml
- K8s Secret patch 範本
- 新增 MIGRATION_DATABASE_URL key (awoooi_migrator 連線串)
- 與應用 DATABASE_URL 拆開
5. .gitea/workflows/run-migration.yml
- CI 自動套用新 migration (單 transaction + ON_ERROR_STOP)
- 用 Gitea secret MIGRATION_DATABASE_URL,不走明碼
- 每次成功寫一筆 asset_discovery_run (audit trail)
零信任三層防線 (對應 feedback_secrets_leak_incidents):
L1 對話無密碼 -> wrapper 內建白名單
L2 操作經 wrapper -> sudoers + awoooi_migrator
L3 顯示強制遮蔽 -> CI 走 secret,不走 env
本 Session 發現的 3 次憑證外洩全部在 feedback_secrets_leak
memory 登記,並有對應 P0 輪替計畫.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-18 13:23:39 +08:00
OG T
5ae82d1d1f
feat(db): ADR-090 L4 AIOps 地基 — 資產盤點 × 7 項自動化覆蓋矩陣永久化 DB
...
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-04-18 下午(台北時區)—— ogt + Claude Opus 4.7 (1M)
MoWoooWorkDown 假警報 RCA 暴露三重結構性失守:
- 110/188 主機 load 18/16 × 13 天 / cadvisor 288% / K3s 120/121 無監控
- Prometheus 僅 35 targets / 58 rules(覆蓋不到三成)
- HostHighCpuLoad 量錯維度(CPU idle vs load_avg)
統帥戰略指令:
- 全景資產 × 七大自動化 × 永久化 DB
- AI 四分工(OpenClaw × NemoTron × Hermes × Claude LLM)
- 所有自動化操作歷程必進 DB,不靠 MD(MD 會漂移)
本 commit 交付:
1. SQL migration (apps/api/migrations/adr090_asset_inventory_foundation.sql)
- 11 張表 + 33 indexes + 20 CHECK + 3 UNIQUE + 16 FK
- pgcrypto extension dependency
- 完整 idempotent(CREATE IF NOT EXISTS + single transaction)
- 已 apply 進 awoooi_prod(188 PG),驗證通過
2. ADR-090 (docs/adr/ADR-090-monitoring-blindspot-governance.md)
- 決策紀錄 + 7 引擎對映 + 4 替代方案否決
3. 主戰略文件 (docs/superpowers/specs/2026-04-18-blindspot-governance-capacity-l4.md)
- §0-§14: 背景 / 根因 / Schema DDL / 4 層防禦 / 7 Phase 實施 /
HARD_RULES / AI 分工矩陣 / 驗收指標 / 技術債 / 回滾 / 接手協議
4. MASTER §8 Living Changelog 追加 Phase 7 啟動條目
11 張表:
asset_inventory / asset_discovery_run / asset_coverage_snapshot /
asset_relationship / alert_rule_catalog / asset_change_event /
asset_compliance_snapshot / host_capacity_snapshot /
capacity_violation_event / automation_operation_log /
ai_collaboration_trace
首筆 bootstrap 記錄已 seed 進 asset_discovery_run
(run_id=6760c5bf-57e5-4a40-b82d-31b794464652)
相關 Memory (未 commit,存於 ~/.claude/...):
- project_blindspot_governance.md (跨 session 指針)
- feedback_monitor_self_monitoring.md (監控工具必須被監控)
- feedback_secrets_leak_incidents_2026-04-18.md (憑證外洩三防線)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-04-18 13:18:46 +08:00
OG T
9d6aa7ea45
feat(trust): ADR-088 Trust Score 持久化 — L4 自動放行核心
...
CD Pipeline / build-and-deploy (push) Successful in 10m40s
TrustScoreManager 從記憶體升級為 PostgreSQL 持久化,
Pod 重啟後信任分數不再歸零,AI 能真正累積到 L4 自動放行門檻。
變更:
- migrations/adr088_trust_score_persistence.sql: trust_records 表
- db/models.py: TrustRecordDB ORM model
- repositories/interfaces.py: ITrustRepository Protocol
- repositories/trust_repository.py: PG upsert ON CONFLICT DO UPDATE
- services/trust_engine.py: bulk_load() 啟動 warm-up
- services/learning_service.py: _persist_trust() + 2 call sites
- main.py: 啟動時 load_all() → bulk_load()
流程: 批准 5 次 → score=5 寫入 DB → Pod 重啟 → warm-up 讀回
→ evaluate_adjusted_risk MEDIUM→LOW → 自動執行
2026-04-17 ogt + Claude Sonnet 4.6(亞太): ADR-088
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-17 16:14:44 +08:00
OG T
c05bac6112
fix(playbook): seed tuple unpack + text[] → jsonb migration
...
CD Pipeline / build-and-deploy (push) Has been cancelled
- playbook_seed_service.py: list_playbooks 回傳 tuple[list, int],
缺少解包導致 'list' has no attribute 'source'
- fix_playbooks_array_to_jsonb.sql: source_incident_ids/tags text[] → jsonb
(已手動套用 prod DB)
2026-04-15 ogt + Claude Sonnet 4.6(亞太)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 22:03:59 +08:00
OG T
da871fc149
chore(db): 補齊 AIOps P1/P2/P6 migration SQL(已套用到 prod)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
incident_evidence / agent_sessions / ai_governance_events 三表
IF NOT EXISTS,production DB 已手動確認存在並 apply。
2026-04-15 ogt + Claude Sonnet 4.6(亞太)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-15 22:02:17 +08:00
OG T
325b3851b5
feat(adr-071): 告警通知四類型第一批 B/C/E/F/G/H 全實作
...
CD Pipeline / build-and-deploy (push) Has been cancelled
Type Sync Check / check-type-sync (push) Failing after 1m7s
ADR-071-B: classify_notification() — 五型分類器 (TYPE-1/2/3/4/4D)
ADR-071-C: send_info_notification() — TYPE-1 純資訊無按鈕卡片
ADR-071-E: _build_inline_keyboard() — 依 alert_category 動態組合 TYPE-3 按鈕
ADR-071-F: send_drift_card() — TYPE-4D Config Drift 卡片 + Diff 截斷
ADR-071-G: km_conversion_service.py — Incident RESOLVED 自動轉 KM
ADR-071-H: handle_manual_fix_done() — TYPE-4 手動修復 Bot 對話閉環
前批已完成: ADR-071-A (DB Migration) + ADR-071-D (狀態機守衛)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-11 02:24:20 +08:00
OG T
c6edfb5614
fix(flywheel): 四階段系統性修復 AUTO_REPAIR NO_MATCH 斷層
...
CD Pipeline / build-and-deploy (push) Has been cancelled
Phase 1 — affected_services 污染根治
- webhooks.py: _extract_affected_services() 從 labels 精準萃取服務名
(component > job > pod deployment name > clean target_resource > [])
- create_incident_for_approval: alert_labels 完整保留進 Signal
- alert_name 從 alertname 取,不再用 "custom"
Phase 2 — Playbook alertname 變體擴充
- alert_rules.yaml: 5 條規則新增 HostHighCpuLoad、KubePodCrashLooping 等變體
- scripts/update_playbook_alert_variants.py: Redis index 已執行更新 ✅
Phase 3 — Jaccard 通用型 Playbook 豁免
- similarity.py: affected_services=[] → 1.0 豁免(基礎設施 Playbook 不針對特定服務)
- severity_range=[] → 1.0 豁免(適用所有嚴重度)
Phase 4 — Playbook Embedding 持久化(冷啟動修復)
- migrations/flywheel_playbook_embeddings.sql: pgvector 持久化表
- services/playbook_embedding_service.py: 啟動時重建 Redis 向量快取 + 同步 DB
- main.py: lifespan 啟動時 asyncio.create_task 非阻塞執行
2026-04-10 Asia/Taipei — Claude Sonnet 4.6
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-10 11:04:56 +08:00
OG T
af7b1591c1
feat(rag): phase35 ivfflat 向量索引 — 5814 chunks 已建立
...
CD Pipeline / build-and-deploy (push) Has been cancelled
已在 prod 執行: idx_rag_chunks_embedding (lists=100, cosine)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-10 10:33:32 +08:00
OG T
63e840ae42
feat(ollama): Phase 31-34 ADR-067 — Log摘要/PR審查/RAG知識庫/圖片分析
...
CD Pipeline / build-and-deploy (push) Has started running
Phase 31: log_summary_service.py — deepseek-r1:14b K8s Pod日誌異常摘要
- 觸發: signoz_webhook 告警時背景呼叫
- Redis快取 log_summary:{pod}:{date} TTL 24h
- 敏感資料regex遮蔽
Phase 32: local_code_review_service.py — qwen2.5-coder:7b PR自動審查
- Fallback: Gemini (diff > 50KB 或 Ollama超時)
- semaphore 最多2個同時審查
- 雙寫: Redis TTL 7d + pr_reviews表 (phase29 migration)
Phase 33: knowledge_rag_service.py — nomic-embed-text 768維 pgvector RAG
- 向量化(188) + 生成(111) 雙Ollama
- rag_chunks表 (phase28 migration)
- 初期線性搜尋,>100筆啟用ivfflat索引
Phase 34: image_analysis_service.py — llava:latest Telegram圖片分析
- download_and_analyze: Bot API getFile → 下載 → llava → 回應
- Rate limit: 每chat_id每分鐘3次 (Redis sliding window)
- telegram.py webhook新增photo分支
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-10 01:50:22 +08:00
OG T
89015d4527
feat(phase30): Drift 報告 AI 人話摘要 (ADR-067)
...
CD Pipeline / build-and-deploy (push) Has been cancelled
- 新增 DriftNarratorService — qwen2.5:7b-instruct (Ollama 111)
- 觸發條件: high >= 1 or medium >= 3(HPA replicas 白名單)
- Redis 快取: drift_narrative:{report_id} TTL 1h
- LLM 失敗時 graceful fallback 結構化文字
- drift.py _analyze_and_notify: 接入 narrator(Phase 30 標記)
- Migration: drift_reports.narrative_text TEXT (已在 prod 執行)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-10 01:37:43 +08:00
OG T
88ac1c7f50
feat(phase27): 歷史按鈕雙層頻率統計 + DB frequency_snapshot 持久化
...
CD Pipeline / build-and-deploy (push) Failing after 1m44s
- telegram_gateway: _send_incident_history 改為 Phase 27 雙層策略
Layer 1: DB frequency_snapshot (建立時刻永久快照)
Layer 2: Redis AnomalyCounter disposition 累積統計 (35d TTL)
修復舊版呼叫 record_anomaly() 導致誤計數的 bug
- 新增 migration: phase27_incident_frequency_snapshot.sql (已在 prod 執行)
- CLAUDE.md: 精簡至 123 行,減少 Token 消耗
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-10 01:06:51 +08:00
OG T
896bef94ee
fix(web): pending-approvals-card 加防重複點擊 + loading 狀態
...
linter 自動強化: actioningId state 防止同一張卡重複操作
- disabled + opacity 0.6 + cursor not-allowed
- loading 時按鈕顯示 '...'
- finally() 確保 actioningId 清除
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-09 18:38:08 +08:00
OG T
88696dba9b
feat(sprint5.1): Data Safety Guardrails 全鏈路整合 (L1-L5)
...
CD Pipeline / build-and-deploy (push) Failing after 1m33s
Type Sync Check / check-type-sync (push) Failing after 58s
Layer 0 - K8s RBAC:
- k8s/rbac/api-velero-reader.yaml: awoooi-executor SA Velero backup reader
Layer 1 - DB Migration (已在 188 執行):
- M-002: approval_records 新增 approval_level/votes/required_votes
- M-003: alert_event_type ENUM 新增 8 個值
Layer 2 - IaC:
- ops/config/service-registry.yaml: 全服務 Stateful 分級清單 (BLOCK/CRITICAL_HITL/STANDARD_HITL/AUTO)
Layer 3 - Python Services:
- service_registry.py: 讀取 YAML,提供 is_blocked/requires_multisig/get_required_votes
- velero_client.py: kubectl 查詢 Velero 備份年齡,失敗 fallback 999h
- preflight_service.py: Pre-flight 安全檢查 (Q2/Q4 決策)
Layer 1-M001 - Playbook model:
- playbook.py: 新增 requires_approval_level/stateful_targets/requires_pre_backup
Layer 4 - 業務邏輯:
- alert_operation_log_repository.py: 新增 8 個 event_type (Guardrail/Pre-flight/MultiSig/備份)
- auto_repair_service.py: 注入 Service Registry Guardrail 檢查 (BLOCK → 直接拒絕)
- webhooks.py: ALERT_RECEIVED 溯源記錄 + auto_repair flag Q9 + Langfuse trace_id Q10
- db/models.py: ApprovalRecord 同步 approval_level/votes/required_votes 欄位
- docker-health-monitor.sh: 純感知層改造(移除所有 docker restart 邏輯)
Layer 5 - Telegram 通知:
- telegram_gateway.py: T1-T6 六個新通知方法 (Guardrail/Pre-flight/Backup/MultiSig/ChangeApplied)
參考: ADR-062 Data Safety Guardrails, ADR-063 Service Registry IaC
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-08 16:24:09 +08:00
OG T
f20121ad41
feat(audit): Phase 11 告警操作完整溯源 — alert_operation_log + 歷史回填
...
CD Pipeline / build-and-deploy (push) Failing after 1m29s
統帥指令「所有告警訊息通通寫入資料庫,並記錄相關操作」
變更:
- phase11_alert_operation_log.sql: 新表 (Event Sourcing,不可變)
- phase11b_backfill_alert_operation_log.sql: 歷史回填 654 筆
- 14 筆 ALERT_RECEIVED (incidents)
- 265 筆 TELEGRAM_SENT (approval_records)
- 265 筆 USER_ACTION (approval_records)
- 110 筆 EXECUTION_COMPLETED (audit_logs)
- db/models.py: AlertOperationLog SQLAlchemy model
- repositories/alert_operation_log_repository.py: append/list_by_incident/get_stats
- webhooks.py: _try_auto_repair_background 寫入 AUTO_REPAIR_TRIGGERED + EXECUTION_COMPLETED + TELEGRAM_RESULT_SENT
- webhooks.py: _push_to_telegram_background 寫入 TELEGRAM_SENT
- telegram.py: handle_callback 寫入 USER_ACTION (approve/reject)
已執行 migration: awoooi_prod@192.168 .0.188 ✅
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-08 11:22:03 +08:00
OG T
eee6f06215
feat(auto-repair): 所有操作強制寫入 DB — auto_repair_executions 表
...
CD Pipeline / build-and-deploy (push) Failing after 1m32s
統帥指令: 所有自動修復操作(成功/失敗)必須持久化
變更:
- migrations/phase10_auto_repair_executions.sql: 新增表 + 4 個索引
- db/models.py: 新增 AutoRepairExecution SQLAlchemy model
- repositories/audit_log_repository.py: 新增 AutoRepairExecutionRepository (create/list_by_incident/get_stats)
- auto_repair_service.py: execute_auto_repair 成功/失敗分支都寫入 DB
- 新增 similarity_score 參數傳遞
- AutoRepairDecision 新增 similarity_score 欄位
- webhooks.py: 傳入 similarity_score 到 execute_auto_repair
已執行 migration: awoooi_prod@192.168 .0.188:5432 ✅
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-08 11:16:37 +08:00
OG T
658337ec18
fix(phase26): 打通 Incident→DB→KM 完整鏈路 + namespace 修正
...
CD Pipeline / build-and-deploy (push) Failing after 1m29s
Type Sync Check / check-type-sync (push) Failing after 52s
問題根因:
1. create_incident_for_approval 只存 Redis,不存 PostgreSQL
→ TTL 7天後消失,Playbook 萃取永遠找不到 Incident
2. ApprovalRecord 無 incident_id 欄位
→ _trigger_playbook_extraction 靠 regex 掃中文文字找 INC-,永遠失敗
3. operation_parser namespace fallback 是 "default"
→ 所有 deployment 在 awoooi-prod,203 次執行全失敗
修復:
- Incident 同時寫入 Redis + PostgreSQL (save_to_episodic_memory)
- ApprovalRecord 加入 incident_id 欄位 (model + ORM + migration)
- alertmanager_webhook 建立 Approval 後回寫 incident_id
- _trigger_playbook_extraction 直接用 approval.incident_id
- operation_parser DEFAULT_NAMESPACE = "awoooi-prod"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-06 11:46:05 +08:00
OG T
3455044457
feat(phase25): Nemotron 主動防禦三方向 P0+P1+P2 完整實作
...
CD Pipeline / build-and-deploy (push) Failing after 38s
Type Sync Check / check-type-sync (push) Failing after 35s
P0 - DIAGNOSE Privacy-First Routing:
- ai_router.py: _local_fallback_chain [NEMOTRON→OLLAMA→REJECT]
- DIAGNOSE 意圖 override 改為 NEMOTRON (原 OLLAMA)
- DIAGNOSE fallback 使用 local-only 鏈,不觸碰雲端
- 全部失敗時 REJECT + Telegram 通知
- config.py: NEMOTRON_DIAGNOSE_TIMEOUT_SECONDS=30, OLLAMA_DIAGNOSE_TIMEOUT_SECONDS=60
- nemotron.py: 根據 context[task_type] 選擇 timeout
P1 - Knowledge Auto-Harvesting:
- models/knowledge.py: EntryType.AUTO_RUNBOOK + ANTI_PATTERN + symptoms_hash
- EntryStatus.PUBLISHED (ANTI_PATTERN 直接發布,無需審核)
- models/playbook.py: SymptomPattern.compute_hash() (16字元確定性 hash)
- services/runbook_generator.py: NemotronRunbookGenerator (v1.1)
- generate_runbook() → AUTO_RUNBOOK (DRAFT) + Telegram 審核 card
- generate_anti_pattern() → ANTI_PATTERN (PUBLISHED) + Telegram 通知
- 使用 nvidia.chat() (正確介面),Nemotron 超時時 Minimal fallback
- knowledge_service.py: check_anti_pattern(symptoms_hash, days=7)
- db/models.py: symptoms_hash VARCHAR(16) + ix_knowledge_symptoms_hash
- repositories/knowledge_repository.py: create() 支援 symptoms_hash + status
- auto_repair_service.py: anti_pattern_gate 在 decide() + runbook hook 在 execute()
- migrations/phase8_symptoms_hash.sql: ALTER TABLE + partial index + PUBLISHED constraint
P2 - Config Drift Detection:
- models/drift.py: DriftItem/DriftReport/DriftLevel/DriftIntent/DriftStatus
- services/drift_detector.py: GitStateReader + K8sStateReader + DriftDetector
- services/drift_analyzer.py: 白名單過濾 + DriftLevel 分級
- services/drift_interpreter.py: NemotronDriftInterpreter(意圖分析,不生成修復指令)
- services/drift_remediator.py: rollback(kubectl apply) + adopt(git push gitea)
- api/v1/drift.py: POST /scan, GET /reports, POST /rollback, POST /adopt
- migrations/phase9_drift_reports.sql: drift_reports 表
- k8s/drift-cronjob.yaml: 每小時自動掃描 CronJob
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-04 12:35:05 +08:00
OG T
df3ef9006c
fix(auto-repair): 首席架構師 Review — 4 Critical/Important 修復
...
CD Pipeline / build-and-deploy (push) Successful in 7m2s
Critical #1 : KM write task 移出 try/except
- _trigger_learning 的 KM 寫入原在 try 內,learning 失敗時不寫 KM
- 移至 except 後確保成功/失敗都寫入
- 移除冗餘 import asyncio(已在頂層 import)
- Minor: approval.incident_id or None 防空字串
Important #2 : migration 加 PRIMARY KEY
- playbook_id 從 UNIQUE 升為 PRIMARY KEY
- prod DB 已執行 ALTER TABLE ADD PRIMARY KEY
Important #3 : s.sequence→s.step_number, s.description→s.command
- embed_playbook() 使用不存在的欄位名,RAG 向量索引靜默失敗
- RepairStep 正確欄位: step_number, command
Important #1 : PlaybookService._get_rag_service 不再 Service 層快取
- 改為每次呼叫工廠 get_playbook_rag_service()
- 避免舊實例繞過工廠的 is_closed 重建邏輯
冷啟動修復 (首席架構師建議B+C):
- _trigger_playbook_extraction 執行成功後自動設定
execution_success=True, effectiveness_score=4, status=RESOLVED
- skip 路徑 logger.debug → logger.info 提升可觀測性
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-04 12:02:03 +08:00
OG T
72d7536ead
feat(auto-repair): 完整自動修復閉環 + KM 沉澱串接
...
CD Pipeline / build-and-deploy (push) Has been cancelled
1. DB Migration: playbooks 資料表 (phase7_playbooks_table.sql)
- 這是自動修復無法啟動的根本原因 — table 從未建立
- 5 個索引: status/tags/alert_names/source_incidents/created_at
- 已在 prod DB 執行
2. playbook_service: 萃取後自動沉澱 KM
- extract_from_incident() 完成後 fire-and-forget _write_to_km()
- 內容含症狀模式、修復步驟、信心度、來源 Incident
3. approval_execution: 執行結果沉澱 KM
- _trigger_learning() 後 fire-and-forget _write_execution_result_to_km()
- 成功/失敗記錄都寫入,category=execution_result
完整閉環:
告警 → AI分析 → 查Playbook → 決策 → 執行 → 結果寫KM
↓
Incident解決 → KM(knowledge_extractor)
→ Playbook萃取 → KM
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-04 11:54:15 +08:00
OG T
a1f7d1f495
fix(db): 固化 risklevel ADD VALUE 'high' 為正式 migration
...
CD Pipeline / build-and-deploy (push) Successful in 6m58s
E2E Health Check / e2e-health (push) Successful in 18s
Phase 23 緊急修復已在 prod/dev 手動執行,此檔作為正式記錄
使用 DO 塊防止重複執行錯誤
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-04-01 21:36:15 +08:00
OG T
30153496d1
fix(api): 修復全部 lint 錯誤 (ruff --fix)
...
- Import sorting (I001)
- Unused imports (F401)
- f-string without placeholders (F541)
- Loop variable unused (B007)
- zip() strict parameter (B905)
- Exception chaining (B904)
- collections.abc imports (UP035)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-26 16:06:20 +08:00