awoooi

Author	SHA1	Message	Date
Your Name	b00a7b050a	test(ollama): align inference connect errors with degraded health Some checks failed CD Pipeline / tests (push) Failing after 2m26s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details Code Review / ai-code-review (push) Successful in 28s Details	2026-05-05 13:34:19 +08:00
Your Name	506744ba3a	test(ollama): keep slow gcp primary on ollama Some checks failed CD Pipeline / tests (push) Failing after 2m21s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details Code Review / ai-code-review (push) Successful in 26s Details	2026-05-05 13:29:27 +08:00
Your Name	869646459c	fix(ollama): treat legacy primary as ollama Some checks failed CD Pipeline / tests (push) Failing after 1m48s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details Code Review / ai-code-review (push) Successful in 28s Details	2026-05-05 13:25:27 +08:00
Your Name	33d4326cce	test(ollama): align slow recovery with gcp routing policy Some checks failed CD Pipeline / tests (push) Failing after 1m51s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details Code Review / ai-code-review (push) Successful in 33s Details	2026-05-05 13:21:16 +08:00
Your Name	f78b1b0690	fix(ollama): honor provider endpoint selection All checks were successful Code Review / ai-code-review (push) Successful in 37s Details	2026-05-05 13:14:46 +08:00
Your Name	8629ac709b	feat(awooop): Phase 1-8 完整實作 — AwoooP Agent Platform 六平面架構 Some checks failed run-migration / migrate (push) Failing after 59s Details Code Review / ai-code-review (push) Successful in 1m8s Details Type Sync Check / check-type-sync (push) Successful in 2m27s Details ## Phase 1-3: Control Plane + Contract System - awooop_phase1_control_plane_2026-05-04.sql: 12 張核心表 + RLS - awooop_phase1_batch1_rls_2026-05-04.sql: 全部 FORCE RLS + GRANT - packages/awooop-contracts/: 六合約 JSON Schema + golden fixtures - src/models/awooop_contracts.py: Pydantic v2 contract models（extra=forbid） - src/repositories/contract_repository.py: contract lifecycle（draft→published→active） - src/services/contract_service.py: HMAC publish sig + Redis multi-sig activate - src/services/schema_validator.py: LLM output validator（retry×3, E-SCHEMA-001） ## Phase 2: Tenant Isolation - awooop_phase2_budget_ledger_2026-05-04.sql: budget_ledger + RLS - src/services/budget_service.py: Token Budget Hard Kill 三層防線 - src/core/context.py: PROJECT_ID ContextVar（31 background loop 自動繼承） - src/db/base.py + models.py: project_id 欄位 + RLS set_config 注入 - src/hermes/nl_gateway.py: project_id Redis key 前綴（Phase A 雙寫） - src/services/anomaly_counter.py: per-project 改造（Phase A fallback） ## Phase 4: Platform Shell in Shadow Mode - awooop_phase4_run_state_2026-05-04.sql: run_state + step_journal + idempotency - src/services/run_state_machine.py: 8-state FSM + SKIP LOCKED + stale reaper - src/services/platform_runtime.py: UUID v7 + W3C trace_id + shadow_execute - src/services/audit_sink.py: PII/secret redaction 9 patterns - src/api/v1/platform/runs.py: POST/GET /v1/platform/runs（Router→Service 架構） - src/workers/platform_worker.py: SKIP LOCKED worker + heartbeat + reaper loop - src/main.py: platform router + lifespan worker start/stop ## Phase 5: MCP Gateway 五閘門 - awooop_phase5_mcp_gateway_2026-05-04.sql: 4 表 + RLS - src/plugins/mcp/gateway.py: McpGateway（Gate 1~5, E-MCP-GATE-001~009） - src/plugins/mcp/redaction_middleware.py: 雙層 redaction + 16K 截斷 - src/plugins/mcp/registry.py: __provider name mangling（ADR-116） - src/plugins/mcp/credential_resolver.py: k8s secret ref 解析 - tests/test_mcp_credential_isolation.py: 10 個迴歸測試（secret leak 防再現） ## Phase 6-8: EwoooC + Channel Hub + Approval Token - awooop_phase6_ewoooc_onboarding_2026-05-04.sql: ewoooc tenant + 4 read-only MCP tools - awooop_phase7_channel_hub_2026-05-04.sql: conversation_event + outbound_message - src/services/provider_proxy.py: ProviderProxy + PlatformEnvelope（ADR-115） - src/services/channel_hub.py: Telegram inbound mirror + Progressive Feedback（30s） - src/services/awooop_approval_token.py: HS256 + jti NX replay 防護 + suggest mode Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-04 19:31:53 +08:00
Your Name	14bf86a462	fix(awooop): Phase 2 初批 P0 修正 + Phase 1 Task 1.7 integration tests ## P0 安全 / 架構修正 ### P0-08 telemetry.py — 移除硬碼 IP assert（ADR-121） - config.py：新增 OTEL_ALLOWED_ENDPOINTS（預設 192.168.0.188）+ OTEL_FORBIDDEN_ENDPOINTS - telemetry.py：_validate_endpoint() 改為 config-driven allowlist/forbidlist - EwoooC 可用 env 覆寫 OTEL_ALLOWED_ENDPOINTS 指向自己的 SigNoz host ### P0-13 mcp_bridge.py — K8s namespace 由 settings 提供 - config.py：新增 AWOOOI_K8S_NAMESPACE（預設 "awoooi-prod"） - mcp_bridge.py：5 處 parameters.get("namespace", "awoooi-prod") → settings.AWOOOI_K8S_NAMESPACE - EwoooC/Tsenyang 可設自己的 namespace ### P1-24 decision_manager.py — silence key 常數統一 - 新增 from src.services.telegram_gateway import SILENCE_KEY_PREFIX - f"telegram_silence:{target}" → f"{SILENCE_KEY_PREFIX}{target}" - 消除跨兩處重複定義（ADR-118 No Island Coding 原則） ## Phase 1 Task 1.7 Integration Tests - tests/integration/test_awooop_phase1_schema.py：31 個測試案例 - awooop_projects CHECK 約束（4 cases） - revision 不可變性 trigger（5 cases：draft 可改、published 鎖住、身份欄不可改、非法流轉、DELETE 禁止） - awooop_published_revisions VIEW draft/published 隔離（2 cases） - active_pointer_guard（3 cases：不可指向 draft、可指向 active、跨租戶 mismatch） - RLS fail-closed（3 cases：未設/錯設/正確設 project_id） - outbox FK + dedup（2 cases） Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-04 13:46:19 +08:00
Your Name	2409d861fa	fix(test): 更新 auto_recovery 測試斷言至 ADR-110（ollama_111 → ollama_gcp_a） Some checks failed Code Review / ai-code-review (push) Successful in 55s Details CD Pipeline / tests (push) Failing after 1m22s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details - notify_recovery 斷言改為 "ollama_gcp_a"（3 處） - alert_recovery payload["to"] 改為 "ollama" - test_full_recovery_flow 改用 mock alerter 避免打真實 Telegram Bot API Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 22:57:58 +08:00
Your Name	b1ef05fa8c	feat(ollama): ADR-110 GCP 三層容災架構（GCP-A → GCP-B → Local → Gemini） Some checks failed Code Review / ai-code-review (push) Successful in 50s Details CD Pipeline / tests (push) Failing after 1m14s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details ## 變更摘要 - Primary: http://34.143.170.20:11434 (GCP-A SSD, 9x 載速 + 2x 推理) - Secondary: http://34.21.145.224:11434 (GCP-B SSD) - Fallback: http://192.168.0.111:11434 (M1 Pro Local HDD，最後防線) - 廢止 ADR-105「111 唯一鐵律」，新建 ADR-110 ## 核心改動 - config.py: 新增 OLLAMA_SECONDARY_URL；validator 加 GCP IP 白名單（34.143.170.20, 34.21.145.224） - ollama_failover_manager.py: 三層 Ollama 決策矩陣；並行健康檢查三台；health_111 → health_gcp_a - ollama_health_monitor.py: host label 萃取改為通用版（支援 GCP 公網 IP） - failover_alerter.py: 故障/恢復主機動態顯示，不再硬編碼「Ollama 111 (GPU)」 - ollama_auto_recovery.py: notify_recovery 改為 ollama_gcp_a；recovered_host 動態 - k8s/awoooi-prod: configmap + deployment + network-policy 同步更新（egress 加 GCP /32） - 服務層: 10 個服務檔案硬編碼 192.168.0.111 改為讀 settings.OLLAMA_URL - 測試: URL 常數更新，新增三層容災場景，GCP IP 白名單驗證測試 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 22:49:23 +08:00
Your Name	e45b055e0e	feat(governance): AI 治理事件處理鏈四軌交付（C/D/B/A） Some checks failed Code Review / ai-code-review (push) Successful in 48s Details run-migration / migrate (push) Failing after 45s Details CD Pipeline / tests (push) Successful in 3m46s Details Type Sync Check / check-type-sync (push) Successful in 2m8s Details CD Pipeline / build-and-deploy (push) Failing after 31m14s Details CD Pipeline / post-deploy-checks (push) Has been skipped Details 【十二人專家團隊全景掃描 + 並行四軌實施】統帥質疑「有讓 12-agent 一起協作嗎」後，依照團隊規則完成全鏈路交付： onboarder + critic + db-expert + debugger + frontend-designer 並行掃描，找到 6 大 Gap，再由 fullstack-engineer × 4、refactor-specialist 協作落地。【Track C — trust_drift 雙寫整併】兩條獨立寫 event_type=trust_drift 路徑互不呼叫，下游 consumer 拿到雙份資料無法判定 source-of-truth。整併保留 governance_agent.check_trust_drift（功能更全：auto-deprecate + Telegram + PG），TrustDriftDetector 降為純統計 lib， W-6 watchdog 改呼叫 governance_agent。新增 TestSinglePgWritePerDriftScenario 驗證同一 drift 場景只觸發一次 PG 寫入。變更： - apps/api/src/services/trust_drift_detector.py（lib only，不再寫 PG） - apps/api/tests/test_trust_drift_watchdog.py（W-6 改 mock governance_agent）【Track D — governance_remediation_dispatch 派遣表】 ai_governance_events 是不可變 Event Sourcing，不能塞執行狀態。新建派遣表作為投影層：1 event → 0..N dispatches，狀態可變、可重試、可審計。 - PgEnum 5 種 event_type + 7 階段狀態機（pending → dispatched → executing → succeeded/failed/cancelled/skipped） - 失敗重試 INSERT 新 row（不改舊 row 的 status，保留審計痕跡） - Partial unique index ux_grd_one_active_per_event 強制「同事件唯一活躍」 - 4 個複合 index 支援 worker poll、去重查詢、觀測面板 - FK 對應 ai_governance_events / playbooks / incidents / approval_records 全部 SET NULL（avoid cascade lock，但 governance_event 用 RESTRICT）變更： - apps/api/src/db/models.py（GovernanceRemediationDispatch ORM class） - apps/api/migrations/governance_remediation_dispatch_2026-05-03.sql - apps/api/src/repositories/governance_remediation_dispatch_repo.py （6 個 async 函式 + 3 個自訂例外：DispatchAlreadyActive / InvalidStatusTransition / DispatchNotFound） - apps/api/src/models/governance_dispatch.py（DecisionContextV1 等 4 schema） - apps/api/tests/test_governance_remediation_dispatch.py（29 tests）【Track B — /governance 頁面】後端 PR1 三個 endpoint + 前端 PR2-5 完整三 Tab。 PR1 後端： - GET /api/v1/ai/governance/events（events_tab，含 event_type/severity/ 狀態/時間範圍篩選 + 分頁） - GET /api/v1/ai/governance/queue（queue_tab，含 graceful fallback： dispatch 表不存在時回 table_pending=True 不拋 500） - GET /api/v1/ai/governance/summary（slo_tab 30d 違反時序圖） - severity 映射規則寫死（critic 建議未來移 settings） PR2-5 前端： - /governance 路由 + AppLayout + Compliance Badge 橫幅 + PageTabs - SLO Tab：3 KPI 卡片（Syne 28px + StatusOrb + 7d sparkline）+ 30d 違反 stacked BarChart - Events Tab：篩選列 + 表格 + inline 展開行（JSON / 修復建議 / 派遣記錄） - Queue Tab：HITL 待辦卡片 + 信任度進度條 + 批准/拒絕按鈕（本 PR console.log） - Sidebar 加入「AI 治理」入口（ShieldCheck icon） - i18n 雙語完整（governance namespace + nav.governance） - 7 個新元件：slo-kpi-card / slo-violation-chart / events-table / events-filter-bar / event-detail-drawer / queue-item-card / queue-history-tabs 變更： - apps/api/src/api/v1/ai_governance.py（router） - apps/api/src/services/governance_query_service.py - apps/api/src/models/governance.py（Pydantic V2 schemas） - apps/api/tests/test_ai_governance_endpoints.py（21 tests） - apps/web/src/app/[locale]/governance/（page + 3 tabs） - apps/web/src/components/governance/（7 元件） - apps/web/messages/{zh-TW,en}.json（governance namespace） - apps/web/src/components/layout/sidebar.tsx（+1 行） - apps/api/src/main.py（router include）【Track A — GovernanceDispatcher 決策融合】把治理事件接到 remediation 執行器，走北極星方向決策融合（LLM × Playbook trust × MCP），符合「禁寫死規則」鐵律。 - 設計鐵律：DecisionFusionAdapter 是新增 wrapper，不修改任何 Tier 3 檔（decision_manager / learning_service / trust_engine），只 consume 既有 API - 三維融合公式：confidence = 0.4×llm + 0.3×playbook_trust + 0.3×mcp_consistency （權重加 TODO 標明未來由 AI 自學調整） - 三分支決策路徑： confidence ≥ 0.85 → auto_dispatch（status=dispatched） 0.65 ≤ confidence < 0.85 → pending_approval（HITL） confidence < 0.65 → skip + log - decision_context JSONB 完整記錄三維輸入快照（給未來 fine-tune 用） - poll 30s 掃 unresolved 事件，仿 governance loop 模式 - 重複事件擋去重（呼叫 get_active_for_event）變更： - apps/api/src/services/governance_dispatcher.py - apps/api/src/services/decision_fusion_adapter.py - apps/api/tests/test_governance_dispatcher.py（14 tests） - apps/api/src/main.py（lifespan task 接 run_governance_dispatcher_loop）【驗證】 1836 個 unit test 全過（29 skipped 為既有 PG integration env 問題）【調度教訓 — 已記入 memory】 - vuln-verifier 應在 fullstack-engineer 之前跑（避免並行讀到已修代碼誤判） - critic 雙輪審查不可省（第二輪抓到 NaN sentinel + Prom rule 連鎖） - 北極星「禁寫死規則」搭配 decision-fusion 確實實施【未動 Tier 3 — 已驗證】 git diff 確認本 commit 完全沒改 decision_manager.py / learning_service.py / trust_engine.py，只新增 wrapper service consume 既有 API。 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 12:42:40 +08:00
Your Name	8fb0c5df33	feat(heartbeat): noise reduction — silent 6h + warnings hash dedup Some checks failed Code Review / ai-code-review (push) Successful in 47s Details CD Pipeline / tests (push) Successful in 2m11s Details CD Pipeline / build-and-deploy (push) Failing after 31m12s Details CD Pipeline / post-deploy-checks (push) Has been skipped Details P0 #4 (徹底長期修系列) — 統帥鐵證：「INFO \| AWOOOI 系統報告」每 30 分鐘推一次，一天 48 條同樣內容，即使我修了 P0 #3 假警報，每天的「全系統正常」重複推送本身就是噪音，讓統帥誤以為告警還在重複。修法（不違反「監控工具必須被監控」鐵律 — 健康狀態仍每 6h 推 1 次「我活著」）: \| 狀況 \| 推送行為 \| \|------\|---------\| \| 健康（無 warnings）\| 6h 內最多 1 次「我活著」訊號 \| \| 有 warnings 跟上次同 hash \| 跳過 \| \| 有 warnings 跟上次不同 \| 立即推送（新狀況不漏）\| \| 健康 ↔ 有事切換 \| 自動清掉相反 marker \| Redis keys: - `heartbeat:silent_last_sent` — 健康狀態 silent marker, TTL=6h - `heartbeat:warnings_hash` — 上次 warnings 的 md5[:12], TTL=24h 效果：統帥每天從 48 條 heartbeat → ~4 條（健康狀態 4×6h），有事立即推。 Tests: 6 passed (test_heartbeat_dedup_p0_4.py) - healthy_first_send_goes_through - healthy_second_send_within_6h_skipped - warnings_unchanged_skipped - warnings_changed_pushes - warnings_to_healthy_clears_warnings_hash - healthy_to_warnings_clears_silent_marker Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 01:48:57 +08:00
Your Name	2ce722bda9	feat(heartbeat): full K8s pod lifecycle state machine + regression tests Some checks failed Code Review / ai-code-review (push) Successful in 51s Details CD Pipeline / tests (push) Successful in 2m59s Details CD Pipeline / build-and-deploy (push) Has started running Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details P0 #3 (徹底長期修系列) — 把 daily report 的 pod 健康判斷從「ready=False 一律告警」升級到完整 K8s pod lifecycle state machine： \| Phase \| 行為 \| \|-------\|------\| \| Succeeded / Completed \| 跳過（CronJob/Job 跑完正常） \| \| Failed \| 必告警 \| \| Unknown \| 必告警 \| \| Pending <5min \| 跳過（剛 schedule 合理） \| \| Pending >=5min \| 告警「image pull / scheduling 卡住」\| \| Running ready=True \| 健康，跳過 \| \| Running ready=False <2min \| 跳過（剛起來 probe 還沒過）\| \| Running ready=False >=2min \| 告警「readiness probe fail / 啟動異常」\| \| restarts >=3 \| 必告警（無論 phase）\| 實作： - PodInfo 加 start_time: Optional[str]（從 .status.startTime） - _get_pod_status kubectl custom-columns 加 STARTTIME - _build_warnings 完整 state machine + 閾值常數 regression test (test_heartbeat_pod_state_machine.py 13 個) 覆蓋每個 phase + 邊界條件，含 2026-05-02 統帥截圖鐵證重現（3 個 drift-scanner Succeeded pod 不該觸發「需關注 3 項」假警報）。 Tests: 13 passed (新增 test_heartbeat_pod_state_machine.py) 接續 a38d9112（單純 Succeeded skip），這次徹底處理 Pending/Failed/Unknown + 時間閾值 + 沒 start_time 的保守告警。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 01:44:58 +08:00
Your Name	f1362fcc8d	fix(governance): 修治理告警 4 個 silent failure + Prom sentinel 連鎖 Some checks failed Code Review / ai-code-review (push) Successful in 49s Details CD Pipeline / tests (push) Successful in 2m9s Details CD Pipeline / build-and-deploy (push) Failing after 31m11s Details CD Pipeline / post-deploy-checks (push) Has been skipped Details 【全景檢測：12-agent 並行掃描定位 4 大 bug 與 1 個 P0 連鎖回歸】 Bug 1（P0 silent failure）— governance_agent.check_trust_drift 原 `await db.commit()` 縮排錯在 async with 區塊外（8 空格 vs 12）， session 已 auto-commit 關閉，二次 commit 拋 InvalidRequestError 被吞， governance_trust_drift_auto_deprecated log 從不出現。修：commit/log 移回 with 內。附 AST regression guard test 擋退化。 Bug 2 — flywheel_stats_service / W-3 fresh deploy 假告警 Redis 空時 total_exec=0 → rate=0.0 → watchdog `< 0.30` 立即觸發「飛輪成功率 0%」假告警。修：total_exec < FLYWHEEL_MIN_SAMPLE(10) 回 None， watchdog 判 None 跳過 W-3。Prometheus sentinel 用 NaN（非 -1.0）避免觸發 ops/monitoring/alerts.yml:775 等 3 份 prom rule 的 `< 0.1` 條件造成 2h 後假告警連鎖。前端 type 同步 number \| null。 Bug 3 — failover_alerter dedup key 原 key 只看 event_type 不看 payload，trust_drift 4→25 IDs 變動全被 1h dedup 吞掉。修：dedup key 加 sha256(impact subdict)[:8]，event_type sanitize 防特殊字元污染 Redis key。 Bug 4 — ai_slo_watchdog_job W-4 evolver 全封存初始化誤報原邏輯 approved==0 即告警，未排除「playbooks 表初始化中」場景。修：_count_approved_playbooks 回 (approved, total)，total==0 → skip。【執行結果】 - 39 個相關 unit test 全過（test_failover_alerter / test_governance_agent / test_trust_drift_watchdog / test_check_trust_drift_commit_outside_context_poc） - 6 個關鍵路徑實測：NaN sentinel / float 渲染 / hash 區分性 / dedup 同 impact 相同 hash / datetime 容錯 / 4 檔 py_compile 全過【調度教訓 — 留作未來改進】 - 12-agent 並行調度時，vuln-verifier 與 fullstack-engineer 競態導致 vuln-verifier 讀到已修代碼誤判 NOT REPRODUCIBLE。未來：vuln-verifier 應在 fullstack 之前執行，或用 git show HEAD~1 對比修復前。 - fullstack-engineer 引入 P0 regression（f-string 內嵌 ternary 非法 format spec）， critic 抓到 + Prom sentinel 連鎖 — 證明 critic 審查必要不可省。 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 00:18:57 +08:00
Your Name	314cb0e079	fix(test): align governance self_failure assertions with nested payload schema Some checks failed Code Review / ai-code-review (push) Successful in 48s Details CD Pipeline / tests (push) Successful in 2m18s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details Codex commits `dedb1208` + `b710f3f3` (governance enrich + normalize) 把 _alert("governance_self_failure", ...) 的 payload structure 重構成嵌套： {status, impact: {failed_checks, total_checks, errors}, remediation, actionable} （governance_agent.py:604-624，2026-04-29 critic M6 修），但 3 個 test 還用舊路徑 `payload["total_checks"]` 直讀，KeyError 後 RuntimeError 模擬 cascading 失敗。修法：3 個 assertion 改為讀正確嵌套路徑： - test_governance_agent.py:601 → payload["impact"]["total_checks"\|"failed_checks"] - test_wave8_remaining_blockers.py:223 → 同 - test_wave8_remaining_blockers.py:268 → 同 Tests: 30 passed (test_governance_agent + test_wave8_remaining_blockers 全部) 效果：解開 `dedb1208` / `b710f3f3` / `a38d9112` 三個 commit 因 governance test fail 被擋在 build-and-deploy 之前的卡點，恢復 CD 鏈通暢。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 00:05:04 +08:00
Your Name	da772a1605	fix(decision): block kubectl actions on bare_metal host alerts All checks were successful Code Review / ai-code-review (push) Successful in 54s Details CD Pipeline / tests (push) Successful in 3m47s Details CD Pipeline / build-and-deploy (push) Successful in 13m26s Details CD Pipeline / post-deploy-checks (push) Successful in 5m45s Details When HostHighCpuLoad / HostOutOfMemory fire on a bare-metal host (192.168.0.110 et al, where Sentry / ClickHouse / Snuba are eating CPU), the LLM kept proposing "kubectl rollout restart awoooi-api", which is a wrong-domain action — restarting awoooi cannot fix a third-party process's CPU usage on the host. Auto-execute would then either run the no-op kubectl restart (wasted) or escalate after ssh_diagnose because no safe action was found, producing the "AI 自動修復失敗" Telegram noise the user just complained about. Adds a guard at the top of DecisionManager._auto_execute: if the incident's primary signal carries host_type=bare_metal AND the proposed action starts with "kubectl", refuse to execute. The incident is marked READY with a clear blocked_reason so human operators see why automation declined, and emergency_escalation records the event in AOL for audit. Also patches /home/wooo/monitoring/alerts.yml on 110 (and the new ops/monitoring/alerts.yml in repo) to add an explicit auto_repair_action annotation on HostHighCpuLoad / HostOutOfMemory that hints LLM toward `ssh ... ps aux` rather than kubectl restart. Prometheus reload returned 200. Tests: tests/test_decision_manager_bare_metal_kubectl_guard.py covers (1) bare_metal+kubectl blocked, (2) kubectl get also blocked, (3) bare_metal+ssh NOT blocked, (4) k8s host_type+kubectl NOT blocked, (5) missing host_type label NOT blocked. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 17:41:28 +08:00
Your Name	3059897318	feat(governance): auto-deprecate low-trust unused playbooks (>30d) Some checks failed Code Review / ai-code-review (push) Successful in 41s Details CD Pipeline / tests (push) Successful in 3m29s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details trust_drift previously fired alerts forever for playbooks stuck below the 0.2 threshold. With user authorization for governance-class auto-fixes, check_trust_drift now retires playbooks that have been unused for 30+ days (or never used and created 30+ days ago) by flipping status to 'deprecated' before alerting. Alerts now report drifted_count, auto_deprecated_count, and the kept playbook_ids that still need human review (those in their 30d trial window). Existing alert noise from the four currently-drifted playbooks should drop to whatever fraction is genuinely in trial. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:31:37 +08:00
Your Name	607358c4dd	fix(approval): route SSH actions through SSHProvider on manual approve parse_operation_from_action only knew kubectl and Chinese restart phrases, so any "ssh host '...'" action approved via Telegram fell through to "Could not parse operation type" and reported a fake failure even though the LLM had proposed a valid host repair. Adds OperationType.SSH_HOST, makes the parser detect ssh prefixes (with optional flags / user@host) before kubectl patterns, and routes the SSH_HOST branch in approval_execution.execute_in_background through SSHProvider with the same tool keywords decision_manager uses (ssh_docker_prune / ssh_docker_restart / ssh_systemctl_restart / ssh_diagnose). Unroutable SSH actions now fail loudly with a descriptive error instead of silently breaking. Trigger: 2026-05-02 incidents INC-20260502-D6D0B7 / E12EE4 / 557055 were approved by the user but executor reported "Could not parse" and left the alerts pending. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:31:37 +08:00
Your Name	3156ff1c69	feat(aiops): add ssh_docker_prune to auto-repair flywheel for disk-full alerts Adds Group B SSH MCP tool ssh_docker_prune (image+volume+builder prune with ≥75% disk usage gate) and routes "docker prune" actions through it. Flips HostDiskUsageHigh from auto_repair=false to true with mcp_provider routing labels so the flywheel can self-heal next disk-full event without hitting the emergency_channel Telegram path. Trigger: 2026-05-01 → 05-02 Telegram alert storm (peak 53/hr) caused by empty ssh-mcp-key/known_hosts secret rejecting all SSH and forcing every disk-full alert through "Host key is not trusted → escalate" loop. known_hosts patched live; this commit closes the playbook gap so the next occurrence resolves without manual intervention. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:31:37 +08:00
Your Name	7795f027d2	fix(aiops): persist emergency intervention traces Some checks failed CD Pipeline / tests (push) Successful in 2m56s Details Code Review / ai-code-review (push) Failing after 39s Details CD Pipeline / build-and-deploy (push) Successful in 12m54s Details CD Pipeline / post-deploy-checks (push) Successful in 4m40s Details	2026-05-01 20:34:33 +08:00
Your Name	433f7b068e	fix(aiops): close ssh and telegram remediation gaps All checks were successful CD Pipeline / tests (push) Successful in 2m7s Details Code Review / ai-code-review (push) Successful in 42s Details CD Pipeline / build-and-deploy (push) Successful in 13m14s Details CD Pipeline / post-deploy-checks (push) Successful in 4m29s Details	2026-05-01 16:53:02 +08:00
Your Name	b0da6da1e9	feat(aiops): structure agent loop shadow output Some checks failed CD Pipeline / tests (push) Successful in 2m50s Details Code Review / ai-code-review (push) Successful in 33s Details CD Pipeline / build-and-deploy (push) Failing after 25m48s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details	2026-05-01 15:09:57 +08:00
Your Name	f8e44971c1	feat(aiops): enable read-only agent loop canary All checks were successful CD Pipeline / tests (push) Successful in 1m43s Details Code Review / ai-code-review (push) Successful in 31s Details CD Pipeline / build-and-deploy (push) Successful in 10m22s Details CD Pipeline / post-deploy-checks (push) Successful in 4m3s Details	2026-05-01 14:20:16 +08:00
Your Name	b6cf616707	fix(aiops): harden agent tool permission names All checks were successful CD Pipeline / tests (push) Successful in 1m32s Details Code Review / ai-code-review (push) Successful in 27s Details CD Pipeline / build-and-deploy (push) Successful in 8m26s Details CD Pipeline / post-deploy-checks (push) Successful in 3m37s Details	2026-05-01 13:52:33 +08:00
Your Name	7e4d995e4b	feat(aiops): add mcp agent loop foundation Some checks failed CD Pipeline / tests (push) Successful in 1m59s Details Code Review / ai-code-review (push) Successful in 28s Details run-migration / migrate (push) Failing after 24s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-05-01 13:21:19 +08:00
Your Name	9db87f177e	fix(aiops): suppress repeated llm alert loops Some checks failed CD Pipeline / tests (push) Successful in 1m37s Details Code Review / ai-code-review (push) Successful in 28s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-05-01 13:02:07 +08:00
Your Name	11673d80ea	fix(aiops): route backup decisions through ssh Some checks failed CD Pipeline / tests (push) Successful in 1m35s Details Code Review / ai-code-review (push) Successful in 34s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-05-01 12:50:01 +08:00
Your Name	3a6acae408	fix(km): add phase25 knowledge enum labels Some checks failed CD Pipeline / tests (push) Successful in 2m14s Details Code Review / ai-code-review (push) Successful in 26s Details run-migration / migrate (push) Failing after 24s Details CD Pipeline / build-and-deploy (push) Has been cancelled Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details	2026-05-01 11:03:03 +08:00
Your Name	97be5dedd7	fix(aiops): escalate failed host verification Some checks failed CD Pipeline / tests (push) Successful in 1m27s Details Code Review / ai-code-review (push) Successful in 29s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-05-01 10:47:42 +08:00
Your Name	e4aef6ac4e	fix(aiops): block k8s playbooks for host repair All checks were successful CD Pipeline / tests (push) Successful in 1m27s Details Code Review / ai-code-review (push) Successful in 26s Details CD Pipeline / build-and-deploy (push) Successful in 8m6s Details CD Pipeline / post-deploy-checks (push) Successful in 3m31s Details	2026-05-01 10:33:52 +08:00
Your Name	ca22ec2fd2	fix(aiops): route backup failures rule-first All checks were successful CD Pipeline / tests (push) Successful in 1m51s Details Code Review / ai-code-review (push) Successful in 30s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 42s Details CD Pipeline / build-and-deploy (push) Successful in 8m21s Details CD Pipeline / post-deploy-checks (push) Successful in 4m18s Details	2026-05-01 10:11:10 +08:00
Your Name	f154ac022e	feat(playbook): version generated playbooks All checks were successful CD Pipeline / tests (push) Successful in 1m34s Details Code Review / ai-code-review (push) Successful in 28s Details Type Sync Check / check-type-sync (push) Successful in 1m10s Details CD Pipeline / build-and-deploy (push) Successful in 10m19s Details CD Pipeline / post-deploy-checks (push) Successful in 3m1s Details	2026-04-30 23:59:39 +08:00
Your Name	f0d14ab6c4	fix(aiops): escalate blocked auto repair Some checks failed CD Pipeline / tests (push) Successful in 1m33s Details Code Review / ai-code-review (push) Successful in 28s Details Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 40s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-04-30 23:49:17 +08:00
Your Name	6e04fe9c8a	feat(playbook): generate drafts with local llm Some checks failed CD Pipeline / tests (push) Successful in 1m28s Details Code Review / ai-code-review (push) Successful in 29s Details Type Sync Check / check-type-sync (push) Failing after 2m41s Details CD Pipeline / build-and-deploy (push) Successful in 8m40s Details CD Pipeline / post-deploy-checks (push) Successful in 3m10s Details	2026-04-30 23:04:58 +08:00
Your Name	95110971f3	fix(telegram): close remaining DM alert routes Some checks failed CD Pipeline / tests (push) Successful in 1m27s Details Code Review / ai-code-review (push) Successful in 29s Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details	2026-04-30 23:02:17 +08:00
Your Name	61f5a6a419	fix(telegram): route alerts to SRE war room Some checks failed CD Pipeline / tests (push) Has been cancelled Details CD Pipeline / build-and-deploy (push) Has been cancelled Details CD Pipeline / post-deploy-checks (push) Has been cancelled Details Code Review / ai-code-review (push) Has been cancelled Details	2026-04-30 15:01:23 +08:00
Your Name	80defbed7c	fix(aiops): fallback and escalate automation blockers Some checks failed CD Pipeline / tests (push) Successful in 2m41s Details Code Review / ai-code-review (push) Successful in 24s Details CD Pipeline / build-and-deploy (push) Successful in 7m51s Details CD Pipeline / post-deploy-checks (push) Failing after 2m15s Details	2026-04-30 14:13:57 +08:00
Your Name	ed2a4838f2	fix(auto): use action parser for repair gates Some checks failed CD Pipeline / tests (push) Failing after 1m2s Details CD Pipeline / build-and-deploy (push) Has been skipped Details CD Pipeline / post-deploy-checks (push) Has been skipped Details Code Review / ai-code-review (push) Successful in 24s Details	2026-04-30 14:06:09 +08:00
Your Name	639bb64788	feat(flywheel): surface ai automation and code review Some checks failed Code Review / ai-code-review (push) Successful in 31s Details CD Pipeline / build-and-deploy (push) Failing after 5m23s Details	2026-04-30 00:09:25 +08:00
Your Name	4a57c2d04f	feat(flywheel): expose incident processing timeline All checks were successful CD Pipeline / build-and-deploy (push) Successful in 10m56s Details	2026-04-29 23:38:30 +08:00
Your Name	d845d53257	fix(security): keep Gemini key out of request URLs All checks were successful CD Pipeline / build-and-deploy (push) Successful in 15m5s Details	2026-04-29 22:56:12 +08:00
Your Name	fe2b8f4571	fix(flywheel): fallback on OpenClaw degraded responses All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m56s Details	2026-04-29 22:38:57 +08:00
Your Name	dccdcdbaf5	fix(flywheel): unblock action safety and Claude fallback All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m45s Details	2026-04-29 21:51:18 +08:00
Your Name	4115ddde48	fix(cd-blocker-2): setup_test_schema.sql 補 KM 欄位（解 CD 真實 root cause） All checks were successful CD Pipeline / build-and-deploy (push) Successful in 14m4s Details ## 之前 `c5b18101` 修錯地方我加 db/base.py:init_db() ALTER 沒解問題。CI 不跑 init_db()。 ## 真實 CD 流程 `.gitea/workflows/cd.yaml` Integration Tests step： 1. 啟動臨時 `pg-test-b5` 容器（fresh PG） 2. `psql -f tests/integration/setup_test_schema.sql` 建表 3. 跑 pytest tests/integration/test_b5_core_flows.py setup_test_schema.sql 的 `knowledge_entries` 表沒有 `related_approval_id` + `path_type` 欄位 → INSERT 失敗。 ## 修法 setup_test_schema.sql:110 `CREATE TABLE knowledge_entries` 補： - related_approval_id VARCHAR(64) - path_type VARCHAR(50) - uix_knowledge_incident_path partial unique index - ix_knowledge_related_approval partial index ## 預期效果 CD #1119 (本 commit) 應該成功。解鎖 4 個 stuck commit (1114-1118) 的部署 backlog。 `fb0c72db` 推翻 A2 DIAGNOSE Ollama primary 終於上 prod。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 20:54:54 +08:00
Your Name	3668d49f2f	feat(flywheel): W2 三件 + KMWriter critic 修法（1635 tests 全綠） Some checks failed CD Pipeline / build-and-deploy (push) Failing after 1m38s Details W2 (onboarder 4 週飛輪 80→90 路徑第二週) + critic PR review 5 個 critical/major 全部修完，default flag=false 安全無爆炸風險。 ## W2 三件 PR ### PR-R2 — AOL → catalog confidence EWMA 回灌（修飛輪斷鏈 C2） - 新檔 `apps/api/src/jobs/aol_to_catalog_writeback_job.py` - 邏輯：每小時掃 AOL 計算 EWMA confidence (alpha=0.3) 回灌 alert_rule_catalog - 失敗閾值 N=5 連續低成功率 → review_status='draft' - Hermes _fetch_noisy_rules SQL 加 OR review_status='draft' - ENABLE_AOL_WRITEBACK_JOB=false (default) - 8 個測試（mock path 修正：lazy import → patch src.db.base.get_db_context） ### PR-V1 — self_healing_validator 串接 (修飛輪斷鏈 C6) - 新檔 `apps/api/src/services/self_healing_validator.py`（純函數 assess_self_healing） - post_execution_verifier.py step 5 串接（feature flag gate） - evidence_snapshot.py 加 self_healing_score / self_healing_detail 欄位 - db/models.py + base.py ALTER IF NOT EXISTS - score < 0.5 → 觸發 rollback 提案 Telegram alert（不自動執行） - ENABLE_SELF_HEALING_VALIDATOR=false (default) - 7 個測試 ### PR-L1 — KM ↔ Playbook 雙向回路 (修飛輪斷鏈 C3+C4) - learning_service.py 三條新邏輯： 1. _write_playbook_evolution_km：promote/demote 寫 KM 演化條目 2. _check_and_mark_playbook_review：N=5 累積觸發 review_required 3. _demote_alert_rule_catalog_confidence：DEPRECATED → confidence×=0.5 - PlaybookRecord 加 review_required 欄位（schema migration via base.py） - ENABLE_KM_PLAYBOOK_FEEDBACK_LOOP=false (default) - KM_PLAYBOOK_REVIEW_THRESHOLD=5 可調 - 6 個測試 ## KMWriter Critic 5 個 Critical/Major 修復（之前 critic PR review 發現）之前 push commit `c5753e1c` 已修，本 commit 補回 stash 中的對應檔案： - C1 km_writer.py:194 backfill 自打臉（已修：同步 await + DLQ） - C2 km_writer.py:391 KM_WRITE_AWAIT=false 路徑收緊 - M1 decision_manager.py:2178/2203 移除 _fire_and_forget - M2 incident_service.py:1099 自製 path 加 retry+DLQ - M3 km_writer.py:166 冪等聲明對齊（UPSERT + partial unique index） ## 驗證 - 1635 unit tests 全綠（+27 from 1608） - 與 `fb0c72db` (推翻 A2 Ollama primary) 共存無衝突 - 所有新 Job/Service default flag=false（不爆炸） ## 期望影響飛輪斷鏈 C2 + C3 + C4 + C6 全修飛輪自主化評分：65 → 85 預估（W2 完成後）啟用順序（待 prod `fb0c72db` 驗證 OLLAMA primary 跑得起來後）： 1. ENABLE_AOL_WRITEBACK_JOB=true 2. ENABLE_KM_PLAYBOOK_FEEDBACK_LOOP=true 3. ENABLE_SELF_HEALING_VALIDATOR=true Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:44:04 +08:00
Your Name	fb0c72db42	feat(ai-router): 推翻 A2 鐵律 — DIAGNOSE primary 改 Ollama 本地優先 Some checks failed CD Pipeline / build-and-deploy (push) Failing after 2m26s Details 統帥鐵律 2026-04-29：「主要優先用 111 主機的 Ollama」 + feedback_ai_autonomous_direction.md：以本地免費 LLM 為主 + feedback_ollama_111_only.md：Ollama 唯一主機 = 111 ## 推翻 A2 (2026-04-27 INC-20260425) 的事實基礎舊事實：Ollama = CPU-only deepseek-r1:14b @ 238s（不可用）新事實：prod Ollama 111 = M1 Pro Apple Silicon GPU + qwen2.5:7b-instruct VRAM 8.2GB 全載入，ctx 32k，實測 hi prompt 0.54s 雲端全死（2026-04-29 prod log 證據）： - OpenClaw 188:8088 → 500 Internal Server Error - Gemini → 429 Too Many Requests（配額爆） - Claude → 404 Not Found（model claude-3-haiku-20240307 過期）不推翻 A2 → 100% incident llm_failed → AI 自動修復永遠不啟動 ## 修改範圍（最小、安全、可驗證） ### ai_router.py - `_diagnose_fallback_chain`: OLLAMA 第一順位（取代「永久排除」舊註解）順序：[OLLAMA, OPENCLAW_NEMO, GEMINI, CLAUDE] - `_intent_provider_overrides[DIAGNOSE]`: OPENCLAW_NEMO → OLLAMA - 不動 _full_fallback_chain（避免影響 RESTART/SCALE/CONFIG/DELETE） - 不動 _tool_calling_fallback_chain - 不動 complexity_map（critic M2 留待後續） ### openclaw.py - 注入 task_type="diagnose" 到 alert_context（critic C2 真根因） - 修復 ai_providers/ollama.py:77 timeout 對齊問題： - 有 task_type → OLLAMA_DIAGNOSE_TIMEOUT_SECONDS=200s - 沒有 → OPENCLAW_TIMEOUT=30s（不夠 qwen2.5:7b 推理） - prod log 看到 latency_ms=120014 的根因 - 用 dict(alert_context) 複製，不污染原 context ## Regression Test 同步更新（5 個） A2 鐵律守門 test 全部反映新鐵律： - test_p0_diagnose_routing.py::test_diagnose_override_is_ollama （原 test_diagnose_override_is_openclaw_nemo） - test_ai_router_diagnose_fallback.py::test_diagnose_fallback_chain_ollama_primary （原 test_diagnose_fallback_chain_no_ollama） - test_ai_router_diagnose_fallback.py::test_diagnose_route_primary_is_ollama （原 test_diagnose_route_fallback_chain_excludes_ollama） - test_ai_router_diagnose_fallback.py::test_diagnose_route_sync_primary_is_ollama （原 test_diagnose_route_sync_fallback_chain_excludes_ollama） - test_ai_router_diagnose_fallback.py::test_build_fallback_chain_for_intent_diagnose_with_ollama_primary （原 test_build_fallback_chain_for_intent_diagnose_no_ollama） - test_ai_router_failover_integration.py::test_router_uses_failover_for_diagnose_ollama_primary （原 test_router_does_not_use_failover_for_openclaw_nemo）每個 test docstring 都記載歷史脈絡 + 推翻原因。 ## 驗證 - 1608 unit tests 全綠 - LLM 路徑 16 個 test 全綠（含 6 個 A2 守門 test 更新版） - complexity_scorer / failover_manager / intent_classifier 不受影響 ## 期望 prod 行為（部署後驗證） incident 進入 → DIAGNOSE intent → primary OLLAMA (qwen2.5:7b on M1 Pro GPU) 失敗才 fallback → OpenClaw 188 → Gemini → Claude Ollama 用 200s timeout（之前 30s 不夠） → AI 自動修復終於可以啟動，不再 100% llm_failed ## 已知債（後續處理） - models.json:21 ollama.default 仍是 deepseek-r1:14b（critic C1，但 prod 已自動 route 到實載 model） - complexity 4/5 仍寫死 gemini/claude（critic M2） - Gemini API key 在 prod log 明文（需輪換 + sanitize） Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 11:39:36 +08:00
Your Name	681b5ac949	feat(flywheel): W1 PR-R1 規則→Playbook 遷移 + PR-K1 timeline 防禦 ALTER Some checks failed run-migration / migrate (push) Failing after 12s Details Type Sync Check / check-type-sync (push) Successful in 1m25s Details CD Pipeline / build-and-deploy (push) Failing after 1m48s Details W1 第二波：onboarder 飛輪 80→90 路徑剩餘兩件 PR。 ## PR-R1 — 25 條 yaml 規則 → DRAFT Playbook 遷移斷鏈背景（onboarder C2）：alert_rules.yaml 25 條規則 68% 寫死 RESTART，沒有對應 Playbook → RAG 永遠 generic_fallback → 規則命中率沒回饋給 catalog。修法： - 新建 services/rule_to_playbook_migrator.py - 自動從 alert_rules.yaml 解析每條 rule - 產生 PlaybookRecord（status=DRAFT, ai_confidence=0.3, source=YAML_RULE） - 誠實標示信心 0.3（非假 1.0，違反 feedback_confidence_truthfulness） - INSERT ON CONFLICT 冪等（name LIKE 'AutoMigrated: %' 去重，不擾動 seed） - 新建 scripts/migrate_rules_to_playbooks.py（CLI: --dry-run/--commit/--disable-flag） - ENABLE_RULE_MIGRATION_DRAFT=true（rollback flag） - 23 測試覆蓋（parse / build_dict / idempotent / dry_run / action_type / severity_map / feature_flag / wildcard_filter / partial_existing 等） ## PR-K1 — timeline_events 防禦性 ALTER（db-expert finding）任務原前提錯誤：onboarder 報告的 C7 斷鏈（incident_id 欄位）在 2026-04-24 P1.6 已修復 ORM。但生產環境若在 P1.6 前已建表，create_all 跳過已存在的表 → ORM 寫入 SELECT 仍可能找不到 column。修法： - db/base.py:init_db() 補防禦性 ALTER: ALTER TABLE timeline_events ADD COLUMN IF NOT EXISTS incident_id VARCHAR(64); CREATE INDEX IF NOT EXISTS ix_timeline_incident_id ON timeline_events(incident_id); - IF NOT EXISTS 為 no-op 安全（已有 column 不做事） - stage 欄位是任務描述的幻覺（codebase 0 writer），不新增未做： - alembic migration（專案不用 alembic，遵循既有 init_db ALTER pattern） - onboarder C7 在 ORM 層已修，本 commit 確保 prod schema 對齊 ## 驗證 - 1608 unit tests 全綠（+23 from 1585） - PR-R1 23 個測試獨立通過 ## 期望影響 - 飛輪 RAG 終於有 25 條 DRAFT Playbook 可查 → +5 分 - prod schema 對齊保險 → 防 ORM SELECT 失敗 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 10:49:25 +08:00
Your Name	c5753e1c57	fix(critic-review): KMWriter 名實統一 + Alertmanager 修抑制 + drift checker AST 化 critic PR review 揭示已 push commits 的 7 個 blocker，本 commit 全部修復。 ## C1 + C2 + M1 + M2 + M3 — KMWriter 真正統一契約（critic 最嚴重 5 條） ### C1 km_writer.py:194 — backfill 自打臉修 - 裸 asyncio.create_task(_backfill_path_a_approval) → await _backfill_path_a_approval_safe() - 同步 await + 獨立 DLQ km:backfill:dlq + try/except 不阻塞主寫入 - 新增 km_backfill_reconciler_job.py（每 5 分鐘掃 DLQ）+ ENABLE_KM_BACKFILL_RECONCILER flag - 防 Path B 比 Path A 先完成 → related_approval_id 永遠 NULL 的 race ### C2 km_writer.py:391 — KM_WRITE_AWAIT=false 路徑收緊 - 從 ensure_future（fire-and-forget 比舊版同步寫更糟） - 改 await writer.write(retry=1, timeout=2.0)（仍 await 但只試一次、超時短） - docstring 明確標註「緊急回滾用，不保證可靠性」 ### M1 decision_manager.py:2178/2203 — 移除 _fire_and_forget 旁路 - 兩處 _fire_and_forget(executor.write_execution_result_to_km(...)) - 改 await asyncio.shield(...) + BaseException 保護（防上層 cancel 中斷） - KM_WRITE_AWAIT=true 在這條路徑終於真正 await ### M2 incident_service.py:1099 — 自製 path 加 retry+DLQ - 原本 if settings.KM_WRITE_AWAIT: await asyncio.wait_for else create_task - 改 3 次指數退避 retry + DLQ 保護（呼叫 km_writer 私有 helper） ### M3 km_writer.py:166 — 冪等聲明對齊實作 - knowledge_repository.create() 加 UPSERT 路徑（pg_insert ON CONFLICT DO UPDATE） - KnowledgeEntryCreate / KnowledgeEntryRecord 加 path_type 欄位 - migration: ADD COLUMN path_type + partial unique index uix_knowledge_incident_path ## M4 alertmanager.yml — equal: [] 收緊（critic 防爆炸抑制） - OllamaInstanceDown / KMConverterDown 抑制加 equal: ['cluster'] 約束 - 防多 cluster 場景下任一 Ollama down 誤抑全 AI/SLO 告警 ## M5 Alertmanager 版本驗證（已確認 v0.31.1，遠超 v0.22+） ## M6 governance_agent.py — health score 區分 skipped vs ok vs violated - check_slo_compliance 加 _meta {violated_count, skipped_count, ok_count, all_skipped, status} - run_self_check: SLO 全 skipped 時獨立發 governance_slo_data_gap 告警（不污染 self_failure 計數，因為 no_data 是 emitter 未實作不是治理機制故障） ## M7 scripts/check_config_drift.py — 改 AST 解析 - regex 改 ast.parse 找 Settings ClassDef AnnAssign Field(default=...) - 避免多行 list / default_factory= / 含跳行字串的 false negative - 4 欄位（AI_FALLBACK_ORDER / ARGOCD_URL / PROMETHEUS_URL / OLLAMA_URL）全對齊 ## 新增測試 - test_km_writer_backfill_reconciler.py: 7 cases（C1 reconciler + safe helper） - test_km_writer_idempotent.py: 5 cases（M3 path_type 注入 + UPSERT 分支） ## 驗證 - 1585 unit tests 全綠（+13 從 1572） - amtool check-config SUCCESS（8 inhibit_rules / 2 receivers） - drift checker AST-based 4 欄位全對齊 - Alertmanager v0.31.1 確認支援新語法 ## 期望影響 - KMWriter 名實統一：飛輪閉環 KM 寫入路徑 100% 可靠 - M4 抑制爆炸風險解除 - 治理層不再對 SLO no_data 靜默 - drift checker false negative 風險解除 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 10:44:39 +08:00
Your Name	6878e62af7	feat(flywheel): W1 PR-P1 + ADR-091 T1 — 飛輪 80→90 第一波依 onboarder 端到端閉環審計挖出的 10 條斷鏈 + critic 鐵律違反全景， W1 第一波修復飛輪鐵證 1 + 2 的核心斷鏈 C1。 ## W1 PR-P1 — matched_playbook_id 四斷點守門 (C1 修復) fullstack 探勘發現 4 斷點之前 session 已修，本 PR 補： - ENABLE_PLAYBOOK_MATCHING feature flag (default=true) rollback: kubectl set env deployment/awoooi-api ENABLE_PLAYBOOK_MATCHING=false - proposal_service._try_playbook_match_id 入口加 flag check - 7 個 e2e 測試補上保護網（之前無測試覆蓋）斷鏈 C1 證據鏈：proposal_service.generate_proposal() → matched_playbook_id → approval_db → approval_repository → learning_service._update_playbook_stats 24h 後 playbooks.trust_score 應有真實 EWMA 更新。 ## ADR-091 T1 — auto_generate_rule 雙寫 DB (鐵證 1 第一步) 飛輪鐵證 1：alert_rule_catalog.source='ai_generated' 全 codebase 0 筆。 auto_generate_rule() 寫 alert_rules.yaml 但不寫 DB → AI 自學成果與 catalog 雙軌脫鉤。修法（依 ADR-091 §1 D1）： - 新增 _insert_catalog_ai_generated()：YAML 寫入成功後雙寫 source='ai_generated', confidence=0.5, review_status='draft', created_by_agent - 新增 _parse_for_to_seconds() helper（"30s"/"5m"/"2h" → seconds） - ON CONFLICT (rule_name) DO NOTHING 冪等保證 - transaction 策略：YAML + DB 不在同一 transaction（YAML 已成 SoT，DB 失敗只 log） - ENABLE_AI_RULE_CATALOG_WRITE feature flag (default=true) rollback: kubectl set env deployment/awoooi-api ENABLE_AI_RULE_CATALOG_WRITE=false 13 個測試覆蓋：parse helper 8 + 業務邏輯 5（success/db_fail/idempotent/flag/SQL_lit） ## 驗證 1572 unit tests 全綠（+20 新增：PR-P1 7 + ADR-091 T1 13） ## 期望影響飛輪自主化評分：42 → 65（+23 = C1 +3 + 鐵證 1 +20） ## 已知債（critic PR review 揭示，下一個 commit 處理） - KMWriter 統一契約 3 條 caller 路徑被旁路（C1/M1/M2） - KMWriter 冪等聲明與實作不符（M3 缺 ON CONFLICT） - Alertmanager equal:[] 爆炸抑制 + 版本未驗（M4/M5） - drift checker regex 脆弱（M7 應改 AST） - governance health score skipped 失真（M6） Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 10:44:39 +08:00
Your Name	c22e5f334e	feat(km): P1-1 KMWriter 統一契約 + 5 caller 切換 + M4 反查鏈補齊 12-Agent 全景診斷揪出 KM 寫入鏈路 5 條入口無統一契約，fire-and-forget 在 Pod recycle 時會丟失條目。本次抽 KMWriter 強制 7 條契約。 ## 7 條契約強制 1. 同步底線：強制 await asyncio.wait_for(timeout) 2. 重試：3 次指數退避 1s/2s/4s（OperationalError / 網路類例外） 3. 失敗回收：3 次後寫 Redis DLQ km:dlq + log 4. 觀測：structlog event + 預留 metric hook（P1-3 補 emitter） 5. 冪等：incident_id + path_type 為 unique key 6. 禁止吞例外：except 必須 log + raise/DLQ 7. M4 反查鏈：payload 含 approval_id 時自動填 related_approval_id 並回填 Path A ## Caller 切換（5 條入口統一介面） - incident_service.py:1086 Path A（KB extractor + km_conversion） - approval_execution.py:771 Path B-人工 - decision_manager.py:2178 Path B-自動成功（消除跨類私有方法調用 M1） - decision_manager.py:2200 Path B-自動失敗（修 B2 早期吞例外） - playbook_service.py:210 PlaybookKM（兩份 T0 報告都漏的第三條） ## M4 反查鏈補齊 - knowledge.py + models.py: 補 related_approval_id ORM 欄位 - 對齊 phase26_incident_km_integration.sql:20 schema（partial index 已存在） - approval↔KM 雙向反查鏈完整（dual-path 縫合線） ## Feature Flag (rollback 保險) - KM_WRITE_AWAIT=true (default): await + timeout + DLQ 強制 - KM_WRITE_AWAIT=false: fire-and-forget（舊行為） ## 測試 - apps/api/tests/test_km_writer.py: 18 測試全綠覆蓋 success / timeout / retry / DLQ / 冪等 / KMWriteError / on_failure=raise / 反查鏈回填 - 1552 unit tests 全綠（無回歸） ## 驗收飛輪閉環核心 — KM 寫入不再靜默丟失，AI 學習鏈不斷裂。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 10:44:39 +08:00
Your Name	143c15f052	feat(wave2+km): LLM 動態按鈕啟用 + KM 自動寫入 + AI Router dead code 標記 All checks were successful CD Pipeline / build-and-deploy (push) Successful in 9m52s Details - ConfigMap: USE_LLM_DYNAMIC_BUTTONS=true（B2/B3/B4 handler 全就緒） - decision_manager: auto_execute 失敗路徑補 KM fire-and-forget 寫入 - ai_router: _build_fallback_chain 標記 DEPRECATED 2026-04-28 - tests: test_golden_regression.py 新增 172 行 golden 回歸測試 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-28 15:27:33 +08:00

1 2 3 4 5

202 Commits