Your Name
fa0e956c0e
fix(mcp): tag legacy provider calls with audit context
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 59s
CD Pipeline / build-and-deploy (push) Successful in 3m22s
CD Pipeline / post-deploy-checks (push) Successful in 1m19s
2026-05-06 17:18:52 +08:00
Your Name
c1ac157aaf
fix(km): keep backfill reconciler loop alive
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / tests (push) Successful in 1m12s
CD Pipeline / build-and-deploy (push) Successful in 4m2s
CD Pipeline / post-deploy-checks (push) Successful in 1m18s
2026-05-06 17:03:22 +08:00
Your Name
33f85ec8ca
fix(logging): redact telegram bot urls
Code Review / ai-code-review (push) Successful in 17s
CD Pipeline / tests (push) Successful in 1m14s
CD Pipeline / build-and-deploy (push) Successful in 3m19s
CD Pipeline / post-deploy-checks (push) Successful in 1m15s
2026-05-06 16:54:14 +08:00
Your Name
8f715fd3f2
fix(telegram): sanitize failover alert errors
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 1m5s
CD Pipeline / build-and-deploy (push) Successful in 3m25s
CD Pipeline / post-deploy-checks (push) Successful in 1m16s
2026-05-06 16:45:47 +08:00
Your Name
a7a9ba996d
fix(mcp): audit approved ssh execution path
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / tests (push) Successful in 1m5s
CD Pipeline / build-and-deploy (push) Successful in 3m45s
CD Pipeline / post-deploy-checks (push) Successful in 1m20s
2026-05-06 16:34:39 +08:00
Your Name
7ed8c95409
fix(mcp): persist blocked gateway audit rows
Code Review / ai-code-review (push) Successful in 16s
run-migration / migrate (push) Failing after 9s
CD Pipeline / tests (push) Successful in 1m8s
CD Pipeline / build-and-deploy (push) Successful in 3m59s
CD Pipeline / post-deploy-checks (push) Successful in 1m46s
2026-05-06 16:21:43 +08:00
Your Name
60c00d7a5d
fix(mcp): tolerate legacy tool DTO fields
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 1m9s
CD Pipeline / build-and-deploy (push) Successful in 3m18s
CD Pipeline / post-deploy-checks (push) Successful in 1m27s
2026-05-06 16:11:26 +08:00
Your Name
927c2a758d
fix(mcp): accept legacy tool result data alias
Code Review / ai-code-review (push) Successful in 12s
CD Pipeline / tests (push) Successful in 1m6s
CD Pipeline / build-and-deploy (push) Successful in 3m24s
CD Pipeline / post-deploy-checks (push) Successful in 1m17s
2026-05-06 16:02:27 +08:00
Your Name
22453161e9
fix(ai): restore dynamic baseline holt winters fit
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 59s
CD Pipeline / build-and-deploy (push) Successful in 8m20s
CD Pipeline / post-deploy-checks (push) Successful in 1m14s
2026-05-06 15:30:31 +08:00
Your Name
4111ea4f9f
fix(ai): remove 188 ollama provider
Code Review / ai-code-review (push) Successful in 12s
CD Pipeline / tests (push) Successful in 1m13s
CD Pipeline / build-and-deploy (push) Successful in 3m36s
CD Pipeline / post-deploy-checks (push) Successful in 1m20s
2026-05-06 14:34:48 +08:00
OG T
2c2bf9d665
fix(awooop): use shared redis for approval gates
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 1m0s
CD Pipeline / build-and-deploy (push) Failing after 4m6s
CD Pipeline / post-deploy-checks (push) Has been skipped
2026-05-06 13:18:43 +08:00
OG T
c696b99ccf
fix(awooop): authenticate approval decisions
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / tests (push) Successful in 1m3s
CD Pipeline / build-and-deploy (push) Successful in 3m28s
CD Pipeline / post-deploy-checks (push) Successful in 1m25s
2026-05-06 13:05:51 +08:00
Your Name
9ef9633aff
fix(alerts): bypass proxy timeout for GCP Ollama
2026-05-06 08:55:14 +08:00
Your Name
c2c0b1ec82
fix(alerts): let GCP Ollama finish before cloud fallback
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 1m9s
CD Pipeline / build-and-deploy (push) Successful in 4m21s
CD Pipeline / post-deploy-checks (push) Successful in 1m16s
2026-05-06 05:27:55 +08:00
Your Name
3b64d66836
fix(alerts): guard approval actions and wire playbook learning
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 42s
CD Pipeline / build-and-deploy (push) Successful in 3m31s
CD Pipeline / post-deploy-checks (push) Successful in 1m18s
2026-05-06 03:34:24 +08:00
Your Name
5ed396e390
fix(decision): derive telegram dedup from incident signals
CD Pipeline / tests (push) Successful in 58s
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / build-and-deploy (push) Successful in 3m30s
CD Pipeline / post-deploy-checks (push) Successful in 2m19s
2026-05-06 00:19:35 +08:00
Your Name
2aa31c205a
fix(ai): require 111 before alert cloud fallback
CD Pipeline / tests (push) Successful in 54s
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / build-and-deploy (push) Successful in 3m21s
CD Pipeline / post-deploy-checks (push) Successful in 2m2s
2026-05-06 00:05:51 +08:00
Your Name
e208798531
fix(ai): keep GCP Ollama lane on safe models
CD Pipeline / tests (push) Successful in 54s
Code Review / ai-code-review (push) Successful in 14s
CD Pipeline / build-and-deploy (push) Successful in 3m25s
CD Pipeline / post-deploy-checks (push) Successful in 1m50s
2026-05-05 23:37:33 +08:00
Your Name
c4854bb355
fix(ai): isolate heavy Ollama workloads from GCP alert lane
CD Pipeline / tests (push) Successful in 54s
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / build-and-deploy (push) Successful in 3m19s
CD Pipeline / post-deploy-checks (push) Successful in 3m12s
2026-05-05 23:06:07 +08:00
Your Name
ee5e3bc94f
fix(openclaw): gate alert cloud fallback behind flag
Code Review / ai-code-review (push) Successful in 27s
CD Pipeline / tests (push) Successful in 5m17s
CD Pipeline / build-and-deploy (push) Failing after 5m35s
CD Pipeline / post-deploy-checks (push) Has been skipped
2026-05-05 20:54:47 +08:00
Your Name
6b93c8f454
fix(chat): route OpenClaw chat through Ollama lane
CD Pipeline / tests (push) Successful in 5m26s
Code Review / ai-code-review (push) Successful in 25s
CD Pipeline / build-and-deploy (push) Successful in 8m11s
CD Pipeline / post-deploy-checks (push) Has been cancelled
2026-05-05 15:57:26 +08:00
Your Name
e24c8ea051
fix(ci): align B5 schema with tenant isolation
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
2026-05-05 15:00:07 +08:00
Your Name
fc1a6196df
fix(code-review): keep Gemini fallback opt-in
CD Pipeline / tests (push) Successful in 2m2s
Code Review / ai-code-review (push) Successful in 27s
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
2026-05-05 14:38:44 +08:00
Your Name
0e14935351
fix(ops): classify systemd runner alerts as host resources
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
2026-05-05 14:28:18 +08:00
Your Name
2ff0ef3bb6
fix(openclaw): route legacy ollama through failover endpoints
CD Pipeline / tests (push) Failing after 1m49s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 24s
2026-05-05 13:55:52 +08:00
Your Name
a57e3d3d75
test(consensus): expect redis namespace dual write
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
2026-05-05 13:41:41 +08:00
Your Name
b00a7b050a
test(ollama): align inference connect errors with degraded health
CD Pipeline / tests (push) Failing after 2m26s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 28s
2026-05-05 13:34:19 +08:00
Your Name
506744ba3a
test(ollama): keep slow gcp primary on ollama
CD Pipeline / tests (push) Failing after 2m21s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 26s
2026-05-05 13:29:27 +08:00
Your Name
869646459c
fix(ollama): treat legacy primary as ollama
CD Pipeline / tests (push) Failing after 1m48s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 28s
2026-05-05 13:25:27 +08:00
Your Name
33d4326cce
test(ollama): align slow recovery with gcp routing policy
CD Pipeline / tests (push) Failing after 1m51s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 33s
2026-05-05 13:21:16 +08:00
Your Name
f78b1b0690
fix(ollama): honor provider endpoint selection
Code Review / ai-code-review (push) Successful in 37s
2026-05-05 13:14:46 +08:00
Your Name
8629ac709b
feat(awooop): Phase 1-8 完整實作 — AwoooP Agent Platform 六平面架構
...
run-migration / migrate (push) Failing after 59s
Code Review / ai-code-review (push) Successful in 1m8s
Type Sync Check / check-type-sync (push) Successful in 2m27s
## Phase 1-3: Control Plane + Contract System
- awooop_phase1_control_plane_2026-05-04.sql: 12 張核心表 + RLS
- awooop_phase1_batch1_rls_2026-05-04.sql: 全部 FORCE RLS + GRANT
- packages/awooop-contracts/: 六合約 JSON Schema + golden fixtures
- src/models/awooop_contracts.py: Pydantic v2 contract models(extra=forbid)
- src/repositories/contract_repository.py: contract lifecycle(draft→published→active)
- src/services/contract_service.py: HMAC publish sig + Redis multi-sig activate
- src/services/schema_validator.py: LLM output validator(retry×3, E-SCHEMA-001)
## Phase 2: Tenant Isolation
- awooop_phase2_budget_ledger_2026-05-04.sql: budget_ledger + RLS
- src/services/budget_service.py: Token Budget Hard Kill 三層防線
- src/core/context.py: PROJECT_ID ContextVar(31 background loop 自動繼承)
- src/db/base.py + models.py: project_id 欄位 + RLS set_config 注入
- src/hermes/nl_gateway.py: project_id Redis key 前綴(Phase A 雙寫)
- src/services/anomaly_counter.py: per-project 改造(Phase A fallback)
## Phase 4: Platform Shell in Shadow Mode
- awooop_phase4_run_state_2026-05-04.sql: run_state + step_journal + idempotency
- src/services/run_state_machine.py: 8-state FSM + SKIP LOCKED + stale reaper
- src/services/platform_runtime.py: UUID v7 + W3C trace_id + shadow_execute
- src/services/audit_sink.py: PII/secret redaction 9 patterns
- src/api/v1/platform/runs.py: POST/GET /v1/platform/runs(Router→Service 架構)
- src/workers/platform_worker.py: SKIP LOCKED worker + heartbeat + reaper loop
- src/main.py: platform router + lifespan worker start/stop
## Phase 5: MCP Gateway 五閘門
- awooop_phase5_mcp_gateway_2026-05-04.sql: 4 表 + RLS
- src/plugins/mcp/gateway.py: McpGateway(Gate 1~5, E-MCP-GATE-001~009)
- src/plugins/mcp/redaction_middleware.py: 雙層 redaction + 16K 截斷
- src/plugins/mcp/registry.py: __provider name mangling(ADR-116)
- src/plugins/mcp/credential_resolver.py: k8s secret ref 解析
- tests/test_mcp_credential_isolation.py: 10 個迴歸測試(secret leak 防再現)
## Phase 6-8: EwoooC + Channel Hub + Approval Token
- awooop_phase6_ewoooc_onboarding_2026-05-04.sql: ewoooc tenant + 4 read-only MCP tools
- awooop_phase7_channel_hub_2026-05-04.sql: conversation_event + outbound_message
- src/services/provider_proxy.py: ProviderProxy + PlatformEnvelope(ADR-115)
- src/services/channel_hub.py: Telegram inbound mirror + Progressive Feedback(30s)
- src/services/awooop_approval_token.py: HS256 + jti NX replay 防護 + suggest mode
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-05-04 19:31:53 +08:00
Your Name
14bf86a462
fix(awooop): Phase 2 初批 P0 修正 + Phase 1 Task 1.7 integration tests
...
## P0 安全 / 架構修正
### P0-08 telemetry.py — 移除硬碼 IP assert(ADR-121)
- config.py:新增 OTEL_ALLOWED_ENDPOINTS(預設 192.168.0.188)+ OTEL_FORBIDDEN_ENDPOINTS
- telemetry.py:_validate_endpoint() 改為 config-driven allowlist/forbidlist
- EwoooC 可用 env 覆寫 OTEL_ALLOWED_ENDPOINTS 指向自己的 SigNoz host
### P0-13 mcp_bridge.py — K8s namespace 由 settings 提供
- config.py:新增 AWOOOI_K8S_NAMESPACE(預設 "awoooi-prod")
- mcp_bridge.py:5 處 parameters.get("namespace", "awoooi-prod") → settings.AWOOOI_K8S_NAMESPACE
- EwoooC/Tsenyang 可設自己的 namespace
### P1-24 decision_manager.py — silence key 常數統一
- 新增 from src.services.telegram_gateway import SILENCE_KEY_PREFIX
- f"telegram_silence:{target}" → f"{SILENCE_KEY_PREFIX}{target}"
- 消除跨兩處重複定義(ADR-118 No Island Coding 原則)
## Phase 1 Task 1.7 Integration Tests
- tests/integration/test_awooop_phase1_schema.py:31 個測試案例
- awooop_projects CHECK 約束(4 cases)
- revision 不可變性 trigger(5 cases:draft 可改、published 鎖住、身份欄不可改、非法流轉、DELETE 禁止)
- awooop_published_revisions VIEW draft/published 隔離(2 cases)
- active_pointer_guard(3 cases:不可指向 draft、可指向 active、跨租戶 mismatch)
- RLS fail-closed(3 cases:未設/錯設/正確設 project_id)
- outbox FK + dedup(2 cases)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-05-04 13:46:19 +08:00
Your Name
2409d861fa
fix(test): 更新 auto_recovery 測試斷言至 ADR-110(ollama_111 → ollama_gcp_a)
...
Code Review / ai-code-review (push) Successful in 55s
CD Pipeline / tests (push) Failing after 1m22s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
- notify_recovery 斷言改為 "ollama_gcp_a"(3 處)
- alert_recovery payload["to"] 改為 "ollama"
- test_full_recovery_flow 改用 mock alerter 避免打真實 Telegram Bot API
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-03 22:57:58 +08:00
Your Name
b1ef05fa8c
feat(ollama): ADR-110 GCP 三層容災架構(GCP-A → GCP-B → Local → Gemini)
...
Code Review / ai-code-review (push) Successful in 50s
CD Pipeline / tests (push) Failing after 1m14s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
## 變更摘要
- Primary: http://34.143.170.20:11434 (GCP-A SSD, 9x 載速 + 2x 推理)
- Secondary: http://34.21.145.224:11434 (GCP-B SSD)
- Fallback: http://192.168.0.111:11434 (M1 Pro Local HDD,最後防線)
- 廢止 ADR-105「111 唯一鐵律」,新建 ADR-110
## 核心改動
- config.py: 新增 OLLAMA_SECONDARY_URL;validator 加 GCP IP 白名單(34.143.170.20, 34.21.145.224)
- ollama_failover_manager.py: 三層 Ollama 決策矩陣;並行健康檢查三台;health_111 → health_gcp_a
- ollama_health_monitor.py: host label 萃取改為通用版(支援 GCP 公網 IP)
- failover_alerter.py: 故障/恢復主機動態顯示,不再硬編碼「Ollama 111 (GPU)」
- ollama_auto_recovery.py: notify_recovery 改為 ollama_gcp_a;recovered_host 動態
- k8s/awoooi-prod: configmap + deployment + network-policy 同步更新(egress 加 GCP /32)
- 服務層: 10 個服務檔案硬編碼 192.168.0.111 改為讀 settings.OLLAMA_URL
- 測試: URL 常數更新,新增三層容災場景,GCP IP 白名單驗證測試
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-03 22:49:23 +08:00
Your Name
e45b055e0e
feat(governance): AI 治理事件處理鏈四軌交付(C/D/B/A)
...
Code Review / ai-code-review (push) Successful in 48s
run-migration / migrate (push) Failing after 45s
CD Pipeline / tests (push) Successful in 3m46s
Type Sync Check / check-type-sync (push) Successful in 2m8s
CD Pipeline / build-and-deploy (push) Failing after 31m14s
CD Pipeline / post-deploy-checks (push) Has been skipped
【十二人專家團隊全景掃描 + 並行四軌實施】
統帥質疑「有讓 12-agent 一起協作嗎」後,依照團隊規則完成全鏈路交付:
onboarder + critic + db-expert + debugger + frontend-designer 並行掃描,
找到 6 大 Gap,再由 fullstack-engineer × 4、refactor-specialist 協作落地。
【Track C — trust_drift 雙寫整併】
兩條獨立寫 event_type=trust_drift 路徑互不呼叫,下游 consumer 拿到雙份資料
無法判定 source-of-truth。整併保留 governance_agent.check_trust_drift(功能
更全:auto-deprecate + Telegram + PG),TrustDriftDetector 降為純統計 lib,
W-6 watchdog 改呼叫 governance_agent。新增 TestSinglePgWritePerDriftScenario
驗證同一 drift 場景只觸發一次 PG 寫入。
變更:
- apps/api/src/services/trust_drift_detector.py(lib only,不再寫 PG)
- apps/api/tests/test_trust_drift_watchdog.py(W-6 改 mock governance_agent)
【Track D — governance_remediation_dispatch 派遣表】
ai_governance_events 是不可變 Event Sourcing,不能塞執行狀態。新建派遣表
作為投影層:1 event → 0..N dispatches,狀態可變、可重試、可審計。
- PgEnum 5 種 event_type + 7 階段狀態機(pending → dispatched → executing →
succeeded/failed/cancelled/skipped)
- 失敗重試 INSERT 新 row(不改舊 row 的 status,保留審計痕跡)
- Partial unique index ux_grd_one_active_per_event 強制「同事件唯一活躍」
- 4 個複合 index 支援 worker poll、去重查詢、觀測面板
- FK 對應 ai_governance_events / playbooks / incidents / approval_records
全部 SET NULL(avoid cascade lock,但 governance_event 用 RESTRICT)
變更:
- apps/api/src/db/models.py(GovernanceRemediationDispatch ORM class)
- apps/api/migrations/governance_remediation_dispatch_2026-05-03.sql
- apps/api/src/repositories/governance_remediation_dispatch_repo.py
(6 個 async 函式 + 3 個自訂例外:DispatchAlreadyActive /
InvalidStatusTransition / DispatchNotFound)
- apps/api/src/models/governance_dispatch.py(DecisionContextV1 等 4 schema)
- apps/api/tests/test_governance_remediation_dispatch.py(29 tests)
【Track B — /governance 頁面】
後端 PR1 三個 endpoint + 前端 PR2-5 完整三 Tab。
PR1 後端:
- GET /api/v1/ai/governance/events(events_tab,含 event_type/severity/
狀態/時間範圍篩選 + 分頁)
- GET /api/v1/ai/governance/queue(queue_tab,含 graceful fallback:
dispatch 表不存在時回 table_pending=True 不拋 500)
- GET /api/v1/ai/governance/summary(slo_tab 30d 違反時序圖)
- severity 映射規則寫死(critic 建議未來移 settings)
PR2-5 前端:
- /governance 路由 + AppLayout + Compliance Badge 橫幅 + PageTabs
- SLO Tab:3 KPI 卡片(Syne 28px + StatusOrb + 7d sparkline)+
30d 違反 stacked BarChart
- Events Tab:篩選列 + 表格 + inline 展開行(JSON / 修復建議 / 派遣記錄)
- Queue Tab:HITL 待辦卡片 + 信任度進度條 + 批准/拒絕按鈕(本 PR console.log)
- Sidebar 加入「AI 治理」入口(ShieldCheck icon)
- i18n 雙語完整(governance namespace + nav.governance)
- 7 個新元件:slo-kpi-card / slo-violation-chart / events-table /
events-filter-bar / event-detail-drawer / queue-item-card / queue-history-tabs
變更:
- apps/api/src/api/v1/ai_governance.py(router)
- apps/api/src/services/governance_query_service.py
- apps/api/src/models/governance.py(Pydantic V2 schemas)
- apps/api/tests/test_ai_governance_endpoints.py(21 tests)
- apps/web/src/app/[locale]/governance/(page + 3 tabs)
- apps/web/src/components/governance/(7 元件)
- apps/web/messages/{zh-TW,en}.json(governance namespace)
- apps/web/src/components/layout/sidebar.tsx(+1 行)
- apps/api/src/main.py(router include)
【Track A — GovernanceDispatcher 決策融合】
把治理事件接到 remediation 執行器,走北極星方向決策融合(LLM × Playbook trust
× MCP),符合「禁寫死規則」鐵律。
- 設計鐵律:DecisionFusionAdapter 是新增 wrapper,**不修改任何 Tier 3 檔**
(decision_manager / learning_service / trust_engine),只 consume 既有 API
- 三維融合公式:confidence = 0.4×llm + 0.3×playbook_trust + 0.3×mcp_consistency
(權重加 TODO 標明未來由 AI 自學調整)
- 三分支決策路徑:
confidence ≥ 0.85 → auto_dispatch(status=dispatched)
0.65 ≤ confidence < 0.85 → pending_approval(HITL)
confidence < 0.65 → skip + log
- decision_context JSONB 完整記錄三維輸入快照(給未來 fine-tune 用)
- poll 30s 掃 unresolved 事件,仿 governance loop 模式
- 重複事件擋去重(呼叫 get_active_for_event)
變更:
- apps/api/src/services/governance_dispatcher.py
- apps/api/src/services/decision_fusion_adapter.py
- apps/api/tests/test_governance_dispatcher.py(14 tests)
- apps/api/src/main.py(lifespan task 接 run_governance_dispatcher_loop)
【驗證】
1836 個 unit test 全過(29 skipped 為既有 PG integration env 問題)
【調度教訓 — 已記入 memory】
- vuln-verifier 應在 fullstack-engineer **之前**跑(避免並行讀到已修代碼誤判)
- critic 雙輪審查不可省(第二輪抓到 NaN sentinel + Prom rule 連鎖)
- 北極星「禁寫死規則」搭配 decision-fusion 確實實施
【未動 Tier 3 — 已驗證】
git diff 確認本 commit 完全沒改 decision_manager.py / learning_service.py /
trust_engine.py,只新增 wrapper service consume 既有 API。
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-03 12:42:40 +08:00
Your Name
8fb0c5df33
feat(heartbeat): noise reduction — silent 6h + warnings hash dedup
...
Code Review / ai-code-review (push) Successful in 47s
CD Pipeline / tests (push) Successful in 2m11s
CD Pipeline / build-and-deploy (push) Failing after 31m12s
CD Pipeline / post-deploy-checks (push) Has been skipped
P0 #4 (徹底長期修系列) — 統帥鐵證:「INFO | AWOOOI 系統報告」每 30 分鐘
推一次,一天 48 條同樣內容,即使我修了 P0 #3 假警報,每天的「全系統正常」
重複推送本身就是噪音,讓統帥誤以為告警還在重複。
修法(不違反「監控工具必須被監控」鐵律 — 健康狀態仍每 6h 推 1 次「我活著」):
| 狀況 | 推送行為 |
|------|---------|
| 健康(無 warnings)| 6h 內最多 1 次「我活著」訊號 |
| 有 warnings 跟上次同 hash | 跳過 |
| 有 warnings 跟上次不同 | 立即推送(新狀況不漏)|
| 健康 ↔ 有事 切換 | 自動清掉相反 marker |
Redis keys:
- `heartbeat:silent_last_sent` — 健康狀態 silent marker, TTL=6h
- `heartbeat:warnings_hash` — 上次 warnings 的 md5[:12], TTL=24h
效果:統帥每天從 48 條 heartbeat → ~4 條(健康狀態 4×6h),有事立即推。
Tests: 6 passed (test_heartbeat_dedup_p0_4.py)
- healthy_first_send_goes_through
- healthy_second_send_within_6h_skipped
- warnings_unchanged_skipped
- warnings_changed_pushes
- warnings_to_healthy_clears_warnings_hash
- healthy_to_warnings_clears_silent_marker
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-03 01:48:57 +08:00
Your Name
2ce722bda9
feat(heartbeat): full K8s pod lifecycle state machine + regression tests
...
Code Review / ai-code-review (push) Successful in 51s
CD Pipeline / tests (push) Successful in 2m59s
CD Pipeline / build-and-deploy (push) Has started running
CD Pipeline / post-deploy-checks (push) Has been cancelled
P0 #3 (徹底長期修系列) — 把 daily report 的 pod 健康判斷從「ready=False 一律告警」
升級到完整 K8s pod lifecycle state machine:
| Phase | 行為 |
|-------|------|
| Succeeded / Completed | 跳過(CronJob/Job 跑完正常) |
| Failed | 必告警 |
| Unknown | 必告警 |
| Pending <5min | 跳過(剛 schedule 合理) |
| Pending >=5min | 告警「image pull / scheduling 卡住」|
| Running ready=True | 健康,跳過 |
| Running ready=False <2min | 跳過(剛起來 probe 還沒過)|
| Running ready=False >=2min | 告警「readiness probe fail / 啟動異常」|
| restarts >=3 | 必告警(無論 phase)|
實作:
- PodInfo 加 start_time: Optional[str](從 .status.startTime)
- _get_pod_status kubectl custom-columns 加 STARTTIME
- _build_warnings 完整 state machine + 閾值常數
regression test (test_heartbeat_pod_state_machine.py 13 個) 覆蓋每個 phase
+ 邊界條件,含 2026-05-02 統帥截圖鐵證重現(3 個 drift-scanner Succeeded
pod 不該觸發「需關注 3 項」假警報)。
Tests: 13 passed (新增 test_heartbeat_pod_state_machine.py)
接續 a38d9112(單純 Succeeded skip),這次徹底處理 Pending/Failed/Unknown
+ 時間閾值 + 沒 start_time 的保守告警。
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-03 01:44:58 +08:00
Your Name
f1362fcc8d
fix(governance): 修治理告警 4 個 silent failure + Prom sentinel 連鎖
...
Code Review / ai-code-review (push) Successful in 49s
CD Pipeline / tests (push) Successful in 2m9s
CD Pipeline / build-and-deploy (push) Failing after 31m11s
CD Pipeline / post-deploy-checks (push) Has been skipped
【全景檢測:12-agent 並行掃描定位 4 大 bug 與 1 個 P0 連鎖回歸】
Bug 1(P0 silent failure)— governance_agent.check_trust_drift
原 `await db.commit()` 縮排錯在 async with 區塊外(8 空格 vs 12),
session 已 auto-commit 關閉,二次 commit 拋 InvalidRequestError 被吞,
governance_trust_drift_auto_deprecated log 從不出現。修:commit/log 移回 with 內。
附 AST regression guard test 擋退化。
Bug 2 — flywheel_stats_service / W-3 fresh deploy 假告警
Redis 空時 total_exec=0 → rate=0.0 → watchdog `< 0.30` 立即觸發
「飛輪成功率 0%」假告警。修:total_exec < FLYWHEEL_MIN_SAMPLE(10) 回 None,
watchdog 判 None 跳過 W-3。Prometheus sentinel 用 NaN(非 -1.0)
避免觸發 ops/monitoring/alerts.yml:775 等 3 份 prom rule 的 `< 0.1`
條件造成 2h 後假告警連鎖。前端 type 同步 number | null。
Bug 3 — failover_alerter dedup key
原 key 只看 event_type 不看 payload,trust_drift 4→25 IDs 變動全被
1h dedup 吞掉。修:dedup key 加 sha256(impact subdict)[:8],event_type
sanitize 防特殊字元污染 Redis key。
Bug 4 — ai_slo_watchdog_job W-4 evolver 全封存初始化誤報
原邏輯 approved==0 即告警,未排除「playbooks 表初始化中」場景。
修:_count_approved_playbooks 回 (approved, total),total==0 → skip。
【執行結果】
- 39 個相關 unit test 全過(test_failover_alerter / test_governance_agent /
test_trust_drift_watchdog / test_check_trust_drift_commit_outside_context_poc)
- 6 個關鍵路徑實測:NaN sentinel / float 渲染 / hash 區分性 / dedup 同 impact
相同 hash / datetime 容錯 / 4 檔 py_compile 全過
【調度教訓 — 留作未來改進】
- 12-agent 並行調度時,vuln-verifier 與 fullstack-engineer 競態
導致 vuln-verifier 讀到已修代碼誤判 NOT REPRODUCIBLE。
未來:vuln-verifier 應在 fullstack 之前執行,或用 git show HEAD~1 對比修復前。
- fullstack-engineer 引入 P0 regression(f-string 內嵌 ternary 非法 format spec),
critic 抓到 + Prom sentinel 連鎖 — 證明 critic 審查必要不可省。
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-03 00:18:57 +08:00
Your Name
314cb0e079
fix(test): align governance self_failure assertions with nested payload schema
...
Code Review / ai-code-review (push) Successful in 48s
CD Pipeline / tests (push) Successful in 2m18s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
Codex commits dedb1208 + b710f3f3 (governance enrich + normalize) 把
_alert("governance_self_failure", ...) 的 payload structure 重構成嵌套:
{status, impact: {failed_checks, total_checks, errors}, remediation, actionable}
(governance_agent.py:604-624,2026-04-29 critic M6 修),
但 3 個 test 還用舊路徑 `payload["total_checks"]` 直讀,KeyError 後 RuntimeError 模擬 cascading 失敗。
修法:3 個 assertion 改為讀正確嵌套路徑:
- test_governance_agent.py:601 → payload["impact"]["total_checks"|"failed_checks"]
- test_wave8_remaining_blockers.py:223 → 同
- test_wave8_remaining_blockers.py:268 → 同
Tests: 30 passed (test_governance_agent + test_wave8_remaining_blockers 全部)
效果:解開 dedb1208 / b710f3f3 / a38d9112 三個 commit 因 governance test fail
被擋在 build-and-deploy 之前的卡點,恢復 CD 鏈通暢。
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com >
2026-05-03 00:05:04 +08:00
Your Name
da772a1605
fix(decision): block kubectl actions on bare_metal host alerts
...
Code Review / ai-code-review (push) Successful in 54s
CD Pipeline / tests (push) Successful in 3m47s
CD Pipeline / build-and-deploy (push) Successful in 13m26s
CD Pipeline / post-deploy-checks (push) Successful in 5m45s
When HostHighCpuLoad / HostOutOfMemory fire on a bare-metal host
(192.168.0.110 et al, where Sentry / ClickHouse / Snuba are eating
CPU), the LLM kept proposing "kubectl rollout restart awoooi-api",
which is a wrong-domain action — restarting awoooi cannot fix a
third-party process's CPU usage on the host. Auto-execute would then
either run the no-op kubectl restart (wasted) or escalate after
ssh_diagnose because no safe action was found, producing the
"AI 自動修復失敗" Telegram noise the user just complained about.
Adds a guard at the top of DecisionManager._auto_execute: if the
incident's primary signal carries host_type=bare_metal AND the
proposed action starts with "kubectl", refuse to execute. The
incident is marked READY with a clear blocked_reason so human
operators see why automation declined, and emergency_escalation
records the event in AOL for audit.
Also patches /home/wooo/monitoring/alerts.yml on 110 (and the new
ops/monitoring/alerts.yml in repo) to add an explicit
auto_repair_action annotation on HostHighCpuLoad / HostOutOfMemory
that hints LLM toward `ssh ... ps aux` rather than kubectl restart.
Prometheus reload returned 200.
Tests: tests/test_decision_manager_bare_metal_kubectl_guard.py
covers (1) bare_metal+kubectl blocked, (2) kubectl get also blocked,
(3) bare_metal+ssh NOT blocked, (4) k8s host_type+kubectl NOT
blocked, (5) missing host_type label NOT blocked.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-05-02 17:41:28 +08:00
Your Name
3059897318
feat(governance): auto-deprecate low-trust unused playbooks (>30d)
...
Code Review / ai-code-review (push) Successful in 41s
CD Pipeline / tests (push) Successful in 3m29s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
trust_drift previously fired alerts forever for playbooks stuck below
the 0.2 threshold. With user authorization for governance-class
auto-fixes, check_trust_drift now retires playbooks that have been
unused for 30+ days (or never used and created 30+ days ago) by
flipping status to 'deprecated' before alerting.
Alerts now report drifted_count, auto_deprecated_count, and the kept
playbook_ids that still need human review (those in their 30d trial
window). Existing alert noise from the four currently-drifted
playbooks should drop to whatever fraction is genuinely in trial.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-05-02 12:31:37 +08:00
Your Name
607358c4dd
fix(approval): route SSH actions through SSHProvider on manual approve
...
parse_operation_from_action only knew kubectl and Chinese restart phrases,
so any "ssh host '...'" action approved via Telegram fell through to
"Could not parse operation type" and reported a fake failure even though
the LLM had proposed a valid host repair.
Adds OperationType.SSH_HOST, makes the parser detect ssh prefixes (with
optional flags / user@host) before kubectl patterns, and routes the
SSH_HOST branch in approval_execution.execute_in_background through
SSHProvider with the same tool keywords decision_manager uses
(ssh_docker_prune / ssh_docker_restart / ssh_systemctl_restart /
ssh_diagnose). Unroutable SSH actions now fail loudly with a descriptive
error instead of silently breaking.
Trigger: 2026-05-02 incidents INC-20260502-D6D0B7 / E12EE4 / 557055
were approved by the user but executor reported "Could not parse" and
left the alerts pending.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-05-02 12:31:37 +08:00
Your Name
3156ff1c69
feat(aiops): add ssh_docker_prune to auto-repair flywheel for disk-full alerts
...
Adds Group B SSH MCP tool ssh_docker_prune (image+volume+builder prune
with ≥75% disk usage gate) and routes "docker prune" actions through it.
Flips HostDiskUsageHigh from auto_repair=false to true with mcp_provider
routing labels so the flywheel can self-heal next disk-full event without
hitting the emergency_channel Telegram path.
Trigger: 2026-05-01 → 05-02 Telegram alert storm (peak 53/hr) caused by
empty ssh-mcp-key/known_hosts secret rejecting all SSH and forcing every
disk-full alert through "Host key is not trusted → escalate" loop.
known_hosts patched live; this commit closes the playbook gap so the
next occurrence resolves without manual intervention.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-05-02 12:31:37 +08:00
Your Name
7795f027d2
fix(aiops): persist emergency intervention traces
CD Pipeline / tests (push) Successful in 2m56s
Code Review / ai-code-review (push) Failing after 39s
CD Pipeline / build-and-deploy (push) Successful in 12m54s
CD Pipeline / post-deploy-checks (push) Successful in 4m40s
2026-05-01 20:34:33 +08:00
Your Name
433f7b068e
fix(aiops): close ssh and telegram remediation gaps
CD Pipeline / tests (push) Successful in 2m7s
Code Review / ai-code-review (push) Successful in 42s
CD Pipeline / build-and-deploy (push) Successful in 13m14s
CD Pipeline / post-deploy-checks (push) Successful in 4m29s
2026-05-01 16:53:02 +08:00
Your Name
b0da6da1e9
feat(aiops): structure agent loop shadow output
CD Pipeline / tests (push) Successful in 2m50s
Code Review / ai-code-review (push) Successful in 33s
CD Pipeline / build-and-deploy (push) Failing after 25m48s
CD Pipeline / post-deploy-checks (push) Has been cancelled
2026-05-01 15:09:57 +08:00
Your Name
f8e44971c1
feat(aiops): enable read-only agent loop canary
CD Pipeline / tests (push) Successful in 1m43s
Code Review / ai-code-review (push) Successful in 31s
CD Pipeline / build-and-deploy (push) Successful in 10m22s
CD Pipeline / post-deploy-checks (push) Successful in 4m3s
2026-05-01 14:20:16 +08:00
Your Name
b6cf616707
fix(aiops): harden agent tool permission names
CD Pipeline / tests (push) Successful in 1m32s
Code Review / ai-code-review (push) Successful in 27s
CD Pipeline / build-and-deploy (push) Successful in 8m26s
CD Pipeline / post-deploy-checks (push) Successful in 3m37s
2026-05-01 13:52:33 +08:00
Your Name
7e4d995e4b
feat(aiops): add mcp agent loop foundation
CD Pipeline / tests (push) Successful in 1m59s
Code Review / ai-code-review (push) Successful in 28s
run-migration / migrate (push) Failing after 24s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-05-01 13:21:19 +08:00