Your Name
|
09256be62c
|
fix(rag): use bge embeddings on GCP Ollama lane
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / tests (push) Successful in 1m22s
CD Pipeline / build-and-deploy (push) Failing after 2h14m5s
CD Pipeline / post-deploy-checks (push) Has been cancelled
|
2026-05-06 05:49:37 +08:00 |
|
Your Name
|
c2c0b1ec82
|
fix(alerts): let GCP Ollama finish before cloud fallback
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 1m9s
CD Pipeline / build-and-deploy (push) Successful in 4m21s
CD Pipeline / post-deploy-checks (push) Successful in 1m16s
|
2026-05-06 05:27:55 +08:00 |
|
Your Name
|
3b64d66836
|
fix(alerts): guard approval actions and wire playbook learning
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 42s
CD Pipeline / build-and-deploy (push) Successful in 3m31s
CD Pipeline / post-deploy-checks (push) Successful in 1m18s
|
2026-05-06 03:34:24 +08:00 |
|
Your Name
|
a2c4b3d47e
|
fix(awooop): align console with flywheel execution metrics
Code Review / ai-code-review (push) Has been cancelled
CD Pipeline / tests (push) Successful in 2m22s
CD Pipeline / build-and-deploy (push) Successful in 3m54s
CD Pipeline / post-deploy-checks (push) Successful in 1m17s
|
2026-05-06 00:46:08 +08:00 |
|
Your Name
|
5ed396e390
|
fix(decision): derive telegram dedup from incident signals
CD Pipeline / tests (push) Successful in 58s
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / build-and-deploy (push) Successful in 3m30s
CD Pipeline / post-deploy-checks (push) Successful in 2m19s
|
2026-05-06 00:19:35 +08:00 |
|
Your Name
|
2aa31c205a
|
fix(ai): require 111 before alert cloud fallback
CD Pipeline / tests (push) Successful in 54s
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / build-and-deploy (push) Successful in 3m21s
CD Pipeline / post-deploy-checks (push) Successful in 2m2s
|
2026-05-06 00:05:51 +08:00 |
|
Your Name
|
e208798531
|
fix(ai): keep GCP Ollama lane on safe models
CD Pipeline / tests (push) Successful in 54s
Code Review / ai-code-review (push) Successful in 14s
CD Pipeline / build-and-deploy (push) Successful in 3m25s
CD Pipeline / post-deploy-checks (push) Successful in 1m50s
|
2026-05-05 23:37:33 +08:00 |
|
Your Name
|
1cc215ec30
|
fix(ops): keep Ollama health checks on alert fast model
CD Pipeline / tests (push) Successful in 52s
Code Review / ai-code-review (push) Successful in 9s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
|
2026-05-05 23:16:21 +08:00 |
|
Your Name
|
c4854bb355
|
fix(ai): isolate heavy Ollama workloads from GCP alert lane
CD Pipeline / tests (push) Successful in 54s
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / build-and-deploy (push) Successful in 3m19s
CD Pipeline / post-deploy-checks (push) Successful in 3m12s
|
2026-05-05 23:06:07 +08:00 |
|
Your Name
|
86bd6432ee
|
fix(ops): make bge-m3 migration idempotent
Code Review / ai-code-review (push) Successful in 9s
run-migration / migrate (push) Successful in 7s
CD Pipeline / tests (push) Successful in 2m8s
CD Pipeline / build-and-deploy (push) Failing after 9s
CD Pipeline / post-deploy-checks (push) Has been skipped
|
2026-05-05 22:21:47 +08:00 |
|
Your Name
|
bf847ad045
|
fix(ai): stabilize GCP Ollama alert lane
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
|
2026-05-05 22:20:27 +08:00 |
|
Your Name
|
a4e9a04982
|
fix(ops): harden cold-start schedule recovery
Code Review / ai-code-review (push) Successful in 10s
run-migration / migrate (push) Successful in 7s
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
|
2026-05-05 22:17:10 +08:00 |
|
Your Name
|
ee5e3bc94f
|
fix(openclaw): gate alert cloud fallback behind flag
Code Review / ai-code-review (push) Successful in 27s
CD Pipeline / tests (push) Successful in 5m17s
CD Pipeline / build-and-deploy (push) Failing after 5m35s
CD Pipeline / post-deploy-checks (push) Has been skipped
|
2026-05-05 20:54:47 +08:00 |
|
Your Name
|
6b93c8f454
|
fix(chat): route OpenClaw chat through Ollama lane
CD Pipeline / tests (push) Successful in 5m26s
Code Review / ai-code-review (push) Successful in 25s
CD Pipeline / build-and-deploy (push) Successful in 8m11s
CD Pipeline / post-deploy-checks (push) Has been cancelled
|
2026-05-05 15:57:26 +08:00 |
|
Your Name
|
1cc9de5722
|
fix(ops): point runner guardrail alerts to host script
CD Pipeline / tests (push) Successful in 5m31s
Code Review / ai-code-review (push) Successful in 30s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 37s
CD Pipeline / build-and-deploy (push) Successful in 7m45s
CD Pipeline / post-deploy-checks (push) Successful in 5m4s
|
2026-05-05 15:25:37 +08:00 |
|
Your Name
|
d08d1e4951
|
fix(ops): alert on missing docker resource limits
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Successful in 23s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s
|
2026-05-05 15:01:31 +08:00 |
|
Your Name
|
e24c8ea051
|
fix(ci): align B5 schema with tenant isolation
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
|
2026-05-05 15:00:07 +08:00 |
|
Your Name
|
7d45f0cb58
|
fix(ops): alert on stale gitea actions jobs
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Has been cancelled
|
2026-05-05 14:42:09 +08:00 |
|
Your Name
|
fc1a6196df
|
fix(code-review): keep Gemini fallback opt-in
CD Pipeline / tests (push) Successful in 2m2s
Code Review / ai-code-review (push) Successful in 27s
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
|
2026-05-05 14:38:44 +08:00 |
|
Your Name
|
0e14935351
|
fix(ops): classify systemd runner alerts as host resources
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
|
2026-05-05 14:28:18 +08:00 |
|
Your Name
|
34d1c76be9
|
fix(ops): route systemd runner baseline alerts
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
|
2026-05-05 14:19:58 +08:00 |
|
Your Name
|
fe618960a8
|
fix(ops): monitor systemd runners in host baseline
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 39s
|
2026-05-05 14:08:43 +08:00 |
|
Your Name
|
8e22110030
|
fix(governance): keep trust drift watchdog on governance agent
CD Pipeline / tests (push) Successful in 2m51s
Code Review / ai-code-review (push) Successful in 24s
CD Pipeline / build-and-deploy (push) Has started running
CD Pipeline / post-deploy-checks (push) Has been cancelled
|
2026-05-05 14:00:13 +08:00 |
|
Your Name
|
2ff0ef3bb6
|
fix(openclaw): route legacy ollama through failover endpoints
CD Pipeline / tests (push) Failing after 1m49s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 24s
|
2026-05-05 13:55:52 +08:00 |
|
Your Name
|
bb1995f349
|
fix(awooop): use naive utc for run lease timestamps
CD Pipeline / tests (push) Failing after 1m48s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Has been cancelled
|
2026-05-05 13:53:07 +08:00 |
|
Your Name
|
e8e6748f70
|
fix(ops): add docker host resource baseline guardrails
CD Pipeline / tests (push) Failing after 1m50s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 25s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s
|
2026-05-05 13:45:09 +08:00 |
|
Your Name
|
a57e3d3d75
|
test(consensus): expect redis namespace dual write
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
|
2026-05-05 13:41:41 +08:00 |
|
Your Name
|
b00a7b050a
|
test(ollama): align inference connect errors with degraded health
CD Pipeline / tests (push) Failing after 2m26s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 28s
|
2026-05-05 13:34:19 +08:00 |
|
Your Name
|
506744ba3a
|
test(ollama): keep slow gcp primary on ollama
CD Pipeline / tests (push) Failing after 2m21s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 26s
|
2026-05-05 13:29:27 +08:00 |
|
Your Name
|
869646459c
|
fix(ollama): treat legacy primary as ollama
CD Pipeline / tests (push) Failing after 1m48s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 28s
|
2026-05-05 13:25:27 +08:00 |
|
Your Name
|
33d4326cce
|
test(ollama): align slow recovery with gcp routing policy
CD Pipeline / tests (push) Failing after 1m51s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 33s
|
2026-05-05 13:21:16 +08:00 |
|
Your Name
|
f78b1b0690
|
fix(ollama): honor provider endpoint selection
Code Review / ai-code-review (push) Successful in 37s
|
2026-05-05 13:14:46 +08:00 |
|
Your Name
|
2e17325c3f
|
fix(ollama): 更新 failover_manager URL 註解反映 ADR-110 nginx proxy 拓撲
Code Review / ai-code-review (push) Successful in 43s
url_primary/secondary/tertiary 的 comment 還是舊版(ADR-110 前的 IP),
更新為 110:11435→GCP-A / 11436→GCP-B / 11437→Local111 nginx proxy 格式。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-05 11:03:36 +08:00 |
|
Your Name
|
e22b8e7ab2
|
feat(awooop): Operator Console API + 前端(leWOOOgo 積木化修復)
Code Review / ai-code-review (push) Successful in 42s
後端:
- 新增 platform_operator_service.py(DB 存取集中 Service 層)
- Router 層移除 Depends(get_db),改呼叫 Service 函數
- tenants/contracts/operator_runs 三個 Router 符合 leWOOOgo 規範
- __init__.py 整合四個 platform router
前端:
- apps/web/src/app/[locale]/awooop/ 完整建立(7 個頁面)
- layout.tsx:四分頁導覽(tenants/contracts/runs/approvals)
- 全部使用 @/i18n/routing(Link/usePathname/useRouter)避免 i18n 路徑問題
- approvals page:10s 自動刷新、timeout 倒數、緊急紅色高亮
ADR-106/107/112/114/115/116
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-05 11:00:20 +08:00 |
|
Your Name
|
aa4ccec429
|
fix(watchdog): ADR-092 B4 — 三層修復消除 META SYSTEM 重複告警 + Ollama 路由強化
Code Review / ai-code-review (push) Successful in 7m16s
問題根因(debugger 全景徹查):
1. Prod 仍跑舊版代碼(ec013f66 後的修法未部署 → 告警字串仍含舊格式)
2. replicas=2 時 Pod 間 grace period 不共享 → violation_codes 分歧 → 不同 SHA256 → dedup 失效
3. 新 Pod 啟動立即執行 _check_once() → rollout 時多發一波
4. W6 violation_codes 含動態 low_count → count 微變繞過 dedup
修復(A2/A3/W6/C1/C2):
- A2:run_ai_slo_watchdog_loop 加 90s leading sleep,避免 rollout 立即觸發
- A3:_grace_active() 改為 Redis cluster-shared(watchdog:cluster_grace, ex=1800s, nx=True)
消除 Pod 間 grace period 不一致;Redis 故障時 fallback 為 process-local monotonic
- W6:violation_codes 移除動態 low_count,改為穩定 "W6:trust_drift"
- C1:ollama_auto_recovery.py recovered_host 改動態 label(依 URL port 判斷 GCP-A/B/Local)
- C2:ConfigMap OLLAMA_FALLBACK_URL 改走 110:11437 nginx proxy,三層容災統一架構
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-05 10:31:53 +08:00 |
|
Your Name
|
3f853accf2
|
fix(alerter): Ollama 恢復告警去重修復 — per-host key + 1h TTL
根因:
1. dedup_key 固定為 "alert:recovery",GCP-A 每 10min 健康閃爍就觸發重發
2. 三層容災下不同主機恢復共用同一個 key,互相污染
修法:
- dedup key 改為 "alert:recovery:{safe_host}",各主機獨立 dedup
- RECOVERY_DEDUP_TTL_SEC = 3600(1h),GCP 持續閃爍只報一次
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-05 01:22:01 +08:00 |
|
Your Name
|
10e665a540
|
fix(watchdog): 修復 META SYSTEM 重複告警 — violation_codes 穩定 dedup
Code Review / ai-code-review (push) Successful in 1m3s
根因:violations 字串含動態浮點數(mean_trust/low_ratio),每次微變 → SHA256 不同 → dedup 失效
修法:新增 violation_codes list(穩定 W-code 格式),dedup 計算只用 violation_codes
violations 保持含動態值(顯示用),Telegram 通知照常顯示完整資訊
W-6 Trust Drift dedup key: W6:trust_drift:low_count={N}(不含浮點數)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-05 00:06:38 +08:00 |
|
Your Name
|
ec013f662d
|
fix(watchdog): 修复 Trust Drift 重复告警 + 建立 GCP Ollama nginx proxy
Code Review / ai-code-review (push) Successful in 45s
Ansible Lint / lint (push) Has been cancelled
- ai_slo_watchdog_job: 改用 trust_drift_detector 纯统计 lib
避免与 governance_agent 每小时自检查重复触发 Telegram
- infra/ansible: 建立 110 nginx proxy 转发到 GCP-A/B
端口 11435 -> 34.143.170.20:11434 (GCP-A)
端口 11436 -> 34.21.145.224:11434 (GCP-B)
- docs/runbooks: DEPLOY-GCP-OLLAMA-PROXY.md 完整部署指南
- ops/nginx: 手动部署脚本供 110 直接执行
ADR-110 三层容灾启用前提:先部署 proxy,再改 ConfigMap
|
2026-05-04 23:12:35 +08:00 |
|
Your Name
|
a1b61289f5
|
fix(governance): 修復 skip 路徑無限迴圈 + MCP 評分偏低根因
Code Review / ai-code-review (push) Successful in 59s
根因一:GovernanceDispatcher skip 決策後未記錄任何狀態
- 事件永遠 resolved=False → 每 30s 重撈 → 每輪呼叫 LLM + Prometheus
- 4437 筆 stale 事件積壓,導致 governance_fusion_complete 每 20s 狂刷
修復:
1. Redis 90min 冷卻鍵(governance:skip:{event_id})防止重複 LLM 呼叫
2. 超過 2h 的 stale skip 事件自動標記 resolved=True
3. 直接 bulk-resolve 4437 筆 stale 事件 + 預設 105 筆冷卻鍵
根因二:MCP 評分 0.2 硬地板
- SLI recording rules 尚未在 Prometheus 生效 → result_list=[] → success_count=0
- 公式 0.2 + 0.7*0 = 0.2,融合信心度永遠 < 0.65 threshold
修復:
- 空結果(no_data)≠ MCP 故障,改給 0.5 中性貢獻
- 新公式:weighted = success_count + 0.5 * no_data_count;score = 0.2 + 0.7*(weighted/total)
- MCP 全無資料時:0.2 + 0.7*0.5 = 0.55(而非 0.2)
順帶修正 _score_llm 中過時的 GCP-A fallback URL 註解(實際已走 settings)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 20:00:54 +08:00 |
|
Your Name
|
45f6f17558
|
fix(watchdog): dedup hash 非確定性 bug — 改用 hashlib.sha256 + setnx atomic
Code Review / ai-code-review (push) Successful in 56s
根因:Python 內建 hash() 受 PYTHONHASHSEED 影響,每次 process 重啟值不同。
每次 kubectl rollout restart → 新 pod 算出不同 dedup_hash → 繞過 1h TTL → 洗版。
症狀:連續 rollout 4-5 次後,META SYSTEM 每分鐘一條狂發(19:39/40/41/42 截圖)。
修法:
1. hash() → hashlib.sha256(content.encode()).hexdigest()[:12](跨 pod/重啟確定性)
2. redis.exists+setex → redis.set(nx=True) atomic setnx(防多 replica 並發多發)
2026-05-04 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 19:47:42 +08:00 |
|
Your Name
|
8629ac709b
|
feat(awooop): Phase 1-8 完整實作 — AwoooP Agent Platform 六平面架構
run-migration / migrate (push) Failing after 59s
Code Review / ai-code-review (push) Successful in 1m8s
Type Sync Check / check-type-sync (push) Successful in 2m27s
## Phase 1-3: Control Plane + Contract System
- awooop_phase1_control_plane_2026-05-04.sql: 12 張核心表 + RLS
- awooop_phase1_batch1_rls_2026-05-04.sql: 全部 FORCE RLS + GRANT
- packages/awooop-contracts/: 六合約 JSON Schema + golden fixtures
- src/models/awooop_contracts.py: Pydantic v2 contract models(extra=forbid)
- src/repositories/contract_repository.py: contract lifecycle(draft→published→active)
- src/services/contract_service.py: HMAC publish sig + Redis multi-sig activate
- src/services/schema_validator.py: LLM output validator(retry×3, E-SCHEMA-001)
## Phase 2: Tenant Isolation
- awooop_phase2_budget_ledger_2026-05-04.sql: budget_ledger + RLS
- src/services/budget_service.py: Token Budget Hard Kill 三層防線
- src/core/context.py: PROJECT_ID ContextVar(31 background loop 自動繼承)
- src/db/base.py + models.py: project_id 欄位 + RLS set_config 注入
- src/hermes/nl_gateway.py: project_id Redis key 前綴(Phase A 雙寫)
- src/services/anomaly_counter.py: per-project 改造(Phase A fallback)
## Phase 4: Platform Shell in Shadow Mode
- awooop_phase4_run_state_2026-05-04.sql: run_state + step_journal + idempotency
- src/services/run_state_machine.py: 8-state FSM + SKIP LOCKED + stale reaper
- src/services/platform_runtime.py: UUID v7 + W3C trace_id + shadow_execute
- src/services/audit_sink.py: PII/secret redaction 9 patterns
- src/api/v1/platform/runs.py: POST/GET /v1/platform/runs(Router→Service 架構)
- src/workers/platform_worker.py: SKIP LOCKED worker + heartbeat + reaper loop
- src/main.py: platform router + lifespan worker start/stop
## Phase 5: MCP Gateway 五閘門
- awooop_phase5_mcp_gateway_2026-05-04.sql: 4 表 + RLS
- src/plugins/mcp/gateway.py: McpGateway(Gate 1~5, E-MCP-GATE-001~009)
- src/plugins/mcp/redaction_middleware.py: 雙層 redaction + 16K 截斷
- src/plugins/mcp/registry.py: __provider name mangling(ADR-116)
- src/plugins/mcp/credential_resolver.py: k8s secret ref 解析
- tests/test_mcp_credential_isolation.py: 10 個迴歸測試(secret leak 防再現)
## Phase 6-8: EwoooC + Channel Hub + Approval Token
- awooop_phase6_ewoooc_onboarding_2026-05-04.sql: ewoooc tenant + 4 read-only MCP tools
- awooop_phase7_channel_hub_2026-05-04.sql: conversation_event + outbound_message
- src/services/provider_proxy.py: ProviderProxy + PlatformEnvelope(ADR-115)
- src/services/channel_hub.py: Telegram inbound mirror + Progressive Feedback(30s)
- src/services/awooop_approval_token.py: HS256 + jti NX replay 防護 + suggest mode
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 19:31:53 +08:00 |
|
Your Name
|
0a90dab1e9
|
fix(ollama): ADR-110 修正 — 111 升 primary,failover log 改用動態 URL 標識
Code Review / ai-code-review (push) Successful in 56s
根因:K8s pods → GCP-A/B:11434 = connection refused(外網路由不通),
但 ConfigMap 把 GCP-A 設為 OLLAMA_URL(primary),導致容災鏈最終才輪到 111。
ConfigMap (04-configmap.yaml):
- OLLAMA_URL: GCP-A → 192.168.0.111(K8s 內網可達的 primary)
- OLLAMA_SECONDARY_URL: GCP-B → 34.143.170.20(GCP-A,保留待 nginx proxy 後恢復)
- OLLAMA_FALLBACK_URL: 111 → 34.21.145.224(GCP-B,保留待 nginx proxy 後恢復)
- 長期目標:110 架設 nginx proxy 轉發 GCP,ConfigMap 改指向 110:11435/11436
health.py (check_ollama):
- 改為三層輪查(primary → secondary → tertiary)
- primary up → "up";fallback up → "degraded";全掛 → "down"
- 不再只看 OLLAMA_URL 一台,反映實際路由可用狀態
ollama_failover_manager.py (_decide_route / select_provider):
- 變數名改為 url_primary/secondary/tertiary(原 gcp_a/gcp_b/local 與實際 URL 脫鉤)
- routing_reason 改用動態 IP label,不再硬編碼 "GCP-A"/"GCP-B"/"Local"
- _write_failover_audit failed_host 同步改用實際 URL
2026-05-04 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 19:17:07 +08:00 |
|
Your Name
|
855819652e
|
fix(ollama): 修復容災鏈四大 bug — OFFLINE cache 放大 + SLOW 路由缺失 + recovery 命名不一致 + 告警顯示
Code Review / ai-code-review (push) Successful in 48s
根因:NetworkPolicy reload/CNI 瞬態抖動導致三台 Ollama 同時 OFFLINE,被 30s Redis cache 放大
→ 後續 30s 所有請求誤走 Gemini,燒 quota
B1 ollama_health_monitor: OFFLINE TTL 從 30s 縮短至 5s,儘速重試
B3 ollama_health_monitor: inference ConnectError 改判 DEGRADED(connectivity 通了不算 OFFLINE)
B5/B6 ollama_auto_recovery: _current_primary 預設改 "ollama_gcp_a",比對改 startswith("ollama_")
SLOW 修復: failover_manager SLOW 節點視為可用(優於 Gemini quota 耗盡)
SLOW 修復: auto_recovery SLOW 也計入 recovery counter(GCP 高負載仍可切回)
告警顯示: _provider_display 加入 GCP-A/B/Local 具體伺服器識別
告警顯示: _format_automation_block 加入 Token 用量行
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 19:01:27 +08:00 |
|
Your Name
|
f6b698c873
|
fix(aiops): Critic 修復 — PromQL 注入防線 + flag=False escalation bug + 計數虛報
Code Review / ai-code-review (push) Successful in 53s
Bug 1 (drift.py): DRIFT_AUTO_ADOPT_ENABLED=false 時仍設 auto_block_reason
→ 導致 escalation 被觸發,把「停用」誤判為「阻擋事故」
修法: flag=False 不設 auto_block_reason,視為靜默停用
Bug 2 (coverage_evaluator_job.py): asset name/host/namespace/ip 直接 f-string
進 PromQL,無白名單驗證
→ 髒資料可生成語意污染規則或讓 Prometheus reload 失敗
修法: 加 _safe_label_val 正規表達式白名單(^[a-zA-Z0-9._\-]+$),
不合法直接 skip + debug log
Bug 3 (coverage_evaluator_job.py): ON CONFLICT DO NOTHING 衝突時 created 仍 +1
→ stats["rules_auto_created"] 計數虛高,Redis 冷卻被誤設
修法: 改用 INSERT ... RETURNING rule_name,fetchone() 確認實際插入才計數和設冷卻
附加: Redis RuntimeError 單獨 catch + log(不再靜默 pass)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 14:31:53 +08:00 |
|
Your Name
|
72cd79ed8b
|
fix(aiops): Task2 drift auto-adopt 根因修復 + Task3 coverage gap 規則自動生成
Code Review / ai-code-review (push) Successful in 48s
Task 2 — Drift 自動採納修根因:
根因: _analyze_and_notify() 中 report 是 in-memory 物件,
update_interpretation() 只更新 DB,不回寫 report.interpretation,
導致 auto_adopt_if_safe() 永遠看到 None → 觸發「尚無 Nemotron 意圖分析」
→ Drift 自動採納 0 筆
修法: report.interpretation = interpretation(DB 寫入後立即回寫記憶體)
附加: DRIFT_AUTO_ADOPT_ENABLED flag(default=True,回滾: kubectl set env ...=false)
Task 3 — Coverage Gap → AI 規則自動生成執行器:
根因: evaluate_once() 只分析 red 缺口,但無執行器將分析轉為實際規則
→ alert_rule_catalog 的 ai_generated source 永遠為 0 條
修法: 新增 _auto_create_rules_for_uncovered_assets(run_id)
· 查 auto_alerting=red 的 top 5 host/k8s_workload asset
· 依 asset_type 生成範本化 PromQL rule(host→up, k8s→replicas_available)
· UPSERT 進 alert_rule_catalog(source='ai_generated', review_status='pending_review')
· Redis 24h 冷卻防重複,Redis 不可用時降級繼續
附加: COVERAGE_AUTO_RULE_ENABLED flag(default=True,回滾: kubectl set env ...=false)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 14:22:51 +08:00 |
|
Your Name
|
54a4e59af9
|
fix(auto-approve): 主機告警 SSH 診斷指令豁免 bad_target 驗證 — 修復 no_executable_action
根因:host_resource_alert 規則使用 {host}(由 instance label 派生),
與 {target} 無關;但 host 告警缺少 K8s deployment label 導致 target=unknown,
_is_bad_target=True → kubectl_command 被清空 → auto_approve 以
no_executable_action 拒絕 → 每日 3 次人工攔截。
修復:
- alert_rule_engine.py: SSH 指令(startswith "ssh ")跳過 bad_target 驗證
- prompts.py: 主 + Nemo prompt 補 Host* 告警 SSH 診斷規則,防 LLM fallback 路徑輸出 kubectl
- ssh_command_whitelist.py: 新建唯讀 SSH 指令白名單模組(供 _ssh_execute() 執行前驗證)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 14:15:05 +08:00 |
|
Your Name
|
ccffaa5f3e
|
fix(telegram): 補 send_text 公開方法 — 修復 drift_adopt_telegram_failed
drift_adopt_service / drift_remediator / runbook_generator / signoz_webhook
均呼叫 tg.send_text(),但 TelegramGateway 缺少此公開方法,
導致每次呼叫拋出 AttributeError。
新增 send_text() 委派至 _send_request('sendMessage'),
預設 chat_id = alert_chat_id(SRE 群組),支援 HTML parse_mode。
不動任何呼叫方,不改 dedup / nonce 邏輯。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 14:11:32 +08:00 |
|
Your Name
|
f2f5148ca6
|
fix(awooop): Phase 2 第二批 P0 安全強化 + Redis key 命名空間修正
## P0-05 Callback Nonce 防偽造(ADR-116)
- security_interceptor.py:generate_callback_nonce() 新增 HMAC-SHA256[:16] 附加
- 新 5-part 格式:{action}:{short_id}:{ts}:{rand}:{hmac16}
- CALLBACK_HMAC_SECRET 未設定時降級 warning(向後相容)
- security_interceptor.py:parse_callback_data() 新增 5-part 分支 + HMAC 驗證
- config.py:新增 CALLBACK_HMAC_SECRET: str = Field(default="")
## P0-06 Webhook HMAC Replay 防護(ADR-116)
- security_interceptor.py:新增 check_webhook_nonce()(Service 層,get_redis 在此層合法)
- webhooks.py:verify_webhook_signature() 新增兩個可選 Header
- X-Webhook-Timestamp:±300s 窗口驗證(若提供)
- X-Webhook-Nonce:呼叫 check_webhook_nonce()(Redis NX dedup,fail open)
- 移除直接 get_redis import(leWOOOgo 積木化修正)
## P0-11 ollama:current_primary Redis key 遷移 Phase A(ADR-110)
- ollama_auto_recovery.py:_REDIS_PRIMARY_KEY = "platform:ollama:current_primary"
- 雙寫舊 key "ollama:current_primary"(Phase A 30 天)
- 讀取以新 key 為主,fallback 舊 key
## P0-12 consensus Redis key 加 project namespace Phase A
- consensus_engine.py:新增 _consensus_key() / _consensus_legacy_key() helper
- 新 key:{project_id}:consensus:{consensus_id}
- project_id=None 時 fallback __platform__:consensus:{consensus_id}
- Phase A 雙寫 + fallback 讀取,現有呼叫方零修改
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 13:54:38 +08:00 |
|
Your Name
|
2b2359e367
|
fix(ai-router): ADR-110 GCP 三層容災 — 修復 Ollama 直跳 Gemini 根因
Code Review / ai-code-review (push) Successful in 55s
run-migration / migrate (push) Successful in 41s
根因(所有告警 Ollama 失敗直接跳 Gemini 的原因):
AIProviderEnum 缺少 ollama_gcp_a / ollama_gcp_b / ollama_local
→ AIProviderEnum("ollama_gcp_a") 拋 ValueError
→ fallback chain 清空(所有 GCP 端點轉換全失敗)
→ failover_fallback = [](空 list,非 None)
→ fallback_chain 被覆寫為 [] 而非走 Gemini 備援
→ AIProviderRegistry.get("ollama_gcp_a") 回傳 None → not_registered → 跳過
→ 整條 Ollama 鏈(GCP-A → GCP-B → 111)全部略過,直接跳 Gemini
修復:
1. AIProviderEnum 新增 OLLAMA_GCP_A / OLLAMA_GCP_B / OLLAMA_LOCAL
2. PROVIDER_LATENCY_BUDGET 補齊三個新 enum
3. ollama.py 新增 OllamaGcpBProvider(OLLAMA_SECONDARY_URL = GCP-B 34.21.145.224)
4. _init_registry() 補登:
- "ollama_gcp_a" alias → OllamaProvider(GCP-A,OLLAMA_URL)
- OllamaGcpBProvider("ollama_gcp_b",OLLAMA_SECONDARY_URL)
- "ollama_local" alias → Ollama188Provider(111,OLLAMA_FALLBACK_URL)
修復後路由順序:GCP-A → GCP-B → Local(111) → Gemini → Claude
2026-05-04 ogt + Claude Sonnet 4.6
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 13:49:32 +08:00 |
|
Your Name
|
14bf86a462
|
fix(awooop): Phase 2 初批 P0 修正 + Phase 1 Task 1.7 integration tests
## P0 安全 / 架構修正
### P0-08 telemetry.py — 移除硬碼 IP assert(ADR-121)
- config.py:新增 OTEL_ALLOWED_ENDPOINTS(預設 192.168.0.188)+ OTEL_FORBIDDEN_ENDPOINTS
- telemetry.py:_validate_endpoint() 改為 config-driven allowlist/forbidlist
- EwoooC 可用 env 覆寫 OTEL_ALLOWED_ENDPOINTS 指向自己的 SigNoz host
### P0-13 mcp_bridge.py — K8s namespace 由 settings 提供
- config.py:新增 AWOOOI_K8S_NAMESPACE(預設 "awoooi-prod")
- mcp_bridge.py:5 處 parameters.get("namespace", "awoooi-prod") → settings.AWOOOI_K8S_NAMESPACE
- EwoooC/Tsenyang 可設自己的 namespace
### P1-24 decision_manager.py — silence key 常數統一
- 新增 from src.services.telegram_gateway import SILENCE_KEY_PREFIX
- f"telegram_silence:{target}" → f"{SILENCE_KEY_PREFIX}{target}"
- 消除跨兩處重複定義(ADR-118 No Island Coding 原則)
## Phase 1 Task 1.7 Integration Tests
- tests/integration/test_awooop_phase1_schema.py:31 個測試案例
- awooop_projects CHECK 約束(4 cases)
- revision 不可變性 trigger(5 cases:draft 可改、published 鎖住、身份欄不可改、非法流轉、DELETE 禁止)
- awooop_published_revisions VIEW draft/published 隔離(2 cases)
- active_pointer_guard(3 cases:不可指向 draft、可指向 active、跨租戶 mismatch)
- RLS fail-closed(3 cases:未設/錯設/正確設 project_id)
- outbox FK + dedup(2 cases)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 13:46:19 +08:00 |
|