Commit Graph

1000 Commits

Author SHA1 Message Date
OG T
2c2bf9d665 fix(awooop): use shared redis for approval gates
Some checks failed
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 1m0s
CD Pipeline / build-and-deploy (push) Failing after 4m6s
CD Pipeline / post-deploy-checks (push) Has been skipped
2026-05-06 13:18:43 +08:00
OG T
c696b99ccf fix(awooop): authenticate approval decisions
All checks were successful
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / tests (push) Successful in 1m3s
CD Pipeline / build-and-deploy (push) Successful in 3m28s
CD Pipeline / post-deploy-checks (push) Successful in 1m25s
2026-05-06 13:05:51 +08:00
Your Name
9ef9633aff fix(alerts): bypass proxy timeout for GCP Ollama 2026-05-06 08:55:14 +08:00
Your Name
09256be62c fix(rag): use bge embeddings on GCP Ollama lane
Some checks failed
Code Review / ai-code-review (push) Successful in 11s
CD Pipeline / tests (push) Successful in 1m22s
CD Pipeline / build-and-deploy (push) Failing after 2h14m5s
CD Pipeline / post-deploy-checks (push) Has been cancelled
2026-05-06 05:49:37 +08:00
Your Name
c2c0b1ec82 fix(alerts): let GCP Ollama finish before cloud fallback
All checks were successful
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 1m9s
CD Pipeline / build-and-deploy (push) Successful in 4m21s
CD Pipeline / post-deploy-checks (push) Successful in 1m16s
2026-05-06 05:27:55 +08:00
Your Name
3b64d66836 fix(alerts): guard approval actions and wire playbook learning
All checks were successful
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 42s
CD Pipeline / build-and-deploy (push) Successful in 3m31s
CD Pipeline / post-deploy-checks (push) Successful in 1m18s
2026-05-06 03:34:24 +08:00
Your Name
a2c4b3d47e fix(awooop): align console with flywheel execution metrics
Some checks failed
Code Review / ai-code-review (push) Has been cancelled
CD Pipeline / tests (push) Successful in 2m22s
CD Pipeline / build-and-deploy (push) Successful in 3m54s
CD Pipeline / post-deploy-checks (push) Successful in 1m17s
2026-05-06 00:46:08 +08:00
Your Name
5ed396e390 fix(decision): derive telegram dedup from incident signals
All checks were successful
CD Pipeline / tests (push) Successful in 58s
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / build-and-deploy (push) Successful in 3m30s
CD Pipeline / post-deploy-checks (push) Successful in 2m19s
2026-05-06 00:19:35 +08:00
Your Name
2aa31c205a fix(ai): require 111 before alert cloud fallback
All checks were successful
CD Pipeline / tests (push) Successful in 54s
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / build-and-deploy (push) Successful in 3m21s
CD Pipeline / post-deploy-checks (push) Successful in 2m2s
2026-05-06 00:05:51 +08:00
Your Name
e208798531 fix(ai): keep GCP Ollama lane on safe models
All checks were successful
CD Pipeline / tests (push) Successful in 54s
Code Review / ai-code-review (push) Successful in 14s
CD Pipeline / build-and-deploy (push) Successful in 3m25s
CD Pipeline / post-deploy-checks (push) Successful in 1m50s
2026-05-05 23:37:33 +08:00
Your Name
1cc215ec30 fix(ops): keep Ollama health checks on alert fast model
Some checks failed
CD Pipeline / tests (push) Successful in 52s
Code Review / ai-code-review (push) Successful in 9s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-05-05 23:16:21 +08:00
Your Name
c4854bb355 fix(ai): isolate heavy Ollama workloads from GCP alert lane
All checks were successful
CD Pipeline / tests (push) Successful in 54s
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / build-and-deploy (push) Successful in 3m19s
CD Pipeline / post-deploy-checks (push) Successful in 3m12s
2026-05-05 23:06:07 +08:00
Your Name
86bd6432ee fix(ops): make bge-m3 migration idempotent
Some checks failed
Code Review / ai-code-review (push) Successful in 9s
run-migration / migrate (push) Successful in 7s
CD Pipeline / tests (push) Successful in 2m8s
CD Pipeline / build-and-deploy (push) Failing after 9s
CD Pipeline / post-deploy-checks (push) Has been skipped
2026-05-05 22:21:47 +08:00
Your Name
bf847ad045 fix(ai): stabilize GCP Ollama alert lane
Some checks failed
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
2026-05-05 22:20:27 +08:00
Your Name
a4e9a04982 fix(ops): harden cold-start schedule recovery
Some checks failed
Code Review / ai-code-review (push) Successful in 10s
run-migration / migrate (push) Successful in 7s
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
2026-05-05 22:17:10 +08:00
Your Name
ee5e3bc94f fix(openclaw): gate alert cloud fallback behind flag
Some checks failed
Code Review / ai-code-review (push) Successful in 27s
CD Pipeline / tests (push) Successful in 5m17s
CD Pipeline / build-and-deploy (push) Failing after 5m35s
CD Pipeline / post-deploy-checks (push) Has been skipped
2026-05-05 20:54:47 +08:00
Your Name
6b93c8f454 fix(chat): route OpenClaw chat through Ollama lane
Some checks failed
CD Pipeline / tests (push) Successful in 5m26s
Code Review / ai-code-review (push) Successful in 25s
CD Pipeline / build-and-deploy (push) Successful in 8m11s
CD Pipeline / post-deploy-checks (push) Has been cancelled
2026-05-05 15:57:26 +08:00
Your Name
1cc9de5722 fix(ops): point runner guardrail alerts to host script
All checks were successful
CD Pipeline / tests (push) Successful in 5m31s
Code Review / ai-code-review (push) Successful in 30s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 37s
CD Pipeline / build-and-deploy (push) Successful in 7m45s
CD Pipeline / post-deploy-checks (push) Successful in 5m4s
2026-05-05 15:25:37 +08:00
Your Name
d08d1e4951 fix(ops): alert on missing docker resource limits
Some checks failed
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Successful in 23s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s
2026-05-05 15:01:31 +08:00
Your Name
e24c8ea051 fix(ci): align B5 schema with tenant isolation
Some checks failed
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
2026-05-05 15:00:07 +08:00
Your Name
7d45f0cb58 fix(ops): alert on stale gitea actions jobs
Some checks failed
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Has been cancelled
2026-05-05 14:42:09 +08:00
Your Name
fc1a6196df fix(code-review): keep Gemini fallback opt-in
Some checks failed
CD Pipeline / tests (push) Successful in 2m2s
Code Review / ai-code-review (push) Successful in 27s
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
2026-05-05 14:38:44 +08:00
Your Name
0e14935351 fix(ops): classify systemd runner alerts as host resources
Some checks failed
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
2026-05-05 14:28:18 +08:00
Your Name
34d1c76be9 fix(ops): route systemd runner baseline alerts
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
2026-05-05 14:19:58 +08:00
Your Name
fe618960a8 fix(ops): monitor systemd runners in host baseline
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 39s
2026-05-05 14:08:43 +08:00
Your Name
8e22110030 fix(governance): keep trust drift watchdog on governance agent
Some checks failed
CD Pipeline / tests (push) Successful in 2m51s
Code Review / ai-code-review (push) Successful in 24s
CD Pipeline / build-and-deploy (push) Has started running
CD Pipeline / post-deploy-checks (push) Has been cancelled
2026-05-05 14:00:13 +08:00
Your Name
2ff0ef3bb6 fix(openclaw): route legacy ollama through failover endpoints
Some checks failed
CD Pipeline / tests (push) Failing after 1m49s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 24s
2026-05-05 13:55:52 +08:00
Your Name
bb1995f349 fix(awooop): use naive utc for run lease timestamps
Some checks failed
CD Pipeline / tests (push) Failing after 1m48s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Has been cancelled
2026-05-05 13:53:07 +08:00
Your Name
e8e6748f70 fix(ops): add docker host resource baseline guardrails
Some checks failed
CD Pipeline / tests (push) Failing after 1m50s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 25s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s
2026-05-05 13:45:09 +08:00
Your Name
a57e3d3d75 test(consensus): expect redis namespace dual write
Some checks failed
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
2026-05-05 13:41:41 +08:00
Your Name
b00a7b050a test(ollama): align inference connect errors with degraded health
Some checks failed
CD Pipeline / tests (push) Failing after 2m26s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 28s
2026-05-05 13:34:19 +08:00
Your Name
506744ba3a test(ollama): keep slow gcp primary on ollama
Some checks failed
CD Pipeline / tests (push) Failing after 2m21s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 26s
2026-05-05 13:29:27 +08:00
Your Name
869646459c fix(ollama): treat legacy primary as ollama
Some checks failed
CD Pipeline / tests (push) Failing after 1m48s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 28s
2026-05-05 13:25:27 +08:00
Your Name
33d4326cce test(ollama): align slow recovery with gcp routing policy
Some checks failed
CD Pipeline / tests (push) Failing after 1m51s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 33s
2026-05-05 13:21:16 +08:00
Your Name
f78b1b0690 fix(ollama): honor provider endpoint selection
All checks were successful
Code Review / ai-code-review (push) Successful in 37s
2026-05-05 13:14:46 +08:00
Your Name
2e17325c3f fix(ollama): 更新 failover_manager URL 註解反映 ADR-110 nginx proxy 拓撲
All checks were successful
Code Review / ai-code-review (push) Successful in 43s
url_primary/secondary/tertiary 的 comment 還是舊版(ADR-110 前的 IP),
更新為 110:11435→GCP-A / 11436→GCP-B / 11437→Local111 nginx proxy 格式。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 11:03:36 +08:00
Your Name
e22b8e7ab2 feat(awooop): Operator Console API + 前端(leWOOOgo 積木化修復)
All checks were successful
Code Review / ai-code-review (push) Successful in 42s
後端:
- 新增 platform_operator_service.py(DB 存取集中 Service 層)
- Router 層移除 Depends(get_db),改呼叫 Service 函數
- tenants/contracts/operator_runs 三個 Router 符合 leWOOOgo 規範
- __init__.py 整合四個 platform router

前端:
- apps/web/src/app/[locale]/awooop/ 完整建立(7 個頁面)
- layout.tsx:四分頁導覽(tenants/contracts/runs/approvals)
- 全部使用 @/i18n/routing(Link/usePathname/useRouter)避免 i18n 路徑問題
- approvals page:10s 自動刷新、timeout 倒數、緊急紅色高亮

ADR-106/107/112/114/115/116

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 11:00:20 +08:00
Your Name
aa4ccec429 fix(watchdog): ADR-092 B4 — 三層修復消除 META SYSTEM 重複告警 + Ollama 路由強化
All checks were successful
Code Review / ai-code-review (push) Successful in 7m16s
問題根因(debugger 全景徹查):
1. Prod 仍跑舊版代碼(ec013f66 後的修法未部署 → 告警字串仍含舊格式)
2. replicas=2 時 Pod 間 grace period 不共享 → violation_codes 分歧 → 不同 SHA256 → dedup 失效
3. 新 Pod 啟動立即執行 _check_once() → rollout 時多發一波
4. W6 violation_codes 含動態 low_count → count 微變繞過 dedup

修復(A2/A3/W6/C1/C2):
- A2:run_ai_slo_watchdog_loop 加 90s leading sleep,避免 rollout 立即觸發
- A3:_grace_active() 改為 Redis cluster-shared(watchdog:cluster_grace, ex=1800s, nx=True)
     消除 Pod 間 grace period 不一致;Redis 故障時 fallback 為 process-local monotonic
- W6:violation_codes 移除動態 low_count,改為穩定 "W6:trust_drift"
- C1:ollama_auto_recovery.py recovered_host 改動態 label(依 URL port 判斷 GCP-A/B/Local)
- C2:ConfigMap OLLAMA_FALLBACK_URL 改走 110:11437 nginx proxy,三層容災統一架構

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 10:31:53 +08:00
Your Name
3f853accf2 fix(alerter): Ollama 恢復告警去重修復 — per-host key + 1h TTL
根因:
1. dedup_key 固定為 "alert:recovery",GCP-A 每 10min 健康閃爍就觸發重發
2. 三層容災下不同主機恢復共用同一個 key,互相污染

修法:
- dedup key 改為 "alert:recovery:{safe_host}",各主機獨立 dedup
- RECOVERY_DEDUP_TTL_SEC = 3600(1h),GCP 持續閃爍只報一次

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 01:22:01 +08:00
Your Name
10e665a540 fix(watchdog): 修復 META SYSTEM 重複告警 — violation_codes 穩定 dedup
All checks were successful
Code Review / ai-code-review (push) Successful in 1m3s
根因:violations 字串含動態浮點數(mean_trust/low_ratio),每次微變 → SHA256 不同 → dedup 失效
修法:新增 violation_codes list(穩定 W-code 格式),dedup 計算只用 violation_codes
     violations 保持含動態值(顯示用),Telegram 通知照常顯示完整資訊

W-6 Trust Drift dedup key: W6:trust_drift:low_count={N}(不含浮點數)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 00:06:38 +08:00
Your Name
ec013f662d fix(watchdog): 修复 Trust Drift 重复告警 + 建立 GCP Ollama nginx proxy
Some checks failed
Code Review / ai-code-review (push) Successful in 45s
Ansible Lint / lint (push) Has been cancelled
- ai_slo_watchdog_job: 改用 trust_drift_detector 纯统计 lib
  避免与 governance_agent 每小时自检查重复触发 Telegram

- infra/ansible: 建立 110 nginx proxy 转发到 GCP-A/B
  端口 11435 -> 34.143.170.20:11434 (GCP-A)
  端口 11436 -> 34.21.145.224:11434 (GCP-B)

- docs/runbooks: DEPLOY-GCP-OLLAMA-PROXY.md 完整部署指南
- ops/nginx: 手动部署脚本供 110 直接执行

ADR-110 三层容灾启用前提:先部署 proxy,再改 ConfigMap
2026-05-04 23:12:35 +08:00
Your Name
a1b61289f5 fix(governance): 修復 skip 路徑無限迴圈 + MCP 評分偏低根因
All checks were successful
Code Review / ai-code-review (push) Successful in 59s
根因一:GovernanceDispatcher skip 決策後未記錄任何狀態
- 事件永遠 resolved=False → 每 30s 重撈 → 每輪呼叫 LLM + Prometheus
- 4437 筆 stale 事件積壓,導致 governance_fusion_complete 每 20s 狂刷

修復:
1. Redis 90min 冷卻鍵(governance:skip:{event_id})防止重複 LLM 呼叫
2. 超過 2h 的 stale skip 事件自動標記 resolved=True
3. 直接 bulk-resolve 4437 筆 stale 事件 + 預設 105 筆冷卻鍵

根因二:MCP 評分 0.2 硬地板
- SLI recording rules 尚未在 Prometheus 生效 → result_list=[] → success_count=0
- 公式 0.2 + 0.7*0 = 0.2,融合信心度永遠 < 0.65 threshold

修復:
- 空結果(no_data)≠ MCP 故障,改給 0.5 中性貢獻
- 新公式:weighted = success_count + 0.5 * no_data_count;score = 0.2 + 0.7*(weighted/total)
- MCP 全無資料時:0.2 + 0.7*0.5 = 0.55(而非 0.2)

順帶修正 _score_llm 中過時的 GCP-A fallback URL 註解(實際已走 settings)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 20:00:54 +08:00
Your Name
45f6f17558 fix(watchdog): dedup hash 非確定性 bug — 改用 hashlib.sha256 + setnx atomic
All checks were successful
Code Review / ai-code-review (push) Successful in 56s
根因:Python 內建 hash() 受 PYTHONHASHSEED 影響,每次 process 重啟值不同。
每次 kubectl rollout restart → 新 pod 算出不同 dedup_hash → 繞過 1h TTL → 洗版。

症狀:連續 rollout 4-5 次後,META SYSTEM 每分鐘一條狂發(19:39/40/41/42 截圖)。

修法:
1. hash() → hashlib.sha256(content.encode()).hexdigest()[:12](跨 pod/重啟確定性)
2. redis.exists+setex → redis.set(nx=True) atomic setnx(防多 replica 並發多發)

2026-05-04 ogt

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 19:47:42 +08:00
Your Name
8629ac709b feat(awooop): Phase 1-8 完整實作 — AwoooP Agent Platform 六平面架構
Some checks failed
run-migration / migrate (push) Failing after 59s
Code Review / ai-code-review (push) Successful in 1m8s
Type Sync Check / check-type-sync (push) Successful in 2m27s
## Phase 1-3: Control Plane + Contract System
- awooop_phase1_control_plane_2026-05-04.sql: 12 張核心表 + RLS
- awooop_phase1_batch1_rls_2026-05-04.sql: 全部 FORCE RLS + GRANT
- packages/awooop-contracts/: 六合約 JSON Schema + golden fixtures
- src/models/awooop_contracts.py: Pydantic v2 contract models(extra=forbid)
- src/repositories/contract_repository.py: contract lifecycle(draft→published→active)
- src/services/contract_service.py: HMAC publish sig + Redis multi-sig activate
- src/services/schema_validator.py: LLM output validator(retry×3, E-SCHEMA-001)

## Phase 2: Tenant Isolation
- awooop_phase2_budget_ledger_2026-05-04.sql: budget_ledger + RLS
- src/services/budget_service.py: Token Budget Hard Kill 三層防線
- src/core/context.py: PROJECT_ID ContextVar(31 background loop 自動繼承)
- src/db/base.py + models.py: project_id 欄位 + RLS set_config 注入
- src/hermes/nl_gateway.py: project_id Redis key 前綴(Phase A 雙寫)
- src/services/anomaly_counter.py: per-project 改造(Phase A fallback)

## Phase 4: Platform Shell in Shadow Mode
- awooop_phase4_run_state_2026-05-04.sql: run_state + step_journal + idempotency
- src/services/run_state_machine.py: 8-state FSM + SKIP LOCKED + stale reaper
- src/services/platform_runtime.py: UUID v7 + W3C trace_id + shadow_execute
- src/services/audit_sink.py: PII/secret redaction 9 patterns
- src/api/v1/platform/runs.py: POST/GET /v1/platform/runs(Router→Service 架構)
- src/workers/platform_worker.py: SKIP LOCKED worker + heartbeat + reaper loop
- src/main.py: platform router + lifespan worker start/stop

## Phase 5: MCP Gateway 五閘門
- awooop_phase5_mcp_gateway_2026-05-04.sql: 4 表 + RLS
- src/plugins/mcp/gateway.py: McpGateway(Gate 1~5, E-MCP-GATE-001~009)
- src/plugins/mcp/redaction_middleware.py: 雙層 redaction + 16K 截斷
- src/plugins/mcp/registry.py: __provider name mangling(ADR-116)
- src/plugins/mcp/credential_resolver.py: k8s secret ref 解析
- tests/test_mcp_credential_isolation.py: 10 個迴歸測試(secret leak 防再現)

## Phase 6-8: EwoooC + Channel Hub + Approval Token
- awooop_phase6_ewoooc_onboarding_2026-05-04.sql: ewoooc tenant + 4 read-only MCP tools
- awooop_phase7_channel_hub_2026-05-04.sql: conversation_event + outbound_message
- src/services/provider_proxy.py: ProviderProxy + PlatformEnvelope(ADR-115)
- src/services/channel_hub.py: Telegram inbound mirror + Progressive Feedback(30s)
- src/services/awooop_approval_token.py: HS256 + jti NX replay 防護 + suggest mode

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 19:31:53 +08:00
Your Name
0a90dab1e9 fix(ollama): ADR-110 修正 — 111 升 primary,failover log 改用動態 URL 標識
All checks were successful
Code Review / ai-code-review (push) Successful in 56s
根因:K8s pods → GCP-A/B:11434 = connection refused(外網路由不通),
但 ConfigMap 把 GCP-A 設為 OLLAMA_URL(primary),導致容災鏈最終才輪到 111。

ConfigMap (04-configmap.yaml):
- OLLAMA_URL: GCP-A → 192.168.0.111(K8s 內網可達的 primary)
- OLLAMA_SECONDARY_URL: GCP-B → 34.143.170.20(GCP-A,保留待 nginx proxy 後恢復)
- OLLAMA_FALLBACK_URL: 111 → 34.21.145.224(GCP-B,保留待 nginx proxy 後恢復)
- 長期目標:110 架設 nginx proxy 轉發 GCP,ConfigMap 改指向 110:11435/11436

health.py (check_ollama):
- 改為三層輪查(primary → secondary → tertiary)
- primary up → "up";fallback up → "degraded";全掛 → "down"
- 不再只看 OLLAMA_URL 一台,反映實際路由可用狀態

ollama_failover_manager.py (_decide_route / select_provider):
- 變數名改為 url_primary/secondary/tertiary(原 gcp_a/gcp_b/local 與實際 URL 脫鉤)
- routing_reason 改用動態 IP label,不再硬編碼 "GCP-A"/"GCP-B"/"Local"
- _write_failover_audit failed_host 同步改用實際 URL

2026-05-04 ogt

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 19:17:07 +08:00
Your Name
855819652e fix(ollama): 修復容災鏈四大 bug — OFFLINE cache 放大 + SLOW 路由缺失 + recovery 命名不一致 + 告警顯示
All checks were successful
Code Review / ai-code-review (push) Successful in 48s
根因:NetworkPolicy reload/CNI 瞬態抖動導致三台 Ollama 同時 OFFLINE,被 30s Redis cache 放大
  → 後續 30s 所有請求誤走 Gemini,燒 quota

B1 ollama_health_monitor: OFFLINE TTL 從 30s 縮短至 5s,儘速重試
B3 ollama_health_monitor: inference ConnectError 改判 DEGRADED(connectivity 通了不算 OFFLINE)
B5/B6 ollama_auto_recovery: _current_primary 預設改 "ollama_gcp_a",比對改 startswith("ollama_")
SLOW 修復: failover_manager SLOW 節點視為可用(優於 Gemini quota 耗盡)
SLOW 修復: auto_recovery SLOW 也計入 recovery counter(GCP 高負載仍可切回)
告警顯示: _provider_display 加入 GCP-A/B/Local 具體伺服器識別
告警顯示: _format_automation_block 加入 Token 用量行

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 19:01:27 +08:00
Your Name
f6b698c873 fix(aiops): Critic 修復 — PromQL 注入防線 + flag=False escalation bug + 計數虛報
All checks were successful
Code Review / ai-code-review (push) Successful in 53s
Bug 1 (drift.py): DRIFT_AUTO_ADOPT_ENABLED=false 時仍設 auto_block_reason
  → 導致 escalation 被觸發,把「停用」誤判為「阻擋事故」
  修法: flag=False 不設 auto_block_reason,視為靜默停用

Bug 2 (coverage_evaluator_job.py): asset name/host/namespace/ip 直接 f-string
  進 PromQL,無白名單驗證
  → 髒資料可生成語意污染規則或讓 Prometheus reload 失敗
  修法: 加 _safe_label_val 正規表達式白名單(^[a-zA-Z0-9._\-]+$),
        不合法直接 skip + debug log

Bug 3 (coverage_evaluator_job.py): ON CONFLICT DO NOTHING 衝突時 created 仍 +1
  → stats["rules_auto_created"] 計數虛高,Redis 冷卻被誤設
  修法: 改用 INSERT ... RETURNING rule_name,fetchone() 確認實際插入才計數和設冷卻

附加: Redis RuntimeError 單獨 catch + log(不再靜默 pass)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 14:31:53 +08:00
Your Name
72cd79ed8b fix(aiops): Task2 drift auto-adopt 根因修復 + Task3 coverage gap 規則自動生成
All checks were successful
Code Review / ai-code-review (push) Successful in 48s
Task 2 — Drift 自動採納修根因:
  根因: _analyze_and_notify() 中 report 是 in-memory 物件,
        update_interpretation() 只更新 DB,不回寫 report.interpretation,
        導致 auto_adopt_if_safe() 永遠看到 None → 觸發「尚無 Nemotron 意圖分析」
        → Drift 自動採納 0 筆
  修法: report.interpretation = interpretation(DB 寫入後立即回寫記憶體)
  附加: DRIFT_AUTO_ADOPT_ENABLED flag(default=True,回滾: kubectl set env ...=false)

Task 3 — Coverage Gap → AI 規則自動生成執行器:
  根因: evaluate_once() 只分析 red 缺口,但無執行器將分析轉為實際規則
        → alert_rule_catalog 的 ai_generated source 永遠為 0 條
  修法: 新增 _auto_create_rules_for_uncovered_assets(run_id)
    · 查 auto_alerting=red 的 top 5 host/k8s_workload asset
    · 依 asset_type 生成範本化 PromQL rule(host→up, k8s→replicas_available)
    · UPSERT 進 alert_rule_catalog(source='ai_generated', review_status='pending_review')
    · Redis 24h 冷卻防重複,Redis 不可用時降級繼續
  附加: COVERAGE_AUTO_RULE_ENABLED flag(default=True,回滾: kubectl set env ...=false)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 14:22:51 +08:00
Your Name
54a4e59af9 fix(auto-approve): 主機告警 SSH 診斷指令豁免 bad_target 驗證 — 修復 no_executable_action
根因:host_resource_alert 規則使用 {host}(由 instance label 派生),
與 {target} 無關;但 host 告警缺少 K8s deployment label 導致 target=unknown,
_is_bad_target=True → kubectl_command 被清空 → auto_approve 以
no_executable_action 拒絕 → 每日 3 次人工攔截。

修復:
- alert_rule_engine.py: SSH 指令(startswith "ssh ")跳過 bad_target 驗證
- prompts.py: 主 + Nemo prompt 補 Host* 告警 SSH 診斷規則,防 LLM fallback 路徑輸出 kubectl
- ssh_command_whitelist.py: 新建唯讀 SSH 指令白名單模組(供 _ssh_execute() 執行前驗證)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 14:15:05 +08:00
Your Name
ccffaa5f3e fix(telegram): 補 send_text 公開方法 — 修復 drift_adopt_telegram_failed
drift_adopt_service / drift_remediator / runbook_generator / signoz_webhook
均呼叫 tg.send_text(),但 TelegramGateway 缺少此公開方法,
導致每次呼叫拋出 AttributeError。

新增 send_text() 委派至 _send_request('sendMessage'),
預設 chat_id = alert_chat_id(SRE 群組),支援 HTML parse_mode。
不動任何呼叫方,不改 dedup / nonce 邏輯。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 14:11:32 +08:00