Your Name
|
bf847ad045
|
fix(ai): stabilize GCP Ollama alert lane
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
|
2026-05-05 22:20:27 +08:00 |
|
Your Name
|
a4e9a04982
|
fix(ops): harden cold-start schedule recovery
Code Review / ai-code-review (push) Successful in 10s
run-migration / migrate (push) Successful in 7s
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
|
2026-05-05 22:17:10 +08:00 |
|
Your Name
|
ee5e3bc94f
|
fix(openclaw): gate alert cloud fallback behind flag
Code Review / ai-code-review (push) Successful in 27s
CD Pipeline / tests (push) Successful in 5m17s
CD Pipeline / build-and-deploy (push) Failing after 5m35s
CD Pipeline / post-deploy-checks (push) Has been skipped
|
2026-05-05 20:54:47 +08:00 |
|
Your Name
|
6b93c8f454
|
fix(chat): route OpenClaw chat through Ollama lane
CD Pipeline / tests (push) Successful in 5m26s
Code Review / ai-code-review (push) Successful in 25s
CD Pipeline / build-and-deploy (push) Successful in 8m11s
CD Pipeline / post-deploy-checks (push) Has been cancelled
|
2026-05-05 15:57:26 +08:00 |
|
Your Name
|
1cc9de5722
|
fix(ops): point runner guardrail alerts to host script
CD Pipeline / tests (push) Successful in 5m31s
Code Review / ai-code-review (push) Successful in 30s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 37s
CD Pipeline / build-and-deploy (push) Successful in 7m45s
CD Pipeline / post-deploy-checks (push) Successful in 5m4s
|
2026-05-05 15:25:37 +08:00 |
|
Your Name
|
d08d1e4951
|
fix(ops): alert on missing docker resource limits
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Successful in 23s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s
|
2026-05-05 15:01:31 +08:00 |
|
Your Name
|
e24c8ea051
|
fix(ci): align B5 schema with tenant isolation
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
|
2026-05-05 15:00:07 +08:00 |
|
Your Name
|
7d45f0cb58
|
fix(ops): alert on stale gitea actions jobs
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Has been cancelled
|
2026-05-05 14:42:09 +08:00 |
|
Your Name
|
fc1a6196df
|
fix(code-review): keep Gemini fallback opt-in
CD Pipeline / tests (push) Successful in 2m2s
Code Review / ai-code-review (push) Successful in 27s
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
|
2026-05-05 14:38:44 +08:00 |
|
Your Name
|
0e14935351
|
fix(ops): classify systemd runner alerts as host resources
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
|
2026-05-05 14:28:18 +08:00 |
|
Your Name
|
34d1c76be9
|
fix(ops): route systemd runner baseline alerts
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
|
2026-05-05 14:19:58 +08:00 |
|
Your Name
|
fe618960a8
|
fix(ops): monitor systemd runners in host baseline
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 39s
|
2026-05-05 14:08:43 +08:00 |
|
Your Name
|
8e22110030
|
fix(governance): keep trust drift watchdog on governance agent
CD Pipeline / tests (push) Successful in 2m51s
Code Review / ai-code-review (push) Successful in 24s
CD Pipeline / build-and-deploy (push) Has started running
CD Pipeline / post-deploy-checks (push) Has been cancelled
|
2026-05-05 14:00:13 +08:00 |
|
Your Name
|
2ff0ef3bb6
|
fix(openclaw): route legacy ollama through failover endpoints
CD Pipeline / tests (push) Failing after 1m49s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 24s
|
2026-05-05 13:55:52 +08:00 |
|
Your Name
|
bb1995f349
|
fix(awooop): use naive utc for run lease timestamps
CD Pipeline / tests (push) Failing after 1m48s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Has been cancelled
|
2026-05-05 13:53:07 +08:00 |
|
Your Name
|
e8e6748f70
|
fix(ops): add docker host resource baseline guardrails
CD Pipeline / tests (push) Failing after 1m50s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 25s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s
|
2026-05-05 13:45:09 +08:00 |
|
Your Name
|
a57e3d3d75
|
test(consensus): expect redis namespace dual write
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
|
2026-05-05 13:41:41 +08:00 |
|
Your Name
|
b00a7b050a
|
test(ollama): align inference connect errors with degraded health
CD Pipeline / tests (push) Failing after 2m26s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 28s
|
2026-05-05 13:34:19 +08:00 |
|
Your Name
|
506744ba3a
|
test(ollama): keep slow gcp primary on ollama
CD Pipeline / tests (push) Failing after 2m21s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 26s
|
2026-05-05 13:29:27 +08:00 |
|
Your Name
|
869646459c
|
fix(ollama): treat legacy primary as ollama
CD Pipeline / tests (push) Failing after 1m48s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 28s
|
2026-05-05 13:25:27 +08:00 |
|
Your Name
|
33d4326cce
|
test(ollama): align slow recovery with gcp routing policy
CD Pipeline / tests (push) Failing after 1m51s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 33s
|
2026-05-05 13:21:16 +08:00 |
|
Your Name
|
f78b1b0690
|
fix(ollama): honor provider endpoint selection
Code Review / ai-code-review (push) Successful in 37s
|
2026-05-05 13:14:46 +08:00 |
|
Your Name
|
2e17325c3f
|
fix(ollama): 更新 failover_manager URL 註解反映 ADR-110 nginx proxy 拓撲
Code Review / ai-code-review (push) Successful in 43s
url_primary/secondary/tertiary 的 comment 還是舊版(ADR-110 前的 IP),
更新為 110:11435→GCP-A / 11436→GCP-B / 11437→Local111 nginx proxy 格式。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-05 11:03:36 +08:00 |
|
Your Name
|
e22b8e7ab2
|
feat(awooop): Operator Console API + 前端(leWOOOgo 積木化修復)
Code Review / ai-code-review (push) Successful in 42s
後端:
- 新增 platform_operator_service.py(DB 存取集中 Service 層)
- Router 層移除 Depends(get_db),改呼叫 Service 函數
- tenants/contracts/operator_runs 三個 Router 符合 leWOOOgo 規範
- __init__.py 整合四個 platform router
前端:
- apps/web/src/app/[locale]/awooop/ 完整建立(7 個頁面)
- layout.tsx:四分頁導覽(tenants/contracts/runs/approvals)
- 全部使用 @/i18n/routing(Link/usePathname/useRouter)避免 i18n 路徑問題
- approvals page:10s 自動刷新、timeout 倒數、緊急紅色高亮
ADR-106/107/112/114/115/116
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-05 11:00:20 +08:00 |
|
Your Name
|
aa4ccec429
|
fix(watchdog): ADR-092 B4 — 三層修復消除 META SYSTEM 重複告警 + Ollama 路由強化
Code Review / ai-code-review (push) Successful in 7m16s
問題根因(debugger 全景徹查):
1. Prod 仍跑舊版代碼(ec013f66 後的修法未部署 → 告警字串仍含舊格式)
2. replicas=2 時 Pod 間 grace period 不共享 → violation_codes 分歧 → 不同 SHA256 → dedup 失效
3. 新 Pod 啟動立即執行 _check_once() → rollout 時多發一波
4. W6 violation_codes 含動態 low_count → count 微變繞過 dedup
修復(A2/A3/W6/C1/C2):
- A2:run_ai_slo_watchdog_loop 加 90s leading sleep,避免 rollout 立即觸發
- A3:_grace_active() 改為 Redis cluster-shared(watchdog:cluster_grace, ex=1800s, nx=True)
消除 Pod 間 grace period 不一致;Redis 故障時 fallback 為 process-local monotonic
- W6:violation_codes 移除動態 low_count,改為穩定 "W6:trust_drift"
- C1:ollama_auto_recovery.py recovered_host 改動態 label(依 URL port 判斷 GCP-A/B/Local)
- C2:ConfigMap OLLAMA_FALLBACK_URL 改走 110:11437 nginx proxy,三層容災統一架構
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-05 10:31:53 +08:00 |
|
Your Name
|
3f853accf2
|
fix(alerter): Ollama 恢復告警去重修復 — per-host key + 1h TTL
根因:
1. dedup_key 固定為 "alert:recovery",GCP-A 每 10min 健康閃爍就觸發重發
2. 三層容災下不同主機恢復共用同一個 key,互相污染
修法:
- dedup key 改為 "alert:recovery:{safe_host}",各主機獨立 dedup
- RECOVERY_DEDUP_TTL_SEC = 3600(1h),GCP 持續閃爍只報一次
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-05 01:22:01 +08:00 |
|
Your Name
|
10e665a540
|
fix(watchdog): 修復 META SYSTEM 重複告警 — violation_codes 穩定 dedup
Code Review / ai-code-review (push) Successful in 1m3s
根因:violations 字串含動態浮點數(mean_trust/low_ratio),每次微變 → SHA256 不同 → dedup 失效
修法:新增 violation_codes list(穩定 W-code 格式),dedup 計算只用 violation_codes
violations 保持含動態值(顯示用),Telegram 通知照常顯示完整資訊
W-6 Trust Drift dedup key: W6:trust_drift:low_count={N}(不含浮點數)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-05 00:06:38 +08:00 |
|
Your Name
|
ec013f662d
|
fix(watchdog): 修复 Trust Drift 重复告警 + 建立 GCP Ollama nginx proxy
Code Review / ai-code-review (push) Successful in 45s
Ansible Lint / lint (push) Has been cancelled
- ai_slo_watchdog_job: 改用 trust_drift_detector 纯统计 lib
避免与 governance_agent 每小时自检查重复触发 Telegram
- infra/ansible: 建立 110 nginx proxy 转发到 GCP-A/B
端口 11435 -> 34.143.170.20:11434 (GCP-A)
端口 11436 -> 34.21.145.224:11434 (GCP-B)
- docs/runbooks: DEPLOY-GCP-OLLAMA-PROXY.md 完整部署指南
- ops/nginx: 手动部署脚本供 110 直接执行
ADR-110 三层容灾启用前提:先部署 proxy,再改 ConfigMap
|
2026-05-04 23:12:35 +08:00 |
|
Your Name
|
a1b61289f5
|
fix(governance): 修復 skip 路徑無限迴圈 + MCP 評分偏低根因
Code Review / ai-code-review (push) Successful in 59s
根因一:GovernanceDispatcher skip 決策後未記錄任何狀態
- 事件永遠 resolved=False → 每 30s 重撈 → 每輪呼叫 LLM + Prometheus
- 4437 筆 stale 事件積壓,導致 governance_fusion_complete 每 20s 狂刷
修復:
1. Redis 90min 冷卻鍵(governance:skip:{event_id})防止重複 LLM 呼叫
2. 超過 2h 的 stale skip 事件自動標記 resolved=True
3. 直接 bulk-resolve 4437 筆 stale 事件 + 預設 105 筆冷卻鍵
根因二:MCP 評分 0.2 硬地板
- SLI recording rules 尚未在 Prometheus 生效 → result_list=[] → success_count=0
- 公式 0.2 + 0.7*0 = 0.2,融合信心度永遠 < 0.65 threshold
修復:
- 空結果(no_data)≠ MCP 故障,改給 0.5 中性貢獻
- 新公式:weighted = success_count + 0.5 * no_data_count;score = 0.2 + 0.7*(weighted/total)
- MCP 全無資料時:0.2 + 0.7*0.5 = 0.55(而非 0.2)
順帶修正 _score_llm 中過時的 GCP-A fallback URL 註解(實際已走 settings)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 20:00:54 +08:00 |
|
Your Name
|
45f6f17558
|
fix(watchdog): dedup hash 非確定性 bug — 改用 hashlib.sha256 + setnx atomic
Code Review / ai-code-review (push) Successful in 56s
根因:Python 內建 hash() 受 PYTHONHASHSEED 影響,每次 process 重啟值不同。
每次 kubectl rollout restart → 新 pod 算出不同 dedup_hash → 繞過 1h TTL → 洗版。
症狀:連續 rollout 4-5 次後,META SYSTEM 每分鐘一條狂發(19:39/40/41/42 截圖)。
修法:
1. hash() → hashlib.sha256(content.encode()).hexdigest()[:12](跨 pod/重啟確定性)
2. redis.exists+setex → redis.set(nx=True) atomic setnx(防多 replica 並發多發)
2026-05-04 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 19:47:42 +08:00 |
|
Your Name
|
8629ac709b
|
feat(awooop): Phase 1-8 完整實作 — AwoooP Agent Platform 六平面架構
run-migration / migrate (push) Failing after 59s
Code Review / ai-code-review (push) Successful in 1m8s
Type Sync Check / check-type-sync (push) Successful in 2m27s
## Phase 1-3: Control Plane + Contract System
- awooop_phase1_control_plane_2026-05-04.sql: 12 張核心表 + RLS
- awooop_phase1_batch1_rls_2026-05-04.sql: 全部 FORCE RLS + GRANT
- packages/awooop-contracts/: 六合約 JSON Schema + golden fixtures
- src/models/awooop_contracts.py: Pydantic v2 contract models(extra=forbid)
- src/repositories/contract_repository.py: contract lifecycle(draft→published→active)
- src/services/contract_service.py: HMAC publish sig + Redis multi-sig activate
- src/services/schema_validator.py: LLM output validator(retry×3, E-SCHEMA-001)
## Phase 2: Tenant Isolation
- awooop_phase2_budget_ledger_2026-05-04.sql: budget_ledger + RLS
- src/services/budget_service.py: Token Budget Hard Kill 三層防線
- src/core/context.py: PROJECT_ID ContextVar(31 background loop 自動繼承)
- src/db/base.py + models.py: project_id 欄位 + RLS set_config 注入
- src/hermes/nl_gateway.py: project_id Redis key 前綴(Phase A 雙寫)
- src/services/anomaly_counter.py: per-project 改造(Phase A fallback)
## Phase 4: Platform Shell in Shadow Mode
- awooop_phase4_run_state_2026-05-04.sql: run_state + step_journal + idempotency
- src/services/run_state_machine.py: 8-state FSM + SKIP LOCKED + stale reaper
- src/services/platform_runtime.py: UUID v7 + W3C trace_id + shadow_execute
- src/services/audit_sink.py: PII/secret redaction 9 patterns
- src/api/v1/platform/runs.py: POST/GET /v1/platform/runs(Router→Service 架構)
- src/workers/platform_worker.py: SKIP LOCKED worker + heartbeat + reaper loop
- src/main.py: platform router + lifespan worker start/stop
## Phase 5: MCP Gateway 五閘門
- awooop_phase5_mcp_gateway_2026-05-04.sql: 4 表 + RLS
- src/plugins/mcp/gateway.py: McpGateway(Gate 1~5, E-MCP-GATE-001~009)
- src/plugins/mcp/redaction_middleware.py: 雙層 redaction + 16K 截斷
- src/plugins/mcp/registry.py: __provider name mangling(ADR-116)
- src/plugins/mcp/credential_resolver.py: k8s secret ref 解析
- tests/test_mcp_credential_isolation.py: 10 個迴歸測試(secret leak 防再現)
## Phase 6-8: EwoooC + Channel Hub + Approval Token
- awooop_phase6_ewoooc_onboarding_2026-05-04.sql: ewoooc tenant + 4 read-only MCP tools
- awooop_phase7_channel_hub_2026-05-04.sql: conversation_event + outbound_message
- src/services/provider_proxy.py: ProviderProxy + PlatformEnvelope(ADR-115)
- src/services/channel_hub.py: Telegram inbound mirror + Progressive Feedback(30s)
- src/services/awooop_approval_token.py: HS256 + jti NX replay 防護 + suggest mode
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 19:31:53 +08:00 |
|
Your Name
|
0a90dab1e9
|
fix(ollama): ADR-110 修正 — 111 升 primary,failover log 改用動態 URL 標識
Code Review / ai-code-review (push) Successful in 56s
根因:K8s pods → GCP-A/B:11434 = connection refused(外網路由不通),
但 ConfigMap 把 GCP-A 設為 OLLAMA_URL(primary),導致容災鏈最終才輪到 111。
ConfigMap (04-configmap.yaml):
- OLLAMA_URL: GCP-A → 192.168.0.111(K8s 內網可達的 primary)
- OLLAMA_SECONDARY_URL: GCP-B → 34.143.170.20(GCP-A,保留待 nginx proxy 後恢復)
- OLLAMA_FALLBACK_URL: 111 → 34.21.145.224(GCP-B,保留待 nginx proxy 後恢復)
- 長期目標:110 架設 nginx proxy 轉發 GCP,ConfigMap 改指向 110:11435/11436
health.py (check_ollama):
- 改為三層輪查(primary → secondary → tertiary)
- primary up → "up";fallback up → "degraded";全掛 → "down"
- 不再只看 OLLAMA_URL 一台,反映實際路由可用狀態
ollama_failover_manager.py (_decide_route / select_provider):
- 變數名改為 url_primary/secondary/tertiary(原 gcp_a/gcp_b/local 與實際 URL 脫鉤)
- routing_reason 改用動態 IP label,不再硬編碼 "GCP-A"/"GCP-B"/"Local"
- _write_failover_audit failed_host 同步改用實際 URL
2026-05-04 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 19:17:07 +08:00 |
|
Your Name
|
855819652e
|
fix(ollama): 修復容災鏈四大 bug — OFFLINE cache 放大 + SLOW 路由缺失 + recovery 命名不一致 + 告警顯示
Code Review / ai-code-review (push) Successful in 48s
根因:NetworkPolicy reload/CNI 瞬態抖動導致三台 Ollama 同時 OFFLINE,被 30s Redis cache 放大
→ 後續 30s 所有請求誤走 Gemini,燒 quota
B1 ollama_health_monitor: OFFLINE TTL 從 30s 縮短至 5s,儘速重試
B3 ollama_health_monitor: inference ConnectError 改判 DEGRADED(connectivity 通了不算 OFFLINE)
B5/B6 ollama_auto_recovery: _current_primary 預設改 "ollama_gcp_a",比對改 startswith("ollama_")
SLOW 修復: failover_manager SLOW 節點視為可用(優於 Gemini quota 耗盡)
SLOW 修復: auto_recovery SLOW 也計入 recovery counter(GCP 高負載仍可切回)
告警顯示: _provider_display 加入 GCP-A/B/Local 具體伺服器識別
告警顯示: _format_automation_block 加入 Token 用量行
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 19:01:27 +08:00 |
|
Your Name
|
f6b698c873
|
fix(aiops): Critic 修復 — PromQL 注入防線 + flag=False escalation bug + 計數虛報
Code Review / ai-code-review (push) Successful in 53s
Bug 1 (drift.py): DRIFT_AUTO_ADOPT_ENABLED=false 時仍設 auto_block_reason
→ 導致 escalation 被觸發,把「停用」誤判為「阻擋事故」
修法: flag=False 不設 auto_block_reason,視為靜默停用
Bug 2 (coverage_evaluator_job.py): asset name/host/namespace/ip 直接 f-string
進 PromQL,無白名單驗證
→ 髒資料可生成語意污染規則或讓 Prometheus reload 失敗
修法: 加 _safe_label_val 正規表達式白名單(^[a-zA-Z0-9._\-]+$),
不合法直接 skip + debug log
Bug 3 (coverage_evaluator_job.py): ON CONFLICT DO NOTHING 衝突時 created 仍 +1
→ stats["rules_auto_created"] 計數虛高,Redis 冷卻被誤設
修法: 改用 INSERT ... RETURNING rule_name,fetchone() 確認實際插入才計數和設冷卻
附加: Redis RuntimeError 單獨 catch + log(不再靜默 pass)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 14:31:53 +08:00 |
|
Your Name
|
72cd79ed8b
|
fix(aiops): Task2 drift auto-adopt 根因修復 + Task3 coverage gap 規則自動生成
Code Review / ai-code-review (push) Successful in 48s
Task 2 — Drift 自動採納修根因:
根因: _analyze_and_notify() 中 report 是 in-memory 物件,
update_interpretation() 只更新 DB,不回寫 report.interpretation,
導致 auto_adopt_if_safe() 永遠看到 None → 觸發「尚無 Nemotron 意圖分析」
→ Drift 自動採納 0 筆
修法: report.interpretation = interpretation(DB 寫入後立即回寫記憶體)
附加: DRIFT_AUTO_ADOPT_ENABLED flag(default=True,回滾: kubectl set env ...=false)
Task 3 — Coverage Gap → AI 規則自動生成執行器:
根因: evaluate_once() 只分析 red 缺口,但無執行器將分析轉為實際規則
→ alert_rule_catalog 的 ai_generated source 永遠為 0 條
修法: 新增 _auto_create_rules_for_uncovered_assets(run_id)
· 查 auto_alerting=red 的 top 5 host/k8s_workload asset
· 依 asset_type 生成範本化 PromQL rule(host→up, k8s→replicas_available)
· UPSERT 進 alert_rule_catalog(source='ai_generated', review_status='pending_review')
· Redis 24h 冷卻防重複,Redis 不可用時降級繼續
附加: COVERAGE_AUTO_RULE_ENABLED flag(default=True,回滾: kubectl set env ...=false)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 14:22:51 +08:00 |
|
Your Name
|
54a4e59af9
|
fix(auto-approve): 主機告警 SSH 診斷指令豁免 bad_target 驗證 — 修復 no_executable_action
根因:host_resource_alert 規則使用 {host}(由 instance label 派生),
與 {target} 無關;但 host 告警缺少 K8s deployment label 導致 target=unknown,
_is_bad_target=True → kubectl_command 被清空 → auto_approve 以
no_executable_action 拒絕 → 每日 3 次人工攔截。
修復:
- alert_rule_engine.py: SSH 指令(startswith "ssh ")跳過 bad_target 驗證
- prompts.py: 主 + Nemo prompt 補 Host* 告警 SSH 診斷規則,防 LLM fallback 路徑輸出 kubectl
- ssh_command_whitelist.py: 新建唯讀 SSH 指令白名單模組(供 _ssh_execute() 執行前驗證)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 14:15:05 +08:00 |
|
Your Name
|
ccffaa5f3e
|
fix(telegram): 補 send_text 公開方法 — 修復 drift_adopt_telegram_failed
drift_adopt_service / drift_remediator / runbook_generator / signoz_webhook
均呼叫 tg.send_text(),但 TelegramGateway 缺少此公開方法,
導致每次呼叫拋出 AttributeError。
新增 send_text() 委派至 _send_request('sendMessage'),
預設 chat_id = alert_chat_id(SRE 群組),支援 HTML parse_mode。
不動任何呼叫方,不改 dedup / nonce 邏輯。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 14:11:32 +08:00 |
|
Your Name
|
f2f5148ca6
|
fix(awooop): Phase 2 第二批 P0 安全強化 + Redis key 命名空間修正
## P0-05 Callback Nonce 防偽造(ADR-116)
- security_interceptor.py:generate_callback_nonce() 新增 HMAC-SHA256[:16] 附加
- 新 5-part 格式:{action}:{short_id}:{ts}:{rand}:{hmac16}
- CALLBACK_HMAC_SECRET 未設定時降級 warning(向後相容)
- security_interceptor.py:parse_callback_data() 新增 5-part 分支 + HMAC 驗證
- config.py:新增 CALLBACK_HMAC_SECRET: str = Field(default="")
## P0-06 Webhook HMAC Replay 防護(ADR-116)
- security_interceptor.py:新增 check_webhook_nonce()(Service 層,get_redis 在此層合法)
- webhooks.py:verify_webhook_signature() 新增兩個可選 Header
- X-Webhook-Timestamp:±300s 窗口驗證(若提供)
- X-Webhook-Nonce:呼叫 check_webhook_nonce()(Redis NX dedup,fail open)
- 移除直接 get_redis import(leWOOOgo 積木化修正)
## P0-11 ollama:current_primary Redis key 遷移 Phase A(ADR-110)
- ollama_auto_recovery.py:_REDIS_PRIMARY_KEY = "platform:ollama:current_primary"
- 雙寫舊 key "ollama:current_primary"(Phase A 30 天)
- 讀取以新 key 為主,fallback 舊 key
## P0-12 consensus Redis key 加 project namespace Phase A
- consensus_engine.py:新增 _consensus_key() / _consensus_legacy_key() helper
- 新 key:{project_id}:consensus:{consensus_id}
- project_id=None 時 fallback __platform__:consensus:{consensus_id}
- Phase A 雙寫 + fallback 讀取,現有呼叫方零修改
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 13:54:38 +08:00 |
|
Your Name
|
2b2359e367
|
fix(ai-router): ADR-110 GCP 三層容災 — 修復 Ollama 直跳 Gemini 根因
Code Review / ai-code-review (push) Successful in 55s
run-migration / migrate (push) Successful in 41s
根因(所有告警 Ollama 失敗直接跳 Gemini 的原因):
AIProviderEnum 缺少 ollama_gcp_a / ollama_gcp_b / ollama_local
→ AIProviderEnum("ollama_gcp_a") 拋 ValueError
→ fallback chain 清空(所有 GCP 端點轉換全失敗)
→ failover_fallback = [](空 list,非 None)
→ fallback_chain 被覆寫為 [] 而非走 Gemini 備援
→ AIProviderRegistry.get("ollama_gcp_a") 回傳 None → not_registered → 跳過
→ 整條 Ollama 鏈(GCP-A → GCP-B → 111)全部略過,直接跳 Gemini
修復:
1. AIProviderEnum 新增 OLLAMA_GCP_A / OLLAMA_GCP_B / OLLAMA_LOCAL
2. PROVIDER_LATENCY_BUDGET 補齊三個新 enum
3. ollama.py 新增 OllamaGcpBProvider(OLLAMA_SECONDARY_URL = GCP-B 34.21.145.224)
4. _init_registry() 補登:
- "ollama_gcp_a" alias → OllamaProvider(GCP-A,OLLAMA_URL)
- OllamaGcpBProvider("ollama_gcp_b",OLLAMA_SECONDARY_URL)
- "ollama_local" alias → Ollama188Provider(111,OLLAMA_FALLBACK_URL)
修復後路由順序:GCP-A → GCP-B → Local(111) → Gemini → Claude
2026-05-04 ogt + Claude Sonnet 4.6
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 13:49:32 +08:00 |
|
Your Name
|
14bf86a462
|
fix(awooop): Phase 2 初批 P0 修正 + Phase 1 Task 1.7 integration tests
## P0 安全 / 架構修正
### P0-08 telemetry.py — 移除硬碼 IP assert(ADR-121)
- config.py:新增 OTEL_ALLOWED_ENDPOINTS(預設 192.168.0.188)+ OTEL_FORBIDDEN_ENDPOINTS
- telemetry.py:_validate_endpoint() 改為 config-driven allowlist/forbidlist
- EwoooC 可用 env 覆寫 OTEL_ALLOWED_ENDPOINTS 指向自己的 SigNoz host
### P0-13 mcp_bridge.py — K8s namespace 由 settings 提供
- config.py:新增 AWOOOI_K8S_NAMESPACE(預設 "awoooi-prod")
- mcp_bridge.py:5 處 parameters.get("namespace", "awoooi-prod") → settings.AWOOOI_K8S_NAMESPACE
- EwoooC/Tsenyang 可設自己的 namespace
### P1-24 decision_manager.py — silence key 常數統一
- 新增 from src.services.telegram_gateway import SILENCE_KEY_PREFIX
- f"telegram_silence:{target}" → f"{SILENCE_KEY_PREFIX}{target}"
- 消除跨兩處重複定義(ADR-118 No Island Coding 原則)
## Phase 1 Task 1.7 Integration Tests
- tests/integration/test_awooop_phase1_schema.py:31 個測試案例
- awooop_projects CHECK 約束(4 cases)
- revision 不可變性 trigger(5 cases:draft 可改、published 鎖住、身份欄不可改、非法流轉、DELETE 禁止)
- awooop_published_revisions VIEW draft/published 隔離(2 cases)
- active_pointer_guard(3 cases:不可指向 draft、可指向 active、跨租戶 mismatch)
- RLS fail-closed(3 cases:未設/錯設/正確設 project_id)
- outbox FK + dedup(2 cases)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 13:46:19 +08:00 |
|
Your Name
|
13e51802fe
|
feat(awooop): Phase 0 全 ADR + Phase 1 control plane schema(含 critic 四項修正)
## Phase 0(文件層,全部 Accepted)
- ADR-106/107:AwoooP 平台架構 + 儲存策略
- ADR-111~118:Bootstrap → RLS 七項核心 ADR
- ADR-119~124:SAGA → Singleton Decomposition 六項 ADR
- ADR-UI-01~04:Operator Console 四個 UI ADR
## Phase 1(DB schema + migration)
- awooop_phase1_control_plane_2026-05-04.sql:7 張新表 + trigger + RLS
- Step 1:三角色(platform_admin/migration BYPASSRLS,awooop_app 受 RLS)
- Step 13:GRANT awooop_app 最小權限(7 條)
- Step 14:RLS fail-closed,移除 __platform__ 後門
- awooop_phase1_batch1_rls_2026-05-04.sql:高流量四表三步式 ADD COLUMN
- awooop_phase1_batch1_backfill.py:SKIP LOCKED 分批回填腳本
- awooop_models.py:7 個 SQLAlchemy 2.x models
## Critic 修正(4 Critical + 3 Major)
- C-1:ADD CONSTRAINT IF NOT EXISTS → DO 塊 + pg_constraint 查詢
- C-2:__mapper_args__ 字串 list → primary_key=True on mapped_column
- C-3:__platform__ RLS 後門 → 全移除,改用 BYPASSRLS role
- C-4:awooop_app role 從未建立 → Step 1 + 7 條 GRANT
- M-1:active_pointer_guard SECURITY DEFINER(FORCE RLS 跨租戶保護)
- M-2:pg_partman create_parent 加冪等防護
- M-3:immutability trigger 新增身份欄位保護(project_id/family/contract_id)
## Task 1.2 修補
- agent_loader.py:硬編碼 Mac 路徑 → AGENTS_DIR 環境變數
- Dockerfile:補 COPY .claude/agents/
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 13:37:11 +08:00 |
|
Your Name
|
b4055c5915
|
feat(embedding): ADR-110 升級 bge-m3:latest 1024 維向量
Code Review / ai-code-review (push) Successful in 57s
run-migration / migrate (push) Failing after 44s
GCP-A (34.143.170.20) 無 nomic-embed-text,改用 bge-m3:latest(專用
多語言 embedding 模型),產生 1024 維向量。
變更:
- embedding_service.py: 加入 bge-m3:latest=1024 維到 MODEL_DIMENSIONS,
預設模型改為 bge-m3:latest,更新文件說明
- playbook_embedding_repository.py + interfaces.py: 更新維度說明
- migrations/embedding_bge_m3_1024.sql: pgvector schema 遷移
rag_chunks + playbook_embeddings vector(768) → vector(1024)
- scripts/reembed_bge_m3.py: 遷移後重新嵌入現有資料的 script
遷移步驟:
1. 執行 embedding_bge_m3_1024.sql(清空現有 768 維向量,變更維度)
2. 執行 python scripts/reembed_bge_m3.py 重新嵌入
2026-05-04 ogt + Claude Sonnet 4.6
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 11:18:20 +08:00 |
|
Your Name
|
f7e5fc772e
|
feat(ai-models): ADR-110 GCP-A Primary + 全任務模型升級 (v1.4.0)
Code Review / ai-code-review (push) Failing after 18s
models.json v1.3.0 → v1.4.0:
- endpoint: 192.168.0.111 → GCP-A 34.143.170.20:11434 (ADR-110)
- rca/drift_summary/playbook_draft/rag_generate: qwen2.5:7b → qwen3:14b
- code_review: qwen2.5-coder:7b → qwen2.5-coder:32b (GCP SSD)
- embedding: nomic-embed-text → bge-m3:latest (多語言更佳)
- image_analysis: llava → minicpm-v:latest
- 新增: trust_scoring/alert_triage/intent_classify/governance 四任務
config.py:
- OLLAMA_REQUIRED_MODELS: 新增 qwen3:14b + hermes3:latest
- OLLAMA_TOOL_MODEL: llama3.1:8b → hermes3:latest
- OPENCLAW_DEFAULT_MODEL: qwen2.5:7b-instruct → qwen3:14b
111 背景安裝 minicpm-v + qwen3:14b (fallback 補齊)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 10:59:38 +08:00 |
|
Your Name
|
0068440388
|
fix(failover): Gemini 永遠附在 Ollama fallback 鏈尾(ADR-110 漏加)
Code Review / ai-code-review (push) Successful in 54s
CD Pipeline / tests (push) Successful in 1m55s
CD Pipeline / build-and-deploy (push) Successful in 41m6s
CD Pipeline / post-deploy-checks (push) Successful in 3m36s
GCP-A HEALTHY → fallback=[GCP-B, Local, Gemini]
GCP-B HEALTHY → fallback=[Local, Gemini]
與舊 111 HEALTHY → fallback=[Gemini] 行為一致,保留雲端最後防線。
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-05-03 23:03:34 +08:00 |
|
Your Name
|
2409d861fa
|
fix(test): 更新 auto_recovery 測試斷言至 ADR-110(ollama_111 → ollama_gcp_a)
Code Review / ai-code-review (push) Successful in 55s
CD Pipeline / tests (push) Failing after 1m22s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
- notify_recovery 斷言改為 "ollama_gcp_a"(3 處)
- alert_recovery payload["to"] 改為 "ollama"
- test_full_recovery_flow 改用 mock alerter 避免打真實 Telegram Bot API
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-05-03 22:57:58 +08:00 |
|
Your Name
|
4461c2778d
|
fix(model-probe): 補回 ollama_188 provider 判斷(ADR-110 漏刪)
Code Review / ai-code-review (push) Successful in 51s
CD Pipeline / tests (push) Failing after 1m13s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
188 CPU-only 主機雖移出 routing chain,但 probe 仍可被呼叫。
保留 192.168.0.188 → "ollama_188" 映射,避免 test_success_188_provider 失敗。
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-05-03 22:52:24 +08:00 |
|
Your Name
|
b1ef05fa8c
|
feat(ollama): ADR-110 GCP 三層容災架構(GCP-A → GCP-B → Local → Gemini)
Code Review / ai-code-review (push) Successful in 50s
CD Pipeline / tests (push) Failing after 1m14s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
## 變更摘要
- Primary: http://34.143.170.20:11434 (GCP-A SSD, 9x 載速 + 2x 推理)
- Secondary: http://34.21.145.224:11434 (GCP-B SSD)
- Fallback: http://192.168.0.111:11434 (M1 Pro Local HDD,最後防線)
- 廢止 ADR-105「111 唯一鐵律」,新建 ADR-110
## 核心改動
- config.py: 新增 OLLAMA_SECONDARY_URL;validator 加 GCP IP 白名單(34.143.170.20, 34.21.145.224)
- ollama_failover_manager.py: 三層 Ollama 決策矩陣;並行健康檢查三台;health_111 → health_gcp_a
- ollama_health_monitor.py: host label 萃取改為通用版(支援 GCP 公網 IP)
- failover_alerter.py: 故障/恢復主機動態顯示,不再硬編碼「Ollama 111 (GPU)」
- ollama_auto_recovery.py: notify_recovery 改為 ollama_gcp_a;recovered_host 動態
- k8s/awoooi-prod: configmap + deployment + network-policy 同步更新(egress 加 GCP /32)
- 服務層: 10 個服務檔案硬編碼 192.168.0.111 改為讀 settings.OLLAMA_URL
- 測試: URL 常數更新,新增三層容災場景,GCP IP 白名單驗證測試
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-05-03 22:49:23 +08:00 |
|
Your Name
|
e45b055e0e
|
feat(governance): AI 治理事件處理鏈四軌交付(C/D/B/A)
Code Review / ai-code-review (push) Successful in 48s
run-migration / migrate (push) Failing after 45s
CD Pipeline / tests (push) Successful in 3m46s
Type Sync Check / check-type-sync (push) Successful in 2m8s
CD Pipeline / build-and-deploy (push) Failing after 31m14s
CD Pipeline / post-deploy-checks (push) Has been skipped
【十二人專家團隊全景掃描 + 並行四軌實施】
統帥質疑「有讓 12-agent 一起協作嗎」後,依照團隊規則完成全鏈路交付:
onboarder + critic + db-expert + debugger + frontend-designer 並行掃描,
找到 6 大 Gap,再由 fullstack-engineer × 4、refactor-specialist 協作落地。
【Track C — trust_drift 雙寫整併】
兩條獨立寫 event_type=trust_drift 路徑互不呼叫,下游 consumer 拿到雙份資料
無法判定 source-of-truth。整併保留 governance_agent.check_trust_drift(功能
更全:auto-deprecate + Telegram + PG),TrustDriftDetector 降為純統計 lib,
W-6 watchdog 改呼叫 governance_agent。新增 TestSinglePgWritePerDriftScenario
驗證同一 drift 場景只觸發一次 PG 寫入。
變更:
- apps/api/src/services/trust_drift_detector.py(lib only,不再寫 PG)
- apps/api/tests/test_trust_drift_watchdog.py(W-6 改 mock governance_agent)
【Track D — governance_remediation_dispatch 派遣表】
ai_governance_events 是不可變 Event Sourcing,不能塞執行狀態。新建派遣表
作為投影層:1 event → 0..N dispatches,狀態可變、可重試、可審計。
- PgEnum 5 種 event_type + 7 階段狀態機(pending → dispatched → executing →
succeeded/failed/cancelled/skipped)
- 失敗重試 INSERT 新 row(不改舊 row 的 status,保留審計痕跡)
- Partial unique index ux_grd_one_active_per_event 強制「同事件唯一活躍」
- 4 個複合 index 支援 worker poll、去重查詢、觀測面板
- FK 對應 ai_governance_events / playbooks / incidents / approval_records
全部 SET NULL(avoid cascade lock,但 governance_event 用 RESTRICT)
變更:
- apps/api/src/db/models.py(GovernanceRemediationDispatch ORM class)
- apps/api/migrations/governance_remediation_dispatch_2026-05-03.sql
- apps/api/src/repositories/governance_remediation_dispatch_repo.py
(6 個 async 函式 + 3 個自訂例外:DispatchAlreadyActive /
InvalidStatusTransition / DispatchNotFound)
- apps/api/src/models/governance_dispatch.py(DecisionContextV1 等 4 schema)
- apps/api/tests/test_governance_remediation_dispatch.py(29 tests)
【Track B — /governance 頁面】
後端 PR1 三個 endpoint + 前端 PR2-5 完整三 Tab。
PR1 後端:
- GET /api/v1/ai/governance/events(events_tab,含 event_type/severity/
狀態/時間範圍篩選 + 分頁)
- GET /api/v1/ai/governance/queue(queue_tab,含 graceful fallback:
dispatch 表不存在時回 table_pending=True 不拋 500)
- GET /api/v1/ai/governance/summary(slo_tab 30d 違反時序圖)
- severity 映射規則寫死(critic 建議未來移 settings)
PR2-5 前端:
- /governance 路由 + AppLayout + Compliance Badge 橫幅 + PageTabs
- SLO Tab:3 KPI 卡片(Syne 28px + StatusOrb + 7d sparkline)+
30d 違反 stacked BarChart
- Events Tab:篩選列 + 表格 + inline 展開行(JSON / 修復建議 / 派遣記錄)
- Queue Tab:HITL 待辦卡片 + 信任度進度條 + 批准/拒絕按鈕(本 PR console.log)
- Sidebar 加入「AI 治理」入口(ShieldCheck icon)
- i18n 雙語完整(governance namespace + nav.governance)
- 7 個新元件:slo-kpi-card / slo-violation-chart / events-table /
events-filter-bar / event-detail-drawer / queue-item-card / queue-history-tabs
變更:
- apps/api/src/api/v1/ai_governance.py(router)
- apps/api/src/services/governance_query_service.py
- apps/api/src/models/governance.py(Pydantic V2 schemas)
- apps/api/tests/test_ai_governance_endpoints.py(21 tests)
- apps/web/src/app/[locale]/governance/(page + 3 tabs)
- apps/web/src/components/governance/(7 元件)
- apps/web/messages/{zh-TW,en}.json(governance namespace)
- apps/web/src/components/layout/sidebar.tsx(+1 行)
- apps/api/src/main.py(router include)
【Track A — GovernanceDispatcher 決策融合】
把治理事件接到 remediation 執行器,走北極星方向決策融合(LLM × Playbook trust
× MCP),符合「禁寫死規則」鐵律。
- 設計鐵律:DecisionFusionAdapter 是新增 wrapper,**不修改任何 Tier 3 檔**
(decision_manager / learning_service / trust_engine),只 consume 既有 API
- 三維融合公式:confidence = 0.4×llm + 0.3×playbook_trust + 0.3×mcp_consistency
(權重加 TODO 標明未來由 AI 自學調整)
- 三分支決策路徑:
confidence ≥ 0.85 → auto_dispatch(status=dispatched)
0.65 ≤ confidence < 0.85 → pending_approval(HITL)
confidence < 0.65 → skip + log
- decision_context JSONB 完整記錄三維輸入快照(給未來 fine-tune 用)
- poll 30s 掃 unresolved 事件,仿 governance loop 模式
- 重複事件擋去重(呼叫 get_active_for_event)
變更:
- apps/api/src/services/governance_dispatcher.py
- apps/api/src/services/decision_fusion_adapter.py
- apps/api/tests/test_governance_dispatcher.py(14 tests)
- apps/api/src/main.py(lifespan task 接 run_governance_dispatcher_loop)
【驗證】
1836 個 unit test 全過(29 skipped 為既有 PG integration env 問題)
【調度教訓 — 已記入 memory】
- vuln-verifier 應在 fullstack-engineer **之前**跑(避免並行讀到已修代碼誤判)
- critic 雙輪審查不可省(第二輪抓到 NaN sentinel + Prom rule 連鎖)
- 北極星「禁寫死規則」搭配 decision-fusion 確實實施
【未動 Tier 3 — 已驗證】
git diff 確認本 commit 完全沒改 decision_manager.py / learning_service.py /
trust_engine.py,只新增 wrapper service consume 既有 API。
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-05-03 12:42:40 +08:00 |
|
Your Name
|
577250a678
|
fix(governance): 修反消音化 W-3/W-4 守衛 + Prometheus 補資料缺失告警
Code Review / ai-code-review (push) Successful in 52s
CD Pipeline / tests (push) Failing after 2m21s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 1m6s
【統帥怒訓 — 違反 feedback_full_chain_first_then_fix.md 鐵律】
前次 commit f1362fcc 用 skip 條件把告警吞掉,是消音化解法:
- W-3:total_exec<10 永遠 skip → Redis 永遠空也不會告警
- W-4:playbooks total==0 永遠 skip → 表被清空也不會告警
- Prometheus NaN sentinel + 既有 < 0.1 規則疊加後沒任何路徑會告警
統帥怒訓「又把告警給消失了」「已經這樣做幾次了」。本 commit 救回告警可見性。
【修法 — 啟動 30 分鐘寬限 + 過期改打資料管線斷新告警】
- ai_slo_watchdog_job.py 新增模組層 _PROCESS_START 與 _grace_active() 守衛:
- W-3a:metric 有資料 + rate<0.30 → 既有「飛輪成功率過低」
- W-3b:rate=None 且 uptime>30min → 新告警「飛輪資料管線無流量」
- W-4a:playbooks total>0 + approved=0 → 既有「自動修復鏈路斷裂」
- W-4b:playbooks total=0 且 uptime>30min → 新告警「Playbook 表初始化失敗」
- 3 份 Prometheus rule(k8s/monitoring/flywheel-alerts.yaml、
ops/monitoring/alerts.yml、ops/monitoring/alerts-unified.yml)新增
FlywheelExecutionRateMissing:absent() 或 NaN 持續 30 分鐘 → 告警,
與 watchdog W-3b 雙保險
【已加入 memory】
feedback_silencing_alerts_recurring_violation.md 鎖入紅線鐵律:
「fresh deploy / init guard 用 skip 吞告警 = 結構性失職,必須分流寬限期 +
過期改打資料管線斷新告警」
【驗證】
106 個治理相關 unit test 全過:
test_trust_drift_watchdog / test_governance_agent / test_failover_alerter /
test_check_trust_drift_commit_outside_context_poc /
test_governance_remediation_dispatch / test_ai_governance_endpoints /
test_governance_dispatcher
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-05-03 12:39:46 +08:00 |
|
Your Name
|
8fb0c5df33
|
feat(heartbeat): noise reduction — silent 6h + warnings hash dedup
Code Review / ai-code-review (push) Successful in 47s
CD Pipeline / tests (push) Successful in 2m11s
CD Pipeline / build-and-deploy (push) Failing after 31m12s
CD Pipeline / post-deploy-checks (push) Has been skipped
P0 #4 (徹底長期修系列) — 統帥鐵證:「INFO | AWOOOI 系統報告」每 30 分鐘
推一次,一天 48 條同樣內容,即使我修了 P0 #3 假警報,每天的「全系統正常」
重複推送本身就是噪音,讓統帥誤以為告警還在重複。
修法(不違反「監控工具必須被監控」鐵律 — 健康狀態仍每 6h 推 1 次「我活著」):
| 狀況 | 推送行為 |
|------|---------|
| 健康(無 warnings)| 6h 內最多 1 次「我活著」訊號 |
| 有 warnings 跟上次同 hash | 跳過 |
| 有 warnings 跟上次不同 | 立即推送(新狀況不漏)|
| 健康 ↔ 有事 切換 | 自動清掉相反 marker |
Redis keys:
- `heartbeat:silent_last_sent` — 健康狀態 silent marker, TTL=6h
- `heartbeat:warnings_hash` — 上次 warnings 的 md5[:12], TTL=24h
效果:統帥每天從 48 條 heartbeat → ~4 條(健康狀態 4×6h),有事立即推。
Tests: 6 passed (test_heartbeat_dedup_p0_4.py)
- healthy_first_send_goes_through
- healthy_second_send_within_6h_skipped
- warnings_unchanged_skipped
- warnings_changed_pushes
- warnings_to_healthy_clears_warnings_hash
- healthy_to_warnings_clears_silent_marker
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-05-03 01:48:57 +08:00 |
|