Your Name
|
aa4ccec429
|
fix(watchdog): ADR-092 B4 — 三層修復消除 META SYSTEM 重複告警 + Ollama 路由強化
Code Review / ai-code-review (push) Successful in 7m16s
問題根因(debugger 全景徹查):
1. Prod 仍跑舊版代碼(ec013f66 後的修法未部署 → 告警字串仍含舊格式)
2. replicas=2 時 Pod 間 grace period 不共享 → violation_codes 分歧 → 不同 SHA256 → dedup 失效
3. 新 Pod 啟動立即執行 _check_once() → rollout 時多發一波
4. W6 violation_codes 含動態 low_count → count 微變繞過 dedup
修復(A2/A3/W6/C1/C2):
- A2:run_ai_slo_watchdog_loop 加 90s leading sleep,避免 rollout 立即觸發
- A3:_grace_active() 改為 Redis cluster-shared(watchdog:cluster_grace, ex=1800s, nx=True)
消除 Pod 間 grace period 不一致;Redis 故障時 fallback 為 process-local monotonic
- W6:violation_codes 移除動態 low_count,改為穩定 "W6:trust_drift"
- C1:ollama_auto_recovery.py recovered_host 改動態 label(依 URL port 判斷 GCP-A/B/Local)
- C2:ConfigMap OLLAMA_FALLBACK_URL 改走 110:11437 nginx proxy,三層容災統一架構
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-05 10:31:53 +08:00 |
|
Your Name
|
40badc42cf
|
fix(ollama): 恢復 GCP 優先路由(ADR-110 正式路由)
Code Review / ai-code-review (push) Successful in 54s
E2E Health Check / e2e-health (push) Successful in 2m59s
nginx proxy 架設完成後恢復原設計:
GCP-A (110:11435 → 34.143.170.20:11434) → primary
GCP-B (110:11436 → 34.21.145.224:11434) → secondary
111 (192.168.0.111:11434) → 兜底
OLLAMA_URL=http://192.168.0.110:11435
OLLAMA_SECONDARY_URL=http://192.168.0.110:11436
OLLAMA_FALLBACK_URL=http://192.168.0.111:11434
已用 kubectl set env 熱更新,不動 image tag。
兩台 GCP Ollama 均 200 OK(10 個模型各)。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 23:37:42 +08:00 |
|
Your Name
|
8629ac709b
|
feat(awooop): Phase 1-8 完整實作 — AwoooP Agent Platform 六平面架構
run-migration / migrate (push) Failing after 59s
Code Review / ai-code-review (push) Successful in 1m8s
Type Sync Check / check-type-sync (push) Successful in 2m27s
## Phase 1-3: Control Plane + Contract System
- awooop_phase1_control_plane_2026-05-04.sql: 12 張核心表 + RLS
- awooop_phase1_batch1_rls_2026-05-04.sql: 全部 FORCE RLS + GRANT
- packages/awooop-contracts/: 六合約 JSON Schema + golden fixtures
- src/models/awooop_contracts.py: Pydantic v2 contract models(extra=forbid)
- src/repositories/contract_repository.py: contract lifecycle(draft→published→active)
- src/services/contract_service.py: HMAC publish sig + Redis multi-sig activate
- src/services/schema_validator.py: LLM output validator(retry×3, E-SCHEMA-001)
## Phase 2: Tenant Isolation
- awooop_phase2_budget_ledger_2026-05-04.sql: budget_ledger + RLS
- src/services/budget_service.py: Token Budget Hard Kill 三層防線
- src/core/context.py: PROJECT_ID ContextVar(31 background loop 自動繼承)
- src/db/base.py + models.py: project_id 欄位 + RLS set_config 注入
- src/hermes/nl_gateway.py: project_id Redis key 前綴(Phase A 雙寫)
- src/services/anomaly_counter.py: per-project 改造(Phase A fallback)
## Phase 4: Platform Shell in Shadow Mode
- awooop_phase4_run_state_2026-05-04.sql: run_state + step_journal + idempotency
- src/services/run_state_machine.py: 8-state FSM + SKIP LOCKED + stale reaper
- src/services/platform_runtime.py: UUID v7 + W3C trace_id + shadow_execute
- src/services/audit_sink.py: PII/secret redaction 9 patterns
- src/api/v1/platform/runs.py: POST/GET /v1/platform/runs(Router→Service 架構)
- src/workers/platform_worker.py: SKIP LOCKED worker + heartbeat + reaper loop
- src/main.py: platform router + lifespan worker start/stop
## Phase 5: MCP Gateway 五閘門
- awooop_phase5_mcp_gateway_2026-05-04.sql: 4 表 + RLS
- src/plugins/mcp/gateway.py: McpGateway(Gate 1~5, E-MCP-GATE-001~009)
- src/plugins/mcp/redaction_middleware.py: 雙層 redaction + 16K 截斷
- src/plugins/mcp/registry.py: __provider name mangling(ADR-116)
- src/plugins/mcp/credential_resolver.py: k8s secret ref 解析
- tests/test_mcp_credential_isolation.py: 10 個迴歸測試(secret leak 防再現)
## Phase 6-8: EwoooC + Channel Hub + Approval Token
- awooop_phase6_ewoooc_onboarding_2026-05-04.sql: ewoooc tenant + 4 read-only MCP tools
- awooop_phase7_channel_hub_2026-05-04.sql: conversation_event + outbound_message
- src/services/provider_proxy.py: ProviderProxy + PlatformEnvelope(ADR-115)
- src/services/channel_hub.py: Telegram inbound mirror + Progressive Feedback(30s)
- src/services/awooop_approval_token.py: HS256 + jti NX replay 防護 + suggest mode
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 19:31:53 +08:00 |
|
Your Name
|
0a90dab1e9
|
fix(ollama): ADR-110 修正 — 111 升 primary,failover log 改用動態 URL 標識
Code Review / ai-code-review (push) Successful in 56s
根因:K8s pods → GCP-A/B:11434 = connection refused(外網路由不通),
但 ConfigMap 把 GCP-A 設為 OLLAMA_URL(primary),導致容災鏈最終才輪到 111。
ConfigMap (04-configmap.yaml):
- OLLAMA_URL: GCP-A → 192.168.0.111(K8s 內網可達的 primary)
- OLLAMA_SECONDARY_URL: GCP-B → 34.143.170.20(GCP-A,保留待 nginx proxy 後恢復)
- OLLAMA_FALLBACK_URL: 111 → 34.21.145.224(GCP-B,保留待 nginx proxy 後恢復)
- 長期目標:110 架設 nginx proxy 轉發 GCP,ConfigMap 改指向 110:11435/11436
health.py (check_ollama):
- 改為三層輪查(primary → secondary → tertiary)
- primary up → "up";fallback up → "degraded";全掛 → "down"
- 不再只看 OLLAMA_URL 一台,反映實際路由可用狀態
ollama_failover_manager.py (_decide_route / select_provider):
- 變數名改為 url_primary/secondary/tertiary(原 gcp_a/gcp_b/local 與實際 URL 脫鉤)
- routing_reason 改用動態 IP label,不再硬編碼 "GCP-A"/"GCP-B"/"Local"
- _write_failover_audit failed_host 同步改用實際 URL
2026-05-04 ogt
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-05-04 19:17:07 +08:00 |
|
AWOOOI CD
|
035fe20e4d
|
chore(cd): deploy 0068440 [skip ci]
|
2026-05-03 23:45:12 +08:00 |
|
Your Name
|
b1ef05fa8c
|
feat(ollama): ADR-110 GCP 三層容災架構(GCP-A → GCP-B → Local → Gemini)
Code Review / ai-code-review (push) Successful in 50s
CD Pipeline / tests (push) Failing after 1m14s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
## 變更摘要
- Primary: http://34.143.170.20:11434 (GCP-A SSD, 9x 載速 + 2x 推理)
- Secondary: http://34.21.145.224:11434 (GCP-B SSD)
- Fallback: http://192.168.0.111:11434 (M1 Pro Local HDD,最後防線)
- 廢止 ADR-105「111 唯一鐵律」,新建 ADR-110
## 核心改動
- config.py: 新增 OLLAMA_SECONDARY_URL;validator 加 GCP IP 白名單(34.143.170.20, 34.21.145.224)
- ollama_failover_manager.py: 三層 Ollama 決策矩陣;並行健康檢查三台;health_111 → health_gcp_a
- ollama_health_monitor.py: host label 萃取改為通用版(支援 GCP 公網 IP)
- failover_alerter.py: 故障/恢復主機動態顯示,不再硬編碼「Ollama 111 (GPU)」
- ollama_auto_recovery.py: notify_recovery 改為 ollama_gcp_a;recovered_host 動態
- k8s/awoooi-prod: configmap + deployment + network-policy 同步更新(egress 加 GCP /32)
- 服務層: 10 個服務檔案硬編碼 192.168.0.111 改為讀 settings.OLLAMA_URL
- 測試: URL 常數更新,新增三層容災場景,GCP IP 白名單驗證測試
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-05-03 22:49:23 +08:00 |
|
Your Name
|
577250a678
|
fix(governance): 修反消音化 W-3/W-4 守衛 + Prometheus 補資料缺失告警
Code Review / ai-code-review (push) Successful in 52s
CD Pipeline / tests (push) Failing after 2m21s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 1m6s
【統帥怒訓 — 違反 feedback_full_chain_first_then_fix.md 鐵律】
前次 commit f1362fcc 用 skip 條件把告警吞掉,是消音化解法:
- W-3:total_exec<10 永遠 skip → Redis 永遠空也不會告警
- W-4:playbooks total==0 永遠 skip → 表被清空也不會告警
- Prometheus NaN sentinel + 既有 < 0.1 規則疊加後沒任何路徑會告警
統帥怒訓「又把告警給消失了」「已經這樣做幾次了」。本 commit 救回告警可見性。
【修法 — 啟動 30 分鐘寬限 + 過期改打資料管線斷新告警】
- ai_slo_watchdog_job.py 新增模組層 _PROCESS_START 與 _grace_active() 守衛:
- W-3a:metric 有資料 + rate<0.30 → 既有「飛輪成功率過低」
- W-3b:rate=None 且 uptime>30min → 新告警「飛輪資料管線無流量」
- W-4a:playbooks total>0 + approved=0 → 既有「自動修復鏈路斷裂」
- W-4b:playbooks total=0 且 uptime>30min → 新告警「Playbook 表初始化失敗」
- 3 份 Prometheus rule(k8s/monitoring/flywheel-alerts.yaml、
ops/monitoring/alerts.yml、ops/monitoring/alerts-unified.yml)新增
FlywheelExecutionRateMissing:absent() 或 NaN 持續 30 分鐘 → 告警,
與 watchdog W-3b 雙保險
【已加入 memory】
feedback_silencing_alerts_recurring_violation.md 鎖入紅線鐵律:
「fresh deploy / init guard 用 skip 吞告警 = 結構性失職,必須分流寬限期 +
過期改打資料管線斷新告警」
【驗證】
106 個治理相關 unit test 全過:
test_trust_drift_watchdog / test_governance_agent / test_failover_alerter /
test_check_trust_drift_commit_outside_context_poc /
test_governance_remediation_dispatch / test_ai_governance_endpoints /
test_governance_dispatcher
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-05-03 12:39:46 +08:00 |
|
Your Name
|
dedb12085b
|
chore(governance,watchdog): enrich alerts and enable prometheus multiproc
CD Pipeline / tests (push) Failing after 1m22s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 43s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 57s
|
2026-05-02 23:44:12 +08:00 |
|
AWOOOI CD
|
68e182381f
|
chore(cd): deploy da772a1 [skip ci]
|
2026-05-02 17:58:22 +08:00 |
|
AWOOOI CD
|
697e13b23a
|
chore(cd): deploy 297afb6 [skip ci]
|
2026-05-02 17:28:56 +08:00 |
|
AWOOOI CD
|
a6409c39e2
|
chore(cd): deploy b3a0f0d [skip ci]
|
2026-05-02 16:49:00 +08:00 |
|
AWOOOI CD
|
329849a559
|
chore(cd): deploy 7795f02 [skip ci]
|
2026-05-01 20:53:02 +08:00 |
|
AWOOOI CD
|
b72eac0712
|
chore(cd): deploy 433f7b0 [skip ci]
|
2026-05-01 17:08:42 +08:00 |
|
Your Name
|
433f7b068e
|
fix(aiops): close ssh and telegram remediation gaps
CD Pipeline / tests (push) Successful in 2m7s
Code Review / ai-code-review (push) Successful in 42s
CD Pipeline / build-and-deploy (push) Successful in 13m14s
CD Pipeline / post-deploy-checks (push) Successful in 4m29s
|
2026-05-01 16:53:02 +08:00 |
|
AWOOOI CD
|
f72419dd17
|
chore(cd): deploy b0da6da [skip ci]
|
2026-05-01 15:27:48 +08:00 |
|
AWOOOI CD
|
f53d7e5584
|
chore(cd): deploy f8e4497 [skip ci]
|
2026-05-01 14:41:18 +08:00 |
|
Your Name
|
f8e44971c1
|
feat(aiops): enable read-only agent loop canary
CD Pipeline / tests (push) Successful in 1m43s
Code Review / ai-code-review (push) Successful in 31s
CD Pipeline / build-and-deploy (push) Successful in 10m22s
CD Pipeline / post-deploy-checks (push) Successful in 4m3s
|
2026-05-01 14:20:16 +08:00 |
|
AWOOOI CD
|
33a7148916
|
chore(cd): deploy b6cf616 [skip ci]
|
2026-05-01 14:02:59 +08:00 |
|
AWOOOI CD
|
1fe75e9f99
|
chore(cd): deploy 6ec3f11 [skip ci]
|
2026-05-01 13:45:55 +08:00 |
|
Your Name
|
3691402561
|
chore(cd): deploy 11673d80 api [skip ci]
|
2026-05-01 12:52:23 +08:00 |
|
Your Name
|
11673d80ea
|
fix(aiops): route backup decisions through ssh
CD Pipeline / tests (push) Successful in 1m35s
Code Review / ai-code-review (push) Successful in 34s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
|
2026-05-01 12:50:01 +08:00 |
|
Your Name
|
ce4cf4c94b
|
chore(cd): deploy 2c12bce api [skip ci]
|
2026-05-01 10:58:55 +08:00 |
|
Your Name
|
78bcc090ad
|
chore(cd): deploy 97be5de api [skip ci]
|
2026-05-01 10:52:31 +08:00 |
|
AWOOOI CD
|
046d598e88
|
chore(cd): deploy e4aef6a [skip ci]
|
2026-05-01 10:43:56 +08:00 |
|
Your Name
|
fa6a78af2a
|
chore(cd): deploy e4aef6a api [skip ci]
|
2026-05-01 10:42:07 +08:00 |
|
AWOOOI CD
|
7472eb2fcd
|
chore(cd): deploy ca22ec2 [skip ci]
|
2026-05-01 10:24:48 +08:00 |
|
AWOOOI CD
|
3e0ab0f8c6
|
chore(cd): deploy f154ac0 [skip ci]
|
2026-05-01 00:14:36 +08:00 |
|
AWOOOI CD
|
f946e7b184
|
chore(cd): deploy 6e04fe9 [skip ci]
|
2026-04-30 23:18:20 +08:00 |
|
AWOOOI CD
|
64b09273f7
|
chore(cd): deploy e29aab5 [skip ci]
|
2026-04-30 15:58:18 +08:00 |
|
AWOOOI CD
|
a93fbe5d66
|
chore(cd): deploy 36967d0 [skip ci]
|
2026-04-30 15:44:46 +08:00 |
|
AWOOOI CD
|
38ffcf4395
|
chore(cd): deploy 712d3e5 [skip ci]
|
2026-04-30 15:20:33 +08:00 |
|
AWOOOI CD
|
ae52d51210
|
chore(cd): deploy 72945bf [skip ci]
|
2026-04-30 15:05:57 +08:00 |
|
AWOOOI CD
|
6e76c5dfd5
|
chore(cd): deploy c9393c3 [skip ci]
|
2026-04-30 14:41:46 +08:00 |
|
AWOOOI CD
|
19788302df
|
chore(cd): deploy 80defbe [skip ci]
|
2026-04-30 14:26:44 +08:00 |
|
Your Name
|
80defbed7c
|
fix(aiops): fallback and escalate automation blockers
CD Pipeline / tests (push) Successful in 2m41s
Code Review / ai-code-review (push) Successful in 24s
CD Pipeline / build-and-deploy (push) Successful in 7m51s
CD Pipeline / post-deploy-checks (push) Failing after 2m15s
|
2026-04-30 14:13:57 +08:00 |
|
AWOOOI CD
|
9ee3cc6242
|
chore(cd): deploy 4723499 [skip ci]
|
2026-04-30 11:11:04 +08:00 |
|
AWOOOI CD
|
a0be4ebb03
|
chore(cd): deploy 0f7e9d3 [skip ci]
|
2026-04-30 10:54:29 +08:00 |
|
Your Name
|
9f15f3cfe4
|
chore(cd): deploy 639bb64 [skip ci]
|
2026-04-30 09:41:20 +08:00 |
|
AWOOOI CD
|
d197e2785d
|
chore(cd): deploy 4a57c2d [skip ci]
|
2026-04-29 15:48:24 +00:00 |
|
AWOOOI CD
|
dae0aa2312
|
chore(cd): deploy d845d53 [skip ci]
|
2026-04-29 15:06:57 +00:00 |
|
AWOOOI CD
|
b857be0a64
|
chore(cd): deploy fe2b8f4 [skip ci]
|
2026-04-29 14:47:51 +00:00 |
|
AWOOOI CD
|
525a243550
|
chore(cd): deploy dccdcdb [skip ci]
|
2026-04-29 13:59:53 +00:00 |
|
AWOOOI CD
|
4c91d89dd2
|
chore(cd): deploy 4115ddd [skip ci]
|
2026-04-29 13:04:37 +00:00 |
|
Your Name
|
dc18b0ebd6
|
fix(prometheus_url): drift 殘存追修 — kured 守門員 + monitoring API
debugger 全 codebase 追根溯源後揪出 5 處 PROMETHEUS_URL drift 殘存
(根因:docs/reference/SERVICE-ENDPOINTS.md 早期把 Prometheus 標在 188
是整個 codebase drift 的源頭)。
本次修最急的 2 處:
## 🔴🔴 kured.yaml:132(守門員失效風險)
- 188 → 110
- kured 跑 reboot 前會查 Prometheus alerts,連錯主機 = 跳過保護直接 reboot 主機
- 對齊 ConfigMap + config.py PROMETHEUS_URL
## 🟡 monitoring.py:67(單一事實源)
- 寫死 110:9090 改用 settings.PROMETHEUS_URL
- 主機巧合正確但繞過 ConfigMap 注入機制
- 未來 Prometheus 再遷移避免再次 drift
## 暫不修
- k3s_monitor_service.py:38 用 121:30090 是 K3s NodePort 內網端點
與外部 PROMETHEUS_URL 概念不同,需新增 PROMETHEUS_INTERNAL_URL setting
- 其他 docstring + 文件 drift(SERVICE-ENDPOINTS.md 等)留待後續
## 驗證
1552 unit tests 全綠(無回歸)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-04-29 10:44:39 +08:00 |
|
AWOOOI CD
|
20009cddcf
|
chore(cd): deploy 143c15f [skip ci]
|
2026-04-28 07:36:19 +00:00 |
|
Your Name
|
143c15f052
|
feat(wave2+km): LLM 動態按鈕啟用 + KM 自動寫入 + AI Router dead code 標記
CD Pipeline / build-and-deploy (push) Successful in 9m52s
- ConfigMap: USE_LLM_DYNAMIC_BUTTONS=true(B2/B3/B4 handler 全就緒)
- decision_manager: auto_execute 失敗路徑補 KM fire-and-forget 寫入
- ai_router: _build_fallback_chain 標記 DEPRECATED 2026-04-28
- tests: test_golden_regression.py 新增 172 行 golden 回歸測試
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-28 15:27:33 +08:00 |
|
AWOOOI CD
|
2e6ae7fe84
|
chore(cd): deploy 7f200af [skip ci]
|
2026-04-28 07:14:34 +00:00 |
|
AWOOOI CD
|
b8a330f9e4
|
chore(cd): deploy c1a1be6 [skip ci]
|
2026-04-27 12:21:13 +00:00 |
|
Your Name
|
c1a1be61bd
|
fix(ssh-auto): 主機告警 SSH 自動診斷授權(HostHighCpuLoad 不再卡人工審核)
CD Pipeline / build-and-deploy (push) Successful in 9m7s
根因:SSH_MCP_ALLOWED_HOSTS 未設定 → _ssh_execute() 全部攔截
+ auto_approve 只認 kubectl 不認 ssh → 主機告警永遠降級人工
修復:
- ConfigMap: 補 SSH_MCP_ALLOWED_HOSTS 四主機白名單
- alert_rules: HostHighCpuLoad 等從 NO_ACTION 改為 ssh_diagnose 指令
- auto_approve: _has_executable 加入 ssh 開頭識別
- decision_manager: _ssh_execute() 加入 ssh_diagnose 路由
- ssh_provider: 新增 ssh_diagnose tool(ps aux + free -h + df -h,只讀)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-27 20:13:07 +08:00 |
|
AWOOOI CD
|
ded17caca0
|
chore(cd): deploy a0502b7 [skip ci]
|
2026-04-27 11:55:33 +00:00 |
|