Commit Graph

1986 Commits

Author SHA1 Message Date
Your Name
337b2df60d chore(cd): deploy latest image tag for prod manifests 2026-06-04 00:13:51 +08:00
Your Name
ab21d8bad2 chore: execute W1-redline convergence updates and evidence log 2026-06-03 20:10:14 +08:00
Your Name
2d37383fc6 fix(monitoring): fix false positive NoAlertsReceived2Hours by filtering only alertmanager source 2026-05-28 15:33:17 +08:00
Your Name
3779f6f1e0 fix(metrics): 串入飛輪指標到 /metrics 主端點,修復 FlywheelExecutionRateMissing 死告警
INC-20260507-99ADF2 根因(feedback_full_chain_first_then_fix.md 全鏈分析):

【鏈路斷點】規則層(5/3 加)vs 指標層(5/6 改)vs scrape 層(從沒同步)
- 577250a6(5/3)「反消音化」commit 加了 FlywheelExecutionRateMissing
  rule,要求 110 Prom scrape 到 awoooi_flywheel_execution_success_rate;
- a2c4b3d4(5/6)Codex 改 FlywheelStatsService 用 auto_repair_executions
  作 source of truth(24h 樣本 1-9 筆回 None 給 W-3b watchdog 接管);
- 但 awoooi_flywheel_* 指標自始至終只在 /api/v1/stats/flywheel/metrics
  暴露,110 Prom awoooi-api job 抓的是 /metrics → absent() 永遠 1
  → 自 2026-05-06T04:14 UTC 起 firing 26h+ 屬 dead alert

【修法】只動 awoooi-api 一處,不碰 Codex 設計、不碰 110 Prom 配置:
- main.py /metrics endpoint 改 async,在 generate_latest() 後串入
  FlywheelStatsService.compute() → to_prometheus_lines()。
- 既有 awoooi-api scrape job 自動拿到飛輪指標。
- 完全保留 Codex a2c4b3d4 設計:1-9 筆回 None 讓 W-3b watchdog 雙保險。

【不碰的部分】
- flywheel_stats_service.py 不動:Codex 5/6 LOGBOOK 已明確說明
  「Redis playbook counter 失準 → 用 auto_repair_executions 為唯一信任源」,
  1-9 筆 return None 是配合 ai_slo_watchdog_job W-3b grace+30min 設計的
  反消音化雙保險,不是 bug。

驗證計畫(部署後):
1. curl /metrics | grep awoooi_flywheel  → 看到飛輪指標
2. Prom query awoooi_flywheel_execution_success_rate  → 非空
3. ALERTS{alertname="FlywheelExecutionRateMissing"}  → resolved
4. 30 分鐘觀察 Telegram 不再收 INC-20260507-99ADF2

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 15:32:47 +08:00
Your Name
c38227e945 fix(ai): remove 188 ollama provider 2026-05-06 14:33:16 +08:00
Your Name
1b4a6c1e8c fix(awooop): align console with flywheel execution metrics 2026-05-06 00:44:53 +08:00
Your Name
894174da5b fix(ops): harden cold-start schedule recovery 2026-05-05 22:14:54 +08:00
Your Name
10cd9fc025 fix(openclaw): gate alert cloud fallback behind flag 2026-05-05 20:53:12 +08:00
Your Name
8161ccf83f fix(ops): persist host resource guardrails 2026-05-05 16:13:02 +08:00
Your Name
6b93c8f454 fix(chat): route OpenClaw chat through Ollama lane
Some checks failed
CD Pipeline / tests (push) Successful in 5m26s
Code Review / ai-code-review (push) Successful in 25s
CD Pipeline / build-and-deploy (push) Successful in 8m11s
CD Pipeline / post-deploy-checks (push) Has been cancelled
2026-05-05 15:57:26 +08:00
AWOOOI CD
3a17a860a0 chore(cd): deploy 1cc9de5 [skip ci] 2026-05-05 15:41:33 +08:00
Your Name
6ec5c06bad docs(ops): record docker limit cleanup 2026-05-05 15:39:46 +08:00
Your Name
44d8322c4d docs(ops): record live runner guardrail fix 2026-05-05 15:34:00 +08:00
Your Name
819734f655 docs(ops): record runner guardrail follow-up 2026-05-05 15:28:31 +08:00
Your Name
1cc9de5722 fix(ops): point runner guardrail alerts to host script
All checks were successful
CD Pipeline / tests (push) Successful in 5m31s
Code Review / ai-code-review (push) Successful in 30s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 37s
CD Pipeline / build-and-deploy (push) Successful in 7m45s
CD Pipeline / post-deploy-checks (push) Successful in 5m4s
2026-05-05 15:25:37 +08:00
Your Name
96c1ba20da fix(ci): cap host-runner helper containers
All checks were successful
Code Review / ai-code-review (push) Successful in 27s
2026-05-05 15:09:44 +08:00
Your Name
855a39ad95 docs(ops): record docker limit alert deploy 2026-05-05 15:06:47 +08:00
Your Name
209da7ba33 chore(ops): deploy docker limit alert image
Some checks failed
CD Pipeline / tests (push) Successful in 5m24s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-05-05 15:05:23 +08:00
Your Name
d08d1e4951 fix(ops): alert on missing docker resource limits
Some checks failed
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Successful in 23s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s
2026-05-05 15:01:31 +08:00
Your Name
e24c8ea051 fix(ci): align B5 schema with tenant isolation
Some checks failed
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
2026-05-05 15:00:07 +08:00
Your Name
72d66e4ae6 fix(ops): align stale job cleanup thresholds
All checks were successful
Code Review / ai-code-review (push) Successful in 28s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 36s
2026-05-05 14:54:17 +08:00
Your Name
5e625f777d fix(ops): add stale gitea job cleanup guard
Some checks failed
Code Review / ai-code-review (push) Has been cancelled
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Has been cancelled
2026-05-05 14:50:47 +08:00
Your Name
df72c77880 chore(ops): deploy stale gitea job alert image
Some checks failed
CD Pipeline / tests (push) Successful in 5m29s
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
2026-05-05 14:43:53 +08:00
Your Name
7d45f0cb58 fix(ops): alert on stale gitea actions jobs
Some checks failed
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Has been cancelled
2026-05-05 14:42:09 +08:00
Your Name
fc1a6196df fix(code-review): keep Gemini fallback opt-in
Some checks failed
CD Pipeline / tests (push) Successful in 2m2s
Code Review / ai-code-review (push) Successful in 27s
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
2026-05-05 14:38:44 +08:00
Your Name
3b73cc7f94 fix(ci): avoid cd on workflow-only changes
Some checks failed
Code Review / ai-code-review (push) Has been cancelled
2026-05-05 14:37:31 +08:00
Your Name
96b860dc2c docs(ops): record ci stale-run guard 2026-05-05 14:35:24 +08:00
Your Name
2e128f90db fix(ci): skip stale code review runs
Some checks failed
Code Review / ai-code-review (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
2026-05-05 14:35:09 +08:00
Your Name
228768ff68 docs(ops): record host baseline follow-up 2026-05-05 14:31:59 +08:00
Your Name
ab0f0a8a62 chore(ops): deploy runner classification image
Some checks failed
CD Pipeline / tests (push) Successful in 2m35s
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Successful in 26s
2026-05-05 14:29:55 +08:00
Your Name
0e14935351 fix(ops): classify systemd runner alerts as host resources
Some checks failed
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
2026-05-05 14:28:18 +08:00
Your Name
a5192d4e03 chore(ops): deploy runner alert routing image
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
2026-05-05 14:21:17 +08:00
Your Name
34d1c76be9 fix(ops): route systemd runner baseline alerts
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
2026-05-05 14:19:58 +08:00
Your Name
2b93975d37 chore(ops): deploy systemd runner baseline image
Some checks failed
CD Pipeline / tests (push) Successful in 2m6s
Code Review / ai-code-review (push) Successful in 26s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-05-05 14:12:30 +08:00
Your Name
fe618960a8 fix(ops): monitor systemd runners in host baseline
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 39s
2026-05-05 14:08:43 +08:00
Your Name
8e22110030 fix(governance): keep trust drift watchdog on governance agent
Some checks failed
CD Pipeline / tests (push) Successful in 2m51s
Code Review / ai-code-review (push) Successful in 24s
CD Pipeline / build-and-deploy (push) Has started running
CD Pipeline / post-deploy-checks (push) Has been cancelled
2026-05-05 14:00:13 +08:00
Your Name
2ff0ef3bb6 fix(openclaw): route legacy ollama through failover endpoints
Some checks failed
CD Pipeline / tests (push) Failing after 1m49s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 24s
2026-05-05 13:55:52 +08:00
Your Name
bb1995f349 fix(awooop): use naive utc for run lease timestamps
Some checks failed
CD Pipeline / tests (push) Failing after 1m48s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Has been cancelled
2026-05-05 13:53:07 +08:00
Your Name
e8e6748f70 fix(ops): add docker host resource baseline guardrails
Some checks failed
CD Pipeline / tests (push) Failing after 1m50s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 25s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s
2026-05-05 13:45:09 +08:00
Your Name
a57e3d3d75 test(consensus): expect redis namespace dual write
Some checks failed
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
2026-05-05 13:41:41 +08:00
Your Name
b00a7b050a test(ollama): align inference connect errors with degraded health
Some checks failed
CD Pipeline / tests (push) Failing after 2m26s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 28s
2026-05-05 13:34:19 +08:00
Your Name
506744ba3a test(ollama): keep slow gcp primary on ollama
Some checks failed
CD Pipeline / tests (push) Failing after 2m21s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 26s
2026-05-05 13:29:27 +08:00
Your Name
869646459c fix(ollama): treat legacy primary as ollama
Some checks failed
CD Pipeline / tests (push) Failing after 1m48s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 28s
2026-05-05 13:25:27 +08:00
Your Name
33d4326cce test(ollama): align slow recovery with gcp routing policy
Some checks failed
CD Pipeline / tests (push) Failing after 1m51s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 33s
2026-05-05 13:21:16 +08:00
Your Name
b3d412f9eb fix(cd): restore gitea workflow yaml parsing
Some checks failed
CD Pipeline / tests (push) Failing after 2m20s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 31s
2026-05-05 13:17:15 +08:00
Your Name
f78b1b0690 fix(ollama): honor provider endpoint selection
All checks were successful
Code Review / ai-code-review (push) Successful in 37s
2026-05-05 13:14:46 +08:00
Your Name
0ebd0d8a92 fix(deploy): 緊急部署 API 2e17325c — governance skip cooldown + watchdog B4
All checks were successful
Code Review / ai-code-review (push) Successful in 54s
CI cancel-in-progress 導致 CD 未執行,手動更新 kustomization.yaml。

包含修復:
- governance_dispatcher skip 路徑 cooldown(消除 30s 重複處理)
- watchdog B4 A2/A3/W6 三層修復(消除 META SYSTEM 重複告警)
- Operator Console leWOOOgo 積木化修復(e22b8e7)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 12:09:29 +08:00
Your Name
2e17325c3f fix(ollama): 更新 failover_manager URL 註解反映 ADR-110 nginx proxy 拓撲
All checks were successful
Code Review / ai-code-review (push) Successful in 43s
url_primary/secondary/tertiary 的 comment 還是舊版(ADR-110 前的 IP),
更新為 110:11435→GCP-A / 11436→GCP-B / 11437→Local111 nginx proxy 格式。

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 11:03:36 +08:00
Your Name
e22b8e7ab2 feat(awooop): Operator Console API + 前端(leWOOOgo 積木化修復)
All checks were successful
Code Review / ai-code-review (push) Successful in 42s
後端:
- 新增 platform_operator_service.py(DB 存取集中 Service 層)
- Router 層移除 Depends(get_db),改呼叫 Service 函數
- tenants/contracts/operator_runs 三個 Router 符合 leWOOOgo 規範
- __init__.py 整合四個 platform router

前端:
- apps/web/src/app/[locale]/awooop/ 完整建立(7 個頁面)
- layout.tsx:四分頁導覽(tenants/contracts/runs/approvals)
- 全部使用 @/i18n/routing(Link/usePathname/useRouter)避免 i18n 路徑問題
- approvals page:10s 自動刷新、timeout 倒數、緊急紅色高亮

ADR-106/107/112/114/115/116

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 11:00:20 +08:00
Your Name
aa4ccec429 fix(watchdog): ADR-092 B4 — 三層修復消除 META SYSTEM 重複告警 + Ollama 路由強化
All checks were successful
Code Review / ai-code-review (push) Successful in 7m16s
問題根因(debugger 全景徹查):
1. Prod 仍跑舊版代碼(ec013f66 後的修法未部署 → 告警字串仍含舊格式)
2. replicas=2 時 Pod 間 grace period 不共享 → violation_codes 分歧 → 不同 SHA256 → dedup 失效
3. 新 Pod 啟動立即執行 _check_once() → rollout 時多發一波
4. W6 violation_codes 含動態 low_count → count 微變繞過 dedup

修復(A2/A3/W6/C1/C2):
- A2:run_ai_slo_watchdog_loop 加 90s leading sleep,避免 rollout 立即觸發
- A3:_grace_active() 改為 Redis cluster-shared(watchdog:cluster_grace, ex=1800s, nx=True)
     消除 Pod 間 grace period 不一致;Redis 故障時 fallback 為 process-local monotonic
- W6:violation_codes 移除動態 low_count,改為穩定 "W6:trust_drift"
- C1:ollama_auto_recovery.py recovered_host 改動態 label(依 URL port 判斷 GCP-A/B/Local)
- C2:ConfigMap OLLAMA_FALLBACK_URL 改走 110:11437 nginx proxy,三層容災統一架構

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 10:31:53 +08:00