Commit Graph

108 Commits

Author SHA1 Message Date
Your Name
2104f0f01a fix(recovery): harden runner failclosed authority copy [skip ci] 2026-06-28 16:32:28 +08:00
Your Name
f52ec0db26 fix(recovery): add runner failclosed cron authority [skip ci] 2026-06-28 16:32:27 +08:00
Your Name
d7f56351f2 fix(recovery): reopen controlled automation after failclosed regression
Some checks failed
CD Pipeline / workflow-shape (push) Successful in 0s
CD Pipeline / cancel-stale-cd (push) Has been skipped
Code Review / ai-code-review (push) Has been cancelled
CD Pipeline / tests (push) Failing after 14m8s
Type Sync Check / check-type-sync (push) Successful in 42s
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
2026-06-28 16:01:40 +08:00
Your Name
ba054e698d fix(recovery): seal runner failclosed disablers [skip ci] 2026-06-28 15:58:06 +08:00
Your Name
3c495bb472 fix(ci): preserve controlled cd drain lane
All checks were successful
Code Review / ai-code-review (push) Successful in 16s
2026-06-28 14:30:50 +08:00
Your Name
4414ec991f fix(ci): reopen hard-limited controlled cd lane
All checks were successful
CD Pipeline / workflow-shape (push) Successful in 0s
CD Pipeline / cancel-stale-cd (push) Has been skipped
CD Pipeline / tests (push) Successful in 1m42s
Code Review / ai-code-review (push) Successful in 15s
CD Pipeline / build-and-deploy (push) Successful in 6m33s
CD Pipeline / post-deploy-checks (push) Successful in 3m10s
2026-06-28 11:53:42 +08:00
Your Name
f109b11478 fix(recovery): seal 110 cd lane restore sources [skip ci] 2026-06-28 11:37:01 +08:00
Your Name
e97b252475 fix(cd): reopen controlled runtime deploy lane
Some checks failed
CD Pipeline / tests (push) Failing after 7s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 17s
2026-06-28 11:09:42 +08:00
Your Name
241cbe067e fix(recovery): freeze 110 cd lane and source-aware 188 gates [skip ci] 2026-06-28 10:58:41 +08:00
Your Name
0531050934 fix(runner): split controlled cd lane guard [skip ci] 2026-06-28 09:56:31 +08:00
Your Name
00db624e5f fix(reboot): fail closed direct cd lane pressure path [skip ci] 2026-06-28 09:46:46 +08:00
Your Name
3200f9af97 docs(runner): add direct runner pressure exception [skip ci] 2026-06-28 09:00:26 +08:00
Your Name
899635cc63 docs(runner): record 110 fail-closed pressure exception [skip ci] 2026-06-28 08:44:45 +08:00
Your Name
4c951b2996 fix(ci): keep 110 runner inactive until pressure clears 2026-06-27 20:15:01 +08:00
ogt
5e4887d15c fix(ops): gate reboot recovery on product freshness [skip ci] 2026-06-25 19:39:42 +08:00
ogt
6f5e22ba69 fix(ops): classify momo source absence in cold-start gate [skip ci] 2026-06-24 23:05:42 +08:00
Your Name
2ec7f6f440 fix(ops): harden heartbeat and momo alert noise
Some checks failed
Code Review / ai-code-review (push) Successful in 14s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 31s
CD Pipeline / tests (push) Successful in 1m59s
CD Pipeline / build-and-deploy (push) Successful in 7m36s
CD Pipeline / post-deploy-checks (push) Failing after 43s
Ansible / Reboot Recovery Contract / validate (push) Has been cancelled
2026-06-24 19:38:33 +08:00
Your Name
95f442adab fix(ops): harden 188 backup exporter recovery [skip ci] 2026-06-24 06:37:44 +08:00
Your Name
93ac6030cf fix(ops): 同步 source provider freshness 告警規則
Some checks failed
Ansible / Reboot Recovery Contract / validate (push) Has been cancelled
Code Review / ai-code-review (push) Successful in 10s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 24s
2026-06-18 14:23:13 +08:00
Your Name
ff18872a23 feat(ops): 新增 host runaway process aiops guard
Some checks failed
Code Review / ai-code-review (push) Successful in 14s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Failing after 26s
Ansible / Reboot Recovery Contract / validate (push) Has been cancelled
2026-06-18 14:17:03 +08:00
Your Name
b997016991 docs(ops): 鎖定重啟 Plan B 機制檢查 [skip ci] 2026-06-18 11:50:53 +08:00
Your Name
6efd186750 docs(security): 建立高價值配置控管清冊 [skip ci] 2026-06-11 11:29:58 +08:00
Your Name
ae7b39d96a fix(ops): harden reboot recovery and backup alerts 2026-05-29 12:41:34 +08:00
Your Name
6d2b0ed4cd ops(runner): add isolation readiness gate [skip ci] 2026-05-24 09:56:47 +08:00
Your Name
4407b46bb6 ops(runner): inventory workflow labels [skip ci] 2026-05-24 09:52:04 +08:00
Your Name
22b45006b7 ops(runner): add pool inventory audit [skip ci] 2026-05-24 09:47:02 +08:00
Your Name
9b465ee140 ci(runner): drain legacy docker act runner safely
All checks were successful
Code Review / ai-code-review (push) Successful in 11s
2026-05-21 18:53:45 +08:00
Your Name
b3ab4da03b ci(cd): wait for host web build pressure
All checks were successful
Code Review / ai-code-review (push) Successful in 17s
2026-05-21 15:51:36 +08:00
Your Name
ae9d0b7385 feat(monitoring): alert on stale source provider ingestion
All checks were successful
Code Review / ai-code-review (push) Successful in 10s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 25s
CD Pipeline / tests (push) Successful in 3m26s
CD Pipeline / build-and-deploy (push) Successful in 3m38s
CD Pipeline / post-deploy-checks (push) Successful in 1m25s
2026-05-20 19:19:21 +08:00
Your Name
598f33ae8b fix(monitoring): clarify alert chain smoke evidence
All checks were successful
Code Review / ai-code-review (push) Successful in 11s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 22s
CD Pipeline / tests (push) Successful in 3m55s
CD Pipeline / build-and-deploy (push) Successful in 3m31s
CD Pipeline / post-deploy-checks (push) Successful in 1m33s
2026-05-20 13:11:44 +08:00
Your Name
21dcfbd991 fix(governance): collapse km slo fallback series
All checks were successful
Code Review / ai-code-review (push) Successful in 11s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 22s
CD Pipeline / tests (push) Successful in 1m6s
CD Pipeline / build-and-deploy (push) Successful in 5m17s
CD Pipeline / post-deploy-checks (push) Successful in 1m38s
2026-05-14 19:37:15 +08:00
Your Name
d2a4a17969 fix(governance): stabilize adr100 km growth slo
Some checks failed
Code Review / ai-code-review (push) Successful in 22s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 25s
CD Pipeline / tests (push) Successful in 1m11s
CD Pipeline / post-deploy-checks (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
2026-05-14 19:33:52 +08:00
Your Name
4111ea4f9f fix(ai): remove 188 ollama provider
All checks were successful
Code Review / ai-code-review (push) Successful in 12s
CD Pipeline / tests (push) Successful in 1m13s
CD Pipeline / build-and-deploy (push) Successful in 3m36s
CD Pipeline / post-deploy-checks (push) Successful in 1m20s
2026-05-06 14:34:48 +08:00
OG T
c4f40235f4 fix(alertmanager): gate direct telegram to alertchain emergencies
All checks were successful
Code Review / ai-code-review (push) Successful in 10s
2026-05-06 13:45:33 +08:00
OG T
4753099155 fix(alertmanager): send direct alerts to sre group
All checks were successful
Code Review / ai-code-review (push) Successful in 10s
2026-05-06 13:38:47 +08:00
Your Name
587551c1f1 fix(ops): monitor full-stack cold-start gates
All checks were successful
Code Review / ai-code-review (push) Successful in 11s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 18s
2026-05-06 00:48:05 +08:00
Your Name
6e96623884 fix(ops): harden momo scheduler cold start gate
All checks were successful
Code Review / ai-code-review (push) Successful in 10s
2026-05-06 00:15:14 +08:00
Your Name
0315c2b510 docs(ops): codify full stack cold start recovery
All checks were successful
Code Review / ai-code-review (push) Successful in 7s
2026-05-06 00:07:57 +08:00
Your Name
23932773ef fix(monitoring): route docker baseline alerts to ssh
All checks were successful
Code Review / ai-code-review (push) Successful in 11s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 19s
2026-05-06 00:00:12 +08:00
Your Name
2f50c67f5c fix(monitoring): keep host alert ssh diagnostics canonical
All checks were successful
Code Review / ai-code-review (push) Successful in 10s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 20s
E2E Health Check / e2e-health (push) Successful in 2m35s
2026-05-05 23:57:53 +08:00
Your Name
2221fd3256 fix(ops): persist host resource guardrails
All checks were successful
CD Pipeline / tests (push) Successful in 5m25s
Code Review / ai-code-review (push) Successful in 25s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 37s
CD Pipeline / build-and-deploy (push) Successful in 7m31s
CD Pipeline / post-deploy-checks (push) Successful in 5m10s
2026-05-05 16:13:19 +08:00
Your Name
1cc9de5722 fix(ops): point runner guardrail alerts to host script
All checks were successful
CD Pipeline / tests (push) Successful in 5m31s
Code Review / ai-code-review (push) Successful in 30s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 37s
CD Pipeline / build-and-deploy (push) Successful in 7m45s
CD Pipeline / post-deploy-checks (push) Successful in 5m4s
2026-05-05 15:25:37 +08:00
Your Name
d08d1e4951 fix(ops): alert on missing docker resource limits
Some checks failed
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Successful in 23s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s
2026-05-05 15:01:31 +08:00
Your Name
72d66e4ae6 fix(ops): align stale job cleanup thresholds
All checks were successful
Code Review / ai-code-review (push) Successful in 28s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 36s
2026-05-05 14:54:17 +08:00
Your Name
5e625f777d fix(ops): add stale gitea job cleanup guard
Some checks failed
Code Review / ai-code-review (push) Has been cancelled
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Has been cancelled
2026-05-05 14:50:47 +08:00
Your Name
7d45f0cb58 fix(ops): alert on stale gitea actions jobs
Some checks failed
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Has been cancelled
2026-05-05 14:42:09 +08:00
Your Name
fe618960a8 fix(ops): monitor systemd runners in host baseline
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
CD Pipeline / tests (push) Has been cancelled
CD Pipeline / post-deploy-checks (push) Has been cancelled
Code Review / ai-code-review (push) Has been cancelled
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 39s
2026-05-05 14:08:43 +08:00
Your Name
e8e6748f70 fix(ops): add docker host resource baseline guardrails
Some checks failed
CD Pipeline / tests (push) Failing after 1m50s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 25s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s
2026-05-05 13:45:09 +08:00
Your Name
ec013f662d fix(watchdog): 修复 Trust Drift 重复告警 + 建立 GCP Ollama nginx proxy
Some checks failed
Code Review / ai-code-review (push) Successful in 45s
Ansible Lint / lint (push) Has been cancelled
- ai_slo_watchdog_job: 改用 trust_drift_detector 纯统计 lib
  避免与 governance_agent 每小时自检查重复触发 Telegram

- infra/ansible: 建立 110 nginx proxy 转发到 GCP-A/B
  端口 11435 -> 34.143.170.20:11434 (GCP-A)
  端口 11436 -> 34.21.145.224:11434 (GCP-B)

- docs/runbooks: DEPLOY-GCP-OLLAMA-PROXY.md 完整部署指南
- ops/nginx: 手动部署脚本供 110 直接执行

ADR-110 三层容灾启用前提:先部署 proxy,再改 ConfigMap
2026-05-04 23:12:35 +08:00
Your Name
b1ef05fa8c feat(ollama): ADR-110 GCP 三層容災架構(GCP-A → GCP-B → Local → Gemini)
Some checks failed
Code Review / ai-code-review (push) Successful in 50s
CD Pipeline / tests (push) Failing after 1m14s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
## 變更摘要
- Primary: http://34.143.170.20:11434 (GCP-A SSD, 9x 載速 + 2x 推理)
- Secondary: http://34.21.145.224:11434 (GCP-B SSD)
- Fallback: http://192.168.0.111:11434 (M1 Pro Local HDD,最後防線)
- 廢止 ADR-105「111 唯一鐵律」,新建 ADR-110

## 核心改動
- config.py: 新增 OLLAMA_SECONDARY_URL;validator 加 GCP IP 白名單(34.143.170.20, 34.21.145.224)
- ollama_failover_manager.py: 三層 Ollama 決策矩陣;並行健康檢查三台;health_111 → health_gcp_a
- ollama_health_monitor.py: host label 萃取改為通用版(支援 GCP 公網 IP)
- failover_alerter.py: 故障/恢復主機動態顯示,不再硬編碼「Ollama 111 (GPU)」
- ollama_auto_recovery.py: notify_recovery 改為 ollama_gcp_a;recovered_host 動態
- k8s/awoooi-prod: configmap + deployment + network-policy 同步更新(egress 加 GCP /32)
- 服務層: 10 個服務檔案硬編碼 192.168.0.111 改為讀 settings.OLLAMA_URL
- 測試: URL 常數更新,新增三層容災場景,GCP IP 白名單驗證測試

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 22:49:23 +08:00