Your Name
|
ae7b39d96a
|
fix(ops): harden reboot recovery and backup alerts
|
2026-05-29 12:41:34 +08:00 |
|
Your Name
|
d441f70693
|
fix(ai): add 188 ollama retirement gate
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 1m2s
CD Pipeline / build-and-deploy (push) Successful in 9m2s
CD Pipeline / post-deploy-checks (push) Successful in 1m15s
|
2026-05-06 14:55:21 +08:00 |
|
Your Name
|
e8e6748f70
|
fix(ops): add docker host resource baseline guardrails
CD Pipeline / tests (push) Failing after 1m50s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 25s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s
|
2026-05-05 13:45:09 +08:00 |
|
OG T
|
16d682346a
|
feat(adr-074): M1 飛輪健康度 Exporter + M2 主機網路監控
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-074 M1:
- FlywheelStatsService: 計算6項飛輪指標(Playbook數/成功率/KM向量化/alertname NULL/卡住數)
- GET /api/v1/stats/flywheel — 六節點即時狀態(C1 前端用)
- GET /api/v1/stats/summary — KPI 面板數據(C1 前端用)
- GET /api/v1/stats/flywheel/metrics — Prometheus text format
- flywheel-alerts.yaml: 5條告警規則(FlywheelPlaybookZero/ExecutionSuccessLow/KMVectorizationLow/AlertnameNullHigh/IncidentsStuck)
- prometheus.yml: awoooi-flywheel scrape job(5分鐘間隔)
ADR-074 M2:
- prometheus.yml: host-connectivity Blackbox TCP probe(110:22/188:22/120:6443/121:6443)
- flywheel-alerts.yaml: HostNetworkPartition 告警規則
597 unit tests passed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-12 15:31:01 +08:00 |
|
OG T
|
08db3580a7
|
fix(monitoring): 修復 110 主機 CPU 高負載
CD Pipeline / build-and-deploy (push) Has started running
根因 1: cadvisor 持續掃描 overlay2 磁碟用量 (每次 1-4s × N 容器)
→ 加 --disable_metrics=disk,diskIO,tcp,udp,percpu,sched,process
→ --housekeeping_interval=30s --docker_only=true
→ CPU 從 239% 降到 <1%
根因 2: node_exporter scrape_timeout 預設 10s,高 load 下超時→broken pipe→瘋狂重試
→ 加 scrape_interval: 30s / scrape_timeout: 25s
→ CPU 從 48% 降到 0%
整體 load average: 20 → 9
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 13:53:13 +08:00 |
|
OG T
|
5bd8a8a719
|
fix(monitoring): 補齊 blackbox-tcp scrape targets (11→15)
CD Pipeline / build-and-deploy (push) Has been cancelled
SentryDown/HarborDown/SignOzDown 等告警引用的 instance 不在 scrape list
導致 absent metric = 0,告警持續 firing
新增缺少的 targets:
192.168.0.125:6443/32334/32335 (K3s)
192.168.0.110:9000/5000/3100 (Sentry/Harbor/Langfuse)
192.168.0.188:3301/5432/6380/11434/8089 (SignOz/PG/Redis/Ollama/OpenClaw)
已在 110 主機 reload Prometheus,全部 15 targets UP
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-04-09 12:20:19 +08:00 |
|