Commit Graph

6 Commits

Author SHA1 Message Date
Your Name
ae7b39d96a fix(ops): harden reboot recovery and backup alerts 2026-05-29 12:41:34 +08:00
Your Name
d441f70693 fix(ai): add 188 ollama retirement gate
All checks were successful
Code Review / ai-code-review (push) Successful in 10s
CD Pipeline / tests (push) Successful in 1m2s
CD Pipeline / build-and-deploy (push) Successful in 9m2s
CD Pipeline / post-deploy-checks (push) Successful in 1m15s
2026-05-06 14:55:21 +08:00
Your Name
e8e6748f70 fix(ops): add docker host resource baseline guardrails
Some checks failed
CD Pipeline / tests (push) Failing after 1m50s
CD Pipeline / build-and-deploy (push) Has been skipped
CD Pipeline / post-deploy-checks (push) Has been skipped
Code Review / ai-code-review (push) Successful in 25s
Deploy Alert Rules / Deploy Prometheus Alert Rules (push) Successful in 38s
2026-05-05 13:45:09 +08:00
OG T
16d682346a feat(adr-074): M1 飛輪健康度 Exporter + M2 主機網路監控
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
ADR-074 M1:
  - FlywheelStatsService: 計算6項飛輪指標(Playbook數/成功率/KM向量化/alertname NULL/卡住數)
  - GET /api/v1/stats/flywheel — 六節點即時狀態(C1 前端用)
  - GET /api/v1/stats/summary — KPI 面板數據(C1 前端用)
  - GET /api/v1/stats/flywheel/metrics — Prometheus text format
  - flywheel-alerts.yaml: 5條告警規則(FlywheelPlaybookZero/ExecutionSuccessLow/KMVectorizationLow/AlertnameNullHigh/IncidentsStuck)
  - prometheus.yml: awoooi-flywheel scrape job(5分鐘間隔)

ADR-074 M2:
  - prometheus.yml: host-connectivity Blackbox TCP probe(110:22/188:22/120:6443/121:6443)
  - flywheel-alerts.yaml: HostNetworkPartition 告警規則

597 unit tests passed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-12 15:31:01 +08:00
OG T
08db3580a7 fix(monitoring): 修復 110 主機 CPU 高負載
Some checks are pending
CD Pipeline / build-and-deploy (push) Has started running
根因 1: cadvisor 持續掃描 overlay2 磁碟用量 (每次 1-4s × N 容器)
  → 加 --disable_metrics=disk,diskIO,tcp,udp,percpu,sched,process
  → --housekeeping_interval=30s --docker_only=true
  → CPU 從 239% 降到 <1%

根因 2: node_exporter scrape_timeout 預設 10s,高 load 下超時→broken pipe→瘋狂重試
  → 加 scrape_interval: 30s / scrape_timeout: 25s
  → CPU 從 48% 降到 0%

整體 load average: 20 → 9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 13:53:13 +08:00
OG T
5bd8a8a719 fix(monitoring): 補齊 blackbox-tcp scrape targets (11→15)
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled
SentryDown/HarborDown/SignOzDown 等告警引用的 instance 不在 scrape list
導致 absent metric = 0,告警持續 firing

新增缺少的 targets:
  192.168.0.125:6443/32334/32335 (K3s)
  192.168.0.110:9000/5000/3100 (Sentry/Harbor/Langfuse)
  192.168.0.188:3301/5432/6380/11434/8089 (SignOz/PG/Redis/Ollama/OpenClaw)

已在 110 主機 reload Prometheus,全部 15 targets UP

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 12:20:19 +08:00