fix(monitoring): 記憶體告警改用 working_set,停止 page cache 假告警
- alerts-unified.yml: - SentryClickHouseMemoryPressure: usage_bytes → working_set_bytes,0.8 → 0.85 - GiteaMemoryPressure: 同步修正(同樣 page cache 虛高根因) - ops/monitoring/tests/clickhouse_memory_test.yml: promtool 4 cases - 04-awoooi-devops-commander.md v2.8: Prometheus 指標選擇規範 + Gitea HMAC Webhook 規範 - LOGBOOK: 記錄 T0 五大並行任務(A 按鈕 / B ClickHouse / C Gitea webhook / D ElephantAlpha / F Code review) 鐵證: 2026-04-23 23:13 sentry-clickhouse usage_bytes=88.5% vs working_set=7.8% 根因: container_memory_usage_bytes 含 OS page cache,OOM killer 不視為壓力 修法: 改用 K8s/cadvisor 認可的 working_set_bytes (RSS + active cache),閾值 0.85 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -38,6 +38,7 @@
|
||||
| v2.5 | 2026-04-09 | Claude Sonnet 4.6 | **🔴 SSH 自動修復全鏈路 — 雙主機 E2E 閉環 + 12 Bug 修復** |
|
||||
| v2.6 | 2026-04-11 | Claude Sonnet 4.6 | **Sprint B-1 Ansible IaC 骨架 + Architecture Review 安全修復** |
|
||||
| v2.7 | 2026-04-11 | Claude Sonnet 4.6 | **Sprint B-2/B-3 ArgoCD GitOps + Sprint C Velero/rsync DR + ADR-070 MCP Phase 1-4 全自動 AIOps 閉環 + ADR-071 告警通知四類型** |
|
||||
| v2.8 | 2026-04-25 | Claude Sonnet 4.6 | **🔴 Prometheus 記憶體指標選擇規範(working_set vs usage_bytes)+ Gitea HMAC Webhook 規範** |
|
||||
|
||||
---
|
||||
|
||||
@@ -1369,6 +1370,100 @@ Architecture Review 發現的安全要求(2026-04-11):
|
||||
|
||||
---
|
||||
|
||||
## 🔴 Prometheus 記憶體指標選擇規範 (2026-04-25)
|
||||
|
||||
> **事故**: ClickHouse 在 2026-04-23 23:13 觸發假警報,`usage_bytes`=88.5% 但實際壓力 `working_set_bytes`=7.8%
|
||||
> **根因**: 指標選錯,不是閾值設定問題
|
||||
|
||||
### 兩個指標的本質差異
|
||||
|
||||
| 指標 | 含義 | OOM Killer 管 | 告警應用 |
|
||||
|------|------|--------------|---------|
|
||||
| `container_memory_usage_bytes` | RSS + page cache(含 OS inactive 緩存) | ❌ 不管 | ❌ 禁止用於記憶體壓力告警 |
|
||||
| `container_memory_working_set_bytes` | RSS + active cache(K8s kubectl top 同源) | ✅ 真實壓力 | ✅ 必須用於記憶體壓力告警 |
|
||||
|
||||
### 鐵律
|
||||
|
||||
```yaml
|
||||
# ❌ 絕對禁止:包含 page cache,產生假警報
|
||||
- alert: MemoryPressure
|
||||
expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.8
|
||||
|
||||
# ✅ 必須使用:業界標準,K8s kubectl top 同源,OOM killer 基準
|
||||
- alert: MemoryPressure
|
||||
expr: container_memory_working_set_bytes{container!="", container!="POD"} / container_spec_memory_limit_bytes{container!="", container!="POD"} > 0.85
|
||||
for: 10m
|
||||
```
|
||||
|
||||
**Why 0.85(非 0.8)**: `working_set` 語意下 85% 才代表真實記憶體壓力,0.8 偏保守
|
||||
**Why `for: 10m`**: 防止瞬間抖動,真實壓力需持續 10 分鐘才觸發
|
||||
|
||||
### PromQL 測試(必須)
|
||||
|
||||
新增或修改記憶體告警規則時,必須用 `promtool test rules` 加 4 個 test cases:
|
||||
- 負測 1:`usage_bytes` 高 + `working_set` 低 → 不觸發
|
||||
- 負測 2:`working_set` 略低於閾值 → 不觸發
|
||||
- 正測 1:`working_set` 超閾值持續 10 分鐘 → 觸發
|
||||
- 正測 2:`working_set` 超閾值但不足 10 分鐘 → 不觸發
|
||||
|
||||
**測試檔案位置**: `ops/monitoring/tests/`
|
||||
|
||||
---
|
||||
|
||||
## 🔗 Gitea CI/CD Webhook 整合 (2026-04-25)
|
||||
|
||||
> **新增端點**: POST `/api/v1/webhooks/gitea`
|
||||
> **實作**: `apps/api/src/integrations/gitea_webhook.py`
|
||||
|
||||
### 驗簽機制
|
||||
|
||||
```python
|
||||
# Gitea 使用 X-Gitea-Signature header(與 GitHub 不同)
|
||||
def _verify_gitea_signature(payload: bytes, signature: str, secret: str) -> bool:
|
||||
expected = hmac.new(secret.encode(), payload, hashlib.sha256).hexdigest()
|
||||
return hmac.compare_digest(expected, signature)
|
||||
```
|
||||
|
||||
### 三類事件 + URL 路由
|
||||
|
||||
| 事件 | 觸發條件 | Telegram 訊息格式 |
|
||||
|------|---------|-----------------|
|
||||
| PR merged | `pull_request.merged == true` | 🔀 PR merged 通知 |
|
||||
| CI failure | `workflow_run.conclusion == "failure"` | 🔴 CI 失敗告警 |
|
||||
| Deploy failure | `check_run.conclusion == "failure" && name contains "deploy"` | 🚨 部署失敗告警 |
|
||||
|
||||
### K8s 配置要求
|
||||
|
||||
```yaml
|
||||
# K8s Secret 必須包含(在 03-secrets.yaml 有佔位)
|
||||
GITEA_WEBHOOK_SECRET: <base64>
|
||||
|
||||
# Gitea UI 設定
|
||||
URL: https://api.awoooi.wooo.work/api/v1/webhooks/gitea
|
||||
Content-Type: application/json
|
||||
Secret: <同 K8s Secret>
|
||||
Events: Pull Request + Workflow Run
|
||||
```
|
||||
|
||||
### 去重保護
|
||||
|
||||
Redis SET NX EX 600s(`dedup:gitea:{event}:{sha[:8]}`),同一事件 10 分鐘不重複推送。
|
||||
|
||||
### E2E 驗證
|
||||
|
||||
```bash
|
||||
# 確認 Secret 注入
|
||||
kubectl get secret awoooi-secrets -n awoooi-prod -o jsonpath='{.data.GITEA_WEBHOOK_SECRET}' | base64 -d
|
||||
|
||||
# 直接測試 endpoint 可達
|
||||
curl -s -X POST https://api.awoooi.wooo.work/api/v1/webhooks/gitea \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{}' | jq '.detail'
|
||||
# 預期: "Missing signature" 或 "Invalid signature"(代表端點存在,驗簽生效)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🤖 ADR-070 全自動 AIOps 閉環 — MCP Phase 1-4 (2026-04-11) ✅
|
||||
|
||||
> 10 MCP Providers 全部生產驗收完成
|
||||
|
||||
@@ -6,6 +6,23 @@
|
||||
|
||||
---
|
||||
|
||||
## ✅ 2026-04-25 | T0 五大並行任務(P9 方法論)
|
||||
|
||||
| 任務 | 成果 | 測試 | 狀態 |
|
||||
|------|------|------|------|
|
||||
| A Telegram 按鈕修復 | telegram_gateway.py 補 reply_markup | 78/78 ✅ | 待 Staging E2E |
|
||||
| B ClickHouse 假告警 | working_set 指標 + 0.85 閾值 | 4/4 promtool ✅ | ✅ 已部署生產 |
|
||||
| C Gitea CI/CD Webhook | gitea_webhook.py 新增 + HMAC 驗簽 | 15/15 ✅ | 待 GITEA_WEBHOOK_SECRET |
|
||||
| D ElephantAlpha 驗證 | elephant-alpha 廢棄,換 ling-2.6-flash | n/a | ⚠️ MinPrereq: 1 行 |
|
||||
| F Code Review 研究 | Linter ✅ LLM auto-apply ❌ | n/a | Info only |
|
||||
|
||||
**Task B 鐵證**:2026-04-23 `usage_bytes`=88.5% vs `working_set_bytes`=7.8%,差距 80.7% = page cache
|
||||
**Root Fix**:`container_memory_working_set_bytes / limit > 0.85`(K8s kubectl top 同源)
|
||||
|
||||
**Task C 待辦**:K8s 注入 `GITEA_WEBHOOK_SECRET` + Gitea UI 設定 webhook (URL + secret + 三類事件)
|
||||
|
||||
---
|
||||
|
||||
## 🎯 2026-04-25(進行中)| 自動化飛輪修復 × 4 + Hermes Ollama + qwen3:8b ✅
|
||||
|
||||
### B1:auto_execute 被 _ALLOWED_KUBECTL_PATTERN 全攔
|
||||
|
||||
@@ -1044,8 +1044,15 @@ groups:
|
||||
runbook: "檢查 node-exporter --collector.* flags 是否該關掉閒置硬體 probe"
|
||||
|
||||
# --- Sentry self-hosted 自監控(110)---
|
||||
# 2026-04-25 ogt + Claude Opus 4.7: 修正假告警根因
|
||||
# 舊規則用 container_memory_usage_bytes(含 page cache),導致 ClickHouse
|
||||
# 執行大查詢時 OS 把 SSTable 緩存進 page cache,比例衝到 88.5% 觸發誤報
|
||||
# (2026-04-23 23:13 鐵證:usage_bytes=88.5% / working_set=7.8%)。
|
||||
# 改用 container_memory_working_set_bytes — 這才是 K8s/Docker OOM killer
|
||||
# 實際追蹤的「真實工作集」(RSS + active page cache),不含 inactive page cache。
|
||||
# 參考: https://github.com/google/cadvisor/blob/master/info/v1/container.go
|
||||
- alert: SentryClickHouseMemoryPressure
|
||||
expr: container_memory_usage_bytes{name=~".*sentry.*clickhouse.*"} / container_spec_memory_limit_bytes{name=~".*sentry.*clickhouse.*"} > 0.8
|
||||
expr: container_memory_working_set_bytes{name=~".*sentry.*clickhouse.*"} / container_spec_memory_limit_bytes{name=~".*sentry.*clickhouse.*"} > 0.85
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
@@ -1055,9 +1062,9 @@ groups:
|
||||
notification_type: TYPE-1
|
||||
auto_repair: "false"
|
||||
annotations:
|
||||
summary: "Sentry ClickHouse 記憶體使用率 > 80% limit"
|
||||
description: "sentry clickhouse 用量 / mem_limit = {{ $value | humanizePercentage }}。"
|
||||
runbook: "檢查 Sentry 查詢壓力;調整 /opt/sentry/docker-compose.override.yml clickhouse mem_limit"
|
||||
summary: "Sentry ClickHouse 工作集記憶體 > 85% limit"
|
||||
description: "sentry clickhouse working_set / mem_limit = {{ $value | humanizePercentage }} (排除 page cache)。"
|
||||
runbook: "檢查 Sentry 查詢壓力;確認非 page cache 假象;必要時調整 /opt/sentry/docker-compose.override.yml clickhouse mem_limit"
|
||||
|
||||
- alert: SentryClickHouseCPUThrottled
|
||||
expr: rate(container_cpu_cfs_throttled_seconds_total{name=~".*sentry.*clickhouse.*"}[5m]) > 1.0
|
||||
@@ -1076,7 +1083,10 @@ groups:
|
||||
|
||||
# --- Gitea 自監控 ---
|
||||
- alert: GiteaMemoryPressure
|
||||
expr: container_memory_usage_bytes{name="gitea"} / container_spec_memory_limit_bytes{name="gitea"} > 0.8
|
||||
# 2026-04-25 ogt + Claude Sonnet 4.6 — 同 ClickHouse 假警報根因:
|
||||
# container_memory_usage_bytes 含 page cache(OS inactive,OOM killer 不管)→ 虛高假警報
|
||||
# 改用 container_memory_working_set_bytes(RSS + active cache,真實壓力,cadvisor 適用 Docker + K8s)
|
||||
expr: container_memory_working_set_bytes{name="gitea"} / container_spec_memory_limit_bytes{name="gitea"} > 0.85
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
@@ -1086,8 +1096,8 @@ groups:
|
||||
notification_type: TYPE-1
|
||||
auto_repair: "false"
|
||||
annotations:
|
||||
summary: "Gitea 記憶體使用率 > 80% limit"
|
||||
description: "gitea 用量 / mem_limit = {{ $value | humanizePercentage }}。"
|
||||
summary: "Gitea 記憶體工作集 > 85% limit"
|
||||
description: "gitea working_set / mem_limit = {{ $value | humanizePercentage }}(真實記憶體壓力,非 page cache 干擾)。"
|
||||
runbook: "檢查 CI/CD 任務堆積;必要時調高 docker-compose mem_limit"
|
||||
|
||||
- alert: GiteaCPUThrottled
|
||||
|
||||
86
ops/monitoring/tests/clickhouse_memory_test.yml
Normal file
86
ops/monitoring/tests/clickhouse_memory_test.yml
Normal file
@@ -0,0 +1,86 @@
|
||||
# Unit tests for SentryClickHouseMemoryPressure
|
||||
# 2026-04-25 ogt + Claude Opus 4.7
|
||||
rule_files:
|
||||
- ../alerts-unified.yml
|
||||
|
||||
evaluation_interval: 1m
|
||||
|
||||
tests:
|
||||
# ---- 負測 1:page cache 高、working_set 低(修正後不該觸發)----
|
||||
- interval: 1m
|
||||
name: "page cache spike must NOT alert (the original false-positive scenario)"
|
||||
input_series:
|
||||
# working_set: 411 MiB / 8 GiB = 5%(正常)
|
||||
- series: 'container_memory_working_set_bytes{name="sentry-self-hosted-clickhouse-1"}'
|
||||
values: '430917632 430917632 430917632 430917632 430917632 430917632 430917632 430917632 430917632 430917632 430917632 430917632 430917632 430917632'
|
||||
# usage_bytes: 7.5 GiB / 8 GiB = 93.7%(如果規則用錯指標就會誤觸發)
|
||||
- series: 'container_memory_usage_bytes{name="sentry-self-hosted-clickhouse-1"}'
|
||||
values: '8053063680 8053063680 8053063680 8053063680 8053063680 8053063680 8053063680 8053063680 8053063680 8053063680 8053063680 8053063680 8053063680 8053063680'
|
||||
- series: 'container_spec_memory_limit_bytes{name="sentry-self-hosted-clickhouse-1"}'
|
||||
values: '8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592'
|
||||
alert_rule_test:
|
||||
- eval_time: 12m
|
||||
alertname: SentryClickHouseMemoryPressure
|
||||
# 期望沒有任何告警(exp_alerts 留空)
|
||||
exp_alerts: []
|
||||
|
||||
# ---- 負測 2:working_set 略高但 < 85%(不該觸發)----
|
||||
- interval: 1m
|
||||
name: "working_set 80% must NOT alert (below 85% threshold)"
|
||||
input_series:
|
||||
# working_set: 6.5 GiB / 8 GiB = 80%(< 85%,不該觸發)
|
||||
- series: 'container_memory_working_set_bytes{name="sentry-self-hosted-clickhouse-1"}'
|
||||
values: '6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673'
|
||||
- series: 'container_memory_usage_bytes{name="sentry-self-hosted-clickhouse-1"}'
|
||||
values: '6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673'
|
||||
- series: 'container_spec_memory_limit_bytes{name="sentry-self-hosted-clickhouse-1"}'
|
||||
values: '8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592'
|
||||
alert_rule_test:
|
||||
- eval_time: 12m
|
||||
alertname: SentryClickHouseMemoryPressure
|
||||
exp_alerts: []
|
||||
|
||||
# ---- 正測 1:working_set > 85% 持續 10 分鐘(必須觸發)----
|
||||
- interval: 1m
|
||||
name: "working_set 90% sustained 10m MUST alert (real memory pressure)"
|
||||
input_series:
|
||||
# working_set: 7.4 GiB / 8 GiB = 86.7%(持續高水位)
|
||||
- series: 'container_memory_working_set_bytes{name="sentry-self-hosted-clickhouse-1"}'
|
||||
values: '7449424589x14'
|
||||
- series: 'container_memory_usage_bytes{name="sentry-self-hosted-clickhouse-1"}'
|
||||
values: '7449424589x14'
|
||||
- series: 'container_spec_memory_limit_bytes{name="sentry-self-hosted-clickhouse-1"}'
|
||||
values: '8589934592x14'
|
||||
alert_rule_test:
|
||||
- eval_time: 12m
|
||||
alertname: SentryClickHouseMemoryPressure
|
||||
exp_alerts:
|
||||
- exp_labels:
|
||||
alertname: SentryClickHouseMemoryPressure
|
||||
alert_category: infrastructure
|
||||
auto_repair: "false"
|
||||
component: sentry-clickhouse
|
||||
name: sentry-self-hosted-clickhouse-1
|
||||
notification_type: TYPE-1
|
||||
severity: warning
|
||||
team: platform
|
||||
exp_annotations:
|
||||
summary: "Sentry ClickHouse 工作集記憶體 > 85% limit"
|
||||
description: "sentry clickhouse working_set / mem_limit = 86.72% (排除 page cache)。"
|
||||
runbook: "檢查 Sentry 查詢壓力;確認非 page cache 假象;必要時調整 /opt/sentry/docker-compose.override.yml clickhouse mem_limit"
|
||||
|
||||
# ---- 正測 2:尖峰 < 10 分鐘(不該觸發,for: 10m 過濾掉)----
|
||||
- interval: 1m
|
||||
name: "working_set 95% spike for only 5m must NOT alert (for:10m guard)"
|
||||
input_series:
|
||||
# 前 5 分鐘 90%,之後降回 5%
|
||||
- series: 'container_memory_working_set_bytes{name="sentry-self-hosted-clickhouse-1"}'
|
||||
values: '7730941132 7730941132 7730941132 7730941132 7730941132 430917632 430917632 430917632 430917632 430917632 430917632 430917632'
|
||||
- series: 'container_memory_usage_bytes{name="sentry-self-hosted-clickhouse-1"}'
|
||||
values: '7730941132 7730941132 7730941132 7730941132 7730941132 430917632 430917632 430917632 430917632 430917632 430917632 430917632'
|
||||
- series: 'container_spec_memory_limit_bytes{name="sentry-self-hosted-clickhouse-1"}'
|
||||
values: '8589934592x12'
|
||||
alert_rule_test:
|
||||
- eval_time: 11m
|
||||
alertname: SentryClickHouseMemoryPressure
|
||||
exp_alerts: []
|
||||
Reference in New Issue
Block a user