fix(monitoring): 記憶體告警改用 working_set,停止 page cache 假告警

- alerts-unified.yml:
  - SentryClickHouseMemoryPressure: usage_bytes → working_set_bytes,0.8 → 0.85
  - GiteaMemoryPressure: 同步修正(同樣 page cache 虛高根因)
- ops/monitoring/tests/clickhouse_memory_test.yml: promtool 4 cases
- 04-awoooi-devops-commander.md v2.8: Prometheus 指標選擇規範 + Gitea HMAC Webhook 規範
- LOGBOOK: 記錄 T0 五大並行任務(A 按鈕 / B ClickHouse / C Gitea webhook / D ElephantAlpha / F Code review)

鐵證: 2026-04-23 23:13 sentry-clickhouse usage_bytes=88.5% vs working_set=7.8%
根因: container_memory_usage_bytes 含 OS page cache,OOM killer 不視為壓力
修法: 改用 K8s/cadvisor 認可的 working_set_bytes (RSS + active cache),閾值 0.85

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Your Name
2026-04-26 20:16:12 +08:00
parent 4a8c3ca5c4
commit 7cd53c0228
4 changed files with 215 additions and 7 deletions

View File

@@ -38,6 +38,7 @@
| v2.5 | 2026-04-09 | Claude Sonnet 4.6 | **🔴 SSH 自動修復全鏈路 — 雙主機 E2E 閉環 + 12 Bug 修復** |
| v2.6 | 2026-04-11 | Claude Sonnet 4.6 | **Sprint B-1 Ansible IaC 骨架 + Architecture Review 安全修復** |
| v2.7 | 2026-04-11 | Claude Sonnet 4.6 | **Sprint B-2/B-3 ArgoCD GitOps + Sprint C Velero/rsync DR + ADR-070 MCP Phase 1-4 全自動 AIOps 閉環 + ADR-071 告警通知四類型** |
| v2.8 | 2026-04-25 | Claude Sonnet 4.6 | **🔴 Prometheus 記憶體指標選擇規範working_set vs usage_bytes+ Gitea HMAC Webhook 規範** |
---
@@ -1369,6 +1370,100 @@ Architecture Review 發現的安全要求2026-04-11
---
## 🔴 Prometheus 記憶體指標選擇規範 (2026-04-25)
> **事故**: ClickHouse 在 2026-04-23 23:13 觸發假警報,`usage_bytes`=88.5% 但實際壓力 `working_set_bytes`=7.8%
> **根因**: 指標選錯,不是閾值設定問題
### 兩個指標的本質差異
| 指標 | 含義 | OOM Killer 管 | 告警應用 |
|------|------|--------------|---------|
| `container_memory_usage_bytes` | RSS + page cache含 OS inactive 緩存) | ❌ 不管 | ❌ 禁止用於記憶體壓力告警 |
| `container_memory_working_set_bytes` | RSS + active cacheK8s kubectl top 同源) | ✅ 真實壓力 | ✅ 必須用於記憶體壓力告警 |
### 鐵律
```yaml
# ❌ 絕對禁止:包含 page cache產生假警報
- alert: MemoryPressure
expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.8
# ✅ 必須使用業界標準K8s kubectl top 同源OOM killer 基準
- alert: MemoryPressure
expr: container_memory_working_set_bytes{container!="", container!="POD"} / container_spec_memory_limit_bytes{container!="", container!="POD"} > 0.85
for: 10m
```
**Why 0.85(非 0.8**: `working_set` 語意下 85% 才代表真實記憶體壓力0.8 偏保守
**Why `for: 10m`**: 防止瞬間抖動,真實壓力需持續 10 分鐘才觸發
### PromQL 測試(必須)
新增或修改記憶體告警規則時,必須用 `promtool test rules` 加 4 個 test cases
- 負測 1`usage_bytes` 高 + `working_set` 低 → 不觸發
- 負測 2`working_set` 略低於閾值 → 不觸發
- 正測 1`working_set` 超閾值持續 10 分鐘 → 觸發
- 正測 2`working_set` 超閾值但不足 10 分鐘 → 不觸發
**測試檔案位置**: `ops/monitoring/tests/`
---
## 🔗 Gitea CI/CD Webhook 整合 (2026-04-25)
> **新增端點**: POST `/api/v1/webhooks/gitea`
> **實作**: `apps/api/src/integrations/gitea_webhook.py`
### 驗簽機制
```python
# Gitea 使用 X-Gitea-Signature header與 GitHub 不同)
def _verify_gitea_signature(payload: bytes, signature: str, secret: str) -> bool:
expected = hmac.new(secret.encode(), payload, hashlib.sha256).hexdigest()
return hmac.compare_digest(expected, signature)
```
### 三類事件 + URL 路由
| 事件 | 觸發條件 | Telegram 訊息格式 |
|------|---------|-----------------|
| PR merged | `pull_request.merged == true` | 🔀 PR merged 通知 |
| CI failure | `workflow_run.conclusion == "failure"` | 🔴 CI 失敗告警 |
| Deploy failure | `check_run.conclusion == "failure" && name contains "deploy"` | 🚨 部署失敗告警 |
### K8s 配置要求
```yaml
# K8s Secret 必須包含(在 03-secrets.yaml 有佔位)
GITEA_WEBHOOK_SECRET: <base64>
# Gitea UI 設定
URL: https://api.awoooi.wooo.work/api/v1/webhooks/gitea
Content-Type: application/json
Secret: <同 K8s Secret>
Events: Pull Request + Workflow Run
```
### 去重保護
Redis SET NX EX 600s`dedup:gitea:{event}:{sha[:8]}`),同一事件 10 分鐘不重複推送。
### E2E 驗證
```bash
# 確認 Secret 注入
kubectl get secret awoooi-secrets -n awoooi-prod -o jsonpath='{.data.GITEA_WEBHOOK_SECRET}' | base64 -d
# 直接測試 endpoint 可達
curl -s -X POST https://api.awoooi.wooo.work/api/v1/webhooks/gitea \
-H "Content-Type: application/json" \
-d '{}' | jq '.detail'
# 預期: "Missing signature" 或 "Invalid signature"(代表端點存在,驗簽生效)
```
---
## 🤖 ADR-070 全自動 AIOps 閉環 — MCP Phase 1-4 (2026-04-11) ✅
> 10 MCP Providers 全部生產驗收完成

View File

@@ -6,6 +6,23 @@
---
## ✅ 2026-04-25 | T0 五大並行任務P9 方法論)
| 任務 | 成果 | 測試 | 狀態 |
|------|------|------|------|
| A Telegram 按鈕修復 | telegram_gateway.py 補 reply_markup | 78/78 ✅ | 待 Staging E2E |
| B ClickHouse 假告警 | working_set 指標 + 0.85 閾值 | 4/4 promtool ✅ | ✅ 已部署生產 |
| C Gitea CI/CD Webhook | gitea_webhook.py 新增 + HMAC 驗簽 | 15/15 ✅ | 待 GITEA_WEBHOOK_SECRET |
| D ElephantAlpha 驗證 | elephant-alpha 廢棄,換 ling-2.6-flash | n/a | ⚠️ MinPrereq: 1 行 |
| F Code Review 研究 | Linter ✅ LLM auto-apply ❌ | n/a | Info only |
**Task B 鐵證**2026-04-23 `usage_bytes`=88.5% vs `working_set_bytes`=7.8%,差距 80.7% = page cache
**Root Fix**`container_memory_working_set_bytes / limit > 0.85`K8s kubectl top 同源)
**Task C 待辦**K8s 注入 `GITEA_WEBHOOK_SECRET` + Gitea UI 設定 webhook (URL + secret + 三類事件)
---
## 🎯 2026-04-25進行中| 自動化飛輪修復 × 4 + Hermes Ollama + qwen3:8b ✅
### B1auto_execute 被 _ALLOWED_KUBECTL_PATTERN 全攔

View File

@@ -1044,8 +1044,15 @@ groups:
runbook: "檢查 node-exporter --collector.* flags 是否該關掉閒置硬體 probe"
# --- Sentry self-hosted 自監控110---
# 2026-04-25 ogt + Claude Opus 4.7: 修正假告警根因
# 舊規則用 container_memory_usage_bytes含 page cache導致 ClickHouse
# 執行大查詢時 OS 把 SSTable 緩存進 page cache比例衝到 88.5% 觸發誤報
# 2026-04-23 23:13 鐵證usage_bytes=88.5% / working_set=7.8%)。
# 改用 container_memory_working_set_bytes — 這才是 K8s/Docker OOM killer
# 實際追蹤的「真實工作集」(RSS + active page cache),不含 inactive page cache。
# 參考: https://github.com/google/cadvisor/blob/master/info/v1/container.go
- alert: SentryClickHouseMemoryPressure
expr: container_memory_usage_bytes{name=~".*sentry.*clickhouse.*"} / container_spec_memory_limit_bytes{name=~".*sentry.*clickhouse.*"} > 0.8
expr: container_memory_working_set_bytes{name=~".*sentry.*clickhouse.*"} / container_spec_memory_limit_bytes{name=~".*sentry.*clickhouse.*"} > 0.85
for: 10m
labels:
severity: warning
@@ -1055,9 +1062,9 @@ groups:
notification_type: TYPE-1
auto_repair: "false"
annotations:
summary: "Sentry ClickHouse 記憶體使用率 > 80% limit"
description: "sentry clickhouse 用量 / mem_limit = {{ $value | humanizePercentage }}。"
runbook: "檢查 Sentry 查詢壓力;調整 /opt/sentry/docker-compose.override.yml clickhouse mem_limit"
summary: "Sentry ClickHouse 工作集記憶體 > 85% limit"
description: "sentry clickhouse working_set / mem_limit = {{ $value | humanizePercentage }} (排除 page cache)。"
runbook: "檢查 Sentry 查詢壓力;確認非 page cache 假象;必要時調整 /opt/sentry/docker-compose.override.yml clickhouse mem_limit"
- alert: SentryClickHouseCPUThrottled
expr: rate(container_cpu_cfs_throttled_seconds_total{name=~".*sentry.*clickhouse.*"}[5m]) > 1.0
@@ -1076,7 +1083,10 @@ groups:
# --- Gitea 自監控 ---
- alert: GiteaMemoryPressure
expr: container_memory_usage_bytes{name="gitea"} / container_spec_memory_limit_bytes{name="gitea"} > 0.8
# 2026-04-25 ogt + Claude Sonnet 4.6 — 同 ClickHouse 假警報根因:
# container_memory_usage_bytes 含 page cacheOS inactiveOOM killer 不管)→ 虛高假警報
# 改用 container_memory_working_set_bytesRSS + active cache真實壓力cadvisor 適用 Docker + K8s
expr: container_memory_working_set_bytes{name="gitea"} / container_spec_memory_limit_bytes{name="gitea"} > 0.85
for: 10m
labels:
severity: warning
@@ -1086,8 +1096,8 @@ groups:
notification_type: TYPE-1
auto_repair: "false"
annotations:
summary: "Gitea 記憶體使用率 > 80% limit"
description: "gitea 用量 / mem_limit = {{ $value | humanizePercentage }}。"
summary: "Gitea 記憶體工作集 > 85% limit"
description: "gitea working_set / mem_limit = {{ $value | humanizePercentage }}(真實記憶體壓力,非 page cache 干擾)。"
runbook: "檢查 CI/CD 任務堆積;必要時調高 docker-compose mem_limit"
- alert: GiteaCPUThrottled

View File

@@ -0,0 +1,86 @@
# Unit tests for SentryClickHouseMemoryPressure
# 2026-04-25 ogt + Claude Opus 4.7
rule_files:
- ../alerts-unified.yml
evaluation_interval: 1m
tests:
# ---- 負測 1page cache 高、working_set 低(修正後不該觸發)----
- interval: 1m
name: "page cache spike must NOT alert (the original false-positive scenario)"
input_series:
# working_set: 411 MiB / 8 GiB = 5%(正常)
- series: 'container_memory_working_set_bytes{name="sentry-self-hosted-clickhouse-1"}'
values: '430917632 430917632 430917632 430917632 430917632 430917632 430917632 430917632 430917632 430917632 430917632 430917632 430917632 430917632'
# usage_bytes: 7.5 GiB / 8 GiB = 93.7%(如果規則用錯指標就會誤觸發)
- series: 'container_memory_usage_bytes{name="sentry-self-hosted-clickhouse-1"}'
values: '8053063680 8053063680 8053063680 8053063680 8053063680 8053063680 8053063680 8053063680 8053063680 8053063680 8053063680 8053063680 8053063680 8053063680'
- series: 'container_spec_memory_limit_bytes{name="sentry-self-hosted-clickhouse-1"}'
values: '8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592'
alert_rule_test:
- eval_time: 12m
alertname: SentryClickHouseMemoryPressure
# 期望沒有任何告警exp_alerts 留空)
exp_alerts: []
# ---- 負測 2working_set 略高但 < 85%(不該觸發)----
- interval: 1m
name: "working_set 80% must NOT alert (below 85% threshold)"
input_series:
# working_set: 6.5 GiB / 8 GiB = 80%< 85%,不該觸發)
- series: 'container_memory_working_set_bytes{name="sentry-self-hosted-clickhouse-1"}'
values: '6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673'
- series: 'container_memory_usage_bytes{name="sentry-self-hosted-clickhouse-1"}'
values: '6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673'
- series: 'container_spec_memory_limit_bytes{name="sentry-self-hosted-clickhouse-1"}'
values: '8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592'
alert_rule_test:
- eval_time: 12m
alertname: SentryClickHouseMemoryPressure
exp_alerts: []
# ---- 正測 1working_set > 85% 持續 10 分鐘(必須觸發)----
- interval: 1m
name: "working_set 90% sustained 10m MUST alert (real memory pressure)"
input_series:
# working_set: 7.4 GiB / 8 GiB = 86.7%(持續高水位)
- series: 'container_memory_working_set_bytes{name="sentry-self-hosted-clickhouse-1"}'
values: '7449424589x14'
- series: 'container_memory_usage_bytes{name="sentry-self-hosted-clickhouse-1"}'
values: '7449424589x14'
- series: 'container_spec_memory_limit_bytes{name="sentry-self-hosted-clickhouse-1"}'
values: '8589934592x14'
alert_rule_test:
- eval_time: 12m
alertname: SentryClickHouseMemoryPressure
exp_alerts:
- exp_labels:
alertname: SentryClickHouseMemoryPressure
alert_category: infrastructure
auto_repair: "false"
component: sentry-clickhouse
name: sentry-self-hosted-clickhouse-1
notification_type: TYPE-1
severity: warning
team: platform
exp_annotations:
summary: "Sentry ClickHouse 工作集記憶體 > 85% limit"
description: "sentry clickhouse working_set / mem_limit = 86.72% (排除 page cache)。"
runbook: "檢查 Sentry 查詢壓力;確認非 page cache 假象;必要時調整 /opt/sentry/docker-compose.override.yml clickhouse mem_limit"
# ---- 正測 2尖峰 < 10 分鐘不該觸發for: 10m 過濾掉)----
- interval: 1m
name: "working_set 95% spike for only 5m must NOT alert (for:10m guard)"
input_series:
# 前 5 分鐘 90%,之後降回 5%
- series: 'container_memory_working_set_bytes{name="sentry-self-hosted-clickhouse-1"}'
values: '7730941132 7730941132 7730941132 7730941132 7730941132 430917632 430917632 430917632 430917632 430917632 430917632 430917632'
- series: 'container_memory_usage_bytes{name="sentry-self-hosted-clickhouse-1"}'
values: '7730941132 7730941132 7730941132 7730941132 7730941132 430917632 430917632 430917632 430917632 430917632 430917632 430917632'
- series: 'container_spec_memory_limit_bytes{name="sentry-self-hosted-clickhouse-1"}'
values: '8589934592x12'
alert_rule_test:
- eval_time: 11m
alertname: SentryClickHouseMemoryPressure
exp_alerts: []