diff --git a/.agents/skills/04-awoooi-devops-commander.md b/.agents/skills/04-awoooi-devops-commander.md index 892519b9..5a340413 100644 --- a/.agents/skills/04-awoooi-devops-commander.md +++ b/.agents/skills/04-awoooi-devops-commander.md @@ -38,6 +38,7 @@ | v2.5 | 2026-04-09 | Claude Sonnet 4.6 | **🔴 SSH 自動修復全鏈路 — 雙主機 E2E 閉環 + 12 Bug 修復** | | v2.6 | 2026-04-11 | Claude Sonnet 4.6 | **Sprint B-1 Ansible IaC 骨架 + Architecture Review 安全修復** | | v2.7 | 2026-04-11 | Claude Sonnet 4.6 | **Sprint B-2/B-3 ArgoCD GitOps + Sprint C Velero/rsync DR + ADR-070 MCP Phase 1-4 全自動 AIOps 閉環 + ADR-071 告警通知四類型** | +| v2.8 | 2026-04-25 | Claude Sonnet 4.6 | **🔴 Prometheus 記憶體指標選擇規範(working_set vs usage_bytes)+ Gitea HMAC Webhook 規範** | --- @@ -1369,6 +1370,100 @@ Architecture Review 發現的安全要求(2026-04-11): --- +## 🔴 Prometheus 記憶體指標選擇規範 (2026-04-25) + +> **事故**: ClickHouse 在 2026-04-23 23:13 觸發假警報,`usage_bytes`=88.5% 但實際壓力 `working_set_bytes`=7.8% +> **根因**: 指標選錯,不是閾值設定問題 + +### 兩個指標的本質差異 + +| 指標 | 含義 | OOM Killer 管 | 告警應用 | +|------|------|--------------|---------| +| `container_memory_usage_bytes` | RSS + page cache(含 OS inactive 緩存) | ❌ 不管 | ❌ 禁止用於記憶體壓力告警 | +| `container_memory_working_set_bytes` | RSS + active cache(K8s kubectl top 同源) | ✅ 真實壓力 | ✅ 必須用於記憶體壓力告警 | + +### 鐵律 + +```yaml +# ❌ 絕對禁止:包含 page cache,產生假警報 +- alert: MemoryPressure + expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.8 + +# ✅ 必須使用:業界標準,K8s kubectl top 同源,OOM killer 基準 +- alert: MemoryPressure + expr: container_memory_working_set_bytes{container!="", container!="POD"} / container_spec_memory_limit_bytes{container!="", container!="POD"} > 0.85 + for: 10m +``` + +**Why 0.85(非 0.8)**: `working_set` 語意下 85% 才代表真實記憶體壓力,0.8 偏保守 +**Why `for: 10m`**: 防止瞬間抖動,真實壓力需持續 10 分鐘才觸發 + +### PromQL 測試(必須) + +新增或修改記憶體告警規則時,必須用 `promtool test rules` 加 4 個 test cases: +- 負測 1:`usage_bytes` 高 + `working_set` 低 → 不觸發 +- 負測 2:`working_set` 略低於閾值 → 不觸發 +- 正測 1:`working_set` 超閾值持續 10 分鐘 → 觸發 +- 正測 2:`working_set` 超閾值但不足 10 分鐘 → 不觸發 + +**測試檔案位置**: `ops/monitoring/tests/` + +--- + +## 🔗 Gitea CI/CD Webhook 整合 (2026-04-25) + +> **新增端點**: POST `/api/v1/webhooks/gitea` +> **實作**: `apps/api/src/integrations/gitea_webhook.py` + +### 驗簽機制 + +```python +# Gitea 使用 X-Gitea-Signature header(與 GitHub 不同) +def _verify_gitea_signature(payload: bytes, signature: str, secret: str) -> bool: + expected = hmac.new(secret.encode(), payload, hashlib.sha256).hexdigest() + return hmac.compare_digest(expected, signature) +``` + +### 三類事件 + URL 路由 + +| 事件 | 觸發條件 | Telegram 訊息格式 | +|------|---------|-----------------| +| PR merged | `pull_request.merged == true` | 🔀 PR merged 通知 | +| CI failure | `workflow_run.conclusion == "failure"` | 🔴 CI 失敗告警 | +| Deploy failure | `check_run.conclusion == "failure" && name contains "deploy"` | 🚨 部署失敗告警 | + +### K8s 配置要求 + +```yaml +# K8s Secret 必須包含(在 03-secrets.yaml 有佔位) +GITEA_WEBHOOK_SECRET: + +# Gitea UI 設定 +URL: https://api.awoooi.wooo.work/api/v1/webhooks/gitea +Content-Type: application/json +Secret: <同 K8s Secret> +Events: Pull Request + Workflow Run +``` + +### 去重保護 + +Redis SET NX EX 600s(`dedup:gitea:{event}:{sha[:8]}`),同一事件 10 分鐘不重複推送。 + +### E2E 驗證 + +```bash +# 確認 Secret 注入 +kubectl get secret awoooi-secrets -n awoooi-prod -o jsonpath='{.data.GITEA_WEBHOOK_SECRET}' | base64 -d + +# 直接測試 endpoint 可達 +curl -s -X POST https://api.awoooi.wooo.work/api/v1/webhooks/gitea \ + -H "Content-Type: application/json" \ + -d '{}' | jq '.detail' +# 預期: "Missing signature" 或 "Invalid signature"(代表端點存在,驗簽生效) +``` + +--- + ## 🤖 ADR-070 全自動 AIOps 閉環 — MCP Phase 1-4 (2026-04-11) ✅ > 10 MCP Providers 全部生產驗收完成 diff --git a/docs/LOGBOOK.md b/docs/LOGBOOK.md index 36176d87..838ccd44 100644 --- a/docs/LOGBOOK.md +++ b/docs/LOGBOOK.md @@ -6,6 +6,23 @@ --- +## ✅ 2026-04-25 | T0 五大並行任務(P9 方法論) + +| 任務 | 成果 | 測試 | 狀態 | +|------|------|------|------| +| A Telegram 按鈕修復 | telegram_gateway.py 補 reply_markup | 78/78 ✅ | 待 Staging E2E | +| B ClickHouse 假告警 | working_set 指標 + 0.85 閾值 | 4/4 promtool ✅ | ✅ 已部署生產 | +| C Gitea CI/CD Webhook | gitea_webhook.py 新增 + HMAC 驗簽 | 15/15 ✅ | 待 GITEA_WEBHOOK_SECRET | +| D ElephantAlpha 驗證 | elephant-alpha 廢棄,換 ling-2.6-flash | n/a | ⚠️ MinPrereq: 1 行 | +| F Code Review 研究 | Linter ✅ LLM auto-apply ❌ | n/a | Info only | + +**Task B 鐵證**:2026-04-23 `usage_bytes`=88.5% vs `working_set_bytes`=7.8%,差距 80.7% = page cache +**Root Fix**:`container_memory_working_set_bytes / limit > 0.85`(K8s kubectl top 同源) + +**Task C 待辦**:K8s 注入 `GITEA_WEBHOOK_SECRET` + Gitea UI 設定 webhook (URL + secret + 三類事件) + +--- + ## 🎯 2026-04-25(進行中)| 自動化飛輪修復 × 4 + Hermes Ollama + qwen3:8b ✅ ### B1:auto_execute 被 _ALLOWED_KUBECTL_PATTERN 全攔 diff --git a/ops/monitoring/alerts-unified.yml b/ops/monitoring/alerts-unified.yml index da659b91..7a601db4 100644 --- a/ops/monitoring/alerts-unified.yml +++ b/ops/monitoring/alerts-unified.yml @@ -1044,8 +1044,15 @@ groups: runbook: "檢查 node-exporter --collector.* flags 是否該關掉閒置硬體 probe" # --- Sentry self-hosted 自監控(110)--- + # 2026-04-25 ogt + Claude Opus 4.7: 修正假告警根因 + # 舊規則用 container_memory_usage_bytes(含 page cache),導致 ClickHouse + # 執行大查詢時 OS 把 SSTable 緩存進 page cache,比例衝到 88.5% 觸發誤報 + # (2026-04-23 23:13 鐵證:usage_bytes=88.5% / working_set=7.8%)。 + # 改用 container_memory_working_set_bytes — 這才是 K8s/Docker OOM killer + # 實際追蹤的「真實工作集」(RSS + active page cache),不含 inactive page cache。 + # 參考: https://github.com/google/cadvisor/blob/master/info/v1/container.go - alert: SentryClickHouseMemoryPressure - expr: container_memory_usage_bytes{name=~".*sentry.*clickhouse.*"} / container_spec_memory_limit_bytes{name=~".*sentry.*clickhouse.*"} > 0.8 + expr: container_memory_working_set_bytes{name=~".*sentry.*clickhouse.*"} / container_spec_memory_limit_bytes{name=~".*sentry.*clickhouse.*"} > 0.85 for: 10m labels: severity: warning @@ -1055,9 +1062,9 @@ groups: notification_type: TYPE-1 auto_repair: "false" annotations: - summary: "Sentry ClickHouse 記憶體使用率 > 80% limit" - description: "sentry clickhouse 用量 / mem_limit = {{ $value | humanizePercentage }}。" - runbook: "檢查 Sentry 查詢壓力;調整 /opt/sentry/docker-compose.override.yml clickhouse mem_limit" + summary: "Sentry ClickHouse 工作集記憶體 > 85% limit" + description: "sentry clickhouse working_set / mem_limit = {{ $value | humanizePercentage }} (排除 page cache)。" + runbook: "檢查 Sentry 查詢壓力;確認非 page cache 假象;必要時調整 /opt/sentry/docker-compose.override.yml clickhouse mem_limit" - alert: SentryClickHouseCPUThrottled expr: rate(container_cpu_cfs_throttled_seconds_total{name=~".*sentry.*clickhouse.*"}[5m]) > 1.0 @@ -1076,7 +1083,10 @@ groups: # --- Gitea 自監控 --- - alert: GiteaMemoryPressure - expr: container_memory_usage_bytes{name="gitea"} / container_spec_memory_limit_bytes{name="gitea"} > 0.8 + # 2026-04-25 ogt + Claude Sonnet 4.6 — 同 ClickHouse 假警報根因: + # container_memory_usage_bytes 含 page cache(OS inactive,OOM killer 不管)→ 虛高假警報 + # 改用 container_memory_working_set_bytes(RSS + active cache,真實壓力,cadvisor 適用 Docker + K8s) + expr: container_memory_working_set_bytes{name="gitea"} / container_spec_memory_limit_bytes{name="gitea"} > 0.85 for: 10m labels: severity: warning @@ -1086,8 +1096,8 @@ groups: notification_type: TYPE-1 auto_repair: "false" annotations: - summary: "Gitea 記憶體使用率 > 80% limit" - description: "gitea 用量 / mem_limit = {{ $value | humanizePercentage }}。" + summary: "Gitea 記憶體工作集 > 85% limit" + description: "gitea working_set / mem_limit = {{ $value | humanizePercentage }}(真實記憶體壓力,非 page cache 干擾)。" runbook: "檢查 CI/CD 任務堆積;必要時調高 docker-compose mem_limit" - alert: GiteaCPUThrottled diff --git a/ops/monitoring/tests/clickhouse_memory_test.yml b/ops/monitoring/tests/clickhouse_memory_test.yml new file mode 100644 index 00000000..565cbe34 --- /dev/null +++ b/ops/monitoring/tests/clickhouse_memory_test.yml @@ -0,0 +1,86 @@ +# Unit tests for SentryClickHouseMemoryPressure +# 2026-04-25 ogt + Claude Opus 4.7 +rule_files: + - ../alerts-unified.yml + +evaluation_interval: 1m + +tests: + # ---- 負測 1:page cache 高、working_set 低(修正後不該觸發)---- + - interval: 1m + name: "page cache spike must NOT alert (the original false-positive scenario)" + input_series: + # working_set: 411 MiB / 8 GiB = 5%(正常) + - series: 'container_memory_working_set_bytes{name="sentry-self-hosted-clickhouse-1"}' + values: '430917632 430917632 430917632 430917632 430917632 430917632 430917632 430917632 430917632 430917632 430917632 430917632 430917632 430917632' + # usage_bytes: 7.5 GiB / 8 GiB = 93.7%(如果規則用錯指標就會誤觸發) + - series: 'container_memory_usage_bytes{name="sentry-self-hosted-clickhouse-1"}' + values: '8053063680 8053063680 8053063680 8053063680 8053063680 8053063680 8053063680 8053063680 8053063680 8053063680 8053063680 8053063680 8053063680 8053063680' + - series: 'container_spec_memory_limit_bytes{name="sentry-self-hosted-clickhouse-1"}' + values: '8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592' + alert_rule_test: + - eval_time: 12m + alertname: SentryClickHouseMemoryPressure + # 期望沒有任何告警(exp_alerts 留空) + exp_alerts: [] + + # ---- 負測 2:working_set 略高但 < 85%(不該觸發)---- + - interval: 1m + name: "working_set 80% must NOT alert (below 85% threshold)" + input_series: + # working_set: 6.5 GiB / 8 GiB = 80%(< 85%,不該觸發) + - series: 'container_memory_working_set_bytes{name="sentry-self-hosted-clickhouse-1"}' + values: '6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673' + - series: 'container_memory_usage_bytes{name="sentry-self-hosted-clickhouse-1"}' + values: '6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673 6871947673' + - series: 'container_spec_memory_limit_bytes{name="sentry-self-hosted-clickhouse-1"}' + values: '8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592 8589934592' + alert_rule_test: + - eval_time: 12m + alertname: SentryClickHouseMemoryPressure + exp_alerts: [] + + # ---- 正測 1:working_set > 85% 持續 10 分鐘(必須觸發)---- + - interval: 1m + name: "working_set 90% sustained 10m MUST alert (real memory pressure)" + input_series: + # working_set: 7.4 GiB / 8 GiB = 86.7%(持續高水位) + - series: 'container_memory_working_set_bytes{name="sentry-self-hosted-clickhouse-1"}' + values: '7449424589x14' + - series: 'container_memory_usage_bytes{name="sentry-self-hosted-clickhouse-1"}' + values: '7449424589x14' + - series: 'container_spec_memory_limit_bytes{name="sentry-self-hosted-clickhouse-1"}' + values: '8589934592x14' + alert_rule_test: + - eval_time: 12m + alertname: SentryClickHouseMemoryPressure + exp_alerts: + - exp_labels: + alertname: SentryClickHouseMemoryPressure + alert_category: infrastructure + auto_repair: "false" + component: sentry-clickhouse + name: sentry-self-hosted-clickhouse-1 + notification_type: TYPE-1 + severity: warning + team: platform + exp_annotations: + summary: "Sentry ClickHouse 工作集記憶體 > 85% limit" + description: "sentry clickhouse working_set / mem_limit = 86.72% (排除 page cache)。" + runbook: "檢查 Sentry 查詢壓力;確認非 page cache 假象;必要時調整 /opt/sentry/docker-compose.override.yml clickhouse mem_limit" + + # ---- 正測 2:尖峰 < 10 分鐘(不該觸發,for: 10m 過濾掉)---- + - interval: 1m + name: "working_set 95% spike for only 5m must NOT alert (for:10m guard)" + input_series: + # 前 5 分鐘 90%,之後降回 5% + - series: 'container_memory_working_set_bytes{name="sentry-self-hosted-clickhouse-1"}' + values: '7730941132 7730941132 7730941132 7730941132 7730941132 430917632 430917632 430917632 430917632 430917632 430917632 430917632' + - series: 'container_memory_usage_bytes{name="sentry-self-hosted-clickhouse-1"}' + values: '7730941132 7730941132 7730941132 7730941132 7730941132 430917632 430917632 430917632 430917632 430917632 430917632 430917632' + - series: 'container_spec_memory_limit_bytes{name="sentry-self-hosted-clickhouse-1"}' + values: '8589934592x12' + alert_rule_test: + - eval_time: 11m + alertname: SentryClickHouseMemoryPressure + exp_alerts: []