fix(webhooks+prompts): 修復 LLM 對所有告警一律輸出「重啟 AWOOOI 服務」的根本問題
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled

根因 (INC-20260416-C365D0 postgres 磁碟告警事故):
1. alert_context 中 alertname 埋在 labels 深處,LLM 看到 alert_type="custom" → 不知道是什麼告警
2. 快取鍵用 alert_type:target_resource → 不同 alertname 共用同一快取 → 全部回傳第一個 LLM 結果
3. 系統 Prompt 無 alert-category 指導 → LLM 永遠輸出 kubectl rollout restart

修復:
- webhooks.py: alert_context 置頂加入 alertname + alert_category + annotations
- openclaw.py: 快取鍵改用 alertname:target_resource(告警名稱才是主要識別符)
- prompts.py: OPENCLAW_SYSTEM_PROMPT + NEMOTRON_SYSTEM_PROMPT 加入 Alert-Specific Analysis Rules
  database/storage 告警 → NO_ACTION + 調查指令;K8s 告警 → 對應重啟指令
  禁止對非 K8s 告警輸出 kubectl rollout restart deployment/awoooi-prod

2026-04-16 ogt + Claude Sonnet 4.6(亞太)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
OG T
2026-04-16 19:56:07 +08:00
parent 6048102139
commit a258d87767
3 changed files with 44 additions and 6 deletions

View File

@@ -1377,13 +1377,19 @@ async def alertmanager_webhook(
# ==========================================================================
# 新告警 - LLM 分析
# ==========================================================================
# 2026-04-16 ogt + Claude Sonnet 4.6: 修復 — alertname 置頂LLM 才能知道是什麼告警
# 舊版 alertname 埋在 labels 中alert_type 永遠是 "custom"
# → LLM 全部輸出「重啟 AWOOOI 服務」(見 INC-20260416-C365D0 postgres 磁碟告警事故)
alert_context = {
"alertname": alertname, # 主要識別符 — LLM 必讀
"alert_category": alert_category, # kubernetes/database/storage/host_resource/ssl_cert
"alert_type": alert_type,
"severity": severity,
"source": "alertmanager",
"target_resource": target_resource,
"namespace": namespace,
"message": message,
"annotations": dict(alert.annotations) if alert.annotations else {},
"metrics": {},
"labels": alert.labels,
}

View File

@@ -103,6 +103,23 @@ For each optimization suggestion, provide EXECUTABLE kubectl commands:
}
```
## 🔑 Alert-Specific Analysis Rules (CRITICAL — read alertname first)
The `alertname` field is your PRIMARY signal. Use it to determine the problem type and appropriate action:
| Alert category / alertname pattern | suggested_action | kubectl_command guidance |
|-------------------------------------|-----------------|--------------------------|
| contains "Disk", "Storage", "PVC", "Volume" | NO_ACTION | `kubectl exec <pod> -- df -h` or `kubectl get pvc -n <ns>` |
| contains "Postgres", "MySQL", "Redis", "DB", "Database" | NO_ACTION | `kubectl exec <pod> -- psql` or `kubectl logs <pod>` |
| contains "CrashLoop", "OOMKilled", "Pod" | DELETE_POD or RESTART_DEPLOYMENT | `kubectl delete pod <pod> -n <ns>` |
| contains "CPU", "Memory", "Resource" | TUNE_RESOURCES or SCALE_DEPLOYMENT | `kubectl top pod -n <ns>` or HPA command |
| contains "Node", "NodeNotReady" | NO_ACTION | `kubectl describe node <node>` |
| contains "SSL", "Certificate", "Cert" | NO_ACTION | `kubectl get certificate -n <ns>` |
| alert_category = "database" | NO_ACTION | DB investigation commands only |
| alert_category = "storage" | NO_ACTION | `kubectl get pvc`, `kubectl exec -- df -h` |
**NEVER** use `kubectl rollout restart deployment/awoooi-prod` for database, storage, or network alerts.
Make `action_title` describe the ACTUAL problem from alertname (not generic "自動修復 AWOOOI 服務").
## 🔥 Short Example: High CPU -> SCALE_DEPLOYMENT, HPA, risk_level=medium
Please carefully justify your confidence between 0.0 and 1.0 (e.g. 0.82) based on symptoms and metrics.
@@ -138,16 +155,26 @@ OPENCLAW_TEST_PROMPT = """你是 AWOOOI AIOps 平台的智慧助手 OpenClaw。
NEMOTRON_SYSTEM_PROMPT = """# OpenClaw Lightweight (Nemo-4B Optimized)
You are an SRE AI. Analyze the alert and respond with ONLY valid JSON.
## CRITICAL: Read alertname first
The `alertname` field tells you what kind of problem this is. Use it:
- "Disk/Storage/PVC/Volume" → suggested_action=NO_ACTION, kubectl_command="kubectl get pvc" or "kubectl exec <pod> -- df -h"
- "Postgres/MySQL/Redis/DB/Database" → suggested_action=NO_ACTION, DB investigation commands
- "CrashLoop/OOM/Pod" → suggested_action=DELETE_POD or RESTART_DEPLOYMENT
- "CPU/Memory/Resource" → suggested_action=TUNE_RESOURCES or SCALE_DEPLOYMENT
- "SSL/Cert" → suggested_action=NO_ACTION
NEVER use "kubectl rollout restart deployment/awoooi-prod" for database or storage alerts.
Make action_title describe the ACTUAL problem (not generic "自動修復 AWOOOI 服務").
## Required JSON Schema:
{
"confidence": <YOUR_CALCULATED_VALUE>,
"reasoning": "簡短理由 (繁體中文)",
"primary_responsibility": "FE|BE|INFRA|DB|COLLAB",
"risk_level": "low|medium|critical",
"action_title": "操作標題 (繁體中文)",
"description": "根因分析 (繁體中文)",
"suggested_action": "RESTART_DEPLOYMENT|DELETE_POD|SCALE_DEPLOYMENT|NO_ACTION",
"kubectl_command": "kubectl 指令",
"action_title": "操作標題,必須反映 alertname 的實際問題 (繁體中文)",
"description": "根因分析,說明 alertname 代表的問題及建議調查步驟 (繁體中文)",
"suggested_action": "RESTART_DEPLOYMENT|DELETE_POD|SCALE_DEPLOYMENT|TUNE_RESOURCES|NO_ACTION",
"kubectl_command": "針對此告警類型的 kubectl 指令",
"target_resource": "目標資源",
"namespace": "K8s namespace",
"blast_radius": {"affected_pods": 1, "estimated_downtime": "~30s"}

View File

@@ -741,10 +741,15 @@ class OpenClawService:
2026-03-29 ogt: 加入 Token/Cost 追蹤
"""
# 生成快取鍵 (基於 prompt + alert_context hash)
# 2026-04-16 ogt + Claude Sonnet 4.6: 修復 — alertname 才是主要識別符
# 舊版用 alert_type:target_resource → 不同告警 (e.g. PostgreSQLDiskGrowth vs PodCrashLoop)
# 在 alert_type="custom" 時共用同一快取鍵 → 全部回傳相同 LLM 結果
context_hash = ""
if alert_context:
# 使用告警類型 + 目標資源作為上下文 hash
context_hash = f"{alert_context.get('alert_type', '')}:{alert_context.get('target_resource', '')}"
# alertname 優先;無 alertname 時 fallback 到 alert_type
_alertname = alert_context.get("alertname") or alert_context.get("alert_type", "")
_target = alert_context.get("target_resource", "")
context_hash = f"{_alertname}:{_target}"
cache_key = self._generate_cache_key(prompt, context_hash)