fix(solver): 補 AGENT_SOLVER_TIMEOUT_SEC=80 + prompt 禁無腦重啟
Some checks failed
CD Pipeline / build-and-deploy (push) Has been cancelled

問題1:AGENT_SOLVER_TIMEOUT_SEC 預設 20s K8s 未設 → deepseek-r1:14b 必然
       timeout → candidates=[] → action="" → Telegram 顯示「待分析」+「規則分析」

問題2:Solver prompt JSON 範例只有 restart + kubectl top,LLM 模仿範例
       → 所有告警都推重啟,HostDisk/CPU 類應優先診斷+清理

修復:
- K8s 加 AGENT_SOLVER_TIMEOUT_SEC=80(< OPENCLAW_TIMEOUT=120,留 buffer)
- Solver prompt 加根因對應修復規則:HostDisk→df/du/journalctl,CPU→top/ps,
  OOM→kubectl logs,禁止「先重啟」
- JSON 範例改為 HostDisk SSH 診斷場景,不再只有 K8s 命令

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Your Name
2026-04-27 15:51:42 +08:00
parent b432becd4e
commit c3fa03fc19
2 changed files with 44 additions and 33 deletions

View File

@@ -667,66 +667,75 @@ class SolverAgent(BaseAgent):
告警類別:{_safe_category}
診斷信心:{_confidence_pct}
{_inventory_section}{_non_k8s_warning}{_mcp_section}
你的工作:為此根因提出 1-3 個修復候選方案,同時輸出 0-3 個結構化 recommended_actions。
你的工作:依照根因假設,提出 1-3 個針對性修復方案,同時輸出 0-3 個結構化 recommended_actions。
⚠️ 核心規則:修復方案必須對應根因,禁止無腦重啟
- HostDisk 類(磁碟滿)→ 先查大檔du -sh、清 logjournalctl --vacuum、查 df -h最後才考慮擴容
- HostCPU / CPU 競爭 → 先查兇手進程top -bn1 / ps aux、找具體進程名再決定是否重啟
- OOM / 記憶體 → 先查 Pod 記憶體kubectl top pods、查 OOM log再重啟
- NetworkLatency → 先查連線狀態ss -tp / ping / traceroute
- DatabaseConnection → 先查連線池pg_stat_activity、DB log再重啟
- K8s Pod Crash → 先查 Pod logkubectl logs再重啟
- 只有在診斷明確指向 crash/OOM/deadlock 時才用重啟;資源類問題(磁碟/CPU優先用診斷+清理命令
candidates 格式規則:
- action 欄位必須是真實的 kubectl 命令(不可用自然語言描述)
- 目標資源格式deployment/<name>,命名空間統一用 awoooi-prod
- action 欄位K8s 問題用 kubectl 命令主機層問題HostDisk/HostCPU 等)用 ssh 診斷命令
- 主機 IP192.168.0.188AI+Web/ 192.168.0.110(主服務)/ 192.168.0.111Ollama
- 每個方案必須評估 blast_radius0-100 影響範圍)和 rollback_cost0-100 回滾難度)
blast_radius 參考:
- 診斷命令df/du/top/ps/kubectl top= 0
- SSH 清理 logjournalctl --vacuum / find -delete= 5
- kubectl rollout restart deployment = 10
- kubectl scale deployment --replicas=N = 15
- kubectl rollout undo deployment = 25
- kubectl apply -f = 40
- kubectl delete deployment = 75
- kubectl delete pvc = 95
recommended_actions 規則(北極星 §1.1 修復多樣性):
- 不要全部是 restart 類動作,至少 1 個是查看 log 或診斷類(低風險)
- 第一個動作必須是診斷/查看類(低風險),讓 SRE 先確認情況
- 不要全部是 restart 類動作
- mcp_provider 必須是以下之一k8s | ssh | prometheus | signoz | database | internal
- risk 必須是以下之一low | medium | high | critical
- critical risk 的動作必須在 reasoning 說明原因
- params 中可使用模板:{{labels.namespace}} / {{labels.pod}} / {{incident_id}}
以 JSON 回覆:
以 JSON 回覆(範例為 HostDisk 場景,根據根因假設替換)
{{
"candidates": [
{{
"action": "kubectl rollout restart deployment/awoooi-api -n awoooi-prod",
"blast_radius": 10,
"rollback_cost": 5,
"confidence": 0.8,
"rationale": "重啟可清除 OOM 導致的記憶體碎片化"
}},
{{
"action": "kubectl top pods -n awoooi-prod --sort-by=memory",
"action": "ssh user@192.168.0.110 'df -h && du -sh /var/log/* 2>/dev/null | sort -rh | head -20'",
"blast_radius": 0,
"rollback_cost": 0,
"confidence": 0.9,
"rationale": "先確認哪個 Pod 記憶體使用最高再決定操作"
"confidence": 0.95,
"rationale": "先確認磁碟使用情況和最大目錄,找出根因後再決定清理方式"
}},
{{
"action": "ssh user@192.168.0.110 'journalctl --vacuum-time=7d && find /tmp -mtime +7 -delete'",
"blast_radius": 5,
"rollback_cost": 5,
"confidence": 0.8,
"rationale": "清理 7 天前 journal log 和 /tmp 舊檔,釋放磁碟空間"
}}
],
"recommended_actions": [
{{
"name": "check_pod_logs",
"label": " Pod Log",
"emoji": "📋",
"mcp_provider": "k8s",
"mcp_tool": "k8s_get_pod_logs",
"params": {{"namespace": "awoooi-prod", "pod": "{{labels.pod}}", "tail_lines": "50"}},
"name": "check_disk_usage",
"label": "磁碟用量",
"emoji": "💾",
"mcp_provider": "ssh",
"mcp_tool": "ssh_exec",
"params": {{"host": "192.168.0.110", "command": "df -h && du -sh /var/log/* 2>/dev/null | sort -rh | head -10"}},
"risk": "low",
"reasoning": "先查 log 確認 OOM 根因,避免盲目重啟"
"reasoning": "診斷磁碟使用分布,找出佔用大戶"
}},
{{
"name": "k8s_restart",
"label": "重啟",
"emoji": "🔄",
"mcp_provider": "k8s",
"mcp_tool": "kubectl_restart",
"params": {{"namespace": "awoooi-prod", "deployment": "{{labels.deployment}}"}},
"risk": "medium",
"reasoning": "確認 OOM 後重啟清除記憶體碎片"
"name": "clean_old_logs",
"label": "清舊 Log",
"emoji": "🗑️",
"mcp_provider": "ssh",
"mcp_tool": "ssh_exec",
"params": {{"host": "192.168.0.110", "command": "journalctl --vacuum-time=7d"}},
"risk": "low",
"reasoning": "清理 7 天前 journal log安全可回滾"
}}
]
}}"""

View File

@@ -75,6 +75,8 @@ spec:
value: "120"
- name: AGENT_DIAGNOSTICIAN_TIMEOUT_SEC
value: "100"
- name: AGENT_SOLVER_TIMEOUT_SEC
value: "80"
# 2026-04-05 Claude Code: Sprint 3 — 掛載 SSH key 供 HostRepairAgent 使用
volumeMounts:
- name: repair-ssh-key