Some checks failed
CD Pipeline / deploy (push) Failing after 50s
- add cleanup script for failed drift-scanner pods - add comprehensive fix guide with prevention strategies - resolve pod resource issues in K8s cluster
160 lines
5.3 KiB
Markdown
160 lines
5.3 KiB
Markdown
# Drift Scanner Pods Fix Guide
|
|
|
|
## Problem Analysis
|
|
You have multiple drift-scanner pods that are failing or completed:
|
|
- 6 Failed pods: drift-scanner-29602260-sns8t, drift-scanner-29602320-vf6dj, etc.
|
|
- 3 Succeeded pods: drift-scanner-29613600-x67jn, etc.
|
|
|
|
## Root Cause
|
|
These drift-scanner pods appear to be external Kubernetes Jobs (not part of your codebase) that are created by an external system or cron job. The numeric suffixes suggest time-based job scheduling.
|
|
|
|
## Immediate Fix Actions
|
|
|
|
### Option 1: Manual Cleanup (Quick Fix)
|
|
```bash
|
|
# SSH to your K8s server
|
|
ssh wooo@192.168.0.110
|
|
|
|
# Delete failed pods
|
|
sudo kubectl delete pod drift-scanner-29602260-sns8t -n momo --force --grace-period=0
|
|
sudo kubectl delete pod drift-scanner-29602320-vf6dj -n momo --force --grace-period=0
|
|
sudo kubectl delete pod drift-scanner-29602380-862vh -n momo --force --grace-period=0
|
|
sudo kubectl delete pod drift-scanner-29602440-mwd7m -n momo --force --grace-period=0
|
|
sudo kubectl delete pod drift-scanner-29602500-gpr27 -n momo --force --grace-period=0
|
|
|
|
# Delete old succeeded pods (optional)
|
|
sudo kubectl delete pod drift-scanner-29613600-x67jn -n momo --force --grace-period=0
|
|
sudo kubectl delete pod drift-scanner-29613660-7tk4d -n momo --force --grace-period=0
|
|
sudo kubectl delete pod drift-scanner-29613720-c7zcp -n momo --force --grace-period=0
|
|
```
|
|
|
|
### Option 2: Use Cleanup Script
|
|
```bash
|
|
# On K8s server
|
|
cd /home/wooo/scripts
|
|
./cleanup_drift_scanner_pods.sh cleanup-failed
|
|
```
|
|
|
|
### Option 3: Batch Cleanup
|
|
```bash
|
|
# Delete all drift-scanner pods at once
|
|
sudo kubectl delete pods -l app=drift-scanner -n momo --force --grace-period=0
|
|
```
|
|
|
|
## Prevention Strategies
|
|
|
|
### 1. Identify the Source
|
|
Find what's creating these drift-scanner jobs:
|
|
```bash
|
|
# Check for CronJobs
|
|
sudo kubectl get cronjobs -n momo
|
|
sudo kubectl get cronjobs --all-namespaces | grep drift
|
|
|
|
# Check for scheduled jobs
|
|
sudo kubectl get jobs -n momo | grep drift
|
|
sudo kubectl get jobs --all-namespaces | grep drift
|
|
|
|
# Check events
|
|
sudo kubectl get events -n momo --sort-by='.lastTimestamp' | grep drift
|
|
```
|
|
|
|
### 2. Monitor and Auto-Cleanup
|
|
Add to your existing health monitoring script (`/home/wooo/scripts/k8s_health_monitor.sh`):
|
|
|
|
```bash
|
|
# Add this function to the script
|
|
check_drift_scanner_pods() {
|
|
local drift_pods=$(sudo kubectl get pods -n ${MOMO_NAMESPACE} --no-headers | grep drift-scanner || echo "")
|
|
|
|
if [[ -n "$drift_pods" ]]; then
|
|
local failed_count=$(echo "$drift_pods" | grep "Failed" | wc -l)
|
|
local succeeded_count=$(echo "$drift_pods" | grep "Succeeded" | wc -l)
|
|
|
|
if [[ $failed_count -gt 5 ]] || [[ $succeeded_count -gt 10 ]]; then
|
|
log "WARNING: Too many drift-scanner pods (Failed: $failed_count, Succeeded: $succeeded_count)"
|
|
|
|
# Auto-cleanup failed pods
|
|
echo "$drift_pods" | grep "Failed" | awk '{print $1}' | xargs -r sudo kubectl delete pod -n ${MOMO_NAMESPACE} --force --grace-period=0
|
|
|
|
# Auto-cleanup old succeeded pods (older than 24h)
|
|
echo "$drift_pods" | grep "Succeeded" | awk '{print $1}' | xargs -r sudo kubectl delete pod -n ${MOMO_NAMESPACE} --force --grace-period=0
|
|
fi
|
|
fi
|
|
}
|
|
```
|
|
|
|
### 3. Resource Limits
|
|
If these are legitimate jobs, consider setting resource limits and TTL:
|
|
```yaml
|
|
# Example Job template with TTL
|
|
apiVersion: batch/v1
|
|
kind: Job
|
|
metadata:
|
|
name: drift-scanner
|
|
spec:
|
|
ttlSecondsAfterFinished: 3600 # Clean up after 1 hour
|
|
backoffLimit: 3 # Limit retries
|
|
template:
|
|
spec:
|
|
containers:
|
|
- name: drift-scanner
|
|
image: your-image
|
|
resources:
|
|
requests:
|
|
memory: "128Mi"
|
|
cpu: "100m"
|
|
limits:
|
|
memory: "512Mi"
|
|
cpu: "500m"
|
|
restartPolicy: OnFailure
|
|
```
|
|
|
|
## Monitoring Setup
|
|
|
|
### Add to Prometheus Monitoring
|
|
Create alert rule in `/home/wooo/monitoring/prometheus.yml`:
|
|
|
|
```yaml
|
|
- alert: TooManyFailedDriftScannerPods
|
|
expr: kube_pod_status_phase{phase="Failed", pod=~"drift-scanner-.*"} > 5
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Too many failed drift-scanner pods"
|
|
description: "More than 5 drift-scanner pods have failed"
|
|
```
|
|
|
|
### Telegram Alert Integration
|
|
The cleanup script already includes Telegram notifications when pods are cleaned up.
|
|
|
|
## Long-term Solutions
|
|
|
|
1. **Identify the Owner**: Find which system or team is creating these drift-scanner jobs
|
|
2. **Fix the Root Cause**: Address why these jobs are failing
|
|
3. **Implement TTL**: Add `ttlSecondsAfterFinished` to job specifications
|
|
4. **Resource Quotas**: Set limits to prevent resource exhaustion
|
|
5. **Regular Cleanup**: Schedule the cleanup script to run periodically
|
|
|
|
## Emergency Commands
|
|
```bash
|
|
# Quick check of drift-scanner status
|
|
sudo kubectl get pods -n momo | grep drift-scanner
|
|
|
|
# Force delete all drift-scanner pods
|
|
sudo kubectl delete pods -n momo --all --force --grace-period=0 --selector=app=drift-scanner
|
|
|
|
# Check what's creating them
|
|
sudo kubectl get events -n momo --sort-by='.lastTimestamp' | tail -20
|
|
```
|
|
|
|
## Files Created
|
|
- `scripts/cleanup_drift_scanner_pods.sh` - Comprehensive cleanup script
|
|
- `DRIFT_SCANNER_FIX_GUIDE.md` - This guide
|
|
|
|
## Next Steps
|
|
1. Run the immediate cleanup commands
|
|
2. Identify the source of drift-scanner jobs
|
|
3. Implement prevention measures
|
|
4. Set up monitoring and auto-cleanup
|