# Drift Scanner Pods Fix Guide ## Problem Analysis You have multiple drift-scanner pods that are failing or completed: - 6 Failed pods: drift-scanner-29602260-sns8t, drift-scanner-29602320-vf6dj, etc. - 3 Succeeded pods: drift-scanner-29613600-x67jn, etc. ## Root Cause These drift-scanner pods appear to be external Kubernetes Jobs (not part of your codebase) that are created by an external system or cron job. The numeric suffixes suggest time-based job scheduling. ## Immediate Fix Actions ### Option 1: Manual Cleanup (Quick Fix) ```bash # SSH to your K8s server ssh wooo@192.168.0.110 # Delete failed pods sudo kubectl delete pod drift-scanner-29602260-sns8t -n momo --force --grace-period=0 sudo kubectl delete pod drift-scanner-29602320-vf6dj -n momo --force --grace-period=0 sudo kubectl delete pod drift-scanner-29602380-862vh -n momo --force --grace-period=0 sudo kubectl delete pod drift-scanner-29602440-mwd7m -n momo --force --grace-period=0 sudo kubectl delete pod drift-scanner-29602500-gpr27 -n momo --force --grace-period=0 # Delete old succeeded pods (optional) sudo kubectl delete pod drift-scanner-29613600-x67jn -n momo --force --grace-period=0 sudo kubectl delete pod drift-scanner-29613660-7tk4d -n momo --force --grace-period=0 sudo kubectl delete pod drift-scanner-29613720-c7zcp -n momo --force --grace-period=0 ``` ### Option 2: Use Cleanup Script ```bash # On K8s server cd /home/wooo/scripts ./cleanup_drift_scanner_pods.sh cleanup-failed ``` ### Option 3: Batch Cleanup ```bash # Delete all drift-scanner pods at once sudo kubectl delete pods -l app=drift-scanner -n momo --force --grace-period=0 ``` ## Prevention Strategies ### 1. Identify the Source Find what's creating these drift-scanner jobs: ```bash # Check for CronJobs sudo kubectl get cronjobs -n momo sudo kubectl get cronjobs --all-namespaces | grep drift # Check for scheduled jobs sudo kubectl get jobs -n momo | grep drift sudo kubectl get jobs --all-namespaces | grep drift # Check events sudo kubectl get events -n momo --sort-by='.lastTimestamp' | grep drift ``` ### 2. Monitor and Auto-Cleanup Add to your existing health monitoring script (`/home/wooo/scripts/k8s_health_monitor.sh`): ```bash # Add this function to the script check_drift_scanner_pods() { local drift_pods=$(sudo kubectl get pods -n ${MOMO_NAMESPACE} --no-headers | grep drift-scanner || echo "") if [[ -n "$drift_pods" ]]; then local failed_count=$(echo "$drift_pods" | grep "Failed" | wc -l) local succeeded_count=$(echo "$drift_pods" | grep "Succeeded" | wc -l) if [[ $failed_count -gt 5 ]] || [[ $succeeded_count -gt 10 ]]; then log "WARNING: Too many drift-scanner pods (Failed: $failed_count, Succeeded: $succeeded_count)" # Auto-cleanup failed pods echo "$drift_pods" | grep "Failed" | awk '{print $1}' | xargs -r sudo kubectl delete pod -n ${MOMO_NAMESPACE} --force --grace-period=0 # Auto-cleanup old succeeded pods (older than 24h) echo "$drift_pods" | grep "Succeeded" | awk '{print $1}' | xargs -r sudo kubectl delete pod -n ${MOMO_NAMESPACE} --force --grace-period=0 fi fi } ``` ### 3. Resource Limits If these are legitimate jobs, consider setting resource limits and TTL: ```yaml # Example Job template with TTL apiVersion: batch/v1 kind: Job metadata: name: drift-scanner spec: ttlSecondsAfterFinished: 3600 # Clean up after 1 hour backoffLimit: 3 # Limit retries template: spec: containers: - name: drift-scanner image: your-image resources: requests: memory: "128Mi" cpu: "100m" limits: memory: "512Mi" cpu: "500m" restartPolicy: OnFailure ``` ## Monitoring Setup ### Add to Prometheus Monitoring Create alert rule in `/home/wooo/monitoring/prometheus.yml`: ```yaml - alert: TooManyFailedDriftScannerPods expr: kube_pod_status_phase{phase="Failed", pod=~"drift-scanner-.*"} > 5 for: 5m labels: severity: warning annotations: summary: "Too many failed drift-scanner pods" description: "More than 5 drift-scanner pods have failed" ``` ### Telegram Alert Integration The cleanup script already includes Telegram notifications when pods are cleaned up. ## Long-term Solutions 1. **Identify the Owner**: Find which system or team is creating these drift-scanner jobs 2. **Fix the Root Cause**: Address why these jobs are failing 3. **Implement TTL**: Add `ttlSecondsAfterFinished` to job specifications 4. **Resource Quotas**: Set limits to prevent resource exhaustion 5. **Regular Cleanup**: Schedule the cleanup script to run periodically ## Emergency Commands ```bash # Quick check of drift-scanner status sudo kubectl get pods -n momo | grep drift-scanner # Force delete all drift-scanner pods sudo kubectl delete pods -n momo --all --force --grace-period=0 --selector=app=drift-scanner # Check what's creating them sudo kubectl get events -n momo --sort-by='.lastTimestamp' | tail -20 ``` ## Files Created - `scripts/cleanup_drift_scanner_pods.sh` - Comprehensive cleanup script - `DRIFT_SCANNER_FIX_GUIDE.md` - This guide ## Next Steps 1. Run the immediate cleanup commands 2. Identify the source of drift-scanner jobs 3. Implement prevention measures 4. Set up monitoring and auto-cleanup