Troubleshooting
Quick Commands (Kubernetes + Helm + Ingress + cert-manager)
Set these once so every command is copy/paste friendly:
export NS=erpnext
export REL=frappe-bench # change if your Helm release name differs
1) Quick health check (is anything obviously broken?)
Cluster + namespace overview
kubectl get nodes -o wide
kubectl get ns | grep -E "erpnext|ingress-nginx|cert-manager" || true
kubectl get all -n $NS
kubectl get events -n $NS --sort-by=.metadata.creationTimestamp | tail -n 50
Watch pods live
kubectl get pods -n $NS -w
2) Helm sanity (release, values, revisions, rollback)
Find your release name if you forgot it
helm list -A | grep -i erp
helm list -n $NS
Current release status + history
helm status $REL -n $NS
helm history $REL -n $NS
See what Helm actually applied
helm get manifest $REL -n $NS | less
helm get values $REL -n $NS -a
Rollback quickly (if last deploy broke things)
helm history $REL -n $NS
helm rollback $REL <REVISION> -n $NS
3) Ingress / DNS / LoadBalancer issues (site not opening, 404, 502, timeout)
Check ingress object
kubectl get ingress -n $NS -o wide
kubectl describe ingress -n $NS
Check ingress controller service (external IP/hostname)
kubectl get svc -n ingress-nginx
kubectl get svc ingress-nginx-controller -n ingress-nginx -o wide
Check ingress controller logs (very useful for 404/502)
kubectl logs -n ingress-nginx deploy/ingress-nginx-controller --tail=200
kubectl logs -n ingress-nginx deploy/ingress-nginx-controller -f
Verify service endpoints exist (common 502 cause: no endpoints)
kubectl get svc -n $NS
kubectl get endpoints -n $NS
Quick internal curl test (bypass DNS, test service inside cluster)
1) Start a temp curl pod:
kubectl run -n $NS tmp-curl --rm -it --image=curlimages/curl -- sh
2) Inside it, test service (replace service name/port):
nslookup <service-name>
curl -sv http://<service-name>:<port>/health || true
curl -sv http://<service-name>:<port>/ || true
If ingress works but DNS doesn’t: DNS A-record must point to the ingress LB address shown in
ingress-nginx-controllerservice.
4) TLS / Certificate problems (HTTPS not working, cert pending)
See certs and challenges
kubectl get certificate -n $NS
kubectl describe certificate -n $NS
kubectl get order,challenge -n $NS
kubectl describe challenge -n $NS
cert-manager logs
kubectl logs -n cert-manager deploy/cert-manager --tail=200
kubectl logs -n cert-manager deploy/cert-manager -f
Common root causes
- Ingress class mismatch (issuer expects nginx but ingress uses another class)
- DNS points to wrong load balancer
- Port 80 not reachable for HTTP-01 challenge
5) Pod stuck / CrashLoopBackOff / ImagePullBackOff
Identify failing pods fast
kubectl get pods -n $NS -o wide
kubectl get pods -n $NS | egrep -i "CrashLoopBackOff|Error|ImagePullBackOff|Pending" || true
Describe pod (shows events like image pull errors, probes failing, etc.)
kubectl describe pod -n $NS <pod-name>
Logs (current + previous)
kubectl logs -n $NS <pod-name> --tail=200
kubectl logs -n $NS <pod-name> --previous --tail=200
If a deployment is failing, check rollout
kubectl get deploy -n $NS
kubectl rollout status deploy/<deploy-name> -n $NS
kubectl describe deploy/<deploy-name> -n $NS
6) PVC / Storage problems (pods Pending, RWX volume issues)
PVC status
kubectl get pvc -n $NS
kubectl describe pvc -n $NS <pvc-name>
Check StorageClasses (need RWX for worker/shared)
kubectl get storageclass
If PVC is Pending, check events
kubectl get events -n $NS --sort-by=.metadata.creationTimestamp | tail -n 80
Typical reasons
- StorageClass doesn’t support ReadWriteMany
- Provisioner missing / permissions issue
- Quota limits
7) ERPNext site not created / createSite job failed
List jobs + pods created by jobs
kubectl get jobs -n $NS
kubectl get pods -n $NS | grep -i job || true
Describe job + view job pod logs
kubectl describe job -n $NS <job-name>
# Find the pod for the job:
kubectl get pods -n $NS --selector=job-name=<job-name> -o name
# Logs:
kubectl logs -n $NS <job-pod-name> --tail=400
Rerun createSite (only if you understand the impact)
kubectl delete job -n $NS <job-name>
helm upgrade $REL -n $NS -f values-erpnext.yaml
8) Database / Cache / Queue issues (500 errors, workers stuck)
Identify DB/cache pods/services
kubectl get pods -n $NS | egrep -i "mariadb|mysql|postgres|redis|dragonfly|queue|cache" || true
kubectl get svc -n $NS | egrep -i "mariadb|mysql|postgres|redis|dragonfly|queue|cache" || true
DB logs
kubectl logs -n $NS <db-pod> --tail=200
Test connectivity from inside cluster
kubectl run -n $NS tmp-net --rm -it --image=busybox -- sh
Inside:
nc -zv <db-service-name> 3306 || true
nc -zv <cache-service-name> 6379 || true
9) ERPNext application logs (web/gunicorn, workers, scheduler)
List likely ERPNext pods
kubectl get pods -n $NS | egrep -i "web|gunicorn|worker|scheduler|socketio|nginx|frappe|bench" || true
Tail logs (replace pod name)
kubectl logs -n $NS <pod-name> --tail=300
kubectl logs -n $NS <pod-name> -f
If pod has multiple containers
kubectl get pod -n $NS <pod-name> -o jsonpath='{.spec.containers[*].name}'; echo
kubectl logs -n $NS <pod-name> -c <container-name> --tail=300
10) Exec into ERP pod (inspect configs, run bench commands)
Shell into a running ERP pod
kubectl exec -n $NS -it <pod-name> -- bash
# or
kubectl exec -n $NS -it <pod-name> -- sh
Inside (varies by image):
env | sort | head
ls -la
bench --version || true
bench version || true
Don’t run migrations blindly in PROD unless you have a clear rollback plan.
11) Resource issues (OOMKilled, CPU throttling, nodes full)
Check resource usage
kubectl top nodes
kubectl top pods -n $NS
Find OOMKilled / restarts
kubectl get pods -n $NS --sort-by=.status.containerStatuses[0].restartCount
kubectl describe pod -n $NS <pod-name> | egrep -i "OOMKilled|Killed|Reason|Last State" -n || true
Node pressure events
kubectl get events --all-namespaces --sort-by=.metadata.creationTimestamp | egrep -i "evict|pressure|OOM|disk" | tail -n 60
12) Service routing issues (Ingress OK but backend not reachable / endpoints empty)
Service + selectors
kubectl get svc -n $NS -o wide
kubectl describe svc -n $NS <service-name>
Endpoints must show pod IPs
kubectl get endpoints -n $NS <service-name> -o yaml
If endpoints are empty: - labels/selector mismatch - pods not Ready (readiness probe failing)
13) Port-forward debugging (bypass ingress)
Port-forward service to local machine
kubectl port-forward -n $NS svc/<service-name> 8080:<service-port>
# Open: http://localhost:8080
Port-forward a pod directly
kubectl port-forward -n $NS pod/<pod-name> 8080:<container-port>
14) Handy one-liners for fast triage
Show only problematic resources
kubectl get pods -n $NS | egrep -i "Pending|CrashLoopBackOff|ImagePullBackOff|Error" || true
kubectl get pvc -n $NS | egrep -i "Pending|Lost" || true
kubectl get events -n $NS --sort-by=.metadata.creationTimestamp | tail -n 30
Sort pods by restart count
kubectl get pods -n $NS \
-o custom-columns=NAME:.metadata.name,RESTARTS:.status.containerStatuses[*].restartCount,PHASE:.status.phase \
--sort-by=.status.containerStatuses[0].restartCount