Troubleshooting

Quick Commands (Kubernetes + Helm + Ingress + cert-manager)

Set these once so every command is copy/paste friendly:

export NS=erpnext
export REL=frappe-bench   # change if your Helm release name differs

1) Quick health check (is anything obviously broken?)

Cluster + namespace overview

kubectl get nodes -o wide
kubectl get ns | grep -E "erpnext|ingress-nginx|cert-manager" || true
kubectl get all -n $NS
kubectl get events -n $NS --sort-by=.metadata.creationTimestamp | tail -n 50

Watch pods live

kubectl get pods -n $NS -w

2) Helm sanity (release, values, revisions, rollback)

Find your release name if you forgot it

helm list -A | grep -i erp
helm list -n $NS

Current release status + history

helm status $REL -n $NS
helm history $REL -n $NS

See what Helm actually applied

helm get manifest $REL -n $NS | less
helm get values   $REL -n $NS -a

Rollback quickly (if last deploy broke things)

helm history $REL -n $NS
helm rollback $REL <REVISION> -n $NS

3) Ingress / DNS / LoadBalancer issues (site not opening, 404, 502, timeout)

Check ingress object

kubectl get ingress -n $NS -o wide
kubectl describe ingress -n $NS

Check ingress controller service (external IP/hostname)

kubectl get svc -n ingress-nginx
kubectl get svc ingress-nginx-controller -n ingress-nginx -o wide

Check ingress controller logs (very useful for 404/502)

kubectl logs -n ingress-nginx deploy/ingress-nginx-controller --tail=200
kubectl logs -n ingress-nginx deploy/ingress-nginx-controller -f

Verify service endpoints exist (common 502 cause: no endpoints)

kubectl get svc -n $NS
kubectl get endpoints -n $NS

Quick internal curl test (bypass DNS, test service inside cluster)

1) Start a temp curl pod:

kubectl run -n $NS tmp-curl --rm -it --image=curlimages/curl -- sh

2) Inside it, test service (replace service name/port):

nslookup <service-name>
curl -sv http://<service-name>:<port>/health || true
curl -sv http://<service-name>:<port>/ || true

If ingress works but DNS doesn’t: DNS A-record must point to the ingress LB address shown in ingress-nginx-controller service.


4) TLS / Certificate problems (HTTPS not working, cert pending)

See certs and challenges

kubectl get certificate -n $NS
kubectl describe certificate -n $NS

kubectl get order,challenge -n $NS
kubectl describe challenge -n $NS

cert-manager logs

kubectl logs -n cert-manager deploy/cert-manager --tail=200
kubectl logs -n cert-manager deploy/cert-manager -f

Common root causes - Ingress class mismatch (issuer expects nginx but ingress uses another class) - DNS points to wrong load balancer - Port 80 not reachable for HTTP-01 challenge


5) Pod stuck / CrashLoopBackOff / ImagePullBackOff

Identify failing pods fast

kubectl get pods -n $NS -o wide
kubectl get pods -n $NS | egrep -i "CrashLoopBackOff|Error|ImagePullBackOff|Pending" || true

Describe pod (shows events like image pull errors, probes failing, etc.)

kubectl describe pod -n $NS <pod-name>

Logs (current + previous)

kubectl logs -n $NS <pod-name> --tail=200
kubectl logs -n $NS <pod-name> --previous --tail=200

If a deployment is failing, check rollout

kubectl get deploy -n $NS
kubectl rollout status deploy/<deploy-name> -n $NS
kubectl describe deploy/<deploy-name> -n $NS

6) PVC / Storage problems (pods Pending, RWX volume issues)

PVC status

kubectl get pvc -n $NS
kubectl describe pvc -n $NS <pvc-name>

Check StorageClasses (need RWX for worker/shared)

kubectl get storageclass

If PVC is Pending, check events

kubectl get events -n $NS --sort-by=.metadata.creationTimestamp | tail -n 80

Typical reasons - StorageClass doesn’t support ReadWriteMany - Provisioner missing / permissions issue - Quota limits


7) ERPNext site not created / createSite job failed

List jobs + pods created by jobs

kubectl get jobs -n $NS
kubectl get pods -n $NS | grep -i job || true

Describe job + view job pod logs

kubectl describe job -n $NS <job-name>

# Find the pod for the job:
kubectl get pods -n $NS --selector=job-name=<job-name> -o name

# Logs:
kubectl logs -n $NS <job-pod-name> --tail=400

Rerun createSite (only if you understand the impact)

kubectl delete job -n $NS <job-name>
helm upgrade $REL -n $NS -f values-erpnext.yaml

8) Database / Cache / Queue issues (500 errors, workers stuck)

Identify DB/cache pods/services

kubectl get pods -n $NS | egrep -i "mariadb|mysql|postgres|redis|dragonfly|queue|cache" || true
kubectl get svc  -n $NS | egrep -i "mariadb|mysql|postgres|redis|dragonfly|queue|cache" || true

DB logs

kubectl logs -n $NS <db-pod> --tail=200

Test connectivity from inside cluster

kubectl run -n $NS tmp-net --rm -it --image=busybox -- sh

Inside:

nc -zv <db-service-name> 3306 || true
nc -zv <cache-service-name> 6379 || true

9) ERPNext application logs (web/gunicorn, workers, scheduler)

List likely ERPNext pods

kubectl get pods -n $NS | egrep -i "web|gunicorn|worker|scheduler|socketio|nginx|frappe|bench" || true

Tail logs (replace pod name)

kubectl logs -n $NS <pod-name> --tail=300
kubectl logs -n $NS <pod-name> -f

If pod has multiple containers

kubectl get pod -n $NS <pod-name> -o jsonpath='{.spec.containers[*].name}'; echo
kubectl logs -n $NS <pod-name> -c <container-name> --tail=300

10) Exec into ERP pod (inspect configs, run bench commands)

Shell into a running ERP pod

kubectl exec -n $NS -it <pod-name> -- bash
# or
kubectl exec -n $NS -it <pod-name> -- sh

Inside (varies by image):

env | sort | head
ls -la
bench --version || true
bench version || true

Don’t run migrations blindly in PROD unless you have a clear rollback plan.


11) Resource issues (OOMKilled, CPU throttling, nodes full)

Check resource usage

kubectl top nodes
kubectl top pods -n $NS

Find OOMKilled / restarts

kubectl get pods -n $NS --sort-by=.status.containerStatuses[0].restartCount
kubectl describe pod -n $NS <pod-name> | egrep -i "OOMKilled|Killed|Reason|Last State" -n || true

Node pressure events

kubectl get events --all-namespaces --sort-by=.metadata.creationTimestamp | egrep -i "evict|pressure|OOM|disk" | tail -n 60

12) Service routing issues (Ingress OK but backend not reachable / endpoints empty)

Service + selectors

kubectl get svc -n $NS -o wide
kubectl describe svc -n $NS <service-name>

Endpoints must show pod IPs

kubectl get endpoints -n $NS <service-name> -o yaml

If endpoints are empty: - labels/selector mismatch - pods not Ready (readiness probe failing)


13) Port-forward debugging (bypass ingress)

Port-forward service to local machine

kubectl port-forward -n $NS svc/<service-name> 8080:<service-port>
# Open: http://localhost:8080

Port-forward a pod directly

kubectl port-forward -n $NS pod/<pod-name> 8080:<container-port>

14) Handy one-liners for fast triage

Show only problematic resources

kubectl get pods -n $NS | egrep -i "Pending|CrashLoopBackOff|ImagePullBackOff|Error" || true
kubectl get pvc  -n $NS | egrep -i "Pending|Lost" || true
kubectl get events -n $NS --sort-by=.metadata.creationTimestamp | tail -n 30

Sort pods by restart count

kubectl get pods -n $NS \
  -o custom-columns=NAME:.metadata.name,RESTARTS:.status.containerStatuses[*].restartCount,PHASE:.status.phase \
  --sort-by=.status.containerStatuses[0].restartCount
Discard
Save
This page has been updated since your last edit. Your draft may contain outdated content. Load Latest Version
Was this article helpful?

On this page

Review Changes ← Back to Content
Message Status Space Raised By Last update on