we are using Jenkins in EKS service, most of the time whenever the load is beyond threshold Jenkins controller POD is going down. will it be possible to stop executing(upcoming/intaking) jobs whenever the EKS cluster node CPU is more than 80%?
Are you running your builds on the controller? Otherwise Jenkins doesn’t need much CPU normally.
In any way it is recommended not to run builds on the controller but run them on agents, either by permanent agents or some cloud plugin that dynamically creates agents when needed.
We are not running builds on to controller, Here Node refers to EKS cluster node health
Hi @Dhanasekhar , the problem you describe is related to EKS/Kubernetes and has nothing to do with Jenkins.
I believe the following official Kubernetes documentation page would help you: Resource Management for Pods and Containers | Kubernetes.
First step is to determine what is the root cause. Correlation (when I have high load my Jenkins goes down) is not causality (CPU at 100 evicts the pod controller as per the logs): unless you have a “describe pod” (or EKS) event message proving the high CPU load on the underlying node kills your pod, you cannot make any conclusion about causality.
You should make sure that:
- You set resource limits and reservations on your controller pod and also on agent pod templates to tell EKS scheduler how to manage nodes (and avoid packing too much pods on the same node in the case with “noisy neighbor”)
- Pod agents are not running on the same node as your controller (for performance and security reasons: “workflow isolation”) using Tolerations / nodeSelector Taints and Tolerations | Kubernetes
Additionally, enabling
Kubernetes Horizontal Pod Autoscaling might help.
You can set a threshold on usage; for example, setting the average threshold to 60% will automatically spin up more containers. Ensure that your cluster has autoscaling enabled. I would also recommend using Karpenter on top of this to optimize EC2 resource utilization.
You cannot use HPA on Jenkins controller pods. (Unless you are actually running CloudBees CI, in HA mode.)