Jenkins Default Load Balancing Behavior

Greetings Folks,

Our infrastructure consists of:

  • Jenkins Controller v2.289.1 (0 Executors)
  • Jenkins Agent 1 (4 Executors)
  • Jenkins Agent 2 (4 Executors)
  • Jenkins Agent 3 (4 Executors)

Through Grafana/Prometheus, we’ve noticed that the CPU usage of our Agent 1 is 5 times higher than the other two agents. Μore specifically, Agent’s 3 CPU usage sits at ~1% (7 Days average graph) while Agent 1 crashed two times this week due to a heavy build. This caught our attention, so we did our search and found out this:

https://docs.cloudbees.com/docs/admin-resources/latest/plugins/even-scheduler

"By default, Jenkins employs the algorithm known as consistent hashing to make this decision. More specifically, it hashes the name of the node, in numbers proportional to the number of available executors, then hashes the job name to create a probe point for the consistent hash.

This behavior is based on the assumption that it is preferable to use the same workspace as much as possible, because SCM updates are more efficient than SCM checkouts. In a typical continuous integration situation, each build only contains a limited number of changes, so indeed updates (which only fetch updated files) run substantially faster than checkouts (which refetch all files from scratch.)"

Since there’s no other official documentation about it, or at least I’m not aware of, I can assume that this is the case for what makes most of our jobs to be built in Agent 1, while the other two Agents stay nearly idle.

A temporary solution we thought, was to pin heavy Jobs to specific agents (ex. agent { label ‘agent-3’ }), but that doesn’t sound really dynamic, as new Jobs may arrive and occupy the other agents as well.

Hence, we wanted to test the Even Scheduler Plugin, but unfortunately it’s for Cloudbees CI only, which we don’t use. However, there’s also the Least Load Plugin for normal distros and allegedly does the same thing:

https://plugins.jenkins.io/leastload/

We tested it in a test environment, but we haven’t applied it yet in production.

My question here is, does it worth to ‘sacrifice’ the default decision making behavior so we can distribute jobs evenly and unload stressed agents? This means that Jobs may take longer due to the fact that if a Job that hasn’t been built before in a specific agent, the ‘refetch from scratch’ will happen more often. On the other hand, this promises no agent crashes, since the headroom of 3/3 agents occupied evenly is more than enough.

If there are any suggestions or I’m mistaken about anything, please let me know!

Thanks for reading!

Most of this depends on the nature of your jobs and how they are configured.

E.g. we have a CleanWs in most jobs (either before checkout or at the end) and often have many parallel runs of the same job, so it is a fresh clone in most cases anyway. To reduce clone times we use local mirror repos as reference repository in the checkout instead. and we use the LeastLoad balancer.

So I would recommend to just test it and see what best fits your needs.

Thanks for replying. I’ll take that into account.

The problem with the default scheduler (which prefers the last used node) is exactly what you see.
If there is a slot free on the agent,. it will build on the same agent as last time.
I don’t understand, how the agent is chosen for a job never running before.
We’re using the “leastLoad” plugin, because we do a full build all the time anyway.
You might want to test if it helps to manually configure some jobs to run on Agent 2 or 3 once and then go back to your configuration. So Jenkins hopefully prefers Agent 2 or 3 respectively for these jobs in the future.