Greetings Folks,
Our infrastructure consists of:
- Jenkins Controller v2.289.1 (0 Executors)
- Jenkins Agent 1 (4 Executors)
- Jenkins Agent 2 (4 Executors)
- Jenkins Agent 3 (4 Executors)
Through Grafana/Prometheus, we’ve noticed that the CPU usage of our Agent 1 is 5 times higher than the other two agents. Μore specifically, Agent’s 3 CPU usage sits at ~1% (7 Days average graph) while Agent 1 crashed two times this week due to a heavy build. This caught our attention, so we did our search and found out this:
https://docs.cloudbees.com/docs/admin-resources/latest/plugins/even-scheduler
"By default, Jenkins employs the algorithm known as consistent hashing to make this decision. More specifically, it hashes the name of the node, in numbers proportional to the number of available executors, then hashes the job name to create a probe point for the consistent hash.
…
This behavior is based on the assumption that it is preferable to use the same workspace as much as possible, because SCM updates are more efficient than SCM checkouts. In a typical continuous integration situation, each build only contains a limited number of changes, so indeed updates (which only fetch updated files) run substantially faster than checkouts (which refetch all files from scratch.)"
Since there’s no other official documentation about it, or at least I’m not aware of, I can assume that this is the case for what makes most of our jobs to be built in Agent 1, while the other two Agents stay nearly idle.
A temporary solution we thought, was to pin heavy Jobs to specific agents (ex. agent { label ‘agent-3’ }), but that doesn’t sound really dynamic, as new Jobs may arrive and occupy the other agents as well.
Hence, we wanted to test the Even Scheduler Plugin, but unfortunately it’s for Cloudbees CI only, which we don’t use. However, there’s also the Least Load Plugin for normal distros and allegedly does the same thing:
https://plugins.jenkins.io/leastload/
We tested it in a test environment, but we haven’t applied it yet in production.
My question here is, does it worth to ‘sacrifice’ the default decision making behavior so we can distribute jobs evenly and unload stressed agents? This means that Jobs may take longer due to the fact that if a Job that hasn’t been built before in a specific agent, the ‘refetch from scratch’ will happen more often. On the other hand, this promises no agent crashes, since the headroom of 3/3 agents occupied evenly is more than enough.
If there are any suggestions or I’m mistaken about anything, please let me know!
Thanks for reading!