Need some ideas for a very large instance

I have a rather large instance, a node was in the process of starting up 131 builds across the agents on this Jenkins controller (this is spread out over 5 secs per job) like pushing the build-now button a bunch of times… But it monitors the queue to match jobs to machines vs dumping a thousand jobs in queue all at once but it still seems to be a little too much for the server to handle. Does anyone have suggestions on how I can further tune this environment to prevent the OOM reaper?

Service]
TimeoutStartSec=108000
Environment="JAVA_OPTS=-XX:+AlwaysPreTouch -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/jenkins/heapjdump.log -Xms16g -Xmx16g -XX:+UseG1GC -XX:+UseStringDeduplication -XX:+ParallelRefProcEnabled -XX:+DisableExplicitGC -XX:+UnlockDiagnosticVMOptions -XX:+UnlockExperimentalVMOptions -Xlog:gc*,gc+heap=info,gc+heap=debug,gc+ref*=debug,gc+ergo*=trace,gc+age*=trace:file=/var/log/jenkins/gc.log:utctime,pid,level,tags:filecount=2,filesize=100M -XX:ErrorFile=/var/log/jenkins/hs_err_%p.log -XX:+LogVMOutput -XX:LogFile=/var/log/jenkins/jvm.log -DhistoryWidget.descriptionLimit=-1 -Dhudson.model.DirectoryBrowserSupport.CSP="default-src *; script-src * 'unsafe-inline' 'unsafe-eval'; style-src * 'unsafe-inline'; img-src *; font-src *; connect-src *; object-src *; media-src *; frame-src *;""
free -m
              total        used        free      shared  buff/cache   available
Mem:          64563       56465         470           2        7627        7374


[1296853.833621] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/jenkins.service,task=java,pid=3195418,uid=116
[1296853.839423] Out of memory: Killed process 3195418 (java) total-vm:27410168kB, anon-rss:18880016kB, file-rss:15908kB, shmem-rss:0kB, UID:116 pgtables:42924kB oom_score_adj:0

I’m not 100% sure on how you are running jenkins so how to optimize will depend on many factors.

We run our jenkins instances in kubernetes (controllers and agents).Controllers java process have sufficient java memory heap, but when many declarative pipelines trigger at the same time, they trigger git clones of the target repo in order to fetch the jenkinsfile, which in turns spawns dozens of git processes that hoard all the memory on the container, but the java process is the one that the get OOM killed.

The way we deal with this is by staggering the pipelines with a wait, like you do, but we also did several things to reduce the memory, CPU and IO caused by the git processes on the controller.

If git is ultimately what causes your OOM issues here are some of the tings we have done:

  • use reference clones, so that your git clones for each pipelines will be significantly faster, and use less memory overall.
  • tune your git config to limit the amount of threads and amount of RAM used when cloning. By default each git command will try to hoard as much RAM as possible and will memory starve your container.
  • if it works for you use Lightweight checkout in your pipelines, unfortunately it is incompatible in many cases, such as if you use the Git Parameter plugin to select which branches to use.
  • we added a custom git which is a shell script wrapper pretending to be the real git. It has further optimizations such as doing sparse clones on the controller so that only the root directory with the jenkins files is cloned on the controller. This saves a lot of io, disk space and nodes (our instances have run out of inodes due to this jenkins behavior).

If your problem is entirely caused by memory consumption inside the controller, it might depend on your plugins and the size of your console outputs. For example the log-parser plugin is a nice option but it needs to apply regexes to the console output, and if your pipelines generate several MB each of console output then your controller can run out of heap. Because of this, we banned that plugin from our instances. Other plugins are abusing XML as a database which is not scalable, one such example is the the Global Build Stats plugin Loading...

Memory consumption also depends on the number of jobs and runs you have on your instance. And complex pipeline jobs consume more memory than freestyle jobs I think.
Jenkins has some lazy loading for runs, some plugins might lead to loading all runs into memory.
So make sure that you have configured build discarders for your jobs or globally to avoid that you get too many builds.