Hey Experts,
We are using Jenkins 2.246.3, Java 11.0.23
We have deployed jenkins as a podman container on a OL8 VM, with 80 ocpus and 1280 GB Memory.
We have set Jenkins Max Heap size 300GB, Min Heap Size 15000m
Recently, we came across an issue with pod provisioning due to timeout
spec:
dnsConfig:
options:
- name: ndots
value: "2"]}
io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [300000] milliseconds for [Pod] with name:[docker-bench-oci-scan-zgq59-16hl9] in namespace [jenkins].
at org.csanchez.jenkins.plugins.kubernetes.AllContainersRunningPodWatcher.await(AllContainersRunningPodWatcher.java:95)
at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:169)
at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:297)
at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
which eventually led to
13-Mar-2025 00:20:16.904 WARNING [Computer.threadPoolForRemoting [#16]] jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0 null
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
at java.base/java.lang.Thread.start0(Native Method)
at java.base/java.lang.Thread.start(Thread.java:803)
at java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
We also observed a jump in number of processes on Jenkins during the same time (We do see GC doing its tasks during the same phase). With an increase in number of processes, the defunct processes with the java pid doing git operations were hovering around.
Can we co-relate these events? Is there a way we can debug this further.
Sharing some GC stats below;
S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT CGC CGCT GCT
0.0 98304.0 0.0 95268.5 4849664.0 1703936.0 88834048.0 85957824.5 431712.0 272826.3 0.0 0.0 13410 499.225 1 5.132 2426 225.996 730.353
0.0 98304.0 0.0 95268.5 4849664.0 1933312.0 88834048.0 85957824.5 431712.0 272826.3 0.0 0.0 13410 499.225 1 5.132 2426 225.996 730.353
0.0 98304.0 0.0 95268.5 4849664.0 2162688.0 88834048.0 85957824.5 431712.0 272826.3 0.0 0.0 13410 499.225 1 5.132 2426 225.996 730.353
0.0 98304.0 0.0 95268.5 4849664.0 2392064.0 88834048.0 85957824.5 431712.0 272826.3 0.0 0.0 13410 499.225 1 5.132 2426 225.996 730.353
0.0 98304.0 0.0 95268.5 4849664.0 2523136.0 88834048.0 85957824.5 431712.0 272826.3 0.0 0.0 13410 499.225 1 5.132 2426 225.996 730.353
0.0 98304.0 0.0 95268.5 4849664.0 2719744.0 88834048.0 85957824.5 431712.0 272826.3 0.0 0.0 13410 499.225 1 5.132 2426 225.996 730.353
jmap for heap
Attaching to process ID 3978360, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 11.0.23-ea+1-LTS-175
using thread-local object allocation.
Garbage-First (G1) GC with 103 thread(s)
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 322122547200 (307200.0MB)
NewSize = 1363144 (1.2999954223632812MB)
MaxNewSize = 193273528320 (184320.0MB)
OldSize = 5452592 (5.1999969482421875MB)
NewRatio = 2
SurvivorRatio = 8
MetaspaceSize = 21807104 (20.796875MB)
CompressedClassSpaceSize = 1073741824 (1024.0MB)
MaxMetaspaceSize = 17592186044415 MB
G1HeapRegionSize = 33554432 (32.0MB)
Heap Usage:
G1 Heap:
regions = 9600
capacity = 322122547200 (307200.0MB)
used = 88288727304 (84198.69165802002MB)
free = 233833819896 (223001.30834197998MB)
27.40842827409506% used
G1 Young Generation:
Eden Space:
regions = 9
capacity = 4966055936 (4736.0MB)
used = 301989888 (288.0MB)
free = 4664066048 (4448.0MB)
6.081081081081081% used
Survivor Space:
regions = 2
capacity = 100663296 (96.0MB)
used = 93309664 (88.98703002929688MB)
free = 7353632 (7.012969970703125MB)
92.69482294718425% used
G1 Old Generation:
regions = 2624
capacity = 90966065152 (86752.0MB)
used = 87893427752 (83821.70462799072MB)
free = 3072637400 (2930.2953720092773MB)
96.62221577368905% used
There is a very high memory usage for G1 Old Generation and G1 Young Generation Survivor Space. I do wish to understand this allocation further, any assistance to narrow this issue down or fine tune GC parameters would be helpful.
Regards
Hema