Kubernetes Client Timeout Exception: Timed out waiting for pod provisioning

hpriya · March 14, 2025, 1:11pm

Hey Experts,
We are using Jenkins 2.246.3, Java 11.0.23

We have deployed jenkins as a podman container on a OL8 VM, with 80 ocpus and 1280 GB Memory.

We have set Jenkins Max Heap size 300GB, Min Heap Size 15000m

Recently, we came across an issue with pod provisioning due to timeout

spec:
  dnsConfig:
    options:
      - name: ndots
        value: "2"]}
	io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [300000] milliseconds for [Pod] with name:[docker-bench-oci-scan-zgq59-16hl9] in namespace [jenkins].
		at org.csanchez.jenkins.plugins.kubernetes.AllContainersRunningPodWatcher.await(AllContainersRunningPodWatcher.java:95)
		at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:169)
		at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:297)
		at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
		at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
		at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
		at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
		at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
		at java.base/java.lang.Thread.run(Thread.java:834)

which eventually led to

13-Mar-2025 00:20:16.904 WARNING [Computer.threadPoolForRemoting [#16]] jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0 null
        java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
                at java.base/java.lang.Thread.start0(Native Method)
                at java.base/java.lang.Thread.start(Thread.java:803)
                at java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)

We also observed a jump in number of processes on Jenkins during the same time (We do see GC doing its tasks during the same phase). With an increase in number of processes, the defunct processes with the java pid doing git operations were hovering around.

Can we co-relate these events? Is there a way we can debug this further.

Sharing some GC stats below;

 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT    CGC    CGCT     GCT   
 0.0   98304.0  0.0   95268.5 4849664.0 1703936.0 88834048.0 85957824.5 431712.0 272826.3  0.0    0.0    13410  499.225   1      5.132 2426   225.996  730.353
 0.0   98304.0  0.0   95268.5 4849664.0 1933312.0 88834048.0 85957824.5 431712.0 272826.3  0.0    0.0    13410  499.225   1      5.132 2426   225.996  730.353
 0.0   98304.0  0.0   95268.5 4849664.0 2162688.0 88834048.0 85957824.5 431712.0 272826.3  0.0    0.0    13410  499.225   1      5.132 2426   225.996  730.353
 0.0   98304.0  0.0   95268.5 4849664.0 2392064.0 88834048.0 85957824.5 431712.0 272826.3  0.0    0.0    13410  499.225   1      5.132 2426   225.996  730.353
 0.0   98304.0  0.0   95268.5 4849664.0 2523136.0 88834048.0 85957824.5 431712.0 272826.3  0.0    0.0    13410  499.225   1      5.132 2426   225.996  730.353
 0.0   98304.0  0.0   95268.5 4849664.0 2719744.0 88834048.0 85957824.5 431712.0 272826.3  0.0    0.0    13410  499.225   1      5.132 2426   225.996  730.353

jmap for heap

Attaching to process ID 3978360, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 11.0.23-ea+1-LTS-175

using thread-local object allocation.
Garbage-First (G1) GC with 103 thread(s)

Heap Configuration:
   MinHeapFreeRatio         = 40
   MaxHeapFreeRatio         = 70
   MaxHeapSize              = 322122547200 (307200.0MB)
   NewSize                  = 1363144 (1.2999954223632812MB)
   MaxNewSize               = 193273528320 (184320.0MB)
   OldSize                  = 5452592 (5.1999969482421875MB)
   NewRatio                 = 2
   SurvivorRatio            = 8
   MetaspaceSize            = 21807104 (20.796875MB)
   CompressedClassSpaceSize = 1073741824 (1024.0MB)
   MaxMetaspaceSize         = 17592186044415 MB
   G1HeapRegionSize         = 33554432 (32.0MB)

Heap Usage:
G1 Heap:
   regions  = 9600
   capacity = 322122547200 (307200.0MB)
   used     = 88288727304 (84198.69165802002MB)
   free     = 233833819896 (223001.30834197998MB)
   27.40842827409506% used
G1 Young Generation:
Eden Space:
   regions  = 9
   capacity = 4966055936 (4736.0MB)
   used     = 301989888 (288.0MB)
   free     = 4664066048 (4448.0MB)
   6.081081081081081% used
Survivor Space:
   regions  = 2
   capacity = 100663296 (96.0MB)
   used     = 93309664 (88.98703002929688MB)
   free     = 7353632 (7.012969970703125MB)
   92.69482294718425% used
G1 Old Generation:
   regions  = 2624
   capacity = 90966065152 (86752.0MB)
   used     = 87893427752 (83821.70462799072MB)
   free     = 3072637400 (2930.2953720092773MB)
   96.62221577368905% used

There is a very high memory usage for G1 Old Generation and G1 Young Generation Survivor Space. I do wish to understand this allocation further, any assistance to narrow this issue down or fine tune GC parameters would be helpful.

Regards
Hema

hpriya · March 18, 2025, 8:43am

Adding to the query, How is the Java used and swap memory handled on Jenkins? If there are any metrics to validate functioning of the G1 GC.

Topic		Replies	Views
44 millions of org.jenkinsci.plugins.workflow.support.concurrent.Timeout Using Jenkins question	9	240	March 12, 2025
Jenkins memory leak ？ Community	7	3498	January 8, 2025
Jenkins huge CPU utilization provokes slowness Using Jenkins infra , java	2	3773	May 17, 2022
Jenkins core 2.387.1 with jdk 11.018 performance issue Using Jenkins	0	457	April 21, 2023
Need some ideas for a very large instance Using Jenkins	2	194	September 20, 2024

Kubernetes Client Timeout Exception: Timed out waiting for pod provisioning

Related topics