44 millions of org.jenkinsci.plugins.workflow.support.concurrent.Timeout

mchoma · September 6, 2024, 10:21am

We upgraded to jenkins 2.452.4 on jdk17 (from 2.440.2 on jdk 11) and our memory consumption rised from 6GB to 12 GB. We are running approximately 4000 jobs at once, which are in queue from beginning. Analyzing heapdump we noticed 44 millions instances of org.jenkinsci.plugins.workflow.support.concurrent.Timeout. From looking into heapdump I guess those objects are created by jobs waiting in queue. Am I right?

We can’t downgrade because of security policy so can’t compare if this is normal, but does not look normal to me

I would be thankful for any hint. We continue to diagnose the issue in meantime.

poddingue · September 10, 2024, 8:48am

Hello and welcome to this community, @mchoma.

The significant increase in memory consumption and the large number of org.jenkinsci.plugins.workflow.support.concurrent.Timeout instances suggest that there might be an issue with how Jenkins is handling jobs in the queue.

This could be related to changes in the newer Jenkins version or the JDK version.

Here are some steps that could help you diagnose and potentially mitigate the issue:

Ensure all Jenkins plugins are up to date. Sometimes, plugin updates include performance improvements and bug fixes.
Check if there are any specific configurations or plugins that might be causing excessive memory usage. For example, certain pipeline configurations or plugins might not be optimized for the new Jenkins or JDK version.
Temporarily increase the heap size to accommodate the higher memory usage while you diagnose the issue. You can do this by modifying the JAVA_OPTS in the Jenkins startup script: export JAVA_OPTS="-Xms8g -Xmx16g"
Use tools like Eclipse MAT (Memory Analyzer Tool) to analyze the heap dump and identify the root cause of the memory consumption. Look for patterns or specific objects that are consuming a lot of memory.
If the issue is related to jobs waiting in the queue, consider optimizing the job queue management. This might involve adjusting the number of executors, using job throttling plugins, or distributing the load across multiple Jenkins instances.
Tune the garbage collection settings to improve memory management. You can add the following options to JAVA_OPTS: export JAVA_OPTS="-Xms8g -Xmx16g -XX:+UseG1GC -XX:MaxGCPauseMillis=200"

mchoma · December 9, 2024, 10:24am

@poddingue Thanks for your response and hints. We still suffer with this issue. We will continue to investigating.

Blind shot. Looking into implementation of Timeout class, I see ScheduledExecutorService is used. Looking for some recent bugs in that class. I found https://bugs.openjdk.org/browse/JDK-8338765. No idea if that could be related just mentioning here for record.

poddingue · December 9, 2024, 2:51pm

Thanks for your feedback, @mchoma.
It looks like the fix for this issue has been merged, I have yet to find the JDK version it has been incorporated into.

What version of the JDK are you running?

mchoma · March 3, 2025, 10:54am

What version of the JDK are you running?

$ java -version
openjdk version "17.0.12" 2024-07-16
OpenJDK Runtime Environment Temurin-17.0.12+7 (build 17.0.12+7)
OpenJDK 64-Bit Server VM Temurin-17.0.12+7 (build 17.0.12+7, mixed mode)

poddingue · March 3, 2025, 12:08pm

Your JDK is older than the fix. If you can, try with the latest version (14+7 from https://adoptium.net/fr/temurin/releases/?version=17).

mchoma · March 3, 2025, 1:38pm

We can upgrade to 14+7, yes. But was you able to find evidence fix is there? I was not able to found it was really backported.

mchoma · March 4, 2025, 10:57am

I am continuing with investigation, as we still hit the issue.

Regarding JDK-8338765. I was searching in Heapdump and we do not have Long.MAX_VALUE or other big delays in Timeout objects. Also we are not experiencing hangs of tasks as described in issue, but rather cumulating of completed Timeouts. So I do not think we are facing JDK-8338765 here.

Fact that we see completed tasks in interruptions structure make me think isn’t Timeout class missing some cleaning logic of interruptions structure ?

When I consult implementation with Gemini I get this advices. They sounds reasonable to me considering ScheduledExecutorService javadoc:

The class does not clean up the scheduled future. This means that if many timeouts are created and closed, the scheduled executor service will contain many scheduled tasks that do nothing. This is not ideal, but also not critical.

And suggest cleaning logic like this

public class Timeout implements AutoCloseable {

    // ... (other fields)

    private ScheduledFuture<?> future; // Store the ScheduledFuture

    private void ping(final long time, final TimeUnit unit) {
        future = interruptions.schedule(() -> { // Store the future
            if (completed) {
                LOGGER.log(Level.FINER, "{0} already finished, no need to interrupt", thread.getName());
                return;
            }
            // ... (interruption logic)
            ping(5, TimeUnit.SECONDS);
        }, time, unit);
    }

    @Override
    public void close() {
        completed = true;
        if (future != null) {
            future.cancel(false); // Cancel the scheduled task
        }
        LOGGER.log(Level.FINER, "completed {0}", thread.getName());
    }

I will continue with some experiments about behaviour of ScheduledExecutorService. But what do you think about it?

jglick · March 11, 2025, 5:56pm

If Proactively cancel `Timeout` futures on `close` by jglick · Pull Request #309 · jenkinsci/workflow-support-plugin · GitHub passes a few tests it should deploy an incremental build (see instructions in Checks). If so, please test it and see if it helps in your case.

mchoma · March 12, 2025, 8:02am

It turns out cause of millions objects was in our code. We had inefficient loop creating them. It popped up (most probably) after CPS execution become optimized by More efficient use of `GroovyCategorySupport` by jglick · Pull Request #877 · jenkinsci/workflow-cps-plugin · GitHub. We slowed down creating of that objects and we are back in normal numbers.

About trying this change. I do not have reproducer on upstream jenkins, but rather internal company jenkins. So we will try this code once it will be propagated for internal use.

Thanks for looking.

Topic		Replies	Views
Jenkins memory leak (DelayBufferedOutputStream) Ask a question question	1	396	July 22, 2024
Memory Leak Issues Using Jenkins question	1	102	March 21, 2025
Jenkins Agent memory leak leading to Out of Heap Using Jenkins	2	1784	September 12, 2023
Need some ideas for a very large instance Using Jenkins	2	98	September 20, 2024
Jenkins huge CPU utilization provokes slowness Using Jenkins infra , java	2	3600	May 17, 2022

44 millions of org.jenkinsci.plugins.workflow.support.concurrent.Timeout

Related topics