We upgraded to jenkins 2.452.4 on jdk17 (from 2.440.2 on jdk 11) and our memory consumption rised from 6GB to 12 GB. We are running approximately 4000 jobs at once, which are in queue from beginning. Analyzing heapdump we noticed 44 millions instances of org.jenkinsci.plugins.workflow.support.concurrent.Timeout. From looking into heapdump I guess those objects are created by jobs waiting in queue. Am I right?
We can’t downgrade because of security policy so can’t compare if this is normal, but does not look normal to me
I would be thankful for any hint. We continue to diagnose the issue in meantime.
The significant increase in memory consumption and the large number of org.jenkinsci.plugins.workflow.support.concurrent.Timeout instances suggest that there might be an issue with how Jenkins is handling jobs in the queue.
This could be related to changes in the newer Jenkins version or the JDK version.
Here are some steps that could help you diagnose and potentially mitigate the issue:
Ensure all Jenkins plugins are up to date. Sometimes, plugin updates include performance improvements and bug fixes.
Check if there are any specific configurations or plugins that might be causing excessive memory usage. For example, certain pipeline configurations or plugins might not be optimized for the new Jenkins or JDK version.
Temporarily increase the heap size to accommodate the higher memory usage while you diagnose the issue. You can do this by modifying the JAVA_OPTS in the Jenkins startup script: export JAVA_OPTS="-Xms8g -Xmx16g"
Use tools like Eclipse MAT (Memory Analyzer Tool) to analyze the heap dump and identify the root cause of the memory consumption. Look for patterns or specific objects that are consuming a lot of memory.
If the issue is related to jobs waiting in the queue, consider optimizing the job queue management. This might involve adjusting the number of executors, using job throttling plugins, or distributing the load across multiple Jenkins instances.
Tune the garbage collection settings to improve memory management. You can add the following options to JAVA_OPTS: export JAVA_OPTS="-Xms8g -Xmx16g -XX:+UseG1GC -XX:MaxGCPauseMillis=200"
Thanks for your feedback, @mchoma.
It looks like the fix for this issue has been merged, I have yet to find the JDK version it has been incorporated into.
I am continuing with investigation, as we still hit the issue.
Regarding JDK-8338765. I was searching in Heapdump and we do not have Long.MAX_VALUE or other big delays in Timeout objects. Also we are not experiencing hangs of tasks as described in issue, but rather cumulating of completed Timeouts. So I do not think we are facing JDK-8338765 here.
Fact that we see completed tasks in interruptions structure make me think isn’t Timeout class missing some cleaning logic of interruptions structure ?
When I consult implementation with Gemini I get this advices. They sounds reasonable to me considering ScheduledExecutorService javadoc:
The class does not clean up the scheduled future. This means that if many timeouts are created and closed, the scheduled executor service will contain many scheduled tasks that do nothing. This is not ideal, but also not critical.
And suggest cleaning logic like this
public class Timeout implements AutoCloseable {
// ... (other fields)
private ScheduledFuture<?> future; // Store the ScheduledFuture
private void ping(final long time, final TimeUnit unit) {
future = interruptions.schedule(() -> { // Store the future
if (completed) {
LOGGER.log(Level.FINER, "{0} already finished, no need to interrupt", thread.getName());
return;
}
// ... (interruption logic)
ping(5, TimeUnit.SECONDS);
}, time, unit);
}
@Override
public void close() {
completed = true;
if (future != null) {
future.cancel(false); // Cancel the scheduled task
}
LOGGER.log(Level.FINER, "completed {0}", thread.getName());
}
I will continue with some experiments about behaviour of ScheduledExecutorService. But what do you think about it?
About trying this change. I do not have reproducer on upstream jenkins, but rather internal company jenkins. So we will try this code once it will be propagated for internal use.