Parallel stages are being killed

Hi, we have a Jenkins declarative script that schedules a set of stages across a set of agents using:

parallel p

This has worked well, but since early August stages are being randomly killed.
The console log shows:

Killed

but there is no explanation or reason.

How can I determine the reason? Might it be caused by the Jenkins Process Tree Killer?

I have seen in the logs that the jobs are killed because of:

Cannot contact <server>: java.lang.InterruptedException

The system log shows:

org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep ... waiting for ... unresponsive for 3 min 0 sec
...
org.jenkinsci.plugins.workflow.support.concurrent.Timeout lambda$ping$0

Any suggestions of how to investigate or fix this please?

I have been advised on Jira that this is likely “a configuration issue on the system that is running the agent. It may be killing the process due to out of memory issues or due to hardware failures or for other reasons that are outside the control of Jenkins.”

Does anyone have any advise for how to diagnose this please?

The issue you’re experiencing with Jenkins stages being randomly killed could indeed be related to the Jenkins Process Tree Killer or other system-level issues. :person_shrugging:

Here are some steps that may help you diagnose and potentially resolve the issue: :crossed_fingers:

  1. Check System Resources:
    • Monitor the system resources (CPU, memory, disk I/O) on the agents to see if they are running out of resources.
    • Use tools like top, htop, or vmstat to monitor resource usage.
  2. Disable Process Tree Killer:
    • I’ve heard the Jenkins Process Tree Killer could sometimes kill processes that it shouldn’t. Or that was the case a long time ago. :person_shrugging: In any case, you could disable it by setting the hudson.util.ProcessTreeKiller.disable system property to true (if it still exists, I’m not so sure).
      Add the following to your Jenkins startup options:
      -Dhudson.util.ProcessTreeKiller.disable=true
  3. Increase Logging Level:
    Increase the logging level for the relevant Jenkins components to get more detailed logs. You can do this in the Jenkins UI under Manage JenkinsSystem LogAdd recorder.

@poddingue Thanks very much for your help. Do you have a suggestion for which logger(s) to use?

The Jenkins declarative script schedules multiple tasks onto an agent, one per available executor. We have a 48 core machine and have 40 executors. When running multiple tasks the machine becomes unresponsive. The Jenkins SSHLauncher times out and jobs are killed.

What measures could we take to ensure that the machine remains responsive to SSH?

The JVM options for the agent do not specify heap memory limits using -Xms and -Xmx. The machine has 4GB memory. Should I set heap limits?

When you have 48 cores and configure 40 executors then you probably assume that each executors only starts a single threaded process. Many build tool read how many CPUs are available on the machine and then start as many compile jobs. So you if you have such build jobs then you can easily overload your machine.
On my Jenkins we start the build of a bigger C/C++ project with cmake and the machine has 120 CPUs but we only have one executor as the build will consume all 120 CPUs and all of the 400GiB of memory. If we would use 2 executors we would need to take extra action to ensure we don’t overload the machine.
The whole machine has only 4GiB memory with 48CPU? That is not much.

@mawinter69 Thanks for your reply. Our builds are single-threaded so each has only one process.

I made a stupid mistake checking the system memory - in fact we have 264GB.

Should I specify limits for the heap in the agent’s configuration page?