Testcases are getting skipped due to File descriptor leak

Hi,
I’m encountering an issue where jobs configured under a MultiPhase job setup are experiencing test case skips due to a file descriptor leak. Additionally, some jobs in the sequence are getting aborted, even though the preceding jobs run successfully.

I’ve tried increasing the file descriptor limit and adjusting settings, but the problem persists. Has anyone encountered a similar issue or have suggestions for troubleshooting this further?

Hello,
It sounds like you’re dealing with a challenging issue. Here are a few steps to troubleshoot the file descriptor leak and job abortion problems:

  1. Monitor File Descriptors: Use tools like lsof or strace to identify where the leak is occurring.
  2. Update Software: Ensure all related software and dependencies are up to date.
  3. Review Job Configurations: Double-check your MultiPhase job setup for any misconfigurations.
  4. Implement Cleanup Routines: Ensure jobs properly close file descriptors after use.

If these steps don’t resolve the issue, consider consulting the documentation or reaching out to the support community for your job scheduler.

Best Regards,
Ellen Hogan

1 Like

Hi Ellen,
Thank you for your reply on this issue! I’ve already checked the first three points, so I will try to implement the cleanup routines as you suggested.

Thanks again!

Did you read the information on the page from your picture? That page describes the general problem that causes a Jenkins job to report that it leaked file descriptors and offers several alternatives.

That page says:

The reason this problem happens is because of file descriptor leak and how they are inherited from one process to another. Jenkins and the child process are connected by three pipes (stdin/stdout/stderr.) This allows Jenkins to capture the output from the child process. Since the child process may write a lot of data to the pipe and quit immediately after that, Jenkins needs to make sure that it drained the pipes before it considers the build to be over. Jenkins does this by waiting for EOF.

When a process terminates for whatever reasons, the operating system closes all the file descriptors it owned. So even if the process didn’t close stdout/stderr, Jenkins will nevertheless get EOF.

The complication happens when those file descriptors are inherited to other processes. Let’s say the child process forks another process to the background. The background process (AKA daemon) inherits all the file descriptors of the parent, including the writing side of the stdout/stderr pipes that connect the child process and Jenkins. If the daemon forgets to close them, Jenkins won’t get EOF for pipes even when the child process exits, because daemon still have those descriptors open. That’s how this problem happens.

A good daemon program closes all file descriptors to avoid problems like this, but often there are bad ones that don’t follow the rule.

Since you said that the problem is related to test cases, maybe one or more of your test cases is launching a separate process and that separately launched process is not closing file descriptors like it should.

1 Like

Hi @ellen05898,
I appreciate your response on this query. So our last resort was to Implement Cleanup Routines into our configuration.

JOB_NAME=“S1_FT_Multibyte”
JOB_PID=$(pgrep -f “$JOB_NAME”)

if [ -n “$JOB_PID” ]; then
lsof -p “$JOB_PID” > /tmp/open_fds.txt

awk '{print $9}' /tmp/open_fds.txt | sort | uniq -d > /tmp/duplicate_fds.txt

while read -r duplicate_fd; do
    grep "$duplicate_fd" /tmp/open_fds.txt | while read -r fd_info; do
        pid=$(echo "$fd_info" | awk '{print $2}')
        fd=$(echo "$fd_info" | awk '{print $4}')

        echo "Closing file descriptor $fd for process $pid related to $duplicate_fd"
        kill -9 "$pid"  # Forcibly closes FD, be cautious when using
    done
done < /tmp/duplicate_fds.txt

rm -f /tmp/open_fds.txt /tmp/duplicate_fds.txt

else
echo “No running process found for job: $JOB_NAME”
fi

This is the code i tried implementing into our configuration but seems like its not able to fetch the process id as this script closes all unwanted open FDs but now since this script is written to close the FDs after the job ends its not able to fetch the PID, Can you please suggest which other approach we can try for this issue