We are seeing Jenkins 504 errors

Hi, we are have a fairly large Jenkins, but we have annoying issue where when we click start a task it jus freezes for 10 minutes or throws 504 error.
The problem is that we cannot find the root cause of this behaviour, nothing in Jenkins logs, there are a lot of resources of CPU, Mem, Disk, IOPS nothing hits the limits.
We have tried Jenkins-prometheus monitoring, but at the same time when there is that lag this plugin also stops sending any metrics.

Any advises?

Hello and welcome to the community, @Explas! :wave:

The type of issue you’re experiencing, where Jenkins tasks freeze or throw 504 errors without any apparent resource bottleneck, can be challenging to debug. :person_shrugging: Could you let us know your operating system, Java version and vendor, and Jenkins version?

Here are a few areas to explore:

  1. Thread Blocking or Deadlocks:

    • If Jenkins threads are blocked or waiting on locks, it can lead to UI freezes or task initiation delays.
  2. Garbage Collection (GC) Issues:

    • Long GC pauses can halt Jenkins, especially if the JVM heap is not properly tuned.
  3. Plugin Issues:

    • Misbehaving or outdated plugins can cause delays, particularly during task execution.
  4. Network or Reverse Proxy Timeouts:

    • If Jenkins is operating behind a proxy (like NGINX, HAProxy, or Apache), misconfigured timeouts might cause 504 errors during prolonged requests.
  5. Too Many Concurrent Jobs:

    • A large number of builds or excessive job queue processing can overload Jenkins’ internal task scheduler.
  6. Controller-Agent Communication Issues:

    • Problems in communication between the controller and agents can cause tasks to hang.

Hi,

I’m running Jenkins on AWS Amazon Linux (m6a.xlarge):

NAME="Amazon Linux"
VERSION="2023"
ID="amzn"
ID_LIKE="fedora"
VERSION_ID="2023"
PLATFORM_ID="platform:al2023"
PRETTY_NAME="Amazon Linux 2023.6.20241212"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
VENDOR_NAME="AWS"
VENDOR_URL="https://aws.amazon.com/"
SUPPORT_END="2028-03-15"
Amazon Linux release 2023.6.20241212 (Amazon Linux)

Java:

Java --version
openjdk 17.0.13 2024-10-15 LTS
OpenJDK Runtime Environment Corretto-17.0.13.11.1 (build 17.0.13+11-LTS)
OpenJDK 64-Bit Server VM Corretto-17.0.13.11.1 (build 17.0.13+11-LTS, mixed mode, sharing)

Jenkins version: 2.486 (but I would not rely on it, since we have done a lot of upgrades already and the problem persists across all versions)

as I have mentioned we have monitoring from Jenkins-prometheus plugin:


as per image you can see that Heap peaks at 700 MB at most, but we have provided 8 GB (-Xmx8192m)

Jenkins executed with following parameters:

-Dhudson.model.ParametersAction.keepUndefinedParameters=true -Dhudson.model.DirectoryBrowserSupport.CSP= -Djava.awt.headless=true -server -Xmx8192m -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -jar /usr/share/java/jenkins.war --webroot=/var/cache/jenkins/war --logfile=/var/log/jenkins/jenkins.log --httpPort=8080

We are constantly upgrading Jenkins and plugins ( we have over 200 of them ).
We are using AWS ALB as a load balancer, but I’m not sure which settings could be tuned if it causes the issue.
Also we have a lot of agents runnings in different AWS accounts and regions, that could cause some delays between controller and slaves, but all of them are in AWS so quite reliable connection.

1 Like

Basically all UI of Jenkins is working fine, but it hangs when starting a new Task/Job - e.g nothing happens for like a few minutes and then new task number is being created and basically task starts…