Cannot recover job when jenkins agent lose network connectivity

Hello I am facing problems regarding the connectivity of jenkins agent with the controller node

Configuration:
The agent agent is configured to connect with the controller node using web-sockets configuration.

The behaviour I experience is the following.

  1. When the controller node looses connectivity even for a long period for example when restarting the controller Jenkins node the job that runs in the agent is resumed without problems which is very good.
  2. On the other hand if for some reason the agent loses network connectivity even for a very short period (3-5 seconds) the job stucks and never resumes properly.
    Furthermore if I restart the java process in the agent that initiates the connection to the controller node I am taking the following error

Sept 06, 2023 6:17:03 PM hudson.remoting.jnlp.Main createEngine
INFO: Setting up agent: Tsiakos Node
Sept 06, 2023 6:17:03 PM hudson.remoting.Engine startEngine
INFO: Using Remoting version: 3107.v665000b_51092
Sept 06, 2023 6:17:03 PM hudson.remoting.Engine startEngine
WARNING: No Working Directory. Using the legacy JAR Cache location: /Users/panagiotistsiakos/.jenkins/cache/jars
Sept 06, 2023 6:17:04 PM hudson.remoting.jnlp.Main$CuiListener error
SEVERE: Handshake error.
io.jenkins.remoting.shaded.jakarta.websocket.DeploymentException: Handshake error.
at io.jenkins.remoting.shaded.org.glassfish.tyrus.client.ClientManager$3$1.run(ClientManager.java:658)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.client.ClientManager$3.run(ClientManager.java:696)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.client.ClientManager$SameThreadExecutorService.execute(ClientManager.java:849)
at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:123)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.client.ClientManager.connectToServer(ClientManager.java:493)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.client.ClientManager.connectToServer(ClientManager.java:337)
at hudson.remoting.Engine.runWebSocket(Engine.java:678)
at hudson.remoting.Engine.run(Engine.java:499)
Caused by: io.jenkins.remoting.shaded.org.glassfish.tyrus.core.HandshakeException: Response code was not 101: 500.
at io.jenkins.remoting.shaded.org.glassfish.tyrus.client.TyrusClientEngine.processResponse(TyrusClientEngine.java:301)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.ClientFilter.processRead(ClientFilter.java:167)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.Filter.onRead(Filter.java:111)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.Filter.onRead(Filter.java:113)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.SslFilter.handleRead(SslFilter.java:402)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.SslFilter.processRead(SslFilter.java:365)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.Filter.onRead(Filter.java:111)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.Filter.onRead(Filter.java:113)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.TransportFilter$4.completed(TransportFilter.java:295)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.TransportFilter$4.completed(TransportFilter.java:279)
at java.base/sun.nio.ch.Invoker.invokeUnchecked(Invoker.java:129)
at java.base/sun.nio.ch.Invoker$2.run(Invoker.java:221)
at java.base/sun.nio.ch.AsynchronousChannelGroupImpl$1.run(AsynchronousChannelGroupImpl.java:113)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)

which it seems that the controller seems that the agent is still connected. After some time and retries the agent is able to reconnect.

I have tried various combinations of values for the following Jenkins system properties without luck

-Djenkins.websocket.idleTimeout
** -Djenkins.websocket.pingInterval**
** -Dhudson.slaves.ChannelPinger.pingIntervalSeconds**
** -Dhudson.slaves.ChannelPinger.pingTimeoutSeconds**

Is there something I miss here? It’s impossible to ensure that the jenkins agent will never looses the network connectivity even for a very short time and this has causes a lot of sporadic instabilities to the jobs that are running in the agents.

Thank you in advance
Panagiotis

Hello @ptsiakos and welcome to this community. :wave:

It appears that you are encountering issues with Jenkins agent connectivity when there are brief network interruptions, and you’ve already tried adjusting some Jenkins system properties without success. To address this problem, you could maybe consider the following steps:

  1. Increase WebSocket Timeout:
    Increase the WebSocket timeout by adjusting the -Djenkins.websocket.idleTimeout property to a longer value. This will give the agent more time to reconnect after a brief network interruption. I know it’s already bigger than the detected network loss period, but…
  2. Ping Interval and Timeout:
    Ensure that the -Djenkins.websocket.pingInterval and -Djenkins.websocket.pingTimeout values are appropriately configured. These properties help in maintaining the WebSocket connection and detecting connectivity issues. You may need to experiment with different values to find the optimal settings for your network conditions.
  3. Channel Pinger Configuration:
    Similarly, check and adjust the -Dhudson.slaves.ChannelPinger.pingIntervalSeconds and -Dhudson.slaves.ChannelPinger.pingTimeoutSeconds properties. These settings control how often Jenkins checks the agent’s connectivity.
  4. Agent/Controller Versions:
    Ensure that both the Jenkins controller and the agent are running compatible versions. Sometimes, issues can arise when there is a version mismatch between the two components. Upgrading to the latest versions might help. Same for the java version.
  5. Network Stability:
    If possible, investigate the root cause of these network interruptions. It could be due to network configurations, firewalls, or other factors that can be addressed to improve network stability.