Hello I am facing problems regarding the connectivity of jenkins agent with the controller node
Configuration:
The agent agent is configured to connect with the controller node using web-sockets configuration.
The behaviour I experience is the following.
- When the controller node looses connectivity even for a long period for example when restarting the controller Jenkins node the job that runs in the agent is resumed without problems which is very good.
- On the other hand if for some reason the agent loses network connectivity even for a very short period (3-5 seconds) the job stucks and never resumes properly.
Furthermore if I restart the java process in the agent that initiates the connection to the controller node I am taking the following error
Sept 06, 2023 6:17:03 PM hudson.remoting.jnlp.Main createEngine
INFO: Setting up agent: Tsiakos Node
Sept 06, 2023 6:17:03 PM hudson.remoting.Engine startEngine
INFO: Using Remoting version: 3107.v665000b_51092
Sept 06, 2023 6:17:03 PM hudson.remoting.Engine startEngine
WARNING: No Working Directory. Using the legacy JAR Cache location: /Users/panagiotistsiakos/.jenkins/cache/jars
Sept 06, 2023 6:17:04 PM hudson.remoting.jnlp.Main$CuiListener error
SEVERE: Handshake error.
io.jenkins.remoting.shaded.jakarta.websocket.DeploymentException: Handshake error.
at io.jenkins.remoting.shaded.org.glassfish.tyrus.client.ClientManager$3$1.run(ClientManager.java:658)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.client.ClientManager$3.run(ClientManager.java:696)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.client.ClientManager$SameThreadExecutorService.execute(ClientManager.java:849)
at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:123)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.client.ClientManager.connectToServer(ClientManager.java:493)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.client.ClientManager.connectToServer(ClientManager.java:337)
at hudson.remoting.Engine.runWebSocket(Engine.java:678)
at hudson.remoting.Engine.run(Engine.java:499)
Caused by: io.jenkins.remoting.shaded.org.glassfish.tyrus.core.HandshakeException: Response code was not 101: 500.
at io.jenkins.remoting.shaded.org.glassfish.tyrus.client.TyrusClientEngine.processResponse(TyrusClientEngine.java:301)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.ClientFilter.processRead(ClientFilter.java:167)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.Filter.onRead(Filter.java:111)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.Filter.onRead(Filter.java:113)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.SslFilter.handleRead(SslFilter.java:402)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.SslFilter.processRead(SslFilter.java:365)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.Filter.onRead(Filter.java:111)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.Filter.onRead(Filter.java:113)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.TransportFilter$4.completed(TransportFilter.java:295)
at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.TransportFilter$4.completed(TransportFilter.java:279)
at java.base/sun.nio.ch.Invoker.invokeUnchecked(Invoker.java:129)
at java.base/sun.nio.ch.Invoker$2.run(Invoker.java:221)
at java.base/sun.nio.ch.AsynchronousChannelGroupImpl$1.run(AsynchronousChannelGroupImpl.java:113)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
which it seems that the controller seems that the agent is still connected. After some time and retries the agent is able to reconnect.
I have tried various combinations of values for the following Jenkins system properties without luck
-Djenkins.websocket.idleTimeout
** -Djenkins.websocket.pingInterval**
** -Dhudson.slaves.ChannelPinger.pingIntervalSeconds**
** -Dhudson.slaves.ChannelPinger.pingTimeoutSeconds**
Is there something I miss here? It’s impossible to ensure that the jenkins agent will never looses the network connectivity even for a very short time and this has causes a lot of sporadic instabilities to the jobs that are running in the agents.
Thank you in advance
Panagiotis