We have several long running jobs in Jenkins, up to 5 hours (mainly database maintenance). We have successfully been using this strategy for several year.
Recently we put Jenkins behind a load balancer (AWS ALB), breaking TCP connections (as the DNS now points to the load balancer). So we changed agents to use WebSocket instead.
However, we experience frequent disconnects (not at set intervals, but several times per day). Worse it seems the agent is not only restarting, but also killing any running jobs. Previously (using TCP in a previous Jenkins version) when the connection was lost, at least the job would run to completion, even if it showed up as failed in the Jenkins UI.
Is there a way to force the agents to try to reconnect instead of restarting? Or some other workaround?
Remoting version: 3063.v26e24490f041
Jenkins version: 2.374
The remoting log looks like this:
Oct 26, 2022 6:11:05 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: WebSocket connection open
Oct 26, 2022 6:11:05 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connected
Oct 26, 2022 8:02:50 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Read side closed
Oct 26, 2022 8:02:50 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Read side closed
Oct 26, 2022 8:02:50 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Terminated
Oct 26, 2022 8:02:50 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Performing onReconnect operation.
Oct 26, 2022 8:02:50 AM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$EngineListenerAdapterImpl onReconnect
INFO: Restarting agent via jenkins.slaves.restarter.WinswSlaveRestarter@e3f7514
The jenkins-slave.wrapper.log
looks like this
2022-10-26 08:02:50,737 DEBUG - Starting WinSW in the CLI mode
2022-10-26 08:02:50,819 INFO - Restarting the service with id 'jenkinsslave-C__JenkinsSlave'
2022-10-26 08:02:50,829 DEBUG - Completed. Exit code is 0
2022-10-26 08:02:50,908 DEBUG - Starting WinSW in the CLI mode
2022-10-26 08:02:50,989 INFO - Restarting the service with id 'jenkinsslave-C__JenkinsSlave'
2022-10-26 08:02:50,999 INFO - Stopping jenkinsslave-C__JenkinsSlave
2022-10-26 08:02:50,999 DEBUG - ProcessKill 7120
2022-10-26 08:02:51,089 INFO - Found child process: 4156 Name: conhost.exe
2022-10-26 08:02:51,120 INFO - Stopping process 4156
2022-10-26 08:02:51,136 INFO - Send SIGINT 4156
2022-10-26 08:02:51,136 WARN - SIGINT to 4156 failed - Killing as fallback
2022-10-26 08:02:51,136 INFO - Stopping process 7120
2022-10-26 08:02:51,136 INFO - Send SIGINT 7120
2022-10-26 08:02:51,136 WARN - SIGINT to 7120 failed - Killing as fallback
2022-10-26 08:02:51,136 INFO - Finished jenkinsslave-C__JenkinsSlave
2022-10-26 08:02:51,136 DEBUG - Completed. Exit code is 0
2022-10-26 08:02:52,284 DEBUG - Starting WinSW in the service mode
2022-10-26 08:02:52,299 DEBUG - Completed. Exit code is 0
2022-10-26 08:02:52,315 INFO - Starting C:\Program Files\Microsoft\jdk-11.0.16.8-hotspot\bin\java.exe -Xrs -jar "C:\JenkinsSlave\slave.jar" -jnlpUrl http://bob4.fc.local:8080/computer/sql-test-aws.fc.local/slave-agent.jnlp -secret e600548f34759a2ea7c8b6333d9c81a6a4ff9f1ea861bbef818e66fdbf052005
2022-10-26 08:02:52,315 INFO - Extension loaded: killOnStartup
2022-10-26 08:02:52,331 DEBUG - Checking the potentially runaway process with PID=7120
2022-10-26 08:02:52,331 DEBUG - No runaway process with PID=7120. The process has been already stopped.
2022-10-26 08:02:52,346 INFO - Started process 3016
2022-10-26 08:02:52,346 DEBUG - Forwarding logs of the process System.Diagnostics.Process (java) to winsw.SizeBasedRollingLogAppender
2022-10-26 08:02:52,362 INFO - Recording PID of the started process:3016. PID file destination is C:\JenkinsSlave\jenkins_agent.pid
2022-10-26 08:02:55,579 DEBUG - Starting WinSW in the CLI mode
2022-10-26 08:02:55,669 DEBUG - User requested the status of the process with id 'jenkinsslave-C__JenkinsSlave'
2022-10-26 08:02:55,671 DEBUG - Completed. Exit code is 0