Jenkins setup: Jenkins LTS 2.541.2, azure-vm-agents 1093.va_9cd2dd11158.
I am having issues with worker nodes (Ubuntu nodes) terminating during use. The issue is chronic in that the worker nodes are terminating about every 2 hours, even while running a job.
[id=22112] INFO h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel jenkins-worker7614f0
Mar 03 17:24:46 dev2-jenkins-master jenkins[945625]: java.io.EOFException
Mar 03 17:24:46 dev2-jenkins-master jenkins[945625]: at java.base/java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2933)
Mar 03 17:24:46 dev2-jenkins-master jenkins[945625]: at java.base/java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3428)
I am getting no logs other than the above. While logged into the worker node via ssh from the controller, there is no indication if issue until the connection suddenly quietly breaks, and the node gets terminated.
I have tried running the workers with the agent template set to either the Idle Retention strategy (with retention time set to 0) and with Pool retention policy (retention time 0, and pool size 1). Neither works any better keeping the nodes up.
Can anyone suggest anything else to try, since I do not see any SSH connection flakiness. Is there some configuration of any node health tests the plugin does in the background? What other information would be helpful to debug the issue?
Thanks in advance.
EDIT: I should also note I am using SSH communication to the workers, not JNLP. The nodes are in the same subnet.
I see the following logged as well in NodeProvisioning activity logs:
java.lang.Exception: Node ProvisioningActivity for workers/jenkins-worker/null (-533124035) has lost. Mark as failure
at PluginClassLoader for azure-vm-agents//com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.cleanCloudStatistics(AzureVMAgentCleanUpTask.java:607)
at PluginClassLoader for azure-vm-agents//com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.clean(AzureVMAgentCleanUpTask.java:626)
at PluginClassLoader for azure-vm-agents//com.microsoft.azure.vmagent.AzureVMAgentCleanUpTask.lambda$execute$1(AzureVMAgentCleanUpTask.java:634)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
But also before it dies:
openjdk full version "21.0.10+7-Ubuntu-122.04"
<===[JENKINS REMOTING CAPACITY]===>Remoting version: 3352.v17a_fb_4b_2773f
Launcher: AzureVMAgentSSHLauncher
Communication Protocol: Standard in/out
This is a Unix agent
Agent successfully connected and online