Jenkins job will not fail if the agent crashes/disconnected

Hi community,

Our team has identified a test scenario that could potentially cause the Jenkins agent/agent to crash or reboot. However, even if the agent crashes or reboots during execution, the Jenkins build itself does not fail/exist. To address this, I’ve created a similar pipeline to simulate the scenario:

Pipeline:

node("192.168.111.134"){
    sh '''
        i=1
        while [ $i -le 600 ]
        do
            echo " $i"
            ((i++))
            sleep 5
        done
    '''
}

While the pipeline is running, I rebooted the agent “192.168.111.134”, and here is the pipeline output:

Now, my question is how can I ensure that the Jenkins build fails/exists immediately upon the agent crashing?

15:34:41  Running on 192.168.111.134 in /home/tcnsh/k8s-workspace/workspace/test
15:34:41  [Pipeline] {
15:34:42  [Pipeline] sh
15:34:43  + i=1
15:34:43  + '[' 1 -le 600 ']'
15:34:43  + echo ' 1'
15:34:43   1
15:34:43  + (( i++ ))
15:34:43  + sleep 5
15:34:49  + '[' 2 -le 600 ']'
15:34:49  + echo ' 2'
15:34:49   2
15:34:49  + (( i++ ))
15:34:49  + sleep 5
15:34:54  + '[' 3 -le 600 ']'
15:34:54  + echo ' 3'
15:34:54   3
15:34:54  + (( i++ ))
15:34:54  + sleep 5
15:34:58  + '[' 4 -le 600 ']'
15:34:58  + echo ' 4'
15:34:58   4
15:34:58  + (( i++ ))
15:34:58  + sleep 5
15:35:00  Cannot contact 192.168.111.134: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@777efc9f:192.168.111.134": Remote call on 192.168.111.134 failed. The channel is closing down or has closed down
15:45:01  wrapper script does not seem to be touching the log file in /home/tcnsh/k8s-workspace/workspace/test@tmp/durable-85599b58
15:45:01  (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)

Jenkins will only detect the crash when the agent gets back online at least for static agents.The reason is that the execution of the sh step is decoupled from the java process of the agent. This ensures that when Jenkins is restarted or when just the java agent process crashes the build is not lost and Jenkins can recover. When the agent is back online Jenkins will push any logs that have been written in between to the controller. If the process has finished it will report this as well. For Jenkins it is not possible to determine if the connection was lost due to the machine having crashed or just the agent process.
This might be different for agents from a cloud provider plugin where the cloud provider is able to detect the crash and thus can tell Jenkins that the process stopped.

1 Like

Hi @mawinter69, thank you for the thorough explanation. I appreciate understanding the underlying rules. I have noticed that you said and got one more question:

But what happens if it takes hours for the agent to finally come back online, or if it never comes back online without manual intervention? Will the job continue to run indefinitely or will it eventually timeout(not the timeout block defined in pipeline) and terminate?

It will wait forever if you have not wrapped it with a timeout step in your pipeline. I’ve seen this behaviour as well. So using a timeout step is a good idea

1 Like