Jenkins job will not fail if the agent crashes/disconnected

longkang · February 21, 2024, 8:03am

Hi community,

Our team has identified a test scenario that could potentially cause the Jenkins agent/agent to crash or reboot. However, even if the agent crashes or reboots during execution, the Jenkins build itself does not fail/exist. To address this, I’ve created a similar pipeline to simulate the scenario:

Pipeline:

node("192.168.111.134"){
    sh '''
        i=1
        while [ $i -le 600 ]
        do
            echo " $i"
            ((i++))
            sleep 5
        done
    '''
}

While the pipeline is running, I rebooted the agent “192.168.111.134”, and here is the pipeline output:

Now, my question is how can I ensure that the Jenkins build fails/exists immediately upon the agent crashing?

15:34:41  Running on 192.168.111.134 in /home/tcnsh/k8s-workspace/workspace/test
15:34:41  [Pipeline] {
15:34:42  [Pipeline] sh
15:34:43  + i=1
15:34:43  + '[' 1 -le 600 ']'
15:34:43  + echo ' 1'
15:34:43   1
15:34:43  + (( i++ ))
15:34:43  + sleep 5
15:34:49  + '[' 2 -le 600 ']'
15:34:49  + echo ' 2'
15:34:49   2
15:34:49  + (( i++ ))
15:34:49  + sleep 5
15:34:54  + '[' 3 -le 600 ']'
15:34:54  + echo ' 3'
15:34:54   3
15:34:54  + (( i++ ))
15:34:54  + sleep 5
15:34:58  + '[' 4 -le 600 ']'
15:34:58  + echo ' 4'
15:34:58   4
15:34:58  + (( i++ ))
15:34:58  + sleep 5
15:35:00  Cannot contact 192.168.111.134: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@777efc9f:192.168.111.134": Remote call on 192.168.111.134 failed. The channel is closing down or has closed down
15:45:01  wrapper script does not seem to be touching the log file in /home/tcnsh/k8s-workspace/workspace/test@tmp/durable-85599b58
15:45:01  (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)

mawinter69 · February 21, 2024, 8:48am

Jenkins will only detect the crash when the agent gets back online at least for static agents.The reason is that the execution of the sh step is decoupled from the java process of the agent. This ensures that when Jenkins is restarted or when just the java agent process crashes the build is not lost and Jenkins can recover. When the agent is back online Jenkins will push any logs that have been written in between to the controller. If the process has finished it will report this as well. For Jenkins it is not possible to determine if the connection was lost due to the machine having crashed or just the agent process.
This might be different for agents from a cloud provider plugin where the cloud provider is able to detect the crash and thus can tell Jenkins that the process stopped.

longkang · February 21, 2024, 9:04am

Hi @mawinter69, thank you for the thorough explanation. I appreciate understanding the underlying rules. I have noticed that you said and got one more question:

But what happens if it takes hours for the agent to finally come back online, or if it never comes back online without manual intervention? Will the job continue to run indefinitely or will it eventually timeout(not the timeout block defined in pipeline) and terminate?

mawinter69 · February 21, 2024, 9:17am

It will wait forever if you have not wrapped it with a timeout step in your pipeline. I’ve seen this behaviour as well. So using a timeout step is a good idea

Topic		Replies	Views
Stack trace when rebooting agent over IPSec Ask a question question	0	540	August 30, 2023
Agent went offline during the build Ask a question question	1	1862	July 2, 2022
Automaticatly stop the job on `hudson.remoting.ChannelClosedException` Ask a question question , pipeline	4	1293	July 11, 2024
Pipeline aborted with "Agent was removed" Ask a question	0	910	February 17, 2025
Jenkins hangs when pipeline completes successfullly Ask a question question	14	8748	December 21, 2023

Jenkins job will not fail if the agent crashes/disconnected

Related topics