Automaticatly stop the job on `hudson.remoting.ChannelClosedException`

Hi there,

I have an issue with the Jenkins pipeline.
When the agent disconnects while running a command.
(usually a code compilation command that can take some time, but the command doesn’t matter.)
It gets stuck and never stops until a user cancels it, or until the pipeline timeout if configured.

(edit) with the option: Do not allow the pipeline to resume if the controller restarts on, so no need to wait for agent reconnection, the job won’t be able to continue.

I would like to find a way to make the job stop as soon as it receives an event hudson.remoting.ChannelClosedException.

I tried to add a try catch inside my step, but the catch is never reached. Neither the job post block.

I made you a pipeline to reproduce the bug:

pipeline {
	parameters {
		string(name: 'AGENT_NAME', defaultValue: 'generic', description: 'Specify the agent name where the job should run, by default it will take any agent with this generic label, but you can replace it with a specific agent name.')
	}
	agent {
		node {
			label "${params.AGENT_NAME}"
			customWorkspace "Sandbox"
		}
	}
	options {
		skipDefaultCheckout()
	}
	stages {
		stage('Stage 1') {
			steps {
				script {
					try {
						println("Stage 1")
						bat "echo %TIME% & echo. & echo %DATE%"
						bat "start /wait timeout 30"
						println("After sleep")
						bat "echo %TIME% & echo. & echo %DATE%"
					}
					catch (org.jenkinsci.plugins.workflow.steps.FlowInterruptedException e) {
						println("Catched !")
						println(e)
					}
				}
			}
		}
	}
	post {
		always {
			script {
				println("post reached!")
			}
		}
	}
}

To test it I run it on my agent and in the middle of the bat "start /wait timeout 30" call I close the agent connection. Then the job will stay stuck forever like this…

024-07-11T14:23:43.039Z] Started by user Ernest
[2024-07-11T14:23:43.053Z] Resume disabled by user, switching to high-performance, low-durability mode.
[2024-07-11T14:23:43.212Z] [Pipeline] Start of Pipeline
[2024-07-11T14:23:43.450Z] [Pipeline] node
[2024-07-11T14:23:43.457Z] Running on Agent-Win11-1 in e:\workspace\jenkins_sandbox_2
[2024-07-11T14:23:43.457Z] [Pipeline] {
[2024-07-11T14:23:43.470Z] [Pipeline] ws
[2024-07-11T14:23:43.472Z] Running in e:\Sandbox
[2024-07-11T14:23:43.472Z] [Pipeline] {
[2024-07-11T14:23:43.482Z] [Pipeline] stage
[2024-07-11T14:23:43.483Z] [Pipeline] { (Stage 1)
[2024-07-11T14:23:43.499Z] [Pipeline] script
[2024-07-11T14:23:43.503Z] [Pipeline] {
[2024-07-11T14:23:43.511Z] [Pipeline] echo
[2024-07-11T14:23:43.512Z] Stage 1
[2024-07-11T14:23:43.517Z] [Pipeline] bat
[2024-07-11T14:23:43.862Z] 
[2024-07-11T14:23:43.862Z] e:\Sandbox>echo 16:23:43.35   & echo.   & echo Thu 07/11/2024 
[2024-07-11T14:23:43.862Z] 16:23:43.35 
[2024-07-11T14:23:43.862Z]  
[2024-07-11T14:23:43.862Z] Thu 07/11/2024
[2024-07-11T14:23:43.880Z] [Pipeline] bat
[2024-07-11T14:23:44.226Z] 
[2024-07-11T14:23:44.226Z] e:\Sandbox>start /wait timeout 30 
[2024-07-11T14:23:47.601Z] Cannot contact Agent-Win11-1: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@9f67038:Agent-Win11-1": Remote call on Agent-Win11-1 failed. The channel is closing down or has closed down
[2024-07-11T14:24:43.577Z] [Pipeline] echo
[2024-07-11T14:24:43.578Z] After sleep
[2024-07-11T14:24:43.579Z] [Pipeline] bat

Any ideas?

Thank you!

Here is my Jenkins config:

Jenkins setup:

Jenkins: 2.440.1
OS: Linux - 6.1.85+
Java: 17.0.10 - Eclipse Adoptium (OpenJDK 64-Bit Server VM)
---
analysis-model-api:12.1.0
antisamy-markup-formatter:162.v0e6ec0fcfcf6
apache-httpcomponents-client-4-api:4.5.14-208.v438351942757
apache-httpcomponents-client-5-api:5.3.1-1.0
authentication-tokens:1.53.v1c90fd9191a_b_
authorize-project:1.7.1
blueocean:1.27.11
blueocean-autofavorite:1.2.5
blueocean-bitbucket-pipeline:1.27.11
blueocean-commons:1.27.11
blueocean-config:1.27.11
blueocean-core-js:1.27.11
blueocean-dashboard:1.27.11
blueocean-display-url:2.4.2
blueocean-events:1.27.11
blueocean-git-pipeline:1.27.11
blueocean-github-pipeline:1.27.11
blueocean-i18n:1.27.11
blueocean-jwt:1.27.11
blueocean-personalization:1.27.11
blueocean-pipeline-api-impl:1.27.11
blueocean-pipeline-editor:1.27.11
blueocean-pipeline-scm-api:1.27.11
blueocean-rest:1.27.11
blueocean-rest-impl:1.27.11
blueocean-web:1.27.11
bootstrap5-api:5.3.2-4
bouncycastle-api:2.30.1.77-225.v26ea_c9455fd9
branch-api:2.1152.v6f101e97dd77
build-blocker-plugin:1.7.9
build-failure-analyzer:2.5.0
build-monitor-plugin:1.14-860.vd06ef2568b_3f
build-timeout:1.32
build-timestamp:1.0.3
caffeine-api:3.1.8-133.v17b_1ff2e0599
checks-api:2.0.2
cloud-stats:336.v788e4055508b_
cloudbees-bitbucket-branch-source:877.vb_b_d5243f6794
cloudbees-disk-usage-simple:203.v3f46a_7462b_1a_
cloudbees-folder:6.901.vb_4c7a_da_75da_3
command-launcher:107.v773860566e2e
commons-compress-api:1.26.1-2
commons-httpclient3-api:3.1-3
commons-lang3-api:3.13.0-62.v7d18e55f51e2
commons-text-api:1.11.0-95.v22a_d30ee5d36
conditional-buildstep:1.4.3
configuration-as-code:1775.v810dc950b_514
credentials:1319.v7eb_51b_3a_c97b_
credentials-binding:657.v2b_19db_7d6e6d
data-tables-api:1.13.8-4
depgraph-view:1.0.5
discard-old-build:1.07
disk-usage:1.2
display-url-api:2.200.vb_9327d658781
docker-build-publish:1.4.0
docker-commons:439.va_3cb_0a_6a_fb_29
docker-java-api:3.3.4-86.v39b_a_5ede342c
docker-plugin:1.6
docker-workflow:572.v950f58993843
durable-task:550.v0930093c4b_a_6
echarts-api:5.4.3-4
email-ext:2.104
emailext-template:1.5
extended-read-permission:53.v6499940139e5
favorite:2.208.v91d65b_7792a_c
font-awesome-api:6.5.1-3
forensics-api:2.4.0
git:5.2.1
git-client:4.6.0
git-server:114.v068a_c7cc2574
github:1.38.0
github-api:1.318-461.v7a_c09c9fa_d63
github-branch-source:1772.va_69eda_d018d4
google-container-registry-auth:0.3
google-login:109.v022b_cf87b_e5b_
google-oauth-plugin:1.330.vf5e86021cb_ec
gson-api:2.10.1-15.v0d99f670e0a_7
handy-uri-templates-2-api:2.1.8-30.v7e777411b_148
htmlpublisher:1.32
instance-identity:185.v303dc7c645f9
ionicons-api:56.v1b_1c8c49374e
jackson2-api:2.16.1-373.ve709c6871598
jakarta-activation-api:2.0.1-3
jakarta-mail-api:2.0.1-3
javadoc:243.vb_b_503b_b_45537
javax-activation-api:1.2.0-6
javax-mail-api:1.6.2-9
jaxb:2.3.9-1
jdk-tool:73.vddf737284550
jenkins-design-language:1.27.11
jjwt-api:0.11.5-77.v646c772fddb_0
joda-time-api:2.12.7-29.v5a_b_e3a_82269a_
jquery3-api:3.7.1-2
jsch:0.2.16-86.v42e010d9484b_
json-api:20240205-27.va_007549e895c
json-path-api:2.9.0-33.v2527142f2e1d
junit:1259.v65ffcef24a_88
kubernetes:4186.v1d804571d5d4
kubernetes-client-api:6.10.0-240.v57880ce8b_0b_2
kubernetes-credentials:0.11
lockable-resources:1243.v346d600eea_24
mailer:463.vedf8358e006b_
markdown-formatter:167.v8a_428ca_49f89
matrix-auth:3.2.1
matrix-project:822.824.v14451b_c0fd42
maven-plugin:3.23
metrics:4.2.21-449.v6960d7c54c69
mina-sshd-api-common:2.12.0-90.v9f7fb_9fa_3d3b_
mina-sshd-api-core:2.12.0-90.v9f7fb_9fa_3d3b_
monitoring:1.98.0
oauth-credentials:0.646.v02b_66dc03d2e
okhttp-api:4.11.0-172.vda_da_1feeb_c6e
p4:1.15.1
parameterized-scheduler:262.v00f3d90585cc
parameterized-trigger:787.v665fcf2a_830b_
periodic-reincarnation:1.13
pipeline-build-step:540.vb_e8849e1a_b_d8
pipeline-github-lib:42.v0739460cda_c4
pipeline-graph-analysis:216.vfd8b_ece330ca_
pipeline-groovy-lib:704.vc58b_8890a_384
pipeline-input-step:491.vb_07d21da_1a_fb_
pipeline-milestone-step:111.v449306f708b_7
pipeline-model-api:2.2175.v76a_fff0a_2618
pipeline-model-definition:2.2175.v76a_fff0a_2618
pipeline-model-extensions:2.2175.v76a_fff0a_2618
pipeline-rest-api:2.34
pipeline-stage-step:305.ve96d0205c1c6
pipeline-stage-tags-metadata:2.2175.v76a_fff0a_2618
pipeline-stage-view:2.34
pipeline-utility-steps:2.17.0
plain-credentials:143.v1b_df8b_d3b_e48
plugin-util-api:4.1.0
pollscm:1.5
prism-api:1.29.0-13
prometheus:2.5.1
prqa-plugin:3.3.5
pubsub-light:1.18
resource-disposer:0.23
role-strategy:689.v731678c3e0eb_
run-condition:1.7
saml:4.464.vea_cb_75d7f5e0
scm-api:683.vb_16722fb_b_80b_
script-security:1326.vdb_c154de8669
slack:684.v833089650554
snakeyaml-api:2.2-111.vc6598e30cc65
sse-gateway:1.26
ssh-credentials:308.ve4497b_ccd8f4
ssh-slaves:2.948.vb_8050d697fec
sshd:3.322.v159e91f6a_550
structs:337.v1b_04ea_4df7c8
throttle-concurrents:2.14
timestamper:1.26
token-macro:400.v35420b_922dcb_
trilead-api:2.133.vfb_8a_7b_9c5dd1
validating-string-parameter:183.v3748e79b_9737
variant:60.v7290fc0eb_b_cd
warnings-ng:11.1.0
workflow-aggregator:596.v8c21c963d92d
workflow-api:1291.v51fd2a_625da_7
workflow-basic-steps:1042.ve7b_140c4a_e0c
workflow-cps:3880.vb_ef4b_5cfd270
workflow-durable-task-step:1331.vc8c2fed35334
workflow-job:1400.v7fd111b_ec82f
workflow-multibranch:773.vc4fe1378f1d5
workflow-scm-step:415.v434365564324
workflow-step-api:657.v03b_e8115821b_
workflow-support:865.v43e78cc44e0d
ws-cleanup:0.45

This behaviour is actually by design. With this feature you can restart Jenkins while builds are running and once Jenkins comes back it will automatically detect the running pipelines.
Or imagine you have a short network problem, that leads to a disconnection of the agent.

If you have an inbound agent and you restart the agent you should normally see that your job then continues to run and will finish successfully.
Outbound agents will try to restart automatically usually.

1 Like

Yes but in this case we have the option Do not allow the pipeline to resume if the controller restarts because we had too many issues with resumes and weird states.
If this option is on it shouldn’t wait for nothing.

But you kill the agent process without restarting the controller, so this option is not relevant.

1 Like

Oh ok, I didn’t understand this option properly, I thought it was about the agent, not the controller restart.

So, there is no way to achieve what we need? : Stop the job if the agent disconnects.

A couple of times a month we have had jobs stuck since 12h+ because the agent shut down and no one from our team saw it, we have a lot of agents, so it would be fine for us if the job stopped when it happened, and then restart on a different agent. But since the job keeps waiting and the catch method nor the post step is executed it gets stuck.

Also, this job doesn’t allow parallel runs for another reason so it also blocks future auto-schedule.

The only workaround we found is to put a global timeout on the pipeline option, but it still makes it wait quite a lot for nothing.