Jenkins stuck waiting for agent communication when starting new executor

Any hints how to fix this are greatly appreciated.

We are running on GCP. Sometimes when launching an agent the process gets stuck. Under “Build Executor Status” there will be an agent with “launching…” annotation, seemingly forever. The agent logs typically goes like that:

INFO: Connected via SSH.
May 25, 2024 8:18:28 PM null
INFO: Verifying: /opt/corretto_java11/bin/java -fullversion
openjdk full version "11.0.20.1+9-LTS"
May 25, 2024 8:18:28 PM null
INFO: Copying agent.jar to: /var/jenkins
May 25, 2024 8:18:29 PM null
INFO: Launching Jenkins agent via plugin SSH: /opt/corretto_java11/bin/java -jar /var/jenkins/agent.jar

and then nothing happens for hours/days. This happens only on arm instance, but not sure if that is relevant. I don’t think I have ever seen Jenkins recover from that - pipelines requiring arm will not run, presumably waiting for this executor to come online. Other pipelines run fine, and other (x64) agents are being started on demand.

The executor server is healthy, I can log into it and start agent manually using the command printed in logs. I can also manually request Jenkins to “relaunch agent” which usually works.

The stack trace on Jenkins server side are:

"Computer.threadPoolForRemoting [#3335] for lab--rocky-linux-8--arm64... id=7" #607165 daemon prio=5 os_prio=0 cpu=125.54ms elapsed=170.79s tid=0x00007ba41db5bfd0 nid=0xe51c4 in Object.wait()  [0x00007ba36bafe000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(java.base@17.0.11/Native Method)
	- waiting on <no object reference available>
	at hudson.remoting.PipeWindow$Real.get(PipeWindow.java:232)
	- locked <0x0000000771dee088> (a hudson.remoting.PipeWindow$Real)
	at hudson.remoting.ProxyOutputStream.write(ProxyOutputStream.java:127)
	- locked <0x0000000771dee030> (a hudson.remoting.ProxyOutputStream)
	at hudson.remoting.RemoteOutputStream.write(RemoteOutputStream.java:112)
	at hudson.remoting.Util.copy(Util.java:53)
	at hudson.remoting.JarLoaderImpl.writeJarTo(JarLoaderImpl.java:57)
	at jdk.internal.reflect.GeneratedMethodAccessor572.invoke(Unknown Source)
	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@17.0.11/Unknown Source)
	at java.lang.reflect.Method.invoke(java.base@17.0.11/Unknown Source)
	at hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:924)
	at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:902)
	at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:853)
	at hudson.remoting.UserRequest.perform(UserRequest.java:211)
	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
	at hudson.remoting.Request$2.run(Request.java:377)
	at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
	at hudson.remoting.InterceptingExecutorService$$Lambda$1228/0x00000008013d3ad8.call(Unknown Source)
	at org.jenkinsci.remoting.CallableDecorator.call(CallableDecorator.java:18)
	at hudson.remoting.CallableDecoratorList.lambda$applyDecorator$0(CallableDecoratorList.java:19)
	at hudson.remoting.CallableDecoratorList$$Lambda$1229/0x00000008013d3d00.call(Unknown Source)
	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
	at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
	at java.util.concurrent.FutureTask.run(java.base@17.0.11/Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.11/Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.11/Unknown Source)
	at java.lang.Thread.run(java.base@17.0.11/Unknown Source)

which seem to wait for data in a loop. I would terminate only if there is an exception set in PipeWindow::dead. Other relevant stack trace:

"Channel reader thread: lab--rocky-linux-8--arm64-..." #607163 daemon prio=5 os_prio=0 cpu=4.15ms elapsed=171.03s tid=0x00007ba3d8757860 nid=0xe51c2 in Object.wait()  [0x00007ba37b428000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(java.base@17.0.11/Native Method)
	- waiting on <no object reference available>
	at java.lang.Object.wait(java.base@17.0.11/Unknown Source)
	at com.trilead.ssh2.channel.FifoBuffer.read(FifoBuffer.java:212)
	- locked <0x000000077275e628> (a com.trilead.ssh2.channel.Channel)
	at com.trilead.ssh2.channel.Channel$Output.read(Channel.java:127)
	at com.trilead.ssh2.channel.ChannelManager.getChannelData(ChannelManager.java:935)
	- locked <0x000000077275e628> (a com.trilead.ssh2.channel.Channel)
	at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:58)
	at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:102)
	at hudson.remoting.ChunkedInputStream.read(ChunkedInputStream.java:48)
	at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:99)
	at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61)

This code waits on a lock:

synchronized (lock) {
    ...
    lock.wait(); // wait until the writer gives us something
}

which probably is not a problem as it waits for commands from Jenkins (I think).

I understand this cuts through three plugins: remoting, google-compute-engine, and trilead-ssh2 and I was unable to make sens of all the pieces yet.

Expected behaviour:

Whatever the problem with connection is I would think that there is a timeout somewhere that would retry or kill stuck slaves/executors. Am I missing a configuration option?

Jenkins setup:

Jenkins: 2.459
OS: Linux - 5.15.146+
Java: 17.0.11 - Eclipse Adoptium (OpenJDK 64-Bit Server VM)

antisamy-markup-formatter:162.v0e6ec0fcfcf6
apache-httpcomponents-client-4-api:4.5.14-208.v438351942757
apache-httpcomponents-client-5-api:5.3.1-1.0
asm-api:9.7-33.v4d23ef79fcc8
authentication-tokens:1.53.v1c90fd9191a_b_
bootstrap5-api:5.3.3-1
bouncycastle-api:2.30.1.77-225.v26ea_c9455fd9
branch-api:2.1169.va_f810c56e895
caffeine-api:3.1.8-133.v17b_1ff2e0599
checks-api:2.2.0
cloud-stats:336.v788e4055508b_
cloudbees-folder:6.940.v7fa_03b_f14759
commons-lang3-api:3.13.0-62.v7d18e55f51e2
commons-text-api:1.11.0-109.vfe16c66636eb_
configuration-as-code:1810.v9b_c30a_249a_4c
copyartifact:722.v0662a_9b_e22a_c
credentials:1337.v60b_d7b_c7b_c9f
credentials-binding:677.vdc9d38cb_254d
display-url-api:2.204.vf6fddd8a_8b_e9
docker-commons:439.va_3cb_0a_6a_fb_29
docker-java-api:3.3.4-86.v39b_a_5ede342c
docker-plugin:1.6.1
durable-task:555.v6802fe0f0b_82
echarts-api:5.5.0-1
font-awesome-api:6.5.2-1
gcp-secrets-manager-credentials-provider:0.3.1
git:5.2.2
git-changelog:3.38
git-client:4.7.0
github:1.39.0
github-api:1.318-461.v7a_c09c9fa_d63
google-compute-engine:4.573.v7dcd6a_37a_ee2
google-kubernetes-engine:0.430.v4cc1fa_1847a_9
google-metadata-plugin:0.5
google-oauth-plugin:1.330.vf5e86021cb_ec
google-source-plugin:0.4
google-storage-plugin:1.360.v6ca_38618b_41f
gson-api:2.10.1-15.v0d99f670e0a_7
instance-identity:185.v303dc7c645f9
ionicons-api:74.v93d5eb_813d5f
jackson2-api:2.17.0-379.v02de8ec9f64c
jakarta-activation-api:2.1.3-1
jakarta-mail-api:2.1.3-1
javax-activation-api:1.2.0-6
javax-mail-api:1.6.2-9
jaxb:2.3.9-1
joda-time-api:2.12.7-29.v5a_b_e3a_82269a_
jquery3-api:3.7.1-2
jsch:0.2.16-86.v42e010d9484b_
json-api:20240303-41.v94e11e6de726
json-path-api:2.9.0-58.v62e3e85b_a_655
junit:1265.v65b_14fa_f12f0
kubernetes:4231.vb_a_6b_8936497d
kubernetes-client-api:6.10.0-240.v57880ce8b_0b_2
kubernetes-credentials:0.11
mailer:472.vf7c289a_4b_420
matrix-project:822.824.v14451b_c0fd42
metrics:4.2.21-449.v6960d7c54c69
mina-sshd-api-common:2.12.1-101.v85b_e08b_780dd
mina-sshd-api-core:2.12.1-101.v85b_e08b_780dd
monitoring:1.98.0
multibranch-scan-webhook-trigger:1.0.11
nested-view:1.33
oauth-credentials:0.646.v02b_66dc03d2e
okhttp-api:4.11.0-172.vda_da_1feeb_c6e
pipeline-build-step:540.vb_e8849e1a_b_d8
pipeline-graph-analysis:216.vfd8b_ece330ca_
pipeline-graph-view:287.v3ef017b_780d5
pipeline-groovy-lib:710.v4b_94b_077a_808
pipeline-input-step:495.ve9c153f6067b_
pipeline-milestone-step:119.vdfdc43fc3b_9a_
pipeline-model-api:2.2198.v41dd8ef6dd56
pipeline-model-definition:2.2198.v41dd8ef6dd56
pipeline-model-extensions:2.2198.v41dd8ef6dd56
pipeline-rest-api:2.34
pipeline-stage-step:312.v8cd10304c27a_
pipeline-stage-tags-metadata:2.2198.v41dd8ef6dd56
pipeline-stage-view:2.34
plain-credentials:182.v468b_97b_9dcb_8
plugin-util-api:4.1.0
prism-api:1.29.0-15
resource-disposer:0.23
saml:4.464.vea_cb_75d7f5e0
scm-api:690.vfc8b_54395023
scmskip:50.vfb_3a_f04242a_a_
script-security:1336.vf33a_a_9863911
slack:715.v1cfed1b_9c63c
snakeyaml-api:2.2-111.vc6598e30cc65
ssh-credentials:337.v395d2403ccd4
ssh-slaves:2.948.vb_8050d697fec
sshd:3.322.v159e91f6a_550
structs:337.v1b_04ea_4df7c8
test-results-analyzer:0.4.1
text-finder:1.27
timestamper:1.27
token-macro:400.v35420b_922dcb_
trilead-api:2.142.v748523a_76693
variant:60.v7290fc0eb_b_cd
view-job-filters:377.v66f4b_796e5fa_
workflow-aggregator:596.v8c21c963d92d
workflow-api:1311.v4250456a_e552
workflow-basic-steps:1058.vcb_fc1e3a_21a_9
workflow-cps:3894.3896.vca_2c931e7935
workflow-durable-task-step:1353.v1891a_b_01da_18
workflow-job:1415.v4f9c9131248b_
workflow-multibranch:791.v28fb_f74dfca_e
workflow-scm-step:427.v4ca_6512e7df1
workflow-step-api:657.v03b_e8115821b_
workflow-support:907.v6713a_ed8a_573
ws-cleanup:0.45