Extremely slow execution on dynamic agents on external K8s cluster

Jenkins setup:

Operating system: Ubuntu 22.04(64-bit) for entire infra.
Jenkins and plugin versions: Adding only the relevant plugin list.

Jenkins: 2.426.3
OS: Linux - 5.15.0-33-generic
Java: 17.0.9 - Eclipse Adoptium (OpenJDK 64-Bit Server VM)
---
credentials:1319.v7eb_51b_3a_c97b_
credentials-binding:657.v2b_19db_7d6e6d
git:5.2.1
git-client:4.6.0
git-server:114.v068a_c7cc2574
github:1.37.3.1
github-api:1.318-461.v7a_c09c9fa_d63
github-branch-source:1772.va_69eda_d018d4
gradle:2.9
jackson2-api:2.16.1-373.ve709c6871598
javax-mail-api:1.6.2-9
jaxb:2.3.9-1
jdk-tool:73.vddf737284550
kubernetes:4179.v3b_88431df708
kubernetes-cli:1.12.1
kubernetes-client-api:6.10.0-240.v57880ce8b_0b_2
kubernetes-credentials:0.11

Jenkins is being run as a pod on Kubernetes cluster (let’s call it cluster A) on a VM inside server #1. On another cluster(cluster B) on a VM inside server #2, I have setup the required service account and also have added this cluster B as “clouds” in Jenkins.

Jenkins is being accessed by the dynamic using HAproxy over a local network (192.168.50.x/24) which is a direct connection(vSwitch in hetzner’s terms). Reachability of cluster B verified from the “test connection” option on “clouds” page.

Pipeline information: I am using a podTemplate with 3 containers(jnlp, docker and git), I am installing kubectl on docker container using apk and using CPU as 1000m and memory as 4Gi. Verified resources limits correctly using kubectl on cluster B.

Problems:

Problem #1: Failure in dynamic pod creation

Sometimes, jenkins fails to create a dynamic agent. This is intermittent but very frequent. I can see pod getting created on the k8s cluster i.e. cluster B but pod creation fails with following error:

- jnlp -- terminated (255)
-----Logs-------------
Jan 30, 2024 5:44:17 AM hudson.remoting.Launcher createEngine
INFO: Setting up agent: clusterB-138-cbt8q-n59wk-99wml
Jan 30, 2024 5:44:17 AM hudson.remoting.Engine startEngine
INFO: Using Remoting version: 3198.v03a_401881f3e
Jan 30, 2024 5:44:17 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using /home/jenkins/agent/remoting as a remoting work directory
Jan 30, 2024 5:46:27 AM hudson.remoting.Launcher$CuiListener error
SEVERE: Connection failed.
io.jenkins.remoting.shaded.jakarta.websocket.DeploymentException: Connection failed.
	at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.JdkClientContainer$1.call(JdkClientContainer.java:187)
	at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.JdkClientContainer$1.call(JdkClientContainer.java:107)
	at io.jenkins.remoting.shaded.org.glassfish.tyrus.container.jdk.client.JdkClientContainer.openClientSocket(JdkClientContainer.java:192)
	at io.jenkins.remoting.shaded.org.glassfish.tyrus.client.ClientManager$3$1.run(ClientManager.java:647)
	at io.jenkins.remoting.shaded.org.glassfish.tyrus.client.ClientManager$3.run(ClientManager.java:696)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at io.jenkins.remoting.shaded.org.glassfish.tyrus.client.ClientManager$SameThreadExecutorService.execute(ClientManager.java:849)
	at java.base/java.util.concurrent.AbstractExecutorService.submit(Unknown Source)
	at io.jenkins.remoting.shaded.org.glassfish.tyrus.client.ClientManager.connectToServer(ClientManager.java:493)
	at io.jenkins.remoting.shaded.org.glassfish.tyrus.client.ClientManager.connectToServer(ClientManager.java:337)
	at hudson.remoting.Engine.runWebSocket(Engine.java:731)
	at hudson.remoting.Engine.run(Engine.java:519)
Caused by: java.net.ConnectException: Connection timed out
	at java.base/sun.nio.ch.UnixAsynchronousSocketChannelImpl.checkConnect(Native Method)
	at java.base/sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishConnect(Unknown Source)
	at java.base/sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(Unknown Source)
	at java.base/sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(Unknown Source)
	at java.base/sun.nio.ch.EPollPort$EventHandlerTask.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)


[Pipeline] // node
[Pipeline] }
[Pipeline] // podTemplate
[Pipeline] End of Pipeline
Queue task was cancelled
org.jenkinsci.plugins.workflow.actions.ErrorAction$ErrorId: b8fa138f-652a-4c6d-b3f7-ed70d2dbc6c8
Finished: ABORTED

Problem #2: Extreme slowness

As and when I am able to get a dynamic pod running, there is terribly slow experience and different errors thrown during the execution like:

[Pipeline] sh
10:55:52  Failed to start websocket connection: io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
10:55:52  	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:129)
10:55:52  	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:122)
10:55:52  	at io.fabric8.kubernetes.client.utils.Utils.waitUntilReady(Utils.java:177)
10:55:52  	at io.fabric8.kubernetes.client.utils.Utils.waitUntilReadyOrFail(Utils.java:185)
10:55:52  	at io.fabric8.kubernetes.client.dsl.internal.core.v1.PodOperationsImpl.setupConnectionToPod(PodOperationsImpl.java:387)
10:55:52  	at io.fabric8.kubernetes.client.dsl.internal.core.v1.PodOperationsImpl.exec(PodOperationsImpl.java:293)

Simple operations like docker login or kubectl installation take 3-5 minutes at times.

Troubleshooting done till now

  1. Confirmed and found no packet losses between the two servers.
  2. Initially thought default resources to be reason for slow behaviour but even after adding more resources for jnlp container in pod template, there is no noticeable change in behaviour.

Let me know what all I can try and if you need any more information.