Kubernetes plugin: retries not working?

rafaelrezend · March 29, 2023, 8:03pm

I’m using the Kubernetes plugin in a declarative pipeline. Below is a base snippet removing the (I assume) irrelevant part:

stage('Run Simulation') {
	agent {
		kubernetes {
			cloud 'tools-openshift'
			yaml '''
				apiVersion: v1
				kind: Pod
				...
				'''
			retries 2
		}
	}
	steps {
		...
	}
}

Every now and then (~2%) my Pod has issues to connect:

09:27:26  Created Pod: tools-openshift cluster/platform-446-5ds4s-9wftj-88kxc
09:27:31  cluster/platform-446-5ds4s-9wftj-88kxc Container jnlp was terminated (Exit Code: 1, Reason: Error)
09:27:31  
09:27:31  - jnlp -- terminated (1)
09:27:31  -----Logs-------------
09:27:31  Mar 28, 2023 7:27:30 AM hudson.remoting.jnlp.Main$CuiListener error
09:27:31  SEVERE: Failed to connect to https://company.com/jenkins/tcpSlaveAgentListener/: Connection refused (Connection refused)
09:27:31  java.io.IOException: Failed to connect to https://company.com/jenkins/tcpSlaveAgentListener/: Connection refused (Connection refused)
09:27:31  	at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:216)
09:27:31  	at hudson.remoting.Engine.innerRun(Engine.java:755)
09:27:31  	at hudson.remoting.Engine.run(Engine.java:543)
09:27:31  Caused by: java.net.ConnectException: Connection refused (Connection refused)
09:27:31  	at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
09:27:31  	at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:412)
09:27:31  	at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:255)
09:27:31  	at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:237)
09:27:31  	at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
09:27:31  	at java.base/java.net.Socket.connect(Socket.java:609)
09:27:31  	at java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:305)
09:27:31  	at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:177)
09:27:31  	at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:507)
09:27:31  	at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:602)
09:27:31  	at java.base/sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:266)
09:27:31  	at java.base/sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:373)
09:27:31  	at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:207)
09:27:31  	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187)
09:27:31  	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081)
09:27:31  	at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:193)
09:27:31  	at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:168)
09:27:31  	at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:213)
09:27:31  	... 2 more
09:27:31  
09:27:31  2023/03/28 07:27:30 [go-init] Main command failed
09:27:31  2023/03/28 07:27:30 [go-init] exit status 255
09:27:31  2023/03/28 07:27:30 [go-init] No post-stop command defined, skip
09:27:31  
09:27:31  cluster/platform-446-5ds4s-9wftj-88kxc Pod just failed (Reason: null, Message: null)
[Pipeline] // node
09:27:31  
09:27:31  - jnlp -- terminated (1)
[Pipeline] }
09:27:31  Could not find a node block associated with node (source of error)

The exact error is irrelevant, because sometimes I also get other issues due to network fluctuation. The bottomline is that the client fails to connect.

I’ve set retries in pipeline. Yet, no retries happen after this failure at all.
Doesn’t this type of error trigger the retry? If not, any recommendation?

My last resort would be to use a script block with a podTemplate instead, and a generic retry around the node (so, no declarative). And try to catch only node-related issues. But I feel it would be too workaroundish… Any ideas welcome. Thanks!

ymajoros · January 9, 2024, 11:52am

Same for me, retry doesn’t seem to work.

jenkins/xx-xx-it-xx-1284-jenkins-retry-1-xftbf-z7w9v-1tlcr Container docker was terminated (Exit Code: 137, Reason: Error)
jenkins/xx-xx-it-xx-1284-jenkins-retry-1-xftbf-z7w9v-1tlcr Container docker-daemon was terminated (Exit Code: 0, Reason: Completed)
jenkins/xx-xx-it-xx-1284-jenkins-retry-1-xftbf-z7w9v-1tlcr Container jnlp was terminated (Exit Code: 137, Reason: Error)

docker – terminated (137)
-----Logs-------------
unable to retrieve container logs for containerd://6948e1c2c28178839116ab3a9d5575ce38a8055031105a766d0b977105e50076
docker-daemon – terminated (0)
-----Logs-------------
unable to retrieve container logs for containerd://bf49fee66674896780b934576f4b3990e237a73d2cd4233be50a38e668147941
jnlp – terminated (137)
-----Logs-------------
unable to retrieve container logs for containerd://aadc9e6dd997f04b7ab9784bc5a13519d9f8b5ed7a72dbbee84ca3c480b0c6a9
jenkins/xx-xx-it-xx-1284-jenkins-retry-1-xftbf-z7w9v-1tlcr Pod just failed (Reason: null, Message: null)
docker – terminated (137)
-----Logs-------------
unable to retrieve container logs for containerd://6948e1c2c28178839116ab3a9d5575ce38a8055031105a766d0b977105e50076
docker-daemon – terminated (0)
-----Logs-------------
unable to retrieve container logs for containerd://bf49fee66674896780b934576f4b3990e237a73d2cd4233be50a38e668147941
jnlp – terminated (137)
-----Logs-------------
unable to retrieve container logs for containerd://aadc9e6dd997f04b7ab9784bc5a13519d9f8b5ed7a72dbbee84ca3c480b0c6a9
ERROR: Failed to launch xx-xx-it-xx-1284-jenkins-retry-1-xftbf-z7w9v-jcz8z
io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [1000000] milliseconds for [Pod] with name:[xx-xx-it-xx-1284-jenkins-retry-1-xftbf-z7w9v-jcz8z] in namespace [jenkins].
at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilCondition(BaseOperation.java:889)
at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilReady(BaseOperation.java:871)
at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilReady(BaseOperation.java:92)
at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:170)
at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:297)
at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
Agent xx-xx-it-xx-1284-jenkins-retry-1-xftbf-z7w9v-1tlcr was deleted; cancelling node body
[Pipeline] }
[Pipeline] // script
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // withEnv
[Pipeline] }
[Pipeline] // node
[Pipeline] }
Ignored termination reason(s) for xx-xx-it-xx-1284-jenkins-retry-1-xftbf-z7w9v-1tlcr for purposes of retry: [Completed, Error]

Any idea?

jastabile · March 5, 2024, 1:08pm

Did someone do a workaround for this?

I have the same issue, there were only some specific jobs where retry worked

slmjy · April 25, 2024, 12:10pm

@ymajoros or @jastabile Did you happen to solve the problem or find a workaround?

melvilgit · February 12, 2025, 10:21pm

Did some one figure out the solution?

Topic		Replies	Views
Unable to use latest version of Kubernetes Plugins with Jenkins Using Jenkins question , pipeline , kubernetes	0	931	August 8, 2023
Seeing issue with pod connection after Kubernetes Plugin Upgrade Ask a question	7	663	September 27, 2024
How to terminate orphan Kubernetes Agent Pods? Ask a question question , kubernetes	1	966	December 9, 2022
EKS Jenkins Kubernetes plugin tcpSlaveAgentListener/: connect timed out Using Jenkins question	0	664	February 10, 2023
EKS Jenkins Kubernetes plugin tcpSlaveAgentListener/: connect timed out Using Jenkins question	0	1419	February 10, 2023

Kubernetes plugin: retries not working?

Related topics