Kubernetes plugin: retries not working?

I’m using the Kubernetes plugin in a declarative pipeline. Below is a base snippet removing the (I assume) irrelevant part:

stage('Run Simulation') {
	agent {
		kubernetes {
			cloud 'tools-openshift'
			yaml '''
				apiVersion: v1
				kind: Pod
				...
				'''
			retries 2
		}
	}
	steps {
		...
	}
}

Every now and then (~2%) my Pod has issues to connect:

09:27:26  Created Pod: tools-openshift cluster/platform-446-5ds4s-9wftj-88kxc
09:27:31  cluster/platform-446-5ds4s-9wftj-88kxc Container jnlp was terminated (Exit Code: 1, Reason: Error)
09:27:31  
09:27:31  - jnlp -- terminated (1)
09:27:31  -----Logs-------------
09:27:31  Mar 28, 2023 7:27:30 AM hudson.remoting.jnlp.Main$CuiListener error
09:27:31  SEVERE: Failed to connect to https://company.com/jenkins/tcpSlaveAgentListener/: Connection refused (Connection refused)
09:27:31  java.io.IOException: Failed to connect to https://company.com/jenkins/tcpSlaveAgentListener/: Connection refused (Connection refused)
09:27:31  	at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:216)
09:27:31  	at hudson.remoting.Engine.innerRun(Engine.java:755)
09:27:31  	at hudson.remoting.Engine.run(Engine.java:543)
09:27:31  Caused by: java.net.ConnectException: Connection refused (Connection refused)
09:27:31  	at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
09:27:31  	at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:412)
09:27:31  	at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:255)
09:27:31  	at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:237)
09:27:31  	at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
09:27:31  	at java.base/java.net.Socket.connect(Socket.java:609)
09:27:31  	at java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:305)
09:27:31  	at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:177)
09:27:31  	at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:507)
09:27:31  	at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:602)
09:27:31  	at java.base/sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:266)
09:27:31  	at java.base/sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:373)
09:27:31  	at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:207)
09:27:31  	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187)
09:27:31  	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081)
09:27:31  	at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:193)
09:27:31  	at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:168)
09:27:31  	at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:213)
09:27:31  	... 2 more
09:27:31  
09:27:31  2023/03/28 07:27:30 [go-init] Main command failed
09:27:31  2023/03/28 07:27:30 [go-init] exit status 255
09:27:31  2023/03/28 07:27:30 [go-init] No post-stop command defined, skip
09:27:31  
09:27:31  cluster/platform-446-5ds4s-9wftj-88kxc Pod just failed (Reason: null, Message: null)
[Pipeline] // node
09:27:31  
09:27:31  - jnlp -- terminated (1)
[Pipeline] }
09:27:31  Could not find a node block associated with node (source of error)

The exact error is irrelevant, because sometimes I also get other issues due to network fluctuation. The bottomline is that the client fails to connect.

I’ve set retries in pipeline. Yet, no retries happen after this failure at all.
Doesn’t this type of error trigger the retry? If not, any recommendation?

My last resort would be to use a script block with a podTemplate instead, and a generic retry around the node (so, no declarative). And try to catch only node-related issues. But I feel it would be too workaroundish… Any ideas welcome. Thanks!

1 Like

Same for me, retry doesn’t seem to work.

jenkins/xx-xx-it-xx-1284-jenkins-retry-1-xftbf-z7w9v-1tlcr Container docker was terminated (Exit Code: 137, Reason: Error)
jenkins/xx-xx-it-xx-1284-jenkins-retry-1-xftbf-z7w9v-1tlcr Container docker-daemon was terminated (Exit Code: 0, Reason: Completed)
jenkins/xx-xx-it-xx-1284-jenkins-retry-1-xftbf-z7w9v-1tlcr Container jnlp was terminated (Exit Code: 137, Reason: Error)

  • docker – terminated (137)
    -----Logs-------------
    unable to retrieve container logs for containerd://6948e1c2c28178839116ab3a9d5575ce38a8055031105a766d0b977105e50076

  • docker-daemon – terminated (0)
    -----Logs-------------
    unable to retrieve container logs for containerd://bf49fee66674896780b934576f4b3990e237a73d2cd4233be50a38e668147941

  • jnlp – terminated (137)
    -----Logs-------------
    unable to retrieve container logs for containerd://aadc9e6dd997f04b7ab9784bc5a13519d9f8b5ed7a72dbbee84ca3c480b0c6a9
    jenkins/xx-xx-it-xx-1284-jenkins-retry-1-xftbf-z7w9v-1tlcr Pod just failed (Reason: null, Message: null)

  • docker – terminated (137)
    -----Logs-------------
    unable to retrieve container logs for containerd://6948e1c2c28178839116ab3a9d5575ce38a8055031105a766d0b977105e50076

  • docker-daemon – terminated (0)
    -----Logs-------------
    unable to retrieve container logs for containerd://bf49fee66674896780b934576f4b3990e237a73d2cd4233be50a38e668147941

  • jnlp – terminated (137)
    -----Logs-------------
    unable to retrieve container logs for containerd://aadc9e6dd997f04b7ab9784bc5a13519d9f8b5ed7a72dbbee84ca3c480b0c6a9
    ERROR: Failed to launch xx-xx-it-xx-1284-jenkins-retry-1-xftbf-z7w9v-jcz8z
    io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out waiting for [1000000] milliseconds for [Pod] with name:[xx-xx-it-xx-1284-jenkins-retry-1-xftbf-z7w9v-jcz8z] in namespace [jenkins].
    at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilCondition(BaseOperation.java:889)
    at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilReady(BaseOperation.java:871)
    at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.waitUntilReady(BaseOperation.java:92)
    at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:170)
    at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:297)
    at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
    at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
    Agent xx-xx-it-xx-1284-jenkins-retry-1-xftbf-z7w9v-1tlcr was deleted; cancelling node body
    [Pipeline] }
    [Pipeline] // script
    [Pipeline] }
    [Pipeline] // stage
    [Pipeline] }
    [Pipeline] // withEnv
    [Pipeline] }
    [Pipeline] // node
    [Pipeline] }
    Ignored termination reason(s) for xx-xx-it-xx-1284-jenkins-retry-1-xftbf-z7w9v-1tlcr for purposes of retry: [Completed, Error]

Any idea?

Did someone do a workaround for this?

I have the same issue, there were only some specific jobs where retry worked