I’m using the Kubernetes plugin in a declarative pipeline. Below is a base snippet removing the (I assume) irrelevant part:
stage('Run Simulation') {
agent {
kubernetes {
cloud 'tools-openshift'
yaml '''
apiVersion: v1
kind: Pod
...
'''
retries 2
}
}
steps {
...
}
}
Every now and then (~2%) my Pod has issues to connect:
09:27:26 Created Pod: tools-openshift cluster/platform-446-5ds4s-9wftj-88kxc
09:27:31 cluster/platform-446-5ds4s-9wftj-88kxc Container jnlp was terminated (Exit Code: 1, Reason: Error)
09:27:31
09:27:31 - jnlp -- terminated (1)
09:27:31 -----Logs-------------
09:27:31 Mar 28, 2023 7:27:30 AM hudson.remoting.jnlp.Main$CuiListener error
09:27:31 SEVERE: Failed to connect to https://company.com/jenkins/tcpSlaveAgentListener/: Connection refused (Connection refused)
09:27:31 java.io.IOException: Failed to connect to https://company.com/jenkins/tcpSlaveAgentListener/: Connection refused (Connection refused)
09:27:31 at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:216)
09:27:31 at hudson.remoting.Engine.innerRun(Engine.java:755)
09:27:31 at hudson.remoting.Engine.run(Engine.java:543)
09:27:31 Caused by: java.net.ConnectException: Connection refused (Connection refused)
09:27:31 at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
09:27:31 at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:412)
09:27:31 at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:255)
09:27:31 at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:237)
09:27:31 at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
09:27:31 at java.base/java.net.Socket.connect(Socket.java:609)
09:27:31 at java.base/sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:305)
09:27:31 at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:177)
09:27:31 at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:507)
09:27:31 at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:602)
09:27:31 at java.base/sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:266)
09:27:31 at java.base/sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:373)
09:27:31 at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:207)
09:27:31 at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187)
09:27:31 at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081)
09:27:31 at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:193)
09:27:31 at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:168)
09:27:31 at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:213)
09:27:31 ... 2 more
09:27:31
09:27:31 2023/03/28 07:27:30 [go-init] Main command failed
09:27:31 2023/03/28 07:27:30 [go-init] exit status 255
09:27:31 2023/03/28 07:27:30 [go-init] No post-stop command defined, skip
09:27:31
09:27:31 cluster/platform-446-5ds4s-9wftj-88kxc Pod just failed (Reason: null, Message: null)
[Pipeline] // node
09:27:31
09:27:31 - jnlp -- terminated (1)
[Pipeline] }
09:27:31 Could not find a node block associated with node (source of error)
The exact error is irrelevant, because sometimes I also get other issues due to network fluctuation. The bottomline is that the client fails to connect.
I’ve set retries
in pipeline. Yet, no retries happen after this failure at all.
Doesn’t this type of error trigger the retry? If not, any recommendation?
My last resort would be to use a script
block with a podTemplate
instead, and a generic retry around the node (so, no declarative). And try to catch only node-related issues. But I feel it would be too workaroundish… Any ideas welcome. Thanks!