Hi,
I am newbee in Jenkins. I am trying to make Jenkins master high availability i.e. Active and Standby mode in GKE cluster.
I have a load balancer on top of 2 Jenkins master(as deployment) connected to persistent file system across nodes and running a pipe-line running on slave connected to a master(say master1).
When running pipeline, I delete master1 pod, then the pipeline execution is paused and then it is failed. Ideally, pipeline should be continued by re-establishing connection with master2 i.e. it should be highly available.
When I deleted master1 pod, then following error observed on console output of pipeline executing on slave connected to master1 :
Waiting to resume part of slave1-pipeline #5: ‘slave1-jvz6z’ is offline
Waiting to resume part of slave1-pipeline #5
Resuming build at Tue Jan 25 12:16:12 UTC 2022 after Jenkins restart
Waiting to resume part of slave1-pipeline #5: Finished waiting
Waiting to resume part of slave1-pipeline #5: ‘slave1-jvz6z’ is offline
Waiting to resume part of slave1-pipeline #5: ‘slave1-jvz6z’ is offline
Waiting to resume part of slave1-pipeline #5: ‘slave1-jvz6z’ is offline
ees.groovy.cps.impl.SequenceBlock$ContinuationImpl.exp2
in object com.cloudbees.groovy.cps.impl.SequenceBlock$ContinuationImpl@5c43a50 …………………
Caused: java.io.IOException: Stale file handle
at java.base/java.io.FileInputStream.available0(Native Method)
at java.base/java.io.FileInputStream.available(FileInputStream.java:330) ………………………
Caused: java.io.IOException: Failed to load build state
at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$3.onSuccess(CpsFlowExecution.java:865) …………………………….
Finished: FAILURE
So, my questions are:
Why pipeline failed, ideally I expect it to continue and finish. It looks master1(Active) is not checkpointing/syncing the pipelines states to another master2(Standby).
Can you please provide the documentation/configuration of making one master as Active and another master as Standby. I didn’t find any documentation related to it.
PS: Just to mention, we did experimented with one master and one slave and below is our observation:
we are running pipeline on slave and during execution of pipeline we killed master pod. The execution of pipeline got paused and resumed when the master pod came back after restart. This is kind of high availability behaviour.
Resuming build at Fri Jan 21 09:29:21 UTC 2022 after Jenkins restart
Waiting to resume part of pipeline-slave1 #3: ‘slave1-xdnjc’ is offline
Waiting to resume part of pipeline-slave1 #3: ‘slave1-xdnjc’ is offline
Waiting to resume part of pipeline-slave1 #3: ‘slave1-xdnjc’ is offline
Waiting to resume part of pipeline-slave1 #3: ‘slave1-xdnjc’ is offline
Waiting to resume part of pipeline-slave1 #3: ‘slave1-xdnjc’ is offline
Waiting to resume part of pipeline-slave1 #3: ‘slave1-xdnjc’ is offline
Waiting to resume part of pipeline-slave1 #3: ‘slave1-xdnjc’ is offline
Waiting to resume part of pipeline-slave1 #3: ‘Jenkins’ doesn’t have label ‘slave1-xdnjc’
‘slave1-xdnjc’ is offline
Ready to run at Fri Jan 21 09:30:16 UTC 2022
Ideally we are expecting the same behaviour of pipeline with 2 master pods.