Seeking suggestions: Zero-downtime Jenkins build ecosystem during VMware → AWS migration

We are currently planning a migration of our on-prem VMware infrastructure to AWS, and I’m looking for suggestions or best practices specifically around achieving (or getting as close as possible to) zero downtime for Jenkins-based build ecosystems.

Current setup

Our build ecosystem consists of:

  • Jenkins (controller + build agents)

  • GitHub as source control

  • JFrog Artifactory for artifacts

  • Network file shares used by pipelines (build outputs, intermediate artifacts, etc.)

All of this currently runs on VMware (on-prem).

Migration constraints

  • We plan to retain the same machine hostnames and service URLs (Jenkins, JFrog, file shares, agents) after moving to AWS as incase of URL changes, it requires changes in all the place wherever it been referred.

  • Only one instance of a given machines can be active at any time - (either VMware or AWS - never both in parallel).

  • Because of this constraint, when any machine/system is migrated (for example, JFrog or File Share), the on-prem instance must be stopped before the AWS instance can come up using the same hostname/URL.

  • This creates a potential gap, during which system pipelines that depend on that system will have some downtime and required sometime for whatever fixes it needed at integration points.

Key concerns / questions

I’d love community input on strategies to transition Jenkins ecosystem with zero downtime(if there is any container creation and restore based strategies, etc. is possible or any changes in migration flow/constraints can achieve it)

When you say all of this runs on VMware (on-prem) does that include github?

When you want to keep things with the same name that means you have to switch DNS entries to new IPs. Here you need to consider the TTL of the DNS entries. Ideally you change that to a small value before you migrate anything to avoid that you have to wait too long for the new IPs to be known everywhere.

For the migration of agents and JFrog you can achieve that without interruption. For Jenkins agents I would assume it doesn’t matter if they have the same name or not, so you could just attach the new machines and assign them the correct labels. Then take the old agent temporarily offline and wait until no more jobs are running. Then delete it. If you need to use the same agent name, then take the agent offline in Jenkins wait for all jobs to finish, shutdown the old machine (this will disconnect it), adjust DNS. Take agent online in Jenkins (the agent should come online automatically when DNS changes are known).
For JFrog, put complete Jenkins in shutdown mode. That will stop basically everything from running. But steps in pipelines that are currently running will finish, e.g. sh steps. But once the sh step has finished the pipeline will pause as well. You can also use lenient-shutdown` plugin (configure it so that it waits for all jobs to finish). After enabling lenient shutdown, any new jobs will then wait in the queue. Once all runnings jobs have finished, you can switch artifactory to the new host.

For Jenkins itself there is no way with zero downtime (unless you have Cloudbees Jenkins that offers a HA solution afaik). Here again use lenient shutdown, wait until all jobs have finished. Stop Jenkins, copy JENKINS_HOME to new host, change DNS, start Jenkins on new machine. Jenkins should pick up the queue and continue. What you might lose is events sent from github to Jenkins while Jenkins was down, afaik github is not trying to resend events that couldn’t be delivered by itself. Prepare the copying of JENKINS_HOME upfront so that at the time of the real switch you only have to copy the delta of changes and not everything to reduce the downtime.

I strongly recommend, to setup a test landscape where you play around and test things before applying this to the productuve landscape. Take backups where necessary.

When github is hosted by yourself, then probably moving Jenkins and github at the same time is better. So first put Jenkins in lenient shitdown, wait until all jobs finished (so that they all have reported back to github). Then stop first github and then Jenkins. Bring Jenkins online.Have an init.groovy script in Jenkins where you put Jenkins in shutdown mode (not lenient) to avoid that it immediately starts building but is able to recieve events. Then start github.