Builds running/failing before node is ready

I inherited a rather complex Jenkins setup, and now they’re telling me I’m the expert! I have a lot of CI/CD experience with other platforms, but only the most basic knowledge of Jenkins itself.

We’re using Jenkins 2.222.4 and I’d love to upgrade, but a test run of that path was full of tears so for now we’ve stayed on this old version.

We launch build nodes through AWS ASG (EC2 Fleet),which might be through a plugin- I’m not sure which one. We build an AMI with most of what we need, then run Chef Zero on them again after launching a new node- at which point they are available to jenkins.

The new problem we’re seeing on one type of our nodes is that the build seems to be running a few minutes before the node is actually ready.

Here is an example error:

11:05:34 Building remotely on minion-2xl i-abcdef12345678 (main mysql bazel-worker frontend) in workspace /mnt/jenkins/workspace/main/master/bazel-build
11:05:34 [WS-CLEANUP] Deleting project workspace...
11:05:34 [WS-CLEANUP] Deferred wipeout is used...
11:05:34 Running Prebuild steps
11:05:34 [bazel-build] $ /bin/bash /tmp/jenkins1234567890.sh
11:05:34 fatal: not a git repository: '/mnt/jenkins/main.git'
11:05:34 Failed build for hudson.tasks.Shell@3b2ac32e
...

If we wait 3-4 more minutes and run again, it runs fine. This failure returns when we spin up new nodes- until they have also been up for several minutes. Since we launch our nodes through ASG, we might only run a few builds on the node before it shuts down. So a high percentage of our builds are on “fresh” nodes, and failing due to overfreshness.

As far as I understand it- and I don’t really- it’s running a git command near “/mnt/jenkins/main.git” and it’s not quite ready. /mnt/jenkins is setup by the chef process before we build the ami- it’s also verified/corrected when chef runs again after launching the node. The chef code doesn’t do anything with git in that directory. Forgive my ignorance, but is a link to that git repo something that is set up by jenkins when it connects to a new node? Or is that something being set up in this particular job? My team uses bazel and groovy- which are also new to me, so I’m not too sure what steps the build process is actually executing.

The easiest thing would probably be to delay the node presenting as “ready” to the jenkins server. Is there somewhere to set that? Is there a check which is presenting the node as ready, which is missing something?

Any other advice is welcome. My googles- they have done nothing.

Thanks in advance.


I’m sure these settings are related.

So my hoping is you are using one of the cloud plugins to spin up your agent on demand. There’s a number of AWS integrations that should take care of that. I feel like that’s a good place to start looking into why an agent is marked ready before it’s ready.

My concern is that if inherited, it might not be using the cloud functionally and someone jury rigged something using a job. Something would spin up an instance, connect, then run scripts but Jenkins things it’s available.

So assuming the first, and a semi decent recent version of Jenkins, you should be able to find manage clouds in your nodes section. Sadly I haven’t been at an instance to take screenshot, but if you click the link at the top of the second box on the left it should take you there. I think the URL is /computer

Hi, thanks for reaching out!

For clarity, I’ll post the config under Manage Cloud:

Okay! I finally stumbled into the setting that was controlling this!

image

I failed to expand the “Advanced” area here under Launcher, and I found that my other nodes had a command called “Prefix Start Agent Command” where I am able to configure it to wait for the chef run to finish.

I can only blame my eyes for passing over that popout section many times.