Run something when Jenkins is idle

I would like to run a system-wide clean-up/checking script but ONLY when all of Jenkins is completely idle, i.e. no nodes are running anything and no jobs are queued, so that my system-wide action cannot upset anything. How would I go about determining that Jenkins is idle?

I’ve seen a suggestion here where one can check that a node isn’t executing but I need to be sure that no nodes are executing and then run something while I am sure that this is still the case. I guess there might be a way of reserving nodes inside the for() loop and then un-reserving them afterwards but this is all getting terribly complex and potentially error prone when all I’m really looking for is an “idle” flag.

Is there such a thing?

Hi.

We has similar problem. In our case we try to connect and clean up all nodes every day at 7 PM. To eliminate the problem about running jobs and possible allocated executors we use ‘lock()’ from lockable-resources-plugin.
See also lockable-resources-plugin/lock-nodes.md at caacd42fabfe4aaad9cc08cfeae5f9c4f4b42922 · jenkinsci/lockable-resources-plugin · GitHub

We do it in parallel stages, that means when one node is stuck, then all other nodes can do the tasks as well. Also one more tip. Use function ‘timeout()’ before you lock the node in your helper jobs. This eliminate that this job stuck.

1 Like

Hi, and thanks for the swift response. I had thought about using the lockable resources plugin, but what to lock? I can’t use a single lock-point for all my nodes because I have a multiply-parallel pipeline that must “run freely”, so they can’t all wait on a single lock-point; in my case the clean-up I’m doing is on the Jenkins master and hence I really do need all nodes to be idle. Borrowing from the first link I gave above and the one you pointed at, I suppose I could call lock() inside the pipeline with the pattern:

node('label1 && label2 && label3') {
    lock("${env.NODE_NAME}")
        // do stuff
    }
}

…and then in my clean-up script do something along the lines of:

for (Node node in jenkinsNodes) {
    // Make sure node is online
    if (!node.getComputer().isOffline()) {           
        lock.reserve(node.getNodeName())
    }
}

// TODO the clean-up

for (Node node in jenkinsNodes) {
    if (!node.getComputer().isOffline()) {           
        lock.unreserve(node.getNodeName())
    }
}

But this would be kinda duplicating the node() functionality in another way, really need to be “reserving” a node so that it doesn’t get used for other stuff, which is what the other link I gave above does by adding an out_of_service (or removing an in_service) or some such label. Again, kind of a hack when all I need is one simple system flag.

That, and lock doesn’t seem to have a reserve capability, even though the documentation kind of implies (to me) that it has.

Hmmm, or maybe, for my particular case, the clean-up script could wait for each node take it off-line, do the clean-up on the Jenkins master and then put all of the nodes back on-line again?

EDIT: no, can’t see how to do that since I’d need to put the “off-lining” code for an agent inside a node() {} block (to ensure the agent is locked) and code within a node block isn’t running on the master and so hasn’t got access to the off-lining commands.

code within a node block isn’t running on the master
that is not true, any pipeline code is run on the controller unless the specific implementation of a step is doing something on an agent, if you use the Jenkins api this is running on the controller
this simple script should work in a pipeline and will take your agent offline.
You might also consider using a freestyle job that runs on the controller.
Note that when an agent is offline you can’t run any jobs on it directly but you can still use the jenkins api to execute things, it’s just some a bit groovy scripting

import jenkins.model.Jenkins
jenkins = Jenkins.get()
agent = j.getNode("myagent")
computer = agent.toComputer()
if (computer != null) {
  computer.setTemporarilyOffline(true)
}

Understood but the above will, I guess (?), take the agent off-line even if it is running something, whereas I need it to run only “cooperatively”. I [think] I need to attempt to reserve all nodes: if I can then my clean-up can run and then I can unreserve all nodes; if I cannot reserve all nodes I just unreserve all the nodes that I had reserved.

I’ve found the lockable resources manager API now, which has the reserve part in it, so hopefully I can make something happen with that.

Then ask the computer if it is idle

if (computer != null && computer.isIdle()) {
 computer.setTemporarilyOffline(true)
}

Do you have the chance to say I want a dedicated maintenance window for all agents?
You could use the agent-maintenance plugin which I implemented.

I also have a script that will take all agents matching a label expression offline and wait until all agents have finished running.

Very useful indeed! Your first suggestion, the computer.isIdle() check, leads me to ask a question about Jenkins that has been bugging me for some time, and is the reason I’m trying to do all this locking stuff in the solution to this problem: what about race conditions? For instance, would it not be possible for an agent to become non-idle between the idle check and it being taken off-line on the next line of the script?

I had assumed that such a race condition was possible, hence the need to somehow try-reserve/try-lock the agent to prevent it happening. Have I been thinking wrongly?

your assumption is correct this could happen so you need a bit more logic of course.
This is the script which is inside a FreeStyle job running a system groovy step
The job has 4 parameters:

  • serverlist - a String parameter, Input can be comma separated list of servers or label expressions.
  • Reason - a string parameter, added to the offline cause
  • “set server state to” - A choice parameter with “offline” and “online” as values, Set the given serverlist to offline or online
    Job should be restricted to master or built-in
import jenkins.model.Jenkins
import hudson.model.Hudson
import hudson.slaves.OfflineCause
import hudson.model.User
import hudson.model.Node
import hudson.model.Computer
import hudson.console.HyperlinkNote
import java.util.concurrent.*

def hudson = Hudson.getInstance()

class ExecResult
{
    final Computer computer
    final String output
    final int exitCode

    ExecResult(Computer computer, String output, int exitCode)
    {
        this.computer = computer
        this.output = output
        this.exitCode = exitCode
    }
}

class Exec implements Callable<ExecResult>
{
    final Computer computer

    public Exec(Computer computer)
    {
        this.computer = computer
    }

    public ExecResult call()
    {
        def output = ""
        int exitCode = 0
        int count = 0

        if (computer.isOnline())
        {
            output += "[Info] Nothing to do server is already online.\n"
        }
        else
        {
            computer.setTemporarilyOffline(false,null)
            output += "[Info] Set server to online mode.\n"
        }

        // Check if agent is connected to Jenkins

        while (!computer.isOnline() && count < 12)
        {
            computer.connect(true)
            output += "[Info] Server is online but not connected to Jenkins. Sleeping for 10 seconds...\n"
            sleep(10000)
            count++
        }
        if(computer.isOnline())
        {
            output += "[Info] computer is successfully connected to Jenkins.\n"
        }
        else
        {
            output += "[Info] computer cannot get connected back to Jenkins.\n"
            exitCode = 1
        }

        return new ExecResult(computer, output, exitCode)
    }
}

/**  METHOD   **/
String nodeLink(Node node)
{
    def nodeName = node.nodeName
    if ("".equals(nodeName))
    {
        HyperlinkNote.encodeTo("/computer/(master)","master")
    }
    else
    {
        HyperlinkNote.encodeTo("/computer/" + nodeName, nodeName)
    }
}

/**  METHOD   **/
String getNodeUrl( Object node )
{
    return HyperlinkNote.encodeTo('/' + node.getSearchUrl(), node.getDisplayName() )
}

/** INPUT PARAMS  **/
def resolver = build.buildVariableResolver
def serverlist = resolver.resolve("serverlist")
def offline_comment = resolver.resolve("Reason")
def state_request = resolver.resolve("set server state to")

zeroExitNodes = []
nonZeroExitNodes = []
errorNodes = []
cancelledNodes = []

println ("")
println ("Your input was: " + serverlist )
println ("Reason: " + offline_comment)
def nodes = new HashSet()

/** check input line **/
def input_items = serverlist.split(",")
for  ( item in input_items )
{
     item = item.trim()
     /** if label, get all nodes **/
     def item_nodes = hudson.getLabel(item).getNodes()
     if (item_nodes.isEmpty() )
     {
        println ("[Info] Nothing found for label: " + item )
     }
     else
     {
        nodes.addAll(item_nodes)
     }
}

println ("")
println ("[Info] Processing the following servers: ")
println ("----------------------------------------")
for (node in nodes)
{
    println ( getNodeUrl(node) )
}
println ("----------------------------------------")
println ()

/** colletc server list **/
def busy_computers=new Vector()
def stopped_computers=new Vector()

ExecutorService executor

try
{
    executor = Executors.newFixedThreadPool(20)
    CompletionService<ExecResult> completionService = new ExecutorCompletionService<ExecResult>(executor);
    def futures = new IdentityHashMap()

    for (node in nodes)
    {
        def computer=node.getComputer()

        /** Set Server State to offline **/
        if ("offline".equals(state_request))
        {
            println ("Taking offline: " + getNodeUrl(node) )
            if (computer.isTemporarilyOffline())
            {
                println ("  Server state is offline, nothing to do.")
                println ("  Offline reason : " + computer.getOfflineCauseReason() )
            }
            else
            {
                def cause = build.getCause(hudson.model.Cause.UserIdCause.class);
                def user
                if(cause == null) {
                    user = hudson.model.User.get("SYSTEM")
                }
                else {
                    user = hudson.model.User.get(cause.getUserId())
                }

                def offline_cause = new OfflineCause.UserCause(user,offline_comment)
                computer.setTemporarilyOffline(true,offline_cause)
                stopped_computers.add(computer)
                println ( "  State offline = " + computer.isOffline())
                println ()
            }
            /** fill list **/
            if (computer.countBusy() > 0 )
            {
                busy_computers.add(computer)
            }
        }

        /** Set Server State to online **/
        if ("online".equals(state_request))  {
            try
            {
                println("Taking online: " + getNodeUrl(node) )
                def future = completionService.submit(new Exec(computer))
                futures[future] = node
            }
            catch (Exception e)
            {
                println("problem occured: " + e.getMessage())
            }
        }
    }

    if("online".equals(state_request))
    {
        println("command executions scheduled on ${nodes.size()} node(s)")
        println()
    }

    /** as long as elements  **/
    if (!busy_computers.isEmpty()) {
        println()
        println("Waiting for busy servers")
        println("------------------------")
        for (computer in busy_computers) {
            println(computer.getDisplayName())
        }
        def printBusy = true

        while (!busy_computers.isEmpty() )
        {
            if (printBusy) {
                 println ()
                 println ("Remaining busy servers: " + busy_computers.size() )
                 printBusy = false
            }
            sleep(10000)
            def remaining_computer = new Vector()
            for (computer in busy_computers )
            {
                if (computer.countBusy() > 0 )
                {
                    remaining_computer.add(computer)
                }
                else
                {
                    println ("Server finished processing: " + computer.getDisplayName())
                    printBusy = true
                }
            }
            busy_computers=remaining_computer
        }
    }

    if (!stopped_computers.isEmpty()) {
        println ("")
        println ("[Info] Computers which were set to offline")
        println ("------------------------------------------")
        for (computer in stopped_computers)
        {
            print(computer.getDisplayName()+",")
        }

        println()
        println ("------------------------------------------")
    }

    // process results
    if(futures)
    {
        lastIntermediateStatusAtIndex = -1
        for (int i = 0; i < nodes.size(); i++)
        {
            def future
            while (!future)
            {
                future = completionService.poll(15, TimeUnit.SECONDS)
                if (!future && i != lastIntermediateStatusAtIndex)
                {
                    lastIntermediateStatusAtIndex = i

                    def unfinished = futures.values().sort(false) {a, b -> a.nodeName <=> b.nodeName}
                    println()
                    println("still running on ${unfinished.size} node(s) (${unfinished.collect({nodeLink(it)}).join(", ")})")
                    println()
                }
            }

            node = futures[future]
            futures.remove(future)  // keep only unfinished futures

            StringBuilder message = new StringBuilder()
            message << "*** node " << nodeLink(node) << " ***"

            try
            {
                ExecResult result = future.get()

                (result.exitCode ? nonZeroExitNodes : zeroExitNodes) << node

                message << "exit code: " << result.exitCode << "\n"

                message << "output:\n"

                message << result.output << "\n"
            }
            catch (CancellationException ex)
            {
                cancelledNodes << node
                message << "cancelled: " << (ex.message ?: "cause unknown") << "\n"
            }
            catch (ExecutionException ex)
            {
                errorNodes << node
                message << "error: " << ex.cause << "\n"
            }

            println()
            println(message)
            println()
        }
    }
}
finally {
  executor?.shutdownNow()
}

if (errorNodes || cancelledNodes || nonZeroExitNodes)
{
    return 1
}
return 0
1 Like

Coo, that is good. Lots of logic :-). Many thanks for this, let me see if I can make what I need from it.