process apparently never started in /some_workspace/test@tmp/durable-z79a0e86
After turning on the diagnostic flag, I see one extra logs like:
[2024-10-11T08:11:39.648Z] touch: cannot touch '/some_workspace/test@tmp/durable-z79a0e86/test@tmp/durable-
f64d9d00/jenkins-log.txt': No such file or directory
[2024-10-11T08:13:54.978Z] process apparently never started in /some_workspace/test@tmp/durable-z79a0e86
I have some questions:
Can I get some help to understand how the durable task plugin works in general?
e.g. how it’s executed on the remote swarm-client, how it starts the script and keeps monitoring it, and what is the purpose of the jenkins-log.txt. how is it associated with the Jenkins timeout etc.
Does the log above mean there was some issues accessing the underlying file system?
The first thing the durable task does is create 2 scripts on the remote server in the mentioned temp directory. One script is the content of the sh step that you have defined in your pipeline.
The other script is the actual starter script. It starts the first script and redirects the output to the jenkins-log.txt. In parallel in touches every second or so the log file so that Jenkins knows that the script is still running. This starter script is decoupled from the java process so that in case the agents java process dies Jenkins can recover can continue monitoring the sh steps execution.
There is another mode for this that uses a binary wrapper instead of a shell script. From my experience the binary wrapper is more reliable.
You can activate it by running this in the script console org.jenkinsci.plugins.durabletask.BourneShellScript.USE_BINARY_WRAPPER=true
If you want this to be permanent run this in an init.groovy script or set it as a java property during Jenkins start java -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.USE_BINARY_WRAPPER=true ... -jar jenkins.war ...
@mawinter69 Thanks a lot for the detailed response!!
So if Durable task plugin is installed on Jenkins controller, how does the process you described happen on the remote machine (with swarm-client installed)?
I didn’t see those scripts when checked after job finishes, I guess those 2 scripts will get cleaned up after the job finishes?
I saw it’s started by Launcher.ProcStarter.start() and I’m guessing this is an async call? Also could you help me understand what triggers to FileMonitoringTask to run periodically for checking the script liveness (I assume this is the watcher)?
yes the complete tmp directory gets deleted once the stp is finished. Here the script is called async. The launcher object has a reference to the remoting channel (that is a connection to the java process that runs on the agent). This allows to execute the command on the agent. This is independent on how the agent is started, be it a swarm, inbound or outbound agent.