Random durable task step termination

Dear Community,

in our Jenkins pipelines, durable task steps are terminated seemingly randomly and we do not get a grip on who/which job or why they are terminated.

We use Jenkins 2.346.3.

On linux nodes, we use the durable task sh() to execute long running (mostly building between 5 and 20min, and testing up to multiple hours) shell scripts which in turn start multiple threads due to build parallelization using ninja or multiple threads for testing. If a termination occurs the log reads " [2022-11-28T22:29:18.081Z] Terminated " and the threads are terminated with signal 15, i.e. SIGTERM, (we know this because our testing suite catches signals and reports them to the log) and an hudson.AbortException. We believe that ‚ÄúTerminated‚ÄĚ is shell output.

On windows nodes, a similar (but maybe not the same) event occurs for the durable task bat(). Here, the log reads " [2022-11-30T15:01:05.915Z] ^CTerminate batch job (Y/N)? [2022-11-30T15:01:05.947Z] ^C " and the task ends also with an hudson.AbortException.

What we already tried to get a hold on this problem:

  • Add log recorders to:
    • hudson.AbortException with log level ALL. As this reports nothing, I had a look into the source code and yes, nothing is logged there.
    • hudson.model.Executor with log level FINE. For the windows termination, this got logged (which for me does not look like a termination through a timeout or an interrrupt through another job, as e.g. the latter is explicitly logged):
     2022-11-30 15:00:02.349+0000 [id=127979]        FINE    hudson.model.Executor#run: Executor #0 for winnode4 : executing PlaceholderExecutable:ExecutorStepExecution.PlaceholderTask{runId=pull_request_pipeline/PR-12566#7,label=windows && x64 && build_farm,context=CpsStepContext[317:node]:Owner[pull_request_pipeline/PR-12566/7:pull_request_pipeline/PR-12566 #7],cookie=null,auth=null} is now executing PlaceholderExecutable:ExecutorStepExecution.PlaceholderTask{runId=pull_request_pipeline/PR-12566#7,label=windows && x64 && build_farm,context=CpsStepContext[317:node]:Owner[pull_request_pipeline/PR-12566/7:pull_request_pipeline/PR-12566 #7],cookie=null,auth=null} as UsernamePasswordAuthenticationToken [Principal=SYSTEM, Credentials=[PROTECTED], Authenticated=false, Details=null, Granted Authorities=[]]
     ...
     2022-11-30 15:01:09.313+0000 [id=70]    FINE    hudson.model.Executor#finish1: Executor #0 for winnode4 : executing PlaceholderExecutable:ExecutorStepExecution.PlaceholderTask{runId=pull_request_pipeline/PR-12566#7,label=windows && x64 && build_farm,context=CpsStepContext[317:node]:Owner[pull_request_pipeline/PR-12566/7:pull_request_pipeline/PR-12566 #7],cookie=null,auth=null} completed PlaceholderExecutable:ExecutorStepExecution.PlaceholderTask{runId=pull_request_pipeline/PR-12566#7,label=winnode4,context=CpsStepContext[317:node]:Owner[pull_request_pipeline/PR-12566/7:pull_request_pipeline/PR-12566 #7],cookie=6c72915a-c59c-48ac-8154-79781c7ed4fe,auth=null} in 66,965ms
  • hudson.util.ProcessTree with log level FINEST. This did not log anything at this time hence I would exclude stage timeouts because they are killed by the ProcessTreeKiller which for FINEST gets logged.
  • org.jenkinsci.plugins.workflow.steps.timeout with log level ALL. We at first suspected timeouts but also this logged nothing when the termination occurred.
  • Checking kernel logs with dmesg whether the OOM-killer decided to kill a thread and hence somehow the whole process group dies, but no OOM-killer events are logged. Normally, when a thread of our test suite is killed by the OOM-killer, this is logged pretty accurately and the test suite worker thread is properly restarted.
  • Checking connection statistics of the Jenkins master and the nodes on which the terminations occurred. Also here, we did not see something suspicious like reconnecting attempts. Also, as we catch and report the hudson.AbortException now, the stage continous to run on the same node and executes following tasks (which are also partly durable task steps).
  • Another change occurred: We changed the ‚ÄúPipeline Speed / Durability‚ÄĚ from ‚ÄúNone: use pipeline default (MAX_SURVIVABILITY)‚ÄĚ to ‚ÄúPerformance-optimized: much faster‚ÄĚ to check load on our master server because we can tick some points on Scaling Pipelines. But the termination occurred before and after this change.

I found something in the web that CPS might have an internal timeout of 5min ([JENKINS-42561] Users should be able to custom configure the timeout on pipeline build wrappers/steps - Jenkins Jira), but because they get an InterruptedException instead of AbortException, this does not feel 100% applicable and we also do not see timeouts after 5min, i.e. see the hudson.model.Executor log example.

Questions:

  1. Do you have any idea what is going on? Any pointer can help!
  2. Do you have suggestions where we can add more log recorders to get a grip on this?

Thank you very much!

1 Like