We have cases where pipeline builds get stuck waiting for an executor, especially when the controller is under high load. The build is waiting for a node with certain labels but it will never get scheduled onto any node. Even more interesting, the build does not show up on the main page any more it’s only visible if you go onto the job page. The build log says Waiting for next available executor on ‘maven&&java11’ but does not proceed any more.
The agents that can provide the requested labels are ECS tasks that are dynamically started. Once you abort the stuck build and start a new one again it usually runs through without any problems.
This is either a problem in the core code itself or with the ECS agent plug-in. The ECS agents plug-in does not seem to get any requests for the requested node labels from the Jenkins core (at least no corresponding log messages are printed) therefore it’s probably more an issue in the core code. Any hints on how this can be investigated further?
We are running Jenkins LTS 2.361.1 but the problem already occured with previous LTS versions as well.
Is there anything in the Jenkins log about failures with ECS? Perhaps something is getting stuck during the provisioning process of the agent?
With ECS there are in many cases messages because when no capacity is available the task gets rejected one or more times until capacity for the container is available. So this is kind of expected. Or do you mean an problem once the ECS task has been started and while the Jenkins agent process is about to start? Would a failure there lead to this strange state? I would expect that if an error happens during that phase the whole process is retried. And not that the build completely vanishes from Jenkins.