Nodes/config.xml not cleaned up after failed provisioning

Hi hi

  • Environment:
    Plugin name: kubernetes-plugin
    Plugin tags:
  • 3646.va_b_469a_7666b_7
  • 3651.v908e7db_10d06

Jenkins version: 2.332.2

We observed the following behaviour:

  1. create a new job in Jenkins and specify an invalid pod template (e.g. use more memory than is currently available in your resource quota)
  2. start a new build for this job
  3. Jenkins will create a new /var/jenkins_home/nodes/$pod/config.xml in its filesystem
  4. the Kubernetes API will reject the pod as expected
  5. Jenkins will retry to create the pod as often as a new executor is available (if you have 10 executors, it will retry ten times per second)
  6. now trigger this job multiple times so that the build queue increases (e.g. 50 times)
  7. Jenkins will again create the config.xml files but now it will no longer clean up any failed ../nodes/$pod/config.xml files
  8. after a while, the number of config.xml files from failed builds has increased to an unreasonable number (in our case: 138 thousand)
  9. once you restart Jenkins, it will try to load all nodes from disk, which will not succeed in time (it takes a while to load 138k obsolete configs)

We are severly impacted by this issue and our Jenkins master fails to restart due to OOM error since it tries to load all (undeleted) nodes config.xml files form the disk. Everytime we have to clean the node config.xml files manually to bring the Jenkins up and running.

Does anyone knows any workaround until we have a permanent fix for this issue.

1 Like

Please report bugs in Jira, if at all possible with complete, self-contained instructions to reproduce the issue from scratch.

2 Likes

Thanks Jesse
Of course I will do that