GCP custom agents get deleted half way through the build

Hi,
I am running Jenkins 2.426.2 on GKE. We have GCP cloud configured in jenkins to provision dynamic agents during build. They have node retention of 3 minutes.
Recently we have upgraded jenkins to 2.426 and we are noticing 1 in 4 to 5 builds is getting stuck ever since.
What iam thinking is that somehow agent stays idle for 3 min and then its getting deleted as per configuration. We never had this issue before and all builds were always smooth. Let me know if you need any details.

Hi @spoorthyredi2 and welcome to this community. :wave:

Could the issue you’re facing be related to the node retention configuration?

It’s a wild guess, but if a build doesn’t start within the 3-minute window, the agent could be deleted before the build has a chance to run. :thinking:

One possible solution could be to increase the node retention time. This would give your builds more time to start before the agent is deleted. However, this could also lead to higher costs if you’re paying for the time that the agents are running

I found out the fix. The issue was that the IP Address Quotas for the region exceeded the limit and hence the agent was never getting created , so Jenkins keeps on waiting for it to come online. Also yes, retention time is something we need to keep an appropriate value as per agent launch time . During trouble shooting when i made it zero, by mistake i had many issues.

@poddingue Today, another build agent got deleted by the plugin. This time, it took 10 minutes to delete and the build was in progress. Not sure why the plugin is deleting it. Jenkins logs are all fine until the agent got connected to the controller, and then suddenly it says resource not found.

When an active build is going on , why does the Plugin consider it to be IDLE and delete as per node rentention timeout ?

At 13:57:51 , machine was inserted into Compute engine, And at
14:07:54 request to delete is received .

machine was in RUNNING state when Compute engine got request to delete.


2024-01-19 13:57:52.590+0000 [id=188695]        INFO    c.g.j.p.c.ComputeEngineCloud#log: Launching instance: leap-jen-custom-agent-hxm758
2024-01-19 13:57:52.591+0000 [id=188695]        INFO    c.g.j.p.c.ComputeEngineCloud#log: bootstrap
2024-01-19 13:57:52.591+0000 [id=188695]        INFO    c.g.j.p.c.ComputeEngineCloud#log: Getting private key...
2024-01-19 13:57:52.591+0000 [id=188695]        INFO    c.g.j.p.c.ComputeEngineCloud#log: Using custom ssh private key
2024-01-19 13:57:52.591+0000 [id=188695]        INFO    c.g.j.p.c.ComputeEngineCloud#log: Authenticating as jenkins
2024-01-19 13:57:52.667+0000 [id=188695]        INFO    c.g.j.p.c.ComputeEngineCloud#log: Connecting to 10.100.0.76 on port 22, with timeout 10000.
2024-01-19 13:58:02.668+0000 [id=188695]        INFO    c.g.j.p.c.ComputeEngineCloud#log: Failed to connect via ssh: The kexTimeout (10000 ms) expired.
2024-01-19 13:58:02.669+0000 [id=188695]        INFO    c.g.j.p.c.ComputeEngineCloud#log: Waiting for SSH to come up. Sleeping 5.
2024-01-19 13:58:07.770+0000 [id=188695]        INFO    c.g.j.p.c.ComputeEngineCloud#log: Connecting to 10.100.0.76 on port 22, with timeout 10000.
2024-01-19 13:58:07.965+0000 [id=188695]        INFO    c.g.j.p.c.ComputeEngineCloud#log: Connected via SSH.
2024-01-19 13:58:08.132+0000 [id=188695]        INFO    c.g.j.p.c.ComputeEngineCloud#log: Verifying: java -fullversion
2024-01-19 13:58:08.538+0000 [id=188695]        INFO    c.g.j.p.c.ComputeEngineCloud#log: Copying agent.jar to: /tmp
2024-01-19 13:58:08.662+0000 [id=188695]        INFO    c.g.j.p.c.ComputeEngineCloud#log: Launching Jenkins agent via plugin SSH: java -jar /tmp/agent.jar
2024-01-19 13:58:12.031+0000 [id=188694]        INFO    c.g.j.p.c.ComputeEngineCloud#lambda$getPlannedNodeFuture$0: 39891ms elapsed waiting for node leap-jen-custom-agent-hxm758 to connect
2024-01-19 14:07:54.956+0000 [id=42]    INFO    c.g.j.p.c.CleanLostNodesWork#terminateInstance: Remote instance leap-jen-custom-agent-hxm758 not found locally, removing it
2024-01-19 14:07:55.955+0000 [id=42]    INFO    c.g.j.p.c.CleanLostNodesWork#terminateInstance: Remote instance leap-jen-custom-agent-hxm758 not found locally, removing it
2024-01-19 14:08:26.363+0000 [id=188722]        INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel leap-jen-custom-agent-hxm758
java.io.EOFException
request: {

@type: "type.googleapis.com/compute.instances.delete"

}

requestMetadata: {

callerIp: "104.196.163.248"

callerNetwork: "//compute.googleapis.com/projects/leap-metrics-dev/global/networks/__unknown__"

callerSuppliedUserAgent: "jenkins-google-compute-plugin Google-HTTP-Java-Client/1.42.2 (gzip),gzip(gfe)"

destinationAttributes: {

@poddingue Sometimes node get deleted within few seconds , like literally 20 seconds after its created. So I dont think its the retention time