Jenkins losing connection to Google Cloud Platform worker on long running jobs

Heya,

I have a really strange problem and it’s superhard to debug, since it’s not consistent.

We were using Jenkins with GCP “one shot” VMs. This worked fine for several months. After upgrading Jenkins from 2.440.2 LTS to 2.452.1 LTS (and all plugins, including the GCP/GCE plugin) problems began to rise where jobs occasionally (10-20% yes, 80-90% no) start to lose connection to GCP when running unit tests (not sure if only there, but it’s our main use case and the only long running job we have, other jobs run also on GCP, don’t have this problem but are also way faster done).

(successful test jobs, but not completed test suite)
[2024-06-18T06:04:44.494Z] Cannot contact gcp-rre-unittest-debian12-jkt5ag: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@1362ebbd:gcp-rre-unittest-debian12-jkt5ag": Remote call on gcp-rre-unittest-debian12-jkt5ag failed. The channel is closing down or has closed down
[2024-06-18T06:05:02.226Z] Could not connect to gcp-rre-unittest-debian12-jkt5ag to send interrupt signal to process

I tried several things (rolling back GCP plugin version, upgrading even further (Jenkins now on 2.452.2 and all plugins on most recent versions) but with no success. Only thing I did not (yet) is rolling back to the former working version (since a “yes it works again” would cause other trouble - like not being able to upgrade).

Since Jenkins is pretty much a combination of several plugins - what may cause this effect? I would think most likely something in the GCE/GCP plugin, but since I did not get any response maybe someone of you can pin it down to “this type of stuff is done exclusively inside the cloud plugin(s)” or “xyz is triggering this and that”.

Thank you!

Anyone? :frowning:

I’m just looking for an idea what may cause this.

Hello Jens,

The issue you’re experiencing with intermittent connection losses to GCP VMs during Jenkins jobs, especially after upgrading Jenkins and its plugins, can be challenging to diagnose due to its inconsistent nature.

Let’s try nonetheless to outline an approach to troubleshooting and identifying potential causes:

  1. Make sure the network connection between Jenkins and GCP is stable. Intermittent network issues could cause the described behavior.
  2. Check if there are any GCP resource limits being hit, such as API rate limits or VM quotas, which might affect the ability to maintain connections.
  3. Review the configuration settings for the GCP/GCE plugin to ensure they are correct and optimal for your use case.
  4. Verify the configuration of the Jenkins agents running on GCP VMs, especially any changes related to timeouts or disconnection handling.
  5. Examine Jenkins system logs and job logs for any errors or warnings that occur around the time of the disconnections.

Hi Bruno (@poddingue),

thank you so much for your response.

Yes it’s difficult to diagnose, that’s why I’m asking for ideas and which component is actually “responsible” for the tear down of the VMs. Since Jenkins is pretty much a complex puzzle of plugins I don’t know which plugin might be triggering the destruction of the VM. (I assume it’s the GCP plugin since I think it’s the only component with access to GCP and it’s API). Interestingly enough it started to happen one day after the upgrade. I have not yet reset Jenkins and the plugins, as this is the last and most time-consuming resort and a confirmation “it works again” would do nothing, but leaves us with the “choice” of “using an outdated version that may never be updated” or “not using GCP because it is too error-prone”.

About your points:

  1. I think this is fine, but this is also very hard to “prove”. We have no known problems at all communicating with the “outside world” and how do you test if “intermittent network issues” just happened in this case between our network and GCP. We could try and create a static regular Jenkins worker on some static VM at GCP (circumventing the GCP plugin) but hesitated due to the costs for this (yet).
  2. This is not the case. Due to this problem, we have largely stopped using GCP runners for now and use it only for a single job type, while quotas remained the same.
  3. Do you have any idea what to look for? This settings weren’t changed for quite some time (read: months) and worked before without any issues.
  4. This is also very hard to do, since the agent is provisioned by the GCP plugin.
  5. I did this for the latest error and will add this below.

Jenkins log (it’s dockerized, times are UTC) say this:

Note:

  • this run crashed very early, which is rare but on the other hand reduces the noise in the logs
  • the dots indicate unrelated lines from a different plugin running in a different job on a different node - I left most of the lines in place in place when it was not obvious they are unrelated
{"log":"2024-07-08 10:01:54.329+0000 [id=58]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#provision: Provisioning node from configs [com.google.jenkins.plugins.computeengine.InstanceConfiguration@48d1ea4f] for excess workload of 1 units of label 'cloud-rre-test'\n","stream":"stderr","time":"2024-07-08T10:01:54.329494965Z"}
{"log":"2024-07-08 10:01:55.018+0000 [id=58]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#availableNodeCapacity: Found capacity for 2 nodes in cloud Google Compute Engine (RRE-Unittest)\n","stream":"stderr","time":"2024-07-08T10:01:55.018920863Z"}
{"log":"2024-07-08 10:01:55.018+0000 [id=58]\u0009INFO\u0009c.g.j.p.c.InstanceConfiguration#instance: User selected to use an autogenerated ssh key pair\n","stream":"stderr","time":"2024-07-08T10:01:55.019104005Z"}
{"log":"2024-07-08 10:01:56.550+0000 [id=58]\u0009INFO\u0009c.g.j.p.c.InstanceConfiguration#provision: Sent insert request for instance configuration [Debian12 agent for RRE unittests]\n","stream":"stderr","time":"2024-07-08T10:01:56.550470623Z"}
{"log":"2024-07-08 10:01:56.552+0000 [id=783581]\u0009INFO\u0009c.g.j.p.c.ComputeEngineComputerLauncher#launch: Launch will wait 300000 for operation operation-1720432915238-61cb980c5cd7c-8a962697-28ce0d6f to complete...\n","stream":"stderr","time":"2024-07-08T10:01:56.552382895Z"}
{"log":"2024-07-08 10:01:56.556+0000 [id=783469]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#lambda$getPlannedNodeFuture$0: Waiting 300000ms for node gcp-rre-unittest-debian12-m5eb8g to connect\n","stream":"stderr","time":"2024-07-08T10:01:56.556557076Z"}
{"log":"2024-07-08 10:02:00.329+0000 [id=30]\u0009WARNING\u0009c.c.h.p.f.c.PeriodicFolderTrigger#run: Queue refused to schedule org.jenkinsci.plugins.workflow.multibranch.WorkflowMultiBranchProject@6c58483f[kw-selenium]\n","stream":"stderr","time":"2024-07-08T10:02:00.329832915Z"}
{"log":"2024-07-08 10:02:22.711+0000 [id=783581]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#log: Launching instance: gcp-rre-unittest-debian12-m5eb8g\n","stream":"stderr","time":"2024-07-08T10:02:22.711568522Z"}
{"log":"2024-07-08 10:02:22.711+0000 [id=783581]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#log: bootstrap\n","stream":"stderr","time":"2024-07-08T10:02:22.711668279Z"}
{"log":"2024-07-08 10:02:22.711+0000 [id=783581]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#log: Getting keypair...\n","stream":"stderr","time":"2024-07-08T10:02:22.711761673Z"}
{"log":"2024-07-08 10:02:22.711+0000 [id=783581]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#log: Using autogenerated ssh keypair\n","stream":"stderr","time":"2024-07-08T10:02:22.711836925Z"}
{"log":"2024-07-08 10:02:22.711+0000 [id=783581]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#log: Authenticating as jenkins\n","stream":"stderr","time":"2024-07-08T10:02:22.711887497Z"}
{"log":"2024-07-08 10:02:22.880+0000 [id=783581]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#log: No public address found. Fall back to internal address.\n","stream":"stderr","time":"2024-07-08T10:02:22.880551815Z"}
{"log":"2024-07-08 10:02:22.880+0000 [id=783581]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#log: Connecting to 192.168.75.42 on port 22, with timeout 10000.\n","stream":"stderr","time":"2024-07-08T10:02:22.88077559Z"}
{"log":"2024-07-08 10:02:23.163+0000 [id=783581]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#log: Connected via SSH.\n","stream":"stderr","time":"2024-07-08T10:02:23.164052242Z"}
{"log":"2024-07-08 10:02:23.277+0000 [id=783581]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#log: Verifying: java -fullversion\n","stream":"stderr","time":"2024-07-08T10:02:23.277623908Z"}
{"log":"2024-07-08 10:02:23.606+0000 [id=783581]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#log: Copying agent.jar to: /tmp\n","stream":"stderr","time":"2024-07-08T10:02:23.606668525Z"}
{"log":"2024-07-08 10:02:23.982+0000 [id=783581]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#log: Launching Jenkins agent via plugin SSH: java -jar /tmp/agent.jar\n","stream":"stderr","time":"2024-07-08T10:02:23.982918345Z"}
{"log":"2024-07-08 10:02:30.603+0000 [id=783469]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#lambda$getPlannedNodeFuture$0: 34047ms elapsed waiting for node gcp-rre-unittest-debian12-m5eb8g to connect\n","stream":"stderr","time":"2024-07-08T10:02:30.604095585Z"}
...
{"log":"2024-07-08 10:04:00.329+0000 [id=30]\u0009WARNING\u0009c.c.h.p.f.c.PeriodicFolderTrigger#run: Queue refused to schedule org.jenkinsci.plugins.workflow.multibranch.WorkflowMultiBranchProject@67d2865b[kw-commons]\n","stream":"stderr","time":"2024-07-08T10:04:00.330001085Z"}
{"log":"2024-07-08 10:04:00.330+0000 [id=30]\u0009WARNING\u0009c.c.h.p.f.c.PeriodicFolderTrigger#run: Queue refused to schedule org.jenkinsci.plugins.workflow.multibranch.WorkflowMultiBranchProject@663667cf[kw-log]\n","stream":"stderr","time":"2024-07-08T10:04:00.330467116Z"}
{"log":"2024-07-08 10:04:14.683+0000 [id=48]\u0009INFO\u0009c.g.j.p.c.CleanLostNodesWork#terminateInstance: Remote instance gcp-rre-unittest-debian12-m5eb8g not found locally, removing it\n","stream":"stderr","time":"2024-07-08T10:04:14.684097424Z"}
{"log":"2024-07-08 10:04:45.898+0000 [id=783666]\u0009INFO\u0009h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel gcp-rre-unittest-debian12-m5eb8g\n","stream":"stderr","time":"2024-07-08T10:04:45.899208784Z"}
{"log":"java.io.EOFException\n","stream":"stderr","time":"2024-07-08T10:04:45.899252468Z"}
{"log":"\u0009at java.base/java.io.ObjectInputStream$PeekInputStream.readFully(Unknown Source)\n","stream":"stderr","time":"2024-07-08T10:04:45.899262945Z"}
{"log":"\u0009at java.base/java.io.ObjectInputStream$BlockDataInputStream.readShort(Unknown Source)\n","stream":"stderr","time":"2024-07-08T10:04:45.899269998Z"}
{"log":"\u0009at java.base/java.io.ObjectInputStream.readStreamHeader(Unknown Source)\n","stream":"stderr","time":"2024-07-08T10:04:45.899276416Z"}
{"log":"\u0009at java.base/java.io.ObjectInputStream.\u003cinit\u003e(Unknown Source)\n","stream":"stderr","time":"2024-07-08T10:04:45.899282898Z"}
{"log":"\u0009at hudson.remoting.ObjectInputStreamEx.\u003cinit\u003e(ObjectInputStreamEx.java:50)\n","stream":"stderr","time":"2024-07-08T10:04:45.899303866Z"}
{"log":"\u0009at hudson.remoting.Command.readFrom(Command.java:142)\n","stream":"stderr","time":"2024-07-08T10:04:45.899310967Z"}
{"log":"\u0009at hudson.remoting.Command.readFrom(Command.java:128)\n","stream":"stderr","time":"2024-07-08T10:04:45.899317391Z"}
{"log":"\u0009at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)\n","stream":"stderr","time":"2024-07-08T10:04:45.899323809Z"}
{"log":"\u0009at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61)\n","stream":"stderr","time":"2024-07-08T10:04:45.899330449Z"}
{"log":"Caused: java.io.IOException: Unexpected termination of the channel\n","stream":"stderr","time":"2024-07-08T10:04:45.899337437Z"}
{"log":"\u0009at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:75)\n","stream":"stderr","time":"2024-07-08T10:04:45.899345438Z"}
{"log":"2024-07-08 10:05:56.773+0000 [id=783469]\u0009INFO\u0009c.g.j.p.c.ComputeEngineComputerLauncher#launch: Launch will wait 300000 for operation operation-1720432915238-61cb980c5cd7c-8a962697-28ce0d6f to complete...\n","stream":"stderr","time":"2024-07-08T10:05:56.773533098Z"}
{"log":"2024-07-08 10:06:02.359+0000 [id=783469]\u0009INFO\u0009o.j.p.cloudstats.CloudStatistics#getIdFor: No support for cloud-stats-plugin by class com.google.jenkins.plugins.computeengine.ComputeEngineInstance\n","stream":"stderr","time":"2024-07-08T10:06:02.360249626Z"}

Please note: the instances are running through a wireguard setup so an internal IP can be used. To determine if problems might be related to this I set up a different (Jenkins-)“cloud” some weeks ago with default network and external IP - same issue.

In GCP I can not find much for this VM:

(NOTICE) 2024-07-08 12:01:55.288 CEST Compute Engine insert europe-west3-c:gcp-rre-unittest-debian12-m5eb8g jenkins@jenkins-303020.iam.gserviceaccount.com {@type: type.googleapis.com/google.cloud.audit.AuditLog, authenticationInfo: {…}, authorizationInfo: […], metadata: {…}, methodName: v1.compute.instances.insert, request: {…}, requestMetadata: {…}, resourceLocation: {…}, resourceName: projects/jenkins-303020/zones/europe-west3-c/instances/gcp-rre-un… 
(NOTICE) 2024-07-08 12:02:20.765 CEST Compute Engine insert europe-west3-c:gcp-rre-unittest-debian12-m5eb8g jenkins@jenkins-303020.iam.gserviceaccount.com {@type: type.googleapis.com/google.cloud.audit.AuditLog, authenticationInfo: {…}, methodName: v1.compute.instances.insert, request: {…}, requestMetadata: {…}, resourceName: projects/jenkins-303020/zones/europe-west3-c/instances/gcp-rre-unittest-debian12-m5eb8g, serviceName: compute.googleapis.com} 
(NOTICE) 2024-07-08 12:04:14.776 CEST Compute Engine delete europe-west3-c:gcp-rre-unittest-debian12-m5eb8g jenkins@jenkins-303020.iam.gserviceaccount.com {@type: type.googleapis.com/google.cloud.audit.AuditLog, authenticationInfo: {…}, authorizationInfo: […], methodName: v1.compute.instances.delete, request: {…}, requestMetadata: {…}, resourceLocation: {…}, resourceName: projects/jenkins-303020/zones/europe-west3-c/instances/gcp-rre-unittest-debian12… 
(NOTICE) 2024-07-08 12:05:06.011 CEST Compute Engine delete europe-west3-c:gcp-rre-unittest-debian12-m5eb8g jenkins@jenkins-303020.iam.gserviceaccount.com {@type: type.googleapis.com/google.cloud.audit.AuditLog, authenticationInfo: {…}, methodName: v1.compute.instances.delete, request: {…}, requestMetadata: {…}, resourceName: projects/jenkins-303020/zones/europe-west3-c/instances/gcp-rre-unittest-debian12-m5eb8g, serviceName: compute.googleapis.com} 
(ERROR) 2024-07-08 12:06:02.199 CEST Compute Engine delete europe-west3-c:gcp-rre-unittest-debian12-m5eb8g jenkins@jenkins-303020.iam.gserviceaccount.com {@type: type.googleapis.com/google.cloud.audit.AuditLog, authenticationInfo: {…}, authorizationInfo: […], methodName: v1.compute.instances.delete, request: {…

Note: the (NOTICE) and (ERROR) are added by myself to “translate” the icons.

Here the next job which succeeded. It’s a lot of noise so I just added the intro and the ending.

{"log":"2024-07-10 08:22:54.329+0000 [id=56]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#provision: Provisioning node from configs [com.google.jenkins.plugins.computeengine.InstanceConfiguration@48d1ea4f] for excess workload of 1 units of label 'cloud-rre-test'\n","stream":"stderr","time":"2024-07-10T08:22:54.329940581Z"}
{"log":"2024-07-10 08:22:55.133+0000 [id=56]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#availableNodeCapacity: Found capacity for 2 nodes in cloud Google Compute Engine (RRE-Unittest)\n","stream":"stderr","time":"2024-07-10T08:22:55.133675724Z"}
{"log":"2024-07-10 08:22:55.133+0000 [id=56]\u0009INFO\u0009c.g.j.p.c.InstanceConfiguration#instance: User selected to use an autogenerated ssh key pair\n","stream":"stderr","time":"2024-07-10T08:22:55.133960062Z"}
{"log":"2024-07-10 08:22:56.653+0000 [id=56]\u0009INFO\u0009c.g.j.p.c.InstanceConfiguration#provision: Sent insert request for instance configuration [Debian12 agent for RRE unittests]\n","stream":"stderr","time":"2024-07-10T08:22:56.653755626Z"}
{"log":"2024-07-10 08:22:56.656+0000 [id=844097]\u0009INFO\u0009c.g.j.p.c.ComputeEngineComputerLauncher#launch: Launch will wait 300000 for operation operation-1720599775343-61ce05a68f2aa-664a4e4c-166433ad to complete...\n","stream":"stderr","time":"2024-07-10T08:22:56.656801906Z"}
{"log":"2024-07-10 08:22:56.663+0000 [id=844073]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#lambda$getPlannedNodeFuture$0: Waiting 300000ms for node gcp-rre-unittest-debian12-7lubbu to connect\n","stream":"stderr","time":"2024-07-10T08:22:56.664182584Z"}
...
{"log":"2024-07-10 08:23:22.965+0000 [id=844097]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#log: Launching instance: gcp-rre-unittest-debian12-7lubbu\n","stream":"stderr","time":"2024-07-10T08:23:22.965394355Z"}
{"log":"2024-07-10 08:23:22.965+0000 [id=844097]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#log: bootstrap\n","stream":"stderr","time":"2024-07-10T08:23:22.96572794Z"}
{"log":"2024-07-10 08:23:22.965+0000 [id=844097]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#log: Getting keypair...\n","stream":"stderr","time":"2024-07-10T08:23:22.965844455Z"}
{"log":"2024-07-10 08:23:22.965+0000 [id=844097]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#log: Using autogenerated ssh keypair\n","stream":"stderr","time":"2024-07-10T08:23:22.965912973Z"}
{"log":"2024-07-10 08:23:22.965+0000 [id=844097]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#log: Authenticating as jenkins\n","stream":"stderr","time":"2024-07-10T08:23:22.965926627Z"}
{"log":"2024-07-10 08:23:23.130+0000 [id=844097]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#log: No public address found. Fall back to internal address.\n","stream":"stderr","time":"2024-07-10T08:23:23.131270049Z"}
{"log":"2024-07-10 08:23:23.131+0000 [id=844097]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#log: Connecting to 192.168.75.56 on port 22, with timeout 10000.\n","stream":"stderr","time":"2024-07-10T08:23:23.131311384Z"}
{"log":"2024-07-10 08:23:23.460+0000 [id=844097]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#log: Connected via SSH.\n","stream":"stderr","time":"2024-07-10T08:23:23.460739089Z"}
{"log":"2024-07-10 08:23:23.577+0000 [id=844097]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#log: Verifying: java -fullversion\n","stream":"stderr","time":"2024-07-10T08:23:23.578107611Z"}
{"log":"2024-07-10 08:23:23.873+0000 [id=844097]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#log: Copying agent.jar to: /tmp\n","stream":"stderr","time":"2024-07-10T08:23:23.873778743Z"}
{"log":"2024-07-10 08:23:24.230+0000 [id=844097]\u0009INFO\u0009c.g.j.p.c.ComputeEngineCloud#log: Launching Jenkins agent via plugin SSH: java -jar /tmp/agent.jar\n","stream":"stderr","time":"2024-07-10T08:23:24.230942715Z"}
... (tons of logs) ...
{"log":"2024-07-10 08:42:14.337+0000 [id=844676]\u0009INFO\u0009o.j.p.cloudstats.CloudStatistics#getIdFor: No support for cloud-stats-plugin by class com.google.jenkins.plugins.computeengine.ComputeEngineInstance\n","stream":"stderr","time":"2024-07-10T08:42:14.337888449Z"}
{"log":"2024-07-10 08:42:14.348+0000 [id=844718]\u0009WARNING\u0009hudson.remoting.Request$2#run: Failed to send back a reply to the request RPCRequest:hudson.remoting.RemoteClassLoader$IClassLoader.fetch3[java.lang.String](2)\n","stream":"stderr","time":"2024-07-10T08:42:14.349294714Z"}
{"log":"java.io.IOException\n","stream":"stderr","time":"2024-07-10T08:42:14.349336324Z"}
{"log":"\u0009at hudson.remoting.Channel.close(Channel.java:1494)\n","stream":"stderr","time":"2024-07-10T08:42:14.349344303Z"}
{"log":"\u0009at hudson.remoting.Channel.close(Channel.java:1450)\n","stream":"stderr","time":"2024-07-10T08:42:14.349351276Z"}
{"log":"\u0009at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:949)\n","stream":"stderr","time":"2024-07-10T08:42:14.349357738Z"}
{"log":"\u0009at hudson.slaves.SlaveComputer$2.run(SlaveComputer.java:823)\n","stream":"stderr","time":"2024-07-10T08:42:14.349364518Z"}
{"log":"\u0009at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)\n","stream":"stderr","time":"2024-07-10T08:42:14.34937135Z"}
{"log":"\u0009at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)\n","stream":"stderr","time":"2024-07-10T08:42:14.349378474Z"}
{"log":"\u0009at jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51)\n","stream":"stderr","time":"2024-07-10T08:42:14.34938507Z"}
{"log":"\u0009at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)\n","stream":"stderr","time":"2024-07-10T08:42:14.349391804Z"}
{"log":"Caused: hudson.remoting.ChannelClosedException: Channel \"hudson.remoting.Channel@44e85a1b:gcp-rre-unittest-debian12-7lubbu\": channel is already closed\n","stream":"stderr","time":"2024-07-10T08:42:14.349398442Z"}
{"log":"\u0009at hudson.remoting.Channel.send(Channel.java:764)\n","stream":"stderr","time":"2024-07-10T08:42:14.349405636Z"}
{"log":"\u0009at hudson.remoting.Request$2.run(Request.java:390)\n","stream":"stderr","time":"2024-07-10T08:42:14.349412185Z"}
{"log":"\u0009at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)\n","stream":"stderr","time":"2024-07-10T08:42:14.349418796Z"}
{"log":"\u0009at org.jenkinsci.remoting.CallableDecorator.call(CallableDecorator.java:18)\n","stream":"stderr","time":"2024-07-10T08:42:14.349425623Z"}
{"log":"\u0009at hudson.remoting.CallableDecoratorList.lambda$applyDecorator$0(CallableDecoratorList.java:19)\n","stream":"stderr","time":"2024-07-10T08:42:14.349432617Z"}
{"log":"\u0009at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)\n","stream":"stderr","time":"2024-07-10T08:42:14.349439194Z"}
{"log":"\u0009at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)\n","stream":"stderr","time":"2024-07-10T08:42:14.349445975Z"}
{"log":"\u0009at java.base/java.util.concurrent.FutureTask.run(Unknown Source)\n","stream":"stderr","time":"2024-07-10T08:42:14.349452525Z"}
{"log":"\u0009at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)\n","stream":"stderr","time":"2024-07-10T08:42:14.349459061Z"}
{"log":"\u0009at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)\n","stream":"stderr","time":"2024-07-10T08:42:14.349465663Z"}
{"log":"\u0009at java.base/java.lang.Thread.run(Unknown Source)\n","stream":"stderr","time":"2024-07-10T08:42:14.349472307Z"}

This is a bit weird (channel is already closed) but has additional loglines missing in the failure (e.g. hudson.slaves.SlaveComputer.closeChannel).

It looks like this at GCP

(NOTICE) 2024-07-10 10:22:55.391 CEST Compute Engine insert europe-west3-c:gcp-rre-unittest-debian12-7lubbu jenkins@jenkins-303020.iam.gserviceaccount.com {@type: type.googleapis.com/google.cloud.audit.AuditLog, authenticationInfo: {…}, authorizationInfo: […], metadata: {…}, methodName: v1.compute.instances.insert, request: {…}, requestMetadata: {…}, resourceLocation: {…}, resourceName: projects/jenkins-303020/zones/europe-west3-c/instances/gcp-rre-un… 
(NOTICE) 2024-07-10 10:23:20.245 CEST Compute Engine insert europe-west3-c:gcp-rre-unittest-debian12-7lubbu jenkins@jenkins-303020.iam.gserviceaccount.com {@type: type.googleapis.com/google.cloud.audit.AuditLog, authenticationInfo: {…}, methodName: v1.compute.instances.insert, request: {…}, requestMetadata: {…}, resourceName: projects/jenkins-303020/zones/europe-west3-c/instances/gcp-rre-unittest-debian12-7lubbu, serviceName: compute.googleapis.com} ...
(NOTICE) 2024-07-10 10:42:14.127 CEST Compute Engine delete europe-west3-c:gcp-rre-unittest-debian12-7lubbu jenkins@jenkins-303020.iam.gserviceaccount.com {@type: type.googleapis.com/google.cloud.audit.AuditLog, authenticationInfo: {…}, authorizationInfo: […], methodName: v1.compute.instances.delete, request: {…}, requestMetadata: {…}, resourceLocation: {…}, resourceName: projects/jenkins-303020/zones/europe-west3-c/instances/gcp-rre-unittest-debian12… 
(NOTICE) 2024-07-10 10:42:59.687 CEST Compute Engine delete europe-west3-c:gcp-rre-unittest-debian12-7lubbu jenkins@jenkins-303020.iam.gserviceaccount.com {@type: type.googleapis.com/google.cloud.audit.AuditLog, authenticationInfo: {…}, methodName: v1.compute.instances.delete, request: {…}, requestMetadata: {…}, ...

We made the log level more verbose and added a dedicated logger for this.

Looks like the plugin is sometimes “forgetting” about its VMs. No idea what may causing this.

Plugin log:

Aug 26, 2024 7:00:46 AM INFO com.google.jenkins.plugins.computeengine.InstanceConfiguration provision
Sent insert request for instance configuration [Debian12 agent for RRE unittests]
Aug 26, 2024 7:00:46 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineComputerLauncher launch
Launch will wait 300000 for operation operation-1724648445057-6208f01ee1a77-f70c6933-7f833c72 to complete...
Aug 26, 2024 7:00:46 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud lambda$getPlannedNodeFuture$0
Waiting 300000ms for node gcp-rre-unittest-debian12-xqphpu to connect
Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Launching instance: gcp-rre-unittest-debian12-xqphpu
Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
bootstrap
Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Getting keypair...
Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Using autogenerated ssh keypair
Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Authenticating as jenkins
Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
No public address found. Fall back to internal address.
Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Connecting to 192.168.75.37 on port 22, with timeout 10000.
Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Failed to connect via ssh: There was a problem while connecting to 192.168.75.37:22
Aug 26, 2024 7:01:12 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Waiting for SSH to come up. Sleeping 5.
Aug 26, 2024 7:01:17 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
No public address found. Fall back to internal address.
Aug 26, 2024 7:01:17 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Connecting to 192.168.75.37 on port 22, with timeout 10000.
Aug 26, 2024 7:01:17 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Connected via SSH.
Aug 26, 2024 7:01:17 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Verifying: java -fullversion
Aug 26, 2024 7:01:18 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Copying agent.jar to: /tmp
Aug 26, 2024 7:01:18 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud log
Launching Jenkins agent via plugin SSH: java -jar /tmp/agent.jar
Aug 26, 2024 7:01:26 AM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud lambda$getPlannedNodeFuture$0
40479ms elapsed waiting for node gcp-rre-unittest-debian12-xqphpu to connect
Aug 26, 2024 7:04:12 AM INFO com.google.jenkins.plugins.computeengine.CleanLostNodesWork terminateInstance
Remote instance gcp-rre-unittest-debian12-xqphpu not found locally, removing it

At the same time the VM was doing work:

...
07:02:21  Agent: gcp-rre-unittest-debian12-xqphpu
...
07:03:22  + yarn run test:ci
07:03:22  yarn run v1.22.22
07:03:22  $ yarn generate && craco test --coverage
07:03:22  $ yarn dependency test && yarn create-plugin-list && yarn create-view-model && yarn create-template-list && yarn create-themes && yarn bundle-messages
07:03:22  $ cross-env TS_NODE_PROJECT=./tsconfig.buildConf.json node -r ts-node/register build-config/dependency.ts test
07:03:25  Successfully checked view/edit dependencies array in package.json.
07:03:25  $ cross-env TS_NODE_PROJECT=./tsconfig.buildConf.json node -r ts-node/register build-config/createPluginList.ts
07:03:27  $ cross-env TS_NODE_PROJECT=./tsconfig.buildConf.json node -r ts-node/register build-config/createViewModel.ts
07:03:30  $ cross-env TS_NODE_PROJECT=./tsconfig.buildConf.json node -r ts-node/register build-config/createTemplateList.ts
07:03:32  $ cross-env TS_NODE_PROJECT=./tsconfig.buildConf.json node -r ts-node/register build-config/createThemes.ts
07:04:56  Cannot contact gcp-rre-unittest-debian12-xqphpu: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@70fb0402:gcp-rre-unittest-debian12-xqphpu": Remote call on gcp-rre-unittest-debian12-xqphpu failed. The channel is closing down or has closed down
07:05:02  Agent gcp-rre-unittest-debian12-xqphpu was deleted; cancelling node body
07:05:02  Could not connect to gcp-rre-unittest-debian12-xqphpu to send interrupt signal to process