Hello,
I’m investigating the root cause and possible ways to fix an issue with my Jenkins setup.
I have a configuration where on-premise agents connect to Jenkins via WebSocket agents. The agents run on Linux (Ubuntu Server) and macOS.
Jenkins is deployed to an EKS cluster via a Helm chart and is exposed through an NGINX Ingress Controller, which is behind an AWS Network Load Balancer (NLB).
Jenkins Version 2.492.3
Problem: Agent’s connection is abruptly terminated (TCP RST or timeout), then the agent immediately attempts to reconnect. However, the Jenkins server still believes the old connection is active and rejects the new connection with:
org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: <agent-name> is already connected to this controller. Rejecting this connection.
at jenkins.agents.WebSocketAgents.doIndex(WebSocketAgents.java:107)
This results in HTTP 500 errors on /wsagents/ endpoint. Agents retry with exponential backoff (1s, 3s, 7s, 10s max) but continue failing until Jenkins eventually detects the stale connection (2-6 minutes).
I have multiple sites with on-premise agents, and the issue reproduces across all of them, so it’s unlikely to be a network problem isolated to a single location.
The behavior is sporadic. For instance, a site may have 30+ on-premise agents, but during an incident, only a subset of agents is impacted.
Agent-side error:
java.io.IOException: Connection reset
at java.base/sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishRead(UnixAsynchronousSocketChannelImpl.java:425)
at java.base/sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:195)
at java.base/sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:217)
at java.base/sun.nio.ch.KQueuePort$EventHandlerTask.run(KQueuePort.java:312)
at java.base/java.lang.Thread.run(Thread.java:1583)
As Ubuntu Server is used for the Linux agents, I found such an issue where java service can be restarted without user approval, so it could cause a reconnect process, but then I realized that the same is happening with macOS agents.
Impact:
Recovery time: 2 seconds to 6+ minutes, depending on when Jenkins detects a stale connection
Affects agents across multiple offices (not location-specific)
I will appreciate any help with this strange behaviour