Jenkins controller <---> agent communication with Docker Swarm

Jenkins setup:

  • Jenkins 2.440.2-lts w/ JDK17
  • Agents using JDK11
  • Docker Swarm cluster w/ 3 masters and 6 workers
  • Everything runs in AWS on the same VPC and connectivity has been successfully tested

Please help! My last post was hidden for some reason. We have a pretty big production issue where our controller node(s) are unable to communciate with our worker nodes after having to restore the system after a botched upgrade of the controller to 2.479.1. We had to create 3 new controller nodes and were able to get the controller up and running, but anytime a job is scheduled, it says that any agents are offline. This is the error:

[4:36:23 PM] Creating Service with Name : agt-_portal_PR_69_2-1246
java.net.SocketException: Broken pipe
at java.base/sun.nio.ch.NioSocketImpl.implWrite(Unknown Source)
at java.base/sun.nio.ch.NioSocketImpl.write(Unknown Source)
at java.base/sun.nio.ch.NioSocketImpl$2.write(Unknown Source)
at java.base/java.net.Socket$SocketOutputStream.write(Unknown Source)
at java.base/sun.security.ssl.SSLSocketOutputRecord.encodeChangeCipherSpec(Unknown Source)
at java.base/sun.security.ssl.OutputRecord.changeWriteCiphers(Unknown Source)
at java.base/sun.security.ssl.ChangeCipherSpec$T10ChangeCipherSpecProducer.produce(Unknown Source)
at java.base/sun.security.ssl.Finished$T12FinishedProducer.onProduceFinished(Unknown Source)

The java.net.SocketException: Broken pipe error usually hints at some communication troubles between your Jenkins controller and agents. Here’s a relaxed way to tackle and possibly fix this issue:

  1. Check Network Connectivity: Let’s start simple. Ensure your network connections between the Jenkins controller and agents are solid. You might want to use tools like ping and telnet to see if everything’s communicating properly.
  2. Firewall and Security Groups: Double-check that your firewall settings and, if you’re using AWS, your security groups are allowing traffic to flow freely on the necessary ports. For instance, JNLP agents typically use port 50000.
  3. Agent Configuration: Make sure that all agents are set up correctly to link up with the controller nodes. If there’ve been any changes in your setup, you might need to update these configurations.
  4. Search in Jenkins Logs: If you’re still stuck, the Jenkins controller and agent logs are good places to dig deeper. They can sometimes tell you more about what’s causing the communication mishaps.
  5. Docker Swarm Configuration: If you’re using Docker Swarm, ensure all your configurations are spot on, and that services are running as expected. It’s crucial that Swarm nodes can talk to each other without any hitches.
  6. SSL/TLS Configuration: Since the error mentions SSL/TLS, check to make sure your SSL/TLS certificates are properly set up. A misconfigured certificate or a broken chain can also cause these errors.
  7. Restart Jenkins Services: When all else fails, a good old restart can sometimes do the trick. Try rebooting your Jenkins controller and agents to see if that clears up any transient issues.

Hope these steps help you get back on track!