[GSoC 2026] OpenTelemetry scaling strategies

Heyy everyone!! :smiley:

I am Pratham, and currently very interested in the Opentelemetry project. Researching OTel docs, they have recommended deployment patterns for different scenarios, but I’d like to hear more from the people considering this project, having done the research themselves. I’d go first:

The architecture I propose is multi-layered - An agent layer (Daemonset) and a gateway layer (Deployment and Service). The otel collector on the agent layer should be lightweight, minimal processing (batch and memory limits), its only job is to receive telemetry from Jenkins agent pods and fetch system metrics of the node and immediately export data to the gateway.

The collector on the gateway layer is what does the heavy work → heavy processing like tail sampling, aggregation, attribute control, filtering, etc. and finally async export to any backend. The gateway collectors should be horizontally scalable and are exposed via ClusterIP Service. We’d use horizontal pod autoscaler to help scale the collectors when required - could be based on cpu/memory usage or collector’s queue length or request throughput… pasting a part of my proposal below including the justification of this architecture

While this being a well thought architecture, one may question:

  • Why have an agent layer at all? Why not cut the middle man and emit telemetry directly from the pods to the gateway?

  • While it is possible to route telemetry directly to gateway collectors, introducing a node local layer would improve scalability, reduce network overhead, isolate failures, etc. This becomes critical for high concurrency CI environments like ci.jenkins.io

  • Without agent layer, we face massive fan in to the gateway pods. Introducing the agent layer would make the network flow smoother, basically turning: N noisy sources → 1 controlled stream per node.

  • It enhances the scalability model. Without the agent layer, the gateway must scale for ingestion + processing. With agents, it introduces a clean scaling boundary:

    • Agent layer → fixed (per node)

    • Gateway layer → elastic

  • Why not have Daemonsets do the processing, exporting and rule out the gateway layer?

  • While technically the agent collectors can handle ingestion, processing and export, separating a centralized gateway layer enables more efficient resource utilization.

  • We get inefficient resource utilization. If the agent layer does everything: tail sampling, filtering, aggregation, it leads to wasteful duplication and high CPU/memory usage per node. With a gateway, the heavy work is centralized, fewer pods do expensive processing.

  • Without gateway, scaling gets tied to nodes. This is bad coupling. A multi layered system helps us scale selectively.

Also here’s a diagram to help understand better: Excalidraw Whiteboard
What do y’all think??!

In future, we could use better scaling options such as event driven scaling (KEDA) based on collector’s queue length, export frequency, etc. We could get faster reaction on traffic spikes with that! We could also utilize historical patterns and scale adaptively based on predictable high load periods, for example, increased contributor activity during GSoC or Hacktoberfestt

Strong architectural thinking, Pratham! The agent+ gateway pattern makes sense
for large-scale K8s environments.

One scoping question for mentors: Given the project goal is “to help enhance
observability of Jenkins jobs on ci.jenkins.io” within a 175-hour medium project,

should we prioritize:

a) Full multi-tier architecture (agent DaemonSets+ gateway+ HPA/KEDA) from day one
b) Simpler gateway pattern first, with agent layer as future scaling enhancement

In my PoC, the gateway-only approach achieved 80%+ data reduction with tail-based sampling while keeping deployment straightforward.

Would be worth hearing from the infrastructure team on their Kubernetes setup
and whether node-level collectors are needed from day one.

cc: @shivaylamba @krisstern

Hey salman! That’s a very fair concern, this architecture can look quite heavy at first glance, especially for a 3 month timeline. My approach is to build it incrementally rather than all of it at once. The idea is:

  • First establish a minimal end to end pipeline (Jenkins → Agent → Gateway → Backend)
  • Then gradually introduce sampling and filtering strategies, scaling and optimization - well divided in my 12 weeks plan.
  • Finally refine it for production readiness - autoscaling, failure handling, configurability.

Basically the goal is to evolve the system step by step within the timeline. I have crafted my proposal around this progression and submitted the draft, so I’m quite optimistic that it’s feasible within the standard gsoc period. Curious to find out what mentors think of it, and I’ll let you know if they have any say on my architectural design

Hi,

The gateway pattern is the way, we have scaled it to thousands of pods. The daemon agent make the heavy work collecting the metrics, logs and traces, also enrich that data with node information. You can scale this layer horizontally if your nodes have load that require it. Then you send all the data to the gateway layer tant can make some process to if it is needed and send the data to the backend. This layer can also scale horizontally if it is needed.

1 Like

Appreciate your input on this. I had a couple of questions about gateway collectors’ deployment. I had been looking into how stateful operations like tail based sampling would behave under horizontal scaling. In cases of slow/flaky pipelines, where traces may take longer to complete, what if the collector instance handling that pipeline’s trace is terminated (scaled down due to autoscaling, crashed or anything)? Are there recommended strategies to mitigate scenarios like these in practice to avoid affecting sampling accuracy?