(RFC)Architecture Design: Scalable OpenTelemetry for Jenkins

Hi everyone,

I am exploring the ‘Use OpenTelemetry for Jenkins Jobs’ project for GSoC 2026. Following a discussion with maintainers on Gitter, I am opening this thread to document the architectural design and trade-offs for ci.jenkins.io.The Challenge Enabling the OTel plugin on the Jenkins infrastructure scale is not just about installation, it requires an engineering strategy to handle High Cardinality and Storage Costs.

Proposed Architecture I have been researching a Backend-Agnostic OTLP Pipeline where the Collector acts as a gateway. The key feature is Tail-based Sampling, which allows us to:

  • Retain 100% of failed build traces (for debugging).

  • Aggressively sample successful builds (to save storage).

  • Switch backends (Jaeger, Tempo, Prometheus) without changing Jenkins configuration.

Proof of Concept & Trade-off Analysis I have set up a local lab (Jenkins → Collector → Jaeger) to validate the probabilistic_sampler and load generation. I have compiled my findings, including a detailed analysis of Cardinality vs. Query Speed, in the draft (RFC)Architecture Design: Scalable OpenTelemetry for Jenkins - Google Docs

I would appreciate any feedback from the Infra team, specifically regarding the retention policies for successful builds.

Amanraz Thakur

Update on the Context Propagation Fix

Update: I have successfully implemented and validated the fix for the TRACEPARENT context issue.

The Issue: Previously, the TRACEPARENT environment variable was not correctly updating for steps inside a stage. This caused broken trace hierarchies where steps like sh or bat were not properly nested under their parent stage.

The Fix: I modified OtelEnvironmentContributor.java to explicitly respect the current active span. You can view the implementation here: [Link to The PR]

Visual Verification: I spun up a local Jenkins instance connected to a Jaeger backend to verify the fix. As seen in the screenshot below, the sh step is now correctly indented as a child of the Verification stage, confirming that the trace context is being propagated correctly down the pipeline.