[GSoC 2026 PROPOSAL] Ayush Singh (Flamki) — Continue AI-Powered Chatbot for Jenkins Resources

Hi everyone, I’m Ayush Singh (GitHub: Flamki), a final-year B.Tech Information Technology
student with a Data Science major. I’m applying for GSoC 2026 for the
Continue AI-Powered Chatbot for Quick Access to Jenkins Resources project.


Contributions to the Chatbot Codebase

jenkinsci/resources-ai-chatbot-plugin:

PR #172 (open) — Implemented search_stackoverflow_threads, replacing the no-op stub
that was returning hardcoded “Nothing relevant” regardless of the query. The fix uses the
existing retrieve_documents + extract_top_chunks helpers to stay consistent with the
rest of the retrieval architecture. Also fixed the request-logger flow — berviantoleo
confirmed the logger should come from the caller, not module-level. sharma-sugurthi
flagged that the other three search tools accept query, keywords, and logger while this
tool only accepts query. Prompt and keyword contract alignment is a planned follow-up
in Phase 1.

PR #179 (open, approved by berviantoleo :white_check_mark:) — The /message endpoint persists session
state after every reply. The /message/upload endpoint didn’t. This meant uploaded-file
conversations could be silently lost after a restart. Fixed by injecting BackgroundTasks
and enqueuing persist_session after get_chatbot_reply returns — the same pattern the
standard endpoint uses. Approved and pending merge.

Other Jenkins contributions:

  • Jenkins Core PR #26331 (merged) — Encode log recorder name on redirect (security)
  • jenkins.io PR #8831 (merged) — Document hasPermission for secure form validation
  • OpenTelemetry Plugin PR #1240 (open) — Fix queue metrics visibility under restricted
    ACL using ACL.SYSTEM2
  • OpenTelemetry Plugin PR #1241 (open) — Make build-step span purge non-fatal with
    STRICT_MODE toggle

What the Codebase Actually Needs

Working on these PRs gave me a clear picture of where the real gaps are. Not from reading
the readme — from being inside the code.

Gap 1: Tool interface inconsistency
Three tools accept query, keywords, and logger. search_stackoverflow_threads only accepts
query. Keywords matter because the retrieval pipeline uses them for more precise matching —
skipping them degrades search quality silently. PR #172 fixes the logger flow. Full
keyword and prompt contract alignment across all tools is still needed.

Gap 2: No-op stub on main
search_stackoverflow_threads is still returning hardcoded “Nothing relevant” on current
main because PR #172 is not yet merged. Once it merges, the fix lands — but this also
raises the question of whether other tools have similar stub behavior worth auditing.

Gap 3: Endpoint test coverage is uneven
The session persistence bug in #179 existed because the upload endpoint had no test
asserting this behavior. The fix was straightforward but the gap in test coverage
across endpoints is broader than one file.

Gap 4: No way to measure retrieval quality
There is no eval, evaluation, or benchmark folder in chatbot-core on current main, and
no files containing evaluate or metrics logic. If someone improves the chunking strategy
or switches embedding models, there is no baseline to compare against. Improvements are
essentially unverifiable. This is the most important gap to close.

Gap 5: Indexes go stale
There is no workflow in .github/workflows with a cron or schedule trigger, and no
refresh, update, or index pipeline exists. Jenkins documentation and community threads
get updated constantly. Without automated refresh, the chatbot slowly drifts out of date.


Proposed Plan (~350 hours)

Community Bonding — May (~30 hrs)

  • Full audit of tools.py — flag every interface inconsistency and stub
  • Set up local evaluation baseline using existing query patterns
  • Align with mentors on phase priorities before writing a single line of GSoC code

Phase 1 — Tool Completion & Interface Standardization — Weeks 1–4 (~80 hrs)

Every tool should follow the same contract: query, keywords, logger. Right now they
don’t. This phase makes the tool layer clean, tested, and documented so every future
change is building on solid ground.

  • Audit and fix every tool for interface consistency
  • Update Retriever Agent prompt in prompts.py to pass keywords for tools that
    currently don’t receive it — completing the alignment flagged during PR #172 review
  • Replace any remaining stubs with real retrieval implementations
  • Standardize error handling and fallback behavior
  • Write unit + integration tests per tool
  • Document the interface contract clearly for future contributors

Deliverable: All tools working, fully tested, consistent interface


Phase 2 — Retrieval Quality & Evaluation — Weeks 5–8 (~100 hrs)

This is where my Data Science background is directly relevant. Without measurement,
every retrieval improvement is just a guess. This phase builds the evaluation
infrastructure that makes all future improvements verifiable.

  • Build an offline golden evaluation dataset of Jenkins Q&A pairs
    drawn from real community questions and documentation
  • Measure retrieval precision and recall per source —
    docs, plugins, community threads, stackoverflow
  • Improve keyword extraction in the retrieval pipeline
  • Tune chunking strategy per content type — docs chunk differently from forum
    threads and plugin readmes
  • Add confidence scoring to surface low-confidence answers to the user

Deliverable: Evaluation framework with measurable baselines per retrieval source


Phase 3 — Index Freshness & Session Robustness — Weeks 9–11 (~80 hrs)

  • Automated index refresh pipeline for documentation and community content
  • Systematic session management tests across all endpoints —
    the #179 fix revealed this gap, now close it properly
  • Session expiry and cleanup logic
  • Observability improvements using request-level logger pattern
    established in #172 review

Deliverable: Automated freshness pipeline, full session endpoint coverage


Phase 4 — Polish & Documentation — Weeks 12–13 (~60 hrs)

  • End-to-end integration tests covering the full retrieval → response flow
  • Contributor documentation for the tool interface pattern so the inconsistency
    issues I found don’t happen again
  • Performance profiling of the retrieval pipeline under realistic query load
  • Final evaluation run comparing retrieval quality at project start vs end —
    concrete numbers, not just “it feels better”

Deliverable: Production-ready chatbot, documented architecture, final report


Stack

Based on what mentors have confirmed in this forum:

  • LLM: Open-source / self-hosted (Ollama, Llama) — no proprietary dependency
  • Orchestration: LangChain
  • Backend: Decoupled FastAPI — existing architecture, not changing it
  • Evaluation: Offline golden dataset approach, no external API required

One Question for Mentors

During Phase 1, I’ll need to update the Retriever Agent prompt in prompts.py to pass
keywords for tools that currently don’t receive it — the gap sharma-sugurthi flagged
during PR #172 review.

Do you have a preference for how keyword extraction should be handled — should the LLM
extract keywords inline as part of the tool call, or should there be a dedicated
extraction step before tool dispatch?

This will shape the Phase 1 design and feed directly into retrieval improvements in
Phase 2.


Why This Plan

A “Continue” project isn’t about adding the most features. It’s about taking something
that was started and making it genuinely reliable and maintainable. My PRs showed me
exactly where the foundations need work before new features make sense. This proposal
is the plan to fix that foundation first, measure the results, then build on top of it
with confidence.

Looking forward to feedback from @krisstern and @berviantoleo and the rest of the
mentor team — Vutukuri Sreenivas and Chirag Gupta.

Thanks,
Ayush Singh (Flamki)

1 Like

Welcome to the community.

Looks good.

I’ll give some detailed comments later. However, I recommend to have some testing or measurement what you’ve done in each phase. As it’s not recommended to evaluate all of your work in the last phase.

1 Like

Thank you @berviantoleo — really appreciate the feedback.

That makes sense. I’ll update the plan so each phase has explicit testing/measurement and success criteria, instead of evaluating everything only at the end.

I’ll post the revised phase-by-phase measurement section shortly.

Thanks for the guidance. Adding the requested phase-wise validation plan:

Phase-wise Measurement Plan

  • Phase 1 (Tool standardization)

    • target: 100% tool contract compliance
    • validation: unit + integration pass rate for tool calls/fallbacks
    • artifact: tool-layer regression baseline report
  • Phase 2 (Retrieval quality)

    • target: measurable delta vs baseline for precision/recall per source
    • validation: offline faithfulness/relevance scoring on golden set
    • artifact: reproducible evaluation report per iteration
  • Phase 3 (Freshness + session robustness)

    • target: stable scheduled refresh runs + index recency checks
    • validation: session persistence/recovery coverage across endpoints
    • artifact: reliability report (freshness + session consistency)
  • Phase 4 (Final integration)

    • target: improved E2E quality/latency vs initial baseline
    • validation: end-to-end suite + comparative benchmark
    • artifact: final before/after metrics report

Hey Ayush, interesting read. I do have a question about the Phase 2 plan though. If we rely on an offline “golden dataset” for evaluation, how do we prevent that dataset from becoming stale almost immediately? Jenkins docs and plugins update constantly. Maintaining and updating the eval dataset manually might end up becoming a massive bottleneck. Wondering if there’s a more dynamic approach we should be looking at instead of static offline eval?

1 Like

Great question — I agree a purely static offline golden dataset would stale quickly for Jenkins.

My proposed hybrid evaluation setup is:

  1. Stable regression set (small, versioned, frozen)
  • A curated core set used only for before/after comparison across iterations.
  • It is not auto-overwritten, so metric deltas stay comparable over time.
  1. Rolling freshness set (auto-regenerated)
  • Regenerated from the latest corpus snapshots produced by pipeline outputs (data/collection/*, data/preprocessing/*, data/chunking/*).
  • Regeneration is triggered by planned freshness jobs (scheduled cadence) and by corpus/index refresh events.
  • Samples are stratified by source (docs/plugins/discourse, plus StackOverflow when refreshed).
  1. Freshness-safe scoring
  • Stratification is also used as a freshness guard: if a source snapshot is older than the freshness window, that source’s samples are flagged and excluded from freshness scoring (while still usable for regression tracking).
  1. Snapshot-aware reporting per run
  • Each evaluation run records snapshot metadata (timestamp, source counts, index/model config), so results remain reproducible and comparable as content evolves.

I’ll add this to the proposal doc now and link it here once updated.

Hey Ayush, great approach with the hybrid setup. I have a question regarding the Rolling freshness set. If this set is auto-regenerated from the latest pipeline outputs, how are the ground-truth Q&A pairs being validated? If an LLM is automatically generating the evaluation answers from the new chunks, isn’t there a risk of it hallucinating or ingesting bad data (like prompt injections or broken formatting in the source docs) and establishing that as the new ‘correct’ baseline? How do we ensure the auto-evaluator itself doesn’t drift?

Great catch, Adarsh — that’s exactly the risky part.

For the rolling set, I’m not planning unconstrained LLM-generated Q&A as “ground truth.”
Question creation will be constrained: either rule/template-derived from chunk structure, or LLM-generated in extractive mode where the expected answer must be a verbatim (or near-verbatim) span from the same chunk.

So validation is span/citation-based, not free-form “judge says correct.”
If a generated item can’t be mapped back to a valid evidence span, it gets dropped from the rolling set.

The frozen regression set stays separate (human-curated/versioned) and remains the anti-drift anchor for before/after comparisons.

I’ll also run a manual audit on ~20–30 rolling samples per refresh cycle to estimate noise and catch drift/injection artifacts early.

Update : I revised the proposal to a hybrid evaluation framework (stable regression + rolling freshness sets) and made measurement/validation explicit per phase rather than deferred to the end.

Timeline and workload (unchanged overall): ~350 hours

Community Bonding (May, ~30h)

  • Tool/endpoint audit and milestone lock with mentors.
    Measurement: baseline checklist for tool contracts + endpoint reliability.
    Validation: mentor-aligned acceptance criteria.
    Artifact: baseline scope + validation plan.

Phase 1 (Weeks 1–4, ~80h): Tool completion + interface standardization

  • Standardize retrieval tool contract (query, keywords, logger).
  • Align retriever-agent tool-call contract for keyword-aware dispatch.
  • Replace remaining stubs; normalize error/fallback behavior.
  • Add unit + integration tests per tool.
    Measurement: tool contract compliance rate + pass rates.
    Validation: tool-layer regression and integration suite.
    Artifact: contract matrix + Phase 1 report.

Phase 2 (Weeks 5–8, ~100h): Retrieval quality + evaluation framework

  • Hybrid eval setup:
    • Stable regression set (frozen/versioned anchor).
    • Rolling freshness set (auto-regenerated from data/collection/*, data/preprocessing/*, data/chunking/*).
  • Ground-truth guardrails:
    • no unconstrained free-form LLM “truth”
    • citation/span-anchored expected answers
    • failed evidence mapping => sample dropped
    • malformed/unsafe chunks quarantined
  • Source-stratified scoring and per-source quality tracking.
    Measurement: retrieval quality delta vs baseline per source.
    Validation: reproducible eval runs with metadata + audit logs.
    Artifact: evaluation framework + per-iteration quality reports.

Phase 3 (Weeks 9–11, ~80h): Freshness automation + session robustness

  • Refresh triggers: scheduled cadence + corpus/index refresh events + retrieval/chunking config changes.
  • Expand session lifecycle tests across message/upload/stream flows.
  • Validate expiry/cleanup and persisted reload behavior.
    Measurement: refresh reliability + endpoint/session coverage targets.
    Validation: scheduled run health checks + robustness suite.
    Artifact: freshness reliability report + session robustness report.

Phase 4 (Weeks 12–13, ~60h): Integration, profiling, documentation

  • E2E retrieval→response tests.
  • Contributor docs for tool contracts + eval workflow.
  • Retrieval performance profiling.
  • Final before/after benchmark.
    Measurement: integrated quality + latency delta vs initial baseline.
    Validation: E2E suite + comparative benchmark review.
    Artifact: final metrics report + contributor docs.

Additional quality controls

  • Freshness-safe scoring: sources outside freshness window excluded from freshness score (retained for regression tracking).
  • Manual audit each refresh cycle (~20–30 rolling samples) to catch drift/injection/noise early.
  • Reproducibility metadata per run: snapshot timestamp, source counts, chunk/index/model config, eval seed.

Revisiting the open keyword-extraction question from the original post: given the updated hybrid eval design in Phase 2, do you prefer:

  1. inline keyword extraction during retriever-agent tool-call generation, or
  2. dedicated pre-dispatch keyword extraction?

@berviantoleo @krisstern I’ve posted a revised update with phase-wise measurement/validation and the hybrid eval design updates from this thread. I’d appreciate a final review pass to confirm this direction is aligned.

@berviantoleo @krisstern Planning to submit my final proposal in the next couple of days. On the open keyword-extraction question, I’ll go with pre-dispatch keyword extraction as the default design — it keeps the retriever-agent contract cleaner and makes the dispatch logic easier to test in isolation. Happy to revisit this post-submission if you’d prefer inline.

Thanks for all the engagement on this thread — it directly shaped the hybrid eval design.