[GSoC 2026 PROPOSAL] Ayush Singh (Flamki) — Continue AI-Powered Chatbot for Jenkins Resources

Hi everyone, I’m Ayush Singh (GitHub: Flamki), a final-year B.Tech Information Technology
student with a Data Science major. I’m applying for GSoC 2026 for the
Continue AI-Powered Chatbot for Quick Access to Jenkins Resources project.


Contributions to the Chatbot Codebase

jenkinsci/resources-ai-chatbot-plugin:

PR #172 (open) — Implemented search_stackoverflow_threads, replacing the no-op stub
that was returning hardcoded “Nothing relevant” regardless of the query. The fix uses the
existing retrieve_documents + extract_top_chunks helpers to stay consistent with the
rest of the retrieval architecture. Also fixed the request-logger flow — berviantoleo
confirmed the logger should come from the caller, not module-level. sharma-sugurthi
flagged that the other three search tools accept query, keywords, and logger while this
tool only accepts query. Prompt and keyword contract alignment is a planned follow-up
in Phase 1.

PR #179 (open, approved by berviantoleo :white_check_mark:) — The /message endpoint persists session
state after every reply. The /message/upload endpoint didn’t. This meant uploaded-file
conversations could be silently lost after a restart. Fixed by injecting BackgroundTasks
and enqueuing persist_session after get_chatbot_reply returns — the same pattern the
standard endpoint uses. Approved and pending merge.

Other Jenkins contributions:

  • Jenkins Core PR #26331 (merged) — Encode log recorder name on redirect (security)
  • jenkins.io PR #8831 (merged) — Document hasPermission for secure form validation
  • OpenTelemetry Plugin PR #1240 (open) — Fix queue metrics visibility under restricted
    ACL using ACL.SYSTEM2
  • OpenTelemetry Plugin PR #1241 (open) — Make build-step span purge non-fatal with
    STRICT_MODE toggle

What the Codebase Actually Needs

Working on these PRs gave me a clear picture of where the real gaps are. Not from reading
the readme — from being inside the code.

Gap 1: Tool interface inconsistency
Three tools accept query, keywords, and logger. search_stackoverflow_threads only accepts
query. Keywords matter because the retrieval pipeline uses them for more precise matching —
skipping them degrades search quality silently. PR #172 fixes the logger flow. Full
keyword and prompt contract alignment across all tools is still needed.

Gap 2: No-op stub on main
search_stackoverflow_threads is still returning hardcoded “Nothing relevant” on current
main because PR #172 is not yet merged. Once it merges, the fix lands — but this also
raises the question of whether other tools have similar stub behavior worth auditing.

Gap 3: Endpoint test coverage is uneven
The session persistence bug in #179 existed because the upload endpoint had no test
asserting this behavior. The fix was straightforward but the gap in test coverage
across endpoints is broader than one file.

Gap 4: No way to measure retrieval quality
There is no eval, evaluation, or benchmark folder in chatbot-core on current main, and
no files containing evaluate or metrics logic. If someone improves the chunking strategy
or switches embedding models, there is no baseline to compare against. Improvements are
essentially unverifiable. This is the most important gap to close.

Gap 5: Indexes go stale
There is no workflow in .github/workflows with a cron or schedule trigger, and no
refresh, update, or index pipeline exists. Jenkins documentation and community threads
get updated constantly. Without automated refresh, the chatbot slowly drifts out of date.


Proposed Plan (~350 hours)

Community Bonding — May (~30 hrs)

  • Full audit of tools.py — flag every interface inconsistency and stub
  • Set up local evaluation baseline using existing query patterns
  • Align with mentors on phase priorities before writing a single line of GSoC code

Phase 1 — Tool Completion & Interface Standardization — Weeks 1–4 (~80 hrs)

Every tool should follow the same contract: query, keywords, logger. Right now they
don’t. This phase makes the tool layer clean, tested, and documented so every future
change is building on solid ground.

  • Audit and fix every tool for interface consistency
  • Update Retriever Agent prompt in prompts.py to pass keywords for tools that
    currently don’t receive it — completing the alignment flagged during PR #172 review
  • Replace any remaining stubs with real retrieval implementations
  • Standardize error handling and fallback behavior
  • Write unit + integration tests per tool
  • Document the interface contract clearly for future contributors

Deliverable: All tools working, fully tested, consistent interface


Phase 2 — Retrieval Quality & Evaluation — Weeks 5–8 (~100 hrs)

This is where my Data Science background is directly relevant. Without measurement,
every retrieval improvement is just a guess. This phase builds the evaluation
infrastructure that makes all future improvements verifiable.

  • Build an offline golden evaluation dataset of Jenkins Q&A pairs
    drawn from real community questions and documentation
  • Measure retrieval precision and recall per source —
    docs, plugins, community threads, stackoverflow
  • Improve keyword extraction in the retrieval pipeline
  • Tune chunking strategy per content type — docs chunk differently from forum
    threads and plugin readmes
  • Add confidence scoring to surface low-confidence answers to the user

Deliverable: Evaluation framework with measurable baselines per retrieval source


Phase 3 — Index Freshness & Session Robustness — Weeks 9–11 (~80 hrs)

  • Automated index refresh pipeline for documentation and community content
  • Systematic session management tests across all endpoints —
    the #179 fix revealed this gap, now close it properly
  • Session expiry and cleanup logic
  • Observability improvements using request-level logger pattern
    established in #172 review

Deliverable: Automated freshness pipeline, full session endpoint coverage


Phase 4 — Polish & Documentation — Weeks 12–13 (~60 hrs)

  • End-to-end integration tests covering the full retrieval → response flow
  • Contributor documentation for the tool interface pattern so the inconsistency
    issues I found don’t happen again
  • Performance profiling of the retrieval pipeline under realistic query load
  • Final evaluation run comparing retrieval quality at project start vs end —
    concrete numbers, not just “it feels better”

Deliverable: Production-ready chatbot, documented architecture, final report


Stack

Based on what mentors have confirmed in this forum:

  • LLM: Open-source / self-hosted (Ollama, Llama) — no proprietary dependency
  • Orchestration: LangChain
  • Backend: Decoupled FastAPI — existing architecture, not changing it
  • Evaluation: Offline golden dataset approach, no external API required

One Question for Mentors

During Phase 1, I’ll need to update the Retriever Agent prompt in prompts.py to pass
keywords for tools that currently don’t receive it — the gap sharma-sugurthi flagged
during PR #172 review.

Do you have a preference for how keyword extraction should be handled — should the LLM
extract keywords inline as part of the tool call, or should there be a dedicated
extraction step before tool dispatch?

This will shape the Phase 1 design and feed directly into retrieval improvements in
Phase 2.


Why This Plan

A “Continue” project isn’t about adding the most features. It’s about taking something
that was started and making it genuinely reliable and maintainable. My PRs showed me
exactly where the foundations need work before new features make sense. This proposal
is the plan to fix that foundation first, measure the results, then build on top of it
with confidence.

Looking forward to feedback from @krisstern and @berviantoleo and the rest of the
mentor team — Vutukuri Sreenivas and Chirag Gupta.

Thanks,
Ayush Singh (Flamki)

1 Like

Welcome to the community.

Looks good.

I’ll give some detailed comments later. However, I recommend to have some testing or measurement what you’ve done in each phase. As it’s not recommended to evaluate all of your work in the last phase.

Thank you @berviantoleo — really appreciate the feedback.

That makes sense. I’ll update the plan so each phase has explicit testing/measurement and success criteria, instead of evaluating everything only at the end.

I’ll post the revised phase-by-phase measurement section shortly.

Thanks for the guidance. Adding the requested phase-wise validation plan:

Phase-wise Measurement Plan

  • Phase 1 (Tool standardization)

    • target: 100% tool contract compliance
    • validation: unit + integration pass rate for tool calls/fallbacks
    • artifact: tool-layer regression baseline report
  • Phase 2 (Retrieval quality)

    • target: measurable delta vs baseline for precision/recall per source
    • validation: offline faithfulness/relevance scoring on golden set
    • artifact: reproducible evaluation report per iteration
  • Phase 3 (Freshness + session robustness)

    • target: stable scheduled refresh runs + index recency checks
    • validation: session persistence/recovery coverage across endpoints
    • artifact: reliability report (freshness + session consistency)
  • Phase 4 (Final integration)

    • target: improved E2E quality/latency vs initial baseline
    • validation: end-to-end suite + comparative benchmark
    • artifact: final before/after metrics report

Hey Ayush, interesting read. I do have a question about the Phase 2 plan though. If we rely on an offline “golden dataset” for evaluation, how do we prevent that dataset from becoming stale almost immediately? Jenkins docs and plugins update constantly. Maintaining and updating the eval dataset manually might end up becoming a massive bottleneck. Wondering if there’s a more dynamic approach we should be looking at instead of static offline eval?