Hi everyone, I’m Ayush Singh (GitHub: Flamki), a final-year B.Tech Information Technology
student with a Data Science major. I’m applying for GSoC 2026 for the
Continue AI-Powered Chatbot for Quick Access to Jenkins Resources project.
Contributions to the Chatbot Codebase
jenkinsci/resources-ai-chatbot-plugin:
PR #172 (open) — Implemented search_stackoverflow_threads, replacing the no-op stub
that was returning hardcoded “Nothing relevant” regardless of the query. The fix uses the
existing retrieve_documents + extract_top_chunks helpers to stay consistent with the
rest of the retrieval architecture. Also fixed the request-logger flow — berviantoleo
confirmed the logger should come from the caller, not module-level. sharma-sugurthi
flagged that the other three search tools accept query, keywords, and logger while this
tool only accepts query. Prompt and keyword contract alignment is a planned follow-up
in Phase 1.
PR #179 (open, approved by berviantoleo
) — The /message endpoint persists session
state after every reply. The /message/upload endpoint didn’t. This meant uploaded-file
conversations could be silently lost after a restart. Fixed by injecting BackgroundTasks
and enqueuing persist_session after get_chatbot_reply returns — the same pattern the
standard endpoint uses. Approved and pending merge.
Other Jenkins contributions:
- Jenkins Core PR #26331 (merged) — Encode log recorder name on redirect (security)
- jenkins.io PR #8831 (merged) — Document
hasPermissionfor secure form validation - OpenTelemetry Plugin PR #1240 (open) — Fix queue metrics visibility under restricted
ACL usingACL.SYSTEM2 - OpenTelemetry Plugin PR #1241 (open) — Make build-step span purge non-fatal with
STRICT_MODEtoggle
What the Codebase Actually Needs
Working on these PRs gave me a clear picture of where the real gaps are. Not from reading
the readme — from being inside the code.
Gap 1: Tool interface inconsistency
Three tools accept query, keywords, and logger. search_stackoverflow_threads only accepts
query. Keywords matter because the retrieval pipeline uses them for more precise matching —
skipping them degrades search quality silently. PR #172 fixes the logger flow. Full
keyword and prompt contract alignment across all tools is still needed.
Gap 2: No-op stub on main
search_stackoverflow_threads is still returning hardcoded “Nothing relevant” on current
main because PR #172 is not yet merged. Once it merges, the fix lands — but this also
raises the question of whether other tools have similar stub behavior worth auditing.
Gap 3: Endpoint test coverage is uneven
The session persistence bug in #179 existed because the upload endpoint had no test
asserting this behavior. The fix was straightforward but the gap in test coverage
across endpoints is broader than one file.
Gap 4: No way to measure retrieval quality
There is no eval, evaluation, or benchmark folder in chatbot-core on current main, and
no files containing evaluate or metrics logic. If someone improves the chunking strategy
or switches embedding models, there is no baseline to compare against. Improvements are
essentially unverifiable. This is the most important gap to close.
Gap 5: Indexes go stale
There is no workflow in .github/workflows with a cron or schedule trigger, and no
refresh, update, or index pipeline exists. Jenkins documentation and community threads
get updated constantly. Without automated refresh, the chatbot slowly drifts out of date.
Proposed Plan (~350 hours)
Community Bonding — May (~30 hrs)
- Full audit of
tools.py— flag every interface inconsistency and stub - Set up local evaluation baseline using existing query patterns
- Align with mentors on phase priorities before writing a single line of GSoC code
Phase 1 — Tool Completion & Interface Standardization — Weeks 1–4 (~80 hrs)
Every tool should follow the same contract: query, keywords, logger. Right now they
don’t. This phase makes the tool layer clean, tested, and documented so every future
change is building on solid ground.
- Audit and fix every tool for interface consistency
- Update Retriever Agent prompt in
prompts.pyto pass keywords for tools that
currently don’t receive it — completing the alignment flagged during PR #172 review - Replace any remaining stubs with real retrieval implementations
- Standardize error handling and fallback behavior
- Write unit + integration tests per tool
- Document the interface contract clearly for future contributors
Deliverable: All tools working, fully tested, consistent interface
Phase 2 — Retrieval Quality & Evaluation — Weeks 5–8 (~100 hrs)
This is where my Data Science background is directly relevant. Without measurement,
every retrieval improvement is just a guess. This phase builds the evaluation
infrastructure that makes all future improvements verifiable.
- Build an offline golden evaluation dataset of Jenkins Q&A pairs
drawn from real community questions and documentation - Measure retrieval precision and recall per source —
docs, plugins, community threads, stackoverflow - Improve keyword extraction in the retrieval pipeline
- Tune chunking strategy per content type — docs chunk differently from forum
threads and plugin readmes - Add confidence scoring to surface low-confidence answers to the user
Deliverable: Evaluation framework with measurable baselines per retrieval source
Phase 3 — Index Freshness & Session Robustness — Weeks 9–11 (~80 hrs)
- Automated index refresh pipeline for documentation and community content
- Systematic session management tests across all endpoints —
the #179 fix revealed this gap, now close it properly - Session expiry and cleanup logic
- Observability improvements using request-level logger pattern
established in #172 review
Deliverable: Automated freshness pipeline, full session endpoint coverage
Phase 4 — Polish & Documentation — Weeks 12–13 (~60 hrs)
- End-to-end integration tests covering the full retrieval → response flow
- Contributor documentation for the tool interface pattern so the inconsistency
issues I found don’t happen again - Performance profiling of the retrieval pipeline under realistic query load
- Final evaluation run comparing retrieval quality at project start vs end —
concrete numbers, not just “it feels better”
Deliverable: Production-ready chatbot, documented architecture, final report
Stack
Based on what mentors have confirmed in this forum:
- LLM: Open-source / self-hosted (Ollama, Llama) — no proprietary dependency
- Orchestration: LangChain
- Backend: Decoupled FastAPI — existing architecture, not changing it
- Evaluation: Offline golden dataset approach, no external API required
One Question for Mentors
During Phase 1, I’ll need to update the Retriever Agent prompt in prompts.py to pass
keywords for tools that currently don’t receive it — the gap sharma-sugurthi flagged
during PR #172 review.
Do you have a preference for how keyword extraction should be handled — should the LLM
extract keywords inline as part of the tool call, or should there be a dedicated
extraction step before tool dispatch?
This will shape the Phase 1 design and feed directly into retrieval improvements in
Phase 2.
Why This Plan
A “Continue” project isn’t about adding the most features. It’s about taking something
that was started and making it genuinely reliable and maintainable. My PRs showed me
exactly where the foundations need work before new features make sense. This proposal
is the plan to fix that foundation first, measure the results, then build on top of it
with confidence.
Looking forward to feedback from @krisstern and @berviantoleo and the rest of the
mentor team — Vutukuri Sreenivas and Chirag Gupta.
Thanks,
Ayush Singh (Flamki)