Hi everyone ,
I’m Anshuman Singh, a second-year student and a gen AI + backend engineering enthusiast. I’m very interested in applying for GSoC 2026 with Jenkins, and after exploring the available ideas, the AI-Powered Chatbot for Quick Access to Jenkins Resources project really stood out to me.
Over the past few days, I’ve been digging into the repository (jenkinsci/resources-ai-chatbot-plugin) and trying to understand the current implementation end-to-end (retrieval pipeline, hybrid FAISS + BM25 setup, agentic tool routing, FastAPI backend, Jenkins UI integration, and testing infrastructure). The project already feels production-ready, and that makes it even more exciting because the GSoC work would be a meaningful upgrade rather than a prototype.
Why I’m interested in this GSoC project
From what I observed, the current hybrid RAG approach works very well for direct queries like:
-
how to install a plugin
-
where to configure a setting
-
how to use a feature
But it struggles when the question requires relational or multi-hop reasoning, for example:
-
which plugins conflict with Blue Ocean
-
which plugin version fixed a security issue
-
which dependencies are indirectly causing an issue
-
what breaks after upgrading a plugin
These kinds of questions are extremely common for Jenkins users, and solving them would significantly improve the usefulness of the chatbot.
That’s why I’m particularly interested in the GraphRAG enhancement direction being discussed.
Similar project I’ve built (GraphRAG system)
Recently, I built a GraphRAG-based project called Nsure Graph AI, designed for domains like insurance/legal documents where relationships matter more than semantic similarity.
Repo: https://github.com/IND-Anshuman/Nsure_graph_AI
Live demo: https://nsure-graph-ai.onrender.com/agent
In that project, the key challenge was that traditional RAG retrieves isolated chunks but fails to connect controlling clauses, definitions, exceptions, and overrides. GraphRAG helped solve this by explicitly linking entities and relationships, enabling structured retrieval and multi-hop reasoning.
The reason I mention this is because Jenkins plugin knowledge (dependencies, conflicts, CVEs, fixes, configuration requirements) has a very similar structure: the answer often depends on how multiple components relate, not just what one document says.
My initial GSoC-oriented roadmap:
If selected, my focus would be to implement GraphRAG in a way that fits cleanly into the existing architecture without disrupting current users.
Phase 1: Knowledge Graph Construction (Data Pipeline)
-
Extract entities such as:
- Plugin, Plugin Version, CVE, Error, Configuration Key
-
Extract relationships such as:
- DEPENDS_ON, CONFLICTS_WITH, FIXED_IN, AFFECTED_BY, REQUIRES_CONFIG
-
Use a hybrid extraction approach:
-
deterministic parsing for structured sources (pom.xml, plugin metadata)
-
LLM-assisted extraction for unstructured sources (plugin docs, release notes, security mentions)
-
Phase 2: Graph-Aware Retrieval Integration
-
Keep the current hybrid FAISS + BM25 retrieval as the baseline
-
Add a graph expansion layer that can fetch connected context when queries are relational
-
Ensure graceful fallback so the plugin never breaks if graph logic fails
Phase 3: Precision Improvements
-
Reduce graph fan-out using relation filtering + hop limits
-
Prioritize controlling relationships like FIXED_IN / CONFLICTS_WITH over generic dependency expansion
Phase 4: Evaluation and Benchmarking
-
Build a small benchmark dataset focused on multi-hop queries
-
Compare baseline hybrid RAG vs GraphRAG improvements
-
Add non-regression checks so simple queries remain unaffected
Questions / clarifications
I’d love guidance on a few points before I finalize a proposal:
-
Graph artifact strategy
Would it be preferred to generate the knowledge graph in CI and ship it as an artifact (to avoid Jenkins startup overhead), or is runtime generation expected? -
Version-specific relationships
For Phase 1, should edges like CONFLICTS_WITH and FIXED_IN be version-specific, or is plugin-level granularity acceptable initially? -
Preferred data sources for Phase 1
Is plugin documentation + manifests sufficient for the first milestone, or should Jenkins security advisories / CVE feeds be included early? -
Offline-first expectations
Since the plugin supports local inference via llama.cpp, should Graph extraction also be designed to run fully offline by default (with optional hosted LLM providers)?
Looking forward to learning from the community and contributing meaningfully.