[GSoC 2026] Continue AI-Powered Chatbot - Contributor Introduction + Qns

Hi Jenkins community,

I’m Guna Palanivel, applying for GSoC 2026: Continue AI-Powered Chatbot for Quick Access to Jenkins Resources (clarifying this is the resource-ai-chatbot-plugin continuation, not the user workflow guidance project).

Background

I’ve been contributing to jenkinsci/resources-ai-chatbot-plugin since January 2026:

  • 5 merged PRs: Crawler fix (#62), WebSocket streaming (#68), file upload (#61), auth cleanup (#158), TBD
  • 4 PRs in review: Jenkins auth (#105), streaming UI (#91), pipeline config (#113), E2E tests (#261)
  • 13 issues filed: Memory serialization (#207), dead reformulation loop (#191), config routing (#221), and more

Full history: https://github.com/jenkinsci/resources-ai-chatbot-plugin/pulls?q=is:pr+author:GunaPalanivel

Proposal Focus
My proposal addresses the incomplete GSoC 2025 work with three phases:

Phase 1 (Weeks 1-4): Stabilize Core

  • Fix 9 identified bugs (StackOverflow stub, relevance scoring, dead code)
  • Add E2E test framework (currently 0% → 80%+ coverage)
  • Weekly measurement: Test coverage % (60→85%), P95 latency tracking

Phase 2 (Weeks 5-8): Agentic Mode + Multi-Turn

  • Reflection-based retrieval (fix issue #191: query reformulation loop)
  • Sliding window memory (fix issue #207: unbounded growth)
  • Weekly measurement: Reflection convergence rate, memory efficiency

Phase 3 (Weeks 9-12): Evaluation Pipeline

  • Dataset: 100+ Q/A pairs (JSON + CSV), Jenkins Core/Plugins/Errors
  • Framework: Ragas (Faithfulness >0.85, Context Recall >0.80, Answer Relevance >0.75)
  • Judge LLM: Mistral 7B Instruct Q5_K_M (avoids self-evaluation bias vs chatbot’s Q4_K_M)
  • CI trigger: run-eval label on PRs (not every push)
  • Jenkins auth integration (#78)
  • Dataset versioning: weekly CI check for stale URLs, quarterly refresh

Questions for Mentors

  1. Phase-by-phase measurement: I saw @berviantoleo 's feedback about evaluating in each phase. I’ve added weekly metrics (coverage %, latency, reflection quality). Does this meet expectations?

  2. Judge LLM quality: Is local Mistral Q5_K_M sufficient as judge, or should I target 13B+ parameter models? Groq API is feature-flagged as fallback for labeled PRs.

  3. Dataset validation: 100 queries as minimum—is mentor validation of a subset (e.g., 20 queries) feasible during community bonding?

  4. Stretch goal: Issue #69 (log analysis agent) involves knowledge graph construction. Defer to post-GSoC, or attempt Week 13 if ahead of schedule?

Looking forward to feedback from @krisstern as well.

Draft proposal: Submitted Via Google Docs link

Thanks,
Guna

I won’t push any candidates to achieve really high test coverages. I consider 80% is nice to have, but not mandatory to achieve. It’s not an easy part to ensure the functionality is testable.

I can’t answer the other questions right now. For detailed feedback, I will review the draft proposal after I receive the link (which you submitted).

1 Like

Hi @berviantoleo , Thanks for the reply! That makes total sense. Testing LLM/RAG pipelines is definitely tricky, so I’ll focus on getting the core features stable and reliable for Phase’s rather than chasing a perfect coverage number.

No rush at all on the proposal :slight_smile: