Hi Jenkins community,
I’m Guna Palanivel, applying for GSoC 2026: Continue AI-Powered Chatbot for Quick Access to Jenkins Resources (clarifying this is the resource-ai-chatbot-plugin continuation, not the user workflow guidance project).
Background
I’ve been contributing to jenkinsci/resources-ai-chatbot-plugin since January 2026:
- 5 merged PRs: Crawler fix (#62), WebSocket streaming (#68), file upload (#61), auth cleanup (#158), TBD
- 4 PRs in review: Jenkins auth (#105), streaming UI (#91), pipeline config (#113), E2E tests (#261)
- 13 issues filed: Memory serialization (#207), dead reformulation loop (#191), config routing (#221), and more
Full history: https://github.com/jenkinsci/resources-ai-chatbot-plugin/pulls?q=is:pr+author:GunaPalanivel
Proposal Focus
My proposal addresses the incomplete GSoC 2025 work with three phases:
Phase 1 (Weeks 1-4): Stabilize Core
- Fix 9 identified bugs (StackOverflow stub, relevance scoring, dead code)
- Add E2E test framework (currently 0% → 80%+ coverage)
- Weekly measurement: Test coverage % (60→85%), P95 latency tracking
Phase 2 (Weeks 5-8): Agentic Mode + Multi-Turn
- Reflection-based retrieval (fix issue #191: query reformulation loop)
- Sliding window memory (fix issue #207: unbounded growth)
- Weekly measurement: Reflection convergence rate, memory efficiency
Phase 3 (Weeks 9-12): Evaluation Pipeline
- Dataset: 100+ Q/A pairs (JSON + CSV), Jenkins Core/Plugins/Errors
- Framework: Ragas (Faithfulness >0.85, Context Recall >0.80, Answer Relevance >0.75)
- Judge LLM: Mistral 7B Instruct Q5_K_M (avoids self-evaluation bias vs chatbot’s Q4_K_M)
- CI trigger: run-eval label on PRs (not every push)
- Jenkins auth integration (#78)
- Dataset versioning: weekly CI check for stale URLs, quarterly refresh
Questions for Mentors
-
Phase-by-phase measurement: I saw @berviantoleo 's feedback about evaluating in each phase. I’ve added weekly metrics (coverage %, latency, reflection quality). Does this meet expectations?
-
Judge LLM quality: Is local Mistral Q5_K_M sufficient as judge, or should I target 13B+ parameter models? Groq API is feature-flagged as fallback for labeled PRs.
-
Dataset validation: 100 queries as minimum—is mentor validation of a subset (e.g., 20 queries) feasible during community bonding?
-
Stretch goal: Issue #69 (log analysis agent) involves knowledge graph construction. Defer to post-GSoC, or attempt Week 13 if ahead of schedule?
Looking forward to feedback from @krisstern as well.
Draft proposal: Submitted Via Google Docs link
Thanks,
Guna