Evaluating GenAI responses in production

Kumar Rangarajan (~kumar9)




One of the biggest challenges with GenAI has been in dealing with the uncertainties associated with its generative nature. Not just about hallucination but also the answer relevant in the context of the app and its restrictions. The same tech that makes it magical is also central to the lack of trust that Enterprises have in it. This is true esp for one of the biggest use cases for GenAI - AI Assistance or Copilots that can be added to apps.

How do you evaluate how are things happening in production? Are the responses to user's queries appropriate? How do you keep improving the quality of the system in an automated process? And more importantly how do you do this in a way that is automated and scales across millions of queries.

In this talk, we will cover the tool that we developed called “G-eval” which is inspired from a paper from Microsoft which talked about how to use LLMs to evaluate the output of other LLMs. The tool was used to evaluate the Copilot built using our CONVA.ai platform and is used by popular e-commerce apps like Tata Nue. The talk will cover the reasons for why we created the tool, the challenges with it, and how it can be adapted for other use-cases easily.


Basic understanding of Python and GenAI prompt engineering

Speaker Info:

Co-founder @ Slang Labs. Previously worked at Meta (Facebook) in the Bay Area after they acquired my previous startup, Little Eye Labs. Before that worked in companies like HP, Rational/IBM, GE, BlueCoat & S7.

I am passionate about Developer tools and making tech easier to use in general.

Speaker Links:

https://voicecon.net/voice-business-strategy/the-rise-of-voice-commerce-in-india/ https://www.youtube.com/watch?v=UpoM7mC9a1c https://medium.com/@kumarrangarajan

Section: Artificial Intelligence and Machine Learning
Type: Talk
Target Audience: Intermediate
Last Updated: