Juliacon 2024

Evaluating LLM Frameworks
2024-07-11 , Else (1.3)

Large Language Models are everywhere these days. But how can you objectively evaluate whether a model or a prompt is performing properly? Let's dive into the world of LLM evaluation frameworks!


At CM.com we have released a new GenAI product (in 2023, built in Python) which is currently used by over 50 clients in various countries. GenAI is a chatbot that leverages the power of LLMs, while protecting against their common pitfalls such as incorrectness & inappropriateness, by using a Retrieval Augmented Generation framework (RAG).

As newer & better models rapidly arise, and our clients continue to provide feedback on the product, our own product development cannot lag behind. But how do we know whether changing from e.g. ChatGPT to Gemini or Llama improves the replies for all conversations? And how can you do prompt optimization if you don't know what you're optimizing against? To help us maintain our current chatbot quality while investigating other models/prompts, we have developed an evaluation framework in Python that can objectively evaluate several scenarios across a various of metrics.

During these 30 minutes, I'll explain the principle behind RAG, highlight the huge development work being done in this field across the globe, give examples of several evaluation measures, and finally explain how we use these to move forward with our product. The talk is most interesting for Data Scientists and ML/AI/Prompt Engineers, but can be followed by anyone with some background knowledge on LLMs.