Devconf.US

Who Watches the Watchmen? Understanding LLM Benchmark Quality
08-16, 11:50–12:25 (US/Eastern), Conference Auditorium (capacity 260)

The ecosystem of Large Language Models (LLMs) is extremely active, with new models being released every week. LLM leaderboards have emerged as a popular resource on model hubs, such as Hugging Face, where purveyors of new models can measure up against their competition, and model users can evaluate new alternatives for their business needs.

Leaderboards rank models using one or more popular LLM benchmarks: data sets with queries and expected answers that LLMs can be tested against. But how well do these benchmarks really measure model effectiveness? There are many ways for a user to ask a question, and many ways to express a correct (or incorrect) answer! There are also multiple requirements for LLM outputs besides factual correctness, including providing responses that do not harm human users or providing answers that are socially sensitive. Measuring model quality in any of these ways must contend with the practically infinite variations of human language. How robust is a model with respect to changes in a query? How well does a benchmark cover the full range of conceivable human inputs? Does a good score on a benchmark translate into good performance for your specific application?

In this talk, Erik Erlandson will take the audience on a tour of the multiple dimensions of model performance and quality, and the popular benchmarks for measuring them. He will explain how benchmarks work, what they are measuring, and what they might not be measuring. Attendees will leave armed with the knowledge to go beyond the LLM leaderboards and ask smart questions about the models they are choosing.

See also: Slide Deck for This Talk (2.9 MB)

Erik Erlandson works at Red Hat's Emerging Technologies group, where he leads a team of data scientists and software engineers who evaluate new technologies at the intersection of data science, AI and cloud native development.