PyCon DE & PyData 2025

Is your LLM any good at writing? Benchmarking on creative writing and editing tasks
2025-04-25 , Platinum3

Many LLM benchmarks focus on reasoning and coding tasks. These are exciting tasks! But the majority of LLM usage is still in writing and editing related tasks, and there's a surprising lack of benchmarks on these.

In this talk you'll learn what it took to create a writing benchmark, and which model performs best!


Large Language Models (LLMs) have demonstrated impressive capabilities in generating human-quality text, but how do we objectively measure their performance on complex writing and editing tasks? This talk explores the challenges of benchmarking LLMs for these tasks and presents a novel framework for evaluating their effectiveness.

The talk will provide practical guidance on how to evaluate and compare the performance of different LLMs. Basic familiarity with language models is required for this talk.

Outline:

  1. Introduction
  • Briefly introduce LLMs and their growing role in writing and editing.
  • Highlight the need for standardized benchmarks to compare and improve LLM performance. Majority of LLM usage is still on writing tasks*!

*Source: https://arxiv.org/pdf/2405.01470

  1. Challenges in Benchmarking LLMs for Writing and Editing:
  • Defining objective metrics for subjective tasks like writing quality and editing accuracy.
  • Addressing the issue of bias in training data and its impact on evaluation.
  • Accounting for the diverse range of writing and editing tasks.
  1. A Framework for Evaluating LLM Performance:
  • Proposing a set of key metrics that encompass fluency, coherence, accuracy, and style.
  • Introducing a methodology for constructing diverse and representative test datasets.
  1. Case Studies and Results:
  • Showcasing examples of how the proposed framework can be applied to evaluate different LLMs.
  • Presenting findings from recent benchmarking studies and discussing their implications.
  1. Future Directions:
  • Exploring the potential of LLMs to assist with increasingly complex writing and editing tasks.
  • Identifying areas for future research and development in LLM benchmarking.

Expected audience expertise: Domain:

Intermediate

Expected audience expertise: Python:

None

AI engineer at Typetone, where I'm taming LLMs to automate end-to-end marketing.

We help unburden SMEs and solopreneurs from doing their content marketing, and this task is surprisingly hard for LLMs to solve yet!

In past lives personalized marketing at ING as a data scientist and ran a non-profit in Kyrgyzstan.