Mercari LLM Benchmark: Building a Practical LLM Benchmark for Your Business PyCon Hong Kong 2025

Mercari LLM Benchmark: Building a Practical LLM Benchmark for Your Business
.ical
2025-10-11 11:30–12:00, Track C (LT-16)
Language: English

Every new LLM comes with glowing performance on English-centric benchmarks. This makes it difficult to predict how that performance will translate to business use cases in other languages or specialized domains. At Mercari, Japan's largest C2C marketplace, we faced this exact problem with Japanese. Inspired by Kagi, Wolfram, and Aider benchmarks, we are building our own continuously updated internal benchmark to evaluate major LLMs on unpolluted, business-critical tasks that models have not seen in their training data. The talk will cover task design, an evaluation pipeline in Python, a comparison of the latest models on accuracy, cost, and latency, and practical lessons for creating your own benchmark tailored to your needs.

Outline
1. Motivation: Why Standard Benchmarks Fail in Non-English Contexts. Speaker introduction. (2 minutes)
2. Task Taxonomy and Dataset Curation: Creating unpolluted tasks that mirror real business problems. Examples of tasks. (3 minutes)
3. Python Evaluation Pipeline: A robust, automated pipeline with Litellm, Pydantic, SQLite, etc. (9 minutes)
4. Methodology and Metrics: Ensuring fair, unpolluted evaluation beyond simple accuracy. (3 minutes)
5. Results: A snapshot of how the latest major LLMs perform on our private, business-critical tasks. (3 minutes)
6. Surprising Trade-offs: Analyzing the Pareto frontier of accuracy, cost, and speed for production systems. (3 minutes)
7. Building Your Own Benchmark: Essential Python libraries, design patterns, and practical lessons. (5 minutes)
8. Conclusion and Q&A (2 minutes)

Audience
Python developers, ML/DS engineers, tech leads, and product managers using Large Language Models for real-world business applications, especially in non-English or specialized domains.

Prashant Anand

Prashant is a Staff ML Engineer at Mercari, Inc. in Tokyo, Japan, where he has spent over 5 years building scalable, high-performance production ML systems. He leads the exploration and application of machine learning, NLP, and LLMs to transform customer support experiences at one of Japan's largest e-commerce platforms.

With a B.Tech from IIT Delhi (2019), Prashant brings deep technical expertise in bridging cutting-edge ML research with real-world production systems. He is passionate about sharing knowledge with the Python community, having previously spoken at PyCon JP 2024 and PyCon APAC 2023.

Away from the keyboard, you’ll find him obsessing over specialty coffee and perfecting his hand-drip and French-press brews.

Kanta Suga

I'm a Machine Learning engineer at Mercari,inc.
Master (eng): Bioinformatics.

Mercari LLM Benchmark: Building a Practical LLM Benchmark for Your Business .ical 2025-10-11 11:30–12:00, Track C (LT-16) Language: English

Mercari LLM Benchmark: Building a Practical LLM Benchmark for Your Business
.ical
2025-10-11 11:30–12:00, Track C (LT-16)
Language: English