Mercari LLM Benchmark: Building a Practical LLM Benchmark for Your Business
Every new LLM comes with glowing performance on English-centric benchmarks. This makes it difficult to predict how that performance will translate to business use cases in other languages or specialized domains. At Mercari, Japan's largest C2C marketplace, we faced this exact problem with Japanese. Inspired by Kagi, Wolfram, and Aider benchmarks, we are building our own continuously updated internal benchmark to evaluate major LLMs on unpolluted, business-critical tasks that models have not seen in their training data. The talk will cover task design, an evaluation pipeline in Python, a comparison of the latest models on accuracy, cost, and latency, and practical lessons for creating your own benchmark tailored to your needs.