PyCon DE & PyData 2025

Generative-AI: Usecase-Specific Evaluation of LLM-powered Applications
2025-04-23 , Platinum3

This talk addresses the critical need for use case-specific evaluation of Large Language Model (LLM)-powered applications, highlighting the limitations of generic evaluation benchmarks in capturing domain-specific requirements. It proposes a workflow for designing evaluation pipelines to optimize LLM-based applications, consisting of three key activities: human-expert evaluation and benchmark dataset curation, creation of evaluation agents, and alignment of these agents with human evaluations using the curated datasets. The workflow produces two key outcomes: a curated benchmark dataset for testing LLM applications and an evaluation agent that scores their responses. The presentation further addresses the limitations, and best practices to enhance the reliability of evaluations, ensuring LLM applications are better tailored to specific use cases.


Large Language Models (LLMs) are transformative technology, enabling a wide array of applications, from content generation to interactive chatbots. This technology is leveraged in creating LLM-powered applications. A wide variety of LLMs are offered, followed by independent and generic evaluation of their performance by the LLM community. The requirements and domain-specificity of the usecases behind the LLM-applications, renders this generic evaluation of the LLMs insufficient in revealing their performance issues. Furthermore, the usecase-specific performance evaluation of LLM-applications becomes a necessary component in the design and continuous development of the LLM-applications.
In this talk, we address the need for usecase-specific evaluation of LLM-applications by proposing a workflow for creating evaluation models that support the selection and optimization of the design of LLM-applications. The workflow is comprised of three main activities:
1) Human-expert evaluation of LLM-applications & benchmark dataset curation
2) Creating evaluation agents
3) Aligning evaluation agents with human evaluation based on the curated dataset
And it leads to two concrete outcomes:
1) Curated benchmark dataset: against which the LLM-applications will be tested.
2) Evaluation Agent: this is the scoring model which automatically evaluates the responses of the LLM-applications.
The talk will elaborate on the workflow, the limitations, and best practices to increase the reliability of the evaluations considering the limitations.


Expected audience expertise: Domain:

None

Expected audience expertise: Python:

None

Lead AI/ML scientist with 10+ years of experience in algorithm design for information ex-
traction from multimodal unstructured data (image, time series, geospatial data). Experienced in
innovative algorithm development with statistical signal processing, shallow and deep
machine learning, and pre-trained Large Language Models (LLMs); for radar satellite imagery and niche medical sensors. Recipient of innovation awards from the German Aerospace Center (DLR) as well as IEEE for designing algorithms and data products tailored to spaceborne data.