Generative-AI: Usecase-Specific Evaluation of LLM-powered Applications PyCon DE & PyData 2025

Generative-AI: Usecase-Specific Evaluation of LLM-powered Applications
.ical

2025-04-23 17:10–17:40, Platinum3

This talk addresses the critical need for usecase-specific evaluation of Large Language Model (LLM)-powered applications, highlighting the limitations of generic evaluation benchmarks in capturing domain-specific requirements. It proposes a workflow for designing more reliable evaluatios to optimize LLM-based applications, consisting of three key activities: human-expert evaluation and benchmark dataset curation, creation of evaluation agents, and alignment of these agents with human evaluations using the curated datasets. The workflow produces two key outcomes: a curated benchmark dataset for testing LLM applications and an evaluation agent that scores their responses. The presentation further addresses the limitations, and best practices to enhance the reliability of evaluations, ensuring LLM applications are better tailored to specific use cases.

Large Language Models (LLMs) are transformative technology, enabling a wide array of applications, from content generation to interactive chatbots. This technology is leveraged in creating LLM-powered applications. A wide variety of LLMs are offered, followed by independent and generic evaluation of their performance by the LLM community. The requirements and domain-specificity of the usecases behind the LLM-applications, renders this generic evaluation of the LLMs insufficient in revealing their performance issues. Furthermore, the usecase-specific performance evaluation of LLM-applications becomes a necessary component in the design and continuous development of the LLM-applications.
In this talk, we address the need for usecase-specific evaluation of LLM-applications by proposing a workflow for creating evaluation models that support the selection and optimization of the design of LLM-applications. The workflow is comprised of three main activities:
1) Human-expert evaluation of LLM-applications & benchmark dataset curation
2) Creating evaluation agents
3) Aligning evaluation agents with human evaluation based on the curated dataset
And it leads to two concrete outcomes:
1) Curated benchmark dataset: against which the LLM-applications will be tested.
2) Evaluation Agent: this is the scoring model which automatically evaluates the responses of the LLM-applications.
The talk will elaborate on the workflow, the limitations, and best practices to increase the reliability of the evaluations considering the limitations.

Expected audience expertise: Domain:

None

Expected audience expertise: Python:

None

Dr. Homa Ansari

Lead AI/ML scientist at ZEISS Meditec with 10+ years of experience in algorithm design for multimodal unstructured data (image, time series, geospatial data). Expert in developing innovative algorithms with statistical methods, shallow and deep
machine learning, and pre-trained Large Language Models (LLMs); specifically for satellite data and niche medical sensors. Recipient of innovation awards from the German Aerospace Center (DLR) and IEEE for novel algorithms and data products for satellite missions. Previous work experience at German Aerospace Center (DLR) and DataRobot Inc.

Generative-AI: Usecase-Specific Evaluation of LLM-powered Applications .ical 2025-04-23 17:10–17:40, Platinum3

Generative-AI: Usecase-Specific Evaluation of LLM-powered Applications
.ical

2025-04-23 17:10–17:40, Platinum3