BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//pretalx//pretalx.com//GGJDTW
BEGIN:VTIMEZONE
TZID:CET
BEGIN:STANDARD
DTSTART:20001029T040000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=10
TZNAME:CET
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:20000326T030000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=3
TZNAME:CEST
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
UID:pretalx-pyconde-pydata-2025-GGJDTW@pretalx.com
DTSTART;TZID=CET:20250423T171000
DTEND;TZID=CET:20250423T174000
DESCRIPTION:This talk addresses the critical need for usecase-specific eval
 uation of Large Language Model (LLM)-powered applications\, highlighting t
 he limitations of generic evaluation benchmarks in capturing domain-specif
 ic requirements. It proposes a workflow for designing more reliable evalua
 tios to optimize LLM-based applications\, consisting of three key activiti
 es: human-expert evaluation and benchmark dataset curation\, creation of e
 valuation agents\, and alignment of these agents with human evaluations us
 ing the curated datasets. The workflow produces two key outcomes: a curate
 d benchmark dataset for testing LLM applications and an evaluation agent t
 hat scores their responses. The presentation further addresses the limitat
 ions\, and best practices to enhance the reliability of evaluations\, ensu
 ring LLM applications are better tailored to specific use cases.
DTSTAMP:20250505T123457Z
LOCATION:Platinum3
SUMMARY:Generative-AI: Usecase-Specific Evaluation of LLM-powered Applicati
 ons - Dr. Homa Ansari
URL:https://pretalx.com/pyconde-pydata-2025/talk/GGJDTW/
END:VEVENT
END:VCALENDAR