Dr. Homa Ansari
Lead AI/ML scientist with 10+ years of experience in algorithm design for information ex-
traction from multimodal unstructured data (image, time series, geospatial data). Experienced in
innovative algorithm development with statistical signal processing, shallow and deep
machine learning, and pre-trained Large Language Models (LLMs); for radar satellite imagery and niche medical sensors. Recipient of innovation awards from the German Aerospace Center (DLR) as well as IEEE for designing algorithms and data products tailored to spaceborne data.
Session
This talk addresses the critical need for use case-specific evaluation of Large Language Model (LLM)-powered applications, highlighting the limitations of generic evaluation benchmarks in capturing domain-specific requirements. It proposes a workflow for designing evaluation pipelines to optimize LLM-based applications, consisting of three key activities: human-expert evaluation and benchmark dataset curation, creation of evaluation agents, and alignment of these agents with human evaluations using the curated datasets. The workflow produces two key outcomes: a curated benchmark dataset for testing LLM applications and an evaluation agent that scores their responses. The presentation further addresses the limitations, and best practices to enhance the reliability of evaluations, ensuring LLM applications are better tailored to specific use cases.