PyLadiesCon 2024

ロケール設定が保存されました。pretalxでは英語のサポートが充実していると思っていますが、問題やエラーが発生した場合は、ぜひご連絡ください。

Holistic Evaluation of Large Language Models: From References to Human Judgment
2024/12/07 , Main Stream
言語: English

n the rapidly evolving field of natural language processing, the evaluation of large language models (LLMs) is crucial for understanding their performance and guiding their development. This talk delves into the two primary evaluation methodologies: reference-based and referenceless techniques.

Reference-based evaluation relies on predefined ground truth references to assess the quality of generated text. Metrics such as BLEU, ROUGE, and METEOR are commonly used to compare the generated output against these references, providing insights into the model’s accuracy and fluency. However, these metrics often fall short in capturing the nuances of human language and creativity.

On the other hand, referenceless evaluation techniques, such as BERTScore, perplexity, and human judgment, offer a complementary perspective by assessing the coherence, relevance, and overall quality of the generated text without relying on reference texts. These methods can better capture the subtleties of language generation and provide a more holistic view of model performance.

This talk will explore the strengths and limitations of both evaluation approaches, highlighting recent advancements and practical applications. This talk is suitable for anyone who has basic understanding of NLP and LLMs and want to know more about evaluation stratergies

Riya is a Data and Applied Scientist at Microsoft who specializes in NLP and machine learning. She holds a Master’s degree in CS from the University of Massachusetts, Amherst, which she completed in May 2022. Before joining Microsoft’s US team, she worked as a Data Engineer in India. She is passionate about building data and AI-driven products and solutions that can benefit people and society. She enjoys hiking, dancing and working out in her spare time.