2026-07-15 –, Memorial Hall
Optical character recognition (OCR) has been a long standing method of extracting text data from images. Traditional OCR models rely on pattern recognition and feature extraction using computer vision techniques and specialized Python libraries. Recently, large language models (LLMs) and generic AI assistants have provided an alternative method of text extraction. This talk explores the efficacy of using LLMs and VLMs for information extraction in production data pipelines and a data-driven approach for evaluating them against traditional OCR methods in terms of accuracy, reliability, latency, and cost.
Structured data extraction is a classic problem that has applications to many domains, such as document digitization, information extraction, and accessibility. There is great potential for LLMs to enhance automation for routine document processing tasks, but there are notable engineering risks associated with integrating these models into production data pipelines. LLMs can produce inconsistent outputs, produce hallucinations and confabulations, and are vulnerable to prompt injection. When evaluating the efficacy of OCR solutions, it's important to define metrics that capture not only accuracy but also latency, cost, and energy expenses.
This talk explores the benefits and challenges of applying LLMs to extracting text from scanned images by contrasting three approaches. First, I will explore object detection approaches using the open source docling and RF-DETR Python libraries which directly identify characters and words from images. I will also discuss the docTR library which applies deep learning models to text recognition.
Next, I will explore how state-of-the-art LLMs and AI assistants such as Gemini, Claude, and Qwen can be applied to targeted text extraction tasks. This includes a data-driven evaluation strategy that utilizes both automated and human feedback to compare LLM-based approaches to traditional OCR.
Finally, I will discuss a hybrid approach that combines traditional OCR methods with LLMs. This is a two-stage process that uses an OCR model to extract text from the image, then passes the unstructured text data to an LLM to produce a structured output.
This talk is for data scientists and machine learning engineers who are interested in prototyping and evaluating text extraction solutions in Python. I will walk through several Python code examples for structured extraction using open source libraries such as docling and docTR and demonstrate how to experimentally validate those methods against modern machine learning approaches that utilize LLMs and VLMs.
Outline
- Traditional OCR techniques (5 minutes)
a. Object detection approaches with docling and RF-DETR
b. Deep learning with open source models and the docTR library - Text extraction with LLMs (5 minutes)
a. Extracting structured outputs with pydantic
b. Prompt engineering
c. Self-hosted vs. managed service models - Hybrid approach (5 minutes)
a. Combining traditional OCR with LLMs
b. Profiling performance metrics - Evaluating text extraction approaches (10 minutes)
a. Automated vs. human evaluation
b. Cost metrics (latency, compute and API expenses, energy)
c. Creating an evaluation framework
Patrick Deziel is a machine learning engineer and Python and Go programmer. Patrick has extensive experience building machine learning powered applications and contributing to open source projects such as Yellowbrick, an ML visualization library written for Python. He currently works at Rotational Labs where he builds software to support prototyping and evaluation of AI/ML powered solutions.