PyData Boston 2025

Going multi-modal: How to leverage the lastest multi-modal LLMs and deep learning models on real world applications
2025-12-08 , Thomas Paul

Multimodal deep learning models continue improving rapidly, but creating real-world applications that effectively leverage multiple data types remains challenging. This hands-on tutorial covers model selection, embedding storage, fine-tuning, and production deployment through two practical examples: a historical manuscript search system and flood forecasting with satellite imagery and time series data.


Many real-world applications benefit from leveraging multiple modalities of data. For instance, an app to help users repair their appliances might take both the user's images and textual descriptions of their issues, a clinical support decision system might take patient vitals, CT scans, and the doctor's free-form notes, or a flash flood forecasting system might leverage both numerical historic data and satellite imagery. Recently, we have seen the proliferation of multi-modal LLMs such as ChatGPT, Command A-Vision, and LLAMA.

This tutorial will cover background on multi-modal models, building multi-modal RAG applications in Python, fine-tuning models, and deploying the production code with Docker. Specifically, we will learn through two hands-on real-world use cases (described in the talk outline).

Participants will leave with a broad understanding of the challenges of working with multi-modal models as well as the potential these models offer. They will also leave with access to open source code and knowledge of where to find other free resources to continue their learning.

This tutorial assumes that participants have some knowledge of Python and basic familiarity with LLMs. Experience with Docker and setting up RAG pipelines is helpful but not required. No experience with multi-modal models themselves is required. Code examples and details on how to set up the environment will be distributed beforehand via a GitHub repository.

Talk Schedule:

0-5 minutes: Introduction/speaker background/motivation
5-35 minutes: Multi-modal deep learning theory/background: understanding CLIP, cross-attention mechanisms, fusing representations, etc.
35-65 minutes: Building a Multimodal Historic Document Understanding System:

  • Trade-offs between OCR + text-only LLM and multi-modal model
  • Creating and saving multi-modal document embeddings to ElasticSearch
  • Multi-modal prompting strategies for visual question answering
  • Serving large multi-modal models
  • Fine-tuning strategies

65-80 minutes: Analyzing Historic Flash Floods and Forecasting Future Floods with Time Series + Satellite Images:
- Creating a unified representation/clustering analysis
- Searching time series embeddings (aligning time series and text)
- Multi-modal time series forecasting models

80-90 minutes: Reserved for questions from the audience


Prior Knowledge Expected: Previous knowledge expected