EuroSciPy 2026

Same Recipe, Different Results: Fine-Tuning Models Across Modalities
2026-07-22 , Room 1.19 (Ground Floor, Shannon)

The intuitions you build fine-tuning text models are surprisingly bad guides for other modalities. Training configurations that work well for language will silently degrade an image model. Dataset sizes that feel tiny for text are more than enough for adapting a visual style. And audio, despite seeming like its own world, follows an image pipeline once you transform sound into spectrograms, making what counts as a "token" stranger and more interesting than most people expect. The modalities share a vocabulary (fine-tuning, adapters, checkpoints) but not a playbook, and the gaps between them are where the most useful lessons live.

This talk is a practical, comparative tour of fine-tuning across four modalities: text, images, audio, and video. Rather than focusing on one, we will look at what changes as you move between them, how you prepare different data, which training strategies transfer and which don't, where the gotchas hide, and what model merging can do for you once training is done. All examples use Python and the HuggingFace ecosystem with publicly available models and datasets. Whether you are a practitioner looking to branch out beyond NLP or someone curious about what multi-modal fine-tuning looks like in practice, you will leave with a mental map of the landscape and enough pointers to start exploring on your own.


Fine-tuning has become the default way to adapt foundation models to specific tasks, but most of the conversation focuses on text. If you have fine-tuned an LLM with LoRA or QLoRA, you might assume the jump to other modalities is straightforward as the core idea is the same. In practice, each modality comes with its own assumptions, failure modes, and hard-won lessons that only become obvious once you start training.

This talk walks through fine-tuning across four modalities side by side, highlighting the patterns that hold and the ones that break.

For text (LLMs), we start with the standard recipe as a baseline (LoRA, dataset formatting, evaluation), and focus is on identifying the implicit assumptions in the text workflow that do not carry over to other modalities.

For images (Diffusion Models), we walk through fine-tuning for specific visual styles that look similar on the surface, but for which the data preparation is fundamentally different. We will cover why image adaptation is far more sensitive to dataset size and composition than text, and the tradeoffs between different techniques.

For audio, we will look at fine-tuning a model to generate music in a specific genre using publicly available data, and how audio tagging models can be paired with embeddings to build applications that connect generation with semantic understanding of music.

Video, as the least documented modality, has frame sampling strategies, temporal consistency, and compute requirements that escalate faster than you would expect. We will cover the current state of video model adaptation and where the tooling still has rough edges.

Once you have multiple fine-tuned models, merging offers a way to combine their capabilities without retraining. We will cover the main strategies and when merging is a shortcut worth taking versus when it will produce sub-optimal outputs.

Across all modalities, we will compare data preparation, training configuration, evaluation, and the current state of open-source tooling. All code examples use Python with HuggingFace Transformers, Diffusers, and related libraries, and every example uses publicly available models and datasets.

The goal is to give you the comparative mental model that makes moving between modalities far less intimidating, and to show that with the right tools and a bit of curiosity, the same recipe can produce very different and very satisfying results.


Expected audience expertise: Domain: none Expected audience expertise: Python: some Supporting material: Supporting material Your relationship with the presented work/project: Developed the presented feature, Developed original workshop or study course

Hello! I'm Ramon, a systems engineer and educator living in Sydney. I currently work at Canva on the AI Ops team within the Content and Delivery division. Previously, I was a research engineer at Menlo Labs building tools to run AI models on robots and constrained devices, and before that a Senior Product Developer at Decoded, a technology education company based in the UK where I created custom data science tools, workshops, and training programs for clients in industries ranging from retail to finance. Prior to that, I held roles at the intersection of education, data science, and research in the areas of entrepreneurship and strategy. On the personal side, I enjoy giving talks and technical workshops and have had the privilege of participating in several conferences such as PyCon, SciPy, CppNow, PyData, and countless meetup events. In my spare time, I spend as much time as possible mountain biking and exploring many of the outdoor wonders Australia has to offer.