Ramon Perez
Hello! I'm Ramon, a systems engineer and educator living in Sydney. I currently work at Canva on the AI Ops team within the Content and Delivery division. Previously, I was a research engineer at Menlo Labs building tools to run AI models on robots and constrained devices, and before that a Senior Product Developer at Decoded, a technology education company based in the UK where I created custom data science tools, workshops, and training programs for clients in industries ranging from retail to finance. Prior to that, I held roles at the intersection of education, data science, and research in the areas of entrepreneurship and strategy. On the personal side, I enjoy giving talks and technical workshops and have had the privilege of participating in several conferences such as PyCon, SciPy, CppNow, PyData, and countless meetup events. In my spare time, I spend as much time as possible mountain biking and exploring many of the outdoor wonders Australia has to offer.
He/Him
Canva
Systems Engineer
@ramonprz0
Session
The intuitions you build fine-tuning text models are surprisingly bad guides for other modalities. Training configurations that work well for language will silently degrade an image model. Dataset sizes that feel tiny for text are more than enough for adapting a visual style. And audio, despite seeming like its own world, follows an image pipeline once you transform sound into spectrograms, making what counts as a "token" stranger and more interesting than most people expect. The modalities share a vocabulary (fine-tuning, adapters, checkpoints) but not a playbook, and the gaps between them are where the most useful lessons live.
This talk is a practical, comparative tour of fine-tuning across four modalities: text, images, audio, and video. Rather than focusing on one, we will look at what changes as you move between them, how you prepare different data, which training strategies transfer and which don't, where the gotchas hide, and what model merging can do for you once training is done. All examples use Python and the HuggingFace ecosystem with publicly available models and datasets. Whether you are a practitioner looking to branch out beyond NLP or someone curious about what multi-modal fine-tuning looks like in practice, you will leave with a mental map of the landscape and enough pointers to start exploring on your own.