PyCon DE & PyData 2025

The Foundation Model Revolution for Tabular Data
2025-04-25 , Titanium3

What if we could make the same revolutionary leap for tables that ChatGPT made for text? While foundation models have transformed how we work with text and images, tabular / structured data (spreadsheets and databases) - the backbone of economic and scientific analysis - has been left behind. TabPFN changes this. It's a foundation model that achieves in 2.8 seconds what traditional methods need 4 hours of hyperparameter tuning for - while delivering better results. On datasets up to 10,000 samples, it outperforms every existing Python library, from XGBoost to CatBoost to Autogluon.

Beyond raw performance, TabPFN brings foundation model capabilities to tables: native handling of messy data without preprocessing, built-in uncertainty estimation, synthetic data generation, and transfer learning - all in a few lines of Python code. Whether you're building risk models, accelerating scientific research, or optimizing business decisions, TabPFN represents the next major transformation in how we analyze data. Join us to explore and learn how to leverage these new capabilities in your work.


TabPFN shows how foundation model concepts can advance tabular data analysis in Python. Born as research published at ICLR 2023, it found strong community adoption with 1,200+ GitHub stars and 100,000+ downloads. Our upcoming January 2025 release introduces major improvements in speed, scale and capabilities that we're excited to preview at PyCon.

Detailed Outline:

  1. Context & Evolution (5 min)
    - The challenge of applying deep learning to tabular data
    - Learning from the foundation model revolution in text and vision
    - Key improvements from V0 to V1 based on community feedback
    - Real-world examples where TabPFN shines (and where it doesn't)

  2. Technical Insights (8 min)
    - How we adapted transformers for tabular data
    - Making in-context learning work for structured data
    - Performance characteristics and resource requirements
    - Understanding current limitations and constraints

  3. Live Coding & Integration (12 min)
    - Getting started with TabPFN in 3 lines of code
    - Handling real-world data challenges:
    - Missing values and mixed data types
    - Built-in uncertainty estimation
    - Working with similar tasks efficiently
    - Integration with pandas, scikit-learn and the Python ecosystem

  4. Practical Applications (5 min)
    - When to choose TabPFN vs traditional methods
    - Resource requirements and scalability limits
    - What's next for TabPFN
    - Q&A

Key Takeaways:
- Practical understanding of TabPFN's capabilities and limitations
- Hands-on experience integrating with Python data science workflows
- Best practices for working with foundation models on tabular data
- Insight into emerging approaches for structured data analysis


Expected audience expertise: Domain:

Intermediate

Expected audience expertise: Python:

Intermediate

Public link to supporting material, e.g. videos, Github, etc.:

https://arxiv.org/abs/2207.01848; https://github.com/automl/tabpfn

Frank is a Hector-Endowed Fellow and PI at the ELLIS Institute Tübingen and has been a full professor for Machine Learning at the University of Freiburg (Germany) since 2016. Previously, he has been an Emmy Noether Research Group Lead at the University of Freiburg since 2013. Before that, he did a PhD (2004-2009) and postdoc (2009-2013) at the University of British Columbia (UBC) in Canada. He received the 2010 CAIAC doctoral dissertation award for the best thesis in AI in Canada, as well as several best paper awards and prizes in international ML competitions. He is a Fellow of ELLIS and EurAI, Director of the ELLIS unit Freiburg, and the recipient of 3 ERC grants. Frank is best known for his research on automated machine learning (AutoML), including neural architecture search, efficient hyperparameter optimization, and meta-learning. He co-authored the first book on AutoML and the prominent AutoML tools Auto-WEKA, Auto-sklearn and Auto-PyTorch, won the first two AutoML challenges with his team, is co-teaching the first MOOC on AutoML, co-organized 15 AutoML-related workshops at ICML, NeurIPS and ICLR, and founded the AutoML conference as general chair in 2022. In recent years, his focus has been on the intersection of foundation models and AutoML, prominently including the first foundation model for tabular data, TabPFN.