PyCon DE & PyData 2025

1.5 PyCon DE & PyData 2025 pyconde-pydata-2025 2025-04-23 2025-04-25 3 00:05 https://pretalx.com https://pretalx.com/media/pyconde-pydata-2025/img/Logo_horizontal_JM1xTKg.svg Europe/Berlin Zeiss Plenary (Spectrum) Reasonable AI Keynote 2025-04-23T10:30:00+02:00 10:30 00:45 The relationship between humans and machines, especially in the context of Artificial Intelligence (AI), is shaped by hopes, concerns, and moral questions. On the one hand, advances in AI offer great promise: it can help us solve complex problems, improve healthcare, streamline workflows, and much more. Yet, at the same time, there are legitimate concerns about the control over this technology, its potential impact on jobs and society, and ethical issues related to discrimination and the loss of human autonomy. In the talk I shall will explore and illustrate the complex tension between innovation and moral responsibility in AI research. pyconde-pydata-2025-64178-reasonable-ai Keynote Kristian Kersting en The relationship between humans and machines, especially in the context of Artificial Intelligence (AI), is shaped by hopes, concerns, and moral questions. On the one hand, advances in AI offer great promise: it can help us solve complex problems, improve healthcare, streamline workflows, and much more. Yet, at the same time, there are legitimate concerns about the control over this technology, its potential impact on jobs and society, and ethical issues related to discrimination and the loss of human autonomy. In the talk I shall will explore and illustrate the complex tension between innovation and moral responsibility in AI research. false https://pretalx.com/pyconde-pydata-2025/talk/3MNGN8/ https://pretalx.com/pyconde-pydata-2025/talk/3MNGN8/feedback/ Zeiss Plenary (Spectrum) Python Performance Unleashed: Essential Optimization Techniques Beyond Libraries Talk 2025-04-23T11:45:00+02:00 11:45 00:30 Every Python developer faces performance challenges, from slow data processing to memory-intensive operations. While external libraries like Numba or Cython offer solutions, understanding core Python optimization techniques is crucial for writing efficient code. This talk explores practical optimization strategies using Python's built-in capabilities, demonstrating how to achieve significant performance improvements without external dependencies. Through real-world examples from machine learning pipelines and data processing applications, we'll examine common bottlenecks and their solutions. Whether you're building data pipelines, web applications, or ML systems, these techniques will help you write faster, more efficient Python code. pyconde-pydata-2025-61317-python-performance-unleashed-essential-optimization-techniques-beyond-libraries PyCon: Python Language & Ecosystem Thomas Berger en Performance optimization remains a critical challenge in Python development. While Python's simplicity and extensive ecosystem make it the language of choice for many applications, its interpreted nature can lead to significant performance bottlenecks. This is particularly evident in data-intensive applications, machine learning pipelines, and large-scale production systems where every millisecond counts. Many developers immediately reach for external libraries or complex solutions when facing performance issues. However, Python's standard library and built-in features offer powerful optimization opportunities that are often overlooked. Understanding these fundamental optimization techniques not only improves code performance but also helps developers write more efficient code from the start. This talk addresses the core performance challenges faced by Python developers daily. From memory management to algorithmic efficiency, we'll explore how seemingly simple code changes can lead to substantial performance improvements. Through practical examples drawn from real-world applications, we'll demonstrate how to identify, measure, and optimize performance bottlenecks effectively. false https://pretalx.com/pyconde-pydata-2025/talk/AJDYRL/ https://pretalx.com/pyconde-pydata-2025/talk/AJDYRL/feedback/ Zeiss Plenary (Spectrum) Open Table Formats in the Wild: From Parquet to Delta Lake and Back Talk (long) 2025-04-23T12:25:00+02:00 12:25 00:45 Open table formats have revolutionized analytical, columnar storage on cloud object stores with critical features like ACID compliance and enhanced metadata management, once exclusive to proprietary cloud data warehouses. Delta Lake, Iceberg, and Hudi have significantly advanced over traditional open file formats like Parquet and ORC. In an effort to modernize our data architecture, we aimed to replace our Parquet-based bronze layer with Delta Lake, anticipating better query performance, reduced maintenance, native support for incremental processing, and more. While our initial pilot showed promise, we encountered unexpected pitfalls that ultimately brought us back to where we began. Curious? Join me as we shed light on the current state of table formats. pyconde-pydata-2025-60702-open-table-formats-in-the-wild-from-parquet-to-delta-lake-and-back PyData: Data Handling & Engineering Franz Wöllert en # Description Open Table Formats (OTF) such as Hudi, Iceberg and Delta Lake have disruptively changed the data engineering landscape in recent years. While the Parquet file format has evolved as the de-facto standard for open, interoperable columnar storage for analyical workloads, it lacked first class support for critical features such as ACID compliance, incremental processing, flexible schema & partioning evolution and scalable meta data management. This led to increased development and maintenance efforts while building idempotent and failure tolerant data pipelines that often resulted in custom frameworks. OTFs solve all of these issues via providing a sophisticated meta data layer and improved maintenance capabilities on top of Parquet. Driven by the promises of OTFs, we intended to replace our own bronze-read-only Parquet-based storage layer with Delta Lake. In theory, this should have improved performance, reduced maintenanced and provided more flexibility. However, we've stumbled upon several issues: 1. drastic performance issues with Liquid Clustering during incremental processing 2. inmature interoperability in the python and cloud-based ecosystem (DuckDB, Pandas, Polars, Athena, Snowflake) 3. maintaining logical session-boundaries during incremental processing While the first two issues are solvable in foreseeable future, the last one is specific to our requirements and does not overlap with design decisions made for incremental processing in Delta Lake. Taken together, these points ultimately led us to go back to relying on Parquet again. ## Targeted Audience This talk is mainly intended for an intermediate data engineering audience but is well suited for interested beginners, too. The content of this talk is relevant for all architects and data engineers being responsible for storing and managing data for analytical workloads. # Key takeaways - What problems do OTFs solve? - How do OTFs contribute to an open, composable data stack? - Is there a predominant Open Table Format? - How does Delta Lake conceptionally work? - What are concrete real-world advantages of Delta Lake in contrast to "plain" Parquet? - What is the "small files" problem and how does Liquid Clustering help? - How is the current state of interoperability with Delta Lake? # Talk Outline - Introduction (5 min) - OTFs in comparison (5 min) - Delta Lake Internals (10 min) - Use Case Requirements (5 min) - Benchmarks & Results (10 min) - Conclusion and Outlook (5 min) - Questions (5 min) false https://pretalx.com/pyconde-pydata-2025/talk/MRHNCV/ https://pretalx.com/pyconde-pydata-2025/talk/MRHNCV/feedback/ Zeiss Plenary (Spectrum) From Trees to Transformers: Our Journey Towards Deep Learning for Ranking Talk 2025-04-23T14:30:00+02:00 14:30 00:30 GetYourGuide, a global marketplace for travel experiences, reached diminishing returns with its XGBoost-based ranking system. We switched to a Deep Learning pipeline in just nine months, maintaining high throughput and low latency. We iterated on over 50 offline models and conducted more than 10 live A/B tests, ultimately deploying a PyTorch transformer that yielded significant gains. In this talk, we will share our phased approach—from a simple baseline to a high-impact launch—and discuss the key operational and modeling challenges we faced. Learn how to transition from tree-based methods to neural networks and unlock new possibilities for real-time ranking. pyconde-pydata-2025-61250-from-trees-to-transformers-our-journey-towards-deep-learning-for-ranking PyData: Machine Learning & Deep Learning & Statistics Theodore MeynardMihail Douhaniaris en GetYourGuide is a global online marketplace that helps travelers discover and book the best experiences. One of our core challenges is ensuring users always see the most relevant activities first—a task historically powered by an XGBoost-based ranking system. However, as we continued refining our tree-based models, returns on incremental improvements began to plateau. To spark our next step change in performance, we decided to adopt Deep Learning. In this talk, we will share how, in just nine months, we migrated our ranking pipeline to a Deep Learning architecture while maintaining tight latency and high-throughput requirements. We will walk through our phased approach, starting with a minimal viable model to confirm our production setup and gradually increasing its complexity. Along the way, we tested over 50 iterations offline and ran more than 10 live A/B tests to validate the impact on our customers. Ultimately, we rolled out a PyTorch transformer-based model with significant business impact. We will also discuss the main challenges we faced on the operational and modeling sides, how we overcame them, and the lessons we learned. You will leave with practical strategies for transitioning from traditional tree-based models to neural networks in production. Join us to learn how to advance your machine-learning capabilities and unlock new dimensions of relevance and personalization for real-time ranking. false https://pretalx.com/pyconde-pydata-2025/talk/83QH37/ https://pretalx.com/pyconde-pydata-2025/talk/83QH37/feedback/ Zeiss Plenary (Spectrum) Beyond Agents: What AI Strategy Really Needs in 2025 Talk (long) 2025-04-23T15:10:00+02:00 15:10 00:45 Artificial intelligence is no longer confined to models and APIs—it now shapes systems, hardware, and real-world agents. In this talk, I reflect on strategic insights gained at NVIDIA’s GTC 2025, where AI’s convergence with simulation, synthetic data, and robotics signals a fundamental shift. Drawing from over 1,100 sessions and personal experiences at the heart of Silicon Valley, I explore emerging patterns that redefine what it means to build and deploy AI at scale. We’ll look beyond the hype of large language models to examine autonomous systems, interdisciplinary development, and the infrastructure shifts enabling AI everywhere—from cloud to desktop. This session is a call to technical leaders and practitioners to broaden their perspective, think beyond tools, and engage strategically. Whether you’re developing agents, managing data pipelines, or scaling AI across teams, this talk will challenge assumptions and highlight what truly matters in 2025 and beyond. pyconde-pydata-2025-60441-beyond-agents-what-ai-strategy-really-needs-in-2025 General: Others Alexander CS Hendorf en Artificial intelligence is expanding beyond the boundaries of models and APIs—into real-world agents, high-fidelity simulation, and strategic infrastructure. This talk offers a practical, forward-looking perspective on AI strategy, based on insights gathered at NVIDIA’s GTC 2025, one of the most influential events in the global AI ecosystem. We begin with a personal reflection: why attending GTC as an AI consultant helped reset my strategic thinking after experiencing the common challenges of fragmented data, isolated tools, and innovation fatigue. From there, we’ll explore key emerging trends—agentic AI, synthetic data generation, and real-time digital twins—and discuss their broader implications for how we design, train, and deploy intelligent systems. The second part of the talk focuses on convergence: how disciplines such as robotics, healthcare, simulation, and cloud infrastructure are blending, creating new demands for cross-functional collaboration. A brief clustering analysis of 500+ GTC sessions will illustrate this shift. We’ll conclude by examining strategic changes in AI infrastructure—especially the rise of powerful, local AI systems—and draw lessons from unexpected collaborations (such as Disney, DeepMind, and NVIDIA) that reveal how innovation often happens at the intersection of domains. This talk is intended for developers, data scientists, and technical leads who want to broaden their understanding of where AI is headed and how to align today’s decisions with tomorrow’s possibilities. Talk Outline: • Introduction: personal motivation and strategic perspective on GTC 2025 • Key trends: agentic AI, synthetic data, and real-time simulation • Interdisciplinary convergence: how domains like robotics, biology, and infrastructure intersect • Case study: the Disney–DeepMind–NVIDIA collaboration and its broader lessons • Strategic implications: shifts in AI infrastructure and a call for action-oriented, cross-domain thinking false https://pretalx.com/pyconde-pydata-2025/talk/JA9NFW/ https://pretalx.com/pyconde-pydata-2025/talk/JA9NFW/feedback/ Zeiss Plenary (Spectrum) Mastering Demand Forecasting: Lessons from Europe's Largest Retailer Talk 2025-04-23T16:10:00+02:00 16:10 00:30 Ever craved your favorite dish, only to find its key ingredient missing from the store? You're not alone - stock outs can have significant consequences for businesses, resulting in frustrated customers and lost sales. On the other hand, overstocking can lead to wasted storage costs and potential write-offs. The replenishment system is responsible for striking the right balance between these opposing risks. The key to successful replenishment is making accurate predictions about future demand. This presentation takes a deep dive into the intricate world of demand forecasting, at Europe's largest retailer. We will demonstrate how enhancing simple machine learning methods with domain knowledge allows to generate hundreds of millions of high-quality forecasts every day. pyconde-pydata-2025-61232-mastering-demand-forecasting-lessons-from-europe-s-largest-retailer PyData: Machine Learning & Deep Learning & Statistics Moreno SchlageterYovli Duvshani en This talk will provide an in-depth look at the forecasting engine, the heart of Lidl's replenishment system. Each day at Lidl, hundreds of millions of various products journey from suppliers to warehouses before reaching the shelves. Our so-called forecasting engine helps to automate the supply chain at every step along the way. Even with the vast amount of data at our disposal, the problem is still extraordinarily intricate. Each item, store or warehouse has unique demand patterns influenced heavily by a wide range of factors, such as holidays. While most of the effects are quantifiable, others remain unavailable and a certain degree of stochasticity is inherent to the process. The objective of our demand prediction may also vary based on their usages. Accuracy on the day level typically matters for short-term predictions, while it doesn't for long-term predictions. We'll present our pragmatic modeling methodology on a simplified version of the problem at hand: The warehouse forecasting of single items. We explain the rationale for training separate models for each item-warehouse combination and go into the reasons why we opted for using a LGBM model and why we believe it is best suited for our application. In addition to outlining our high-level modeling approach, we demonstrate how business and domain expertise are integrated into the modeling process through the use of sample and feature weighting and examine the impact of this integration on prediction quality. Following the base model, extensions are introduced that enable the incorporation of higher-level information at the finest level of granularity. This is achieved through decomposition and recomposition of the time-series at hand. In detail, we will present uplift decomposition for different use-cases, which include handling of promotions and holidays. To conclude, we will give an overview of how all the presented methods synergize in delivering reliable forecasts for happy customers, so that you will hopefully never find yourself in front of an empty shelf! false https://pretalx.com/pyconde-pydata-2025/talk/NNGWGC/ https://pretalx.com/pyconde-pydata-2025/talk/NNGWGC/feedback/ Zeiss Plenary (Spectrum) Conquering PDFs: document understanding beyond plain text Talk 2025-04-23T17:10:00+02:00 17:10 00:30 NLP and data science could be so easy if all of our data came as clean and plain text. But in practice, a lot of it is hidden away in PDFs, Word documents, scans and other formats that have been a nightmare to work with. In this talk, I'll present a new and modular approach for building robust document understanding systems, using state-of-the-art models and the awesome Python ecosystem. I'll show you how you can go from PDFs to structured data and even build fully custom information extraction pipelines for your specific use case. pyconde-pydata-2025-59318-conquering-pdfs-document-understanding-beyond-plain-text PyData: Natural Language Processing & Audio (incl. Generative AI NLP) Ines Montani en For the practical examples, I'll be using spaCy, and the new Docling library and layout analysis models. I'll also cover Optical Character Recognition (OCR) for image-based text, how to convert tabular data to pandas DataFrames, and strategies for creating training and evaluation data for information extraction tasks like text classification and entity recognition using PDFs and other documents as inputs. false https://pretalx.com/pyconde-pydata-2025/talk/FUX3FR/ https://pretalx.com/pyconde-pydata-2025/talk/FUX3FR/feedback/ Zeiss Plenary (Spectrum) Is Prompt Engineering Dead? How Auto-Optimization is Changing the Game Talk 2025-04-23T17:50:00+02:00 17:50 00:30 The rise of LLMs has elevated prompt engineering as a critical skill in the AI industry, but manual prompt tuning is often inefficient and model-specific. This talk explores various automatic prompt optimization approaches, ranging from simple ones like bootstrapped few-shot to more complex techniques such as MIPRO and TextGrad, and showcases their practical applications through frameworks like DSPy and AdalFlow. By exploring the benefits, challenges, and trade-offs of these approaches, the attendees will be able to answer the question: is prompt engineering dead, or has it just evolved? pyconde-pydata-2025-61192-is-prompt-engineering-dead-how-auto-optimization-is-changing-the-game PyData: Natural Language Processing & Audio (incl. Generative AI NLP) Iryna KondrashchenkoOleh Kostromin en With the rise of LLMs, prompt engineering has become a highly impactful skill in the AI industry. However, manual prompt tuning is challenging, time-consuming, and not always generalizable across different models. This raises a reasonable question: can prompts be automatically learned from data? The answer is yes, and in this talk, we will explore how. First, we will provide a high-level overview of various prompt optimization approaches, starting with a simple technique like bootstrapped few-shot, which automatically generates and selects an optimal set of demonstrations for each step in the LLM chain. Then, we will discuss more complex approaches, such as MIPRO and TextGrad, which directly optimize the instructions. Afterwards, we will move on to a more practical part by showcasing how these techniques can be used via popular frameworks such as DSPy and AdalFlow. Finally, we will discuss the benefits and trade-offs of these approaches and frameworks in terms of costs, complexity and performance, so the audience can decide whether prompt engineering is truly dead. **Outline:** * Introduction (2 min) * Discussion of problems with manual prompt engineering (2 min) * Overview of existing prompt optimization approaches (10 min): * Bootstrapped few-shot (3 min) * MIPRO (3 min) * TextGrad (4 min) * Showcasing the prompt optimization frameworks (8 min): * DSPy (4 min) * AdalFlow (4 min) * Comparison of methods and concluding remarks (3 min) * Q&A (5 min) false https://pretalx.com/pyconde-pydata-2025/talk/GURXPK/ https://pretalx.com/pyconde-pydata-2025/talk/GURXPK/feedback/ Zeiss Plenary (Spectrum) Lightning Talks (1/2) Lightning Talks 2025-04-23T18:30:00+02:00 18:30 01:00 Lightning Talks at PyCon DE & PyData are short, 5-minute presentations open to all attendees. They’re a fun and fast-paced way to share ideas, showcase projects, spark discussions, or raise awareness about topics you care about — whether technical, community-related, or just inspiring. No slides are required, and talks can be spontaneous or prepared. It’s a great chance to speak up and connect with the community! Please note: community conference and event announcements are limited to 1 minute only. All event announcements will be collected in a slide slide deck. pyconde-pydata-2025-68193-lightning-talks-1-2 General: Others Valerio Maggio en ### ⚡ Lightning Talk Rules * No promotion for products or companies. * No call for 'we are hiring' (but you may name your employer). * One LT per person per conference policy. #### Community Event Announcements * ⏱ You want to announce a community event? You have ONE minute. * All event announcements will be collected in a single slide slide deck, see instructions at the Lightning Talk desk in the Community Space in the Lounge on Level 1. #### All other LTs: * ⏱ You have exactly 5 minutes. The clock starts when you start — and ends when time’s up. That’s the thrill of Lightning Talks ⚡ * 🎯 Be sharp, clear, and fun. Introduce your idea, make your point, give the audience something to remember. No pressure. (Okay, maybe a little.) * 🎲 You must include at least **one entry from the [official Bingo Card list](/bingocard/)**. Every audience member will receive a Bingo card — and they’ll be watching 👀 Your job? Choose at least one Bingo item from the [official Bingo Card list—](/bingocard/)and drop it into your talk. Subtly or dramatically — your style. * 🐍 Keep it relevant to Python, PyData and the community. You can go broad — tools, workflows, stories, experiments — as long as there’s some connection to Python, PyData or the community. * 👏 Keep it respectful. Keep it awesome. Humor is welcome, but please be kind, inclusive, and professional. * 🎤 Be ready when your name is called. We’re running a tight session — speakers go on stage rapid-fire. Stay close and stay hyped. * 🏆 Bonus prizes may be awarded. Best talk, best Bingo moment, most unexpected Hogwarts reference... who knows what could happen? #### How to Submit The Lightning Talk desk is located in the Community Space in the Lounge on Level 1. false https://pretalx.com/pyconde-pydata-2025/talk/ESD7KF/ https://pretalx.com/pyconde-pydata-2025/talk/ESD7KF/feedback/ Titanium3 Why E.ON Loves Python Talk 2025-04-23T11:45:00+02:00 11:45 00:30 Join me as I share my 20-year journey with Python and its pivotal role at E.ON. Discover how we transitioned fully to Python, streamlined our development framework, and embraced MLOps principles. Learn about some of our AI projects, including image analysis and real-time inference, and our steps towards open-sourcing code to foster innovation in the energy sector. Explore why Python is our go-to language for data science and collaboration. pyconde-pydata-2025-60161-why-e-on-loves-python PyCon: MLOps & DevOps Christer Friberg en In this talk, I will share my journey with Python, spanning over 20 years, and how it has become an integral part of our work at E.ON. My experience with open source began over 30 years ago during my research as a Theoretical Particle Physicist, where sharing insights and code was a daily practice. Transitioning to a software developer role at a start-up, I initially used Perl for various tasks but soon realized the challenges of code readability and collaboration. Python, with its enforced indentation and readability, quickly became my language of choice. At E.ON, Python is our go-to language for Data Science tasks. In our team we recently migrated another programming language codebase to Python to streamline our development framework and attract top talent. Python's straightforward modularization into packages and modules simplifies maintenance and lineage, especially in cloud-based pipelines, and helps prevent vendor lock-in. The robust toolchain for code quality checks, testing, and building packages makes Python a no-brainer for development and supports our MLOps principles. I will discuss how Python facilitates collaboration globally at E.ON and share examples of our MLOps principles in action. Highlights include image analysis projects like object detection with batch inferencing and instance segmentation with real-time inference endpoints. Additionally, I will detail E.ON's steps towards open-sourcing some of our codebases, enabling other energy companies to build on our projects. Join me to explore why Python is not just a tool but a catalyst for innovation and collaboration at E.ON. false https://pretalx.com/pyconde-pydata-2025/talk/JM3G8S/ https://pretalx.com/pyconde-pydata-2025/talk/JM3G8S/feedback/ Titanium3 Why Exceptions Are Just Sophisticated Gotos - and How to Move Beyond Talk (long) 2025-04-23T12:25:00+02:00 12:25 00:45 "Why Exceptions Are Just Sophisticated Gotos - and How to Move Beyond" explores a common programming tool with a fresh perspective. While exceptions are a key feature in Python and other languages, they share surprising similarities with the notorious goto statement. This talk examines those parallels, the problems exceptions can create, and practical alternatives for better code. Attendees will gain a clear understanding of modern programming concepts and the evolution of programming. pyconde-pydata-2025-60094-why-exceptions-are-just-sophisticated-gotos-and-how-to-move-beyond PyCon: Programming & Software Engineering Florian Wilhelm en Exceptions have long been seen as an improvement over error-handling approaches like goto. However, they can introduce complexity and obscure control flow when used without care. This talk will critically examine exceptions, outline the similarities to goto, and explore better ways to handle errors in programming. ### Outline: 1. Introduction (5 minutes) - The historical role of goto in programming. - Spaghetti code and the rise of structured programming. - How exceptions emerged as an alternative. 2. Why and What Are Exceptions (10 minutes) - Why exceptions were introduced. - How they became mainstream in languages like Java and C++. - Common problems caused by exceptions: hidden control flow, debugging challenges, and performance impacts. 3. The Evolution Toward Result Types (10 minutes) - How result types address the shortcomings of exceptions. - Implementations in Haskell, Rust, and Golang. - Real-world benefits of using result types. 4. Using Result Types in Python (10 minutes) - Introducing the returns package. - Practical examples of result types in Python. - How this approach improves code clarity and reliability. 5. Conclusion (5 minutes) - Recap of the journey from goto to exceptions to result types. - Key takeaways: thoughtful error handling and modern best practices. - Encouragement to explore and adopt better patterns in Python. This session is ideal for intermediate and advanced Python developers seeking actionable techniques to improve error handling and write cleaner, more predictable code. false https://pretalx.com/pyconde-pydata-2025/talk/S8MUBF/ https://pretalx.com/pyconde-pydata-2025/talk/S8MUBF/feedback/ Titanium3 LLM Inference Arithmetics: the Theory behind Model Serving Talk 2025-04-23T14:30:00+02:00 14:30 00:30 Have you ever asked yourself how parameters for an LLM are counted, or wondered why Gemma 2B is actually closer to a 3B model? You have no clue about what a KV-Cache is? (And, before you ask: no, it's not a Redis fork.) Do you want to find out how much GPU VRAM you need to run your model smoothly? If your answer to any of these questions was "yes", or you have another doubt about inference with LLMs - such as batching, or time-to-first-token - this talk is for you. Well, except for the Redis part. pyconde-pydata-2025-61362-llm-inference-arithmetics-the-theory-behind-model-serving PyData: Generative AI Luca Baggi en The talk will cover the theory necessary to understand how to serve LLMs. The talk covers the math behind transformers inference in an accessible and light way. By the end of the talk, attendants will learn: 1. How to count the parameters in an LLM, especially the ones in the attention layers. 2. The difference between compute and memory in the context of LLM inference. 3. That LLM inference is made up of two parts: prefill and decoding. 4. What is an LLM server, and what features they implement to optimise GPU memory usage and reduce latency 4. How batching affects your inference metrics, like time-to-first-token. The talk will cover: **Did you pay attention?** (4 min). A short review of the attention mechanism and how to count parameters in a transformer-based model. **Get to know your params** (8 min). The math-y section of the talk, explaining how to translate parameter counts into memory and compute requirements. **Prefill and Decoding** (8 min) Explains that inference happens in two steps (prefill and decoding) and how KV-cache exploits this to make decoding faster. Common metrics to measure inference performance, like time-to-first-token and token-per-second. **Context and batch size** (5 min) Adds to the picture the sequence length, as well as the number of requests to process in parallel. Explains how LLM servers, like vLLM, use techniques like Paged Attention to optimise GPU usage **Conclusion** (5 min) Wrap up, Q&A. false https://pretalx.com/pyconde-pydata-2025/talk/G3AT7E/ https://pretalx.com/pyconde-pydata-2025/talk/G3AT7E/feedback/ Titanium3 Size matters: Inspecting Docker images for Efficiency and Security Talk (long) 2025-04-23T15:10:00+02:00 15:10 00:45 Inspecting Docker images is crucial for building secure and efficient containers. In this session, we will analyze the structure of a Python-based Docker image using various tools, focusing on best practices for minimizing image size and reducing layers with multi-stage builds. We’ll also address common security pitfalls, including proper handling of build and runtime secrets. While this talk offers valuable insights for anyone working with Docker, it is especially beneficial for Python developers seeking to master clean and secure containerization techniques. pyconde-pydata-2025-61850-size-matters-inspecting-docker-images-for-efficiency-and-security PyCon: MLOps & DevOps Irena Grgic en 1. **Introduction** - We start with an example Dockerfile for a Python-based image. - We will explore the role of OverlayFS, Docker’s file system for combining layers, to understand how layers stack and how data (or even secrets) can be retrieved from individual layers. 2. **Layer Analysis** - To gain better understanding of layering, we use simple command-line tools like `docker history` and `docker inspect` to examine image layers. - We introduce `dive`, a tool for exploring the contents of each layer. - We apply these insights to optimize the image by implementing multi-stage builds to create a smaller image with fewer layers, improving storage efficiency, build speed, and security. - We discuss the benefits of Docker’s caching mechanism in reducing build times. 3. **Security Enhancements** - Given our example image, we will use `trivy`, a comprehensive security scanner, to scan the example image for vulnerabilities and demonstrate how to address common issues. - Finally, we introduce `hadolint`, an open-source linter for Dockerfiles. To get the most out of this session, participants are encouraged to clone the session's [repository](https://github.com/pythonmonty/inspect-docker-images). false https://pretalx.com/pyconde-pydata-2025/talk/GJ9MVT/ https://pretalx.com/pyconde-pydata-2025/talk/GJ9MVT/feedback/ Titanium3 Guiding data minds: how mentoring transforms careers for both sides Talk 2025-04-23T17:10:00+02:00 17:10 00:30 Mentorship is a powerful way to shape careers while building meaningful connections in the data field. In this talk, I’ll share my journey as a professional mentor, what the role entails, and the impact it has on both mentees and mentors. Learn how mentorship drives growth, fosters innovation, and creates value for the data community—and why you should consider stepping into this rewarding role. pyconde-pydata-2025-59827-guiding-data-minds-how-mentoring-transforms-careers-for-both-sides General: Community & Diversity Anastasia Karavdina en Mentorship is a rewarding journey that allows experienced professionals to guide and empower the next generation of talent. As a mentor in the data field, I have had the privilege of helping individuals navigate their careers, refine their skills, and unlock their potential. In this talk, I will share my personal journey into becoming a professional mentor, how I approach mentorship in a structured and impactful way, and the unique value a mentorship brings to both mentees and mentors. I’ll provide insights into the day-to-day activities of mentoring, from offering career guidance to solving technical challenges, while also discussing the importance of tailoring advice to individual goals. Beyond technical skills, mentorship fosters confidence, networking, and long-term growth for mentees while offering mentors opportunities for personal development, deep satisfaction, and a broader industry perspective. With the rapid evolution of the data industry, mentorship has never been more critical. This talk will highlight how professionals at any stage of their career can engage in mentorship to create a ripple effect of positive change in the data community—and why taking the step to become a mentor, paid or otherwise, is an investment in the future of data science and yourself. false https://pretalx.com/pyconde-pydata-2025/talk/TYXMZC/ https://pretalx.com/pyconde-pydata-2025/talk/TYXMZC/feedback/ Titanium3 The earth is no longer flat - introducing support for spherical geometries in Spherely and GeoPandas Talk 2025-04-23T17:50:00+02:00 17:50 00:30 The geometries in GeoPandas, using the Shapely library, are assumed to be in projected coordinates on a flat plane. While this approximation is often just fine, for global data this runs into its limitations. This presentation introduces spherely, a Python library for working with vector geometries on the sphere, and its integration into GeoPandas. pyconde-pydata-2025-61899-the-earth-is-no-longer-flat-introducing-support-for-spherical-geometries-in-spherely-and-geopandas PyData: PyData & Scientific Libraries Stack Joris Van den Bossche en Not all geospatial data are best represented using a projected coordinate system. Unfortunately, the Python geospatial ecosystem is almost fully based on planar geometries using Shapely, and is still lacking a general purpose library for efficient manipulation of geometric objects on the sphere. We introduce Spherely: a new Python library that fills this gap, aiming to provide a similar API as Shapely, but then gor geometries on the sphere. Spherely provides Python/Numpy vectorized bindings to S2Geometry, a mature and performant C++ library for spherical geometry that is widely used for indexing and processing geographic data, notably in popular database systems. This is done via S2Geography, a C++ library that has emerged from the R-spatial ecosystem and that provides a GEOS-like compatibility layer on top of S2Geometry. Unlike S2Geometry’s SWIG wrappers or S2Sphere (pure-Python implementation), Spherely exposes its functionality via “universal” functions operating on n-dimensional Numpy arrays, therefore greatly reducing the overhead of the Python interpreter. Complementary to Shapely 2.0, Spherely may be used as a backend geometry engine for Python geospatial libraries like GeoPandas, hence extending their functionality to more robust and accurate manipulation of geographic data (i.e., using longitude and latitude coordinates). This presentation introduces spherely and its capabilities to work with vector geometries on the sphere, and its integration into GeoPandas. Code repository: https://github.com/benbovy/spherely false https://pretalx.com/pyconde-pydata-2025/talk/AWPYGE/ https://pretalx.com/pyconde-pydata-2025/talk/AWPYGE/feedback/ Helium3 Introducing the Synthetic Data SDK - Privacy Preserving Synthetic Data for AI/ML Sponsored Talk 2025-04-23T11:45:00+02:00 11:45 00:30 AI-generated synthetic data is gaining traction as a privacy-safe solution for data access and sharing. This data is created from original datasets, maintaining privacy without compromising utility. In this Session, we'll cover the fundamental concepts of AI-generated synthetic data and demonstrate how easy it is to generate synthetic data within your local compute environment using the open-source Synthetic Data SDK. pyconde-pydata-2025-66106-introducing-the-synthetic-data-sdk-privacy-preserving-synthetic-data-for-ai-ml PyData: Data Handling & Engineering Michael Platzer en Privacy regulations are tightening globally, making it increasingly challenging for organizations to access and share data while ensuring compliance. AI-generated synthetic data is gaining traction as a privacy-safe solution for data access and sharing. This data is created from original datasets, maintaining privacy without compromising utility. MOSTLY AI has recently released an efficient and flexible Synthetic Data SDK under a fully permissive Apache v2 license, empowering anyone to generate high-quality synthetic data with top-tier performance. Powered by the TabularARGN model architecture, the SDK achieves training times 10x to 100x faster than existing models, while acchieving a SOTA fidelity-privacy balance. In this Session, we'll cover the fundamental concepts of synthetic data and demonstrate how easy it is to generate synthetic data directly from a Jupyter Notebook using the Synthetic Data SDK. Specifically, we will go through - Installing the Synthetic Data SDK - Loading original data into the SDK and locally creating a Generator - Using a Generator to create different versions of synthetic data - Uploading a Generator to the MOSTLY AI Platform and sharing it with the world This will be a hands-on session - so come with your laptop and ideally a dataset that you'd like to synthesize! false https://pretalx.com/pyconde-pydata-2025/talk/MQG9HN/ https://pretalx.com/pyconde-pydata-2025/talk/MQG9HN/feedback/ Helium3 expectation: A modern take on statistical A/B testing with e-values and martingales Talk (long) 2025-04-23T12:25:00+02:00 12:25 00:45 This talk introduces a novel Python library for statistical testing using e-values, offering a refreshing alternative to traditional p-values. We'll explore how this approach enables real-time sequential testing, allowing data scientists to monitor experiments continuously without the statistical penalties of repeated testing. Through practical examples, we'll demonstrate how e-values provide more intuitive evidence measures and enable flexible stopping rules in A/B testing, clinical trials, and anomaly detection. The library implements cutting-edge methods from game-theoretic probability, making advanced sequential testing accessible to Python practitioners. Whether you're conducting A/B tests, monitoring production models, or running clinical trials, this talk will equip you with powerful new tools for sequential data analysis. pyconde-pydata-2025-60861-expectation-a-modern-take-on-statistical-a-b-testing-with-e-values-and-martingales PyData: Machine Learning & Deep Learning & Statistics Jako Rostami en Modern data science demands flexible statistical methods that can handle sequential data analysis and continuous monitoring. Traditional p-values, while widely used, have limitations when dealing with sequential testing scenarios. This talk introduces a Python library that implements e-values and e-processes, offering a more natural approach to measuring statistical evidence and enabling true sequential testing. Outline: 1. Statistical toolkit - Current tools - Purpose and fundamental concepts - Challenges in modern statistics - Type 1 error concerns - Optional stopping problems 2. Sequential testing - Origins - The concept of sequential testing - Peeking 3. e-values - What are e-values? - Definitions and concepts - Betting interpretation - Wealth process - Ville's inequality - Anytime valid inference - p-value vs. e-value differences 4. Python library - Architecture - Core components - Installation and basic setup 5. Demo 1: A/B testing 6. Beyond A/B testing - Broader applications - Conformal e-testing - Confidence sequences 7. Demo 2: It is a versatile library 8. Acknowledgments Q&A Session false https://pretalx.com/pyconde-pydata-2025/talk/ZKNTGN/ https://pretalx.com/pyconde-pydata-2025/talk/ZKNTGN/feedback/ Helium3 Benchmarking Time Series Foundation Models with sktime Talk 2025-04-23T14:30:00+02:00 14:30 00:30 Recent time series foundation models such as LagLlama, Chronos, Moirai, and TinyTimesMixer promise zero-shot forecasting for arbitrary time series. One central claim of foundation models is their ability to perform zero-shot forecasting, that is, to perform well with no training data. However, performance claims of foundation models are difficult to verify, as public benchmark datasets may have been a part of the training data, and only the already trained weights are available to the user. Therefore, performance in specific use cases must be verified based on the use case data itself to ensure a reliable assessment of forecasting performance. sktime allows users to easily produce a performance benchmark of any collection of forecasting models, foundation models, simple baselines, or custom methods on their internal use case data. pyconde-pydata-2025-61277-benchmarking-time-series-foundation-models-with-sktime PyData: Machine Learning & Deep Learning & Statistics Benedikt HeidrichFranz Kiraly en In the past years, time series foundation models emerged. They have the potential to change time series forecasting. For example, multiple time series models such as LagLlama, Chronos, Moirai, and TinyTimesMixer promise zero-shot forecasting for arbitrary time series. Furthermore, also sktime started to unify the interfaces of the various foundation models to make the usage of those models easy. However, whether these time series foundation models provide added value to various forecasting applications is still unclear. Thus, benchmarking is necessary. In sktime, we have implemented a benchmarking module enabling easy comparison of those time series foundation models on custom datasets and with arbitrary metrics. Our talk will outline how sktime’s benchmarking module works and how users can use it to evaluate time series foundation models. We will show how to combine the benchmarking module with the time series foundation models. We will show the results of a small benchmarking study using time series foundation models and statistical time series models. We will outline our roadmap for time series foundation models. sktime is developed by an open community with the aim of ecosystem integration in a commercially neutral, charitable space. We welcome contributions or donations and seek to provide opportunities for anyone worldwide. false https://pretalx.com/pyconde-pydata-2025/talk/GUKTNX/ https://pretalx.com/pyconde-pydata-2025/talk/GUKTNX/feedback/ Helium3 PyData Stack: Pure Python open source data platforms Talk (long) 2025-04-23T15:10:00+02:00 15:10 00:45 Modern open source Python data packages offer the opportunity to build and deploy pure Python, production-ready data platforms. Engineers can and do play a big role in helping companies become data-driven by centralising this data, cleaning and modelling it and presenting back to the business. Now more than ever it allows engineers and companies of any size the ability to build data products and insights for relatively low cost. In this talk we’ll walk through the key components of this stack, tooling options available and demo a deployable containerised Python data stack. pyconde-pydata-2025-61785-pydata-stack-pure-python-open-source-data-platforms PyData: Data Handling & Engineering Eric Thanenthiran en Modern data platforms can be built and deployed using completely open source, Python packages. In this talk, I’ll cover what constitutes a modern data stack and what open source Python packages can be used to build a stack suitable for the needs of most developers and companies. Rather than a one size fits-all approach, I’ll initially demonstrate the rich ecosystem of technologies available and the pros and cons of the technology choices. To be concrete, we will demo an instance of this type of self-contained, deployable platform that is composed of specific technology choices for the key components: data pipelines, transformation engine, data warehouse, presentation layer and orchestration. This implementation will only use Python with a sprinkling of SQL. Structure 1. What is a data stack? 2. Data Stores 3. Pipelines 4. Transformation 5. Orchestration 6. Visualisation Outcomes The aim of this talk is to equip attendees with an understanding of the available python libraries and the knowledge to build their own data platforms. This would specifically be useful for attendees who may be software or backend engineers who may also be called upon to own the data stack to support business and analyst use cases. It may also help engineers who may be looking to re-platform legacy, expensive data platforms to a more modern data stack. For research and personal projects, spinning up a modern platform could be useful for compute heavy analytics that have outgrown local development. false https://pretalx.com/pyconde-pydata-2025/talk/PRRPQ3/ https://pretalx.com/pyconde-pydata-2025/talk/PRRPQ3/feedback/ Helium3 How to use Data Science Superpowers in real life, a Bayesian perspective Talk 2025-04-23T16:10:00+02:00 16:10 00:30 In the data science field, we use all these powerful methods to solve important problems. Most of the time, we do this very well because our data science and machine-learning toolbox fits the problems we tackle quite precisely. Yet, what about our everyday choices or even our most important life decisions? Can we use for our private lives what we advocate for in our jobs or are these choices inherently different? Many of this real life decisions are a little different than textbook machine-learning problems. There is often less or hard-to-come-by data and the decisions are infrequent, but sometimes very consequential. This talk will dive into what makes everyday decisions difficult to handle with our data science toolbox. It will show how Bayesian thinking can help to reason in such cases, especially when there is not a lot of data to rely on. pyconde-pydata-2025-60453-how-to-use-data-science-superpowers-in-real-life-a-bayesian-perspective PyData: Machine Learning & Deep Learning & Statistics Tim Lenzen en In this talk, I want to have a look on decision making from a slightly different angle. In a world that produces an ever growing amount of data in every domain, data scientists can shine with their tools to make data-driven decisions. Often there is even too much data and the most tedious part of the work is to remove the noise from the signal with clever feature engineering. Though the world gets covered more and more by big data, this development is not distributed evenly. Lots of decisions we need to make in real life do not follow this pattern. In fact, there are often surprisingly few data points that help us here. Yet, are there fundamental differences between everyday decisions and the type of decisions we automate so well with machine learning in our jobs? In this part of the talk, I will attempt a characterisation of both types of decisions. We will have a closer look at what implicit assumptions we make to use our machine learning toolbox. After this we might get a first explanation why these tools might be unsuited to answer questions like ’how longe should I study for an exam’ or ‚’should I accept this new job or not’. Enter Bayesian statistics: This part of the talk will introduce Bayesian statistics for beginners using simple examples and images. It will highlight the benefits of the method when we are short of data but have some additional experience not encoded in the data. I will show how in these circumstances prior distributions come in really handy. After laying the groundwork on Bayesian methods we will circle back to the everyday decisions and see how well both things fit together. On a higher level, this will show what makes problems in decision making a great fit for Bayesian methods. I will introduce this using a practical example. The example will deal with the decision how long one should study for a test or exam. Taking a step-by-step approach, we explore how this decision can be informed with just a few data points. Set aside finding the key to successful exam preparation, the example is also helpful to see some of the basics for working with the pymc library. The talk will end with some more general thoughts. This will answer where to go from here and for which decisions a thorough investigation like the presented one is worthwhile.,Yet, once one is familiar with the basics of Bayesian thinking, there might be shortcuts. I will show that we can use the principles as a great tool to improve discussions about important decisions on a broader scale. false https://pretalx.com/pyconde-pydata-2025/talk/RLTZTC/ https://pretalx.com/pyconde-pydata-2025/talk/RLTZTC/feedback/ Helium3 Information Retrieval Without Feeling Lucky: The Art and Science of Search Talk 2025-04-23T17:10:00+02:00 17:10 00:30 Search is everywhere, yet effective Information Retrieval remains one of the most underestimated challenges in modern technology. While Retrieval-Augmented Generation has captured significant attention, the foundational element - Information Retrieval - often remains underexplored. In this talk, we put Information Retrieval center stage by asking: How do we know that user queries and data 'speak' the same language? How do we evaluate the relevance and completeness of search results? And how do we prioritize what gets displayed? Or do we even want to hide specific content? We try to answer these questions by introducing the audience to the art and science of Information Retrieval, exploring metrics such as precision, recall, and desirability. We’ll examine key challenges, including ambiguity, query relaxation, and the interplay between sparse and dense search techniques. Through a live demo using public content from Sendung mit der Maus, we show how hybrid search improves upon vector and keyword based search in isolation. pyconde-pydata-2025-61878-information-retrieval-without-feeling-lucky-the-art-and-science-of-search General: Others Anja Pilz en Information Retrieval goes beyond keyword matching - it’s about intent, context, and delivering relevant and accurate results. As RAG applications gain traction, understanding the retrieval process becomes more crucial for developers, data scientists, and search engineers. We start with the Why. People have different needs for search - lookup, research, and inspiration. Each of these needs can be influenced and affected by the key IR metrics of search engines: precision, recall, and desirability. Having introduced these fundamentals, we go into common retrieval challenges, such as ambiguity, mismatched vocabularies, and the impact of context. Aiming to solve these challenges, we then go into advanced search techniques, comparing sparse (keyword-based) and dense (vector-based) retrieval, highlighting their strengths and limitations. We’ll explore hybrid search as a powerful approach that blends these techniques. In a live demo, using crawled data from the Sendung mit der Maus, we’ll showcase a hybrid search setup leveraging tools like Mistral, Elasticsearch, and Streamlit. While the dataset language is German, the core concepts and search dynamics should hopefully be easily understandable also for non native speakers. The talk concludes with key takeaways on building effective search systems and a look ahead at future developments in contextualized search. Tentative Outline: 1. Introduction to Information Retrieval (~ 5 min) * Why do we search? Lookup, research, inspiration * Core metrics: precision, recall, desirability 2. Challenges in Search and Retrieval (~ 5 min) * Ambiguity * Discrepancy in query and content * The impact of context 3. Search Techniques (~ 5 min) * Sparse vs dense retrieval: comparing keyword and vector search (semantic search, embeddings, synsets, decompounders) * Hybrid search: Combining sparse and dense approaches 4. Hybrid Search in Action (~ 10 min) * Setting up a hybrid search with Mistral, Elasticsearch, and Streamlit * Live Demo: exploring search in Lach- & Sachgeschichten from Sendung mit der Maus 5. Takeaways & Outlook (< 5 min) * hybrid search systems combine semantics, precision and explainability * contextualized search The talk is directed at anyone interested in building or improving search systems. Attendees will gain a deeper understanding of the tools, methodologies, and metrics essential for building robust and explainable search systems. true https://pretalx.com/pyconde-pydata-2025/talk/ZHT9HW/ https://pretalx.com/pyconde-pydata-2025/talk/ZHT9HW/feedback/ Helium3 🦀 Rüstzeit: Asynchronous Concurrency in Python & Rust Talk 2025-04-23T17:50:00+02:00 17:50 00:30 Many Python developers are enhancing their Rust knowledge and want to take the next step in translating their understanding of advanced concepts like asynchronous programming. In this talk, I'll help you take that step by juxtaposing Python's asyncio with Rust's async ecosystems, tokio and async-std. Through real-world examples and insights from conversations with graingert, co-author of Python's Anyio, we'll explore how each language approaches asynchronous execution, highlighting similarities and differences in syntax, performance, and ecosystem support. This talk aims to persuade you that by leveraging Rust's powerful type system and compiler guarantees, we can build fast, reliable async code that's less prone to race conditions and concurrency bugs. Whether you're a Pythonista venturing into Rust or a Rustacean curious about Python's concurrency model, this session will provide practical insights to help you navigate async programming across both languages. Welcome to Rüstzeit: Prepare to navigate async programming across both ecosystems. pyconde-pydata-2025-61842-rustzeit-asynchronous-concurrency-in-python-rust General: Rust Jamie Coombes en Talk Timings (30 minutes): Introduction and Hybrid Programming in Python and Rust [5 mins] Asynchronous Programming in Python [5 mins] Asynchronous Programming in Rust [5 mins] Performance Comparison: Python vs. Rust [1 min] Leveraging Rust's Type System and Compiler Guarantees [5 mins] Case Study: "A Million Large Language Monkeys at a Million Typewriters" – Building Scalable Microservices with Tokio [7 mins] (Optional) Tom's Library: AnyIO and Unified Async in Python [3 mins] Conclusion and Takeaways [3 mins] --- Many Python developers are enhancing their Rust knowledge and want to take the next step in translating their understanding of advanced concepts like asynchronous programming. In this talk, I'll help you take that step by juxtaposing Python's asyncio with Rust's async ecosystems, tokio and async-std. Through real-world examples and insights from conversations with graingert, co-author of Python's Anyio, we'll explore how each language approaches asynchronous execution, highlighting similarities and differences in syntax, performance, and ecosystem support. This talk aims to persuade you that by leveraging Rust's powerful type system and compiler guarantees, we can build fast, reliable async code that's less prone to race conditions and concurrency bugs. Whether you're a Pythonista venturing into Rust or a Rustacean curious about Python's concurrency model, this session will provide practical insights to help you navigate async programming across both languages. Welcome to Rüstzeit: It's time to prepare for async programming in Python and Rust. Further Resources: https://rust-lang.github.io/async-book/ https://anyio.readthedocs.io/en/stable/ https://github.com/graingert false https://pretalx.com/pyconde-pydata-2025/talk/FGFFEE/ https://pretalx.com/pyconde-pydata-2025/talk/FGFFEE/feedback/ Platinum3 Interactive end-to-end root-cause analysis with explainable AI in a Python Shiny App Sponsored Talk 2025-04-23T11:45:00+02:00 11:45 00:30 We demonstrate a pure Python solution for exploring and understanding datasets using state-of-the-art machine learning and explainable AI techniques. Our application features a reactive dashboard built with Shiny, specifically designed for the daily work of data scientists. The tool provides insights into data rapidly and effortlessly through an interactive dashboard. It facilitates data preprocessing, interactive exploratory data analysis, on-demand model training, evaluation, and interpretation. It further renders dynamic, annotated, and interactive visualizations. This allows to pinpoint critical elements and relations as root causes in a haystack of features, compressing a full day's work into under an hour. Utilizing Plotly for dynamic visualizations, along with Scikit-learn, CatBoost, SHAP values, and MLflow for experiment tracking, married with shiny reactive dashboard, we facilitate quick and easy data preprocessing and exploration, model training and evaluation, together with explainable AI. pyconde-pydata-2025-66135-interactive-end-to-end-root-cause-analysis-with-explainable-ai-in-a-python-shiny-app PyData: Machine Learning & Deep Learning & Statistics Julius MöllerSimone Lederer en Problem Statement Data scientists' daily work is characterized by a repetitive and time-consuming cycle of exploratory data analysis, preprocessing, model training, and feature identification. This ultimately means missing key insights into the data. Time spent on repetitive tasks detracts from critical work. We enable data scientists to focus on what matters. Solution We streamline the data analysis process to facilitate efficient dataset exploration and uncovering critical insights without time spent on coding. We empower users to seamlessly conduct data preprocessing, interactive exploratory analysis, on-demand model training, evaluation, and interpretation, reducing the time to understand a dataset to under an hour. Demonstrator Our pure Python application features a reactive dashboard. It allows users to engage with data—uploading, manipulating, creating interactive visualizations, performing on-demand model training and interpretation, while tracking results in MLflow. We demonstrate how to quickly deliver insights and identify root causes. Architecture/Technical Implementation Our application is built entirely in Python, utilizing the Shiny framework for a reactive dashboard. The backend uses Plotly, Scikit-learn, CatBoost, SHAP values, and MLflow. We highlight the core functionalities and development choices, emphasizing data preprocessing, model training, evaluation, and explainable AI features. false https://pretalx.com/pyconde-pydata-2025/talk/AGLBMF/ https://pretalx.com/pyconde-pydata-2025/talk/AGLBMF/feedback/ Platinum3 Generative AI Monitoring with PydanticAI and Logfire Sponsored Talk (long) 2025-04-23T12:25:00+02:00 12:25 00:45 In this talk, we will explore how the integration of PydanticAI and Logfire creates a powerful foundation for generative AI applications. We'll demonstrate how these tools combine to form sophisticated AI workflows and give you comprehensive monitoring. The session illustrates how PydanticAI enables more reliable agent responses while Logfire provides real-time insights for efficient troubleshooting. Through practical examples, you'll learn implementation techniques that will help your team build AI systems with observability, transforming how you develop and maintain generative AI projects. 🚀 pyconde-pydata-2025-66967-generative-ai-monitoring-with-pydanticai-and-logfire PyData: Generative AI Marcelo Trylesinski en In this talk, we'll explore the essential techniques for developing generative AI applications that are not only powerful but also reliable and transparent. By leveraging the combined capabilities of PydanticAI and Logfire, developers can create systems that deliver consistent results while maintaining full visibility into their operations. We'll begin by examining how to create and configure PydanticAI agents, demonstrating how these structured components can form the backbone of sophisticated AI workflows. This foundation will be enhanced through a detailed exploration of Logfire monitoring implementation using MCP servers, providing a robust observability layer for your applications. The discussion will then shift to evaluation methodologies, offering practical approaches to assess and validate your AI applications' performance and accuracy. We'll delve into the advantages of structured outputs, showing how they enable more predictable and testable agent responses across various scenarios. Finally, we'll investigate how real-time insights can transform your troubleshooting process, allowing teams to quickly identify bottlenecks and resolve issues before they impact users. By the end of this session, you'll have a comprehensive understanding of how these tools and techniques can elevate your generative AI projects to new levels of reliability and observability. false https://pretalx.com/pyconde-pydata-2025/talk/GS9QWQ/ https://pretalx.com/pyconde-pydata-2025/talk/GS9QWQ/feedback/ Platinum3 AI coding agent - what it is, how it works and is it good for developers Sponsored Talk 2025-04-23T14:30:00+02:00 14:30 00:30 In this talk, we will have a deeper technical look at AI coding agents, their design, and how they can carry out coding tasks with the support of large language models. We will look at the journey from the user entering a prompt to how it converts to actions in completing the task. After that, we will look at the impact it could make in the industry, as a developer, whether or not you should use an AI coding agent, and what a user should be cautious of when using suchan agent. pyconde-pydata-2025-65732-ai-coding-agent-what-it-is-how-it-works-and-is-it-good-for-developers PyData: Generative AI Cheuk Ting Ho en ## Goal To educate developers, especially those who are using it about what an AI coding agent is. Explore potential benefits and also potential harm when using such tools. ## Target audience Anyone who is interested in AI agents, especially AI coding agents, wants to learn more about it and maybe try using them in their work or hobby coding projects. ## Outline What are AI agents - Examples of AI agents - What are AI coding agents How do AI coding agents work - Components in AI coding agents - How your prompts get processed - How to convert scripts in actions Pros and cons of using AI coding agent - benefit of using AI coding agents - what to be aware of when using AI coding agents Conclusions and Q&A false https://pretalx.com/pyconde-pydata-2025/talk/UWTH7C/ https://pretalx.com/pyconde-pydata-2025/talk/UWTH7C/feedback/ Platinum3 Inclusive Data for 1.3 Billion: Designing Accessible Visualizations Talk (long) 2025-04-23T15:10:00+02:00 15:10 00:45 According to the World Health Organization (WHO), an estimated 1.3 billion people (1 in 6 individuals) experience a disability, and nearly 2.2 billion people (1 in 5 individuals) have vision impairment. Improving the accessibility of visualizations will enable more people to participate in and engage with our data analyses. In this talk, we’ll discuss some principles and best practices for creating more accessible data visualizations. It will include tips for individuals who create visualizations, as well as guidelines for the developers of visualization software to help ensure your tools can help downstream designers and developers create more accessible visualizations. pyconde-pydata-2025-61303-inclusive-data-for-1-3-billion-designing-accessible-visualizations PyData: Visualisation & Jupyter Pavithra Eswaramoorthy en Specifically, we will cover: - What makes data visualizations inaccessible? We will cover accessibility fundamentals like color contrast, alternative text descriptions, keyboard navigation support, screen reader compatibility, and more, with specific examples and demonstrations. - Are Python data visualization tools accessible? We will teach how to analyze the visualization landscape and discuss how tool developers can begin and prioritize improvements. - How accessible is my visualization? We will demonstrate how to conduct accessibility audits for data visualization tools by performing and documenting two accessibility evaluation tests live. This talk will include specific examples from our ongoing work to improve the accessibility of Bokeh, a Python library for creating interactive data visualizations for web browsers. We hope this talk enables you to take the first few steps in making your next data visualization and your visualization tools, more accessible. false https://pretalx.com/pyconde-pydata-2025/talk/LNW3KE/ https://pretalx.com/pyconde-pydata-2025/talk/LNW3KE/feedback/ Platinum3 Jeannie: An Agentic Field Worker Assistant Sponsored Talk 2025-04-23T16:10:00+02:00 16:10 00:30 Jeannie is an LLM-based agentic workflow implemented in Python to automate task management for field workers in the energy sector. This system addresses inefficiencies and safety risks in tasks like PV panel installation and powerline repair. Using open-source tools (LangChain family, OpenStreetMap and OpenWeatherMap APIs), Jeannie retrieves tasks, fetches weather and directions, identifies past incidents via RAG, and emails tailored reports with safety warnings. This presentation offers a case study of Jeannie’s implementation for E.ON in Germany, demonstrating how daily task automation enhances worker safety and efficiency. Attendees will discover how to create agentic systems with Python, integrate APIs, and apply RAG for safety applications, with access to open-source code and data for replicating the workflow. pyconde-pydata-2025-66440-jeannie-an-agentic-field-worker-assistant PyData: Generative AI Andrei BeliankouJose Moreno Ortega en This talk showcases Jeannie, an Agentic LLM workflow which we designed and implemented to automate task management for field workers in the energy sector with a focus on E.ON’s daily routines in Germany. Field workers at E.ON are meant to manage many ongoing and urgent daily tasks, such as installing Photovoltaic panels, repairing powerlines, and revising smart meters, often under tight schedules and varying environmental conditions. Thorough preparation is key to efficient task accomplishment. Preparation steps may include weather assessments at the incident location, navigation guidelines, and knowledge of past incidents to ensure safety. However, manual coordination of these elements is time-consuming and error-prone, leading to inefficiencies and safety risks. Jeannie addresses this problem by automating the entire task management lifecycle. The talk will focus on the practical aspects of the system design and implementation using Python and state-of-the-art LLM and an open-source Agentic Workflow stack. The core system drives an agent fleet through the following steps: Agents in parallel • retrieve upcoming tasks from a storage facility, • gather critical information for the task location (weather, driving directions), • assess historical accidents at the given location and for similar tasks in the past, • generate tailored reports, • send the reports to workers assigned to the task, • follow up on task completion, • and log incidents. The workflow is orchestrated with LangGraph, leveraging libraries such as SQLAlchemy for database management, requests for API calls to fetch weather and directions (e.g., OpenWeatherMap and OpenStreetMap APIs with Reverse GeoCoding), smtplib for email automation, and an Azure OpenAI 4o endpoint as the LLM powering the Agents. The RAG component uses a vector store (built with the PGVector extension) to identify past incidents, ensuring workers are warned of potential risks specific to their task and locations. In the talk, we critically evaluate the system's current state and outline the directions for its further development. false https://pretalx.com/pyconde-pydata-2025/talk/7CAVX7/ https://pretalx.com/pyconde-pydata-2025/talk/7CAVX7/feedback/ Platinum3 Generative-AI: Usecase-Specific Evaluation of LLM-powered Applications Sponsored Talk 2025-04-23T17:10:00+02:00 17:10 00:30 This talk addresses the critical need for usecase-specific evaluation of Large Language Model (LLM)-powered applications, highlighting the limitations of generic evaluation benchmarks in capturing domain-specific requirements. It proposes a workflow for designing more reliable evaluatios to optimize LLM-based applications, consisting of three key activities: human-expert evaluation and benchmark dataset curation, creation of evaluation agents, and alignment of these agents with human evaluations using the curated datasets. The workflow produces two key outcomes: a curated benchmark dataset for testing LLM applications and an evaluation agent that scores their responses. The presentation further addresses the limitations, and best practices to enhance the reliability of evaluations, ensuring LLM applications are better tailored to specific use cases. pyconde-pydata-2025-66036-generative-ai-usecase-specific-evaluation-of-llm-powered-applications PyData: Natural Language Processing & Audio (incl. Generative AI NLP) Dr. Homa Ansari en Large Language Models (LLMs) are transformative technology, enabling a wide array of applications, from content generation to interactive chatbots. This technology is leveraged in creating LLM-powered applications. A wide variety of LLMs are offered, followed by independent and generic evaluation of their performance by the LLM community. The requirements and domain-specificity of the usecases behind the LLM-applications, renders this generic evaluation of the LLMs insufficient in revealing their performance issues. Furthermore, the usecase-specific performance evaluation of LLM-applications becomes a necessary component in the design and continuous development of the LLM-applications. In this talk, we address the need for usecase-specific evaluation of LLM-applications by proposing a workflow for creating evaluation models that support the selection and optimization of the design of LLM-applications. The workflow is comprised of three main activities: 1) Human-expert evaluation of LLM-applications & benchmark dataset curation 2) Creating evaluation agents 3) Aligning evaluation agents with human evaluation based on the curated dataset And it leads to two concrete outcomes: 1) Curated benchmark dataset: against which the LLM-applications will be tested. 2) Evaluation Agent: this is the scoring model which automatically evaluates the responses of the LLM-applications. The talk will elaborate on the workflow, the limitations, and best practices to increase the reliability of the evaluations considering the limitations. false https://pretalx.com/pyconde-pydata-2025/talk/GGJDTW/ https://pretalx.com/pyconde-pydata-2025/talk/GGJDTW/feedback/ Platinum3 From Idea to Integration: An Intro to the Model Context Protocol (MCP) Talk 2025-04-23T17:50:00+02:00 17:50 00:30 The Model Context Protocol (MCP) has emerged as a standard for connecting Large Language Models with diverse data sources and enabling interactions with other systems. In this talk, we’ll introduce the MCP standard and demonstrate how to build a MCP Server using real world examples. We’ll then explore its applications, showing how it empowers developers and makes data from complex systems accessible to non-technical users. Finally, we’ll dive into recent protocol updates, including improvements to Streamable HTTP transport and security enhancements, and share practical strategies for deploying MCP servers as well as clients. pyconde-pydata-2025-67583-from-idea-to-integration-an-intro-to-the-model-context-protocol-mcp PyData: Generative AI Julian Beck en ### From Idea to Integration: An Intro to the Model Context Protocol (MCP) The Model Context Protocol (MCP) has emerged as a standard for connecting Large Language Models with diverse data sources and enabling interactions with other systems. In this talk, we’ll introduce the MCP standard and demonstrate how to build a MCP Server using real world examples. We’ll then explore its applications, showing how it empowers developers and makes data from complex systems accessible to non-technical users. Finally, we’ll dive into recent protocol updates, including improvements to Streamable HTTP transport and security enhancements, and share practical strategies for deploying MCP servers as well as clients. **Talk Outline:** **Introduction to MCP** - What is the Model Context Protocol? - Core concepts: context exposure, streaming, and stateless interaction **MCP Architecture** - Overview of MCP Servers and Clients **Building an MCP Server** - Creating an MCP Server for Home Assistant - Connecting to a SQLite Database **Real-World Use Cases** - Demo: How MCP empowers developers with contextual tooling - Demo: How MCP enables non-technical users to access complex data **Recent Protocol Updates** - Streamable HTTP transport improvements - Security and authentication updates for MCP servers **Deployment Best Practices** - Deploying MCP servers and clients false https://pretalx.com/pyconde-pydata-2025/talk/J7YKEE/ https://pretalx.com/pyconde-pydata-2025/talk/J7YKEE/feedback/ Europium2 Building an Open Source RAG System for the United Nations Negotiations on Global Plastic Pollution Talk 2025-04-23T12:25:00+02:00 12:25 00:30 Plastic pollution is a significant global challenge. Every year, millions of tons of plastic enter the oceans, impacting marine ecosystems and human health. To address this issue, the United Nations is negotiating a legally binding treaty with representatives from 180 countries, aiming to reduce plastic pollution and promote sustainable practices. We have developed NegotiateAI, an open-source chat application that supports delegations during the UN negotiations on a legally binding agreement to combat plastic pollution. The tool demonstrates how generative AI and Retrieval Augmented Systems (RAG) can address complex global challenges. Built with Haystack 2.0, Qdrant, HuggingFace Spaces, and Streamlit, it showcases the potential of open-source technologies in tackling issues of global relevance. As a beginner or advanced developer, this talk will give you valuable insights into developing impactful AI applications with open source tools in the public sector. pyconde-pydata-2025-61124-building-an-open-source-rag-system-for-the-united-nations-negotiations-on-global-plastic-pollution PyData: Natural Language Processing & Audio (incl. Generative AI NLP) Rahkakavee BaskaranTeresa KroesenAnna-Lisa Wirth en Plastic pollution is a global crisis that requires urgent action. An estimated 4.8 to 12.7 million tons of plastic end up in the oceans every year. Forecasts show that global plastic waste will triple by 2060. In response, the United Nations is currently negotiating a legally binding agreement to end plastic pollution, involving representatives from 180 countries in a multi-year process. Tools that can streamline these complex negotiations can help support the negotiations. This talk introduces NegotiateAI, an open-source application developed to support delegations during the UN negotiations on a legally binding treaty to end plastic pollution. Developed with Haystack 2.0, Qdrant Vector Storage, HuggingFace Spaces, and Streamlit, NegotiateAI is a concrete example of how generative AI can be harnessed to address global challenges. While RAG is no longer a new concept, its variety continues to make it an essential approach for tackling real-world problems with LLMs, as demonstrated in this application. We will take you on a journey through the development of NegotiateAI from choosing the right tools, to overcoming technical challenges, to using the app in live UN negotiations. Along the way, we will explore the development of a robust RAG system. We’ll also discuss how we leveraged Streamlit to build a user-friendly interface, showcasing features such as multi-tab navigation and custom layouts that make the app intuitive and accessible to end users. During the session, we will highlight key challenges and present best practices for the coding structure. We will also show how we designed the app to be extensible and allow for the integration of additional data. Beginners will gain practical knowledge about building RAG systems and their real-world applications, while advanced developers will be inspired by the technical innovations, tool integration, and the potential of generative AI in the public sector. The talk will also provide insights into how organizations like the GIZ (German International Cooperation Society*)* are using AI to tackle pressing global issues and offer inspiration for anyone interested in the intersection of technology and sustainable development. false https://pretalx.com/pyconde-pydata-2025/talk/NF8UPF/ https://pretalx.com/pyconde-pydata-2025/talk/NF8UPF/feedback/ Europium2 Taking Control of LLM Outputs: An Introductory Journey into Logits Talk 2025-04-23T14:30:00+02:00 14:30 00:30 This talk explores logits - the raw confidence scores that language models generate before selecting each token. Understanding and manipulating these scores gives you practical control over how models generate text. In this introductory session, we'll explore the token-by-token generation process, examining how tokenizers work and why vocabulary matters. You'll learn about the relationship between logits, probabilities, and tokens. Then we will cover constrained decoding approaches and talk about structured generation. pyconde-pydata-2025-61119-taking-control-of-llm-outputs-an-introductory-journey-into-logits PyData: Natural Language Processing & Audio (incl. Generative AI NLP) Emek Gözlüklü en Logits are the raw numerical scores that language models compute for each token in their vocabulary before making a selection. These scores are converted to probabilities and used internally for token selection. Accessing and analyzing them directly opens up possibilities for controlling and understanding model behavior. We'll cover common sampling techniques like temperature adjustment, top-k, and top-p filtering, and beam search. Then we will see how logits can be used to evaluate model uncertainty, causing hallucinations. And we will talk about structured generation to use language models in deterministic projects. We will see how the logit values can be used to guide the generation process. Lastly we will explore the libraries like outlines and guidance by showcasing some example snippets about how to use them. If "token by token" is your only answer when someone asks how LLMs generate text, come join us and let's dig deeper together! false https://pretalx.com/pyconde-pydata-2025/talk/VDG9YG/ https://pretalx.com/pyconde-pydata-2025/talk/VDG9YG/feedback/ Europium2 Beyond Basic Prompting: Supercharging Open Source LLMs with LMQL's Structured Generation Talk (long) 2025-04-23T15:10:00+02:00 15:10 00:45 This intermediate-level talk demonstrates how to leverage Language Model Query Language (LMQL) for structured generation and tool usage with open-source models like Llama. You will learn how to build a RAG system that enforces output constraints, handles tool calls, and maintains response structure - all while using open-source components. The presentation includes hands-on examples where audience members can experiment with LMQL prompts, showcasing real-world applications of constrained generation in production environments. pyconde-pydata-2025-61309-beyond-basic-prompting-supercharging-open-source-llms-with-lmql-s-structured-generation PyData: Natural Language Processing & Audio (incl. Generative AI NLP) Christiaan Swart en 1. Introduction to structured generation with LMQL and open-source LLMs - Key differences between constrained and free-form generation - Why structure matters for production applications - Setting up LMQL with Llama 2. Building a RAG system with structured outputs - Implementing context retrieval with constraints - Enforcing response formats through LMQL decorators - Handling edge cases and error states 3. Tool usage and function calling - Implementing tool calls through LMQL - Managing tool execution flow - Error handling and fallbacks 4. Interactive segment - Audience members will write and test their own LMQL prompts through a live demo environment 5. Production considerations - Scaling structured generation - Monitoring and logging strategies Attendees will leave with practical knowledge of how to implement structured generation in their own projects using LMQL, understanding both the technical implementation and best practices for production deployment. false https://pretalx.com/pyconde-pydata-2025/talk/DQTMJB/ https://pretalx.com/pyconde-pydata-2025/talk/DQTMJB/feedback/ Europium2 Beyond FOMO — Keeping Up-to-Date in AI Talk 2025-04-23T16:10:00+02:00 16:10 00:30 The rapid evolution of AI technologies, particularly since the emergence of Large Language Models, has transformed the data science landscape from a field of steady progress to one of constant breakthroughs. This acceleration creates unique challenges for practitioners, from managing FOMO to battling imposter syndrome. Drawing from personal experience transitioning from mathematical modeling to modern AI development, this talk explores practical strategies for staying current while maintaining sanity. We'll discuss building effective learning structures, creating collaborative knowledge-sharing environments, and finding the right balance between innovation and implementation. Attendees will leave with actionable insights on navigating technological change while fostering sustainable growth in their teams and careers. pyconde-pydata-2025-61339-beyond-fomo-keeping-up-to-date-in-ai General: Education, Career & Life Carsten Frommhold en The landscape of data science and AI is evolving at an unprecedented rate. What started as a relatively stable field of mathematical modeling and time-series analysis has transformed into a whirlwind of weekly breakthroughs, especially since the emergence of Large Language Models. How do we stay current without succumbing to FOMO or imposter syndrome? In this talk, I'll share my personal journey from traditional mathematical modeling to modern AI development, exploring how the field's pace has shifted dramatically. Drawing from real-world experiences as a consultant and team lead, I'll discuss practical strategies for maintaining technical excellence while managing the psychological challenges of rapid technological change. We'll examine how to build effective learning structures within teams, the importance of creating safe spaces for knowledge sharing, and why sometimes it's okay to not be at the cutting edge of every new development. Through concrete examples and lessons learned, I'll offer insights on balancing client expectations, team growth, and personal development in an era where the technological landscape shifts weekly. Whether you're a seasoned data scientist or just entering the field, this talk will provide practical frameworks for navigating the exciting yet overwhelming world of modern AI development. false https://pretalx.com/pyconde-pydata-2025/talk/MSUCAS/ https://pretalx.com/pyconde-pydata-2025/talk/MSUCAS/feedback/ Europium2 Secure “Human in the Loop” Interactions for AI Agents Sponsored Talk 2025-04-23T17:10:00+02:00 17:10 00:30 Explore the power of Human-in-the-Loop (HITL) for GenAI agents! Learn how to build AI systems that augment your abilities, not replace your judgment, especially when high-stakes actions are involved. This session will focus on practical implementation using Python and Langchain to stay in control. pyconde-pydata-2025-66254-secure-human-in-the-loop-interactions-for-ai-agents PyData: Generative AI Juan Cruz Martinez en Imagine a world where AI agents handle complex tasks on your behalf – managing your finances, optimizing energy consumption in your home, or even coordinating logistics for a global supply chain. The potential benefits are enormous, but what happens when these agents need to perform critical actions? Most of us would probably prefer to have a say in those decisions. We want AI to augment our abilities, not replace our judgment, especially when high-stakes actions are involved. In this session we explore how to add a Human in the Loop (HITL) capabilities to your GenAI agents using Python and Langhain. false https://pretalx.com/pyconde-pydata-2025/talk/7RLYSQ/ https://pretalx.com/pyconde-pydata-2025/talk/7RLYSQ/feedback/ Europium2 Streamlining Python deployment with Pixi: A Perspective from production Talk 2025-04-23T17:50:00+02:00 17:50 00:30 In our quest to improve Python deployments, we explored Pixi, a tool designed to enhance dependency management within the Conda ecosystem. This talk recounts our experience integrating Pixi into a setup used in production. We leveraged Pixi to create lockfiles, ensuring consistent builds, and to automate deployments via CI/CD pipelines. This integration led to greater reliability and efficiency, minimizing deployment errors and allowing us to concentrate more on development. Join us as we share how Pixi transformed our deployment process and offer insights into optimizing your own workflows. pyconde-pydata-2025-60719-streamlining-python-deployment-with-pixi-a-perspective-from-production PyCon: MLOps & DevOps Dennis Weyland en In modern software development, managing dependencies effectively is crucial for ensuring that applications run smoothly across various environments. This talk explores our journey to optimize Python deployments by integrating Pixi into our workflow. As a tool that enhances the Conda ecosystem, Pixi offers a reliable and efficient solution to the common challenges in dependency management. While concepts such as consistent builds, reproducibility, and automated deployments are well-established, Pixi simplifies their implementation within a Conda-based environment, making these practices more accessible and manageable. The talk will cover - DevOps Concept Introducing concepts like lockfile, reproducible environments and CI/CD pipeline to set out a good baseline for deploying python code productively - Conda vs Pypi comparison Considering the tradeoffs between isolation and development comfort - Pixi introduction An introduction to the philosphy of pixi and how it compares to other conda tooling. This also covers how Pixi streamlines the implementation of DevOps concepts - Implementing DevOps concepts using pixi This talk is designed for professional software developers who prioritize a robust setup for deploying Python code as services into production. While familiarity with the Conda ecosystem is beneficial, it is not a prerequisite for this session. false https://pretalx.com/pyconde-pydata-2025/talk/BLKYGU/ https://pretalx.com/pyconde-pydata-2025/talk/BLKYGU/feedback/ Hassium Are LLMs the answer to all our problems? Talk 2025-04-23T11:45:00+02:00 11:45 00:30 Generative AI models have shaken up the German market. Since the release of ChatGPT, AI is available and usable for everyone. The number of ChatGPT-based agents is growing rapidly, but concerns about privacy, copyright and ethics remain. Regulation and ethical AI go hand in hand, but are often seen as barriers. The presentation will cover the different aspects of ethics and how they are addressed by regulation. It will give an overview of how to use large language models in a safe and practical way. This won't only address the various ethical issues, but also convince your next customer to invest in your AI-based product. pyconde-pydata-2025-60405-are-llms-the-answer-to-all-our-problems General: Ethics & Privacy Dr. Maria Börner en The talk will delve into the complexities of large language models (LLMs), exploring their capabilities and challenges. We'll look at bias in face recognition and word2vec, highlighting cases such as the COMPAS system and Amazon's recruitment tool, which have raised concerns about fairness and accuracy. The intersection of LLM and copyright will also be discussed, including the use of copyrighted material in training data and potential infringement issues. When talking about data, regulations such as the EU AI Act, the CLOUD Act and data privacy will be examined, raising important questions about data sovereignty and cross-border data transfers. The environmental impact of LLMs will be addressed, focusing on their significant carbon footprint and the need for sustainable solutions. An overview of the LLM landscape will be provided, including English models and European alternatives. By exploring these topics, participants will gain a deeper understanding of the opportunities and challenges presented by LLMs, as well as the regulatory frameworks and best practices that can help mitigate their risks. false https://pretalx.com/pyconde-pydata-2025/talk/EN3QPQ/ https://pretalx.com/pyconde-pydata-2025/talk/EN3QPQ/feedback/ Hassium The aesthetics of AI: from cyberpunk to fascism Talk 2025-04-23T12:25:00+02:00 12:25 00:30 Let’s explore the visual grammars, references and cultural norms at play in the field of AI; from Kismet to Spot®, from Clippy to Claude. As a sector we can be hyper-focused on technical process and function, to the extent that it blinkers our understanding of the cultural and political impacts of our work. Aesthetics infuse every aspect of technology. Aesthetic interpretations are manifold and mutable, constructed in-congress with the observer and not fully defined by the original designer. AI technologies add additional layers of subtext: character, consciousness, agency, intent. Despite this murkiness, or perhaps because of it, this talk makes an passionate argument for engaging with historical aesthetic movements, for building our shared professional knowledge of fads and fashions⎯not just from the past 40 years of internet culture⎯but also the past 140 years of ideology, technology, and thought. pyconde-pydata-2025-61350-the-aesthetics-of-ai-from-cyberpunk-to-fascism General: Others Laura Summers en **Talk Outline** - Define aesthetics - Differentiate aesthetics from visual design - Aesthetics of contemporary AI - History of aesthetics in technology - Link current technologies to historical aesthetic movements **Detailed description** The field of artificial intelligence has long been dominated by discussions of technical capabilities, algorithmic improvements, and functional benchmarks. Beneath this technical layer lies a rich but often unexamined tapestry of visual and cultural decisions that profoundly shape how we perceive, interact with, and ultimately integrate AI systems into our society. This talk moves beyond simple visual design to explore how aesthetics – study of the principles of beauty and artistic taste – shapes both the creation and interpretation of AI technologies. The aesthetic choices woven into today’s AI interfaces, both for end-users and industry practitioners, reveal our deep-seated assumptions about the world. Philosophers of art ask us to introspect: What is goodness? What is beauty? Which endeavours are most worthy of our attention? We can use these same questions to explore the ideas framing AI. “All watched over by machines of loving grace”, a poem by Richard Brautigan, imagines a utopian future where the natural and technological worlds achieve balance and harmony, and where humans are free to pursue creative, embodied pursuits, freed of menial labour. Is this the utopia imagined by OpenAI or DeepMind when they describe the imminent arrival of AGI? Does the world as described by the big brands of AI actually align with our own imaginings of progress, of utopia? A brief historical overview will trace how technological aesthetics have evolved, examining how different eras have visualized and presented technological innovations. This context sets the stage for drawing direct connections between current AI aesthetics and historical movements – revealing how contemporary design choices often unconsciously echo past ideological and artistic approaches. Through these connections, I’ll demonstrate why developing a broader aesthetic literacy is crucial for AI practitioners. Understanding these historical and cultural reference points can lead to more thoughtful and effective uses of AI. As our field continues to shape the future of human-machine interaction, this aesthetic awareness becomes not just an academic exercise, but a practical necessity: providing both the groundwork for nuanced critique, and the capacity to clearly define how we expect technologies to fit into and improve our lives. false https://pretalx.com/pyconde-pydata-2025/talk/933YXH/ https://pretalx.com/pyconde-pydata-2025/talk/933YXH/feedback/ Hassium Autonomous Browsing using Large Action Models Talk 2025-04-23T14:30:00+02:00 14:30 00:30 The browser serves as our gateway to the internet—the largest repository of knowledge in human history. Proficiency in its use is a core skill across nearly all professions and is becoming increasingly important for Artificial Intelligence. But can Large Action Models (LAMs) autonomously operate a browser? What exactly are LAMs that promise to translate human intentions into actions? We report on a project that fully automates the job application process using AI: from navigating unfamiliar website structures and filling out forms to handling document uploads and cookie banners. pyconde-pydata-2025-61092-autonomous-browsing-using-large-action-models PyData: Natural Language Processing & Audio (incl. Generative AI NLP) Arne GrobrüggeNico Kreiling en Large Action Models (LAMs) were first introduced by Rabbit with the launch of their R1 device, aiming to create end-to-end trained models that automatically translate human instructions into actions. Since then, the definition of LAMs has evolved to encompass Large Language Models (LLMs) utilized in multi-agent settings. Notable examples include Anthropic's "Computer Use" feature in their Claude model and Google's Project Mariner. These projects allow LLMs to operate a web browser or computer in a human-like manner by viewing the screen, moving the cursor, clicking buttons, and typing text, thereby fulfilling the original promise of LAMs by effectively translating human instructions into automated actions. We present an innovative application of LAMs that automates the job application process using AI. Our system autonomously navigates unfamiliar website structures, fills out forms, handles document uploads, and manages cookie banners without human intervention. This level of automation streamlines the application process for job seekers while ensuring accurate and timely submissions. To achieve this, we leveraged the LaVague framework, which employs a modular, agent-based approach: 1. **Coordinator Agent:** A central agent powered by a multimodal model coordinates the entire process. It has access to website visuals, user data (e.g., personal details, CV information), previous instructions, and the overall objective. Based on this information, it delegates tasks to specialized agents. 2. **Navigation Control Agent:** For simple website navigation, this agent utilizes a browser driver such as Selenium to directly interact (e.g., scroll) with the webpage. 3. **Knowledge Agent:** When additional information is required, this agent performs knowledge-intensive tasks using an LLM. Examples include researching specific details or restructuring CV data. 4. **Navigation Engine Agent:** For complex website interactions like inputting values or uploading files, this agent generates custom code for the browser driver. Using an LLM with access to the HTML code, it creates the necessary commands. These agents work iteratively, performing tasks step by step until either the objective is achieved, or a maximum number of steps is reached. By building a custom solution around the LaVague framework tailored specifically for the job application process, we successfully automated the entire workflow. In our presentation, we discuss our overall architecture, the challenges encountered during development and share valuable lessons learned for practical adoption. Large Action Models like these highlight the transformative potential of AI in automating intricate tasks, bridging the gap between understanding human intentions and executing them in dynamic, real-world scenarios. false https://pretalx.com/pyconde-pydata-2025/talk/HYE8EX/ https://pretalx.com/pyconde-pydata-2025/talk/HYE8EX/feedback/ Hassium PDFs - When a thousand words are worth more than a picture (or table). Talk 2025-04-23T15:10:00+02:00 15:10 00:30 PDF, a must-have in RAG systems, ensures visual fidelity across platforms and devices, at the expense of compromising what would be the core condition for computers to properly process and interpret text: semantics. That means any logical arrangement of text, upon rendering, explodes into dummy visual shards of data that literally portrait the bigger picture for the human eye to perceive, but no longer convey the information computers should grasp. Such a bottleneck already makes proper ingestion of text-only documents a big challenge, let alone when tables or figures come into play, the ultimate nightmare for PDF parsers, not to say developers. The rest you must have already foreseen: a RAG system barfing unreliable knowledge from bad chunks (based on regular PDF parsing), if those ever get to be retrieved from a vector database. In this talk you can gather some vision-driven insights on how to leverage the strengths of PDF and language models towards good chunks to be ingested. Or, in other words, how multimodal models can go beyond trivial reverse engineering by decomposing tables into its building blocks, in plain language, as how those would be explained to another human; or better yet, as how humans would ask questions about such pieces of knowledge. And from such a strategy, we transfer the same rationale to figures. Come along, gather some insights, and get inspired to break down tables and figures from your own PDFs, and to improve retrieval in your RAG systems. pyconde-pydata-2025-60158-pdfs-when-a-thousand-words-are-worth-more-than-a-picture-or-table PyData: Generative AI Caio Benatti Moretti en PDF, a must-have in RAG systems, ensures visual fidelity across platforms and devices, at the expense of compromising what would be the core condition for computers to properly process and interpret text: semantics. That means any logical arrangement of text, upon rendering, explodes into dummy visual shards of data that literally portrait the bigger picture for the human eye to perceive, but no longer convey the information computers should grasp. Such a bottleneck already makes proper ingestion of text-only documents a big challenge, let alone when tables or figures come into play, the ultimate nightmare for PDF parsers, not to say developers. The rest you must have already foreseen: a RAG system barfing unreliable knowledge from bad chunks (based on regular PDF parsing), if those ever get to be retrieved from a vector database. In this talk you can gather some vision-driven insights on how to leverage the strengths of PDFs and language models towards good chunks to be ingested in a vector database. Or, in other words, how multimodal models can go beyond trivial reverse engineering by decomposing tables into its building blocks, in plain language, as how those would be explained to another human; or better yet, as how humans would ask questions about such pieces of knowledge. Consequently, it brings robustness to retrieval, the backbone of RAG. And from such a strategy, we can transfer the same rationale to figures. Get ready to boost your retrieval skills, as we: - Analyze the semantical bottlenecks, from the anatomy of a PDF stream, to how parsers traverse it; - (Briefly) approach the never-ending debate on the ideal chunk format for ingestion in vector databases; - Build some chunks using multimodal models to decompose tables into its building blocks, preserving plain language; - Conduct an experiment on measuring quality of retrieval and compare the decomposition strategy against PDF parsers and reverse engineering techniques; - And last, but not least, transfer the same rationale to figures. By then, you'll have enough food for thought to get your hands dirty, clone the repo, and give tweaks to the experiment yourself. Come along, gather some insights, and get inspired to break down tables and figures from your own PDF files, and to improve retrieval in your RAG systems. false https://pretalx.com/pyconde-pydata-2025/talk/UVPALT/ https://pretalx.com/pyconde-pydata-2025/talk/UVPALT/feedback/ Hassium Driving Trust and Addressing Ethical Challenges in Transportation through Explainable AI Talk 2025-04-23T16:10:00+02:00 16:10 00:30 Machine Learning can transform transportation—improving safety, optimizing routes, and reducing delays—yet it also presents ethical concerns. In this talk,I will show how Explainable AI (XAI) can offer practical solutions these ethical dilemmas like lack of trust in AI solutions. Instead of focusing on the technical underpinnings, we will discuss how transparency can be enhanced in AI-supported transportation systems. Using a real-world example, I will demonstrate how XAI provides the groundwork for building ethical, trustworthy, and socially responsible AI solutions in public transportation systems. pyconde-pydata-2025-61882-driving-trust-and-addressing-ethical-challenges-in-transportation-through-explainable-ai General: Ethics & Privacy Natalie Beyer en AI systems in transportation make decisions that directly impact people's lives, such as route optimization, safety measures, and resource allocation. These decisions often rely on complex algorithms, which can be opaque to stakeholders, including operators, regulators, and passengers. **One possible solution: Explainable AI (XAI)** Explainable AI (XAI) refers to methods and tools that make AI systems more transparent by providing interpretable insights into their decision-making processes. By integrating XAI, stakeholders can understand, validate, and trust the outputs of AI systems. **KARL: A Case Study in XAI for Public Transportation** The *KARL* (KI in Arbeit und Lernen in der Region Karlsruhe) project is an exemplary initiative showcasing how XAI can address ethical challenges in AI-suported public transportation. **Technical Implementation** While the presentation will not delve deeply into technical specifics, it will touch upon key elements such as: * The use of open-source libraries like *SHAP* (SHapley Additive exPlanations) to provide interpretability. * Integration of XAI tools into the operational dashboard used by tram operators. * Collaboration with domain experts to ensure the explanations are meaningful and actionable. **Takeaways for the Audience** At the end of this talk, attendees will: 1. Understand the ethical challenges posed by AI in transportation and how they can undermine trust. 2. Learn how XAI tools can address these challenges by enhancing transparency. 3. Gain insights into the practical implementation of XAI in a real-world setting through the KARL project. 4. Be inspired to incorporate XAI principles into their own AI projects to build ethical and socially responsible solutions. false https://pretalx.com/pyconde-pydata-2025/talk/NPMNCE/ https://pretalx.com/pyconde-pydata-2025/talk/NPMNCE/feedback/ Hassium Enhancing Software Supply Chain Security with Open Source Python Tools Talk 2025-04-23T17:10:00+02:00 17:10 00:30 The Cyber Resilience Act (CRA) is focused on improving the security and resilience of digital products. But to comply with the CRA, businesses will need to start preparing the necessary evidence to ensure compliance if they want to continue to deliver digital products to the EU market once the CRA is in force. Key requirements within the CRA include implementing robust security measures throughout the product life-cycle, adopting secure development practices and implementing proactive vulnerability management processes. This session will show how a number of the requirements for the CRA can be achieved by use of a number of open source Python tools. pyconde-pydata-2025-61189-enhancing-software-supply-chain-security-with-open-source-python-tools PyCon: Security Anthony Harrison en The Cyber Resilience Act (CRA) is aimed at improving the security and resilience of the software components within a digital product. This session will provide a high level overview of the CRA and demonstrate how to enhance software supply chain transparency, manage risks effectively throughout the Software Development Lifecycle (SDLC), and achieve the necessary compliance by leveraging a suite of open-source Python tools. Key areas to be addressed will include: - Learn how to create comprehensive and high quality SBOMs to gain a clear understanding of all components within your software. - Discover how to identify and mitigate potential risks and threats within the software supply chain throughout the entire SDLC. - Explore effective strategies for identifying, assessing, prioritising and remediating software vulnerabilities. - Understand how to adopt best practices to ensure compliance with relevant regulations and industry standards. The Python tools/applications to be referenced will include sbom4python, lib4sbom, lib4vex, lib4package, distro2sbom, sbomdiff, sbomaudit and cve-bin-tool. false https://pretalx.com/pyconde-pydata-2025/talk/M98YBR/ https://pretalx.com/pyconde-pydata-2025/talk/M98YBR/feedback/ Hassium Modern NLP for Proactive Harmful Content Moderation Talk 2025-04-23T17:50:00+02:00 17:50 00:30 Despite an array of regulations implemented by governments and social media platforms worldwide (i.e. famous DSA), the problem of digital abusive speech persists. At the same time, rapid advances in NLP and large language models (LLMs) are opening up new possibilities—and responsibilities—for using this technology to make a positive social impact. Can LLMs streamline content moderation efforts? Are they effective at spotting and countering hate speech, and can they help produce more proactive solutions like text detoxification and counter-speech generation? In this talk, we will dive into the cutting-edge research and best practices of automatic textual content moderation today. From clarifying core definitions to detailing actionable methods for leveraging multilingual NLP models, we will provide a practical roadmap for researchers, developers, and policymakers aiming to tackle the challenges of harmful online content. Join us to discover how modern NLP can foster safer, more inclusive digital communities. pyconde-pydata-2025-61184-modern-nlp-for-proactive-harmful-content-moderation PyData: Natural Language Processing & Audio (incl. Generative AI NLP) Daryna Dementieva en The rise of large language models (LLMs) has revolutionized natural language processing (NLP), creating opportunities to address complex societal challenges, including the pervasive issue of harmful online content. Despite global regulations and platform-specific policies, abusive speech and toxic content continue to plague digital spaces, highlighting the need for smarter, scalable, and multilingual solutions. This talk explores how modern NLP technologies can play a transformative role in content moderation, moving beyond traditional detection methods to proactive measures that promote healthier online interactions. We will cover key topics, including: * Understanding the Landscape: Definitions and nuances of harmful content categories, including hate speech, misinformation, and harassment. We will bring practices not only from CS field, but from communication with social scientists and NGOs. * Hate Speech Detection: Can LLMs detect hate speech? How the models can be adapted to new languages? * Text Detoxification: Diving into nuances of toxicity of 9 languages (from our recent shared task) and sharing best practice on LLMs prompting for texts detoxification. * Counter-Speech Generation: Our recent research results on how make LLMs generate not a very general "Please, it is not ok to talk like this report" but indeed address the targeted group. * Ethical Considerations: Who, in the end, responsible for the content moderation? How the community can help to bring best practices? How the measure the "effectiveness" of LLMs for content moderation? false https://pretalx.com/pyconde-pydata-2025/talk/F9EFXA/ https://pretalx.com/pyconde-pydata-2025/talk/F9EFXA/feedback/ Palladium From Tensors to Clouds — A Practical Guide to Zarr V3 and Zarr-Python 3 Talk 2025-04-23T12:25:00+02:00 12:25 00:30 A key feature of the Python data ecosystem is the reliance on simple but efficient primitives that follow well-defined interfaces to make tools work seamlessly together (Cf. http://data-apis.org/). NumPy provides an in-memory representation for tensors. Dask provides parallelisation of tensor access. Xarray provides metadata linking tensor dimensions. **Zarr** provides a missing feature, namely the scalable, persistent storage for annotated hierarchies of tensors. Defined through a community process, the Zarr specification enables the storage of large out-of-memory datasets locally and in the cloud. Implementations exist in C++, C, Java, Javascript, Julia, and Python, enabling. This talk presents a systematic approach to understanding and implementing the newer version of [Zarr-Python](https://github.com/zarr-developers/zarr-python), i.e. Zarr-Python 3 by explaining the new API, deprecations, new storage backend, improved codec pipeline, etc. pyconde-pydata-2025-61385-from-tensors-to-clouds-a-practical-guide-to-zarr-v3-and-zarr-python-3 PyData: Data Handling & Engineering Sanket Verma en Zarr is a data format for storing chunked, compressed N-dimensional arrays and is sponsored by [NumFOCUS]((https://numfocus.org/project/zarr)) under their umbrella. It is based on open-source technical specification and has implementations in several languages, with [Zarr-Python](https://github.com/zarr-developers/zarr-python) being the most used. After the successful adoption of Specification V3, our team has worked tirelessly over the last year to ensure the Python library's compliance with the latest spec. ## Outline First, I’d be talking about: ### Understanding Zarr basics (5 mins.) - What is Zarr, and how it works? - The inner workings of Zarr using illustrated graphics - What is the Zarr Specification? - What's new in Zarr Spec V3? Then, I'll be talking about the new Zarr-Python 3 and its significant features: ### What's new in Zarr-Python 3? (15 mins.) - Major design updates - New storage backend - Creating Zarr arrays and groups asynchronously - New and improved codec pipeline - Native GPU support for creating and writing arrays - Changes and deprecations - Overview of the new API - Optimising performance for large arrays - Deprecation of several stores like LMDBStore, SQLStore, MongoDBStore, etc. - 3.0 Migration guide - Steps to migrate from Zarr-Python 2 to Zarr-Python 3 - Extensions - How can Zarr-Python 3 be extended to add new custom data types, stores, chunking strategies, etc.? Then, I’d be doing a hands-on session, which would cover the following: ### Hands-on (5 mins.) - Creating Zarr arrays and groups using Zarr-Python 3 - Plus walkthrough of the new features (mentioned above) - Looking under the hood - Use store and info functions to explain how your Zarr data is stored and display important information ### Conclusion (5 mins.) - Key takeaways - How can you get involved? - QnA This talk aims to address an audience that works with large amounts of data and is looking for a transparent, open-source, reliable, cloud-optimised, and environmentally friendly format. The tone of the talk is set to be informative, story-telling and fun. Intermediate knowledge of Python and NumPy arrays is required for the attendees to attend this talk. ### After this talk, you’d: - understand the basics of Zarr and what's new in V3, - leverage the new functionalities of Zarr-Python 3 with improved performance, - make an informed decision on what data format to use for your data false https://pretalx.com/pyconde-pydata-2025/talk/ABWHSD/ https://pretalx.com/pyconde-pydata-2025/talk/ABWHSD/feedback/ Palladium Reinforcement Learning Without a PhD: A Python Developer’s Journey Talk 2025-04-23T14:30:00+02:00 14:30 00:30 Reinforcement Learning (RL) has shown superhuman performance in games and is already delivering value in Big Tech. But despite its potential, RL remains largely inaccessible to most developers. Why? Because real-world RL is hard—it demands data, infrastructure, and tools that are often built for researchers, not practitioners. This talk shares the journey of applying RL to a real-world use case without having a PhD. It’s a story of figuring things out through hands-on experimentation, trial and error, and building what didn’t exist. We’ll explore what makes RL powerful, why it’s still rare in practice, and how you can get started. Along the way, you’ll learn about the key challenges of production RL, how to work around them, and how the open-source toolkit pi_optimal can help bridge the gap. Whether you're just RL-curious or ready to dive in, this talk offers practical insights and a demo to help you take your first steps. pyconde-pydata-2025-61178-reinforcement-learning-without-a-phd-a-python-developer-s-journey PyData: Machine Learning & Deep Learning & Statistics Jochen Luithardt en Reinforcement Learning (RL) has made headlines for beating humans at Go and StarCraft, and it’s already being used by companies like Google, Amazon, and Lyft to optimize real-world systems. But outside of big tech and research labs, RL is still rarely applied. Why? Because even though RL is powerful, it's also complex, resource-intensive, and hard to implement without the right tools. In this talk, we explore what it really takes to bring RL into production—without a PhD, a research team, or unlimited infrastructure. I’ll share the story of how we applied RL to a real-world business problem: optimizing digital campaign management in a fast-changing environment. We faced all the classic challenges—limited data, no simulator, and no out-of-the-box tools that actually worked for our use case. We’ll look at how we built a training environment from historical data, dealt with uncertainty using ensemble models, and iterated through a long cycle of trial, error, and learning. That experience eventually led us to create pi_optimal, an open-source toolkit designed to make RL more accessible to Python developers and data scientists. You’ll walk away with a clear understanding of: - Why RL is powerful, but rarely applied in practice - What makes real-world RL so challenging - How we got a working RL system off the ground without a PhD in RL - How pi_optimal helps lower the barrier to entry - How you can get started with RL, either through theory or hands-on practice Whether you're RL-curious or looking to apply it in your own projects, this talk offers practical insights and a live demo to help you take your first steps. false https://pretalx.com/pyconde-pydata-2025/talk/TQLGA8/ https://pretalx.com/pyconde-pydata-2025/talk/TQLGA8/feedback/ Palladium Building Reliable AI Agents for Publishing: A DSPy-Based Quality Assurance Framework Talk 2025-04-23T15:10:00+02:00 15:10 00:30 As publishers increasingly adopt AI agents for content generation and analysis, ensuring output quality and reliability becomes critical. This talk introduces a novel quality assurance framework built with DSPy that addresses the unique challenges of evaluating AI agents in publishing workflows. Using real-world examples from newsroom implementations, I will demonstrate how to design and implement systematic testing pipelines that verify factual accuracy, content consistency, and compliance with editorial standards. Attendees will learn practical techniques for building reliable agent evaluation systems that go beyond simple metrics to ensure AI-generated content meets professional publishing standards. pyconde-pydata-2025-61812-building-reliable-ai-agents-for-publishing-a-dspy-based-quality-assurance-framework PyData: Natural Language Processing & Audio (incl. Generative AI NLP) Simonas Černiauskas en This presentation addresses one of the most pressing challenges in professional publishing today: ensuring quality and reliability when deploying AI agents in editorial environments. We'll take a deep dive into how DSPy's programmatic approach to language model development can be leveraged to create robust testing and validation pipelines that meet the demanding standards of modern newsrooms. The discussion begins by exploring the current landscape of AI evaluation in publishing workflows, examining why traditional testing approaches fall short when dealing with language models, and identifying the specific quality requirements unique to journalistic and editorial content. We'll then move into a detailed technical exploration of solutions built with DSPy, demonstrating how to design modular evaluation pipelines, implement publishing-specific metrics, and create automated systems for fact-checking and consistency validation. Special attention will be given to the integration of knowledge graphs for reference-based evaluation and the incorporation of these systems into broader MLOps workflows. To ground these concepts in reality, we'll examine a detailed case study of implementing this framework in an actual newsroom environment. This will include practical discussions of handling various content types, along with strategies for managing test data and evaluation criteria. We'll share real-world performance monitoring approaches and concrete improvement strategies that have proven successful in production environments. The presentation concludes with hard-won insights and best practices, including practical strategies for finding the right balance between automated testing and human review, effective approaches to handling edge cases, and methods for scaling quality assurance processes across diverse content teams. Throughout the talk, we'll share code examples and practical implementations that attendees can adapt for their own projects. This session is specifically designed for technical leads and machine learning engineers, though the principles and approaches discussed will be valuable for anyone involved in AI quality assurance. Attendees will leave with a comprehensive understanding of how to design and implement QA processes for AI agents, practical knowledge of DSPy implementation for automated testing, and concrete strategies for maintaining high quality standards in AI-assisted workflows. false https://pretalx.com/pyconde-pydata-2025/talk/F7RDPT/ https://pretalx.com/pyconde-pydata-2025/talk/F7RDPT/feedback/ Palladium Deploying Synchronous and Asynchronous Django Applications for Hobby Projects Talk 2025-04-23T16:10:00+02:00 16:10 00:30 Simplify deploying hybrid Django applications with synchronous views and asynchronous apps. This session covers ASGI support, Docker containerization, and Kamal for seamless, zero-downtime deployments on single-server setups, ideal for hobbyists and small-scale projects. pyconde-pydata-2025-59426-deploying-synchronous-and-asynchronous-django-applications-for-hobby-projects PyCon: Django & Web melhin en Hobby projects often start small but can quickly grow in complexity, especially when incorporating Django’s support for asynchronous applications alongside traditional synchronous views. Deploying such hybrid projects on a single server—whether in the cloud or on-premise—can be daunting without the right tools and workflows. This talk focuses on simplifying the deployment process for hobbyists and developers who want to create and manage robust Django applications without requiring extensive infrastructure or expertise. We’ll cover: - Deploying Django projects that combine synchronous views and asynchronous apps using Django’s ASGI support. - Containerizing the application with Docker for consistent and manageable environments. - Utilizing Kamal, an open-source deployment tool, to enable zero-downtime deployments, rolling updates, and seamless app management. - Demonstrating the workflow on a single cloud server, with insights on adapting it to on-premise servers. Whether you're building a passion project or experimenting with modern Django features, this session will provide you with practical tools and approaches to deploy hybrid Django applications effortlessly, keeping the process accessible and scalable for hobby-level development. false https://pretalx.com/pyconde-pydata-2025/talk/CMTKZS/ https://pretalx.com/pyconde-pydata-2025/talk/CMTKZS/feedback/ Palladium Getting Started with Bayes in Engineering: Implementing Kalman Filters with RxInfer.jl Talk 2025-04-23T17:10:00+02:00 17:10 00:30 Bayesian methods are not commonly seen in Civil Engineering and Structural Dynamics. In this talk we explore how RxInfer.jl and the Julia Programming Language can simplify Bayesian modeling by implementing a Kalman filter for tracking the dynamics of a structural system. Perfect for engineers, researchers, and data scientists eager to apply probabilistic modelling and Bayesian methods to real-world engineering challenges. pyconde-pydata-2025-61204-getting-started-with-bayes-in-engineering-implementing-kalman-filters-with-rxinfer-jl PyData: Research Software Engineering Victor Flores Terrazas en Bayesian methods are renowned for their ability to incorporate domain knowledge and quantify uncertainty, making them valuable across various engineering and data science fields. However, finding practical examples of these methods in civil engineering, especially within structural dynamics, can be challenging. This talk aims to make Bayesian inference accessible to engineering practitioners by demonstrating how RxInfer.jl, a Julia package for probabilistic programming, can be used to implement a Kalman filter for tracking the dynamics of a structural system. The session covers: 1. Bayesian Modelling in Python and Julia: A brief comparison of probabilistic programming languages, highlighting Python and Julia 2. State Space Modelling of Structural Dynamical Systems: A brief introduction to state space models and their use in structural dynamics 3. Linking State Space Modelling to Finite Element Modelling: Making the connection between FEM and SSM 4. A Simplified Overview of Bayesian Filtering and Kalman Filters for Dynamical Systems 5. Bayesian Filtering Made Simple with RxInfer.jl: a step-by-step guide to setting up a user-friendly and readable Bayesian filter using Rxinfer.jl 6. Full Workflow Example 7. Interpreting the Results and Next Steps 8. Connections to Julia, Python and Open-Source Ecosystems: exploring integrations with tools like FreeCAD and other open-source platforms By the end of the talk, attendees will have a clear understanding of how to start using Bayesian methods in their engineering projects, supported by reproducible and open-source code. false https://pretalx.com/pyconde-pydata-2025/talk/U9KHNA/ https://pretalx.com/pyconde-pydata-2025/talk/U9KHNA/feedback/ Palladium Streamlining the Cosmos: Pythonic Workflow Management for Astronomical Analysis Talk 2025-04-23T17:50:00+02:00 17:50 00:30 Astronomical surveys are growing rapidly in complexity and scale, necessitating accurate, efficient, and reproducible reduction and analysis pipelines. In this talk we explore Pythonic workflow managers to streamline processing large datasets on distributed computing environments. Modern astronomy generates vast datasets across the electromagnetic spectrum. NASA's flagship James Webb Space Telescope (JWST) provides unprecedented observations that enable deep studies of distant galaxies, cosmic structures, and other astrophysical phenomena. However, these datasets are complex and require intricate calibration and analysis pipelines to transform raw data into meaningful scientific insights. We will discuss the development and deployment of Pythonic tools, including snakemake and pixi, to construct modular, parallelized workflows for data reduction and analysis. Attendees will learn how these tools automate complex processing steps, optimize performance in distributed computing environments, and ensure reproducibility. Using real-world examples, we will illustrate how these workflows simplify the journey from raw data to actionable scientific insights. pyconde-pydata-2025-60536-streamlining-the-cosmos-pythonic-workflow-management-for-astronomical-analysis PyData: PyData & Scientific Libraries Stack Raphael Hviding en As astronomical surveys continue to grow in size and sophistication, researchers face mounting challenges in building efficient, scalable, and reproducible data processing pipelines. Modern observatories, like NASA's James Webb Space Telescope (JWST), are delivering unprecedented volumes of complex and specialized data, requiring innovative approaches to transform raw observations into meaningful, scientifically valid results. This talk focuses on leveraging Pythonic workflow management tools to address the unique challenges of processing large-scale astronomical datasets efficiently and reproducibly. I will provide a brief overview of JWST including its capabilities and the groundbreaking science it has enabled. In particular we will focus on the Pure Parallel mode which collect serendipitous observations from regions of the sky adjacent to primary science targets. These opportunistic datasets are a powerful resource for blind extragalactic surveys, offering unique opportunities to uncover faint galaxies, cosmic structures, and rare astrophysical phenomena. However, their “unscheduled” and heterogeneous nature presents significant challenges: the data arrive in raw, uncalibrated formats and require intricate, multi-step workflows—such as artifact masking, background subtraction, and galaxy spectral analysis—before becoming scientifically usable. In this talk, I will demonstrate how tools like Snakemake and Pixi offer powerful, Pythonic solutions for these challenges. I’ll show how these tools allow scientists to design modular, scalable, and highly parallel workflows that automate the reduction and analysis process while efficiently distributing computation across high-performance computing (HPC) clusters and cloud environments. By breaking workflows into smaller, reusable components, we can improve computational performance and maintain flexibility to adapt pipelines to new datasets, instruments, or evolving scientific goals. Reproducibility remains a critical pillar of modern science, and I will highlight how combining workflow managers with environment management tools ensures version-controlled pipelines, transparent data lineage tracking, and reliable replication of results. This enables consistent analyses across diverse systems, fostering collaboration and long-term usability of scientific products. This talk is designed for (data) scientists and researchers working with large-scale or complex datasets with multi-step reduction/analysis pipelines. This talk will provide a GitHub repository containing resources for building modular workflows, including examples of existing infrastructures, to provide actionable takeaways. Attendees will leave with a clear understanding of how to apply modern workflow management techniques to streamline their processing pipelines, improve reproducibility, and scale their analyses to meet the demands of their datasets. false https://pretalx.com/pyconde-pydata-2025/talk/NBFH7G/ https://pretalx.com/pyconde-pydata-2025/talk/NBFH7G/feedback/ Ferrum Instrumenting Python Applications with OpenTelemetry Tutorial 2025-04-23T14:30:00+02:00 14:30 01:30 Observability is challenging and often requires vendor-specific instrumentation. Enter OpenTelemetry: a vendor-agnostic standard for logs, metrics, and traces. Learn how to instrument Python applications with OpenTelemetry and send telemetry to your preferred observability backends. pyconde-pydata-2025-61846-instrumenting-python-applications-with-opentelemetry PyCon: MLOps & DevOps Mika NaylorEmily Woods en Understanding the behaviour and performance characteristics of the software we deploy, especially distributed software, is quite tricky. While observability tooling helps, implementing vendor-specific instrumentation creates tight coupling and technical debt. Enter OpenTelemetry: A one-stop-shop for observability instrumentation, collection and routing. It aims to solve the above problem by providing SDKs, libraries and a unified semantic model for describing telemetry signals like logs, metrics and traces. These signals can be collected, transformed and then routed to many observability backends that support the OpenTelemetry protocol - avoiding vendor lock-in and platform specific observability code. In this workshop, we'll guide you through what OpenTelemetry is, how it works, how to instrument your Python applications to emit telemetry data, and how to ingest this data into observability backends - enabling you to make better decisions about your application's performance. ***Note*: We will be using docker & docker compose during this workshop, so please make sure it is installed! Familiarity with Flask is also a plus!** We'll be working from this repository: https://github.com/autophagy/pycon-2025-otel-workshop You're welcome to clone the repository in advance and pull the images we'll use for the workshop. You can pull these images by running following command from the root of the repo: `docker compose pull`. false https://pretalx.com/pyconde-pydata-2025/talk/UH7FXA/ https://pretalx.com/pyconde-pydata-2025/talk/UH7FXA/feedback/ Ferrum Building Serverless Python AI skills as WASM components Sponsored Talk 2025-04-23T16:10:00+02:00 16:10 00:30 Frameworks like llama-stack and langchain allow for quick prototyping of generative AI applications. However, companies often struggle to deploy these applications into production quickly. This talk explores the design of a Python SDK that enables the development of AI skills in Python and their compilation into WebAssembly (WASM) components, targeting a specific host runtime that offers interfaces for interacting with LLMs and associated tooling. pyconde-pydata-2025-67553-building-serverless-python-ai-skills-as-wasm-components PyData: Generative AI Moritz Althaus en Why do companies struggle so hard to get their AI skills into production quickly? This talk is about building an SDK that enables the development of production-ready AI skills in Python that can be run as serverless functions within a WASM runtime and interact with LLMs via a WIT (WASM Interface Type) world. On a less technical note, we will explore the design of an SDK that offers a streamlined development experience for AI skills. We will explore the implications for topics such as testability, traceability, and the evaluation of AI logic. How can software engineering best practices, such as separation of concerns and modularity, be applied to the design of AI applications? Fred Brooks' excellent essay, No Silver Bullet, distinguishes between accidental complexity and essential complexity. This talk will explore how an SDK for AI skills can reduce accidental complexity during development and deployment, providing developers with a focused environment for innovating prompts and retrieval strategies. How can a WIT that supports running AI applications be designed, and how can bindings to such a WIT world be generated and consumed in a Python module? We will examine abstractions that allow local testing and debugging without the compilation step by encapsulating the WIT host interface behind a Protocol. The talk also covers benefits of running AI skills as WASM components: When compiling a Python module to WebAssembly, the Python interpreter is part of the compiled component. Although this results in longer start-up times compared to components written in compiled languages like Rust, it provides a key advantage: The interpreter can securely execute Python code generated by an LLM within a highly restricted environment, ensuring no network or file system access. Key takeaways include developing a foundational understanding of WASM and WIT, and how they can interface with Python. You will gain insights into the challenges of deploying AI skills into production and discover how testing, tracing, and evaluation can simplify this process. false https://pretalx.com/pyconde-pydata-2025/talk/LYDDDC/ https://pretalx.com/pyconde-pydata-2025/talk/LYDDDC/feedback/ Ferrum Supercharge Your Testing with inline-snapshot Talk 2025-04-23T17:10:00+02:00 17:10 00:30 Snapshot tests are invaluable when you are working with large, complex, or frequently changing expected values in your tests. Introducing inline-snapshot, a Python library designed for snapshot testing that integrates seamlessly with pytest, allowing you to embed snapshot values directly within your source code. This approach not only simplifies test management but also boosts productivity by improving the maintenance of the tests. It is particularly useful for integration testing and can be used to write your own abstractions to test complex Apis. pyconde-pydata-2025-61762-supercharge-your-testing-with-inline-snapshot PyCon: Testing Frank Hoffmann en This Talk gives you an introduction into inline-snapshot and how it can transform your testing strategy: * Foundations of Snapshot Testing: Start with an introduction to what snapshot testing is and why it's a game-changer for Python developers. * Basic Usage: Learn the core functionality of the `snapshot()` function. Understand how it captures and manages snapshots inline with your tests. Advanced Techniques: * Dirty Equals: Explore how you can leverage dirty-equals within your snapshots for more flexible assertions, allowing for partial matching which is particularly useful for complex data structures. * Parametrized Tests: See how inline-snapshot can be applied to parametrized tests, ensuring each parameter set has its own snapshot. * Customizable: Learn to create your own test functions to test your specific problems. false https://pretalx.com/pyconde-pydata-2025/talk/CRNJWQ/ https://pretalx.com/pyconde-pydata-2025/talk/CRNJWQ/feedback/ Ferrum Zero Code Change Acceleration: familiar interfaces and high performance Talk 2025-04-23T17:50:00+02:00 17:50 00:30 The PyData ecosystem is home to some of the best and most popular tools for doing data-science. Every data-scientist alive today has used pandas and scikit-learn and even Large Language Models know how to use them! For many years there have also been alternative implementations with similar interfaces and libraries with completely new approaches that focus on achieving the ultimate in performance and hardware acceleration. This talk will look at the recent efforts to give users the best of both worlds: a familiar and widely used interface as well as high performance. pyconde-pydata-2025-61342-zero-code-change-acceleration-familiar-interfaces-and-high-performance PyData: PyData & Scientific Libraries Stack Tim Head en The interfaces defined by libraries like Numpy, pandas or scikit-learn are the defacto standard APIs in each library's domain. Data scientists use these libraries directly as well as indirectly through libraries that depend on them. This talk will look at the different approaches that recent efforts have taken to give users both a familiar interface and GPU acceleration. This means users do not have to rewrite their code, learn a new library and benefit from acceleration when using existing libraries. The cuml team built a scikit-learn accelerator by diving deep into the import system of Python. By hooking into the import system you can replace the result of `import sklearn` with a library that uses cuml where possible and falls back to scikit-learn where necessary. The scikit-learn team is adding experimental support to handle PyTorch and CuPy inputs by using the array API standard. Instead of using the Numpy API to perform array computations, scikit-learn is switching to using the array API. This is a subset of the Numpy API that is supported by several other array libraries. The API of Numpy and PyTorch is similar but not exactly the same, this makes writing code that works with both hard. The array API addresses this problem by providing a unified API. Users can accelerate their scikit-learn code by passing in a CuPy or PyTorch array instead of a Numpy array. false https://pretalx.com/pyconde-pydata-2025/talk/PNQB7C/ https://pretalx.com/pyconde-pydata-2025/talk/PNQB7C/feedback/ Dynamicum Power up your Polars code with Polars extention Tutorial 2025-04-23T11:45:00+02:00 11:45 01:30 While Polars is written in Rust and has the advantages of speed and multi-threaded functionalities., everything will slow down if a Python function needs to be applied to the DataFrame. To avoid that, a Polar extension can be used to solve the problem. In this workshop, we will look at how to do it. pyconde-pydata-2025-60450-power-up-your-polars-code-with-polars-extention PyData: Data Handling & Engineering Cheuk Ting Ho en We love Polars because it is written in Rust so we can use Rust's security and speed. However, it is not the most efficient if we still have to call in a Python function to perform specific aggregation. In this workshop, we will use the Polars plugin. You will be writing simple functions in Rust, and then you will use it together with Polars in your Python data pipeline. #### Target Audience Engineers and data scientists who use Polars and are confident to write a bit of Rust code. We expect you to have knowledge of Python and Polars and have a bit of Rust experience (or be able to pick it up relatively quickly). Not all concepts in Rust will be explained but we will link to material where you can find explanations. #### Goal To empower Polars users who want to do more and do better with Polars. For folks who don't mind learning a new programming language, it is also a good opportunity to learn and practice writing in Rust. --- ## Preflight check In this workshop, we expect you to have knowledge of Python and Polars and have a bit of Rust experience (or be able to pick it up relatively quickly). Not all concepts in Rust will be explained but we will link to material where you can find explanations. Here are the things that you should have installed when you started this workshop: - [Install/ Update Rust](https://www.rust-lang.org/tools/install)(we are using rustc version 1.86.0 here) - Make sure having Python 3.9 or above (assuming 3.13 in this workshop) - Make sure using virtual environment (recommend using uv >= 0.4.25) ## Windows checklist In this workshop we recommend using Unix OS (Mac or Linux). *If you use Windows, you may encounter problems with Rust and Maturin.* To minimise issues that you may encounter, please go through the extra checklist below: - Install the [c++ build tools](https://visualstudio.microsoft.com/downloads/) - [Check the `dll` files are linked correctly](https://pyo3.rs/v0.21.2/faq#im-trying-to-call-python-from-rust-but-i-get-status_dll_not_found-or-status_entrypoint_not_found) ## Learning resources for Rust and PyO3 To wirte a Polars plugin, you will have to develop in Rust. If you are not familiar with Rust, we highly recommend you first check out some of the Rust learning resources so you can be prepare for the workshop. Here are some of our recommendations: - [The Rust Book](https://doc.rust-lang.org/book/title-page.html) - [Rustlings (Exerciese in Rust)](https://github.com/rust-lang/rustlings) - [Rust by Example](https://doc.rust-lang.org/rust-by-example/) - [Teach-rs (GitHub repo)](https://github.com/tweedegolf/teach-rs) Another tool that we will be using will be PyO3 and Maturin. To learn more about them, please check out the following: - [The PyO3 user guide](https://pyo3.rs/) - [PyO3 101 - Writing Python modules in Rust](https://github.com/Cheukting/py03_101) ## Setting up 1. create a new working directory ``` mkdir polars-plugin-101 cd polars-plugin-101 ``` 2. Set up virtual environment and activate it ``` uv venv .venv source .venv/bin/activate python -m ensurepip --default-pip ``` *Note: the last command is needed as maturin develop cannot find pip otherwise* 3. Install **polars** and **maturin** ``` uv pip install polars maturin ``` These are the versions that we are using here: + maturin==1.8.3 + polars==1.27.1 --- Workshop materials: https://github.com/Cheukting/polars_plugin_101 --- #### Outline - Introduction (15 mins): 1. What is Polars plugin 2. How does it work (using Maturin to develop packages) 3. How to use it with Polars (exercises) - Simple numerical functions (35 mins): 1. Creating numerical functions with 1 input (exercise) 2. Creating numerical functions with multiple inputs in the same row (exercise) 3. Creating numerical functions that support multiple types (exercise) - Advance usage with Polars plugin (40 mins): 1. Creating functions with multiple inputs across different rows (exercise) 2. Functions with user-set parameters (exercise) 3. Working with strings and lists (exercise) false https://pretalx.com/pyconde-pydata-2025/talk/CP3TKB/ https://pretalx.com/pyconde-pydata-2025/talk/CP3TKB/feedback/ Dynamicum supplyseer: Computational Supply Chain with Python Tutorial 2025-04-23T14:30:00+02:00 14:30 01:30 This talk introduces supplyseer, an open-source Python library that brings advanced analytics to Supply Chain and Logistics. By combining time series embedding techniques, stochastic process modeling, and geopolitical risk analysis, supplyseer helps organizations make data-driven decisions in an increasingly complex global supply chain landscape. The library implements novel approaches like Takens embedding for demand forecasting, Hawkes processes for modeling supply chain events, and Bayesian methods for inventory optimization. Through practical examples and real-world use cases, we'll explore how these mathematical concepts translate into actionable insights for supply chain practitioners. pyconde-pydata-2025-60853-supplyseer-computational-supply-chain-with-python PyData: Machine Learning & Deep Learning & Statistics Jako Rostami en Supplyseer bridges the gap between theoretical supply chain analytics and practical implementation by providing a pythonic interface to advanced mathematical concepts. This talk will walk through the library's core components and demonstrate how they solve real-world supply chain challenges. Outline: 1. Introduction to Modern Supply Chain Analytics - The need for sophisticated analytics in today's complex supply chains - Why traditional methods fall short - The role of probabilistic modeling and topological analysis 2. Core Mathematical Foundations - Time series embedding techniques using Takens' theorem - Stochastic process modeling for demand forecasting - Bayesian approaches to Economic Order Quantity (EOQ) - Point process modeling with Hawkes processes - Network analysis for supply chain risk assessment 3. Library Architecture and Design Philosophy - Object-oriented design for supply chain analytics - Integration of multiple analytical approaches - Extensible architecture for custom analytics - Performance considerations and optimizations 4. Key Features Deep Dive a) Demand Forecasting Module - Stochastic demand process simulation - Time-delay embedding for pattern recognition - Mixture density networks for uncertainty quantification b) Risk Analysis Tools - Geopolitical risk assessment - Supply chain network visualization - Real-time monitoring and alerting - Trade restriction impact analysis c) Inventory Optimization - Bayesian EOQ implementation - Multi-echelon inventory optimization - Stockout probability calculation - Vector field analysis for inventory dynamics 5. Practical Applications - Route optimization with geopolitical risk consideration - You and your suppliers play cooperative games: game-theoretic Supply Chain - Supply Chain Digital Twins - Real-time risk monitoring and mitigation 6. Integration with Data Science Ecosystem - Compatibility with pandas and polars - Integration with scikit-learn pipeline - Visualization with matplotlib and seaborn - Performance optimization with numpy 7. Future Directions - Planned features and enhancements - Community contribution opportunities - Integration with other supply chain tools - Research directions in supply chain analytics 8. Interactive Demonstrations - Live coding examples - Real-world data analysis - Visualization of supply chain dynamics - Risk assessment workflows The talk will include code examples and practical demonstrations, showing how to: - Implement stochastic demand forecasting - Analyze supply chain risks using network analysis - Optimize inventory levels using Bayesian methods - Visualize supply chain dynamics using vector fields - Monitor and assess geopolitical risks Target Audience: This talk is aimed at data scientists, supply chain analysts, and Python developers interested in applying advanced analytics to supply chain problems. Attendees should have intermediate Python knowledge and basic familiarity with data science libraries like pandas and numpy. Prerequisites: - Python programming experience - Basic understanding of supply chain concepts - Familiarity with pandas and numpy - Basic knowledge of probability and statistics Takeaways: Attendees will learn: - How to implement advanced supply chain analytics in Python - Practical applications of mathematical concepts in supply chain - Best practices for supply chain data analysis - Techniques for visualizing and monitoring supply chain dynamics - Methods for quantifying and managing supply chain risks All code examples and demonstrations will be available in a GitHub repository, allowing attendees to experiment with the concepts presented and apply them to their own supply chain challenges. false https://pretalx.com/pyconde-pydata-2025/talk/N9CAUM/ https://pretalx.com/pyconde-pydata-2025/talk/N9CAUM/feedback/ Dynamicum Conformal Prediction: uncertainty quantification to humanise models Talk 2025-04-23T16:10:00+02:00 16:10 00:30 Quantifying model uncertainties is critical to improve model reliability and make sound decisions. Conformal Prediction is a framework for uncertainty quantification that provides mathematical guarantees of true outcome coverage, allowing more informed decisions to be made by stakeholders pyconde-pydata-2025-60141-conformal-prediction-uncertainty-quantification-to-humanise-models PyData: Machine Learning & Deep Learning & Statistics Vincenzo Ventriglia en Quantifying uncertainties of Machine Learning models is crucial to improve their reliability, accurately assess risks and make more robust decisions. By quantifying and understanding uncertainty, we can build more reliable and trustworthy systems. Imagine we have a model that predicts whether or not a CT scan contains a tumour: traditional approaches tend to provide binary predictions, while not providing information on the model’s confidence in each prediction. Conformal Prediction (CP) is a framework for uncertainty quantification that offers an estimate of the confidence in the model’s predictions: instead of providing just a point estimate, it provides a set of possible outcomes (prediction set), together with a measure of confidence in each outcome. These prediction sets come with a (mathematical!) guarantee of coverage of the true outcome, ensuring that they will detect at least a pre-fixed percentage of true values. CP is a model-agnostic paradigm, requiring no retraining of the model and making no major assumptions about the distribution of the data. We Humans, when faced with uncertainty, tend to express indecision and offer alternatives. We will see that CP can be a key tool to include a human in the decision-making loop, once the ‘humanised’ machine is able to express its uncertainty. CP therefore offers a robust framework that allows stakeholders to make more informed decisions, even more so in high-risk sectors such as healthcare, finance and autonomous systems. false https://pretalx.com/pyconde-pydata-2025/talk/FGEUJJ/ https://pretalx.com/pyconde-pydata-2025/talk/FGEUJJ/feedback/ Dynamicum Citation is Collaboration: Software Recognition in Research and Industry Talk 2025-04-23T17:10:00+02:00 17:10 00:30 The development of open source software is increasingly recognized as a critical contribution across many disciplines, yet the mechanisms for credit and citation vary significantly. This talk uses astronomy as a case study to explore shared challenges in attributing software contributions across research and industry. It will review the evolution of journal recommendations and policies over the past decade, alongside emerging publishing practices offering insights into their impact on the recognition of software contributions. An analysis of citation patterns for widely used libraries (numpy, scipy, astropy) highlights trends over time and their dependence on publication venues and policies. The talk will conclude with strategies for both developers and users for improving the recognition of software, fostering collaboration and sustainability in software ecosystems. All data and analysis code will be made available in a public repository, supporting transparency and further study. pyconde-pydata-2025-61390-citation-is-collaboration-software-recognition-in-research-and-industry PyData: Research Software Engineering Ivelina Momcheva en In many fields, including research and industry, software is essential for driving innovation and scientific discovery, yet mechanisms for crediting software developers remain inconsistent and underdeveloped. This lack of recognition, particularly for open-source contributions, can discourage participation in software development and limit career opportunities for developers. Astronomy, as a computationally intensive discipline with a rich history of open-source software contributions, offers a valuable case study to examine these challenges. Over the past decade, changes in journal policies and emerging publishing practices have sought to address the issue, but their impact on credit attribution remains unclear. This talk addresses the issue of software credit by analyzing publication and citation practices in astronomy. It evaluates how existing policies acknowledge software contributions and examines variations across journals and over time. Drawing on bibliometric data from the past decade, the analysis focuses on citation patterns for commonly used libraries, trends in citation rates, and the influence of journal policies. The study includes both foundational libraries, such as NumPy, and astronomy-specific libraries, such as Astropy. Based on these findings, the talk will offer recommendations to enhance the attribution of software contributions. The issue of recognizing research software is not unique to astronomy or research. Participants from industry, other computationally driven fields, open-source communities, and publishing will find the insights applicable to their own disciplines. Understanding how software is cited and credited is critical for shaping more equitable recognition systems, which in turn support sustainable software development and community growth. The audience will leave with a clear understanding of how astronomy’s experience can inform broader efforts to address similar challenges in their respective fields. The data and code for the analysis will be shared with participants giving participants access to a reproducible framework for analyzing software citation practices in other disciplines or software ecosystems. false https://pretalx.com/pyconde-pydata-2025/talk/VJR39N/ https://pretalx.com/pyconde-pydata-2025/talk/VJR39N/feedback/ Dynamicum Build a personalized Commute agent in Python with Hopsworks, LangGraph and LLM Function Calling Sponsored Talk 2025-04-23T17:50:00+02:00 17:50 00:30 The invention of the clock and the organization of time in zones have helped synchronize human activities across the globe. While timekeepers are better at planning and sticking to the plan, time optimists somehow believe that time is malleable and extends the closer the deadline. Nevertheless, whether you are an organized timekeeper or a creative timebender, external factors can affect your commute. In this talk, we will define the different components necessary to build a personalized commute virtual agent in Python. The agent will help you analyze your historical lateness records, estimate future delays, and suggest the best time to leave home based on these predictions. It will be powered by a LLM and will use a technique called Function Calling to recognize the user intent from the conversation history and provide informed answers. pyconde-pydata-2025-61908-build-a-personalized-commute-agent-in-python-with-hopsworks-langgraph-and-llm-function-calling PyData: Data Handling & Engineering Javier de la Rúa Martínez en The invention of the clock and the organization of time in zones have helped synchronize human activities across the globe. While timekeepers are better at planning and sticking to the plan, time optimists somehow believe that time is malleable and extends the closer the deadline. Nevertheless, whether you are an organized timekeeper or a creative timebender, external factors can affect your commute. In this talk, we will define the different components necessary to build a personalized commute virtual agent in Python. The agent will help you analyze your historical lateness records, estimate future delays, and suggest the best time to leave home based on these predictions. It will be powered by a LLM and will use a technique called Function Calling to recognize the user intent from the conversation history and provide informed answers. The ML system will be built in Python, following the best practices of the FTI (feature/training/inference) pipeline architecture, on top of the open-source Hopsworks AI lakehouse, which will provide the necessary ML infrastructure, such as the feature store, model serving, and a model registry. The agent will be designed with LangGraph and powered by a LLM running on the vLLM inference engine. false https://pretalx.com/pyconde-pydata-2025/talk/8S3RC3/ https://pretalx.com/pyconde-pydata-2025/talk/8S3RC3/feedback/ Zeiss Plenary (Spectrum) Chasing the Dark Universe with Euclid and Python: Unveiling the Secrets of the Cosmos Keynote 2025-04-24T09:05:00+02:00 09:05 00:45 The ESA Euclid mission, launched in July 2023, is on a quest to unravel the mysteries of dark energy and dark matter: the enigmatic components that make up 95% of the Universe. By mapping one-third of the sky with unprecedented precision, Euclid is building the largest 3D map of the cosmos. This talk explores how cosmologists bridge theory and and Euclid observation to reveal the hidden nature of dark energy and the dark matter. We will delve into the challenges of cosmological inference, where advanced statistical methods and Python-based pipelines compare theoretical models against Euclid's vast datasets, and we will explain how Bayesian inference, machine learning, and state-of-the-art simulations are revolutionizing our understanding of the cosmos. pyconde-pydata-2025-65262-chasing-the-dark-universe-with-euclid-and-python-unveiling-the-secrets-of-the-cosmos Keynote Guadalupe Canas Herrera en The Euclid mission, a European Space Agency-led mission launched in July 2023, is set to transform our understanding of the Universe by exploring its most elusive constituents: dark energy and dark matter. Together, they account for 95% of the cosmos, dictating its structure, evolution, and eventual fate. Euclid is currently surveying one-third of the sky to construct the most extensive 3D map of the Universe ever created. By using deep imaging and spectroscopic data, it traces the distribution of galaxies and the subtle distortions caused by gravitational lensing with unparalleled precision. By connecting theory with observations, Euclid aims to uncover the properties of dark energy driving cosmic acceleration and the distribution of dark matter shaping large-scale cosmic structures. At the heart of this endeavor lies the challenge of cosmological statistical inference: extracting robust conclusions about the nature of dark energy and dark matter from vast, complex datasets. This talk will explore how cutting-edge statistical techniques and powerful computational tools, including Python-based analysis pipelines, are being used to compare theoretical models against Euclid's observations. We will discuss the role of Bayesian inference, machine learning, and advanced simulations in constraining cosmological parameters and testing extensions to the standard model of cosmology. false https://pretalx.com/pyconde-pydata-2025/talk/EGNBHD/ https://pretalx.com/pyconde-pydata-2025/talk/EGNBHD/feedback/ Zeiss Plenary (Spectrum) Algorithmic Music Composition With Python Talk 2025-04-24T10:15:00+02:00 10:15 00:30 Computers have long been an integral part of creating music. Virtual instruments and digital audio workstations make creating music easy and accessible. But how do programming languages and especially Python fit into this? Python can serve as a tool for creating musical notation and MIDI files. Throughout the session, you’ll learn how to: - Use Python to create melodies, harmonies, and rhythms. - Generate music based on rules, randomness, and mathematical principles. - Visualize and export your compositions as MIDI and sheet music. By the end of the talk, you’ll have a clear understanding of how to turn simple algorithms into expressive musical works. pyconde-pydata-2025-61093-algorithmic-music-composition-with-python PyCon: Python Language & Ecosystem Hendrik Niemeyer en This talk provides a general introduction into creating music algorithmically using Python. Little prior knowledge about music is assumed. It is helpful to know how sheet music looks and what the MIDI format is beforehand. We will start by looking briefly into the basic building blocks of music (harmony, melody and rhythm) and what our goal is (creating sheet music and a playable MIDI file). Then we will discuss the history of algorithmic composition in music and from that we will develop ideas how we can create music from algorithms and randomness. For creating sheet music we will look into the packages Abjad und music21. In the end will we will create a playable MIDI file for our music using MIDIUtil. false https://pretalx.com/pyconde-pydata-2025/talk/TQN98D/ https://pretalx.com/pyconde-pydata-2025/talk/TQN98D/feedback/ Zeiss Plenary (Spectrum) AI in Reality Fireside Chat: Enterprise AI & Open‑Source Innovation Panel 2025-04-24T11:00:00+02:00 11:00 01:00 This fireside chat brings together leading voices from industry and open-source to explore how artificial intelligence is being meaningfully integrated into enterprise environments—beyond the buzzwords. Moderated by Alexander CS Hendorf, the conversation features Walid Mehanna (Chief Data Officer, Merck), Dr. Alexander Beck (CTO, Quoniam), and Ines Montani (co-founder explosion.ai, spaCy), who share their diverse perspectives from pharmaceuticals, finance, and AI tooling. Together, they’ll explore the cultural, technical, and ethical dimensions of AI adoption in large organizations, the growing influence of open-source ecosystems, and the long-term vision required to build sustainable, human-centered AI systems. This session is designed for those who want to move past the hype and better understand what real-world innovation at scale looks like—and what it demands from leadership, infrastructure, and community. pyconde-pydata-2025-68523-ai-in-reality-fireside-chat-enterprise-ai-open-source-innovation General: Others Alexander CS HendorfDr. Alexander BeckWalid MehannaInes Montani en While headlines are dominated by generative AI breakthroughs and ever-larger models, some of the most meaningful progress is happening quietly—in enterprises that are aligning AI with long-term strategy and in open-source communities driving technical excellence. This session brings together Walid Mehanna (Chief Data Officer, Merck), Dr. Alexander Beck (CTO, Quoniam), and Ines Montani (co-founder of explosion.ai/spaCy) in a live conversation moderated by Alexander CS Hendorf. Together, they’ll explore how open-source tools shape enterprise AI adoption, the cultural and organizational shifts needed to move beyond pilots and prototypes, and the responsibilities that come with deploying AI in production. From internal LLM platforms and research pipelines to industry collaboration and digital ethics, the panel will offer grounded, practical insights from vastly different domains. This isn’t another panel about AI buzzwords. It’s a discussion about building AI systems that matter—tools that integrate with people, processes, and purpose. The audience can expect a thoughtful, forward-looking exchange between builders, strategists, and leaders who are working at the edge of what’s possible, while keeping a strong eye on what’s meaningful. false https://pretalx.com/pyconde-pydata-2025/talk/TAXVSC/ https://pretalx.com/pyconde-pydata-2025/talk/TAXVSC/feedback/ Zeiss Plenary (Spectrum) Machine Learning Models in a Dynamic Environment Keynote 2025-04-24T13:25:00+02:00 13:25 00:45 "We've only tested the happy path - now users are finding all sorts of creative ways to break the app." What is already a cause for headaches in traditional software engineering turns into a large challenge when the application is based on machine learning models: Data distribution may change from training phase to deployment. Even worse, humans interacting with the model may adjust their behaviour to the model making the gap between original training environment and deployment even larger. When deployed in a public environment the model may be exposed to users trying to game the system. When re-trained it may be exposed to users trying to poison the pool of training data. We will take a tour of historic cases of models being gamed: What are the lessons we learnt a long time ago building e-mail spam filters? What happened when high search engine rankings started to be linked to monetary income? How can personalization and targeted advertising be exploited to influence public discourse? “… it should be clear that improvements in communication tend to divide mankind …” by Harold Innis in Changing Concepts of Time This keynote will turn interactive engaging the audience in sharing their stories on users playing interesting games with deployed models - including counter moves rolled out. If we are to learn from IT security experience, one important ingredient to address these issues is a combination of collaboration and transparency - across organisations. pyconde-pydata-2025-64179-machine-learning-models-in-a-dynamic-environment Keynote Isabel Drost-Fromm en "Collect data, choose an algorithm, train a model to match your target metric and deploy to production." ... sounds easy enough. But what if user behaviour changes after the model was deployed? What if the deployment of the model itself causes a change in user behaviour? This talk will look at examples for models changing user behaviour. In the interactive part the talk will collect stories from the audience. false https://pretalx.com/pyconde-pydata-2025/talk/3FUYVH/ https://pretalx.com/pyconde-pydata-2025/talk/3FUYVH/feedback/ Zeiss Plenary (Spectrum) Safeguard your precious API endpoints built on FastAPI using OAuth 2.0 Talk 2025-04-24T14:20:00+02:00 14:20 00:30 Is implementing authorization on your API endpoints an afterthought? Who should have access to your API endpoints? Is it secure? This talk covers using OAuth 2.0 to secure API endpoints built on FastAPI following industry-recognized best practices. Come on a journey with me from taking your API endpoints to being functional AND secure. When you follow secure identity standards, you’ll be equipped with a deeper understanding of the critical need for authorization. pyconde-pydata-2025-61261-safeguard-your-precious-api-endpoints-built-on-fastapi-using-oauth-2-0 PyCon: Security Semona Igama en Audience Level: Beginners, Pythonistas who build on FastAPI who are not necessarily security experts but still need to deploy secure APIs. History of OAuth 2.0? (3 mins) - Background/history on OAuth - Why do we need OAuth 2.0? Authorization Challenge (2 mins) - Why implement secure authorization now rather than later? - Data sensitivity OAuth 2.0 Overview (3 mins) - Core concepts - Key features: What are JWTs? - Benefits of using OAuth 2.0 Technical Implementation (4 mins) - Components of OAuth 2.0 - Different types of authorization flows and use cases - API setup on FastAPI Demo with FastAPI (12 mins) - Create an endpoint in FastAPI framework and secure it with OAuth 2.0 - What are the different identity providers that can provide authorization? - Troubleshooting common issues Best Practices (4 mins) - Industry-standard protocol - Token-based security - Should you build your authorization server? Next Steps (2 mins) - Ability to integrate/provide SSO with various IdPs - Share resources to learn more including blogs, GitHub repo, etc. - Got questions? Connect with me! false https://pretalx.com/pyconde-pydata-2025/talk/K9ACTV/ https://pretalx.com/pyconde-pydata-2025/talk/K9ACTV/feedback/ Zeiss Plenary (Spectrum) They are not unit tests: a survey of unit-testing anti-patterns Talk (long) 2025-04-24T15:00:00+02:00 15:00 00:45 The entire industry approves of unit testing but almost no one can fully agree on how to do it correctly, or even on what unit tests are. This results in unit tests often being associated with slower development cycle and an overall less enjoyable workflow. I'll show you how testing turns into hell in real enterprises with the most common anti-patterns and then I'll show you that most of them are avoidable with modern tooling like mutation testing, snapshot testing, dirty-equals, and many more. We'll discuss how to make tests speed up your development and make refactoring easy. pyconde-pydata-2025-61234-they-are-not-unit-tests-a-survey-of-unit-testing-anti-patterns PyCon: Testing Stanislav Zmiev en Similar to TDD, unit tests are one of the most misunderstood concepts in software engineering. In this session, I will cover the most important fallacies about unit testing and the most common anti-patterns. I will also show you how modern infrastructure (pytest-fixture-classes, inline-snapshot, dirty-equals, import-linter, mutmut, and pytest-xdist) makes it possible to avoid most of them. We will discuss that the real goal of tests is not always stability and how tests often make refactoring and restructuring your project easy, not hard. I will define my criteria for good tests and then for the rest of the session, we will be using it to analyze anti-patterns and explore modern solutions to them. You will see: 1. How people make their "units" too small and how you can prevent it using import-linter 2. How people make their "units" too big and what architectural patterns can you use to make them smaller 3. How the real value of tests is in the quality of their assertions and how mutation testing can measure it for you 4. How people end up with asserting too much, and how inline-snapshot and dirty-equals make this problem obsolete 5. How people try to cover the volatile parts of their software, and how coveragepy already has tooling to prevent it 6. How slow tests hurt you, and how to make your tests fast even if you tried it many times and failed 7. How to build an architecture that makes writing tests hard, and how to make it easy using inline-snapshot, pytest-fixture-classes, and a few clever tricks 8. How you can mock your way into making your tests useless, what you should actually mock and how testcontainers can help you with that After this session, your tests will become your friend instead of slowing you down. false https://pretalx.com/pyconde-pydata-2025/talk/VFE78U/ https://pretalx.com/pyconde-pydata-2025/talk/VFE78U/feedback/ Zeiss Plenary (Spectrum) PyLadies Panel: AI Skills & Careers Panel 2025-04-24T16:15:00+02:00 16:15 01:00 As generative AI and autonomous agents rapidly transform the workplace, the skills required to thrive are evolving just as quickly. This panel will explore the essential AI skills that are driving career growth. pyconde-pydata-2025-68399-pyladies-panel-ai-skills-careers General: Education, Career & Life Tereza IofciuAnastasia KaravdinaJesper DramschGuadalupe Canas Herrera en In this panel, we will have some of our PyLadies & Friends discuss career challenges in the age of "everything AI", and how to overcome them. As generative AI and autonomous agents rapidly transform the workplace, the skills required to thrive are evolving just as quickly. This panel will explore the needed AI skills that are driving career growth. Whether you are at the beginning of your career or a very experienced Pythonista, this panel is for you! false https://pretalx.com/pyconde-pydata-2025/talk/9TRFCK/ https://pretalx.com/pyconde-pydata-2025/talk/9TRFCK/feedback/ Zeiss Plenary (Spectrum) Lightning Talks (2/2) Lightning Talks 2025-04-24T17:45:00+02:00 17:45 01:30 Lightning Talks at PyCon DE & PyData are short, 5-minute presentations open to all attendees. They’re a fun and fast-paced way to share ideas, showcase projects, spark discussions, or raise awareness about topics you care about — whether technical, community-related, or just inspiring. No slides are required, and talks can be spontaneous or prepared. It’s a great chance to speak up and connect with the community! Please note: community conference and event announcements are limited to 1 minute only. All event announcements will be collected in a slide slide deck. pyconde-pydata-2025-68195-lightning-talks-2-2 General: Others Valerio Maggio en ### ⚡ Lightning Talk Rules * No promotion for products or companies. * No call for 'we are hiring' (but you may name your employer). * One LT per person per conference policy. #### Community Event Announcements * ⏱ You want to announce a community event? You have ONE minute. * All event announcements will be collected in a single slide slide deck, see instructions at the Lightning Talk desk in the Community Space in the Lounge on Level 1. #### All other LTs: * ⏱ You have exactly 5 minutes. The clock starts when you start — and ends when time’s up. That’s the thrill of Lightning Talks ⚡ * 🎯 Be sharp, clear, and fun. Introduce your idea, make your point, give the audience something to remember. No pressure. (Okay, maybe a little.) * 🎲 You must include at least **one entry from the [official Bingo Card list](/bingocard/)**. Every audience member will receive a Bingo card — and they’ll be watching 👀 Your job? Choose at least one Bingo item from the [official Bingo Card list—](/bingocard/)and drop it into your talk. Subtly or dramatically — your style. * 🐍 Keep it relevant to Python, PyData and the community. You can go broad — tools, workflows, stories, experiments — as long as there’s some connection to Python, PyData or the community. * 👏 Keep it respectful. Keep it awesome. Humor is welcome, but please be kind, inclusive, and professional. * 🎤 Be ready when your name is called. We’re running a tight session — speakers go on stage rapid-fire. Stay close and stay hyped. * 🏆 Bonus prizes may be awarded. Best talk, best Bingo moment, most unexpected Hogwarts reference... who knows what could happen? #### How to Submit The Lightning Talk desk is located in the Community Space in the Lounge on Level 1. false https://pretalx.com/pyconde-pydata-2025/talk/SUDMDV/ https://pretalx.com/pyconde-pydata-2025/talk/SUDMDV/feedback/ Titanium3 Design, Generate, Deploy: Contract-First with FastAPI Talk 2025-04-24T10:15:00+02:00 10:15 00:30 This talk explores a contract-first approach to API development using the OpenAPI generator, a powerful tool for automating API generation from a standardized specification. We will cover (1) what would you need to run to have a standard implementation of the FastAPI endpoints and data models; (2) how to customize the mustache templates that are used to generate the API stubs; (3) share some ideas how to customize the CLI and (4) how to maintain the contract and how to handle breaking changes to the contract. We will close the session with a discussion of the challenges of implementing the OpenAPI generator. pyconde-pydata-2025-60732-design-generate-deploy-contract-first-with-fastapi PyCon: MLOps & DevOps Dr. Evelyne GroenKateryna Budzyak en Let me share a story with you about two developers working at a Malt, Europe's leading freelance management system & marketplace. Dev-1: Hi there! We have an issue on production. It seems that a request was sent where “company id” is not given. Dev-2: Oops! But I thought we agreed on an anonymous mode? Dev-1: That’s actually a great idea. You mean that company id is not required? Dev-2: Exactly! Dev-1: Thanks! I will update the data model and push the changes! As the conversation above suggests, sending data between two applications can easily fail if the requirements are not defined up front. Even for simple requests a lot of decisions have to be made: are the fields optional or mandatory? What about the returned payloads and their data types? Do we need default values? If we are not clear what we will expect (from the request) and what we will return (in the response), in the worst case, the request will fail and we spend time debugging, like above. To overcome this issue, we decided to move to a contract-first approach, where we define the exact request and response and generate the endpoints and data models from there using the OpenAPI generator. The OpenAPI generator is a powerful tool that allows you to automatically generate API client libraries, server stubs, documentation, and configuration from an OpenAPI specification, or a “contract” between two applications. This contract forms the basis for generating the endpoint stubs for our python applications but also for the client models and code. Starting with the contract can significantly speed up the development process and improve the consistency of your API implementations. During this talk we will address the following topics: - The vanilla implementation that generates endpoints and data models: what would you need to run to have a first version of the FastAPI endpoints. If the setting allows for it, we would show a short demonstration. - How to use customisable templates: we customised the mustache templates that generated the endpoints and data models so we could generate our custom FastAPI app. Also we added examples to the generated data models as these were not available in the default implementation. - How to customise the CLI tool and ideas for setting up your CI pipeline: we will share some ideas how to customise the CLI and how we used it in our CI pipeline to prevent discrepancies between the contract and the generated stubs. - how to maintain the contract and how to handle breaking changes to the contract We will close the session with a discussion of the challenges and benefits of implementing the OpenAPI Generator. While it offers standardisation and best practices, it can introduce additional complexity, especially with the tool still in beta. We'll share our experiences navigating this trade-off. false https://pretalx.com/pyconde-pydata-2025/talk/ZACM3E/ https://pretalx.com/pyconde-pydata-2025/talk/ZACM3E/feedback/ Titanium3 Serverless Orchestration: Exploring the Future of Workflow Automation Talk 2025-04-24T10:55:00+02:00 10:55 00:30 Orchestration is a typical challenge in the data engineering world. Scheduling your data transformation jobs via CRON-jobs is cumbersome and error-prone. Furthermore, with an increasing number of jobs to manage it gets in-oversee able. Tools like Apache Airflow, Dagster, Luigi, and Prefect are known for addressing these challenges but often require additional resources or investment. With the advent of serverless orchestration tools, many of these disadvantages are mitigated, offering a more streamlined and cost-effective solution. This session provides a comprehensive overview of combining serverless architecture with orchestration. We will start by defining the core concepts of orchestration and serverless technologies and discuss the benefits of integrating them. The talk will then analyze solutions available in the cloud vendor space. Attendees will leave with a well-rounded understanding of the tools and strategies available in serverless orchestration. pyconde-pydata-2025-59751-serverless-orchestration-exploring-the-future-of-workflow-automation PyCon: Programming & Software Engineering Tim Bossenmaier en Orchestration is a typical challenge in the data engineering world. Scheduling your data transformation jobs via CRON-jobs is cumbersome and error-prone. Furthermore, with an increasing number of jobs to manage it gets in-oversee able. Tools like Apache Airflow, Dagster, Luigi, and Prefect are known for addressing these challenges but often require additional resources or investment. With the advent of serverless orchestration tools, many of these disadvantages are mitigated, offering a more streamlined and cost-effective solution. Beyond data engineering, serverless orchestration holds substantial potential for classical software engineering, especially as organizations explore serverless approaches for optimizing efficiency and reducing overhead. In this talk you will explore: * Basic Introduction to Serverless Orchestration: - What is orchestration about? - What is serverless about? - Why combining the two of them? * Offerings from Major Cloud Vendors: - Analyzing solutions from leading cloud providers in the realm of serverless orchestration * Patterns and Solutions for Serverless Orchestration in Software Engineering: - Exploring how serverless orchestration can be applied within classical software engineering contexts Participants will leave this session equipped with a comprehensive understanding of the serverless orchestration landscape and its applications across different engineering disciplines. false https://pretalx.com/pyconde-pydata-2025/talk/AGY8CT/ https://pretalx.com/pyconde-pydata-2025/talk/AGY8CT/feedback/ Titanium3 Reinventing Streamlit Talk (long) 2025-04-24T11:35:00+02:00 11:35 00:45 Dreaming of creating sleek, interactive web apps with just Python? Streamlit is great for dashboards, but what if your needs go beyond that? Discover how Reflex.dev, a cutting-edge full-stack Python framework, lets you level up from dashboards to full-fledged web apps! pyconde-pydata-2025-60317-reinventing-streamlit PyCon: Django & Web Malte Klemm en Have you ever wished you could build sleek, interactive web apps using just Python? Maybe you’ve tried Streamlit and loved its simplicity. But maybe you also had the feeling that your dashboard is no longer a dashboard and your needs have outgrown Streamlit's data model. In this talk, I’ll introduce Reflex.dev, a powerful Python framework that makes web development effortless. Reflex combines the ease of Python with the flexibility of React, enabling you to create full-stack, interactive apps quickly. We’ll cover the basics: what Reflex.dev is and how it stacks up against familiar frameworks. Then, we’ll dive into building a Streamlit-inspired app from scratch in Reflex.dev by creating an API compatibility wrapper. Along the way, I’ll show you how Reflex can: - Help you build dynamic, shareable web apps with only Python. - Smoothly transition your Streamlit app into a stateful Reflex app. - Make the whole react ecosystem accessible. - Help testing your application. No web development experience? No problem. This talk is for anyone who wants to create web apps without diving into JavaScript. We’ll stick to Python and start from the ground up. By the end, you’ll leave with a working Streamlit clone and a powerful new tool in your Python arsenal. Let’s make web development fun again! false https://pretalx.com/pyconde-pydata-2025/talk/7CXSPN/ https://pretalx.com/pyconde-pydata-2025/talk/7CXSPN/feedback/ Titanium3 Duplicate Code Dilemma: Unlocking Automation with Open Source! Talk 2025-04-24T14:20:00+02:00 14:20 00:30 "Don't Repeat Yourself" – a phrase that we have all heard many times. In this talk, we will have an overview how to deal with code duplication and how open-source template libraries such as Copier can assist us in managing similarly structured repositories. Furthermore, we will explore how code updates can be automated with the help of open-source libraries like Renovate Bot. By the end of this session, you will gain insights into these solutions while also questioning whether they truly eliminate repetition or merely contribute to another cycle of automation. pyconde-pydata-2025-60316-duplicate-code-dilemma-unlocking-automation-with-open-source PyCon: Programming & Software Engineering Raana Saheb-Nassagh en “Don’t Repeat Yourself” (DRY) is one of the first principles that every programmer encounters in the early stages of their coding journey. Some of us even had to learn it the hard way. We promised ourselves to avoid repetitive code to never again deal with the extensive refactoring required for every small change. This simple principle has found a fundamental place in every programmer's heart. It may also be the reason why, from time to time, every programmer doubts their code and begins to refactor it in the early stages of coding. This talk provides an overview of different solutions for preventing code repetition. We will start with the most common solutions, such as using git commands, and then explore more intermediate approaches for managing similarly structured repositories with the help of open-source template libraries such as Copier and Cookiecutter. Finally, we will address a more complex problem and examine how to automate updates using open-source tools like Renovate Bot. As a takeaway, participants will gain insights into various solutions and a glimpse into the usability of each open-source library. Participants are also encouraged to reconsider the entire process: Are these solutions truly preventing repetitive code, or are we merely caught in an endless cycle of automation? false https://pretalx.com/pyconde-pydata-2025/talk/KZKT9W/ https://pretalx.com/pyconde-pydata-2025/talk/KZKT9W/feedback/ Titanium3 Distributed file-systems made easy with Python's fsspec Talk (long) 2025-04-24T15:00:00+02:00 15:00 00:45 The cloud native revolution has impacted all aspects of engineering, and data engineering is not exempt. One of the ongoing challenges in the data engineering world remains the local and distributed cloud native storage. In this talk we’ll explore working with distributed file systems in Python, through an intro to fsspec: a popular python library that is well-positioned to address the growing challenge of interacting with storage systems of different kinds in a consistent way. In this talk we’ll show hands-on examples of working with fsspec with some of the most popular data tools in the Python community: Pandas, Tensorflow and PyArrow. We’ll demonstrate a real world implementation of fsspec and how it provides easy extensibility through open source tooling. You’ll come away from this session with a better understanding for how to implement and extend fsspec to work with different cloud native storage systems. pyconde-pydata-2025-60444-distributed-file-systems-made-easy-with-python-s-fsspec PyData: Data Handling & Engineering Einat OrrBarak Amar en ### **1. Setting the Stage: Local vs. Distributed Storage (5 minutes)** - **What’s the Big Deal with Storage?** - First, let’s talk about the shift from local storage (where we keep files on our own machines) to cloud-native storage (where data is spread across servers in the cloud). - This shift is awsome but comes with new challenges: distributed systems can be tricky to work with, especially when you need to access them in a consistant way. ### **2. Enter fsspec: A Game Changer for File Systems (10 minutes)** - **What is fsspec?** - fsspec is a Python library that makes working with any kind of file system—whether it's local, in the cloud, or on a distributed system—much easier. - It does this by giving us a unified way to interact with storage, no matter where the files actaully live. - **Why is fsspec Awesome?** - It simplifies file operations (like opening and reading files) across different storage systems, saving us time and mental enery. - Plus, it’s open-source, which means you can extend it and make it work for your own unique storage setup. ### **3. fsspec in Action: How It Works with Popular Python Tools (15 minutes)** #### **A. Using fsspec with Pandas** - **Pandas & fsspec:** - If you work with Pandas, you’re probably familiar with loading and saving data. fsspec helps make this process smoother by letting you pull data from cloud storage (like AWS S3) with no fuss. - We’ll see how this works in practise, making it easy to work with large datasets in the cloud. #### **B. Using fsspec with TensorFlow** - **TensorFlow & fsspec:** - If you’re building machine learning models, TensorFlow needs to access training data and models, sometimes stored in the cloud. - With fsspec, TensorFlow can seamlessly interact with cloud storage, making your ML pipelines more streamlined and less frustraiting. #### **C. Using fsspec with PyArrow** - **PyArrow & fsspec:** - PyArrow is great for high-performance data processing. When working with big data files like Parquet, fsspec makes it easy to load and save them from cloud storage without missing a beat. ### **4. Extending fsspec: Building Your Own Solutions (5 minutes)** - **What if I Need Something Custom?** - Sometimes, you need to work with storage systems that aren’t “out of the box.” The cool part about fsspec is that it’s highly extensible. - I’ll walk through how you can easily extend fsspec to work with your own custom storage systems, using a real-world example of how we did this. ### **5. Wrap-Up & Key Takeaways (5 minutes)** - **The Big Picture:** - fsspec is a simple yet powerful tool for making cloud-native storage work seamlessly with Python data tools like Pandas, TensorFlow, and PyArrow. - It’s the tool you didn’t know you needed to simplify your cloud storage tasks. - **Final Thought:** - With fsspec, working with distributed storage doesn’t have to be hard. It makes everything feel like you’re working with local files, even when they’re scattered across the cloud. ### **6. Q&A Session (5 minutes)** false https://pretalx.com/pyconde-pydata-2025/talk/DEHZHK/ https://pretalx.com/pyconde-pydata-2025/talk/DEHZHK/feedback/ Titanium3 Learnings from migrating a Flask app to FastAPI Talk 2025-04-24T16:15:00+02:00 16:15 00:30 FastAPI has been constantly growing in popularity during the last years. A lot of this growth is driven by its relative simplicity and ease-of-use. In this talk, we'll discuss some practical insights into building a FastAPI application, based on my experience of migrating an existing Flask prototype to FastAPI. We'll explore how FastAPI's core features like Pydantic integration and dependency injection can improve API development, while also talking about the drawbacks of FastAPI. pyconde-pydata-2025-60738-learnings-from-migrating-a-flask-app-to-fastapi PyCon: Django & Web Orell Garten en Building HTTP APIs has become a normal part of the work as a software or data engineer within the last 10 to 15 years. In the Python ecosystem Flask was the only option to build an HTTP API for many years. After its initial release in 2018 FastAPI quickly became a serious alternative to build such APIs with Python. In this talk I will share my experiences from migrating an existing HTTP API built with flask to a FastAPI-based API. We will discuss the following topics: - Why did we migrate at all? - Data modeling - Async is overrated - Problems you **will** encounter - Migration strategy The talk will show you the practical differences between developing APIs with FastAPI or Flask. Material: https://github.com/orgarten/pycon-de-2025/blob/main/2025-pycon-learnings-from-migrating-flask-app-to-fastapi.pdf false https://pretalx.com/pyconde-pydata-2025/talk/EDJ8N7/ https://pretalx.com/pyconde-pydata-2025/talk/EDJ8N7/feedback/ Titanium3 Lessons learned in bringing a RAG chatbot with access to 50k+ diverse documents to production Talk (long) 2025-04-24T16:55:00+02:00 16:55 00:45 Retrieval-Augmented Generation (RAG) chatbots are a key use case of GenAI in organizations, allowing users to conveniently access and query internal company data. A first RAG prototype can often be created in a matter of days. But why are the majority of prototypes still in the pilot stage? [\[1\]](https://www2.deloitte.com/content/dam/Deloitte/us/Documents/consulting/us-state-of-gen-ai-q3.pdf) In this talk we share our insights from developing a production-grade chatbot at Merck. Our RAG chatbot for R&D experts accesses over 50,000 documents across numerous SharePoint sites and other sources. We identified three technical key success factors: 1. Building a robust data pipeline that syncs documents from source systems and that handles enterprise features such as replicating user permissions. 2. Developing a chatbot workflow from user question to answer with retrieval components such as hybrid search and reranking 3. Establishing a comprehensive evaluation framework with a clear optimization metric. We think that many of these lessons are broadly applicable to RAG chatbots, making this talk valuable for practitioners aiming to implement GenAI solutions in business contexts. pyconde-pydata-2025-61144-lessons-learned-in-bringing-a-rag-chatbot-with-access-to-50k-diverse-documents-to-production PyData: Generative AI Bernhard SchäferNico Mohr en Building a prototype RAG chatbot with frameworks like LangChain can be straightforward. However, scaling it into a production-grade application introduces complex challenges. In this talk, we share our lessons learned from developing a RAG chatbot designed to assist research and development (R&D) experts. Our chatbot was developed to effectively handle and provide access to a large collection of unstructured knowledge, consisting of over 50,000 documents stored across more than 20 SharePoint sites and other sources. We faced significant hurdles in: - **Data Pipeline Engineering**: Crafting a modular and scalable pipeline capable of periodically syncing documents, handling dynamic user permissions, and efficiently processing large volumes of unstructured data. - **RAG Design and Prompting Strategies**: Addressing challenges in document chunking, citation integration, reranking retrieved results, and applying permission and PII filters to ensure compliance and accuracy in responses. - **Evaluation Framework Development**: Implementing an effective testing strategy without the availability of static ground truth data. We employed automated testing with frameworks like pytest, utilized LLM-as-a-judge, and integrated tracing to iteratively refine our dataset and maintain high answer quality. - **User Adoption**: Driving user adoption through onboarding training and ongoing engagement, such as regular office hours and feedback mechanisms. We emphasize the importance of applying data science principles to GenAI projects: - **Start Simple and Iterate**: Begin with a basic implementation as a baseline and iteratively enhance functionality based on testing and user feedback. - **Test-Driven Development**: Identify key test scenarios early and use them to drive development, ensuring that improvements are measurable and aligned with growing user needs. - **Focus on Key Metrics**: Establish clear metrics to optimize against, aiding in making informed decisions throughout the development process. **Main Takeaways for the Audience:** - Understand the critical role of robust, modular data pipelines in handling dynamic and unstructured data sources for LLM applications. - Learn strategies for developing effective evaluation frameworks in complex domains where traditional ground truth data may be lacking. - Gain insights into advanced RAG design techniques that enhance chatbot performance and reliability. - Recognize the substantial data engineering and software development efforts required to transition a prototype to a production-grade LLM solution. By sharing our experiences, attendees will gain practical insights into deploying robust RAG chatbots, transforming a functional prototype into a reliable, scalable application that fulfills enterprise requirements. false https://pretalx.com/pyconde-pydata-2025/talk/XLZQFA/ https://pretalx.com/pyconde-pydata-2025/talk/XLZQFA/feedback/ Helium3 Unforgettable, that's what you are: Evaluating Machine Unlearning and Forgetting Talk 2025-04-24T10:15:00+02:00 10:15 00:30 Can deep learning/AI models forget? In this talk, you'll explore the realm of machine unlearning, where researchers and practitioners aim to remove memorized examples from machine learning models. This is relevant for training increasingly overparameterized models and growing GDPR/Privacy concerns with large scale model development and use. pyconde-pydata-2025-61837-unforgettable-that-s-what-you-are-evaluating-machine-unlearning-and-forgetting PyData: Machine Learning & Deep Learning & Statistics Katharine Jarmul en Deep learning memorization is a known phenomena, where deep learning / AI models memorize parts of their training dataset. This happens often for repeated examples, novel examples and occurs more often in overparameterized models. This presents problems for guiding machine learning behavior, requiring much effort in guardrails and output monitoring, as well as questioning whether the models can be GDPR-compliant (i.e. the right to be forgotten). A growing area of research on machine unlearning or machine forgetting has emerged to investigate ways a model might unlearn or forget particular memorized examples. In this talk, you'll learn about the field of machine unlearning and related topics like data anonymization to evaluate exactly what's truly unforgettable. Jokes aside: you'll have some practical take-aways to apply to your work in data and machine learning development. false https://pretalx.com/pyconde-pydata-2025/talk/SZFRRA/ https://pretalx.com/pyconde-pydata-2025/talk/SZFRRA/feedback/ Helium3 Oh, no! Users love my GenAI-Prototype and want to use it more. Talk 2025-04-24T10:55:00+02:00 10:55 00:30 Demos and prototypes for generative AI (GenAI) projects can be quickly created with tools like Streamlit, offering impressive results for users within hours. However, scaling these solutions from prototypes to robust systems introduces significant challenges. As user demand grows, hacks and workarounds in tools like Streamlit lead to unreliability and debugging frustrations. This talk explores the journey of overcoming these obstacles, evolving to a stable tech stack with Qdrant, Postgres, Litellm, FastAPI, and Streamlit. Aimed at beginners in GenAI, it highlights key lessons. pyconde-pydata-2025-61752-oh-no-users-love-my-genai-prototype-and-want-to-use-it-more PyCon: MLOps & DevOps Thomas PrexlFrank Rust en Demos and prototypes for projects with generative AI can be quickly put together: an API key from the preferred model provider, some source code from the online tutorial and a few small adjustments suffice. Thanks to Streamlit and the like, even beginners can achieve impressive results that can be used by users within a few hours. But what happens when users actually like the solution? When demos and prototypes need to be expanded and connected to other systems? What if the number of users continues to rise? It is quite impressive how far you can bend Streamlit to achieve things it was probably never meant for. But at a certain point, you pay for the hacks and workarounds with unreliability and frustrating debugging. The speakers repeatedly reached this point in various projects and delayed the necessary architecture discussion for too long. So the path was longer and more painful than it should have been – but in the end, thanks to the wide range of open-source (Python) projects, a flexible and stable system was created. Our current tech stack includes Qdrant, Postgress, Litellm and FastAPI – as well as OpenWebUI, and of course Streamlit. Thanks to modularization, we now have a stable system that we can easily run locally but also deploy in an enterprise environment. Nevertheless, we have retained a great deal of flexibility. In our talk, we report on the trials and tribulations along the way. We report on the challenges that led to decisions for various components. We disclose which problems we were able to solve and which new problems arose. The talk is aimed primarily at those who are taking their first steps with generative AI or have already developed their first demonstrators or prototypes. Structure: (1) GenAI applications in Streamlit are cool (2) The challenges on the way from prototype to productive deployment (3) Ramming heads through walls (4) The path to a flexible but stable stack (5) What still plagues us false https://pretalx.com/pyconde-pydata-2025/talk/UXTCZC/ https://pretalx.com/pyconde-pydata-2025/talk/UXTCZC/feedback/ Helium3 Bridging the gap: unlocking SAP data for data lakes with Python and PySpark via SAP Datasphere Talk (long) 2025-04-24T11:35:00+02:00 11:35 00:45 SAP's data often remains locked away, hindering the creation of a complete data picture. This talk presents a hands-on proof of concept leveraging SAP Datasphere, Python and PySpark to bridge an Azure-based, data mesh-inspired open data lake with a centralized SAP BI environment. This presentation will delve into the architecture of SAP Datasphere and its integration interfaces with Python. It will explore network integration, authentication, authorization and resource management options, as well as data integration patterns. The presentation will summarize the evaluated features and limitations discovered during the PoC. pyconde-pydata-2025-61276-bridging-the-gap-unlocking-sap-data-for-data-lakes-with-python-and-pyspark-via-sap-datasphere PyData: Data Handling & Engineering Rostislaw Krassow en In many enterprises relying on SAP ERP systems, a wealth of valuable master data remains trapped within a closed ecosystem. This creates significant obstacles when striving for a comprehensive, 360° view, especially when integrating with modern, open data lakes built on platforms like Azure and designed around data mesh principles. This talk presents a practical PoC that tackles this challenge head-on, utilizing SAP Datasphere as the key integration point. Outline: 1. The challenge: navigating sap's data silos and the pursuit of a unified view * The section outlines the enterprise data landscape of RATIONAL where valuable master data resides within SAP’s traditionally closed ecosystem, hindering data democratization and the creation of a comprehensive, 360° operational view. This situation is frequently encountered, particularly among German manufacturers. * The inherent conflict between the open, distributed nature of data lakes (especially those built on data mesh principles) and the centralized, closed nature of traditional SAP BI environments is discussed. 2. Solution overview: leveraging sap datasphere as the integration layer * An introduction to sap datasphere and its capabilities is provided, with a focus on its ability to connect with non-SAP systems. * This part explains how datasphere was chosen as the central integration layer for the proof of concept and its role in enabling bi-directional data flow between SAP and the open data lake. 3. Architecture of SAP Datasphere * Introduction in architecture of SAP Datasphere and role of underlying SAP HANA database * Explanation of openSQL schema as key integration option 4. Security first: exploring network integration, authentication and authorization options * This section details the evaluation of network connectivity options between the Azure services like Azure Databricks, PostgresQL, ADLS and SAP Datasphere * The methods used to authenticate Python and Pypark to SAP datasphere are explained * The implementation and evaluation of data authorization mechanisms within SAP Datasphere are described 5. Python and PySpark integration * Available interfaces for python integration (ODBC/JDBC, OData), their features and limitations * Explanation of practical data integration patterns implemented within the poc for extracting data from sap and loading it into the data lake for full and delta load scenarios 6. Reflecting PoC: summary and key learnings * This section summarizes the core findings and lessons learned from the PoC, particularly regarding security and software quality best practices * A hint for the SAP open data alliance launched in 2023 Main takeaways: * An understanding of SAP Datasphere's architecture and its potential for integrating non-SAP, open-source technologies like Python and PySpark * Knowledge of current features and limitations of SAP Datasphere in the area of data integration with the open source world true https://pretalx.com/pyconde-pydata-2025/talk/7CL3KS/ https://pretalx.com/pyconde-pydata-2025/talk/7CL3KS/feedback/ Helium3 Analyze data easily with duckdb - and the implications on data architectures Talk 2025-04-24T14:20:00+02:00 14:20 00:30 duckdb is increasingly becoming a universal tool for accessing and analyzing data. In this talk I will show with slides and live demo what duckdb is capable of and will dive deeper in how it will influence modern data architectures. pyconde-pydata-2025-61134-analyze-data-easily-with-duckdb-and-the-implications-on-data-architectures PyData: Data Handling & Engineering Matthias Niehoff en duckdb - a lightweight database with a focus on data analysis and a fast query engine that can be used in a variety of ways: - Analyze data, stored on your own hard drive or somewhere on the Internet, in the browser with SQL? No problem - Quickly check all the JSON files in S3 using SQL? Nothing could be easier - A huge parquet file, bigger than my working memory. And now I have to analyze it locally. Easy! - Read csv from blob storage, process and save in a Postgres database. Just one command duckdb is developing more and more into a universal tool for accessing and analyzing data. In this talk I will show with slides and a live demo why it is so popular and why it belongs in the toolbox of every data scientist, ML engineer or data engineer. But I will not stop at the useful tooling. I will dive deeper into the implications for data and software architectures that arise from the rise of the embedded OLAP systems like duckdb. I will especially focus on both moving the data closer to the user for faster analytics but also on accessing data without the explicit need to move it. What you learn and see can be used immediately in your day-to-day work. false https://pretalx.com/pyconde-pydata-2025/talk/TXKLWR/ https://pretalx.com/pyconde-pydata-2025/talk/TXKLWR/feedback/ Helium3 Scraping LEGO for Fun: A Hacky Dive into Dynamic Data Extraction Talk (long) 2025-04-24T15:00:00+02:00 15:00 00:45 Unlock the full potential of modern web scraping by combining Python, Scrapy, and Playwright to extract data from dynamic, JavaScript-heavy sites—exemplified by LEGO product pages. This talk introduces Model Context Protocol (MCP) servers for orchestrating advanced data fetching, refining CSS selectors, and integrating Large Language Models for automated code suggestions. Learn how to scale ethically, handle concurrency, and respect site policies, while maintaining flexible, maintainable pipelines for diverse use cases from research to robotics. pyconde-pydata-2025-61429-scraping-lego-for-fun-a-hacky-dive-into-dynamic-data-extraction PyData: Data Handling & Engineering Peter Lodri en # Advanced Web Scraping: From LEGO to Production Today's web landscape is teeming with JavaScript-heavy content, complex layouts, and sometimes opaque data structures. But what if you could reliably scrape rich product information—images, specs, descriptions—from modern e-commerce sites without hitting constant roadblocks? This session tackles advanced scraping with Python, Scrapy, and Playwright, exemplified by data extraction from LEGO product pages. We'll explore a "grey hat" perspective—applying a slightly "hacky" mindset—while stressing practical ethics, performance considerations, and compliance with site policies. ## Outline ### 1. Introduction: The Hacky Spirit vs. Ethical Constraints - Why scrape LEGO? - Setting boundaries: terms of service, rate limiting, and disclaimers - When "scraping for fun" crosses into potential legal pitfalls ### 2. Scraping Tech Stack Overview - Scrapy for structured crawling and item pipelines - Playwright for rendering JavaScript and handling dynamic elements - Comparison to traditional HTML-only approaches - Project structure, environment setup, and practical tips ### 3. Spiders in Action - Product Spider: Extracting core product data (ID, name, specifications, multiple images) - Gallery Spider: Navigating hidden galleries, handling tricky JS-based carousels, and filtering unwanted images - Ensuring consistent output (JSON or database ingestion) ### 4. Model Context Protocol (MCP) Integration - Definition: Leveraging specialized helper servers for orchestrating data fetching, refining selectors, and automating debugging - Chaining Large Language Models: Code suggestions, auto-generation of selectors, and reactive error handling - Example workflow: "Broken selector? Ask the MCP server for an LLM-aided fix" ### 5. Performance & Scale - Polite but robust concurrency: balancing speed and TOS compliance - Handling large link lists, incremental updates, and site changes - Monitoring and logging for reliability, debugging, and optimization ### 6. Ethics & Privacy - Respecting site ownership, disclaimers, and usage limits - Storing scraped data securely and avoiding personal information - A discussion of "grey hat" territory: testing site vulnerabilities without exploiting them ### 7. Use Cases & Extensions - Research software engineering: building reproducible data sets - Robotics and embedded: offline or partial data ingestion for classification or motion planning - Future directions: advanced concurrency, containerization, and HPC ### 8. Demo & Q&A - Live snippet showing an MCP-powered spider reacting to a changed DOM structure - Q&A session on bridging the gap between hackery and best practices ## Key Takeaways - Techniques for scraping dynamic, JS-heavy sites using Python, Scrapy, and Playwright - Practical "hacky" methods balanced by responsible, 'ethical approaches' - Introduction to Model Context Protocol servers for automated code refinement - Scalable patterns for data handling, from small tests to large-scale deployments Whether you're a data engineer, hobbyist, or researcher, this talk provides a robust (and slightly subversive) recipe for capturing essential data from the wild world of modern websites—without crossing into unethical or unlawful territory. false https://pretalx.com/pyconde-pydata-2025/talk/HSFR7A/ https://pretalx.com/pyconde-pydata-2025/talk/HSFR7A/feedback/ Helium3 Optimizing in the Python Ecosystem – Powered by Gurobi Sponsored Talk 2025-04-24T16:15:00+02:00 16:15 00:30 Join us as we explore integrating Gurobi and prescriptive analytics into your Python ecosystem. In this session, you’ll discover model-building techniques that leverage NumPy and SciPy.sparse as well as the data structures of pandas. We’ll also show you how to seamlessly integrate trained regressors from scikit-learn as constraints in your optimization models. Elevate your workflows and unlock new decision-making capabilities with Gurobi in Python. pyconde-pydata-2025-66498-optimizing-in-the-python-ecosystem-powered-by-gurobi PyCon: Python Language & Ecosystem Silke Horn en Gurobi is a prescriptive analytics technology that enables you to make optimal decisions from data. You can use prescriptive analytics to generate optimized decision recommendations, based on real-world variables and constraints. Powered by mathematical models solved by mixed-integer optimization, it enables embedded decision intelligence in all kinds of applications in an industry-agnostic fashion and in any deployment scenario. Join us as we explore integrating Gurobi and prescriptive analytics into your Python ecosystem. In this session, you’ll discover model-building techniques that leverage NumPy and SciPy.sparse as well as the data structures of pandas. We’ll also show you how to seamlessly integrate trained regressors from scikit-learn as constraints in your optimization models. Elevate your workflows and unlock new decision-making capabilities with Gurobi in Python. false https://pretalx.com/pyconde-pydata-2025/talk/BAASYV/ https://pretalx.com/pyconde-pydata-2025/talk/BAASYV/feedback/ Helium3 Challenges and Lessons Learned While Building a Real-Time Lakehouse using Apache Iceberg and Kafka Talk (long) 2025-04-24T16:55:00+02:00 16:55 00:45 How do you build a large-scale data lakehouse architecture that makes data available for business analytics in real time, while being more cost-effective, more flexible and faster than the previous proprietary solution? With Python, Kafka and Iceberg, of course! We built a large-scale data lakehouse based on Apache Iceberg for the Schwarz Group, Europe's largest retailer. The system collects business data from thousands of stores, warehouses and offices across Europe. In this talk, we will present our architecture, the challenges we faced, and how Apache Iceberg is shaping up to be the data lakehouse format of the future. pyconde-pydata-2025-60723-challenges-and-lessons-learned-while-building-a-real-time-lakehouse-using-apache-iceberg-and-kafka PyData: Data Handling & Engineering Jonas BöerElena Ouro Paz en The Schwarz Group is present in thirty-two countries around the world with over ten thousand stores, hundreds of warehouses, an assortment of over two thousand different products and a single ERP system to manage them all. Every country maintains its own databases for operational purposes but all the data is also gathered in one central analytics platform for all countries. Not only does this platform need to be stable and reliable, the data also needs to be made available for the consumers in near real-time –within mere minutes. The existing analytics platform was based on proprietary solutions which were expensive and required niche knowledge, severely limiting the number of available developers. Therefore, we set out on a journey to completely redesign the analytics platform; a new solution, based as much as possible on Open Source technologies like Python, Kafka and Iceberg. Leveraging Python for its great ecosystem and ease of use. Kafka for fast and reliable message processing of over one thousand tables per country into one central hub. And Iceberg at the core, as our data lakehouse format for its fully transparent schema evolution and high performance through its rich metadata layer. Through our presentation we will showcase the different challenges we faced during the design of our new architecture, how our selected tech stack allowed us to tackle each of them and the lessons we learnt. We will focus on our challenges in four areas: scalability, performance, continuity of service and data quality. Scalability: Ingesting changes on over one thousand tables coming from servers across thirty-two countries supporting operations of Europe’s largest retailer is no easy task. Our architecture needs to support receiving tens of thousands of events per second. We will present how we set up Kafka to support our current load and potential for future growth, how we use Tabular’s Iceberg sink connector to ingest all our tables, and how we leverage avro serialization and snappy compression of messages to reduce network traffic. Performance: The large amounts of data we handle, paired with the influx of small files that can result from real-time data ingestion, made ensuring performance an extremely challenging aspect of our application. We will show how we designed our data lake house; using Iceberg's hidden partitioning to ensure performance while remaining flexible to evolve them over time and how we designed and implemented an effective maintenance job to reduce small files in our iceberg tables. Continuity of Service: The existing analytics platform contains the core of the business data which is used for many analytics and forecasting use cases across the organization. One of main requirements was to ensure a smooth transition to the new architecture with as little downtime as possible. This meant facilitating the access to existing users by allowing them to retrieve the data in the same way they were doing it in the past, with minimal changes. We will show how our architecture ensures flexibility by allowing access to the data from diverse query engines and show an example on how we integrated our architecture with Snowflake. Data quality: We faced some challenges when it came to the consolidation of all the data we receive from all 32 countries. For some tables, the schemas diverged across countries; having different sets of columns, different data types being used and even different primary keys. We will talk about how we handled data quality issues coming from the operational databases by using iceberg’s schema evolution capabilities, a schema registry and Kafka Connect single message transforms (SMTs). true https://pretalx.com/pyconde-pydata-2025/talk/JUAF3S/ https://pretalx.com/pyconde-pydata-2025/talk/JUAF3S/feedback/ Platinum3 Building versatile operating setups for real world use and testing with Python and the Raspberry Pi Sponsored Talk 2025-04-24T10:15:00+02:00 10:15 00:30 **Rosenxt** is the host of a number of ventures aiming to provide next level solutions for demanding problems in a variety of industries based on decades of engineering excellence. Some of them address challenges in water environments ranging from water pipelines to offshore applications. As differing as these areas may seem, regarding the solutions we build for them they have a lot in common. Whether its the necessary power supply, movement and steering concepts or sensing approaches. All of them benefit from generalized, smart solutions that we design as components that can later be orchestrated and configured in various setups to fulfill quite different purposes. This presentation explores the versatility of leveraging a Raspberry Pi based hardware platform combined with a Python based application stack to bridge development and deployment of various basic components, such as motors and motor controllers, lift foils, steering units and controls. By utilizing a unified platform, we demonstrate how the same system can seamlessly transition from test bench measurements during hardware component development to real-world applications for various industries. The talk highlights how this approach can create a robust framework to help streamlining workflows, enhance scalability and reduce costs. pyconde-pydata-2025-66640-building-versatile-operating-setups-for-real-world-use-and-testing-with-python-and-the-raspberry-pi PyData: Embedded Systems & Robotics Jens Nie en We will specifically showcase a setup, where a custom made Raspberry Pi based hardware platform and a Python application stack is used for operating a so called functional model, where a set of components is orchestrated to showcase a final usage scenario and the same setup is used in a test rig environment to specifically benchmark a single component of the functional model. Both use cases work pretty much the same way and generate the same sort of data in the same formats and structure, which eases evaluation and handling significantly. The solution presented showcases both 'standard' Python applications interacting with each other on a Raspberry Pi as well as **pyscript** based scripts running in a Web-Browser to visualise test data in realtime. false https://pretalx.com/pyconde-pydata-2025/talk/PW3VKG/ https://pretalx.com/pyconde-pydata-2025/talk/PW3VKG/feedback/ Platinum3 Composable AI: Building Next-Gen AI Agents with MCP Sponsored Talk 2025-04-24T10:55:00+02:00 10:55 00:30 At Blue Yonder, we're embarking on a journey toward building composable AI agents using Model Context Protocol (MCP). We're discovering firsthand the challenges of integrating diverse products and APIs into useful, context-aware agents. In this talk, I'll discuss our early experiences, the challenges we've faced, and why MCP is emerging as a potential game changer for developing scalable, flexible AI solutions. pyconde-pydata-2025-67105-composable-ai-building-next-gen-ai-agents-with-mcp PyData: Generative AI Martin Seeler en In this talk, I'll share our journey with MCP at Blue Yonder, explaining why this protocol is becoming crucial for anyone involved in building AI agents. We'll start by understanding what an agent really is - essentially a clever brain leveraging powerful tools - and why composability is the key to efficient development. You'll discover what MCP is, how it's already shaping popular tools like Cursor and Claude Desktop, and why developers everywhere are excited about it. I'll dive into practical insights, showing how agents like Manus, a highly regarded agent hailed as the next "DeepSeek" moment, achieved success simply by combining 29 MCP-compliant tools effectively. This demonstrates the power of composing existing capabilities rather than reinventing the wheel. We'll also explore how MCP empowers organizations. Using MCP SDKs and OpenAPI wrappers, even teams without extensive AI expertise can rapidly transform existing APIs into sophisticated, usable AI agents. But there's no silver bullet. I'll frankly discuss some organizational challenges, including the tendency to chase flashy "new" agents over contributing collaboratively to existing solutions. Finally, we'll look ahead to an exciting future, envisioning a world where entire product ecosystems are MCP-enabled. Imagine agents seamlessly orchestrating tasks across multiple products, unlocking entirely new possibilities in user interaction. Join me for an engaging session, learn from our experiences, and see how MCP can reshape your approach to building the next generation of composable AI agents. false https://pretalx.com/pyconde-pydata-2025/talk/7FLW7F/ https://pretalx.com/pyconde-pydata-2025/talk/7FLW7F/feedback/ Platinum3 Going Global: Taking code from research to operational open ecosystem for AI weather forecasting Talk (long) 2025-04-24T11:35:00+02:00 11:35 00:45 When I was hired as a Scientist for Machine Learning, experts said ML would never work in weather forecasting. Nowadays, I get to contribute to Anemoi, a full-featured ML weather forecasting framework used by international weather agencies to research, build, and scale AI weather forecasting models. The project started out as a curiosity by my colleagues and soon scaled as a result of its initial success. As machine learning stories go, this is a story of change, adaptation and making things work. In this talk, I'll share some practical lessons: how we evolved from a mono-package with four people working on it to multiple open-source packages with 40+ internal and external collaborators. Specifically, how we managed the explosion of over 300 config options without losing all of our sanity, building a separation of packages that works for both researchers and operations teams, as well as CI/CD and testing that constrains how many bugs we can introduce in a given day. You'll learn concrete patterns for growing Python packaging for ML systems, and balancing research flexibility with production stability. As a bonus, I'll sprinkle in anecdotes where LLMs like chatGPT and Copilot massively failed at facilitating this evolution. Join me for a deep dive into the real challenges of scaling ML systems - where the weather may be hard to predict, but our code doesn't have to be. pyconde-pydata-2025-61421-going-global-taking-code-from-research-to-operational-open-ecosystem-for-ai-weather-forecasting PyCon: MLOps & DevOps Jesper Dramsch en What does it take to go from "ML will never work in weather forecasting" to running AI models in production at weather agencies? This talk chronicles the journey of Anemoi, a framework that evolved from research code to an operational ML weather forecasting system - and the technical challenges we faced along the way. Starting as experimental code and notebooks by a small team of four, Anemoi grew into a robust ecosystem supporting 40+ developers across multiple international weather agencies. I'll share our experience of scaling both the team and codebase, including the interesting challenge of conducting weekly code tours for new team members while maintaining development velocity. The technical evolution of Anemoi mirrors many challenges in scaling ML systems. We'll explore how the codebase transformed from research artifacts and notebooks into a structured mono-package with proper separation of concerns. Then, how we split this into an ecosystem of specialized packages - only to later realize that some components were too tightly coupled and needed reunification. This journey offers valuable lessons about when to split packages and when to maintain unified codebases. Configuration management evolved alongside our architecture. I'll demonstrate how we leveraged Hydra to tame over 300 configuration options into a hierarchical system that enables component composition without sacrificing usability. This system now powers everything from dataset creation to model inference, with full traceability of configurations and artifacts throughout the ML lifecycle. A unique aspect of developing ML systems at ECMWF is integrating with decades of expertise in weather forecast validation. We'll look at how we connected modern ML tooling like MLFlow with traditional meteorological evaluation systems, creating a bridge between ML innovation and established meteorological practices. The talk will cover practical challenges that every growing ML system faces: - Making model components truly configurable and replaceable - Implementing model sharding for global weather predictions - Supporting flexible grids for regional weather services - Managing CI/CD across multiple packages - Streamlining release processes with modern tools - The eternal struggle with changelog management Throughout the presentation, I'll share real examples of what worked, what didn't, and why - including our experiments with AI coding assistants and where they fell short. You'll walk away with concrete patterns for scaling Python ML systems, strategies for managing growing complexity, and insights into balancing research flexibility with production requirements. Whether you're scaling an ML system, managing a growing Python codebase, or interested in how weather forecasting is being transformed by AI, this talk offers practical lessons from the frontier of operational ML systems. false https://pretalx.com/pyconde-pydata-2025/talk/WMBDJ8/ https://pretalx.com/pyconde-pydata-2025/talk/WMBDJ8/feedback/ Platinum3 Dataframely — A declarative, 🐻‍❄️-native data frame validation library Sponsored Talk 2025-04-24T14:20:00+02:00 14:20 00:30 Understanding the structure and content of data frames is crucial when working with tabular data — a core requirement for the robust pipelines we build at QuantCo. Libraries such as `pandera` or `patito` already exist to ease the process of defining data frame schemas and validating that data frames comply with these schemas. However, when building production-ready data pipelines, we encountered limitations of these libraries. Specifically, we were missing support for strict static type checking, validation of interdependent data frames, and graceful validation including introspection of failures. To remedy the shortcomings of these libraries, we started building `dataframely` at the beginning of last year. Dataframely is a declarative data frame validation library with first-class support for polars data frames. Over the last year, we have gained experience in using `dataframely` both for analytical and production code across several projects. The result was a drastic improvement of the legibility of our pipeline code and our confidence in its correctness. To enable the wider data engineering community to benefit from similar effects, we have recently open-sourced `dataframely` and are keen on introducing it in this talk. pyconde-pydata-2025-66511-dataframely-a-declarative-native-data-frame-validation-library PyData: Data Handling & Engineering Oliver BorchertDaniel Elsner en In this talk, we will talk about the motivation behind building `dataframely` in more detail and lead the audience through its key features. We will also touch upon our learnings in developing robust data pipelines that establish clear contracts for the design of data transformations. In our experience, this significantly improves communication among developers and comprehensibility of the entire pipeline. false https://pretalx.com/pyconde-pydata-2025/talk/3DCS8K/ https://pretalx.com/pyconde-pydata-2025/talk/3DCS8K/feedback/ Platinum3 Accuracy Is Not Enough: Building Trustworthy AI with Conformal Prediction Talk (long) 2025-04-24T15:00:00+02:00 15:00 00:45 Building a good scoring model is just the beginning. In the age of critical AI applications, understanding and quantifying uncertainty is as crucial as achieving high accuracy. This talk highlights conformal prediction as the definitive approach to both uncertainty quantification and probability calibration, two extremely important topics in Deep Learning and Machine Learning. We’ll explore its theoretical underpinnings, practical implementations using TorchCP, and transformative impact on safety-critical fields like healthcare, robotics, and NLP. Whether you're building predictive systems or deploying AI in high-stakes environments, this session will provide actionable insights to level up your modelling skills for robust decision-making. pyconde-pydata-2025-61818-accuracy-is-not-enough-building-trustworthy-ai-with-conformal-prediction PyData: Machine Learning & Deep Learning & Statistics Chris Aivazidis en When deploying machine learning models in the real world, especially in domains like healthcare, robotics, or natural language processing, the stakes are high. It’s not enough to train a model, evaluate its accuracy, and call it a day. Questions of how confident the model is, how reliable its predictions are, and how to act on these predictions are critical yet often overlooked. This talk takes you beyond conventional metrics and into the world of uncertainty quantification and probability calibration, with conformal prediction as the definitive tool for both. We’ll start of the presentation by exploring the fundamental need for uncertainty in AI systems—why it matters, how it’s quantified, and how it can be used to make informed decisions. From there, we’ll introduce conformal prediction, a mathematically rigorous yet practical framework that provides guarantees on prediction reliability while remaining model-agnostic. Core concepts such as probability calibration and uncertainty quantification will be highlighted as key parts in the modelling process, establishing their importance in the domain. The session will also feature real-world examples and use cases such as: - Healthcare: Predict irAE likelihood with quantifiable confidence, to inform life and death decisions - Robotics: Navigate dynamic environments safely using calibrated vision-language models. - Natural Language Processing: Improve outputs of large language models with uncertainty-aware predictions. Finally, we’ll showcase the TorchCP toolbox, a GPU-accelerated library for integrating conformal prediction into deep learning pipelines, an area of Data Science that has a lot of hype but often overlooks the importance of such tools. Through a live demonstration, you’ll see how to implement these methods step-by-step, empowering you to build trustworthy AI systems that go beyond accuracy. Attendees will leave with: - A solid understanding of uncertainty quantification, probability calibration and their importance. - Practical knowledge of conformal prediction and how to implement it. - A new perspective on AI reliability and decision-making in critical domains. Whether you're an ML researcher, data scientist, or practitioner deploying AI models in critical environments, this session will equip you with the right tools and philosophy to create AI systems that are not only accurate but also reliable and robust. false https://pretalx.com/pyconde-pydata-2025/talk/UDDTBS/ https://pretalx.com/pyconde-pydata-2025/talk/UDDTBS/feedback/ Platinum3 Cache me if you can: Boosted application performance with Redis and client-side caching Sponsored Talk 2025-04-24T16:15:00+02:00 16:15 00:30 Did you know Redis can notify your app about server-side data changes? This feature enables client-side tracking and caching in redis-py, helping to reduce network round-trips and optimize performance. In this talk, we explore how client-side caching works in redis-py and how you can use it to make your applications even faster. pyconde-pydata-2025-67452-cache-me-if-you-can-boosted-application-performance-with-redis-and-client-side-caching PyData: Data Handling & Engineering David Maier en Did you know Redis can notify your app about server-side data changes? This feature enables client-side tracking and caching in redis-py, helping to reduce network round-trips and optimize performance. In this talk, we explore how client-side caching works in redis-py and how you can use it to make your applications even faster. The following topics are covered: - Quick introduction to Redis - Redis as a cache - What is client-side caching? - What's new in redis-py false https://pretalx.com/pyconde-pydata-2025/talk/3FSWJU/ https://pretalx.com/pyconde-pydata-2025/talk/3FSWJU/feedback/ Platinum3 A11y Need Is Love (But Accessible Docs Help Too) Talk (long) 2025-04-24T16:55:00+02:00 16:55 00:45 Accessible documentation benefits everyone, from developers to end users. Using the [PyData Sphinx Theme](https://pydata-sphinx-theme.readthedocs.io/en/stable/) as a case study, this talk dives into common accessibility barriers in documentation websites like low contrast colors, missing focus states, etc. and practical ways to address them. Learn about accessibility improvements and take part in a live accessibility audit to see how small changes can make a big difference. pyconde-pydata-2025-60499-a11y-need-is-love-but-accessible-docs-help-too PyData: PyData & Scientific Libraries Stack Smera Goel en The Beatles told us that ‘all you need is love’ and while that is a lovely sentiment, love alone won’t fix low contrast colours, missing focus states or inaccessible navigation. These barriers impact countless users with disabilities, reducing the usefulness and reach of valuable documentation. So, while love is great, accessible docs are *essential*. In this talk, we will use the [PyData Sphinx Theme](https://pydata-sphinx-theme.readthedocs.io/en/stable/) as a case study to explore common accessibility problems in documentation websites and how to tackle them. We will discuss the accessibility changes we made to the theme, how those changes affected users, and what we learnt along the way. Additionally, we will also conduct a short accessibility audit on a website suggested by the audience. This demo will provide a practical understanding of how to improve accessibility. Whether you’re a documentation maintainer, a curious developer or simply someone who cares about accessibility, this beginner-friendly talk will help you learn more about accessibility in documentation and how to get started. Love might be a universal language, but your code appreciates accessible documentation. false https://pretalx.com/pyconde-pydata-2025/talk/WLZSEZ/ https://pretalx.com/pyconde-pydata-2025/talk/WLZSEZ/feedback/ Europium2 Blazing-Fast Python in Your Database: Unlocking Data Science at Scale with Exasol Sponsored Talk 2025-04-24T10:15:00+02:00 10:15 00:30 What if your Python models could run inside your database—at scale, with parallel execution, and zero data movement? Meet Exasol: a high-performance Analytics Engine with native Python support and a massively parallel processing (MPP) engine. In this session, you’ll learn how to run Python directly where your data lives using user-defined functions (UDFs) and customizable script language containers. Whether you're doing forecasting, categorization, or calling APIs in real time, Exasol enables fast, scalable Python execution—perfect for demanding data science workflows. We’ll share real-world use cases, including large-scale model inference across thousands of sensors. If you're tired of bottlenecks and batch jobs, this is your shortcut to blazing-fast, in-database Python. pyconde-pydata-2025-67913-blazing-fast-python-in-your-database-unlocking-data-science-at-scale-with-exasol PyData: Machine Learning & Deep Learning & Statistics Alexander Stigsen en What if your Python models could run inside your database—at scale, with parallel execution, and no data movement? Meet Exasol: the high-performance analytics database that speaks native Python, supercharged by a massively parallel processing (MPP) engine. In this talk, we’ll dive into how Exasol empowers Python developers and data scientists to run custom Python code—directly where the data lives—using user-defined functions (UDFs) and fully customizable script language containers. Whether you’re doing model training, forecasting, categorization, or even tapping into the power of large language models, Exasol brings Python to the party with native support and serious horsepower. You’ll learn how to: -Execute high-performance Python code inside your database using UDFs. -Bring any Python library into Exasol with containerized script languages. -Scale inference and forecasting across thousands of sensors or data points using Exasol’s MPP engine—no batch jobs, no bottlenecks. -Call APIs or run models in-database to enable real-time, insight-driven applications. We’ll showcase real-world examples, like how one company forecasts sensor traffic volume across entire regions to optimize planning—running thousands of model inferences simultaneously with high speed performance. If you’re tired of waiting for your models to run—or moving massive datasets just to do a quick prediction—this talk is for you. Python meets MPP, and the result is next-level analytics. false https://pretalx.com/pyconde-pydata-2025/talk/NQ3RHQ/ https://pretalx.com/pyconde-pydata-2025/talk/NQ3RHQ/feedback/ Europium2 Scalable Python and SQL Data Engineering without Migraines Sponsored Talk 2025-04-24T10:55:00+02:00 10:55 00:30 This session is for data and ML engineers with a basic understanding of data engineering and Python. It shows how to easily use Python code in Snowflake Notebooks to create data pipelines. By the end, you’ll know how to build and process data pipelines with Python. pyconde-pydata-2025-66940-scalable-python-and-sql-data-engineering-without-migraines PyData: Machine Learning & Deep Learning & Statistics Dirk Jung en Data loading processes are complex and require effort to organize, often different tools are used and seamless processing is not ensured. Learn how to create pipelines efficiently and easily with Python in Snowflake Notebooks. Create and monitor tasks to continuously load data. Use third-party data directly to extend the data model without copying it. Harness the power of Python to quickly calculate values and write efficient stored procedures. In this session you will see how to - Load Parquet data to Snowflake using schema inference - Setup access to Snowflake Marketplace data - Create a Python UDF to convert temperature - Create a data engineering pipeline with Python stored procedures to incrementally process data - Orchestrate the pipelines with tasks - Monitor the pipelines with Snowsight false https://pretalx.com/pyconde-pydata-2025/talk/GVUPQN/ https://pretalx.com/pyconde-pydata-2025/talk/GVUPQN/feedback/ Europium2 Bias Meets Bayes: A Bayesian Perspective on Improving Model Fairness Talk 2025-04-24T11:35:00+02:00 11:35 00:30 Bias in machine learning models remains a pressing issue, often disproportionately affecting the most vulnerable groups in society. This talk introduces a Bayesian perspective to effectively tackle these challenges, focusing on improving fairness by modeling and addressing bias directly. You will learn about the interplay between uncertainty, equity, and predictive accuracy, while gaining actionable insights to improve fairness in diverse applications. Using a practical example of a risk-scoring model trained on data with underrepresented minority groups, I will showcase how Bayesian methods compare to traditional techniques, demonstrating their unique potential to mitigate bias while maintaining performance. pyconde-pydata-2025-61120-bias-meets-bayes-a-bayesian-perspective-on-improving-model-fairness PyData: Machine Learning & Deep Learning & Statistics Vince Nelidov en Machine learning models often perpetuate biases that exacerbate societal inequities, particularly for vulnerable groups. As machine learning increasingly shapes critical decisions, addressing these biases is more important than ever. In this talk, I will explain how Bayesian methods offer a principled and effective approach to improving fairness by directly addressing bias and incorporating uncertainty into machine learning models.  The talk will cover: 1. Theoretical Foundations: I will start by exploring the connection between Bayesian statistics, fairness, and accuracy, with a focus on why uncertainty is a crucial factor in fairness interventions. 2. Practical Example: Using a risk-scoring model trained on a dataset with underrepresented minority groups, I will demonstrate how Bayesian methods compare to traditional fairness techniques. This example will illustrate their ability to not only mitigate bias but also adapt to complex, real-world data distributions while maintaining predictive accuracy. 3. Key Insights and Applications: Finally, I will provide actionable takeaways on incorporating Bayesian thinking into existing workflows, enabling more equitable and robust outcomes across diverse applications.  This talk is designed to be accessible to a broad audience. While minimal familiarity with machine learning concepts and fairness principles is recommended, no advanced knowledge of statistics is required. Attendees will leave with practical tools, code examples, and insights to address bias effectively in real-world scenarios, empowering them to promote fairness in their own projects and organizations. false https://pretalx.com/pyconde-pydata-2025/talk/ER3V7W/ https://pretalx.com/pyconde-pydata-2025/talk/ER3V7W/feedback/ Europium2 Oh my license! – Achieving order by automation in the license chaos of your dependencies Talk 2025-04-24T14:20:00+02:00 14:20 00:30 License issues can haunt you at night. You spend days, weeks, and months developing beautiful software. But then it happens. You realize that an essential dependency is GPL-3.0 licensed. All your code is now infected with this license. Now you are forced to either: 1. Rewrite all parts relying on the other library 2. Open-source your codebase under the GPL-3.0 license How could this have been avoided? Join the talk and find out! First, we’ll give you a brief introduction to different software licenses and their implications. Second, we’ll show you how to automate your license checking using open-source software. pyconde-pydata-2025-61230-oh-my-license-achieving-order-by-automation-in-the-license-chaos-of-your-dependencies PyCon: Programming & Software Engineering Paul Müller en Software licensing can feel like a daunting maze, but it doesn’t have to be. This talk will demystify the world of software licenses and equip you with the critical knowledge to navigate it with confidence. We’ll start by exploring key categories of licenses—like Strong Copyleft, Weak Copyleft, and Permissive—and break down the most common ones you’ll encounter (e.g., GPL, AGPL, BSD, and MIT). Through concrete examples, you’ll learn how these licenses affect your projects and how to handle them effectively. Next, we’ll dive into practical solutions for automating license compliance. You’ll be introduced to conda-deny (an open-source tool) and see how it can help ensure your projects remain compliant without adding manual overhead. Whether you’re building open-source software or proprietary tools, this talk will leave you with actionable strategies to future-proof your projects and avoid licensing pitfalls. false https://pretalx.com/pyconde-pydata-2025/talk/ME7XPJ/ https://pretalx.com/pyconde-pydata-2025/talk/ME7XPJ/feedback/ Europium2 Quiet on Set: Building an On-Air Sign with Open Source Technologies Talk (long) 2025-04-24T15:00:00+02:00 15:00 00:45 Learn how to build a custom On-Air sign using Apache Kafka®, Apache Flink®, and Apache Iceberg™! See how to capture events like Zoom meetings and camera usage with Python, process data with FlinkSQL, analyze trends using Iceberg, and bring it all together with a practical IoT project that easily scales out. pyconde-pydata-2025-59893-quiet-on-set-building-an-on-air-sign-with-open-source-technologies General: Infrastructure - Hardware & Cloud Danica Fine en While many of us have adapted to work from home life, one major problem remains: finding an easy way to keep folks in your home away from your workspace when you’re on an important call. Dust off your Raspberry Pi––let’s build a custom on-air sign with Apache Kafka®, Apache Flink®, and Apache Iceberg™! We’ll begin by writing Python scripts to capture key events––such as when a Zoom meeting is running and when a camera is being used––and produce it into Kafka. The live data are then consumed by a Raspberry Pi script to drive the operation of a custom designed on-air sign. From there, you’ll be introduced to the ins and outs of FlinkSQL for stream processing as we wrangle the data into a better format for downstream use. And, finally, we’ll see Iceberg in action and learn how to use query engines to analyze meeting and recording trends. By the end of the session, you’ll be well-acquainted with this powerful trio of open source technologies and know how you could use the same scaffolding and scale out a simple, at-home project to millions of users and simultaneous events. false https://pretalx.com/pyconde-pydata-2025/talk/P9GRZU/ https://pretalx.com/pyconde-pydata-2025/talk/P9GRZU/feedback/ Europium2 Building a Self-Hosted MLOps Platform with Kubernetes Talk 2025-04-24T16:15:00+02:00 16:15 00:30 Many managed MLOps platforms, while convenient, often fall short in providing flexibility, requiring complex integrations, and causing vendor lock-in. In this talk, we’ll share our experience transitioning from managed MLOps tools to a self-hosted solution built on Kubernetes. We’ll focus on how we leveraged open-source tools like Feast, MLflow, and Ray to build a more flexible, scalable, and customizable platform that is now in use at Rewe Digital. By migrating to this self-hosted architecture, we gained greater control over our ML pipelines, reduced our dependency on third-party services, and created a more adaptable infrastructure for our ML workloads. pyconde-pydata-2025-61325-building-a-self-hosted-mlops-platform-with-kubernetes PyCon: MLOps & DevOps Josef Nagelschmidt en Many managed MLOps platforms, while convenient, often fall short in providing flexibility, requiring complex integrations, and causing vendor lock-in. In this talk, we’ll share our experience transitioning from managed MLOps tools to a self-hosted solution built on Kubernetes. We’ll focus on how we leveraged open-source tools like Feast, MLflow, and Ray to build a more flexible, scalable, and customizable platform that is now in use at Rewe Digital. By migrating to this self-hosted architecture, we gained greater control over our ML pipelines, reduced our dependency on third-party services, and created a more adaptable infrastructure for our ML workloads. Talk Outline: 1. Introduction (5 minutes): - The challenges of using managed MLOps platforms: vendor lock-in, integration complexity, and lack of flexibility. - Why transitioning to a self-hosted solution on Kubernetes can be beneficial. 2. Proposed Solution (10 minutes): - Why Kubernetes for MLOps? - How open-source tools like Feast, MLflow, and Ray come together to form the core of a robust self-hosted MLOps stack. - Benefits of building a flexible, scalable platform that fits your needs. 3. Building the Platform (10 minutes): - Practical steps for setting up and configuring Feast, MLflow, and Ray on Kubernetes. - Integration strategies and how to manage pipelines, model tracking, and feature storage. 4. Lessons Learned and Q&A (5 minutes): - Challenges and takeaways during the migration process - Q&A false https://pretalx.com/pyconde-pydata-2025/talk/3CYZUH/ https://pretalx.com/pyconde-pydata-2025/talk/3CYZUH/feedback/ Europium2 From Algorithm to Action: Building a DIY Distributed Trading Platform with Open Source Talk (long) 2025-04-24T16:55:00+02:00 16:55 00:45 In this talk, we'll explore how you can implement your own distributed system for algorithmic trading leveraging the power of open source without being dependent on trading bot providers. We will discuss different challenges occurring in HFT inter alia processing massive amounts of data with low latency and reliable risk control and how to solve them. Furthermore we will touch on the topic of regulatory requirements in trading. These challenges will be addressed through a distributed system implemented in Python, utilizing Kafka for real-time data streaming and PostgreSQL for persistent storage. We will examine approaches to decouple the components to re-use and scale them across different markets. Cryptocurrency markets are used as a proving ground for the PoC due to easy availability for everyone. pyconde-pydata-2025-60204-from-algorithm-to-action-building-a-diy-distributed-trading-platform-with-open-source PyCon: Programming & Software Engineering Eugen Geist en ## Who is this talk for This talk is ideal for all software engineers interested in financial technology, quantitative developers looking to understand modern trading infrastructure, and technical architects exploring distributed systems in high-stakes environments. This talk will NOT discuss specific trading strategies or give any financial advice. ## Outline * Motivation * Fundamental trading concepts and market mechanics * Market data ingestion and processing * Order management and execution * Implementation of trading strategies * Data storage * Outlook ## Motivation The landscape of financial trading has undergone a dramatic transformation over the past decades. What was once the exclusive domain of institutional players on physical trading floors has evolved into a digitized, accessible marketplace where individual traders can participate from anywhere in the world. The emergence of commission-free trading apps and cryptocurrency exchanges has brought market participation to millions of new retail traders. This enables everyone to participate with their own trading system in global markets. In this talk, we'll explore how you can implement your own distributed system for exchange trading leveraging the power of open source without being dependent on trading bot providers. While we won't be able to cover every aspect in depth, we'll address the most essential elements. Cryptocurrency markets are used as a proving ground for the PoC due to easy availability for everyone. ## Fundamental Trading Concepts and Market Mechanics We'll begin by exploring essential trading concepts: * Order book dynamics * Orders, Trades and Positions * Different types of orders and their implications for system implementation * Regulatory requirements * Performance of strategies These lead to different considerations in system design and architecture: * De-coupling of exchange interfaces and trading strategies to use same strategy for different markets by using adapter pattern * Horizontal scaling to handle data load * Need of low latency components and their communication to properly react to market * Need of streaming data for real-time risk management * Need of persistent storage for regulatory data and post-trading-analysis * Need of order action recording and post-trading analysis for performance evaluation ## Market Data Ingestion and Processing The foundation of any trading system is its ability to efficiently process market data. This includes a Python component responsible for real-time normalization and standardization of multi-venue data: * Efficient market data representation and storage structures * Techniques for handling high-throughput data without compromising latency * Market data recording for post-trading analysis using Kafka ## Order Management and Execution Critical components for managing the trading lifecycle. This includes a Python component responsible for normalization and standardization of multi-venue order interfaces: * Order action handling (placing orders, modifying orders) and keeping track of orders * Global real-time position tracking and risk calculation using Kafka * State recovery and system restart procedures * Audit trail implementation and transaction logging using Postgres ## Implementation of Trading Strategies We'll explore the practical aspects of implementing trading strategies in Python using the previously discussed system components: * Usage of provided market data * Placing orders and keeping track of positions * Fast communication with market data and order components using Kafka with msgpack * Recording of strategy internals for post-trade analysis ## Data storage We will take a closer look to: * What kinds of data exist in a trading system (live vs. post-trade) * Approaches to storing the different data kinds ## Outlook At the end we will have a brief outlook what other challenges might occur e.g.: * Other market types (Finance/Equity/ETF and Energy) * Latency considerations * Taxes false https://pretalx.com/pyconde-pydata-2025/talk/BR3D83/ https://pretalx.com/pyconde-pydata-2025/talk/BR3D83/feedback/ Hassium Scaling Python: An End-to-End ML Pipeline for ISS Anomaly Detection with Kubeflow Talk 2025-04-24T10:15:00+02:00 10:15 00:30 Building and deploying scalable, reproducible machine learning pipelines can be challenging, especially when working with orchestration tools like Slurm or Kubernetes. In this talk, we demonstrate how to create an end-to-end ML pipeline for anomaly detection in International Space Station (ISS) telemetry data using only Python code. We show how Kubeflow Pipelines, MLFlow, and other open-source tools enable the seamless orchestration of critical steps: distributed preprocessing with Dask, hyperparameter optimization with Katib, distributed training with PyTorch Operator, experiment tracking and monitoring with MLFlow, and scalable model serving with KServe. All these steps are integrated into a holistic Kubeflow pipeline. By leveraging Kubeflow's Python SDK, we simplify the complexities of Kubernetes configurations while achieving scalable, maintainable, and reproducible pipelines. This session provides practical insights, real-world challenges, and best practices, demonstrating how Python-first workflows empower data scientists to focus on machine learning development rather than infrastructure. pyconde-pydata-2025-61163-scaling-python-an-end-to-end-ml-pipeline-for-iss-anomaly-detection-with-kubeflow PyCon: MLOps & DevOps Christian GeierHenrik Sebastian Steude en Among popular open-source MLOps tools, **Kubeflow** stands out as a Kubernetes-native platform designed to support the entire ML lifecycle, from data preprocessing to model training, deployment, and retraining. Its modular structure enables the integration of a wide range of tools, making it a highly versatile framework for building scalable and reproducible ML workflows. Despite this, most existing resources focus on individual components rather than demonstrating how these can be orchestrated into a seamless, end-to-end pipeline. In this talk, we present a practical case study that highlights the potential of Kubeflow in a real-world application. Specifically, we showcase how an automated ML pipeline for anomaly detection in International Space Station (ISS) telemetry data can be built and deployed using Kubeflow and other open-source MLOps tools. The dataset, originating from the Columbus module of the ISS, introduces unique challenges due to its complexity and high-dimensional nature, providing an excellent testbed for MLOps workflows. ### **What makes this approach unique?** Our workflow is built entirely in Python, leveraging Kubeflow’s Python SDK to orchestrate every stage of the pipeline. This eliminates the need for manual interaction with Kubernetes or container configurations, making the process accessible to ML engineers and data scientists without extensive DevOps expertise. ### **Key takeaways for attendees:** * **Tool integration:** Learn how to combine Dask for distributed preprocessing, Katib for hyperparameter optimization, PyTorch Operator for distributed training, MLFlow for experiment tracking and monitoring, and KServe for scalable model serving. These tools are orchestrated into a unified pipeline using Kubeflow Pipelines. * **Overcoming challenges:** Gain insights into the technical hurdles faced during the implementation of this pipeline and discover the strategies and best practices that made it possible. * **Real-world impact:** Understand how to apply MLOps principles to complex, real-world datasets and how these principles translate into scalable, maintainable, and reproducible workflows. To ensure reproducibility and accessibility, the entire pipeline, including configurations and code, is publicly available in our GitHub repository [here](https://github.com/hsteude/code-ml4cps-paper). Attendees will be able to replicate the workflow, adapt it to their own use cases, or extend it with additional features. ### **Who should attend?** This session is designed for data scientists, ML engineers, and Python enthusiasts who want to simplify the development of scalable ML pipelines. Whether you're new to Kubernetes or looking to streamline your MLOps workflows, this talk will provide actionable insights and tools to help you succeed. false https://pretalx.com/pyconde-pydata-2025/talk/TRUUVL/ https://pretalx.com/pyconde-pydata-2025/talk/TRUUVL/feedback/ Hassium Outgrowing your node? Zero stress scaling with cuPyNumeric. Talk 2025-04-24T10:55:00+02:00 10:55 00:30 Many data and simulation scientists use NumPy for its ease of use and good performance on CPU. This approach works well for single-node tasks, but scaling to handle larger datasets or more resource-intensive computations introduces significant challenges. Not to mention, using GPUs requires another level of complexity. We present the cuPyNumeric library, which gives developers the same familiar NumPy interface, but seamlessly distributes work across CPUs and GPUs. In this talk we showcase the productivity and performance of cuPyNumeric library on one of the user's examples covering some detail on its implementation. pyconde-pydata-2025-61182-outgrowing-your-node-zero-stress-scaling-with-cupynumeric PyCon: Programming & Software Engineering Bo Dong en Many data and simulation scientists use NumPy for its ease of use and good performance on CPU. This approach works well for single-node tasks, but scaling to handle larger datasets or more resource-intensive computations introduces significant challenges. Not to mention, using GPUs requires another level of complexity. We present the cuPyNumeric library. cuPyNumeric gives developers the same familiar NumPy interface, but seamlessly distributes work across CPUs and GPUs. A compelling example when scaling is necessary is when scientists at the Stanford Linear Accelerator Center(SLAC) need to process a large amount of data within a fixed time window, called beam time. The full dataset generated during experiments is too large to be processed on a single CPU. Additionally, the code often must be modified during the beam time to adapt to changing experimental needs. Being able to use NumPy syntax rather than lower level distributed computing libraries makes these changes quick and easy, allowing researchers to focus on conducting more experiments rather than debugging or optimizing code. cuPyNumeric is designed to be a drop-in replacement to NumPy. Built on top of task-based distributed runtime from Stanford University, it automatically parallelizes NumPy APIs across all available resources, taking care of data distribution, communication, asynchronous and accelerated execution of compute kernels on both GPUs or multi-core CPUs. In addition, cuPyNumeric can be integrated with other popular Python libraries like SciPy, matplotlib, Jax. With cuPyNumeric, SLAC scientists successfully ran their data processing code distributed across multiple nodes and GPUs, processing the full dataset with a 6x speed-up compared to the original single-node implementation. In this talk we showcase the productivity and performance of cuPyNumeric library covering some detail on its implementation. false https://pretalx.com/pyconde-pydata-2025/talk/HPGEKH/ https://pretalx.com/pyconde-pydata-2025/talk/HPGEKH/feedback/ Hassium Beyond Alembic and Django Migrations Talk 2025-04-24T11:35:00+02:00 11:35 00:30 ORMs like Django and SQLAlchemy have become indispensable in Python development, simplifying the interaction between applications and databases. Yet, their built-in schema migration tools often fall short in projects that require advanced database features or robust CI/CD integration. In this talk, we’ll explore how you can go beyond the limitations of your ORM’s migration tool. Using Atlas—a language-agnostic schema management tool—as a case study, we’ll demonstrate how Python developers can automate migration planning, leverage advanced database features, and seamlessly integrate database changes into modern CI/CD pipelines. pyconde-pydata-2025-61479-beyond-alembic-and-django-migrations PyCon: Django & Web Rotem Tamir en Talk Structure: "Beyond Your ORM's Migration Tool" 1. Introduction – Why ORMs Build Migration Tools - ORMs like SQLAlchemy and Django ORM simplify database interactions and include migration tools (e.g., Alembic, Django Migrations) for schema changes. - These tools are robust for ORM-defined schemas but lack advanced features and native CI/CD integrations. 2. Where Built-in Tools Fall Short - ORM migration tools focus on basic schema changes but don’t support advanced database objects like triggers, materialized views, or stored procedures. - Lack native integration with modern CI/CD tools, leaving teams to implement custom, often suboptimal solutions. 3. Presenting Atlas – Bridging the Gap - Atlas complements ORM tools by reading their schemas (e.g., Django models, SQLAlchemy models) and enabling advanced extensions. - Key features: - Support for triggers, materialized views, and other advanced objects. - Native CI/CD integration for automating and validating schema changes. 4. How Atlas Integrates with ORMs - Atlas reads ORM-defined schemas and enhances them with advanced features. - Combines ORM workflows with Atlas’s robust schema management capabilities, enabling automation and database-specific optimizations. 5. Demo – Atlas in Action - Example: A Django project adds a materialized view and a trigger using Atlas. - Steps: - Use Atlas to read the ORM schema and extend it with advanced features. - Automate migration validation and deployment through CI/CD pipelines. - Outcome: Simplified and automated schema management with modern tooling. 6. Conclusion and Q&A - Key Takeaways: - ORM migration tools like Alembic and Django Migrations are great for standard use cases but fall short for advanced workflows and CI/CD integration. - Atlas bridges this gap, enabling automation and advanced database features. - Call to Action: Try Atlas to enhance schema workflows. - Q&A: Open floor for questions. false https://pretalx.com/pyconde-pydata-2025/talk/KCV9RS/ https://pretalx.com/pyconde-pydata-2025/talk/KCV9RS/feedback/ Hassium Writing reliable software while depending on hazardous APIs Talk 2025-04-24T14:20:00+02:00 14:20 00:30 As we develop business critical software, we often need to rely on external APIs to get the job done. And all services are not born equal: although the ideal world would provide well operated APIs with over-met service levels, the real world is usually way worse than that. Timeouts, HTTP errors, cascading failures, unclear or changing contracts, approximate protocol implementations ... And even the oh-so-human bad faith while trying to pinpoint the root cause... Most of us have written hacks to handle commonly seen failures, from the quick and dirty implementation to well thought resilience patterns implementation, but this is usually hard to do correctly, and rarely a business priority to invest the correct amount of time and money on the topic. We'll present the options, both including direct dependencies (not framework dependant, although some families can emerge (async/sync ...)) and including a service/proxy based approach. pyconde-pydata-2025-61356-writing-reliable-software-while-depending-on-hazardous-apis PyCon: MLOps & DevOps Romain Dorgueil en The two most common causes for software failure are, in order, human errors then external services. Working extensively with external APIs, we often encounter tricky issues in maintaining the responsiveness of our end-user services (both in terms of speed, but also plain availability). Many teams are addressing those issues on a case-by-case basis, most often using a homemade patchwork of external libraries and failing cases, and we used to do the same. Over time, we have come to rethink our approach to this problem. We will present the usual suspects (and their consequences) we're usually facing: timeouts, HTTP errors, cascading failures, unclear or changing contracts, and the difficulty of forensic analysis after an incident occurs when the root cause stems from external data or calls. Then, we'll show various approaches we use or have seen be used by teams of different sizes. We'll finish by presenting an innovative approach delegating the issues to a forward proxy so that the development team can both avoid having to spend time on reinventing the resilience and reliability patterns, while providing them the tools to act quickly when things go wrong. false https://pretalx.com/pyconde-pydata-2025/talk/7PDARV/ https://pretalx.com/pyconde-pydata-2025/talk/7PDARV/feedback/ Hassium Decoding Topics: A Comparative Analysis of Python’s Leading Topic Modeling Libraries Using Climate C Talk (long) 2025-04-24T15:00:00+02:00 15:00 00:45 Topic modelling has come a long way, evolving from traditional statistical methods to leveraging advanced embeddings and neural networks. Python’s diverse library ecosystem includes tools like Latent Dirichlet Allocation (LDA) using gensim, Top2Vec, BERTopic, and Contextualized Topic Models (CTM). This talk evaluates these popular approaches using a dataset of UK climate change policies, considering use cases relevant to organisations like DEFRA (Department for Environment, Food & Rural Affairs). The analysis explores real-time integration, dynamic topic modelling over time, adding new documents, and retrieving similar ones. Attendees will learn the strengths, limitations, and practical applications of each library to make informed decisions for their projects. pyconde-pydata-2025-59307-decoding-topics-a-comparative-analysis-of-python-s-leading-topic-modeling-libraries-using-climate-c PyData: Natural Language Processing & Audio (incl. Generative AI NLP) Dr. Lisa Andreevna Chalaguine en Objectives: The session aims to: 1. Compare Python-based topic modelling libraries, highlighting their relevance to real-world scenarios like policy analysis. 2. Explore practical use cases, including real-time document integration, tracking topic evolution, and finding similar documents. 3. Evaluate the tools based on performance, interpretability, scalability, and flexibility, with a focus on climate change policy data presented by [1] focusing on adaptation and mitigation. 4. Provide actionable guidance on selecting the right library for different project needs and datasets. Outline: 1. Introduction to Topic Modeling: Overview of traditional and modern approaches, including their practical significance. 2. Algorithms & Libraries Overview: LDA (gensim) [2], CTM [3], Top2Vec [4], BERTopic [5] 3. Dataset and Use Cases: - Overview of the UK climate change policy dataset. - Use cases inspired by DEFRA and similar organisations, such as: - Real-time integration for continuously adding new documents. - Tracking topic development over time (dynamic topic modeling). - Retrieving similar documents for faster insights. (- Classification) 4. Evaluation Criteria: Analysis of libraries based on: - Ease of Use: How easy it is for no coding experts - Quality: Coherence and diversity of extracted topics. - Efficiency: Runtime performance and scalability. - Flexibility: Features like contextual embeddings and integration capabilities. - Interpretability: Ease of understanding topics and output. 5. Results: Detailed findings, including specific advantages and limitations of each library in supporting the outlined use cases. 6. Practical Recommendations: Guidance on choosing a library based on project goals, dataset characteristics, and organisational needs. 7. Conclusion and Future Directions: Summary of key insights and the evolving role of embedding-based methods in topic modelling. Outcomes: By attending this session, participants will: - Gain an in-depth understanding of Python’s top topic modeling libraries. - Learn how to apply these tools to real-world challenges in policy analysis and other fields. - Understand how to handle use cases like real-time document integration and topic evolution over time. - Develop the skills to evaluate and choose the best tool for specific datasets and objectives. Target Audience This talk is for: - Data scientists and NLP practitioners seeking to apply topic modelling to unstructured text data. - Policy analysts and researchers working with large textual datasets, such as government or environmental policies. - Professionals in organisations like DEFRA, where tracking changes, adding new documents, or finding similar records are critical tasks. - Python enthusiasts interested in cutting-edge NLP techniques for extracting meaningful insights. [1] R. Biesbroek, S. Badloe, and I. Athanasiadis. Machine learning for research on cli- mate change adaptation policy integration: an exploratory uk case study. Regional Environmental Change, 20, 07 2020. [2] https://pypi.org/project/gensim/ [3] https://github.com/MilaNLProc/contextualized-topic-models [4] https://github.com/ddangelov/Top2Vec [5] https://maartengr.github.io/BERTopic/index.html 5 https://github.com/MilaNLProc/contextualized-topic-models false https://pretalx.com/pyconde-pydata-2025/talk/BJKSGK/ https://pretalx.com/pyconde-pydata-2025/talk/BJKSGK/feedback/ Hassium Conquering the Queue: Lessons from processing one billion Celery tasks Talk 2025-04-24T16:15:00+02:00 16:15 00:30 At Userlike, Celery is the backbone of our application, orchestrating over a 100 million tasks per month. In this talk, I’ll share real-world insights into scaling Celery, optimizing performance, avoiding common pitfalls, handling failures, and building a resilient architecture. pyconde-pydata-2025-61288-conquering-the-queue-lessons-from-processing-one-billion-celery-tasks PyCon: Django & Web Daniel Hepper en At Userlike, Celery plays a critical role as the backbone of our Django-based SaaS application, orchestrating over 100 million tasks per month with speed, reliability, and precision. In this talk, I’ll share the lessons we’ve learned while scaling Celery to handle massive workloads and support the needs of a growing user base. From optimizing performance and avoiding common pitfalls to handling failures gracefully and ensuring a resilient architecture, this session will provide actionable insights for developers and architects working with distributed task queues. Whether you’re just starting with Celery or looking to scale an established system, you’ll walk away with practical tips, battle-tested strategies, and a deeper understanding of how to harness Celery’s full potential in real-world scenarios. Outline: • Introduction: Why Userlike needs a task queue, and why you need one too • Fundamental concepts: latency, throughput, failure modes • Optimizing Performance: Strategies for faster and more efficient task execution • Avoiding Pitfalls: Common mistakes and how to mitigate them • Handling Failures: Building fault-tolerant workflows and monitoring systems • Resilient Architecture: Designing for reliability and scalability • Key Takeaways: Practical tips for implementing and scaling Celery in your own projects This talk is designed to be technical, engaging, and packed with real-world experiences to help you conquer the queue in your own applications. false https://pretalx.com/pyconde-pydata-2025/talk/J8FLDN/ https://pretalx.com/pyconde-pydata-2025/talk/J8FLDN/feedback/ Hassium From LIKE to Love: Adding Proper Search to Your Django Apps Talk (long) 2025-04-24T16:55:00+02:00 16:55 00:45 Is your Django application still relying on SQL LIKE queries for search? In this talk, we'll explore why basic text matching falls short of modern user expectations and how to implement proper search functionality without complexity. We'll introduce django-semantic-search, a practical package that bridges the gap between Django's ORM and powerful semantic search capabilities. Through practical code examples and real-world use cases, you'll learn how to enhance your application's search experience from basic keyword matching to understanding user intent. Whether you're building a content platform, e-commerce site, or internal tool, you'll walk away with concrete steps to implement production-ready search that your users will actually enjoy using. pyconde-pydata-2025-61776-from-like-to-love-adding-proper-search-to-your-django-apps PyCon: Django & Web Kacper Łukawski en Introduction (5 minutes) 1. The state of search in Django applications today 2. Common patterns and their limitations 3. Real costs of poor search functionality 4. Why search is often an afterthought in Django apps The Search Landscape (10 minutes) 1. Review of Django's built-in search capabilities 2. Performance implications of basic text matching 3. Field lookups and their limitations 4. PostgreSQL-specific features 5. Popular search solutions in the Django ecosystem 6. Trade-offs between complexity and functionality Why Search Matters (10 minutes) 1. User expectations in 2025 2. Common search patterns and user behaviors 3. Impact on user engagement and business metrics 4. Natural language queries vs keyword matching 5. Handling imperfect input 6. Context and intent understanding 7. Real-world examples of search improvements Modern Search Approaches (5 minutes) 1. Key concepts of vector search 2. From keywords to meaning 3. Why embeddings work better than keywords 4. Understanding user intent 5. Relevance beyond exact matches Practical Implementation & Best Practices (15 minutes) 1. Introducing django-semantic-search 2. Core concepts and architecture 3. Integration with existing Django models 4. Real-world implementation strategies 5. Handling different content types 6. Performance optimization techniques 7. Common pitfalls and solutions 8. Resource management 9. Query optimization 10. Monitoring and maintaining search quality false https://pretalx.com/pyconde-pydata-2025/talk/UCG9AS/ https://pretalx.com/pyconde-pydata-2025/talk/UCG9AS/feedback/ Palladium Multi-tenant Conversational Analytics Talk 2025-04-24T10:15:00+02:00 10:15 00:30 Ever wondered how to use GenAI to enable self-service analytics through prompting? In this talk, I will share my experience of building a multi-tenant conversational analytics set-up that is built into a Software-as-a-Service (SaaS) platform. This talk is intended for AI engineers, data scientists, software engineers and anyone interested in using GenAI to power conversational analytics using open-source tools. I will discuss the challenges faced in designing and implementing, as well as the lessons learned along the way. We'll answer questions such as, why offer analytics through prompting? Why multi-tenancy and makes it so difficult? How to build it into an existing product? What makes open-source the preferred choice over proprietary solutions? What could the implications be for the analytics field? pyconde-pydata-2025-61793-multi-tenant-conversational-analytics PyData: Natural Language Processing & Audio (incl. Generative AI NLP) Rodel van Rooijen en This talk will start by answering the question: What is conversational analytics and how does it work? After which we'll dive into why this was built and how the implementation was done. * How analytics in SaaS can be fundamentally improved by conversational analytics (5 mins). * How the Text-to-SQL fundament was shaped using RAG with Embeddings in PGVector (5 mins). * Dealing with multi-tenancy in PostgreSQL and BigQuery to ensure data segregation & security (5 mins). * How to handle tenant specific pre-training and training examples (5 mins). * Building this into an existing application and supporting integrations (5 minutes). * Conclusion and thoughts on the implications for the field of analytics (5 mins). In the end you should have a good idea on why conversational analytics can be a game changer, what the pitfalls are and how to build it with open source technologies. false https://pretalx.com/pyconde-pydata-2025/talk/KCSSJ7/ https://pretalx.com/pyconde-pydata-2025/talk/KCSSJ7/feedback/ Palladium Navigating the Security Maze: An Interactive Adventure Talk 2025-04-24T10:55:00+02:00 10:55 00:30 How to integrate security into a software development project? Without jeopardizing timeline or budget? You decide! This interactive session covers crucial decisions for software security, and the audience decides how the story ends... pyconde-pydata-2025-61125-navigating-the-security-maze-an-interactive-adventure PyCon: Security Clemens Hübner en Although DevSecOps has been a trend topic for years, it is still far from being a solved problem. This interactive session brings the challenges of security in the development process to life: Participants are confronted with several scenarios from everyday project work and their decisions help shape the further course of the presentation. They have to reconcile security requirements with budget, development speed and user-friendliness and bring the project safely from the idea to live operation. The session covers the entire development process, but each run is different as the audience decides the course of the story via online-voting: How to proceed with the development project and think about security at the same time? false https://pretalx.com/pyconde-pydata-2025/talk/3WLDMQ/ https://pretalx.com/pyconde-pydata-2025/talk/3WLDMQ/feedback/ Palladium Securing Generative AI: Essential Threat Modeling Techniques Talk 2025-04-24T11:35:00+02:00 11:35 00:30 Generative AI development introduces unique security challenges that traditional methods often overlook. This talk explores practical threat modeling techniques tailored for AI practitioners, focusing on real-world scenarios encountered in daily development. Through relatable examples and demonstrations, attendees will learn to identify and mitigate common vulnerabilities in AI systems. The session covers user-friendly security tools and best practices specifically designed for AI development. By the end, participants will have practical strategies to enhance the security of their AI applications, regardless of their prior security expertise. pyconde-pydata-2025-61891-securing-generative-ai-essential-threat-modeling-techniques PyData: Generative AI Elizaveta Zinovyeva en 1. Introduction * Motivation * What can go wrong 2. Generative AI vs Traditional Applications * Key differences in security considerations * Unique challenges posed by generative AI 3. Threat Modeling Basics and AI-Specific Threats * Threat modeling frameworks * Focus on prompt injection and data poisoning * Example: Simple prompt injection attempt 4. Practical Threat Modeling Process * Simplified system decomposition example * Threat identification walkthrough 5. Example: Input Validation 6. Tools Showcase and Mitigation Strategies 7. Conclusion and Resources * Recap key takeaways * List of recommended tools and further reading true https://pretalx.com/pyconde-pydata-2025/talk/UGTB7A/ https://pretalx.com/pyconde-pydata-2025/talk/UGTB7A/feedback/ Palladium Machine Reasoning and System 2 Thinking Talk 2025-04-24T14:20:00+02:00 14:20 00:30 Raw large language models struggle with complex reasoning. New techniques have emerged that allow these models to spend more time thinking before giving an answer. Direct token sampling can be seen as system-1 thinking and explicit step-by-step reasoning as system-2. How can this reasoning ability be improved and what is the future? pyconde-pydata-2025-61377-machine-reasoning-and-system-2-thinking PyData: Generative AI Andy Kitchen en Basic large language models struggle with complex reasoning. New techniques, broadly referred to as "test time compute" have emerged that allow these models to spend more time processing before giving an answer. Direct token sampling can be seen as analogous to system-1 thinking and explicit step-by-step reasoning as system-2. Many top AI researchers and companes are now working on building system-2 into AI systems to improve general reasoning. We will review the newest open research on test time computation including promising techniques that have appeared in top entries for François Chollet's ARC-AGI challenge. While OpenAI has shamefully kept the research behind their o1, o3 and o-N models secret, other researchers have worked in public, demonstrating how to use test time compute to greatly boost model performance with the right fine-tuning and test time procedures. This talk will explore the latest developments in the rapidly developing area of system-2 AI reasoning, the engine behind the only significant gains in LLM performance recently. Giving LLMs system-2 like capabilities improves problem solving, code generation quality and reduces hallucinations, get up to speed on research behind these techniques. false https://pretalx.com/pyconde-pydata-2025/talk/AYN837/ https://pretalx.com/pyconde-pydata-2025/talk/AYN837/feedback/ Palladium Securing RAG Pipelines with Fine Grained Authorization Talk (long) 2025-04-24T15:00:00+02:00 15:00 00:45 Using LLMs and AI in your Enterprise? Make sure you build Fine Grained Authorization to ensure your LLMs access only the data they are authorized to. This talk will show how you can build Relationship Based Access Control (ReBAC) for fine-grained authorization for your RAG pipelines. The talk also includes a demo using Pinecone, Langchain, OpenAI, and SpiceDB. pyconde-pydata-2025-61181-securing-rag-pipelines-with-fine-grained-authorization PyData: Generative AI Sohan Maheshwar en Building enterprise-ready AI requires ensuring users can only augment prompts with data they're authorized to access. Relationship-based access control (ReBAC) is particularly well-suited for fine-grained authorization in Retrieval-Augmented Generation (RAG) because it makes decisions based on relationships between objects, offering more precise control compared to traditional models like RBAC and ABAC. This talk covers how ReBAC systems can safeguard sensitive data in RAG pipelines. We'll start with why Authorization is critical for RAG pipelines, and how Google Zanzibar achieves this with ReBAC. We'll then illustrate how pre-filtering vector database queries with a list of authorized object IDs can improve efficiency & security. The talk will also include a demo implementing fine-grained authorization for RAG using Pinecone, Langchain, OpenAI, and SpiceDB. false https://pretalx.com/pyconde-pydata-2025/talk/GRWYQB/ https://pretalx.com/pyconde-pydata-2025/talk/GRWYQB/feedback/ Palladium Streaming at 30,000 Feet: A Real-Time Journey from APIs to Stream Processing Talk 2025-04-24T16:15:00+02:00 16:15 00:30 Traditional API architectures face significant challenges in environments where repetitive and frequent requests are required to retrieve data updates. These request-response mechanisms introduce latency, as clients must continually query the server to check for changes, often receiving redundant or outdated information. This approach leads to increased network overhead, inefficient use of server resources and diminished scalability as the number of clients or requests grows. Additionally, frequent requests expand the attack surface, requiring security measures to mitigate risks such as (un-)authorised access, rate limiting and query sanitisation. Managing all of these inherent problem results in increasingly complex systems to maintain and improve while putting considerable implementation effort onto the customer. Join to find out how transitioning to a streaming architecture can address these issues by providing proactive, event-based data delivery, reducing latency, minimising redundant processing, enhancing scalability and simplifying security management. pyconde-pydata-2025-59280-streaming-at-30-000-feet-a-real-time-journey-from-apis-to-stream-processing PyCon: Programming & Software Engineering Felix Leon Buck en In this talk we will go over which benefits, drawbacks and lessons Airbus has encountered/learned in the switch from an API to a Python based Stream Architecture for continuous flight traffic prediction. The Goal is to highlight Stream based architectures as a architectural alternative and allow the attendees to decide if it could be a alternative to their current API based Setup worthwhile to look into. The talk addresses the inefficiencies and limitations of traditional API-based architectures for real-time data delivery. Specifically, it explores challenges such as high latency, network overhead, customer effort, and scalability issues when APIs rely on polling mechanisms. These issues became apparent at Airbus during a project as customer needs evolved, highlighting the shortcomings of APIs in handling real-time updates effectively. This story of the project will aid as an example on how to identify a limiting architectural decision and what pain points can potentially be avoided by taking a new route guiding us along the talk. For developers and architects building modern, data-driven applications, choosing the right architecture is critical. Many face similar challenges when scaling APIs for real-time use cases, such as IoT, financial data, or notifications. This problem is relevant because adopting an unsuitable architecture can lead to poor performance, higher costs, and frustrated users. The proposed solution is to transition from an API-based architecture to a streaming architecture for real-time data delivery. This involves leveraging stream processing systems that push updates proactively, handle high-throughput data efficiently, and offer features like backpressure, partitioning, and stateful processing. Recent developments in Python based tools such as **Bytewax**, **Faust** and **Quix** are highlighted for their scalability and fault-tolerance capabilities. ### Key Takeaways 1. **Challenges of APIs**: Polling APIs is inefficient for real-time updates, leading to delays, resource wastage, and customer dissatisfaction. 2. **Advantages of Streaming**: Streaming architectures offer real-time data delivery, lower latency, reduced customer effort, better scalability, and improved fault tolerance. 3. **Key Streaming Concepts**: Understanding backpressure, partitioning, and stateful processing is essential for understanding streaming specific limitations and solutions. 4. **Architectural Considerations**: Streaming is ideal for use cases where data changes frequently and needs to be delivered in real time, while APIs may still be suitable for low-frequency, static, or manual queries. 5. **Strategic Transition**: Adopting a streaming approach requires a paradigm shift in thinking about how data is delivered and processed, with significant changes to the system architecture which needs to be cautiously managed. true https://pretalx.com/pyconde-pydata-2025/talk/CTUEJX/ https://pretalx.com/pyconde-pydata-2025/talk/CTUEJX/feedback/ Palladium Transformers for Game Log Data Talk (long) 2025-04-24T16:55:00+02:00 16:55 00:45 The Transformer architecture, originally designed for machine translation, has revolutionized deep learning with applications in natural language processing, computer vision, and time series forecasting. Recently, its capabilities have extended to sequence-to-sequence tasks involving log data, such as telemetric event data from computer games. This talk demonstrates how to apply a Transformer-based model to game log data, showcasing its potential for sequence prediction and representation learning. Attendees will gain insights into implementing a simple Transformer in Python, optimizing it through hyperparameter tuning, architectural adjustments, and defining an appropriate vocabulary for game logs. Real-world applications, including clustering and user level predictions, will be explored using a dataset of over 175 million events from an MMORPG. The talk will conclude with a discussion of the model's performance, computational requirements, and future opportunities for this approach. pyconde-pydata-2025-61739-transformers-for-game-log-data PyData: Machine Learning & Deep Learning & Statistics Fabian Hadiji en The paper[1] introducing the Transformer architecture has been cited almost 150k times. By now, this deep learning architecture has been used for a large number of use cases. Obviously, language generation and large language models are among the most prominent use cases. However, the architecture has also been successfully employed to solve problems in computer vision and to forecast time series data to name only a few other examples. At its core, the Transformer architecture is a deep neural network designed for sequence-to-sequence prediction tasks. E.g., mapping a sequence of words in one language to a sequence of words in another language as it is done in machine translation tasks. This architecture has recently gained attention for another application well-suited to sequence-to-sequence mapping: the analysis of telemetric log data from games[2]. While log data from games is one specific area that has been explored lately, this approach generally works for log data in other domains arising from websites or mobile apps. In this talk, I will walk the audience through a simple Transformer architecture in Python that can be used to train a model on game log data. I will discuss the challenges of constructing a vocabulary and tokenizer based on log data. Unlike language data, game logs often contain structured events with properties, making vocabulary design non-trivial. I will highlight design choices in the model construction to balance the predictive power of the model and computational efficiency. This includes hyper-parameter selection for the model (e.g., embedding size, number of layers, etc.) and the training procedure (e.g., batch size, learning rate, etc.). I will also explain how to adapt the Transformer architecture to handle long sequences of log data efficiently, including architectural changes to the basic network. I will demonstrate how representations derived from the model can be applied to various use cases, such as clustering and prediction tasks arising in game data science. Typical prediction tasks in game data science are survival time prediction for regression or purchase prediction for classification. Insights from clustering or player level predictions can help to improve retention or optimize monetization models. To evaluate the effectiveness of this approach, I trained multiple models on a publicly available 100GB game log dataset containing over 175 million events from NCSOFT’s MMORPG Blade and Soul. In addition to presenting qualitative results, I will compare the computational resources and hardware requirements of this method to those of a simple baseline algorithm. By the end of the talk, attendees will gain actionable insights into building and training Transformers for log data, equipping them to tackle similar challenges in their own domains. Tentative agenda of the talk: 5 - Intro 5 - Review of the Transformer architecture and its usage in GPT 10 - Adjusting the architecture to game log data 10 - Training of different models 10 - Obtaining player representations from the models for clustering and prediction tasks 5 - Outlook & Conclusion [1] “Attention is all you need”, Vaswani et al., 2017 [2] “player2vec: A Language Modeling Approach to Understand Player Behavior in Games”, Wang et al., 2024 [3] “Game Data Mining Competition on Churn Prediction and Survival Analysis using Commercial Game Log Data”, Lee et al., 2018 false https://pretalx.com/pyconde-pydata-2025/talk/9NFHAS/ https://pretalx.com/pyconde-pydata-2025/talk/9NFHAS/feedback/ Ferrum BayBE: A Bayesian Back End for Experimental Planning in the Low-To-No-Data Regime Sponsored Talk (Keystone) 2025-04-24T10:15:00+02:00 10:15 01:30 From coffee machine settings to chemical reactions to website AB testing - iterative make-test-learn cycles are ubiquitous. The [Bayesian Back End](https://emdgroup.github.io/baybe/stable/) (BayBE) is an open-source experimental planner enabling users to smartly navigate such black-box optimization problems in iterative settings. This tutorial will i) introduce the core concepts enabled by combining Bayesian optimization and machine learning; ii) explain our software design choices, robust tests and open-source libraries this is built on; and iii) provide a short practical hands-on session. pyconde-pydata-2025-65811-baybe-a-bayesian-back-end-for-experimental-planning-in-the-low-to-no-data-regime PyData: PyData & Scientific Libraries Stack Martin FitznerAlexander HoppAdrian Šošić en In the evolving landscape of data science, advanced computational tools are crucial for driving innovation and efficiency. This tutorial introduces the [Bayesian Back End](https://emdgroup.github.io/baybe/stable/) (BayBE), an AI-assisted open-source experimental planner developed by [Merck KGaA](https://www.merckgroup.com/en), which utilizes Bayesian Optimization and machine learning to smartly streamline experimental workflows in the low-to-no-date regime. From chemical reactions to biological assays to coffee machine settings - with BayBE users can find optimal configurations in an iterative manner, which is anyway the main working mode of many experimentalists. We will start the first part with a brief introduction to Bayesian Optimization, highlighting its principles and advantages in experimental design. Following this, we will showcase BayBE's unique features, including elegant categorical encodings and advanced capabilities like active learning, transfer learning or Pareto optimization. In the second part, we explain some of our code and test design choices that went into the open-source Python package [`baybe`](https://github.com/emdgroup/baybe). This will include learnings about our built-in (de-)serialization engine, CI/CD, advanced hypothesis tests, autodocumentation and open-source tools BayBE is built on. The final part will comprise of a hands-on tutorial. We will look at representative problems and guide potential users from formalization of the problem to performing the iterative loop to analyzing the results including an assessment of parameter relevance. The tutorials can be accessed [here](https://github.com/emdgroup/baybe-resources). false https://pretalx.com/pyconde-pydata-2025/talk/TMBTYH/ https://pretalx.com/pyconde-pydata-2025/talk/TMBTYH/feedback/ Ferrum Unlocking the Predictive Power of Relational Data with Automated Feature Engineering Tutorial 2025-04-24T14:20:00+02:00 14:20 01:30 Relational data can be a goldmine for classical Machine Learning applications — yet extracting useful features from multiple tables, time windows, and primary-foreign key relationships is notoriously difficult. In this code tutorial, we’ll use the H&M Fashion dataset to demonstrate how getML FastProp automates feature engineering for both classification (churn prediction) and regression (sales prediction) with minimal manual effort, outperforming both Relational Deep Learning and a skilled human data scientist according to the RelBench leaderboard. This code tutorial is perfect for data scientists looking to leverage their relational and time-series data data effectively for any kind of predictive analytics applications. pyconde-pydata-2025-61900-unlocking-the-predictive-power-of-relational-data-with-automated-feature-engineering PyData: Machine Learning & Deep Learning & Statistics Alexander Uhlig en This tutorial tackles a common pain point in data science – extracting useful features from relational data spread across multiple interconnected tables. Manually crafting these features is often tedious, error-prone, and heavily reliant on domain expertise. Why is this important? Relational data powers industries from e-commerce and healthcare to finance. Yet, building predictive models on such datasets often involves laborious feature engineering. getML FastProp – the fastest open-source algorithm for automated feature engineering – streamlines this process, helping data scientists move faster and build better models. In this hands-on tutorial, we’ll work through two tasks from Stanford’s Relational Learning Benchmark (RelBench) using the H&M Fashion dataset: 1) Predict customer churn with a classification model, 2) Forecast item sales using regression model. We’ll walk through the code and concepts needed to solve these tasks with getML FastProp, achieving state-of-the-art performance and outperforming both Relational Deep Learning models and an experienced human data scientist. By the end of this tutorial, you'll learn how to: - Understand relational learning – Grasp the core challenges and concepts of working with multi-table datasets. - Reproduce results – Run the provided notebooks and code to reproduce the results at your own pace. - Automate feature engineering – Use getML’s FastProp to extract features directly from relational data. - Build and optimize getML pipelines – Develop pipelines for both classification and regression tasks. - Integrate into MLOps workflows – Leverage getML alongside LightGBM and Optuna. This tutorial provides a practical, reproducible framework for working with relational and time-series data, applicable across industries and domains. false https://pretalx.com/pyconde-pydata-2025/talk/C3RVM3/ https://pretalx.com/pyconde-pydata-2025/talk/C3RVM3/feedback/ Ferrum pytest - simple, rapid and fun testing with Python Tutorial 2025-04-24T16:15:00+02:00 16:15 01:30 The pytest tool offers a rapid and simple way to write tests for your Python code. This training gives an introduction with exercises to some distinguishing features, such as its assertions, marks and fixtures. Despite its simplicity, pytest is incredibly flexible and configurable. We'll look at various configuration options as well as the plugin ecosystem around pytest. pyconde-pydata-2025-59479-pytest-simple-rapid-and-fun-testing-with-python PyCon: Testing Florian Bruhin en # Preparation and Repository See [The-Compiler/pytest-basics](https://github.com/The-Compiler/pytest-basics) on GitHub for exercise code and preparation steps. Please make sure you have at least a virtualenv with `pytest` (or the full `requirements.txt` in the repo) set up and the code cloned before the training starts, so that we don't lose any time with the boring setup parts. See the README for detailed setup instructions. # Schedule - (25 minutes) **pytest feature walkthrough:** * Automatic test discovery * Assertions without boilerplate via the assert statement * Configuration and commandline options * Marking and skipping tests * Data-driven tests via parametrization * Exercises - (60 minutes) **pytest fixture mechanism:** * Setup and teardown via dependency injection * Declaring and using function/module/session scoped fixtures * Using fixtures from fixture functions * Parametrizing fixtures * Looking at useful built-in fixtures (managing temporary files, patching, output capturing) * Exercises - (5 minutes) **Where to go next:** * Useful CLI arguments to deal with failing tests * Overview of the plugin ecosystem around pytest false https://pretalx.com/pyconde-pydata-2025/talk/PDBAXQ/ https://pretalx.com/pyconde-pydata-2025/talk/PDBAXQ/feedback/ Dynamicum Career Path Experience Stories Tutorial 2025-04-24T10:15:00+02:00 10:15 01:30 As part of the PyConDE & PyData 2025 Conference, we would like to present an initiative aimed primarily at students and those just starting their careers in computer science. Our goal is to showcase the diverse career paths possible and break some myths about typical job skills and responsibilities relevant, so as to inspire and encourage their journey. pyconde-pydata-2025-67741-career-path-experience-stories General: Education, Career & Life Kristina Khvatova en Join us for an interactive session where professionals share their diverse tech career journeys. This workshop aims to broaden students' perspectives on the many paths available in tech and beyond. Speakers will share honest insights about their career decisions, skills that proved most valuable, and advice they wish they'd received as students. Why should you be there? - Hear honest stories about career twists, turns, and triumphs. - Discover roles you might not have even considered yet. - See how versatile your current skillset really is. - Ask YOUR questions to people who've been where you are. If you're curious about the diverse opportunities waiting for you and want to hear firsthand accounts of building unique careers in tech, this workshop is for you. Come prepared with questions and leave with a clearer vision of the possibilities ahead. false https://pretalx.com/pyconde-pydata-2025/talk/XB8VG7/ https://pretalx.com/pyconde-pydata-2025/talk/XB8VG7/feedback/ Dynamicum AI Agents of Change: Creating, Reflecting, and Monetizing Tutorial 2025-04-24T14:20:00+02:00 14:20 01:30 Create, reflect, and earn—with purpose. In this workshop, you’ll not only build your own AI agent but also confront the ethical questions it raises, from its impact on jobs to its potential for social good. Together, we’ll explore how to harness AI for empowerment while uncovering pathways to turn your skills into meaningful value. This workshop is designed to equip Python enthusiasts with the tools to create their own AI agent while fostering a deeper understanding of the societal implications of this technology. Through hands-on learning, collaborative discussions, and practical monetization strategies, you’ll leave with more than just code—you’ll gain a vision of how AI can be wielded responsibly and profitably. pyconde-pydata-2025-61867-ai-agents-of-change-creating-reflecting-and-monetizing PyData: Generative AI Paloma OliveiraTereza Iofciu en The session unfolds in three engaging parts: 1. Build Your AI Agent Start with the fundamentals of AI by designing and implementing a functional agent. Using Python, we’ll demystify the process and equip you with practical skills for creating an AI that responds to user needs and scenarios. 2. Reflect on Ethics and the Future of Work Once your agent comes to life, we’ll pause to examine the bigger picture: • How does the AI agent you have created may reshape the job market? • Can it democratize and decentralize opportunities, or does it risk amplifying inequalities? • What collective vision do we want for the future of work? This thought-provoking discussion will challenge you to think critically about the role of technology in fostering empowerment or exacerbating social challenges. 3. Earn by Sharing Value Finally, we’ll explore how your AI agent can create real-world value. You’ll learn how to leverage marketplaces like OpenServ to turn your innovation into income. Whether you aim to solve practical problems, inspire creativity, or contribute to ethical AI development, this segment will connect your skills with opportunities for meaningful impact. By the end of the workshop, you’ll have built an AI agent, grappled with its ethical dimensions, and uncovered how to use your coding prowess to create and share value—all while shaping a more inclusive, responsible AI ecosystem. false https://pretalx.com/pyconde-pydata-2025/talk/PKZD8L/ https://pretalx.com/pyconde-pydata-2025/talk/PKZD8L/feedback/ Dynamicum The future of AI training is federated Tutorial 2025-04-24T16:15:00+02:00 16:15 01:30 Since it’s introduction in 2016, Federated Learning (FL) has become a key paradigm to AI models in scenarios when training data cannot leave its source. This applies in many industrial settings where centralizing data is challenging due to a combination of reasons, including but not limited to privacy, legal, and logistics. The main focus of this tutorial is to introduce an alternative approach to training AI models that is straightforward and accessible. We’ll walk you through the basics of an FL system, how to iterate on your workflow and code in a research setting, and finally deploy your code to a production environment. You will learn all of these approaches using a real-world application based on open-sourced datasets, and the open-source federated AI framework, [Flower](https://github.com/adap/flower), which is written in Python and designed for Python users. Throughout the tutorial, you’ll have access to hands-on open-sourced code examples to follow along. pyconde-pydata-2025-61376-the-future-of-ai-training-is-federated PyData: Machine Learning & Deep Learning & Statistics Chong Shen Ng en Federated Learning has quickly become the preferred form of training of AI models when the training data cannot leave their point of origin due to privacy regulations (e.g. GDPR), legal constraints (e.g. in different jurisdictions), and logistical challenges (e.g. large volumes of data, sparse connectivity), among other reasons. Furthermore, contracts and regulations establish boundaries for data sharing, particularly in industries like healthcare and finance, where misuse prevention is crucial. One could also argue that we are [running out of publicly and ethically sourced datasets](https://www.theverge.com/2024/12/13/24320811/what-ilya-sutskever-sees-openai-model-data-training), for instance to [scale large foundational models](https://arxiv.org/abs/2410.08892), and federated learning offers one way to train models on protected data. The key point of this tutorial is to introduce an alternative approach to training AI models that is straightforward and accessible. This tutorial is sequenced in 3 parts. We’ll first introduce federated learning and its prototypical architecture. In part 2, we’ll dive into a series of live Python code demos that showcase how to convert a classical centralized machine learning workflow into a federated workflow involving multiple federated clients. We’ll demonstrate the similarities and differences of how the iteration of a federated research project is conducted. Finally, in part 3, we’ll demonstrate how you can take your research code and deploy it in a production setting using a mixture of physical edge devices and VMs. Throughout the tutorial, we’ll use [Flower](https://github.com/adap/flower), the fully open-sourced federated AI framework, which is written in Python and designed for Python users. With simplicity as one of it’s main goals, Flower provides multiple features and libraries to accelerate research, such as [Flower Baselines](https://flower.ai/docs/baselines/) (for reproducing federated learning benchmarks) and [Flower Datasets](https://flower.ai/docs/datasets/) (a standalone Python library for easily creating federated datasets). We’ll showcase how to use the Flower CLI in both research and production setting. This tutorial addresses people with fluency in Python, CLI, and basic knowledge of a machine learning project. It would help if you’ve also used Docker before. Any data practitioner is encouraged to attend the tutorial to learn and discuss how to federate and distribute the training of an ML model. You will learn: - What’s Federated Learning? - Basics and real-world examples - How to federate your existing ML training code, and more FL-specific steps such as how to: - Configure the behaviours of each federated client - Persist the state of each client across global rounds - Evaluate both aggregated and local models - Standardize your FL experiments - Track your experiments - How to deploy your research code in a production setting, such as how to: - Deploy Flower federated learning clients using Docker - Set-up secure connection and node authentication - Run, monitory, and manage the federated learning runs. Bring your own laptop if you’d like to follow along. Some code examples will be executed in GitHub Codespaces, others can be locally executed on your favourite IDE. **Update: 24th April 2025** The GitHub repo containing the code examples is available here 👉 [link](https://github.com/chongshenng/pyconde2025). The tutorial session is structured in the following way: - 0:00 Introduction, and getting to know the audience. - 0:05 What’s Federated Learning? Basics and real-world-examples. - 0:25 Overview of the Flower framework for federated learning - 0:30 Quickstart examples with PyTorch. Moving from a centralized training to federated. - 1:00 Deploying your research to production - 1:20 Feedback and Q&A false https://pretalx.com/pyconde-pydata-2025/talk/9Y9DM8/ https://pretalx.com/pyconde-pydata-2025/talk/9Y9DM8/feedback/ Carbonium Mini-Pythonistas: Coding, Experimenting, and Exploring with Zümi! Kids Workshop 2025-04-24T09:00:00+02:00 09:00 03:00 Please note, this is a children's workshop. Recommended age 10-16 years. Experienced use of keyboard and mouse, first words in English (for programming) are required. // Welcome, mini-Pythonistas! In this workshop, we’ll dive into the world of Zümi, a programmable car that’s much more than just wheels and motors. With built-in sensors, lights, and a camera, Zümi can learn to recognize colors, respond to gestures, and even identify faces — all with your help! pyconde-pydata-2025-64272-mini-pythonistas-coding-experimenting-and-exploring-with-zumi PyData: Embedded Systems & Robotics Dr. Marisa MohrAnna-Lena PopkesHannah HepkeDaniel Hieber en ## Summary Welcome, mini-Pythonistas! In this workshop, we’ll dive into the world of Zümi, a programmable car that’s much more than just wheels and motors. With built-in sensors, lights, and a camera, Zümi can learn to recognize colors, respond to gestures, and even identify faces — all with your help! Note, this is a kids workshop: All children and young people up to the age of 16 are welcome. ## More Details Whether you’re brand new to programming or a seasoned Python pro, there’s something here for everyone: * Blockly: Perfect for beginners! Learn the basics of programming by snapping together colorful blocks. * Jupyter Notebooks: Already know about variables and loops? Take the next step and explore more advanced coding concepts. * Python Scripting: For our experienced coders, write your own Python scripts and push Zümi to its limits. What can you teach Zümi? * Drive and park autonomously: With infrared sensors, Zümi can detect obstacles, stop, and adjust its course. * Recognize colors: Train a machine learning model to teach Zümi to stop or react when it sees a specific color. * Identify faces: Using its camera, Zümi can spot faces in photos and even recognize a smile! * and many more! Join us for a fun-filled adventure where coding meets creativity and discovery. Let’s see what you and Zümi can achieve together! 🚗💻✨ false https://pretalx.com/pyconde-pydata-2025/talk/VC3T39/ https://pretalx.com/pyconde-pydata-2025/talk/VC3T39/feedback/ OpenSpace Probably Fun: Board Games to teach Data Science Tutorial 2025-04-24T10:15:00+02:00 10:15 01:30 In this tutorial, you will speed-date with board and card games that can be used to teach Data Science. You will play one game for 15 minutes, reflect on the Data Science concepts it involves, and then rotate to the next table. As a result, you will experience multiple ideas that you can use to make complex ideas more understandable and enjoyable. We would like to demonstrate how gamification can not only used to produce short puzzles and quizzes, but also as a tool to reason complex problem-solving strategies. We will bring a set of carefully selected games that have been proven effective in teaching statistics, programming, machine learning and other Data Science skills. We also believe that it is probably fun to participate in this tutorial. pyconde-pydata-2025-61389-probably-fun-board-games-to-teach-data-science General: Education, Career & Life Dr. Kristian RotherPaula Gonzalez Avalos en Games encourage people to put their brains to work in a focused, constructive and peaceful way. This makes games a fantastic tool in the classroom. Many board games contain sophisticated algorithms and statistical models right under the surface. Therefore, Data Science education can be boosted by playing carefully selected games. We have applied popular board and card games such as Memory, Wizard, Machi Koro, Pandemic and Sky Team (the 2024 Game of the Year in Germany) to teach Data Science concepts in our courses. Learners would first play a game, discuss the mechanisms and only after that get exposed to the theory. Finally, they would move to practical applications using computers. This game-driven approach provides learners with an intrinsic motivation to solve a real practical problem (succeeding at the game). Analyzing a game makes it easier to grasp the core mechanism or algorithmic model and ask qualified questions about the details later. It also makes sure learners will want to come back for the next class. We have documented practical lessons and made them available under a CC license on https://www.academis.eu/probably_fun/ . In this tutorial, you will speed-date with several short games that can be used to teach Data Science concepts and skills. You will play one game for 15 minutes, reflect on the Data Science concepts it involves, and then rotate to the next table. This way, you will experience multiple ideas you can use to make complex methods and ideas more accessible. Also, the tutorial is probably fun to participate in. The tutorial will be executed according to the following pseudocode (or lesson plan): 1. The presenters give a short introduction on why games matter (5 min) 2. The presenters group participants into teams of up to 6 people. 3. Each team is assigned to a game table with a game and a cheat sheet with instructions. The presenters facilitate with understanding rules and to remove other obstacles. 4. The teams play the game for up to 15 minutes. 5. The teams discuss 1-3 prepared reflection questions to make the transfer from the game to the data science concepts. 6. Each team moves to the next table. 7. Repeat for 3-4 rounds. 8. Everybody gets together for a joint Q & A 9. A QR-Code links to material with games that help learning Data Science and lesson plans false https://pretalx.com/pyconde-pydata-2025/talk/WELCVS/ https://pretalx.com/pyconde-pydata-2025/talk/WELCVS/feedback/ Zeiss Plenary (Spectrum) The Future of AI: Building the Most Impactful Technology Together Keynote 2025-04-25T09:05:00+02:00 09:05 00:45 In this talk, Leandro will examine the significant benefits of combining open source principles with artificial intelligence. He will walk through the need for openness in language models to build trust, maintain control, mitigate biases, and achieve true alignment and show how open models are rapidly gaining momentum in the AI landscape, challenging proprietary systems through community-driven innovation. Finally, he will then talk about emerging trends and what the community needs to build for the next generation of models. pyconde-pydata-2025-65615-the-future-of-ai-building-the-most-impactful-technology-together Keynote Leandro von Werra en In this talk, Leandro will examine the significant benefits of combining open source principles with artificial intelligence. He will walk through the need for openness in language models to build trust, maintain control, mitigate biases, and achieve true alignment and show how open models are rapidly gaining momentum in the AI landscape, challenging proprietary systems through community-driven innovation. Finally, he will then talk about emerging trends and what the community needs to build for the next generation of models. false https://pretalx.com/pyconde-pydata-2025/talk/Z9ZTAH/ https://pretalx.com/pyconde-pydata-2025/talk/Z9ZTAH/feedback/ Zeiss Plenary (Spectrum) Data as (Python) Code Talk 2025-04-25T10:15:00+02:00 10:15 00:30 In contemporary data-driven environments, the seamless integration of data into automated workflows is paramount. The reliability of automation, however, is constantly threatened by breaking changes in the source data. The Data-as-Code (DaC) paradigm address this challenge by treating data as a first-class citizen within the software development lifecycle. pyconde-pydata-2025-60689-data-as-python-code PyCon: MLOps & DevOps Francesco Calcavecchia en Data-as-Code (DaC) is a paradigm that streamlines data distribution by encapsulating dataset retrieval within Python packages, along with a data contract. This approach makes it easy to enforce data quality, effortlessly leverage on semantic versioning to prevent errors in the data pipeline, and abstracts away from the Data Scientist all the boilerplate code to load the data needed by the ML models, improving efficiency and consistency. This presentation will delve into the implementation of DaC, demonstrate its practical applications, and discuss the benefits it offers in modern data workflows. This session will cover: 1. Introduction to Data-as-Code (DaC): - What problems do we want to solve with DaC - What it is out of scope 2. Implementing DaC: - Packaging data as Python packages - Defining data contracts 3. Advantages of DaC: - Application of semantic versioning to manage data changes effectively - Breaking changes in data are automatically detected as part of the data distribution - Abstraction of data loading mechanisms, allowing seamless transitions between data sources - Elimination of hard-coded data field names, enhancing code maintainability - Facilitation of unit testing through schema examples - Inclusion of comprehensive data descriptions and metadata - Centralized data distribution via the Python Package Index (PyPI) 4. DaC in the real world: - Step-by-step walkthrough of creating and distributing a DaC package - Guidelines for data engineers on preparing data for DaC - Instructions for data scientists on consuming DaC packages in their workflows - Discussion on the scalability and adaptability of DaC 6. Q&A Session: - Addressing audience questions and remarks false https://pretalx.com/pyconde-pydata-2025/talk/SXRVNU/ https://pretalx.com/pyconde-pydata-2025/talk/SXRVNU/feedback/ Zeiss Plenary (Spectrum) How Narwhals is silently bringing pandas, Polars, DuckDB, PyArrow, and more together Talk 2025-04-25T10:55:00+02:00 10:55 00:30 If you were writing a data science tool in 2015, you'd have ensured it supported pandas and then called it a day. But it's not 2015 anymore, we've fast-forwarded to 2025. If you write a tool which only supports pandas, users will demand support for Polars, PyArrow, DuckDB, and so many other libraries that you'll feel like giving up. Learn about how Narwhals allows you to write dataframe-agnostic tools which can support all of the above, with zero dependencies, low overhead, static typing, and strong backwards-compatibility promises! pyconde-pydata-2025-61175-how-narwhals-is-silently-bringing-pandas-polars-duckdb-pyarrow-and-more-together PyData: PyData & Scientific Libraries Stack Marco Gorelli en Suppose you want to write a data science tool to do feature engineering. Your experience may go like this: - Expectation: you can focus on state-of-the art techniques for feature engineering. - Reality: you keep having to make you codebase more complex because a new dataframe library has come out and users are demanding support for it. Or rather, it might have gone like that in the pre-Narwhals era. Because now, you can focus on solving the problems which your tool set out to do, and let Narwhals handle the subtle differences between different kinds of dataframe inputs! Narwhals is a lightweight and extensible compatibility layer between dataframe libraries. It is already used by several open source libraries including Altair, Marimo, Plotly, Scikit-lego, Vegafusion, and more. You will learn how to use Narwhals to build dataframe-agnostic tools. This is a technical talk aimed at tool-builders. You'll be expected to be familiar with Python and dataframes. We will cover: - 2-3 minutes: motivation. Why are there so many dataframe libraries? - 2-3: minutes: life before vs after Narwhals - real-world examples of how the data landscape is changing - 7-8 minutes: basics of Narwhals, wrapping native objects, expressions vs Series, lazy vs eager - 7-8 minutes: advanced Narwhals concepts: row order, non-elementary group-by aggregations, multi-indices, null values, backwards-compatibility promises - 2-3 minutes: what comes next? - 5 minutes: engaging Q&A / awkward silence Tool builders will benefit from the talk by learning how to build tools for modern dataframe libraries without sacrificing support for foundational classic libraries such as pandas. false https://pretalx.com/pyconde-pydata-2025/talk/CPCNRZ/ https://pretalx.com/pyconde-pydata-2025/talk/CPCNRZ/feedback/ Zeiss Plenary (Spectrum) Topological data analysis: How to quantify "holes" in your data and why? Talk (long) 2025-04-25T11:35:00+02:00 11:35 00:45 Do you need to compare sets of points in a plane? Identify a potential cyclic event in high-dimensional time series data? Find the second or the third highest peak of a noisily sampled function? Topological data analysis (TDA) is not a universal hammer, but it might just be the 16 mm wrench for your 16 mm hex head bolt. There is no shortage of Python libraries implementing TDA methods for various settings, but navigating the options can be challanging without prior familiarity with the topic. In my talk I will demonstrate the utility of the tool with several simple examples, list various libraries used by the TDA community, and dive a bit deeper into the methods to explain what the libraries implement and how to interpret and work with the outputs. pyconde-pydata-2025-61318-topological-data-analysis-how-to-quantify-holes-in-your-data-and-why PyData: PyData & Scientific Libraries Stack Ondrej Draganov en For specific tasks, topological data analysis can be a more rigid, straightforward and interpretable alternative to complicated machine learning pipelines. However, it is not so widely known and can be intimidating to get into when starting from zero. The goal of this talk is to introduce persistent homology, the main tool of topological data analysis, show concrete examples of how to apply it using available Python libraries, and reveal more details about what is going on "under the hood", which is important to correctly utilize the methods. I will start with several examples showcasing the possible uses of persistent homology and how to establish an analysis pipeline in Python. Then I will describe more about different variants within such a pipeline, like a choice of a filtered complex or vectorization, and their advantages and disadvantages. false https://pretalx.com/pyconde-pydata-2025/talk/HQWAYP/ https://pretalx.com/pyconde-pydata-2025/talk/HQWAYP/feedback/ Zeiss Plenary (Spectrum) From stockouts to happy customers: Proven solutions for time series forecasting in retail Talk 2025-04-25T13:20:00+02:00 13:20 00:30 Time series forecasting in the retail industry is uniquely challenging: Datasets often include stockouts that censor actual demand, promotional events cause irregular demand spikes, new product launches face cold-start issues, and diverse demand patterns within an imbalanced product portfolio create modeling challenges. In this talk, we’ll explore proven, real-world strategies and examples to address these problems. Learn how to successfully handle censored demand caused by stockouts, effectively incorporate promotional effects, and tackle the variability of diverse products using clustering and ensembling strategies. Whether you’re a seasoned data scientist or a Python developer exploring forecasting, the goal of this session is to introduce you to the key challenges in retail forecasting and equip you with actionable insights to successfully overcome them in real-life scenarios. pyconde-pydata-2025-61905-from-stockouts-to-happy-customers-proven-solutions-for-time-series-forecasting-in-retail PyData: Machine Learning & Deep Learning & Statistics Robert Haase en Retail time series forecasting is uniquely challenging: stockouts censor true demand, promotions cause irregular demand spikes, cold-start products lack historical data, and diverse product portfolios introduce modeling complexities. These challenges can lead to inefficiencies such as over- or understocking in the warehouses and therefore also to dissatisfied customers. This talk explores proven strategies to tackle these issues and deliver actionable insights. Learn how to handle constrained demand caused by stockouts both with adequate imputation as well as machine learning strategies, incorporate promotional effects with suitable feature engineering techniques that also help in cases of incomplete promotional data, predict demand for new products using transfer learning and also discover how ensembling strategies and clustering can simplify forecasting for diverse, imbalanced datasets. We’ll also highlight tools like statsforecast, neuralforecast, scikit-learn and our AutoML framework with a strong stacking ensembling mechanism in it's core. Whether you’re a seasoned data scientist or a Python developer exploring forecasting, the goal of this session is to introduce you to the key challenges in retail forecasting and equip you with actionable insights to successfully overcome them in real-life scenarios. false https://pretalx.com/pyconde-pydata-2025/talk/QN3BTA/ https://pretalx.com/pyconde-pydata-2025/talk/QN3BTA/feedback/ Zeiss Plenary (Spectrum) Forecast of Hourly Train Counts on Rail Routes Affected by Construction Work Talk 2025-04-25T14:00:00+02:00 14:00 00:30 Construction work in national railroad networks often disrupts train traffic, making it vital to estimate hourly train numbers for effective re-routing. Traditionally managed by humans, this process has been automated due to staff shortages and demographic changes. DB Systel GmbH, Deutsche Bahn's IT provider, leveraged machine learning and artificial intelligence to estimate train traffic during construction. Using Python and frameworks like Pandas, scikit-learn, NumPy, PyTorch and Polars, their solution demonstrated significant benefits in performance and efficiency. pyconde-pydata-2025-59643-forecast-of-hourly-train-counts-on-rail-routes-affected-by-construction-work PyData: Machine Learning & Deep Learning & Statistics Sebastian FolzDr Maren Westermann en Within a national railroad network, construction work for maintenance and modernization is unavoidable - as is train traffic on the affected sections under certain circumstances. Although there are fixed timetables for passenger rail transport that are planned well in advance and are set very early, there are still many freight transports and special trains that are registered at short notice and cause a dynamic traffic situation on the rail network. Therefore, the capacity utilisation of the rail routes is unknown until shortly before the journey takes place. It is therefore important to estimate the number of trains that will run over the affected tracks in order to establish a sensible re-routing strategy. Until now, this process has been in the hands of human decision-makers for decades or even more than a century. Demographic change and staff shortages are increasingly forcing companies to automate activities intelligently. This is where machine learning and artificial intelligence come into play. As Deutsche Bahn's IT service provider, DB Systel GmbH was able to successfully implement an example of intelligent automation of this process and estimate train numbers on sections of tracks affected by construction using modern ML and AI methods. Python as well as various established frameworks (Pandas, scikit-learn, NumPy, PyTorch) and new frameworks (Polars, Ruff) were used in this project. A success and performance measurement clearly demonstrated the benefits of ML automation. false https://pretalx.com/pyconde-pydata-2025/talk/RAHBEP/ https://pretalx.com/pyconde-pydata-2025/talk/RAHBEP/feedback/ Zeiss Plenary (Spectrum) Demystifying Design Patterns: A Practical Guide for Developers Talk 2025-04-25T14:40:00+02:00 14:40 00:30 Do you ever worry about your code becoming spaghetti-like and difficult to maintain? Master the art of crafting clean, maintainable, and adaptable software by harnessing the power of design patterns. This presentation will empower you with a clear, structured understanding of these reusable solutions to address common programming challenges. We'll delve into design patterns’ key categories: Behavioral, Structural, and Creational, as well as explore their functionality and how they can be applied in your daily development workflow. For each category, we'll also explore a practical design pattern in detail and showcase real-world applications of these patterns, along with small-scale code examples that illustrate their practical implementation. You'll gain valuable insight into how these patterns can translate into real-world development scenarios, such as facilitating communication between objects (Behavioral), separating interfaces from implementation for flexibility (Structural), and enabling dynamic algorithm selection at runtime (Creational). pyconde-pydata-2025-61311-demystifying-design-patterns-a-practical-guide-for-developers PyCon: Programming & Software Engineering Tanu en Do you ever worry about your code becoming spaghetti-like and difficult to maintain? Master the art of crafting clean, maintainable, and adaptable software by harnessing the power of design patterns. This presentation will empower you with a clear, structured understanding of these reusable solutions to address common programming challenges. We'll delve into design patterns’ key categories: Behavioral, Structural, and Creational, as well as explore their functionality and how they can be applied in your daily development workflow. For each category, we'll also explore a practical design pattern in detail and showcase real-world applications of these patterns, along with small-scale code examples that illustrate their practical implementation. You'll gain valuable insight into how these patterns can translate into real-world development scenarios, such as facilitating communication between objects (Behavioral), separating interfaces from implementation for flexibility (Structural), and enabling dynamic algorithm selection at runtime (Creational). false https://pretalx.com/pyconde-pydata-2025/talk/8PFFPS/ https://pretalx.com/pyconde-pydata-2025/talk/8PFFPS/feedback/ Titanium3 From Queries to Confidence: Ensuring SQL Reliability with Python Talk 2025-04-25T10:15:00+02:00 10:15 00:30 SQL remains a foundational component of data-driven applications, but ensuring the accuracy and reliability of SQL logic is often challenging. SQL testing can be cumbersome, time-consuming, and error-prone. However, these challenges can be addressed by leveraging the simplicity of Python's testing framework such as pytest, enabling clean, robust, and automated SQL testing. pyconde-pydata-2025-61104-from-queries-to-confidence-ensuring-sql-reliability-with-python PyCon: Testing Anna Varzina en SQL is an essential part of data-driven applications, powering everything from simple queries to complex data transformations. However, ensuring the accuracy and reliability of SQL code is often challenging, particularly when dealing with intricate logic or large-scale datasets. Also, deploying changes in SQL code to production is another complex task, as it requires careful validation to avoid breaking the query logic. Fortunately, integrating Python’s testing framework such as pytest into SQL workflows provides a streamlined solution for these challenges. Such approach enables creating clean, efficient, and automated testing processes for SQL code and database logic. Therefore, we can validate query results, enforce schema consistency, and simulate complex data scenarios, all while reducing manual effort and improving test coverage. This talk will address: - configuring lightweight database fixtures - verifying SQL query result and testing scripts seamlessly - data mocking - schema validation - testing non-deterministic queries - handling large datasets Attendees will gain insights into improving SQL code quality, identifying issues early in the development process, and ensuring the reliability of data-driven products. This presentation is particularly beneficial for Data Scientists, Engineers, and Analysts seeking to enhance the efficiency and precision of their testing practices. false https://pretalx.com/pyconde-pydata-2025/talk/FSK3PE/ https://pretalx.com/pyconde-pydata-2025/talk/FSK3PE/feedback/ Titanium3 Using Python to enter the world of Microcontrollers Talk 2025-04-25T10:55:00+02:00 10:55 00:30 So you've happily used the Raspberry Pi for your homelab projects, of course with Python based solutions as we all do. You've been down the rabbit hole with everything about temperature and humidity measurements, energy and solar tracking, video recording and time-lapse photography, object detection and security surveillance. You don't just buy these things of the shelve. You want to deeply understand what it takes to create such a thing, and you've been quite happy with your results so far, learned a lot. But for many simple applications ... the power draw! Yes, it's just 5 Watts you say for using a Raspberry Pi. Not a big deal in terms of cost. But you'll always need a power adapter and a free socket. You've heard of these guys using microcontrollers that run on batteries or even solar, for days, weeks, even months. That's exciting, but there's also a catch. These people write code in C-like languages, they build firmware to make their projects run. And it's all bare metal! That seems very different. That'll be a steep learning curve to take ... Or is it? Well, there's MicroPython to the rescue. Let me take you with me on a journey to make a simple microcontroller based application to read a Power Meter and send the readings over WiFi for more in depth processing somewhere else. pyconde-pydata-2025-61858-using-python-to-enter-the-world-of-microcontrollers PyData: Embedded Systems & Robotics Jens Nie en Over the past years Python became available on more and more platforms, both software and hardware ones. From MacOS and Linux to Windows. From Desktop Computers and SoC Platforms such as the Raspberry Pis to Data Centers. And even on the smallest side Python is available today. MicroPython implements our beloved language for direct use on embedded platforms built on top of popular microcontrollers, such as the original PyBoard using an STM32 microcontroller, the ESP32 platform and the Raspberry Pi Picos. In this talk we'll have a look at how MicroPython feels compared to the fully fledged Python implementations, by "porting" a simple application that initially was built to run on a Raspberry Pi to an ESP32 based Microcontroller. The application was used to retrieve Power Meter Readings via its internal Infrared LED using a small photo transistor based circuit connected to the Raspberry Pi and calculate current power draw from these readings to send them somewhere else for further processing. We'll see what it takes to make such an application work on a Microcontroller running just on batteries. false https://pretalx.com/pyconde-pydata-2025/talk/DSHASE/ https://pretalx.com/pyconde-pydata-2025/talk/DSHASE/feedback/ Titanium3 Rustifying Python: A Practical Guide to Achieving High Performance While Maintaining Observability Talk (long) 2025-04-25T11:35:00+02:00 11:35 00:45 In this session, I’ll share our journey of migrating key parts of a Python application to Rust, resulting in over 200% performance improvement. Rather than focusing on quick Rust-to-Python integration with PyO3, this talk dives into the complexities of implementing such a migration in an enterprise environment, where reliability, scalability, and observability are crucial. You’ll learn from our mistakes, how we identified suitable areas for Rust integration, and how we extended our observability tools to cover Rust components. This session offers practical insights for improving performance and reliability in Python applications using Rust. pyconde-pydata-2025-61270-rustifying-python-a-practical-guide-to-achieving-high-performance-while-maintaining-observability PyCon: Programming & Software Engineering Max Höhl en For performance-critical sections of code, especially those that are I/O-bound or CPU-heavy, Python’s Global Interpreter Lock (GIL) can create significant bottlenecks. To improve performance, our team explored integrating Rust, taking advantage of its speed and concurrency features while maintaining Python’s ease of use and flexibility. This session will focus on overcoming common hurdles when migrating to Rust and optimizing performance in a real-world, production environment which orchestrates workload across 2000 compute nodes in various data centers and cloud provider regions. This talk covers practical aspects such as observability, scalability, and deployment in a production setting. We’ll begin by discussing how to identify the parts of your Python code that would benefit most from a Rust migration, particularly those where the GIL is a limiting factor. We’ll also share insights into our migration process, including the challenges we faced and how we overcame them. You’ll learn how we refactored Python code and used PyO3 to integrate Rust, achieving over 200% performance improvements. A key challenge when adding Rust to a Python codebase is maintaining robust observability. We’ll explain how we extended our OpenTelemetry and Sentry observability stack to include Rust components, ensuring seamless monitoring, tracing, and debugging across the entire stack. Throughout the session, we’ll illustrate the process with a practical example: a simplified version of our own application, which includes both I/O-heavy and compute-heavy tasks. You’ll see how to break down business logic and decide which parts to migrate to Rust for maximum performance benefit. By the end of this session, you will be equipped with the knowledge to assess where Rust can improve your Python application’s performance, and how to integrate it in a reliable and observable way. This session is ideal for anyone looking to optimize Python performance with Rust, while keeping applications running. false https://pretalx.com/pyconde-pydata-2025/talk/QXSQKL/ https://pretalx.com/pyconde-pydata-2025/talk/QXSQKL/feedback/ Titanium3 Extending Python with Rust, Mojo, Cuda and C and building packages Talk 2025-04-25T13:20:00+02:00 13:20 00:30 We all love Python - but we especially love it for its unique ability as a glue language. In this talk we will show a number of ways of extending Python: using Rust, C and Cython, C++, CUDA and Mojo! We will use the pixi package manager and the open source conda-forge distribution to demonstrate how to easily build custom Python extensions with these languages. The main challenge with custom extensions is about distributing them. The new pixi build feature makes it easy to build a Python extension into a conda package as well as wheel file for PyPI. Pixi will manage not only Python, but also the compilers and other system-level dependencies. pyconde-pydata-2025-61215-extending-python-with-rust-mojo-cuda-and-c-and-building-packages PyData: PyData & Scientific Libraries Stack Wolf VollprechtRuben Arts en Extending Python with native code is a common way of speeding up the execution. There are a number of traditional ways of writing Python extensions (Fortran, C, C++) but lately some modern languages have also entered the game (Rust, Mojo, and let’s count CUDA as modern, too). All of these have slightly different ways of writing Python extensions, and they require the installation of a compiler and compilation tool chain, as well as possibly other system dependencies. Installing, updating and managing the system dependencies is usually a bit of a hassle, and it is where pixi comes in. Pixi is a new package manager that builds on top of the Conda ecosystem. The community distribution “conda-forge” already has tools like C and Rust compilers, and thus it’s easy to maintain the compiler + Python tool chain in a single project. The new “pixi build” feature makes it even easier to build complex multi-language workspaces that combine different Python versions, compilers, and languages. In the talk we will show lots of live demos going over simple numerical examples that highlight the different ways of extending Python (using pybind11, nanobind, PyO3, Mojo, …) glued together in a single workspace with pixi build compiling the extensions from source. We will also demonstrate how pixi can help not only depending on other packages from source, but also by building the packages into Conda and Wheel (PyPI) packages that can be shared. After the talk, the listeners will have seen a number of ways how to extend Python code (easily!) with native languages and will have an understanding of benefits and drawbacks of the different approaches. false https://pretalx.com/pyconde-pydata-2025/talk/P9VKRV/ https://pretalx.com/pyconde-pydata-2025/talk/P9VKRV/feedback/ Titanium3 Offline Disaster Relief Coordination with OpenStreetMap and FastAPI Talk 2025-04-25T14:00:00+02:00 14:00 00:30 In natural disaster scenarios, reliable communication is crucial. This talk presents a solution for disaster relief coordination using OpenStreetMap vector maps hosted on a local device in the emergency vehicle with FastAPI, ensuring functionality without an internet connection. By integrating a database of post codes and street names, and leveraging a LORAWAN gateway to receive positional data and water levels, this system ensures access to critical information even in blackout situations. pyconde-pydata-2025-59909-offline-disaster-relief-coordination-with-openstreetmap-and-fastapi General: Infrastructure - Hardware & Cloud Jannis Lübbe en In natural disaster scenarios, effective coordination of relief efforts is essential, especially when traditional communication networks are compromised or overloaded. This involves organizing various aspects of disaster response without internet connectivity, such as dike defense, sandbag logistics, power supply, distribution of relief goods, clearing and repairing roadways, searching for missing persons, securing structurally unstable buildings and salvaging property. Typically, emergency responders are mobilized from different regions and may not be familiar with the affected area. Since it is unlikely that responders will have pre-downloaded offline maps of the target region on their devices, a vehicle-hosted Wi-Fi hotspot providing nationwide maps would be invaluable. This presentation introduces a practical solution for offline disaster relief coordination using OpenStreetMap vector maps hosted on a local device in the emergency vehicle with FastAPI, allowing responders to use their existing devices to access critical geographical information and enhancing the efficiency and effectiveness of disaster relief operations. The solution includes: FastAPI Server Setup: Configuring and deploying a FastAPI server to host OpenStreetMap vector maps offline on a Raspberry Pi, which also offers a WiFi hotspot for existing end devices, ensuring accessibility without an internet connection. Database Integration: Integrating a comprehensive database of post codes and street names to facilitate quick and accurate location searches. LORAWAN Gateway Integration: Implementing a LORAWAN gateway to receive real-time positional data and water level measurements, providing up-to-date situational awareness for disaster relief workers. false https://pretalx.com/pyconde-pydata-2025/talk/GUEAHT/ https://pretalx.com/pyconde-pydata-2025/talk/GUEAHT/feedback/ Titanium3 3 Ways to Speed up Your Regression Modeling in Python Talk 2025-04-25T14:40:00+02:00 14:40 00:30 Linear Regression is the workhorse of statistics and data science. Some data scientists even go as far and argue that "linear regression is all you need". In this talk, we will introduce three ways to run regression models faster by using smarter algorithms, implemented in the scikit-learn & fastreg (sparse solvers), pyfixest (Frisch-Waugh-Lovell), and duckreg (regression compression via duckdb) libraries. pyconde-pydata-2025-61855-3-ways-to-speed-up-your-regression-modeling-in-python PyData: Machine Learning & Deep Learning & Statistics Alexander Fischer en We introduce three different ways to make regressions run faster. We first introduce sparse solvers and show how to run regressions on sparse matrices via scikit-learn and the fastreg libraries. We then lay out the Frisch-Waugh-Lovell theorem and the alternating projections algorithm and show how to speed it up on the CPU (via numba) and on the GPU (via JAX) as implemented in the pyfixest library. Finally, we demonstrate how to drastically speed up regression estimation by first preprocessing the data in duckdb and then fitting a regression via weighted least squares in memory. References: - fastreg: https://github.com/iamlemec/fastreg - scikit-learn: https://github.com/scikit-learn/scikit-learn - pyfixest: https://github.com/py-econometrics/pyfixest - duckreg: https://github.com/py-econometrics/duckreg false https://pretalx.com/pyconde-pydata-2025/talk/ECNDQM/ https://pretalx.com/pyconde-pydata-2025/talk/ECNDQM/feedback/ Helium3 FastHTML vs. Streamlit - The Dashboarding Face Off Talk 2025-04-25T10:15:00+02:00 10:15 00:30 In the right corner, we have the go-to dashboarding solution for showcasing ML models or visualizing data, **STREAMLIT** (\*crowd cheers\*). Simple yet powerful, it defends the throne of Python dashboarding, but have you ever tried to create complex interactions with it? Things like drill-downs or logins, can make your control flow become messy really quick (\*crowd nods knowlingly\*). And in the left corner, the new contender in the arena of Python web frameworks which, according to its docs, "*excels at building dashboards*", **FastHTML** (\*crowd whoops\*). We will see if this is true, in the **ultimate dashboarding face off** (\*crowd gasps\*). By building the same dashboard, step by step, in both frameworks, investigate their strengths and weaknesses, we will see which framework can claim the crown. pyconde-pydata-2025-59664-fasthtml-vs-streamlit-the-dashboarding-face-off PyCon: Django & Web Tilman Krokotsch en Streamlit is the go-to dashboarding solution for showcasing ML models or visualizing data. It has a vibrant community, multiple years of development under its belt, and tons of third-party integrations. On the other hand, everyone that tried to create complex interactions, like drill-downs or logins, knows that control flow can get messy really quick. Initially simple dashboards often evolve into something bigger and the simple-but-powerful Streamlit formula may not always be up to the tasks. FastHTML is a new contender in the arena of Python web frameworks and, according to its docs, "it excels at building dashboards." FastHTML stands on the shoulders of giants, giving you a smooth Python experience for authoring web pages, while allowing access to the foundations of the web, like CSS and JS, at any time. We will see if FastHTML can put code where its mouth is, by building the same dashboard, step by step, in both frameworks and investigate their strengths and weaknesses. This is a talk for data enthusiasts that dabble in web technologies for the sake of showcasing their work or building internal tooling. Do not expect a course on building customer-facing web apps. We will build a dashboard that features: - an interactive Plotly chart - a drill-down with detailed information shown in a second plot - a login - multiple pages and navigation We will examine how hard or easy it is to implement each of these features and how interacting with them in the browser feels. At the end we will see if the reigning champion can defend their crown or if the ambitious contender takes the win. false https://pretalx.com/pyconde-pydata-2025/talk/WCDPLP/ https://pretalx.com/pyconde-pydata-2025/talk/WCDPLP/feedback/ Helium3 Death by a Thousand API Versions Talk 2025-04-25T10:55:00+02:00 10:55 00:30 API versioning is tough, really tough. We tried multiple approaches to versioning in production and eventually ended up with a solution we love. During this talk you will look into the tradeoffs of the most popular ways to do API versioning, and I will recommend which ones are fit for which products and companies. I will also present my framework, Cadwyn, that allows you to support hundreds of API versions with ease -- based on FastAPI and inspired by Stripe's approach to API versioning. After this session, you will understand which approach to pick for your company to make your versioning cost effective and maintainable without investing too much into it. pyconde-pydata-2025-61236-death-by-a-thousand-api-versions PyCon: Django & Web Stanislav Zmiev en Web API Versioning is a way to allow your developers to move quickly and break things while your clients enjoy the stable API in long cycles. It is best practice for any API-first company to have API Versioning in one way or another. Otherwise, the company will either be unable to improve their API or their clients will have their integrations broken every few months. I'll cover all sorts of approaches you can pick to add incompatible features to your API: extremely stable and expensive, easy-looking but horrible in practice, and even completely version-less yet viable. I will provide you with the best practices of how you could find or implement a modern API versioning solution and will discuss the versioning at Stripe in great detail. When you leave, you'll have enough information to make your API Versioning user-friendly without overburdening your developers. false https://pretalx.com/pyconde-pydata-2025/talk/ZMKJAY/ https://pretalx.com/pyconde-pydata-2025/talk/ZMKJAY/feedback/ Helium3 Hands-On LLM Security: Attacks and Countermeasures You Need to Know! Talk (long) 2025-04-25T11:35:00+02:00 11:35 00:45 Dive into the vulnerabilities of LLMs and learn how to prevent them From prompt injection to data poisoning, we’ll demonstrate real-world attack scenarios and reveal essential countermeasures to safeguard your applications. pyconde-pydata-2025-61112-hands-on-llm-security-attacks-and-countermeasures-you-need-to-know PyCon: Security Clemens HübnerFlorian Teutsch en The rapid increase in usage of large language models (LLMs) in the last years makes it necessary to address the specific security risks of LLMs. In this presentation, we will examine typical vulnerabilities in LLMs from a practical perspective. Starting with a systematic overview, we will use a specific demo app to illustrate the various attack scenarios. Vulnerabilities like prompt injection, data poisoning and system prompt leakage will be explained and demonstrated as well as attacks on RAG and agent implementations. In addition to a basic introduction and a presentation of specific vulnerabilities, the talk also presents suitable countermeasures and general best practices for the use of LLMs in productive applications. What to expect? Attending this talk, you learn which vulnerabilities need to be considered when using and integrating LLMs. You will see how specific attacks work and what risks are associated with them. You will also learn which countermeasures are suitable and how these can be implemented technically. false https://pretalx.com/pyconde-pydata-2025/talk/3DSU8V/ https://pretalx.com/pyconde-pydata-2025/talk/3DSU8V/feedback/ Helium3 Electify - Retrieval-Augmented Generation for Voter Information in the 2024 European Election Talk 2025-04-25T13:20:00+02:00 13:20 00:30 In general elections, voters often face the challenge of navigating complex political landscapes and extensive party manifestos. To address this, we developed Electify, an interactive application that utilizes Retrieval-Augmented Generation (RAG) to provide concise summaries of political party positions based on individual user queries. During its first roll-out for the European Election 2024, Electify attracted more than 6,000 active users. This talk will explore its development and deployment. It will focus on its technical architecture, the integration of data from party manifestos and parliamentary speeches, and the challenges of ensuring political neutrality and providing accurate replies. Additionally, we will discuss user feedback and ethical considerations, focusing on how generative AI can enhance voter information systems. pyconde-pydata-2025-61790-electify-retrieval-augmented-generation-for-voter-information-in-the-2024-european-election PyData: Natural Language Processing & Audio (incl. Generative AI NLP) Christian Liedl en In general elections, voters often face the challenge of navigating complex political landscapes. These challenges include understanding the differences between nuanced policy positions, comparing extensive party manifestos, and reconciling conflicting information from various sources. The sheer volume of information and the high frequency of elections can lead to voter fatigue and disengagement [1]. Existing tools like Wahlomat are helpful for voters but don’t adapt well to individual preferences or specific questions. To address these issues, we developed Electify—an interactive application designed to empower voters by addressing these pain points. Using Retrieval-Augmented Generation (RAG), Electify simplifies the decision-making process by enabling users to access concise and relevant summaries of political party positions tailored to their individual queries. Our user interface provides the possibility to fact-check the generated responses by directly showing the original sources. Additionally, we included a blinding feature to combat confirmation bias: users can hide party names and read summaries of their positions before unblinding. This talk will explore the technical development and deployment of Electify, covering its architecture, integration of data from party manifestos and parliamentary speeches, and strategies to maintain political neutrality and accuracy in responses. In particular, we will discuss our efforts to use reranking to improve context relevancy and LLM-as-a-judge evaluation for parameter optimization. We identify a trade-off between factual accuracy and the frequency of denied responses, which we think is highly relevant for generative AI systems that operate within sensitive areas like voter information [2]. During its first roll-out for the European Election 2024, Electify received significant attention, attracting 6,000 active users who leveraged the platform to make more informed and confident voting decisions. We will address the lessons learned from user feedback and discuss the ethical considerations involved, emphasizing the potential of generative AI to enhance voter information systems and promote political engagement. Contributors: Christian Liedl, Anna Neifer, Joshua Nowak [Github Repository](https://github.com/electify-eu/europarl-ai) [1] Kostelka et al. "Election frequency and voter turnout." Comparative Political Studies 56.14 (2023) [2] Cao, Lang. "Learn to Refuse: Making Large Language Models More Controllable and Reliable through Knowledge Scope Limitation and Refusal Mechanism." arXiv:2311.01041 (2023). false https://pretalx.com/pyconde-pydata-2025/talk/DNVCEY/ https://pretalx.com/pyconde-pydata-2025/talk/DNVCEY/feedback/ Helium3 Practical Python/Rust: Building and Maintaining Dual-Language Libraries Talk 2025-04-25T14:00:00+02:00 14:00 00:30 Building performant Python often means reaching for C extensions. This talk explores an alternative: leveraging Rust to create blazing-fast Python modules that also benefit the Rust ecosystem. I will share practical strategies from building `semantic-text-splitter`, a library for fast and accurate text segmentation used in both Python and Rust, demonstrating how to bridge the gap between these two languages and unlock new possibilities for performance and cross-language collaboration. pyconde-pydata-2025-61860-practical-python-rust-building-and-maintaining-dual-language-libraries General: Rust Ben Brandt en Building performant Python often means reaching for C extensions. But what if you could achieve similar performance with Rust, while also creating a library usable directly within the Rust ecosystem? This talk explores how Rust can be a powerful ally, creating blazing-fast Python modules that benefit both communities. I will share the strategies I use while building and maintaining my package, `semantic-text-splitter`, used for fast and accurate text segmentation, which sees significant usage in both Python and Rust ecosystems. Some key challenges arise when integrating these two languages, such as bridging the gap between Rust's generics and Python's dynamic typing, managing data representation and memory across the Python/Rust boundary, and maintaining type hints and documentation across both languages. But with practical maintenance strategies, these challenges can be overcome. Moreover, you contribute to a growing ecosystem of high-performance Python tools powered by Rust. Join me to learn how to build and maintain dual-language Python/Rust libraries, and discover how this approach can unlock new possibilities for performance and cross-language collaboration. false https://pretalx.com/pyconde-pydata-2025/talk/CUJMCD/ https://pretalx.com/pyconde-pydata-2025/talk/CUJMCD/feedback/ Helium3 Switching from Data Scientist to Manager Talk 2025-04-25T14:40:00+02:00 14:40 00:30 In this presentation, I will discuss my transition from a Data Scientist to a management role, covering key managerial responsibilities, preparation tips, and the pros and cons of the switch. The talk is particularly relevant for engineers who have recently moved into management or are considering the change, as well as those interested in understanding the challenges managers face. The session will include brief presentations followed by interactive discussions with the audience. pyconde-pydata-2025-68948-switching-from-data-scientist-to-manager General: Education, Career & Life Theodore Meynard en In this presentation, I will discuss my transition from a Data Scientist to a management position. After a brief introduction, I will provide an overview of the key aspects of managerial responsibilities. Then I will share some advice on how to prepare for the transition, what you cannot prepare for, and how to start as a new manager. Ultimately, I will share my perspective on the transition and outline the pros and cons of the new role. This talk is particularly relevant to engineers who have recently transitioned to management or are considering a change in roles. It will also be of value for those who are keen to understand the challenges their manager is (probably) facing and how to help them. This session will consist of a few rounds of a brief presentation, followed by an interactive session with the audience. false https://pretalx.com/pyconde-pydata-2025/talk/KDGZ8K/ https://pretalx.com/pyconde-pydata-2025/talk/KDGZ8K/feedback/ Platinum3 Where have all the post offices gone? Discovering neighborhood facilities with Python and OSM Talk 2025-04-25T10:15:00+02:00 10:15 00:30 When it comes to open geographic data, OpenStreetMap is an awesome resource. Getting started and figuring out how to make the most out of the data available can be challenging. Using a personal example: frustration at the apparent lack of post offices in my neighborhood, we'll walk through examples of how to parse, filter, process, and visualize geospatial data with Python. At the end of this talk, you will know how to process geographic data from OpenStreetMap using Python and find out some surprising info that I learned while answering the question: Where have all the post offices gone? pyconde-pydata-2025-61064-where-have-all-the-post-offices-gone-discovering-neighborhood-facilities-with-python-and-osm PyData: Data Handling & Engineering Katie Richardson en **Problem statement** Needing an international postcard stamp, I headed to my nearest post office only to find out that it was permanently closed, the latest closure among others in recent memory. Was this just in my neighborhood or was this happening all over the state? To answer these questions, I turned to open data and Python. - What is OpenStreetMap? **How can we identify types of places, like post offices and districts, in OpenStreetMap?** - Types of data in OSM - Tags - Tools for diving into the data to get an idea of how it is structured and how to construct queries: Overpass API, overpass turbo **How can we access the raw OSM data and work with it in Python?** - How many post offices are there in each neighborhood? What about by area or population? - Working with PBF files: parsing and filtering with the PyOsmium library - Using GeoPandas to store the data in a GeoDataFrame and apply transformations **What are some tools for visualizing the data?** - How can we make an interactive plot of post offices in each neighborhood? What about other facilities and resources? - Plot directly from a GeoDataFrame - Interactive plotting While this talk is aimed at those beginning with geographic data, it would be helpful to have some background knowledge about Python and data handling. false https://pretalx.com/pyconde-pydata-2025/talk/ADSXCA/ https://pretalx.com/pyconde-pydata-2025/talk/ADSXCA/feedback/ Platinum3 The Foundation Model Revolution for Tabular Data Talk 2025-04-25T10:55:00+02:00 10:55 00:30 What if we could make the same revolutionary leap for tables that ChatGPT made for text? While foundation models have transformed how we work with text and images, tabular / structured data (spreadsheets and databases) - the backbone of economic and scientific analysis - has been left behind. TabPFN changes this. It's a foundation model that achieves in 2.8 seconds what traditional methods need 4 hours of hyperparameter tuning for - while delivering better results. On datasets up to 10,000 samples, it outperforms every existing Python library, from XGBoost to CatBoost to Autogluon. Beyond raw performance, TabPFN brings foundation model capabilities to tables: native handling of messy data without preprocessing, built-in uncertainty estimation, synthetic data generation, and transfer learning - all in a few lines of Python code. Whether you're building risk models, accelerating scientific research, or optimizing business decisions, TabPFN represents the next major transformation in how we analyze data. Join us to explore and learn how to leverage these new capabilities in your work. pyconde-pydata-2025-61121-the-foundation-model-revolution-for-tabular-data PyData: Machine Learning & Deep Learning & Statistics Noah HollmannFrank Hutter en TabPFN shows how foundation model concepts can advance tabular data analysis in Python. Published in Nature Magazine in January 2025, it found strong community adoption with >3,000+ GitHub stars and 1,000,000+ downloads. **Detailed Outline:** 1. **Motivation** - Why tabular data: examples of tabular prediction tasks and time series forecasting - Why foundation models for tabular data - Learning from the foundation model revolution in text and vision 2. **Technical Insights** - How we adapted transformers for tabular data - Making in-context learning work for structured data - Performance characteristics and resource requirements - How to apply TabPFN to time series 3. **Practical Applications** - When to choose TabPFN vs traditional methods - Resource requirements and scalability limits - What's next for TabPFN 4. **Colab Demo** - Q&A **Key Takeaways:** - Practical understanding of TabPFN's capabilities and limitations - Hands-on experience integrating with Python data science workflows - Best practices for working with foundation models on tabular data - Insight into emerging approaches for structured data analysis false https://pretalx.com/pyconde-pydata-2025/talk/XRHEYZ/ https://pretalx.com/pyconde-pydata-2025/talk/XRHEYZ/feedback/ Platinum3 Enhancing RAG with Fast GraphRAG and InstructLab: A Scalable, Interpretable, and Efficient Framework Talk 2025-04-25T11:35:00+02:00 11:35 00:30 Retrieval Augmented Generation (RAG) has become a cornerstone in enriching GenAI outputs with external data, yet traditional frameworks struggle with challenges like data noise, domain specialization, and scalability. In this talk, Tuhin will dive into open-source frameworks Fast GraphRAG and InstructLab, which addresses these limitations by combining knowledge graphs with the classical PageRank algorithm and Fine-tuning, delivering a precision-focused, scalable, and interpretable solution. By leveraging the structured context of knowledge graphs, Fast GraphRAG enhances data adaptability, handles dynamic datasets efficiently, and provides traceable, explainable outputs while InstructLab adds domain depth to the LLM through Fine-tuning. Designed for real-world applications, it bridges the gap between raw data and actionable insights, redefining intelligent retrieval for developers, researchers, and enterprises. This talk will showcase Fast GraphRAG’s transformative features coupled with domain specific Fine-tuning leveraging InstructLab and demonstrate its potential to elevate RAG’s capabilities in handling the evolving demands of large language models (LLMs) for developers, researchers, and businesses. pyconde-pydata-2025-61251-enhancing-rag-with-fast-graphrag-and-instructlab-a-scalable-interpretable-and-efficient-framework PyData: Generative AI Tuhin Sharma en Retrieval Augmented Generation (RAG) has changed the way AI systems incorporate external knowledge, but it often falls short when faced with real-world challenges like adapting to new data, managing complexity, or delivering reliable answers. Fast GraphRAG steps in to address these gaps with a refreshing approach that blends the structure of knowledge graphs with the proven efficiency of algorithms like PageRank. By focusing on interpretability, scalability, and adaptability, Fast GraphRAG creates a pathway for building AI systems that don’t just retrieve data but leverage it in a meaningful way. The agenda for the talk is as follows Challenges in Traditional RAG - Lack of interpretability leads to untrustworthy outputs. - High computational costs limit scalability. - Inflexibility makes adapting to evolving data cumbersome. Fast GraphRAG’s Core Innovations - Interpretability: Knowledge graphs provide clear, traceable reasoning. - Scalability: Efficient query resolution with minimal overhead. - Adaptability: Dynamic updates ensure relevance in changing domains. - Precision: PageRank sharpens focus on high-value information. - Robust Workflows: Typed and asynchronous handling for complex scenarios. How Fast GraphRAG Works - Architecture and algorithmic innovations. - Knowledge graphs for intelligent reasoning. - PageRank for multi-hop exploration and precise retrieval. - Entity extraction, incremental updates, and graph exploration. - Role of InstructLab and Fine-tuning. Demo and Practical Takeaways - Building a knowledge graph and resolving queries. - Open-source tools for scaling Fast GraphRAG. - Real-World applications Fast GraphRAG isn’t just another tool. It's a game-changer for anyone frustrated by the limitations of traditional RAG systems. By combining the structured clarity of knowledge graphs with the power of algorithms like PageRank and fine-tuning by InstructLab, it makes retrieval smarter, faster, and the LLM more adaptable. This session will leave you with a clear understanding of how to build/train AI systems that deliver meaningful results while being transparent and trustworthy. Whether you’re a developer, researcher, or just someone passionate about AI, Fast GraphRAG is a framework that sparks possibilities and redefines what intelligent retrieval can achieve. false https://pretalx.com/pyconde-pydata-2025/talk/JABVHK/ https://pretalx.com/pyconde-pydata-2025/talk/JABVHK/feedback/ Platinum3 Is your LLM any good at writing? Benchmarking on creative writing and editing tasks Talk 2025-04-25T13:20:00+02:00 13:20 00:30 Many LLM benchmarks focus on reasoning and coding tasks. These are exciting tasks! But the majority of LLM usage is still in writing and editing related tasks, and there's a surprising lack of benchmarks on these. In this talk you'll learn what it took to create a writing benchmark, and which model performs best! pyconde-pydata-2025-61840-is-your-llm-any-good-at-writing-benchmarking-on-creative-writing-and-editing-tasks PyData: Natural Language Processing & Audio (incl. Generative AI NLP) Azamat Omuraliev en Large Language Models (LLMs) have demonstrated impressive capabilities in generating human-quality text, but how do we objectively measure their performance on complex writing and editing tasks? This talk explores the challenges of benchmarking LLMs for these tasks and presents a novel framework for evaluating their effectiveness. The talk will provide practical guidance on how to evaluate and compare the performance of different LLMs. Basic familiarity with language models is required for this talk. **Outline:** Introduction - Briefly introduce LLMs and their growing role in writing and editing. - Highlight the need for standardized benchmarks to compare and improve LLM performance. Majority of LLM usage is still on writing tasks*! *Source: https://arxiv.org/pdf/2405.01470 Challenges in benchmarking LLMs for writing and editing: - Defining objective metrics for subjective tasks like writing quality and editing accuracy. - Addressing the issue of bias in training data and its impact on evaluation. - Accounting for the diverse range of writing and editing tasks. A framework for evaluating LLM performance: - Proposing a set of key metrics that encompass fluency, coherence, accuracy, and style. - Introducing a methodology for constructing diverse and representative test datasets. Results: - Showcasing examples of how the proposed framework can be applied to evaluate different LLMs. - Presenting findings from recent benchmarking studies and discussing their implications. Future directions: - Exploring the potential of LLMs to assist with increasingly complex writing and editing tasks. - Identifying areas for future research and development in LLM benchmarking. false https://pretalx.com/pyconde-pydata-2025/talk/P8GUWG/ https://pretalx.com/pyconde-pydata-2025/talk/P8GUWG/feedback/ Platinum3 Using Causal thinking to make Media Mix Modeling Talk 2025-04-25T14:00:00+02:00 14:00 00:30 In today's data-driven landscape, understanding causal relationships is essential for effective marketing strategies. This talk will explore the link between Bayesian causal thinking and media mix modeling, utilizing Directed Acyclic Graphs (DAGs), Structural Causal Models (SCMs), and the Data Generation Process (DGP). We will examine how DAGs represent causal assumptions, how SCMs define relationships in media mix models, and how to implement these models within a Bayesian framework. By using media mix models as causal inference tools, we can estimate counterfactuals and causal effects, offering insights into the effectiveness of media investments. pyconde-pydata-2025-59374-using-causal-thinking-to-make-media-mix-modeling PyData: PyData & Scientific Libraries Stack Carlos Trujillo en In the era of data-driven decision-making, understanding causal relationships is crucial for effective marketing strategies. This talk delves into the underexplored connection between Bayesian causal thinking and media mix modeling, linking Directed Acyclic Graphs (DAGs), Structural Causal Models (SCMs), and the Data Generation Process (DGP). By navigating through these key concepts, we will demonstrate how we can build models that not only predict outcomes but also represent causal mechanisms within the marketing ecosystem. Starting from foundational principles, we will explore how DAGs serve as a formal language for encoding causal assumptions, how Structural Causal Modeling define relationships in media mix models, and how we implement those in the Bayesian framework through the famous DGP. We will further illustrate how media mix models can be employed as causal inference tools to estimate counterfactuals and causal effects, providing actionable insights into the effectiveness of media investments. Finally, we’ll show how Bayesian inference enables us to update these causal beliefs in light of data. This synthesis of causal reasoning and probabilistic modeling is not only theoretically rich but practically powerful—offering a robust framework for constructing media mix models that more accurately reflect the complexities of real-world marketing dynamics. Attendees will leave with an understanding of how to apply Bayesian causal discovery (guided by an example in an IPython notebook) to develop causally valid models that can be applied to real-world marketing data. They will learn how to use Media Mix Models as causal inference tools to estimate counterfactual scenarios and causal effects, unlocking deeper insights into the effectiveness of media investments. This presentation aims to reveal a new pathway for marketers, data scientists, and researchers to harness the potential of these powerful methodologies together, empowering them to drive more informed, causally grounded decisions. false https://pretalx.com/pyconde-pydata-2025/talk/MNTFRG/ https://pretalx.com/pyconde-pydata-2025/talk/MNTFRG/feedback/ Platinum3 Building a HybridRAG Document Question-Answering System Talk 2025-04-25T14:40:00+02:00 14:40 00:30 Retrieval Augmented Generation (RAG) is a powerful technique for searching across unstructured documents, but it often falls short when the task demands an understanding of intricate relationships between entities. GraphRAG addresses this by leveraging knowledge graphs to capture these relationships, but it struggles with scalability and handling diverse unstructured formats. In this talk, we’ll explore how HybridRAG combines the strengths of both approaches - RAG for scalable unstructured data retrieval and GraphRAG for semantic richness- to deliver accurate and contextually relevant answers. We’ll dive into its application, challenges, and the significant improvements it offers for question-answering systems across various domains. pyconde-pydata-2025-61788-building-a-hybridrag-document-question-answering-system PyData: Natural Language Processing & Audio (incl. Generative AI NLP) Darya Petrashka en #### Outline: 1. Introduction - The challenge of extracting information from unstructured and domain-specific text (e.g., legal documents). - Overview of traditional RAG techniques and their limitations: - Scalability and unstructured data handling. - Lack of semantic depth to capture intricate relationships. - Why HybridRAG is a game-changer. 2. What is RAG? - Explanation of vector-based retrieval using embeddings and databases. - Advantages of RAG: - Scalable search across diverse unstructured formats. - Domain-agnostic retrieval capabilities. - Limitations: - Inability to capture relationships between entities. - Difficulty handling domain-specific or complex queries. 3. What is GraphRAG? - Explanation of GraphRAG: How knowledge graphs enhance retrieval by mapping relationships between entities. - Benefits of GraphRAG: - Semantic richness and contextual understanding. - Effective for domains requiring deep relational reasoning (e.g., finance, healthcare). - Challenges of GraphRAG: - Building high-quality knowledge graphs from unstructured data. - Scalability and integration with generative models. 4. Introducing HybridRAG: Combining RAG and GraphRAG - The HybridRAG architecture: - RAG for scalable retrieval of unstructured data. - GraphRAG for refining answers with relational and semantic context. - Benefits of HybridRAG: - Combining scalability with semantic depth. - Improved retrieval accuracy and contextual relevance. - Use case: Legal documents processing (e.g., extracting Q&A insights). - How RAG retrieves general context. - How GraphRAG captures relationships (e.g., between companies, documents, events). 5. Challenges in Building HybridRAG Systems - Creating high-quality knowledge graphs from diverse and unstructured data. - Balancing computational overhead from combining RAG and GraphRAG. - Addressing domain-specific terminology and ensuring generalizability to other domains. 6. Key Takeaways - HybridRAG effectively combines the strengths of RAG and GraphRAG. - It’s particularly powerful for domains requiring both scalability and semantic depth. - Practical advice for building HybridRAG systems in your projects. #### What You’ll Learn: - The strengths and limitations of RAG and GraphRAG techniques for question-answering systems. - How HybridRAG bridges the gap by combining scalable retrieval with semantic richness. - Practical challenges and solutions for building HybridRAG systems, including knowledge graph creation and integration. - Insights into real-world applications where HybridRAG delivers superior results. false https://pretalx.com/pyconde-pydata-2025/talk/9CRNU3/ https://pretalx.com/pyconde-pydata-2025/talk/9CRNU3/feedback/ Europium2 Building Bare-Bones Game Physics in Rust with Python Integration Talk 2025-04-25T10:15:00+02:00 10:15 00:30 Learn how to build a minimalist game physics engine in Rust and make it accessible to Python developers using PyO3. This talk explores fundamental concepts like collision detection and motion dynamics while focusing on Python integration for scripting and testing. Ideal for developers interested in combining Rust’s performance with Python’s ease of use to create lightweight and efficient tools for games or simulations. pyconde-pydata-2025-59310-building-bare-bones-game-physics-in-rust-with-python-integration General: Rust Sam Kaveh en Python’s simplicity makes it the go-to choice for scripting, while Rust excels in performance-critical tasks like game physics. This talk demonstrates how to build a minimalist physics engine in Rust, focusing on core concepts like collision detection, basic rigid body dynamics, and force application, while providing seamless Python integration using PyO3. We’ll explore how PyO3 allows developers to expose Rust functionality as native Python modules, enabling Python developers to easily script and interact with the physics engine. Through practical examples, attendees will see how Python can be used for rapid prototyping and gameplay scripting, while Rust handles the heavy lifting of physics calculations. By the end of this session, participants will not only understand the basics of implementing physics in Rust but also how to use PyO3 to bridge the gap between Rust’s performance and Python’s flexibility. This talk is perfect for Python enthusiasts curious about Rust or Rustaceans looking to make their libraries accessible to the Python ecosystem. false https://pretalx.com/pyconde-pydata-2025/talk/VKYDBD/ https://pretalx.com/pyconde-pydata-2025/talk/VKYDBD/feedback/ Europium2 High-performance dataframe-agnostic GLMs with glum Talk 2025-04-25T10:55:00+02:00 10:55 00:30 Generalized linear models (GLMs) are interpretable, relatively quick to train, and specifying them helps the modeler understand the main effects in the data. This makes them a popular choice today to complement other machine-learning approaches. `glum` was conceived with the aim of offering the community an efficient, feature-rich, and Python-first GLM library with a scikit-learn-style API. More recently, we are striving to keep up with PyData community's ongoing push for dataframe-agnosticism. While `glum` was originally heavily based on `pandas`, with the help of `narwhals`, we are close to being able to fit models on any dataset that the latter supports. This talk presents our experiences with achieving this goal. pyconde-pydata-2025-61810-high-performance-dataframe-agnostic-glms-with-glum PyData: PyData & Scientific Libraries Stack Martin Stancsics en Arguably, `glum`'s standout feature is its ability to efficiently handle datasets consisting of a mix of dense, sparse and categorical features. To facilitate this, it relies on our (similarly open-source) `tabmat` library, which provides classes and useful methods for mixed-sparsity data. `glum` fits models by first converting input data to `tabmat` matrices, and then using those matrices to do the necessary computations. Therefore, dataframe-agnostism in our case mostly boils down to handling the conversion of different dataframes to `tabmat` matrices (which themselves store data in `numpy` arrays and sparse `scipy` matrices) in an efficient manner. Most of it is rather smooth and straightforward due to `narwhals` providing a convenient compatibility layer for a wide range of dataframe functionality. However, we have encountered a couple of pain points that might be of interest to other package maintainers and the PyData community. In particular, - We heavily rely on manipulating the category order and encoding of categorical variables, for which there is somewhat limited support. This is due to various dataframe libraries handling cateegorical columns somewhat differently. - Most dataframe libraries do not support sparse columns, while for us, it is important to be able to accept sparse inputs. In this talk I demonstrate how we used `narwhals` to easily accept multiple types of dataframes. I will go into details about categorical and sparse columns, and present the challenges we encountered with those. I will also examine the benefits and challenges of supporting sparse columns in dataframe libraries and the Arrow stardard. These points are meant to facilitate discussion among the participants and in the PyData community. At the end of the talk I will also briefly mention potential future plans for `glum` and `tabmat`, including the possibility to do computations directly on Arrow objects without converting them to `numpy` and `scipy` arrays. ### Outline 1. A short intro to `glum`, it's backend library `tabmat`, and the main ideas that make them performant. 2. Making `glum` dataframe-agnostic. - Showcase how `narwhals` simplifies handling a wide variety of dataframes. - Discuss handling categorical (and enum/dictionary) columns. - Talk about representing sparse columns in dataframes. 3. Concluding remarks and potential future plans. ### Target audience - Basic understanding of the scientific Python ecosystem (with a focus on dataframe libraries) is recommended. - While some familiarity with linear models might be useful to get the most out of this talk, it is by no means required. ### Main takeaways - How `glum` efficiently handles mixed-sparsity data - How `narwhals` helps to achieve dataframe-agnosticism with little effort - Differences between categorical types in various packages and the Apache Arrow specification. - How support for sparse column could be incorporated into dataframe libraries and the Arrow Columnar Format false https://pretalx.com/pyconde-pydata-2025/talk/JUQ9JJ/ https://pretalx.com/pyconde-pydata-2025/talk/JUQ9JJ/feedback/ Europium2 GitMLOps – How we are managing 100+ ML pipelines in AWS SageMaker Talk 2025-04-25T11:35:00+02:00 11:35 00:30 Scaling machine learning pipelines is no small feat - especially when you’re managing over 100 of them on AWS SageMaker. In this talk, I’ll take you behind the scenes of how our team at idealo built a Git-based MLOps framework that powers millions of real-time recommendations every minute. I’ll share the challenges we faced, the solutions we implemented, and the lessons we learned while streamlining model versioning, deployment, and monitoring. This session is packed with actionable takeaways for ML engineers, data scientists, and DevOps professionals looking to simplify their MLOps workflows and operate efficiently at scale. Whether you’re running a handful of pipelines or preparing to scale up, this talk will equip you with the tools and strategies to tackle MLOps with confidence. pyconde-pydata-2025-60315-gitmlops-how-we-are-managing-100-ml-pipelines-in-aws-sagemaker PyCon: MLOps & DevOps Bogdan Girman en In 2022, idealo’s Machine Learning Engineering (MLE) team took on a bold mission: to transform and scale the recommendation systems powering the idealo website. Fast forward to today, we’re delivering over 1 million recommendations per minute across 20 key user touchpoints - driving seamless, personalized experiences at scale. But how do you manage over 100 machine learning pipelines without breaking a sweat? In this talk, I’ll reveal the three core principles that helped us build a sustainable and efficient MLOps workflow in AWS SageMaker: * Decoupling pipeline releases from deployments for ultimate flexibility * Testing pipelines to ensure seamless performance * Centrally managing infrastructure as code for full control and scalability If you’re ready to supercharge your MLOps game, this session will leave you with practical strategies and battle-tested solutions for running ML pipelines like a pro. false https://pretalx.com/pyconde-pydata-2025/talk/DPAPUA/ https://pretalx.com/pyconde-pydata-2025/talk/DPAPUA/feedback/ Europium2 Responsible AI with fmeval - an open source library to evaluate LLMs Talk 2025-04-25T13:20:00+02:00 13:20 00:30 The term "Responsible AI" has seen a threefold increase in search interest compared to 2020 across the globe. As developers, the questions like "How can we build large language model-enabled applications that are responsible and accountable to its users?" encountered in the conversation more often than before. And the discussion is further compounded by concerns surrounding uncertainty, bias, explainability, and other ethical considerations. In this session, the speaker will guide you through fmeval, an open-source library designed to evaluate Large Language Models (LLMs) across a range of tasks. The library provides notebooks that you can integrate into your daily development process, enabling you to identify, measure, and mitigate potential responsible AI issues throughout your system development lifecycle. pyconde-pydata-2025-61839-responsible-ai-with-fmeval-an-open-source-library-to-evaluate-llms PyData: PyData & Scientific Libraries Stack Mia Chang en The term "Responsible AI" has seen a threefold increase in search interest compared to 2020 across the globe. As developers, the questions like "How can we build large language model-enabled applications that are responsible and accountable to its users?" encountered in the conversation more often than before. And the discussion is further compounded by concerns surrounding uncertainty, bias, explainability, and other ethical considerations. In this session, the speaker will guide you through fmeval, an open-source library designed to evaluate Large Language Models (LLMs) across a range of tasks. The library provides notebooks that you can integrate into your daily development process, enabling you to identify, measure, and mitigate potential responsible AI issues throughout your system development lifecycle. Target Audience: Machine Learning Engineers/Data Scientists, AI/ML Researchers, Software Developers, AI/ML Project Managers, Solutions Architectures. false https://pretalx.com/pyconde-pydata-2025/talk/KTJY9V/ https://pretalx.com/pyconde-pydata-2025/talk/KTJY9V/feedback/ Europium2 You don’t think about your Streamlit app optimization until you try to deploy it to the cloud Talk 2025-04-25T14:00:00+02:00 14:00 00:30 Building Streamlit apps is easy for Data Scientists - but when it’s time to deploy them to the cloud, challenges like slow model loading, scalability, and security can become major hurdles. This talk bridges two perspectives: the Data Scientist who builds the app and the MLOps engineer who deploys it. We'll dive into optimizing model loading from Hugging Face Hub, implementing features like autoscaling and authentication, and securing your app against potential threats. By the end of this talk, you’ll be ready to design Streamlit apps that are functional and deployment-ready for the cloud. pyconde-pydata-2025-61786-you-don-t-think-about-your-streamlit-app-optimization-until-you-try-to-deploy-it-to-the-cloud PyCon: MLOps & DevOps Darya Petrashka en #### Talk Outline: 1. Introduction - The disconnect: challenges when transitioning a Streamlit app from development to deployment. - Why deployment considerations should influence app design. 2. Optimizing model loading from HuggingFace hub - Challenges: - Large model sizes slowing down app performance. - Inefficient loading processes increasing costs and user wait times. - Solutions: - Using Streamlit caching to reuse loaded models across sessions. - Preloading models during image build. - Deploying models and calling them as APIs - MLOps Perspective: How optimized model loading reduces deployment complexity and cloud costs. 3. AWS deployment considerations: autoscaling, authentication, and security - Autoscaling: - Challenges: Handling variable user traffic without incurring unnecessary costs. - Solutions: - Using Fargate with ECS for containerized apps with auto-scaling policies. - Setting thresholds to scale instances based on traffic and resource utilization. - Optimizing cost-performance balance with reserved vs. spot instances. - Authentication: - Challenges: Providing a secure and user-friendly authentication mechanism. - Solutions: - Integrating AWS Cognito for user management. - Adding role-based access control to limit app functionality based on user roles. - Security: - Challenges: Protecting the app from attacks and unauthorized access. - Solutions: - Using AWS Web Application Firewall (WAF) to block malicious traffic. - Configuring CloudFront to protect against DDoS attacks and improve performance. - Setting up HTTPS with Route 53 and TLS certificates for secure connections. - MLOps Perspective: Balancing simplicity and scalability in app deployment. 4. Secrets Storage - Challenges: Hardcoding sensitive credentials into the app. - Solutions: - Using AWS Secrets Manager or Parameter Store for secure secrets management. - Employing environment variables for flexible app configuration. - MLOps Perspective: How to ensure security without complicating deployment workflows. 5. Key Takeaways - Data Scientist’s Perspective: - Why it’s critical to consider performance, scalability, authentication, and security during app development. - MLOps Perspective: - How to simplify deployment while ensuring performance and security. - Encouraging collaboration between Data Scientists and MLOps engineers for smoother deployment processes. #### What you will learn: - How to efficiently load Hugging Face models in Streamlit apps to reduce costs and improve performance. - How to design apps with AWS autoscaling to handle variable traffic seamlessly. - Best practices for implementing user authentication with AWS Cognito. - How to secure your Streamlit app using cloud services. - Best practices for secure secrets management in Streamlit apps. - How to approach Streamlit app development with deployment in mind. false https://pretalx.com/pyconde-pydata-2025/talk/3VYSMS/ https://pretalx.com/pyconde-pydata-2025/talk/3VYSMS/feedback/ Europium2 What do a tree and the human brain have in common-a not so serious introduction to digital pathology Talk 2025-04-25T14:40:00+02:00 14:40 00:30 While trees and human brains don't share that many properties regarding their domain, the analysis of the height of a tree and cancer in human brains does. This talk provides a not-so-serious introduction to the domain of computer vision for pathological use cases. Besides a general introduction to (digital) pathology and the technical similarities between satellite images (GeoTIFs) and pathological images (Whole-Slide Images), we will take a look at computer vision for medical tasks using Python. Whether you have never done image processing in Python, are an expert (ready to share some tricks with me), or are just curious to see pictures of a human brain, this talk is for you. Warning: this talk contains quite abstract pink-ish pictures of human tissue (and trees^^). If you are unsure this is something you are comfortable with (have a friend), do a quick search for "HE-stained whole-slide image". pyconde-pydata-2025-61098-what-do-a-tree-and-the-human-brain-have-in-common-a-not-so-serious-introduction-to-digital-pathology PyData: Computer Vision (incl. Generative AI CV) Daniel Hieber en Inspired by last year's talk about the height of a tree [🌳 The taller the tree, the harder the fall. Determining tree height from space using Deep Learning and very high resolution satellite imagery 🛰️] and the strong similarities between optical high resolution satellite images and pathological images, this talk will give a not-so-serious introduction to a quite serious topic: Python for digital pathology. The main content is: - "Cancer detection" - An introduction to (digital) pathology (know your domain) - The similarities between a tree and your brain (technically speaking, there are a lot) - A shallow view of ML-based and conventional computer vision in Python with some practical use cases - Why we can steal (nearly) everything from radiology and get away with it - What potential pitfalls could be - How you can start doing medical computer vision on your own Warning: this talk contains quite abstract pink-ish pictures of human tissue (and trees^^). If you are unsure this is something you are comfortable with (have a friend), do a quick search for "HE-stained whole-slide image". false https://pretalx.com/pyconde-pydata-2025/talk/MJD7TG/ https://pretalx.com/pyconde-pydata-2025/talk/MJD7TG/feedback/ Hassium Beyond DALL-E: Advanced Image Generation Workflows with ComfyUI Talk 2025-04-25T10:15:00+02:00 10:15 00:30 Image generation using AI has made huge progress over the last years, and many people still think that DALL-E with a text prompt is the best way to generate images. There are well-known models like Stable Diffusion and Flux, which can be used with easy-to-use frontends like A1111 or Invoke AI, but if you want to do more complex or bleeding-edge workflows, you need something else. In this talk, I want to show you ComfyUI, an open-source node-based GUI written in Python where you can build complex pipelines that are otherwise only possible using plain code. pyconde-pydata-2025-61259-beyond-dall-e-advanced-image-generation-workflows-with-comfyui PyData: Computer Vision (incl. Generative AI CV) René Fa en Image generation using AI has made huge progress over the last years, and many people still think that DALL-E with a text prompt is the best way to generate images. But thanks to Stable Diffusion, Flux, and many supplementary models like ControlNet or an Image Prompt Model, we have much more control over the images we want to create. There are frontends for that, like A1111 or Invoke AI, but if you want to try bleeding-edge models or do something more complex, you will have a hard time implementing such a pipeline in code yourself, and it requires a steep learning curve. In this talk, I want to show you ComfyUI, an open-source node-based GUI written in Python where you can build workflows as a DAG. Thanks to many other contributors, there are a lot of plugins available which bring in new functionality. This talk shows the capabilities and power of this tool using practical examples and how you can combine many things together to create a complex workflow much faster than coding it yourself. I want to cover the following topics: - What are the limits of a simple text-to-image workflow? - What is ComfyUI? - What are the requirements to use ComfyUI? (Resources, OS, etc.) - What can you do with ComfyUI that you can't do with a simple text-to-image interface? - Pre- and post-processing of images in a single workflow - Advanced conditioning using images, bounding boxes, depth maps, etc., all together - The examples shown as a demonstration: - Integrating existing objects from a photo into a generated scenery - Creating optical illusions and surreal images false https://pretalx.com/pyconde-pydata-2025/talk/LRUKZQ/ https://pretalx.com/pyconde-pydata-2025/talk/LRUKZQ/feedback/ Hassium PosePIE: Replace Your Keyboard and Mouse With AI-Driven Gesture Control Talk 2025-04-25T10:55:00+02:00 10:55 00:30 In this talk, we show how to leverage publicly available tools to control any game or program using hand or body movements. To achieve this, we introduce PosePIE, an open-source programmable input emulator that generates input events on virtual gamepads, keyboards and mice based on gestures recognized by using AI-driven pose estimation. PosePIE is fully configurable by the user through Python scripts, making it easily adaptable to new applications. pyconde-pydata-2025-61171-posepie-replace-your-keyboard-and-mouse-with-ai-driven-gesture-control PyData: Computer Vision (incl. Generative AI CV) Daniel Stolpmann en Recent advancements in machine learning and AI hardware acceleration have enabled the use of complex models for solving computer vision problems in real-time applications. Pose estimation is one such problem, involving the detection of keypoints of the human body within an image. In this talk, we show how PosePIE uses pose estimation to control any game or program using hand or body movements. By using state-of-the-art models, PosePIE does not require expensive specialized sensors but works entirely on the monocular image from an off-the-shelf webcam. By leveraging readily available Graphics Processing Unit (GPU) hardware, it is able to do all processing at a high frame rate to support interactive applications. As PosePIE is fully configurable by the user through Python scripts, it can be easily adapted to new applications. This lowers the barrier to use pose estimation and gesture recognition in creative ways and for novel applications. The source code of PosePIE is available on GitHub under the GNU GPLv3+ license: https://github.com/tegtmeier-inkubator/PosePIE false https://pretalx.com/pyconde-pydata-2025/talk/AEUZGX/ https://pretalx.com/pyconde-pydata-2025/talk/AEUZGX/feedback/ Hassium Guardians of the Code: Safeguarding Machine Learning Models in a Climate Tech World Talk 2025-04-25T11:35:00+02:00 11:35 00:30 LLMs, Machine learning and AI are everywhere, yet their security is often overlooked, leaving your systems vulnerable to serious attacks. What happens when someone tampers with your model’s input, poisons your training data, or steals your model? In this talk, I’ll explore these risks through the lens of the OWASP Machine Learning Security Top 10 using relatable, real-world examples from the climate tech world. I’ll explain how these attacks happen, their impact, and why they matter to you as a Python developer, data scientist, or data engineer. You’ll learn practical ways to defend your models and pipelines, ensuring they’re robust against adversarial forces. Bridging theory and practice, you'll leave equipped with insights and strategies to secure your machine learning systems, whether you’re training models or deploying them in production. By the end, you’ll have a solid understanding of the risks, a toolkit of best practices, and maybe even a new perspective on how important security is everywhere. pyconde-pydata-2025-59620-guardians-of-the-code-safeguarding-machine-learning-models-in-a-climate-tech-world PyCon: MLOps & DevOps Doreen Sacker en Machine learning is applied to a variety of challenges in climate tech, from optimising renewable energy to forecasting energy demands or predicting solar production. We rely more on these models, but we often forget a critical piece: their security. What happens if someone tampers with your model’s inputs, poisons your training data, or sneaks malicious code into an open-source package you’re using? These attacks can throw off predictions and disrupt energy systems or even the grid itself. In this talk, I’ll walk you through the OWASP Machine Learning Security Top 10, using real-world examples from climate tech to show how these attacks can happen. I'll show you cases like manipulating energy consumption forecasts, poisoning datasets, or sneaking malware into open-source libraries used for climate modelling. It’s not just a hypothetical threat, these risks are real and the consequences can be serious. I’ll also share practical solutions you can use as a Python developer, data scientist, or data engineer to protect your models and systems. I’ll talk about securing your ML supply chain, validating data, and monitoring your pipelines for suspicious activity. You'll leave with strategies to defend your work so you can build systems that are not only smart but also safe and reliable. Why does this matter? Because in climate tech, the stakes are incredibly high. The predictions we make and the systems we build influence the grid, energy policies, resource allocation, and consumers trust. During the talk, we'll cover: - How attacks on machine learning models can disrupt climate tech applications. - Examples of adversarial attacks, poisoned datasets, and supply chain vulnerabilities in renewable energy systems. - Practical steps to protect your machine learning pipelines. - Why security should be at the core of any ML project, especially in mission-critical fields like climate tech. Outline of the Talk: 1. Why Security in Climate Tech Machine Learning Matters - How machine learning is powering renewable energy and climate solutions. - What can go wrong when systems are vulnerable. 2. Breaking Down the OWASP ML Security Top 10 - Input manipulation: How attackers trick models with tampered data. - Data poisoning: Real-life example of skewing optimization models with bad data. - Supply chain attacks: How a hacked library could disrupt energy demand predictions. 3. Real-World Impact of Attacks - Manipulated energy consumption forecasts causing grid instability. - Corrupted solar panel efficiency datasets leading to poor resource allocation. 4. How to Protect Your Models - How to spot tampered inputs. - Data validation, cleaning and checking datasets. - Best practices for safe use of open-source libraries. - Monitoring and auditing: Setting up checks for unusual activity in your pipelines. Key Takeaways - Recap of risks and defences. - Practical steps you can take today to secure your ML systems. - A call to prioritize security as a core part of building trustworthy ML. Climate tech is one of the most exciting and meaningful areas to work in. The systems we’re building have the potential to shape a more sustainable future. But if we don’t make security a priority, we risk undermining the customer's trust. This talk will give you the tools and confidence to keep your machine learning models safe and ensure they’re as reliable and impactful as they need to be. #ThereIsNoPlanetB false https://pretalx.com/pyconde-pydata-2025/talk/CVMPVG/ https://pretalx.com/pyconde-pydata-2025/talk/CVMPVG/feedback/ Hassium Vector Streaming: The Memory Efficient Indexing for Vector Databases Talk 2025-04-25T13:20:00+02:00 13:20 00:30 Vector databases are everywhere, powering LLMs. But indexing embeddings, especially multivector embeddings like ColPali and Colbert, at a bulk is memory intensive. Vector streaming solves this problem by parallelizing the tasks of parsing, chunking, and embedding generation and indexing it continuously chunk by chunk instead of bulk. This not only increase the speed but also makes the whole task more optimized and memory efficient. The library gives many vector database supports, like Pinecone, Weavaite, and Elastic. pyconde-pydata-2025-61257-vector-streaming-the-memory-efficient-indexing-for-vector-databases General: Rust Sonam PankajAkshay Ballal en Embedding creation is mostly done synchronously; a lot of time is wasted while the chunks are being created, as chunking is not a compute-heavy operation. As the chunks are being made, passing them to the embedding model would be efficient. This problem further intensifies with late interaction embeddings like CoLBert or ColPali. The solution is to create an asynchronous chunking and embedding task. We can effectively spawn threads to handle this task using Rust's concurrency patterns and thread safety. This is done using Rust's MPSC (Multi-producer Single Consumer) module, which passes messages between threads. Thus, this creates a stream of chunks passed into the embedding thread with a buffer. Once the buffer is complete, it embeds the chunks and sends the embeddings back to the main thread, where they are sent to the vector database. This ensures no time is wasted on a single operation and no bottlenecks. Moreover, only the chunks and embeddings in the buffer are stored in the system memory. They are erased from the memory once moved to the vector database. All this is then bound into Python using pyo3 and maturin, so it's easily accessible from Python, but the core is still asynchronous with rust. false https://pretalx.com/pyconde-pydata-2025/talk/SFDRTR/ https://pretalx.com/pyconde-pydata-2025/talk/SFDRTR/feedback/ Hassium Pipeline-level differentiable programming for the real world Talk 2025-04-25T14:00:00+02:00 14:00 00:30 Automatic Differentiation (AD) is not only the backbone of modern deep learning but also a transformative tool across various domains such as control systems, materials science, weather prediction, 3D rendering, data-driven scientific discovery, and so on. Thanks to a mature ML framework ecosystem, powered by libraries like PyTorch and JAX, AD performs remarkably well at a component level; however, integrating these components into differentiable pipelines still remains a significant challenge. In this talk, we will provide an accessible introduction to (pipeline-level) AD, demonstrate some cool applications you can build with it, and see how to build differentiable pipelines that hold up in the real world. pyconde-pydata-2025-61137-pipeline-level-differentiable-programming-for-the-real-world PyData: Research Software Engineering Alessandro Angioi en The tools enabling automatic differentiation (AD), like JAX and PyTorch, are increasingly being adopted beyond machine learning to tackle optimization problems in various scientific and engineering contexts. These tools have catalyzed the development of differentiable simulators, solvers, 3D renderers, and other powerful components, under the umbrella of differentiable programming (DP). However, building pipelines that propagate gradients effortlessly across components introduces unique challenges. Real-world pipelines often span diverse technologies, frameworks (e.g., JAX, TensorFlow, PyTorch, Julia), computing environments (local vs. distributed clusters; CPU vs. GPU), and teams with varying expertise. Additionally, legacy systems and non-differentiable components often need to coexist with modern AD-enabled frameworks. This talk will provide an overview of differentiable pipelines: why they matter and the types of optimization problems they address. We will revisit foundational concepts of automatic differentiation to set the stage for understanding the intricacies of orchestrating differentiable pipelines in Python. Then, using our open-source project, Tesseract, as a case study, we will share lessons learned and best practices for designing AD-friendly APIs with tools like Pydantic and FastAPI, achieving seamless integration with JAX, packaging scientific software, and enabling end-to-end systems-level optimization. Attendees will leave with practical insights on why they should care about differentiable programming, and how to overcome the challenges of building real-world differentiable pipelines. false https://pretalx.com/pyconde-pydata-2025/talk/JH97CL/ https://pretalx.com/pyconde-pydata-2025/talk/JH97CL/feedback/ Hassium From Rules to Reality: Python's Role in Shaping Roundnet Talk 2025-04-25T14:40:00+02:00 14:40 00:30 Roundnet is a dynamic and fast-growing sport that combines quick reaction, athleticism, and strong community. However, like many emerging sports, it faces challenges in balancing competition, optimizing rules, and increasing accessibility for both players and spectators. This is where Python and data analysis come into play. In this talk, I'll share insights from my role as Data Lead on the International Roundnet rule committee, where we use Python-powered data analysis to make informed decisions about the future of the sport. We'll explore how analyzing gameplay patterns and testing rule changes with simulation can lead to fairer, more exciting games and attract a broader audience. pyconde-pydata-2025-60403-from-rules-to-reality-python-s-role-in-shaping-roundnet PyData: Data Handling & Engineering Larissa Haas en Roundnet is a dynamic and fast-growing sport that combines quick reaction, athleticism, and strong community. But what's truly unique about Roundnet is the opportunity it offers: as a new and emerging sport, we have the rare chance to shape its global rule changes entirely through data analysis. This is a groundbreaking approach – a first for any sport in the modern era. In this talk, I’ll share insights from my role as Data Lead on the International Roundnet Rule Committee and take you through how we are leveraging Python and data analysis to guide these changes. Over the past year, we’ve collected rule proposals and have set up a series of experiments designed to test their effects. Using Python and statistical modeling, we’re planning to select key tournaments worldwide this year to observe how rule adjustments impact gameplay data. Our ultimate goal is to discover if specific combinations of rule changes can make Roundnet fairer, more exciting, and accessible for players and spectators alike. This journey is an exploration of how data-driven decision-making can transform a sport from the ground up, using real-world insights and experimentation. This talk will take you on the journey of how we set up our testing framework, what tools we’re using, and how we’ve employed Python-powered analysis to bring empirical evidence into the decision-making process. It will equip you with a new perspective on data's role in shaping real-world change – especially in grassroots movements, community building, and sports. false https://pretalx.com/pyconde-pydata-2025/talk/ZT3MGL/ https://pretalx.com/pyconde-pydata-2025/talk/ZT3MGL/feedback/ Palladium Towards Intelligent Monitoring: Detecting Degraded Flame Torch Nozzles Talk 2025-04-25T10:15:00+02:00 10:15 00:30 Flame cutting is a method where metals are efficiently cut using precise control of the oxygen jet and consistent mixing of fuel gas. The condition of the nozzle is changing over time: deposits formed during the cutting process can degrade the flame quality, reducing the precision of the cut. Traditionally, nozzles suspected of wear are sent back for manual inspection, where experts evaluated the flame visually and audibly to determine whether repair or replacement is needed. This project leverages machine learning to optimize this process by analyzing acoustic emission data. pyconde-pydata-2025-61203-towards-intelligent-monitoring-detecting-degraded-flame-torch-nozzles PyData: Machine Learning & Deep Learning & Statistics Dominik Falkner en Flame cutting is a technique that enables efficient metal cutting by precisely controlling the oxygen jet and maintaining a consistent mix of fuel gas. Over time, the nozzle’s condition deteriorates as deposits accumulate during the cutting process, leading to a decline in flame quality and cutting precision. Currently, nozzle testing is performed manually, with experts assessing the flame based on its appearance and sound. This approach is risky because worn nozzles can remain in use, increasing the danger of high-temperature material being ejected. Moreover, it is a costly process, particularly when damage to industrial equipment occurs. Laboratory Evaluation: This section outlines the preliminary experiments aimed at assessing whether this sensor is suitable for distinguishing different machine states. The experiments focus on identifying the optimal sensor placement and analyzing how various machine states impact sensor readings. The design process for the laboratory experiments and the subsequent systematic data collection is shown. The results suggest that while detecting every machine state may not be feasible, the sensor shows promise in identifying degraded nozzles. Data Preprocessing & Annotation: For a proof of concept, the raw acoustic emission data required manual labeling, as prior assessments depended on expert evaluations. Here, we utilized Label Studio, an annotation tool that streamlines the labeling process. Modelling & Feature Engineering: We extract features using statistical methods and transform the acoustic emission signals into the frequency domain through scipy, focusing on features in the frequency domain. Evaluation: We discuss the approach for splitting the data, considering that multiple observations from the same nozzle are present. In a computational study, we evaluate the feature sets developed in the previous step using two different classification models: Support Vector Classifier and Multilayer Perceptron. This section explains how the experiments are computed and parallelized including the time required for execution. Lastly, we discuss the dataset's limitations and the challenges faced during development. We also highlight steps taken to improve generalization and provide an outlook on future objectives, mostly aimed at a broader applicability of the models. false https://pretalx.com/pyconde-pydata-2025/talk/89BX8V/ https://pretalx.com/pyconde-pydata-2025/talk/89BX8V/feedback/ Palladium Filling in the Gaps: When Terraform Falls Short, Python and Typer Step In Talk 2025-04-25T10:55:00+02:00 10:55 00:30 Not all resources in today’s cloud environments have native Terraform providers. That’s where Python’s Typer library can step in, offering a flexible, production-ready command-line interface (CLI) framework to help fill in the gaps. In this session, we’ll explore how to integrate Typer with Terraform to manage resources that fall outside Terraform’s direct purview. We’ll share a real-life example of how Typer was used alongside Terraform to automate and streamline the management of an otherwise unsupported API. You’ll learn how Terraform can invoke Python scripts—passing arguments and parameters to control complex operations—while still benefiting from Terraform’s declarative model and lifecycle management. We’ll also discuss best practices for defining resource lifecycles to ensure easy maintainability and consistency across deployments. By the end, participants will see how combining Terraform’s robust infrastructure-as-code approach with Python’s versatility and Typer’s user-friendly CLI can create a powerful, cohesive strategy for managing even the trickiest resources in production environments. pyconde-pydata-2025-61281-filling-in-the-gaps-when-terraform-falls-short-python-and-typer-step-in General: Infrastructure - Hardware & Cloud Yuliia Barabash en In this session, we’ll address a common challenge in managing resources and APIs that lack native Terraform providers but still need to integrate seamlessly into your CI/CD pipeline. I’ll demonstrate how Python’s Typer library can help bridge this gap by offering a straightforward yet powerful command-line interface (CLI). I’ll explain how to create and configure Typer applications, pass parameters, and integrate these scripts with Terraform. 1. Problem Statement (Managing APIs or resources with incomplete Terraform provider support) - 5 mins 2. Typer (Key components, advantages, and how to use in production enviroment) - 10 mins 3. Terraform resources that can execute CLI and how to work with them - 10 mins 4. Conclusion - 2 mins false https://pretalx.com/pyconde-pydata-2025/talk/CZXBEP/ https://pretalx.com/pyconde-pydata-2025/talk/CZXBEP/feedback/ Palladium Code & Community: The Synergy of Community Building and Task Automation Talk 2025-04-25T11:35:00+02:00 11:35 00:30 The Python community is built on a culture of support, inclusion, and collaboration. Sustaining this welcoming environment requires intentional community-building efforts, which often involve repetitive or time-consuming tasks. These tasks, however, can be automated without compromising their value—freeing up time for meaningful human engagement. This talk showcases my project aimed at supporting underrepresented groups in tech, specifically through building Python communities on Mastodon and Bluesky. A key part of this initiative is the "Awesome PyLadies" repository, a curated collection of PyLadies blogs and YouTube channels that celebrates their work. To enhance visibility, I created a PyLadies bot for social media. This bot automates regular posts and reposts tagged content, significantly extending their reach and fostering an engaged community. In this session, I’ll cover: - The role of automation in community building - The technical architecture behind the bot - A hands-on demo on integrating Google’s Gemini into community tools - Upcoming features and opportunities for collaboration By combining Python, automation, and modern AI capabilities, we can create thriving, inclusive communities that scale impact while staying true to the human-centered ethos of open source. pyconde-pydata-2025-60381-code-community-the-synergy-of-community-building-and-task-automation PyData: Natural Language Processing & Audio (incl. Generative AI NLP) Cosima Meyer en My planned outline for the talk is as follows: - **Introduction**: A brief overview of the project and its goals, focusing on community building and inclusivity within the Python ecosystem (3 minutes) - **The Importance of Visibility**: Explain the background of the project and why visibility is important (3 minutes) - **Bot Architecture and Setup**: A technical walkthrough of the bot, its architecture, and how it operates to extend the reach of community content on platforms like Mastodon or Bluesky (5 minutes) - **Hands-On Demo: Task Automation with Google’s Gemini and GitHub Actions**: A step-by-step guide to integrating Google’s Gemini and GitHub Actions for creating low-barrier, automated workflows tailored for community-building tasks (12 minutes) - **Looking Ahead**: Provide a forward-looking perspective (upcoming features of the project and future developments) (2 minutes) - **Q&A and Buffer** (5 minutes) I hope that the talk will inspire more Pythonistas to automate their tasks, and also more PyLadies to share material publicly and make the public perception of experts in the field more diverse. false https://pretalx.com/pyconde-pydata-2025/talk/PLMJZ8/ https://pretalx.com/pyconde-pydata-2025/talk/PLMJZ8/feedback/ Palladium What we talk about when we talk about AI skills. Talk 2025-04-25T13:20:00+02:00 13:20 00:30 Defining what constitutes AI skills has always been ambiguous. As AI adoption accelerates across industries and the European AI Act mandates companies to ensure AI literacy among their staff, organizations face growing even more challenges in defining and developing AI competencies. In this talk, we'll present a comprehensive framework developed by the appliedAI Institute's experts that categorizes AI skills across technical, regulatory, strategic, and innovation domains. We'll also share initial data on current AI skills levels and upskilling needs and provide practical strategies for organizations to assess, develop, and acquire the AI capabilities required for their specific needs. pyconde-pydata-2025-61889-what-we-talk-about-when-we-talk-about-ai-skills General: Education, Career & Life Paula Gonzalez Avalos en What it means to "work in/with AI" and the corresponding roles, tasks, and required skills have been ambiguous since the emergence of AI professionals in industry. And while the demand for AI-skilled professionals continues to grow, both organizations and individuals seeking to work in the AI field often struggle with two challenges: 1) clearly defining the competencies and responsibilities that positions and projects require, and 2) identifying appropriate upskilling opportunities to match these needs. The urgency to upskill professionals in AI topics has not only become more nuanced since the emergence of generative AI but is also growing rapidly. This trend is further amplified by the upcoming European AI Act, which will soon require companies to "ensure, to their best extent, a sufficient level of AI literacy among their staff." This regulation has created an urgent need to define and understand what constitutes AI literacy and AI skills in practical terms. To help organizations and professionals navigate this landscape, we have developed a comprehensive framework categorizing AI skills into distinct domains spanning technical competencies, regulatory knowledge, AI strategy, and ecosystem understanding. Our framework, developed by the multidisciplinary team of AI experts in the appliedAI Institute for Europe, provides a structured approach to defining skill requirements, guiding career development, identifying training gaps, and helping educational providers align their offerings with market demands. In this presentation, we will introduce our framework and share initial data reflecting the current state of AI skills levels and upskilling needs across a sample of companies. We will also discuss practical strategies for implementing this AI Skills framework within organizations, enabling them to better assess, develop, and acquire the AI capabilities they need to fulfill their specific needs. false https://pretalx.com/pyconde-pydata-2025/talk/98FQDY/ https://pretalx.com/pyconde-pydata-2025/talk/98FQDY/feedback/ Palladium Optimizing Energy Tariffing System with Formal Concept Analysis and Dash Talk 2025-04-25T14:00:00+02:00 14:00 00:30 As a data scientist, I value the power of insightful visualizations to unlock unique interpretations of complex data. In my talk, I will introduce an elegant mathematical framework called Formal Concept Analysis (FCA), developed in the 1980s in Darmstadt. FCA transforms binary data into concepts that can be visualized as a hierarchical graph, offering a fresh perspective on multidimensional data analysis. Leveraging this theory and its open-source Python libraries, I am developing an interactive Dash-based tool featuring interactive tables and graphs to explore data insights. To illustrate its potential, I will showcase an optimization of the entire tariffing system of an energy provider company, highlighting how FCA can bring structure and clarity to even such tangled datasets. pyconde-pydata-2025-60771-optimizing-energy-tariffing-system-with-formal-concept-analysis-and-dash PyData: Visualisation & Jupyter Dr. Irina Smirnova-Pinchukova en My goal is to introduce Formal Concept Analysis (FCA) as a fascinating mathematical framework. I aim to inspire Python enthusiasts to explore its potential and uncover insights in their data analysis tasks. The talk is divided into three sections: 1 FCA Basics - What is a "concept"? *First, I am going to introduce the main terms used in FCA and define the central object of the theory - the formal concept.* - Illustrative example. *To show the power of FCA in action, I will provide a relatable example to explain the hierarchical structure of the graph visualization.* 2 Python Implementation - ``fcapy`` Python library. *Core functionality overview of the library and the data formats it can use.* - Introducing interactivity with Python Dash: *Enhancing exploration and user experience with interactive tables (AG Grid) and dynamic graph visualizations (Cytoscape).* 3 Applications and Practical Relevance - Use Case: Energy Tariffing System Optimization. *In this section, I am going to showcase the real data in its original complexity and the optimization process of identifying redundancies, overlaps, or inefficiencies.* - Examples of other applications and key takeaways false https://pretalx.com/pyconde-pydata-2025/talk/B8TUR9/ https://pretalx.com/pyconde-pydata-2025/talk/B8TUR9/feedback/ Palladium Langfuse, OpenLIT, and Phoenix: Observability for the GenAI Era Talk 2025-04-25T14:40:00+02:00 14:40 00:30 Large Language Models (LLMs) are transforming digital products, but their non-deterministic behaviour challenges predictability and testing, making observability essential for quality and scalability. This talk presents **observability for LLM-based applications**, spotlighting three tools: Langfuse, OpenLIT, and Phoenix. We'll share best practices about what and how to monitor LLM features and explore each tool's strengths and limitations. Langfuse excels in tracing and quality monitoring but lacks OpenTelemetry support and customization. OpenLIT, while less mature, integrates well with existing observability stacks using **OpenTelemetry**. Phoenix stands out in debugging and experimentation but struggles with real-time tracing. The comparison will be enhanced by **live coding examples**. Attendees will walk away with an improved understanding of observability for **GenAI applications** and will understand which tool to use for their use case. pyconde-pydata-2025-61375-langfuse-openlit-and-phoenix-observability-for-the-genai-era PyCon: Python Language & Ecosystem Emanuele Fabbiani en Large Language Models (LLMs) are becoming core components of modern digital products. However, their **non-deterministic nature** means that their behaviour cannot be fully predicted or tested before deployment. This makes **observability** an essential practice for building and maintaining applications with generative AI features. This session focuses on observability in LLM-based systems. We start by motivating why monitoring and understanding your application is key to ensuring quality, reliability, and scalability. We’ll analyze three leading tools for observability in this domain: **Langfuse**, **OpenLIT**, and **Phoenix**. Each has unique strengths and challenges that make them suitable for different use cases. Through examples and real-world scenarios, we’ll explore: - How **Langfuse** provides detailed tracing and quality monitoring through developer-friendly APIs. While it supports multi-step workflows effectively, it lacks support for the OpenTelemetry protocol and can be difficult to customize for non-standard use cases. - Why **OpenLIT**, built on OpenTelemetry, offers strong observability for distributed systems. Although it is the least mature of the three tools, it integrates well with established observability stacks and has promising potential for future growth. - Where **Phoenix** fits into the process by combining experimentation and debugging capabilities with evaluation pipelines. Its strength lies in development-focused observability, but it has limitations in handling real-time tracing once systems are in production. This talk will provide a clear, straightforward comparison of these tools, helping you understand which option best fits your LLM applications. You’ll leave with practical insights into how observability can enhance the reliability and performance of your generative AI systems. false https://pretalx.com/pyconde-pydata-2025/talk/HKYQDB/ https://pretalx.com/pyconde-pydata-2025/talk/HKYQDB/feedback/ Ferrum Agentic AI: Build a Multi-Agent Application with CrewAI Tutorial 2025-04-25T10:15:00+02:00 10:15 01:30 This hands-on tutorial will dive into the fundamentals of building multi-agent systems using the CrewAI Python library. Starting from the basics, we’ll cover key concepts, explore advanced features, and guide you step-by-step through building a complete application from scratch. We’ll discuss implementing guardrails, securing interactions, and preventing query injection vulnerabilities along the way. pyconde-pydata-2025-61420-agentic-ai-build-a-multi-agent-application-with-crewai PyData: Generative AI Alessandro Romano en ### **Short Abstract** **Agentic AI: Build a Multi-Agent Application with CrewAI** In this hands-on tutorial, we’ll dive into the fundamentals of building multi-agent systems using the CrewAI Python library. Starting from the basics, we’ll cover key concepts, explore advanced features, and guide you step-by-step through building a complete application from scratch. Along the way, we’ll discuss implementing guardrails, securing interactions, and preventing query injection vulnerabilities. --- ### **Detailed Description** This tutorial introduces **Agentic AI**—a design approach where multiple agents collaborate to solve complex tasks efficiently. Using the **CrewAI Python library**, we’ll start with the fundamentals and progressively move towards advanced concepts, focusing on practical implementation. #### **What We’ll Cover:** 1. **Understanding Agentic AI:** Core principles and why multi-agent systems are valuable. 2. **Getting Started with CrewAI:** Setting up the library and creating simple agents. 3. **Advanced Agent Interactions:** Defining workflows, collaboration patterns, and communication protocols. 4. **Building from Scratch:** Step-by-step guide to developing a complete multi-agent application. 5. **Implementing Guardrails:** Techniques to ensure agents operate within defined constraints. 6. **Preventing Query Injection:** Strategies for securing agent queries against malicious inputs. #### **Why Attend?** By the end of this session, you’ll have hands-on experience building an agent-based application, understand how to implement security measures, and be equipped with best practices for maintaining control over agent behavior. Whether you're new to agentic systems or looking to refine your skills, this tutorial will provide both the theory and the practical insights needed to start building with CrewAI. **Prerequisites:** - **OpenAI Key** or another LLM or Cloud provider. This is needed to implement the solutions. - **SerperDev Tool Key** from https://serper.dev/. (Free Trial is more than enough) - Familiarity with Python and basic AI concepts will help you get the most out of this session. LINK TO THE WORKSHOP WEBSITE: https://pigna90.github.io/crewai-workshop-pyconde-2025 false https://pretalx.com/pyconde-pydata-2025/talk/SVLRGG/ https://pretalx.com/pyconde-pydata-2025/talk/SVLRGG/feedback/ Ferrum Reinforcement Learning for Finance Tutorial 2025-04-25T13:05:00+02:00 13:05 01:30 Reinforcement Learning and related algorithms, such as Deep Q-Learning (DQL), have led to major breakthroughs in different fields. DQL, for example, is at the core of the AIs developed by DeepMind that achieved superhuman levels in such complex games as Chess, Shogi, and Go ("AlphaGo", "AlphaZero"). Reinforcement Learning can also be beneficially applied to typical problems in finance, such as algorithmic trading, dynamic hedging of options, or dynamic asset allocation. The workshop addresses the problem of limited data availability in finance and solutions to it, such as synthetic data generation through GANs. It also shows how to apply the DQL algorithm to typical financial problems. The workshop is based on my new O'Reilly book "Reinforcement Learning for Finance -- A Python-based Introduction". pyconde-pydata-2025-61808-reinforcement-learning-for-finance PyData: Machine Learning & Deep Learning & Statistics Dr. Yves J. Hilpisch en Reinforcement Learning and related algorithms, such as Deep Q-Learning (DQL), have led to major breakthroughs in different fields. DQL, for example, is at the core of the AIs developed by DeepMind that achieved superhuman levels in such complex games as Chess, Shogi, and Go ("AlphaGo", "AlphaZero"). Reinforcement Learning can also be beneficially applied to typical problems in finance, such as algorithmic trading, dynamic hedging of options, or dynamic asset allocation. The workshop addresses the problem of limited data availability in finance and solutions to it, such as synthetic data generation through GANs. It also shows how to apply the DQL algorithm to typical financial problems. The workshop covers the following topics: * Learning through interaction * Deep Q-Learning applied to Finance * Synthetic Data Generation * Dynamic Asset Allocation with DQL The workshop is based on my new O'Reilly book "Reinforcement Learning for Finance -- A Python-based Introduction". false https://pretalx.com/pyconde-pydata-2025/talk/VBW3EK/ https://pretalx.com/pyconde-pydata-2025/talk/VBW3EK/feedback/ Ferrum Intuitive A/B Test Evaluations for Coders Talk 2025-04-25T14:40:00+02:00 14:40 00:30 A/B testing is a critical tool for making data-driven decisions, yet its statistical underpinnings—p-values, confidence intervals, and hypothesis testing—are often challenging for those without a background in statistics. Coders frequently encounter these concepts but lack a straightforward way to compute and interpret them using their existing skill set. This talk presents a practical approach to A/B test evaluations tailored for coders. By utilizing Python’s random number generator and basic loops, it introduces bootstrapping as an accessible method for calculating p-values and confidence intervals directly from data. The goal is to simplify statistical concepts and provide coders with an intuitive understanding of how to evaluate test results without relying on complex formulas or statistical jargon. pyconde-pydata-2025-61237-intuitive-a-b-test-evaluations-for-coders PyData: Machine Learning & Deep Learning & Statistics Thomas Mayer en Making A/B Test Evaluations Intuitive for Coders: A Python-Based Approach A/B testing is an essential method for data-driven decision-making, but interpreting the results can be daunting. Complex jargon around p-values and confidence intervals often creates barriers to understanding. This talk simplifies A/B testing by introducing a practical, Python-powered approach using bootstrapping—a flexible and accessible method that aligns with how software engineers think and works without requiring statistical knowledge. Session Highlights: 1. Statistical Significance and Hypothesis Testing: * Why is statistical testing crucial for A/B tests? Simple comparisons overlook randomness. * Using Python, we’ll demonstrate how to simulate "what-if" scenarios by shuffling and resampling data, allowing participants to compute p-values and understand the likelihood of observed differences occurring by chance. 2. Confidence Intervals with Bootstrapping: * Confidence intervals clarify the range of plausible outcomes. * We’ll explore how to resample experiment data repeatedly to estimate variability and construct intuitive confidence intervals—all using basic tools like random number generators and loops, without requiring advanced math. * Key Takeaways: * Hands-on skills to compute p-values and confidence intervals using basic programming concepts. * Clear, step-by-step demonstrations of shuffling, resampling, and generating statistical insights. * Practical knowledge to move beyond black-box libraries and understand the "why" and "how" behind A/B test evaluations. By the end of the session, attendees will be equipped to demystify A/B testing with a coder-friendly workflow, empowering them to make confident, data-driven decisions in their projects. Talk Outline: 1. Setting the Stage (5 minutes) * What is A/B testing? * Why isn't it enough to just compare numbers? Why do we need statistics to interpret results? 2. Statistical Significance and P-Values (5 minutes) * Statistical tests (t-test, z-test, binomial test) are frequently used, but what is the intuition behind them? * Introducing the basic idea of bootstrapping. 3. Bootstrapping Explained (8 minutes) * Step-by-step illustration of the bootstrapping approach. * What is a p-value? An intuitive description using resampling. 4. Confidence Intervals Explained (7 minutes) * Importance of confidence intervals and how they help interpret results. * Intuitive computation of confidence intervals using bootstrapping. * Impact of sample size on confidence intervals and certainty. 5. Why These Statistics Matter (5 minutes) * Discussion on the practical necessity of statistical techniques. * How these methods ensure data-driven decision-making in A/B testing. false https://pretalx.com/pyconde-pydata-2025/talk/RQ8JBM/ https://pretalx.com/pyconde-pydata-2025/talk/RQ8JBM/feedback/ Dynamicum The Mighty Dot - Customize Attribute Access with Descriptors Tutorial 2025-04-25T10:15:00+02:00 10:15 01:30 Whenever you use a dot after an object in Python you access an attribute. While this seems a very simple operation, behind the scenes many things can happen. This tutorial looks into this mechanism that is regulated by descriptors. You will learn how a descriptor works and what kind of problems it can help to solve. Python properties are based on descriptors and solve one type of problems. Descriptors are more general, allow more use cases, and are more re-usable. Descriptors are an advanced topic. But once mastered, they provide a powerful tool to hide potentially complex behavior behind a simple dot. pyconde-pydata-2025-60503-the-mighty-dot-customize-attribute-access-with-descriptors PyCon: Python Language & Ecosystem Mike Müller en Whenever you use a dot in Python you access an attribute. While this seems a very simple operation, behind the scenes many things can happen. This tutorial looks into this mechanism that is regulated by descriptors. You will learn how a descriptor works and what kind of problems it can help to solve. Python properties are based on descriptors and solve one type of problems. Descriptors are more general, allow more use cases, and are more re-usable. Descriptors are an advanced topic. But once mastered, they provide a powerful tool to hide potentially complex behavior behind a simple dot. In this tutorial you will: * Learn how to use Python's descriptors to add new functionality to attribute access * Acquired solid background knowledge on how descriptors work * Work with practical examples for applying descriptors * Learn when to use a property or reach for a descriptor * Get to know how popular Python libraries apply descriptors for tasks such as data structure access, REST-APIs, ORMs, and serialization false https://pretalx.com/pyconde-pydata-2025/talk/WJPEQH/ https://pretalx.com/pyconde-pydata-2025/talk/WJPEQH/feedback/ Dynamicum What's inside the box? Building a deep learning framework from scratch. Tutorial 2025-04-25T13:05:00+02:00 13:05 01:30 Explore the inner workings of deep learning frameworks like TensorFlow and PyTorch by building your own in this workshop. We will start with the fundamental automatic differentiation mechanics and proceed to implementing more complex components like layers, modules and optimizers. This workshop is mainly designed for experienced data scientists, who want to expand their intuition about lower level framework internals. pyconde-pydata-2025-60835-what-s-inside-the-box-building-a-deep-learning-framework-from-scratch PyData: Machine Learning & Deep Learning & Statistics Oleh Kostromin en Data scientists typically concentrate on the mathematical foundations when designing and training neural networks, often treating the process by which deep learning frameworks link high-level code with lower-level mathematical operations as a black box. As a result, the internal workings of these frameworks are frequently overlooked. This workshop is aimed to open the black box by letting the participants construct a small deep learning framework from scratch. We will begin with creating a simple automatic differentiation engine, followed by more advanced elements such as modules, and optimizers. As a result, the participants will be able to construct and train a neural networks architecture using the framework they have built in just 1.5 hours. The detailed text guide and solutions for all of the exercises are going to be provided as a public GitHub repository. After constructing the framework from scratch, the participants will gain a comprehensive understanding of: - the inner workings of deep learning frameworks; - the mapping of high-level framework components to lower-level operations; - the operational principles of autograd engine and dynamic computational graphs; - higher-level abstractions such as modules, and their mechanisms of automatic parameters tracking; **Target audience** This workshop is primarily intended for those with some experience in building deep learning models using popular frameworks like PyTorch, TensorFlow, or JAX. However, prior experience is not absolutely mandatory, as essential fundamentals will be briefly covered. **Outline** Introduction, motivation and essential theory [15 min] Implementation [60 min] Tensors + autograd engine [25 min] Modules and layers [25 min] Optimizers [10 min] Using the framework to build and train the model [10 min] Concluding remarks + sharing bonus exercises [5 min] false https://pretalx.com/pyconde-pydata-2025/talk/LBKU3T/ https://pretalx.com/pyconde-pydata-2025/talk/LBKU3T/feedback/ Dynamicum The Forecast Whisperer: Secrets of Model Tuning Revealed Talk 2025-04-25T14:40:00+02:00 14:40 00:30 Forecasting can often feel like interpreting vague signals—unclear yet full of potential. In this talk, we’ll cover advanced techniques for tuning forecasting models in professional settings, moving beyond the basics to explore methods that enhance both accuracy and interpretability. You’ll learn: How to set clear business goals for ML model tuning and align technical work with business needs, including balancing forecast granularity and accuracy and selecting statistically correct metric. Practical data preparation methods, including business-driven data cleaning and detecting data problems with statistical and buiness driven approaches. Advanced feature selection techniques such as recursive feature elimination and SHAP values, alongside hyperparameter tuning strategies including Bayesian optimization and ensemble methods. How generative AI can support model tuning by automating feature generation, hyperparameter search, and enhancing model explainability through SHAP and LIME techniques. Real-world case studies, including how Blue Yonder’s data science team optimized demand forecasting models for retail and supply chain applications. We'll also discuss common mistakes like overfitting and data leakage, best practices for reliable validation, and the importance of domain knowledge in successful forecasting. Whether you're a seasoned data scientist or exploring time series forecasting, you'll gain advanced insights and techniques you can apply immediately. pyconde-pydata-2025-61781-the-forecast-whisperer-secrets-of-model-tuning-revealed PyData: Machine Learning & Deep Learning & Statistics Illia Babounikau en Forecasting can often feel like trying to make sense of unclear patterns—difficult to interpret but rich with potential. This talk clarifies the process, focusing on actionable steps for tuning forecasting models in professional environments where accuracy and performance drive business outcomes. 1. Defining Clear Business Objectives: Importance of aligning machine learning efforts with tangible business goals. Scoping forecasting problems and selecting appropriate success metrics. 2. Data Preparation Techniques: Cleaning data with a focus on business relevance and systematically enriching it In addition, we show how to tune the model by tuning the data nad the corresponding feature engineering. 3. Feature Selection and Hyperparameter Tuning: Advanced feature selection strategies and their impact on model performance. Techniques for identifying impactful features. Best practices for hyperparameter tuning and optimization strategies. 4. The Role of interpretability and Generative AI in Model Tuning: Automating feature generation. Hyperparameter optimization techniques using generative AI. model tuning through model interpretation 5. Real-World Applications and Case Studies: How Blue Yonder improved retail forecast accuracy. Lessons learned from industry case studies. 6. Common Pitfalls and Best Practices: Typical mistakes made during model tuning. Best practices for ensuring model reliability and relevance. The importance of domain knowledge in successful forecasting. Conclusion: Whether you are a seasoned data scientist or just starting your forecasting journey, this session will provide you with actionable insights to fine-tune your forecasting models effectively. Expect practical techniques, real-world examples, and expert tips that you can apply immediately. Join us and learn how better forecasts lead to better business decisions. false https://pretalx.com/pyconde-pydata-2025/talk/YLKDJK/ https://pretalx.com/pyconde-pydata-2025/talk/YLKDJK/feedback/