PyConDE & PyData Berlin 2024

0.17 PyConDE & PyData Berlin 2024 pyconde-pydata-2024 2024-04-22 2024-04-24 3 00:05 https://pretalx.com Europe/Berlin Kuppelsaal Keynote - A View From My Window - An Outside Perspective of Open Source Scientific Computing From the Inside Keynote 2024-04-22T10:15:00+02:00 10:15 00:45 Twelve years as the Executive Director of NumFOCUS has given me a unique perspective of the open source scientific ecosystem. Building an organization to support project communities has taken me down many roads. Navigating these paths has been rewarding and challenging. We will look at lessons learned as I share my experiences through observations and insights on projects, community leadership, education, and fundraising. pyconde-pydata-2024-44831-keynote-a-view-from-my-window-an-outside-perspective-of-open-source-scientific-computing-from-the-inside Plenary Leah Silen en Twelve years as the Executive Director of NumFOCUS has given me a unique perspective of the open source scientific ecosystem. Building an organization to support project communities has taken me down many roads. Navigating these paths has been rewarding and challenging. We will look at lessons learned as I share my experiences through observations and insights on projects, community leadership, education, and fundraising. NumFOCUS is a nonprofit organization that serves open source scientific computing projects and their communities. Our support programming includes fiscal sponsorship, affiliation services, development grants, educational and DEI initiatives, and collaborative opportunities in open source science. PyData is an educational program of NumFOCUS. false https://pretalx.com/pyconde-pydata-2024/talk/TNHMGN/ https://pretalx.com/pyconde-pydata-2024/talk/TNHMGN/feedback/ Kuppelsaal PyCon Community Backstage: A Decade of Camaraderie, Growth, and Lessons Learned Talk (long) 2024-04-22T11:25:00+02:00 11:25 00:45 For the past decade, my journey as a dedicated community organizer has allowed me to immerse myself deeply in the Python community, experiencing its extraordinary growth firsthand. The transition of Python from being a top-10 contender to becoming the foremost programming language has been an exhilarating experience, propelled by a burgeoning community and its foray into fields such as data science and artificial intelligence. The inclusivity and camaraderie within the Python community have been pivotal, illustrating how collective effort and a nurturing culture are instrumental to its current standing. This presentation is crafted to disseminate the pivotal lessons and best practices that have emerged from my decade-long engagement. During this period, I have played a key role in organizing over twenty Python/PyData conferences, including notable events like PyCon.DE, PyData Berlin, EuroPython, EuroSciPy, and PyData Global. It is for anyone who wants to learn more about, contribute to and organize themselves in the Python and PyData community. This talk will address: * How it works: community backstage * Why it works: community organizations * Lessons learned: * community leadership & team dynamics * balancing ideas and realities * personal & professional growth * How to contribute as an individual, community or company * How organizations like the [PySV](https://pysv.org), [NumFOCUS](https://numfocus.org) or PioneersHub serve the community pyconde-pydata-2024-42949-pycon-community-backstage-a-decade-of-camaraderie-growth-and-lessons-learned General: Community, Diversity, Career, Life and everything else Alexander CS Hendorf en Through organizing numerous community conferences, both small and large, I've gained invaluable insights into what makes a team and a community function effectively, and equally important, what doesn't. Leadership has been a key learning area for me. Through understanding my strengths and weaknesses, I have grown not just as a community leader but also in my professional career, enhancing how I work and lead. In "PyCon Backstage All Access," I will cover: 1. Organizational Experiences: The nuances of organizing conferences of various scales. 2. Leadership Lessons: Insights into team dynamics - what works and what doesn't in building a great community team. 3. Balancing Ideas and Realities: The driving factors behind enjoyable community conferences. How to listen to others. When to embrace complexity, and when to say no. 4. Handling the Mundane: Strategies for dealing with administrative, tax, and legal aspects. How and where organisations can help. 5. Future Outlook: Strategies for sustaining the European Python Community amidst growing challenges. This includes my reasons for rejoining the EuroPython board to help shape its future beyond being just a conference organizer. My community service "CV": * 2013 local MongoDBB meetup * 2014 joined EuroPython * 2015-2020 core EuroPython organizer, 2 years board member * 2017-2018 PyCon DE organizer * 2018-today EuroSciPy organizer * 2018-today PyData Südwest meetup organizer * 2019-2022 PyCon DE & PyData Berlin chair * 2019-today PyData Frankfurt meetup organizer * 2019-today Python Software Verband chair (German Python association) * 2023 PyCon DE & PyData Berlin organizer * 2023 EuroPython board member false https://pretalx.com/pyconde-pydata-2024/talk/7EC3UY/ https://pretalx.com/pyconde-pydata-2024/talk/7EC3UY/feedback/ Kuppelsaal Streamlining Python Development: A Guide to a Modern Project Setup Talk 2024-04-22T12:15:00+02:00 12:15 00:30 Designed for beginners, this presentation demystifies Python project management using [Hatch](https://hatch.pypa.io/) and delves into `pyproject.toml` for efficient configuration. We'll guide you through organizing directories, implementing unit testing for code reliability, and using [mypy](https://mypy-lang.org/) for type checking to enhance code quality. The session concludes with insights into [ruff](https://github.com/astral-sh/ruff), a modern linter for maintaining Python standards, which is replacing black, isort, flake8. This talk is a comprehensive toolkit for anyone eager to learn and apply the latest practices in Python development. pyconde-pydata-2024-40238-streamlining-python-development-a-guide-to-a-modern-project-setup PyCon: Programming & Software Engineering Florian Wilhelm en In the dynamic world of Python programming, an efficient project setup is key to success. 'Streamlining Python Development: A Guide to a Modern Project Setup' is a presentation tailored specifically for Python beginners, aiming to demystify the process of setting up a Python project with clarity and efficiency. In this session, we'll introduce Hatch, a cutting-edge tool that simplifies project management. We'll delve into the functionalities and benefits of using `pyproject.toml`, a cornerstone in modern Python development for its streamlined approach to project configuration. The talk will also cover effective strategies for organizing your project's directory structure, ensuring a clean and manageable workspace. Understanding the importance of testing, we'll discuss unit testing techniques for enhancing code reliability. Additionally, the presentation will feature mypy for type checking, an essential practice for catching errors early and improving code quality. Finally, we'll explore the use of ruff, a modern linter, to keep your code clean and in line with Python standards. By the end of this presentation, Python beginners will have gained a comprehensive understanding of the tools and methodologies necessary for a modern Python project setup, empowering them to create well-structured, high-quality Python applications. false https://pretalx.com/pyconde-pydata-2024/talk/CBVTEG/ https://pretalx.com/pyconde-pydata-2024/talk/CBVTEG/feedback/ Kuppelsaal You shall not pass! 🧙 Strengthen your python code against attacks. Talk (long) 2024-04-22T13:45:00+02:00 13:45 00:45 Have you ever thought about IT Security when coding your Python application? If not, you are not alone – but also not safe. Just recently, a research study counted almost 4000 secrets published on PyPI. Most of the secrets such as AWS Keys, Google API Keys or database credentials were most likely leaked accidentally. Leaked credentials top the list of entry points for attackers into protected areas. In this talk you’ll gain insights into how malicious attacks on Python applications are performed – and most importantly, how to protect yourself against them. We’ll kick off with a basic review of how to crack a password not only with brute force and continue with the most important IT Security principles. After understanding the importance of adhering to common security precautions, we will dive into Python coding hygiene. Where do the most common vulnerabilities lie? How can we strengthen the security of our code? We’ll cover secure coding practices such as code analysis, input validation and dependency vulnerabilities in theory and practice. Lastly, we will look at some case studies of common attacks on Python code and how to protect yourself against them. If you have never thought about security aspects in Python, this talk is for you! pyconde-pydata-2024-42952-you-shall-not-pass-strengthen-your-python-code-against-attacks- PyCon: Security Antonia ScherzRoman Krafft en This talk will highlight the theoretical concepts on security. We’ll start with a general overview and dive into specifics for Python applications. We will address five main questions: 1. How can we retrieve a password with a Python function? 2. What are the most essential IT Security practices? 3. Where can we find information on current security vulnerabilities? 4. What should we keep in mind to write secure Python code? 5. What are some historical attacks on Python code? What can we learn from them? Listeners will walk away with a general overview of how to approach security issues when building their Python application and make their future code more secure. false https://pretalx.com/pyconde-pydata-2024/talk/7LQEJ3/ https://pretalx.com/pyconde-pydata-2024/talk/7LQEJ3/feedback/ Kuppelsaal Better safe than sorry: Threat Modeling for Python Developers Talk 2024-04-22T14:35:00+02:00 14:35 00:30 Every developer wants to write good code. Good code, that also means security against attackers and their threats. But how secure is your code really? The talk explains how you can use Threat Modeling to assess your application in a systematic approach against the threats that are relevant to your use cases and their attack surface. pyconde-pydata-2024-41572-better-safe-than-sorry-threat-modeling-for-python-developers PyCon: Security Clemens Hübner en In the ever-evolving landscape of cybersecurity, Python applications play a pivotal role in handling critical data and supporting essential business functions, making them prime targets for malicious actors. As the stakes continue to rise, developers want to prioritize the implementation of security measures to safeguard against potential threats. However, the definition of "secure" remains elusive and often subjective. This does not only cause insecurity of the application, but especially among the people that develop it. This talk explains how to move from "best effort security" to a comprehensive and systematic approach to application security. It introduces the tried and tested method “Threat Modeling” and explains its value in a Python development project. Python developers will gain practical insights to identify, assess, and prioritize security risks systematically. Real-world examples illustrate the impact of effective threat modeling, empowering developers to proactively secure their applications against the threats that are really relevant for them. false https://pretalx.com/pyconde-pydata-2024/talk/PRH3QU/ https://pretalx.com/pyconde-pydata-2024/talk/PRH3QU/feedback/ Kuppelsaal How to embrace your Leadership role as a Data Nerd (or other creative types) Talk 2024-04-22T15:35:00+02:00 15:35 00:30 The transition from a hands-on creative job to a leadership role isn't always smooth. The tasks you excelled at are now handled by your team, and your new title brings added responsibilities, numerous meetings, leaving little room for deep work. So, how do we— the data people, the coaches, the coders—thrive in management roles? In this talk, I'll share my journey into management and how I learned to embrace and find reward in my leadership role. pyconde-pydata-2024-43027-how-to-embrace-your-leadership-role-as-a-data-nerd-or-other-creative-types- General: Community, Diversity, Career, Life and everything else Paula Gonzalez Avalos en You've been working as a Data person/coder/designer/coach for a while and enjoy the creative task at hand. Investing your time in something meaningful that you're very good at brings you a deep sense of satisfaction, making your job truly enjoyable. As your career advances, you climb the ranks to become a senior professional and at some point, you find yourself taking on a management role. Suddenly, creative time is scarce, pressure is high, your schedule is full of meetings, and you are responsible for projects and a team. A great team, that too often you envy for getting to do the actual hands-on job. Sounds familiar? Or is this step something to better avoid? In this talk, I'll discuss my not-so-smooth transition from a senior position to a leadership role. I'll share lessons learned in my last years as a Head and ultimately, I’ll share my tips on how to not only survive but actually like and thrive in a management role. false https://pretalx.com/pyconde-pydata-2024/talk/TU9EUQ/ https://pretalx.com/pyconde-pydata-2024/talk/TU9EUQ/feedback/ Kuppelsaal When and how to start coding with kids Talk (long) 2024-04-22T16:10:00+02:00 16:10 00:45 Our world is driven by technology and there are many reasons to teach our kids how to code. For example, coding allows them to develop logical reasoning skills and teaches attention to detail. Allowing children to discover how much fun coding can be supports them in their development and opens many doors for their future. But when and how should we start coding with kids? This talk will approach the question from a scientific perspective, looking into how children's brains develop, how children learn and how to best teach them coding abilities. It will answer important questions like "At what age can a child start coding?" or "What are the benefits of learning to code?". It will also present possible starting points, like learning platforms or tutorials. pyconde-pydata-2024-39507-when-and-how-to-start-coding-with-kids General: Community, Diversity, Career, Life and everything else Anna-Lena Popkes en Being able to code is becoming a more valuable skill every day. Besides the obvious advantages of being able to code (e.g. better career opportunities), coding teaches important skills like logical reasoning, attention to detail and creativity. But what is the best time to start coding? Are kids even able to learn how to code? And at what age? In this talk I would like to approach these questions from a scientific perspective, discussing the biological backgrounds and giving concrete advice on when and how to start coding with kids. false https://pretalx.com/pyconde-pydata-2024/talk/UBNVYW/ https://pretalx.com/pyconde-pydata-2024/talk/UBNVYW/feedback/ B09 Better search relevance using Learning to Rank at mobile.de Sponsored Talk (long) 2024-04-22T11:25:00+02:00 11:25 00:45 At mobile.de, we aim to provide a satisfactory search experience so users can find the vehicles quickly they are looking for. We make it happen using our machine learning systems working 24X7 in the backend which continuously learns changing user interests and optimize the search experience. Based on techniques like learning to rank using XGBoost, this talk will discuss our current search relevance ranking framework and how it ranks millions of searches daily. pyconde-pydata-2024-44953-better-search-relevance-using-learning-to-rank-at-mobile-de Sponsor Manish Saraswat en At mobile.de, we continuously strive to provide our users with a better, faster and a unique search experience. Machine learning and Python plays a key role in providing this experience. Every day, millions of people visit mobile.de to find their dream car. The user journey typically starts by entering a search query and later refining it based on their requirements. If the user finds a relevant listing, they contact the seller to purchase the vehicle. Our search engine is responsible for matching users with the right sellers. In this talk, I will talk about: - Introduction - Why search is important - How learning to rank helps ? - Current challenges with our ranking models - Proposed solution - How we deploy our ranking models ? (Under strict latency SLA <30ms) - AB Test results - Key Learnings - How can we improve further false https://pretalx.com/pyconde-pydata-2024/talk/LMMM7D/ https://pretalx.com/pyconde-pydata-2024/talk/LMMM7D/feedback/ B09 Haystack 2.0: the story of a rewrite Sponsored Talk 2024-04-22T12:15:00+02:00 12:15 00:30 To rewrite or not to rewrite: it's a major question. Releasing new software versions with breaking changes can be disruptive to a community, but sometimes they are necessary in the long run to move forward. Haystack is a free open source Python LLM framework. It was launched in 2020, before LLMs were cool. In 2023 we decided to undergo a major re-architecture, culminating in the GA release of Haystack 2.0. It wasn't an easy decision. By involving the open source community and some big companies in our design process early on, we are confident we built a more usable, flexible foundation for years to come. In this talk I'll tell you the story of this rewrite. The decisions we made to bring the project forward with the right level of flexibility / composability in the rapidly changing LLM landscape. I won't only show you the new features 2.0 provides, but give you a peek into our future roadmap. You'll walk away with a better understanding of how modern LLM frameworks can help you solve problems for yourself and your users, as well as an enriched understanding of how to think for the long-term when building for an open source community. You’ll see how the strength of Haystack modularity and ease of use makes it stand out from other libraries. Demos will make it much clear and give you some great ideas on how to integrate Haystack in your projects. pyconde-pydata-2024-45527-haystack-2-0-the-story-of-a-rewrite Sponsor Silvano Cerza en To rewrite or not to rewrite: it's a major question. Releasing new software versions with breaking changes can be disruptive to a community, but sometimes they are necessary in the long run to move forward. Haystack is a free open source Python LLM framework. It was launched in 2020, before LLMs were cool. In 2023 we decided to undergo a major re-architecture, culminating in the GA release of Haystack 2.0. It wasn't an easy decision. By involving the open source community and some big companies in our design process early on, we are confident we built a more usable, flexible foundation for years to come. In this talk I'll tell you the story of this rewrite. The decisions we made to bring the project forward with the right level of flexibility / composability in the rapidly changing LLM landscape. I won't only show you the new features 2.0 provides, but give you a peek into our future roadmap. You'll walk away with a better understanding of how modern LLM frameworks can help you solve problems for yourself and your users, as well as an enriched understanding of how to think for the long-term when building for an open source community. You’ll see how the strength of Haystack modularity and ease of use makes it stand out from other libraries. Demos will make it much clear and give you some great ideas on how to integrate Haystack in your projects. false https://pretalx.com/pyconde-pydata-2024/talk/GLXJPC/ https://pretalx.com/pyconde-pydata-2024/talk/GLXJPC/feedback/ B09 From idea to production in a day: Leveraging Azure ML and Streamlit to build and user test machine learning ideas quickly Talk 2024-04-22T13:45:00+02:00 13:45 00:30 Getting a machine learning solution in front of users usually takes some time. The data science tech stack is full of time traps and infrastructure issues might slow down deployment. The Azure Machine Learning platform, automated machine learning, and Streamlit are predestined tools for circumventing common development and deployment issues – if you know how to use them. Based on our learnings in corporate hackathons, we will use the stack to rapidly prototype a computer vision application users can interact with. You will walk away with Python code snippets and inspiration to build and user test your own machine learning ideas quickly. pyconde-pydata-2024-41326-from-idea-to-production-in-a-day-leveraging-azure-ml-and-streamlit-to-build-and-user-test-machine-learning-ideas-quickly PyData: Machine Learning & Deep Learning & Stats Florian Roscheck en Experimentation, bringing machine learning ideas in front of users, is essential to innovation. Yet, in our corporate hackathons, our data science team has struggled many times with how to build and deploy user-facing machine learning ideas in just a single day. Over the past 2+ years, we have developed a routine around using Azure Machine Learning, automated machine learning, and Streamlit to build and user test machine learning ideas quickly. The aim of this talk is to pass on practical, technical knowledge to fellow data scientists about how to leverage this stack to achieve high build and user test speeds. During the talk, we will walk through the process of building a computer vision system for identifying trash in images via an app using the open-source TACO dataset (http://tacodataset.org/). Working through a Jupyter notebook, we will load the data into Azure Machine Learning and trigger an automated machine learning run on the data. In this context, we will quickly get to know the training and testing metrics available in Azure ML to evaluate the model. We will then download the machine learning model as a file packaged in the open-source ONNX format (https://onnx.ai/). Using the open-source Python web application framework Streamlit (https://github.com/streamlit/streamlit), we will program an application in which users can upload images and embed the machine learning model in it to identify trash in these images. Using a to-be-published infrastructure-as-code pipeline on Azure DevOps, we will deploy the application to the public internet on the Azure platform. From here, users can test it. The stack and code presented in this talk will enable fellow data scientists to accelerate their data science development, leading to quicker experimentation and, therefore, to faster innovation of products with machine learning at their core. false https://pretalx.com/pyconde-pydata-2024/talk/GVTJW8/ https://pretalx.com/pyconde-pydata-2024/talk/GVTJW8/feedback/ B09 Going beyond Parquet's default settings – be surprised what you can get Talk 2024-04-22T14:35:00+02:00 14:35 00:30 Apache Parquet has become the de facto format for storing tabular (DataFrame) data on disk. This is done through universal compression and efficient knowledge of the stored data structure. As part of this talk, we would like to show the core structure of Parquet and the knobs that allow you to get even more of the capabilities of the file format. pyconde-pydata-2024-43007-going-beyond-parquet-s-default-settings-be-surprised-what-you-can-get PyData: Data Handling & Engineering Uwe L. Korn en In the last decade, Apache Parquet has become the standard format to store tabular data on disk regardless of the technology stack used. This is due to its read/write performance, efficient compression technology, interoperability and especially outstanding performance with the default settings. While these default settings and access patterns already provide decent performance, by understanding the format in more detail and using recent developments, one can get much better performance, smaller files, and utilise Parquet's newer partial reading features to read even smaller subsets of a file for a given query. This talk aims to provide insight into the Parquet format and its recent development that are useful for end users' daily workflows. One only needs prior knowledge to know what a DataFrame/tabular data is. false https://pretalx.com/pyconde-pydata-2024/talk/SQUNWS/ https://pretalx.com/pyconde-pydata-2024/talk/SQUNWS/feedback/ B09 Bridging the Gap: From Analytical Models to Operational Success Sponsored Talk 2024-04-22T15:35:00+02:00 15:35 00:30 Deploying machine learning models in production carries its own unique set of challenges. Some challenges stem from different, and sometimes conflicting, objectives between analytics and production. Others arise from technological limitations, business requirements, and even regulatory needs. In this talk, we will focus on the part of the problem surrounding the handover of models from analytics to production. We expect data scientists, operation specialists, and product owners to benefit from our stories. pyconde-pydata-2024-44950-bridging-the-gap-from-analytical-models-to-operational-success Sponsor Ignacio VergaraNick Harmening en Deploying machine learning models in production carries its own unique set of challenges. Some challenges stem from different, and sometimes conflicting, objectives between analytics and production. Others arise from technological limitations, business requirements, and even regulatory needs. In this talk, we will focus on the part of the problem surrounding the handover of models from analytics to production. This process has multiple facets, with tasks executed at different points in time and with different degrees of automation possible. To name a few: model packaging, inference reproducibility, establishing what needs to be deployed, and deployment-related actions. We'll share some of our experiences and strategies to tackle these challenges. For example, how we tackle the topic of contracts, interfaces, and responsibilities between modeling and production. Or how the role of automation in the pre-deployment process ensures a smooth and efficient model transition from an analytics model store to something ready for production once a model is approved. Whether you are a data scientist developing models, an operations specialist tasked with deploying them, or a product/project owner supervising the process, we aim to ignite engaging and fruitful discussions. For data scientists, to have a window into what happens after they are done with training a model. For operations specialists, to gain some strategies to improve their experience and success rate. And for a product owner, to get a framework on how to drive alignment. false https://pretalx.com/pyconde-pydata-2024/talk/XNY3HX/ https://pretalx.com/pyconde-pydata-2024/talk/XNY3HX/feedback/ B09 Documenting R&D Progress using jupyter-book - and feel safe for the next performance audit Sponsored Talk 2024-04-22T16:10:00+02:00 16:10 00:30 Rosenxt has only just been founded, and yet we are already very busy researching great things and making them usable. The ideas are bubbling, the motivation is high. The urge to try out the next idea quickly is high. But progress needs to be well documented, as the next performance audit is sure to come. pyconde-pydata-2024-48144-documenting-r-d-progress-using-jupyter-book-and-feel-safe-for-the-next-performance-audit Sponsor Jens Nie en Rosenxt has been founded to offer experience and excellence gathered in the last decades for the most challenging environments in the future, such as subsea, industrial, renewables, or the integrity of water and energy supply. Highly motivated, we can hardly wait to try out the next idea to make rapid progress. But we are also aware of the rules of business. At the end there is always the performance audit. This is where you have to prove that you can really deliver what you have promised. And to do this, you better have everything well documented. At our venture we have chosen a jupyter-book based workflow. Here come the Jupyter Notebook based steps for data analysis we're using anyways along with some simple markdown based documents embracing everything. Using a clever file system structure and a few tools, we create appealing documents that document the development progress very well. In this talk, I would like to present this workflow in more detail using the tests with a specific water pressure sensor that we are currently evaluating. false https://pretalx.com/pyconde-pydata-2024/talk/YYKJMP/ https://pretalx.com/pyconde-pydata-2024/talk/YYKJMP/feedback/ B07-B08 Select ML from Databases Talk (long) 2024-04-22T11:25:00+02:00 11:25 00:45 This talk introduces a new workflow for building your machine learning models using the capabilities of modern databases that support machine learning use cases natively. There is an overview of how machine learning models are being created today to how they could look in the near future by utilising the features provided by current databases. pyconde-pydata-2024-41601-select-ml-from-databases PyData: Machine Learning & Deep Learning & Stats Gregor Bauer en Developing machine learning models involves the use of data to identify patterns that would help solve business problems. Over the years as the scale of data increased, data started to get stored in databases. The model-building workflows would typically fetch the data from the databases, perform some transformations to create features, and use them to train the models. In some cases, these features would get stored in databases known as feature stores for reuse. To infer the model output in real-time, typically, there would be a small service or an API endpoint that would be deployed to get the results to the consumers. As these use cases became more common, modern databases started incorporating features that aid in building machine learning models. This talk covers some of the features provided by some of the databases like including common models like linear regression, image classification, text processing, support for functions with custom models, etc. Apart from these features, many of them also make it easy to deploy the model without needing an external service for the inference. Instead, they provide native interfaces for inference like querying in SQL like languages. This talk includes an example of how to build your custom model in Python and then include it inside your Couchbase database making inference a matter of using database queries. The example would help to understand some of the capabilities of modern databases in building machine learning model false https://pretalx.com/pyconde-pydata-2024/talk/RBNJRK/ https://pretalx.com/pyconde-pydata-2024/talk/RBNJRK/feedback/ B07-B08 Data valuation for machine learning Talk 2024-04-22T12:15:00+02:00 12:15 00:30 Data valuation techniques compute the contribution of training points to the final performance of machine learning models. They are part of so-called data-centric ML, with immediate applications in data engineering like data pruning or improved collection processes, and in model debugging and development. In this talk we demonstrate how the open source library [pyDVL](https://pydvl.org) can be used to detect mislabeled and out-of-distribution samples with little effort. We cover the core ideas behind the most successful algorithms and illustrate how they can be used to inspect your data to extract the most out of it. pyconde-pydata-2024-41793-data-valuation-for-machine-learning PyData: Data Handling & Engineering Miguel de Benito DelgadoKristof Schröder en The core idea of so-called data-centric machine learning is that any effort spent on improving the quality of the data used to train a model is probably better spent than on improving the model itself. This tested rule of thumb is particularly relevant for applications where data is scarce, expensive to acquire or difficult to annotate. Concepts of the usefulness of a datum or its influence on the outcome of a prediction have a long history in statistics and ML, in particular through the notion of the influence function. However, it has only been recently that rigorous and practical notions of value for data, and in particular data-sets, have appeared in the ML literature. The core idea is to look at data points known to be “useful” in some sense — for instance in that they substantially contribute to the final performance of a model — and focus acquisition or labelling efforts around similar ones, while eliminating or “cleaning” the less useful ones. In a nutshell, data valuation for machine learning is the task of assigning a scalar to each element of a training set which reflects its contribution to the final performance of some model trained on it. This can be used to repair or prune corrupt or superfluous data, or for data collection, like active learning strategies when labelling is expensive. While many exact methods have exponential time complexity in the size of the training set, recent advances provide either good approximation strategies or introduce alternative approaches which are starting to make this field relevant in practice. In this context, [pyDVL](https://pydvl.org) is an LGPL library aiming to provide robust, parallel implementations of every relevant method for simple usage in applications and research. In this talk we showcase how it can be used to detect issues in data pipelines and to improve final performance. pyDVL is still in early stages of development but already provides over a dozen algorithms, runs in parallel using ray and supports sklearn-compatible interfaces and large pytorch models with out-of-core computation thanks to dask. false https://pretalx.com/pyconde-pydata-2024/talk/WNHAG8/ https://pretalx.com/pyconde-pydata-2024/talk/WNHAG8/feedback/ B07-B08 A conceptual and practical introduction to Hilbert Space Gaussian Process (HSGP) approximation methods Talk (long) 2024-04-22T13:45:00+02:00 13:45 00:45 In this talk, we explore a new method to approximate Gaussian processes using spectral analysis methods, known as the Hilbert Space Gaussian process (HSGP) approximation. This technique allows us to use and fit Gaussian processes at scale for concrete applications. We provide a basic introduction to the ideas behind the method and make them tangible by implementing them ourselves using Numpyro. We then present two concrete examples in practice using both Numpyro and PyMC. Namely time-varying coefficient regression and time series forecasting. pyconde-pydata-2024-40780-a-conceptual-and-practical-introduction-to-hilbert-space-gaussian-process-hsgp-approximation-methods PyData: Machine Learning & Deep Learning & Stats Dr. Juan Orduz en In this talk, we explore a new method to approximate Gaussian processes using spectral analysis methods, known as the Hilbert Space Gaussian process (HSGP) approximation. This technique allows us to use and fit Gaussian processes at scale for concrete applications. We provide a basic introduction to the ideas behind the method and make them tangible by implementing them ourselves using Numpyro. We then present two concrete examples in practice using both Numpyro and PyMC. Namely time-varying coefficient regression and time series forecasting. **Idea about the approximation idea:** The core of this method relies on the Laplacian's spectral decomposition to approximate kernels' spectral measures as a function of basis functions. The key observation is that the basis functions in the reduced-rank approximation do not depend on the hyperparameters of the covariance function for the Gaussian process. This allows us to speed up the computations tremendously. **References** - Hilbert space methods for reduced-rank Gaussian process regression (https://link.springer.com/article/10.1007/s11222-019-09886-w) - Practical Hilbert space approximate Bayesian Gaussian processes for probabilistic programming (https://link.springer.com/article/10.1007/s11222-022-10167-2 ) - Example: Hilbert space approximation for Gaussian processes (https://num.pyro.ai/en/stable/examples/hsgp.html) - PyMCon Web Series - Introduction to Hilbert Space GPs in PyMC - Bill Engels (https://www.youtube.com/watch?v=ri5sJAdcYHk ) false https://pretalx.com/pyconde-pydata-2024/talk/YWUZW9/ https://pretalx.com/pyconde-pydata-2024/talk/YWUZW9/feedback/ B07-B08 Next Stop: Insights! How Streamlit and Snowflake Power Up Data Stories Talk 2024-04-22T14:35:00+02:00 14:35 00:30 Data stories transform complex data insights into clear, actionable and context rich narratives to drive business value. The presentation of data stories to different audiences in a visually compelling manner while keeping track of data changes is a challenging task. A possible solution is to implement appealing and interactive data applications, for which Streamlit is an established open-source solution. In combination with Snowflake, it enables an efficient and straightforward approach to build engaging data applications that utilize data directly from a data platform. In this talk, we will explore a proof-of-concept, tracing the conception of a data story to the implementation of a Streamlit app in Snowflake by using open source datasets from Deutsche Bahn. So, hold onto your seats – it is time to explore the world of data apps with Snowflake and Streamlit. pyconde-pydata-2024-41728-next-stop-insights-how-streamlit-and-snowflake-power-up-data-stories PyData: Data Handling & Engineering Marie-Kristin Wirsching en Streamlit is an open-source Python package designed to simplify the creation of data applications featuring interactive data dashboards. Since September 2023, Streamlit has been integrated into Snowflake offering several benefits, including the ability for developers to securely build, deploy, and share Streamlit apps within Snowflake's data cloud making use of the scale, performance and security of the Snowflake platform. This talk provides an introduction to Streamlit and showcases its integration into Snowflake. After this talk you will gain: - an introduction of how Streamlit can be used within Snowflake - practical insights into the creation of a data story based on a Deutsche Bahn open-source dataset on Wi-Fi connectivity in trains - comprehensive understanding of implementing a Streamlit app in Snowflake, illustrated through the developed data story - main takeaways and key insights working with Streamlit in Snowflake This talk is addressed to data enthusiasts who are - often faced with the challenge of presenting profound data insights to diverse audiences - interested in a tool that effortlessly constructs appealing data applications - curious about a a direct link between Streamlit and Snowflake false https://pretalx.com/pyconde-pydata-2024/talk/83ZGV3/ https://pretalx.com/pyconde-pydata-2024/talk/83ZGV3/feedback/ B07-B08 Machine Learning on microcontrollers using MicroPython and emlearn Talk 2024-04-22T15:35:00+02:00 15:35 00:30 This presentation will show you how to deploy machine learning models to affordable microcontroller-based systems - using the Python that you already know. Combined with sensors, such as microphone, accelerometer or camera, this makes it possible to create devices that can automatically analyze and react to physical phenomena. This enables a wide range of useful and fun applications, and is often referred to as "TinyML". The presentation will cover key concepts and explain the different steps of the process. We will train the machine learning models using standard scikit-learn and Keras, and then execute them on device using the emlearn library. To run Python code on the microcontroller, MicroPython will be used. We will demonstrate some practical use-cases using different sensors, such as Sound Event Detection (microphone), Image Classification (camera), and Human Activity Recognition (accelerometer). pyconde-pydata-2024-41644-machine-learning-on-microcontrollers-using-micropython-and-emlearn PyData: Machine Learning & Deep Learning & Stats Jon Nordby en Modern Machine Learning makes it possible to automatically extract valuable information from sensor data. While Machine Learning is often associated with costly, compute-intensive systems, it is becoming feasible to deploy ML systems to very small embedded devices and sensors. These devices typically use low-power, microcontrollers that cost as little as 1 USD. This niche is often referred to as "TinyML", and is enabling a range of new applications in scientific applications, industry and consumer electronics. While microcontrollers are getting more powerful year by year, it is still important to fit within the limited RAM, program size and CPU time available. emlearn is an open-source Python library that allows converting scikit-learn and Keras models to efficient C code. This makes it easy to deploy models to any microcontroller with a C99 compiler, while keeping Python-based workflow that is familiar to Machine Learning Engineers. Via emlearn-micropython it also supports MicroPython, a Python implementation designed for microcontrollers. MicroPython runs on practically all microcontrollers with 16kB+ RAM, and this makes it possible to write an entire application for microcontrollers using Python. The emlearn-micropython packages provided as a set of MicroPython modules that can be installed onto a device, without having to recompile any C code. This preserves the ease-of-use that Python developers are used to on a desktop system. Compared to pure-Python approaches, the emlearn-micropython models are typically 10-100x faster and smaller. The models in emlearn support the core Machine Learning tasks types: classification, regression and anomaly detection. Additionally there are also tools for data preprocessing, feature engineering and estimation of compute requirements. Since the start in 2019, emlearn has been used in a wide range of applications, from detection of vechicles in acoustic sensor nodes, to hand gesture recognition based on sEMG data, to real-time malware detection in Android devices. While emlearn and MicroPython can target a very wide range of hardware, we will focus on the Espressif ESP32 family of devices. These are very powerful and affordable, with good WiFi+BLE connectivity support, gpod open-source toolchains, very popular both among hobbyist and companies, and have many good ready-to-use hardware development kits. The audience is expected to have a basic literacy in Python and proficiency in programming, and familiarity with core Machine Learning concepts such as supervised/unsupervised learning, classification/regression, et.c. Familiarity with microcontrollers and embedded systems is of course an advantage, but the talk should be approachable to those who are new to this area. false https://pretalx.com/pyconde-pydata-2024/talk/NYHFSB/ https://pretalx.com/pyconde-pydata-2024/talk/NYHFSB/feedback/ B07-B08 Your Model _Probably_ Memorized the Training Data Talk (long) 2024-04-22T16:10:00+02:00 16:10 00:45 I know you probably don't want to hear about it, but your deep learning model probably memorized some of its training data. In this talk, we'll review active research on deep learning and memorization, particularly for large models such as large language and multi-modal models. We'll also explore potential ways to think through when this memorization is actually desired (and why) as well as threat vectors and legal risk of using models who have memorized training data. We'll also look at potential privacy protections which could address some of the issues and how to embrace memorization by thinking through different types of models and their use. pyconde-pydata-2024-41993-your-model-probably-memorized-the-training-data PyData: Machine Learning & Deep Learning & Stats Katharine Jarmul en In this talk, I will cover: - Proven mathematical research as to why deep learning models memorize information - A series of successful attacks against deep learning models and GPT-models to extract memorized information - The legal and social impact of memorization and using memorized data - Differential privacy as one potential solution (but also its pitfalls when used to train large models) - Federated and/or local- or community-trained models as an alternative - The need for distillation that also attempts to reduce memorization false https://pretalx.com/pyconde-pydata-2024/talk/BFF9VA/ https://pretalx.com/pyconde-pydata-2024/talk/BFF9VA/feedback/ B05-B06 RAG for a medical company: the technical and product challenges Talk (long) 2024-04-22T11:25:00+02:00 11:25 00:45 [RAG (Retrieval Augmented Generation)](https://www.pinecone.io/learn/retrieval-augmented-generation/) is the process of querying a (large) set of documents with natural language, leveraging vector search and llms. While it has recently become widely accessible to develop a Proof-Of-Concept RAG using OpenAI and one of the various open-source contributions (e.g. langchain), building a **performant** RAG that **brings value to users** is challenging. This talk will focus on learnings from building a RAG for a **medical company**, to allow doctors to query drug documentation with natural language, using tools like **[Chainlit](https://docs.chainlit.io/get-started/overview), [Qdrant](https://qdrant.tech/) and [Langsmith](https://www.langchain.com/langsmith)**. Naturally, a product question emerged: how to effectively leverage LLMs that **can never guarantee 100% accuracy** in the health sector? We will explain how we addressed this challenge, as well as the various **technical improvements** implemented to enhance both the retrieval (vector search) and generation (llm) metrics of our RAG. pyconde-pydata-2024-41787-rag-for-a-medical-company-the-technical-and-product-challenges PyData: Generative AI Noé Achache en RAG works as follows: - An **embedding model** is used to create representations of all documents. These representations are then stored in a **vector database**. - A user poses a question. The same **embedding model** is used to create a representation of this question, enabling the **retrieval** of the most similar documents through a **similarity search**. - These documents are incorporated into a **prompt** along with the question to **generate an answer based on the documents' content**. Many open-source tools, such as Langchain, enable the creation of such pipelines in just [few lines of code](https://python.langchain.com/docs/expression_language/cookbook/retrieval). However, without specific adjustments, such systems often do **not** perform well enough to gain **user adoption**. In this talk, we will cover the challenges and learnings encountered while building a **RAG for the drug documentation of a medical company**. More specifically, we will: - Cover the **basics** of RAGs. - Present the use case we faced and showcase the **resulting product**. - Show how we significantly improved our **retrieval and generation metrics** with techniques such as leveraging **LLMs** to add extra context to the user's question to enhance retrieval accuracy. - Discuss how we designed the product to effectively utilize LLMs while ensuring that doctors are not **misled** by potentially erroneous information, such as **hallucinations**. We achieved this mostly by displaying the sources: while many RAG pipelines cite their sources, we went a step further by **inserting HTMLs** of the sources directly **within** the generated answers, along with **highlighted citations**. - Highlight the tooling aspect of the project, e.g. **[Langsmith](https://www.langchain.com/langsmith) (a logging tool for LLMs)**, allowed us to easily augment our initial dataset and ensure that users were interacting correctly with the product. Furthermore, the ability to replay/alter a prompt on the interface allowed the **product owner** to iterate on prompt engineering and assist with technical iterations using their **field knowledge**. false https://pretalx.com/pyconde-pydata-2024/talk/XMKREA/ https://pretalx.com/pyconde-pydata-2024/talk/XMKREA/feedback/ B05-B06 Acknowledging Women’s Contributions in the Python Community Through Podcast Talk 2024-04-22T12:15:00+02:00 12:15 00:30 The Python community has been making efforts in improving the diversity and representation among its members. There are examples of success stories such as PyCon US Charlas, PyLadies, Djangonaut, and Django Girls. Yet in the Python podcast community, women are still underrepresented, making up only 17% of invited guests among the popular podcast series. Being a guest in a podcast is a privilege, and an opportunity to influence the Python community. There are many women and underrepresented group members who have made impactful contributions to the Python community globally, and they deserve the recognition and to be heard by the rest of us. Disheartened by the lack of representation by women on Python podcasts, and inspired by others who have shown us how diversity in the community can be improved through intentionality, we decided to start a podcast with a goal to highlight their voices so that they could receive the recognition they deserve. In this talk, learn about them, and about our podcast series. We’ll also share how you can further help out cause in improving representation and diversity in the Python community. pyconde-pydata-2024-41690-acknowledging-women-s-contributions-in-the-python-community-through-podcast General: Community, Diversity, Career, Life and everything else Cheuk Ting HoTereza Iofciu en The Python community has been making efforts in improving the diversity and representation among its members. There are examples of success stories such as PyCon US Charlas, PyLadies, Djangonaut, and Django Girls. Yet in the Python podcast community, women are still underrepresented, making up only 17% of invited guests among the popular podcast series. Being a guest in a podcast is a privilege, and an opportunity to influence the Python community. There are many women and underrepresented group members who have made impactful contributions to the Python community globally, and they deserve the recognition and to be heard by the rest of us. Disheartened by the lack of representation by women on Python podcasts, and inspired by others who have shown us how diversity in the community can be improved through intentionality, we decided to start a podcast with a goal to highlight their voices so that they could receive the recognition they deserve. In this talk,earn about them, and about our podcast series. We’ll also share how you can further help out cause in improving representation and diversity in the Python community. ## Goal To raise awareness of the underrepresentation of certain groups, especially women. To acknowledge the progress made by the Python community and what can be done further to continue the improvement. ## Target Audience Anyone who cares about the diversity and inclusion progression in the Python community. Community leaders who want to be allies. ## Outline ### Diversity in Python community, examples (5 minutes) - PyCon US speakers: from 1% in 2011 to 40% in 2016 -Efforts in improving diversity in the Python community: Charlas, PyLadies, DjangoGirls, Djangonaut ### How are those efforts successful? (5 minutes) - Intentionality: starts with recognizing the issue and clear intention and goal in improving the situation - Outreach: targeted and direct outreach to underrepresented, explicit invitation asking underrepresented group members to participate in - Opportunity: providing opportunities and tools for women to succeed ### In Podcast (3 minutes) - Since there were no stats, we collected our own data by scraping three most popular Python Podcasts Collected using Python, beautiful soup, and Datasette - Our result shows that among the three podcasts that have been running for years, women made up only 17% of invited guests, whereas there were the same men who appeared more frequently on the same shows ### Why is ithis important (5 minutes) - Podcast guest is influential - Women and underrepresented group members deserve to be seen and heard - Representation creates inspirations. Lack of representation = lost opportunity to inspire women to further participate in the community ### 6 months of our podcasts (4 minutes) - Share public reactions and support from our launch - Karolina Ladino: in Colombia, women has to be accompanied by husband, brothers to come to meetups, otherwise it's not safe for them to come alone. - Joanna Jablonski: making impact in Python community through documentation and developer education ### How you can help(3 minutes) - Listen to their stories - Actively promote and boost voices from women and underrepresented group members - Suggest people to interview false https://pretalx.com/pyconde-pydata-2024/talk/BYH8Y8/ https://pretalx.com/pyconde-pydata-2024/talk/BYH8Y8/feedback/ B05-B06 The pragmatic Pythonic data engineer Talk (long) 2024-04-22T13:45:00+02:00 13:45 00:45 Learn to make practical decisions in data engineering with Python's vast ecosystem. Avoid blindly following market guidelines and consider the reality of your situation for better performance and architecture. pyconde-pydata-2024-41056-the-pragmatic-pythonic-data-engineer PyData: Data Handling & Engineering Robson Junior en Often, we tend to look at the success of others and try to repeat their **decisions**, expecting the same result. We must deal with things sensibly and realistically based on practical rather than just theoretical considerations. **Python** offers a vast **ecosystem** to handle all phases of data engineering. Implementing a **data architecture** can be complex, and many adopt the strategy of using market **guidelines** without **pragmatism** of understanding your **reality**; in most cases, this strategy is a big problem of **architecture** and **performance**. As a part of this talk, we will walk through the process of identifying **Pythonic** components of **data analysis**, **data cleaning**, **data ingestion**, **databases**, **file systems**, **serialization formats**, **workflows**, and **pipelines**. As we move through those steps, my main focus is teaching the audience **pragmatic thinking** on incorporating best practices into the **data architecture** process. I will also walk through **strategies** and explain high-level data engineering concepts we can use. false https://pretalx.com/pyconde-pydata-2024/talk/NYFVLM/ https://pretalx.com/pyconde-pydata-2024/talk/NYFVLM/feedback/ B05-B06 Whispered Secrets: Building An Open-Source Tool To Live Transcribe & Summarize Conversations Talk 2024-04-22T14:35:00+02:00 14:35 00:30 Are you secretly a spy and/or passionate about open-source? Maybe you don't trust a cloud-hosted service with your highly classified information, or perhaps you like to build things for yourself. In this light-hearted talk, you will learn how to make a real-time on-device GenAI-powered application that can live transcribe and summarize conversations without internet access, using open-source components. Our journey begins with an introduction to open-source LLMs and the latest trends in running GenAI tools on your own hardware. We will build up our application step-by-step, first creating a live streaming voice-to-text transcription pipeline, then an LLM-based conversation summarization layer, presented within a Streamlit frontend, with conversation summaries sent to a lightweight Django API backend for storage. This talk is tailored for Python enthusiasts and requires no ML expertise. By seeing a practical demo come together piece by piece, attendees will gain a deeper understanding of how to build their own complex Generative AI applications and be pushed to imagine what they could make for themselves using on-device computation in real-world scenarios. pyconde-pydata-2024-43040-whispered-secrets-building-an-open-source-tool-to-live-transcribe-summarize-conversations PyData: Generative AI John Sandall en This light-hearted talk will aim to introduce the audience to the latest trends and possibilities for building GenAI applications using open-source components. Here's why this matters: * Cloud-hosted SaaS tools cannot store highly **sensitive information**. * **Good open-source alternatives exist** for most GenAI tasks; the more people who use them, the more they will thrive. * Commercial tools will solve for common use cases, but developers can build personalized tools that are **highly specialized for their own bespoke needs**. During the course of this talk, we will build a real-time conversation pipeline including transcription, summarization and topic analysis layers. We will use open-source Python libraries, including a Streamlit frontend and a Django API backend. The primary focus is to demonstrate the simplicity of building complex LLM-based applications, specifically tailored for attendees with a basic understanding of Python but who may not have prior experience using LLMs. We'll explore a variety of tools*, the use of Whisper for accurate live transcription, delving into its capabilities and integration with Streamlit. Additionally, we'll discuss LangChain + llama.cpp + Llama-2 for efficient summarization and topic analysis, highlighting their performance on standard hardware like a MacBook Pro. For the web API, Django will be our framework of choice, providing a robust and scalable solution for storing and displaying our conversation transcripts and summaries. We will also demonstrate how additional tools can be easily integrated into our workflow, for example using the Chroma vector database to build a simple semantic search function. Expect plenty of Python code and some fun live demos, with GitHub code provided for attendees to try it at home. This demo only covers a small fraction of the immensely versatile capabilities available from the modern open-source AI landscape, but will leave attendees with a sense that building complex LLM-powered applications that solve real-world problems has never been this easy. _* The exact tools presented may be different from those mentioned here, due to the rapidly evolving nature of this landscape. The goal is to ensure that attendees are provided with state-of-the-art content that is fully up-to-date come April 2024._ false https://pretalx.com/pyconde-pydata-2024/talk/7898PU/ https://pretalx.com/pyconde-pydata-2024/talk/7898PU/feedback/ B05-B06 Everything you need to know about change-point detection Talk 2024-04-22T15:35:00+02:00 15:35 00:30 Change-point detection is a crucial processing step when dealing with long and non-stationary time series. It has been applied in many contexts, such as human activity recognition, speech/sound processing and industrial monitoring. This talk guides data scientists, engineers and researchers through the mathematical foundations of this subject, introduces the [ruptures](https://github.com/deepcharles/ruptures) Python package for change-point detection, and illustrates algorithms in a biomedical context. By the end, the audience will be able to integrate them into complex data pipelines. pyconde-pydata-2024-42012-everything-you-need-to-know-about-change-point-detection PyData: Machine Learning & Deep Learning & Stats Charles Truong en How do you detect an activity change (e.g. walking to running to biking) from smartwatch data? Or abrupt transitions in paleoclimate records? Or when a server failure occurs, using hardware telemetry sensor data (fan speed, acoustic noise, etc.) and software metrics (CPU, memory, I/O, etc.)? If you work with long time series, you will inevitably have to detect changes in the data-generating model. Change-point detection is a crucial task for such signals. It consists in estimating the timestamps when the underlying signal model changes. First introduced in the 50s to monitor quality changes in industrial processes, this subject has since been extended to numerous contexts, such as sound/speech processing, human activity recognition, DNA analysis, analysis of COVID-19 policies' effects, software and hardware monitoring, etc. Over several decades, this subject has generated an important but heterogeneous body of work. This talk will help data scientists, engineers and researchers navigate this vast literature. We will start by describing the mathematical and algorithmic background behind change-point detection in a high-level and easy-to-understand fashion. Then, we will introduce [ruptures](https://github.com/deepcharles/ruptures), a Python package containing many change-point detection methods, as well as calibration and visualisation routines. Algorithms will be illustrated in a real-world biomedical application. At the end of the talk, the audience will be able to understand when to use change-point detection algorithms and how to calibrate and integrate them in a complex data pipeline. **Time breakdown:** - Introduction and motivations: 5 min - Background on change-point detection: 10 min - Python framework: 5 min - Illustration on a real-world biomedical data pipeline: 10 min - Q&A: 5 min false https://pretalx.com/pyconde-pydata-2024/talk/ZKYA9W/ https://pretalx.com/pyconde-pydata-2024/talk/ZKYA9W/feedback/ B05-B06 Using LLMs to Create Knowledge Graphs From a Large Corpus of Parliamentary Debates Talk (long) 2024-04-22T16:10:00+02:00 16:10 00:45 Large Language Models (LLMs) have proven to be incredibly powerful on a range of tasks. They do however, have certain limitations when the input context becomes significantly large. Solutions such as Retrieval Augmented Generation (RAG) do a great job in providing context from custom data without retraining any models but they too have limitations, especially when the context is spread out over many documents. Consider the question “Which projects has person X worked on?”. Information required to answer this question may be spread out over hundreds of documents, making it difficult for an LLM alone to answer. One way to overcome this issue is to use an LLM as an entity extraction tool, which can extract entities and relationships from documents and load that data into a structured format such as a knowledge graph. In this talk, I will demonstrate this process on a dataset of parliamentary debates, showing how downstream analytics becomes more intuitive and feasible. pyconde-pydata-2024-41739-using-llms-to-create-knowledge-graphs-from-a-large-corpus-of-parliamentary-debates PyData: Natural Language Processing & Computer Vision Usman en In this talk, I will demonstrate the process through which I implemented a solution to create knowledge graphs using LLMs and why this can be powerful. Agenda: - Limitations of LLMs and RAG for specific tasks - Knowledge graph (KG) bascis - Creating KGs using LLMs - Dataset and use-case: official parliamentary debates - Practical experience in creating an LLM-based pipeline - Retrieving data using natural language i.e. Text2SQL - Future works false https://pretalx.com/pyconde-pydata-2024/talk/LWWQ9U/ https://pretalx.com/pyconde-pydata-2024/talk/LWWQ9U/feedback/ A1 Best of both worlds - How we built an AI-aided content creation tool for language learning Talk (long) 2024-04-22T11:25:00+02:00 11:25 00:45 Discover how Babbel bridged the gap between tailored language learning and scalability through an AI-aided content creation tool. Our approach amalgamates human expertise with Generative Artificial Intelligence, enabling personalized content creation on a large scale. Join us on our development journey and the different iterations we went through. We will demo the tool's current version and its AI features. Learn about the tech stack and what lies ahead in our development pipeline. pyconde-pydata-2024-41668-best-of-both-worlds-how-we-built-an-ai-aided-content-creation-tool-for-language-learning General: Industry & Academia Use-Cases Hector HernandezLea Petters en Babbel learners value the high quality content that follows an educational methodology and covers everything a learner needs to become conversational in a foreign language. However, language learning cannot be approached with a one-fits all strategy. Learners have different motivation, interests, goals & learning needs that they want to see addressed throughout their learning path. Relying on human learning experts only for creating thousands of tailored learning items to personalize our contents is not a scalable solution. Luckily, recent developments in Generative Artificial Intelligence (GenAI) and its high-performing Large Language Models (LLMs) offer great opportunities to leverage artificial intelligence (AI) in the content creation process to enable large-scale personalization of contents. Let us take you on our journey of developing an AI-aided content creation tool for language learning which combines best of both worlds, namely using AI to automate and scale various steps within the content generation process and putting human intelligence (HI) in the loop to make sure that our contents meet the expectations of our learners and fit the Babbel way of learning. We will give you an overview of our development process with the help of our cross-functional team and walk you through the different iterations - from initial workflow analysis to leveraging the power of connecting our tool to Babbel’s proprietary data. Additionally, we will demo the current version of the tool and give a quick tour of the different AI features that we already included. We will give an overview of the used tech stack and a quick outlook on what is next in the development pipeline. false https://pretalx.com/pyconde-pydata-2024/talk/Y3FLEH/ https://pretalx.com/pyconde-pydata-2024/talk/Y3FLEH/feedback/ A1 Power structures. The fair advantage Talk 2024-04-22T12:15:00+02:00 12:15 00:30 Humans are complex. As developers, we wanna ignore that ... but to do our job right, we cannot. Let's talk about power, motivation, techno-sociology, politics and why all of this is important for our job. pyconde-pydata-2024-39662-power-structures-the-fair-advantage General: Ethics & Privacy Anja Kunkel en Have you ever been in the following situation? You know for certain that you are technically right. Your project has to be done for the benefit of the company. But you cannot convince your boss for whatever reason. You are stuck. - This might be the glorious moment of informal structures and networking. You will need to know whom else to talk to. Whom you can trust and who has the power to convince your boss? The best answer will rarely be found in formal organizational structure. As developers, we often think in models and charts. We are used to formalize worded requests into code and structures to solve problems. And we are good at it. But what you cannot fully put into models are humans and human behavior. This is also true for the human interactions inside companies and networks. Organigrams never tell the truth about an organization. Power and influence is more complex than formal structures can describe. In this talk, I wanna dive into how human interactions inside companies are at the same time complex, powerful and worth exploring. Disclaimer: This is no talk about unfair techniques. I will not provide you dark magic. My goal is to provide you the knowledge how to fairly play in a complex world. false https://pretalx.com/pyconde-pydata-2024/talk/8LNYPD/ https://pretalx.com/pyconde-pydata-2024/talk/8LNYPD/feedback/ A1 Tailored and Trending: Key learnings from 3 years of news recommendations Talk (long) 2024-04-22T13:45:00+02:00 13:45 00:45 Every day, we engage with news, and more often, these are curated by recommendation engines. Building such an algorithm poses some unique challenges, different from movie or product recommendations: articles have a short lifetime because nothing is older than yesterday's news. The data is heavily biased by the different positioning of articles on the page, and journalistic principles and brand identity should be represented in the article selection. At Axel Springer National Media and Tech, we overcome these challenges by leveraging our domain knowledge combined with simple statistics instead of black-box machine learning models. This talk will share some of our learnings that can be applied to recommendation systems and data science projects in general. pyconde-pydata-2024-40429-tailored-and-trending-key-learnings-from-3-years-of-news-recommendations PyData: Machine Learning & Deep Learning & Stats Dr. Christian Leschinski en #### What is special about news recommendations? - We are used to recommendations from Netflix, Amazon, or TikTok. All of these apps have logged-in users that can be easily tracked. News websites, on the other hand, have a large share of unknown users that can only be tracked via first-party cookies. Therefore, there is much more cold start in the user dimension. In addition to that, movies, products, and funny videos have relatively long lifetimes, whereas news articles are often only relevant for a few hours. This means that recommendation systems have much less time to collect information about what is relevant for whom, and there is a lot of cold start in the item dimension. - Users are more critical with the selection of news articles that are presented to them compared to selections of products or movies. News recommendation is not only about finding the most relevant items; it is also about putting items in the right relationship to each other to reflect journalistic considerations and brand values. For example, often articles should be sorted according to the seriousness of the topic, or the topic's relevance for society. Similar articles should be placed next to each other, etc. - The front page plays an outsized role for news websites. Users come here to get an overview of what is happening in the world. Consequently, the data generated by these websites is heavily dominated by effects that originate in the structure and mechanics of the front page. Articles shown on top of this page with a large image will be clicked much more likely, compared to an article at the bottom of the page with just a small headline. #### How do news recommendations typically work? - Recommendation engines are often closely associated with collaborative filtering. However, collaborative filtering systems struggle with cold start, which is especially prevalent for news articles and users of media sites. At the same time, there are many simple ways to rank articles. Articles can be sorted according to their age, their popularity, or according to how often a user has read articles from the same category before. Based on our experience, most systems deployed in practice use a combination of these principles along with collaborative filtering. Especially for smaller widgets, multi-armed bandit approaches are also popular, where the algorithm just tries different articles and keeps showing those that tend to have the highest CTR. #### What is special about our approach to news recommendations? - One can think of recommendation as a simple click prediction problem. We have one user and many items and want to use features of the user and the items to predict how likely the user will click. The articles can then be ranked and selected based on these probabilities. Therefore, we are not tied to use collaborative filtering algorithms but can use any machine learning algorithm of our choice. - A major feature for our system is to identify articles that are trending. Most popular feeds and rankings are widely used, but as an absolute measure, they are heavily influenced by the position bias. The articles on top of the page are most likely to get the most clicks, therefore they will be put on top of the page again. This cycle continues until the story becomes so uninteresting that it starts to perform worse than other stories in worse positions. In contrast to that, we refer to relative performance as trendingness. If a story performs better than usual for its position, then it is trending. The beauty of this approach is that it makes the performance of articles at the top and at the bottom of the page comparable to each other. You can be 10 percent better or worse than expected in all positions of the page. The ugly part is that numbers at the bottom of the page start to become very small and therefore trendingness becomes very unstable. If an article is expected to get 1/100 of a click in a certain time interval, and there is an accidental click on this article, you suddenly have an incredible trending article. Unfortunately, most news pages contain many articles that are clicked with very low probabilities, therefore you have good chances to produce these outliers quite frequently. The art of constructing a good measure of trendingness is in finding a good way to regularize the trendingness to avoid these effects. - Position bias on news media sites is so strong that a classification model that predicts clicks solely using the position of an article as a feature will have an AUC of about 0.8. Consequently, a model trained on clicks will mostly just learn patterns that are correlated with the position. For example, if politics articles tend to be placed higher on the page than sports articles, the model will learn that politics articles generally click better than sports articles. We can avoid this by giving the model information about the position, but then the algorithm mostly picks up position-related patterns that cannot be exploited when choosing which article to put in one specific position. - When training our recommendation algorithm, we overcome the position bias problem by weighting clicks so that they are compared on neutral grounds. First, we determine the click probability of an article based on its position alone. Then we weight clicks and non-clicks according to their relative probability. - A click that was supposed to happen with a probability of 0.1 becomes 1/0.1 - 1 = 9, and a click with a probability of 0.01 becomes 1/0.01 - 1 = 99. A likely click gets a lower weight than an unlikely one. - We also derive information from non-clicks. A non-click with a probability of 0.9 becomes -1/0.9 + 1 = -0.1. If an article is presented in a prominent position, but it is not clicked by the user, this is an expression of disinterest and it can help to feed our algorithm. - By turning clicks into weighted clicks, we essentially turn the problem from a classification problem into a regression problem. On average, the weighted clicks are equal across all positions, so that the position bias is eliminated. - One of the features that surprised us the most with its good performance is our "article already seen" feature. For each user and every recommendable article, we keep a counter that measures how often the article was already shown in a prominent position but not clicked by the user. These scores are based on the position-based click probabilities that we also use for the weighted clicks. If an article gets shown in a position with an average CTR of 0.1, the score is 0.1 the next time the article could potentially be recommended to the user. If the article now gets shown in a lower position with a click probability of 0.01, the score increases to 0.11 next time. The model then learns that articles that were shown multiple times in prominent positions before but were not clicked are likely not going to be clicked next time they are shown, either. As a consequence, the page becomes fresher and A/B test results indicate a meaningful uplift compared to a model without this feature. #### What have we learned? - Websites usually track what users do, but not what they do themselves. Our algorithms rely heavily on the fact that we track who saw what and in which position. This gives us the ability to overcome the position bias and significantly improve our algorithms. - We do simple things for complicated reasons. The key advantage of simple statistical models over black-box algorithms is that they are easier to debug. Every time we replace a boosted tree or something similar with a linear model, we realize that it is not acting the way we expected. We can then make the necessary adjustments - for example, by adding well-crafted features that leverage our domain expertise. At the end of the process, the linear model becomes better than the black-box model was in the beginning. true https://pretalx.com/pyconde-pydata-2024/talk/XDQNCR/ https://pretalx.com/pyconde-pydata-2024/talk/XDQNCR/feedback/ A1 A Retrieval Augmented Generation system to query the scikit-learn documentation Talk 2024-04-22T14:35:00+02:00 14:35 00:30 The scikit-learn website currently employs an "exact" search engine based on the Sphinx Python package, but it has limitations: it cannot handle spelling mistakes and queries based on natural language. To address these constraints, we experimented with using large language models (LLMs) and opted for a retrieval augmented generation (RAG) system due to resource constraints. This talk introduces our experimental RAG system for querying scikit-learn documentation. We focus on an open-source software stack and open-weight models. The talk presents the different stages of the RAG pipeline. We provide documentation scraping strategies that we designed based on numpydoc and sphinx-gallery, which are used to build vector indices for the lexical and semantic searches. We compare our RAG approach with an LLM-only approach to demonstrate the advantage of providing context. The source code for this experiment is available on GitHub: https://github.com/glemaitre/sklearn-ragger-duck. Finally, we discuss the gains and challenges of integrating such a system into an open-source project, including hosting and cost considerations, comparing it with alternative approaches. pyconde-pydata-2024-42921-a-retrieval-augmented-generation-system-to-query-the-scikit-learn-documentation PyData: Generative AI Guillaume Lemaitre en Currently, the scikit-learn website provides an "exact" search engine based on the tools provided by the Sphinx Python package (i.e., https://www.sphinx-doc.org/). The current search engine is implemented in JavaScript and runs locally using an index built when generating the documentation. This solution has the advantage of being lightweight and does not require any server to handle the query. However, the complexity of the query treated is weak: since the search is "exact," it is not robust to spelling mistakes, and the search is intended for searches based on keywords. As large language models (LLMs) are becoming more popular, we have been interested in experimenting with this technology, knowing that they could address some of the previously stated limitations. As an open-source project, we have limited resources in terms of compute and limited available datasets; therefore, we discarded the option of fine-tuning an LLM and leaned towards retrieval augmented generation (RAG) systems. This talk presents an experimental RAG system developed to query the scikit-learn documentation. As constraints, we impose ourselves to use an open-source software stack and open-weight models to build our system. The talk is decomposed as follows: First, we provide some background on the RAG system and the pipeline to follow to implement such a system. Then, we go into details in the different stages of the RAG pipeline. We provide some insights regarding documentation scraping strategies that we developed by leveraging the `numpydoc` and `sphinx-gallery` parser. Then, we discuss the solution that we tested to perform lexical and semantic searches. Finally, we explain how the context found can be fed to the LLM to help generate an answer to the user query. We provide a small demo to compare queries performed on an LLM-only system and on the developed RAG system. All the code for the experiment is hosted at the following GitHub repository: https://github.com/glemaitre/sklearn-ragger-duck. Finally, we put into perspective the gains and pains of such an RAG system when it comes to integrating it into an open-source project. Notably, we question the hosting and cost of such systems and compare it with other approaches that could tackle some of the original issues. false https://pretalx.com/pyconde-pydata-2024/talk/ZCKQVG/ https://pretalx.com/pyconde-pydata-2024/talk/ZCKQVG/feedback/ A1 Moving from Offline to Online Machine Learning with River Talk 2024-04-22T15:35:00+02:00 15:35 00:30 The foundations of machine learning were built on offline batch processing techniques for model training and inference. As organisations become more dependent on real-time data, the technological trend for machine learning in production is moving towards adding an online stream processing approach. This has benefits such as lower computational requirements due to being able to incrementally learn from a stream of data points, which enables the continual upgrading of models by adapting to real-time changes in data. Learn how to get started on your online ML journey with River pyconde-pydata-2024-41848-moving-from-offline-to-online-machine-learning-with-river PyData: Machine Learning & Deep Learning & Stats Tun Shwe en The foundations of machine learning were built on offline batch processing techniques for model training and inference. As organisations become more dependent on real-time data, the technological trend for machine learning in production is moving towards adding an online stream processing approach. This has benefits such as lower computational requirements due to being able to incrementally learn from a stream of data points, which enables the continual upgrading of models by adapting to real-time changes in data. This has wide applications in industries such as cyber security, banking, healthcare, IIoT and any industry that involves processing large volumes of high throughput data and adapting predictive capability with real-time data feeds. You’ll leave this talk with an understanding of the differences between offline and online machine learning, how to complement one with the other and enough streaming concepts and best practices needed get started on your online ML journey with River, an open source Python ML library. false https://pretalx.com/pyconde-pydata-2024/talk/G9S3MR/ https://pretalx.com/pyconde-pydata-2024/talk/G9S3MR/feedback/ A1 Put your RAG to the test: Component-per-component evaluation of our LLM-powered airplane manufacturing assistant Talk (long) 2024-04-22T16:10:00+02:00 16:10 00:45 Your RAG-powered LLM application might look pretty convincing at first glance, but how do you really know if it’s any good? And how do you justify the design choices you make? In this talk, you will learn about the RAG evaluation concept we produced at Airbus for evaluating the components of our digital engineering assistant, its implementation with open source tools paired with Google Vertex AI, and what we learnt in the process. pyconde-pydata-2024-43022-put-your-rag-to-the-test-component-per-component-evaluation-of-our-llm-powered-airplane-manufacturing-assistant PyData: Generative AI Nataliia Kees en Nowadays, Retrieval Augmented Generation (RAG) architecture has become quite the standard approach for building high-quality document search products or personal assistant applications. Prototyping a RAG application might yield quite convincing results from the very first stages of development, but how do you know if it’s really any good when you move your application from prototype into production? And how do you justify the design choices you make? For example, do you know if long-context models would perform better than short-context models with chunking for long-form documents you have at hand? Or, what difference does it make if you keep your different types of documents in one index or in separate ones? Or, is usage of few-shot learning really worth it for your use case, given that adding examples can increase the cost dramatically compared to zero-shot learning? And of course, how do you know there isn’t a better prompt out there for making the LLM do exactly what you expect it to? At Airbus, we went through this thought process during the development of a RAG-based assistant for creation of assembly manuals - documents which help our colleagues in Manufacturing navigate through the airplane parts construction procedures. For answering these and other questions, we produced an evaluation concept for our Generative AI applications, which relies on different methods and metrics for RAG evaluation end-to-end and testing each of its components separately. In this talk, we will present our evaluation concept, how we implemented it with tools like LangChain and Ragas, what metrics we use and how we conduct our experiments with the help of Google Vertex AI Pipelines. true https://pretalx.com/pyconde-pydata-2024/talk/WEVXJS/ https://pretalx.com/pyconde-pydata-2024/talk/WEVXJS/feedback/ A03-A04 The Secret Life of Metaclasses Tutorial 2024-04-22T11:25:00+02:00 11:25 01:30 Metaclasses. What are they? Where do they live? How do they reproduce? Did you know that you can make your classes receive keyword arguments, just like functions? And that they can be decorated as well? Do you want to understand how classes, metaclasses and decorators work and what are they good for? In this hands-on coding session we will inspect the inner workings of how Python creates classes, and how decorators, meta-classes and methods from superclasses can influence this process. We'll explore: * normal and special methods * how attribute lookup works between instances and classes * what are descriptors, and how they fit into attribute lookup process * what is the relationship between instances, classes and metaclasses * what are metaclasses for * and some other metaprogramming odds and ends All that is required for you to enjoy this session is that you have written a class in Python. If you've done the original [Python Tutorial](https://docs.python.org/3/tutorial/index.html), that should be more than enough. pyconde-pydata-2024-41676-the-secret-life-of-metaclasses PyCon: Python Language & Ecosystem Leonardo Rochael AlmeidaLuciano Ramalho en Class outline: * 10 min.: Intro and Setup * 15 min.: Every time is "runtime": * Function, Classes and Methods are created at runtime * The dual responsibility of `class` * Attribute lookup and method resolution order * The role of `.__dict__` and `.__slots__` * Special methods, giving instances superpowers * 10 min.: Everything is an object: * Functions, methods and classes are also objects * Descriptors, properties and method binding * The two functionalities of `type` * And how to create a class without the `class` keyword * 10 min.: Metaclass is the class of the class: * Calling a class creates an instance, calling a metaclass creates a class * `type` & `object`: class relations * Creating and using metaclasses * 15 min.: What are metaclasses for? * Giving classes special methods * Intercepting class creation * Keyword arguments in class declarations * Preparing the class namespace * The role of the methods: `__call__`, `__new__` & `__init__` * What are metaclasses **not** for * 5 min.: complete debugging walkthrough * class creation * instance creation * instance use * 5 min.: You're unlikely to ever need to create a metaclass * `__init_subclass__` * Class decorators * `__class_getitem__` * Capturing descriptor names and ordering * 5 min.: Examples * 5 min.: conclusion and questions false https://pretalx.com/pyconde-pydata-2024/talk/BKBNRF/ https://pretalx.com/pyconde-pydata-2024/talk/BKBNRF/feedback/ A03-A04 Build TikTok's Personalized Real-Time Recommendation System in Python with Hopsworks Tutorial 2024-04-22T13:45:00+02:00 13:45 01:30 The real-time recommendations engine in Tiktok, Monolith, is so good it has been described as "digital crack" (by Andrej Karpathy, former head of AI at Tesla). In this tutorial, we will build the core components of Tiktok Monolith (a retrieval and ranking architecture): a stream processing feature pipeline, a two-tower embedding model to support personalized queries based on each user's history/context, and a simple user interface in Python (Streamlit). Our real-time machine learning system will consist of 3 Python programs - the feature pipeline, the training pipeline, and the online inference pipeline - and the ML infrastructure they require will be provided by the open-source Hopsworks platform, including a feature store, vector database, model serving, and model registry. pyconde-pydata-2024-41302-build-tiktok-s-personalized-real-time-recommendation-system-in-python-with-hopsworks PyData: Machine Learning & Deep Learning & Stats Jim Dowling en The real-time recommendations engine in Tiktok is so good it has been described as "digital crack" (by Andrej Karpathy, former head of AI at Tesla). It is a retrieval and ranking architecture that uses significant ML infrastructure, including a real-time feature store, a vector database, a model registry, and model serving infrastructure. In this tutorial, we will build the core components of Tiktok Monolith as 3 ML pipelines: a stream processing feature pipeline that takes user actions (clicks, swipes, searches) written to Kafka and computes features that are stored in Hopsworks online store in less than 1 second. We will train a two-tower embedding model to support personalized queries using training data grounded on each user's history/context and the videos they clicked/didn't-click on. We will develop an online inference pipeline that takes a user query, encodes it as an embedding to retrieve candidate videos, then users an online feature store to enrich the candidates before a ranking model personalizes the order of candidates for the client. We will even develop a simple user interface in Python (Streamlit) to show the whole system working visually. Our real-time machine learning system will consist of 3 Python programs - the feature pipeline, the training pipeline, and the online inference pipeline - and the ML infrastructure they require will be provided by the open-source Hopsworks platform, including a feature store, vector database, model serving, and model registry. false https://pretalx.com/pyconde-pydata-2024/talk/DPGRGW/ https://pretalx.com/pyconde-pydata-2024/talk/DPGRGW/feedback/ A03-A04 Refactoring Large Programs Tutorial 2024-04-22T15:35:00+02:00 15:35 01:30 One of the most challenging tasks in software engineering is cleaning up a complex software with 10,000-100,000 lines of code. The problem gets worse, if you are taking over legacy code. The fact that the Python language does neither enforce strict typing or encapsulation does not help either. What should you do if throwing away everything and rewriting the program from scratch is not an option? In this tutorial, we will exercise refactoring a larger program that is undocumented, unstructured and untested. We will take a messy example program and work through a list of procedures that may help you in your next big refactoring. pyconde-pydata-2024-41842-refactoring-large-programs PyCon: Programming & Software Engineering Dr. Kristian Rother en Refactoring Large Programs You find code and installation instructions for the tutorial on https://github.com/krother/space One of the most challenging tasks in software engineering is cleaning up a complex software with 10,000-100,000 lines of code. The problem gets worse, if you are taking over legacy code. The fact that the Python language does neither enforce strict typing or encapsulation does not help either. What should you do if throwing away everything and rewriting the program from scratch is not an option? In this tutorial, we will exercise refactoring a larger program that is undocumented, unstructured and untested. We will take a messy example program and work through a list of procedures that may help you in your next big refactoring. These include: * review the code * write a minimal test * add type annotations * extract core data structures * separate easily cleanable parts from very bad parts * remove excess dependencies * be very transparent about which features of the code you trust The main takeaway of the tutorial is that large-scale refactoring is possible. Although a large refactoring is difficult and costly, you should learn that it can be approached systematically. You will walk away with ideas where to start refactoring. You will also develop your awareness how difficult a complex refactoring is. Looking at a messy codebase realistically is not only important to manage the expectations of clients and stakeholders, it is also important to manage the stress that comes with it. This tutorial addresses people with fluency in basic Python. You should know how a class in Python works and what a Unit Test is. It helps if you have done simple refactoring before (extract variable, extract function) before. I encourage junior developers to attend the tutorial to learn and discuss how a potentially overwhelming situation looks like. The tutorial session is structured in the following way: * 0:00 Interactive Warm-up with the audience: Who is here? * 0:05 Download and inspect code * 0:10 Quick code review * 0:20 Refactoring I: create a minimal test * 0:40 Refactoring II: extract data structures * 1:00 Refactoring III: isolate code * 1:20 buffer time and Q & A The messy code and refactoring recipes will be provided to participants through GitHub. false https://pretalx.com/pyconde-pydata-2024/talk/CMM8S3/ https://pretalx.com/pyconde-pydata-2024/talk/CMM8S3/feedback/ A05-A06 No More Raw SQL: SQLAlchemy, ORMs & asyncio Tutorial 2024-04-22T11:25:00+02:00 11:25 01:30 Managing a database and synchronizing service data representation with the database can be tricky. In this workshop, you’ll learn how to use SQLAlchemy, a powerful SQL toolkit, to simplify this task. We’ll cover how to leverage SQLAlchemy’s Object Relational Mapper (ORM) system, and how to use SQLAlchemy's asyncio extension in your async services. Participants will walk out of this tutorial having learned how to: - Use SQLAlchemy for database operations in Python, enhancing the readability and maintainability of the code - Build Python classes (ORMs) that represent the database tables - Experiment with different relationship-loading techniques to improve querying performance - Utilize SQLAlchemy’s asyncio extension to interact with databases asynchronously pyconde-pydata-2024-40843-no-more-raw-sql-sqlalchemy-orms-asyncio PyCon: Programming & Software Engineering Rhythm PatelAya Elsayed en OUTLINE - Introduction [15 min] - What is SQLAlchemy? - Why use SQLAlchemy and advantages? - Components Overview such as engine, dialect, connection pool, etc. - Initial setup for the hands-on workshop with GitHub Codespaces [5 min] - Run and explore example service that has database queries with raw SQL - Adding SQLAlchemy to the example service - Set up SQLAlchemy [10 min] - Set up engine & dialect to connect with the DB - Use SQLAlchemy Core to query the DB - Add ORMs [20 min] - What are ORMs? - How to represent a basic table? - Modeling different relationships (e.g., 1-1 and 1-many) between the classes - Using ORMs to query the DB - Convert other queries using SQLAlchemy [5 min] - Improve performance by changing relationship loading techniques [10 min] - Consequences of certain models: Talk about N+1 problem and bidirectional relationships - Work with different loading techniques, such as lazy loading and eager loading - The SQLAlchemy.asyncio extension - Brief description of asyncio [10 min] - Understanding coroutines - Scheduling tasks on the asyncio event loop - A hands-on walkthrough of SQLAlchemy’s asyncio extension [15 min] - Setting up SQLAlchemy in async mode - Performing a query and inserting it into the database - Using ORMs in queries using asyncio FORMAT This is an interactive tutorial where we will guide participants through the use of SQLAlchemy and ORMs to interact with a database. Participants will gain an understanding of SQLAlchemy and be well-versed enough to use it in their next project. Participants will be working on a repository via GitHub Codespaces, and they will be building on that throughout the tutorial. The Codespaces dev environment will include all required modules and a Dockerized PostgreSQL database, enabling a seamless setup. The repository will have a branch corresponding to each section of the workshop, so participants who have trouble with a step or aren’t able to finish on time can check out the corresponding branch and follow the rest of the workshop from there. We’ll start with an introduction to SQLAlchemy and its advantages. The rest of the tutorial will be hands-on. For each section, we will start by explaining the concept, then allowing participants to complete the relevant steps on the example service on their own laptops, and ask questions. We expect this to last around 10 minutes per concept. We will then give participants time to complete the steps on their own laptops and ask questions. AUDIENCE This tutorial is for Python developers of any level who write applications that interact with databases and want to learn how to leverage a tool like SQLAlchemy to seamlessly interact with their database and manage their data in a Pythonic way. Having a basic understanding of databases and SQL (such as inserting or reading data from a table) is sufficient. Participants should also be familiar with git and have a GitHub account, as we would use GitHub Codespaces to enable easy set-up for Python and the database. However, they do not need any prior knowledge of SQLAlchemy or ORMs, since we will explain that first. For the last part of the tutorial, it would help if attendees have some familiarity with coroutines or asynchronous programming, but it is not required, since we will be explaining these fundamental concepts first. Participants will walk out of this tutorial having learned how to: - Use SQLAlchemy for database operations in Python, enhancing the readability and maintainability of the code - Build Python classes (ORMs) that represent the database tables - Experiment with different relationship-loading techniques to improve querying performance - Utilize SQLAlchemy’s asyncio extension to interact with databases asynchronously false https://pretalx.com/pyconde-pydata-2024/talk/EHJRVF/ https://pretalx.com/pyconde-pydata-2024/talk/EHJRVF/feedback/ A05-A06 Build an AI Document Inquiry Chat with Offline LLMs Tutorial 2024-04-22T13:45:00+02:00 13:45 01:30 As we descend from the peak of the hype cycle around Large Language Models (LLMs), chat-based document inquiry systems have emerged as a high-value practical use case. Retrieval-Augmented Generation (RAG) is a technique to share relevant context and external information (retrieved from vector storage) to LLMs, thus making them more powerful and accurate. In this hands-on tutorial, we’ll dive into RAG by creating a personal chat app that accurately answers questions about your selected documents. We’ll use a new [OSS project called Ragna](https://ragna.chat/en/latest/) that provides a friendly Python and REST API, designed for this particular case. We’ll test the effectiveness of different LLMs and vector databases, including an offline LLM (i.e., local LLM) running on GPUs on the cloud-machines provided to you. And, we’ll conclude by demonstrating how to quickly build personal or company-level chat-based document interrogation systems. pyconde-pydata-2024-41740-build-an-ai-document-inquiry-chat-with-offline-llms PyData: Natural Language Processing & Computer Vision Pavithra EswaramoorthyPhilip Meier en The ability to ask natural language questions and get relevant and accurate answers from a large corpus of documents can fundamentally transform organizations and make institutional knowledge accessible. Foundational LLM models like OpenAI’s GPT4 provide powerful capabilities, but using them directly to answer questions about a collection of documents presents accuracy-related limitations. Retrieval-augmented generation (RAG) is the leading approach to enhancing the capabilities and usability of Large Language Models. In this tutorial, we will learn to use RAG to build document-inquiry chat systems using different commercial and locally running LLMs. The topics we’ll cover include: * **Introduction to RAG**, how it works and interacts with LLMs, and Ragna - a framework for RAG orchestration * Creating a **basic chat function** that uses popular LLMs (like GPT) answers questions about your documents, using a Python API in Jupyter Notebooks * Optimizing the chat through **experiments with different LLMs**, vector databases, context windows, and more * Running a **local LLM on GPUs** on the provided platform, and comparing its performance to commercial LLMs * Walkthrough of the **REST API for building web-apps** and user interfaces and exploration of the built-in (Panel-based) web application By the end of this tutorial, you will have an understanding of the fundamental components that form a RAG model, and practical knowledge of open source tools that can help you or your organization explore and build on your own applications. This tutorial is designed to enable enthusiasts in our community to explore an interesting topic using some beginner-friendly Python libraries. false https://pretalx.com/pyconde-pydata-2024/talk/WPKRCT/ https://pretalx.com/pyconde-pydata-2024/talk/WPKRCT/feedback/ A05-A06 pytest tips and tricks for a better testsuite Tutorial 2024-04-22T15:35:00+02:00 15:35 01:30 pytest lets you write simple tests fast - but also scales to very complex scenarios: Beyond the basics of no-boilerplate test functions, this training will show various intermediate/advanced features, as well as gems and tricks. To attend this training, you should already be familiar with the pytest basics (e.g. writing test functions, parametrize, or what a fixture is) and want to learn how to take the next step to improve your test suites. If you're already familiar with things like fixture caching scopes, autouse, or using the built-in `tmp_path`/`monkeypatch`/... fixtures: There will probably be some slides about concepts you already know, but there are also various little hidden tricks and gems I'll be showing. pyconde-pydata-2024-41706-pytest-tips-and-tricks-for-a-better-testsuite PyCon: Testing Florian Bruhin en We'll cover things like: - Recommended pytest settings for more strictness - What's xfail and why is it useful? - How to mark an entire test file or single parameters - Ways to deal with parametrize IDs and syntax - Useful built-in pytest fixtures - Caching for fixtures - Using fixtures implicitly - Advanced fixture and parametrization topics - How to customize fixtures behavior based on markers or custom CLI arguments - Patching, mocking, and alternatives - Various useful plugins, and how to write your own - Short intro to property-based testing with Hypothesis false https://pretalx.com/pyconde-pydata-2024/talk/DSFWRC/ https://pretalx.com/pyconde-pydata-2024/talk/DSFWRC/feedback/ Kuppelsaal Keynote - Safe Space or Trap? Creating Software like DuckDB in Academic Institutions Keynote 2024-04-23T09:15:00+02:00 09:15 00:45 DuckDB is an in-process analytical data management system. DuckDB is free and open source and rather popular. It is one of the fastest growing data system to date, especially in the Python ecosystem. DuckDB was created at Centrum Wiskunde & Informatica (CWI) in Amsterdam, not entirely coincidentally the same place Python was created in. Later on, the we founded a commercial company, DuckDB Labs, which now drives development. In my talk, I will discuss DuckDB, its origins, and the unique benefits and challenges of maintaining popular software in an academic setting. pyconde-pydata-2024-44830-keynote-safe-space-or-trap-creating-software-like-duckdb-in-academic-institutions Plenary Hannes Mühleisen en DuckDB is an in-process analytical data management system. DuckDB is free and open source and rather popular. It is one of the fastest growing data system to date, especially in the Python ecosystem. DuckDB was created at Centrum Wiskunde & Informatica (CWI) in Amsterdam, not entirely coincidentally the same place Python was created in. Later on, the we founded a commercial company, DuckDB Labs, which now drives development. In my talk, I will discuss DuckDB, its origins, and the unique benefits and challenges of maintaining popular software in an academic setting. false https://pretalx.com/pyconde-pydata-2024/talk/HKFN8J/ https://pretalx.com/pyconde-pydata-2024/talk/HKFN8J/feedback/ Kuppelsaal 🌳 The taller the tree, the harder the fall. Determining tree height from space using Deep Learning and very high resolution satellite imagery 🛰️ Talk 2024-04-23T10:30:00+02:00 10:30 00:30 A case study of how we use Deep Learning based photogrammetry to calculate the height of trees from very high resolution satellite imagery. We show the substantial improvement achieved by switching from classical photogrammetric techniques to a deep learning based model (implemented in PyTorch), and the challenges we had to overcome to make this solution work. pyconde-pydata-2024-41750--the-taller-the-tree-the-harder-the-fall-determining-tree-height-from-space-using-deep-learning-and-very-high-resolution-satellite-imagery- PyData: Machine Learning & Deep Learning & Stats Ferdinand Schenck en The risk that a tree poses to line infrastructure (such as power lines) is determined by several factors, chief among them the height of the particular tree. The increasing availability of very high resolution satellite imagery makes it possible to use photogrammetric techniques to extract height information from a set of stereo satellite images. By using satellite imagery we can achieve a scale not possible by manual measurement. We found that classical techniques perform poorly on vegetation, and were handily outperformed by deep learning based techniques implemented in PyTorch. This improvement was not trivial to achieve however, as creating labelled data in sufficient quantity was quite challenging. By increasing the quality of our height predictions we were able to more accurately calculate risk for our customers. false https://pretalx.com/pyconde-pydata-2024/talk/ZFXZHG/ https://pretalx.com/pyconde-pydata-2024/talk/ZFXZHG/feedback/ Kuppelsaal Streamlining Python Development: A Practical Approach to CI/CD with GitHub Actions Talk 2024-04-23T11:05:00+02:00 11:05 00:30 Crafting code for minimal dependencies and maximum portability is an art. This talk focuses on how continuous integration and delivery ensure project resilience to Python updates and changes in the packaging ecosystem. Setting up automation around your project enhances peace of mind, improves code maintainability, and facilitates collaboration. pyconde-pydata-2024-42610-streamlining-python-development-a-practical-approach-to-ci-cd-with-github-actions PyCon: MLOps & DevOps Artem Kislovskiy en The worst thing I dislike when dealing with code is encountering an error message indicating that well-crafted code, written a while ago in a language other than Bash, fails to run on the new system, new laptop, or some other operating system. It's an art to write code with minimal dependencies and maximum portability. The complexity increases in larger projects. This is where Continuous Integration and Continuous Delivery (CI/CD) pipelines prove useful. CI/CD can help you keep the project alive even without you being around. Dependencies could be automatically updated, the code could be automatically tested, and delivered to the end-user, be it you or someone else. This talk is about "YAML programming", which will help you write better Python code. The goal of the talk is to equip you with a set of building blocks to construct a CI/CD pipeline with GitHub Actions for your projects. Automating tasks as much as possible is highly beneficial. We'll cover best practices and helpful tools for writing and debugging CI/CD pipelines. Writing YAMLs is time-consuming and error-prone; my goal is to help you spend less time on it and benefit faster from automation. false https://pretalx.com/pyconde-pydata-2024/talk/YHMUCL/ https://pretalx.com/pyconde-pydata-2024/talk/YHMUCL/feedback/ Kuppelsaal That’s it?! Dealing with unexpected data problems Talk 2024-04-23T11:40:00+02:00 11:40 00:30 Drawing on experience with multiple consulting projects, this talk shares experiences on how to deal with unexpected data problems. We are discussing how fare purely technical solutions as well as domain knowledge can be deployed to compensate for lacking data quality or quantity and when it might be better to scale down the original project scope. pyconde-pydata-2024-41468-that-s-it-dealing-with-unexpected-data-problems PyData: Machine Learning & Deep Learning & Stats Simon Pressler en And it was such a nice idea! Nearly everybody working with data has felt this sentiment at least once in their career. The promising idea for a cool new data tool meets the reality of lacking data quality or quantity. This talk wants to provide you with some options on what else you can do in this kind of situations instead of giving up and filing the project away for the non-foreseeable future. Drawing on experience from multiple consulting projects we are discussing what is realistically possible and how to make the most out of the limited data you might find yourself confronted with. The talk covers a brief recap of the limitations arising from unexpectedly little and/or unclean data, before moving on to share lessons learned. We are going to discuss how fare purely technical solutions might be able to provide fixes to some of the issues, before moving on to consider how domain knowledge can be deployed to compensate for lacking data quality or quantity. Next, this talk addresses under which circumstances it makes sense to keep pursuing your original goal and when it might be better to down-size expectations. The talk concludes, by arguing that despite all the problems arising from unexpected data scarcity, potential answers to important business problems can be found in small data settings if the right questions are asked. false https://pretalx.com/pyconde-pydata-2024/talk/JKWBBR/ https://pretalx.com/pyconde-pydata-2024/talk/JKWBBR/feedback/ Kuppelsaal Keynote - The art and science of tending open source orchards Keynote 2024-04-23T13:15:00+02:00 13:15 00:45 Over the history of free and open source software, we have gone through quite a few metaphors for open source projects: from homesteads in noosphere to puppies, roads & bridges, gardens, forests, and orchards. Regardless of the preferred comparison, we all can agree that behind every large open source project is a resilient contributor community. Is there a blueprint for it? How about a script for scaling a contributor community or a formula for contributor retention? In this talk, I will examine all these questions and share my insight on the art and science of fostering resilient open source communities. pyconde-pydata-2024-44832-keynote-the-art-and-science-of-tending-open-source-orchards Plenary Inessa Pawson en Inessa is building bridges between people, open science, and open source software, advocating for diversification of contribution pathways to open source and supporting its human infrastructure. She is an active contributor to the Python ecosystem (NumPy, Scientific Python, PyOpenSci, SciPy conference, PyCon US Maintainers Summit, PySWFL, PyLadies SoFlo) and broader open source (Contributor Experience Project, CHAOSS). In her role as Open Source Program Manager at OpenTeams, she leads initiatives focused on widening the contributor pipeline and bringing funding to more open source projects. Inessa is perpetually fascinated by incentive design, collaborative intelligence, and jazz. false https://pretalx.com/pyconde-pydata-2024/talk/7TEYDQ/ https://pretalx.com/pyconde-pydata-2024/talk/7TEYDQ/feedback/ Kuppelsaal Robust Configuration Management with Pydantic's Data Validation Talk 2024-04-23T14:10:00+02:00 14:10 00:30 As applications grow, so do the amount of configurable features. Managing consistent defaults, maintaining user and developer documentation, and ensuring uniform parsing among a growing number of client applications can become a challenge. Adding constraints like complex fallback hierarchies and backwards compatibility, increases the probability of runtime errors. We show how [`Pydantic`'s](https://pydantic.dev/) strong data validation and integration into Python's type annotations can help building a strict specification for your configuration format, catch misconfiguration early, and mitigate the aforementioned problems with a non-formalized configuration management system. pyconde-pydata-2024-41743-robust-configuration-management-with-pydantic-s-data-validation PyCon: Programming & Software Engineering Philipp Stephan en We describe how we moved our configuration management system from a simple unstructured YAML format loaded into dictionaries into a fully formalized, typed, class-based system using [`Pydantic`'s][pydantic] data validation. While simple enough to begin with, we discuss the problems that emerged from the lack of tight specification of our early configuration system: Missing ahead-of-time validation and resulting runtime errors; out-of-sync code and browsable user documentation; incompatible defaults and subtle differences in various separate parsers scattered throughout many microservices; duplicated and brittle fallback logic. Using a strict specification can mitigate these issues by enabling static validation of configuration files, automatic documentation generation, centralized defaults, and flexible data transformation. After discussing various available configuration management systems, we explain the motivation to hand-roll a simple system based on the data validation library [`Pydantic`][pydantic]. Popularized by it's usage in [`FastAPI`][fastapi] has become the de-facto standard for data validation in Python. It's deep integration into Python's type annotation system makes it a powerful tool for configuration management. After an introduction into [`Pydantic`][pydantic] capabilities and usage, specifically it's features tailored to configuration management ([`pydantic.BaseSettings`][basesettings]), we share some tips-and-tricks encountered while speccing out our configuration file format. Additionally, we share some inspiration on our internal tooling to load and validate configuration, render up-to-date browsable user documentation, integration with CI systems, and lessons learned for a incremental transition from the lose `dict`-based system to the strictly typed class-based type strict system powerd by [`Pydantic`][pydantic]. [pydantic]: https://pydantic.dev/ [fastapi]: https://fastapi.tiangolo.com/ [basesettings]: https://docs.pydantic.dev/latest/api/pydantic_settings/ false https://pretalx.com/pyconde-pydata-2024/talk/RGWDCN/ https://pretalx.com/pyconde-pydata-2024/talk/RGWDCN/feedback/ Kuppelsaal Unlock the Power of Dev Containers: Build a Consistent Python Development Environment in Seconds! Talk (long) 2024-04-23T14:45:00+02:00 14:45 00:45 In this talk, we will explore the basic concepts of Dev Containers and demonstrate how they can support your everyday development as a Python programmer, data scientist, or machine learning engineer. With Dev Containers, you can build a consistent development environment in seconds, no matter where you are or what tools you use. And you know what? The Development Container Specification is even open source. Say goodbye to the hassle of setting up your development environment from scratch every time you start a new project! We will start with a basic example and discuss how to set up a consistent Python development environment, including best practices for package management and GPU support. After this talk, you will be able to leverage the advantages of Dev Containers, allowing you to work from anywhere and be ready in seconds. If you're tired of wasting time setting up your development environment and want to unlock the power of Dev Containers, then this talk is a must-attend for you! pyconde-pydata-2024-42830-unlock-the-power-of-dev-containers-build-a-consistent-python-development-environment-in-seconds- PyCon: Programming & Software Engineering Thomas Fraunholz en In this talk, we will explore the basic concepts of Dev Containers and demonstrate how they can support your everyday development as a Python programmer, data scientist, or machine learning engineer. With Dev Containers, you can build a consistent development environment in seconds, no matter where you are or what tools you use. And you know what? The Development Container Specification is even open source. Say goodbye to the hassle of setting up your development environment from scratch every time you start a new project! We will start with a basic example and discuss how to set up a consistent Python development environment, including best practices for package management and GPU support. After this talk, you will be able to leverage the advantages of Dev Containers, allowing you to work from anywhere and be ready in seconds. If you're tired of wasting time setting up your development environment and want to unlock the power of Dev Containers, then this talk is a must-attend for you! false https://pretalx.com/pyconde-pydata-2024/talk/UG8THG/ https://pretalx.com/pyconde-pydata-2024/talk/UG8THG/feedback/ Kuppelsaal Community Conferences under the Hood. Perspectives and Best Practices in Volunteer Organization Panel 2024-04-23T16:00:00+02:00 16:00 01:00 PyCon DE & PyData Berlin is volunteer run. This session aims to underscore the significant role that volunteer organization plays in cultivating environments of authenticity, inclusion, and diversity within tech communities. pyconde-pydata-2024-47678-community-conferences-under-the-hood-perspectives-and-best-practices-in-volunteer-organization General: Community, Diversity, Career, Life and everything else Alexander CS HendorfLais CarvalhoValentina ScipioneFlorian Wilhelm en Through a combination of individual presentations and interactive discussions, the panel will explore the challenges and triumphs of community organization. This session is designed not just for current and aspiring community leaders but for anyone passionate about fostering an inclusive, collaborative tech ecosystem. This panel brings together seasoned community organizers from diverse backgrounds to share their insights, experiences, and best practices in building and nurturing inclusive communities. Join us in this empowering session to discover how you can contribute to a more inclusive, diverse, and vibrant Python community through effective volunteer organization. Together, we can drive positive change and ensure that our communities remain strong, supportive, and forward-moving. false https://pretalx.com/pyconde-pydata-2024/talk/PLJKUH/ https://pretalx.com/pyconde-pydata-2024/talk/PLJKUH/feedback/ B09 Build a personalized Bitcoin (BTC) virtual assistant in Python with Hopsworks and LLM function calling Sponsored Talk 2024-04-23T10:30:00+02:00 10:30 00:30 The human ambitious desire to get rich without effort has been a major driving force behind the popularity of cryptocurrencies like Bitcoin and Ethereum. However, their high volatility makes them too unpredictable, and keeping track of our investment gains and losses over time can be tedious, if not boring. In this talk, we will define the different components necessary to build a personalized Bitcoin (BTC) virtual assistant in Python. The assistant will help you analyze your transaction history, estimate future BTC prices, and calculate the future value of your holdings based on these predictions. It will be powered by LLMs and will make use of a recent technique called Function Calling to recognize the user intent from the conversation history. pyconde-pydata-2024-44946-build-a-personalized-bitcoin-btc-virtual-assistant-in-python-with-hopsworks-and-llm-function-calling Sponsor Javier de la Rúa Martínez en The human ambitious desire to get rich without effort has been a major driving force behind the popularity of cryptocurrencies like Bitcoin and Ethereum. However, their high volatility makes them too unpredictable, and keeping track of our investment gains and losses over time can be tedious, if not boring. In this talk, we will define the different components necessary to build a personalized Bitcoin (BTC) virtual assistant in Python. The assistant will help you analyze your transaction history, estimate future BTC prices, and calculate the future value of your holdings based on these predictions. It will be powered by LLMs and will make use of a recent technique called Function Calling to recognize the user intent from the conversation history. The ML system will be built in Python, following the best practices of the FTI (feature/training/inference) pipeline architecture, on top of the open-source Hopsworks platform which will provide the necessary ML infrastructure such as a feature store, model serving, and a model registry. false https://pretalx.com/pyconde-pydata-2024/talk/JRRET3/ https://pretalx.com/pyconde-pydata-2024/talk/JRRET3/feedback/ B09 Missing Data, Bayesian Imputation and People Analytics with PyMC Talk 2024-04-23T11:05:00+02:00 11:05 00:30 We demonstrate a range of different approaches to missing data imputation in employee engagement survey data. Contrasting frequentist style full-information maximum likelihood approaches with more direct Bayesian imputation and chained equation methods, we highlight how the different assumptions regarding the missing-data license different inferences about the imputed values and ultimately the plausible causal narratives which can be expressed in PyMC. In particular we avail of the hierarchical nature of employee engagement data to justify a hierarchical approach to justifying the (MAR) missing-at-random assumption for imputation schemes in People Analytics. pyconde-pydata-2024-41053-missing-data-bayesian-imputation-and-people-analytics-with-pymc PyData: Machine Learning & Deep Learning & Stats Nathaniel Forde en There is no "agnostic statistics" when approaching the question of missing data. Theory quickly breaks against reality in the context people-analytics. All imputation schemes need to justify their assumptions of "strong-ignorability" or "missing-at-random" reasons for missing data. This is easier and cleaner in a Bayesian setting than in frequentist alternatives. This transparency is important when dealing with HR data. We will demonstrate both full information maximum likelihood (FIML) and Bayesian imputation by chained equation approaches to the imputation of missing data in the context of employee engagement survey data. We will use the probabilistic programming language PyMC to articulate the structures and conditional probabilities around missing data in hierarchical organisations. Non-response bias in engagement survey data often corrupts the overall picture of organisational health and modelling of the non-response bias helps uncover patterns or trends in the patterns of missing-ness. These insights can be used diagnostically to locate the source of problems within the organisation, but we need to be willing to commit to the assumptions that license genuine causal inference. In this way we present the problem of missing-data as a gate-way to an organisational focus on causal inference problems. Somewhat ironically, the lack of data can actually makes the problems of causal inference more concrete for business stakeholders. false https://pretalx.com/pyconde-pydata-2024/talk/KXU7Q8/ https://pretalx.com/pyconde-pydata-2024/talk/KXU7Q8/feedback/ B09 Tackling the Cold Start Challenge in Demand Forecasting Talk 2024-04-23T11:40:00+02:00 11:40 00:30 In this talk, we address the Cold Start problem in Demand Forecasting, focusing on scenarios where historical data is scarce or nonexistent. This constitutes a common situation in practice, such as with the launch of new products in Retail. However, many Time Series and Machine Learning models encounter difficulties in handling this challenge, primarily due to their dependence on a substantial amount of historical data for effective training and prediction. We begin by providing an overview of established techniques used to address the Cold Start problem, including methods like padding, feature engineering, and leveraging item similarities. Additionally, we explore more recent advancements and emerging research, such as Transfer Learning for Time Series. While each technique presents its unique set of trade-offs, the challenge lies in determining the most suitable approach for a given dataset or use case. This aspect is often not widely understood, and our goal is to unravel this complexity by offering practical insights. Furthermore, we introduce a practical framework for systematically evaluating different forecasting strategies within the Cold Start setting, guiding you in selecting the most suitable approach for your datasets and use cases. pyconde-pydata-2024-42907-tackling-the-cold-start-challenge-in-demand-forecasting PyData: Machine Learning & Deep Learning & Stats Alexander MeierDaria Mokrytska en In this talk, we address the Cold Start problem in Demand Forecasting, focusing on scenarios where historical data is scarce or nonexistent. This constitutes a common situation in practice, such as with the launch of new products in Retail. However, many Time Series and Machine Learning models encounter difficulties in handling this challenge, primarily due to their dependence on a substantial amount of historical data for effective training and prediction. We begin by providing an overview of established techniques used to address the Cold Start problem, including methods like padding, feature engineering, and leveraging item similarities. Additionally, we explore more recent advancements and emerging research, such as Transfer Learning for Time Series. While each technique presents its unique set of trade-offs, the challenge lies in determining the most suitable approach for a given dataset or use case. This aspect is often not widely understood, and our goal is to unravel this complexity by offering practical insights. Furthermore, we introduce a practical framework for systematically evaluating different forecasting strategies within the Cold Start setting, guiding you in selecting the most suitable approach for your datasets and use cases. false https://pretalx.com/pyconde-pydata-2024/talk/H3X3AX/ https://pretalx.com/pyconde-pydata-2024/talk/H3X3AX/feedback/ B09 Content Recommendation with Graphs: From Basic Walks to Neural Networks Talk 2024-04-23T14:10:00+02:00 14:10 00:30 Discover how graph algorithms are transforming content recommendation in this insightful talk. We'll journey from the basics of graph-based models, exploring simple graph walks, to the cutting-edge realm of Graph Neural Networks. Uncover the power of graph embeddings and learn when graph-based approaches excel in recommender systems. pyconde-pydata-2024-39508-content-recommendation-with-graphs-from-basic-walks-to-neural-networks PyData: Machine Learning & Deep Learning & Stats Dr. Mirza Klimenta en In this talk, we'll explore how the complex problem of content recommendation transforms when viewed through the innovative lens of graph algorithms. Imagine a world where content and users form a bi-partite graph, and the key to unlocking personalized recommendations lies in predicting links and weights within this graph. We'll embark on a journey starting from the foundational graph-based recommender models, where simple graph walks lay the groundwork. As we delve deeper, we'll uncover the potent capabilities of graph embeddings and the transformative impact of Graph Neural Networks. Finally, we'll wrap up with valuable insights on the scenarios where graph-based approaches shine the brightest in solving recommender problems. Whether you're a seasoned data scientist or new to the field of machine learning, this talk will equip you with a fresh perspective on leveraging graphs for sophisticated and effective content recommendation strategies. false https://pretalx.com/pyconde-pydata-2024/talk/RD9SU8/ https://pretalx.com/pyconde-pydata-2024/talk/RD9SU8/feedback/ B09 Personalizing Carousel Ranking on Wolt's Discovery Page: A Hierarchical Multi-Armed Bandit Approach Talk (long) 2024-04-23T14:45:00+02:00 14:45 00:45 Wolt's Discovery page serves as the primary gateway for millions of weekly users exploring diverse cuisines and products. With over 130,000 merchants in 25 countries, presenting relevant content poses a unique challenge. In this presentation, we address the complexities of personalizing the Discovery page using a hierarchical multi-armed bandit (MAB) approach built on the Python ecosystem. We outline the challenges specific to an expansive online delivery platform, introducing our MAB solution that incorporates hierarchical parameters at user, segment, city, and country levels. Leveraging Thompson Sampling for exploration and exploitation, our approach accommodates data sparsity challenges. Evaluation results, both offline and online, showcase the effectiveness of our solution. The talk concludes with insights into the resilient, scalable, and adaptive architecture underpinning our approach, featuring open-source libraries such as mlflow, Flyte, and Seldon Core. Our learnings and future steps toward a personalized, context-aware Discovery page cap off the presentation. Join us as we navigate the intricacies of recommendation challenges in the dynamic world of quick commerce. pyconde-pydata-2024-41608-personalizing-carousel-ranking-on-wolt-s-discovery-page-a-hierarchical-multi-armed-bandit-approach PyData: Machine Learning & Deep Learning & Stats Marcel KurovskiSteffen Klempau en Wolt's Discovery page is the main entrance point for millions of weekly users seeking to explore new cuisines, order their favorite dish, or replenish their fridge's stock. The Discovery page is a vertical collection of multiple modules (carousels) which can stem from automatic and curated mechanisms. It features restaurants, retail venues, individual items and dishes along with a broad set of banners. Wolt consumers have distinct tastes and preferences - all of which can change over time and vary with context. However, they expect Wolt to show what's relevant to them and to be able to discover - coupled with a frictionless experience. We want to satisfy our users, keep them engaged and grow our customer base around the world. Wolt delivery covers over 130.000 merchants in more than 500 cities across 25 countries, which results in a substantial variety and size of content Wolt has to offer its customers. Ranking the most relevant carousels at the top is a key challenge to solve so that our users find what they want fast. This renders personalizing the Discovery page as a key lever. Personalized carousel ranking presents a major recommendation challenge across many different domains like content streaming, ecommerce or quick commerce. In our talk, we present a hierarchical multi-armed bandit (MAB) solution for personalizing the ranking of carousels on Wolt’s Discovery page which is built on top of the Python ecosystem. Therefore, we first illustrate the specific challenges of an (almost) everything online delivery platform and our goals for Wolt's Discovery page. Second, we present our MAB-approach which combines a novel hierarchical parameterization of bandits on user-, segment-, city- and country-level with classical Thompson Sampling for exploration and exploitation. This approach caters well to the challenge of data sparsity. We also share the offline and online evaluation results of our approach. Lastly, we illustrate the architecture to make this solution resilient, scalable and adaptive. Our architecture is built on top of well-known open source libraries. We’re leveraging mlflow for tracking and lineage, Flyte for ML workflows, Redis for serving features, and Seldon Core for serving user requests online fast and reliably. We will wrap up our talk with our learnings and an outlook for the next steps in our journey towards a personalized, context-aware, and controllable Discovery page. false https://pretalx.com/pyconde-pydata-2024/talk/7J7LEB/ https://pretalx.com/pyconde-pydata-2024/talk/7J7LEB/feedback/ B09 Time series anomaly detection with a human-in-the-loop Sponsored Talk 2024-04-23T16:00:00+02:00 16:00 00:30 In the cross-industry wide trend towards industry 4.0 solutions, the amount of gathered sensor data is ever growing. Through the sheer amount of data, manual or human-based monitoring of the collected time series data becomes cumbersome if not even impossible. Yet, careful inspection of the time series data and identification of possible anomalies therein is crucial to detect problems in the underlying processes. To resolve this demand, ZEISS is developing a fully automated time series processing tool that performs ML based time series anomaly detection with a human-in-the-loop. pyconde-pydata-2024-44670-time-series-anomaly-detection-with-a-human-in-the-loop General: Industry & Academia Use-Cases Philipp Millet en Starting from a completely unlabelled dataset, unsupervised anomaly detection is performed. Identified anomaly candidates are presented via a web app to domain experts, who can judge whether the identified time series segments are indeed abnormal or are expected behaviour, i.e., false positives generated by the anomaly detection. The domain-expert’s feedback is stored to create a partially labelled dataset. The intended benefits from storing the collected labels are: 1) Metrics can be generated that allow to evaluate the performance of the initially unsupervised anomaly detection run. 2) The number of false positives generated by the algorithm, i.e., time series segments that were incorrectly flagged as anomaly, can be reduced via pattern matching. 3) Based on a partially labelled dataset more domain problem specific methods might be applied such as semi-supervised anomaly detection or time series classification. The framework uses open source tools and all its components, i.e., data pipelines, anomaly detection, web app, are deployed to the cloud. false https://pretalx.com/pyconde-pydata-2024/talk/CMMJPN/ https://pretalx.com/pyconde-pydata-2024/talk/CMMJPN/feedback/ B09 Cloud? No Thanks! I’m Gonna Run GenAI on My AI PC Sponsored Talk 2024-04-23T16:35:00+02:00 16:35 00:30 In this speech, we want to introduce an AI PC, a single machine that consists of a CPU, GPU, and NPU (Neural Processing Unit) and can run GenAI in seconds, not hours. Besides the hardware, we will also show the OpenVINO Toolkit, a software solution that helps squeeze as much as possible out of that PC. Join our talk and see for yourself the AI PC is good for both generative and conventional AI models. All presented demos are open source and available on our GitHub. pyconde-pydata-2024-47974-cloud-no-thanks-i-m-gonna-run-genai-on-my-ai-pc PyData: Generative AI Adrian BoguszewskiDmitriy Pastushenkov en In a world dominated by cloud computing, there's a growing demand for harnessing the power of PCs and edge devices for AI needs. After all, all computers connected have more power than any cloud. Hence, in this speech, we want to introduce an AI PC, a single machine that consists of a CPU, GPU, and NPU (Neural Processing Unit) and can run GenAI in seconds, not hours. Besides the hardware, we will also show the OpenVINO Toolkit, a software solution that helps squeeze as much as possible out of that PC. Join our talk and see for yourself the AI PC is good for both generative and conventional AI models. The demos we will present are open source, so feel free to try them at home. Let's paint your dreams together! false https://pretalx.com/pyconde-pydata-2024/talk/BNFLZB/ https://pretalx.com/pyconde-pydata-2024/talk/BNFLZB/feedback/ B07-B08 Unleashing Confidence in SQL Development through Unit Testing Talk 2024-04-23T10:30:00+02:00 10:30 00:30 As the landscape of data-driven applications expands, the need for robust SQL development practices becomes increasingly critical. This conference talk addresses the challenges faced by data teams in maintaining and evolving complex SQL models for their Data Warehouses, and shows how unit testing can play a vital role in ensuring data quality. We will delve into the significance of SQL unit testing, highlighting its ability to quickly validate modeling logic and making sure that modifications do not break existing behavior. With the ease of mind of an automatically verified SQL logic, changes to existing data models can be shipped with confidence, ultimately contributing to faster deployment cycles. Get detailed insights on the structure and functionality of Lotum’s SQL unit testing framework, built in Python using pytest and tailored for BigQuery. With Lotum processing millions of events from mobile games every day, explore how this robust framework allows for efficient testing, ensuring the accuracy of the SQL logic. Learn how test cases with small sets of static mock data can be defined effortlessly so that they help pinpoint potential code errors easily. pyconde-pydata-2024-40110-unleashing-confidence-in-sql-development-through-unit-testing PyCon: Programming & Software Engineering Tobias Lampert en The conventional approach to data model development frequently involves a repetitive cycle: crafting a query, executing it, examining a portion of the result, and iterating through the process with each subsequent query modification. This method becomes particularly challenging when dealing with the evolution of mature, extensively-used data models, where multiple developers collaborate without sufficient testing. In such scenarios, the iterative nature of this process poses significant risks, potentially leading to overlooked errors and compromised data quality. The talk showcases the tangible benefits of having a well-designed unit testing framework, providing ease of mind to developers working collaboratively on the same model, and enabling the early detection of hard-to-spot errors before deployment. During the development of new data models and during the integration of new data sources, the absence of large amounts of production data makes verification of the model outputs difficult - clearly defined tests for scenarios not yet observed in production play a crucial role in overcoming this hurdle. SQL unit testing becomes especially relevant when refactoring existing data models and can be very helpful to ensure the logic is unchanged, even for edge cases. I outline the requirements for an effective SQL unit testing framework, emphasizing the use of the database or query engine to verify SQL statement correctness without persisting any data in the database. The presented framework supports the definition of atomic test cases, where each test case consists of minimal input datasets and expected output datasets and it is verified if the output of the query when run on the defined inputs matches the expected output. The practical implementation of a SQL unit testing framework will be shared in detail, by giving insights into Lotum’s pytest-based SQL unit testing framework and demonstrating how a test case for a SQL statement with mock data can be built effortlessly with minimal code redundancy. Internal workings of the framework will be explained, including the mechanics to define and run a unit test: By injecting mock data into an existing SQL statement, replacing references to production tables by the injected mock data, and executing the resulting fully-static statements in the query engine, the framework evaluates the transformed data against expected outputs. This way, the correctness of the query can be verified on a case-by-case basis without manually modifying the query code itself. Attendees will leave the session with a deep understanding of the importance of SQL unit testing, equipped with insights into building an effective framework, defining test cases, and ensuring data model robustness. The talk provides a roadmap for data teams to embrace a test-driven development approach, enhancing code quality, and fostering a culture of confident SQL development. false https://pretalx.com/pyconde-pydata-2024/talk/EMZ7L7/ https://pretalx.com/pyconde-pydata-2024/talk/EMZ7L7/feedback/ B07-B08 Green Software Engineering Talk 2024-04-23T11:05:00+02:00 11:05 00:30 Did this question ever cross your mind that how green software engineering can help in environment sustainability? My talk will answer this exact question.  My passion for nature and love for technology pushed me into this topic. The way global warming is affecting us is one of the biggest concern of so many people around the world. The focus is to educate people about how they can play their role in protecting the environment by just using their laptop or computers in the right possible way. One of the biggest questions is to deal with the gas emissions and control it but how software engineering can help in all of this? The complete cycle of the Software Engineering should be designed and implemented in such a way that it incorporates environment sustainability without affecting the economic benefits. It is a win win situation. We need more environment sustainable mobile and web applications. pyconde-pydata-2024-42936-green-software-engineering PyCon: Programming & Software Engineering Farah en The rapid growth of digital economy, production of software products demands a more sustainable way to deal with global warming issues. All of the tech industry is contributing to the growth of carbon footprints and we need to handle it efficiently. I will focus on the life cycle of Software Engineering and also explain how they can incorporate green software engineering into practice, from requirement engineering to the end product in the whole cycle. Further digging deeper into the following topics: • Green Requirement Engineering • Green Architecture and Design • Green Coding • Optimization of Infrastructure • Green Usage of software products The development of software products should be in such a way that it decrease carbon, increase efficiency and lower carbon intensity. The choice of coding language should be based upon time, complexity and resource usage so we can incorporate green coding. Participate in electronic recycling programs and shift your previous infrastructure to the services such as cloud to decrease resources usage. When it comes to green usage of the software products then never leave your laptops and systems on sleep as it also increase the carbon footprints. In the end of the talk people will be able to practice some green computing concepts in their everyday life. false https://pretalx.com/pyconde-pydata-2024/talk/Z3FALV/ https://pretalx.com/pyconde-pydata-2024/talk/Z3FALV/feedback/ B07-B08 Building Professional Voice AI with Vocode Talk 2024-04-23T11:40:00+02:00 11:40 00:30 Dive into the world of AI voice agents with Vocode, the leading framework for creating interactive, voice-based AI assistants. In this talk, we'll explore how Vocode integrates speech-to-text, response generation, and speech synthesis APIs to create agents that not only speak but also understand and adapt to the nuances of human conversation. We'll discuss the challenges of teaching these agents the etiquette of real conversations, such as knowing when to pause, not interrupt, and conclude interactions. Plus, we'll showcase Vocode's LLM function-calling feature through a practical example: real-time appointment booking. Join us to uncover the secrets behind building AI voice agents that are as engaging and efficient as they are innovative. pyconde-pydata-2024-42852-building-professional-voice-ai-with-vocode PyData: Natural Language Processing & Computer Vision Lev Konstantinovskiy en The AI open-source package Vocode (https://github.com/vocodedev/vocode-python) has emerged as a leader in creating AI voice agents since May 2023. These are the interactive voices on the other end of the phone, ready to assist with various tasks. My journey with Vocode began in August while developing a commercial platform that allows for no-code creation of voice agents utilizing Vocode's capabilities. This presentation delves into the intricacies of Vocode. It's not just about voice; it's about crafting an experience. The framework seamlessly integrates external APIs for speech-to-text conversion, Large Language Model (LLM) response generation, and speech synthesis. But the real challenge lies in the nuances of human conversation: teaching the bot to pause when interrupted, not to speak over others, and to recognize the natural end of a conversation. These subtleties are what make interactions with Vocode feel remarkably human. A significant part of this talk will focus on the LLM function-calling feature of Vocode, particularly in real-time tasks like booking appointments. Imagine a scenario where you're speaking to 'Jane', a virtual plumber, to schedule a visit. The interaction feels real, with the bot understanding and responding to changes in appointment preferences, such as switching from a suggested time of "tomorrow at 9 am" to a more suitable slot "next month". This talk aims to share insights and practical knowledge about building and refining AI voice agents, making them more than just voices on a call but rather engaging, interactive entities capable of performing complex tasks with ease and human-like finesse. false https://pretalx.com/pyconde-pydata-2024/talk/8W7RPP/ https://pretalx.com/pyconde-pydata-2024/talk/8W7RPP/feedback/ B07-B08 How to Do Monolingual, Multilingual, and Cross-lingual Text Classification in April, 2024 Talk 2024-04-23T14:10:00+02:00 14:10 00:30 In 2023, the field of NLP was again flurried -- the appearing of powerful closed- and opens-source LLMs opened new possibility for texts processing. However, many questions about these models usability for typical NLP tasks are still open. One of them is quite simple -- if we want a classification model for some task, can we rely on LLMs or is it still better to fine-tune an own model? It might be easier to obtain some classifier for English, but what if my target language is not so resource-rich? In this presentation, the main "recipes" how to obtain the best text classifier depending on the language and data availability will be described. pyconde-pydata-2024-41472-how-to-do-monolingual-multilingual-and-cross-lingual-text-classification-in-april-2024 PyData: Natural Language Processing & Computer Vision Daryna Dementieva en We will provide the answer to the three main questions: 1. If I want a text classifier for English texts, what is better -- to fine-tune the model or to prompt LLM? Which model is to fine-tune though? 2. If my data is not in English, i.e. not resource rich language, what should I do? Can I utilize LLMs? Or I need to somehow get the data? Or I can transfer somehow knowledge from existing English data? 3. If I want a multilingual model for several languages, again, what is the choice -- LLMs or own model? Which model then? The findings and comparisons will be illustrated on three tasks -- toxic speech, formal speech, and fluent speech detection -- for two languages -- English (as resource-rich language) and Ukrainian (as low resource language in terms of different data availability). We will provide tests of closed- and open-source models together with fine-tuned opensources models like BERT, RoBERTa. false https://pretalx.com/pyconde-pydata-2024/talk/RLCLBB/ https://pretalx.com/pyconde-pydata-2024/talk/RLCLBB/feedback/ B07-B08 Leveraging the Art of Parallel Unit Testing in Django Talk (long) 2024-04-23T14:45:00+02:00 14:45 00:45 Unit testing is a fundamental practice in software development, ensuring the reliability and maintainability of code. However, in the context of monolith repositories, executing unit tests efficiently becomes a formidable challenge. This conference aims to explore the intricacies of unit testing in Django within monolithic codebases and shed light on how major institutions address and overcome these challenges through the implementation of parallel testing strategies. pyconde-pydata-2024-40391-leveraging-the-art-of-parallel-unit-testing-in-django PyCon: Testing Syed Ansab Waqar GillaniAzan Bin Zahid en Key Points to Address: - Understanding Monolith Challenges: - - Identification of challenges and bottlenecks in traditional unit testing approaches within Django monoliths. - - Analysis of the impact on development velocity and code quality. - Introduction to Parallel Testing: - - Explanation of parallel testing concepts and its application to Django unit testing. - - Benefits of parallelization in terms of speed, efficiency, and resource utilization. - Parallel Testing Tools and Techniques: - - Overview of tools and techniques available for parallelizing unit tests in Django. - - Practical insights into configuring and optimizing test suites for parallel execution. - Real-world Experiences from Major Institutions: - - Case studies from leading institutions sharing their challenges with unit testing in Django monoliths. - - Lessons learned and best practices in implementing parallel testing strategies. - Implementation Guidelines for Django Projects: - - Guidance on implementing parallel unit testing in Django projects, including code examples and configurations. - - Tips for integrating parallel testing seamlessly into existing development workflows. Expected Outcomes: - Insight into challenges specific to Django unit testing within monolithic repositories. - Understanding the principles and benefits of parallel testing. - Practical knowledge of tools and techniques for parallelizing Django unit tests. - Real-world experiences and best practices shared by major institutions. - Actionable guidelines for implementing parallel unit testing in Django projects. Target Audience: This talk is tailored for Django developers, software engineers, and testing professionals seeking to optimize their unit testing practices, especially within the context of monolithic repositories. Conclusion: Join me in this 45-minute session as we navigate through the challenges of unit testing in Django monoliths and explore the art of parallelization. By the end, you'll be equipped with the knowledge and tools to transform your Django unit testing workflows, leveraging the lessons learned from major institutions in the industry. false https://pretalx.com/pyconde-pydata-2024/talk/ZKDEPW/ https://pretalx.com/pyconde-pydata-2024/talk/ZKDEPW/feedback/ B07-B08 Analyzing COVID-19 Protest Movements: A Multidimensional Approach Using Geo-Social Media Data Talk 2024-04-23T16:00:00+02:00 16:00 00:30 The COVID-19 pandemic and associated policy measures lead to world-wide protest movements that were singled out by the spread of misinformation and conspiracy theories, predominantly on social media platforms. Publicly available social media data therefore is a powerful proxy for studying these protest movements. The data, consisting of user locations, follower relationships, and content information, allows to understand the geographical centers of activity, network structure, and key themes of conspiracy movements. This talk will present a multi-dimensional network analysis for the Austrian COVID-10 protest movement using Python libraries like geopandas, networkx and gensim. In particular, it will demonstrate how to identify geo-spatial hot spots using spatial statistics, densely connected clusters within the network by employing community detection techniques, as well as dominating content themes through topic modeling approaches. The presentation highlights how data-driven analysis enables further understanding of movements that may pose threats to democracy, alongside the importance of publicly available social media data for addressing societal challenges. pyconde-pydata-2024-41829-analyzing-covid-19-protest-movements-a-multidimensional-approach-using-geo-social-media-data General: Ethics & Privacy Nefta Kanilmaz en The talk will walk through the steps undertaken in the analysis of a protest network using Twitter data. It will explain the methods used, present the results as well as code and libraries used following (roughly) this outline: 1. Motivation: What was special about the COVID-19 protest movement and why a multi-dimensional view is crucial for understanding. 2. The Data: The retrieved information using Twitter's API and the necessary pre-processing steps. 3. Spatial Analysis: The statistical means to understand the movement's spatial manifestation, including explanation of used methods, presentation of results. 4. Network Analysis: Mere social network analysis is not enough for understanding protest movements. Including the spatial information allows to draw deeper insights by geo-spatially mapping network communities and centralities. 5. Semantic Analysis: Understanding the dominating themes in the protest network with semantic analysis: generating the document embeddings, clustering topics and dealing with a large dataset of tweets. 6. Conclusion: Importance of multi-dimensional analysis and the availability of social media data for studying societally important phenomena. Python libraries that were used (among others): geopandas, networkx. berttopic, lda and friends. false https://pretalx.com/pyconde-pydata-2024/talk/CY97LS/ https://pretalx.com/pyconde-pydata-2024/talk/CY97LS/feedback/ B07-B08 Would you rely on ChatGPT to dial 911? A talk on balancing determinism and probabilism in production machine learning systems Talk 2024-04-23T16:35:00+02:00 16:35 00:30 In the last year there hasn’t been a day that passed without us hearing about a new generative AI innovation that will enhance some aspect of our lives. On a number of tasks large probabilistic systems are now outperforming humans, or at least they do so “on average”. “On average” means most of the time, but in many real life scenarios “average” performance is not enough: we need correctness ALL of the time, for example when you ask the system to dial 911. In this talk we will explore the synergy between deterministic and probabilistic models to enhance the robustness and controllability of machine learning systems. Tailored for ML engineers, data scientists, and researchers, the presentation delves into the necessity of using both deterministic algorithms and probabilistic model types across various ML systems, from straightforward classification to advanced Generative AI models. You will learn about the unique advantages each paradigm offers and gain insights into how to most effectively combine them for optimal performance in real-world applications. I will walk you through my past and current experiences in working with simple and complex NLP models, and show you what kind of pitfalls, shortcuts, and tricks are possible to deliver models that are both competent and reliable. The session will be structured into a brief introduction to both model types, followed by case studies in classification and generative AI, concluding with a Q&A segment. pyconde-pydata-2024-41758-would-you-rely-on-chatgpt-to-dial-911-a-talk-on-balancing-determinism-and-probabilism-in-production-machine-learning-systems PyData: Natural Language Processing & Computer Vision Nicolas Guenon des Mesnards en Objective and Outline: This talk addresses the often-overlooked need for integrating deterministic and probabilistic models in machine learning, which is crucial in complex production environments. We begin by defining deterministic and probabilistic models, highlighting their distinct roles in ML systems. The talk then showcases practical examples where the synergy of these models enhances system performance, focusing on classification and Generative AI models. Target Audience and Expected Background Knowledge: Intended for ML engineers, data scientists, and academic researchers, this presentation assumes familiarity with basic machine learning concepts and models. It's particularly beneficial for those involved in designing, implementing, or managing ML systems in production environments. Key Takeaways: - Understanding the strengths and limitations of deterministic and probabilistic models in ML. - Strategies for effectively combining these models in various ML systems. - Real-world examples demonstrating the improved robustness and controllability achieved through this integration. - Insights into future trends and potential developments in model integration. Time Breakdown: - Minutes 0-10: Introduction to deterministic and probabilistic models - Minutes 10-20: Synergies of approaches in real-world examples - Minutes 20-30: Applications for Generative AI models, including Q&A Additional Information: No prerequisites are required beyond a basic understanding of machine learning concepts. The presentation will be informative with a focus on practical applications, providing attendees with actionable knowledge and a deeper appreciation of model integration in ML systems. false https://pretalx.com/pyconde-pydata-2024/talk/YPKKQF/ https://pretalx.com/pyconde-pydata-2024/talk/YPKKQF/feedback/ B05-B06 Deploying your Python application to Android Talk 2024-04-23T10:30:00+02:00 10:30 00:30 Since many years Android has held the top position as the most used OS with about 38% of the OS user share in 2023. Currently 3 major languages – C++, Java, Kotlin are used for application development on Android. Although Python has the capabilities of enabling Android deployment, Python was never considered as an adequate language for Android development. But, with the introduction of “PEP 738: Adding Android as a supported platform”, and the increasing popularity of frameworks like PySide6, Kivy, Flet etc. which enable GUI development with Python for Android devices, it is time for Python package developers to consider Android as a potential platform. This talk gives an introduction to each of the GUI development toolkits – Kivy, Flet and PySide6 by demonstrating how to create a simple Contact List application. We later delve into the pros and cons of each of these frameworks, so that Python application developers can decide which framework suits their requirements better. pyconde-pydata-2024-42839-deploying-your-python-application-to-android PyCon: Programming & Software Engineering Shyamnath Premnadh en Python can be used to create native applications for Android. However, although Python is the most popular programming language, it is not the first choice to create an Android application. This talk gives an overview of developing Android application with Python by comparing the 3 popular frameworks for GUI development with Python that support Android as a platform – PySide6, Kivy and Flet. This comparison is demonstrated with a simple Contact List application with the ability to add, edit and delete contacts. The overall structure of the talk will be almost the following: 1. Why is Android a relevant platform for Python application developers? (6 minutes) In this section, we establish why Android is the most popular OS being sued currently. Although Python has had the support to run applications natively in Android, even dating back to 2011, the development of Android applications with Python is not so popular. We will further highlight one of the major concerns of using Python for Android develpoment and how PEP 738 can help simplify this. 2. Current status of Android app development with Python (2 minutes) In this section, we give a brief introduction to some of the Python based toolkits that support Android as a platform – Kivy, Flet, PySide6, Beeware etc. 3. Contact List application with Kivy (3 minutes) In this section, we look at how the applicatiion looks with Kivy and KivyMD, followed by the ease of development and some pros and cons of the framework. 4. Contact List application with PySide6 (5 minutes) The deployment of PySide6 application to Android uses the same build tool as Kivy, called python-for-android. python-for-android now also supports a Qt backend along with SDL2 that Kivy uses thus enabling the deployment of PySide6 application. In this section, we look at how the applicatiion looks with PySide6, followed by the ease of development and some pros and cons of the framework. 5. Contact List application with Flet (3 minutes) In this section, we look at how the applicatiion looks with Flet, followed by the ease of development and some pros and cons of the framework. 6. Python packages support (6 minutes) We see the various Python packages supported by each framework. 7. Conclusion and Questions (5 minutes) Questions from the audience. false https://pretalx.com/pyconde-pydata-2024/talk/W7YDRX/ https://pretalx.com/pyconde-pydata-2024/talk/W7YDRX/feedback/ B05-B06 Advanced Observability with OpenTelemetry and Python Talk 2024-04-23T11:05:00+02:00 11:05 00:30 As Python expands into serverless and cloud environments, popularizing distributed microservice architectures, we often face observability challenges that impact efficiency and complicate error tracing. This presentation introduces OpenTelemetry, an emerging industry standard that provides a framework for tracking the performance of not just our Python code, but also other system components like databases and message queues. Its API and SDK integrate seamlessly with Python, enabling a unified approach to gather, process, and export telemetry data from various sources within a distributed system. We will explore the setup and usage of OpenTelemetry's Python SDK through a practical scenario. The session will demonstrate how to convert an existing Flask microservice to use OpenTelemetry, using both automatic and manual instrumentation. Finally, we will examine how to utilize the exported data for enhanced system monitoring. pyconde-pydata-2024-41840-advanced-observability-with-opentelemetry-and-python PyCon: Programming & Software Engineering Anton Caceres en With the rise of serverless architectures and cloud technologies, Python has become increasingly popular for building microservices. Yet, as these systems expand, they face observability challenges leading to reduced efficiency and complexities in error tracing. To address these challenges, this presentation introduces OpenTelemetry, an emerging industry standard providing a framework for tracking the performance of not only our Python code but also other system components such as databases or message queues. It integrates seamlessly into Python environments, offering a common way to gather, process, and export telemetry data from various sources of a distributed system. The session will begin by revisiting the concept of observability and its critical importance in distributed systems. We will then introduce OpenTelemetry, and check the fundamentals of its' Python SDK. A practical use case will be presented, demonstrating the integration of OpenTelemetry into an existing Python microservice, using both automatic instrumentation mode and manual traces. Finally, we will discuss how to utilize the data collected by OpenTelemetry for system monitoring. false https://pretalx.com/pyconde-pydata-2024/talk/AQ8HUM/ https://pretalx.com/pyconde-pydata-2024/talk/AQ8HUM/feedback/ B05-B06 Boost your app to Flash speed by mastering performance tricks Talk 2024-04-23T11:40:00+02:00 11:40 00:30 In this talk, we discuss computational operations and memory utilization in Python and what is the connection between them. Additionally, we will provide you with visual aids for helping to build a mental picture of these concepts. Moreover, we will dive into how Python interpreter works and how the understanding of bytecode instructions can help you write better code. In the end, we will demonstrate the advantages of best practices by comparing both performance metrics and bytecode instructions. pyconde-pydata-2024-41737-boost-your-app-to-flash-speed-by-mastering-performance-tricks PyCon: Python Language & Ecosystem Laysa UchoaYuliia Barabash en Nowadays, more and more companies are looking for different strategies to gain more users for their products by using different approaches starting from introducing unique features to optimizing application performance. Additionally, python is one of the widely used programming languages where the community continuously introduces new libraries for enhancing performance and optimizing memory usage. However, can we also accelerate app performance not only by relying on libraries but also by understanding how Python works under the hood? In this talk, we discuss computational operations and memory utilization in Python and what is the connection between them. Additionally, we will provide you with visual aids for helping to build a mental picture of these concepts. Moreover, we will dive into how Python interpreter works and how the understanding of bytecode instructions can help you write better code. In the end, we will demonstrate the advantages of best practices by comparing both performance metrics and bytecode instructions. If you're keen to move beyond basic optimizations and truly understand what happens under Python's hood during application execution, this session is for you. Join us to learn how Python works under the hood and also have an imagination of what is going on in Python during the application execution. false https://pretalx.com/pyconde-pydata-2024/talk/C9F9CC/ https://pretalx.com/pyconde-pydata-2024/talk/C9F9CC/feedback/ B05-B06 Pandas + Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars Talk 2024-04-23T14:10:00+02:00 14:10 00:30 Dask is a library for distributed computing with Python that integrates tightly with pandas. Historically, Dask was the easiest choice to use (it’s just pandas) but struggled to achieve robust performance (there were many ways to accidentally perform poorly). The re-implementation of the DataFrame API addresses all of the pain points that users ran into. We will look into how Dask is a lot faster now, how it performs on benchmarks that is struggled with in the past and how it compares to other tools like Spark, DuckDB and Polars. pyconde-pydata-2024-41505-pandas-dask-dataframe-2-0-comparison-to-spark-duckdb-and-polars PyData: Data Handling & Engineering Patrick HoeflerFlorian Jetter en Dask is a library for distributed computing with Python that integrates tightly with pandas and other libraries from the PyData stack. It offers a DataFrame API that wraps pandas and thus offers an easy transition into the big data space. Historically, Dask was the easiest choice to use (it’s just pandas) but struggled to achieve robust performance (there were many ways to accidentally perform poorly). It was great for experts, but bad for novices. Other tools (Spark, DuckDB, Polars) just did this better. Fortunately, these pain points have been fixed with the following features: - A new and vastly improved shuffle algorithm - A logical query planning layer to improve performance and usability - A reduced memory footprint through a more efficient data model due to pandas 2.0 We will look into how these changes work together across pandas, Arrow, and Dask to provide a better UX and a more robust and faster system overall. Additionally, we will look into a comparison of Dask against other tools in the big data space, including Spark, Polars and DuckDB. We will use the TPC-H benchmarks to compare these tools. We will look ahead into what the future will bring for pandas and Dask and how the logical query planning layer can be extended to fit other frameworks like Dask Array and XArray. false https://pretalx.com/pyconde-pydata-2024/talk/N9DEVW/ https://pretalx.com/pyconde-pydata-2024/talk/N9DEVW/feedback/ B05-B06 The key to reliability - Testing in the field of ML-Ops Talk (long) 2024-04-23T14:45:00+02:00 14:45 00:45 Testing is a de facto standard in modern software development. With increasing awareness that comes with ML-Ops, testing becomes more important for the development and operation of machine learning-based components. In this talk we would like to share our view and solution for testing in the field of machine learning. We will present the applied testing strategy used and the lessons learned from the last four years of experience in operating idealo’s cataloging system. pyconde-pydata-2024-41576-the-key-to-reliability-testing-in-the-field-of-ml-ops PyCon: MLOps & DevOps Gunar MaiwaldTobias Senst en idealo.de offers a price comparison service for millions of products from a wide variety of categories. It navigates the dynamic landscape of about 3.7 billion offerings from 50,000+ shops, our central challenge is cataloging this huge offer automatically. Machine learning plays a crucial role for us in processing data. Machine learning components must be considered as a part of a more complex domain. In our domain those components are part of an event driven asynchronous architecture. The need to continuously develop, deliver, and train accompanied by the capability to smoothly work together with traditional software components raises high demands on stable software development and operations. Testing plays a crucial role and brings up many open questions in the field of machine learning. In this talk we want to share and present our holistic approach to testing in machine learning. The following aspects are taken into account: - Introduction into our machine learning lifecycle - Testing in context of traditional software development comprising unit tests, code coverage, contract tests, tests on infrastructure as code - Specific challenges of testing in the machine learning domain comprising end-to-end test of training pipelines, deployment testing of inference endpoints in operational modes - The role of logging and monitoring for safe operations The presented test strategy is based on our 4 years' experience in operating idealo's cataloging system. Examples will be aligned along our tech stack consisting of e.g., PyTest, CDK , Pactman, AWS Sagemaker, Github Actions, OpenSearch Kibana and Grafana. false https://pretalx.com/pyconde-pydata-2024/talk/9PZSBS/ https://pretalx.com/pyconde-pydata-2024/talk/9PZSBS/feedback/ B05-B06 The evolution of Feature Stores Talk 2024-04-23T16:00:00+02:00 16:00 00:30 Feature Stores have become an important component of the machine learning lifecycle. They have been particularly pivotal in bridging the gap between data engineering and machine learning workflows(experimentation, training and serving). This talk will explore Feature Stores with a focus on their evolution, what they look like now and what they could look like in the future with the advent of the AI ACT. pyconde-pydata-2024-41727-the-evolution-of-feature-stores PyData: Machine Learning & Deep Learning & Stats Olamilekan Wahab en In recent years, the role of feature stores has become increasingly pivotal in data engineering and machine learning. This talk will delve into the history of feature stores, exploring their evolution from Uber's Michelangelo to recent solutions like Feast, Hopsworks and Fennel. Lastly, we will discuss the potential impact of the AI Act on the future of feature stores, highlighting regulatory constraints that may affect what they look like in the future. The outline of this talk is detailed below. ### Historical Perspective: - Tracing the origins of Feature Stores: How did the concept evolve over time? - Early use cases and challenges: Lessons learned from Michelangelo. - Pioneering Feature Stores: Case studies on organizations at the forefront of adoption. ### Current Landscape: - Architectural insights: What do modern Feature Stores look like? - Integration with popular ML frameworks and data storage solutions. - Real-world success stories: How Zalando built a central Feature Store for serving features across departments and business units with different technical requirements. ### AI ACT and the Future of Feature Stores: - Envisioning Feature Stores in an AI ACT environment. - Federated learning and distributed feature stores: Opportunities and challenges. false https://pretalx.com/pyconde-pydata-2024/talk/BNCJPV/ https://pretalx.com/pyconde-pydata-2024/talk/BNCJPV/feedback/ B05-B06 Polars and Time Series: what it can do, and how to overcome any limitation Talk 2024-04-23T16:35:00+02:00 16:35 00:30 Time series analysis is ubiquitous in applied data science because of the value it delivers. In order to do effective time series analysis, you need to know your tools well. Polars has excellent built-in time series support, and it's also possible to extend it where necessary. We will talk about: - Basic built-in time series operations with Polars (e.g. "what's the average number of sales per month?"). - numba/numpy/scipy interoperability for not-so-basic time series operations (e.g. non-linear interpolation, or cumulative operations). - Advanced, custom time series operations, and how you can implement them as Polars plugins (e.g. business day arithmetic). Basic interest and knowledge of Python and data will be assumed, but no prior Polars experience is required. Anyone working with time series and/or dataframes will likely benefit from the talk. pyconde-pydata-2024-41772-polars-and-time-series-what-it-can-do-and-how-to-overcome-any-limitation PyData: Data Handling & Engineering Marco Gorelli en This will be a technical talk, teaching people how to use Polars effectively for time series analysis. The format will be roughly: - 5 mins: motivation, super-fast Polars crash course. - 7 mins: what's built-in - making the most of Polars' built-in time series capabilities. - 7 mins: when Polars isn't enough: interoperability with numba/scipy/numpy. - 6 mins: when nothing is enough: writing your own Polars Plugin, and learning how to do that. - 5 mins: engaging Q&A / awkward silence. Attendees will leave knowing where to turn to for any time series analysis task they may encounter whilst using Polars. false https://pretalx.com/pyconde-pydata-2024/talk/LNFSDV/ https://pretalx.com/pyconde-pydata-2024/talk/LNFSDV/feedback/ A1 Encoding Charactersets - may the force be with you Talk 2024-04-23T10:30:00+02:00 10:30 00:30 Understanding and repairing garbled text (Mojibake) is despite Unicode a permanent ongoing task in IT projects. Garbled text is the result of text being decoded using an unintended character encoding. Example: Die UTF-8 Selbsthilfegruppe trifft sich heute Abend im grÃ¼nen Saal This talks explains how to analyze and fix such encoding problems with python. The topics of this talk contains: - difference between grapheme and codepoints - Unicode vs. UTF-8 - decoding and encoding files, database result sets, REST-APIs calls - the unicodedata module - handling of ISO charsets in the unicode world This talk shows short code examples for real world problems and solutions. pyconde-pydata-2024-41604-encoding-charactersets-may-the-force-be-with-you PyCon: Python Language & Ecosystem Martin Hoermann en Understanding and repairing garbled text (Mojibake) is despite Unicode a permanent ongoing task in IT projects. Garbled text is the result of text being decoded using an unintended character encoding. The topics of this talk contains the following points. To every point there are code examples: - Explore the nuances of text representation: Grapheme vs. Codepoints. Unravel the essence of characters in computing. - Delve into the realm of character encoding: Unicode vs. UTF-8. Decipher the key distinctions shaping text globalization. - Master the art of data interchange. Decode and encode files, database results, and REST-APIs seamlessly for universal communication. - Unlock the power of the unicodedata module. Learn how it aids in character information retrieval and manipulation in Python. - Navigate the challenges of ISO charsets in the Unicode era. Gain insights into effective strategies for handling diverse character sets. false https://pretalx.com/pyconde-pydata-2024/talk/RRAZ99/ https://pretalx.com/pyconde-pydata-2024/talk/RRAZ99/feedback/ A1 (Un)leashed potential of AI in Government Talk 2024-04-23T11:05:00+02:00 11:05 00:30 As the world is being reshaped at an unprecedented speed through the rise of powerful (Generative) AI technologies that change the way we work and live, governments seek their place in the arena. This presentation will focus on how government institutions adapt to these changes by exploring three key areas of action: Adoption, Regulation, and Reskilling/Upskilling. Emphasis will be placed on Ethics and AI in government. pyconde-pydata-2024-42987--un-leashed-potential-of-ai-in-government General: Ethics & Privacy Rosa Marie Keller en As the world is being reshaped at an unprecedented speed through the rise of powerful (Generative) AI technologies that change the way we work and live, governments seek their place in the arena. This presentation will focus on how government institutions adapt to these changes by exploring three key areas of action: 1. Adoption: Generally, technology adoption has been slower in government than in the private sector. Yet governments have increasingly started to explore the potential of AI to deliver on their mission. The audience will learn about potentials, barriers, and concrete use cases/prototypes of AI-based services in German government bodies with a focus on responsible AI and Ethics. 2. Regulation: It is discussed how government bodies respond to the rise of AI through regulation. An introduction to the EU AI Act is given – the world’s first comprehensive AI law. 3. Reskilling & Upskilling: Insights are given on the role specialised data skills play in shaping the future of Digital Government in Germany. true https://pretalx.com/pyconde-pydata-2024/talk/Y3Y78W/ https://pretalx.com/pyconde-pydata-2024/talk/Y3Y78W/feedback/ A1 DDataflow: An open-source end-to-end testing framework for ML pipelines Talk 2024-04-23T11:40:00+02:00 11:40 00:30 In the realm of machine learning, the complexity of data pipelines often hinders rapid experimentation and iteration. This talk will introduce [DDataflow](https://github.com/getyourguide/DDataFlow), an innovative open-source tool, designed to facilitate end-to-end testing in ML pipelines by leveraging decentralized data sampling. Attendees will gain insights into the challenges of unit testing in large-scale data pipelines, the design philosophy behind DDataflow, and practical implementation strategies to enhance the reliability and efficiency of their ML pipelines. pyconde-pydata-2024-42932-ddataflow-an-open-source-end-to-end-testing-framework-for-ml-pipelines PyCon: MLOps & DevOps Theodore MeynardJean Machado en Machine Learning pipelines, especially those dealing with large datasets, are intricate and multifaceted. The ability to quickly iterate and experiment is crucial, yet the complexity and scale of these pipelines often lead to prolonged development loops and latent errors. Traditional unit-testing approaches have proven to be cumbersome and inefficient in addressing these challenges due to the extensive boilerplate code and limited coverage they offer. This talk will delve into the journey of developing [DDataflow](https://github.com/getyourguide/DDataFlow), a tool aimed at addressing the aforementioned challenges by enabling efficient end-to-end testing in ML pipelines. DDataflow employs decentralized data sampling to expedite testing processes, allowing for rapid and reliable iterations in ML pipelines. false https://pretalx.com/pyconde-pydata-2024/talk/8GQLLY/ https://pretalx.com/pyconde-pydata-2024/talk/8GQLLY/feedback/ A1 Exploring Zarr: From Fundamentals to Version 3.0 and Beyond Talk 2024-04-23T14:10:00+02:00 14:10 00:30 A key feature of the Python data ecosystem is the reliance on simple but efficient primitives that follow well-defined interfaces to make tools work seamlessly together (Cf. http://data-apis.org/). NumPy provides an in-memory representation for tensors. Dask provides parallelisation of tensor access. Xarray provides metadata linking tensor dimensions. **Zarr** provides a missing feature, namely the scalable, persistent storage for annotated hierarchies of tensors. Defined through a community process, the Zarr specification enables the storage of large out-of-memory datasets locally and in the cloud. Implementations exist in C++, C, Java, Javascript, Julia, and Python, enabling. This talk presents a systematic approach to understanding the newer [Zarr Specification Version 3](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html) by explaining the critical design updates, performance improvements, and the lessons learned via the broader specification adoption across the scientific ecosystem. I will also briefly discuss the evolution of the Zarr - the development of the [Zarr Enhancement Process (ZEP)](https://zarr.dev/zeps) and its use to define the next major version of the specification (V3); as well as uptake of the format across the research landscape. pyconde-pydata-2024-43030-exploring-zarr-from-fundamentals-to-version-3-0-and-beyond PyData: Data Handling & Engineering Sanket Verma en Zarr is a data format for storing chunked, compressed N-dimensional arrays and is sponsored by [NumFOCUS](https://numfocus.org/project/zarr) under their umbrella. It is based on open-source technical specification and has implementations in several languages, with [Zarr-Python](https://github.com/zarr-developers/zarr-python) being the most used. ## Outline First, I’d be talking about: ### Understanding Zarr basics (5 mins.) - What is Zarr, and how it works? - The inner workings of Zarr using illustrated graphics - What is the Zarr Specification? - How is Zarr different when compared to other storage formats? Then, I'll be talking about the new Zarr Specification V3 and its significant features: ### What's new in Zarr Spec V3? (15 mins.) - What is the motivation for the evolution of the specification? - High-latency storage → Better support for technologies, particularly systems with relatively high latency per operation, such as cloud object stores - Interoperability → Language-agnostic approach towards the new specification by slimming down the specification to achieve interoperability across major programming languages - Major design updates - Greater flexibility in how groups and arrays are created - Support for implicit groups that do not have a metadata document but whose existence is implied by descendant nodes - Restructuring of the `JSON` metadata document and storage path in both arrays and groups - Why is the Zarr V3 metadata consolidated compared to the Zarr V2 metadata? - Explicit support for extensions via defined extension points and mechanisms - How do extensions allow the community to add innovative and cutting-edge features to help their specific use cases? - Chunk encoding and supported codecs for V3 - How are chunks encoded into binary representation for storage in the store, using the chain of codecs specified by the codecs metadata field? - ZEP Process - Need and origin of a community feedback process for the evolution of Zarr specification - Transformation from steering council governed to community-owned specification - Learnings when migrating from [Spec V2](https://zarr.readthedocs.io/en/stable/spec/v2.html) → [Spec V3](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html) Then, I’d be doing a hands-on session, which would cover the following: ### Hands-on (5 mins.) - Creating Zarr arrays and groups using Zarr-Python V3.0 - Walk through of the new features (mentioned above) - Demo of [Sharding Codec](https://zarr.dev/zeps/accepted/ZEP0002.html) extension - Creating a sharded array and group and showing how a large number of chunks can be grouped together into a single shard - Looking under the hood - Use store functions to explain how your Zarr data is stored I'd be closing the talk by: ### Conclusion (5 mins.) - Key takeaways - How can you get involved? - QnA This talk aims to address an audience that works with large amounts of data and is looking for a transparent, open-source, reliable, cloud-optimised, and environmentally friendly format. Also, I’d like to invite anyone interested in the lessons I learned by maintaining the project throughout the years. The tone of the talk is set to be informative, story-telling and fun. Intermediate knowledge of Python and NumPy arrays is required for the attendees to attend this talk. ### After this talk, you’d: - understand the basics of Zarr and what's new in V3, - using Zarr V3 for local and cloud storage, - make an informed decision on what data format to use for your data and also you'd: - know why should you have a process for your project, - have essential takeaways regarding when an OSS project transitions from a young to a mature stage false https://pretalx.com/pyconde-pydata-2024/talk/93MHQ3/ https://pretalx.com/pyconde-pydata-2024/talk/93MHQ3/feedback/ A1 From LLM as oracle to LLM as translator - our journey from theory to everyday’s practice in a corporate setting with dmGPT (and python) Talk (long) 2024-04-23T14:45:00+02:00 14:45 00:45 Last year, dm-drogeriemarkt was among the first big German companies launching a tool for the coworkers to be able to unlock the power of LLMs in a secure setting. At the beginning, dmGPT was only a user interface pointing to a private instance of a foundation Model. Listening to the needs of our colleagues, we quickly learned that this “naked” model – a super powerful NLP Model that can help them processing text - is not really what they needed: they needed a trustworthy, knowledge-rich assistant to help them accomplish their daily tasks. In our journey towards this goal, we used python to shift the LLM’s role in dmGPT: from being the motor and only source of answers to being a translator between the user’s input in natural language and multiple software systems, the steering wheel that helps humans drive the flow. Today, dmGPT is not only a statistical parrot anymore, now it is an open platform powered by internal knowledge. In this talk we want to share with you the learnings and insights we gained while designing and implementing the new dmGPT. pyconde-pydata-2024-41566-from-llm-as-oracle-to-llm-as-translator-our-journey-from-theory-to-everyday-s-practice-in-a-corporate-setting-with-dmgpt-and-python- PyData: Generative AI Emma HaleyNiklas Lederer en One of the biggest challenges of working in such a large organization like dm is finding the information you need to accomplish your tasks: distributed organization units, multiple knowledge sources, and different tools make it very challenging to know where to find information whose location you don’t know. Most of the times, the best way to find something out is to ping a more experienced colleague and ask them. But what if you could ping your AI-Powered copilot and find out? Not only that… What if it also helped you create content for your specific product without you telling it everything about the product? What if it was able to help you write code using internal tools? What if it could help you have an insight of your internal data? After its first steps in summer 2023, our vision for dmGPT quickly developed to it becoming a truly helpful assistant for every coworker of dm. Since then, we have contributed to the design and implementation of an LLM-powered platform that aims to achieve this goal. To come a step closer, we had to rethink the role of the LLM, picturing it as a translator between natural languages and software systems and back. Now, it helps us map an instruction in natural language to a set of tools needed to accomplish the given task and construct a coherent answer based on the provided data. In the design we had to face multiple challenging questions, such as: - How to connect multiple, heterogenic data sources? - How to pick an LLM for a given task? - Which LLM do we support? - How do we build a user friendly, dynamic and configurable user interface? - How to measure the system’s quality? In this talk we would like to provide a technical insight to our journey, discussing architectural decisions as well as implementation dilemmas, and engage in a discussion with the community about the steps to come. false https://pretalx.com/pyconde-pydata-2024/talk/P3GRLG/ https://pretalx.com/pyconde-pydata-2024/talk/P3GRLG/feedback/ A1 Safeguarding Privacy and Mitigating Vulnerabilities: Navigating Security Challenges in Generative AI Talk 2024-04-23T16:00:00+02:00 16:00 00:30 Generative AI (GenAI) has significantly improved our daily lives, prompting a focus on its integration into products and our routines. However, the growing importance of GenAI brings along significant concerns regarding privacy and vulnerability. This talk delves into the critical issues surrounding the protection of private data and the security of GenAI systems. We'll begin by understanding the fundamental differences between data privacy and data security. Drawing insights from real-life data breaches and compromised information in major companies, we'll explore the mistakes made and the steps taken to rectify them. Throughout the discussion, we'll analyze the challenges faced by GenAI in ensuring data privacy and security across various stages of an LLM project. Furthermore, the talk will shed light on how prominent companies building GenAI are working to reduce the impact of data privacy and security concerns within their models. Additionally, we'll explore strategies for individuals, like ourselves, using GenAI, to enhance data privacy and security when integrating it into our products or daily lives. Finally, the role and significance of government regulations in ensuring the safety and security of GenAI will be emphasized. pyconde-pydata-2024-42572-safeguarding-privacy-and-mitigating-vulnerabilities-navigating-security-challenges-in-generative-ai PyData: Generative AI John Robert en In the ever-evolving landscape of Generative AI (GenAI), privacy and security have emerged as paramount concerns, echoing the necessity for comprehensive frameworks and collaborative initiatives. The session kicks off with an interactive segment, aiming to gauge the audience's familiarity and involvement with GenAI, ensuring the discussion aligns with their varying levels of expertise and engagement. Fundamental concepts of Data Privacy and Data Security are meticulously delineated, elucidating the responsible handling and fortification of personal information. A visual aid in the form of a Venn diagram underscores the intricate interplay between these two crucial facets, facilitating a deeper understanding for the audience. Transitioning to the domain of GenAI, the discourse delves into the indispensable need for data privacy throughout the lifecycle of GenAI models. Instances of ethical and legal concerns arise during the training phase, where datasets often contain potentially sensitive personal information sourced from the internet. Real-world cases such as disputes between media entities like The New York Times and AI organizations like OpenAI exemplify these dilemmas. Moreover, the session critically scrutinizes data privacy concerns during GenAI production, focusing on the policies adopted by AI companies regarding prompt-related data retention. Here, certain AI entities retain prompt records for extended durations, which can pose potential privacy risks. In response, initiatives such as enterprise versions of GenAI models, like those offered by OpenAI, provide users with enhanced control over data usage, reinforcing a more privacy-centric approach. Simultaneously, the discussion navigates through the dimensions of data security risks inherent in GenAI models during operational phases. The potential extraction of sensitive personal data from these models poses substantial risks, given GenAI's proclivity to retain information from its training data. Academic research papers, like "Scalable Extraction of Training Data from (Production) Language Models," delve into these vulnerabilities, highlighting the complexity of data security challenges in GenAI. Further enriching the discourse, the session showcases the top ten vulnerabilities in GenAI, as identified by insights from OWASP. These vulnerabilities encompass a wide array of risks, from prompt injection and insecure output handling to training data poisoning and supply chain vulnerabilities. To culminate the discussion, actionable strategies to fortify data protection within GenAI are proposed. These encompass leveraging Open Source GenAI solutions like LLAMA, recognized for their transparency, although they may come with higher maintenance costs. Additionally, anonymizing data before prompt utilization emerges as a proactive measure, albeit posing certain operational challenges. Moreover, the session underscores the pivotal role of government regulations in safeguarding citizen data and establishing policies binding on GenAI companies. Recent regulations from governments like the US, UK, and other countries emphasize the need for AI systems to be 'secure by design,' promoting robust data protection measures. Collaborative efforts among companies also come to the forefront, exemplified by initiatives like the "AI Alliance" formed by IBM, Meta, and 50 other organizations. These alliances aim to advance open-source AI while fostering collective processes for data protection and security. In conclusion, this comprehensive session aims to empower attendees with a holistic understanding of privacy and security challenges in the GenAI domain. The discourse, enriched with real-world instances, legal dilemmas, academic insights, and industry perspectives, seeks to equip individuals and organizations with actionable insights. The objective is to navigate the complex terrain of GenAI, fostering a more privacy-aware and secure integration into our lives and technological ecosystems. false https://pretalx.com/pyconde-pydata-2024/talk/MTVWQM/ https://pretalx.com/pyconde-pydata-2024/talk/MTVWQM/feedback/ A1 Breaking AI Boundaries: Fairness Metrics in Unstructured Data Domains Talk 2024-04-23T16:35:00+02:00 16:35 00:30 This presentation addresses the rare use of machine learning fairness metrics in domains with indirect human impact, e.g., automotive engineering. We briefly map out the space of use cases to examine the necessity, potential benefits, and challenges of applying fairness-related techniques. The main focus then lies on proposing solutions for overcoming identified hurdles, especially regarding the application in unstructured data domains, such as image and audio recognition and large text document analysis. Our approach includes strategies for detecting key subgroups and providing clear explanations for model failures. We also highlight two open-source tools, Sliceguard and Spotlight, for practical implementation. pyconde-pydata-2024-40918-breaking-ai-boundaries-fairness-metrics-in-unstructured-data-domains PyData: Machine Learning & Deep Learning & Stats Daniel Klitzke en Fairness Metrics are already widely used to avoid unwanted bias in machine learning models. However, although fairness is a hot topic, it is primarily used in domains where the models' interface and influence on humans are obvious. In other domains with a less obvious connection between model decisions and their impact on human beings, they are rarely seen (e.g., automotive engineering applications, etc.). This poses three questions: 1. In those domains, is it really unnecessary to use fairness techniques, or is their absence endangering individuals in a less obvious way? (necessity) 2. Even if a use case does not need fairness techniques, wouldn't the use cases still benefit from a look through the "Fairness lens" and the connected methods and tools? (benefit) 3. Besides having less strong implications for using fairness metrics, what obstacles keep people from using them, and how can we mitigate them? (obstacles and solutions) To answer these questions, our presentation will first briefly compare five prototypical engineering use cases and categorize them according to the above criteria (necessity, benefit, obstacles). This first part mainly aims to map out the space of machine learning use cases in the engineering domain and suggest possible reasons why fairness-related techniques are not applied in those areas. We will then mainly focus on further analyzing those obstacles and providing solutions to omit them. Here, the main focus will be expanding the application of fairness-based model evaluation to unstructured data domains. Typical use cases in this category go from image and audio recognition to LLM applications with large text documents. We will provide a brief theoretical overview of strategies to make fairness metric application suitable and then go through a concrete example down to the implementation level. For that, we will touch on important subjects, such as detecting meaningful subgroups in unstructured data, extracting easy-to-grasp explanations for model failures, and interactive analysis of model predictions. This section will also feature two open-source tools to address these challenges: Sliceguard and Spotlight. false https://pretalx.com/pyconde-pydata-2024/talk/QLXUHY/ https://pretalx.com/pyconde-pydata-2024/talk/QLXUHY/feedback/ A03-A04 Using ML to find out the "Why"? A Tutorial in Causal Machine Learning Tutorial 2024-04-23T10:30:00+02:00 10:30 01:30 Machine learning is mostly used for predicting outcome variables. But in many cases, we are interested in causal questions: Why do customers churn? What is the effect of a price change on sales? How can we optimize personalized marketing campaigns or medical treatments? This tutorial introduces participants to the field of Causal Machine Learning (Causal ML). We will start with a basic motivation of causal analysis and share insights on how to recognize causal questions in data science. We will dive into the basics of Causal ML: Why can't we simply use of-the-shelf ML methods to answer causal questions? The tutorial will focus on the Double Machine Learning approach and demonstrate the use of Causal ML with the Python library DoubleML (Bach et al., 2022). The general introduction will be complemented by hands-on data examples and interactive discussion and Q&A sessions. The tutorial is a great starting point for participants to discover Causality/Causal ML and start their own causal data science projects. References Bach, P., Chernozhukov, V., Kurz, M. S., and Spindler, M. (2022), DoubleML - An Object-Oriented Implementation of Double Machine Learning in Python, Journal of Machine Learning Research, 23(53): 1-6, https://www.jmlr.org/papers/v23/21-0862.html pyconde-pydata-2024-41577-using-ml-to-find-out-the-why-a-tutorial-in-causal-machine-learning PyData: Machine Learning & Deep Learning & Stats Oliver SchachtJan Teichert-Kluge en The tutorial will be organized in three blocks. 1) Introduction and motivation We will point out why Causality matters in data science. Many problems managers and data scientists are facing are causal. When organizations and companies want to optimize their marketing campaigns, their financial planning, pricing scheme they usually run into causal considerations: How much do my sales decrease if we increase the price by X%? How can I send out email newsletters to those who like them and avoid to annoy other subscribers? Causal Inference and Causal ML offer powerful tools that help to formalize and model things that are usually discussed only on an intuitive basis: Are the people who opened my newsletters really comparable to those who haven't? Can I just compare the convergence rates of these groups when I want to evaluate the newsletters's effectiveness? 2) Introduction to Causal Machine Learning with DoubleML Causal Machine Learning offers tools to estimate causal relationships with SOTA ML algorithms. We will offer an introduction to the Double Machine Learning approach (Chernozhukov et al., 2018). This introduction will be aligned with several data examples and code demonstrations using the Python package DoubleML, https://docs.doubleml.org/stable/index.html . DoubleML is an open source package that offers various tools to estimate causal effects, for example for estimation of heterogeneous treatment effects (like in personalized marketing or personalized medicine). 3) Hands-on Session: Data Example The tutorial featues a data projects that participants can solve on their own. With the hands-on session participants already get started on their own Causality learning journey :) Participants are invited to apply DoubleML to their own data example and play around with the package features. The hands-on session will follow the structure of the DoubleML workflow, which guides analysts through the process of causal inference with DoubleML, https://docs.doubleml.org/stable/workflow/workflow.html. 4) Discussion and Q&A The tutorial conlcudes with a discussion and Q&A session. We are looking forward to participants' comments and ideas. We appreciate fedback of the Python community on the DoubleML package :) true https://pretalx.com/pyconde-pydata-2024/talk/RMZLKZ/ https://pretalx.com/pyconde-pydata-2024/talk/RMZLKZ/feedback/ A03-A04 Performant, scientific computation in Python and Rust Tutorial 2024-04-23T14:05:00+02:00 14:05 01:30 A tutorial session on how to build scientific packages for numerical calculus and algorithms in Python and Rust. It walks through the process of packaging with a modern tool stack, introduces the concept of vectorization for efficient computation in Python in the context of classical Machine Learning, and shows how the package can be optimized with extensions written in Rust. pyconde-pydata-2024-41605-performant-scientific-computation-in-python-and-rust PyCon: Programming & Software Engineering Stefan Ulbrich en The Rust programming language gained a lot of attention over the last years, and began to slowly infiltrate the Python ecosystem with an ever-increasing number of tools and libraries in the Python ecosystem such as Ruff and Polars which are implemented in this language. Unlike Python, Rust is a system language optimized for performance and memory safety, and some consider it the spiritual successor of C++. Despite its steep learning curve, it is the perfect candidate for extending Python and its ecosystem when performance matters, in a modern and memory-safe language. This session demonstrates the path of creating a scientific package in python (following best practices and modern tools) and gradually migrating parts of it to Rust for additional performance gains. The use case is a naive implementation of the "Expectation maximization for Gaussian Mixture Models" algorithm from scratch, a relatively simple yet efficient machine learning method. The session addresses the following points: How to build a Python package with a modern tools set, how to translate a numerical algorithm into vectorized Python, and optimize the package with a performant Rust implementation of the critical parts. Prior knowledge of Rust or the algorithm is not required. Note that the goal is not to learn Rust in this single session (this requires at least three days) but rather to provide a superficial overview on what makes this language so great and well-suited for extending Python. Participants are advised to follow the clone the repository below and follow the installation instructions to avoid longer download times during the session. https://github.com/StefanUlbrich/PyCon2024 false https://pretalx.com/pyconde-pydata-2024/talk/XBUHCK/ https://pretalx.com/pyconde-pydata-2024/talk/XBUHCK/feedback/ A03-A04 PyO3 101 - Writing Python modules in Rust Tutorial 2024-04-23T15:50:00+02:00 15:50 01:30 In this interactive workshop, we will cover the very basics of using PyO3. There will be hands-on exercises to go from how to set up the project environment to writing a "toy" Python library written in Rust using PyO3. We will cover a lot of specifications of the API provided by PyO3 to create Python functions, modules, handling errors and converting types. --- ## Preflight checklist - [Install/ Update Rust](https://www.rust-lang.org/tools/install) - Make sure having Python 3.8 or above (recommend 3.12) - Make sure using virtual environment (recommend pyenv + virtualenv) *In this workshop we recommend using Unix OS (Mac or Linux) If you have to use Windows, you may encounter problems with Rust and Maturin. You may want to install a VM like [VirtualBox](https://www.virtualbox.org/) for developing Python libraries with PyO3.* ## Setting up Set up virtual environment and install **maturin** ``` pyenv virtualenv 3.12.2 pyo3 pyenv activate pyo3 pip install maturin ``` pyconde-pydata-2024-41691-pyo3-101-writing-python-modules-in-rust PyCon: Programming & Software Engineering Cheuk Ting Ho en In recent years, Rust has been getting more and more popular over other similar programming languages like C and C++ due to its robust compiler checking and ownership rules to make sure memory is safe. Hence there are more and more Python libraries that have been written in Rust natively with a Python API interface. One of the tools that have been driving this movement is PyO3, a toolset that proves Rust bindings for Python and tools for creating native Python extension modules. In this interactive workshop, we will cover the very basics of using PyO3. There will be hands-on exercises to go from how to set up the project environment to writing a "toy" Python library written in Rust using PyO3. We will cover a lot of specifications of the API provided by PyO3 to create Python functions, modules, handling errors and converting types. ## Goal To give developers who are not familiar with PyO3 an introduction to PyO3 so they can consider building their Python libraries with Rust to make use of Rust's memory-safe property and parallelism ability. ## Target audiences Any developers who are interested in developing Python libraries using Rust. It will be an advantage if the attendees are comfortable writing in Rust. However, attendees are not required to be familiar with Rust as all the Rust codes will be provided. Basic knowledge of Python will be assumed from the attendees. ## Outline Part 1 - introduction and getting started (40 mins) - What's the difference between Rust and Python (5 mins) - Why using PyO3 (5 mins) - Setting up the environment (exercises) (15 mins) - Starting a new project (exercises) (15 mins) Break (15 mins) Part 2 - Creating a simple Python library (50 mins) - Creating Python modules (exercises) (20 mins) - Generating documentation - Creating Python functions (exercises) (30 mins) - How to create function signatures - How to deal with errors false https://pretalx.com/pyconde-pydata-2024/talk/8C83EA/ https://pretalx.com/pyconde-pydata-2024/talk/8C83EA/feedback/ A05-A06 Bulletproof Python - Property-Based Testing with Hypothesis Tutorial 2024-04-23T10:30:00+02:00 10:30 01:30 Do you find yourself working through pages of copied and pasted tests to accommodate a simple code change? Does your software frequently break in unexpected ways despite your testing efforts? Don’t despair! Property-based testing could be your way out of that mess. Rather than working harder and writing more test code, property-based testing forces you to work smarter and test more code with fewer tests. pyconde-pydata-2024-41639-bulletproof-python-property-based-testing-with-hypothesis PyCon: Testing Michael Seifert en Traditional tests are example-based. They require the developer to come up with arbitrary inputs and check a system’s behaviour against explicit outputs. More often than not, developers only think of inputs that are handled correctly by their code, thus leaving bugs hidden. Property-based tests generate the inputs for you and in many cases they’re more likely to find invalid inputs than humans. The difficulty lies in formulating these test cases. After this workshop you’ll be comfortable with property-based testing using Hypothesis. You’ll have experience requesting appropriate test data from Hypothesis and in writing tests for common and more advanced properties. At work, your co-workers will be impressed by your unbreakable code ;) Participants are expected to have basic familiarity with unit testing and a testing framework. Provided code examples use pytest. Please set up the workshop material in advance. To do that, navigate to the Git repository linked in the supporting material section and follow the setup instructions in the README file. true https://pretalx.com/pyconde-pydata-2024/talk/KCC9EF/ https://pretalx.com/pyconde-pydata-2024/talk/KCC9EF/feedback/ A05-A06 Functional Python Tutorial 2024-04-23T14:05:00+02:00 14:05 01:30 Python supports multiple programming paradigms. In addition to the procedural and object-oriented approach, it also provides some features that are typical for functional programming. While these features are optional, they can be useful to create better Python programs. This tutorial introduces Python features that help to implement parts of Python programs in the functional style. Objective is not to write pure functional programs but improve programs design by using functional feature where suitable. The tutorial points out advantages and disadvantages of functional programming in general and in Python in particular. Participants will learn alternative ways to solve problems. This will broaden their programming toolbox. pyconde-pydata-2024-41459-functional-python PyCon: Programming & Software Engineering Mike Müller en ## Audience Intermediate Python programmers who like to learn more about functional programming and its application Python. ## Format The tutorial will be hands-on. I will use JupyterLab and will start with an empty Jupyter Notebook. I will unroll the tutorial content by typing. In addition, I will distribute scripts before the tutorial to avid too lengthly typing. I will load these scripts one by one into a Notebook. Participants will have the opportunity to type along. I am a rather slow typer. In addition, I will stop typing often to explain. This gives most participants plenty of time to follow along. The PDF handout is very comprehensive and contains most of what I type. This allows students to pick if they should fall behind. ## Outline * Functional programming basics (10 min) * Overview programming paradigms * Features of functional programming * Advantages of functional programming * Disadvantages of functional programming * Python's functional features - overview * Pure functions (5 min) * Callables and functions in Python (20 min) * Callables * Closures * "Currying" * Partial functions * Recursion * Lambda * Single Dispatch * No Loops - map, filter, and reduce (10 min) * Processing iterables with map * Select from iterables with filter * Reductions of iterables with reduce * Operators as Functions (10 min) * Arithmetic operators * Logical operators * Attribute access * Lookup * Comprehensions (15 min) * Simple * Nested * Dictionary comprehensions * Set comprehensions * Iterators (15 min) * Itertools * Infinite iterators * Iterators terminating on the shortest input sequence * Combinatoric iterators * External tools (5 min) * More itertools * Toolz false https://pretalx.com/pyconde-pydata-2024/talk/PKJHBA/ https://pretalx.com/pyconde-pydata-2024/talk/PKJHBA/feedback/ A05-A06 Boost your Data Science skills with the new Python in Excel Tutorial 2024-04-23T15:50:00+02:00 15:50 01:30 Python in Excel is the new integration created by Microsoft that brings Python programming directly into Excel workbooks, for advanced data analytics. With Python in Excel, it is now possible to embed Python code directly into workbook cells, very easily, and with zero setup required. In this tutorial, we will explore the many features and capabilities this new integration provides, to unlock unprecedented data science and machine learning use cases in Excel. pyconde-pydata-2024-41694-boost-your-data-science-skills-with-the-new-python-in-excel PyData: PyData & Scientific Libraries Stack Valerio Maggio en Python in Excel is the new integration created by Microsoft that brings Python programming directly into Excel workbooks, for advanced data analytics. With Python in Excel, it is now possible to embed Python code directly into workbook cells, very easily, and with zero setup required. In fact, all the Python code runs automatically in the Microsoft Cloud, and leverages on the Python Anaconda Distribution to get immediate access to a vast selection of packages to unlock unprecedented use cases in data science, data visualization, and machine learning. The output of each execution is automatically integrated into the spreadsheet, creating interactive data reports to share with customers and other users. The new feature is currently available in _public preview_ to **all users** running the MS Excel Beta Channel on Windows. In this tutorial, we will explore the many features and capabilities this new integration provides, to unlock unprecedented data science and machine learning use cases in Excel. First, we will familiarize with the new environment, understanding its execution model, and the differences from standard Python programs. Afterwards, we will work on several examples to demonstrate the potential of using Python directly into the workbook to filter, validate, wrangle and visualize our data. We will conclude our tutorial by creating a full-fledged machine learning experiment directly into Excel. Familiarity with Excel and the Python language is the only requirement necessary to attend this tutorial. ## Setup Instructions **Python in Excel** is currently available (_for free_) to MS Excel users using **Windows** operating system. ### Non-Windows Users If you are not running on Windows, it is strongly recommended to install a version of Windows on a virtual machine (VM) using any solution that works on your operating system. For example, [Parallels](https://www.parallels.com/products/desktop) for mac OS users, or [VirtualBox](https://www.virtualbox.org/) for Linux users. ### Setup Python in Excel for Windows To use the _new_ "Python in Excel" feature, it is required to join the [Microsoft 365 Insider Program](https://support.microsoft.com/en-gb/office/get-started-with-python-in-excel-a33fbcbe-065b-41d3-82cf-23d05397f53d#:~:text=Microsoft%20365%20Insider%20Program) and choose the Beta Channel Insider level. You can find more detailed instructions on [Get Started with Python in Excel](https://support.microsoft.com/en-gb/office/get-started-with-python-in-excel-a33fbcbe-065b-41d3-82cf-23d05397f53d). ### (Optional) Install Excel Labs plugin [Excel Labs](https://appsource.microsoft.com/en-us/product/office/wa200003696?tab=overview) is an add-in that includes experimental Excel features. Among these features, it provides **Python editor**: A notebook-like interface designed for authoring Python in Excel. Excel lab is **not** required, but strongly recommended to have a better working and development experience with Python in Excel. ### Data Download Once all the setup operations are completed, please download the [Financial Sample Excel Workbook](https://go.microsoft.com/fwlink/?LinkID=521962). We will use this data file as our gym playground to familiarise with the new feature. false https://pretalx.com/pyconde-pydata-2024/talk/UPSJEM/ https://pretalx.com/pyconde-pydata-2024/talk/UPSJEM/feedback/ Kuppelsaal Keynote - Ten Key Questions that a Company Should Ask to have Responsible AI Keynote 2024-04-24T09:15:00+02:00 09:15 00:45 Responsible AI covers mainly AI principles, governance & regulation, but most companies do not know how to implement all of these. Hence, in this presentation we cover the key questions for the whole process behind a new AI product, from the idea and design to the development and deployment. The questions are partly based on the new ACM Principles for Responsible Algorithmic Systems (2022) where he is one of the two lead authors as well as their extensions for Generative AI (2023). For each question we will discuss its relevance, challenges, and (partial) solutions, triggering an interactive discussion. pyconde-pydata-2024-44829-keynote-ten-key-questions-that-a-company-should-ask-to-have-responsible-ai Plenary Ricardo Baeza-Yates en Responsible AI covers mainly AI principles, governance & regulation, but most companies do not know how to implement all of these. Hence, in this presentation we cover the key questions for the whole process behind a new AI product, from the idea and design to the development and deployment. The questions are partly based on the new ACM Principles for Responsible Algorithmic Systems (2022) where he is one of the two lead authors as well as their extensions for Generative AI (2023). For each question we will discuss its relevance, challenges, and (partial) solutions, triggering an interactive discussion. false https://pretalx.com/pyconde-pydata-2024/talk/PLRERM/ https://pretalx.com/pyconde-pydata-2024/talk/PLRERM/feedback/ Kuppelsaal Which kind of software tests do I really need? Talk 2024-04-24T10:30:00+02:00 10:30 00:30 Explore a variety of software testing methodologies, from Manual and A/B Testing to Unit and Performance Tests. Learn how to make informed decisions for enhanced software delivery, matching the unique needs of your projects. pyconde-pydata-2024-42953-which-kind-of-software-tests-do-i-really-need- PyCon: Testing Pascal Puchtler en In the dynamic landscape of software development, choosing the right testing strategy is crucial for delivering high-quality software products. The myriad of available testing methodologies often leaves developers and QA professionals pondering over the question: "Which kind of software tests do I really need?" This presentation aims to demystify the world of software testing by exploring various testing approaches and methodologies. From unit testing to system testing, from functional to non-functional testing, each method serves a unique purpose in the software development life cycle. The talk will dive into the factors influencing the selection of appropriate testing methods. We will discuss the advantages and limitations of different testing types, helping participants understand the trade-offs involved in each approach. Practical examples will be presented to illustrate how choosing the right testing strategy can positively impact software quality, development speed, and overall project success. Participants will gain insights into evolving industry best practices and learn how to adapt their testing strategies to meet the demands of modern software development. By the end of the talk, attendees will have a overview of the diverse landscape of software testing and be equipped with the knowledge needed to make informed decisions about which types of tests are most relevant for their specific projects. This presentation aims to empower developers, QA professionals, and project managers to navigate the testing maze and optimize their testing efforts for efficient and effective software delivery. false https://pretalx.com/pyconde-pydata-2024/talk/PVLTD3/ https://pretalx.com/pyconde-pydata-2024/talk/PVLTD3/feedback/ Kuppelsaal I achieved peak performance in python, here's how ... Talk 2024-04-24T11:05:00+02:00 11:05 00:30 In the ever-evolving landscape of software development, crafting code that not only functions flawlessly but also operates at peak performance is a skill that sets exceptional developers apart. This talk delves into the art of optimizing Python code, exploring techniques and strategies to fine-tune your programs for maximum speed and minimal resource consumption, with a particular focus on memory efficiency. pyconde-pydata-2024-40984-i-achieved-peak-performance-in-python-here-s-how- PyCon: Programming & Software Engineering Dishant Sethi en In this session, we will embark on a journey and refine the phases of development in python. 1. Functional Execution 2. Rigorous Testing and Accuracy 3. Performance Optimization We will discuss common bottlenecks in unoptimized code 1. inefficient Coding Practices can negatively impact performance 2. Memory Leaks 3. Suboptimal Data Structures and Algorithms 4. Lack of Vectorization 5. Overlooked Parallelization We'll further look into the benefits of profiling the code 1. Profiling the code with cProfile/sentry 2. Profiling the Code with timeit 3. Memory Profiler Finally, for data driven application, we'll look into strategies to achieve peak performance 1. Efficient DataFrame Storage with Parquet Files 2. Handling Categorical Data Type 3. Looping Techniques and How to Choose Between Different Looping Techniques? 4. String concatenation (joins and cleanup) [Attendees takeaway] Whether you're a seasoned developer looking to enhance your optimization skills or a newcomer eager to understand the principles behind efficient Python code, this talk offers valuable insights and practical takeaways. [Pre-requisites] Basics of Python [who-am-i] Name: Dishant Sethi Email: dishantsethi14@gmail.com Phone no: +919582565371 Designation: Software Consultant and Founder @prodinit.com [Previous Talks] PyconDE and Pydata Berlin: https://youtu.be/osGGX3tcwkc Gophercon India 2023: https://youtu.be/zuzTN3ibrCM?si=GEo31lE_Q8h4hzTR PyDelhi: https://youtu.be/6h9I3iyqyu4 false https://pretalx.com/pyconde-pydata-2024/talk/RKDSK7/ https://pretalx.com/pyconde-pydata-2024/talk/RKDSK7/feedback/ Kuppelsaal Python 3.12's new monitoring and debugging API Talk 2024-04-24T11:40:00+02:00 11:40 00:30 Python 3.12 introduced a new low-impact monitoring API with [PEP669](https://peps.python.org/pep-0669/), which can be used to implement far faster debuggers than ever before. This talk covers the main advantages of this API and how you can use it to develop small tools. pyconde-pydata-2024-41454-python-3-12-s-new-monitoring-and-debugging-api PyCon: Python Language & Ecosystem Johannes Bechberger en Python long lagged a good monitoring and profiling API. It had only the simplistic sys.settrace API, which had a high overhead and couldn't be configured appropriately. The new API, released in October 2023, will change this by offering a proper fine-grained and well-designed monitoring API while also making the commonly used operations fast. This talk will give you an introduction to the new API and its design major design decisions and show you how you can use it to write a simple debugger from scratch. false https://pretalx.com/pyconde-pydata-2024/talk/P7AG9A/ https://pretalx.com/pyconde-pydata-2024/talk/P7AG9A/feedback/ Kuppelsaal (PyLadies Panel) Reflecting Within: Challenging Narratives in Tech Feminism Panel 2024-04-24T13:10:00+02:00 13:10 01:00 For the third year in a role, the PyLadies Panel at PyCon PyData engages with a broader audience on critical issues related to gender disparities, ethics, and the ongoing importance of women-focused tech groups. Adopting unconventional formats, the PyLadies Panel aims to foster meaningful discussions among PyLadies members and the Python community, encouraging open dialogue and community solidarity. pyconde-pydata-2024-47339--pyladies-panel-reflecting-within-challenging-narratives-in-tech-feminism Plenary Paloma OliveiraKatharine JarmulCheuk Ting HoNaa Ashiorkor Nortey en For the third year in a role, the PyLadies Panel at PyCon PyData engages with a broader audience on critical issues related to gender disparities, ethics, and the ongoing importance of women-focused tech groups. Adopting unconventional formats, the PyLadies Panel aims to foster meaningful discussions among PyLadies members and the Python community, encouraging open dialogue and community solidarity. This year, we propose a structured debate inspired by Lucy Delap’s “Feminisms: A Global History.” The book challenges ethnocentric and exclusive narratives within the feminist movement itself. It calls for a more inclusive and multifaceted understanding of feminism that respects and incorporates the diversity of its expressions and the different challenges faced by women around the world. Having the book as a reference point and inspiration, this panel is an opportunity to critically reflect on these themes and develop actionable strategies for a more equitable future in technology. Designed to dissect and challenge entrenched narratives about feminism in the tech industry, the debate encourages a deep dive into difficult conversations to dismantle binary thinking and uncover nuances in common discourse. Participants and audience members are invited to confront and critique the prevailing frameworks of feminism, particularly the predominance of perspectives that may not fully represent the movement’s global and diverse nature. By acknowledging and addressing these gaps, the debate will explore actionable steps toward inclusivity and equity. Through a debate-style format, panelists will engage in a candid, necessary discussion and exchange of ideas, allowing for both the celebration of feminist achievements and a critical evaluation of ongoing issues. It will provide a platform for voices that have been marginalized or silenced, enabling a constructive dialogue that moves beyond simple dichotomies to foster understanding and progress. Join us as we challenge the status quo, identify systemic flaws, and collaboratively outline the future directions of feminism in technology. This debate is not just about reflection; it’s about taking active steps to ensure that our community is inclusive and representative of all its members. Panel with Taniar Allard, Katherine Jarmul, Naa Ashiorkor Nortey & Cheuk Ting Ho false https://pretalx.com/pyconde-pydata-2024/talk/BFYUUJ/ https://pretalx.com/pyconde-pydata-2024/talk/BFYUUJ/feedback/ Kuppelsaal Async Awaits: Mastering Asynchronous Python in FastAPI Talk 2024-04-24T14:45:00+02:00 14:45 00:30 In this talk, we delve into the transformative world of asynchronous programming in Python, tailored specifically for the FastAPI framework. This session will explore the fundamentals of async/await syntax, unveiling how it can optimize the performance and scalability of web applications. Attendees will gain practical insights into implementing asynchronous operations in FastAPI, from setting up to handling real-time data processing. This talk is perfect for Python developers eager to harness the power of asynchronous programming to build faster, more efficient web applications. Join us to unlock the full potential of Python's async capabilities within FastAPI's dynamic environment. pyconde-pydata-2024-41769-async-awaits-mastering-asynchronous-python-in-fastapi PyCon: Programming & Software Engineering Bojan Miletic en In this 30-minute session, we'll embark on a journey to master asynchronous programming in Python, specifically focusing on its application in the FastAPI framework. The talk is designed to provide a thorough understanding of async/await syntax and its practical use in building efficient, scalable web applications. ### Timetable: #### 1. Introduction to Asynchronous Programming (5 minutes) - Brief overview of asynchronous programming concepts. - The importance of async in modern web development. #### 2. Understanding Async/Await in Python (5 minutes) - Deep dive into Python's async/await syntax. - Key differences between synchronous and asynchronous code. #### 3. FastAPI and Asynchronous Python (10 minutes) - Introduction to FastAPI with a focus on its asynchronous features. - Demonstrating how FastAPI leverages Python’s async capabilities. #### 4. Building an Asynchronous Web App (7 minutes) - Step-by-step guide on setting up and coding an async web application in FastAPI. - Best practices for handling asynchronous operations. #### 5. Q&A and Wrap-Up (3 minutes) - Addressing questions from the audience. - Summarizing key takeaways and concluding the talk. Join us to unlock the power of asynchronous Python in the world of web development and learn how to effectively implement these techniques in your FastAPI projects. false https://pretalx.com/pyconde-pydata-2024/talk/DPVJ7K/ https://pretalx.com/pyconde-pydata-2024/talk/DPVJ7K/feedback/ Kuppelsaal Building accessible documentation sites Talk 2024-04-24T15:20:00+02:00 15:20 00:30 Your project's documentation site is one of the first places where new users will interact with your project; as such, it is essential that these are up-to-date, well-organised, and usable and that they cater to newcomers, experienced users, and contributors alike. It is estimated that about 25% of the global population has some sort of disability, and ensuring all folks can use and access your projects and their documentation is paramount and this, of course, includes thinking of and including disabled developers and end-users. In this talk, we will cover some of the basics of web content accessibility and explore some tools and approaches that you can use to ensure your tools and documentation sites are accessible. pyconde-pydata-2024-43025-building-accessible-documentation-sites General: Others Dr. Tania Allard en For a long time, there has been a prevailing notion that accessibility should only be considered within front-end web development - the discipline of creating what someone can see or do on a website or web app. However, accessibility is a holistic practice that covers every aspect of building digital experiences, meaning it is everyone’s concern - whether working on the backend, documentation, CLI, or API levels. As an open-source maintainer, your project’s documentation is one of the primary ways users interact with your tools. Ensuring your documentation is up-to-date is as important as ensuring it is accessible for disabled users to provide an inclusive user experience and bring in new contributors. For the last five years, I have worked on multiple aspects of open-source accessibility, from auditing to remediation and building more accessible tools for end-users, authors, and open-source maintainers. In this talk, I will share practical advice - including tools and workflows - to make your documentation and other user-facing resources, from markdown files to Sphinx documentation sites and Jupyter notebooks, more accessible to disabled users. After this talk, you will better understand how to make your documentation more accessible with minor changes to your workflows or practices, even if you do not have deep accessibility knowledge (yet). Outline - Context setting [5 mins] - Brief context setting - Intro to accessibility [7 mins] - 101 into accessibility - while this will not be a deep dive, we will cover some guidelines and principles applicable to documentation, notebooks, and user-facing resources. - Contextualising accessibility into documentation [8 mins] - discussing strategies for accessibility auditing, remediation, and implementation within open source documentation Practical strategies TL;DR [5 mins] - Summarise best practices and tools for OSS documentation accessibility - Q/A with the audience [5 mins] false https://pretalx.com/pyconde-pydata-2024/talk/7UYHYP/ https://pretalx.com/pyconde-pydata-2024/talk/7UYHYP/feedback/ B09 Prescriptive Analytics in the Python Ecosystem with Gurobi Sponsored Talk 2024-04-24T10:30:00+02:00 10:30 00:30 Join us as we guide you through integrating Gurobi and prescriptive analytics into your greater Python ecosystem. We’ll demonstrate model-building patterns based on NumPy and SciPy.sparse data structures and explore how to take advantage of indexed DataFrames and Series in pandas for mathematical model building. You’ll also discover how to use trained regressors from scikit-learn as constraints in optimization models. Join us as we delve into the world of optimization with Gurobi and elevate your workflows. pyconde-pydata-2024-44952-prescriptive-analytics-in-the-python-ecosystem-with-gurobi Sponsor Robert Luce en Gurobi is a prescriptive analytics technology that enables you to make optimal decisions from data. You can use prescriptive analytics to generate optimized decision recommendations, based on real-world variables and constraints. Powered by mathematical models solved by mixed-integer optimization, it enables embedded decision intelligence in all kinds of applications in an industry-agnostic fashion and in any deployment scenario. Join us as we guide you through integrating Gurobi and prescriptive analytics into your greater Python ecosystem. We’ll demonstrate model-building patterns based on NumPy and SciPy.sparse data structures and explore how to take advantage of indexed DataFrames and Series in pandas for mathematical model building. You’ll also discover how to use trained regressors from scikit-learn as constraints in optimization models. Join us as we delve into the world of optimization with Gurobi and elevate your workflows. false https://pretalx.com/pyconde-pydata-2024/talk/KCYDM9/ https://pretalx.com/pyconde-pydata-2024/talk/KCYDM9/feedback/ B09 Mojo 🔥 - Is it Python's faster cousin or just hype? Talk 2024-04-24T11:05:00+02:00 11:05 00:30 On 2023-05-02, the tech sphere buzzed with the release of Mojo 🔥, a new programming language developed by Chris Lattner, renowned for his work on Clang, LLVM, and Swift. Billed as "Python's faster cousin," and "The programming language for all AI developers", Mojo promised a 68,000x performance uplift and a familiar Pythonic syntax. As it reaches its first anniversary, we unpack Mojo's journey towards its ambitious promise. This talk delves into the practical experiences developing a Large Language Model Interpretation library as part of an AI Safety Camp project in that language. We cast a critical eye over its performance, evaluate its usability, and explore its potential as a Python superset. Against a backdrop where alternatives like Rust, PyPy and Julia dominate performant programming for AI, we question whether Mojo can carve out its niche or if it will languish as another "could-have-been" in the programming language pantheon. pyconde-pydata-2024-42873-mojo-is-it-python-s-faster-cousin-or-just-hype- PyCon: Python Language & Ecosystem Jamie Coombes en Background & Motivation The introduction of Mojo by Chris Lattner captured the attention of the Python community with the allure of dramatic performance enhancements and a syntax that would not alienate current Python developers. As Mojo progresses beyond its infancy, it's critical to assess its evolution and its capacity to disrupt the programming ecosystem, particularly within artificial intelligence and machine learning domains. Objective & Scope This presentation will share findings from an AI Safety Camp project which used Mojo to build a Large Language Model Mechanistic Interpretatability and Activation Engineering library. Through our exploration, we aim to provide a candid narrative of Mojo's strengths and limitations, judge its performance claims, and probe its likelihood of adoption for AI development. Content Overview Introduction to Mojo: Brief overview of Mojo's conception, ethos, and intended use-cases. Performance Claims: An further look at the purported 68,000x speed increase over Python, including benchmark comparisons and real-world application data. Language Design: An analysis of Mojo's syntax and semantics, drawing parallels and contrasts with Python, and the implications for developers transitioning to or adopting Mojo. Case Study: Detailed account of the process of writing a Large Language Model Interpretation library in Mojo, highlighting the challenges and breakthroughs experienced. Ecosystem Overview: Examination of the current state of Mojo's ecosystem, its community support, and the availability of tooling and libraries. Discussion: Engaging the audience in a discussion about Mojo's potential future, its fit within existing projects, and the propensity for it to become the primary language for AI development. Conclusion We'll wrap up with predictions for Mojo's trajectory based on our experiences and broader industry trends, potentially setting the stage for Mojo to capture the "Mojo" it needs to triumph or to become a footnote in the annals of programming language history. false https://pretalx.com/pyconde-pydata-2024/talk/DG8G7Q/ https://pretalx.com/pyconde-pydata-2024/talk/DG8G7Q/feedback/ B09 Enhance your balcony power plant with Python Talk 2024-04-24T11:40:00+02:00 11:40 00:30 Plug-in solar systems, so-called balcony power plants, are getting more popular. This talk will cover the basics of such a system, how to figure out the energy consumption of a household and how to monitor and optimize the power output of a balcony power plant. pyconde-pydata-2024-40109-enhance-your-balcony-power-plant-with-python General: Infrastructure - Hardware & Cloud Jannis Lübbe en Plug-in solar systems, so-called balcony power plants, are getting more popular and more affordable as people want a simple way to participate in moving towards sustainable energy resources. They are easy to install without the need for an electrician. In this talk I will discuss how to figure out much power a household consumes and how much can be covered by the balcony power plant. I will also exemplify different user profiles, like “working from home” or the “home in idle state” and how it affects the efficiency of an additional battery system. The power consumption is measured by using devices, like WiFi plugs, from Shelly and myStrom, each offering a REST API. The power production is preferably recorded by using OpenDTU in combination with compatible microinverters but may be measured using WiFi plugs as well. These measured values are published to Redis and can be observed using WebSockets and FastAPI. Additionally, these values may be pushed to a public server running on FastAPI and Redis as well. A social login like Google or GitHub can be used to control the access to this server. false https://pretalx.com/pyconde-pydata-2024/talk/TCSERC/ https://pretalx.com/pyconde-pydata-2024/talk/TCSERC/feedback/ B09 Connecting batteries with Python: Towards EV Charging with #zero emissions at #zero costs Sponsored Talk 2024-04-24T13:10:00+02:00 13:10 00:30 This talk dives into how Python helps us to bridge the gap between automotive and energy industries. Learn how Python helps in integrating EV batteries into the power grid, enabling further use and growth of renewable energies, stabilizing power grids and enhancing the accessibility of electric mobility. pyconde-pydata-2024-44849-connecting-batteries-with-python-towards-ev-charging-with-zero-emissions-at-zero-costs Sponsor Christopher Bock en The goal of The Mobility House is to create a zero-emission energy and mobility future. Our technology unites the automotive and energy industries. We integrate vehicle batteries into the power grid using intelligent charging and energy solutions. This way, we promote the development of renewable energies, stabilize the power grid, and make electric mobility more affordable. The goal of this talk is to give you an overview of how and where Python is used at The Mobility House. A hint upfront, we use it in many places. We use Python in all phases of development, it enables us to go quickly from a proof of concept to production. Python helps us in understanding our data better and using Python in production even changed our development culture and helped bridging the gap between data scientists and coders. However, Python does not solve all of our problems, so we will also talk about the roadblocks we hit and share the solutions which worked for us. false https://pretalx.com/pyconde-pydata-2024/talk/K8AL9P/ https://pretalx.com/pyconde-pydata-2024/talk/K8AL9P/feedback/ B09 Replacing Callbacks with Generators: A Case Study in Computer-Assisted Live Music Talk 2024-04-24T13:45:00+02:00 13:45 00:30 *Callbacks* have become an ubiquitous programming technique that we use every day without even thinking about it. They are definitely handy in many situations, but sometimes they feel more like a burden than a help. In developing an interactive realtime audio processing system for use on stage in live music, we encountered such a situation. This talk will present how a few dozen lines adding a thin abstraction layer allowed us to replace a complex callback mess with tremendously more readable *generators* (yes, you know, those functions which `yield` results instead of `return`ing them...). pyconde-pydata-2024-40404-replacing-callbacks-with-generators-a-case-study-in-computer-assisted-live-music PyCon: Python Language & Ecosystem Matthieu Amiguet en At [Les Chemins de Traverse](https://www.lescheminsdetraverse.net/) we explore ways of "augmenting" acoustical musical instruments with new sonic possibilities offered by computers (think "augmented reality" for live music). For doing so, we are using Olivier Bélanger's great [pyo](http://ajaxsoundstudio.com/software/pyo/) module for realtime audio processing. To make the system interactive, this module allows to register callbacks on some events. While this works great in many situation, it can get very cumbersome when we design a stateful system, where the same event must trigger different callbacks depending on the system's inner state. This talk will present how we developed a thin abstraction layer that allows us to replace many callback functions together with many registering/unregistering of these functions by a nice, streamlined *generator* definition that's incomparably more readable than the many-callbacks version. This allows us to keep our mind focused on what's important, namely supporting the music we want to play, instead of tedious boilerplate code. While our use case is admittedly very specific, we believe that the ideas we present could be adapted in many other situations where callbacks are used for technical reasons, but lead to bulky and contrived code. false https://pretalx.com/pyconde-pydata-2024/talk/Y7R9GZ/ https://pretalx.com/pyconde-pydata-2024/talk/Y7R9GZ/feedback/ B09 Bridging the worlds: pixi reimplements pip and conda in Rust Talk 2024-04-24T14:45:00+02:00 14:45 00:30 Pixi is a modern package manager that bridges the worlds of conda and pip package management. A from-scratch implementation of a SAT solver that works for both pip and conda, native lockfiles and a cross-platform task system are compelling features of this new package manager. pyconde-pydata-2024-41565-bridging-the-worlds-pixi-reimplements-pip-and-conda-in-rust PyCon: Programming & Software Engineering Wolf VollprechtRuben Arts en Pixi goes further than existing conda-based package managers in many ways: - From scratch implemented in Rust and ships as a single binary - Integrates a new SAT solver called resolvo - Supports lockfiles like poetry / yarn / cargo - Cross-platform task system (simple bash-like syntax) A major requested feature was interoperability with PyPI packages. For this we have created a standalone library called rip. Rip contains all the code needed to download and extract wheels and SDist packages straight from PyPI, and also uses resolvo for resolution. We had to overcome some PyPI specific hurdles that we want to discuss in the talk: - Lazy fetching of metadata, since on PyPI it is embedded in the wheel - Resolving Python packages for other platforms and locking them (since we want to resolve on Linux for Windows) We’re looking forward to take a deep-dive together into what conda and PyPI packages are and how we are seamlessly integrating the two worlds in pixi. We’ll also look at some benchmarks and explain more about the conda ecosystem and why it might still have a reason to exist (even though wheels also solve a lot of the painpoints). More information about Pixi: - https://pixi.sh - https://prefix.dev false https://pretalx.com/pyconde-pydata-2024/talk/HSJGHH/ https://pretalx.com/pyconde-pydata-2024/talk/HSJGHH/feedback/ B09 There is a Better Way to Automate and Manage Your (Fluid) Simulations Talk 2024-04-24T15:20:00+02:00 15:20 00:30 This is a story about applying Python and the “hacker mindset” to Computer Aided Engineering (CAE), an emerging domain within the Python ecosystem. Shell scripts have traditionally been the preferred tool for automating CAE pipelines, especially in subfield of Computational Fluid Dynamics (CFD). However, this approach is brittle, severely limited and cumbersome to manage at scale. Data management is also a challenge, with tens to hundreds of GB per simulation needing to be stored and versioned in complex folder structures. One possible approach is to use Python as an automation and glue language and Data Version Control (DVC) which is a Python based tool built on top of git to track pipelines and data. pyconde-pydata-2024-40960-there-is-a-better-way-to-automate-and-manage-your-fluid-simulations General: Industry & Academia Use-Cases Julian Wagenschütz en This is a story about applying Python and the “hacker mindset” to Computer Aided Engineering (CAE), an emerging domain within the Python ecosystem. Shell scripts have traditionally been the preferred tool for automating CAE pipelines, especially in subfield of Computational Fluid Dynamics (CFD). However, this approach is brittle, severely limited and cumbersome to manage at scale. Data management is also a challenge, with tens to hundreds of GB per simulation needing to be stored and versioned in complex folder structures. One possible approach is to use Python as an automation and glue language and Data Version Control (DVC) which is a Python based tool built on top of git to track pipelines and data. This talk will show you how to use Python to automate many tasks in CAE workflows, even when the tools don’t offer a native Python interface: - Exporting CFD simulation results from Starccm+ to a PowerPoint template with python-pptx and updating the final presentation with new simulation data - Preparing input data for an electrical thermal simulation to improve performance 80-fold Both examples will illustrate best practices and lessons learned in the automation of the CFD software that are applicable beyond the field. DVC was originally designed and is broadly used for machine learning pipelines, but its flexibility allows it to be adapted to other domains. The potential benefits for engineering applications are immense. This talk will show you how easy it is to convert an existing CAE pipeline to DVC and show the benefits: - Running hundreds of simulations, comparing them and choosing the optimal with DVC - Managing software versions declaratively and comparing results across versions - Creating in-depth meta studies and comparing many simulations with Jupyter notebooks Finally, this talk will give an outlook on the changing CAE ecosystem and propose new features for DVC to better leverage it for this use case. **Audience** Either simulation engineers seeking to enhance and scale their workflows or software engineers aiming to build powerful and flexible simulation tooling. **Relevant talks or blog posts** - Sending Rovers to Mars with Jupyter - Managing OpenFOAM Physical Simulations with DVC, CML, and Studio - How Python enables future computer chips false https://pretalx.com/pyconde-pydata-2024/talk/ML99UB/ https://pretalx.com/pyconde-pydata-2024/talk/ML99UB/feedback/ B07-B08 AsyncApp. My contribution to hype Pythons asyncio a bit more Talk 2024-04-24T10:30:00+02:00 10:30 00:30 Asyncio use is now everywhere in the Python world, ... .. or is it? Being there since version 3.4 my impression is, that it is still not the go to solution when starting off new projects. It's not an obvious choice and traditional approaches still seem to be much preferred especially by beginners. So let me take you with me on a journey to create simple, yet powerful building blocks to build asyncio based applications using patterns that are easy to follow, lightweight and attractive. #asyncio #click #logging #psutil #redis #raspberrypi pyconde-pydata-2024-41756-asyncapp-my-contribution-to-hype-pythons-asyncio-a-bit-more PyCon: Programming & Software Engineering Jens Nie en Asyncio has been introduced as a possible solution mainly for I/O related performance problems. The traditional way to handle I/O often ends up in code, which blocks the execution of concurrent elements in an application, often resulting in bad performance. The usual suspects when dealing with these problems, such as multiprocessing and threading, are often considered to be complex and not straightforward in use, especially for beginners. I believe that proper threading and multiprocessing, with all its interprocess or shared memory communication, locks and race condition prevention, as well as efficient object handling still requires a deep understanding of the architecture and inner workings, and is still mainly a topic for experts. Asyncio comes to the rescue here offering a layer of abstraction at a lower and much easier to understand layer. While it is no solution to aid in distributing code execution to gain more performance, it will solve the blocking issues quite effiently. To demonstrate the power and simplicity of asyncio I will show a few object orientated building blocks that will allow us to create a simple environment monitoring app for the raspberry pi. This app will - periodically gather sensor readings - log them - store the readings to a data file - offer a monitoring system to log cpu and memory usage for itself - be able to be configured via environment variables, config files and command line arguments In its final iteration the app will be distributed into small parts just dealing with a single, very specific task to be performed, following the traditional UNIX philosophy for an app to do just one thing, but do this well. false https://pretalx.com/pyconde-pydata-2024/talk/BA7FZL/ https://pretalx.com/pyconde-pydata-2024/talk/BA7FZL/feedback/ B07-B08 High Performance Data Visualization for the Web Talk 2024-04-24T11:05:00+02:00 11:05 00:30 In this talk, we will put together a simple but full-featured website using [Perspective](https://perspective.finos.org). Perspective is an open source interactive analytics and data visualization component, which is especially well-suited for large and/or streaming datasets. It is written in C++ and Rust with bindings to both Python and WebAssembly, making it ideal for data-intensive applications. It comes with a variety of visualization plugins, including a datagrid and various charts. Additionally, it comes with a Jupyter widget, which allows developers to iterate quickly with a clear pathway to their production website. pyconde-pydata-2024-41827-high-performance-data-visualization-for-the-web PyData: Visualisation & Jupyter Tim Paine en The Python ecosystem has ample supply of both web development frameworks, and data visualization components. But despite the maturity of the ecosystem, few datavisualization tools are capable of dealing with large amounts of streaming data. Even fewer are able to perform live aggregations, sorting, and filtering on top of this data. In this talk, we will put together a simple but full-featured website using [Perspective](https://perspective.finos.org). Perspective is an open source interactive analytics and data visualization component, which is especially well-suited for large and/or streaming datasets. It is written in C++ and Rust with bindings to both Python and WebAssembly, making it ideal for data-intensive applications. It comes with a variety of visualization plugins, including a datagrid and various charts. Additionally, it comes with a Jupyter widget, which allows developers to iterate quickly with a clear pathway to their production website. We will start with a simple [FastAPI](https://fastapi.tiangolo.com)-based website and some static data. In a few lines of code, we will have the website up and running. Next, we will demonstrate some of the core features of Perspective - pivoting, sorting, filtering, the various visualization plugins, cross-filtering (using one table as a filter on other tables), and computed columns. After this, we will pull in some streaming data and show how the functionality of Perspective demonstrated updates in realtime alongside the data. Finally, we'll crank the speed of updates to the limit. By the end of this talk, the audience will know how to use Perspective and how to incorporate it into their own applications for both static and streaming data, either as a simple but high performance datagrid or as a full featured set of interconnected visualization components. false https://pretalx.com/pyconde-pydata-2024/talk/9JEZ8E/ https://pretalx.com/pyconde-pydata-2024/talk/9JEZ8E/feedback/ B07-B08 How to Improve the Python Development Experience for Millions of Ubuntu Users Talk 2024-04-24T11:40:00+02:00 11:40 00:30 Have you ever tried to install a different Python version on Ubuntu or tried to upgrade your current one? Lots of posts exist, many are outdated, and some even lead to a broken Ubuntu installation. This talk will introduce the most common options and their ups and downs in-depth. We will also give an outlook on what Ubuntu could do to make it even easier for you and everybody. pyconde-pydata-2024-41823-how-to-improve-the-python-development-experience-for-millions-of-ubuntu-users PyCon: Python Language & Ecosystem Jürgen Gmach en Updating your current Python installation, or installing a different one on Ubuntu is not an easy task. There are many reasons why you want a different Python version on Ubuntu: - you want to use the latest version, but Ubuntu comes with an older one pre-installed - a Python app requires an older Python version - you want to test your Python library against multiple Python versions Unfortunately, `apt install python-<version>` won't work. After googling some time, you'd learn that you have many options: - pyenv - deadsnakes - mamba/conda - or even compiling Python yourself Why isn't there a single way, and which one fits your needs the best? And why doesn't `apt install python-<version>` just work? There are many blog posts and tutorials out there to install a new Python version, but they lack the depth to understand the core of the problem. And are they up-to-date? Do you trust them not to break your Ubuntu installation? This talk will not only introduce and compare all the most common options to update a Python version or to install a new one on Ubuntu but will also convey the knowledge to assess the existing and upcoming options yourself. We will also look into the future. What new tools are on the horizon? And especially, what could Ubuntu do itself to make it easier for you and everybody? false https://pretalx.com/pyconde-pydata-2024/talk/DKL7YQ/ https://pretalx.com/pyconde-pydata-2024/talk/DKL7YQ/feedback/ B07-B08 µDjango, an asynchronous microservices technique. Talk 2024-04-24T13:10:00+02:00 13:10 00:30 A standard Django project involves working with multiple files and folders from the start. Let's see how the work with a Django project changes when we have only one file. This solution automatically transforms Django into a microservice-oriented async framework with "batteries included” philosophy. pyconde-pydata-2024-41677-django-an-asynchronous-microservices-technique- PyCon: Django & Web Maxim Danilov en The history of the lightweight Django project isn't new. The first time single-py-file Django project paradigm appears in 2014 in book Lightweight Django. I with Django project consisting of only 2 files in 2015. At that time, the tiny Django project wasn't comparable to the capabilities of projects based on FASTAPI or FLASK. But a couple of years later, Django introduced ASGI, and in 2022, Django was ready for use in microservices. The concept of creating micro-projects on Django reappeared within the Django community in 2019 and again in the spring of 2023, and now we have a full-fledged technology for creating asynchronous microservices consisting of one or two files. It was named uDjango. In this talk, I will share my experience in creating high-performance microservices on Django and how i can keep simplicity and minimalism in projects. During the talk, I'll discuss the advantages of Django microservices: * All-in-one package * Standard architecture and syntax * Extremely rapid development and deployment speed After years of work with uDjango paradigm, I have identified the challenges in creating Django microservices: * The prevailing opinion that the 'Django framework isn't suitable for microservices' * Django settings.py - cause of many problems. * URL routing in Django that could be stricter * Initialization time of forms and model objects reduces performance The result of this Talk for the audience will be knowlege about mDjango, a ready-to-use technology for building synchronous and asynchronous microservices. Talk Based on ideas of: Julia Elman and Mark Lavin, Lightweight Django 2014. Will Vincent, django-microframework 2019. Kirill Klenov, python benchmark repository, 2019. Carlton Gibson, linked in post about one app Django project, 2022 Paolo Melchiore 2023, uDjango false https://pretalx.com/pyconde-pydata-2024/talk/7NETLX/ https://pretalx.com/pyconde-pydata-2024/talk/7NETLX/feedback/ B07-B08 Beyond Deployment: Exploring Machine Learning Inference Architectures and Patterns Talk 2024-04-24T13:45:00+02:00 13:45 00:30 This talk is about setting up robust and scalable machine learning systems for high-throughput real-time predictions and large numbers of users. It is meant for ML engineers and people who work with data and want to learn more about MLOps focusing on cloud-based platforms. The focus of this talk will be about different ways to make predictions -– real-time, asynchronously and batch processing. It discusses the advantages and disadvantages of the different patterns and highlights the importance of choosing the right pattern for specific use cases, including generative large language models We will use examples from StepStone's production systems to illustrate how to build systems that scale to thousands of simultaneous requests while delivering low-latency, robust predictions. I will cover some of the technical details, how to efficiently manage operations, and real-life examples in a way that is easy to understand and informative. You will learn about different setups for ML and how to make them work. This will help you make your ML inference faster, more cost-efficient, and reliable. pyconde-pydata-2024-42904-beyond-deployment-exploring-machine-learning-inference-architectures-and-patterns PyCon: MLOps & DevOps Tim Elfrink en This talk explains the major challenges of ML deployment and management, emphasizing inference patterns for robust, scalable applications. Using StepStone's infrastructure as an example, we'll discuss efficiently handling large workloads and complex models, including recent large language models, to ensure fast, cost-effective, and reliable results. The session begins with an introduction, highlighting the significance of ML inference and outlining the objective of providing insights into effective MLOps strategies. We'll then overview various ML inference patterns, emphasizing their advantages, disadvantages, and the importance of selecting the right pattern for specific use cases. Moving on, we'll delve into StepStone's ML inference strategy, showcasing real-world applications and how scalability, performance, and cost are managed while maintaining agility for frequent model updates and monitoring in production systems. In summary, this talk provides a practical roadmap of ML inference patterns with a focus on real-world implementation at StepStone. false https://pretalx.com/pyconde-pydata-2024/talk/ZLDMGM/ https://pretalx.com/pyconde-pydata-2024/talk/ZLDMGM/feedback/ B07-B08 The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs Talk 2024-04-24T14:45:00+02:00 14:45 00:30 With the latest advancements in Natural Language Processing and Large Language Models (LLMs), and big companies like OpenAI dominating the space, many people wonder: Are we heading further into a black box era with larger and larger models, obscured behind APIs controlled by big tech monopolies? I don’t think so, and in this talk, I’ll show you why. I’ll dive deeper into the open-source model ecosystem, some common misconceptions about use cases for LLMs in industry, practical real-world examples and how basic principles of software development such as modularity, testability and flexibility still apply. LLMs are a great new tool in our toolkits, but the end goal remains to create a system that does what you want it to do. Explicit is still better than implicit, and composable building blocks still beat huge black boxes. pyconde-pydata-2024-41607-the-ai-revolution-will-not-be-monopolized-how-open-source-beats-economies-of-scale-even-for-llms PyData: Natural Language Processing & Computer Vision Ines Montani en As ideas develop, we’re seeing more and more ways to use compute efficiently, producing AI systems that are cheaper to run and easier to control. In this talk, I'll share some practical approaches that you can apply today. If you’re trying to build a system that does a particular thing, you don’t need to transform your request into arbitrary language and call into the largest model that understands arbitrary language the best. The people developing those models are telling that story, but the rest of us aren’t obliged to believe them. false https://pretalx.com/pyconde-pydata-2024/talk/QYPLJE/ https://pretalx.com/pyconde-pydata-2024/talk/QYPLJE/feedback/ B07-B08 Jupyter Notebooks for Print Media Talk 2024-04-24T15:20:00+02:00 15:20 00:30 In this talk, we will discuss leveraging Jupyter Notebooks to generate print media - books, magazine and newspaper articles, business reports, academic papers, etc. We will motivate the problem, introduce a library for accomplishing the task (nbprint), and walk through some end-to-end examples. pyconde-pydata-2024-41830-jupyter-notebooks-for-print-media PyData: Visualisation & Jupyter Tim Paine en Jupyter Notebooks are the tool of choice for researchers and data scientists, and a lot of work has been done to take Jupyter Notebooks and turn them into standalone websites. From [Voilà](https://voila.readthedocs.io/en/stable/index.html) to [Jupyter Book](https://jupyterbook.org/en/stable/intro.html), with widget and app libraries galore, it has never been easier to take a notebook and produce an interactive website. In contrast, despite the origins of notebooks in academic research, comparatively less work has been done in building tools to take notebooks and produce print media - newspaper articles, business reports, textbooks, academic publications, etc. In this talk, we will do four things. First, we will motivate print media as a good target for Jupyter Notebooks. We will do so through three worked examples: - a data-driven news publications such as those from The New York Times - a computer science textbook - a business intelligence report Second, we will highlight the correct set of technologies for producing notebook-derived print media. In particular, we will discuss NBPrint, a small [NBConvert](https://nbconvert.readthedocs.io/en/latest/)-based library that leverages [paged.js](https://pagedjs.org), a free and open source library which has [been used to produce real, printed books](https://pagedjs.org/made-with-paged.js.html). Third, we will give an end-to-end example from Jupyter Notebook to publication quality result for one of the above examples, showing a side-by-side comparison with the original media. Finally, we will discuss the power of the notebook oriented approach, and discuss which disciplines might be best suited for adopting notebooks as the source format for their print-oriented media. false https://pretalx.com/pyconde-pydata-2024/talk/DBGXJN/ https://pretalx.com/pyconde-pydata-2024/talk/DBGXJN/feedback/ B05-B06 Reinforcement Learning: Bridging The Gap Between Research and Applications Talk 2024-04-24T10:30:00+02:00 10:30 00:30 Reinforcement learning (RL) has great potential for industrial applications, but few mature software frameworks exist to facilitate its use. This talk discusses efforts to improve the software landscape for RL, making it easier for researchers to contribute algorithms and for engineers to apply RL in real-world settings. Specifically, we highlight the open-source library Tianshou, which provides high-level interfaces for painless RL application development along with lower-level APIs that cater to the needs of researchers. By improving RL software, we aim to accelerate research progress and expand RL adoption in industry. pyconde-pydata-2024-41808-reinforcement-learning-bridging-the-gap-between-research-and-applications PyData: Machine Learning & Deep Learning & Stats Michael Panchenko en Despite the very general applicability of reinforcement learning (RL) to a variety of decision and control problems, there are comparatively few applications of it in current industries. Moreover, many important developments emerging in the highly active RL research community do not get added to existing frameworks or libraries. Code written for successful RL applications in industry is also rarely contributed to open source software (OSS). This is in stark contrast to other areas of machine learning (ML), where reported progress is often transferred to mature OSS within weeks, if not days. Part of the reason behind this lamentable state may be the intrinsically higher complexity of RL when compared to, say, supervised learning. However, we believe that the lower permeation of RL in mature software arises in large part because writing RL-based software is currently much harder than it has to be. Widely used OSS for RL is either too complex for researchers to contribute to (like ray/RLlib or Pearl), too buggy and unstable for industry to consider (also RLlib), too limited in scope (like stable-baselines3, which includes relatively few algorithms), lacking high-level interfaces (like torch-rl), or even completely gives up on modularity (like cleanRL). Another reason is the difference in focus between RL research and applications. In research, an important goal is to find an algorithm that works well in a variety of environments, whereas in applications, one is usually interested in solving a particular environment of interest, by any means. This leads to wildly differing evaluation scenarios and selection criteria. We believe that the current state of RL software is reminiscent of the pre-PyTorch/pre-Keras era for supervised deep learning, when the implementation of a task like training a convolutional network on a large image dataset was non-trivial. Today, it requires but a few lines of code. We thus infer that significant progress in the software landscape supporting RL is still to be made, and that this progress will have high impact both on researchers and ML engineers. With this goal in mind, the appliedAI Institute for Europe, together with the core developers of the open source RL library Tianshou, took on the task of extending the latter in order to democratize RL in applications and accelerate reliable and trustworthy research on it. In this talk, we will highlight Tianshou’s high-level interfaces, which allow painless applications of RL algorithms in industry applications, as well as the lower-level interfaces that researchers can base their work on. Research code that is compatible with Tianshou’s interfaces will not only get mature evaluation, reporting and hyper-parameter optimization “for free”, but will also be much easier to use in applications, thereby boosting its impact. We will also address the question of environment design, which is a highly important RL engineering topic that is largely ignored in RL research. false https://pretalx.com/pyconde-pydata-2024/talk/SSKV9R/ https://pretalx.com/pyconde-pydata-2024/talk/SSKV9R/feedback/ B05-B06 Climate Crisis in Numbers Sponsored Talk 2024-04-24T11:05:00+02:00 11:05 00:30 Climate change is one of the biggest and most daunting challenges that our and future generations are going to face. In order to mitigate climate change and its consequences, first one needs to understand the problem and get a rough idea about the magnitude of human made global warming. As a proper numbers nerd I understand problems best when looking at science, statistics, and measurements. So here’s my little guide to better grasp what climate change is all about through data. pyconde-pydata-2024-44945-climate-crisis-in-numbers Sponsor Robert Meyer en About 5 years ago my co-founder and I launched alcemy, a Machine Learning startup to help decarbonize the cement and concrete supply chain. My primary motivation to run the startup is to find ways to tackle and prevent climate change and human made global warming. In the course of building the company I not only wanted to understand how much we can contribute in our niche sector of cement and concrete, but get a better idea of the problem and its magnitude as a whole. So here’s my little guide to better grasp what climate change is all about through data. I am going to talk about a variety of things regarding climate change and the greenhouse effect: - CO2 Equivalence - Magnitude and origin of different emission sources - The consequences of global warming and our potentially grim future - A (very) brief outlook of what humankind needs to do to tame global warming PS: Absolutely no Python experience needed here ;-) false https://pretalx.com/pyconde-pydata-2024/talk/GNK3PV/ https://pretalx.com/pyconde-pydata-2024/talk/GNK3PV/feedback/ B05-B06 Lessons learned from deploying Machine Learning in an old-fashioned heavy industry Talk 2024-04-24T11:40:00+02:00 11:40 00:30 About 5 years ago my co-founder and I launched alcemy, a Machine Learning startup to help decarbonize the cement and concrete supply chain. I experienced first hand moving from a simple proof of concept, a ML model inside a Jupyter notebook, to a full-fledged pipeline running 24/7 and steering massive amounts of cement production in real plants. I can tell you the road was long and winding. I want to share some of the hard lessons we learned along the way with you. If you are an aspiring ML or Software Engineer, Data Scientist, Entrepreneur, or you are just wondering how Machine Learning applied in the wild looks like this talk is for you. No prior knowledge is required except some familiarity with basic concepts and terminology of Machine Learning. pyconde-pydata-2024-41452-lessons-learned-from-deploying-machine-learning-in-an-old-fashioned-heavy-industry PyData: Machine Learning & Deep Learning & Stats Robert Meyer en Introduction ------------------ **Cement alone is responsible for about 8% of worldwide CO2 emissions**. Fortunately, we have quickly learned that low-carbon alternatives to "conventional" cement and concrete already exist. For instance, 60% of carbon emissions can be avoided if burnt limestone, the main ingredient for cement, is replaced partly by limestone powder (which isn't burnt, and therefore doesn't release carbon into the atmosphere). Yet, these low-carbon cement recipes have a substantial shortcoming: They react much more sensitive to changes, e.g. changes in weather conditions or in the chemical and mineralogical composition of ingredients. As a consequence, low-carbon cements and the resulting concrete (made by mixing cement with sand and water) can only be reliably produced under laboratory conditions. We are changing this. We use data intelligence and predictive Machine Learning control to optimize production processes such that low-carbon cement and concrete can be manufactured in real plants and at scale. I will quickly introduce our solution that is already deployed in 5 cement plants. Moreover, we are currently prototyping to move into concrete production as well. Of course, we do this (mostly) in Python. Part 1: Machine Learning ------------------------------------- Machine Learning in production is vastly different from solving a kaggle challenge. In fact, the particular choice of Machine Learning model is much less important than you think. I will cover the benefits of using rather simple models such as random forests or even linear regression in comparison to deep learning. If stuff goes wrong, and it will, interpretable and debuggable models are far superior to complex architectures. Also having proper model evaluation that reflects production requirements, and good baselines for comparison are always crucial first steps and pay off in the long run. It was surprising how much less time we spent on the core Machine Learning algorithms in comparison to infrastructure, such as deployments on AWS fargate or k8s, re-training processes, proper database layout, or home-brewed tooling to allow easier configurations of dozens of ML models. Part 2: Data ------------------ We quickly learned that data is way more important than models. Some might have heard the phrase *Garbage in garbage out* coined by programmers in the 50s. This is even more important when it comes to today's widespread usage of Machine Learning. We run ML not on our own data, but on data provided by our customers. While the level of data-maintenance and quality that our customers are used to allows for in-house bookkeeping and short analyses, it does not necessarily suffice for ML. I will discuss why and how we spend a good amount of time cleaning and really drilling into the data provided by our customers. Moreover, differences between training and real-time inference data can be a real challenge. For example, it is not guaranteed that the location where samples are drawn from cement mills, i.e. the live data used for inference, is as representative of the actual cement as silo samples that can be used for training. Fine particles might not be captured simply due to the physical properties of the sample site. To tackle problems like these as a Machine Learning engineer you have to become an expert in the domain your models are applied. You really need to understand the data in every detail and know how it is generated by your customers and understand the context and consequences of all of your customers' processes. Part 3: Customers and Business ----------------------------------------------- Our customers are, of course, no Machine Learning experts. Why should they be? If they were, they wouldn't need us anyway. However, oftentimes we as Machine Learning engineers forget the ramifications of this. I will talk about customer relations and their interactions with our Machine Learning models. For example, we had to deal with a rather skeptical customer not believing our models' predictions. They pretty much went against all recommendations made by the model. Although it is nice if in the end the model predictions turn out to be right, your customer does not necessarily feel the same way. In contrast, the customer does not enjoy being wrong and may even feel mocked by a machine. Having a strong customer success team, who knows both how ML works and, of course, how the customer operates and thinks, is often more valuable than "rockstar" Machine Learning engineers. Lastly, a tough lesson to learn was that Machine Learning as a service should not be mistaken for a software as a service business model. Our marginal costs are not zero. Besides a great deal of consulting that is needed for every customer, on-boarding a new customer is time consuming and needs a lot of work. Integrating into existing infrastructure of cement plants (who are not top-notch IT companies) can be tough or plain-right frustrating at times. Therefore, scaling a Machine Learning startup can be hard, and we learned to better go hunting for elephants, i.e. few high paying customers, than for mice, many low paying ones. false https://pretalx.com/pyconde-pydata-2024/talk/UGJJMP/ https://pretalx.com/pyconde-pydata-2024/talk/UGJJMP/feedback/ B05-B06 How Python helped us uncover secrets of protein motion Talk 2024-04-24T13:10:00+02:00 13:10 00:30 This presentation will give an overview of the scientific project that focuses on understanding how proteins move and function. Along the way a very large collection of Python tools was used, and on top of them our own innovative approaches are based. To be able to understand everything about living beings, including our health and origin of deseases in humans, we have to know how proteins do what they do. Hence is of utmost importance to understand their structure and function. Thanks to extraordinary technique called X-ray crystallography we are able to see how the proteins look at atomic scale, but it is impossible to see how they move. Therefore the next best thing we can do is to simulate the motion of the protein by so-called molecular dynamics (MD) simulations. These simulations generate incredible amounts of data, generally hundreds of GB of data per 1 microsecond of protein movement! Extracting useful and meaningful information from it is a daunting task. We are going to show how we have used many Python tools to tackle this problem in the project. Using Django to place everything in an interactive web app (https://alokomp.irb.hr/), along with Pandas, Numpy, Scipy, Dask, Jupyther, NetworkX, Bokeh, Datashader and many more under the hood, we have created an innovative new way of seeing protein move and communicate. pyconde-pydata-2024-41753-how-python-helped-us-uncover-secrets-of-protein-motion General: Industry & Academia Use-Cases Zoran ŠtefanićBoris Gomaz en Proteins are one of the main building blocks of the living world. They are largely responsible for the amazing diversity that we witness in the nature around us. Although proteins are composed of sequences of just 20 amino acids, clever nature’s design has endowed them with an incredibly diverse set of functions. It is not an overstatement to say that this diversity and the myriad of ways proteins interact with each other is at the very heart of life. Therefore it is of utmost importance to understand their structure and function. Proteins are very large molecules, composed of thousands up to even millions of atoms connected in a giant hairball like structures. But still they are too tiny to be seen by any sort of microscope, even the most powerful ones. That is why in order to “see” how they look we use X-rays and shine them on crystals made entirely of single proteins species in the fascinating method of X-ray crystallography. It then gives us the picture of how the proteins look to unprecedented atomic detail. In order to do their function proteins also move their parts, but unfortunately this motion is too quick to be seen by any device. X-ray crystallography alone, although mighty in giving us the details, gives us only one static image. It is a bit like trying to tell a story of a movie just by seeing a movie poster. Therefore we have to simulate the motion of the protein by so-called molecular dynamics (MD) simulations. Basically we give the computer the initial positions of all the atoms that we know from X-ray crystallography and then kick them and see how the protein moves in time, in very tiny steps. This results in so-called MD trajectories which contain all atom positions in millions of steps. Needles to say that this results in super heavy data that usually contains hundreds of GB of data that needs to be processed somehow. In the project called “Allosteric communication pathways in oligomeric enzymes” (https://alokomp.irb.hr/) we have faced that very problem. How to extract information about protein movement from such enormous quantities of data? Of course the answer was using marvelous Python suite of tools available. Python has established itself as a de facto standard programming language in data science, and with already available plethora of options for X-ray crystallography and MD analysis it was a logical choice (not to mention its awesomeness and being our favourite anyway). The whole project really displays how mature and diverse Python is to be able to tackle every single aspect of such a specialized problem. To begin with, we have centered the entire project around a web page built using Django. It serves both as a front-end wih general information, but also as a web app for diving into the data. Behind it is a PostgreSQL relational database containing all the structural and derived data from a family of proteins, called PNPs, which serve as sort of proof of concept (https://alokomp.irb.hr/pdbase/structures/). It also contains data derived from MD simulations and analysed with MDanalysis tool (https://www.mdanalysis.org/). It is hard to mention all the Python tools we have used for analysis of the data in the database. Of course the backbone of it are indispensable Pandas, Numpy, Scipy, Dask, Jupyther, NetworkX, Bokeh, HoloViz to name but a few. More specifically we have developed a special approach (“avocado” plots, example https://alokomp.irb.hr/md/avocados/1458/A) to visualize the motion of protein as a whole in time, as a series of snapshots each containing plots of millions of points, using awesome Datashader library (https://datashader.org). We have also used Ruptures (https://github.com/deepcharles/ruptures) library to detect changes in the positions of protein and to detect correlations. Everything is wrapped up in a form of interactive web app which can be used to visually browse vast amounts of data, giving a whole new perspective on a highly complex multidimensional data. false https://pretalx.com/pyconde-pydata-2024/talk/TMF8V7/ https://pretalx.com/pyconde-pydata-2024/talk/TMF8V7/feedback/ B05-B06 525 days working full-time on FOSS: lessons learned Talk 2024-04-24T13:45:00+02:00 13:45 00:30 I've been working full-time on a Python FOSS project for 525 days, so what did I learn? Am I a better (Python) programmer? Am I a better teammate? Am I a better person? In this talk I will share some of the lessons I learned over the course of these 525 days: - how to get a tech job in this day & age - how to put your ego aside when working with others (who know more than you!) and how to deal with mistakes - how to interact with users & contributors online - how it feels to collaborate to a large codebase As for the first three reflective questions, you'll have to ask my colleagues! pyconde-pydata-2024-40942-525-days-working-full-time-on-foss-lessons-learned General: Community, Diversity, Career, Life and everything else Rodrigo Girão Serrão en ## Outline ### Introduction (~5min) Personal and professional context for the talk: - Who am I? - What FOSS project have I been working on for 525 days? - Who am I working with? ### Lesson learned 1 – how to get a tech job (~5min) In this segment of the talk I share the story of how I got this job. This will explain how my writing on my blog contributed to establish some reputation and how my (Python-focused) social media presence connected me with the person who would eventually become my employer. ### Lesson learned 2 – put your ego aside (~5min) In this segment of the talk I explain how I deal with PR reviews and how I've learned to embrace the criticism, taking into account that all of your work is scrutinised every time you make a PR. I'll also tell the story of how I made a couple of blunders in successive PRs, how my team dealt with those, and what I got away from those weeks when I underperformed. ### Lesson learned 3 – interacting with users & contributors (~5/7min) This segment of the talk covers the other end of the interactions on a FOSS project, answering questions like: - How should you behave when interacting with users making feature requests? - What about users that report “bugs” that would be “solved” if they read the documentation carefully? - How do you review external PRs, leave feedback, and request changes? Depending on how the audience reacts to this segment, I might also tell an anecdote about how bad I felt when rejecting an external PR and how that feeling was amplified tenfold when I found out that the external PR came from a “Python personality”, which also contains another lesson because the person whose PR was rejected handled it in the most graceful way possible. ### Lesson learned 4 – working on a large project (~5min) I will dedicate this segment of the presentation to talk about the strategies I use to deal with the fact that the project I work on is too big for me to keep all of it in my head. This includes my note-taking system and my PR checklist. ### Wrap-up (~2min) To wrap up the talk, I'll summarise my learnings and share a bullet-point list of the ones that are more likely to be helpful to others. false https://pretalx.com/pyconde-pydata-2024/talk/ZMC9FU/ https://pretalx.com/pyconde-pydata-2024/talk/ZMC9FU/feedback/ B05-B06 Python Monorepos: The Polylith Developer Experience Talk 2024-04-24T14:45:00+02:00 14:45 00:30 What if writing software could be more like building with LEGO bricks? A more playful and productive developer experience. For me, that is all about writing code without the hassle. A productive setup should also let let us make design decisions while learning what to actually build, and allow changes during the way. Polylith solves this in a nice and simple way. I am the developer of the Open Source Python-specific tooling for Polylith. I’ll walk through the simple Architecture & the Developer friendly tooling for a joyful Python Experience. pyconde-pydata-2024-40987-python-monorepos-the-polylith-developer-experience PyCon: Programming & Software Engineering David Vujic en If you haven’t heard about Polylith before: it has a really simple take on Software Architecture - with tooling support. Polylith is based on small building blocks, very much like LEGO bricks. In fact, the Polylith Architecture originates from the Clojure community and is well suited for functional programming. It is a fresh take on how to share & reuse code, by using monorepos in a very developer-friendly way. And we have that in Python! I am the developer of the Open Source Python-specific tooling for Polylith. I’ll walk through the simple architecture & developer-friendly tooling for a joyful Python Experience. false https://pretalx.com/pyconde-pydata-2024/talk/VEACZM/ https://pretalx.com/pyconde-pydata-2024/talk/VEACZM/feedback/ B05-B06 Marketing Media Mix Models with Python & PyMC: a Case Study Talk 2024-04-24T15:20:00+02:00 15:20 00:30 In today's digital landscape, traditional analytics struggle with understanding marketing ROI, especially with evolving privacy norms. But Python and its ecosystem come to the rescue. In this talk, we will discuss how we leveraged Python and PyMC to build a Bayesian Marketing Media Mix model for the fastest-growing Italian tour operator. We'll cover the challenges we faced, the valuable insights we gained, and the results achieved. This will offer you a clear and practical roadmap for developing a similar model for your business. pyconde-pydata-2024-41707-marketing-media-mix-models-with-python-pymc-a-case-study General: Industry & Academia Use-Cases Emanuele Fabbiani en Understanding the effectiveness of various marketing channels is crucial to maximise the return on investment (ROI). However, the limitation of third-party cookies and an ever-growing focus on privacy make it difficult to rely on basic analytics. This talk discusses a pioneering project where a Bayesian model was employed to assess the marketing media mix effectiveness of WeRoad, the fastest-growing Italian tour operator. The Bayesian approach allows for the incorporation of prior knowledge, seamlessly updating it with new data to provide robust, actionable insights. This project leveraged a Bayesian model to unravel the complex interactions between marketing channels such as online ads, social media, and promotions. We'll dive deep into how the Bayesian model was designed, discussing how we provided the AI system with expert knowledge, and presenting how delays and saturation were modelled. We will also tackle aspects of the technical implementation, discussing how Python, PyMC, and Streamlit provided us with the all the tools we needed to develop an effective, efficient, and user-friendly system. Attendees will walk away with: - A simple understanding of the Bayesian approach and why it matters. - Concrete examples of the transformative impact on WeRoad's marketing strategy. - A blueprint to harness predictive models in their business strategies. false https://pretalx.com/pyconde-pydata-2024/talk/D7AEQY/ https://pretalx.com/pyconde-pydata-2024/talk/D7AEQY/feedback/ A1 FlixBus CitySnap: How we use GenAI and not only to collect captivating images for cities and confirm their locations Talk 2024-04-24T10:30:00+02:00 10:30 00:30 Have you ever wondered how travel e-commerce companies gather photos of cities? While I can't speak for everyone, I will demonstrate the innovative approach we are using at Flix. In recent years, text-to-text models like ChatGPT and text-to-image models such as DALL-E 3 have become increasingly integrated into various industries. The main aim of these initiatives is typically to generate text or images. In our presentation, we propose a slightly different approach to leveraging these models commercially. Our objective is to gather images for thousands of cities that inspire travel. We utilize ChatGPT to tailor prompts for our business requirements, enabling efficient image retrieval through API queries from free stock image services. Then we apply image-to-text models to confirm the images' locations. Finally, we need to adjust the resolution of images for display across various platforms, such as social media campaigns on Instagram, email marketing, and on our website. To achieve this, we have used an automated cropping service to get images in the required aspect ratios, followed by Lanczos sampling for downscaling the images. This integration of cutting-edge models has resulted in an automated, highly flexible process that aligns with varied business needs. Our approach is cost-efficient; processing several hundred cities amounts to only a few euros, and we have utilized commonly available services, making replication easy for everyone. pyconde-pydata-2024-42881-flixbus-citysnap-how-we-use-genai-and-not-only-to-collect-captivating-images-for-cities-and-confirm-their-locations PyData: Natural Language Processing & Computer Vision Andrei Chernov en Flix's buses serve over 5,000 cities, and to elevate our customers' experience, we aim to collect captivating photos for each city. Photo city collection task is not new, but previously, it was predominantly addressed with human resources. However, due to the extensive number and the growing scale of our bus network, manually gathering photos for each city is unfeasible and non scalable. In this talk, we will demonstrate how we built a fully automated end-to-end pipeline to achieve this goal. Our pipeline comprises three main steps. The first step involves collecting city images from free image stock services like Pixabay and Pexels, via API. Simple queries by city names yielded poor results as not every image is enticing enough to inspire visits to the city. People often travel to see a city's landmarks, which is why we utilized ChatGPT to gather images of prominent landmarks for each city. The second and most complicated step is to verify that the images accurately represent the targeted cities. Initially, we relied on metadata from the image stock services, such as tags from photographers. However, this information is often not sufficient to validate an image's location. To improve accuracy, we investigated various services. Models like DALLE from OpenAI can predict image locations but currently lack an API for full automation. We found two services from the Google Cloud Platform with APIs suitable for location validation: the Gemini multimodal and the landmark detection service. The third and final step of our pipeline involves adjusting the images to various resolutions for display across different platforms, such as social media campaigns on Instagram, email marketing, and our website. This is achieved by cropping images to the desired aspect ratios using Google Cloud Vision API's smart cropping service, followed by Lanczos sampling for image downscaling, which is available in various open-source Python libraries. Our pipeline is a cost-efficient approach using widely available services, thereby facilitating easy replication. During this presentation, we will share our results across several countries, discuss the most challenging problems we encountered, and offer insights into how this pipeline could be improved with the release of upcoming cutting-edge models. We believe that our case shows how the industry can use Generative AI not only to create a new context, but also to find, analyze and filter publicly available information for different business needs. false https://pretalx.com/pyconde-pydata-2024/talk/ECCJAG/ https://pretalx.com/pyconde-pydata-2024/talk/ECCJAG/feedback/ A1 Public Money, Public Experiment - open source processes in the public administration Talk 2024-04-24T11:05:00+02:00 11:05 00:30 Imagine a data lab in a federal ministry wants to publish python applications - how long could it possibly take? While open code is widely acknowledged as beneficial, the lack of thriving open code platforms from public institutions gets you wondering: a day, a week, months, or even years? When publishing code, a private person, a company or a public institution all face unique circumstances and take different considerations into account. While individuals or companies frequently publish their code and share their experiences, less is known about these processes in public institutions. In our talk we will cover how a data lab, located in a federal ministry would go about this topic. We will share insights into the publishing process, touching upon existing pioneers and the alignment of open source with administrative principles, as well as the hurdles, surprises, and regulatory considerations of our journey. Since we are a newly established unit with the word lab in our name, our talk delves into a unique real-world experiment: How much progress can our data lab make in publishing code within the three months leading up to PyCon DE & PyData Berlin 2024? pyconde-pydata-2024-43017-public-money-public-experiment-open-source-processes-in-the-public-administration General: Others Lisa Reiber en As one of many data labs in the public administration, sharing code and software increases the speed with which technical problems can be solved and reduces overall costs. In the previous months, we started collaborating with other public units to share a python prototype between labs. Now it's time for the next step: as we approach PyCon DE & PyData Berlin 2024, we aim to make code publicly available. The presentation will address the following questions: 1. How can the process of publishing code look like in a public administration and where can you get access to code already published? (Spoiler: Check out OpenCoDE) 2. How does open source align with public administration principles? 3. What legal and political and security requirements shape the process and possibly the code base? Whether we succeed or encounter challenges, this talk serves as an attempt to transparently share our journey and contribute to the broader discourse on the intersection of public administration and open source initiatives. Join us at PyCon DE & PyData Berlin 2024 and stay tuned for a glimpse into the evolving landscape of our code publication. false https://pretalx.com/pyconde-pydata-2024/talk/DEKGYM/ https://pretalx.com/pyconde-pydata-2024/talk/DEKGYM/feedback/ A1 Improve LLM-based Applications with Fallback Mechanisms Talk 2024-04-24T11:40:00+02:00 11:40 00:30 While RAG addresses the common LLM pitfalls, challenges like handling out-of-domain queries still persist. Learn the significance of fallback mechanisms to tackle these issues gracefully, incorporating strategies like web searches and alternative data sources to improve the user experience of your system. In this session, we’ll discover various fallback techniques and practical implementation using Haystack, empowering you to develop resilient LLM-based systems for diverse scenarios without human intervention. pyconde-pydata-2024-41814-improve-llm-based-applications-with-fallback-mechanisms PyData: Generative AI Bilge Yücel en Large Language Model (LLM)-based systems have demonstrated remarkable advancements in various natural language processing (NLP) tasks, particularly through the Retrieval Augmented Generation (RAG) approach. This approach addresses some of the pitfalls associated with LLMs, such as hallucination or issues related to the recentness of its training data. However, RAG systems may encounter other challenges in real-world scenarios, including handling out-of-domain queries (e.g., requesting medical advice from a finance app), struggling to generate meaningful answers from retrieved data, or failing to provide any answer at all. To address these situations effectively, it is necessary to implement a fallback mechanism capable of gracefully handling such scenarios. 🧗 This fallback mechanism can incorporate alternative strategies, such as conducting a web search with the same query to retrieve more up-to-date information or utilizing alternative information sources (such as Slack, Notion, Google Drive, etc.) to gather more relevant data and generate a satisfactory or comprehensive response. However, the question arises: how can we determine if the response is inadequate? 🤔 During this session, we will explore various fallback mechanism techniques and ensure that our system can assess the adequacy of a response and improve it if necessary without human intervention. On the practical side, we will use the open source LLM framework Haystack to implement end-to-end RAG systems. By the end of this talk, you will have learned to select the appropriate fallback method for your use case, enabling you to develop more dependable and versatile LLM-based systems and implement them effectively using Haystack. 💪 false https://pretalx.com/pyconde-pydata-2024/talk/QCNXLW/ https://pretalx.com/pyconde-pydata-2024/talk/QCNXLW/feedback/ A1 Is GenAI All You Need to Classify Text? Some Learnings from the Trenches Talk 2024-04-24T13:10:00+02:00 13:10 00:30 In recent times, GenAI has sparked fervent excitement, sometimes touted as the panacea for all natural language processing (NLP) tasks. This presentation explores a practical text classification scenario at Malt, highlighting the practical hurdles encountered when employing GenAI (latency, environmental impact, and budgetary constraints). To overcome these obstacles, a smaller, dedicated model emerged as a viable solution. We'll delve into the construction and optimization (quantization, graph optimization) of this multilingual model. Finally we’ll see how GenAI's unparalleled zero-shot capabilities enables its continuous adaptation. pyconde-pydata-2024-41817-is-genai-all-you-need-to-classify-text-some-learnings-from-the-trenches PyData: Natural Language Processing & Computer Vision Marc PalyartKateryna Budzyak en In recent times, GenAI has sparked fervent excitement, sometimes touted as the panacea for all natural language processing (NLP) tasks. This presentation explores a practical text classification scenario at Malt, highlighting first the practical hurdles encountered when employing GenAI (latency, environmental impact, and budgetary constraints). In a second part, we’ll cover how we overcame these obstacles by building a small dedicated model built from a pre-trained SentenceBERT [1], a model trained on semantic similarity. We'll explain how training a classification network on top of it preserves the original language alignment [2], enabling multilingual generalization. Next, we'll unveil the secret to unlocking even more efficiency: quantization and graph optimization techniques thanks to the ONNX ecosystem [3]. These optimizations while reducing even more the latency and resource consumption of this dedicated model enable it to be deployed with just a CPU. Finally, we’ll see that GenAI still plays a relevant role in our text classification journey. Its unparalleled zero-shot capabilities allow us to continuously adapt our dedicated model, ensuring it remains relevant amidst an ever-changing product. [1] Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. [2] Reimers, N., & Gurevych, I. (2020). Making monolingual sentence embeddings multilingual using knowledge distillation. [3] https://onnx.ai/onnx/ false https://pretalx.com/pyconde-pydata-2024/talk/CWUQF3/ https://pretalx.com/pyconde-pydata-2024/talk/CWUQF3/feedback/ A1 Mostly Harmless Fixed Effects Regression in Python with PyFixest Talk 2024-04-24T13:45:00+02:00 13:45 00:30 This session introduces PyFixest, an open source Python library inspired by the "fixest" R package. PyFixest implements fast routines for the estimation of regression models with high-dimensional fixed effects, including OLS, IV, and Poisson regression. The library also provides tools for robust inference, including heteroscedasticity-robust and cluster robust standard errors, as well as the wild cluster bootstrap. Additionally, PyFixest implements several routines for difference-in-differences estimation with staggered treatment adoption. PyFixest aims to faithfully replicate the core design principles of "fixest", offering post-estimation inference adjustments, user-friendly syntax for multiple estimations, and efficient post-processing capabilities. By making efficient use of jit-compilation, it is also one of the fastest solutions for regressions with high-dimensional fixed effects. The presentation will cover PyFixest's functionality, design philosophy, and future development prospects. pyconde-pydata-2024-42752-mostly-harmless-fixed-effects-regression-in-python-with-pyfixest PyData: PyData & Scientific Libraries Stack Alexander Fischer en When regression models contain very high-dimensional categorical features, estimation can become cumbersome: inverting a matrix with more than a few hundred rows is no simple task! Fortunately, the problem of estimating models with high-dimensional fixed effects has been effectively solved since at least the 1930s. A range of software packages now implement what is known as the Frisch-Waugh-Lovell Theorem (FWL) for efficient estimation of regression models with high-dimensional fixed effects. These packages are available in various programming languages, including Stata, R, Julia, and Python. Among these, the R package fixest particularly stands out. It is not only blazing fast but also offers an innovative and user-friendly post-estimation functionality and syntax. When I started my journey with Python, fixest was the R package I missed the most. In fact, I missed it so much that I began working on PyFixest, a software package that aims to faithfully replicate all of fixest's innovations in Python. In this talk, I will introduce the audience to both fixest and PyFixest and the FWL theorem that underpins these packages. We will explore how PyFixest can be used for analyzing AB Tests and for conducting event studies with staggered rollouts. For more information: - PyFixest GitHub repository: https://github.com/s3alfisc/pyfixest - Introduction to PyFixest: https://aeturrell.github.io/coding-for-economists/econmt-regression.html#regression-basics - PyFixest Documentation: https://s3alfisc.github.io/pyfixest/ false https://pretalx.com/pyconde-pydata-2024/talk/UXQJTF/ https://pretalx.com/pyconde-pydata-2024/talk/UXQJTF/feedback/ A1 Can ChatGPT convince you to get a COVID19 vaccine? Comparing ChatGPT to an expert system - which one is more convincing? Talk 2024-04-24T14:45:00+02:00 14:45 00:30 This study explores the efficacy of chatbots as dialogical argumentation systems for behaviour change, focusing on vaccine hesitancy during the COVID-19 pandemic. A Python-based chatbot, developed in 2021, engaged in argumentative dialogues with users reluctant to get vaccinated, resulting in a 20% positive change in participants' stances. As natural language processing technologies, like ChatGPT, advance, it is crucial to compare them to traditional expert systems. Prior studies have shown ChatGPT's reliability in addressing vaccine hesitancy. This research compares our chatbot with ChatGPT, evaluating persuasiveness through crowdsourced participants. The findings inform resource allocation decisions, guiding the choice between domain-specific expert systems and enhancing versatile models like ChatGPT. Understanding comparative strengths aids in preventing the dissemination of misinformation in behaviour change contexts. pyconde-pydata-2024-42950-can-chatgpt-convince-you-to-get-a-covid19-vaccine-comparing-chatgpt-to-an-expert-system-which-one-is-more-convincing- PyData: Natural Language Processing & Computer Vision Dr. Lisa Andreevna Chalaguine en Chatbots have the potential of being used as dialogical argumentation systems for behaviour change applications. They thereby offer a cost-effective and scalable alternative to in-person consultations with health professionals that users could engage in from the comfort of their own home. During events like the global COVID-19 pandemic, it is even more important than usual that people are well informed and make conscious decisions that benefit themselves. Getting a COVID-19 vaccine is a prime example of a behaviour that benefits the individual, as well as society as a whole. In 2021, prior to the release of ChatGPT, we presented a chatbot (developed in Python using scikit learn and flask) that engaged in dialogues with users who did not want to get vaccinated, with the goal to persuade them to change their stance and get a vaccine. The chatbot was equipped with a small repository of arguments that it used to counter user arguments which were presented in free-text by the user on why they were reluctant to get a vaccine. We evaluated our chatbot in a study with participants and found that 20% of the participants had a positive change in stance (e.g. changing their stance from "unlikely to get a vaccine" to "neutral" or "likely to get a vaccine" after chatting with the chatbot). The rapid advancements in natural language processing and the release of technologies such as ChatGPT raises the need to compare them to traditional expert systems in order to (1) identify potential problems in the new technologies and (2) assess whether they can replace traditional expert systems. Several studies have already used ChatGPT to address vaccine hesitancy and to tackle vaccine myths and concluded that ChatGPT is indeed a reliable source of non-technical information to the public. We were, therefore, interested to compare our system to ChatGPT and simulate the conversations participants had with our chatbot using ChatGPT and evaluate which conversations were considered more convincing by crowdsourced participants who are not domain experts. Research like this helps us understand whether we need to continue investing resources into domain specific expert systems or rather invest them into improving ChatGPT and make it more reliable and credible to avoid spreading misinformation. false https://pretalx.com/pyconde-pydata-2024/talk/BJUQ9E/ https://pretalx.com/pyconde-pydata-2024/talk/BJUQ9E/feedback/ A1 The Struggles We Skipped: Data Engineering for the TikTok Generation Talk 2024-04-24T15:20:00+02:00 15:20 00:30 In a world increasingly embracing Python, plug-and-play solutions and AI-generated code, our generation growing up with these advancements may not fully grasp the challenges faced by our predecessors. Meanwhile, data engineering, traditionally known for its complexity, can now transition into the plug-and-play realm too, thanks to Python libraries such as dlt. Aimed to be both fun and insightful, this talk will educate the listener on the concepts of data engineering our generation finds most important and enable them to use high level abstractions to automate most of what used to be highly manual work. The juniors will gain an appreciation for the difficulties in data pipeline engineering, the seniors - a straightforward solution to expedite the creation of robust pipelines. From the perspective of junior data engineers such as us, the talk will walk through the challenges associated with constructing a data pipeline and demonstrate how these can be effectively addressed using Python libraries such as dlt that simplify the intricacies of data extraction, transformation, and loading. pyconde-pydata-2024-42851-the-struggles-we-skipped-data-engineering-for-the-tiktok-generation PyData: Data Handling & Engineering AnuunHiba Jamal en A tale of two junior data engineers. Our generation of developers might have it “easy” due to there being a plethora of tools available to automate and plug and play everything. However, this abundance poses challenges in breaking into a field. This talk explores the perspectives of two junior data engineers—one entirely new to data and the other with a data science background—both navigating the complexities of data engineering. The first one, a data scientist navigating her tasks without the luxury of well-formatted data. This journey inadvertently led to a gradual familiarity with complex tools like Spark, and the necessity of understanding various connectors and writing detailed code for data extraction and normalization. With the introduction of dlt, a significant shift occurred. This technology automated many of the tedious processes, allowing analysts to focus more on analytics, and less on tedious data handling. The second one, never having had to deal with the chaos of unstructured data, was directly introduced to dlt. Spared by the typical struggles faced by traditional data engineers, she's set to find out what happens behind dlt’s automation throughout the talk. After realizing that the two lines of Python code she wrote saved her from the manual tasks of data normalization, structuring, and loading, she will gain an appreciation for the tools at her disposal, especially dlt. dlt, or data load tool is an open-source python library for data teams of all sizes. It can extract a range of data formats from various sources, then normalizes that unstructured data into a relational structure and loads it into the destination of your choice. All of this is done within a few lines of Python code, as compared to the usage of different tools that were needed to get these tasks done. It is a valuable and cost effective addition to a company’s data stack. The talk will follow a step-by-step, linear narrative to outline the challenges of building a data pipeline and illustrate how dlt can resolve these issues, thereby automating the process. Beginning with schema inference and evolution, then progressing to dependency handling and data governance, each challenge will be portrayed as a quest on the journey to constructing a well-defined data pipeline. As junior data engineers, we would like to emphasize the paradigm shift in data engineering towards a greater level of abstraction. This shift, enabled by tools such as dlt's declarative incremental loading, empowers junior engineers to tackle tasks that traditionally would not be considered junior-level work. false https://pretalx.com/pyconde-pydata-2024/talk/DWGV7W/ https://pretalx.com/pyconde-pydata-2024/talk/DWGV7W/feedback/ A03-A04 Lose your fear of equations! Tutorial 2024-04-24T10:30:00+02:00 10:30 01:30 The skill of quickly judging what a formula does and how changing a parameter will affect the result is crucial when dealing with real-life data science - but it's a skill not easily acquired if you don't come from a STEM background. In this tutorial we'll work on guesstimating what complex mathematical expressions do so that you, too, can lose your fear of math! pyconde-pydata-2024-41746-lose-your-fear-of-equations- PyData: Data Handling & Engineering Darina Goldin en If you transitioned into data science from "soft" sciences, you've already had a steep learning curve. Coding, data engineering, statistics... There is a lot to catch up on. And while there are plenty of true black box models in machine learning, just as many can and should be described in mathematical terms. This tutorial is for everyone who is scared by formulae. We will learn how to quickly recognize which part of an equation matters and how changing individual parameters will affect it. We will make differential equations less scary and get a "feel" for the logistic function that goes beyond running Logreg in sklearn. false https://pretalx.com/pyconde-pydata-2024/talk/SYJE7B/ https://pretalx.com/pyconde-pydata-2024/talk/SYJE7B/feedback/ A03-A04 A deep dive into the Arrow Columnar format with pyarrow and nanoarrow Tutorial 2024-04-24T13:00:00+02:00 13:00 01:30 Apache Arrow has become a de-facto standard for efficient in-memory columnar data representation. You might have heard about Arrow or using Arrow, but do you understand the format and why it’s so useful? This tutorial will dive deep into the details of the Arrow columnar format, the different types and buffer layouts, and explore those details interactively using the pyarrow and nanoarrow libraries. pyconde-pydata-2024-41838-a-deep-dive-into-the-arrow-columnar-format-with-pyarrow-and-nanoarrow PyData: PyData & Scientific Libraries Stack Joris Van den BosscheAlenka FrimRaúl Cumplido en **You can find the material and setup instructions at https://github.com/voltrondata-labs/2024-arrow-format-tutorial/** According to the website, Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing. Nowadays, the Arrow project encompasses many things, including serialization, messaging and database specifications and a variety of language implementations. But at its core is the Columnar Format: a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. This format is being used (fully or partially) by many libraries that you might know, such as pandas, polars, datafusion, duckdb, cudf, influxdb, and many more. This tutorial will dive into the details of the Columnar format, explore the physical memory layout and the different data types. It will do so with interactive code examples using the pyarrow and nanoarrow libraries, learning how you can create and inspect Arrow data with those libraries. So at once you will also learn a bit about those two libraries, but the insights about the columnar format itself is general for any project using such data under the hood. false https://pretalx.com/pyconde-pydata-2024/talk/LERYUY/ https://pretalx.com/pyconde-pydata-2024/talk/LERYUY/feedback/ A05-A06 Securing Python: Race Condition Vulnerabilities Tutorial 2024-04-24T10:30:00+02:00 10:30 01:30 This workshop addresses the critical and often underestimated topic of race conditions in Python, with a focus on their security implications. We begin with an overview of race conditions, explaining their nature and the security risks they pose. Participants will engage with small Python applications designed to demonstrate these vulnerabilities. Through hands-on analysis, we identify where and why these race conditions occur. The session progresses to simulate attacks exploiting these weaknesses, highlighting their potential for exploitation. Finally, we explore effective mitigation strategies, emphasizing thread synchronization and safe programming practices. The workshop aims to equip attendees with a deep understanding of race conditions in Python and practical skills to enhance the security and robustness of their code. pyconde-pydata-2024-41462-securing-python-race-condition-vulnerabilities PyCon: Security Shahriyar Rzayev en We will begin by exploring the fundamentals of race conditions, and understanding how concurrent processes can lead to unpredictable and hazardous outcomes. This segment focuses on the theoretical underpinnings and real-world implications of these conditions in Python applications. Next, the workshop transitions into a more hands-on approach. Participants will be presented with small, intentionally vulnerable Python applications. These applications are designed to showcase various forms of race conditions, providing a practical context for understanding their impact. We will analyze the source code of these applications, identifying the critical sections where race conditions occur and discussing why these vulnerabilities are often overlooked during development. Following the analysis, the workshop shifts to the offensive aspect. We will simulate attacks exploiting these race conditions. This exercise aims to demonstrate the ease with which malicious entities can take advantage of these vulnerabilities, underscoring the importance of addressing them in the development phase. The final segment of the workshop is dedicated to resolution strategies. We will explore various techniques and best practices to mitigate race conditions in Python. This includes implementing thread synchronization mechanisms, such as locks, semaphores, and queues, and adopting safe programming practices that minimize the risk of concurrent execution issues. We'll also discuss how to incorporate these strategies into the software development lifecycle to enhance code quality and maintainability. Throughout the workshop, emphasis will be placed on clean, maintainable, and secure code architecture, aligning with contemporary best practices in Python development. By the end of the session, participants will not only have a thorough understanding of race conditions and their security implications but also possess the knowledge and tools to identify, exploit, and mitigate these vulnerabilities in their Python projects. false https://pretalx.com/pyconde-pydata-2024/talk/A8HJHV/ https://pretalx.com/pyconde-pydata-2024/talk/A8HJHV/feedback/ A05-A06 Django loves strawberries Tutorial 2024-04-24T13:00:00+02:00 13:00 01:30 Explore the dynamic duo of GraphQL Strawberry and Django in an immersive workshop! Discover the seamless integration of Strawberry with Django, mastering type definitions, queries and mutations. Harness the power of Starlette for efficient API development, empowering your projects with this potent blend of cutting-edge technologies. pyconde-pydata-2024-41719-django-loves-strawberries PyCon: Django & Web Arthur Bayr en <strong>Update<br /> Please prepare the Workshop as described [here](https://github.com/Speedy1991/strawberry-workshop)</strong><br /> --------------------------------------- Delve into the world of GraphQL Strawberry and Django in this comprehensive workshop designed to unravel the intricacies of these technologies. Throughout the sessions, participants will navigate the synergy between Strawberry, a GraphQL library for Python, and Django, a robust web framework. The workshop kicks off with an exploration of type definitions, offering insights into creating robust schemas and defining custom types to suit project requirements. Moving beyond the fundamentals, attendees dive into the realm of queries and mutations, mastering the art of fetching data and manipulating it through GraphQL. With Django's ORM seamlessly integrated into Strawberry, participants discover how to effortlessly execute complex queries and mutations. Furthermore, the workshop explores the integration of Starlette, a lightweight ASGI framework, into the mix. Uncover how Starlette complements Django and Strawberry, enhancing API development with its performance and flexibility. The hands-on approach of this workshop ensures participants grasp each concept thoroughly. Through guided exercises and practical examples, attendees gain confidence in implementing GraphQL APIs using Strawberry and Django, unlocking the potential to build robust and scalable applications. By the workshop's conclusion, participants will have a comprehensive understanding of: - Creating GraphQL schemas using Strawberry and Django - Executing queries and mutations seamlessly within Django applications - Leveraging Starlette for efficient API development alongside Django Whether you're a seasoned developer or new to these technologies, this workshop promises to equip you with the skills needed to harness the combined power of GraphQL Strawberry and Django for your projects' success. false https://pretalx.com/pyconde-pydata-2024/talk/AT9HCG/ https://pretalx.com/pyconde-pydata-2024/talk/AT9HCG/feedback/