2.0 -//Pentabarf//Schedule//EN

PUBLISH TNHMGN@@pretalx.com

-TNHMGN

Keynote - A View From My Window - An Outside Perspective of Open Source Scientific Computing From the Inside en

20240422T101500 20240422T110000 0.04500

Keynote - A View From My Window - An Outside Perspective of Open Source Scientific Computing From the Inside

Twelve years as the Executive Director of NumFOCUS has given me a unique perspective of the open source scientific ecosystem. Building an organization to support project communities has taken me down many roads. Navigating these paths has been rewarding and challenging. We will look at lessons learned as I share my experiences through observations and insights on projects, community leadership, education, and fundraising. NumFOCUS is a nonprofit organization that serves open source scientific computing projects and their communities. Our support programming includes fiscal sponsorship, affiliation services, development grants, educational and DEI initiatives, and collaborative opportunities in open source science. PyData is an educational program of NumFOCUS. PUBLIC CONFIRMED Keynote https://pretalx.com/pyconde-pydata-2024/talk/TNHMGN/ Kuppelsaal Leah Silen PUBLISH 7EC3UY@@pretalx.com

-7EC3UY

PyCon Community Backstage: A Decade of Camaraderie, Growth, and Lessons Learned en

20240422T112500 20240422T121000 0.04500

PyCon Community Backstage: A Decade of Camaraderie, Growth, and Lessons Learned

Through organizing numerous community conferences, both small and large, I've gained invaluable insights into what makes a team and a community function effectively, and equally important, what doesn't. Leadership has been a key learning area for me. Through understanding my strengths and weaknesses, I have grown not just as a community leader but also in my professional career, enhancing how I work and lead. In "PyCon Backstage All Access," I will cover: 1. Organizational Experiences: The nuances of organizing conferences of various scales. 2. Leadership Lessons: Insights into team dynamics - what works and what doesn't in building a great community team. 3. Balancing Ideas and Realities: The driving factors behind enjoyable community conferences. How to listen to others. When to embrace complexity, and when to say no. 4. Handling the Mundane: Strategies for dealing with administrative, tax, and legal aspects. How and where organisations can help. 5. Future Outlook: Strategies for sustaining the European Python Community amidst growing challenges. This includes my reasons for rejoining the EuroPython board to help shape its future beyond being just a conference organizer. My community service "CV": * 2013 local MongoDBB meetup * 2014 joined EuroPython * 2015-2020 core EuroPython organizer, 2 years board member * 2017-2018 PyCon DE organizer * 2018-today EuroSciPy organizer * 2018-today PyData Südwest meetup organizer * 2019-2022 PyCon DE & PyData Berlin chair * 2019-today PyData Frankfurt meetup organizer * 2019-today Python Software Verband chair (German Python association) * 2023 PyCon DE & PyData Berlin organizer * 2023 EuroPython board member PUBLIC CONFIRMED Talk (long) https://pretalx.com/pyconde-pydata-2024/talk/7EC3UY/ Kuppelsaal Alexander CS Hendorf PUBLISH CBVTEG@@pretalx.com

-CBVTEG

Streamlining Python Development: A Guide to a Modern Project Setup en

20240422T121500 20240422T124500 0.03000

Streamlining Python Development: A Guide to a Modern Project Setup

In the dynamic world of Python programming, an efficient project setup is key to success. 'Streamlining Python Development: A Guide to a Modern Project Setup' is a presentation tailored specifically for Python beginners, aiming to demystify the process of setting up a Python project with clarity and efficiency. In this session, we'll introduce Hatch, a cutting-edge tool that simplifies project management. We'll delve into the functionalities and benefits of using `pyproject.toml`, a cornerstone in modern Python development for its streamlined approach to project configuration. The talk will also cover effective strategies for organizing your project's directory structure, ensuring a clean and manageable workspace. Understanding the importance of testing, we'll discuss unit testing techniques for enhancing code reliability. Additionally, the presentation will feature mypy for type checking, an essential practice for catching errors early and improving code quality. Finally, we'll explore the use of ruff, a modern linter, to keep your code clean and in line with Python standards. By the end of this presentation, Python beginners will have gained a comprehensive understanding of the tools and methodologies necessary for a modern Python project setup, empowering them to create well-structured, high-quality Python applications. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/CBVTEG/ Kuppelsaal Florian Wilhelm PUBLISH 7LQEJ3@@pretalx.com

-7LQEJ3

You shall not pass! 🧙 Strengthen your python code against attacks. en

20240422T134500 20240422T143000 0.04500

You shall not pass! 🧙 Strengthen your python code against attacks.

This talk will highlight the theoretical concepts on security. We’ll start with a general overview and dive into specifics for Python applications. We will address five main questions: 1. How can we retrieve a password with a Python function? 2. What are the most essential IT Security practices? 3. Where can we find information on current security vulnerabilities? 4. What should we keep in mind to write secure Python code? 5. What are some historical attacks on Python code? What can we learn from them? Listeners will walk away with a general overview of how to approach security issues when building their Python application and make their future code more secure. PUBLIC CONFIRMED Talk (long) https://pretalx.com/pyconde-pydata-2024/talk/7LQEJ3/ Kuppelsaal Antonia Scherz Roman Krafft PUBLISH PRH3QU@@pretalx.com

-PRH3QU

Better safe than sorry: Threat Modeling for Python Developers en

20240422T143500 20240422T150500 0.03000

Better safe than sorry: Threat Modeling for Python Developers

In the ever-evolving landscape of cybersecurity, Python applications play a pivotal role in handling critical data and supporting essential business functions, making them prime targets for malicious actors. As the stakes continue to rise, developers want to prioritize the implementation of security measures to safeguard against potential threats. However, the definition of "secure" remains elusive and often subjective. This does not only cause insecurity of the application, but especially among the people that develop it. This talk explains how to move from "best effort security" to a comprehensive and systematic approach to application security. It introduces the tried and tested method “Threat Modeling” and explains its value in a Python development project. Python developers will gain practical insights to identify, assess, and prioritize security risks systematically. Real-world examples illustrate the impact of effective threat modeling, empowering developers to proactively secure their applications against the threats that are really relevant for them. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/PRH3QU/ Kuppelsaal Clemens Hübner PUBLISH TU9EUQ@@pretalx.com

-TU9EUQ

How to embrace your Leadership role as a Data Nerd (or other creative types) en

20240422T153500 20240422T160500 0.03000

How to embrace your Leadership role as a Data Nerd (or other creative types)

You've been working as a Data person/coder/designer/coach for a while and enjoy the creative task at hand. Investing your time in something meaningful that you're very good at brings you a deep sense of satisfaction, making your job truly enjoyable. As your career advances, you climb the ranks to become a senior professional and at some point, you find yourself taking on a management role. Suddenly, creative time is scarce, pressure is high, your schedule is full of meetings, and you are responsible for projects and a team. A great team, that too often you envy for getting to do the actual hands-on job. Sounds familiar? Or is this step something to better avoid? In this talk, I'll discuss my not-so-smooth transition from a senior position to a leadership role. I'll share lessons learned in my last years as a Head and ultimately, I’ll share my tips on how to not only survive but actually like and thrive in a management role. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/TU9EUQ/ Kuppelsaal Paula Gonzalez Avalos PUBLISH UBNVYW@@pretalx.com

-UBNVYW

When and how to start coding with kids en

20240422T161000 20240422T165500 0.04500

When and how to start coding with kids

Being able to code is becoming a more valuable skill every day. Besides the obvious advantages of being able to code (e.g. better career opportunities), coding teaches important skills like logical reasoning, attention to detail and creativity. But what is the best time to start coding? Are kids even able to learn how to code? And at what age? In this talk I would like to approach these questions from a scientific perspective, discussing the biological backgrounds and giving concrete advice on when and how to start coding with kids. PUBLIC CONFIRMED Talk (long) https://pretalx.com/pyconde-pydata-2024/talk/UBNVYW/ Kuppelsaal Anna-Lena Popkes PUBLISH LMMM7D@@pretalx.com

-LMMM7D

Better search relevance using Learning to Rank at mobile.de en

20240422T112500 20240422T121000 0.04500

Better search relevance using Learning to Rank at mobile.de

At mobile.de, we continuously strive to provide our users with a better, faster and a unique search experience. Machine learning and Python plays a key role in providing this experience. Every day, millions of people visit mobile.de to find their dream car. The user journey typically starts by entering a search query and later refining it based on their requirements. If the user finds a relevant listing, they contact the seller to purchase the vehicle. Our search engine is responsible for matching users with the right sellers. In this talk, I will talk about: - Introduction - Why search is important - How learning to rank helps ? - Current challenges with our ranking models - Proposed solution - How we deploy our ranking models ? (Under strict latency SLA <30ms) - AB Test results - Key Learnings - How can we improve further PUBLIC CONFIRMED Sponsored Talk (long) https://pretalx.com/pyconde-pydata-2024/talk/LMMM7D/ B09 Manish Saraswat PUBLISH GLXJPC@@pretalx.com

-GLXJPC

Haystack 2.0: the story of a rewrite en

20240422T121500 20240422T124500 0.03000

Haystack 2.0: the story of a rewrite

To rewrite or not to rewrite: it's a major question. Releasing new software versions with breaking changes can be disruptive to a community, but sometimes they are necessary in the long run to move forward. Haystack is a free open source Python LLM framework. It was launched in 2020, before LLMs were cool. In 2023 we decided to undergo a major re-architecture, culminating in the GA release of Haystack 2.0. It wasn't an easy decision. By involving the open source community and some big companies in our design process early on, we are confident we built a more usable, flexible foundation for years to come. In this talk I'll tell you the story of this rewrite. The decisions we made to bring the project forward with the right level of flexibility / composability in the rapidly changing LLM landscape. I won't only show you the new features 2.0 provides, but give you a peek into our future roadmap. You'll walk away with a better understanding of how modern LLM frameworks can help you solve problems for yourself and your users, as well as an enriched understanding of how to think for the long-term when building for an open source community. You’ll see how the strength of Haystack modularity and ease of use makes it stand out from other libraries. Demos will make it much clear and give you some great ideas on how to integrate Haystack in your projects. PUBLIC CONFIRMED Sponsored Talk https://pretalx.com/pyconde-pydata-2024/talk/GLXJPC/ B09 Silvano Cerza PUBLISH GVTJW8@@pretalx.com

-GVTJW8

From idea to production in a day: Leveraging Azure ML and Streamlit to build and user test machine learning ideas quickly en

20240422T134500 20240422T141500 0.03000

From idea to production in a day: Leveraging Azure ML and Streamlit to build and user test machine learning ideas quickly

Experimentation, bringing machine learning ideas in front of users, is essential to innovation. Yet, in our corporate hackathons, our data science team has struggled many times with how to build and deploy user-facing machine learning ideas in just a single day. Over the past 2+ years, we have developed a routine around using Azure Machine Learning, automated machine learning, and Streamlit to build and user test machine learning ideas quickly. The aim of this talk is to pass on practical, technical knowledge to fellow data scientists about how to leverage this stack to achieve high build and user test speeds. During the talk, we will walk through the process of building a computer vision system for identifying trash in images via an app using the open-source TACO dataset (http://tacodataset.org/). Working through a Jupyter notebook, we will load the data into Azure Machine Learning and trigger an automated machine learning run on the data. In this context, we will quickly get to know the training and testing metrics available in Azure ML to evaluate the model. We will then download the machine learning model as a file packaged in the open-source ONNX format (https://onnx.ai/). Using the open-source Python web application framework Streamlit (https://github.com/streamlit/streamlit), we will program an application in which users can upload images and embed the machine learning model in it to identify trash in these images. Using a to-be-published infrastructure-as-code pipeline on Azure DevOps, we will deploy the application to the public internet on the Azure platform. From here, users can test it. The stack and code presented in this talk will enable fellow data scientists to accelerate their data science development, leading to quicker experimentation and, therefore, to faster innovation of products with machine learning at their core. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/GVTJW8/ B09 Florian Roscheck PUBLISH SQUNWS@@pretalx.com

-SQUNWS

Going beyond Parquet's default settings – be surprised what you can get en

20240422T143500 20240422T150500 0.03000

Going beyond Parquet's default settings – be surprised what you can get

In the last decade, Apache Parquet has become the standard format to store tabular data on disk regardless of the technology stack used. This is due to its read/write performance, efficient compression technology, interoperability and especially outstanding performance with the default settings. While these default settings and access patterns already provide decent performance, by understanding the format in more detail and using recent developments, one can get much better performance, smaller files, and utilise Parquet's newer partial reading features to read even smaller subsets of a file for a given query. This talk aims to provide insight into the Parquet format and its recent development that are useful for end users' daily workflows. One only needs prior knowledge to know what a DataFrame/tabular data is. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/SQUNWS/ B09 Uwe L. Korn PUBLISH XNY3HX@@pretalx.com

-XNY3HX

Bridging the Gap: From Analytical Models to Operational Success en

20240422T153500 20240422T160500 0.03000

Bridging the Gap: From Analytical Models to Operational Success

Deploying machine learning models in production carries its own unique set of challenges. Some challenges stem from different, and sometimes conflicting, objectives between analytics and production. Others arise from technological limitations, business requirements, and even regulatory needs. In this talk, we will focus on the part of the problem surrounding the handover of models from analytics to production. This process has multiple facets, with tasks executed at different points in time and with different degrees of automation possible. To name a few: model packaging, inference reproducibility, establishing what needs to be deployed, and deployment-related actions. We'll share some of our experiences and strategies to tackle these challenges. For example, how we tackle the topic of contracts, interfaces, and responsibilities between modeling and production. Or how the role of automation in the pre-deployment process ensures a smooth and efficient model transition from an analytics model store to something ready for production once a model is approved. Whether you are a data scientist developing models, an operations specialist tasked with deploying them, or a product/project owner supervising the process, we aim to ignite engaging and fruitful discussions. For data scientists, to have a window into what happens after they are done with training a model. For operations specialists, to gain some strategies to improve their experience and success rate. And for a product owner, to get a framework on how to drive alignment. PUBLIC CONFIRMED Sponsored Talk https://pretalx.com/pyconde-pydata-2024/talk/XNY3HX/ B09 Ignacio Vergara Nick Harmening PUBLISH YYKJMP@@pretalx.com

-YYKJMP

Documenting R&D Progress using jupyter-book - and feel safe for the next performance audit en

20240422T161000 20240422T164000 0.03000

Documenting R&D Progress using jupyter-book - and feel safe for the next performance audit

Rosenxt has been founded to offer experience and excellence gathered in the last decades for the most challenging environments in the future, such as subsea, industrial, renewables, or the integrity of water and energy supply. Highly motivated, we can hardly wait to try out the next idea to make rapid progress. But we are also aware of the rules of business. At the end there is always the performance audit. This is where you have to prove that you can really deliver what you have promised. And to do this, you better have everything well documented. At our venture we have chosen a jupyter-book based workflow. Here come the Jupyter Notebook based steps for data analysis we're using anyways along with some simple markdown based documents embracing everything. Using a clever file system structure and a few tools, we create appealing documents that document the development progress very well. In this talk, I would like to present this workflow in more detail using the tests with a specific water pressure sensor that we are currently evaluating. PUBLIC CONFIRMED Sponsored Talk https://pretalx.com/pyconde-pydata-2024/talk/YYKJMP/ B09 Jens Nie PUBLISH RBNJRK@@pretalx.com

-RBNJRK

Select ML from Databases en

20240422T112500 20240422T121000 0.04500

Select ML from Databases

Developing machine learning models involves the use of data to identify patterns that would help solve business problems. Over the years as the scale of data increased, data started to get stored in databases. The model-building workflows would typically fetch the data from the databases, perform some transformations to create features, and use them to train the models. In some cases, these features would get stored in databases known as feature stores for reuse. To infer the model output in real-time, typically, there would be a small service or an API endpoint that would be deployed to get the results to the consumers. As these use cases became more common, modern databases started incorporating features that aid in building machine learning models. This talk covers some of the features provided by some of the databases like including common models like linear regression, image classification, text processing, support for functions with custom models, etc. Apart from these features, many of them also make it easy to deploy the model without needing an external service for the inference. Instead, they provide native interfaces for inference like querying in SQL like languages. This talk includes an example of how to build your custom model in Python and then include it inside your Couchbase database making inference a matter of using database queries. The example would help to understand some of the capabilities of modern databases in building machine learning model PUBLIC CONFIRMED Talk (long) https://pretalx.com/pyconde-pydata-2024/talk/RBNJRK/ B07-B08 Gregor Bauer PUBLISH WNHAG8@@pretalx.com

-WNHAG8

Data valuation for machine learning en

20240422T121500 20240422T124500 0.03000

Data valuation for machine learning

The core idea of so-called data-centric machine learning is that any effort spent on improving the quality of the data used to train a model is probably better spent than on improving the model itself. This tested rule of thumb is particularly relevant for applications where data is scarce, expensive to acquire or difficult to annotate. Concepts of the usefulness of a datum or its influence on the outcome of a prediction have a long history in statistics and ML, in particular through the notion of the influence function. However, it has only been recently that rigorous and practical notions of value for data, and in particular data-sets, have appeared in the ML literature. The core idea is to look at data points known to be “useful” in some sense — for instance in that they substantially contribute to the final performance of a model — and focus acquisition or labelling efforts around similar ones, while eliminating or “cleaning” the less useful ones. In a nutshell, data valuation for machine learning is the task of assigning a scalar to each element of a training set which reflects its contribution to the final performance of some model trained on it. This can be used to repair or prune corrupt or superfluous data, or for data collection, like active learning strategies when labelling is expensive. While many exact methods have exponential time complexity in the size of the training set, recent advances provide either good approximation strategies or introduce alternative approaches which are starting to make this field relevant in practice. In this context, [pyDVL](https://pydvl.org) is an LGPL library aiming to provide robust, parallel implementations of every relevant method for simple usage in applications and research. In this talk we showcase how it can be used to detect issues in data pipelines and to improve final performance. pyDVL is still in early stages of development but already provides over a dozen algorithms, runs in parallel using ray and supports sklearn-compatible interfaces and large pytorch models with out-of-core computation thanks to dask. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/WNHAG8/ B07-B08 Miguel de Benito Delgado Kristof Schröder PUBLISH YWUZW9@@pretalx.com

-YWUZW9

A conceptual and practical introduction to Hilbert Space Gaussian Process (HSGP) approximation methods en

20240422T134500 20240422T143000 0.04500

A conceptual and practical introduction to Hilbert Space Gaussian Process (HSGP) approximation methods

In this talk, we explore a new method to approximate Gaussian processes using spectral analysis methods, known as the Hilbert Space Gaussian process (HSGP) approximation. This technique allows us to use and fit Gaussian processes at scale for concrete applications. We provide a basic introduction to the ideas behind the method and make them tangible by implementing them ourselves using Numpyro. We then present two concrete examples in practice using both Numpyro and PyMC. Namely time-varying coefficient regression and time series forecasting. **Idea about the approximation idea:** The core of this method relies on the Laplacian's spectral decomposition to approximate kernels' spectral measures as a function of basis functions. The key observation is that the basis functions in the reduced-rank approximation do not depend on the hyperparameters of the covariance function for the Gaussian process. This allows us to speed up the computations tremendously. **References** - Hilbert space methods for reduced-rank Gaussian process regression (https://link.springer.com/article/10.1007/s11222-019-09886-w) - Practical Hilbert space approximate Bayesian Gaussian processes for probabilistic programming (https://link.springer.com/article/10.1007/s11222-022-10167-2 ) - Example: Hilbert space approximation for Gaussian processes (https://num.pyro.ai/en/stable/examples/hsgp.html) - PyMCon Web Series - Introduction to Hilbert Space GPs in PyMC - Bill Engels (https://www.youtube.com/watch?v=ri5sJAdcYHk ) PUBLIC CONFIRMED Talk (long) https://pretalx.com/pyconde-pydata-2024/talk/YWUZW9/ B07-B08 Dr. Juan Orduz PUBLISH 83ZGV3@@pretalx.com

-83ZGV3

Next Stop: Insights! How Streamlit and Snowflake Power Up Data Stories en

20240422T143500 20240422T150500 0.03000

Next Stop: Insights! How Streamlit and Snowflake Power Up Data Stories

Streamlit is an open-source Python package designed to simplify the creation of data applications featuring interactive data dashboards. Since September 2023, Streamlit has been integrated into Snowflake offering several benefits, including the ability for developers to securely build, deploy, and share Streamlit apps within Snowflake's data cloud making use of the scale, performance and security of the Snowflake platform. This talk provides an introduction to Streamlit and showcases its integration into Snowflake. After this talk you will gain: - an introduction of how Streamlit can be used within Snowflake - practical insights into the creation of a data story based on a Deutsche Bahn open-source dataset on Wi-Fi connectivity in trains - comprehensive understanding of implementing a Streamlit app in Snowflake, illustrated through the developed data story - main takeaways and key insights working with Streamlit in Snowflake This talk is addressed to data enthusiasts who are - often faced with the challenge of presenting profound data insights to diverse audiences - interested in a tool that effortlessly constructs appealing data applications - curious about a a direct link between Streamlit and Snowflake PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/83ZGV3/ B07-B08 Marie-Kristin Wirsching PUBLISH NYHFSB@@pretalx.com

-NYHFSB

Machine Learning on microcontrollers using MicroPython and emlearn en

20240422T153500 20240422T160500 0.03000

Machine Learning on microcontrollers using MicroPython and emlearn

Modern Machine Learning makes it possible to automatically extract valuable information from sensor data. While Machine Learning is often associated with costly, compute-intensive systems, it is becoming feasible to deploy ML systems to very small embedded devices and sensors. These devices typically use low-power, microcontrollers that cost as little as 1 USD. This niche is often referred to as "TinyML", and is enabling a range of new applications in scientific applications, industry and consumer electronics. While microcontrollers are getting more powerful year by year, it is still important to fit within the limited RAM, program size and CPU time available. emlearn is an open-source Python library that allows converting scikit-learn and Keras models to efficient C code. This makes it easy to deploy models to any microcontroller with a C99 compiler, while keeping Python-based workflow that is familiar to Machine Learning Engineers. Via emlearn-micropython it also supports MicroPython, a Python implementation designed for microcontrollers. MicroPython runs on practically all microcontrollers with 16kB+ RAM, and this makes it possible to write an entire application for microcontrollers using Python. The emlearn-micropython packages provided as a set of MicroPython modules that can be installed onto a device, without having to recompile any C code. This preserves the ease-of-use that Python developers are used to on a desktop system. Compared to pure-Python approaches, the emlearn-micropython models are typically 10-100x faster and smaller. The models in emlearn support the core Machine Learning tasks types: classification, regression and anomaly detection. Additionally there are also tools for data preprocessing, feature engineering and estimation of compute requirements. Since the start in 2019, emlearn has been used in a wide range of applications, from detection of vechicles in acoustic sensor nodes, to hand gesture recognition based on sEMG data, to real-time malware detection in Android devices. While emlearn and MicroPython can target a very wide range of hardware, we will focus on the Espressif ESP32 family of devices. These are very powerful and affordable, with good WiFi+BLE connectivity support, gpod open-source toolchains, very popular both among hobbyist and companies, and have many good ready-to-use hardware development kits. The audience is expected to have a basic literacy in Python and proficiency in programming, and familiarity with core Machine Learning concepts such as supervised/unsupervised learning, classification/regression, et.c. Familiarity with microcontrollers and embedded systems is of course an advantage, but the talk should be approachable to those who are new to this area. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/NYHFSB/ B07-B08 Jon Nordby PUBLISH BFF9VA@@pretalx.com

-BFF9VA

Your Model _Probably_ Memorized the Training Data en

20240422T161000 20240422T165500 0.04500

Your Model _Probably_ Memorized the Training Data

In this talk, I will cover: - Proven mathematical research as to why deep learning models memorize information - A series of successful attacks against deep learning models and GPT-models to extract memorized information - The legal and social impact of memorization and using memorized data - Differential privacy as one potential solution (but also its pitfalls when used to train large models) - Federated and/or local- or community-trained models as an alternative - The need for distillation that also attempts to reduce memorization PUBLIC CONFIRMED Talk (long) https://pretalx.com/pyconde-pydata-2024/talk/BFF9VA/ B07-B08 Katharine Jarmul PUBLISH XMKREA@@pretalx.com

-XMKREA

RAG for a medical company: the technical and product challenges en

20240422T112500 20240422T121000 0.04500

RAG for a medical company: the technical and product challenges

RAG works as follows: - An **embedding model** is used to create representations of all documents. These representations are then stored in a **vector database**. - A user poses a question. The same **embedding model** is used to create a representation of this question, enabling the **retrieval** of the most similar documents through a **similarity search**. - These documents are incorporated into a **prompt** along with the question to **generate an answer based on the documents' content**. Many open-source tools, such as Langchain, enable the creation of such pipelines in just [few lines of code](https://python.langchain.com/docs/expression_language/cookbook/retrieval). However, without specific adjustments, such systems often do **not** perform well enough to gain **user adoption**. In this talk, we will cover the challenges and learnings encountered while building a **RAG for the drug documentation of a medical company**. More specifically, we will: - Cover the **basics** of RAGs. - Present the use case we faced and showcase the **resulting product**. - Show how we significantly improved our **retrieval and generation metrics** with techniques such as leveraging **LLMs** to add extra context to the user's question to enhance retrieval accuracy. - Discuss how we designed the product to effectively utilize LLMs while ensuring that doctors are not **misled** by potentially erroneous information, such as **hallucinations**. We achieved this mostly by displaying the sources: while many RAG pipelines cite their sources, we went a step further by **inserting HTMLs** of the sources directly **within** the generated answers, along with **highlighted citations**. - Highlight the tooling aspect of the project, e.g. **[Langsmith](https://www.langchain.com/langsmith) (a logging tool for LLMs)**, allowed us to easily augment our initial dataset and ensure that users were interacting correctly with the product. Furthermore, the ability to replay/alter a prompt on the interface allowed the **product owner** to iterate on prompt engineering and assist with technical iterations using their **field knowledge**. PUBLIC CONFIRMED Talk (long) https://pretalx.com/pyconde-pydata-2024/talk/XMKREA/ B05-B06 Noé Achache PUBLISH BYH8Y8@@pretalx.com

-BYH8Y8

Acknowledging Women’s Contributions in the Python Community Through Podcast en

20240422T121500 20240422T124500 0.03000

Acknowledging Women’s Contributions in the Python Community Through Podcast

The Python community has been making efforts in improving the diversity and representation among its members. There are examples of success stories such as PyCon US Charlas, PyLadies, Djangonaut, and Django Girls. Yet in the Python podcast community, women are still underrepresented, making up only 17% of invited guests among the popular podcast series. Being a guest in a podcast is a privilege, and an opportunity to influence the Python community. There are many women and underrepresented group members who have made impactful contributions to the Python community globally, and they deserve the recognition and to be heard by the rest of us. Disheartened by the lack of representation by women on Python podcasts, and inspired by others who have shown us how diversity in the community can be improved through intentionality, we decided to start a podcast with a goal to highlight their voices so that they could receive the recognition they deserve. In this talk,earn about them, and about our podcast series. We’ll also share how you can further help out cause in improving representation and diversity in the Python community. ## Goal To raise awareness of the underrepresentation of certain groups, especially women. To acknowledge the progress made by the Python community and what can be done further to continue the improvement. ## Target Audience Anyone who cares about the diversity and inclusion progression in the Python community. Community leaders who want to be allies. ## Outline ### Diversity in Python community, examples (5 minutes) - PyCon US speakers: from 1% in 2011 to 40% in 2016 -Efforts in improving diversity in the Python community: Charlas, PyLadies, DjangoGirls, Djangonaut ### How are those efforts successful? (5 minutes) - Intentionality: starts with recognizing the issue and clear intention and goal in improving the situation - Outreach: targeted and direct outreach to underrepresented, explicit invitation asking underrepresented group members to participate in - Opportunity: providing opportunities and tools for women to succeed ### In Podcast (3 minutes) - Since there were no stats, we collected our own data by scraping three most popular Python Podcasts Collected using Python, beautiful soup, and Datasette - Our result shows that among the three podcasts that have been running for years, women made up only 17% of invited guests, whereas there were the same men who appeared more frequently on the same shows ### Why is ithis important (5 minutes) - Podcast guest is influential - Women and underrepresented group members deserve to be seen and heard - Representation creates inspirations. Lack of representation = lost opportunity to inspire women to further participate in the community ### 6 months of our podcasts (4 minutes) - Share public reactions and support from our launch - Karolina Ladino: in Colombia, women has to be accompanied by husband, brothers to come to meetups, otherwise it's not safe for them to come alone. - Joanna Jablonski: making impact in Python community through documentation and developer education ### How you can help(3 minutes) - Listen to their stories - Actively promote and boost voices from women and underrepresented group members - Suggest people to interview PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/BYH8Y8/ B05-B06 Cheuk Ting Ho Tereza Iofciu PUBLISH NYFVLM@@pretalx.com

-NYFVLM

The pragmatic Pythonic data engineer en

20240422T134500 20240422T143000 0.04500

The pragmatic Pythonic data engineer

Often, we tend to look at the success of others and try to repeat their **decisions**, expecting the same result. We must deal with things sensibly and realistically based on practical rather than just theoretical considerations. **Python** offers a vast **ecosystem** to handle all phases of data engineering. Implementing a **data architecture** can be complex, and many adopt the strategy of using market **guidelines** without **pragmatism** of understanding your **reality**; in most cases, this strategy is a big problem of **architecture** and **performance**. As a part of this talk, we will walk through the process of identifying **Pythonic** components of **data analysis**, **data cleaning**, **data ingestion**, **databases**, **file systems**, **serialization formats**, **workflows**, and **pipelines**. As we move through those steps, my main focus is teaching the audience **pragmatic thinking** on incorporating best practices into the **data architecture** process. I will also walk through **strategies** and explain high-level data engineering concepts we can use. PUBLIC CONFIRMED Talk (long) https://pretalx.com/pyconde-pydata-2024/talk/NYFVLM/ B05-B06 Robson Junior PUBLISH 7898PU@@pretalx.com

-7898PU

Whispered Secrets: Building An Open-Source Tool To Live Transcribe & Summarize Conversations en

20240422T143500 20240422T150500 0.03000

Whispered Secrets: Building An Open-Source Tool To Live Transcribe & Summarize Conversations

This light-hearted talk will aim to introduce the audience to the latest trends and possibilities for building GenAI applications using open-source components. Here's why this matters: * Cloud-hosted SaaS tools cannot store highly **sensitive information**. * **Good open-source alternatives exist** for most GenAI tasks; the more people who use them, the more they will thrive. * Commercial tools will solve for common use cases, but developers can build personalized tools that are **highly specialized for their own bespoke needs**. During the course of this talk, we will build a real-time conversation pipeline including transcription, summarization and topic analysis layers. We will use open-source Python libraries, including a Streamlit frontend and a Django API backend. The primary focus is to demonstrate the simplicity of building complex LLM-based applications, specifically tailored for attendees with a basic understanding of Python but who may not have prior experience using LLMs. We'll explore a variety of tools*, the use of Whisper for accurate live transcription, delving into its capabilities and integration with Streamlit. Additionally, we'll discuss LangChain + llama.cpp + Llama-2 for efficient summarization and topic analysis, highlighting their performance on standard hardware like a MacBook Pro. For the web API, Django will be our framework of choice, providing a robust and scalable solution for storing and displaying our conversation transcripts and summaries. We will also demonstrate how additional tools can be easily integrated into our workflow, for example using the Chroma vector database to build a simple semantic search function. Expect plenty of Python code and some fun live demos, with GitHub code provided for attendees to try it at home. This demo only covers a small fraction of the immensely versatile capabilities available from the modern open-source AI landscape, but will leave attendees with a sense that building complex LLM-powered applications that solve real-world problems has never been this easy. _* The exact tools presented may be different from those mentioned here, due to the rapidly evolving nature of this landscape. The goal is to ensure that attendees are provided with state-of-the-art content that is fully up-to-date come April 2024._ PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/7898PU/ B05-B06 John Sandall PUBLISH ZKYA9W@@pretalx.com

-ZKYA9W

Everything you need to know about change-point detection en

20240422T153500 20240422T160500 0.03000

Everything you need to know about change-point detection

How do you detect an activity change (e.g. walking to running to biking) from smartwatch data? Or abrupt transitions in paleoclimate records? Or when a server failure occurs, using hardware telemetry sensor data (fan speed, acoustic noise, etc.) and software metrics (CPU, memory, I/O, etc.)? If you work with long time series, you will inevitably have to detect changes in the data-generating model. Change-point detection is a crucial task for such signals. It consists in estimating the timestamps when the underlying signal model changes. First introduced in the 50s to monitor quality changes in industrial processes, this subject has since been extended to numerous contexts, such as sound/speech processing, human activity recognition, DNA analysis, analysis of COVID-19 policies' effects, software and hardware monitoring, etc. Over several decades, this subject has generated an important but heterogeneous body of work. This talk will help data scientists, engineers and researchers navigate this vast literature. We will start by describing the mathematical and algorithmic background behind change-point detection in a high-level and easy-to-understand fashion. Then, we will introduce [ruptures](https://github.com/deepcharles/ruptures), a Python package containing many change-point detection methods, as well as calibration and visualisation routines. Algorithms will be illustrated in a real-world biomedical application. At the end of the talk, the audience will be able to understand when to use change-point detection algorithms and how to calibrate and integrate them in a complex data pipeline. **Time breakdown:** - Introduction and motivations: 5 min - Background on change-point detection: 10 min - Python framework: 5 min - Illustration on a real-world biomedical data pipeline: 10 min - Q&A: 5 min PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/ZKYA9W/ B05-B06 Charles Truong PUBLISH LWWQ9U@@pretalx.com

-LWWQ9U

Using LLMs to Create Knowledge Graphs From a Large Corpus of Parliamentary Debates en

20240422T161000 20240422T165500 0.04500

Using LLMs to Create Knowledge Graphs From a Large Corpus of Parliamentary Debates

In this talk, I will demonstrate the process through which I implemented a solution to create knowledge graphs using LLMs and why this can be powerful. Agenda: - Limitations of LLMs and RAG for specific tasks - Knowledge graph (KG) bascis - Creating KGs using LLMs - Dataset and use-case: official parliamentary debates - Practical experience in creating an LLM-based pipeline - Retrieving data using natural language i.e. Text2SQL - Future works PUBLIC CONFIRMED Talk (long) https://pretalx.com/pyconde-pydata-2024/talk/LWWQ9U/ B05-B06 Usman PUBLISH Y3FLEH@@pretalx.com

-Y3FLEH

Best of both worlds - How we built an AI-aided content creation tool for language learning en

20240422T112500 20240422T121000 0.04500

Best of both worlds - How we built an AI-aided content creation tool for language learning

Babbel learners value the high quality content that follows an educational methodology and covers everything a learner needs to become conversational in a foreign language. However, language learning cannot be approached with a one-fits all strategy. Learners have different motivation, interests, goals & learning needs that they want to see addressed throughout their learning path. Relying on human learning experts only for creating thousands of tailored learning items to personalize our contents is not a scalable solution. Luckily, recent developments in Generative Artificial Intelligence (GenAI) and its high-performing Large Language Models (LLMs) offer great opportunities to leverage artificial intelligence (AI) in the content creation process to enable large-scale personalization of contents. Let us take you on our journey of developing an AI-aided content creation tool for language learning which combines best of both worlds, namely using AI to automate and scale various steps within the content generation process and putting human intelligence (HI) in the loop to make sure that our contents meet the expectations of our learners and fit the Babbel way of learning. We will give you an overview of our development process with the help of our cross-functional team and walk you through the different iterations - from initial workflow analysis to leveraging the power of connecting our tool to Babbel’s proprietary data. Additionally, we will demo the current version of the tool and give a quick tour of the different AI features that we already included. We will give an overview of the used tech stack and a quick outlook on what is next in the development pipeline. PUBLIC CONFIRMED Talk (long) https://pretalx.com/pyconde-pydata-2024/talk/Y3FLEH/ A1 Hector Hernandez Lea Petters PUBLISH 8LNYPD@@pretalx.com

-8LNYPD

Power structures. The fair advantage en

20240422T121500 20240422T124500 0.03000

Power structures. The fair advantage

Have you ever been in the following situation? You know for certain that you are technically right. Your project has to be done for the benefit of the company. But you cannot convince your boss for whatever reason. You are stuck. - This might be the glorious moment of informal structures and networking. You will need to know whom else to talk to. Whom you can trust and who has the power to convince your boss? The best answer will rarely be found in formal organizational structure. As developers, we often think in models and charts. We are used to formalize worded requests into code and structures to solve problems. And we are good at it. But what you cannot fully put into models are humans and human behavior. This is also true for the human interactions inside companies and networks. Organigrams never tell the truth about an organization. Power and influence is more complex than formal structures can describe. In this talk, I wanna dive into how human interactions inside companies are at the same time complex, powerful and worth exploring. Disclaimer: This is no talk about unfair techniques. I will not provide you dark magic. My goal is to provide you the knowledge how to fairly play in a complex world. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/8LNYPD/ A1 Anja Kunkel PUBLISH XDQNCR@@pretalx.com

-XDQNCR

Tailored and Trending: Key learnings from 3 years of news recommendations en

20240422T134500 20240422T143000 0.04500

Tailored and Trending: Key learnings from 3 years of news recommendations

#### What is special about news recommendations? - We are used to recommendations from Netflix, Amazon, or TikTok. All of these apps have logged-in users that can be easily tracked. News websites, on the other hand, have a large share of unknown users that can only be tracked via first-party cookies. Therefore, there is much more cold start in the user dimension. In addition to that, movies, products, and funny videos have relatively long lifetimes, whereas news articles are often only relevant for a few hours. This means that recommendation systems have much less time to collect information about what is relevant for whom, and there is a lot of cold start in the item dimension. - Users are more critical with the selection of news articles that are presented to them compared to selections of products or movies. News recommendation is not only about finding the most relevant items; it is also about putting items in the right relationship to each other to reflect journalistic considerations and brand values. For example, often articles should be sorted according to the seriousness of the topic, or the topic's relevance for society. Similar articles should be placed next to each other, etc. - The front page plays an outsized role for news websites. Users come here to get an overview of what is happening in the world. Consequently, the data generated by these websites is heavily dominated by effects that originate in the structure and mechanics of the front page. Articles shown on top of this page with a large image will be clicked much more likely, compared to an article at the bottom of the page with just a small headline. #### How do news recommendations typically work? - Recommendation engines are often closely associated with collaborative filtering. However, collaborative filtering systems struggle with cold start, which is especially prevalent for news articles and users of media sites. At the same time, there are many simple ways to rank articles. Articles can be sorted according to their age, their popularity, or according to how often a user has read articles from the same category before. Based on our experience, most systems deployed in practice use a combination of these principles along with collaborative filtering. Especially for smaller widgets, multi-armed bandit approaches are also popular, where the algorithm just tries different articles and keeps showing those that tend to have the highest CTR. #### What is special about our approach to news recommendations? - One can think of recommendation as a simple click prediction problem. We have one user and many items and want to use features of the user and the items to predict how likely the user will click. The articles can then be ranked and selected based on these probabilities. Therefore, we are not tied to use collaborative filtering algorithms but can use any machine learning algorithm of our choice. - A major feature for our system is to identify articles that are trending. Most popular feeds and rankings are widely used, but as an absolute measure, they are heavily influenced by the position bias. The articles on top of the page are most likely to get the most clicks, therefore they will be put on top of the page again. This cycle continues until the story becomes so uninteresting that it starts to perform worse than other stories in worse positions. In contrast to that, we refer to relative performance as trendingness. If a story performs better than usual for its position, then it is trending. The beauty of this approach is that it makes the performance of articles at the top and at the bottom of the page comparable to each other. You can be 10 percent better or worse than expected in all positions of the page. The ugly part is that numbers at the bottom of the page start to become very small and therefore trendingness becomes very unstable. If an article is expected to get 1/100 of a click in a certain time interval, and there is an accidental click on this article, you suddenly have an incredible trending article. Unfortunately, most news pages contain many articles that are clicked with very low probabilities, therefore you have good chances to produce these outliers quite frequently. The art of constructing a good measure of trendingness is in finding a good way to regularize the trendingness to avoid these effects. - Position bias on news media sites is so strong that a classification model that predicts clicks solely using the position of an article as a feature will have an AUC of about 0.8. Consequently, a model trained on clicks will mostly just learn patterns that are correlated with the position. For example, if politics articles tend to be placed higher on the page than sports articles, the model will learn that politics articles generally click better than sports articles. We can avoid this by giving the model information about the position, but then the algorithm mostly picks up position-related patterns that cannot be exploited when choosing which article to put in one specific position. - When training our recommendation algorithm, we overcome the position bias problem by weighting clicks so that they are compared on neutral grounds. First, we determine the click probability of an article based on its position alone. Then we weight clicks and non-clicks according to their relative probability. - A click that was supposed to happen with a probability of 0.1 becomes 1/0.1 - 1 = 9, and a click with a probability of 0.01 becomes 1/0.01 - 1 = 99. A likely click gets a lower weight than an unlikely one. - We also derive information from non-clicks. A non-click with a probability of 0.9 becomes -1/0.9 + 1 = -0.1. If an article is presented in a prominent position, but it is not clicked by the user, this is an expression of disinterest and it can help to feed our algorithm. - By turning clicks into weighted clicks, we essentially turn the problem from a classification problem into a regression problem. On average, the weighted clicks are equal across all positions, so that the position bias is eliminated. - One of the features that surprised us the most with its good performance is our "article already seen" feature. For each user and every recommendable article, we keep a counter that measures how often the article was already shown in a prominent position but not clicked by the user. These scores are based on the position-based click probabilities that we also use for the weighted clicks. If an article gets shown in a position with an average CTR of 0.1, the score is 0.1 the next time the article could potentially be recommended to the user. If the article now gets shown in a lower position with a click probability of 0.01, the score increases to 0.11 next time. The model then learns that articles that were shown multiple times in prominent positions before but were not clicked are likely not going to be clicked next time they are shown, either. As a consequence, the page becomes fresher and A/B test results indicate a meaningful uplift compared to a model without this feature. #### What have we learned? - Websites usually track what users do, but not what they do themselves. Our algorithms rely heavily on the fact that we track who saw what and in which position. This gives us the ability to overcome the position bias and significantly improve our algorithms. - We do simple things for complicated reasons. The key advantage of simple statistical models over black-box algorithms is that they are easier to debug. Every time we replace a boosted tree or something similar with a linear model, we realize that it is not acting the way we expected. We can then make the necessary adjustments - for example, by adding well-crafted features that leverage our domain expertise. At the end of the process, the linear model becomes better than the black-box model was in the beginning. PUBLIC CONFIRMED Talk (long) https://pretalx.com/pyconde-pydata-2024/talk/XDQNCR/ A1 Dr. Christian Leschinski PUBLISH ZCKQVG@@pretalx.com

-ZCKQVG

A Retrieval Augmented Generation system to query the scikit-learn documentation en

20240422T143500 20240422T150500 0.03000

A Retrieval Augmented Generation system to query the scikit-learn documentation

Currently, the scikit-learn website provides an "exact" search engine based on the tools provided by the Sphinx Python package (i.e., https://www.sphinx-doc.org/). The current search engine is implemented in JavaScript and runs locally using an index built when generating the documentation. This solution has the advantage of being lightweight and does not require any server to handle the query. However, the complexity of the query treated is weak: since the search is "exact," it is not robust to spelling mistakes, and the search is intended for searches based on keywords. As large language models (LLMs) are becoming more popular, we have been interested in experimenting with this technology, knowing that they could address some of the previously stated limitations. As an open-source project, we have limited resources in terms of compute and limited available datasets; therefore, we discarded the option of fine-tuning an LLM and leaned towards retrieval augmented generation (RAG) systems. This talk presents an experimental RAG system developed to query the scikit-learn documentation. As constraints, we impose ourselves to use an open-source software stack and open-weight models to build our system. The talk is decomposed as follows: First, we provide some background on the RAG system and the pipeline to follow to implement such a system. Then, we go into details in the different stages of the RAG pipeline. We provide some insights regarding documentation scraping strategies that we developed by leveraging the `numpydoc` and `sphinx-gallery` parser. Then, we discuss the solution that we tested to perform lexical and semantic searches. Finally, we explain how the context found can be fed to the LLM to help generate an answer to the user query. We provide a small demo to compare queries performed on an LLM-only system and on the developed RAG system. All the code for the experiment is hosted at the following GitHub repository: https://github.com/glemaitre/sklearn-ragger-duck. Finally, we put into perspective the gains and pains of such an RAG system when it comes to integrating it into an open-source project. Notably, we question the hosting and cost of such systems and compare it with other approaches that could tackle some of the original issues. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/ZCKQVG/ A1 Guillaume Lemaitre PUBLISH G9S3MR@@pretalx.com

-G9S3MR

Moving from Offline to Online Machine Learning with River en

20240422T153500 20240422T160500 0.03000

Moving from Offline to Online Machine Learning with River

The foundations of machine learning were built on offline batch processing techniques for model training and inference. As organisations become more dependent on real-time data, the technological trend for machine learning in production is moving towards adding an online stream processing approach. This has benefits such as lower computational requirements due to being able to incrementally learn from a stream of data points, which enables the continual upgrading of models by adapting to real-time changes in data. This has wide applications in industries such as cyber security, banking, healthcare, IIoT and any industry that involves processing large volumes of high throughput data and adapting predictive capability with real-time data feeds. You’ll leave this talk with an understanding of the differences between offline and online machine learning, how to complement one with the other and enough streaming concepts and best practices needed get started on your online ML journey with River, an open source Python ML library. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/G9S3MR/ A1 Tun Shwe PUBLISH WEVXJS@@pretalx.com

-WEVXJS

Put your RAG to the test: Component-per-component evaluation of our LLM-powered airplane manufacturing assistant en

20240422T161000 20240422T165500 0.04500

Put your RAG to the test: Component-per-component evaluation of our LLM-powered airplane manufacturing assistant

Nowadays, Retrieval Augmented Generation (RAG) architecture has become quite the standard approach for building high-quality document search products or personal assistant applications. Prototyping a RAG application might yield quite convincing results from the very first stages of development, but how do you know if it’s really any good when you move your application from prototype into production? And how do you justify the design choices you make? For example, do you know if long-context models would perform better than short-context models with chunking for long-form documents you have at hand? Or, what difference does it make if you keep your different types of documents in one index or in separate ones? Or, is usage of few-shot learning really worth it for your use case, given that adding examples can increase the cost dramatically compared to zero-shot learning? And of course, how do you know there isn’t a better prompt out there for making the LLM do exactly what you expect it to? At Airbus, we went through this thought process during the development of a RAG-based assistant for creation of assembly manuals - documents which help our colleagues in Manufacturing navigate through the airplane parts construction procedures. For answering these and other questions, we produced an evaluation concept for our Generative AI applications, which relies on different methods and metrics for RAG evaluation end-to-end and testing each of its components separately. In this talk, we will present our evaluation concept, how we implemented it with tools like LangChain and Ragas, what metrics we use and how we conduct our experiments with the help of Google Vertex AI Pipelines. PUBLIC CONFIRMED Talk (long) https://pretalx.com/pyconde-pydata-2024/talk/WEVXJS/ A1 Nataliia Kees PUBLISH BKBNRF@@pretalx.com

-BKBNRF

The Secret Life of Metaclasses en

20240422T112500 20240422T125500 1.03000

The Secret Life of Metaclasses

Class outline: * 10 min.: Intro and Setup * 15 min.: Every time is "runtime": * Function, Classes and Methods are created at runtime * The dual responsibility of `class` * Attribute lookup and method resolution order * The role of `.__dict__` and `.__slots__` * Special methods, giving instances superpowers * 10 min.: Everything is an object: * Functions, methods and classes are also objects * Descriptors, properties and method binding * The two functionalities of `type` * And how to create a class without the `class` keyword * 10 min.: Metaclass is the class of the class: * Calling a class creates an instance, calling a metaclass creates a class * `type` & `object`: class relations * Creating and using metaclasses * 15 min.: What are metaclasses for? * Giving classes special methods * Intercepting class creation * Keyword arguments in class declarations * Preparing the class namespace * The role of the methods: `__call__`, `__new__` & `__init__` * What are metaclasses **not** for * 5 min.: complete debugging walkthrough * class creation * instance creation * instance use * 5 min.: You're unlikely to ever need to create a metaclass * `__init_subclass__` * Class decorators * `__class_getitem__` * Capturing descriptor names and ordering * 5 min.: Examples * 5 min.: conclusion and questions PUBLIC CONFIRMED Tutorial https://pretalx.com/pyconde-pydata-2024/talk/BKBNRF/ A03-A04 Leonardo Rochael Almeida Luciano Ramalho PUBLISH DPGRGW@@pretalx.com

-DPGRGW

Build TikTok's Personalized Real-Time Recommendation System in Python with Hopsworks en

20240422T134500 20240422T151500 1.03000

Build TikTok's Personalized Real-Time Recommendation System in Python with Hopsworks

The real-time recommendations engine in Tiktok is so good it has been described as "digital crack" (by Andrej Karpathy, former head of AI at Tesla). It is a retrieval and ranking architecture that uses significant ML infrastructure, including a real-time feature store, a vector database, a model registry, and model serving infrastructure. In this tutorial, we will build the core components of Tiktok Monolith as 3 ML pipelines: a stream processing feature pipeline that takes user actions (clicks, swipes, searches) written to Kafka and computes features that are stored in Hopsworks online store in less than 1 second. We will train a two-tower embedding model to support personalized queries using training data grounded on each user's history/context and the videos they clicked/didn't-click on. We will develop an online inference pipeline that takes a user query, encodes it as an embedding to retrieve candidate videos, then users an online feature store to enrich the candidates before a ranking model personalizes the order of candidates for the client. We will even develop a simple user interface in Python (Streamlit) to show the whole system working visually. Our real-time machine learning system will consist of 3 Python programs - the feature pipeline, the training pipeline, and the online inference pipeline - and the ML infrastructure they require will be provided by the open-source Hopsworks platform, including a feature store, vector database, model serving, and model registry. PUBLIC CONFIRMED Tutorial https://pretalx.com/pyconde-pydata-2024/talk/DPGRGW/ A03-A04 Jim Dowling PUBLISH CMM8S3@@pretalx.com

-CMM8S3

Refactoring Large Programs en

20240422T153500 20240422T170500 1.03000

Refactoring Large Programs

Refactoring Large Programs You find code and installation instructions for the tutorial on https://github.com/krother/space One of the most challenging tasks in software engineering is cleaning up a complex software with 10,000-100,000 lines of code. The problem gets worse, if you are taking over legacy code. The fact that the Python language does neither enforce strict typing or encapsulation does not help either. What should you do if throwing away everything and rewriting the program from scratch is not an option? In this tutorial, we will exercise refactoring a larger program that is undocumented, unstructured and untested. We will take a messy example program and work through a list of procedures that may help you in your next big refactoring. These include: * review the code * write a minimal test * add type annotations * extract core data structures * separate easily cleanable parts from very bad parts * remove excess dependencies * be very transparent about which features of the code you trust The main takeaway of the tutorial is that large-scale refactoring is possible. Although a large refactoring is difficult and costly, you should learn that it can be approached systematically. You will walk away with ideas where to start refactoring. You will also develop your awareness how difficult a complex refactoring is. Looking at a messy codebase realistically is not only important to manage the expectations of clients and stakeholders, it is also important to manage the stress that comes with it. This tutorial addresses people with fluency in basic Python. You should know how a class in Python works and what a Unit Test is. It helps if you have done simple refactoring before (extract variable, extract function) before. I encourage junior developers to attend the tutorial to learn and discuss how a potentially overwhelming situation looks like. The tutorial session is structured in the following way: * 0:00 Interactive Warm-up with the audience: Who is here? * 0:05 Download and inspect code * 0:10 Quick code review * 0:20 Refactoring I: create a minimal test * 0:40 Refactoring II: extract data structures * 1:00 Refactoring III: isolate code * 1:20 buffer time and Q & A The messy code and refactoring recipes will be provided to participants through GitHub. PUBLIC CONFIRMED Tutorial https://pretalx.com/pyconde-pydata-2024/talk/CMM8S3/ A03-A04 Dr. Kristian Rother PUBLISH EHJRVF@@pretalx.com

-EHJRVF

No More Raw SQL: SQLAlchemy, ORMs & asyncio en

20240422T112500 20240422T125500 1.03000

No More Raw SQL: SQLAlchemy, ORMs & asyncio

OUTLINE - Introduction [15 min] - What is SQLAlchemy? - Why use SQLAlchemy and advantages? - Components Overview such as engine, dialect, connection pool, etc. - Initial setup for the hands-on workshop with GitHub Codespaces [5 min] - Run and explore example service that has database queries with raw SQL - Adding SQLAlchemy to the example service - Set up SQLAlchemy [10 min] - Set up engine & dialect to connect with the DB - Use SQLAlchemy Core to query the DB - Add ORMs [20 min] - What are ORMs? - How to represent a basic table? - Modeling different relationships (e.g., 1-1 and 1-many) between the classes - Using ORMs to query the DB - Convert other queries using SQLAlchemy [5 min] - Improve performance by changing relationship loading techniques [10 min] - Consequences of certain models: Talk about N+1 problem and bidirectional relationships - Work with different loading techniques, such as lazy loading and eager loading - The SQLAlchemy.asyncio extension - Brief description of asyncio [10 min] - Understanding coroutines - Scheduling tasks on the asyncio event loop - A hands-on walkthrough of SQLAlchemy’s asyncio extension [15 min] - Setting up SQLAlchemy in async mode - Performing a query and inserting it into the database - Using ORMs in queries using asyncio FORMAT This is an interactive tutorial where we will guide participants through the use of SQLAlchemy and ORMs to interact with a database. Participants will gain an understanding of SQLAlchemy and be well-versed enough to use it in their next project. Participants will be working on a repository via GitHub Codespaces, and they will be building on that throughout the tutorial. The Codespaces dev environment will include all required modules and a Dockerized PostgreSQL database, enabling a seamless setup. The repository will have a branch corresponding to each section of the workshop, so participants who have trouble with a step or aren’t able to finish on time can check out the corresponding branch and follow the rest of the workshop from there. We’ll start with an introduction to SQLAlchemy and its advantages. The rest of the tutorial will be hands-on. For each section, we will start by explaining the concept, then allowing participants to complete the relevant steps on the example service on their own laptops, and ask questions. We expect this to last around 10 minutes per concept. We will then give participants time to complete the steps on their own laptops and ask questions. AUDIENCE This tutorial is for Python developers of any level who write applications that interact with databases and want to learn how to leverage a tool like SQLAlchemy to seamlessly interact with their database and manage their data in a Pythonic way. Having a basic understanding of databases and SQL (such as inserting or reading data from a table) is sufficient. Participants should also be familiar with git and have a GitHub account, as we would use GitHub Codespaces to enable easy set-up for Python and the database. However, they do not need any prior knowledge of SQLAlchemy or ORMs, since we will explain that first. For the last part of the tutorial, it would help if attendees have some familiarity with coroutines or asynchronous programming, but it is not required, since we will be explaining these fundamental concepts first. Participants will walk out of this tutorial having learned how to: - Use SQLAlchemy for database operations in Python, enhancing the readability and maintainability of the code - Build Python classes (ORMs) that represent the database tables - Experiment with different relationship-loading techniques to improve querying performance - Utilize SQLAlchemy’s asyncio extension to interact with databases asynchronously PUBLIC CONFIRMED Tutorial https://pretalx.com/pyconde-pydata-2024/talk/EHJRVF/ A05-A06 Rhythm Patel Aya Elsayed PUBLISH WPKRCT@@pretalx.com

-WPKRCT

Build an AI Document Inquiry Chat with Offline LLMs en

20240422T134500 20240422T151500 1.03000

Build an AI Document Inquiry Chat with Offline LLMs

The ability to ask natural language questions and get relevant and accurate answers from a large corpus of documents can fundamentally transform organizations and make institutional knowledge accessible. Foundational LLM models like OpenAI’s GPT4 provide powerful capabilities, but using them directly to answer questions about a collection of documents presents accuracy-related limitations. Retrieval-augmented generation (RAG) is the leading approach to enhancing the capabilities and usability of Large Language Models. In this tutorial, we will learn to use RAG to build document-inquiry chat systems using different commercial and locally running LLMs. The topics we’ll cover include: * **Introduction to RAG**, how it works and interacts with LLMs, and Ragna - a framework for RAG orchestration * Creating a **basic chat function** that uses popular LLMs (like GPT) answers questions about your documents, using a Python API in Jupyter Notebooks * Optimizing the chat through **experiments with different LLMs**, vector databases, context windows, and more * Running a **local LLM on GPUs** on the provided platform, and comparing its performance to commercial LLMs * Walkthrough of the **REST API for building web-apps** and user interfaces and exploration of the built-in (Panel-based) web application By the end of this tutorial, you will have an understanding of the fundamental components that form a RAG model, and practical knowledge of open source tools that can help you or your organization explore and build on your own applications. This tutorial is designed to enable enthusiasts in our community to explore an interesting topic using some beginner-friendly Python libraries. PUBLIC CONFIRMED Tutorial https://pretalx.com/pyconde-pydata-2024/talk/WPKRCT/ A05-A06 Pavithra Eswaramoorthy Philip Meier PUBLISH DSFWRC@@pretalx.com

-DSFWRC

pytest tips and tricks for a better testsuite en

20240422T153500 20240422T170500 1.03000

pytest tips and tricks for a better testsuite

We'll cover things like: - Recommended pytest settings for more strictness - What's xfail and why is it useful? - How to mark an entire test file or single parameters - Ways to deal with parametrize IDs and syntax - Useful built-in pytest fixtures - Caching for fixtures - Using fixtures implicitly - Advanced fixture and parametrization topics - How to customize fixtures behavior based on markers or custom CLI arguments - Patching, mocking, and alternatives - Various useful plugins, and how to write your own - Short intro to property-based testing with Hypothesis PUBLIC CONFIRMED Tutorial https://pretalx.com/pyconde-pydata-2024/talk/DSFWRC/ A05-A06 Florian Bruhin PUBLISH HKFN8J@@pretalx.com

-HKFN8J

Keynote - Safe Space or Trap? Creating Software like DuckDB in Academic Institutions en

20240423T091500 20240423T100000 0.04500

Keynote - Safe Space or Trap? Creating Software like DuckDB in Academic Institutions

DuckDB is an in-process analytical data management system. DuckDB is free and open source and rather popular. It is one of the fastest growing data system to date, especially in the Python ecosystem. DuckDB was created at Centrum Wiskunde & Informatica (CWI) in Amsterdam, not entirely coincidentally the same place Python was created in. Later on, the we founded a commercial company, DuckDB Labs, which now drives development. In my talk, I will discuss DuckDB, its origins, and the unique benefits and challenges of maintaining popular software in an academic setting. PUBLIC CONFIRMED Keynote https://pretalx.com/pyconde-pydata-2024/talk/HKFN8J/ Kuppelsaal Hannes Mühleisen PUBLISH ZFXZHG@@pretalx.com

-ZFXZHG

🌳 The taller the tree, the harder the fall. Determining tree height from space using Deep Learning and very high resolution satellite imagery 🛰️ en

20240423T103000 20240423T110000 0.03000

🌳 The taller the tree, the harder the fall. Determining tree height from space using Deep Learning and very high resolution satellite imagery 🛰️

The risk that a tree poses to line infrastructure (such as power lines) is determined by several factors, chief among them the height of the particular tree. The increasing availability of very high resolution satellite imagery makes it possible to use photogrammetric techniques to extract height information from a set of stereo satellite images. By using satellite imagery we can achieve a scale not possible by manual measurement. We found that classical techniques perform poorly on vegetation, and were handily outperformed by deep learning based techniques implemented in PyTorch. This improvement was not trivial to achieve however, as creating labelled data in sufficient quantity was quite challenging. By increasing the quality of our height predictions we were able to more accurately calculate risk for our customers. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/ZFXZHG/ Kuppelsaal Ferdinand Schenck PUBLISH YHMUCL@@pretalx.com

-YHMUCL

Streamlining Python Development: A Practical Approach to CI/CD with GitHub Actions en

20240423T110500 20240423T113500 0.03000

Streamlining Python Development: A Practical Approach to CI/CD with GitHub Actions

The worst thing I dislike when dealing with code is encountering an error message indicating that well-crafted code, written a while ago in a language other than Bash, fails to run on the new system, new laptop, or some other operating system. It's an art to write code with minimal dependencies and maximum portability. The complexity increases in larger projects. This is where Continuous Integration and Continuous Delivery (CI/CD) pipelines prove useful. CI/CD can help you keep the project alive even without you being around. Dependencies could be automatically updated, the code could be automatically tested, and delivered to the end-user, be it you or someone else. This talk is about "YAML programming", which will help you write better Python code. The goal of the talk is to equip you with a set of building blocks to construct a CI/CD pipeline with GitHub Actions for your projects. Automating tasks as much as possible is highly beneficial. We'll cover best practices and helpful tools for writing and debugging CI/CD pipelines. Writing YAMLs is time-consuming and error-prone; my goal is to help you spend less time on it and benefit faster from automation. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/YHMUCL/ Kuppelsaal Artem Kislovskiy PUBLISH JKWBBR@@pretalx.com

-JKWBBR

That’s it?! Dealing with unexpected data problems en

20240423T114000 20240423T121000 0.03000

That’s it?! Dealing with unexpected data problems

And it was such a nice idea! Nearly everybody working with data has felt this sentiment at least once in their career. The promising idea for a cool new data tool meets the reality of lacking data quality or quantity. This talk wants to provide you with some options on what else you can do in this kind of situations instead of giving up and filing the project away for the non-foreseeable future. Drawing on experience from multiple consulting projects we are discussing what is realistically possible and how to make the most out of the limited data you might find yourself confronted with. The talk covers a brief recap of the limitations arising from unexpectedly little and/or unclean data, before moving on to share lessons learned. We are going to discuss how fare purely technical solutions might be able to provide fixes to some of the issues, before moving on to consider how domain knowledge can be deployed to compensate for lacking data quality or quantity. Next, this talk addresses under which circumstances it makes sense to keep pursuing your original goal and when it might be better to down-size expectations. The talk concludes, by arguing that despite all the problems arising from unexpected data scarcity, potential answers to important business problems can be found in small data settings if the right questions are asked. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/JKWBBR/ Kuppelsaal Simon Pressler PUBLISH 7TEYDQ@@pretalx.com

-7TEYDQ

Keynote - The art and science of tending open source orchards en

20240423T131500 20240423T140000 0.04500

Keynote - The art and science of tending open source orchards

Inessa is building bridges between people, open science, and open source software, advocating for diversification of contribution pathways to open source and supporting its human infrastructure. She is an active contributor to the Python ecosystem (NumPy, Scientific Python, PyOpenSci, SciPy conference, PyCon US Maintainers Summit, PySWFL, PyLadies SoFlo) and broader open source (Contributor Experience Project, CHAOSS). In her role as Open Source Program Manager at OpenTeams, she leads initiatives focused on widening the contributor pipeline and bringing funding to more open source projects. Inessa is perpetually fascinated by incentive design, collaborative intelligence, and jazz. PUBLIC CONFIRMED Keynote https://pretalx.com/pyconde-pydata-2024/talk/7TEYDQ/ Kuppelsaal Inessa Pawson PUBLISH RGWDCN@@pretalx.com

-RGWDCN

Robust Configuration Management with Pydantic's Data Validation en

20240423T141000 20240423T144000 0.03000

Robust Configuration Management with Pydantic's Data Validation

We describe how we moved our configuration management system from a simple unstructured YAML format loaded into dictionaries into a fully formalized, typed, class-based system using [`Pydantic`'s][pydantic] data validation. While simple enough to begin with, we discuss the problems that emerged from the lack of tight specification of our early configuration system: Missing ahead-of-time validation and resulting runtime errors; out-of-sync code and browsable user documentation; incompatible defaults and subtle differences in various separate parsers scattered throughout many microservices; duplicated and brittle fallback logic. Using a strict specification can mitigate these issues by enabling static validation of configuration files, automatic documentation generation, centralized defaults, and flexible data transformation. After discussing various available configuration management systems, we explain the motivation to hand-roll a simple system based on the data validation library [`Pydantic`][pydantic]. Popularized by it's usage in [`FastAPI`][fastapi] has become the de-facto standard for data validation in Python. It's deep integration into Python's type annotation system makes it a powerful tool for configuration management. After an introduction into [`Pydantic`][pydantic] capabilities and usage, specifically it's features tailored to configuration management ([`pydantic.BaseSettings`][basesettings]), we share some tips-and-tricks encountered while speccing out our configuration file format. Additionally, we share some inspiration on our internal tooling to load and validate configuration, render up-to-date browsable user documentation, integration with CI systems, and lessons learned for a incremental transition from the lose `dict`-based system to the strictly typed class-based type strict system powerd by [`Pydantic`][pydantic]. [pydantic]: https://pydantic.dev/ [fastapi]: https://fastapi.tiangolo.com/ [basesettings]: https://docs.pydantic.dev/latest/api/pydantic_settings/ PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/RGWDCN/ Kuppelsaal Philipp Stephan PUBLISH UG8THG@@pretalx.com

-UG8THG

Unlock the Power of Dev Containers: Build a Consistent Python Development Environment in Seconds! en

20240423T144500 20240423T153000 0.04500

Unlock the Power of Dev Containers: Build a Consistent Python Development Environment in Seconds!

In this talk, we will explore the basic concepts of Dev Containers and demonstrate how they can support your everyday development as a Python programmer, data scientist, or machine learning engineer. With Dev Containers, you can build a consistent development environment in seconds, no matter where you are or what tools you use. And you know what? The Development Container Specification is even open source. Say goodbye to the hassle of setting up your development environment from scratch every time you start a new project! We will start with a basic example and discuss how to set up a consistent Python development environment, including best practices for package management and GPU support. After this talk, you will be able to leverage the advantages of Dev Containers, allowing you to work from anywhere and be ready in seconds. If you're tired of wasting time setting up your development environment and want to unlock the power of Dev Containers, then this talk is a must-attend for you! PUBLIC CONFIRMED Talk (long) https://pretalx.com/pyconde-pydata-2024/talk/UG8THG/ Kuppelsaal Thomas Fraunholz PUBLISH PLJKUH@@pretalx.com

-PLJKUH

Community Conferences under the Hood. Perspectives and Best Practices in Volunteer Organization en

20240423T160000 20240423T170000 1.00000

Community Conferences under the Hood. Perspectives and Best Practices in Volunteer Organization

Through a combination of individual presentations and interactive discussions, the panel will explore the challenges and triumphs of community organization. This session is designed not just for current and aspiring community leaders but for anyone passionate about fostering an inclusive, collaborative tech ecosystem. This panel brings together seasoned community organizers from diverse backgrounds to share their insights, experiences, and best practices in building and nurturing inclusive communities. Join us in this empowering session to discover how you can contribute to a more inclusive, diverse, and vibrant Python community through effective volunteer organization. Together, we can drive positive change and ensure that our communities remain strong, supportive, and forward-moving. PUBLIC CONFIRMED Panel https://pretalx.com/pyconde-pydata-2024/talk/PLJKUH/ Kuppelsaal Alexander CS Hendorf Lais Carvalho Valentina Scipione Florian Wilhelm PUBLISH JRRET3@@pretalx.com

-JRRET3

Build a personalized Bitcoin (BTC) virtual assistant in Python with Hopsworks and LLM function calling en

20240423T103000 20240423T110000 0.03000

Build a personalized Bitcoin (BTC) virtual assistant in Python with Hopsworks and LLM function calling

The human ambitious desire to get rich without effort has been a major driving force behind the popularity of cryptocurrencies like Bitcoin and Ethereum. However, their high volatility makes them too unpredictable, and keeping track of our investment gains and losses over time can be tedious, if not boring. In this talk, we will define the different components necessary to build a personalized Bitcoin (BTC) virtual assistant in Python. The assistant will help you analyze your transaction history, estimate future BTC prices, and calculate the future value of your holdings based on these predictions. It will be powered by LLMs and will make use of a recent technique called Function Calling to recognize the user intent from the conversation history. The ML system will be built in Python, following the best practices of the FTI (feature/training/inference) pipeline architecture, on top of the open-source Hopsworks platform which will provide the necessary ML infrastructure such as a feature store, model serving, and a model registry. PUBLIC CONFIRMED Sponsored Talk https://pretalx.com/pyconde-pydata-2024/talk/JRRET3/ B09 Javier de la Rúa Martínez PUBLISH KXU7Q8@@pretalx.com

-KXU7Q8

Missing Data, Bayesian Imputation and People Analytics with PyMC en

20240423T110500 20240423T113500 0.03000

Missing Data, Bayesian Imputation and People Analytics with PyMC

There is no "agnostic statistics" when approaching the question of missing data. Theory quickly breaks against reality in the context people-analytics. All imputation schemes need to justify their assumptions of "strong-ignorability" or "missing-at-random" reasons for missing data. This is easier and cleaner in a Bayesian setting than in frequentist alternatives. This transparency is important when dealing with HR data. We will demonstrate both full information maximum likelihood (FIML) and Bayesian imputation by chained equation approaches to the imputation of missing data in the context of employee engagement survey data. We will use the probabilistic programming language PyMC to articulate the structures and conditional probabilities around missing data in hierarchical organisations. Non-response bias in engagement survey data often corrupts the overall picture of organisational health and modelling of the non-response bias helps uncover patterns or trends in the patterns of missing-ness. These insights can be used diagnostically to locate the source of problems within the organisation, but we need to be willing to commit to the assumptions that license genuine causal inference. In this way we present the problem of missing-data as a gate-way to an organisational focus on causal inference problems. Somewhat ironically, the lack of data can actually makes the problems of causal inference more concrete for business stakeholders. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/KXU7Q8/ B09 Nathaniel Forde PUBLISH H3X3AX@@pretalx.com

-H3X3AX

Tackling the Cold Start Challenge in Demand Forecasting en

20240423T114000 20240423T121000 0.03000

Tackling the Cold Start Challenge in Demand Forecasting

In this talk, we address the Cold Start problem in Demand Forecasting, focusing on scenarios where historical data is scarce or nonexistent. This constitutes a common situation in practice, such as with the launch of new products in Retail. However, many Time Series and Machine Learning models encounter difficulties in handling this challenge, primarily due to their dependence on a substantial amount of historical data for effective training and prediction. We begin by providing an overview of established techniques used to address the Cold Start problem, including methods like padding, feature engineering, and leveraging item similarities. Additionally, we explore more recent advancements and emerging research, such as Transfer Learning for Time Series. While each technique presents its unique set of trade-offs, the challenge lies in determining the most suitable approach for a given dataset or use case. This aspect is often not widely understood, and our goal is to unravel this complexity by offering practical insights. Furthermore, we introduce a practical framework for systematically evaluating different forecasting strategies within the Cold Start setting, guiding you in selecting the most suitable approach for your datasets and use cases. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/H3X3AX/ B09 Alexander Meier Daria Mokrytska PUBLISH RD9SU8@@pretalx.com

-RD9SU8

Content Recommendation with Graphs: From Basic Walks to Neural Networks en

20240423T141000 20240423T144000 0.03000

Content Recommendation with Graphs: From Basic Walks to Neural Networks

In this talk, we'll explore how the complex problem of content recommendation transforms when viewed through the innovative lens of graph algorithms. Imagine a world where content and users form a bi-partite graph, and the key to unlocking personalized recommendations lies in predicting links and weights within this graph. We'll embark on a journey starting from the foundational graph-based recommender models, where simple graph walks lay the groundwork. As we delve deeper, we'll uncover the potent capabilities of graph embeddings and the transformative impact of Graph Neural Networks. Finally, we'll wrap up with valuable insights on the scenarios where graph-based approaches shine the brightest in solving recommender problems. Whether you're a seasoned data scientist or new to the field of machine learning, this talk will equip you with a fresh perspective on leveraging graphs for sophisticated and effective content recommendation strategies. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/RD9SU8/ B09 Dr. Mirza Klimenta PUBLISH 7J7LEB@@pretalx.com

-7J7LEB

Personalizing Carousel Ranking on Wolt's Discovery Page: A Hierarchical Multi-Armed Bandit Approach en

20240423T144500 20240423T153000 0.04500

Personalizing Carousel Ranking on Wolt's Discovery Page: A Hierarchical Multi-Armed Bandit Approach

Wolt's Discovery page is the main entrance point for millions of weekly users seeking to explore new cuisines, order their favorite dish, or replenish their fridge's stock. The Discovery page is a vertical collection of multiple modules (carousels) which can stem from automatic and curated mechanisms. It features restaurants, retail venues, individual items and dishes along with a broad set of banners. Wolt consumers have distinct tastes and preferences - all of which can change over time and vary with context. However, they expect Wolt to show what's relevant to them and to be able to discover - coupled with a frictionless experience. We want to satisfy our users, keep them engaged and grow our customer base around the world. Wolt delivery covers over 130.000 merchants in more than 500 cities across 25 countries, which results in a substantial variety and size of content Wolt has to offer its customers. Ranking the most relevant carousels at the top is a key challenge to solve so that our users find what they want fast. This renders personalizing the Discovery page as a key lever. Personalized carousel ranking presents a major recommendation challenge across many different domains like content streaming, ecommerce or quick commerce. In our talk, we present a hierarchical multi-armed bandit (MAB) solution for personalizing the ranking of carousels on Wolt’s Discovery page which is built on top of the Python ecosystem. Therefore, we first illustrate the specific challenges of an (almost) everything online delivery platform and our goals for Wolt's Discovery page. Second, we present our MAB-approach which combines a novel hierarchical parameterization of bandits on user-, segment-, city- and country-level with classical Thompson Sampling for exploration and exploitation. This approach caters well to the challenge of data sparsity. We also share the offline and online evaluation results of our approach. Lastly, we illustrate the architecture to make this solution resilient, scalable and adaptive. Our architecture is built on top of well-known open source libraries. We’re leveraging mlflow for tracking and lineage, Flyte for ML workflows, Redis for serving features, and Seldon Core for serving user requests online fast and reliably. We will wrap up our talk with our learnings and an outlook for the next steps in our journey towards a personalized, context-aware, and controllable Discovery page. PUBLIC CONFIRMED Talk (long) https://pretalx.com/pyconde-pydata-2024/talk/7J7LEB/ B09 Marcel Kurovski Steffen Klempau PUBLISH CMMJPN@@pretalx.com

-CMMJPN

Time series anomaly detection with a human-in-the-loop en

20240423T160000 20240423T163000 0.03000

Time series anomaly detection with a human-in-the-loop

Starting from a completely unlabelled dataset, unsupervised anomaly detection is performed. Identified anomaly candidates are presented via a web app to domain experts, who can judge whether the identified time series segments are indeed abnormal or are expected behaviour, i.e., false positives generated by the anomaly detection. The domain-expert’s feedback is stored to create a partially labelled dataset. The intended benefits from storing the collected labels are: 1) Metrics can be generated that allow to evaluate the performance of the initially unsupervised anomaly detection run. 2) The number of false positives generated by the algorithm, i.e., time series segments that were incorrectly flagged as anomaly, can be reduced via pattern matching. 3) Based on a partially labelled dataset more domain problem specific methods might be applied such as semi-supervised anomaly detection or time series classification. The framework uses open source tools and all its components, i.e., data pipelines, anomaly detection, web app, are deployed to the cloud. PUBLIC CONFIRMED Sponsored Talk https://pretalx.com/pyconde-pydata-2024/talk/CMMJPN/ B09 Philipp Millet PUBLISH BNFLZB@@pretalx.com

-BNFLZB

Cloud? No Thanks! I’m Gonna Run GenAI on My AI PC en

20240423T163500 20240423T170500 0.03000

Cloud? No Thanks! I’m Gonna Run GenAI on My AI PC

In a world dominated by cloud computing, there's a growing demand for harnessing the power of PCs and edge devices for AI needs. After all, all computers connected have more power than any cloud. Hence, in this speech, we want to introduce an AI PC, a single machine that consists of a CPU, GPU, and NPU (Neural Processing Unit) and can run GenAI in seconds, not hours. Besides the hardware, we will also show the OpenVINO Toolkit, a software solution that helps squeeze as much as possible out of that PC. Join our talk and see for yourself the AI PC is good for both generative and conventional AI models. The demos we will present are open source, so feel free to try them at home. Let's paint your dreams together! PUBLIC CONFIRMED Sponsored Talk https://pretalx.com/pyconde-pydata-2024/talk/BNFLZB/ B09 Adrian Boguszewski Dmitriy Pastushenkov PUBLISH EMZ7L7@@pretalx.com

-EMZ7L7

Unleashing Confidence in SQL Development through Unit Testing en

20240423T103000 20240423T110000 0.03000

Unleashing Confidence in SQL Development through Unit Testing

The conventional approach to data model development frequently involves a repetitive cycle: crafting a query, executing it, examining a portion of the result, and iterating through the process with each subsequent query modification. This method becomes particularly challenging when dealing with the evolution of mature, extensively-used data models, where multiple developers collaborate without sufficient testing. In such scenarios, the iterative nature of this process poses significant risks, potentially leading to overlooked errors and compromised data quality. The talk showcases the tangible benefits of having a well-designed unit testing framework, providing ease of mind to developers working collaboratively on the same model, and enabling the early detection of hard-to-spot errors before deployment. During the development of new data models and during the integration of new data sources, the absence of large amounts of production data makes verification of the model outputs difficult - clearly defined tests for scenarios not yet observed in production play a crucial role in overcoming this hurdle. SQL unit testing becomes especially relevant when refactoring existing data models and can be very helpful to ensure the logic is unchanged, even for edge cases. I outline the requirements for an effective SQL unit testing framework, emphasizing the use of the database or query engine to verify SQL statement correctness without persisting any data in the database. The presented framework supports the definition of atomic test cases, where each test case consists of minimal input datasets and expected output datasets and it is verified if the output of the query when run on the defined inputs matches the expected output. The practical implementation of a SQL unit testing framework will be shared in detail, by giving insights into Lotum’s pytest-based SQL unit testing framework and demonstrating how a test case for a SQL statement with mock data can be built effortlessly with minimal code redundancy. Internal workings of the framework will be explained, including the mechanics to define and run a unit test: By injecting mock data into an existing SQL statement, replacing references to production tables by the injected mock data, and executing the resulting fully-static statements in the query engine, the framework evaluates the transformed data against expected outputs. This way, the correctness of the query can be verified on a case-by-case basis without manually modifying the query code itself. Attendees will leave the session with a deep understanding of the importance of SQL unit testing, equipped with insights into building an effective framework, defining test cases, and ensuring data model robustness. The talk provides a roadmap for data teams to embrace a test-driven development approach, enhancing code quality, and fostering a culture of confident SQL development. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/EMZ7L7/ B07-B08 Tobias Lampert PUBLISH Z3FALV@@pretalx.com

-Z3FALV

Green Software Engineering en

20240423T110500 20240423T113500 0.03000

Green Software Engineering

The rapid growth of digital economy, production of software products demands a more sustainable way to deal with global warming issues. All of the tech industry is contributing to the growth of carbon footprints and we need to handle it efficiently. I will focus on the life cycle of Software Engineering and also explain how they can incorporate green software engineering into practice, from requirement engineering to the end product in the whole cycle. Further digging deeper into the following topics: • Green Requirement Engineering • Green Architecture and Design • Green Coding • Optimization of Infrastructure • Green Usage of software products The development of software products should be in such a way that it decrease carbon, increase efficiency and lower carbon intensity. The choice of coding language should be based upon time, complexity and resource usage so we can incorporate green coding. Participate in electronic recycling programs and shift your previous infrastructure to the services such as cloud to decrease resources usage. When it comes to green usage of the software products then never leave your laptops and systems on sleep as it also increase the carbon footprints. In the end of the talk people will be able to practice some green computing concepts in their everyday life. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/Z3FALV/ B07-B08 Farah PUBLISH 8W7RPP@@pretalx.com

-8W7RPP

Building Professional Voice AI with Vocode en

20240423T114000 20240423T121000 0.03000

Building Professional Voice AI with Vocode

The AI open-source package Vocode (https://github.com/vocodedev/vocode-python) has emerged as a leader in creating AI voice agents since May 2023. These are the interactive voices on the other end of the phone, ready to assist with various tasks. My journey with Vocode began in August while developing a commercial platform that allows for no-code creation of voice agents utilizing Vocode's capabilities. This presentation delves into the intricacies of Vocode. It's not just about voice; it's about crafting an experience. The framework seamlessly integrates external APIs for speech-to-text conversion, Large Language Model (LLM) response generation, and speech synthesis. But the real challenge lies in the nuances of human conversation: teaching the bot to pause when interrupted, not to speak over others, and to recognize the natural end of a conversation. These subtleties are what make interactions with Vocode feel remarkably human. A significant part of this talk will focus on the LLM function-calling feature of Vocode, particularly in real-time tasks like booking appointments. Imagine a scenario where you're speaking to 'Jane', a virtual plumber, to schedule a visit. The interaction feels real, with the bot understanding and responding to changes in appointment preferences, such as switching from a suggested time of "tomorrow at 9 am" to a more suitable slot "next month". This talk aims to share insights and practical knowledge about building and refining AI voice agents, making them more than just voices on a call but rather engaging, interactive entities capable of performing complex tasks with ease and human-like finesse. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/8W7RPP/ B07-B08 Lev Konstantinovskiy PUBLISH RLCLBB@@pretalx.com

-RLCLBB

How to Do Monolingual, Multilingual, and Cross-lingual Text Classification in April, 2024 en

20240423T141000 20240423T144000 0.03000

How to Do Monolingual, Multilingual, and Cross-lingual Text Classification in April, 2024

We will provide the answer to the three main questions: 1. If I want a text classifier for English texts, what is better -- to fine-tune the model or to prompt LLM? Which model is to fine-tune though? 2. If my data is not in English, i.e. not resource rich language, what should I do? Can I utilize LLMs? Or I need to somehow get the data? Or I can transfer somehow knowledge from existing English data? 3. If I want a multilingual model for several languages, again, what is the choice -- LLMs or own model? Which model then? The findings and comparisons will be illustrated on three tasks -- toxic speech, formal speech, and fluent speech detection -- for two languages -- English (as resource-rich language) and Ukrainian (as low resource language in terms of different data availability). We will provide tests of closed- and open-source models together with fine-tuned opensources models like BERT, RoBERTa. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/RLCLBB/ B07-B08 Daryna Dementieva PUBLISH ZKDEPW@@pretalx.com

-ZKDEPW

Leveraging the Art of Parallel Unit Testing in Django en

20240423T144500 20240423T153000 0.04500

Leveraging the Art of Parallel Unit Testing in Django

Key Points to Address: - Understanding Monolith Challenges: - - Identification of challenges and bottlenecks in traditional unit testing approaches within Django monoliths. - - Analysis of the impact on development velocity and code quality. - Introduction to Parallel Testing: - - Explanation of parallel testing concepts and its application to Django unit testing. - - Benefits of parallelization in terms of speed, efficiency, and resource utilization. - Parallel Testing Tools and Techniques: - - Overview of tools and techniques available for parallelizing unit tests in Django. - - Practical insights into configuring and optimizing test suites for parallel execution. - Real-world Experiences from Major Institutions: - - Case studies from leading institutions sharing their challenges with unit testing in Django monoliths. - - Lessons learned and best practices in implementing parallel testing strategies. - Implementation Guidelines for Django Projects: - - Guidance on implementing parallel unit testing in Django projects, including code examples and configurations. - - Tips for integrating parallel testing seamlessly into existing development workflows. Expected Outcomes: - Insight into challenges specific to Django unit testing within monolithic repositories. - Understanding the principles and benefits of parallel testing. - Practical knowledge of tools and techniques for parallelizing Django unit tests. - Real-world experiences and best practices shared by major institutions. - Actionable guidelines for implementing parallel unit testing in Django projects. Target Audience: This talk is tailored for Django developers, software engineers, and testing professionals seeking to optimize their unit testing practices, especially within the context of monolithic repositories. Conclusion: Join me in this 45-minute session as we navigate through the challenges of unit testing in Django monoliths and explore the art of parallelization. By the end, you'll be equipped with the knowledge and tools to transform your Django unit testing workflows, leveraging the lessons learned from major institutions in the industry. PUBLIC CONFIRMED Talk (long) https://pretalx.com/pyconde-pydata-2024/talk/ZKDEPW/ B07-B08 Syed Ansab Waqar Gillani Azan Bin Zahid PUBLISH CY97LS@@pretalx.com

-CY97LS

Analyzing COVID-19 Protest Movements: A Multidimensional Approach Using Geo-Social Media Data en

20240423T160000 20240423T163000 0.03000

Analyzing COVID-19 Protest Movements: A Multidimensional Approach Using Geo-Social Media Data

The talk will walk through the steps undertaken in the analysis of a protest network using Twitter data. It will explain the methods used, present the results as well as code and libraries used following (roughly) this outline: 1. Motivation: What was special about the COVID-19 protest movement and why a multi-dimensional view is crucial for understanding. 2. The Data: The retrieved information using Twitter's API and the necessary pre-processing steps. 3. Spatial Analysis: The statistical means to understand the movement's spatial manifestation, including explanation of used methods, presentation of results. 4. Network Analysis: Mere social network analysis is not enough for understanding protest movements. Including the spatial information allows to draw deeper insights by geo-spatially mapping network communities and centralities. 5. Semantic Analysis: Understanding the dominating themes in the protest network with semantic analysis: generating the document embeddings, clustering topics and dealing with a large dataset of tweets. 6. Conclusion: Importance of multi-dimensional analysis and the availability of social media data for studying societally important phenomena. Python libraries that were used (among others): geopandas, networkx. berttopic, lda and friends. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/CY97LS/ B07-B08 Nefta Kanilmaz PUBLISH YPKKQF@@pretalx.com

-YPKKQF

Would you rely on ChatGPT to dial 911? A talk on balancing determinism and probabilism in production machine learning systems en

20240423T163500 20240423T170500 0.03000

Would you rely on ChatGPT to dial 911? A talk on balancing determinism and probabilism in production machine learning systems

Objective and Outline: This talk addresses the often-overlooked need for integrating deterministic and probabilistic models in machine learning, which is crucial in complex production environments. We begin by defining deterministic and probabilistic models, highlighting their distinct roles in ML systems. The talk then showcases practical examples where the synergy of these models enhances system performance, focusing on classification and Generative AI models. Target Audience and Expected Background Knowledge: Intended for ML engineers, data scientists, and academic researchers, this presentation assumes familiarity with basic machine learning concepts and models. It's particularly beneficial for those involved in designing, implementing, or managing ML systems in production environments. Key Takeaways: - Understanding the strengths and limitations of deterministic and probabilistic models in ML. - Strategies for effectively combining these models in various ML systems. - Real-world examples demonstrating the improved robustness and controllability achieved through this integration. - Insights into future trends and potential developments in model integration. Time Breakdown: - Minutes 0-10: Introduction to deterministic and probabilistic models - Minutes 10-20: Synergies of approaches in real-world examples - Minutes 20-30: Applications for Generative AI models, including Q&A Additional Information: No prerequisites are required beyond a basic understanding of machine learning concepts. The presentation will be informative with a focus on practical applications, providing attendees with actionable knowledge and a deeper appreciation of model integration in ML systems. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/YPKKQF/ B07-B08 Nicolas Guenon des Mesnards PUBLISH W7YDRX@@pretalx.com

-W7YDRX

Deploying your Python application to Android en

20240423T103000 20240423T110000 0.03000

Deploying your Python application to Android

Python can be used to create native applications for Android. However, although Python is the most popular programming language, it is not the first choice to create an Android application. This talk gives an overview of developing Android application with Python by comparing the 3 popular frameworks for GUI development with Python that support Android as a platform – PySide6, Kivy and Flet. This comparison is demonstrated with a simple Contact List application with the ability to add, edit and delete contacts. The overall structure of the talk will be almost the following: 1. Why is Android a relevant platform for Python application developers? (6 minutes) In this section, we establish why Android is the most popular OS being sued currently. Although Python has had the support to run applications natively in Android, even dating back to 2011, the development of Android applications with Python is not so popular. We will further highlight one of the major concerns of using Python for Android develpoment and how PEP 738 can help simplify this. 2. Current status of Android app development with Python (2 minutes) In this section, we give a brief introduction to some of the Python based toolkits that support Android as a platform – Kivy, Flet, PySide6, Beeware etc. 3. Contact List application with Kivy (3 minutes) In this section, we look at how the applicatiion looks with Kivy and KivyMD, followed by the ease of development and some pros and cons of the framework. 4. Contact List application with PySide6 (5 minutes) The deployment of PySide6 application to Android uses the same build tool as Kivy, called python-for-android. python-for-android now also supports a Qt backend along with SDL2 that Kivy uses thus enabling the deployment of PySide6 application. In this section, we look at how the applicatiion looks with PySide6, followed by the ease of development and some pros and cons of the framework. 5. Contact List application with Flet (3 minutes) In this section, we look at how the applicatiion looks with Flet, followed by the ease of development and some pros and cons of the framework. 6. Python packages support (6 minutes) We see the various Python packages supported by each framework. 7. Conclusion and Questions (5 minutes) Questions from the audience. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/W7YDRX/ B05-B06 Shyamnath Premnadh PUBLISH AQ8HUM@@pretalx.com

-AQ8HUM

Advanced Observability with OpenTelemetry and Python en

20240423T110500 20240423T113500 0.03000

Advanced Observability with OpenTelemetry and Python

With the rise of serverless architectures and cloud technologies, Python has become increasingly popular for building microservices. Yet, as these systems expand, they face observability challenges leading to reduced efficiency and complexities in error tracing. To address these challenges, this presentation introduces OpenTelemetry, an emerging industry standard providing a framework for tracking the performance of not only our Python code but also other system components such as databases or message queues. It integrates seamlessly into Python environments, offering a common way to gather, process, and export telemetry data from various sources of a distributed system. The session will begin by revisiting the concept of observability and its critical importance in distributed systems. We will then introduce OpenTelemetry, and check the fundamentals of its' Python SDK. A practical use case will be presented, demonstrating the integration of OpenTelemetry into an existing Python microservice, using both automatic instrumentation mode and manual traces. Finally, we will discuss how to utilize the data collected by OpenTelemetry for system monitoring. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/AQ8HUM/ B05-B06 Anton Caceres PUBLISH C9F9CC@@pretalx.com

-C9F9CC

Boost your app to Flash speed by mastering performance tricks en

20240423T114000 20240423T121000 0.03000

Boost your app to Flash speed by mastering performance tricks

Nowadays, more and more companies are looking for different strategies to gain more users for their products by using different approaches starting from introducing unique features to optimizing application performance. Additionally, python is one of the widely used programming languages where the community continuously introduces new libraries for enhancing performance and optimizing memory usage. However, can we also accelerate app performance not only by relying on libraries but also by understanding how Python works under the hood? In this talk, we discuss computational operations and memory utilization in Python and what is the connection between them. Additionally, we will provide you with visual aids for helping to build a mental picture of these concepts. Moreover, we will dive into how Python interpreter works and how the understanding of bytecode instructions can help you write better code. In the end, we will demonstrate the advantages of best practices by comparing both performance metrics and bytecode instructions. If you're keen to move beyond basic optimizations and truly understand what happens under Python's hood during application execution, this session is for you. Join us to learn how Python works under the hood and also have an imagination of what is going on in Python during the application execution. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/C9F9CC/ B05-B06 Laysa Uchoa Yuliia Barabash PUBLISH N9DEVW@@pretalx.com

-N9DEVW

Pandas + Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars en

20240423T141000 20240423T144000 0.03000

Pandas + Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars

Dask is a library for distributed computing with Python that integrates tightly with pandas and other libraries from the PyData stack. It offers a DataFrame API that wraps pandas and thus offers an easy transition into the big data space. Historically, Dask was the easiest choice to use (it’s just pandas) but struggled to achieve robust performance (there were many ways to accidentally perform poorly). It was great for experts, but bad for novices. Other tools (Spark, DuckDB, Polars) just did this better. Fortunately, these pain points have been fixed with the following features: - A new and vastly improved shuffle algorithm - A logical query planning layer to improve performance and usability - A reduced memory footprint through a more efficient data model due to pandas 2.0 We will look into how these changes work together across pandas, Arrow, and Dask to provide a better UX and a more robust and faster system overall. Additionally, we will look into a comparison of Dask against other tools in the big data space, including Spark, Polars and DuckDB. We will use the TPC-H benchmarks to compare these tools. We will look ahead into what the future will bring for pandas and Dask and how the logical query planning layer can be extended to fit other frameworks like Dask Array and XArray. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/N9DEVW/ B05-B06 Patrick Hoefler Florian Jetter PUBLISH 9PZSBS@@pretalx.com

-9PZSBS

The key to reliability - Testing in the field of ML-Ops en

20240423T144500 20240423T153000 0.04500

The key to reliability - Testing in the field of ML-Ops

idealo.de offers a price comparison service for millions of products from a wide variety of categories. It navigates the dynamic landscape of about 3.7 billion offerings from 50,000+ shops, our central challenge is cataloging this huge offer automatically. Machine learning plays a crucial role for us in processing data. Machine learning components must be considered as a part of a more complex domain. In our domain those components are part of an event driven asynchronous architecture. The need to continuously develop, deliver, and train accompanied by the capability to smoothly work together with traditional software components raises high demands on stable software development and operations. Testing plays a crucial role and brings up many open questions in the field of machine learning. In this talk we want to share and present our holistic approach to testing in machine learning. The following aspects are taken into account: - Introduction into our machine learning lifecycle - Testing in context of traditional software development comprising unit tests, code coverage, contract tests, tests on infrastructure as code - Specific challenges of testing in the machine learning domain comprising end-to-end test of training pipelines, deployment testing of inference endpoints in operational modes - The role of logging and monitoring for safe operations The presented test strategy is based on our 4 years' experience in operating idealo's cataloging system. Examples will be aligned along our tech stack consisting of e.g., PyTest, CDK , Pactman, AWS Sagemaker, Github Actions, OpenSearch Kibana and Grafana. PUBLIC CONFIRMED Talk (long) https://pretalx.com/pyconde-pydata-2024/talk/9PZSBS/ B05-B06 Gunar Maiwald Tobias Senst PUBLISH BNCJPV@@pretalx.com

-BNCJPV

The evolution of Feature Stores en

20240423T160000 20240423T163000 0.03000

The evolution of Feature Stores

In recent years, the role of feature stores has become increasingly pivotal in data engineering and machine learning. This talk will delve into the history of feature stores, exploring their evolution from Uber's Michelangelo to recent solutions like Feast, Hopsworks and Fennel. Lastly, we will discuss the potential impact of the AI Act on the future of feature stores, highlighting regulatory constraints that may affect what they look like in the future. The outline of this talk is detailed below. ### Historical Perspective: - Tracing the origins of Feature Stores: How did the concept evolve over time? - Early use cases and challenges: Lessons learned from Michelangelo. - Pioneering Feature Stores: Case studies on organizations at the forefront of adoption. ### Current Landscape: - Architectural insights: What do modern Feature Stores look like? - Integration with popular ML frameworks and data storage solutions. - Real-world success stories: How Zalando built a central Feature Store for serving features across departments and business units with different technical requirements. ### AI ACT and the Future of Feature Stores: - Envisioning Feature Stores in an AI ACT environment. - Federated learning and distributed feature stores: Opportunities and challenges. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/BNCJPV/ B05-B06 Olamilekan Wahab PUBLISH LNFSDV@@pretalx.com

-LNFSDV

Polars and Time Series: what it can do, and how to overcome any limitation en

20240423T163500 20240423T170500 0.03000

Polars and Time Series: what it can do, and how to overcome any limitation

This will be a technical talk, teaching people how to use Polars effectively for time series analysis. The format will be roughly: - 5 mins: motivation, super-fast Polars crash course. - 7 mins: what's built-in - making the most of Polars' built-in time series capabilities. - 7 mins: when Polars isn't enough: interoperability with numba/scipy/numpy. - 6 mins: when nothing is enough: writing your own Polars Plugin, and learning how to do that. - 5 mins: engaging Q&A / awkward silence. Attendees will leave knowing where to turn to for any time series analysis task they may encounter whilst using Polars. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/LNFSDV/ B05-B06 Marco Gorelli PUBLISH RRAZ99@@pretalx.com

-RRAZ99

Encoding Charactersets - may the force be with you en

20240423T103000 20240423T110000 0.03000

Encoding Charactersets - may the force be with you

Understanding and repairing garbled text (Mojibake) is despite Unicode a permanent ongoing task in IT projects. Garbled text is the result of text being decoded using an unintended character encoding. The topics of this talk contains the following points. To every point there are code examples: - Explore the nuances of text representation: Grapheme vs. Codepoints. Unravel the essence of characters in computing. - Delve into the realm of character encoding: Unicode vs. UTF-8. Decipher the key distinctions shaping text globalization. - Master the art of data interchange. Decode and encode files, database results, and REST-APIs seamlessly for universal communication. - Unlock the power of the unicodedata module. Learn how it aids in character information retrieval and manipulation in Python. - Navigate the challenges of ISO charsets in the Unicode era. Gain insights into effective strategies for handling diverse character sets. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/RRAZ99/ A1 Martin Hoermann PUBLISH Y3Y78W@@pretalx.com

-Y3Y78W

(Un)leashed potential of AI in Government en

20240423T110500 20240423T113500 0.03000

(Un)leashed potential of AI in Government

As the world is being reshaped at an unprecedented speed through the rise of powerful (Generative) AI technologies that change the way we work and live, governments seek their place in the arena. This presentation will focus on how government institutions adapt to these changes by exploring three key areas of action: 1. Adoption: Generally, technology adoption has been slower in government than in the private sector. Yet governments have increasingly started to explore the potential of AI to deliver on their mission. The audience will learn about potentials, barriers, and concrete use cases/prototypes of AI-based services in German government bodies with a focus on responsible AI and Ethics. 2. Regulation: It is discussed how government bodies respond to the rise of AI through regulation. An introduction to the EU AI Act is given – the world’s first comprehensive AI law. 3. Reskilling & Upskilling: Insights are given on the role specialised data skills play in shaping the future of Digital Government in Germany. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/Y3Y78W/ A1 Rosa Marie Keller PUBLISH 8GQLLY@@pretalx.com

-8GQLLY

DDataflow: An open-source end-to-end testing framework for ML pipelines en

20240423T114000 20240423T121000 0.03000

DDataflow: An open-source end-to-end testing framework for ML pipelines

Machine Learning pipelines, especially those dealing with large datasets, are intricate and multifaceted. The ability to quickly iterate and experiment is crucial, yet the complexity and scale of these pipelines often lead to prolonged development loops and latent errors. Traditional unit-testing approaches have proven to be cumbersome and inefficient in addressing these challenges due to the extensive boilerplate code and limited coverage they offer. This talk will delve into the journey of developing [DDataflow](https://github.com/getyourguide/DDataFlow), a tool aimed at addressing the aforementioned challenges by enabling efficient end-to-end testing in ML pipelines. DDataflow employs decentralized data sampling to expedite testing processes, allowing for rapid and reliable iterations in ML pipelines. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/8GQLLY/ A1 Theodore Meynard Jean Machado PUBLISH 93MHQ3@@pretalx.com

-93MHQ3

Exploring Zarr: From Fundamentals to Version 3.0 and Beyond en

20240423T141000 20240423T144000 0.03000

Exploring Zarr: From Fundamentals to Version 3.0 and Beyond

Zarr is a data format for storing chunked, compressed N-dimensional arrays and is sponsored by [NumFOCUS](https://numfocus.org/project/zarr) under their umbrella. It is based on open-source technical specification and has implementations in several languages, with [Zarr-Python](https://github.com/zarr-developers/zarr-python) being the most used. ## Outline First, I’d be talking about: ### Understanding Zarr basics (5 mins.) - What is Zarr, and how it works? - The inner workings of Zarr using illustrated graphics - What is the Zarr Specification? - How is Zarr different when compared to other storage formats? Then, I'll be talking about the new Zarr Specification V3 and its significant features: ### What's new in Zarr Spec V3? (15 mins.) - What is the motivation for the evolution of the specification? - High-latency storage → Better support for technologies, particularly systems with relatively high latency per operation, such as cloud object stores - Interoperability → Language-agnostic approach towards the new specification by slimming down the specification to achieve interoperability across major programming languages - Major design updates - Greater flexibility in how groups and arrays are created - Support for implicit groups that do not have a metadata document but whose existence is implied by descendant nodes - Restructuring of the `JSON` metadata document and storage path in both arrays and groups - Why is the Zarr V3 metadata consolidated compared to the Zarr V2 metadata? - Explicit support for extensions via defined extension points and mechanisms - How do extensions allow the community to add innovative and cutting-edge features to help their specific use cases? - Chunk encoding and supported codecs for V3 - How are chunks encoded into binary representation for storage in the store, using the chain of codecs specified by the codecs metadata field? - ZEP Process - Need and origin of a community feedback process for the evolution of Zarr specification - Transformation from steering council governed to community-owned specification - Learnings when migrating from [Spec V2](https://zarr.readthedocs.io/en/stable/spec/v2.html) → [Spec V3](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html) Then, I’d be doing a hands-on session, which would cover the following: ### Hands-on (5 mins.) - Creating Zarr arrays and groups using Zarr-Python V3.0 - Walk through of the new features (mentioned above) - Demo of [Sharding Codec](https://zarr.dev/zeps/accepted/ZEP0002.html) extension - Creating a sharded array and group and showing how a large number of chunks can be grouped together into a single shard - Looking under the hood - Use store functions to explain how your Zarr data is stored I'd be closing the talk by: ### Conclusion (5 mins.) - Key takeaways - How can you get involved? - QnA This talk aims to address an audience that works with large amounts of data and is looking for a transparent, open-source, reliable, cloud-optimised, and environmentally friendly format. Also, I’d like to invite anyone interested in the lessons I learned by maintaining the project throughout the years. The tone of the talk is set to be informative, story-telling and fun. Intermediate knowledge of Python and NumPy arrays is required for the attendees to attend this talk. ### After this talk, you’d: - understand the basics of Zarr and what's new in V3, - using Zarr V3 for local and cloud storage, - make an informed decision on what data format to use for your data and also you'd: - know why should you have a process for your project, - have essential takeaways regarding when an OSS project transitions from a young to a mature stage PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/93MHQ3/ A1 Sanket Verma PUBLISH P3GRLG@@pretalx.com

-P3GRLG

From LLM as oracle to LLM as translator - our journey from theory to everyday’s practice in a corporate setting with dmGPT (and python) en

20240423T144500 20240423T153000 0.04500

From LLM as oracle to LLM as translator - our journey from theory to everyday’s practice in a corporate setting with dmGPT (and python)

One of the biggest challenges of working in such a large organization like dm is finding the information you need to accomplish your tasks: distributed organization units, multiple knowledge sources, and different tools make it very challenging to know where to find information whose location you don’t know. Most of the times, the best way to find something out is to ping a more experienced colleague and ask them. But what if you could ping your AI-Powered copilot and find out? Not only that… What if it also helped you create content for your specific product without you telling it everything about the product? What if it was able to help you write code using internal tools? What if it could help you have an insight of your internal data? After its first steps in summer 2023, our vision for dmGPT quickly developed to it becoming a truly helpful assistant for every coworker of dm. Since then, we have contributed to the design and implementation of an LLM-powered platform that aims to achieve this goal. To come a step closer, we had to rethink the role of the LLM, picturing it as a translator between natural languages and software systems and back. Now, it helps us map an instruction in natural language to a set of tools needed to accomplish the given task and construct a coherent answer based on the provided data. In the design we had to face multiple challenging questions, such as: - How to connect multiple, heterogenic data sources? - How to pick an LLM for a given task? - Which LLM do we support? - How do we build a user friendly, dynamic and configurable user interface? - How to measure the system’s quality? In this talk we would like to provide a technical insight to our journey, discussing architectural decisions as well as implementation dilemmas, and engage in a discussion with the community about the steps to come. PUBLIC CONFIRMED Talk (long) https://pretalx.com/pyconde-pydata-2024/talk/P3GRLG/ A1 Emma Haley Niklas Lederer PUBLISH MTVWQM@@pretalx.com

-MTVWQM

Safeguarding Privacy and Mitigating Vulnerabilities: Navigating Security Challenges in Generative AI en

20240423T160000 20240423T163000 0.03000

Safeguarding Privacy and Mitigating Vulnerabilities: Navigating Security Challenges in Generative AI

In the ever-evolving landscape of Generative AI (GenAI), privacy and security have emerged as paramount concerns, echoing the necessity for comprehensive frameworks and collaborative initiatives. The session kicks off with an interactive segment, aiming to gauge the audience's familiarity and involvement with GenAI, ensuring the discussion aligns with their varying levels of expertise and engagement. Fundamental concepts of Data Privacy and Data Security are meticulously delineated, elucidating the responsible handling and fortification of personal information. A visual aid in the form of a Venn diagram underscores the intricate interplay between these two crucial facets, facilitating a deeper understanding for the audience. Transitioning to the domain of GenAI, the discourse delves into the indispensable need for data privacy throughout the lifecycle of GenAI models. Instances of ethical and legal concerns arise during the training phase, where datasets often contain potentially sensitive personal information sourced from the internet. Real-world cases such as disputes between media entities like The New York Times and AI organizations like OpenAI exemplify these dilemmas. Moreover, the session critically scrutinizes data privacy concerns during GenAI production, focusing on the policies adopted by AI companies regarding prompt-related data retention. Here, certain AI entities retain prompt records for extended durations, which can pose potential privacy risks. In response, initiatives such as enterprise versions of GenAI models, like those offered by OpenAI, provide users with enhanced control over data usage, reinforcing a more privacy-centric approach. Simultaneously, the discussion navigates through the dimensions of data security risks inherent in GenAI models during operational phases. The potential extraction of sensitive personal data from these models poses substantial risks, given GenAI's proclivity to retain information from its training data. Academic research papers, like "Scalable Extraction of Training Data from (Production) Language Models," delve into these vulnerabilities, highlighting the complexity of data security challenges in GenAI. Further enriching the discourse, the session showcases the top ten vulnerabilities in GenAI, as identified by insights from OWASP. These vulnerabilities encompass a wide array of risks, from prompt injection and insecure output handling to training data poisoning and supply chain vulnerabilities. To culminate the discussion, actionable strategies to fortify data protection within GenAI are proposed. These encompass leveraging Open Source GenAI solutions like LLAMA, recognized for their transparency, although they may come with higher maintenance costs. Additionally, anonymizing data before prompt utilization emerges as a proactive measure, albeit posing certain operational challenges. Moreover, the session underscores the pivotal role of government regulations in safeguarding citizen data and establishing policies binding on GenAI companies. Recent regulations from governments like the US, UK, and other countries emphasize the need for AI systems to be 'secure by design,' promoting robust data protection measures. Collaborative efforts among companies also come to the forefront, exemplified by initiatives like the "AI Alliance" formed by IBM, Meta, and 50 other organizations. These alliances aim to advance open-source AI while fostering collective processes for data protection and security. In conclusion, this comprehensive session aims to empower attendees with a holistic understanding of privacy and security challenges in the GenAI domain. The discourse, enriched with real-world instances, legal dilemmas, academic insights, and industry perspectives, seeks to equip individuals and organizations with actionable insights. The objective is to navigate the complex terrain of GenAI, fostering a more privacy-aware and secure integration into our lives and technological ecosystems. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/MTVWQM/ A1 John Robert PUBLISH QLXUHY@@pretalx.com

-QLXUHY

Breaking AI Boundaries: Fairness Metrics in Unstructured Data Domains en

20240423T163500 20240423T170500 0.03000

Breaking AI Boundaries: Fairness Metrics in Unstructured Data Domains

Fairness Metrics are already widely used to avoid unwanted bias in machine learning models. However, although fairness is a hot topic, it is primarily used in domains where the models' interface and influence on humans are obvious. In other domains with a less obvious connection between model decisions and their impact on human beings, they are rarely seen (e.g., automotive engineering applications, etc.). This poses three questions: 1. In those domains, is it really unnecessary to use fairness techniques, or is their absence endangering individuals in a less obvious way? (necessity) 2. Even if a use case does not need fairness techniques, wouldn't the use cases still benefit from a look through the "Fairness lens" and the connected methods and tools? (benefit) 3. Besides having less strong implications for using fairness metrics, what obstacles keep people from using them, and how can we mitigate them? (obstacles and solutions) To answer these questions, our presentation will first briefly compare five prototypical engineering use cases and categorize them according to the above criteria (necessity, benefit, obstacles). This first part mainly aims to map out the space of machine learning use cases in the engineering domain and suggest possible reasons why fairness-related techniques are not applied in those areas. We will then mainly focus on further analyzing those obstacles and providing solutions to omit them. Here, the main focus will be expanding the application of fairness-based model evaluation to unstructured data domains. Typical use cases in this category go from image and audio recognition to LLM applications with large text documents. We will provide a brief theoretical overview of strategies to make fairness metric application suitable and then go through a concrete example down to the implementation level. For that, we will touch on important subjects, such as detecting meaningful subgroups in unstructured data, extracting easy-to-grasp explanations for model failures, and interactive analysis of model predictions. This section will also feature two open-source tools to address these challenges: Sliceguard and Spotlight. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/QLXUHY/ A1 Daniel Klitzke PUBLISH RMZLKZ@@pretalx.com

-RMZLKZ

Using ML to find out the "Why"? A Tutorial in Causal Machine Learning en

20240423T103000 20240423T120000 1.03000

Using ML to find out the "Why"? A Tutorial in Causal Machine Learning

The tutorial will be organized in three blocks. 1) Introduction and motivation We will point out why Causality matters in data science. Many problems managers and data scientists are facing are causal. When organizations and companies want to optimize their marketing campaigns, their financial planning, pricing scheme they usually run into causal considerations: How much do my sales decrease if we increase the price by X%? How can I send out email newsletters to those who like them and avoid to annoy other subscribers? Causal Inference and Causal ML offer powerful tools that help to formalize and model things that are usually discussed only on an intuitive basis: Are the people who opened my newsletters really comparable to those who haven't? Can I just compare the convergence rates of these groups when I want to evaluate the newsletters's effectiveness? 2) Introduction to Causal Machine Learning with DoubleML Causal Machine Learning offers tools to estimate causal relationships with SOTA ML algorithms. We will offer an introduction to the Double Machine Learning approach (Chernozhukov et al., 2018). This introduction will be aligned with several data examples and code demonstrations using the Python package DoubleML, https://docs.doubleml.org/stable/index.html . DoubleML is an open source package that offers various tools to estimate causal effects, for example for estimation of heterogeneous treatment effects (like in personalized marketing or personalized medicine). 3) Hands-on Session: Data Example The tutorial featues a data projects that participants can solve on their own. With the hands-on session participants already get started on their own Causality learning journey :) Participants are invited to apply DoubleML to their own data example and play around with the package features. The hands-on session will follow the structure of the DoubleML workflow, which guides analysts through the process of causal inference with DoubleML, https://docs.doubleml.org/stable/workflow/workflow.html. 4) Discussion and Q&A The tutorial conlcudes with a discussion and Q&A session. We are looking forward to participants' comments and ideas. We appreciate fedback of the Python community on the DoubleML package :) PUBLIC CONFIRMED Tutorial https://pretalx.com/pyconde-pydata-2024/talk/RMZLKZ/ A03-A04 Oliver Schacht Jan Teichert-Kluge PUBLISH XBUHCK@@pretalx.com

-XBUHCK

Performant, scientific computation in Python and Rust en

20240423T140500 20240423T153500 1.03000

Performant, scientific computation in Python and Rust

The Rust programming language gained a lot of attention over the last years, and began to slowly infiltrate the Python ecosystem with an ever-increasing number of tools and libraries in the Python ecosystem such as Ruff and Polars which are implemented in this language. Unlike Python, Rust is a system language optimized for performance and memory safety, and some consider it the spiritual successor of C++. Despite its steep learning curve, it is the perfect candidate for extending Python and its ecosystem when performance matters, in a modern and memory-safe language. This session demonstrates the path of creating a scientific package in python (following best practices and modern tools) and gradually migrating parts of it to Rust for additional performance gains. The use case is a naive implementation of the "Expectation maximization for Gaussian Mixture Models" algorithm from scratch, a relatively simple yet efficient machine learning method. The session addresses the following points: How to build a Python package with a modern tools set, how to translate a numerical algorithm into vectorized Python, and optimize the package with a performant Rust implementation of the critical parts. Prior knowledge of Rust or the algorithm is not required. Note that the goal is not to learn Rust in this single session (this requires at least three days) but rather to provide a superficial overview on what makes this language so great and well-suited for extending Python. Participants are advised to follow the clone the repository below and follow the installation instructions to avoid longer download times during the session. https://github.com/StefanUlbrich/PyCon2024 PUBLIC CONFIRMED Tutorial https://pretalx.com/pyconde-pydata-2024/talk/XBUHCK/ A03-A04 Stefan Ulbrich PUBLISH 8C83EA@@pretalx.com

-8C83EA

PyO3 101 - Writing Python modules in Rust en

20240423T155000 20240423T172000 1.03000

PyO3 101 - Writing Python modules in Rust

In recent years, Rust has been getting more and more popular over other similar programming languages like C and C++ due to its robust compiler checking and ownership rules to make sure memory is safe. Hence there are more and more Python libraries that have been written in Rust natively with a Python API interface. One of the tools that have been driving this movement is PyO3, a toolset that proves Rust bindings for Python and tools for creating native Python extension modules. In this interactive workshop, we will cover the very basics of using PyO3. There will be hands-on exercises to go from how to set up the project environment to writing a "toy" Python library written in Rust using PyO3. We will cover a lot of specifications of the API provided by PyO3 to create Python functions, modules, handling errors and converting types. ## Goal To give developers who are not familiar with PyO3 an introduction to PyO3 so they can consider building their Python libraries with Rust to make use of Rust's memory-safe property and parallelism ability. ## Target audiences Any developers who are interested in developing Python libraries using Rust. It will be an advantage if the attendees are comfortable writing in Rust. However, attendees are not required to be familiar with Rust as all the Rust codes will be provided. Basic knowledge of Python will be assumed from the attendees. ## Outline Part 1 - introduction and getting started (40 mins) - What's the difference between Rust and Python (5 mins) - Why using PyO3 (5 mins) - Setting up the environment (exercises) (15 mins) - Starting a new project (exercises) (15 mins) Break (15 mins) Part 2 - Creating a simple Python library (50 mins) - Creating Python modules (exercises) (20 mins) - Generating documentation - Creating Python functions (exercises) (30 mins) - How to create function signatures - How to deal with errors PUBLIC CONFIRMED Tutorial https://pretalx.com/pyconde-pydata-2024/talk/8C83EA/ A03-A04 Cheuk Ting Ho PUBLISH KCC9EF@@pretalx.com

-KCC9EF

Bulletproof Python - Property-Based Testing with Hypothesis en

20240423T103000 20240423T120000 1.03000

Bulletproof Python - Property-Based Testing with Hypothesis

Traditional tests are example-based. They require the developer to come up with arbitrary inputs and check a system’s behaviour against explicit outputs. More often than not, developers only think of inputs that are handled correctly by their code, thus leaving bugs hidden. Property-based tests generate the inputs for you and in many cases they’re more likely to find invalid inputs than humans. The difficulty lies in formulating these test cases. After this workshop you’ll be comfortable with property-based testing using Hypothesis. You’ll have experience requesting appropriate test data from Hypothesis and in writing tests for common and more advanced properties. At work, your co-workers will be impressed by your unbreakable code ;) Participants are expected to have basic familiarity with unit testing and a testing framework. Provided code examples use pytest. Please set up the workshop material in advance. To do that, navigate to the Git repository linked in the supporting material section and follow the setup instructions in the README file. PUBLIC CONFIRMED Tutorial https://pretalx.com/pyconde-pydata-2024/talk/KCC9EF/ A05-A06 Michael Seifert PUBLISH PKJHBA@@pretalx.com

-PKJHBA

Functional Python en

20240423T140500 20240423T153500 1.03000

Functional Python

## Audience Intermediate Python programmers who like to learn more about functional programming and its application Python. ## Format The tutorial will be hands-on. I will use JupyterLab and will start with an empty Jupyter Notebook. I will unroll the tutorial content by typing. In addition, I will distribute scripts before the tutorial to avid too lengthly typing. I will load these scripts one by one into a Notebook. Participants will have the opportunity to type along. I am a rather slow typer. In addition, I will stop typing often to explain. This gives most participants plenty of time to follow along. The PDF handout is very comprehensive and contains most of what I type. This allows students to pick if they should fall behind. ## Outline * Functional programming basics (10 min) * Overview programming paradigms * Features of functional programming * Advantages of functional programming * Disadvantages of functional programming * Python's functional features - overview * Pure functions (5 min) * Callables and functions in Python (20 min) * Callables * Closures * "Currying" * Partial functions * Recursion * Lambda * Single Dispatch * No Loops - map, filter, and reduce (10 min) * Processing iterables with map * Select from iterables with filter * Reductions of iterables with reduce * Operators as Functions (10 min) * Arithmetic operators * Logical operators * Attribute access * Lookup * Comprehensions (15 min) * Simple * Nested * Dictionary comprehensions * Set comprehensions * Iterators (15 min) * Itertools * Infinite iterators * Iterators terminating on the shortest input sequence * Combinatoric iterators * External tools (5 min) * More itertools * Toolz PUBLIC CONFIRMED Tutorial https://pretalx.com/pyconde-pydata-2024/talk/PKJHBA/ A05-A06 Mike Müller PUBLISH UPSJEM@@pretalx.com

-UPSJEM

Boost your Data Science skills with the new Python in Excel en

20240423T155000 20240423T172000 1.03000

Boost your Data Science skills with the new Python in Excel

Python in Excel is the new integration created by Microsoft that brings Python programming directly into Excel workbooks, for advanced data analytics. With Python in Excel, it is now possible to embed Python code directly into workbook cells, very easily, and with zero setup required. In fact, all the Python code runs automatically in the Microsoft Cloud, and leverages on the Python Anaconda Distribution to get immediate access to a vast selection of packages to unlock unprecedented use cases in data science, data visualization, and machine learning. The output of each execution is automatically integrated into the spreadsheet, creating interactive data reports to share with customers and other users. The new feature is currently available in _public preview_ to **all users** running the MS Excel Beta Channel on Windows. In this tutorial, we will explore the many features and capabilities this new integration provides, to unlock unprecedented data science and machine learning use cases in Excel. First, we will familiarize with the new environment, understanding its execution model, and the differences from standard Python programs. Afterwards, we will work on several examples to demonstrate the potential of using Python directly into the workbook to filter, validate, wrangle and visualize our data. We will conclude our tutorial by creating a full-fledged machine learning experiment directly into Excel. Familiarity with Excel and the Python language is the only requirement necessary to attend this tutorial. ## Setup Instructions **Python in Excel** is currently available (_for free_) to MS Excel users using **Windows** operating system. ### Non-Windows Users If you are not running on Windows, it is strongly recommended to install a version of Windows on a virtual machine (VM) using any solution that works on your operating system. For example, [Parallels](https://www.parallels.com/products/desktop) for mac OS users, or [VirtualBox](https://www.virtualbox.org/) for Linux users. ### Setup Python in Excel for Windows To use the _new_ "Python in Excel" feature, it is required to join the [Microsoft 365 Insider Program](https://support.microsoft.com/en-gb/office/get-started-with-python-in-excel-a33fbcbe-065b-41d3-82cf-23d05397f53d#:~:text=Microsoft%20365%20Insider%20Program) and choose the Beta Channel Insider level. You can find more detailed instructions on [Get Started with Python in Excel](https://support.microsoft.com/en-gb/office/get-started-with-python-in-excel-a33fbcbe-065b-41d3-82cf-23d05397f53d). ### (Optional) Install Excel Labs plugin [Excel Labs](https://appsource.microsoft.com/en-us/product/office/wa200003696?tab=overview) is an add-in that includes experimental Excel features. Among these features, it provides **Python editor**: A notebook-like interface designed for authoring Python in Excel. Excel lab is **not** required, but strongly recommended to have a better working and development experience with Python in Excel. ### Data Download Once all the setup operations are completed, please download the [Financial Sample Excel Workbook](https://go.microsoft.com/fwlink/?LinkID=521962). We will use this data file as our gym playground to familiarise with the new feature. PUBLIC CONFIRMED Tutorial https://pretalx.com/pyconde-pydata-2024/talk/UPSJEM/ A05-A06 Valerio Maggio PUBLISH PLRERM@@pretalx.com

-PLRERM

Keynote - Ten Key Questions that a Company Should Ask to have Responsible AI en

20240424T091500 20240424T100000 0.04500

Keynote - Ten Key Questions that a Company Should Ask to have Responsible AI

Responsible AI covers mainly AI principles, governance & regulation, but most companies do not know how to implement all of these. Hence, in this presentation we cover the key questions for the whole process behind a new AI product, from the idea and design to the development and deployment. The questions are partly based on the new ACM Principles for Responsible Algorithmic Systems (2022) where he is one of the two lead authors as well as their extensions for Generative AI (2023). For each question we will discuss its relevance, challenges, and (partial) solutions, triggering an interactive discussion. PUBLIC CONFIRMED Keynote https://pretalx.com/pyconde-pydata-2024/talk/PLRERM/ Kuppelsaal Ricardo Baeza-Yates PUBLISH PVLTD3@@pretalx.com

-PVLTD3

Which kind of software tests do I really need? en

20240424T103000 20240424T110000 0.03000

Which kind of software tests do I really need?

In the dynamic landscape of software development, choosing the right testing strategy is crucial for delivering high-quality software products. The myriad of available testing methodologies often leaves developers and QA professionals pondering over the question: "Which kind of software tests do I really need?" This presentation aims to demystify the world of software testing by exploring various testing approaches and methodologies. From unit testing to system testing, from functional to non-functional testing, each method serves a unique purpose in the software development life cycle. The talk will dive into the factors influencing the selection of appropriate testing methods. We will discuss the advantages and limitations of different testing types, helping participants understand the trade-offs involved in each approach. Practical examples will be presented to illustrate how choosing the right testing strategy can positively impact software quality, development speed, and overall project success. Participants will gain insights into evolving industry best practices and learn how to adapt their testing strategies to meet the demands of modern software development. By the end of the talk, attendees will have a overview of the diverse landscape of software testing and be equipped with the knowledge needed to make informed decisions about which types of tests are most relevant for their specific projects. This presentation aims to empower developers, QA professionals, and project managers to navigate the testing maze and optimize their testing efforts for efficient and effective software delivery. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/PVLTD3/ Kuppelsaal Pascal Puchtler PUBLISH RKDSK7@@pretalx.com

-RKDSK7

I achieved peak performance in python, here's how ... en

20240424T110500 20240424T113500 0.03000

I achieved peak performance in python, here's how ...

In this session, we will embark on a journey and refine the phases of development in python. 1. Functional Execution 2. Rigorous Testing and Accuracy 3. Performance Optimization We will discuss common bottlenecks in unoptimized code 1. inefficient Coding Practices can negatively impact performance 2. Memory Leaks 3. Suboptimal Data Structures and Algorithms 4. Lack of Vectorization 5. Overlooked Parallelization We'll further look into the benefits of profiling the code 1. Profiling the code with cProfile/sentry 2. Profiling the Code with timeit 3. Memory Profiler Finally, for data driven application, we'll look into strategies to achieve peak performance 1. Efficient DataFrame Storage with Parquet Files 2. Handling Categorical Data Type 3. Looping Techniques and How to Choose Between Different Looping Techniques? 4. String concatenation (joins and cleanup) [Attendees takeaway] Whether you're a seasoned developer looking to enhance your optimization skills or a newcomer eager to understand the principles behind efficient Python code, this talk offers valuable insights and practical takeaways. [Pre-requisites] Basics of Python [who-am-i] Name: Dishant Sethi Email: dishantsethi14@gmail.com Phone no: +919582565371 Designation: Software Consultant and Founder @prodinit.com [Previous Talks] PyconDE and Pydata Berlin: https://youtu.be/osGGX3tcwkc Gophercon India 2023: https://youtu.be/zuzTN3ibrCM?si=GEo31lE_Q8h4hzTR PyDelhi: https://youtu.be/6h9I3iyqyu4 PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/RKDSK7/ Kuppelsaal Dishant Sethi PUBLISH P7AG9A@@pretalx.com

-P7AG9A

Python 3.12's new monitoring and debugging API en

20240424T114000 20240424T121000 0.03000

Python 3.12's new monitoring and debugging API

Python long lagged a good monitoring and profiling API. It had only the simplistic sys.settrace API, which had a high overhead and couldn't be configured appropriately. The new API, released in October 2023, will change this by offering a proper fine-grained and well-designed monitoring API while also making the commonly used operations fast. This talk will give you an introduction to the new API and its design major design decisions and show you how you can use it to write a simple debugger from scratch. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/P7AG9A/ Kuppelsaal Johannes Bechberger PUBLISH BFYUUJ@@pretalx.com

-BFYUUJ

(PyLadies Panel) Reflecting Within: Challenging Narratives in Tech Feminism en

20240424T131000 20240424T141000 1.00000

(PyLadies Panel) Reflecting Within: Challenging Narratives in Tech Feminism

For the third year in a role, the PyLadies Panel at PyCon PyData engages with a broader audience on critical issues related to gender disparities, ethics, and the ongoing importance of women-focused tech groups. Adopting unconventional formats, the PyLadies Panel aims to foster meaningful discussions among PyLadies members and the Python community, encouraging open dialogue and community solidarity. This year, we propose a structured debate inspired by Lucy Delap’s “Feminisms: A Global History.” The book challenges ethnocentric and exclusive narratives within the feminist movement itself. It calls for a more inclusive and multifaceted understanding of feminism that respects and incorporates the diversity of its expressions and the different challenges faced by women around the world. Having the book as a reference point and inspiration, this panel is an opportunity to critically reflect on these themes and develop actionable strategies for a more equitable future in technology. Designed to dissect and challenge entrenched narratives about feminism in the tech industry, the debate encourages a deep dive into difficult conversations to dismantle binary thinking and uncover nuances in common discourse. Participants and audience members are invited to confront and critique the prevailing frameworks of feminism, particularly the predominance of perspectives that may not fully represent the movement’s global and diverse nature. By acknowledging and addressing these gaps, the debate will explore actionable steps toward inclusivity and equity. Through a debate-style format, panelists will engage in a candid, necessary discussion and exchange of ideas, allowing for both the celebration of feminist achievements and a critical evaluation of ongoing issues. It will provide a platform for voices that have been marginalized or silenced, enabling a constructive dialogue that moves beyond simple dichotomies to foster understanding and progress. Join us as we challenge the status quo, identify systemic flaws, and collaboratively outline the future directions of feminism in technology. This debate is not just about reflection; it’s about taking active steps to ensure that our community is inclusive and representative of all its members. Panel with Taniar Allard, Katherine Jarmul, Naa Ashiorkor Nortey & Cheuk Ting Ho PUBLIC CONFIRMED Panel https://pretalx.com/pyconde-pydata-2024/talk/BFYUUJ/ Kuppelsaal Paloma Oliveira Katharine Jarmul Cheuk Ting Ho Naa Ashiorkor Nortey PUBLISH DPVJ7K@@pretalx.com

-DPVJ7K

Async Awaits: Mastering Asynchronous Python in FastAPI en

20240424T144500 20240424T151500 0.03000

Async Awaits: Mastering Asynchronous Python in FastAPI

In this 30-minute session, we'll embark on a journey to master asynchronous programming in Python, specifically focusing on its application in the FastAPI framework. The talk is designed to provide a thorough understanding of async/await syntax and its practical use in building efficient, scalable web applications. ### Timetable: #### 1. Introduction to Asynchronous Programming (5 minutes) - Brief overview of asynchronous programming concepts. - The importance of async in modern web development. #### 2. Understanding Async/Await in Python (5 minutes) - Deep dive into Python's async/await syntax. - Key differences between synchronous and asynchronous code. #### 3. FastAPI and Asynchronous Python (10 minutes) - Introduction to FastAPI with a focus on its asynchronous features. - Demonstrating how FastAPI leverages Python’s async capabilities. #### 4. Building an Asynchronous Web App (7 minutes) - Step-by-step guide on setting up and coding an async web application in FastAPI. - Best practices for handling asynchronous operations. #### 5. Q&A and Wrap-Up (3 minutes) - Addressing questions from the audience. - Summarizing key takeaways and concluding the talk. Join us to unlock the power of asynchronous Python in the world of web development and learn how to effectively implement these techniques in your FastAPI projects. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/DPVJ7K/ Kuppelsaal Bojan Miletic PUBLISH 7UYHYP@@pretalx.com

-7UYHYP

Building accessible documentation sites en

20240424T152000 20240424T155000 0.03000

Building accessible documentation sites

For a long time, there has been a prevailing notion that accessibility should only be considered within front-end web development - the discipline of creating what someone can see or do on a website or web app. However, accessibility is a holistic practice that covers every aspect of building digital experiences, meaning it is everyone’s concern - whether working on the backend, documentation, CLI, or API levels. As an open-source maintainer, your project’s documentation is one of the primary ways users interact with your tools. Ensuring your documentation is up-to-date is as important as ensuring it is accessible for disabled users to provide an inclusive user experience and bring in new contributors. For the last five years, I have worked on multiple aspects of open-source accessibility, from auditing to remediation and building more accessible tools for end-users, authors, and open-source maintainers. In this talk, I will share practical advice - including tools and workflows - to make your documentation and other user-facing resources, from markdown files to Sphinx documentation sites and Jupyter notebooks, more accessible to disabled users. After this talk, you will better understand how to make your documentation more accessible with minor changes to your workflows or practices, even if you do not have deep accessibility knowledge (yet). Outline - Context setting [5 mins] - Brief context setting - Intro to accessibility [7 mins] - 101 into accessibility - while this will not be a deep dive, we will cover some guidelines and principles applicable to documentation, notebooks, and user-facing resources. - Contextualising accessibility into documentation [8 mins] - discussing strategies for accessibility auditing, remediation, and implementation within open source documentation Practical strategies TL;DR [5 mins] - Summarise best practices and tools for OSS documentation accessibility - Q/A with the audience [5 mins] PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/7UYHYP/ Kuppelsaal Dr. Tania Allard PUBLISH KCYDM9@@pretalx.com

-KCYDM9

Prescriptive Analytics in the Python Ecosystem with Gurobi en

20240424T103000 20240424T110000 0.03000

Prescriptive Analytics in the Python Ecosystem with Gurobi

Gurobi is a prescriptive analytics technology that enables you to make optimal decisions from data. You can use prescriptive analytics to generate optimized decision recommendations, based on real-world variables and constraints. Powered by mathematical models solved by mixed-integer optimization, it enables embedded decision intelligence in all kinds of applications in an industry-agnostic fashion and in any deployment scenario. Join us as we guide you through integrating Gurobi and prescriptive analytics into your greater Python ecosystem. We’ll demonstrate model-building patterns based on NumPy and SciPy.sparse data structures and explore how to take advantage of indexed DataFrames and Series in pandas for mathematical model building. You’ll also discover how to use trained regressors from scikit-learn as constraints in optimization models. Join us as we delve into the world of optimization with Gurobi and elevate your workflows. PUBLIC CONFIRMED Sponsored Talk https://pretalx.com/pyconde-pydata-2024/talk/KCYDM9/ B09 Robert Luce PUBLISH DG8G7Q@@pretalx.com

-DG8G7Q

Mojo 🔥 - Is it Python's faster cousin or just hype? en

20240424T110500 20240424T113500 0.03000

Mojo 🔥 - Is it Python's faster cousin or just hype?

Background & Motivation The introduction of Mojo by Chris Lattner captured the attention of the Python community with the allure of dramatic performance enhancements and a syntax that would not alienate current Python developers. As Mojo progresses beyond its infancy, it's critical to assess its evolution and its capacity to disrupt the programming ecosystem, particularly within artificial intelligence and machine learning domains. Objective & Scope This presentation will share findings from an AI Safety Camp project which used Mojo to build a Large Language Model Mechanistic Interpretatability and Activation Engineering library. Through our exploration, we aim to provide a candid narrative of Mojo's strengths and limitations, judge its performance claims, and probe its likelihood of adoption for AI development. Content Overview Introduction to Mojo: Brief overview of Mojo's conception, ethos, and intended use-cases. Performance Claims: An further look at the purported 68,000x speed increase over Python, including benchmark comparisons and real-world application data. Language Design: An analysis of Mojo's syntax and semantics, drawing parallels and contrasts with Python, and the implications for developers transitioning to or adopting Mojo. Case Study: Detailed account of the process of writing a Large Language Model Interpretation library in Mojo, highlighting the challenges and breakthroughs experienced. Ecosystem Overview: Examination of the current state of Mojo's ecosystem, its community support, and the availability of tooling and libraries. Discussion: Engaging the audience in a discussion about Mojo's potential future, its fit within existing projects, and the propensity for it to become the primary language for AI development. Conclusion We'll wrap up with predictions for Mojo's trajectory based on our experiences and broader industry trends, potentially setting the stage for Mojo to capture the "Mojo" it needs to triumph or to become a footnote in the annals of programming language history. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/DG8G7Q/ B09 Jamie Coombes PUBLISH TCSERC@@pretalx.com

-TCSERC

Enhance your balcony power plant with Python en

20240424T114000 20240424T121000 0.03000

Enhance your balcony power plant with Python

Plug-in solar systems, so-called balcony power plants, are getting more popular and more affordable as people want a simple way to participate in moving towards sustainable energy resources. They are easy to install without the need for an electrician. In this talk I will discuss how to figure out much power a household consumes and how much can be covered by the balcony power plant. I will also exemplify different user profiles, like “working from home” or the “home in idle state” and how it affects the efficiency of an additional battery system. The power consumption is measured by using devices, like WiFi plugs, from Shelly and myStrom, each offering a REST API. The power production is preferably recorded by using OpenDTU in combination with compatible microinverters but may be measured using WiFi plugs as well. These measured values are published to Redis and can be observed using WebSockets and FastAPI. Additionally, these values may be pushed to a public server running on FastAPI and Redis as well. A social login like Google or GitHub can be used to control the access to this server. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/TCSERC/ B09 Jannis Lübbe PUBLISH K8AL9P@@pretalx.com

-K8AL9P

Connecting batteries with Python: Towards EV Charging with #zero emissions at #zero costs en

20240424T131000 20240424T134000 0.03000

Connecting batteries with Python: Towards EV Charging with #zero emissions at #zero costs

The goal of The Mobility House is to create a zero-emission energy and mobility future. Our technology unites the automotive and energy industries. We integrate vehicle batteries into the power grid using intelligent charging and energy solutions. This way, we promote the development of renewable energies, stabilize the power grid, and make electric mobility more affordable. The goal of this talk is to give you an overview of how and where Python is used at The Mobility House. A hint upfront, we use it in many places. We use Python in all phases of development, it enables us to go quickly from a proof of concept to production. Python helps us in understanding our data better and using Python in production even changed our development culture and helped bridging the gap between data scientists and coders. However, Python does not solve all of our problems, so we will also talk about the roadblocks we hit and share the solutions which worked for us. PUBLIC CONFIRMED Sponsored Talk https://pretalx.com/pyconde-pydata-2024/talk/K8AL9P/ B09 Christopher Bock PUBLISH Y7R9GZ@@pretalx.com

-Y7R9GZ

Replacing Callbacks with Generators: A Case Study in Computer-Assisted Live Music en

20240424T134500 20240424T141500 0.03000

Replacing Callbacks with Generators: A Case Study in Computer-Assisted Live Music

At [Les Chemins de Traverse](https://www.lescheminsdetraverse.net/) we explore ways of "augmenting" acoustical musical instruments with new sonic possibilities offered by computers (think "augmented reality" for live music). For doing so, we are using Olivier Bélanger's great [pyo](http://ajaxsoundstudio.com/software/pyo/) module for realtime audio processing. To make the system interactive, this module allows to register callbacks on some events. While this works great in many situation, it can get very cumbersome when we design a stateful system, where the same event must trigger different callbacks depending on the system's inner state. This talk will present how we developed a thin abstraction layer that allows us to replace many callback functions together with many registering/unregistering of these functions by a nice, streamlined *generator* definition that's incomparably more readable than the many-callbacks version. This allows us to keep our mind focused on what's important, namely supporting the music we want to play, instead of tedious boilerplate code. While our use case is admittedly very specific, we believe that the ideas we present could be adapted in many other situations where callbacks are used for technical reasons, but lead to bulky and contrived code. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/Y7R9GZ/ B09 Matthieu Amiguet PUBLISH HSJGHH@@pretalx.com

-HSJGHH

Bridging the worlds: pixi reimplements pip and conda in Rust en

20240424T144500 20240424T151500 0.03000

Bridging the worlds: pixi reimplements pip and conda in Rust

Pixi goes further than existing conda-based package managers in many ways: - From scratch implemented in Rust and ships as a single binary - Integrates a new SAT solver called resolvo - Supports lockfiles like poetry / yarn / cargo - Cross-platform task system (simple bash-like syntax) A major requested feature was interoperability with PyPI packages. For this we have created a standalone library called rip. Rip contains all the code needed to download and extract wheels and SDist packages straight from PyPI, and also uses resolvo for resolution. We had to overcome some PyPI specific hurdles that we want to discuss in the talk: - Lazy fetching of metadata, since on PyPI it is embedded in the wheel - Resolving Python packages for other platforms and locking them (since we want to resolve on Linux for Windows) We’re looking forward to take a deep-dive together into what conda and PyPI packages are and how we are seamlessly integrating the two worlds in pixi. We’ll also look at some benchmarks and explain more about the conda ecosystem and why it might still have a reason to exist (even though wheels also solve a lot of the painpoints). More information about Pixi: - https://pixi.sh - https://prefix.dev PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/HSJGHH/ B09 Wolf Vollprecht Ruben Arts PUBLISH ML99UB@@pretalx.com

-ML99UB

There is a Better Way to Automate and Manage Your (Fluid) Simulations en

20240424T152000 20240424T155000 0.03000

There is a Better Way to Automate and Manage Your (Fluid) Simulations

This is a story about applying Python and the “hacker mindset” to Computer Aided Engineering (CAE), an emerging domain within the Python ecosystem. Shell scripts have traditionally been the preferred tool for automating CAE pipelines, especially in subfield of Computational Fluid Dynamics (CFD). However, this approach is brittle, severely limited and cumbersome to manage at scale. Data management is also a challenge, with tens to hundreds of GB per simulation needing to be stored and versioned in complex folder structures. One possible approach is to use Python as an automation and glue language and Data Version Control (DVC) which is a Python based tool built on top of git to track pipelines and data. This talk will show you how to use Python to automate many tasks in CAE workflows, even when the tools don’t offer a native Python interface: - Exporting CFD simulation results from Starccm+ to a PowerPoint template with python-pptx and updating the final presentation with new simulation data - Preparing input data for an electrical thermal simulation to improve performance 80-fold Both examples will illustrate best practices and lessons learned in the automation of the CFD software that are applicable beyond the field. DVC was originally designed and is broadly used for machine learning pipelines, but its flexibility allows it to be adapted to other domains. The potential benefits for engineering applications are immense. This talk will show you how easy it is to convert an existing CAE pipeline to DVC and show the benefits: - Running hundreds of simulations, comparing them and choosing the optimal with DVC - Managing software versions declaratively and comparing results across versions - Creating in-depth meta studies and comparing many simulations with Jupyter notebooks Finally, this talk will give an outlook on the changing CAE ecosystem and propose new features for DVC to better leverage it for this use case. **Audience** Either simulation engineers seeking to enhance and scale their workflows or software engineers aiming to build powerful and flexible simulation tooling. **Relevant talks or blog posts** - Sending Rovers to Mars with Jupyter - Managing OpenFOAM Physical Simulations with DVC, CML, and Studio - How Python enables future computer chips PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/ML99UB/ B09 Julian Wagenschütz PUBLISH BA7FZL@@pretalx.com

-BA7FZL

AsyncApp. My contribution to hype Pythons asyncio a bit more en

20240424T103000 20240424T110000 0.03000

AsyncApp. My contribution to hype Pythons asyncio a bit more

Asyncio has been introduced as a possible solution mainly for I/O related performance problems. The traditional way to handle I/O often ends up in code, which blocks the execution of concurrent elements in an application, often resulting in bad performance. The usual suspects when dealing with these problems, such as multiprocessing and threading, are often considered to be complex and not straightforward in use, especially for beginners. I believe that proper threading and multiprocessing, with all its interprocess or shared memory communication, locks and race condition prevention, as well as efficient object handling still requires a deep understanding of the architecture and inner workings, and is still mainly a topic for experts. Asyncio comes to the rescue here offering a layer of abstraction at a lower and much easier to understand layer. While it is no solution to aid in distributing code execution to gain more performance, it will solve the blocking issues quite effiently. To demonstrate the power and simplicity of asyncio I will show a few object orientated building blocks that will allow us to create a simple environment monitoring app for the raspberry pi. This app will - periodically gather sensor readings - log them - store the readings to a data file - offer a monitoring system to log cpu and memory usage for itself - be able to be configured via environment variables, config files and command line arguments In its final iteration the app will be distributed into small parts just dealing with a single, very specific task to be performed, following the traditional UNIX philosophy for an app to do just one thing, but do this well. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/BA7FZL/ B07-B08 Jens Nie PUBLISH 9JEZ8E@@pretalx.com

-9JEZ8E

High Performance Data Visualization for the Web en

20240424T110500 20240424T113500 0.03000

High Performance Data Visualization for the Web

The Python ecosystem has ample supply of both web development frameworks, and data visualization components. But despite the maturity of the ecosystem, few datavisualization tools are capable of dealing with large amounts of streaming data. Even fewer are able to perform live aggregations, sorting, and filtering on top of this data. In this talk, we will put together a simple but full-featured website using [Perspective](https://perspective.finos.org). Perspective is an open source interactive analytics and data visualization component, which is especially well-suited for large and/or streaming datasets. It is written in C++ and Rust with bindings to both Python and WebAssembly, making it ideal for data-intensive applications. It comes with a variety of visualization plugins, including a datagrid and various charts. Additionally, it comes with a Jupyter widget, which allows developers to iterate quickly with a clear pathway to their production website. We will start with a simple [FastAPI](https://fastapi.tiangolo.com)-based website and some static data. In a few lines of code, we will have the website up and running. Next, we will demonstrate some of the core features of Perspective - pivoting, sorting, filtering, the various visualization plugins, cross-filtering (using one table as a filter on other tables), and computed columns. After this, we will pull in some streaming data and show how the functionality of Perspective demonstrated updates in realtime alongside the data. Finally, we'll crank the speed of updates to the limit. By the end of this talk, the audience will know how to use Perspective and how to incorporate it into their own applications for both static and streaming data, either as a simple but high performance datagrid or as a full featured set of interconnected visualization components. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/9JEZ8E/ B07-B08 Tim Paine PUBLISH DKL7YQ@@pretalx.com

-DKL7YQ

How to Improve the Python Development Experience for Millions of Ubuntu Users en

20240424T114000 20240424T121000 0.03000

How to Improve the Python Development Experience for Millions of Ubuntu Users

Updating your current Python installation, or installing a different one on Ubuntu is not an easy task. There are many reasons why you want a different Python version on Ubuntu: - you want to use the latest version, but Ubuntu comes with an older one pre-installed - a Python app requires an older Python version - you want to test your Python library against multiple Python versions Unfortunately, `apt install python-<version>` won't work. After googling some time, you'd learn that you have many options: - pyenv - deadsnakes - mamba/conda - or even compiling Python yourself Why isn't there a single way, and which one fits your needs the best? And why doesn't `apt install python-<version>` just work? There are many blog posts and tutorials out there to install a new Python version, but they lack the depth to understand the core of the problem. And are they up-to-date? Do you trust them not to break your Ubuntu installation? This talk will not only introduce and compare all the most common options to update a Python version or to install a new one on Ubuntu but will also convey the knowledge to assess the existing and upcoming options yourself. We will also look into the future. What new tools are on the horizon? And especially, what could Ubuntu do itself to make it easier for you and everybody? PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/DKL7YQ/ B07-B08 Jürgen Gmach PUBLISH 7NETLX@@pretalx.com

-7NETLX

µDjango, an asynchronous microservices technique. en

20240424T131000 20240424T134000 0.03000

µDjango, an asynchronous microservices technique.

The history of the lightweight Django project isn't new. The first time single-py-file Django project paradigm appears in 2014 in book Lightweight Django. I with Django project consisting of only 2 files in 2015. At that time, the tiny Django project wasn't comparable to the capabilities of projects based on FASTAPI or FLASK. But a couple of years later, Django introduced ASGI, and in 2022, Django was ready for use in microservices. The concept of creating micro-projects on Django reappeared within the Django community in 2019 and again in the spring of 2023, and now we have a full-fledged technology for creating asynchronous microservices consisting of one or two files. It was named uDjango. In this talk, I will share my experience in creating high-performance microservices on Django and how i can keep simplicity and minimalism in projects. During the talk, I'll discuss the advantages of Django microservices: * All-in-one package * Standard architecture and syntax * Extremely rapid development and deployment speed After years of work with uDjango paradigm, I have identified the challenges in creating Django microservices: * The prevailing opinion that the 'Django framework isn't suitable for microservices' * Django settings.py - cause of many problems. * URL routing in Django that could be stricter * Initialization time of forms and model objects reduces performance The result of this Talk for the audience will be knowlege about mDjango, a ready-to-use technology for building synchronous and asynchronous microservices. Talk Based on ideas of: Julia Elman and Mark Lavin, Lightweight Django 2014. Will Vincent, django-microframework 2019. Kirill Klenov, python benchmark repository, 2019. Carlton Gibson, linked in post about one app Django project, 2022 Paolo Melchiore 2023, uDjango PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/7NETLX/ B07-B08 Maxim Danilov PUBLISH ZLDMGM@@pretalx.com

-ZLDMGM

Beyond Deployment: Exploring Machine Learning Inference Architectures and Patterns en

20240424T134500 20240424T141500 0.03000

Beyond Deployment: Exploring Machine Learning Inference Architectures and Patterns

This talk explains the major challenges of ML deployment and management, emphasizing inference patterns for robust, scalable applications. Using StepStone's infrastructure as an example, we'll discuss efficiently handling large workloads and complex models, including recent large language models, to ensure fast, cost-effective, and reliable results. The session begins with an introduction, highlighting the significance of ML inference and outlining the objective of providing insights into effective MLOps strategies. We'll then overview various ML inference patterns, emphasizing their advantages, disadvantages, and the importance of selecting the right pattern for specific use cases. Moving on, we'll delve into StepStone's ML inference strategy, showcasing real-world applications and how scalability, performance, and cost are managed while maintaining agility for frequent model updates and monitoring in production systems. In summary, this talk provides a practical roadmap of ML inference patterns with a focus on real-world implementation at StepStone. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/ZLDMGM/ B07-B08 Tim Elfrink PUBLISH QYPLJE@@pretalx.com

-QYPLJE

The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs en

20240424T144500 20240424T151500 0.03000

The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs

As ideas develop, we’re seeing more and more ways to use compute efficiently, producing AI systems that are cheaper to run and easier to control. In this talk, I'll share some practical approaches that you can apply today. If you’re trying to build a system that does a particular thing, you don’t need to transform your request into arbitrary language and call into the largest model that understands arbitrary language the best. The people developing those models are telling that story, but the rest of us aren’t obliged to believe them. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/QYPLJE/ B07-B08 Ines Montani PUBLISH DBGXJN@@pretalx.com

-DBGXJN

Jupyter Notebooks for Print Media en

20240424T152000 20240424T155000 0.03000

Jupyter Notebooks for Print Media

Jupyter Notebooks are the tool of choice for researchers and data scientists, and a lot of work has been done to take Jupyter Notebooks and turn them into standalone websites. From [Voilà](https://voila.readthedocs.io/en/stable/index.html) to [Jupyter Book](https://jupyterbook.org/en/stable/intro.html), with widget and app libraries galore, it has never been easier to take a notebook and produce an interactive website. In contrast, despite the origins of notebooks in academic research, comparatively less work has been done in building tools to take notebooks and produce print media - newspaper articles, business reports, textbooks, academic publications, etc. In this talk, we will do four things. First, we will motivate print media as a good target for Jupyter Notebooks. We will do so through three worked examples: - a data-driven news publications such as those from The New York Times - a computer science textbook - a business intelligence report Second, we will highlight the correct set of technologies for producing notebook-derived print media. In particular, we will discuss NBPrint, a small [NBConvert](https://nbconvert.readthedocs.io/en/latest/)-based library that leverages [paged.js](https://pagedjs.org), a free and open source library which has [been used to produce real, printed books](https://pagedjs.org/made-with-paged.js.html). Third, we will give an end-to-end example from Jupyter Notebook to publication quality result for one of the above examples, showing a side-by-side comparison with the original media. Finally, we will discuss the power of the notebook oriented approach, and discuss which disciplines might be best suited for adopting notebooks as the source format for their print-oriented media. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/DBGXJN/ B07-B08 Tim Paine PUBLISH SSKV9R@@pretalx.com

-SSKV9R

Reinforcement Learning: Bridging The Gap Between Research and Applications en

20240424T103000 20240424T110000 0.03000

Reinforcement Learning: Bridging The Gap Between Research and Applications

Despite the very general applicability of reinforcement learning (RL) to a variety of decision and control problems, there are comparatively few applications of it in current industries. Moreover, many important developments emerging in the highly active RL research community do not get added to existing frameworks or libraries. Code written for successful RL applications in industry is also rarely contributed to open source software (OSS). This is in stark contrast to other areas of machine learning (ML), where reported progress is often transferred to mature OSS within weeks, if not days. Part of the reason behind this lamentable state may be the intrinsically higher complexity of RL when compared to, say, supervised learning. However, we believe that the lower permeation of RL in mature software arises in large part because writing RL-based software is currently much harder than it has to be. Widely used OSS for RL is either too complex for researchers to contribute to (like ray/RLlib or Pearl), too buggy and unstable for industry to consider (also RLlib), too limited in scope (like stable-baselines3, which includes relatively few algorithms), lacking high-level interfaces (like torch-rl), or even completely gives up on modularity (like cleanRL). Another reason is the difference in focus between RL research and applications. In research, an important goal is to find an algorithm that works well in a variety of environments, whereas in applications, one is usually interested in solving a particular environment of interest, by any means. This leads to wildly differing evaluation scenarios and selection criteria. We believe that the current state of RL software is reminiscent of the pre-PyTorch/pre-Keras era for supervised deep learning, when the implementation of a task like training a convolutional network on a large image dataset was non-trivial. Today, it requires but a few lines of code. We thus infer that significant progress in the software landscape supporting RL is still to be made, and that this progress will have high impact both on researchers and ML engineers. With this goal in mind, the appliedAI Institute for Europe, together with the core developers of the open source RL library Tianshou, took on the task of extending the latter in order to democratize RL in applications and accelerate reliable and trustworthy research on it. In this talk, we will highlight Tianshou’s high-level interfaces, which allow painless applications of RL algorithms in industry applications, as well as the lower-level interfaces that researchers can base their work on. Research code that is compatible with Tianshou’s interfaces will not only get mature evaluation, reporting and hyper-parameter optimization “for free”, but will also be much easier to use in applications, thereby boosting its impact. We will also address the question of environment design, which is a highly important RL engineering topic that is largely ignored in RL research. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/SSKV9R/ B05-B06 Michael Panchenko PUBLISH GNK3PV@@pretalx.com

-GNK3PV

Climate Crisis in Numbers en

20240424T110500 20240424T113500 0.03000

Climate Crisis in Numbers

About 5 years ago my co-founder and I launched alcemy, a Machine Learning startup to help decarbonize the cement and concrete supply chain. My primary motivation to run the startup is to find ways to tackle and prevent climate change and human made global warming. In the course of building the company I not only wanted to understand how much we can contribute in our niche sector of cement and concrete, but get a better idea of the problem and its magnitude as a whole. So here’s my little guide to better grasp what climate change is all about through data. I am going to talk about a variety of things regarding climate change and the greenhouse effect: - CO2 Equivalence - Magnitude and origin of different emission sources - The consequences of global warming and our potentially grim future - A (very) brief outlook of what humankind needs to do to tame global warming PS: Absolutely no Python experience needed here ;-) PUBLIC CONFIRMED Sponsored Talk https://pretalx.com/pyconde-pydata-2024/talk/GNK3PV/ B05-B06 Robert Meyer PUBLISH UGJJMP@@pretalx.com

-UGJJMP

Lessons learned from deploying Machine Learning in an old-fashioned heavy industry en

20240424T114000 20240424T121000 0.03000

Lessons learned from deploying Machine Learning in an old-fashioned heavy industry

Introduction ------------------ **Cement alone is responsible for about 8% of worldwide CO2 emissions**. Fortunately, we have quickly learned that low-carbon alternatives to "conventional" cement and concrete already exist. For instance, 60% of carbon emissions can be avoided if burnt limestone, the main ingredient for cement, is replaced partly by limestone powder (which isn't burnt, and therefore doesn't release carbon into the atmosphere). Yet, these low-carbon cement recipes have a substantial shortcoming: They react much more sensitive to changes, e.g. changes in weather conditions or in the chemical and mineralogical composition of ingredients. As a consequence, low-carbon cements and the resulting concrete (made by mixing cement with sand and water) can only be reliably produced under laboratory conditions. We are changing this. We use data intelligence and predictive Machine Learning control to optimize production processes such that low-carbon cement and concrete can be manufactured in real plants and at scale. I will quickly introduce our solution that is already deployed in 5 cement plants. Moreover, we are currently prototyping to move into concrete production as well. Of course, we do this (mostly) in Python. Part 1: Machine Learning ------------------------------------- Machine Learning in production is vastly different from solving a kaggle challenge. In fact, the particular choice of Machine Learning model is much less important than you think. I will cover the benefits of using rather simple models such as random forests or even linear regression in comparison to deep learning. If stuff goes wrong, and it will, interpretable and debuggable models are far superior to complex architectures. Also having proper model evaluation that reflects production requirements, and good baselines for comparison are always crucial first steps and pay off in the long run. It was surprising how much less time we spent on the core Machine Learning algorithms in comparison to infrastructure, such as deployments on AWS fargate or k8s, re-training processes, proper database layout, or home-brewed tooling to allow easier configurations of dozens of ML models. Part 2: Data ------------------ We quickly learned that data is way more important than models. Some might have heard the phrase *Garbage in garbage out* coined by programmers in the 50s. This is even more important when it comes to today's widespread usage of Machine Learning. We run ML not on our own data, but on data provided by our customers. While the level of data-maintenance and quality that our customers are used to allows for in-house bookkeeping and short analyses, it does not necessarily suffice for ML. I will discuss why and how we spend a good amount of time cleaning and really drilling into the data provided by our customers. Moreover, differences between training and real-time inference data can be a real challenge. For example, it is not guaranteed that the location where samples are drawn from cement mills, i.e. the live data used for inference, is as representative of the actual cement as silo samples that can be used for training. Fine particles might not be captured simply due to the physical properties of the sample site. To tackle problems like these as a Machine Learning engineer you have to become an expert in the domain your models are applied. You really need to understand the data in every detail and know how it is generated by your customers and understand the context and consequences of all of your customers' processes. Part 3: Customers and Business ----------------------------------------------- Our customers are, of course, no Machine Learning experts. Why should they be? If they were, they wouldn't need us anyway. However, oftentimes we as Machine Learning engineers forget the ramifications of this. I will talk about customer relations and their interactions with our Machine Learning models. For example, we had to deal with a rather skeptical customer not believing our models' predictions. They pretty much went against all recommendations made by the model. Although it is nice if in the end the model predictions turn out to be right, your customer does not necessarily feel the same way. In contrast, the customer does not enjoy being wrong and may even feel mocked by a machine. Having a strong customer success team, who knows both how ML works and, of course, how the customer operates and thinks, is often more valuable than "rockstar" Machine Learning engineers. Lastly, a tough lesson to learn was that Machine Learning as a service should not be mistaken for a software as a service business model. Our marginal costs are not zero. Besides a great deal of consulting that is needed for every customer, on-boarding a new customer is time consuming and needs a lot of work. Integrating into existing infrastructure of cement plants (who are not top-notch IT companies) can be tough or plain-right frustrating at times. Therefore, scaling a Machine Learning startup can be hard, and we learned to better go hunting for elephants, i.e. few high paying customers, than for mice, many low paying ones. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/UGJJMP/ B05-B06 Robert Meyer PUBLISH TMF8V7@@pretalx.com

-TMF8V7

How Python helped us uncover secrets of protein motion en

20240424T131000 20240424T134000 0.03000

How Python helped us uncover secrets of protein motion

Proteins are one of the main building blocks of the living world. They are largely responsible for the amazing diversity that we witness in the nature around us. Although proteins are composed of sequences of just 20 amino acids, clever nature’s design has endowed them with an incredibly diverse set of functions. It is not an overstatement to say that this diversity and the myriad of ways proteins interact with each other is at the very heart of life. Therefore it is of utmost importance to understand their structure and function. Proteins are very large molecules, composed of thousands up to even millions of atoms connected in a giant hairball like structures. But still they are too tiny to be seen by any sort of microscope, even the most powerful ones. That is why in order to “see” how they look we use X-rays and shine them on crystals made entirely of single proteins species in the fascinating method of X-ray crystallography. It then gives us the picture of how the proteins look to unprecedented atomic detail. In order to do their function proteins also move their parts, but unfortunately this motion is too quick to be seen by any device. X-ray crystallography alone, although mighty in giving us the details, gives us only one static image. It is a bit like trying to tell a story of a movie just by seeing a movie poster. Therefore we have to simulate the motion of the protein by so-called molecular dynamics (MD) simulations. Basically we give the computer the initial positions of all the atoms that we know from X-ray crystallography and then kick them and see how the protein moves in time, in very tiny steps. This results in so-called MD trajectories which contain all atom positions in millions of steps. Needles to say that this results in super heavy data that usually contains hundreds of GB of data that needs to be processed somehow. In the project called “Allosteric communication pathways in oligomeric enzymes” (https://alokomp.irb.hr/) we have faced that very problem. How to extract information about protein movement from such enormous quantities of data? Of course the answer was using marvelous Python suite of tools available. Python has established itself as a de facto standard programming language in data science, and with already available plethora of options for X-ray crystallography and MD analysis it was a logical choice (not to mention its awesomeness and being our favourite anyway). The whole project really displays how mature and diverse Python is to be able to tackle every single aspect of such a specialized problem. To begin with, we have centered the entire project around a web page built using Django. It serves both as a front-end wih general information, but also as a web app for diving into the data. Behind it is a PostgreSQL relational database containing all the structural and derived data from a family of proteins, called PNPs, which serve as sort of proof of concept (https://alokomp.irb.hr/pdbase/structures/). It also contains data derived from MD simulations and analysed with MDanalysis tool (https://www.mdanalysis.org/). It is hard to mention all the Python tools we have used for analysis of the data in the database. Of course the backbone of it are indispensable Pandas, Numpy, Scipy, Dask, Jupyther, NetworkX, Bokeh, HoloViz to name but a few. More specifically we have developed a special approach (“avocado” plots, example https://alokomp.irb.hr/md/avocados/1458/A) to visualize the motion of protein as a whole in time, as a series of snapshots each containing plots of millions of points, using awesome Datashader library (https://datashader.org). We have also used Ruptures (https://github.com/deepcharles/ruptures) library to detect changes in the positions of protein and to detect correlations. Everything is wrapped up in a form of interactive web app which can be used to visually browse vast amounts of data, giving a whole new perspective on a highly complex multidimensional data. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/TMF8V7/ B05-B06 Zoran Štefanić Boris Gomaz PUBLISH ZMC9FU@@pretalx.com

-ZMC9FU

525 days working full-time on FOSS: lessons learned en

20240424T134500 20240424T141500 0.03000

525 days working full-time on FOSS: lessons learned

## Outline ### Introduction (~5min) Personal and professional context for the talk: - Who am I? - What FOSS project have I been working on for 525 days? - Who am I working with? ### Lesson learned 1 – how to get a tech job (~5min) In this segment of the talk I share the story of how I got this job. This will explain how my writing on my blog contributed to establish some reputation and how my (Python-focused) social media presence connected me with the person who would eventually become my employer. ### Lesson learned 2 – put your ego aside (~5min) In this segment of the talk I explain how I deal with PR reviews and how I've learned to embrace the criticism, taking into account that all of your work is scrutinised every time you make a PR. I'll also tell the story of how I made a couple of blunders in successive PRs, how my team dealt with those, and what I got away from those weeks when I underperformed. ### Lesson learned 3 – interacting with users & contributors (~5/7min) This segment of the talk covers the other end of the interactions on a FOSS project, answering questions like: - How should you behave when interacting with users making feature requests? - What about users that report “bugs” that would be “solved” if they read the documentation carefully? - How do you review external PRs, leave feedback, and request changes? Depending on how the audience reacts to this segment, I might also tell an anecdote about how bad I felt when rejecting an external PR and how that feeling was amplified tenfold when I found out that the external PR came from a “Python personality”, which also contains another lesson because the person whose PR was rejected handled it in the most graceful way possible. ### Lesson learned 4 – working on a large project (~5min) I will dedicate this segment of the presentation to talk about the strategies I use to deal with the fact that the project I work on is too big for me to keep all of it in my head. This includes my note-taking system and my PR checklist. ### Wrap-up (~2min) To wrap up the talk, I'll summarise my learnings and share a bullet-point list of the ones that are more likely to be helpful to others. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/ZMC9FU/ B05-B06 Rodrigo Girão Serrão PUBLISH VEACZM@@pretalx.com

-VEACZM

Python Monorepos: The Polylith Developer Experience en

20240424T144500 20240424T151500 0.03000

Python Monorepos: The Polylith Developer Experience

If you haven’t heard about Polylith before: it has a really simple take on Software Architecture - with tooling support. Polylith is based on small building blocks, very much like LEGO bricks. In fact, the Polylith Architecture originates from the Clojure community and is well suited for functional programming. It is a fresh take on how to share & reuse code, by using monorepos in a very developer-friendly way. And we have that in Python! I am the developer of the Open Source Python-specific tooling for Polylith. I’ll walk through the simple architecture & developer-friendly tooling for a joyful Python Experience. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/VEACZM/ B05-B06 David Vujic PUBLISH D7AEQY@@pretalx.com

-D7AEQY

Marketing Media Mix Models with Python & PyMC: a Case Study en

20240424T152000 20240424T155000 0.03000

Marketing Media Mix Models with Python & PyMC: a Case Study

Understanding the effectiveness of various marketing channels is crucial to maximise the return on investment (ROI). However, the limitation of third-party cookies and an ever-growing focus on privacy make it difficult to rely on basic analytics. This talk discusses a pioneering project where a Bayesian model was employed to assess the marketing media mix effectiveness of WeRoad, the fastest-growing Italian tour operator. The Bayesian approach allows for the incorporation of prior knowledge, seamlessly updating it with new data to provide robust, actionable insights. This project leveraged a Bayesian model to unravel the complex interactions between marketing channels such as online ads, social media, and promotions. We'll dive deep into how the Bayesian model was designed, discussing how we provided the AI system with expert knowledge, and presenting how delays and saturation were modelled. We will also tackle aspects of the technical implementation, discussing how Python, PyMC, and Streamlit provided us with the all the tools we needed to develop an effective, efficient, and user-friendly system. Attendees will walk away with: - A simple understanding of the Bayesian approach and why it matters. - Concrete examples of the transformative impact on WeRoad's marketing strategy. - A blueprint to harness predictive models in their business strategies. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/D7AEQY/ B05-B06 Emanuele Fabbiani PUBLISH ECCJAG@@pretalx.com

-ECCJAG

FlixBus CitySnap: How we use GenAI and not only to collect captivating images for cities and confirm their locations en

20240424T103000 20240424T110000 0.03000

FlixBus CitySnap: How we use GenAI and not only to collect captivating images for cities and confirm their locations

Flix's buses serve over 5,000 cities, and to elevate our customers' experience, we aim to collect captivating photos for each city. Photo city collection task is not new, but previously, it was predominantly addressed with human resources. However, due to the extensive number and the growing scale of our bus network, manually gathering photos for each city is unfeasible and non scalable. In this talk, we will demonstrate how we built a fully automated end-to-end pipeline to achieve this goal. Our pipeline comprises three main steps. The first step involves collecting city images from free image stock services like Pixabay and Pexels, via API. Simple queries by city names yielded poor results as not every image is enticing enough to inspire visits to the city. People often travel to see a city's landmarks, which is why we utilized ChatGPT to gather images of prominent landmarks for each city. The second and most complicated step is to verify that the images accurately represent the targeted cities. Initially, we relied on metadata from the image stock services, such as tags from photographers. However, this information is often not sufficient to validate an image's location. To improve accuracy, we investigated various services. Models like DALLE from OpenAI can predict image locations but currently lack an API for full automation. We found two services from the Google Cloud Platform with APIs suitable for location validation: the Gemini multimodal and the landmark detection service. The third and final step of our pipeline involves adjusting the images to various resolutions for display across different platforms, such as social media campaigns on Instagram, email marketing, and our website. This is achieved by cropping images to the desired aspect ratios using Google Cloud Vision API's smart cropping service, followed by Lanczos sampling for image downscaling, which is available in various open-source Python libraries. Our pipeline is a cost-efficient approach using widely available services, thereby facilitating easy replication. During this presentation, we will share our results across several countries, discuss the most challenging problems we encountered, and offer insights into how this pipeline could be improved with the release of upcoming cutting-edge models. We believe that our case shows how the industry can use Generative AI not only to create a new context, but also to find, analyze and filter publicly available information for different business needs. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/ECCJAG/ A1 Andrei Chernov PUBLISH DEKGYM@@pretalx.com

-DEKGYM

Public Money, Public Experiment - open source processes in the public administration en

20240424T110500 20240424T113500 0.03000

Public Money, Public Experiment - open source processes in the public administration

As one of many data labs in the public administration, sharing code and software increases the speed with which technical problems can be solved and reduces overall costs. In the previous months, we started collaborating with other public units to share a python prototype between labs. Now it's time for the next step: as we approach PyCon DE & PyData Berlin 2024, we aim to make code publicly available. The presentation will address the following questions: 1. How can the process of publishing code look like in a public administration and where can you get access to code already published? (Spoiler: Check out OpenCoDE) 2. How does open source align with public administration principles? 3. What legal and political and security requirements shape the process and possibly the code base? Whether we succeed or encounter challenges, this talk serves as an attempt to transparently share our journey and contribute to the broader discourse on the intersection of public administration and open source initiatives. Join us at PyCon DE & PyData Berlin 2024 and stay tuned for a glimpse into the evolving landscape of our code publication. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/DEKGYM/ A1 Lisa Reiber PUBLISH QCNXLW@@pretalx.com

-QCNXLW

Improve LLM-based Applications with Fallback Mechanisms en

20240424T114000 20240424T121000 0.03000

Improve LLM-based Applications with Fallback Mechanisms

Large Language Model (LLM)-based systems have demonstrated remarkable advancements in various natural language processing (NLP) tasks, particularly through the Retrieval Augmented Generation (RAG) approach. This approach addresses some of the pitfalls associated with LLMs, such as hallucination or issues related to the recentness of its training data. However, RAG systems may encounter other challenges in real-world scenarios, including handling out-of-domain queries (e.g., requesting medical advice from a finance app), struggling to generate meaningful answers from retrieved data, or failing to provide any answer at all. To address these situations effectively, it is necessary to implement a fallback mechanism capable of gracefully handling such scenarios. 🧗 This fallback mechanism can incorporate alternative strategies, such as conducting a web search with the same query to retrieve more up-to-date information or utilizing alternative information sources (such as Slack, Notion, Google Drive, etc.) to gather more relevant data and generate a satisfactory or comprehensive response. However, the question arises: how can we determine if the response is inadequate? 🤔 During this session, we will explore various fallback mechanism techniques and ensure that our system can assess the adequacy of a response and improve it if necessary without human intervention. On the practical side, we will use the open source LLM framework Haystack to implement end-to-end RAG systems. By the end of this talk, you will have learned to select the appropriate fallback method for your use case, enabling you to develop more dependable and versatile LLM-based systems and implement them effectively using Haystack. 💪 PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/QCNXLW/ A1 Bilge Yücel PUBLISH CWUQF3@@pretalx.com

-CWUQF3

Is GenAI All You Need to Classify Text? Some Learnings from the Trenches en

20240424T131000 20240424T134000 0.03000

Is GenAI All You Need to Classify Text? Some Learnings from the Trenches

In recent times, GenAI has sparked fervent excitement, sometimes touted as the panacea for all natural language processing (NLP) tasks. This presentation explores a practical text classification scenario at Malt, highlighting first the practical hurdles encountered when employing GenAI (latency, environmental impact, and budgetary constraints). In a second part, we’ll cover how we overcame these obstacles by building a small dedicated model built from a pre-trained SentenceBERT [1], a model trained on semantic similarity. We'll explain how training a classification network on top of it preserves the original language alignment [2], enabling multilingual generalization. Next, we'll unveil the secret to unlocking even more efficiency: quantization and graph optimization techniques thanks to the ONNX ecosystem [3]. These optimizations while reducing even more the latency and resource consumption of this dedicated model enable it to be deployed with just a CPU. Finally, we’ll see that GenAI still plays a relevant role in our text classification journey. Its unparalleled zero-shot capabilities allow us to continuously adapt our dedicated model, ensuring it remains relevant amidst an ever-changing product. [1] Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. [2] Reimers, N., & Gurevych, I. (2020). Making monolingual sentence embeddings multilingual using knowledge distillation. [3] https://onnx.ai/onnx/ PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/CWUQF3/ A1 Marc Palyart Kateryna Budzyak PUBLISH UXQJTF@@pretalx.com

-UXQJTF

Mostly Harmless Fixed Effects Regression in Python with PyFixest en

20240424T134500 20240424T141500 0.03000

Mostly Harmless Fixed Effects Regression in Python with PyFixest

When regression models contain very high-dimensional categorical features, estimation can become cumbersome: inverting a matrix with more than a few hundred rows is no simple task! Fortunately, the problem of estimating models with high-dimensional fixed effects has been effectively solved since at least the 1930s. A range of software packages now implement what is known as the Frisch-Waugh-Lovell Theorem (FWL) for efficient estimation of regression models with high-dimensional fixed effects. These packages are available in various programming languages, including Stata, R, Julia, and Python. Among these, the R package fixest particularly stands out. It is not only blazing fast but also offers an innovative and user-friendly post-estimation functionality and syntax. When I started my journey with Python, fixest was the R package I missed the most. In fact, I missed it so much that I began working on PyFixest, a software package that aims to faithfully replicate all of fixest's innovations in Python. In this talk, I will introduce the audience to both fixest and PyFixest and the FWL theorem that underpins these packages. We will explore how PyFixest can be used for analyzing AB Tests and for conducting event studies with staggered rollouts. For more information: - PyFixest GitHub repository: https://github.com/s3alfisc/pyfixest - Introduction to PyFixest: https://aeturrell.github.io/coding-for-economists/econmt-regression.html#regression-basics - PyFixest Documentation: https://s3alfisc.github.io/pyfixest/ PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/UXQJTF/ A1 Alexander Fischer PUBLISH BJUQ9E@@pretalx.com

-BJUQ9E

Can ChatGPT convince you to get a COVID19 vaccine? Comparing ChatGPT to an expert system - which one is more convincing? en

20240424T144500 20240424T151500 0.03000

Can ChatGPT convince you to get a COVID19 vaccine? Comparing ChatGPT to an expert system - which one is more convincing?

Chatbots have the potential of being used as dialogical argumentation systems for behaviour change applications. They thereby offer a cost-effective and scalable alternative to in-person consultations with health professionals that users could engage in from the comfort of their own home. During events like the global COVID-19 pandemic, it is even more important than usual that people are well informed and make conscious decisions that benefit themselves. Getting a COVID-19 vaccine is a prime example of a behaviour that benefits the individual, as well as society as a whole. In 2021, prior to the release of ChatGPT, we presented a chatbot (developed in Python using scikit learn and flask) that engaged in dialogues with users who did not want to get vaccinated, with the goal to persuade them to change their stance and get a vaccine. The chatbot was equipped with a small repository of arguments that it used to counter user arguments which were presented in free-text by the user on why they were reluctant to get a vaccine. We evaluated our chatbot in a study with participants and found that 20% of the participants had a positive change in stance (e.g. changing their stance from "unlikely to get a vaccine" to "neutral" or "likely to get a vaccine" after chatting with the chatbot). The rapid advancements in natural language processing and the release of technologies such as ChatGPT raises the need to compare them to traditional expert systems in order to (1) identify potential problems in the new technologies and (2) assess whether they can replace traditional expert systems. Several studies have already used ChatGPT to address vaccine hesitancy and to tackle vaccine myths and concluded that ChatGPT is indeed a reliable source of non-technical information to the public. We were, therefore, interested to compare our system to ChatGPT and simulate the conversations participants had with our chatbot using ChatGPT and evaluate which conversations were considered more convincing by crowdsourced participants who are not domain experts. Research like this helps us understand whether we need to continue investing resources into domain specific expert systems or rather invest them into improving ChatGPT and make it more reliable and credible to avoid spreading misinformation. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/BJUQ9E/ A1 Dr. Lisa Andreevna Chalaguine PUBLISH DWGV7W@@pretalx.com

-DWGV7W

The Struggles We Skipped: Data Engineering for the TikTok Generation en

20240424T152000 20240424T155000 0.03000

The Struggles We Skipped: Data Engineering for the TikTok Generation

A tale of two junior data engineers. Our generation of developers might have it “easy” due to there being a plethora of tools available to automate and plug and play everything. However, this abundance poses challenges in breaking into a field. This talk explores the perspectives of two junior data engineers—one entirely new to data and the other with a data science background—both navigating the complexities of data engineering. The first one, a data scientist navigating her tasks without the luxury of well-formatted data. This journey inadvertently led to a gradual familiarity with complex tools like Spark, and the necessity of understanding various connectors and writing detailed code for data extraction and normalization. With the introduction of dlt, a significant shift occurred. This technology automated many of the tedious processes, allowing analysts to focus more on analytics, and less on tedious data handling. The second one, never having had to deal with the chaos of unstructured data, was directly introduced to dlt. Spared by the typical struggles faced by traditional data engineers, she's set to find out what happens behind dlt’s automation throughout the talk. After realizing that the two lines of Python code she wrote saved her from the manual tasks of data normalization, structuring, and loading, she will gain an appreciation for the tools at her disposal, especially dlt. dlt, or data load tool is an open-source python library for data teams of all sizes. It can extract a range of data formats from various sources, then normalizes that unstructured data into a relational structure and loads it into the destination of your choice. All of this is done within a few lines of Python code, as compared to the usage of different tools that were needed to get these tasks done. It is a valuable and cost effective addition to a company’s data stack. The talk will follow a step-by-step, linear narrative to outline the challenges of building a data pipeline and illustrate how dlt can resolve these issues, thereby automating the process. Beginning with schema inference and evolution, then progressing to dependency handling and data governance, each challenge will be portrayed as a quest on the journey to constructing a well-defined data pipeline. As junior data engineers, we would like to emphasize the paradigm shift in data engineering towards a greater level of abstraction. This shift, enabled by tools such as dlt's declarative incremental loading, empowers junior engineers to tackle tasks that traditionally would not be considered junior-level work. PUBLIC CONFIRMED Talk https://pretalx.com/pyconde-pydata-2024/talk/DWGV7W/ A1 Anuun Hiba Jamal PUBLISH SYJE7B@@pretalx.com

-SYJE7B

Lose your fear of equations! en

20240424T103000 20240424T120000 1.03000

Lose your fear of equations!

If you transitioned into data science from "soft" sciences, you've already had a steep learning curve. Coding, data engineering, statistics... There is a lot to catch up on. And while there are plenty of true black box models in machine learning, just as many can and should be described in mathematical terms. This tutorial is for everyone who is scared by formulae. We will learn how to quickly recognize which part of an equation matters and how changing individual parameters will affect it. We will make differential equations less scary and get a "feel" for the logistic function that goes beyond running Logreg in sklearn. PUBLIC CONFIRMED Tutorial https://pretalx.com/pyconde-pydata-2024/talk/SYJE7B/ A03-A04 Darina Goldin PUBLISH LERYUY@@pretalx.com

-LERYUY

A deep dive into the Arrow Columnar format with pyarrow and nanoarrow en

20240424T130000 20240424T143000 1.03000

A deep dive into the Arrow Columnar format with pyarrow and nanoarrow

**You can find the material and setup instructions at https://github.com/voltrondata-labs/2024-arrow-format-tutorial/** According to the website, Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing. Nowadays, the Arrow project encompasses many things, including serialization, messaging and database specifications and a variety of language implementations. But at its core is the Columnar Format: a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. This format is being used (fully or partially) by many libraries that you might know, such as pandas, polars, datafusion, duckdb, cudf, influxdb, and many more. This tutorial will dive into the details of the Columnar format, explore the physical memory layout and the different data types. It will do so with interactive code examples using the pyarrow and nanoarrow libraries, learning how you can create and inspect Arrow data with those libraries. So at once you will also learn a bit about those two libraries, but the insights about the columnar format itself is general for any project using such data under the hood. PUBLIC CONFIRMED Tutorial https://pretalx.com/pyconde-pydata-2024/talk/LERYUY/ A03-A04 Joris Van den Bossche Alenka Frim Raúl Cumplido PUBLISH A8HJHV@@pretalx.com

-A8HJHV

Securing Python: Race Condition Vulnerabilities en

20240424T103000 20240424T120000 1.03000

Securing Python: Race Condition Vulnerabilities

We will begin by exploring the fundamentals of race conditions, and understanding how concurrent processes can lead to unpredictable and hazardous outcomes. This segment focuses on the theoretical underpinnings and real-world implications of these conditions in Python applications. Next, the workshop transitions into a more hands-on approach. Participants will be presented with small, intentionally vulnerable Python applications. These applications are designed to showcase various forms of race conditions, providing a practical context for understanding their impact. We will analyze the source code of these applications, identifying the critical sections where race conditions occur and discussing why these vulnerabilities are often overlooked during development. Following the analysis, the workshop shifts to the offensive aspect. We will simulate attacks exploiting these race conditions. This exercise aims to demonstrate the ease with which malicious entities can take advantage of these vulnerabilities, underscoring the importance of addressing them in the development phase. The final segment of the workshop is dedicated to resolution strategies. We will explore various techniques and best practices to mitigate race conditions in Python. This includes implementing thread synchronization mechanisms, such as locks, semaphores, and queues, and adopting safe programming practices that minimize the risk of concurrent execution issues. We'll also discuss how to incorporate these strategies into the software development lifecycle to enhance code quality and maintainability. Throughout the workshop, emphasis will be placed on clean, maintainable, and secure code architecture, aligning with contemporary best practices in Python development. By the end of the session, participants will not only have a thorough understanding of race conditions and their security implications but also possess the knowledge and tools to identify, exploit, and mitigate these vulnerabilities in their Python projects. PUBLIC CONFIRMED Tutorial https://pretalx.com/pyconde-pydata-2024/talk/A8HJHV/ A05-A06 Shahriyar Rzayev PUBLISH AT9HCG@@pretalx.com

-AT9HCG

Django loves strawberries en

20240424T130000 20240424T143000 1.03000

Django loves strawberries

<strong>Update<br /> Please prepare the Workshop as described [here](https://github.com/Speedy1991/strawberry-workshop)</strong><br /> --------------------------------------- Delve into the world of GraphQL Strawberry and Django in this comprehensive workshop designed to unravel the intricacies of these technologies. Throughout the sessions, participants will navigate the synergy between Strawberry, a GraphQL library for Python, and Django, a robust web framework. The workshop kicks off with an exploration of type definitions, offering insights into creating robust schemas and defining custom types to suit project requirements. Moving beyond the fundamentals, attendees dive into the realm of queries and mutations, mastering the art of fetching data and manipulating it through GraphQL. With Django's ORM seamlessly integrated into Strawberry, participants discover how to effortlessly execute complex queries and mutations. Furthermore, the workshop explores the integration of Starlette, a lightweight ASGI framework, into the mix. Uncover how Starlette complements Django and Strawberry, enhancing API development with its performance and flexibility. The hands-on approach of this workshop ensures participants grasp each concept thoroughly. Through guided exercises and practical examples, attendees gain confidence in implementing GraphQL APIs using Strawberry and Django, unlocking the potential to build robust and scalable applications. By the workshop's conclusion, participants will have a comprehensive understanding of: - Creating GraphQL schemas using Strawberry and Django - Executing queries and mutations seamlessly within Django applications - Leveraging Starlette for efficient API development alongside Django Whether you're a seasoned developer or new to these technologies, this workshop promises to equip you with the skills needed to harness the combined power of GraphQL Strawberry and Django for your projects' success. PUBLIC CONFIRMED Tutorial https://pretalx.com/pyconde-pydata-2024/talk/AT9HCG/ A05-A06 Arthur Bayr