PyConDE & PyData Berlin 2024
Twelve years as the Executive Director of NumFOCUS has given me a unique perspective of the open source scientific ecosystem. Building an organization to support project communities has taken me down many roads. Navigating these paths has been rewarding and challenging. We will look at lessons learned as I share my experiences through observations and insights on projects, community leadership, education, and fundraising.
Discover how Babbel bridged the gap between tailored language learning and scalability through an AI-aided content creation tool. Our approach amalgamates human expertise with Generative Artificial Intelligence, enabling personalized content creation on a large scale. Join us on our development journey and the different iterations we went through. We will demo the tool's current version and its AI features. Learn about the tech stack and what lies ahead in our development pipeline.
At mobile.de, we aim to provide a satisfactory search experience so users can find the vehicles quickly they are looking for. We make it happen using our machine learning systems working 24X7 in the backend which continuously learns changing user interests and optimize the search experience. Based on techniques like learning to rank using XGBoost, this talk will discuss our current search relevance ranking framework and how it ranks millions of searches daily.
Managing a database and synchronizing service data representation with the database can be tricky. In this workshop, you’ll learn how to use SQLAlchemy, a powerful SQL toolkit, to simplify this task. We’ll cover how to leverage SQLAlchemy’s Object Relational Mapper (ORM) system, and how to use SQLAlchemy's asyncio extension in your async services.
Participants will walk out of this tutorial having learned how to:
- Use SQLAlchemy for database operations in Python, enhancing the readability and maintainability of the code
- Build Python classes (ORMs) that represent the database tables
- Experiment with different relationship-loading techniques to improve querying performance
- Utilize SQLAlchemy’s asyncio extension to interact with databases asynchronously
For the past decade, my journey as a dedicated community organizer has allowed me to immerse myself deeply in the Python community, experiencing its extraordinary growth firsthand. The transition of Python from being a top-10 contender to becoming the foremost programming language has been an exhilarating experience, propelled by a burgeoning community and its foray into fields such as data science and artificial intelligence. The inclusivity and camaraderie within the Python community have been pivotal, illustrating how collective effort and a nurturing culture are instrumental to its current standing.
This presentation is crafted to disseminate the pivotal lessons and best practices that have emerged from my decade-long engagement. During this period, I have played a key role in organizing over twenty Python/PyData conferences, including notable events like PyCon.DE, PyData Berlin, EuroPython, EuroSciPy, and PyData Global.
It is for anyone who wants to learn more about, contribute to and organize themselves in the Python and PyData community.
This talk will address:
* How it works: community backstage
* Why it works: community organizations
* Lessons learned:
* community leadership & team dynamics
* balancing ideas and realities
* personal & professional growth
* How to contribute as an individual, community or company
* How organizations like the PySV, NumFOCUS or PioneersHub serve the community
RAG (Retrieval Augmented Generation) is the process of querying a (large) set of documents with natural language, leveraging vector search and llms. While it has recently become widely accessible to develop a Proof-Of-Concept RAG using OpenAI and one of the various open-source contributions (e.g. langchain), building a performant RAG that brings value to users is challenging.
This talk will focus on learnings from building a RAG for a medical company, to allow doctors to query drug documentation with natural language, using tools like Chainlit, Qdrant and Langsmith.
Naturally, a product question emerged: how to effectively leverage LLMs that can never guarantee 100% accuracy in the health sector?
We will explain how we addressed this challenge, as well as the various technical improvements implemented to enhance both the retrieval (vector search) and generation (llm) metrics of our RAG.
This talk introduces a new workflow for building your machine learning models using the capabilities of modern databases that support machine learning use cases natively. There is an overview of how machine learning models are being created today to how they could look in the near future by utilising the features provided by current databases.
Metaclasses. What are they? Where do they live? How do they reproduce?
Did you know that you can make your classes receive keyword arguments, just like functions? And that they can be decorated as well?
Do you want to understand how classes, metaclasses and decorators work and what are they good for?
In this hands-on coding session we will inspect the inner workings of how Python creates classes, and how decorators, meta-classes and methods from superclasses can influence this process.
We'll explore:
- normal and special methods
- how attribute lookup works between instances and classes
- what are descriptors, and how they fit into attribute lookup process
- what is the relationship between instances, classes and metaclasses
- what are metaclasses for
- and some other metaprogramming odds and ends
All that is required for you to enjoy this session is that you have written a class in Python. If you've done the original Python Tutorial, that should be more than enough.
The Python community has been making efforts in improving the diversity and representation among its members. There are examples of success stories such as PyCon US Charlas, PyLadies, Djangonaut, and Django Girls. Yet in the Python podcast community, women are still underrepresented, making up only 17% of invited guests among the popular podcast series. Being a guest in a podcast is a privilege, and an opportunity to influence the Python community. There are many women and underrepresented group members who have made impactful contributions to the Python community globally, and they deserve the recognition and to be heard by the rest of us. Disheartened by the lack of representation by women on Python podcasts, and inspired by others who have shown us how diversity in the community can be improved through intentionality, we decided to start a podcast with a goal to highlight their voices so that they could receive the recognition they deserve. In this talk, learn about them, and about our podcast series. We’ll also share how you can further help out cause in improving representation and diversity in the Python community.
Data valuation techniques compute the contribution of training points to the final performance of machine learning models. They are part of so-called data-centric ML, with immediate applications in data engineering like data pruning or improved collection processes, and in model debugging and development. In this talk we demonstrate how the open source library pyDVL can be used to detect mislabeled and out-of-distribution samples with little effort. We cover the core ideas behind the most successful algorithms and illustrate how they can be used to inspect your data to extract the most out of it.
To rewrite or not to rewrite: it's a major question.
Releasing new software versions with breaking changes can be disruptive to a community, but sometimes they are necessary in the long run to move forward.
Haystack is a free open source Python LLM framework. It was launched in 2020, before LLMs were cool. In 2023 we decided to undergo a major re-architecture, culminating in the GA release of Haystack 2.0. It wasn't an easy decision. By involving the open source community and some big companies in our design process early on, we are confident we built a more usable, flexible foundation for years to come.
In this talk I'll tell you the story of this rewrite. The decisions we made to bring the project forward with the right level of flexibility / composability in the rapidly changing LLM landscape. I won't only show you the new features 2.0 provides, but give you a peek into our future roadmap. You'll walk away with a better understanding of how modern LLM frameworks can help you solve problems for yourself and your users, as well as an enriched understanding of how to think for the long-term when building for an open source community.
You’ll see how the strength of Haystack modularity and ease of use makes it stand out from other libraries. Demos will make it much clear and give you some great ideas on how to integrate Haystack in your projects.
Humans are complex. As developers, we wanna ignore that ... but to do our job right, we cannot. Let's talk about power, motivation, techno-sociology, politics and why all of this is important for our job.
Designed for beginners, this presentation demystifies Python project management using Hatch and delves into pyproject.toml
for efficient configuration. We'll guide you through organizing directories, implementing unit testing for code reliability, and using mypy for type checking to enhance code quality. The session concludes with insights into ruff, a modern linter for maintaining Python standards, which is replacing black, isort, flake8. This talk is a comprehensive toolkit for anyone eager to learn and apply the latest practices in Python development.
In this talk, we explore a new method to approximate Gaussian processes using spectral analysis methods, known as the Hilbert Space Gaussian process (HSGP) approximation. This technique allows us to use and fit Gaussian processes at scale for concrete applications. We provide a basic introduction to the ideas behind the method and make them tangible by implementing them ourselves using Numpyro. We then present two concrete examples in practice using both Numpyro and PyMC. Namely time-varying coefficient regression and time series forecasting.
The real-time recommendations engine in Tiktok, Monolith, is so good it has been described as "digital crack" (by Andrej Karpathy, former head of AI at Tesla). In this tutorial, we will build the core components of Tiktok Monolith (a retrieval and ranking architecture): a stream processing feature pipeline, a two-tower embedding model to support personalized queries based on each user's history/context, and a simple user interface in Python (Streamlit). Our real-time machine learning system will consist of 3 Python programs - the feature pipeline, the training pipeline, and the online inference pipeline - and the ML infrastructure they require will be provided by the open-source Hopsworks platform, including a feature store, vector database, model serving, and model registry.
As we descend from the peak of the hype cycle around Large Language Models (LLMs), chat-based document inquiry systems have emerged as a high-value practical use case. Retrieval-Augmented Generation (RAG) is a technique to share relevant context and external information (retrieved from vector storage) to LLMs, thus making them more powerful and accurate.
In this hands-on tutorial, we’ll dive into RAG by creating a personal chat app that accurately answers questions about your selected documents. We’ll use a new OSS project called Ragna that provides a friendly Python and REST API, designed for this particular case. We’ll test the effectiveness of different LLMs and vector databases, including an offline LLM (i.e., local LLM) running on GPUs on the cloud-machines provided to you. And, we’ll conclude by demonstrating how to quickly build personal or company-level chat-based document interrogation systems.
Getting a machine learning solution in front of users usually takes some time. The data science tech stack is full of time traps and infrastructure issues might slow down deployment. The Azure Machine Learning platform, automated machine learning, and Streamlit are predestined tools for circumventing common development and deployment issues – if you know how to use them. Based on our learnings in corporate hackathons, we will use the stack to rapidly prototype a computer vision application users can interact with. You will walk away with Python code snippets and inspiration to build and user test your own machine learning ideas quickly.
Every day, we engage with news, and more often, these are curated by recommendation engines. Building such an algorithm poses some unique challenges, different from movie or product recommendations: articles have a short lifetime because nothing is older than yesterday's news. The data is heavily biased by the different positioning of articles on the page, and journalistic principles and brand identity should be represented in the article selection. At Axel Springer National Media and Tech, we overcome these challenges by leveraging our domain knowledge combined with simple statistics instead of black-box machine learning models. This talk will share some of our learnings that can be applied to recommendation systems and data science projects in general.
Learn to make practical decisions in data engineering with Python's vast ecosystem. Avoid blindly following market guidelines and consider the reality of your situation for better performance and architecture.
Have you ever thought about IT Security when coding your Python application? If not, you are not alone – but also not safe.
Just recently, a research study counted almost 4000 secrets published on PyPI. Most of the secrets such as AWS Keys, Google API Keys or database credentials were most likely leaked accidentally. Leaked credentials top the list of entry points for attackers into protected areas. In this talk you’ll gain insights into how malicious attacks on Python applications are performed – and most importantly, how to protect yourself against them.
We’ll kick off with a basic review of how to crack a password not only with brute force and continue with the most important IT Security principles. After understanding the importance of adhering to common security precautions, we will dive into Python coding hygiene. Where do the most common vulnerabilities lie? How can we strengthen the security of our code?
We’ll cover secure coding practices such as code analysis, input validation and dependency vulnerabilities in theory and practice. Lastly, we will look at some case studies of common attacks on Python code and how to protect yourself against them.
If you have never thought about security aspects in Python, this talk is for you!
The scikit-learn website currently employs an "exact" search engine based on the Sphinx Python package, but it has limitations: it cannot handle spelling mistakes and queries based on natural language. To address these constraints, we experimented with using large language models (LLMs) and opted for a retrieval augmented generation (RAG) system due to resource constraints.
This talk introduces our experimental RAG system for querying scikit-learn documentation. We focus on an open-source software stack and open-weight models. The talk presents the different stages of the RAG pipeline. We provide documentation scraping strategies that we designed based on numpydoc and sphinx-gallery, which are used to build vector indices for the lexical and semantic searches. We compare our RAG approach with an LLM-only approach to demonstrate the advantage of providing context. The source code for this experiment is available on GitHub: https://github.com/glemaitre/sklearn-ragger-duck.
Finally, we discuss the gains and challenges of integrating such a system into an open-source project, including hosting and cost considerations, comparing it with alternative approaches.
Every developer wants to write good code. Good code, that also means security against attackers and their threats. But how secure is your code really?
The talk explains how you can use Threat Modeling to assess your application in a systematic approach against the threats that are relevant to your use cases and their attack surface.
Apache Parquet has become the de facto format for storing tabular (DataFrame) data on disk. This is done through universal compression and efficient knowledge of the stored data structure. As part of this talk, we would like to show the core structure of Parquet and the knobs that allow you to get even more of the capabilities of the file format.
Data stories transform complex data insights into clear, actionable and context rich narratives to drive business value. The presentation of data stories to different audiences in a visually compelling manner while keeping track of data changes is a challenging task. A possible solution is to implement appealing and interactive data applications, for which Streamlit is an established open-source solution. In combination with Snowflake, it enables an efficient and straightforward approach to build engaging data applications that utilize data directly from a data platform.
In this talk, we will explore a proof-of-concept, tracing the conception of a data story to the implementation of a Streamlit app in Snowflake by using open source datasets from Deutsche Bahn. So, hold onto your seats – it is time to explore the world of data apps with Snowflake and Streamlit.
Are you secretly a spy and/or passionate about open-source? Maybe you don't trust a cloud-hosted service with your highly classified information, or perhaps you like to build things for yourself. In this light-hearted talk, you will learn how to make a real-time on-device GenAI-powered application that can live transcribe and summarize conversations without internet access, using open-source components.
Our journey begins with an introduction to open-source LLMs and the latest trends in running GenAI tools on your own hardware. We will build up our application step-by-step, first creating a live streaming voice-to-text transcription pipeline, then an LLM-based conversation summarization layer, presented within a Streamlit frontend, with conversation summaries sent to a lightweight Django API backend for storage.
This talk is tailored for Python enthusiasts and requires no ML expertise. By seeing a practical demo come together piece by piece, attendees will gain a deeper understanding of how to build their own complex Generative AI applications and be pushed to imagine what they could make for themselves using on-device computation in real-world scenarios.
Deploying machine learning models in production carries its own unique set of challenges. Some challenges stem from different, and sometimes conflicting, objectives between analytics and production. Others arise from technological limitations, business requirements, and even regulatory needs.
In this talk, we will focus on the part of the problem surrounding the handover of models from analytics to production. We expect data scientists, operation specialists, and product owners to benefit from our stories.
Change-point detection is a crucial processing step when dealing with long and non-stationary time series. It has been applied in many contexts, such as human activity recognition, speech/sound processing and industrial monitoring. This talk guides data scientists, engineers and researchers through the mathematical foundations of this subject, introduces the ruptures Python package for change-point detection, and illustrates algorithms in a biomedical context. By the end, the audience will be able to integrate them into complex data pipelines.
The transition from a hands-on creative job to a leadership role isn't always smooth. The tasks you excelled at are now handled by your team, and your new title brings added responsibilities, numerous meetings, leaving little room for deep work. So, how do we— the data people, the coaches, the coders—thrive in management roles? In this talk, I'll share my journey into management and how I learned to embrace and find reward in my leadership role.
This presentation will show you how to deploy machine learning models
to affordable microcontroller-based systems - using the Python that you already know.
Combined with sensors, such as microphone, accelerometer or camera,
this makes it possible to create devices that can automatically analyze and react to physical phenomena.
This enables a wide range of useful and fun applications, and is often referred to as "TinyML".
The presentation will cover key concepts and explain the different steps of the process.
We will train the machine learning models using standard scikit-learn and Keras,
and then execute them on device using the emlearn library.
To run Python code on the microcontroller, MicroPython will be used.
We will demonstrate some practical use-cases using different sensors, such as
Sound Event Detection (microphone), Image Classification (camera), and Human Activity Recognition (accelerometer).
The foundations of machine learning were built on offline batch processing techniques for model training and inference. As organisations become more dependent on real-time data, the technological trend for machine learning in production is moving towards adding an online stream processing approach. This has benefits such as lower computational requirements due to being able to incrementally learn from a stream of data points, which enables the continual upgrading of models by adapting to real-time changes in data. Learn how to get started on your online ML journey with River
One of the most challenging tasks in software engineering is cleaning up a complex software with 10,000-100,000 lines of code. The problem gets worse, if you are taking over legacy code. The fact that the Python language does neither enforce strict typing or encapsulation does not help either. What should you do if throwing away everything and rewriting the program from scratch is not an option?
In this tutorial, we will exercise refactoring a larger program that is undocumented, unstructured and untested. We will take a messy example program and work through a list of procedures that may help you in your next big refactoring.
pytest lets you write simple tests fast - but also scales to very complex scenarios: Beyond the basics of no-boilerplate test functions, this training will show various intermediate/advanced features, as well as gems and tricks.
To attend this training, you should already be familiar with the pytest basics (e.g. writing test functions, parametrize, or what a fixture is) and want to learn how to take the next step to improve your test suites.
If you're already familiar with things like fixture caching scopes, autouse, or using the built-in tmp_path
/monkeypatch
/... fixtures: There will probably be some slides about concepts you already know, but there are also various little hidden tricks and gems I'll be showing.
Rosenxt has only just been founded, and yet we are already very busy researching great things and making them usable. The ideas are bubbling, the motivation is high. The urge to try out the next idea quickly is high. But progress needs to be well documented, as the next performance audit is sure to come.
Your RAG-powered LLM application might look pretty convincing at first glance, but how do you really know if it’s any good? And how do you justify the design choices you make? In this talk, you will learn about the RAG evaluation concept we produced at Airbus for evaluating the components of our digital engineering assistant, its implementation with open source tools paired with Google Vertex AI, and what we learnt in the process.
Large Language Models (LLMs) have proven to be incredibly powerful on a range of tasks. They do however, have certain limitations when the input context becomes significantly large. Solutions such as Retrieval Augmented Generation (RAG) do a great job in providing context from custom data without retraining any models but they too have limitations, especially when the context is spread out over many documents. Consider the question “Which projects has person X worked on?”. Information required to answer this question may be spread out over hundreds of documents, making it difficult for an LLM alone to answer. One way to overcome this issue is to use an LLM as an entity extraction tool, which can extract entities and relationships from documents and load that data into a structured format such as a knowledge graph. In this talk, I will demonstrate this process on a dataset of parliamentary debates, showing how downstream analytics becomes more intuitive and feasible.
Our world is driven by technology and there are many reasons to teach our kids how to code. For example, coding allows them to develop logical reasoning skills and teaches attention to detail. Allowing children to discover how much fun coding can be supports them in their development and opens many doors for their future.
But when and how should we start coding with kids? This talk will approach the question from a scientific perspective, looking into how children's brains develop, how children learn and how to best teach them coding abilities. It will answer important questions like "At what age can a child start coding?" or "What are the benefits of learning to code?". It will also present possible starting points, like learning platforms or tutorials.
I know you probably don't want to hear about it, but your deep learning model probably memorized some of its training data. In this talk, we'll review active research on deep learning and memorization, particularly for large models such as large language and multi-modal models.
We'll also explore potential ways to think through when this memorization is actually desired (and why) as well as threat vectors and legal risk of using models who have memorized training data. We'll also look at potential privacy protections which could address some of the issues and how to embrace memorization by thinking through different types of models and their use.
DuckDB is an in-process analytical data management system. DuckDB is free and open source and rather popular. It is one of the fastest growing data system to date, especially in the Python ecosystem. DuckDB was created at Centrum Wiskunde & Informatica (CWI) in Amsterdam, not entirely coincidentally the same place Python was created in. Later on, the we founded a commercial company, DuckDB Labs, which now drives development. In my talk, I will discuss DuckDB, its origins, and the unique benefits and challenges of maintaining popular software in an academic setting.
The human ambitious desire to get rich without effort has been a major driving force
behind the popularity of cryptocurrencies like Bitcoin and Ethereum. However, their high
volatility makes them too unpredictable, and keeping track of our investment gains and
losses over time can be tedious, if not boring.
In this talk, we will define the different components necessary to build a personalized
Bitcoin (BTC) virtual assistant in Python. The assistant will help you analyze your
transaction history, estimate future BTC prices, and calculate the future value of your
holdings based on these predictions. It will be powered by LLMs and will make use of a
recent technique called Function Calling to recognize the user intent from the
conversation history.
Do you find yourself working through pages of copied and pasted tests to accommodate a simple code change? Does your software frequently break in unexpected ways despite your testing efforts? Don’t despair! Property-based testing could be your way out of that mess. Rather than working harder and writing more test code, property-based testing forces you to work smarter and test more code with fewer tests.
Since many years Android has held the top position as the most used OS with about 38% of the OS user share in 2023. Currently 3 major languages – C++, Java, Kotlin are used for application development on Android. Although Python has the capabilities of enabling Android deployment, Python was never considered as an adequate language for Android development. But, with the introduction of “PEP 738: Adding Android as a supported platform”, and the increasing popularity of frameworks like PySide6, Kivy, Flet etc. which enable GUI development with Python for Android devices, it is time for Python package developers to consider Android as a potential platform.
This talk gives an introduction to each of the GUI development toolkits – Kivy, Flet and PySide6 by demonstrating how to create a simple Contact List application. We later delve into the pros and cons of each of these frameworks, so that Python application developers can decide which framework suits their requirements better.
Understanding and repairing garbled text (Mojibake)
is despite Unicode a permanent ongoing task in IT projects.
Garbled text is the result of text being decoded using an unintended character encoding.
Example: Die UTF-8 Selbsthilfegruppe trifft sich heute Abend im grünen Saal
This talks explains how to analyze and fix such encoding problems with python.
The topics of this talk contains:
- difference between grapheme and codepoints
- Unicode vs. UTF-8
- decoding and encoding files, database result sets, REST-APIs calls
- the unicodedata module
- handling of ISO charsets in the unicode world
This talk shows short code examples for real world problems and solutions.
As the landscape of data-driven applications expands, the need for robust SQL development practices becomes increasingly critical. This conference talk addresses the challenges faced by data teams in maintaining and evolving complex SQL models for their Data Warehouses, and shows how unit testing can play a vital role in ensuring data quality.
We will delve into the significance of SQL unit testing, highlighting its ability to quickly validate modeling logic and making sure that modifications do not break existing behavior. With the ease of mind of an automatically verified SQL logic, changes to existing data models can be shipped with confidence, ultimately contributing to faster deployment cycles.
Get detailed insights on the structure and functionality of Lotum’s SQL unit testing framework, built in Python using pytest and tailored for BigQuery. With Lotum processing millions of events from mobile games every day, explore how this robust framework allows for efficient testing, ensuring the accuracy of the SQL logic. Learn how test cases with small sets of static mock data can be defined effortlessly so that they help pinpoint potential code errors easily.
Machine learning is mostly used for predicting outcome variables. But in many cases, we are interested in causal questions: Why do customers churn? What is the effect of a price change on sales? How can we optimize personalized marketing campaigns or medical treatments?
This tutorial introduces participants to the field of Causal Machine Learning (Causal ML). We will start with a basic motivation of causal analysis and share insights on how to recognize causal questions in data science. We will dive into the basics of Causal ML: Why can't we simply use of-the-shelf ML methods to answer causal questions? The tutorial will focus on the Double Machine Learning approach and demonstrate the use of Causal ML with the Python library DoubleML (Bach et al., 2022). The general introduction will be complemented by hands-on data examples and interactive discussion and Q&A sessions. The tutorial is a great starting point for participants to discover Causality/Causal ML and start their own causal data science projects.
References
Bach, P., Chernozhukov, V., Kurz, M. S., and Spindler, M. (2022), DoubleML - An Object-Oriented Implementation of Double Machine Learning in Python, Journal of Machine Learning Research, 23(53): 1-6, https://www.jmlr.org/papers/v23/21-0862.html
A case study of how we use Deep Learning based photogrammetry to calculate the height of trees from very high resolution satellite imagery. We show the substantial improvement achieved by switching from classical photogrammetric techniques to a deep learning based model (implemented in PyTorch), and the challenges we had to overcome to make this solution work.
As the world is being reshaped at an unprecedented speed through the rise of powerful (Generative) AI technologies that change the way we work and live, governments seek their place in the arena. This presentation will focus on how government institutions adapt to these changes by exploring three key areas of action: Adoption, Regulation, and Reskilling/Upskilling. Emphasis will be placed on Ethics and AI in government.
As Python expands into serverless and cloud environments, popularizing distributed microservice architectures, we often face observability challenges that impact efficiency and complicate error tracing. This presentation introduces OpenTelemetry, an emerging industry standard that provides a framework for tracking the performance of not just our Python code, but also other system components like databases and message queues. Its API and SDK integrate seamlessly with Python, enabling a unified approach to gather, process, and export telemetry data from various sources within a distributed system.
We will explore the setup and usage of OpenTelemetry's Python SDK through a practical scenario. The session will demonstrate how to convert an existing Flask microservice to use OpenTelemetry, using both automatic and manual instrumentation. Finally, we will examine how to utilize the exported data for enhanced system monitoring.
Did this question ever cross your mind that how green software engineering can help in environment sustainability?
My talk will answer this exact question.
My passion for nature and love for technology pushed me into this topic.
The way global warming is affecting us is one of the biggest concern of so many people around the world. The focus is to educate people about how they can play their role in protecting the environment by just using their laptop or computers in the right possible way.
One of the biggest questions is to deal with the gas emissions and control it but how software engineering can help in all of this?
The complete cycle of the Software Engineering should be designed and implemented in such a way that it incorporates environment sustainability without affecting the economic benefits. It is a win win situation. We need more environment sustainable mobile and web applications.
We demonstrate a range of different approaches to missing data imputation in employee engagement survey data. Contrasting frequentist style full-information maximum likelihood approaches with more direct Bayesian imputation and chained equation methods, we highlight how the different assumptions regarding the missing-data license different inferences about the imputed values and ultimately the plausible causal narratives which can be expressed in PyMC. In particular we avail of the hierarchical nature of employee engagement data to justify a hierarchical approach to justifying the (MAR) missing-at-random assumption for imputation schemes in People Analytics.
Crafting code for minimal dependencies and maximum portability is an art. This talk focuses on how continuous integration and delivery ensure project resilience to Python updates and changes in the packaging ecosystem. Setting up automation around your project enhances peace of mind, improves code maintainability, and facilitates collaboration.
In this talk, we discuss computational operations and memory utilization in Python and what is the connection between them. Additionally, we will provide you with visual aids for helping to build a mental picture of these concepts. Moreover, we will dive into how Python interpreter works and how the understanding of bytecode instructions can help you write better code. In the end, we will demonstrate the advantages of best practices by comparing both performance metrics and bytecode instructions.
Dive into the world of AI voice agents with Vocode, the leading framework for creating interactive, voice-based AI assistants. In this talk, we'll explore how Vocode integrates speech-to-text, response generation, and speech synthesis APIs to create agents that not only speak but also understand and adapt to the nuances of human conversation. We'll discuss the challenges of teaching these agents the etiquette of real conversations, such as knowing when to pause, not interrupt, and conclude interactions. Plus, we'll showcase Vocode's LLM function-calling feature through a practical example: real-time appointment booking. Join us to uncover the secrets behind building AI voice agents that are as engaging and efficient as they are innovative.
In the realm of machine learning, the complexity of data pipelines often hinders rapid experimentation and iteration. This talk will introduce DDataflow, an innovative open-source tool, designed to facilitate end-to-end testing in ML pipelines by leveraging decentralized data sampling. Attendees will gain insights into the challenges of unit testing in large-scale data pipelines, the design philosophy behind DDataflow, and practical implementation strategies to enhance the reliability and efficiency of their ML pipelines.
In this talk, we address the Cold Start problem in Demand Forecasting, focusing on scenarios where historical data is scarce or nonexistent. This constitutes a common situation in practice, such as with the launch of new products in Retail. However, many Time Series and Machine Learning models encounter difficulties in handling this challenge, primarily due to their dependence on a substantial amount of historical data for effective training and prediction.
We begin by providing an overview of established techniques used to address the Cold Start problem, including methods like padding, feature engineering, and leveraging item similarities. Additionally, we explore more recent advancements and emerging research, such as Transfer Learning for Time Series.
While each technique presents its unique set of trade-offs, the challenge lies in determining the most suitable approach for a given dataset or use case. This aspect is often not widely understood, and our goal is to unravel this complexity by offering practical insights. Furthermore, we introduce a practical framework for systematically evaluating different forecasting strategies within the Cold Start setting, guiding you in selecting the most suitable approach for your datasets and use cases.
Drawing on experience with multiple consulting projects, this talk shares experiences on how to deal with unexpected data problems. We are discussing how fare purely technical solutions as well as domain knowledge can be deployed to compensate for lacking data quality or quantity and when it might be better to scale down the original project scope.
Over the history of free and open source software, we have gone through quite a few metaphors for open source projects: from homesteads in noosphere to puppies, roads & bridges, gardens, forests, and orchards. Regardless of the preferred comparison, we all can agree that behind every large open source project is a resilient contributor community. Is there a blueprint for it? How about a script for scaling a contributor community or a formula for contributor retention? In this talk, I will examine all these questions and share my insight on the art and science of fostering resilient open source communities.
Python supports multiple programming paradigms. In addition to the procedural and object-oriented approach, it also provides some features that are typical for functional programming.
While these features are optional, they can be useful to create better Python programs. This tutorial introduces Python features that help to implement parts of Python programs in the functional style. Objective is not to write pure functional programs but improve programs design by using functional feature where suitable.
The tutorial points out advantages and disadvantages of functional programming in general and in Python in particular. Participants will learn alternative ways to solve problems. This will broaden their programming toolbox.
A tutorial session on how to build scientific packages for numerical calculus and algorithms in Python and Rust. It walks through the process of packaging with a modern tool stack, introduces the concept of vectorization for efficient computation in Python in the context of classical Machine Learning, and shows how the package can be optimized with extensions written in Rust.
Discover how graph algorithms are transforming content recommendation in this insightful talk. We'll journey from the basics of graph-based models, exploring simple graph walks, to the cutting-edge realm of Graph Neural Networks. Uncover the power of graph embeddings and learn when graph-based approaches excel in recommender systems.
A key feature of the Python data ecosystem is the reliance on simple but efficient primitives that follow well-defined interfaces to make tools work seamlessly together (Cf. http://data-apis.org/). NumPy provides an in-memory representation for tensors. Dask provides parallelisation of tensor access. Xarray provides metadata linking tensor dimensions. Zarr provides a missing feature, namely the scalable, persistent storage for annotated hierarchies of tensors. Defined through a community process, the Zarr specification enables the storage of large out-of-memory datasets locally and in the cloud. Implementations exist in C++, C, Java, Javascript, Julia, and Python, enabling.
This talk presents a systematic approach to understanding the newer Zarr Specification Version 3 by explaining the critical design updates, performance improvements, and the lessons learned via the broader specification adoption across the scientific ecosystem.
I will also briefly discuss the evolution of the Zarr - the development of the Zarr Enhancement Process (ZEP) and its use to define the next major version of the specification (V3); as well as uptake of the format across the research landscape.
In 2023, the field of NLP was again flurried -- the appearing of powerful closed- and opens-source LLMs opened new possibility for texts processing. However, many questions about these models usability for typical NLP tasks are still open. One of them is quite simple -- if we want a classification model for some task, can we rely on LLMs or is it still better to fine-tune an own model? It might be easier to obtain some classifier for English, but what if my target language is not so resource-rich? In this presentation, the main "recipes" how to obtain the best text classifier depending on the language and data availability will be described.
Dask is a library for distributed computing with Python that integrates tightly with pandas. Historically, Dask was the easiest choice to use (it’s just pandas) but struggled to achieve robust performance (there were many ways to accidentally perform poorly). The re-implementation of the DataFrame API addresses all of the pain points that users ran into. We will look into how Dask is a lot faster now, how it performs on benchmarks that is struggled with in the past and how it compares to other tools like Spark, DuckDB and Polars.
As applications grow, so do the amount of configurable features. Managing consistent defaults, maintaining user and developer documentation, and ensuring uniform parsing among a growing number of client applications can become a challenge. Adding constraints like complex fallback hierarchies and backwards compatibility, increases the probability of runtime errors. We show how Pydantic
's strong data validation and integration into Python's type annotations can help building a strict specification for your configuration format, catch misconfiguration early, and mitigate the aforementioned problems with a non-formalized configuration management system.
Last year, dm-drogeriemarkt was among the first big German companies launching a tool for the coworkers to be able to unlock the power of LLMs in a secure setting. At the beginning, dmGPT was only a user interface pointing to a private instance of a foundation Model.
Listening to the needs of our colleagues, we quickly learned that this “naked” model – a super powerful NLP Model that can help them processing text - is not really what they needed: they needed a trustworthy, knowledge-rich assistant to help them accomplish their daily tasks.
In our journey towards this goal, we used python to shift the LLM’s role in dmGPT: from being the motor and only source of answers to being a translator between the user’s input in natural language and multiple software systems, the steering wheel that helps humans drive the flow.
Today, dmGPT is not only a statistical parrot anymore, now it is an open platform powered by internal knowledge.
In this talk we want to share with you the learnings and insights we gained while designing and implementing the new dmGPT.
Unit testing is a fundamental practice in software development, ensuring the reliability and maintainability of code. However, in the context of monolith repositories, executing unit tests efficiently becomes a formidable challenge. This conference aims to explore the intricacies of unit testing in Django within monolithic codebases and shed light on how major institutions address and overcome these challenges through the implementation of parallel testing strategies.
Wolt's Discovery page serves as the primary gateway for millions of weekly users exploring diverse cuisines and products. With over 130,000 merchants in 25 countries, presenting relevant content poses a unique challenge. In this presentation, we address the complexities of personalizing the Discovery page using a hierarchical multi-armed bandit (MAB) approach built on the Python ecosystem. We outline the challenges specific to an expansive online delivery platform, introducing our MAB solution that incorporates hierarchical parameters at user, segment, city, and country levels. Leveraging Thompson Sampling for exploration and exploitation, our approach accommodates data sparsity challenges. Evaluation results, both offline and online, showcase the effectiveness of our solution. The talk concludes with insights into the resilient, scalable, and adaptive architecture underpinning our approach, featuring open-source libraries such as mlflow, Flyte, and Seldon Core. Our learnings and future steps toward a personalized, context-aware Discovery page cap off the presentation. Join us as we navigate the intricacies of recommendation challenges in the dynamic world of quick commerce.
Testing is a de facto standard in modern software development. With increasing awareness that comes with ML-Ops, testing becomes more important for the development and operation of machine learning-based components. In this talk we would like to share our view and solution for testing in the field of machine learning. We will present the applied testing strategy used and the lessons learned from the last four years of experience in operating idealo’s cataloging system.
In this talk, we will explore the basic concepts of Dev Containers and demonstrate how they can support your everyday development as a Python programmer, data scientist, or machine learning engineer. With Dev Containers, you can build a consistent development environment in seconds, no matter where you are or what tools you use. And you know what? The Development Container Specification is even open source. Say goodbye to the hassle of setting up your development environment from scratch every time you start a new project!
We will start with a basic example and discuss how to set up a consistent Python development environment, including best practices for package management and GPU support. After this talk, you will be able to leverage the advantages of Dev Containers, allowing you to work from anywhere and be ready in seconds.
If you're tired of wasting time setting up your development environment and want to unlock the power of Dev Containers, then this talk is a must-attend for you!
Python in Excel is the new integration created by Microsoft that brings Python programming directly into Excel workbooks, for advanced data analytics. With Python in Excel, it is now possible to embed Python code directly into workbook cells, very easily, and with zero setup required.
In this tutorial, we will explore the many features and capabilities this new integration provides, to unlock unprecedented data science and machine learning use cases in Excel.
In this interactive workshop, we will cover the very basics of using PyO3. There will be hands-on exercises to go from how to set up the project environment to writing a "toy" Python library written in Rust using PyO3. We will cover a lot of specifications of the API provided by PyO3 to create Python functions, modules, handling errors and converting types.
Preflight checklist
- Install/ Update Rust
- Make sure having Python 3.8 or above (recommend 3.12)
- Make sure using virtual environment (recommend pyenv + virtualenv)
In this workshop we recommend using Unix OS (Mac or Linux) If you have to use Windows, you may encounter problems with Rust and Maturin. You may want to install a VM like VirtualBox for developing Python libraries with PyO3.
Setting up
Set up virtual environment and install maturin
pyenv virtualenv 3.12.2 pyo3
pyenv activate pyo3
pip install maturin
The COVID-19 pandemic and associated policy measures lead to world-wide protest movements that were singled out by the spread of misinformation and conspiracy theories, predominantly on social media platforms. Publicly available social media data therefore is a powerful proxy for studying these protest movements. The data, consisting of user locations, follower relationships, and content information, allows to understand the geographical centers of activity, network structure, and key themes of conspiracy movements.
This talk will present a multi-dimensional network analysis for the Austrian COVID-10 protest movement using Python libraries like geopandas, networkx and gensim.
In particular, it will demonstrate how to identify geo-spatial hot spots using spatial statistics, densely connected clusters within the network by employing community detection techniques, as well as dominating content themes through topic modeling approaches.
The presentation highlights how data-driven analysis enables further understanding of movements that may pose threats to democracy, alongside the importance of publicly available social media data for addressing societal challenges.
PyCon DE & PyData Berlin is volunteer run. This session aims to underscore the significant role that volunteer organization plays in cultivating environments of authenticity, inclusion, and diversity within tech communities.
Generative AI (GenAI) has significantly improved our daily lives, prompting a focus on its integration into products and our routines. However, the growing importance of GenAI brings along significant concerns regarding privacy and vulnerability.
This talk delves into the critical issues surrounding the protection of private data and the security of GenAI systems. We'll begin by understanding the fundamental differences between data privacy and data security. Drawing insights from real-life data breaches and compromised information in major companies, we'll explore the mistakes made and the steps taken to rectify them. Throughout the discussion, we'll analyze the challenges faced by GenAI in ensuring data privacy and security across various stages of an LLM project.
Furthermore, the talk will shed light on how prominent companies building GenAI are working to reduce the impact of data privacy and security concerns within their models. Additionally, we'll explore strategies for individuals, like ourselves, using GenAI, to enhance data privacy and security when integrating it into our products or daily lives. Finally, the role and significance of government regulations in ensuring the safety and security of GenAI will be emphasized.
Feature Stores have become an important component of the machine learning lifecycle. They have been particularly pivotal in bridging the gap between data engineering and machine learning workflows(experimentation, training and serving). This talk will explore Feature Stores with a focus on their evolution, what they look like now and what they could look like in the future with the advent of the AI ACT.
In the cross-industry wide trend towards industry 4.0 solutions, the amount of gathered sensor data is ever growing. Through the sheer amount of data, manual or human-based monitoring of the collected time series data becomes cumbersome if not even impossible. Yet, careful inspection of the time series data and identification of possible anomalies therein is crucial to detect problems in the underlying processes. To resolve this demand, ZEISS is developing a fully automated time series processing tool that performs ML based time series anomaly detection with a human-in-the-loop.
This presentation addresses the rare use of machine learning fairness metrics in domains with indirect human impact, e.g., automotive engineering. We briefly map out the space of use cases to examine the necessity, potential benefits, and challenges of applying fairness-related techniques. The main focus then lies on proposing solutions for overcoming identified hurdles, especially regarding the application in unstructured data domains, such as image and audio recognition and large text document analysis. Our approach includes strategies for detecting key subgroups and providing clear explanations for model failures. We also highlight two open-source tools, Sliceguard and Spotlight, for practical implementation.
In this speech, we want to introduce an AI PC, a single machine that consists of a CPU, GPU, and NPU (Neural Processing Unit) and can run GenAI in seconds, not hours. Besides the hardware, we will also show the OpenVINO Toolkit, a software solution that helps squeeze as much as possible out of that PC. Join our talk and see for yourself the AI PC is good for both generative and conventional AI models. All presented demos are open source and available on our GitHub.
Time series analysis is ubiquitous in applied data science because of the value it delivers. In order to do effective time series analysis, you need to know your tools well. Polars has excellent built-in time series support, and it's also possible to extend it where necessary.
We will talk about:
- Basic built-in time series operations with Polars (e.g. "what's the average number of sales per month?").
- numba/numpy/scipy interoperability for not-so-basic time series operations (e.g. non-linear interpolation, or cumulative operations).
- Advanced, custom time series operations, and how you can implement them as Polars plugins (e.g. business day arithmetic).
Basic interest and knowledge of Python and data will be assumed, but no prior Polars experience is required.
Anyone working with time series and/or dataframes will likely benefit from the talk.
In the last year there hasn’t been a day that passed without us hearing about a new generative AI innovation that will enhance some aspect of our lives. On a number of tasks large probabilistic systems are now outperforming humans, or at least they do so “on average”. “On average” means most of the time, but in many real life scenarios “average” performance is not enough: we need correctness ALL of the time, for example when you ask the system to dial 911.
In this talk we will explore the synergy between deterministic and probabilistic models to enhance the robustness and controllability of machine learning systems. Tailored for ML engineers, data scientists, and researchers, the presentation delves into the necessity of using both deterministic algorithms and probabilistic model types across various ML systems, from straightforward classification to advanced Generative AI models.
You will learn about the unique advantages each paradigm offers and gain insights into how to most effectively combine them for optimal performance in real-world applications. I will walk you through my past and current experiences in working with simple and complex NLP models, and show you what kind of pitfalls, shortcuts, and tricks are possible to deliver models that are both competent and reliable.
The session will be structured into a brief introduction to both model types, followed by case studies in classification and generative AI, concluding with a Q&A segment.
Responsible AI covers mainly AI principles, governance & regulation, but most companies do not know how to implement all of these. Hence, in this presentation we cover the key questions for the whole process behind a new AI product, from the idea and design to the development and deployment. The questions are partly based on the new ACM Principles for Responsible Algorithmic Systems (2022) where he is one of the two lead authors as well as their extensions for Generative AI (2023). For each question we will discuss its relevance, challenges, and (partial) solutions, triggering an interactive discussion.
Asyncio use is now everywhere in the Python world, ...
.. or is it?
Being there since version 3.4 my impression is, that it is still not the go to solution when starting off new projects.
It's not an obvious choice and traditional approaches still seem to be much preferred especially by beginners.
So let me take you with me on a journey to create simple, yet powerful building blocks to build asyncio based applications using patterns that are easy to follow, lightweight and attractive.
asyncio #click #logging #psutil #redis #raspberrypi
Have you ever wondered how travel e-commerce companies gather photos of cities? While I can't speak for everyone, I will demonstrate the innovative approach we are using at Flix.
In recent years, text-to-text models like ChatGPT and text-to-image models such as DALL-E 3 have become increasingly integrated into various industries. The main aim of these initiatives is typically to generate text or images. In our presentation, we propose a slightly different approach to leveraging these models commercially. Our objective is to gather images for thousands of cities that inspire travel. We utilize ChatGPT to tailor prompts for our business requirements, enabling efficient image retrieval through API queries from free stock image services. Then we apply image-to-text models to confirm the images' locations. Finally, we need to adjust the resolution of images for display across various platforms, such as social media campaigns on Instagram, email marketing, and on our website. To achieve this, we have used an automated cropping service to get images in the required aspect ratios, followed by Lanczos sampling for downscaling the images. This integration of cutting-edge models has resulted in an automated, highly flexible process that aligns with varied business needs. Our approach is cost-efficient; processing several hundred cities amounts to only a few euros, and we have utilized commonly available services, making replication easy for everyone.
The skill of quickly judging what a formula does and how changing a parameter will affect the result is crucial when dealing with real-life data science - but it's a skill not easily acquired if you don't come from a STEM background. In this tutorial we'll work on guesstimating what complex mathematical expressions do so that you, too, can lose your fear of math!
Join us as we guide you through integrating Gurobi and prescriptive analytics into your greater Python ecosystem. We’ll demonstrate model-building patterns based on NumPy and SciPy.sparse data structures and explore how to take advantage of indexed DataFrames and Series in pandas for mathematical model building. You’ll also discover how to use trained regressors from scikit-learn as constraints in optimization models. Join us as we delve into the world of optimization with Gurobi and elevate your workflows.
Reinforcement learning (RL) has great potential for industrial applications, but few mature software frameworks exist to facilitate its use. This talk discusses efforts to improve the software landscape for RL, making it easier for researchers to contribute algorithms and for engineers to apply RL in real-world settings. Specifically, we highlight the open-source library Tianshou, which provides high-level interfaces for painless RL application development along with lower-level APIs that cater to the needs of researchers. By improving RL software, we aim to accelerate research progress and expand RL adoption in industry.
This workshop addresses the critical and often underestimated topic of race conditions in Python, with a focus on their security implications. We begin with an overview of race conditions, explaining their nature and the security risks they pose. Participants will engage with small Python applications designed to demonstrate these vulnerabilities. Through hands-on analysis, we identify where and why these race conditions occur. The session progresses to simulate attacks exploiting these weaknesses, highlighting their potential for exploitation. Finally, we explore effective mitigation strategies, emphasizing thread synchronization and safe programming practices. The workshop aims to equip attendees with a deep understanding of race conditions in Python and practical skills to enhance the security and robustness of their code.
Explore a variety of software testing methodologies, from Manual and A/B Testing to Unit and Performance Tests. Learn how to make informed decisions for enhanced software delivery, matching the unique needs of your projects.
Climate change is one of the biggest and most daunting challenges that our and future generations are going to face. In order to mitigate climate change and its consequences, first one needs to understand the problem and get a rough idea about the magnitude of human made global warming. As a proper numbers nerd I understand problems best when looking at science, statistics, and measurements. So here’s my little guide to better grasp what climate change is all about through data.
In this talk, we will put together a simple but full-featured website using Perspective. Perspective is an open source interactive analytics and data visualization component, which is especially well-suited for large and/or streaming datasets. It is written in C++ and Rust with bindings to both Python and WebAssembly, making it ideal for data-intensive applications. It comes with a variety of visualization plugins, including a datagrid and various charts. Additionally, it comes with a Jupyter widget, which allows developers to iterate quickly with a clear pathway to their production website.
In the ever-evolving landscape of software development, crafting code that not only functions flawlessly but also operates at peak performance is a skill that sets exceptional developers apart. This talk delves into the art of optimizing Python code, exploring techniques and strategies to fine-tune your programs for maximum speed and minimal resource consumption, with a particular focus on memory efficiency.
On 2023-05-02, the tech sphere buzzed with the release of Mojo 🔥, a new programming language developed by Chris Lattner, renowned for his work on Clang, LLVM, and Swift. Billed as "Python's faster cousin," and "The programming language for all AI developers", Mojo promised a 68,000x performance uplift and a familiar Pythonic syntax.
As it reaches its first anniversary, we unpack Mojo's journey towards its ambitious promise. This talk delves into the practical experiences developing a Large Language Model Interpretation library as part of an AI Safety Camp project in that language. We cast a critical eye over its performance, evaluate its usability, and explore its potential as a Python superset. Against a backdrop where alternatives like Rust, PyPy and Julia dominate performant programming for AI, we question whether Mojo can carve out its niche or if it will languish as another "could-have-been" in the programming language pantheon.
Imagine a data lab in a federal ministry wants to publish python applications - how long could it possibly take? While open code is widely acknowledged as beneficial, the lack of thriving open code platforms from public institutions gets you wondering: a day, a week, months, or even years?
When publishing code, a private person, a company or a public institution all face unique circumstances and take different considerations into account. While individuals or companies frequently publish their code and share their experiences, less is known about these processes in public institutions. In our talk we will cover how a data lab, located in a federal ministry would go about this topic. We will share insights into the publishing process, touching upon existing pioneers and the alignment of open source with administrative principles, as well as the hurdles, surprises, and regulatory considerations of our journey.
Since we are a newly established unit with the word lab in our name, our talk delves into a unique real-world experiment: How much progress can our data lab make in publishing code within the three months leading up to PyCon DE & PyData Berlin 2024?
Plug-in solar systems, so-called balcony power plants, are getting more popular. This talk will cover the basics of such a system, how to figure out the energy consumption of a household and how to monitor and optimize the power output of a balcony power plant.
Have you ever tried to install a different Python version on Ubuntu or tried to upgrade your current one?
Lots of posts exist, many are outdated, and some even lead to a broken Ubuntu installation.
This talk will introduce the most common options and their ups and downs in-depth.
We will also give an outlook on what Ubuntu could do to make it even easier for you and everybody.
While RAG addresses the common LLM pitfalls, challenges like handling out-of-domain queries still persist. Learn the significance of fallback mechanisms to tackle these issues gracefully, incorporating strategies like web searches and alternative data sources to improve the user experience of your system. In this session, we’ll discover various fallback techniques and practical implementation using Haystack, empowering you to develop resilient LLM-based systems for diverse scenarios without human intervention.
About 5 years ago my co-founder and I launched alcemy, a Machine Learning startup to help decarbonize the cement and concrete supply chain. I experienced first hand moving from a simple proof of concept, a ML model inside a Jupyter notebook, to a full-fledged pipeline running 24/7 and steering massive amounts of cement production in real plants. I can tell you the road was long and winding. I want to share some of the hard lessons we learned along the way with you. If you are an aspiring ML or Software Engineer, Data Scientist, Entrepreneur, or you are just wondering how Machine Learning applied in the wild looks like this talk is for you. No prior knowledge is required except some familiarity with basic concepts and terminology of Machine Learning.
Python 3.12 introduced a new low-impact monitoring API with PEP669, which can be used to implement far faster debuggers than ever before. This talk covers the main advantages of this API and how you can use it to develop small tools.
Apache Arrow has become a de-facto standard for efficient in-memory columnar data representation. You might have heard about Arrow or using Arrow, but do you understand the format and why it’s so useful? This tutorial will dive deep into the details of the Arrow columnar format, the different types and buffer layouts, and explore those details interactively using the pyarrow and nanoarrow libraries.
Explore the dynamic duo of GraphQL Strawberry and Django in an immersive workshop! Discover the seamless integration of Strawberry with Django, mastering type definitions, queries and mutations. Harness the power of Starlette for efficient API development, empowering your projects with this potent blend of cutting-edge technologies.
For the third year in a role, the PyLadies Panel at PyCon PyData engages with a broader audience on critical issues related to gender disparities, ethics, and the ongoing importance of women-focused tech groups. Adopting unconventional formats, the PyLadies Panel aims to foster meaningful discussions among PyLadies members and the Python community, encouraging open dialogue and community solidarity.
This talk dives into how Python helps us to bridge the gap between automotive and energy industries. Learn how Python helps in integrating EV batteries into the power grid, enabling further use and growth of renewable energies, stabilizing power grids and enhancing the accessibility of electric mobility.
This presentation will give an overview of the scientific project that focuses on understanding how proteins move and function. Along the way a very large collection of Python tools was used, and on top of them our own innovative approaches are based. To be able to understand everything about living beings, including our health and origin of deseases in humans, we have to know how proteins do what they do. Hence is of utmost importance to understand their structure and function. Thanks to extraordinary technique called X-ray crystallography we are able to see how the proteins look at atomic scale, but it is impossible to see how they move. Therefore the next best thing we can do is to simulate the motion of the protein by so-called molecular dynamics (MD) simulations. These simulations generate incredible amounts of data, generally hundreds of GB of data per 1 microsecond of protein movement! Extracting useful and meaningful information from it is a daunting task.
We are going to show how we have used many Python tools to tackle this problem in the project. Using Django to place everything in an interactive web app (https://alokomp.irb.hr/), along with Pandas, Numpy, Scipy, Dask, Jupyther, NetworkX, Bokeh, Datashader and many more under the hood, we have created an innovative new way of seeing protein move and communicate.
In recent times, GenAI has sparked fervent excitement, sometimes touted as the panacea for all natural language processing (NLP) tasks. This presentation explores a practical text classification scenario at Malt, highlighting the practical hurdles encountered when employing GenAI (latency, environmental impact, and budgetary constraints). To overcome these obstacles, a smaller, dedicated model emerged as a viable solution. We'll delve into the construction and optimization (quantization, graph optimization) of this multilingual model. Finally we’ll see how GenAI's unparalleled zero-shot capabilities enables its continuous adaptation.
A standard Django project involves working with multiple files and folders from the start. Let's see how the work with a Django project changes when we have only one file. This solution automatically transforms Django into a microservice-oriented async framework with "batteries included” philosophy.
I've been working full-time on a Python FOSS project for 525 days, so what did I learn?
Am I a better (Python) programmer?
Am I a better teammate?
Am I a better person?
In this talk I will share some of the lessons I learned over the course of these 525 days:
- how to get a tech job in this day & age
- how to put your ego aside when working with others (who know more than you!) and how to deal with mistakes
- how to interact with users & contributors online
- how it feels to collaborate to a large codebase
As for the first three reflective questions, you'll have to ask my colleagues!
This talk is about setting up robust and scalable machine learning systems for high-throughput real-time predictions and large numbers of users. It is meant for ML engineers and people who work with data and want to learn more about MLOps focusing on cloud-based platforms. The focus of this talk will be about different ways to make predictions -– real-time, asynchronously and batch processing. It discusses the advantages and disadvantages of the different patterns and highlights the importance of choosing the right pattern for specific use cases, including generative large language models
We will use examples from StepStone's production systems to illustrate how to build systems that scale to thousands of simultaneous requests while delivering low-latency, robust predictions.
I will cover some of the technical details, how to efficiently manage operations, and real-life examples in a way that is easy to understand and informative. You will learn about different setups for ML and how to make them work. This will help you make your ML inference faster, more cost-efficient, and reliable.
This session introduces PyFixest, an open source Python library inspired by the "fixest" R package. PyFixest implements fast routines for the estimation of regression models with high-dimensional fixed effects, including OLS, IV, and Poisson regression. The library also provides tools for robust inference, including heteroscedasticity-robust and cluster robust standard errors, as well as the wild cluster bootstrap. Additionally, PyFixest implements several routines for difference-in-differences estimation with staggered treatment adoption.
PyFixest aims to faithfully replicate the core design principles of "fixest", offering post-estimation inference adjustments, user-friendly syntax for multiple estimations, and efficient post-processing capabilities. By making efficient use of jit-compilation, it is also one of the fastest solutions for regressions with high-dimensional fixed effects.
The presentation will cover PyFixest's functionality, design philosophy, and future development prospects.
Callbacks have become an ubiquitous programming technique that we use every day without even thinking about it. They are definitely handy in many situations, but sometimes they feel more like a burden than a help. In developing an interactive realtime audio processing system for use on stage in live music, we encountered such a situation. This talk will present how a few dozen lines adding a thin abstraction layer allowed us to replace a complex callback mess with tremendously more readable generators (yes, you know, those functions which yield
results instead of return
ing them...).
In this talk, we delve into the transformative world of asynchronous programming in Python, tailored specifically for the FastAPI framework. This session will explore the fundamentals of async/await syntax, unveiling how it can optimize the performance and scalability of web applications.
Attendees will gain practical insights into implementing asynchronous operations in FastAPI, from setting up to handling real-time data processing. This talk is perfect for Python developers eager to harness the power of asynchronous programming to build faster, more efficient web applications. Join us to unlock the full potential of Python's async capabilities within FastAPI's dynamic environment.
Pixi is a modern package manager that bridges the worlds of conda and pip package management. A from-scratch implementation of a SAT solver that works for both pip and conda, native lockfiles and a cross-platform task system are compelling features of this new package manager.
This study explores the efficacy of chatbots as dialogical argumentation systems for behaviour change, focusing on vaccine hesitancy during the COVID-19 pandemic. A Python-based chatbot, developed in 2021, engaged in argumentative dialogues with users reluctant to get vaccinated, resulting in a 20% positive change in participants' stances. As natural language processing technologies, like ChatGPT, advance, it is crucial to compare them to traditional expert systems. Prior studies have shown ChatGPT's reliability in addressing vaccine hesitancy. This research compares our chatbot with ChatGPT, evaluating persuasiveness through crowdsourced participants. The findings inform resource allocation decisions, guiding the choice between domain-specific expert systems and enhancing versatile models like ChatGPT. Understanding comparative strengths aids in preventing the dissemination of misinformation in behaviour change contexts.
What if writing software could be more like building with LEGO bricks? A more playful and productive developer experience. For me, that is all about writing code without the hassle. A productive setup should also let let us make design decisions while learning what to actually build, and allow changes during the way. Polylith solves this in a nice and simple way. I am the developer of the Open Source Python-specific tooling for Polylith. I’ll walk through the simple Architecture & the Developer friendly tooling for a joyful Python Experience.
With the latest advancements in Natural Language Processing and Large Language Models (LLMs), and big companies like OpenAI dominating the space, many people wonder: Are we heading further into a black box era with larger and larger models, obscured behind APIs controlled by big tech monopolies?
I don’t think so, and in this talk, I’ll show you why. I’ll dive deeper into the open-source model ecosystem, some common misconceptions about use cases for LLMs in industry, practical real-world examples and how basic principles of software development such as modularity, testability and flexibility still apply. LLMs are a great new tool in our toolkits, but the end goal remains to create a system that does what you want it to do. Explicit is still better than implicit, and composable building blocks still beat huge black boxes.
Your project's documentation site is one of the first places where new users will interact with your project; as such, it is essential that these are up-to-date, well-organised, and usable and that they cater to newcomers, experienced users, and contributors alike.
It is estimated that about 25% of the global population has some sort of disability, and ensuring all folks can use and access your projects and their documentation is paramount and this, of course, includes thinking of and including disabled developers and end-users.
In this talk, we will cover some of the basics of web content accessibility and explore some tools and approaches that you can use to ensure your tools and documentation sites are accessible.
In this talk, we will discuss leveraging Jupyter Notebooks to generate print media - books, magazine and newspaper articles, business reports, academic papers, etc. We will motivate the problem, introduce a library for accomplishing the task (nbprint), and walk through some end-to-end examples.
In today's digital landscape, traditional analytics struggle with understanding marketing ROI, especially with evolving privacy norms. But Python and its ecosystem come to the rescue.
In this talk, we will discuss how we leveraged Python and PyMC to build a Bayesian Marketing Media Mix model for the fastest-growing Italian tour operator. We'll cover the challenges we faced, the valuable insights we gained, and the results achieved. This will offer you a clear and practical roadmap for developing a similar model for your business.
In a world increasingly embracing Python, plug-and-play solutions and AI-generated code, our generation growing up with these advancements may not fully grasp the challenges faced by our predecessors. Meanwhile, data engineering, traditionally known for its complexity, can now transition into the plug-and-play realm too, thanks to Python libraries such as dlt.
Aimed to be both fun and insightful, this talk will educate the listener on the concepts of data engineering our generation finds most important and enable them to use high level abstractions to automate most of what used to be highly manual work. The juniors will gain an appreciation for the difficulties in data pipeline engineering, the seniors - a straightforward solution to expedite the creation of robust pipelines.
From the perspective of junior data engineers such as us, the talk will walk through the challenges associated with constructing a data pipeline and demonstrate how these can be effectively addressed using Python libraries such as dlt that simplify the intricacies of data extraction, transformation, and loading.
This is a story about applying Python and the “hacker mindset” to Computer Aided Engineering (CAE), an emerging domain within the Python ecosystem. Shell scripts have traditionally been the preferred tool for automating CAE pipelines, especially in subfield of Computational Fluid Dynamics (CFD). However, this approach is brittle, severely limited and cumbersome to manage at scale. Data management is also a challenge, with tens to hundreds of GB per simulation needing to be stored and versioned in complex folder structures. One possible approach is to use Python as an automation and glue language and Data Version Control (DVC) which is a Python based tool built on top of git to track pipelines and data.