Speeding up Python code has traditionally been achieved by writing C/C++ — an alien world for most Python users. Today, you can write high performance code in Julia instead, which is much much easier for Python users. This tutorial will give you hands-on experience writing a Python library that incorporates Julia for performance optimization.
The industrial environment offers a lot of interesting use cases for data enthusiasts. There are myriads of interesting challenges that can be solved by data scientists.
However, collecting industrial data in general and industrial IoT (IIoT) data in particular, is cumbersome and not really appealing for anyone who just wants to work with data.
Apache StreamPipes addresses this pitfall and allows anyone to extract data from IIoT data sources without messing around with (old-fashioned) protocols. In addition, StreamPipes newly developed Python client now gives Pythonistas the ability to programmatically access and work with them in a Pythonic way.
This talk will provide a basic introduction into the functionality of Apache StreamPipes itself, followed by a deeper discussion of the Python client. Finally, a live demo will show how IIoT data can be easily derived in Python and used directly for visualization and ML model training.
What is a ML platform and do you even need one? When should you consider investing in your own ML platform? What challenges can you expect building and maintaining one? Tune in and discover (some) answers to these questions and more! I will share a first-hand account of our ongoing journey towards becoming a ML platform team within Delivery Hero's Logistics department, including how we got here, how we structure our work, and what challenges and tools we are focussing on next.
The nightmare before data science production: You found a working prototype for your problem using a Jupyter notebook and now it's time to build a production grade solution from that notebook. Unfortunately, your notebook looks anything but production grade. The good news is, there's finally a cure!
The open-source python package LineaPy aims to automate data science workflow generation and expediting the process of going from data science development to production. And truly, it transforms messy notebooks into data pipelines like Apache Airflow, DVC, Argo, Kubeflow, and many more. And if you can't find your favorite orchestration framework, you are welcome to work with the creators of LineaPy to contribute a plugin for it!
In this talk, you will learn the basic concepts of LineaPy and how it supports your everyday tasks as a data practitioner. For this purpose, we will transform a notebook step by step together to create a DVC pipeline. Finally, we will discuss what place LineaPy will take in the MLOps universe. Will you only have to check in your notebook in the future?
When building PyTorch models for custom applications from scratch there's usually one problem: The model does not learn anything. In a complex project, it can be tricky to identify the cause: Is it the data? A bug in the model? Choosing the wrong loss function at 3 am after an 8-hour coding session?
In this talk, we will build a toolbox to find the culprits in a structured manner. We will focus on simple ways to ensure a training loop is correct, generate synthetic training data to determine whether we have a model bug or problematic real-world data, and leverage pytest to safely refactor PyTorch models.
After this talk, visitors will be well equipped to take the right steps when a model is not learning, quickly identify the underlying reasons, and prevent bugs in the future.
The materials presented during this tutorial are open source and can be used by coaches and tutors who want to teach their students how to use Python for text processing and text classification. (A minimal understanding of programming (in any language) is required by the students)
Pandas has reached a 2.0 milestone in 2023. But what does that mean? And what is coming after 2.0? This talk will give an overview of what happened in the latest releases of pandas and highlight some topics and major new features the pandas project is working on.
Python packaging is quickly evolving and new tools pop up on a regular basis. Lots of talks and posts on packaging exist but none of them give a structured, unbiased overview of the available tools.
This talk will shed light on the jungle of packaging and environment management tools, comparing them on a basis of predefined features.
AutoML, or automated machine learning, offers the promise of transforming raw data into accurate predictions with minimal human intervention, expertise, and manual experimentation. In this talk, we will introduce AutoGluon, a cutting-edge toolkit that enables AutoML for tabular, multimodal and time series data. AutoGluon emphasizes usability, enabling a wide variety of tasks from regression to time series forecasting and image classification through a unified and intuitive API. We will specifically focus on tasks on tabular and time series tasks where AutoGluon is the current state-of-the-art, and demonstrate how AutoGluon can be used to achieve competitive performance on tabular and time series competition data sets. We will also discuss the techniques used to automatically build and train these models, peeking under the hood of AutoGluon.
In the last years, Hyperparameter Optimization (HPO) became a fundamental step in the training
of Machine Learning (ML) models and in the creation of automatic ML pipelines.
Unfortunately, while HPO improves the predictive performance of the final model, it comes with a significant cost both in terms of computational resources and waiting time.
This leads many practitioners to try to lower the cost of HPO by employing unreliable heuristics.
In this talk we will provide simple and practical algorithms for users that want to train models
with almost-optimal predictive performance, while incurring in a significantly lower cost and waiting
time. The presented algorithms are agnostic to the application and the model being trained so they can be useful in a wide range of scenarios.
We provide results from an extensive experimental activity on public benchmarks, including comparisons with well-known techniques like Bayesian Optimization (BO), ASHA, Successive Halving.
We will describe in which scenarios the biggest gains are observed (up to 30x) and provide examples for how to use these algorithms in a real-world environment.
All the code used for this talk is available on (GitHub)[https://github.com/awslabs/syne-tune].
In this talk, I'll show how large language models such as GPT-3 complement rather than replace existing machine learning workflows. Initial annotations are gathered from the OpenAI API via zero- or few-shot learning, and then corrected by a human decision maker using an annotation tool. The resulting annotations can then be used to train and evaluate models as normal. This process results in higher accuracy than can be achieved from the OpenAI API alone, with the added benefit that you'll own and control the model for runtime.
Snowflake as a data platform is the core data repository of many large organizations.
With the introduction of Snowflake's Snowpark for Python, Python developers can now collaborate and build on one platform with a secure Python sandbox, providing developers with dynamic scalability & elasticity as well as security and compliance.
In this talk I'll explain the core concepts of Snowpark for Python and how they can be used for large scale feature engineering and data science.
In this keynote, I will share the lessons learned from using Python in 4 industries. Apart from machine learning applications that I build in my day to day as a data scientist and machine learning engineer, I also use Python to develop games for my own gaming company, Quill Game Studios. There is a lot of versatility in Python, and it's been my pleasure to use it to solve many interesting problems. I hope that this talk can give inspiration to various types of applications in your own industry as well.
Time-series data is all around us: from logistics to digital marketing, from pricing to stock
markets. It’s hard to imagine a modern business that has no time series data to forecast.
However, mastering such forecasting is not an easy task.
For this talk, together with other domain experts, I have collected a list of common time
series issues that data professionals commonly run into. After this talk, you will learn to
identify, understand, and resolve such issues. This will include stabilising divergent time
series, organising delayed / irregular data, handling missing values without anomaly propagation,
and reducing the impact of noise and outliers on your forecasting models.
We have recently open-sourced a pure-Python implementation of Cyclic Boosting, a family of general-purpose, supervised machine learning algorithms. Its predictions are fully explainable on individual sample level, and yet Cyclic Boosting can deliver highly accurate and robust models. For this, it requires little hyperparameter tuning and minimal data pre-processing (including support for missing information and categorical variables of high cardinality), making it an ideal off-the-shelf method for structured, heterogeneous data sets. Furthermore, it is computationally inexpensive and fast, allowing for rapid improvement iterations. The modeling process, especially the infamous but unavoidable feature engineering, is facilitated by automatic creation of an extensive set of visualizations for data dependencies and training results. In this presentation, we will provide an overview of the inner workings of Cyclic Boosting, along with a few sample use cases, and demonstrate the usage of the new Python library.
You can find Cyclic Boosting on GitHub: https://github.com/Blue-Yonder-OSS/cyclic-boosting
In this talk, we will explore the build-measure-learn paradigm and the role of baselines in natural language processing (NLP). We will cover the common NLP tasks of classification, clustering, search, and named entity recognition, and describe the baseline approaches that can be used for each task. We will also discuss how to move beyond these baselines through weak learning and transfer learning. By the end of this talk, attendees will have a better understanding of how to establish and improve upon baselines in NLP.
Learn how to build and analyze heterogeneous graphs using PyG, a machine graph learning library in Python. This workshop will provide a practical introduction to the concept of heterogeneous graphs and their applications, including their ability to capture the complexity and diversity of real-world systems. Participants will gain experience in creating a heterogeneous graph from multiple data tables, preparing a dataset, and implementing and training a model using PyG.
Pandas is the de-facto standard for data manipulation in python, which I personally love for its flexible syntax and interoperability. But Pandas has well-known drawbacks such as memory in-efficiency, inconsistent missing data handling and lacking multicore-support. Multiple open-source projects aim to solve those issues, the most interesting is Polars.
Polars uses Rust and Apache Arrow to win in all kinds of performance-benchmarks and evolves fast. But is it already stable enough to migrate an existing Pandas' codebase? And does it meet the high-expectations on query language flexibility of long-time Pandas-lovers?
In this talk, I will explain, how Polars can be that fast, and present my insights on where Polars shines and in which scenarios I stay with pandas (at least for now!)
Proper monitoring of machine learning models in production is essential to avoid performance issues. Setting up monitoring can be easy for a single model, but it often becomes challenging at scale or when you face alert fatigue based on many metrics and dashboards.
In this talk, I will introduce the concept of test-based ML monitoring. I will explore how to prioritize metrics based on risks and model use cases, integrate checks in the prediction pipeline and standardize them across similar models and model lifecycle. I will also take an in-depth look at batch model monitoring architecture and the use of open-source tools for setup and analysis.
In the recent years we saw an explosion of usage of Python in the browser:
Pyodide, CPython on WASM, PyScript, etc. All of this is possible thanks to the
powerful functionalities of the underlying platform, WebAssembly, which is essentially a virtual CPU
inside the browser.
We evaluated time-series databases and complementary services to stream-process sensor data. In this talk, our evaluation will be presented. The final implementation will be shown, alongside python-tools we’ve built and lessons learned during the process.
When handling a large amount of data, memory profiling the data science workflow becomes more important. It gives you insight into which process consumes lots of memory. In this talk, we will introduce Mamray, a Python memory profiling tool and its new Jupyter plugin.
At the boundary of model development and MLOps lies the balance between the speed of deploying new models and ensuring operational constraints. These include factors like low latency prediction, the absence of vulnerabilities in dependencies and the need for the model behavior to stay reproducible for years. The longer the list of constraints, the longer it usually takes to take a model from its development environment into production. In this talk, we present how we seemingly managed to square the circle and have both a rapid, highly dynamic model development and yet also a stable and high-performance deployment.
In this talk, we will introduce the audience to DoWhy, a library for causal machine-learning (ML). We will introduce typical problems where causal ML can be applied and will specifically do a deep dive on root cause analysis using DoWhy. To do this, we will lay out what typical problem spaces for causal ML look like, what kind of problems we're trying to solve, and then show how to use DoWhy's API to solve these problems. Expect to see a lot of code with a hands-on example. We will close this session by zooming out a bit and also talk about the PyWhy organization governing DoWhy.
In this talk, we will report on our experiences switching from Pandas to Polars in a real-world ML project. Polars is a new high-performance dataframe library for Python based on Apache Arrow and written in Rust. We will compare the performance of polars with the popular pandas library, and show how polars can provide significant speed improvements for data manipulation and analysis tasks. We will also discuss the unique features of polars, such as its ability to handle large datasets that do not fit into memory, and how it feels in practice to make the switch from Pandas. This talk is aimed at data scientists, analysts, and anyone interested in fast and efficient data processing in Python.
The name WALD-stack stems from the four technologies it is composed of, i.e. a cloud-computing Warehouse like Snowflake or Google BigQuery, the open-source data integration engine Airbyte, the open-source full-stack
BI platform Lightdash, and the open-source data transformation tool DBT.
Using a Formula 1 Grand Prix dataset, I will give an overview of how these four tools complement each other perfectly for analytics tasks in an ELT approach. You will learn the specific uses of each tool as well as their particular features. My talk is based on a full tutorial, which you can find under waldstack.org.
The detection of outliers or anomalous data patterns is one of the most prominent machine learning use cases in industrial applications. I present a Bayesian histogram anomaly detector (BHAD), where the number of bins is treated as an additional unknown model parameter with an assigned prior distribution. BHAD scales linearly with the sample size and enables a straightforward explanation of individual scores, which makes it very suitable for industrial applications when model interpretability is crucial. I study the predictive performance of the proposed BHAD algorithm with various SoA anomaly detection approaches using simulated data and also using popular benchmark datasets for outlier detection. The reported results indicate that BHAD has very competitive predictive accuracy
compared to other more complex and computationally more expensive algorithms, while being explainable and fast.
Large Language Models (LLM), like ChatGPT, have shown miraculous performances on various tasks. But there are still unsolved issues with these models: they can be confidently wrong and their knowledge becomes outdated. GPT also does not have any of the information that you have stored in your own data. In this talk, you'll learn how to use Haystack, an open source framework, to chain LLMs with other models and components to overcome these issues. We will build a practical application using these techniques. And you will walk away with a deeper understanding of how to use LLMs to build NLP products that work.
In this talk, we will explore how to use the FastAPI web framework and Celery task queue to build reliable and scalable web applications in a test-driven manner. We will start by setting up a testing environment and writing unit tests for the core functionality of our application. Next, we will use FastAPI to create an api to perform some long-running task. Finally, we will then see how Celery can help us offload long-running tasks and improve the performance of our application.
By the end of this talk, attendees will have a strong understanding of TDD and how to apply it to your FastAPI and Celery projects, and you will be able to write tests that ensure the reliability and maintainability of your code.
As machine learning becomes more prevalent across nearly every business and industry, making sure that these technologies are working and delivering quality is critical. In her talk, Alicia will discuss the importance of machine learning observability and why it should be a fundamental tool of modern machine learning architectures. Not only does it ensure models are accurate, but it helps teams iterate and improve models quicker. Alicia will dive into how Shopify has been prototyping building observability into different parts of its machine learning platform. This talk will provide insights on how to track model performance, how to catch any unexpected or erroneous behaviour, what types of behavior to look for in your data (e.g. drift, quality metrics) and in your model/predictions, and how observability could work with large language models and Chat AIs.
In this talk, we will explore the use of Python's typing.Protocol
, Scala's Typeclasses, and Rust's Traits.
They all offer a very powerful & elegant mechanism for abstracting over various concepts (such as Serialization) in a modular manner.
We will compare and contrast the syntax and implementation of these constructs in each language and discuss their strengths and weaknesses. We will also look at real-world examples of how these features are used in each language to specify behavior, and consider differences in terms of type system expressiveness and effectiveness. By the end of the talk, attendees will have a better understanding of the differences and similarities between these three language features, and will be able to make informed decisions about which one is best suited for their needs.
The title “Data Scientist” has been in use for 15 years now. We have been attending PyData conferences for over 10 years as well. The hype around data science and AI seems higher than ever before. But How are we managing?
The aspect-oriented programming paradigm can support the separation of
cross-cutting concerns such as logging, caching, or checking of permissions.
This can improve code modularity and maintainability.
Python offers decorator to implement re-usable code for cross-cutting task.
This tutorial is an in-depth introduction to decorators.
It covers the usage of decorators and how to implement simple and more advanced
decorators.
Use cases demonstrate how to work with decorators.
In addition to showing how functions can use closures to create decorators,
the tutorial introduces callable class instance as alternative.
Class decorators can solve problems that use be to be tasks for metaclasses.
The tutorial provides uses cases for class decorators.
While the focus is on best practices and practical applications, the tutorial
also provides deeper insight into how Python works behind the scene.
After the tutorial participants will feel comfortable with functions that take
functions and return new functions.
In this talk I will present two new open-source packages that make up a powerful and state-of-the-art marketing analytics toolbox. Specifically, PyMC-Marketing is a new library built on top of the popular Bayesian modeling library PyMC. PyMC-Marketing allows robust estimation of customer acquisition costs (via media mix modeling) as well as customer lifetime value.
In addition, I will show how we can estimate the effectiveness of marketing campaigns using a new Bayesian causal inference package called CausalPy. The talk will be applied with a real-world case-study and many code examples. Special emphasis will be placed on the interplay between these tools and how they can be combined together to make optimal marketing budget decisions in complex scenarios.
In this tutorial, you will learn about the various Python modules for processing geospatial data, including GDAL, Rasterio, Pyproj, Shapely, Folium, Fiona, OSMnx, Libpysal, Geopandas, Pydeck, Whitebox, ESDA, and Leaflet. You will gain hands-on experience working with real-world geospatial data and learn how to perform tasks such as reading and writing spatial data, reprojecting data, performing spatial analyses, and creating interactive maps. This tutorial is suitable for beginners as well as intermediate Python users who want to expand their knowledge in the field of geospatial data processing
Large generative models rely upon massive data sets that are collected automatically. For example, GPT-3 was trained with data from “Common Crawl” and “Web Text”, among other sources. As the saying goes — bigger isn’t always better. While powerful, these data sets (and the models that they create) often come at a cost, bringing their “internet-scale biases” along with their “internet-trained models.” While powerful, these models beg the question — is unsupervised learning the best future for machine learning?
ML researchers have developed new model-tuning techniques to address the known biases within existing models and improve their performance (as measured by response preference, truthfulness, toxicity, and result generalization). All of this at a fraction of the initial training cost. In this talk, we will explore these techniques, known as Reinforcement Learning from Human Feedback (RLHF), and how open-source machine learning tools like PyTorch and Label Studio can be used to tune off-the-shelf models using direct human feedback.
Even if every data science work is special, a lot can be learned from similar problems solved in the past. In this talk, I will share some specific software design concepts that data scientists can use to build better data products.
As the number of production machine learning use-cases increase, we find ourselves facing new and bigger challenges where more is at stake. Because of this, it's critical to identify the key areas to focus our efforts, so we can ensure our machine learning pipelines are reliable and scalable. In this talk we dive into the state of production machine learning in the Python Ecosystem, and we will cover the concepts that make production machine learning so challenging, as well as some of the recommended tools available to tackle these challenges.
This talk will cover key principles, patterns and frameworks around the open source frameworks powering single or multiple phases of the end-to-end ML lifecycle, incluing model training, deploying, monitoring, etc. We will be covering a high level overview of the production ML ecosystem and dive into best practices that have been abstracted from production use-cases of machine learning operations at scale, as well as how to leverage tools to that will allow us to deploy, explain, secure, monitor and scale production machine learning systems.
Within this talk, I want to look at the topic of data ethics with a practical lens and facilitate the discussion about how we can establish ethical data practices into our day to day work. I will shed some light on the multiple sources of biases in data applications: Where are potential pitfalls and how can we prevent, detect and mitigate them early so they never become a risk for our data product. I will walk you through the different stages of a data product lifecycle and dive deeper into the questions we as data professionals have to ask ourselves throughout the process. Furthermore, I will present methods, tools and libraries that can support our work. Being well aware that there is no universal solution as tools and strategies need to be chosen to specifically address requirements of the use-case and models at hand, my talk will provide a good starting point for your own data ethics journey.
PyScript brings the full PyData stack in the browser, opening up to unprecedented use cases for interactive data-intensive applications. In this scenario, the web browser becomes a ubiquitous computing platform, operating within a (nearly) zero-installation & server-less environment.
In this talk, we will explore how to create full-fledged interactive front-end machine learning applications using PyScript. We will dive into the the main features of the PyScript platform (e.g. built-in Javascript integration and local modules ), discussing new data & design patterns (e.g. loading heterogeneous data in the browser), required to adapt and to overcome the limitations imposed by the new operating environment (i.e. the browser).
Bluetooth Low Energy (BLE) is a part of the Bluetooth standard aimed at bringing wireless technology to low-power devices, and it's getting into everything - lightbulbs, robots, personal health and fitness devices, and plenty more. One of the main advantages of BLE is that everybody can integrate those devices into their tools or projects.
However, BLE is not the most developer-friendly protocol and these devices most of the time don't come with good documentation. In addition, there are not a lot of good open-source tools, examples, and tutorials on how to use Python with BLE. Especially if one wants to build both sides of the communication.
In this talk, I will introduce the concepts and properties used in BLE interactions and look at how we can use the Linux Bluetooth Stack (Bluez) to communicate with other devices. We will look at a simple example and learn along the way about common pitfalls and debugging options while working with BLE and Python.
This talk is for everybody that has a basic understanding of Python and wants to have a deeper understanding of how BLE works and how one could use it in a private project.
Chatbots are fun to use, ranging from simple chit-chat (“How are you today?”) to more sophisticated use cases like shopping assistants, or the diagnosis of technical or medical problems. Despite their mostly simple user interaction, chatbots must combine various complex NLP concepts to deliver convincing, intelligent, or even witty results.
With the advancing development of machine learning models and the availability of open source frameworks and libraries, chatbots are becoming more powerful every day and at the same time easier to implement. Yet, depending on the concrete use case, the implementation must be approached in specific ways. In the design process of chatbots it is crucial to define the language processing tasks thoroughly and to choose from a variety of techniques wisely.
In this talk, we will look together at common concepts and techniques in modern chatbot implementation as well as practical experiences from an E-mobility bot that was developed using the Rasa framework.
Python is a very expressive and powerful language, but it is not always the fastest option for performance-critical parts of an application. Rust, on the other hand, is known for its lightning-fast runtime and low-level control, making it an attractive option for speeding up performance-sensitive portions of Python programs.
In this talk, we will present a case study of using Rust to speed up a critical component of a Python application. We will cover the following topics:
- An overview of Rust and its benefits for Python developers
- Profiling and identifying performance bottlenecks in Python application
- Implementing a solution in Rust and integrating it with the Python application using PyO3
- Measuring the performance improvements and comparing them to other optimization techniques
Attendees will learn about the potential for using Rust to boost the performance of their Python programs and how to go about doing so in their own projects.
Innovations such as sentence-transformers, neural search and vector databases fueled a very fast development of question-answering systems recently. At scieneers, we wanted to test those components to satisfy our own information needs using a slack-bot that will answer our questions by reading through our internal documents and slack-conversations. We therefore leveraged the HayStack QA-Framework in combination with a Weaviate vector database and many fine-tuned NLP-models.
This talk will give you insights in both, the technical challenges we faced and the organizational learnings we took.
An exchange of views on fastAPI in practice.
FastAPI is great, it helps many developers create REST APIs based on the OpenAPI standard and run them asynchronously. It has a thriving community and educational documentation.
FastAPI does a great job of getting people started with APIs quickly.
This talk will point out some obstacles and dark spots that I wish we had known about before. In this talk we want to highlight solutions.
At the semiconductor division of Carl Zeiss it's our mission to continuously make computer chips faster and more energy efficient. To do so, we go to the very limits of what is possible, both physically and technologically. This is only possible through massive research and development efforts.
In this talk, we tell the story how Python became a central tool for our R&D activities. This includes technical aspects as well as organization and culture. How do you make sure that hundreds of people work in consistent environments? – How do you get all people on board to work together with Python? – You have lots of domain experts without much software background. How do you prevent them from creating a mess when projects get larger?
Keeping in mind the Pythonic principle that “simple is better than complex” we'll see how to create a web map with the Python based web framework Django using its GeoDjango module, storing geographic data in your local database on which to run geospatial queries.
Debugging is hard. Distributed debugging is hell.
Dask is a popular library for parallel and distributed computing in Python. Dask is commonly used in data science, actual science, data engineering, and machine learning to distribute workloads onto clusters of many hundreds of workers with ease.
However, when things go wrong life can become difficult due to all of the moving parts. These parts include your code, other PyData libraries like NumPy/pandas, the machines you’re running on, the network between them, storage, the cloud, and of course issues with Dask itself. It can be difficult to understand what is going on, especially when things seem slower than they should be or fail unexpectedly. Observability is the key to sanity and success.
In this talk, we describe the tools Dask offers to help you observe your distributed cluster, analyze performance, and monitor your cluster to react to unexpected changes quickly. We will dive into distributed logging, automated metrics, event-based monitoring, and root-causing problems with diagnostic tooling. Throughout the talk, we will leverage real-world use cases to show how these tools help to identify and solve problems for large-scale users in the wild.
This talk should be particularly insightful for Dask users, but the approaches to observing distributed systems should be relevant to anyone operating at scale in production.
“Got an NLP problem nowadays? Use transformers! Just download a pretrained model from the hub!” - every blog article ever
As if it’s that easy, because nearly all pretrained models have a very annoying limitation: they can only process short input sequences. Not every NLP practitioner happens to work on tweets, but instead many of us have to deal with longer input sequences. What started as a minor design choice for BERT, got cemented by the research community over the years and now turns out to be my biggest headache: the 512 tokens limit.
In this talk, we’ll ask a lot of dumb questions and get an equal number of unsatisfying answers:
-
How much text actually fits into 512 tokens? Spoiler: not enough to solve my use case, and I bet a lot of your use cases, too.
-
I can feed a sequence of any length into an RNN, why do transformers even have a limit? We’ll look into the architecture in more detail to understand that.
-
Somebody smart must have thought about this sequence length issue before, or not? Prepare yourself for a rant about benchmarks in NLP research.
-
So what can we do to handle longer input sequences? Enjoy my collection of mediocre workarounds.
Database Management Systems (DBMSs) are the backbone for managing large volumes of data efficiently and thus play a central role in business and science today. For providing high performance, many of the most complex DBMS components such as query optimizers or schedulers involve solving non-trivial problems. To tackle such problems, very recent work has outlined a new direction of so-called learned DBMSs where core parts of DBMSs are being replaced by machine learning (ML) models which has shown to provide significant performance benefits. However, a major drawback of the current approaches to enabling learned DBMS components is that they not only cause very high overhead for training an ML model to replace a DBMS component but that the overhead occurs repeatedly which renders these approaches far from practical. Hence, in this talk, I present my vision of Learned DBMS Components 2.0 to tackle these issues. First, I will introduce data-driven learning where the idea is to learn the data distribution over a complex relational schema. In contrast to workload-driven learning, no large workload has to be executed on the database to gather training data. While data-driven learning has many applications such as cardinality estimation or approximate query processing, many DBMS tasks such as physical cost estimation cannot be supported. I thus propose a second technique called zero-shot learning which is a general paradigm for learned DBMS components. Here, the idea is to train model
Write code as an ensemble to solve a data validation problem with Pydantic. Working together is not just about code - learn how to listen to colleagues, make typos in front of everyone, become a supportive team member, defend your ideas and maybe even accept criticism.
PyData Berlin are excited to bring you this open source workshop dedicated to contributing to pandas. This tutorial is 3 hours. We will have a break and continue with the same group of people.
pandas is a data wrangling platform for Python widely adopted in the scientific computing community. In this session, you will be guided on how you can make your own contributions to the project, no prior experience contributing required! Not only will this teach you new skills and boost your CV, you'll also likely get a nice adrenaline rush when your contribution is accepted!
If you don’t finish your contribution during the event, we hope you will continue to work on it after the tutorial. pandas offers regular new contributor meetings and has a slack space to provide ongoing support for new contributors. For more details, see our contributor community page: http://pandas.pydata.org/docs/dev/development/community.html .
Historically, changes in the scheduling algorithm of Dask have often been based on theory, single use cases, or even gut feeling. Coiled has now moved to using hard, comprehensive performance metrics for all changes - and it's been a turning point!
Deepminds JAX ecosystem provides deep learning practitioners with an appealing alternative to TensorFlow and PyTorch. Among its strengths are great functionalities such as native TPU support, as well as easy vectorization and parallelization. Nevertheless, making your first steps in JAX can feel complicated given some of its idiosyncrasies. This talk helps new users getting started in this promising ecosystem by sharing practical tips and best practises.
Global access to the Internet has enabled the spread of information throughout the world and has offered many new possibilities. On the other hand, alongside the advantages, the exponential and uncontrolled growth of user-generated content on the Internet has also facilitated the spread of toxicity and hate speech. Much work has been done in the direction of offensive speech detection. However, there is another more proactive way to fight toxic speech -- how a suggestion for a user as a detoxified version of the message. In this presentation, we will provide an overview how texts detoxification task can be solved. The proposed approaches can be reused for any text style transfer task for both monolingual and multilingual use-cases.
Writing efficient data pipelines in Python can be tricky. The standard recommendation is to use vectorized functions implemented in Numpy, Pandas, or the like. However, what to do, when the processing task does not fit these libraries? Using plain Python for processing can result in lacking performance, in particular when handling large data sets.
Rust is a modern, performance-oriented programming language that is already widely used by the Python community. Augmenting data processing steps with Rust can result in substantial speed ups. In this talk will present strategies of using Rust in a larger Python data processing pipeline with a particular focus on pragmatism and minimizing integration efforts.
Sometimes the internet can be a bit overwhelming, so I thought I would make a tool to create a personalized summary of it! In this talk, I'll demonstrate a personal front-page project that allows me to filter info on the internet on a certain topic, built using spaCy, an open-source library for NLP, and Prodigy, a scriptable annotation tool. With this project, I learned about the power of working with tools that provide extensive customizability without sacrificing ease of use. Throughout the talk, I'll also discuss how design concepts of developer tools can improve the development experience when building complex and adaptable software.
Local Planning Authorities (LPAs) in the UK rely on written representations from the community to inform their Local Plans which outline development needs for their area. With an average of 2000 representations per consultation and 4 rounds of consultation per Local Plan, the volume of information can be overwhelming for both LPAs and the Planning Inspectorate tasked with examining the legality and soundness of plans. In this study, we investigate the potential for Large Language Models (LLMs) to streamline representation analysis.
We find that LLMs have the potential to significantly reduce the time and effort required to analyse representations, with simulations on historical Local Plans projecting a reduction in processing time by over 30%, and experiments showing classification accuracy of up to 90%.
In this presentation, we discuss our experimental process which used a distributed experimentation environment with Jupyter Lab and cloud resources to evaluate the performance of the BERT, RoBERTa, DistilBERT, and XLNet models. We also discuss the design and prototyping of web applications to support the aided processing of representations using Voilà, FastAPI, and React. Finally, we highlight successes and challenges encountered and suggest areas for future improvement.
Everybody knows our yellow vans, trucks and planes around the world. But do you know how data
drives our business and how we leverage algorithms and technology in our core operations? We will
share some “behind the scenes” insights on Deutsche Post DHL Group’s journey towards a Data-Driven
Company.
• Large-Scale Use Cases: Challenging and high impact Use Cases in all major areas of logistics,
including Computer Vision and NLP
• Fancy Algorithms: Deep-Neural Networks, TSP Solvers and the standard toolkit of a Data
Scientist
• Modern Tooling: Cloud Platforms, Kubernetes , Kubeflow, Auto ML
• No rusty working mode: small, self-organized, agile project teams, combining state of the art
Machine Learning with MLOps best practices
• A young, motivated and international team – German skills are only “nice to have”
But we have more to offer than slides filled with buzzwords. We will demonstrate our passion for our
work, deep dive into our largest use cases that impact your everyday life and share our approach for a
timeseries forecasting library - combining data science, software engineering and technology for
efficient and easy to maintain machine learning projects..
Are you ready to take your Computer Vision projects to the next level? Then don't miss this talk!
Data visualization is a crucial ingredient for the success of any computer vision project.
It allows you to assess the quality of your data, grasp the intricacies of your project, and communicate effectively with stakeholders.
In this talk, we'll showcase the power of data visualization with compelling examples. You'll learn about the benefits of data visualization and discover practical methods and tools to elevate your projects.
Don't let this opportunity pass you by: join us and learn how to make data visualization a core feature of your Computer Vision projects.
Identification of causal relationships through running experiments is not always possible. In this talk, an alternative approach towards it, quasi-experimental frameworks, is discussed. Additionally, I will present how to adjust well-known machine-learning algorithms so they can be used to quantify causal relationships.
In modern software engineering, plugin systems are a ubiquitous way to extend and modify the behavior of applications and libraries. When software is written in a way that is plugin friendly, it encourages the use of modular organization where the contracts between the core software and the plugin have been well thought out. In this talk, we cover exactly how to define this contract and how you can start designing your software to be more plugin friendly.
Throughout the talk we will be creating our own plugin friendly application using the pluggy library to show these design principles in action. At the end of the talk, I also cover a real-life case study of how the package manager conda is currently making its 10 year old code more plugin friendly to illustrate how to retrofit an existing project.
Write code as an ensemble to solve a data validation problem using Pydantic. Working together is not just about code - learn how to listen to colleagues, make typos in front of everyone, become a supportive team member, defend your ideas and maybe even accept criticism.
PyData Berlin are excited to bring you this open source workshop dedicated to contributing to pandas. This tutorial is 3 hours. We will have a break and continue with the same group of people.
pandas is a data wrangling platform for Python widely adopted in the scientific computing community. In this session, you will be guided on how you can make your own contributions to the project, no prior experience contributing required! Not only will this teach you new skills and boost your CV, you'll also likely get a nice adrenaline rush when your contribution is accepted!
If you don’t finish your contribution during the event, we hope you will continue to work on it after the tutorial. pandas offers regular new contributor meetings and has a slack space to provide ongoing support for new contributors. For more details, see our contributor community page: http://pandas.pydata.org/docs/dev/development/community.html .
In this talk, we will explore the role of a machine learning enabler engineer in facilitating the development and deployment of machine learning models. We will discuss best practices for optimizing infrastructure and tools to streamline the machine learning workflow, reduce time to deployment, and enable data scientists to extract insights and value from data more efficiently. We will also examine case studies and examples of successful machine learning enabler engineering projects and share practical tips and insights for anyone interested in this field.
FastKafka is a Python library that makes it easy to connect to Apache Kafka queues and send and receive messages. In this talk, we will introduce the library and its features for working with Kafka queues in Python. We will discuss the motivations for creating the library, how it compares to other Kafka client libraries, and how to use its decorators to define functions for consuming and producing messages. We will also demonstrate how to use these functions to build a simple application that sends and receives messages from the queue. This talk will be of interest to Python developers looking for an easy-to-use solution for working with Kafka.
The documentation of the library can be found here: https://fastkafka.airt.ai/
I will present the challenges we encountered while migrating an ML model from batch to real-time predictions and how we handled them. In particular, I will focus on the design decisions and open-source tools we built to test the code, data and models as part of the CI/CD pipeline and enable us to ship fast with confidence.
During this panel, we’ll discuss the significant role PyLadies chapters around the world have played in advocating for gender representation and leadership and combating biases and the gender pay gap.
We will delve into the importance of effective data visualisation in today's world. We will explore how it can help convey insights from data using Matplotlib and best practices for creating informative visualisations. We will also discuss the limitations of static visualisations and examine the role of continuous integration in streamlining the process and avoiding common pitfalls. By the end of this talk, you will have gained valuable insights and techniques for creating informative and accurate data visualisations, no matter what tools you're using.
Doing data science in international development often means finding the right-sized solution in resource-constrained settings.
This talk walks you through how my team helped answer thousands of questions from pregnant folks and new parents on a South African maternal and child health helpline, which model we ended up choosing and why (hint: resource-constraints!), and how we've packaged everything into a service that anyone can start for themselves,
By the end of the talk, I hope you'll know how to start your own FAQ-answering service and learn about one example of doing data science in international development.
In this talk we walkthrough our experience using Neo4j and Python to model climate policy as a graph database. We discuss how we did it, some of the challenges we faced, and what we learnt along the way!
Over the past decade, developers, researchers, and the community have successfully built tens of thousands of data applications using Spark. Since then, use cases and requirements of data applications have evolved: Today, every application, from web services that run in application servers, interactive environments such as notebooks and IDEs, to phones and edge devices such as smart home devices, want to leverage the power of data.
However, Spark's driver architecture is monolithic, running client applications on top of a scheduler, optimizer and analyzer. This architecture makes it hard to address these new requirements: there is no built-in capability to remotely connect to a Spark cluster from languages other than SQL.
Spark Connect introduces a decoupled client-server architecture for Apache Spark that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications, in IDEs, Notebooks and programming languages.
This talk highlights how simple it is to connect to Spark using Spark Connect from any data applications or IDEs. We will do a deep dive into the architecture of Spark Connect and give an outlook of how the community can participate in the extension of Spark Connect for new programming languages and frameworks - to bring the power of Spark everywhere.
A workshop for PyLadies members with the Berlin Tech Workers Council discussing the legal frameworks on contracts and termination agreements, as well as how employees can defend themselves in situations where they are made redundant due to mass layoffs.
A life without joy is like software without meaningful test data - it's uncertain and unreliable. The search for the perfect test data is a challenge. Real data should not be too real. Random data should not be too random. This is a randomly real and a really random journey to discover the balance between these two, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Python is a beautiful language for fast prototyping and and sketching ideas quickly. People often struggle to get their code into production though for various reasons. Besides of all security and safety concerns that usually are not addressed from the very beginning when playing around with an algorithmic idea, performance concerns are quite frequently a reason for not taking the Python code to the next level.
We will look at the "missing performance" worries using a simple numerical problem and how to speed the corresponding Python code up to top notch performance.
Image retrieval is the process of searching for images in a large database that are similar to one or more query images. A classical approach is to transform the database images and the query images into embeddings via a feature extractor (e.g., a CNN or a ViT), so that they can be compared via a distance metric. Self-supervised learning (SSL) can be used to train a feature extractor without the need for expensive and time-consuming labeled training data. We will use DINO's SSL method to build a feature extractor and Milvus, an open-source vector database built for evolutionary similarity search, to index image representation vectors for efficient retrieval. We will compare the SSL approach with supervised and pre-trained feature extractors.
The importance of enterprise architecture patterns is all well-known and applicable to varied types of tasks. Thinking about the architecture from the beginning of the journey is crucial to have a maintainable, therefore testable, and flexible code base. In We are going to explore the Ports and Adapters(Hexagonal) pattern by showing a simple web app using Repository, Unit of Work, and Services(Use Cases) patterns tied together with Dependency Injection. All those patterns are quite famous in other languages but they are relatively new for the Python ecosystem, which is a crucial missing part.
As a web framework, we are going to use FastAPI which can be replaced with any framework in a matter of time because of the abstractions we have added.
Jupyter notebooks are a popular tool for data science and scientific computing, allowing users to mix code, text, and multimedia in a single document. However, sharing Jupyter notebooks can be challenging, as they require installing a specific software environment to be viewed and executed.
JupyterLite is a Jupyter distribution that runs entirely in the web browser without any server components. A significant benefit of this approach is the ease of deployment. With JupyterLite, the only requirement to provide a live computing environment is a collection of static assets. In this talk, we will show how you can create such static website and deploy it to your users.
Working with python is fun.
Managing python packaging, linters, tests, CI, etc. is not as fun.
Every maintainer needs to worry about consistent styling, quality, speed of tests, etc as the project grows.
Monorepos have been successful in other communities - how does it work in Python ?
Get ready to level up your big data processing skills! Join us for an introductory talk on Apache
Spark, the distributed computing system used by tech giants like Netflix and Amazon. We'll
cover PySpark DataFrames and how to use them. Whether you're a Python developer new to
big data or looking to explore new technologies, this talk is for you. You'll gain foundational
knowledge about Apache Spark and its capabilities, and learn how to leverage DataFrames and
SQL APIs to efficiently process large amounts of data. Don't miss out on this opportunity to up
your big data game!
Forget specialized hardware. Get GPU-class performance on your commodity CPUs with compound sparsity and sparsity-aware inference execution.
This talk will demonstrate the power of compound sparsity for model compression and inference speedup for NLP and CV domains, with a special focus on the recently popular Large Language Models. The combination of structured + unstructured pruning (to 90%+ sparsity), quantization, and knowledge distillation can be used to create models that run an order of magnitude faster than their dense counterparts, without a noticeable drop in accuracy. The session participants will learn the theory behind compound sparsity, state-of-the-art techniques, and how to apply it in practice using the Neural Magic platform.
How can NLP and Haystack help answer sustainability questions and fight climate change? In this talk we walkthrough our experience using Haystack to build Question Answering Models for the climate change and sustainability domain. We discuss how we did it, some of the challenges we faced, and what we learnt along the way!
We present an open source library to shrink pickled scikit-learn and lightgbm models. We will provide insights of how pickling ML models work and how to improve the disk representation. With this approach, we can reduce the deployment size of machine learning applications up to 6x.
By taking neural networks back to the school bench and teaching them some elements of geometry and topology we can build algorithms that can reason about the shape of data. Surprisingly these methods can be useful not only for computer vision – to model input data such as images or point clouds through global, robust properties – but in a wide range of applications, such as evaluating and improving the learning of embeddings, or the distribution of samples originating from generative models. This is the promise of the emerging field of Topological Data Analysis (TDA) which we will introduce and review recent works at its intersection with machine learning. TDA can be seen as being part of the increasingly popular movement of Geometric Deep Learning which encourages us to go beyond seeing data only as vectors in Euclidean spaces and instead consider machine learning algorithms that encode other geometric priors. In the past couple of years TDA has started to take a step out of the academic bubble, to a large extent thanks to powerful Python libraries written as extensions to scikit-learn or PyTorch.
Is your model prejudicial? Is your model deviating from the predictions it ought to have made? Has your model misunderstood the concept? In the world of artificial intelligence and machine learning, the word "fairness" is particularly common. It is described as having the quality of being impartial or fair. Fairness in ML is essential for contemporary businesses. It helps build consumer confidence and demonstrates to customers that their issues are important. Additionally, it aids in ensuring adherence to guidelines established by authorities. So guaranteeing that the idea of responsible AI is upheld. In this talk, let's explore how certain sensitive features are influencing the model and introducing bias into it. We'll also look at how we can make it better.
Many good project ideas fail before they even start due to the sensitive personal data required. The good news: a synthetic version of this data does not need protection. Synthetic data copies the actual data's structure and statistical properties without recreating personally identifiable information. The bad news: It is difficult to create synthetic data for open-access use, without recreating the exact copy of actual data.
This talk will give hands-on insights into synthetic data creation and challenges along its lifecycle. We will learn how to create and evaluate synthetic data for any use case using the open-source package Synthetic Data Vault. We will find answers to why it takes so long to synthesize the huge amount of data dormant in public administration. The talk addresses owners who want to create access to their private data as well as analysts looking to use synthetic data. After this session, listeners will know which steps to take to generate synthetic data for multi-purpose use and its limitations for real-world analyses.
The Python data ecosystem has matured during the last decade and there are less and less reasons to rely only large batch process executed in a Spark cluster, but with every large ecosystem, putting together the key pieces of technology takes some effort. There are now better storage technologies, streaming execution engines, query planners, and low level compute libraries. And modern hardware is way more powerful than what you'd probably expect. In this workshop we will explore some global-warming-reducing techniques to build more efficient data transformation pipelines in Python, and a little bit of Rust.
Data-driven products are becoming more and more ubiquitous. Humans build data-driven products. Humans are intrinsically biased. This bias goes into the data-driven products, confirming and amplifying the original bias. In this tutorial, you will learn how to identify your own -often unperceived- biases and reflect on and discuss the consequences of unchecked biases in Data Products.
Assessing the robustness of models is an essential step in developing machine-learning systems. To determine if a model is sound, it often helps to know which and how many input features its output hinges on. This talk introduces the fundamentals of “anchor” explanations that aim to provide that information.
Typing is at the center of „modern Python“, and tools (mypy, beartype) and libraries (FastAPI, SQLModel, Pydantic, DocArray) based on it are slowly eating the Python world.
This talks explores the benefits of Python type hints, and shows how they are infiltrating the next big domain: Machine Learning
A modern AI start-up is a front-end developer plus a prompt engineer" is a popular joke on Twitter.
This talk is about LangChain, a Python open-source tool for prompt engineering. You can use it with completely open-source language models or ChatGPT. I will show you how to create a prompt and get an answer from LLM. As an example application, I will show a demo of an intelligent agent using web search and generating Python code to answer questions about this conference.
Jupyter Notebooks have been a widely popular tool for data science in recent years due to their ability to combine code, text, and visualizations in a single document.
Despite its popularity, the core functionality and user experience of the Classic Jupyter Notebook interface has remained largely unchanged over the past years.
Lately the Jupyter Notebook project decided to base its next major version 7 on JupyterLab components and extensions, which means many JupyterLab features are also available to Jupyter Notebook users.
In this presentation, we will demo the new features coming in Jupyter Notebook version 7 and how they are relevant to existing users of the Classic Notebook.
Many developers avoid using generators. For example, many well-known python libraries use lists instead of generators. The generators themselves are slower than normal list loops, but their use in code greatly increases the speed of the application. Let’s discover why.
Does your production code look like it’s been copied from Untitled12.ipynb? Are your engineers complaining about the code but you can’t find the time to work on improving the code base? This talk will go through some of the basics of clean coding and how to best implement them in a data science team.
In the talk we give a brief overview of how we use Dynamic Pricing to tune the prices for rides based on demand, time of purchase, unexpected events strike etc., and other criteria to fulfil our business requirements.
Let’s say you are the ruler of a remote island. For it to succeed and thrive you can’t expect it to be isolated from the world. You need to establish trade routes, offer your products to other islands, and import items from them. Doing this will certainly make your economy grow! We’re not going to talk about land masses or commerce, however, you should think of your application as an island that needs to connect to other applications to succeed. Unfortunately, the sea is treacherous and is not always very consistent, similar to the networks you use to connect your application to the world.
We will explore some techniques and libraries in the Python ecosystem used to make your life easier while dealing with external services. From asynchronicity, caching, testing, and building abstractions on top of the APIs you consume, you will definitely learn some strategies to build your connected application gracefully, and avoid those pesky 2 AM errors that keep you awake.
This talk presents a novel approach to MLOps that combines the benefits of open-source technologies with the power and cost-effectiveness of cloud computing platforms. By using tools such as Terraform, MLflow, and Feast, we demonstrate how to build a scalable and maintainable ML system on the cloud that is accessible to ML Engineers and Data Scientists. Our approach leverages cloud managed services for the entire ML lifecycle, reducing the complexity and overhead of maintenance and eliminating the vendor lock-in and additional costs associated with managed MLOps SaaS services. This innovative approach to MLOps allows organizations to take full advantage of the potential of machine learning while minimizing cost and complexity.
Streamlit, a pure-Python data app framework, has been ported to Wasm as "stlite".
See its power and convenience with many live examples and explore its internals from a technical perspective.
You will learn to quickly create interactive in-browser apps using only Python.
Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing, and is becoming the de facto standard for tabular data. This talk will give an overview of the recent developments both in Apache Arrow itself as how it is being adopted in the PyData ecosystem (and beyond) and can improve your day-to-day data analytics workflows.
tox is a widely-used tool for automating testing in Python. In this talk, we will go behind the scenes of the creation of tox 4, the latest version of the tool. We will discuss the motivations for the rewrite, the challenges and lessons learned during the development process. We will have a look at the new features and improvements introduced in tox 4. But most importantly, you will get to know the maintainers.
Models in Natural Language Processing are fun to train but can be difficult to deploy. The size of their models, libraries and necessary files can be challenging, especially in a microservice environment. When services should be built as lightweight and slim as possible, large (language) models can lead to a lot of problems. With a recent real-world use case as an example, which runs productively for over a year and in 10 different languages, I will walk you through my experiences with deploying NLP models. What kind of pitfalls, shortcuts, and tricks are possible while bringing an NLP model to production?
In this talk, you will learn about different ways and possibilities to deploy NLP services. I will speak briefly about the way leading from data to model and a running service (without going into much detail) before I will focus on the MLOps part in the end. I will take you with me on my past journey of struggles and successes so that you don’t need to take these detours by yourselves.
Running machine learning models in a production environment brings its own challenges. In this talk we would like to present our solution of a machine learning lifecycle for the text-based cataloging classification system from idealo.de. We will share lessons learned and talk about our experiences during the lifecycle migration from a hosted cluster to a cloud solution within the last 3 years. In addition, we will outline how we embedded our ML components as part of the overall idealo.de processing architecture.
As we are in an era of big data where large groups of information are assimilated and analyzed, for insights into human behavior, data privacy has become a hot topic. Since there is a lot of private information which once leaked can be misused, all data cannot be released for research. This talk aims to discuss Differential Privacy, a cutting-edge technique of cybersecurity that claims to preserve an individual’s privacy, how it is employed to minimize the risks with private data, its applications in various domains, and how Python eases the task of employing it in our models with PyDP.
Bricks is an open-source content library for natural language processing, which provides the building blocks to quickly and easily enrich, transform or analyze text data for machine learning projects. For many Pythonistas, contributing to an open-source project seems scary and intimidating. In this tutorial, we offer a hands-on experience in which programmers and data scientists learn how to code their own building blocks and share their creations with the community with ease.
With an average of 3.2 new papers published on Arxiv every day in 2022, causal inference has exploded in popularity, attracting large amount of talent and interest from top researchers and institutions including industry giants like Amazon or Microsoft. Text data, with its high complexity, posits an exciting challenge for causal inference community. In the workshop, we'll review the latest advances in the field of Causal NLP and implement a causal Transformer model to demonstrate how to translate these developments into a practical solution that can bring real business value. All in Python!
Discover how Infrastructure From Code (IfC) can revolutionize Cloud DevOps automation by generating cloud deployment templates directly from Python code. Learn how this technology empowers Python developers to easily deploy and operate cost-effective, secure, reliable, and sustainable cloud software. Join us to explore the strategic potential of IfC.
Do you struggle with PRs? Have you ever had to change code even though you disagreed with the change just to land the PR? Have you ever given feedback that would have improved the code only to get into a comment war? We'll discuss how to give and receive feedback to extract maximum value from it and avoid all the communication problems that come with PRs.
Asynchronous programming is a type of parallel programming in which a unit of work is allowed to run separately from the primary application thread. Post execution, it notifies the main thread about the completion or failure of the worker thread. There are numerous benefits to using it, such as improved application performance, enhanced responsiveness, and effective usage of CPU.
Asynchronicity seems to be a big reason why Node.js is so popular for server-side programming. Most of the code we write, especially in heavy IO applications like websites, depends on external resources. This could be anything from a remote database POST API call. As soon as you ask for any of these resources, your code is waiting around for process completion with nothing to do. With asynchronous programming, you allow your code to handle other tasks while waiting for these other resources to respond.
In this session, we are going to talk about asynchronous programming in Python. Its benefits and multiple ways to implement it.
In this talk, I’d be talking about Zarr, an open-source data format for storing chunked, compressed N-dimensional arrays. This talk presents a systematic approach to understanding and implementing Zarr by showing how it works, the need for using it, and a hands-on session at the end. Zarr is based on an open technical specification, making implementations across several languages possible. I’d mainly talk about Zarr’s Python implementation and show how it beautifully interoperates with the existing libraries in the PyData stack.
Tired of having to handle asynchronous processes for neuroevolution? Do you want to leverage massive vectorization and high-throughput accelerators for evolution strategies (ES)? evosax allows you to leverage JAX, XLA compilation and auto-vectorization/parallelization to scale ES to your favorite accelerators. In this talk we will get to know the core API and how to solve distributed black-box optimization problems with evolution strategies.
Developers often use code coverage as a target, which makes it a bad measure of test quality.
Mutation testing changes the game: create mutant versions of your code that break your tests, and you'll quickly start to write better tests!
Come and learn to use it as part of your CI/CD process. I promise, you'll never look at penguins the same way again!
After a decade of writing code, I joined the application security team. During the transition process, I discovered that there are many myths about security, and how difficult it is. Often devs choose to ignore it because they think that writing more secure code would take them ages. It is not true. Security doesn’t have to be scary. From my talk, you will learn the most useful piece from the Application Security theory. It will be practical and not boring at all.
Today state of the art technology and scientific research strongly depend on open source libraries. The demographic of the contributors to these libraries is predominantly white and male [1][2][3][4]. This situation creates problems not only for individual contributors outside of this demographic but also for open source projects such as loss of career opportunities and less robust technologies, respectively [1][7]. In recent years there have been a number of various recommendations and initiatives to increase the participation in open source projects of groups who are underrepresented in this domain [1][3][5][6]. While these efforts are valuable and much needed, contributor diversity remains a challenge in open source communities [2][3][7]. This talk highlights the underlying problems and explores how we can overcome them.
The Modern Data Stack has brought a lot of new buzzwords into the data engineering lexicon: "data mesh", "data observability", "reverse ETL", "data lineage", "analytics engineering". In this light-hearted talk we will demystify the evolving revolution that will define the future of data analytics & engineering teams.
Our journey begins with the PyData Stack: pandas pipelines powering ETL workflows...clean code, tested code, data validation, perfect for in-memory workflows. As demand for self-serve analytics grows, new data sources bring more APIs to model, more code to maintain, DAG workflow orchestration tools, new nuances to capture ("the tax team defines revenue differently"), more dashboards, more not-quite-bugs ("but my number says this...").
This data maturity journey is a well-trodden path with common pitfalls & opportunities. After dashboards comes predictive modelling ("what will happen"), prescriptive modelling ("what should we do?"), perhaps eventually automated decision making. Getting there is much easier with the advent of the Python Powered Modern Data Stack.
In this talk, we will cover the shift from ETL to ELT, the open-source Modern Data Stack tools you should know, with a focus on how dbt's new Python integration is changing how data pipelines are built, run, tested & maintained. By understanding the latest trends & buzzwords, attendees will gain a deeper insight into Python's role at the core of the future of data engineering.
Did you know that the Python Software Foundation Code of Conduct is turning 10 years old in 2023? It was voted in as they felt they were “unbalanced and not seeing the true spectrum of the greater community”.
Why is that a big thing? Come to my talk and find out!