Generating Data Frames for your test - using Pandas stratgies in Hypothesis
2023-08-15 , Aula

Do you test your data pipeline? Do you use Hypothesis? In this workshop, we will use Hypothesis - a property-based testing framework to generate Pandas DataFrame for your tests, without involving any real data.


In this short 90 mins workshop, we will first go through the basics of hypothesis and what is property-based testing. After that, we will introduce the strategies for Pandas objects - available via the extras in Hypothesis. We will have a glimpse of what the strategies are doing to generate the testing object, including Pandas Series and DataFrames. In the end, we will apply what we learn in real testing applications - testing a data pipeline that involves DataFrames.

Preparation

No preparation is needed, however, if you want to make sure you are not relying on the wifi at the venue for installation and download. You can clone the workshop repo and follow the setup instruction.

Outline

  • Introduction of Property-based testing (15 mins)
  • Introduction and basic use of Hypothesis exercises (30 mins)
  • Deep dive into Pandas strategies (20 mins)
  • Do it yourself - apply property-based testing to data pipelines (20 mins)
  • Conclusion (5 mins)

Prerequisits

No prior knowledge of property-based testing or hypothesis is required. However, we assume the attendee has experience using Pandas and has a basic understanding of Pandas objects. Knowledge about Numpy array and typing would also be beneficial in understanding the Pandas Strategies.

Goal

We hope the attendee will learn about property-based testing and see how it can benefit their work involved data - especially those that use Pandas. After the workshop, attendees should be able to understand how the Pandas strategies in Hypothesis works and to use Hypotheses to test codes that involve Pandas Series or DataFrame input.


Category [Machine and Deep Learning]

Reproducible Machine Learning

Expected audience expertise: Domain

some

Expected audience expertise: Python

some

Project Homepage / Git

https://github.com/HypothesisWorks/hypothesis/

Abstract as a tweet

Do you test your data pipeline? Do you use Hypothesis? In this workshop, we will use Hypothesis - a property-based testing framework to generate Pandas DataFrame for your tests, without involving any real data.

Before working in Developer Relations, Cheuk has been a Data Scientist in various companies which demands high numerical and programmatical skills, especially in Python. To follow her passion for the tech community, Cheuk is now the Developer Advocate at Anaconda. Cheuk also contributes to multiple Open Source libraries like Hypothesis and Pandas.

Besides her work, Cheuk enjoys talking about Python on personal streaming platforms and podcasts. Cheuk has also been a speaker at Universities and various conferences. Besides speaking at conferences, Cheuk also organises events for developers. Conferences that Cheuk has organized include EuroPython (which she is a board member), PyData Global and Pyjamas Conf. Believing in Tech Diversity and Inclusion, Cheuk constantly organizes workshops and mentored sprints for minority groups. In 2021, Cheuk has become a Python Software Foundation fellow.

This speaker also appears in: