Introduction to Data Analysis Using Pandas
2023-08-14 , HS 120

Working with data can be challenging: it often doesn’t come in the best format for analysis, and understanding it well enough to extract insights requires both time and the skills to filter, aggregate, reshape, and visualize it. This session will equip you with the knowledge you need to effectively use pandas – a powerful library for data analysis in Python – to make this process easier.


Section 1: Getting Started With Pandas

We will begin by introducing the Series, DataFrame, and Index classes, which are the basic building blocks of the pandas library, and showing how to work with them. By the end of this section, you will be able to create DataFrames and perform operations on them to inspect and filter data.

Section 2: Data Wrangling

To prepare our data for analysis, we need to perform data wrangling. We will learn how to clean and reformat data (e.g. renaming columns, fixing data type mismatches), restructure/reshape it, and enrich it (e.g. discretizing columns, calculating aggregations, combining data sources).

Target Audience

This tutorial is for anyone with basic knowledge of Python and an interest in learning how to analyze data in Python. We will be working with Jupyter Notebooks, so attendees should familiarize themselves with the interface (i.e., know how to run/edit a cell) beforehand.

Prerequisites

Bring a laptop (preferably your personal one) with the virtual environment configured as indicated here. Come to the session with your environment set up so we can dive right into the material.


Expected audience expertise: Domain

none

Category [Data Science and Visualization]

Data Analysis and Data Engineering

Expected audience expertise: Python

some

Public link to supporting material

https://stefmolin.github.io/pandas-workshop/slides/html/workshop.slides.html#/

Project Homepage / Git

https://github.com/stefmolin/pandas-workshop

Abstract as a tweet

"Introduction to Data Analysis Using Pandas" will equip you with the knowledge you need to effectively use pandas – a powerful library for data analysis in Python.

Stefanie Molin is a software engineer and data scientist at Bloomberg in New York City, where she tackles tough problems in information security, particularly those revolving around data wrangling/visualization, building tools for gathering data, and knowledge sharing. She is also the author of Hands-On Data Analysis with Pandas, which is currently in its second edition. She holds a bachelor’s of science degree in operations research from Columbia University's Fu Foundation School of Engineering and Applied Science, as well as a master’s degree in computer science, with a specialization in machine learning, from Georgia Tech. In her free time, she enjoys traveling the world, inventing new recipes, and learning new languages spoken among both people and computers.