Easier data analysis in Python with pandas (video series)

Summary: If you're working with data in Python, learning pandas will make your life easier! I love teaching pandas, and so I created a video series targeted at beginners. There are currently 30 videos in the series.

Why learn pandas?

pandas is a powerful, open source Python library for data analysis, manipulation, and visualization. If you're working with data in Python and you're not using pandas, you're probably working too hard!

There are many things to like about pandas: It's well-documented, has a huge amount of community support, is under active development, and plays well with other Python libraries (such as matplotlib, scikit-learn, and seaborn).

There are also things you might not like: pandas has an overwhelming amount of functionality (so it's hard to know where to start), and it provides too many ways to accomplish the same task (so it's hard to figure out the best practices).

That's why I created this series. I've been using and teaching pandas for a long time, and so I know how to explain pandas in a way that is understandable to novices.

About the video series

You don't need to have any pandas experience to benefit from this series, but you do need to know the basics of Python.

In each video, I answer a question from one of my students using a real dataset. Since I've posted the data online, and pandas can read files directly from a URL, you can follow along with every video at home!

Every video in the series is embedded below. There are currently 30 videos in the series, but more may be added in the future. (Subscribe on YouTube for notifications.)

Embedded videos with descriptions

pandas is a full-featured Python library for data analysis, manipulation, and visualization. This video series is for anyone who wants to work with data in Python, regardless of whether you are brand new to pandas or have some experience. Each video will answer a student question about pandas using a real dataset, which is available online so you can follow along!

"Tabular data" is just data that has been formatted as a table, with rows and columns (like a spreadsheet). You can easily read a tabular data file into pandas, even directly from a URL! In this video, I'll walk you through how to do that, including how to modify some of the default arguments of the read_table function to solve common problems.

DataFrames and Series are the two main object types in pandas for data storage: a DataFrame is like a table, and each column of the table is called a Series. You will often select a Series in order to analyze or manipulate it. In this video, I'll show you how to select a Series using "bracket notation" and "dot notation", and will discuss the limitations of dot notation. I'll also demonstrate how to create a new Series in a DataFrame.

To access most of the functionality in pandas, you have to call the methods and attributes of DataFrame and Series objects. In this video, I'll discuss some common methods and attributes, and show you how to tell the difference between them. (Hint: It's all about the parentheses!)

You will often want to rename the columns of a DataFrame so that their names are descriptive, easy to type, and don't contain any spaces. In this video, I'll demonstrate three different strategies for renaming columns so that you can choose the best strategy to fit your particular situation.

If you have DataFrame columns that you're never going to use, you may want to remove them entirely in order to focus on the columns that you do use. In this video, I'll show you how to remove columns (and rows), and will briefly explain the meaning of the "axis" and "inplace" parameters.

pandas allows you to sort a DataFrame by one of its columns (known as a "Series"), and also allows you to sort a Series alone. The sorting API changed in pandas version 0.17, so in this video, I'll demonstrate both the "old way" and the "new way" to sort. I'll also show you how to sort a DataFrame by multiple columns at once!

Let's say that you only want to display the rows of a DataFrame which have a certain column value. How would you do it? pandas makes it easy, but the notation can be confusing and thus difficult to remember. In this video, I'll work up to the solution step-by-step using regular Python code so that you can truly understand the logic behind pandas filtering notation.

Let's say that you want to filter the rows of a DataFrame by multiple conditions. In this video, I'll demonstrate how to do this using two different logical operators. I'll also explain the special rules in pandas for combining filter criteria, and end with a trick for simplifying chained conditions!

When performing operations on a pandas DataFrame, such as dropping columns or calculating row means, it is often necessary to specify the "axis". But what exactly is an axis? In this video, I'll help you to build a mental model for understanding the axis parameter so that you will know when and how to use it.

pandas includes powerful string manipulation capabilities that you can easily apply to any Series of strings. In this video, I'll show you how to access string methods in pandas (along with a few examples), and then end with two bonus tips to help you maximize your efficiency.

Have you ever tried to do math with a pandas Series that you thought was numeric, but it turned out that your numbers were stored as strings? In this video, I'll demonstrate two different ways to change the data type of a Series so that you can fix incorrect data types. I'll also show you the easiest way to convert a boolean Series to integers, which is useful for creating dummy/indicator variables for machine learning.

The pandas "groupby" method allows you to split a DataFrame into groups, apply a function to each group independently, and then combine the results back together. This is called the "split-apply-combine" pattern, and is a powerful tool for analyzing data across different categories. In this video, I'll explain when you should use a groupby and then demonstrate its flexibility using four different examples.

When you start working with a new dataset, how should you go about exploring it? In this video, I'll demonstrate some of the basic tools in pandas for exploring both numeric and non-numeric data. I'll also show you how to create simple visualizations in a single line of code!

Most datasets contain "missing values", meaning that the data is incomplete. Deciding how to handle missing values can be challenging! In this video, I'll cover all of the basics: how missing values are represented in pandas, how to locate them, and options for how to drop them or fill them in.

The DataFrame index is core to the functionality of pandas, yet it's confusing to many users. In this video, I'll explain what the index is used for and why you might want to store your data in the index. I'll also demonstrate how to set and reset the index, and show how that affects the DataFrame's shape and contents.

In part two of our discussion of the index, we'll switch our focus from the DataFrame index to the Series index. After discussing index-based selection and sorting, I'll demonstrate how automatic index alignment during mathematical operations and concatenation enables us to easily work with incomplete data in pandas.

Have you ever been confused about the "right" way to select rows and columns from a DataFrame? pandas gives you an incredible number of options for doing so, but in this video, I'll outline the current best practices for row and column selection using the loc, iloc, and ix methods.

We've used the "inplace" parameter many times during this video series, but what exactly does it do, and when should you use it? In this video, I'll explain how "inplace" affects methods such as "drop" and "dropna", and why it is always False by default.

Are you working with a large dataset in pandas, and wondering if you can reduce its memory footprint or improve its efficiency? In this video, I'll show you how to do exactly that in one line of code using the "category" data type, introduced in pandas 0.15. I'll explain how it works, and how to know when you shouldn't use it.

Have you been using scikit-learn for machine learning, and wondering whether pandas could help you to prepare your data and export your predictions? In this video, I'll demonstrate the simplest way to integrate pandas into your machine learning workflow, and will create a submission for Kaggle's Titanic competition in just a few lines of code!

If you want to include a categorical feature in your machine learning model, one common solution is to create dummy variables. In this video, I'll demonstrate three different ways you can create dummy variables from your existing DataFrame columns. I'll also show you a trick for simplifying your code that was introduced in pandas 0.18.

Let's say that you have dates and times in your DataFrame and you want to analyze your data by minute, month, or year. What should you do? In this video, I'll demonstrate how you can convert your data to "datetime" format, enabling you to access a ton of convenient attributes and perform datetime comparisons and mathematical operations.

During the data cleaning process, you will often need to figure out whether you have duplicate data, and if so, how to deal with it. In this video, I'll demonstrate the two key methods for finding and removing duplicate rows, as well as how to modify their behavior to suit your specific needs.

If you've been using pandas for a while, you've likely encountered a SettingWithCopyWarning. The proper response is to modify your code appropriately, not to turn off the warning! In this video, I'll show you two common scenarios in which this warning arises, explain why it's occurring, and then demonstrate how to address it.

Have you ever wanted to change the way your DataFrame is displayed? Perhaps you needed to see more rows or columns, or modify the formatting of numbers? In this video, I'll demonstrate how to change the settings for five common display options in pandas.

Have you ever needed to create a DataFrame of "dummy" data, but without reading from a file? In this video, I'll demonstrate how to create a DataFrame from a dictionary, a list, and a NumPy array. I'll also show you how to create a new Series and attach it to the DataFrame.

Have you ever struggled to figure out the differences between apply, map, and applymap? In this video, I'll explain when you should use each of these methods and demonstrate a few common use cases. Watch the end of the video for three important announcements!

During this two-hour webcast, I answered 45 viewer questions about pandas, the leading Python library for data analysis, exploration, and manipulation. View the complete list of questions on Crowdcast.