A yearly top ten

- Next, we'd like to extract the ten most popular names…in any given year.…As we do that, we will learn how to sort a DataFrame,…how to drop columns,…how to join two frames matching their indexes,…and how to count values in a Series.…Let's go to the IPython Notebook…and let's select the 07 04 topten begin exercise file.…This notebook contains all the code that we have developed…so far in this chapter.…We're going to evaluate all cells…by selecting cell, run all.…

All done.…From our DataFrame all years indexed,…we can select all data for a year…using a .loc selection object.…For instance, 2008.…Next, we want to sort this selection.…Actually, let's sort it the right way,…with larger numbers on top.…So with the sort, the ascending is false.…Very well.…The most popular name in 2008…for a boy was Jacob.…

Let's simplify this a bit,…and assign this table to a variable,…by copying this code into the next cell…and feeding it to pop2008.…Then I will reset the index,…and drop several columns that I do not care about.…I'm dropping columns, so axis is one.…

Resume Transcript Auto-Scroll

Author

Released

11/12/2015

If you're going to work with big data, you'll probably be using R or Python. And if you're using Python, you'll be definitely using Pandas and NumPy, the third-party packages designed specifically for data analysis. This course provides an opportunity to learn about them. Michele Vallisneri shows how to set up your analysis environment and provides a refresher on the basics of working with data containers in Python. Then he jumps into the big stuff: the power of arrays, indexing, and DataFrames in NumPy and Pandas. He also walks through two sample big-data projects: one using NumPy to analyze weather patterns and the other using Pandas to analyze the popularity of baby names over the last century. Challenges issued along the way help you practice what you've learned.

Topics include:

Writing and running Python in iPython

Using Python lists and dictionaries

Creating NumPy arrays

Indexing and slicing in NumPy

Downloading and parsing data files into NumPy and Pandas

Using multilevel series in Pandas

Aggregating data in Pandas

Skill Level Intermediate

2h 16m

Duration

774,655

Views

Show MoreShow Less

Q: The course shows how to download files from FTP and web servers using Python 3.X. How do I do the same thing with Python 2.7?

A: First import urllib, then use urllib.urlretrieve(URL,filename).
For instance, to download the stations.txt files used in the
chapter 5 video “Downloading and parsing data files,” you’d do urllib.urlretrieve(‘ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt','stations.txt').

Q. What are the issues with DataFrame.sort()?

A: Since Pandas version 0.18, the DataFrame method sort() was removed in favor of sort_values(). Unlike sort(), the new method does not sort records in place unless it is given the option "inplace=True". The following lines of code in the video need changing:

[in addition to lines above, which are used to initialize the "name fads" computation]

totals_both = totals_both.sort_values(ascending=False)

Q. What are the issues with Pandas categorical data?

A. Since version 0.6, seaborn.load_dataset converts certain columns to Pandas categorical data (see http://pandas.pydata.org/pandas-docs/stable/categorical.html). This creates a problem in the handling of the "flights" DataFrame used in "Introduction to Pandas/Using multilevel indices". To avoid the problem, you may load the dataset directly with Pandas:

Q. What are the issues with matplotlib.pyplot.stackplot?

A. In recent versions of matplotlib, the function matplotlib.pyplot.stackplot now throws an error if given the keyword argument "label". This problem occurs in the "Baby names with Pandas/Name popularity" exercise file, and it can be ignored. In the video, matplotlib does not complain, but nevertheless shows no legend for the plot. The tutorial moves on to show how to make a legend using matplotlib.pyplot.text.