Pandas is a very strong library for manipulating large and complex datasets using a new data structure, the data frame, which models a table of data.
Pandas helps to close the gap between Python and R for data analysis and statistical computing.

Pandas data frames address three deficiencies of NumPy arrays:

data frame hold heterogenous data; each column can have its own numpy.dtype,

the axes of a data frame are labeled with column names and row indices,

and, they account for missing values which this is not directly supported by arrays.

Data frames are extremely useful for data manipulation.
They provide a large range of operations such as filter, join, and group-by aggregation, as well as plotting.

We will analyze animal life-history data from AnAge.
We will get the data from the download page, but it's compressed with zip so we need to unzip it and then we can read the data using pandasread_table function:

withzipfile.ZipFile(fname)asz:f=z.open('anage_data.txt')data=pd.read_table(f)# lots of other pd.read_... functionsprint(type(data))print(data.shape)

<class 'pandas.core.frame.DataFrame'>
(4219, 31)

Pandas holds data in DataFrame (similar to R).
DataFrame have a single row per observation (in contrast to the previous exercise in which each table cell was one observation), and each column has a single variable. Variables can be numbers or strings.

The head method gives us the 5 first rows of the data frame.

In [5]:

data.head()

Out[5]:

HAGRID

Kingdom

Phylum

Class

Order

Family

Genus

Species

Common name

Female maturity (days)

...

Source

Specimen origin

Sample size

Data quality

IMR (per yr)

MRDT (yrs)

Metabolic rate (W)

Body mass (g)

Temperature (K)

References

0

3

Animalia

Arthropoda

Branchiopoda

Diplostraca

Daphniidae

Daphnia

pulicaria

Daphnia

NaN

...

NaN

unknown

medium

acceptable

NaN

NaN

NaN

NaN

NaN

1294,1295,1296

1

5

Animalia

Arthropoda

Insecta

Diptera

Drosophilidae

Drosophila

melanogaster

Fruit fly

7.0

...

NaN

captivity

large

acceptable

0.05

0.04

NaN

NaN

NaN

2,20,32,47,53,68,69,240,241,242,243,274,602,98...

2

6

Animalia

Arthropoda

Insecta

Hymenoptera

Apidae

Apis

mellifera

Honey bee

NaN

...

812

unknown

medium

acceptable

NaN

NaN

NaN

NaN

NaN

63,407,408,741,805,806,808,812,815,828,830,831...

3

8

Animalia

Arthropoda

Insecta

Hymenoptera

Formicidae

Cardiocondyla

obscurior

Cardiocondyla obscurior

NaN

...

1293

captivity

medium

acceptable

NaN

NaN

NaN

NaN

NaN

1293

4

9

Animalia

Arthropoda

Insecta

Hymenoptera

Formicidae

Lasius

niger

Black garden ant

NaN

...

411

unknown

medium

acceptable

NaN

NaN

NaN

NaN

NaN

411,813,814

5 rows × 31 columns

DataFrame has many of the features of numpy.ndarray - it also has a shape and various statistical methods (max, mean etc.).
However, DataFrame allows richer indexing.
For example, let's browse our data for species that have body mass greater than 300 kg.
First we will a create new column (Series object) that tells us if a row is a large animal row or not:

Now, we slice our data with this boolean index.
The iterrows method let's us iterate over the rows of the data.
For each row we get both the row as a Series object (similar to dict for our use) and the row number as an int (this is similar to the use of enumerate on lists and strings).

Let's continue with small and medium animals.
For starters, let's plot a scatter of body mass vs. metabolic rate.
Because we work with pandas, we can do that with the plot method of DataFrame, specifying the columns for x and y and a plotting style (without the style we would get a line plot which makes no sense here).

So we have lots of mammals and birds, and a few reptiles and amphibians. This is important as amphibian and reptiles could have a different replationship between mass and metabolism because they are cold blooded.

Let's do a simple linear regression plot; but let's do it in separate for each Class. We can do this kind of thing with Matplotlib and SciPy, but a very good tool for statistical visualizations is Seaborn.

Seaborn adds on top of Pandas a set of sophisticated statistical visualizations, similar to ggplot2 for R.

hue means color, but it also causes seaborn to fit a different linear model to each of the Classes.

ci controls the confidence intervals. I chose False, but setting it to True will show them.

We can see that mammals and birds have a clear correlation between size and metabolism and that it extends over a nice range of mass, so let's stick to mammals; next up we will see which orders of mammals we have.

Because there is alot of data here I made the lines thinner - this can be done by giving matplotlib keywords as a dictionary to the argument line_kws - and I made the markers bigger but with alpha (transperancy) 0.5 using the scatter_kws argument.

Still ,there's too much data, and part of the problem is that some orders are large (e.g. primates) and some are small (e.g. rodents).

Let's plot a separate regression plot for each order.
We do this using the col and row arguments of lmplot, but in general this can be done for any plot using seaborn's FacetGrid function.

We used the sharex=False and sharey=False arguments so that each Order will have a different axis range and so the data is will spread nicely.
Last but not least, let's have a closer look at the corelation between mass and metabolism in primates.
We will do a joint plot which will give us the pearson correlation and the distribution of each parameter.

You can disregard the warning, it appears because seaborn uses a deprecated keyword argument of matplotlib.

/Users/yoavram/miniconda3/envs/DataSciPy/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
warnings.warn("The 'normed' kwarg is deprecated, and has been "