You'll need to import the numpy and pandas libraries as well as the Series and DF objects from the pandas library like so.

In [26]:

importnumpyasnpimportpandasaspdfrompandasimportSeries,DataFrame

We'll start by creating a single series object with 8 values by calling mp.arrange and passing an index to the Series() method and assigning that to our series_obj variable. In this case, we're choosing to assign lables to our rows.

Using ['label-index', 'label-index', 'label-index'] = scalar value, you can set the value of one or many objects at once to a scalar value by using label-indexes. One could use this to set approximate values or throw-away numbers for specific cases.

# Setting the seed so our numbers stay consistent for this demonstration.np.random.seed(25)DF=DataFrame(np.random.randn(36).reshape(6,6))# Setting rows three through five in column zero and rows one through four in # column five to missing.DF.ix[3:5,0]=missingDF.ix[1:4,5]=missingDF

You can also pass a dictionary into the .fillna() method. The method will then fill in missing values from each column Series (as designated by the dictionary key) with its own unique value (as specified in the corresponding dictionary value).

Here we'll fill missing values in column zero with 0.1 and in column five we'll use 1.25. This allows you to get more granular instead of treating the entire dataset as one entity.

In [40]:

filled_DF=DF.fillna({0:0.1,5:1.25})filled_DF

Out[40]:

0

1

2

3

4

5

0

0.228273

1.026890

-0.839585

-0.591182

-0.956888

-0.222326

1

-0.619915

1.837905

-2.053231

0.868583

-0.920734

1.250000

2

2.152957

-1.334661

0.076380

-1.246089

1.202272

1.250000

3

0.100000

-0.419678

2.294842

-2.594487

2.822756

1.250000

4

0.100000

-1.976254

0.533340

-0.290870

-0.513520

1.250000

5

0.100000

-1.839905

1.607671

0.388292

0.399732

0.405477

You can also pass in "method='ffill'" as an arguement, and the .fillna() method will fill-forward any missing values with values from the last non-null element in the column series. Note rows 3 to 5 in column 0 and rows 1 to 4 in column 5.

In [41]:

fill_DF=DF.fillna(method='ffill')fill_DF

Out[41]:

0

1

2

3

4

5

0

0.228273

1.026890

-0.839585

-0.591182

-0.956888

-0.222326

1

-0.619915

1.837905

-2.053231

0.868583

-0.920734

-0.222326

2

2.152957

-1.334661

0.076380

-1.246089

1.202272

-0.222326

3

2.152957

-0.419678

2.294842

-2.594487

2.822756

-0.222326

4

2.152957

-1.976254

0.533340

-0.290870

-0.513520

-0.222326

5

2.152957

-1.839905

1.607671

0.388292

0.399732

0.405477

In [42]:

# Here's were setting another data-frame object with missing values so we can # continue with our example.np.random.seed(25)DF1=DataFrame(np.random.randn(36).reshape(6,6))DF1.ix[3:5,0]=missingDF1.ix[1:4,5]=missingDF1

You can generate a True|False table which identifies the NaNs by calling the .isnull() method. Then you can add the .sum() method to count of how many missing instances you have by column. Here you can see column zero and five have missing values. I've divided them up to illustrated what's produced.

If you wanted to drop columns that contain any missing values, you'd just pass in the axis=1 argument to select and search the DF by columns, instead of by row.

In [46]:

DF_no_NaN=DF1.dropna(axis=1)DF_no_NaN

Out[46]:

1

2

3

4

0

1.026890

-0.839585

-0.591182

-0.956888

1

1.837905

-2.053231

0.868583

-0.920734

2

-1.334661

0.076380

-1.246089

1.202272

3

-0.419678

2.294842

-2.594487

2.822756

4

-1.976254

0.533340

-0.290870

-0.513520

5

-1.839905

1.607671

0.388292

0.399732

In [47]:

# Here's were setting another data-frame object with missing values so we can # continue with our example.np.random.seed(25)DF2=DataFrame(np.random.randn(36).reshape(6,6))DF2.ix[3:5,0]=missingDF2.ix[3,1]=missingDF2.ix[3,2]=missingDF2.ix[3,3]=missingDF2.ix[3,4]=missingDF2.ix[1:4,5]=missingDF2

To drop the rows that have duplicates found in a column Series, just call the drop_duplicates() method and pass in the label-index of the column. This method will drop all rows that have duplicates in the column you specify. As you can see from the previous chart, it's not inspecting the other columns as we still have a duplicate in column 2.

The concat() method joins data from seperate sources into one combined data table. If you want to join objects based on their row index values, just call the pd.concat() method on the objects you want joined, and then pass in the axis=1 argument. The axis=1 argument tells Python to concatenate the DFs by adding columns (in other words, joining on the row index values).

To sort rows in a DF, either in ascending or descending order, call the .sort_values() method off of the DF, and pass in the "by" parameter to specify the column index you want to use to sort your Data Frame.

To group a DF by its values in a particular column, call the .groupby() method, and then pass in the column Series you want the DF to be grouped by. Here we want to group the listed cars by their number of cylinders.

In [96]:

cars_groups=cars.groupby(cars['cyl'])

Then you can call the mean() method to calculated the mean values of the cars in each cylinder category.

You only need to import what you're adding to a notebook. If this was your first import, you'd also have to add:

In [67]:

# import numpy as np# import pandas as pd# from pandas import Series, DataFrame# We're adding the following imports for the next part. fromnumpy.randomimportrandnimportmatplotlib.pyplotaspltfrommatplotlibimportrcParamsimportseabornassb

When you add "%matplotlib inline", it tells matplotlib to print the data visualization within the Python notebook instead of opening it in an external graphical user interface.

# Setting the range and step size for the x-axis. x=range(1,10,1)# Sets the points to be plotted on the y-axis in the order of plotting from left to right.y=[1,2,3,4,0,4,3,2,1]# Plots the points. The range and the number of plots must be the same.plt.plot(x,y)

Out[69]:

[<matplotlib.lines.Line2D at 0x10ab20898>]

You can render the data as a bar chart

In [98]:

plt.bar(x,y)

Out[98]:

<Container object of 9 artists>

Assigns the cars DF calling the MPG series and assigns it to the variable "mpg".

In [99]:

mpg=cars['mpg']mpg.plot()

Out[99]:

<matplotlib.axes._subplots.AxesSubplot at 0x10c60f550>

You can represent this same data in bar form by adding the kind attribute to the plot method.

In [100]:

mpg.plot(kind='bar')

Out[100]:

<matplotlib.axes._subplots.AxesSubplot at 0x10c6f0a20>

You can render your bar chart horizontally by changing the kind attribute from 'bar' to 'barh'.

In [101]:

mpg.plot(kind='barh')

Out[101]:

<matplotlib.axes._subplots.AxesSubplot at 0x10b3605c0>

You can plot several data series at once by calling the axis labels from the DF.

In [102]:

DF6=cars[['cyl','wt','mpg']]DF6.plot()

Out[102]:

<matplotlib.axes._subplots.AxesSubplot at 0x10bfedc18>

The pie chart represents the data as a percentage of the whole. For example, x=[1,1,1] will be represented the same way as x=[9,9,9] as they share equal . You could also represent a 25%/75% chart as x= [25,75] or as x=[1,3]. This relationship is rendered below to demonstrate that point.

Aditionally you can save your files to your working directory by using the savefig method.

# Setting the axes limits and tic marks. Each time you create a plot you need # to start with a blank figure and add axes again.fig=plt.figure()ax=fig.add_axes([.1,.1,1,1])# Sets x and y axis limitsax.set_xlim([1,9])ax.set_ylim([0,5])# Sets x and y axis tick marks# You'll notice that 3 and 7 are removed from the chart below. ax.set_xticks([0,1,2,4,5,6,8,9,10])ax.set_yticks([0,1,2,3,4,5])ax.plot(x,y)

Out[77]:

[<matplotlib.lines.Line2D at 0x10b52dc18>]

In [78]:

# Creates blank figure objectfig=plt.figure()# The figure will have two axes at once, ax1 and ax2# Subplot 1 row with two columnsfig,(ax1,ax2)=plt.subplots(1,2)# Plots x in axis 1ax1.plot(x)# Plots x and y in axis 2ax2.plot(x,y)

# Choose the values to represent in your pie chart.z=[1,2,3,4,0.5]# Assign lables to that data by passing them as a list in the same order.veh_type=['bicycle','motorbike','car','van','stroller']# Plot values and labels as a pie chart.plt.pie(z,labels=veh_type)plt.show()

In [90]:

# Adding a legend# You can also represent the labels with a legend and let Pandas choose the "best"# display location.plt.pie(z)plt.legend(veh_type,loc='best')plt.show()

fig=plt.figure()ax=fig.add_axes([.1,.1,1,1])mpg.plot()ax.set_xticks(range(32))ax.set_xticklabels(cars.car_names,rotation=60,fontsize='medium')ax.set_title('Miles per Gallon of Cars in mtcars')ax.set_xlabel('car names')ax.set_ylabel('miles/gal')# Adding a legend.ax.legend(loc='best')

Out[92]:

<matplotlib.legend.Legend at 0x10c257128>

In [93]:

fig=plt.figure()ax=fig.add_axes([.1,.1,1,1])mpg.plot()ax.set_title('Miles per Gallon of Cars in mtcars')ax.set_ylabel('miles/gal')ax.set_ylim([0,45])# Adds an in graph annotation. The value of the xy attribute sets the location # of the tip of the arrow. The xytext value sets the location of the text. The# arrow will adjust between the two declared points. ax.annotate('Toyota Corolla',xy=(19,33.9),xytext=(21,35),arrowprops=dict(facecolor='black',shrink=0.05))