Data Analysis with Pandas & Python

What is Data Analysis?Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. In today’s business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python.
In this article, I have used Pandas to know more about doing data analysis.
Mainly pandas have two data structures, series,data frames, and Panel.

You can use array indexing or labels to access data in the series.
You can use array indexing or labels to access data in the seriesprint(a[1])print(a[‘test4’])

1

2

98.7

97.7

You can also apply mathematical operations on pandas series.
b = a * 2
c = a ** 1.5print(b)print(c)

1

2

3

4

5

6

7

8

9

10

11

test1200.0

test2197.4

test3196.8

test4195.4

dtype:float64

test11000.000000

test2980.563513

test3976.096258

test4965.699142

dtype:float64

You can even create a series of heterogeneous data.
s = pd.Series([‘test1’, 1.2, 3, ‘test2’], index=[‘test3’, ‘test4’, 2, ‘4.3’])

print(s)

1

2

3

4

5

test3test1

test41.2

23

4.3test2

dtype:object

pandas DataFrame

pandas DataFrame is a two-dimensional array with heterogeneous data.i.e., data is aligned in a tabular fashion in rows and columns.Structure
Let us assume that we are creating a data frame with the student’s data.

Name

Age

Gender

Rating

Steve

32

Male

3.45

Lia

28

Female

4.6

Vin

45

Male

3.9

Katie

38

Female

2

You can think of it as an SQL table or a spreadsheet data representation.
The table represents the data of a sales team of an organization with their overall performance rating. The data is represented in rows and columns. Each column represents an attribute and each row represents a person.
The data types of the four columns are as follows −

Column

Type

Name

String

Age

Integer

Gender

String

Rating

Float

Key Points
• Heterogeneous data
• Size Mutable
• Data Mutable

A pandas DataFrame can be created using the following constructor −pandas.DataFrame( data, index, columns, dtype, copy)

• data
data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame.
• index
For the row labels, the Index to be used for the resulting frame is Optional Default np.arrange(n) if no index is passed.
• columns
For column labels, the optional default syntax is – np.arrange(n). This is only true if no index is passed.
• dtype
The data type of each column.
• copy
This command (or whatever it is) is used for copying of data if the default is False.

There are many methods to create DataFrames.
• Lists
• dict
• Series
• Numpy ndarrays
• Another DataFrame

Creating DataFrame from the dictionary of Series
The following method can be used to create DataFrames from a dictionary of pandas series.

You might have noticed that we got a DataFrame with NaN values in it. This is because we didn’t the data for that particular row and column.

Creating DataFrame from Text/CSV files
Pandas tool comes in handy when you want to load data from a CSV or a text file. It has built-in functions to do this for use.

df = pd.read_csv(‘happiness.csv’)

Yes, we created a DataFrame from a CSV file. This dataset contains the outcome of the European quality of life survey. This dataset is available here. Now we have stored the DataFrame in df, we want to see what’s inside. First, we will see the size of the DataFrame.

print(df.shape)

1

(105,4)

It has 105 Rows and 4 Columns. Instead of printing out all the data, we will see the first 10 rows.df.head(10)

1

2

3

4

5

6

7

8

9

10

11

CountryGenderMeanN=

0ATMale7.3471

1NaNFemale7.3570

2NaNBoth7.31041

3BEMale7.8468

4NaNFemale7.8542

5NaNBoth7.81010

6BGMale5.8416

7NaNFemale5.8555

8NaNBoth5.8971

9CYMale7.8433

There are many more methods to create a DataFrames. But now we will see the basic operation on DataFrames.

We are not interested in the unnamed column. So, let’s delete that first. Then we’ll see the statistics with one line of code.

1

2

3

4

5

6

7

8

9

dbhwoodbarkrootrootskbranch

count153.000000133.00000017.00000054.00000053.00000076.000000

mean26.3529411569.045113513.235294334.383333113.80226454.065789

std28.2736794071.380720632.467542654.641245247.22411865.606369

min3.0000003.0000007.0000000.3000000.0500004.000000

25%8.00000029.00000059.00000011.5000002.00000010.750000

50%15.000000162.000000328.00000041.00000011.00000035.000000

75%36.0000001000.000000667.000000235.00000045.00000077.750000

max145.00000025116.0000001808.0000003000.0000001030.000000371.000000

It’s simple as that. We can see all the statistics. Count, mean, standard deviation and other statistics. Now we are gonna find some other metrics which are not available in the describe() summary.

Mean :print(df.mean())

1

2

3

4

5

6

7

dbh26.352941

wood1569.045113

bark513.235294

root334.383333

rootsk113.802264

branch54.065789

dtype:float6

Min and Max print(df.min())

1

2

3

4

5

6

7

8

dbh3

wood3

bark7

root0.3

rootsk0.05

branch4

speciesAcacia mabellae

dtype:object

print(df.max())

1

2

3

4

5

6

7

8

dbh145

wood25116

bark1808

root3000

rootsk1030

branch371

speciesOther

dtype:object

Pairwise Correlation
df.corr()

1

2

3

4

5

6

7

dbhwoodbarkrootrootskbranch

dbh1.0000000.9051750.9654130.8993010.9349820.861660

wood0.9051751.0000000.9717000.9887520.9670820.821731

bark0.9654130.9717001.0000000.9610380.9713410.943383

root0.8993010.9887520.9610381.0000000.9369350.679760

rootsk0.9349820.9670820.9713410.9369351.0000000.621550

branch0.8616600.8217310.9433830.6797600.6215501.000000

Data Cleaning
We need to clean our data. Our data might contain missing values, NaN values, outliers, etc. We may need to remove or replace that data. Otherwise, our data might make any sense.
We can find null values using the following method.

print(df.isnull().any())

1

2

3

4

5

6

7

8

9

dbhFalse

woodTrue

barkTrue

rootTrue

rootskTrue

branchTrue

speciesFalse

fac26True

dtype:bool

We have to remove these null values. This can be done by the method shown below.

newdf = df.dropna()

print(newdf.shape)

1

2

3

4

5

dbhwoodbarkrootrootskbranchspecies fac26

12327550.0105.044.09.059.0B.myrtifoliaz

12426414.078.038.013.044.0B.myrtifoliaz

125942.08.05.01.37.0B.myrtifoliaz

1261285.013.017.02.216.0B.myrtifoliaz

print(newdf.shape)

1

(4,8)

Pandas .Panel()
A panel is a 3D container of data. The term Panel data is derived from econometrics and is partially responsible for the name pandas − pan(el)-da(ta)-s.
The names for the 3 axes are intended to give some semantic meaning to describing operations involving panel data. They are −
• items − axis 0, each item corresponds to a DataFrame contained inside.
• major_axis − axis 1, it is the index (rows) of each of the DataFrames.
• minor_axis − axis 2, it is the columns of each of the DataFrames.

A Panel can be created using the following constructor −
The parameters of the constructor are as follows −
• data – Data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame
• items – axis=0
• major_axis – axis=1
• minor_axis – axis=2
• dtype – the Data type of each column
• copy – Copy data. Default, false

A Panel can be created using multiple ways like −
• From ndarrays
• From dict of DataFrames
• From 3D ndarray

1

2

3

4

5

# creating an empty panel

import pandas aspd

import numpy asnp

data=np.random.rand(2,4,5)

p=pd.Panel(data)

print(p)

1

2

3

4

5

output:

Dimensions:2(items)x4(major_axis)x5(minor_axis)

Items axis:0to1

Major_axis axis:0to3

Minor_axis axis:0to4

Note − Observe the dimensions of the empty panel and the above panel, all the objects are different.

From dict of DataFrame Objects

1

2

3

4

5

6

#creating an empty panel

import pandas aspd

import numpy asnp

data={'Item1':pd.DataFrame(np.random.randn(4,3)),

'Item2':pd.DataFrame(np.random.randn(4,2))}

p=pd.Panel(data)

print(p)

1

2

3

4

5

output:

Dimensions:2(items)x4(major_axis)x3(minor_axis)

Items axis:Item1 toItem2

Major_axis axis:0to3

Minor_axis axis:0to2

Selecting the Data from Panel
Select the data from the panel using −
• Items
• Major_axis
• Minor_axis

Using Items

1

2

3

4

5

6

# creating an empty panel

import pandas aspd

import numpy asnp

data={'Item1':pd.DataFrame(np.random.randn(4,3)),

'Item2':pd.DataFrame(np.random.randn(4,2))}

p=pd.Panel(data)

print p[‘Item1’]

1

2

3

4

5

6

output:

012

0-0.006795-1.156193-0.524367

10.0256101.5337410.331956

21.0676711.3096661.304710

30.6151961.348469-0.410289

We have two items, and we retrieved item1. The result is a DataFrame with 4 rows and 3 columns, which are the Major_axis and Minor_axis dimensions.

Using major_axis
Data can be accessed using the method panel.major_axis(index).

1

2

3

4

Item1Item2

00.027133-1.078773

10.115686-0.253315

2-0.473201NaN

Using minor_axis
Data can be accessed using the method panel.minor_axis(index).