Data Analysis Apart from Excel – Getting Started

tl;dr This post is about how to analyse data in a very efficient manner. Try out Python with pandas for data analysis and matplotlib for visualisation.

Introduction – How can I get insight into my data?

Our daily life brings us tons of data. Our smartphones, a collection of ten sensors or even more, follow us everywhere we go to. Another example is our credit or debit card account. Everytime we pay, we leave an entry in the record. Depending upon the business we work in, there are much more opportunities for data aggregation. Among others, these are:

Measurements

Calculations

Simulations

Daily business, such as

Access to website

Customer data (country, region, zip code, sales volume, branch, etc.)

Weather forecast

Utilisation of servers, vehicles, tools, etc.

Accounting, stock market, etc.

I would like to show an easy way to understand data, to analyse the past, and to find trends for the future. However, I do not like to employ spreadsheets for this task, since data wrangling with cells and clicks is tedious and prone to errors.

Data Analysis – How do I analyse my data efficiently?

There are several approaches for data analysis. Beside spreadsheets, there are several scripting languages which are famous for their statistics features, such as R and S. However, the language Python is more convenient as it is not restricted to statistics but an universal programming language allowing coding of almost all tasks from a simple calculator until a web server. The primary focus of their developers is easy readability, which is the reason why people sometimes say that Python is executable pseudo-code. Moreover, it is open-source and extendible with modules. Here, we would like to employ:

pandas

NumPy

matplotlib

Most of the features we will use here are taken from pandas. This module has been developed for the analysis of time series for stock market and provides data frames with tabular structure allowing access to rows, columns, and cells. The installation routine of Python and the modules depend upon the operating system. There is an excellent documentation tackling this task: https://pandas.pydata.org/pandas-docs/stable/install.html. For further reading, checkout the PyData website (https://pandas.pydata.org).

Our script-based approach features some important advantages:

Few commands only

Less error prone than manual spreadsheet work

Very efficient

Let’s get started. At the beginning we have to import the required modules.

Example: Black-box of a thermo-hydraulic system

Story

We have a thermo-hydraulic system without any knowledge of its internal structure. Therefore, we plug it into a test bench, connect supply and return of water, and the electrical power cable. Moreover, we measure a couple of quantities, such as:

Flow rate Q

Pressure difference between inlet and outlet Δp

Temperatures of inlet T1 and outlet T2

Electrical power P

Scheme for a black-box thermohydraulic system

In order to acquire operation data of the system, the following work flow is carried out:

Adjusting the flow rate to the desired value

Waiting until steady state

Measuring the physical quantities

Storing data into CSV file

Reading data from a CSV file

The data is stored in a CSV file that actually uses spaces as column separators. Loading a file and creating a data frame df is as easy as:

In [2]:

df=pd.read_csv('measurement_1.csv',sep=' ')

A data frame is an indexed matrix-like structure with entries that can be addressed with their row (index) and column (title).

What is stored in the data frame?

There are several options to list the entities stored in a data frame. These are:

df: showing all entries (for large data sets, the beginning and end only)

df.head(): showing the first 5 rows (any other value can be provided as argument)

df.tail(): showing the last 5 rows (any other value can be provided as argument)

In [3]:

df.head()

Out[3]:

Index

t/s

T1/C

T2/C

Tu/C

pu/Pa

fu/%

P/W

Q/(m^3/s)

dp/Pa

u/(m/s)

0

0

21.97

21.97

19.70

99979.55

60.14

999.70

0.000017

28520.99

0.844

1

1

21.40

22.43

20.08

99996.69

59.59

999.69

0.000017

29131.00

0.856

2

2

21.46

23.42

19.97

100001.17

60.54

1000.18

0.000017

29720.75

0.862

3

3

21.86

24.74

19.68

99964.14

58.92

999.44

0.000017

28927.89

0.849

4

4

22.30

25.89

19.60

99993.45

60.38

999.87

0.000017

29111.47

0.858

N.b.: The table head is determined automatically from the CSV file. It could have been also redefined in the pd.read_csv() command.

How do I get the full statistics?

The statistics of a data frame can be easily obtained by calling the method describe(). This provides the number of valid entries, i.e., non-NaN values, the mean value, the standard deviation, and various quantiles of the sample of each column. By the way, the method describe() can also be applied to non-numeric columns, where the number of entries is returned.

In [4]:

df.describe()

Out[4]:

t/s

T1/C

T2/C

Tu/C

pu/Pa

fu/%

P/W

Q/(m^3/s)

dp/Pa

u/(m/s)

count

151.000000

151.000000

151.000000

151.000000

151.000000

151.000000

151.000000

1.510000e+02

151.000000

151.000000

mean

75.000000

21.940662

34.973311

20.019073

99996.339007

60.044437

999.939404

1.668497e-05

28727.719603

0.849748

std

43.734045

0.209799

2.892432

0.212632

26.095390

0.636769

0.959258

1.841585e-07

605.813018

0.009424

min

0.000000

21.280000

21.970000

19.520000

99934.530000

58.510000

997.600000

1.616100e-05

27035.320000

0.823000

25%

37.500000

21.800000

35.345000

19.870000

99980.325000

59.565000

999.230000

1.656900e-05

28337.590000

0.844000

50%

75.000000

21.920000

36.060000

20.020000

99996.290000

60.050000

999.930000

1.669400e-05

28758.160000

0.850000

75%

112.500000

22.085000

36.330000

20.165000

100015.625000

60.540000

1000.585000

1.681100e-05

29152.945000

0.856000

max

150.000000

22.360000

37.100000

20.550000

100071.300000

61.880000

1002.320000

1.731500e-05

30911.540000

0.882000

How do I plot data of a particular column?

Reading data in a tabulated manner is very often not intuitive. For this purpose, the plot() method of a data frame can be employed. Here, we would like to see the evolution of the temperature T2 over time within the limits of [22,38] and save the resulting figure to a PDF file. The figure file type can be also any other common one, such as (E)PS, PNG, JPG, etc.

In [5]:

df['T2/C'].plot(grid=True,ylim=(22,38))plt.savefig('t2.pdf')

Evolution of the outlet temperature

How do I find the steady-state?

Calculating the running mean value

The next step in our analysis is to consider the data only when the system is in a steady-state, i.e., the thermodynamic quantities do not vary more than the noise of the signal. For this purpose, we evaluate the mean value of a particular time interval just before the time under consideration. This feature is implemented with the method rolling() which accepts several arguments. Here we define the length to be 20 entries and the minimum length to be 20, too. This is important for the behaviour at the beginning. Just try on your own, what happens if you set min_periods to other values. Afterwards, the method mean() is applied, which calculates the mean value of all quantities in this interval. The results are stored in a new data frame dff.

In [6]:

dff=df['T2/C'].rolling(20,min_periods=20).mean()

The following plot shows that the first 20 entries (0…19 s) are without data, which is not quite surprising as the default behaviour of the method rolling() is to store its value at the end of the interval. It can be observed that the running mean is more or less constant from times greater than 70 s in this particular example. However, we do not know this a priori.

In [7]:

dff.plot(grid=True,ylim=(22,38))

Evolution of the running mean of the outlet temperature

Calculating with columns

Now, we want to use the running mean data frame dff for determining the data when the system is in steady-state. Therefore, we calculate the difference of the present temperature and its running mean value.

We see that the difference is less than 1 K for most of the values from times greater than 40 s.

In [8]:

diff=(df['T2/C']-dff)diff.plot(grid=True)plt.savefig('t2_diff.pdf')

Evolution of the temperature difference between the outlet temperature and the coinciding running mean value

Creating a filter

We can now create a filter that stores the result of an inequality as a Boolean array. Here, we check if the temperature difference is less than 2.5% of the coinciding temperature. Again, there is no output, since the result is assigned to a variable.

In [9]:

fil=diff<0.025*df['T2/C']

The output of the filter variable fil is the Boolean array. It can be observed that the first and the last thirty entries are shown in order to keep the list comprehensible.

Using filtered values only

The filter fil can be applied to the original data frame df by employing the squared bracket. The method describe() reveals the statistics of the filtered data frame, i.e., only of those rows for which the filter is true. Compared to Out[5] we see that roughly 40 data points are not considered. Thus, the mean value of T2/C is now slightly higher.

In [11]:

df[fil].describe()

Out[11]:

t/s

T1/C

T2/C

Tu/C

pu/Pa

fu/%

P/W

Q/(m^3/s)

dp/Pa

u/(m/s)

count

110.000000

110.000000

110.000000

110.000000

110.000000

110.000000

110.000000

1.100000e+02

110.000000

110.000000

mean

95.445455

21.945727

36.206273

20.029636

99997.141545

60.086636

999.903545

1.668840e-05

28714.969091

0.849936

std

31.992140

0.198256

0.343078

0.200842

26.690450

0.632201

0.968197

1.875631e-07

635.548513

0.009619

min

39.000000

21.550000

35.240000

19.520000

99939.890000

58.510000

997.600000

1.626500e-05

27035.320000

0.828000

25%

68.250000

21.800000

36.000000

19.890000

99981.295000

59.572500

999.150000

1.657100e-05

28335.310000

0.844000

50%

95.500000

21.920000

36.240000

20.030000

99995.845000

60.080000

999.940000

1.671050e-05

28755.000000

0.851000

75%

122.750000

22.090000

36.430000

20.170000

100017.955000

60.540000

1000.577500

1.680975e-05

29087.127500

0.856000

max

150.000000

22.360000

37.100000

20.550000

100071.300000

61.880000

1002.230000

1.731500e-05

30911.540000

0.882000

If not the entire statistics is required, but some particular values only, they can be addressed with the methods min(), max(), mean(), std(), median() and quantile(x), where $x\in[0,1]$. Here, the mean values are calculated as follows: