Frequency Distribution Analysis using Python Data Stack – Part 1

During my years as a Consultant Data Scientist I have received many requests from my clients to provide frequency distribution reports for their specific business data needs. These reports have been very useful for the company management to make proper business decisions quickly. In this paper I would like to show how to design and develop a generic frequency distribution library that will allow you to reduce your development time and provide a good summary table and image report for your clients. One important topic to be covered is this paper is a logic conversion of a top-bottom Python code in a generic reusable super class library for future Object-Oriented Programming (OOP) development applied data analytics and visualization.

Matplotlib – is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

Frequency Statistical Definitions

The frequency of a particular data value is the number of times the data value occurs. A frequency distribution is a tabular summary (frequency table) of data showing the frequency number of observations (outcomes) in each of several non-overlapping categories named classes. The objective is to provide a simple interpretation about the data that cannot be quickly obtained by looking only at the original raw data.

The Frequency Distribution Analysis can be used for Categorical (qualitative) and Numerical (quantitative) data types. I have seen the most use of it for Categorical data especially during the data cleansing process using pandas library. In general, there are two types of frequency tables, Univariate (used with a single variable) and Bivariate (used with multiple variables). Univariate tables will be used in this paper. The Bivariate frequency tables are presented as (two-way) Contingency Tables. These tables are used in Chi-squared Test Analysis for the Goodness-Of-Fit Test and Test of Independence. We’ll be covering these topics in future papers.

As you can see from Table 1, the log data file contains four columns as Time, Priority, Category and Message. In real production environment this log file may have hundreds of thousands of rows.

Network Server Activities Analysis

The server administrator team has requested a statistical analysis and report of the networking activities to be created for maintenance and management review. In general, this frequency statistical report includes two components:

As you can see from this Code Listing 1 the majority of the input data has been hardcoding in the program and the only way to use this program is to copy and paste in another module file, and of course change the data input values after that – a lot works and a very bad programming practices for sure! Some of the input data hardcode are: data file and images paths, data column name, many plot parameters, etc.

I have seen many Python programmers doing this type of Data Analytics implementation using Python Jupyter Notebook or any modern text editor today. It’s like they don’t understand/know the importance of Object-Oriented Programming design and implementation, Continuous Integration deployment practices, Unit and System Tests, etc.

Frequency Distribution Main Library

We need to create a reusable and extensible library to considerably reduce the Data Analytics development time and necessary code. I have developed a frequency_distribution_superclass.py module that contains the frequency distribution class library FrequencyDistributionLibrary(object) shown in Code Listing 2.

The Author

Ernest Bonat, Ph.D.

Ernest Bonat, Ph.D., is a Senior Software Developer and Senior Data Scientist with over 25 years experience in designing, developing and deployment database business applications. He earned a PhD. Engineering in Computer Science from Kiev Polytechnic Institute in Ukraine, after which he has had senior roles as a consultant software engineer, data scientist, mentor and teacher.