Author: Karandeep Kaur

In this tutorial, you will work through two functionally equivalent examples / demos – one written in Hive (v. 1.1) and the other written using PySpark API for the Spark SQL module (v. 1.6) – to see the differences between the command syntax of these popular Big Data processing systems. Both examples / demos have been prepared using CDH 5.13.0 and that’s what you are going to use in this exercise.

We will work against the same input file in both examples, running functionally equivalent query statements against that file.

Both Hive and PySpark shells support the Unix command-line short-cuts that allow you to quickly navigate along a single command line:

5. Enter the following commands to put the file on HDFS:

The file is now in the hive_demo directory on HDFS – that’s where we are going to load it from when working with both Hive and Spark.

Part 3 – The Hive Example / Demo

We will use the hive tool to start the interactive Hive shell (REPL) instead of the now recommended beeline tool which is a bit more ceremonial and here we are going to ignore its client-server architecture advantages in favor of the simplicity of hive.

1. Enter the following command to Start Hive REPL:

hive

Note: You start the Beeline shell in embedded mode by running this command:

beeline -u jdbc:hive2://

To connect to a remote Hive Server (that’s is the primary feature of Beeline, use this command:

To use the sqlContext’ssql method, you need to first register the DataFrame as a temp table that is associated with the active sqlContext; this temp table’s lifetime is tied to that of your current Spark session.

6. Enter the following command:

xfiles.registerTempTable('xfiles_tmp')

7. Enter the following command to fetch the first ten rows of the table:

Part 5 – Creating a Spark DataFrame from Scratch

You can create a Spark DataFrame from raw data sitting in memory and have Spark infer the schema from the data itself. The choices for column types are somewhat limited: PySpark does not support dates or booleans, but in most practical cases what PySpark supports is more than enough.

1. Enter the following commands to create a list of tuples (records) representing our data (we simulate data coming form different sources here):

1.1 Data Visualization

The common wisdom states that ‘Seeing is believing and a picture is worth a thousand words’. Data visualization techniques help users understand the data, underlying trends and patterns by displaying it in a variety of graphical forms (heat maps, scatter plots, charts, etc.) . Data visualization is also a great vehicle for communicating analysis results to stakeholders. Data visualization is an indispensable activity in exploratory data analysis (EDA). Business intelligence software vendors usually bundle data visualization tools into their products. There are a number of free tools that may offer similar capabilities in certain areas of data visualization.

1.2 Data Visualization in Python

The three most popular data visualization libraries with Python developers

are:

matplotlib,

seaborn, and

ggplot

seaborn is built on top of matplotlib and you need to perform the required matplotlib imports.

1.3 Matplotlib

Matplotlib [https://matplotlib.org/] is a Python graphics library for data visualization. The project dates back to 2002 and offers Python developers a MATLAB-like plotting interface. It depends on NumPy. You can generate plots, histograms, power spectra, bar charts, error charts, scatter plots, etc., with just a few lines of code. Matplotlib’s main focus is 2D plotting; 3D plotting is possible with the mplot3d package. It is a 2D and 3D desktop plotting package for Python. 3D plots are supported through the mtplot3d toolkit. It supports different graphics platforms and toolkits, as well as all the common vector and raster graphics formats (JPG, PNG, GIF, SVG, PDF, etc.). Matplotlib can be used in Python scripts, IPython REPL, and Jupyter notebooks.

1.4 Getting Started with matplotlib

In your Python program, you start by importing the matplotlib.pyplot module and aliasing it like so:

import matplotlib.pyplot as plt

In Jupyter notebooks, you can instruct the graphics rendering engine to embed the generated graphs with the notebook page with this “magic” command:

%matplotlib inline

The generated graphics will be in-lined in your notebook and there will be no plotting window popping up as in stand-alone Python (including IPython). You can now use the matplotlib.pyplot object to draw your plots using its graphics functions. When done, invoke plt.show() command to render your plot. The show() function discards the object when you close the plot window (you cannot run plt.show() again on the same object). In Jupyter notebook you are not required to use the show() method, also, in order to suppress some diagnostic messages, simply add ‘;’ at the end of the last graph rendering command.

1.5 Figures

The matplotlib.pyplot.figure() method call will launch the plotting window and render the image there. You can create multiple figures before the final call to show(), upon which all the images will be rendered in their respective plotting windows. You can optionally pass the function a number or a string as a parameter representing the figure coordinates to help moving back and forth between the figures. An important function parameter is figsize which holds a tuple of the figure width and height in inches, e.g. plt.figure(figsize=[12,8]). The default figsize values are 6.4 and 4.8 inches.

Examples of using the figure() function in stand-alone Python

plt.figure(1) # Subsequent graphics commands will be rendered in the first plotting window

plt.subplot(211) # You can set the figure’s grid layout

plt.plot( …

plt.subplot(212)

plt.plot( …

plt.figure(2) # Now all the subsequent graphics will be

# rendered in a second window

plt.plot( …

plt.figure(1) # You can go back to figure #1

…

plt.show() # Two stacked-up plotting windows will be generated

Note: You can drop the figure() parameters in case you do not plan to alternate between the figures.

1.6 Saving Figures to a File

Use the matplotlib.pyplot.savefig() function to save the generated figure to a file. Matplotlib will try to figure out the file’s format using the file’s extension. Supported formats are eps, jpeg, jpg, pdf, pgf, png, ps, raw, rgba, svg, svgz, tif, tiff.

1.7 Seaborn

seaborn is a popular data visualization and EDA library [https://seaborn.pydata.org/]. It is based on matplotlib and is closely integrated with pandas data structures. It has a number of attractive features. It has a dataset-oriented API for examining relationships between multiple variables. It has a convenient views of complex datasets. It has high-level abstractions for structuring multi-plot grids and it has concise control over matplotlib figure styling with several built-in themes.

1.8 Getting Started with seaborn

The required imports are as follows:

%matplotlib inline

import matplotlib.pyplot as plt

import seaborn as sns

Optionally, you can start your data visualization session by resetting the rendering engine settings to seaborn’s default theme and color palette using this command:

sns.set()

1.9 Histograms and KDE

You can render histogram plots along with the fitted kernel density estimate (KDE) line with the distplot() function, e.g.

sns.distplot (pandas_df.column_name)

1.10 Plotting Bivariate Distributions

In addition to plotting univariate distributions (using the distplot() function), seaborn offers a way to plot bivariate distributions using the joinplot() function:

sns.jointplot(x=”col_nameA”, y=”col_nameB”, data=DF, kind=”kde”);

1.11 Scatter plots in seaborn

Scatter plots are rendered using the scatterplot() function, for example:

sns.scatterplot(x, y, hue=[list of color levels]);

1.12 Pair plots in seaborn

The pairplot() function automatically plots pairwise relationships between variables in a dataset. A sample output of the function is shown below.

Note: Trying to plot too many variables (stored as columns in you DataFrame) in one go may clutter the resulting pair plot.

1.13 Heatmaps

Heatmaps, popularized by Microsoft Excel, are supported in seaborn through its heatmap() function.

A sample output of the function is shown below.

1.14 Summary

In this tutorial, we reviewed two main data visualization packages in Python:

1.1 Code Organization in Python

Several organizational terms are used when referring to Python code:Module – A file with some Python code in itLibrary – A collection of modulesPackage – A directory that can include an individual module, a library of modules, an __init__.py file or sub-package(s).

1.2 Python Modules

A Module is a file containing Python code. Modules that are meant to be executed are sometimes called Scripts. Modules meant to be imported and used by other modules may be referred to as Libraries. Module filenames have the extension .py
Python module code can include variables, functions, classes and runnable code (code not in a function)

1.3 Python Module Example

The following file is an example of a Python module:
# my_utils.py
def get_mod_name():
return __name__

def halved(a):
return a / 2

def doubled(a):
return 2 * a

def squared(a):
return a * a

1.4 Using Modules

Modules must be importing before they can be used:
import module_name

Functions are called with the following syntax:
module_name.function_name()

1.5 Import Statements

Import statements can be used to import a module or individual function. Imports can allow access via an alias’.

Import

Allow Use of

import mod1

mod1.func_name()

import mod1 as m

m.func_name()

import dir1.mod1

dir1.mod1.func_name()

import dir1.mod1 as dm1

dm1.func_name()

from mod1 import func_name

func_name()

1.6 Using Modules in Multiple Projects

To use a Module in multiple projects:
Create the module in its own project.
Place a copy of the module in a location where Python can access it.
Import the module into the code where it is used.

1.7 How Does Python Find Modules?

Upon reading the import statement below Python looks in various directories for my_util.py and an error is thrown if the module is not found.
import my_util

The directories Python looks in are defined in its sys.path variable. The sys.path variable can be accessed from the Python console like this:
>>> import sys
>>> print(sys.path)
[”,
‘/usr/lib/python36.zip’,
‘/usr/lib/python3.6’,
‘/usr/lib/python3.6/lib-dynload’,
‘/usr/local/lib/python3.6/dist-packages’,
‘/usr/lib/python3/dist-packages’]

Placing your module in any of the listed directories will allow python to find it when it processes the import statement.

1.8 Adding Directories to Sys.Path

If the directory where you are keeping your module is not one of the directories in sys.path you have two options.
Copy your module file to a directory listed in sys.path.
Add the directory to the sys.path.

You can add directories to sys.path by :

Setting the PYTHONPATH environment variable before running your app (or before running the shell if you are working in there).
export PYTHONPATH=”$PWD/the_module_dir”.

Adding the following code in your script (before the module import statement):