Machine Learning with Spark: Determining Credibility of a Customer – Part 2

A DataFrame is a new feature that has been exposed as an API from Spark 1.3.0.

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

A DataFrame is a distributed storage of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrame as an API is available in Scala, Java, Python and R. It allows you to process any type of structured and semi-structured data.DataFrame enables developers to impose a structure or schema onto a distributed collection of data.

Salient Features:

Optimized querying

Tabular representation

Support for a variety of data formats like Avro, JSON, Parquet etc…

Supports multiple data sources like HDFS, HIVE, RDBMS etc…

Support for SQL queries

Shuffling/Sorting without deserializing

DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs. DataFrame supports reading data from the most popular formats, including JSON files, Parquet files, Hive tables. It can read from local file systems, distributed file systems (HDFS), cloud storage (S3), and external relational database systems via JDBC. In addition, through Spark SQL’sexternal data sources API, DataFrames can be extended to support any third-party data formats or sources. Existing third-party extensions already include Avro, CSV, ElasticSearch, and Cassandra.

Why DataFrame when we already have RDD?

You can easily optimize RDDsin the DataFrame. Unlike RDDs , Dataframes keep track of the schema and support various relational operations that lead to more optimized execution. On top of that, DataFrame has all the advantages that RDD provides.

Similar to RDDs, DataFrames are evaluated lazily. That is to say, computation only happens when an action (e.g. display result, save output) is required. All DataFrame operations are also automatically parallelized and distributed on clusters.

RDDs are more like low-level APIs, where you have to optimize your execution plan as per your need, whereas DataFrames are abstract APIs with much of the optimizations being done internally.

Also, DataFrame gives you an option of schema inferencing where it can guess the schema by looking at few columns of the data which are a new addition as a feature. This can save a lot of effort that goes into defining the schema.

When to use RDDs?

Consider these scenarios or common use cases for using RDDs:

When you want low-level transformation and actions, and control on your dataset.

When your data is unstructured, such as media streams or streams of text.

When you want to manipulate your data with functional programming constructs than domain specific expressions.

When you don’t care about imposing a schema, such as columnar format, while processing or accessing data attributes by name or column.

When you can forgo some optimization and performance benefits available with DataFrames and Datasets for structured and semi-structured data.

You can seamlessly switch between DataFrame and Dataset and RDDs at will by using simple API method calls, and DataFrames and Datasets are built on top of RDDs.

Few Common Operations on DataFrame:

cache() : Works similar to the cache operation in RDD (explained here)

collect() : Works similar to the collect operation in RDD (explained here)

columns : Returns list of column names or headers.

freqItems(cols) :Returns a list of Rows having most frequent items from the specified column.

count() : Counts the number of records in a data frame.

corr(col1, col2, method=None) : Calculates the correlation of two columns of a DataFrame as a double value. Currently, only supports the Pearson Correlation Coefficient.

cov(col1, col2) : Calculates the sample covariance for the given columns, specified by their names, as a double value.

describe(*cols) :Computes the descriptive stats for numerical columns in a data frame.

explain(extended=False) : Prints the (logical and physical) plans to the console for debugging purpose.

Related

Abhay Kumar, lead Data Scientist – Computer Vision in a startup, is an experienced data scientist specializing in Deep Learning in Computer vision and has worked with a variety of programming languages like Python, Java, Pig, Hive, R, Shell, Javascript and with frameworks like Tensorflow, MXNet, Hadoop, Spark, MapReduce, Numpy, Scikit-learn, and pandas.