Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

More than 5000 IT Certified ( SAP,Oracle,Mainframe,Microsoft and IBM Technologies etc...)Consultants registered. Register for IT courses at http://www.todaycourses.com Most of our companies will help you in processing H1B Visa, Work Permit and Job Placements

16.
Adding Schema to RDDs
Spark + RDDs
Functional transformations on
partitioned collections of opaque
objects.
SQL + SchemaRDDs
Declarative transformations on
partitioned collections of tuples.
User
User
User
User
User
User
Name
Age
Height
Name
Age
Height
Name
Age
Height
Name
Age
Height
Name
Age
Height
Name
Age
Height

23.
Querying Using SQL
#
SQL
can
be
run
over
SchemaRDDs
that
have
been
registered
#
as
a
table.
teenagers
=
sqlCtx.sql("""
SELECT
name
FROM
people
WHERE
age
>=
13
AND
age
<=
19""")
#
The
results
of
SQL
queries
are
RDDs
and
support
all
the
normal
#
RDD
operations.
teenNames
=
teenagers.map(lambda
p:
"Name:
"
+
p.name)

24.
Existing Tools, New Data Sources
Spark SQL includes a server that exposes its data
using JDBC/ODBC
• Query data from HDFS/S3,
• Including formats like Hive/Parquet/JSON*
• Support for caching data in-memory
* Coming in Spark 1.2

32.
Using Parquet
#
SchemaRDDs
can
be
saved
as
Parquet
files,
maintaining
the
#
schema
information.
peopleTable.saveAsParquetFile("people.parquet")
#
Read
in
the
Parquet
file
created
above.
Parquet
files
are
#
self-­‐describing
so
the
schema
is
preserved.
The
result
of
#
loading
a
parquet
file
is
also
a
SchemaRDD.
parquetFile
=
sqlCtx.parquetFile("people.parquet”)
#
Parquet
files
can
be
registered
as
tables
used
in
SQL.
parquetFile.registerAsTable("parquetFile”)
teenagers
=
sqlCtx.sql("""
SELECT
name
FROM
parquetFile
WHERE
age
>=
13
AND
age
<=
19""")

33.
{JSON} Support
• Use jsonFile or jsonRDD to convert a
collection of JSON objects into a SchemaRDD
• Infers and unions the schema of each record
• Maintains nested structures and arrays