SQL has been here for awhile and people like it. However, the engines that power SQL have changed with time in order to solve new problems and keep up with demands from consumers.

Traditional engines such as Microsoft SQL Server had some problems with scalability that they have solved with time and cloud-based solutions. On the other hand, others have been built from the ground up to work in a distributed environment so they can put performance at the top of their priority list.

There is not a tool for all use cases. In fact, we believe that tools are built with use cases in mind, to solve a specific problem. Then they evolve to a more mature stage where they can be used to solve many other problems.

In a traditional SQL environment, the data is represented by tables and the relationships between them, but this representation is sometimes not enough, so new tools have been developed to solve this. We can find everywhere organizations that don’t use relational databases; instead, they prefer to go to the non-SQL ones.

Hadoop

In the Hadoop world, we have a variety of different query engines; each of them has its own particularities, and they each solve a wide variety of problems.

In any Hadoop distribution, we can find Apache Hive , a SQL-like tool that offers data warehouse infrastructure and capabilities for big data queries and analysis.

Depending on the Hadoop distribution, we can also find Apache Impala and Apache Drill . All of them offer more or less the same capabilities, sharing a common goal. We can use SQL or SQL-like languages to query data stored in Hadoop. They also have their own limitations and advantages that you should be aware of. Here is a link with more details about these technologies.

Apache Spark

Apache Spark is a lightning-fast cluster computing that can be deployed in a Hadoop cluster or stand alone mode. It can also be used as a SQL engine like the others we mentioned. Spark, however, offers some advantages over the previous ones.

Spark exposes APIs for different languages such as Scala, Java, Python, and R. This makes it accessible by many types of people, such as developers, data scientists, and those with statistics experience.

Exploring Our Data Source

Our data set is a simple folder with a few terabytes in CSV-formatted files, and each file is about 40MB each. The size of the files does not affect the performance, because they are stored in a MapR cluster. MapR take cares of the Hadoop small file problem as I explain in this post .

Because we are using MapR, copying files to the cluster is quite easy, since we have mounted a volume to our local file system.

At this point, we want to be able to serve our data without worrying about the schema of our file; we want a self-service BI environment as I explained here . Using the headers from our data files, we can create the schema automatically, so we don’t have to worry about schema changes in the future. Once we have the schema, we create a DataFrame that we are going to expose in order to be queried using SQL.

Our application should start running on YARN as we indicated when submitting it.

Our SQL engine is ready to be queried, so let’s move forward and test it out.

SQL Clients

An easy way to test our SQL engine is to run beeline , a command line tool that works as an SQL client.

We can find beeline in the Spark bin folder. To start it, we type ./beeline.

Within beeline, we need connect to the end point we have defined in our application, so we run:

!connect jdbc:hive2://localhost:10001

We should be ready to run SQL statements, but let’s verify we can see the table we registered.

show tables;

Spark SQL will return a table with the registered tables including the one we registered in our application (someTableName).

In the same way, we can connect using other clients such as Microstrategy or Tableau. We have tried both and they both can build and execute queries on tables registered by Spark applications. We can also combine different sources (Spark SQL, MS SQL Server, Hive, Impala, etc.) which gives us the flexibility of combining relational sources with non-relational data.

Spark SQL performs quite well and often better than the other providers in Hadoop, but be aware that performance can be degraded under certain conditions and use cases.

Why Apache Spark

Certainly, Spark SQL offers some of the functionalities that other tools have within Hadoop. However, the possibility of exploring complex data sets rather unique to Spark, since we can code custom serialization / deserialization processes in our application. Using Spark SQL, we can connect to any data source and present it as tables to be consumed by SQL clients. This is as easy as changing how we ready the data in those sources by changing our serializer in our application.

Endings

There are very useful tools that we can use within Hadoop to query data in an SQL fashion and all of them have their advantages. The Spark SQL module from Apache Spark offers some flexibility that others lack while keeping performance as one of the main priorities.

Spark is not the only tool you can use, but we strongly advise that you include it in big data solutions where SQL statements are to be executed. You might need to use a mix of different tools, but Spark should be an important part of the system you are building.