How Apache Spark makes your slow MySQL queries 10x faster (or more)

In this blog post, we’ll discuss how to improve the performance of slow MySQL queries using Apache Spark.

Introduction

In my previous blog post, I wrote about using Apache Spark with MySQL for data analysis and showed how to transform and analyze a large volume of data (text files) with Apache Spark. Vadim also performed a benchmark comparing performance of MySQL and Spark with Parquet columnar format (using Air traffic performance data). That works great, but what if we don’t want to move our data from MySQL to another storage (i.e., columnar format), and instead want to use “ad hock” queries on top of an existing MySQL server? Apache Spark can help here as well.

TL;DR version:

Using Apache Spark on top of the existing MySQL server(s) (without the need to export or even stream data to Spark or Hadoop), we can increase query performance more than ten times. Using multiple MySQL servers (replication or Percona XtraDB Cluster) gives us an additional performance increase for some queries. You can also use the Spark cache function to cache the whole MySQL query results table.

The idea is simple: Spark can read MySQL data via JDBC and can also execute SQL queries, so we can connect it directly to MySQL and run the queries. Why is this faster? For long running (i.e., reporting or BI) queries, it can be much faster as Spark is a massively parallel system. MySQL can only use one CPU core per query, whereas Spark can use all cores on all cluster nodes. In my examples below, MySQL queries are executed inside Spark and run 5-10 times faster (on top of the same MySQL data).

In addition, Spark can add “cluster” level parallelism. In the case of MySQL replication or Percona XtraDB Cluster, Spark can split the query into a set of smaller queries (in the case of a partitioned table it will run one query per each partition for example) and run those in parallel across multiple slave servers of multiple Percona XtraDB Cluster nodes. Finally, it will use map/reduce the type of processing to aggregate the results.

I’ve used the same “Airlines On-Time Performance” database as in previous posts. Vadim created some scripts to download data and upload it to MySQL. You can find the scripts here: https://github.com/Percona-Lab/ontime-airline-performance. I’ve also used Apache Spark 2.0, which was released July 26, 2016.

To connect to Spark we can use spark-shell (Scala), pyspark (Python) or spark-sql. Since spark-sql is similar to MySQL cli, using it would be the easiest option (even “show tables” works). I also wanted to work with Scala in interactive mode so I’ve used spark-shell as well. In all the examples I’m using the same SQL query in MySQL and Spark, so working with Spark is not that different.

To work with MySQL server in Spark we need Connector/J for MySQL. Download the package and copy the mysql-connector-java-5.1.39-bin.jar to the spark directory, then add the class path to the conf/spark-defaults.conf:

Running MySQL queries via Apache Spark

For this test I was using one physical server with 12 CPU cores (older Intel(R) Xeon(R) CPU L5639 @ 2.13GHz) and 48G of RAM, SSD disks. I’ve installed MySQL and started spark master and spark slave on the same box.

Now we are ready to run MySQL queries inside Spark. First, start the shell (from the Spark directory, /usr/local/spark in my case):

Shell

1

$./bin/spark-shell--driver-memory4G--master spark://server1:7077

Then we will need to connect to MySQL from spark and register the temporary view:

So we have created a “datasource” for Spark (or in other words, a “link” from Spark to MySQL). The Spark table name is “ontime” (linked to MySQL ontime.ontime_part table) and we can run SQL queries in Spark, which in turn parse it and translate it in MySQL queries.

“partitionColumn” is very important here. It tells Spark to run multiple queries in parallel, one query per each partition.

Now we can run the query:

Scala

1

2

valsqlDF=sql("select min(year), max(year) as max_year, Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') and (origin = 'RDU' or dest = 'RDU') GROUP by carrier HAVING cnt > 100000 and max_year > '1990' ORDER by rate DESC, cnt desc LIMIT 10")

sqlDF.show()

MySQL Query Example

Let’s go back to MySQL for a second and look at the query example. I’ve chosen the following query (from my older blog post):

Shell

1

2

3

4

5

6

7

8

9

10

11

select min(year),max(year)asmax_year,Carrier,count(*)ascnt,

sum(if(ArrDelayMinutes>30,1,0))asflights_delayed,

round(sum(if(ArrDelayMinutes>30,1,0))/count(*),2)asrate

FROM ontime

WHERE

DayOfWeek notin(6,7)

andOriginState notin('AK','HI','PR','VI')

andDestState notin('AK','HI','PR','VI')

GROUP by carrier HAVING cnt>100000andmax_year>'1990'

ORDER by rate DESC,cnt desc

LIMIT10

The query will find the total number of delayed flights per each airline. In addition, the query will calculate the smart “ontime” rating, taking into consideration the number of flights (we do not want to compare smaller air carriers with the large ones, and we want to exclude the older airlines who are not in business anymore).

The main reason I’ve chosen this query is that it is hard to optimize it in MySQL. All conditions in the “where” clause will only filter out ~70% of rows. I’ve done a basic calculation:

16/08/0223:24:12WARN JDBCRelation:The number of partitions isreduced because the specified number of partitions isless than the difference between upper bound andlower bound.Updated number of partitions:27;Input number of partitions:48;Lower bound:1988;Upper bound:2015.

16/08/0401:44:27WARN JDBCRelation:The number of partitions isreduced because the specified number of partitions isless than the difference between upper bound andlower bound.Updated number of partitions:26;Input number of partitions:48;Lower bound:1988;Upper bound:2014.

16/08/0401:45:13WARN Utils:Truncated the stringrepresentation ofaplan since it was too large.Thisbehavior can be adjusted by setting'spark.debug.maxToStringFields'inSparkEnv.conf.

20032013EV29620084642640.16

20032013B612374001878630.15

20062011XE16152662309770.14

20032005DH501056698330.14

20012013MQ45181066056980.13

20032013FL16928872120690.13

20042010OH13074041752580.13

20062013YV11210251435970.13

20032006RU10072481267330.13

19882013UA1071738313271960.12

Timetaken:139.628seconds,Fetched10row(s)

So the response time of the same query is almost 10x faster (on the same server, just one box). But now how was this query translated to MySQL queries, and why it is so much faster? Here is what is happening inside MySQL:

In this case, as the box has 12 CPU cores / 24 threads, it efficently executes 26 queries in parallel and the partitioned table helps to avoid contention issues (I wish MySQL could scan partitions in parallel, but it can’t at the time of writing).

Another interesting thing is that Spark can “push down” some of the conditions to MySQL, but only those inside the “where” clause. All group by/order by/aggregations are done inside Spark. It needs to retrieve data from MySQL to satisfy those conditions and will not push down group by/order by/etc to MySQL.

That also means that queries without “where” conditions (for example “select count(*) as cnt, carrier from ontime group by carrier order by cnt desc limit 10”) will have to retrieve all data from MySQL and load it to Spark (as opposed to MySQL will do all group by inside). Running it in Spark might be slower or faster (depending on the amount of data and use of indexes) but it also requires more resources and potentially more memory dedicated for Spark. The above query is translated to 26 queries, each does a “select carrier from ontime_part where (yearD >= N AND yearD < N)”

Pushing down the whole query into MySQL

If we want to avoid sending all data from MySQL to Spark we have the option of creating a temporary table on top of a query (similar to MySQL’s create temporary table as select …). In Scala:

Scala

1

2

3

4

5

6

7

8

9

10

valtableQuery=

"(select yeard, count(*) from ontime group by yeard) tmp"

valjdbcDFtmp=spark.read.format("jdbc").options(

Map("url"->"jdbc:mysql://localhost:3306/ontime?user=root&password=",

"dbtable"->tableQuery,

"fetchSize"->"10000"

)).load()

jdbcDFtmp.createOrReplaceTempView("ontime_tmp")

In Spark SQL:

MySQL

1

2

3

4

5

6

7

8

9

CREATETEMPORARYVIEWontime_tmp

USINGorg.apache.spark.sql.jdbc

OPTIONS(

url"jdbc:mysql://localhost:3306/ontime?user=root&password=mysql",

dbtable"(select yeard, count(*) from ontime_part group by yeard) tmp",

fetchSize"1000"

);

select*fromontime_tmp;

Please note:

We do not want to use “partitionColumn” here, otherwise we will see 26 queries like this in MySQL: “SELECT yeard, count(*) FROM (select yeard, count(*) from ontime_part group by yeard) tmp where (yearD >= N AND yearD < N)” (obviously not optimal)

This is not a good use of Spark, more like a “hack.” The only good reason to do it is to be able to have the result of the query as a source of an additional query.

Query cache in Spark

Another option is to cache the result of the query (or even the whole table) and then use .filter in Scala for faster processing. This requires sufficient memory dedicated for Spark. The good news is we can add additional nodes to Spark and get more memory for Spark cluster.

Using Spark with Percona XtraDB Cluster

As Spark can be used in a cluster mode and scale with more and more nodes, reading data from a single MySQL is a bottleneck. We can use MySQL replication slave servers or Percona XtraDB Cluster (PXC) nodes as a Spark datasource. To test it out, I’ve provisioned Percona XtraDB Cluster with three nodes on AWS (I’ve used m4.2xlarge Ubuntu instances) and also started Apache Spark on each node:

Then I can start spark-sql (also need to have connector/J JAR file copied to all nodes):

Shell

1

$./bin/spark-sql--driver-memory4G--master spark://pxc1:7077

When creating a table, I still use localhost to connect to MySQL (url “jdbc:mysql://localhost:3306/ontime?user=root&password=xxx”). As Spark worker nodes are running on the same instance as Percona Cluster nodes, it will use the local connection. Then running a Spark SQL will evenly distribute all 26 MySQL queries among the three MySQL nodes.

Alternatively we can run Spark cluster on a separate host and connect it to the HA Proxy, which in turn will load balance selects across multiple Percona XtraDB Cluster nodes.

Query Performance Benchmark

Finally, here is the query response time test on the three AWS Percona XtraDB Cluster nodes:

Now, this looks really good, but it can be better. With three nodes @ m4.2xlarge we will have 8*3 = 24 cores total (although they are shared between Spark and MySQL). We can expect 10x improvement, especially without a covered index.

However, on m4.2xlarge the amount of RAM did not allow me to run MySQL out of memory, so all reads were from EBS non-provisioned IOPS, which only gave me ~120MB/sec. I’ve redone the test on a set of three dedicated servers:

28 cores E5-2683 v3 @ 2.00GHz

240GB of RAM

Samsung 850 PRO

The test was running completely off RAM:

Query 1 (from the above)

Query / Index type

MySQL Time

Spark Time (3 nodes)

Times Improvement

No covered index (partitoned)

3 min 13.94 sec

14.255 sec

13.61

Covered index (partitioned)

2 min 2.11 sec

9.035 sec

13.52

Query 2:
selectdayofweek,count(*)fromontime_partgroupbydayofweek;

Query / Index type

MySQL Time

Spark Time (3 nodes)

Times Improvement

No covered index (partitoned)

2 min 0.36 sec

7.055 sec

17.06

Covered index (partitioned)

1 min 6.85 sec

4.514 sec

14.81

With this amount of cores and running out of RAM we actually do not have enough concurrency as the table only have 26 partitions. I’ve tried the unpartitioned table with ID primary key and use 128 partitions.

Note about partitioning

I’ve used partitioned table (partition by year) in my tests to help reduce MySQL level contention. At the same time the “partitionColumn” option in Spark does not require that MySQL table is partitioned. For example, if a table has a primary key, we can use this CREATE VIEW in Spark :

Assuming we have enough MySQL servers (i.e., nodes or slaves), we can increase the number of partitions and that can improve the parallelism (as opposed to only 26 partitions when running one partition by year). Actually, the above test gives us even better response time: 6.44 seconds for query 1.

Where Spark doesn’t work well

For faster queries (those that use indexes or can efficiently use an index) it does not make sense to use Spark. Retrieving data from MySQL and loading it into Spark is not free. This overhead can be significant for faster queries. For example, a query like this
select count(*)fromontime_partwhereYearD=2013andDayOfWeek=7andOriginState='NC'andDestState='NC'; will only scan 1300 rows and will return instant (0.00 seconds reported by MySQL).

An even better example is this:
selectmax(id)fromontime_part. In MySQL, the query will use the index and all calculations will be done inside MySQL. Spark, on the other hand, will have to retrieve all IDs (select id from ontime_part) from MySQL and calculate maximum. That took 24.267 seconds.

Conclusion

Using Apache Spark as an additional engine level on top of MySQL can help to speed up the slow reporting queries and add much-needed scalability for the long running select queries. In addition, Spark can help with query caching for frequent queries.

PS: Visual explain plan with Spark

Related

Alexander joined Percona in 2013. Alexander worked with MySQL since 2000 as DBA and Application Developer. Before joining Percona he was doing MySQL consulting as a principal consultant for over 7 years (started with MySQL AB in 2006, then Sun Microsystems and then Oracle). He helped many customers design large, scalable and highly available MySQL systems and optimize MySQL performance. Alexander also helped customers design Big Data stores with Apache Hadoop and related technologies.

11 Comments

Or you could use Shard-Query and just talk over the wire to your MySQL server using MySQL proxy or by using the stored procedure execution method. No need for an extra JDBC layer if, for example, you have a GO or PHP app.

Shard-Query also adds support for scale-out (sharding), window functions, and more.

I am happy about this article and using this technology, but I would like to make a comment as to how it can be used in the right context and the right use case:

1) If you had a report that only needed to be run once a year, bi-year, quarterly, monthly or even weekly, then waiting 19 minutes is not so bad Vs setting up a new server with a new technology and maintaining that.
2) If you need the report more frequently, then updating an intermediate or summary table once a day, a couple of times a day or hourly, would be more cost, time and resource effective.
3) If the data is too big for that AND you have sharded your data across a few servers – lets say, a server per destination/continent – then this technology is absolutely amazing and exactly what you would need.

Hardware: 2011 Macbook Pro laptop running Oracle 12c on VirtualBox VM – 4 core, 10GB RAM, single SSD. I plan on loading this dataset up on S7 Server to see what kind of DAX offloading/benefits you will receive with SparkSQL – advertised is 9X gains.

As you can see Oracle 12.2 in-memory parallel query is much faster! Small laptop destroying many larger server configurations with sparksql.

On sparc, 12.2, I’m getting 180-220B rows second per core by leveraging the DAX database accelerators. The good news is the DAX API natively supports Spark so it fully offloads to the offloading gpu’s. So you could use the sparc chips to run faster mysql/ sparksql by 10x over Intel.

Otherwise, just leverage big data sql and create in memory external table of the files on hdfs or perform smartscan on hdfs or nosql tables. Use regular sql/jdbc and a single security model for all your data in different polyglot systems.

When Spark loads data from a table does the load happen on a single node in the cluster or should the query work be spread across several nodes depending on the size of the results? I have read references regarding MySQL saying that data is loaded on a single node and then parallized? I am interested from an Oracle side but I think this is a general question. I get executor failures when trying to load a full large table which is approx the size of RAM on a single node in the cluster. Any help appreciated!

Unfortunately Spark and MySQL does not love each other. Spark 1.6 cannot process by default large MySQL files that does not fit in memory. You have to set fetchsize to Integer.MIN_VALUE. It will force fetchsize to 1, any slowly but steady can deal with MySQL large tables.
Now, Spark 2 decide to check foe negative parameters, so it does not allow Integer.MIN_VALUE, so it just does not work.

Hello Alexander Can you tell me probably that how much RAM need to use for some ammount of Data for example i have 18 GB RAM machine and i need to work with 5 GB data / 3 caror 50 lacs Raws will it be possible?

I updated the results to reflect Oracle 12.2 with the latest airline data pulled (1987-2017) on a new Macbook Pro with 9 other Pluggable Databases running. I created a Pluggable database called Kraken.