What a GPU-powered database can do for you

The parallel processing power of the GPU is being brought to analytics by some innovative startups, promising new levels of performance

The SQL database dates back to the 1970s and has been an ANSI standard since the 1980s, but that doesn’t mean the technology sits still. It is still changing, and one of those ways as GPU-accelerated databases.

Relational databases have grown in size to data sets that measure in the petabytes and beyond. Even with the advent of 64-bit computing and terabytes of memory for increased processing, that’s still a lot of data to chew through—and CPUs can only manage so much. That’s where GPUs have come in.

GPUs have morphed from their original mission of accelerating gaming to accelerating almost everything. Nvidia has pivoted masterfully to become synonymous with artificial intelligence, a process that requires vast amounts of data processed in parallel and other tasks that can be parallelized well. AMD is starting to play catchup, but Nvidia has a long lead.

When it comes to cores, it’s not even close. Xeon CPUs have a maximum of 22 cores. AMD Epyc has 32 cores. The Nvidia Volta architecture has 5,120 cores. Now imagine more than 5,000 cores running in parallel on data and it’s clear why GPUs have become so popular for massive compute projects.

So a new class of databases has emerged, written from the ground up to support and embrace GPUs and their massive parallel processing capabilities. These databases are enabling new levels of data processing, analytics and real-time Big Data as they can handle data sets that regular CPU-powered databases simply cannot.

The GPU database defined

The concept of a GPU database is simple enough: It uses the parallelism of GPUs to perform massive data-processing acceleration. The GPU is ideally suited to accelerate processing SQL queries because SQL performs the same operation—usually a search—on every row in the set.

However, you don’t simply put a bunch of Nvidia Tesla cards in the server hosting an Oracle database. GPU databases have been designed and written from the ground up to perform parallel processing, starting with SQL JOIN operations.

JOINs establish a relationship between columns from multiple tables in a database and are critical to performing meaningful analytics. Traditional design approaches for JOINs on legacy RDBMS systems were designed years ago for single-core CPUs and don’t lend themselves well even to a CPU, much less a GPU.

Connectors to popular open source frameworks, such as Hadoop, Kafka, HBase, Spark, and Storm.

ODBC and JDBC drivers for integration with existing visualization and BI tools such as Tableau, Power BI, and Spotfire

APIs for bindings with popular programming languages like C++, SQL, Java, Node.js, and Python.

Where to use a GPU database

In that regard, GPU databases don’t really compete with Oracle, SQL Server, or DB2. GPU databases are oriented toward making data-analytics decisions, where companies are trying to make a decision in real time from vast amounts of data but find themselves unable to do it because there is too much data or because visual analysis tools are too slow.

The GPU database vendors don’t see themselves as a replacement for Oracle or an OLTP database like Teradata. Instead of targeting traditional RDBMS workloads, GPU databases aim at the OLAP/OLTP world and big data, where the data sets are massive and the need is real-time. Instead of batch processes run over hours or overnight, GPU databases are where data can be presented in real time or on an hourly basis.

The GPU database should solve a lot of problems that NoSQL is trying to solve but lets you use your existing structured query tools. Using NoSQL means rewriting all your SQL tools, but GPU databases use existing SQL tools.

“What we think we will see is people realizing they can do multidimension systems and take data from multiple scenarios and combine it,” says Steve Worthington, emerging technologies solution architect for Datatrend Technologies, an IT consultancy that uses the GPU database SQream. “Medical companies want to take [data] from multiple systems and do analytics across databases because before, they couldn’t do cross references and didn’t have any way to join the databases.”

He also cites financial institutions doing fraud and risk analysis that might just be doing just credit cards checks now but want to do checks across multiple accounts. With the power of the GPU, they can cross-reference across all those sources of information at once.

For Rich Sutton, vice president of geospatial data at Skyhook, a location services provider, using the OmniSci GPU database gives him a much bigger visualization of geographic datasets than he could do with a CPU-based database. “I can load a billion rows into OmniSci and with little to no latency instead of having to look at a data set of 10,000 lines in a traditional CPU space,” he says. “It’s multiple orders of magnitude beneficial to me reducing in consumption of data with massively reduced latency.”

Todd Mostak, CEO of OmniSci, says one customer told him the speed of OmniSci “lowers the cost of curiosity. They ask questions they would previously hold back on.” One financial services customer told him an 18-hour processing query on a traditional database went down to a subsecond, while a telco told him that queries that took hours to run now respond in under a second.

Another place for GPU databases is in real-time big data, where Hadoop has fallen short. Ami Gal, CEO of GPU database provider SQream, says much of the promise of big data—finding all the opportunities that resides in tens of petabytes of row data—was not achieved on Hadoop because it was too slow.

“Spark is pretty good for data movement and transformation but once you need to crunch huge amounts of data and move them you start to deal with hundreds of thousands of [compute] nodes and that is seen as too much to crunch in large data sets. But if you can do it with ten or 15 nodes, that is much more efficient,” he says.

Worthington says GPU-based servers can do in one cabinet what require many cabinets’ worth of CPU-powered multiple-parallel-processing (MPP) nodes. “We can replace racks of MPP nodes with a half dozen nodes, each with two to four GPUs in them. With that we can replace a $10 million investment with under a $1 million investment,” he says.

The GPU is also important to Skyhook, which does visualization of big geographic datasets. “If you got a million devices in the field and pinging location a couple times a minute, you are talking 2 billion data rows a day. That’s impossible to consume in a traditional database. It’s just not possible. So [a] GPU [database] brings you up to where you can consume that data,” Sutton says.

Before adopting OmniSci, Skyhook would have to “pyramidize” data, taking only segments of it for visualization. Now, Sutton says, it can look at the whole data picture. “I’ve never seen another realistic way to get data into shape for my kind of use.”

GPU databases: What’s available

All vary slightly in how they work. For example, OmniSci does visualization of data, while SQream uses connectors to visualization tools like Tableau, so each needs to be individually evaluated to determine the best fit for your need.

The big names in RDBMS have yet to get on board, except for IBM, which does support some GPU processing in DB2 Blu, a special version of DB2 for analytics workloads. Oracle and TeraData have both said they are working with Nvidia but nothing has come of it yet. Microsoft does not support GPU acceleration on SQL Server. SQream’s Gal said he has heard that all of the RDBMS vendors are working to add some kind of GPU support to their products but had no further information.