Raison d’Graph

Graph database technology gives us a way to clarify the connections between and among data points, facts, analytics, and other synthetic objects. As more enterprises intend to relate data across disparate data models, interest in graphing technology should rise.

It’s easy to forget that the analytics we consume daily are the original products of human analysis, imagination, interpretation, and synthesis. Analytics are basically synthetic objects -- facts, cubes, algorithms, functions, etc. -- derived from one or more data points. Human analysts use a mix of technologies and techniques to put data points together with other data points or other synthetic objects to form something new: a slice of data that communicates something about the business world.

Graph database technology has the potential to radically enlarge the size and detail of this slice. It gives us a way to elucidate the connections between and among data points, facts, analytics, and other synthetic objects. In traditional business intelligence (BI) analytics, the context which links these objects is comparatively impoverished.

The slice of the business world captured by BI analytics is extraordinarily narrow: sales of this product in this region -- or these stores -- over this period of time. Sometimes this information is enriched with demographic data, sometimes with geographic information system (GIS) data. In some cases, it even has limited predictive power.

Something important is missing, however. BI analytics tells us little about the human behavioral backdrop to the facts it discloses. It tells us next to nothing about the conditions or events of the world in which these facts are situated. At its core, this limitation is a function of scale: the more data we can collect about a problem, the richer and more detailed the analytical context we can create.

For a long time, we didn’t have the means to collect and manage enough data of enough different types. Then, at some point, we did, and we even had ready-made techniques -- in statistics and numerical analysis, for example -- to make sense of it.

Technological partitioning -- the use of different fit-for-purpose systems to address a range of data storage and data processing requirements -- permits us to address the scalability problem.

Technological partitioning results in analytics silos, however. For example, we put relational and file data in Web-scale NoSQL systems, which are capable of cost-effectively ingesting data of all types, storing it in massive quantities, and processing it at massive scale.

We use streaming technologies to capture, analyze, and persist event data from connected devices, sensors, and other signalers. We put this event data in a time-series database or, more simply, a key-value store, such as HBase. We put strictly structured data in an OLTP database or (what amounts to the same thing) a data warehouse. Each of these siloed data structures -- tables, hierarchies, lists, file formats, etc. -- has its own data model, even if it isn’t explicitly described as such.

The problem is that it isn’t easy to programmatically establish relationships across different data models -- even if (as with NoSQL platforms such as Hadoop, Cassandra, or MongoDB) they’re all coexisting in the same system cluster. The data in HBase can’t easily be connected to the data in Hive (or Impala, or Presto) which can’t easily be connected to the files -- text documents, JSON objects, binary files, multimedia files, even nominally structured files (such as the Parquet or ORC columnar formats) -- stored in the Hadoop Distributed File System (HDFS).

The graph database gives us a way to do this. It derives semantic relationships by establishing meaningful connections between and among “nodes,” “properties,” and “edges.” In graph database-speak, a node is an entity, person, or thing. A property is an attribute or characteristic of a node. An edge describes the relationships between nodes and their essential properties.

If you come from a background in text analytics, this might sound conceptually familiar.

A graph database isn’t necessarily an RDF, or triples, database. It’s similar enough, however, that the two terms are often used interchangeably. The key takeaway is that a graph database discovers relationships that span disparate data models.

For example, it gives us a way to link transactional data stored in an operational system or data warehouse with a prediction of impending equipment failure, which could be derived from a Spark Streaming analysis of telemetry data from multiple signalers. The graph database derives the relationships that link together the faulty part, that part’s inventory status, the locations of both faulty and replacement part, and the logistics of getting that replacement (and, if necessary, a technician) to the location.

“We sometimes refer to semantics as a 'glue,' which means that if you have two different databases, even if they're relational systems, you can bring them ... together,” says Matt Allen, senior product marketing manager with enterprise NoSQL specialist MarkLogic. “Say one relational source describes 'customer' in this way, while another describes 'customer' in a slightly different way. You can ... relate those two concepts of 'customer' by saying this 'subject' in this data set is the same as this 'subject' in this other data set, this is the same 'predicate,' and so on.”

Graph databases aren’t exactly hot, although they should be. There are signs that interest in graph technology is heating up, however. Earlier this year, DataStax Inc., which provides commercial support for the Cassandra distributed database, acquired graph database specialist Aurelieus.

“This event telegraphs an important move in our product strategy here at DataStax, which is our intention to add multi-model capabilities into Cassandra and DataStax Enterprise, wrote Robin Schumacher, vice president of products with DataStax, in announcing the acquisition on his blog.

“It’s not uncommon to see NoSQL databases characterized by their underlying data model,” Schumacher wrote. “However, the reality is that ... our customers are building modern systems where the underlying applications require more than one NoSQL data model format.”

What does or doesn’t make a database a “multi-model” platform is mostly a topic for another (lengthy) article. MarkLogic, for example, positions itself as a multi-model database. Teradata Corp. doesn’t, but nonetheless has a graph story -- thanks to its Aster analytics platform, which has a built-in graphing engine. A graphing engine is not a graph database, and Teradata and Aster are separate platforms, so there’s no question of calling it a single multi-model database.

On the other hand, Teradata’s ambition with its Unified Data Architecture is to create a fabric that knits together the decision support (Teradata Database), discovery (Aster), and big data (NoSQL/Hadoop) use cases. Graph technology will likely play a role in providing overarching context to this fabric. Will this fabric itself constitute a logical multi-model database? We’ll leave that to the dogmatists.

Amazon’s DynamoDB is a multi-model database, as is Sqrrl. Neither can be considered powerhouse platforms for relational data management or SQL query. However, many analytics database platforms which are powerhouse platforms incorporate graph and text-analytics capabilities. These include Hewlett-Packard Enterprise’s Vertica, IBM’s Netezza, and SAP’s HANA, to cite just a few. So is multi-model the way forward?

Again, it’s a complicated subject. What isn’t complicated is the value proposition of the graph database itself. “A very large part of today’s Web and mobile world is comprised of systems of engagement and systems of inquiry that deal with highly connected data,” Schumacher wrote, noting that many critical applications (he cited fraud detection and buyer behavioral analysis, among others) “must manage a seemingly infinite series of connections between data....

“Enter the graph database: an engine that can model these types of engagement and inquiry systems in a way where connecting data is easy and where performance doesn’t suffer from the antiquated join methodology that slows down an RDBMS,” he concluded.

About the Author

Stephen Swoyer is a technology writer with 20 years of experience. His writing has focused on business intelligence, data warehousing, and analytics for almost 15 years. Swoyer has an abiding interest in tech, but he’s particularly intrigued by the thorny people and process problems technology vendors never, ever want to talk about. You can contact him at evets@alwaysbedisrupting.com.

Featured Resources

Find out what's keeping teams up at night and get great advice on how to face common problems when it comes to analytic and data programs. From head-scratchers about analytics and data management to organizational issues and culture, we are talking about it all with Q&A with Jill Dyche.