For Storing Web 3.0, HBase has the Edge

A comparison of cutting-edge cloud and relational database technologies

A storage system modeled after Google’s BigTable has the edge in data management for next generation Internet and cloud computing users, claim researchers at the University of Texas – Pan American (UTPA) in Edinburg. In tests designed to find the best storage technologies for Web 3.0 — also known as the Semantic Web — Apache’s Hadoop database, HBase, out-performed MySQL Cluster, the UTPA team discovered in a classic confrontation between relational and non-relational databases.

With their own algorithms to adapt the two database systems, the team found that HBase works faster with larger datasets, a major issue since the Semantic Web comprises vast amounts of tags and descriptions known as metadata.

“HBase is easier to use than MySQL,” he explains. “Most programmers know how to write code rather than program databases, and not as many people code in SQL.”

Relational vs. non-relationalAn open-source, non-relational database written in Java that can scale to thousands of servers, HBase makes many features of Google’s proprietary, high-performance distributed storage system BigTable available to the programming community. It also features a fail-safe library that runs “on top of” a server cluster — a global architecture that detects and handles failures at the local level before they spread.

With similar open-source, scalability and fail-safe features, MySQL Cluster is a relational database whose primary feature is “shared nothing” architecture — interconnected nodes that “share nothing,” including communicable failures. A shared-nothing system won’t fail if one or more nodes fail.

The UTPA study tackled the natural next question: Which candidate would win the next-gen storage challenge?

Making sense of metadata“In this time of unprecedented information growth, the fastest growing data category is metadata,” says study co-author and UTPA computer science professor Pearl Brazier, Ph.D. “It would be difficult to imagine the success of the Semantic Web without efficient and scalable data management tools to support its large-scale metadata-enabled applications.”

In Krauthammer’s work, metadata known as knowledge provenance, such as the researchers and institutions behind new cancer research, helps assess information credibility and research reproducibility. In a simple genomics example, the data might be a gene sequence. Metadata might include who discovered it, where and how — what laboratory techniques, which funding grants and the like.

Though not as widely-known as Web 2.0 counterparts, such as hyper text markup language (HTML) and extensible markup language (XML), the new languages are used in such high-profile projects as the U.S. census, the BestBuy catalog, Facebook pages, and the latest cancer bioinformatics research.

To first base with HBase RDF presents data and metadata in so-called “triples” — graphs of a subject, an object and a predicate that define a relationship between them. An example the UTPA team used is a student Craig (subject) who is a member of (predicate) the technology society IEEE (object).

For the next step — retrieving stored RDF data — the UTPA team designed a new algorithm with three functions that allowed their database system — Hadoop 0.20.2 and HBase 0.90 — to evaluate queries in SPARQL, the standard RDF data query language. One UTPA function, matchBGP-DB, translates a SPARQL graph into an HBase table, for instance.

“By enabling reuse of existing database technologies, our query translation algorithms are efficient, their performance overhead is negligible, and they speed up the development of useful Semantic Web data management tools,” Brazier explains.

The UTPA team’s HBase algorithm ordered datasets from the benchmarks using two criteria. First, it evaluated datasets that yield a smaller result first, decreasing iterations and memory usage. Second, it returned datasets that share variables before datasets with no shared variables, narrowing results more quickly.

In one example, the algorithm reordered a triple dataset that sought professors who taught course Y to students X. The pre-algorithm query first returned all students across all universities nationwide — an enormous dataset — then sifted through courses and finally, professors, slowly matching them up.

With the UTPA algorithm, the query returned a reordered dataset, first professors who taught course Y — a much smaller group — then students who took course Y and, finally, students matched with their professors.

Though the translation algorithm represents extra work, the result was faster queries that consumed less memory based on existing technologies.

“The extra work is inevitable if we want to store Semantic Web data with existing programs. But, solutions are delivered much faster than if we developed a Semantic Web database from scratch,” Brazier explains.

Distributed technologies, such as HBase, that are often used in cloud computing are being explored for distributed and scalable RDF data management.

The spark in SPARQLA table that stores a data triplet labeled (s, p, o) — subject, predicate, object — is the starting point for the team’s SPARQL-to-SQL translation algorithm, which uses tools on MySQL Cluster 7.1.9a with names such as BGPtoFlatSQL to manipulate the same datasets from the HBase test.

Specifically, HBase performed slightly better than MySQL cluster on queries from the PC3 benchmark. On one LUBM query, MySQL Cluster performed roughly two to three times faster than HBase. But on another, HBase was three to 47 times faster.

Common to both systems on certain queries, the team observed everything from rapid performance degradation to limited or no growth in execution times with an increase in data size.

Initially, MySQLCluster demonstrated a significant advantage over HBase. However, this performance advantage decreased with growth in dataset size.

Winner by a nose Though the results did not reveal a blow-out, hands-down winner, since faster queries over larger datasets was the object of the game, “HBase showed superior performance and scalability,” Pearl Brazier explains.

The results weren’t surprising, but also weren’t a given.

“As an open-source implementation of Google’s Bigtable, HBase has proven capable of managing very large volumes of data,”

Brazier says. But that’s on Web 2.0, so “it was questionable if HBase was up to efficiently handling complex Semantic Web queries. Our study showed that it was.”

The team presented their findings this July at the 2011 IEEE International Cloud Computing Conference in Washington, DC.

“Keeping in mind that HBase is far less mature than MySQL Cluster,” Chebotko says, “our research team has high hopes for it.”