]]>New features included in Hadoop’s latest releases go some way towards freeing an increasingly capable data platform from the constraints of its early dependence on one specific technical approach: MapReduce.

]]>Facebook announced in a blog post on Thursday that it has upgraded the Apache HBase database with a new open source system called HydraBase. Facebook is an avid HBase shop, using it to store data for various services, including the company’s internal monitoring system, search indexing, streaming data analysis and data scraping. What makes HydraBase better than HBase is that it is supposedly a more reliable database that should minimize downtime when servers fail.

With HBase, data is sharded across many regions, with multiple regions hosted on a set of “region servers.” If a region server goes down, all the regions it hosts have to migrate to another region server. According to Facebook, although HBase has automatic failover, it can take a long time to actually happen.

HydraBase counters this lag by having each region hosted on multiple region servers, so if a single server were to go kaput, the other region servers can act as backups, thus significantly improving recovery time compared with HBase. The company claims HydraBase could lead to Facebook having “no more than five minutes of downtime in a year.”

From the blog post:

The set of region servers serving each region form a quorum. Each quorum has a leader that services read and write requests from the client. HydraBase uses the RAFT consensus protocol to ensure consistency across the quorum. With a quorum of 2F+1, HydraBase can tolerate up to F failures. Each hosting region server synchronously writes to the WAL corresponding the modified region, but only a majority of the region servers need to complete their writes to ensure consistency.

Facebook is testing HydraBase and the company plans on deploying the system in phases across production clusters.

A typical HydraBase deployment. Source: Facebook

In addition to the HydraBase, Facebook also detailed on Thursday HDFS RAID, a way of using erasure codes — a method of data protection — in order to cut down on the multiple clusters of data one might have Hadoop create as backups in case one cluster shuts down.

Last year when the company used HDFS RAID in its data warehouse clusters, the blog post explains, “the cluster’s overall replication factor was reduced enough to represent tens of petabytes of capacity saving.”

Correction: Story’s title was changed to indicate Facebook is not open sourcing Hydrabase as of yet

]]>Cleversafe, a Chicago-based provider of object-storage systems for housing massive amounts of data, has raised a $55 million series D round led by New Enterprise Associates. Apart from traditional storage workloads, Cleversafe has also made a name for itself as a replacement for HDFS in Hadoop environments. According to Crunchbase, the company has now raised $91.4 million since 2007.

]]>The new economics of data warehousing provide attractive alternatives in both costs and benefits. While big data gets most of the attention, evolved data warehousing will play an important role for the foreseeable future. In order to be relevant, data-warehouse design and operation need to be simplified, taking advantage of greatly improved hardware, software, and methods.

]]>In the tsunami of experimentation, investment, and deployment of systems that analyze big data, vendors have seemingly been trying approaches at two extremes—either embracing the Hadoop ecosystem or building increasingly sophisticated query capabilities into database management system (DBMS) engines.For some use cases, there appears to be room for a third approach that lies between the extremes and borrows from the best of each.

]]>Today’s most successful companies are the ones with the ability to capture and analyze all data available to them. Enter SQL-on-Hadoop solutions, which increase the accessibility of Hadoop and allow organizations to reuse their investment learning in SQL.

]]>The Quantcast File System is like the Six-Million Dollar Man of distributed data stores for Hadoop. An implementation of the Kosmix Distributed File System (aka CloudStore) that had largely been written off and forgotten, Quantcast has built QFS to be bigger, faster and stronger than the Hadoop Distributed File System most commonly associated with the popular big data platform. Now, QFS is open source and ready for use in the webscale world.

According to Quantcast VP of Research and Development Jim Kelly, the web-audience measurement specialist began working with Hadoop in 2006 and experienced problems almost from the start. However, while the early problems with HDFS might have been symptoms of its immaturity, the problems soon began centering around the two things Hadoop is supposed to be best at — size and speed. So, in 2008, Quantcast began experimenting with, and actually sponsoring, the Kosmix project.

It turns out that wasn’t a moment too soon. By 2010, after Quantcast began integrating with ad networks, its data flow really began picking up into the tens of terabytes a day range. It turned on QFS as its production Hadoop file system in 2011 and now receives about 40TB a day and processes a whopping 20 petabytes. Kelly said Quantcast has pushed 4 exabytes — or 4 billion gigabytes — through QFS since turning it on.

Faster, yes. Bigger, not so much.

At Quantcast’s scale, the problem with HDFS wasn’t so much its scalability, but the sheer size of the cluster required to handle petabyte-scale data stores. HDFS stores three copies of each piece of data to ensure they’re always available, although it tries to make up for the size issue with data locality (i.e., putting data directly on the computing nodes so it doesn’t have to traverse the network in order to be processed). Kelly thinks those techniques are relics of a bygone era.

“When HDFS [was created], disk drives and networks were tied for being the slowest things in the cluster,” he said.

Enter Reed-Solomon error correction, QFS’s chosen method for assuring reliable access to data that Kelly says actually ends up shrinking the size of Hadoop clusters while improving their performance. (It’s actually the same method used on CDs and DVDs.) Rather than storing three full versions of each file like HDFS, resulting in the need for three times more storage, QFS only needs 1.5x the raw capacity because it stripes data across nine different disk drives. Quantcast believes smaller cluster size, combined with today’s 10-gigabit networks and the ability to read and write data in parallel make QFS significantly faster than HDFS at large scale.

QFS also comes equipped with other features that Quantcast had to implement to make it production-ready. Among them: it is written in C++ and has fixed-footprint memory management; it has access control based on users and groups; and it intelligently detects node failures, as opposed to planned maintenance, and invokes data recovery accordingly.

It’s not for everyone, though

Despite its claimed improvements over HDFS though, Kelly is quick to point out that QFS is probably not the best choice for everyone. It’s really designed for Hadoop users operating at petabyte scale, who have the technical prowess to handle a migration away from HDFS, and for whom data-processing costs are hitting the six-to-seven-figure range monthly once things such as energy bills accounted for.

“If you’re cluster only has 10 disk drives,” Kelly said, “[QFS] will save you $500, which is nice but …”

Likewise, if high availability is very important, the latest version of HDFS might be preferable. “There’s a standby [in QFS]; it’s not quite as hot as theirs,” Kelly said. But availability isn’t super important to Quantcast, he said, it hasn’t had any real problems with QFS going down anyhow. When it does, it actually recovers pretty fast.

]]>There’s a trend afoot in the big data space to turn data science from black magic into child’s play, and one of the newest companies trying to pull of this technological alchemy is 0xdata. The bootstrapped startup, pronounced “hexadata,” is the brainchild of former DataStax engineer, and Platfora co-founder, SriSatish Ambati, and it’s trying to blend Hadoop, R and Google BigQuery into the ultimate tool for statistical analysis. Scientists, data analysts or whoever ultimately uses the product only need to be experts in their domains, not in statistics.

Although BigQuery is a SQL service hosted by Google, 0xdata follows a similar theory on simplicity.

However they choose to leverage the product, Ambati said, the scale of the underlying data and the complexity of running advanced analysis are details that need to be hidden. It’s the same theory that underlies Platfora, the company Ambati co-founded last year with his former DataStax colleague Ben Werther, although their approaches appear to be different. Whereas Platfora is trying to disrupt the data warehouse market by building a next-generation user experience atop Hadoop, 0xdata is trying to change the way users interact with popular statistical software such as R.

But either way, Ambati says of new data-analysis products, “[There are] no bragging rights for making it simple. If you don’t do that, you won’t be able to go forward.”

oxdata is also putting a focus on speed, both in terms of how fast it processes data and how fast it lets users react. Google search changed our thinking around how many questions people can ask successively, Ambati explained, and data analysts should have the same experience. That’s why H2O provides approximate results at every step in the analysis process. Rather than wait for the entire job to run and the exact results to be computed, users can get a general idea of results and kill the job and start over quicker if they’re completely outside the expected range.

But it will be a while before the public gets a chance to see whether H2O lives up to its promises. Ambati said the product is just four months into development and won’t have its first set of algorithms available for another few months. His team of eight engineers has “built a lot of cool stuff,” but now it needs to round out the process and turn its code for H2O into an actual product.

Still, having decided to tackle data as a system, Ambati and his team are having a lot of fun. “We are live-and-die-with-infrastructure people,” he said, but for a bunch of folks who spent a lot of time learning math, it’s like going back to the their days as computer science students.