BigData: when transactions, analytics, and search collide

This is a longer piece on BigData that came together over the last six months. Had to wait a bit for this stripped down version to appear in DBTA (thanks for publishing DBTA!)…

What happens when the information your business depends on is too big, slow, and brittle to keep up? Do you have strategies in place to deal with massive information sets and new types of complex and variably structured information? These are the types of questions many of the customers I’ve worked with are facing. I am going to provide a view into the problems I see customers running into and the technology trends I expect to see as more businesses grapple with these problems. The answers to these questions are bigger than any one solution, which is why it is an incredibly exciting time to be involved in information management technology. Something strange is afoot at the Circle K! There is one term in particular that attempts to capture a lot of this excitement: BigData.

The term BigData has been barnstorming the IT world as solutions from the biggest those in between and the shiny new things are getting their arms around BigData. However, talk to two people involved in information management and you will likely get two different definitions of BigData and the associated problems. It has so far defied any consistent definition. And while analysts are happily taking the cue, they are creating more rigorous definitions with different names: Extreme Data (Gartner), Total Data (451 Group), Total Information all center around the same issues of managing information . Ah, well, this is enterprise software after all, where, as an industry, many thrive on incredibly precise but inconsistent definitions and “standards”.

More recently there has been some convergence around the 3 V’s definition of BigData: Volume, Velocity, and Variety. I like this definition. Gartner has been and appears to still be keen on it. IBM very recently got right in front of this particular parade. Kudos to them for jumping in front. Someone needed to.

What these terms and definitions have in common is that they are attempting to encompass the issues businesses are running into as they manage increasingly large sets of information. To a certain extent issues are relative, what is a large data set to one business is an hours worth of activity to another. While the scale may be different between organizations, as pressure is applied to any given information management system the same set of problems tend to appear, sometimes just one, sometimes a whole bunch. Rather than attempt to define BigData or create a new term, which I happily leave to the pros, I’ll talk about the issues and trends that I have seen when working with customers and tracking different approaches to these problems. These customers routinely address information management problems in the terabyte range, are actively solving petabyte issues, and use the term exabyte in all seriousness when doing longer term planning.

First, a fun example that has all the hallmarks of a pressure filled information management environment. Cantor Gaming is in the business of making bets. More importantly they offer a specific type of sports wager called in-running which allows people to place bets on real time game outcomes like whether a baseball batter will strikeout during a given at bat or a football team will turn the ball over on the current drive. This is not a customer I have worked with but Wired did a fantastic write-up of this operation and described their custom built system for managing their information. Similar to the Wall Street systems it was modeled after, the Cantor system is managing information in a pressure cooker: The system churns through historical sports data, feeds in live information, sets betting lines, and manages bet transactions, all in real-time, and all of which is fed back into the system.

As the article notes, this is similar to what Wall Street has been doing with their custom transactional systems for years. Wall Street routinely works with huge rapidly changing data sets. They also have immense incentives to leverage this information as quickly as possible. They have, of course, been cutting their teeth on the problems these data sets create for years. Seemingly simple tasks, such as loading historic data sets or recovering lost data, are problematic once the data sets become too large.

This is important because the rest of the world is catching up in terms of their information needs. But the rest of the world cannot always dedicate, or even find, a team capable of developing and managing this type of system. As more businesses need to generate, manage, and analyze BigData they are increasingly looking to commercial vendors and open source projects to provide a stable foundation for their solution. Twitter and Facebook may be high profile examples but they are not alone. Just look at what Zynga, FourSquare, and Bit.ly are doing these days using commercial and open source software. The information is overwhelming yet extremely valuable. A critical requirement for these businesses is to be able to scale. And it just keeps growing, 10 fold in 5 years according to some estimates such as a BAML BigData report.

Managing this data creates a host of challenges, some new and some old. These are the key pressure points I see businesses grappling with:

Capacity (Volume): The capacity runway has been used up. Incremental improvements and the onward march of CPU/RAM/Storage capacity will help but information growth is fast outpacing those improvements. The impact can be seen across the infrastructure stack with transactional data stores, commodity search engines, and data warehouses unable to simply keep up with the total volume of information.

Mixed Data (Variety): Customers are working with many different types of information from raw text and precise tables to complex denormalized structures and huge video files. Traditional systems can usually store this information but can’t leverage it to create value.

Throughput (Velocity): Simply ingesting or routing new information is a challenge in and of itself. Consider the updates of 500 million Facebook users. Traditional transactional systems are not designed to ingest or update at these rates.

Real-Time: It is no longer acceptable to wait hours for database updates to reach a search index because the entire index needs to be rewritten. Navigational interfaces like facets need to reflect the information that is available right now. New information needs to immediately show up in searches, be included in analytics, and shared with other systems in real-time.

Load and Restore: Information sets are getting so large that customers cannot load them into other systems in at timely manner or worse, restore them if the primary system is corrupted. As systems are pushed to the brink simply ingesting data, some customers have actually abandoned the notion of restoring the information.

People: Often overlooked when talking about technology solutions, talented and well-trained people are required to run traditional systems. This pool is already limited. Systems that can manage massive information sets are often built from scratch which can naturally lead to a very small number of people that actually know how the system works. This is a very real and scary proposition for teams running their business on these systems. The developer meeting the proverbial bus is generally not replaceable for months or years, the time it takes to hire and train an expert.

Time to Rethink Some Basics

These issues are pushing us to rethink core principles of information management systems. Much of this work has started but it is far from complete. Following are the trends that I am seeing today and believe will be critical to successfully managing increasingly large information sets.

Scale and Performance: Terabytes and petabytes of information are becoming common. As we start to actually use this information and plot the growth curve, exabytes are a clear part of the near future. New architectures are required to support the basic storage and retrieval of these information sets. Massively Parallel Processing (MPP) architectures seem to be the common approach. But even the most advanced MPP clusters will max out before they can reasonably address the storage needs we see coming over the next several years. I doubt we will see wholesale architecture shifts but we will see significant modifications to existing MPP systems such as vertical scaling strategies and dedicated task specific sub-clusters, that, along with incremental improvements in CPU, RAM, and disk drive performance, will help us reach these scaling and performance needs.

Storing and accessing information is only the beginning. Complex queries, sorting, information analytics, and information transformation must also occur in sub-second time in order to support the systems being developed today. MPP architectures can do some of this through parallelization but this can only take us so far. I expect to see specialized implementations of common algorithms, such as sorting or facet generation, that use near real time strategies such as caching to bring performance and data freshness into acceptable real-time ranges. Other architectural strategies will include dependence on in-memory processing and integration of specialized processing systems. For example, MapReduce has been integrated into multiple database systems in order to extend the processing capabilities on large information sets. I believe MapReduce has a future as a standard feature of MPP information management systems as well as a standalone dedicated system.

Flexibility and Specialization: Massive information sets are often comprised of many types of information, typically denormalized. The information can range from pure text to binary videos and everything in between. Sometimes referred to as unstructured or semistrucutred information, these names can be confusing because they typically imply all information that is not relational.

What is important is that the system can support changing data types. Whether it is a new field or an entirely new type, managing these changes in the broad BigData context at large scale breaks down. These systems must be able to flexibly handle information of varying degrees of complexity. The onus will not be on the developer or administrator to preview every piece of information and adjust the system accordingly, it will be on the tools to work with the information as is and allow the developer to incrementally master the information, to discover how the information can be used over time rather than expecting to know a priori. What if you could assign relationships, or even discover relationships, within your information throughout the lifecycle of the information without doing invasive information surgery?

With such large data sets in place there is pressure to do more with the information. Creating multiple systems and moving the data around is a non-sarter in many cases. This means we expect to do more with the information within a single system. While I don’t believe specialized information systems such as OLTP will be disappearing, I do believe we will see a class of systems that claim their specialty is scale. These systems will need to provide top tier support for transactions, queries, search, analytics, and delivery. Real-time in situ processing of information at 100’s of Terabytes and Petabytes will become the norm.

Different types of information also demand different ways to access and manipulate them. Many of the NoSQL systems are implementing SQL. This is not because they want to create the next great RDBMS, they simply recognize that for some tasks within their environment SQL is the best choice of language. Following from the idea that modern systems will support multiple modes of information processing, I expect to see an increasing number of databases that are not tied to one specific language for manipulating the information but support many languages. These systems will provide many lenses through which to view the contents.

An example of a project that is driving this type of convergence is a customer that needs to search document contents alongside semantic relationships. This required creating a data store to hold hundreds of millions of document and billions of semantic triples. It needed the flexibility to store multiple data types, some unknown, and query and search across the contents. The system also streamed new documents in, extracting semantic relationships while ingesting. While this type of project could be pulled off with traditional systems the complexity, time, and cost of implementing it with a relational database was not feasible.

People and Ease of Use: The number of systems that are generating huge amounts of data is growing faster than the number of people with the skills to manage them. While many of these are still custom systems, the output of this custom development (e.g. voldemort) is making it into the public light along with commercial systems focused on the same issues. While I expect the number of systems to consolidate, and therefore consolidate areas of expertise, I believe we will have a shortage of skilled administrators and developers over the next several years.

Part of the reason these systems have become popular is that it is quite easy to get started with them. This doesn’t mean it is easy to develop a massive scale production application with them though. In order to ride this wave of folks experimenting and trying out these databases, I believe vendors will continue to focus on making it as easy as possible to get started with their technologies while requiring heavy lifting tasks of those that are pushing their technology to the limits.

What does it all mean?

Just imagine: A system that can scale to petabytes, supports almost any type of information, allows ad hoc queries and analytics, is transactional and searchable in real-time, and is easy to work with… and is of course low latency. Some might claim this system is here today. I believe it may be close in some cases but most likely will need to bring together specialized systems in new ways.

A lot of work to be done, and it is exciting. What is the next architectural step that will get us to petabytes and exabytes? What type of specialization or commoditization will we see with this technology? Who can we call on to develop and manage these systems?

We’ll start to see high profile answers to these questions in 2011 and 2012. I believe we will see not only significant investments in BigData from the name brands to drive these answers, but we will also see some breakthroughs in architectures. Whether integrating existing specialized systems in novel ways or the invention of new platforms from the less established brands, the battle for BigData dominance, or perhaps even participation, is just getting started.