Q&A: Addressing Big-Data Challenges

How can you get "big data" under control without going under? We recently posed that question to Ramon Chen, vice president of product management, at Rainstor, a company devoted to online information preservation that offers a specialized data repository. Our simple, single question sparked the conversation below.

How do you define "big data" and what do you mean by "big data" retention?

Big data is a problem that practically every business will experience at some point. I had an impromptu conversation with O’Reilly Media’s Roger Magoulas at SDForum last year and his view was that the size of big data varies depending on the capabilities of the organization managing the set. For some it may be hundreds of gigabytes, for larger organizations it may be terabytes or even petabytes. In other words, you have a big-data problem when your data volumes exceed your current IT infrastructure capacity, resulting in significant increase in cost and negative impact to production systems.

It is worth distinguishing up front between unstructured and structured big data. Enterprises need to retain both unstructured (e-mail messages, files, documents, images, video) and structured (transactional data, log records) data. Although there are many excellent solutions available for retaining unstructured big data, much of the focus on the more critical business lifeblood, structured data has been on deep analytics and predictive models for competitive and business advantage.

The other side of the big-data problem is how organizations manage and retain structured business data long term while continuing to provide ongoing access for business insight and compliance. Achieving this in a cost-efficient, scalable manner to meet future growth is the big-data retention challenge.

What impact is big data having on the data center or storage management?

In today’s economic climate, cost cutting is on the minds of every IT organization. Unfortunately, pressure to add new systems and storage capacity continues, driven by requirements to retain critical data for longer time periods with stringent, on-demand access due to business needs or compliance regulations. Broadly speaking, a number of industry sectors are becoming much more heavily regulated and for global organizations, meeting these regulations (which differ from country to country) makes the challenge even more complex.

Telecommunications is an example of a sector that is experiencing significant growth rates. Billions of call data records, SMS/MMS data, and IP data records must be kept for between one and three years by government anti-terrorist entities around the world. This becomes a very large storage and management undertaking, and the ingestion rate of tens of thousands of records per second typically exceeds the capabilities of traditional RDBMS systems. The growth of big data is literally forcing organizations to revisit and rethink their infrastructure and capacity plans or else drown under the weight of cost and compliance.

Why is data retention more challenging now than it was as recently as a year ago?

Setting aside the massive data volume growth rates, entire industries (such as financial services, banking, health care and communications) are becoming much more heavily regulated. Aside from protecting individuals from fraudulent or illegal activity, data retention regulations in industries such as health care also benefit the health of the individual by ensuring more accurate and timely accessibility of historical patient records. Data retention rules and schedules vary by type of data, by industry sector, and even by country. These regulations change over time following new legislation, forcing organizations to stay compliant in order to stay in business. Organizations are also wary that as their data reaches the end of its retention period, it needs to be purged to avoid any further liability.

OLTP and OLAP technologies have continued to evolve, but their focus has not been to support long-term data retention. Neither should they, as they are tuned to be best in class for transactional and analytical capabilities. IT's data retention challenge requires a new level of functionality for handling massive size, ever-changing compliance parameters, and interoperability with those traditional systems, all at the lowest possible price.

What kinds of organizations or industry sectors are most challenged with retaining and managing growing amounts of data?

Organizations in sectors that are more heavily regulated tend to face tough challenges retaining volumes of data online for longer time periods. In the U.S. financial system, there are stringent SEC and SOX rules. Health-care legislation passed in 2009 introduced a big-data retention time bomb that requires an always-accessible on-demand electronic health record for the lifetime of every U.S. patient.

Worldwide vigilance against terrorist attacks has translated into lawful intercept mandates for access to, and search of, communications records for global counter-terrorism. Many countries have rules for legal electronic surveillance of circuit and packet communications. At any time, law enforcement can obtain a judicial warrant to tap the landlines, cell phones, and e-mail accounts of suspects, as well as receive copies of their call, SMS/MMS, and other communication records. Telco companies have to retain this information securely long term, while providing full accessibility on demand to only authorized government entities to stay compliant.

There are many other industry sectors experiencing big-data retention challenges. Interestingly compliance requirements may ultimately benefit organizations as they end up improving risk management, standardizing on IT architectures, and are able to not only drive operational efficiencies, but gain better business insights through satisfying these tough regulations. Big-data problems aside, a proven well-thought-out approach to data management and long-term data retention is an absolute must to stay compliant and competitive.

What approaches do organizations take to overcome these challenges?

Many organizations in more heavily regulated industries have already experienced pain managing large data sets. Many have invested in storage compression technologies, which provide cost savings and benefits by physically compressing data at the byte or file-block level. Such technologies provide the greatest benefit for unstructured big-data types such as documents, e-mail messages, images, and videos. Structured data repositories are seen as simply large blocks in this context and compressed accordingly, regardless of their contents.

At a more granular level, organizations continue to retain critical, structured, transactional data in production system environments far longer than is legally required. These primary systems quickly become bloated and require ongoing capacity planning to accommodate anticipated growth. IT operations stays on top of this problem by adding more processing power to production systems to meet end-user performance and query-response times. No amount of generic byte-level storage compression can help with this dilemma.

Similarly, many organizations have seen the value of leveraging critical data across multiple systems for trending and analytics, which is reflected in the burgeoning data warehouse and business intelligence market. Operational data is extracted and fed to BI systems for ongoing analysis and reports. Some organizations view data stored long term in a data warehouse as sufficient for compliance.

Finally, a will-not-go-away method of storing historical data is the use of tape that is distributed offsite to a warehouse (the brick and mortar kind). In this situation, the up-front cost is far less but longer term it carries compliance risk because the data is not online nor easily accessible or searchable.

Which of are these approaches is most successful/effective?

To be honest, almost all of today’s traditional structured data retention methods have drawbacks. Continuing to retain growing data volumes in a traditional production RDBMS might seem like good practice because the data is always available and accessible, but from an economic perspective, it doesn’t make sense. In reality, the percentage of historical data in most production systems is above 60 percent; in some sectors, this has reached 95 percent. This puts an unnecessary cost and performance burden on expensive production hardware and storage. Additionally certain “immediately historical” data types such as communications CDRs, SMS/MMS data, financial services trading data, or even simple logs will never change. Retaining such data in transactional RDBMS is simply a waste, as 100 percent of the time the data will only be queried and never modified.

Retaining data warehouse or analytics environments is often common practice because traditionally it has been viewed as a non-production, less-costly alternative to OLTP environments. As analytics has increased in importance within the operations of a business, major cost and investment has put it on par with OLTP systems. Continuing to add data to both these environments is extremely expensive and not a sustainable option.

As I mentioned, offsite tape back-up is comparably inexpensive but no easily accessible and non-compliant. All of these factors are driving organizations to look for new technologies and solutions dedicated to long-term retention.

What kinds of technologies are being deployed to streamline these processes and reduce the costs of managing data long term?

As I previously outlined, byte-level compression or de-duplication is well suited to addressing the big-data retention problem for unstructured data such as e-mail messages, files, images, and video but have relatively low impact on overall database sizes. Other techniques, such as database sharding (which involves a partitioning scheme for large databases across a number of servers) and hardware tiering (whereby portions of an RDBMS are moved to lower-cost hardware), merely exacerbate an already complex, administration-heavy environment.

Structured big-data retention requires a new class of data management solution at a total cost of ownership (TCO) significantly lower than traditional RDBMSes or analytics repositories. It’s natural that organizations are looking to new open source offerings (such as NoSQL and Hadoop) and solutions that are as alike as possible. Although such technologies promise low costs for initial deployment, compatibility with existing systems can be challenging due to the potentially high retraining and integration costs. As with any new technology, mainstream adoption and acceptance can only be accelerated if a majority of the friction points, not just cost, can be significantly reduced.

Have these technologies or solutions kept up with the problem or are they still deficient in some way?

Solutions specifically designed for massive structured data retention and online retrieval continue to evolve with the needs of the market. The right solution naturally depends on the main use case involved.

In the “big-data diet” use case, static or historic data is moved out of either the OLTP or OLAP production environment, freeing the production repository of a large burden while retaining full on demand accessibility to a second-tier repository at a much lower TCO. This benefits the performance of the production environment, but also has a secondary effect of reducing the overall size of the downstream test, development, and back-up environments. Cost savings and operational efficiency, including time for migrations, is greatly reduced. If at any time data in the archive needs to be modified, it can be programmatically reinstated back to the production system. An example of this scenario is in health care when a patient record, moved to the retention repository, may need to be re-activated after several years of inactivity.

For “big-data ingestion” of immediately historical data, these new solutions are replacing traditional RDBMSes in order to keep up with massive data volumes, particularly in the telecommunications sector where growth rates are exceeding billions of records a day.

In the end, the best solution is specifically architected for big-data retention rather than forced to fit into a specific use case. With the continuing growth in data volumes and changing compliance regulations, any solution must also be scalable, configurable, and adaptable in order help organizations keep this big-data retention problem under control.

How are partners such as Informatica and others tapping RainStor to help address this market need for data-intensive industries and their end-user customers?

Informatica OEMs RainStor for “big-data diets” as part of their Information Lifecycle Management solution that enables organizations to archive or retire legacy application data for long-term retention.

Adaptive Mobile and Group 2000 use RainStor as a “big-data ingestion” primary repository to handle billions of records a day on behalf of their telco clients.