In Search of a Data Platform

Contributed by

Search for a Data Platform

Over the past three decades, companies have produced numerous technologies to enable application developers to build a variety of data-intensive applications. The nature of the applications often drove the choice of the underlying data platform. For example, building transactional applications gave rise to Oracle RDBMS as the underlying data platform. TIBCO became the de facto data platform for building real-time applications. Building an analytic application meant choosing Teradata as the underlying data platform. During the late 90s, IT also started to deploy many popular packaged applications, such as ERP, CRM, and HR applications, which often included their own embedded database.

Rise of Data Silos

A big, unintended consequence was the rise of data silos. Corporate data was fragmented among different data platforms with different security models, different performance attributes, and different failure resiliences. Moreover, the same data duplicated among different data platforms was out of sync, preventing businesses from making timely decisions.

Data integration vendors, such as Informatica in the 90s, seized the data silo opportunity and built ETL (Extract, Transform, and Load) technology to address the data problem. The ETL technology provided simple and easy-to-use tools for IT developers to keep data synchronized among different systems. In other words, ETL technology enabled companies to succeed with their application-specific data platform strategy. With the rapid rise of machine- and user- generated data, data volume started to double every 18 months, and companies started to look for a new data platform to build next-generation applications.

Scale-Out Data Platform

The release of Apache™ Hadoop® in 2007 really popularized the scale-out architecture to handle rapidly growing data volume and very large numbers of jobs that can run concurrently. The data is typically stored in files with different formats, such as Parquet, Avro, and CSV. Large numbers of computers are connected over high-speed ethernet to form a cluster. As data volume and users grow, more computers are added to the cluster to allow Hadoop software to carry out the additional workload, using additional resources (CPU, memory, and disk). A number of open source NoSQL database technologies, built with scale-out architecture, also became available to handle structured data in forms of tables. The rise of the NoSQL database allowed developers to build a new class of operational applications, such as profile management, catalog management, or fraud detection. Event queue or message queue software, such as Apache Kafka and Apache Pulsar, also allowed developers to build a new class of applications, such as website activity tracking, log aggregation, or real-time alerts.

These innovative scale-out data platforms have been very successful and ushered in building and deploying a new class of applications, processing data at petabyte scale, and lowering the cost of the hardware infrastructure. However, many customers deploying these innovative technologies faced a problem they had faced before - more data silos, different security models among different data platforms, different performance attributes, and different failure resiliences.

The key question that many companies are beginning to ask is: is there a data platform that can handle ever-increasing data growth and users without creating more data silos? If so, what are the key architecture characteristics? In the rest of this blog, we will cover four important architecture considerations.

Four Pillars of a Data Platform

In this section, we will describe important architecture characteristics of a scale-out data platform that provides unlimited scale and supports a variety of applications with a unified security model.

1. Variety of Protocols and API Support

As we discussed earlier, Apache Hadoop uses files for storing and processing data at scale. Specifically, Hadoop supports HDFS APIs that allow applications to write and read data to/from files. Most Hadoop components, such as MapReduce or HIVE, talk over HDFS API, so these applications perform well against a file-based data platform. With an AI/ML application, such as Tensorflow, that writes/reads data using POSIX APIs, Apache Hadoop is no longer a viable data platform. To make a Tensorflow application run, the data from a Hadoop data platform first must be copied into a POSIX API-compliant data platform, leading to a new data silo. Similarly, an object store providing S3 API is great in building a new class of applications using the S3 APIs only, but data must be copied into an intermediate store to run Hadoop applications that support HDFS APIs. Such a mismatch of APIs and application needs create the data silos.

Obviously, a data platform that supports a variety of protocols and APIs to enable developers and architects to build applications is better at reducing/eliminating data silos.

MapReduce or HIVE applications read/write data from/to files that are stored in Hadoop file systems. Such applications may partially aggregate raw data from files, such as a company’s web traffic data from different regions and different time periods. The partially aggregated data may be further processed and consumed by different business units to gain further insights (e.g., how many unique visitors visited the landing page during last month in the European region). To serve additional application use cases, the aggregate data might be loaded into a NoSQL database for additional processing and querying. With the exception of HBase, no other NoSQL database persists the tables inside Hadoop file systems. For example, Cassandra, MongoDB, or CouchDB each has its own underlying data storage that’s different from HDFS. Copying semi-aggregate data into a NoSQL database means creating another data silo with different security and failure resilience, etc.

The reverse situation is also problematic. Building an operational application requires using one of the popular document databases, such as MongoDB or CouchDB. However, months later, if an analytic application needs to scan all the tables’ data for analytics use cases, the data may be copied into a Hadoop file system before running HIVE or Spark jobs, thereby creating data silos with different security and failure resilience.

A data platform that support more persistence models, such as objects, files, tables, and event queues with a single underlying data storage, is better suited in eliminating data silos, providing a single security model with consistent performance and failure resilience.

3. Distributed Metadata

The success behind scale-out technology is its ability to shard large amounts of data in many smaller data fragments and distribute the fragments across 100s of computers, where it can be stored and processed. The metadata keeps track of the location of different pieces of data across the large number of computers. When an application attempts to read or write a piece of data, the application first consults the metadata service to determine the specific computer (data node) where the data is located, and then connects to the data node to perform the read or write operations.

In the case of Apache Hadoop, the NameNode is the centralized metadata service. A single instance of the NameNode maintains the mapping between files and its location (i.e., data node). When handling billions of files, the metadata catalog becomes too big to cache entirely into memory. Additionally, when a large number of concurrent applications start accessing the NameNode, NameNode gets overwhelmed with the load and is not able to respond fast to the applications, impacting the applications’ performance.

In the case of Apache HBase, the data is organized in one or more Regions. One RegionServer serves one region. The mapping of Region to RegionServers is kept in a system table called META. By reading META, applications determine which RegionServer to contact to access the data. The META table is stored on a single server, just like any other table. When a large number of applications access META, the load on the META may become too high and impact timely responses to the applications, affecting the applications’ performance.

As seen in the above examples, under heavy load the centralized metadata architecture becomes the hotspot in a distributed scale-out architecture. When a lot of concurrent applications start contacting the metadata service, the high load overwhelms the centralized metadata service, preventing timely responses to applications’ requests. The distributed metadata is a noble approach to overcome the metadata bottleneck problem that’s common in a centralized metadata architecture. Distributed metadata architecture organizes the metadata in a hierarchical manner. Specifically, applications will contact the top-level metadata service to retrieve high-level metadata information and will contact a different metadata service for additional metadata information. This approach solves two fundamental problems:

Distributes the metadata among many servers, making it possible to cache the entire metadata into memory

Eliminates the metadata access bottleneck by distributing the workload among different metadata service

In summary, the distributed metadata architecture allows a cluster to scale to a much larger size, supporting a much higher concurrent workload, when compared to a cluster with centralized metadata architecture.

4. Security

Every data platform provides certain functionality to ensure only legitimate users access the data that they have permission to access. User authentication ensures that only legitimate users can login and run applications. Encryption of the data in motion or data at rest ensures that data is protected from prying eyes. Access control on the data ensures only authorized users are accessing the data.

Implementing security in a single data system is not easy. With too many data silos, it becomes a much harder job and exposes companies to security breaches. For example, customer information might be stored in multiple systems such as Apache Hadoop, Apache Kafka, and Cassandra database with incorrect access control models, allowing unauthorized users to gain access to sensitive data. Sensitive data might have been encrypted in one data system, but left unencrypted in a different data system. It is absolutely critical to have a uniform security policy across all these data platforms. But that is much easier said than done. Different data platforms often differ on functionality, making it impossible to achieve a uniform security policy across data silos. Moreover, if the data access policy is not implemented at the data storage level, it provides a backdoor entry for applications to access data that they are not supposed to access. For example, in the case of Apache Hadoop, policy-based access control is enforced in Apache Sentry or Apache Ranger. Sentry and Ranger sit above the HDFS layer. As long as the applications (e.g., HIVE applications) are well behaved (i.e., going through Sentry or Ranger), policy-based access control will be enforced. But when a custom application is run, built using HDFS APIs that do not go through Sentry or Ranger, policy-based access control enforcement is off.

So a secured data platform must implement policy-based access control at the data storage layer, ensuring uniform access control, irrespective of the API through which that data is accessed.

Summary

A data platform, for the enterprise that must store and process all data without creating more data silos, must provide: