05 May 2015

How to access data with different Hadoop versions and distributions simultaneously

Many companies start rolling out or at least think about using Hadoop for data analysis and processing. Hadoop Distributed Filesystem (HDFS) is the underlying filesystem and typically local storage within the compute nodes is used to provide the storage capacity. HDFS has been designed to satisfy the workload characteristics for analytics and has been born during the time where 1 Gigabit Ethernet was the standard networking technology in the datacenter. The idea was to bring the data to the compute nodes in order to minimize network utilization and bandwidth and reduce latency through data locality. (I’ll post another article to discuss this in more detail and show that this requirement is less important these days where we have 10 Gigabit almost everywhere in the datacenter). For the moment we’ll look at some other side effects of this strategy which reminds me somehow to a lot of data silos. Business Intelligence dudes know what I am talking about.

Figure 1: The “Bring Data to Compute” strategy results in a lot of data silos, complex and time consuming workflows.
One thing you may already know is the fact that HDFS is not compatible with POSIX protocols likeNFS or SMB. That means you need special tools to copy your data into the filesystem. That’ll take a long time. For example, if you need to copy 100TB over a 10 GB Ethernet, you’d need more than 24 hours to so so if the network is not occupied with other traffic.

But it gets even worse: You may have multiple HDFS distributions or versions like you have different RDBMS systems

Have you ever thought about how many different RDBMS systems you have in the company ? Typically companies have several relational database systems like Oracle, MS SQL, MySQL, DB2, Sybase etc. But wouldn’t it be easier to have only one system? The answer of course is: yes it would, but that’s not the reality we are facing. In practice we have different RDBMS systems for various reasons:

Application dependencies

Different people or organizations within the company have different preferences

Mergers and acquisitions

Price and licensing models

Functionality

Performance

Historical reasons

Large IT organizations or service providers just have to support what their customers want. They cannot dictate the Hadoop version or setup a new cluster including storage for every customer

New innovative distributions appear in the market. Consider how many Linux distributions we have. It’s not only RedHat, SuSe, Debian, Ubuntu and you name it. Just recently Intel, IBM and Pivotal/EMC have announced that they’ll maintain their additional distributions that are optimized for virtual and cloud environments. The same may happen with HDFS.

…and others

I guess we’ll see the same development with HDFS for quite the same reasons. Furthermore, the development of Hadoop is currently very fast and we’ll see HDFS vendors starting to build their individual strength within different areas. Now think about how you would use different versions or distributions when you have compute and data tightly integrated? It most probably will end up in more HDFS clusters, more copies of data and big data movements. You may also be stuck at a specific version or distribution because you need your production data to be available for analysis and you cannot just migrate and copy them every day.

Here is the solution: the Scale-out Data Lake Isilon

Fortunately there is a solution to this issue: EMC’s Isilon Scale Out NAS System has a very mature distributed filesystem OneFS. It’s also build on top of commodity hardware and uses internal disks to provide the space for the data. However, it’s much more advanced than HDFS in many regards and it has been built over more than 15 years to serve massive amounts of data with very high throughput and low latency. To serve Hadoop requests, HDFS has been implemented as a protocol rather than a filesystem. As a result, you can access your data over various protocols such as SMB, NFS, FTP, HTTP, Openstack Swift and HDFS simultaneously while consistency, protection, access control and global file locking is provided by OneFS.

HDFS as a protocol

Instead of storing the data on a new filesystem type, the Isilon team has integrated HDFS as a protocol. A multi-threaded daemon called isi_hdfs_d is running on every Isilon node. It services both Name Node and Data Node protocols and it translates HDFS RPCs to POSIX system calls. As HDFS is stateless, the underlying filesystem handles coherency. Figure 3: Multi-threaded HDFS daemon runs on every Isilon node.With this approach, new protocol version can be integrated quickly and data migrations or modifications are not required as they reside on the POSIX scale-out filesystem.

Figure 4: Data Node and Name Node requests are served in a highly available manner.

Access to the data with different Hadoop versions or distributions

This “de-coupling” of compute and storage with Isilon as your “Data Lake”, you can now access the very same data with multiple Hadoop distributions and even different HDFS versions (at the time of writing this, Isilon supports almost everything from HDFS 1.0 to HDFS 2.6 and the development team has a strong focus to have new versions ready right after the major distributions come up with new HDFS versions).If you think about this for a moment you’ll agree that this is huge! By pooling your data into Isilon, you get complete freedom which version of HDFS you want or need to use. You can test new versions, roll back to previous ones or use another distribution to access the same data simultaneously. Think a moment about the analogy with the different RDBMS versions in your company which I have mentioned above. There is a high probability that you’ll have the same with Hadoop: different Hadoop distributions and versions within the company. That’s no problem with Isilon. Solved.

Other Advantages

But there are further advantages:

No single point of failure. Name node requests are served by all Isilon Nodes in an active/active manner.

Isilon protects data with erasure coding across nodes. That’s much more efficient than just creating multiple copies of each block. The protection level can be set very flexible and dynamically for every directory or pool. If you follow the guidelines, you’ll get a protection overhead between 20% and 30%. That’s much more efficient over native HDFS where you need to provide 300% of DAS capacity for 3 copies of data. See [3] for more details.

You can scale compute and storage independently. Your compute nodes don’t require storage anymore (you might want to use internal disks for the shuffle IO though). If you need compute power, you add servers, if you need capacity, you add Isilon nodes.

Most workloads run faster on Isilon [1,4].

For some data you can eliminate ingest since the data is already present on Isilon

If you need to ingest data, you can do it via POSIX protocols such as NFS, SMB or FTP. IDC found that NFS writes work 4.2 times faster on Isilon and 36 times faster for reads [1,4].

Isilon balances the data equally across all nodes in the cluster. If you need more capacity, you just add a node. New capacity is available immediately, rebalancing takes place in background.

Use parallel synchronization over LAN or WAN for a disaster recovery strategy

Manage a single large scale-out filesystem (today up to 50PB) very easy via a WebUI, CLI (OneFS is based on FreeBSD so Unix/Linux dudes feel home) or API.

Use Data Tiering: you can use different Isilon nodes in one cluster and use policy based and transparent data tiering for optimal performance and cost efficiency. For details see [5].

Data at Rest Encryption on Isilon is done at drive level. There is almost no performance impact compared to evolving software encryption solutions.

Use Isilon de-duplication. It’s running as a background post process and as such doesn’t impact production performance.

Use Snapshots. Currently more than 20000 snapshots are supported.

Use SEC 17-a4 compliant WORM retention

Use certified file system auditing

Use Isilon Access Zones to provide Hadoop as a Service securely to multiple tenants.

Use existing backup mechanisms.

Summary

OneFS is a very mature scale-out filesystem that serves data via multiple protocols including HDFS to hundreds or thousands of clients. The biggest advantage is that you separate compute and storage and you can scale both independently. Most importantly, you can provide access to the data via multiple HDFS protocols and distributions at the same time. No matter which version or distributions your users or customers prefer, they all can be served by Isilon as long as it is a distribution that’s based on the Apache base, such as Hortonworks, Pivotal or Cloudera. Instead of using HDFS data silos, Isilon is a great foundation for your Data Lake with enterprise grade functionalities that integrates well into your datacenter’s infrastructure with respect to security, serviceability, high performance.

17 comments:

There are lots of information about latest technology and how to get trained in them, like Big Data Training in Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Big Data Training). By the way you are running a great blog. Thanks for sharing this.

Excellent post on iOS mobile apps development!!! The future of mobile application development is on positive note. You can make most it by having in-depth knowledge on mobile application development platform and other stunning features. iOS Training in Chennai | iOS Training Institutes in Chennai

The pictorial representation was really good and i got some clarity about Hadoop.Thanks for posting such a unique and interesting blog.Hadoop is a platform for storing and processing of Data in an environment with clusters of computers using simple programming language.

GREEN WOMEN HOSTELGreen Women hostel is one of the leading Ladies hostel in Adyar and we serving an excellent service to Staying people, We create a home atmosphere, it is the best place for Working WomenOur hostel Surrounded around bus depot, hospital, atm, bank, medical Shop & 24 hours Security Facility

It’s really amazing that we can record what our visitors do on our site. Thanks for sharing this awesome guide. I’m happy that I came across with your site this article is on point,thanks again and have a great day. Keep update more information..Salesforce Training in Chennai