Sentrium

Tag
AFF

Commodity hardware cheap right? Well yes, but when it comes to petabytes of data, it becomes more expensive.

Let’s think how much servers do you need to run 1 PetaByte of data? It simple you need 3 PB of storage because if you have HDFS, native filesystem for a Big Data framework, it will create three copies of your data and will spread those pieces across the cluster randomly.

How much server nodes do you need? The biggest NL-SAS HDD 3.5” you can find nowadays is 12TB (actually 10.91 TiB), the biggest SAS HDD 2.5” is 2.4 TB (2.18 TiB) and the biggest SSD drive 2.5” out there is 32TiB (more like 30TiB), but not all servers supports that, and nearest supported drive is 1.6TB (1.46 TiB). So, the SSD with 2.5” has the most compact data footprint and most performant, but the most expensive one.

To get 1PiB of storage with HDFS, we will need 3PiB of raw capacity, which is (sorted from highest to lowest number of drives):

How much 2.5” drives you can put to a rack server? About ten drives into 1U rack server or up to 24-26 drives into 2U. Moreover, when it comes to NL-SAS, you can put maximum 12 drives in a 2U rack server. Having 10-26 SSD drives per server is a good way to fully utilize performance potential of SSD drives.

In this case, you’ll need either (1.6TB SSD) 2104/26=81 (2U) servers or 1409/26=54 (2U) servers with 2.4TB SAS for SFF drives and might have too many servers & more computing power than you actually need in your Big Data server farm. Moreover, when it comes to more than 20 nodes, usually, you need more than a couple of

Alternatively, you might need (12TB NL-SAS) 281/12=23 (2U) or 563/12=47 servers with 6TB NL-SAS for LFF drives and that number might have not enough computing power then you need, or in contrary, be too much for you.

And let me remind you, that at the time this article written those disk drives are the best case scenario since usually, the biggest drives have not the best $/TB price. And therefore in real Big Data clusters, you normally will find drives with smaller space, thus the number of drives higher than we are using for this article and thus needs more servers.

Alternatively, HPE disk enclosures which can be connected to a server with 96 LFF or 200 SFF drives

However, If you’ll put SSD in a server with 56 slots for SFF disk drives, theoretically you’ll need 3 servers (two needed, but the minimum is three) in case of 32TB SSD (only 1 needed but the minimum is 3), and that number might have not enough CPU & RAM to run your tasks but majority of servers still not support 32TB drives. While with 38 (2104/56) servers in case of 1.6TB SSD might be a good ratio to utilize the full potential of SSD drives but might be too much of computing power for your Big Data farm. Again with only 6 servers (338/56) with 10TB SSD drives you'll not be able to utilize the full potential of the drives themselves. And with all NL-SAS drives, there might be not enough CPU & RAM for your Big Data cluster if you’ll have only 3 (281/96) servers and you have extremely slow storage subsystem.

If you’ll put SSD in a server with 200 slots for SFF disk drives, theoretically you’ll need 3 servers (1 needed but minimum is 3) in case of 32TB SSD (only 1 needed but minimum is 3), and that number might have not enough CPU & RAM to run your tasks & fully utilize SSD performance bat again majority of servers still not support 32TB drives. While with 10 (2104/200) servers in case of 1.6TB SSD also not enough to utilize the full potential of SSD drives themselves. Needles to say situation goes even worse with SSD drive performance utilization in the case of only 3 servers (338/200= ~2, but 3 minimum) with 10TB SSD drives and 3 servers might be not enough for computing power. While with all NL-SAS drives there might be not enough CPU & RAM for your Big Data cluster if you’ll have only 3 (281/96) servers and you have extremely slow storage subsystem.

Do you see how storage medium and space determine your Big Data server farm?

The idea of computing separation from storage comes naturally to make Big Data more flexible.

Additional HDFS overheads

When you choose a strategy to reduce costs as much as you can, you might choose slow NL-SAS & high-density servers, and obviously, you’ll try to choose a server which can support a lot of CPU & Memory. In this case, when it comes to cluster expansion for storage or CPU or memory, you’ll have to buy another big server with a lot of CPU, Memory, and storage to keep your nodes in the HDFS farm more or less equal, whether you actually need that resources or not. In another word, high-density servers are increasing the granularity of your server farm expansion and forcing you to buy resources you might not need.

Also, 12TB or 6TB might seem like a good choice for TB/$, but they are also consuming way more electricity, and they are extremely slow compared to SSD, so NL-SAS not suitable for some workloads like Machine Learning & Deep Learning.

HDFS have in its architecture Checkpoint Node which copying hourly or even daily metadata out of NameNode Master RAM, which means in case of Master (and Backup Node is you have one) collapse for any reason you will lose all the data after last time metadata been backed up even though your data is there.

Probabilities

There is another most annoying thing coming from HDFS architecture. When you have 23 or 81 servers and your HDFS creating three copies of your data it throws them into the nodes in the cluster randomly. What does it mean that the cluster stores data randomly? Let’s calculate what is the possibility that you will find a single piece of information on a given server? In a best-case scenario the probability is 3/23 or in the worst case, it is 3/81.

Of course, your cluster will try to run your tasks on nodes that have (almost) all the required data as part of Data locality strategy, but what is the possibility that you’ll have all the data your task needs on a single server? The more data pieces you have for a given task, the less possible to have all the data on a given server making possibility even less than 3/23 (or 3/81). Ok, you might say that situation might be not so bad as I am drawing because you are running more than one task on more than one server thus increasing the possibility to have your data locally. However, the problem in this situation that files bigger than 64KB broken up into pieces (blocks) and stored separately across all the nodes in a cluster so there might not even be a single node which storing all the pieces of a file thus committing to the base probability of local data access. Moreover, also, if that server which has data needed for your task, currently running another task and fully loaded while other servers are not loaded but do not have required piece of information, that’s where you have inefficient cluster resources utilization. In another word HDFS architecture increasing the probability of network traffic node communications, the more nodes you have in a cluster and the bigger size of files you have.

The more nodes you have and bigger size of your files, the more probability of requesting data from other nodes increasing cross-switch network traffic.

Three copies

Not efficient because of:

Network Congestion

High levels of IO over server system bus

Poor disk space utilization

Data replication causes additional memory consumption on servers and memory problems are a large part of support calls. Server degradation causes performance degradation with data rebalancing. Cluster performing rebalancing if one storage node low in free space.

The placement of replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from most other distributed file systems. This is a feature that needs lots of tuning and experience.

NFS with Big Data

While with NAS storage like NetApp FAS/AFF systems you will have only about 35% space overhead compare to 200% overhead in HDFS (replication factor three) and be able to scale storage space and computing power separately reducing unneeded switches & server resources and allows customers to choose servers based on CPU & Memory characteristics eliminating storage from consideration.

Moreover, yes, 30TiB drives supported with AFF systems. With only 24x 32TB SSD in a NetApp AFF system you can get ~1PiB of effective space, in case of 2:1 data reduction, which gives extremely small physical footprint in the data center and power consumption.

A dedicated NAS storage aggregates all the drives into a single pool and capable of expanding flexible volumes on the fly without the need for cluster rebalancing.

Differentiators

NetApp ONTAP systems can replicate your data set for disaster recovery purposes & then replicate only new changed blocks of information as deltas to a secondary site which is essential for big data sets. NFS, unlike HDFS, allows modifying files if it's needed but, on another hand, if you need to make sure your golden image of data not been modified you can use thin clones (FlexClones) to make sure nothing happened to your original data.

The unique feature like FabricPool allows utilizing SSD drives as primary storage for frequently accessed data and transparently destage cold data on cheap cold S3 compatible storage (and back), to further reduce storage costs from one hand but still use SSD drives on another hand for hot, frequently accessed data. Data reduction capabilities like Deduplication can significantly reduce data footprint without losing performance even on the smallest systems.

When you have two characteristics (Computing & Storage) to choose in a single solution, it will always lead to inefficiency & compromise.

Summary

When it comes to really big data, HDFS simply killing the solution because of the replication factor tree and underlying architecture, storage must be separated from a Big Data cluster to make it more flexible and surprisingly even cheaper than a commodity, especially when it comes to petabytes of data.

Tags

As title stands, this will be a very
quick article about a customer who has NetApp FAS systems.

This customer in 2014 bought their
first two FAS3220 systems with NSE encryption and at that time with 7-Mode
ONTAP.

Then in 2015 they bought one AFF8040
and one FAS8040 (HDD, and made it "Hybrid" with adding few SSDs
tooled from AFF8040 system) both also with NSE encryption and ONTAP cDOT formed
in a single cluster.

Then they migrated all their VMware
infrastructure to new storage systems, upgraded old systems with ONTAP cDOT
& joined old but upgraded systems to the cluster with FAS8040 & AFF8040
and moved back some of slow workloads back on 3240 non-disruptively but now
with NetApp’s LUN move & Volume move, way faster than it would be done with
VMware Storage vMotion.

And then in 2017 they bought AFF
A700 without encryption. All systems happily working, monitored and managed
under a single cluster and data during its life cycle non-disruptively migrated
across all the nodes, while they got at least 2:1 data reduction on AFF systems (Cross-Volume Deduplication is not enabled yet)
and 1.5:1 on hybrid & HDD-only systems.

Now in 2018 after 4 years since they
got first FAS system they thinking to throw old FAS3220 controllers away, buy
new low-end FAS2700 controllers (which probably will be same or faster then
3220) and connect old disk shelves to them with simply using MiniSAS HD to QSFP
cable adapter. Then as always, connect all FAS & AFF systems to a single
cluster again and be able to upgrade to 9.3 and farther, and be able to utilize
new ONTAP functionality like FabricPool or Inline Aggregate Deduplication.

And now with A700 with simple ONTAP
upgrade they will be able to use FC-NVMe with existing cluster when they are
ready.

I have three rhetorical questions to
take away:

Which storage system vendor would allow you to keep in
a single cluster: nodes from different models (3240, 8040, A700 and 2700);
have Low-End, Mid-Range & High-End systems in a cluster; Have
different types of systems (All Flash, HDD & Hybrid); have different
generations (4 generations: from 3240 to 2700); some of the systems with
encryption some without?

Which storage system vendor would allow you to: upgrade
same hardware with huge major software brake-trough release (Which was
move from 7-Mode to cDOT); allow you to reconnect your old disk shelves
between Low-End, Mid-range & High-End systems; allow you to reconnect
old disk shelves to different generations & models?