EMR+OSS: Separated storage and computing for offline computing

Background

In traditional Hadoop usage, storage and computing are inseparable. Therefore, as your business grows, the cluster size often cannot meet your needs for business expansion. For example, if your data scale exceeds the cluster’s storage capacity, the new requirements arising from the data production cycle of your business may outpace computing capabilities. In this case, you must be ready at all times to deal with the challenges of insufficient cluster storage space or computing capabilities.

If you choose to deploy computing and storage in a hybrid manner, storage scaling can often lead to excess computing capabilities. This is a waste of resources. Likewise, an increase in computing capabilities causes a waste of storage resources.

Separating computing and storage for offline computing makes it easier to cope with insufficient computing or storage resources. In this solution, you can store all your data in OSS and then analyze it using stateless E-MapReduce. Therefore, E-MapReduce is only responsible for computation, and storage resource are not tied to computing resources in your business. This approach provides the highest flexibility.

Architecture

The architecture for offline computing with separated storage and computing is simple, as shown in the following figure. OSS acts as the default storage unit, and Hadoop or Spark acts as a computing engine that directly analyzes data stored in OSS.

Benefits

Factor

Integrated computing and storage

Separated computing and storage

Flexibility

Not flexible

After computing and storage are separated, cluster rules are simple and flexible. You hardly need to estimate your future business scale, besides using the resources as you need.

Cost

High

Ultra cloud disks are used in self-built ECS systems. After separating storage and computing, if the cluster configuration is one master node with an 8-core 32 GB CPU, six slave nodes with 8-core 32 GB CPUs, and 10 TB of data, the cost is roughly halved.

From the performance chart, we can compare the respective advantages of the EMR + OSS and self-built Hadoop with ECS systems:

The overall load is lower

Memory utilization is basically the same

CPU usage is lower, in which case, the usage level for iowait and sys is much lower. Because the datanode and disk operations of the self-built ECS system occupy resources, this adds to the CPU overhead.

In terms of network usage, because sortbenchmark performs two data read operations (the first for sampling and the second for actually reading the data), network usage starts out high, and then in the shuffle+ results output stage, drops to about half of the self-built Hadoop with ECS system. Therefore, from the network perspective, the overall usage is basically flat.

In short, with EMR + OSS, the cost is halved, but the drop in performance is negligible. Moreover, an increase in the concurrency of the EMR + OSS solution means better time advantage in comparison with the self-built Hadoop with ECS system.

Unsuitable scenarios

We recommend that you do not use EMR + OSS in the following scenarios: