BigDL: Distributed Deep Learning on Apache Spark*

This is a computer translation of the original content. It is provided for general information only and should not be relied upon as complete or accurate.

Sorry, we can't translate this content right now, please try again later.

BigDL is a distributed deep learning library for Apache Spark*. Using BigDL, you can write deep learning applications as Scala or Python* programs and take advantage of the power of scalable Spark clusters. This article introduces BigDL, shows you how to build the library on a variety of platforms, and provides examples of BigDL in action.

What Is Deep Learning?

Deep learning is a branch of machine learning that uses algorithms to model high-level abstractions in data. These methods are based on artificial neural network topologies and can scale with larger data sets.

Figure 1 explores the fundamental difference between traditional and deep learning approaches. Where traditional approaches tend to focus on feature extraction as a single phase for training a machine learning model, deep learning introduces depth by creating a pipeline of feature extractors in the model. These multiple phases in a deep learning pipeline increase the overall prediction accuracy of the resulting model through hierarchical feature extraction.

Figure 1. Deep learning as a pipeline of feature extractors

What Is BigDL?

BigDL is a distributed deep learning library for Spark that can run directly on top of existing Spark or Apache Hadoop* clusters. You can write deep learning applications as Scala or Python programs.

Rich deep learning support. Modeled after TorchBigDL provides comprehensive support for deep learning, including numeric computing (via Tensor and high-level neural networks; in addition, you can load pretrained Caffe* or Torch models into the Spark framework, and then use the BigDL library to run inference applications on their data.

Efficient scale out. BigDL can efficiently scale out to perform data analytics at “big data scale” by using Spark as well as efficient implementations of synchronous stochastic gradient descent (SGD) and all-reduce communications in Spark.

Extremely high performance. To achieve high performance, BigDL uses Intel® Math Kernel Library (Intel® MKL) and multithreaded programming in each Spark task. Designed and optimized for Intel® Xeon® processors, BigDL and Intel® MKL provide you extremely high performance.

What Is Apache Spark*?

Spark is a lightning-fast distributed data processing framework developed by the University of California, Berkeley, AMPLab. Spark can run in stand-alone mode, or it can run in cluster mode on YARN on top of Hadoop or in Apache Mesos* cluster manager (Figure 2). Spark can process data from a variety of sources, including the HDFS, Apache Cassandra*, or Apache Hive*. Its high performance comes from ability to do in-memory processing via memory-persistent RDDs or DataFrames instead of saving data to hard disks like traditional Hadoop MapReduce architecture.

Figure 2. BigDL in the Apache Spark* stack

Why Use BigDL?

You may want to use BigDL to write your deep learning programs if:

You want to analyze a large amount of data on the same big data Spark cluster on which the data reside (in, say, HDFS, Apache HBase*, or Hive);

You want to add deep learning functionality (either training or prediction) to your big data (Spark) programs or workflow; or

You want to use existing Hadoop/Spark clusters to run your deep learning applications, which you can then easily share with other workloads (e.g., extract-transform-load, data warehouse, feature engineering, classical machine learning, graph analytics). An undesirable alternative to using BigDL is to introduce yet another distributed framework alongside Spark just to implement deep learning algorithms.

Install the Toolchain

Note: This article provides instructions for a fresh installation, assuming that you have no Linux* tools installed. If you are not a novice Linux user, feel free to skip rudimentary steps like Git installation).

Install the Toolchain on Oracle* VM VirtualBox

To install the toolchain on Oracle* VM VirtualBox, perform the following steps:

Enable virtualization in your computer’s BIOS settings (Figure 3).

The BIOS settings are typically under the SECURITY menu. You will need to reboot your machine and go into the BIOS settings menu.

Figure 3. Enable virtualization in your computer’s BIOS.

Determine whether you have 64-bit or 32-bit machine.

Most modern computers are 64 bit, but just to be sure, look in the Control Panel System and Security applet (Figure 4).

Figure 4. Determine whether your computer is 32 bits or 64 bits.

Note: You should have more than 4 GB of RAM. Your virtual machine (VM) will run with 2 GB, but most of the BigDL examples (except for LeNet) won’t work well (or at all). This article considers 8 GB of RAM the minimum.

Allocate 35–50 GB of hard disk space to your VM to be used for Ubuntu*, Spark, BigDL, and all DeepLearning models

When allocating hard disk space, select “dynamic” allocation, which allows you to expand the partition later (even though it is nontrivial). If you choose “static,” and then run out of disk space, your only option will be to wipe out the entire installation and start over.

When the download is complete, point your VM to the downloaded file so that you can boot from it. (For instructions, go to the Ubuntu FAQ):

sudo apt-get update
sudo apt-get upgrade

From this point forward, all installations will take place within Ubuntu or your version of Linux; instructions should be identical for the VM or native Linux except when explicitly said otherwise.

If you have trouble accessing the Internet, you may need to set up proxies. (Virtual private networks in general and Cisco AnyConnect in particular are notorious for modifying the VM’s proxy settings.) The proxy settings in Figure 6 are known to work for VirtualBox version 5.1.12 and 64-bit Ubuntu.

dist/lib/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies-and-spark.jar. This JAR package contains all dependencies, including Spark.

Alternative Approaches

If you’re building for Spark 2.0, you should amend the bash call for Scala 2.11. To build for Spark 2.0, which uses Scala 2.11 by default, pass –P spark_2.0 to the make-dist.sh script:

$ bash make-dist.sh -P spark_2.0

It is strongly recommended that you use Java 8 when running with Spark 2.0. Otherwise, you may observe poor performance.

By default, make-dist.sh uses Scala 2.10 for Spark version 1.5.x or 1.6.x and Scala 2.11 for Spark 2.0. To override the default behaviors, pass -P scala_2.10 or -P scala_2.11 to make-dist.sh, as appropriate.

Getting Started

Prior to running a BigDL program, you must do a bit of setup, first. This section identifies the steps for getting started.

Set Environment Variables

To achieve high performance, BigDL uses Intel MKL and multithreaded programming. Therefore, you must first set the environment variables by running the provided script in PATH_To_BigDL/scripts/bigdl.sh as follows:

$ source PATH_To_BigDL/scripts/bigdl.sh

Alternatively, you can use PATH_To_BigDL/scripts/bigdl.sh to launch your BigDL program.

Working with the Interactive Scala Shell

You can experiment with BigDL codes by using the interactive Scala shell. To do so, run:

--node. The number of nodes (machines) used in the training (As the example runs as a local Java program in this case, it is set to 1.)

--env. Either "local" or "spark" (In this case, it is set to "local".)

-f. The folder in which you put the CIFAR-10 data set

-b. The mini-batch size (The program expects that the mini-batch size is a multiple of node_number × core_number. In this example, node_number is 1, and you should set the mini-batch size to core_number × 4.)

Running a Spark Program

You can run a BigDL program—for example, the VGG model training — as a standard Spark program (running in either local mode or cluster mode). To do so, perform the following steps:

--core. The number of physical cores used in each executor (or container) of the Spark cluster

--node. The executor (container) number of the Spark cluster (When running in Spark local mode, set the number to 1.)

--env. Either "local" or "spark" (In this case, it is set to "spark".)

-f. The folder in which you put the CIFAR-10 data set (Note that in this example, this is just a local file folder on the Spark drive. As the CIFAR-10 data is somewhat small [about 120 MB], you send it directly from the driver to executors in the example.)

-b. The mini-batch size (It is expected that the mini-batch size is a multiple of node_number × core_number_per_node. In this example, you should set the mini-batch size to node_number × core_number_per_node × 4.)

Notes on Mac OS* installation

In general, installation on Mac OS* is simpler than within a windows VM, since Mac OS is Unix-based. However, the command syntax is obviously different. Here are some notes:

Install Java

Go to java.com and install the mac OS version of the JDK. Follow the installation prompts. You need to install JDK (Java development kit) for MacOS. Alternatively, you can open a terminal prompt and type

$ java –version

This will also prompt you to install JDK (click on “more info” rather than ‘OK’ button).

$ brew install git
Do a git pull as directed above.
Maven may have been installed already as part of one of the previous steps. To check:
$ mvn –v.
Expected output:

If maven is not installed, install it via the following command

$ brew install maven

Debugging Common Issues

Currently, BigDL uses synchronous mini-batch SGD in model training, and it will launch a single Spark task (running in multithreaded mode) on each executor (or container) when processing a mini-batch:

Set Engine.nodeNumber to the number of executors in the Spark cluster.

Ensure that each Spark executor has the same number of cores (Engine.coreNumber).

The mini-batch size must be a multiple of nodeNumber × coreNumber.

Note: You may observe poor performance when running BigDL for Spark 2.0 with Java 7, so use Java 8 when you build and run BigDL for Spark 2.0.

In Spark 2.0, use the default Java serializer instead of Kryo because of Kryo Issues 341: “Stackoverflow error due to parentScope of Generics being pushed as the same object” (see the Kryo Issues page for more information). The issue has been fixed in Kryo 4.0, but Spark 2.0 uses Kryo 3.0.3. Spark versions 1.5 and 1.6 do not have this problem.

On CentOS* versions 6 and 7, increase the max user processes to a larger value, such as, 514585. Otherwise, you may see errors like “unable to create new native thread.”

Currently, BigDL loads all the training and validation data into memory during training. You may encounter errors if BigDL runs out of memory.

BigDL Examples

Note: if you’re using a VM, your performance won’t be as good as it would be in a native operating system installation, mostly due to the overhead of the VM and limited resources available to it. This is expected and is in no way reflective of genuine Spark performance.

Training LeNet on MNIST

A BigDL program starts with an import of com.intel.analytics.bigdl._. Then, it initializes the engine, including the number of executor nodes, the number of physical cores on each executor, and whether it runs on Spark or as a local Java program:

If the program runs on Spark, Engine.init() returns a SparkConf with proper configurations populated, which you can then use to create the SparkContext. Otherwise, the program runs as a local Java program, and Engine.init() returns None.

After the initialization, begin creating the LeNet model by calling LeNet5(), which creates the LeNet-5 convolutional network model as follows:

Next, load the data by creating the DataSet.scala commands (either a distributed or local one, depending on whether it runs on Spark—and then applying a series of Transformer.scala commands (for example, SampleToGreyImg, GreyImgNormalizer, and GreyImgToBatch):

After that, you create the Optimizer — either a distributed or local one, depending on whether it runs on Spark) by specifying the data set, the model, and the criterion, which, given input and target, computes gradient per given loss function:

Image Classification

Note: For an example of running an Image inference on pre-trained model, see the BigDL README.md.

Download the images into the directory ILSVRC2012_img_val.tar. This is the image .tar file for the pre-trained model. Put it in ~/data/imagenet, for example, and untar it. Next, download the model resnet-18.t7 and put it in ~/model/resnet-18.t7, for example.

Run the following command from a command line (this example is for a VM, so memory sizes are set to 1g):

The output of the program is a table with two columns: the *.JPEG file name and a numeric label. To check the accuracy of prediction, refer to the imagenet1000_clsid.txt label file. (Note that the file labels are 0-based, while the output of the example is 1-based.)