Overview

On November 7, 2017, UC Berkeley, U-Texas, and UC Davis researchers published their results training ResNet-50* in a record time (as of the time of their publication on Nov. 7, 2017) of 31 minutes and AlexNet* in a record time of 11 minutes on CPUs to state-of-the-art accuracy. These results were obtained on Intel® Xeon® Scalable processors (formerly codename Skylake-SP). The main factors for these performance speeds are:

The compute and memory capacity of Intel Xeon Scalable processors

Software optimizations in the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) and in the popular deep learning frameworks

This level of performance demonstrates that Intel Xeon processors are an excellent hardware platform for deep learning training. Data scientists can now use their existing general-purpose Intel Xeon processor clusters for deep learning training as well as continue using them for deep learning inference, classical machine learning and big data workloads. They can get excellent deep learning training performance using 1 server node, and further reduced the time-to-train by using more server nodes scaling near linearly to hundreds of nodes.

In this 4-part article, we explore each of the main three factors outlined contributing to record-setting speed, and provide various examples of commercial use cases using Intel Xeon processors for deep learning training. While the main focus of this article is on training, the first two factors also significantly improve inference performance.

Part 1: Compute and Memory Capacity of Intel Xeon Scalable Processors

Training deep learning models often requires significant compute. For example, training ResNet-50 requires a total of about one exa (1018) single precision operations. Hardware capable of high compute throughput can reduce the training time if high utilization is achieved. High utilization requires high bandwidth memory and clever memory management to keep the compute busy on the chip. These features are in the new generation of Intel Xeon processors: large core count at high processor frequency, fast system memory, large per-core mid-level cache (MLC or L2 cache), and new SIMD instructions making this new generation of Intel Xeon processors an excellent platform for training deep learning models. In Part 1, we review the main hardware features of the Intel Xeon Scalable processors including compute and memory, and compare the performance of the Intel Xeon Scalable processors to previous generations of Intel Xeon processors for deep learning workloads.

In July 2017, Intel launched the Intel Xeon Scalable processor family built on 14 nm process technology. The Intel Xeon Scalable processors can support up to 28 physical cores (56 threads) per socket (up to 8 sockets) at 2.50 GHz processor base frequency and 3.80 GHz max turbo frequency, and six memory channels with up to 1.5 TB of 2,666 MHz DDR4 memory. The top-bin Intel Xeon Platinum processor 8180 provides up to 199GB/s of STREAM Triad performance on a 2-socket systema,b. For inter-socket data transfers, the Intel Xeon Scalable processors introduced the new Ultra Path Interconnect (UPI), a coherent interconnect that replaces QuickPath Interconnect (QPI) and increases the data rate to 10.4 GT/s per UPI port and up to 3 UPI ports in a 2-socket configuration.

Additional improvements include a 38.5 MB shared non-inclusive last-level cache (LLC or L3 cache), that is, memory reads fill directly to the L2 and not to both the L2 and L3, and 1MB of private L2 cache per core. The Intel Xeon Scalable processor core now includes the 512-bit wide Fused Multiply Add (FMA) instructions as part of the larger 512-bit wide vector engine with up to two 512-bit FMA units computing in parallel per core (previously introduced in the Intel Xeon Phi™ processor product line)1. This provides a significant performance boost over the previous 256-bit wide AVX2 instructions in the previous Intel Xeon processor v3 and v4 generations (formerly codename Haswell and Broadwell, respectively) for both training and inference workloads.

The Intel Xeon Platinum 8180 processor provides up to 3.57 TFLOPS (FP32) and up to 5.18 TOPS (INT8) per socket2. The 512-bit wide FMA’s essentially doubles the FLOPS that the Intel Xeon Scalable processors can deliver and significantly speeds up single precision matrix arithmetic. Comparing SGEMM and IGEMM performance we observe 2.3x and 3.4x improvement, respectively, over the previous Intel Xeon processor v4 generationc,e. Comparing the performance on a full deep learning model, we observed using the ResNet-18 model with the Intel® neon™ framework a 2.2x training and 2.4x inference throughput improvement in performance using FP32 over the previous Intel Xeon processor v4 generationd,f.

Part 2: Software Optimizations in Intel MKL-DNN and the Main Frameworks

Software optimization is essential to high compute utilization and improved performance. Intel Optimized Caffe* (sometimes referred to as Intel Caffe), TensorFlow*, Apache* MXNet*, and Intel neon are optimized for training and inference. Optimizations with other frameworks such as Caffe2*, CNTK*, PyTorch*, and PaddlePaddle* are also a work in progress. In Part 2, we compare the performance of Intel optimized vs non-Intel optimized models; we explain how the Intel MKL-DNN library enables high compute utilization; we discuss the difference between Intel MKL and Intel MKL-DNN; and we explain additional optimizations at the framework level that further improve performance.

Two years ago, deep learning performance was sub-optimal on Intel® processors as software optimizations were limited and compute utilization was low. Deep learning scientists incorrectly assumed that CPUs were not good for deep learning workloads. Over the past two years, Intel has diligently optimized deep learning functions achieving high utilization and enabling deep learning scientists to use their existing general-purpose Intel processors for deep learning training. By simply setting a configuration flag when building the popular deep learning frameworks (the framework will automatically download and build Intel MKL-DNN by default), data scientists can take advantage of Intel CPU optimizations.

Using Intel Xeon processors can provide over 100x performance increase with the Intel MKL-DNN library. For example, inference across all available CPU cores on AlexNet*, GoogleNet* v1, ResNet-50*, and GoogleNet v3 with Apache MXNet on the Intel Xeon processor E5-2666 v3 (c4.8xlarge AWS* EC2* instance) can provide 111x, 109x, 66x, and 92x, respectively, higher throughput. Inference across all CPU cores on AlexNet with Caffe2 on the Intel Xeon processor E5-2699 v4 can provide 39x higher throughput. Training AlexNet, GoogleNet, and VGG* with TensorFlow on the Intel Xeon processor E5-2699 v4 can provide 17x, 6.7x, and 40x, respectively, higher throughput. Training across all CPU cores AlexNet with Intel Optimized Caffe and Intel MKL-DNN on the Intel Xeon Scalable Platinum 8180 processor has 113x higher throughput than BVLC*-Caffe without Intel MKL-DNN on the Intel Xeon processor E5-2699 v3d,g. Figures 1 and 2 compares the training and inference throughput, respectively, of the Intel Xeon processor E5-2699 v4 and the Intel Scalable Platinum 8180 processor with the Intel MKL-DNN library for TensorFlow and Intel Optimized Caffe. The performance of these and other frameworks is expected to improve with further optimizations. All these data was computed with fp32 precision. Data with lower numerical precision will be added at a later time; lower precision can improve performance.

At the heart of these optimizations is the Intel® Math Kernel Library (Intel® MKL) and the Intel MKL-DNN library. There are a variety of deep learning models, and they may seem very different. However, most models are built from a limited set of building blocks known as primitives that operate on tensors. Some of these primitives are inner products, convolutions, rectified linear units or ReLU, batch normalization, etc., along with functions necessary to manipulate tensors. These building blocks or low-level deep learning functions have been optimized for the Intel Xeon product family inside the Intel MKL library. Intel MKL is a library that contains many mathematical functions and only some of them are used for deep learning. In order to have a more targeted deep learning library and to collaborate with deep learning developers, Intel MKL-DNN was released open-source under an Apache 2 license with all the key building blocks necessary to build complex models. Intel MKL-DNN allows industry and academic deep learning developers to distribute the library and contribute new or improved functions. Intel MKL-DNN is expected to lead in performance as all new optimizations will first be introduced in Intel MKL-DNN.

Deep learning primitives are optimized in the Intel MKL-DNN library by incorporating prefetching, data layout, cache-blocking, data reuse, vectorization, and register-blocking strategies. High utilization requires that data be available when the execution units (EU) need it. This requires prefetching the data and reusing the data in cache instead of fetching that same data multiple times from main memory. For cache-blocking, the goal is to maximize the computation on a given block of data that fits in cache, typically in MLC. The data layout is arranged consecutively in memory so that access in the innermost loops is as contiguous as possible avoiding unnecessary gather/scatter operations. This results in better utilization of cache lines (and hence bandwidth) and improves pre-fetcher performance. As we loop through the block, we constrain the outer dimension of the block to be a multiple of SIMD-width and the inner most dimension looping over groups of SIMD-width to enable efficient vectorization. Register blocking may be needed to hide the latency of the FMA instructions3.

Additional parallelism across cores is important for high CPU utilization, such as parallelizing across a mini-batch using OpenMP*. This requires improving the load balance so that each core is doing an equal amount of work and reducing synchronization events across cores. Efficiently using all cores in a balanced way requires additional parallelization within a given layer.

These sets of optimizations ensure that all the key deep learning primitives, such as convolution, matrix multiplication, batch normalization, etc. are efficiently vectorized to the latest SIMD and parallelized across the cores. Intel MKL-DNN primitives are implemented in C with C and C++ API bindings for most widely used deep learning functions:

There are multiple deep learning frameworks such as Caffe, TensorFlow, MXNet, PyTorch, etc. Modifications (code refactorization) at the framework level is required to efficiently take advantage of the Intel MKL-DNN primitives. The frameworks carefully replace calls to existing deep learning functions with the appropriate Intel MKL-DNN APIs avoiding the framework and Intel MKL-DNN library from competing for the same threads. During setup, the framework manages layout conversions from the framework to MKL-DNN and allocate temporary arrays if the appropriate output and input data layouts do not match. To improve performance, graph optimizations may be required to keep conversion between different data layouts to a minimum as shown in Figure 3. During the execution step the data is fed to the network in a plain layout like BCWH (batch, channel, width, height) and is converted to a SIMD-friendly layout. As data propagates between layers the data layout is preserved and conversions are made when it is necessary to perform operations that are not supported by the Intel MKL-DNN.

Figure 3:Operation graph flow. MKL-DNN primitives are shown in blue. Framework optimizations attempts to reduce the layout conversion so that the data stays in the MKL-DNN layout for consecutive primitive operations. Image source.

Furthermore, core utilization may improve when executing some frameworks and workloads by partitioning the sockets and the cores as separate computing devices and running multiple instances in each socket--one instance per computing device. Saletore, et al.,describe the steps to implement this methodology without changing a single line of code in the framework. These methods may improve training by up to 2x and inference by up to 2.7x on top of the current software optimizations.

Part 3: Advancements in Distributed Training Algorithms For Deep Learning

Training a large deep learning model often takes days or even weeks. Distributing the computational requirement among multiple server nodes can reduce the time to train. However, regardless of the hardware use there are algorithmic challenges to this, but there are recent advancements in distributed algorithms that mitigate some of these challenges. In Part 3, we review the gradient descent and stochastic gradient descent (SGD) algorithms and explain the limitations of training with very large mini-batches; we discuss model and data parallelism; we review synchronous SGD (SSGD), asynchronous SGD (ASGD) and allreduce/broadcast algorithms; finally, we present recent advances that enable larger mini-batch size SSGD training and present state-of-the-art results.

In supervised deep learning, input data is passed through the model and the output is compared to the ground truth or expected output. A penalty or loss is then computed. Training the model involves adjusting the model parameters to reduce this loss. There are various optimization algorithms that can be used to minimize the loss function such as gradient descent, or variants such as stochastic gradient descent, Adagrad, Adadelta, RMSprop, Adam, etc.

In gradient descent (GD), also known as steepest descent, the loss function for a particular model defined by the set of weights is computed over the entire dataset. The weights are updated by moving in the direction opposite to the gradient; that is, moving towards the local minimum: updated-weights = current-weights – learning-rate * gradient.

In stochastic gradient descent (SGD), or more correctly called mini-batch gradient descent, the dataset is broken into several mini-batches. The loss is computed with respect to a mini-batch and the weights are updated using the same update rule as gradient descent. There are other variants that speed up the training process by accumulating velocity (known as momentum) in the direction of the opposite of gradients, or that reduce the data scientist’s burden of choosing a good learning rate by automatically modifying the learning rate depending on the norm of the gradients. An in-depth discussion of these variants can be found elsewhere.

The behavior of SGD approaches the behavior of GD as the mini-batch sizes increase and become the same when the mini-batch size equals the entire dataset. There are three main challenges that GD has (and SGD also has when the mini-batch size is very large). First, each step or iteration is computationally expensive as it requires computing the loss over the entire dataset. Second, learning slows near saddle points or areas where the gradient is close to zero and the curvature switches. Third, according to Intel and Northwestern researchers, it appears that the optimization space has many sharp minima. Gradient descent does not explore the optimization space but rather moves towards the local minimum directly underneath its starting position, which is often a sharp minimum. Sharp minima do not generalize. While the overall loss function with respect to the test dataset is similar to that of the training dataset, the actual cost at the sharp minima may be very different. A cartoonish way to visualize this is shown in Figure 4 where the loss function with respect to the test dataset is slighted shifted from the loss function with respect to the training dataset. This shift results in models that converge to a sharp minimum having a high cost with respect to the test dataset, meaning that the model does not generalize well for data outside the training set. On the other hand, models that converge to a flat minimum have a low cost with respect to the test dataset, meaning that the model generalizes well for data outside the training set.

Figure 4:In this cartoonish figure, the loss function with respect to the test dataset is slighted shifted from the loss function with respect to the training dataset. The sharp minimum has a high cost with respect to the test loss function. Image source.

Small mini-batch SGD (SMB-SGD) resolves these issues. First, using SMB-SGD is computational inexpensive and therefore each iteration is fast. There is usually a large number of iterations over the entire dataset, instead of just 1 as in GD. SMB-SGD usually requires less passes over the entire dataset and therefore it trains faster. Second, it is extremely unlikely to get stuck at a saddle point using SMB-SGD since the gradients with respect to some of the mini-batches in the training set are likely not zero even if the gradient with respect to the entire training set is zero. Third, it is more likely to find a flat minimum since SMB-SGD better explores the solution space instead of moving towards the local minimum directly underneath its starting position. On the other hand, very small mini-batches are also not ideal because it is difficult to have high CPU (or GPU) utilization. This becomes more problematic when distributing the computational workload of the small mini-batch across several worker nodes. Therefore, it is important to find a mini-batch size large enough to maintain high CPU utilization but not so large to avoid the issues of GD. This becomes more important for synchronous data-parallel SGD discussed below.

Efficiently distributing the workload across several worker nodes can reduce the overall time-to-train. The two main techniques used are model parallelism and data parallelism. In model parallelism, the mode is split among the worker nodes with each node working on the same mini-batch. Model parallelism is used in practice when the memory requirements exceed the worker’s memory. Data parallelism is the more common approach and works best for models with fewer weights. In data parallelism, the mini-batch is split among the worker nodes with each node having the full model and processing a piece of the mini-batch, known as the node-batch. Each worker node computes the gradient with respect to the node-batch. These gradients are then aggregated using some reduce algorithm to compute the gradient with respect to the overall mini-batch. The model weights are then updated and those updated weights are broadcasted to each worker node. This is known as reduce/broadcast or just allreduce scheme (a list of allreduce options is discussed below). At the end of each iteration or cycle through a mini-batch, all the worker nodes have the same updated model, that is, the nodes are synchronized. This is known as synchronous SGD (SSGD).

Asynchronous SGD (ASGD) alleviates the overhead of synchronization. However, ASGD has additional challenges. ASGD requires more tuning of hyperparameters such as momentum, and requires more iterations to train. Furthermore, it does not match single node performance and therefore it is more difficult to debug. In practice ASGD has not been shown to scale and retain accuracy on large models. Stanford, LBNL, and Intel researchers have shown that an ASGD/SSGD hybrid approach can work where the nodes are clustered in up to 8 groups. Updates within a group are synchronous and between groups asynchronous. Going beyond 8 groups reduces performance due to the ASGD challenges.

One strategy for communicating gradients is to appoint one node as the parameter server, which computes the sum of the node gradients, updates the model, and sends the updated weights back to each worker. However, there is a bottleneck in sending and receiving all of the gradients using one parameter server. Unless ASGD is used, a parameter server strategy is not recommended.

Allreduce and broadcast algorithms are used for communicating and adding the node gradients and then broadcasting updated weights. There are various allreduce algorithms including Tree, Butterfly, and Ring. Butterfly is optimal for latency scaling at O(log(P)) iterations, where P is the number of worker nodes, and combines reduce-scat and broadcast. Ring is optimal for bandwidth; for large data communication it scales at O(1) with the number of nodes. In bandwidth constraints clusters AllReduce Ring is usually the better algorithm. A detailed explanation of the AllReduce Ring algorithm is found elsewhere. Figure 5 showcases various communication strategies.

In November 2014, Jeff Dean spoke of Google’s research goal to reduce training time from six weeks to a day. Three years later, CPUs were used to train AlexNet in 11 minutes! This was accomplished by using larger mini-batch sizes that allows distributing the computational workload to 1000+ nodes. To scale efficiently, the communication of the gradients and updated weights must be hidden in the computation of these gradients.

Increasing the overall mini-batch size is possible with these techniques: 1) proportionally increasing the mini-batch size and learning rate; 2) slowly increasing the learning rate during the initial part of training (known as warm-up learning rates); and 3) having a different learning rate for each layer in the model using the layer-wise adaptive rate scaling (LARS) algorithm. Let’s go through each technique in more detail.

The larger the mini-batch size, the more confidence one has in the gradient and therefore a larger learning rate can be used. As a rule of thumb, the learning rate is increased proportional to the increased mini-batch size4. This technique allowed UC Berkeley researchers to increase the mini-batch size from 256 to 1024 with the GoogleNet model and scale to 128 K20-GPU nodes reducing the time-to-train from 21 days to 10.5 hours, and Intel researchers to increase the mini-batch size from 128 to 512 with the VGG-A model and scale to 128 Intel Xeon processor E5-2698 v3 nodes.

A large learning rate can lead to divergence (the loss increases with each iteration rather than decreases), in particular during the initial training phase. This is because the norm of the gradients is much greater than the norm of the weights during the initial training phase. This is mitigated by gradually increasing the learning rate during the initial training phase, for example during the first 5 epochs, until the target learning rate is reached. This technique allowed Facebook* researchers to increase the mini-batch size from 256 to 8096 with ResNet-50 and scale to 256 Nvidia* Tesla* P100-GPU nodes reducing the time-to-train from 29 hours (using 8 P100-GPUs) to 1 hour. This technique also allowed SurfSARA and Intel researchers to scale to 512 2-socket Intel Xeon Platinum processors reducing the ResNet-50 time-to-train to 44 minutes.

Researchers at NVidia observed that the ratio of gradients to the weights for different layers within a model greatly varies. They proposed having a different learning rate for each layer that is inversely proportional to this ratio. This technique (combined with the ones above) allowed them to increase the mini-batch size to 32K.

UC Berkeley, U-Texas, and UC Davis researchers used these techniques to achieve record training times (as of their Nov. 7, 2017 publication): AlexNet in 11 minutes and ResNet-50 in 31 minutes on Intel CPUs to state-of-the-art accuracy. They used 1024 and 1600, respectively, 2-socket Intel Xeon Platinum 8160 processor servers with the Intel MKL-DNN library and the Intel Optimized Caffe framework5.

Many data centers may not have high bandwidth interconnects or thousands of nodes available. Nevertheless, using just 10 Gbps Ethernet perfect linear scaling to 32 and 99.8% scaling to 64 2-socket Intel Xeon Gold 6148 processor servers can be achieved on ResNet-50i. The total training time for 90 epochs with these 64 servers reaching the expected 75.9% top-1 validation accuracy is 7.3 hours.

Intel’s assembly and test factory benefited from Intel Optimized Caffe on Intel Xeon processors, improving silicon manufacturing package fault detection. This project aimed to reduce the human review rate for package cosmetic damage at the final inspection point, while keeping the false negative ratio at the same level as the human rate. The input was a set of package photos, and the goal was to perform binary classification on each of them, indicating whether each package was rejected or passed. The GoogleNet model was modified for this task. Using 8 Intel Xeon Platinum 8180 processor connected via 10 Gb Ethernet, training was completed within 1 hour. The false negative rate consistently met the expected human-level accuracy. Automating this process saved 70% of the inspectors’ time.

Facebook utilizes Intel processors for inference and both CPUs and GPUs for training the machine learning algorithms used in their services. Training is done much less frequently than inference. Inference phase may be run tens-of-trillions of times per day, and generally needs to be performed in real time. Some of the major services leveraging machine learning include News Feed, Ads, Search, Sigma (anomaly detection), Lumos (feature extractor), Facer (face recognition), Language Translation, and Speech Recognition. News Feed and Sigma services train on CPUs, and Facer and Search algorithms train on both CPUs and GPUs.

deepsense.ai employs Intel processors for RL. They trained an agent to play a wide range of Atari 2600 games on 64 12-core Intel CPUs sometimes with perfect linear scaling (note: the article does not specify the specific Intel processors or interconnects used) in as little as 20 minutes per game.

Kyoto University researchers used deep learning on Intel processors to predict compound-based interactions, an important step for drug design. For their deep learning workloads, they claimed that "Intel Haswell-EP (Xeon E5-2699v3×2 with 128 GB of memory) with optimized Theano* outperformed Nvidia Tesla (K40 hosted by Ivy Bridge) in terms of both speed and max supported data size." Note that the Intel Xeon E5-2699v3 processor and the Nvidia Tesla K40 are older generations CPU and GPU, respectively.

Intel and one if its partners successfully used Faster-RCNN* with Intel Optimized Caffe for the tasks of solar panel defect detection. 300 original solar panel images augmented with 36-degree rotation were used in training. Training on an Intel Xeon Platinum 8180 processor took 6 hours and achieved a detection accuracy of 96.3% under some adverse environmental influences. The inference performance is 188 images per second. This is a general solution that can be used for various inspection services in the markets including oil and gas inspection, pipeline seepage and leakage, utilities inspection, transmission lines and substations, and emergency crisis response.

Amazon Web Services (AWS) GM Matt Wood reported AWS and Intel optimized deep learning with the latest version of Intel MKL and Intel Xeon Scalable processors available on the AWS EC2 C5 instances to increase inference performance by 100x and developed tools to easily build distributed deep learning applications on AWS. "For example," Wood explains, "Novartis used 10,600 EC2 instances and approximately 87 thousands compute cores to conduct 39 years of computations chemistry in just 9 hours; screening 10 million compounds against a cancer target. This magnitude of improvement is changing what is possible in industries such as health care, life sciences, financial services, scientific research, aerospace, automatic, manufacturing, and energy."

Clemson University researchers applied 1.1M AWS EC2 vCPUs on Intel processors to study topic modeling, a component of natural language processing. They studied how human language is processed by computers with nearly half a million topic modeling experiments. Topic models can used to discover the themes present across a collection of documents.

GE Healthcare applied Intel Xeon E5-2650v4 processors for GE's deep learning CT imaging classification inference workloads. They used Intel’s Deep Learning Deployment Toolkit (Intel DLDT) and Intel MKL-DNN to improve throughput by an average of 14x over a baseline version of the solution and exceeded GE’s throughput goals by almost 6x on just four cores. The Intel DLDT facilitates deployment of deep learning solutions by delivering a unified API to integrate the inference with application logic. These findings pave the path to a new era of smarter medical imaging.

Aier Eye Hospital Group and MedImaging Integrated Solutions (MiiS) employed Intel Xeon Scalable processors and Intel Optimized Caffe to develop a deep-learning solution to improve screening for diabetic retinopathy and age-related macular degeneration. China’s national government plans to use the solution to enable high-quality, eye-health screening in clinics and smaller hospitals across the country. Stefan Zheng, CEO of MiiS, said, “The combination of Intel’s expertise, the new Intel Xeon Scalable processors, and Caffe optimized for Intel architecture provided such impressive results that we are moving forward with Intel Xeon Bronze processor and Intel’s optimized Caffe framework.” Xu Ming, general manager of Aier Eye Health Group, said, “By enabling our clinics to offer fast, accurate screening, Aier can help address the shortage of ophthalmologists and bring high-quality care to people in rural areas, where the only care available is through a rural clinic. Even in the biggest cities, we can help save time for eye-care professionals and enable them to devote themselves to patients with the most serious problems. Through early detection, we provide opportunities for earlier diagnosis and treatment, to help preserve vision.”

OpenAI uses CPU to train evolution strategies (ES) achieving strong performance on RL benchmarks (Atari/MuJoCo). Using a cluster of 80 machines and 1,440 CPU cores, OpenAI engineers were able to train a 3D MuJoCo humanoid walker in only 10 minutes.

China UnionPay implemented a neural-network risk-control system utilizing Intel Xeon processors with BigDL and Apache Spark. The network takes advantage of 10 TB of training data with 10 billion training samples. The model was trained on 32 Intel Xeon processor nodes and improved the accuracy of risk detection by up to 60 percent over traditional rule-based risk-control systems.

JD.com* and Intel teams have worked together to build a large-scale image feature extraction pipeline using SSD and DeepBit models on BigDL and Apache Spark. In this use case, BigDL provides support for various deep learning models, for example, object detection, classification. In addition, it allows the reuse of pre-trained models from Caffe, Torch*, and TensorFlow. The entire application pipeline was fully optimized to deliver significantly accelerated performance on Intel Xeon processors-based Hadoop cluster, with ~3.83x speedup compared to the same solution running on 20 K40 GPU cards.

MLSListings* and the Intel team also worked together to build an image similarity based house recommendation system in BigDL on Intel Xeon processors at Microsoft Azure. They used the Places365* dataset to fine-tuned GoogleNet v1 and VGG-16 to produce the image embeddings used to compute the image similarities.

Gigaspace* has integrated BigDL into Gigapsace’s InsightEdge* platform. Gigaspace built a system to automatically route user requests to appropriate specialists in call centers by using a natural language processing (NLP) model trained in BigDL on Intel Xeon processors.

Open Telefónica* Cloud offers Map Reduce Service (MRS) with BigDL running on top. The steps on how to use their platform with BigDL for deep learning workloads are outlined here.

Conclusion

Intel’s newest Xeon Scalable processors along with optimized deep learning functions in the Intel MKL-DNN library provide sufficient compute for deep learning training workloads (in addition to inference, classical machine learning, and other AI algorithms). Popular deep learning frameworks are now incorporating these optimizations, increasing the effective performance delivered by a single server by over 100x in some cases. Recent advances in distributed algorithms have also enabled the use of hundreds of servers to reduce the time to train from weeks to minutes. Data scientists can now use their existing general-purpose Intel Xeon processor clusters for deep learning training as well as continue using them for deep learning inference, classical machine learning and big data workloads. They can get excellent deep learning training performance using 1 Intel CPU, and further reduced the time-to-train by using multiple CPU nodes scaling near linearly to hundreds of nodes.

Footnotes

Two 512-bit FMA units computing in parallel per core are available in Intel Xeon Platinum processors, Intel Xeon Gold processors 6000 series and 5122. Other Intel Xeon Scalable processor SKUs have one FMA unit per core.

The raw compute can be calculated as AVX-512-frequency * number-of-cores * number-of-FMAs-per-core * 2-operations-per-FMA * SIMD-vector-length / number-of-bits-in-numerical-format. The Intel Xeon Platinum 8180 has AVX-512-frequency * 28 * 2 * 2 * 512/32 = 1792 * AVX-512-frequency peak TFLOPS. The AVX-512 frequencies for multiple SKUs can be found at elsewhere. The frequencies shown correspond to FP64 operations; the frequencies for FP32 may be slightly higher than the ones shown. The AVX-512 max turbo-frequency may not be fully sustained when running very high FLOPS workloads.

When executing in 512-bit register port scheme on processors with two FMA unit, Port 0 FMA has a latency of 4 cycles, and Port 5 FMA has a latency of 6 cycles. Bypass can have a -2 (fast bypass) to +1 cycle delay. The FMA unit supports fast bypass when all instruction sources come from the FMA unit. The instructions used for deep learning workloads at fp32 support bypass and have a latency of 4 cycles for both ports 0 and 5; see Section 15.17. An Intel Xeon Scalable processor with 2 FMAs requires at least 8 registers to hide these latencies for the data and the weights.

This does not work as we approach large mini-batch sizes 8K+. After 8K, the learning rate should increase proportional to the square root of the increased in mini-batch sizes. Details elsewhere.

The researchers added some modifications that will be committed to the main Intel Optimized Caffe branch.

About the authors

Andres Rodriguez, PhD, is a Sr. Principal Engineer with Intel’s Artificial Intelligence Products Group (AIPG) where he designs AI solutions for Intel’s customers and provides technical leadership across Intel for AI products. He has 13 years of experience working in AI. Andres received his PhD from Carnegie Mellon University for his research in machine learning. He holds over 20 peer reviewed publications in journals and conferences, and a book chapter on machine learning.

Wei Li, PhD, is vice president in the Software and Services Group and general manager of Machine Learning and Translation at Intel, responsible for the development of software systems including machine learning, binary translation, and hardware emulation. An experienced leader in his field, he has led numerous teams accelerating the cloud, data center, artificial intelligence, mobile and Internet-of-things. He holds a Ph.D. in computer science from Cornell University, and completed the Executive Accelerator Program at the Stanford Graduate School of Business.

Jason Dai is a Sr. Principal Engineer and Big Data Chief Architect from Intel Software and Service Group (SSG), responsible for leading the development of advanced Big Data analytics (including distributed machine / deep learning), as well as collaborations with leading research labs (e.g., UC Berkeley AMPLab). He is an internationally recognized expert on big data, cloud and distributed machine learning; he is the program co-chair of Strata Data Conference Beijing, a committer and PMC member of Apache Spark project, and the creator of BigDL project, a distributed deep learning framework on Apache Spark.

Frank Zhang is the Intel Optimized Caffe product manager from Intel Software and Service Group where he is responsible for product management of Intel Optimized Caffe deep learning framework development, product release and customer support. He has more than 20 years of industrial experience in software development from multiple companies including NEC, TI and Marvell. Frank graduated from University of Texas at Dallas with master degree in Electrical Engineering.

Jiong Gong is a senior software engineer with Intel’s Software and Service Group where he is responsible for the architectural design of Intel Optimized Caffe, making optimizations to show its performance advantage on both single-node and multi-node IA platforms. Jiong has more than 10 years industrial experiences in system software and AI. Jiong graduated from Shanghai Jiao Tong University as a master in computer science. He holds 4 US patents on AI and machine learning.

Chong Yu is a software engineer in Intel Software and Service Group, and now is working for Intel Optimized Caffe framework development and optimization on IA platforms. Chong won the Intel Fellowship and then joined Intel 5 years ago. Chong obtained the master degree in information science and technology from Fudan University. Chong published 20 journal publications and has 2 Chinese patents. His research areas include artificial intelligence, robotics, 3D reconstruction, remote sensing, steganography, etc.