Accelerate Recommender Systems with GPUs

Wei Tan, Research Staff Member at IBM T. J. Watson Research Center shares how IBM is using NVIDIA GPUs to accelerate recommender systems, which use ratings or user behavior to recommend new products, items or content to users. Recommender systems are important in applications such as recommending products on retail sites, recommending movies or music on streaming media services, and recommending news items or posts on social media and networking services. Wei Tan’s team developed cuMF, a highly optimized matrix factorization system using CUDA to accelerate recommendations used in applications like these and more.

Brad: Can you talk a bit about your current research?

Wei: Matrix factorization (MF) is at the core of many popular algorithms, such as collaborative-filtering-based recommendation, word embedding, and topic modeling. Matrix factorization factors a sparse ratings matrix (m-by-n, with non-zero ratings) into a m-by-f matrix (X) and a f-by-n matrix (ΘT), as Figure 1 shows.

Suppose we obtained m users’ ratings on items (say, movies). If user u rated item v, we use as the non-zero element of R at position (u, v). We want to minimize the following cost function J. To avoid overfitting, we use weighted-λ-regularization proposed in [1], where and denote the number of total ratings on user u and item v, respectively.

Many optimization methods, including Alternating Least Squares (ALS) and Stochastic Gradient Descent (SGD) [1] have been applied to minimize . We adopt the ALS approach, which first optimizes X while fixing , and then optimizes while fixing X. That is, in one iteration we need to solve these two equations alternatively:

B: What are some of the biggest challenges in this project?

W: Recently, the GPU has emerged as an accelerator for parallel algorithms. It has massive computing power (typically 10x higher floating-point operations per second (FLOP/s) versus a CPU) and memory bandwidth (typically 5x versus a CPU), but with limited amount of control logic and memory capacity. Particularly, the GPU’s success in deep learning inspired us to try GPUs for MF. In deep learning, the computation is mainly dense matrix multiplication which is compute bound. As a result, a GPU can train deep neural networks 10x as fast as a CPU by saturating its FLOP/s. However, unlike deep learning, a MF problem involves sparse matrix manipulation which is usually memory bound. Given this, we want to explore a MF algorithm and a system that can still leverage the GPU’s compute and memory capability.

Given the widespread use of MF, a scalable and speedy implementation is very important. There are two challenges:

On a single GPU, MF deals with sparse matrices, which makes it difficult to utilize the GPU’s compute power.

We need to scale to multiple GPUs to deal with large, industry-scale problems (hundreds of millions of non-zero ratings).

B: What NVIDIA technologies are you using to overcome the challenges?

W: We developed cuMF, a CUDA-based matrix factorization library that optimizes the alternating least square (ALS) method to solve very large-scale MF. cuMF uses CUDA 7.0+,cuBLAS and cuSPARSE, and has been tested on both Kepler (Tesla K40/K80) and Maxwell (Titan X) GPUs.

cuMF achieves excellent scalability and performance by innovatively applying the following techniques on GPUs:

(1) On a single GPU, we optimize memory access in ALS by various techniques including reducing discontiguous memory access, retaining hotspot variables in faster memory, and aggressively using registers. These techniques allow cuMF to get closer to the roofline performance of a single GPU. See Figure 2.

Figure 2. Memory optimization in cuMF. To solve X using Θ, cuMF needs to obtain the sum from Equation 1 for for different xus. CuMF reads θv through texture memory, so that θvs used by different users can be cached and reused. θv are loaded into shared memory, and the partial summation result of Equation 1 is tiled and stored in registers until the summation over all rated items is done.

In distributed machine learning, model parallelism and data parallelism are two common schemes. Model parallelism partitions parameters among multiple learners with each one learns a subset of parameters. Data parallelism partitions the training data among multiple learners so that each one learns all parameters from its partial observation. These two schemes can be combined when both the number of model parameters and the training data are large.

Model parallelism. As seen from equation 1, the solution of each is independent so model parallelism is straightforward.

Data parallelism. To tackle large-scale problems, on top of the existing model parallelism, we design a data-parallel approach. When is big and cannot stay in one GPU, we re-write the summation matrix in eq. 1 to its data-parallel form as:

That is, instead of transferring all s to one GPU to calculate the sum, it calculates a local sum on each GPU with only the local s, and reduces (aggregates) many local s later (See Figure 3).

Figure 3. Parallelism on multiple GPUs. ΘT is partitioned evenly and vertically, and stored on p GPUs (p=4 in the figure). X is partitioned evenly and horizontally, and solved in batches, achieving model parallelism. Each X batch is solved in parallel on p GPUs, each with ΘT ‘s partition on it, achieving data parallelism.

(3) We also developed an innovative topology-aware, parallel reduction method to fully leverage the bandwidth between GPUs. This way, cuMF ensures that multiple GPUs are efficiently utilized simultaneously.

The experimental results demonstrate that with up to four Titan X GPUs on one machine, cuMF is (1) competitive compared with multi-core methods, on medium-sized problems; (2) much faster than vanilla GPU implementations without memory optimization; (3) 6 to 10 times as fast, and 33 to 100 times as cost-efficient as distributed CPU systems, on large-scale problems; (4) more significantly, able to solve the largest matrix factorization problem ever reported.

Table 1. Speed and cost of cuMF on one machine with four GPUs, compared with three different CPU systems in the cloud. Note: experiment details are in our paper [2].Figure 4. cuMF as Spark accelerator.

CuMF can be used standalone, or to accelerate the ALS implementation in Spark MLlib. We modified Spark’s ml/recommendation/als.scala (code) to detect GPUs and offload the ALS forming and solving to GPUs, while retain shuffling on Spark RDDs.

This approach has several advantages. First, existing Spark applications relying on mllib/ALS need no change. Second, we leverage the best of Spark (to scale-out to multiple nodes) and GPUs (to scale-up in one node).

B: What is the impact of your research to the larger data science community?

W: First, by using cuMF for matrix factorization, a large category of workloads that derive latent features from observations can be accelerated. These workloads include recommendation, topic modeling and word embedding.

Second, cuMF sheds light on accelerating machine learning algorithms involving sparse and graph data. We are glad to see that NVIDIA is also moving toward this direction by releasing nvGRAPH in the forthcoming CUDA 8.

B: When and why did you start looking at using NVIDIA GPUs?

W: We started looking at using NVIDIA GPUs to solve this problem in late 2014. We thought NVIDIA GPUs massive floating point throughput and large memory bandwidth would help in accelerating MF.

What is your GPU Computing experience and how have GPUs impacted your research?

GPUs have helped cuMF to be the state-of-the-art MF solution in terms of speed. Our research result is to be published in HPDC [2], a premier high performance computing conference.

B: What’s next for cuMF?

W: We have open-sourced cuMF. In future work we plan to make cuMF more user-friendly by providing a Python interface. We also want to further enhance it by using the latest hardware and software features such as NVLink, FP16 and NCCL.

B: In terms of technology advances, what are you looking forward to in the next five years?

W: In my personal opinion, in the next five years HPC technology is going to impact many aspects of machine learning. HPC has enabled and will keep enabling much more sophisticated machine learning, dealing with even bigger data, in data center and on devices.

About Brad Nemire
Brad Nemire is on the Accelerated Computing Developer Marketing team and loves reading about all of the fascinating research being done by developers using NVIDIA GPUs. Reach out to Brad on Twitter @BradNemire and let him know how you’re using GPUs to accelerate your research. Brad graduated from San Diego State University and currently resides in San Jose, CA. Follow
@BradNemire on Twitter

I looked at the code and am wondering how to choose the optimal number for iterations of conjugate gradient. I see that the default is 6. Is there any experimentation or guidance for how many is best?

It would seem to me that a CG iteration that gives a relative reduction in the residual that is much less than the first reduction for the next ‘side’ is likely to get wiped out. Therefore it might make sense to stop CG iterations when the fractional reduction in residual norm is less than that of the last one done for the other side.

On the other hand, I’m guessing that it is more computationally-efficient to run multiple iterations on the same matrix than switching between them.

Also, minor edit, I think: “where n_{x_u} and n_{theta_v} denote the number of total ratings on user u and item v, respectively.”

Thanks!

Wei Tan

Thanks for the comment John and apologize for the late reply.
As for CG iteration, we tested on three data sets (netflix, yahoomusic and hugewiki) and found that <=10 iterations is good enough.
I am not very sure what do you mean "a CG iteration that gives a relative reduction in the residual that is
much less than the first reduction for the next 'side' is likely to get
wiped out."
Could you explain more?

"I'm guessing that it is more computationally-efficient to run multiple
iterations on the same matrix than switching between them." — solving one matrix at a time is not able to saturate GPU so we run batch solvers. At the end of the day, CG kernel is memory bandwidth bounded and computation is very light (vector inner product and vector times scalar).