3 Objectives Programmer productivity Encourage generic programmingRapidly develop complex applicationsLeverage parallel primitivesEncourage generic programmingDon’t reinvent the wheelE.g. one reduction to rule them allHigh performance With minimal programmer effortInteroperabilityIntegrates with CUDA C/C++ codeWe should encourage programmers to think in terms of parallel primitives instead of ad hoc kernels.Many operations can be efficiently implemented with combinations of reduce(), scan(), sort(), etc.Thrust is built on top of CUDA C/C++ and is interoperable with it. Existing CUDA programs can make use of Thrust without significant effort. Similarly, Thrust programs can utilize CUDA C/C++ where necessary (e.g. specialized algorithms).

4 What is Thrust? C++ template library for CUDA Containers AlgorithmsMimics Standard Template Library (STL)Containersthrust::host_vector<T>thrust::device_vector<T>Algorithmsthrust::sort()thrust::reduce()thrust::inclusive_scan()Etc.Thrust mimics the STL and implements many of the same algorithms. When we can’t offer the same exact semantics, we choose a different name. For example, std::accumulate() is similar to thrust::reduce(), but the former guarantees sequential operation while the latter does not.

6 Containers Compatible with STL containers Eases integrationvector, list, map, ...// list container on hoststd::list<int> h_list;h_list.push_back(13);h_list.push_back(27);// copy list to device vectorthrust::device_vector<int> d_vec(h_list.size());thrust::copy(h_list.begin(), h_list.end(), d_vec.begin());// alternative methodthrust::device_vector<int> d_vec(h_list.begin(), h_list.end());Thrust containers are compatible with STL containers such as vector, list, map, etc.One can first allocate a Thrust vector and then copy() the data, or use the vector’s constructor to initialize the vector in one step.Note that Thrust handles these operations in an optimized manner.For example, copying a host list<T> to a device_vector<T> is implementing by first copying the list<T> to a temporary host_vector<T> and then copying the host_vector<T> to a device_vector<T>.The only exception to this is where a STL container is initialized with a device_vector since we can’t intercept this usage. This operation does work, but it will result in one cudaMemcpy() for each element of the sequence, as opposed to a single cudaMemcpy() for the whole vector.

10 Iterators Track memory space (host/device) Guides algorithm dispatch// initialize random values on hostthrust::host_vector<int> h_vec(1000);thrust::generate(h_vec.begin(), h_vec.end(), rand);// copy values to devicethrust::device_vector<int> d_vec = h_vec;// compute sum on hostint h_sum = thrust::reduce(h_vec.begin(), h_vec.end());// compute sum on deviceint d_sum = thrust::reduce(d_vec.begin(), d_vec.end());We borrow the STL concept of iterators.A vector iterator can be thought of as a normal C pointer with some meta-information.Thrust iterators are just like STL vector iterators except they also have a notion of memory space (i.e. whether they live on the host or device).When a Thrust algorithm is invoked it looks at the memory space to decide whether to dispatch the host or device implementation of the algorithm.

11 Iterators Convertible to raw pointers // allocate device vectorthrust::device_vector<int> d_vec(4);// obtain raw pointer to device vector’s memoryint * ptr = thrust::raw_pointer_cast(&d_vec[0]);// use ptr in a CUDA C kernelmy_kernel<<<N/256, 256>>>(N, ptr);// Note: ptr cannot be dereferenced on the host!This is similar to how one would obtain a raw pointer from a STL vectorNote that raw pointers don’t know where they live, so it does not make sense to dereference a device pointer on the host (or vice-versa).

18 Algorithms Thrust provides many standard algorithmsTransformationsReductionsPrefix SumsSortingGeneric definitionsGeneral TypesBuilt-in types (int, float, …)User-defined structuresGeneral Operatorsreduce with plus operatorscan with maximum operatorThrust provides many standard data parallel primitives.Other functions, like thrust::unique(), which implements stream compaction, are constructed with one or more such primitives.Thrust algorithms are generic, meaning that they support general data types and operations.For example, reduce() makes sense as long as the reduction operation (op(a,b) -> c) is associative and commutative. The plus and maximum operations are two common examples.

19 Algorithms General types and operators#include <thrust/reduce.h>// declare storagedevice_vector<int> i_vec = ...device_vector<float> f_vec = ...// sum of integers (equivalent calls)reduce(i_vec.begin(), i_vec.end());reduce(i_vec.begin(), i_vec.end(), 0, plus<int>());// sum of floats (equivalent calls)reduce(f_vec.begin(), f_vec.end());reduce(f_vec.begin(), f_vec.end(), 0.0f, plus<float>());// maximum of integersreduce(i_vec.begin(), i_vec.end(), 0, maximum<int>());Here we have two device_vectors, one filled with integers and another with floating point values.We can use the same generic reduce() function to sum ints or floats, or compute a maximum() reduction over the sequence.When reduce() is called with just two arguments it assumes that the user wanted compute the sum of the sequence.This follows the behavior of the accumulate() function in the STL.The third parameter is the initial value of the sum, which often is simply the value zero.The fourth parameter is the binary operation used to “reduce” the sequence.

23 Recap Algorithms Generic Statically dispatched based on iterator typeSupport general types and operatorsStatically dispatched based on iterator typeMemory space is known at compile timeHave default argumentsreduce(begin, end)reduce(begin, end, init, binary_op)By now the audience shouldBe comfortable with the concepts of containers & iteratorsUnderstand how algorithms use iterators to dispatch the appropriate implementationUnderstand what it means for an algorithm to be generic

24 Fancy Iterators Behave like “normal” iterators ExamplesAlgorithms don't know the differenceExamplesconstant_iteratorcounting_iteratortransform_iteratorpermutation_iteratorzip_iterator“Fancy” iterators are things like look and feel just like normal iterators (or pointers), but do something more interesting.We can use fancy iterators in our usual algorithms to do things more easily and efficiently.Note: these iterators will appear in Thrust v1.1

25 Fancy Iterators constant_iterator A A A A AMimics an infinite array filled with a constant value// create iteratorsconstant_iterator<int> begin(10);constant_iterator<int> end = begin + 3;begin[0] // returns 10begin[1] // returns 10begin[100] // returns 10// sum of [begin, end)reduce(begin, end); // returns 30 (i.e. 3 * 10)constant_iterator is like a virtual infinite array filled with a constant value.When we dereference a constant_iterator we receive the same value, no matter what.Note that *there is no array* anywhere in memory!A more practical example of constant_iterator would be something likeconstant_iterator<int> const_iter(1);transform(vec.begin(), vec.end(), const_iter, vec.begin(), plus<T>())which adds the value 1 to every element of the vector named vec.AAAAA

26 Fancy Iterators counting_iterator 1 2 3Mimics an infinite array with sequential values// create iteratorscounting_iterator<int> begin(10);counting_iterator<int> end = begin + 3;begin[0] // returns 10begin[1] // returns 11begin[100] // returns 110// sum of [begin, end)reduce(begin, end); // returns 33 (i.e )counting_iterator is similar to constant_iterator in that it behaves like an infinite array.When dereferenced, counting_iterator returns an element from predefined sequence of values. Here we’ve started the sequence at the value 10 so first[0] returns 10 while first[n] returns 10 + n.counting_iterator is especially useful when one wants to generate tuples of the form (vec[i], i). Such tuples arise when we want to compute the *index* of the minimum or maximum element in a sequence.123

27 F( x ) F( ) F( ) F( ) Fancy Iterators transform_iterator X Y Z X Y ZYields a transformed sequenceFacilitates kernel fusionF( x )A transform_iterator has two parts, a transformation (denoted f(x)) and an underlying sequence (denoted [x,y,z]).When we access a tranform_iterator we get f(vec[i]) rather than vec[i].transform_iterator facilitates kernel fusion. For example, if we want to compute the sum of squares (x*x) of the values in a sequence, then would be very wasteful to first compute the squares, write them to a separate range, and then read the squared values back again inside reduce().F( )F( )F( )XYZXYZ

29 Fancy Iterators zip_iterator A B C A X B Y C Z X Y ZLooks like an array of structs (AoS)Stored in structure of arrays (SoA)ABCzip_iterator “zips” two or more sequences together.When the zip_iterator is dereferenced, it returns a tuple() containing (two or more) elements at the same position in their respective sequences.Here the seuqences [A,B,C] and [X,Y,Z] are zipped into a virtual sequence [tuple(A,X), tuple(B,Y), tuple(C,Z)].This is important because storing the data in separate sequences (so-called “structure of arrays”) is generally better for performance than the alternative (so-called “array of structs”).If we were to store these tuples in an actual array, then when we loaded them from memory we’d suffer a performance penalty due to uncoalesed memory accesses.zip_iterator let’s us imagine that we’re processing a AoS when we’ve really stored things in a SoA.AXBYCZXYZ

34 Fusion Optimized implementation (3.8x faster)// define transformation f(x) -> x^2struct square{__host__ __device__float operator()(float x)return x * x;}};float snrm2_fast(device_vector<float>& x)// with fusionreturn sqrt( transform_reduce(x.begin(), x.end(),square(), 0.0f, plus<float>());This is an example of kernel fusion, where the square() transformation is fused with the reduction operation.Although we could have implemented this with transform_iterator, we use the transform_reduce function for convenience.Note that transform_reduce is implemented with transform_iterator and reduce().Note that this implementation provides excellent performance without much effort.