Book

Design Doc: Supporting new Device/Library

Background

Deep learning has a high demand for computing resources. New high-performance devices and computing libraries are appearing very frequently. Deep learning frameworks have to integrate these high-performance devices and computing libraries in a flexible and efficient manner.

On one hand, hardware and computing libraries usually do not have a one-to-one correspondence. For example, Intel CPUs support Eigen and MKL computing libraries while Nvidia GPUs support Eigen and cuDNN computing libraries. We have to implement operator specific kernels for each computing library.

On the other hand, users usually do not want to care about the low-level hardware and computing libraries when writing a neural network configuration. In Fluid, Layer is exposed in Python, and Operator is exposed in C++. Both Layer and Operator are hardware independent.

Place and DeviceContext

Please note that device and computing library are not one-to-one corresponding. A device can have a lot of computing libraries and a computing library can also support several devices.

Place

Fluid uses class Place to represent the device memory where data is located. If we add another device, we have to add the corresponding DevicePlace.

| CPUPlace
Place --| CUDAPlace
| FPGAPlace

And Place is defined as follows:

typedef boost::variant<CUDAPlace, CPUPlace, FPGAPlace> Place;

DeviceContext

Fluid uses class DeviceContext to manage the resources in different libraries, such as CUDA stream in CDUADeviceContext. There are also inheritance relationships between different kinds of DeviceContext.

Tensor

classTensor{public:/*! Return a pointer to mutable memory block. */template<typenameT>inlineT*data();/** * @brief Return a pointer to mutable memory block. * @note If not exist, then allocation. */template<typenameT>inlineT*mutable_data(platform::Placeplace);/** * @brief Return a pointer to mutable memory block. * * @param[in] dims The dimensions of the memory block. * @param[in] place The place of the memory block. * * @note If not exist, then allocation. */template<typenameT>inlineT*mutable_data(DDimdims,platform::Placeplace);/*! Resize the dimensions of the memory block. */inlineTensor&Resize(constDDim&dims);/*! Return the dimensions of the memory block. */inlineconstDDim&dims()const;private:/*! holds the memory block if allocated. */std::shared_ptr<Placeholder>holder_;/*! points to dimensions of memory block. */DDimdim_;};

Placeholder is used to delay memory allocation; that is, we can first define a tensor, using Resize to configurate its shape, and then call mutuable_data to allocate the actual memory.

Advanced topics: How to switch between different Device/Library

Generally, we will implement OpKernel for all Device/Library of an Operator. We can easily train a Convolutional Neural Network in GPU. However, some OpKernel is not suitable on a specific Device. For example, crf operator can only run on CPU, whereas most other operators can run on GPU. To achieve high performance in such circumstance, we have to switch between different Device/Library.